`turbo3` decode performance for LLM inference on Apple Silicon (M1, M2 Pro, M5 Max), specifically addressing the 'decode cliff' at increasing context depths.
Raw Developer Origin & Technical Request
GitHub Issue
Mar 26, 2026
## Problem
turbo3 decode has 82M data-dependent constant memory accesses per token (centroid LUT lookup). On M5 Max this costs ~9% speed. On M1 Max it causes a cliff from 84% to 39% of q8_0 as context grows — L2 cache pressure evicts centroid entries to main memory.
## Current decode ratio curve (turbo3/q8_0)
| Depth | M5 Max | M1 Max |
|-------|--------|--------|
| short | 0.91x | 0.84x |
| 4K | 0.89x | ~0.64x |
| 8K | 0.86x | ~0.54x |
| 16K | — | ~0.45x |
| 32K | — | ~0.39x |
## The Fix
Eliminate constant memory from the FA decode inner loop. Three approaches to try:
### Approach 1: half cn[8] registers (16 bytes, may not spill)
Previous float cn[8] (32 bytes) spilled on Metal. Half-precision halves register pressure.
### Approach 2: Threadgroup centroid cache
Load 8 centroids to threadgroup memory once per threadgroup. Previous test was invalid (CPU fallback bug). Never tested on real Metal GPU.
### Approach 3: Per-block norm*centroid table
Precompute `cn_norm[c] = centroid[c] * norm` at block start. Inner loop becomes `score += cn_norm[idx] * q[j]`. Fresh 8-entry register array per block, maximally cache-friendly.
## Success criteria
- turbo3/q8_0 decode ratio stays FLAT across context depths (currently drops 0.91x to 0.72x on M5)
- If flat at 0.90x+ across all depths, the fix works
- PPL unchanged (dequant values identical, just accessed differently)
## Prior work
- buun's register LUT: 0.965x on CUDA, spilled to 0.879x on Metal (float[8])
- Split-LUT (2x4 half...
Developer Debate & Comments
Adjacent Repository Pain Points
Other highly discussed features and pain points extracted from TheTom/turboquant_plus.
Engagement Signals
Cross-Market Term Frequency
Quantifies the cross-market adoption of foundational terms like Metal and q8_0 by tracking occurrence frequency across active SaaS architectures and enterprise developer debates.
Market Trends