`turbo3` decode performance for LLM inference on Apple Silicon (M1, M2 Pro, M5 Max), specifically addressing the 'decode cliff' at increasing context depths.
Raw Developer Origin & Technical Request
GitHub Issue
Mar 26, 2026
## Problem
turbo3 decode has 82M data-dependent constant memory accesses per token (centroid LUT lookup). On M5 Max this costs ~9% speed. On M1 Max it causes a cliff from 84% to 39% of q8_0 as context grows — L2 cache pressure evicts centroid entries to main memory.
## Current decode ratio curve (turbo3/q8_0)
| Depth | M5 Max | M1 Max |
|-------|--------|--------|
| short | 0.91x | 0.84x |
| 4K | 0.89x | ~0.64x |
| 8K | 0.86x | ~0.54x |
| 16K | — | ~0.45x |
| 32K | — | ~0.39x |
## The Fix
Eliminate constant memory from the FA decode inner loop. Three approaches to try:
### Approach 1: half cn[8] registers (16 bytes, may not spill)
Previous float cn[8] (32 bytes) spilled on Metal. Half-precision halves register pressure.
### Approach 2: Threadgroup centroid cache
Load 8 centroids to threadgroup memory once per threadgroup. Previous test was invalid (CPU fallback bug). Never tested on real Metal GPU.
### Approach 3: Per-block norm*centroid table
Precompute `cn_norm[c] = centroid[c] * norm` at block start. Inner loop becomes `score += cn_norm[idx] * q[j]`. Fresh 8-entry register array per block, maximally cache-friendly.
## Success criteria
- turbo3/q8_0 decode ratio stays FLAT across context depths (currently drops 0.91x to 0.72x on M5)
- If flat at 0.90x+ across all depths, the fix works
- PPL unchanged (dequant values identical, just accessed differently)
## Prior work
- buun's register LUT: 0.965x on CUDA, spilled to 0.879x on Metal (float[8])
- Split-LUT (2x4 half...
Developer Debate & Comments
Adjacent Repository Pain Points
Other highly discussed features and pain points extracted from TheTom/turboquant_plus.
Frequently Asked Questions
Market intelligence mapped to `turbo3` decode performance for LLM inference on Apple Silicon (M1, M2 Pro, M5 Max), specifically addressing the 'decode cliff' at increasing context depths..
How is `turbo3` decode performance for LLM inference on Apple Silicon (M1, M2 Pro, M5 Max), specifically addressing the 'decode cliff' at increasing context depths. positioned in the market?
How is the developer community reacting to `turbo3` decode performance for LLM inference on Apple Silicon (M1, M2 Pro, M5 Max), specifically addressing the 'decode cliff' at increasing context depths.?
Which technical concepts are associated with `turbo3` decode performance for LLM inference on Apple Silicon (M1, M2 Pro, M5 Max), specifically addressing the 'decode cliff' at increasing context depths.?
Engagement Signals
Cross-Market Term Frequency
Quantifies the cross-market adoption of foundational terms like Metal and q8_0 by tracking occurrence frequency across active SaaS architectures and enterprise developer debates.
SaaS Metrics