← Back to AI Insights
Gemini Executive Synthesis

`turbo3` decode performance for LLM inference on Apple Silicon (M1, M2 Pro, M5 Max), specifically addressing the 'decode cliff' at increasing context depths.

Technical Positioning
Achieving flat, high-performance `turbo3` decode ratios (0.90x+ of `q8_0`) across all context depths on Apple Silicon, minimizing performance degradation from memory access patterns.
SaaS Insight & Market Implications
This extensive analysis identifies a critical performance bottleneck for `turbo3` decode on Apple Silicon: a 'decode cliff' at increasing context depths, particularly severe on M1/M2, initially attributed to centroid LUT constant memory accesses. Profiling reveals the constant memory LUT is indeed a significant factor, performing 2x worse on M2 than M5. However, the core issue is not the constant cache itself, but the *cost* of LUT lookups and related operations. Solutions like 'Batched Extract' and a '4-Entry Magnitude LUT + Branchless Sign' significantly improve M2 Pro decode performance, demonstrating that targeted micro-optimizations for specific hardware architectures are crucial. For B2B SaaS, consistent performance across diverse hardware, especially for long-context LLM inference, is a key differentiator. This deep dive into hardware-specific bottlenecks underscores the necessity of low-level optimization to unlock full performance potential and maintain competitive advantage.
Proprietary Technical Taxonomy
turbo3 decode data-dependent constant memory accesses centroid LUT lookup L2 cache pressure decode ratio curve q8_0 context depths half cn[8] registers

Raw Developer Origin & Technical Request

Source Icon GitHub Issue Mar 26, 2026
Repo: TheTom/turboquant_plus
Experiment: Fused Q·Centroid compressed attention for turbo3 decode

## Problem

turbo3 decode has 82M data-dependent constant memory accesses per token (centroid LUT lookup). On M5 Max this costs ~9% speed. On M1 Max it causes a cliff from 84% to 39% of q8_0 as context grows — L2 cache pressure evicts centroid entries to main memory.

## Current decode ratio curve (turbo3/q8_0)

| Depth | M5 Max | M1 Max |
|-------|--------|--------|
| short | 0.91x | 0.84x |
| 4K | 0.89x | ~0.64x |
| 8K | 0.86x | ~0.54x |
| 16K | — | ~0.45x |
| 32K | — | ~0.39x |

## The Fix

Eliminate constant memory from the FA decode inner loop. Three approaches to try:

### Approach 1: half cn[8] registers (16 bytes, may not spill)
Previous float cn[8] (32 bytes) spilled on Metal. Half-precision halves register pressure.

### Approach 2: Threadgroup centroid cache
Load 8 centroids to threadgroup memory once per threadgroup. Previous test was invalid (CPU fallback bug). Never tested on real Metal GPU.

### Approach 3: Per-block norm*centroid table
Precompute `cn_norm[c] = centroid[c] * norm` at block start. Inner loop becomes `score += cn_norm[idx] * q[j]`. Fresh 8-entry register array per block, maximally cache-friendly.

## Success criteria
- turbo3/q8_0 decode ratio stays FLAT across context depths (currently drops 0.91x to 0.72x on M5)
- If flat at 0.90x+ across all depths, the fix works
- PPL unchanged (dequant values identical, just accessed differently)

## Prior work
- buun's register LUT: 0.965x on CUDA, spilled to 0.879x on Metal (float[8])
- Split-LUT (2x4 half...

Developer Debate & Comments

TheTom • Mar 27, 2026
## M2 Pro Results: Bit-Arithmetic Dequant **Hardware:** Apple M2 Pro, Apple8 (1008), has_tensor=false, 32GB **Model:** Qwen2.5-7B-Instruct-Q4_K_M **Build:** experiment/m1-m2-decode-comparison (auto-detected bit-arithmetic) ### Decode Speed (tok/s) | Depth | q8_0 | turbo3 (bit-arith) | Ratio | turbo3 (const LUT, earlier diag) | Ratio | |-------|------|-------------------|-------|----------------------------------|-------| | short | 32.5 | 23.2 | 0.714x | 34.5 | 0.837x | | 4K | 26.0 | 15.7 | 0.604x | 20.4 | 0.640x | | 8K | 22.1 | 11.6 | 0.525x | 14.8 | 0.538x | | 16K | 17.2 | 8.0 | 0.465x | 9.4 | 0.454x | ### Conclusion **Bit-arithmetic did NOT fix the M2 decode cliff.** The ratio still degrades from 0.71x to 0.47x. Worse: bit-arithmetic is **slower than constant LUT at short context** (0.71x vs 0.84x) because the ALU cost exceeds M2's constant cache cost at low contention. **Key finding: The M2 decode bottleneck is NOT the centroid LUT.** The constant cache is not the problem on ...
TheTom • Mar 27, 2026
## M2 Pro Results Update: Batched Extract IS a Win True baseline comparison (same branch chain, same build): | Depth | q8_0 | Main (const LUT) | Batched extract | Bit-arithmetic | |-------|------|-----------------|-----------------|----------------| | short | 32.5 | 22.9 | **23.7 (+3.5%)** | 23.2 | | 8K | 22.1 | 10.95 | **13.7 (+25%)** | 11.6 | Earlier diagnostic (34.5 short) was a different build/context allocation — not comparable. **Batched extract gives +25% at 8K on M2 Pro.** The explicit bit field pre-extraction helps Metal's compiler schedule device reads ahead of ALU on older hardware. Next: profile where the remaining gap is (turbo3 0.62x vs q8_0 at 8K).
TheTom • Mar 27, 2026
## BREAKTHROUGH: Profiling isolation identifies exact bottleneck Added TURBO_PROFILE_MODE (0-4) to strip away dequant layers one at a time. ### M5 Max vs M2 Pro at 8K context decode: | Mode | What | M5 (% ceil) | M2 (% ceil) | |------|------|------------|------------| | 1 | No-op ceiling | 78.9 (100%) | 24.5 (100%) | | 2 | + norm read | 75.1 (95%) | 22.1 (90%) | | 4 | + all byte reads | 75.2 (95%) | 21.9 (89%) | | 3 | + qs extraction + LUT | 64.9 (82%) | 16.4 (67%) | | 0 | + signs + full LUT | 59.2 (75%) | 14.0 (57%) | | q8_0 | baseline | 78.8 | 22.1 | ### Key findings: 1. **No-op turbo3 is FASTER than q8_0 on M2** (24.5 vs 22.1) — compressed cache = less bandwidth. The format is not the problem. 2. **Constant memory LUT is 2x worse on M2 than M5:** - Mode 4→3 (LUT cost): M5 loses 13.7%, M2 loses 25.1% - Mode 3→0 (signs+more LUT): M5 loses another 8.6%, M2 loses another 14.7% 3. **Byte reading is NOT the bottleneck** — Mode 4 (all reads, no LUT) only costs 10% on both. 4....
TheTom • Mar 27, 2026
## 4-Entry Magnitude LUT + Branchless Sign: BEST M2 RESULT **Approach:** 4-entry constant half magnitude LUT (0.021-0.190) + XOR trick for reversed magnitude order + branchless sign multiply. Only 4 possible constant addresses per lookup instead of 8. ### M2 Pro decode improvement: | Depth | q8_0 | Main (8-LUT) | 4-mag LUT | vs Main | vs q8_0 | |-------|------|-------------|-----------|---------|---------| | short | 32.5 | 22.9 | 23.8 | +3.9% | 0.732x | | 8K | 21.9 | 10.95 | 15.1 | **+37.9%** | 0.689x | | 16K | 17.2 | 8.0 | 11.6 | **+45.0%** | 0.674x | ### M5 Max (no regression): | Depth | Main | 4-mag LUT | Delta | |-------|------|-----------|-------| | short | 77.4 | 75.7 | -2.2% | PPL: 6.1756 (unchanged). ### Summary +38-45% decode improvement on M2 Pro at long context. The ratio vs q8_0 improved from 0.45-0.50x to 0.67-0.73x. The cliff is much less severe. Minor regression on M5 (-2.2%) from the extra ALU (XOR + sign multiply). Could use the auto-detection to use 4-mag on ...
TheTom • Mar 27, 2026
## Final experiment results — dequant-level optimization ceiling reached ### Complete M2 Pro scoreboard (8K decode, q8_0 = 21.9 tok/s): | # | Approach | tok/s | vs q8_0 | vs Main | Const addrs | |---|----------|-------|---------|---------|-------------| | — | No-op ceiling | 24.5 | 1.119x | — | 0 | | **1** | **4-mag LUT + per-elem norm** | **15.1** | **0.689x** | **+38%** | **4** | | 2 | Batched extract (8-LUT) | 13.7 | 0.626x | +25% | 8 | | 3 | Deferred norm (4-mag) | 12.9 | 0.589x | +18% | 4 | | 4 | 2-pair half2 LUT | 12.0 | 0.548x | +10% | 2 | | 5 | Select chain (zero LUT) | 11.9 | 0.544x | +9% | 0 | | 6 | Bit-arithmetic | 11.6 | 0.530x | +6% | 0 | | — | Main (8-entry LUT) | 10.95 | 0.500x | baseline | 8 | | 7 | Non-vec forced (nl=2) | 10.2 | 0.466x | -7% | 8 | ### Key insight: 4 constant addresses is the sweet spot on M2 Pro - **0 addresses** (select chain, bit-arith): ALU cost exceeds constant cache savings - **2 addresses** (half2 pairs): ternary overhead exceeds savings from ...

Adjacent Repository Pain Points

Other highly discussed features and pain points extracted from TheTom/turboquant_plus.

Extracted Positioning
turbo3 quantization for LLM KV cache compression
Achieving 4.6x compression with quality (perplexity, KL divergence, NIAH) comparable to q8_0 (within 2% PPL) and superior to q4_0, while maintaining high inference speed.
Top Replies
TheTom • Mar 25, 2026
## CRITICAL: Perplexity test reveals quality failure | Cache | PPL | vs f16 | |-------|-----|--------| | f16 | 6.121 | baseline | | q8_0 | 6.111 | -0.16% | | q4_0 | 6.142 | +0.34% | | **turbo3** | ...
TheTom • Mar 25, 2026
## Root causes found ### 1. V cache in rotated space Python verification: dequant output has cosine=0.02 with input (garbage). After inverse rotation: cosine=0.987 (correct). V cache values MUST be...
TheTom • Mar 25, 2026
## QUALITY FIXED ✅ Perplexity with inverse rotation restored in dequant: | Cache | PPL | vs q8_0 | |-------|-----|---------| | f16 | 6.121 | — | | q8_0 | 6.111 | baseline | | q4_0 | 6.142 | +0.5% ...
Extracted Positioning
TurboQuant (`-ctk turbo3 -ctv turbo3`) integration with Vulkan devices for LLM inference.
Achieving broad hardware compatibility for TurboQuant, specifically extending to Vulkan-enabled AMD GPUs.
Extracted Positioning
TurboQuant (turbo3 and turbo4) performance optimization for LLM inference, specifically on Apple M1 hardware.
Achieving superior LLM inference speed (tokens/sec) through TurboQuant optimizations on Apple Silicon (M1).
Extracted Positioning
TurboQuant's quantization strategy, specifically regarding K/V norm disparity, attention quantization methods (MSE vs. Prod), and outlier detection (dynamic vs. fixed).
Advancing TurboQuant's quantization efficacy to achieve lower perplexity (PPL) and higher compression (lower average bit rates) through refined techniques.
Extracted Positioning
turbo3 and turbo4 quantization implementation, specifically related to block size changes and kernel instantiation.
Ensuring correct and robust implementation of different quantization schemes (turbo3, turbo4) across varying block sizes and head dimensions, preventing data corruption and out-of-bounds access.

Engagement Signals

6
Replies
open
Issue Status

Cross-Market Term Frequency

Quantifies the cross-market adoption of foundational terms like Metal and q8_0 by tracking occurrence frequency across active SaaS architectures and enterprise developer debates.