Gemini Executive Synthesis

`turbo3` decode performance for LLM inference on Apple Silicon (M1, M2 Pro, M5 Max), specifically addressing the 'decode cliff' at increasing context depths.

Technical Positioning

Achieving flat, high-performance `turbo3` decode ratios (0.90x+ of `q8_0`) across all context depths on Apple Silicon, minimizing performance degradation from memory access patterns.

SaaS Insight & Market Implications

This extensive analysis identifies a critical performance bottleneck for `turbo3` decode on Apple Silicon: a 'decode cliff' at increasing context depths, particularly severe on M1/M2, initially attributed to centroid LUT constant memory accesses. Profiling reveals the constant memory LUT is indeed a significant factor, performing 2x worse on M2 than M5. However, the core issue is not the constant cache itself, but the *cost* of LUT lookups and related operations. Solutions like 'Batched Extract' and a '4-Entry Magnitude LUT + Branchless Sign' significantly improve M2 Pro decode performance, demonstrating that targeted micro-optimizations for specific hardware architectures are crucial. For B2B SaaS, consistent performance across diverse hardware, especially for long-context LLM inference, is a key differentiator. This deep dive into hardware-specific bottlenecks underscores the necessity of low-level optimization to unlock full performance potential and maintain competitive advantage.

Proprietary Technical Taxonomy

Raw Developer Origin & Technical Request

GitHub Issue Mar 26, 2026

Repo: TheTom/turboquant_plus

Experiment: Fused Q·Centroid compressed attention for turbo3 decode

## Problem

turbo3 decode has 82M data-dependent constant memory accesses per token (centroid LUT lookup). On M5 Max this costs ~9% speed. On M1 Max it causes a cliff from 84% to 39% of q8_0 as context grows — L2 cache pressure evicts centroid entries to main memory.

## Current decode ratio curve (turbo3/q8_0)

| Depth | M5 Max | M1 Max |
|-------|--------|--------|
| short | 0.91x | 0.84x |
| 4K | 0.89x | ~0.64x |
| 8K | 0.86x | ~0.54x |
| 16K | — | ~0.45x |
| 32K | — | ~0.39x |

## The Fix

Eliminate constant memory from the FA decode inner loop. Three approaches to try:

### Approach 1: half cn[8] registers (16 bytes, may not spill)
Previous float cn[8] (32 bytes) spilled on Metal. Half-precision halves register pressure.

### Approach 2: Threadgroup centroid cache
Load 8 centroids to threadgroup memory once per threadgroup. Previous test was invalid (CPU fallback bug). Never tested on real Metal GPU.

### Approach 3: Per-block norm*centroid table
Precompute `cn_norm[c] = centroid[c] * norm` at block start. Inner loop becomes `score += cn_norm[idx] * q[j]`. Fresh 8-entry register array per block, maximally cache-friendly.

## Success criteria
- turbo3/q8_0 decode ratio stays FLAT across context depths (currently drops 0.91x to 0.72x on M5)
- If flat at 0.90x+ across all depths, the fix works
- PPL unchanged (dequant values identical, just accessed differently)

## Prior work
- buun's register LUT: 0.965x on CUDA, spilled to 0.879x on Metal (float[8])
- Split-LUT (2x4 half...

View Raw Source

Developer Debate & Comments

TheTom • Mar 27, 2026

## M2 Pro Results: Bit-Arithmetic Dequant **Hardware:** Apple M2 Pro, Apple8 (1008), has_tensor=false, 32GB **Model:** Qwen2.5-7B-Instruct-Q4_K_M **Build:** experiment/m1-m2-decode-comparison (auto-detected bit-arithmetic) ### Decode Speed (tok/s) | Depth | q8_0 | turbo3 (bit-arith) | Ratio | turbo3 (const LUT, earlier diag) | Ratio | |-------|------|-------------------|-------|----------------------------------|-------| | short | 32.5 | 23.2 | 0.714x | 34.5 | 0.837x | | 4K | 26.0 | 15.7 | 0.604x | 20.4 | 0.640x | | 8K | 22.1 | 11.6 | 0.525x | 14.8 | 0.538x | | 16K | 17.2 | 8.0 | 0.465x | 9.4 | 0.454x | ### Conclusion **Bit-arithmetic did NOT fix the M2 decode cliff.** The ratio still degrades from 0.71x to 0.47x. Worse: bit-arithmetic is **slower than constant LUT at short context** (0.71x vs 0.84x) because the ALU cost exceeds M2's constant cache cost at low contention. **Key finding: The M2 decode bottleneck is NOT the centroid LUT.** The constant cache is not the problem on ...

TheTom • Mar 27, 2026

## M2 Pro Results Update: Batched Extract IS a Win True baseline comparison (same branch chain, same build): | Depth | q8_0 | Main (const LUT) | Batched extract | Bit-arithmetic | |-------|------|-----------------|-----------------|----------------| | short | 32.5 | 22.9 | **23.7 (+3.5%)** | 23.2 | | 8K | 22.1 | 10.95 | **13.7 (+25%)** | 11.6 | Earlier diagnostic (34.5 short) was a different build/context allocation — not comparable. **Batched extract gives +25% at 8K on M2 Pro.** The explicit bit field pre-extraction helps Metal's compiler schedule device reads ahead of ALU on older hardware. Next: profile where the remaining gap is (turbo3 0.62x vs q8_0 at 8K).

TheTom • Mar 27, 2026

## BREAKTHROUGH: Profiling isolation identifies exact bottleneck Added TURBO_PROFILE_MODE (0-4) to strip away dequant layers one at a time. ### M5 Max vs M2 Pro at 8K context decode: | Mode | What | M5 (% ceil) | M2 (% ceil) | |------|------|------------|------------| | 1 | No-op ceiling | 78.9 (100%) | 24.5 (100%) | | 2 | + norm read | 75.1 (95%) | 22.1 (90%) | | 4 | + all byte reads | 75.2 (95%) | 21.9 (89%) | | 3 | + qs extraction + LUT | 64.9 (82%) | 16.4 (67%) | | 0 | + signs + full LUT | 59.2 (75%) | 14.0 (57%) | | q8_0 | baseline | 78.8 | 22.1 | ### Key findings: 1. **No-op turbo3 is FASTER than q8_0 on M2** (24.5 vs 22.1) — compressed cache = less bandwidth. The format is not the problem. 2. **Constant memory LUT is 2x worse on M2 than M5:** - Mode 4→3 (LUT cost): M5 loses 13.7%, M2 loses 25.1% - Mode 3→0 (signs+more LUT): M5 loses another 8.6%, M2 loses another 14.7% 3. **Byte reading is NOT the bottleneck** — Mode 4 (all reads, no LUT) only costs 10% on both. 4....

TheTom • Mar 27, 2026

## 4-Entry Magnitude LUT + Branchless Sign: BEST M2 RESULT **Approach:** 4-entry constant half magnitude LUT (0.021-0.190) + XOR trick for reversed magnitude order + branchless sign multiply. Only 4 possible constant addresses per lookup instead of 8. ### M2 Pro decode improvement: | Depth | q8_0 | Main (8-LUT) | 4-mag LUT | vs Main | vs q8_0 | |-------|------|-------------|-----------|---------|---------| | short | 32.5 | 22.9 | 23.8 | +3.9% | 0.732x | | 8K | 21.9 | 10.95 | 15.1 | **+37.9%** | 0.689x | | 16K | 17.2 | 8.0 | 11.6 | **+45.0%** | 0.674x | ### M5 Max (no regression): | Depth | Main | 4-mag LUT | Delta | |-------|------|-----------|-------| | short | 77.4 | 75.7 | -2.2% | PPL: 6.1756 (unchanged). ### Summary +38-45% decode improvement on M2 Pro at long context. The ratio vs q8_0 improved from 0.45-0.50x to 0.67-0.73x. The cliff is much less severe. Minor regression on M5 (-2.2%) from the extra ALU (XOR + sign multiply). Could use the auto-detection to use 4-mag on ...

TheTom • Mar 27, 2026

## Final experiment results — dequant-level optimization ceiling reached ### Complete M2 Pro scoreboard (8K decode, q8_0 = 21.9 tok/s): | # | Approach | tok/s | vs q8_0 | vs Main | Const addrs | |---|----------|-------|---------|---------|-------------| | — | No-op ceiling | 24.5 | 1.119x | — | 0 | | **1** | **4-mag LUT + per-elem norm** | **15.1** | **0.689x** | **+38%** | **4** | | 2 | Batched extract (8-LUT) | 13.7 | 0.626x | +25% | 8 | | 3 | Deferred norm (4-mag) | 12.9 | 0.589x | +18% | 4 | | 4 | 2-pair half2 LUT | 12.0 | 0.548x | +10% | 2 | | 5 | Select chain (zero LUT) | 11.9 | 0.544x | +9% | 0 | | 6 | Bit-arithmetic | 11.6 | 0.530x | +6% | 0 | | — | Main (8-entry LUT) | 10.95 | 0.500x | baseline | 8 | | 7 | Non-vec forced (nl=2) | 10.2 | 0.466x | -7% | 8 | ### Key insight: 4 constant addresses is the sweet spot on M2 Pro - **0 addresses** (select chain, bit-arith): ALU cost exceeds constant cache savings - **2 addresses** (half2 pairs): ternary overhead exceeds savings from ...

Adjacent Repository Pain Points

Other highly discussed features and pain points extracted from TheTom/turboquant_plus.

Quality validation: perplexity, KL divergence, and NIAH benchmarks

Extracted Positioning

turbo3 quantization for LLM KV cache compression

Achieving 4.6x compression with quality (perplexity, KL divergence, NIAH) comparable to q8_0 (within 2% PPL) and superior to q4_0, while maintaining high inference speed.

Top Replies

TheTom • Mar 25, 2026

## CRITICAL: Perplexity test reveals quality failure | Cache | PPL | vs f16 | |-------|-----|--------| | f16 | 6.121 | baseline | | q8_0 | 6.111 | -0.16% | | q4_0 | 6.142 | +0.34% | | **turbo3** | ...

TheTom • Mar 25, 2026

## Root causes found ### 1. V cache in rotated space Python verification: dequant output has cosine=0.02 with input (garbage). After inverse rotation: cosine=0.987 (correct). V cache values MUST be...

TheTom • Mar 25, 2026

## QUALITY FIXED ✅ Perplexity with inverse rotation restored in dequant: | Cache | PPL | vs q8_0 | |-------|-----|---------| | f16 | 6.121 | — | | q8_0 | 6.111 | baseline | | q4_0 | 6.142 | +0.5% ...

Don't work on vulkan device

Extracted Positioning

TurboQuant (`-ctk turbo3 -ctv turbo3`) integration with Vulkan devices for LLM inference.

Achieving broad hardware compatibility for TurboQuant, specifically extending to Vulkan-enabled AMD GPUs.

Top Replies

TheTom • Mar 28, 2026

turbo3 currently only supports Metal, CUDA, and ROCm/HIP backends. the Vulkan backend doesn't have a SET_ROWS kernel for the turbo3 quant type yet. since you have an RX 7900 XTX, ROCm would be your...

ogbinar • Mar 30, 2026

i hope the rocm issues get fixed i'm interested to try this out!

TheTom • Mar 31, 2026

Yeah we really need some more rcom support from the community. i only have so many viable devices to play with at home

No Difference in tokens/sec - Ministral3 8B Q5_K_M

Extracted Positioning

TurboQuant (turbo3 and turbo4) performance optimization for LLM inference, specifically on Apple M1 hardware.

Achieving superior LLM inference speed (tokens/sec) through TurboQuant optimizations on Apple Silicon (M1).

Top Replies

MrMuhannadObeidat • Mar 29, 2026

I missed the part where you highlight the fact that tokens/sec may actually degrade with the added compression of KV cache. I tried with turbo3, do not see noticeable degradation but certainly see ...

zrlhk • Mar 31, 2026

这个只是对kv缓存压缩，所以只是提升了最大推理上下文的大小。对模型量化压缩和推理速度，是没有提升的。原来10G显存，如果是一个9b模型，最大上下文128k可能就OOM了。现在压缩后，就可以支持128k上下文了。

zekrom-vale • Mar 31, 2026

You can just use it like Ollama or LM studio without the slow development or wrapper overheads. I use llama cpp directly and use it with router mode and many llms configured with models.ini and int...

Engineering findings: K/V norm disparity + MSE > Prod + outlier mixed precision

Extracted Positioning

TurboQuant's quantization strategy, specifically regarding K/V norm disparity, attention quantization methods (MSE vs. Prod), and outlier detection (dynamic vs. fixed).

Advancing TurboQuant's quantization efficacy to achieve lower perplexity (PPL) and higher compression (lower average bit rates) through refined techniques.

Top Replies

TheTom • Mar 28, 2026

this is great work, thanks for sharing. the K/V norm disparity data across models is something we hadn't quantified — 182x ratio on Qwen2.5-1.5B is wild. that directly informs the head_dim=128 qual...

TheTom • Mar 28, 2026

update on this: we ran a full turbo4 investigation this week and your MSE > Prod finding is now independently confirmed on three setups: 1. our Metal (M5 Max): QJL ablation on turbo4 shows removing...

Post-commit review: block size 32 breaks turbo4 and non-128 head dims

Extracted Positioning

turbo3 and turbo4 quantization implementation, specifically related to block size changes and kernel instantiation.

Ensuring correct and robust implementation of different quantization schemes (turbo3, turbo4) across varying block sizes and head dimensions, preventing data corruption and out-of-bounds access.

Frequently Asked Questions

Market intelligence mapped to `turbo3` decode performance for LLM inference on Apple Silicon (M1, M2 Pro, M5 Max), specifically addressing the 'decode cliff' at increasing context depths..

How is `turbo3` decode performance for LLM inference on Apple Silicon (M1, M2 Pro, M5 Max), specifically addressing the 'decode cliff' at increasing context depths. positioned in the market?

Based on our AI analysis of the original developer request, its primary technical positioning is: Achieving flat, high-performance `turbo3` decode ratios (0.90x+ of `q8_0`) across all context depths on Apple Silicon, minimizing performance degradation from memory access patterns.

What is the general sentiment around `turbo3` decode performance for LLM inference on Apple Silicon (M1, M2 Pro, M5 Max), specifically addressing the 'decode cliff' at increasing context depths.?

Yes, we have tracked 6 direct responses and active debates regarding this specific topic originating from GitHub Issue.

What are the foundational technologies related to `turbo3` decode performance for LLM inference on Apple Silicon (M1, M2 Pro, M5 Max), specifically addressing the 'decode cliff' at increasing context depths.?

Our proprietary extraction maps `turbo3` decode performance for LLM inference on Apple Silicon (M1, M2 Pro, M5 Max), specifically addressing the 'decode cliff' at increasing context depths. to adjacent architectural concepts including turbo3 decode, data-dependent constant memory accesses, centroid LUT lookup, L2 cache pressure.

Engagement Signals

Replies

open

Issue Status

Cross-Market Term Frequency

Quantifies the cross-market adoption of foundational terms like Metal and bandwidth by tracking occurrence frequency across active SaaS architectures and enterprise developer debates.