← Back to AI Insights
Gemini Executive Synthesis

turbo3 quantization for LLM KV cache compression

Technical Positioning
Achieving 4.6x compression with quality (perplexity, KL divergence, NIAH) comparable to q8_0 (within 2% PPL) and superior to q4_0, while maintaining high inference speed.
SaaS Insight & Market Implications
Initial speed claims for turbo3 quantization were invalid, as the model produced nonsensical output due to critical implementation bugs. Specifically, V cache values were not inverse-rotated, and `dynamic_cast` failures prevented Q/V rotations in MoE models, leading to garbage results despite fast processing. This highlights the critical need for robust quality validation (e.g., perplexity) to prevent misleading performance metrics. The fix restored quality, achieving 1.4% perplexity degradation versus q8_0 at 4.6x compression, meeting the 2% quality gate. However, this came at the cost of speed, necessitating re-optimization. The market implication is that raw speed metrics without rigorous quality benchmarks are meaningless; reliable performance requires balancing aggressive compression with validated output fidelity, especially for complex architectures like MoE. Further validation with NIAH is still required.
Proprietary Technical Taxonomy
perplexity KL divergence NIAH benchmarks f16 q8_0 q4_0 q4_1 q5_0

Raw Developer Origin & Technical Request

Source Icon GitHub Issue Mar 25, 2026
Repo: TheTom/turboquant_plus
Quality validation: perplexity, KL divergence, and NIAH benchmarks

## Supersedes #24

We claim 4.6× compression at 91-97% speed. But we have ZERO quantitative quality data on the llama.cpp build.

## Required benchmarks (in priority order):

### 1. Perplexity (wikitext-2)
- f16, q8_0, q4_0, q4_1, q5_0, turbo3
- Target: turbo3 within 1% of q8_0
- If >2% worse: quality problem

### 2. KL Divergence vs f16
- Required by llama.cpp CONTRIBUTING.md for new quant types
- Metrics: mean KLD, delta-p RMS, same-top-p %

### 3. Passkey Retrieval (NIAH)
- At 1K, 2K, 4K, 8K context lengths
- Prince Canuma got 6/6 at all lengths

### 4. Generation Quality (qualitative)
- Side-by-side comparison

## Tracking
Full plan and results in docs/quality-benchmarks.md

Developer Debate & Comments

TheTom • Mar 25, 2026
## CRITICAL: Perplexity test reveals quality failure | Cache | PPL | vs f16 | |-------|-----|--------| | f16 | 6.121 | baseline | | q8_0 | 6.111 | -0.16% | | q4_0 | 6.142 | +0.34% | | **turbo3** | **165.6** | **+2607%** ❌ | turbo3 perplexity is 27× worse than f16. Speed benchmarks were measuring how fast the model produces wrong answers. Root cause investigation needed. DO NOT update README with speed claims until quality is fixed. Suspected causes: 1. Norm mismatch: quantize stores full 128-element group norm, dequant uses it as per-32-block norm 2. Pre-rotate-queries rotation matrix mismatch with quantize rotation 3. 3-bit packing bug in block size 32
TheTom • Mar 25, 2026
## Root causes found ### 1. V cache in rotated space Python verification: dequant output has cosine=0.02 with input (garbage). After inverse rotation: cosine=0.987 (correct). V cache values MUST be inverse-rotated after attention. ### 2. dynamic_cast fails for MoE models The Qwen 3.5 MoE uses `llama_memory_hybrid_context`, not `llama_kv_cache_context`. Our `dynamic_cast` returns null → Q rotation and V inverse rotation NEVER execute. ALL speed benchmarks were on unrotated Q with rotated-space K/V — garbage results with fast speed. ### Why "coherent text" was misleading Without any rotation applied, the raw quantize/dequant produces plausible-looking grammar but wrong content. Short conversations hide this. Perplexity caught it. ### Fix needed Store rotation tensors in `llm_graph_context` directly (not behind a KV cache dynamic_cast). Then both Q rotation and V inverse rotation will work for ALL memory types.
TheTom • Mar 25, 2026
## QUALITY FIXED ✅ Perplexity with inverse rotation restored in dequant: | Cache | PPL | vs q8_0 | |-------|-----|---------| | f16 | 6.121 | — | | q8_0 | 6.111 | baseline | | q4_0 | 6.142 | +0.5% | | **turbo3** | **6.194** | **+1.4%** | turbo3 is within 1.4% of q8_0 perplexity. Quality target met. Speed is back to ~10.7 tok/s (pre-optimization level) because the inverse rotation is in the dequant hot path. The pre-rotate-queries optimization needs to be reimplemented to work with GQA head layout (ne[0]=256 for concatenated heads) and hybrid memory types.
Rotatingxenomorph • Mar 26, 2026
How is turbo3 being worse than q4 quality target met?
TheTom • Mar 26, 2026
Good question. turbo3 at +1.4% vs q8_0 is worse than q4_0 (+0.5%) on raw PPL, but the comparison isn't apples-to-apples: - **q4_0** compresses the KV cache to 4 bits → 4× compression - **turbo3** compresses to ~3.5 bits → 4.6× compression - **q8_0** is 8 bits → baseline turbo3 gives more compression than q4_0 while staying within the 2% quality gate we set at the top of this issue. The target was "within 1% of q8_0, if >2% worse it's a quality problem." 1.4% is in that range. Also worth noting: since this issue was opened, a community contributor found a [norm correction](https://github.com/spiritbuun/llama-cpp-turboquant-cuda/commit/6b821a9) that brings turbo3 PPL even closer to q8_0 (now +1.1% on our Metal build). On CUDA with the same fix, turbo3 actually *beats* q8_0 PPL by 0.09%. The real quality validation we're still missing is NIAH (needle-in-haystack retrieval at long context). PPL passing doesn't guarantee retrieval works. That's the next benchmark to run.

Adjacent Repository Pain Points

Other highly discussed features and pain points extracted from TheTom/turboquant_plus.

Extracted Positioning
`turbo3` decode performance for LLM inference on Apple Silicon (M1, M2 Pro, M5 Max), specifically addressing the 'decode cliff' at increasing context depths.
Achieving flat, high-performance `turbo3` decode ratios (0.90x+ of `q8_0`) across all context depths on Apple Silicon, minimizing performance degradation from memory access patterns.
Top Replies
TheTom • Mar 27, 2026
## M2 Pro Results: Bit-Arithmetic Dequant **Hardware:** Apple M2 Pro, Apple8 (1008), has_tensor=false, 32GB **Model:** Qwen2.5-7B-Instruct-Q4_K_M **Build:** experiment/m1-m2-decode-comparison (auto...
TheTom • Mar 27, 2026
## M2 Pro Results Update: Batched Extract IS a Win True baseline comparison (same branch chain, same build): | Depth | q8_0 | Main (const LUT) | Batched extract | Bit-arithmetic | |-------|------|-...
TheTom • Mar 27, 2026
## BREAKTHROUGH: Profiling isolation identifies exact bottleneck Added TURBO_PROFILE_MODE (0-4) to strip away dequant layers one at a time. ### M5 Max vs M2 Pro at 8K context decode: | Mode | What ...
Extracted Positioning
TurboQuant (`-ctk turbo3 -ctv turbo3`) integration with Vulkan devices for LLM inference.
Achieving broad hardware compatibility for TurboQuant, specifically extending to Vulkan-enabled AMD GPUs.
Extracted Positioning
TurboQuant (turbo3 and turbo4) performance optimization for LLM inference, specifically on Apple M1 hardware.
Achieving superior LLM inference speed (tokens/sec) through TurboQuant optimizations on Apple Silicon (M1).
Extracted Positioning
TurboQuant's quantization strategy, specifically regarding K/V norm disparity, attention quantization methods (MSE vs. Prod), and outlier detection (dynamic vs. fixed).
Advancing TurboQuant's quantization efficacy to achieve lower perplexity (PPL) and higher compression (lower average bit rates) through refined techniques.
Extracted Positioning
turbo3 and turbo4 quantization implementation, specifically related to block size changes and kernel instantiation.
Ensuring correct and robust implementation of different quantization schemes (turbo3, turbo4) across varying block sizes and head dimensions, preventing data corruption and out-of-bounds access.

Engagement Signals

9
Replies
open
Issue Status

Cross-Market Term Frequency

Quantifies the cross-market adoption of foundational terms like turbo3 and CUDA by tracking occurrence frequency across active SaaS architectures and enterprise developer debates.