Gemini Executive Synthesis

turbo3 quantization for LLM KV cache compression

Technical Positioning

Achieving 4.6x compression with quality (perplexity, KL divergence, NIAH) comparable to q8_0 (within 2% PPL) and superior to q4_0, while maintaining high inference speed.

SaaS Insight & Market Implications

Initial speed claims for turbo3 quantization were invalid, as the model produced nonsensical output due to critical implementation bugs. Specifically, V cache values were not inverse-rotated, and `dynamic_cast` failures prevented Q/V rotations in MoE models, leading to garbage results despite fast processing. This highlights the critical need for robust quality validation (e.g., perplexity) to prevent misleading performance metrics. The fix restored quality, achieving 1.4% perplexity degradation versus q8_0 at 4.6x compression, meeting the 2% quality gate. However, this came at the cost of speed, necessitating re-optimization. The market implication is that raw speed metrics without rigorous quality benchmarks are meaningless; reliable performance requires balancing aggressive compression with validated output fidelity, especially for complex architectures like MoE. Further validation with NIAH is still required.

Proprietary Technical Taxonomy

Raw Developer Origin & Technical Request

GitHub Issue Mar 25, 2026

Repo: TheTom/turboquant_plus

Quality validation: perplexity, KL divergence, and NIAH benchmarks

## Supersedes #24

We claim 4.6× compression at 91-97% speed. But we have ZERO quantitative quality data on the llama.cpp build.

## Required benchmarks (in priority order):

### 1. Perplexity (wikitext-2)
- f16, q8_0, q4_0, q4_1, q5_0, turbo3
- Target: turbo3 within 1% of q8_0
- If >2% worse: quality problem

### 2. KL Divergence vs f16
- Required by llama.cpp CONTRIBUTING.md for new quant types
- Metrics: mean KLD, delta-p RMS, same-top-p %

### 3. Passkey Retrieval (NIAH)
- At 1K, 2K, 4K, 8K context lengths
- Prince Canuma got 6/6 at all lengths

### 4. Generation Quality (qualitative)
- Side-by-side comparison

## Tracking
Full plan and results in docs/quality-benchmarks.md

View Raw Source

Developer Debate & Comments

TheTom • Mar 25, 2026

## CRITICAL: Perplexity test reveals quality failure | Cache | PPL | vs f16 | |-------|-----|--------| | f16 | 6.121 | baseline | | q8_0 | 6.111 | -0.16% | | q4_0 | 6.142 | +0.34% | | **turbo3** | **165.6** | **+2607%** ❌ | turbo3 perplexity is 27× worse than f16. Speed benchmarks were measuring how fast the model produces wrong answers. Root cause investigation needed. DO NOT update README with speed claims until quality is fixed. Suspected causes: 1. Norm mismatch: quantize stores full 128-element group norm, dequant uses it as per-32-block norm 2. Pre-rotate-queries rotation matrix mismatch with quantize rotation 3. 3-bit packing bug in block size 32

TheTom • Mar 25, 2026

## Root causes found ### 1. V cache in rotated space Python verification: dequant output has cosine=0.02 with input (garbage). After inverse rotation: cosine=0.987 (correct). V cache values MUST be inverse-rotated after attention. ### 2. dynamic_cast fails for MoE models The Qwen 3.5 MoE uses `llama_memory_hybrid_context`, not `llama_kv_cache_context`. Our `dynamic_cast` returns null → Q rotation and V inverse rotation NEVER execute. ALL speed benchmarks were on unrotated Q with rotated-space K/V — garbage results with fast speed. ### Why "coherent text" was misleading Without any rotation applied, the raw quantize/dequant produces plausible-looking grammar but wrong content. Short conversations hide this. Perplexity caught it. ### Fix needed Store rotation tensors in `llm_graph_context` directly (not behind a KV cache dynamic_cast). Then both Q rotation and V inverse rotation will work for ALL memory types.

TheTom • Mar 25, 2026

## QUALITY FIXED ✅ Perplexity with inverse rotation restored in dequant: | Cache | PPL | vs q8_0 | |-------|-----|---------| | f16 | 6.121 | — | | q8_0 | 6.111 | baseline | | q4_0 | 6.142 | +0.5% | | **turbo3** | **6.194** | **+1.4%** | turbo3 is within 1.4% of q8_0 perplexity. Quality target met. Speed is back to ~10.7 tok/s (pre-optimization level) because the inverse rotation is in the dequant hot path. The pre-rotate-queries optimization needs to be reimplemented to work with GQA head layout (ne[0]=256 for concatenated heads) and hybrid memory types.

Rotatingxenomorph • Mar 26, 2026

How is turbo3 being worse than q4 quality target met?

TheTom • Mar 26, 2026

Good question. turbo3 at +1.4% vs q8_0 is worse than q4_0 (+0.5%) on raw PPL, but the comparison isn't apples-to-apples: - **q4_0** compresses the KV cache to 4 bits → 4× compression - **turbo3** compresses to ~3.5 bits → 4.6× compression - **q8_0** is 8 bits → baseline turbo3 gives more compression than q4_0 while staying within the 2% quality gate we set at the top of this issue. The target was "within 1% of q8_0, if >2% worse it's a quality problem." 1.4% is in that range. Also worth noting: since this issue was opened, a community contributor found a [norm correction](https://github.com/spiritbuun/llama-cpp-turboquant-cuda/commit/6b821a9) that brings turbo3 PPL even closer to q8_0 (now +1.1% on our Metal build). On CUDA with the same fix, turbo3 actually *beats* q8_0 PPL by 0.09%. The real quality validation we're still missing is NIAH (needle-in-haystack retrieval at long context). PPL passing doesn't guarantee retrieval works. That's the next benchmark to run.

Adjacent Repository Pain Points

Other highly discussed features and pain points extracted from TheTom/turboquant_plus.

Experiment: Fused Q·Centroid compressed attention for turbo3 decode

Extracted Positioning

`turbo3` decode performance for LLM inference on Apple Silicon (M1, M2 Pro, M5 Max), specifically addressing the 'decode cliff' at increasing context depths.

Achieving flat, high-performance `turbo3` decode ratios (0.90x+ of `q8_0`) across all context depths on Apple Silicon, minimizing performance degradation from memory access patterns.

Top Replies

TheTom • Mar 27, 2026

## M2 Pro Results: Bit-Arithmetic Dequant **Hardware:** Apple M2 Pro, Apple8 (1008), has_tensor=false, 32GB **Model:** Qwen2.5-7B-Instruct-Q4_K_M **Build:** experiment/m1-m2-decode-comparison (auto...

TheTom • Mar 27, 2026

## M2 Pro Results Update: Batched Extract IS a Win True baseline comparison (same branch chain, same build): | Depth | q8_0 | Main (const LUT) | Batched extract | Bit-arithmetic | |-------|------|-...

TheTom • Mar 27, 2026

## BREAKTHROUGH: Profiling isolation identifies exact bottleneck Added TURBO_PROFILE_MODE (0-4) to strip away dequant layers one at a time. ### M5 Max vs M2 Pro at 8K context decode: | Mode | What ...

Don't work on vulkan device

Extracted Positioning

TurboQuant (`-ctk turbo3 -ctv turbo3`) integration with Vulkan devices for LLM inference.

Achieving broad hardware compatibility for TurboQuant, specifically extending to Vulkan-enabled AMD GPUs.

Top Replies

TheTom • Mar 28, 2026

turbo3 currently only supports Metal, CUDA, and ROCm/HIP backends. the Vulkan backend doesn't have a SET_ROWS kernel for the turbo3 quant type yet. since you have an RX 7900 XTX, ROCm would be your...

ogbinar • Mar 30, 2026

i hope the rocm issues get fixed i'm interested to try this out!

TheTom • Mar 31, 2026

Yeah we really need some more rcom support from the community. i only have so many viable devices to play with at home

No Difference in tokens/sec - Ministral3 8B Q5_K_M

Extracted Positioning

TurboQuant (turbo3 and turbo4) performance optimization for LLM inference, specifically on Apple M1 hardware.

Achieving superior LLM inference speed (tokens/sec) through TurboQuant optimizations on Apple Silicon (M1).

Top Replies

MrMuhannadObeidat • Mar 29, 2026

I missed the part where you highlight the fact that tokens/sec may actually degrade with the added compression of KV cache. I tried with turbo3, do not see noticeable degradation but certainly see ...

zrlhk • Mar 31, 2026

这个只是对kv缓存压缩，所以只是提升了最大推理上下文的大小。对模型量化压缩和推理速度，是没有提升的。原来10G显存，如果是一个9b模型，最大上下文128k可能就OOM了。现在压缩后，就可以支持128k上下文了。

zekrom-vale • Mar 31, 2026

You can just use it like Ollama or LM studio without the slow development or wrapper overheads. I use llama cpp directly and use it with router mode and many llms configured with models.ini and int...

Engineering findings: K/V norm disparity + MSE > Prod + outlier mixed precision

Extracted Positioning

TurboQuant's quantization strategy, specifically regarding K/V norm disparity, attention quantization methods (MSE vs. Prod), and outlier detection (dynamic vs. fixed).

Advancing TurboQuant's quantization efficacy to achieve lower perplexity (PPL) and higher compression (lower average bit rates) through refined techniques.

Post-commit review: block size 32 breaks turbo4 and non-128 head dims

Extracted Positioning

turbo3 and turbo4 quantization implementation, specifically related to block size changes and kernel instantiation.

Ensuring correct and robust implementation of different quantization schemes (turbo3, turbo4) across varying block sizes and head dimensions, preventing data corruption and out-of-bounds access.

Frequently Asked Questions

Market intelligence mapped to turbo3 quantization for LLM KV cache compression.

What is the technical positioning of turbo3 quantization for LLM KV cache compression?

Based on our AI analysis of the original developer request, its primary technical positioning is: Achieving 4.6x compression with quality (perplexity, KL divergence, NIAH) comparable to q8_0 (within 2% PPL) and superior to q4_0, while maintaining high inference speed.

Are engineers actively discussing turbo3 quantization for LLM KV cache compression?

Yes, we have tracked 9 direct responses and active debates regarding this specific topic originating from GitHub Issue.

Which technical concepts are associated with turbo3 quantization for LLM KV cache compression?

Our proprietary extraction maps turbo3 quantization for LLM KV cache compression to adjacent architectural concepts including perplexity, KL divergence, NIAH benchmarks, f16.

Engagement Signals

Replies

open

Issue Status

Cross-Market Term Frequency

Quantifies the cross-market adoption of foundational terms like CUDA and turbo3 by tracking occurrence frequency across active SaaS architectures and enterprise developer debates.