GitHub Issue

Quality validation: perplexity, KL divergence, and NIAH benchmarks

Discovered On Mar 25, 2026
Primary Metric open
## Supersedes #24 We claim 4.6× compression at 91-97% speed. But we have ZERO quantitative quality data on the llama.cpp build. ## Required benchmarks (in priority order): ### 1. Perplexity (wikitext-2) - f16, q8_0, q4_0, q4_1, q5_0, turbo3 - Target: turbo3 within 1% of q8_0 - If >2% worse: quality problem ### 2. KL Divergence vs f16 - Required by llama.cpp CONTRIBUTING.md for new quant types - Metrics: mean KLD, delta-p RMS, same-top-p % ### 3. Passkey Retrieval (NIAH) - At 1K, 2K, 4K, 8K context lengths - Prince Canuma got 6/6 at all lengths ### 4. Generation Quality (qualitative) - Side-by-side comparison ## Tracking Full plan and results in docs/quality-benchmarks.md
View Raw Thread

Developer & User Discourse

TheTom • Mar 25, 2026
## CRITICAL: Perplexity test reveals quality failure

| Cache | PPL | vs f16 |
|-------|-----|--------|
| f16 | 6.121 | baseline |
| q8_0 | 6.111 | -0.16% |
| q4_0 | 6.142 | +0.34% |
| **turbo3** | **165.6** | **+2607%** ❌ |

turbo3 perplexity is 27× worse than f16. Speed benchmarks were measuring how fast the model produces wrong answers.

Root cause investigation needed. DO NOT update README with speed claims until quality is fixed.

Suspected causes:
1. Norm mismatch: quantize stores full 128-element group norm, dequant uses it as per-32-block norm
2. Pre-rotate-queries rotation matrix mismatch with quantize rotation
3. 3-bit packing bug in block size 32
TheTom • Mar 25, 2026
## Root causes found

### 1. V cache in rotated space
Python verification: dequant output has cosine=0.02 with input (garbage).
After inverse rotation: cosine=0.987 (correct).
V cache values MUST be inverse-rotated after attention.

### 2. dynamic_cast fails for MoE models
The Qwen 3.5 MoE uses `llama_memory_hybrid_context`, not `llama_kv_cache_context`.
Our `dynamic_cast` returns null → Q rotation and V inverse rotation NEVER execute.
ALL speed benchmarks were on unrotated Q with rotated-space K/V — garbage results with fast speed.

### Why "coherent text" was misleading
Without any rotation applied, the raw quantize/dequant produces plausible-looking grammar
but wrong content. Short conversations hide this. Perplexity caught it.

### Fix needed
Store rotation tensors in `llm_graph_context` directly (not behind a KV cache dynamic_cast).
Then both Q rotation and V inverse rotation will work for ALL memory types.
TheTom • Mar 25, 2026
## QUALITY FIXED ✅

Perplexity with inverse rotation restored in dequant:

| Cache | PPL | vs q8_0 |
|-------|-----|---------|
| f16 | 6.121 | — |
| q8_0 | 6.111 | baseline |
| q4_0 | 6.142 | +0.5% |
| **turbo3** | **6.194** | **+1.4%** |

turbo3 is within 1.4% of q8_0 perplexity. Quality target met.

Speed is back to ~10.7 tok/s (pre-optimization level) because the inverse rotation
is in the dequant hot path. The pre-rotate-queries optimization needs to be
reimplemented to work with GQA head layout (ne[0]=256 for concatenated heads)
and hybrid memory types.
Rotatingxenomorph • Mar 26, 2026
How is turbo3 being worse than q4 quality target met?
TheTom • Mar 26, 2026
Good question. turbo3 at +1.4% vs q8_0 is worse than q4_0 (+0.5%) on raw PPL, but the comparison isn't apples-to-apples:

- **q4_0** compresses the KV cache to 4 bits → 4× compression
- **turbo3** compresses to ~3.5 bits → 4.6× compression
- **q8_0** is 8 bits → baseline

turbo3 gives more compression than q4_0 while staying within the 2% quality gate we set at the top of this issue. The target was "within 1% of q8_0, if >2% worse it's a quality problem." 1.4% is in that range.

Also worth noting: since this issue was opened, a community contributor found a [norm correction](https://github.com/spiritbuun/llama-cpp-turboquant-cuda/commit/6b821a9) that brings turbo3 PPL even closer to q8_0 (now +1.1% on our Metal build). On CUDA with the same fix, turbo3 actually *beats* q8_0 PPL by 0.09%.

The real quality validation we're still missing is NIAH (needle-in-haystack retrieval at long context). PPL passing doesn't guarantee retrieval works. That's the next benchmark to run.