Comment on: Quality validation: perplexity, KL divergence, and NIAH benchmarks

Repo: TheTom/turboquant_plus by TheTom

Posted: Mar 26, 2026

Good question. turbo3 at +1.4% vs q8_0 is worse than q4_0 (+0.5%) on raw PPL, but the comparison isn't apples-to-apples: - **q4_0** compresses the KV cache to 4 bits → 4× compression - **turbo3** compresses to ~3.5 bits → 4.6× compression - **q8_0** is 8 bits → baseline turbo3 gives more compression than q4_0 while staying within the 2% quality gate we set at the top of this issue. The target was "within 1% of q8_0, if >2% worse it's a quality problem." 1.4% is in that range. Also worth noting: since this issue was opened, a community contributor found a [norm correction](https://github.com/spiritbuun/llama-cpp-turboquant-cuda/commit/6b821a9) that brings turbo3 PPL even closer to q8_0 (now +1.1% on our Metal build). On CUDA with the same fix, turbo3 actually *beats* q8_0 PPL by 0.09%. The real quality validation we're still missing is NIAH (needle-in-haystack retrieval at long context). PPL passing doesn't guarantee retrieval works. That's the next benchmark to run.

GitHub Issue

Parent Entity

Quality validation: perplexity, KL divergence, and NIAH benchmarks

State: Open • Comments: 9

Other Comments / Reviews

How is turbo3 being worse than q4 quality target met?

by Rotatingxenomorph Mar 26, 2026
## QUALITY FIXED ✅ Perplexity with inverse rotation res...

by TheTom Mar 25, 2026
## Root causes found ### 1. V cache in rotated space Pyt...

by TheTom Mar 25, 2026
## CRITICAL: Perplexity test reveals quality failure | C...

by TheTom Mar 25, 2026