Comment on: Quality validation: perplexity, KL divergence, and NIAH benchmarks
Repo: TheTom/turboquant_plus by TheTom
Good question. turbo3 at +1.4% vs q8_0 is worse than q4_0 (+0.5%) on raw PPL, but the comparison isn't apples-to-apples:
- **q4_0** compresses the KV cache to 4 bits → 4× compression
- **turbo3** compresses to ~3.5 bits → 4.6× compression
- **q8_0** is 8 bits → baseline
turbo3 gives more compression than q4_0 while staying within the 2% quality gate we set at the top of this issue. The target was "within 1% of q8_0, if >2% worse it's a quality problem." 1.4% is in that range.
Also worth noting: since this issue was opened, a community contributor found a [norm correction](https://github.com/spiritbuun/llama-cpp-turboquant-cuda/commit/6b821a9) that brings turbo3 PPL even closer to q8_0 (now +1.1% on our Metal build). On CUDA with the same fix, turbo3 actually *beats* q8_0 PPL by 0.09%.
The real quality validation we're still missing is NIAH (needle-in-haystack retrieval at long context). PPL passing doesn't guarantee retrieval works. That's the next benchmark to run.
GitHub Issue
Parent Entity
State: Open • Comments: 9
SaaS Metrics