Insight for: Quality validation: perplexity, KL divergence, and NIAH benchmarks

turbo3 quantization for LLM KV cache compression

Analyzed: Apr 1, 2026

Initial speed claims for turbo3 quantization were invalid, as the model produced nonsensical output due to critical implementation bugs. Specifically, V cache values were not inverse-rotated, and `dynamic_cast` failures prevented Q/V rotations in MoE models, leading to garbage results despite fast processing. This highlights the critical need for robust quality validation (e.g., perplexity) to prevent misleading performance metrics. The fix restored quality, achieving 1.4% perplexity degradation versus q8_0 at 4.6x compression, meeting the 2% quality gate. However, this came at the cost of speed, necessitating re-optimization. The market implication is that raw speed metrics without rigorous quality benchmarks are meaningless; reliable performance requires balancing aggressive compression with validated output fidelity, especially for complex architectures like MoE. Further validation with NIAH is still required.

perplexity KL divergence NIAH benchmarks f16 q8_0 q4_0 q4_1 q5_0 turbo3 KV cache llama.cpp build norm mismatch pre-rotate-queries 3-bit packing V cache in rotated space inverse rotation dynamic_cast MoE models llama_memory_hybrid_context llama_kv_cache_context llm_graph_context GQA head layout token/s Metal build CUDA norm correction needle-in-haystack retrieval

GitHub Issue

Parent Entity

Quality validation: perplexity, KL divergence, and NIAH benchmarks

State: Open