ROIpad ← Back to Search
github.com › issue comment

Comment on: Quality validation: perplexity, KL divergence, and NIAH benchmarks

Repo: TheTom/turboquant_plus by TheTom
Posted: Mar 25, 2026
## Root causes found ### 1. V cache in rotated space Python verification: dequant output has cosine=0.02 with input (garbage). After inverse rotation: cosine=0.987 (correct). V cache values MUST be inverse-rotated after attention. ### 2. dynamic_cast fails for MoE models The Qwen 3.5 MoE uses `llama_memory_hybrid_context`, not `llama_kv_cache_context`. Our `dynamic_cast` returns null → Q rotation and V inverse rotation NEVER execute. ALL speed benchmarks were on unrotated Q with rotated-space K/V — garbage results with fast speed. ### Why "coherent text" was misleading Without any rotation applied, the raw quantize/dequant produces plausible-looking grammar but wrong content. Short conversations hide this. Perplexity caught it. ### Fix needed Store rotation tensors in `llm_graph_context` directly (not behind a KV cache dynamic_cast). Then both Q rotation and V inverse rotation will work for ALL memory types.
GitHub Issue