Comment on: Quality validation: perplexity, KL divergence, and NIAH benchmarks
Repo: TheTom/turboquant_plus by TheTom
## Root causes found
### 1. V cache in rotated space
Python verification: dequant output has cosine=0.02 with input (garbage).
After inverse rotation: cosine=0.987 (correct).
V cache values MUST be inverse-rotated after attention.
### 2. dynamic_cast fails for MoE models
The Qwen 3.5 MoE uses `llama_memory_hybrid_context`, not `llama_kv_cache_context`.
Our `dynamic_cast` returns null → Q rotation and V inverse rotation NEVER execute.
ALL speed benchmarks were on unrotated Q with rotated-space K/V — garbage results with fast speed.
### Why "coherent text" was misleading
Without any rotation applied, the raw quantize/dequant produces plausible-looking grammar
but wrong content. Short conversations hide this. Perplexity caught it.
### Fix needed
Store rotation tensors in `llm_graph_context` directly (not behind a KV cache dynamic_cast).
Then both Q rotation and V inverse rotation will work for ALL memory types.
GitHub Issue
Parent Entity
State: Open • Comments: 9
SaaS Metrics