ROIpad ← Back to Search
github.com › repository issue

Engineering findings: K/V norm disparity + MSE > Prod + outlier mixed precision

TheTom/turboquant_plus
Status: Open
Opened: Mar 28, 2026
Comments: 2
Hi! We independently implemented TurboQuant and ran systematic benchmarks across 8 models. Found some things that might be useful for your outlier.py implementation: ## K/V Norm Disparity Modern models have dramatically different K vs V norms: | Model | K norm | V norm | Ratio | |-------|--------|--------|-------| | GPT-2 | 11.8 | 2.0 | 6x | | Phi-2 | 13.1 | 3.0 | 4x | | Qwen2.5-3B | 172.1 | 3.3 | 52x | | Qwen2.5-7B | 274.0 | 2.6 | 106x | | Qwen2.5-1.5B | 778.6 | 4.3 | 182x | This means K and V need very different bit budgets. K/V ratio > 100x (Qwen family) needs mixed precision for K — uniform quantization fails catastrophically. ## MSE beats Prod for Attention Paper recommends TurboQuantProd (QJL) for Keys. We found MSE for both K and V works much better: - GPT-2 b=3: MSE gives +7.6% PPL, Prod gives +300% PPL - Reason: QJL variance is amplified by softmax ## Dynamic vs Fixed Outlier Detection Your outlier.py uses the paper's fixed allocation (32 outlier / 96 regular for d=128). We tried dynamic detection (channels with RMS > 3x median = outlier): - Layer 0 has ~20% outliers (RMS up to 272 vs median 1.7) - Middle layers have only 4-6% outliers - Per-layer dynamic detection may be more efficient than fixed allocation ## Result With dynamic outlier detection (outliers at 8-bit, rest at 3-bit): - Qwen2.5-1.5B: **3.6-bit avg, +2.1% PPL** (vs +78% with uniform 4.5-bit) Our implementation + all benchmark data: https://github.com/scos-lab/turboquant Great work on turboq...
Python
View on GitHub ↗