← Back to AI Insights
Gemini Executive Synthesis

TurboQuant's quantization strategy, specifically regarding K/V norm disparity, attention quantization methods (MSE vs. Prod), and outlier detection (dynamic vs. fixed).

Technical Positioning
Advancing TurboQuant's quantization efficacy to achieve lower perplexity (PPL) and higher compression (lower average bit rates) through refined techniques.
SaaS Insight & Market Implications
This issue presents critical engineering findings for TurboQuant, revealing significant opportunities for optimization. The 'K/V norm disparity' necessitates mixed precision, as uniform quantization catastrophically fails for models like Qwen with high K/V ratios. Furthermore, MSE is empirically shown to outperform the paper's recommended Prod for Attention, yielding dramatically lower PPL. Dynamic outlier detection also offers efficiency gains over fixed allocation. For B2B SaaS, these findings are paramount: improved quantization directly translates to reduced memory footprint and potentially faster inference with minimal quality degradation. Adopting these refinements can significantly enhance the cost-efficiency and performance of LLM deployments, providing a competitive edge in resource-constrained environments.
Proprietary Technical Taxonomy
K/V norm disparity bit budgets mixed precision uniform quantization MSE Prod TurboQuantProd (QJL) Attention

Raw Developer Origin & Technical Request

Source Icon GitHub Issue Mar 28, 2026
Repo: TheTom/turboquant_plus
Engineering findings: K/V norm disparity + MSE > Prod + outlier mixed precision

Hi! We independently implemented TurboQuant and ran systematic benchmarks across 8 models. Found some things that might be useful for your outlier.py implementation:

## K/V Norm Disparity

Modern models have dramatically different K vs V norms:

| Model | K norm | V norm | Ratio |
|-------|--------|--------|-------|
| GPT-2 | 11.8 | 2.0 | 6x |
| Phi-2 | 13.1 | 3.0 | 4x |
| Qwen2.5-3B | 172.1 | 3.3 | 52x |
| Qwen2.5-7B | 274.0 | 2.6 | 106x |
| Qwen2.5-1.5B | 778.6 | 4.3 | 182x |

This means K and V need very different bit budgets. K/V ratio > 100x (Qwen family) needs mixed precision for K — uniform quantization fails catastrophically.

## MSE beats Prod for Attention

Paper recommends TurboQuantProd (QJL) for Keys. We found MSE for both K and V works much better:
- GPT-2 b=3: MSE gives +7.6% PPL, Prod gives +300% PPL
- Reason: QJL variance is amplified by softmax

## Dynamic vs Fixed Outlier Detection

Your outlier.py uses the paper's fixed allocation (32 outlier / 96 regular for d=128). We tried dynamic detection (channels with RMS > 3x median = outlier):
- Layer 0 has ~20% outliers (RMS up to 272 vs median 1.7)
- Middle layers have only 4-6% outliers
- Per-layer dynamic detection may be more efficient than fixed allocation

## Result

With dynamic outlier detection (outliers at 8-bit, rest at 3-bit):
- Qwen2.5-1.5B: **3.6-bit avg, +2.1% PPL** (vs +78% with uniform 4.5-bit)

Our implementation + all benchmark data: github.com/scos-lab/turboqua...

Great work on turboq...

Developer Debate & Comments

No active discussions extracted for this entry yet.

Adjacent Repository Pain Points

Other highly discussed features and pain points extracted from TheTom/turboquant_plus.

Extracted Positioning
turbo3 quantization for LLM KV cache compression
Achieving 4.6x compression with quality (perplexity, KL divergence, NIAH) comparable to q8_0 (within 2% PPL) and superior to q4_0, while maintaining high inference speed.
Top Replies
TheTom • Mar 25, 2026
## CRITICAL: Perplexity test reveals quality failure | Cache | PPL | vs f16 | |-------|-----|--------| | f16 | 6.121 | baseline | | q8_0 | 6.111 | -0.16% | | q4_0 | 6.142 | +0.34% | | **turbo3** | ...
TheTom • Mar 25, 2026
## Root causes found ### 1. V cache in rotated space Python verification: dequant output has cosine=0.02 with input (garbage). After inverse rotation: cosine=0.987 (correct). V cache values MUST be...
TheTom • Mar 25, 2026
## QUALITY FIXED ✅ Perplexity with inverse rotation restored in dequant: | Cache | PPL | vs q8_0 | |-------|-----|---------| | f16 | 6.121 | — | | q8_0 | 6.111 | baseline | | q4_0 | 6.142 | +0.5% ...
Extracted Positioning
`turbo3` decode performance for LLM inference on Apple Silicon (M1, M2 Pro, M5 Max), specifically addressing the 'decode cliff' at increasing context depths.
Achieving flat, high-performance `turbo3` decode ratios (0.90x+ of `q8_0`) across all context depths on Apple Silicon, minimizing performance degradation from memory access patterns.
Top Replies
TheTom • Mar 27, 2026
## M2 Pro Results: Bit-Arithmetic Dequant **Hardware:** Apple M2 Pro, Apple8 (1008), has_tensor=false, 32GB **Model:** Qwen2.5-7B-Instruct-Q4_K_M **Build:** experiment/m1-m2-decode-comparison (auto...
TheTom • Mar 27, 2026
## M2 Pro Results Update: Batched Extract IS a Win True baseline comparison (same branch chain, same build): | Depth | q8_0 | Main (const LUT) | Batched extract | Bit-arithmetic | |-------|------|-...
TheTom • Mar 27, 2026
## BREAKTHROUGH: Profiling isolation identifies exact bottleneck Added TURBO_PROFILE_MODE (0-4) to strip away dequant layers one at a time. ### M5 Max vs M2 Pro at 8K context decode: | Mode | What ...
Extracted Positioning
TurboQuant (`-ctk turbo3 -ctv turbo3`) integration with Vulkan devices for LLM inference.
Achieving broad hardware compatibility for TurboQuant, specifically extending to Vulkan-enabled AMD GPUs.
Extracted Positioning
TurboQuant (turbo3 and turbo4) performance optimization for LLM inference, specifically on Apple M1 hardware.
Achieving superior LLM inference speed (tokens/sec) through TurboQuant optimizations on Apple Silicon (M1).
Extracted Positioning
turbo3 and turbo4 quantization implementation, specifically related to block size changes and kernel instantiation.
Ensuring correct and robust implementation of different quantization schemes (turbo3, turbo4) across varying block sizes and head dimensions, preventing data corruption and out-of-bounds access.

Engagement Signals

2
Replies
open
Issue Status

Cross-Market Term Frequency

Quantifies the cross-market adoption of foundational terms like Prod and Attention by tracking occurrence frequency across active SaaS architectures and enterprise developer debates.