turbo3 quantization for LLM KV cache compression
Raw Developer Origin & Technical Request
GitHub Issue
Mar 25, 2026
## Supersedes #24
We claim 4.6× compression at 91-97% speed. But we have ZERO quantitative quality data on the llama.cpp build.
## Required benchmarks (in priority order):
### 1. Perplexity (wikitext-2)
- f16, q8_0, q4_0, q4_1, q5_0, turbo3
- Target: turbo3 within 1% of q8_0
- If >2% worse: quality problem
### 2. KL Divergence vs f16
- Required by llama.cpp CONTRIBUTING.md for new quant types
- Metrics: mean KLD, delta-p RMS, same-top-p %
### 3. Passkey Retrieval (NIAH)
- At 1K, 2K, 4K, 8K context lengths
- Prince Canuma got 6/6 at all lengths
### 4. Generation Quality (qualitative)
- Side-by-side comparison
## Tracking
Full plan and results in docs/quality-benchmarks.md
Developer Debate & Comments
Adjacent Repository Pain Points
Other highly discussed features and pain points extracted from TheTom/turboquant_plus.
Engagement Signals
Cross-Market Term Frequency
Quantifies the cross-market adoption of foundational terms like turbo3 and CUDA by tracking occurrence frequency across active SaaS architectures and enterprise developer debates.
Market Trends