Quality validation: perplexity, KL divergence, and NIAH benchmarks
TheTom/turboquant_plus
## Supersedes #24
We claim 4.6× compression at 91-97% speed. But we have ZERO quantitative quality data on the llama.cpp build.
## Required benchmarks (in priority order):
### 1. Perplexity (wikitext-2)
- f16, q8_0, q4_0, q4_1, q5_0, turbo3
- Target: turbo3 within 1% of q8_0
- If >2% worse: quality problem
### 2. KL Divergence vs f16
- Required by llama.cpp CONTRIBUTING.md for new quant types
- Metrics: mean KLD, delta-p RMS, same-top-p %
### 3. Passkey Retrieval (NIAH)
- At 1K, 2K, 4K, 8K context lengths
- Prince Canuma got 6/6 at all lengths
### 4. Generation Quality (qualitative)
- Side-by-side comparison
## Tracking
Full plan and results in docs/quality-benchmarks.md
View on GitHub ↗
Other Comments / Reviews
Related Content
SaaS Metrics