Quality validation: perplexity, KL divergence, and NIAH benchmarks

TheTom/turboquant_plus

Status: Open

Opened: Mar 25, 2026

Comments: 9

## Supersedes #24 We claim 4.6× compression at 91-97% speed. But we have ZERO quantitative quality data on the llama.cpp build. ## Required benchmarks (in priority order): ### 1. Perplexity (wikitext-2) - f16, q8_0, q4_0, q4_1, q5_0, turbo3 - Target: turbo3 within 1% of q8_0 - If >2% worse: quality problem ### 2. KL Divergence vs f16 - Required by llama.cpp CONTRIBUTING.md for new quant types - Metrics: mean KLD, delta-p RMS, same-top-p % ### 3. Passkey Retrieval (NIAH) - At 1K, 2K, 4K, 8K context lengths - Prince Canuma got 6/6 at all lengths ### 4. Generation Quality (qualitative) - Side-by-side comparison ## Tracking Full plan and results in docs/quality-benchmarks.md

Python

View on GitHub ↗

Other Comments / Reviews

Good question. turbo3 at +1.4% vs q8_0 is worse than q4_0...

by TheTom Mar 26, 2026
How is turbo3 being worse than q4 quality target met?

by Rotatingxenomorph Mar 26, 2026
## QUALITY FIXED ✅ Perplexity with inverse rotation res...

by TheTom Mar 25, 2026
## Root causes found ### 1. V cache in rotated space Pyt...

by TheTom Mar 25, 2026
## CRITICAL: Perplexity test reveals quality failure | C...

by TheTom Mar 25, 2026