TurboQuant's performance and quality across different GPU backends (CUDA vs. Metal).
Raw Developer Origin & Technical Request
GitHub Issue
Mar 27, 2026
## Context
@spiritbuun's CUDA fork is now the performance leader:
- **PPL: -1.17% vs q8_0** (beats baseline quality)
- **Prefill: 99.6%** of q8_0
- **Decode: 97.5%** of q8_0
- **128K context** on RTX 3090 24GB, Q6 Qwen3.5 27B
Repo: github.com/spiritbuun/llama-...
Our Metal implementation: 99% prefill, +1.1% PPL, but only 88-90% decode.
## Task
Go through buun's latest commits and identify optimizations we can port to Metal. Cherry-pick what's portable, document what's CUDA-only.
### Already ported
- [x] Norm correction (PPL +1.6% → +1.1%) — merged to main
- [x] Register centroid LUT — tested, spills on Metal (CUDA-only)
### To review
- [ ] Latest decode dequant optimizations (fattn-common.cuh)
- [ ] V dequant path (separate from K dot-product path)
- [ ] Batched uint8 loads for qs/signs (3 loads per 8 elements vs 16)
- [ ] turbo4 V_DOT2 half2 path — any Metal equivalent?
- [ ] AMD RDNA v_dot2_f32_f16 path — relevant for our AMD testers
- [ ] Any new norm correction refinements since our port
- [ ] FWHT rotation implementation differences
- [ ] Prefill dequant-then-attend (we're blocked on turbo3→f16 cast)
### Files to review
```
ggml/src/ggml-cuda/fattn-common.cuh # FA dequant (decode hot path)
ggml/src/ggml-cuda/turbo-quant-cuda.cuh # Quantize + norm correction
ggml/src/ggml-cuda/turbo-wht.cu # FWHT rotation
ggml/src/ggml-cuda/fattn-vec.cuh # Vec attention path
```
### Attribution
All ported optimizations must credit @spiri...
Developer Debate & Comments
No active discussions extracted for this entry yet.
Adjacent Repository Pain Points
Other highly discussed features and pain points extracted from TheTom/turboquant_plus.
Engagement Signals
Cross-Market Term Frequency
Quantifies the cross-market adoption of foundational terms like q8_0 and PPL by tracking occurrence frequency across active SaaS architectures and enterprise developer debates.
Macro Market Trends
Correlated public search velocity for adjacent technologies.
Market Trends