TurboQuant's performance and quality across different GPU backends (CUDA vs. Metal).
Raw Developer Origin & Technical Request
GitHub Issue
Mar 27, 2026
## Context
@spiritbuun's CUDA fork is now the performance leader:
- **PPL: -1.17% vs q8_0** (beats baseline quality)
- **Prefill: 99.6%** of q8_0
- **Decode: 97.5%** of q8_0
- **128K context** on RTX 3090 24GB, Q6 Qwen3.5 27B
Repo: github.com/spiritbuun/llama-...
Our Metal implementation: 99% prefill, +1.1% PPL, but only 88-90% decode.
## Task
Go through buun's latest commits and identify optimizations we can port to Metal. Cherry-pick what's portable, document what's CUDA-only.
### Already ported
- [x] Norm correction (PPL +1.6% → +1.1%) — merged to main
- [x] Register centroid LUT — tested, spills on Metal (CUDA-only)
### To review
- [ ] Latest decode dequant optimizations (fattn-common.cuh)
- [ ] V dequant path (separate from K dot-product path)
- [ ] Batched uint8 loads for qs/signs (3 loads per 8 elements vs 16)
- [ ] turbo4 V_DOT2 half2 path — any Metal equivalent?
- [ ] AMD RDNA v_dot2_f32_f16 path — relevant for our AMD testers
- [ ] Any new norm correction refinements since our port
- [ ] FWHT rotation implementation differences
- [ ] Prefill dequant-then-attend (we're blocked on turbo3→f16 cast)
### Files to review
```
ggml/src/ggml-cuda/fattn-common.cuh # FA dequant (decode hot path)
ggml/src/ggml-cuda/turbo-quant-cuda.cuh # Quantize + norm correction
ggml/src/ggml-cuda/turbo-wht.cu # FWHT rotation
ggml/src/ggml-cuda/fattn-vec.cuh # Vec attention path
```
### Attribution
All ported optimizations must credit @spiri...
Developer Debate & Comments
No active discussions extracted for this entry yet.
Adjacent Repository Pain Points
Other highly discussed features and pain points extracted from TheTom/turboquant_plus.
Frequently Asked Questions
Market intelligence mapped to TurboQuant's performance and quality across different GPU backends (CUDA vs. Metal)..
How is TurboQuant's performance and quality across different GPU backends (CUDA vs. Metal). positioned in the market?
How is the developer community reacting to TurboQuant's performance and quality across different GPU backends (CUDA vs. Metal).?
What architecture is tied to TurboQuant's performance and quality across different GPU backends (CUDA vs. Metal).?
Engagement Signals
Cross-Market Term Frequency
Quantifies the cross-market adoption of foundational terms like PPL and q8_0 by tracking occurrence frequency across active SaaS architectures and enterprise developer debates.
SaaS Metrics