Review spiritbuun's CUDA fork for portable optimizations

TheTom/turboquant_plus

Status: Open

Opened: Mar 27, 2026

Comments: 1

## Context @spiritbuun's CUDA fork is now the performance leader: - **PPL: -1.17% vs q8_0** (beats baseline quality) - **Prefill: 99.6%** of q8_0 - **Decode: 97.5%** of q8_0 - **128K context** on RTX 3090 24GB, Q6 Qwen3.5 27B Repo: https://github.com/spiritbuun/llama-cpp-turboquant-cuda Our Metal implementation: 99% prefill, +1.1% PPL, but only 88-90% decode. ## Task Go through buun's latest commits and identify optimizations we can port to Metal. Cherry-pick what's portable, document what's CUDA-only. ### Already ported - [x] Norm correction (PPL +1.6% → +1.1%) — merged to main - [x] Register centroid LUT — tested, spills on Metal (CUDA-only) ### To review - [ ] Latest decode dequant optimizations (fattn-common.cuh) - [ ] V dequant path (separate from K dot-product path) - [ ] Batched uint8 loads for qs/signs (3 loads per 8 elements vs 16) - [ ] turbo4 V_DOT2 half2 path — any Metal equivalent? - [ ] AMD RDNA v_dot2_f32_f16 path — relevant for our AMD testers - [ ] Any new norm correction refinements since our port - [ ] FWHT rotation implementation differences - [ ] Prefill dequant-then-attend (we're blocked on turbo3→f16 cast) ### Files to review ``` ggml/src/ggml-cuda/fattn-common.cuh # FA dequant (decode hot path) ggml/src/ggml-cuda/turbo-quant-cuda.cuh # Quantize + norm correction ggml/src/ggml-cuda/turbo-wht.cu # FWHT rotation ggml/src/ggml-cuda/fattn-vec.cuh # Vec attention path ``` ### Attribution All ported optimizations must credit @spiri...

Python

View on GitHub ↗