← Back to AI Insights
Gemini Executive Synthesis

TurboQuant's performance and quality across different GPU backends (CUDA vs. Metal).

Technical Positioning
Achieving state-of-the-art performance (prefill, decode) and quality (PPL) for TurboQuant across diverse hardware platforms (NVIDIA CUDA, Apple Metal, AMD RDNA).
SaaS Insight & Market Implications
This issue outlines a critical competitive analysis and optimization strategy for TurboQuant. A CUDA fork has achieved superior performance and quality (lower PPL, higher prefill/decode ratios) compared to the existing Metal implementation. The task is to systematically port these CUDA optimizations to Metal, identifying portable versus CUDA-specific techniques. For B2B SaaS, cross-platform performance parity is crucial for market penetration. Relying on a single hardware ecosystem limits addressable market. This initiative demonstrates a commitment to maximizing efficiency across diverse customer infrastructures, directly impacting cost-effectiveness and competitive positioning. Prioritizing such engineering efforts ensures the product remains performant and relevant across evolving hardware landscapes.
Proprietary Technical Taxonomy
CUDA fork performance leader PPL q8_0 Prefill Decode 128K context RTX 3090

Raw Developer Origin & Technical Request

Source Icon GitHub Issue Mar 27, 2026
Repo: TheTom/turboquant_plus
Review spiritbuun's CUDA fork for portable optimizations

## Context

@spiritbuun's CUDA fork is now the performance leader:
- **PPL: -1.17% vs q8_0** (beats baseline quality)
- **Prefill: 99.6%** of q8_0
- **Decode: 97.5%** of q8_0
- **128K context** on RTX 3090 24GB, Q6 Qwen3.5 27B

Repo: github.com/spiritbuun/llama-...

Our Metal implementation: 99% prefill, +1.1% PPL, but only 88-90% decode.

## Task

Go through buun's latest commits and identify optimizations we can port to Metal. Cherry-pick what's portable, document what's CUDA-only.

### Already ported
- [x] Norm correction (PPL +1.6% → +1.1%) — merged to main
- [x] Register centroid LUT — tested, spills on Metal (CUDA-only)

### To review
- [ ] Latest decode dequant optimizations (fattn-common.cuh)
- [ ] V dequant path (separate from K dot-product path)
- [ ] Batched uint8 loads for qs/signs (3 loads per 8 elements vs 16)
- [ ] turbo4 V_DOT2 half2 path — any Metal equivalent?
- [ ] AMD RDNA v_dot2_f32_f16 path — relevant for our AMD testers
- [ ] Any new norm correction refinements since our port
- [ ] FWHT rotation implementation differences
- [ ] Prefill dequant-then-attend (we're blocked on turbo3→f16 cast)

### Files to review
```
ggml/src/ggml-cuda/fattn-common.cuh # FA dequant (decode hot path)
ggml/src/ggml-cuda/turbo-quant-cuda.cuh # Quantize + norm correction
ggml/src/ggml-cuda/turbo-wht.cu # FWHT rotation
ggml/src/ggml-cuda/fattn-vec.cuh # Vec attention path
```

### Attribution
All ported optimizations must credit @spiri...

Developer Debate & Comments

No active discussions extracted for this entry yet.

Adjacent Repository Pain Points

Other highly discussed features and pain points extracted from TheTom/turboquant_plus.

Extracted Positioning
turbo3 quantization for LLM KV cache compression
Achieving 4.6x compression with quality (perplexity, KL divergence, NIAH) comparable to q8_0 (within 2% PPL) and superior to q4_0, while maintaining high inference speed.
Top Replies
TheTom • Mar 25, 2026
## CRITICAL: Perplexity test reveals quality failure | Cache | PPL | vs f16 | |-------|-----|--------| | f16 | 6.121 | baseline | | q8_0 | 6.111 | -0.16% | | q4_0 | 6.142 | +0.34% | | **turbo3** | ...
TheTom • Mar 25, 2026
## Root causes found ### 1. V cache in rotated space Python verification: dequant output has cosine=0.02 with input (garbage). After inverse rotation: cosine=0.987 (correct). V cache values MUST be...
TheTom • Mar 25, 2026
## QUALITY FIXED ✅ Perplexity with inverse rotation restored in dequant: | Cache | PPL | vs q8_0 | |-------|-----|---------| | f16 | 6.121 | — | | q8_0 | 6.111 | baseline | | q4_0 | 6.142 | +0.5% ...
Extracted Positioning
`turbo3` decode performance for LLM inference on Apple Silicon (M1, M2 Pro, M5 Max), specifically addressing the 'decode cliff' at increasing context depths.
Achieving flat, high-performance `turbo3` decode ratios (0.90x+ of `q8_0`) across all context depths on Apple Silicon, minimizing performance degradation from memory access patterns.
Top Replies
TheTom • Mar 27, 2026
## M2 Pro Results: Bit-Arithmetic Dequant **Hardware:** Apple M2 Pro, Apple8 (1008), has_tensor=false, 32GB **Model:** Qwen2.5-7B-Instruct-Q4_K_M **Build:** experiment/m1-m2-decode-comparison (auto...
TheTom • Mar 27, 2026
## M2 Pro Results Update: Batched Extract IS a Win True baseline comparison (same branch chain, same build): | Depth | q8_0 | Main (const LUT) | Batched extract | Bit-arithmetic | |-------|------|-...
TheTom • Mar 27, 2026
## BREAKTHROUGH: Profiling isolation identifies exact bottleneck Added TURBO_PROFILE_MODE (0-4) to strip away dequant layers one at a time. ### M5 Max vs M2 Pro at 8K context decode: | Mode | What ...
Extracted Positioning
TurboQuant (`-ctk turbo3 -ctv turbo3`) integration with Vulkan devices for LLM inference.
Achieving broad hardware compatibility for TurboQuant, specifically extending to Vulkan-enabled AMD GPUs.
Extracted Positioning
TurboQuant (turbo3 and turbo4) performance optimization for LLM inference, specifically on Apple M1 hardware.
Achieving superior LLM inference speed (tokens/sec) through TurboQuant optimizations on Apple Silicon (M1).
Extracted Positioning
TurboQuant's quantization strategy, specifically regarding K/V norm disparity, attention quantization methods (MSE vs. Prod), and outlier detection (dynamic vs. fixed).
Advancing TurboQuant's quantization efficacy to achieve lower perplexity (PPL) and higher compression (lower average bit rates) through refined techniques.

Engagement Signals

1
Replies
open
Issue Status

Cross-Market Term Frequency

Quantifies the cross-market adoption of foundational terms like q8_0 and PPL by tracking occurrence frequency across active SaaS architectures and enterprise developer debates.

Macro Market Trends

Correlated public search velocity for adjacent technologies.

Apple Apple-silicon Applied Psychology