Insight for: Review spiritbuun's CUDA fork for portable optimizations

TurboQuant's performance and quality across different GPU backends (CUDA vs. Metal).

Analyzed: Apr 1, 2026

This issue outlines a critical competitive analysis and optimization strategy for TurboQuant. A CUDA fork has achieved superior performance and quality (lower PPL, higher prefill/decode ratios) compared to the existing Metal implementation. The task is to systematically port these CUDA optimizations to Metal, identifying portable versus CUDA-specific techniques. For B2B SaaS, cross-platform performance parity is crucial for market penetration. Relying on a single hardware ecosystem limits addressable market. This initiative demonstrates a commitment to maximizing efficiency across diverse customer infrastructures, directly impacting cost-effectiveness and competitive positioning. Prioritizing such engineering efforts ensures the product remains performant and relevant across evolving hardware landscapes.

CUDA fork performance leader PPL q8_0 Prefill Decode 128K context RTX 3090 Qwen3.5 27B Metal implementation Norm correction Register centroid LUT decode dequant optimizations fattn-common.cuh V dequant path Batched uint8 loads turbo4 V_DOT2 half2 path AMD RDNA v_dot2_f32_f16 path FWHT rotation Prefill dequant-then-attend turbo3→f16 cast turbo-quant-cuda.cuh turbo-wht.cu fattn-vec.cuh

GitHub Issue

Parent Entity

Review spiritbuun's CUDA fork for portable optimizations

State: Open