TurboQuant (turbo3/turbo4 cache types) for LLM inference, specifically its compatibility with new NVIDIA Blackwell GPUs.
Raw Developer Origin & Technical Request
GitHub Issue
Mar 31, 2026
## Environment
- OS: CachyOS Linux (kernel 6.19.10)
- GPU: NVIDIA GeForce RTX 5070 Laptop GPU
- VRAM: 8GB (7707 MiB)
- CUDA Version: 13.2
- Driver: 595.58.03
- Compute Capability: 12.0 (Blackwell)
- Build: 8665 (5364f8a1d) with GNU 15.2.1
## Model
- qwen2.5-coder:7b-instruct-q6_K (GGUF)
## Command
llama-server
-m qwen2.5-coder-7b-q6_K.gguf
-ngl 99 -c 32768 -fa on
--cache-type-k turbo3
--cache-type-v turbo3
--host 0.0.0.0 --port 8080
## Expected behavior
Coherent text output as reported in the paper on Apple Silicon.
## Actual behavior
Garbled, repetitive output. Examples:
- Prompt: "Write a hello world in Python"
- Response: `"Here is a simple simple simple Python program that world world:\n\nprint(\"Hello world\")"`
turbo3/turbo4 on both K and V produces broken output.
K=turbo3 + V=q8_0 also produces broken output (only 3 tokens generated).
K=q8_0 + V=q8_0 works correctly.
## Notes
This appears to be the first test of TurboQuant CUDA kernels on Blackwell (sm_120).
The CUDA build succeeded without errors (ARCHS=1200).
The issue is likely in the CUDA dequantization kernels not being validated for sm_120.
Developer Debate & Comments
No active discussions extracted for this entry yet.
Adjacent Repository Pain Points
Other highly discussed features and pain points extracted from TheTom/turboquant_plus.
Engagement Signals
Cross-Market Term Frequency
Quantifies the cross-market adoption of foundational terms like turbo3 and turbo4 by tracking occurrence frequency across active SaaS architectures and enterprise developer debates.
Market Trends