turbo3/turbo4 cache produces garbled output on NVIDIA Blackwell GPU (RTX 5070 Laptop, compute capability 12.0)

TheTom/turboquant_plus

Status: Open

Opened: Mar 31, 2026

Comments: 1

## Environment - OS: CachyOS Linux (kernel 6.19.10) - GPU: NVIDIA GeForce RTX 5070 Laptop GPU - VRAM: 8GB (7707 MiB) - CUDA Version: 13.2 - Driver: 595.58.03 - Compute Capability: 12.0 (Blackwell) - Build: 8665 (5364f8a1d) with GNU 15.2.1 ## Model - qwen2.5-coder:7b-instruct-q6_K (GGUF) ## Command llama-server -m qwen2.5-coder-7b-q6_K.gguf -ngl 99 -c 32768 -fa on --cache-type-k turbo3 --cache-type-v turbo3 --host 0.0.0.0 --port 8080 ## Expected behavior Coherent text output as reported in the paper on Apple Silicon. ## Actual behavior Garbled, repetitive output. Examples: - Prompt: "Write a hello world in Python" - Response: `"Here is a simple simple simple Python program that world world:\n\nprint(\"Hello world\")"` turbo3/turbo4 on both K and V produces broken output. K=turbo3 + V=q8_0 also produces broken output (only 3 tokens generated). K=q8_0 + V=q8_0 works correctly. ## Notes This appears to be the first test of TurboQuant CUDA kernels on Blackwell (sm_120). The CUDA build succeeded without errors (ARCHS=1200). The issue is likely in the CUDA dequantization kernels not being validated for sm_120.

Python

View on GitHub ↗