← Back to Product Feed

GitHub Open Source TheTom/turboquant_plus

No tagline provided.

5,654
Traction Score
776
Forks
Mar 25, 2026
Launch Date
View Origin Link

Product Positioning & Context

AI Executive Synthesis
Ensuring correct and robust implementation of different quantization schemes (turbo3, turbo4) across varying block sizes and head dimensions, preventing data corruption and out-of-bounds access.
A post-commit review identified critical bugs in the block size 32 change, corrupting turbo4 cache writes and causing out-of-bounds array access in CPU paths. The `SET_ROWS` kernel, hardcoded for turbo3, was incorrectly instantiated for turbo4, and integer division logic dropped tail blocks for non-128 head dimensions. This reveals significant fragility in low-level quantization implementations, where minor changes can introduce severe data integrity issues across different model configurations and quantization types. The reliance on specific head dimensions (e.g., Qwen's 128) masked a broader problem. Market implications include the necessity for rigorous, automated code analysis and comprehensive testing across diverse model architectures and hardware to ensure the reliability and compatibility of highly optimized, low-level inference components.

Active Developer Issues (GitHub)

open Unsupported gpu architecture 'compute_120a'
Logged: Mar 31, 2026
open turbo3/turbo4 cache produces garbled output on NVIDIA Blackwell GPU (RTX 5070 Laptop, compute capability 12.0)
Logged: Mar 31, 2026
open No Difference in tokens/sec - Ministral3 8B Q5_K_M
Logged: Mar 29, 2026
open Don't work on vulkan device
Logged: Mar 28, 2026
open Engineering findings: K/V norm disparity + MSE > Prod + outlier mixed precision
Logged: Mar 28, 2026

Community Voice & Feedback

TheTom • Mar 27, 2026
## Final experiment results — dequant-level optimization ceiling reached

### Complete M2 Pro scoreboard (8K decode, q8_0 = 21.9 tok/s):

| # | Approach | tok/s | vs q8_0 | vs Main | Const addrs |
|---|----------|-------|---------|---------|-------------|
| — | No-op ceiling | 24.5 | 1.119x | — | 0 |
| **1** | **4-mag LUT + per-elem norm** | **15.1** | **0.689x** | **+38%** | **4** |
| 2 | Batched extract (8-LUT) | 13.7 | 0.626x | +25% | 8 |
| 3 | Deferred norm (4-mag) | 12.9 | 0.589x | +18% | 4 |
| 4 | 2-pair half2 LUT | 12.0 | 0.548x | +10% | 2 |
| 5 | Select chain (zero LUT) | 11.9 | 0.544x | +9% | 0 |
| 6 | Bit-arithmetic | 11.6 | 0.530x | +6% | 0 |
| — | Main (8-entry LUT) | 10.95 | 0.500x | baseline | 8 |
| 7 | Non-vec forced (nl=2) | 10.2 | 0.466x | -7% | 8 |

### Key insight: 4 constant addresses is the sweet spot on M2 Pro
- **0 addresses** (select chain, bit-arith): ALU cost exceeds constant cache savings
- **2 addresses** (half2 pairs): ternary overhead exceeds savings from ...
TheTom • Mar 27, 2026
## 4-Entry Magnitude LUT + Branchless Sign: BEST M2 RESULT

**Approach:** 4-entry constant half magnitude LUT (0.021-0.190) + XOR trick for reversed magnitude order + branchless sign multiply. Only 4 possible constant addresses per lookup instead of 8.

### M2 Pro decode improvement:

| Depth | q8_0 | Main (8-LUT) | 4-mag LUT | vs Main | vs q8_0 |
|-------|------|-------------|-----------|---------|---------|
| short | 32.5 | 22.9 | 23.8 | +3.9% | 0.732x |
| 8K | 21.9 | 10.95 | 15.1 | **+37.9%** | 0.689x |
| 16K | 17.2 | 8.0 | 11.6 | **+45.0%** | 0.674x |

### M5 Max (no regression):

| Depth | Main | 4-mag LUT | Delta |
|-------|------|-----------|-------|
| short | 77.4 | 75.7 | -2.2% |

PPL: 6.1756 (unchanged).

### Summary

+38-45% decode improvement on M2 Pro at long context. The ratio vs q8_0 improved from 0.45-0.50x to 0.67-0.73x. The cliff is much less severe.

Minor regression on M5 (-2.2%) from the extra ALU (XOR + sign multiply). Could use the auto-detection to use 4-mag on ...
TheTom • Mar 27, 2026
## BREAKTHROUGH: Profiling isolation identifies exact bottleneck

Added TURBO_PROFILE_MODE (0-4) to strip away dequant layers one at a time.

### M5 Max vs M2 Pro at 8K context decode:

| Mode | What | M5 (% ceil) | M2 (% ceil) |
|------|------|------------|------------|
| 1 | No-op ceiling | 78.9 (100%) | 24.5 (100%) |
| 2 | + norm read | 75.1 (95%) | 22.1 (90%) |
| 4 | + all byte reads | 75.2 (95%) | 21.9 (89%) |
| 3 | + qs extraction + LUT | 64.9 (82%) | 16.4 (67%) |
| 0 | + signs + full LUT | 59.2 (75%) | 14.0 (57%) |
| q8_0 | baseline | 78.8 | 22.1 |

### Key findings:

1. **No-op turbo3 is FASTER than q8_0 on M2** (24.5 vs 22.1) — compressed cache = less bandwidth. The format is not the problem.

2. **Constant memory LUT is 2x worse on M2 than M5:**
- Mode 4→3 (LUT cost): M5 loses 13.7%, M2 loses 25.1%
- Mode 3→0 (signs+more LUT): M5 loses another 8.6%, M2 loses another 14.7%

3. **Byte reading is NOT the bottleneck** — Mode 4 (all reads, no LUT) only costs 10% on both.

4....
TheTom • Mar 27, 2026
## M2 Pro Results Update: Batched Extract IS a Win

True baseline comparison (same branch chain, same build):

| Depth | q8_0 | Main (const LUT) | Batched extract | Bit-arithmetic |
|-------|------|-----------------|-----------------|----------------|
| short | 32.5 | 22.9 | **23.7 (+3.5%)** | 23.2 |
| 8K | 22.1 | 10.95 | **13.7 (+25%)** | 11.6 |

Earlier diagnostic (34.5 short) was a different build/context allocation — not comparable.

**Batched extract gives +25% at 8K on M2 Pro.** The explicit bit field pre-extraction helps Metal's compiler schedule device reads ahead of ALU on older hardware.

Next: profile where the remaining gap is (turbo3 0.62x vs q8_0 at 8K).
TheTom • Mar 27, 2026
## M2 Pro Results: Bit-Arithmetic Dequant

**Hardware:** Apple M2 Pro, Apple8 (1008), has_tensor=false, 32GB
**Model:** Qwen2.5-7B-Instruct-Q4_K_M
**Build:** experiment/m1-m2-decode-comparison (auto-detected bit-arithmetic)

### Decode Speed (tok/s)

| Depth | q8_0 | turbo3 (bit-arith) | Ratio | turbo3 (const LUT, earlier diag) | Ratio |
|-------|------|-------------------|-------|----------------------------------|-------|
| short | 32.5 | 23.2 | 0.714x | 34.5 | 0.837x |
| 4K | 26.0 | 15.7 | 0.604x | 20.4 | 0.640x |
| 8K | 22.1 | 11.6 | 0.525x | 14.8 | 0.538x |
| 16K | 17.2 | 8.0 | 0.465x | 9.4 | 0.454x |

### Conclusion

**Bit-arithmetic did NOT fix the M2 decode cliff.** The ratio still degrades from 0.71x to 0.47x.

Worse: bit-arithmetic is **slower than constant LUT at short context** (0.71x vs 0.84x) because the ALU cost exceeds M2's constant cache cost at low contention.

**Key finding: The M2 decode bottleneck is NOT the centroid LUT.** The constant cache is not the problem on ...
TheTom • Mar 26, 2026
Good question. turbo3 at +1.4% vs q8_0 is worse than q4_0 (+0.5%) on raw PPL, but the comparison isn't apples-to-apples:

- **q4_0** compresses the KV cache to 4 bits → 4× compression
- **turbo3** compresses to ~3.5 bits → 4.6× compression
- **q8_0** is 8 bits → baseline

turbo3 gives more compression than q4_0 while staying within the 2% quality gate we set at the top of this issue. The target was "within 1% of q8_0, if >2% worse it's a quality problem." 1.4% is in that range.

Also worth noting: since this issue was opened, a community contributor found a [norm correction](https://github.com/spiritbuun/llama-cpp-turboquant-cuda/commit/6b821a9) that brings turbo3 PPL even closer to q8_0 (now +1.1% on our Metal build). On CUDA with the same fix, turbo3 actually *beats* q8_0 PPL by 0.09%.

The real quality validation we're still missing is NIAH (needle-in-haystack retrieval at long context). PPL passing doesn't guarantee retrieval works. That's the next benchmark to run.
Rotatingxenomorph • Mar 26, 2026
How is turbo3 being worse than q4 quality target met?
TheTom • Mar 25, 2026
## QUALITY FIXED ✅

Perplexity with inverse rotation restored in dequant:

| Cache | PPL | vs q8_0 |
|-------|-----|---------|
| f16 | 6.121 | — |
| q8_0 | 6.111 | baseline |
| q4_0 | 6.142 | +0.5% |
| **turbo3** | **6.194** | **+1.4%** |

turbo3 is within 1.4% of q8_0 perplexity. Quality target met.

Speed is back to ~10.7 tok/s (pre-optimization level) because the inverse rotation
is in the dequant hot path. The pre-rotate-queries optimization needs to be
reimplemented to work with GQA head layout (ne[0]=256 for concatenated heads)
and hybrid memory types.
TheTom • Mar 25, 2026
## Root causes found

### 1. V cache in rotated space
Python verification: dequant output has cosine=0.02 with input (garbage).
After inverse rotation: cosine=0.987 (correct).
V cache values MUST be inverse-rotated after attention.

### 2. dynamic_cast fails for MoE models
The Qwen 3.5 MoE uses `llama_memory_hybrid_context`, not `llama_kv_cache_context`.
Our `dynamic_cast` returns null → Q rotation and V inverse rotation NEVER execute.
ALL speed benchmarks were on unrotated Q with rotated-space K/V — garbage results with fast speed.

### Why "coherent text" was misleading
Without any rotation applied, the raw quantize/dequant produces plausible-looking grammar
but wrong content. Short conversations hide this. Perplexity caught it.

### Fix needed
Store rotation tensors in `llm_graph_context` directly (not behind a KV cache dynamic_cast).
Then both Q rotation and V inverse rotation will work for ALL memory types.
TheTom • Mar 25, 2026
## CRITICAL: Perplexity test reveals quality failure

| Cache | PPL | vs f16 |
|-------|-----|--------|
| f16 | 6.121 | baseline |
| q8_0 | 6.111 | -0.16% |
| q4_0 | 6.142 | +0.34% |
| **turbo3** | **165.6** | **+2607%** ❌ |

turbo3 perplexity is 27× worse than f16. Speed benchmarks were measuring how fast the model produces wrong answers.

Root cause investigation needed. DO NOT update README with speed claims until quality is fixed.

Suspected causes:
1. Norm mismatch: quantize stores full 128-element group norm, dequant uses it as per-32-block norm
2. Pre-rotate-queries rotation matrix mismatch with quantize rotation
3. 3-bit packing bug in block size 32

Related Early-Stage Discoveries

Discovery Source

GitHub Open Source GitHub Open Source

Aggregated via automated community intelligence tracking.

Tech Stack Dependencies

No direct open-source NPM package mentions detected in the product documentation.

Media Tractions & Mentions

No mainstream media stories specifically mentioning this product name have been intercepted yet.

Deep Research & Science

No direct peer-reviewed scientific literature matched with this product's architecture.