Product Positioning & Context
AI Executive Synthesis
Ensuring correct and robust implementation of different quantization schemes (turbo3, turbo4) across varying block sizes and head dimensions, preventing data corruption and out-of-bounds access.
A post-commit review identified critical bugs in the block size 32 change, corrupting turbo4 cache writes and causing out-of-bounds array access in CPU paths. The `SET_ROWS` kernel, hardcoded for turbo3, was incorrectly instantiated for turbo4, and integer division logic dropped tail blocks for non-128 head dimensions. This reveals significant fragility in low-level quantization implementations, where minor changes can introduce severe data integrity issues across different model configurations and quantization types. The reliance on specific head dimensions (e.g., Qwen's 128) masked a broader problem. Market implications include the necessity for rigorous, automated code analysis and comprehensive testing across diverse model architectures and hardware to ensure the reliability and compatibility of highly optimized, low-level inference components.
Active Developer Issues (GitHub)
Logged: Mar 31, 2026
Logged: Mar 31, 2026
Logged: Mar 29, 2026
Logged: Mar 28, 2026
Logged: Mar 28, 2026
Community Voice & Feedback
## Final experiment results — dequant-level optimization ceiling reached
### Complete M2 Pro scoreboard (8K decode, q8_0 = 21.9 tok/s):
| # | Approach | tok/s | vs q8_0 | vs Main | Const addrs |
|---|----------|-------|---------|---------|-------------|
| — | No-op ceiling | 24.5 | 1.119x | — | 0 |
| **1** | **4-mag LUT + per-elem norm** | **15.1** | **0.689x** | **+38%** | **4** |
| 2 | Batched extract (8-LUT) | 13.7 | 0.626x | +25% | 8 |
| 3 | Deferred norm (4-mag) | 12.9 | 0.589x | +18% | 4 |
| 4 | 2-pair half2 LUT | 12.0 | 0.548x | +10% | 2 |
| 5 | Select chain (zero LUT) | 11.9 | 0.544x | +9% | 0 |
| 6 | Bit-arithmetic | 11.6 | 0.530x | +6% | 0 |
| — | Main (8-entry LUT) | 10.95 | 0.500x | baseline | 8 |
| 7 | Non-vec forced (nl=2) | 10.2 | 0.466x | -7% | 8 |
### Key insight: 4 constant addresses is the sweet spot on M2 Pro
- **0 addresses** (select chain, bit-arith): ALU cost exceeds constant cache savings
- **2 addresses** (half2 pairs): ternary overhead exceeds savings from ...
### Complete M2 Pro scoreboard (8K decode, q8_0 = 21.9 tok/s):
| # | Approach | tok/s | vs q8_0 | vs Main | Const addrs |
|---|----------|-------|---------|---------|-------------|
| — | No-op ceiling | 24.5 | 1.119x | — | 0 |
| **1** | **4-mag LUT + per-elem norm** | **15.1** | **0.689x** | **+38%** | **4** |
| 2 | Batched extract (8-LUT) | 13.7 | 0.626x | +25% | 8 |
| 3 | Deferred norm (4-mag) | 12.9 | 0.589x | +18% | 4 |
| 4 | 2-pair half2 LUT | 12.0 | 0.548x | +10% | 2 |
| 5 | Select chain (zero LUT) | 11.9 | 0.544x | +9% | 0 |
| 6 | Bit-arithmetic | 11.6 | 0.530x | +6% | 0 |
| — | Main (8-entry LUT) | 10.95 | 0.500x | baseline | 8 |
| 7 | Non-vec forced (nl=2) | 10.2 | 0.466x | -7% | 8 |
### Key insight: 4 constant addresses is the sweet spot on M2 Pro
- **0 addresses** (select chain, bit-arith): ALU cost exceeds constant cache savings
- **2 addresses** (half2 pairs): ternary overhead exceeds savings from ...
## 4-Entry Magnitude LUT + Branchless Sign: BEST M2 RESULT
**Approach:** 4-entry constant half magnitude LUT (0.021-0.190) + XOR trick for reversed magnitude order + branchless sign multiply. Only 4 possible constant addresses per lookup instead of 8.
### M2 Pro decode improvement:
| Depth | q8_0 | Main (8-LUT) | 4-mag LUT | vs Main | vs q8_0 |
|-------|------|-------------|-----------|---------|---------|
| short | 32.5 | 22.9 | 23.8 | +3.9% | 0.732x |
| 8K | 21.9 | 10.95 | 15.1 | **+37.9%** | 0.689x |
| 16K | 17.2 | 8.0 | 11.6 | **+45.0%** | 0.674x |
### M5 Max (no regression):
| Depth | Main | 4-mag LUT | Delta |
|-------|------|-----------|-------|
| short | 77.4 | 75.7 | -2.2% |
PPL: 6.1756 (unchanged).
### Summary
+38-45% decode improvement on M2 Pro at long context. The ratio vs q8_0 improved from 0.45-0.50x to 0.67-0.73x. The cliff is much less severe.
Minor regression on M5 (-2.2%) from the extra ALU (XOR + sign multiply). Could use the auto-detection to use 4-mag on ...
**Approach:** 4-entry constant half magnitude LUT (0.021-0.190) + XOR trick for reversed magnitude order + branchless sign multiply. Only 4 possible constant addresses per lookup instead of 8.
### M2 Pro decode improvement:
| Depth | q8_0 | Main (8-LUT) | 4-mag LUT | vs Main | vs q8_0 |
|-------|------|-------------|-----------|---------|---------|
| short | 32.5 | 22.9 | 23.8 | +3.9% | 0.732x |
| 8K | 21.9 | 10.95 | 15.1 | **+37.9%** | 0.689x |
| 16K | 17.2 | 8.0 | 11.6 | **+45.0%** | 0.674x |
### M5 Max (no regression):
| Depth | Main | 4-mag LUT | Delta |
|-------|------|-----------|-------|
| short | 77.4 | 75.7 | -2.2% |
PPL: 6.1756 (unchanged).
### Summary
+38-45% decode improvement on M2 Pro at long context. The ratio vs q8_0 improved from 0.45-0.50x to 0.67-0.73x. The cliff is much less severe.
Minor regression on M5 (-2.2%) from the extra ALU (XOR + sign multiply). Could use the auto-detection to use 4-mag on ...
## BREAKTHROUGH: Profiling isolation identifies exact bottleneck
Added TURBO_PROFILE_MODE (0-4) to strip away dequant layers one at a time.
### M5 Max vs M2 Pro at 8K context decode:
| Mode | What | M5 (% ceil) | M2 (% ceil) |
|------|------|------------|------------|
| 1 | No-op ceiling | 78.9 (100%) | 24.5 (100%) |
| 2 | + norm read | 75.1 (95%) | 22.1 (90%) |
| 4 | + all byte reads | 75.2 (95%) | 21.9 (89%) |
| 3 | + qs extraction + LUT | 64.9 (82%) | 16.4 (67%) |
| 0 | + signs + full LUT | 59.2 (75%) | 14.0 (57%) |
| q8_0 | baseline | 78.8 | 22.1 |
### Key findings:
1. **No-op turbo3 is FASTER than q8_0 on M2** (24.5 vs 22.1) — compressed cache = less bandwidth. The format is not the problem.
2. **Constant memory LUT is 2x worse on M2 than M5:**
- Mode 4→3 (LUT cost): M5 loses 13.7%, M2 loses 25.1%
- Mode 3→0 (signs+more LUT): M5 loses another 8.6%, M2 loses another 14.7%
3. **Byte reading is NOT the bottleneck** — Mode 4 (all reads, no LUT) only costs 10% on both.
4....
Added TURBO_PROFILE_MODE (0-4) to strip away dequant layers one at a time.
### M5 Max vs M2 Pro at 8K context decode:
| Mode | What | M5 (% ceil) | M2 (% ceil) |
|------|------|------------|------------|
| 1 | No-op ceiling | 78.9 (100%) | 24.5 (100%) |
| 2 | + norm read | 75.1 (95%) | 22.1 (90%) |
| 4 | + all byte reads | 75.2 (95%) | 21.9 (89%) |
| 3 | + qs extraction + LUT | 64.9 (82%) | 16.4 (67%) |
| 0 | + signs + full LUT | 59.2 (75%) | 14.0 (57%) |
| q8_0 | baseline | 78.8 | 22.1 |
### Key findings:
1. **No-op turbo3 is FASTER than q8_0 on M2** (24.5 vs 22.1) — compressed cache = less bandwidth. The format is not the problem.
2. **Constant memory LUT is 2x worse on M2 than M5:**
- Mode 4→3 (LUT cost): M5 loses 13.7%, M2 loses 25.1%
- Mode 3→0 (signs+more LUT): M5 loses another 8.6%, M2 loses another 14.7%
3. **Byte reading is NOT the bottleneck** — Mode 4 (all reads, no LUT) only costs 10% on both.
4....
## M2 Pro Results Update: Batched Extract IS a Win
True baseline comparison (same branch chain, same build):
| Depth | q8_0 | Main (const LUT) | Batched extract | Bit-arithmetic |
|-------|------|-----------------|-----------------|----------------|
| short | 32.5 | 22.9 | **23.7 (+3.5%)** | 23.2 |
| 8K | 22.1 | 10.95 | **13.7 (+25%)** | 11.6 |
Earlier diagnostic (34.5 short) was a different build/context allocation — not comparable.
**Batched extract gives +25% at 8K on M2 Pro.** The explicit bit field pre-extraction helps Metal's compiler schedule device reads ahead of ALU on older hardware.
Next: profile where the remaining gap is (turbo3 0.62x vs q8_0 at 8K).
True baseline comparison (same branch chain, same build):
| Depth | q8_0 | Main (const LUT) | Batched extract | Bit-arithmetic |
|-------|------|-----------------|-----------------|----------------|
| short | 32.5 | 22.9 | **23.7 (+3.5%)** | 23.2 |
| 8K | 22.1 | 10.95 | **13.7 (+25%)** | 11.6 |
Earlier diagnostic (34.5 short) was a different build/context allocation — not comparable.
**Batched extract gives +25% at 8K on M2 Pro.** The explicit bit field pre-extraction helps Metal's compiler schedule device reads ahead of ALU on older hardware.
Next: profile where the remaining gap is (turbo3 0.62x vs q8_0 at 8K).
## M2 Pro Results: Bit-Arithmetic Dequant
**Hardware:** Apple M2 Pro, Apple8 (1008), has_tensor=false, 32GB
**Model:** Qwen2.5-7B-Instruct-Q4_K_M
**Build:** experiment/m1-m2-decode-comparison (auto-detected bit-arithmetic)
### Decode Speed (tok/s)
| Depth | q8_0 | turbo3 (bit-arith) | Ratio | turbo3 (const LUT, earlier diag) | Ratio |
|-------|------|-------------------|-------|----------------------------------|-------|
| short | 32.5 | 23.2 | 0.714x | 34.5 | 0.837x |
| 4K | 26.0 | 15.7 | 0.604x | 20.4 | 0.640x |
| 8K | 22.1 | 11.6 | 0.525x | 14.8 | 0.538x |
| 16K | 17.2 | 8.0 | 0.465x | 9.4 | 0.454x |
### Conclusion
**Bit-arithmetic did NOT fix the M2 decode cliff.** The ratio still degrades from 0.71x to 0.47x.
Worse: bit-arithmetic is **slower than constant LUT at short context** (0.71x vs 0.84x) because the ALU cost exceeds M2's constant cache cost at low contention.
**Key finding: The M2 decode bottleneck is NOT the centroid LUT.** The constant cache is not the problem on ...
**Hardware:** Apple M2 Pro, Apple8 (1008), has_tensor=false, 32GB
**Model:** Qwen2.5-7B-Instruct-Q4_K_M
**Build:** experiment/m1-m2-decode-comparison (auto-detected bit-arithmetic)
### Decode Speed (tok/s)
| Depth | q8_0 | turbo3 (bit-arith) | Ratio | turbo3 (const LUT, earlier diag) | Ratio |
|-------|------|-------------------|-------|----------------------------------|-------|
| short | 32.5 | 23.2 | 0.714x | 34.5 | 0.837x |
| 4K | 26.0 | 15.7 | 0.604x | 20.4 | 0.640x |
| 8K | 22.1 | 11.6 | 0.525x | 14.8 | 0.538x |
| 16K | 17.2 | 8.0 | 0.465x | 9.4 | 0.454x |
### Conclusion
**Bit-arithmetic did NOT fix the M2 decode cliff.** The ratio still degrades from 0.71x to 0.47x.
Worse: bit-arithmetic is **slower than constant LUT at short context** (0.71x vs 0.84x) because the ALU cost exceeds M2's constant cache cost at low contention.
**Key finding: The M2 decode bottleneck is NOT the centroid LUT.** The constant cache is not the problem on ...
Good question. turbo3 at +1.4% vs q8_0 is worse than q4_0 (+0.5%) on raw PPL, but the comparison isn't apples-to-apples:
- **q4_0** compresses the KV cache to 4 bits → 4× compression
- **turbo3** compresses to ~3.5 bits → 4.6× compression
- **q8_0** is 8 bits → baseline
turbo3 gives more compression than q4_0 while staying within the 2% quality gate we set at the top of this issue. The target was "within 1% of q8_0, if >2% worse it's a quality problem." 1.4% is in that range.
Also worth noting: since this issue was opened, a community contributor found a [norm correction](https://github.com/spiritbuun/llama-cpp-turboquant-cuda/commit/6b821a9) that brings turbo3 PPL even closer to q8_0 (now +1.1% on our Metal build). On CUDA with the same fix, turbo3 actually *beats* q8_0 PPL by 0.09%.
The real quality validation we're still missing is NIAH (needle-in-haystack retrieval at long context). PPL passing doesn't guarantee retrieval works. That's the next benchmark to run.
- **q4_0** compresses the KV cache to 4 bits → 4× compression
- **turbo3** compresses to ~3.5 bits → 4.6× compression
- **q8_0** is 8 bits → baseline
turbo3 gives more compression than q4_0 while staying within the 2% quality gate we set at the top of this issue. The target was "within 1% of q8_0, if >2% worse it's a quality problem." 1.4% is in that range.
Also worth noting: since this issue was opened, a community contributor found a [norm correction](https://github.com/spiritbuun/llama-cpp-turboquant-cuda/commit/6b821a9) that brings turbo3 PPL even closer to q8_0 (now +1.1% on our Metal build). On CUDA with the same fix, turbo3 actually *beats* q8_0 PPL by 0.09%.
The real quality validation we're still missing is NIAH (needle-in-haystack retrieval at long context). PPL passing doesn't guarantee retrieval works. That's the next benchmark to run.
How is turbo3 being worse than q4 quality target met?
## QUALITY FIXED ✅
Perplexity with inverse rotation restored in dequant:
| Cache | PPL | vs q8_0 |
|-------|-----|---------|
| f16 | 6.121 | — |
| q8_0 | 6.111 | baseline |
| q4_0 | 6.142 | +0.5% |
| **turbo3** | **6.194** | **+1.4%** |
turbo3 is within 1.4% of q8_0 perplexity. Quality target met.
Speed is back to ~10.7 tok/s (pre-optimization level) because the inverse rotation
is in the dequant hot path. The pre-rotate-queries optimization needs to be
reimplemented to work with GQA head layout (ne[0]=256 for concatenated heads)
and hybrid memory types.
Perplexity with inverse rotation restored in dequant:
| Cache | PPL | vs q8_0 |
|-------|-----|---------|
| f16 | 6.121 | — |
| q8_0 | 6.111 | baseline |
| q4_0 | 6.142 | +0.5% |
| **turbo3** | **6.194** | **+1.4%** |
turbo3 is within 1.4% of q8_0 perplexity. Quality target met.
Speed is back to ~10.7 tok/s (pre-optimization level) because the inverse rotation
is in the dequant hot path. The pre-rotate-queries optimization needs to be
reimplemented to work with GQA head layout (ne[0]=256 for concatenated heads)
and hybrid memory types.
## Root causes found
### 1. V cache in rotated space
Python verification: dequant output has cosine=0.02 with input (garbage).
After inverse rotation: cosine=0.987 (correct).
V cache values MUST be inverse-rotated after attention.
### 2. dynamic_cast fails for MoE models
The Qwen 3.5 MoE uses `llama_memory_hybrid_context`, not `llama_kv_cache_context`.
Our `dynamic_cast` returns null → Q rotation and V inverse rotation NEVER execute.
ALL speed benchmarks were on unrotated Q with rotated-space K/V — garbage results with fast speed.
### Why "coherent text" was misleading
Without any rotation applied, the raw quantize/dequant produces plausible-looking grammar
but wrong content. Short conversations hide this. Perplexity caught it.
### Fix needed
Store rotation tensors in `llm_graph_context` directly (not behind a KV cache dynamic_cast).
Then both Q rotation and V inverse rotation will work for ALL memory types.
### 1. V cache in rotated space
Python verification: dequant output has cosine=0.02 with input (garbage).
After inverse rotation: cosine=0.987 (correct).
V cache values MUST be inverse-rotated after attention.
### 2. dynamic_cast fails for MoE models
The Qwen 3.5 MoE uses `llama_memory_hybrid_context`, not `llama_kv_cache_context`.
Our `dynamic_cast` returns null → Q rotation and V inverse rotation NEVER execute.
ALL speed benchmarks were on unrotated Q with rotated-space K/V — garbage results with fast speed.
### Why "coherent text" was misleading
Without any rotation applied, the raw quantize/dequant produces plausible-looking grammar
but wrong content. Short conversations hide this. Perplexity caught it.
### Fix needed
Store rotation tensors in `llm_graph_context` directly (not behind a KV cache dynamic_cast).
Then both Q rotation and V inverse rotation will work for ALL memory types.
## CRITICAL: Perplexity test reveals quality failure
| Cache | PPL | vs f16 |
|-------|-----|--------|
| f16 | 6.121 | baseline |
| q8_0 | 6.111 | -0.16% |
| q4_0 | 6.142 | +0.34% |
| **turbo3** | **165.6** | **+2607%** ❌ |
turbo3 perplexity is 27× worse than f16. Speed benchmarks were measuring how fast the model produces wrong answers.
Root cause investigation needed. DO NOT update README with speed claims until quality is fixed.
Suspected causes:
1. Norm mismatch: quantize stores full 128-element group norm, dequant uses it as per-32-block norm
2. Pre-rotate-queries rotation matrix mismatch with quantize rotation
3. 3-bit packing bug in block size 32
| Cache | PPL | vs f16 |
|-------|-----|--------|
| f16 | 6.121 | baseline |
| q8_0 | 6.111 | -0.16% |
| q4_0 | 6.142 | +0.34% |
| **turbo3** | **165.6** | **+2607%** ❌ |
turbo3 perplexity is 27× worse than f16. Speed benchmarks were measuring how fast the model produces wrong answers.
Root cause investigation needed. DO NOT update README with speed claims until quality is fixed.
Suspected causes:
1. Norm mismatch: quantize stores full 128-element group norm, dequant uses it as per-32-block norm
2. Pre-rotate-queries rotation matrix mismatch with quantize rotation
3. 3-bit packing bug in block size 32
Related Early-Stage Discoveries
Discovery Source
GitHub Open Source Aggregated via automated community intelligence tracking.
Tech Stack Dependencies
No direct open-source NPM package mentions detected in the product documentation.
Media Tractions & Mentions
No mainstream media stories specifically mentioning this product name have been intercepted yet.
Deep Research & Science
No direct peer-reviewed scientific literature matched with this product's architecture.
Market Trends