← Back to Product Feed

GitHub Open Source TheTom/turboquant_plus

No tagline provided.

5,811
Traction Score
802
Forks
Mar 25, 2026
Launch Date
View Origin Link

Product Positioning & Context

AI Executive Synthesis
Ensuring correct and robust implementation of different quantization schemes (turbo3, turbo4) across varying block sizes and head dimensions, preventing data corruption and out-of-bounds access.
A post-commit review identified critical bugs in the block size 32 change, corrupting turbo4 cache writes and causing out-of-bounds array access in CPU paths. The `SET_ROWS` kernel, hardcoded for turbo3, was incorrectly instantiated for turbo4, and integer division logic dropped tail blocks for non-128 head dimensions. This reveals significant fragility in low-level quantization implementations, where minor changes can introduce severe data integrity issues across different model configurations and quantization types. The reliance on specific head dimensions (e.g., Qwen's 128) masked a broader problem. Market implications include the necessity for rigorous, automated code analysis and comprehensive testing across diverse model architectures and hardware to ensure the reliability and compatibility of highly optimized, low-level inference components.

Related Ecosystem & Alternatives

Discover adjacent products, open-source repositories, and developer tools sharing similar technical architecture.

Deep-Dive FAQs

What is TheTom/turboquant_plus?
TheTom/turboquant_plus is analyzed by our AI as: Ensuring correct and robust implementation of different quantization schemes (turbo3, turbo4) across varying block sizes and head dimensions, preventing data corruption and out-of-bounds access.. It focuses on A post-commit review identified critical bugs in the block size 32 change, corrupting turbo4 cache writes and causing out-of-bounds array access in...
Where did TheTom/turboquant_plus originate?
Data for TheTom/turboquant_plus was aggregated directly from the GitHub Open Source community ecosystem, representing raw developer and early-adopter sentiment.
When was TheTom/turboquant_plus publicly launched?
The initial public indexing or launch date for TheTom/turboquant_plus within our tracked developer communities was recorded on March 25, 2026.
How popular is TheTom/turboquant_plus?
TheTom/turboquant_plus has achieved measurable traction, logging over 5,811 traction score and facilitating 802 recorded discussions or engagements.
Are there active development issues for TheTom/turboquant_plus?
Yes, we are currently tracking open architectural debates and bug reports for this project on GitHub. There are currently 5 active high-priority issues logged recently.
What are some commercial alternatives to TheTom/turboquant_plus?
Our semantic intelligence engine identifies potential commercial alternatives in the SaaS space, such as Databerry, which offers overlapping value propositions.

Active Developer Issues (GitHub)

open Unsupported gpu architecture 'compute_120a'
Logged: Mar 31, 2026
open turbo3/turbo4 cache produces garbled output on NVIDIA Blackwell GPU (RTX 5070 Laptop, compute capability 12.0)
Logged: Mar 31, 2026
open No Difference in tokens/sec - Ministral3 8B Q5_K_M
Logged: Mar 29, 2026
open Don't work on vulkan device
Logged: Mar 28, 2026
open Engineering findings: K/V norm disparity + MSE > Prod + outlier mixed precision
Logged: Mar 28, 2026

Community Voice & Feedback

zekrom-vale • Mar 31, 2026
You can just use it like Ollama or LM studio without the slow development or wrapper overheads. I use llama cpp directly and use it with router mode and many llms configured with models.ini and interface it with Open Web UI you can find others too. It's nice, but it has a lot more overhead in setting it up.

Use `llama-server --host 0.0.0.0 --port 8080 --models-preset /path/to/ini/models.ini` and access it through an Open AI compatible UI like Open Web UI at `http://:8080`. Or 127.0.0.1 as the loop-back if not connecting to a different computer as I do.

Here is an example model ini file I use for my 5070ti.
[models.ini.txt](https://github.com/user-attachments/files/26391236/models.ini.txt)
zrlhk • Mar 31, 2026
这个只是对kv缓存压缩,所以只是提升了最大推理上下文的大小。对模型量化压缩和推理速度,是没有提升的。
原来10G显存,如果是一个9b模型,最大上下文128k可能就OOM了。现在压缩后,就可以支持128k上下文了。
TheTom • Mar 31, 2026
Yeah we really need some more rcom support from the community. i only have so many viable devices to play with at home
ogbinar • Mar 30, 2026
i hope the rocm issues get fixed i'm interested to try this out!
MrMuhannadObeidat • Mar 29, 2026
I missed the part where you highlight the fact that tokens/sec may actually degrade with the added compression of KV cache. I tried with turbo3, do not see noticeable degradation but certainly see major impact on memory consumption and the ability to use much bigger context windows.
Question I have now, is can you take advantage of this from within lm studio or ollama?
TheTom • Mar 28, 2026
turbo3 currently only supports Metal, CUDA, and ROCm/HIP backends. the Vulkan backend doesn't have a SET_ROWS kernel for the turbo3 quant type yet.

since you have an RX 7900 XTX, ROCm would be your best path — apollosenvy has a working ROCm PR here: https://github.com/TheTom/llama-cpp-turboquant/pull/5

that said, ROCm turbo3 still has some quality issues being worked out (head_dim=128 models especially). head_dim=256 models like Qwen3.5 should work.
TheTom • Mar 27, 2026
## Final experiment results — dequant-level optimization ceiling reached

### Complete M2 Pro scoreboard (8K decode, q8_0 = 21.9 tok/s):

| # | Approach | tok/s | vs q8_0 | vs Main | Const addrs |
|---|----------|-------|---------|---------|-------------|
| — | No-op ceiling | 24.5 | 1.119x | — | 0 |
| **1** | **4-mag LUT + per-elem norm** | **15.1** | **0.689x** | **+38%** | **4** |
| 2 | Batched extract (8-LUT) | 13.7 | 0.626x | +25% | 8 |
| 3 | Deferred norm (4-mag) | 12.9 | 0.589x | +18% | 4 |
| 4 | 2-pair half2 LUT | 12.0 | 0.548x | +10% | 2 |
| 5 | Select chain (zero LUT) | 11.9 | 0.544x | +9% | 0 |
| 6 | Bit-arithmetic | 11.6 | 0.530x | +6% | 0 |
| — | Main (8-entry LUT) | 10.95 | 0.500x | baseline | 8 |
| 7 | Non-vec forced (nl=2) | 10.2 | 0.466x | -7% | 8 |

### Key insight: 4 constant addresses is the sweet spot on M2 Pro
- **0 addresses** (select chain, bit-arith): ALU cost exceeds constant cache savings
- **2 addresses** (half2 pairs): ternary overhead exceeds savings from ...
TheTom • Mar 27, 2026
## 4-Entry Magnitude LUT + Branchless Sign: BEST M2 RESULT

**Approach:** 4-entry constant half magnitude LUT (0.021-0.190) + XOR trick for reversed magnitude order + branchless sign multiply. Only 4 possible constant addresses per lookup instead of 8.

### M2 Pro decode improvement:

| Depth | q8_0 | Main (8-LUT) | 4-mag LUT | vs Main | vs q8_0 |
|-------|------|-------------|-----------|---------|---------|
| short | 32.5 | 22.9 | 23.8 | +3.9% | 0.732x |
| 8K | 21.9 | 10.95 | 15.1 | **+37.9%** | 0.689x |
| 16K | 17.2 | 8.0 | 11.6 | **+45.0%** | 0.674x |

### M5 Max (no regression):

| Depth | Main | 4-mag LUT | Delta |
|-------|------|-----------|-------|
| short | 77.4 | 75.7 | -2.2% |

PPL: 6.1756 (unchanged).

### Summary

+38-45% decode improvement on M2 Pro at long context. The ratio vs q8_0 improved from 0.45-0.50x to 0.67-0.73x. The cliff is much less severe.

Minor regression on M5 (-2.2%) from the extra ALU (XOR + sign multiply). Could use the auto-detection to use 4-mag on ...
TheTom • Mar 27, 2026
## BREAKTHROUGH: Profiling isolation identifies exact bottleneck

Added TURBO_PROFILE_MODE (0-4) to strip away dequant layers one at a time.

### M5 Max vs M2 Pro at 8K context decode:

| Mode | What | M5 (% ceil) | M2 (% ceil) |
|------|------|------------|------------|
| 1 | No-op ceiling | 78.9 (100%) | 24.5 (100%) |
| 2 | + norm read | 75.1 (95%) | 22.1 (90%) |
| 4 | + all byte reads | 75.2 (95%) | 21.9 (89%) |
| 3 | + qs extraction + LUT | 64.9 (82%) | 16.4 (67%) |
| 0 | + signs + full LUT | 59.2 (75%) | 14.0 (57%) |
| q8_0 | baseline | 78.8 | 22.1 |

### Key findings:

1. **No-op turbo3 is FASTER than q8_0 on M2** (24.5 vs 22.1) — compressed cache = less bandwidth. The format is not the problem.

2. **Constant memory LUT is 2x worse on M2 than M5:**
- Mode 4→3 (LUT cost): M5 loses 13.7%, M2 loses 25.1%
- Mode 3→0 (signs+more LUT): M5 loses another 8.6%, M2 loses another 14.7%

3. **Byte reading is NOT the bottleneck** — Mode 4 (all reads, no LUT) only costs 10% on both.

4....
TheTom • Mar 27, 2026
## M2 Pro Results Update: Batched Extract IS a Win

True baseline comparison (same branch chain, same build):

| Depth | q8_0 | Main (const LUT) | Batched extract | Bit-arithmetic |
|-------|------|-----------------|-----------------|----------------|
| short | 32.5 | 22.9 | **23.7 (+3.5%)** | 23.2 |
| 8K | 22.1 | 10.95 | **13.7 (+25%)** | 11.6 |

Earlier diagnostic (34.5 short) was a different build/context allocation — not comparable.

**Batched extract gives +25% at 8K on M2 Pro.** The explicit bit field pre-extraction helps Metal's compiler schedule device reads ahead of ALU on older hardware.

Next: profile where the remaining gap is (turbo3 0.62x vs q8_0 at 8K).
TheTom • Mar 27, 2026
## M2 Pro Results: Bit-Arithmetic Dequant

**Hardware:** Apple M2 Pro, Apple8 (1008), has_tensor=false, 32GB
**Model:** Qwen2.5-7B-Instruct-Q4_K_M
**Build:** experiment/m1-m2-decode-comparison (auto-detected bit-arithmetic)

### Decode Speed (tok/s)

| Depth | q8_0 | turbo3 (bit-arith) | Ratio | turbo3 (const LUT, earlier diag) | Ratio |
|-------|------|-------------------|-------|----------------------------------|-------|
| short | 32.5 | 23.2 | 0.714x | 34.5 | 0.837x |
| 4K | 26.0 | 15.7 | 0.604x | 20.4 | 0.640x |
| 8K | 22.1 | 11.6 | 0.525x | 14.8 | 0.538x |
| 16K | 17.2 | 8.0 | 0.465x | 9.4 | 0.454x |

### Conclusion

**Bit-arithmetic did NOT fix the M2 decode cliff.** The ratio still degrades from 0.71x to 0.47x.

Worse: bit-arithmetic is **slower than constant LUT at short context** (0.71x vs 0.84x) because the ALU cost exceeds M2's constant cache cost at low contention.

**Key finding: The M2 decode bottleneck is NOT the centroid LUT.** The constant cache is not the problem on ...
TheTom • Mar 26, 2026
Good question. turbo3 at +1.4% vs q8_0 is worse than q4_0 (+0.5%) on raw PPL, but the comparison isn't apples-to-apples:

- **q4_0** compresses the KV cache to 4 bits → 4× compression
- **turbo3** compresses to ~3.5 bits → 4.6× compression
- **q8_0** is 8 bits → baseline

turbo3 gives more compression than q4_0 while staying within the 2% quality gate we set at the top of this issue. The target was "within 1% of q8_0, if >2% worse it's a quality problem." 1.4% is in that range.

Also worth noting: since this issue was opened, a community contributor found a [norm correction](https://github.com/spiritbuun/llama-cpp-turboquant-cuda/commit/6b821a9) that brings turbo3 PPL even closer to q8_0 (now +1.1% on our Metal build). On CUDA with the same fix, turbo3 actually *beats* q8_0 PPL by 0.09%.

The real quality validation we're still missing is NIAH (needle-in-haystack retrieval at long context). PPL passing doesn't guarantee retrieval works. That's the next benchmark to run.
Rotatingxenomorph • Mar 26, 2026
How is turbo3 being worse than q4 quality target met?
TheTom • Mar 25, 2026
## QUALITY FIXED ✅

Perplexity with inverse rotation restored in dequant:

| Cache | PPL | vs q8_0 |
|-------|-----|---------|
| f16 | 6.121 | — |
| q8_0 | 6.111 | baseline |
| q4_0 | 6.142 | +0.5% |
| **turbo3** | **6.194** | **+1.4%** |

turbo3 is within 1.4% of q8_0 perplexity. Quality target met.

Speed is back to ~10.7 tok/s (pre-optimization level) because the inverse rotation
is in the dequant hot path. The pre-rotate-queries optimization needs to be
reimplemented to work with GQA head layout (ne[0]=256 for concatenated heads)
and hybrid memory types.
TheTom • Mar 25, 2026
## Root causes found

### 1. V cache in rotated space
Python verification: dequant output has cosine=0.02 with input (garbage).
After inverse rotation: cosine=0.987 (correct).
V cache values MUST be inverse-rotated after attention.

### 2. dynamic_cast fails for MoE models
The Qwen 3.5 MoE uses `llama_memory_hybrid_context`, not `llama_kv_cache_context`.
Our `dynamic_cast` returns null → Q rotation and V inverse rotation NEVER execute.
ALL speed benchmarks were on unrotated Q with rotated-space K/V — garbage results with fast speed.

### Why "coherent text" was misleading
Without any rotation applied, the raw quantize/dequant produces plausible-looking grammar
but wrong content. Short conversations hide this. Perplexity caught it.

### Fix needed
Store rotation tensors in `llm_graph_context` directly (not behind a KV cache dynamic_cast).
Then both Q rotation and V inverse rotation will work for ALL memory types.

Discovery Source

GitHub Open Source GitHub Open Source

Aggregated via automated community intelligence tracking.

Tech Stack Dependencies

No direct open-source NPM package mentions detected in the product documentation.

Media Tractions & Mentions

No mainstream media stories specifically mentioning this product name have been intercepted yet.

Deep Research & Science

No direct peer-reviewed scientific literature matched with this product's architecture.