TheTom/turboquant_plus

Name: TheTom/turboquant_plus
Rating: 4.5 (802 reviews)

No tagline provided.

5,811

Traction Score

802

Forks

Mar 25, 2026

Launch Date

View Origin Link

Product Positioning & Context

AI Executive Synthesis

Ensuring correct and robust implementation of different quantization schemes (turbo3, turbo4) across varying block sizes and head dimensions, preventing data corruption and out-of-bounds access.

A post-commit review identified critical bugs in the block size 32 change, corrupting turbo4 cache writes and causing out-of-bounds array access in CPU paths. The `SET_ROWS` kernel, hardcoded for turbo3, was incorrectly instantiated for turbo4, and integer division logic dropped tail blocks for non-128 head dimensions. This reveals significant fragility in low-level quantization implementations, where minor changes can introduce severe data integrity issues across different model configurations and quantization types. The reliance on specific head dimensions (e.g., Qwen's 128) masked a broader problem. Market implications include the necessity for rigorous, automated code analysis and comprehensive testing across diverse model architectures and hardware to ensure the reliability and compatibility of highly optimized, low-level inference components.

Related Ecosystem & Alternatives

Discover adjacent products, open-source repositories, and developer tools sharing similar technical architecture.

Deep-Dive FAQs

What is TheTom/turboquant_plus?

TheTom/turboquant_plus is analyzed by our AI as: Ensuring correct and robust implementation of different quantization schemes (turbo3, turbo4) across varying block sizes and head dimensions, preventing data corruption and out-of-bounds access.. It focuses on A post-commit review identified critical bugs in the block size 32 change, corrupting turbo4 cache writes and causing out-of-bounds array access in...

Where did TheTom/turboquant_plus originate?

Data for TheTom/turboquant_plus was aggregated directly from the GitHub Open Source community ecosystem, representing raw developer and early-adopter sentiment.

When was TheTom/turboquant_plus publicly launched?

The initial public indexing or launch date for TheTom/turboquant_plus within our tracked developer communities was recorded on March 25, 2026.

How popular is TheTom/turboquant_plus?

TheTom/turboquant_plus has achieved measurable traction, logging over 5,811 traction score and facilitating 802 recorded discussions or engagements.

Are there active development issues for TheTom/turboquant_plus?

Yes, we are currently tracking open architectural debates and bug reports for this project on GitHub. There are currently 5 active high-priority issues logged recently.

What are some commercial alternatives to TheTom/turboquant_plus?

Our semantic intelligence engine identifies potential commercial alternatives in the SaaS space, such as Trump Accounts, which offers overlapping value propositions.

Active Developer Issues (GitHub)

open Unsupported gpu architecture 'compute_120a'

Logged: Mar 31, 2026

open turbo3/turbo4 cache produces garbled output on NVIDIA Blackwell GPU (RTX 5070 Laptop, compute capability 12.0)

Logged: Mar 31, 2026

open No Difference in tokens/sec - Ministral3 8B Q5_K_M

Logged: Mar 29, 2026

open Don't work on vulkan device

Logged: Mar 28, 2026

open Engineering findings: K/V norm disparity + MSE > Prod + outlier mixed precision

Logged: Mar 28, 2026

Community Voice & Feedback

zekrom-vale • Mar 31, 2026

You can just use it like Ollama or LM studio without the slow development or wrapper overheads. I use llama cpp directly and use it with router mode and many llms configured with models.ini and interface it with Open Web UI you can find others too. It's nice, but it has a lot more overhead in setting it up.

Use `llama-server --host 0.0.0.0 --port 8080 --models-preset /path/to/ini/models.ini` and access it through an Open AI compatible UI like Open Web UI at `http://:8080`. Or 127.0.0.1 as the loop-back if not connecting to a different computer as I do.

Here is an example model ini file I use for my 5070ti.
[models.ini.txt](https://github.com/user-attachments/files/26391236/models.ini.txt)

zrlhk • Mar 31, 2026

这个只是对kv缓存压缩，所以只是提升了最大推理上下文的大小。对模型量化压缩和推理速度，是没有提升的。
原来10G显存，如果是一个9b模型，最大上下文128k可能就OOM了。现在压缩后，就可以支持128k上下文了。

TheTom • Mar 31, 2026

Yeah we really need some more rcom support from the community. i only have so many viable devices to play with at home

ogbinar • Mar 30, 2026

i hope the rocm issues get fixed i'm interested to try this out!

MrMuhannadObeidat • Mar 29, 2026

I missed the part where you highlight the fact that tokens/sec may actually degrade with the added compression of KV cache. I tried with turbo3, do not see noticeable degradation but certainly see major impact on memory consumption and the ability to use much bigger context windows.
Question I have now, is can you take advantage of this from within lm studio or ollama?

TheTom • Mar 28, 2026

update on this: we ran a full turbo4 investigation this week and your MSE > Prod finding is now independently confirmed on three setups:

1. our Metal (M5 Max): QJL ablation on turbo4 shows removing QJL improves PPL from 6.1894 to 6.1756. QJL actively hurts.
2. buun's CUDA (RTX 3090): turbo4 degrades from -0.28% at 2K to +3.69% at 64K. QJL noise accumulates with context.
3. your GPT-2 data: Prod at +300% PPL vs MSE at +7.6%.

we've dropped QJL from turbo4 entirely and fixed the dequant path (byte-aligned packing, direct extraction). turbo4 now matches turbo3 speed and quality. the QJL bit was pure waste.

next step for us is looking at asymmetric K/V using your norm disparity data as a starting point.

TheTom • Mar 28, 2026

turbo3 currently only supports Metal, CUDA, and ROCm/HIP backends. the Vulkan backend doesn't have a SET_ROWS kernel for the turbo3 quant type yet.

since you have an RX 7900 XTX, ROCm would be your best path — apollosenvy has a working ROCm PR here: https://github.com/TheTom/llama-cpp-turboquant/pull/5

that said, ROCm turbo3 still has some quality issues being worked out (head_dim=128 models especially). head_dim=256 models like Qwen3.5 should work.

TheTom • Mar 28, 2026

this is great work, thanks for sharing. the K/V norm disparity data across models is something we hadn't quantified — 182x ratio on Qwen2.5-1.5B is wild. that directly informs the head_dim=128 quality gap we've been chasing.

the MSE vs Prod finding for keys is interesting too. we dropped QJL early on (MSE-only approach) and buun's CUDA experiments independently confirmed it — your GPT-2 numbers showing Prod at +300% PPL vs MSE at +7.6% add more evidence that QJL variance gets amplified through softmax.

dynamic outlier detection with per-layer RMS thresholding is a smarter approach than the fixed 32/96 split. getting Qwen2.5-1.5B to 3.6-bit avg at +2.1% PPL vs +78% uniform is a massive improvement. we'll take a closer look at your repo.

appreciate the contribution.

TheTom • Mar 27, 2026

## Final experiment results — dequant-level optimization ceiling reached

### Complete M2 Pro scoreboard (8K decode, q8_0 = 21.9 tok/s):

| # | Approach | tok/s | vs q8_0 | vs Main | Const addrs |
|---|----------|-------|---------|---------|-------------|
| — | No-op ceiling | 24.5 | 1.119x | — | 0 |
| **1** | **4-mag LUT + per-elem norm** | **15.1** | **0.689x** | **+38%** | **4** |
| 2 | Batched extract (8-LUT) | 13.7 | 0.626x | +25% | 8 |
| 3 | Deferred norm (4-mag) | 12.9 | 0.589x | +18% | 4 |
| 4 | 2-pair half2 LUT | 12.0 | 0.548x | +10% | 2 |
| 5 | Select chain (zero LUT) | 11.9 | 0.544x | +9% | 0 |
| 6 | Bit-arithmetic | 11.6 | 0.530x | +6% | 0 |
| — | Main (8-entry LUT) | 10.95 | 0.500x | baseline | 8 |
| 7 | Non-vec forced (nl=2) | 10.2 | 0.466x | -7% | 8 |

### Key insight: 4 constant addresses is the sweet spot on M2 Pro
- **0 addresses** (select chain, bit-arith): ALU cost exceeds constant cache savings
- **2 addresses** (half2 pairs): ternary overhead exceeds savings from ...

TheTom • Mar 27, 2026

## 4-Entry Magnitude LUT + Branchless Sign: BEST M2 RESULT

**Approach:** 4-entry constant half magnitude LUT (0.021-0.190) + XOR trick for reversed magnitude order + branchless sign multiply. Only 4 possible constant addresses per lookup instead of 8.

### M2 Pro decode improvement:

| Depth | q8_0 | Main (8-LUT) | 4-mag LUT | vs Main | vs q8_0 |
|-------|------|-------------|-----------|---------|---------|
| short | 32.5 | 22.9 | 23.8 | +3.9% | 0.732x |
| 8K | 21.9 | 10.95 | 15.1 | **+37.9%** | 0.689x |
| 16K | 17.2 | 8.0 | 11.6 | **+45.0%** | 0.674x |

### M5 Max (no regression):

| Depth | Main | 4-mag LUT | Delta |
|-------|------|-----------|-------|
| short | 77.4 | 75.7 | -2.2% |

PPL: 6.1756 (unchanged).

### Summary

+38-45% decode improvement on M2 Pro at long context. The ratio vs q8_0 improved from 0.45-0.50x to 0.67-0.73x. The cliff is much less severe.

Minor regression on M5 (-2.2%) from the extra ALU (XOR + sign multiply). Could use the auto-detection to use 4-mag on ...

TheTom • Mar 27, 2026

## BREAKTHROUGH: Profiling isolation identifies exact bottleneck

Added TURBO_PROFILE_MODE (0-4) to strip away dequant layers one at a time.

### M5 Max vs M2 Pro at 8K context decode:

| Mode | What | M5 (% ceil) | M2 (% ceil) |
|------|------|------------|------------|
| 1 | No-op ceiling | 78.9 (100%) | 24.5 (100%) |
| 2 | + norm read | 75.1 (95%) | 22.1 (90%) |
| 4 | + all byte reads | 75.2 (95%) | 21.9 (89%) |
| 3 | + qs extraction + LUT | 64.9 (82%) | 16.4 (67%) |
| 0 | + signs + full LUT | 59.2 (75%) | 14.0 (57%) |
| q8_0 | baseline | 78.8 | 22.1 |

### Key findings:

1. **No-op turbo3 is FASTER than q8_0 on M2** (24.5 vs 22.1) — compressed cache = less bandwidth. The format is not the problem.

2. **Constant memory LUT is 2x worse on M2 than M5:**
- Mode 4→3 (LUT cost): M5 loses 13.7%, M2 loses 25.1%
- Mode 3→0 (signs+more LUT): M5 loses another 8.6%, M2 loses another 14.7%

3. **Byte reading is NOT the bottleneck** — Mode 4 (all reads, no LUT) only costs 10% on both.

4....

TheTom • Mar 27, 2026

## M2 Pro Results Update: Batched Extract IS a Win

True baseline comparison (same branch chain, same build):

| Depth | q8_0 | Main (const LUT) | Batched extract | Bit-arithmetic |
|-------|------|-----------------|-----------------|----------------|
| short | 32.5 | 22.9 | **23.7 (+3.5%)** | 23.2 |
| 8K | 22.1 | 10.95 | **13.7 (+25%)** | 11.6 |

Earlier diagnostic (34.5 short) was a different build/context allocation — not comparable.

**Batched extract gives +25% at 8K on M2 Pro.** The explicit bit field pre-extraction helps Metal's compiler schedule device reads ahead of ALU on older hardware.

Next: profile where the remaining gap is (turbo3 0.62x vs q8_0 at 8K).

TheTom • Mar 27, 2026

## M2 Pro Results: Bit-Arithmetic Dequant

**Hardware:** Apple M2 Pro, Apple8 (1008), has_tensor=false, 32GB
**Model:** Qwen2.5-7B-Instruct-Q4_K_M
**Build:** experiment/m1-m2-decode-comparison (auto-detected bit-arithmetic)

### Decode Speed (tok/s)

| Depth | q8_0 | turbo3 (bit-arith) | Ratio | turbo3 (const LUT, earlier diag) | Ratio |
|-------|------|-------------------|-------|----------------------------------|-------|
| short | 32.5 | 23.2 | 0.714x | 34.5 | 0.837x |
| 4K | 26.0 | 15.7 | 0.604x | 20.4 | 0.640x |
| 8K | 22.1 | 11.6 | 0.525x | 14.8 | 0.538x |
| 16K | 17.2 | 8.0 | 0.465x | 9.4 | 0.454x |

### Conclusion

**Bit-arithmetic did NOT fix the M2 decode cliff.** The ratio still degrades from 0.71x to 0.47x.

Worse: bit-arithmetic is **slower than constant LUT at short context** (0.71x vs 0.84x) because the ALU cost exceeds M2's constant cache cost at low contention.

**Key finding: The M2 decode bottleneck is NOT the centroid LUT.** The constant cache is not the problem on ...

TheTom • Mar 26, 2026

Good question. turbo3 at +1.4% vs q8_0 is worse than q4_0 (+0.5%) on raw PPL, but the comparison isn't apples-to-apples:

- **q4_0** compresses the KV cache to 4 bits → 4× compression
- **turbo3** compresses to ~3.5 bits → 4.6× compression
- **q8_0** is 8 bits → baseline

turbo3 gives more compression than q4_0 while staying within the 2% quality gate we set at the top of this issue. The target was "within 1% of q8_0, if >2% worse it's a quality problem." 1.4% is in that range.

Also worth noting: since this issue was opened, a community contributor found a [norm correction](https://github.com/spiritbuun/llama-cpp-turboquant-cuda/commit/6b821a9) that brings turbo3 PPL even closer to q8_0 (now +1.1% on our Metal build). On CUDA with the same fix, turbo3 actually *beats* q8_0 PPL by 0.09%.

The real quality validation we're still missing is NIAH (needle-in-haystack retrieval at long context). PPL passing doesn't guarantee retrieval works. That's the next benchmark to run.

Rotatingxenomorph • Mar 26, 2026

How is turbo3 being worse than q4 quality target met?

Discovery Source

GitHub Open Source

Aggregated via automated community intelligence tracking.

Tech Stack Dependencies

No direct open-source NPM package mentions detected in the product documentation.

Media Tractions & Mentions

No mainstream media stories specifically mentioning this product name have been intercepted yet.

Deep Research & Science

No direct peer-reviewed scientific literature matched with this product's architecture.