GitHub Issue

Experiment: Fused Q·Centroid compressed attention for turbo3 decode

Discovered On Mar 26, 2026

Primary Metric open

## Problem turbo3 decode has 82M data-dependent constant memory accesses per token (centroid LUT lookup). On M5 Max this costs ~9% speed. On M1 Max it causes a cliff from 84% to 39% of q8_0 as context grows — L2 cache pressure evicts centroid entries to main memory. ## Current decode ratio curve (turbo3/q8_0) | Depth | M5 Max | M1 Max | |-------|--------|--------| | short | 0.91x | 0.84x | | 4K | 0.89x | ~0.64x | | 8K | 0.86x | ~0.54x | | 16K | — | ~0.45x | | 32K | — | ~0.39x | ## The Fix Eliminate constant memory from the FA decode inner loop. Three approaches to try: ### Approach 1: half cn[8] registers (16 bytes, may not spill) Previous float cn[8] (32 bytes) spilled on Metal. Half-precision halves register pressure. ### Approach 2: Threadgroup centroid cache Load 8 centroids to threadgroup memory once per threadgroup. Previous test was invalid (CPU fallback bug). Never tested on real Metal GPU. ### Approach 3: Per-block norm*centroid table Precompute `cn_norm[c] = centroid[c] * norm` at block start. Inner loop becomes `score += cn_norm[idx] * q[j]`. Fresh 8-entry register array per block, maximally cache-friendly. ## Success criteria - turbo3/q8_0 decode ratio stays FLAT across context depths (currently drops 0.91x to 0.72x on M5) - If flat at 0.90x+ across all depths, the fix works - PPL unchanged (dequant values identical, just accessed differently) ## Prior work - buun's register LUT: 0.965x on CUDA, spilled to 0.879x on Metal (float[8]) - Split-LUT (2x4 half...

View Raw Thread

Developer & User Discourse

TheTom • Mar 27, 2026

## M2 Pro Results: Bit-Arithmetic Dequant

**Hardware:** Apple M2 Pro, Apple8 (1008), has_tensor=false, 32GB
**Model:** Qwen2.5-7B-Instruct-Q4_K_M
**Build:** experiment/m1-m2-decode-comparison (auto-detected bit-arithmetic)

### Decode Speed (tok/s)

| Depth | q8_0 | turbo3 (bit-arith) | Ratio | turbo3 (const LUT, earlier diag) | Ratio |
|-------|------|-------------------|-------|----------------------------------|-------|
| short | 32.5 | 23.2 | 0.714x | 34.5 | 0.837x |
| 4K | 26.0 | 15.7 | 0.604x | 20.4 | 0.640x |
| 8K | 22.1 | 11.6 | 0.525x | 14.8 | 0.538x |
| 16K | 17.2 | 8.0 | 0.465x | 9.4 | 0.454x |

### Conclusion

**Bit-arithmetic did NOT fix the M2 decode cliff.** The ratio still degrades from 0.71x to 0.47x.

Worse: bit-arithmetic is **slower than constant LUT at short context** (0.71x vs 0.84x) because the ALU cost exceeds M2's constant cache cost at low contention.

**Key finding: The M2 decode bottleneck is NOT the centroid LUT.** The constant cache is not the problem on ...

TheTom • Mar 27, 2026

## M2 Pro Results Update: Batched Extract IS a Win

True baseline comparison (same branch chain, same build):

| Depth | q8_0 | Main (const LUT) | Batched extract | Bit-arithmetic |
|-------|------|-----------------|-----------------|----------------|
| short | 32.5 | 22.9 | **23.7 (+3.5%)** | 23.2 |
| 8K | 22.1 | 10.95 | **13.7 (+25%)** | 11.6 |

Earlier diagnostic (34.5 short) was a different build/context allocation — not comparable.

**Batched extract gives +25% at 8K on M2 Pro.** The explicit bit field pre-extraction helps Metal's compiler schedule device reads ahead of ALU on older hardware.

Next: profile where the remaining gap is (turbo3 0.62x vs q8_0 at 8K).

TheTom • Mar 27, 2026

## BREAKTHROUGH: Profiling isolation identifies exact bottleneck

Added TURBO_PROFILE_MODE (0-4) to strip away dequant layers one at a time.

### M5 Max vs M2 Pro at 8K context decode:

| Mode | What | M5 (% ceil) | M2 (% ceil) |
|------|------|------------|------------|
| 1 | No-op ceiling | 78.9 (100%) | 24.5 (100%) |
| 2 | + norm read | 75.1 (95%) | 22.1 (90%) |
| 4 | + all byte reads | 75.2 (95%) | 21.9 (89%) |
| 3 | + qs extraction + LUT | 64.9 (82%) | 16.4 (67%) |
| 0 | + signs + full LUT | 59.2 (75%) | 14.0 (57%) |
| q8_0 | baseline | 78.8 | 22.1 |

### Key findings:

1. **No-op turbo3 is FASTER than q8_0 on M2** (24.5 vs 22.1) — compressed cache = less bandwidth. The format is not the problem.

2. **Constant memory LUT is 2x worse on M2 than M5:**
- Mode 4→3 (LUT cost): M5 loses 13.7%, M2 loses 25.1%
- Mode 3→0 (signs+more LUT): M5 loses another 8.6%, M2 loses another 14.7%

3. **Byte reading is NOT the bottleneck** — Mode 4 (all reads, no LUT) only costs 10% on both.

4....

TheTom • Mar 27, 2026

## 4-Entry Magnitude LUT + Branchless Sign: BEST M2 RESULT

**Approach:** 4-entry constant half magnitude LUT (0.021-0.190) + XOR trick for reversed magnitude order + branchless sign multiply. Only 4 possible constant addresses per lookup instead of 8.

### M2 Pro decode improvement:

| Depth | q8_0 | Main (8-LUT) | 4-mag LUT | vs Main | vs q8_0 |
|-------|------|-------------|-----------|---------|---------|
| short | 32.5 | 22.9 | 23.8 | +3.9% | 0.732x |
| 8K | 21.9 | 10.95 | 15.1 | **+37.9%** | 0.689x |
| 16K | 17.2 | 8.0 | 11.6 | **+45.0%** | 0.674x |

### M5 Max (no regression):

| Depth | Main | 4-mag LUT | Delta |
|-------|------|-----------|-------|
| short | 77.4 | 75.7 | -2.2% |

PPL: 6.1756 (unchanged).

### Summary

+38-45% decode improvement on M2 Pro at long context. The ratio vs q8_0 improved from 0.45-0.50x to 0.67-0.73x. The cliff is much less severe.

Minor regression on M5 (-2.2%) from the extra ALU (XOR + sign multiply). Could use the auto-detection to use 4-mag on ...

TheTom • Mar 27, 2026

## Final experiment results — dequant-level optimization ceiling reached

### Complete M2 Pro scoreboard (8K decode, q8_0 = 21.9 tok/s):

| # | Approach | tok/s | vs q8_0 | vs Main | Const addrs |
|---|----------|-------|---------|---------|-------------|
| — | No-op ceiling | 24.5 | 1.119x | — | 0 |
| **1** | **4-mag LUT + per-elem norm** | **15.1** | **0.689x** | **+38%** | **4** |
| 2 | Batched extract (8-LUT) | 13.7 | 0.626x | +25% | 8 |
| 3 | Deferred norm (4-mag) | 12.9 | 0.589x | +18% | 4 |
| 4 | 2-pair half2 LUT | 12.0 | 0.548x | +10% | 2 |
| 5 | Select chain (zero LUT) | 11.9 | 0.544x | +9% | 0 |
| 6 | Bit-arithmetic | 11.6 | 0.530x | +6% | 0 |
| — | Main (8-entry LUT) | 10.95 | 0.500x | baseline | 8 |
| 7 | Non-vec forced (nl=2) | 10.2 | 0.466x | -7% | 8 |

### Key insight: 4 constant addresses is the sweet spot on M2 Pro
- **0 addresses** (select chain, bit-arith): ALU cost exceeds constant cache savings
- **2 addresses** (half2 pairs): ternary overhead exceeds savings from ...