GitHub Issue

Nonsensical output on Apple M4 Pro (Mac Mini 64GB) — 14.5 tok/s but garbage generation

Discovered On Mar 22, 2026

Primary Metric open

Environment • Machine: Mac Mini M4 Pro, 64GB unified memory • macOS: Tahoe 26.3 • GPU: Apple M4 Pro (20-core GPU) What works • Compilation: clean build ✅ • Model loading: model_weights.bin (5.52 GB) mmap'd ✅ • Vocab: 248,077 tokens loaded ✅ • Metal shaders: compile in 1ms ✅ • Speed: 14.3-14.5 tok/s sustained — significantly faster than M3 Max (5.7 tok/s) ✅ What's broken Generated tokens are nonsensical regardless of prompt or sampling strategy. CLI mode (greedy): ./infer --prompt "What is prostate surgery?" --tokens 20 --k 4 Output: The prostate surgery is a surgery that is a surgery that is... (infinite loop) Server mode: Generates 1 token (#) then immediate EOS. Chat template in CLI: Echoes prompt then produces garbage tokens. With temperature sampling (0.7): Same garbage, different random tokens. Hypothesis Metal shaders optimized for M3 Max (40 GPU cores) may have compatibility issues with M4 Pro (20 GPU cores) — different SIMD group sizes or threadgroup configs causing numerical errors in dequant/matvec kernels. Speed being FASTER while output is garbage suggests computation runs but produces incorrect results. Model mlx-community/Qwen3.5-397B-A17B-4bit (snapshot: 39159bd8) Happy to run diagnostic builds or Metal profiling. Great project!

View Raw Thread

Developer & User Discourse

ccckblaze • Mar 23, 2026

https://github.com/danveloper/flash-moe/pull/1
vocab issues related

tamastoth-byborg • Mar 23, 2026

https://github.com/tamastoth-byborg/flash-moe/commit/203c78397e90954cc88a52bf1181839587dcd01b#diff-7d450f8500f4f66c2601cd6c2a73aff6aadd1b041a53c4e0b2ac8f9a7701e7e4R19 - try this generator, after adding the bpe decoding as well it produced a nice response with --token 1000:

Run on Macbook Pro with M3 Pro 36GB; while it was running it used 6GB+ RAM and was streaming from SSD with 2.8GB/s.

userFRM • Mar 23, 2026

Investigated this. The root cause is likely **mixed-precision quantization** in the MLX 4-bit model.

The MLX quantization config in `config.json` specifies per-tensor overrides:

```json
"quantization": {
"group_size": 64, "bits": 4, "mode": "affine",
"model.layers.0.mlp.gate": {"group_size": 64, "bits": 8},
"model.layers.0.mlp.shared_expert_gate": {"group_size": 64, "bits": 8},
...
}
```

**Every `mlp.gate` (routing) and `mlp.shared_expert_gate` tensor is 8-bit, not 4-bit.** The inference engine treats all tensors uniformly as 4-bit, extracting 8 nibbles per uint32 — but these gate tensors pack 4 bytes per uint32 (8-bit). This corrupts the routing gate scores, selecting wrong experts every layer, producing garbage output.

Verification: gate weight shape is `[512, 512]` U32. At 4-bit that implies `in_dim = 512 * 8 = 4096`, but `hidden_size = 4096` for Qwen3.5-397B so it happens to work dimensionally. However the dequantized values are wrong because the nibble extracti...

userFRM • Mar 23, 2026

Correction to my previous comment: the 8-bit gate issue may be specific to Qwen3-Coder-Next, not Qwen3.5-397B. For the 397B model, the gate weight `[512, 512]` U32 at 4-bit gives `in_dim = 512*8 = 4096 = hidden_size` — dimensionally correct. The 397B quantization config may not have per-tensor 8-bit overrides.

The M4 Pro gibberish reported here likely has a different root cause — possibly M4-specific Metal GPU behavior (different SIMD group scheduling, threadgroup memory semantics, or register pressure behavior vs M3 Max).

Would be useful to know: does adding `--cpu-linear` (CPU delta-net fallback) change the output quality? That would isolate whether the issue is in the GPU linear attention kernels.

userFRM • Mar 23, 2026

Definitive finding: **the bug is in the GPU CMD2 pipeline, not in individual kernels or weight loading.**

Proof: with `g_metal = NULL` (forcing full CPU computation), the model produces **coherent text** with correct routing scores matching the MLX reference. With GPU enabled, the same model produces gibberish.

Tested on Qwen3-Coder-Next-4bit (M1 Pro, 32GB):
- **CPU path**: "Hello:\n\nI have a table with a field..." — coherent, gate scores match MLX
- **GPU path**: gibberish, gate scores have wrong magnitudes and signs

The GPU CMD2 pipeline (o_proj → residual_add → rms_norm → routing gate matvec) passes incorrect data through `buf_input` to the routing gate. The CPU dequant and CPU attention produce correct results, but somewhere in the fused GPU command buffer (8-12 encoders per CMD2), the buffer contents get corrupted.

This likely affects the original Qwen3.5-397B model on M4 Pro too (issue reporter's hardware), as the same GPU pipeline code is used.

The fix needs to audit every...