← Back to AI Insights
Gemini Executive Synthesis

Flash-MoE inference engine on Apple M4 Pro, specifically addressing nonsensical output despite high token generation speed.

Technical Positioning
Achieving accurate and coherent LLM generation on Apple Silicon (M4 Pro) by resolving GPU pipeline data corruption issues, ensuring compatibility across different GPU architectures and correct handling of mixed-precision quantization.
SaaS Insight & Market Implications
The Flash-MoE engine on Apple M4 Pro produces nonsensical output despite high token generation speed, indicating a critical quality failure. Initial hypotheses pointed to M4-specific Metal shader incompatibility or mixed-precision quantization issues. The definitive finding reveals the bug resides in the GPU CMD2 pipeline, where data corruption occurs within the fused command buffer, specifically affecting the routing gate. This means the CPU path generates coherent text, but GPU acceleration introduces errors. This highlights a severe challenge in optimizing LLM inference for diverse hardware: achieving speed without compromising output fidelity. For B2B SaaS, this underscores the necessity of rigorous, hardware-specific validation and debugging at the low-level GPU pipeline stage. Performance metrics are irrelevant if the output is garbage; ensuring computational correctness across all target platforms is paramount for product viability.
Proprietary Technical Taxonomy
Nonsensical output Apple M4 Pro Mac Mini 64GB 14.5 tok/s garbage generation Metal shaders M3 Max SIMD group sizes

Raw Developer Origin & Technical Request

Source Icon GitHub Issue Mar 22, 2026
Repo: danveloper/flash-moe
Nonsensical output on Apple M4 Pro (Mac Mini 64GB) — 14.5 tok/s but garbage generation

Environment

• Machine: Mac Mini M4 Pro, 64GB unified memory
• macOS: Tahoe 26.3
• GPU: Apple M4 Pro (20-core GPU)

What works

• Compilation: clean build ✅
• Model loading: model_weights.bin (5.52 GB) mmap'd ✅
• Vocab: 248,077 tokens loaded ✅
• Metal shaders: compile in 1ms ✅
• Speed: 14.3-14.5 tok/s sustained — significantly faster than M3 Max (5.7 tok/s) ✅

What's broken

Generated tokens are nonsensical regardless of prompt or sampling strategy.

CLI mode (greedy):
./infer --prompt "What is prostate surgery?" --tokens 20 --k 4
Output: The prostate surgery is a surgery that is a surgery that is... (infinite loop)

Server mode:
Generates 1 token (#) then immediate EOS.

Chat template in CLI:
Echoes prompt then produces garbage tokens.

With temperature sampling (0.7):
Same garbage, different random tokens.

Hypothesis

Metal shaders optimized for M3 Max (40 GPU cores) may have compatibility issues with M4 Pro (20 GPU cores) — different SIMD group sizes or threadgroup configs causing numerical errors in dequant/matvec kernels.

Speed being FASTER while output is garbage suggests computation runs but produces incorrect results.

Model

mlx-community/Qwen3.5-397B-A17B-4bit (snapshot: 39159bd8)

Happy to run diagnostic builds or Metal profiling. Great project!

Developer Debate & Comments

ccckblaze • Mar 23, 2026
https://github.com/danveloper/flash-moe/pull/1 vocab issues related
tamastoth-byborg • Mar 23, 2026
https://github.com/tamastoth-byborg/flash-moe/commit/203c78397e90954cc88a52bf1181839587dcd01b#diff-7d450f8500f4f66c2601cd6c2a73aff6aadd1b041a53c4e0b2ac8f9a7701e7e4R19 - try this generator, after adding the bpe decoding as well it produced a nice response with --token 1000: Run on Macbook Pro with M3 Pro 36GB; while it was running it used 6GB+ RAM and was streaming from SSD with 2.8GB/s.
userFRM • Mar 23, 2026
Investigated this. The root cause is likely **mixed-precision quantization** in the MLX 4-bit model. The MLX quantization config in `config.json` specifies per-tensor overrides: ```json "quantization": { "group_size": 64, "bits": 4, "mode": "affine", "model.layers.0.mlp.gate": {"group_size": 64, "bits": 8}, "model.layers.0.mlp.shared_expert_gate": {"group_size": 64, "bits": 8}, ... } ``` **Every `mlp.gate` (routing) and `mlp.shared_expert_gate` tensor is 8-bit, not 4-bit.** The inference engine treats all tensors uniformly as 4-bit, extracting 8 nibbles per uint32 — but these gate tensors pack 4 bytes per uint32 (8-bit). This corrupts the routing gate scores, selecting wrong experts every layer, producing garbage output. Verification: gate weight shape is `[512, 512]` U32. At 4-bit that implies `in_dim = 512 * 8 = 4096`, but `hidden_size = 4096` for Qwen3.5-397B so it happens to work dimensionally. However the dequantized values are wrong because the nibble extracti...
userFRM • Mar 23, 2026
Correction to my previous comment: the 8-bit gate issue may be specific to Qwen3-Coder-Next, not Qwen3.5-397B. For the 397B model, the gate weight `[512, 512]` U32 at 4-bit gives `in_dim = 512*8 = 4096 = hidden_size` — dimensionally correct. The 397B quantization config may not have per-tensor 8-bit overrides. The M4 Pro gibberish reported here likely has a different root cause — possibly M4-specific Metal GPU behavior (different SIMD group scheduling, threadgroup memory semantics, or register pressure behavior vs M3 Max). Would be useful to know: does adding `--cpu-linear` (CPU delta-net fallback) change the output quality? That would isolate whether the issue is in the GPU linear attention kernels.
userFRM • Mar 23, 2026
Definitive finding: **the bug is in the GPU CMD2 pipeline, not in individual kernels or weight loading.** Proof: with `g_metal = NULL` (forcing full CPU computation), the model produces **coherent text** with correct routing scores matching the MLX reference. With GPU enabled, the same model produces gibberish. Tested on Qwen3-Coder-Next-4bit (M1 Pro, 32GB): - **CPU path**: "Hello:\n\nI have a table with a field..." — coherent, gate scores match MLX - **GPU path**: gibberish, gate scores have wrong magnitudes and signs The GPU CMD2 pipeline (o_proj → residual_add → rms_norm → routing gate matvec) passes incorrect data through `buf_input` to the routing gate. The CPU dequant and CPU attention produce correct results, but somewhere in the fused GPU command buffer (8-12 encoders per CMD2), the buffer contents get corrupted. This likely affects the original Qwen3.5-397B model on M4 Pro too (issue reporter's hardware), as the same GPU pipeline code is used. The fix needs to audit every...

Adjacent Repository Pain Points

Other highly discussed features and pain points extracted from danveloper/flash-moe.

Extracted Positioning
`Flash-MoE` for running large MoE models (Qwen3.5-397B-A17B) locally on Apple Silicon Macs.
Enabling local, cloud-independent execution of massive MoE models on consumer-grade high-end hardware (Apple Silicon), achieving interactive performance.
Extracted Positioning
Model weight loading for the Flash-MoE inference engine.
Ensuring correct file path resolution and loading of model weights (`model_weights.bin`) for the Flash-MoE engine, particularly when models are sourced from Hugging Face caches.
Extracted Positioning
Vocab file generation (`vocab.bin`) for the C decoder in Flash-MoE.
Ensuring the availability and correct generation of the `vocab.bin` file, which maps token IDs to strings, by providing a robust Python script that searches common locations and Hugging Face caches for `tokenizer.json`.
Extracted Positioning
The `flash-moe` project, specifically the lack of an explicit `LICENSE` file.
Adherence to open-source best practices and legal clarity for project usage and contributions.
Extracted Positioning
Adaptability of flash-moe (running big models on small laptops) to other Qwen models.
Versatility and broad compatibility across different Qwen model variants.

Engagement Signals

5
Replies
open
Issue Status

Cross-Market Term Frequency

Quantifies the cross-market adoption of foundational terms like routing and Nonsensical output by tracking occurrence frequency across active SaaS architectures and enterprise developer debates.