Gemini Executive Synthesis

Flash-MoE inference engine on Apple M4 Pro, specifically addressing nonsensical output despite high token generation speed.

Technical Positioning

Achieving accurate and coherent LLM generation on Apple Silicon (M4 Pro) by resolving GPU pipeline data corruption issues, ensuring compatibility across different GPU architectures and correct handling of mixed-precision quantization.

SaaS Insight & Market Implications

The Flash-MoE engine on Apple M4 Pro produces nonsensical output despite high token generation speed, indicating a critical quality failure. Initial hypotheses pointed to M4-specific Metal shader incompatibility or mixed-precision quantization issues. The definitive finding reveals the bug resides in the GPU CMD2 pipeline, where data corruption occurs within the fused command buffer, specifically affecting the routing gate. This means the CPU path generates coherent text, but GPU acceleration introduces errors. This highlights a severe challenge in optimizing LLM inference for diverse hardware: achieving speed without compromising output fidelity. For B2B SaaS, this underscores the necessity of rigorous, hardware-specific validation and debugging at the low-level GPU pipeline stage. Performance metrics are irrelevant if the output is garbage; ensuring computational correctness across all target platforms is paramount for product viability.

Proprietary Technical Taxonomy

Raw Developer Origin & Technical Request

GitHub Issue Mar 22, 2026

Repo: danveloper/flash-moe

Nonsensical output on Apple M4 Pro (Mac Mini 64GB) — 14.5 tok/s but garbage generation

Environment

• Machine: Mac Mini M4 Pro, 64GB unified memory
• macOS: Tahoe 26.3
• GPU: Apple M4 Pro (20-core GPU)

What works

• Compilation: clean build ✅
• Model loading: model_weights.bin (5.52 GB) mmap'd ✅
• Vocab: 248,077 tokens loaded ✅
• Metal shaders: compile in 1ms ✅
• Speed: 14.3-14.5 tok/s sustained — significantly faster than M3 Max (5.7 tok/s) ✅

What's broken

Generated tokens are nonsensical regardless of prompt or sampling strategy.

CLI mode (greedy):
./infer --prompt "What is prostate surgery?" --tokens 20 --k 4
Output: The prostate surgery is a surgery that is a surgery that is... (infinite loop)

Server mode:
Generates 1 token (#) then immediate EOS.

Chat template in CLI:
Echoes prompt then produces garbage tokens.

With temperature sampling (0.7):
Same garbage, different random tokens.

Hypothesis

Metal shaders optimized for M3 Max (40 GPU cores) may have compatibility issues with M4 Pro (20 GPU cores) — different SIMD group sizes or threadgroup configs causing numerical errors in dequant/matvec kernels.

Speed being FASTER while output is garbage suggests computation runs but produces incorrect results.

Model

mlx-community/Qwen3.5-397B-A17B-4bit (snapshot: 39159bd8)

Happy to run diagnostic builds or Metal profiling. Great project!

View Raw Source

Developer Debate & Comments

ccckblaze • Mar 23, 2026

https://github.com/danveloper/flash-moe/pull/1 vocab issues related

tamastoth-byborg • Mar 23, 2026

https://github.com/tamastoth-byborg/flash-moe/commit/203c78397e90954cc88a52bf1181839587dcd01b#diff-7d450f8500f4f66c2601cd6c2a73aff6aadd1b041a53c4e0b2ac8f9a7701e7e4R19 - try this generator, after adding the bpe decoding as well it produced a nice response with --token 1000: Run on Macbook Pro with M3 Pro 36GB; while it was running it used 6GB+ RAM and was streaming from SSD with 2.8GB/s.

userFRM • Mar 23, 2026

Investigated this. The root cause is likely **mixed-precision quantization** in the MLX 4-bit model. The MLX quantization config in `config.json` specifies per-tensor overrides: ```json "quantization": { "group_size": 64, "bits": 4, "mode": "affine", "model.layers.0.mlp.gate": {"group_size": 64, "bits": 8}, "model.layers.0.mlp.shared_expert_gate": {"group_size": 64, "bits": 8}, ... } ``` **Every `mlp.gate` (routing) and `mlp.shared_expert_gate` tensor is 8-bit, not 4-bit.** The inference engine treats all tensors uniformly as 4-bit, extracting 8 nibbles per uint32 — but these gate tensors pack 4 bytes per uint32 (8-bit). This corrupts the routing gate scores, selecting wrong experts every layer, producing garbage output. Verification: gate weight shape is `[512, 512]` U32. At 4-bit that implies `in_dim = 512 * 8 = 4096`, but `hidden_size = 4096` for Qwen3.5-397B so it happens to work dimensionally. However the dequantized values are wrong because the nibble extracti...

userFRM • Mar 23, 2026

Correction to my previous comment: the 8-bit gate issue may be specific to Qwen3-Coder-Next, not Qwen3.5-397B. For the 397B model, the gate weight `[512, 512]` U32 at 4-bit gives `in_dim = 512*8 = 4096 = hidden_size` — dimensionally correct. The 397B quantization config may not have per-tensor 8-bit overrides. The M4 Pro gibberish reported here likely has a different root cause — possibly M4-specific Metal GPU behavior (different SIMD group scheduling, threadgroup memory semantics, or register pressure behavior vs M3 Max). Would be useful to know: does adding `--cpu-linear` (CPU delta-net fallback) change the output quality? That would isolate whether the issue is in the GPU linear attention kernels.

userFRM • Mar 23, 2026

Definitive finding: **the bug is in the GPU CMD2 pipeline, not in individual kernels or weight loading.** Proof: with `g_metal = NULL` (forcing full CPU computation), the model produces **coherent text** with correct routing scores matching the MLX reference. With GPU enabled, the same model produces gibberish. Tested on Qwen3-Coder-Next-4bit (M1 Pro, 32GB): - **CPU path**: "Hello:\n\nI have a table with a field..." — coherent, gate scores match MLX - **GPU path**: gibberish, gate scores have wrong magnitudes and signs The GPU CMD2 pipeline (o_proj → residual_add → rms_norm → routing gate matvec) passes incorrect data through `buf_input` to the routing gate. The CPU dequant and CPU attention produce correct results, but somewhere in the fused GPU command buffer (8-12 encoders per CMD2), the buffer contents get corrupted. This likely affects the original Qwen3.5-397B model on M4 Pro too (issue reporter's hardware), as the same GPU pipeline code is used. The fix needs to audit every...

Adjacent Repository Pain Points

Other highly discussed features and pain points extracted from danveloper/flash-moe.

I did get it working, with a lot of pain, if your interested here's a readme I had claud crank out capturing the gotchas.

Extracted Positioning

`Flash-MoE` for running large MoE models (Qwen3.5-397B-A17B) locally on Apple Silicon Macs.

Enabling local, cloud-independent execution of massive MoE models on consumer-grade high-end hardware (Apple Silicon), achieving interactive performance.

Top Replies

aronson • Mar 30, 2026

HIGGS317 • Mar 31, 2026

Great experiment and write up. Wanted to ask, can this method be adopted for other small models in the 80-100B parameters to run on MacBook Airs too?

rafaelkupper • Mar 31, 2026

Thanks! Got it running on a MBP M4 Pro 48GB at 3.1 tok/s.

Cannot open model_weights.bin: No such file or directory

Extracted Positioning

Model weight loading for the Flash-MoE inference engine.

Ensuring correct file path resolution and loading of model weights (`model_weights.bin`) for the Flash-MoE engine, particularly when models are sourced from Hugging Face caches.

Top Replies

existeundelta • Mar 22, 2026

tamastoth-byborg • Mar 23, 2026

Claude generated this one that works: https://github.com/tamastoth-byborg/flash-moe/commit/203c78397e90954cc88a52bf1181839587dcd01b#diff-4a3ca27fc198ca94f12561bf3591ef735cb0e8e5e98dad2f0f0e884ee663...

vocab.bin missing

Extracted Positioning

Vocab file generation (`vocab.bin`) for the C decoder in Flash-MoE.

Ensuring the availability and correct generation of the `vocab.bin` file, which maps token IDs to strings, by providing a robust Python script that searches common locations and Hugging Face caches for `tokenizer.json`.

Please add a license to this repo

Extracted Positioning

The `flash-moe` project, specifically the lack of an explicit `LICENSE` file.

Adherence to open-source best practices and legal clarity for project usage and contributions.

Other Qwen models

Extracted Positioning

Adaptability of flash-moe (running big models on small laptops) to other Qwen models.

Versatility and broad compatibility across different Qwen model variants.

Frequently Asked Questions

Market intelligence mapped to Flash-MoE inference engine on Apple M4 Pro, specifically addressing nonsensical output despite high token generation speed..

How is Flash-MoE inference engine on Apple M4 Pro, specifically addressing nonsensical output despite high token generation speed. positioned in the market?

Based on our AI analysis of the original developer request, its primary technical positioning is: Achieving accurate and coherent LLM generation on Apple Silicon (M4 Pro) by resolving GPU pipeline data corruption issues, ensuring compatibility across different GPU architectures and correct handling of mixed-precision quantization.

What is the general sentiment around Flash-MoE inference engine on Apple M4 Pro, specifically addressing nonsensical output despite high token generation speed.?

Yes, we have tracked 5 direct responses and active debates regarding this specific topic originating from GitHub Issue.

Which technical concepts are associated with Flash-MoE inference engine on Apple M4 Pro, specifically addressing nonsensical output despite high token generation speed.?

Our proprietary extraction maps Flash-MoE inference engine on Apple M4 Pro, specifically addressing nonsensical output despite high token generation speed. to adjacent architectural concepts including Nonsensical output, Apple M4 Pro, Mac Mini 64GB, 14.5 tok/s.

Engagement Signals

Replies

open

Issue Status

Cross-Market Term Frequency

Quantifies the cross-market adoption of foundational terms like routing and o_proj by tracking occurrence frequency across active SaaS architectures and enterprise developer debates.