Gemini Executive Synthesis

`Flash-MoE` for running large MoE models (Qwen3.5-397B-A17B) locally on Apple Silicon Macs.

Technical Positioning

Enabling local, cloud-independent execution of massive MoE models on consumer-grade high-end hardware (Apple Silicon), achieving interactive performance.

SaaS Insight & Market Implications

This issue provides a critical 'gotcha' guide for `Flash-MoE`, highlighting the significant setup complexity for running massive MoE models locally on Apple Silicon. The primary pain point is the exorbitant temporary disk space requirement (~450GB) and the need for high-end unified memory (48GB+). For B2B SaaS, while 'zero cloud dependency' is a strong value proposition for data privacy and cost control, such demanding local setup requirements create a high barrier to entry. Enterprises seeking to deploy large models on edge devices or developer workstations need streamlined, less resource-intensive deployment processes. This indicates a market need for more efficient model packaging, automated resource management, and clearer, less painful onboarding to unlock the full potential of local LLM inference.

Proprietary Technical Taxonomy

Raw Developer Origin & Technical Request

GitHub Issue Mar 28, 2026

Repo: danveloper/flash-moe

I did get it working, with a lot of pain, if your interested here's a readme I had claud crank out capturing the gotchas.

# Flash-MoE Setup Guide — The Real One

## What This Is

A step-by-step guide to running Qwen3.5-397B-A17B (397 billion parameter MoE model) locally on an Apple Silicon Mac using [danveloper/flash-moe](github.com/danveloper/flash-... Written from an actual setup on an M4 Max 64GB MacBook Pro — including every gotcha we hit.

**End result:** ~5 tok/s interactive chat + OpenAI-compatible API server. Zero cloud dependency.

---

## Hardware Requirements

- Apple Silicon Mac (M3 Max, M4 Pro, M4 Max, or better)
- **Minimum 48GB unified memory** (64GB+ recommended for better page cache hit rates)
- **~450GB free disk space during setup** (drops to ~215GB after cleanup)
- 1TB+ SSD (all Apple Silicon Macs qualify)
- macOS 26.2+ (Darwin 25.2.0+)

### Disk Space Budget — Read This First

This is the #1 thing that will bite you. The setup has three phases of disk usage:

| Phase | Cumulative Disk Used | Notes |
|-------|---------------------|-------|
| Download MLX 4-bit model | ~210 GB | Source safetensors files |
| Git LFS cache (hidden) | ~420 GB | `.git/lfs/` holds a second copy |
| After `git lfs fetch --all` cleanup | ~210 GB | Delete `.git/lfs/` to reclaim |
| After `repack_experts.py` | ~420 GB | 210GB source + 209GB packed experts |
| After deleting source model | **~215 GB** | Final footprint |

**You need ~450GB free to start.** Plan your cleanup steps. On a 1TB drive, this means you need most of your disk empty.

**Critical cleanup commands** (safe to run at each s...

View Raw Source

Developer Debate & Comments

aronson • Mar 30, 2026

This helped me a ton! Managed to get it running, and wanted to add to the numbers: ## Performance Notes ### Expected Performance by Hardware | Machine | RAM | Bandwidth | Expected tok/s | |---------|-----|-----------|---------------| | M3 Max (reference) | 48 GB | ~400 GB/s | 4.4 | | M4 Max | 64 GB | ~546 GB/s | 5.0-5.5+ | | M1 Max | 64 GB | ~400 GB/s | 2.4-2.9+ | Tested on a MBP 16" fully loaded M1 Max series

HIGGS317 • Mar 31, 2026

Great experiment and write up. Wanted to ask, can this method be adopted for other small models in the 80-100B parameters to run on MacBook Airs too?

rafaelkupper • Mar 31, 2026

Thanks! Got it running on a MBP M4 Pro 48GB at 3.1 tok/s.

Shivox • Apr 8, 2026

Thanks for taking the time to write the detailed instructions. A couple of things: - Cleanup command `find ~/qwen35-397b-4bit -maxdepth 1 ! -name packed_experts ! -name . -exec rm -rf {} +` deleted the entire model directory on vanilla MacOS zsh ... one hour to redo the whole process. I think it is missing a `-mindepth 1` to prevent the deletion of the parent directory. - `expert_index.json` from this repo has model path hardcoded, so you might want to add an instruction to update the path. - Maybe it would be better to export the model path in an environment variable, so it would be easier to copy & paste commands. For performance figures: MBP Pro 16 with M4 Max/48GB/1TB, I'm getting around `5.5 tok/s`

Proryanator • Apr 14, 2026

Thank you for sharing this, I got this working on my 36GB M3 Max macbook pro following your instructions 👏

Adjacent Repository Pain Points

Other highly discussed features and pain points extracted from danveloper/flash-moe.

Nonsensical output on Apple M4 Pro (Mac Mini 64GB) — 14.5 tok/s but garbage generation

Extracted Positioning

Flash-MoE inference engine on Apple M4 Pro, specifically addressing nonsensical output despite high token generation speed.

Achieving accurate and coherent LLM generation on Apple Silicon (M4 Pro) by resolving GPU pipeline data corruption issues, ensuring compatibility across different GPU architectures and correct handling of mixed-precision quantization.

Top Replies

ccckblaze • Mar 23, 2026

https://github.com/danveloper/flash-moe/pull/1 vocab issues related

tamastoth-byborg • Mar 23, 2026

https://github.com/tamastoth-byborg/flash-moe/commit/203c78397e90954cc88a52bf1181839587dcd01b#diff-7d450f8500f4f66c2601cd6c2a73aff6aadd1b041a53c4e0b2ac8f9a7701e7e4R19 - try this generator, after ad...

userFRM • Mar 23, 2026

Investigated this. The root cause is likely **mixed-precision quantization** in the MLX 4-bit model. The MLX quantization config in `config.json` specifies per-tensor overrides: ```json "quantizati...

Cannot open model_weights.bin: No such file or directory

Extracted Positioning

Model weight loading for the Flash-MoE inference engine.

Ensuring correct file path resolution and loading of model weights (`model_weights.bin`) for the Flash-MoE engine, particularly when models are sourced from Hugging Face caches.

Top Replies

existeundelta • Mar 22, 2026

tamastoth-byborg • Mar 23, 2026

Claude generated this one that works: https://github.com/tamastoth-byborg/flash-moe/commit/203c78397e90954cc88a52bf1181839587dcd01b#diff-4a3ca27fc198ca94f12561bf3591ef735cb0e8e5e98dad2f0f0e884ee663...

vocab.bin missing

Extracted Positioning

Vocab file generation (`vocab.bin`) for the C decoder in Flash-MoE.

Ensuring the availability and correct generation of the `vocab.bin` file, which maps token IDs to strings, by providing a robust Python script that searches common locations and Hugging Face caches for `tokenizer.json`.

Please add a license to this repo

Extracted Positioning

The `flash-moe` project, specifically the lack of an explicit `LICENSE` file.

Adherence to open-source best practices and legal clarity for project usage and contributions.

Other Qwen models

Extracted Positioning

Adaptability of flash-moe (running big models on small laptops) to other Qwen models.

Versatility and broad compatibility across different Qwen model variants.

Frequently Asked Questions

Market intelligence mapped to `Flash-MoE` for running large MoE models (Qwen3.5-397B-A17B) locally on Apple Silicon Macs..

What problem does `Flash-MoE` for running large MoE models (Qwen3.5-397B-A17B) locally on Apple Silicon Macs. solve?

Based on our AI analysis of the original developer request, its primary technical positioning is: Enabling local, cloud-independent execution of massive MoE models on consumer-grade high-end hardware (Apple Silicon), achieving interactive performance.

What is the general sentiment around `Flash-MoE` for running large MoE models (Qwen3.5-397B-A17B) locally on Apple Silicon Macs.?

Yes, we have tracked 3 direct responses and active debates regarding this specific topic originating from GitHub Issue.

What architecture is tied to `Flash-MoE` for running large MoE models (Qwen3.5-397B-A17B) locally on Apple Silicon Macs.?

Our proprietary extraction maps `Flash-MoE` for running large MoE models (Qwen3.5-397B-A17B) locally on Apple Silicon Macs. to adjacent architectural concepts including Flash-MoE, Qwen3.5-397B-A17B, MoE model, Apple Silicon Mac.

Engagement Signals

Replies

open

Issue Status

Cross-Market Term Frequency

Quantifies the cross-market adoption of foundational terms like MoE model and Qwen3.5-397B-A17B by tracking occurrence frequency across active SaaS architectures and enterprise developer debates.