danveloper/flash-moe

Name: danveloper/flash-moe
Rating: 4.5 (274 reviews)

Running a big model on a small laptop

2,695

Traction Score

274

Forks

Mar 18, 2026

Launch Date

View Origin Link

Product Positioning & Context

AI Executive Synthesis

Ensuring the availability and correct generation of the `vocab.bin` file, which maps token IDs to strings, by providing a robust Python script that searches common locations and Hugging Face caches for `tokenizer.json`.

The `vocab.bin` file, crucial for the C decoder's token-to-string mapping, is frequently missing, causing deployment issues for Flash-MoE. The provided Python script `export_vocab.py` addresses this by searching common locations and Hugging Face caches for `tokenizer.json` to generate the binary `vocab.bin`. This highlights a common developer pain point in LLM deployment: managing and generating auxiliary model files. For B2B SaaS, robust tooling for asset generation and discovery is critical. Relying on manual steps or implicit file locations introduces friction and errors. Automating this process, as attempted here, improves developer experience and reduces deployment overhead, ensuring models are runnable out-of-the-box.

Running a big model on a small laptop

Related Ecosystem & Alternatives

Discover adjacent products, open-source repositories, and developer tools sharing similar technical architecture.

Deep-Dive FAQs

What is danveloper/flash-moe?

danveloper/flash-moe is analyzed by our AI as: Ensuring the availability and correct generation of the `vocab.bin` file, which maps token IDs to strings, by providing a robust Python script that searches common locations and Hugging Face caches for `tokenizer.json`.. It focuses on The `vocab.bin` file, crucial for the C decoder's token-to-string mapping, is frequently missing, causing deployment issues for Flash-MoE. The prov...

Where did danveloper/flash-moe originate?

Data for danveloper/flash-moe was aggregated directly from the GitHub Open Source community ecosystem, representing raw developer and early-adopter sentiment.

When was danveloper/flash-moe publicly launched?

The initial public indexing or launch date for danveloper/flash-moe within our tracked developer communities was recorded on March 18, 2026.

How popular is danveloper/flash-moe?

danveloper/flash-moe has achieved measurable traction, logging over 2,695 traction score and facilitating 274 recorded discussions or engagements.

Are there active development issues for danveloper/flash-moe?

Yes, we are currently tracking open architectural debates and bug reports for this project on GitHub. There are currently 5 active high-priority issues logged recently.

Is danveloper/flash-moe recognized by media or academic researchers?

Yes. It has been covered by media outlets like Github.com. This indicates the concept has reached a level of mainstream or scientific viability beyond just developer forums.

What are some commercial alternatives to danveloper/flash-moe?

Our semantic intelligence engine identifies potential commercial alternatives in the SaaS space, such as Timbal AI, which offers overlapping value propositions.

How does the creator describe danveloper/flash-moe?

The original author or development team describes the product as follows: "Running a big model on a small laptop"

Active Developer Issues (GitHub)

open Other Qwen models

Logged: Apr 1, 2026

open Please add a license to this repo

Logged: Mar 30, 2026

open I did get it working, with a lot of pain, if your interested here's a readme I had claud crank out capturing the gotchas.

Logged: Mar 28, 2026

open Nonsensical output on Apple M4 Pro (Mac Mini 64GB) — 14.5 tok/s but garbage generation

Logged: Mar 22, 2026

open Cannot open model_weights.bin: No such file or directory

Logged: Mar 22, 2026

Community Voice & Feedback

Proryanator • Apr 14, 2026

Thank you for sharing this, I got this working on my 36GB M3 Max macbook pro following your instructions 👏

Shivox • Apr 8, 2026

Thanks for taking the time to write the detailed instructions.

A couple of things:

- Cleanup command `find ~/qwen35-397b-4bit -maxdepth 1 ! -name packed_experts ! -name . -exec rm -rf {} +` deleted the entire model directory on vanilla MacOS zsh ... one hour to redo the whole process.
I think it is missing a `-mindepth 1` to prevent the deletion of the parent directory.

- `expert_index.json` from this repo has model path hardcoded, so you might want to add an instruction to update the path.

- Maybe it would be better to export the model path in an environment variable, so it would be easier to copy & paste commands.

For performance figures: MBP Pro 16 with M4 Max/48GB/1TB, I'm getting around `5.5 tok/s`

rafaelkupper • Mar 31, 2026

Thanks! Got it running on a MBP M4 Pro 48GB at 3.1 tok/s.

HIGGS317 • Mar 31, 2026

Great experiment and write up. Wanted to ask, can this method be adopted for other small models in the 80-100B parameters to run on MacBook Airs too?

aronson • Mar 30, 2026

This helped me a ton! Managed to get it running, and wanted to add to the numbers:

## Performance Notes

### Expected Performance by Hardware

| Machine | RAM | Bandwidth | Expected tok/s |
|---------|-----|-----------|---------------|
| M3 Max (reference) | 48 GB | ~400 GB/s | 4.4 |
| M4 Max | 64 GB | ~546 GB/s | 5.0-5.5+ |
| M1 Max | 64 GB | ~400 GB/s | 2.4-2.9+ |

Tested on a MBP 16" fully loaded M1 Max series

userFRM • Mar 23, 2026

Definitive finding: **the bug is in the GPU CMD2 pipeline, not in individual kernels or weight loading.**

Proof: with `g_metal = NULL` (forcing full CPU computation), the model produces **coherent text** with correct routing scores matching the MLX reference. With GPU enabled, the same model produces gibberish.

Tested on Qwen3-Coder-Next-4bit (M1 Pro, 32GB):
- **CPU path**: "Hello:\n\nI have a table with a field..." — coherent, gate scores match MLX
- **GPU path**: gibberish, gate scores have wrong magnitudes and signs

The GPU CMD2 pipeline (o_proj → residual_add → rms_norm → routing gate matvec) passes incorrect data through `buf_input` to the routing gate. The CPU dequant and CPU attention produce correct results, but somewhere in the fused GPU command buffer (8-12 encoders per CMD2), the buffer contents get corrupted.

This likely affects the original Qwen3.5-397B model on M4 Pro too (issue reporter's hardware), as the same GPU pipeline code is used.

The fix needs to audit every...

userFRM • Mar 23, 2026

Correction to my previous comment: the 8-bit gate issue may be specific to Qwen3-Coder-Next, not Qwen3.5-397B. For the 397B model, the gate weight `[512, 512]` U32 at 4-bit gives `in_dim = 512*8 = 4096 = hidden_size` — dimensionally correct. The 397B quantization config may not have per-tensor 8-bit overrides.

The M4 Pro gibberish reported here likely has a different root cause — possibly M4-specific Metal GPU behavior (different SIMD group scheduling, threadgroup memory semantics, or register pressure behavior vs M3 Max).

Would be useful to know: does adding `--cpu-linear` (CPU delta-net fallback) change the output quality? That would isolate whether the issue is in the GPU linear attention kernels.

userFRM • Mar 23, 2026

Investigated this. The root cause is likely **mixed-precision quantization** in the MLX 4-bit model.

The MLX quantization config in `config.json` specifies per-tensor overrides:

```json
"quantization": {
"group_size": 64, "bits": 4, "mode": "affine",
"model.layers.0.mlp.gate": {"group_size": 64, "bits": 8},
"model.layers.0.mlp.shared_expert_gate": {"group_size": 64, "bits": 8},
...
}
```

**Every `mlp.gate` (routing) and `mlp.shared_expert_gate` tensor is 8-bit, not 4-bit.** The inference engine treats all tensors uniformly as 4-bit, extracting 8 nibbles per uint32 — but these gate tensors pack 4 bytes per uint32 (8-bit). This corrupts the routing gate scores, selecting wrong experts every layer, producing garbage output.

Verification: gate weight shape is `[512, 512]` U32. At 4-bit that implies `in_dim = 512 * 8 = 4096`, but `hidden_size = 4096` for Qwen3.5-397B so it happens to work dimensionally. However the dequantized values are wrong because the nibble extracti...

tamastoth-byborg • Mar 23, 2026

https://github.com/tamastoth-byborg/flash-moe/commit/203c78397e90954cc88a52bf1181839587dcd01b#diff-7d450f8500f4f66c2601cd6c2a73aff6aadd1b041a53c4e0b2ac8f9a7701e7e4R19 - try this generator, after adding the bpe decoding as well it produced a nice response with --token 1000:

Run on Macbook Pro with M3 Pro 36GB; while it was running it used 6GB+ RAM and was streaming from SSD with 2.8GB/s.

tamastoth-byborg • Mar 23, 2026

Claude generated this one that works: https://github.com/tamastoth-byborg/flash-moe/commit/203c78397e90954cc88a52bf1181839587dcd01b#diff-4a3ca27fc198ca94f12561bf3591ef735cb0e8e5e98dad2f0f0e884ee6637a7a

ccckblaze • Mar 23, 2026

https://github.com/danveloper/flash-moe/pull/1
vocab issues related

existeundelta • Mar 22, 2026

Discovery Source

GitHub Open Source

Aggregated via automated community intelligence tracking.

Tech Stack Dependencies

No direct open-source NPM package mentions detected in the product documentation.

Media Tractions & Mentions

Flash-Moe: Running a 397B Parameter Model on a Mac with 48GB RAM
Github.com • Mar 22, 2026

Deep Research & Science

No direct peer-reviewed scientific literature matched with this product's architecture.