Product Positioning & Context
AI Executive Synthesis
Ensuring the availability and correct generation of the `vocab.bin` file, which maps token IDs to strings, by providing a robust Python script that searches common locations and Hugging Face caches for `tokenizer.json`.
The `vocab.bin` file, crucial for the C decoder's token-to-string mapping, is frequently missing, causing deployment issues for Flash-MoE. The provided Python script `export_vocab.py` addresses this by searching common locations and Hugging Face caches for `tokenizer.json` to generate the binary `vocab.bin`. This highlights a common developer pain point in LLM deployment: managing and generating auxiliary model files. For B2B SaaS, robust tooling for asset generation and discovery is critical. Relying on manual steps or implicit file locations introduces friction and errors. Automating this process, as attempted here, improves developer experience and reduces deployment overhead, ensuring models are runnable out-of-the-box.
Running a big model on a small laptop
Related Ecosystem & Alternatives
Discover adjacent products, open-source repositories, and developer tools sharing similar technical architecture.
Deep-Dive FAQs
What is danveloper/flash-moe?
danveloper/flash-moe is analyzed by our AI as: Ensuring the availability and correct generation of the `vocab.bin` file, which maps token IDs to strings, by providing a robust Python script that searches common locations and Hugging Face caches for `tokenizer.json`.. It focuses on The `vocab.bin` file, crucial for the C decoder's token-to-string mapping, is frequently missing, causing deployment issues for Flash-MoE. The prov...
Where did danveloper/flash-moe originate?
Data for danveloper/flash-moe was aggregated directly from the GitHub Open Source community ecosystem, representing raw developer and early-adopter sentiment.
When was danveloper/flash-moe publicly launched?
The initial public indexing or launch date for danveloper/flash-moe within our tracked developer communities was recorded on March 18, 2026.
How popular is danveloper/flash-moe?
danveloper/flash-moe has achieved measurable traction, logging over 2,695 traction score and facilitating 274 recorded discussions or engagements.
Are there active development issues for danveloper/flash-moe?
Yes, we are currently tracking open architectural debates and bug reports for this project on GitHub. There are currently 5 active high-priority issues logged recently.
Is danveloper/flash-moe recognized by media or academic researchers?
Yes. It has been covered by media outlets like Github.com. This indicates the concept has reached a level of mainstream or scientific viability beyond just developer forums.
What are some commercial alternatives to danveloper/flash-moe?
Our semantic intelligence engine identifies potential commercial alternatives in the SaaS space, such as Databerry, which offers overlapping value propositions.
How does the creator describe danveloper/flash-moe?
The original author or development team describes the product as follows: "Running a big model on a small laptop"
Active Developer Issues (GitHub)
Logged: Apr 1, 2026
Logged: Mar 30, 2026
Logged: Mar 28, 2026
Logged: Mar 22, 2026
Logged: Mar 22, 2026
Community Voice & Feedback
Thank you for sharing this, I got this working on my 36GB M3 Max macbook pro following your instructions 👏
Thanks for taking the time to write the detailed instructions.
A couple of things:
- Cleanup command `find ~/qwen35-397b-4bit -maxdepth 1 ! -name packed_experts ! -name . -exec rm -rf {} +` deleted the entire model directory on vanilla MacOS zsh ... one hour to redo the whole process.
I think it is missing a `-mindepth 1` to prevent the deletion of the parent directory.
- `expert_index.json` from this repo has model path hardcoded, so you might want to add an instruction to update the path.
- Maybe it would be better to export the model path in an environment variable, so it would be easier to copy & paste commands.
For performance figures: MBP Pro 16 with M4 Max/48GB/1TB, I'm getting around `5.5 tok/s`
A couple of things:
- Cleanup command `find ~/qwen35-397b-4bit -maxdepth 1 ! -name packed_experts ! -name . -exec rm -rf {} +` deleted the entire model directory on vanilla MacOS zsh ... one hour to redo the whole process.
I think it is missing a `-mindepth 1` to prevent the deletion of the parent directory.
- `expert_index.json` from this repo has model path hardcoded, so you might want to add an instruction to update the path.
- Maybe it would be better to export the model path in an environment variable, so it would be easier to copy & paste commands.
For performance figures: MBP Pro 16 with M4 Max/48GB/1TB, I'm getting around `5.5 tok/s`
Thanks! Got it running on a MBP M4 Pro 48GB at 3.1 tok/s.
Great experiment and write up. Wanted to ask, can this method be adopted for other small models in the 80-100B parameters to run on MacBook Airs too?
This helped me a ton! Managed to get it running, and wanted to add to the numbers:
## Performance Notes
### Expected Performance by Hardware
| Machine | RAM | Bandwidth | Expected tok/s |
|---------|-----|-----------|---------------|
| M3 Max (reference) | 48 GB | ~400 GB/s | 4.4 |
| M4 Max | 64 GB | ~546 GB/s | 5.0-5.5+ |
| M1 Max | 64 GB | ~400 GB/s | 2.4-2.9+ |
Tested on a MBP 16" fully loaded M1 Max series
## Performance Notes
### Expected Performance by Hardware
| Machine | RAM | Bandwidth | Expected tok/s |
|---------|-----|-----------|---------------|
| M3 Max (reference) | 48 GB | ~400 GB/s | 4.4 |
| M4 Max | 64 GB | ~546 GB/s | 5.0-5.5+ |
| M1 Max | 64 GB | ~400 GB/s | 2.4-2.9+ |
Tested on a MBP 16" fully loaded M1 Max series
Definitive finding: **the bug is in the GPU CMD2 pipeline, not in individual kernels or weight loading.**
Proof: with `g_metal = NULL` (forcing full CPU computation), the model produces **coherent text** with correct routing scores matching the MLX reference. With GPU enabled, the same model produces gibberish.
Tested on Qwen3-Coder-Next-4bit (M1 Pro, 32GB):
- **CPU path**: "Hello:\n\nI have a table with a field..." — coherent, gate scores match MLX
- **GPU path**: gibberish, gate scores have wrong magnitudes and signs
The GPU CMD2 pipeline (o_proj → residual_add → rms_norm → routing gate matvec) passes incorrect data through `buf_input` to the routing gate. The CPU dequant and CPU attention produce correct results, but somewhere in the fused GPU command buffer (8-12 encoders per CMD2), the buffer contents get corrupted.
This likely affects the original Qwen3.5-397B model on M4 Pro too (issue reporter's hardware), as the same GPU pipeline code is used.
The fix needs to audit every...
Proof: with `g_metal = NULL` (forcing full CPU computation), the model produces **coherent text** with correct routing scores matching the MLX reference. With GPU enabled, the same model produces gibberish.
Tested on Qwen3-Coder-Next-4bit (M1 Pro, 32GB):
- **CPU path**: "Hello:\n\nI have a table with a field..." — coherent, gate scores match MLX
- **GPU path**: gibberish, gate scores have wrong magnitudes and signs
The GPU CMD2 pipeline (o_proj → residual_add → rms_norm → routing gate matvec) passes incorrect data through `buf_input` to the routing gate. The CPU dequant and CPU attention produce correct results, but somewhere in the fused GPU command buffer (8-12 encoders per CMD2), the buffer contents get corrupted.
This likely affects the original Qwen3.5-397B model on M4 Pro too (issue reporter's hardware), as the same GPU pipeline code is used.
The fix needs to audit every...
Correction to my previous comment: the 8-bit gate issue may be specific to Qwen3-Coder-Next, not Qwen3.5-397B. For the 397B model, the gate weight `[512, 512]` U32 at 4-bit gives `in_dim = 512*8 = 4096 = hidden_size` — dimensionally correct. The 397B quantization config may not have per-tensor 8-bit overrides.
The M4 Pro gibberish reported here likely has a different root cause — possibly M4-specific Metal GPU behavior (different SIMD group scheduling, threadgroup memory semantics, or register pressure behavior vs M3 Max).
Would be useful to know: does adding `--cpu-linear` (CPU delta-net fallback) change the output quality? That would isolate whether the issue is in the GPU linear attention kernels.
The M4 Pro gibberish reported here likely has a different root cause — possibly M4-specific Metal GPU behavior (different SIMD group scheduling, threadgroup memory semantics, or register pressure behavior vs M3 Max).
Would be useful to know: does adding `--cpu-linear` (CPU delta-net fallback) change the output quality? That would isolate whether the issue is in the GPU linear attention kernels.
Investigated this. The root cause is likely **mixed-precision quantization** in the MLX 4-bit model.
The MLX quantization config in `config.json` specifies per-tensor overrides:
```json
"quantization": {
"group_size": 64, "bits": 4, "mode": "affine",
"model.layers.0.mlp.gate": {"group_size": 64, "bits": 8},
"model.layers.0.mlp.shared_expert_gate": {"group_size": 64, "bits": 8},
...
}
```
**Every `mlp.gate` (routing) and `mlp.shared_expert_gate` tensor is 8-bit, not 4-bit.** The inference engine treats all tensors uniformly as 4-bit, extracting 8 nibbles per uint32 — but these gate tensors pack 4 bytes per uint32 (8-bit). This corrupts the routing gate scores, selecting wrong experts every layer, producing garbage output.
Verification: gate weight shape is `[512, 512]` U32. At 4-bit that implies `in_dim = 512 * 8 = 4096`, but `hidden_size = 4096` for Qwen3.5-397B so it happens to work dimensionally. However the dequantized values are wrong because the nibble extracti...
The MLX quantization config in `config.json` specifies per-tensor overrides:
```json
"quantization": {
"group_size": 64, "bits": 4, "mode": "affine",
"model.layers.0.mlp.gate": {"group_size": 64, "bits": 8},
"model.layers.0.mlp.shared_expert_gate": {"group_size": 64, "bits": 8},
...
}
```
**Every `mlp.gate` (routing) and `mlp.shared_expert_gate` tensor is 8-bit, not 4-bit.** The inference engine treats all tensors uniformly as 4-bit, extracting 8 nibbles per uint32 — but these gate tensors pack 4 bytes per uint32 (8-bit). This corrupts the routing gate scores, selecting wrong experts every layer, producing garbage output.
Verification: gate weight shape is `[512, 512]` U32. At 4-bit that implies `in_dim = 512 * 8 = 4096`, but `hidden_size = 4096` for Qwen3.5-397B so it happens to work dimensionally. However the dequantized values are wrong because the nibble extracti...
https://github.com/tamastoth-byborg/flash-moe/commit/203c78397e90954cc88a52bf1181839587dcd01b#diff-7d450f8500f4f66c2601cd6c2a73aff6aadd1b041a53c4e0b2ac8f9a7701e7e4R19 - try this generator, after adding the bpe decoding as well it produced a nice response with --token 1000:
Run on Macbook Pro with M3 Pro 36GB; while it was running it used 6GB+ RAM and was streaming from SSD with 2.8GB/s.
Run on Macbook Pro with M3 Pro 36GB; while it was running it used 6GB+ RAM and was streaming from SSD with 2.8GB/s.
Claude generated this one that works: https://github.com/tamastoth-byborg/flash-moe/commit/203c78397e90954cc88a52bf1181839587dcd01b#diff-4a3ca27fc198ca94f12561bf3591ef735cb0e8e5e98dad2f0f0e884ee6637a7a
https://github.com/danveloper/flash-moe/pull/1
vocab issues related
vocab issues related
+1
Discovery Source
GitHub Open Source Aggregated via automated community intelligence tracking.
Tech Stack Dependencies
No direct open-source NPM package mentions detected in the product documentation.
Media Tractions & Mentions
Deep Research & Science
No direct peer-reviewed scientific literature matched with this product's architecture.
SaaS Metrics