Product Positioning & Context
AI Executive Synthesis
Ensuring the availability and correct generation of the `vocab.bin` file, which maps token IDs to strings, by providing a robust Python script that searches common locations and Hugging Face caches for `tokenizer.json`.
The `vocab.bin` file, crucial for the C decoder's token-to-string mapping, is frequently missing, causing deployment issues for Flash-MoE. The provided Python script `export_vocab.py` addresses this by searching common locations and Hugging Face caches for `tokenizer.json` to generate the binary `vocab.bin`. This highlights a common developer pain point in LLM deployment: managing and generating auxiliary model files. For B2B SaaS, robust tooling for asset generation and discovery is critical. Relying on manual steps or implicit file locations introduces friction and errors. Automating this process, as attempted here, improves developer experience and reduces deployment overhead, ensuring models are runnable out-of-the-box.
Running a big model on a small laptop
Active Developer Issues (GitHub)
Logged: Apr 1, 2026
Logged: Mar 30, 2026
Logged: Mar 28, 2026
Logged: Mar 22, 2026
Logged: Mar 22, 2026
Community Voice & Feedback
Definitive finding: **the bug is in the GPU CMD2 pipeline, not in individual kernels or weight loading.**
Proof: with `g_metal = NULL` (forcing full CPU computation), the model produces **coherent text** with correct routing scores matching the MLX reference. With GPU enabled, the same model produces gibberish.
Tested on Qwen3-Coder-Next-4bit (M1 Pro, 32GB):
- **CPU path**: "Hello:\n\nI have a table with a field..." — coherent, gate scores match MLX
- **GPU path**: gibberish, gate scores have wrong magnitudes and signs
The GPU CMD2 pipeline (o_proj → residual_add → rms_norm → routing gate matvec) passes incorrect data through `buf_input` to the routing gate. The CPU dequant and CPU attention produce correct results, but somewhere in the fused GPU command buffer (8-12 encoders per CMD2), the buffer contents get corrupted.
This likely affects the original Qwen3.5-397B model on M4 Pro too (issue reporter's hardware), as the same GPU pipeline code is used.
The fix needs to audit every...
Proof: with `g_metal = NULL` (forcing full CPU computation), the model produces **coherent text** with correct routing scores matching the MLX reference. With GPU enabled, the same model produces gibberish.
Tested on Qwen3-Coder-Next-4bit (M1 Pro, 32GB):
- **CPU path**: "Hello:\n\nI have a table with a field..." — coherent, gate scores match MLX
- **GPU path**: gibberish, gate scores have wrong magnitudes and signs
The GPU CMD2 pipeline (o_proj → residual_add → rms_norm → routing gate matvec) passes incorrect data through `buf_input` to the routing gate. The CPU dequant and CPU attention produce correct results, but somewhere in the fused GPU command buffer (8-12 encoders per CMD2), the buffer contents get corrupted.
This likely affects the original Qwen3.5-397B model on M4 Pro too (issue reporter's hardware), as the same GPU pipeline code is used.
The fix needs to audit every...
Correction to my previous comment: the 8-bit gate issue may be specific to Qwen3-Coder-Next, not Qwen3.5-397B. For the 397B model, the gate weight `[512, 512]` U32 at 4-bit gives `in_dim = 512*8 = 4096 = hidden_size` — dimensionally correct. The 397B quantization config may not have per-tensor 8-bit overrides.
The M4 Pro gibberish reported here likely has a different root cause — possibly M4-specific Metal GPU behavior (different SIMD group scheduling, threadgroup memory semantics, or register pressure behavior vs M3 Max).
Would be useful to know: does adding `--cpu-linear` (CPU delta-net fallback) change the output quality? That would isolate whether the issue is in the GPU linear attention kernels.
The M4 Pro gibberish reported here likely has a different root cause — possibly M4-specific Metal GPU behavior (different SIMD group scheduling, threadgroup memory semantics, or register pressure behavior vs M3 Max).
Would be useful to know: does adding `--cpu-linear` (CPU delta-net fallback) change the output quality? That would isolate whether the issue is in the GPU linear attention kernels.
Investigated this. The root cause is likely **mixed-precision quantization** in the MLX 4-bit model.
The MLX quantization config in `config.json` specifies per-tensor overrides:
```json
"quantization": {
"group_size": 64, "bits": 4, "mode": "affine",
"model.layers.0.mlp.gate": {"group_size": 64, "bits": 8},
"model.layers.0.mlp.shared_expert_gate": {"group_size": 64, "bits": 8},
...
}
```
**Every `mlp.gate` (routing) and `mlp.shared_expert_gate` tensor is 8-bit, not 4-bit.** The inference engine treats all tensors uniformly as 4-bit, extracting 8 nibbles per uint32 — but these gate tensors pack 4 bytes per uint32 (8-bit). This corrupts the routing gate scores, selecting wrong experts every layer, producing garbage output.
Verification: gate weight shape is `[512, 512]` U32. At 4-bit that implies `in_dim = 512 * 8 = 4096`, but `hidden_size = 4096` for Qwen3.5-397B so it happens to work dimensionally. However the dequantized values are wrong because the nibble extracti...
The MLX quantization config in `config.json` specifies per-tensor overrides:
```json
"quantization": {
"group_size": 64, "bits": 4, "mode": "affine",
"model.layers.0.mlp.gate": {"group_size": 64, "bits": 8},
"model.layers.0.mlp.shared_expert_gate": {"group_size": 64, "bits": 8},
...
}
```
**Every `mlp.gate` (routing) and `mlp.shared_expert_gate` tensor is 8-bit, not 4-bit.** The inference engine treats all tensors uniformly as 4-bit, extracting 8 nibbles per uint32 — but these gate tensors pack 4 bytes per uint32 (8-bit). This corrupts the routing gate scores, selecting wrong experts every layer, producing garbage output.
Verification: gate weight shape is `[512, 512]` U32. At 4-bit that implies `in_dim = 512 * 8 = 4096`, but `hidden_size = 4096` for Qwen3.5-397B so it happens to work dimensionally. However the dequantized values are wrong because the nibble extracti...
https://github.com/tamastoth-byborg/flash-moe/commit/203c78397e90954cc88a52bf1181839587dcd01b#diff-7d450f8500f4f66c2601cd6c2a73aff6aadd1b041a53c4e0b2ac8f9a7701e7e4R19 - try this generator, after adding the bpe decoding as well it produced a nice response with --token 1000:
Run on Macbook Pro with M3 Pro 36GB; while it was running it used 6GB+ RAM and was streaming from SSD with 2.8GB/s.
Run on Macbook Pro with M3 Pro 36GB; while it was running it used 6GB+ RAM and was streaming from SSD with 2.8GB/s.
https://github.com/danveloper/flash-moe/pull/1
vocab issues related
vocab issues related
Related Early-Stage Discoveries
Discovery Source
GitHub Open Source Aggregated via automated community intelligence tracking.
Tech Stack Dependencies
No direct open-source NPM package mentions detected in the product documentation.
Media Tractions & Mentions
No mainstream media stories specifically mentioning this product name have been intercepted yet.
Deep Research & Science
No direct peer-reviewed scientific literature matched with this product's architecture.
Market Trends