← Back to AI Insights
Gemini Executive Synthesis

Needle – A 26M parameter open-source function-calling (tool use) model, distilled from Gemini, designed for efficient execution on consumer devices using a Simple Attention Networks architecture.

Technical Positioning
We Distilled Gemini Tool Calling into a 26M Model. It positions itself as a lightweight, efficient solution for agentic models on budget consumer devices, arguing that 'massive models are overkill' for tool calling, which is fundamentally 'retrieval-and-assembly.'
SaaS Insight & Market Implications
Needle addresses a critical performance and accessibility gap for AI agents on consumer devices. By distilling tool-calling capabilities into a 26M parameter model, it challenges the assumption that massive models are necessary for effective agentic behavior. This product targets developers building applications for resource-constrained environments (phones, wearables), where latency and computational overhead are significant barriers. The "Simple Attention Networks" architecture, optimized for retrieval-and-assembly, represents a focused engineering effort to achieve high throughput with minimal parameters. This approach offers a compelling alternative to larger, more resource-intensive models, potentially democratizing agentic AI development for a broader range of edge computing applications and expanding the market for on-device AI.
Proprietary Technical Taxonomy
Needle 26M parameter function-calling (tool use) model distilled Gemini 6000 tok/s prefill 1200 tok/s decode consumer devices agentic models

Raw Developer Origin & Technical Request

Source Icon Hacker News May 13, 2026
Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model

Hey HN, Henry here from Cactus. We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices.We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted investigations that led to an observation: agentic experiences are built upon tool calling, and massive models are overkill for it. Tool calling is fundamentally retrieval-and-assembly (match query to tool name, extract argument values, emit JSON), not reasoning. Cross-attention is the right primitive for this, and FFN parameters are wasted at this scale.Simple Attention Networks: the entire model is just attention and gating, no MLPs anywhere. Needle is an experimental run for single-shot function calling for consumer devices (phones, watches, glasses...).Training:
- Pretrained on 200B tokens across 16 TPU v6e (27 hours)
- Post-trained on 2B tokens of synthesized function-calling data (45 minutes)
- Dataset synthesized via Gemini with 15 tool categories (timers, messaging, navigation, smart home, etc.)You can test it right now and finetune on your Mac/PC: github.com/cactus-compute/ne... full writeup on the architecture is here: github.com/cactus-compute/ne... found that the "no FFN" finding generalizes beyond function calling to any task where the model has access to external structured knowledge (RAG, tool use, retrieval-augmented generation). The model doesn't need to memorize facts in FFN weights if the facts are provided in the input. Experimental results to published.While it beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, LFM2.5-350M on single-shot function calling, those models have more scope/capacity and excel in conversational settings. We encourage you to test on your own tools via the playground and finetune accordingly.This is part of our broader work on Cactus (github.com/cactus-compute/ca... an inference engine built from scratch for mobile, wearables and custom hardware. We wrote about Cactus here previously: news.ycombinator.com/item is MIT licensed. Weights: huggingface.co/Cactus-Compute/ne...
GitHub: github.com/cactus-compute/ne...

Developer Debate & Comments

halyconWays • May 13, 2026
I assume this would only be useful as the second stage after a model like Whisper, as it can't understand speech where you'd want it, like on a phone or small device?
kgeist • May 13, 2026
>Experiments at Cactus showed that MLPs can be completely dropped from transformer networks, as long as the model relies on external knowledge source.Heh, what a coincidence, just today one of my students presented research results which also confirmed this. He removed MLP from Qwen and the model still could do transformation tasks on input but lost knowledge.
nl • May 13, 2026
Do you have any examples or data on the discriminatory power of the model for tool use?The examples are things like "What is the weather in San Francisco", where you are only passed a tool like tools='[{"name":"get_weather","parameters":{"location":"string"}}]', I had a thing[1] over 10 years ago that could handle this kind of problem using SPARQL and knowledge graphs.My question is how effective is it at handling ambiguity.Can I send it something like a text message "lets catch up at coffee tomorrow 10:00" and a command like "save this" and have it choose a "add appointment" action from hundreds (or even tens) of possible tools?[1] https://github.com/nlothian/Acuitra/wiki/About
brainless • May 12, 2026
Lovely to see the push for tiny models.I have been building for small (20B or less) models for quite a while. Highly focused/constrained agents, many of them running together in some kind of task orchestration mode to achieve what feels like one "agent".I build (privacy first) desktop apps this way and I want to get into mobile apps with similar ideas but tiny models.
exabrial • May 12, 2026
Dumb questions, from someone not in the field...What is a distilled model?Why doesn't Google do this (to make their models smaller)?Seems like you could make a competitor to Gemini?
tomaskafka • May 12, 2026
Awesome! I just tried to set an alarm and add some groceries to the shopping list, and it outperformed Siri.
kristopolous • May 12, 2026
That M versus B is way too subtle. 0.026B is my suggestion
simonw • May 12, 2026
Suggestion: publish a live demo of the "needle playground". It's small enough that it should be pretty cheap to run this on a little VPS somewhere!
ilaksh • May 12, 2026
Hmm.. this might make it feasible to build something like a command line program where you can optionally just specify the arguments in natural language. Although I know people will object to including an extra 14 MB and the computation for "parsing" and it could be pretty bad if everyone started doing that.But it's really interesting to me that that may be possible now. You can include a fine-tuned model that understands how to use your program.E.g. `> toolcli what can you do` runs `toolcli --help summary`, `toolcli add tom to teamfutz group` = `toolcli --gadd teamfutz tom`
simonw • May 12, 2026
Looks like you need to open up access to https://huggingface.co/Cactus-Compute/datasets/needle-tokeni... - I get this error when trying to run the steps in your README:> Repository Not Found for url: http s://huggingface.co/api/datasets/Cactus-Compute/needle-tokenizer/revision/main.

Frequently Asked Questions

Market intelligence mapped to Needle – A 26M parameter open-source function-calling (tool use) model, distilled from Gemini, designed for efficient execution on consumer devices using a Simple Attention Networks architecture..

How is Needle – A 26M parameter open-source function-calling (tool use) model, distilled from Gemini, designed for efficient execution on consumer devices using a Simple Attention Networks architecture. positioned in the market?
Based on our AI analysis of the original developer request, its primary technical positioning is: We Distilled Gemini Tool Calling into a 26M Model. It positions itself as a lightweight, efficient solution for agentic models on budget consumer devices, arguing that 'massive models are overkill' for tool calling, which is fundamentally 'retrieval-and-assembly.'
Are engineers actively discussing Needle – A 26M parameter open-source function-calling (tool use) model, distilled from Gemini, designed for efficient execution on consumer devices using a Simple Attention Networks architecture.?
Yes, we have tracked 115 direct responses and active debates regarding this specific topic originating from Hacker News.
Which technical concepts are associated with Needle – A 26M parameter open-source function-calling (tool use) model, distilled from Gemini, designed for efficient execution on consumer devices using a Simple Attention Networks architecture.?
Our proprietary extraction maps Needle – A 26M parameter open-source function-calling (tool use) model, distilled from Gemini, designed for efficient execution on consumer devices using a Simple Attention Networks architecture. to adjacent architectural concepts including Needle, 26M parameter, function-calling (tool use) model, distilled Gemini.
Is anyone launching products related to Needle – A 26M parameter open-source function-calling (tool use) model, distilled from Gemini, designed for efficient execution on consumer devices using a Simple Attention Networks architecture.?
Yes, market intelligence reveals commercial overlap. A product named 'Gemini Robotics ER 1.6' focuses directly on this: Google's SOTA robotics model for visual & spatial reasoning!

Engagement Signals

321
Upvotes
115
Comments

Cross-Market Term Frequency

Quantifies the cross-market adoption of foundational terms like Gemini and architecture by tracking occurrence frequency across active SaaS architectures and enterprise developer debates.