Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model

Name: Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model
Rating: 4.5 (115 reviews)

We Distilled Gemini Tool Calling into a 26M Model. It positions itself as a lightweight, efficient solution for agentic models on budget consumer devices, arguing that 'massive models are overkill' for tool calling, which is fundamentally 'retrieval-and-assembly.'

321

Traction Score

115

Discussions

May 13, 2026

Launch Date

View Origin Link

Product Positioning & Context

AI Executive Synthesis

Needle addresses a critical performance and accessibility gap for AI agents on consumer devices. By distilling tool-calling capabilities into a 26M parameter model, it challenges the assumption that massive models are necessary for effective agentic behavior. This product targets developers building applications for resource-constrained environments (phones, wearables), where latency and computational overhead are significant barriers. The "Simple Attention Networks" architecture, optimized for retrieval-and-assembly, represents a focused engineering effort to achieve high throughput with minimal parameters. This approach offers a compelling alternative to larger, more resource-intensive models, potentially democratizing agentic AI development for a broader range of edge computing applications and expanding the market for on-device AI.

Hey HN, Henry here from Cactus. We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices.We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted investigations that led to an observation: agentic experiences are built upon tool calling, and massive models are overkill for it. Tool calling is fundamentally retrieval-and-assembly (match query to tool name, extract argument values, emit JSON), not reasoning. Cross-attention is the right primitive for this, and FFN parameters are wasted at this scale.Simple Attention Networks: the entire model is just attention and gating, no MLPs anywhere. Needle is an experimental run for single-shot function calling for consumer devices (phones, watches, glasses...).Training:
- Pretrained on 200B tokens across 16 TPU v6e (27 hours)
- Post-trained on 2B tokens of synthesized function-calling data (45 minutes)
- Dataset synthesized via Gemini with 15 tool categories (timers, messaging, navigation, smart home, etc.)You can test it right now and finetune on your Mac/PC: https://github.com/cactus-compute/needleThe full writeup on the architecture is here: https://github.com/cactus-compute/needle/blob/main/docs/simp...We found that the "no FFN" finding generalizes beyond function calling to any task where the model has access to external structured knowledge (RAG, tool use, retrieval-augmented generation). The model doesn't need to memorize facts in FFN weights if the facts are provided in the input. Experimental results to published.While it beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, LFM2.5-350M on single-shot function calling, those models have more scope/capacity and excel in conversational settings. We encourage you to test on your own tools via the playground and finetune accordingly.This is part of our broader work on Cactus (https://github.com/cactus-compute/cactus), an inference engine built from scratch for mobile, wearables and custom hardware. We wrote about Cactus here previously: https://news.ycombinator.com/item?id=44524544Everything is MIT licensed. Weights: https://huggingface.co/Cactus-Compute/needle
GitHub: https://github.com/cactus-compute/needle

Related Ecosystem & Alternatives

Discover adjacent products, open-source repositories, and developer tools sharing similar technical architecture.

Deep-Dive FAQs

What is Needle: We Distilled Gemini Tool Calling into a 26M Model?

Needle: We Distilled Gemini Tool Calling into a 26M Model is analyzed by our AI as: We Distilled Gemini Tool Calling into a 26M Model. It positions itself as a lightweight, efficient solution for agentic models on budget consumer devices, arguing that 'massive models are overkill' for tool calling, which is fundamentally 'retrieval-and-assembly.'. It focuses on Needle addresses a critical performance and accessibility gap for AI agents on consumer devices. By distilling tool-calling capabilities into a 26M...

Where did Needle: We Distilled Gemini Tool Calling into a 26M Model originate?

Data for Needle: We Distilled Gemini Tool Calling into a 26M Model was aggregated directly from the Hacker News community ecosystem, representing raw developer and early-adopter sentiment.

When was Needle: We Distilled Gemini Tool Calling into a 26M Model publicly launched?

The initial public indexing or launch date for Needle: We Distilled Gemini Tool Calling into a 26M Model within our tracked developer communities was recorded on May 13, 2026.

How popular is Needle: We Distilled Gemini Tool Calling into a 26M Model?

Needle: We Distilled Gemini Tool Calling into a 26M Model has achieved measurable traction, logging over 321 traction score and facilitating 115 recorded discussions or engagements.

Which technical categories define Needle: We Distilled Gemini Tool Calling into a 26M Model?

Based on metadata extraction, Needle: We Distilled Gemini Tool Calling into a 26M Model is categorized under topics such as: Needle, 26M parameter, function-calling (tool use) model, distilled Gemini.

What are some commercial alternatives to Needle: We Distilled Gemini Tool Calling into a 26M Model?

Our semantic intelligence engine identifies potential commercial alternatives in the SaaS space, such as Gemini Robotics ER 1.6, which offers overlapping value propositions.

How does the creator describe Needle: We Distilled Gemini Tool Calling into a 26M Model?

The original author or development team describes the product as follows: "Hey HN, Henry here from Cactus. We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices.We were always fru..."

Community Voice & Feedback

halyconWays • May 13, 2026

I assume this would only be useful as the second stage after a model like Whisper, as it can't understand speech where you'd want it, like on a phone or small device?

kgeist • May 13, 2026

>Experiments at Cactus showed that MLPs can be completely dropped from transformer networks, as long as the model relies on external knowledge source.Heh, what a coincidence, just today one of my students presented research results which also confirmed this. He removed MLP from Qwen and the model still could do transformation tasks on input but lost knowledge.

nl • May 13, 2026

Do you have any examples or data on the discriminatory power of the model for tool use?The examples are things like "What is the weather in San Francisco", where you are only passed a tool like tools='[{"name":"get_weather","parameters":{"location":"string"}}]',

I had a thing[1] over 10 years ago that could handle this kind of problem using SPARQL and knowledge graphs.My question is how effective is it at handling ambiguity.Can I send it something like a text message "lets catch up at coffee tomorrow 10:00" and a command like "save this" and have it choose a "add appointment" action from hundreds (or even tens) of possible tools?[1] https://github.com/nlothian/Acuitra/wiki/About

brainless • May 12, 2026

Lovely to see the push for tiny models.I have been building for small (20B or less) models for quite a while. Highly focused/constrained agents, many of them running together in some kind of task orchestration mode to achieve what feels like one "agent".I build (privacy first) desktop apps this way and I want to get into mobile apps with similar ideas but tiny models.

exabrial • May 12, 2026

Dumb questions, from someone not in the field...What is a distilled model?Why doesn't Google do this (to make their models smaller)?Seems like you could make a competitor to Gemini?

tomaskafka • May 12, 2026

Awesome! I just tried to set an alarm and add some groceries to the shopping list, and it outperformed Siri.

kristopolous • May 12, 2026

That M versus B is way too subtle. 0.026B is my suggestion

simonw • May 12, 2026

Suggestion: publish a live demo of the "needle playground". It's small enough that it should be pretty cheap to run this on a little VPS somewhere!

ilaksh • May 12, 2026

Hmm.. this might make it feasible to build something like a command line program where you can optionally just specify the arguments in natural language. Although I know people will object to including an extra 14 MB and the computation for "parsing" and it could be pretty bad if everyone started doing that.But it's really interesting to me that that may be possible now. You can include a fine-tuned model that understands how to use your program.E.g. `> toolcli what can you do` runs `toolcli --help summary`, `toolcli add tom to teamfutz group` = `toolcli --gadd teamfutz tom`

simonw • May 12, 2026

Looks like you need to open up access to https://huggingface.co/Cactus-Compute/datasets/needle-tokeni... - I get this error when trying to run the steps in your README:> Repository Not Found for url: http s://huggingface.co/api/datasets/Cactus-Compute/needle-tokenizer/revision/main.

Discovery Source

Hacker News

Aggregated via automated community intelligence tracking.

Tech Stack Dependencies

No direct open-source NPM package mentions detected in the product documentation.

Media Tractions & Mentions

No mainstream media stories specifically mentioning this product name have been intercepted yet.

Deep Research & Science

No direct peer-reviewed scientific literature matched with this product's architecture.