Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks

Name: Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks
Rating: 4.5 (119 reviews)

A framework that significantly improves the reliability and performance of local LLMs on consumer hardware for agentic tasks, outperforming frontier APIs without guardrails and reducing cloud costs. It addresses the "compounding math problem" of multi-step agentic workflows.

321

Traction Score

119

Discussions

May 20, 2026

Launch Date

View Origin Link

Product Positioning & Context

AI Executive Synthesis

Forge directly addresses the critical reliability gap in self-hosted LLM agentic workflows, a major pain point for developers seeking to reduce cloud costs and leverage local hardware. By boosting an 8B model's performance from 53% to 99% on agentic tasks, Forge demonstrates a significant value proposition: enabling enterprise-grade reliability from commodity hardware. This disrupts the reliance on expensive frontier APIs for many use cases. The focus on guardrails, error recovery, and VRAM-aware context management highlights the maturity of challenges in deploying LLMs. The finding that serving backend significantly impacts accuracy underscores the complexity of LLM infrastructure. Forge positions itself as essential middleware for any B2B SaaS building agentic systems, democratizing advanced AI capabilities and driving down operational expenses.

Hi HN, I'm Antoine Zambelli, AI Director at Texas Instruments.I built Forge, an open-source reliability layer for self-hosted LLM tool-calling.What it does:- Adds domain-and-tool-agnostic guardrails (retry nudges, step enforcement, error recovery, VRAM-aware context management) to local models running on consumer hardware- Takes an 8B model from ~53% to ~99% on multi-step agentic workflows without changing the model - just the system around it- Ships with an eval harness and interactive dashboard so you can reproduce every numberI wanted to run a handful of always-on agentic systems for my portfolio, didn't want to pay cloud frontier costs, and immediately hit the compounding math problem on local models. 90% per-step accuracy sounds great, but with a 5-step workflow that's a 40% failure rate. No existing framework seemed to address this mechanical reliability issue - they all seemed tailor-made for cloud frontier.Demo video: https://youtu.be/MzRgJoJAXGc (side-by-side: same model, same task, with and without Forge guardrails)The paper (accepted to ACM CAIS '26, presenting May 26-29 in San Jose) covers the peer-reviewed findings across 97 model/backend configurations, 18 scenarios, 50 runs each. Key numbers:- Ministral 8B with Forge: 99.3%. Claude Sonnet with Forge: 100%. The gap between a free local 8B model on a $600 GPU and a frontier API is less than 1 point.- The same 8B local model with Forge (99.3%) outperforms Claude Sonnet without guardrails (87.2%) - an 8B model with framework support beats the best result you can get through frontier API alone.- Error recovery scores 0% for every model tested - local and frontier - without the retry mechanism. Not a capability gap, an architectural absence.I'm currently using this for my home assistant running on Ministral 14B-Reasoning, and for my locally hosted agentic coding harness (8B managed to contribute to the codebase!).The guardrail stack has five layers, each independently toggleable. The two that carry the most weight (per ablation study with McNemar's test): retry nudges (24-49 point drops when disabled) and error recovery (~10 point drops, significant for every model tested). Step enforcement is situational - only fires for models with weaker sequencing discipline. Rescue parsing and context compaction showed no significance in the eval but are retained for production workloads where they activate once in a while.One thing I really didn't expect: the serving backend matters. Same Mistral-Nemo 12B weights produce 7% accuracy on llama-server with native function calling and 83% on Llamafile in prompt mode. A 75-point swing from infrastructure alone. I don't think anyone's published this because standard benchmarks don't control for serving backend.Another surprise: there's no distinction in current LLM tool-calling between "the tool ran successfully and returned data" and "the tool ran successfully but found nothing." Both return a value, the orchestrator marks the step complete, and bad data cascades downstream. It's the equivalent of HTTP having 200 but no 404. Forge adds this as a new exception class (ToolResolutionError) - the model sees the error and can retry instead of silently passing garbage forward.Biggest technical challenge was context compaction for memory-constrained hardware. Both Ollama and Llamafile silently fall back to CPU when the model exceeds VRAM - no warning, no error, just 10-100x slower inference. Forge queries nvidia-smi at startup and derives a token budget to prevent this.How to try it:- Clone the repo, run the eval harness on a model I haven't tested. If you get interesting results I'll add them to the dashboard.- Try the proxy server mode - point any OpenAI-compatible client at Forge and it handles guardrails transparently. It's the newest model and I'd love more eyes on it.- Dogfooding led me to optimize model parameters in v0.6.0. The harder eval suite (26 scenarios) is designed to raise the ceiling so no one sits at 100%. Several that did on the original suite can't sweep it - including Opus 4.6. Curious if anyone finds scenarios that expose gaps I haven't thought of. Paper numbers based on pre v0.6.0 code.Background: prior ML publication in unsupervised learning (83 citations). This paper accepted to ACM CAIS '26 - presenting May 26-29.Repo: https://github.com/antoinezambelli/forgePaper: https://www.caisconf.org/program/2026/demos/forge-agentic-re... https://github.com/antoinezambelli/forge/blob/main/docs/forg...Dashboard: https://github.com/antoinezambelli/forge/docs/results/dashbo...

Related Ecosystem & Alternatives

Discover adjacent products, open-source repositories, and developer tools sharing similar technical architecture.

Deep-Dive FAQs

What is Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks?

Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks is analyzed by our AI as: A framework that significantly improves the reliability and performance of local LLMs on consumer hardware for agentic tasks, outperforming frontier APIs without guardrails and reducing cloud costs. It addresses the "compounding math problem" of multi-step agentic workflows.. It focuses on Forge directly addresses the critical reliability gap in self-hosted LLM agentic workflows, a major pain point for developers seeking to reduce clo...

Where did Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks originate?

Data for Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks was aggregated directly from the Hacker News community ecosystem, representing raw developer and early-adopter sentiment.

When was Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks publicly launched?

The initial public indexing or launch date for Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks within our tracked developer communities was recorded on May 20, 2026.

How popular is Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks?

Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks has achieved measurable traction, logging over 321 traction score and facilitating 119 recorded discussions or engagements.

Which technical categories define Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks?

Based on metadata extraction, Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks is categorized under topics such as: open-source reliability layer, self-hosted LLM tool-calling, domain-and-tool-agnostic guardrails, retry nudges.

Is Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks recognized by media or academic researchers?

Yes. It has been covered by media outlets like Github.com. This indicates the concept has reached a level of mainstream or scientific viability beyond just developer forums.

What are some commercial alternatives to Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks?

Our semantic intelligence engine identifies potential commercial alternatives in the SaaS space, such as ElevenAgents Guardrails 2.0, which offers overlapping value propositions.

How does the creator describe Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks?

The original author or development team describes the product as follows: "Hi HN, I'm Antoine Zambelli, AI Director at Texas Instruments.I built Forge, an open-source reliability layer for self-hosted LLM tool-calling.What it does:- Adds domain-and-tool-agnostic guardrail..."

Community Voice & Feedback

lwansbrough • May 20, 2026

Had a couple thoughts in this realm, and am working them into my own harness. Curious to see what others think. I'm not sure if this is generalizable, as my harness is fairly specialized:- Breaking down a problem into a planned execution, with executing agent providing the initial plan which includes explicit objectives such as which tools it calls and what it would consider to be a successful execution.- The harness then executes the plan in order- Each step that involves a tool call will be executed by breaking down the tool call into component parts: the harness interrogates the agent for a valid parameter value for the current tool argument. The tool definition contains validators for each argument. If the validator fails, the harness rewinds the conversation and injects the failure reason into the next try.- Once the agent produces a valid response for the argument, the harness proceeds to the next argument.- Once all the arguments have been filled, the harness calls the tool. It passes the agent's initial expected value along with the actual value, along with any errors that may have been produced and asks the agent if it is satisfied with the result. If it isn't, the agent provides a reason and the harness then retries the tool call process from the beginning rewinding the conversation and inserting the reasoning for the retry.- The agent may request to re-plan if it discovers a flaw in its initial plan. The harness will also attempt to re-plan if the agent produces too many failures in a row.This proves to be quite effective at reducing tool call failures. One benefit is that the sub-agent gets a perfect conversation history where it makes no mistakes. I'm not sure if it's actually better at completing tasks though, I haven't tried to benchmark it.

seemaze • May 20, 2026

> One thing I really didn't expect: the serving backend matters. Same Mistral-Nemo 12B weights produce 7% accuracy on llama-server with native function calling and 83% on Llamafile in prompt mode.I thought Llamafile was just a model and llama.cpp bundled in to a single binary - is this the difference between Llamafile injecting a default sysmtem prompt vs hitting the raw llama-server endpoint with no harness?That seems like comparing apples to apple pie, there's some ingredients missing.

jonnyasmar • May 19, 2026

The tool-call ambiguity point — yeah, I hit that at frontier scale too. Running Claude Code, Codex, and Gemini CLI in parallel for daily dev, the most common failure mode I see is grep/find returning exit code 1 (no matches): the model reads it as "the tool failed" instead of "search ran, here's the negative space," then either bails or retries with slightly different syntax instead of broadening the search.The retry-nudge layer maps almost 1:1 to what I do manually multiple times an hour: "no, that wasn't a tool failure, the file just doesn't contain that pattern, try X." Encoding it at the framework level is the right shape.Have you looked at whether these guardrails close the smaller frontier-model gap on long-horizon tasks? My intuition is the 87→99 delta on Sonnet won't quite hold past ~50 steps, where context drift starts dominating more than retry semantics.

88j88 • May 19, 2026

Something very similar I was experimenting with on, but had different results that you may be interested in, some of my findings were interestingThis was part of testing out how well a tool of mine worked (github.com/jsuppe/loom), which aims to be used to extracts requirements, specs, creates tests. At first I had no intention of using it for code generation but then tried it out with some early success. I tried splitting the work by using the tool with different frontier models, and then providing work to a local ollama instance running one of several models. Not all local models had the same outcome, not all coding languages had the same outcome. I also found in this experiment, when nailing down the coding tasks I wanted to set up positive and negative scenarios- which is where I found setting guardrails can sometimes backfire with inversion- this essentially elaborates on previous work by Khan 2025 (https://arxiv.org/abs/2510.22251); the most interesting finding to me was that if you give guardrails with a rationale, it reduces compliance and may cause the inversionFor coding tasks I found that the improvement was not only ability to use a lower cost model for these broken down tasks, but wall clock time was improved over using frontier model alone, with equivalent outcomes.

Escapade5160 • May 19, 2026

I've been saying for a while that given a proper harness, small local models can perform incredibly well. When you have a system that can try everything, it will eventually get it right as long as you can prevent it from getting it wrong in the meantime.

_pdp_ • May 19, 2026

Maybe I am reading it wrong but I don't think this does what it claim it does or at least how it sounds.Basically this is a tool auto-complete that has a workflow element to it with certain steps that need to happen in certain order. In other words the order is defined in advance. Am I correct?Basically execute step 1 first, then step 2 and finally step 3 and this is the schema for each step. That is effectively the guardrail and there is retry logic.If it is the case, this is obviously useful but in a very specific set of problems where the solution is kind of known in advance. A workflow automation might work but this is kind of N8N where each step is LLM step.Anyway, I might me wrong but I wanted to share a few thoughts.

azurewraith • May 19, 2026

Interestingly enough we have found the same net result -- structural guardrails are the unlock for smaller models. Our approach in particular layers three things: a parse rescue for malformed/incorrect tool calls (similar to your retry nudges), content-level intervention (diff size rejection, checkpoint forcing) and state machine enforcement on top (per-phase tool restriction, transition guards). On 13B models we saw completion of a selection of SWE-bench tasks went from ~20% to 100%. With frontier models we saw a reduction in API calls from reduced thrashing.One of the most surprising findings was when a 9B model self-corrected through 4 tool parse failures within the guard rails. It tried to use a complex tool (patch_file), kept failing and eventually downshifted to a simpler tool (edit_line) that it could actually execute. The guardrails didn't make the model smarter, it just narrowed the execution space until it could find something that worked.Brief: https://statewright.ai/research

6r17 • May 19, 2026

Very cool work ! I'm running harness system myself and could measure improvement of token use of 2x to 10x on gsm8k only by running a math harness - i'm confident the future is bright for people who will know how to sell tech that is appropriately scaled to one's need. We absolutely do not need to run Claude 123 for most tasks and we better prepare for the rag-pull !

Aleesha_hacker • May 19, 2026

Impressive work, love seeing tools that boost local LLM reliability without touching the model itself

jf • May 19, 2026

Tangentially related: Since you are at Texas Instruments, I wonder if you could find out what the status is of the intellectual property for the TI Explorer lisp machines. I know who owns the IP for Genera, but wasn’t able to find out about TI’s lisp OS

Discovery Source

Hacker News

Aggregated via automated community intelligence tracking.

Tech Stack Dependencies

No direct open-source NPM package mentions detected in the product documentation.

Media Tractions & Mentions

Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks
Github.com • May 19, 2026

Deep Research & Science

No direct peer-reviewed scientific literature matched with this product's architecture.