Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview

Name: Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview
Rating: 4.5 (54 reviews)

An open-source CLI agent outperforming Google's official model and a top closed-source model on TerminalBench, emphasizing the importance of the harness.

144

Traction Score

Discussions

Apr 27, 2026

Launch Date

View Origin Link

Product Positioning & Context

AI Executive Synthesis

An open-source CLI agent outperforming Google's official model and a top closed-source model on TerminalBench, emphasizing the importance of the harness.

This submission demonstrates a significant achievement in agent performance, with an open-source CLI agent surpassing both proprietary and official models on a recognized benchmark. The explicit denial of cheating mechanisms reinforces the integrity of the results, crucial in competitive AI benchmarking. The observation that 'the harness matters' is a critical insight, indicating that the surrounding infrastructure and methodology for agent interaction are as vital as the underlying model. This highlights a market trend where open-source solutions, when properly engineered and integrated, can compete directly with, and even exceed, closed-source offerings, driving innovation and challenging established market leaders.

Scored 65.2% vs google's official 47.8%, and the existing top closed source model Junie CLI's 64.3%.Since there are a lot of reports of deliberate cheating on TerminalBench 2.0 lately (https://debugml.github.io/cheating-agents/), I would like to also clarify a few things1. Absolutely no {agents/skills}.md files were inserted at any point. No cheating mechanisms whatsoever2. The cli agent was run in leaderboard compliant way (no modification of resources or timeouts)3. The full terminal bench run was done using the fully open source version of the agent, no difference between what is on github and what was run.I was originally going to wait for it to land on the leaderboard, but it has been 8 days and the maintainers do not respond unfortunately (there is a large backlog of the pull requests on their HF) so I decided to post anyways.HF PR: https://huggingface.co/datasets/harborframework/terminal-ben...It is astounding how much the harness matters, based on this and other experiments I have done.

Related Ecosystem & Alternatives

Discover adjacent products, open-source repositories, and developer tools sharing similar technical architecture.

Deep-Dive FAQs

What is OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview?

OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview is analyzed by our AI as: An open-source CLI agent outperforming Google's official model and a top closed-source model on TerminalBench, emphasizing the importance of the harness.. It focuses on This submission demonstrates a significant achievement in agent performance, with an open-source CLI agent surpassing both proprietary and official...

Where did OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview originate?

Data for OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview was aggregated directly from the Hacker News community ecosystem, representing raw developer and early-adopter sentiment.

When was OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview publicly launched?

The initial public indexing or launch date for OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview within our tracked developer communities was recorded on April 27, 2026.

How popular is OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview?

OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview has achieved measurable traction, logging over 144 traction score and facilitating 54 recorded discussions or engagements.

Which technical categories define OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview?

Based on metadata extraction, OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview is categorized under topics such as: OSS Agent, TerminalBench, Gemini-3-flash-preview, CLI agent.

How does the creator describe OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview?

The original author or development team describes the product as follows: "Scored 65.2% vs google's official 47.8%, and the existing top closed source model Junie CLI's 64.3%.Since there are a lot of reports of deliberate cheating on TerminalBench 2.0 lately (https://debu..."

Community Voice & Feedback

sally_glance • Apr 27, 2026

Great job and congrats! Working on my own harness has been one of my favorite side projects in the past couple of weeks, of course I never finish anything... But I'm very interested in your experience with the following:1. Context management - specifically pruning old tool call responses, truncation of tool output and automatic compaction. Those have worked pretty great for me, benefits of reducing context greatly seem to outweigh gains from "remembering" everything. I always leave short summaries though.2. "Subagents" - my latest attempts revolve around not exposing any tools for the main agent at all, except for a run_agent tool where the subagent has access to the classic search/execute/fetch tools. My theory is that if subagents return concise summaries this would automatically keep the parent agent context clean for much longer. Still experimenting though, writing prompts for subagents may also be too far outside of the current training sets.

gobdovan • Apr 27, 2026

Very interesting, especially the harness point, how much of performance is in the wrapper tools (when I almost run out of credits, I change my model to a smaller one and try to give it more structured prompts; very often gpt-5.4-mini with structure works better than gpt-5.4 with vibes)This inspired me to start a "skill distillery" [0] where I take good agent workflow ideas and turning them into small, inspectable/installable skills.The first one is dirac-workflow, based on Dirac's structural code workflow. It's not a Dirac clone tho, it has no runtime, persistent AST index, hash-anchor editing engine, or benchmark harness. Just a small AST helper and the workflow discipline as a portable skill.I also dogfooded it on the Dirac repo itself and included a short report.Would appreciate feedback from the original author, if the prompts and tools [1] are representative.[0] https://github.com/ouatu-ro/skill-distillery[1] https://github.com/ouatu-ro/skill-distillery/blob/main/skill...

deaux • Apr 27, 2026

1. Would be good to benchmark at least one other model from a different family to see if it indeed generalizes. Minimax 2.7 seems a good candidate to keep it affordable. Until then we can't really tell if it's just overfit on Gemini 3 Flash.2. Until then your landing page needs to mention all the numbers are just from running on Gemini 3 Flash. Currently there's no mention at all of Gemini.3. Assuming that cheaper also means faster in this case where model is equal? If
so, then why not add this to the benchmarks to highlight another advantage - time until completion of the tasks. If it's the opposite and it takes longer (seems unlikely), then it would be transparent to note this.4. Would be good to note if it does or does not support skills, (nested) AGENTS.md, MCP and so on for people considering migrating.

nzoschke • Apr 27, 2026

I’ve haven’t had great experiences with Gemini for coding yet. I’m doing reasonably simple full stack Go apps. Tried Gemini-ClI, antigravity, Pi.The problems I’ve experienced are less adept at picking the right bash commands to build and test the Go app, and not following idiomatic Go or code base patterns for changes.A skill hasn’t helped much.Will need to try this and open code next.

kha1n3vol3 • Apr 27, 2026

I am using dirac with Kimi 2.6 for refactoring a rust codebase. I have a Clean Architecture design which is being reinforced.
The scope of work is laid out in a Beads epic with sub-issues.
The planning was done with gpt5.5, and gpt5.5 is checking the work is complete.
I have found that dirac is more productive on large codebase refactoring than OpenCode which actually trashed the .rs file and had to revert the code.

avereveard • Apr 27, 2026

"astounding how much the harness matters" is the right read and it should be the lasting one. the model is rentable, the prompts are rentable, the benchmark numbers are mostly a function of the harness around them. swapping Gemini for Sonnet underneath the same harness has a smaller bench delta than swapping the harness around the model. the cheating-agents post you linked is the same observation through a different lens, the harness is what's being measured, the model is just the substrate.that said context management seem to be solving today model problems, more than being an universal property, and will probably be obsoleted a few model generations down the road, as tool obsoleted RAG context injection from question embeddings.

mdasen • Apr 27, 2026

It's really interesting how much the AI harness seems to matter. Going from 48% via Google's official results to 65% is a huge jump. I feel like I'm constantly seeing results that compare models and rarely seeing results that compare harnesses.Is there a leaderboard out there comparing harness results using the same models?

adyavanapalli • Apr 27, 2026

I haven't tried it, but I'm curious why you decided to implement a whole new harness over just writing extensions in pi. From whatever I've done with pi so far, the extension api is quite extensive. Hash anchored edits, for example, can definitely be implemented in pi. Anyhow, thank you for showing us your project and will be checking it out later. Cheers!

bryanhogan • Apr 27, 2026

If I understand correctly, this is a heavily improved Cline fork? Does that mean features such as plan and act mode are also still there?

GodelNumbering • Apr 27, 2026

Interesting things Dirac does:1. Uses an optimized version of Hash-Anchored edits for file editing (https://dirac.run/posts/hash-anchors-myers-diff-single-token)2. Utilizes language's AST to decide what to fetch into context, entirely avoids large code file reads3. Batches all operations. Does large number of reads/edits simultaneously (you can see a video demo for deepseek-v4-flash here https://www.reddit.com/r/LocalLLaMA/comments/1suhdki/tested_...)4. Allows the model to execute code to analyze things on the fly, so the model can simply write bash/python/perl script to accomplish things where appropriate5. A lot of context curation and opportunistic context updates, i.e. put into context anything that you are certain model would ask next

Discovery Source

Hacker News

Aggregated via automated community intelligence tracking.

Tech Stack Dependencies

No direct open-source NPM package mentions detected in the product documentation.

Media Tractions & Mentions

No mainstream media stories specifically mentioning this product name have been intercepted yet.

Deep Research & Science

No direct peer-reviewed scientific literature matched with this product's architecture.