Gemini Executive Synthesis

Open-source agent that topped TerminalBench on Gemini-3-flash-preview.

Technical Positioning

An open-source CLI agent outperforming Google's official model and a top closed-source model on TerminalBench, emphasizing the importance of the harness.

SaaS Insight & Market Implications

This submission demonstrates a significant achievement in agent performance, with an open-source CLI agent surpassing both proprietary and official models on a recognized benchmark. The explicit denial of cheating mechanisms reinforces the integrity of the results, crucial in competitive AI benchmarking. The observation that 'the harness matters' is a critical insight, indicating that the surrounding infrastructure and methodology for agent interaction are as vital as the underlying model. This highlights a market trend where open-source solutions, when properly engineered and integrated, can compete directly with, and even exceed, closed-source offerings, driving innovation and challenging established market leaders.

Proprietary Technical Taxonomy

Raw Developer Origin & Technical Request

Hacker News Apr 27, 2026

Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview

Scored 65.2% vs google's official 47.8%, and the existing top closed source model Junie CLI's 64.3%.Since there are a lot of reports of deliberate cheating on TerminalBench 2.0 lately (debugml.github.io/cheating-agents/ I would like to also clarify a few things1. Absolutely no {agents/skills}.md files were inserted at any point. No cheating mechanisms whatsoever2. The cli agent was run in leaderboard compliant way (no modification of resources or timeouts)3. The full terminal bench run was done using the fully open source version of the agent, no difference between what is on github and what was run.I was originally going to wait for it to land on the leaderboard, but it has been 8 days and the maintainers do not respond unfortunately (there is a large backlog of the pull requests on their HF) so I decided to post anyways.HF PR: huggingface.co/datasets/harborfr... is astounding how much the harness matters, based on this and other experiments I have done.

View Raw Source

Developer Debate & Comments

sally_glance • Apr 27, 2026

Great job and congrats! Working on my own harness has been one of my favorite side projects in the past couple of weeks, of course I never finish anything... But I'm very interested in your experience with the following:1. Context management - specifically pruning old tool call responses, truncation of tool output and automatic compaction. Those have worked pretty great for me, benefits of reducing context greatly seem to outweigh gains from "remembering" everything. I always leave short summaries though.2. "Subagents" - my latest attempts revolve around not exposing any tools for the main agent at all, except for a run_agent tool where the subagent has access to the classic search/execute/fetch tools. My theory is that if subagents return concise summaries this would automatically keep the parent agent context clean for much longer. Still experimenting though, writing prompts for subagents may also be too far outside of the current training sets.

gobdovan • Apr 27, 2026

Very interesting, especially the harness point, how much of performance is in the wrapper tools (when I almost run out of credits, I change my model to a smaller one and try to give it more structured prompts; very often gpt-5.4-mini with structure works better than gpt-5.4 with vibes)This inspired me to start a "skill distillery" [0] where I take good agent workflow ideas and turning them into small, inspectable/installable skills.The first one is dirac-workflow, based on Dirac's structural code workflow. It's not a Dirac clone tho, it has no runtime, persistent AST index, hash-anchor editing engine, or benchmark harness. Just a small AST helper and the workflow discipline as a portable skill.I also dogfooded it on the Dirac repo itself and included a short report.Would appreciate feedback from the original author, if the prompts and tools [1] are representative.[0] https://github.com/ouatu-ro/skill-distillery[1] https://github.com/ouatu-ro/skill-distillery/blob/main/skill...

deaux • Apr 27, 2026

1. Would be good to benchmark at least one other model from a different family to see if it indeed generalizes. Minimax 2.7 seems a good candidate to keep it affordable. Until then we can't really tell if it's just overfit on Gemini 3 Flash.2. Until then your landing page needs to mention all the numbers are just from running on Gemini 3 Flash. Currently there's no mention at all of Gemini.3. Assuming that cheaper also means faster in this case where model is equal? If so, then why not add this to the benchmarks to highlight another advantage - time until completion of the tasks. If it's the opposite and it takes longer (seems unlikely), then it would be transparent to note this.4. Would be good to note if it does or does not support skills, (nested) AGENTS.md, MCP and so on for people considering migrating.

nzoschke • Apr 27, 2026

I’ve haven’t had great experiences with Gemini for coding yet. I’m doing reasonably simple full stack Go apps. Tried Gemini-ClI, antigravity, Pi.The problems I’ve experienced are less adept at picking the right bash commands to build and test the Go app, and not following idiomatic Go or code base patterns for changes.A skill hasn’t helped much.Will need to try this and open code next.

kha1n3vol3 • Apr 27, 2026

I am using dirac with Kimi 2.6 for refactoring a rust codebase. I have a Clean Architecture design which is being reinforced. The scope of work is laid out in a Beads epic with sub-issues. The planning was done with gpt5.5, and gpt5.5 is checking the work is complete. I have found that dirac is more productive on large codebase refactoring than OpenCode which actually trashed the .rs file and had to revert the code.

avereveard • Apr 27, 2026

"astounding how much the harness matters" is the right read and it should be the lasting one. the model is rentable, the prompts are rentable, the benchmark numbers are mostly a function of the harness around them. swapping Gemini for Sonnet underneath the same harness has a smaller bench delta than swapping the harness around the model. the cheating-agents post you linked is the same observation through a different lens, the harness is what's being measured, the model is just the substrate.that said context management seem to be solving today model problems, more than being an universal property, and will probably be obsoleted a few model generations down the road, as tool obsoleted RAG context injection from question embeddings.

mdasen • Apr 27, 2026

It's really interesting how much the AI harness seems to matter. Going from 48% via Google's official results to 65% is a huge jump. I feel like I'm constantly seeing results that compare models and rarely seeing results that compare harnesses.Is there a leaderboard out there comparing harness results using the same models?

adyavanapalli • Apr 27, 2026

I haven't tried it, but I'm curious why you decided to implement a whole new harness over just writing extensions in pi. From whatever I've done with pi so far, the extension api is quite extensive. Hash anchored edits, for example, can definitely be implemented in pi. Anyhow, thank you for showing us your project and will be checking it out later. Cheers!

bryanhogan • Apr 27, 2026

If I understand correctly, this is a heavily improved Cline fork? Does that mean features such as plan and act mode are also still there?

GodelNumbering • Apr 27, 2026

Interesting things Dirac does:1. Uses an optimized version of Hash-Anchored edits for file editing (https://dirac.run/posts/hash-anchors-myers-diff-single-token)2. Utilizes language's AST to decide what to fetch into context, entirely avoids large code file reads3. Batches all operations. Does large number of reads/edits simultaneously (you can see a video demo for deepseek-v4-flash here https://www.reddit.com/r/LocalLLaMA/comments/1suhdki/tested_...)4. Allows the model to execute code to analyze things on the fly, so the model can simply write bash/python/perl script to accomplish things where appropriate5. A lot of context curation and opportunistic context updates, i.e. put into context anything that you are certain model would ask next

Frequently Asked Questions

Market intelligence mapped to Open-source agent that topped TerminalBench on Gemini-3-flash-preview..

What is the technical positioning of Open-source agent that topped TerminalBench on Gemini-3-flash-preview.?

Based on our AI analysis of the original developer request, its primary technical positioning is: An open-source CLI agent outperforming Google's official model and a top closed-source model on TerminalBench, emphasizing the importance of the harness.

Are engineers actively discussing Open-source agent that topped TerminalBench on Gemini-3-flash-preview.?

Yes, we have tracked 54 direct responses and active debates regarding this specific topic originating from Hacker News.

What architecture is tied to Open-source agent that topped TerminalBench on Gemini-3-flash-preview.?

Our proprietary extraction maps Open-source agent that topped TerminalBench on Gemini-3-flash-preview. to adjacent architectural concepts including OSS Agent, TerminalBench, Gemini-3-flash-preview, CLI agent.

Engagement Signals

144

Upvotes

Comments

Cross-Market Term Frequency

Quantifies the cross-market adoption of foundational terms like harness and CLI agent by tracking occurrence frequency across active SaaS architectures and enterprise developer debates.