← Back to AI Insights
Gemini Executive Synthesis

Structured Output Benchmark (SOB) for LLMs.

Technical Positioning
A new benchmark that measures LLM value accuracy in structured outputs across text, image, and audio modalities, addressing the problem of 'structured hallucinations' where schema is valid but values are incorrect.
SaaS Insight & Market Implications
The Structured Output Benchmark (SOB) addresses a critical, unmeasured failure mode in enterprise LLM applications: 'structured hallucinations.' Current benchmarks are insufficient, only validating schema, not value accuracy. This directly impacts B2B SaaS workflows relying on LLMs for deterministic data extraction and transformation, such as invoice processing or meeting summarization into structured formats. The SOB's focus on value accuracy across modalities provides a necessary tool for evaluating and selecting LLMs for production-grade, reliable applications. This benchmark will drive improvements in LLM performance for business-critical tasks, enabling more trustworthy and automated data pipelines within enterprises.
Proprietary Technical Taxonomy
LLMs deterministic outputs structured output JSON schema value accuracy modalities (text, image, audio) ground-truth answer structured hallucinations

Raw Developer Origin & Technical Request

Source Icon Hacker News Apr 30, 2026
Show HN: A new benchmark for testing LLMs for deterministic outputs

When building workflows that rely on LLMs, we commonly use structured output for programmatic use cases like converting an invoice into rows or meeting transcripts into tickets or even complex PDFs into database entries.The model may return the schema you want, but with hallucinated values like `invoice_date` being off by 2 months or the transcript array ordered wrongly. The JSON is valid, but the values are not.Structured output today is a big part of using LLMs, especially when building deterministic workflows.Current structured output benchmarks (e.g., JSONSchemaBench) only validate the pass rate for JSON schema and types, and not the actual values within the produced JSON.So we designed the Structured Output Benchmark (SOB) that fixes this by measuring both the JSON schema pass rate, types, and the value accuracy across all three modalities, text, image, and audio.For our test set, every record is paired with a JSON Schema and a ground-truth answer that was verified against the source context manually by a human and an LLM cross-check, so a missing or hallucinated value will be considered to be wrong.Open source is doing pretty well with GLM 4.7 coming in number 2 right after GPT 5.4.We noticed the rankings shift across modalities: GLM-4.7 leads text, Gemma-4-31B leads images, Gemini-2.5-Flash leads audio.For example, GPT-5.4 ranks 3rd on text but 9th on images.Model size is not a predictor, either: Qwen3.5-35B and GLM-4.7 beat GPT-5 and Claude-Sonnet-4.6 on Value Accuracy. Phi-4 (14B) beats GPT-5 and GPT-5-mini on text.Structured hallucinations are the hardest bug. Such values are type-correct, schema-valid, and plausible, so they slip through most guardrails. For example, in one audio record, the ground truth is "target_market_age": "15 to 35 years", and a model returns "25 to 35". This is invisible without field-level checks.Our goal is to be the best general model for deterministic tasks, and a key aspect of determinism is a controllable and consistent output structure. The first step to making structured output better is to measure it and hold ourselves against the best.

Developer Debate & Comments

moonlitemoney • Apr 29, 2026
[dead]
timxtokyo • Apr 29, 2026
Would it be possible to add llm provider from glm5.1, minimax2.1? Those latest model have their parameters change significantly compare to previous gen
jadbox • Apr 29, 2026
Wow, Qwen3.5-35B is absolutely punching above its weight. Perhaps it's the best/cheapest model for just JSON operations?
jumploops • Apr 29, 2026
I have anecdotal experience here, but I've found more success when solving the task first, and then returning it as JSON in a separate LLM call[0].Running a single non-reasoning LLM call from source data (text/image/audio in your diagram) to structured JSON seems fragile with the current state of LLMs.You're essentially asking the model to do two tasks in one pass: parse the input and then format the output. It's amazing it works a lot of the time, but reasonable to assume it won't all of the time.(As a human, when I'm filling out a complex form, I'll often jump around the document)Curious how the benchmarks change when you add an intermediary representation, either via reasoning or an additional LLM call. I'd also love to see a comparison with BAML[1].[0]In my experience we were using structured outputs as part of an agentic state machine, where the JSON contained code snippets (html/js/py/etc.). In the cases where we first prompted the model for the code, and then wrapped it in JSON, we saw much higher quality/success than asking for JSON straightaway.[1]https://boundaryml.com/
maxdo • Apr 29, 2026
gpt 5.5 seems to be the recent leader overall, it make sense to include it , just to see what you trade off for speed/open source nature vs cutting edge leader.
Kbuckley454 • Apr 29, 2026
[flagged]
broyojo • Apr 29, 2026
hmm why can't structured decoding be used?
zihotki • Apr 29, 2026
I wonder if this benchmark brings any value. Models are already quite capable and reach high scores in it.
alphainfo • Apr 29, 2026
[flagged]
stared • Apr 29, 2026
Thank you for sharing benchmark. However, the results are selective.Why no Opus 4.7? Why Gemini 3.1 Pro is missing?If there is some other criterion (e.g. models within certain time or budget), great - just make it explicit.When I see "Top 5 at a glance" and it missed key frontier models, I am (at best) confused.

Frequently Asked Questions

Market intelligence mapped to Structured Output Benchmark (SOB) for LLMs..

What problem does Structured Output Benchmark (SOB) for LLMs. solve?
Based on our AI analysis of the original developer request, its primary technical positioning is: A new benchmark that measures LLM value accuracy in structured outputs across text, image, and audio modalities, addressing the problem of 'structured hallucinations' where schema is valid but values are incorrect.
Are engineers actively discussing Structured Output Benchmark (SOB) for LLMs.?
Yes, we have tracked 21 direct responses and active debates regarding this specific topic originating from Hacker News.
What architecture is tied to Structured Output Benchmark (SOB) for LLMs.?
Our proprietary extraction maps Structured Output Benchmark (SOB) for LLMs. to adjacent architectural concepts including LLMs, deterministic outputs, structured output, JSON schema.

Engagement Signals

49
Upvotes
21
Comments

Cross-Market Term Frequency

Quantifies the cross-market adoption of foundational terms like LLMs and structured output by tracking occurrence frequency across active SaaS architectures and enterprise developer debates.