Show HN: A new benchmark for testing LLMs for deterministic outputs
A new benchmark that measures LLM value accuracy in structured outputs across text, image, and audio modalities, addressing the problem of 'structured hallucinations' where schema is valid but values are incorrect.
View Origin Link
Product Positioning & Context
AI Executive Synthesis
A new benchmark that measures LLM value accuracy in structured outputs across text, image, and audio modalities, addressing the problem of 'structured hallucinations' where schema is valid but values are incorrect.
The Structured Output Benchmark (SOB) addresses a critical, unmeasured failure mode in enterprise LLM applications: 'structured hallucinations.' Current benchmarks are insufficient, only validating schema, not value accuracy. This directly impacts B2B SaaS workflows relying on LLMs for deterministic data extraction and transformation, such as invoice processing or meeting summarization into structured formats. The SOB's focus on value accuracy across modalities provides a necessary tool for evaluating and selecting LLMs for production-grade, reliable applications. This benchmark will drive improvements in LLM performance for business-critical tasks, enabling more trustworthy and automated data pipelines within enterprises.
When building workflows that rely on LLMs, we commonly use structured output for programmatic use cases like converting an invoice into rows or meeting transcripts into tickets or even complex PDFs into database entries.The model may return the schema you want, but with hallucinated values like `invoice_date` being off by 2 months or the transcript array ordered wrongly. The JSON is valid, but the values are not.Structured output today is a big part of using LLMs, especially when building deterministic workflows.Current structured output benchmarks (e.g., JSONSchemaBench) only validate the pass rate for JSON schema and types, and not the actual values within the produced JSON.So we designed the Structured Output Benchmark (SOB) that fixes this by measuring both the JSON schema pass rate, types, and the value accuracy across all three modalities, text, image, and audio.For our test set, every record is paired with a JSON Schema and a ground-truth answer that was verified against the source context manually by a human and an LLM cross-check, so a missing or hallucinated value will be considered to be wrong.Open source is doing pretty well with GLM 4.7 coming in number 2 right after GPT 5.4.We noticed the rankings shift across modalities: GLM-4.7 leads text, Gemma-4-31B leads images, Gemini-2.5-Flash leads audio.For example, GPT-5.4 ranks 3rd on text but 9th on images.Model size is not a predictor, either: Qwen3.5-35B and GLM-4.7 beat GPT-5 and Claude-Sonnet-4.6 on Value Accuracy. Phi-4 (14B) beats GPT-5 and GPT-5-mini on text.Structured hallucinations are the hardest bug. Such values are type-correct, schema-valid, and plausible, so they slip through most guardrails. For example, in one audio record, the ground truth is "target_market_age": "15 to 35 years", and a model returns "25 to 35". This is invisible without field-level checks.Our goal is to be the best general model for deterministic tasks, and a key aspect of determinism is a controllable and consistent output structure. The first step to making structured output better is to measure it and hold ourselves against the best.
LLMs
deterministic outputs
structured output
JSON schema
value accuracy
modalities (text, image, audio)
ground-truth answer
structured hallucinations
Related Ecosystem & Alternatives
Discover adjacent products, open-source repositories, and developer tools sharing similar technical architecture.
Deep-Dive FAQs
What is A new benchmark for testing LLMs for deterministic outputs?
A new benchmark for testing LLMs for deterministic outputs is analyzed by our AI as: A new benchmark that measures LLM value accuracy in structured outputs across text, image, and audio modalities, addressing the problem of 'structured hallucinations' where schema is valid but values are incorrect.. It focuses on The Structured Output Benchmark (SOB) addresses a critical, unmeasured failure mode in enterprise LLM applications: 'structured hallucinations.' Cu...
Where did A new benchmark for testing LLMs for deterministic outputs originate?
Data for A new benchmark for testing LLMs for deterministic outputs was aggregated directly from the Hacker News community ecosystem, representing raw developer and early-adopter sentiment.
When was A new benchmark for testing LLMs for deterministic outputs publicly launched?
The initial public indexing or launch date for A new benchmark for testing LLMs for deterministic outputs within our tracked developer communities was recorded on April 30, 2026.
How popular is A new benchmark for testing LLMs for deterministic outputs?
A new benchmark for testing LLMs for deterministic outputs has achieved measurable traction, logging over 49 traction score and facilitating 21 recorded discussions or engagements.
Which technical categories define A new benchmark for testing LLMs for deterministic outputs?
Based on metadata extraction, A new benchmark for testing LLMs for deterministic outputs is categorized under topics such as: LLMs, deterministic outputs, structured output, JSON schema.
What are some commercial alternatives to A new benchmark for testing LLMs for deterministic outputs?
Our semantic intelligence engine identifies potential commercial alternatives in the SaaS space, such as Bluedot 2.1, which offers overlapping value propositions.
How does the creator describe A new benchmark for testing LLMs for deterministic outputs?
The original author or development team describes the product as follows: "When building workflows that rely on LLMs, we commonly use structured output for programmatic use cases like converting an invoice into rows or meeting transcripts into tickets or even complex PDFs..."
Community Voice & Feedback
Discovery Source

Hacker News
Aggregated via automated community intelligence tracking.
Tech Stack Dependencies
No direct open-source NPM package mentions detected in the product documentation.
Media Tractions & Mentions
No mainstream media stories specifically mentioning this product name have been intercepted yet.
Deep Research & Science
No direct peer-reviewed scientific literature matched with this product's architecture.