A benchmark for LLM agents fixing real-world security vulnerabilities (CVEs).
Raw Developer Origin & Technical Request
Hacker News
Jun 5, 2026
I built a benchmark with 20 real CVEs across 18 Python projects (Pillow, GitPython, yt-dlp, urllib3, etc). I've run it over 5 LLM agents (3 OpenAI, 2 poolside) and 3 different prompts (full advisory, locate, diagnose) with a total of 300 runs. The agents are tasked to fix security vulnerabilities in a sandboxed environment and they are scored against a hidden security tests from the maintainer's own fix.Best solve rate was 50%. On the other 50%, some fixes are sometimes coherent and pass all regression tests, but vulnerability still present.The main differentiator I found between models is cost: gpt-5.5 at 12× more expensive than gpt-5.4-mini while producing statistically similar results. Within-family performance gaps are small, which points out the difference is likely due to model training data. I also did a power analysis and the task count needed to detect a meaningful within-family edge at ~700.Full write-up: giovannigatti.github.io/cve-benchCode github.com/GiovanniGatti/cve...
Developer Debate & Comments
No active discussions extracted for this entry yet.
Frequently Asked Questions
Market intelligence mapped to A benchmark for LLM agents fixing real-world security vulnerabilities (CVEs)..
What problem does A benchmark for LLM agents fixing real-world security vulnerabilities (CVEs). solve?
What is the general sentiment around A benchmark for LLM agents fixing real-world security vulnerabilities (CVEs).?
What are the foundational technologies related to A benchmark for LLM agents fixing real-world security vulnerabilities (CVEs).?
Engagement Signals
Cross-Market Term Frequency
Quantifies the cross-market adoption of foundational terms like OpenAI and cost by tracking occurrence frequency across active SaaS architectures and enterprise developer debates.
SaaS Metrics