← Back to AI Insights
Gemini Executive Synthesis

SNEWPAPERS is a historical newspaper archive with full-text extractions, high-accuracy OCR, a categorization taxonomy, and semantic/agentic search capabilities, processing over 600k pages (5TB) from the Chronicling America collection. It uses a multi-model pipeline (layout, OCR, LLM, VLLM) and stores data in OpenSearch/Postgres.

Technical Positioning
The first historical newspaper archive offering full-text extraction, near-perfect OCR, and advanced semantic/agentic search, solving the problem of noise and lack of context in traditional keyword-based archives.
SaaS Insight & Market Implications
Traditional historical archives are severely limited by keyword-only search and raw image returns, creating significant research friction. SNEWPAPERS addresses this by applying advanced AI/ML techniques to transform unstructured historical data into semantically searchable, contextualized information. The achievement of "nearly perfect OCR" on diverse historical layouts is a critical technical hurdle overcome, enabling reliable full-text extraction. This product has profound implications for academic research, historical analysis, and potentially even legal or journalistic investigations, where accurate, contextualized access to historical documents is paramount. The integration of semantic and agentic search capabilities represents a significant leap in data discoverability, moving beyond simple retrieval to intelligent information synthesis. This demonstrates the power of AI to unlock value from previously inaccessible or unwieldy datasets.
Proprietary Technical Taxonomy
historical newspaper archive full-text extractions nearly perfect OCR categorization taxonomy semantic search agentic search multi-model pipeline layout tech

Raw Developer Origin & Technical Request

Source Icon Hacker News May 2, 2026
Show HN: Large Scale Article Extract of Newspapers 1730s-1960s

Hello HN, over the past 7 months I've spent nearly 3,000 hours on building SNEWPAPERS, the first historical newpaper archive with full-text extractions, nearly perfect OCR, a vast categorization taxonomy and of course with semantic and agentic search capabilities.Problem:
I wanted to search through newspaper archives, but when I tried every service only lets you search for keywords and dates, and gives you back raw images of the papers, and too many of them with no context. A sea of noise.Solution:
I taught machines how to read the newspapers and so far I've extracted the content from > 600k pages (about 5TB) from the Chronicling America collection. Problems I had to deal with were an infinite variety of layouts, font sizes, image scan qualities, resolutions, aspect ratios, navigating around the images on the page. I also had to figure out how to get OCR to be nearly perfect so people wouldn't hate reading the extracts. I stitched together a multi-model pipeline (layout tech, ocr tech, llm, vllm) with heuristics to go from layout -> segmentation -> classification. I put it all in OpenSearch / Postgres and made it semantically searchable and also put an agentic search tool on top that knows how to use the API really well and helps you write queries to find what you're looking for. Happy to discuss AWS architecture and scaling as well, that was tough!If you have five minutes and you just want to jump in and have your own personalized experience, what I would suggest is:Before searching for anything, go to the Sleuth page
Ask it about anything from 1736 to 1963, maybe 1 or 2 follow up questions
Then go to the search page so you can see the queries it wrote for you (bottom left "saved queries") and uncover more info on whatever it is you're interested inIf you think it's cool and you want to learn more, then there's about 10 minutes of video guides on the various capabilities in "Guide" on the nav barSome other people have also taken a crack at this, notably:dell-research-harvard.github.io/resources/america... (very good attempt)
labs.loc.gov/work/experiments/... (focused on images)

Developer Debate & Comments

longplay • May 3, 2026
How well do you think your OCR solution would work on magazines? I found OCR very hit and miss with magazines, especially ones with text into background pictures etc.
brettnbutter • May 2, 2026
I'm opening up https://snewpapers.com/today-in-history to the world right now. Per @benwillis's advice below, I will figure out how to make a section of the data searchable for free as well, but this is the best I can do today!Thank you everyone for taking the time to look at snewpapers today, enjoy!
vista8 • May 2, 2026
[dead]
nastrofa • May 2, 2026
It would be really cool to create different analysis across the time- Each month's / year's top news headline- Left / Right swings of publishers
zzleeper • May 2, 2026
Looks cool, congrats!I've also worked with this data, but only for research purposes:https://www.finhist.com/bank-runs/episodes/13895.html https://www.finhist.com/bank-runs/index.htmlSurprisingly, I found out that layout was the trickiest thing, as newspaper articles often had multiple layers of headers, spanned multiple columns, etc.Do you have a preferred solution on that?
benwills • May 2, 2026
As someone who has done a lot of downloading/parsing, this is so awesome and impressive to see.One thing to think about, which I also struggle with when it comes to large and complicated datasets, is the UI. Even being in the search industry for a long time, it's difficult for me to concretely see how I would use this.I'd suggest taking a small sample of the dataset that might be reflective of how people would use it, then make that segment public and immediately searchable without registering. eg: One year of articles related to the Olympics.What I've found is that it's hard for a lot of people to imagine how they would use something without actually using it. So giving people the actual experience of searching the archive and interacting with the results would go a long way.Again, congrats on the work. This is really impressive work.
brettnbutter • May 2, 2026
A few examples you can click on without having to authenticate or click the free trial (no cc if you do though and I won't bother you or chase you with spam etc...)https://snewpapers.com/components/b2d40c08-db63-40e8-890f-09...https://snewpapers.com/components/0fabc8e4-a60b-4f31-9ad1-b0...https://snewpapers.com/components/cdde790f-4e97-4f2d-a2c2-95...

Frequently Asked Questions

Market intelligence mapped to SNEWPAPERS is a historical newspaper archive with full-text extractions, high-accuracy OCR, a categorization taxonomy, and semantic/agentic search capabilities, processing over 600k pages (5TB) from the Chronicling America collection. It uses a multi-model pipeline (layout, OCR, LLM, VLLM) and stores data in OpenSearch/Postgres..

What problem does SNEWPAPERS is a historical newspaper archive with full-text extractions, high-accuracy OCR, a categorization taxonomy, and semantic/agentic search capabilities, processing over 600k pages (5TB) from the Chronicling America collection. It uses a multi-model pipeline (layout, OCR, LLM, VLLM) and stores data in OpenSearch/Postgres. solve?
Based on our AI analysis of the original developer request, its primary technical positioning is: The first historical newspaper archive offering full-text extraction, near-perfect OCR, and advanced semantic/agentic search, solving the problem of noise and lack of context in traditional keyword-based archives.
What is the general sentiment around SNEWPAPERS is a historical newspaper archive with full-text extractions, high-accuracy OCR, a categorization taxonomy, and semantic/agentic search capabilities, processing over 600k pages (5TB) from the Chronicling America collection. It uses a multi-model pipeline (layout, OCR, LLM, VLLM) and stores data in OpenSearch/Postgres.?
Yes, we have tracked 14 direct responses and active debates regarding this specific topic originating from Hacker News.
What architecture is tied to SNEWPAPERS is a historical newspaper archive with full-text extractions, high-accuracy OCR, a categorization taxonomy, and semantic/agentic search capabilities, processing over 600k pages (5TB) from the Chronicling America collection. It uses a multi-model pipeline (layout, OCR, LLM, VLLM) and stores data in OpenSearch/Postgres.?
Our proprietary extraction maps SNEWPAPERS is a historical newspaper archive with full-text extractions, high-accuracy OCR, a categorization taxonomy, and semantic/agentic search capabilities, processing over 600k pages (5TB) from the Chronicling America collection. It uses a multi-model pipeline (layout, OCR, LLM, VLLM) and stores data in OpenSearch/Postgres. to adjacent architectural concepts including historical newspaper archive, full-text extractions, nearly perfect OCR, categorization taxonomy.

Engagement Signals

27
Upvotes
14
Comments

Cross-Market Term Frequency

Quantifies the cross-market adoption of foundational terms like llm and Postgres by tracking occurrence frequency across active SaaS architectures and enterprise developer debates.