Show HN: Large Scale Article Extract of Newspapers 1730s-1960s
The first historical newspaper archive offering full-text extraction, near-perfect OCR, and advanced semantic/agentic search, solving the problem of noise and lack of context in traditional keyword-based archives.
View Origin LinkProduct Positioning & Context
I wanted to search through newspaper archives, but when I tried every service only lets you search for keywords and dates, and gives you back raw images of the papers, and too many of them with no context. A sea of noise.Solution:
I taught machines how to read the newspapers and so far I've extracted the content from > 600k pages (about 5TB) from the Chronicling America collection. Problems I had to deal with were an infinite variety of layouts, font sizes, image scan qualities, resolutions, aspect ratios, navigating around the images on the page. I also had to figure out how to get OCR to be nearly perfect so people wouldn't hate reading the extracts. I stitched together a multi-model pipeline (layout tech, ocr tech, llm, vllm) with heuristics to go from layout -> segmentation -> classification. I put it all in OpenSearch / Postgres and made it semantically searchable and also put an agentic search tool on top that knows how to use the API really well and helps you write queries to find what you're looking for. Happy to discuss AWS architecture and scaling as well, that was tough!If you have five minutes and you just want to jump in and have your own personalized experience, what I would suggest is:Before searching for anything, go to the Sleuth page
Ask it about anything from 1736 to 1963, maybe 1 or 2 follow up questions
Then go to the search page so you can see the queries it wrote for you (bottom left "saved queries") and uncover more info on whatever it is you're interested inIf you think it's cool and you want to learn more, then there's about 10 minutes of video guides on the various capabilities in "Guide" on the nav barSome other people have also taken a crack at this, notably:https://dell-research-harvard.github.io/resources/americanst... (very good attempt)
https://labs.loc.gov/work/experiments/newspaper-navigator/ (focused on images)
Related Ecosystem & Alternatives
Discover adjacent products, open-source repositories, and developer tools sharing similar technical architecture.
Deep-Dive FAQs
What is Large Scale Article Extract of Newspapers 1730s-1960s?
Where did Large Scale Article Extract of Newspapers 1730s-1960s originate?
When was Large Scale Article Extract of Newspapers 1730s-1960s publicly launched?
How popular is Large Scale Article Extract of Newspapers 1730s-1960s?
Which technical categories define Large Scale Article Extract of Newspapers 1730s-1960s?
What are some commercial alternatives to Large Scale Article Extract of Newspapers 1730s-1960s?
How does the creator describe Large Scale Article Extract of Newspapers 1730s-1960s?
Community Voice & Feedback
https://www.finhist.com/bank-runs/index.htmlSurprisingly, I found out that layout was the trickiest thing, as newspaper articles often had multiple layers of headers, spanned multiple columns, etc.Do you have a preferred solution on that?
Discovery Source
Hacker News Aggregated via automated community intelligence tracking.
Tech Stack Dependencies
No direct open-source NPM package mentions detected in the product documentation.
Media Tractions & Mentions
No mainstream media stories specifically mentioning this product name have been intercepted yet.
Deep Research & Science
No direct peer-reviewed scientific literature matched with this product's architecture.
SaaS Metrics