Academic Publication

AI models collapse when trained on recursively generated data

599

Citations

July 25, 2024

Published Date

Research Abstract & Technology Focus

Abstract
Stable diffusion revolutionized image creation from descriptive text. GPT-2 (ref. 1), GPT-3(.5) (ref. 2) and GPT-4 (ref. 3) demonstrated high performance across a variety of language tasks. ChatGPT introduced such language models to the public. It is now clear that generative artificial intelligence (AI) such as large language models (LLMs) is here to stay and will substantially change the ecosystem of online text and images. Here we consider what may happen to GPT-{n} once LLMs contribute much of the text found online. We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear. We refer to this effect as ‘model collapse’ and show that it can occur in LLMs as well as in variational autoencoders (VAEs) and Gaussian mixture models (GMMs). We build theoretical intuition behind the phenomenon and portray its ubiquity among all learned generative models. We demonstrate that it must be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of LLM-generated content in data crawled from the Internet.

Read Full Literature

AI Semantic Synergy Context

Connecting this academic literature to real-world market discussions and products.

AI models collapse when trained on recursively generated data

Abstract Stable diffusion revolutionized image creation from descriptive text. GPT-2 (ref. 1), GPT-3(.5) (ref. 2) and GPT-4 (ref. 3) demonstrated high performance across a variety of lang...

Reinforcement-learning

Technical advancements in AI focus on model efficiency, with LLM architectural optimizations addressing KV cache problems and TinyLoRA enabling reasoning with fewer parameters. Apple's development ...

Show HN: 30u30.fyi – Is your startup founder on Forbes' most fraudulent list?

The AI agent paradigm is promising but current implementations are fragile. The failure modes compound — each step in a chain has some probability of error, and multi-step chains amplify this expon...

Decoder only model AI making repetitive responses

I think I know what is causing the problem in your code. cross-attention to itself You build a transformerdecoder and passed memory=X, this makes the layer run cross-attention over memory=X. Becau...

Show HN: Output.ai - OSS framework we extracted from 500+ production AI agents

Interesting that this came out of 500 agents in production. The hardest part I've seen with agent tool calls is handling partial failures gracefully — the tool returns something but it's incomplete...

Frequently Asked Questions (FAQ)

Curated market intelligence mapped to this research.

What is the core focus of the research titled 'AI models collapse when trained on recursively generated data'?

This literature focuses on: Abstract Stable diffusion revolutionized image creation from descriptive text. GPT-2 (ref. 1), GPT-3(.5) (ref. 2) and GPT-4 (ref. 3) demonstrated high performance across a variety of language tasks. ChatGPT introduced such language model...

Are there open-source GitHub repositories related to AI models collapse when trained on recursively generated data?

Yes, open-source projects like alvinunreal/awesome-opensource-ai (Curated list of the best truly open-source AI projects, models, tools, and infrastructure.) are actively building upon these concepts.

Which startups are commercializing the technology behind AI models collapse when trained on recursively generated data?

Products like Google Gemma 4 are bringing this to market. Their focus is: Google's most intelligent open models to date.

What other academic literature is closely related to 'AI models collapse when trained on recursively generated data'?

Yes, highly correlated activity was mapped. An entry titled 'AI models collapse when trained on recursively generated data' discusses this: Abstract Stable diffusion revolutionized image creation from descriptive text. GPT-2 (ref. 1), GPT-3(.5) (ref. 2) and GPT-4 (ref. 3) demo...

Are there commercial applications of 'AI models collapse when trained on recursively generated data' in market news publications?

Yes, highly correlated activity was mapped. An entry titled 'Reinforcement-learning' discusses this: Technical advancements in AI focus on model efficiency, with LLM architectural optimizations addressing KV cache problems and TinyLoRA enabling rea...

Cite this Market Intelligence Report

Reference our AI-mapped synergy between this research and the commercial market to instantly build authority.

"Commercial Applications of AI models collapse when trained on recursively generated data." ROIpad Intelligence Index, 2026. Available at: https://roipad.com/saas-metrics/research/cr_MTAuMTAzOC9zNDE1ODYtMDI0LTA3NTY2LXk/ai-models-collapse-when-trained-on-recursively-generated-data

Commercial Realization

Startups and Open Source tools heavily associated with the concepts explored in this paper.

GitHub
alvinunreal/awesome-opensource-ai
Curated list of the best truly open-source AI projects, models, too...
GitHub
Arthur-Ficial/apfel
Apple Intelligence from the command line. On-device LLM via Foundat...
Product Hunt
Google Gemma 4
Google's most intelligent open models to date
Product Hunt
OpenRouter Model Fusion
Run many models side by side and fuse the best answer

Associated Media Narrative

Osaurus, the Native macOS Harness for Local AI Models, Celebrates 7.3K GitHub Stars and #2 Product of the Day on Product Hunt
GlobeNewswire • Jul 20, 2026
The next DeepSeek? A surprise AI breakthrough in China is rattling US market heavyweights.
Business Insider • Jul 17, 2026
AI Chatbot Responses Often Mirror Government Censorship, Report Finds
CNET • Jul 17, 2026