Following on from the reflection on cognitive extraction comes a question, technical in appearance, decisive in reality: the massive recourse to synthetic data — generated by AI — to train AI itself. The work of Shumailov and colleagues (Oxford, Cambridge), published in Nature in July 2024, gave a name to the phenomenon that ensues: model collapse, the progressive degeneration of models trained recursively on data they have themselves generated. Their conclusion: the indiscriminate use of model-produced content causes irreversible defects, the tails of the original distribution — rare cases, diversity — disappearing first.
The mechanism breaks down into four drifts. Loss of diversity: if synthetic data dominate, the same patterns reproduce endlessly — knowledge homogenizes, and creativity, which lives on the variety of viewpoints, is impoverished. Propagation of bias: far from being corrected, initial biases amplify — the machine trains on errors it produced itself. Degeneration proper: as successive generations train on the outputs of the previous ones, the model converges toward an impoverished distribution, ever further from the real — this is the study's central result. Erosion of the link with reality: an AI cut off from testimony, experiment and human interaction loses its acuity in reading the sensory world.
AI may well feed itself — without safeguards, it condemns itself to intellectual autarky.
Why, then, the push toward the all-synthetic? Because the incentives are real: human data are costly to acquire and clean, their legal status grows complicated (GDPR, AI Act, copyright), and the synthetic offers massive, calibrated datasets free of litigation. For some applications — industrial simulation, autonomous driving — it is even the right answer. But for the creation and transmission of knowledge, the richness of human data remains irreplaceable.
The reasonable response is not to ban the synthetic, it is to balance it. Hybrid corpora, where the verified real keeps the majority. A curation that teaches systems to recognize their own output so as not to re-ingest it. A traceability of training-set composition — knowing when and in what proportion the artificial was injected, and being able to audit it. And finally a regulatory demand for transparency on that composition, which is to AI what labelling is to food.
The result fuelled a lively debate — some later work qualifies it, showing that well-built hybrid corpora avoid collapse — but its direction is consensual: an AI fed on its own production, without fresh human input, degenerates. I wrote this text in early 2025. Shortly after, the conclusion reinforced itself: as the web fills with generated content, verified, contextualized, attributed human knowledge ceases to be an abundant raw material and becomes the scarce resource. Those who preserve it hold the antidote.
This is one of the foundations of the Preservation vs Extraction paradigm: preserve human reasoning at the source, verified and attributed, rather than let the loop close. Read the founding essay →