- Source
- Dwarkesh Patel
- Published
- Runtime
- 11:57
- Snippets
- 19
A conversation between
The data black hole at the center of AI
§02
Snippets
-
So one definition of intelligence is sample efficiency. That is to say, how much data do you need in a given domain to operate fluently and competently? And it's actually not clear that we've made that much progress in training sample efficiency over the last few years.
Reframes intelligence as sample efficiency, raising the provocative question of whether AI progress is actually about better learning or just more data.
-
It seems more like we've just dramatically widened and improved the data distribution. The main way that AIs have been getting better is from adding more and better data, and scaling the compute required to develop that data in the first place.
Challenges the popular narrative that algorithmic breakthroughs are driving AI progress, attributing gains primarily to data quantity and quality.
-
Obviously, RL is the main way that this has happened. You can think of RL as basically a kind of synthetic data generation, where you dump a ton of compute against a verifier — or a rubric, if you have an LLM as a judge — in order to find out what the good data is in the first place. And then you train your model to predict these correct rollouts, much in the same way that you might train that model to predict the next word in internet text.
Reconceptualizes reinforcement learning as a compute-intensive data mining process rather than a fundamentally new learning paradigm.
-
For this process to work, the model must have at least some prior probability of anticipating the correct solution in the first place, which is why you need mind-stretching amounts of human expert trajectories in every single field and skill that you want the model to eventually be competent in.
Exposes a fundamental dependency: RL-based AI improvement is bottlenecked by the prior breadth of human expert demonstrations.
-
It's hard to overstate how task-specific and bespoke this human expert data is. If you want some intuition, I recommend checking out the job descriptions on Mercor or Surge's websites. There are listings for Word specialists who will convert legacy documents into polished Word files, and legal experts who will write realistic M&A diligence reports or securities filings, and management consultants who will write up template market research.
Makes the abstract concept of domain-specific training data concrete, revealing a sprawling hidden labor industry supporting AI capabilities.
-
Now imagine if it took a couple decades' worth of courses with hundreds of concurrent professors and millions of practice tasks for you to learn how to polish a Word file. Even the task-count difference here understates the gap, because the models have to grind through their far more numerous tasks, each far harder. Whereas a human student might practice a textbook problem once or twice, with GRPO, these models are generating hundreds to thousands of rollouts per task, and they need to do this to solve the credit assignment problem.
The GRPO analogy vividly illustrates the staggering inefficiency of current AI training compared to human learning, making the sample-efficiency gap viscerally real.
-
The correct way to think about these models is not like a human who has learned all these different skills that you see the models displaying. It's more like a Frankenstein's monster that has been built out of a billion grafts of carefully constructed examples, all sewn together.
The Frankenstein metaphor reframes AI capability as stitched-together pattern matching rather than unified understanding, with deep implications for reliability.
-
I think the reason it is relatively easy for open source and previous laggards to catch up to within months of the frontier is that data is the real driver of progress. And data can be easily distilled from public APIs, whereas hyperparameters, training tricks, and architectural optimizations cannot.
Offers a data-centric explanation for why frontier AI leads are so thin, with major implications for competitive dynamics and IP strategy.
-
It is easy to forget how much data these models are trained on, and how much more it is than what we humans see in our lifetimes. We see these AIs as a galaxy glittering with capabilities. But at their center, invisible to the naked eye, holding all the constellations together, is an unimaginably massive black hole of data.
A striking metaphor that reframes visible AI capability as a surface phenomenon sustained by an invisible, vast data substrate.
-
If a person sees and hears on average, let's say generously, 2,000 words an hour, then between the time they're born and the time they're an adult, they'll see about 200 million tokens. Now, by contrast, these frontier models are trained on somewhere between tens to hundreds of trillions of tokens. That is close to a millionfold difference.
A concrete, quantified comparison that makes the human-AI sample-efficiency gap impossible to dismiss as merely theoretical.
-
If you wanted to, you could learn to teleoperate any random humanoid or robot arm within hours. And if we could get AIs to learn just as fast, robotics would be a deca-trillion-dollar industry, and you'd have an endless army of Unitree G1s doing all kinds of useful work in the world. But the reason we can't do this is that our AIs learn much less efficiently than we do, and even with the millions of hours of demonstrations that we've collected, this is not enough to allow them to perform complex, open-ended tasks.
The robotics example grounds the sample-efficiency problem in trillion-dollar economic stakes, showing what's directly blocked by this unsolved challenge.
-
One thing people will say, and I think Karpathy said this when he came on my podcast, is that for humans, many billions of years of evolution had to go into basically pretraining us. And so we're being unfair when we're comparing how little data we see within our lifetimes to what these cold-started LLMs, which are just starting off with a totally random initialization, have to learn from. I think this is not the right way to think about it. Our genome is only three gigabytes, and only one to two percent of it is protein coding. There is simply not enough space to store the parameters of this network that evolution supposedly pretrained.
The genome-as-storage argument is a sharp rebuttal to the 'evolution pretrained us' defense, forcing a rethink of what biological learning actually inherits.
-
I think the closer analogy is that evolution found the right hyperparameters and the right loss functions, and that within our lifetime, we are still building up the connectome in our brain from scratch. That is to say, the thing analogous to the weights and parameters of the neural net itself. And even if you granted this comparison and said, 'Yes, the hundreds of trillions of tokens these models see to get pretrained is similar to just catching up to evolution,' that still doesn't explain why any new marginal capability that you want to give these models takes so much data.
The hyperparameters-vs-weights analogy reframes the evolution debate productively, but the killer point is the marginal capability problem that persists even after pretraining.
-
My response to this objection is simply that blind or deaf people, who are cut off from parts of this sensory stream, still have general intelligence. That suggests to me that all these billions of sensory tokens are not really the thing that is making humans smart. In fact, deaf people who communicate through sign language and reading, and not through hearing, are probably ingesting far less than the 200 million language tokens that we ballparked earlier, which suggests that even the millionfold difference that we calculated earlier might be an understatement.
The deaf/blind argument is a clever natural experiment that isolates the source of human intelligence and suggests the sample-efficiency gap may be even larger than calculated.
-
If you look at the way the scaling-law equations work, they tell you that the parameter and data terms are added to the loss independently. Suppose you have a model, and you've trained it compute-optimally, and you say, 'I want to be sample efficient. I want to use as little data as possible, and I'll throw in as many parameters as necessary to make that happen.' Take the constants from the Chinchilla scaling-law paper. Even if you increased the number of parameters by infinity, that would only decrease by a factor of ten the amount of data that you need in order to keep the same loss. Humans are somewhere between thousands to millions of times more sample efficient than these models. So scaling the size of current models simply can't make up for that discrepancy, and this really does suggest that humans are on a different scaling curve altogether.
A mathematically grounded argument that scaling model size cannot bridge the human-AI sample-efficiency gap, challenging a core assumption of the scaling hypothesis.
-
Okay, all these nerdy comparisons aside, you might ask: why do we even care about sample efficiency? Is this actually necessary for the labs to achieve the two overarching objectives they have, which are, one, to automate white-collar work, and two, to automate AI research itself? The bet that the labs are making with white-collar work is that the common tasks that a software engineer or analyst or accountant needs to do are common, and as a result, you can bring them into the training distribution quite easily.
Sharpens the practical stakes by connecting the abstract sample-efficiency debate to the concrete economic bets AI labs are making on automation.
-
It might be more inefficient to train AIs to do these kinds of tasks than it is to train humans, but so what? Human lifespan simply does not allow for the quantity and the breadth of training that these models experience. If you, as a human, had some weird learning disability where you needed to read through every public repository on GitHub before you could be a competent software engineer, then it would simply not make sense to train you up. You'd be on Social Security by the early stages of your education. But AIs can learn these skills by firehosing gigawatts of training at a time, and what they learn can be amortized across billions of sessions at once.
Flips the framing: AI's sample inefficiency doesn't matter for economic competition because parallelism and amortization make inefficient training commercially viable.
-
Some jobs are so mechanical and predictable that we were able to automate them long before the modern era of AI, for example, bank tellers or travel agents. But there are other jobs that require dealing on a daily basis with problems that are quite distant from the data distribution. I think software engineering is probably one such job. This is the job that AIs are supposed to take first, but I would be willing to bet that there's overall more demand for human software engineers in 2028 than there is right now, largely due to the complementary input of AI.
A specific, falsifiable prediction that AI will increase rather than decrease demand for software engineers — a contrarian bet worth tracking.
-
The labs' plan for this latter category of jobs is first to automate AI research and then have the automated AI researchers solve the sample-efficiency problem. So then the question is: can AIs, which do not have human-level sample efficiency, nonetheless solve the remaining research problems that stand in the way of human-like intelligence and learning? I think that the way people currently think about an intelligence explosion is very clumsy, because either people dismiss the possibility of AIs speeding up AI progress altogether, or they assume that some kind of God pops out the other end. They don't reason carefully about what it looks like to have a period where AI progress is much faster than usual, but to have that happen on top of LLMs and the particular kind of intelligence that LLMs are.
Identifies a critical gap in intelligence-explosion reasoning: most discourse skips the messy intermediate phase where AI accelerates research but remains LLM-constrained.
§03
Synthesis
# The Data Black Hole: Why AI's Intelligence Is Built on Inefficiency
Intelligence, by one meaningful definition, is sample efficiency—how much data you need to operate competently in a domain. By this metric, modern AI has made surprisingly little progress. Instead of learning faster, AI systems have gotten better by consuming exponentially more data. This fundamental asymmetry between human and machine learning reveals a central constraint on AI development that no amount of scaling will easily overcome.
## The Data Engine Behind AI Progress
The narrative of AI advancement typically highlights architectural breakthroughs and algorithmic innovations. The reality is far simpler and more brute-force: AI gets better by ingesting more data and using reinforcement learning (RL) to synthesize it. RL functions as a kind of industrial data generation, where massive computational resources hunt for correct solutions against a verifier or scoring rubric, then train models to replicate those solutions—much like predicting the next word in text.
This process works, but it has a brutal prerequisite. Models require "at least some prior probability of anticipating the correct solution in the first place," which means they need enormous volumes of human expert demonstrations in every domain where they're expected to perform well. The job listings on platforms like Mercor and Surge illustrate the scale: specialists converting legacy Word documents, legal experts drafting M&A diligence reports, management consultants writing market research templates. These aren't generic skills. They're domain-specific, bespoke, and numerous. Each skill requires hundreds of human experts generating examples, writing scoring rubrics, and explaining their reasoning.
The quantity is staggering. While a human student might solve a textbook problem once or twice to master it, these models generate hundreds to thousands of rollouts per task using techniques like GRPO to solve credit assignment. A human learning to polish a Word document doesn't require decades of coursework; an AI does—or something functionally equivalent.
This is why the data industry has become a multibillion-dollar sector. And it explains why open-source models can catch up to frontier models within months: data is the real bottleneck, easily distilled from public APIs. Hyperparameters and architectural tricks matter far less than the sheer volume and quality of training data.
## The Millionfold Sample Efficiency Gap
The disparity between human and machine learning is almost difficult to grasp. A person sees and hears roughly 2,000 words per hour. From birth to adulthood, that accumulates to about 200 million tokens. Frontier AI models train on tens to hundreds of trillions of tokens—a nearly millionfold difference.
Consider other comparisons. A teenager learns to drive a car in about 20 hours of practice. Including 16 years of growing up and building intuition, that's three to four orders of magnitude less data than Waymo and Tesla use to train autonomous vehicles. A human can learn to teleoperate a robot arm within hours. If AI achieved this efficiency, robotics would be a decillion-dollar industry filled with useful autonomous agents. It isn't, because AI learns far less efficiently.
These comparisons provoke predictable objections. One common response invokes evolutionary pretraining: humans benefited from billions of years of evolution, so comparing lifelong learning to cold-start LLMs is unfair. This argument crumbles under inspection. The human genome is only three gigabytes, with just one to two percent coding for proteins. There's simply insufficient space to store the neural network parameters that evolution supposedly pretrained. Evolution likely discovered good hyperparameters and loss functions, but the brain's connectome—the actual "weights" of the biological network—still builds from scratch within a lifetime.
Another objection claims that multimodal sensory data (vision, hearing, touch) fills the gap, pushing humans' effective token count into the billions. Yet deaf people and blind people retain general intelligence despite missing entire sensory channels. Deaf people using sign language and reading may ingest far fewer than 200 million language tokens, suggesting the true gap is even wider than the millionfold estimate.
A third objection appeals to scaling laws: bigger models are more sample-efficient, so perhaps scaling frontier models one to two orders of magnitude (the brain has roughly 100 trillion synapses versus current models' five trillion parameters) would solve the problem. This misreads the mathematics. Chinchilla scaling laws treat parameter and data terms additively in the loss function. Even infinite parameter scaling decreases the data requirement by only a factor of ten. Humans are thousands to millions of times more sample-efficient. Current scaling curves simply cannot close that gap.
## The Paradox of Inefficient Scaling
This raises an uncomfortable question: if AI is so inefficient, why does it matter? The labs pursuing AI aren't betting on sample efficiency. They're betting on automating white-collar work and automating AI research itself. For white-collar automation, the bet is that common tasks—what software engineers, analysts, and accountants actually do—fall within achievable training distributions.
The inefficiency is irrelevant if the economics work. Humans have lifespans; they can only learn one domain at a time. An AI with a "learning disability" requiring it to read every GitHub repository before becoming a competent software engineer would be financially absurd if it were human. But an AI can absorb this data across gigawatts of training, then amortize the learned skills across billions of sessions. Ludicrously inefficient training becomes economically rational.
However, not all white-collar work is equally automatable. Some jobs—bank tellers, travel agents—are mechanical enough that they were automated decades ago. Others, like software engineering, require constant dealing with out-of-distribution problems. This is precisely where AI struggles most. The irony is that software engineering is the job labs expect AI to automate first, yet it's the job most dependent on reasoning beyond the training distribution.
The labs' long-term solution is to automate AI research itself, then let automated AI researchers solve the sample-efficiency problem. Whether AI systems lacking human-level learning efficiency can nonetheless make breakthroughs in AI science is "a very complicated question," as Patel notes, one that requires careful thinking about what intelligence explosions actually look like when built atop the particular substrate of large language models rather than abstract superintelligence.
## The Hidden Architecture of Modern AI
The metaphor Patel settles on is stark: frontier models are not intelligences that have learned diverse skills. They are "Frankenstein's monsters built out of a billion grafts of carefully constructed examples, all sewn together." At the center of the glittering galaxy of AI capabilities is an invisible black hole—an unimaginably massive concentration of data holding everything together.
This framing matters because it suggests that scaling, architectural innovation, and algorithmic breakthroughs matter far less than the relentless accumulation and curation of training data. It's unglamorous but foundational. Until that dynamic shifts—until we understand why humans learn so differently, or until we find a new scaling paradigm—the data black hole remains the dominant force in AI development.
§04
Fan-out
Questions raised
- 01 Is sample efficiency the best proxy for intelligence, or are there other dimensions that matter more?
- 02 If data distribution is the key driver, what happens when high-quality data becomes scarce or saturated?
- 03 If RL is just expensive data generation, does it offer any advantage over collecting the same data directly from humans?
- 04 Could AI systems generate their own bootstrapping data for domains where human expert trajectories don't exist?
- 05 How does the concentration of high-quality expert data among a few contractors create competitive moats in AI?
- 06 Could better credit assignment algorithms meaningfully close the sample-efficiency gap, or is the gap architectural?
- 07 If AI is a patchwork of examples rather than a coherent reasoner, where are the seams most likely to come apart?
- 08 If data is easily distilled but compute infrastructure is not, does hardware become the true long-term moat?
- 09 What happens to AI capabilities if the 'data black hole' stops growing — when high-quality human data is exhausted?
- 10 Are there ways to measure what portion of human learning comes from explicit linguistic tokens versus embodied sensorimotor experience?
- 11 Is the robotics sample-efficiency problem fundamentally the same as the language model problem, or does physical embodiment introduce unique challenges?
- 12 If the genome can't store network weights, what exactly does evolution encode that makes humans such efficient learners?
- 13 Could we encode better inductive biases into neural network architectures that would replicate what evolution gave humans, without the trillion-token pretraining cost?
- 14 If sensory richness doesn't explain human intelligence, what does — is it the structure of feedback loops, social learning, or something else?
- 15 If humans are on a different scaling curve, what architectural or algorithmic changes would be needed to reach that curve?
- 16 How do AI labs decide which tasks are 'common enough' to be worth collecting training data for versus leaving to humans?
- 17 At what point does the cost of inefficient AI training outweigh the value of amortizing across billions of sessions?
- 18 What evidence would confirm or refute the prediction of rising demand for human software engineers by 2028?
- 19 What would AI-accelerated research look like concretely if constrained to the particular failure modes of LLMs — what problems would it fail to solve?
Concepts to learn
- 01 Sample efficiency
- 02 Training sample efficiency
- 03 Data distribution
- 04 Reinforcement Learning from Human Feedback (RLHF)
- 05 LLM-as-a-judge
- 06 Prior probability in RL
- 07 Expert trajectory
- 08 Data annotation industry
- 09 GRPO (Group Relative Policy Optimization)
- 10 Credit assignment problem
- 11 Out-of-distribution generalization
- 12 Knowledge distillation from APIs
- 13 Data wall
- 14 Tokenization
- 15 Teleoperation learning
- 16 Genome information capacity
- 17 Connectome
- 18 Inductive bias
- 19 Multimodal AI training
- 20 Compute-optimal training
- 21 In-distribution vs. out-of-distribution tasks
- 22 Cost amortization in AI deployment
- 23 Jevons paradox in AI labor
- 24 Complementarity vs. substitution in automation
- 25 Intelligence explosion
References invoked
- 01 Scaling laws literature (Kaplan et al., Chinchilla paper)
- 02 Mercor and Surge AI — platforms that source expert human data for AI training
- 03 Epoch AI report on open model lag behind frontier models
- 04 Unitree G1 humanoid robot
- 05 Andrej Karpathy interview on Dwarkesh Patel's podcast
- 06 Chinchilla scaling laws paper (Hoffmann et al., 2022)
- 07 I.J. Good's original intelligence explosion concept
Mine your own.
Lode is a workbench, not a feed. Paste a YouTube URL. The model proposes a transcript, a set of quote-grounded snippets, a synthesis essay, and the fan-out. You decide what stays.