On the Folk Theory of Scaling

In 2019, Rich Sutton argued that the most effective methods in AI were and will be those that leverage computation. Instead of trying to design better heuristics that more closely approximate human intelligence, Sutton thought progress would come from models that could improve with scale as greater computational resources became available. In other words, don’t try to make static systems that seem intelligent like we are (ie expert systems) but dynamic ones that could get more intelligent with more resources in ways we might not be able to predict.

The incredible progress in AI from scaling transformer models over the last few years makes Sutton look prophetic. However, it’s possible to overlearn Sutton’s “bitter lesson” and conclude that ever greater scale is sufficient for achieving all kinds of further AI progress (novel research / knowledge creation, reliable agents, etc) instead of merely seeing scalability as a necessary feature of promising approaches. This much stronger thesis isn’t Sutton’s. Let’s call it the folk theory of scaling. The folk theory might just be a strawman no one would defend, but I think it’s useful to articulate why it doesn’t seem right. Or at least gesture at some pro tanto evidence against it.

A huge part of the value of LLMs is that they not only effectively compress and reproduce human knowledge but that they are queryable in natural language. ChatGPT has hundreds of millions of users because the chatbot interface is so intuitive. You can just talk to it like a friend or teacher.

But, this conversationality doesn’t “come for free” with scale; rather, it required a breakthrough technique, instruction tuning. In fact, when OpenAI first trained an instruction-tuned model, they found that users preferred its outputs to a non-instruction-tuned model with >100x the scale. It’s easy to assume that everything must change with a 100x or 1000x scale-up, but sometimes it simply doesn’t.

In the years since that paper, instruction tuning has become standard practice because larger base models aren’t scaling their way to conversationality.

Pretrained LLMs only optimize for next-token prediction. Having one well-specified goal enables the models to take advantage of large amounts of training data and compute resources, but it means that the models are fundamentally good at one very precise thing. Compare the below responses from Meta’s latest 70 billion parameter base model and Mistral’s 7 billion parameter instruction tuned model.

Instead of getting conversationality as a side-effect of greater scale, we needed an entirely new trick. We had to tack on an extra learning step to distort the model’s pretraining so it can serve two masters: next-token prediction and conversationality. The folk theory seems flawed.

If just scaling next-token prediction doesn’t get you everything, maybe scaling next-token prediction with instruction tuning could? Maybe this time really is different? One reason to think it might is that such a system is “natural language complete”, ie it can parse and produce anything any other natural language user can. And since everything an intelligence does can (maybe?) be expressed in natural language, mastery of the world might simply grow with mastery of natural language.

On the other hand, there’s probably a difference between knowing-that and knowing-how. Maybe mastery over natural language and facts about the world doesn’t get you tool use/knowledge of how to act in the world. Being able to specify physically what’s involved in throwing a baseball is different than knowing how to throw a baseball, which also requires eg awareness of and control over limbs. Then again, since everything digital is expressible in language/code, which we’re assuming AIs will get arbitrarily good at, maybe there’s no “digital proprioception” or other additional features AI systems will need to reliably use digital tools (and of course digital tools can control physical tools).

Today, AI systems that can carry out complicated tasks usually look like the expert systems Sutton’s bitter lesson is supposed to steer us away from. They’re broadly defined by a series of if/then statements that narrowly incorporate calls to LLMs for assistance with parsing and creating natural language. The Generative Agents paper is an interesting example. LLMs are the duct tape that makes everything work, but the generative agents themselves were intelligently designed in the image of their creators. Their capacities to observe, reflect, and plan weren’t learned– they were architected.

The question, then, is whether scaled-up LLMs will generalize to handle more than just narrow natural language processing steps in intelligently designed expert systems or if we’ll need fundamentally new tricks to expand their capabilities.

Matthew Mandel

On the Folk Theory of Scaling

Leave a comment Cancel reply

On the Folk Theory of Scaling

Share this:

Leave a comment Cancel reply