Understanding by Predicting

A strange claim: if a machine can accurately predict the next word in a sentence, it must understand the sentence. That sounds wrong. Prediction seems too simple to require understanding. But by the end of this chapter, you'll see why it's true, and why this one idea is the foundation of modern AI.

The Game

Start with a game. I'll show you the beginning of a sentence, and you try to guess the next word. Ready?

Look at what just happened. To predict the next word of "The capital of France is ___," you needed to know that the capital of France is Paris. To predict what someone did after opening a letter, you needed to understand human emotions. To predict "two plus two equals ___," you needed arithmetic.

Key Insight

Predicting the next word requires grammar, facts, common sense, even emotional understanding. The better you predict, the more you must understand. Prediction REQUIRES understanding.

This is what we're going to teach a computer to do. And it turns out this single task, predict the next word, is the key to building AI systems that can write, reason, and hold conversations.

The Simplest Approach: Count What Comes Next

How would you build a next-word predictor? The most obvious strategy: read a huge pile of English text, and for every word, count what word usually follows it.

The word "of" is most often followed by "the." The word "he" is often followed by "was." The word "she" by "was." You get the idea.

This is called a bigram model. It predicts the next word based on just the one previous word. It's a giant lookup table: for each word, store a list of what typically follows it, ranked by frequency.

Each individual prediction is reasonable. "of the," "he was," "it is" all sound right. But string them together and you get a random walk through common word pairs. The model has no memory beyond one word. "The cat sat on" means nothing to it. All it sees is "on."

How does the model choose between its options? It uses the frequencies as probabilities. If "the" follows "of" 33% of the time, the model has a 33% chance of picking it. This randomness is controlled by temperature: at low temperature, the model almost always picks the single most likely word. At high temperature, it picks more evenly from all options. We'll see temperature again when we get to real language models.

The N-gram Wall

You can probably see where this is going. What about trigrams (3 words of context)? 4-grams? 10-grams? More context should help, right?

It does, but we hit a wall. The same wall we saw back in Everything Is Numbers with the lookup table explosion.

The problem: a 10-gram model needs to have seen every possible sequence of 10 words to make good predictions. But with a vocabulary of 50,000 words, there are 50,000^10 possible 10-word sequences. That's a number with 47 digits. There aren't enough books, websites, or documents in the world to cover even a tiny fraction of those combinations.

Most 10-word sequences the model encounters will be ones it has never seen before. It can't predict what comes after "The curious elephant carefully examined the ancient" if it's never seen that exact sequence in its training data.

We need a model that can generalize, that can make reasonable predictions for word combinations it's never encountered. Sound familiar? This is exactly the problem we solved with neural networks and embeddings.

Neural Networks to the Rescue

The idea: instead of a lookup table, use a neural network. Feed it the embeddings of the previous few words, pass them through a neural network, and have it output a prediction for the next word.

Why would this work? Because embeddings capture meaning. "Cat" and "dog" have similar embeddings because they appear in similar contexts. So a neural network that learns "the cat ___" → "was" will automatically generalize to "the dog ___" → "was", because the inputs look almost the same!

The network doesn't need to have seen every possible word. It just needs to learn patterns over the continuous space of embeddings. Below is a real neural network we trained on 50,000 children's stories. It looks at the last 3 tokens and predicts what comes next.

Key Insight

Neural networks + embeddings solve the lookup table problem. Instead of memorizing every possible word sequence, the network learns patterns over meanings. Similar words produce similar predictions automatically, because they have similar embeddings.

More Context, Same Problem

Our neural network uses 3 tokens of context, a big improvement over bigrams. We could try 5, 10, or even 20 tokens. More context = better predictions.

But this approach has serious limitations:

Fixed window size. We have to decide in advance how many previous words to look at, but the meaning of a sentence often depends on text from far earlier. A novel might spend a chapter establishing that John is a disgraced surgeon trying to redeem himself. Thousands of words later, the text simply says "John picked up the scalpel." To predict what comes next, the model needs everything it learned about John, far beyond any fixed window. Or a thriller might spend pages laying out the plan, a careful heist with five precise steps. Later, "they began the plan" carries all that meaning, but only if the model can still see where the plan was described.

Size explosion. Longer windows mean more input dimensions, which means more parameters, which means harder training. A window of 10 words with 300-dimensional embeddings gives 3,000 input features. That's manageable, but we'd really like to look at thousands of words for context.

We need something different. Not a fixed window. Not a lookup table. Something that can dynamically choose which earlier words matter, based on what it's trying to predict.

"It Just Predicts the Next Word"

You'll sometimes hear people dismiss large language models by saying they "just predict the next word", as if that means they can't really be intelligent. By now, you should be suspicious of this argument. As we've seen, predicting the next word well requires grammar, facts, common sense, and even emotional understanding. It's a task that demands broad knowledge of the world.

But there's a deeper problem with the dismissal: modern LLMs don't just predict the next word. Next word prediction is a powerful way that AI models learn to extract knowledge from the vast amount of text that humans have written. However once they have that knowledge, we use many other techniques (covered in future chapters) to teach them how to apply that knowledge productively.

Next word prediction is an important part of how an AI model learns, but it's not what it does. It's like how a soccer player learns ball control by doing simple drills, and then applies that in a real game.

There's also a deeper irony in dismissing prediction as unintelligent. The neuroscientist Karl Friston's free energy principle proposes that the brain itself is a prediction machine. It builds an internal model of the world and constantly predicts what it will experience next, updating that model when predictions fail. If Friston is right, then learning by prediction isn't a cheap shortcut to intelligence. It may be how all intelligence works.

What's Next

What if the network could choose which earlier words to pay attention to? Instead of only looking at the last 5 words, what if it could look at every word that came before and decide: "For predicting this word, I need to focus on that word three sentences ago, and that word right before me, and ignore everything else"?

That's the idea behind attention, and it changed everything. In the next chapter, we'll see how it works, built from pieces you already know: embeddings, dot products, and neural networks.

Try it in PyTorch — Optional

Build a bigram model from word counts, see the n-gram explosion, train a neural network next-word predictor, visualize how similar words cluster in embedding space, and experiment with temperature sampling.

Open in Google Colab →