Paying Attention

In the last chapter, we saw that predicting the next word requires understanding, but our models could only look at a fixed window of recent words. What if the model could choose which earlier words to focus on? That idea, called attention, is the breakthrough that makes modern AI work. And it's built entirely from things you already know.

Some Words Need Other Words

You often can't tell what a word means without looking at other words in the sentence. The word "bank" could be a place for money or the side of a river. You need to see the words around it to know which one. A pronoun like "it" makes no sense on its own. You have to look back and find what it refers to.

This is something our fixed-window model from the last chapter can't do well. It sees the last few words and has no way to reach back and pick out the specific earlier words that matter.

Every one of these examples requires the same ability: given a word, choose which other words in the sentence are relevant to understanding it. Not the nearest words. Not a fixed window. The right words, wherever they happen to be.

This is what attention does. It gives the model the ability to look at every word that came before and decide which ones matter for the word it's currently processing. We'll build up to how it works, step by step.

Match Scores

We need each token to be able to ask a question and find the other tokens that answer it. The token isn't really asking anything; the model does the asking on the token's behalf. But it's easier to follow the math if we picture it from the token's point of view.

Imagine a tiny vocabulary of just four tokens: cat, dog, blah (a filler word), and it (a pronoun). Our goal is to work out what noun it is referring to.

Each token gets two vectors:

A query — what this token is looking for in other tokens.
A key — what this token is offering to other tokens.

Both vectors here are just one number: a 1 means "noun" or "looking for a noun", a 0 means "no". "cat" and "dog" offer themselves as nouns. "blah" offers nothing. "it" is looking for a noun.

Given a token, we can work out which other tokens it should be paying attention to by multiplying its query (what it is looking for) by the key of each of the other tokens. This is the dot product from the Vectors chapter. The bigger the score, the better the match.

What you offer isn't the same as what you're looking for. A pronoun needs to find nouns without being mistaken for one itself. That's why keys and queries are separate.

Dividing Your Attention

Match scores are useful, but the model can't just keep piling them up. Each token has a fixed budget of attention, 100%, and needs to divide that budget among all the candidates. We need to turn raw scores into percentages that add up to 1. You'll find out why the attention needs to add up to 100% later.

The function usually used for this is called softmax. Think of it as a competition: bigger numbers dominate, smaller numbers get squeezed, and the result is a set of percentages.

Now apply softmax to the dot products we computed above. With keys and queries at 1/0, "it" against cat blah blah it gets scores [1, 0, 0, 0], which softmax turns into roughly [47.5%, 17.5%, 17.5%, 17.5%]. That's the right shape, cat wins, but more than half the attention leaks to the filler tokens. Not great.

The fix is to crank up the query magnitude. The keys still describe what each token is, but the asker cares more about paying attention to the tokens that matched best. With "it"'s query at 10 instead of 1, the scores become [10, 0, 0, 0], and softmax turns those into roughly [99.99%, ≈0%, ≈0%, ≈0%]. Nearly all the attention goes to the noun.

But there's a bigger problem. Try clicking on "blah" in the widget above. "blah" doesn't have a noun-finding question — its query is just zero. So every score is zero, and softmax spreads attention evenly across all four tokens. The widget reports "blah pays 25% attention to itself, 25% to cat, 25% to the other blah, 25% to it" — which is meaningless. The same thing happens to "it" if there are no nouns in the sentence. Softmax has nothing to grab onto, so it spreads attention everywhere, with full confidence.

Where Does Attention Go When Nothing Matches?

Softmax always allocates 100% somewhere, even when no real match is available. We need a sensible place for the leftover attention to go.

The first fix is to add a sink token whose only job is to absorb leftover attention. Give it a key that's slightly nounish — say, [0.5] — so it scores higher than zero, but lower than a real noun like [1]. When a real noun is around, the noun wins by a wide margin (because of the cranked-up query magnitude). When no noun is around, the sink wins by default — 0.5 is the only score above zero.

That handles the "it can't find a noun" case. But there's still the other problem we saw above: clicking on "blah" or "cat". Those tokens don't have a noun-finding query, so their scores are still all zero, and attention still spreads evenly.

The second fix is to add a "none" dimension to every token's key and query, alongside the existing "noun" dimension. Tokens that aren't looking for a noun get queries pointing at the "none" dimension. The sink advertises in both dimensions — its key becomes [0.5, 0.5], slightly nounish and a strong "none". So now off-task tokens have something to search for, and the sink is the only token that answers them.

Real models don't always use a dedicated sink token. Sometimes the model learns to use a punctuation mark as a sink, or a start-of-sentence token, or just have the token attend to itself. The idea is the same in every case: when this head doesn't have anything useful to do for a particular token, attention has to go somewhere harmless.

What Did You Find?

So far attention tells us where to look — which tokens are relevant. But we also need what to gather from those tokens.

Each token gets a third vector: a value. The query and key decide who talks to whom. The value is what gets said. For our pronoun example, the value encodes which animal the token represents: cat's value is [1, 0], dog's value is [0, 1], and filler tokens (including the sink) have nothing to share, so their value is [0, 0].

Attention then blends these values using the softmax weights as a recipe. If "it" pays nearly 100% attention to "cat" and a fraction of a percent everywhere else, the result is roughly [1, 0] — almost pure cat.

Two of the same noun produces a more confident answer than one, not a doubled one. Softmax-weighted blending stays in the same value-space as the input, no matter how many matches there are. Match dog twice, you get dog. Match dog and cat, you get half-dog-half-cat. The mechanism never produces "double dog."

When attention falls on the sink (whose value is [0, 0]), the result is empty. That's a clean way to represent "this head had nothing useful to say."

Multi-Headed Attention

An attention head is one full attention computation — one set of queries, keys, and values, learning one pattern of which words attend to which.

Real models don't run just one head. They run many in parallel, each with its own queries, keys, and values, each free to learn a different pattern. This is called multi-headed attention. Each head produces its own new representation of every token; those per-head outputs are then combined into a single representation that gets passed on.

Why bother with more than one head? Understanding a sentence requires many kinds of questions at once: What does this pronoun refer to? What's the subject of this verb? What clause does this word belong to? Multi-headed attention lets the model ask all of these at the same time, with each head specializing in a different relationship.

Key Insight

Multiple attention heads let the model ask many different questions in parallel. Each head has its own query, key, and value weights, so each learns to attend to a different kind of relationship — pronouns, grammar, structure, semantics. The model doesn't need to choose one kind of attention; it gets all of them at once.

In real models, every head also has its own sink behavior, because no single head can apply to every token in every sentence. A pronoun-resolving head will be on-task for pronouns and idle for everything else. That idle behavior is what the sink is for.

Where the Vectors Come From

We've been talking about query, key, and value vectors — but where do they actually come from?

The answer is the same answer we've run into many times before. The model uses gradient descent to learn the weights that make the model perform best and thus minimize its error.

To be more precise, the model learns a simple one-layer neural network (without an activation function) that transforms the token's current embedding into query, key, and value vectors.

In a model with multiple attention heads, each head gets its own three sets of weights, so each head translates the same embedding into its own private Q, K, V vectors. That's how different heads end up looking for different things.

What We've Built

Attention is:

Match scores — Each query and key combine via dot product to score how relevant each token is to each other token.
Softmax — Turn relevance scores into percentages that add up to 100%.
Values — A third vector from each token, blended together using the softmax weights.
The sink — A token that absorbs attention when nothing relevant is around, so off-task tokens have somewhere harmless to land.
Multiple heads — Run many attention patterns in parallel, each specializing in a different relationship.

But our attention mechanism has no idea where words are — only what they are. "The dog chased the cat" and "the cat chased the dog" produce identical attention scores. In the next chapter, we'll fix this with a trick involving rotation.

Try it in PyTorch — Optional

Compute dot-product attention from scratch, build query/key/value projections, visualize attention heatmaps on real sentences, implement multi-head attention, and see how scrambled word order doesn't change attention scores, a problem we'll solve in the next chapter.

Open in Google Colab →