Where Am I?

The meaning of a sentence often depends on the order of the words, not just which words are there. But the attention system we built in the last chapter has no idea where any word is. Let's fix that.

Word Order Matters

Think about "The dog bit the man" versus "The man bit the dog." Same words, completely different meaning, just from changing the order. Who's biting whom flips entirely.

This isn't a weird edge case. Word order is how language works. Moving a single word can flip a sentence's meaning.

The attention system we built in the last chapter can't tell any of these apart.

Remember how attention works. Each word gets an embedding (a list of numbers representing its meaning), and then we compute how much each word should pay attention to every other word using dot products. But the word "cat" gets the same embedding whether it's the first word or the last word. Nothing about position goes into the calculation.

That means "The dog chased the cat" and "The cat chased the dog" produce exactly the same attention pattern. The model literally cannot tell them apart. It's like reading a sentence where all the words are floating in a bag with no order. You can see which words are present, but you have no idea which came first.

We need to give the model a sense of position. There are two main approaches used in practice, and the difference between them turns out to be important.

The Simplest Fix: Distance Penalties

The most obvious idea: just subtract a penalty from the attention score based on how far apart two tokens are. If tokens are 3 positions apart, subtract 3 times some slope m. If they're 10 apart, subtract 10 × m:

score = (query · key) − m × distance

This is called ALiBi (Attention with Linear Biases), and it's used by models like BLOOM and MPT. It's simple and works: nearby tokens automatically get higher attention scores than distant ones.

Different attention heads get different slopes m. Some heads have steep slopes and attend very locally (just the nearest few words), while others have gentle slopes and attend broadly across the whole context. This gives the model a range of distance sensitivities.

The Limits of Linear Distance Penalties

ALiBi has a limitation: the distance penalty is per-head, not per-query. Every query in a given head gets the same distance falloff. The model can't decide "this particular word needs to look far back in the text". The slope is baked into the head.

Consider: "The president visited France. He said the treaty would be honored." The pronoun "He" needs to look at words very close by to see what person "he" is referring to. But "treaty" might need to look many pages back in order to find the details of the treaty. If both happen to land in the same attention head, then they are forced to have the same distance penalty.

ALiBi also limits the distance penalty to being a linear slope — the amount of attention each word gets always falls off in a straight line as words get farther away. But that's not the only useful shape. Different jobs want very different falloff patterns:

It turns out there is a different model, RoPE, that overcomes all of these limits. Nearly every frontier model today (Llama, Mistral, Gemma, Qwen, DeepSeek) uses RoPE. Let's see why.

Rotation Encodes Distance

There's a neater way to give the model a sense of distance, leaning on a fact about dot products: the dot product of two unit vectors only depends on the angle between them. Picture two clock hands — if you rotate the whole clock by 30°, both hands turn together and the angle between them is unchanged, so the dot product is unchanged too. But if the two hands rotate by different amounts, the dot product changes by an amount that depends only on the gap.

Now imagine giving each token a position around a circle, like the numbers around a clock. Tokens close together have a small angle between them, so their dot product is high. Tokens far apart have a larger angle between them, so their dot product is low. The dot product depends only on the distance between the two token positions, not on where in the sequence either token is.

How quickly we rotate per position is called the speed. A higher speed means the pointers spread apart faster, so even a small gap produces a big angle difference and a low dot product. A lower speed means tokens have to be farther apart before their dot product drops significantly. Try adjusting the speed slider in the playground above to see this tradeoff.

Applying Rotation to a Dimension

How do we connect this geometric idea to actual attention? Take one dimension of our token vectors (say "nouniness") and split it into two coordinates: noun-x and noun-y. Now rotate that 2D point by position × speed. The dot product between a rotated query and a rotated key automatically favors nearby tokens, just like the circle demo above.

This is the core of RoPE (Rotary Position Embeddings). Instead of adding a distance penalty after computing the dot product, we rotate each word's query and key vectors by an angle that depends on position. The dot product naturally encodes distance.

Multiple Rotation Speeds

In practice, RoPE always uses multiple dimension pairs rotating at different speeds. A 128-dimensional vector becomes 64 pairs, each rotating at its own rate. The first pair rotates fast (maybe 30° per position) while later pairs rotate progressively slower, down to fractions of a degree per position.

The power of RoPE is that each token can specify an arbitrary distance penalty curve by choosing how much weight to put in dimension pairs that rotate at different speeds:

You Don't Need to Double the Dimensions

You might worry: splitting each dimension into x/y means we need twice as many dimensions. But in practice, models don't add new dimensions. They pair up adjacent existing dimensions. A 128-dimensional head vector becomes 64 pairs: dimensions 1 and 2 form the first pair, 3 and 4 the second, and so on. Each pair is treated as the x and y coordinates of a 2D point and rotated together.

This works because rotation preserves the length of each pair (√(x² + y²) stays the same) while changing its direction. No information is destroyed; it's just rearranged. There is a subtle cost, though: a content vector at one position could, after rotation, look like a completely different content vector at another position. The model has to learn representations that avoid these collisions, but with 64 dimension pairs all rotating at different speeds, two unrelated vectors would have to match across all pairs simultaneously, which is extremely unlikely. In practice, training finds good representations without difficulty.

Causal Masking: Only Look Backward

The dot product between two rotated vectors is the same whether the other word is 3 positions before or 3 positions after — RoPE encodes distance but not direction. The simple fix most models use is causal masking: each token is only allowed to attend to tokens that came before it. Future tokens are blocked out entirely, so the question of direction never comes up.

Words in Order

We started with a problem: attention couldn't tell "The dog bit the man" from "The man bit the dog." The words were floating in a bag with no order. Now, by rotating each word's vector based on its position, the model can tell how far apart two words are, and each word gets to decide for itself whether to focus on its close neighbors, scan the whole sentence, or anything in between.

Attention now knows three things: what each word means (from embeddings), which other words matter (from attention scores), and where each word is (from rotation). In the next chapter, we'll put all of these pieces together to build the transformer, the architecture behind ChatGPT, Claude, and most modern language models.

Try it in PyTorch — Optional

Demonstrate that toy attention is position-blind, build ALiBi from a linear distance penalty, derive RoPE from the rotation-invariance of the dot product, combine multiple rotation speeds for different attention-distance shapes, and apply causal masking with torch.triu.

Open in Google Colab →