One Architecture to Rule Them All

The transformer is the architecture behind every major AI you've heard of: ChatGPT, Claude, Gemini, Llama. It's the wiring that turns the pieces we've been building (embeddings, attention, positional encoding) into something that can read a sentence and tell you what comes next.

A Transformer At a Glance

The core idea behind the transformer is simple. We pass a passage of text through a sequence of layers. Each layer creates a better representation of each token by combining what is already known about the token with information from other tokens. The transformer uses attention to identify which other tokens have relevant information.

The original paper that introduced transformers was called "Attention Is All You Need", and indeed attention is where a lot of the work is happening, identifying which other tokens have information relevant to the current token.

The picture below shows a very simple transformer doing its work on a tiny sentence. The final layer's embedding of the last token tells us enough to predict the next word.

The representation of each token gets more sophisticated at each layer, by bringing in information from the tokens it paid attention to.

This looks a lot like a regular neural network: boxes in layers, connected by links whose weights control how strongly the signal flows. The big difference: in a normal neural network the wiring is fixed, learned once during training. Here, attention tells us which information should flow where. And, as you'll see later, each node actually has its own little neural network combining the information flowing in with what it already knows.

A Transformer In More Detail

The playground below shows a more detailed example with a longer sentence and more layers. Go through the layers one at a time and see how the representation of each token becomes more sophisticated as we bring in information from other tokens.

The model never sees Earth spelled out. But by the last layer, blue's representation says a blue thing belonging to the astronaut on Mars, seen in the Martian sky, and the embedding for Earth (a blue planet, home to humans, visible from other planets in our solar system) lines up almost exactly. That's where the prediction comes from.

Large Language Models are Very Large

The example we have in the playground above has 5 layers, with one attention head on each layer. Real transformers are much, much, much bigger. Most frontier models don't disclose their exact dimensions, but GPT-3 had 96 layers with 96 heads per layer, and the latest models are believed to be significantly larger.

Modern LLMs also use embeddings with thousands of dimensions (12,288 for GPT-3). This provides enough space to encode complex things like "A blue thing that belongs to a human astronaut, seen by that astronaut in the sky of Mars, which is a planet in Earth's solar system".

Interpretability

The examples given above have their representations written by hand rather than being generated by a real transformer. One of the weird things about real transformers is that we usually don't actually know what the intermediate representations of tokens mean, or even what a particular attention head or particular layer is trying to do. When we train a transformer, it isn't trying to produce layers, heads, or representations that make sense to humans. It is just trying to come up with representations that best help it complete its task, such as predicting the next word.

Trying to understand the inner thought processes of transformers is an area of active research.

Inside One Transformer Layer

We said at the beginning that each layer of a transformer creates a new representation for each token by using attention to find relevant tokens, and combining information from those tokens with what we already know about that token.

The playground below gives more detail on how exactly a single layer of a transformer does that. Click on each box to see a description of what it does and how it does it.

(One transformer layer is sometimes also called a transformer block. The two names mean the same thing. We'll stick with "layer" because that's how people describe model size: "GPT-3 has 96 layers".)

The Big Picture

We've come a long way:

Everything is numbers (Chapter 1): text, images, sound, all just lists of numbers.
Small improvements add up (Chapter 2): gradient descent finds good solutions step by step.
Neural networks learn patterns (Chapter 3): layers of neurons can learn anything.
Vectors capture meaning (Chapter 4): similar things end up with similar numbers.
Embeddings give words geometry (Chapter 5): words live in a space where closeness equals similarity.
Prediction requires understanding (Chapter 6): to guess the next word, you have to get the sentence.
Attention lets words talk (Chapter 7): each word chooses which others to listen to.
Position encoding adds order (Chapter 8): so "dog bit man" isn't the same as "man bit dog".
The transformer wires it all together (this chapter): and meaning emerges from prediction.

Same architecture, scaled up trillions of times, is what powers ChatGPT, Claude, and Gemini.

Try it in PyTorch — Optional

Build a complete transformer from scratch in PyTorch. Train one on simple patterns, visualize what the attention heads learn, then train a bigger one on real stories and experiment with temperature and generation.

Open in Google Colab →