Glossary

Words this tutorial uses, in one place

Every technical word this tutorial uses, with a short definition and links to the chapters where it appears. Click any word with a dotted underline in a chapter to see its entry without leaving the page.

activation functionThe function applied at the end of a neuron to turn its weighted sum into a final output.

An activation function is the function applied at the end of a neuron to turn its weighted sum into a final output.

Without one, stacking neurons in layers would collapse mathematically into a single linear model — you'd get nothing extra from depth. The activation function is what makes deep networks more powerful than shallow ones.

Some common activation functions:

sigmoid — smoothly squashes any number into the range 0 to 1. Big positive inputs become close to 1, big negative inputs become close to 0, with a smooth slope in between. The classic activation, used in the neurons chapter's examples.
ReLU — max(0, x). Passes positive numbers through unchanged; turns negatives into 0. Simpler than sigmoid and very widely used.
Swish (also called SiLU) — x · sigmoid(x). Like ReLU for positive values, but with a small dip below zero on the negative side. Neurons that want to switch "off" settle at the bottom of the dip rather than drifting into ReLU's flat dead zone — which is why most frontier models prefer Swish.
tanh — like sigmoid but ranges from −1 to 1 instead of 0 to 1.
softmax — applied at the very end of a classifier to turn raw scores into probabilities that sum to 1.

All activation functions share one critical property: they're smooth enough that training can use small parameter nudges to improve the model.

Introduced in: Neural Networks·Also appears in: PyTorch from Scratch, Attention, Embeddings, Vectors

algorithmA step-by-step procedure you can follow to produce a result.

An algorithm is a step-by-step procedure you can follow to produce a result.

The key thing about an algorithm is that you don't need to be clever to follow it: just do the steps in order, and the right answer drops out. That's why algorithms are how computers do anything.

Some examples:

Long division — a sequence of steps that turns any division problem into the right answer, without needing to understand what division "really means" each time.
A cooking recipe — measure, mix, bake; the recipe doesn't care who's following it, it just produces the cake.
Sorting a list — many algorithms exist (bubble sort, quicksort, …), all producing the same sorted output from a fixed set of steps.
Incremental optimization — measure error, make a small change, keep it if it helped, repeat. The algorithm behind evolution, the scientific method, and AI training.
Gradient descent — the algorithm that makes training a neural network practical: at every step, calculate the direction that shrinks the loss fastest, step that way, repeat.

Every part of AI is one or more algorithms layered together. The cleverness is in designing the right algorithm; once you have it, the running is mechanical.

Introduced in: Optimization·Also appears in: Computation, Neural Networks

attentionThe mechanism that lets a token in one layer of a transformer know which tokens in the previous layer to get information from.

Attention is the mechanism that lets a token in one layer of a transformer know which tokens in the previous layer to get information from.

For each token, attention answers a single question: out of all the tokens in the previous layer, which ones have the information I need? It does this by giving every token three vectors:

A query — what kind of information this token is looking for (e.g. "a noun").
A key — a label describing the kind of information this token offers to others (e.g. "a noun"). The key is a tag, not the actual content.
A value — the actual content this token shares when others attend to it (e.g. the specific identity "cat").

Each token's query is matched against every other token's key via dot product; the highest-scoring matches contribute their values most strongly, and the token's new representation is the softmax-weighted blend of those values. In short: query asks, key advertises, value delivers.

Some examples:

A pronoun resolving itself — "she" has a query for "female noun"; "the woman" has a key advertising female-noun info and a value carrying who that woman is. The query matches the key, and "she" pulls in the value.
Disambiguating "bank" — in "she made a deposit at the bank", "bank" asks "is there a money-context word nearby?"; "deposit"'s key advertises that, so "bank" gathers in its value — the financial context — rather than river information.
A pronoun resolved across paragraphs — a novel introduces "John" once at the start; thousands of words later, "John picked up the scalpel" — attention can gather information from the original introduction. No fixed-window model can do this.
Inside ChatGPT — every token in your prompt and the model's response uses attention to gather information from every previous token, in parallel, at every layer.

Attention is the single biggest reason modern AI works as well as it does — it's the A in Attention Is All You Need, the paper that introduced the transformer.

Introduced in: Attention·Also appears in: Computation, Neural Networks, Next-Word Prediction, Positional Encoding, Transformers, Vectors

attention headOne full attention computation — its own set of query, key, and value weights, learning one specific pattern of which tokens attend to which.

An attention head is one full attention computation — its own set of query, key, and value weights, learning one specific pattern of which tokens attend to which.

A single attention head can only learn one kind of relationship — pronoun resolution, say, or subject–verb linking, or grouping related ideas. To handle the many different relationships in a sentence at once, transformers run many heads in parallel, each free to learn a different pattern. This is called multi-head attention.

Some examples:

A pronoun-resolution head — query, key, and value weights tuned to spot pronouns and connect them to the nouns they refer to.
A subject–verb linking head — different weights, tuned to bind verbs to their subjects.
A long-range head — weights and (with positional encoding) distance falloff tuned to pull in context from many sentences back.
GPT-3 — 96 heads per layer over 96 layers, so nearly 10 000 attention heads in the whole model, each specialising in a different pattern.

The model doesn't need to choose one kind of attention; with many heads it gets all of them at once.

Introduced in: Attention·Also appears in: Positional Encoding, Transformers

backpropagationAn efficient algorithm for calculating the gradient of every parameter in a multi-layer neural network.

Backpropagation is an efficient algorithm for calculating the gradient of every parameter in a multi-layer neural network.

To train a model with gradient descent, you need to know each parameter's gradient — how much to nudge it, and in which direction, to reduce the loss. Every parameter has a gradient, even one buried several layers deep: in principle you could find it by nudging that parameter up and down a tiny amount and watching how the loss changes. But doing that one parameter at a time would mean a full forward pass through the model for every single parameter — completely impractical when a network has billions of them.

This is really just calculus. If you have a function that depends on many variables (here the loss depends on all the model's parameters), you can differentiate it with respect to any one of those variables to get that variable's gradient. Backpropagation is just how that differentiation works out in a neural network.

The trick is to start at the output and work back through the layers. A parameter's gradient depends on everything between it and the output, so once you know the gradients of the layer just after a parameter, the chain rule of calculus gives you that parameter's gradient almost for free. Starting at the output and stepping one layer back at a time, each layer's gradients fall out of the gradients of the layer in front of it — one backward pass through the network and every parameter's gradient is computed.

Introduced in: Neural Networks·Also appears in: PyTorch from Scratch

biasA constant a neuron adds to its weighted sum to control how easily it fires.

A bias is a constant a neuron adds to its weighted sum to control how easily it fires.

A large positive bias makes the neuron easy to activate — it fires even with weak input. A negative bias makes it stubborn — it stays quiet unless the inputs strongly argue for firing.

Some examples:

bias = 0 — the neuron flips around the zero crossing of its weighted sum; no built-in preference for firing or staying quiet.
bias = +5 — the neuron leans toward firing; it activates unless the inputs strongly argue against it.
bias = −5 — the neuron leans against firing; it stays quiet unless the inputs strongly support firing.

Every neuron in a network has one bias of its own — millions across a typical mid-size model, hundreds of millions in large language models. Together with weights, biases are the parameters of a neural network; training adjusts all of them.

Introduced in: Neural Networks·Also appears in: PyTorch from Scratch, Embeddings, Optimization, Positional Encoding, Vectors

causal maskingThe rule that each token in a language model is only allowed to attend to tokens before it in the sequence — never to future tokens.

Causal masking is the rule that each token in a language model is only allowed to attend to tokens before it in the sequence — never to future tokens.

This serves two purposes. First, it matches the training task: a next-word predictor is supposed to predict the next word without seeing it, so it can't be allowed to peek ahead. Second, it removes any ambiguity about direction — positional encoding tells the model about distance, and causal masking handles the question of which side.

Some examples:

The first token — has nothing to attend to except itself; its representation can only be based on what it inherently is.
The fifth token — can attend to tokens 1 through 4, plus itself. Cannot see tokens 6 onward.
Generating text — when the model predicts the next word, causal masking guarantees the prediction depends only on what came before, not on what's about to come.

Causal masking is what makes a transformer a causal (left-to-right) language model. Other transformer variants — used for translation, classification, or filling in masked words — don't apply causal masking because they're allowed to see the whole input at once.

Introduced in: Positional Encoding

contextThe surrounding tokens a language model uses when making a prediction.

A context is the surrounding tokens a language model uses when making a prediction.

When a language model predicts the next word, it doesn't just look at the previous word — it looks at all the tokens leading up to that point. Those tokens are its context. The maximum number of tokens it can use at once is called the context window; bigger windows let the model use longer-range information.

Some examples:

A bigram model — context is one token: just the previous word.
An n-gram model — context is the previous n−1 tokens. A few words at most.
GPT-2 — context window of 1024 tokens. Enough for a few paragraphs.
GPT-4 — context window of 128 000 tokens in the largest variant. Long enough to hold a small book.

Bigger context windows are expensive: vanilla attention's cost grows with the square of sequence length, so doubling the context roughly quadruples the compute (unless you use clever tricks — modern long-context models use techniques like sliding-window attention or linear-attention variants to avoid the full quadratic cost).

Introduced in: Next-Word Prediction·Also appears in: Positional Encoding, Transformers

cosine similarityThe dot product of two unit vectors — a single number from −1 to 1 measuring how similar the two vectors are.

Cosine similarity is the dot product of two unit vectors — a single number from −1 to 1 measuring how similar the two vectors are.

Because both vectors are scaled to size 1, the dot product depends only on the direction they point: 1 if they point the same way, 0 if they're at right angles, −1 if they point in opposite directions. Vector sizes can't sneak in to distort the comparison.

Some examples:

A vector compared with itself — cosine similarity = 1. Same direction exactly.
Two vectors at right angles — cosine similarity = 0. They share nothing.
A bear vector and a dog vector — cosine similarity high (≈ 0.8) because they share many properties. A bear and a mouse is much lower.
Two word embeddings — "cat" and "dog" have high cosine similarity because they appear in similar contexts and the embedding has placed them nearby. "Cat" and "democracy" sit nearly perpendicular.

Cosine similarity is how AI models compare meaning at scale — search, recommendation, clustering, and retrieval all ultimately ask "how close is this vector to that one?" and answer with cosine similarity.

If you're familiar with trigonometry, it's called cosine similarity because it's the cosine of the angle between the two vectors.

Introduced in: Vectors·Also appears in: Embeddings

dimensionOne of the properties a vector measures.

A dimension is one of the properties a vector measures.

A colour vector like (1, 0.5, 0) has 3 dimensions — red, green, and blue — and the numbers (1, 0.5, 0) are the values along those dimensions. The total count of dimensions is also called the vector's dimension, so this is a "3-dimensional" vector. A 768-dimensional embedding has 768 dimensions, each capturing one (learned) aspect of meaning.

Some examples:

2 dimensions — a point on a map: how far east, how far north.
3 dimensions — a colour on a screen: how much red, green, blue.
6 dimensions — the animal vectors from the chapter: big, scary, hairy, cuddly, fast, fat.
300 dimensions — a GloVe word embedding, learned to capture meaning.
12 288 dimensions — one token inside GPT-3, capturing meaning and context.

More dimensions let a vector make finer distinctions, at the cost of more storage and computation. Modern AI models lean on very high-dimensional vectors precisely because that's where the meaning fits.

Introduced in: Vectors·Also appears in: PyTorch from Scratch, Attention, Embeddings, Next-Word Prediction, Optimization, Positional Encoding, Transformers

dot productA single number that measures how strongly two vectors point in the same direction.

A dot product is a single number that measures how strongly two vectors point in the same direction.

You calculate the dot product by multiplying matching pairs of numbers from the two vectors and adding up the results. For two 2D vectors: (a, b) · (c, d) = a×c + b×d. For longer vectors, the pattern is the same — multiply matching numbers, then sum.

A few illustrative cases:

For two 1-dimensional vectors (just numbers) — the dot product is plain multiplication: 5 · 3 = 15.
For two unit vectors (each scaled to size 1) — the dot product says how much the two vectors point in the same direction, or, equivalently, describe the same thing. The result lands between 1 (identical), 0 (unrelated), and −1 (opposite). This clean version is sometimes called cosine similarity.
For non-unit vectors — the dot product still rewards agreement on which dimensions matter, but it also grows with the magnitudes of the two vectors. A vector that scores high on every property would give a huge dot product with anything, even unrelated things; that's why we usually scale to unit length first when we want a fair similarity comparison.
Inside a neuron — the weighted sum is a dot product between the input vector and the weight vector. The weight vector's direction (its unit vector) says what the neuron is looking for; its magnitude says how sensitive the neuron is.

The dot product turns up everywhere in modern AI — it's how neurons recognise patterns, how attention decides which words to focus on, and how embeddings compare meanings.

Introduced in: Vectors·Also appears in: Attention, Neural Networks, Next-Word Prediction, Positional Encoding

embeddingA vector of numbers that represents the meaning of an item (e.g. a word), where similar items have similar vectors.

An embedding is a vector of numbers that represents the meaning of an item (e.g. a word), where similar items have similar vectors.

Embeddings are how AI models turn things — words, tokens, images, sounds — into the kind of numerical data they can do math on.

Some examples:

A hand-picked animal vector — properties chosen by humans: big, scary, hairy, etc. The same shape as an embedding, but with axes humans decided rather than learned.
A 300-dimensional GloVe word embedding — learned from huge amounts of text so that "king" and "queen" end up close, and "king − man + woman ≈ queen" works as vector arithmetic.
A 768-dimensional GPT-2 token embedding — each token gets a vector that captures both its meaning and how it's being used.
A 12 288-dimensional GPT-3 token embedding — much more room to encode subtle distinctions.

Once items live in the same vector space, the dot product between two embeddings measures how related they are — and sometimes arithmetic on embeddings ("king − man + woman") corresponds to meaningful operations on the underlying meanings.

Introduced in: Embeddings·Also appears in: Attention, Next-Word Prediction, Positional Encoding, Transformers, Vectors

errorA number that says how far off something is from perfect.

An error is a number that says how far off something is from perfect.

Some examples:

A student's test score — got 72 out of 100, target is 100. Error = 100 − 72 = 28 points.
A runner's time — finished a race in 14.2 seconds, target is 0 (impossible, but smaller is always better). Error = 14.2 − 0 = 14.2.
A photo classifier — the answer is "cat", but the model only gave "cat" 30% probability. Error = 100% − 30% = 70%.
A next-word predictor — the next word in "The cat sat on the ___" is "mat", but the model gave "mat" only 1% probability. The error formula −log(probability of the right word) turns this into ≈ 4.6 (very bad); a 99% guess would have given ≈ 0.01 (very good).

Averaged across many examples, the error becomes the loss — the number training tries to make small.

Introduced in: Optimization·Also appears in: PyTorch from Scratch, Attention, Computation, Introduction, Neural Networks

featureA piece of information about the input that the model uses to make its prediction.

A feature is a piece of information about the input that the model uses to make its prediction.

In classic machine learning, features were hand-engineered numbers — average pixel brightness, word count, day of the week. In modern neural networks, the network learns its own features inside its hidden layers: a deep layer might end up detecting "the presence of a wheel" or "the subject of the sentence" without anyone telling it to.

Some examples:

A player's height — for predicting basketball performance, height is a useful feature.
The pixel values of an image — the raw features a vision model starts with.
"Edge in the image" — a feature an early layer of a vision network learns to detect from raw pixels.
"This token refers to the subject of the sentence" — a feature a deep layer of a language model might learn to detect, without being told to.

A lot of what makes a deep neural network powerful is its ability to learn useful features automatically, instead of relying on humans to design them.

Introduced in: Neural Networks·Also appears in: PyTorch from Scratch, Next-Word Prediction, Vectors

frontier modelOne of the most capable AI models in existence at a given time, at the leading edge of what AI can do.

A frontier model is one of the most capable AI models in existence at a given time, at the leading edge of what AI can do.

The "frontier" is that leading edge, and frontier models are the biggest, most expensive systems built by the labs pushing it. They set the state of the art for what's possible. Crucially, the frontier keeps moving: a model that is frontier today is ordinary in a year or two, so naming a frontier model only makes sense relative to a point in time.

At the time of writing, the frontier models are the very latest releases from the leading labs, such as GPT-5.5, Claude Opus 4.8, and the newest Gemini. Note that these come in families: only the latest version of Claude or Gemini is frontier, not the older ones that share the name.

The frontier as it moved over time:

GPT-2 (2019) — frontier in its day, now small enough to run on a laptop.
GPT-4 (2023) — a few years later, far more capable, but since overtaken.
Today's latest releases — frontier now, and likely surpassed by the time you read this.

Frontier models advance partly through sheer scale, more layers, more parameters, more training data, but also through genuine innovations: new abilities like step-by-step reasoning, better ways to handle long contexts, and improved training methods. Frontier labs seldom disclose exactly what's inside, but each generation tends to be both bigger and cleverer than the last.

Introduced in: Neural Networks·Also appears in: Embeddings, Positional Encoding, Transformers

functionAnything that takes an input and produces an output.

A function is anything that takes an input and produces an output.

Some examples, simple to complex:

"Double it" — takes 3, produces 6.
"Is this a cat?" — takes pixels, produces a number from 0 to 1.
"Translate to French" — takes English, produces French.
"Write a poem about rain" — takes a prompt, produces a poem.
ChatGPT — takes your message, produces a reply.

Functions are often written as f(input) = expression for output. For example, double(x) = x · 2 means a function called double that takes input x and gives back x · 2.

A model is a function whose behaviour you can adjust by tuning its parameters.

Introduced in: Computation·Also appears in: PyTorch from Scratch, Attention, Neural Networks, Optimization

generalisationWhen a model makes good predictions on inputs it has never seen before, because it has learned the underlying patterns rather than just memorising its training data.

Generalisation is when a model makes good predictions on inputs it has never seen before, because it has learned the underlying patterns rather than just memorising its training data.

The thing that genuinely can't generalise is pure memorisation — a lookup table that needs the exact input it was trained on. Anything that extracts a pattern and reuses it can generalise to some degree; the interesting question is how it generalises and how far.

Different models generalise in different ways:

A bigram model — generalises by assuming only the previous word matters. It can score any sentence where the individual word-to-word transitions appeared in training, even if it has never seen the whole sentence. What it can't do is generalise across word similarity: it has no idea that "cat" and "kitten" are related, so seeing one teaches it nothing about the other.
A neural-network language model — generalises further through embeddings: similar words have similar vectors, so similar contexts produce similar predictions. It can answer "the kitten sat on the ___" sensibly even if its training only ever mentioned cats, because "kitten" and "cat" sit near each other in embedding space.
An image classifier — generalises by learning features (edges, textures, shapes) that apply to photos it has never seen; that's why a cat detector still works on a brand-new cat photo.
A model that overfits — has tuned itself so closely to the training set that it predicts well on training data but poorly on anything new. Memorisation has crowded out useful generalisation.

Generalisation is the whole point of training. A model that can only repeat its training data hasn't really learned anything; a model that generalises well is the one that's actually useful.

Introduced in: Next-Word Prediction·Also appears in: Embeddings

gradientA number for each parameter of a model, telling you which way to nudge that parameter to reduce the loss fastest from where the model currently is.

A gradient is a number for each parameter of a model, telling you which way to nudge that parameter to reduce the loss fastest from where the model currently is.

The "from where the model currently is" matters: a gradient isn't a fixed property of a parameter. It depends on the current values of every parameter, plus the training data being looked at. As soon as gradient descent takes a step and the parameters shift, all the gradients change. The gradient only says "right now, given this state, here's the direction to move"; once you've moved, the gradient has to be recomputed before the next step.

Take the gradients of every parameter together and you can think of them as a single vector pointing in the direction of fastest loss reduction across the whole model. That vector is what gradient descent follows — for one step.

Some examples:

A one-parameter model — the gradient is a single number: the slope of the loss curve at the parameter's current value. Move the parameter and the slope changes.
A two-parameter model — the gradient is two numbers, forming an arrow on a 2D loss landscape that points in the steepest downhill direction from where you are. Take a step and the arrow tilts.
A million-parameter network — a million numbers, one per parameter, computed efficiently by backpropagation. After every training step, they all get recomputed.
The gradient inside ChatGPT — about a trillion numbers, recomputed at every training step.

Training is, mechanically, the loop of compute the gradient, take a small step in its direction, repeat — and the reason it has to be a loop is that the gradient changes every time you take a step.

Introduced in: Optimization·Also appears in: PyTorch from Scratch, Embeddings, Neural Networks

gradient descentA more efficient way of optimizing the parameters of a model during training: nudge each parameter in the direction of its gradient, then repeat.

Gradient descent is a more efficient way of optimizing the parameters of a model during training: nudge each parameter in the direction of its gradient, then repeat.

Random search (try a tiny change, keep it if it helped, repeat) works but is hopelessly slow when there are millions of parameters — you have no idea which of the millions to change first or in what direction. Gradient descent solves this by computing, for every parameter at once, the direction that shrinks the loss fastest. That direction is the gradient; you take a small step that way, recompute, and repeat. This only works because real models are built to be smooth — small parameter changes give small, predictable changes in the loss, so the gradient really does point the way.

Some examples:

A ball rolling down a hill — gravity pulls it in the steepest direction; it stops at the bottom of the valley. The mechanical analogue, and the picture used in the optimization chapter's widget.
Tuning a one-knob model — try a setting, see the loss; the slope tells you whether to nudge the knob up or down. Step that way, repeat.
Training a small neural network — use backpropagation to compute the gradient for every weight, then update each by a small fraction of its gradient. Repeat for millions of steps.
Training ChatGPT — the same algorithm, applied to about a trillion parameters, for months on tens of thousands of GPUs.

Without gradient descent, you'd be guessing the direction of every parameter change — hopeless when there are millions of them. Gradient descent is what makes large-model training work at all.

Introduced in: Optimization·Also appears in: Attention, Embeddings, Neural Networks, Transformers

hidden neuronA neuron in a middle layer of a network, one that neither reads the raw inputs directly nor produces the final output, but computes an intermediate result.

A hidden neuron is a neuron in a middle layer of a network, one that neither reads the raw inputs directly nor produces the final output, but computes an intermediate result.

They're called "hidden" because their outputs aren't seen from outside the network. The network's inputs hold the data and its final layer gives the answer; everything in between is hidden. Hidden neurons are where the real work happens, each one computing some intermediate fact that later neurons build on.

Some examples:

In the XOR network — two hidden neurons compute OR and NAND, and the output neuron ANDs them together to make XOR.
In an image classifier — early hidden neurons detect edges, later ones detect eyes and ears, building up to "cat".
In a large language model — billions of hidden neurons, spread across dozens of layers, gradually build up each word's meaning from the words around it.

A layer made of hidden neurons is called a hidden layer. Stacking hidden layers is what gives a network its depth, and depth is what lets it learn complex things.

Introduced in: Neural Networks

labelThe correct answer attached to a training example, telling the model what the right output should be.

A label is the correct answer attached to a training example, telling the model what the right output should be.

Every training example comes with a label — the "right answer" the model is supposed to produce. Training compares the model's prediction to the label, measures the error, and adjusts parameters to bring the two closer together. Without labels there's nothing to compare predictions against.

Some examples:

An image's class label — "cat" attached to a photo of a cat. The model is trained to output "cat" when shown that photo.
A next-word example — given "the cat sat on the", the label is the actual word that came next in the training text ("mat", say).
A sentiment label — given a movie review, the label is "positive" or "negative".
A bounding-box label — for object detection, the label is a rectangle around each object plus what kind of object it is.

Collecting labelled data — usually by paying humans to label examples one at a time — is often the most expensive and slowest part of training a new model.

Introduced in: Vectors·Also appears in: Embeddings

language modelA model that takes some text as input and predicts what text would come next.

A language model is a model that takes some text as input and predicts what text would come next.

Feed it "the cat sat on the ___" and a good language model assigns higher probability to "mat" than to "elephant". To generate longer text, pick one of the likely next tokens, add it to the input, and repeat. That's how ChatGPT writes paragraph-long answers one token at a time.

Some examples:

Autocomplete on your phone — a tiny language model trained to suggest the next word as you type.
A bigram model — counts how often each pair of words appears in training data, then predicts the next word from the previous one only.
GPT-2 — a transformer-based language model with 1.5 billion parameters; the first model that wrote convincingly long passages.
Frontier chat LLMs (ChatGPT, Claude, Gemini) — transformer-based language models with hundreds of billions of parameters, trained on large amounts of text (including all the data on the internet), then tuned to behave like helpful assistants.

Every chatbot, every code-completion tool, every "AI that writes things" is some flavour of language model — they differ mostly in size and training.

Introduced in: Neural Networks·Also appears in: Embeddings, Next-Word Prediction, Positional Encoding

layerA group of neurons that all process the same inputs at the same time, whose outputs become the inputs of the next layer.

A layer is a group of neurons that all process the same inputs at the same time, whose outputs become the inputs of the next layer.

Each layer combines the previous layer's outputs into something more complex. Early layers detect simple patterns (edges, colours); deeper layers combine those into parts (eyes, ears); the deepest layers combine those into whole concepts (cat, dog). Depth is what lets neural networks learn complex things.

Some examples:

One layer — inputs → one row of neurons → output. Limited to simple problems; famously, a single layer can't solve XOR.
Two layers — the minimum needed to solve XOR. The first layer computes intermediate facts; the second combines them into the answer.
Modern image classifiers — typically 50 to 150 layers, each detecting progressively higher-level features.
GPT-3 — 96 layers; frontier language models have even more.

Backpropagation is what makes training every layer practical — it flows the error signal backward through every layer so every weight gets a gradient.

Introduced in: Neural Networks·Also appears in: PyTorch from Scratch, Attention, Embeddings, Transformers

linearA linear function is one whose output is a weighted sum of its inputs — no curves, no thresholds, no surprises.

A linear function is one whose output is a weighted sum of its inputs — no curves, no thresholds, no surprises. Anything else is nonlinear.

Plotted on a graph, a one-input linear function is a straight line; with two inputs, a flat plane; with more inputs, the higher-dimensional equivalent.

Some examples:

f(x) = 3x + 2 — a straight line. Linear.
f(x, y) = 2x + 5y — a flat plane in 3D space. Linear.
A neural-network layer with no activation function — just a weighted sum of inputs, so linear.
A neuron's activation function (sigmoid, ReLU, Swish) — nonlinear. Sigmoid bends; ReLU has a kink at zero.

If you stacked linear layers without any nonlinear activation function between them, the whole stack would collapse mathematically into a single linear function — depth would buy you nothing. Activation functions exist to break that, so that deep networks can model curves and thresholds, not just straight lines.

Introduced in: Optimization·Also appears in: PyTorch from Scratch, Neural Networks, Positional Encoding

LLMShort for large language model — a language model with billions of parameters, trained on large amounts of text (including all the data on the internet).

An LLM is short for large language model — a language model with billions of parameters, trained on large amounts of text (including all the data on the internet).

There's no exact cut-off for "large"; it's whatever the field considered enormous at the time. In practice today it means anything roughly in the GPT-3-and-up range. The term carries a connotation: an LLM is big enough to do general-purpose tasks (writing, coding, reasoning), not just complete the next word in a narrow domain.

Some examples:

GPT-2 — 1.5 billion parameters. The smallest model people sometimes still call an LLM.
GPT-3 — 175 billion parameters. The model that made "LLM" a household abbreviation.
Llama 3 (70B) — Meta's open-weights LLM with 70 billion parameters; anyone can download and run it.
GPT-4 / Claude / Gemini — frontier LLMs trained by OpenAI, Anthropic, and Google. Exact sizes are mostly secret but believed to be in the hundreds of billions.

The "L" is doing real work: a 50-million-parameter language model isn't called an LLM, even though it works the same way.

Introduced in: Next-Word Prediction·Also appears in: Transformers

lossA number that says how far a model is from perfect, measured across a set of examples (the training data).

A loss is a number that says how far a model is from perfect, measured across a set of examples (the training data).

Some examples:

A runner over 10 races — each race has its own error. The runner's loss is the average of those 10 errors: a verdict on the runner, not on any single race.
A photo classifier on 10 000 photos — each photo contributes one error. The classifier's loss is the average; a loss of 0.05 means the classifier is confidently right almost everywhere.
A next-word predictor on a billion sentences — each position contributes one error (−log(probability of the right word)). The predictor's loss is the average across every position. This is the number training shrinks.

The goal of training is to find parameter values that minimise the loss on the training data (the training loss), typically using gradient descent. Loss measured on data the model didn't train on is the test loss — the two often differ, since a model can memorise its training data and still flunk on fresh examples.

Introduced in: Optimization·Also appears in: PyTorch from Scratch

magnitudeThe size (or length) of a vector.

A magnitude is the size (or length) of a vector. If you pictured the vector as an arrow, the magnitude is how long that arrow is — a vector twice as big in every direction has twice the magnitude.

You calculate it by squaring each number in the vector, adding them up, and taking the square root — the Pythagorean rule extended to any number of dimensions. For a 2D vector (a, b), the magnitude is √(a² + b²). For more dimensions, just keep adding the squared terms before taking the root.

Some examples:

A regular number (1-dimensional vector) — the same formula applied to one number. Magnitude of 5 = √(5²) = √25 = 5. Magnitude of −5 = √((−5)²) = √25 = 5. Squaring removes the sign, so the magnitude of a single number is just its absolute value.
(3, 4) — magnitude = √(9 + 16) = √25 = 5. The classic 3-4-5 right triangle.
(1, 0, 0) — magnitude = √1 = 1. A unit vector along one axis.
(1, 1, 1, 1) — magnitude = √4 = 2. Four equal components in 4D.
A neuron's weight vector — its magnitude says how sensitive the neuron is. Two neurons can look for the same pattern (same direction); the one with larger magnitude reacts more strongly to matching inputs.

Splitting a vector into a unit vector (its direction) plus a magnitude (its size) separates what a vector represents from how much of it there is — one of the most useful tricks in the vectors chapter.

Introduced in: Vectors·Also appears in: Attention

matrixA 2-dimensional grid of numbers — rows and columns of values.

A matrix is a 2-dimensional grid of numbers — rows and columns of values.

A vector is a list of numbers; a matrix is a rectangle. Matrices are how neural networks store the weights connecting one layer to another: each row holds the weights for one output neuron, each column corresponds to one input.

Some examples:

A 2×2 grid — [[1, 2], [3, 4]]. Two rows, two columns. The smallest interesting matrix.
A black-and-white photo — a matrix where each cell is the brightness of one pixel; a 1024×1024 photo is a 1024×1024 matrix.
A layer's weights — connecting 100 inputs to 50 outputs uses a 50×100 matrix. Multiplying the input vector by this matrix produces the layer's output.
A transformer's query/key/value weights — three matrices per attention head, each transforming a token's embedding into its query, key, or value vector.

Pretty much every weight in a neural network lives in some matrix.

matrix multiplicationAn operation that combines two matrices to produce a new matrix.

Matrix multiplication is an operation that combines two matrices to produce a new matrix.

Each cell in the result is the dot product of one row from the first matrix and one column from the second. Sounds fiddly, but it's the most common operation in a neural network — every layer's forward pass is mostly matrix multiplications.

Some examples:

A 2×2 times a 2×2 — four dot products, one for each cell of the resulting 2×2 matrix.
A neural-network layer — multiply the input vector (treated as a 1×N matrix) by the layer's N×M weight matrix to get the 1×M output.
An attention head — comparing every token's query with every other token's key is one big matrix multiplication.
Training GPT-3 — trillions of matrix multiplications, repeated for every forward pass and every backward pass during training.

GPUs exist mainly to do matrix multiplications fast in parallel — that's why running modern AI requires them.

Introduced in: PyTorch from Scratch

modelA function whose behaviour you can adjust by tuning its parameters.

A model is a function whose behaviour you can adjust by tuning its parameters.

Some examples, from one knob to a trillion:

bigger(x) = x · m — input x, parameter m. Set m to 2 and the model doubles things; set it to 10 and it makes them ten times bigger.
lineY(x) = m · x + c — input x, parameters m and c. Computes the y-coordinate of a straight line at x; different values of m and c draw different lines.
curveY(x) = a + b · x + c · x² + d · x³ — one input, four parameters that decide the curve's shape.
isCat(photo) = something complex — millions of parameters that decide whether it's a cat.
chatReply(messages) = something complex — like ChatGPT. About a trillion parameters tuned during training.

Training is the process of finding parameter values that make the model do the job you want.

Introduced in: Computation·Also appears in: PyTorch from Scratch, Attention, Embeddings, Introduction, Next-Word Prediction, Optimization, Positional Encoding, Transformers

multi-head attentionRunning the attention computation several times in parallel at the same layer, each copy with its own query, key, and value weights, so the model can learn many different relationships between tokens at once.

Multi-head attention is running the attention computation several times in parallel at the same layer, each copy with its own query, key, and value weights, so the model can learn many different relationships between tokens at once.

Understanding a sentence requires asking several kinds of question at the same time: what does this pronoun refer to? What's the subject of this verb? Which clause does this word belong to? A single attention computation can only learn one of these patterns, so transformers run several side-by-side at each layer and combine the results. Each one of these parallel runs is called an attention head.

Some examples:

Two heads — one head specialises in pronoun resolution, the other in subject–verb linking. Both run on the same tokens at the same time, and their outputs are combined into the next layer's input.
GPT-3 — 96 heads per layer; every token gets 96 simultaneous "questions" asked about it at every layer.
Frontier models — even more heads, learning a wider variety of patterns.

Multi-head attention is what gives transformers their remarkable ability to handle complex grammar and reasoning — the model isn't picking one kind of attention, it's getting them all in parallel.

Introduced in: Attention

n-gramA sequence of n consecutive words (or tokens), used by simple next-word predictors that look up "what usually follows" for each such sequence.

An n-gram is a sequence of n consecutive words (or tokens), used by simple next-word predictors that look up "what usually follows" for each such sequence.

Early language models worked by counting n-grams in a giant pile of text and building a giant lookup table: given the last n−1 words, what's the most common next word? Simple and fast — but limited, because the number of possible n-grams explodes with n. A vocabulary of 50 000 words has more 10-grams than there are atoms in the observable universe.

Some examples:

bigram (n = 2) — predict from one previous word: "of" → "the" 33% of the time, "he" → "was" frequently. Each pair sounds fine, but stringing them together produces nonsense.
trigram (n = 3) — predict from two previous words. More context, much bigger table.
5-gram — already needs more entries than the entire internet contains.
10-gram — the number of possibilities exceeds the atoms in the universe.

The fundamental problem with n-grams is that they can't generalise — they only predict for sequences they've literally seen before. Neural-network language models replaced them by working with continuous embeddings instead, so similar sentences automatically produce similar predictions.

Introduced in: Next-Word Prediction

neural networkA model built by connecting lots of neurons together into layers.

A neural network is a model built by connecting lots of neurons together into layers.

Each neuron takes a weighted sum of its inputs, runs the result through an activation function, and passes its output along. Layers are stacked so each layer's outputs become the next layer's inputs. Training adjusts all the weights so the whole stack ends up producing useful outputs.

Some examples:

A single neuron — technically a one-neuron neural network. The smallest possible case.
A small image classifier — three or four layers of neurons that take pixel values in and output a probability that the image is a cat.
A neural-network language model — many layers, each containing thousands of neurons; takes a sequence of tokens in and predicts the next one.
ChatGPT (GPT-4) — hundreds of layers, hundreds of billions of parameters; still just neurons wired together into a giant network.

Pretty much all modern AI is built on neural networks. The differences between models are mostly in how the neurons are wired (feed-forward layers, attention, and so on) and how big the network is.

Introduced in: Neural Networks·Also appears in: PyTorch from Scratch, Attention, Embeddings, Next-Word Prediction, Optimization, Transformers, Vectors

neuronA function that multiplies its inputs by weights and applies an activation function.

A neuron is a function that multiplies its inputs by weights and applies an activation function.

Neurons are loosely inspired by the brain, and they're typically wired up in layers — each neuron taking its inputs from the outputs of the previous layer.

There are several ways to think about a neuron:

As a simulation of a brain neuron — biological neurons receive electrical signals from other neurons through synapses, and fire if the total exceeds a threshold. An artificial neuron mirrors this loosely: weights stand in for synapse strengths, the weighted sum for the total incoming signal, and the activation function for the fire-or-not decision.
As a smooth logic gate — with the right weights, a single neuron can compute AND, OR, or NOT. Unlike a real logic gate (whose inputs and output are strictly 0 or 1), a neuron accepts any value and its output changes smoothly as the inputs change — which is what makes it trainable.
As a pattern detector — the weights describe what the neuron is looking for; the bias decides how much evidence it needs before firing. A neuron with bear-like weights, fed an animal vector as input, asks "is this bear-like?"
As math — output = activation(w₁·x₁ + w₂·x₂ + … + wₙ·xₙ + b). The weights w decide what the neuron looks for; the bias b shifts the threshold; the activation function squashes the result into a useful range.

The weights and bias are parameters; training adjusts all of them.

Introduced in: Neural Networks·Also appears in: PyTorch from Scratch, Attention, Embeddings, Optimization, Transformers, Vectors

one-hot encodingA vector with a single 1 in the slot for the item being represented, and 0s everywhere else.

A one-hot encoding is a vector with a single 1 in the slot for the item being represented, and 0s everywhere else.

For a vocabulary of, say, 50 000 tokens, every token gets a 50 000-dimensional one-hot vector — its own slot is 1, every other slot is 0. It's a way of representing discrete things (which item out of many) using numbers a neural network can read, without claiming anything about how the items relate to each other.

Some examples:

3-token vocabulary [cat, dog, fish] — cat is (1, 0, 0), dog is (0, 1, 0), fish is (0, 0, 1).
The alphabet — each letter is a 26-dimensional vector with a single 1.
A language model's input layer — each token is a one-hot vector with one slot per token, often tens of thousands of slots long. Feeding this through the embedding layer turns it into the token's learned embedding.

One-hot vectors are sparse and uninformative on their own — every pair of one-hot vectors has dot product 0 — so models almost always pass them through an embedding layer to get a denser, meaningful representation.

Introduced in: Embeddings

optimizationThe process of making something better step by step — measure how far it is from perfect, make a small change, keep it if it helped, repeat.

Optimization is the process of making something better step by step — measure how far it is from perfect, make a small change, keep it if it helped, repeat.

This same loop appears almost anywhere complex things get built. The chapter calls it incremental optimization: nothing is designed perfectly in one shot; you reach a good solution by accumulating reliable tiny improvements.

Some examples:

Natural evolution — random changes, tested by survival, kept or eliminated. The original incremental optimizer.
The scientific method — propose a hypothesis, run an experiment, accept or reject, refine.
The Wright brothers' wings — built and tested hundreds of wing designs, measured which generated more lift, kept what worked. The airplane was iterated into existence.
Training a neural network — measure the loss, calculate the gradient (the direction of fastest improvement), nudge every parameter that way, repeat over billions of steps.

Training a model is just optimization applied to its parameters, with the loss as the number to minimize. The "small change" step uses gradient descent, which is what makes training fast enough to work in practice — without it, you'd be optimizing by random guesses.

Introduced in: Optimization·Also appears in: PyTorch from Scratch, Computation, Neural Networks, Transformers

overfittingWhen a model has learned its training data so precisely that it predicts well on examples it has seen, but poorly on anything new.

Overfitting is when a model has learned its training data so precisely that it predicts well on examples it has seen, but poorly on anything new.

A model that overfits has memorised quirks of the training set instead of learning the underlying patterns. It's the opposite of useful generalisation: training error keeps falling, but real-world performance gets worse.

Some examples:

A student who memorises past exam papers — aces those exact questions but gets stuck on anything slightly different. Same shape as overfitting.
A bigram model trained on a tiny corpus — assigns high probability only to the exact word pairs it saw, and is helpless when new sentences contain unseen pairs.
A neural network with too many parameters for too little training data — fits every training example exactly, including the noise, and generalises poorly.
A frontier model trained too long on a narrow slice of data — starts reproducing specific training examples verbatim rather than generalising from the patterns in them.

The opposite is underfitting — a model that hasn't even managed to fit the training data well. The sweet spot is in between: fit the patterns, ignore the noise.

Introduced in: Next-Word Prediction

parameterOne of the adjustable numbers inside a model.

A parameter is one of the adjustable numbers inside a model.

Each parameter is one number you can change; together they decide what the model does.

Some examples — the parameters in different models:

bigger(x) = x · m — m is the parameter.
lineY(x) = m · x + c — m and c are two parameters.
curveY(x) = a + b · x + c · x² + d · x³ — four parameters: a, b, c, d.
Every weight in a small neural network — typically thousands or millions of parameters.
The numbers inside ChatGPT — about a trillion parameters.

Training is the process of finding good values for all of them, automatically.

Introduced in: Computation·Also appears in: PyTorch from Scratch, Neural Networks, Next-Word Prediction, Optimization

positional encodingThe technique used to give an attention-based model a sense of where each token is in the sequence, since attention by itself is position-blind.

A positional encoding is the technique used to give an attention-based model a sense of where each token is in the sequence, since attention by itself is position-blind.

Without a positional encoding, "the dog bit the man" and "the man bit the dog" would produce identical attention patterns — the model would see which words are present but not which came first. Positional encoding fixes this so word order matters. The two main approaches used in practice are ALiBi (linear distance penalties added to attention scores) and RoPE (rotating each token's query and key vectors by an angle proportional to position).

Some examples:

No positional encoding — attention treats the words as a bag with no order; "dog bit man" = "man bit dog". Broken for language.
ALiBi — subtract a small penalty from each attention score proportional to the distance between tokens. Used by BLOOM and MPT.
RoPE — rotate each token's query and key vectors by an angle proportional to position; the dot product then naturally falls off with distance. Used by Llama, Mistral, Gemma, Qwen, DeepSeek.
Combined with causal masking — positional encoding handles distance; causal masking handles direction (each token can only attend to earlier tokens).

Positional encoding is one of the under-appreciated pieces of modern transformers — without it, attention couldn't tell a sentence from a bag of words.

Introduced in: Positional Encoding·Also appears in: Transformers

promptThe text you give a language model as input.

A prompt is the text you give a language model as input.

In a chat model like ChatGPT, the prompt is everything before the model starts replying — your current message, any earlier messages in the conversation, and any "system" instructions the developer set up to steer the model's behaviour. By default, what's in the prompt is the model's only working knowledge of the current conversation.

Some examples:

A simple question — "What's the capital of France?" The model reads this and predicts the next tokens, which happen to spell out "Paris".
A code completion — your half-written code is the prompt; the model predicts what comes next.
A multi-turn chat — every message in the history (yours and the model's) is concatenated into one big prompt for predicting the next response.
A system prompt — developers can put detailed instructions in front of every user message ("You are a helpful coding assistant…") so the model behaves consistently across users.

Most of what a language model "knows" about the current conversation lives in the prompt — change the prompt and the response changes.

Introduced in: Computation·Also appears in: Next-Word Prediction

ReLUA popular activation function: it outputs zero for negative inputs and passes positive inputs through unchanged.

ReLU is a popular activation function: it outputs zero for negative inputs and passes positive inputs through unchanged.

"ReLU" stands for Rectified Linear Unit. As math, relu(x) = max(0, x). Despite being almost laughably simple, it works extremely well in practice and was the default activation in deep networks for years before being supplanted by smoother variants like Swish and SwiGLU in most modern frontier models.

Some examples:

relu(3) = 3 — positive input, passes through unchanged.
relu(-7) = 0 — negative input, clipped to zero.
relu(0) = 0 — boundary case; the function has a kink right at zero.
Inside a deep image classifier — every neuron in every hidden layer uses ReLU. For any given input, about half the neurons output zero (they're "off") and the rest pass their values through.

ReLU's main virtues are that it's fast to compute and easy to train.

Introduced in: PyTorch from Scratch

RoPERoPE (Rotary Position Embeddings) is a positional encoding that gives each token a sense of position by rotating its query and key vectors by an angle proportional to where it sits in the sequence.

RoPE (Rotary Position Embeddings) is a positional encoding that gives each token a sense of position by rotating its query and key vectors by an angle proportional to where it sits in the sequence.

The clever part: when two rotated vectors are matched by dot product, the result depends only on the angle between them — which depends only on the tokens' distance apart, not on their absolute positions. Tokens close together produce a high dot product; tokens far apart produce a low one, automatically. Each attention head can adjust how quickly the rotation accumulates with position, giving it control over how sharply attention falls off with distance.

Some examples:

Speed 30° per position — neighbouring tokens have similar angles (high dot product), but a token 12 positions away has rotated 360°, so its dot product drops sharply.
Multiple rotation speeds — RoPE uses many pairs of dimensions rotating at different rates; fast pairs handle short-range attention, slow pairs handle long-range, and each head can mix them however it likes.
Used in Llama, Mistral, Gemma, Qwen, DeepSeek — nearly every frontier open-weight model today uses RoPE.

RoPE encodes distance but not direction, so it's almost always paired with causal masking (each token attends only to tokens before it).

Introduced in: Positional Encoding

sigmoidAn S-shaped function that smoothly squashes any number into the range 0 to 1.

A sigmoid is an S-shaped function that smoothly squashes any number into the range 0 to 1.

Large positive inputs become close to 1; large negative inputs become close to 0; everything in between gets a smooth, graded value — like a dimmer switch instead of a light switch. Sigmoid functions appear all over maths, biology, and economics, anywhere you need to map an unbounded number into a bounded range. In AI, the sigmoid is one of the classic activation functions used in neurons.

Some examples:

sigmoid(0) = 0.5 — exactly halfway; the input is on the fence.
sigmoid(10) ≈ 0.9999 — strongly positive input → nearly 1.
sigmoid(−10) ≈ 0.00005 — strongly negative input → nearly 0.
Inside a neuron — the sigmoid runs after the weighted sum, turning the raw score into a fire-strength between 0 and 1.

The sigmoid's smoothness is what makes it useful in neural networks — small changes to the inputs give small, predictable changes to the output, providing a gradient the optimizer can follow during training.

Introduced in: Neural Networks·Also appears in: PyTorch from Scratch

softmaxA function that turns a list of numbers into a list of percentages that add up to 100%.

Softmax is a function that turns a list of numbers into a list of percentages that add up to 100%.

Think of it as a competition: bigger numbers dominate, smaller numbers get squeezed almost to zero, and the result is a probability distribution. Softmax cares about relative sizes, not absolute ones — making one score bigger doesn't change the output unless it's bigger relative to the others.

Some examples:

Equal scores [1, 1, 1, 1] — softmax gives [25%, 25%, 25%, 25%]. No one stands out.
One clear winner [10, 0, 0, 0] — softmax gives roughly [99.99%, ≈0%, ≈0%, ≈0%]. The big number takes almost everything.
Moderate gap [2, 1, 0] — softmax gives roughly [65%, 24%, 11%]. The gaps between scores matter more than their absolute values.
Attention scores — raw match scores from the dot product are passed through softmax to decide how to divide each token's attention budget of 100%.
A classifier's output — raw scores for cat, dog, bird get softmaxed into class probabilities (e.g., 80% cat, 15% dog, 5% bird).

Softmax appears everywhere in modern AI — anywhere a model needs to turn "raw scores" into "percentages that compete for a fixed budget of 100%".

Introduced in: Attention

temperatureA number that controls how random a language model's next-word predictions are.

A temperature is a number that controls how random a language model's next-word predictions are.

At low temperature, the model almost always picks the single most likely word — its output becomes safe, predictable, and often repetitive. At high temperature, the model picks more evenly from its options — its output becomes more varied and creative, but also more error-prone. Temperature is the dial you turn to trade reliability for creativity.

Some examples:

Temperature 0 — always pick the top prediction. The model becomes deterministic: same prompt → same output. Useful when you want consistent, factual answers.
Temperature 0.7 — a common default. The model usually picks one of the top few options, with mild randomness for variety.
Temperature 1 — pick from options in proportion to their probabilities, with no extra sharpening or smoothing.
Temperature 2 — heavily flattened distribution; even low-probability words get a real chance. Used for highly creative or wild generations.

Generating the same prompt twice at high temperature gives you different outputs each time — that's the temperature dial at work.

Introduced in: Next-Word Prediction·Also appears in: Transformers

tensorA generalisation of vectors and matrices to arbitrarily shaped blocks of numbers.

A tensor is a generalisation of vectors and matrices to arbitrarily shaped blocks of numbers.

An N-dimensional vector has N numbers in it — its shape is (N,). An N × M matrix has N × M numbers laid out in a grid of rows and columns — shape (N, M). A tensor generalises both: its shape can have any number of values — (N, M, K), (N, M, K, J), and so on — describing how big the tensor is along each axis.

The shape notation comes from numpy and PyTorch: a vector might have shape (768,), a transformer's weight matrix might have shape (768, 768), and a batch of images might have shape (32, 3, 224, 224).

In popular usage, "tensor" often just means "a matrix stored in GPU memory" — because most tensors inside a neural network actually are matrices, and what distinguishes them from regular matrices in your code is that they live on the GPU.

Some examples:

5 — shape (). A single number; the smallest possible tensor.
[1, 2, 3] — shape (3,). A 3-dimensional vector.
A black-and-white photo — shape (height, width). A matrix.
A colour photo — shape (height, width, 3). The extra axis holds the three colour channels.
A batch of 32 colour photos — shape (32, height, width, 3). 32 photos arranged so the model can process them all at once.

"Tensor" sounds fancy but in practice it just means "a block of numbers with a given shape, stored on the GPU".

Introduced in: Computation·Also appears in: PyTorch from Scratch

tokenThe smallest unit of text a language model works with — usually a whole word, a piece of a word, or a piece of punctuation.

A token is the smallest unit of text a language model works with — usually a whole word, a piece of a word, or a piece of punctuation.

Language models don't process text one character at a time, or one full word at a time. They split text into tokens of an in-between size that strikes a good balance: common words become their own token, while rare or invented words are assembled from smaller known pieces. That way the model can handle "treeishness" (split into tree, -ish, -ness) without ever having seen it before.

Some examples:

A whole word — cat, the, running. Common words usually get their own token.
A word fragment — -ing, super-, -ness. Less common words are built from pieces.
Punctuation and symbols — !, ,, ?, even spaces are each tokens.
An invented word — "qwertyflorp" gets broken into known subword pieces; the model can still process it.
Inside GPT-3 — each token is represented by a 12 288-dimensional embedding; the model sees only the tokens, never the original characters.

Splitting text into tokens (a process called tokenisation) is the first thing any language model does to its input, and the last thing it un-does to produce its output.

Introduced in: Embeddings·Also appears in: Attention, Next-Word Prediction, Positional Encoding, Transformers

trainingThe process of adjusting a model's parameters so it makes the smallest error on the data we give it.

Training is the process of adjusting a model's parameters so it makes the smallest error on the data we give it.

Some examples, the same recipe at different scales:

Finding the best stride for a runner — try a stride length, time the race. A shorter time means a better stride. Adjust the stride toward whatever made the time go down. Repeat until you find the stride that gives the best time.
Fitting curveY(x) = a + b·x + c·x² + d·x³ to a set of points — pick values for a, b, c, d, measure how far the curve is from each point. A smaller total distance means a better fit. Adjust the parameters toward whatever shrinks the distance. Repeat until the curve is as close as possible.
Training a next-word predictor — show it real sentences. For each one, see how unlikely the model thought the actual next word was. A model is good if it gives the right words high probability. Nudge the parameters until the right words become less surprising. Repeat over billions of sentences.

In practice, training uses gradient descent: instead of guessing which parameter to nudge, it calculates which direction shrinks the error fastest, then takes a small step that way. This works only because the model is built to be smooth — small parameter changes give small, predictable changes in the error. Training is optimization applied to a model's parameters.

Introduced in: Optimization·Also appears in: PyTorch from Scratch, Computation, Embeddings, Neural Networks, Next-Word Prediction, Positional Encoding, Transformers

training dataThe set of examples a model is shown during training.

Training data is the set of examples a model is shown during training.

Each example is one input paired with its correct answer. The model makes a guess for the input, an error is computed against the correct answer, and the parameters are nudged to shrink the loss — the average error across the whole training set.

Some examples:

A handful of test questions and their answer keys — enough to fit a curve, not enough to train anything large.
Hundreds of thousands of labelled photos (like ImageNet) — used to train classifiers that recognise cats, dogs, planes, and so on.
Billions of sentences from books and the web — used to train next-word predictors like GPT-3, where each sentence offers a next-word prediction at every position.

Some examples are deliberately held back from the training data as test data, so we can check the model on inputs it has never seen. The loss on this held-out set is the test loss — a model that has memorised its training data will have low training loss but high test loss, which is how we catch the problem.

Introduced in: Optimization·Also appears in: Next-Word Prediction

transformerThe neural-network architecture behind every major modern AI — ChatGPT, Claude, Gemini, Llama — built by stacking many layers, each of which uses attention to let every token gather information from every other token.

A transformer is the neural-network architecture behind every major modern AI — ChatGPT, Claude, Gemini, Llama — built by stacking many layers, each of which uses attention to let every token gather information from every other token.

A passage of text enters as token embeddings. Each layer creates a richer representation of every token by combining what's already known about that token with information attention pulls in from other tokens. After many layers, the final representation of the last token has enough context to predict what should come next. The architecture's defining trick is attention: instead of fixed wiring between neurons, attention decides at runtime which token's information should flow where.

Some examples:

A toy 5-layer transformer — one attention head per layer, a few hundred parameters. Enough to demonstrate the wiring; not enough to do anything useful.
GPT-2 — 12 layers, 12 heads per layer, 768-dimensional embeddings, ~117 million parameters. Generates short stories.
GPT-3 — 96 layers, 96 heads per layer, 12 288-dimensional embeddings, 175 billion parameters. Surprised everyone with its emergent abilities.
Frontier models like Claude or GPT-4 — exact dimensions undisclosed but believed to be substantially larger.

The original paper introducing the architecture was called "Attention Is All You Need" — and the title turned out to be right. The same architecture, scaled up, is what powers nearly every conversation, translation, and code-completion tool you've used in the last few years.

Introduced in: Transformers·Also appears in: Embeddings, Introduction, Positional Encoding, Vectors

unit vectorA vector whose magnitude is exactly 1.

A unit vector is a vector whose magnitude is exactly 1.

You make a unit vector by dividing every number in the vector by its magnitude — the proportions between the dimensions stay the same, but the overall size is now 1. Two unit vectors can be compared fairly: their dot product depends only on the direction they point, not on how large either one is.

Some examples:

(1, 0) — already a unit vector; points along the x-axis.
(3, 4) → (0.6, 0.8) — divide each number by the magnitude (5). Same direction, size 1.
An animal vector scaled to size 1 — all the chapter's animal vectors are unit vectors, which is why their dot products land cleanly between −1 and 1 (cosine similarity).
A weight vector inside a neuron — its unit vector tells you what the neuron is looking for; its magnitude tells you how sensitive it is.

Unit vectors are the standard form for fair similarity comparisons — without them, a vector that's "big in everything" would falsely look similar to anything, because its sheer size inflates the dot product.

Introduced in: Vectors·Also appears in: Positional Encoding

vectorA list of numbers, where each number describes a different aspect of one thing.

A vector is a list of numbers, where each number describes a different aspect of one thing.

Some things you can describe with a vector:

A point in 2D space — captured by two numbers: how far east and how far north. (0.8, 0.6) means 0.8 east, 0.6 north.
A colour on a screen — captured by three numbers from 0 (none) to 1 (full): how much red, green, and blue light to mix. (1, 0.5, 0) is full red, half green, no blue — orange.
An animal — captured by rating it from 0 to 1 on properties like big, scary, hairy, cuddly, fast, fat. A bear and a rabbit end up with very different vectors; a bear and a dog end up similar.
A word's meaning — captured by a vector of hundreds of numbers learned by a model, so that words with similar meanings end up with similar vectors.
A single token inside GPT-3 — captured by a vector of 12 288 numbers, each one carrying some learned aspect of the word's role.

The order matters: the first number means the same thing in every vector of a given kind. That's what makes vectors comparable — you can multiply matching numbers, add vectors together, and measure distances between them.

Introduced in: Vectors·Also appears in: Attention, Embeddings, Neural Networks, Positional Encoding, Transformers

vocabularyThe set of tokens a tokeniser can output.

A vocabulary is the set of tokens a tokeniser can output.

You might imagine the vocabulary as just all the words — but that doesn't work. There are too many words in English alone, and the model also needs to handle misspelled words, made-up names, French, Chinese, Python code, emoji, and ASCII art. Putting every possible word from every possible context into the vocabulary would mean millions of entries, and the model would need to learn an embedding for each one — there isn't enough training data to figure out what every rare word means.

So tokenisers break text into smaller pieces instead. Common words still get their own token, but rare or unfamiliar words get split into fragments (-ing, super-, individual characters) that the model already knows. Any word — even one the tokeniser has never seen — can be assembled from pieces.

This sets up a tradeoff:

A bigger vocabulary — common words and patterns get represented as single tokens, so sequences are shorter and the model processes text faster. But the embedding layer needs more parameters, and rare tokens may not appear in training data often enough for the model to learn good embeddings for them.
A smaller vocabulary — fewer parameters, and every token gets seen often during training. But common words get split into multiple tokens, so sequences are longer and the model has to do more work.

Some examples:

GPT-2 — 50 257 tokens, optimised for English text.
GPT-3 — 50 257 tokens (same tokeniser as GPT-2).
GPT-4 — about 100 000 tokens, with much better handling of code and non-English languages.
Llama 3 — 128 000 tokens, optimised across many languages.

Every token in the vocabulary has its own row in the embedding layer's weight matrix — that's how the model turns a token into a vector it can do math on.

Introduced in: Embeddings·Also appears in: Attention, Next-Word Prediction

weightA number inside a neuron that controls how much one of its inputs matters.

A weight is a number inside a neuron that controls how much one of its inputs matters.

A neuron multiplies each input by its weight before adding them together — a large weight makes that input strongly affect the output; a weight near zero makes it almost ignored; a negative weight pushes the output the other way.

Some examples, from a single weight to billions:

One input — output = w · x + b. The single weight w controls how much x matters.
A small neuron with three inputs — weights (2, 0.1, −1) mean the neuron cares a lot about the first input, barely cares about the second, and gets pushed in the opposite direction by the third.
A neural-network layer — every connection from one layer to the next has its own weight. Stack a few of these and you have thousands of weights.
GPT-3 — about 175 billion weights, each one nudged during training until the model is good at predicting the next word.

Weights, together with biases, are the parameters of a model.

Introduced in: Neural Networks·Also appears in: Attention, Embeddings, Positional Encoding, Transformers, Vectors

XORXOR (short for "exclusive or") is a logic operation that is true when exactly one of its two inputs is true, but not both.

XOR (short for "exclusive or") is a logic operation that is true when exactly one of its two inputs is true, but not both.

It means "one or the other, but not both." XOR is famous in the history of AI because a single neuron can't compute it. You need at least two layers — hidden neurons that compute intermediate facts (like OR and NAND), and an output neuron that combines them.

Here's the full truth table:

0 XOR 0 = 0 — neither input on.
0 XOR 1 = 1 — exactly one on.
1 XOR 0 = 1 — exactly one on.
1 XOR 1 = 0 — both on, so not exclusive.

XOR's unsolvability by a single layer was the heart of the 1969 Perceptrons critique that helped trigger the first "AI winter." The fix, stacking layers, is the same idea that powers every deep network today.

Introduced in: Neural Networks·Also appears in: PyTorch from Scratch