PyTorch from Scratch

Every chapter in this tutorial has an optional "Try it in PyTorch" section at the bottom. This appendix gets you set up so you can actually run that code, and teaches you enough PyTorch to understand what it's doing.

What Is PyTorch?

PyTorch is the most popular framework for building AI. It's what researchers at Anthropic, OpenAI, Meta, Google DeepMind, and most universities use to train neural networks. If you read an AI paper, the code is almost certainly PyTorch.

At its core, PyTorch does two things:

Tensor math: fast operations on big arrays of numbers, running on your GPU if you have one
Automatic differentiation: it tracks every operation you do, so it can automatically compute gradients for training

That's it. Everything else (neural network layers, optimizers, data loaders) is built on top of those two primitives.

Google Colab: The Easy Way

You don't need to install anything on your own machine. Google Colab gives you a free Python environment in your browser with PyTorch already installed. It even gives you access to a GPU.

Every "Try it in PyTorch" section in this tutorial links directly to a Colab notebook. You click the link, and you're running code. No setup, no installation, no fighting with Python environments.

To use Colab:

Click any "Open in Google Colab" button in this tutorial
Sign in with a Google account (free)
Click a code cell and press Shift+Enter to run it
The notebook runs on Google's servers; your computer just shows the results

Colab is genuinely the easiest way to get started. If you just want to run the tutorial notebooks, you can skip the rest of this section entirely.

Installing PyTorch Locally

If you want PyTorch on your own machine (maybe you want to experiment beyond the notebooks, or you prefer working in your own editor), here's how.

First, you need Python (3.9 or newer). Then install PyTorch with pip:

pip install torch

Verify it worked:

import torch
print(torch.__version__)  # e.g. "2.5.1"

That's the CPU-only version, which is fine for everything in this tutorial. But PyTorch can also use your GPU to run tensor math much faster:

NVIDIA GPUs: PyTorch supports CUDA acceleration. See the official install guide for CUDA-specific instructions.
Apple Silicon Macs: If you have an M1/M2/M3/M4 Mac, PyTorch can use the GPU via Apple's MPS (Metal Performance Shaders) backend. It works out of the box with the standard pip install torch, no extra setup needed. Move tensors to the GPU with x.to("mps").

Tensors

A tensor is PyTorch's word for "a box of numbers." It's the basic building block. Every piece of data in PyTorch is a tensor.

A Quick Python Primer

Before we dive in, a few bits of Python syntax you'll see everywhere:

Decimal points make numbers precise. Writing 3.0 instead of 3 tells Python this is a decimal number (a "float"), not a whole number (an "integer"). AI math needs decimal precision (values like 0.7 or -1.234) so you'll see that trailing .0 a lot. Writing 1. is shorthand for 1.0.

Square brackets make lists. In Python, [1.0, 2.0, 3.0] is a list of three numbers. You can picture it as a row of boxes:

a list, one row of numbers

1.02.03.0

Lists can contain lists. [[1.0, 2.0], [3.0, 4.0]] is a list containing two lists, which gives you a grid:

a list of lists, a grid of numbers

1.02.03.04.0

This is how you build up dimensions: a list of numbers is a row, a list of rows is a grid, a list of grids is a cube, and so on. PyTorch tensors work the same way.

Creating Tensors

import torch

# A 1D tensor (a row of numbers)
x = torch.tensor([1.0, 2.0, 3.0])

This creates a tensor with three numbers in a row, just like the table above. You can also create 2D tensors (grids):

# A 2D tensor (a table of numbers — 2 rows, 3 columns)
m = torch.tensor([[1.0, 2.0, 3.0],
                   [4.0, 5.0, 6.0]])

That's a 2-by-3 grid:

1.02.03.04.05.06.0

PyTorch also has shortcuts for common patterns:

z = torch.zeros(3, 4)       # 3x4 grid, all zeros
o = torch.ones(2, 2)        # 2x2 grid, all ones
r = torch.randn(5, 3)       # 5x3 grid, filled with random numbers

Shape

Every tensor has a shape: the size of each dimension. This tells you how the numbers are arranged:

x = torch.randn(3, 4)
print(x.shape)   # torch.Size([3, 4]) — 3 rows, 4 columns

Shape matters a lot. A common source of bugs is tensors with the wrong shape. When something goes wrong, print(x.shape) is usually your first debugging move.

Basic Math

When you do math on two tensors of the same shape, it happens element by element. Each number pairs up with the number in the same position:

a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])

a + b       # tensor([5., 7., 9.])     — 1+4, 2+5, 3+6
a * b       # tensor([4., 10., 18.])   — 1×4, 2×5, 3×6
a ** 2      # tensor([1., 4., 9.])     — 1², 2², 3²

For matrix multiplication (the workhorse of neural networks; every layer of weighted sums in Chapter 3 is a matrix multiply under the hood), use @:

# Matrix multiply: (2x3) @ (3x4) -> (2x4)
A = torch.randn(2, 3)
B = torch.randn(3, 4)
C = A @ B
print(C.shape)  # torch.Size([2, 4])

Automatic Differentiation

This is PyTorch's killer feature. Training a neural network means adjusting millions of parameters to reduce the error (see Chapter 2). To know which direction to adjust each parameter, you need to know: if I nudge this parameter a tiny bit, how much does the error change? That's called a gradient.

If you know calculus, a gradient is a derivative. If you don't, no problem. Just think of it as: "how sensitive is the output to a small change in this input?" If the gradient is large, a small nudge to the input causes a big change in the output. If it's near zero, that input barely matters.

Computing gradients by hand for millions of parameters would be completely impractical. PyTorch does it automatically.

How It Works

Tell PyTorch to track a tensor by setting requires_grad=True. Then do math with it. PyTorch silently records every operation. When you're done, call .backward() on the result, and PyTorch works backward through the chain of operations to figure out each input's gradient.

x = torch.tensor(3.0, requires_grad=True)

# Do some math
y = x ** 2 + 2 * x + 1

# Compute gradients
y.backward()

print(x.grad)  # tensor(8.)

That 8.0 means: "if you increase x by a tiny amount, y will increase by about 8 times that amount." (You can verify: when x is 3.0, y is 16.0. When x is 3.01, y is 16.0801, an increase of 0.0801, which is roughly 8 × 0.01.)

.backward() Updates Everything at Once

When you call y.backward(), PyTorch computes the gradient of y with respect to every tracked tensor that contributed to y. Each of those tensors gets its gradient stored in .grad:

a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(5.0, requires_grad=True)

y = a * b + a ** 2   # y depends on both a and b

y.backward()

print(a.grad)  # tensor(9.) — how much y changes when a changes
print(b.grad)  # tensor(2.) — how much y changes when b changes

This is how neural networks learn: the "error" is one number at the end of the computation, and .backward() computes how sensitive that error is to every parameter in the network, all in one pass. This process is called backpropagation.

You can only call .backward() on a single value (not an array of values), because you're asking "how does this one number change when each input changes?" In a neural network, that single number is always the loss, the measure of how wrong the network is.

You almost never call .backward() directly. PyTorch's optimizers handle it. But understanding that it exists, and that it works by tracking operations, helps you debug when things go wrong.

Building a Neural Network

PyTorch provides torch.nn, a library of common neural network building blocks.

nn.Linear

The most basic layer. It computes y = x @ W.T + b, a weighted sum of the inputs plus a bias, applied to a whole layer of neurons at once (see Chapter 3):

import torch.nn as nn

layer = nn.Linear(in_features=3, out_features=2)
x = torch.randn(1, 3)      # one input with 3 features
y = layer(x)                # output has 2 features
print(y.shape)              # torch.Size([1, 2])

nn.Sequential

Stack layers together into a network:

model = nn.Sequential(
    nn.Linear(2, 16),   # 2 inputs -> 16 hidden neurons
    nn.ReLU(),           # activation function — see below
    nn.Linear(16, 16),  # 16 -> 16
    nn.ReLU(),
    nn.Linear(16, 1),   # 16 -> 1 output
)

This is a three-layer neural network. ReLU ("rectified linear unit") is just max(0, x). It passes positive values through unchanged and clamps negative values to zero. It plays the same role as the sigmoid in Chapter 3: a nonlinear activation function applied after each weighted sum. Most modern networks use ReLU instead of sigmoid because it's faster and easier to train. The activations between linear layers are essential. Without a nonlinearity in the middle, two stacked linear layers collapse mathematically into a single linear layer, no matter how many you stack.

The Training Loop

Training a neural network follows the same pattern every time:

model = nn.Sequential(
    nn.Linear(2, 16), nn.ReLU(),
    nn.Linear(16, 1),
)

loss_fn = nn.MSELoss()                        # mean squared error
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)  # stochastic gradient descent

for epoch in range(1000):
    predictions = model(inputs)               # 1. Forward pass
    loss = loss_fn(predictions, targets)       # 2. Compute loss
    optimizer.zero_grad()                      # 3. Clear old gradients
    loss.backward()                            # 4. Compute new gradients
    optimizer.step()                           # 5. Update parameters

Five lines, repeated thousands of times. That's the core of how every neural network is trained, from a toy example to GPT-4. The model, loss function, and optimizer change, but the loop stays the same.

To break it down:

Forward pass: run inputs through the model to get predictions
Compute loss: measure how wrong the predictions are (see Chapter 2)
Zero gradients: clear gradients from the previous step (PyTorch accumulates them by default)
Backward pass: compute how much each parameter contributed to the error
Update parameters: nudge each parameter in the direction that reduces the error

Putting It Together: Learning XOR

Let's train a network to learn XOR, a problem that a single neuron can't solve (see Chapter 3) but a two-layer network can.

XOR returns 1 when its inputs differ, and 0 when they're the same:

Input A	Input B	Output
0	0	0
0	1	1
1	0	1
1	1	0

Here's the complete code:

import torch
import torch.nn as nn

# Training data
inputs = torch.tensor([[0., 0.], [0., 1.], [1., 0.], [1., 1.]])
targets = torch.tensor([[0.], [1.], [1.], [0.]])

# Model: 2 inputs -> 8 hidden neurons -> 1 output
model = nn.Sequential(
    nn.Linear(2, 8),
    nn.ReLU(),
    nn.Linear(8, 1),
    nn.Sigmoid(),       # squash output to 0-1
)

loss_fn = nn.BCELoss()  # binary cross-entropy for 0/1 classification
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# Train
for epoch in range(2000):
    predictions = model(inputs)
    loss = loss_fn(predictions, targets)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if epoch % 500 == 0:
        print(f"Epoch {epoch}: loss = {loss.item():.4f}")

# Test
with torch.no_grad():
    results = model(inputs)
    print("\nResults:")
    for inp, out in zip(inputs, results):
        print(f"  {inp.tolist()} -> {out.item():.3f}")

Run this and you'll see the loss drop over time, and the final outputs will be close to [0, 1, 1, 0]. The network learned XOR.

A few things about this code:

nn.Sigmoid() squashes the output to the 0–1 range, which is what we want for a yes/no answer
nn.BCELoss() (binary cross-entropy) is the right loss function for binary classification
torch.no_grad() tells PyTorch not to track gradients during testing, which saves memory and speeds things up
Adam instead of SGD. Adam is a smarter optimizer that adapts its learning rate per-parameter. It usually works better out of the box

Where to Go Next

You now know enough PyTorch to follow every notebook in this tutorial. A few directions from here:

Run the chapter notebooks: each chapter has a "Try it in PyTorch" section with a link to a Colab notebook
PyTorch official tutorials: well-written guides for going deeper
PyTorch documentation: the reference for every function and module
Experiment: change the XOR example above. Try different hidden sizes, learning rates, or optimizers. Break things. That's how you learn.

Try it in PyTorch — Optional

All the code from this appendix in a single runnable notebook. Create tensors, compute gradients, build a neural network, and train it to learn XOR, all in Google Colab, no setup required.

Open in Google Colab →