Learn AI Layer by Layer

Building a Brain

What's the simplest building block that can learn to compute anything? Your brain uses one. GPT and Claude use an artificial version of the same idea. You might expect it to be fearsomely complex, but it's actually remarkably simple.

A neural network is made of neurons, arranged in layers. Each connection has a weight, a number that controls how much attention one neuron pays to another.

So how big are these networks? Here's a comparison of biological brains and AI systems, measured by their number of connections:

There's something strange in this chart. Today's largest language models have roughly as many parameters as a mouse brain has synapses (though comparing artificial parameters to biological synapses is not an exact match). Yet they can write essays and translate languages, and mice generally can't. And elephants have more brain neurons than humans, yet aren't smarter than us. The number of connections isn't everything. What matters most is how good the weights are. That's what training is all about, and we'll get deeper into it in later chapters.

The Building Block

Every neuron receives numbers as inputs and produces a number as output. Each neuron receives inputs from the previous layer, does a little computation, and passes a result to the next layer. But what's happening inside each one?

A neuron does two things:

  1. Weighted sum. Each input gets multiplied by a weight, a number that controls how much that input matters. All the weighted inputs get added together, plus a constant called the bias. This is just multiplication and addition: nothing fancy yet.
  2. Activation function. The result gets passed through a function that determines how much to pass on to the next layer. This is called the activation function, and it's what makes neurons more than just arithmetic. We'll see exactly why it matters later in this chapter.

A neuron is a weighted sum followed by an activation function.

The structure of a neuron is just multiplication, addition, and passing the result through a function. Wire millions of them together and you have GPT, or the human brain. The hard part is finding the right values for all those weights and biases. That's what training is for, and we'll get deeper into it in later chapters.

Your First Neuron

Before we go further, let's get our hands on one. The widget below is a single neuron with two inputs. Every slider is yours to play with. There's no right answer here, just exploration.

The Sigmoid

The activation function we're using is called the sigmoid, an S-shaped curve that smoothly squashes any number into the range 0 to 1. Large positive inputs become close to 1. Large negative inputs become close to 0. Everything in between gets a smooth, graded value. Think of it as a dimmer switch: instead of snapping between off and on like a light switch, it slides smoothly between them.

Nature loves S-curves. An enormous number of real-world phenomena follow this shape: population growth, technology adoption, learning curves, the spread of diseases. And if a trend doesn't look S-shaped to you, it's probably because you're only seeing part of one. If it looks exponential, you're seeing the beginning of an S-curve. If it looks linear, you're seeing the middle. If it looks like diminishing returns, you're seeing the end.

The key property for neural networks: when you change a weight by a small amount, the output changes by a small amount. No sudden jumps, no unpredictable flips. That's exactly what Chapter 2 told us we need for optimization to work.

But you can make the sigmoid sharper or smoother by scaling the weights. Multiply all the weights and bias by a large number and the S-curve steepens toward a hard step. Scale them down and it flattens into a gentle slope. The boundary stays in the same place. Only the sharpness changes.

There's a catch, though. Very sharp neurons are harder to train. Remember from Chapter 2: optimization works by measuring how a small change in a weight affects the output. On the flat parts of a sharp sigmoid, the output barely changes at all. The slope is nearly zero. The optimizer gets stuck because it can't tell which direction to move. Real networks use various tricks to keep neurons in the smooth, trainable middle of the curve, but the basic intuition is enough for now: smoother neurons are easier to train.

Logic Gates

You may be familiar with basic logical operations like AND, OR, and NOT. These combine true-or-false values:

  • AND: "Are both things true?" You need an umbrella if it's raining AND you're going outside.
  • OR: "Is at least one thing true?" You'll get wet if it's raining OR someone sprays you with a hose.
  • NOT: Flips the answer. NOT true is false, NOT false is true.

These simple operations are the basis for every digital computer. Inside every chip is a network of logic gates, tiny components that compute AND, OR, NOT, and similar operations on electrical signals.

Wire enough gates together and you can compute anything. Real computer chips have billions of gates. Your phone, your laptop, every digital device works this way.

But logic gates have a problem. Their outputs snap between 0 and 1 with nothing in between. There's no smooth gradient to follow, which means you can't train a network of logic gates using the optimization techniques from Chapter 2. You can't ask "how should I adjust this gate to make the answer a little more correct?" It's all or nothing.

Neurons as Smooth Logic

One way to think of a neuron is as a smooth version of a logic gate. With the right weights, a single two-input neuron can compute AND, OR, NOT, and more, but unlike a logic gate, its output changes smoothly when you adjust the weights. That smooth change is what makes neurons trainable.

But there's something important here beyond just mimicking logic gates. Look at what happens when you slide an input to a value between 0 and 1, say 0.6. The output isn't forced to snap to "true" or "false." It can be somewhere in between: 0.73, or 0.4, or any other value. The neuron handles degrees naturally. "Probably true" rather than just "true" or "false."

This matters because real-world data is rarely cleanly true or false. Is this email spam? Probably. Is that shadow in the photo a cat? Maybe. A neuron can take uncertain inputs and produce a graded output.

Key Insight

One way to think of a neuron is as a smooth, trainable logic gate. With the right weights, it can compute AND, OR, NOT, and more, and because small changes to the weights produce small changes in the output, we can find those weights using optimization. Neurons also handle in-between values naturally, making them building blocks for computation over uncertain, continuous data.

Three Neurons Solve XOR

There's one more operation worth knowing: XOR (short for "exclusive or"). It means "one or the other, but not both." A single neuron can't solve it. No matter what weights you try, you can't get all four input combinations right at once.

But what if we use two neurons feeding into a third? The two middle neurons are called hidden neurons. They don't produce the final answer, but instead compute intermediate results that the output neuron needs.

The trick:

  • Hidden Neuron 1 computes OR: "is at least one input on?"
  • Hidden Neuron 2 computes NAND (not-and): "are they not both on?"
  • Output neuron computes AND of those two results: "at least one is on, AND they're not both on."

That's XOR. Three neurons, two layers, problem solved.

i

The XOR Crisis

In 1969, Minsky and Papert published Perceptrons, proving that single-layer networks can't solve XOR. This was taken as proof that neural networks were a dead end, and it killed most neural network research for over a decade: the "AI winter." The fix was always sitting there, obvious in hindsight: use more than one layer.

What Hidden Layers Compute

To solve complex problems, you often need to compute intermediate things first. You can't jump straight from raw pixels to "that's a cat." You need stepping stones.

That's what hidden layers provide. Each layer takes the previous layer's outputs and combines them into something more complex. The first hidden layer might detect simple patterns: edges, color patches, textures. The next layer combines those into parts: eyes, ears, fur. A later layer combines those into whole objects: "cat" or "dog."

This is why depth matters. A single layer can only combine the raw inputs in simple ways. But with multiple layers, each one adds another level of abstraction, building up from simple features to complex concepts. The network learns what to compute at each layer during training. We don't have to tell it to look for edges or eyes. It figures that out on its own.

Deeper Networks

If three neurons in two layers can solve XOR, what happens when we keep adding more? Each neuron you add contributes another piece of the computation. Each layer lets the network combine results from the previous layer into something more complex.

The widget below lets you build networks of different sizes and explore them by hand. Try adjusting the architecture and clicking neurons to poke at their weights.

The point is: this would be impossible to do by hand for anything interesting. That's why we need to find values by training on real data, which we'll learn about in later chapters.

How big do real networks get? GPT-3 has 96 layers, and frontier models today have even more. Even AI models that run on your smartphone have 20+ layers. The architecture is simple: the same building block repeated millions of times. The hard part is finding the right weights.

Training Deep Networks

But how do we train the weights in all these hidden layers? It turns out to be surprisingly straightforward. Every weight in the network, even ones buried deep inside, contributes to the final output. And if it contributes to the output, it contributes to the error. That means every weight has a gradient (as we saw in the optimization chapter), telling us which direction to nudge it to reduce the error. So we can use gradient descent on every weight at once. The algorithm that makes this practical is called backpropagation. It efficiently computes the gradient for every weight by flowing the error signal backward through each layer.

Key Insight

Stacking neurons in layers lets you compute anything. Each layer combines the previous layer's outputs into something more complex. Backpropagation lets you train all the weights at once by flowing the error signal backward through every layer. A network with ten thousand weights is no harder to train in principle than one with ten. It just takes more computation.

What's Next

We've built up from a single neuron to deep networks. A neuron is just a weighted sum and an activation function. But wire enough of them together, and finding the right weights is what creates the intelligence. Backpropagation makes that possible, even for networks with billions of weights.

But what is that weighted sum, really? In the next chapter, we'll meet vectors, lists of numbers that describe things, and discover that a neuron's weighted sum is actually a geometric operation called a dot product. This connection reveals why a single neuron can only draw a straight line through input space, why XOR is impossible for one neuron, and why activation functions are the key to making depth work.

Try it in PyTorch — Optional

Build a neuron from scratch, implement AND/OR gates, watch a single neuron fail on XOR, then train a two-layer network that succeeds, and visualize its decision boundary.

Open in Google Colab →
i

I'd love to hear from you.

I want every chapter to be easy for everyone to understand. Please send a message if anything was unclear, if you'd like something explained in more depth, or if there's something about this part of AI you wanted to understand that the chapter didn't cover. I'll get an email and reply when I can.