Building a Brain
What's the simplest building block that can learn to compute anything? Your brain uses one. GPT and Claude use an artificial version of the same idea. You might expect it to be fearsomely complex, but it's actually remarkably simple.
A is made of , arranged in . Each connection has a , a number that controls how much one neuron pays to another.
So how big are these networks? Here's a comparison of biological brains and AI systems, measured by their number of connections:
There's something strange in this chart. Today's largest have roughly as many as a mouse brain has synapses (though comparing artificial parameters to biological synapses is not an exact match). Yet they can write essays and translate languages, and mice generally can't. And elephants have more brain neurons than humans, yet aren't smarter than us. The number of connections isn't everything. What matters most is how good the weights are. That's what is all about, and we'll get deeper into it in later chapters.
The Building Block
Every neuron receives numbers as inputs and produces a number as output. Each neuron receives inputs from the previous layer, does a little computation, and passes a result to the next layer. But what's happening inside each one?
A neuron does two things:
- Weighted sum. Each input gets multiplied by a weight, a number that controls how much that input matters. All the weighted inputs get added together, plus a constant called the . This is just multiplication and addition: nothing fancy yet.
- . The result gets passed through a that determines how much to pass on to the next layer. This is called the activation function, and it's what makes neurons more than just arithmetic. We'll see exactly why it matters later in this chapter.
A neuron is a weighted sum followed by an activation function.
The structure of a neuron is just multiplication, addition, and passing the result through a function. Wire millions of them together and you have GPT, or the human brain. The hard part is finding the right values for all those weights and biases. That's what training is for, and we'll get deeper into it in later chapters.
Your First Neuron
Before we go further, let's get our hands on one. The widget below is a single neuron with two inputs. Every slider is yours to play with. There's no right answer here, just exploration.
The Sigmoid
The activation function we're using is called the , an S-shaped curve that smoothly squashes any number into the range 0 to 1. Large positive inputs become close to 1. Large negative inputs become close to 0. Everything in between gets a smooth, graded value. Think of it as a dimmer switch: instead of snapping between off and on like a light switch, it slides smoothly between them.
Nature loves S-curves. An enormous number of real-world phenomena follow this shape: population growth, technology adoption, learning curves, the spread of diseases. And if a trend doesn't look S-shaped to you, it's probably because you're only seeing part of one. If it looks exponential, you're seeing the beginning of an S-curve. If it looks , you're seeing the middle. If it looks like diminishing returns, you're seeing the end.
The key property for neural networks: when you change a weight by a small amount, the output changes by a small amount. No sudden jumps, no unpredictable flips. That's exactly what Chapter 2 told us we need for to work.
But you can make the sigmoid sharper or smoother by scaling the weights. Multiply all the weights and bias by a large number and the S-curve steepens toward a hard step. Scale them down and it flattens into a gentle slope. The boundary stays in the same place. Only the sharpness changes.
There's a catch, though. Very sharp neurons are harder to train. Remember from Chapter 2: optimization works by measuring how a small change in a weight affects the output. On the flat parts of a sharp sigmoid, the output barely changes at all. The slope is nearly zero. The optimizer gets stuck because it can't tell which direction to move. Real networks use various tricks to keep neurons in the smooth, trainable middle of the curve, but the basic intuition is enough for now: smoother neurons are easier to train.
Logic Gates
You may be familiar with basic logical operations like AND, OR, and NOT. These combine true-or-false values:
- AND: "Are both things true?" You need an umbrella if it's raining AND you're going outside.
- OR: "Is at least one thing true?" You'll get wet if it's raining OR someone sprays you with a hose.
- NOT: Flips the answer. NOT true is false, NOT false is true.
These simple operations are the basis for every digital computer. Inside every chip is a network of logic gates, tiny components that compute AND, OR, NOT, and similar operations on electrical signals.
Wire enough gates together and you can compute anything. Real computer chips have billions of gates. Your phone, your laptop, every digital device works this way.
But logic gates have a problem. Their outputs snap between 0 and 1 with nothing in between. There's no smooth to follow, which means you can't train a network of logic gates using the optimization techniques from Chapter 2. You can't ask "how should I adjust this gate to make the answer a little more correct?" It's all or nothing.
Neurons as Smooth Logic
One way to think of a neuron is as a smooth version of a logic gate. With the right weights, a single two-input neuron can compute AND, OR, NOT, and more, but unlike a logic gate, its output changes smoothly when you adjust the weights. That smooth change is what makes neurons trainable.
But there's something important here beyond just mimicking logic gates. Look at what happens when you slide an input to a value between 0 and 1, say 0.6. The output isn't forced to snap to "true" or "false." It can be somewhere in between: 0.73, or 0.4, or any other value. The neuron handles degrees naturally. "Probably true" rather than just "true" or "false."
This matters because real-world data is rarely cleanly true or false. Is this email spam? Probably. Is that shadow in the photo a cat? Maybe. A neuron can take uncertain inputs and produce a graded output.
Key Insight
One way to think of a neuron is as a smooth, trainable logic gate. With the right weights, it can compute AND, OR, NOT, and more, and because small changes to the weights produce small changes in the output, we can find those weights using optimization. Neurons also handle in-between values naturally, making them building blocks for computation over uncertain, continuous data.
Three Neurons Solve XOR
There's one more operation worth knowing: (short for "exclusive or"). It means "one or the other, but not both." A single neuron can't solve it. No matter what weights you try, you can't get all four input combinations right at once.
But what if we use two neurons feeding into a third? The two middle neurons are called . They don't produce the final answer, but instead compute intermediate results that the output neuron needs.
The trick:
- Hidden Neuron 1 computes OR: "is at least one input on?"
- Hidden Neuron 2 computes NAND (not-and): "are they not both on?"
- Output neuron computes AND of those two results: "at least one is on, AND they're not both on."
That's XOR. Three neurons, two layers, problem solved.
The XOR Crisis
In 1969, Minsky and Papert published Perceptrons, proving that single-layer networks can't solve XOR. This was taken as proof that neural networks were a dead end, and it killed most neural network research for over a decade: the "AI winter." The fix was always sitting there, obvious in hindsight: use more than one layer.
Layers of Abstraction
XOR needed a single hidden layer computing intermediate facts. Harder problems need many. You can't jump straight from raw pixels to "that's a cat," so you build up through stepping stones. The first hidden layer might detect simple patterns: edges, color patches, textures. The next combines those into parts: eyes, ears, fur. A later layer combines those into whole objects: "cat" or "dog." Real networks have lots of layers, allowing them to think very deeply. GPT-3 has 96 layers and today likely have more.
No human decides what new things each layer computes. The AI learns that as part of its training. So how do we train weights buried deep inside the network? Surprisingly straightforwardly. Every weight contributes to the final output, so it contributes to the , which means every weight has a gradient (as we saw in the optimization chapter) telling us which direction to nudge it. We run on every weight at once, using an called that computes those gradients efficiently by flowing the error signal backward through each layer.
What's Next
We've built up from a single neuron to deep networks. A neuron is just a weighted sum and an activation function. But wire enough of them together, and finding the right weights is what creates the intelligence. Backpropagation makes that possible, even for networks with billions of weights.
But what is that weighted sum, really? In the next chapter, we'll meet , lists of numbers that describe things, and discover that a neuron's weighted sum is actually a geometric operation called a . This connection reveals why a single neuron can only draw a straight line through input space, why XOR is impossible for one neuron, and why activation functions are the key to making depth work.
Try it in PyTorch — Optional
Build a neuron from scratch, implement AND/OR gates, watch a single neuron fail on XOR, then train a two-layer network that succeeds, and visualize its decision boundary.