← Iris

Epoch 0
Loss
Accuracy
Parameters 0
Click    Shift+click
Network Architecture
Data:
Activation:
Hidden Layers 2
Neurons 8
Learning Rate 0.03
Regularization 0
Speed 10

Universal Approximation

A single hidden layer with enough neurons can approximate any continuous function (Cybenko 1989, Hornik 1991). But “can” doesn’t mean “will.” The universal approximation theorem is existential, not constructive. It says the function exists in the hypothesis space, not that gradient descent will find it. The theorem guarantees a solution is representable — not that it is reachable by any particular training procedure.

Try the spiral dataset with one hidden layer of 16 neurons versus two layers of 8. Both architectures have similar parameter counts, but depth often matters more than width in practice. Deeper networks compose features hierarchically — the first layer might learn to detect angles, the second layer combines angles into curves. A single wide layer must learn everything at once. The approximation theorem doesn’t care, but gradient descent does.

Why Nonlinearity Matters

Without activation functions, a neural network is just a composition of linear transformations — which is itself linear. No matter how many layers you stack, the network can only learn linear decision boundaries. Matrix multiplication followed by matrix multiplication is just… matrix multiplication. A thousand layers deep, the network is still fitting a hyperplane. The activation function is what bends space.

Try the circle dataset with different activations. ReLU creates piecewise linear boundaries — the circle becomes a polygon, its edges the creases where rectified linear units switch on and off. Sigmoid creates smooth, flowing curves but can suffer from vanishing gradients in deeper networks — the derivative approaches zero for large inputs, starving early layers of learning signal. Tanh is similar to sigmoid but zero-centered, which often helps training converge faster because gradients don’t have a systematic positive bias.

Backpropagation

The algorithm that makes deep learning possible. For each training example, the network computes its prediction (forward pass), measures the error against the true label, then propagates that error backward through every layer using the chain rule of calculus. Each weight learns how much it contributed to the mistake, and adjusts proportionally. The beauty of backpropagation is that it computes all gradients in a single backward sweep — the cost is linear in the number of parameters, not exponential.

// Backward pass for one layer for (let j = 0; j < layer.neurons; j++) { // gradient of loss w.r.t. pre-activation let delta = layer.deltas[j]; for (let i = 0; i < prevActivation.length; i++) { // weight gradient = delta * input layer.weights[j][i] -= lr * delta * prevActivation[i]; } layer.biases[j] -= lr * delta; } // propagate delta to previous layer for (let i = 0; i < prevLayer.neurons; i++) { let sum = 0; for (let j = 0; j < layer.neurons; j++) { sum += layer.weightsBeforeUpdate[j][i] * layer.deltas[j]; } prevLayer.deltas[i] = sum * prevLayer.activationDeriv(prevLayer.preActs[i]); }

Rumelhart, Hinton, and Williams published backpropagation for multilayer networks in 1986, though the idea had been discovered independently several times before. What made their paper revolutionary was the demonstration that it worked — that gradient descent through deep compositions of simple functions could learn meaningful internal representations, not just input-output mappings.

The Loss Landscape

Training a neural network is optimization over a high-dimensional surface. The loss function defines a landscape over weight-space, and gradient descent is a ball rolling downhill. But the landscape is not convex — it has saddle points, local minima, flat plateaus, and narrow ravines. For a network with 100 parameters, this surface exists in 100-dimensional space. Our spatial intuitions fail completely.

Click Reset Weights multiple times on the spiral dataset and watch. Different random initializations lead to different solutions. Some find the spiral beautifully; others get stuck in local minima, carving out crude approximations that capture the rough shape but miss the fine structure. The loss landscape is the same each time — only the starting point changes. This is why initialization matters, and why the same architecture can produce wildly different results on the same data.

Overfitting and the Bias-Variance Tradeoff

A network with enough parameters can memorize any training set perfectly — just create one narrow region for each point. This is overfitting: the model learns the noise in the data, not the signal. Watch what happens with 3 layers of 16 neurons on just 10 data points. The boundary becomes wildly complex, threading between individual points with baroque precision. It has learned nothing about the underlying pattern — it has merely memorized the training examples.

The regularization slider adds a penalty for large weights, proportional to their squared magnitude. This encourages the network to find simpler, smoother decision boundaries — solutions that use many small weights rather than a few large ones. It is a formal expression of Occam’s razor: among models that fit the data, prefer the simplest. The tradeoff is fundamental. Too little capacity (or too much regularization) and the model underfits — it cannot capture the true pattern. Too much capacity (or too little regularization) and it overfits — it captures the pattern and the noise together. The art of machine learning is finding the boundary between these regimes.

From Perceptrons to Deep Learning

Rosenblatt’s perceptron (1957) was the first trainable neural network — a single layer of weights that could learn linear decision boundaries. It was a sensation. The New York Times reported that it was “the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself, and be conscious of its existence.” Then in 1969, Minsky and Papert published Perceptrons, proving rigorously that a single-layer network could not learn XOR — the simple exclusive-or function. The proof was correct and devastating. Funding dried up. The first AI winter began.

The XOR preset in this lab is the problem that almost ended neural network research. Try it with zero hidden layers and watch the network struggle with a problem it literally cannot solve — no linear boundary can separate the four quadrants correctly. Now add one hidden layer. The problem becomes trivial. The network learns in seconds what a single-layer perceptron could never learn at all. The difference between impossible and easy was one hidden layer and the backpropagation algorithm to train it. Rumelhart, Hinton, and Williams (1986) showed the way out. Thirty years later, networks with hundreds of layers would learn to see, speak, translate, and generate language. The history of artificial intelligence hinged on this classification boundary.