In Part 1, we established that machine learning abandons hardcoded rules. Instead of telling a computer exactly how to solve a problem, we feed it data and let it find the pattern.
But how does a computer actually "find" a pattern? It is a mechanical process of trial, error, and calculus.
To understand how a model learns, we need to make things extremely concrete. Let us look inside the black box.
The Model is Just Math
At its core, a machine learning model is a mathematical equation. It takes an input, multiplies it by some numbers, adds some numbers, and spits out a prediction.
Imagine we want to predict the price of a house based on its square footage. The simplest model we can build is a straight line. The formula for a line is:
ŷ = w * x + b
Let us break down these terms:
- x is the input data. This is the square footage of the house.
- ŷ (pronounced "y-hat") is the model's prediction. This is the estimated price.
- w is the weight. It represents how much the price increases for every additional square foot.
- b is the bias. It represents the base price of a house before factoring in the size.
When you first initialize a model, the weight w and bias b are set to random numbers. Because they are random, the initial predictions will be completely wrong. If a 2,000 square foot house actually costs $400,000, our random model might guess $12.
To fix this, the model needs a way to measure exactly how wrong it is.
The Loss Function: Measuring Stupidity
A model cannot improve unless it knows how badly it failed. We measure this failure using a Loss Function.
The loss function compares the model's prediction (ŷ) to the true answer (y). One of the most common loss functions for predicting numbers is Mean Squared Error (MSE).
L = (1 / N) * Σ (y - ŷ)²
This formula looks intimidating, but it is doing something very simple:
- Take the true price (y) and subtract the model's prediction (ŷ). This gives the error for a single house.
- Square that error. Squaring ensures that negative and positive errors do not cancel each other out. It also heavily penalizes massive mistakes.
- Add up all the squared errors for every house in the dataset (the sum sum).
- Divide by the total number of houses (N) to find the average.
The result is a single number called the Loss. A high loss means the model is performing terribly. A loss of zero means the model predicts every single house price perfectly.
The entire goal of machine learning is to find the specific values of w and b that make the loss as close to zero as possible.
Optimization: Walking Down the Hill
Now we know our model is wrong. How do we fix it? We use an optimization algorithm called Gradient Descent.
Imagine you are standing on a foggy mountain, and you need to reach the bottom of the valley. You cannot see the valley floor, but you can feel the slope of the ground beneath your feet. To get to the bottom, you simply feel which direction goes downhill, and you take a small step in that direction.
In this analogy, the mountain is the Loss Function. The valley floor is zero loss. The slope of the ground is the gradient (the derivative of the loss function with respect to our weights).
By calculating the derivative using calculus, the model figures out exactly which way to adjust the weight and bias to make the loss smaller.
The model updates its weights using this rule:
w_new = w_old - α * (∂L / ∂w)
- w_new is the updated, slightly smarter weight.
- w_old is the previous weight.
- ∂L / ∂w is the gradient. It tells us the direction and steepness of the error.
- α (alpha) is the learning rate. It dictates how big of a step we take down the mountain. If α is too small, learning takes forever. If α is too big, the model might step entirely over the valley and make things worse.
The Training Loop in Action
When you train a model, you are just running this process in a rapid loop.
- Forward Pass: The model takes inputs (x) and uses its current weights (w, b) to make predictions (ŷ).
- Calculate Loss: The loss function compares ŷ to the real answers (y) and outputs an error score.
- Backward Pass (Backpropagation): The model uses calculus to compute the gradient, figuring out how each weight contributed to the error.
- Update Weights: The model adjusts its weights and bias using gradient descent.
The model repeats this loop thousands or millions of times. With every single step, the weights shift slightly. The predictions get tighter. The loss shrinks.
Eventually, the model stops improving. It has reached the bottom of the valley. The weights are locked in, and the training is complete. You now have a mathematical function that can accurately predict the price of a house it has never seen before.
