Mihai Anton
Blog Image

I’ve always been interested in AI since the first time I saw what it can do. I remember being in high-school, attending Codecamp, where I saw a guy teaching a computer program to write very realistic poems. I was amazed and had many questions going inside my head. How can I do the same thing? What does it take?

Fast forward in 2023, I’m doing research in AI at Google, working on very complex problems, at huge scales. It wasn’t easy to get here, and there was not a straightforward path from seeing someone have AI generate poems, to working on large scale data systems, in a company that does not just play around with AI. One way of connecting those 2 dots was to understand, in very deep detail, the very basic algorithms that power what we call today machine learning.

While in the previous post of the series I was talking about algorithmics, now it’s time to learn about backpropagation, one of the most important algorithms to train machine learning models. Understanding backpropagation was crucial for me in this process, and I’m sure understanding it will give you a very big advantage in the tech scene.

Backpropagation: The Heart of Learning in Neural Networks

Backpropagation is an algorithm used to train neural networks, adjusting the weights between the neurons by calculating the gradient of the loss function, in a flow called gradient descent. This technique uses the chain rule from calculus, a method to compute the derivative of composite functions. Sounds complicated already, right?

Imagine a game of bowling. You're at the end of the alley, the ball is in your hand, and your goal is to knock down as many pins as possible. However, after a few tries, you realize that your throw is not perfect and you need to improve. So, what do you do? You change your approach slightly. Maybe you take a step to the right, change your swing a bit, or even increase your speed.

This iterative process of making an attempt, observing the error (missed pins), and then adjusting your approach based on that error is very similar to how backpropagation works in neural networks. The goal is to understand what is the contribution of every action you took to the number of pins you did not take down. Was the speed too high, was the position not central enough, or maybe the spin played the most important role. It’s like writing the following equation: f(speed, swing, spin, position) = "number of pins down". The goal of backpropagation is to find the influence of all the 4 variables, adjust them iteratively and reduce the number of missed pins to 0. The math behind flows very nicely, and allows us to define precise equations to determine the gradient (influence) of every variable.

Firstly, it's important to understand what backpropagation is trying to achieve. The goal is to optimize the weights of a neural network to minimize the error of its predictions. This error is usually calculated with a cost or loss function, which measures the difference between the network's output and the actual target values. Backpropagation uses gradient descent, an optimization algorithm, to adjust each weight based on the rate at which the error changes with respect to that weight.

Let's understand this with a simple example. Consider a neural network with a single hidden layer. The output y from this network for a given input x is a function of the weights w and biases b of the network. Let's denote the error of the network as E, which is the difference between the target output t and the actual output y. The goal of backpropagation is to find the set of weights and biases that minimize E.

Here is where the calculus comes into play. To find the minimum of E, we need to find where the derivative (or gradient) of E with respect to the weights and biases is zero. In other words, we need to solve the equation ∇E = 0, where ∇E is the gradient of E.

However, E is a complex function that involves the weights and biases of the network, the activation functions, and the inputs. So we can't solve this equation directly. Instead, we use the method of gradient descent, which iteratively adjusts the weights and biases in the direction of steepest descent of E.

Now, how do we calculate these gradients? This is where the "backpropagation" part comes in. We use the chain rule of calculus to "propagate" the error backwards through the network, from the output layer to the input layer. This allows us to express the gradient of E with respect to each weight and bias in terms of the errors at each layer of the network.

The chain rule states:

If we have a variable z depending on y, and y depending on x, then the derivative of z with respect to x is the derivative of z with respect to y times the derivative of y with respect to x.

Intuitively, you can think of backpropagation as a method for attributing the "blame" for the error. If the network's output is far off from the target, backpropagation determines how much each weight and bias in the network contributed to that error. Those that contributed a lot are adjusted more than those that contributed a little.

This is why backpropagation is often explained in terms of "signals" or "messages" being passed backwards through the network. The error at the output is like a message that gets passed back to the weights and biases, telling them how much they need to change.

And that, in essence, is the math and intuition behind backpropagation. It's a powerful yet beautiful algorithm that combines calculus and computation to optimize neural networks.

Let's put this all into practice with some Python code:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 import numpy as np def sigmoid(x): return 1 / (1 + np.exp(-x)) def sigmoid_derivative(x): return x * (1 - x) class NeuralNetwork: def __init__(self, x, y): self.input = x self.weights1 = np.random.rand(self.input.shape[1], 4) self.weights2 = np.random.rand(4, 1) self.y = y self.output = np.zeros(y.shape) def feedforward(self): self.layer1 = sigmoid(np.dot(self.input, self.weights1)) self.output = sigmoid(np.dot(self.layer1, self.weights2)) def backprop(self): d_weights2 = np.dot(self.layer1.T, (2*(self.y - self.output) * sigmoid_derivative(self.output))) d_weights1 = np.dot(self.input.T, (np.dot(2*(self.y - self.output) * sigmoid_derivative(self.output), self.weights2.T) * sigmoid_derivative(self.layer1))) self.weights1 += d_weights1 self.weights2 += d_weights2

This simple snippet demonstrates the process of backpropagation in a neural network with one hidden layer. We start by defining our activation function (sigmoid) and its derivative. The NeuralNetwork class is then defined with __init__, feedforward, and backprop methods.

In the feedforward method, we pass the input data through the network, applying our weights and the sigmoid function to get the output.

The backprop method is where the magic happens. Here, we're computing the derivatives of the error with respect to our weights, which tells us how much to adjust our weights by in order to reduce the error. These derivatives are calculated using the chain rule, exactly as discussed above. After obtaining the derivatives (or gradients), we update our weights, nudging them in the direction that will decrease our error.

This process of feedforward and backpropagation continues over many iterations or epochs, gradually reducing the error and improving the network's performance.

And that's it! That's the essence of backpropagation, the fundamental building block that underpins much of the modern AI revolution. This simple yet powerful concept is at the heart of deep learning, and understanding it was small but significant part of my journey to Google.

The underlying principle of backpropagation and gradient descent carries over even to complex architectures like transformers and large language models such as GPT-3 or GPT-4.

Transformers, introduced in the paper "Attention is All You Need," are a type of model architecture that makes use of self-attention mechanisms instead of recurrent layers, making them highly parallelizable and effective at capturing long-range dependencies in data. Despite the novel architecture, the training of transformers still relies on backpropagation, at its very base level, to learn from data.

When a transformer model is trained, an input sequence is passed through the model and a sequence of output vectors is produced. The discrepancy between the output sequence and the target sequence is measured using a loss function, such as cross-entropy loss for language modeling tasks.

The goal is to adjust the parameters (weights) of the transformer model to minimize this loss. To accomplish this, the gradient of the loss with respect to the model parameters is calculated using backpropagation.

In the case of transformers, the gradients are passed back through the self-attention layers, and the weights of these layers are updated to better model the data. The use of backpropagation in this context allows the model to learn intricate patterns in the data and produce highly accurate results.

Despite the increased complexity in model architecture and size, the fundamental concept remains the same - propagate the error back through the model and adjust the weights accordingly to minimize the loss.

It's worth mentioning that while backpropagation is the backbone of training these models, a lot of engineering and research goes into efficiently implementing and scaling this process for large models and datasets. Nevertheless, the underlying logic remains rooted in this key algorithm.

Understanding the core algorithm of backpropagation is fundamental to developing complex machine learning systems. This is due to several reasons:

  1. Flexibility: Regardless of the complexity of the model architecture, whether it is a basic feed-forward neural network, a convolutional neural network, a recurrent neural network, or the most advanced transformer-based models like GPT-4, backpropagation forms the backbone of the training process. Understanding it ensures you can work with a variety of machine learning models and approaches.
  2. Debugging and Optimization: Understanding backpropagation can help diagnose and troubleshoot model performance issues. You might encounter situations where your model isn't learning, and this could be due to a variety of reasons such as vanishing or exploding gradients, improper weight initialization, or inadequate learning rate. Knowledge of backpropagation helps you pinpoint these issues and take the necessary steps to resolve them.
  3. Customization: When designing novel architectures or custom layers, an understanding of backpropagation is essential. This is because when you create a new type of layer or an entirely new architecture, you'll need to define how gradients are computed and how they propagate through your model, which requires a clear understanding of the backpropagation algorithm.
  4. Innovation: The most transformative ideas in deep learning, such as Residual Networks, Transformers, and advanced optimization algorithms, have all built upon and made modifications to the basic concept of backpropagation. Therefore, to contribute to the cutting-edge research in this field, an understanding of backpropagation is indispensable.

To sum up, backpropagation is a cornerstone of modern machine learning. Its understanding opens up the entire world of deep learning and AI, from applying pre-existing architectures to creating new ones, from debugging and improving models to contributing to the forefront of AI research. It's a simple yet powerful concept, and truly, the better we understand it, the better we can harness the power of machine learning.

Of course, mastering backpropagation and the math behind it will not guarantee that you’ll get in one of the big tech companies, but it will provide a lot of momentum that you can build upon. Once you know this, you might start applying it in various contexts, creating nice models and even AI backed products.

In the next post we’ll pay more attention to some core concepts of large language models, that are certainly in the center of AI at the moment. Stay tuned!