Warning! Your browser does not support this Website: Try Google-Chrome or Firefox!


by Elvira Siegel
(Published: Sat Aug 17, 2019)

or backward propagation of errors

is another supervised learning optimization algorithm. The main task of the backpropagation algorithm is to find optimal weights in a neural network by implementing gradient descent optimization technique.

The context of Backpropagation

What is the role of the Backpropagation algorithm and when do we need it? We use Backpropagation in the context of Neural Networks in he following algorithmic procedure:

  1. Forward Propagation
  2. Computing the Loss / Cost Function
  3. Applying Backpropagation

And still why do we need backpropagation specifically? Why not just use a Feed Forward approach? We will first need some background knowledge to answer this question.

Feed Forward

Feed Forward Neural Network is the basic model of an artificial neural network. It is often used for regression and classification tasks. There is no looping in the feed forward model, that is why a feed forward neural network is called so. There is only a forward flow of the information. The model takes in some input on the input layer level. Then it "feeds" this input to the hidden layers (there may be many of them). Subsequently, the calculation is forwarded to the output layer. All in all, data flows in only one direction from input through hidden layers to the output.

The data flow happens because each layer in a neural network is connected with each other by weights. Each neuron in a feed forward neural network receives a set of inputs, performs a dot product, and then applies an activation function to it. Then we just repeat that process to output a prediction.

After that we compare the prediction with the desired output. It can be seen then, that our prediction is incorrect, which means that the weight values are not optimally calculated. Consequently, we want to find the best weight values in a way that, given any input, the correct output will be calculated with the help of the optimally calculated weights. Our main goal is to minimize the found error and we can do it with the help of the algorithm called gradient descent. More on gradient descent here.

We use gradient descent to increase the accuracy of the model's prediction. Backpropagation has its name because we propagate the derivatives back through the network and at the same time we update the weights accordingly.

Small Recap on Calculus

a derivative (d) is a slope of some tangent line which is on a curve at a specific point. A derivative measures the function's instantaneous rate of change:


a partial derivative is a function with several variables:


A partial derivative of a function with several variables is its derivative with respect to one of those variables with the others kept constant.


Backpropagation is an algorithm for computing partial derivatives (the gradient) of the loss function.

Gradient (∇) is a vector of partial derivatives. In other words, we organize partial derivatives in this special vector called gradient.

Example of a gradient :

-∇C (w, b, …) = [0.17, 0.20, 0.03, 0.52, 1.72, ...]

Each value in the vector represents sensitivity of the connections (weights).

The concept of differentiation (derivative / partial derivative) is widely used in other mathematical idea used in Backpropagation, called the Chain Rule. Backpropagation is basically an algorithm that computes the chain rule.

The Chain Rule

A neural network is in its essence a big composite function. If we come back to the feed forward model, we can see that each layer of a single neural network can be represented as a function.

We can also compute derivatives of composite functions. A composite function is a function, which consists of another functions. That means, we may have a complex function that is composed of multiple nested functions.

The chain rule is used to compute partial derivatives that are stored in the gradient of the error function with respect to each weight.


The Chain Rule: algorithmic procedure

  1. take the derivative of a "sub-function"
  2. take the derivative of the entire function


Back to Backpropagation

The purpose of backpropagation is to figure out the partial derivatives of the error function with respect to the weights in order to use those in gradient descent.

That means, we backpropagate to update weights with the help of gradient descent.

Backpropagation utilizes the Chain Rule to increase efficiency. Firstly, we go forward through the whole network once. Secondly, we move through the network backwards once. By passing the network forwards and backwards we can compute all the partial derivatives we need to calculate the error. The nice part is: computationally, backpropagation takes approximately the same time as if we complete two forward passes through the network.

After we determined the error for the final output layer, we calculate the error again but this time not for each separate neuron in the hidden layers. We can accomplish this by moving backwards layer by layer in the neural network.

This error for a neuron in the hidden layers looks like this:


We can apply those errors to compute the weights variation:


This operation is repeated for all the inputs and deltas (Δ) are getting gathered and stored:


Finally, when we land at the end of one learning iteration (weights update), the actual (old) weights are changed accordingly to the newly accumulated weights (the change in weights, delta Δ) for all training samples:

Backpropagation: The Algorithmic Thinking
  1. apply feed forward (derivatives of the activation function are stored in each neuron)
  2. backpropagate to the output (accumulate all the multiplicative terms, define the backpropagated error)
  3. backpropagate to the hidden layer (compute the backpropagated error)
  4. update weights (the weights are updated in the negative direction of the gradient)

An important note is : make the weights adjustments only after the backpropagated error computation. The computation of the backpropagated error must occur for all neurons in the neural network. If you adjust weights before the backpropagated error was computed, the weights adjustment gets twisted with the backpropagation of the error. As a result the calculated adjustments do not correlate with the direction of the negative gradient.

Backpropagation Through Time

You will sometimes hear about "backpropagation through time" or BPTT for short. It is broadly speaking the same as the "normal" backpropagation only with some additional property.

In order to calculate the gradient at the current time step, BPTT will need all the previous gradient calculations. For example, we want to calculate the gradient at the time step 4, so t=4. To do that we need to compute all the gradients from the time steps t=1, t=2, t=3 and sum them up.

As we can see from the example above, to use BPTT is computationally expensive. For example, if we have 10.000 time steps on total, we have to calculate 10.000 derivatives for a single weight update, which might lead to another problem: vanishing/exploding gradients.

Both BPTT and backpropagation apply the chain rule to calculate gradients of some loss function. Backpropagation is used when computing gradients in a multi-layer network. On the other hand, BPTT is applied to the sequential data (time series). BPTT is also widely used in models like Recurrent Neural Networks (RNNs).


To sum it up, the gradient computation proceeds backwards through the entire model. Firstly, the gradient of weights in the last layer is computed. Lastly, the gradient of weights in the first layer is computed. Partial calculations of the gradient in one layer are reused in the calculation of the gradient in the previous layer. Such backwards error flow enables a highly efficient gradient computation at each layer. On the other hand, calculating the gradient of each layer separately has been proven to be less efficient.

To conclude, the main advantage of the backpropagation algorithm in comparison to the feed forward model is that with backpropagation we can calculate all the partial derivatives at the same time utilizing just one forward and one backward pass through the neural network.

Further recommended readings:

Neural Networks - Introduction

The Backpropagation Algorithm


It's AI Against Corona

2019-nCoV There has been a lot of talking about the new corona virus going around the world. Let's clear up some things about it first and then we will see how data science and ai can help us fight 2019-nCoV. ...

Activation Functions

What are activation functions in Neural Networks? First of all let's clear some terminology you need in order to understand the concept of an activation function. ...


The Convolutional Neural Network (CNN) architecture is widely used in the field of computer vision. Because we have a massive amount of data in image files, the usage of traditional neural networks wouldn't give much efficiency as the computational time would expl...

Early Stopping

In this article we will introduce you to the concept of Early Stopping and its implementation including code samples. ...


Generative Adversarial Networks (GANs) are a type of unsupervised neural networks. The network exists since 2014 and was developed by and colleges. ...

Gradient Descent

Hiking Down a Mountain Gradient Descent is a popular optimization technique in machine learning. It is aimed to find the minimum value of a function. ...

Introduction to Statistics

Part III In this third and last part of the series "Introduction to Statistics" we will cover questions as what is probability and what are its types, as well as the three probability axioms on top of which the entire probability theory is constructed. ...

Introduction to Statistics

Part I In the following three parts we will cover basic terminology as well as the core concepts from statistics. In this Part I you are going to learn about measures of central tendency (mean, median and mode). In the Part II you will read about measures of variabili...

Introduction to Statistics

Part II In this part we will continue our talk about descriptive statistics and the measures of variability such as range, standard deviation and variance as well as different types of distributions. Feel free to read the Part I of these series to deepen your knowle...

Logistic Regression

Logit Regression Logit regression is another shortened name derived from logistic unit. Logistic regression is a popular statistical model that generates probabilities for binary classification tasks. It produces discrete values and its span lies in the range of [...

Loss Functions

When training a neural network, we try to optimize the algorithm, so it gives the best possible output. This optimization needs a loss function to compute the error/loss of the model. In this article we will gain a general picture of Squared Error, Mean Sq...

The Magic Behind Tensorflow

Getting started In this article we will delve into the magic behind one of the most popular Deep Learning frameworks - Tensorflow. We will look at the crucial terminology and some core computation principles we need to grasp the real power of Tensorflow. ...

Classification with Naive Bayes

The Bayes' Theorem describes the probability of some event, based on some conditions that might be related to that event. ...

Neural Networks

Neural Networks - Introduction In Neural Networks (NNs) we try to create a program which is able to learn from experience with respect to some task. This program should cons...


Principal component analysis or PCA is a technique for taking out relevant data points (variables also called components or sometimes features) from a larger data set. From this high dimensional data set, PCA tries extracting low dimensional data points. The idea...

Introduction to reinforcement learning

Part IV: Policy Gradient In the previous articles from this series on Reinforcement Learning (RL) we discussed Model-Based and Model-Free RL. In model-free RL we talked about Value Function Approximation (VFA). In this Part we are going to learn about Policy Based R...

Introduction to Reinforcement Learning

Part I : Model-Based Reinforcement Learning Welcome to the series "Introduction to Reinforcement Learning" which will give you a broad understanding about basic (and not only :) ) techniques in the field of Reinforcement Learning. The article series assumes you have s...

Introduction to Reinforcement Learning

Part II : Model-Free Reinforcement Learning In this Part II we're going to deal with Model-Free approaches in Reinforcement Learning (RL). See what model-free prediction and control mean and get to know some useful algorithms like Monte Carlo (MC) and Temporal Differ...

Recurrent Neural Networks

RNNs A Recurrent Neural Network (RNN) is a type of neural network where an output from the previous step is given as an input to the current step. RNNs are designed to take an input series with no size limits. RNNs remember the past states and are influenced by them...


Support Vector Machines If you happened to have a classification, a regression or an outlier detection task, you might want to consider using Support Vector Machines (SVMs), a supervised learning model, that builds a line (hyperplane) to separate data into groups....

Singular Value Decomposition

Matrix factorization: Singular Value Decomposition Matrix decomposition is another name for matrix factorization. This method is a nice representation for applied linear algebra in machine learning and similar algorithms. ...

Partial Derivatives and the Jacobian Matrix

A Jacobian Matrix is a special kind of matrix that consists of first order partial derivatives for some vector function. The form of the Jacobian matrix can vary. That means, the number of rows and columns can be equal or not, denoting that in one case it is a squa...

Introduction to Reinforcement Learning

Part III: Value Function Approximation In the previous Part I and Part II of this series we described model-based and model-free reinforcement learning as well as some well known algorithms. In this Part III we are going to talk about Value Function Approximation: w...

Weight Initialization

How does Weight Initialization work? As a general rule, weights and biases are normally initialized with some random numbers. Weights and biases are extremely important model's parameters and play a pivot role in every neural network training. Therefore, one should ...

Word Embeddings

Part 1: Introduction to Word2Vec Word embedding is a popular vocabulary representation model. Such model is able to capture contexts and semantics of a word in a document. So what is it exactly? ...

Word Embeddings

Part 2: Word2Vec (Skip Gram)In the second part of Word Embeddings we will talk about what are the downsides of the Word2Vec model (Skip Gram...


T-Distributed Stochastic Neighbor Embedding If you do data analysis, machine learning or some other data driven research you will prob...
Copyright © 2024 by Richard Siegel at siegel.work Donate Contact & Privacy Policy