Warning! Your browser does not support this Website: Try Google-Chrome or Firefox!

Recurrent Neural Networks

by Elvira Siegel
(Published: Mon Sep 23, 2019)


A Recurrent Neural Network (RNN) is a type of neural network where an output from the previous step is given as an input to the current step. RNNs are designed to take an input series with no size limits. RNNs remember the past states and are influenced by them. This type of Neural Networks are well suited for time series data.

RNN vs. Feed Forward

Feed Forward networks originate from the mid. 50s. The network is fully connected and each hidden layer has its own set of weights and biases. The main restriction here is that the set of weights and biases is not shared across all the layers. In a RNN model independent activations are converted into dependent ones, which means that the set of weights and biases is provided to all the layers. So, the parameters are shared across all time steps in a RNN model. That means, the gradient depends not only on the current calculations but on the previous calculations as well.

In a RNN each output serves as an input for the next hidden layer.


RNN Model


Feed Forward Model

Normally Feed Forward neural networks are trained with the help of backpropagation.

Recurrent Neural Network

The main task RNNs are used for, is to operate on data sequences like speech, video, stock market prices, etc. The network analyzes one element at a time, while keeping a "memory" of what was earlier in the sequence.

Recurrent means the output at the present (current) time step becomes the input to the next time step. The model considers not just the current input, but also what it remembers about the preceding elements at each time step in the sequence.


Unfolded Recurrent Neural Network

Such memory structure enables the network model to learn long-term dependencies in a sequence. That means, the model considers the whole context while making a prediction for the next element in a sequence. RNNs were developed to imitate the way how humans analyze sequences: we don't analyze words separately from their context but take into account the entire sentence's context to make an adequate response.

Backpropagation and the Chain Rule

We always want the network to assign a greater "confidence" to a desired target. In supervised machine learning we have labels on our data. What we want as a result is: the output label must be the same as the desired "gold" label.

The structure of an RNN consists entirely of differentiable operations we can run backpropagation on. Backpropagation is a recursive application of the chain rule from calculus. We apply this to figure out, in which direction we should adjust every one of the cell's weights to increase the scores of the desired label. We also perform a parameter update at each iteration, that means, we nudge every weight some amount in a calculated gradient direction. We want a higher score for the correct label at the end. As said before, an RNN is more sophisticated as a standard neural network because an RNN shares its parameters internally. A normal RNN has one weight matrix, which also makes the architecture faster than the one of a standard neural network.


By performing the feed forward pass we compute the loss of the model.

By performing the backward pass we compute the gradient and try to minimize the loss.

Chain Rule

The chain rule is used in Recurrent Neural Networks for backpropagation. The chain rule enables us to find a derivative, when we have a complicated equation, like e.g. a function inside of a larger function, which is inside of a larger function, etc ...

We normally apply the chain rule, when we have to deal with composite functions or also called nested functions.


Summing up: the chain rule is:

  1. take the derivative of the outside function and leave the inside part as it is.
  2. multiply it by ...
  3. ... the derivative of the inside function

If we want a derivative with respect to x , we have to keep taking the derivative of a function, until we have the final derivative with respect to that x .

That means, we can have many functions being nested inside each other. We can have e.g. a function g inside of a function f : y = f(g(x)) . The number of such "foldings" is unlimited because we can also get something like: y = n(k(q(f(g(x))))) . In such example our derivative will be a chain of smaller derivatives multiplied together to get the right value for the overall derivative.

Vanishing Gradient Problem

RNNs are normally trained with the help of backpropagation (e.g. backpropagation through time or BPTT), where the recursive application of the chain rule is used. Unfortunately, vanilla RNNs, that have no complex recurrent cells, suffer from the so called vanishing gradient problem.

The main problem lies in a fact that the error derivatives cannot be backpropagated through the network at each time step without vanishing to zero or sometimes exploding to infinity. That means, the "vanilla" RNN stops learning as the weight update stops and the calculations from the earlier steps become insignificant for the update/learning. If the gradient becomes small once, it drags the whole computation to the small values which become even smaller after more iterations.

Generally speaking, vanishing gradient is not intrinsically a problem of neural networks by themselves but rather an issue of gradient based methods of computations used in neural networks.

Vanishing gradient happens due to use of some activation functions like the sigmoid type (logistic or tanh). Functions like the logistic one (sigmoid) maps the values between 0 and 1 which are small values that are returned and used in the further computation.

That is why, other types of recurrent cells are often utilized in RNN architectures to lessen the effect of vanishing or exploding gradients. Gradient vanishing/exploding problem can be overcome by Long-Short Term Memory or LSTM for short, which learns long-term dependencies.

Long Short Term Memory or LSTM

LSTM is a type of the Recurrent Neural Network, which utilizes a memory cell. The concept of a usual RNN is that it can process sequential information by remembering let's say 10 previous elements of the sequence. It is fine, when we don't need a bigger gap between the relevant information in the context and the current state. But what if we need much more context? If the gap between the current state we want to predict and some sequence place in the context gets bigger, the RNN won't be able to learn how to connect the information from far away in the context to the current state. Fortunately, there is LSTM. LSTM cells allow remembering many more sequences from the past.

An LSTM network consists of a chain of LSTM-units. Each LSTM unit consists of a cell and it consists of an input/output gate and a forget gate. All those components together are also called a memory cell.

The memory cell is an essential part of the LSTM network. Such a memory cell is composed of gates. Those gates are in control of how information is being processed: added or removed. Gates monitor the information flow in a unit. Gates also implement a sigmoid layer. The sigmoid layer outputs the number spam between 1 and 0, where 0 means "0% of the information is coming through" and 1 means "100% of information is coming through".


The ct is the memory cell of the LSTM neuron. The it , ot , ft in the picture are the input, output and forget gates, they control the state of ct . Gates have their own weights and activation functions.

An LSTM cell allows past information to be used at a later time. LSTMs are clearly constructed to avoid the problem of long-term dependencies, like not being able to go even further in the context from the current state we want to predict the output for. LSTMs remember information from long past periods of time by default. They don't learn it in the first palace, they are already programmed to do so.


Dependent on the task, we use RNN or a more powerful type of RNN - LSTM. LSTM includes a "memory cell " that is able to hold information for a long time with the help of its "gates ". LSTM is suitable, when we need to preserve a wide range of long-term dependencies in a sequence and RNNs are good, when we don't need information from long time ago, so the range of "past dependencies" is relatively small.

A recurrent neural network is a very advanced pattern recognition system. RNNs make their predictions based on the elements ordering in a sequence.

Further recommended readings:

Visualizing memorization in RNNs

To learn more about derivatives, check out this article on Gradient Descent and this video on Derivative formulas through geometry

Understanding LSTM Networks


It's AI Against Corona

2019-nCoV There has been a lot of talking about the new corona virus going around the world. Let's clear up some things about it first and then we will see how data science and ai can help us fight 2019-nCoV. ...

Activation Functions

What are activation functions in Neural Networks? First of all let's clear some terminology you need in order to understand the concept of an activation function. ...


or backward propagation of errorsis another supervised learning optimization algorithm. The main task of the backpropagation algorithm is to find optimal weights in a by implementing optimization technique. ...


The Convolutional Neural Network (CNN) architecture is widely used in the field of computer vision. Because we have a massive amount of data in image files, the usage of traditional neural networks wouldn't give much efficiency as the computational time would expl...

Early Stopping

In this article we will introduce you to the concept of Early Stopping and its implementation including code samples. ...


Generative Adversarial Networks (GANs) are a type of unsupervised neural networks. The network exists since 2014 and was developed by and colleges. ...

Gradient Descent

Hiking Down a Mountain Gradient Descent is a popular optimization technique in machine learning. It is aimed to find the minimum value of a function. ...

Introduction to Statistics

Part III In this third and last part of the series "Introduction to Statistics" we will cover questions as what is probability and what are its types, as well as the three probability axioms on top of which the entire probability theory is constructed. ...

Introduction to Statistics

Part I In the following three parts we will cover basic terminology as well as the core concepts from statistics. In this Part I you are going to learn about measures of central tendency (mean, median and mode). In the Part II you will read about measures of variabili...

Introduction to Statistics

Part II In this part we will continue our talk about descriptive statistics and the measures of variability such as range, standard deviation and variance as well as different types of distributions. Feel free to read the Part I of these series to deepen your knowle...

Logistic Regression

Logit Regression Logit regression is another shortened name derived from logistic unit. Logistic regression is a popular statistical model that generates probabilities for binary classification tasks. It produces discrete values and its span lies in the range of [...

Loss Functions

When training a neural network, we try to optimize the algorithm, so it gives the best possible output. This optimization needs a loss function to compute the error/loss of the model. In this article we will gain a general picture of Squared Error, Mean Sq...

The Magic Behind Tensorflow

Getting started In this article we will delve into the magic behind one of the most popular Deep Learning frameworks - Tensorflow. We will look at the crucial terminology and some core computation principles we need to grasp the real power of Tensorflow. ...

Classification with Naive Bayes

The Bayes' Theorem describes the probability of some event, based on some conditions that might be related to that event. ...

Neural Networks

Neural Networks - Introduction In Neural Networks (NNs) we try to create a program which is able to learn from experience with respect to some task. This program should cons...


Principal component analysis or PCA is a technique for taking out relevant data points (variables also called components or sometimes features) from a larger data set. From this high dimensional data set, PCA tries extracting low dimensional data points. The idea...

Introduction to reinforcement learning

Part IV: Policy Gradient In the previous articles from this series on Reinforcement Learning (RL) we discussed Model-Based and Model-Free RL. In model-free RL we talked about Value Function Approximation (VFA). In this Part we are going to learn about Policy Based R...

Introduction to Reinforcement Learning

Part I : Model-Based Reinforcement Learning Welcome to the series "Introduction to Reinforcement Learning" which will give you a broad understanding about basic (and not only :) ) techniques in the field of Reinforcement Learning. The article series assumes you have s...

Introduction to Reinforcement Learning

Part II : Model-Free Reinforcement Learning In this Part II we're going to deal with Model-Free approaches in Reinforcement Learning (RL). See what model-free prediction and control mean and get to know some useful algorithms like Monte Carlo (MC) and Temporal Differ...


Support Vector Machines If you happened to have a classification, a regression or an outlier detection task, you might want to consider using Support Vector Machines (SVMs), a supervised learning model, that builds a line (hyperplane) to separate data into groups....

Singular Value Decomposition

Matrix factorization: Singular Value Decomposition Matrix decomposition is another name for matrix factorization. This method is a nice representation for applied linear algebra in machine learning and similar algorithms. ...

Partial Derivatives and the Jacobian Matrix

A Jacobian Matrix is a special kind of matrix that consists of first order partial derivatives for some vector function. The form of the Jacobian matrix can vary. That means, the number of rows and columns can be equal or not, denoting that in one case it is a squa...

Introduction to Reinforcement Learning

Part III: Value Function Approximation In the previous Part I and Part II of this series we described model-based and model-free reinforcement learning as well as some well known algorithms. In this Part III we are going to talk about Value Function Approximation: w...

Weight Initialization

How does Weight Initialization work? As a general rule, weights and biases are normally initialized with some random numbers. Weights and biases are extremely important model's parameters and play a pivot role in every neural network training. Therefore, one should ...

Word Embeddings

Part 1: Introduction to Word2Vec Word embedding is a popular vocabulary representation model. Such model is able to capture contexts and semantics of a word in a document. So what is it exactly? ...

Word Embeddings

Part 2: Word2Vec (Skip Gram)In the second part of Word Embeddings we will talk about what are the downsides of the Word2Vec model (Skip Gram...


T-Distributed Stochastic Neighbor Embedding If you do data analysis, machine learning or some other data driven research you will prob...
Copyright © 2024 by Richard Siegel at siegel.work Donate Contact & Privacy Policy