Warning! Your browser does not support this Website: Try Google-Chrome or Firefox!

Introduction to Reinforcement Learning

by Elvira Siegel
(Published: Sun Mar 06, 2020)

Part I : Model-Based Reinforcement Learning

Welcome to the series "Introduction to Reinforcement Learning" which will give you a broad understanding about basic (and not only :) ) techniques in the field of Reinforcement Learning. The article series assumes you have some basic knowledge about Neural Networks and Machine Learning in general.

In Part I of these series we will learn about model-based Reinforcement Learning methods, including Dynamic Programming : policy and value iteration algorithms, Markov Processes, the Bellman Equation and much more ... ;)

Part II

Part III

Part IV

What is Reinforcement Learning?

Reinforcement Learning (RL) belongs to the family of Machine Learning techniques that are more close to the field of Artificial Intelligence. First of all let's go through the first (and not the last :D ) portion of some important terminology in RL:

agent - a model which moves in an environment and tries to perform the best action in order to maximize the reward.

environment - the "world" where the agent interacts; environments can be provided with a model (model-based) or be model-free.

actions (A) - transition steps the agent takes to move within an environment. Every action returns a reward.

reward (R) - a response value to an action. The goal is to maximize the total reward the agent receives in one episode (all the states in between from the initial to the terminal state, sometimes also called trajectory).

policy (π) - describes how the agent acts. A policy is a function that returns the probability of taking an action a in a state s, so it is a probability distribution over actions given states. This is the strategy of the agent.

The main goal in Reinforcement Learning (RL) is to learn an optimal policy. An optimal policy describes how to act in order to maximize the reward in every state. Once more, policy π is a special mapping (=function) from states to optimal actions in these states: π(s) ∊ A(s) so that the agent always knows what to do.

The policy basically tells us how much reward the agent can get in each separate state.


We will first start from some broader concepts to get a bigger picture and afterwards will narrow down the idea to more and more specific methods.

Modelfree vs. Modelbased Approaches

There are two main methods to work with RL models:

  1. model-based methods: algorithmic methods which apply the transition probability table and the reward function from some Markov Decision Process (MDP). In RL a Markov Decision Process (MDP) represents the problem we seek to solve.
  2. model-free methods: we will discuss the model-free methods in Part II of these series but for now, a model-free approach uses no transition probability table and no reward function. The transition table and the reward function are sometimes called the "model" of the environment (MDP), so the name "model-free".

Markov Decision/Reward Process (MDP/MRP)

To fully understand Markov Decision Process (MDP) let's look at the so called Markov Property first:

The Markov Property is the memoryless property of some random process. All right and what does it mean?

Some random process incorporates the Markov property if the conditional probability of future states depends solely on the present state, hence the name "memoryless property". There is no past ... so essentially the Markov property omits all the events from the past.

P(future | present, past )


A process which satisfies this property is called a Markov process.

In the world of RL (and ML in general) it is easy to get overwhelmed with a lot of different terminology and the lingo. So let's differentiate between the following terms:

Markov Process is a tuple of the form (S, P) wherre S - a finite set of States and P - state transition Probability matrix.

Markov Reward Process (MRP) is a tuple (S, P, R, γ) where R - Reward function, γ - discounting factor (more on γ later on). MRP describes no environment (as e.x. a MDP does). A MRP tells us how much reward the agent could accumulate after passing an episode (=a sequence of states). The Reward function R is defined as:

Rstate = E[Rnext state | Scurrent stae ]

Markov Decision Process (MDP) is a tuple (S, P, R, γ, A) where a ∊ A - Action taken in a state s ∊ S.

The dependency between MDP and MRP is the following: in RL a MRP is needed when we try to fix a policy π for a MDP.

From now on we will primarily concentrate more on MDPs and on MRPs.

Markov Decision Process (MDP) is a discrete time, random control process. It describes an environment for RL, which is fully observable, hence model-based. A fully observable environment always provides the agent with all the existing states, so the agent has full knowledge of the environment. Control in RL means the approximation of the optimal policy π, also called policy improvement (more details later on).

In essence we can think of an MDP as of a mathematical formulation of an RL problem we are trying to find solution for.

A quick summary : an MDP sets up an environment for a RL problem. In an MDP the environment is fully observable. Practically all RL problems can be presented as MDPs.

As we talked about the reward function, let's generalize this idea to the so called return Gt where the subscript t is the current time step:


where R is our Reward function from above and γ is the discount factor.

The role of γ

The γ (=gamma discount factor) controls how much the agent should care about immediate or future rewards. The γ factor lies in the range [0;1] where if for example γ = 1, then the agent always considers the rewards from far away in the sequence (=the agent is "far-sighted"). If γ = 0, then the agent only cares about the first reward (=the agent is "short-sighted").

γ is called "discount factor" because it discounts the value of a state as the agent goes away from the goal. The agent follows the path (=sequence of states) with maximum values it can get in a state.

We can also think about the γ factor as when it has a low value (near to 0), the agent has short term thinking, whereas if γ is high (near to 1), the agent has long term thinking.

Value Function

The value function is a useful tool to gain the value of being in some state s. What is this value? This value shows possible future rewards that could be received by the agent from being in this state s.

The value function can be seen as cashing of useful information we can use later on and comes in two flavors:

  1. the "normal" state value function and
  2. the state-action value function.

For a given policy π the state value function is defined as:


For a given policy π the state-action value function is defined as:


In the above equation we focus on a particular action and its return in the state s.

Both equations give the expected discounted return G t starting from the state s following the policy π .

Bellman Equation

The backbone of many RL algorithms lies in the Bellman Equation. Let's recall once more: the goal of RL is to choose the optimal action which will maximize the long-term expected reward given the current state the agent is in.

The Bellman Equation helps us in finding the answer to the question: given the state the agent is currently in, if the agent takes the best possible action at the current and each further step, what long-term reward will the agent get?


We can also write this equation down as a linear equation system in matrix notation :


Using the Bellman Equation we can easily find the value of a state.

Consider an example MRP :

Imagine that we have an agent that tries to explore the environment around it and to gather maximum reward for its actions.

We have the following scenario : the agent is on a small detached lonely island which is the start state. Because nothing ever happens on this island the agent "gets bored" and receives the negative reward for being in this state (-2). The goal of the agent is to explore the environment and land on the mainland which is the terminal state with the reward (0). The agent builds a boat and tries to reach the mainland - so we have a state "boat" with the reward (-1).


Let's say that there is a 0.5 probability that the agent will still stay on the island though it's bored, 0.4 probability that the agent gets on the boat and starts the dangerous journey and 0.1. probability that the agent might be directly taken by some magical accident to the mainland without doing anything. Once the agent decides to get on the boat with 0.5 probability the agent stays on the boat, with 0.2 probability the agent returns back to the island and with 0.3 probability the agent successfully reaches the mainland and goes to a party :D .


Now we build a transition probability table for the states i - island, b - boat, m - mainland:

Transition Probability Table ( = P )
i b m
from i 0.5 0.4 0.1
b 0.2 0.5 0.3
m 0 0 0

Here we can apply the Bellman Equation written in the matrix notation we already presented above:

(I - γ P) * V = R where we assume γ = 1 :


Solving this system of linear equations will give us the values for each of our three states (island, boat, mainland).

Dynamic Programming: What Is It?

Dynamic Programming (DP) is an approach to solve complex problems consisting of multiple components. The approach breaks a bigger problem in some smaller subproblems. Then we try to solve those smaller subproblems and in the end we generate an overall solution.

In the context of DP we have to clear up some terminology (again :D):

  1. prediction: or policy evaluation, at this step we compute the state-value function Vπ for a policy π . The input of a prediction step might be an MDP or also an MRP. The output is a value function Vπ .
  2. control: policy improvement, here we try to approximate optimal policies. The input is an MDP and the output is the optimal policy π* and the optimal value function V* .

Both control and prediction fall into the category of Planning in DP.

In order to be able to use DP, our problem has to satisfy the following two criteria:

  1. it is possible to decompose the problem into smaller parts
  2. we can solve subproblems recursively by caching the information

MDPs meet both these properties as the Bellman equation gives a recursive decomposition and the value function stores solutions.

Dynamic Programming (DP) belongs to the family of model-based approaches. DP assumes complete knowledge of the MDP, so we have a transition probability table and the reward function. We will now go through two DP algorithms that help to find the optimal policy, namely: policy iteration and value iteration. With these two algorithms we are able to find the best policy π .

Policy Iteration

Let's start with an example and return to our already mentioned island agent. The agent can now perform two certain actions, namely:

  1. change position
  2. learn to fly

Moreover, the agent changes its position with probability 0.8 if it is on the island or on the boat, else it stays where it was with probability 0.2. With a very small probability 0.1 the agent learns how to fly and flies to the goal mainland directly from the island or from the boat. With a higher probability 0.9 the agent doesn't learn to fly and stays put on the island or on the boat.

change position

learn to fly


Now we will apply policy iteration to find out the optimal policy π and the state values V(s) of: island - i , boat - b and mainland - m. For simplicity we assume that the initial policy π0 has the action learn to fly in both states.

Step 1 policy evaluation:

For the policy evaluation step we will use the simplified Bellman equation for a fixed policy πfly .


where s is the current state and s' is the new state the agent moves in. We take γ = 1, so we take all the rewards into account and basically have no discounting of rewards.

Consulting the MDP from above we conclude:


Analogously we compute:

Vb = -1 + 0.1*Vm + 0.9*Vb

Vm = 0

Now we can plug in Vm = 0 into the first equations and get Vb = -10 and Vi = -20

Step 2 policy improvement:

In this step we use the Bellman Optimality Equation for finding the action that maximizes the value function for every state.


Because R and γ are independent of the action the agent takes and therefore they don't influence the maximization problem, we can simplify the above equation as follows:


We simply neglected the R and gamma γ terms.

Let's denote the Σ s' P(s'| s,a)*V(s') part as T(state, action) and we get:


Now we have to choose the maximum value. Because T(i, fly) < T(i, change) = -18 < -12 the action change is preferred in state island. However in the state boat, T(b, change) < T(b, fly) = -18 < -9, so the action fly is preferred in state boat.

We repeat the steps 1 and 2 until the new proposed actions stay unchanged.

Again let's summarize the whole algorithm systematically:

  1. start with initializing a random policy.
  2. policy evaluation step (=prediction): find the value function V(s) of this policy by applying the Bellman equation.
  3. policy improvement step (=control): find a new improved policy, take the action that maximizes the value function V(s) for each state.
  4. repeat the step 1 and 2 until the variable unchanged is True.

Value Iteration

With value iteration we can go further and join both steps. We update the Bellman Optimality Equation directly.

The algorithm updates the policy based simply on the Bellman optimality equation (= the control step only):

  1. start with some random Value function V(s).
  2. policy improvement step (=control): find a new improved optimal value function V*.
  3. repeat the step 2 until reaching the optimal value function.

Policy Iteration vs. Value Iteration

Down below you will find the pseudocode for the two algorithms. See for yourself and compare:


policy iteration


value iteration

Conclusion on Policy/Value Iteration

The policy iteration algorithm helps us to look for the optimal policy. The exact state values are not that relevant if one action is clearly the optimal one. Principally, there are two major steps: 1. policy evaluation and 2. policy improvement. Within the step 2 (policy improvement) the algorithm acts "greedily" because it takes the maximum over all the sate values.

With value iteration we can directly jump to the control step and update the value function using the Bellman optimality equation. It saves us some memory and backs up on more recent versions of the value function

The prior goal of both policy iteration and value iteration is to figure out the optimal policy for the agent to follow. The new policy has to be a strict improvement over the old one. Both policy iteration and value iteration belong to the family of Dynamic Programming methods.

Further recommended readings:

Part II : Model-Free Reinforcement Learning

Part III : Value Function Approximation

Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto

Python code for Artificial Intelligence: Foundations of Computational Agents


It's AI Against Corona

2019-nCoV There has been a lot of talking about the new corona virus going around the world. Let's clear up some things about it first and then we will see how data science and ai can help us fight 2019-nCoV. ...

Activation Functions

What are activation functions in Neural Networks? First of all let's clear some terminology you need in order to understand the concept of an activation function. ...


or backward propagation of errorsis another supervised learning optimization algorithm. The main task of the backpropagation algorithm is to find optimal weights in a by implementing optimization technique. ...


The Convolutional Neural Network (CNN) architecture is widely used in the field of computer vision. Because we have a massive amount of data in image files, the usage of traditional neural networks wouldn't give much efficiency as the computational time would expl...

Early Stopping

In this article we will introduce you to the concept of Early Stopping and its implementation including code samples. ...


Generative Adversarial Networks (GANs) are a type of unsupervised neural networks. The network exists since 2014 and was developed by and colleges. ...

Gradient Descent

Hiking Down a Mountain Gradient Descent is a popular optimization technique in machine learning. It is aimed to find the minimum value of a function. ...

Introduction to Statistics

Part III In this third and last part of the series "Introduction to Statistics" we will cover questions as what is probability and what are its types, as well as the three probability axioms on top of which the entire probability theory is constructed. ...

Introduction to Statistics

Part I In the following three parts we will cover basic terminology as well as the core concepts from statistics. In this Part I you are going to learn about measures of central tendency (mean, median and mode). In the Part II you will read about measures of variabili...

Introduction to Statistics

Part II In this part we will continue our talk about descriptive statistics and the measures of variability such as range, standard deviation and variance as well as different types of distributions. Feel free to read the Part I of these series to deepen your knowle...

Logistic Regression

Logit Regression Logit regression is another shortened name derived from logistic unit. Logistic regression is a popular statistical model that generates probabilities for binary classification tasks. It produces discrete values and its span lies in the range of [...

Loss Functions

When training a neural network, we try to optimize the algorithm, so it gives the best possible output. This optimization needs a loss function to compute the error/loss of the model. In this article we will gain a general picture of Squared Error, Mean Sq...

The Magic Behind Tensorflow

Getting started In this article we will delve into the magic behind one of the most popular Deep Learning frameworks - Tensorflow. We will look at the crucial terminology and some core computation principles we need to grasp the real power of Tensorflow. ...

Classification with Naive Bayes

The Bayes' Theorem describes the probability of some event, based on some conditions that might be related to that event. ...

Neural Networks

Neural Networks - Introduction In Neural Networks (NNs) we try to create a program which is able to learn from experience with respect to some task. This program should cons...


Principal component analysis or PCA is a technique for taking out relevant data points (variables also called components or sometimes features) from a larger data set. From this high dimensional data set, PCA tries extracting low dimensional data points. The idea...

Introduction to reinforcement learning

Part IV: Policy Gradient In the previous articles from this series on Reinforcement Learning (RL) we discussed Model-Based and Model-Free RL. In model-free RL we talked about Value Function Approximation (VFA). In this Part we are going to learn about Policy Based R...

Introduction to Reinforcement Learning

Part II : Model-Free Reinforcement Learning In this Part II we're going to deal with Model-Free approaches in Reinforcement Learning (RL). See what model-free prediction and control mean and get to know some useful algorithms like Monte Carlo (MC) and Temporal Differ...

Recurrent Neural Networks

RNNs A Recurrent Neural Network (RNN) is a type of neural network where an output from the previous step is given as an input to the current step. RNNs are designed to take an input series with no size limits. RNNs remember the past states and are influenced by them...


Support Vector Machines If you happened to have a classification, a regression or an outlier detection task, you might want to consider using Support Vector Machines (SVMs), a supervised learning model, that builds a line (hyperplane) to separate data into groups....

Singular Value Decomposition

Matrix factorization: Singular Value Decomposition Matrix decomposition is another name for matrix factorization. This method is a nice representation for applied linear algebra in machine learning and similar algorithms. ...

Partial Derivatives and the Jacobian Matrix

A Jacobian Matrix is a special kind of matrix that consists of first order partial derivatives for some vector function. The form of the Jacobian matrix can vary. That means, the number of rows and columns can be equal or not, denoting that in one case it is a squa...

Introduction to Reinforcement Learning

Part III: Value Function Approximation In the previous Part I and Part II of this series we described model-based and model-free reinforcement learning as well as some well known algorithms. In this Part III we are going to talk about Value Function Approximation: w...

Weight Initialization

How does Weight Initialization work? As a general rule, weights and biases are normally initialized with some random numbers. Weights and biases are extremely important model's parameters and play a pivot role in every neural network training. Therefore, one should ...

Word Embeddings

Part 1: Introduction to Word2Vec Word embedding is a popular vocabulary representation model. Such model is able to capture contexts and semantics of a word in a document. So what is it exactly? ...

Word Embeddings

Part 2: Word2Vec (Skip Gram)In the second part of Word Embeddings we will talk about what are the downsides of the Word2Vec model (Skip Gram...


T-Distributed Stochastic Neighbor Embedding If you do data analysis, machine learning or some other data driven research you will prob...
Copyright © 2024 by Richard Siegel at siegel.work Donate Contact & Privacy Policy