Warning! Your browser does not support this Website: Try Google-Chrome or Firefox!

Introduction to Reinforcement Learning

by Elvira Siegel
(Published: Thu Apr 02, 2020)

Part III: Value Function Approximation

In the previous Part I and Part II of this series we described model-based and model-free reinforcement learning as well as some well known algorithms. In this Part III we are going to talk about Value Function Approximation: what it is and how it can be helpful in the world of Reinforcement Learning (RL).

Part I

Part II

Part IV

In RL our goal is to find the state-action value in an environment, so that the agent follows the most optimal path and collects the maximum reward.

As we already mentioned in Part I and Part II you can read more about model-based and model-free approaches in RL. The problem with those methods is that they work well on discrete state spaces but perform poorly on continuous ones. As we discussed in Part I, a policy π has a table of the form {(s1, a1), ... , (sn,an)}, so if our Markov Decision Process (MDP) becomes too large because we have a continuous space, we encounter the following problems:

  1. We have too many states and actions to store in memory.
  2. To learn the value of each state individually would be too slow.
  3. It is not possible to represent a continuous state space by a value table.
  4. What about states that we didn't see yet? We have no way to generalize according to those states.

In which scenarios do we have such continuity? For example in some games like Backgammon, Go, the game of Chess or when we steer an RC Helicopter we may potentially have infinitely many moves, so we are in a continuous state space. We need scalable model-free methods to deal with such continuous environments.

With some games we might encounter an almost infinite amount of states, so if we use a table to store an infinite amount of some values, it quickly becomes inefficient.

Generally speaking, in RL we want to be able to cope with non-stationary targets. Those non-stationary targets are functions that change over time. In control methods, for example, which are based on generalized policy iteration, we often want to learn the qπ function while the policy π changes. The target values of training data samples are non-stationary, even if the policy π remains the same. The targets are often non-stationary if they are generated by bootstrapping methods like DP and TD learning.

Approaches which cannot handle non-stationary values are in general less desirable for effective RL.

So if we have a huge value table with all our states and actions (MDP), we can estimate the value function and perform function approximation:

for the state value function:


for the state action value function:


We take an approximate value function which is not a table anymore but a function with a weight vector w. So w in the two equations above w stays for the weight vector. The weight parameter w is updated by applying Monte Carlo (MC) or Temporal Difference (TD) Learning. More on MC and TD methods in Part II.

Value function approximation can be ordered to the model-free control. Model-free control is needed when we want to know how to learn a good policy from experience. With control we can approximate optimal policies.

With value function approximation we can generalize the value estimation. Estimation here means that we don't find the true value but a "close to the true" value, hence we approximate the true value. Intuitively an estimation approach might seem to be inconvenient but it is still computationally faster and generalization works well.


What are typical function approximators?

If we want to approximate some function, we need to choose an approximator. Typical function approximators are e.g. neural networks, Fourier bases, etc...

In this article we will use neural networks as a differentiable function approximator.

Function Approximation with Gradient Descent

One of the most popular neural network optimization techniques is called Gradient Descent. With gradient descent we can find the minimum of a function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. We normally use the gradient descent optimization algorithm to update the parameters of a machine learning model.

To learn more about Gradient Descent consult the article here.

Let's pick some arbitrary function F(w). This F(w) will be our differentiable function of the parameter w.

Now we define the gradient of the function F(w).


Now in order to find a local minimum of F(w) we have to adjust the weight vector parameter w in the direction of the negative gradient:


Where α is the learning rate.


Some Examples

We have an approximation function v(S, w) which gives some predicted (approximated) value and the true value vπ(S). How does the error that we want to minimize look like? Well, it basically can be represented with the following equation:


We can try to minimize the Mean Squared Value Error (VE) between the predicted (approximated) value v(S, w) and the true value vπ(S) which builds our objective function.

As a reminder, the formula of MSE is as follows:


Where Yi is the true value and YÌ‚i is the prediction.

So using our objective function, we get:


Where vπ(S) is the true value and v(S,w) is the approximation (=prediction).

There are more states than weights. Making one state estimate more accurate means making other state estimates less accurate. That is why we have to decide which states are more important for us. For this we can define the state distribution μ(s) >= 0, Σs μ(s) = 1 . This depicts how much we care about the error in each state s.


By applying the chain rule, we get the gradient of the objective with respect to its parameter w :


If we use gradient descent, we adjust the parameter w in the direction of the negative gradient by applying the update rule w ← w + Δw where:


The update rule for Stochastic Gradient Descent (SGD) of a single sample S (without a weighting) will look the following way:


In most cases we do not have the magical true value function vπ(S) ... Nevertheless we have rewards and can use the targets from Monte Carlo or Temporal Difference Learning predictions:

The Monte Carlo target is the return Gt :


So we have:


Analogously we have the Temporal Difference Learning target Rt+1 + γ*vÌ‚ (St+1,w):



What is bootstrapping in the context of RL? Well, bootstrapping methods update estimates of states based on estimates of future states. In other words, such methods update estimates based on other estimates. This concept is called bootstrapping.

With bootstrapping we update targets with their present estimates (like in Dynamic Programming or Temporal Difference Learning) and we do not rely on actual rewards and complete returns (as e.g. in Monte Carlo methods).

With bootstrapping we can achieve faster learning for our model. Faster learning with bootstrapping happens because with bootstrapping the model can learn to recognize a state by repeatedly returning to it. One of possible problems with bootstrapping is that it can make learning less efficient for problems where the state representation is not good enough which results in poor generalization.

The Deadly Triad

There are many possibilities of how our system can output unsatisfying results. Instability of results comes along with the usage of a particular configuration with the three following features:

  1. Function Approximation
  2. Bootstrapping
  3. Off-Policy Training

Such combination within one model is called the deadly triad.

How do we deal with the deadly triad? Well, one possibility would be to find out whether there is one of the three deadly triad components that can be omitted.

What if only two deadly triad elements are present? If there are not all three of them, then instability can be avoided.

Moreover, if all three elements are present, the instability problems appears already with a simple prediction task. So, your RL task or your ML model doesn't have to be complex to suffer from the deadly triad.

Which of the three could be omitted?

It is definitely not advisable to omit function approximation. We still need scalable methods for larger problems with continuous spaces. It could be possible to try Linear Function Approximation instead, but linear function approximation has too many features and parameters, which in turn has its own negative sides.

All right, but what if we replace function approximation with some non-parametric method? Well, the problem here is: the time complexity of non-parametric methods grows with data. Besides non-parametric methods are in general too weak or too expensive.

What about Least Squares Methods? There are some, for example Least Squares Temporal Difference (LSTD). The problem with those methods is that they all have quadratic time complexity and are therefore too computationally expensive for larger problems.

As a result, it's better to keep Function Approximation.

Omitting bootstrapping may be possible. But then we encounter losses in computational efficiency. We will probably try to keep it.

Off-policy learning. Can we do without it good enough? On-policy methods are often suitable as well, so often we can replace an off-policy with an on-policy learning.

As an example, in case of model-free reinforcement learning, we can often apply SARSA (on-policy), rather than Q-learning (off-policy).


Function approximation is a scalable method for applications with continuous state spaces which are much larger than the memory and computational resources we might have.

Further recommended readings:

Part I : Model-Based Reinforcement Learning

Part II : Model-Free Reinforcement Learning

Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto


It's AI Against Corona

2019-nCoV There has been a lot of talking about the new corona virus going around the world. Let's clear up some things about it first and then we will see how data science and ai can help us fight 2019-nCoV. ...

Activation Functions

What are activation functions in Neural Networks? First of all let's clear some terminology you need in order to understand the concept of an activation function. ...


or backward propagation of errorsis another supervised learning optimization algorithm. The main task of the backpropagation algorithm is to find optimal weights in a by implementing optimization technique. ...


The Convolutional Neural Network (CNN) architecture is widely used in the field of computer vision. Because we have a massive amount of data in image files, the usage of traditional neural networks wouldn't give much efficiency as the computational time would expl...

Early Stopping

In this article we will introduce you to the concept of Early Stopping and its implementation including code samples. ...


Generative Adversarial Networks (GANs) are a type of unsupervised neural networks. The network exists since 2014 and was developed by and colleges. ...

Gradient Descent

Hiking Down a Mountain Gradient Descent is a popular optimization technique in machine learning. It is aimed to find the minimum value of a function. ...

Introduction to Statistics

Part III In this third and last part of the series "Introduction to Statistics" we will cover questions as what is probability and what are its types, as well as the three probability axioms on top of which the entire probability theory is constructed. ...

Introduction to Statistics

Part I In the following three parts we will cover basic terminology as well as the core concepts from statistics. In this Part I you are going to learn about measures of central tendency (mean, median and mode). In the Part II you will read about measures of variabili...

Introduction to Statistics

Part II In this part we will continue our talk about descriptive statistics and the measures of variability such as range, standard deviation and variance as well as different types of distributions. Feel free to read the Part I of these series to deepen your knowle...

Logistic Regression

Logit Regression Logit regression is another shortened name derived from logistic unit. Logistic regression is a popular statistical model that generates probabilities for binary classification tasks. It produces discrete values and its span lies in the range of [...

Loss Functions

When training a neural network, we try to optimize the algorithm, so it gives the best possible output. This optimization needs a loss function to compute the error/loss of the model. In this article we will gain a general picture of Squared Error, Mean Sq...

The Magic Behind Tensorflow

Getting started In this article we will delve into the magic behind one of the most popular Deep Learning frameworks - Tensorflow. We will look at the crucial terminology and some core computation principles we need to grasp the real power of Tensorflow. ...

Classification with Naive Bayes

The Bayes' Theorem describes the probability of some event, based on some conditions that might be related to that event. ...

Neural Networks

Neural Networks - Introduction In Neural Networks (NNs) we try to create a program which is able to learn from experience with respect to some task. This program should cons...


Principal component analysis or PCA is a technique for taking out relevant data points (variables also called components or sometimes features) from a larger data set. From this high dimensional data set, PCA tries extracting low dimensional data points. The idea...

Introduction to reinforcement learning

Part IV: Policy Gradient In the previous articles from this series on Reinforcement Learning (RL) we discussed Model-Based and Model-Free RL. In model-free RL we talked about Value Function Approximation (VFA). In this Part we are going to learn about Policy Based R...

Introduction to Reinforcement Learning

Part I : Model-Based Reinforcement Learning Welcome to the series "Introduction to Reinforcement Learning" which will give you a broad understanding about basic (and not only :) ) techniques in the field of Reinforcement Learning. The article series assumes you have s...

Introduction to Reinforcement Learning

Part II : Model-Free Reinforcement Learning In this Part II we're going to deal with Model-Free approaches in Reinforcement Learning (RL). See what model-free prediction and control mean and get to know some useful algorithms like Monte Carlo (MC) and Temporal Differ...

Recurrent Neural Networks

RNNs A Recurrent Neural Network (RNN) is a type of neural network where an output from the previous step is given as an input to the current step. RNNs are designed to take an input series with no size limits. RNNs remember the past states and are influenced by them...


Support Vector Machines If you happened to have a classification, a regression or an outlier detection task, you might want to consider using Support Vector Machines (SVMs), a supervised learning model, that builds a line (hyperplane) to separate data into groups....

Singular Value Decomposition

Matrix factorization: Singular Value Decomposition Matrix decomposition is another name for matrix factorization. This method is a nice representation for applied linear algebra in machine learning and similar algorithms. ...

Partial Derivatives and the Jacobian Matrix

A Jacobian Matrix is a special kind of matrix that consists of first order partial derivatives for some vector function. The form of the Jacobian matrix can vary. That means, the number of rows and columns can be equal or not, denoting that in one case it is a squa...

Weight Initialization

How does Weight Initialization work? As a general rule, weights and biases are normally initialized with some random numbers. Weights and biases are extremely important model's parameters and play a pivot role in every neural network training. Therefore, one should ...

Word Embeddings

Part 1: Introduction to Word2Vec Word embedding is a popular vocabulary representation model. Such model is able to capture contexts and semantics of a word in a document. So what is it exactly? ...

Word Embeddings

Part 2: Word2Vec (Skip Gram)In the second part of Word Embeddings we will talk about what are the downsides of the Word2Vec model (Skip Gram...


T-Distributed Stochastic Neighbor Embedding If you do data analysis, machine learning or some other data driven research you will prob...
Copyright © 2021 by Richard Siegel at siegel.work Donate Contact & Privacy Policy