Warning! Your browser does not support this Website: Try Google-Chrome or Firefox!


by Elvira Siegel
(Published: Thu Jan 06, 2020)

The Convolutional Neural Network

(CNN) architecture is widely used in the field of computer vision. Because we have a massive amount of data in image files, the usage of traditional neural networks wouldn't give much efficiency as the computational time would explode.

This article assumes, you have basic knowledge about neural networks and matrix manipulation. In case you need a refresher on these topics, check out the following links:

Neural Networks Introduction

Partial Derivatives and the Jacobian Matrix

The Vocabulary of CNNs

What is a convolution?

A convolution is an operation of "convolving" an input matrix (input image) with a filter matrix also called kernel or just filter:


We slide the filter matrix across the input matrix and sum up all the results:


convolution operation: (left) input - 5x5 matrix + filter; (right) output - 3x3 feature map


To understand what's going on in the convolution layer in depth, the idea of a filter is crucial:

A filter will normally have a matrix form and will consist of randomly (or certainly) initialized numbers. One filter detects one specific pattern, be it edges, corners, circles, etc. We can have many different filters as well depending on the complexity of shapes we want to detect, as each of these filters will have the task of detecting a particular feature. For example, we can have an edge detector filter, a corner detector filter or a circle detector filter and so on.

One of the key concepts behind CNN is that filters are parameters that can be learned over time. So, the model learns to distinguish between diverse shapes in the image as it changes the parameters of its filters in a way that it suites the ideal output.

Note: sometimes filter is called kernel or convolutional matrix. All these terms can be used interchangeably.

Types of Filters

Normally in CNNs you can always train (learn) your own filter which will suite exactly to your type of task. But you can also choose filters which are predefined. Let's look at a short introduction to some edge detection filters:

Sobel Filter

This type of filter is based on gradient calculation. From the original image the filter computes the first order derivatives for x and y axes respectively. As image data is not continuous data, those derivatives we get are only approximations. In order to be able to approximate the derivatives, we use the Sobel filter.

The Sobel filter consists of two filters: one filter detects horizontal edges and the other vertical ones.


detects horizontal edges


detects vertical edges

The "horizontal" filter finds the derivative along the x-axis and the "vertical" filter finds it for the y-axis.

Laplace Filter

Another edge detection filter which unlike the Sobel applies only one filter. The core concept of this filter is that it computes second order derivatives in one pass. It basically approximates the Laplace operator (or discrete Laplacian) which is the sum of both second order partial derivatives:


In the sense of the Laplace filter we see an edge as a curve. The gradient along this curve is always pointing in the direction of the normal. A normal is a vector which is always perpendicular to the surface.

Second order derivative filters are noise-sensitive. You should reduce noise in the input image first and then apply the Laplace filter. It is possible to reduce noise by some smoothing filter like, for example, a Gaussian filter . After applying the smoothing filter we can use the Laplace filter. It is also possible to combine the smoothing filter and Laplace filter into a single filter and use it as a whole.


The step with which we shift the filter is called stride. This is principally the number of pixels we move the filter across the input matrix. If stride = 1 then we make one step per computation (e.g. see pic "convolution operation" above), if stride = 2, we take two steps. The bigger the steps, the more quicker computation will occur and the smaller the output will be which leads to information loss.


If we look at the picture "convolutional operation" above once more, we'll notice that the output matrix (the output image) we receive at the end is smaller than the input matrix. So if we have a 5x5 input matrix and a 3x3 filter, we'll get a 3x3 output. In general, we can describe this procedure with the following formula:

input size: m x m

filter size: f x f

output size: (m-f+1) x (m-f+1)

Every time we apply a convolutional operation, the size of the image shrinks.

In the formula above 1 stays for bias.

After applying convolution without padding we always get a smaller output image. We also ignore for the most time the pixels situated in the corners of out input image staying centered. This all results in information loss.

But what if we pad the input image with extra pixels? We add pixels around the image, creating a frame around the input image consisting of pixels with the zero value. The padding pixels are normally equal 0.


Types of padding:


If we have "valid" padding, we actually apply no padding at all. We'll have a normal output as shown in the formula above: input size - filter size + bias .


Let's take an example again: we firstly pad a 5x5 input image with zeros and get a 6x6 input image afterwards. Then we apply convolution by shifting the 3x3 filter and we get a 5x5 output matrix which is exactly the size of the input matrix. This type of padding where the output size is equal to the input size is called same padding.

calculate padding:

padding: p = (f-1) / 2

if same padding is applied then the size of the output = size of the input

m+2p-f+1 = m

where m is the input size


Let's say the size of our filter is 3x3 (we can shorten it as 3). Then we can pad our input image of the size 6x6 according to the formula: padding = filter size - 1. So we pad our image with the maximum of 2 zeros, in order to get output for each input when performing the convolution. The output size will be: input size + filter size - 1 .

Let's present our input image as an array of pixel values:


then we pad the image with 2 zeros on both sides:


and we get a new input image size: 10x10

Afterwards we apply a filter of the size 3x3 with stride 1:


We get the value C0 by using the following formula which represents the sliding of the filter across the image:


Let's use this formula:


Now we shift the filter further and apply the formula again:



Why Use Convolutions?

Why not just use normal fully connected layers? There are some advantages CNNs give us that fully connected neural networks don't.

Parameter Sharing

Imagine that we have a fully connected layer instead of a convolution layer. The total parameters number would explode exponentially. That means, we would have to multiply all the image sizes from all the layers together and get a huge number in the end which would lead to a massively expensive computation.

In case of a convolution layer, the number of parameters is independent of the image size. The number of parameters depends on the filter size only. If we have a filter of size 3x3x3 the number of parameters for every filter will be 3*3*3 = 27. We also add a bias for each filter 27 + 1. If we have 8 filters total, we will get 28*8 = 224 parameters for each layer totally.

While convolving through the input in a convolution layer, the layer parameters are shared. What does it mean? It means that one and the same filter is convolved over the whole input. Why is it useful? A filter helpful in one image part, might be helpful in a different image part as well for detecting certain features.

Sparsity of Connections

Every layer in a convolutional network has sparse connections. That means, every value in the output is determined by a certain small number of inputs. We don't take into account all the inputs at one computational time.

As we could see one of he major advantages of a convolution layer is: it reduces the number of parameters which also helps to speed up the computations in the training phase. Also the sparse connections help us in keeping the neural network size smaller.

Going deeper

Mathematically speaking, a convolution is an operation on two functions with the goal to generate another function that shows how the first function is changed influenced by the second.

In mathematics convolution is called cross-correlation. It's basically the same as convolution in a CNN, only with one minor detail: in cross-correlation the filter (kernel) is flipped over.

A convolutional neural network (CNN) implements a rather extended version of a neural network. Every standard CNN consists of the following layers:

  1. Convolution Layer
  2. Activation
  3. Pooling Layer
  4. Flatten Layer
  5. Fully Connected Layer
  6. Output Fully Connected Layer: application of softmax

Let's analyze each of them in depth.

1. Convolution Layer

One of the main steps in the convolution layer was already mentioned above is the convolution operation. Just to recap it: we randomly initialize some filters to determine patterns and take our initial input image. Then we convolve the input image with the initialized filter (or filters) by shifting the filter over the entire image and multiplying each value from the original image with the values from the filter, then summing everything up we get one value which is a new entry in the new output matrix.

This new output matrix is the result of convolving the input image with the filter. This matrix is called a feature map. For each shape, we want to detect with the filter, we receive a unique feature map. We then can pass this feature map output to the next layer.

Suppose we have the following input matrix:


and the following filter (=kernel):


The convolution operation would look like this:


left: input matrix + filter; right: output 3x3 feature map

Take a look at a CNN model function which demonstrates a convolution operation implementation in tensorflow:

import tensorflow as tf
from tensorflow import keras

def cnn_1():

    input = tf.keras.layers.Input(shape=(28, 28, 1))
    conv1 = tf.keras.layers.Conv2D(filters=8, kernel_size=5, strides=1, padding='same', 
                                   activation='relu', use_bias=True)(input)
    conv2 = tf.keras.layers.Conv2D(filters=8, kernel_size=3, strides=1, padding='same', 
                                   activation='relu', use_bias=True)(conv1)
    hidden = tf.keras.layers.Flatten()(conv2)
    output = tf.keras.layers.Dense(10)(hidden) # our fully connected layer
                                               # softmax will be done by setting 'from_logits' as True in 'train' function (see further code below)
    model = tf.keras.Model(inputs=input, outputs=output)
    return model   

You might have noticed that we're using the 2D convolution ( tf.keras.layers.Conv2D ). That means, the filter is moved in 2 dimensions - across the x- and y-axes. In a 3D convolution for example the 3rd dimension denotes the depth, so the filter is moved across the "depth-dimension" too. Then we have x-,y- and z-axes accordingly.

2. Activation

After applying convolution, we activate the output of the convolution layer with an activation function. For a CNN architecture ReLu is often used.

Read more on activation functions here.

We can either choose an activation in the parameter specification list of a convolutional layer (when working with tf.keras.layers see the code above) or we implement it as a separate layer (when working with a Sequential model from keras.models):

def cnn_2():
    model = Sequential()
    model.add(Conv2D(kernel_size=(5,5), out_channels=8, stride=1, padding='SAME', activation="relu"))
    model.add(Conv2D(kernel_size=(3,3), out_channels=8, stride=1, padding='SAME', activation="relu"))
    return model 

3. Pooling

Pooling is useful in cases we want to downsample the input image. We basically want to reduce the size of the image to reduce the number of parameters and hence to speed up computation.

In most cases we apply the so called MaxPooling .

With MaxPooling we choose the maximum value from the "window" we apply on the feature map (the output of the convolution layer). Let's say we have a 4x4 feature map and we apply a MaxPooling filter of size 2x2 and the stride 2 on it. Then we get:


Let's extend our cnn_1 model with pooling layers:

  def cnn_1_with_pooling():

    input = tf.keras.layers.Input(shape=(28, 28,1))
    conv1 = tf.keras.layers.Conv2D(filters=8, kernel_size=5, strides=1, padding='same', 
                                   activation='relu', use_bias=True)(input)
    pool1  = tf.keras.layers.MaxPool2D(pool_size=2, strides=2, padding = 'same')(conv1)
    conv2  = tf.keras.layers.Conv2D(filters=8, kernel_size=3, strides=1, padding='same', 
                                   activation='relu', use_bias=True)(pool1)
    pool2  = tf.keras.layers.MaxPool2D(pool_size=2, strides=2, padding = 'same')(conv2)
    hidden = tf.keras.layers.Flatten()(pool2)
    output = tf.keras.layers.Dense(10)(hidden)

    model = tf.keras.Model(inputs=input, outputs=output)
    return model   

But there is also another pooling type called Average Pooling, where the average of the values in the "filter-window" is selected.


In Keras code an AveragePooling layer looks like this:


We can repeat the Convolution + Activation + Pooling part as long as we want.

4. Flattening

In the very end we flatten the output of the previous convolution which means we lay all levels of our multi layered image down in one single vector. We then pass this vector to the first fully connected layer. After that we pass the output of the first fully connected layer to the final output fully connected layer. We can apply multiple fully connected layers. Normally we use one or two of them.


flattening example

5. Fully Connected Layer

In the fully connected layer the network looks at the output from the previous layer which is our feature map - the result of a convolution operation. Then it looks at the number of classes we want to predict for an initial image. Then a fully connected layer tires to determine what high level features correlates with what class.

The input to a fully connected layer must be flattened, otherwise the layer won't be able to process a multi layered input. The first fully connected layer assigns necessary weights to predict labels for each input.

The fully connected output layer gives final probability for each label with the help of the softmax function.

Don't forget that we need an optimizer and a loss function for the training phase of our model. Let's combine those parameters into one function called train:

def train(x, y, model, epochs=5, batch_size=128): 

        optimizer = tf.keras.optimizers.RMSprop(),
        metrics = ['accuracy'],
        loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)            # 'from_logits=True' allows us to get rid of softmax in the output)
    fitting = model.fit(x, y, batch_size=batch_size, epochs=epochs, validation_split=0.2) # loss and metric (accuracy) will also be evaluated on validation data (not trained on)

    return fitting 

CNNs' Parameters

Let's mention parameters of a CNN one more time:

The total parameters' number in a single layer is the number of values we can learn for each filter. More parameters means more computational time will be needed. The general formula for parameter calculation is the following:

number of parameters = (filter width * filter height + bias) * number of filters

CNN for text?

We previously said that CNNs are primarily used for image classification and object recognition. But what if we apply CNN to a text? It turns out, there are some tasks in NLP we can successfully apply CNNs to.

Generally speaking, if the global understanding across the entire sequence is required and the length of the sequence is also important, you'd better use an RNN architecture (specifically LSTM) instead of CNNs.

But in what cases we might use a CNN architecture to do language processing? As we learned because of the convolution layer CNN tries to detect patterns. Patterns in the case of a text might be just some n-grams (bi-grams, three-grams, etc...) as "I love", "I hate", "I admire", "very nice", etc. CNN can also find these patterns independent of their position in the text. So, for example, if you want to perform sentiment analysis or spam detection, where you are only interested in negative patterns or typical spam word detection, you might consider using CNN for these tasks as well.

In sentiment analysis CNN might find the negation pattern, e.g. "not great", "don't like", "poorly produced", etc. If such negations are somewhere in the text and we apply a filter to all the regions of the text, the output of the convolution layer will give us a large number for that region in the text where negations appear. The negation was detected.

When applying the pooling layer after the convolution layer, we will loose information about the precise place in the text where the negation is located. Nevertheless, we still preserve information whether the negative pattern (or some other patterns) appeared in the text or not.

Concluding we can say that in case you want to find patterns that are position independet, you may use CNNs. For cases where you have to consider long term dependencies, use RNNs.


There are many different types of Convolutional Neural Network (CNN) in the family of neural networks. CNNs are often used in image analysis. Because the parameters of a CNN are shared across the whole network, CNNs are also called space invariant artificial neural networks. CNNs find their application in image recognition/classification, medical image analysis, recommender systems and as we said above in natural language processing.

Further recommended readings:

Neural Networks - Introduction

Activation Funcitons

Loss Functions

Introduction to Padding


It's AI Against Corona

2019-nCoV There has been a lot of talking about the new corona virus going around the world. Let's clear up some things about it first and then we will see how data science and ai can help us fight 2019-nCoV. ...

Activation Functions

What are activation functions in Neural Networks? First of all let's clear some terminology you need in order to understand the concept of an activation function. ...


or backward propagation of errorsis another supervised learning optimization algorithm. The main task of the backpropagation algorithm is to find optimal weights in a by implementing optimization technique. ...

Early Stopping

In this article we will introduce you to the concept of Early Stopping and its implementation including code samples. ...


Generative Adversarial Networks (GANs) are a type of unsupervised neural networks. The network exists since 2014 and was developed by and colleges. ...

Gradient Descent

Hiking Down a Mountain Gradient Descent is a popular optimization technique in machine learning. It is aimed to find the minimum value of a function. ...

Introduction to Statistics

Part III In this third and last part of the series "Introduction to Statistics" we will cover questions as what is probability and what are its types, as well as the three probability axioms on top of which the entire probability theory is constructed. ...

Introduction to Statistics

Part I In the following three parts we will cover basic terminology as well as the core concepts from statistics. In this Part I you are going to learn about measures of central tendency (mean, median and mode). In the Part II you will read about measures of variabili...

Introduction to Statistics

Part II In this part we will continue our talk about descriptive statistics and the measures of variability such as range, standard deviation and variance as well as different types of distributions. Feel free to read the Part I of these series to deepen your knowle...

Logistic Regression

Logit Regression Logit regression is another shortened name derived from logistic unit. Logistic regression is a popular statistical model that generates probabilities for binary classification tasks. It produces discrete values and its span lies in the range of [...

Loss Functions

When training a neural network, we try to optimize the algorithm, so it gives the best possible output. This optimization needs a loss function to compute the error/loss of the model. In this article we will gain a general picture of Squared Error, Mean Sq...

The Magic Behind Tensorflow

Getting started In this article we will delve into the magic behind one of the most popular Deep Learning frameworks - Tensorflow. We will look at the crucial terminology and some core computation principles we need to grasp the real power of Tensorflow. ...

Classification with Naive Bayes

The Bayes' Theorem describes the probability of some event, based on some conditions that might be related to that event. ...

Neural Networks

Neural Networks - Introduction In Neural Networks (NNs) we try to create a program which is able to learn from experience with respect to some task. This program should cons...


Principal component analysis or PCA is a technique for taking out relevant data points (variables also called components or sometimes features) from a larger data set. From this high dimensional data set, PCA tries extracting low dimensional data points. The idea...

Introduction to reinforcement learning

Part IV: Policy Gradient In the previous articles from this series on Reinforcement Learning (RL) we discussed Model-Based and Model-Free RL. In model-free RL we talked about Value Function Approximation (VFA). In this Part we are going to learn about Policy Based R...

Introduction to Reinforcement Learning

Part I : Model-Based Reinforcement Learning Welcome to the series "Introduction to Reinforcement Learning" which will give you a broad understanding about basic (and not only :) ) techniques in the field of Reinforcement Learning. The article series assumes you have s...

Introduction to Reinforcement Learning

Part II : Model-Free Reinforcement Learning In this Part II we're going to deal with Model-Free approaches in Reinforcement Learning (RL). See what model-free prediction and control mean and get to know some useful algorithms like Monte Carlo (MC) and Temporal Differ...

Recurrent Neural Networks

RNNs A Recurrent Neural Network (RNN) is a type of neural network where an output from the previous step is given as an input to the current step. RNNs are designed to take an input series with no size limits. RNNs remember the past states and are influenced by them...


Support Vector Machines If you happened to have a classification, a regression or an outlier detection task, you might want to consider using Support Vector Machines (SVMs), a supervised learning model, that builds a line (hyperplane) to separate data into groups....

Singular Value Decomposition

Matrix factorization: Singular Value Decomposition Matrix decomposition is another name for matrix factorization. This method is a nice representation for applied linear algebra in machine learning and similar algorithms. ...

Partial Derivatives and the Jacobian Matrix

A Jacobian Matrix is a special kind of matrix that consists of first order partial derivatives for some vector function. The form of the Jacobian matrix can vary. That means, the number of rows and columns can be equal or not, denoting that in one case it is a squa...

Introduction to Reinforcement Learning

Part III: Value Function Approximation In the previous Part I and Part II of this series we described model-based and model-free reinforcement learning as well as some well known algorithms. In this Part III we are going to talk about Value Function Approximation: w...

Weight Initialization

How does Weight Initialization work? As a general rule, weights and biases are normally initialized with some random numbers. Weights and biases are extremely important model's parameters and play a pivot role in every neural network training. Therefore, one should ...

Word Embeddings

Part 1: Introduction to Word2Vec Word embedding is a popular vocabulary representation model. Such model is able to capture contexts and semantics of a word in a document. So what is it exactly? ...

Word Embeddings

Part 2: Word2Vec (Skip Gram)In the second part of Word Embeddings we will talk about what are the downsides of the Word2Vec model (Skip Gram...


T-Distributed Stochastic Neighbor Embedding If you do data analysis, machine learning or some other data driven research you will prob...
Copyright © 2024 by Richard Siegel at siegel.work Donate Contact & Privacy Policy