Warning! Your browser does not support this Website: Try Google-Chrome or Firefox!

Neural Networks

by Elvira Siegel
(Published: Sat Aug 17, 2019)

Neural Networks - Introduction

In Neural Networks (NNs) we try to create a program which is able to learn from experience with respect to some task. This program should consider a performance measure, if its performance improves with experience.

To start with, Neural Networks provide us a possibility to do machine learning (ML). Neural Network is a program which performs a task by analyzing training data. There are many types and kinds of machine learning models and algorithms, like e.g. Linear or Logistic Regression, Perceptron, Feed Forward, LSTM, Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Support Vector Machines (SVM), and many more ...

Neural Networks are often called universal function approximators. That means, using NNs we could possibly solve any set of problems. That is, if we can reduce a problem (task) to some function, we can approximate it with the help of Neural Networks.

Typical tasks for NNs include: forecasting, recommendation systems, risk management, anomaly detection, time series predictions and natural language understanding.

Machine Learning same as Neural Networks?

Machine Learning (ML) is a broader term as Neural Networks. A neural network (sometimes also "artificial neural network") is a type of machine learning.

Generally speaking, a machine learning algorithm consists of the following stages:

  1. explore and clean data
  2. preprocess data (e.g. make it numerical)
  3. define the model (=algorithm) architecture (set up the layers)
  4. compile the model (set up the optimizer, the loss function and metrics)
  5. training the model (also "fitting the model" or "estimating parameters")
  6. test the model (also "evaluating")

Machine Learning Lingo:

You should get used to the fact that one term in ML might have maybe five different names. That is why, here some frequent terms used in ML:

  1. label - the desired outcome we want to predict. In simple linear regression it corresponds to the y-variable.
  2. feature (also "attribute" or "dimension") - the input data, corresponds to the x-variable. Examples of features for spam classification: words in an e-mail, number of exclamation marks, sender's address, etc...
  3. model (algorithm) - defines the relationship between features and labels.
  4. weights - a feature coefficient. The main idea of training a model is to find the ideal weight for every feature. Weights connect each neuron in one layer to every neuron in the next layer.
  5. layer - a structure that receives weighted input, transforms it with a non-linear function and then passes these values as output to the next layer. There are an input layer, a hidden layer and an output layer.

Supervised vs. Unsupervised Learning

In supervised learning the data we give ("feed into") the neural network has been already labeled. That is, the model learns on data with an label, which is the right "answer" we want to predict in the end. So, the model learns from labeled examples.

In contrast, in unsupervised learning unlabeled data is used. The model tries to make sense of this data by extracting patterns unattended. So we don't need e.g. a human annotator for this type of machine learning.

There are Discrete/Continuous Supervised Learning and Discrete/Continuous Unsupervised Learning. Discrete means the variable is of either concrete category: yes or no, cat or dog, red or green or blue. Whereas, continuous means ongoing change in variables like age, height, stock price, price of a house. Those values are constantly changing.

Examples of: discrete supervised learning would be classification or categorization problems. Continuous supervised learning: regression problems like Linear Regression.

Examples of: discrete unsupervised learning would be clustering, where we have particular groups or "clusters". Continuous unsupervised learning would be Dimensionality Reduction.

Data Splitting

When building a machine learning model, we usually split our data into two kinds of sets:

  1. Training set
  2. Test set

A training set contains a "right" output (label). The model learns on this data to be able to generalize on unseen data.

A test set is used to give us an accuracy estimation expected on new, previously unseen data. The test set measures how good or bad the model is on new unseen data, so we can measure how good the model generalizes because of this unbiased result.

It is advisable to even split data in three sets:

  1. Training set
  2. Development set (sometimes also "validation set")
  3. Test set

A development set is used to evaluate the model with different hyperparameters. We can improve our model's performance by tweaking its hyperparameters (learning rate, number of iterations, batch size, loss function, etc...).

We normally have to specify model's hyperparameters manually. Model's parameters are often not set manually but are determined automatically by the model.

In an ideal situation we would develop a model using a training set. Then we tweak model's hyperparameters in a way, that it performs good on a development set. In the end, we would finally check the model's performance on a test set, which is brand new and was never seen before by our model, so we can get the best unbiased judgment.

Overfitting vs. Underfitting

Overfitting happens when our trained model cannot generalize well on previously unseen data. An "overfitted" model will be extremely precise on the training data but will show a very poor accuracy on new data. With overfitting we can't generalize the output and can't reason on other data, which is our prior goal. Overfitting often happens in complex models with a lot of features. That also means, the more hidden neurons, the higher the risk of overfitting. In this case, the system didn't learn to generalize data but stores/memorizes the data patterns and any noise contained in them. Overfitting may also occur due to homogeneous data. That means, we should keep data diverse.

On the contrary, underfitting happens when the accuracy on the development set is higher as the accuracy on the train set. If underfitting occurs, the model fails to recognize patterns in data. As well as with overfitting, an underfitted model cannot be generalized to novel data. Underfitting often occurs with very simple models where the model lacks of predictors (also called "independent variables"). Underfitting may happen when, we try fitting (training) a linear model (e.g. linear regression) to data that is not linear. Such model will definitely output faulty predictions.

In general, we can reduce overfitting/underfitting by splitting our data into train/validation/test sets. For overfitting reduction we can add more data to our training set. Regularization techniques as Dropout and L1/L2 Regularization are also a good help in reducing overfitting.

Overfitting Reduction: Regularization

To overcome overfitting we can use regularization techniques:


when we use dropout, we randomly delete some number of neurons (also called "activations") during training. Then, during testing, we use all neurons but we also reduce them by a number of missing neurons during training.


Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting", JMLR 2014

Feel free to read the original paper.

The problem with dropout is: it causes the information to get lost and may even result in underfitting.

Regularization assumes that smaller weights create a simpler model, helping to prevent overfitting.

L1 Regularization

The L1 regularization penalizes the algorithm for keeping weights that are not zeros. With the L1 regularization we try to zero out not useful weights, so we don't have to deal with unnecessary feature crosses. "Feature crosses" means: when we multiply (or "cross") two or more features, it results in vectors having more and more dimensions. More dimensions in data means more RAM capacity is needed, more computation time is required and model's complexity grows. Zeroing out some features will save RAM and reduce noise in the model. By deactivating non-informative features, we can make our network less complex, which means less overfitting.

The problem with L1: it produces models that are not capable of learning more complex patterns.

L2 Regularization

The L2 regularization makes the weights very small but does not force them to become 0 as in L1, so L2 has no feature selection built in thus L2 cannot find non-informative features.

Feel free to do some exercises here to strengthen your knowledge.


The actual "learning" happens when the model's parameters are updated. Those parameters can be e.g. weights in NN, support vectors in SVM, coefficients in a Linear / Logistic Regression. The parameters get adjusted in a way that the model's performance can get better. There are many optimization algorithms for ML. One of them is called Gradient Descent. We use Gradient Descent to minimize a function (also cost function or loss function) by moving towards the steepest descent. How we move and in which direction is defined by the negative of the gradient.


Perceptron is one of the first neural network models. The perceptron algorithm creates "associations" between input stimulations and the necessary response at the output. The perceptron sums up the weighted input values. These values are normally in a form of a vector. Then we "activate" this sum with a function and, in the end, transfer it to the output layer.



Feedforward Neural Networks are in essence multilayer perceptrons. The aim is, the same as said before, to approximate some function.

The model is called "feedforward" because information passes right through: from the input, through the intermediate calculations, and lastly to the output. There are no feedback links in which outputs of the model are fed back into the model again and again. If feedforward neural networks are extended to add feedback connections, they become Recurrent Neural Networks.

Layers: (one layer has one input vector)


where x values are the initial input; w values are the weights; N are the neurons and σ is the sigmoid function. Notice how the output from the first layer (y) becomes input for the second one and so on. At the end of the network we have an s value which is in this case a scalar.

To calculate an output from each layer, we use following formula: activation(input vec * weight vec)

To compute the y-vector we do following:

y1 = sigma(x*w)

y2 = sigma(x*w)

y3 = sigma(x*w)

We can also stack weight vectors w together and make a weight matrix :


Now when we have a weight matrix, we can accomplish our computations much faster.


The initial aim of neural network idea was to solve problems in the similar manner that a human brain would. Nevertheless, NN models were gradually defined to perform certain tasks, leading to deviations from biology.

Further recommended readings:

Deep Learning: Basic Mathematics for Deep Learning


It's AI Against Corona

2019-nCoV There has been a lot of talking about the new corona virus going around the world. Let's clear up some things about it first and then we will see how data science and ai can help us fight 2019-nCoV. ...

Activation Functions

What are activation functions in Neural Networks? First of all let's clear some terminology you need in order to understand the concept of an activation function. ...


or backward propagation of errorsis another supervised learning optimization algorithm. The main task of the backpropagation algorithm is to find optimal weights in a by implementing optimization technique. ...


The Convolutional Neural Network (CNN) architecture is widely used in the field of computer vision. Because we have a massive amount of data in image files, the usage of traditional neural networks wouldn't give much efficiency as the computational time would expl...

Early Stopping

In this article we will introduce you to the concept of Early Stopping and its implementation including code samples. ...


Generative Adversarial Networks (GANs) are a type of unsupervised neural networks. The network exists since 2014 and was developed by and colleges. ...

Gradient Descent

Hiking Down a Mountain Gradient Descent is a popular optimization technique in machine learning. It is aimed to find the minimum value of a function. ...

Introduction to Statistics

Part III In this third and last part of the series "Introduction to Statistics" we will cover questions as what is probability and what are its types, as well as the three probability axioms on top of which the entire probability theory is constructed. ...

Introduction to Statistics

Part I In the following three parts we will cover basic terminology as well as the core concepts from statistics. In this Part I you are going to learn about measures of central tendency (mean, median and mode). In the Part II you will read about measures of variabili...

Introduction to Statistics

Part II In this part we will continue our talk about descriptive statistics and the measures of variability such as range, standard deviation and variance as well as different types of distributions. Feel free to read the Part I of these series to deepen your knowle...

Logistic Regression

Logit Regression Logit regression is another shortened name derived from logistic unit. Logistic regression is a popular statistical model that generates probabilities for binary classification tasks. It produces discrete values and its span lies in the range of [...

Loss Functions

When training a neural network, we try to optimize the algorithm, so it gives the best possible output. This optimization needs a loss function to compute the error/loss of the model. In this article we will gain a general picture of Squared Error, Mean Sq...

The Magic Behind Tensorflow

Getting started In this article we will delve into the magic behind one of the most popular Deep Learning frameworks - Tensorflow. We will look at the crucial terminology and some core computation principles we need to grasp the real power of Tensorflow. ...

Classification with Naive Bayes

The Bayes' Theorem describes the probability of some event, based on some conditions that might be related to that event. ...


Principal component analysis or PCA is a technique for taking out relevant data points (variables also called components or sometimes features) from a larger data set. From this high dimensional data set, PCA tries extracting low dimensional data points. The idea...

Introduction to reinforcement learning

Part IV: Policy Gradient In the previous articles from this series on Reinforcement Learning (RL) we discussed Model-Based and Model-Free RL. In model-free RL we talked about Value Function Approximation (VFA). In this Part we are going to learn about Policy Based R...

Introduction to Reinforcement Learning

Part I : Model-Based Reinforcement Learning Welcome to the series "Introduction to Reinforcement Learning" which will give you a broad understanding about basic (and not only :) ) techniques in the field of Reinforcement Learning. The article series assumes you have s...

Introduction to Reinforcement Learning

Part II : Model-Free Reinforcement Learning In this Part II we're going to deal with Model-Free approaches in Reinforcement Learning (RL). See what model-free prediction and control mean and get to know some useful algorithms like Monte Carlo (MC) and Temporal Differ...

Recurrent Neural Networks

RNNs A Recurrent Neural Network (RNN) is a type of neural network where an output from the previous step is given as an input to the current step. RNNs are designed to take an input series with no size limits. RNNs remember the past states and are influenced by them...


Support Vector Machines If you happened to have a classification, a regression or an outlier detection task, you might want to consider using Support Vector Machines (SVMs), a supervised learning model, that builds a line (hyperplane) to separate data into groups....

Singular Value Decomposition

Matrix factorization: Singular Value Decomposition Matrix decomposition is another name for matrix factorization. This method is a nice representation for applied linear algebra in machine learning and similar algorithms. ...

Partial Derivatives and the Jacobian Matrix

A Jacobian Matrix is a special kind of matrix that consists of first order partial derivatives for some vector function. The form of the Jacobian matrix can vary. That means, the number of rows and columns can be equal or not, denoting that in one case it is a squa...

Introduction to Reinforcement Learning

Part III: Value Function Approximation In the previous Part I and Part II of this series we described model-based and model-free reinforcement learning as well as some well known algorithms. In this Part III we are going to talk about Value Function Approximation: w...

Weight Initialization

How does Weight Initialization work? As a general rule, weights and biases are normally initialized with some random numbers. Weights and biases are extremely important model's parameters and play a pivot role in every neural network training. Therefore, one should ...

Word Embeddings

Part 1: Introduction to Word2Vec Word embedding is a popular vocabulary representation model. Such model is able to capture contexts and semantics of a word in a document. So what is it exactly? ...

Word Embeddings

Part 2: Word2Vec (Skip Gram)In the second part of Word Embeddings we will talk about what are the downsides of the Word2Vec model (Skip Gram...


T-Distributed Stochastic Neighbor Embedding If you do data analysis, machine learning or some other data driven research you will prob...
Copyright © 2024 by Richard Siegel at siegel.work Donate Contact & Privacy Policy