Warning! Your browser does not support this Website: Try Google-Chrome or Firefox!

Introduction to Statistics

by Elvira Siegel
(Published: Sun Nov 03, 2019 )

Part II

In this part we will continue our talk about descriptive statistics and the measures of variability such as range, standard deviation and variance as well as different types of distributions. Feel free to read the Part I of these series to deepen your knowledge about the measures of central tendency:

Go to Part I

Go to Part III

Measures of Variability (Spread, Dispersion)

The second group we use in descriptive statistics to describe data is the measures of variability. Sometimes also called measures of spread or just dispersion. As the name reveals, here our goal is to describe the variability within the data or the extend of stretching or squeezing in the data. In other words, when we use measures of variability, we want to know how spread out the data is.


the distribution number 1 has the lowest variability


The range is one of the measures of variability that is intuitively easy to understand. It's the difference between the smallest and the largest term in the data set.


Interquartile Range (IQR)

First of all, what is a quartile? In simple words, a quartile is a representation of the data split. Quartile comes from "quarter", so a quartile divides data (namely, distribution) into four equal parts.

  1. The Q1 (the lower quartile) denotes the first quartile and is essentially the middle number between the median of the data and the smallest number in the data.
  2. The Q2 (50 % of the rest values lie below and the other 50 % above) is the second quartile and is in fact the median of the data set.
  3. The Q3 (the upper quartile) is the third quartile and is the number in the middle between the highest value of the data and the median.


Source wiki

The IQR measure shows how widespread the interval is, in which the middle 50 % of all the values lie.

A general procedure to find IQR would be:

  1. sort the data
  2. find the median in the whole data
  3. find the difference between the middle of the first half and the middle of the second half. In other words, find the difference between the medians from the first and the second parts respectively.

Let's look at an example:

Out data set can look like this:


Step 1: Sort the data set:


Step 2: Determine the median of the data:


Step 3: Find the difference between the middle of the second part and the middle of the first part


Pay attention:

Quartile is often confused with quantile ... They are similar but not the same. Look at the short note below to see the difference:

Note about quartile vs. quantile :

Quantile (from "quantity") is a point at which a distribution is divided into equal parts. Strictly speaking the median is a quantile because it (the median) divides the complete data set into two equal groups: the first one is lower than the median (the first 50%) and the second one is above the median (the second 50%).

Quartile is one type of quantile. Quartile divide the data set (the distribution) into four equally sized parts. So, for example, one quartile is equal to 0.25 quantile.

Standard Deviation (SD)

The standard deviation (SD) describes how data is spread out going from the mean. SD measures dispersion (variance) of a data set.

If the data points are closer to the mean, then we observe low SD :


If the data points have a wide spread, then we observe high SD :


The formula for the SD is:



if we look at the definition from Wikipedia, we read:

"it measures how far a set of (random) numbers are spread out from their average value", Wikipedia (Variance)

And what's the difference from the standard deviation? Well, basically there is no big mathematical differences between the standard deviation and the variance. So, in our statistical analysis we can sometimes interchange both. We use the variance to calculate the data spread. The variance give us a measure of how far the values are distributed from each other starting from the mean but we can also use the SD for this task.

We can present the variance as the square root of the Standard Deviation:


Variance, SD and interquartile range are all used to measure statistical dispersion. Both variance and SD are applied to measure the spread around a data set and both of them tell us the spread of a data set around the mean.


We will cover the topic of probability in the Part II more in detail. But for now, a probability distribution summarizes all outcomes of an experiment.

Normal Distribution

The normal distribution is often called Gaussian or Laplace-Gauss Distribution. It is a continuous probability distribution.

Continuous values are for example person's age, weight, height, stock prices, etc...


Look closely at the normal distribution graph above. Did you notice that the distribution's "tails" never touch the x-axis? This we call horizontal asymptote. Asymptote is some hypothetical line which touches the curve at some point in infinity. Which basically means, the curve approaches the line but never actually reaches and touches it.

The curve can only approximate the x-axis but never reach it because we cannot realistically be 100% sure about any result. If we measure the area under the curve at some "tail-point", we see that it is very small but still existent. In other words the area under the curve depicts the probability of getting a particular value.

As we mentioned above the measures of central tendency are all equal with the normal distribution. They are all equal because most of the data is placed around the center and the total area under the curve is equal to 1 which respects the probability measure.


Data Dispersion


Source wikiwand

Inspect the image above. We can see that approx. the 68-69 % of the whole data set is within the first standard deviation and approx. 95 % in the second. And if we look at all three standard deviations, we cover almost 100 % of the whole data set.

Laplace Distribution

Another continuous probability distribution is the Laplace distribution. This kind of distribution is frequently associated with the exponential distribution because we can view the Laplace distribution as if it would combine two exponential distributions.


Binomial Distribution

The binomial distribution is a discrete probability distribution. Discrete values are always certain and can be placed in certain categories, for example the gender of a person, income (low, medium, high), grades (bad, satisfactory, good, excellent), civil status (single, married, divorced), etc ...


In contrast to the normal distribution, the binomial distribution takes only a limited number of values hence, it is discrete.


Binomial Distribution: discrete probability distribution


Normal Distribution: continuous probability distribution

Uniform Distribution


The uniform distribution (also rectangular distribution) is a kind of a distribution that has a constant probability. The uniform distribution is symmetric in its nature and has all its intervals of the same length.

Conclusion for Part I and Part II

In these two parts we learned about two major branches in statistics: descriptive and inferential. We've seen what are the measures of central tendency and the measures of variability and got to know some types of probability distributions.

In the third part of this article, we're going to learn more about the three axioms of the probability theory and different probability types.

Read The Part III next :)

Further recommended readings:

Introduction to Statistics Part I

Introduction to Statistics Part III

Classification with Naive Bayes


It's AI Against Corona

2019-nCoV There has been a lot of talking about the new corona virus going around the world. Let's clear up some things about it first and then we will see how data science and ai can help us fight 2019-nCoV. ...

Activation Functions

What are activation functions in Neural Networks? First of all let's clear some terminology you need in order to understand the concept of an activation function. ...


or backward propagation of errorsis another supervised learning optimization algorithm. The main task of the backpropagation algorithm is to find optimal weights in a by implementing optimization technique. ...


The Convolutional Neural Network (CNN) architecture is widely used in the field of computer vision. Because we have a massive amount of data in image files, the usage of traditional neural networks wouldn't give much efficiency as the computational time would expl...

Early Stopping

In this article we will introduce you to the concept of Early Stopping and its implementation including code samples. ...


Generative Adversarial Networks (GANs) are a type of unsupervised neural networks. The network exists since 2014 and was developed by and colleges. ...

Gradient Descent

Hiking Down a Mountain Gradient Descent is a popular optimization technique in machine learning. It is aimed to find the minimum value of a function. ...

Introduction to Statistics

Part III In this third and last part of the series "Introduction to Statistics" we will cover questions as what is probability and what are its types, as well as the three probability axioms on top of which the entire probability theory is constructed. ...

Introduction to Statistics

Part I In the following three parts we will cover basic terminology as well as the core concepts from statistics. In this Part I you are going to learn about measures of central tendency (mean, median and mode). In the Part II you will read about measures of variabili...

Logistic Regression

Logit Regression Logit regression is another shortened name derived from logistic unit. Logistic regression is a popular statistical model that generates probabilities for binary classification tasks. It produces discrete values and its span lies in the range of [...

Loss Functions

When training a neural network, we try to optimize the algorithm, so it gives the best possible output. This optimization needs a loss function to compute the error/loss of the model. In this article we will gain a general picture of Squared Error, Mean Sq...

The Magic Behind Tensorflow

Getting started In this article we will delve into the magic behind one of the most popular Deep Learning frameworks - Tensorflow. We will look at the crucial terminology and some core computation principles we need to grasp the real power of Tensorflow. ...

Classification with Naive Bayes

The Bayes' Theorem describes the probability of some event, based on some conditions that might be related to that event. ...

Neural Networks

Neural Networks - Introduction In Neural Networks (NNs) we try to create a program which is able to learn from experience with respect to some task. This program should cons...


Principal component analysis or PCA is a technique for taking out relevant data points (variables also called components or sometimes features) from a larger data set. From this high dimensional data set, PCA tries extracting low dimensional data points. The idea...

Introduction to reinforcement learning

Part IV: Policy Gradient In the previous articles from this series on Reinforcement Learning (RL) we discussed Model-Based and Model-Free RL. In model-free RL we talked about Value Function Approximation (VFA). In this Part we are going to learn about Policy Based R...

Introduction to Reinforcement Learning

Part I : Model-Based Reinforcement Learning Welcome to the series "Introduction to Reinforcement Learning" which will give you a broad understanding about basic (and not only :) ) techniques in the field of Reinforcement Learning. The article series assumes you have s...

Introduction to Reinforcement Learning

Part II : Model-Free Reinforcement Learning In this Part II we're going to deal with Model-Free approaches in Reinforcement Learning (RL). See what model-free prediction and control mean and get to know some useful algorithms like Monte Carlo (MC) and Temporal Differ...

Recurrent Neural Networks

RNNs A Recurrent Neural Network (RNN) is a type of neural network where an output from the previous step is given as an input to the current step. RNNs are designed to take an input series with no size limits. RNNs remember the past states and are influenced by them...


Support Vector Machines If you happened to have a classification, a regression or an outlier detection task, you might want to consider using Support Vector Machines (SVMs), a supervised learning model, that builds a line (hyperplane) to separate data into groups....

Singular Value Decomposition

Matrix factorization: Singular Value Decomposition Matrix decomposition is another name for matrix factorization. This method is a nice representation for applied linear algebra in machine learning and similar algorithms. ...

Partial Derivatives and the Jacobian Matrix

A Jacobian Matrix is a special kind of matrix that consists of first order partial derivatives for some vector function. The form of the Jacobian matrix can vary. That means, the number of rows and columns can be equal or not, denoting that in one case it is a squa...

Introduction to Reinforcement Learning

Part III: Value Function Approximation In the previous Part I and Part II of this series we described model-based and model-free reinforcement learning as well as some well known algorithms. In this Part III we are going to talk about Value Function Approximation: w...

Weight Initialization

How does Weight Initialization work? As a general rule, weights and biases are normally initialized with some random numbers. Weights and biases are extremely important model's parameters and play a pivot role in every neural network training. Therefore, one should ...

Word Embeddings

Part 1: Introduction to Word2Vec Word embedding is a popular vocabulary representation model. Such model is able to capture contexts and semantics of a word in a document. So what is it exactly? ...

Word Embeddings

Part 2: Word2Vec (Skip Gram)In the second part of Word Embeddings we will talk about what are the downsides of the Word2Vec model (Skip Gram...


T-Distributed Stochastic Neighbor Embedding If you do data analysis, machine learning or some other data driven research you will prob...
Copyright © 2024 by Richard Siegel at siegel.work Donate Contact & Privacy Policy