Warning! Your browser does not support this Website: Try Google-Chrome or Firefox!

Introduction to Statistics

by Elvira Siegel
(Published: Thu Oct 24, 2019)

Part I

In the following three parts we will cover basic terminology as well as the core concepts from statistics. In this Part I you are going to learn about measures of central tendency (mean, median and mode). In the Part II you will read about measures of variability and different distributions. Finally, in the last Part III you will get to know types of probability such as marginal, joint and conditional. Enjoy!

Go to Part II

Go to Part III

Basic Terminology

In order to accomplish a statistical analysis, we first have to gather data from some source. In statistics, this source of data is called population. Don't be confused that it has to represent only people! A population in statistics can represent anything, for example: measurements of forest fires, stock market prices, sun activity and among others, people.

What if we want to name only a smaller part from a population? Then we call it a sample. A sample is normally a subset of the manageable size which is extracted from the population.

Descriptive vs. inferential

We can divide the study of statistics into two general branches: descriptive and inferential.

Inferential statistics is based on the probability theory as the result of its computation is still a probability. With inferential statistics we infer patterns and trends about a population. We construct those patterns based on an analysis of an individual sample from the population. Inferential statistics helps us to generalize better about a population.

How can we analyze every member of the population individually? Well, it's quite a tedious task ... What we can do to make our life easier is to choose a sample that represents the population the most and infer in the peculiarities of the population. Those peculiarities from the population we want to study are built on this representative sample.

As an example, imagine we want to analyze hours spent studying from all students in Germany. Trying to measure study hours of so many people is extremely challenging. By the time we are done with our analysis, some students may graduate or some new students with different habits may come in. In any case, we end up with a lot of calculations, if we work with the whole data set (with the population). Moreover, gathering so much information is rather difficult and time consuming. Even if we had the perfect data set, it wouldn't ease the task much.

What we can do is to take only one sample consisting of, let's say, 10.000 students chosen at random. Now we can focus on studying these particular people who will represent the whole "student population" of Germany which is approx. 2.7M . Now we can make inferences from our sample and afterwards generalize those inferences in respect to the whole set of students.

Descriptive statistics is what its name says: it describes the data which comes from a sample and filters out not meaningful information. From now on, we will talk about descriptive statistics more in detail.

Descriptive Statistics: diving into the terminology.

In descriptive statistics we have different metric groups to describe data. One metric group is called "measures of central tendency" and the other group "measures of variability".

Measures of Central Tendency

Let's start with the metrics of central tendency. The three measures of central tendency are: mean, median and mode. These measures are normally a single number and are applied to mainly describe the center of data.


The mean value is the average value of the data. This is the number which summarizes the entire data set most precisely.

The mean of a population is denoted by the Greek letter μ (pronounced as mju). We can compute the mean over the whole population or over a sample:



Note: The mean will always change if any data value in the data set changes.

The problem with the mean is that it is outlier-sensitive. Outliers are values that are unexpectedly different from the rest. They are an anomaly. Those values are normally either extremely small or extremely large. Such outliers cause skewed distributions and drag the mean towards the outlier, so the mean becomes not representative for the middle point of the data set.

For example, if we compare the yearly income of workers of the company B, we will get the following mean:


It seems to represent the average salary just right. But what if we add an outlier ...


Now the mean is too high and it doesn't seem to represent the average of the data correctly. The outlier drags the mean to its direction.

The mean also doesn't work nicely with skewed data. In situations the data is skewed, the mean doesn't give us the accurate central tendency location. As we said before, it happens because the skewness pulls the mean away from the representative values in the middle.


Skewness shows the degree of symmetry (-,+,0)

We can overcome these issues with the mean by paying more attention to its close colleague: the median.


The median is the middle number in data. It divides the data set in two equal parts. There are some rules to follow:

  1. If the number of values in the data set is an odd number, then the median is the middle value:

  2. pic_median_odd

  3. But if the number of terms is even, then the median is the average of the two terms in the middle:

  4. pic_median_even

Note: the median works only with ordered data. So sort first, then use the median. Because it is ordered, one half of the data lies below the median and the other half above the median.

In contrast to the mean, the median is more robust to outliers. We can see it on our previous example with yearly income:



The median stays stable even if we have a large outlier.

We also should use median and not rely on the mean in case our data is skewed:



The mode denotes the most frequently appearing value in the data set. If we look on the previous histogram, we can easily identify the mode by looking at the tallest bar:


In case of the normal distribution (where data is distributed symmetrically), the mean, the median and the mode are all the same. The measures of central tendency (mean, median, mode) display the most representative value in the data.


It's important to note that we can use the mean and the median only on numerical data. The mode, on the other hand, can be used even if we have nominal data in our data set. Note: nominal data focuses on depicting certain categories, so it represents categorical data. A usual example of nominal data would be gender: male and female, eye/hair color or nationality (American, German, etc..). Those are all limited number of groups.


We previously named three major measures of central tendency: mean, median and mode. But one more measure for the central tendency which is also worth mentioning is the mid-range. The mid-range is a number which represents the average between the smallest value and the greatest value of the data. This number is called the mid-range. The formula for finding the mid-range is rather simple: we find the mean of the two values: one is the smallest and the other one is the greatest:

mid-range = min_val + max_val / 2

Summarizing mean, median and mode

use mean in cases where:

  1. the data is numerical
  2. the data has no outliers
  3. the goal is to find the most typical value of the data set

use median in cases where:

  1. the data is ordered
  2. the distribution (data) has outliers and is skewed
  3. the goal is to find the central values of the data set

use mode in cases where:

  1. the data is numerical or categorical
  2. the goal is to find the most frequent value of the data set

Read Part II and III next :)

Further recommended readings:

Introduction to Statistics Part II

Introduction to Statistics Part III

Classification with Naive Bayes


It's AI Against Corona

2019-nCoV There has been a lot of talking about the new corona virus going around the world. Let's clear up some things about it first and then we will see how data science and ai can help us fight 2019-nCoV. ...

Activation Functions

What are activation functions in Neural Networks? First of all let's clear some terminology you need in order to understand the concept of an activation function. ...


or backward propagation of errorsis another supervised learning optimization algorithm. The main task of the backpropagation algorithm is to find optimal weights in a by implementing optimization technique. ...


The Convolutional Neural Network (CNN) architecture is widely used in the field of computer vision. Because we have a massive amount of data in image files, the usage of traditional neural networks wouldn't give much efficiency as the computational time would expl...

Early Stopping

In this article we will introduce you to the concept of Early Stopping and its implementation including code samples. ...


Generative Adversarial Networks (GANs) are a type of unsupervised neural networks. The network exists since 2014 and was developed by and colleges. ...

Gradient Descent

Hiking Down a Mountain Gradient Descent is a popular optimization technique in machine learning. It is aimed to find the minimum value of a function. ...

Introduction to Statistics

Part III In this third and last part of the series "Introduction to Statistics" we will cover questions as what is probability and what are its types, as well as the three probability axioms on top of which the entire probability theory is constructed. ...

Introduction to Statistics

Part II In this part we will continue our talk about descriptive statistics and the measures of variability such as range, standard deviation and variance as well as different types of distributions. Feel free to read the Part I of these series to deepen your knowle...

Logistic Regression

Logit Regression Logit regression is another shortened name derived from logistic unit. Logistic regression is a popular statistical model that generates probabilities for binary classification tasks. It produces discrete values and its span lies in the range of [...

Loss Functions

When training a neural network, we try to optimize the algorithm, so it gives the best possible output. This optimization needs a loss function to compute the error/loss of the model. In this article we will gain a general picture of Squared Error, Mean Sq...

The Magic Behind Tensorflow

Getting started In this article we will delve into the magic behind one of the most popular Deep Learning frameworks - Tensorflow. We will look at the crucial terminology and some core computation principles we need to grasp the real power of Tensorflow. ...

Classification with Naive Bayes

The Bayes' Theorem describes the probability of some event, based on some conditions that might be related to that event. ...

Neural Networks

Neural Networks - Introduction In Neural Networks (NNs) we try to create a program which is able to learn from experience with respect to some task. This program should cons...


Principal component analysis or PCA is a technique for taking out relevant data points (variables also called components or sometimes features) from a larger data set. From this high dimensional data set, PCA tries extracting low dimensional data points. The idea...

Introduction to reinforcement learning

Part IV: Policy Gradient In the previous articles from this series on Reinforcement Learning (RL) we discussed Model-Based and Model-Free RL. In model-free RL we talked about Value Function Approximation (VFA). In this Part we are going to learn about Policy Based R...

Introduction to Reinforcement Learning

Part I : Model-Based Reinforcement Learning Welcome to the series "Introduction to Reinforcement Learning" which will give you a broad understanding about basic (and not only :) ) techniques in the field of Reinforcement Learning. The article series assumes you have s...

Introduction to Reinforcement Learning

Part II : Model-Free Reinforcement Learning In this Part II we're going to deal with Model-Free approaches in Reinforcement Learning (RL). See what model-free prediction and control mean and get to know some useful algorithms like Monte Carlo (MC) and Temporal Differ...

Recurrent Neural Networks

RNNs A Recurrent Neural Network (RNN) is a type of neural network where an output from the previous step is given as an input to the current step. RNNs are designed to take an input series with no size limits. RNNs remember the past states and are influenced by them...


Support Vector Machines If you happened to have a classification, a regression or an outlier detection task, you might want to consider using Support Vector Machines (SVMs), a supervised learning model, that builds a line (hyperplane) to separate data into groups....

Singular Value Decomposition

Matrix factorization: Singular Value Decomposition Matrix decomposition is another name for matrix factorization. This method is a nice representation for applied linear algebra in machine learning and similar algorithms. ...

Partial Derivatives and the Jacobian Matrix

A Jacobian Matrix is a special kind of matrix that consists of first order partial derivatives for some vector function. The form of the Jacobian matrix can vary. That means, the number of rows and columns can be equal or not, denoting that in one case it is a squa...

Introduction to Reinforcement Learning

Part III: Value Function Approximation In the previous Part I and Part II of this series we described model-based and model-free reinforcement learning as well as some well known algorithms. In this Part III we are going to talk about Value Function Approximation: w...

Weight Initialization

How does Weight Initialization work? As a general rule, weights and biases are normally initialized with some random numbers. Weights and biases are extremely important model's parameters and play a pivot role in every neural network training. Therefore, one should ...

Word Embeddings

Part 1: Introduction to Word2Vec Word embedding is a popular vocabulary representation model. Such model is able to capture contexts and semantics of a word in a document. So what is it exactly? ...

Word Embeddings

Part 2: Word2Vec (Skip Gram)In the second part of Word Embeddings we will talk about what are the downsides of the Word2Vec model (Skip Gram...


T-Distributed Stochastic Neighbor Embedding If you do data analysis, machine learning or some other data driven research you will prob...
Copyright © 2024 by Richard Siegel at siegel.work Donate Contact & Privacy Policy