Warning! Your browser does not support this Website: Try Google-Chrome or Firefox!

Introduction to Statistics

by Elvira Siegel
(Published: Sun Nov 10, 2019)

Part III

In this third and last part of the series "Introduction to Statistics" we will cover questions as what is probability and what are its types, as well as the three probability axioms on top of which the entire probability theory is constructed.

Go to Part I

Go to Part II

Understanding the Terminology

Let's start with probability. In simple terms, probability is a measure that tells us the confidence that an event will happen. Or, in other words, it shows us how probable an occurrence of an event is.

We used the word event. In probability theory an event denotes exactly the same thing as in normal life, namely something that happens (or already happened). Exactly said, an event is an outcome of an experiment. This outcome is called a random variable.

A random variable is a value that changes randomly, it is an output of some random event. If I toss a coin, I get heads or tails at random. Therefore, heads and tails both denote random variables in my experiment. Random variables can be of two types: discrete and continuous.

  1. a discrete random variable has values which are countable and not infinite, e.g.: number of ill patients, number of times getting a 6 when rolling a dice only seven times, etc.
  2. a continuous random variable has values which come from an interval which can have infinitely many numbers, e.g.: temperature, height of a tree, cats' weight or time it takes to get to work, etc. With all these examples, there are infinitely many intermediate values, a variable can take on. That means, that the probability of a continuous random variable X will have a specific value, is equal to 0.

Note: consult the Probability Density Function (PDF) in context of continuous random variables.

Another frequently used term is experiment. An experiment in probability theory is a random trial, mainly defined through the fact that it can be repeated an infinite amount of times. It has some defined set of possible outcomes called the sample space.

Consider some descriptive examples: you toss a coin and get heads - tossing a coin will be an experiment, getting heads is considered to be an event and the sample space of this experiment will be {heads, tails}. You roll a fair dice and you get 6 - rolling a fair dice is an experiment, getting the 6 is an event and the sample space of the experiment is {1,2,3,4,5,6}. A sample space is often marked with the Ω omega symbol.

Probability Axioms

An axiom is a statement that was proven to be true. It is applied as a premise on which we can build our further argumentation.

In the probability theory there are three axioms.

Axiom 1.

The first axiom says: the probability of any event is a not negative, real number between 0 and 1 where 0 denote 0% probability that an event happens and 1 denotes 100% probability that something happens.


Axiom 2.

The second axiom says if we sum up all the outcomes from the sample space, we get 1. The axiom also means that at least one outcome will happened with 100% probability. If we toss a coin, we will definitely get heads or tails. We can't know for sure whether it will be heads or tails but we know exactly that one of them will happen.

Note: remember, all possible outcomes together build a sample space


Axiom 3.

This one sounds complicated when you first read it: if two events are mutually exclusive then the probability of one of both happening is the sum of their individual probabilities.

All right, mutually exclusive means that two events cannot occur simultaneously, meaning to find out the probability of one of these two events happening, we must sum up these two probabilities.


Example: if I roll a fair dice, the probability of getting 1 OR 6 is:

number of all possible outcomes when rolling a fair dice: 6

probability of rolling a 1: 1/6

probability of rolling a 6: 1/6

1/6 + 1/6 = 2/6 = 1/3 for rolling 1 OR 6

These three axioms are universal in the probability theory. We use them to derive or prove other concepts with the help of logical reasoning.

Nevertheless, these three axioms still don't give us all the answers. As an example, take some function that conforms with all the three axioms. Such functions are called probability functions. Yet, we still cannot know from the axioms WHICH function we should use. We only know that the function of our choice must correspond with the three axioms.

Types of probability

In statistics when dealing with real world problems, we often have to calculate probabilities taking into account multiple random variables. To do so, we might consider following types of probabilities which build the fundamental basis of statistics:

Joint Probability

The joint probability describes events that happen at one and the same time. Let's represent the joint probability visually by using two sets:

We have two sets :


Now the joint probability (as the name already suggests) of two sets, is calculated by taking an event A from the set1 AND and an event B from the set2.


An example would be: what is the probability of pulling a card from a card deck which is a Queen AND black?

Note: Sometimes in the context of joint probability you will see the symbol which comes from the set theory and is called intersection.

p(A ∩ B) = p(Queen AND black)


So, the joint probability is the probability that an event A occurs as well as the event B occurs simultaneously. For example, the probability that when I pull a card from a card deck, it is Queen and black is:


Statisticians apply joint probability in cases when they want to measure two or more events happening simultaneously. As an example: what is the probability that the The Dow Jones Index will drop if Amazon shares will drop at the same time p(DJIA drop and Amazon drop)?

Conditional Probability

This type gives us the probability that an event A will happen, given (the condition) that an event B already happened. This relationship is represented as:

p(A | B)

Let's look at an example: again a deck of 52 cards, you pull a card and it's a black one. You only have this information. What you want to know is: what's the probability that this black card is a Queen? : p(Queen | black) ?

We know for sure that we have 2 black Queens in total in a 52 card deck, from these 52 cards 26 cards are black:

p(Queen | black) = 2 / 26 = 1/13

Marginal Probability

This type of probability can be thought of as "unconditional" probability. It is just the probability of some event A happening: p(A). So, there is only one event we want to analyze. A a nice example for a case of marginal probability would be: if you take a card from a card deck of 52 cards, what's the probability this card is black, p(black)? Well, we know that we have 52 in total, we also know that 26 from them are red, so the other 26 are black. That means, 26/52 = 1/2 or 0.5 is the marginal probability of pulling a black card.

Connecting the probability types together :

The conditional probability can be represented using the joint and the marginal probabilities:

p(A | B) = p(A ∩ B) / p(B)

We can also rearrange the above equation to build the joint probability:

p(A ∩ B) = p(A | B) * p(B)

Probability vs. Likelihood

Probability and likelihood are often used by non statisticians as synonyms. But strictly speaking they are not synonyms. Let's clear out what's the difference between probability and likelihood.

When we talk about probability, we normally mean some area under a distribution:


It's basically a probability of data given some distribution (curve). We can think of a probability function application as of taking an interval from the distribution and analyzing the probabilities within that interval. We do so in cases when we have extremely small probabilities for very exact measurements, as e.g. what is the probability that if a friend of mine is randomly picking a number between 0 and 1 that she picks exactly 0.3? She could pick 0.5 or 0.7 or 0.3459 ... The interval is too diverse to define one probability certainly, so we pick an interval (or an area under the distribution in accordance to the interval) and find its probability.

We should see this interval in the context of continuous random variables (we already mentioned continuous random variable above). Because we cannot really find an exact value for a continuous random variable X , we find a probability that X falls in an interval between a and b: P(a < X < b)

To deepen your knowledge about probability distribution, read more about PDF (Probability Density Function) or watch this nice explanation video.

In the context of likelihood, we want to find the distribution (curve) given some data:

Likelihood(distribution | data)


We can summarize likelihood as a "probability" that some event, which has already occurred, would produce a certain outcome.

Likelihood refers to a past event with known outcomes, while probability refers to the occurrence of future events.


In these three parts of Introduction to Statistics, you've learned about measures of central tendency, measures of variability, different distributions and types of probabilities as well as the three crucial axioms in the probability theory. Feel free to read other related articles on the blog!

Further recommended readings:

Introduction to Statistics Part I

Introduction to Statistics Part II

Classification with Naive Bayes


It's AI Against Corona

2019-nCoV There has been a lot of talking about the new corona virus going around the world. Let's clear up some things about it first and then we will see how data science and ai can help us fight 2019-nCoV. ...

Activation Functions

What are activation functions in Neural Networks? First of all let's clear some terminology you need in order to understand the concept of an activation function. ...


or backward propagation of errorsis another supervised learning optimization algorithm. The main task of the backpropagation algorithm is to find optimal weights in a by implementing optimization technique. ...


The Convolutional Neural Network (CNN) architecture is widely used in the field of computer vision. Because we have a massive amount of data in image files, the usage of traditional neural networks wouldn't give much efficiency as the computational time would expl...

Early Stopping

In this article we will introduce you to the concept of Early Stopping and its implementation including code samples. ...


Generative Adversarial Networks (GANs) are a type of unsupervised neural networks. The network exists since 2014 and was developed by and colleges. ...

Gradient Descent

Hiking Down a Mountain Gradient Descent is a popular optimization technique in machine learning. It is aimed to find the minimum value of a function. ...

Introduction to Statistics

Part I In the following three parts we will cover basic terminology as well as the core concepts from statistics. In this Part I you are going to learn about measures of central tendency (mean, median and mode). In the Part II you will read about measures of variabili...

Introduction to Statistics

Part II In this part we will continue our talk about descriptive statistics and the measures of variability such as range, standard deviation and variance as well as different types of distributions. Feel free to read the Part I of these series to deepen your knowle...

Logistic Regression

Logit Regression Logit regression is another shortened name derived from logistic unit. Logistic regression is a popular statistical model that generates probabilities for binary classification tasks. It produces discrete values and its span lies in the range of [...

Loss Functions

When training a neural network, we try to optimize the algorithm, so it gives the best possible output. This optimization needs a loss function to compute the error/loss of the model. In this article we will gain a general picture of Squared Error, Mean Sq...

The Magic Behind Tensorflow

Getting started In this article we will delve into the magic behind one of the most popular Deep Learning frameworks - Tensorflow. We will look at the crucial terminology and some core computation principles we need to grasp the real power of Tensorflow. ...

Classification with Naive Bayes

The Bayes' Theorem describes the probability of some event, based on some conditions that might be related to that event. ...

Neural Networks

Neural Networks - Introduction In Neural Networks (NNs) we try to create a program which is able to learn from experience with respect to some task. This program should cons...


Principal component analysis or PCA is a technique for taking out relevant data points (variables also called components or sometimes features) from a larger data set. From this high dimensional data set, PCA tries extracting low dimensional data points. The idea...

Introduction to reinforcement learning

Part IV: Policy Gradient In the previous articles from this series on Reinforcement Learning (RL) we discussed Model-Based and Model-Free RL. In model-free RL we talked about Value Function Approximation (VFA). In this Part we are going to learn about Policy Based R...

Introduction to Reinforcement Learning

Part I : Model-Based Reinforcement Learning Welcome to the series "Introduction to Reinforcement Learning" which will give you a broad understanding about basic (and not only :) ) techniques in the field of Reinforcement Learning. The article series assumes you have s...

Introduction to Reinforcement Learning

Part II : Model-Free Reinforcement Learning In this Part II we're going to deal with Model-Free approaches in Reinforcement Learning (RL). See what model-free prediction and control mean and get to know some useful algorithms like Monte Carlo (MC) and Temporal Differ...

Recurrent Neural Networks

RNNs A Recurrent Neural Network (RNN) is a type of neural network where an output from the previous step is given as an input to the current step. RNNs are designed to take an input series with no size limits. RNNs remember the past states and are influenced by them...


Support Vector Machines If you happened to have a classification, a regression or an outlier detection task, you might want to consider using Support Vector Machines (SVMs), a supervised learning model, that builds a line (hyperplane) to separate data into groups....

Singular Value Decomposition

Matrix factorization: Singular Value Decomposition Matrix decomposition is another name for matrix factorization. This method is a nice representation for applied linear algebra in machine learning and similar algorithms. ...

Partial Derivatives and the Jacobian Matrix

A Jacobian Matrix is a special kind of matrix that consists of first order partial derivatives for some vector function. The form of the Jacobian matrix can vary. That means, the number of rows and columns can be equal or not, denoting that in one case it is a squa...

Introduction to Reinforcement Learning

Part III: Value Function Approximation In the previous Part I and Part II of this series we described model-based and model-free reinforcement learning as well as some well known algorithms. In this Part III we are going to talk about Value Function Approximation: w...

Weight Initialization

How does Weight Initialization work? As a general rule, weights and biases are normally initialized with some random numbers. Weights and biases are extremely important model's parameters and play a pivot role in every neural network training. Therefore, one should ...

Word Embeddings

Part 1: Introduction to Word2Vec Word embedding is a popular vocabulary representation model. Such model is able to capture contexts and semantics of a word in a document. So what is it exactly? ...

Word Embeddings

Part 2: Word2Vec (Skip Gram)In the second part of Word Embeddings we will talk about what are the downsides of the Word2Vec model (Skip Gram...


T-Distributed Stochastic Neighbor Embedding If you do data analysis, machine learning or some other data driven research you will prob...
Copyright © 2024 by Richard Siegel at siegel.work Donate Contact & Privacy Policy