Warning! Your browser does not support this Website: Try Google-Chrome or Firefox!


by Elvira Siegel
(Published: Wed Oct 02, 2019)

Principal component analysis

or PCA is a technique for taking out relevant data points (variables also called components or sometimes features) from a larger data set. From this high dimensional data set, PCA tries extracting low dimensional data points. The idea is that with PCA we would be able to reproduce as much information as feasible in a low-dimensional space. If we have less variables, the data visualization becomes easier to understand. PCA is often used in problems, where we have to work with higher dimensions in data.

PCA is a statistical method which utilizes an orthogonal transformation. The lengths of vectors and the angles between those are defined through their inner product. An orthogonal transformation maintains vectors' lengths and angles between them.


( Qe1, e1 ) = ( TQe1, Te1 ) -> ( Qe2, e2 )

An orthogonal transformation (which is a linear transformation) is used to transform some correlated variables into linearly uncorrelated variables. Those are called principal components. They are called uncorrelated because of the fact that e.g. the first principal component is uncorrelated to the second one and vice versa.

Each newly gained principal component is a linear combination, which we calculated from the original variables that exist in higher dimensions. Principal components show the direction of the most variation in data. Generally speaking, principal components are the lines of the largest variance. The main target of dimensionality reduction is to project the initial data on a line (or plane) of the largest variance.


The dots' spread is along the diagonal line. The maximum variance of the data lies between two endpoints of the horizontal line (1) as well as along the vertical line (2).


Let's turn it in a specific way ... We got new x, y axes. These new axis helps us to see the variation easier. Our data varies a lot more from left to right as up and down.


S1 and S2 stand for "sample 1" and "sample 2".

These two new axis that describe data variance are principal components.

This method allows us to gather principal components in a specific order:

  1. the first, and the most important principal component, demonstrates the maximum variance in the data.
  2. the second principal component describes the remaining variance in the data and is uncorrelated to the first principal component.
  3. the third principal component describes the remaining variance which was not described by the first two principal components and so on.

Because PCA transforms data into linearly uncorrelated variables (components), the variables (components) are uncorrelated between each other.

Maximum Variance: Sum of Squared Distances

As we shew before, our goal is to find principal components in a high dimensional data in order to reduce dimensions in this data. To find principal components, we have to find the maximum variance across the data. To find maximum variance in the data, we use the formula of the sum of squared distances (here SSD for short). To understand how SSD works, let's observe the PCA technique step by step:

For starters, lets plot some data on a graph and calculate the average measurement for the sample 1 and the sample 2 (here S1 and S2 accordingly)


With these average values, we can calculate the center of our data and shift the data in a way, so the center of the data is placed on the origin of the graph (0,0).


Now we will try to fit a line on this data: we start by drawing a random line that intersects the graph right in its origin. Then we rotate the line until the line fits the data as good as it can with the condition that the line still have to go through the origin. In the end, some line will fit best. Note: each data point on the graph is fixed as well as its distance from the graph's origin. That means, the distance from the point to the origin doesn't change, when the line tries to fit the data aka rotates.


A good fit?

How do we know in the end that this final separating line fits the data the best way possible?

First of all, we already know that PCA projects data points onto the separating line. So, we can:

  1. either measure the distance from each data point to the separating line and find the line that minimizes these distances or ...
  2. ... PCA can find the line that maximizes the distances from the projected points to the origin


minimizing distances


maximizing distances

It is normally easier to calculate the second variant through a line that maximizes the distances from the projected points to the origin because PCA maximizes the sum of the squared distances from the projected points to the origin.


d12 + d22 + d32 + d42 + d52 + d62 = sum of squared distances ( SSD )

Note: We always want a line with the largest SSD .

We normalize our existing vectors in a way, so their length is one unit long. Such vectors are called singular vectors or eigenvectors. Eigenvectors have a characteristic that no matter what, they always stay at the origin. SSD is actually a scaling value or, in other words, it is a representation of eigenvalues. The root of a SSD is a singular value.

What will happen if you don't rotate the components?

The orthogonal rotation is essential for the correct PCA calculation. The rotation part maximizes the difference between variations captured by the principal components. The main objective of PCA is to chose less components than variables (data points). Those components can then describe the maximum variance in the data the best. Rotation only changes the coordinates of the points, the relative location of the components, on the other hand, doesn't undergo a change.

We will have to explain variance in the data by choosing a higher number of components (axis), if we don't apply the rotation of the components (axis). Therefore, the major PCA effect of dimensionality reduction diminishes.

PCA is realized on a covariance matrix. A covariance matrix is a matrix with some random elements in some position i and j. This position represents the covariance between the elements. Covariance is measured by the joint probability of two random elements.

Joint probability operates on dependent variables and calculates the probability of two events occurring together simultaneously. It's the probability of an event A occurring at the same time when an event B occurs.


If you've already read our article on T-Distributed Stochastic Neighbor Embedding also called t-SNE, you could ask a question: What are advantages and disadvantages of both methods? What are some differences between those?

First of all, we'll notice that both PCA and t-SNE are used for dimensionality reduction.

  1. To start with differences, t-SNE is a probabilistic technique, whereas PCA is not based on probabilities but more on pure math.
  2. PCA gives us a linear transformation for dimensionality reduction. We cannot say the same about t-SNE. T-SNE minimizes distance between the dataset in a non-linear way, for example by gradient descent.
  3. PCA is computed iteratively. That means, if we've already computed some n principle components and then decided to add a n+1 dimension, that will take only some more additional computation. Whereas, t-SNE computes the first n principal components and only then maps these n-dimensions to a 2D space, which means, we have to recompute everything each time we add a new dimension.
  4. PCA was initially invented in 1901 by Karl Pearson, as a counterpart to the principal axis theorem in mechanics. Later, in the 1930s the PCA method was independently developed further by Harold Hotelling. On the other hand, t-SNE is a relatively new method and was developed in 2008. You can take a look at the original paper: Visualizing Data using t-SNE

The Mathematics behind PCA


By adding constants and variables to the equation, like here: cy and dx, we can move our object (in this case a plane) in a 3D space. The same is also possible in multiple dimensions.


Suddenly, we got an ellipse and went from 3 dimensions to only 2 dimensions! Why does it happen?

Let's take another function x2 which is a parabola and looks like this:


Now, let's write it in this way: x2 = 1 which will be: x2 - 1 = 0 and we get:


To mark it exactly, we have a number line now, aka we reduced the number of dimensions to only one:


In the end the two solutions for this function are: x = - 1 and x = 1. Putting everything not relevant aside, we can present the number line in the following way:


The same principles are applied to our equation above:


Note: an eigenvector is a special type of vector which stays on the original line and only gets scaled by some factor (scalar or also called eigenvalue)

In the last two steps we give our graph (object) some form and then we center it:

Step 5: give a form: a(x - x move)2 - b(y - ymove)2 - 1 = 0

Step 6: center it: a(x)2 - b(y)2 - 1 = 0


PCA is a linear dimensionality reduction method with the main goal to maximize variance and keep large paired distances: data points, which are more different, are placed far apart from each other. In PCA such procedure may cause poor visualization. It is especially difficult in cases, where we have to work with non-linear geometric structures, like for example ball, curve or cylinder. In such cases it is better to use t-SNE. T-SNE keeps small paired distances and PCA maintains large paired distances to maximize variance.

In PCA we can chose the components that would describe the maximum variance in our data set. In general, PCA tries to detect correlated dimensions in some set of variables (like e.g. BOW vectors) in a high dimensional data set and then merge some dimensions into one.

Further recommended readings:

t-SNE: Distributed Stochastic Neighbor Embedding

SVD: Singular Value Decomposition


It's AI Against Corona

2019-nCoV There has been a lot of talking about the new corona virus going around the world. Let's clear up some things about it first and then we will see how data science and ai can help us fight 2019-nCoV. ...

Activation Functions

What are activation functions in Neural Networks? First of all let's clear some terminology you need in order to understand the concept of an activation function. ...


or backward propagation of errorsis another supervised learning optimization algorithm. The main task of the backpropagation algorithm is to find optimal weights in a by implementing optimization technique. ...


The Convolutional Neural Network (CNN) architecture is widely used in the field of computer vision. Because we have a massive amount of data in image files, the usage of traditional neural networks wouldn't give much efficiency as the computational time would expl...

Early Stopping

In this article we will introduce you to the concept of Early Stopping and its implementation including code samples. ...


Generative Adversarial Networks (GANs) are a type of unsupervised neural networks. The network exists since 2014 and was developed by and colleges. ...

Gradient Descent

Hiking Down a Mountain Gradient Descent is a popular optimization technique in machine learning. It is aimed to find the minimum value of a function. ...

Introduction to Statistics

Part III In this third and last part of the series "Introduction to Statistics" we will cover questions as what is probability and what are its types, as well as the three probability axioms on top of which the entire probability theory is constructed. ...

Introduction to Statistics

Part I In the following three parts we will cover basic terminology as well as the core concepts from statistics. In this Part I you are going to learn about measures of central tendency (mean, median and mode). In the Part II you will read about measures of variabili...

Introduction to Statistics

Part II In this part we will continue our talk about descriptive statistics and the measures of variability such as range, standard deviation and variance as well as different types of distributions. Feel free to read the Part I of these series to deepen your knowle...

Logistic Regression

Logit Regression Logit regression is another shortened name derived from logistic unit. Logistic regression is a popular statistical model that generates probabilities for binary classification tasks. It produces discrete values and its span lies in the range of [...

Loss Functions

When training a neural network, we try to optimize the algorithm, so it gives the best possible output. This optimization needs a loss function to compute the error/loss of the model. In this article we will gain a general picture of Squared Error, Mean Sq...

The Magic Behind Tensorflow

Getting started In this article we will delve into the magic behind one of the most popular Deep Learning frameworks - Tensorflow. We will look at the crucial terminology and some core computation principles we need to grasp the real power of Tensorflow. ...

Classification with Naive Bayes

The Bayes' Theorem describes the probability of some event, based on some conditions that might be related to that event. ...

Neural Networks

Neural Networks - Introduction In Neural Networks (NNs) we try to create a program which is able to learn from experience with respect to some task. This program should cons...

Introduction to reinforcement learning

Part IV: Policy Gradient In the previous articles from this series on Reinforcement Learning (RL) we discussed Model-Based and Model-Free RL. In model-free RL we talked about Value Function Approximation (VFA). In this Part we are going to learn about Policy Based R...

Introduction to Reinforcement Learning

Part I : Model-Based Reinforcement Learning Welcome to the series "Introduction to Reinforcement Learning" which will give you a broad understanding about basic (and not only :) ) techniques in the field of Reinforcement Learning. The article series assumes you have s...

Introduction to Reinforcement Learning

Part II : Model-Free Reinforcement Learning In this Part II we're going to deal with Model-Free approaches in Reinforcement Learning (RL). See what model-free prediction and control mean and get to know some useful algorithms like Monte Carlo (MC) and Temporal Differ...

Recurrent Neural Networks

RNNs A Recurrent Neural Network (RNN) is a type of neural network where an output from the previous step is given as an input to the current step. RNNs are designed to take an input series with no size limits. RNNs remember the past states and are influenced by them...


Support Vector Machines If you happened to have a classification, a regression or an outlier detection task, you might want to consider using Support Vector Machines (SVMs), a supervised learning model, that builds a line (hyperplane) to separate data into groups....

Singular Value Decomposition

Matrix factorization: Singular Value Decomposition Matrix decomposition is another name for matrix factorization. This method is a nice representation for applied linear algebra in machine learning and similar algorithms. ...

Partial Derivatives and the Jacobian Matrix

A Jacobian Matrix is a special kind of matrix that consists of first order partial derivatives for some vector function. The form of the Jacobian matrix can vary. That means, the number of rows and columns can be equal or not, denoting that in one case it is a squa...

Introduction to Reinforcement Learning

Part III: Value Function Approximation In the previous Part I and Part II of this series we described model-based and model-free reinforcement learning as well as some well known algorithms. In this Part III we are going to talk about Value Function Approximation: w...

Weight Initialization

How does Weight Initialization work? As a general rule, weights and biases are normally initialized with some random numbers. Weights and biases are extremely important model's parameters and play a pivot role in every neural network training. Therefore, one should ...

Word Embeddings

Part 1: Introduction to Word2Vec Word embedding is a popular vocabulary representation model. Such model is able to capture contexts and semantics of a word in a document. So what is it exactly? ...

Word Embeddings

Part 2: Word2Vec (Skip Gram)In the second part of Word Embeddings we will talk about what are the downsides of the Word2Vec model (Skip Gram...


T-Distributed Stochastic Neighbor Embedding If you do data analysis, machine learning or some other data driven research you will prob...
Copyright © 2024 by Richard Siegel at siegel.work Donate Contact & Privacy Policy