Warning! Your browser does not support this Website: Try Google-Chrome or Firefox!

Singular Value Decomposition

by Elvira Siegel
(Published: Fri Aug 02, 2019 )

Matrix factorization: Singular Value Decomposition

Matrix decomposition is another name for matrix factorization. This method is a nice representation for applied linear algebra in machine learning and similar algorithms.

Matrix Factorization: Definition

A matrix factorization is a mathematical method that "breaks down" (decomposes) a matrix into its separate integral components.

It is a process that can make more complex matrix operations easier by "splitting up" an initial matrix into some consistent pieces, so that those operations can be performed on the decomposed matrix (those "pieces") and not on the original one, more complex, matrix.


Consider the factoring of a natural number e.g. 20 into 2 * 10. That means, that 20 is a composite number and we can decompose (factorize) it into some pieces (2 and 10), which multiplied still represent the initial number 20. So you can think of a matrix factorization as "dividing" a big composite matrix into some smaller ones, which are less complex intrinsically.

There exist different methods to factorize (decompose) a matrix. For now, we will focus on one of them called SVD or Singular Value Decomposition.

Singular Value Decomposition (SVD): Introduction

SVD stores only important information in order to have a dense vector with a low number of dimensions. That means, SVD is one of many dimensionality reduction techniques.

Dimensionality Reduction

We need dimensionality reduction to overcome the so called Curse of Dimensionality. If the number of dimensions in data increases, the distances between the data points increase as well, which leads to a higher sparsity in the data set. Even the closest neighboring points are too far away from each other. That means, with more dimensions, we need more data. More data means, computation time increases as well as more storage capacities are required. The problem becomes computationally expensive.

One further problem with a large number of dimensions is, it is difficult (nearly impossible) to visualize let's say, 1000 dimensions in a way that a human being is able to understand it. Humans can only see up to 3 dimensions. Fortunately, with SVD we can present our data in lower dimensions and visualize it much easier.

Some Dimensionality Reduction Techniques include: Random Forest, Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), Singular Value Decomposition (SVD) and many more.

Visualization of SVD:


watch the visualization here.

Singular Value Decomposition (SVD): Definition

For a start, we have a matrix A which we want to decompose. A is our initial data matrix. When we decompose the A matrix, we get the following three matrices: U, Σ and VT. With SVD we decompose the matrix A into a product of three matrices. (Note: VT means the transpose of a matrix. A transposed matrix is a new matrix, where the rows are the columns of the original matrix).

decomposing the A matrix :


the transpose of a matrix :

Singular Value Decomposition (SVD): Geometric way of thinking

UΣVT as a result of the origin data matrix A is possible due to linear algebra and linear transformation. U, Σ and VT are a representation of some "movement" (transformation) in a linear space. This "movement" is mostly done by addition and scalar multiplication and can be: scaling, rotation or reflection.

Let's transform our three matrices (UΣVT) into its original one A

VT : rotation and reflection

We multiply some matrix B by VT :


BVT and we get a new matrix.

Σ : scaling

Σ * (VTB)

because Σ has values (singular values) on the diagonal, it only scales the matrix. Scaling means an operation which stretches or squishes the matrix (or a vector).


U : rotation and reflection

U * (Σ * ( VT*B) )

Finally, we rotate and reflect the Σ*(VT*B) matrix with the help of U and we get our original matrix A


if you'd like to know more about Linear Algebra or refresh you knowledge, you can watch this playlist from the YouTube channel called 3Blue1Brown

Properties of U, Σ and VT

The initial A matrix is still close to one of the three newly created matrices (U, Σ or VT). That means, when we find a low rank matrix, it can serve as a good approximation of the initial decomposed data matrix (in this case A).

Clarifying U, Σ and VT

U, Σ and VT are also unique matrices, meaning one cannot be a replica of another.

U, V are orthogonal, which means, if we take two columns of U (or V) and multiply them together, we get 0. And they are also column orthonormal. Orthonormal means, the columns' length equals 1 (unit length\unit vector). Unit vector is sometimes called "direction vector" because it describes only the direction of a vector, not its magnitude.


Σ is a diagonal matrix: the entries are on the diagonal line and are sorted in decreasing order. The entries are also only positive numbers.


What do the numbers in U, Σ and VT matrices actually represent in the real-world data?

As an example, we can take our A matrix and say that it is of the form [m x n]. Now we can specify that, m (rows) represents documents in a corpus and n (columns) represents words from those documents. That is one of possibilities, what the matrix A could stand for in the real-world data representation.

what about the rest?

If we want to bring in our data, which is a bunch of words from some documents (the A matrix) and we categorize the words, we can say that:

  1. U represents the relation between a word a category (could be e.g. word embeddings)
  2. Σ represents the strength of the categories
  3. V represents the relation between a document and a category

With the help of SVD, we can also build a recommender system to suggest movies based on users' ratings. Our matrices would be the following representations:

  1. U represents the relation between a user and a concept (Note: a concept here means a movie genre: comedy, drama, thriller...) (could be e.g. word embeddings)
  2. Σ represents the same as before strength of the concepts
  3. V represents the relation between a movie and a concept

Dimensionality Reduction with SVD

The following procedure decreases the complexity of the data:

set the smallest singular values in Σ to 0. That means, take the column from U, which corresponds to the column in Σ with the smallest singular value and set this column from U to 0. Take the corresponding row of VT and set it to 0.

SVD and Eigenvalue Decomposition

Singular value decomposition can decompose a matrix of any dimension. SVD can be applied to any m x n or m x m matrix. On the other hand, eigenvalue decomposition can be applied only to matrices with a clear diagonal line, that means, the matrix must be square (m x m).

Column vectors in one matrix from Eigenvalue Decomposition are not necessarily orthogonal, which means it doesn't always represent rotations/reflections. U and V in SVD are always orthogonal and always represent rotations and reflections.

The singular values in Σ are all real and non-negative numbers. The eigenvalues in Eigenvalue Decomposition can be complex and even have imaginary numbers.

Summing everything up ...


SVD is one of the most widely used techniques for dimensionality reduction, recommender systems, object recognition, risk modeling and many more other models. SVD combines the main linear algebra concepts, namely: matrix transformations, projections, change of basis, symmetric matrices, orthogonality and factorization.

The problem with SVD is that for an n x m matrix the time complexity will be quadratic: O(mn2)

Further recommended readings/courses:

Principal Component Analysis (PCA)

Singular Value Decomposition (SVD)

MIT Open Courses "Linear Algebra"


It's AI Against Corona

2019-nCoV There has been a lot of talking about the new corona virus going around the world. Let's clear up some things about it first and then we will see how data science and ai can help us fight 2019-nCoV. ...

Activation Functions

What are activation functions in Neural Networks? First of all let's clear some terminology you need in order to understand the concept of an activation function. ...


or backward propagation of errorsis another supervised learning optimization algorithm. The main task of the backpropagation algorithm is to find optimal weights in a by implementing optimization technique. ...


The Convolutional Neural Network (CNN) architecture is widely used in the field of computer vision. Because we have a massive amount of data in image files, the usage of traditional neural networks wouldn't give much efficiency as the computational time would expl...

Early Stopping

In this article we will introduce you to the concept of Early Stopping and its implementation including code samples. ...


Generative Adversarial Networks (GANs) are a type of unsupervised neural networks. The network exists since 2014 and was developed by and colleges. ...

Gradient Descent

Hiking Down a Mountain Gradient Descent is a popular optimization technique in machine learning. It is aimed to find the minimum value of a function. ...

Introduction to Statistics

Part III In this third and last part of the series "Introduction to Statistics" we will cover questions as what is probability and what are its types, as well as the three probability axioms on top of which the entire probability theory is constructed. ...

Introduction to Statistics

Part I In the following three parts we will cover basic terminology as well as the core concepts from statistics. In this Part I you are going to learn about measures of central tendency (mean, median and mode). In the Part II you will read about measures of variabili...

Introduction to Statistics

Part II In this part we will continue our talk about descriptive statistics and the measures of variability such as range, standard deviation and variance as well as different types of distributions. Feel free to read the Part I of these series to deepen your knowle...

Logistic Regression

Logit Regression Logit regression is another shortened name derived from logistic unit. Logistic regression is a popular statistical model that generates probabilities for binary classification tasks. It produces discrete values and its span lies in the range of [...

Loss Functions

When training a neural network, we try to optimize the algorithm, so it gives the best possible output. This optimization needs a loss function to compute the error/loss of the model. In this article we will gain a general picture of Squared Error, Mean Sq...

The Magic Behind Tensorflow

Getting started In this article we will delve into the magic behind one of the most popular Deep Learning frameworks - Tensorflow. We will look at the crucial terminology and some core computation principles we need to grasp the real power of Tensorflow. ...

Classification with Naive Bayes

The Bayes' Theorem describes the probability of some event, based on some conditions that might be related to that event. ...

Neural Networks

Neural Networks - Introduction In Neural Networks (NNs) we try to create a program which is able to learn from experience with respect to some task. This program should cons...


Principal component analysis or PCA is a technique for taking out relevant data points (variables also called components or sometimes features) from a larger data set. From this high dimensional data set, PCA tries extracting low dimensional data points. The idea...

Introduction to reinforcement learning

Part IV: Policy Gradient In the previous articles from this series on Reinforcement Learning (RL) we discussed Model-Based and Model-Free RL. In model-free RL we talked about Value Function Approximation (VFA). In this Part we are going to learn about Policy Based R...

Introduction to Reinforcement Learning

Part I : Model-Based Reinforcement Learning Welcome to the series "Introduction to Reinforcement Learning" which will give you a broad understanding about basic (and not only :) ) techniques in the field of Reinforcement Learning. The article series assumes you have s...

Introduction to Reinforcement Learning

Part II : Model-Free Reinforcement Learning In this Part II we're going to deal with Model-Free approaches in Reinforcement Learning (RL). See what model-free prediction and control mean and get to know some useful algorithms like Monte Carlo (MC) and Temporal Differ...

Recurrent Neural Networks

RNNs A Recurrent Neural Network (RNN) is a type of neural network where an output from the previous step is given as an input to the current step. RNNs are designed to take an input series with no size limits. RNNs remember the past states and are influenced by them...


Support Vector Machines If you happened to have a classification, a regression or an outlier detection task, you might want to consider using Support Vector Machines (SVMs), a supervised learning model, that builds a line (hyperplane) to separate data into groups....

Partial Derivatives and the Jacobian Matrix

A Jacobian Matrix is a special kind of matrix that consists of first order partial derivatives for some vector function. The form of the Jacobian matrix can vary. That means, the number of rows and columns can be equal or not, denoting that in one case it is a squa...

Introduction to Reinforcement Learning

Part III: Value Function Approximation In the previous Part I and Part II of this series we described model-based and model-free reinforcement learning as well as some well known algorithms. In this Part III we are going to talk about Value Function Approximation: w...

Weight Initialization

How does Weight Initialization work? As a general rule, weights and biases are normally initialized with some random numbers. Weights and biases are extremely important model's parameters and play a pivot role in every neural network training. Therefore, one should ...

Word Embeddings

Part 1: Introduction to Word2Vec Word embedding is a popular vocabulary representation model. Such model is able to capture contexts and semantics of a word in a document. So what is it exactly? ...

Word Embeddings

Part 2: Word2Vec (Skip Gram)In the second part of Word Embeddings we will talk about what are the downsides of the Word2Vec model (Skip Gram...


T-Distributed Stochastic Neighbor Embedding If you do data analysis, machine learning or some other data driven research you will prob...
Copyright © 2024 by Richard Siegel at siegel.work Donate Contact & Privacy Policy