Warning! Your browser does not support this Website: Try Google-Chrome or Firefox!


by Elvira Siegel
(Published: Sat Sep 28, 2019)

T-Distributed Stochastic Neighbor Embedding

If you do data analysis, machine learning or some other data driven research you will probably encounter with high dimensional data. High dimensions in a data set mean that we have a high number of features (sometimes also called input variable). Our goal is to find good feature values and use them in our model training.

In this article, we will talk about a dimensionality reduction technique known as T-Distributed Stochastic Neighbor Embedding or t-SNE for short.

Why reduce dimensions in data?

Normally the rule is: the higher the number of dimensions, the more data will be needed to train the model efficiently. If the number of dimensions increases, feature space becomes sparser (emptier), so we need more and more data and even the closest neighbor is too far away from another data point.

Dimensionality reduction is the process of reducing the number of less useful data points. By eliminating dimensions in the data set, we will have fewer relationships between features to observe. The advantages of dimensionality reduction are: data can be analyzed and visualized easily. Model overfitting is also less probable.

Here you can see an example of reducing a 2D plot to 1D plot.


We normally use dimensionality reduction techniques like t-SNE or PCA in cases where we have many dimensions (maybe thousands). Because it is difficult to visualize let's say 10000-dimensional data, we can use t-SNE to reduce it to a number of dimensions we can easily comprehend, for example 2 or 3.

There are some methods to reduce dimensionality through reducing the number of features in the data set:

eliminate some features:

We just drop some information (some features) from the data set. The problem is once we eliminated some features, the information is lost as well.

selecting features:

Not all features are useful equally. Some of them are less necessary as the others. Again reducing the number of features is always followed by information loss.

There are some criteria to choose a good feature :

  1. a feature appears frequently in the data set, which means that the model will probably see it in different settings.
  2. a feature has an obvious meaning and is not ambiguous
  3. a feature is uncorrelated to other features


t-SNE, an unsupervised, non-linear technique, takes the original data set which is presented in many dimensions and reduces it to a low dimensional graph that preserves most of the original information. The main goal of t-SNE is to project data into a low dimensional space in a way that the clustering in the high dimensional space is maintained. The method doesn't consist just of a projection onto one of the axis but is a technique to find the right group the data point belongs to.

If we only project the data on one of the axis, we will get a mess, e.g.:


As we can see the clustering was not kept and we just threw everything in one basket.

Let's understand how the t-SNE calculation works:

First of all, we determine how similar one point is in respect to the others in the original high dimensional space. Then, we calculate the similarity probability of data points.

We transform high dimensional distances between data points into conditional probabilities. Those conditional probabilities represent similarities of some data point x1 to some other data point x 2 in the n-dimensional space.

We measure the distance between two data points and then measure the similarities between them:


We continue with the procedure for each other neighboring point.

Because we use a normal distribution (a bell shaped curve), the data points that are most distant from each other have low similarity scores. On the other hand, close data points have high similarity values:


We proceed with the measurement and apply the technique to each neighboring data point.


Now we have to normalize our data we plotted on the curve. We normalize data so that it doesn't depend on the data points density within the cluster. For example:


Data density around the starting point is high


Data density around the starting point is low

We can conclude that the width of a bell shaped curve is dependent on the data density near the point we want to calculate distances from. It is important to scale the similarity scores so that we treat data point independently of their density aka curve width. When the similarities are properly scaled, they can be summed up to 1.


As we now know, t-SNE is in its core a probabilistic technique and it tries to bring two distributions together. One distribution measures similarities of two data points as input and the other distribution measures similarities of the corresponding two data points in the low-dimensional embedding space.

Fundamentally, this indicates that the t-SNE algorithm analyses the initial data, then it tries to present this data in lower dimensions. T-SNE accomplishes it by reducing the divergence between both distributions. The way t-SNE does this reduction is computationally expensive. Therefore, there exist some significant restrictions. As an example, it is not always recommended to use t-SNE first. With very high dimensional data, we might need to apply another dimensionality reduction technique (like PCA in case of dense data or TruncatedSVD if we deal with sparse data) in the first place and only then use t-SNE.

Conclusion on t-SNE

t-Distributed Stochastic Neighbor Embedding (t-SNE) is one of methods for dimensionality reduction. This non-linear technique is good applicable to the high-dimensional datasets visualization. T-SNE is widely used in fields like Natural Language Processing, speech processing and image processing. There are also some other techniques to reduce dimensionality in data. Some of them include: Singular Value Decomposition (SVD) and Principal Component Analysis (PCA).

Feel free to experiment with PCA and T-SNE here.

Further recommended readings:

Visualizing Data using t-SNE

How to Use t-SNE Effectively

Learn more about this ad

It's AI Against Corona

2019-nCoV There has been a lot of talking about the new corona virus going around the world. Let's clear up some things about it first and then we will see how data science and ai can help us fight 2019-nCoV. ...

Activation Functions

What are activation functions in Neural Networks? First of all let's clear some terminology you need in order to understand the concept of an activation function. ...


or backward propagation of errorsis another supervised learning optimization algorithm. The main task of the backpropagation algorithm is to find optimal weights in a by implementing optimization technique. ...
Learn more about this ad


The Convolutional Neural Network (CNN) architecture is widely used in the field of computer vision. Because we have a massive amount of data in image files, the usage of traditional neural networks wouldn't give much efficiency as the computational time would expl...

Early Stopping

In this article we will introduce you to the concept of Early Stopping and its implementation including code samples. ...

Gradient Descent

Hiking Down a Mountain Gradient Descent is a popular optimization technique in machine learning. It is aimed to find the minimum value of a function. ...

Introduction to Statistics

Part III In this third and last part of the series "Introduction to Statistics" we will cover questions as what is probability and what are its types, as well as the three probability axioms on top of which the entire probability theory is constructed. ...

Introduction to Statistics

Part I In the following three parts we will cover basic terminology as well as the core concepts from statistics. In this Part I you are going to learn about measures of central tendency (mean, median and mode). In the Part II you will read about measures of variabili...

Introduction to Statistics

Part II In this part we will continue our talk about descriptive statistics and the measures of variability such as range, standard deviation and variance as well as different types of distributions. Feel free to read the Part I of these series to deepen your knowle...

Logistic Regression

Logit Regression Logit regression is another shortened name derived from logistic unit. Logistic regression is a popular statistical model that generates probabilities for binary classification tasks. It produces discrete values and its span lies in the range of [...

Loss Functions

When training a neural network, we try to optimize the algorithm, so it gives the best possible output. This optimization needs a loss function to compute the error/loss of the model. In this article we will gain a general picture of Squared Error, Mean Sq...

The Magic Behind Tensorflow

Getting started In this article we will delve into the magic behind one of the most popular Deep Learning frameworks - Tensorflow. We will look at the crucial terminology and some core computation principles we need to grasp the real power of Tensorflow. ...
Learn more about this ad

Classification with Naive Bayes

The Bayes' Theorem describes the probability of some event, based on some conditions that might be related to that event. ...

Neural Networks

Neural Networks - Introduction In Neural Networks (NNs) we try to create a program which is able to learn from experience with respect to some task. This program should cons...


Principal component analysis or PCA is a technique for taking out relevant data points (variables also called components or sometimes features) from a larger data set. From this high dimensional data set, PCA tries extracting low dimensional data points. The idea...

Introduction to reinforcement learning

Part IV: Policy Gradient In the previous articles from this series on Reinforcement Learning (RL) we discussed Model-Based and Model-Free RL. In model-free RL we talked about Value Function Approximation (VFA). In this Part we are going to learn about Policy Based R...

Introduction to Reinforcement Learning

Part I : Model-Based Reinforcement Learning Welcome to the series "Introduction to Reinforcement Learning" which will give you a broad understanding about basic (and not only :) ) techniques in the field of Reinforcement Learning. The article series assumes you have s...

Introduction to Reinforcement Learning

Part II : Model-Free Reinforcement Learning In this Part II we're going to deal with Model-Free approaches in Reinforcement Learning (RL). See what model-free prediction and control mean and get to know some useful algorithms like Monte Carlo (MC) and Temporal Differ...

Recurrent Neural Networks

RNNs A Recurrent Neural Network (RNN) is a type of neural network where an output from the previous step is given as an input to the current step. RNNs are designed to take an input series with no size limits. RNNs remember the past states and are influenced by them...


Support Vector Machines If you happened to have a classification, a regression or an outlier detection task, you might want to consider using Support Vector Machines (SVMs), a supervised learning model, that builds a line (hyperplane) to separate data into groups....

Singular Value Decomposition

Matrix factorization: Singular Value Decomposition Matrix decomposition is another name for matrix factorization. This method is a nice representation for applied linear algebra in machine learning and similar algorithms. ...

Partial Derivatives and the Jacobian Matrix

A Jacobian Matrix is a special kind of matrix that consists of first order partial derivatives for some vector function. The form of the Jacobian matrix can vary. That means, the number of rows and columns can be equal or not, denoting that in one case it is a squa...
Learn more about this ad

Introduction to Reinforcement Learning

Part III: Value Function Approximation In the previous Part I and Part II of this series we described model-based and model-free reinforcement learning as well as some well known algorithms. In this Part III we are going to talk about Value Function Approximation: w...

Weight Initialization

How does Weight Initialization work? As a general rule, weights and biases are normally initialized with some random numbers. Weights and biases are extremely important model's parameters and play a pivot role in every neural network training. Therefore, one should ...

Word Embeddings

Part 1: Introduction to Word2Vec Word embedding is a popular vocabulary representation model. Such model is able to capture contexts and semantics of a word in a document. So what is it exactly? ...

Word Embeddings

Part 2: Word2Vec (Skip Gram)In the second part of Word Embeddings we will talk about what are the downsides of the Word2Vec model (Skip Gram...
Copyright © 2020 by Richard Siegel at siegel.work Contact & Privacy Policy