Warning! Your browser does not support this Website: Try Google-Chrome or Firefox!

Word Embeddings

by Elvira Siegel
(Published: Fri Aug 16, 2019)

Part 2: Word2Vec (Skip Gram)

In the second part of Word Embeddings we will talk about what are the downsides of the Word2Vec model (Skip Gram in particular).

Read Part 1 Introduction to Word2Vec

We can list some of Skip Gram model's negative sides:
  1. the model tends to become a very large neural network which means: the larger the neural network, the slower the update of gradient descent is, which means the learning is slow.
  2. a big neural network has a big amount of weights, so more data for tuning and avoiding overfitting is needed.
  3. if we have millions of weights (k) and billions of training samples (n), we get k*n which produces a large outcome.
How can we overcome those obstacles?

    We can:

  1. treat common words/phrases as a "single word", so we reduce the number of samples.
  2. use subsampling for frequent words
  3. use negative sampling
  4. use hierarchical softmax

We choose the words which are the most frequent ones and delete them. The probability that those chosen frequent words are not relevant stop words (words like: "the", "a", "for", "and") is very high. So these words don't contribute to the meaning much and appear in almost each sentence get deleted, which reduces the number of not useful training samples.

Negative Sampling

As we said before neural networks tend to have a big amount of weights. Those weights need to get updated by every one of millions training samples, which makes the computation time consuming. Negative Sampling, as a technique, that modifies only a small percentage of the weights, not all of them, which eases the computation and the weights update.

Some "negative words" are chosen, those words are called "negative samples". We want our network to output 0 for those negative samples, so we choose the samples (words), which are less probable to appear together with the output words. So we have some word pairs which are unlikely to appear together. Those are our negative samples.

We can also choose our negative samples using the technique called unigram distribution, where more frequent words are more likely to be chosen as negative samples, meaning that the probability of choosing a word as a negative sample is related to its frequency. Because more frequent words tend to contribute less meaning, so we want the network to output 0 for them.

Hierarchical Softmax

First of all: What is softmax?

Softmax is a popular activation function. It is suited for multi-class logistic regression problem as well as for binary classification tasks. The softmax output is a probability distribution (or categorical distribution). The probability results of softmax (a vector) show what class is more likely to be chosen as a correct one. The singular probabilities in the softmax output vector must sum up to 1.


There is no activation function used on the hidden layer neurons and the output neurons use the softmax activation function.

An advantage of softmax at the output layer is that we can easily interpret the output of the neural network (a probability value between 0 and 1 for each class).

The problem with softmax is: it is computationally expensive as it sums over all samples.

The formula looks like this:


Now we know what a Softmax Function is. So, what is then

Hierarchical Softmax?

Hierarchical softmax is an optimization of the normal softmax. As we said before, the standard softmax has a time complexity problem that it sums over all given samples, which increases the computation. In hierarchical softmax we only need to update the parameters of the present training example. Less updates means in this case that the next training sample can be worked on sooner, so the processing time reduces as well.

Hierarchical softmax works different with conditional distributions: the parameter number with a single depended variable is proportional to the log of the data set samples in total. Hierarchical softmax is implemented using binary tree.

The Hierarchical Softmax Formula: O(V)--> O(Log V)


In this two part article you've learned about representing words as vectors, word embeddings, n-grams and the Word2Vec Model. In Word2Vec there are two submodels called Continuous Bag-of-Words (CBOW) and Skip Gram (SG). Now you know what optimizations techniques for SG exist (Subsampling, Negative Sampling, Hierarchical Softmax) and what are their consepts.


It's AI Against Corona

2019-nCoV There has been a lot of talking about the new corona virus going around the world. Let's clear up some things about it first and then we will see how data science and ai can help us fight 2019-nCoV. ...

Activation Functions

What are activation functions in Neural Networks? First of all let's clear some terminology you need in order to understand the concept of an activation function. ...


or backward propagation of errorsis another supervised learning optimization algorithm. The main task of the backpropagation algorithm is to find optimal weights in a by implementing optimization technique. ...


The Convolutional Neural Network (CNN) architecture is widely used in the field of computer vision. Because we have a massive amount of data in image files, the usage of traditional neural networks wouldn't give much efficiency as the computational time would expl...

Early Stopping

In this article we will introduce you to the concept of Early Stopping and its implementation including code samples. ...


Generative Adversarial Networks (GANs) are a type of unsupervised neural networks. The network exists since 2014 and was developed by and colleges. ...

Gradient Descent

Hiking Down a Mountain Gradient Descent is a popular optimization technique in machine learning. It is aimed to find the minimum value of a function. ...

Introduction to Statistics

Part III In this third and last part of the series "Introduction to Statistics" we will cover questions as what is probability and what are its types, as well as the three probability axioms on top of which the entire probability theory is constructed. ...

Introduction to Statistics

Part I In the following three parts we will cover basic terminology as well as the core concepts from statistics. In this Part I you are going to learn about measures of central tendency (mean, median and mode). In the Part II you will read about measures of variabili...

Introduction to Statistics

Part II In this part we will continue our talk about descriptive statistics and the measures of variability such as range, standard deviation and variance as well as different types of distributions. Feel free to read the Part I of these series to deepen your knowle...

Logistic Regression

Logit Regression Logit regression is another shortened name derived from logistic unit. Logistic regression is a popular statistical model that generates probabilities for binary classification tasks. It produces discrete values and its span lies in the range of [...

Loss Functions

When training a neural network, we try to optimize the algorithm, so it gives the best possible output. This optimization needs a loss function to compute the error/loss of the model. In this article we will gain a general picture of Squared Error, Mean Sq...

The Magic Behind Tensorflow

Getting started In this article we will delve into the magic behind one of the most popular Deep Learning frameworks - Tensorflow. We will look at the crucial terminology and some core computation principles we need to grasp the real power of Tensorflow. ...

Classification with Naive Bayes

The Bayes' Theorem describes the probability of some event, based on some conditions that might be related to that event. ...

Neural Networks

Neural Networks - Introduction In Neural Networks (NNs) we try to create a program which is able to learn from experience with respect to some task. This program should cons...


Principal component analysis or PCA is a technique for taking out relevant data points (variables also called components or sometimes features) from a larger data set. From this high dimensional data set, PCA tries extracting low dimensional data points. The idea...

Introduction to reinforcement learning

Part IV: Policy Gradient In the previous articles from this series on Reinforcement Learning (RL) we discussed Model-Based and Model-Free RL. In model-free RL we talked about Value Function Approximation (VFA). In this Part we are going to learn about Policy Based R...

Introduction to Reinforcement Learning

Part I : Model-Based Reinforcement Learning Welcome to the series "Introduction to Reinforcement Learning" which will give you a broad understanding about basic (and not only :) ) techniques in the field of Reinforcement Learning. The article series assumes you have s...

Introduction to Reinforcement Learning

Part II : Model-Free Reinforcement Learning In this Part II we're going to deal with Model-Free approaches in Reinforcement Learning (RL). See what model-free prediction and control mean and get to know some useful algorithms like Monte Carlo (MC) and Temporal Differ...

Recurrent Neural Networks

RNNs A Recurrent Neural Network (RNN) is a type of neural network where an output from the previous step is given as an input to the current step. RNNs are designed to take an input series with no size limits. RNNs remember the past states and are influenced by them...


Support Vector Machines If you happened to have a classification, a regression or an outlier detection task, you might want to consider using Support Vector Machines (SVMs), a supervised learning model, that builds a line (hyperplane) to separate data into groups....

Singular Value Decomposition

Matrix factorization: Singular Value Decomposition Matrix decomposition is another name for matrix factorization. This method is a nice representation for applied linear algebra in machine learning and similar algorithms. ...

Partial Derivatives and the Jacobian Matrix

A Jacobian Matrix is a special kind of matrix that consists of first order partial derivatives for some vector function. The form of the Jacobian matrix can vary. That means, the number of rows and columns can be equal or not, denoting that in one case it is a squa...

Introduction to Reinforcement Learning

Part III: Value Function Approximation In the previous Part I and Part II of this series we described model-based and model-free reinforcement learning as well as some well known algorithms. In this Part III we are going to talk about Value Function Approximation: w...

Weight Initialization

How does Weight Initialization work? As a general rule, weights and biases are normally initialized with some random numbers. Weights and biases are extremely important model's parameters and play a pivot role in every neural network training. Therefore, one should ...

Word Embeddings

Part 1: Introduction to Word2Vec Word embedding is a popular vocabulary representation model. Such model is able to capture contexts and semantics of a word in a document. So what is it exactly? ...


T-Distributed Stochastic Neighbor Embedding If you do data analysis, machine learning or some other data driven research you will prob...
Copyright © 2024 by Richard Siegel at siegel.work Donate Contact & Privacy Policy