Warning! Your browser does not support this Website: Try Google-Chrome or Firefox!

Word Embeddings

by Elvira Siegel
(Published: Fri Aug 16, 2019)

Part 1: Introduction to Word2Vec

Word embedding is a popular vocabulary representation model. Such model is able to capture contexts and semantics of a word in a document. So what is it exactly?

Skip to Part 2 Word2Vec (Skip Gram)

Word embeddings are vector representations of a particular word in a document or a corpus. Embeddings are low-dimensional and dense in their nature, which means they are able to reduce the number of dimensions and still accurately represent categories in the new space.

First of all, let's make some extra terminology clear.

We call two words semantically similar, if they have similar meaning. For example: "happy" - "cheerful", "proceed" - "begin", "flower" - "blossom", etc. and we call two words semantically related, if they have related meanings. Such words don't have to posses a common meaning but they have a relation to some aspect of that meaning. So, these words are not synonymous but are related to each other through some sense which is common. For example: "car" - "highway", "spoon" - "fork", "jump" - "change", etc.

If we describe two words that are semantically related, they probably tend to be written near each other in the context, and they're semantically similar if they tend to be used synonymously. We can also say, that words that appear in the same context, often share the same semantic meaning.


Generally speaking an approach to semantics that is based on the contexts of words in large corpora is called Distributional Semantics.

Representing words as vectors

Look at these sentences: "What a wonderful day" and "What a marvelous day". They actually have the same meaning. If we construct a vocabulary (we just call it V) for these two sentences we get:

V = {What, a, wonderful, marvelous, day}

We could try using one-hot encoding where we take x dimensions for x words and fill in our vectors with 0 in places where the word doesn't occur and only one 1 in a place where the word is that we want to represent.

Let's take our vocabulary V = {What, a, wonderful, marvelous, day} and represent the word "wonderful" as:


The big problem of one hot encoding representations is that all the words are independent of each other and are seen as without any context: therefore without any semantic correlation. This model will not understand that words like "cat" and "dog" belong to the same category "animals/pets". That means that one-hot encoding doesn't set similar words closer to one another in a vector space.

Moreover, the size of such representation grows with the number of words. As a result, it doesn't scale for a larger corpus, e.g. each word vector for a 100 million corpus will be 100 million numerical values with all except one being a zero.

The same problems appear with the Bag-Of-Words Model (BOW) BOW - Bag of Words.

We use the Bag-Of-Words model to create a feature vector (word vector) when the number of features (words) is not known in advance, with the premise that the order of features (words) is not important. Each word in this model is a one-hot vector representation - a sparse vector of the vocabulary size, with a "1" standing for a word in the entry and zeros in all other entries where the word representation is absent. The BOW feature vector is the sum of all one-hot encoded words vectors. This vector has a non-zero value for every word that occurred.


This model receives a list of labeled texts and assigns a word count for each text which is a frequency of a word in a text or document. Then we apply the Bayes' Theorem to the new (unlabeled) text with counts to see the probably which label text belongs to based on that word frequencies.

The negative side is: the BOW Model doesn't consider the semantics of a word. Words like "car" and "automobile" are frequently used in the same context, which is ignored by the model. So, the context is ignored because every word is independent of the occurrence of the other word.

Encodings like one-hot and BOW can still be quite useful and are often used for tasks like email spam classification.

We can take another approach and use different numerical values for our vectors representing words.

It is possible to use multiple dimension, where each dimension is a representation of some meaning. The word's numerical weight on that dimension pictures the word-relatedness to that meaning. Therefore, the word-semantics are embedded across the dimensions of the vector.


Word vectors are extremely powerful because they enable us to spot similarities across different words by plotting them in a vector space.

A very good question is: how are word vectors generated?

There are some common ways of getting numerical values for vectors which are:

  1. Counts of word
  2. Context co-occurrences
  3. Predictions based on words from context (e.g. with the help of Word2Vec)

The main problem with the one-hot encoding and the BOW models is that they are not able to "learn" as they don't rely on supervision. With embeddings it's different: embeddings could be improved with the help of neural networks. First of all, the embeddings construct their parameters, called weights. Those weights are adjusted to minimize the loss function on some given task. The embedded output vectors are category representations, where similar categories are placed near to one another in the vector space.

One of the most interesting parts about embeddings is that embeddings can be easily used to visualize data such as a relation of words towards one another in a vector space. Because we may have thousands of dimensions, a dimension reduction technique is required to get the dimensions to 2 or 3, so we can actually visualize the data. A popular technique for dimensionality reduction is: t-Distributed Stochastic Neighbor Embedding (t-SNE for short).


The Word2Vec model takes a text as an input and gives word vectors as an output back.

The first step is: the model builds a vocabulary from the text which is also our training data. After that it learns vector representation of words. The resulting word vectors represents words and can be used as features for many machine learning tasks. The model was originally developed by T.Mikolov in 2013 .

Word2Vec uses a technique often applied in machine learning: train a neural network with a single hidden layer. This neural network should perform some task (Sentiment Analysis, Sequence Tagging, Sequence Prediction, Named Entity Recognition (NER), etc,..). But we will not use that neural network for the task we trained it on. Instead, we want to learn the hidden layer weights. These weights are in fact the "word vectors" that our algorithm is trying to learn.

There are two basic models in Word2Vec: CBOW (Continuous Bag-of-Words) and SG (Skip Gram)

In simple terms, CBOW tries to predict a word from a given context and SG tries to predict context from a given word.

CBOW predicts the most probable word. The problem with CBOW is that it overlooks rare words because they statistical occurrence in a text is small. If you have a context "Yesterday we had a [...] time together!" the CBOW model may suggest you, the most probable words would be: "nice" or "good", but words like "marvelous" will be ignored because they don't appear often overall. So the rare word will be flattened over much more repeated examples of frequent words.

On the other hand Skip Gram predicts the context of a word. Given the word "marvelous" the model will analyze the context in a certain range and give us: "Yesterday we had a [...] time together!"


We can encode a word into a vector and inspect it back and forth within a certain range. These ranges are so called "n-grams" and an n-gram is a sequence of n items. There are different versions of n-grams: unigrams, bigrams, trigrams, four-grams or even five-grams. A skip-gram simplifies an n-gram, dropping parts from it.

In the Skip Gram Model this certain range is called "window". The window size is some number meaning if our window size = 2, then the algorithm will look at 2 words behind and 2 words ahead: 4 words in total.


The Skip Gram Model learns statistics from the number of pair occurrences in the context. For example, if we give the model a task: predict 'drove' from the average of the context: ['I', 'on', 'the', 'highway'], the network is presented with the following skip-gram examples: predict 'drove' from 'I', predict 'drove' from 'on', predict 'drove' from 'the', predict 'drove' from 'highway'.

The Model learns from the word-coocurrence statistics. The network is going to gather more examples of e.g. ("bread", "butter") than of ("bread", "metal").

The idea of the Word2Vec Model is easy to understand: it transforms the unlabeled raw data into labeled data. Then the model learns the word representation for an objective.

However there are some downsides to the model in particular to the Skip Gram ...

If you want to know more about Word2Vec, read the part 2 of this article --> LINK PART2

Credits to McCormick, C. Word2Vec Tutorial - The Skip-Gram Model.

Learn more about this ad

It's AI Against Corona

2019-nCoV There has been a lot of talking about the new corona virus going around the world. Let's clear up some things about it first and then we will see how data science and ai can help us fight 2019-nCoV. ...

Activation Functions

What are activation functions in Neural Networks? First of all let's clear some terminology you need in order to understand the concept of an activation function. ...
Learn more about this ad


or backward propagation of errorsis another supervised learning optimization algorithm. The main task of the backpropagation algorithm is to find optimal weights in a by implementing optimization technique. ...


The Convolutional Neural Network (CNN) architecture is widely used in the field of computer vision. Because we have a massive amount of data in image files, the usage of traditional neural networks wouldn't give much efficiency as the computational time would expl...

Gradient Descent

Hiking Down a Mountain Gradient Descent is a popular optimization technique in machine learning. It is aimed to find the minimum value of a function. ...
Learn more about this ad

Introduction to Statistics

Part III In this third and last part of the series "Introduction to Statistics" we will cover questions as what is probability and what are its types, as well as the three probability axioms on top of which the entire probability theory is constructed. ...

Introduction to Statistics

Part I In the following three parts we will cover basic terminology as well as the core concepts from statistics. In this Part I you are going to learn about measures of central tendency (mean, median and mode). In the Part II you will read about measures of variabili...

Introduction to Statistics

Part II In this part we will continue our talk about descriptive statistics and the measures of variability such as range, standard deviation and variance as well as different types of distributions. Feel free to read the Part I of these series to deepen your knowle...

Logistic Regression

Logit Regression Logit regression is another shortened name derived from logistic unit. Logistic regression is a popular statistical model that generates probabilities for binary classification tasks. It produces discrete values and its span lies in the range of [...

Loss Functions

When training a neural network, we try to optimize the algorithm, so it gives the best possible output. This optimization needs a loss function to compute the error/loss of the model. In this article we will gain a general picture of Squared Error, Mean Sq...

The Magic Behind Tensorflow

Getting started In this article we will delve into the magic behind one of the most popular Deep Learning frameworks - Tensorflow. We will look at the crucial terminology and some core computation principles we need to grasp the real power of Tensorflow. ...

Classification with Naive Bayes

The Bayes' Theorem describes the probability of some event, based on some conditions that might be related to that event. ...
Learn more about this ad

Neural Networks

Neural Networks - Introduction In Neural Networks (NNs) we try to create a program which is able to learn from experience with respect to some task. This program should cons...


Principal component analysis or PCA is a technique for taking out relevant data points (variables also called components or sometimes features) from a larger data set. From this high dimensional data set, PCA tries extracting low dimensional data points. The idea...

Introduction to reinforcement learning

Part IV: Policy Gradient In the previous articles from this series on Reinforcement Learning (RL) we discussed Model-Based and Model-Free RL. In model-free RL we talked about Value Function Approximation (VFA). In this Part we are going to learn about Policy Based R...

Introduction to Reinforcement Learning

Part I : Model-Based Reinforcement Learning Welcome to the series "Introduction to Reinforcement Learning" which will give you a broad understanding about basic (and not only :) ) techniques in the field of Reinforcement Learning. The article series assumes you have s...

Introduction to Reinforcement Learning

Part II : Model-Free Reinforcement Learning In this Part II we're going to deal with Model-Free approaches in Reinforcement Learning (RL). See what model-free prediction and control mean and get to know some useful algorithms like Monte Carlo (MC) and Temporal Differ...

Recurrent Neural Networks

RNNs A Recurrent Neural Network (RNN) is a type of neural network where an output from the previous step is given as an input to the current step. RNNs are designed to take an input series with no size limits. RNNs remember the past states and are influenced by them...


Support Vector Machines If you happened to have a classification, a regression or an outlier detection task, you might want to consider using Support Vector Machines (SVMs), a supervised learning model, that builds a line (hyperplane) to separate data into groups....

Singular Value Decomposition

Matrix factorization: Singular Value Decomposition Matrix decomposition is another name for matrix factorization. This method is a nice representation for applied linear algebra in machine learning and similar algorithms. ...

Partial Derivatives and the Jacobian Matrix

A Jacobian Matrix is a special kind of matrix that consists of first order partial derivatives for some vector function. The form of the Jacobian matrix can vary. That means, the number of rows and columns can be equal or not, denoting that in one case it is a squa...

Introduction to Reinforcement Learning

Part III: Value Function Approximation In the previous Part I and Part II of this series we described model-based and model-free reinforcement learning as well as some well known algorithms. In this Part III we are going to talk about Value Function Approximation: w...
Learn more about this ad

Weight Initialization

How does Weight Initialization work? As a general rule, weights and biases are normally initialized with some random numbers. Weights and biases are extremely important model's parameters and play a pivot role in every neural network training. Therefore, one should ...

Word Embeddings

Part 2: Word2Vec (Skip Gram)In the second part of Word Embeddings we will talk about what are the downsides of the Word2Vec model (Skip Gram...


T-Distributed Stochastic Neighbor Embedding If you do data analysis, machine learning or some other data driven research you will prob...
Copyright © 2020 by Richard Siegel at siegel.work Contact & Privacy Policy