Word Embedding and Vector Space Models

Published in

Analytics Vidhya

10 min readJun 13, 2021

Vector space models capture semantic meaning and relationships between words. In this post, I’m going to talk about how to create word vectors that capture dependencies between words, then visualize their relationships in two dimensions using PCA.

For this topic, I’m going to talk about:

Introduce vector space models
Word by Word and Word by Document Design
Euclidean Distance
Cosine Similarity
Visualization and PCA

Disclaimer: This post is based on week 3 of Natural Language Processing with Classification and Vector Spaces course on Coursera. Most of the figures below credits goes to the course copy right.

Check out my final project here: Click Here

1. Vector Space Models

There are many reasons why we want to learn vector space models. For one obviously one, they can be used to identify similarity for a question answering, paraphrasing, and summarization. For example, in the following example, the two questions on the left have different meaning, while the two questions on the right have similar meaning.

image from week 3 of Natural Language Processing with Classification and Vector Spaces course

Vector space models will also allow you to capture dependencies between words. In the following two examples, you can see the word “cereal” and the word “bowl” are related. Similarly, you can see “sell” depends on “buy”. With vector space models, you will be able to capture this and many other types of relationships among different sets of words.

You eat cereal from a bowl
You buy something and someone else sells it

Vectors space models can have many applications, such as:

Information Extraction — to answer questions in the style of who, what, where, how, etc.
Machine Translation
Chatbots Programing
and more

When using vector space models, the way that representations are made is by identifying the context around each word in the text, and this captures the relative meaning. In summary, vector space models allow you to represent words and documents as vectors. This captures the relative meaning.

2. Word by Word and Word by Document

a. Word by Word Design

Concept: The co-occurence of two different words is the number of times that they appear in the corpus together within a certain word distance k.

For example, we have two sentences in your corpus:

I like simple data
I prefer simple raw data

Let’s build a co-occurrence matrix for words that occur together within a certain distance k = 2. Then the row for “data” in the co-occurrence can be:

With a word by word design, we can get a representation with n entries, with n between one and the size of your entire vocabulary.

b. Word by Document Design

Concept: Count the times that words from our vocabulary appeared in documents that belong to specific categories.

For example,

We could have a corpus consisting of documents between different topics like entertainment, economy, and machine learning.
We then count the number of times that the words appear on the document that belong to each of the three categories.

c. Vector Space

Once we’ve constructed the representations for multiple sets of documents or words using either word by word or word by document design, we’ll get our vector space.

After we constructed the vector space, it’s easy to see that the economy and machine learning documents are much more similar than they are to the entertainment category.

We can make comparisons between vector representations using the cosine similarity and the Euclidean distance in order to get the angle and distance between them.

3. Euclidean Distance

Concept: The Euclidean distance is the length of the straight line segment connecting them.

Euclidean Distance for N-Dimensional Vectors

The approach to calculate euclidean distance for n-dimensional vectors is similar to two dimension calculation, shown in the example above.

First get the difference between each of the dimensions.
Square the differences, then sum them up and then get the square root of the results.
The formula is just the norm of the difference between the vectors that we are comparing.

In summary:

The Euclidean distance is a straight line between points
To calculate the euclidean distance, we calculate the norm of the difference between the vectors that we are comparing.
By using this metric, we can get a sense of how similar two documents or words are.

4. Cosine Similarity

One of the issues with euclidean distance is that it is not always accurate and sometimes we are not looking for that type of similarity metric. For example, when comparing large documents to smaller ones with euclidean distance, one could get an inaccurate result. In the following example, the word totals in the corpora differ from one another. For example, agriculture and history corpus have a similar number of words, while the food corpus has a relatively small number.

If we use Euclidean Distance as a metric to find similarity, we would say agriculture is similar to history, but not to food.
Normally, agriculture and food are more similar than agriculture and history. However the food corpus is much smaller than the agriculture corpus. To further clarify, although the history corpus and the agriculture corpus are different, they have a smaller euclidean distance.

To solve this problem, we look at the cosine between the vectors. The main advantage of the cosine similarity metric over the euclidean distance is that it isn’t biased by the size difference between the representations.

Cosine Similarity Formula

Before we get into cosine similarity, there two functions we want to review:

The norm of a vector is defined as:

The doc product is defined as:

The cosine similarity equation is then:

Cosine Similarity Interpretation:

The more similar two vectors are, the more closer to 1
If two vectors are orthogonal, the cosine is 0

In summary:

The cosine similarity metric is proportional to the similarity between the directions of the vectors that you are comparing.
The cosine similarity takes values between 0 and 1.

Manipulating Words in Vector Spaces

Now we talked about how to structure vector spaces to represent text, and what metrics we can use to find text similarity. We can use word vectors to actually extract patterns and identify certain structures in our text.

Example: Suppose we have a vector space with countries and their capital cities. We know the capital of the United States is Washington DC and we don’t know the capital of Russia. But we’d like to use the known relationship between Washington DC and the USA to figure it out.

Step 1: Set up the vector space.

For this example, we are in a hypothetical two-dimensional vector space that has different representations for different countries and capitals.

Step 2: Find the relationship between the Washington DC and USA vector representations.

In other words, what vectors connect Washington DC and the USA? To do that, get the difference between the two vectors. The values from that will tell we how many units on each dimension we should move in order to find a country’s capital in that vector space.

Step 3: Get the vector representation for the capital of Russia. In this example, it’s [10,4]

Step 4: Find the most similar vector to the vector we just deduce from step 3 in our data.

We can do so by comparing each vector with the Euclidean distances or cosine similarity. In this example, the most similar vector is Moscow.

The only catch here is that you need a vector space where the representations capture the relative meaning of words.

5. Visualization and PCA

Intuition: It’s often the case that we’ll end up having vectors in very, very high dimensions. If we want to find a way to reduce the dimension of these vectors to two dimensions so we can plot it on an XY axis, Principal Component Analysis (PCA) can help us so.

Definition: Principal component analysis is an unsupervised learning algorithm which can be used to reduce the dimension of our data. As a result, it allows us to visualize the data. It tries to combine variances across features.

Visualization of word vectors

Let’s say we want the following representation for our words in a vector space below.
The vector space dimension is higher than two.
We know the words oil and gas, city and town are similar, and we want to see if that relationship is captured by the representation of our words.

PCA is a perfect choice for this task.

When we have a representation of our words in a high dimensional space, we could use an algorithm like PCA to get a representation on a vector space with fewer dimensions. If we want to visualize our data, we can get a reduced representation with three or fewer features.

For example, after we reduce the dimension to d = 2 and plot the words. We might get a plot like below. We can visualize the city and town are clustered together, while gas and oil are clustered together. We can even find other relationships among our words that we didn’t expect.

Note that words with similar part of speech (POS) tags are next to one another. This is because many of the training algorithms learn words by identifying the neighboring words. Thus, words with similar POS tags tend to be found in similar locations. An interesting insight is that synonyms and antonyms tend to be found next to each other in the plot.

In Summary:

PCA is an algorithm used for dimensionality reduction that can find uncorrelated features for your data.
It’s very helpful for visualizing our data to check if our representation is capturing relationships among words.

PCA Algorithm

Principal Component Analysis — High Level Summary

At a high level, let’s say we have a two dimensional vector space and we want to represent it by one feature instead. Using PCA:

Begin with our original vector space
Find a set of uncorrelated features
Project the data to a number of desired features that retain the most information.

Before we talk about the PCA algorithm in details, we need to introduce two definitions:

Eigenvector — the uncorrelated features of our data. The eigenvectors of the covariance matrix from our data give directions of uncorrelated features.
Eigenvalue — the amount of information retained by each feature. The eigenvalues are the variance of our data sets in each of those new features

PCA Algorithm in Details

Step 1: Get a set of uncorrelated features.

Mean normalize the data
Get the covariance matrix
Perform a singular value decomposition to get a set of three matrices. The first matrix is the eigenvectors, the second matrix is the eigenvalues

Step 2: Project the data to a new set of features. To do that, we will be using the Eigenvectors (U) and Eigenvalues (S)

Perform dot products between the matrix containing the word embeddings and the first n columns of the Eigenvectors, U matrix, where n equals the number of dimensions we want to have at the end. That’s our new data in the n vector space.
Get the percentage of variance retained in the new vector space.

As an important side note, the eigenvectors and eigenvalues should be organized according to the eigenvalues in descending order. This condition will ensure that we retain as much information as possible from our original embedding.

In Summary:

Eigenvectors from the covariance matrix of our normalized data give the directions of uncorrelated features.
The eigenvalues associated with those eigenvectors tell us the variance of our data on those features.
The dot products between our word embeddings and the matrix of eigenvectors will project our data onto a new vector space of the dimension we choose.

That’s it! We’ve covered how vector space models can capture semantic meaning and relationships between words. You also learn how to create word vectors that capture dependencies between words, then use PCA technique to visualize the words relationship in lower dimensions.

As always, please feel free to check out my final project on how to put all these together in codes (Click Link)

Hope you enjoy this reading! :)