Parts-of-Speech (POS) and Viterbi Algorithm

Published in

Analytics Vidhya

10 min readOct 4, 2020

Language is built on grammar. The parts of speech are important because they show us how the words relate to each other. Knowing whether a word is a noun or a verb tells us a lot about likely neighboring words and about the syntactic structure around the word. Therefore, Parts-of-speech tagging serves as a very important building block in NLP application.

In this post, I’m going to talk about

What is part of speech tagging (POS)
What are Markov Chains Models
Markov Chains Models and POS tagging
Hidden Markov Chains and calculate transition and emission probabilities
Use Viterbi algorithm in Hidden Markov models to create part-of-speech tags for a given text corpus

Disclaimer: This post is based on week 2 of Natural Language Processing with Probabilistic Models course by deeplearning.ai on Coursera. Most of the images I used in this post are from the course material.

Check out my final project here: Click Link

Part 1: What is part-of-speech (POS) tagging?

Part-of-speech refers to the category of words or the lexical terms in the language. Examples of these lexical terms in the English language would be noun, verb, adjective, adverb, pronoun, preposition, etc.

The process of assigning tags to the words of a sentence or your corpus is referred to as part of speech tagging, or POS tagging for short. Because POS tags describe the characteristics structure of lexical terms in a sentence or text, you can use it to make assumptions about the semantics.

Part 2: What are Markov Chains?

In POS tagging, the idea is that the likelihood of the next word’s part of speech tag in a sentence tends to depend on the part of speech tag of the previous word. This is also known as part of speech dependencies.

image from Natural Language Processing with Probabilistic Models on Coursera — *image from week 2 of Natural Language Processing with Probabilistic Models course*

Because of the part of speech dependencies, we can apply probability model to estimate the POS of next word because of the previous word. One probability model is Markov Chains Model.

They’re a type of stochastic model that describes a sequence of possible events.
To get the probability for each event, it needs only the states of the previous events.
The word stochastic just means random or randomness. So a stochastic model incorporates and models processes does have a random component to them.
A Markov chain can be depicted as a directed graph.
The circles of the graph represent states of our model. A state refers to a certain condition of the present moment.

*image from week 2 of Natural Language Processing with Probabilistic Models course*

Part 3: Markov Chains Model and POS tagging

In NLP, we can think of POS tags as States in the Markov chains model. The edges of the graph have weights or transition probabilities associated with them which define the probability of going from one state to another.

Markov Property, which states that the probability of the next event only depends on the current events.
The Markov property helps keep the model simple by saying all you need to determine the next state is the current states. It doesn’t need information from any of the previous states.

We can use a matrix table to represent a Markov Chain model in NLP. The matrix is called Transition Matrix.

A transition matrix is a table equivalent, but more compact representation of a Markov chain model.
It stores the states and transition probabilities.
A transition matrix is an (n X n) matrix with n being the number of states in the graph.
The row represents the current state POS, the columns represent the possible future states that might come next.
Each row in the matrix represents transition probabilities of one state to all other states.
For a given state, the sum of these transition probabilities should always be one. In other words, in a transition matrix, all of the transition probabilities in each row should add up to one.

In the current setup, it doesn’t handle cases when there are no previous words as in the case when beginning a sentence. To handle this, we can include an initial state. Then the transition matrix has dimension of (n+1,n). For example see below:

In Summary:

Markov chains consists of a set of n states, from q1 all the way to qn.
The transition matrix has dimensions (n+1,n) with the initial probabilities in the first row.

Part 4: Hidden Markov Models

Hidden Markov model implies the states are hidden or not directly observable. In this case, the hidden states are part of speech of the word.

In addition to the transition probabilities in Markov chains model, the hidden Markov model has additional probabilities known as emission probabilities.

Emission probabilities describe the transition probability from the hidden parts of speech hidden state to the words of your corpus.
Each row is designated for one of the hidden states. A column is designated for each of the observables, or words.
Each row sums up to 1.

In high-level, Hidden Markov Chain models consist of a set of N states. There are two important matrixes:

The transition matrix A has dimension N by N
The emission matrix B has dimension N by V
N stands for number of tags, V stands for number of words

Part 5: Calculate the transition and emission probabilities in Hidden Markov Models

In order to be able to calculate the correct transition probabilities, we need to do some preparation to the corpus, such as:

Add the start token to each line or sentence in order to be able to calculate the initial probabilities.
Transform all words in the corpus to lowercase, so the model becomes case insensitive.
Take out some punctuation, special characters that don’t have any meanings. Take out stop words, etc.

Transition Probabilities:

To calculate the transition probabilities, you actually only use the parts of speech tags from your training corpus.
The transition probabilities = # of (previous tag, current tag) combination/ # of combination with starting previous tag
For example, if we want to calculate the transition probability for previous blue tag followed by purple tag, we count there are 2 times of (previous blue, current purple tag), and 3 times a combination starting with blue tag. Therefore, the transition probability of blue tag then purple tag is 2/3.

Notes about the transition matrix:

The rows in the matrix represents the current states.
The columns represent the next states.
The values represent the transition probabilities of going from the current sate to the next state. The states for this use case are the parts of speech tags.

Caution, here are two problems before we apply the formula to calculate the transition matrix:

For example, in the previous image, one is that the row sum of the VB tag is zero, which would lead to a division by zero using this formula.
The other is that a lot of entries in the transition matrix are zero, meaning that these transitions will have probability zero. This won’t work if you want the model to generalize to other equals, which might actually contain verbs.

To handle this, we can do smoothing, which is to add a small value epsilon to each of the accounts in the numerator, and add N times epsilon to the divisor such that the row sum still adds up to one.

However, in real word example, you might not want to apply smoothing to the initial probabilities in the first row of the transition matrix. That’s because if you apply smoothing to that row by adding a small value to possibly zero valued entries, you’ll effectively allow a sentence to start with any parts of speech tag, including punctuation.

In summary, below are steps to calculate the transition probabilities:

Count occurrences of tag pairs:

Calculate probabilities using the counts:

Emission Probabilities

Emission probabilities describe the transition probability from the hidden parts of speech hidden state to the words of your corpus. In other words, from POS to a word.
Emission probabilities with smoothing =

Part 6: The Viterbi Algorithm

Given a sentence, we can use Viterbi algorithm to compute the most likely sequence of parts of speech tags.

Viterbi Algorithm Overview

With a leading start token, you want to find the sequence of hidden states or parts of speech tags that have the highest probability for this sequence.

The Viterbi algorithm computes all the possible paths for a given sentence in order to find the most likely sequence of hidden states. It uses the matrix representation of the hidden Markov models. The algorithm can be split into 3 steps:

Initialization step
Forward pass
Backward pass

It uses the transition probabilities and emission probabilities from the hidden Markov models to calculate two matrices. The matrix C (best_probs) holds the intermediate optimal probabilities and matrix D (best_paths), the indices of the visited states.

These two matrices have n rows, where n is the number of parts of speech tags or hidden states in the model.
And K columns, where k is the number of words in the given sequence.

Viterbi Initialization

In the initialization step, the first column in C and D matrix is populated.

First column in C:

The first column of C represents the probability of transition from start state to the first tag ti and the word w1. Meaning we are trying to go from tag i to the word w1.
Formula:

where a_(1,i) is the transition probability from start state to i and b_(i, cindex(w1) is the emission probability from tag i to word w1.

First column in D matrix:

In the D matrix, you store the labels that represent the different states you’re traversing when finding the most likely sequence of parts of speech tags for the given sequence of words, W1 all the way to Wk.
In the first column, you simply set all entries to zero, as there are no proceeding parts of speech tags we have traversed.

Viterbi Forward Pass

After initialized the matrices C and D, all the remaining entries in the two matrices, C and D, are populated column by column during the forward pass.

C matrix formula:

where the first element is the probability of the preceding path you’ve traversed, the second element is the transition probability from tag k to tag i, and the last element is the emission probability from tag i to word j. We then choose the k which maximizes the entire formula.

D matrix formula:

which simply save the k, which maximized the entry in each Ci,j

Viterbi Backward Pass

Use the C and D matrix from Forward Pass to create a path, so that we can assign a parts of speech tag to every word.

The D matrix represents the sequence of hidden states that most likely generated our sequence, word one all the way to word K. The backward pass helps retrieve the most likely sequence of parts of speech tags for the given sequence of words.

Steps:

Calculate the index of the entry Ci,k with the highest probability in the last column of C. The probability at this index is the probability of the most likely sequence of hidden states, generating the given sequence of words.
Then we use this index s to traverse backwards through the matrix D, to reconstruct the sequence of parts of speech tags.

Example:

Let’s say in the last column of matrix C below, the highest probability is t1

Then we go to matrix D, we can find the following best path travels backward, until we arrive at the start of the token. The path we recover from the backward pass is the sequence of parts of speech tags with the highest probability.

Some notes:

Be careful of the index in the matrix
Use log probabilities instead of product multiplication, because when we multiply many very small numbers like probabilities, this will lead to numerical issues. Use the the log probabilities below yields better result.

Originally published at https://github.com.