Sentiment Analysis with Logistic Regression (Part 2)

Published in

Analytics Vidhya

6 min readDec 23, 2020

In Sentiment Analysis with Logistic Regression (Part 1), we talk about the overall approach on how to do sentiment analysis with Logistic Regression. In this post, we are going to review what’s Logistic Regression.

This post is the part 2 of “Sentiment Analysis with Logistic Regression”. In this post, I’m going to talk about:

Overview of Logistic Regression
Training Logistic Regression
Testing Logistic Regression
Logistic Regression Cost Function

Disclaimer: This post is based on week 1 of Natural Language Processing with Classification and Vector Spaces course on Coursera. Most of the figures below credits goes to the course copy right.

Check out my final project here: Click Link

Part 2: Logistic Regression Review

1. Logistic Regression Overview

Previously (see Part 1), we’ve learnt how to preprocess our data and extract features for our sentiment analysis. We can use logistic regression to predict the outcome. Then what’s logistic regression? At a high level, Logistic regression makes use of a sigmoid function which outputs a probability between zero and one.

The function that’s used to classify in logistic regression H is the sigmoid function and it depends on the parameters Theta and the features vector Xi, where i is used to denote the ith observation or data points.
Visually, the sigmoid function approaches 0 as the dot product of Theta transpose X approaches minus infinity and 1 as it approaches infinity.
For classification, a threshold is needed. Usually, it is set to be 0.5 and this value corresponds to a dot product between Theta transpose and X equal to zero.

image from week 1 of Natural Language Processing with Classification and Vector Spaces course

Note that as

Gets closer and closer to negative infinity, the denominator of the sigmoid function gets larger and larger
As a result, the sigmoid gets closer to 0.

On the other hand, as

Gets closer to closer positive infinity, the denominator of the sigmoid functions gets closer to 1.
As a result the sigmoid also gets closer to 1.

Example: Now given a tweet, we can transform it into a vector and run it through the sigmoid function to get a prediction as follows:

2. Logistic Regression: Training

To train a logistic classifier, we can use gradient descent to iterate until we find the set of parameters theta that minimizes the cost function.

Training the logistic classifier, we use the gradient descent:

First, we’d have to initialize the parameters vector theta.
Then we use the logistic function to get values for each observations.
Then we calculate the gradients of the cost function and update the parameters.
Lastly, we’d be able to compute the cost J and determine if more iterations are needed according to a stop-parameter or maximum number of iterations.

3. Logistic Regression: Testing

To test our model, we would run a subset of our data, known as the validation set, on the model to get predictions and compare it with the true label to calculate the model accuracy.

First, we calculate the predictions, which are the outputs of the sigmoid function.
Then we compare the output with a threshold. Usually we set the threshold = 0.5. If the output is >= 0.5, we would assign the prediction to a positive class. Otherwise, we would assign it to a negative class.
At the end, we will have a vector populated with zeros and ones indicating predicted negative and positive examples, respectively.

Lastly we can compute the accuracy of the model over the validation set. The accuracy is the number of times the model prediction matches with the true labels over the number of labels in the validation set. This metric gives an estimate of the times your logistic regression will correctly work on unseen data.

4. Logistic Regression: Cost Function

The logistic regression cost function is defined as:

If y = 1, and we predict something close to 0, the cost close to infinite.
If y = 0, and we predict something close to 1, the cost close to infinity as well.
On the other hand, if the prediction is equal to the label, the cost is close to zero.
We are trying to minimize the cost function to get the prediction as close to the label as possible.

5. Advanced Topic: Math Derivation fro the Logistic Regression Cost Function

Let’s write up a function that compresses the two cases (1 and 0) into one case.

From the above, we can see that when y = 1, we get the sigmoid function h(x(i), theta), and when y = 0, we get (1-sigmoid).

This makes sense, since the two probabilities are equal to 1 (i.e, for one class classification, the prediction is either 1 or 0).
In either case (1 or 0), we want to maximize the sigmoid function h(x(i), theta) by making it as close to 1 as possible.

Now we want to find a way to model the entire data set and not just the one example above. To do so, we will define the likelihood as follows:

The ∏ symbol tells us that we are multiplying the terms together and not adding them. Note that if we mess up the classification of one example, we end up messing up the overall likelihood score, which is exactly what we intended. We want to fit a model to the entire dataset where all data points are related.

One issue is that as m gets larger, L(θ) goes close to zero, because both numbers h(x(i), θ) and (1-h(x(i), θ)) are bounded between 0 and 1.

Since we are trying to maximize the sigmoid in L(θ), we can introduce the log and just maximize the log of the function.
Introducing the log allows us to write the log of a product as the sum of each log.

We can rewrite the equation as follows:

Hence, we now divide by m because we want to see the average cost.

Remember we are maximizing the sigmoid in the equation above. It turns out that maximizing an equation is the same as minimizing its negative. Hence we add a negative sign and we end up minimizing the cost function.

Originally published at https://github.com.