Sentiment Analysis with Logistic Regression (Part 1)

Jiaqi (Karen) Fang
Analytics Vidhya
Published in
10 min readDec 22, 2020

--

Sentiment analysis is extremely useful in current days as it allows us to gain an overview of the wider opinion behind certain topics. For example, analyzing customer review can helps us see how positive or negative our customer feeling our product. Human being can understand the feeling of a text quite easily. However, if we rely on human to manually classify the sentiment of a large given text, it’s clearly not efficient. Instead, we can apply NLP technique to do sentiment analysis at scale.

For this topic, I’m going to talk about:

  • Part 1: Sentiment Analysis with Logistic Regression
  • Part 2: Logistic Regression Review

This post is the part 1 of “Sentiment Analysis with Logistic Regression”.

Disclaimer: This post is based on week 1 of Natural Language Processing with Classification and Vector Spaces course on Coursera. Most of the figures below credits goes to the course copy right.

Check out my final project here: Click Link

Part 1: Sentiment Analysis with Logistic Regression

In part 1, I’m going to walk through the process to use logistic Regression on tweets to do sentiment analysis, aka identify positive tweets vs. negative tweets.

At a high level, we can follow the steps below to use Logistic Regression to perform sentiment analysis:

  • Preprocessing the text to make it clean and readable
  • Create dictionary mapping to represent text as numeric vectors
  • Extract useful features to represent a given text
  • Perform logistic regression on the features we create to predict the given text’s sentiment.

1. Supervised ML (Training)

In supervised machine learning we have input features X and a set of labels Y. Now to make sure we’re getting the most accurate predictions based on our data, our goal is to minimize the error rates or cost as much as possible.

  • We use the features X to run through our prediction function.
  • We run the prediction function which takes in parameters data to map the features X to output Y hat.
  • The best mapping from features to labels is achieved when the difference between the expected values Y and the predicted values y hat is minimized, which the cost function does by comparing how closely the output Y hat is to target Y.
  • We update the parameters and repeat the whole process until the cost is minimized.
image from week 1 of Natural Language Processing with Classification and Vector Spaces course

2. Sentiment Analysis

According to Wikipedia, “ Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.”

A simple example is:

  • Given a tweet: “I am happy because I am learning NLP.”
  • The goal is to predict whether this tweet is positive or negative.

The question is how do we predict whether the tweet is positive or negative?

  • Approach: to predict Positive or Negative, this seems like a binary classification problem. In Machine Learning, we don’t always need to use fancy deep learning technique. Instead, if we can extract simple but useful information from the text using NLP technique, we can build a simple logistic regression model to identify positive or negative tweet fairly accurately.
image from week 1 of Natural Language Processing with Classification and Vector Spaces course

Steps we can take are:

  • First we will process the raw tweets in the training sets to extract useful features.
  • Then we will train the logistic regression classifier while minimizing the cost.
  • Finally we’ll be able to make the predictions.
image from week 1 of Natural Language Processing with Classification and Vector Spaces course

3. Vocabulary & Feature Extraction

Motivation: Since we know computer works with numbers and texts are not numbers. Before we jump into extracting features from tweets, we need to think about how do we represent a text in computer?

In order to represent a text as a vector, we first have to build a vocabulary and that will allow us to encode any text or any tweet as an array of numbers.

Vocabulary

Definition: Given a list of text, the vocabulary V would be the list of unique words from the list of text we have.

  • To get that list, one naive approach we can use is: go through all the words from all the texts and save every new word that appears in the search.
image from week 1 of Natural Language Processing with Classification and Vector Spaces course

Feature Extraction

To extract features from the vocabulary, we’d have to check if every word from the vocabulary appears in the text.

  • If it does, then we would assign a value of 1 to that feature.
  • If it doesn’t, then assign a value of 0.

This type of representation with a small relative number of non-zero values is called a sparse representation.

For an example, the representation of our sample text would have a couple ones and many zeros. These correspond to every unique word from the vocabulary that isn’t in the tweet.

image from week 1 of Natural Language Processing with Classification and Vector Spaces course

Problem:
With this approach, we immediately see some problems:

  • This representation would have a number of features equal to the size of the entire vocabulary.
  • This would have a lot of features equal to 0 for every tweet.
  • With the sparse representation, a logistic regression model would have to learn n plus 1 parameters, where n would be equal to the size of the vocabulary. We can imagine that for large vocabulary sizes, this would be problematic.
image from week 1 of Natural Language Processing with Classification and Vector Spaces course

As we can see above, as V gets larger, the vector becomes more sparse. Furthermore, we end up having many more features and end up training θV parameters. This could result in larger training time, and large prediction time. In summary:

  • It would take an excessive amount of time to train your model
  • It takes much more time than necessary to make predictions.

Is there a better way to extract useful features from text?

4. Negative and Positive Frequencies

There are many creative ways to create features from text. One approach is to get the positive and negative frequencies.

Positive and Negative counts

Purpose: we can learn to generate counts as features into the logistic regression classifier.

  • Specifically, given a word, we want to keep track of the number of times that it shows up as the positive class or negative class.
  • Using both those counts, we can extract features and use those into logistic regression classifier.

The steps to get the counts:

  • Have a set of corpus in the training dataset. Associated with that corpus, we would have a set of unique words — the vocabulary.
  • For sentiment analysis, let’s say we have two classes in this case — positive or negative.
  • For each class, to get the frequency in each class in any word in our vocabulary, we will have to count the times as it appears in that class. Please note, we do not count it in the whole training set.

Here is an example of positive tweet words frequency:

image from week 1 of Natural Language Processing with Classification and Vector Spaces course

Here is the example of negative tweet words frequency:

image from week 1 of Natural Language Processing with Classification and Vector Spaces course

In summary: Below is the look of the word frequency in each class. In practice, this table is a dictionary mapping from a word class to its frequency. So it maps the word and its corresponding class to the frequency or the number of times that’s where it showed up in the class.

image from week 1 of Natural Language Processing with Classification and Vector Spaces course

5. Feature Extraction with Frequencies

Previously, we mention we extract features based on the V vocabulary, which means our logistic regression would need to learn V features. This would be problematic when V is getting large.

We can reduce the V dimension by representing a word as a vector of dimension 3 using the frequency table we created. In doing so, we’ll have much faster speed for the logistic regression classifier, because instead of learning V features, we only have to learn 3 features.

Feature Extraction with 3 Dimension Vector

We can extract feature to represent a tweet as:

  • A bias term
  • The sum of positive frequencies for the words from the vocabulary that appear on the tweet.
  • The sum of negative frequencies of the words from the vocabulary that appear on the tweet.

So to extract the features for this representation, we’d only have to sum frequencies of words.

image from week 1 of Natural Language Processing with Classification and Vector Spaces course

Let’s see this in an example: I am sad, I am not learning NLP

  • Get the sum of positive frequencies = 8
image from week 1 of Natural Language Processing with Classification and Vector Spaces course
  • Get the sum of negative frequencies = 11
image from week 1 of Natural Language Processing with Classification and Vector Spaces course
  • Therefore, in this example, the tweet can be represented as:
image from week 1 of Natural Language Processing with Classification and Vector Spaces course

6. Preprocessing

Motivation: Before we start to build out any models for NLP projects, a vast majority of the time is to preprocess the text to make it clean in order to be able to extract valuable features from it.

There are some common steps we would do to clean the text.

A. Preprocessing: Stop Words and Punctuation

Remove all the words that don’t add significant meaning to the tweets, aka stop words and punctuation marks. In practices, we would compare our tweet against two lists — stop words and punctuation. These lists are usually much larger.

  • The overall meaning of the sentence after stop words removal could be inferred without any effort.
  • Note that in some contexts we won’t have to eliminate punctuation. So we should think carefully about whether punctuation adds important information to the specific NLP task or not.
image from week 1 of Natural Language Processing with Classification and Vector Spaces course

B. Preprocessing: Handles and URLs

Tweets and other types of texts often have handles and URLs, but these don’t add any value for the task of sentiment analysis so we can eliminate them as well.

image from week 1 of Natural Language Processing with Classification and Vector Spaces course

C. Preprocessing: Stemming and lowercasing

  • Stemming: Stemming in NLP is simply transforming any word to its base stem, which we could define as the set of characters that are used to construct the word and its derivatives. So the vocabulary would be significantly reduced when we perform this process for every word in the corpus.
  • Lowercasing: To reduce your vocabulary even further without losing valuable information, you’d have to lowercase every one of your words.

After this, our tweet would be preprocessed into a list of words [tun, great, ai, model]. Below we can see how we eliminated handles, tokenized it into words, removed stop words, performed stemming, and converted everything to lowercase.

image from week 1 of Natural Language Processing with Classification and Vector Spaces course

In summary, when preprocessing, we usually perform the following:

  • Eliminate handles and URLs
  • Tokenize the string into words.
  • Remove stop words like “and, is, a, on, etc.”
  • Stemming or convert every word to its stem. Like dancer, dancing, danced, becomes ‘danc’. You can use porter stemmer to take care of this.
  • Convert all words to lowercase.

7. Putting it all together

Previously, we discuss

  • The approach (use logistic regression) to preform sentiment analysis.
  • Extract useful features to represent tweets
  • Common steps to preprocess the text before building the model.

In this section, we summarize them and put them all together.

  • Preprocess: For each tweet, we preprocess the tweet into a list of words that contain all the relevant information.
  • Create dictionary: With that list of words, we would be able to get a nice representation using a frequency dictionary mapping.
  • Extract features: Finally, get a vector with a bias unit and two additional features that store the sum of the number of times that every word on the process tweets appear in positive tweets and the sum of the number of times that every word on the negative ones.
image from week 1 of Natural Language Processing with Classification and Vector Spaces course

In practice, we would have to perform this process on a set of m tweets.

image from week 1 of Natural Language Processing with Classification and Vector Spaces course

At the end we would have a matrix, X with m rows and three columns where every row would contain the features for each one of the tweets.

image from week 1 of Natural Language Processing with Classification and Vector Spaces course

That’s it! We’ve covered how to use Logistic Regression to predict positive and negative sentiment for tweets. In the next post, I’m going to review Logistic Regression.

Originally published at https://github.com.

--

--

Jiaqi (Karen) Fang
Analytics Vidhya

Machine learning data scientist, blogger and course facilitator who help organization to unpack the value of data in business