Mar 17, 20214 min read

Sentiment Analysis using Azure ML Text Analytics

Sentiment analysis is a natural language processing(NLP) technique that is used to determine whether textual data is positive, negative, or neutral. Sentiment analysis is extremely helpful for businesses to quickly understand the overall opinions of their customers/users. Be it a product or movie review, customer feedback, social media conversations, tweets, etc. one can make faster and accurate decisions by analyzing the sentiments behind them.

There are different algorithms using which sentiment analysis models can be implemented. Depending on the business need and how much data needs to be analyzed, different models can be created using different algorithms.

Sentiment analysis algorithms fall into one of these :

Rule-based: perform sentiment analysis based on a set of manually defined rules.
Automatic: use machine learning techniques to learn from data and perform sentiment analysis
Hybrid: uses both rule-based and automatic approaches

In this blog, I will explain how to create and train a model for sentiment analysis using Azure ML. I will be using the “Eradicate Diabetes” telegram group chat history to analyze whether the members are satisfied or not based on the chat messages.

A short introduction about “Eradicate Diabetes(ED)” — ED is a community chat group that unites the masses together to combat the problem using the power of crowdsourced healthcare. They help people in reversing their Type 2 diabetes by providing information and support. They not only help in reversing Type 2 diabetes but also believe in the holistic treatment of the organs like the liver, kidneys & heart that have been damaged over years of abuse. They have been advising on Herbal based treatments combined with dietary and lifestyle modifications that have been proven to successfully reverse diabetes.

I strongly recommend going through my earlier blogs for a better understanding of telegram chat history analysis and Azure ML Studio:

Sentiment Analysis of telegram chat history using Decision Tree Classifier model - used python Scikit Learn and Pandas library to create a model.

Exploring Azure ML Studio(Classic) - to know the basic steps to get started with Azure ML Studio

How to Use Azure ML Studio to Evaluate Regression Models- step by step instructions to create a model in Azure ML Studio

Let’s get started with the Sentiment Analysis project in Azure ML.

Step 1: Data from Telegram Chat

The chat history is downloaded as a JSON file and then converted to a CSV file and uploaded as a dataset in the Azure ML Studio as described in my earlier blogs.

These are the steps that we add to the experiment canvas:

Step 2: Data Preprocessing

In the experiment canvas, add all the required modules to read data from the dataset and do data cleaning as follows:

Data Cleaning steps in ML experiment - Image by Author

Output of the dataset after cleaning - Image by Author

Since we are trying to find out the emotions of the users we want to eliminate the messages from the admins and chatbot. Here, I have used:

Appy SQL Transformation module to eliminate the messages from admins and chatbot by writing a SQL query as shown in the figure.
Created a CSV file (Remove words.csv) with all the words that are not relevant for the analysis.
Preprocess Text module from Text Analytics to process the text to change the text to lower case, remove emoticons, etc.

The output of the SQL module and the “Remove words.csv” are inputs to Preprocess Text module. In the properties section of the “Preprocess Text” module, if the “Remove stop words” checkbox is checked and if there is a list of words provided as input then the words in the CSV are removed from the text column in the dataset. If there is no CSV provided, then all the unwanted words like the, in, where, etc are removed from the text. The checkboxes can be checked or unchecked depending on our needs.

The settings and the output after preprocessing are shown below. We can see that the options that I selected have been applied to the text column and a new column “Preprocessed Text” has been created with the processed text.

Preprocessing the text - Image by Author

Step 3: Classifying the messages and Feature Hashing

Once the preprocessing of the text is done, we need to classify the preprocessed text into “Satisfied” and “UnSatisfied” messages. To implement this, I have added another column “emotion” to the dataset and using a predefined list of words, classified the preprocessed text into satisfied and unsatisfied messages. If positive message then the emotion is 1 or else 0.

The next step is to do “Feature Hashing”. Feature Hashing is a module of Text Analytics, which basically splits the text into words or phrases based on the settings provided and creates extra columns to the dataset that will have either 1or 0.

Feature Hashing and its properties - Image by Author

I used the below-given python script in the “Execute Python Script” module to define the emotions of the text.

The output after executing the python scripts is as follows:

The output after “feature Hashing” is as follows. Here we can see that it has created 1024 extra columns and each cell either have 1 or 0 depending on the word/phrase available in the message.

Step 4: Training the model

Since a text feature cannot be used to train a model, we will use the 1024 columns that got created as features as it has a numeric value and the label will be the “emotion” column. I have tried to create 3 models as shown below:

Training various models - Image by Author

Since this is a classification problem, I have trained few binary classification models with the data. The models that have been trained here along with their metrics are given below:

The confusion matrix for all 4 models I trained are given below:

Confusion matrix of 4 models - Image by Author

From the metrics, we can clearly see that the Two class Boosted Decision Tree is the winning model both in terms of accuracy and performance.

Step 5: Making actual predictions

I split the training data of 476 records — as 333 records to train and 143 to validate. I used the winning Two class Boosted Decision Tree model to make the predictions for the “validate” data. The predicted outcome was very close to the actual records on the validation data subset and the accuracy was 0.916 which is the same as I got earlier for the train and test split. A sample of the scored values against actual values for the validation set is shown in the figure.

As we can see, the predictions are fairly close to the actuals.

For Telegram chat sentiment analysis, the Two class Decision Tree model was giving more accurate predictions than other models that I trained using the same data.

Hope this will be helpful for everyone exploring Azure ML to implement sentiment analysis of product reviews or feedback, tweets, etc.

Happy Analyzing!