Sentiment Analysis is a field that has a lot of scope and application into recommendation systems.
Be it movie reviews, stock market, product, or groups, sentiments play a huge role in analyzing the trend and future of a product or service.
To analyze sentiments, different fields may have totally different rules, for example, the polarity of the words in reviews, while for the stock market a different set of rules like the day hike or dip in stock prices will decide the sentiment of news.
I will here take an example of the chat of telegram group to know the people are happy or not with the recommendations given by the group.
I am a part of a telegram group ‘Eradicate Diabetes’, a group that has few experts who recommend diets as LFV, LCHF depending on preference, and also exercises and meditations to control diabetes.
Rule-Based Sentiment Analysis:
Now here, we can define the happiness and unhappiness of a member by messages one sends, like, if someone writes words like ‘awesome experience ‘,’ awesome’,’ very good’,’ excellent’,’ tremendous’,’ great’,’ thanks’, we can safely say member is happy.
To determine unhappiness was a challenge as there were no words with a polarity like bad or worst, but people who were not happy with the diet got used to writing something like: ‘tiredness ‘, ’low energy ‘, ’difficult’, ’cannot’, ’switch’, ’change’ etc.
Now before doing sentiment analysis just export the telegram chat in the JSON format.
Step 1: Write the below code steps are explained in comments for making a data frame of positive responses update target value as 1 (positive sentiment):
import json
import pandas as pd
#Step 1: read the downloaded json file
f = open(r'../input/chatjson/result2.json', encoding='utf8')
data = json.load(f)
#step 2: extracting only the messages dict from json
msgs = data['messages']
#Step 3: creating a DataFrame for messages
df = pd.DataFrame(msgs)
df_new=df.filter(items=['id','from','reply_to_message_id','text'])
df_new.dropna(subset=['text'])
#Step 4 : Extracting all the messages not written by owners of groups Tim,Raj, Dia chatbot, Trupti and containing words in Exc_list
Exc_list=['awesome experience ','awesome','very good','excellent','tremendous','great','thanks']
df_st = df_new.loc[-df_new['from'].isin(['Tim','Raj','Dia','Trupti']) & (df_new['text'].str.lower().str.contains('|'.join(Exc_list),na=False))]
df_st['emotion']=1
Step 2: Write the below, code steps are explained in comments for making a data frame of negative responses update target value as -1 (negative sentiment):
ds_list=['tiredness ','low energy ','difficult','cannot','switch','change']
df_ust = df_new.loc[-df_new['from'].isin(['Tim','Raj','Dia','Trupti']) & (df_new['text'].str.lower().str.contains('|'.join(ds_list),na=False))]
df_ust['emotion']=-1
Step 3: Now concatenate the two data frames.
frames=[df_st,df_ust]
emo_pd=pd.concat(frames)
Percentage of positive and negative sentiments
### Checking for the Distribution of Default ###
import matplotlib.pyplot as plt
%matplotlib inline
print('Percentage for default\n')
print(round(emo_pd['emotion'].value_counts(normalize=True)*100,2))
round(emo_pd['emotion'].value_counts(normalize=True)*100,2).plot(kind='bar')
plt.title('Percentage Distributions by review type')
plt.show()
I am taking the same dataset for the sake of comparison, else we can again do step 1 but without applying keywords.
Remove the following columns:
#Removing columns
emo_pd.drop(columns = ['id','from','reply_to_message_id'], inplace = True)
Data Cleaning
Let’s do some text cleaning first:
import re
import string
#This function converts to lower-case, removes square bracket, removes numbers and punctuation
def text_clean_1(text):
text = text.lower()
text = re.sub('\[.*?\]', '', text)
text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text)
text = re.sub('\w*\d\w*', '', text)
return text
cleaned1 = lambda x: text_clean_1(x)
# Let's take a look at the updated text
emo_pd['cleaned_description'] = pd.DataFrame(emo_pd.text.apply(cleaned1))
emo_pd.head(10)
The second level of cleaning
# Apply a second round of cleaning
def text_clean_2(text):
text = re.sub('[‘’“”…]', '', text)
text = re.sub('\n', '', text)
return text
cleaned2 = lambda x: text_clean_2(x)
# Let's take a look at the updated text
emo_pd['cleaned_description2'] = pd.DataFrame(emo_pd.cleaned_description.apply(cleaned2))
emo_pd.head(10)
Removing Stop words but retaining few words like ‘not’, ‘again’, ‘once’ or any other that may have sentiment value(This is not a necessary step):
def clean_stopwords(text):
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
word_tokens = word_tokenize(text)
stop_words = set(stopwords.words('english')) - set(['again', 'once', 'not'])
filtered_sentence = []
for i in range(len(text)):
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
return filtered_sentence
Cleanest_text = lambda x: clean_stopwords(x)
# Let's take a look at the updated text
emo_pd['Cleanest_text'] = pd.DataFrame(emo_pd.cleaned_description2.apply(Cleanest_text))
Sentiment Analysis through TextBlob:
TextBlob is a Naive Bayes Analyzer based text sentiment analysis library. It is a simple API for natural language processing (NLP) tasks such as POS tagging, name extraction, sentiment analysis, translation, classification, and more. Apply Text Blob for calculating sentiments for the text:
# Apply TextBlob Analyzer
import re
import string
from textblob import TextBlob
def analyzer(text):
analysis=TextBlob(text)
if analysis.sentiment.polarity>0:
return 1
elif analysis.sentiment.polarity<0:
return -1
else:
return 0
analysis = lambda x: analyzer(x)
# Let's take a look at the updated text
emo_pd['Analysis'] = pd.DataFrame(emo_pd.cleaned_description2.apply(analysis))
emo_pd
An example of the working of TextBlob is as under:
Check the distribution of positive and negative sentiments:
### Checking for the Distribution of Default ###
import matplotlib.pyplot as plt
%matplotlib inline
print('Percentage for default\n')
print(round(emo_pd['Analysis'].value_counts(normalize=True)*100,2))
round(emo_pd['Analysis'].value_counts(normalize=True)*100,2).plot(kind='bar')
plt.title('Percentage Distributions by review type')
plt.show()
Sentiment Analysis through Vader:
VADER ( Valence Aware Dictionary for Sentiment Reasoning) is used for text sentiment analysis of unlabelled data, it is sensitive to both polarity (positive/negative) and intensity (strength) of emotion shown. It is available under the library in the NLTK package.
VADER sentimental analysis is dependent upon a dictionary that maps lexical features to emotion intensities better known as sentiment scores. The sentiment score is obtained by summing up the intensity of each word in the text.
Apply Vader sentiment analysis for the text:
# Apply Analyzer
import re
import stringfrom vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
def analyzer2(text):
threshold=0.0
analysis=SentimentIntensityAnalyzer()
vs=analysis.polarity_scores(text)
if vs['compound']>=threshold or vs['compound']<=-threshold:
if vs['compound']>0:
return 1
elif vs['compound']<0:
return -1
return 0
analysis2 = lambda x: analyzer2(x)
# Let's take a look at the updated text
emo_pd['cleanest_text'] = pd.DataFrame(emo_pd.cleaned_description2.apply(analysis2))
An example of the working of Vader is as under:
Let’s see the new distribution of sentiments now:
### Checking for the Distribution of Default ###
import matplotlib.pyplot as plt
%matplotlib inline
print('Percentage for default\n')
print(round(emo_pd['Analysis2'].value_counts(normalize=True)*100,2))
round(emo_pd['Analysis2'].value_counts(normalize=True)*100,2).plot(kind='bar')
plt.title('Percentage Distributions by review type')
plt.show()
As we see there is a slight difference in the results of TextBlob and Vader while in rule base system the change can be determined.
After closely see the data it is discovered that a negative or positive sentiment determined by a rule-based analyzer is actually neutral as per the TextBlob or Vader.
While there is always an upper hand of domain knowledge on pre-determined libraries but these libraries are trained on vast data that cannot be matched by a simply made rule-based system.
After data is labeled with sentiments, text can be vectorized and it can be trained with various algorithms and prediction can be done.
I hope the idea of sentiment analysis and how to do it will be much clear after reading this blog.
Thanks for reading!