Sentiment Analysis
This project is part of my coursework in the university. Here, I performed sentiment analysis by using a pre-trained model to predict the sentiment that indicate whether the customer being satisfied or not from the service. I will be working with a data set, Amazon Fine Food Reviews, which provide attribute such as review comment, summary, score, profilename etc. With that, I might choose only some atrribute and split them into training set and testing set. In the end, I would show the accuracy rate of my model and some challenge for develop the efficientcy of model in the future.
- Import the data set
- Reshape and Explore Data
- Data Preprocessing
- saveing the file
- df1.to_pickle("df1_clean.pkl")
- WordClound
- Transform Text to Vector
- Cosine Similarity
- Heatmap
- Get a sample set
- Splitting into Train and Test Sets:
- Split train_test set
- Word2Vec
- Core Process of Word2Vec
- Evaluate the model
- Confusion Matrix
#the dataset: Amazon Fine Food Reviews
import pandas as pd
import numpy as np
import seaborn as sns
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import datetime as dt
import datetime
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
df = pd.read_csv('Reviews.csv')
df.sample(100000)
Initially, we need to explore the landscape of our data first and make a decision to selecte only essential attributes. Also, we can perform visualization in order to understand our data more.
print([col for col in df])
#dislaimer: in this project, I will used only one-fifth of the dataset
df.shape
df0 = df.sample(frac = 0.20) # taking 20% of dataset
df0 = df0[['Id','ProfileName','Score', 'Time', 'Summary', 'Text']] # query only some attribute
df0.head()
id = np.arange(0,df0.shape[0])
id.shape
df0['id'] = id # insert new_id that has been created
df0.set_index("id", inplace = True) #setting as index_column
df0.pop('Id') # taking out the old one
df0
df1 = df0[['Time', 'ProfileName', 'Summary', 'Text', 'Score']]
df1.head(20)
df1.shape
#Goal: visualize the proportion of reviews catagrorized by score
score_prop = df1.groupby('Score')['Text'].count()/len(df1.Score)*100
round(score_prop)
# declaring data
x = score_prop.to_list()
data = x
keys = ['Score 1', 'Score 2', 'Score 3', 'Score 4', 'Score 5']
# define Seaborn color palette to use
palette_color = sns.color_palette('RdBu')
# plotting data on chart
plt.pie(data, labels=keys, colors=palette_color, autopct='%.0f%%')
# displaying chart
plt.show()
#NOTE: The marjority of the plot is dominated by the reviews with Score 5, and this could lead to imbalance of data prediction.
# displaying the full text of reviews
with pd.option_context('display.max_colwidth', None):
display(df1)
df1['Time'] = df1['Time'].apply(lambda x : datetime.datetime.fromtimestamp(x))
df1.head()
df1.info()
# Create Sentiment Class
# Score 1-3: not satisfied
# Score 4-5: satisfied
df1['Satisfied'] = pd.cut(df1['Score'], bins =[0,3, float('inf')], labels =['not satisfied', 'satisfied'])
df1.iloc[::1000]
ax = df1['Satisfied'].value_counts().plot(kind='bar',
figsize=(8,8),
title="Sentiment of Customer Extraced from Restaurant's reviews")
ax.set_xlabel("Sentiment of Customer")
ax.set_ylabel("Frequency")
plt.show()
Now, we will perform some pre-processing on the data before converting it into vectors and
passing it to the machine learning model.
Objective: To reduce noise, which affect the accuracy rate of model prediction. Make it more simple for model to classify.
Method:
1) Using regular expresiion to get rid off any characters which are not alphabet and unnecssary
2) convert the string to lowercase
3) get rid off stopwords i.e 'the', 'an', 'to'; these are considres as noise which could make a model less precise
4) lemmatization: chang different form of word i.e. working -> work
#object of WordNetLemmatizer
#processing time: around 40 min
lm = WordNetLemmatizer()
def text_transformation(df_col):
corpus = []
for item in df_col:
new_item = re.sub('[^a-zA-Z]',' ',str(item)) #match any characters which are not alphabet and replace with whitespace
new_item = new_item.lower() # convert all to lower case
new_item = new_item.split() # split each string by whitespace into a list
# lemmarizing words & select only words which are not stopword in English
new_item = [lm.lemmatize(word) for word in new_item if word not in set(stopwords.words('english'))]
corpus.append(' '.join(str(x) for x in new_item))
return corpus
corpus = text_transformation(df1['Text'])
#Note: after cleaning text, there's some unwanted elements still
#so it's required to used regular expression to get rid of them (<br>)
pattern0 = r'<br />'
clean = []
for i in df1.text_clean:
a = re.sub(pattern0, ' ', i)
clean.append(a)
pattern1 = r'<br>'
clean1= []
for i in clean:
b = re.sub(pattern1, ' ', i)
clean1.append(b)
pattern2 = r'\s(br)\s'
clean2= []
for i in clean1:
c = re.sub(pattern2, ' ', i)
clean2.append(c)
print(stopwords.words('english'))
df1 = pd.read_pickle("df1_clean.pkl") # reading pkl file
tmp =df1.iloc[::10000, [3, 6]]
with pd.option_context('display.max_colwidth', None):
display(tmp)
#preparing data for wordcloud visualization
word = df1['text_clean']
comment_words = "" # create empty string variable
i=0
j=0
#loop to each row in corpus and append them to comment_words variable
while j <= len(word)-1: #setting number of counter equal to number of observation -1, otherwise, out of inde
i = word[j]
comment_words +="".join(i) # for each word append into comment_words variable
j = j+1 # increae the counter
len(comment_words)
type(comment_words)
comment_words[0:1000]
wordcloud = WordCloud(width = 1500, height = 1500,background_color ='white',min_font_size = 10).generate(comment_words)
plt.figure(figsize=(15, 10))
plt.imshow(wordcloud)
plt.title('High Frequency of Words Found in Customer Reviews')
#plt.savefig('wordclound.png') # set the file to png.
plt.show()
do the visualization with heatmap Assumption: different score review should have different position in vector space so we will utilize heat map to answer the question that the reviews with different range of score are really different in vector space
A short note of what is Word Embedding
Word Embedding Word Embeddings are the texts converted into numbers and there may be different numerical representations of the same text In short, we can say that to build any model in machine learning or deep learning, the final level data has to be in numerical form because models don’t understand text or image data directly as humans do. Therefore, Vectorization or word embedding is the process of converting text data to numerical vectors. Later those vectors are used to build various machine learning models. In this manner, we say this as extracting features with the help of text with an aim to build multiple natural languages, processing models, etc. We have different ways to convert the text data to numerical vectors which we will discuss in this article later. Broadly, we can classified word embeddings into the following two categories: Frequency-based or Statistical based Word Embedding Prediction based Word Embedding
catagorize reviews into two groups: score 4 and 5, score <= 3
filter0 = df1['Score'] == 5
score_5 = df1[filter0]
#filter only text_clen which score <4
filter1 = df1['Score'] < 4
score_1to3 = df1[filter1]
score_5 = score_5[['Time','ProfileName','text_clean','Score']].iloc[0:500] #must be in the same shape
score_1to3 = score_1to3[['Time','ProfileName','text_clean','Score']].iloc[0:500] #must be in the same shape
tf_score5 = score_5
score_5
score_1to3
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
tf_score5 = count_vect.fit_transform(score_5['text_clean'])
tf_score5
tf_score1to3 = count_vect.fit_transform(score_1to3['text_clean'])
tf_score1to3
tf_score1to3.shape
After some kind of transforming text to vector, we need to reshape sparse matrix so we can use a coins_similarity function to generate its cosine similarity. Cosine Similarity is one of the method to measure the distance of different data points in vector space and , in our case, we will implement that and visualize cosine similarity of those reviews with heat map.
tf_score5=tf_score5[0:500, 0:3511].toarray() #reshape sparse matrix
tf_score1to3=tf_score1to3[0:500, 0:3511].toarray()#reshape sparse matrix
tf_score1to3
from sklearn.metrics.pairwise import cosine_similarity
cosinescore = cosine_similarity(tf_score5 ,tf_score1to3)
cosinescore
plot_z = cosinescore[0:40, 0:40]
import seaborn as sns
df_todraw = pd.DataFrame(plot_z)
plt.subplots(figsize=(20, 15))
ax = sns.heatmap(df_todraw,
cmap="YlGnBu",
vmin=0, vmax=1, annot=True, fmt='.1f')
plt.show()
#Note: the heat map showing us that there's less similarity between these two group of review, which it is supposed to be like that
#because they have significanet different range of score.
compare score 5 to another 5 score review
cosinescore5 = cosine_similarity(tf_score5 ,tf_score5)
cosinescore5
plot_zz = cosinescore5[0:40, 41:81]
plot_x = list(range(41,81))
import seaborn as sns
df_todraw2 = pd.DataFrame(plot_zz, columns = plot_x)
plt.subplots(figsize=(20, 15))
ax = sns.heatmap(df_todraw2,
cmap="YlGnBu",
vmin=0, vmax=1, annot=True, fmt='.1f')
plt.show()
#Note: While comparing between those review with only 5 score, they show the likelihood of being similar more.
pulling out some review which has high correlation and see how they being similar
score_5.iloc[3,2]
score_5.iloc[60,2]
score_5.iloc[33,2]
score_5.iloc[78,2]
score_5.iloc[36,2]
score_5.iloc[37,2] # score 0.0, this review is about cereal
Result: those pairs ,which receive cosine similarity at 0.4 and 0.3, are all good revew about chip
tmp =df1.iloc[::10000, [3, 6]]
with pd.option_context('display.max_colwidth', None):
display(tmp)
showing the proportion of our review catagorized by 'satisfied' and 'not satisfied' labels
ax = df1['Satisfied'].value_counts().plot(kind='bar',
figsize=(8,8),
title="Sentiment of Customer Extraced from Restaurant's reviews")
ax.set_xlabel("Sentiment of Customer")
ax.set_ylabel("Frequency")
plt.show()
Because the review of cutomer comprise of 'satisfied review' more thatn 'not satisfied review' significantly, this could lead to 'Imbalanced of sentimental class',which might affect model to be biased. However, in this report, dealing with that issue is out of scope so we will randome pick samples from both group in the equal amount
# the goal of doing thing because we want to eliminate the imbalancing of data set
def get_top_data(top_n = 20000):
top_data_df_positive = df1[df1['Satisfied'] == 'satisfied'].head(top_n)
top_data_df_negative = df1[df1['Satisfied'] == 'not satisfied'].head(top_n)
top_data_df_small = pd.concat([top_data_df_positive, top_data_df_negative])
return top_data_df_small
df2 = get_top_data(top_n=20000)
ax = df2['Satisfied'].value_counts().plot(kind='bar',
figsize=(8,8),
title="Sentiment of Customer Extraced from Restaurant's reviews")
ax.set_xlabel("Sentiment of Customer")
ax.set_ylabel("Frequency")
plt.show()
#the problem of imbalancing data set is gone
#seperate text into single word and this will help when transforming text to numeric value
from gensim.utils import simple_preprocess
# Tokenize the text column to get the new column 'tokenized_text'
df2['tokenized_text'] = [simple_preprocess(line, deacc=True) for line in df2['text_clean']]
print(df2['tokenized_text'].head(10))
[col for col in df2]
from gensim.parsing.porter import PorterStemmer
porter_stemmer = PorterStemmer()
# Get the stemmed_tokens
df2['stemmed_tokens'] = [[porter_stemmer.stem(word) for word in tokens] for tokens in df2['tokenized_text'] ]
df2['stemmed_tokens'].head(10)
tmp =df2.iloc[::2000, [6, 7]]
with pd.option_context('display.max_colwidth', None):
display(tmp)
df2
from sklearn.model_selection import train_test_split
# Train Test Split Function
def split_train_test(df2, test_size=0.3, shuffle_state=True):
X_train, X_test, Y_train, Y_test = train_test_split(df2[['stemmed_tokens']],
df2['Satisfied'],
shuffle=shuffle_state,
test_size=test_size,
random_state=15)
print("Value counts for Train sentiment")
print(Y_train.value_counts())
print('\n')
print("Value counts for Test sentiments")
print(Y_test.value_counts())
print('\n')
print(type(X_train))
print(type(Y_train))
print('\n')
X_train = X_train.reset_index()
X_test = X_test.reset_index()
Y_train = Y_train.to_frame()
Y_train = Y_train.reset_index()
Y_test = Y_test.to_frame()
Y_test = Y_test.reset_index()
print(X_train.head())
return X_train, X_test, Y_train, Y_test
X_train, X_test, Y_train, Y_test = split_train_test(df2)
x_train: The training part of the first sequence (x) x_test: The test part of the first sequence (x) y_train: The training part of the second sequence (y) y_test: The test part of the second sequence (y)
More Detail: splitting training ans testing set https://realpython.com/train-test-split-python-data/#:~:text=x_train%20%3A%20The%20training%20part%20of,of%20the%20second%20sequence%20(%20y%20)
X_train
X_test
Y_train
Y_test
Feature Extraction we will use Word2Vec Model, which is a pre-trained model to fitting
from gensim.models import Word2Vec
import time
# Skip-gram model (sg = 1)
vector_size=1000
window = 5
min_count = 1
workers = 3
sg = 1
word2vec_model_file = 'word2vec_' + str(vector_size) + '.model'
start_time = time.time()
stemmed_tokens = pd.Series(df2['stemmed_tokens']).values
# Train the Word2Vec Model
w2v_model = Word2Vec(stemmed_tokens, min_count = min_count,vector_size=vector_size ,workers = workers, window = window, sg = sg)
print("Time taken to train word2vec model: " + str(time.time() - start_time))
# Because this process might take a long time as well, so i save the file 'word2vec_model_file'
w2v_model.save(word2vec_model_file)
#and now, it being able to find some kind of correlation between those words
# Load the model from the model file
w2v_model = Word2Vec.load(word2vec_model_file)
# Most Similar word
print(w2v_model.wv.most_similar('well'))
#Now the model know that some words which have the similar meaning would be represented by corresponding value
w2v_model.wv.similarity('good', 'worthwhil')
w2v_model.wv.doesnt_match(['good', 'charm', 'amazingli','bad','well'])
w2v_model.wv.similarity('good', 'worthwhil')
w2v_model.wv.similarity('bad', 'bitter')
w2v_model.wv.similarity('bad', 'good') # need to fix this
w2v_model.wv.most_similar(positive="bad")
w2v_model.wv.most_similar(positive="chip")
From now, we need will work with traing set to fitting the model before making prediction. we loop through X_train and X_test, which previously splitted beforehand and we kind of find the mean of each vector in a reviewand used that as a representative of tone in that review
# we find the mean of vector in each review and used that as a representative of tone in that review
#write them into csv file.
word2vec_filename = 'train_review_word2vec.csv'
with open(word2vec_filename, 'w+') as word2vec_file:
for index, row in X_train.iterrows():
model_vector = (np.mean([w2v_model.wv[token] for token in row['stemmed_tokens']], axis=0)).tolist()
if index == 0:
header = ",".join(str(ele) for ele in range(1000))
word2vec_file.write(header)
word2vec_file.write("\n")
# Check if the line exists else it is vector of zeros
if type(model_vector) is list:
line1 = ",".join( [str(vector_element) for vector_element in model_vector] )
else:
line1 = ",".join([str(0) for i in range(1000)])
word2vec_file.write(line1)
word2vec_file.write('\n')
#find the mean
#Also,write them into csv file.
word2vec_filename = 'test_review_word2vec.csv'
with open(word2vec_filename, 'w+') as word2vec_file:
for index, row in X_test.iterrows(): #itterows(); used to loop over each review annd find the mean to represent each review
model_vector = (np.mean([w2v_model.wv[token] for token in row['stemmed_tokens']], axis=0)).tolist()
if index == 0:
header = ",".join(str(ele) for ele in range(1000))
word2vec_file.write(header)
word2vec_file.write("\n")
# Check if the line exists else it is vector of zeros
if type(model_vector) is list:
line1 = ",".join( [str(vector_element) for vector_element in model_vector] )
else:
line1 = ",".join([str(0) for i in range(1000)])
word2vec_file.write(line1)
word2vec_file.write('\n')
import time
#import RandomForestClassifier, this is the algorithm that will be used for classification
from sklearn.ensemble import RandomForestClassifier
# Load from the filename
trainvec = pd.read_csv('train_review_word2vec.csv') # training
testvec = pd.read_csv('test_review_word2vec.csv') # testing
#Initialize the model
forest_word2vec = RandomForestClassifier(n_estimators = 100)
start_time = time.time()
# Fit the model
forest_word2vec.fit(trainvec, Y_train['Satisfied']) # fitting the model; find the coefficients or the model
print("Time taken to fit the model with word2vec vectors: " + str(time.time() - start_time))
# the result is either the review in testset is 'satisfied', or 'not satisfied
result = forest_word2vec.predict(testvec)
result.shape
result[::10]
Y_test['Predict'] = result
Y_test['review'] = X_test['stemmed_tokens']
Y_test[::500]
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(Y_test['Satisfied'],result, zero_division=0))
cf_matrix = confusion_matrix(Y_test['Satisfied'], result)
cf_matrix
import seaborn as sns
ax = sns.heatmap(cf_matrix, annot=True, cmap='YlGn', fmt='.1f')
ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');
## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels(['False','True'])
ax.yaxis.set_ticklabels(['False','True'])
## Display the visualization of the Confusion Matrix.
plt.show()
x_train: The training part of the first sequence (x) x_test: The test part of the first sequence (x) y_train: The training part of the second sequence (y) y_test: The test part of the second sequence (y)
Special Thanks to: https://medium.com/swlh/sentiment-classification-using-word-embeddings-word2vec-aedf28fbb8ca and all related wonderful post in Stack Overflow