Most of the Restaurants ask reviews to the customers and based on the reviews the restaurant can improve
The aim of this project is to predict whether the review is positive or negative. This project implemented by Natural Language Processing and Naive Bayes on Python.
The dataset consists of 1000 rows and 2 columns. Review Column
let’s get our environment ready with the libraries we’ll need and then import the data!
import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns
Check out the Data
df = pd.read_csv('~/DataSet GitHub/NLP/Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3) df.head(10)
Let’s clean the text for first review of our dataset with NLP.
import re review = re.sub('[^a-zA-Z]',' ', df['Review']) review
The second step for cleaning the text is going to be about putting all the letters of restaurant reviews in
review = review.lower() review
The third step is to split each word of review.
review = review.split() review
The fourth step is to remove all the non significant words which are not relevant into predicting whether the review is positive or negative and then apply stemming to our dataset
import nltk nltk.download('stopwords') from nltk.corpus import stopwords from nltk.stem.porter import PorterStemmer ps = PorterStemmer() review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))] review
In the fifth step, we will convert the list which we created before to string and join all the words together.
review = ' '.join(review) review
We cleaned the first review of the dataset so far. let’s apply NLP into all the customer’s reviews
corpus =  for i in range(0,1000): review = re.sub('[^a-zA-Z]',' ', df['Review'][i]) review = review.lower() review = review.split() ps = PorterStemmer() review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))] review = ' '.join(review) corpus.append(review)
The next step is to creating Bag of Words model to prepare our data to predict whether the review is positive or negative.
from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer(max_features=1500) X = cv.fit_transform(corpus).toarray() y = df.iloc[:,1].values
Training a Naive Bayes Model
Now let’s split the data into a training set and a testing set. We will train out model on the training set and then use the test set to evaluate the model.
from sklearn.model_selection import train_test_split X_train,X_test, y_train, y_test = train_test_split(X,y,test_size = 0.20, random_state = 0)
This step is fitting Naive Bayes into the training set.
from sklearn.naive_bayes import GaussianNB classifier = GaussianNB() classifier.fit(X_train,y_train)
Predictions and Evaluations
Let’s Predict the test set
y_pred = classifier.predict(X_test)
Making Confusion Matrix
Confusion Matrix is going to contain the correct predictions that our model made on the set as well as the incorrect predictions.
from sklearn.metrics import confusion_matrix,classification_report cm = confusion_matrix(y_test,y_pred) class_names=[0,1] # name of classes fig, ax = plt.subplots() tick_marks = np.arange(len(class_names)) plt.xticks(tick_marks, class_names) plt.yticks(tick_marks, class_names) # create heatmap sns.heatmap(pd.DataFrame(cm), annot=True, cmap="BuPu" ,fmt='g') ax.xaxis.set_label_position("top") plt.tight_layout() plt.title('Confusion matrix', y=1.1) plt.ylabel('Actual label') plt.xlabel('Predicted label')
So this confusion matrix gathered all the correct predictions and incorrect predictions of all the reviews.
55 and 91 are the correct predictions. In addition, 12 and 42 are incorrect predictions. so we can see that we have
from sklearn.metrics import classification_report print(classification_report(y_test,y_pred))
The accuracy of the model is %75
You may have heard the world is made up of atoms and molecules, but it’s really made up of stories.