Prediction of Spam Messages

In this project, we aim to predict whether the message is spam or ham. we implemented Natural Language Processing, TF-IDF and SVM on Python.

The data contains the following columns:

• Message: text message
• Category: Spam or Ham

.

let’s get our environment ready with the libraries we’ll need and then import the data!

``````import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('seaborn-deep')
from sklearn.metrics import confusion_matrix
import nltk``````

Check out the Data

``````df = pd.read_csv('/kaggle/input/spam-text-message-classification/SPAM text message 20170820 - Data.csv')
``df.info()``

.

Exploratory Data Analysis

Let’s use describe by Category, this way we can begin to think about the features that separate ham and spam!

``df.groupby('Category').describe()``

Let’s make a new column to detect how long the text messages are

``````df['Length'] = df['Message'].apply(len)

Let’s see the percentage of ham and spam in our dataset

``````explode = (0.1,0)
fig1, ax1 = plt.subplots(figsize=(12,7))
ax1.pie(df['Category'].value_counts(), explode=explode,labels=['ham','spam'], autopct='%1.1f%%',
# Equal aspect ratio ensures that pie is drawn as a circle
ax1.axis('equal')
plt.tight_layout()
plt.legend()
plt.show()``````
``````plt.figure(figsize=(10,6))
df['Length'].plot.hist(bins = 150)``````
``df['Length'].describe()``
``df[df['Length'] == 910]['Message'].iloc[0]``

.

Text Cleaning

Let’s clean the text of the messages in our dataset with NLP.

``````import string
from nltk.corpus import stopwords``````

Let’s create the function to remove all punctuation, remove all stopwords and returns a list of the cleaned text

``````def text_process(mess):
"""
Takes in a string of text, then performs the following:
1. Remove all punctuation
2. Remove all stopwords
3. Returns a list of the cleaned text
"""
# Check characters to see if they are in punctuation
nopunc = [char for char in mess if char not in string.punctuation]

# Join the characters again to form the string.
nopunc = ''.join(nopunc)

# Now just remove any stopwords
return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]``````

Check to make sure its working

``df['Message'].head(10).apply(text_process)``

.

Vectorization

Now we have the messages as lists and we need to convert each of those messages into a vector that SciKit Learn’s algorithm models can work with.

``````from sklearn.feature_extraction.text import CountVectorizer
bow_transformer = CountVectorizer(analyzer=text_process).fit(df['Message'])``````

Print total number of vocab words

``print(len(bow_transformer.vocabulary_))``

Let’s take one text message and get its bag-of-words counts as a vector, putting to use our new bow_transformer

``````message4 = df['Message'][3]
print(message4)``````

Now let’s see its vector representation

``````bow4 = bow_transformer.transform([message4])
print(bow4)
print(bow4.shape)``````

Let’s see which ones appear twice in our dataset

``````print(bow_transformer.get_feature_names()[4066])
print(bow_transformer.get_feature_names()[9551])``````

Now let’s transform the entire DataFrame of messages and create sparse matrix

``````messages_bow = bow_transformer.transform(df['Message'])
print('Shape of Sparse Matrix: ', messages_bow.shape)
print('Amount of Non-Zero occurences: ', messages_bow.nnz)``````
``````sparsity = (100.0 * messages_bow.nnz / (messages_bow.shape[0] * messages_bow.shape[1]))
print('sparsity: {}'.format((sparsity)))``````

.

TF-IDF

Now let’s compute term weighting and do normalisation with TF-IDF

``````from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer().fit(messages_bow)
tfidf4 = tfidf_transformer.transform(messages_bow)
print(tfidf4)``````

.

Training a Random Forest model

``````from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=10, criterion='entropy',random_state=0)
classifier.fit(tfidf4, df['Category'])``````

Let’s try classifying our single random message and checking how we do:

``````print('predicted:', classifier.predict(tfidf4)[0])
print('expected:', df.Category[3])``````

.

Model Evaluation

Let’s check out the accuracy of our model in entire dataset

``````all_predictions = classifier.predict(messages_bow)
print(all_predictions)``````

Let’s create classification report

``````from sklearn.metrics import classification_report
print (classification_report(df['Category'], all_predictions))``````

In the above evaluation, we evaluated accuracy on the same data we used for training. You should never actually evaluate on the same dataset you train on! the proper way is to split the data into a training set and test set

Also Read:  Reinforcement Learning in Marketing Campaign

.

Train Test Split

``````from sklearn.model_selection import train_test_split

msg_train, msg_test, label_train, label_test = \
train_test_split(df['Message'], df['Category'], test_size=0.2)

print(len(msg_train), len(msg_test), len(msg_train) + len(msg_test))``````

.

Creating a Data Pipeline

Let’s run our model again and then predict the test set. We will create and use a pipeline for this purpose

``````from sklearn.pipeline import Pipeline

pipeline = Pipeline([
('bow', CountVectorizer(analyzer=text_process)),  # strings to token integer counts
('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
('classifier', RandomForestClassifier()),  # train on TF-IDF vectors w/ SVM
])``````
``pipeline.fit(msg_train,label_train)``
``predictions = pipeline.predict(msg_test)``

.

Making Confusion Matrix

Confusion Matrix is going to contain the correct predictions that our model made on the set as well as the incorrect predictions.

``````from sklearn.metrics import confusion_matrix,classification_report
cm = confusion_matrix(label_test,predictions)
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cm), annot=True, cmap="BuPu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')``````

Create classification report

``print(classification_report(predictions,label_test))``
``````from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(predictions,label_test))``````