In this project, we aim to predict whether the message is spam or ham. we implemented Natural Language Processing, TF-IDF and SVM on Python.
The data contains the following columns:
- Message: text message
- Category: Spam or Ham
.
let’s get our environment ready with the libraries we’ll need and then import the data!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('seaborn-deep')
from sklearn.metrics import confusion_matrix
import nltk
Check out the Data
df = pd.read_csv('/kaggle/input/spam-text-message-classification/SPAM text message 20170820 - Data.csv')
df.head()

df.info()

.
Exploratory Data Analysis
Let’s use describe by Category, this way we can begin to think about the features that separate ham and spam!
df.groupby('Category').describe()

Let’s make a new column to detect how long the text messages are
df['Length'] = df['Message'].apply(len)
df.head()

Let’s see the percentage of ham and spam in our dataset
explode = (0.1,0)
fig1, ax1 = plt.subplots(figsize=(12,7))
ax1.pie(df['Category'].value_counts(), explode=explode,labels=['ham','spam'], autopct='%1.1f%%',
shadow=True)
# Equal aspect ratio ensures that pie is drawn as a circle
ax1.axis('equal')
plt.tight_layout()
plt.legend()
plt.show()

plt.figure(figsize=(10,6))
df['Length'].plot.hist(bins = 150)

df['Length'].describe()

df[df['Length'] == 910]['Message'].iloc[0]

.
Text Cleaning
Let’s clean the text of the messages in our dataset with NLP.
import string
from nltk.corpus import stopwords
Let’s create the function to remove all punctuation, remove all stopwords and returns a list of the cleaned text
def text_process(mess):
"""
Takes in a string of text, then performs the following:
1. Remove all punctuation
2. Remove all stopwords
3. Returns a list of the cleaned text
"""
# Check characters to see if they are in punctuation
nopunc = [char for char in mess if char not in string.punctuation]
# Join the characters again to form the string.
nopunc = ''.join(nopunc)
# Now just remove any stopwords
return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
Check to make sure its working
df['Message'].head(10).apply(text_process)

.
Vectorization
Now we have the messages as lists and we need to convert each of those messages into a vector that SciKit Learn’s algorithm models can work with.
from sklearn.feature_extraction.text import CountVectorizer
bow_transformer = CountVectorizer(analyzer=text_process).fit(df['Message'])
Print total number of vocab words
print(len(bow_transformer.vocabulary_))

Let’s take one text message and get its bag-of-words counts as a vector, putting to use our new bow_transformer
message4 = df['Message'][3]
print(message4)

Now let’s see its vector representation
bow4 = bow_transformer.transform([message4])
print(bow4)
print(bow4.shape)

Let’s see which ones appear twice in our dataset
print(bow_transformer.get_feature_names()[4066])
print(bow_transformer.get_feature_names()[9551])

Now let’s transform the entire DataFrame of messages and create sparse matrix
messages_bow = bow_transformer.transform(df['Message'])
print('Shape of Sparse Matrix: ', messages_bow.shape)
print('Amount of Non-Zero occurences: ', messages_bow.nnz)

sparsity = (100.0 * messages_bow.nnz / (messages_bow.shape[0] * messages_bow.shape[1]))
print('sparsity: {}'.format((sparsity)))

.
TF-IDF
Now let’s compute term weighting and do normalisation with TF-IDF
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer().fit(messages_bow)
tfidf4 = tfidf_transformer.transform(messages_bow)
print(tfidf4)

.
Training a Random Forest model
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=10, criterion='entropy',random_state=0)
classifier.fit(tfidf4, df['Category'])

Let’s try classifying our single random message and checking how we do:
print('predicted:', classifier.predict(tfidf4)[0])
print('expected:', df.Category[3])

.
Model Evaluation
Let’s check out the accuracy of our model in entire dataset
all_predictions = classifier.predict(messages_bow)
print(all_predictions)
Let’s create classification report
from sklearn.metrics import classification_report
print (classification_report(df['Category'], all_predictions))

In the above evaluation, we evaluated accuracy on the same data we used for training. You should never actually evaluate on the same dataset you train on! the proper way is to split the data into a training set and test set
.
Train Test Split
from sklearn.model_selection import train_test_split
msg_train, msg_test, label_train, label_test = \
train_test_split(df['Message'], df['Category'], test_size=0.2)
print(len(msg_train), len(msg_test), len(msg_train) + len(msg_test))

.
Creating a Data Pipeline
Let’s run our model again and then predict the test set. We will create and use a pipeline for this purpose
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('bow', CountVectorizer(analyzer=text_process)), # strings to token integer counts
('tfidf', TfidfTransformer()), # integer counts to weighted TF-IDF scores
('classifier', RandomForestClassifier()), # train on TF-IDF vectors w/ SVM
])
pipeline.fit(msg_train,label_train)
predictions = pipeline.predict(msg_test)
.
Making Confusion Matrix
Confusion Matrix is going to contain the correct predictions that our model made on the set as well as the incorrect predictions.
from sklearn.metrics import confusion_matrix,classification_report
cm = confusion_matrix(label_test,predictions)
class_names=[0,1] # name of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cm), annot=True, cmap="BuPu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

Create classification report
print(classification_report(predictions,label_test))

from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(predictions,label_test))


You may have heard the world is made up of atoms and molecules, but it’s really made up of stories.
Comments
Thanks for sharing with us this piece of writing and rendering it public