Prediction of Online Shopper’s Intention

In this project, we aim to predict the intention of online shoppers. We implemented the Logistic Regression to create our model and Linear Discriminant Analysis to create dimensionality reduction in the dataset using Python.

The data contains the following columns:

  • Administrative: Administrative Value
  • Administrative_Duration: Duration in Administrative Page
  • Informational: Informational Value
  • Informational_Duration: Duration in Informational Page
  • ProductRelated: Product Related Value
  • ProductRelated_Duration: Duration in Product Related Page
  • BounceRates: Bounce Rates of a web page
  • ExitRates: Exit rate of a web page
  • PageValues: Page values of each web page
  • SpecialDay: Special days like valentine etc
  • OperatingSystems: Operating system used
  • Browser: Browser used
  • Month: Month of the year
  • Region: Region of the user
  • VisitorType: Types of Visitor
  • Weekend: Weekend or not
  • Revenue: Revenue will be generated or not


let’s get our environment ready with the libraries we’ll need and then import the data!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Check out the Data

df = pd.read_csv('~/DataSet GitHub/LDA/online_shoppers_intention.csv')

Let’s Visualise the missing data in the columns!

import missingno as msno

Next step is to drop null values in the dataset and get rid of them!

df=df.dropna() #We drop all NaN values.

Next step is to sort out the categorical columns. switch True/False to 1/0 in our Weekend and Revenue columns.

df.Weekend = df.Weekend.astype(int)
df.Revenue = df.Revenue.astype(int)



Let’s visualise the data summary

plt.title("Data summary")

Let’s see the frequency of revenue in our dataset.

print('No:',len(df[df.Revenue == 0]))
print('Yes:',len(df[df.Revenue == 1]))
y = len(df[df.Revenue == 0]),len(df[df.Revenue == 1])
x = ['No','Yes'],y,color = 'hotpink')

Let’s see the percentage of different visitors in the dataset!

plt.rcParams['figure.figsize'] = (20, 10)
size = [10551, 1694, 85]
colors = ['mediumseagreen', 'coral', 'yellow']
labels = "Returning Visitor", "New_Visitor", "Others"
explode = [0, 0, 0.1]
plt.subplot(1, 2, 1)
plt.pie(size, colors = colors, labels = labels, explode = explode, shadow = True, autopct = '%.2f%%')
plt.title('Different Visitors', fontsize = 20)

Let’s visualise weekend vs Revenue!

df = pd.crosstab(df['Weekend'], df['Revenue'])
df.div(df.sum(1).astype(float), axis = 0).plot(kind = 'bar', stacked = True, figsize = (16, 9), color = ['orangered', 'mediumaquamarine'])
plt.title('Weekend vs Revenue', fontsize = 15)

Next step is to sort out the categorical columns for preparing the data for Machine Learning

categorical = ['VisitorType','Month']
df = pd.get_dummies(df,columns = categorical,drop_first=True)


Training Logistic Regression

Let’s now begin to train the Logistic Regression model! We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable, in this case, the Revenue column.

X = df.drop('Revenue',axis=1)
y = df['Revenue']

Train Test Split

Now let’s split the data into a training set and a testing set. We will train our model on the training set and then use the test set to evaluate the model.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 0)

Feature Scalling

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Linear Discriminant Analysis

Let’s use LDA to reduce the dimensionality of the data for getting better accuracy.

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components = 2)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)

Building Logistic Regression model

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0), y_train)


Predictions and Evaluations

Now predict values for the testing data.

# Predicting the test set result
y_pred = classifier.predict(X_test)

Making Confusion Matrix

Confusion Matrix is going to contain the correct predictions that our model made on the set as well as the incorrect predictions.

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cm), annot=True, cmap="Set3" ,fmt='g')
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

2042 and 151 are the correct predictions. In addition, 46 and 225 are tincorrectpredictions. so we can see that we have quiet lot of correct predictions.

Also Read:  Sentiment Analysis for Restaurant Reviews

Correct Predictions : 2042+151 = 2193

Incorrect Predictions: 46+225 = 271

Create a classification report for the model.

from sklearn.metrics import classification_report

For the final step let’s see the accuracy of the model

from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
2798cookie-checkPrediction of Online Shopper’s Intention


  1. Errol Faron

    You’re so interesting! I don’t suppose I have read anything like that before. So good to discover another person with unique thoughts on this subject. Seriously.. thanks for starting this up. This web site is one thing that is needed on the internet, someone with a little originality!

Leave a Reply

Your email address will not be published.