In this project, we are going to implement customer segmentation based on credit card usage behavior with two different approaches (K-means and Hierarchical Clustering)
The data contains the following columns:
CUST_ID : Identification of credit card holder (Categorical)BALANCE : Balance amount left in their account to make purchasesBALANCE_FREQUENCY : How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)PURCHASES : Amount of purchases made fromaccount ONEOFF_PURCHASES : Maximum purchase amountdone in one-goINSTALLMENTS_PURCHASES : Amount of purchase done in installmentCASH_ADVANCE : Cash in advance given by the userPURCHASES_FREQUENCY : How frequently the purchases are beingmade, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)ONEOFFPURCHASESFREQUENCY : How frequently purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)PURCHASESINSTALLMENTSFREQUENCY : How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)CASHADVANCEFREQUENCY : How frequently the cash in advance being paidCASHADVANCETRX : Number of transactions made with “Cash in Advanced”PURCHASES_TRX : Number of purchase transactions madeCREDIT_LIMIT : Limit of credit card foruser PAYMENTS : Amount of Payment done byuser MINIMUM_PAYMENTS : Minimum amount of payments made byuser PRCFULLPAYMENT : Percent of full payment paid byuser TENURE : Tenure of credit card service for user
.
let’s get our environment ready with the libraries we’ll need and then import the data!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('ggplot')
Check out the Data
df = pd.read_csv('~/DataSet GitHub/K-Means/CC GENERAL.csv')
df.head()

df.info()

Visualise the missing value
import missingno as msno
msno.matrix(df)

fill the NaN value to mean value of the column
df=df.fillna(df.mean())
df.info()

We don’t need customer id data for clustering.
df.drop(["CUST_ID"], axis = 1, inplace = True)
df.head()

Let’s visualise the Correlation Map
f,ax = plt.subplots(figsize=(15, 15))
sns.heatmap(df.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)

.
Feature Scalling
from sklearn.preprocessing import StandardScaler
standardscaler = StandardScaler()
X = standardscaler.fit_transform(df)
K-Means Clustering
Let’s Use the elbow method to find the optimal number of clusters
plt.figure(figsize=(10,6))
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 15):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 15), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

Fitting K-Means to the dataset
Elbow point starting from 8
kmeans = KMeans(n_clusters = 8, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(X)
kmeans.cluster_centers_

Let’s see the cluster for each customer
y_kmeans = kmeans.predict(X)
df["cluster"] = y_kmeans
df.head()

Plot data after k = 8 clustering on important columns
best_cols = ["BALANCE", "PURCHASES", "CASH_ADVANCE","CREDIT_LIMIT", "PAYMENTS", "MINIMUM_PAYMENTS"]
kmeans = KMeans(n_clusters=8, init="k-means++", n_init=10, max_iter=300)
best_vals = df[best_cols].iloc[ :, 1:].values
y_pred = kmeans.fit_predict( best_vals )
df["cluster"] = y_pred
best_cols.append("cluster")
sns.pairplot( df[ best_cols ], hue="cluster")

.
Hierarchical Clustering
Let’s use the dendogram to find the optimal number of cluster
plt.figure(figsize=(10,6))
import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')
plt.show()

Fitting Hierarchical Clustering to the credit card usage dataset
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 4, affinity = 'euclidean', linkage = 'ward')
Let’s see the cluster for each customer
y_hc = hc.fit_predict(X)
df["cluster"] = y_hc
df.head()


You may have heard the world is made up of atoms and molecules, but it’s really made up of stories. When you sit with an individual that’s been here, you can give quantitative data a qualitative overlay.
Comments
I have been exploring for a bit for any high-quality articles or blog posts on this sort of space . Exploring in Yahoo I at last stumbled upon this site. Studying this info So i am happy to show that I’ve a very just right uncanny feeling I discovered exactly what I needed. I most indubitably will make sure to do not forget this website and give it a glance regularly.| а
*Hello! I just would like to give a huge thumbs up for the great info you have here on this post. I will be coming back to your blog for more soon.
Hello! Would you mind if I share your blog with my myspace group? There’s a lot of people that I think would really appreciate your content. Please let me know. Many thanks
Everything is very open with a very clear description of the issues. It was truly informative. Your website is useful. Many thanks for sharing!