In this
The data contains the following columns:
- id: ID number
- diagnosis: The diagnosis of breast tissues (M = malignant, B = benign)
- radius_mean: mean of distances from center to points on the perimeter
- texture_mean: standard deviation of gray-scale values
- perimeter_mean: mean size of the core tumor
- area_mean: mean area size of the tumor
- smoothness_mean: mean of local variation in radius lengths
- compactness_mean: mean of perimeter^2 / area – 1.0
- concavity_mean: mean of severity of concave portions of the contour
- concave points_mean: mean for number of concave portions of the contour
- fractal_dimension_mean: mean for “coastline approximation” – 1
- radius_se: standard error for the mean of distances from center to points on the perimeter
- texture_se: standard error for standard deviation of gray-scale values
- smoothness_se: standard error for local variation in radius lengths
- compactness_se: standard error for perimeter^2 / area – 1.0
- concavity_se: standard error for severity of concave portions of the contour
- fractal_dimension_se: standard error for “coastline approximation” – 1
- texture_worst: “worst” or largest mean value for standard deviation of gray-scale values
- smoothness_worst: “worst” or largest mean value for local variation in radius lengths
- compactness_worst: “worst” or largest mean value for perimeter^2 / area – 1.0
- concavity_worst: “worst” or largest mean value for severity of concave portions of the contour
- concave points_worst: “worst” or largest mean value for number of concave portions of the contour
- fractal_dimension_worst: “worst” or largest mean value for “coastline approximation” – 1
.
let’s get our environment ready with the libraries we’ll need and then import the data!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
Check out the Data
df = pd.read_csv('~/DataSet GitHub/k-means/data-3.csv')
df.head()

df.info()

Let’s drop ID and NaN data from dataset.
# We don't need id and NaN data.
df.drop(["Unnamed: 32", "id"], axis = 1, inplace = True)
df.head()

.
Exploratory Data Analysis
Let’s visualise the frequency of each cancer stage in the dataset
ax = sns.countplot(df['diagnosis'],label="Count") # M = 212, B = 357
B, M = df['diagnosis'].value_counts()
print('Number of Benign: ',B)
print('Number of Malignant : ',M)

#correlation map
f,ax = plt.subplots(figsize=(20, 20))
sns.heatmap(df.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)

The size and shape of the nucleus should be a good predictor for whether or not a sample is cancerous.
sns.set_style("whitegrid")
plotOne = sns.FacetGrid(df, hue="diagnosis",aspect=2.5)
plotOne.map(sns.kdeplot,'area_mean',shade=True)
plotOne.set(xlim=(0, df['area_mean'].max()))
plotOne.add_legend()
plotOne.set_axis_labels('mean area', 'Proportion')
plotOne.fig.suptitle('Area vs Diagnosis (Blue = Malignant; Orange = Benign)')
plt.show()
sns.set_style("whitegrid")
plotTwo = sns.FacetGrid(df, hue="diagnosis",aspect=2.5)
plotTwo.map(sns.kdeplot,'concave points_mean',shade= True)
plotTwo.set(xlim=(0, df['concave points_mean'].max()))
plotTwo.add_legend()
plotTwo.set_axis_labels('concave points_mean', 'Proportion')
plotTwo.fig.suptitle('# of Concave Points vs Diagnosis (Blue = Malignant; Orange = Benign)')
plt.show()


sns.set(style="white")
df = df.loc[:,['radius_worst','perimeter_worst','area_worst']]
g = sns.PairGrid(df, diag_sharey=False)
g.map_lower(sns.kdeplot, cmap="BuPu")
g.map_upper(plt.scatter)
g.map_diag(sns.kdeplot, lw=3)

radius_mean and texture_mean features will be used for clustering. Before the clustering process let’s check how our data looks.
sns.pairplot(df.loc[:,['radius_mean','texture_mean', 'diagnosis']], hue = "diagnosis", height = 4)
plt.show()

.
K-Means Clustering
For clustering we do not need labels. Because we’ll identify the labels.
dwt = df.drop(["diagnosis"], axis = 1)
dwt.head()
Our data looks like below plot without diagnosis label.
plt.figure(figsize = (10, 10))
plt.scatter(dwt["radius_mean"], dwt["texture_mean"])
plt.title('without clustering')
plt.xlabel('radius_mean')
plt.ylabel('texture_mean')
plt.show()

WCSS is a metric used for k value selection process. After this operation elbow rule is used for k value.
from sklearn.cluster import KMeans
wcss = [] # within cluster sum of squares
for k in range(1, 15):
kmeansForLoop = KMeans(n_clusters = k)
kmeansForLoop.fit(dwt)
wcss.append(kmeansForLoop.inertia_)
plt.figure(figsize = (10, 10))
plt.plot(range(1, 15), wcss)
plt.xlabel("K value")
plt.ylabel("WCSS")
plt.show()

Elbow point starting from 2
dwt = df.loc[:,['radius_mean','texture_mean']]
kmeans = KMeans(n_clusters = 2)
clusters = kmeans.fit_predict(dwt)
dwt["type"] = clusters
dwt["type"].unique()
Plot data after k = 2 clustering
plt.figure(figsize = (15, 10))
plt.scatter(dwt["radius_mean"][dwt["type"] == 0], dwt["texture_mean"][dwt["type"] == 0], color = "red")
plt.scatter(dwt["radius_mean"][dwt["type"] == 1], dwt["texture_mean"][dwt["type"] == 1], color = "green")
plt.title('with clustering')
plt.xlabel('radius_mean')
plt.ylabel('texture_mean')
plt.show()

Let’s set the centroid point in our plot
# Data centroids middle of clustered scatters
plt.figure(figsize = (15, 10))
plt.scatter(dwt["radius_mean"], dwt["texture_mean"], c = clusters, alpha = 0.5,cmap='jet')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], color = "green", alpha = 1)
plt.xlabel('radius_mean')
plt.ylabel('texture_mean')
plt.show()

.
Training a Random Forest Model
switch B/M to 0/1 in our diagnosis column.
df['diagnosis'].replace({'B':0,'M':1},inplace = True)
df.head()
Let’s now begin to train the random forest model! We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable, in this case, the diagnosis column.
X = df.drop('diagnosis',axis=1)
y = df['diagnosis']
Train Test Split
Now let’s split the data into a training set and a testing set. We will train our model on the training set and then use the test set to evaluate the model.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 0,test_size = 0.3)
Training the Model
#fitting random forest classification into training set
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=10, criterion='entropy',random_state=0)
rf.fit(X_train, y_train)

Predictions and Evaluations
Now predict values for the testing data.
y_pred = rf.predict(X_test)
Making Confusion Matrix
Confusion Matrix is going to contain the correct predictions that our model made on the set as well as the incorrect predictions.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
cm = confusion_matrix(y_test,y_pred)
class_names=[0,1] # name of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cm), annot=True, cmap="RdGy" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

107 and 58 are the correct predictions. In addition, 5 and 1 are
Correct Predictions : 107+58 = 165
Incorrect Predictions: 5+1 = 6
Create a classification report for the model.
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

The accuracy of the model is %97!

You may have heard the world is made up of atoms and molecules, but it’s really made up of stories.
Comments
I’d like to thank you for the efforts you have put in penning this blog. I really hope to see the same high-grade blog posts from you in the future as well. In truth, your creative writing abilities has inspired me to get my very own blog now 😉