Prediction of Breast Cancer Diagnosis

In this project we aim to Predict if tumor is benign or malignant by training a Random Forest classification and K-Means clustering model on target Diagnosis.

The data contains the following columns:

  • id: ID number
  • diagnosis: The diagnosis of breast tissues (M = malignant, B = benign)
  • radius_mean: mean of distances from center to points on the perimeter
  • texture_mean: standard deviation of gray-scale values
  • perimeter_mean: mean size of the core tumor
  • area_mean: mean area size of the tumor
  • smoothness_mean: mean of local variation in radius lengths
  • compactness_mean: mean of perimeter^2 / area – 1.0
  • concavity_mean: mean of severity of concave portions of the contour
  • concave points_mean: mean for number of concave portions of the contour
  • fractal_dimension_mean: mean for “coastline approximation” – 1
  • radius_se: standard error for the mean of distances from center to points on the perimeter
  • texture_se: standard error for standard deviation of gray-scale values
  • smoothness_se: standard error for local variation in radius lengths
  • compactness_se: standard error for perimeter^2 / area – 1.0
  • concavity_se: standard error for severity of concave portions of the contour
  • fractal_dimension_se: standard error for “coastline approximation” – 1
  • texture_worst: “worst” or largest mean value for standard deviation of gray-scale values
  • smoothness_worst: “worst” or largest mean value for local variation in radius lengths
  • compactness_worst: “worst” or largest mean value for perimeter^2 / area – 1.0
  • concavity_worst: “worst” or largest mean value for severity of concave portions of the contour
  • concave points_worst: “worst” or largest mean value for number of concave portions of the contour
  • fractal_dimension_worst: “worst” or largest mean value for “coastline approximation” – 1


let’s get our environment ready with the libraries we’ll need and then import the data!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

Check out the Data

df = pd.read_csv('~/DataSet GitHub/k-means/data-3.csv')

Let’s drop ID and NaN data from dataset.

# We don't need id and NaN data.
df.drop(["Unnamed: 32", "id"], axis = 1, inplace = True)


Exploratory Data Analysis

Let’s visualise the frequency of each cancer stage in the dataset

ax = sns.countplot(df['diagnosis'],label="Count")       # M = 212, B = 357
B, M = df['diagnosis'].value_counts()
print('Number of Benign: ',B)
print('Number of Malignant : ',M)
#correlation map
f,ax = plt.subplots(figsize=(20, 20))
sns.heatmap(df.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)

The size and shape of the nucleus should be a good predictor for whether or not a sample is cancerous.

plotOne = sns.FacetGrid(df, hue="diagnosis",aspect=2.5),'area_mean',shade=True)
plotOne.set(xlim=(0, df['area_mean'].max()))
plotOne.set_axis_labels('mean area', 'Proportion')
plotOne.fig.suptitle('Area vs Diagnosis (Blue = Malignant; Orange = Benign)')

plotTwo = sns.FacetGrid(df, hue="diagnosis",aspect=2.5),'concave points_mean',shade= True)
plotTwo.set(xlim=(0, df['concave points_mean'].max()))
plotTwo.set_axis_labels('concave points_mean', 'Proportion')
plotTwo.fig.suptitle('# of Concave Points vs Diagnosis (Blue = Malignant; Orange = Benign)')
df = df.loc[:,['radius_worst','perimeter_worst','area_worst']]
g = sns.PairGrid(df, diag_sharey=False)
g.map_lower(sns.kdeplot, cmap="BuPu")
g.map_diag(sns.kdeplot, lw=3)

radius_mean and texture_mean features will be used for clustering. Before the clustering process let’s check how our data looks.

sns.pairplot(df.loc[:,['radius_mean','texture_mean', 'diagnosis']], hue = "diagnosis", height = 4)


K-Means Clustering

For clustering we do not need labels. Because we’ll identify the labels.

dwt = df.drop(["diagnosis"], axis = 1)

Our data looks like below plot without diagnosis label.

plt.figure(figsize = (10, 10))
plt.scatter(dwt["radius_mean"], dwt["texture_mean"])
plt.title('without clustering')

WCSS is a metric used for k value selection process. After this operation elbow rule is used for k value.

from sklearn.cluster import KMeans
wcss = [] # within cluster sum of squares

for k in range(1, 15):
    kmeansForLoop = KMeans(n_clusters = k)

plt.figure(figsize = (10, 10))
plt.plot(range(1, 15), wcss)
plt.xlabel("K value")

Elbow point starting from 2

dwt = df.loc[:,['radius_mean','texture_mean']]
kmeans = KMeans(n_clusters = 2)
clusters = kmeans.fit_predict(dwt)
dwt["type"] = clusters

Plot data after k = 2 clustering

plt.figure(figsize = (15, 10))
plt.scatter(dwt["radius_mean"][dwt["type"] == 0], dwt["texture_mean"][dwt["type"] == 0], color = "red")
plt.scatter(dwt["radius_mean"][dwt["type"] == 1], dwt["texture_mean"][dwt["type"] == 1], color = "green")
plt.title('with clustering')

Let’s set the centroid point in our plot

# Data centroids middle of clustered scatters

plt.figure(figsize = (15, 10))
plt.scatter(dwt["radius_mean"], dwt["texture_mean"], c = clusters, alpha = 0.5,cmap='jet')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], color = "green", alpha = 1)


Training a Random Forest Model

switch B/M to 0/1 in our diagnosis column.

df['diagnosis'].replace({'B':0,'M':1},inplace = True)

Let’s now begin to train the random forest model! We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable, in this case, the diagnosis column.

X = df.drop('diagnosis',axis=1)
y = df['diagnosis']

Train Test Split

Now let’s split the data into a training set and a testing set. We will train our model on the training set and then use the test set to evaluate the model.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 0,test_size = 0.3)

Training the Model

#fitting random forest classification into training set
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=10, criterion='entropy',random_state=0), y_train)

Predictions and Evaluations

Now predict values for the testing data.

y_pred = rf.predict(X_test)

Making Confusion Matrix

Confusion Matrix is going to contain the correct predictions that our model made on the set as well as the incorrect predictions.

from sklearn.metrics import confusion_matrix
from sklearn import metrics
cm = confusion_matrix(y_test,y_pred)
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cm), annot=True, cmap="RdGy" ,fmt='g')
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

107 and 58 are the correct predictions. In addition, 5 and 1 are tincorrectpredictions. so we can see that we have quiet lot of correct predictions.

Also Read:  Prediction of Tomorrow Rain in Australia

Correct Predictions : 107+58 = 165

Incorrect Predictions: 5+1 = 6

Create a classification report for the model.

from sklearn.metrics import classification_report

The accuracy of the model is %97!

1148cookie-checkPrediction of Breast Cancer Diagnosis


  1. Codi Lagan

    I’d like to thank you for the efforts you have put in penning this blog. I really hope to see the same high-grade blog posts from you in the future as well. In truth, your creative writing abilities has inspired me to get my very own blog now 😉

Leave a Reply

Your email address will not be published. Required fields are marked *