In this **Diagnosis.**

The data contains the following columns:

- id: ID number
- diagnosis: The diagnosis of breast tissues (M = malignant, B = benign)
- radius_mean: mean of distances from center to points on the perimeter
- texture_mean: standard deviation of gray-scale values
- perimeter_mean: mean size of the core tumor
- area_mean: mean area size of the tumor
- smoothness_mean: mean of local variation in radius lengths
- compactness_mean: mean of perimeter^2 / area – 1.0
- concavity_mean: mean of severity of concave portions of the contour
- concave points_mean: mean for number of concave portions of the contour
- fractal_dimension_mean: mean for “coastline approximation” – 1
- radius_se: standard error for the mean of distances from center to points on the perimeter
- texture_se: standard error for standard deviation of gray-scale values
- smoothness_se: standard error for local variation in radius lengths
- compactness_se: standard error for perimeter^2 / area – 1.0
- concavity_se: standard error for severity of concave portions of the contour
- fractal_dimension_se: standard error for “coastline approximation” – 1
- texture_worst: “worst” or largest mean value for standard deviation of gray-scale values
- smoothness_worst: “worst” or largest mean value for local variation in radius lengths
- compactness_worst: “worst” or largest mean value for perimeter^2 / area – 1.0
- concavity_worst: “worst” or largest mean value for severity of concave portions of the contour
- concave points_worst: “worst” or largest mean value for number of concave portions of the contour
- fractal_dimension_worst: “worst” or largest mean value for “coastline approximation” – 1

.

let’s get our environment ready with the libraries we’ll need and then import the data!

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
```

Check out the Data

```
df = pd.read_csv('~/DataSet GitHub/k-means/data-3.csv')
df.head()
```

`df.info()`

Let’s drop ID and NaN data from dataset.

```
# We don't need id and NaN data.
df.drop(["Unnamed: 32", "id"], axis = 1, inplace = True)
df.head()
```

.

### Exploratory Data Analysis

Let’s visualise the frequency of each cancer stage in the dataset

```
ax = sns.countplot(df['diagnosis'],label="Count") # M = 212, B = 357
B, M = df['diagnosis'].value_counts()
print('Number of Benign: ',B)
print('Number of Malignant : ',M)
```

```
#correlation map
f,ax = plt.subplots(figsize=(20, 20))
sns.heatmap(df.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)
```

The size and shape of the nucleus should be a good predictor for whether or not a sample is cancerous.

```
sns.set_style("whitegrid")
plotOne = sns.FacetGrid(df, hue="diagnosis",aspect=2.5)
plotOne.map(sns.kdeplot,'area_mean',shade=True)
plotOne.set(xlim=(0, df['area_mean'].max()))
plotOne.add_legend()
plotOne.set_axis_labels('mean area', 'Proportion')
plotOne.fig.suptitle('Area vs Diagnosis (Blue = Malignant; Orange = Benign)')
plt.show()
sns.set_style("whitegrid")
plotTwo = sns.FacetGrid(df, hue="diagnosis",aspect=2.5)
plotTwo.map(sns.kdeplot,'concave points_mean',shade= True)
plotTwo.set(xlim=(0, df['concave points_mean'].max()))
plotTwo.add_legend()
plotTwo.set_axis_labels('concave points_mean', 'Proportion')
plotTwo.fig.suptitle('# of Concave Points vs Diagnosis (Blue = Malignant; Orange = Benign)')
plt.show()
```

```
sns.set(style="white")
df = df.loc[:,['radius_worst','perimeter_worst','area_worst']]
g = sns.PairGrid(df, diag_sharey=False)
g.map_lower(sns.kdeplot, cmap="BuPu")
g.map_upper(plt.scatter)
g.map_diag(sns.kdeplot, lw=3)
```

radius_mean and texture_mean features will be used for clustering. Before the clustering process let’s check how our data looks.

```
sns.pairplot(df.loc[:,['radius_mean','texture_mean', 'diagnosis']], hue = "diagnosis", height = 4)
plt.show()
```

.

### K-Means Clustering

For clustering we do not need labels. Because we’ll identify the labels.

```
dwt = df.drop(["diagnosis"], axis = 1)
dwt.head()
```

Our data looks like below plot without diagnosis label.

```
plt.figure(figsize = (10, 10))
plt.scatter(dwt["radius_mean"], dwt["texture_mean"])
plt.title('without clustering')
plt.xlabel('radius_mean')
plt.ylabel('texture_mean')
plt.show()
```

WCSS is a metric used for k value selection process. After this operation elbow rule is used for k value.

```
from sklearn.cluster import KMeans
wcss = [] # within cluster sum of squares
for k in range(1, 15):
kmeansForLoop = KMeans(n_clusters = k)
kmeansForLoop.fit(dwt)
wcss.append(kmeansForLoop.inertia_)
plt.figure(figsize = (10, 10))
plt.plot(range(1, 15), wcss)
plt.xlabel("K value")
plt.ylabel("WCSS")
plt.show()
```

Elbow point starting from 2

```
dwt = df.loc[:,['radius_mean','texture_mean']]
kmeans = KMeans(n_clusters = 2)
clusters = kmeans.fit_predict(dwt)
dwt["type"] = clusters
dwt["type"].unique()
```

Plot data after k = 2 clustering

```
plt.figure(figsize = (15, 10))
plt.scatter(dwt["radius_mean"][dwt["type"] == 0], dwt["texture_mean"][dwt["type"] == 0], color = "red")
plt.scatter(dwt["radius_mean"][dwt["type"] == 1], dwt["texture_mean"][dwt["type"] == 1], color = "green")
plt.title('with clustering')
plt.xlabel('radius_mean')
plt.ylabel('texture_mean')
plt.show()
```

Let’s set the centroid point in our plot

```
# Data centroids middle of clustered scatters
plt.figure(figsize = (15, 10))
plt.scatter(dwt["radius_mean"], dwt["texture_mean"], c = clusters, alpha = 0.5,cmap='jet')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], color = "green", alpha = 1)
plt.xlabel('radius_mean')
plt.ylabel('texture_mean')
plt.show()
```

.

### Training a Random Forest Model

switch B/M to 0/1 in our diagnosis column.

```
df['diagnosis'].replace({'B':0,'M':1},inplace = True)
df.head()
```

Let’s now begin to train the random forest model! We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable, in this case, the diagnosis column.

```
X = df.drop('diagnosis',axis=1)
y = df['diagnosis']
```

### Train Test Split

Now let’s split the data into a training set and a testing set. We will train our model on the training set and then use the test set to evaluate the model.

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 0,test_size = 0.3)
```

### Training the Model

```
#fitting random forest classification into training set
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=10, criterion='entropy',random_state=0)
rf.fit(X_train, y_train)
```

### Predictions and Evaluations

Now predict values for the testing data.

`y_pred = rf.predict(X_test)`

### Making Confusion Matrix

Confusion Matrix is going to contain the correct predictions that our model made on the set as well as the incorrect predictions.

```
from sklearn.metrics import confusion_matrix
from sklearn import metrics
cm = confusion_matrix(y_test,y_pred)
class_names=[0,1] # name of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cm), annot=True, cmap="RdGy" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
```

107 and 58 are the correct predictions. In addition, 5 and 1 are

Correct Predictions : 107+58 = 165

Incorrect Predictions: 5+1 = 6

Create a classification report for the model.

```
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))
```

The accuracy of the model is %97!

## Comments

I’d like to thank you for the efforts you have put in penning this blog. I really hope to see the same high-grade blog posts from you in the future as well. In truth, your creative writing abilities has inspired me to get my very own blog now 😉