In this project, we aim to predict whether a tumor is benign or malignant. we implemented KNN on Python.
The data contains the following columns:
- BI_RADS_assessment: Definitely benign(1) to Highly suggestive of malignancy (5)
- Age: patient’s age in years
- Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
- Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
- Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
- Severity: Predictor Class: benign=0 or malignant=1
.
let’s get our environment ready with the libraries we’ll need and then import the data!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn import metrics
Check out the Data!
df = pd.read_csv('~/DataSet GitHub/KNN/mammogram_weka_dataset.csv')
df.head()

df.info()

df.corr()

EDA
Let’s create some simple plots to check out the data!
#corelation matrix.
cor_mat= df[:].corr()
mask = np.array(cor_mat)
mask[np.tril_indices_from(mask)] = False
fig=plt.gcf()
fig.set_size_inches(30,12)
sns.heatmap(data=cor_mat,mask=mask,square=True,annot=True,cbar=True)

.
Standardize the Variables
Because the KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it, the scale of the variables matters. Any variables that are on a large scale will have a much larger effect on the distance between the observations, and hence on the KNN
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df.drop('severity',axis=1))
scaled_features = scaler.transform(df.drop('severity',axis=1))
df_feat = pd.DataFrame(scaled_features,columns=df.columns[:-1])
df_feat.head()

Training KNN Model
Let’s now begin to train out the regression model! We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable.
We split our data for test and train our regression. We use
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(scaled_features, df['severity'], test_size=0.3)
Creating and Training the Model
Remember that we are trying to come up with a model to predict whether the tumor will be benign or malignant. We’ll start with k=1.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)

Predictions and Evaluations
Now predict values for the testing data.
pred = knn.predict(X_test)
Making Confusion Matrix
Confusion Matrix is going to contain the correct predictions that our model made on the set as well as the incorrect predictions.
from sklearn.metrics import classification_report,confusion_matrix
cm = confusion_matrix(y_test,pred)
class_names=[0,1] # name of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cm), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

print(classification_report(y_test,pred))
135 and 91 are the correct predictions. In addition, 34 and 29 are
Correct Predictions : 135+91 = 226
Incorrect Predictions: 34+29 = 63
Create a classification report for the model.

The accuracy of the model is %78 !!!
Choosing a K Value
Let’s go ahead and use the elbow method to pick a good K Value:
error_rate = []
# Will take some time
for i in range(1,40):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train,y_train)
pred_i = knn.predict(X_test)
error_rate.append(np.mean(pred_i != y_test))
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')


The accuracy of the model with k=5 is %82 !

You may have heard the world is made up of atoms and molecules, but it’s really made up of stories.