# Prediction Of Tumor Severity

In this project, we aim to predict whether a tumor is benign or malignant. we implemented KNN on Python.

The data contains the following columns:

• BI_RADS_assessment: Definitely benign(1) to Highly suggestive of malignancy (5)
• Age: patient’s age in years
• Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
• Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
• Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
• Severity: Predictor Class: benign=0 or malignant=1

.

let’s get our environment ready with the libraries we’ll need and then import the data!

``````import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn import metrics``````

Check out the Data!

``````df = pd.read_csv('~/DataSet GitHub/KNN/mammogram_weka_dataset.csv')
``df.info()``
``df.corr()``

### EDA

Let’s create some simple plots to check out the data!

``````#corelation matrix.
cor_mat= df[:].corr()
fig=plt.gcf()
fig.set_size_inches(30,12)

.

### Standardize the Variables

Because the KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it, the scale of the variables matters. Any variables that are on a large scale will have a much larger effect on the distance between the observations, and hence on the KNN classifier, than variables that are on a small scale.

``````from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df.drop('severity',axis=1))
scaled_features = scaler.transform(df.drop('severity',axis=1))
df_feat = pd.DataFrame(scaled_features,columns=df.columns[:-1])

### Training KNN Model

Let’s now begin to train out the regression model! We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable.

We split our data for test and train our regression. We use sklearn library for that. I use %30 for test my regression and %70 for train my regression.

``````from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(scaled_features, df['severity'], test_size=0.3)``````

### Creating and Training the Model

Remember that we are trying to come up with a model to predict whether the tumor will be benign or malignant. We’ll start with k=1.

``````from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)``````

### Predictions and Evaluations

Now predict values for the testing data.

``pred = knn.predict(X_test)``

### Making Confusion Matrix

Confusion Matrix is going to contain the correct predictions that our model made on the set as well as the incorrect predictions.

``````from sklearn.metrics import classification_report,confusion_matrix
cm = confusion_matrix(y_test,pred)
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cm), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')``````
``print(classification_report(y_test,pred))``

135 and 91 are the correct predictions. In addition, 34 and 29 are the incorrect predictions. so we can see that we have quiet lot of correct predictions.

Correct Predictions : 135+91 = 226

Incorrect Predictions: 34+29 = 63

Create a classification report for the model.

The accuracy of the model is %78 !!!

### Choosing a K Value

Let’s go ahead and use the elbow method to pick a good K Value:

``````error_rate = []

# Will take some time
for i in range(1,40):

knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train,y_train)
pred_i = knn.predict(X_test)
error_rate.append(np.mean(pred_i != y_test))``````
``````plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')``````

The accuracy of the model with k=5 is %82 !