Prediction of Diabetes Occurrence​

In this project, we aim to predict the occurrence of diabetes within the PIMA Native American Group. We implemented the Decision Tree algorithm on Python.

The data contains the following columns:

  • times_pregnant: Number of times pregnant
  • plasma_glucose: Concentration of plasma glucose in a 2 hour oral glucose tolerance test
  • diastolic_blood_pressure: Measured in mmHg
  • tricep_skin_fold_thickness: Measured in mm
  • serum_insulin: Insulin concentration in serum in 2-hour period. Measured in (mu U/ml)
  • body_mass_index: Weight in kg/height in (m^2)
  • diabetes_pedigree_function: Function that assigns probability of someone getting diabetes
  • age: Years
  • class: Predictor: the value of 0 or 1 correspond to no diabetes and diabetes

let’s get our environment ready with the libraries we’ll need and then import the data!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn import metrics

Check out the Data!

#importing the dataset
df = pd.read_csv('~/DataSet GitHub/Decision Tree/pima_native_american_diabetes_weka_dataset.csv')
df.head()
df.info()

Let’s check out the data summary!

plt.figure(figsize=(12,8))
sns.heatmap(df.describe()[1:].transpose(),
            annot=True,linecolor="w",
            linewidth=2,cmap=sns.color_palette("Set1"))
plt.title("Data summary")
plt.show()

.

Exploratory Data Analysis

Let’s check out the correlation between variables.

correlation = df.corr()
plt.figure(figsize=(10,8))
sns.heatmap(correlation,annot=True,
            cmap=sns.color_palette("magma"),
            linewidth=2,edgecolor="k")
plt.title("CORRELATION BETWEEN VARIABLES")
plt.show()

Let’s check out the Proportion of target variable in dataset!

plt.figure(figsize=(12,6))
plt.pie(df["class"].value_counts().values,
        labels=["no diabets","diabets"],
        autopct="%1.0f%%",wedgeprops={"linewidth":2,"edgecolor":"white"})
my_circ = plt.Circle((0,0),.7,color = "white")
plt.gca().add_artist(my_circ)
plt.subplots_adjust(wspace = .2)
plt.title("Proportion of target variable in dataset")
plt.show()
plt.figure(figsize=(12,6))
sns.scatterplot(data=df,x='age',y='times_pregnant',hue='class',cmap="Set2")
plt.legend(title='legend',loc='upper right', labels=['no diabets', 'diabets'])

For having a chance to get diabetes one should have times_pregnant=4.87, plasma_glucose=141.25, diastolic_blood_pressure= 70.82. If you get scores more than this then your chances of diabetes are likely.

df[(df['class'] ==1)].mean().reset_index()

Train Test Split

Split the data into a training set and a testing set

X = df.iloc[:,:-1]
Y = df.iloc[:,8]
#Splitting the data into training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size = 0.25, random_state = 0)

Train a Model

Now it’s time to train a Decision Tree Classifier. 

#fitting classifier to the training set
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion='entropy', random_state = 0)
classifier.fit(X_train,y_train)

Model Evaluation

Now get predictions from the model and create a confusion matrix and a classification report.

y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cm), annot=True, cmap="BuPu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

105 and 44 are the correct predictions. In addition, 18 and 25 are theincorrect predictions. so we can see that we have quiet lot of correct predictions.

Also Read:  Prediction Of Startups Profit

Correct Predictions : 105+44 = 149

Incorrect Predictions: 18+25 = 43

Create a classification report for the model.

from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

The accuracy of the model is %77!

.

Tree Visualisation

sklearn actually has some built-in visualization capabilities for decision trees, you won’t use this often and it requires you to install the pydot library, but here is an example of what it looks like and the code to execute this:

from IPython.display import Image  
from sklearn.externals.six import StringIO  
from sklearn.tree import export_graphviz
import pydot 

features = list(df.columns[:-1])
features
dot_data = StringIO()  
export_graphviz(classifier, out_file=dot_data,feature_names=features,filled=True,rounded=True)

graph = pydot.graph_from_dot_data(dot_data.getvalue())  
Image(graph[0].create_png())  
582cookie-checkPrediction of Diabetes Occurrence​

Leave a Reply

Your email address will not be published. Required fields are marked *