Prediction of Diabetes Occurrence​

In this project, we aim to predict the occurrence of diabetes within the PIMA Native American Group. We implemented the Decision Tree algorithm on Python.

The data contains the following columns:

  • times_pregnant: Number of times pregnant
  • plasma_glucose: Concentration of plasma glucose in a 2 hour oral glucose tolerance test
  • diastolic_blood_pressure: Measured in mmHg
  • tricep_skin_fold_thickness: Measured in mm
  • serum_insulin: Insulin concentration in serum in 2-hour period. Measured in (mu U/ml)
  • body_mass_index: Weight in kg/height in (m^2)
  • diabetes_pedigree_function: Function that assigns probability of someone getting diabetes
  • age: Years
  • class: Predictor: the value of 0 or 1 correspond to no diabetes and diabetes

let’s get our environment ready with the libraries we’ll need and then import the data!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn import metrics

Check out the Data!

#importing the dataset
df = pd.read_csv('~/DataSet GitHub/Decision Tree/pima_native_american_diabetes_weka_dataset.csv')
df.head()
df.info()

Let’s check out the data summary!

plt.figure(figsize=(12,8))
sns.heatmap(df.describe()[1:].transpose(),
            annot=True,linecolor="w",
            linewidth=2,cmap=sns.color_palette("Set1"))
plt.title("Data summary")
plt.show()

.

Exploratory Data Analysis

Let’s check out the correlation between variables.

correlation = df.corr()
plt.figure(figsize=(10,8))
sns.heatmap(correlation,annot=True,
            cmap=sns.color_palette("magma"),
            linewidth=2,edgecolor="k")
plt.title("CORRELATION BETWEEN VARIABLES")
plt.show()

Let’s check out the Proportion of target variable in dataset!

plt.figure(figsize=(12,6))
plt.pie(df["class"].value_counts().values,
        labels=["no diabets","diabets"],
        autopct="%1.0f%%",wedgeprops={"linewidth":2,"edgecolor":"white"})
my_circ = plt.Circle((0,0),.7,color = "white")
plt.gca().add_artist(my_circ)
plt.subplots_adjust(wspace = .2)
plt.title("Proportion of target variable in dataset")
plt.show()
plt.figure(figsize=(12,6))
sns.scatterplot(data=df,x='age',y='times_pregnant',hue='class',cmap="Set2")
plt.legend(title='legend',loc='upper right', labels=['no diabets', 'diabets'])

For having a chance to get diabetes one should have times_pregnant=4.87, plasma_glucose=141.25, diastolic_blood_pressure= 70.82. If you get scores more than this then your chances of diabetes are likely.

df[(df['class'] ==1)].mean().reset_index()

Train Test Split

Split the data into a training set and a testing set

X = df.iloc[:,:-1]
Y = df.iloc[:,8]
#Splitting the data into training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size = 0.25, random_state = 0)

Train a Model

Now it’s time to train a Decision Tree Classifier. 

#fitting classifier to the training set
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion='entropy', random_state = 0)
classifier.fit(X_train,y_train)

Model Evaluation

Now get predictions from the model and create a confusion matrix and a classification report.

y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cm), annot=True, cmap="BuPu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

105 and 44 are the correct predictions. In addition, 18 and 25 are theincorrect predictions. so we can see that we have quiet lot of correct predictions.

Also Read:  Prediction Of Startups Profit

Correct Predictions : 105+44 = 149

Incorrect Predictions: 18+25 = 43

Create a classification report for the model.

from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

The accuracy of the model is %77!

.

Tree Visualisation

sklearn actually has some built-in visualization capabilities for decision trees, you won’t use this often and it requires you to install the pydot library, but here is an example of what it looks like and the code to execute this:

from IPython.display import Image  
from sklearn.externals.six import StringIO  
from sklearn.tree import export_graphviz
import pydot 

features = list(df.columns[:-1])
features
dot_data = StringIO()  
export_graphviz(classifier, out_file=dot_data,feature_names=features,filled=True,rounded=True)

graph = pydot.graph_from_dot_data(dot_data.getvalue())  
Image(graph[0].create_png())  
582cookie-checkPrediction of Diabetes Occurrence​

Comments

  1. Royal CBD

    Right here is the perfect website for anybody who hopes to understand this topic.
    You know so much its almost tough to argue with you (not that I really will need to…HaHa).

    You certainly put a new spin on a subject that’s been written about for ages.
    Wonderful stuff, just excellent!

  2. Otha Klintworth

    Somebody necessarily help to make seriously posts I’d state. This is the first time I frequented your website page and to this point? I surprised with the analysis you made to create this particular post extraordinary. Excellent job!|

  3. Evita Midura

    I really like what you guys are up too. This sort of clever work and exposure! Keep up the great works guys I’ve added you guys to my personal blogroll.|

  4. Leonora Jepson

    Wonderful beat ! I would like to apprentice even as you amend your site, how could i subscribe for a weblog web site? The account aided me a acceptable deal. I had been a little bit familiar of this your broadcast provided vibrant transparent idea|

  5. Rayford Demaio

    I think this is among the most important info for me. And i’m glad reading your article. But should remark on some general things, The site style is wonderful, the articles is really great : D. Good job, cheers|

  6. Karima Larribeau

    I’m not sure where you’re getting your information, but great topic. I needs to spend some time learning more or understanding more. Thanks for great info I was looking for this info for my mission.|

  7. Lyman Selmer

    Hi! I’ve been reading your site for a while now and finally got the bravery to go ahead and give you a shout out from New Caney Tx! Just wanted to say keep up the great work!|

  8. Billie Steeves

    Pretty nice post. I just stumbled upon your blog and wished to say that I’ve truly enjoyed browsing your blog posts. After all I will be subscribing to your rss feed and I hope you write again soon!|

  9. Marybelle Niedens

    I think this is among the most important information for me. And i am glad reading your article. But should remark on few general things, The website style is ideal, the articles is really nice : D. Good job, cheers|

  10. Diana Pavelski

    This is really interesting, You’re a very skilled blogger. I have joined your rss feed and look forward to seeking more of your magnificent post. Also, I have shared your website in my social networks!|

Leave a Reply

Your email address will not be published.