Prediction of Medical Insurance Cost

machine learning insurance medical

In this project, We are going to predict Medical insurance costs. We implemented Random Forest Regression using Python.

The data has been downloaded from Kaggle website (medical insurance cost dataset)

The data contains following columns:

  • sex: insurance contractor gender, female, male 
  • bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9 
  • children: Number of children covered by health insurance / Number of dependents
  • smoker: Smoking
  • region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
  • charges: Individual medical costs billed by health insurance


let’s get our environment ready with the libraries we’ll need and then import the data!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline'ggplot')

Check out the Data

df = pd.read_csv('~/DataSet GitHub/Regression/insurance.csv')

Let’s see the unique value in region column



Encoding Features

In our data frame, we have categorical data. we need to convert and store it as a numeric value for fitting Random Forest Regression model on it

from sklearn.preprocessing import LabelEncoder
labelencoder_smoker = LabelEncoder()
df.smoker = labelencoder_smoker.fit_transform(df.smoker)
labelencoder_sex = LabelEncoder() = labelencoder_sex.fit_transform(
labelencoder_region = LabelEncoder()
df.region = labelencoder_region.fit_transform(df.region)


Data Summary

plt.title("Data summary")
data summary


Correlation Matrix

cor_mat= df[:].corr()
mask = np.array(cor_mat)
mask[np.tril_indices_from(mask)] = False
correlation matrix



In the next data analysis step, Let’s see the percentage of Smoker and Non Smoker persons in our data

explode = (0.1,0)  
fig1, ax1 = plt.subplots(figsize=(12,7))
ax1.pie(df['smoker'].value_counts(), explode=explode,labels=['No','Yes'], autopct='%1.1f%%',
# Equal aspect ratio ensures that pie is drawn as a circle
percentage of smoker and non smoker persons

Next data visualisation step is to find out the number of male and female in the data

print('Female:',len(df[ == 0]))
print('Male:',len(df[ == 1]))
y = len(df[ == 0]),len(df[ == 1])
x = ['Female','Male'],y,color = 'coral')
number of male vs female

Now let’s see the overall distribution of charges comparing smokers and non-smokers

plt.title("Overall distribution of charges comparing smokers and non-smokers")
Overall distribution of charges comparing smokers and non-smokers

As we can see in above figure, The smoker persons spent much money than non smoker persons for medical purpose

Also Read:  Prediction of Tomorrow Rain in Australia

Now in the next step of EDA, Let’s visualise the Scatter plot of charges vs age vs smoker

plt.title('Scatter plot of Charges vs age')
Scatter plot of Charges vs age

In the final stage of EDA phase, let’s see the distribution of charges in each region

plt.title("Distribution of Charges by Region")
for i in df['region'].unique():
    sns.distplot(df[(df['region']==i)]['charges'], hist=False, kde=True, label=i)
Distribution of Charges by Region


Training a Random Forest Regression Model

Let’s now begin to train out the regression model! We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable, in this case, the Profit column. We will toss out the State column because it only has text info that the linear regression model can’t use.

X = df.drop('charges',axis=1)
y = df['charges']

Train Test Split

Now let’s split the data into a training set and a testing set. We will train out model on the training set and then use the test set to evaluate the model.

# separate training (80%) and test (%20) sets
from sklearn.model_selection import train_test_split
x_train, x_test,y_train, y_test = train_test_split(X,y,test_size = 0.20,random_state = 20)

Creating and Training the Model

from sklearn.ensemble import RandomForestRegressor
randomforest = RandomForestRegressor(n_estimators = 100, random_state = 20),y_train)

Model Evaluation

in the model evaluation step, we need to get prediction from X_test and then visualise our result

predictions = randomforest.predict(x_test)
model evaluation random forest regression
from sklearn import metrics
from sklearn.metrics import r2_score
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print('R2 test_data: ', round(r2_score(y_test,predictions), 2))

As we can see, The accuracy of the model is %89 which is quite good

Visualising the Random Forest Regression Result

plt.scatter(y_test, predictions, edgecolors=(0,0,0))
plt.xlabel("Measured Charges")
plt.ylabel("Predicted Charges")
plt.title("Cross-validated Prediction accuracy of Charges")
random forest regression result
3056cookie-checkPrediction of Medical Insurance Cost

Leave a Reply

Your email address will not be published.