In this project, We are going to predict Medical insurance costs. We implemented Random Forest Regression using Python.
The data has been downloaded from Kaggle website (medical insurance cost dataset)
The data contains following columns:
- sex: insurance contractor gender, female, male
bmi : Body mass index, providing an understanding ofbody , weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9- children: Number of children covered by health insurance / Number of dependents
- smoker: Smoking
- region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
- charges: Individual medical costs billed by health insurance
.
let’s get our environment ready with the libraries we’ll need and then import the data!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
Check out the Data
df = pd.read_csv('~/DataSet GitHub/Regression/insurance.csv')
df.head()

df.info()

Let’s see the unique value in region column
df.region.unique()

.
Encoding Features
In our data frame, we have categorical data. we need to convert and store it as a numeric value for fitting Random Forest Regression model on it
from sklearn.preprocessing import LabelEncoder
#smoker
labelencoder_smoker = LabelEncoder()
df.smoker = labelencoder_smoker.fit_transform(df.smoker)
#sex
labelencoder_sex = LabelEncoder()
df.sex = labelencoder_sex.fit_transform(df.sex)
#region
labelencoder_region = LabelEncoder()
df.region = labelencoder_region.fit_transform(df.region)

.
Data Summary
plt.figure(figsize=(12,8))
sns.heatmap(df.describe()[1:].transpose(),
annot=True,linecolor="w",
linewidth=2,cmap=sns.color_palette("tab20"))
plt.title("Data summary")
plt.show()

.
Correlation Matrix
cor_mat= df[:].corr()
mask = np.array(cor_mat)
mask[np.tril_indices_from(mask)] = False
fig=plt.gcf()
fig.set_size_inches(30,12)
sns.heatmap(data=cor_mat,mask=mask,square=True,annot=True,cbar=True)

.
EDA
In the next data analysis step, Let’s see the percentage of Smoker and Non Smoker persons in our data
explode = (0.1,0)
fig1, ax1 = plt.subplots(figsize=(12,7))
ax1.pie(df['smoker'].value_counts(), explode=explode,labels=['No','Yes'], autopct='%1.1f%%',
shadow=True)
# Equal aspect ratio ensures that pie is drawn as a circle
ax1.axis('equal')
plt.tight_layout()
plt.legend()
plt.show()

Next data visualisation step is to find out the number of male and female in the data
plt.figure(figsize=(12,7))
print('Female:',len(df[df.sex == 0]))
print('Male:',len(df[df.sex == 1]))
y = len(df[df.sex == 0]),len(df[df.sex == 1])
x = ['Female','Male']
plt.bar(x,y,color = 'coral')
plt.show()

Now let’s see the overall distribution of charges comparing smokers and non-smokers
plt.figure(figsize=(12,9))
sns.boxplot(x='smoker',y='charges',data=df)
plt.title("Overall distribution of charges comparing smokers and non-smokers")

As we can see in above figure, The smoker persons spent much money than non smoker persons for medical purpose
Now in the next step of EDA, Let’s
plt.figure(figsize=(12,8))
sns.scatterplot(x='age',y='charges',data=df,palette='viridis',hue='smoker')
plt.title('Scatter plot of Charges vs age')

In the final stage of EDA phase, let’s see the distribution of charges in each region
plt.figure(figsize=(12,8))
plt.title("Distribution of Charges by Region")
for i in df['region'].unique():
sns.distplot(df[(df['region']==i)]['charges'], hist=False, kde=True, label=i)

.
Training a Random Forest Regression Model
Let’s now begin to train out the regression model! We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable, in this case, the Profit column. We will toss out the State column because it only has text info that the linear regression model can’t use.
X = df.drop('charges',axis=1)
y = df['charges']
Train Test Split
Now let’s split the data into a training set and a testing set. We will train out model on the training set and then use the test set to evaluate the model.
# separate training (80%) and test (%20) sets
from sklearn.model_selection import train_test_split
x_train, x_test,y_train, y_test = train_test_split(X,y,test_size = 0.20,random_state = 20)
Creating and Training the Model
from sklearn.ensemble import RandomForestRegressor
randomforest = RandomForestRegressor(n_estimators = 100, random_state = 20)
randomforest.fit(x_train,y_train)
Model Evaluation
in the model evaluation step, we need to get prediction from X_test and then visualise our result
predictions = randomforest.predict(x_test)
plt.figure(figsize=(12,8))
plt.scatter(y_test,predictions)

from sklearn import metrics
from sklearn.metrics import r2_score
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print('R2 test_data: ', round(r2_score(y_test,predictions), 2))

As we can see, The accuracy of the model is %89 which is quite good
Visualising the Random Forest Regression Result
plt.figure(figsize=(12,8))
plt.scatter(y_test, predictions, edgecolors=(0,0,0))
plt.plot([y.min(),y.max()],[y.min(),y.max()],'k--',lw=4)
plt.xlabel("Measured Charges")
plt.ylabel("Predicted Charges")
plt.title("Cross-validated Prediction accuracy of Charges")


You may have heard the world is made up of atoms and molecules, but it’s really made up of stories.