# Prediction of Medical Insurance Cost

In this project, We are going to predict Medical insurance costs. We implemented Random Forest Regression using Python.

The data has been downloaded from Kaggle website (medical insurance cost dataset)

The data contains following columns:

• sex: insurance contractor gender, female, male
• bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
• children: Number of children covered by health insurance / Number of dependents
• smoker: Smoking
• region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
• charges: Individual medical costs billed by health insurance

.

let’s get our environment ready with the libraries we’ll need and then import the data!

``````import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')``````

Check out the Data

``````df = pd.read_csv('~/DataSet GitHub/Regression/insurance.csv')
``df.info()``

Let’s see the unique value in region column

``df.region.unique()``

.

### Encoding Features

In our data frame, we have categorical data. we need to convert and store it as a numeric value for fitting Random Forest Regression model on it

``````from sklearn.preprocessing import LabelEncoder
#smoker
labelencoder_smoker = LabelEncoder()
df.smoker = labelencoder_smoker.fit_transform(df.smoker)
#sex
labelencoder_sex = LabelEncoder()
df.sex = labelencoder_sex.fit_transform(df.sex)
#region
labelencoder_region = LabelEncoder()
df.region = labelencoder_region.fit_transform(df.region)``````

.

### Data Summary

``````plt.figure(figsize=(12,8))
sns.heatmap(df.describe()[1:].transpose(),
annot=True,linecolor="w",
linewidth=2,cmap=sns.color_palette("tab20"))
plt.title("Data summary")
plt.show()``````

.

### Correlation Matrix

``````cor_mat= df[:].corr()
fig=plt.gcf()
fig.set_size_inches(30,12)

.

### EDA

In the next data analysis step, Let’s see the percentage of Smoker and Non Smoker persons in our data

``````explode = (0.1,0)
fig1, ax1 = plt.subplots(figsize=(12,7))
ax1.pie(df['smoker'].value_counts(), explode=explode,labels=['No','Yes'], autopct='%1.1f%%',
# Equal aspect ratio ensures that pie is drawn as a circle
ax1.axis('equal')
plt.tight_layout()
plt.legend()
plt.show()``````

Next data visualisation step is to find out the number of male and female in the data

``````plt.figure(figsize=(12,7))
print('Female:',len(df[df.sex == 0]))
print('Male:',len(df[df.sex == 1]))
y = len(df[df.sex == 0]),len(df[df.sex == 1])
x = ['Female','Male']
plt.bar(x,y,color = 'coral')
plt.show()``````

Now let’s see the overall distribution of charges comparing smokers and non-smokers

``````plt.figure(figsize=(12,9))
sns.boxplot(x='smoker',y='charges',data=df)
plt.title("Overall distribution of charges comparing smokers and non-smokers")``````

As we can see in above figure, The smoker persons spent much money than non smoker persons for medical purpose

Also Read:  Time Series Forecasting for 911 Calls

Now in the next step of EDA, Let’s visualise the Scatter plot of charges vs age vs smoker

``````plt.figure(figsize=(12,8))
sns.scatterplot(x='age',y='charges',data=df,palette='viridis',hue='smoker')
plt.title('Scatter plot of Charges vs age')``````

In the final stage of EDA phase, let’s see the distribution of charges in each region

``````plt.figure(figsize=(12,8))
plt.title("Distribution of Charges by Region")
for i in df['region'].unique():
sns.distplot(df[(df['region']==i)]['charges'], hist=False, kde=True, label=i)``````

.

### Training a Random Forest Regression Model

Let’s now begin to train out the regression model! We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable, in this case, the Profit column. We will toss out the State column because it only has text info that the linear regression model can’t use.

``````X = df.drop('charges',axis=1)
y = df['charges']``````

### Train Test Split

Now let’s split the data into a training set and a testing set. We will train out model on the training set and then use the test set to evaluate the model.

``````# separate training (80%) and test (%20) sets
from sklearn.model_selection import train_test_split
x_train, x_test,y_train, y_test = train_test_split(X,y,test_size = 0.20,random_state = 20)``````

### Creating and Training the Model

``````from sklearn.ensemble import RandomForestRegressor
randomforest = RandomForestRegressor(n_estimators = 100, random_state = 20)
randomforest.fit(x_train,y_train)``````

### Model Evaluation

in the model evaluation step, we need to get prediction from X_test and then visualise our result

``````predictions = randomforest.predict(x_test)
plt.figure(figsize=(12,8))
plt.scatter(y_test,predictions)``````
``````from sklearn import metrics
from sklearn.metrics import r2_score
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print('R2 test_data: ', round(r2_score(y_test,predictions), 2))``````

As we can see, The accuracy of the model is %89 which is quite good

### Visualising the Random Forest Regression Result

``````plt.figure(figsize=(12,8))
plt.scatter(y_test, predictions, edgecolors=(0,0,0))
plt.plot([y.min(),y.max()],[y.min(),y.max()],'k--',lw=4)
plt.xlabel("Measured Charges")
plt.ylabel("Predicted Charges")
plt.title("Cross-validated Prediction accuracy of Charges")``````