In this project, we aim to predict the 50 startups profit. we implemented Multiple Linear Regression on Python.

The data contains the following columns:

- R&D Spend
- Administration
- Marketing Spend
- State
- Profit

.

let’s get our environment ready with the libraries we’ll need and then import the data!

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
```

Check out the Data

```
df = pd.read_csv('~/DataSet GitHub/Regression/50_startups.csv')
df.head(5)
```

`df.info()`

`df.describe()`

.

### EDA

Let’s create some simple plots to check out the data!

`sns.pairplot(df)`

`sns.distplot(df['Profit'])`

`sns.heatmap(df.corr())`

.

### Training a Linear Regression Model

Let’s now begin to train out the regression model! We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable, in this case, the Profit column. We will toss out the State column because it only has text info that the linear regression model can’t use.

```
X = df[['R&D Spend', 'Administration', 'Marketing Spend']]
y = df['Profit']
```

### Train Test Split

Now let’s split the data into a training set and a testing set. We will train out model on the training set and then use the test set to evaluate the model.

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
```

### Creating and Training the Model

```
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
```

### Model Evaluation

Let’s evaluate the model by checking out it’s coefficients and how we can interpret them.

```
# print the intercept
print(lm.intercept_)
```

```
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df
```

Interpreting the coefficients:

- Holding all other features fixed, a 1 unit increase in R&D Spend is associated with an *increase of $0.81 *.
- Holding all other features fixed, a 1 unit increase in Administration is associated with an *increase of $0.01 *.
- Holding all other features fixed, a 1 unit increase in Marketing Spend is associated with an *increase of $0.03 *.

.

### Predictions from our Model

Let’s grab predictions off our test set and see how well it did!

```
predictions = lm.predict(X_test)
plt.scatter(y_test,predictions)
```

### Regression Evaluation Metrics

```
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
```

Comparing these metrics:

**MAE:**is the easiest to understand, because it’s the average error.- MSE: is more popular than MAE, because MSE “punishes” larger errors, which tends to be useful in the real world.
- RMSE: is even more popular than MSE, because RMSE is interpretable in the “y” units.

You may have heard the world is made up of atoms and molecules, but it’s really made up of stories. When you sit with an individual that’s been here, you can give quantitative data a qualitative overlay.