In this project, we aim to predict the 50 startups profit. we implemented Multiple Linear Regression on Python.
The data contains the following columns:
- R&D Spend
- Marketing Spend
let’s get our environment ready with the libraries we’ll need and then import the data!
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
Check out the Data
df = pd.read_csv('~/DataSet GitHub/Regression/50_startups.csv') df.head(5)
Let’s create some simple plots to check out the data!
Training a Linear Regression Model
Let’s now begin to train out the regression model! We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable, in this case, the Profit column. We will toss out the State column because it only has text info that the linear regression model can’t use.
X = df[['R&D Spend', 'Administration', 'Marketing Spend']] y = df['Profit']
Train Test Split
Now let’s split the data into a training set and a testing set. We will train out model on the training set and then use the test set to evaluate the model.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
Creating and Training the Model
from sklearn.linear_model import LinearRegression lm = LinearRegression() lm.fit(X_train,y_train)
Let’s evaluate the model by checking out it’s coefficients and how we can interpret them.
# print the intercept print(lm.intercept_)
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient']) coeff_df
Interpreting the coefficients:
- Holding all other features fixed, a 1 unit increase in R&D Spend is associated with an *increase of $0.81 *.
- Holding all other features fixed, a 1 unit increase in Administration is associated with an *increase of $0.01 *.
- Holding all other features fixed, a 1 unit increase in Marketing Spend is associated with an *increase of $0.03 *.
Predictions from our Model
Let’s grab predictions off our test set and see how well it did!
predictions = lm.predict(X_test) plt.scatter(y_test,predictions)
Regression Evaluation Metrics
from sklearn import metrics print('MAE:', metrics.mean_absolute_error(y_test, predictions)) print('MSE:', metrics.mean_squared_error(y_test, predictions)) print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
Comparing these metrics:
- MAE: is the easiest to understand, because it’s the average error.
- MSE: is more popular than MAE, because MSE “punishes” larger errors, which tends to be useful in the real world.
- RMSE: is even more popular than MSE, because RMSE is interpretable in the “y” units.