In this project, we aim to predict whether or not it will rain tomorrow by training a Random Forest classification model on target RainTomorrow. This dataset contains daily weather observations from numerous Australian weather stations.
The data contains the following columns:
- Date: The date of observation
- Location: The common name of the location of the weather station
- MinTemp: The minimum temperature in degrees celsius
- MaxTemp: The maximum temperature in degrees celsius
- Rainfall: The amount of rainfall recorded for the day in mm
- Evaporation: The so-called Class A pan evaporation (mm) in the 24 hours to 9am
- Sunshine: The number of hours of bright sunshine in the day
- WindGustDir: The direction of the strongest wind gust in the 24 hours to midnight
- WindGustSpeed: The speed (km/h) of the strongest wind gust in the 24 hours to midnight
- WindDir9am: Direction of the wind at 9am
- WindDir3pm: Direction of the wind at 3pm
- WindSpeed9am: Wind speed (km/hr) averaged over 10 minutes prior to 9am
- WindSpeed3pm: Wind speed (km/hr) averaged over 10 minutes prior to 3pm
- Humidity9am: Humidity (percent) at 9am
- Pressure9am: Atmospheric pressure (hpa) reduced to mean sea level at 9am
- Cloud9am: Fraction of sky obscured by cloud at 9am. This is measured in “oktas”, which are a unit of eigths. It records how many eigths of the sky are obscured by cloud. A 0 measure indicates completely clear sky whilst an 8 indicates that it is completely overcast.
- Temp9am: Temperature (degrees C) at 9am
- Temp3pm: Temperature (degrees C) at 3pm
- RainToday: Boolean: 1 if precipitation (mm) in the 24 hours to 9am exceeds 1mm, otherwise 0
- RISK_MM: The amount of next day rain in mm. Used to create response variable RainTomorrow. A kind of measure of the “risk”.
- RainTomorrow: The target variable. Did it rain tomorrow?
let’s get our environment ready with the libraries we’ll need and then import the data!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn import metrics
Check out the Data
#importing the dataset
df = pd.read_csv('~/DataSet GitHub/Random Forest/weatherAUS.csv')
df.head()

df.info()

let’s remove the columns that they had a maximum null in it. we don’t need the Location and Date columns as well because we are going to find if it will be rainy tomorrow or not. so the Location and Date columns are not important for us. In addition, we need to remove RISK_MM as well because it has a negative effect on our prediction.
df.drop(["Date","Location","Evaporation","Sunshine","Cloud3pm","Cloud9am","RISK_MM"],axis=1,inplace=True)
#Drop processing
Visualising the missing data in the columns!
import missingno as msno
msno.matrix(df)

Next step is to drop null values in the dataset and get rid of them!
df=df.dropna() #We drop all NaN values.
df.info()

Next step is to sort out the categorical columns. switch yes/no to 1/0 assign 0 and 1 in our RainToday and RainTomorrow columns.
df['RainToday'].replace({'No':0,'Yes':1},inplace = True)
df['RainTomorrow'].replace({'No':0,'Yes':1},inplace = True)
categorical = ['WindGustDir','WindDir9am','WindDir3pm']
df = pd.get_dummies(df,columns = categorical,drop_first=True)
df.head()
.
EDA
Percentage of no rain and rainy in the dataset!
explode = (0.1,0)
fig1, ax1 = plt.subplots(figsize=(12,7))
ax1.pie(df['RainTomorrow'].value_counts(), explode=explode,labels=['No Rainy','Rainy'], autopct='%1.1f%%',
shadow=True)
# Equal aspect ratio ensures that pie is drawn as a circle
ax1.axis('equal')
plt.tight_layout()
plt.legend()
plt.show()

Frequency of no rain and rainy in the dataset!
plt.figure(figsize=(10,6))
print('No Rain:',len(df[df.RainTomorrow == 0]))
print('Rainy:',len(df[df.RainTomorrow == 1]))
y = len(df[df.RainTomorrow == 0]),len(df[df.RainTomorrow == 1])
x = ['No Rain','Rainy']
plt.bar(x,y)
plt.show()

Training a Random Forest Model
Let’s now begin to train the random forest model! We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable, in this case, the RainTomorrow column.
X = df.drop('RainTomorrow',axis=1)
y = df['RainTomorrow']
Train Test Split
Now let’s split the data into a training set and a testing set. We will train our model on the training set and then use the test set to evaluate the model.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.30, random_state = 0)
Training the Model
#fitting random forest classification into training set
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=10, criterion='entropy',random_state=0)
rf.fit(X_train, y_train)

Predictions and Evaluations
Now predict values for the testing data.
#prediction the test set result
y_pred = rf.predict(X_test)
Making Confusion Matrix
Confusion Matrix is going to contain the correct predictions that our model made on the set as well as the incorrect predictions.
cm = confusion_matrix(y_test,y_pred)
class_names=[0,1] # name of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cm), annot=True, cmap="RdGy" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

25213 and 3425 are the correct predictions. In addition, 4005 and 1235 are
Correct Predictions : 25213+3425 = 28638
Incorrect Predictions: 4005+1235 = 5240
Create a classification report for the model.
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

The accuracy of the model is %84!

You may have heard the world is made up of atoms and molecules, but it’s really made up of stories.