In this project, We aim to Predict whether income exceeds $50K/yr based on census data. The data has been downloaded from the UCI Repository website (Adult). We implemented the Artificial Neural Network (ANN) on Python to solve this problem.
The data contains the following culumns:
- Age: continuous.
- Workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt : continuous.- Education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-
acdm , Assoc-voc , 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. - Education-num: continuous.
- Marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- Occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-
inspct , Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. - Relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- Race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- Sex: Female, Male.
- Capital-gain: continuous.
- Capital-loss: continuous.
- Hours-per-week: continuous.
- Native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
- Income: >50K, <=50K.
.
Let’s get our environment ready with the libraries we’ll need and then import the data!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
Check out the Data!
df = pd.read_csv('~/DataSet GitHub/ANN/adult-2.csv')
df.head()

df.info()

Discovering the missing values in dataset
df.isnull().sum()

Removing ‘?’ value in the dataset
df = df[(df != '?').all(axis=1)]
.
Exploratory Data Analysis
Let’s check out the Proportion of target variable in dataset!
explode = (0.1,0)
fig1, ax1 = plt.subplots(figsize=(12,7))
ax1.pie(df['income'].value_counts(), explode=explode,labels=['<=50K','>50K'], autopct='%1.1f%%',
shadow=True)
# Equal aspect ratio ensures that pie is drawn as a circle
ax1.axis('equal')
plt.tight_layout()
plt.legend()
plt.show()

Now we need to see the count of workclass variable in our dataset
sns.catplot(x="workclass", kind="count", palette="ch:.26", data=df, size = 9)

Let’s visualise the occupation vs income in the dataset
plt.figure(figsize=(25,15))
sns.countplot(x='occupation',data=df,hue='income',palette='viridis')
# To relocate the legend
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

Next step is to Explore marital status vs. income
plt.figure(figsize=(25,15))
sns.countplot(x='marital.status',data=df,hue='income',palette='viridis')
# To relocate the legend
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

Now Exploring race vs income in the dataset
plt.figure(figsize=(18,10))
sns.countplot(x='race',data=df,hue='income',palette='viridis')
# To relocate the legend
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

For the next data analysis step, let’s Explore sex vs income in the dataset
plt.figure(figsize=(18,10))
sns.countplot(x='sex',data=df,hue='income',palette='viridis')
# To relocate the legend
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

Now let’s visualise the scatter plot of hours per week vs age in the dataset
plt.figure(figsize=(18,10))
sns.scatterplot(x='hours.per.week',y='age',data=df,palette='inferno', hue = 'income')
plt.title('Scatter plot of hours per week vs age')

Exploring the overall distribution of age comparing income
plt.figure(figsize=(12,9))
sns.boxplot(x='income',y='age',data=df)
plt.title("Overall distribution of age comparing income")

.
Feature Encoding
Labelling the income objects as 0 and 1 to fit it on ANN model
df['income']=df['income'].map({'<=50K': 0, '>50K': 1})
Encoding the workclass, education, occupation, race and sex features
#Encoding the features
from sklearn.preprocessing import LabelEncoder
#workclass
labelencoder_workclass = LabelEncoder()
df.workclass = labelencoder_workclass.fit_transform(df.workclass)
#education
labelencoder_education = LabelEncoder()
df.education = labelencoder_education.fit_transform(df.education)
#occupation
labelencoder_occupation = LabelEncoder()
df.occupation = labelencoder_occupation.fit_transform(df.occupation)
#race
labelencoder_race = LabelEncoder()
df.race = labelencoder_race.fit_transform(df.race)
#sex
labelencoder_sex = LabelEncoder()
df.sex = labelencoder_sex.fit_transform(df.sex)
The next categorical column is marital.status which we need to keep it to binary type
df["marital.status"] = df["marital.status"].replace(['Married-civ-spouse','Married-spouse-absent','Married-AF-spouse'], 'Married')
df["marital.status"] = df["marital.status"].replace(['Never-married','Divorced','Separated','Widowed'], 'Single')
df["marital.status"] = df["marital.status"].map({"Married":0, "Single":1})
We need to dedicate number for each country in native country column to prepare the data for modelling
df['native.country'] = df['native.country'].map({'Puerto-Rico':0,'Haiti':1,'Cuba':2, 'Iran':3,
'Honduras':4, 'Jamaica':5, 'Vietnam':6, 'Mexico':7, 'Dominican-Republic':8,
'Laos':9, 'Ecuador':10, 'El-Salvador':11, 'Cambodia':12, 'Columbia':13,
'Guatemala':14, 'South':15, 'India':16, 'Nicaragua':17, 'Yugoslavia':18,
'Philippines':19, 'Thailand':20, 'Trinadad&Tobago':21, 'Peru':22, 'Poland':23,
'China':24, 'Hungary':25, 'Greece':26, 'Taiwan':27, 'Italy':28, 'Portugal':29,
'France':30, 'Hong':31, 'England':32, 'Scotland':33, 'Ireland':34,
'Holand-Netherlands':35, 'Canada':36, 'Germany':37, 'Japan':38,
'Outlying-US(Guam-USVI-etc)':39, 'United-States':40
})
For the final stage, Let’s remove useless column
df = df.drop('relationship',axis=1)
df.head()

Visualising Correlation Matrix
cor_mat= df[:].corr()
mask = np.array(cor_mat)
mask[np.tril_indices_from(mask)] = False
fig=plt.gcf()
fig.set_size_inches(30,12)
sns.heatmap(data=cor_mat,mask=mask,square=True,annot=True,cbar=True)

.
Train Test Split
Split the data into a training set and a testing set.
X = df.drop('income',axis=1).values
y = df['income'].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 0)
Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
.
Let’s make the ANN
First step is importing the keras libraries and packages
import keras
from keras.models import Sequential
from keras.layers import Dense
Initialising the ANN.
classifier = Sequential()
Adding the input layer and the first hidden layer
classifier.add(Dense(output_dim = 8, init = 'uniform', activation = 'relu', input_dim = 13))
Adding the second hidden layer
classifier.add(Dense(units = 8, kernel_initializer = 'uniform', activation = 'relu'))
Now let’s Add the output layer
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
Compiling the ANN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
Fitting the ANN to the Training set
classifier.fit(X_train, y_train, batch_size=32,nb_epoch = 50,verbose = 1)

.
rediction and Evaluation
Let’s predict the test set result to see the performance of the model
y_pred = classifier.predict(X_test)
y_pred = (y_pred > 0.5)
Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
class_names=[0,1] # name of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cm), annot=True, cmap="BuPu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

4264 and 820 are the correct predictions. In addition, 684 and 265 are the incorrect predictions. so we can see that we have
Correct Predictions : 4264+820 = 5084
Incorrect Predictions: 684+265 = 949
Creating a classification report for the model.
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))


You may have heard the world is made up of atoms and molecules, but it’s really made up of stories.