Adult​ Census Income Analysis and Prediction

prediction of income

In this project, We aim to Predict whether income exceeds $50K/yr based on census data. The data has been downloaded from the UCI Repository website (Adult). We implemented the Artificial Neural Network (ANN) on Python to solve this problem.

The data contains the following culumns:

  • Age: continuous. 
  • Workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. 
  • fnlwgt: continuous. 
  • Education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. 
  • Education-num: continuous. 
  • Marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. 
  • Occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. 
  • Relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. 
  • Race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. 
  • Sex: Female, Male. 
  • Capital-gain: continuous. 
  • Capital-loss: continuous. 
  • Hours-per-week: continuous. 
  • Native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
  • Income: >50K, <=50K. 

.

Let’s get our environment ready with the libraries we’ll need and then import the data!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

Check out the Data!

df = pd.read_csv('~/DataSet GitHub/ANN/adult-2.csv')
df.head()
df.info()

Discovering the missing values in dataset

df.isnull().sum()

Removing ‘?’ value in the dataset

df = df[(df != '?').all(axis=1)]

.

Exploratory Data Analysis

Let’s check out the Proportion of target variable in dataset!

explode = (0.1,0)  
fig1, ax1 = plt.subplots(figsize=(12,7))
ax1.pie(df['income'].value_counts(), explode=explode,labels=['<=50K','>50K'], autopct='%1.1f%%',
        shadow=True)
# Equal aspect ratio ensures that pie is drawn as a circle
ax1.axis('equal')  
plt.tight_layout()
plt.legend()
plt.show()

Now we need to see the count of workclass variable in our dataset

sns.catplot(x="workclass", kind="count", palette="ch:.26", data=df, size = 9)

Let’s visualise the occupation vs income in the dataset

plt.figure(figsize=(25,15))
sns.countplot(x='occupation',data=df,hue='income',palette='viridis')

# To relocate the legend
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

Next step is to Explore marital status vs. income

plt.figure(figsize=(25,15))
sns.countplot(x='marital.status',data=df,hue='income',palette='viridis')

# To relocate the legend
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

Now Exploring race vs income in the dataset

plt.figure(figsize=(18,10))
sns.countplot(x='race',data=df,hue='income',palette='viridis')

# To relocate the legend
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

For the next data analysis step, let’s Explore sex vs income in the dataset

plt.figure(figsize=(18,10))
sns.countplot(x='sex',data=df,hue='income',palette='viridis')

# To relocate the legend
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

Now let’s visualise the scatter plot of hours per week vs age in the dataset

plt.figure(figsize=(18,10))
sns.scatterplot(x='hours.per.week',y='age',data=df,palette='inferno', hue = 'income')
plt.title('Scatter plot of hours per week vs age')

Exploring the overall distribution of age comparing income

plt.figure(figsize=(12,9))
sns.boxplot(x='income',y='age',data=df)
plt.title("Overall distribution of age comparing income")

.

Feature Encoding

Labelling the income objects as 0 and 1 to fit it on ANN model

df['income']=df['income'].map({'<=50K': 0, '>50K': 1})

Encoding the workclass, education, occupation, race and sex features

#Encoding the features
from sklearn.preprocessing import LabelEncoder
#workclass
labelencoder_workclass = LabelEncoder()
df.workclass = labelencoder_workclass.fit_transform(df.workclass)
#education
labelencoder_education = LabelEncoder()
df.education = labelencoder_education.fit_transform(df.education)
#occupation
labelencoder_occupation = LabelEncoder()
df.occupation = labelencoder_occupation.fit_transform(df.occupation)
#race
labelencoder_race = LabelEncoder()
df.race = labelencoder_race.fit_transform(df.race)
#sex
labelencoder_sex = LabelEncoder()
df.sex = labelencoder_sex.fit_transform(df.sex)

The next categorical column is marital.status which we need to keep it to binary type

df["marital.status"] = df["marital.status"].replace(['Married-civ-spouse','Married-spouse-absent','Married-AF-spouse'], 'Married')
df["marital.status"] = df["marital.status"].replace(['Never-married','Divorced','Separated','Widowed'], 'Single')
df["marital.status"] = df["marital.status"].map({"Married":0, "Single":1})

We need to dedicate number for each country in native country column to prepare the data for modelling

df['native.country'] = df['native.country'].map({'Puerto-Rico':0,'Haiti':1,'Cuba':2, 'Iran':3,
                                      'Honduras':4, 'Jamaica':5, 'Vietnam':6, 'Mexico':7, 'Dominican-Republic':8,
                                       'Laos':9, 'Ecuador':10, 'El-Salvador':11, 'Cambodia':12, 'Columbia':13,
                                         'Guatemala':14, 'South':15, 'India':16, 'Nicaragua':17, 'Yugoslavia':18, 
                                         'Philippines':19, 'Thailand':20, 'Trinadad&Tobago':21, 'Peru':22, 'Poland':23, 
                                         'China':24, 'Hungary':25, 'Greece':26, 'Taiwan':27, 'Italy':28, 'Portugal':29, 
                                         'France':30, 'Hong':31, 'England':32, 'Scotland':33, 'Ireland':34, 
                                         'Holand-Netherlands':35, 'Canada':36, 'Germany':37, 'Japan':38, 
                                         'Outlying-US(Guam-USVI-etc)':39, 'United-States':40
                                        })

For the final stage, Let’s remove useless column

df = df.drop('relationship',axis=1)
df.head()

Visualising Correlation Matrix

cor_mat= df[:].corr()
mask = np.array(cor_mat)
mask[np.tril_indices_from(mask)] = False
fig=plt.gcf()
fig.set_size_inches(30,12)
sns.heatmap(data=cor_mat,mask=mask,square=True,annot=True,cbar=True)

.

Train Test Split

Split the data into a training set and a testing set.

X = df.drop('income',axis=1).values
y = df['income'].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 0)

Feature Scaling

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

.

Let’s make the ANN

First step is importing the keras libraries and packages

import keras
from keras.models import Sequential
from keras.layers import Dense

Initialising the ANN.

classifier = Sequential()

Adding the input layer and the first hidden layer

classifier.add(Dense(output_dim = 8, init = 'uniform', activation = 'relu', input_dim = 13))

Adding the second hidden layer

classifier.add(Dense(units = 8, kernel_initializer = 'uniform', activation = 'relu'))

Now let’s Add the output layer

classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))

Compiling the ANN

classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

Fitting the ANN to the Training set

classifier.fit(X_train, y_train, batch_size=32,nb_epoch = 50,verbose = 1)

.

rediction and Evaluation

Let’s predict the test set result to see the performance of the model

y_pred = classifier.predict(X_test)
y_pred = (y_pred > 0.5)

Making the Confusion Matrix

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cm), annot=True, cmap="BuPu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

4264 and 820 are the correct predictions. In addition, 684 and 265 are the incorrect predictions. so we can see that we have quiet lot of correct predictions.

Also Read:  Prediction of Google Stocks Price

Correct Predictions : 4264+820 = 5084

Incorrect Predictions: 684+265 = 949

Creating a classification report for the model.

from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))
from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

3078cookie-checkAdult​ Census Income Analysis and Prediction

Leave a Reply

Your email address will not be published.