Gender Recognition By Voice

Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable.
In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.).

We use Logistic Regression To predict whether an email is spam (1) or not (0), Whether the tumor is malignant (1) or not (0), To predict whether a voice/face man (1) or woman (0)

The data contains the following columns:

  • meanfreq: mean frequency (in kHz)
  • sd: standard deviation of frequency
  • median: median frequency (in kHz)
  • Q25: first quantile (in kHz)
  • Q75: third quantile (in kHz)
  • IQR: interquantile range (in kHz)
  • mode: mode frequency
  • centroid: frequency centroid (see specprop)
  • meanfun: average of fundamental frequency measured across acoustic signal
  • minfun: minimum fundamental frequency measured across acoustic signal
  • maxfun: maximum fundamental frequency measured across acoustic signal
  • meandom: average of dominant frequency measured across acoustic signal
  • mindom: minimum of dominant frequency measured across acoustic signal
  • maxdom: maximum of dominant frequency measured across acoustic signal
  • dfrange: range of dominant frequency measured across acoustic signal
  • modindx: modulation index. Calculated as the accumulated absolute difference between adjacent measurements of fundamental frequencies divided by the frequency range
  • label: Predictor class, male or female

.

let’s get our environment ready with the libraries we’ll need and then import the data!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#For confusion matrixes
from sklearn.metrics import confusion_matrix
from sklearn import metrics

Check out the Data!

df = pd.read_csv('~/DataSet GitHub/Logostic Regression/gender_voice_weka_dataset.csv')
df.head()
df.info()
df.corr()

the “label” column has binary data. It has “male” and “female”. So, we can use logistic regression in here. But, predicted value can’t be an object. It must be integer or category type. We must convert label column’ s type to integer.

df.label = [1 if each == "female" else 0 for each in df.label]
#We assign 1 to female, 0 to male.

.

EDA

Let’s create some simple plots to check out the data!

#corelation matrix.
cor_mat= df[:].corr()
mask = np.array(cor_mat)
mask[np.tril_indices_from(mask)] = False
fig=plt.gcf()
fig.set_size_inches(30,12)
sns.heatmap(data=cor_mat,mask=mask,square=True,annot=True,cbar=True)
g = sns.PairGrid(df[['meanfreq','sd','median','Q25','IQR','sp.ent','sfm','meanfun','label']], hue = "label")
g = g.map(plt.scatter).add_legend()

.

Training a Logistic Regression Model

Let’s now begin to train out the regression model! We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable.

Also Read:  Prediction of Breast Cancer Diagnosis

We split our data for test and train our regression. We use sklearn library for that. I use %25 for test my regression and %75 for train my regression.

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test =train_test_split(X,Y,random_state = 0,test_size = 0.25)

Creating and Training the Model

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)

.

Predictions and Evaluations

Now predict values for the testing data.

y_pred = classifier.predict(X_test)

Making Confusion Matrix

Confusion Matrix is going to contain the correct predictions that our model made on the set as well as the incorrect predictions.

cm = confusion_matrix(y_test,y_pred)
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cm), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

414 and 360 are the correct predictions. In addition, 5 and 13 are the incorrect predictions. so we can see that we have quiet lot of correct predictions.

Correct Predictions : 414+360 = 774

Incorrect Predictions: 5+13 = 18

Create a classification report for the model.

from sklearn.metrics import classification_reportprint(classification_report(y_test,y_pred))

The accuracy of the model is %98

ROC Curve

Receiver Operating Characteristic(ROC) curve is a plot of the true positive rate against the false positive rate. It shows the tradeoff between sensitivity and specificity.

y_pred_proba = classifier.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

AUC score for the case is 0.99. AUC score 1 represents perfect classifier, and 0.5 represents a worthless classifier.

358cookie-checkGender Recognition By Voice

Leave a Reply

Your email address will not be published.