Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable.
In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.).
We use Logistic Regression To predict whether an email is spam (1) or not (0), Whether the tumor is malignant (1) or not (0), To predict whether a voice/face man (1) or woman (0)
The data contains the following columns:
- meanfreq: mean frequency (in kHz)
- sd: standard deviation of frequency
- median: median frequency (in kHz)
- Q25: first quantile (in kHz)
- Q75: third quantile (in kHz)
- IQR: interquantile range (in kHz)
- mode: mode frequency
- centroid: frequency centroid (see specprop)
- meanfun: average of fundamental frequency measured across acoustic signal
- minfun: minimum fundamental frequency measured across acoustic signal
- maxfun: maximum fundamental frequency measured across acoustic signal
- meandom: average of dominant frequency measured across acoustic signal
- mindom: minimum of dominant frequency measured across acoustic signal
- maxdom: maximum of dominant frequency measured across acoustic signal
- dfrange: range of dominant frequency measured across acoustic signal
- modindx: modulation index. Calculated as the accumulated absolute difference between adjacent measurements of fundamental frequencies divided by the frequency range
- label: Predictor class, male or female
.
let’s get our environment ready with the libraries we’ll need and then import the data!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#For confusion matrixes
from sklearn.metrics import confusion_matrix
from sklearn import metrics
Check out the Data!
df = pd.read_csv('~/DataSet GitHub/Logostic Regression/gender_voice_weka_dataset.csv')
df.head()

df.info()

df.corr()

the “label” column has binary data. It has “male” and “female”. So, we can use logistic regression
df.label = [1 if each == "female" else 0 for each in df.label]
#We assign 1 to female, 0 to male.
.
EDA
Let’s create some simple plots to check out the data!
#corelation matrix.
cor_mat= df[:].corr()
mask = np.array(cor_mat)
mask[np.tril_indices_from(mask)] = False
fig=plt.gcf()
fig.set_size_inches(30,12)
sns.heatmap(data=cor_mat,mask=mask,square=True,annot=True,cbar=True)

g = sns.PairGrid(df[['meanfreq','sd','median','Q25','IQR','sp.ent','sfm','meanfun','label']], hue = "label")
g = g.map(plt.scatter).add_legend()

.
Training a Logistic Regression Model
Let’s now begin to train out the regression model! We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable.
We split our data for test and train our regression. We use
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test =train_test_split(X,Y,random_state = 0,test_size = 0.25)
Creating and Training the Model
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)

.
Predictions and Evaluations
Now predict values for the testing data.
y_pred = classifier.predict(X_test)
Making Confusion Matrix
Confusion M
cm = confusion_matrix(y_test,y_pred)
class_names=[0,1] # name of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cm), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

414 and 360 are the correct predictions. In addition, 5 and 13 are the incorrect predictions. so we can see that we have quiet lot of correct predictions.
Correct Predictions : 414+360 = 774
Incorrect Predictions: 5+13 = 18
Create a classification report for the model.
from sklearn.metrics import classification_reportprint(classification_report(y_test,y_pred))

The accuracy of the model is %98
ROC Curve
Receiver Operating Characteristic(ROC) curve is a plot of the true positive rate against the
y_pred_proba = classifier.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

AUC score for the case is 0.99. AUC score 1 represents perfect classifier, and 0.5 represents a worthless classifier.

You may have heard the world is made up of atoms and molecules, but it’s really made up of stories.