Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable.
In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.).
We use Logistic Regression To predict whether an email is spam (1) or not (0), Whether the tumor is malignant (1) or not (0), To predict whether a voice/face man (1) or woman (0)
The data contains the following columns:
- meanfreq: mean frequency (in kHz)
- sd: standard deviation of frequency
- median: median frequency (in kHz)
- Q25: first quantile (in kHz)
- Q75: third quantile (in kHz)
- IQR: interquantile range (in kHz)
- mode: mode frequency
- centroid: frequency centroid (see specprop)
- meanfun: average of fundamental frequency measured across acoustic signal
- minfun: minimum fundamental frequency measured across acoustic signal
- maxfun: maximum fundamental frequency measured across acoustic signal
- meandom: average of dominant frequency measured across acoustic signal
- mindom: minimum of dominant frequency measured across acoustic signal
- maxdom: maximum of dominant frequency measured across acoustic signal
- dfrange: range of dominant frequency measured across acoustic signal
- modindx: modulation index. Calculated as the accumulated absolute difference between adjacent measurements of fundamental frequencies divided by the frequency range
- label: Predictor class, male or female
let’s get our environment ready with the libraries we’ll need and then import the data!
import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns #For confusion matrixes from sklearn.metrics import confusion_matrix from sklearn import metrics
Check out the Data!
df = pd.read_csv('~/DataSet GitHub/Logostic Regression/gender_voice_weka_dataset.csv')
the “label” column has binary data. It has “male” and “female”. So, we can use logistic regression
df.label = [1 if each == "female" else 0 for each in df.label] #We assign 1 to female, 0 to male.
Let’s create some simple plots to check out the data!
#corelation matrix. cor_mat= df[:].corr() mask = np.array(cor_mat) mask[np.tril_indices_from(mask)] = False fig=plt.gcf() fig.set_size_inches(30,12) sns.heatmap(data=cor_mat,mask=mask,square=True,annot=True,cbar=True)
g = sns.PairGrid(df[['meanfreq','sd','median','Q25','IQR','sp.ent','sfm','meanfun','label']], hue = "label") g = g.map(plt.scatter).add_legend()
Training a Logistic Regression Model
Let’s now begin to train out the regression model! We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable.
We split our data for test and train our regression. We use
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test =train_test_split(X,Y,random_state = 0,test_size = 0.25)
Creating and Training the Model
from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test) from sklearn.linear_model import LogisticRegression classifier = LogisticRegression(random_state=0) classifier.fit(X_train, y_train)
Predictions and Evaluations
Now predict values for the testing data.
y_pred = classifier.predict(X_test)
Making Confusion Matrix
cm = confusion_matrix(y_test,y_pred) class_names=[0,1] # name of classes fig, ax = plt.subplots() tick_marks = np.arange(len(class_names)) plt.xticks(tick_marks, class_names) plt.yticks(tick_marks, class_names) # create heatmap sns.heatmap(pd.DataFrame(cm), annot=True, cmap="YlGnBu" ,fmt='g') ax.xaxis.set_label_position("top") plt.tight_layout() plt.title('Confusion matrix', y=1.1) plt.ylabel('Actual label') plt.xlabel('Predicted label')
414 and 360 are the correct predictions. In addition, 5 and 13 are the incorrect predictions. so we can see that we have quiet lot of correct predictions.
Correct Predictions : 414+360 = 774
Incorrect Predictions: 5+13 = 18
Create a classification report for the model.
from sklearn.metrics import classification_reportprint(classification_report(y_test,y_pred))
The accuracy of the model is %98
Receiver Operating Characteristic(ROC) curve is a plot of the true positive rate against the
y_pred_proba = classifier.predict_proba(X_test)[::,1] fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba) auc = metrics.roc_auc_score(y_test, y_pred_proba) plt.plot(fpr,tpr,label="data 1, auc="+str(auc)) plt.legend(loc=4) plt.show()
AUC score for the case is 0.99. AUC score 1 represents perfect classifier, and 0.5 represents a worthless classifier.