Not only human beings but also the machine learning models may get confused !! After all Artificial Intelligence mimics a human brain, isn’t it?(pun intended). Imagine yourself as a machine learning engineer and suppose you trained a machine learning classification model successfully today. After the model is trained you checked the accuracy which is 93.0%. Wow you are so happy!!
You want to move the model to production as soon as possible, but before that as a standard practice your code needs to be reviewed by the lead data scientist. No Problem. You sent the code for the review and went home happily thinking how will you improve the accuracy further.
Next day you came to the office, totally ready for raising production deployment request for your model, but to your surprise, you got a mail from code reviewer saying the model is not appropriate for the production deployment. So sad. So bad.
Are you confused? No problem. That means its high time to discuss a concept called confusion matrix in Machine Learning.
What is Confusion Matrix
The confusion matrix is a table which describes, up to what extent, the classification model is confused while making predictions. For a binary classification, it is a 2 x 2 matrix as shown below.
In the above diagram you can see four quadrants named True Negative (438), True Positive(27) , False Positive (13) and False Negative (22).
Here it means that model is confused about False Positive (13) + False Negative (22) = 13 + 22 = 35 predictions.
To understand it better let us implement a simple classification model and discuss its outcome.
Terms related to confusion matrix
- True Positive or TP : Predicted value is positive and actual value is also positive.
- False Positive or FP : Predicted value is positive but actual value is negative.
- True Negative or TN : Predicted value is negative and the actual value is also negative
- False Negative or FN : Predicted value is negative but the actual value is positive
- True Positive Rate or TPR = It is number of correct positive predictions divided by the total number of positives = TP/(TP+FN)
- False Positive Rate or FPR = It is number of incorrect positive predictions divided by total number of negatives = FP/(TN+FP)
- Accuracy = It is number of correct prediction divided by total number of predictions = (TP+TN)/(TP+TN+FP+FN)
- Precision = It is number of correct positive predictions divided by the total number of positive predictions = TP/(TP+FP)
- Recall = It is number of correct positive predictions divided by total number of positives = TP/(TP+FN) = TPR
- Sensitivity = same as recall.
- Specificity = True negative rate = TN/(TN+FP)
- F1 Score = Harmonic mean of Precision and Recall = (2*Precision*Recall)/( Precision + Recall )
- ROC AUC (Receiver Operating Characteristic Area Under Curve): The receiver operating characteristic area under curve (ROC-AUC) is a graph where we plot true positive rate on y-axis and false positive rate on x-axis. The ROC-AUC plot is used to visualize and represent the performance of the classifier model.
Get the data
We are going to use a git hub hosted dataset
This dataset is having two features X1 and X2 and one label named Y. The label variable has two classes represented by 0 and 1.
The features X1 and X2 represent the medical characteristic of patients and Class 1 indicates people having the lung disease and Class 0 represent people not having lung disease.
First thing first let us import the libraries required.
import pandas as pd import numpy as np from sklearn import metrics from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split import seaborn as sns from sklearn.metrics import confusion_matrix
Now let us load the data and visualize the top 5 rows.
Let us check if the data is Imbalance or not.
This is clearly a case of Imbalance dataset. As the class 1 indicates people having the lung disease and class 0 represent people not having lung disease. This is natural as number of people having disease will be generally far lower.
Next thing we need to assign all the features to a variable x and the label to a variable y.
x= df.drop('Y',axis = 1) y= df.Y
Then we are going to use train_test_split function to split the data into train and test part.
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=40)
After that we are going to define the logistic regression model
log_reg = LogisticRegression()
Now this is the time to train the model using fit method.
Finally we need to predict the values for test features as below.
y_pred = log_reg.predict(x_test)
After this we are able to print the confusion matrix as below.
We can also print Confusion Matrix in formatted way as below.
It means that model is confused about False Positive (13) + False Negative (22) = 13 + 22 = 35 predictions.
Accuracy = (True Positive + True Negative)/Total =(438+27 )/(13+22+438+27) = 465/500 = 0.93 = 93%
However, in this case it is not appropriate to evaluate the model in terms of accuracy as we are concerned about, out of total people having disease how many of them are predicted correctly. This can be calculated using a term called recall.
Recall is defined as the number of true positives divided by the number of true positives plus the number of false negatives.
Recall = True Positive/(True Positive + False Negative) = 27/(27+22) = 0.55
This means our model is only able to predict 55% of the people having disease and its not a very good model that should be used in production.
Hence we observed that we calculated True Positive, True Negative , False Positive, False Negative predictions for the given data. We also calculated the recall and found out that this is just above 50%.
You can visualize and print confusion matrix in better format as well.
After reading this article you should be able to help machine learning model to do away with confusion as well. Thus we should not assume that accuracy is always a good measure of classification model.
In the next post we will discuss what all things we can do to make this model better.
Happy Coding !!