Home » Machine Learning » Regression » Confusion Matrix

Confusion Matrix

Previously we saw a logistic regression model that can predict grape variety from various measurements. The question arose of what kind of mistakes it makes, if any. We can figure that out using a confusion matrix.

Let’s look at the program again, but this time we’ll generate a matrix called a confusion matrix that shows us which classes were correctly and incorrectly classified.

This program uses grapes.csv, which contains diameter, weight, length and colour for a collection of grapes.

from sklearn.linear_model import LogisticRegression
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

df = pd.read_csv('grapes.csv')

# 67 is an outlier
df.drop(67, inplace=True)

X = df[['weight', 'length', 'diameter']]
y = df['color']

X = StandardScaler().fit_transform(X)

model = LogisticRegression()
model.fit(X, y)

y_predicted = model.predict(X)

score = accuracy_score(y, y_predicted)

print('Score:', score)

labels = df['color'].unique()
cm = confusion_matrix(y, y_predicted, labels=labels)
cm_display = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
cm_display.plot()

plt.show()
Score: 0.9927536231884058

Some points:

  • We use ConfusionMatrixDisplay to display the confusion matrix in a nice way. Otherwise it’s literally just a matrix.
  • It can be hard to keep the labels (classes) straight, because by default they will be labelled only as 0, 1 … in the resulting chart. As far as I can tell, if we first obtain the labels using unique() and then pass these labels to both confusion_matrix and ConfusionMatrixDisplay, labels are correctly preserved.
  • We use plt.show() from Matplotlib to ensure the plot doesn’t close prematurely.

If we read along the diagonal of this chart from top left to bottom right, we can see the grapes that were correctly classified. 29 purple grapes were correctly classified as purple, and 108 green grapes were correctly classified as green.

In the bottom left we see that 0 green grapes were classified wrongly as purple.

In the top right we see that 1 purple grape was incorrectly classified as green. By adding the results to a dataframe and using a boolean operator, we can find out which one.

from sklearn.linear_model import LogisticRegression
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

df = pd.read_csv('grapes.csv')

# 67 is an outlier
df.drop(67, inplace=True)

X = df[['weight', 'length', 'diameter']]
y = df['color']

X = StandardScaler().fit_transform(X)

model = LogisticRegression()
model.fit(X, y)

y_predicted = model.predict(X)

df_results = pd.DataFrame()
df_results['True'] = y
df_results['Predicted'] = y_predicted

print(df_results[df_results['True'] != df_results['Predicted']])
     True Predicted
111  purple     green

If we consult the original data, we find that the grape at row 111 in the dataframe is a purple grape with an unusually low weight. Usually the purple grapes are larger than the green grapes, but not in this case.

When Logistic Regression Goes Bad

Suppose we add a column to the dataframe labelled “heavy”. Then we fill in True or False for each grape. A grape is only labeled as ‘heavy’ if it weights more than 10 grams.

Finally we train a logistic regression model to recognise heavy grapes, and we print an accuracy score for the results.

from sklearn.linear_model import LogisticRegression
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

df = pd.read_csv('grapes.csv')

# 67 is an outlier
df.drop(67, inplace=True)

df['heavy'] = df['weight'] > 10

X = df[['weight', 'length', 'diameter']]
y = df['heavy']

X = StandardScaler().fit_transform(X)

model = LogisticRegression()
model.fit(X, y)

y_predicted = model.predict(X)

print(accuracy_score(y, y_predicted))
0.9855072463768116

The score comes out at almost 0.99, so we might think we’ve done pretty well in classifying the grapes. However, if we add a confusion matrix, a major problem becomes apparent.

from sklearn.linear_model import LogisticRegression
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

df = pd.read_csv('grapes.csv')

# 67 is an outlier
df.drop(67, inplace=True)

df['heavy'] = df['weight'] > 10

X = df[['weight', 'length', 'diameter']]
y = df['heavy']

X = StandardScaler().fit_transform(X)

model = LogisticRegression()
model.fit(X, y)

y_predicted = model.predict(X)

print(accuracy_score(y, y_predicted))

labels = df['heavy'].unique()
cm = confusion_matrix(y, y_predicted, labels=labels)
cm_display = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
cm_display.plot()

plt.show()
0.9855072463768116

Nearly all of the success of this model derives from it classifying non-heavy grapes as not heavy. These grapes constitute the vast majority of the set. Out of four heavy grapes, two were actually classified incorrectly. So this model isn’t actually hugely successful.

Leave a Reply

Blog at WordPress.com.

%d