Logistic regression is used when the output is categorical instead of continuous. Typically we’ll have two possible outcomes or classes/categories, and we’re trying to figure out which one of the two our samples belong to, based on some predictor variables. However, the Scikit-learn logistic regression model can handle multiple possible target variables.
As an example we’ll use grapes.csv. This records length, diameter, weight and colour for a pile of grapes. We’ll try to predict the type of each grapes (as visibly demonstrated by the colour of each grape) from its weight, diameter and length.
We’ll use accuracy_score to figure out what fraction of grapes were correctly classisfied as green or purple.
from sklearn.linear_model import LogisticRegression import pandas as pd from sklearn.metrics import accuracy_score df = pd.read_csv('grapes.csv') # 67 is an outlier df.drop(67, inplace=True) X = df[['weight', 'length', 'diameter']] y = df['color'] model = LogisticRegression() model.fit(X, y) predicted_color = model.predict(X) score = accuracy_score(y, predicted_color) print('Score:', score)
This results in a model that can figure out the colour of over 99% of these grapes based on their measurements.
This exercise raises several questions. How would we do if we trained the model on some of the data and then tried to use it to predict grape variety for grapes that the model hasn’t trained on? Is it legitimate to input weight and length to the model when these two are measured on entirely different scales? What kind of mistakes does the model make — is it more often mistakenly classifying purple grapes as green, or green as purple?
We’ll address these questions in subsequent posts.