In the last post we looked at logistic regression and I commented that we’re using data measured on different scales. For example, in our grapes.csv data, weight is measured in grams while diameter is measured in milimetres.
This raises the question of whether this is valid. Can machine learning models cope with data series measured on entirely different scales?
With logistic regression on this particular dataset, the answer is that it appears to make little difference. The measurements are all broadly similar; the weight in grams is of the same order as the diameter in mm, and the diameter of grapes is fairly close to their length.
However, data series measured on different scales (whether different units or just wildly different sizes) can definitely cause problems with machine learning models, and that’s why we typically normalize data.
Let’s first see what happens if we measure weight in kilograms and length in micrometres. To change to these units we’ll do this after loading the data:
df['weight'] /= 1000
df['length'] *= 1000
Previously we obtained an R squared score of 0.99 (rounding up); let’s take a look at the R squared score now. Here’s the complete program:
from sklearn.linear_model import LogisticRegression
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('grapes.csv')
# 67 is an outlier
df.drop(67, inplace=True)
df['weight'] /= 1000
df['length'] *= 1000
print(df)
X = df[['weight', 'length', 'diameter']]
y = df['color']
#X = StandardScaler().fit_transform(X)
model = LogisticRegression()
model.fit(X, y)
predicted_color = model.predict(X)
score = accuracy_score(y, predicted_color)
print('Score:', score)
0.8768115942028986
The R squared score is now much lower. Our model’s predictive ability has significantly declined.
We can fix this by normalizing the data (or normalising, if you’re British).
To do this we use some kind of scaler. StandardScaler by default normalizes the data so it has a mean of 0 and a standard deviation of 1; there are also other scalers you can try, like MinMaxScaler that makes all your data fit into a certain range, which you can specify.
from sklearn.linear_model import LogisticRegression
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('grapes.csv')
# 67 is an outlier
df.drop(67, inplace=True)
df['weight'] /= 1000
df['length'] *= 1000
X = df[['weight', 'length', 'diameter']]
y = df['color']
X = StandardScaler().fit_transform(X)
model = LogisticRegression()
model.fit(X, y)
predicted_color = model.predict(X)
score = accuracy_score(y, predicted_color)
print('Score:', score)
Score: 0.9927536231884058
By adding this line:
X = StandardScaler().fit_transform(X)
We’ve regained an R squared score of 0.99.
Train/Test Splitting and Normalization
If we have a train-test split in our data, we have to be careful about how we do normalization. If we just normalize all the data at the start, technically we’re leaking information about the whole dataset, including the training segment, into the test segment, since the test segment was normalized as part of the whole dataset.
Perhaps the best way to deal with this is the following.
To transform the data so that it has a mean of 0 and a standard deviation of 1, StandardScaler must collect some information from your data. It does this when we use the fit or fit_transform methods. We want to avoid it collecting any information about the test segment, because we don’t want the model to have seen the test segment in any way at all when we come to use it to verify the model’s performance.
Therefore we use fit_transform on only the training segment, then later we can use only transform on the test segment, to ensure the test segment is transformed in the same way as the training segment.
So the scaler only collects information from the training segment, not the test segment. It then uses this information to scale (normalize) all the data that we use.
from sklearn.linear_model import LogisticRegression
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
df = pd.read_csv('grapes.csv')
# 67 is an outlier
df.drop(67, inplace=True)
df['weight'] /= 1000
df['length'] *= 1000
X = df[['weight', 'length', 'diameter']]
y = df['color']
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, train_size=0.7)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
model = LogisticRegression()
model.fit(X_train, y_train)
predicted_color = model.predict(X_test)
score = accuracy_score(y_test, predicted_color)
print('Score:', score)
Score: 1.0
Often we see a score as high as 1.0 here, meaning the model was able to successfully predict the colours of all the grapes in the test segment (which is 30% of all the data) on the basis of their weight, diameter and length.
Leave a Reply