If we fit a line to some data, one way to measure the “goodness of fit” is to use a measure known as R squared. However, this isn’t the full story, so it’s important to use other techniques as well.
For example, if your model diverges from the data at one end, and that’s the bit you intend to use for predictions, R squared won’t alert you to that.
- R squared typically varies from 0 to 1, where 0 indicates a very poor fit with the data, while 1 is a perfect fit.
- R squared is also known as the coefficient of determination.
- It is possible for R squared to be negative. This indicates that your model predictions are worse than if you had just predicted your values to always have their average value.
Here’s an example that uses the cooling.csv data that we saw last time. This data forms almost a straight line if we take the inverse square of temperature. Then we use a scikit-learn linear regression model to model (approximate) the data.
Finally we calculate an R squared score.
import pandas as pd from sklearn.metrics import r2_score from sklearn.linear_model import LinearRegression df = pd.read_csv('cooling.csv') X = df['minute'].values.reshape(-1, 1) y = 1.0/df['temperature']**2 model = LinearRegression() model.fit(X, y) y_predicted = model.predict(X) score = r2_score(y_predicted, y) print("R2 score:", score)
R2 score: 0.9995632665411859
R squared here is almost 1, indicating a very good fit with the data.
Calculating R Squared
How is R squared actually calculated? Here we add on code that calculates it “from scratch”.
import pandas as pd from sklearn.metrics import r2_score from sklearn.linear_model import LinearRegression import numpy as np df = pd.read_csv('cooling.csv') X = df['minute'].values.reshape(-1, 1) y = 1.0/df['temperature']**2 model = LinearRegression() model.fit(X, y) y_predicted = model.predict(X) score = r2_score(y_predicted, y) print("R2 score:", score) variance_of_residuals = np.var(y_predicted - y) total_variance = np.var(y) r2 = 1 - (variance_of_residuals/total_variance) print("Calculated R2:", r2)
R2 score: 0.9995632665411859 Calculated R2: 0.9995634571940355
R squared can be thought of as telling us how much of the variance in the data can be explained by the model.
First we calculate the variance of the data. This is simply a measure of how far apart, or how widely scattered, the y-values are. It measures how far the values are from the average value.
Now we calculate the residuals. These are all the distances between the actual y-values and the y-values predicted by the model.
We then find the variance of the residuals. We expect this to be smaller than the total variance; we expect the y-values to be on average much closer to the values predicted by the model than they are to the average y-value.
Dividing the variance of the residuals by the total variance gives us the fraction of the total variance that the model does not predict. We then subtract from 1 to get the fraction of the total variance that the model does predict.
The only way R squared can be negative is if the predictions made by the model are actually worse than if we simply estimated the values using their average. This would obviously indicate a ‘goodness of fit’ that’s worse than useless.