You may know that a polynomial can be used to fit curves. A polynomial equation looks like this, for example:
The equation contains powers of x; the first term, 3, is a constant and may be thought of as multiplying x to the power of zero. Then we have further coefficients of x, x squared, x cubed and whatever further powers of x we’d like to add. In this case the coefficients are 4, 5 and -2.
By adjusting the coefficients and the constant we can make this fit any curve. We can have as many powers of x as we like; this example is a “polynomial of order 3” or a “third degree polynomial”, but we can of course include higher powers of x. The higher the degree of the polynomial, the more complex a curve it can fit.
To do polynomial regression, we first transform our x variable into however many powers of x we want. Then we can do ordinary multiple linear regression on these new variables (except they’re really just powers of the one original variable.
Let’s try to fit a 3-degree polynomial to our cooling.csv data.
First we transform the data so that instead of having a single independent variable, labelled ‘minute’, for each row we will now have 3 independent variables.
So instead of [0, 1, 2, 3 ….] we have [[0, 0, 0], [1, 1, 1], [2, 4, 8], [3, 9, 27] ….]
import pandas as pd from sklearn.preprocessing import PolynomialFeatures df = pd.read_csv('cooling.csv') print(df) X = df['minute'].values.reshape(-1, 1) y = df['temperature'] poly = PolynomialFeatures(degree=3, include_bias=False) poly_features = poly.fit_transform(X) print(poly_features)
minute temperature 0 0 94.1 1 1 89.6 2 2 87.0 3 3 84.4 4 4 81.8 .. ... ... 65 65 37.9 66 66 37.8 67 67 37.5 68 68 37.1 69 73 36.0 [70 rows x 2 columns] [[0.00000e+00 0.00000e+00 0.00000e+00] [1.00000e+00 1.00000e+00 1.00000e+00] [2.00000e+00 4.00000e+00 8.00000e+00] [3.00000e+00 9.00000e+00 2.70000e+01] [4.00000e+00 1.60000e+01 6.40000e+01] [5.00000e+00 2.50000e+01 1.25000e+02] [6.00000e+00 3.60000e+01 2.16000e+02] [7.00000e+00 4.90000e+01 3.43000e+02] [8.00000e+00 6.40000e+01 5.12000e+02] [9.00000e+00 8.10000e+01 7.29000e+02] [1.00000e+01 1.00000e+02 1.00000e+03] [1.10000e+01 1.21000e+02 1.33100e+03] [1.20000e+01 1.44000e+02 1.72800e+03] [1.30000e+01 1.69000e+02 2.19700e+03] [1.40000e+01 1.96000e+02 2.74400e+03] [1.50000e+01 2.25000e+02 3.37500e+03] [1.60000e+01 2.56000e+02 4.09600e+03] [1.70000e+01 2.89000e+02 4.91300e+03] [1.80000e+01 3.24000e+02 5.83200e+03] [1.90000e+01 3.61000e+02 6.85900e+03] [2.00000e+01 4.00000e+02 8.00000e+03] [2.10000e+01 4.41000e+02 9.26100e+03] [2.20000e+01 4.84000e+02 1.06480e+04] [2.30000e+01 5.29000e+02 1.21670e+04] [2.40000e+01 5.76000e+02 1.38240e+04] [2.50000e+01 6.25000e+02 1.56250e+04] [2.60000e+01 6.76000e+02 1.75760e+04] [2.70000e+01 7.29000e+02 1.96830e+04] [2.80000e+01 7.84000e+02 2.19520e+04] [2.90000e+01 8.41000e+02 2.43890e+04] [3.00000e+01 9.00000e+02 2.70000e+04] ...
Now we can simply run multiple linear regression on this data. Then we’ll calculate a R squared score and plot the original data (the blue line), with predicted values overlaid (the green dots).
import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score from sklearn.preprocessing import PolynomialFeatures import matplotlib.pyplot as plt df = pd.read_csv('cooling.csv') X = df['minute'].values.reshape(-1, 1) y = df['temperature'] poly = PolynomialFeatures(degree=4, include_bias=False) poly_features = poly.fit_transform(X) model = LinearRegression() model.fit(poly_features, y) y_pred = model.predict(poly_features) fig = plt.figure(figsize=(16,9)) ax = fig.add_subplot() ax.scatter(X, y, color='green') ax.plot(X, y_pred, color='blue') print(r2_score(y, y_pred)) plt.show()
The R squared score is really high, but if you look closely you can see that the predictions actually get less good near the end of the graph, which is where we’re most likely to actually use them. We could use a polynomial of higher degree for improved accuracy.