# Cave of Python

»»» Polynomial Regression

## Polynomial Regression

You may know that a polynomial can be used to fit curves. A polynomial equation looks like this, for example: $y = 3 + 4x + 5x^2 - 2x^3$

The equation contains powers of x; the first term, 3, is a constant and may be thought of as multiplying x to the power of zero. Then we have further coefficients of x, x squared, x cubed and whatever further powers of x we’d like to add. In this case the coefficients are 4, 5 and -2.

By adjusting the coefficients and the constant we can make this fit any curve. We can have as many powers of x as we like; this example is a “polynomial of order 3” or a “third degree polynomial”, but we can of course include higher powers of x. The higher the degree of the polynomial, the more complex a curve it can fit.

To do polynomial regression, we first transform our x variable into however many powers of x we want. Then we can do ordinary multiple linear regression on these new variables (except they’re really just powers of the one original variable.

Let’s try to fit a 3-degree polynomial to our cooling.csv data.

First we transform the data so that instead of having a single independent variable, labelled ‘minute’, for each row we will now have 3 independent variables.

So instead of [0, 1, 2, 3 ….] we have [[0, 0, 0], [1, 1, 1], [2, 4, 8], [3, 9, 27] ….]

``````import pandas as pd
from sklearn.preprocessing import PolynomialFeatures

print(df)

X = df['minute'].values.reshape(-1, 1)
y = df['temperature']

poly = PolynomialFeatures(degree=3, include_bias=False)
poly_features = poly.fit_transform(X)

print(poly_features)``````
``````    minute  temperature
0        0         94.1
1        1         89.6
2        2         87.0
3        3         84.4
4        4         81.8
..     ...          ...
65      65         37.9
66      66         37.8
67      67         37.5
68      68         37.1
69      73         36.0

[70 rows x 2 columns]
[[0.00000e+00 0.00000e+00 0.00000e+00]
[1.00000e+00 1.00000e+00 1.00000e+00]
[2.00000e+00 4.00000e+00 8.00000e+00]
[3.00000e+00 9.00000e+00 2.70000e+01]
[4.00000e+00 1.60000e+01 6.40000e+01]
[5.00000e+00 2.50000e+01 1.25000e+02]
[6.00000e+00 3.60000e+01 2.16000e+02]
[7.00000e+00 4.90000e+01 3.43000e+02]
[8.00000e+00 6.40000e+01 5.12000e+02]
[9.00000e+00 8.10000e+01 7.29000e+02]
[1.00000e+01 1.00000e+02 1.00000e+03]
[1.10000e+01 1.21000e+02 1.33100e+03]
[1.20000e+01 1.44000e+02 1.72800e+03]
[1.30000e+01 1.69000e+02 2.19700e+03]
[1.40000e+01 1.96000e+02 2.74400e+03]
[1.50000e+01 2.25000e+02 3.37500e+03]
[1.60000e+01 2.56000e+02 4.09600e+03]
[1.70000e+01 2.89000e+02 4.91300e+03]
[1.80000e+01 3.24000e+02 5.83200e+03]
[1.90000e+01 3.61000e+02 6.85900e+03]
[2.00000e+01 4.00000e+02 8.00000e+03]
[2.10000e+01 4.41000e+02 9.26100e+03]
[2.20000e+01 4.84000e+02 1.06480e+04]
[2.30000e+01 5.29000e+02 1.21670e+04]
[2.40000e+01 5.76000e+02 1.38240e+04]
[2.50000e+01 6.25000e+02 1.56250e+04]
[2.60000e+01 6.76000e+02 1.75760e+04]
[2.70000e+01 7.29000e+02 1.96830e+04]
[2.80000e+01 7.84000e+02 2.19520e+04]
[2.90000e+01 8.41000e+02 2.43890e+04]
[3.00000e+01 9.00000e+02 2.70000e+04]
...``````

Now we can simply run multiple linear regression on this data. Then we’ll calculate a R squared score and plot the original data (the blue line), with predicted values overlaid (the green dots).

``````import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt

X = df['minute'].values.reshape(-1, 1)
y = df['temperature']

poly = PolynomialFeatures(degree=4, include_bias=False)
poly_features = poly.fit_transform(X)

model = LinearRegression()
model.fit(poly_features, y)

y_pred = model.predict(poly_features)

fig = plt.figure(figsize=(16,9))
``0.9994483787227559``