Regression basically means fitting lines to curves. We can also fit surfaces to higher-dimensional data.
By doing this, we end up with a simplified model of our data. This can be useful for making predictions about future data, or for discerning the mathematical laws that govern how the data was generated.
In this post we’ll take a look at linear regression; the business of fitting a straight line to some data points.
First let’s plot some data.
I poured some hot water in an open-topped cylindrical jug — actually it was from a French press for making coffee — and measured the temperature at regular intervals. This turns out to follow a surprisingly simple law.
First let’s plot it with some Python code.
Note that to plot data in a dataframe, we can supply the dataframe with the data parameter and then supply the names of the columns to plot where we’d otherwise supply x and y series.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('cooling.csv')
fig, ax = plt.subplots()
fig.suptitle("Cooling of a Jug of Water")
ax.set_xlabel("Time (seconds)")
ax.set_ylabel("Temperature (°C)")
ax.plot('minute', 'temperature', data=df)
plt.show()

This clearly isn’t a straight line. The situation improves if we take the reciprocal of temperature though.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('cooling.csv')
fig, ax = plt.subplots()
fig.suptitle("Cooling of a Jug of Water")
ax.set_xlabel("Time (seconds)")
ax.set_ylabel("Temperature (°C)")
df['temperature'] = 1/df['temperature']
ax.plot('minute', 'temperature', data=df)
plt.show()

If we also square the temperature, we actually get a straight line. Cooling here seems to follow an inverse square law. I, for one, am surprised.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('cooling.csv')
fig, ax = plt.subplots()
fig.suptitle("Cooling of a Jug of Water")
ax.set_xlabel("Time (seconds)")
ax.set_ylabel("Temperature (°C)")
df['temperature'] = 1/df['temperature']**2
ax.plot('minute', 'temperature', data=df)
plt.show()

Now let’s use linear regression to find a line of best fit.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
df = pd.read_csv('cooling.csv')
X = df['minute'].values.reshape(-1, 1)
y = 1.0/df['temperature']**2
model = LinearRegression()
model.fit(X, y)
y_predicted = model.predict(X)
fig, ax = plt.subplots()
fig.suptitle("Cooling of a Jug of Water")
ax.set_xlabel("Time (seconds)")
ax.set_ylabel("Temperature (°C)")
ax.plot(X, y, '--', color='green')
ax.plot(X, y_predicted, color='#FF000088')
plt.show()

Let’s examine the code line by line.
- 1-3: import Pandas, Matplotlib and sci-kit learn. You may need to install the latter with pip install scikit-learn.
- 5-7: Load the data and then form the X and y variables that we’re going to run the model on.
While y is just a data series of the inverse square of the temperatures, X requires a bit of explanation.
Regression can be performed for multiple X values, so scikit-learn expects a list for every data sample. But we only have one x-value for every data sample. Even so, we must create a list, so we end up with a matrix (2D array) with a single column.
To accomplish this we use reshape. The first argument, -1, says “use however many rows are necessary”. The second argument, 1, says “we want one column in the 2D array”.
We then use an uppercase letter for X, to indicate that it’s a matrix, not a simple 1D array. - 9-10: create a linear regression model and fit it to the data.
- 12: We can now get predictions from the model. In this case we’ve already fed it all the data we have, so I’ll simply get it to predict the y-value for all the existing x-values.
- 14 onwards: plot the original data and the predictions.
You can see the “predicted” values match the actual values so closely that we can barely distinguish the two plots.
Making Predictions
This is all very well, but how can we use this to make predictions about future temperatures? What does this model predict the temperature of the water would be at 120 minutes?
We can easily get a prediction from the model using our line of best fit. Then, since we obtained X from our temperatures measurements by taking the inverse square of temperature, we need to take the inverse square root of this value to get back to an actual temperature.
Note that taking the square root of a number is the same as raising it to the power of 0.5.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
df = pd.read_csv('cooling.csv')
X = df['minute'].values.reshape(-1, 1)
y = 1.0/df['temperature']**2
model = LinearRegression()
model.fit(X, y)
y_predicted = model.predict([[120]])
predicted_temperature = 1.0/y_predicted**0.5
print(predicted_temperature)
[29.02447672]
Our model predicts the temperature of the water will be 29 °C after 120 minutes, which doesn’t seem unreasonable. In the next post we’ll take a look at how to assess the accuracy of linear regression models more systematically.
Leave a Reply