Home » Machine Learning » Regression » Linear Regression with Scikit-Learn

Linear Regression with Scikit-Learn

Regression basically means fitting lines to curves. We can also fit surfaces to higher-dimensional data.

By doing this, we end up with a simplified model of our data. This can be useful for making predictions about future data, or for discerning the mathematical laws that govern how the data was generated.

In this post we’ll take a look at linear regression; the business of fitting a straight line to some data points.

First let’s plot some data.

I poured some hot water in an open-topped cylindrical jug — actually it was from a French press for making coffee — and measured the temperature at regular intervals. This turns out to follow a surprisingly simple law.

Here’s the data.

First let’s plot it with some Python code.

Note that to plot data in a dataframe, we can supply the dataframe with the data parameter and then supply the names of the columns to plot where we’d otherwise supply x and y series.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('cooling.csv')

fig, ax = plt.subplots()

fig.suptitle("Cooling of a Jug of Water")
ax.set_xlabel("Time (seconds)")
ax.set_ylabel("Temperature (°C)")

ax.plot('minute', 'temperature', data=df)

plt.show()
Temperature vs. time for an open jug of hot water

This clearly isn’t a straight line. The situation improves if we take the reciprocal of temperature though.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('cooling.csv')

fig, ax = plt.subplots()

fig.suptitle("Cooling of a Jug of Water")
ax.set_xlabel("Time (seconds)")
ax.set_ylabel("Temperature (°C)")

df['temperature'] = 1/df['temperature']

ax.plot('minute', 'temperature', data=df)

plt.show()
Reciprocal of temperature vs. time for a jug of hot water

If we also square the temperature, we actually get a straight line. Cooling here seems to follow an inverse square law. I, for one, am surprised.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('cooling.csv')

fig, ax = plt.subplots()

fig.suptitle("Cooling of a Jug of Water")
ax.set_xlabel("Time (seconds)")
ax.set_ylabel("Temperature (°C)")

df['temperature'] = 1/df['temperature']**2

ax.plot('minute', 'temperature', data=df)

plt.show()
Reciprocal of the square of temperature vs time for an open jug of hot water

Now let’s use linear regression to find a line of best fit.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

df = pd.read_csv('cooling.csv')
X = df['minute'].values.reshape(-1, 1)
y = 1.0/df['temperature']**2

model = LinearRegression()
model.fit(X, y)

y_predicted = model.predict(X)

fig, ax = plt.subplots()

fig.suptitle("Cooling of a Jug of Water")
ax.set_xlabel("Time (seconds)")
ax.set_ylabel("Temperature (°C)")

ax.plot(X, y, '--', color='green')
ax.plot(X, y_predicted, color='#FF000088')

plt.show()
Time (minutes) vs. squared reciprocal of temperature (C) for a jug of water, plus line of best fit.

Let’s examine the code line by line.

  • 1-3: import Pandas, Matplotlib and sci-kit learn. You may need to install the latter with pip install scikit-learn.
  • 5-7: Load the data and then form the X and y variables that we’re going to run the model on.

    While y is just a data series of the inverse square of the temperatures, X requires a bit of explanation.

    Regression can be performed for multiple X values, so scikit-learn expects a list for every data sample. But we only have one x-value for every data sample. Even so, we must create a list, so we end up with a matrix (2D array) with a single column.

    To accomplish this we use reshape. The first argument, -1, says “use however many rows are necessary”. The second argument, 1, says “we want one column in the 2D array”.

    We then use an uppercase letter for X, to indicate that it’s a matrix, not a simple 1D array.
  • 9-10: create a linear regression model and fit it to the data.
  • 12: We can now get predictions from the model. In this case we’ve already fed it all the data we have, so I’ll simply get it to predict the y-value for all the existing x-values.
  • 14 onwards: plot the original data and the predictions.

You can see the “predicted” values match the actual values so closely that we can barely distinguish the two plots.

Making Predictions

This is all very well, but how can we use this to make predictions about future temperatures? What does this model predict the temperature of the water would be at 120 minutes?

We can easily get a prediction from the model using our line of best fit. Then, since we obtained X from our temperatures measurements by taking the inverse square of temperature, we need to take the inverse square root of this value to get back to an actual temperature.

Note that taking the square root of a number is the same as raising it to the power of 0.5.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

df = pd.read_csv('cooling.csv')
X = df['minute'].values.reshape(-1, 1)
y = 1.0/df['temperature']**2

model = LinearRegression()
model.fit(X, y)

y_predicted = model.predict([[120]])

predicted_temperature = 1.0/y_predicted**0.5

print(predicted_temperature)
[29.02447672]

Our model predicts the temperature of the water will be 29 °C after 120 minutes, which doesn’t seem unreasonable. In the next post we’ll take a look at how to assess the accuracy of linear regression models more systematically.

Leave a Reply

Blog at WordPress.com.

%d