Home » Machine Learning » Regression » Train/Test Splitting

Train/Test Splitting

Previously we saw a simple example of linear regression using scikit-learn. In that example we trained our model on all of our data, then examined how closely the “predictions” made by the model fit the actual data.

However, what we’d really like to know is how good our model really is at making predictions about data it hasn’t been trained on. Then we’ve got a simple form of artificial intelligence that can make predictions.

To accomplish this we use a train/test split. We split our data into a training segment and a test segment. The training segment is used exclusively for training the model, then we test the model’s predictions using the test segment.

Splitting data is easily accomplished using Numpy or Pandas slice syntax, but scikit-learn also has a convenient function that we can use, train_test_split.

Again we’ll use the cooling.csv data, showing the cooling curve of an open jug of water.

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

# Load the data
df = pd.read_csv('cooling.csv')

# Reshape to get a matrix with one column
X = df['minute'].values.reshape(-1, 1)

# This particular data approximates a straight
# line if we take the inverse square of temperature
y = 1.0/df['temperature']**2

# Split the data into train and test segments
# We'll use 70% of the data for training and
# the remaining 30% for testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=False, train_size=0.7)

# Fit the model using the training data
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions for the test data
y_predicted = model.predict(X_test)

# Now check the r-squared score.
score = r2_score(y_test, y_predicted)

print("R2 score:", score)
R2 score: 0.9699183977543862

The R squared score is nearly 1, so we’re making good predictions with this model.

Notice that the model had not seen any of the test data when it was asked to make predictions about it. It was trained on only the training data.

Here we used 70% of the data for training and 30% for testing, which is fairly typical.

We could shuffle the data with the shuffle parameter, randomly including samples in each of the two data segments, but here perhaps we’re interested in making predictions on temperatures after a certain point in time, using temperatures before that point in time. So I’ve turned shuffling off.

Visualising the Results

In this particular case we had to take the inverse square of temperatures to get an approximately straight line to which we could apply linear regression.

Let’s convert the predicted and actual temperatures back to actual temperatures and plot the results side by side.

To do this we’ll need to take the inverse square root, which is the same as raising to the power of -0.5.

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Load the data
df = pd.read_csv('cooling.csv')

# Reshape to get a matrix with one column
X = df['minute'].values.reshape(-1, 1)

# This particular data approximates a straight
# line if we take the inverse square of temperature
y = 1.0/df['temperature']**2

# Split the data into train and test segments
# We'll use 70% of the data for training and
# the remaining 30% for testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=False, train_size=0.7)

# Fit the model using the training data
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions for the test data
y_predicted = model.predict(X_test)

# Now check the r-squared score.
score = r2_score(y_test, y_predicted)

print("R2 score:", score)

actual_temps = y_test ** (-0.5)
predicted_temps = y_predicted ** (-0.5)

fig, ax = plt.subplots()

fig.suptitle("Predicted vs Actual Temperatures for a Cooling Jug of Water")

ax.set_xlabel("Time (Minutes)")
ax.set_ylabel("Temperature (°C)")

ax.plot(X_test, actual_temps, '--o', label="Actual", color='green')
ax.plot(X_test, predicted_temps, '--', label="Predicted", color='blue')
ax.legend()

plt.show()
R2 score: 0.9699183977543862

Predicted temperatures are fairly close to the measured temperatures.

Leave a Reply

Blog at WordPress.com.

%d