Home » Machine Learning » Regression » Multiple Linear Regression

Multiple Linear Regression

We’ve seen examples of using linear regression to fit a straight line to data points, but we can also use linear regression to fit a flat surface (a plane) to multi-dimensional data.

We’re still trying to predict or approximate the value of one particular variable, but we use multiple variables to make the prediction.

An example would be if you try to predict height above sea level using latitude and longitude, for some area of land that’s inclined but relatively flat, like the side of a hill.

For this example we’ll use grapes.csv. I collected together some grapes of two different types and for each grape I measured the diameter, length and weight.

First let’s try to visualise the data. Since we have three dimensions here (three measurements for each grape), we can conveniently do this using a 3-dimensional plot. To create a 3-dimensional plot we can supply projection=3d when we create a subplot, and then supply our three data series to scatter, to create a scatter plot.

import pandas as pd
import matplotlib.pyplot as plt

# Load the data
df = pd.read_csv('grapes.csv')

fig = plt.figure()

ax = fig.add_subplot(projection="3d")

ax.set_title("Grape Measurements")
ax.set_xlabel("Diameter (mm)")
ax.set_ylabel("Length (mm)")
ax.set_zlabel("Weight (g)")

ax.scatter(df['diameter'], df['length'], df['weight'])

plt.show()

We can move this graph around using the mouse. By tilting it carefully it becomes apparent that although these points do approximately fit on a plane (so are a good candidate for linear regression), there is a slight curve in them.

We can also see that there is one point that lies far outside the plane; it’s an outlier, probably resulting from a mistake in measurement or recording the data (I did get very bored measuring all those grapes!) and it probably should be removed.

The reason for the curve is easy to imagine. The weight of a grape is likely proportional to its volume, and its volume is proportional not to the length or diameter, but to the cube of these values. This suggests that we might obtain a flatter plane if we take the cube root of weight. This is the same as raising to the power of 1/3.

import pandas as pd
import matplotlib.pyplot as plt

# Load the data
df = pd.read_csv('grapes.csv')

fig = plt.figure()

ax = fig.add_subplot(projection="3d")

ax.set_title("Grape Measurements")
ax.set_xlabel("Diameter (mm)")
ax.set_ylabel("Length (mm)")
ax.set_zlabel("Weight (g)")

x = df['diameter']
y = df['length']
z = df['weight'] ** (1/3)

ax.scatter(x, y, z)

plt.show()

Indeed, this does look flatter.

The Linear Regression Model

Now we’ll add in a linear regression model, like we did previously. As before we need a “predictor” matrix of values that we use to make predictions, and a “target” value that we seek to predict. The matrix of predictors we can form from two data series taken from the dataframe (diameter and length) and the target we’ll set to the cube root of weight.

Finally we’ll calculate an R squared score.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.metrics import r2_score

# Load the data
df = pd.read_csv('grapes.csv')

x = df['diameter']
y = df['length']
z = df['weight'] ** (1/3)

X = df[['diameter', 'length']]

model = LinearRegression()
model.fit(X, z)

weight_predicted = model.predict(X)

r2 = r2_score(weight_predicted, z)

print("R2 Score:", r2)
R2 Score: 0.9709411609003791

This gives an R squared score of 0.97, which is pretty good.

Removing Outliers

We saw earlier that there is a clear outlier in the data; a sample that does not fit on the plane along with the other samples. We might get a better fit with our model if we remove outliers, especially if there are several of them.

There are lots of ways we can find outliers. Here we’ll take the following approach.

  • Add a new column to our dataframe containing the predicted weights. Since we took the cube root of weight, we must take the cube of our predictions to get back an actual predicted weight.
  • Add another column containing the absolute difference between weight and predicted weight. We could also consider using the square of the difference; either way we need to ensure we don’t mix positive and negative values since we’re only interested in how “far” our predictions are from actual values.
  • Sort the dataframe so that the rows with largest differences between actual and predicted values come first.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.metrics import r2_score

# Load the data
df = pd.read_csv('grapes.csv')

x = df['diameter']
y = df['length']
z = df['weight'] ** (1/3)

X = df[['diameter', 'length']]

model = LinearRegression()
model.fit(X, z)

weight_predicted = model.predict(X)

r2 = r2_score(weight_predicted, z)

print("R2 Score:", r2)

df['predicted weight'] = weight_predicted**3
df['difference'] = (df['predicted weight'] - df['weight']).abs()
df.sort_values(by='difference', inplace=True, ascending=False)

print(df)
R2 Score: 0.9709411609003791
     weight  length  diameter   color  predicted weight  difference
67     4.47   23.49     23.34   green          7.803108    3.333108
0      7.92   27.32     22.84  purple          8.658565    0.738565
57     8.57   27.40     21.63  purple          7.842359    0.727641
136   10.18   27.09     24.09  purple          9.508218    0.671782
64     7.59   27.19     20.39   green          6.978375    0.611625
..      ...     ...       ...     ...               ...         ...
89     3.52   20.40     16.61   green          3.524795    0.004795
53     3.50   18.67     17.34   green          3.504132    0.004132
49     4.17   25.28     15.94   green          4.166844    0.003156
101    1.54   14.63     12.77   green          1.543107    0.003107
37     4.11   22.48     17.07   green          4.106948    0.003052

[139 rows x 6 columns]

You can easily see that the first value is now much further from predicted values than the others. It lies far from the plane that the other values approximately lie on. Since we now have its index ID in the data frame, we can try deleting it before running the model again.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.metrics import r2_score

# Load the data
df = pd.read_csv('grapes.csv')

df.drop(67, inplace=True)

x = df['diameter']
y = df['length']
z = df['weight'] ** (1/3)

X = df[['diameter', 'length']]

model = LinearRegression()
model.fit(X, z)

weight_predicted = model.predict(X)

r2 = r2_score(weight_predicted, z)

print("R2 Score:", r2)
R2 Score: 0.9895569704497154

In this case we’ve deleted only one outlier, but we do see a small improvement in our model’s predictive values, as measured by R squared score.

Leave a Reply

Blog at WordPress.com.

%d