Linear regression from scratch: Predicting F1 results
Before we start¶
This notebook can be browsed interactively on Kaggle: https://www.kaggle.com/dorin131/f1-predictions-blog
Prologue¶
The other day I learned how to implement linear regression and have been itching to make good use of this newly acquired knowledge. It's funny how ML can be applied to almost anything but when it came to actually picking one thing, I didn't know what. What would be fun? What is something that I'm interested in? So I start going through the Kaggle dataset catalogue... and, BINGO! F1 results from 1950 to 2020! What can I do with this? What else than try to predict who's going to win the next race? I suspect it may not the best fit for a linear regression model but it will be fun.
The plan¶
So, the plan is to combine the tables provided in this dataset and end up with a few features that I think may have the biggest effect on the finish position of a driver. Let's say...
- driver standing
- constructor standing
- driver grid position
If only it was that simple to predict the position, right? Well, it's not that simple and our predictions won't be all that accurate, but who cares!
Below you'll see me load the data, do some transformations, joins, filtering, conversions, etc. Once all of this boring part is done, we'll get into implementing the model and then training it!
# Importing some libraries we're going to need
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
from pandas.plotting import scatter_matrix
Getting the data¶
# Looking at what files we've got and getting their paths
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
/kaggle/input/formula-1-world-championship-1950-2020/races.csv /kaggle/input/formula-1-world-championship-1950-2020/constructor_results.csv /kaggle/input/formula-1-world-championship-1950-2020/drivers.csv /kaggle/input/formula-1-world-championship-1950-2020/constructors.csv /kaggle/input/formula-1-world-championship-1950-2020/lap_times.csv /kaggle/input/formula-1-world-championship-1950-2020/status.csv /kaggle/input/formula-1-world-championship-1950-2020/driver_standings.csv /kaggle/input/formula-1-world-championship-1950-2020/seasons.csv /kaggle/input/formula-1-world-championship-1950-2020/pit_stops.csv /kaggle/input/formula-1-world-championship-1950-2020/sprint_results.csv /kaggle/input/formula-1-world-championship-1950-2020/constructor_standings.csv /kaggle/input/formula-1-world-championship-1950-2020/results.csv /kaggle/input/formula-1-world-championship-1950-2020/circuits.csv /kaggle/input/formula-1-world-championship-1950-2020/qualifying.csv
results = pd.read_csv('/kaggle/input/formula-1-world-championship-1950-2020/results.csv')
driver_standings = pd.read_csv('/kaggle/input/formula-1-world-championship-1950-2020/driver_standings.csv')
constructor_standings = pd.read_csv('/kaggle/input/formula-1-world-championship-1950-2020/constructor_standings.csv')
Quick look at the data¶
We're just going to print the first 5 rows of each dataset and check the types
results.head()
resultId | raceId | driverId | constructorId | number | grid | position | positionText | positionOrder | points | laps | time | milliseconds | fastestLap | rank | fastestLapTime | fastestLapSpeed | statusId | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 18 | 1 | 1 | 22 | 1 | 1 | 1 | 1 | 10.0 | 58 | 1:34:50.616 | 5690616 | 39 | 2 | 1:27.452 | 218.300 | 1 |
1 | 2 | 18 | 2 | 2 | 3 | 5 | 2 | 2 | 2 | 8.0 | 58 | +5.478 | 5696094 | 41 | 3 | 1:27.739 | 217.586 | 1 |
2 | 3 | 18 | 3 | 3 | 7 | 7 | 3 | 3 | 3 | 6.0 | 58 | +8.163 | 5698779 | 41 | 5 | 1:28.090 | 216.719 | 1 |
3 | 4 | 18 | 4 | 4 | 5 | 11 | 4 | 4 | 4 | 5.0 | 58 | +17.181 | 5707797 | 58 | 7 | 1:28.603 | 215.464 | 1 |
4 | 5 | 18 | 5 | 1 | 23 | 3 | 5 | 5 | 5 | 4.0 | 58 | +18.014 | 5708630 | 43 | 1 | 1:27.418 | 218.385 | 1 |
results.dtypes
resultId int64 raceId int64 driverId int64 constructorId int64 number object grid int64 position object positionText object positionOrder int64 points float64 laps int64 time object milliseconds object fastestLap object rank object fastestLapTime object fastestLapSpeed object statusId int64 dtype: object
driver_standings.head()
driverStandingsId | raceId | driverId | points | position | positionText | wins | |
---|---|---|---|---|---|---|---|
0 | 1 | 18 | 1 | 10.0 | 1 | 1 | 1 |
1 | 2 | 18 | 2 | 8.0 | 2 | 2 | 0 |
2 | 3 | 18 | 3 | 6.0 | 3 | 3 | 0 |
3 | 4 | 18 | 4 | 5.0 | 4 | 4 | 0 |
4 | 5 | 18 | 5 | 4.0 | 5 | 5 | 0 |
driver_standings.dtypes
driverStandingsId int64 raceId int64 driverId int64 points float64 position int64 positionText object wins int64 dtype: object
constructor_standings.head()
constructorStandingsId | raceId | constructorId | points | position | positionText | wins | |
---|---|---|---|---|---|---|---|
0 | 1 | 18 | 1 | 14.0 | 1 | 1 | 1 |
1 | 2 | 18 | 2 | 8.0 | 3 | 3 | 0 |
2 | 3 | 18 | 3 | 9.0 | 2 | 2 | 0 |
3 | 4 | 18 | 4 | 5.0 | 4 | 4 | 0 |
4 | 5 | 18 | 5 | 2.0 | 5 | 5 | 0 |
constructor_standings.dtypes
constructorStandingsId int64 raceId int64 constructorId int64 points float64 position int64 positionText object wins int64 dtype: object
Dropping the columns we don't need¶
# We only need a few columns, which we're getting here
results = results[["raceId", "driverId", "constructorId", "grid", "position"]]
results.head()
raceId | driverId | constructorId | grid | position | |
---|---|---|---|---|---|
0 | 18 | 1 | 1 | 1 | 1 |
1 | 18 | 2 | 2 | 5 | 2 |
2 | 18 | 3 | 3 | 7 | 3 |
3 | 18 | 4 | 4 | 11 | 4 |
4 | 18 | 5 | 1 | 3 | 5 |
driver_standings = driver_standings[["raceId", "driverId", "position"]]
# Rename the "position" column do avoid conflict with the "position" column from results.csv
driver_standings = driver_standings.rename(columns={"position": "driverStanding"})
# Use current driver standings for the next race
driver_standings["raceId"] += 1
driver_standings.head()
raceId | driverId | driverStanding | |
---|---|---|---|
0 | 19 | 1 | 1 |
1 | 19 | 2 | 2 |
2 | 19 | 3 | 3 |
3 | 19 | 4 | 4 |
4 | 19 | 5 | 5 |
# Again, picking the columns we need and renaming "position"
constructor_standings = constructor_standings[["raceId", "constructorId", "position"]]
constructor_standings = constructor_standings.rename(columns={"position": "constructorStanding"})
# Use current constructor standings for the next race
constructor_standings["raceId"] += 1
constructor_standings.head()
raceId | constructorId | constructorStanding | |
---|---|---|---|
0 | 19 | 1 | 1 |
1 | 19 | 2 | 3 |
2 | 19 | 3 | 2 |
3 | 19 | 4 | 4 |
4 | 19 | 5 | 5 |
Joining the data¶
# Joining results with driver standings. This will add the "driverPosition" column to our results
results_driver_standings = pd.merge(results, driver_standings, on=["raceId", "driverId"], how="inner")
results_driver_standings.head()
raceId | driverId | constructorId | grid | position | driverStanding | |
---|---|---|---|---|---|---|
0 | 18 | 1 | 1 | 1 | 1 | 5 |
1 | 18 | 2 | 2 | 5 | 2 | 13 |
2 | 18 | 3 | 3 | 7 | 3 | 7 |
3 | 18 | 4 | 4 | 11 | 4 | 9 |
4 | 18 | 5 | 1 | 3 | 5 | 12 |
# Now we join the constructor standings and we end up with everything we need in one place
joined_data = pd.merge(results_driver_standings, constructor_standings, on=["raceId", "constructorId"], how="inner")
joined_data.head()
raceId | driverId | constructorId | grid | position | driverStanding | constructorStanding | |
---|---|---|---|---|---|---|---|
0 | 18 | 1 | 1 | 1 | 1 | 5 | 3 |
1 | 18 | 5 | 1 | 3 | 5 | 12 | 3 |
2 | 18 | 2 | 2 | 5 | 2 | 13 | 6 |
3 | 18 | 9 | 2 | 2 | \N | 14 | 6 |
4 | 18 | 3 | 3 | 7 | 3 | 7 | 7 |
Sense checking the data¶
Things look good so far but I've got no idea whether the numbers are correct and I haven't messed up something. To verify things, I'm going to print out the results for the last couple of races in 2020 and compare with F1 results from Wikipedia.
joined_data.sort_values(by='raceId', ascending=False).head(60)
raceId | driverId | constructorId | grid | position | driverStanding | constructorStanding | |
---|---|---|---|---|---|---|---|
22157 | 1096 | 825 | 210 | 16 | 17 | 13 | 8 |
22147 | 1096 | 4 | 214 | 10 | \N | 9 | 4 |
22138 | 1096 | 830 | 9 | 1 | 1 | 1 | 1 |
22139 | 1096 | 815 | 9 | 2 | 3 | 3 | 1 |
22140 | 1096 | 844 | 6 | 3 | 2 | 2 | 2 |
22141 | 1096 | 832 | 6 | 4 | 4 | 6 | 2 |
22142 | 1096 | 847 | 131 | 6 | 5 | 4 | 3 |
22144 | 1096 | 846 | 1 | 7 | 6 | 7 | 5 |
22145 | 1096 | 817 | 1 | 13 | 9 | 12 | 5 |
22146 | 1096 | 839 | 214 | 8 | 7 | 8 | 4 |
22143 | 1096 | 1 | 131 | 5 | 18 | 5 | 3 |
22148 | 1096 | 840 | 117 | 14 | 8 | 15 | 7 |
22153 | 1096 | 822 | 51 | 18 | 15 | 10 | 6 |
22149 | 1096 | 20 | 117 | 9 | 10 | 11 | 7 |
22155 | 1096 | 849 | 3 | 20 | 19 | 20 | 10 |
22154 | 1096 | 848 | 3 | 19 | 13 | 19 | 10 |
22156 | 1096 | 854 | 210 | 12 | 16 | 16 | 8 |
22152 | 1096 | 855 | 51 | 15 | 12 | 18 | 6 |
22151 | 1096 | 842 | 213 | 17 | 14 | 14 | 9 |
22150 | 1096 | 852 | 213 | 11 | 11 | 17 | 9 |
22127 | 1095 | 855 | 51 | 13 | 12 | 18 | 6 |
22118 | 1095 | 847 | 131 | 1 | 1 | 4 | 3 |
22119 | 1095 | 1 | 131 | 2 | 2 | 5 | 3 |
22120 | 1095 | 832 | 6 | 7 | 3 | 6 | 2 |
22121 | 1095 | 844 | 6 | 5 | 4 | 3 | 2 |
22122 | 1095 | 4 | 214 | 17 | 5 | 9 | 4 |
22123 | 1095 | 839 | 214 | 16 | 8 | 8 | 4 |
22124 | 1095 | 830 | 9 | 3 | 6 | 1 | 1 |
22125 | 1095 | 815 | 9 | 4 | 7 | 2 | 1 |
22126 | 1095 | 822 | 51 | 14 | 9 | 10 | 6 |
22133 | 1095 | 852 | 213 | 0 | 17 | 17 | 9 |
22128 | 1095 | 840 | 117 | 15 | 10 | 15 | 7 |
22134 | 1095 | 848 | 3 | 19 | 15 | 19 | 10 |
22129 | 1095 | 20 | 117 | 9 | 11 | 11 | 7 |
22136 | 1095 | 846 | 1 | 6 | \N | 7 | 5 |
22135 | 1095 | 849 | 3 | 18 | 16 | 20 | 10 |
22137 | 1095 | 817 | 1 | 11 | \N | 12 | 5 |
22132 | 1095 | 842 | 213 | 10 | 14 | 14 | 9 |
22131 | 1095 | 825 | 210 | 8 | \N | 13 | 8 |
22130 | 1095 | 854 | 210 | 12 | 13 | 16 | 8 |
22107 | 1094 | 4 | 214 | 9 | 19 | 9 | 4 |
22098 | 1094 | 830 | 9 | 1 | 1 | 1 | 1 |
22099 | 1094 | 815 | 9 | 4 | 3 | 3 | 1 |
22100 | 1094 | 1 | 131 | 3 | 2 | 6 | 3 |
22101 | 1094 | 847 | 131 | 2 | 4 | 4 | 3 |
22102 | 1094 | 832 | 6 | 5 | 5 | 5 | 2 |
22103 | 1094 | 844 | 6 | 7 | 6 | 2 | 2 |
22104 | 1094 | 817 | 1 | 11 | 7 | 12 | 5 |
22105 | 1094 | 846 | 1 | 8 | 9 | 7 | 5 |
22106 | 1094 | 839 | 214 | 10 | 8 | 8 | 4 |
22113 | 1094 | 849 | 3 | 18 | 18 | 20 | 10 |
22108 | 1094 | 822 | 51 | 6 | 10 | 10 | 6 |
22114 | 1094 | 20 | 117 | 16 | 14 | 11 | 7 |
22109 | 1094 | 855 | 51 | 12 | 13 | 18 | 6 |
22116 | 1094 | 854 | 210 | 15 | 16 | 16 | 8 |
22115 | 1094 | 840 | 117 | 20 | 15 | 15 | 7 |
22117 | 1094 | 825 | 210 | 19 | 17 | 13 | 8 |
22112 | 1094 | 848 | 3 | 17 | 12 | 19 | 10 |
22111 | 1094 | 852 | 213 | 13 | \N | 17 | 9 |
22110 | 1094 | 842 | 213 | 14 | 11 | 14 | 9 |
I picked George Russell as my point of reference. Why him do you ask? Maybe because I walked past him on the street once, so that makes him the F1 driver I came closest to. Good enough reason for me.
We can see below that in the Abu Dhabi 2022 race (1096), George Russell (847) came 5th, starting 6th on the grid. That looks right.
(https://en.wikipedia.org/wiki/2022_Abu_Dhabi_Grand_Prix)
Now to check that the driver and constructor standings are correct, we have to look at the previous race. At the end of the Sao Paolo race, he was 4th in the driver standings and Mercedes was 3rd in constructors'.
(https://en.wikipedia.org/wiki/2022_S%C3%A3o_Paulo_Grand_Prix)
Great! It all checks out. We can continue.
# Let's see how many examples do did we end up with
len(joined_data)
22158
Quick plot¶
joined_data[["grid", "driverStanding", "constructorStanding", "position"]].hist(figsize=(12, 8))
plt.show()
joined_data[["grid", "driverStanding", "constructorStanding", "position"]].agg(['min', 'max'])
grid | driverStanding | constructorStanding | position | |
---|---|---|---|---|
min | 0 | 1 | 1 | 1 |
max | 31 | 82 | 20 | \N |
Something looks fishy here.
- Why do we have grid positions over 20 when we know that the maximum number of cars that can start a race is currently set at 20 by the FIA.
- Why is the min grid position 0?
- Why is there an "\N" as the max value for positions?
In the 1st case, I've got a hunch that it has to do with the really old data in our dataset. We'll try to look back only up to 10 years and see if that makes any difference. Regarding the 2nd and 3rd points - we'll deal with that when we clean our data.
# Finding out the first race ID of the year 2013
# Doing this so that we can get rid of races oleder than 10 years from our dataset
races = pd.read_csv('/kaggle/input/formula-1-world-championship-1950-2020/races.csv')
races.loc[races['year'] == 2013].sort_values(by="raceId", ascending=True).head()
# The answer is 880
raceId | year | round | circuitId | name | date | time | url | fp1_date | fp1_time | fp2_date | fp2_time | fp3_date | fp3_time | quali_date | quali_time | sprint_date | sprint_time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
878 | 880 | 2013 | 1 | 1 | Australian Grand Prix | 2013-03-17 | 06:00:00 | http://en.wikipedia.org/wiki/2013_Australian_G... | \N | \N | \N | \N | \N | \N | \N | \N | \N | \N |
879 | 881 | 2013 | 2 | 2 | Malaysian Grand Prix | 2013-03-24 | 08:00:00 | http://en.wikipedia.org/wiki/2013_Malaysian_Gr... | \N | \N | \N | \N | \N | \N | \N | \N | \N | \N |
880 | 882 | 2013 | 3 | 17 | Chinese Grand Prix | 2013-04-14 | 07:00:00 | http://en.wikipedia.org/wiki/2013_Chinese_Gran... | \N | \N | \N | \N | \N | \N | \N | \N | \N | \N |
881 | 883 | 2013 | 4 | 3 | Bahrain Grand Prix | 2013-04-21 | 12:00:00 | http://en.wikipedia.org/wiki/2013_Bahrain_Gran... | \N | \N | \N | \N | \N | \N | \N | \N | \N | \N |
882 | 884 | 2013 | 5 | 4 | Spanish Grand Prix | 2013-05-12 | 12:00:00 | http://en.wikipedia.org/wiki/2013_Spanish_Gran... | \N | \N | \N | \N | \N | \N | \N | \N | \N | \N |
# Removing all races before 2013 from our dataset
joined_data = joined_data[joined_data["raceId"] > 880]
joined_data.head()
raceId | driverId | constructorId | grid | position | driverStanding | constructorStanding | |
---|---|---|---|---|---|---|---|
18241 | 881 | 20 | 9 | 1 | 1 | 3 | 3 |
18242 | 881 | 17 | 9 | 5 | 2 | 6 | 3 |
18243 | 881 | 1 | 131 | 4 | 3 | 5 | 4 |
18244 | 881 | 3 | 131 | 6 | 4 | 20 | 4 |
18245 | 881 | 13 | 6 | 2 | 5 | 4 | 1 |
joined_data[["grid", "driverStanding", "constructorStanding", "position"]].hist(figsize=(12, 8))
plt.show()
Things look slightly better now. We can carry on with data cleaning.
Cleaning the data¶
# Getting rid of a few more columns that we don't need anymore
dataset = joined_data[["grid", "driverStanding", "constructorStanding", "position"]]
joined_data.head()
raceId | driverId | constructorId | grid | position | driverStanding | constructorStanding | |
---|---|---|---|---|---|---|---|
18241 | 881 | 20 | 9 | 1 | 1 | 3 | 3 |
18242 | 881 | 17 | 9 | 5 | 2 | 6 | 3 |
18243 | 881 | 1 | 131 | 4 | 3 | 5 | 4 |
18244 | 881 | 3 | 131 | 6 | 4 | 20 | 4 |
18245 | 881 | 13 | 6 | 2 | 5 | 4 | 1 |
dataset.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 3916 entries, 18241 to 22157 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 grid 3916 non-null int64 1 driverStanding 3916 non-null int64 2 constructorStanding 3916 non-null int64 3 position 3916 non-null object dtypes: int64(3), object(1) memory usage: 153.0+ KB
# Filter out rows where the position is not numeric (remember the "\N" before?)
dataset = dataset[dataset.position.apply(lambda x: x.isnumeric())]
# Filter out rows where grid is 0
dataset = dataset[dataset.grid.apply(lambda x: x > 0)]
# Change type for position values to integers
dataset.position = dataset.position.astype('int')
dataset
grid | driverStanding | constructorStanding | position | |
---|---|---|---|---|
18241 | 1 | 3 | 3 | 1 |
18242 | 5 | 6 | 3 | 2 |
18243 | 4 | 5 | 4 | 3 |
18244 | 6 | 20 | 4 | 4 |
18245 | 2 | 4 | 1 | 5 |
... | ... | ... | ... | ... |
22153 | 18 | 10 | 6 | 15 |
22154 | 19 | 19 | 10 | 13 |
22155 | 20 | 20 | 10 | 19 |
22156 | 12 | 16 | 8 | 16 |
22157 | 16 | 13 | 8 | 17 |
3258 rows × 4 columns
Looking at correlations¶
Using corr()
, we can get the standard correlation coefficient between our label and each attribute. The closer to 1, the stronger the correlation.
dataset.corr()["position"]
grid 0.750109 driverStanding 0.742595 constructorStanding 0.750760 position 1.000000 Name: position, dtype: float64
We can also visualise the correlations using scatter_matrix()
scatter_matrix(dataset, figsize=(12,8))
plt.show()
Another nice way to do visualise correlations is with bi-dimensional histograms.
fig,ax = plt.subplots(1, 3, figsize=(12, 3))
# https://stackoverflow.com/a/20105673/3015186
max_grid = dataset.grid.max()
max_position = dataset.position.max()
max_d_position = dataset.driverStanding.max()
max_c_position = dataset.constructorStanding.max()
print(f"max_grid = {max_grid}; max_position = {max_position}; max_d_position = {max_d_position}; max_c_position = {max_c_position}")
ax[0].hist2d(dataset.grid, dataset.position, (max_grid, max_position), cmap='plasma', cmin=1)
ax[0].set_xlabel("Grid")
ax[0].set_ylabel("Final position")
ax[1].hist2d(dataset.driverStanding, dataset.position, (max_d_position, max_position), cmap='plasma', cmin=1)
ax[1].set_xlabel("Driver standing")
ax[2].hist2d(dataset.constructorStanding, dataset.position, (max_c_position, max_position), cmap='plasma', cmin=1)
ax[2].set_xlabel("Constructor standing")
plt.show()
max_grid = 22; max_position = 22; max_d_position = 24; max_c_position = 11
What we see above is that there is a linear relationship between all our features and the final results, which is good, but data pints are also very dispersed, meaning that our predictions will probably not be very accurate.
Getting ready for training¶
It's time to split our data in features and predictions.
I'm not going to have a separate validation set, just to keep things simple.
One other thing I want to do it to convert the Pandas dataframes into numpy arrays. I just find it easier to deal with them once the data is ready.
# Split dataset in training X and y
x_train = dataset[['grid', 'driverStanding', 'constructorStanding']].values
y_train = dataset[['position']].values.reshape(-1) # Transforming to a 1D array
print(f'{x_train.shape}; {y_train.shape}')
print(x_train)
print(y_train)
(3258, 3); (3258,) [[ 1 3 3] [ 5 6 3] [ 4 5 4] ... [20 20 10] [12 16 8] [16 13 8]] [ 1 2 3 ... 19 16 17]
Next we want to implement the cost function $J(\mathbf{w},b)$: $$J(\mathbf{w},b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})^2 $$ where: $$ f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = \mathbf{w} \cdot \mathbf{x}^{(i)} + b $$
def compute_cost(x, y, w, b):
# Get number of examples
m = x.shape[0]
# Initialise cost
cost = 0
for i in range(m):
f_wb_i = x[i].dot(w) + b
cost += (f_wb_i - y[i]) ** 2
return cost / (2 * m)
Let's try out the cost function with all weights and bias initialised to 0
w_init = [0, 0, 0]
b_init = 0
# Calculate cost for initial w and b
cost = compute_cost(x_train, y_train, w_init, b_init)
print(f"Cost = {cost}")
Cost = 55.43278084714549
Spoiler alert !!!
Now we'll run the cost function again but with the trained weights and bias (after around 25k iterations, when the algorithm converged)
# Spoiler: Here are the trained weights and bias
cost = compute_cost(x_train, y_train, [0.35008161, 0.04341071, 0.27482694], 2.0513445619234765)
print(f"Cost = {cost}")
Cost = 6.637608134053078
See how the cost went from over 55 down to around 6.
Now we want to implement the gradient descent function for multiple variables:
$$\begin{align*} \text{repeat}&\text{ until convergence:} \; \lbrace \newline\; & w_j = w_j - \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} \; & \text{for j = 0..n-1}\newline &b\ \ = b - \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} \newline \rbrace \end{align*}$$where, n is the number of features, parameters $w_j$, $b$, are updated simultaneously and where
$$ \begin{align} \frac{\partial J(\mathbf{w},b)}{\partial w_j} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} \\ \frac{\partial J(\mathbf{w},b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \end{align} $$m is the number of training examples in the data set
$f_{\mathbf{w},b}(\mathbf{x}^{(i)})$ is the model's prediction, while $y^{(i)}$ is the target value
First we go ahead and implement a function to calculate the gradient and then another to calculate the gradient descent.
def compute_gradient(x, y, w, b):
# Get number of examples and features
m, n = x.shape
# Initialise gradient of the cost w.r.t the parameters w
dj_dw = np.zeros((n,))
# Initialise gradient of the cost w.r.t the parameter b
dj_db = 0.
for i in range(m):
loss = (x[i].dot(w) + b) - y[i] # Loss is the same for both derivatives
for j in range (n):
dj_dw[j] += loss * x[i][j]
dj_db += loss
dj_dw = dj_dw / m
dj_db = dj_db / m
return dj_dw, dj_db
# Let's try to compute the gradient for our initial weights and bias
compute_gradient(x_train, y_train, w_init, b_init)
(array([-118.62400246, -120.44229589, -61.77931246]), -9.169429097605892)
def gradient_descent(X, y, w_in, b_in, cost_function, gradient_function, alpha, num_iters):
# Store history of costs (J). This will be used when we print out the progress.
J_history = []
w = copy.deepcopy(w_in)
b = b_in
# Apply gradient descent num_iters times
for i in range(num_iters):
dj_dw, dj_db = compute_gradient(X, y, w, b)
# Update w and b based on gradient, at the same time (this is important)
w = w - alpha * dj_dw
b = b - alpha * dj_db
J_history.append(compute_cost(X, y, w, b))
# Print cost every at intervals 10 times or as many iterations if < 10
if i % math.ceil(num_iters / 10) == 0:
print(f"Iteration {i:4d}: Cost {J_history[-1]:8.8f} ")
return w, b, J_history
import copy, math
# Initial values taken from a previous descent, so as not to start from 0
initial_w = np.array([0.37888087, 0.05009108, 0.32510837])
initial_b = 2.7796157637505052
iterations = 10000
alpha = 3.5e-3
print("Running gradient descent...")
w_final, b_final, J_hist = gradient_descent(x_train, y_train, initial_w, initial_b, compute_cost, compute_gradient, alpha, iterations)
print(f"b = {b_final}; w = {w_final} ")
# Expecting something close to: b = 1.1346913859860654; w = [0.34919898 0.16862326 0.47748792]
Running gradient descent... Iteration 0: Cost 4.98900555 Iteration 1000: Cost 4.68054531 Iteration 2000: Cost 4.62995299 Iteration 3000: Cost 4.61832210 Iteration 4000: Cost 4.61564821 Iteration 5000: Cost 4.61503350 Iteration 6000: Cost 4.61489218 Iteration 7000: Cost 4.61485969 Iteration 8000: Cost 4.61485222 Iteration 9000: Cost 4.61485050 b = 1.1346913859860654; w = [0.34919898 0.16862326 0.47748792]
Above we see that that over time the cost keeps decresing and eventually starts to converge (is decreasing very slowly). It is not very obvious because the initial weights I used are already trained. If you try to run this starting from all zeroes, for more iterations and maybe with a slightly larger rate (alpha) then you'll observe it better.
Below I'm iterating over a few examples and generate some predictions, just to get an idea of how well we did.
for i in range(10):
f = x_train[i+100].dot(w_final) + b_final
prediction = np.round(f).astype(int)
actual = y_train[i+100]
print(f"Prediction: {prediction:3d}, Actual position: {actual:3d}, Accuracy: {100 - (abs(prediction - actual) / actual) * 100.0:3.0f}%")
Prediction: 5, Actual position: 5, Accuracy: 100% Prediction: 9, Actual position: 6, Accuracy: 50% Prediction: 11, Actual position: 15, Accuracy: 73% Prediction: 11, Actual position: 7, Accuracy: 43% Prediction: 8, Actual position: 10, Accuracy: 80% Prediction: 6, Actual position: 9, Accuracy: 67% Prediction: 12, Actual position: 13, Accuracy: 92% Prediction: 10, Actual position: 11, Accuracy: 91% Prediction: 11, Actual position: 12, Accuracy: 92% Prediction: 9, Actual position: 14, Accuracy: 64%
Now let's see what's the avergae difference between our prediction and the actual result. In other words, by how many positions are we off on average when making the predictions.
differences = np.zeros(x_train.shape[0])
for i in range(x_train.shape[0]):
f = x_train[i].dot(w_final) + b_final
prediction = np.round(f).astype(int)
actual = y_train[i]
differences[i] = abs(prediction-actual)
print(f"Average difference: {np.average(differences):2.0f}")
Average difference: 2
The Model¶
So what's our model? Well, here it is:
def predict(grid, driver_standing, constructor_standing):
prediction = np.array([grid, driver_standing, constructor_standing]).dot([0.34919898, 0.16862326, 0.47748792]) + 1.1346913859860654
return np.round(prediction).astype(int)
Let's take it for a spin then! Given a driver starts 3rd on the grid, he's 5th in the driver standings and his team is 2nd in the constructor standings, we predict that he's going to finish.... :drumroll:
predict(3, 5, 2)
4
And the answer is 4!
Epilogue¶
Well, that's about it. I'd say we didn't do too bad. But the real test is to predict a real upcoming race result! The problem is that we can't do this until we've got the grid positions for that race. So check back soon for an update on this post with the real test.