Linear regression from scratch: Predicting F1 results
Before we start¶
This notebook can be browsed interactively on Kaggle: https://www.kaggle.com/dorin131/f1-predictions-blog
Prologue¶
The other day I learned how to implement linear regression and have been itching to make good use of this newly acquired knowledge. It's funny how ML can be applied to almost anything but when it came to actually picking one thing, I didn't know what. What would be fun? What is something that I'm interested in? So I start going through the Kaggle dataset catalogue... and, BINGO! F1 results from 1950 to 2020! What can I do with this? What else than try to predict who's going to win the next race? I suspect it may not the best fit for a linear regression model but it will be fun.
The plan¶
So, the plan is to combine the tables provided in this dataset and end up with a few features that I think may have the biggest effect on the finish position of a driver. Let's say...
- driver standing
- constructor standing
- driver grid position
If only it was that simple to predict the position, right? Well, it's not that simple and our predictions won't be all that accurate, but who cares!
Below you'll see me load the data, do some transformations, joins, filtering, conversions, etc. Once all of this boring part is done, we'll get into implementing the model and then training it!
# Importing some libraries we're going to need
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
from pandas.plotting import scatter_matrix
Getting the data¶
# Looking at what files we've got and getting their paths
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
/kaggle/input/formula-1-world-championship-1950-2020/races.csv /kaggle/input/formula-1-world-championship-1950-2020/constructor_results.csv /kaggle/input/formula-1-world-championship-1950-2020/drivers.csv /kaggle/input/formula-1-world-championship-1950-2020/constructors.csv /kaggle/input/formula-1-world-championship-1950-2020/lap_times.csv /kaggle/input/formula-1-world-championship-1950-2020/status.csv /kaggle/input/formula-1-world-championship-1950-2020/driver_standings.csv /kaggle/input/formula-1-world-championship-1950-2020/seasons.csv /kaggle/input/formula-1-world-championship-1950-2020/pit_stops.csv /kaggle/input/formula-1-world-championship-1950-2020/sprint_results.csv /kaggle/input/formula-1-world-championship-1950-2020/constructor_standings.csv /kaggle/input/formula-1-world-championship-1950-2020/results.csv /kaggle/input/formula-1-world-championship-1950-2020/circuits.csv /kaggle/input/formula-1-world-championship-1950-2020/qualifying.csv
results = pd.read_csv('/kaggle/input/formula-1-world-championship-1950-2020/results.csv')
driver_standings = pd.read_csv('/kaggle/input/formula-1-world-championship-1950-2020/driver_standings.csv')
constructor_standings = pd.read_csv('/kaggle/input/formula-1-world-championship-1950-2020/constructor_standings.csv')
Quick look at the data¶
We're just going to print the first 5 rows of each dataset and check the types
results.head()
resultId | raceId | driverId | constructorId | number | grid | position | positionText | positionOrder | points | laps | time | milliseconds | fastestLap | rank | fastestLapTime | fastestLapSpeed | statusId | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 18 | 1 | 1 | 22 | 1 | 1 | 1 | 1 | 10.0 | 58 | 1:34:50.616 | 5690616 | 39 | 2 | 1:27.452 | 218.300 | 1 |
1 | 2 | 18 | 2 | 2 | 3 | 5 | 2 | 2 | 2 | 8.0 | 58 | +5.478 | 5696094 | 41 | 3 | 1:27.739 | 217.586 | 1 |
2 | 3 | 18 | 3 | 3 | 7 | 7 | 3 | 3 | 3 | 6.0 | 58 | +8.163 | 5698779 | 41 | 5 | 1:28.090 | 216.719 | 1 |
3 | 4 | 18 | 4 | 4 | 5 | 11 | 4 | 4 | 4 | 5.0 | 58 | +17.181 | 5707797 | 58 | 7 | 1:28.603 | 215.464 | 1 |
4 | 5 | 18 | 5 | 1 | 23 | 3 | 5 | 5 | 5 | 4.0 | 58 | +18.014 | 5708630 | 43 | 1 | 1:27.418 | 218.385 | 1 |
results.dtypes
resultId int64 raceId int64 driverId int64 constructorId int64 number object grid int64 position object positionText object positionOrder int64 points float64 laps int64 time object milliseconds object fastestLap object rank object fastestLapTime object fastestLapSpeed object statusId int64 dtype: object
driver_standings.head()
driverStandingsId | raceId | driverId | points | position | positionText | wins | |
---|---|---|---|---|---|---|---|
0 | 1 | 18 | 1 | 10.0 | 1 | 1 | 1 |
1 | 2 | 18 | 2 | 8.0 | 2 | 2 | 0 |
2 | 3 | 18 | 3 | 6.0 | 3 | 3 | 0 |
3 | 4 | 18 | 4 | 5.0 | 4 | 4 | 0 |
4 | 5 | 18 | 5 | 4.0 | 5 | 5 | 0 |
driver_standings.dtypes
driverStandingsId int64 raceId int64 driverId int64 points float64 position int64 positionText object wins int64 dtype: object
constructor_standings.head()
constructorStandingsId | raceId | constructorId | points | position | positionText | wins | |
---|---|---|---|---|---|---|---|
0 | 1 | 18 | 1 | 14.0 | 1 | 1 | 1 |
1 | 2 | 18 | 2 | 8.0 | 3 | 3 | 0 |
2 | 3 | 18 | 3 | 9.0 | 2 | 2 | 0 |
3 | 4 | 18 | 4 | 5.0 | 4 | 4 | 0 |
4 | 5 | 18 | 5 | 2.0 | 5 | 5 | 0 |
constructor_standings.dtypes
constructorStandingsId int64 raceId int64 constructorId int64 points float64 position int64 positionText object wins int64 dtype: object
Dropping the columns we don't need¶
# We only need a few columns, which we're getting here
results = results[["raceId", "driverId", "constructorId", "grid", "position"]]
results.head()
raceId | driverId | constructorId | grid | position | |
---|---|---|---|---|---|
0 | 18 | 1 | 1 | 1 | 1 |
1 | 18 | 2 | 2 | 5 | 2 |
2 | 18 | 3 | 3 | 7 | 3 |
3 | 18 | 4 | 4 | 11 | 4 |
4 | 18 | 5 | 1 | 3 | 5 |
driver_standings = driver_standings[["raceId", "driverId", "position"]]
# Rename the "position" column do avoid conflict with the "position" column from results.csv
driver_standings = driver_standings.rename(columns={"position": "driverStanding"})
# Use current driver standings for the next race
driver_standings["raceId"] += 1
driver_standings.head()
raceId | driverId | driverStanding | |
---|---|---|---|
0 | 19 | 1 | 1 |
1 | 19 | 2 | 2 |
2 | 19 | 3 | 3 |
3 | 19 | 4 | 4 |
4 | 19 | 5 | 5 |
# Again, picking the columns we need and renaming "position"
constructor_standings = constructor_standings[["raceId", "constructorId", "position"]]
constructor_standings = constructor_standings.rename(columns={"position": "constructorStanding"})
# Use current constructor standings for the next race
constructor_standings["raceId"] += 1
constructor_standings.head()
raceId | constructorId | constructorStanding | |
---|---|---|---|
0 | 19 | 1 | 1 |
1 | 19 | 2 | 3 |
2 | 19 | 3 | 2 |
3 | 19 | 4 | 4 |
4 | 19 | 5 | 5 |
Joining the data¶
# Joining results with driver standings. This will add the "driverPosition" column to our results
results_driver_standings = pd.merge(results, driver_standings, on=["raceId", "driverId"], how="inner")
results_driver_standings.head()
raceId | driverId | constructorId | grid | position | driverStanding | |
---|---|---|---|---|---|---|
0 | 18 | 1 | 1 | 1 | 1 | 5 |
1 | 18 | 2 | 2 | 5 | 2 | 13 |
2 | 18 | 3 | 3 | 7 | 3 | 7 |
3 | 18 | 4 | 4 | 11 | 4 | 9 |
4 | 18 | 5 | 1 | 3 | 5 | 12 |
# Now we join the constructor standings and we end up with everything we need in one place
joined_data = pd.merge(results_driver_standings, constructor_standings, on=["raceId", "constructorId"], how="inner")
joined_data.head()
raceId | driverId | constructorId | grid | position | driverStanding | constructorStanding | |
---|---|---|---|---|---|---|---|
0 | 18 | 1 | 1 | 1 | 1 | 5 | 3 |
1 | 18 | 5 | 1 | 3 | 5 | 12 | 3 |
2 | 18 | 2 | 2 | 5 | 2 | 13 | 6 |
3 | 18 | 9 | 2 | 2 | \N | 14 | 6 |
4 | 18 | 3 | 3 | 7 | 3 | 7 | 7 |
Sense checking the data¶
Things look good so far but I've got no idea whether the numbers are correct and I haven't messed up something. To verify things, I'm going to print out the results for the last couple of races in 2020 and compare with F1 results from Wikipedia.
joined_data.sort_values(by='raceId', ascending=False).head(60)
raceId | driverId | constructorId | grid | position | driverStanding | constructorStanding | |
---|---|---|---|---|---|---|---|
22157 | 1096 | 825 | 210 | 16 | 17 | 13 | 8 |
22147 | 1096 | 4 | 214 | 10 | \N | 9 | 4 |
22138 | 1096 | 830 | 9 | 1 | 1 | 1 | 1 |
22139 | 1096 | 815 | 9 | 2 | 3 | 3 | 1 |
22140 | 1096 | 844 | 6 | 3 | 2 | 2 | 2 |
22141 | 1096 | 832 | 6 | 4 | 4 | 6 | 2 |
22142 | 1096 | 847 | 131 | 6 | 5 | 4 | 3 |
22144 | 1096 | 846 | 1 | 7 | 6 | 7 | 5 |
22145 | 1096 | 817 | 1 | 13 | 9 | 12 | 5 |
22146 | 1096 | 839 | 214 | 8 | 7 | 8 | 4 |
22143 | 1096 | 1 | 131 | 5 | 18 | 5 | 3 |
22148 | 1096 | 840 | 117 | 14 | 8 | 15 | 7 |
22153 | 1096 | 822 | 51 | 18 | 15 | 10 | 6 |
22149 | 1096 | 20 | 117 | 9 | 10 | 11 | 7 |
22155 | 1096 | 849 | 3 | 20 | 19 | 20 | 10 |
22154 | 1096 | 848 | 3 | 19 | 13 | 19 | 10 |
22156 | 1096 | 854 | 210 | 12 | 16 | 16 | 8 |
22152 | 1096 | 855 | 51 | 15 | 12 | 18 | 6 |
22151 | 1096 | 842 | 213 | 17 | 14 | 14 | 9 |
22150 | 1096 | 852 | 213 | 11 | 11 | 17 | 9 |
22127 | 1095 | 855 | 51 | 13 | 12 | 18 | 6 |
22118 | 1095 | 847 | 131 | 1 | 1 | 4 | 3 |
22119 | 1095 | 1 | 131 | 2 | 2 | 5 | 3 |
22120 | 1095 | 832 | 6 | 7 | 3 | 6 | 2 |
22121 | 1095 | 844 | 6 | 5 | 4 | 3 | 2 |
22122 | 1095 | 4 | 214 | 17 | 5 | 9 | 4 |
22123 | 1095 | 839 | 214 | 16 | 8 | 8 | 4 |
22124 | 1095 | 830 | 9 | 3 | 6 | 1 | 1 |
22125 | 1095 | 815 | 9 | 4 | 7 | 2 | 1 |
22126 | 1095 | 822 | 51 | 14 | 9 | 10 | 6 |
22133 | 1095 | 852 | 213 | 0 | 17 | 17 | 9 |
22128 | 1095 | 840 | 117 | 15 | 10 | 15 | 7 |
22134 | 1095 | 848 | 3 | 19 | 15 | 19 | 10 |
22129 | 1095 | 20 | 117 | 9 | 11 | 11 | 7 |
22136 | 1095 | 846 | 1 | 6 | \N | 7 | 5 |
22135 | 1095 | 849 | 3 | 18 | 16 | 20 | 10 |
22137 | 1095 | 817 | 1 | 11 | \N | 12 | 5 |
22132 | 1095 | 842 | 213 | 10 | 14 | 14 | 9 |
22131 | 1095 | 825 | 210 | 8 | \N | 13 | 8 |
22130 | 1095 | 854 | 210 | 12 | 13 | 16 | 8 |
22107 | 1094 | 4 | 214 | 9 | 19 | 9 | 4 |
22098 | 1094 | 830 | 9 | 1 | 1 | 1 | 1 |
22099 | 1094 | 815 | 9 | 4 | 3 | 3 | 1 |
22100 | 1094 | 1 | 131 | 3 | 2 | 6 | 3 |
22101 | 1094 | 847 | 131 | 2 | 4 | 4 | 3 |
22102 | 1094 | 832 | 6 | 5 | 5 | 5 | 2 |
22103 | 1094 | 844 | 6 | 7 | 6 | 2 | 2 |
22104 | 1094 | 817 | 1 | 11 | 7 | 12 | 5 |
22105 | 1094 | 846 | 1 | 8 | 9 | 7 | 5 |
22106 | 1094 | 839 | 214 | 10 | 8 | 8 | 4 |
22113 | 1094 | 849 | 3 | 18 | 18 | 20 | 10 |
22108 | 1094 | 822 | 51 | 6 | 10 | 10 | 6 |
22114 | 1094 | 20 | 117 | 16 | 14 | 11 | 7 |
22109 | 1094 | 855 | 51 | 12 | 13 | 18 | 6 |
22116 | 1094 | 854 | 210 | 15 | 16 | 16 | 8 |
22115 | 1094 | 840 | 117 | 20 | 15 | 15 | 7 |
22117 | 1094 | 825 | 210 | 19 | 17 | 13 | 8 |
22112 | 1094 | 848 | 3 | 17 | 12 | 19 | 10 |
22111 | 1094 | 852 | 213 | 13 | \N | 17 | 9 |
22110 | 1094 | 842 | 213 | 14 | 11 | 14 | 9 |
I picked George Russell as my point of reference. Why him do you ask? Maybe because I walked past him on the street once, so that makes him the F1 driver I came closest to. Good enough reason for me.
We can see below that in the Abu Dhabi 2022 race (1096), George Russell (847) came 5th, starting 6th on the grid. That looks right.
(https://en.wikipedia.org/wiki/2022_Abu_Dhabi_Grand_Prix)
Now to check that the driver and constructor standings are correct, we have to look at the previous race. At the end of the Sao Paolo race, he was 4th in the driver standings and Mercedes was 3rd in constructors'.
(https://en.wikipedia.org/wiki/2022_S%C3%A3o_Paulo_Grand_Prix)
Great! It all checks out. We can continue.
# Let's see how many examples do did we end up with
len(joined_data)
22158
Quick plot¶
joined_data[["grid", "driverStanding", "constructorStanding", "position"]].hist(figsize=(12, 8))
plt.show()