Lecture 15 - Aggregating Data
23 October, 2024
R
and JavaScript
code, as well as write LaTeX
and Markdown
documents.agg()
method.agg()
with .groupby()
to aggregate data by groups.query()
together with those methods to filter and summarise data in one stepField
, Type
, Key
, and Description
Field
is the name of the columnType
is the data type of the column
int
)varchar
- “variable character”)float
)Key
is the primary key of the table, also know as the identifier
Description
is a brief description of the columnresultId | raceId | driverId | constructorId | number | grid | position | positionText | positionOrder | points | laps | time | milliseconds | fastestLap | rank | fastestLapTime | fastestLapSpeed | statusId | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 18 | 1 | 1 | 22 | 1 | 1 | 1 | 1 | 10.0 | 58 | 1:34:50.616 | 5690616 | 39 | 2 | 1:27.452 | 218.300 | 1 |
1 | 2 | 18 | 2 | 2 | 3 | 5 | 2 | 2 | 2 | 8.0 | 58 | +5.478 | 5696094 | 41 | 3 | 1:27.739 | 217.586 | 1 |
2 | 3 | 18 | 3 | 3 | 7 | 7 | 3 | 3 | 3 | 6.0 | 58 | +8.163 | 5698779 | 41 | 5 | 1:28.090 | 216.719 | 1 |
3 | 4 | 18 | 4 | 4 | 5 | 11 | 4 | 4 | 4 | 5.0 | 58 | +17.181 | 5707797 | 58 | 7 | 1:28.603 | 215.464 | 1 |
4 | 5 | 18 | 5 | 1 | 23 | 3 | 5 | 5 | 5 | 4.0 | 58 | +18.014 | 5708630 | 43 | 1 | 1:27.418 | 218.385 | 1 |
sample
resultId | raceId | driverId | constructorId | number | grid | position | positionText | positionOrder | points | laps | time | milliseconds | fastestLap | rank | fastestLapTime | fastestLapSpeed | statusId | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5224 | 5225 | 260 | 77 | 6 | 28 | 3 | 3 | 3 | 3 | 4.0 | 78 | +1:16.824 | 6672124 | \N | \N | \N | \N | 1 |
7979 | 7980 | 358 | 160 | 45 | 18 | 0 | \N | F | 30 | 0.0 | 0 | \N | \N | \N | \N | \N | \N | 97 |
10260 | 10261 | 437 | 176 | 21 | 17 | 23 | 9 | 9 | 9 | 0.0 | 71 | \N | \N | \N | \N | \N | \N | 14 |
16589 | 16590 | 677 | 321 | 6 | 7 | 15 | \N | R | 19 | 0.0 | 14 | \N | \N | \N | \N | \N | \N | 5 |
23454 | 23459 | 973 | 1 | 131 | 44 | 1 | 1 | 1 | 1 | 25.0 | 66 | 1:35:56.497 | 5756497 | 64 | 1 | 1:23.593 | 200.471 | 1 |
dtypes
resultId int64
raceId int64
driverId int64
constructorId int64
number object
grid int64
position object
positionText object
positionOrder int64
points float64
laps int64
time object
milliseconds object
fastestLap object
rank object
fastestLapTime object
fastestLapSpeed object
statusId int64
dtype: object
columns
resultId
?raceId
?driverId
?len()
and the pd.nunique()
functions (click on their names to see the documentation)()
count 25840.000000
mean 1.877053
std 4.169849
min 0.000000
25% 0.000000
50% 0.000000
75% 2.000000
max 50.000000
Name: points, dtype: float64
.agg()
.agg()
subfunction computes aggregate statisticscolumn_name
, function_name
)# The functions in quotes are pandas functions. len is not in quotes because it is a Python function
# If you were to use 'len' (a string), pandas would look for a method named len in the DataFrame,
# which does not exist.
results_agg = results.agg(mean_points = ('points','mean'),
sd_points = ('points','std'),
min_points = ('points','min'),
max_points = ('points','max'),
count_obs = ('points', len))
display(results_agg.round(2)) # round to 2 decimal places
points | |
---|---|
mean_points | 1.88 |
sd_points | 4.17 |
min_points | 0.00 |
max_points | 50.00 |
count_obs | 25840.00 |
.groupby()
.groupby()
method groups the data by a column.agg()
method can be used to aggregate the data by groupdf.groupby('column_name').agg(...)
mean_points | sd_points | min_points | max_points | count_obs | |
---|---|---|---|---|---|
driverId | |||||
1 | 14.182258 | 9.224098 | 0.0 | 50.0 | 310 |
2 | 1.407609 | 2.372923 | 0.0 | 15.0 | 184 |
3 | 7.740291 | 8.672456 | 0.0 | 25.0 | 206 |
4 | 5.756983 | 6.330721 | 0.0 | 25.0 | 358 |
5 | 0.937500 | 1.969503 | 0.0 | 10.0 | 112 |
... | ... | ... | ... | ... | ... |
852 | 1.000000 | 2.477808 | 0.0 | 12.0 | 44 |
853 | 0.000000 | 0.000000 | 0.0 | 0.0 | 22 |
854 | 0.272727 | 1.335798 | 0.0 | 8.0 | 44 |
855 | 0.272727 | 0.882735 | 0.0 | 4.0 | 22 |
856 | 2.000000 | NaN | 2.0 | 2.0 | 1 |
855 rows × 5 columns
.groupby()
driverId
points
.groupby()
- multiple groups.groupby()
methoddf.groupby(['column1', 'column2']).agg(...)
constructorId
is the team identifiermean_points | sd_points | min_points | max_points | count_obs | ||
---|---|---|---|---|---|---|
raceId | constructorId | |||||
1 | 1 | 0.0 | 0.00 | 0.0 | 0.0 | 2 |
2 | 0.0 | 0.00 | 0.0 | 0.0 | 2 | |
3 | 1.5 | 2.12 | 0.0 | 3.0 | 2 | |
4 | 2.0 | 2.83 | 0.0 | 4.0 | 2 | |
5 | 1.5 | 0.71 | 1.0 | 2.0 | 2 | |
... | ... | ... | ... | ... | ... | ... |
1096 | 117 | 2.5 | 2.12 | 1.0 | 4.0 | 2 |
131 | 5.0 | 7.07 | 0.0 | 10.0 | 2 | |
210 | 0.0 | 0.00 | 0.0 | 0.0 | 2 | |
213 | 0.0 | 0.00 | 0.0 | 0.0 | 2 | |
214 | 3.0 | 4.24 | 0.0 | 6.0 | 2 |
12568 rows × 5 columns
.query()
.query()
method filters the data, as you know well by now.query()
with .groupby()
and .agg()
to filter, group, and aggregate data in one stepmean_points | sd_points | min_points | max_points | count_obs | ||
---|---|---|---|---|---|---|
raceId | constructorId | |||||
500 | 1 | 0.0 | 0.00 | 0.0 | 0.0 | 2 |
3 | 1.0 | 1.41 | 0.0 | 2.0 | 2 | |
4 | 4.5 | 6.36 | 0.0 | 9.0 | 2 | |
6 | 0.0 | 0.00 | 0.0 | 0.0 | 2 | |
21 | 0.5 | 0.71 | 0.0 | 1.0 | 2 | |
... | ... | ... | ... | ... | ... | ... |
1096 | 117 | 2.5 | 2.12 | 1.0 | 4.0 | 2 |
131 | 5.0 | 7.07 | 0.0 | 10.0 | 2 | |
210 | 0.0 | 0.00 | 0.0 | 0.0 | 2 | |
213 | 0.0 | 0.00 | 0.0 | 0.0 | 2 | |
214 | 3.0 | 4.24 | 0.0 | 6.0 | 2 |
6055 rows × 5 columns
constructorId
(the team) then compute the average number of points
.sort_values(...,ascending = False)
to sort by team points in descending order['function1', 'function2']
df.agg({'column1': 'function1', 'column2': 'function2'})
.merge()
method to merge the datadrivers_agg
data frame we created before and merge it with the original results
data framemean_points | sd_points | min_points | max_points | count_obs | |
---|---|---|---|---|---|
driverId | |||||
1 | 14.18 | 9.22 | 0.0 | 50.0 | 310 |
2 | 1.41 | 2.37 | 0.0 | 15.0 | 184 |
3 | 7.74 | 8.67 | 0.0 | 25.0 | 206 |
4 | 5.76 | 6.33 | 0.0 | 25.0 | 358 |
5 | 0.94 | 1.97 | 0.0 | 10.0 | 112 |
... | ... | ... | ... | ... | ... |
852 | 1.00 | 2.48 | 0.0 | 12.0 | 44 |
853 | 0.00 | 0.00 | 0.0 | 0.0 | 22 |
854 | 0.27 | 1.34 | 0.0 | 8.0 | 44 |
855 | 0.27 | 0.88 | 0.0 | 4.0 | 22 |
856 | 2.00 | NaN | 2.0 | 2.0 | 1 |
855 rows × 5 columns
results
data frameresultId | raceId | driverId | constructorId | number | grid | position | positionText | positionOrder | points | laps | time | milliseconds | fastestLap | rank | fastestLapTime | fastestLapSpeed | statusId | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 18 | 1 | 1 | 22 | 1 | 1 | 1 | 1 | 10.0 | 58 | 1:34:50.616 | 5690616 | 39 | 2 | 1:27.452 | 218.300 | 1 |
1 | 2 | 18 | 2 | 2 | 3 | 5 | 2 | 2 | 2 | 8.0 | 58 | +5.478 | 5696094 | 41 | 3 | 1:27.739 | 217.586 | 1 |
2 | 3 | 18 | 3 | 3 | 7 | 7 | 3 | 3 | 3 | 6.0 | 58 | +8.163 | 5698779 | 41 | 5 | 1:28.090 | 216.719 | 1 |
3 | 4 | 18 | 4 | 4 | 5 | 11 | 4 | 4 | 4 | 5.0 | 58 | +17.181 | 5707797 | 58 | 7 | 1:28.603 | 215.464 | 1 |
4 | 5 | 18 | 5 | 1 | 23 | 3 | 5 | 5 | 5 | 4.0 | 58 | +18.014 | 5708630 | 43 | 1 | 1:27.418 | 218.385 | 1 |
driverId
columnon='driverId'
? Because driverId
is the column that identifies the driver in both data frameshow = "left"
argument keeps all rows from the results
data frame and adds the statistics from drivers_agg
to itresultId | raceId | driverId | constructorId | number | grid | position | positionText | positionOrder | points | ... | fastestLap | rank | fastestLapTime | fastestLapSpeed | statusId | mean_points | sd_points | min_points | max_points | count_obs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 18 | 1 | 1 | 22 | 1 | 1 | 1 | 1 | 10.0 | ... | 39 | 2 | 1:27.452 | 218.300 | 1 | 14.182258 | 9.224098 | 0.0 | 50.0 | 310 |
1 | 2 | 18 | 2 | 2 | 3 | 5 | 2 | 2 | 2 | 8.0 | ... | 41 | 3 | 1:27.739 | 217.586 | 1 | 1.407609 | 2.372923 | 0.0 | 15.0 | 184 |
2 | 3 | 18 | 3 | 3 | 7 | 7 | 3 | 3 | 3 | 6.0 | ... | 41 | 5 | 1:28.090 | 216.719 | 1 | 7.740291 | 8.672456 | 0.0 | 25.0 | 206 |
3 | 4 | 18 | 4 | 4 | 5 | 11 | 4 | 4 | 4 | 5.0 | ... | 58 | 7 | 1:28.603 | 215.464 | 1 | 5.756983 | 6.330721 | 0.0 | 25.0 | 358 |
4 | 5 | 18 | 5 | 1 | 23 | 3 | 5 | 5 | 5 | 4.0 | ... | 43 | 1 | 1:27.418 | 218.385 | 1 | 0.937500 | 1.969503 | 0.0 | 10.0 | 112 |
5 rows × 23 columns
mean_points
, sd_points
, min_points
, max_points
, and count_obs
were added to the data frameIndex(['resultId', 'raceId', 'driverId', 'constructorId', 'number', 'grid',
'position', 'positionText', 'positionOrder', 'points', 'laps', 'time',
'milliseconds', 'fastestLap', 'rank', 'fastestLapTime',
'fastestLapSpeed', 'statusId', 'mean_points', 'sd_points', 'min_points',
'max_points', 'count_obs'],
dtype='object')
points
(y-axis) vs mean_points
(x-axis)teamrace_agg
data into results
on = ["raceId","constructorId"]
value_counts()
to get the same informationresultId
1 1
17224 1
17234 1
17233 1
17232 1
..
8611 1
8610 1
8609 1
8608 1
25845 1
Name: count, Length: 25840, dtype: int64
raceId
800 55
809 47
371 39
359 39
366 39
..
657 14
668 14
827 13
660 13
765 10
Name: count, Length: 1079, dtype: int64
driverId
4 358
8 352
22 326
1 310
18 309
...
545 1
331 1
492 1
546 1
856 1
Name: count, Length: 855, dtype: int64
constructorId
and computing the average number of points
mean_points
in descending ordersort_values()
methodmean_points | |
---|---|
constructorId | |
131 | 12.36 |
9 | 9.12 |
23 | 5.06 |
208 | 4.58 |
6 | 4.32 |
... | ... |
76 | 0.00 |
77 | 0.00 |
78 | 0.00 |
130 | 0.00 |
112 | 0.00 |
210 rows × 1 columns
teamrace_agg
data into results
on = ["raceId","constructorId"]
raceId >= 500
), we will see some missing cases in the merged data frameresults_merge = pd.merge(results,
teamrace_agg,
on = ["raceId","constructorId"],
how = "left")
results_merge.round(2)
resultId | raceId | driverId | constructorId | number | grid | position | positionText | positionOrder | points | ... | fastestLap | rank | fastestLapTime | fastestLapSpeed | statusId | mean_points | sd_points | min_points | max_points | count_obs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 18 | 1 | 1 | 22 | 1 | 1 | 1 | 1 | 10.0 | ... | 39 | 2 | 1:27.452 | 218.300 | 1 | NaN | NaN | NaN | NaN | NaN |
1 | 2 | 18 | 2 | 2 | 3 | 5 | 2 | 2 | 2 | 8.0 | ... | 41 | 3 | 1:27.739 | 217.586 | 1 | NaN | NaN | NaN | NaN | NaN |
2 | 3 | 18 | 3 | 3 | 7 | 7 | 3 | 3 | 3 | 6.0 | ... | 41 | 5 | 1:28.090 | 216.719 | 1 | NaN | NaN | NaN | NaN | NaN |
3 | 4 | 18 | 4 | 4 | 5 | 11 | 4 | 4 | 4 | 5.0 | ... | 58 | 7 | 1:28.603 | 215.464 | 1 | NaN | NaN | NaN | NaN | NaN |
4 | 5 | 18 | 5 | 1 | 23 | 3 | 5 | 5 | 5 | 4.0 | ... | 43 | 1 | 1:27.418 | 218.385 | 1 | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
25835 | 25841 | 1096 | 854 | 210 | 47 | 12 | 16 | 16 | 16 | 0.0 | ... | 39 | 12 | 1:29.833 | 211.632 | 11 | 0.0 | 0.00 | 0.0 | 0.0 | 2.0 |
25836 | 25842 | 1096 | 825 | 210 | 20 | 16 | 17 | 17 | 17 | 0.0 | ... | 40 | 20 | 1:31.158 | 208.556 | 11 | 0.0 | 0.00 | 0.0 | 0.0 | 2.0 |
25837 | 25843 | 1096 | 1 | 131 | 44 | 5 | 18 | 18 | 18 | 0.0 | ... | 42 | 11 | 1:29.788 | 211.738 | 9 | 5.0 | 7.07 | 0.0 | 10.0 | 2.0 |
25838 | 25844 | 1096 | 849 | 3 | 6 | 20 | 19 | 19 | 19 | 0.0 | ... | 45 | 14 | 1:30.309 | 210.517 | 130 | 0.0 | 0.00 | 0.0 | 0.0 | 2.0 |
25839 | 25845 | 1096 | 4 | 214 | 14 | 10 | \N | R | 20 | 0.0 | ... | 24 | 17 | 1:30.579 | 209.889 | 47 | 3.0 | 4.24 | 0.0 | 6.0 | 2.0 |
25840 rows × 23 columns
dropna()
methodresults_merge = pd.merge(results,
teamrace_agg,
on = ["raceId","constructorId"],
how = "left")
results_merge.round(2).dropna()
resultId | raceId | driverId | constructorId | number | grid | position | positionText | positionOrder | points | ... | fastestLap | rank | fastestLapTime | fastestLapSpeed | statusId | mean_points | sd_points | min_points | max_points | count_obs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
12049 | 12050 | 500 | 163 | 4 | 16 | 2 | 1 | 1 | 1 | 9.0 | ... | \N | \N | \N | \N | 1 | 4.5 | 6.36 | 0.0 | 9.0 | 2.0 |
12050 | 12051 | 500 | 172 | 27 | 26 | 4 | 2 | 2 | 2 | 6.0 | ... | \N | \N | \N | \N | 1 | 5.0 | 1.41 | 4.0 | 6.0 | 2.0 |
12051 | 12052 | 500 | 202 | 27 | 25 | 5 | 3 | 3 | 3 | 4.0 | ... | \N | \N | \N | \N | 1 | 5.0 | 1.41 | 4.0 | 6.0 | 2.0 |
12052 | 12053 | 500 | 137 | 34 | 5 | 3 | 4 | 4 | 4 | 3.0 | ... | \N | \N | \N | \N | 1 | 1.5 | 2.12 | 0.0 | 3.0 | 2.0 |
12053 | 12054 | 500 | 199 | 3 | 28 | 6 | 5 | 5 | 5 | 2.0 | ... | \N | \N | \N | \N | 11 | 1.0 | 1.41 | 0.0 | 2.0 | 2.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
25835 | 25841 | 1096 | 854 | 210 | 47 | 12 | 16 | 16 | 16 | 0.0 | ... | 39 | 12 | 1:29.833 | 211.632 | 11 | 0.0 | 0.00 | 0.0 | 0.0 | 2.0 |
25836 | 25842 | 1096 | 825 | 210 | 20 | 16 | 17 | 17 | 17 | 0.0 | ... | 40 | 20 | 1:31.158 | 208.556 | 11 | 0.0 | 0.00 | 0.0 | 0.0 | 2.0 |
25837 | 25843 | 1096 | 1 | 131 | 44 | 5 | 18 | 18 | 18 | 0.0 | ... | 42 | 11 | 1:29.788 | 211.738 | 9 | 5.0 | 7.07 | 0.0 | 10.0 | 2.0 |
25838 | 25844 | 1096 | 849 | 3 | 6 | 20 | 19 | 19 | 19 | 0.0 | ... | 45 | 14 | 1:30.309 | 210.517 | 130 | 0.0 | 0.00 | 0.0 | 0.0 | 2.0 |
25839 | 25845 | 1096 | 4 | 214 | 14 | 10 | \N | R | 20 | 0.0 | ... | 24 | 17 | 1:30.579 | 209.889 | 47 | 3.0 | 4.24 | 0.0 | 6.0 | 2.0 |
12267 rows × 23 columns