Lecture 15 - Aggregating Data
R and JavaScript code, as well as write LaTeX and Markdown documents.agg() method.agg() with .groupby() to aggregate data by groups.query() together with those methods to filter and summarise data in one stepField, Type, Key, and DescriptionField is the name of the columnType is the data type of the column
int)varchar - “variable character”)float)Key is the primary key of the table, also know as the identifier
Description is a brief description of the column| resultId | raceId | driverId | constructorId | number | grid | position | positionText | positionOrder | points | laps | time | milliseconds | fastestLap | rank | fastestLapTime | fastestLapSpeed | statusId | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 18 | 1 | 1 | 22 | 1 | 1 | 1 | 1 | 10.0 | 58 | 1:34:50.616 | 5690616 | 39 | 2 | 1:27.452 | 218.300 | 1 |
| 1 | 2 | 18 | 2 | 2 | 3 | 5 | 2 | 2 | 2 | 8.0 | 58 | +5.478 | 5696094 | 41 | 3 | 1:27.739 | 217.586 | 1 |
| 2 | 3 | 18 | 3 | 3 | 7 | 7 | 3 | 3 | 3 | 6.0 | 58 | +8.163 | 5698779 | 41 | 5 | 1:28.090 | 216.719 | 1 |
| 3 | 4 | 18 | 4 | 4 | 5 | 11 | 4 | 4 | 4 | 5.0 | 58 | +17.181 | 5707797 | 58 | 7 | 1:28.603 | 215.464 | 1 |
| 4 | 5 | 18 | 5 | 1 | 23 | 3 | 5 | 5 | 5 | 4.0 | 58 | +18.014 | 5708630 | 43 | 1 | 1:27.418 | 218.385 | 1 |
sample| resultId | raceId | driverId | constructorId | number | grid | position | positionText | positionOrder | points | laps | time | milliseconds | fastestLap | rank | fastestLapTime | fastestLapSpeed | statusId | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 672 | 673 | 49 | 25 | 3 | 17 | 15 | \N | R | 19 | 0.0 | 34 | \N | \N | 18 | 18 | 1:51.270 | 226.605 | 32 |
| 9990 | 9991 | 426 | 94 | 18 | 29 | 25 | \N | R | 20 | 0.0 | 19 | \N | \N | \N | \N | \N | \N | 3 |
| 16701 | 16702 | 684 | 278 | 6 | 8 | 6 | 3 | 3 | 3 | 4.0 | 80 | +16.6 | 7182200 | \N | \N | \N | \N | 1 |
| 21452 | 21455 | 869 | 17 | 9 | 2 | 8 | 8 | 8 | 8 | 4.0 | 67 | +46.941 | 5512803 | 59 | 14 | 1:19.794 | 206.361 | 1 |
| 24384 | 24390 | 1019 | 830 | 9 | 33 | 4 | 5 | 5 | 5 | 10.0 | 52 | +39.458 | 4907910 | 45 | 4 | 1:29.272 | 237.561 | 1 |
dtypesresultId int64
raceId int64
driverId int64
constructorId int64
number object
grid int64
position object
positionText object
positionOrder int64
points float64
laps int64
time object
milliseconds object
fastestLap object
rank object
fastestLapTime object
fastestLapSpeed object
statusId int64
dtype: object
columnsresultId?raceId?driverId?len() and the pd.nunique() functions (click on their names to see the documentation)()count 25840.000000
mean 1.877053
std 4.169849
min 0.000000
25% 0.000000
50% 0.000000
75% 2.000000
max 50.000000
Name: points, dtype: float64
.agg().agg() subfunction computes aggregate statisticscolumn_name, function_name)# The functions in quotes are pandas functions. len is not in quotes because it is a Python function
# If you were to use 'len' (a string), pandas would look for a method named len in the DataFrame,
# which does not exist.
results_agg = results.agg(mean_points = ('points','mean'),
sd_points = ('points','std'),
min_points = ('points','min'),
max_points = ('points','max'),
count_obs = ('points', len))
display(results_agg.round(2)) # round to 2 decimal places| points | |
|---|---|
| mean_points | 1.88 |
| sd_points | 4.17 |
| min_points | 0.00 |
| max_points | 50.00 |
| count_obs | 25840.00 |
.groupby().groupby() method groups the data by a column.agg() method can be used to aggregate the data by groupdf.groupby('column_name').agg(...)| mean_points | sd_points | min_points | max_points | count_obs | |
|---|---|---|---|---|---|
| driverId | |||||
| 1 | 14.182258 | 9.224098 | 0.0 | 50.0 | 310 |
| 2 | 1.407609 | 2.372923 | 0.0 | 15.0 | 184 |
| 3 | 7.740291 | 8.672456 | 0.0 | 25.0 | 206 |
| 4 | 5.756983 | 6.330721 | 0.0 | 25.0 | 358 |
| 5 | 0.937500 | 1.969503 | 0.0 | 10.0 | 112 |
| ... | ... | ... | ... | ... | ... |
| 852 | 1.000000 | 2.477808 | 0.0 | 12.0 | 44 |
| 853 | 0.000000 | 0.000000 | 0.0 | 0.0 | 22 |
| 854 | 0.272727 | 1.335798 | 0.0 | 8.0 | 44 |
| 855 | 0.272727 | 0.882735 | 0.0 | 4.0 | 22 |
| 856 | 2.000000 | NaN | 2.0 | 2.0 | 1 |
855 rows × 5 columns
.groupby()driverIdpoints.groupby() - multiple groups.groupby() methoddf.groupby(['column1', 'column2']).agg(...)constructorId is the team identifier| mean_points | sd_points | min_points | max_points | count_obs | ||
|---|---|---|---|---|---|---|
| raceId | constructorId | |||||
| 1 | 1 | 0.0 | 0.00 | 0.0 | 0.0 | 2 |
| 2 | 0.0 | 0.00 | 0.0 | 0.0 | 2 | |
| 3 | 1.5 | 2.12 | 0.0 | 3.0 | 2 | |
| 4 | 2.0 | 2.83 | 0.0 | 4.0 | 2 | |
| 5 | 1.5 | 0.71 | 1.0 | 2.0 | 2 | |
| ... | ... | ... | ... | ... | ... | ... |
| 1096 | 117 | 2.5 | 2.12 | 1.0 | 4.0 | 2 |
| 131 | 5.0 | 7.07 | 0.0 | 10.0 | 2 | |
| 210 | 0.0 | 0.00 | 0.0 | 0.0 | 2 | |
| 213 | 0.0 | 0.00 | 0.0 | 0.0 | 2 | |
| 214 | 3.0 | 4.24 | 0.0 | 6.0 | 2 |
12568 rows × 5 columns
.query().query() method filters the data, as you know well by now.query() with .groupby() and .agg() to filter, group, and aggregate data in one step| mean_points | sd_points | min_points | max_points | count_obs | ||
|---|---|---|---|---|---|---|
| raceId | constructorId | |||||
| 500 | 1 | 0.0 | 0.00 | 0.0 | 0.0 | 2 |
| 3 | 1.0 | 1.41 | 0.0 | 2.0 | 2 | |
| 4 | 4.5 | 6.36 | 0.0 | 9.0 | 2 | |
| 6 | 0.0 | 0.00 | 0.0 | 0.0 | 2 | |
| 21 | 0.5 | 0.71 | 0.0 | 1.0 | 2 | |
| ... | ... | ... | ... | ... | ... | ... |
| 1096 | 117 | 2.5 | 2.12 | 1.0 | 4.0 | 2 |
| 131 | 5.0 | 7.07 | 0.0 | 10.0 | 2 | |
| 210 | 0.0 | 0.00 | 0.0 | 0.0 | 2 | |
| 213 | 0.0 | 0.00 | 0.0 | 0.0 | 2 | |
| 214 | 3.0 | 4.24 | 0.0 | 6.0 | 2 |
6055 rows × 5 columns
constructorId (the team) then compute the average number of points.sort_values(...,ascending = False) to sort by team points in descending order['function1', 'function2']df.agg({'column1': 'function1', 'column2': 'function2'}).merge() method to merge the datadrivers_agg data frame we created before and merge it with the original results data frame| mean_points | sd_points | min_points | max_points | count_obs | |
|---|---|---|---|---|---|
| driverId | |||||
| 1 | 14.18 | 9.22 | 0.0 | 50.0 | 310 |
| 2 | 1.41 | 2.37 | 0.0 | 15.0 | 184 |
| 3 | 7.74 | 8.67 | 0.0 | 25.0 | 206 |
| 4 | 5.76 | 6.33 | 0.0 | 25.0 | 358 |
| 5 | 0.94 | 1.97 | 0.0 | 10.0 | 112 |
| ... | ... | ... | ... | ... | ... |
| 852 | 1.00 | 2.48 | 0.0 | 12.0 | 44 |
| 853 | 0.00 | 0.00 | 0.0 | 0.0 | 22 |
| 854 | 0.27 | 1.34 | 0.0 | 8.0 | 44 |
| 855 | 0.27 | 0.88 | 0.0 | 4.0 | 22 |
| 856 | 2.00 | NaN | 2.0 | 2.0 | 1 |
855 rows × 5 columns
results data frame| resultId | raceId | driverId | constructorId | number | grid | position | positionText | positionOrder | points | laps | time | milliseconds | fastestLap | rank | fastestLapTime | fastestLapSpeed | statusId | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 18 | 1 | 1 | 22 | 1 | 1 | 1 | 1 | 10.0 | 58 | 1:34:50.616 | 5690616 | 39 | 2 | 1:27.452 | 218.300 | 1 |
| 1 | 2 | 18 | 2 | 2 | 3 | 5 | 2 | 2 | 2 | 8.0 | 58 | +5.478 | 5696094 | 41 | 3 | 1:27.739 | 217.586 | 1 |
| 2 | 3 | 18 | 3 | 3 | 7 | 7 | 3 | 3 | 3 | 6.0 | 58 | +8.163 | 5698779 | 41 | 5 | 1:28.090 | 216.719 | 1 |
| 3 | 4 | 18 | 4 | 4 | 5 | 11 | 4 | 4 | 4 | 5.0 | 58 | +17.181 | 5707797 | 58 | 7 | 1:28.603 | 215.464 | 1 |
| 4 | 5 | 18 | 5 | 1 | 23 | 3 | 5 | 5 | 5 | 4.0 | 58 | +18.014 | 5708630 | 43 | 1 | 1:27.418 | 218.385 | 1 |
driverId columnon='driverId'? Because driverId is the column that identifies the driver in both data frameshow = "left" argument keeps all rows from the results data frame and adds the statistics from drivers_agg to it| resultId | raceId | driverId | constructorId | number | grid | position | positionText | positionOrder | points | ... | fastestLap | rank | fastestLapTime | fastestLapSpeed | statusId | mean_points | sd_points | min_points | max_points | count_obs | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 18 | 1 | 1 | 22 | 1 | 1 | 1 | 1 | 10.0 | ... | 39 | 2 | 1:27.452 | 218.300 | 1 | 14.182258 | 9.224098 | 0.0 | 50.0 | 310 |
| 1 | 2 | 18 | 2 | 2 | 3 | 5 | 2 | 2 | 2 | 8.0 | ... | 41 | 3 | 1:27.739 | 217.586 | 1 | 1.407609 | 2.372923 | 0.0 | 15.0 | 184 |
| 2 | 3 | 18 | 3 | 3 | 7 | 7 | 3 | 3 | 3 | 6.0 | ... | 41 | 5 | 1:28.090 | 216.719 | 1 | 7.740291 | 8.672456 | 0.0 | 25.0 | 206 |
| 3 | 4 | 18 | 4 | 4 | 5 | 11 | 4 | 4 | 4 | 5.0 | ... | 58 | 7 | 1:28.603 | 215.464 | 1 | 5.756983 | 6.330721 | 0.0 | 25.0 | 358 |
| 4 | 5 | 18 | 5 | 1 | 23 | 3 | 5 | 5 | 5 | 4.0 | ... | 43 | 1 | 1:27.418 | 218.385 | 1 | 0.937500 | 1.969503 | 0.0 | 10.0 | 112 |
5 rows × 23 columns
mean_points, sd_points, min_points, max_points, and count_obs were added to the data frameIndex(['resultId', 'raceId', 'driverId', 'constructorId', 'number', 'grid',
'position', 'positionText', 'positionOrder', 'points', 'laps', 'time',
'milliseconds', 'fastestLap', 'rank', 'fastestLapTime',
'fastestLapSpeed', 'statusId', 'mean_points', 'sd_points', 'min_points',
'max_points', 'count_obs'],
dtype='object')
points (y-axis) vs mean_points (x-axis)teamrace_agg data into resultson = ["raceId","constructorId"]value_counts() to get the same informationresultId
1 1
17224 1
17234 1
17233 1
17232 1
..
8611 1
8610 1
8609 1
8608 1
25845 1
Name: count, Length: 25840, dtype: int64
raceId
800 55
809 47
371 39
359 39
366 39
..
657 14
668 14
827 13
660 13
765 10
Name: count, Length: 1079, dtype: int64
driverId
4 358
8 352
22 326
1 310
18 309
...
545 1
331 1
492 1
546 1
856 1
Name: count, Length: 855, dtype: int64
constructorId and computing the average number of pointsmean_points in descending ordersort_values() method| mean_points | |
|---|---|
| constructorId | |
| 131 | 12.36 |
| 9 | 9.12 |
| 23 | 5.06 |
| 208 | 4.58 |
| 6 | 4.32 |
| ... | ... |
| 76 | 0.00 |
| 77 | 0.00 |
| 78 | 0.00 |
| 130 | 0.00 |
| 112 | 0.00 |
210 rows × 1 columns
teamrace_agg data into resultson = ["raceId","constructorId"]raceId >= 500), we will see some missing cases in the merged data frameresults_merge = pd.merge(results,
teamrace_agg,
on = ["raceId","constructorId"],
how = "left")
results_merge.round(2)| resultId | raceId | driverId | constructorId | number | grid | position | positionText | positionOrder | points | ... | fastestLap | rank | fastestLapTime | fastestLapSpeed | statusId | mean_points | sd_points | min_points | max_points | count_obs | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 18 | 1 | 1 | 22 | 1 | 1 | 1 | 1 | 10.0 | ... | 39 | 2 | 1:27.452 | 218.300 | 1 | NaN | NaN | NaN | NaN | NaN |
| 1 | 2 | 18 | 2 | 2 | 3 | 5 | 2 | 2 | 2 | 8.0 | ... | 41 | 3 | 1:27.739 | 217.586 | 1 | NaN | NaN | NaN | NaN | NaN |
| 2 | 3 | 18 | 3 | 3 | 7 | 7 | 3 | 3 | 3 | 6.0 | ... | 41 | 5 | 1:28.090 | 216.719 | 1 | NaN | NaN | NaN | NaN | NaN |
| 3 | 4 | 18 | 4 | 4 | 5 | 11 | 4 | 4 | 4 | 5.0 | ... | 58 | 7 | 1:28.603 | 215.464 | 1 | NaN | NaN | NaN | NaN | NaN |
| 4 | 5 | 18 | 5 | 1 | 23 | 3 | 5 | 5 | 5 | 4.0 | ... | 43 | 1 | 1:27.418 | 218.385 | 1 | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 25835 | 25841 | 1096 | 854 | 210 | 47 | 12 | 16 | 16 | 16 | 0.0 | ... | 39 | 12 | 1:29.833 | 211.632 | 11 | 0.0 | 0.00 | 0.0 | 0.0 | 2.0 |
| 25836 | 25842 | 1096 | 825 | 210 | 20 | 16 | 17 | 17 | 17 | 0.0 | ... | 40 | 20 | 1:31.158 | 208.556 | 11 | 0.0 | 0.00 | 0.0 | 0.0 | 2.0 |
| 25837 | 25843 | 1096 | 1 | 131 | 44 | 5 | 18 | 18 | 18 | 0.0 | ... | 42 | 11 | 1:29.788 | 211.738 | 9 | 5.0 | 7.07 | 0.0 | 10.0 | 2.0 |
| 25838 | 25844 | 1096 | 849 | 3 | 6 | 20 | 19 | 19 | 19 | 0.0 | ... | 45 | 14 | 1:30.309 | 210.517 | 130 | 0.0 | 0.00 | 0.0 | 0.0 | 2.0 |
| 25839 | 25845 | 1096 | 4 | 214 | 14 | 10 | \N | R | 20 | 0.0 | ... | 24 | 17 | 1:30.579 | 209.889 | 47 | 3.0 | 4.24 | 0.0 | 6.0 | 2.0 |
25840 rows × 23 columns
dropna() methodresults_merge = pd.merge(results,
teamrace_agg,
on = ["raceId","constructorId"],
how = "left")
results_merge.round(2).dropna()| resultId | raceId | driverId | constructorId | number | grid | position | positionText | positionOrder | points | ... | fastestLap | rank | fastestLapTime | fastestLapSpeed | statusId | mean_points | sd_points | min_points | max_points | count_obs | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 12049 | 12050 | 500 | 163 | 4 | 16 | 2 | 1 | 1 | 1 | 9.0 | ... | \N | \N | \N | \N | 1 | 4.5 | 6.36 | 0.0 | 9.0 | 2.0 |
| 12050 | 12051 | 500 | 172 | 27 | 26 | 4 | 2 | 2 | 2 | 6.0 | ... | \N | \N | \N | \N | 1 | 5.0 | 1.41 | 4.0 | 6.0 | 2.0 |
| 12051 | 12052 | 500 | 202 | 27 | 25 | 5 | 3 | 3 | 3 | 4.0 | ... | \N | \N | \N | \N | 1 | 5.0 | 1.41 | 4.0 | 6.0 | 2.0 |
| 12052 | 12053 | 500 | 137 | 34 | 5 | 3 | 4 | 4 | 4 | 3.0 | ... | \N | \N | \N | \N | 1 | 1.5 | 2.12 | 0.0 | 3.0 | 2.0 |
| 12053 | 12054 | 500 | 199 | 3 | 28 | 6 | 5 | 5 | 5 | 2.0 | ... | \N | \N | \N | \N | 11 | 1.0 | 1.41 | 0.0 | 2.0 | 2.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 25835 | 25841 | 1096 | 854 | 210 | 47 | 12 | 16 | 16 | 16 | 0.0 | ... | 39 | 12 | 1:29.833 | 211.632 | 11 | 0.0 | 0.00 | 0.0 | 0.0 | 2.0 |
| 25836 | 25842 | 1096 | 825 | 210 | 20 | 16 | 17 | 17 | 17 | 0.0 | ... | 40 | 20 | 1:31.158 | 208.556 | 11 | 0.0 | 0.00 | 0.0 | 0.0 | 2.0 |
| 25837 | 25843 | 1096 | 1 | 131 | 44 | 5 | 18 | 18 | 18 | 0.0 | ... | 42 | 11 | 1:29.788 | 211.738 | 9 | 5.0 | 7.07 | 0.0 | 10.0 | 2.0 |
| 25838 | 25844 | 1096 | 849 | 3 | 6 | 20 | 19 | 19 | 19 | 0.0 | ... | 45 | 14 | 1:30.309 | 210.517 | 130 | 0.0 | 0.00 | 0.0 | 0.0 | 2.0 |
| 25839 | 25845 | 1096 | 4 | 214 | 14 | 10 | \N | R | 20 | 0.0 | ... | 24 | 17 | 1:30.579 | 209.889 | 47 | 3.0 | 4.24 | 0.0 | 6.0 | 2.0 |
12267 rows × 23 columns