QTM 151 - Intro to Stats Computing II

Lecture 15 - Aggregating Data

Danilo Freire

Emory University

23 October, 2024

Hello, everyone! 👋
Nice to see you all again 😁

Two quick things before we start 😉

Some comments on quiz 02

  • The quiz has been graded and the grades are in Canvas
  • You guys did really well! Congratulations! 🎉
  • What do you think about the quiz? Was it too easy? Too hard? Just right?
  • If you have any question about the quiz, please let us know
  • Please do not forget that the assignment is due today 🤓

Jupyter Notebooks online 🌐

  • It already comes with all packages we need for this class, such as NumPy, Pandas, Matplotlib, and Seaborn
  • You can install many other packages too! 📦
  • Not all Python packages work, but many do. Install them with
%pip install package-name
  • You can also use it to run R and JavaScript code, as well as write LaTeX and Markdown documents
  • Please download your files with the right-click menu before closing the browser!
  • Let me know if you find any bugs! 🐞

Today’s plan 📅

Aggregating data

Today we will…

  • Learn about different ways to gather dataset characteristics
  • Learn how to aggregate data using Pandas using the .agg() method
  • Combine .agg() with .groupby() to aggregate data by groups
  • Use .query() together with those methods to filter and summarise data in one step

Import libraries and load data

  • As usual, let’s start by importing the libraries we need
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
  • And load the dataset we will use today
results = pd.read_csv("data_raw/results.csv")

  • Main information: Field, Type, Key, and Description
  • Field is the name of the column
  • Type is the data type of the column
    • integer (int)
    • string (varchar - “variable character”)
    • float (float)
    • The number in parenthesis is the maximum number of characters/digits
  • Key is the primary key of the table, also know as the identifier
    • It is unique for each row and can be used to join tables (more on that later)
  • Description is a brief description of the column

Display the first few rows

results.head()
resultId raceId driverId constructorId number grid position positionText positionOrder points laps time milliseconds fastestLap rank fastestLapTime fastestLapSpeed statusId
0 1 18 1 1 22 1 1 1 1 10.0 58 1:34:50.616 5690616 39 2 1:27.452 218.300 1
1 2 18 2 2 3 5 2 2 2 8.0 58 +5.478 5696094 41 3 1:27.739 217.586 1
2 3 18 3 3 7 7 3 3 3 6.0 58 +8.163 5698779 41 5 1:28.090 216.719 1
3 4 18 4 4 5 11 4 4 4 5.0 58 +17.181 5707797 58 7 1:28.603 215.464 1
4 5 18 5 1 23 3 5 5 5 4.0 58 +18.014 5708630 43 1 1:27.418 218.385 1
  • See a random sample of the data using sample
results.sample(5).sort_values("resultId")
resultId raceId driverId constructorId number grid position positionText positionOrder points laps time milliseconds fastestLap rank fastestLapTime fastestLapSpeed statusId
5224 5225 260 77 6 28 3 3 3 3 4.0 78 +1:16.824 6672124 \N \N \N \N 1
7979 7980 358 160 45 18 0 \N F 30 0.0 0 \N \N \N \N \N \N 97
10260 10261 437 176 21 17 23 9 9 9 0.0 71 \N \N \N \N \N \N 14
16589 16590 677 321 6 7 15 \N R 19 0.0 14 \N \N \N \N \N \N 5
23454 23459 973 1 131 44 1 1 1 1 25.0 66 1:35:56.497 5756497 64 1 1:23.593 200.471 1

Get the column names and types

  • We can get the column names and types using dtypes
results.dtypes
resultId             int64
raceId               int64
driverId             int64
constructorId        int64
number              object
grid                 int64
position            object
positionText        object
positionOrder        int64
points             float64
laps                 int64
time                object
milliseconds        object
fastestLap          object
rank                object
fastestLapTime      object
fastestLapSpeed     object
statusId             int64
dtype: object
  • We can also get the column names using columns
results.columns
Index(['resultId', 'raceId', 'driverId', 'constructorId', 'number', 'grid',
       'position', 'positionText', 'positionOrder', 'points', 'laps', 'time',
       'milliseconds', 'fastestLap', 'rank', 'fastestLapTime',
       'fastestLapSpeed', 'statusId'],
      dtype='object')

Try it yourself! 🤓

  • How many rows does the dataset have?
  • How many unique values are there for the columns:
    • resultId?
    • raceId?
    • driverId?
  • Hint: Use the len() and the pd.nunique() functions (click on their names to see the documentation)
  • Appendix 01

Group by and aggregate 📊

Multi-line code

  • You can split your code into multiple lines to make it easier to read
  • You just need to wrap the code in parentheses ()
descriptives_multiline = (results["points"]
                          .describe())
descriptives_multiline
count    25840.000000
mean         1.877053
std          4.169849
min          0.000000
25%          0.000000
50%          0.000000
75%          2.000000
max         50.000000
Name: points, dtype: float64
  • This is the same as writing
descriptives_singleline = results["points"].describe()
descriptives_singleline
count    25840.000000
mean         1.877053
std          4.169849
min          0.000000
25%          0.000000
50%          0.000000
75%          2.000000
max         50.000000
Name: points, dtype: float64

Aggregate data using .agg()

  • The .agg() subfunction computes aggregate statistics
  • The syntax is (column_name, function_name)
  • The first argument is the column name
  • The second argument is the function_name
  • The command works with single quotations ‘…’ or double “…”
# The functions in quotes are pandas functions.  len is not in quotes because it is a Python function
# If you were to use 'len' (a string), pandas would look for a method named len in the DataFrame, 
# which does not exist.
results_agg = results.agg(mean_points = ('points','mean'),
                          sd_points =   ('points','std'),
                          min_points =  ('points','min'),
                          max_points =  ('points','max'),
                          count_obs   = ('points', len))

display(results_agg.round(2)) # round to 2 decimal places
points
mean_points 1.88
sd_points 4.17
min_points 0.00
max_points 50.00
count_obs 25840.00

Group and aggregate data using .groupby()

  • The .groupby() method groups the data by a column
  • The .agg() method can be used to aggregate the data by group
  • The syntax is df.groupby('column_name').agg(...)
  • Let’s see one example
drivers_agg = (results.groupby("driverId")
                      .agg(mean_points = ('points','mean'),
                           sd_points =   ('points','std'),
                           min_points =  ('points','min'),
                           max_points =  ('points','max'),
                           count_obs   = ('points',len)))
drivers_agg
mean_points sd_points min_points max_points count_obs
driverId
1 14.182258 9.224098 0.0 50.0 310
2 1.407609 2.372923 0.0 15.0 184
3 7.740291 8.672456 0.0 25.0 206
4 5.756983 6.330721 0.0 25.0 358
5 0.937500 1.969503 0.0 10.0 112
... ... ... ... ... ...
852 1.000000 2.477808 0.0 12.0 44
853 0.000000 0.000000 0.0 0.0 22
854 0.272727 1.335798 0.0 8.0 44
855 0.272727 0.882735 0.0 4.0 22
856 2.000000 NaN 2.0 2.0 1

855 rows × 5 columns

Group and aggregate data using .groupby()

  • Split-apply-combine strategy again
  • Split: Split the dataset into groups, here by driverId
  • Apply: Apply a function to each group, here the mean, standard deviation, minimum, maximum, and count of points
  • Combine: Combine the results into a single data frame
  • The result is a data frame with the statistics for each driver

Group and aggregate data using .groupby() - multiple groups

  • The syntax we used before works for one group, but it is easy to extend it to multiple groups
  • You can pass a list of columns to the .groupby() method
  • df.groupby(['column1', 'column2']).agg(...)
  • constructorId is the team identifier
teamrace_agg = (results.groupby(["raceId","constructorId"])
                       .agg(mean_points = ('points','mean'),
                            sd_points =   ('points','std'),
                            min_points =  ('points','min'),
                            max_points =  ('points','max'),
                            count_obs   = ('points',len)))
teamrace_agg.round(2)
mean_points sd_points min_points max_points count_obs
raceId constructorId
1 1 0.0 0.00 0.0 0.0 2
2 0.0 0.00 0.0 0.0 2
3 1.5 2.12 0.0 3.0 2
4 2.0 2.83 0.0 4.0 2
5 1.5 0.71 1.0 2.0 2
... ... ... ... ... ... ...
1096 117 2.5 2.12 1.0 4.0 2
131 5.0 7.07 0.0 10.0 2
210 0.0 0.00 0.0 0.0 2
213 0.0 0.00 0.0 0.0 2
214 3.0 4.24 0.0 6.0 2

12568 rows × 5 columns

Filtering +
Grouping +
Aggregating 🤓

Filtering data using .query()

  • The .query() method filters the data, as you know well by now
  • We can combine .query() with .groupby() and .agg() to filter, group, and aggregate data in one step
teamrace_agg = (results.query("raceId >= 500")
                       .groupby(["raceId","constructorId"])
                        .agg(mean_points = ('points','mean'),
                             sd_points =   ('points','std'),
                             min_points =  ('points','min'),
                             max_points =  ('points','max'),
                             count_obs   = ('points',len)))
teamrace_agg.round(2)
mean_points sd_points min_points max_points count_obs
raceId constructorId
500 1 0.0 0.00 0.0 0.0 2
3 1.0 1.41 0.0 2.0 2
4 4.5 6.36 0.0 9.0 2
6 0.0 0.00 0.0 0.0 2
21 0.5 0.71 0.0 1.0 2
... ... ... ... ... ... ...
1096 117 2.5 2.12 1.0 4.0 2
131 5.0 7.07 0.0 10.0 2
210 0.0 0.00 0.0 0.0 2
213 0.0 0.00 0.0 0.0 2
214 3.0 4.24 0.0 6.0 2

6055 rows × 5 columns

Try it yourself! 🤓

  • Create a new dataset by chaining groups by constructorId (the team) then compute the average number of points
  • Add a chain .sort_values(...,ascending = False) to sort by team points in descending order
  • Appendix 02

Different functions for different columns

  • You can use different functions for different columns too!
  • You should specify the column name and the function for each column in a dictionary
  • If you want to add more than one function for a column, you can use a list with ['function1', 'function2']
  • The syntax is df.agg({'column1': 'function1', 'column2': 'function2'})
results.agg({'laps': ['min', 'max', len]}).round(2) # one variable
laps
min 0
max 200
len 25840
results.agg({'points': 'mean',
             'positionOrder': 'max',
             'laps': ['min', 'max', len]}).round(2) # multiple variables
points positionOrder laps
mean 1.88 NaN NaN
max NaN 39.0 200.0
min NaN NaN 0.0
len NaN NaN 25840.0

Relative statistics within group 📊

Relative statistics within group

  • We can compute relative statistics within groups and add them to the data frame
  • When we add new columns to a dataset, we call this merging the data
  • We can use the .merge() method to merge the data

  • Let’s see an example. We’ll use the drivers_agg data frame we created before and merge it with the original results data frame
drivers_agg.round(2)
mean_points sd_points min_points max_points count_obs
driverId
1 14.18 9.22 0.0 50.0 310
2 1.41 2.37 0.0 15.0 184
3 7.74 8.67 0.0 25.0 206
4 5.76 6.33 0.0 25.0 358
5 0.94 1.97 0.0 10.0 112
... ... ... ... ... ...
852 1.00 2.48 0.0 12.0 44
853 0.00 0.00 0.0 0.0 22
854 0.27 1.34 0.0 8.0 44
855 0.27 0.88 0.0 4.0 22
856 2.00 NaN 2.0 2.0 1

855 rows × 5 columns

Relative statistics within group

  • And here is the results data frame
results.head()
resultId raceId driverId constructorId number grid position positionText positionOrder points laps time milliseconds fastestLap rank fastestLapTime fastestLapSpeed statusId
0 1 18 1 1 22 1 1 1 1 10.0 58 1:34:50.616 5690616 39 2 1:27.452 218.300 1
1 2 18 2 2 3 5 2 2 2 8.0 58 +5.478 5696094 41 3 1:27.739 217.586 1
2 3 18 3 3 7 7 3 3 3 6.0 58 +8.163 5698779 41 5 1:28.090 216.719 1
3 4 18 4 4 5 11 4 4 4 5.0 58 +17.181 5707797 58 7 1:28.603 215.464 1
4 5 18 5 1 23 3 5 5 5 4.0 58 +18.014 5708630 43 1 1:27.418 218.385 1
  • Now we can merge the two data frames using the driverId column
  • Why do we use on='driverId'? Because driverId is the column that identifies the driver in both data frames
results_merge = pd.merge(results,
                         drivers_agg,
                         on = "driverId",
                         how = "left")
  • The how = "left" argument keeps all rows from the results data frame and adds the statistics from drivers_agg to it
  • The result is a new data frame with the statistics for each driver

Relative statistics within group

  • Let’s see the first few rows of the merged data frame
results_merge.head()
resultId raceId driverId constructorId number grid position positionText positionOrder points ... fastestLap rank fastestLapTime fastestLapSpeed statusId mean_points sd_points min_points max_points count_obs
0 1 18 1 1 22 1 1 1 1 10.0 ... 39 2 1:27.452 218.300 1 14.182258 9.224098 0.0 50.0 310
1 2 18 2 2 3 5 2 2 2 8.0 ... 41 3 1:27.739 217.586 1 1.407609 2.372923 0.0 15.0 184
2 3 18 3 3 7 7 3 3 3 6.0 ... 41 5 1:28.090 216.719 1 7.740291 8.672456 0.0 25.0 206
3 4 18 4 4 5 11 4 4 4 5.0 ... 58 7 1:28.603 215.464 1 5.756983 6.330721 0.0 25.0 358
4 5 18 5 1 23 3 5 5 5 4.0 ... 43 1 1:27.418 218.385 1 0.937500 1.969503 0.0 10.0 112

5 rows × 23 columns

  • We can see that the new columns mean_points, sd_points, min_points, max_points, and count_obs were added to the data frame
results_merge.columns
Index(['resultId', 'raceId', 'driverId', 'constructorId', 'number', 'grid',
       'position', 'positionText', 'positionOrder', 'points', 'laps', 'time',
       'milliseconds', 'fastestLap', 'rank', 'fastestLapTime',
       'fastestLapSpeed', 'statusId', 'mean_points', 'sd_points', 'min_points',
       'max_points', 'count_obs'],
      dtype='object')

Try it yourself! 🤓

  • Compute a scatter plot with points (y-axis) vs mean_points (x-axis)
  • Note: This plots tells you how much a driver’s performance on individual races deviates from their overall average
  • Appendix 03

Try it yourself! 🤓

  • Merge the teamrace_agg data into results
  • This time use the option: on = ["raceId","constructorId"]
  • Appendix 04

Appendix 01

  • Let’s start by counting the number of rows in the dataset
len(results)
25840
  • Now let’s count the number of unique values for the columns resultId, raceId, and driverId
results.resultId.nunique()
25840
results.raceId.nunique()
1079
results.driverId.nunique()
855

Appendix 01

  • You could also used value_counts() to get the same information
results.resultId.value_counts()
resultId
1        1
17224    1
17234    1
17233    1
17232    1
        ..
8611     1
8610     1
8609     1
8608     1
25845    1
Name: count, Length: 25840, dtype: int64
results.raceId.value_counts()
raceId
800    55
809    47
371    39
359    39
366    39
       ..
657    14
668    14
827    13
660    13
765    10
Name: count, Length: 1079, dtype: int64
results.driverId.value_counts()
driverId
4      358
8      352
22     326
1      310
18     309
      ... 
545      1
331      1
492      1
546      1
856      1
Name: count, Length: 855, dtype: int64
  • This function returns the number of times each unique value appears in the column

Back to exercise

Appendix 02

  • Let’s start by grouping the data by constructorId and computing the average number of points
team_agg = (results.groupby("constructorId")
                   .agg(mean_points = ('points','mean')))
  • Now let’s sort the data by mean_points in descending order
  • We can use the sort_values() method
team_agg.sort_values("mean_points", ascending = False).round(2)
mean_points
constructorId
131 12.36
9 9.12
23 5.06
208 4.58
6 4.32
... ...
76 0.00
77 0.00
78 0.00
130 0.00
112 0.00

210 rows × 1 columns

Back to exercise

Appendix 03

plt.figure(figsize=(10,6))
plt.scatter(results_merge["mean_points"], results_merge["points"])
plt.xlabel("Mean Points")
plt.ylabel("Points")
plt.title("Points vs Mean Points")
plt.show()

  • This plots tells you how much a driver’s performance on individual races deviates from their overall average

Back to exercise

Appendix 04

  • Let’s merge the teamrace_agg data into results
  • This time use the option: on = ["raceId","constructorId"]
  • Since we filtered the data before (raceId >= 500), we will see some missing cases in the merged data frame
results_merge = pd.merge(results,
                         teamrace_agg,
                         on = ["raceId","constructorId"],
                         how = "left")
results_merge.round(2)
resultId raceId driverId constructorId number grid position positionText positionOrder points ... fastestLap rank fastestLapTime fastestLapSpeed statusId mean_points sd_points min_points max_points count_obs
0 1 18 1 1 22 1 1 1 1 10.0 ... 39 2 1:27.452 218.300 1 NaN NaN NaN NaN NaN
1 2 18 2 2 3 5 2 2 2 8.0 ... 41 3 1:27.739 217.586 1 NaN NaN NaN NaN NaN
2 3 18 3 3 7 7 3 3 3 6.0 ... 41 5 1:28.090 216.719 1 NaN NaN NaN NaN NaN
3 4 18 4 4 5 11 4 4 4 5.0 ... 58 7 1:28.603 215.464 1 NaN NaN NaN NaN NaN
4 5 18 5 1 23 3 5 5 5 4.0 ... 43 1 1:27.418 218.385 1 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
25835 25841 1096 854 210 47 12 16 16 16 0.0 ... 39 12 1:29.833 211.632 11 0.0 0.00 0.0 0.0 2.0
25836 25842 1096 825 210 20 16 17 17 17 0.0 ... 40 20 1:31.158 208.556 11 0.0 0.00 0.0 0.0 2.0
25837 25843 1096 1 131 44 5 18 18 18 0.0 ... 42 11 1:29.788 211.738 9 5.0 7.07 0.0 10.0 2.0
25838 25844 1096 849 3 6 20 19 19 19 0.0 ... 45 14 1:30.309 210.517 130 0.0 0.00 0.0 0.0 2.0
25839 25845 1096 4 214 14 10 \N R 20 0.0 ... 24 17 1:30.579 209.889 47 3.0 4.24 0.0 6.0 2.0

25840 rows × 23 columns

Appendix 04

  • We can remove the missing cases using the dropna() method
results_merge = pd.merge(results,
                         teamrace_agg,
                         on = ["raceId","constructorId"],
                         how = "left")
results_merge.round(2).dropna()
resultId raceId driverId constructorId number grid position positionText positionOrder points ... fastestLap rank fastestLapTime fastestLapSpeed statusId mean_points sd_points min_points max_points count_obs
12049 12050 500 163 4 16 2 1 1 1 9.0 ... \N \N \N \N 1 4.5 6.36 0.0 9.0 2.0
12050 12051 500 172 27 26 4 2 2 2 6.0 ... \N \N \N \N 1 5.0 1.41 4.0 6.0 2.0
12051 12052 500 202 27 25 5 3 3 3 4.0 ... \N \N \N \N 1 5.0 1.41 4.0 6.0 2.0
12052 12053 500 137 34 5 3 4 4 4 3.0 ... \N \N \N \N 1 1.5 2.12 0.0 3.0 2.0
12053 12054 500 199 3 28 6 5 5 5 2.0 ... \N \N \N \N 11 1.0 1.41 0.0 2.0 2.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
25835 25841 1096 854 210 47 12 16 16 16 0.0 ... 39 12 1:29.833 211.632 11 0.0 0.00 0.0 0.0 2.0
25836 25842 1096 825 210 20 16 17 17 17 0.0 ... 40 20 1:31.158 208.556 11 0.0 0.00 0.0 0.0 2.0
25837 25843 1096 1 131 44 5 18 18 18 0.0 ... 42 11 1:29.788 211.738 9 5.0 7.07 0.0 10.0 2.0
25838 25844 1096 849 3 6 20 19 19 19 0.0 ... 45 14 1:30.309 210.517 130 0.0 0.00 0.0 0.0 2.0
25839 25845 1096 4 214 14 10 \N R 20 0.0 ... 24 17 1:30.579 209.889 47 3.0 4.24 0.0 6.0 2.0

12267 rows × 23 columns

Back to exercise