QTM 151 - Intro to Stats Computing II

Lecture 15 - Aggregating Data

Danilo Freire

danilo.freire@emory.edu

Emory University

Hello, everyone! 👋
Nice to see you all again 😁

Two quick things before we start 😉

Some comments on quiz 02

The quiz has been graded and the grades are in Canvas
You guys did really well! Congratulations! 🎉
What do you think about the quiz? Was it too easy? Too hard? Just right?
If you have any question about the quiz, please let us know
Please do not forget that the assignment is due today 🤓

Jupyter Notebooks online 🌐

New users find it challenging to install Python and Jupyter Notebook on their computers
So I’ve made a Jupyter Notebook online that you can use to run Python code without installing anything! 🤓
I’ve used Pyodide and JupyterLite to run Python code in the browser
You can access it on the Jupyter Lite tab on the course website
The website is https://danilofreire.github.io/qtm350/jupyter

It already comes with all packages we need for this class, such as NumPy, Pandas, Matplotlib, and Seaborn
You can install many other packages too! 📦
Not all Python packages work, but many do. Install them with

%pip install package-name

You can also use it to run R and JavaScript code, as well as write LaTeX and Markdown documents
Please download your files with the right-click menu before closing the browser!
Let me know if you find any bugs! 🐞

Today’s plan 📅

Aggregating data

Today we will…

Learn about different ways to gather dataset characteristics
Learn how to aggregate data using Pandas using the .agg() method
Combine .agg() with .groupby() to aggregate data by groups
Use .query() together with those methods to filter and summarise data in one step

Split-apply-combine strategy

Import libraries and load data

As usual, let’s start by importing the libraries we need

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

And load the dataset we will use today

results = pd.read_csv("data_raw/results.csv")

These data are part of the Formula 1 dataset we used in the past

Main information: Field, Type, Key, and Description
Field is the name of the column
Type is the data type of the column
- integer (int)
- string (varchar - “variable character”)
- float (float)
- The number in parenthesis is the maximum number of characters/digits
Key is the primary key of the table, also know as the identifier
- It is unique for each row and can be used to join tables (more on that later)
Description is a brief description of the column

Display the first few rows

results.head()

	resultId	raceId	driverId	constructorId	number	grid	position	positionText	positionOrder	points	laps	time	milliseconds	fastestLap	rank	fastestLapTime	fastestLapSpeed	statusId
0	1	18	1	1	22	1	1	1	1	10.0	58	1:34:50.616	5690616	39	2	1:27.452	218.300	1
1	2	18	2	2	3	5	2	2	2	8.0	58	+5.478	5696094	41	3	1:27.739	217.586	1
2	3	18	3	3	7	7	3	3	3	6.0	58	+8.163	5698779	41	5	1:28.090	216.719	1
3	4	18	4	4	5	11	4	4	4	5.0	58	+17.181	5707797	58	7	1:28.603	215.464	1
4	5	18	5	1	23	3	5	5	5	4.0	58	+18.014	5708630	43	1	1:27.418	218.385	1

See a random sample of the data using sample

results.sample(5).sort_values("resultId")

	resultId	raceId	driverId	constructorId	number	grid	position	positionText	positionOrder	points	laps	time	milliseconds	fastestLap	rank	fastestLapTime	fastestLapSpeed	statusId
672	673	49	25	3	17	15	\N	R	19	0.0	34	\N	\N	18	18	1:51.270	226.605	32
9990	9991	426	94	18	29	25	\N	R	20	0.0	19	\N	\N	\N	\N	\N	\N	3
16701	16702	684	278	6	8	6	3	3	3	4.0	80	+16.6	7182200	\N	\N	\N	\N	1
21452	21455	869	17	9	2	8	8	8	8	4.0	67	+46.941	5512803	59	14	1:19.794	206.361	1
24384	24390	1019	830	9	33	4	5	5	5	10.0	52	+39.458	4907910	45	4	1:29.272	237.561	1

Get the column names and types

We can get the column names and types using dtypes

results.dtypes

resultId             int64
raceId               int64
driverId             int64
constructorId        int64
number              object
grid                 int64
position            object
positionText        object
positionOrder        int64
points             float64
laps                 int64
time                object
milliseconds        object
fastestLap          object
rank                object
fastestLapTime      object
fastestLapSpeed     object
statusId             int64
dtype: object

We can also get the column names using columns

results.columns

Index(['resultId', 'raceId', 'driverId', 'constructorId', 'number', 'grid',
       'position', 'positionText', 'positionOrder', 'points', 'laps', 'time',
       'milliseconds', 'fastestLap', 'rank', 'fastestLapTime',
       'fastestLapSpeed', 'statusId'],
      dtype='object')

Try it yourself! 🤓

How many rows does the dataset have?
How many unique values are there for the columns:
- resultId?
- raceId?
- driverId?
Hint: Use the len() and the pd.nunique() functions (click on their names to see the documentation)
Appendix 01

Group by and aggregate 📊

Multi-line code

You can split your code into multiple lines to make it easier to read
You just need to wrap the code in parentheses ()

descriptives_multiline = (results["points"]
                          .describe())
descriptives_multiline

count    25840.000000
mean         1.877053
std          4.169849
min          0.000000
25%          0.000000
50%          0.000000
75%          2.000000
max         50.000000
Name: points, dtype: float64

This is the same as writing

descriptives_singleline = results["points"].describe()
descriptives_singleline

count    25840.000000
mean         1.877053
std          4.169849
min          0.000000
25%          0.000000
50%          0.000000
75%          2.000000
max         50.000000
Name: points, dtype: float64

Aggregate data using `.agg()`

The .agg() subfunction computes aggregate statistics
The syntax is (column_name, function_name)
The first argument is the column name
The second argument is the function_name
The command works with single quotations ‘…’ or double “…”

# The functions in quotes are pandas functions.  len is not in quotes because it is a Python function
# If you were to use 'len' (a string), pandas would look for a method named len in the DataFrame, 
# which does not exist.
results_agg = results.agg(mean_points = ('points','mean'),
                          sd_points =   ('points','std'),
                          min_points =  ('points','min'),
                          max_points =  ('points','max'),
                          count_obs   = ('points', len))

display(results_agg.round(2)) # round to 2 decimal places

	points
mean_points	1.88
sd_points	4.17
min_points	0.00
max_points	50.00
count_obs	25840.00

Group and aggregate data using `.groupby()`

The .groupby() method groups the data by a column
The .agg() method can be used to aggregate the data by group
The syntax is df.groupby('column_name').agg(...)
Let’s see one example

drivers_agg = (results.groupby("driverId")
                      .agg(mean_points = ('points','mean'),
                           sd_points =   ('points','std'),
                           min_points =  ('points','min'),
                           max_points =  ('points','max'),
                           count_obs   = ('points',len)))

drivers_agg

	mean_points	sd_points	min_points	max_points	count_obs
driverId
1	14.182258	9.224098	0.0	50.0	310
2	1.407609	2.372923	0.0	15.0	184
3	7.740291	8.672456	0.0	25.0	206
4	5.756983	6.330721	0.0	25.0	358
5	0.937500	1.969503	0.0	10.0	112
...	...	...	...	...	...
852	1.000000	2.477808	0.0	12.0	44
853	0.000000	0.000000	0.0	0.0	22
854	0.272727	1.335798	0.0	8.0	44
855	0.272727	0.882735	0.0	4.0	22
856	2.000000	NaN	2.0	2.0	1

855 rows × 5 columns

Group and aggregate data using `.groupby()`

Split-apply-combine strategy again
Split: Split the dataset into groups, here by driverId
Apply: Apply a function to each group, here the mean, standard deviation, minimum, maximum, and count of points
Combine: Combine the results into a single data frame
The result is a data frame with the statistics for each driver

Group and aggregate data using `.groupby()` - multiple groups

The syntax we used before works for one group, but it is easy to extend it to multiple groups
You can pass a list of columns to the .groupby() method
df.groupby(['column1', 'column2']).agg(...)
constructorId is the team identifier

teamrace_agg = (results.groupby(["raceId","constructorId"])
                       .agg(mean_points = ('points','mean'),
                            sd_points =   ('points','std'),
                            min_points =  ('points','min'),
                            max_points =  ('points','max'),
                            count_obs   = ('points',len)))

teamrace_agg.round(2)

		mean_points	sd_points	min_points	max_points	count_obs
raceId	constructorId
1	1	0.0	0.00	0.0	0.0	2
	2	0.0	0.00	0.0	0.0	2
	3	1.5	2.12	0.0	3.0	2
	4	2.0	2.83	0.0	4.0	2
	5	1.5	0.71	1.0	2.0	2
...	...	...	...	...	...	...
1096	117	2.5	2.12	1.0	4.0	2
	131	5.0	7.07	0.0	10.0	2
	210	0.0	0.00	0.0	0.0	2
	213	0.0	0.00	0.0	0.0	2
	214	3.0	4.24	0.0	6.0	2

12568 rows × 5 columns

Filtering +
Grouping +
Aggregating 🤓

Filtering data using `.query()`

The .query() method filters the data, as you know well by now
We can combine .query() with .groupby() and .agg() to filter, group, and aggregate data in one step

teamrace_agg = (results.query("raceId >= 500")
                       .groupby(["raceId","constructorId"])
                        .agg(mean_points = ('points','mean'),
                             sd_points =   ('points','std'),
                             min_points =  ('points','min'),
                             max_points =  ('points','max'),
                             count_obs   = ('points',len)))

teamrace_agg.round(2)

		mean_points	sd_points	min_points	max_points	count_obs
raceId	constructorId
500	1	0.0	0.00	0.0	0.0	2
	3	1.0	1.41	0.0	2.0	2
	4	4.5	6.36	0.0	9.0	2
	6	0.0	0.00	0.0	0.0	2
	21	0.5	0.71	0.0	1.0	2
...	...	...	...	...	...	...
1096	117	2.5	2.12	1.0	4.0	2
	131	5.0	7.07	0.0	10.0	2
	210	0.0	0.00	0.0	0.0	2
	213	0.0	0.00	0.0	0.0	2
	214	3.0	4.24	0.0	6.0	2

6055 rows × 5 columns

Try it yourself! 🤓

Create a new dataset by chaining groups by constructorId (the team) then compute the average number of points
Add a chain .sort_values(...,ascending = False) to sort by team points in descending order
Appendix 02

Different functions for different columns

You can use different functions for different columns too!
You should specify the column name and the function for each column in a dictionary
If you want to add more than one function for a column, you can use a list with ['function1', 'function2']
The syntax is df.agg({'column1': 'function1', 'column2': 'function2'})

results.agg({'laps': ['min', 'max', len]}).round(2) # one variable

	laps
min	0
max	200
len	25840

results.agg({'points': 'mean',
             'positionOrder': 'max',
             'laps': ['min', 'max', len]}).round(2) # multiple variables

	points	positionOrder	laps
mean	1.88	NaN	NaN
max	NaN	39.0	200.0
min	NaN	NaN	0.0
len	NaN	NaN	25840.0

Relative statistics within group 📊

Relative statistics within group

We can compute relative statistics within groups and add them to the data frame
When we add new columns to a dataset, we call this merging the data
We can use the .merge() method to merge the data

Let’s see an example. We’ll use the drivers_agg data frame we created before and merge it with the original results data frame

drivers_agg.round(2)

	mean_points	sd_points	min_points	max_points	count_obs
driverId
1	14.18	9.22	0.0	50.0	310
2	1.41	2.37	0.0	15.0	184
3	7.74	8.67	0.0	25.0	206
4	5.76	6.33	0.0	25.0	358
5	0.94	1.97	0.0	10.0	112
...	...	...	...	...	...
852	1.00	2.48	0.0	12.0	44
853	0.00	0.00	0.0	0.0	22
854	0.27	1.34	0.0	8.0	44
855	0.27	0.88	0.0	4.0	22
856	2.00	NaN	2.0	2.0	1

855 rows × 5 columns

Relative statistics within group

And here is the results data frame

results.head()

	resultId	raceId	driverId	constructorId	number	grid	position	positionText	positionOrder	points	laps	time	milliseconds	fastestLap	rank	fastestLapTime	fastestLapSpeed	statusId
0	1	18	1	1	22	1	1	1	1	10.0	58	1:34:50.616	5690616	39	2	1:27.452	218.300	1
1	2	18	2	2	3	5	2	2	2	8.0	58	+5.478	5696094	41	3	1:27.739	217.586	1
2	3	18	3	3	7	7	3	3	3	6.0	58	+8.163	5698779	41	5	1:28.090	216.719	1
3	4	18	4	4	5	11	4	4	4	5.0	58	+17.181	5707797	58	7	1:28.603	215.464	1
4	5	18	5	1	23	3	5	5	5	4.0	58	+18.014	5708630	43	1	1:27.418	218.385	1

Now we can merge the two data frames using the driverId column
Why do we use on='driverId'? Because driverId is the column that identifies the driver in both data frames

results_merge = pd.merge(results,
                         drivers_agg,
                         on = "driverId",
                         how = "left")

The how = "left" argument keeps all rows from the results data frame and adds the statistics from drivers_agg to it
The result is a new data frame with the statistics for each driver

Relative statistics within group

Let’s see the first few rows of the merged data frame

results_merge.head()

	resultId	raceId	driverId	constructorId	number	grid	position	positionText	positionOrder	points	...	fastestLap	rank	fastestLapTime	fastestLapSpeed	statusId	mean_points	sd_points	max_points	count_obs
0	1	18	1	1	22	1	1	1	1	10.0	...	39	2	1:27.452	218.300	1	14.182258	9.224098	50.0	310
1	2	18	2	2	3	5	2	2	2	8.0	...	41	3	1:27.739	217.586	1	1.407609	2.372923	15.0	184
2	3	18	3	3	7	7	3	3	3	6.0	...	41	5	1:28.090	216.719	1	7.740291	8.672456	25.0	206
3	4	18	4	4	5	11	4	4	4	5.0	...	58	7	1:28.603	215.464	1	5.756983	6.330721	25.0	358
4	5	18	5	1	23	3	5	5	5	4.0	...	43	1	1:27.418	218.385	1	0.937500	1.969503	10.0	112

5 rows × 23 columns

We can see that the new columns mean_points, sd_points, min_points, max_points, and count_obs were added to the data frame

results_merge.columns

Index(['resultId', 'raceId', 'driverId', 'constructorId', 'number', 'grid',
       'position', 'positionText', 'positionOrder', 'points', 'laps', 'time',
       'milliseconds', 'fastestLap', 'rank', 'fastestLapTime',
       'fastestLapSpeed', 'statusId', 'mean_points', 'sd_points', 'min_points',
       'max_points', 'count_obs'],
      dtype='object')

Try it yourself! 🤓

Compute a scatter plot with points (y-axis) vs mean_points (x-axis)
Note: This plots tells you how much a driver’s performance on individual races deviates from their overall average
Appendix 03

Try it yourself! 🤓

Merge the teamrace_agg data into results
This time use the option: on = ["raceId","constructorId"]
Appendix 04

Appendix 01

Let’s start by counting the number of rows in the dataset

len(results)

Now let’s count the number of unique values for the columns resultId, raceId, and driverId

results.resultId.nunique()

results.raceId.nunique()

results.driverId.nunique()

Appendix 01

You could also used value_counts() to get the same information

results.resultId.value_counts()

resultId
1        1
17224    1
17234    1
17233    1
17232    1
        ..
8611     1
8610     1
8609     1
8608     1
25845    1
Name: count, Length: 25840, dtype: int64

results.raceId.value_counts()

raceId
800    55
809    47
371    39
359    39
366    39
       ..
657    14
668    14
827    13
660    13
765    10
Name: count, Length: 1079, dtype: int64

results.driverId.value_counts()

driverId
4      358
8      352
22     326
1      310
18     309
      ... 
545      1
331      1
492      1
546      1
856      1
Name: count, Length: 855, dtype: int64

This function returns the number of times each unique value appears in the column

Back to exercise

Appendix 02

Let’s start by grouping the data by constructorId and computing the average number of points

team_agg = (results.groupby("constructorId")
                   .agg(mean_points = ('points','mean')))

Now let’s sort the data by mean_points in descending order
We can use the sort_values() method

team_agg.sort_values("mean_points", ascending = False).round(2)

	mean_points
constructorId
131	12.36
9	9.12
23	5.06
208	4.58
6	4.32
...	...
76	0.00
77	0.00
78	0.00
130	0.00
112	0.00

210 rows × 1 columns

Back to exercise

Appendix 03

plt.figure(figsize=(10,6))
plt.scatter(results_merge["mean_points"], results_merge["points"])
plt.xlabel("Mean Points")
plt.ylabel("Points")
plt.title("Points vs Mean Points")
plt.show()

This plots tells you how much a driver’s performance on individual races deviates from their overall average

Back to exercise

Appendix 04

Let’s merge the teamrace_agg data into results
This time use the option: on = ["raceId","constructorId"]
Since we filtered the data before (raceId >= 500), we will see some missing cases in the merged data frame

results_merge = pd.merge(results,
                         teamrace_agg,
                         on = ["raceId","constructorId"],
                         how = "left")
results_merge.round(2)

	resultId	raceId	driverId	constructorId	number	grid	position	positionText	positionOrder	points	...	fastestLap	rank	fastestLapTime	fastestLapSpeed	statusId	mean_points	sd_points	min_points	max_points	count_obs
0	1	18	1	1	22	1	1	1	1	10.0	...	39	2	1:27.452	218.300	1	NaN	NaN	NaN	NaN	NaN
1	2	18	2	2	3	5	2	2	2	8.0	...	41	3	1:27.739	217.586	1	NaN	NaN	NaN	NaN	NaN
2	3	18	3	3	7	7	3	3	3	6.0	...	41	5	1:28.090	216.719	1	NaN	NaN	NaN	NaN	NaN
3	4	18	4	4	5	11	4	4	4	5.0	...	58	7	1:28.603	215.464	1	NaN	NaN	NaN	NaN	NaN
4	5	18	5	1	23	3	5	5	5	4.0	...	43	1	1:27.418	218.385	1	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
25835	25841	1096	854	210	47	12	16	16	16	0.0	...	39	12	1:29.833	211.632	11	0.0	0.00	0.0	0.0	2.0
25836	25842	1096	825	210	20	16	17	17	17	0.0	...	40	20	1:31.158	208.556	11	0.0	0.00	0.0	0.0	2.0
25837	25843	1096	1	131	44	5	18	18	18	0.0	...	42	11	1:29.788	211.738	9	5.0	7.07	0.0	10.0	2.0
25838	25844	1096	849	3	6	20	19	19	19	0.0	...	45	14	1:30.309	210.517	130	0.0	0.00	0.0	0.0	2.0
25839	25845	1096	4	214	14	10	\N	R	20	0.0	...	24	17	1:30.579	209.889	47	3.0	4.24	0.0	6.0	2.0

25840 rows × 23 columns

Appendix 04

We can remove the missing cases using the dropna() method

results_merge = pd.merge(results,
                         teamrace_agg,
                         on = ["raceId","constructorId"],
                         how = "left")
results_merge.round(2).dropna()

	resultId	raceId	driverId	constructorId	number	grid	position	positionText	positionOrder	points	...	fastestLap	rank	fastestLapTime	fastestLapSpeed	statusId	mean_points	sd_points	min_points	max_points	count_obs
12049	12050	500	163	4	16	2	1	1	1	9.0	...	\N	\N	\N	\N	1	4.5	6.36	0.0	9.0	2.0
12050	12051	500	172	27	26	4	2	2	2	6.0	...	\N	\N	\N	\N	1	5.0	1.41	4.0	6.0	2.0
12051	12052	500	202	27	25	5	3	3	3	4.0	...	\N	\N	\N	\N	1	5.0	1.41	4.0	6.0	2.0
12052	12053	500	137	34	5	3	4	4	4	3.0	...	\N	\N	\N	\N	1	1.5	2.12	0.0	3.0	2.0
12053	12054	500	199	3	28	6	5	5	5	2.0	...	\N	\N	\N	\N	11	1.0	1.41	0.0	2.0	2.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
25835	25841	1096	854	210	47	12	16	16	16	0.0	...	39	12	1:29.833	211.632	11	0.0	0.00	0.0	0.0	2.0
25836	25842	1096	825	210	20	16	17	17	17	0.0	...	40	20	1:31.158	208.556	11	0.0	0.00	0.0	0.0	2.0
25837	25843	1096	1	131	44	5	18	18	18	0.0	...	42	11	1:29.788	211.738	9	5.0	7.07	0.0	10.0	2.0
25838	25844	1096	849	3	6	20	19	19	19	0.0	...	45	14	1:30.309	210.517	130	0.0	0.00	0.0	0.0	2.0
25839	25845	1096	4	214	14	10	\N	R	20	0.0	...	24	17	1:30.579	209.889	47	3.0	4.24	0.0	6.0	2.0

12267 rows × 23 columns

Back to exercise

QTM 151 - Intro to Stats Computing II

Hello, everyone! 👋 Nice to see you all again 😁

Two quick things before we start 😉

Some comments on quiz 02

Jupyter Notebooks online 🌐

Today’s plan 📅

Aggregating data

Today we will…

Import libraries and load data

Display the first few rows

Get the column names and types

Try it yourself! 🤓

Group by and aggregate 📊

Multi-line code

Aggregate data using .agg()

Group and aggregate data using .groupby()

Group and aggregate data using .groupby()

Group and aggregate data using .groupby() - multiple groups

Filtering + Grouping + Aggregating 🤓

Filtering data using .query()

Try it yourself! 🤓

Different functions for different columns

Relative statistics within group 📊

Relative statistics within group

Relative statistics within group

Relative statistics within group

Try it yourself! 🤓

Try it yourself! 🤓

Appendix 01

Appendix 01

Appendix 02

Appendix 03

Appendix 04

Appendix 04

Hello, everyone! 👋
Nice to see you all again 😁

Aggregate data using `.agg()`

Group and aggregate data using `.groupby()`

Group and aggregate data using `.groupby()`

Group and aggregate data using `.groupby()` - multiple groups

Filtering +
Grouping +
Aggregating 🤓

Filtering data using `.query()`