QTM 151 - Introduction to Stats Computing II

Lecture 16 - Merging Data

Danilo Freire

danilo.freire@emory.edu

Emory University

Hope your week is off to a good start 😊

Brief recap 📚

Last week we saw:

How to aggregate data using Pandas using the .agg() method
How to combine .agg() with .groupby()
Use .query() to filter and summarise data
Note: assignment 07 has been marked and feedback provided
Assignment 08 is due on Friday

Today’s plan 📅

What we will cover today:

We will dive a little deeper into data manipulation
Specifically, we will see other ways to combine data
We will learn how to concatenate and merge data using Pandas
Explain what dictonaries are and how to use them in Python
We will see the importance of cleaning data before merging

Let’s get started 🚀

Entity relationship diagrams

First, let’s load the libraries and data we will use today

import numpy as np
import pandas as pd

results_raw  = pd.read_csv("data_raw/results.csv")
races_raw    = pd.read_csv("data_raw/races.csv")
circuits_raw = pd.read_csv("data_raw/circuits.csv")

Multi-file datasets can be visualised with an Entity Relationship Diagram
An entity represents people, concepts, or events about which data are stored
An attribute is a property or characteristic of an entity, such as a name or date
A relationship is an association between entities, which can be one-to-one, one-to-many, or many-to-many

Type	Description	Example
One-to-One	One entity relates to exactly one other entity	One student, one mailing address
One-to-Many	One entity relates to multiple instances of another entity	One store, many departments
Many-to-Many	Multiple instances of both entities relate to each other	Students and courses

Entity relationship diagrams

Let’s have a look at the data 📊

results_raw.head(10)

	resultId	raceId	driverId	constructorId	number	grid	position	positionText	positionOrder	points	laps	time	milliseconds	fastestLap	rank	fastestLapTime	fastestLapSpeed	statusId
0	1	18	1	1	22	1	1	1	1	10.0	58	1:34:50.616	5690616	39	2	1:27.452	218.300	1
1	2	18	2	2	3	5	2	2	2	8.0	58	+5.478	5696094	41	3	1:27.739	217.586	1
2	3	18	3	3	7	7	3	3	3	6.0	58	+8.163	5698779	41	5	1:28.090	216.719	1
3	4	18	4	4	5	11	4	4	4	5.0	58	+17.181	5707797	58	7	1:28.603	215.464	1
4	5	18	5	1	23	3	5	5	5	4.0	58	+18.014	5708630	43	1	1:27.418	218.385	1
5	6	18	6	3	8	13	6	6	6	3.0	57	\N	\N	50	14	1:29.639	212.974	11
6	7	18	7	5	14	17	7	7	7	2.0	55	\N	\N	22	12	1:29.534	213.224	5
7	8	18	8	6	1	15	8	8	8	1.0	53	\N	\N	20	4	1:27.903	217.180	5
8	9	18	9	2	4	2	\N	R	9	0.0	47	\N	\N	15	9	1:28.753	215.100	4
9	10	18	10	7	12	18	\N	R	10	0.0	43	\N	\N	23	13	1:29.558	213.166	3

Let’s have a look at the data 📊

races_raw.head(10)

	raceId	year	round	circuitId	name	date	time	url	fp1_date	fp1_time	fp2_date	fp2_time	fp3_date	fp3_time	quali_date	quali_time	sprint_date	sprint_time
0	1	2009	1	1	Australian Grand Prix	2009-03-29	06:00:00	http://en.wikipedia.org/wiki/2009_Australian_G...	\N	\N	\N	\N	\N	\N	\N	\N	\N	\N
1	2	2009	2	2	Malaysian Grand Prix	2009-04-05	09:00:00	http://en.wikipedia.org/wiki/2009_Malaysian_Gr...	\N	\N	\N	\N	\N	\N	\N	\N	\N	\N
2	3	2009	3	17	Chinese Grand Prix	2009-04-19	07:00:00	http://en.wikipedia.org/wiki/2009_Chinese_Gran...	\N	\N	\N	\N	\N	\N	\N	\N	\N	\N
3	4	2009	4	3	Bahrain Grand Prix	2009-04-26	12:00:00	http://en.wikipedia.org/wiki/2009_Bahrain_Gran...	\N	\N	\N	\N	\N	\N	\N	\N	\N	\N
4	5	2009	5	4	Spanish Grand Prix	2009-05-10	12:00:00	http://en.wikipedia.org/wiki/2009_Spanish_Gran...	\N	\N	\N	\N	\N	\N	\N	\N	\N	\N
5	6	2009	6	6	Monaco Grand Prix	2009-05-24	12:00:00	http://en.wikipedia.org/wiki/2009_Monaco_Grand...	\N	\N	\N	\N	\N	\N	\N	\N	\N	\N
6	7	2009	7	5	Turkish Grand Prix	2009-06-07	12:00:00	http://en.wikipedia.org/wiki/2009_Turkish_Gran...	\N	\N	\N	\N	\N	\N	\N	\N	\N	\N
7	8	2009	8	9	British Grand Prix	2009-06-21	12:00:00	http://en.wikipedia.org/wiki/2009_British_Gran...	\N	\N	\N	\N	\N	\N	\N	\N	\N	\N
8	9	2009	9	20	German Grand Prix	2009-07-12	12:00:00	http://en.wikipedia.org/wiki/2009_German_Grand...	\N	\N	\N	\N	\N	\N	\N	\N	\N	\N
9	10	2009	10	11	Hungarian Grand Prix	2009-07-26	12:00:00	http://en.wikipedia.org/wiki/2009_Hungarian_Gr...	\N	\N	\N	\N	\N	\N	\N	\N	\N	\N

Let’s have a look at the data 📊

circuits_raw.head(10)

	circuitId	circuitRef	name	location	country	lat	lng	alt	url
0	1	albert_park	Albert Park Grand Prix Circuit	Melbourne	Australia	-37.84970	144.96800	10	http://en.wikipedia.org/wiki/Melbourne_Grand_P...
1	2	sepang	Sepang International Circuit	Kuala Lumpur	Malaysia	2.76083	101.73800	18	http://en.wikipedia.org/wiki/Sepang_Internatio...
2	3	bahrain	Bahrain International Circuit	Sakhir	Bahrain	26.03250	50.51060	7	http://en.wikipedia.org/wiki/Bahrain_Internati...
3	4	catalunya	Circuit de Barcelona-Catalunya	Montmeló	Spain	41.57000	2.26111	109	http://en.wikipedia.org/wiki/Circuit_de_Barcel...
4	5	istanbul	Istanbul Park	Istanbul	Turkey	40.95170	29.40500	130	http://en.wikipedia.org/wiki/Istanbul_Park
5	6	monaco	Circuit de Monaco	Monte-Carlo	Monaco	43.73470	7.42056	7	http://en.wikipedia.org/wiki/Circuit_de_Monaco
6	7	villeneuve	Circuit Gilles Villeneuve	Montreal	Canada	45.50000	-73.52280	13	http://en.wikipedia.org/wiki/Circuit_Gilles_Vi...
7	8	magny_cours	Circuit de Nevers Magny-Cours	Magny Cours	France	46.86420	3.16361	228	http://en.wikipedia.org/wiki/Circuit_de_Nevers...
8	9	silverstone	Silverstone Circuit	Silverstone	UK	52.07860	-1.01694	153	http://en.wikipedia.org/wiki/Silverstone_Circuit
9	10	hockenheimring	Hockenheimring	Hockenheim	Germany	49.32780	8.56583	103	http://en.wikipedia.org/wiki/Hockenheimring

Let’s have a look at the data 📊

We extract all the unique values in races_raw["name"] and circuits_raw["name"]
We use sort_values() to make it easier to compare the variables

unique_data_races = pd.unique(races_raw["name"].sort_values())
print(unique_data_races)

['70th Anniversary Grand Prix' 'Abu Dhabi Grand Prix'
 'Argentine Grand Prix' 'Australian Grand Prix' 'Austrian Grand Prix'
 'Azerbaijan Grand Prix' 'Bahrain Grand Prix' 'Belgian Grand Prix'
 'Brazilian Grand Prix' 'British Grand Prix' 'Caesars Palace Grand Prix'
 'Canadian Grand Prix' 'Chinese Grand Prix' 'Dallas Grand Prix'
 'Detroit Grand Prix' 'Dutch Grand Prix' 'Eifel Grand Prix'
 'Emilia Romagna Grand Prix' 'European Grand Prix' 'French Grand Prix'
 'German Grand Prix' 'Hungarian Grand Prix' 'Indian Grand Prix'
 'Indianapolis 500' 'Italian Grand Prix' 'Japanese Grand Prix'
 'Korean Grand Prix' 'Las Vegas Grand Prix' 'Luxembourg Grand Prix'
 'Malaysian Grand Prix' 'Mexican Grand Prix' 'Mexico City Grand Prix'
 'Miami Grand Prix' 'Monaco Grand Prix' 'Moroccan Grand Prix'
 'Pacific Grand Prix' 'Pescara Grand Prix' 'Portuguese Grand Prix'
 'Qatar Grand Prix' 'Russian Grand Prix' 'Sakhir Grand Prix'
 'San Marino Grand Prix' 'Saudi Arabian Grand Prix' 'Singapore Grand Prix'
 'South African Grand Prix' 'Spanish Grand Prix' 'Styrian Grand Prix'
 'Swedish Grand Prix' 'Swiss Grand Prix' 'São Paulo Grand Prix'
 'Turkish Grand Prix' 'Tuscan Grand Prix' 'United States Grand Prix'
 'United States Grand Prix West']

Let’s have a look at the data 📊

We do the same for the circuits_raw["name"] variable

unique_data_circuits = pd.unique(circuits_raw["name"].sort_values())
print(unique_data_circuits)

['AVUS' 'Adelaide Street Circuit' 'Ain Diab' 'Aintree'
 'Albert Park Grand Prix Circuit' 'Autodromo Enzo e Dino Ferrari'
 'Autodromo Internazionale del Mugello' 'Autodromo Nazionale di Monza'
 'Autódromo Hermanos Rodríguez' 'Autódromo Internacional Nelson Piquet'
 'Autódromo Internacional do Algarve' 'Autódromo José Carlos Pace'
 'Autódromo Juan y Oscar Gálvez' 'Autódromo do Estoril'
 'Bahrain International Circuit' 'Baku City Circuit' 'Brands Hatch'
 'Buddh International Circuit' 'Charade Circuit' 'Circuit Bremgarten'
 'Circuit Gilles Villeneuve' 'Circuit Mont-Tremblant'
 'Circuit Park Zandvoort' 'Circuit Paul Ricard'
 'Circuit de Barcelona-Catalunya' 'Circuit de Monaco'
 'Circuit de Nevers Magny-Cours' 'Circuit de Pedralbes'
 'Circuit de Spa-Francorchamps' 'Circuit of the Americas'
 'Circuito da Boavista' 'Circuito de Jerez' 'Detroit Street Circuit'
 'Dijon-Prenois' 'Donington Park' 'Fair Park' 'Fuji Speedway'
 'Hockenheimring' 'Hungaroring' 'Indianapolis Motor Speedway'
 'Istanbul Park' 'Jarama' 'Jeddah Corniche Circuit'
 'Korean International Circuit' 'Kyalami' 'Las Vegas Street Circuit'
 'Las Vegas Strip Street Circuit' 'Le Mans' 'Long Beach'
 'Losail International Circuit' 'Marina Bay Street Circuit'
 'Miami International Autodrome' 'Monsanto Park Circuit' 'Montjuïc'
 'Mosport International Raceway' 'Nivelles-Baulers' 'Nürburgring'
 'Okayama International Circuit' 'Pescara Circuit'
 'Phoenix street circuit' 'Prince George Circuit' 'Red Bull Ring'
 'Reims-Gueux' 'Riverside International Raceway' 'Rouen-Les-Essarts'
 'Scandinavian Raceway' 'Sebring International Raceway'
 'Sepang International Circuit' 'Shanghai International Circuit'
 'Silverstone Circuit' 'Sochi Autodrom' 'Suzuka Circuit'
 'Valencia Street Circuit' 'Watkins Glen' 'Yas Marina Circuit' 'Zeltweg'
 'Zolder']

Hmmm, the columns are not the same 🤔

We can see that the names in the races_raw and circuits_raw datasets are different
We should rename the columns to ensure we’re not merging the wrong data or creating duplicates
We can use the rename() method to do this
But first we need to understand what a dictionary is and how it can be used in Python

Dictionaries in Python 📚

What is a dictionary?

A dictionary is a collection of key-value pairs
Each key in a dictionary is unique and maps to a value
You can access values by referencing their keys using square brackets ([])
Dictionaries are defined using curly braces {} with key-value pairs separated by colons (:)
Fields are separated by commas (,)
Example of a dictionary: my_dict = {"name": "Danilo", "course": "QTM151", "city": "Atlanta"}
You can add, modify, or delete key-value pairs in a dictionary
Dictionaries are mutable, meaning they can be changed after creation

Dictionaries + Pandas

Let’s create a dictionary in Python

car_dictionary = {"car_model": ["Ferrari","Tesla","BMW"],
                  "year": ["2018","2023","2022"]}
car_dictionary

{'car_model': ['Ferrari', 'Tesla', 'BMW'], 'year': ['2018', '2023', '2022']}

Dictionaries can have any data type as values, including lists or arrays (even other dictionaries!)

matrix_dict = {'A':np.array([[1,2,3], [2,4,5]]), 'string': 'ABC'}
matrix_dict

{'A': array([[1, 2, 3],
        [2, 4, 5]]),
 'string': 'ABC'}

We can subset a dictionary using the key

car_dictionary['car_model']

['Ferrari', 'Tesla', 'BMW']

We can also add new key-value pairs to a dictionary

car_dictionary['colour'] = ["red","blue","black"]
car_dictionary

{'car_model': ['Ferrari', 'Tesla', 'BMW'],
 'year': ['2018', '2023', '2022'],
 'colour': ['red', 'blue', 'black']}

We can delete key-value pairs using the del keyword

del car_dictionary['colour']
car_dictionary.keys()

dict_keys(['car_model', 'year'])

To create a Pandas DataFrame from a dictionary, we use the pd.DataFrame() function

car_dictionary = {"car_model": ["Ferrari","Tesla","BMW","Something"],
                  "year": ["2018","2023","2022", "1993"]}

df = pd.DataFrame(car_dictionary)
df

	car_model	year
0	Ferrari	2018
1	Tesla	2023
2	BMW	2022
3	Something	1993

Let’s rename the columns 🔄

We rename columns with a dictionary using the {old_name: new_name} format
We pass the dictionary to the rename() method using the columns argument

circuits_raw.rename(columns={'name':'circuit_name'})

	circuitId	circuitRef	circuit_name	location	country	lat	lng	alt	url
0	1	albert_park	Albert Park Grand Prix Circuit	Melbourne	Australia	-37.84970	144.96800	10	http://en.wikipedia.org/wiki/Melbourne_Grand_P...
1	2	sepang	Sepang International Circuit	Kuala Lumpur	Malaysia	2.76083	101.73800	18	http://en.wikipedia.org/wiki/Sepang_Internatio...
2	3	bahrain	Bahrain International Circuit	Sakhir	Bahrain	26.03250	50.51060	7	http://en.wikipedia.org/wiki/Bahrain_Internati...
3	4	catalunya	Circuit de Barcelona-Catalunya	Montmeló	Spain	41.57000	2.26111	109	http://en.wikipedia.org/wiki/Circuit_de_Barcel...
4	5	istanbul	Istanbul Park	Istanbul	Turkey	40.95170	29.40500	130	http://en.wikipedia.org/wiki/Istanbul_Park
...	...	...	...	...	...	...	...	...	...
72	75	portimao	Autódromo Internacional do Algarve	Portimão	Portugal	37.22700	-8.62670	108	http://en.wikipedia.org/wiki/Algarve_Internati...
73	76	mugello	Autodromo Internazionale del Mugello	Mugello	Italy	43.99750	11.37190	255	http://en.wikipedia.org/wiki/Mugello_Circuit
74	77	jeddah	Jeddah Corniche Circuit	Jeddah	Saudi Arabia	21.63190	39.10440	15	http://en.wikipedia.org/wiki/Jeddah_Street_Cir...
75	78	losail	Losail International Circuit	Al Daayen	Qatar	25.49000	51.45420	\N	http://en.wikipedia.org/wiki/Losail_Internatio...
76	79	miami	Miami International Autodrome	Miami	USA	25.95810	-80.23890	\N	http://en.wikipedia.org/wiki/Miami_Internation...

77 rows × 9 columns

Let’s rename the columns 🔄

Let’s save the changes to a new circuits DataFrame

dict_rename_circuits = {"name": "circuit_name"}
circuits = circuits_raw.rename(columns = dict_rename_circuits)
circuits.head()

	circuitId	circuitRef	circuit_name	location	country	lat	lng	alt	url
0	1	albert_park	Albert Park Grand Prix Circuit	Melbourne	Australia	-37.84970	144.96800	10	http://en.wikipedia.org/wiki/Melbourne_Grand_P...
1	2	sepang	Sepang International Circuit	Kuala Lumpur	Malaysia	2.76083	101.73800	18	http://en.wikipedia.org/wiki/Sepang_Internatio...
2	3	bahrain	Bahrain International Circuit	Sakhir	Bahrain	26.03250	50.51060	7	http://en.wikipedia.org/wiki/Bahrain_Internati...
3	4	catalunya	Circuit de Barcelona-Catalunya	Montmeló	Spain	41.57000	2.26111	109	http://en.wikipedia.org/wiki/Circuit_de_Barcel...
4	5	istanbul	Istanbul Park	Istanbul	Turkey	40.95170	29.40500	130	http://en.wikipedia.org/wiki/Istanbul_Park

Let’s see if the columns have been renamed

print(circuits_raw.columns.values)

['circuitId' 'circuitRef' 'name' 'location' 'country' 'lat' 'lng' 'alt'
 'url']

print(circuits.columns.values)

['circuitId' 'circuitRef' 'circuit_name' 'location' 'country' 'lat' 'lng'
 'alt' 'url']

It has worked! 🎉

Try it yourself! 🧠

Create a dictionary to rename name to race_name
Rename this column in the races_raw dataset
Store the output in a new dataset called races
Appendix 01

Merge data in Pandas 🔄

Using `pd.merge()` to combine data

We can use the pd.merge() function to combine data
First, we need to extract the columns we want to merge and the type of join we want to perform
The on argument specifies the column to merge on
The how argument specifies the type of join to perform

circuits[["circuitId","circuit_name"]].head()

	circuitId	circuit_name
0	1	Albert Park Grand Prix Circuit
1	2	Sepang International Circuit
2	3	Bahrain International Circuit
3	4	Circuit de Barcelona-Catalunya
4	5	Istanbul Park

circuits.shape[0] # rows

Using `pd.merge()` to combine data

pd.merge(data1, data2, on, how)
Strive to merge only specific columns of data2
Avoid merging all columns
Keeping it simple gives you more control over the output

# The "pd.merge()" command combines the information from both datasets
# The first argument is the "primary" datasets
# The second argument is the "secondary" dataset (must include the "on" column)
# The "on" is the common variable that is used for merging
# how = "left" tells Python that the left dataset is the primary one

races_merge = pd.merge(races_raw[['raceId', 'year', 'circuitId']],
                       circuits[["circuitId","circuit_name", "location"]],
                       on = "circuitId",
                       how = "left")

Using `pd.merge()` to combine data

Let’s see the results of the merge

races_merge

	raceId	year	circuitId	circuit_name	location
0	1	2009	1	Albert Park Grand Prix Circuit	Melbourne
1	2	2009	2	Sepang International Circuit	Kuala Lumpur
2	3	2009	17	Shanghai International Circuit	Shanghai
3	4	2009	3	Bahrain International Circuit	Sakhir
4	5	2009	4	Circuit de Barcelona-Catalunya	Montmeló
...	...	...	...	...	...
1097	1116	2023	69	Circuit of the Americas	Austin
1098	1117	2023	32	Autódromo Hermanos Rodríguez	Mexico City
1099	1118	2023	18	Autódromo José Carlos Pace	São Paulo
1100	1119	2023	80	Las Vegas Strip Street Circuit	Las Vegas
1101	1120	2023	24	Yas Marina Circuit	Abu Dhabi

1102 rows × 5 columns

Using `pd.merge()` to combine data

We can extract the columns we want to see

races_merge[["raceId",  "circuitId","circuit_name"]].sort_values(by = "circuit_name")

	raceId	circuitId	circuit_name
760	761	61	AVUS
434	435	29	Adelaide Street Circuit
335	336	29	Adelaide Street Circuit
319	320	29	Adelaide Street Circuit
370	371	29	Adelaide Street Circuit
...	...	...	...
501	502	40	Zolder
486	487	40	Zolder
470	471	40	Zolder
516	517	40	Zolder
608	609	40	Zolder

1102 rows × 3 columns

Another example of a merge

results_merge = pd.merge(results_raw,
                         races_raw[["raceId","date"]],
                         on = "raceId",
                         how = "left")

results_merge.head()

	resultId	raceId	driverId	constructorId	number	grid	position	positionText	positionOrder	points	laps	time	milliseconds	fastestLap	rank	fastestLapTime	fastestLapSpeed	statusId	date
0	1	18	1	1	22	1	1	1	1	10.0	58	1:34:50.616	5690616	39	2	1:27.452	218.300	1	2008-03-16
1	2	18	2	2	3	5	2	2	2	8.0	58	+5.478	5696094	41	3	1:27.739	217.586	1	2008-03-16
2	3	18	3	3	7	7	3	3	3	6.0	58	+8.163	5698779	41	5	1:28.090	216.719	1	2008-03-16
3	4	18	4	4	5	11	4	4	4	5.0	58	+17.181	5707797	58	7	1:28.603	215.464	1	2008-03-16
4	5	18	5	1	23	3	5	5	5	4.0	58	+18.014	5708630	43	1	1:27.418	218.385	1	2008-03-16

Common pitfall: What happens if you don’t rename the columns?

If you don’t rename the columns, you may end up with duplicate columns
This can make it difficult to understand the data
It can also lead to errors when merging data
The following code merges the raw data which has the name column in races_raw and circuits_raw

races_merge_pitfall = pd.merge(races_raw,
                               circuits_raw[["circuitId","name"]],
                               on = "circuitId",
                               how = "left")

Let’s see the results of the merge

print(races_merge_pitfall.columns.values)

['raceId' 'year' 'round' 'circuitId' 'name_x' 'date' 'time' 'url'
 'fp1_date' 'fp1_time' 'fp2_date' 'fp2_time' 'fp3_date' 'fp3_time'
 'quali_date' 'quali_time' 'sprint_date' 'sprint_time' 'name_y']

We can see that the name column has been duplicated
Python appends _x and _y to the column names to differentiate them

Try it yourself! 🧠

Rename the columns name_x and name_y in the dataset races_merge_pitfall to race_name and circuit_name
HINT: Create a dictionary and use .rename()
Appendix 02

Try it yourself! 🧠

Merge the column alt, lng, and lat into the races data using pd.merge()
HINT: Use the circuitId as the key
Appendix 03

Concatenating data in Pandas

`pd.concat()`

We can use the pd.concat() function to concatenate data
We can concatenate data along the rows or columns
The method works best if columns are identical
There are also other advanced options if they are not, please check the documentation for more information
Let’s use .query() to split the data into different parts

circuits['country'].unique()

array(['Australia', 'Malaysia', 'Bahrain', 'Spain', 'Turkey', 'Monaco',
       'Canada', 'France', 'UK', 'Germany', 'Hungary', 'Belgium', 'Italy',
       'Singapore', 'Japan', 'China', 'Brazil', 'USA', 'United States',
       'UAE', 'Argentina', 'Portugal', 'South Africa', 'Mexico', 'Korea',
       'Netherlands', 'Sweden', 'Austria', 'Morocco', 'Switzerland',
       'India', 'Russia', 'Azerbaijan', 'Saudi Arabia', 'Qatar'],
      dtype=object)

circuits_spain = circuits.query('country == "Spain"')
circuits_usa = circuits.query('country == "United States" | country == "USA"')
circuits_malaysia = circuits.query('country == "Malaysia"')

Concatenating data in Pandas

circuits_concat = pd.concat([circuits_spain, circuits_usa, circuits_malaysia])
circuits_concat.head()

	circuitId	circuitRef	circuit_name	location	country	lat	lng	alt	url
3	4	catalunya	Circuit de Barcelona-Catalunya	Montmeló	Spain	41.5700	2.261110	109	http://en.wikipedia.org/wiki/Circuit_de_Barcel...
11	12	valencia	Valencia Street Circuit	Valencia	Spain	39.4589	-0.331667	4	http://en.wikipedia.org/wiki/Valencia_Street_C...
25	26	jerez	Circuito de Jerez	Jerez de la Frontera	Spain	36.7083	-6.034170	37	http://en.wikipedia.org/wiki/Circuito_Permanen...
44	45	jarama	Jarama	Madrid	Spain	40.6171	-3.585580	609	http://en.wikipedia.org/wiki/Circuito_Permanen...
48	49	montjuic	Montjuïc	Barcelona	Spain	41.3664	2.151670	79	http://en.wikipedia.org/wiki/Montju%C3%AFc_cir...

Concatenating data in Pandas

If we drop observations from the original dataset, they will be removed from the concatenated dataset (NaN values will be added)

circuits_spain_drop = circuits_spain.drop(columns=['circuitRef', 'location'])
pd.concat([circuits_spain_drop, circuits_usa])

	circuitId	circuit_name	country	lat	lng	alt	url	circuitRef	location
3	4	Circuit de Barcelona-Catalunya	Spain	41.5700	2.261110	109	http://en.wikipedia.org/wiki/Circuit_de_Barcel...	NaN	NaN
11	12	Valencia Street Circuit	Spain	39.4589	-0.331667	4	http://en.wikipedia.org/wiki/Valencia_Street_C...	NaN	NaN
25	26	Circuito de Jerez	Spain	36.7083	-6.034170	37	http://en.wikipedia.org/wiki/Circuito_Permanen...	NaN	NaN
44	45	Jarama	Spain	40.6171	-3.585580	609	http://en.wikipedia.org/wiki/Circuito_Permanen...	NaN	NaN
48	49	Montjuïc	Spain	41.3664	2.151670	79	http://en.wikipedia.org/wiki/Montju%C3%AFc_cir...	NaN	NaN
66	67	Circuit de Pedralbes	Spain	41.3903	2.116670	85	http://en.wikipedia.org/wiki/Pedralbes_Circuit	NaN	NaN
18	19	Indianapolis Motor Speedway	USA	39.7950	-86.234700	223	http://en.wikipedia.org/wiki/Indianapolis_Moto...	indianapolis	Indianapolis
22	80	Las Vegas Strip Street Circuit	United States	36.1147	-115.173000	\N	https://en.wikipedia.org/wiki/Las_Vegas_Grand_...	vegas	Las Vegas
32	33	Phoenix street circuit	USA	33.4479	-112.075000	345	http://en.wikipedia.org/wiki/Phoenix_street_ci...	phoenix	Phoenix
36	37	Detroit Street Circuit	USA	42.3298	-83.040100	177	http://en.wikipedia.org/wiki/Detroit_street_ci...	detroit	Detroit
41	42	Fair Park	USA	32.7774	-96.758700	139	http://en.wikipedia.org/wiki/Fair_Park	dallas	Dallas
42	43	Long Beach	USA	33.7651	-118.189000	12	http://en.wikipedia.org/wiki/Long_Beach,_Calif...	long_beach	California
43	44	Las Vegas Street Circuit	USA	36.1162	-115.174000	639	http://en.wikipedia.org/wiki/Las_Vegas_Street_...	las_vegas	Nevada
45	46	Watkins Glen	USA	42.3369	-76.927200	485	http://en.wikipedia.org/wiki/Watkins_Glen_Inte...	watkins_glen	New York State
59	60	Riverside International Raceway	USA	33.9370	-117.273000	470	http://en.wikipedia.org/wiki/Riverside_Interna...	riverside	California
62	63	Sebring International Raceway	USA	27.4547	-81.348300	18	http://en.wikipedia.org/wiki/Sebring_Raceway	sebring	Florida
68	69	Circuit of the Americas	USA	30.1328	-97.641100	161	http://en.wikipedia.org/wiki/Circuit_of_the_Am...	americas	Austin
76	79	Miami International Autodrome	USA	25.9581	-80.238900	\N	http://en.wikipedia.org/wiki/Miami_Internation...	miami	Miami

Try it yourself! 🧠

Concatenate the USA and Malaysia datasets
Appendix 04

And that’s it for today! 🎉

Have a very nice week! 😊

Appendix 01

dict_rename = {"name": "race_name"}
races_raw.rename(columns = dict_rename).head()

	raceId	year	round	circuitId	race_name	date	time	url	fp1_date	fp1_time	fp2_date	fp2_time	fp3_date	fp3_time	quali_date	quali_time	sprint_date	sprint_time
0	1	2009	1	1	Australian Grand Prix	2009-03-29	06:00:00	http://en.wikipedia.org/wiki/2009_Australian_G...	\N	\N	\N	\N	\N	\N	\N	\N	\N	\N
1	2	2009	2	2	Malaysian Grand Prix	2009-04-05	09:00:00	http://en.wikipedia.org/wiki/2009_Malaysian_Gr...	\N	\N	\N	\N	\N	\N	\N	\N	\N	\N
2	3	2009	3	17	Chinese Grand Prix	2009-04-19	07:00:00	http://en.wikipedia.org/wiki/2009_Chinese_Gran...	\N	\N	\N	\N	\N	\N	\N	\N	\N	\N
3	4	2009	4	3	Bahrain Grand Prix	2009-04-26	12:00:00	http://en.wikipedia.org/wiki/2009_Bahrain_Gran...	\N	\N	\N	\N	\N	\N	\N	\N	\N	\N
4	5	2009	5	4	Spanish Grand Prix	2009-05-10	12:00:00	http://en.wikipedia.org/wiki/2009_Spanish_Gran...	\N	\N	\N	\N	\N	\N	\N	\N	\N	\N

races = races_raw.rename(columns = dict_rename)
races.columns

Index(['raceId', 'year', 'round', 'circuitId', 'race_name', 'date', 'time',
       'url', 'fp1_date', 'fp1_time', 'fp2_date', 'fp2_time', 'fp3_date',
       'fp3_time', 'quali_date', 'quali_time', 'sprint_date', 'sprint_time'],
      dtype='object')

Back to Exercise 01

Appendix 02

rename = {"name_x": "race_name", "name_y": "circuit_name"}
races_merge_pitfall.rename(columns = rename).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1102 entries, 0 to 1101
Data columns (total 19 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   raceId        1102 non-null   int64 
 1   year          1102 non-null   int64 
 2   round         1102 non-null   int64 
 3   circuitId     1102 non-null   int64 
 4   race_name     1102 non-null   object
 5   date          1102 non-null   object
 6   time          1102 non-null   object
 7   url           1102 non-null   object
 8   fp1_date      1102 non-null   object
 9   fp1_time      1102 non-null   object
 10  fp2_date      1102 non-null   object
 11  fp2_time      1102 non-null   object
 12  fp3_date      1102 non-null   object
 13  fp3_time      1102 non-null   object
 14  quali_date    1102 non-null   object
 15  quali_time    1102 non-null   object
 16  sprint_date   1102 non-null   object
 17  sprint_time   1102 non-null   object
 18  circuit_name  1102 non-null   object
dtypes: int64(4), object(15)
memory usage: 163.7+ KB

Back to Exercise 02

Appendix 03

races02 = pd.merge(races,
                   circuits[["circuitId","alt","lng","lat"]],
                   on = "circuitId",
                   how = "left")
races02.head()

	raceId	year	round	circuitId	race_name	date	time	url	fp1_date	fp1_time	...	fp2_time	fp3_date	fp3_time	quali_date	quali_time	sprint_date	sprint_time	alt	lng	lat
0	1	2009	1	1	Australian Grand Prix	2009-03-29	06:00:00	http://en.wikipedia.org/wiki/2009_Australian_G...	\N	\N	...	\N	\N	\N	\N	\N	\N	\N	10	144.96800	-37.84970
1	2	2009	2	2	Malaysian Grand Prix	2009-04-05	09:00:00	http://en.wikipedia.org/wiki/2009_Malaysian_Gr...	\N	\N	...	\N	\N	\N	\N	\N	\N	\N	18	101.73800	2.76083
2	3	2009	3	17	Chinese Grand Prix	2009-04-19	07:00:00	http://en.wikipedia.org/wiki/2009_Chinese_Gran...	\N	\N	...	\N	\N	\N	\N	\N	\N	\N	5	121.22000	31.33890
3	4	2009	4	3	Bahrain Grand Prix	2009-04-26	12:00:00	http://en.wikipedia.org/wiki/2009_Bahrain_Gran...	\N	\N	...	\N	\N	\N	\N	\N	\N	\N	7	50.51060	26.03250
4	5	2009	5	4	Spanish Grand Prix	2009-05-10	12:00:00	http://en.wikipedia.org/wiki/2009_Spanish_Gran...	\N	\N	...	\N	\N	\N	\N	\N	\N	\N	109	2.26111	41.57000

5 rows × 21 columns

Back to Exercise 03

Appendix 04

circuits_usa_malaysia = pd.concat([circuits_usa, circuits_malaysia])
circuits_usa_malaysia.head()

	circuitId	circuitRef	circuit_name	location	country	lat	lng	alt	url
18	19	indianapolis	Indianapolis Motor Speedway	Indianapolis	USA	39.7950	-86.2347	223	http://en.wikipedia.org/wiki/Indianapolis_Moto...
22	80	vegas	Las Vegas Strip Street Circuit	Las Vegas	United States	36.1147	-115.1730	\N	https://en.wikipedia.org/wiki/Las_Vegas_Grand_...
32	33	phoenix	Phoenix street circuit	Phoenix	USA	33.4479	-112.0750	345	http://en.wikipedia.org/wiki/Phoenix_street_ci...
36	37	detroit	Detroit Street Circuit	Detroit	USA	42.3298	-83.0401	177	http://en.wikipedia.org/wiki/Detroit_street_ci...
41	42	dallas	Fair Park	Dallas	USA	32.7774	-96.7587	139	http://en.wikipedia.org/wiki/Fair_Park

Back to Exercise 04

QTM 151 - Introduction to Stats Computing II

Hope your week is off to a good start 😊

Brief recap 📚

Last week we saw:

Today’s plan 📅

What we will cover today:

Let’s get started 🚀

Entity relationship diagrams

Entity relationship diagrams

Let’s have a look at the data 📊

Let’s have a look at the data 📊

Let’s have a look at the data 📊

Let’s have a look at the data 📊

Let’s have a look at the data 📊

Hmmm, the columns are not the same 🤔

Dictionaries in Python 📚

What is a dictionary?

Dictionaries + Pandas

Let’s rename the columns 🔄

Let’s rename the columns 🔄

Try it yourself! 🧠

Merge data in Pandas 🔄

Using pd.merge() to combine data

Using pd.merge() to combine data

Using pd.merge() to combine data

Using pd.merge() to combine data

Common pitfall: What happens if you don’t rename the columns?

Try it yourself! 🧠

Try it yourself! 🧠

Concatenating data in Pandas

pd.concat()

Concatenating data in Pandas

Concatenating data in Pandas

Try it yourself! 🧠

And that’s it for today! 🎉

Have a very nice week! 😊

Appendix 01

Appendix 02

Appendix 03

Appendix 04

Using `pd.merge()` to combine data

Using `pd.merge()` to combine data

Using `pd.merge()` to combine data

Using `pd.merge()` to combine data

`pd.concat()`