QTM 151 - Introduction to Stats Computing II

Lecture 16 - Merging Data

Danilo Freire

Emory University

28 October, 2024

Hope your week is off to a good start 😊

Brief recap 📚

Last week we saw:

  • How to aggregate data using Pandas using the .agg() method
  • How to combine .agg() with .groupby()
  • Use .query() to filter and summarise data
  • Note: assignment 07 has been marked and feedback provided
  • Assignment 08 is due on Friday

Today’s plan 📅

What we will cover today:

  • We will dive a little deeper into data manipulation
  • Specifically, we will see other ways to combine data
  • We will learn how to concatenate and merge data using Pandas
  • Explain what dictonaries are and how to use them in Python
  • We will see the importance of cleaning data before merging

Let’s get started 🚀

Entity relationship diagrams

  • First, let’s load the libraries and data we will use today
import numpy as np
import pandas as pd

results_raw  = pd.read_csv("data_raw/results.csv")
races_raw    = pd.read_csv("data_raw/races.csv")
circuits_raw = pd.read_csv("data_raw/circuits.csv")
  • Multi-file datasets can be visualised with an Entity Relationship Diagram
  • An entity represents people, concepts, or events about which data are stored
  • An attribute is a property or characteristic of an entity, such as a name or date
  • A relationship is an association between entities, which can be one-to-one, one-to-many, or many-to-many
Type Description Example
One-to-One One entity relates to exactly one other entity One student, one mailing address
One-to-Many One entity relates to multiple instances of another entity One store, many departments
Many-to-Many Multiple instances of both entities relate to each other Students and courses

Entity relationship diagrams

Let’s have a look at the data 📊

results_raw.head(10)
resultId raceId driverId constructorId number grid position positionText positionOrder points laps time milliseconds fastestLap rank fastestLapTime fastestLapSpeed statusId
0 1 18 1 1 22 1 1 1 1 10.0 58 1:34:50.616 5690616 39 2 1:27.452 218.300 1
1 2 18 2 2 3 5 2 2 2 8.0 58 +5.478 5696094 41 3 1:27.739 217.586 1
2 3 18 3 3 7 7 3 3 3 6.0 58 +8.163 5698779 41 5 1:28.090 216.719 1
3 4 18 4 4 5 11 4 4 4 5.0 58 +17.181 5707797 58 7 1:28.603 215.464 1
4 5 18 5 1 23 3 5 5 5 4.0 58 +18.014 5708630 43 1 1:27.418 218.385 1
5 6 18 6 3 8 13 6 6 6 3.0 57 \N \N 50 14 1:29.639 212.974 11
6 7 18 7 5 14 17 7 7 7 2.0 55 \N \N 22 12 1:29.534 213.224 5
7 8 18 8 6 1 15 8 8 8 1.0 53 \N \N 20 4 1:27.903 217.180 5
8 9 18 9 2 4 2 \N R 9 0.0 47 \N \N 15 9 1:28.753 215.100 4
9 10 18 10 7 12 18 \N R 10 0.0 43 \N \N 23 13 1:29.558 213.166 3

Let’s have a look at the data 📊

races_raw.head(10) 
raceId year round circuitId name date time url fp1_date fp1_time fp2_date fp2_time fp3_date fp3_time quali_date quali_time sprint_date sprint_time
0 1 2009 1 1 Australian Grand Prix 2009-03-29 06:00:00 http://en.wikipedia.org/wiki/2009_Australian_G... \N \N \N \N \N \N \N \N \N \N
1 2 2009 2 2 Malaysian Grand Prix 2009-04-05 09:00:00 http://en.wikipedia.org/wiki/2009_Malaysian_Gr... \N \N \N \N \N \N \N \N \N \N
2 3 2009 3 17 Chinese Grand Prix 2009-04-19 07:00:00 http://en.wikipedia.org/wiki/2009_Chinese_Gran... \N \N \N \N \N \N \N \N \N \N
3 4 2009 4 3 Bahrain Grand Prix 2009-04-26 12:00:00 http://en.wikipedia.org/wiki/2009_Bahrain_Gran... \N \N \N \N \N \N \N \N \N \N
4 5 2009 5 4 Spanish Grand Prix 2009-05-10 12:00:00 http://en.wikipedia.org/wiki/2009_Spanish_Gran... \N \N \N \N \N \N \N \N \N \N
5 6 2009 6 6 Monaco Grand Prix 2009-05-24 12:00:00 http://en.wikipedia.org/wiki/2009_Monaco_Grand... \N \N \N \N \N \N \N \N \N \N
6 7 2009 7 5 Turkish Grand Prix 2009-06-07 12:00:00 http://en.wikipedia.org/wiki/2009_Turkish_Gran... \N \N \N \N \N \N \N \N \N \N
7 8 2009 8 9 British Grand Prix 2009-06-21 12:00:00 http://en.wikipedia.org/wiki/2009_British_Gran... \N \N \N \N \N \N \N \N \N \N
8 9 2009 9 20 German Grand Prix 2009-07-12 12:00:00 http://en.wikipedia.org/wiki/2009_German_Grand... \N \N \N \N \N \N \N \N \N \N
9 10 2009 10 11 Hungarian Grand Prix 2009-07-26 12:00:00 http://en.wikipedia.org/wiki/2009_Hungarian_Gr... \N \N \N \N \N \N \N \N \N \N

Let’s have a look at the data 📊

circuits_raw.head(10) 
circuitId circuitRef name location country lat lng alt url
0 1 albert_park Albert Park Grand Prix Circuit Melbourne Australia -37.84970 144.96800 10 http://en.wikipedia.org/wiki/Melbourne_Grand_P...
1 2 sepang Sepang International Circuit Kuala Lumpur Malaysia 2.76083 101.73800 18 http://en.wikipedia.org/wiki/Sepang_Internatio...
2 3 bahrain Bahrain International Circuit Sakhir Bahrain 26.03250 50.51060 7 http://en.wikipedia.org/wiki/Bahrain_Internati...
3 4 catalunya Circuit de Barcelona-Catalunya Montmeló Spain 41.57000 2.26111 109 http://en.wikipedia.org/wiki/Circuit_de_Barcel...
4 5 istanbul Istanbul Park Istanbul Turkey 40.95170 29.40500 130 http://en.wikipedia.org/wiki/Istanbul_Park
5 6 monaco Circuit de Monaco Monte-Carlo Monaco 43.73470 7.42056 7 http://en.wikipedia.org/wiki/Circuit_de_Monaco
6 7 villeneuve Circuit Gilles Villeneuve Montreal Canada 45.50000 -73.52280 13 http://en.wikipedia.org/wiki/Circuit_Gilles_Vi...
7 8 magny_cours Circuit de Nevers Magny-Cours Magny Cours France 46.86420 3.16361 228 http://en.wikipedia.org/wiki/Circuit_de_Nevers...
8 9 silverstone Silverstone Circuit Silverstone UK 52.07860 -1.01694 153 http://en.wikipedia.org/wiki/Silverstone_Circuit
9 10 hockenheimring Hockenheimring Hockenheim Germany 49.32780 8.56583 103 http://en.wikipedia.org/wiki/Hockenheimring

Let’s have a look at the data 📊

  • We extract all the unique values in races_raw["name"] and circuits_raw["name"]
  • We use sort_values() to make it easier to compare the variables
unique_data_races = pd.unique(races_raw["name"].sort_values())
print(unique_data_races)
['70th Anniversary Grand Prix' 'Abu Dhabi Grand Prix'
 'Argentine Grand Prix' 'Australian Grand Prix' 'Austrian Grand Prix'
 'Azerbaijan Grand Prix' 'Bahrain Grand Prix' 'Belgian Grand Prix'
 'Brazilian Grand Prix' 'British Grand Prix' 'Caesars Palace Grand Prix'
 'Canadian Grand Prix' 'Chinese Grand Prix' 'Dallas Grand Prix'
 'Detroit Grand Prix' 'Dutch Grand Prix' 'Eifel Grand Prix'
 'Emilia Romagna Grand Prix' 'European Grand Prix' 'French Grand Prix'
 'German Grand Prix' 'Hungarian Grand Prix' 'Indian Grand Prix'
 'Indianapolis 500' 'Italian Grand Prix' 'Japanese Grand Prix'
 'Korean Grand Prix' 'Las Vegas Grand Prix' 'Luxembourg Grand Prix'
 'Malaysian Grand Prix' 'Mexican Grand Prix' 'Mexico City Grand Prix'
 'Miami Grand Prix' 'Monaco Grand Prix' 'Moroccan Grand Prix'
 'Pacific Grand Prix' 'Pescara Grand Prix' 'Portuguese Grand Prix'
 'Qatar Grand Prix' 'Russian Grand Prix' 'Sakhir Grand Prix'
 'San Marino Grand Prix' 'Saudi Arabian Grand Prix' 'Singapore Grand Prix'
 'South African Grand Prix' 'Spanish Grand Prix' 'Styrian Grand Prix'
 'Swedish Grand Prix' 'Swiss Grand Prix' 'São Paulo Grand Prix'
 'Turkish Grand Prix' 'Tuscan Grand Prix' 'United States Grand Prix'
 'United States Grand Prix West']

Let’s have a look at the data 📊

  • We do the same for the circuits_raw["name"] variable
unique_data_circuits = pd.unique(circuits_raw["name"].sort_values())
print(unique_data_circuits)
['AVUS' 'Adelaide Street Circuit' 'Ain Diab' 'Aintree'
 'Albert Park Grand Prix Circuit' 'Autodromo Enzo e Dino Ferrari'
 'Autodromo Internazionale del Mugello' 'Autodromo Nazionale di Monza'
 'Autódromo Hermanos Rodríguez' 'Autódromo Internacional Nelson Piquet'
 'Autódromo Internacional do Algarve' 'Autódromo José Carlos Pace'
 'Autódromo Juan y Oscar Gálvez' 'Autódromo do Estoril'
 'Bahrain International Circuit' 'Baku City Circuit' 'Brands Hatch'
 'Buddh International Circuit' 'Charade Circuit' 'Circuit Bremgarten'
 'Circuit Gilles Villeneuve' 'Circuit Mont-Tremblant'
 'Circuit Park Zandvoort' 'Circuit Paul Ricard'
 'Circuit de Barcelona-Catalunya' 'Circuit de Monaco'
 'Circuit de Nevers Magny-Cours' 'Circuit de Pedralbes'
 'Circuit de Spa-Francorchamps' 'Circuit of the Americas'
 'Circuito da Boavista' 'Circuito de Jerez' 'Detroit Street Circuit'
 'Dijon-Prenois' 'Donington Park' 'Fair Park' 'Fuji Speedway'
 'Hockenheimring' 'Hungaroring' 'Indianapolis Motor Speedway'
 'Istanbul Park' 'Jarama' 'Jeddah Corniche Circuit'
 'Korean International Circuit' 'Kyalami' 'Las Vegas Street Circuit'
 'Las Vegas Strip Street Circuit' 'Le Mans' 'Long Beach'
 'Losail International Circuit' 'Marina Bay Street Circuit'
 'Miami International Autodrome' 'Monsanto Park Circuit' 'Montjuïc'
 'Mosport International Raceway' 'Nivelles-Baulers' 'Nürburgring'
 'Okayama International Circuit' 'Pescara Circuit'
 'Phoenix street circuit' 'Prince George Circuit' 'Red Bull Ring'
 'Reims-Gueux' 'Riverside International Raceway' 'Rouen-Les-Essarts'
 'Scandinavian Raceway' 'Sebring International Raceway'
 'Sepang International Circuit' 'Shanghai International Circuit'
 'Silverstone Circuit' 'Sochi Autodrom' 'Suzuka Circuit'
 'Valencia Street Circuit' 'Watkins Glen' 'Yas Marina Circuit' 'Zeltweg'
 'Zolder']

Hmmm, the columns are not the same 🤔

  • We can see that the names in the races_raw and circuits_raw datasets are different
  • We should rename the columns to ensure we’re not merging the wrong data or creating duplicates
  • We can use the rename() method to do this
  • But first we need to understand what a dictionary is and how it can be used in Python

Dictionaries in Python 📚

What is a dictionary?

  • A dictionary is a collection of key-value pairs
  • Each key in a dictionary is unique and maps to a value
  • You can access values by referencing their keys using square brackets ([])
  • Dictionaries are defined using curly braces {} with key-value pairs separated by colons (:)
  • Fields are separated by commas (,)
  • Example of a dictionary: my_dict = {"name": "Danilo", "course": "QTM151", "city": "Atlanta"}
  • You can add, modify, or delete key-value pairs in a dictionary
  • Dictionaries are mutable, meaning they can be changed after creation

Dictionaries + Pandas

  • Let’s create a dictionary in Python
car_dictionary = {"car_model": ["Ferrari","Tesla","BMW"],
                  "year": ["2018","2023","2022"]}
car_dictionary
{'car_model': ['Ferrari', 'Tesla', 'BMW'], 'year': ['2018', '2023', '2022']}
  • Dictionaries can have any data type as values, including lists or arrays (even other dictionaries!)
matrix_dict = {'A':np.array([[1,2,3], [2,4,5]]), 'string': 'ABC'}
matrix_dict
{'A': array([[1, 2, 3],
        [2, 4, 5]]),
 'string': 'ABC'}
  • We can subset a dictionary using the key
car_dictionary['car_model']
['Ferrari', 'Tesla', 'BMW']
  • We can also add new key-value pairs to a dictionary
car_dictionary['colour'] = ["red","blue","black"]
car_dictionary
{'car_model': ['Ferrari', 'Tesla', 'BMW'],
 'year': ['2018', '2023', '2022'],
 'colour': ['red', 'blue', 'black']}
  • We can delete key-value pairs using the del keyword
del car_dictionary['colour']
car_dictionary.keys()
dict_keys(['car_model', 'year'])
  • To create a Pandas DataFrame from a dictionary, we use the pd.DataFrame() function
car_dictionary = {"car_model": ["Ferrari","Tesla","BMW","Something"],
                  "year": ["2018","2023","2022", "1993"]}

df = pd.DataFrame(car_dictionary)
df
car_model year
0 Ferrari 2018
1 Tesla 2023
2 BMW 2022
3 Something 1993

Let’s rename the columns 🔄

  • We rename columns with a dictionary using the {old_name: new_name} format
  • We pass the dictionary to the rename() method using the columns argument
circuits_raw.rename(columns={'name':'circuit_name'})
circuitId circuitRef circuit_name location country lat lng alt url
0 1 albert_park Albert Park Grand Prix Circuit Melbourne Australia -37.84970 144.96800 10 http://en.wikipedia.org/wiki/Melbourne_Grand_P...
1 2 sepang Sepang International Circuit Kuala Lumpur Malaysia 2.76083 101.73800 18 http://en.wikipedia.org/wiki/Sepang_Internatio...
2 3 bahrain Bahrain International Circuit Sakhir Bahrain 26.03250 50.51060 7 http://en.wikipedia.org/wiki/Bahrain_Internati...
3 4 catalunya Circuit de Barcelona-Catalunya Montmeló Spain 41.57000 2.26111 109 http://en.wikipedia.org/wiki/Circuit_de_Barcel...
4 5 istanbul Istanbul Park Istanbul Turkey 40.95170 29.40500 130 http://en.wikipedia.org/wiki/Istanbul_Park
... ... ... ... ... ... ... ... ... ...
72 75 portimao Autódromo Internacional do Algarve Portimão Portugal 37.22700 -8.62670 108 http://en.wikipedia.org/wiki/Algarve_Internati...
73 76 mugello Autodromo Internazionale del Mugello Mugello Italy 43.99750 11.37190 255 http://en.wikipedia.org/wiki/Mugello_Circuit
74 77 jeddah Jeddah Corniche Circuit Jeddah Saudi Arabia 21.63190 39.10440 15 http://en.wikipedia.org/wiki/Jeddah_Street_Cir...
75 78 losail Losail International Circuit Al Daayen Qatar 25.49000 51.45420 \N http://en.wikipedia.org/wiki/Losail_Internatio...
76 79 miami Miami International Autodrome Miami USA 25.95810 -80.23890 \N http://en.wikipedia.org/wiki/Miami_Internation...

77 rows × 9 columns

Let’s rename the columns 🔄

  • Let’s save the changes to a new circuits DataFrame
dict_rename_circuits = {"name": "circuit_name"}
circuits = circuits_raw.rename(columns = dict_rename_circuits)
circuits.head()
circuitId circuitRef circuit_name location country lat lng alt url
0 1 albert_park Albert Park Grand Prix Circuit Melbourne Australia -37.84970 144.96800 10 http://en.wikipedia.org/wiki/Melbourne_Grand_P...
1 2 sepang Sepang International Circuit Kuala Lumpur Malaysia 2.76083 101.73800 18 http://en.wikipedia.org/wiki/Sepang_Internatio...
2 3 bahrain Bahrain International Circuit Sakhir Bahrain 26.03250 50.51060 7 http://en.wikipedia.org/wiki/Bahrain_Internati...
3 4 catalunya Circuit de Barcelona-Catalunya Montmeló Spain 41.57000 2.26111 109 http://en.wikipedia.org/wiki/Circuit_de_Barcel...
4 5 istanbul Istanbul Park Istanbul Turkey 40.95170 29.40500 130 http://en.wikipedia.org/wiki/Istanbul_Park
  • Let’s see if the columns have been renamed
print(circuits_raw.columns.values)
['circuitId' 'circuitRef' 'name' 'location' 'country' 'lat' 'lng' 'alt'
 'url']
print(circuits.columns.values)
['circuitId' 'circuitRef' 'circuit_name' 'location' 'country' 'lat' 'lng'
 'alt' 'url']
  • It has worked! 🎉

Try it yourself! 🧠

  • Create a dictionary to rename name to race_name
  • Rename this column in the races_raw dataset
  • Store the output in a new dataset called races
  • Appendix 01

Merge data in Pandas 🔄

Using pd.merge() to combine data

  • We can use the pd.merge() function to combine data
  • First, we need to extract the columns we want to merge and the type of join we want to perform
  • The on argument specifies the column to merge on
  • The how argument specifies the type of join to perform
circuits[["circuitId","circuit_name"]].head()
circuitId circuit_name
0 1 Albert Park Grand Prix Circuit
1 2 Sepang International Circuit
2 3 Bahrain International Circuit
3 4 Circuit de Barcelona-Catalunya
4 5 Istanbul Park
circuits.shape[0] # rows
77

Using pd.merge() to combine data

  • pd.merge(data1, data2, on, how)
  • Strive to merge only specific columns of data2
  • Avoid merging all columns
  • Keeping it simple gives you more control over the output
# The "pd.merge()" command combines the information from both datasets
# The first argument is the "primary" datasets
# The second argument is the "secondary" dataset (must include the "on" column)
# The "on" is the common variable that is used for merging
# how = "left" tells Python that the left dataset is the primary one

races_merge = pd.merge(races_raw[['raceId', 'year', 'circuitId']],
                       circuits[["circuitId","circuit_name", "location"]],
                       on = "circuitId",
                       how = "left")

Using pd.merge() to combine data

  • Let’s see the results of the merge
races_merge
raceId year circuitId circuit_name location
0 1 2009 1 Albert Park Grand Prix Circuit Melbourne
1 2 2009 2 Sepang International Circuit Kuala Lumpur
2 3 2009 17 Shanghai International Circuit Shanghai
3 4 2009 3 Bahrain International Circuit Sakhir
4 5 2009 4 Circuit de Barcelona-Catalunya Montmeló
... ... ... ... ... ...
1097 1116 2023 69 Circuit of the Americas Austin
1098 1117 2023 32 Autódromo Hermanos Rodríguez Mexico City
1099 1118 2023 18 Autódromo José Carlos Pace São Paulo
1100 1119 2023 80 Las Vegas Strip Street Circuit Las Vegas
1101 1120 2023 24 Yas Marina Circuit Abu Dhabi

1102 rows × 5 columns

Using pd.merge() to combine data

  • We can extract the columns we want to see
races_merge[["raceId",  "circuitId","circuit_name"]].sort_values(by = "circuit_name")
raceId circuitId circuit_name
760 761 61 AVUS
434 435 29 Adelaide Street Circuit
335 336 29 Adelaide Street Circuit
319 320 29 Adelaide Street Circuit
370 371 29 Adelaide Street Circuit
... ... ... ...
501 502 40 Zolder
486 487 40 Zolder
470 471 40 Zolder
516 517 40 Zolder
608 609 40 Zolder

1102 rows × 3 columns

  • Another example of a merge
results_merge = pd.merge(results_raw,
                         races_raw[["raceId","date"]],
                         on = "raceId",
                         how = "left")

results_merge.head()
resultId raceId driverId constructorId number grid position positionText positionOrder points laps time milliseconds fastestLap rank fastestLapTime fastestLapSpeed statusId date
0 1 18 1 1 22 1 1 1 1 10.0 58 1:34:50.616 5690616 39 2 1:27.452 218.300 1 2008-03-16
1 2 18 2 2 3 5 2 2 2 8.0 58 +5.478 5696094 41 3 1:27.739 217.586 1 2008-03-16
2 3 18 3 3 7 7 3 3 3 6.0 58 +8.163 5698779 41 5 1:28.090 216.719 1 2008-03-16
3 4 18 4 4 5 11 4 4 4 5.0 58 +17.181 5707797 58 7 1:28.603 215.464 1 2008-03-16
4 5 18 5 1 23 3 5 5 5 4.0 58 +18.014 5708630 43 1 1:27.418 218.385 1 2008-03-16

Common pitfall: What happens if you don’t rename the columns?

  • If you don’t rename the columns, you may end up with duplicate columns
  • This can make it difficult to understand the data
  • It can also lead to errors when merging data
  • The following code merges the raw data which has the name column in races_raw and circuits_raw
races_merge_pitfall = pd.merge(races_raw,
                               circuits_raw[["circuitId","name"]],
                               on = "circuitId",
                               how = "left")
  • Let’s see the results of the merge
print(races_merge_pitfall.columns.values)
['raceId' 'year' 'round' 'circuitId' 'name_x' 'date' 'time' 'url'
 'fp1_date' 'fp1_time' 'fp2_date' 'fp2_time' 'fp3_date' 'fp3_time'
 'quali_date' 'quali_time' 'sprint_date' 'sprint_time' 'name_y']
  • We can see that the name column has been duplicated
  • Python appends _x and _y to the column names to differentiate them

Try it yourself! 🧠

  • Rename the columns name_x and name_y in the dataset races_merge_pitfall to race_name and circuit_name
  • HINT: Create a dictionary and use .rename()
  • Appendix 02

Try it yourself! 🧠

  • Merge the column alt, lng, and lat into the races data using pd.merge()
  • HINT: Use the circuitId as the key
  • Appendix 03

Concatenating data in Pandas

pd.concat()

  • We can use the pd.concat() function to concatenate data

  • We can concatenate data along the rows or columns

  • The method works best if columns are identical

  • There are also other advanced options if they are not, please check the documentation for more information

  • Let’s use .query() to split the data into different parts

circuits['country'].unique()
array(['Australia', 'Malaysia', 'Bahrain', 'Spain', 'Turkey', 'Monaco',
       'Canada', 'France', 'UK', 'Germany', 'Hungary', 'Belgium', 'Italy',
       'Singapore', 'Japan', 'China', 'Brazil', 'USA', 'United States',
       'UAE', 'Argentina', 'Portugal', 'South Africa', 'Mexico', 'Korea',
       'Netherlands', 'Sweden', 'Austria', 'Morocco', 'Switzerland',
       'India', 'Russia', 'Azerbaijan', 'Saudi Arabia', 'Qatar'],
      dtype=object)
circuits_spain = circuits.query('country == "Spain"')
circuits_usa = circuits.query('country == "United States" | country == "USA"')
circuits_malaysia = circuits.query('country == "Malaysia"')

Concatenating data in Pandas

circuits_concat = pd.concat([circuits_spain, circuits_usa, circuits_malaysia])
circuits_concat.head()
circuitId circuitRef circuit_name location country lat lng alt url
3 4 catalunya Circuit de Barcelona-Catalunya Montmeló Spain 41.5700 2.261110 109 http://en.wikipedia.org/wiki/Circuit_de_Barcel...
11 12 valencia Valencia Street Circuit Valencia Spain 39.4589 -0.331667 4 http://en.wikipedia.org/wiki/Valencia_Street_C...
25 26 jerez Circuito de Jerez Jerez de la Frontera Spain 36.7083 -6.034170 37 http://en.wikipedia.org/wiki/Circuito_Permanen...
44 45 jarama Jarama Madrid Spain 40.6171 -3.585580 609 http://en.wikipedia.org/wiki/Circuito_Permanen...
48 49 montjuic Montjuïc Barcelona Spain 41.3664 2.151670 79 http://en.wikipedia.org/wiki/Montju%C3%AFc_cir...

Concatenating data in Pandas

  • If we drop observations from the original dataset, they will be removed from the concatenated dataset (NaN values will be added)
circuits_spain_drop = circuits_spain.drop(columns=['circuitRef', 'location'])
pd.concat([circuits_spain_drop, circuits_usa])
circuitId circuit_name country lat lng alt url circuitRef location
3 4 Circuit de Barcelona-Catalunya Spain 41.5700 2.261110 109 http://en.wikipedia.org/wiki/Circuit_de_Barcel... NaN NaN
11 12 Valencia Street Circuit Spain 39.4589 -0.331667 4 http://en.wikipedia.org/wiki/Valencia_Street_C... NaN NaN
25 26 Circuito de Jerez Spain 36.7083 -6.034170 37 http://en.wikipedia.org/wiki/Circuito_Permanen... NaN NaN
44 45 Jarama Spain 40.6171 -3.585580 609 http://en.wikipedia.org/wiki/Circuito_Permanen... NaN NaN
48 49 Montjuïc Spain 41.3664 2.151670 79 http://en.wikipedia.org/wiki/Montju%C3%AFc_cir... NaN NaN
66 67 Circuit de Pedralbes Spain 41.3903 2.116670 85 http://en.wikipedia.org/wiki/Pedralbes_Circuit NaN NaN
18 19 Indianapolis Motor Speedway USA 39.7950 -86.234700 223 http://en.wikipedia.org/wiki/Indianapolis_Moto... indianapolis Indianapolis
22 80 Las Vegas Strip Street Circuit United States 36.1147 -115.173000 \N https://en.wikipedia.org/wiki/Las_Vegas_Grand_... vegas Las Vegas
32 33 Phoenix street circuit USA 33.4479 -112.075000 345 http://en.wikipedia.org/wiki/Phoenix_street_ci... phoenix Phoenix
36 37 Detroit Street Circuit USA 42.3298 -83.040100 177 http://en.wikipedia.org/wiki/Detroit_street_ci... detroit Detroit
41 42 Fair Park USA 32.7774 -96.758700 139 http://en.wikipedia.org/wiki/Fair_Park dallas Dallas
42 43 Long Beach USA 33.7651 -118.189000 12 http://en.wikipedia.org/wiki/Long_Beach,_Calif... long_beach California
43 44 Las Vegas Street Circuit USA 36.1162 -115.174000 639 http://en.wikipedia.org/wiki/Las_Vegas_Street_... las_vegas Nevada
45 46 Watkins Glen USA 42.3369 -76.927200 485 http://en.wikipedia.org/wiki/Watkins_Glen_Inte... watkins_glen New York State
59 60 Riverside International Raceway USA 33.9370 -117.273000 470 http://en.wikipedia.org/wiki/Riverside_Interna... riverside California
62 63 Sebring International Raceway USA 27.4547 -81.348300 18 http://en.wikipedia.org/wiki/Sebring_Raceway sebring Florida
68 69 Circuit of the Americas USA 30.1328 -97.641100 161 http://en.wikipedia.org/wiki/Circuit_of_the_Am... americas Austin
76 79 Miami International Autodrome USA 25.9581 -80.238900 \N http://en.wikipedia.org/wiki/Miami_Internation... miami Miami

Try it yourself! 🧠

  • Concatenate the USA and Malaysia datasets
  • Appendix 04

And that’s it for today! 🎉

Have a very nice week! 😊

Appendix 01

dict_rename = {"name": "race_name"}
races_raw.rename(columns = dict_rename).head()
raceId year round circuitId race_name date time url fp1_date fp1_time fp2_date fp2_time fp3_date fp3_time quali_date quali_time sprint_date sprint_time
0 1 2009 1 1 Australian Grand Prix 2009-03-29 06:00:00 http://en.wikipedia.org/wiki/2009_Australian_G... \N \N \N \N \N \N \N \N \N \N
1 2 2009 2 2 Malaysian Grand Prix 2009-04-05 09:00:00 http://en.wikipedia.org/wiki/2009_Malaysian_Gr... \N \N \N \N \N \N \N \N \N \N
2 3 2009 3 17 Chinese Grand Prix 2009-04-19 07:00:00 http://en.wikipedia.org/wiki/2009_Chinese_Gran... \N \N \N \N \N \N \N \N \N \N
3 4 2009 4 3 Bahrain Grand Prix 2009-04-26 12:00:00 http://en.wikipedia.org/wiki/2009_Bahrain_Gran... \N \N \N \N \N \N \N \N \N \N
4 5 2009 5 4 Spanish Grand Prix 2009-05-10 12:00:00 http://en.wikipedia.org/wiki/2009_Spanish_Gran... \N \N \N \N \N \N \N \N \N \N
races = races_raw.rename(columns = dict_rename)
races.columns
Index(['raceId', 'year', 'round', 'circuitId', 'race_name', 'date', 'time',
       'url', 'fp1_date', 'fp1_time', 'fp2_date', 'fp2_time', 'fp3_date',
       'fp3_time', 'quali_date', 'quali_time', 'sprint_date', 'sprint_time'],
      dtype='object')

Back to Exercise 01

Appendix 02

rename = {"name_x": "race_name", "name_y": "circuit_name"}
races_merge_pitfall.rename(columns = rename).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1102 entries, 0 to 1101
Data columns (total 19 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   raceId        1102 non-null   int64 
 1   year          1102 non-null   int64 
 2   round         1102 non-null   int64 
 3   circuitId     1102 non-null   int64 
 4   race_name     1102 non-null   object
 5   date          1102 non-null   object
 6   time          1102 non-null   object
 7   url           1102 non-null   object
 8   fp1_date      1102 non-null   object
 9   fp1_time      1102 non-null   object
 10  fp2_date      1102 non-null   object
 11  fp2_time      1102 non-null   object
 12  fp3_date      1102 non-null   object
 13  fp3_time      1102 non-null   object
 14  quali_date    1102 non-null   object
 15  quali_time    1102 non-null   object
 16  sprint_date   1102 non-null   object
 17  sprint_time   1102 non-null   object
 18  circuit_name  1102 non-null   object
dtypes: int64(4), object(15)
memory usage: 163.7+ KB

Back to Exercise 02

Appendix 03

races02 = pd.merge(races,
                   circuits[["circuitId","alt","lng","lat"]],
                   on = "circuitId",
                   how = "left")
races02.head()
raceId year round circuitId race_name date time url fp1_date fp1_time ... fp2_time fp3_date fp3_time quali_date quali_time sprint_date sprint_time alt lng lat
0 1 2009 1 1 Australian Grand Prix 2009-03-29 06:00:00 http://en.wikipedia.org/wiki/2009_Australian_G... \N \N ... \N \N \N \N \N \N \N 10 144.96800 -37.84970
1 2 2009 2 2 Malaysian Grand Prix 2009-04-05 09:00:00 http://en.wikipedia.org/wiki/2009_Malaysian_Gr... \N \N ... \N \N \N \N \N \N \N 18 101.73800 2.76083
2 3 2009 3 17 Chinese Grand Prix 2009-04-19 07:00:00 http://en.wikipedia.org/wiki/2009_Chinese_Gran... \N \N ... \N \N \N \N \N \N \N 5 121.22000 31.33890
3 4 2009 4 3 Bahrain Grand Prix 2009-04-26 12:00:00 http://en.wikipedia.org/wiki/2009_Bahrain_Gran... \N \N ... \N \N \N \N \N \N \N 7 50.51060 26.03250
4 5 2009 5 4 Spanish Grand Prix 2009-05-10 12:00:00 http://en.wikipedia.org/wiki/2009_Spanish_Gran... \N \N ... \N \N \N \N \N \N \N 109 2.26111 41.57000

5 rows × 21 columns

Back to Exercise 03

Appendix 04

circuits_usa_malaysia = pd.concat([circuits_usa, circuits_malaysia])
circuits_usa_malaysia.head()
circuitId circuitRef circuit_name location country lat lng alt url
18 19 indianapolis Indianapolis Motor Speedway Indianapolis USA 39.7950 -86.2347 223 http://en.wikipedia.org/wiki/Indianapolis_Moto...
22 80 vegas Las Vegas Strip Street Circuit Las Vegas United States 36.1147 -115.1730 \N https://en.wikipedia.org/wiki/Las_Vegas_Grand_...
32 33 phoenix Phoenix street circuit Phoenix USA 33.4479 -112.0750 345 http://en.wikipedia.org/wiki/Phoenix_street_ci...
36 37 detroit Detroit Street Circuit Detroit USA 42.3298 -83.0401 177 http://en.wikipedia.org/wiki/Detroit_street_ci...
41 42 dallas Fair Park Dallas USA 32.7774 -96.7587 139 http://en.wikipedia.org/wiki/Fair_Park

Back to Exercise 04