QTM 151 - Intro to Statistical Programming II

Lecture 13 - Replacing and Recoding Variables

Danilo Freire

Emory University

14 October, 2024

Nice to see you again! How was your break? 🍂

Today’s plan 📅

Creating and replacing variables

Today we will…

  • Learn how to recode and replace variables in a dataset
  • Specifically focus on replacing NaN values (“Not a Number” - missing data)
  • Cover how to convert variables from one type to another
  • Learn how to create new variables based on existing ones
  • We will use our friends pandas and numpy to do this 🐼

Loading packages and dataset 📦

Our dataset: Formula 1 World Championships 🏁🏎️

  • First, we will load the packages we need
import pandas as pd
import numpy as np
  • Then, we will load the dataset
circuits = pd.read_csv("https://github.com/danilofreire/qtm151/raw/refs/heads/main/lectures/lecture-13/data_raw/circuits.csv")

# Or open it from the local file
# circuits = pd.read_csv("data_raw/circuits.csv")

display(circuits.head(2))
circuitId circuitRef name location country lat lng alt url
0 1 albert_park Albert Park Grand Prix Circuit Melbourne Australia -37.84970 144.968 10 http://en.wikipedia.org/wiki/Melbourne_Grand_P...
1 2 sepang Sepang International Circuit Kuala Lumpur Malaysia 2.76083 101.738 18 http://en.wikipedia.org/wiki/Sepang_Internatio...
  • The dataset contains information about F1 circuits, such as its name, location, latitude, longitude, and more
  • You can find more information about the dataset here
  • The dataset is available in the course’s GitHub repository here
    • Or you can download it using the command above
  • Let’s see how the codebook looks like
  • More information about Formula 1 here

Codebook 📚

  • Field - Name of the variable
  • Type - Type of the variable
    • Integer (int), string (str - varchart), and float (float)
  • Description - Label with a description of the variable
  • Quick discussion: What does varchart(255) mean?
  • The dataset has 9 columns (variables) and 77 rows (observations)
  • The columns are:
    • circuitId: Unique identifier for the circuit
    • circuitRef: Unique reference for the circuit
    • name: Name of the circuit
    • location: Location
    • country: Country where the circuit is located
    • lat: Latitude
    • lng: Longitude
    • alt: Altitude
    • url: URL of the circuit’s Wikipedia page

NaN values 🚫

What is a NaN value?

  • NaN stands for “Not a Number”
  • It is a special value in Python that represents missing data
  • NaN values can be found in datasets for various reasons
    • Data entry errors
    • Data cleaning and processing errors
    • Data collection errors
    • Data transformation errors
  • We (often) need to handle NaN values before we can analyse the data
  • NaN values can be found in different types of variables
    • Numeric variables
    • Categorical variables
    • Date variables
    • Text variables
  • We will focus on numeric variables today
  • pandas and numpy have functions to handle NaN values
    • Note: they handle NaN values differently!

Operations with NaN values

  • NaN is a special number, available in numpy
np.nan
nan
  • Often, we cannot perform operations with NaN values
  • Thus, we need to handle them before we can analyse the data
  • Let’s see some examples. We start with numpy arrays
# Create two array with and without "NaNs"
# The "np.array()" functions converts 
# a list to an array

vec_without_nans = np.array([1,1,1])
vec_with_nans    = np.array([np.nan,4,5])

# When you add the vectors
# you will produce a NaN 
# on any entries with "NaNs"
print(vec_without_nans * vec_with_nans)
print(vec_without_nans / vec_with_nans)
print(vec_without_nans + vec_with_nans)
print(vec_without_nans - vec_with_nans)
[nan  4.  5.]
[ nan 0.25 0.2 ]
[nan  5.  6.]
[nan -3. -4.]

Summary statistics with NaN values

Arrays

  • Some summary statistics functions will not work with NaN values
  • For example, the mean() function
print(np.mean(vec_with_nans))
nan
  • The mean() function will return NaN if there are NaN values in the array
  • To calculate the mean without NaN values, we can use the nanmean() function
print(np.nanmean(vec_with_nans))
4.5
  • The nanmean() function will ignore NaN values and calculate the mean with the remaining values

Summary statistics with NaN values

Pandas DataFrames

  • Let’s create an empty DataFrame and create a new column x with NaN values
dataset = pd.DataFrame()
dataset["x"] = vec_with_nans
dataset
x
0 NaN
1 4.0
2 5.0
  • You will see that pandas will handle NaN values differently: it will ignore them
print(dataset["x"].mean())
4.5
  • For R users: This is the same as na.rm = TRUE in R. pandas does that by default

Data Cleaning 🧹🧽

Data cleaning

  • Data cleaning is the process of preparing data for analysis
  • It involves identifying and handling missing data, outliers, and other data quality issues
  • You guys have no idea how much time you will spend cleaning data in your life 😅
  • According to a Forbes survey, data scientists spend 60% of their time cleaning and preparing data, and 57% say it’s the least enjoyable part of their work
    • I can really relate to that 😂
  • But remember that clean data are good data 🥳
  • Let’s get the data types of the columns in the circuits dataset
  • We use the command dtypes for that
  • object means that the variable is a string or a variable with mixed types (e.g., numbers and strings)
circuits.dtypes
circuitId       int64
circuitRef     object
name           object
location       object
country        object
lat           float64
lng           float64
alt            object
url            object
dtype: object

Check rows with numeric values

  • Here we will use the .str.isnumeric() function
  • This function actually combines two functions: .str and .isnumeric()
  • The .str function is used to check if the variable is a string
  • The .isnumeric() part is used to check if the string is numeric
  • Why do we need both? Because DataFrame columns often contain mixed data types (e.g., numbers and strings), and we need to check if the variable is a string before we can check if it is numeric
  • If we used only .isnumeric(), we would get an error (trust me, I tried 😅)
  • The two dots between the functions are called method chaining
  • It is a way to call multiple functions in a single line of code
  • If you use R, this is similar to the %>% operator in dplyr
  • Let’s see how it works
# Check if the variable "alt" is numeric
circuits["alt"].str.isnumeric()
0      True
1      True
2      True
3      True
4      True
      ...  
72     True
73     True
74     True
75    False
76    False
Name: alt, Length: 77, dtype: bool

Other examples of chaining methods

# Check if the variable "circuitRef" is numeric
circuits["circuitRef"].str.isnumeric()
0     False
1     False
2     False
3     False
4     False
      ...  
72    False
73    False
74    False
75    False
76    False
Name: circuitRef, Length: 77, dtype: bool
# Convert the variable `location` to lowercase
circuits["location"].str.lower()
0        melbourne
1     kuala lumpur
2           sakhir
3         montmeló
4         istanbul
          ...     
72        portimão
73         mugello
74          jeddah
75       al daayen
76           miami
Name: location, Length: 77, dtype: object

Extract list of non-numeric values

  • Remember .query()? We can use it to reference subattributes of a variable
  • Here we will combine it with pd.unique() to extract a list of non-numeric values
  • The pd.unique() function will return a list of unique values in a variable
# Extract a list of non-numeric values
# The pd.unique() function extracts unique values from a list
# Check each value in the alt column to see if it is not numeric
# True if it is not numeric, False if it is numeric
subset = circuits.query("alt.str.isnumeric() == False")
list_unique = pd.unique(subset["alt"])
print(list_unique)
['\\N' '-7']

Replace certain values

  • The replace function is used to replace values in a variable
  • The syntax is dataframe["variable"].replace(list_old, list_new)
  • More information about the function can be found here
# "list_old" encodes values we want to change
# From the list_unique, we see that the values we want to change are '\N' and '-7'
# "list_new" encodes the values that will replace the old
list_old = ['\\N','-7']
list_new = [np.nan, -7]

# This command replaces the values of the "alt" column
circuits["alt"] = circuits["alt"].replace(list_old, list_new)
  • After the cleaning process is done, you may want to store the dataset again
  • It’s strongly recommended to do this in a separate file from the original
  • Use to_csv() to save the dataset as a .csv file
circuits.to_csv("data_clean/circuits_clean.csv", index=False)

Try it yourself! 🧠

  • Use .replace() with the “country” column
  • Replace “UK” with “United Kingdom”
  • Appendix 01

Try it yourself! 🧠

  • What is the column type of “lat” or “lng”?
  • Does it have any string variables?
  • Can we use str.isnumeric() here?

Recoding Numeric Variables 🔄

Recoding numeric variables

  • Recoding is the process of changing the values of a variable
  • We can recode variables for various reasons
    • To create new variables
    • To standardise variables
    • To simplify the analysis
  • Please remember to convert the variable to the correct type before recoding
# Check the data type of the "alt" column
circuits["alt"].dtype
dtype('O')
  • pd.to_numeric() is used to convert a variable to a numeric type
# pd.to_numeric() converts 
# a column to numeric
# Before you use this option, 
# make sure to "clean" the variable
# as we did before by checking what
# the non-numeric values are
circuits["alt_numeric"] = pd.to_numeric(circuits["alt"])
print(circuits["alt_numeric"].mean())
248.1891891891892
print(circuits["alt_numeric"].min())
print(circuits["alt_numeric"].max())
-7.0
2227.0

Recode variables based on an interval

  • Imagine that we want to recode the alt variable into an interval

\[x_{bin} = \begin{cases} "A" &\text{ if } x_1 < x \le x_2 \\ "B" &\text{ if } x_2 < x \le x_3 \end{cases} \]

  • We can use the pd.cut() function to do this
  • The syntax is df["new_variable"] = pd.cut(df["variable"], bins = [x1, x2, x3], labels = ["A", "B"])
  • Where bins are the intervals and labels are the new values
  • More information about the function can be found here
# Recode the "alt" variable into an interval
bins_x = [0, 2500, 5000]
labels_x = ["Between 0 and 2500",
            "Between 2500 and 5000"]

circuits["bins_alt"] = pd.cut(circuits["alt_numeric"],
                              bins = bins_x,
                              right = True,
                              labels = labels_x)
np.random.seed(2014)
display(circuits.sample(5))
circuitId circuitRef name location country lat lng alt url alt_numeric bins_alt
75 78 losail Losail International Circuit Al Daayen Qatar 25.4900 51.45420 NaN http://en.wikipedia.org/wiki/Losail_Internatio... NaN NaN
52 53 essarts Rouen-Les-Essarts Rouen France 49.3306 1.00458 81 http://en.wikipedia.org/wiki/Rouen-Les-Essarts 81.0 Between 0 and 2500
36 37 detroit Detroit Street Circuit Detroit USA 42.3298 -83.04010 177 http://en.wikipedia.org/wiki/Detroit_street_ci... 177.0 Between 0 and 2500
76 79 miami Miami International Autodrome Miami USA 25.9581 -80.23890 NaN http://en.wikipedia.org/wiki/Miami_Internation... NaN NaN
70 71 sochi Sochi Autodrom Sochi Russia 43.4057 39.95780 2 http://en.wikipedia.org/wiki/Sochi_Autodrom 2.0 Between 0 and 2500

Try it yourself! 🧠

  • Create a new variable “hemisphere”
  • Encode lattitude in (-90 and 0] as “south”
  • Encode lattitude in (0 and 90] as “north”
  • Appendix 03

And that’s it for today! 🎉

See you next time! 🚀

Appendix 01

# Replace "UK" with "United Kingdom"
circuits["country"] = circuits["country"].replace("UK", "United Kingdom")

# Check the rows where the replacement was made
uk = circuits.query("country == 'United Kingdom'")
display(uk.head(5))
circuitId circuitRef name location country lat lng alt url alt_numeric bins_alt
8 9 silverstone Silverstone Circuit Silverstone United Kingdom 52.0786 -1.016940 153 http://en.wikipedia.org/wiki/Silverstone_Circuit 153.0 Between 0 and 2500
30 31 donington Donington Park Castle Donington United Kingdom 52.8306 -1.375280 88 http://en.wikipedia.org/wiki/Donington_Park 88.0 Between 0 and 2500
37 38 brands_hatch Brands Hatch Kent United Kingdom 51.3569 0.263056 145 http://en.wikipedia.org/wiki/Brands_Hatch 145.0 Between 0 and 2500
57 58 aintree Aintree Liverpool United Kingdom 53.4769 -2.940560 20 http://en.wikipedia.org/wiki/Aintree_Motor_Rac... 20.0 Between 0 and 2500

Back to exercise

Appendix 02

# Check the data type of the "lat" column
print(circuits["lat"].dtype)
float64
# Check if the "lat" column has any string values
# print(circuits["lat"].str.isnumeric())
# No, it gives an error!
 
# This one doesn't work either
# print(circuits.query("lat.str.isnumeric() == False"))

# We can use the following command to check if the column has any string values
print(circuits["lat"].dtype == 'string')
False

Back to exercise

Appendix 03

# Create a new variable "hemisphere"
# Encode lattitude in (-90 and 0] as "south"
# Encode lattitude in (0 and 90] as "north"
bins_lat = [-90, 0, 90]
labels_lat = ["south", "north"]

circuits["hemisphere"] = pd.cut(circuits["lat"],
                                bins = bins_lat,
                                right = True,
                                labels = labels_lat)
np.random.seed(151)
display(circuits.sample(5))
circuitId circuitRef name location country lat lng alt url alt_numeric bins_alt hemisphere
22 80 vegas Las Vegas Strip Street Circuit Las Vegas United States 36.11470 -115.17300 NaN https://en.wikipedia.org/wiki/Las_Vegas_Grand_... NaN NaN north
61 62 monsanto Monsanto Park Circuit Lisbon Portugal 38.71970 -9.20306 158 http://en.wikipedia.org/wiki/Monsanto_Park_Cir... 158.0 Between 0 and 2500 north
7 8 magny_cours Circuit de Nevers Magny-Cours Magny Cours France 46.86420 3.16361 228 http://en.wikipedia.org/wiki/Circuit_de_Nevers... 228.0 Between 0 and 2500 north
1 2 sepang Sepang International Circuit Kuala Lumpur Malaysia 2.76083 101.73800 18 http://en.wikipedia.org/wiki/Sepang_Internatio... 18.0 Between 0 and 2500 north
28 29 adelaide Adelaide Street Circuit Adelaide Australia -34.92720 138.61700 58 http://en.wikipedia.org/wiki/Adelaide_Street_C... 58.0 Between 0 and 2500 south

Back to exercise