QTM 151 - Intro to Statistical Programming II

Lecture 13 - Replacing and Recoding Variables

Danilo Freire

danilo.freire@emory.edu

Emory University

Nice to see you again! How was your break? 🍂

Today’s plan 📅

Creating and replacing variables

Today we will…

Learn how to recode and replace variables in a dataset
Specifically focus on replacing NaN values (“Not a Number” - missing data)
Cover how to convert variables from one type to another
Learn how to create new variables based on existing ones
We will use our friends pandas and numpy to do this 🐼

Loading packages and dataset 📦

Our dataset: Formula 1 World Championships 🏁🏎️

First, we will load the packages we need

import pandas as pd
import numpy as np

Then, we will load the dataset

circuits = pd.read_csv("https://github.com/danilofreire/qtm151/raw/refs/heads/main/lectures/lecture-13/data_raw/circuits.csv")

# Or open it from the local file
# circuits = pd.read_csv("data_raw/circuits.csv")

display(circuits.head(2))

	circuitId	circuitRef	name	location	country	lat	lng	alt	url
0	1	albert_park	Albert Park Grand Prix Circuit	Melbourne	Australia	-37.84970	144.968	10	http://en.wikipedia.org/wiki/Melbourne_Grand_P...
1	2	sepang	Sepang International Circuit	Kuala Lumpur	Malaysia	2.76083	101.738	18	http://en.wikipedia.org/wiki/Sepang_Internatio...

The dataset contains information about F1 circuits, such as its name, location, latitude, longitude, and more
You can find more information about the dataset here
The dataset is available in the course’s GitHub repository here
- Or you can download it using the command above
Let’s see how the codebook looks like
More information about Formula 1 here

Codebook 📚

Field - Name of the variable
Type - Type of the variable
- Integer (int), string (str - varchart), and float (float)
Description - Label with a description of the variable
Quick discussion: What does varchart(255) mean?

The dataset has 9 columns (variables) and 77 rows (observations)
The columns are:
- circuitId: Unique identifier for the circuit
- circuitRef: Unique reference for the circuit
- name: Name of the circuit
- location: Location
- country: Country where the circuit is located
- lat: Latitude
- lng: Longitude
- alt: Altitude
- url: URL of the circuit’s Wikipedia page

NaN values 🚫

What is a `NaN` value?

NaN stands for “Not a Number”
It is a special value in Python that represents missing data
NaN values can be found in datasets for various reasons
- Data entry errors
- Data cleaning and processing errors
- Data collection errors
- Data transformation errors
We (often) need to handle NaN values before we can analyse the data

NaN values can be found in different types of variables
- Numeric variables
- Categorical variables
- Date variables
- Text variables
We will focus on numeric variables today
pandas and numpy have functions to handle NaN values
- Note: they handle NaN values differently!

Operations with `NaN` values

NaN is a special number, available in numpy

np.nan

nan

Often, we cannot perform operations with NaN values
Thus, we need to handle them before we can analyse the data

Let’s see some examples. We start with numpy arrays

# Create two array with and without "NaNs"
# The "np.array()" functions converts 
# a list to an array

vec_without_nans = np.array([1,1,1])
vec_with_nans    = np.array([np.nan,4,5])

# When you add the vectors
# you will produce a NaN 
# on any entries with "NaNs"
print(vec_without_nans * vec_with_nans)
print(vec_without_nans / vec_with_nans)
print(vec_without_nans + vec_with_nans)
print(vec_without_nans - vec_with_nans)

[nan  4.  5.]
[ nan 0.25 0.2 ]
[nan  5.  6.]
[nan -3. -4.]

Summary statistics with `NaN` values

Arrays

Some summary statistics functions will not work with NaN values
For example, the mean() function

print(np.mean(vec_with_nans))

nan

The mean() function will return NaN if there are NaN values in the array

To calculate the mean without NaN values, we can use the nanmean() function

print(np.nanmean(vec_with_nans))

4.5

The nanmean() function will ignore NaN values and calculate the mean with the remaining values

Summary statistics with `NaN` values

Pandas DataFrames

Let’s create an empty DataFrame and create a new column x with NaN values

dataset = pd.DataFrame()
dataset["x"] = vec_with_nans
dataset

	x
0	NaN
1	4.0
2	5.0

You will see that pandas will handle NaN values differently: it will ignore them

print(dataset["x"].mean())

4.5

For R users: This is the same as na.rm = TRUE in R. pandas does that by default

Data Cleaning 🧹🧽

Data cleaning

Data cleaning is the process of preparing data for analysis
It involves identifying and handling missing data, outliers, and other data quality issues
You guys have no idea how much time you will spend cleaning data in your life 😅
According to a Forbes survey, data scientists spend 60% of their time cleaning and preparing data, and 57% say it’s the least enjoyable part of their work
- I can really relate to that 😂
But remember that clean data are good data 🥳

Let’s get the data types of the columns in the circuits dataset
We use the command dtypes for that
object means that the variable is a string or a variable with mixed types (e.g., numbers and strings)

circuits.dtypes

circuitId       int64
circuitRef     object
name           object
location       object
country        object
lat           float64
lng           float64
alt            object
url            object
dtype: object

Check rows with numeric values

Here we will use the .str.isnumeric() function
This function actually combines two functions: .str and .isnumeric()
The .str function is used to check if the variable is a string
The .isnumeric() part is used to check if the string is numeric
Why do we need both? Because DataFrame columns often contain mixed data types (e.g., numbers and strings), and we need to check if the variable is a string before we can check if it is numeric
If we used only .isnumeric(), we would get an error (trust me, I tried 😅)

The two dots between the functions are called method chaining
It is a way to call multiple functions in a single line of code
If you use R, this is similar to the %>% operator in dplyr
Let’s see how it works

# Check if the variable "alt" is numeric
circuits["alt"].str.isnumeric()

0      True
1      True
2      True
3      True
4      True
      ...  
72     True
73     True
74     True
75    False
76    False
Name: alt, Length: 77, dtype: bool

Other examples of chaining methods

# Check if the variable "circuitRef" is numeric
circuits["circuitRef"].str.isnumeric()

0     False
1     False
2     False
3     False
4     False
      ...  
72    False
73    False
74    False
75    False
76    False
Name: circuitRef, Length: 77, dtype: bool

# Convert the variable `location` to lowercase
circuits["location"].str.lower()

0        melbourne
1     kuala lumpur
2           sakhir
3         montmeló
4         istanbul
          ...     
72        portimão
73         mugello
74          jeddah
75       al daayen
76           miami
Name: location, Length: 77, dtype: object

Extract list of non-numeric values

Remember .query()? We can use it to reference subattributes of a variable
Here we will combine it with pd.unique() to extract a list of non-numeric values
The pd.unique() function will return a list of unique values in a variable

# Extract a list of non-numeric values
# The pd.unique() function extracts unique values from a list
# Check each value in the alt column to see if it is not numeric
# True if it is not numeric, False if it is numeric
subset = circuits.query("alt.str.isnumeric() == False")
list_unique = pd.unique(subset["alt"])
print(list_unique)

['\\N' '-7']

Replace certain values

The replace function is used to replace values in a variable
The syntax is dataframe["variable"].replace(list_old, list_new)
More information about the function can be found here

# "list_old" encodes values we want to change
# From the list_unique, we see that the values we want to change are '\N' and '-7'
# "list_new" encodes the values that will replace the old
list_old = ['\\N','-7']
list_new = [np.nan, -7]

# This command replaces the values of the "alt" column
circuits["alt"] = circuits["alt"].replace(list_old, list_new)

After the cleaning process is done, you may want to store the dataset again
It’s strongly recommended to do this in a separate file from the original
Use to_csv() to save the dataset as a .csv file

circuits.to_csv("data_clean/circuits_clean.csv", index=False)

Try it yourself! 🧠

Use .replace() with the “country” column
Replace “UK” with “United Kingdom”
Appendix 01

Try it yourself! 🧠

What is the column type of “lat” or “lng”?
Does it have any string variables?
Can we use str.isnumeric() here?

Recoding Numeric Variables 🔄

Recoding numeric variables

Recoding is the process of changing the values of a variable
We can recode variables for various reasons
- To create new variables
- To standardise variables
- To simplify the analysis
Please remember to convert the variable to the correct type before recoding

# Check the data type of the "alt" column
circuits["alt"].dtype

dtype('O')

pd.to_numeric() is used to convert a variable to a numeric type

# pd.to_numeric() converts 
# a column to numeric
# Before you use this option, 
# make sure to "clean" the variable
# as we did before by checking what
# the non-numeric values are
circuits["alt_numeric"] = pd.to_numeric(circuits["alt"])
print(circuits["alt_numeric"].mean())

248.1891891891892

print(circuits["alt_numeric"].min())
print(circuits["alt_numeric"].max())

-7.0
2227.0

Recode variables based on an interval

Imagine that we want to recode the alt variable into an interval

\[x_{bin} = \begin{cases} "A" &\text{ if } x_1 < x \le x_2 \\ "B" &\text{ if } x_2 < x \le x_3 \end{cases} \]

We can use the pd.cut() function to do this
The syntax is df["new_variable"] = pd.cut(df["variable"], bins = [x1, x2, x3], labels = ["A", "B"])
Where bins are the intervals and labels are the new values
More information about the function can be found here

# Recode the "alt" variable into an interval
bins_x = [0, 2500, 5000]
labels_x = ["Between 0 and 2500",
            "Between 2500 and 5000"]

circuits["bins_alt"] = pd.cut(circuits["alt_numeric"],
                              bins = bins_x,
                              right = True,
                              labels = labels_x)
np.random.seed(2014)
display(circuits.sample(5))

	circuitId	circuitRef	name	location	country	lat	lng	alt	url	alt_numeric	bins_alt
75	78	losail	Losail International Circuit	Al Daayen	Qatar	25.4900	51.45420	NaN	http://en.wikipedia.org/wiki/Losail_Internatio...	NaN	NaN
52	53	essarts	Rouen-Les-Essarts	Rouen	France	49.3306	1.00458	81	http://en.wikipedia.org/wiki/Rouen-Les-Essarts	81.0	Between 0 and 2500
36	37	detroit	Detroit Street Circuit	Detroit	USA	42.3298	-83.04010	177	http://en.wikipedia.org/wiki/Detroit_street_ci...	177.0	Between 0 and 2500
76	79	miami	Miami International Autodrome	Miami	USA	25.9581	-80.23890	NaN	http://en.wikipedia.org/wiki/Miami_Internation...	NaN	NaN
70	71	sochi	Sochi Autodrom	Sochi	Russia	43.4057	39.95780	2	http://en.wikipedia.org/wiki/Sochi_Autodrom	2.0	Between 0 and 2500

Try it yourself! 🧠

Create a new variable “hemisphere”
Encode lattitude in (-90 and 0] as “south”
Encode lattitude in (0 and 90] as “north”
Appendix 03

And that’s it for today! 🎉

See you next time! 🚀

Appendix 01

# Replace "UK" with "United Kingdom"
circuits["country"] = circuits["country"].replace("UK", "United Kingdom")

# Check the rows where the replacement was made
uk = circuits.query("country == 'United Kingdom'")
display(uk.head(5))

	circuitId	circuitRef	name	location	country	lat	lng	alt	url	alt_numeric	bins_alt
8	9	silverstone	Silverstone Circuit	Silverstone	United Kingdom	52.0786	-1.016940	153	http://en.wikipedia.org/wiki/Silverstone_Circuit	153.0	Between 0 and 2500
30	31	donington	Donington Park	Castle Donington	United Kingdom	52.8306	-1.375280	88	http://en.wikipedia.org/wiki/Donington_Park	88.0	Between 0 and 2500
37	38	brands_hatch	Brands Hatch	Kent	United Kingdom	51.3569	0.263056	145	http://en.wikipedia.org/wiki/Brands_Hatch	145.0	Between 0 and 2500
57	58	aintree	Aintree	Liverpool	United Kingdom	53.4769	-2.940560	20	http://en.wikipedia.org/wiki/Aintree_Motor_Rac...	20.0	Between 0 and 2500

Back to exercise

Appendix 02

# Check the data type of the "lat" column
print(circuits["lat"].dtype)

float64

# Check if the "lat" column has any string values
# print(circuits["lat"].str.isnumeric())
# No, it gives an error!
 
# This one doesn't work either
# print(circuits.query("lat.str.isnumeric() == False"))

# We can use the following command to check if the column has any string values
print(circuits["lat"].dtype == 'string')

False

Back to exercise

Appendix 03

# Create a new variable "hemisphere"
# Encode lattitude in (-90 and 0] as "south"
# Encode lattitude in (0 and 90] as "north"
bins_lat = [-90, 0, 90]
labels_lat = ["south", "north"]

circuits["hemisphere"] = pd.cut(circuits["lat"],
                                bins = bins_lat,
                                right = True,
                                labels = labels_lat)
np.random.seed(151)
display(circuits.sample(5))

	circuitId	circuitRef	name	location	country	lat	lng	alt	url	alt_numeric	bins_alt	hemisphere
22	80	vegas	Las Vegas Strip Street Circuit	Las Vegas	United States	36.11470	-115.17300	NaN	https://en.wikipedia.org/wiki/Las_Vegas_Grand_...	NaN	NaN	north
61	62	monsanto	Monsanto Park Circuit	Lisbon	Portugal	38.71970	-9.20306	158	http://en.wikipedia.org/wiki/Monsanto_Park_Cir...	158.0	Between 0 and 2500	north
7	8	magny_cours	Circuit de Nevers Magny-Cours	Magny Cours	France	46.86420	3.16361	228	http://en.wikipedia.org/wiki/Circuit_de_Nevers...	228.0	Between 0 and 2500	north
1	2	sepang	Sepang International Circuit	Kuala Lumpur	Malaysia	2.76083	101.73800	18	http://en.wikipedia.org/wiki/Sepang_Internatio...	18.0	Between 0 and 2500	north
28	29	adelaide	Adelaide Street Circuit	Adelaide	Australia	-34.92720	138.61700	58	http://en.wikipedia.org/wiki/Adelaide_Street_C...	58.0	Between 0 and 2500	south

Back to exercise

QTM 151 - Intro to Statistical Programming II

Nice to see you again! How was your break? 🍂

Today’s plan 📅

Creating and replacing variables

Today we will…

Loading packages and dataset 📦

Our dataset: Formula 1 World Championships 🏁🏎️

Codebook 📚

NaN values 🚫

What is a NaN value?

Operations with NaN values

Summary statistics with NaN values

Arrays

Summary statistics with NaN values

Pandas DataFrames

Data Cleaning 🧹🧽

Data cleaning

Check rows with numeric values

Other examples of chaining methods

Extract list of non-numeric values

Replace certain values

Try it yourself! 🧠

Try it yourself! 🧠

Recoding Numeric Variables 🔄

Recoding numeric variables

Recode variables based on an interval

Try it yourself! 🧠

And that’s it for today! 🎉

See you next time! 🚀

Appendix 01

Appendix 02

Appendix 03

What is a `NaN` value?

Operations with `NaN` values

Summary statistics with `NaN` values

Summary statistics with `NaN` values