DATASCI 151 - Introduction to Statistical Computing II

Lecture 08 - Data Wrangling with Pandas

Danilo Freire

danilo.freire@emory.edu

Emory University

Hello again! 🥳

Recap of last class 📚

In our last class, we learned

How to write functions with def and return
What parameters, arguments, and return values are
How to combine functions with if statements
How to use lambda to create quick, throwaway functions

Today’s plan 📅

Introduction to pandas, the main library for data manipulation in Python
Learn how to apply functions to many variables at once
How to use the apply and map functions
Learn how to recode and replace variables in a dataset
Specifically focus on replacing NaN values (“Not a Number” - missing data)
Cover how to convert variables from one type to another
Learn how to create new variables based on existing ones
Finally, we will learn about .py files and how to import them as modules

Operations over many variables using Pandas 🐼

Pandas 🐼

pandas is the main library for data manipulation in Python 🐼
We will use it a lot in this course (and in your life as a data scientist!)
It is built on top of numpy and matplotlib, and has a gazillion functions to work with data 😁
If you use R already, think about it as the dplyr of Python
- A list of equivalences between dplyr and pandas
We will learn more about it in the next classes

Applying functions to a dataset

The apply function is used to apply a function to a dataset
- (This course is full of surprises, isn’t it? 😄)
It is a method of a pandas DataFrame
It can be used with built-in functions, custom functions, or lambda functions
- df.apply(function)
You can apply functions to rows or columns
- df.apply(function, axis=0) applies the function to each column (default)
- df.apply(function, axis=1) applies the function to each row

Applying functions to a dataset

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
})

print(df.apply(np.sqrt))

          A         B         C
0  1.000000  2.000000  2.645751
1  1.414214  2.236068  2.828427
2  1.732051  2.449490  3.000000

print(df.apply(np.sum, axis=1))

0    12
1    15
2    18
dtype: int64

print(df.apply(lambda x: x**2))

   A   B   C
0  1  16  49
1  4  25  64
2  9  36  81

Applying functions to a dataset

Let’s do a quick exercise

# Create an empty DataFrame
data = pd.DataFrame()

# Add variables
data["age"] = [18,29,15,32,6]
data["num_underage_siblings"] = [0,0,1,1,0]
data["num_adult_siblings"] = [1,0,0,1,0]

display(data)

	age	num_underage_siblings	num_adult_siblings
0	18	0	1
1	29	0	0
2	15	1	0
3	32	1	1
4	6	0	0

Applying functions to a dataset

Now let’s define some functions

# The first two functions return True/False depending on age constraints
# The third function returns the sum of two numbers
# The fourth function returns a string with the age bracket

fn_iseligible_vote = lambda age: age >= 18

fn_istwenties = lambda age: (age >= 20) & (age < 30)

fn_sum = lambda x,y: x + y

def fn_agebracket(age):
    if (age >= 18):
        status = "Adult"
    elif (age >= 10) & (age < 18):
        status = "Adolescent"
    else:
        status = "Child"
    return(status)

Applying functions to a dataset

Now let’s apply the functions to the data["age"] column

data["can_vote"]    = data["age"].apply(fn_iseligible_vote)
data["in_twenties"] = data["age"].apply(fn_istwenties)
data["age_bracket"] = data["age"].apply(fn_agebracket)

display(data)

	age	num_underage_siblings	num_adult_siblings	can_vote	in_twenties	age_bracket
0	18	0	1	True	False	Adult
1	29	0	0	True	True	Adult
2	15	1	0	False	False	Adolescent
3	32	1	1	True	False	Adult
4	6	0	0	False	False	Child

Creating a new variable

You can also create a new variable using the apply function

# Creating a new variable
data["new_var"] = data["age"].apply(lambda age: age >= 18)

display(data)

	age	num_underage_siblings	num_adult_siblings	can_vote	in_twenties	age_bracket	new_var
0	18	0	1	True	False	Adult	True
1	29	0	0	True	True	Adult	True
2	15	1	0	False	False	Adolescent	False
3	32	1	1	True	False	Adult	True
4	6	0	0	False	False	Child	False

Deleting a variable

You can also delete a variable using the drop function

data = data.drop(columns = ["new_var"])

display(data)

	age	num_underage_siblings	num_adult_siblings	can_vote	in_twenties	age_bracket
0	18	0	1	True	False	Adult
1	29	0	0	True	True	Adult
2	15	1	0	False	False	Adolescent
3	32	1	1	True	False	Adult
4	6	0	0	False	False	Child

Mapping functions to a list, array, or series

The map function is used to apply a function to a list, an array, or a series
- A series is a single column of a pandas DataFrame
In pandas, map works very similarly to the apply function, and they are interchangeable when working with series
map can be faster than apply for simple functions, but apply is more flexible as it can be used with DataFrames (many columns)
However, if you are using regular lists (e.g., list01 = [1,2,3]), you should use map instead of apply
- apply is not a built-in Python function

data["age_bracket01"] = data["age"].map(fn_agebracket)

display(data[["age","age_bracket01"]])

	age	age_bracket01
0	18	Adult
1	29	Adult
2	15	Adolescent
3	32	Adult
4	6	Child

data["age_bracket02"] = data["age"].apply(fn_agebracket)

display(data[["age","age_bracket02"]])

	age	age_bracket02
0	18	Adult
1	29	Adult
2	15	Adolescent
3	32	Adult
4	6	Child

Mapping functions to a list, array, or series

Using map with a list and an array

# Create a list
list01 = [1,2,3,4,5]

# Map a function to the list
list02 = list(map(lambda x: x**2, list01))

print(list02)

[1, 4, 9, 16, 25]

# Create a numpy array
array01 = np.array([1,2,3,4,5])

# Map a function to the array
array02 = np.array(list(map(lambda x: x**2, array01)))

print(array02)

[ 1  4  9 16 25]

Trying to use apply with a list or an array will raise an error

# Create a list
list01 = [1,2,3,4,5]

# Apply a function to the list
list02 = list(apply(lambda x: x**2, list01))

print(list02)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[168], line 5
      2 list01 = [1,2,3,4,5]
      4 # Apply a function to the list
----> 5 list02 = list(apply(lambda x: x**2, list01))
      7 print(list02)

NameError: name 'apply' is not defined

Try it yourself! 🚀

Write a lambda function checking whether num_siblings \(\ge 1\)
Add a variable to the dataset called has_siblings
Assign True/False to this variable using apply()
Appendix 02

Importing modules 📦

Importing modules

What is a module?

While .ipynb files are great for learning and teaching, they are not the best for sharing code
When you write a lot of functions, you should save them in a .py file, which is a Python script
A Python script, or module, is just a file containing Python code
This code can be functions, classes, or variables
A folder containing Python scripts is called a package
You can import modules to use their code in your own code

We can import functions into the working environment from a file

# Import the folder `scripts` as a package
# And the file `example_functions.py` as `ef`
import scripts.example_functions as ef

print(ef.fn_quadratic(2))
print(ef.fn_cubic(3))

ef.message_hello("Juan")

4
27

'hi Juan'

Importing modules

Importing variables

You can also import variables from a module
However, it is not recommended to import variables
It is better to import functions and use them to create variables
This is because variables can be changed in the module, leading to unexpected results

Example:

import scripts.example_variables as ev

# When we run this code
# the value of alpha will be overwritten

alpha = 1
print(alpha)
print(ev.alpha)

from scripts.example_variables import *

print(alpha)
print(beta)
print(gamma)
print(delta)

Loading packages and dataset 📦

Our dataset: Formula 1 World Championships 🏁🏎️

First, we will load the packages we need

import pandas as pd
import numpy as np

Then, we will load the dataset

circuits = pd.read_csv("data_raw/circuits.csv")

# Or download it from the internet
# circuits = pd.read_csv("https://raw.githubusercontent.com/danilofreire/datasci151-summer/main/lectures/lecture-08/data_raw/circuits.csv")
display(circuits.head(2))

	circuitId	circuitRef	name	location	country	lat	lng	alt	url
0	1	albert_park	Albert Park Grand Prix Circuit	Melbourne	Australia	-37.84970	144.968	10	http://en.wikipedia.org/wiki/Melbourne_Grand_P...
1	2	sepang	Sepang International Circuit	Kuala Lumpur	Malaysia	2.76083	101.738	18	http://en.wikipedia.org/wiki/Sepang_Internatio...

Our dataset: Formula 1 World Championships 🏁🏎️

The dataset contains information about F1 circuits, such as its name, location, latitude, longitude, and more
You can find more information about the dataset here
The dataset is available in the course’s GitHub repository here
- Or you can download it using the command above
Let’s see how the codebook looks like
More information about Formula 1 here

Codebook 📚

Field - Name of the variable
Type - Type of the variable
- Integer (int), string (str - varchart), and float (float)
Description - Label with a description of the variable
Quick discussion: What does varchart(255) mean?

The dataset has 9 columns (variables) and 77 rows (observations)
The columns are:
- circuitId: Unique identifier for the circuit
- circuitRef: Unique reference for the circuit
- name: Name of the circuit
- location: Location
- country: Country where the circuit is located
- lat: Latitude
- lng: Longitude
- alt: Altitude
- url: URL of the circuit’s Wikipedia page

NaN values 🚫

What is a `NaN` value?

NaN stands for “Not a Number”
It is a special value in Python that represents missing data
NaN values can be found in datasets for various reasons
- Data entry errors
- Data cleaning and processing errors
- Data collection errors
- Data transformation errors
We (often) need to handle NaN values before we can analyse the data

NaN values can be found in different types of variables
- Numeric variables
- Categorical variables
- Date variables
- Text variables
We will focus on numeric variables today
pandas and numpy have functions to handle NaN values
- Note: they handle NaN values differently!

Operations with `NaN` values

NaN is a special number, available in numpy

np.nan

nan

Often, we cannot perform operations with NaN values
Thus, we need to handle them before we can analyse the data

Let’s see some examples. We start with numpy arrays

# Create two array with and without "NaNs"
# The "np.array()" functions converts 
# a list to an array

vec_without_nans = np.array([1,1,1])
vec_with_nans    = np.array([np.nan,4,5])

# When you add the vectors
# you will produce a NaN 
# on any entries with "NaNs"
print(vec_without_nans * vec_with_nans)
print(vec_without_nans / vec_with_nans)
print(vec_without_nans + vec_with_nans)
print(vec_without_nans - vec_with_nans)

[nan  4.  5.]
[ nan 0.25 0.2 ]
[nan  5.  6.]
[nan -3. -4.]

Summary statistics with `NaN` values

Arrays

Some summary statistics functions will not work with NaN values
For example, the mean() function

print(np.mean(vec_with_nans))

nan

The mean() function will return NaN if there are NaN values in the array

To calculate the mean without NaN values, we can use the nanmean() function

print(np.nanmean(vec_with_nans))

4.5

The nanmean() function will ignore NaN values and calculate the mean with the remaining values

Summary statistics with `NaN` values

Pandas DataFrames

Let’s create an empty DataFrame and create a new column x with NaN values

dataset = pd.DataFrame()
dataset["x"] = vec_with_nans
dataset

	x
0	NaN
1	4.0
2	5.0

You will see that pandas will handle NaN values differently: it will ignore them

print(dataset["x"].mean())

4.5

For R users: This is the same as na.rm = TRUE in R. pandas does that by default

Data Cleaning 🧹🧽

Data cleaning

Data cleaning is the process of preparing data for analysis
It involves identifying and handling missing data, outliers, and other data quality issues
You guys have no idea how much time you will spend cleaning data in your life 😅
According to a Forbes survey, data scientists spend 60% of their time cleaning and preparing data, and 57% say it’s the least enjoyable part of their work
- I can really relate to that 😂
But remember that clean data are good data 🥳

Let’s get the data types of the columns in the circuits dataset
We use the command dtypes for that
object means that the variable is a string or a variable with mixed types (e.g., numbers and strings)

circuits.dtypes

circuitId       int64
circuitRef     object
name           object
location       object
country        object
lat           float64
lng           float64
alt            object
url            object
dtype: object

Check rows with numeric values

Here we will use the .str.isnumeric() function
This function actually combines two functions: .str and .isnumeric()
The .str function is used to check if the variable is a string
The .isnumeric() part is used to check if the string is numeric
Why do we need both? Because DataFrame columns often contain mixed data types (e.g., numbers and strings), and we need to check if the variable is a string before we can check if it is numeric
If we used only .isnumeric(), we would get an error (trust me, I tried 😅)

The two dots between the functions are called method chaining
It is a way to call multiple functions in a single line of code
If you use R, this is similar to the %>% operator in dplyr
Let’s see how it works

# Check if the variable "alt" is numeric
circuits["alt"].str.isnumeric()

0      True
1      True
2      True
3      True
4      True
      ...  
72     True
73     True
74     True
75    False
76    False
Name: alt, Length: 77, dtype: bool

Other examples of chaining methods

# Check if the variable 
# "circuitRef" is numeric
circuits["circuitRef"].str.isnumeric()

0     False
1     False
2     False
3     False
4     False
      ...  
72    False
73    False
74    False
75    False
76    False
Name: circuitRef, Length: 77, dtype: bool

# Convert the variable 
# `location` to lowercase
circuits["location"].str.lower()

0        melbourne
1     kuala lumpur
2           sakhir
3         montmeló
4         istanbul
          ...     
72        portimão
73         mugello
74          jeddah
75       al daayen
76           miami
Name: location, Length: 77, dtype: object

Extract list of non-numeric values

We can use the function query() to reference subattributes of a variable
- query() is a method of a pandas DataFrame and it has many useful functions!
- We will use it more in the future!
Here we will combine query() with pd.unique() to extract a list of non-numeric values
The pd.unique() function will return a list of unique values in a variable

# Extract a list of non-numeric values
# The pd.unique() function extracts unique values from a list
# Check each value in the alt column to see if it is not numeric
# True if it is not numeric, False if it is numeric
subset = circuits.query("alt.str.isnumeric() == False")
list_unique = pd.unique(subset["alt"])
print(list_unique)

['\\N' '-7']

Replace certain values

The replace function is used to replace values in a variable
The syntax is dataframe["variable"].replace(list_old, list_new)
More information about the function can be found here

# "list_old" encodes values we want to change
# From the list_unique, we see that the values we want to change are '\N' and '-7'
# "list_new" encodes the values that will replace the old
list_old = ['\\N','-7']
list_new = [np.nan, -7]

# This command replaces the values of the "alt" column
circuits["alt"] = circuits["alt"].replace(list_old, list_new)

After the cleaning process is done, you may want to store the dataset again
It’s strongly recommended to do this in a separate file from the original
Use to_csv() to save the dataset as a .csv file

circuits.to_csv("data_clean/circuits_clean.csv", index=False)

Try it yourself! 🧠

Use .replace() with the “country” column
Replace “UK” with “United Kingdom”
Appendix 04

Try it yourself! 🧠

What is the column type of “lat” or “lng”?
Does it have any string variables?
Can we use str.isnumeric() here?
Appendix 05

Recoding Numeric Variables 🔄

Recoding numeric variables

Recoding is the process of changing the values of a variable
We can recode variables for various reasons
- To create new variables
- To standardise variables
- To simplify the analysis
Please remember to convert the variable to the correct type before recoding

# Check the data type of the "alt" column
circuits["alt"].dtype

dtype('O')

pd.to_numeric() is used to convert a variable to a numeric type

# pd.to_numeric() converts 
# a column to numeric
# Before you use this option, 
# make sure to "clean" the variable
# as we did before by checking what
# the non-numeric values are
circuits["alt_numeric"] = pd.to_numeric(circuits["alt"])
print(circuits["alt_numeric"].mean())

248.1891891891892

print(circuits["alt_numeric"].min())
print(circuits["alt_numeric"].max())

-7.0
2227.0

Recode variables based on an interval

Imagine that we want to recode the alt variable into an interval

\[x_{bin} = \begin{cases} "A" &\text{ if } x_1 < x \le x_2 \\ "B" &\text{ if } x_2 < x \le x_3 \end{cases} \]

We can use the pd.cut() function to do this
The syntax is df["new_variable"] = pd.cut(df["variable"], bins = [x1, x2, x3], labels = ["A", "B"])
Where bins are the intervals and labels are the new values
More information about the function can be found here

# Recode the "alt" variable into an interval
bins_x = [0, 2500, 5000]
labels_x = ["Between 0 and 2500",
            "Between 2500 and 5000"]

circuits["bins_alt"] = pd.cut(circuits["alt_numeric"],
                              bins = bins_x,
                              right = True,
                              labels = labels_x)
np.random.seed(1)
display(circuits.sample(5))

	circuitId	circuitRef	name	location	country	lat	lng	alt	url	alt_numeric	bins_alt
31	32	rodriguez	Autódromo Hermanos Rodríguez	Mexico City	Mexico	19.4042	-99.09070	2227	http://en.wikipedia.org/wiki/Aut%C3%B3dromo_He...	2227.0	Between 0 and 2500
43	44	las_vegas	Las Vegas Street Circuit	Nevada	USA	36.1162	-115.17400	639	http://en.wikipedia.org/wiki/Las_Vegas_Street_...	639.0	Between 0 and 2500
26	27	estoril	Autódromo do Estoril	Estoril	Portugal	38.7506	-9.39417	130	http://en.wikipedia.org/wiki/Aut%C3%B3dromo_do...	130.0	Between 0 and 2500
74	77	jeddah	Jeddah Corniche Circuit	Jeddah	Saudi Arabia	21.6319	39.10440	15	http://en.wikipedia.org/wiki/Jeddah_Street_Cir...	15.0	Between 0 and 2500
58	59	boavista	Circuito da Boavista	Oporto	Portugal	41.1705	-8.67325	28	http://en.wikipedia.org/wiki/Circuito_da_Boavista	28.0	Between 0 and 2500

And that’s it for today! 🎉

Thanks very much! 😊

Appendix 01

def modify_x():
    global x
    x = x + 5

x = 1

# Now, running the function 
# will permanently increase x by 5.

modify_x()
print(x)
modify_x()
print(x)

6
11

def fn_square(x):
    global y
    y = x**2
    return(y)

x = 5
y = -5

print("Example 1:")
print(fn_square(x = 10))
print(x)
print(y)

Example 1:
100
5
100

Back to exercise 01

Appendix 02

Write a lambda function checking whether num_siblings \(\ge 1\)
Add a variable to the dataset called has_siblings
Assign True/False to this variable using apply()

fn_has_siblings = lambda num_siblings: num_siblings >= 1

data["has_siblings"] = data["num_adult_siblings"].apply(fn_has_siblings)

display(data[["num_adult_siblings","has_siblings"]])

	num_adult_siblings	has_siblings
0	1	True
1	0	False
2	0	False
3	1	True
4	0	False

Back to exercise 02

Appendix 03

Read the car dataset data_raw/features.csv
Create a function that tests whether mpg \(\ge\) 29
Add a variable mpg_above_29 which is True/False if mpg \(\ge\) 29
Store the new dataset to data_clean/features.csv

data_raw = pd.read_csv("data_raw/features.csv")

data_raw["mpg_above_29"] = data_raw["mpg"].apply(lambda mpg: mpg >= 29)

display(data_raw[["mpg","mpg_above_29"]])

data_raw.to_csv("data_clean/features.csv", index = False)

	mpg	mpg_above_29
0	18.0	False
1	15.0	False
2	18.0	False
3	16.0	False
4	17.0	False
...	...	...
393	27.0	False
394	44.0	True
395	32.0	True
396	28.0	False
397	31.0	True

398 rows × 2 columns

Back to exercise 03

Appendix 04

Use .replace() with the “country” column
Replace “UK” with “United Kingdom”

# Replace "UK" with "United Kingdom"
circuits["country"] = circuits["country"].replace("UK", "United Kingdom")
display(circuits[["country"]])

	country
0	Australia
1	Malaysia
2	Bahrain
3	Spain
4	Turkey
...	...
72	Portugal
73	Italy
74	Saudi Arabia
75	Qatar
76	USA

77 rows × 1 columns

Back to exercise 04

Appendix 05

# Check the data type of the "lat" column
print(circuits["lat"].dtype)

# Check if the variable "lat" is numeric
# print(circuits["lat"].str.isnumeric())
# It will raise an error because "lat" is a float

# Check if the variable "lng" is numeric
# print(circuits["lng"].str.isnumeric())
# It will also raise an error

float64

Back to exercise 05

DATASCI 151 - Introduction to Statistical Computing II

Hello again! 🥳

Recap of last class 📚

In our last class, we learned

Today’s plan 📅

Operations over many variables using Pandas 🐼

Pandas 🐼

Applying functions to a dataset

Applying functions to a dataset

Applying functions to a dataset

Applying functions to a dataset

Applying functions to a dataset

Creating a new variable

Deleting a variable

Mapping functions to a list, array, or series

Mapping functions to a list, array, or series

Try it yourself! 🚀

Importing modules 📦

Importing modules

What is a module?

Importing modules

Importing variables

Loading packages and dataset 📦

Our dataset: Formula 1 World Championships 🏁🏎️

Our dataset: Formula 1 World Championships 🏁🏎️

Codebook 📚

NaN values 🚫

What is a NaN value?

Operations with NaN values

Summary statistics with NaN values

Arrays

Summary statistics with NaN values

Pandas DataFrames

Data Cleaning 🧹🧽

Data cleaning

Check rows with numeric values

Other examples of chaining methods

Extract list of non-numeric values

Replace certain values

Try it yourself! 🧠

Try it yourself! 🧠

Recoding Numeric Variables 🔄

Recoding numeric variables

Recode variables based on an interval

And that’s it for today! 🎉

Thanks very much! 😊

Appendix 01

Appendix 02

Appendix 03

Appendix 04

Appendix 05

What is a `NaN` value?

Operations with `NaN` values

Summary statistics with `NaN` values

Summary statistics with `NaN` values