QTM 151 - Introduction to Statistical Computing II

Lecture 08 - Data Wrangling with Pandas

Danilo Freire

Emory University

Hello again! 🥳

Recap of last class 📚

In our last class, we learned

  • How to write functions with def and return
  • What parameters, arguments, and return values are
  • How to combine functions with if statements
  • How to use lambda to create quick, throwaway functions

Today’s plan 📅

  • Introduction to pandas, the main library for data manipulation in Python
  • Learn how to apply functions to many variables at once
  • How to use the apply and map functions
  • Learn how to recode and replace variables in a dataset
  • Specifically focus on replacing NaN values (“Not a Number” - missing data)
  • Cover how to convert variables from one type to another
  • Learn how to create new variables based on existing ones
  • Finally, we will learn about .py files and how to import them as modules

Operations over many variables using Pandas 🐼

Pandas 🐼

  • pandas is the main library for data manipulation in Python 🐼
  • We will use it a lot in this course (and in your life as a data scientist!)
  • It is built on top of numpy and matplotlib, and has a gazillion functions to work with data 😁
  • If you use R already, think about it as the dplyr of Python
  • We will learn more about it in the next classes

Applying functions to a dataset

  • The apply function is used to apply a function to a dataset
    • (This course is full of surprises, isn’t it? 😄)
  • It is a method of a pandas DataFrame
  • It can be used with built-in functions, custom functions, or lambda functions
    • df.apply(function)
  • You can apply functions to rows or columns
    • df.apply(function, axis=0) applies the function to each column (default)
    • df.apply(function, axis=1) applies the function to each row

Applying functions to a dataset

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
})

print(df.apply(np.sqrt))
          A         B         C
0  1.000000  2.000000  2.645751
1  1.414214  2.236068  2.828427
2  1.732051  2.449490  3.000000
print(df.apply(np.sum, axis=1))
0    12
1    15
2    18
dtype: int64
print(df.apply(lambda x: x**2))
   A   B   C
0  1  16  49
1  4  25  64
2  9  36  81

Applying functions to a dataset

  • Let’s do a quick exercise
# Create an empty DataFrame
data = pd.DataFrame()

# Add variables
data["age"] = [18,29,15,32,6]
data["num_underage_siblings"] = [0,0,1,1,0]
data["num_adult_siblings"] = [1,0,0,1,0]

display(data)
age num_underage_siblings num_adult_siblings
0 18 0 1
1 29 0 0
2 15 1 0
3 32 1 1
4 6 0 0

Applying functions to a dataset

  • Now let’s define some functions
# The first two functions return True/False depending on age constraints
# The third function returns the sum of two numbers
# The fourth function returns a string with the age bracket

fn_iseligible_vote = lambda age: age >= 18

fn_istwenties = lambda age: (age >= 20) & (age < 30)

fn_sum = lambda x,y: x + y

def fn_agebracket(age):
    if (age >= 18):
        status = "Adult"
    elif (age >= 10) & (age < 18):
        status = "Adolescent"
    else:
        status = "Child"
    return(status)

Applying functions to a dataset

  • Now let’s apply the functions to the data["age"] column
data["can_vote"]    = data["age"].apply(fn_iseligible_vote)
data["in_twenties"] = data["age"].apply(fn_istwenties)
data["age_bracket"] = data["age"].apply(fn_agebracket)

display(data)
age num_underage_siblings num_adult_siblings can_vote in_twenties age_bracket
0 18 0 1 True False Adult
1 29 0 0 True True Adult
2 15 1 0 False False Adolescent
3 32 1 1 True False Adult
4 6 0 0 False False Child

Creating a new variable

  • You can also create a new variable using the apply function
# Creating a new variable
data["new_var"] = data["age"].apply(lambda age: age >= 18)

display(data)
age num_underage_siblings num_adult_siblings can_vote in_twenties age_bracket new_var
0 18 0 1 True False Adult True
1 29 0 0 True True Adult True
2 15 1 0 False False Adolescent False
3 32 1 1 True False Adult True
4 6 0 0 False False Child False

Deleting a variable

  • You can also delete a variable using the drop function
data = data.drop(columns = ["new_var"])

display(data)
age num_underage_siblings num_adult_siblings can_vote in_twenties age_bracket
0 18 0 1 True False Adult
1 29 0 0 True True Adult
2 15 1 0 False False Adolescent
3 32 1 1 True False Adult
4 6 0 0 False False Child

Mapping functions to a list, array, or series

  • The map function is used to apply a function to a list, an array, or a series
    • A series is a single column of a pandas DataFrame
  • In pandas, map works very similarly to the apply function, and they are interchangeable when working with series
  • map can be faster than apply for simple functions, but apply is more flexible as it can be used with DataFrames (many columns)
  • However, if you are using regular lists (e.g., list01 = [1,2,3]), you should use map instead of apply
    • apply is not a built-in Python function
data["age_bracket01"] = data["age"].map(fn_agebracket)

display(data[["age","age_bracket01"]])
age age_bracket01
0 18 Adult
1 29 Adult
2 15 Adolescent
3 32 Adult
4 6 Child
data["age_bracket02"] = data["age"].apply(fn_agebracket)

display(data[["age","age_bracket02"]])
age age_bracket02
0 18 Adult
1 29 Adult
2 15 Adolescent
3 32 Adult
4 6 Child

Mapping functions to a list, array, or series

  • Using map with a list and an array
# Create a list
list01 = [1,2,3,4,5]

# Map a function to the list
list02 = list(map(lambda x: x**2, list01))

print(list02)
[1, 4, 9, 16, 25]
# Create a numpy array
array01 = np.array([1,2,3,4,5])

# Map a function to the array
array02 = np.array(list(map(lambda x: x**2, array01)))

print(array02)
[ 1  4  9 16 25]
  • Trying to use apply with a list or an array will raise an error
# Create a list
list01 = [1,2,3,4,5]

# Apply a function to the list
list02 = list(apply(lambda x: x**2, list01))

print(list02)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[168], line 5
      2 list01 = [1,2,3,4,5]
      4 # Apply a function to the list
----> 5 list02 = list(apply(lambda x: x**2, list01))
      7 print(list02)

NameError: name 'apply' is not defined

Try it yourself! 🚀

  • Write a lambda function checking whether num_siblings \(\ge 1\)
  • Add a variable to the dataset called has_siblings
  • Assign True/False to this variable using apply()
  • Appendix 02

Importing modules 📦

Importing modules

What is a module?

  • While .ipynb files are great for learning and teaching, they are not the best for sharing code
  • When you write a lot of functions, you should save them in a .py file, which is a Python script
  • A Python script, or module, is just a file containing Python code
  • This code can be functions, classes, or variables
  • A folder containing Python scripts is called a package
  • You can import modules to use their code in your own code
  • We can import functions into the working environment from a file
# Import the folder `scripts` as a package
# And the file `example_functions.py` as `ef`
import scripts.example_functions as ef

print(ef.fn_quadratic(2))
print(ef.fn_cubic(3))

ef.message_hello("Juan")
4
9
'hi Juan'

Importing modules

Importing variables

  • You can also import variables from a module
  • However, it is not recommended to import variables
  • It is better to import functions and use them to create variables
  • This is because variables can be changed in the module, leading to unexpected results
  • Example:
import scripts.example_variables as ev

# When we run this code
# the value of alpha will be overwritten

alpha = 1
print(alpha)
print(ev.alpha)

from scripts.example_variables import *

print(alpha)
print(beta)
print(gamma)
print(delta)
1
5
5
10
20
100

Loading packages and dataset 📦

Our dataset: Formula 1 World Championships 🏁🏎️

  • First, we will load the packages we need
import pandas as pd
import numpy as np
  • Then, we will load the dataset
circuits = pd.read_csv("data_raw/circuits.csv")

# Or download it from the internet
# circuits = pd.read_csv("https://raw.githubusercontent.com/danilofreire/qtm151-summer/main/lectures/lecture-08/data_raw/circuits.csv")
display(circuits.head(2))
circuitId circuitRef name location country lat lng alt url
0 1 albert_park Albert Park Grand Prix Circuit Melbourne Australia -37.84970 144.968 10 http://en.wikipedia.org/wiki/Melbourne_Grand_P...
1 2 sepang Sepang International Circuit Kuala Lumpur Malaysia 2.76083 101.738 18 http://en.wikipedia.org/wiki/Sepang_Internatio...

Our dataset: Formula 1 World Championships 🏁🏎️

  • The dataset contains information about F1 circuits, such as its name, location, latitude, longitude, and more
  • You can find more information about the dataset here
  • The dataset is available in the course’s GitHub repository here
    • Or you can download it using the command above
  • Let’s see how the codebook looks like
  • More information about Formula 1 here

Codebook 📚

  • Field - Name of the variable
  • Type - Type of the variable
    • Integer (int), string (str - varchart), and float (float)
  • Description - Label with a description of the variable
  • Quick discussion: What does varchart(255) mean?
  • The dataset has 9 columns (variables) and 77 rows (observations)
  • The columns are:
    • circuitId: Unique identifier for the circuit
    • circuitRef: Unique reference for the circuit
    • name: Name of the circuit
    • location: Location
    • country: Country where the circuit is located
    • lat: Latitude
    • lng: Longitude
    • alt: Altitude
    • url: URL of the circuit’s Wikipedia page

NaN values 🚫

What is a NaN value?

  • NaN stands for “Not a Number”
  • It is a special value in Python that represents missing data
  • NaN values can be found in datasets for various reasons
    • Data entry errors
    • Data cleaning and processing errors
    • Data collection errors
    • Data transformation errors
  • We (often) need to handle NaN values before we can analyse the data
  • NaN values can be found in different types of variables
    • Numeric variables
    • Categorical variables
    • Date variables
    • Text variables
  • We will focus on numeric variables today
  • pandas and numpy have functions to handle NaN values
    • Note: they handle NaN values differently!

Operations with NaN values

  • NaN is a special number, available in numpy
np.nan
nan
  • Often, we cannot perform operations with NaN values
  • Thus, we need to handle them before we can analyse the data
  • Let’s see some examples. We start with numpy arrays
# Create two array with and without "NaNs"
# The "np.array()" functions converts 
# a list to an array

vec_without_nans = np.array([1,1,1])
vec_with_nans    = np.array([np.nan,4,5])

# When you add the vectors
# you will produce a NaN 
# on any entries with "NaNs"
print(vec_without_nans * vec_with_nans)
print(vec_without_nans / vec_with_nans)
print(vec_without_nans + vec_with_nans)
print(vec_without_nans - vec_with_nans)
[nan  4.  5.]
[ nan 0.25 0.2 ]
[nan  5.  6.]
[nan -3. -4.]

Summary statistics with NaN values

Arrays

  • Some summary statistics functions will not work with NaN values
  • For example, the mean() function
print(np.mean(vec_with_nans))
nan
  • The mean() function will return NaN if there are NaN values in the array
  • To calculate the mean without NaN values, we can use the nanmean() function
print(np.nanmean(vec_with_nans))
4.5
  • The nanmean() function will ignore NaN values and calculate the mean with the remaining values

Summary statistics with NaN values

Pandas DataFrames

  • Let’s create an empty DataFrame and create a new column x with NaN values
dataset = pd.DataFrame()
dataset["x"] = vec_with_nans
dataset
x
0 NaN
1 4.0
2 5.0
  • You will see that pandas will handle NaN values differently: it will ignore them
print(dataset["x"].mean())
4.5
  • For R users: This is the same as na.rm = TRUE in R. pandas does that by default

Data Cleaning 🧹🧽

Data cleaning

  • Data cleaning is the process of preparing data for analysis
  • It involves identifying and handling missing data, outliers, and other data quality issues
  • You guys have no idea how much time you will spend cleaning data in your life 😅
  • According to a Forbes survey, data scientists spend 60% of their time cleaning and preparing data, and 57% say it’s the least enjoyable part of their work
    • I can really relate to that 😂
  • But remember that clean data are good data 🥳
  • Let’s get the data types of the columns in the circuits dataset
  • We use the command dtypes for that
  • object means that the variable is a string or a variable with mixed types (e.g., numbers and strings)
circuits.dtypes
circuitId       int64
circuitRef     object
name           object
location       object
country        object
lat           float64
lng           float64
alt            object
url            object
dtype: object

Check rows with numeric values

  • Here we will use the .str.isnumeric() function
  • This function actually combines two functions: .str and .isnumeric()
  • The .str function is used to check if the variable is a string
  • The .isnumeric() part is used to check if the string is numeric
  • Why do we need both? Because DataFrame columns often contain mixed data types (e.g., numbers and strings), and we need to check if the variable is a string before we can check if it is numeric
  • If we used only .isnumeric(), we would get an error (trust me, I tried 😅)
  • The two dots between the functions are called method chaining
  • It is a way to call multiple functions in a single line of code
  • If you use R, this is similar to the %>% operator in dplyr
  • Let’s see how it works
# Check if the variable "alt" is numeric
circuits["alt"].str.isnumeric()
0      True
1      True
2      True
3      True
4      True
      ...  
72     True
73     True
74     True
75    False
76    False
Name: alt, Length: 77, dtype: bool

Other examples of chaining methods

# Check if the variable 
# "circuitRef" is numeric
circuits["circuitRef"].str.isnumeric()
0     False
1     False
2     False
3     False
4     False
      ...  
72    False
73    False
74    False
75    False
76    False
Name: circuitRef, Length: 77, dtype: bool
# Convert the variable 
# `location` to lowercase
circuits["location"].str.lower()
0        melbourne
1     kuala lumpur
2           sakhir
3         montmeló
4         istanbul
          ...     
72        portimão
73         mugello
74          jeddah
75       al daayen
76           miami
Name: location, Length: 77, dtype: object

Extract list of non-numeric values

  • We can use the function query() to reference subattributes of a variable
    • query() is a method of a pandas DataFrame and it has many useful functions!
    • We will use it more in the future!
  • Here we will combine query() with pd.unique() to extract a list of non-numeric values
  • The pd.unique() function will return a list of unique values in a variable
# Extract a list of non-numeric values
# The pd.unique() function extracts unique values from a list
# Check each value in the alt column to see if it is not numeric
# True if it is not numeric, False if it is numeric
subset = circuits.query("alt.str.isnumeric() == False")
list_unique = pd.unique(subset["alt"])
print(list_unique)
['\\N' '-7']

Replace certain values

  • The replace function is used to replace values in a variable
  • The syntax is dataframe["variable"].replace(list_old, list_new)
  • More information about the function can be found here
# "list_old" encodes values we want to change
# From the list_unique, we see that the values we want to change are '\N' and '-7'
# "list_new" encodes the values that will replace the old
list_old = ['\\N','-7']
list_new = [np.nan, -7]

# This command replaces the values of the "alt" column
circuits["alt"] = circuits["alt"].replace(list_old, list_new)
  • After the cleaning process is done, you may want to store the dataset again
  • It’s strongly recommended to do this in a separate file from the original
  • Use to_csv() to save the dataset as a .csv file
circuits.to_csv("data_clean/circuits_clean.csv", index=False)

Try it yourself! 🧠

  • Use .replace() with the “country” column
  • Replace “UK” with “United Kingdom”
  • Appendix 04

Try it yourself! 🧠

  • What is the column type of “lat” or “lng”?
  • Does it have any string variables?
  • Can we use str.isnumeric() here?
  • Appendix 05

Recoding Numeric Variables 🔄

Recoding numeric variables

  • Recoding is the process of changing the values of a variable
  • We can recode variables for various reasons
    • To create new variables
    • To standardise variables
    • To simplify the analysis
  • Please remember to convert the variable to the correct type before recoding
# Check the data type of the "alt" column
circuits["alt"].dtype
dtype('O')
  • pd.to_numeric() is used to convert a variable to a numeric type
# pd.to_numeric() converts 
# a column to numeric
# Before you use this option, 
# make sure to "clean" the variable
# as we did before by checking what
# the non-numeric values are
circuits["alt_numeric"] = pd.to_numeric(circuits["alt"])
print(circuits["alt_numeric"].mean())
248.1891891891892
print(circuits["alt_numeric"].min())
print(circuits["alt_numeric"].max())
-7.0
2227.0

Recode variables based on an interval

  • Imagine that we want to recode the alt variable into an interval

\[x_{bin} = \begin{cases} "A" &\text{ if } x_1 < x \le x_2 \\ "B" &\text{ if } x_2 < x \le x_3 \end{cases} \]

  • We can use the pd.cut() function to do this
  • The syntax is df["new_variable"] = pd.cut(df["variable"], bins = [x1, x2, x3], labels = ["A", "B"])
  • Where bins are the intervals and labels are the new values
  • More information about the function can be found here
# Recode the "alt" variable into an interval
bins_x = [0, 2500, 5000]
labels_x = ["Between 0 and 2500",
            "Between 2500 and 5000"]

circuits["bins_alt"] = pd.cut(circuits["alt_numeric"],
                              bins = bins_x,
                              right = True,
                              labels = labels_x)
np.random.seed(2014)
display(circuits.sample(5))
circuitId circuitRef name location country lat lng alt url alt_numeric bins_alt
75 78 losail Losail International Circuit Al Daayen Qatar 25.4900 51.45420 NaN http://en.wikipedia.org/wiki/Losail_Internatio... NaN NaN
52 53 essarts Rouen-Les-Essarts Rouen France 49.3306 1.00458 81 http://en.wikipedia.org/wiki/Rouen-Les-Essarts 81.0 Between 0 and 2500
36 37 detroit Detroit Street Circuit Detroit USA 42.3298 -83.04010 177 http://en.wikipedia.org/wiki/Detroit_street_ci... 177.0 Between 0 and 2500
76 79 miami Miami International Autodrome Miami USA 25.9581 -80.23890 NaN http://en.wikipedia.org/wiki/Miami_Internation... NaN NaN
70 71 sochi Sochi Autodrom Sochi Russia 43.4057 39.95780 2 http://en.wikipedia.org/wiki/Sochi_Autodrom 2.0 Between 0 and 2500

And that’s it for today! 🎉

Thanks very much! 😊

Appendix 01

def modify_x():
    global x
    x = x + 5

x = 1

# Now, running the function 
# will permanently increase x by 5.

modify_x()
print(x)
modify_x()
print(x)
6
11
def fn_square(x):
    global y
    y = x**2
    return(y)

x = 5
y = -5

print("Example 1:")
print(fn_square(x = 10))
print(x)
print(y)
Example 1:
100
5
100

Back to exercise 01

Appendix 02

  • Write a lambda function checking whether num_siblings \(\ge 1\)
  • Add a variable to the dataset called has_siblings
  • Assign True/False to this variable using apply()
fn_has_siblings = lambda num_siblings: num_siblings >= 1

data["has_siblings"] = data["num_adult_siblings"].apply(fn_has_siblings)

display(data[["num_adult_siblings","has_siblings"]])
num_adult_siblings has_siblings
0 1 True
1 0 False
2 0 False
3 1 True
4 0 False

Back to exercise 02

Appendix 03

  • Read the car dataset data_raw/features.csv
  • Create a function that tests whether mpg \(\ge\) 29
  • Add a variable mpg_above_29 which is True/False if mpg \(\ge\) 29
  • Store the new dataset to data_clean/features.csv
data_raw = pd.read_csv("data_raw/features.csv")

data_raw["mpg_above_29"] = data_raw["mpg"].apply(lambda mpg: mpg >= 29)

display(data_raw[["mpg","mpg_above_29"]])

data_raw.to_csv("data_clean/features.csv", index = False)
mpg mpg_above_29
0 18.0 False
1 15.0 False
2 18.0 False
3 16.0 False
4 17.0 False
... ... ...
393 27.0 False
394 44.0 True
395 32.0 True
396 28.0 False
397 31.0 True

398 rows × 2 columns

Back to exercise 03

Appendix 04

  • Use .replace() with the “country” column
  • Replace “UK” with “United Kingdom”
# Replace "UK" with "United Kingdom"
circuits["country"] = circuits["country"].replace("UK", "United Kingdom")
display(circuits[["country"]])
country
0 Australia
1 Malaysia
2 Bahrain
3 Spain
4 Turkey
... ...
72 Portugal
73 Italy
74 Saudi Arabia
75 Qatar
76 USA

77 rows × 1 columns

Back to exercise 04

Appendix 05

# Check the data type of the "lat" column
print(circuits["lat"].dtype)

# Check if the variable "lat" is numeric
# print(circuits["lat"].str.isnumeric())
# It will raise an error because "lat" is a float

# Check if the variable "lng" is numeric
# print(circuits["lng"].str.isnumeric())
# It will also raise an error
float64

Back to exercise 05