Lecture 13 - Replacing and Recoding Variables
NaN values (“Not a Number” - missing data)pandas and numpy to do this 🐼circuits = pd.read_csv("https://github.com/danilofreire/qtm151/raw/refs/heads/main/lectures/lecture-13/data_raw/circuits.csv")
# Or open it from the local file
# circuits = pd.read_csv("data_raw/circuits.csv")
display(circuits.head(2))| circuitId | circuitRef | name | location | country | lat | lng | alt | url | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | albert_park | Albert Park Grand Prix Circuit | Melbourne | Australia | -37.84970 | 144.968 | 10 | http://en.wikipedia.org/wiki/Melbourne_Grand_P... |
| 1 | 2 | sepang | Sepang International Circuit | Kuala Lumpur | Malaysia | 2.76083 | 101.738 | 18 | http://en.wikipedia.org/wiki/Sepang_Internatio... |
Field - Name of the variableType - Type of the variable
int), string (str - varchart), and float (float)Description - Label with a description of the variablevarchart(255) mean?circuitId: Unique identifier for the circuitcircuitRef: Unique reference for the circuitname: Name of the circuitlocation: Locationcountry: Country where the circuit is locatedlat: Latitudelng: Longitudealt: Altitudeurl: URL of the circuit’s Wikipedia pageNaN value?NaN stands for “Not a Number”NaN values can be found in datasets for various reasons
NaN values before we can analyse the dataNaN values can be found in different types of variables
pandas and numpy have functions to handle NaN values
NaN values differently!NaN valuesNaN is a special number, available in numpyNaN valuesnumpy arrays# Create two array with and without "NaNs"
# The "np.array()" functions converts
# a list to an array
vec_without_nans = np.array([1,1,1])
vec_with_nans = np.array([np.nan,4,5])
# When you add the vectors
# you will produce a NaN
# on any entries with "NaNs"
print(vec_without_nans * vec_with_nans)
print(vec_without_nans / vec_with_nans)
print(vec_without_nans + vec_with_nans)
print(vec_without_nans - vec_with_nans)[nan 4. 5.]
[ nan 0.25 0.2 ]
[nan 5. 6.]
[nan -3. -4.]
NaN valuesNaN valuesmean() functionmean() function will return NaN if there are NaN values in the arrayNaN valuesx with NaN valuescircuits datasetdtypes for thatobject means that the variable is a string or a variable with mixed types (e.g., numbers and strings).str.isnumeric() function.str and .isnumeric().str function is used to check if the variable is a string.isnumeric() part is used to check if the string is numeric.isnumeric(), we would get an error (trust me, I tried 😅)R, this is similar to the %>% operator in dplyr.query()? We can use it to reference subattributes of a variablepd.unique() to extract a list of non-numeric valuespd.unique() function will return a list of unique values in a variable# Extract a list of non-numeric values
# The pd.unique() function extracts unique values from a list
# Check each value in the alt column to see if it is not numeric
# True if it is not numeric, False if it is numeric
subset = circuits.query("alt.str.isnumeric() == False")
list_unique = pd.unique(subset["alt"])
print(list_unique)['\\N' '-7']
replace function is used to replace values in a variabledataframe["variable"].replace(list_old, list_new)# "list_old" encodes values we want to change
# From the list_unique, we see that the values we want to change are '\N' and '-7'
# "list_new" encodes the values that will replace the old
list_old = ['\\N','-7']
list_new = [np.nan, -7]
# This command replaces the values of the "alt" column
circuits["alt"] = circuits["alt"].replace(list_old, list_new)to_csv() to save the dataset as a .csv file.replace() with the “country” columnstr.isnumeric() here?pd.to_numeric() is used to convert a variable to a numeric typealt variable into an interval\[x_{bin} = \begin{cases} "A" &\text{ if } x_1 < x \le x_2 \\ "B" &\text{ if } x_2 < x \le x_3 \end{cases} \]
pd.cut() function to do thisdf["new_variable"] = pd.cut(df["variable"], bins = [x1, x2, x3], labels = ["A", "B"])bins are the intervals and labels are the new values# Recode the "alt" variable into an interval
bins_x = [0, 2500, 5000]
labels_x = ["Between 0 and 2500",
"Between 2500 and 5000"]
circuits["bins_alt"] = pd.cut(circuits["alt_numeric"],
bins = bins_x,
right = True,
labels = labels_x)
np.random.seed(2014)
display(circuits.sample(5))| circuitId | circuitRef | name | location | country | lat | lng | alt | url | alt_numeric | bins_alt | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 75 | 78 | losail | Losail International Circuit | Al Daayen | Qatar | 25.4900 | 51.45420 | NaN | http://en.wikipedia.org/wiki/Losail_Internatio... | NaN | NaN |
| 52 | 53 | essarts | Rouen-Les-Essarts | Rouen | France | 49.3306 | 1.00458 | 81 | http://en.wikipedia.org/wiki/Rouen-Les-Essarts | 81.0 | Between 0 and 2500 |
| 36 | 37 | detroit | Detroit Street Circuit | Detroit | USA | 42.3298 | -83.04010 | 177 | http://en.wikipedia.org/wiki/Detroit_street_ci... | 177.0 | Between 0 and 2500 |
| 76 | 79 | miami | Miami International Autodrome | Miami | USA | 25.9581 | -80.23890 | NaN | http://en.wikipedia.org/wiki/Miami_Internation... | NaN | NaN |
| 70 | 71 | sochi | Sochi Autodrom | Sochi | Russia | 43.4057 | 39.95780 | 2 | http://en.wikipedia.org/wiki/Sochi_Autodrom | 2.0 | Between 0 and 2500 |
# Replace "UK" with "United Kingdom"
circuits["country"] = circuits["country"].replace("UK", "United Kingdom")
# Check the rows where the replacement was made
uk = circuits.query("country == 'United Kingdom'")
display(uk.head(5))| circuitId | circuitRef | name | location | country | lat | lng | alt | url | alt_numeric | bins_alt | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | 9 | silverstone | Silverstone Circuit | Silverstone | United Kingdom | 52.0786 | -1.016940 | 153 | http://en.wikipedia.org/wiki/Silverstone_Circuit | 153.0 | Between 0 and 2500 |
| 30 | 31 | donington | Donington Park | Castle Donington | United Kingdom | 52.8306 | -1.375280 | 88 | http://en.wikipedia.org/wiki/Donington_Park | 88.0 | Between 0 and 2500 |
| 37 | 38 | brands_hatch | Brands Hatch | Kent | United Kingdom | 51.3569 | 0.263056 | 145 | http://en.wikipedia.org/wiki/Brands_Hatch | 145.0 | Between 0 and 2500 |
| 57 | 58 | aintree | Aintree | Liverpool | United Kingdom | 53.4769 | -2.940560 | 20 | http://en.wikipedia.org/wiki/Aintree_Motor_Rac... | 20.0 | Between 0 and 2500 |
# Check if the "lat" column has any string values
# print(circuits["lat"].str.isnumeric())
# No, it gives an error!
# This one doesn't work either
# print(circuits.query("lat.str.isnumeric() == False"))
# We can use the following command to check if the column has any string values
print(circuits["lat"].dtype == 'string')False
# Create a new variable "hemisphere"
# Encode lattitude in (-90 and 0] as "south"
# Encode lattitude in (0 and 90] as "north"
bins_lat = [-90, 0, 90]
labels_lat = ["south", "north"]
circuits["hemisphere"] = pd.cut(circuits["lat"],
bins = bins_lat,
right = True,
labels = labels_lat)
np.random.seed(151)
display(circuits.sample(5))| circuitId | circuitRef | name | location | country | lat | lng | alt | url | alt_numeric | bins_alt | hemisphere | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 22 | 80 | vegas | Las Vegas Strip Street Circuit | Las Vegas | United States | 36.11470 | -115.17300 | NaN | https://en.wikipedia.org/wiki/Las_Vegas_Grand_... | NaN | NaN | north |
| 61 | 62 | monsanto | Monsanto Park Circuit | Lisbon | Portugal | 38.71970 | -9.20306 | 158 | http://en.wikipedia.org/wiki/Monsanto_Park_Cir... | 158.0 | Between 0 and 2500 | north |
| 7 | 8 | magny_cours | Circuit de Nevers Magny-Cours | Magny Cours | France | 46.86420 | 3.16361 | 228 | http://en.wikipedia.org/wiki/Circuit_de_Nevers... | 228.0 | Between 0 and 2500 | north |
| 1 | 2 | sepang | Sepang International Circuit | Kuala Lumpur | Malaysia | 2.76083 | 101.73800 | 18 | http://en.wikipedia.org/wiki/Sepang_Internatio... | 18.0 | Between 0 and 2500 | north |
| 28 | 29 | adelaide | Adelaide Street Circuit | Adelaide | Australia | -34.92720 | 138.61700 | 58 | http://en.wikipedia.org/wiki/Adelaide_Street_C... | 58.0 | Between 0 and 2500 | south |