Lecture 13 - Replacing and Recoding Variables
14 October, 2024
NaN
values (“Not a Number” - missing data)pandas
and numpy
to do this 🐼circuits = pd.read_csv("https://github.com/danilofreire/qtm151/raw/refs/heads/main/lectures/lecture-13/data_raw/circuits.csv")
# Or open it from the local file
# circuits = pd.read_csv("data_raw/circuits.csv")
display(circuits.head(2))
circuitId | circuitRef | name | location | country | lat | lng | alt | url | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | albert_park | Albert Park Grand Prix Circuit | Melbourne | Australia | -37.84970 | 144.968 | 10 | http://en.wikipedia.org/wiki/Melbourne_Grand_P... |
1 | 2 | sepang | Sepang International Circuit | Kuala Lumpur | Malaysia | 2.76083 | 101.738 | 18 | http://en.wikipedia.org/wiki/Sepang_Internatio... |
Field
- Name of the variableType
- Type of the variable
int
), string (str
- varchart
), and float (float
)Description
- Label with a description of the variablevarchart(255)
mean?circuitId
: Unique identifier for the circuitcircuitRef
: Unique reference for the circuitname
: Name of the circuitlocation
: Locationcountry
: Country where the circuit is locatedlat
: Latitudelng
: Longitudealt
: Altitudeurl
: URL of the circuit’s Wikipedia pageNaN
value?NaN
stands for “Not a Number”NaN
values can be found in datasets for various reasons
NaN
values before we can analyse the dataNaN
values can be found in different types of variables
pandas
and numpy
have functions to handle NaN
values
NaN
values differently!NaN
valuesNaN
is a special number, available in numpy
NaN
valuesnumpy
arrays# Create two array with and without "NaNs"
# The "np.array()" functions converts
# a list to an array
vec_without_nans = np.array([1,1,1])
vec_with_nans = np.array([np.nan,4,5])
# When you add the vectors
# you will produce a NaN
# on any entries with "NaNs"
print(vec_without_nans * vec_with_nans)
print(vec_without_nans / vec_with_nans)
print(vec_without_nans + vec_with_nans)
print(vec_without_nans - vec_with_nans)
[nan 4. 5.]
[ nan 0.25 0.2 ]
[nan 5. 6.]
[nan -3. -4.]
NaN
valuesNaN
valuesmean()
functionmean()
function will return NaN
if there are NaN
values in the arrayNaN
valuesx
with NaN
valuescircuits
datasetdtypes
for thatobject
means that the variable is a string or a variable with mixed types (e.g., numbers and strings).str.isnumeric()
function.str
and .isnumeric()
.str
function is used to check if the variable is a string.isnumeric()
part is used to check if the string is numeric.isnumeric()
, we would get an error (trust me, I tried 😅)R
, this is similar to the %>%
operator in dplyr
.query()
? We can use it to reference subattributes of a variablepd.unique()
to extract a list of non-numeric valuespd.unique()
function will return a list of unique values in a variable# Extract a list of non-numeric values
# The pd.unique() function extracts unique values from a list
# Check each value in the alt column to see if it is not numeric
# True if it is not numeric, False if it is numeric
subset = circuits.query("alt.str.isnumeric() == False")
list_unique = pd.unique(subset["alt"])
print(list_unique)
['\\N' '-7']
replace
function is used to replace values in a variabledataframe["variable"].replace(list_old, list_new)
# "list_old" encodes values we want to change
# From the list_unique, we see that the values we want to change are '\N' and '-7'
# "list_new" encodes the values that will replace the old
list_old = ['\\N','-7']
list_new = [np.nan, -7]
# This command replaces the values of the "alt" column
circuits["alt"] = circuits["alt"].replace(list_old, list_new)
to_csv()
to save the dataset as a .csv
file.replace()
with the “country” columnstr.isnumeric()
here?pd.to_numeric()
is used to convert a variable to a numeric typealt
variable into an interval\[x_{bin} = \begin{cases} "A" &\text{ if } x_1 < x \le x_2 \\ "B" &\text{ if } x_2 < x \le x_3 \end{cases} \]
pd.cut()
function to do thisdf["new_variable"] = pd.cut(df["variable"], bins = [x1, x2, x3], labels = ["A", "B"])
bins
are the intervals and labels
are the new values# Recode the "alt" variable into an interval
bins_x = [0, 2500, 5000]
labels_x = ["Between 0 and 2500",
"Between 2500 and 5000"]
circuits["bins_alt"] = pd.cut(circuits["alt_numeric"],
bins = bins_x,
right = True,
labels = labels_x)
np.random.seed(2014)
display(circuits.sample(5))
circuitId | circuitRef | name | location | country | lat | lng | alt | url | alt_numeric | bins_alt | |
---|---|---|---|---|---|---|---|---|---|---|---|
75 | 78 | losail | Losail International Circuit | Al Daayen | Qatar | 25.4900 | 51.45420 | NaN | http://en.wikipedia.org/wiki/Losail_Internatio... | NaN | NaN |
52 | 53 | essarts | Rouen-Les-Essarts | Rouen | France | 49.3306 | 1.00458 | 81 | http://en.wikipedia.org/wiki/Rouen-Les-Essarts | 81.0 | Between 0 and 2500 |
36 | 37 | detroit | Detroit Street Circuit | Detroit | USA | 42.3298 | -83.04010 | 177 | http://en.wikipedia.org/wiki/Detroit_street_ci... | 177.0 | Between 0 and 2500 |
76 | 79 | miami | Miami International Autodrome | Miami | USA | 25.9581 | -80.23890 | NaN | http://en.wikipedia.org/wiki/Miami_Internation... | NaN | NaN |
70 | 71 | sochi | Sochi Autodrom | Sochi | Russia | 43.4057 | 39.95780 | 2 | http://en.wikipedia.org/wiki/Sochi_Autodrom | 2.0 | Between 0 and 2500 |
# Replace "UK" with "United Kingdom"
circuits["country"] = circuits["country"].replace("UK", "United Kingdom")
# Check the rows where the replacement was made
uk = circuits.query("country == 'United Kingdom'")
display(uk.head(5))
circuitId | circuitRef | name | location | country | lat | lng | alt | url | alt_numeric | bins_alt | |
---|---|---|---|---|---|---|---|---|---|---|---|
8 | 9 | silverstone | Silverstone Circuit | Silverstone | United Kingdom | 52.0786 | -1.016940 | 153 | http://en.wikipedia.org/wiki/Silverstone_Circuit | 153.0 | Between 0 and 2500 |
30 | 31 | donington | Donington Park | Castle Donington | United Kingdom | 52.8306 | -1.375280 | 88 | http://en.wikipedia.org/wiki/Donington_Park | 88.0 | Between 0 and 2500 |
37 | 38 | brands_hatch | Brands Hatch | Kent | United Kingdom | 51.3569 | 0.263056 | 145 | http://en.wikipedia.org/wiki/Brands_Hatch | 145.0 | Between 0 and 2500 |
57 | 58 | aintree | Aintree | Liverpool | United Kingdom | 53.4769 | -2.940560 | 20 | http://en.wikipedia.org/wiki/Aintree_Motor_Rac... | 20.0 | Between 0 and 2500 |
# Check if the "lat" column has any string values
# print(circuits["lat"].str.isnumeric())
# No, it gives an error!
# This one doesn't work either
# print(circuits.query("lat.str.isnumeric() == False"))
# We can use the following command to check if the column has any string values
print(circuits["lat"].dtype == 'string')
False
# Create a new variable "hemisphere"
# Encode lattitude in (-90 and 0] as "south"
# Encode lattitude in (0 and 90] as "north"
bins_lat = [-90, 0, 90]
labels_lat = ["south", "north"]
circuits["hemisphere"] = pd.cut(circuits["lat"],
bins = bins_lat,
right = True,
labels = labels_lat)
np.random.seed(151)
display(circuits.sample(5))
circuitId | circuitRef | name | location | country | lat | lng | alt | url | alt_numeric | bins_alt | hemisphere | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
22 | 80 | vegas | Las Vegas Strip Street Circuit | Las Vegas | United States | 36.11470 | -115.17300 | NaN | https://en.wikipedia.org/wiki/Las_Vegas_Grand_... | NaN | NaN | north |
61 | 62 | monsanto | Monsanto Park Circuit | Lisbon | Portugal | 38.71970 | -9.20306 | 158 | http://en.wikipedia.org/wiki/Monsanto_Park_Cir... | 158.0 | Between 0 and 2500 | north |
7 | 8 | magny_cours | Circuit de Nevers Magny-Cours | Magny Cours | France | 46.86420 | 3.16361 | 228 | http://en.wikipedia.org/wiki/Circuit_de_Nevers... | 228.0 | Between 0 and 2500 | north |
1 | 2 | sepang | Sepang International Circuit | Kuala Lumpur | Malaysia | 2.76083 | 101.73800 | 18 | http://en.wikipedia.org/wiki/Sepang_Internatio... | 18.0 | Between 0 and 2500 | north |
28 | 29 | adelaide | Adelaide Street Circuit | Adelaide | Australia | -34.92720 | 138.61700 | 58 | http://en.wikipedia.org/wiki/Adelaide_Street_C... | 58.0 | Between 0 and 2500 | south |