A B C
0 1.000000 2.000000 2.645751
1 1.414214 2.236068 2.828427
2 1.732051 2.449490 3.000000
Lecture 08 - Data Wrangling with Pandas
def and returnif statementspandas, the main library for data manipulation in Pythonapply and map functionsNaN values (“Not a Number” - missing data).py files and how to import them as modulespandas is the main library for data manipulation in Python 🐼numpy and matplotlib, and has a gazillion functions to work with data 😁R already, think about it as the dplyr of Python
dplyr and pandasapply function is used to apply a function to a dataset
df.apply(function)df.apply(function, axis=0) applies the function to each column (default)df.apply(function, axis=1) applies the function to each row A B C
0 1.000000 2.000000 2.645751
1 1.414214 2.236068 2.828427
2 1.732051 2.449490 3.000000
# The first two functions return True/False depending on age constraints
# The third function returns the sum of two numbers
# The fourth function returns a string with the age bracket
fn_iseligible_vote = lambda age: age >= 18
fn_istwenties = lambda age: (age >= 20) & (age < 30)
fn_sum = lambda x,y: x + y
def fn_agebracket(age):
if (age >= 18):
status = "Adult"
elif (age >= 10) & (age < 18):
status = "Adolescent"
else:
status = "Child"
return(status)data["age"] column| age | num_underage_siblings | num_adult_siblings | can_vote | in_twenties | age_bracket | |
|---|---|---|---|---|---|---|
| 0 | 18 | 0 | 1 | True | False | Adult |
| 1 | 29 | 0 | 0 | True | True | Adult |
| 2 | 15 | 1 | 0 | False | False | Adolescent |
| 3 | 32 | 1 | 1 | True | False | Adult |
| 4 | 6 | 0 | 0 | False | False | Child |
apply function| age | num_underage_siblings | num_adult_siblings | can_vote | in_twenties | age_bracket | new_var | |
|---|---|---|---|---|---|---|---|
| 0 | 18 | 0 | 1 | True | False | Adult | True |
| 1 | 29 | 0 | 0 | True | True | Adult | True |
| 2 | 15 | 1 | 0 | False | False | Adolescent | False |
| 3 | 32 | 1 | 1 | True | False | Adult | True |
| 4 | 6 | 0 | 0 | False | False | Child | False |
drop functionmap function is used to apply a function to a list, an array, or a series
map works very similarly to the apply function, and they are interchangeable when working with seriesmap can be faster than apply for simple functions, but apply is more flexible as it can be used with DataFrames (many columns)list01 = [1,2,3]), you should use map instead of apply
apply is not a built-in Python functionmap with a list and an array[1, 4, 9, 16, 25]
apply with a list or an array will raise an error---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[168], line 5
2 list01 = [1,2,3,4,5]
4 # Apply a function to the list
----> 5 list02 = list(apply(lambda x: x**2, list01))
7 print(list02)
NameError: name 'apply' is not definednum_siblings \(\ge 1\)has_siblingsapply().ipynb files are great for learning and teaching, they are not the best for sharing code.py file, which is a Python script| circuitId | circuitRef | name | location | country | lat | lng | alt | url | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | albert_park | Albert Park Grand Prix Circuit | Melbourne | Australia | -37.84970 | 144.968 | 10 | http://en.wikipedia.org/wiki/Melbourne_Grand_P... |
| 1 | 2 | sepang | Sepang International Circuit | Kuala Lumpur | Malaysia | 2.76083 | 101.738 | 18 | http://en.wikipedia.org/wiki/Sepang_Internatio... |
Field - Name of the variableType - Type of the variable
int), string (str - varchart), and float (float)Description - Label with a description of the variablevarchart(255) mean?circuitId: Unique identifier for the circuitcircuitRef: Unique reference for the circuitname: Name of the circuitlocation: Locationcountry: Country where the circuit is locatedlat: Latitudelng: Longitudealt: Altitudeurl: URL of the circuit’s Wikipedia pageNaN value?NaN stands for “Not a Number”NaN values can be found in datasets for various reasons
NaN values before we can analyse the dataNaN values can be found in different types of variables
pandas and numpy have functions to handle NaN values
NaN values differently!NaN valuesNaN is a special number, available in numpyNaN valuesnumpy arrays# Create two array with and without "NaNs"
# The "np.array()" functions converts
# a list to an array
vec_without_nans = np.array([1,1,1])
vec_with_nans = np.array([np.nan,4,5])
# When you add the vectors
# you will produce a NaN
# on any entries with "NaNs"
print(vec_without_nans * vec_with_nans)
print(vec_without_nans / vec_with_nans)
print(vec_without_nans + vec_with_nans)
print(vec_without_nans - vec_with_nans)[nan 4. 5.]
[ nan 0.25 0.2 ]
[nan 5. 6.]
[nan -3. -4.]
NaN valuesNaN valuesmean() functionmean() function will return NaN if there are NaN values in the arrayNaN valuesx with NaN valuescircuits datasetdtypes for thatobject means that the variable is a string or a variable with mixed types (e.g., numbers and strings).str.isnumeric() function.str and .isnumeric().str function is used to check if the variable is a string.isnumeric() part is used to check if the string is numeric.isnumeric(), we would get an error (trust me, I tried 😅)R, this is similar to the %>% operator in dplyrquery() to reference subattributes of a variable
query() is a method of a pandas DataFrame and it has many useful functions!query() with pd.unique() to extract a list of non-numeric valuespd.unique() function will return a list of unique values in a variable# Extract a list of non-numeric values
# The pd.unique() function extracts unique values from a list
# Check each value in the alt column to see if it is not numeric
# True if it is not numeric, False if it is numeric
subset = circuits.query("alt.str.isnumeric() == False")
list_unique = pd.unique(subset["alt"])
print(list_unique)['\\N' '-7']
replace function is used to replace values in a variabledataframe["variable"].replace(list_old, list_new)# "list_old" encodes values we want to change
# From the list_unique, we see that the values we want to change are '\N' and '-7'
# "list_new" encodes the values that will replace the old
list_old = ['\\N','-7']
list_new = [np.nan, -7]
# This command replaces the values of the "alt" column
circuits["alt"] = circuits["alt"].replace(list_old, list_new)to_csv() to save the dataset as a .csv file.replace() with the “country” columnstr.isnumeric() here?pd.to_numeric() is used to convert a variable to a numeric type248.1891891891892
alt variable into an interval\[x_{bin} = \begin{cases} "A" &\text{ if } x_1 < x \le x_2 \\ "B" &\text{ if } x_2 < x \le x_3 \end{cases} \]
pd.cut() function to do thisdf["new_variable"] = pd.cut(df["variable"], bins = [x1, x2, x3], labels = ["A", "B"])bins are the intervals and labels are the new values| circuitId | circuitRef | name | location | country | lat | lng | alt | url | alt_numeric | bins_alt | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 31 | 32 | rodriguez | Autódromo Hermanos Rodríguez | Mexico City | Mexico | 19.4042 | -99.09070 | 2227 | http://en.wikipedia.org/wiki/Aut%C3%B3dromo_He... | 2227.0 | Between 0 and 2500 |
| 43 | 44 | las_vegas | Las Vegas Street Circuit | Nevada | USA | 36.1162 | -115.17400 | 639 | http://en.wikipedia.org/wiki/Las_Vegas_Street_... | 639.0 | Between 0 and 2500 |
| 26 | 27 | estoril | Autódromo do Estoril | Estoril | Portugal | 38.7506 | -9.39417 | 130 | http://en.wikipedia.org/wiki/Aut%C3%B3dromo_do... | 130.0 | Between 0 and 2500 |
| 74 | 77 | jeddah | Jeddah Corniche Circuit | Jeddah | Saudi Arabia | 21.6319 | 39.10440 | 15 | http://en.wikipedia.org/wiki/Jeddah_Street_Cir... | 15.0 | Between 0 and 2500 |
| 58 | 59 | boavista | Circuito da Boavista | Oporto | Portugal | 41.1705 | -8.67325 | 28 | http://en.wikipedia.org/wiki/Circuito_da_Boavista | 28.0 | Between 0 and 2500 |
num_siblings \(\ge 1\)has_siblingsapply()| num_adult_siblings | has_siblings | |
|---|---|---|
| 0 | 1 | True |
| 1 | 0 | False |
| 2 | 0 | False |
| 3 | 1 | True |
| 4 | 0 | False |
data_raw/features.csvmpg \(\ge\) 29mpg_above_29 which is True/False if mpg \(\ge\) 29data_clean/features.csv| mpg | mpg_above_29 | |
|---|---|---|
| 0 | 18.0 | False |
| 1 | 15.0 | False |
| 2 | 18.0 | False |
| 3 | 16.0 | False |
| 4 | 17.0 | False |
| ... | ... | ... |
| 393 | 27.0 | False |
| 394 | 44.0 | True |
| 395 | 32.0 | True |
| 396 | 28.0 | False |
| 397 | 31.0 | True |
398 rows × 2 columns
.replace() with the “country” column| country | |
|---|---|
| 0 | Australia |
| 1 | Malaysia |
| 2 | Bahrain |
| 3 | Spain |
| 4 | Turkey |
| ... | ... |
| 72 | Portugal |
| 73 | Italy |
| 74 | Saudi Arabia |
| 75 | Qatar |
| 76 | USA |
77 rows × 1 columns
# Check the data type of the "lat" column
print(circuits["lat"].dtype)
# Check if the variable "lat" is numeric
# print(circuits["lat"].str.isnumeric())
# It will raise an error because "lat" is a float
# Check if the variable "lng" is numeric
# print(circuits["lng"].str.isnumeric())
# It will also raise an errorfloat64