QTM 151 - Introduction to Statistical Computing II

Lecture 11 - Subsetting Data

Danilo Freire

Emory University

07 October, 2024

I hope you all had a very nice weekend! 😊

Today’s lecture

  • Today we will dive a little deeper into the pandas library
  • More specifically, we will learn how to subset data
  • This is a very important skill to have when working with data
  • We will look into four ways to subset data
  • We will also learn how to sort data with .sort_values()

Subsetting data with Data Wrangler

  • You can visualise data in VS Code by using the Data Wrangler extension
  • It allows you to see the data in a table format and gives you some basic statistics
  • You can also filter the data, sort it, fill missing values, and more
  • This new extension is super convienent when you are working with data you are not familiar with
  • It also has the advantage of using a sandboxed environment, so you can interact with the data without changing the original file (unless you explicitly save them)
  • Install the extension by going to the Extensions view (Ctrl+Shift+X) and searching for Data Wrangler, or click here
  • Full documentation

  • The extension is automatically integrated with the Jupyter extension, so you can open a .csv file and start wrangling the data
  • Or you can open a Jupyter notebook, load your dataset with pandas, and start wrangling the data from there

Subsetting data with Data Wrangler

  • Let’s see a quick example!
  • The quickest way to open the Data Wrangler is by right-clicking on a .csv file and selecting Open in Data Wrangler
  • Let’s visualise the features.csv dataset. It is located in the data_raw folder
  • The dataset is about cars and has the following variables:
    • mpg: Miles per gallon - Fuel efficiency of the car
    • cylinders: Number of cylinders
    • displacement: Volume of all the cylinders in the engine
    • horsepower: Power of the engine in horsepower
    • weight: Weight in pounds
    • acceleration: Acceleration from 0 to 60 mph
    • vehicle_id: Unique identifier for each car

Subsetting data with Data Wrangler

Subsetting data with Data Wrangler

Subsetting data with Data Wrangler

Any questions so far? 🤔

Subsetting data with Data Wrangler

  • When you click on Filter, you will see a menu like this:

  • Select the column you want to filter, and click on Add Filter
  • Then you can select the condition you want to filter by and click on Apply

  • You can also sort the data by clicking on Sort and selecting the column you want to sort by

  • Data Wrangler will also show the Python code that corresponds to the operations you are doing!
  • This is a great way to learn how to use pandas! 🐼
  • You can then Export to notebook and continue working on your data in a Jupyter notebook, Export as file, or Copy all code and paste it in your Python script

Let’s practice!

  • Please open the features.csv dataset in the Data Wrangler

  • We will filter the data to show only cars with 6 or more cylinders

  • And sort the data by mpg in descending order

  • Finally, we will export the code to a Jupyter notebook

  • Let’s do it together!

  • Download the dataset if you don’t have it (click on Download Raw File)

  • Open the dataset in the Data Wrangler and do the following:

    • Click on Filter
    • Select cylinders in Choose column
    • Select Greater than or equal to in condition
    • Type 6 in Value
    • Click on Apply

Try it yourself! 🚗

  • Now, let’s add two more conditions to the filter
  • We want to find the fastest, less fuel-efficient cars with 6 or more cylinders
  • Add a new condition to the filter:
    • Select acceleration in Choose column
    • Select Less than or equal to in condition
    • Type 15 in Value
  • Then, add the second condition:
    • Select mpg in Choose column
    • Select Less than or equal to in condition
    • Type 12 in Value
    • Click on Apply
  • How many cars are left? Appendix 01

Subsetting data with pandas

The [] operator

  • We saw how pandas can be used to subset data in the Data Wrangler
  • Now, let’s see how we can do the same in a Jupyter notebook or directly in a Python script
  • Let’s start with the [] operator
  • This is the most common way to subset data in pandas, and we can use it to select columns by name, or to select rows by index
  • To select rows by name, we type df['column']
  • But what if we don’t know the name of the column? We can use the .columns attribute to see all the columns in the dataset
  • Let’s load the dataset and see the columns
  • We can do so by typing the name of the dataset and use a period (.) to extract the attribute “columns”
  • If you want to add many columns, it is often a good idea to store the column names in a separate variable, as we will do below
import pandas as pd
carfeatures = pd.read_csv('data_raw/features.csv')
car_colnames = carfeatures.columns
print(car_colnames)
Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'vehicle_id'],
      dtype='object')

Subsetting data with pandas

The [] operator

  • We can either subset the data by typing the column name or the column index

  • With the index (note the brackets):

carfeatures[car_colnames[0]]
0      18.0
1      15.0
2      18.0
3      16.0
4      17.0
       ... 
393    27.0
394    44.0
395    32.0
396    28.0
397    31.0
Name: mpg, Length: 398, dtype: float64
  • With the name:
carfeatures['mpg']
0      18.0
1      15.0
2      18.0
3      16.0
4      17.0
       ... 
393    27.0
394    44.0
395    32.0
396    28.0
397    31.0
Name: mpg, Length: 398, dtype: float64

Selecting multiple columns

  • We can select multiple columns by passing a list of column names
  • Let’s say we want to select the columns mpg and weight
  • We can do so by typing:
carfeatures[['mpg', 'weight']]
mpg weight
0 18.0 3504
1 15.0 3693
2 18.0 3436
3 16.0 3433
4 17.0 3449
... ... ...
393 27.0 2790
394 44.0 2130
395 32.0 2295
396 28.0 2625
397 31.0 2720

398 rows × 2 columns

  • We can also store the columns in a variable and pass it to the [] operator
list_subsetcols = ["mpg", "weight"]
subcols_carfeatures = carfeatures[list_subsetcols]
subcols_carfeatures
mpg weight
0 18.0 3504
1 15.0 3693
2 18.0 3436
3 16.0 3433
4 17.0 3449
... ... ...
393 27.0 2790
394 44.0 2130
395 32.0 2295
396 28.0 2625
397 31.0 2720

398 rows × 2 columns

Subsetting data with pandas

  • Note that pandas uses double brackets when selecting multiple columns
  • This is because pandas interprets a single bracket as a series (a single column)
  • So we need two brackets to pass a list of columns
  • An explanation for this behavior is that pandas is trying to avoid ambiguity
  • Using df['col1', 'col2'] would be interpreted as trying to access a single column with the name ('col1', 'col2')
  • Yes, it’s a bizarre name, but it could happen if you, say, merged two columns and named it like that (which is not a good idea!)
  • More on this here and here

Try it yourself! 🚗

  • Extract the weight and acceleration columns
  • Appendix 02

Subsetting by row/column position

The .iloc[] (integer location) method

  • We use .iloc[] to select data by row and column position
  • .iloc[] can be used to select a single row
    • df.iloc[0] first row
  • Or multiple rows
    • df.iloc[[0, 1, 2]] first three rows
  • Multiple rows and columns
    • df.iloc[[0, 1, 2], [0, 1, 2]] first three rows and columns
  • The : is used to select all rows or columns, or a range of rows or columns
    • df.iloc[:, 0] all rows, first column
  • Or multiple columns
    • df.iloc[:, 0:3] all rows, first three columns
  • A range of rows and columns
    • df.iloc[0:3, 0:3] first three rows and columns
  • Remember that Python is zero-indexed and the last index is not included in lists

Subsetting by row/column position

The .iloc[] method

  • Let’s see some examples!
  • We can combine .iloc[] with sort_values(by = 'variable') to sort the data
carsorted = carfeatures.sort_values(by = "mpg", ascending = False)
print(carsorted.head())
      mpg  cylinders  displacement horsepower  weight  acceleration vehicle_id
322  46.6          4            86         65    2110          17.9  M-1693322
329  44.6          4            91         67    1850          13.8  H-1693399
325  44.3          4            90         48    2085          21.7  V-1693355
394  44.0          4            97         52    2130          24.6  V-1694114
326  43.4          4            90         48    2335          23.7  V-1693366
  • For example, to select the first three rows and the first three columns:
# Select the first three rows 
print(carsorted.iloc[[0, 1, 2]])
      mpg  cylinders  displacement horsepower  weight  acceleration vehicle_id
322  46.6          4            86         65    2110          17.9  M-1693322
329  44.6          4            91         67    1850          13.8  H-1693399
325  44.3          4            90         48    2085          21.7  V-1693355
# Select the first three rows and columns
print(carsorted.iloc[0:3, 0:3])
      mpg  cylinders  displacement
322  46.6          4            86
329  44.6          4            91
325  44.3          4            90
# Compare with the original dataset
print(carfeatures.iloc[0:3, 0:3])
    mpg  cylinders  displacement
0  18.0          8           307
1  15.0          8           350
2  18.0          8           318

The .iloc[] method

  • The following command extracts all columns for row zero
  • In this first example, we will show the car with the highest mpg value
# Select the first row, all columns
display(carsorted.iloc[0,:])
mpg                  46.6
cylinders               4
displacement           86
horsepower             65
weight               2110
acceleration         17.9
vehicle_id      M-1693322
Name: 322, dtype: object
# Select the first three rows, all columns
display(carsorted.iloc[[0,1,2],:])
mpg cylinders displacement horsepower weight acceleration vehicle_id
322 46.6 4 86 65 2110 17.9 M-1693322
329 44.6 4 91 67 1850 13.8 H-1693399
325 44.3 4 90 48 2085 21.7 V-1693355
# The `:` can be omitted when selecting all columns
display(carsorted.iloc[[0,1,2]])
mpg cylinders displacement horsepower weight acceleration vehicle_id
322 46.6 4 86 65 2110 17.9 M-1693322
329 44.6 4 91 67 1850 13.8 H-1693399
325 44.3 4 90 48 2085 21.7 V-1693355

Subset blocks of data

# Extract rows 0 to 5
display(carfeatures.iloc[0:5,:])
mpg cylinders displacement horsepower weight acceleration vehicle_id
0 18.0 8 307 130 3504 12.0 C-1689780
1 15.0 8 350 165 3693 11.5 B-1689791
2 18.0 8 318 150 3436 11.0 P-1689802
3 16.0 8 304 150 3433 12.0 A-1689813
4 17.0 8 302 140 3449 10.5 F-1689824
# Extract rows 8 onwards
display(carfeatures.iloc[:8,:])
mpg cylinders displacement horsepower weight acceleration vehicle_id
0 18.0 8 307 130 3504 12.0 C-1689780
1 15.0 8 350 165 3693 11.5 B-1689791
2 18.0 8 318 150 3436 11.0 P-1689802
3 16.0 8 304 150 3433 12.0 A-1689813
4 17.0 8 302 140 3449 10.5 F-1689824
5 15.0 8 429 198 4341 10.0 F-1689835
6 14.0 8 454 220 4354 9.0 C-1689846
7 14.0 8 440 215 4312 8.5 P-1689857

Try it yourself! 🚗

  • Create a new datate called car_ascendingmpg which sorts cars from lowest to highest mpg
  • Subset the data of 5 cars with the lowest “mpg”
  • Appendix 03

The .query() method

  • The .query() method is a powerful way to subset data in pandas
  • It allows you to filter data using a string expression, similar to natural language and SQL
  • The syntax is df.query('expression')
  • The expression is written as if you were writing a sentence
  • For example, to filter cars with more than 6 cylinders, we would write:
    • df.query('cylinders > 6')
    • df.query('cylinders > 6 & acceleration < 15 & mpg < 12')
carfeatures.query("mpg >= 25")
mpg cylinders displacement horsepower weight acceleration vehicle_id
18 27.0 4 97 88 2130 14.5 D-1689978
19 26.0 4 97 46 1835 20.5 V-1689989
20 25.0 4 110 87 2672 17.5 P-1690000
22 25.0 4 104 95 2375 17.5 S-1690022
23 26.0 4 121 113 2234 12.5 B-1690033
... ... ... ... ... ... ... ...
393 27.0 4 140 86 2790 15.6 F-1694103
394 44.0 4 97 52 2130 24.6 V-1694114
395 32.0 4 135 84 2295 11.6 D-1694125
396 28.0 4 120 79 2625 18.6 F-1694136
397 31.0 4 119 82 2720 19.4 C-1694147

169 rows × 7 columns

Let’s see more examples!

  • Combine multiple conditions with and, &, or, |
carfeatures.query("(acceleration >= 10) and (acceleration < 18)")
mpg cylinders displacement horsepower weight acceleration vehicle_id
0 18.0 8 307 130 3504 12.0 C-1689780
1 15.0 8 350 165 3693 11.5 B-1689791
2 18.0 8 318 150 3436 11.0 P-1689802
3 16.0 8 304 150 3433 12.0 A-1689813
4 17.0 8 302 140 3449 10.5 F-1689824
... ... ... ... ... ... ... ...
390 32.0 4 144 96 2665 13.9 T-1694070
391 36.0 4 135 84 2370 13.0 D-1694081
392 27.0 4 151 90 2950 17.3 C-1694092
393 27.0 4 140 86 2790 15.6 F-1694103
395 32.0 4 135 84 2295 11.6 D-1694125

315 rows × 7 columns

Let’s see more examples!

  • Use .query() and sort_values together
carfeatures.query("(cylinders == 8) & (mpg <= 15) & (acceleration <= 12)").sort_values(by = "mpg")
mpg cylinders displacement horsepower weight acceleration vehicle_id
124 11.0 8 350 180 3664 11.0 O-1691144
67 11.0 8 429 208 4633 11.0 M-1690517
95 12.0 8 455 225 4951 11.0 B-1690825
90 12.0 8 429 198 4952 11.5 M-1690770
42 12.0 8 383 180 4955 11.5 D-1690242
43 13.0 8 400 170 4746 12.0 F-1690253
96 13.0 8 360 175 3821 11.0 A-1690836
94 13.0 8 440 215 4735 11.0 C-1690814
91 13.0 8 400 150 4464 12.0 C-1690781
62 13.0 8 350 165 4274 12.0 C-1690462
44 13.0 8 400 175 5140 12.0 P-1690264
166 13.0 8 302 129 3169 12.0 F-1691606
213 13.0 8 350 145 4055 12.0 C-1692123
38 14.0 8 350 165 4209 12.0 C-1690198
13 14.0 8 455 225 3086 10.0 B-1689923
63 14.0 8 400 175 4385 12.0 P-1690473
11 14.0 8 340 160 3609 8.0 P-1689901
86 14.0 8 304 150 3672 11.5 A-1690726
8 14.0 8 455 225 4425 10.0 P-1689868
7 14.0 8 440 215 4312 8.5 P-1689857
6 14.0 8 454 220 4354 9.0 C-1689846
39 14.0 8 400 175 4464 11.5 P-1690209
12 15.0 8 400 150 3761 9.5 C-1689912
10 15.0 8 383 170 3563 10.0 D-1689890
9 15.0 8 390 190 3850 8.5 A-1689879
121 15.0 8 318 150 3399 11.0 D-1691111
5 15.0 8 429 198 4341 10.0 F-1689835
1 15.0 8 350 165 3693 11.5 B-1689791

Let’s see more examples!

  • Use .query() and select columns
  • Note that we are using the [[]] operator to select the columns
  • Also note that we don’t need parentheses when using & and | in the expression, but it is a good practice to use them
carfeatures.query("(cylinders == 8) & (mpg <= 15) and acceleration <= 12")[["mpg", "cylinders", "acceleration"]]
mpg cylinders acceleration
1 15.0 8 11.5
5 15.0 8 10.0
6 14.0 8 9.0
7 14.0 8 8.5
8 14.0 8 10.0
9 15.0 8 8.5
10 15.0 8 10.0
11 14.0 8 8.0
12 15.0 8 9.5
13 14.0 8 10.0
38 14.0 8 12.0
39 14.0 8 11.5
42 12.0 8 11.5
43 13.0 8 12.0
44 13.0 8 12.0
62 13.0 8 12.0
63 14.0 8 12.0
67 11.0 8 11.0
86 14.0 8 11.5
90 12.0 8 11.5
91 13.0 8 12.0
94 13.0 8 11.0
95 12.0 8 11.0
96 13.0 8 11.0
121 15.0 8 11.0
124 11.0 8 11.0
166 13.0 8 12.0
213 13.0 8 12.0

Let’s see more examples!

  • Combine .query() with global variables
threshold = 25
data_varthreshold_mpg = carfeatures.query("mpg >= @threshold")
data_varthreshold_mpg
mpg cylinders displacement horsepower weight acceleration vehicle_id
18 27.0 4 97 88 2130 14.5 D-1689978
19 26.0 4 97 46 1835 20.5 V-1689989
20 25.0 4 110 87 2672 17.5 P-1690000
22 25.0 4 104 95 2375 17.5 S-1690022
23 26.0 4 121 113 2234 12.5 B-1690033
... ... ... ... ... ... ... ...
393 27.0 4 140 86 2790 15.6 F-1694103
394 44.0 4 97 52 2130 24.6 V-1694114
395 32.0 4 135 84 2295 11.6 D-1694125
396 28.0 4 120 79 2625 18.6 F-1694136
397 31.0 4 119 82 2720 19.4 C-1694147

169 rows × 7 columns

Let’s see more examples!

  • Expression with variable names that contain spaces
  • Use backticks (` `) to refer to the variable name
carfeatures["new variable"] = carfeatures["mpg"]
data_spacesthreshold_mpg = carfeatures.query("`new variable` >= 25")
data_spacesthreshold_mpg
mpg cylinders displacement horsepower weight acceleration vehicle_id new variable
18 27.0 4 97 88 2130 14.5 D-1689978 27.0
19 26.0 4 97 46 1835 20.5 V-1689989 26.0
20 25.0 4 110 87 2672 17.5 P-1690000 25.0
22 25.0 4 104 95 2375 17.5 S-1690022 25.0
23 26.0 4 121 113 2234 12.5 B-1690033 26.0
... ... ... ... ... ... ... ... ...
393 27.0 4 140 86 2790 15.6 F-1694103 27.0
394 44.0 4 97 52 2130 24.6 V-1694114 44.0
395 32.0 4 135 84 2295 11.6 D-1694125 32.0
396 28.0 4 120 79 2625 18.6 F-1694136 28.0
397 31.0 4 119 82 2720 19.4 C-1694147 31.0

169 rows × 8 columns

Try it yourself! 🚗

  • Subset the data with mpg \(\ge\) 25 and cylinders == 8
  • Select the columns mpg and cylinders
  • Appendix 04

Plotting Subsets

  • matplotlib can also be used to plot subsets of data
  • The syntax is similar to what we have seen before
  • For example, you can just add other plt.scatter() or plt.hist() commands to the same cell
  • Or you can create a for loop to plot multiple subsets
  • Let’s see an example using cylynders
  • First, we need to import matplotlib and use pd.unique() to extract a list with the unique elements in that column
import matplotlib.pyplot as plt

list_unique_cylinders = pd.unique(carfeatures["cylinders"])
print(list_unique_cylinders)
[8 4 6 3 5]

Plotting Subsets

# If we call plt.scatter() twice, it will display both plots on the same graph
# We also include include plt.show() at the very end.
df_8 = carfeatures.query("cylinders == 8")
df_4 = carfeatures.query("cylinders == 4")

plt.scatter(x = df_8["weight"],y = df_8["acceleration"])
plt.scatter(x = df_4["weight"],y = df_4["acceleration"])
plt.legend(labels = ["8","4"],
           title  = "Cylinders")

plt.show()

# Note: If we put plt.show() in between the plots, then the results will
# be shown on separate graphs instead.

Plotting Subsets

  • Using a for loop to plot multiple subsets
# Compute number of unique categories
list_unique_cylinders = pd.unique(carfeatures["cylinders"])

# Use a for loop to plot a scatter plot between "weight" and "acceleration"
# for each category. Each plot  will have a different color

for category in list_unique_cylinders:
    df   = carfeatures.query("cylinders == @category")
    plt.scatter(x = df["weight"],y = df["acceleration"])
    
# Add labels and a legends    
plt.xlabel("Weight")
plt.ylabel("Acceleration")
plt.legend(labels = list_unique_cylinders,
           title  = "Cylinders")
plt.show()

Try it yourself! 🚗

  • Compute a histogram of “mpg” by cylinder count
  • Make the histograms transparent by adjusting alpha in plt.hist(x = ..., alpha = 0.5)
  • Appendix 05

And that’s it for today! 🎉

See you next time! 🚀

Appendix 01

import pandas as pd
df = pd.read_csv('data_raw/features.csv')
df_filtered = df[(df['cylinders'] >= 6) & (df['acceleration'] <= 15) & (df['mpg'] <= 12)]
print(df_filtered.shape)
df_filtered
(12, 7)
mpg cylinders displacement horsepower weight acceleration vehicle_id
25 10.0 8 360 215 4615 14.0 F-1690055
26 10.0 8 307 200 4376 15.0 C-1690066
27 11.0 8 318 210 4382 13.5 D-1690077
42 12.0 8 383 180 4955 11.5 D-1690242
67 11.0 8 429 208 4633 11.0 M-1690517
69 12.0 8 350 160 4456 13.5 O-1690539
90 12.0 8 429 198 4952 11.5 M-1690770
95 12.0 8 455 225 4951 11.0 B-1690825
103 11.0 8 400 150 4997 14.0 C-1690913
104 12.0 8 400 167 4906 12.5 F-1690924
106 12.0 8 350 180 4499 12.5 O-1690946
124 11.0 8 350 180 3664 11.0 O-1691144
  • There are 12 cars left (number of rows)

Back to exercise

Appendix 02

import pandas as pd
df = pd.read_csv('data_raw/features.csv')
df[['weight', 'acceleration']]
weight acceleration
0 3504 12.0
1 3693 11.5
2 3436 11.0
3 3433 12.0
4 3449 10.5
... ... ...
393 2790 15.6
394 2130 24.6
395 2295 11.6
396 2625 18.6
397 2720 19.4

398 rows × 2 columns

Back to exercise

Appendix 03

import pandas as pd
df = pd.read_csv('data_raw/features.csv')
car_ascendingmpg = df.sort_values(by = "mpg")
car_ascendingmpg.iloc[0:5]
mpg cylinders displacement horsepower weight acceleration vehicle_id
28 9.0 8 304 193 4732 18.5 H-1690088
25 10.0 8 360 215 4615 14.0 F-1690055
26 10.0 8 307 200 4376 15.0 C-1690066
103 11.0 8 400 150 4997 14.0 C-1690913
124 11.0 8 350 180 3664 11.0 O-1691144

Back to exercise

Appendix 04

import pandas as pd
df = pd.read_csv('data_raw/features.csv')
df.query("(mpg >= 25) & (cylinders == 8)")[["mpg", "cylinders"]]
mpg cylinders
364 26.6 8

Back to exercise

Appendix 05

import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv('data_raw/features.csv')
list_unique_cylinders = pd.unique(df["cylinders"])

for category in list_unique_cylinders:
    df_category = df.query("cylinders == @category")
    plt.hist(x = df_category["mpg"], alpha = 0.5, bins = 10)

plt.xlabel("Miles per gallon")
plt.ylabel("Frequency")

plt.legend(labels = list_unique_cylinders,
           title  = "Cylinders")
plt.show()

Back to exercise