QTM 151 - Introduction to Statistical Computing II

Lecture 11 - Subsetting Data

Danilo Freire

danilo.freire@emory.edu

Emory University

I hope you all had a very nice weekend! 😊

Today’s lecture

Today we will dive a little deeper into the pandas library
More specifically, we will learn how to subset data
This is a very important skill to have when working with data
We will look into four ways to subset data
- Using VS Code’s Data Wrangler extension
- Using []
- Using .iloc[]
- Using .query()
We will also learn how to sort data with .sort_values()

Subsetting data with `Data Wrangler`

You can visualise data in VS Code by using the Data Wrangler extension
It allows you to see the data in a table format and gives you some basic statistics
You can also filter the data, sort it, fill missing values, and more
This new extension is super convienent when you are working with data you are not familiar with
It also has the advantage of using a sandboxed environment, so you can interact with the data without changing the original file (unless you explicitly save them)
Install the extension by going to the Extensions view (Ctrl+Shift+X) and searching for Data Wrangler, or click here
Full documentation

The extension is automatically integrated with the Jupyter extension, so you can open a .csv file and start wrangling the data
Or you can open a Jupyter notebook, load your dataset with pandas, and start wrangling the data from there

Subsetting data with `Data Wrangler`

Let’s see a quick example!
The quickest way to open the Data Wrangler is by right-clicking on a .csv file and selecting Open in Data Wrangler
Let’s visualise the features.csv dataset. It is located in the data_raw folder
The dataset is about cars and has the following variables:
- mpg: Miles per gallon - Fuel efficiency of the car
- cylinders: Number of cylinders
- displacement: Volume of all the cylinders in the engine
- horsepower: Power of the engine in horsepower
- weight: Weight in pounds
- acceleration: Acceleration from 0 to 60 mph
- vehicle_id: Unique identifier for each car

Subsetting data with `Data Wrangler`

Any questions so far? 🤔

Subsetting data with `Data Wrangler`

When you click on Filter, you will see a menu like this:

Select the column you want to filter, and click on Add Filter
Then you can select the condition you want to filter by and click on Apply

You can also sort the data by clicking on Sort and selecting the column you want to sort by

Data Wrangler will also show the Python code that corresponds to the operations you are doing!
This is a great way to learn how to use pandas! 🐼
You can then Export to notebook and continue working on your data in a Jupyter notebook, Export as file, or Copy all code and paste it in your Python script

Let’s practice!

Please open the features.csv dataset in the Data Wrangler
We will filter the data to show only cars with 6 or more cylinders
And sort the data by mpg in descending order
Finally, we will export the code to a Jupyter notebook
Let’s do it together!
Download the dataset if you don’t have it (click on Download Raw File)
Open the dataset in the Data Wrangler and do the following:
- Click on Filter
- Select cylinders in Choose column
- Select Greater than or equal to in condition
- Type 6 in Value
- Click on Apply

Try it yourself! 🚗

Now, let’s add two more conditions to the filter
We want to find the fastest, less fuel-efficient cars with 6 or more cylinders
Add a new condition to the filter:
- Select acceleration in Choose column
- Select Less than or equal to in condition
- Type 15 in Value
Then, add the second condition:
- Select mpg in Choose column
- Select Less than or equal to in condition
- Type 12 in Value
- Click on Apply
How many cars are left? Appendix 01

Subsetting data with `pandas`

The `[]` operator

We saw how pandas can be used to subset data in the Data Wrangler
Now, let’s see how we can do the same in a Jupyter notebook or directly in a Python script
Let’s start with the [] operator
This is the most common way to subset data in pandas, and we can use it to select columns by name, or to select rows by index
To select rows by name, we type df['column']
But what if we don’t know the name of the column? We can use the .columns attribute to see all the columns in the dataset

Let’s load the dataset and see the columns
We can do so by typing the name of the dataset and use a period (.) to extract the attribute “columns”
If you want to add many columns, it is often a good idea to store the column names in a separate variable, as we will do below

import pandas as pd
carfeatures = pd.read_csv('data_raw/features.csv')
car_colnames = carfeatures.columns
print(car_colnames)

Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'vehicle_id'],
      dtype='object')

Subsetting data with `pandas`

The `[]` operator

We can either subset the data by typing the column name or the column index
With the index (note the brackets):

carfeatures[car_colnames[0]]

0      18.0
1      15.0
2      18.0
3      16.0
4      17.0
       ... 
393    27.0
394    44.0
395    32.0
396    28.0
397    31.0
Name: mpg, Length: 398, dtype: float64

With the name:

carfeatures['mpg']

0      18.0
1      15.0
2      18.0
3      16.0
4      17.0
       ... 
393    27.0
394    44.0
395    32.0
396    28.0
397    31.0
Name: mpg, Length: 398, dtype: float64

Selecting multiple columns

We can select multiple columns by passing a list of column names
Let’s say we want to select the columns mpg and weight
We can do so by typing:

carfeatures[['mpg', 'weight']]

	mpg	weight
0	18.0	3504
1	15.0	3693
2	18.0	3436
3	16.0	3433
4	17.0	3449
...	...	...
393	27.0	2790
394	44.0	2130
395	32.0	2295
396	28.0	2625
397	31.0	2720

398 rows × 2 columns

We can also store the columns in a variable and pass it to the [] operator

list_subsetcols = ["mpg", "weight"]
subcols_carfeatures = carfeatures[list_subsetcols]
subcols_carfeatures

	mpg	weight
0	18.0	3504
1	15.0	3693
2	18.0	3436
3	16.0	3433
4	17.0	3449
...	...	...
393	27.0	2790
394	44.0	2130
395	32.0	2295
396	28.0	2625
397	31.0	2720

398 rows × 2 columns

Subsetting data with `pandas`

Note that pandas uses double brackets when selecting multiple columns
This is because pandas interprets a single bracket as a series (a single column)
So we need two brackets to pass a list of columns
An explanation for this behavior is that pandas is trying to avoid ambiguity
Using df['col1', 'col2'] would be interpreted as trying to access a single column with the name ('col1', 'col2')
Yes, it’s a bizarre name, but it could happen if you, say, merged two columns and named it like that (which is not a good idea!)
More on this here and here

Try it yourself! 🚗

Extract the weight and acceleration columns
Appendix 02

Subsetting by row/column position

The `.iloc[]` (integer location) method

We use .iloc[] to select data by row and column position
.iloc[] can be used to select a single row
- df.iloc[0] first row
Or multiple rows
- df.iloc[[0, 1, 2]] first three rows
Multiple rows and columns
- df.iloc[[0, 1, 2], [0, 1, 2]] first three rows and columns
The : is used to select all rows or columns, or a range of rows or columns
- df.iloc[:, 0] all rows, first column
Or multiple columns
- df.iloc[:, 0:3] all rows, first three columns
A range of rows and columns
- df.iloc[0:3, 0:3] first three rows and columns
Remember that Python is zero-indexed and the last index is not included in lists

Subsetting by row/column position

The `.iloc[]` method

Let’s see some examples!
We can combine .iloc[] with sort_values(by = 'variable') to sort the data

carsorted = carfeatures.sort_values(by = "mpg", ascending = False)
print(carsorted.head())

      mpg  cylinders  displacement horsepower  weight  acceleration vehicle_id
322  46.6          4            86         65    2110          17.9  M-1693322
329  44.6          4            91         67    1850          13.8  H-1693399
325  44.3          4            90         48    2085          21.7  V-1693355
394  44.0          4            97         52    2130          24.6  V-1694114
326  43.4          4            90         48    2335          23.7  V-1693366

For example, to select the first three rows and the first three columns:

# Select the first three rows 
print(carsorted.iloc[[0, 1, 2]])

      mpg  cylinders  displacement horsepower  weight  acceleration vehicle_id
322  46.6          4            86         65    2110          17.9  M-1693322
329  44.6          4            91         67    1850          13.8  H-1693399
325  44.3          4            90         48    2085          21.7  V-1693355

# Select the first three rows and columns
print(carsorted.iloc[0:3, 0:3])

      mpg  cylinders  displacement
322  46.6          4            86
329  44.6          4            91
325  44.3          4            90

# Compare with the original dataset
print(carfeatures.iloc[0:3, 0:3])

    mpg  cylinders  displacement
0  18.0          8           307
1  15.0          8           350
2  18.0          8           318

The `.iloc[]` method

The following command extracts all columns for row zero
In this first example, we will show the car with the highest mpg value

# Select the first row, all columns
display(carsorted.iloc[0,:])

mpg                  46.6
cylinders               4
displacement           86
horsepower             65
weight               2110
acceleration         17.9
vehicle_id      M-1693322
Name: 322, dtype: object

# Select the first three rows, all columns
display(carsorted.iloc[[0,1,2],:])

	mpg	cylinders	displacement	horsepower	weight	acceleration	vehicle_id
322	46.6	4	86	65	2110	17.9	M-1693322
329	44.6	4	91	67	1850	13.8	H-1693399
325	44.3	4	90	48	2085	21.7	V-1693355

# The `:` can be omitted when selecting all columns
display(carsorted.iloc[[0,1,2]])

	mpg	cylinders	displacement	horsepower	weight	acceleration	vehicle_id
322	46.6	4	86	65	2110	17.9	M-1693322
329	44.6	4	91	67	1850	13.8	H-1693399
325	44.3	4	90	48	2085	21.7	V-1693355

Subset blocks of data

# Extract rows 0 to 5
display(carfeatures.iloc[0:5,:])

	mpg	cylinders	displacement	horsepower	weight	acceleration	vehicle_id
0	18.0	8	307	130	3504	12.0	C-1689780
1	15.0	8	350	165	3693	11.5	B-1689791
2	18.0	8	318	150	3436	11.0	P-1689802
3	16.0	8	304	150	3433	12.0	A-1689813
4	17.0	8	302	140	3449	10.5	F-1689824

# Extract rows 8 onwards
display(carfeatures.iloc[:8,:])

	mpg	cylinders	displacement	horsepower	weight	acceleration	vehicle_id
0	18.0	8	307	130	3504	12.0	C-1689780
1	15.0	8	350	165	3693	11.5	B-1689791
2	18.0	8	318	150	3436	11.0	P-1689802
3	16.0	8	304	150	3433	12.0	A-1689813
4	17.0	8	302	140	3449	10.5	F-1689824
5	15.0	8	429	198	4341	10.0	F-1689835
6	14.0	8	454	220	4354	9.0	C-1689846
7	14.0	8	440	215	4312	8.5	P-1689857

Try it yourself! 🚗

Create a new datate called car_ascendingmpg which sorts cars from lowest to highest mpg
Subset the data of 5 cars with the lowest “mpg”
Appendix 03

The `.query()` method

The .query() method is a powerful way to subset data in pandas
It allows you to filter data using a string expression, similar to natural language and SQL
The syntax is df.query('expression')
The expression is written as if you were writing a sentence
For example, to filter cars with more than 6 cylinders, we would write:
- df.query('cylinders > 6')
- df.query('cylinders > 6 & acceleration < 15 & mpg < 12')

carfeatures.query("mpg >= 25")

	mpg	cylinders	displacement	horsepower	weight	acceleration	vehicle_id
18	27.0	4	97	88	2130	14.5	D-1689978
19	26.0	4	97	46	1835	20.5	V-1689989
20	25.0	4	110	87	2672	17.5	P-1690000
22	25.0	4	104	95	2375	17.5	S-1690022
23	26.0	4	121	113	2234	12.5	B-1690033
...	...	...	...	...	...	...	...
393	27.0	4	140	86	2790	15.6	F-1694103
394	44.0	4	97	52	2130	24.6	V-1694114
395	32.0	4	135	84	2295	11.6	D-1694125
396	28.0	4	120	79	2625	18.6	F-1694136
397	31.0	4	119	82	2720	19.4	C-1694147

169 rows × 7 columns

Let’s see more examples!

Combine multiple conditions with and, &, or, |

carfeatures.query("(acceleration >= 10) and (acceleration < 18)")

	mpg	cylinders	displacement	horsepower	weight	acceleration	vehicle_id
0	18.0	8	307	130	3504	12.0	C-1689780
1	15.0	8	350	165	3693	11.5	B-1689791
2	18.0	8	318	150	3436	11.0	P-1689802
3	16.0	8	304	150	3433	12.0	A-1689813
4	17.0	8	302	140	3449	10.5	F-1689824
...	...	...	...	...	...	...	...
390	32.0	4	144	96	2665	13.9	T-1694070
391	36.0	4	135	84	2370	13.0	D-1694081
392	27.0	4	151	90	2950	17.3	C-1694092
393	27.0	4	140	86	2790	15.6	F-1694103
395	32.0	4	135	84	2295	11.6	D-1694125

315 rows × 7 columns

Let’s see more examples!

Use .query() and sort_values together

carfeatures.query("(cylinders == 8) & (mpg <= 15) & (acceleration <= 12)").sort_values(by = "mpg")

	mpg	cylinders	displacement	horsepower	weight	acceleration	vehicle_id
124	11.0	8	350	180	3664	11.0	O-1691144
67	11.0	8	429	208	4633	11.0	M-1690517
95	12.0	8	455	225	4951	11.0	B-1690825
90	12.0	8	429	198	4952	11.5	M-1690770
42	12.0	8	383	180	4955	11.5	D-1690242
43	13.0	8	400	170	4746	12.0	F-1690253
96	13.0	8	360	175	3821	11.0	A-1690836
94	13.0	8	440	215	4735	11.0	C-1690814
91	13.0	8	400	150	4464	12.0	C-1690781
62	13.0	8	350	165	4274	12.0	C-1690462
44	13.0	8	400	175	5140	12.0	P-1690264
166	13.0	8	302	129	3169	12.0	F-1691606
213	13.0	8	350	145	4055	12.0	C-1692123
38	14.0	8	350	165	4209	12.0	C-1690198
13	14.0	8	455	225	3086	10.0	B-1689923
63	14.0	8	400	175	4385	12.0	P-1690473
11	14.0	8	340	160	3609	8.0	P-1689901
86	14.0	8	304	150	3672	11.5	A-1690726
8	14.0	8	455	225	4425	10.0	P-1689868
7	14.0	8	440	215	4312	8.5	P-1689857
6	14.0	8	454	220	4354	9.0	C-1689846
39	14.0	8	400	175	4464	11.5	P-1690209
12	15.0	8	400	150	3761	9.5	C-1689912
10	15.0	8	383	170	3563	10.0	D-1689890
9	15.0	8	390	190	3850	8.5	A-1689879
121	15.0	8	318	150	3399	11.0	D-1691111
5	15.0	8	429	198	4341	10.0	F-1689835
1	15.0	8	350	165	3693	11.5	B-1689791

Let’s see more examples!

Use .query() and select columns
Note that we are using the [[]] operator to select the columns
Also note that we don’t need parentheses when using & and | in the expression, but it is a good practice to use them

carfeatures.query("(cylinders == 8) & (mpg <= 15) and acceleration <= 12")[["mpg", "cylinders", "acceleration"]]

	mpg	cylinders	acceleration
1	15.0	8	11.5
5	15.0	8	10.0
6	14.0	8	9.0
7	14.0	8	8.5
8	14.0	8	10.0
9	15.0	8	8.5
10	15.0	8	10.0
11	14.0	8	8.0
12	15.0	8	9.5
13	14.0	8	10.0
38	14.0	8	12.0
39	14.0	8	11.5
42	12.0	8	11.5
43	13.0	8	12.0
44	13.0	8	12.0
62	13.0	8	12.0
63	14.0	8	12.0
67	11.0	8	11.0
86	14.0	8	11.5
90	12.0	8	11.5
91	13.0	8	12.0
94	13.0	8	11.0
95	12.0	8	11.0
96	13.0	8	11.0
121	15.0	8	11.0
124	11.0	8	11.0
166	13.0	8	12.0
213	13.0	8	12.0

Let’s see more examples!

Combine .query() with global variables

threshold = 25
data_varthreshold_mpg = carfeatures.query("mpg >= @threshold")
data_varthreshold_mpg

	mpg	cylinders	displacement	horsepower	weight	acceleration	vehicle_id
18	27.0	4	97	88	2130	14.5	D-1689978
19	26.0	4	97	46	1835	20.5	V-1689989
20	25.0	4	110	87	2672	17.5	P-1690000
22	25.0	4	104	95	2375	17.5	S-1690022
23	26.0	4	121	113	2234	12.5	B-1690033
...	...	...	...	...	...	...	...
393	27.0	4	140	86	2790	15.6	F-1694103
394	44.0	4	97	52	2130	24.6	V-1694114
395	32.0	4	135	84	2295	11.6	D-1694125
396	28.0	4	120	79	2625	18.6	F-1694136
397	31.0	4	119	82	2720	19.4	C-1694147

169 rows × 7 columns

Let’s see more examples!

Expression with variable names that contain spaces
Use backticks (` `) to refer to the variable name

carfeatures["new variable"] = carfeatures["mpg"]
data_spacesthreshold_mpg = carfeatures.query("`new variable` >= 25")
data_spacesthreshold_mpg

	mpg	cylinders	displacement	horsepower	weight	acceleration	vehicle_id	new variable
18	27.0	4	97	88	2130	14.5	D-1689978	27.0
19	26.0	4	97	46	1835	20.5	V-1689989	26.0
20	25.0	4	110	87	2672	17.5	P-1690000	25.0
22	25.0	4	104	95	2375	17.5	S-1690022	25.0
23	26.0	4	121	113	2234	12.5	B-1690033	26.0
...	...	...	...	...	...	...	...	...
393	27.0	4	140	86	2790	15.6	F-1694103	27.0
394	44.0	4	97	52	2130	24.6	V-1694114	44.0
395	32.0	4	135	84	2295	11.6	D-1694125	32.0
396	28.0	4	120	79	2625	18.6	F-1694136	28.0
397	31.0	4	119	82	2720	19.4	C-1694147	31.0

169 rows × 8 columns

Try it yourself! 🚗

Subset the data with mpg \(\ge\) 25 and cylinders == 8
Select the columns mpg and cylinders
Appendix 04

Plotting Subsets

matplotlib can also be used to plot subsets of data
The syntax is similar to what we have seen before
For example, you can just add other plt.scatter() or plt.hist() commands to the same cell
Or you can create a for loop to plot multiple subsets
Let’s see an example using cylynders

First, we need to import matplotlib and use pd.unique() to extract a list with the unique elements in that column

import matplotlib.pyplot as plt

list_unique_cylinders = pd.unique(carfeatures["cylinders"])
print(list_unique_cylinders)

[8 4 6 3 5]

Plotting Subsets

# If we call plt.scatter() twice, it will display both plots on the same graph
# We also include include plt.show() at the very end.
df_8 = carfeatures.query("cylinders == 8")
df_4 = carfeatures.query("cylinders == 4")

plt.scatter(x = df_8["weight"],y = df_8["acceleration"])
plt.scatter(x = df_4["weight"],y = df_4["acceleration"])
plt.legend(labels = ["8","4"],
           title  = "Cylinders")

plt.show()

# Note: If we put plt.show() in between the plots, then the results will
# be shown on separate graphs instead.

Plotting Subsets

Using a for loop to plot multiple subsets

# Compute number of unique categories
list_unique_cylinders = pd.unique(carfeatures["cylinders"])

# Use a for loop to plot a scatter plot between "weight" and "acceleration"
# for each category. Each plot  will have a different color

for category in list_unique_cylinders:
    df   = carfeatures.query("cylinders == @category")
    plt.scatter(x = df["weight"],y = df["acceleration"])
    
# Add labels and a legends    
plt.xlabel("Weight")
plt.ylabel("Acceleration")
plt.legend(labels = list_unique_cylinders,
           title  = "Cylinders")
plt.show()

Try it yourself! 🚗

Compute a histogram of “mpg” by cylinder count
Make the histograms transparent by adjusting alpha in plt.hist(x = ..., alpha = 0.5)
Appendix 05

And that’s it for today! 🎉

See you next time! 🚀

Appendix 01

import pandas as pd
df = pd.read_csv('data_raw/features.csv')
df_filtered = df[(df['cylinders'] >= 6) & (df['acceleration'] <= 15) & (df['mpg'] <= 12)]
print(df_filtered.shape)
df_filtered

(12, 7)

	mpg	cylinders	displacement	horsepower	weight	acceleration	vehicle_id
25	10.0	8	360	215	4615	14.0	F-1690055
26	10.0	8	307	200	4376	15.0	C-1690066
27	11.0	8	318	210	4382	13.5	D-1690077
42	12.0	8	383	180	4955	11.5	D-1690242
67	11.0	8	429	208	4633	11.0	M-1690517
69	12.0	8	350	160	4456	13.5	O-1690539
90	12.0	8	429	198	4952	11.5	M-1690770
95	12.0	8	455	225	4951	11.0	B-1690825
103	11.0	8	400	150	4997	14.0	C-1690913
104	12.0	8	400	167	4906	12.5	F-1690924
106	12.0	8	350	180	4499	12.5	O-1690946
124	11.0	8	350	180	3664	11.0	O-1691144

There are 12 cars left (number of rows)

Back to exercise

Appendix 02

import pandas as pd
df = pd.read_csv('data_raw/features.csv')
df[['weight', 'acceleration']]

	weight	acceleration
0	3504	12.0
1	3693	11.5
2	3436	11.0
3	3433	12.0
4	3449	10.5
...	...	...
393	2790	15.6
394	2130	24.6
395	2295	11.6
396	2625	18.6
397	2720	19.4

398 rows × 2 columns

Back to exercise

Appendix 03

import pandas as pd
df = pd.read_csv('data_raw/features.csv')
car_ascendingmpg = df.sort_values(by = "mpg")
car_ascendingmpg.iloc[0:5]

	mpg	cylinders	displacement	horsepower	weight	acceleration	vehicle_id
28	9.0	8	304	193	4732	18.5	H-1690088
25	10.0	8	360	215	4615	14.0	F-1690055
26	10.0	8	307	200	4376	15.0	C-1690066
103	11.0	8	400	150	4997	14.0	C-1690913
124	11.0	8	350	180	3664	11.0	O-1691144

Back to exercise

Appendix 04

import pandas as pd
df = pd.read_csv('data_raw/features.csv')
df.query("(mpg >= 25) & (cylinders == 8)")[["mpg", "cylinders"]]

	mpg	cylinders
364	26.6	8

Back to exercise

Appendix 05

import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv('data_raw/features.csv')
list_unique_cylinders = pd.unique(df["cylinders"])

for category in list_unique_cylinders:
    df_category = df.query("cylinders == @category")
    plt.hist(x = df_category["mpg"], alpha = 0.5, bins = 10)

plt.xlabel("Miles per gallon")
plt.ylabel("Frequency")

plt.legend(labels = list_unique_cylinders,
           title  = "Cylinders")
plt.show()

Back to exercise

QTM 151 - Introduction to Statistical Computing II

I hope you all had a very nice weekend! 😊

Today’s lecture

Subsetting data with Data Wrangler

Subsetting data with Data Wrangler

Subsetting data with Data Wrangler

Subsetting data with Data Wrangler

Subsetting data with Data Wrangler

Any questions so far? 🤔

Subsetting data with Data Wrangler

Let’s practice!

Try it yourself! 🚗

Subsetting data with pandas

The [] operator

Subsetting data with pandas

The [] operator

Selecting multiple columns

Subsetting data with pandas

Try it yourself! 🚗

Subsetting by row/column position

The .iloc[] (integer location) method

Subsetting by row/column position

The .iloc[] method

The .iloc[] method

Subset blocks of data

Try it yourself! 🚗

The .query() method

Let’s see more examples!

Let’s see more examples!

Let’s see more examples!

Let’s see more examples!

Let’s see more examples!

Try it yourself! 🚗

Plotting Subsets

Plotting Subsets

Plotting Subsets

Try it yourself! 🚗

And that’s it for today! 🎉

See you next time! 🚀

Appendix 01

Appendix 02

Appendix 03

Appendix 04

Appendix 05

Subsetting data with `Data Wrangler`

Subsetting data with `Data Wrangler`

Subsetting data with `Data Wrangler`

Subsetting data with `Data Wrangler`

Subsetting data with `Data Wrangler`

Subsetting data with `Data Wrangler`

Subsetting data with `pandas`

The `[]` operator

Subsetting data with `pandas`

The `[]` operator

Subsetting data with `pandas`

The `.iloc[]` (integer location) method

The `.iloc[]` method

The `.iloc[]` method

The `.query()` method