QTM 350 - Data Science Computing

Lecture 15 - Data Wrangling and Aggregating

Danilo Freire

Emory University

23 October, 2024

Nice to see you all again! 😊

Brief recap 🤓

Last time we learned about:

  • Numpy arrays and basic element-wise operations (addition, exponentiation, etc.)
  • Broadcasting arrays to perform operations on arrays of different shapes
  • Indexing and slicing arrays
  • Basics of Pandas Series and DataFrames
    • Creating Series and DataFrames
    • Indexing and slicing
    • Filtering
    • Sorting
  • Reading and writing data from/to files

Click on the image to increase its size

Today’s plan 📅

Today we will discuss how to:

  • Reshape datasets using melt() and pivot()
  • Aggregate data using groupby()
  • Apply functions to groups of data with apply()
  • Combine data from different sources with merge()
  • Handle missing data with dropna() and fillna()
  • Plot data with pandas’ built-in plotting functions

Reshaping data 🔄

Tidy data

  • Analysts often say that 80% of data analysis is spent on the cleaning and preparing data
  • Tidy data is a standard way of mapping the meaning of a dataset to its structure
  • Formalised by Hadley Wickham in 2014 (of R fame), it has three principles:
    1. Each variable forms a column
    2. Each observation forms a row
    3. Each type of observational unit forms a table
  • Often you’ll need to reshape a dataframe to make it tidy (or for some other purpose)

  • Real-world datasets frequently break the three principles in many ways (looking at you, Excel users! 😅)
  • However, most messy datasets can be cleaned up using three simple techniques: melting, string splitting, and casting
  • Here we will focus on the melt() and pivot() functions

melt() and pivot()

  • Pandas .melt(), .pivot() and .pivot_table() can help reshape dataframes

  • .melt(): make wide data long

  • .pivot(): make long data wide

  • .pivot_table(): same as .pivot() but can handle multiple indexes

  • “Wide” data has many columns and few rows

  • “Long” data has many rows and few columns

  • The choice of wide vs. long format depends on the analysis you want to do, but long format is often more flexible

Example: melt()

  • The data below shows how many courses different instructors taught across different years
  • If the question you want to answer is something like: “Does the number of courses taught vary depending on year?”, then the data would not be considered tidy
  • Why? Because there are multiple observations of courses taught in a year per row (i.e., there is data for 2018, 2019 and 2020 in a single row)
import numpy as np
import pandas as pd

df = pd.DataFrame({"Name": ["Tom", "Mike", "Tiffany", "Varada", "Joel"],
                   "2018": [1, 3, 4, 5, 3],
                   "2019": [2, 4, 3, 2, 1],
                   "2020": [5, 2, 4, 4, 3]})
df
Name 2018 2019 2020
0 Tom 1 2 5
1 Mike 3 4 2
2 Tiffany 4 3 4
3 Varada 5 2 4
4 Joel 3 1 3
  • Let’s make it tidy with .melt()
  • .melt() takes a few arguments, most important is the id_vars which indicated which column should be the identifier
df_melt = df.melt(id_vars="Name",       # identifier
                  var_name="Year",      # new column for the years
                  value_name="Courses") # new column for the courses
df_melt
Name Year Courses
0 Tom 2018 1
1 Mike 2018 3
2 Tiffany 2018 4
3 Varada 2018 5
4 Joel 2018 3
5 Tom 2019 2
6 Mike 2019 4
7 Tiffany 2019 3
8 Varada 2019 2
9 Joel 2019 1
10 Tom 2020 5
11 Mike 2020 2
12 Tiffany 2020 4
13 Varada 2020 4
14 Joel 2020 3

Example: melt()

  • The value_vars argument allows us to select which specific variables we want to “melt”
  • If you don’t specify value_vars (as I did before), all non-identifier columns will be used
  • For example, below I’m omitting the 2018 column:
df_melt = df.melt(id_vars="Name",       # identifier
                  var_name="Year",      # new column for the years
                  value_name="Courses", # new column for the courses
                  value_vars=["2019", "2020"]) # only 2019 and 2020
df_melt
Name Year Courses
0 Tom 2019 2
1 Mike 2019 4
2 Tiffany 2019 3
3 Varada 2019 2
4 Joel 2019 1
5 Tom 2020 5
6 Mike 2020 2
7 Tiffany 2020 4
8 Varada 2020 4
9 Joel 2020 3

Example: pivot()

  • Sometimes, you may want to make long data wide
  • We can do that with .pivot()
  • When using .pivot() we need to specify the index to pivot on, and the columns that will be used to make the new columns of the wider dataframe
df_pivot = df_melt.pivot(index="Name",
                         columns="Year",
                         values="Courses")
display(df_pivot)
Year 2019 2020
Name
Joel 1 3
Mike 4 2
Tiffany 3 4
Tom 2 5
Varada 2 4
  • You’ll notice that Pandas set our specified index as the index of the new dataframe
  • It also preserved the label of the columns
  • We can easily remove these names and reset the index to make our dataframe look like it originally did with .reset_index()
df_pivot = df_pivot.reset_index() 
df_pivot
Year Name 2019 2020
0 Joel 1 3
1 Mike 4 2
2 Tiffany 3 4
3 Tom 2 5
4 Varada 2 4
  • Hmmm, what about Year? 🤔
  • When pivoting, pandas sets the columns index name to match the columns parameter in .pivot(), which was Year
df_pivot.columns.name = None
df_pivot
Name 2019 2020
0 Joel 1 3
1 Mike 4 2
2 Tiffany 3 4
3 Tom 2 5
4 Varada 2 4

Working with Multiple DataFrames 🤝

Combining data with .concat()

  • Often you’ll work with multiple dataframes that you want to stick together or merge
  • df.merge() and df.concat() are [all you need] to know for combining dataframes
  • The Pandas documentation is very helpful for these functions, but they are pretty easy to grasp
  • df.concat() is used to stick dataframes together, either by rows or columns
  • df.merge() is used to combine dataframes based on a common column
    • It’s like a SQL join for those familiar with SQL (if you’re not, don’t worry we’ll learn about it soon 😊)
  • You can use pd.concat() to stick dataframes together:
  • Vertically: if they have the same columns, OR
  • Horizontally: if they have the same rows
df1 = pd.DataFrame({'A': [1, 3, 5],
                    'B': [2, 4, 6]})
df2 = pd.DataFrame({'A': [7, 9, 11],
                    'B': [8, 10, 12]})
df1
A B
0 1 2
1 3 4
2 5 6
df2
A B
0 7 8
1 9 10
2 11 12

Example: pd.concat()

  • Let’s stick these dataframes together vertically
  • Just pass a list of dataframes to pd.concat()
df_concat = pd.concat([df1, df2])
df_concat
A B
0 1 2
1 3 4
2 5 6
0 7 8
1 9 10
2 11 12
  • Notice that the indexes were simply joined together?
  • This may or may not be what you want
  • To reset the index, you can specify the argument ignore_index=True:
df_concat = pd.concat([df1, df2], ignore_index=True)
df_concat
A B
0 1 2
1 3 4
2 5 6
3 7 8
4 9 10
5 11 12

More options of pd.concat()

  • You can also concatenate dataframes horizontally
  • If you do this, you can specify the axis argument to be 1 (default is 0)
df_concat = pd.concat([df1, df2], axis=1)
df_concat
A B A B
0 1 2 7 8
1 3 4 9 10
2 5 6 11 12
  • You are not limited to just two dataframes, you can concatenate as many as you want
pd.concat([df1, df2, df1, df2], axis=0, ignore_index=True)
A B
0 1 2
1 3 4
2 5 6
3 7 8
4 9 10
5 11 12
6 1 2
7 3 4
8 5 6
9 7 8
10 9 10
11 11 12

Combining data with .merge()

  • pd.merge() gives you the ability to “join” dataframes using different rules
    • Again, just like with SQL if you’re familiar with it
  • You can use df.merge() to join dataframes based on shared key columns
  • Methods include:
    • inner join: only keep rows where the key exists in both dataframes
    • outer join: keep all rows from both dataframes
    • left join: keep all rows from the left dataframe
    • right join: keep all rows from the right dataframe

Example: pd.merge()

  • Let’s say we have two dataframes, one with information about superheroes and another with information about the publishers
df1 = pd.DataFrame({"name": ['Magneto', 'Storm', 'Mystique', 'Batman', 'Joker', 'Catwoman', 'Hellboy'],
                    'alignment': ['bad', 'good', 'bad', 'good', 'bad', 'bad', 'good'],
                    'gender': ['male', 'female', 'female', 'male', 'male', 'female', 'male'],
                    'publisher': ['Marvel', 'Marvel', 'Marvel', 'DC', 'DC', 'DC', 'Dark Horse Comics']})
df2 = pd.DataFrame({'publisher': ['DC', 'Marvel', 'Image'],
                    'year_founded': [1934, 1939, 1992]})

Inner join

  • An inner join will return all rows of df1 where matching values for publisher are found in df2
pd.merge(df1, df2, how="inner", on="publisher")
name alignment gender publisher year_founded
0 Magneto bad male Marvel 1939
1 Storm good female Marvel 1939
2 Mystique bad female Marvel 1939
3 Batman good male DC 1934
4 Joker bad male DC 1934
5 Catwoman bad female DC 1934

Outer join

  • An outer join will return all rows of df1 and df2, placing NaNs where information is unavailable
pd.merge(df1, df2, how="outer", on="publisher")
name alignment gender publisher year_founded
0 Batman good male DC 1934.0
1 Joker bad male DC 1934.0
2 Catwoman bad female DC 1934.0
3 Hellboy good male Dark Horse Comics NaN
4 NaN NaN NaN Image 1992.0
5 Magneto bad male Marvel 1939.0
6 Storm good female Marvel 1939.0
7 Mystique bad female Marvel 1939.0

Left join

  • A left join will return all rows of df1, and any matching rows from df2
pd.merge(df1, df2, how="left", on="publisher")
name alignment gender publisher year_founded
0 Magneto bad male Marvel 1939.0
1 Storm good female Marvel 1939.0
2 Mystique bad female Marvel 1939.0
3 Batman good male DC 1934.0
4 Joker bad male DC 1934.0
5 Catwoman bad female DC 1934.0
6 Hellboy good male Dark Horse Comics NaN

Right join

  • A right join, as expected, will return all rows of df2, and any matching rows from df1
pd.merge(df1, df2, how="right", on="publisher")
name alignment gender publisher year_founded
0 Batman good male DC 1934
1 Joker bad male DC 1934
2 Catwoman bad female DC 1934
3 Magneto bad male Marvel 1939
4 Storm good female Marvel 1939
5 Mystique bad female Marvel 1939
6 NaN NaN NaN Image 1992
  • Cool, isn’t it? 😊

The indicator argument

  • The indicator argument can be used to show where the rows came from
  • It can be set to True to create a new column with the information, or to a string to name the column
  • Let’s see how it works
pd.merge(df1, df2, how="outer", on="publisher", indicator=True)
name alignment gender publisher year_founded _merge
0 Batman good male DC 1934.0 both
1 Joker bad male DC 1934.0 both
2 Catwoman bad female DC 1934.0 both
3 Hellboy good male Dark Horse Comics NaN left_only
4 NaN NaN NaN Image 1992.0 right_only
5 Magneto bad male Marvel 1939.0 both
6 Storm good female Marvel 1939.0 both
7 Mystique bad female Marvel 1939.0 both

More DataFrame operations 🛠️

Applying custom functions to DataFrames 🧮

Applying custom functions

  • There will be times when you want to apply a function that is not built-in to Pandas
  • For this, we also have methods:
    • df.apply(), applies a function column-wise or row-wise across a dataframe (the function must be able to accept/return an array)
    • df.applymap(), applies a function element-wise (for functions that accept/return single values at a time)
    • series.apply()/series.map(), same as above but for Pandas series
  • Let’s go back to the dataset we saw before (about cycling) and see how we can apply a numpy function to it
df = pd.read_csv("data/cycling_data.csv")
df[['Time', 'Distance']].apply(np.sin)
Time Distance
0 -0.901866 0.053604
1 -0.901697 0.447197
2 -0.035549 -0.046354
3 -0.739059 0.270228
4 -0.236515 -0.086263
5 -0.587941 -0.116108
6 0.079729 -0.116108
7 -0.873400 0.033623
8 0.167593 0.907781
9 0.663499 -0.086263
10 0.941147 -0.086263
11 -0.745234 0.023627
12 0.998803 0.241226
13 -0.846108 0.113385
14 0.558639 -0.096222
15 0.594763 0.113385
16 -0.000151 -0.449647
17 0.279589 0.083532
18 0.176224 -0.046354
19 0.088579 0.336907
20 0.167207 -0.096222
21 0.536726 0.143136
22 -0.831858 NaN
23 -0.779579 0.250919
24 0.149728 NaN
25 -0.756921 0.473825
26 0.670117 -0.046354
27 0.986598 0.113385
28 -0.683372 0.063586
29 0.150056 0.133232
30 0.026702 0.023627
31 -0.008640 0.221770
32 0.897861 -0.700695
  • Or you may want to apply your own custom function
def seconds_to_hours(x):
    return x / 3600

df[['Time']].apply(seconds_to_hours)
Time
0 0.578889
1 0.703056
2 0.517500
3 0.608889
4 0.525278
5 0.631111
6 0.548056
7 0.634722
8 0.806389
9 0.583611
10 13.350556
11 0.580556
12 0.822500
13 0.683889
14 0.576667
15 0.644722
16 0.493056
17 0.590000
18 0.516667
19 0.652778
20 0.475556
21 0.588333
22 0.481111
23 0.617222
24 0.487778
25 0.592778
26 0.478889
27 0.606111
28 0.519444
29 0.596944
30 0.511389
31 0.684167
32 0.511944
  • However, this could be done more easily with a lambda function
df[['Time']].apply(lambda x: x / 3600)
Time
0 0.578889
1 0.703056
2 0.517500
3 0.608889
4 0.525278
5 0.631111
6 0.548056
7 0.634722
8 0.806389
9 0.583611
10 13.350556
11 0.580556
12 0.822500
13 0.683889
14 0.576667
15 0.644722
16 0.493056
17 0.590000
18 0.516667
19 0.652778
20 0.475556
21 0.588333
22 0.481111
23 0.617222
24 0.487778
25 0.592778
26 0.478889
27 0.606111
28 0.519444
29 0.596944
30 0.511389
31 0.684167
32 0.511944

Applying custom functions

  • You can even use functions that require additional arguments. Just specify the arguments in .apply():
def convert_seconds(x, to="hours"):
    if to == "hours":
        return x / 3600
    elif to == "minutes":
        return x / 60

df[['Time']].apply(convert_seconds, to="minutes")
Time
0 34.733333
1 42.183333
2 31.050000
3 36.533333
4 31.516667
5 37.866667
6 32.883333
7 38.083333
8 48.383333
9 35.016667
10 801.033333
11 34.833333
12 49.350000
13 41.033333
14 34.600000
15 38.683333
16 29.583333
17 35.400000
18 31.000000
19 39.166667
20 28.533333
21 35.300000
22 28.866667
23 37.033333
24 29.266667
25 35.566667
26 28.733333
27 36.366667
28 31.166667
29 35.816667
30 30.683333
31 41.050000
32 30.716667

Grouping and Aggregating 📊

Grouping data

  • Often we are interested in examining specific groups in our data
  • df.groupby() allows us to group our data based on a variable(s)
df = pd.read_csv('data/cycling_data.csv')
df.head()
Date Name Type Time Distance Comments
0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 Rain
1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 rain
2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 Wet road but nice weather
3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 Stopped for photo of sunrise
4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 Tired by the end of the week
  • Let’s group this dataframe on the column Name:
dfg = df.groupby(by='Name')
dfg
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1453bf9b0>
  • What is a DataFrameGroupBy object? It contains information about the groups of the dataframe
  • You can access the groups using the .groups attribute
  • This will return a dictionary where the keys are the group names and the values are the indices of the rows in the original dataframe that belong to that group
dfg.groups
{'Afternoon Ride': [0, 2, 4, 6, 9, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32], 'Morning Ride': [1, 3, 5, 7, 8, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31]}

Grouping data

  • We can also access a group using the .get_group() method:
dfg.get_group('Afternoon Ride').head()
Date Name Type Time Distance Comments
0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 Rain
2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 Wet road but nice weather
4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 Tired by the end of the week
6 17 Sep 2019, 00:15:47 Afternoon Ride Ride 1973 12.45 Legs feeling strong!
9 18 Sep 2019, 00:15:52 Afternoon Ride Ride 2101 12.48 Pumped up tires
dfg.get_group('Morning Ride').head()
Date Name Type Time Distance Comments
1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 rain
3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 Stopped for photo of sunrise
5 16 Sep 2019, 13:57:48 Morning Ride Ride 2272 12.45 Rested after the weekend!
7 17 Sep 2019, 13:43:34 Morning Ride Ride 2285 12.60 Raining
8 18 Sep 2019, 13:49:53 Morning Ride Ride 2903 14.57 Raining today

Aggregating data

  • The usual thing to do however, is to apply aggregate functions to the groupby object

Split-apply-combine strategy

Aggregating data

Functions

  • You can pass many functions to groupby objects, such as:
    • mean()
    • sum()
    • count()
    • std()
    • min()
    • max()
    • median()
    • describe()
    • apply()
    • transform()
    • filter()
dfg['Distance'].mean()
Name
Afternoon Ride    12.462
Morning Ride      12.860
Name: Distance, dtype: float64

Aggregating multiple columns

  • You can also pass many functions at the same time
  • The .agg() method is used for this
  • It takes a dictionary where the keys are the columns and the values are the functions to apply
  • Multiple functions can be passed as a list
dfg.agg({'Distance': ['mean', 'sum', 'count'], 
         'Time': ['mean', 'sum', 'count']})
Distance Time
mean sum count mean sum count
Name
Afternoon Ride 12.462 186.93 15 4654.352941 79124 17
Morning Ride 12.860 205.76 16 2299.875000 36798 16

Aggregating data

  • And even apply different functions to different columns
def num_range(x):
    return x.max() - x.min()

dfg.agg({"Time": ['max', 'min', 'mean', num_range], 
         "Distance": ['sum']})
Time Distance
max min mean num_range sum
Name
Afternoon Ride 48062 1712 4654.352941 46350 186.93
Morning Ride 2903 2090 2299.875000 813 205.76
  • By the way, you can use .agg() for non-grouped dataframes too
  • This is pretty much what df.describe() does under the hood:
# Select only numeric values
numeric_df = df.select_dtypes(include='number')

# Apply the function
numeric_df.agg(['count', 'mean', 'median', 'std', 'min', 'max'])
Time Distance
count 33.000000 31.000000
mean 3512.787879 12.667419
median 2118.000000 12.620000
std 8003.309233 0.428618
min 1712.000000 11.790000
max 48062.000000 14.570000

Dealing with Missing Values

Missing data

  • Missing data is a common problem in data analysis
  • Pandas has a few methods to deal with missing data
  • The most common are:
    • isnull(): returns a boolean mask where True indicates missing values
    • notnull(): opposite of isnull()
    • dropna(): removes missing values
    • fillna(): fills missing values with a specified value
    • interpolate(): fills missing values with interpolated values
  • We can use df.isnull() to find missing values in a dataframe
df.isnull().head(3)
Date Name Type Time Distance Comments
0 False False False False False False
1 False False False False False False
2 False False False False False False
  • df.info() is often more useful
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33 entries, 0 to 32
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Date      33 non-null     object 
 1   Name      33 non-null     object 
 2   Type      33 non-null     object 
 3   Time      33 non-null     int64  
 4   Distance  31 non-null     float64
 5   Comments  33 non-null     object 
dtypes: float64(1), int64(1), object(4)
memory usage: 1.7+ KB

Dropping missing values

  • The simplest way to deal with missing values is to drop them
  • You can use the .dropna() method to do this
df.dropna().head()
Date Name Type Time Distance Comments
0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 Rain
1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 rain
2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 Wet road but nice weather
3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 Stopped for photo of sunrise
4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 Tired by the end of the week
  • Or you can impute (“fill”) them using .fillna()
  • This method has various options for filling
  • You can use a fixed value, the mean of the column, the previous non-nan value, etc
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                   [3, 4, np.nan, 1],
                   [np.nan, np.nan, np.nan, 5],
                   [np.nan, 3, np.nan, 4]],
                  columns=list('ABCD'))
df
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4

Filling missing values

df.fillna(0)  # fill with 0
A B C D
0 0.0 2.0 0.0 0
1 3.0 4.0 0.0 1
2 0.0 0.0 0.0 5
3 0.0 3.0 0.0 4
df.fillna(df.mean())  # fill with the mean
A B C D
0 3.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 3.0 3.0 NaN 5
3 3.0 3.0 NaN 4

Visualising DataFrames 📈

Plotting data with Pandas

  • Pandas has some built-in plotting functions that are very useful
  • You can use them by calling .plot() on a dataframe or series
df = pd.read_csv('data/cycling_data.csv', index_col=0, parse_dates=True).dropna()
  • Let’s plot the Distance column
df['Distance'].plot()

Plotting data with Pandas

  • Cumulative sum of the Distance column
df['Distance'].cumsum().plot.line()

  • You can also plot multiple columns at once
df[['Distance', 'Time']].plot()

Plotting data with Pandas

  • There are many configuration options for these plots which build of the matplotlib library
df['Distance'].cumsum().plot.line(fontsize=14, linewidth = 2, color = 'r', ylabel="km")

  • You can even use custom themes with the style argument
import matplotlib.pyplot as plt
import mplcyberpunk
plt.style.use("cyberpunk")

df['Distance'].plot.line(ylabel="km")
mplcyberpunk.add_glow_effects()

That’s all for today! 🎉

Have a great day! 😊

Appendix: Cheat Sheets 📜

  • Click on the images to increase their size
  • You can download the files by clicking here
  • More cheat sheets can be found here