QTM 350 - Data Science Computing

Lecture 15 - Data Wrangling and Aggregating

Danilo Freire

danilo.freire@emory.edu

Emory University

Nice to see you all again! 😊

Brief recap 🤓

Last time we learned about:

Numpy arrays and basic element-wise operations (addition, exponentiation, etc.)
Broadcasting arrays to perform operations on arrays of different shapes
Indexing and slicing arrays
Basics of Pandas Series and DataFrames
- Creating Series and DataFrames
- Indexing and slicing
- Filtering
- Sorting
Reading and writing data from/to files

Click on the image to increase its size

Today’s plan 📅

Today we will discuss how to:

Reshape datasets using melt() and pivot()
Aggregate data using groupby()
Apply functions to groups of data with apply()
Combine data from different sources with merge()
Handle missing data with dropna() and fillna()
Plot data with pandas’ built-in plotting functions

Reshaping data 🔄

Tidy data

Analysts often say that 80% of data analysis is spent on the cleaning and preparing data
Tidy data is a standard way of mapping the meaning of a dataset to its structure
Formalised by Hadley Wickham in 2014 (of R fame), it has three principles:
1. Each variable forms a column
2. Each observation forms a row
3. Each type of observational unit forms a table
Often you’ll need to reshape a dataframe to make it tidy (or for some other purpose)

Real-world datasets frequently break the three principles in many ways (looking at you, Excel users! 😅)
However, most messy datasets can be cleaned up using three simple techniques: melting, string splitting, and casting
Here we will focus on the melt() and pivot() functions

`melt()` and `pivot()`

Pandas .melt(), .pivot() and .pivot_table() can help reshape dataframes
.melt(): make wide data long
.pivot(): make long data wide
.pivot_table(): same as .pivot() but can handle multiple indexes
“Wide” data has many columns and few rows
“Long” data has many rows and few columns
The choice of wide vs. long format depends on the analysis you want to do, but long format is often more flexible

Example: `melt()`

The data below shows how many courses different instructors taught across different years
If the question you want to answer is something like: “Does the number of courses taught vary depending on year?”, then the data would not be considered tidy
Why? Because there are multiple observations of courses taught in a year per row (i.e., there is data for 2018, 2019 and 2020 in a single row)

import numpy as np
import pandas as pd

df = pd.DataFrame({"Name": ["Tom", "Mike", "Tiffany", "Varada", "Joel"],
                   "2018": [1, 3, 4, 5, 3],
                   "2019": [2, 4, 3, 2, 1],
                   "2020": [5, 2, 4, 4, 3]})
df

	Name	2018	2019	2020
0	Tom	1	2	5
1	Mike	3	4	2
2	Tiffany	4	3	4
3	Varada	5	2	4
4	Joel	3	1	3

Let’s make it tidy with .melt()
.melt() takes a few arguments, most important is the id_vars which indicated which column should be the identifier

df_melt = df.melt(id_vars="Name",       # identifier
                  var_name="Year",      # new column for the years
                  value_name="Courses") # new column for the courses
df_melt

	Name	Year	Courses
0	Tom	2018	1
1	Mike	2018	3
2	Tiffany	2018	4
3	Varada	2018	5
4	Joel	2018	3
5	Tom	2019	2
6	Mike	2019	4
7	Tiffany	2019	3
8	Varada	2019	2
9	Joel	2019	1
10	Tom	2020	5
11	Mike	2020	2
12	Tiffany	2020	4
13	Varada	2020	4
14	Joel	2020	3

Example: `melt()`

The value_vars argument allows us to select which specific variables we want to “melt”
If you don’t specify value_vars (as I did before), all non-identifier columns will be used
For example, below I’m omitting the 2018 column:

df_melt = df.melt(id_vars="Name",       # identifier
                  var_name="Year",      # new column for the years
                  value_name="Courses", # new column for the courses
                  value_vars=["2019", "2020"]) # only 2019 and 2020
df_melt

	Name	Year	Courses
0	Tom	2019	2
1	Mike	2019	4
2	Tiffany	2019	3
3	Varada	2019	2
4	Joel	2019	1
5	Tom	2020	5
6	Mike	2020	2
7	Tiffany	2020	4
8	Varada	2020	4
9	Joel	2020	3

Example: `pivot()`

Sometimes, you may want to make long data wide
- Examples: in presentations (for clarity)
- Running MANOVA or repeated measures analyses
- Building correlation matrices
We can do that with .pivot()
When using .pivot() we need to specify the index to pivot on, and the columns that will be used to make the new columns of the wider dataframe

df_pivot = df_melt.pivot(index="Name",
                         columns="Year",
                         values="Courses")
display(df_pivot)

Year	2019	2020
Name
Joel	1	3
Mike	4	2
Tiffany	3	4
Tom	2	5
Varada	2	4

You’ll notice that Pandas set our specified index as the index of the new dataframe
It also preserved the label of the columns
We can easily remove these names and reset the index to make our dataframe look like it originally did with .reset_index()

df_pivot = df_pivot.reset_index() 
df_pivot

Year	Name	2019	2020
0	Joel	1	3
1	Mike	4	2
2	Tiffany	3	4
3	Tom	2	5
4	Varada	2	4

Hmmm, what about Year? 🤔
When pivoting, pandas sets the columns index name to match the columns parameter in .pivot(), which was Year

df_pivot.columns.name = None
df_pivot

	Name	2019	2020
0	Joel	1	3
1	Mike	4	2
2	Tiffany	3	4
3	Tom	2	5
4	Varada	2	4

Working with Multiple DataFrames 🤝

Combining data with `.concat()`

Often you’ll work with multiple dataframes that you want to stick together or merge
df.merge() and df.concat() are [all you need] to know for combining dataframes
The Pandas documentation is very helpful for these functions, but they are pretty easy to grasp
df.concat() is used to stick dataframes together, either by rows or columns
df.merge() is used to combine dataframes based on a common column
- It’s like a SQL join for those familiar with SQL (if you’re not, don’t worry we’ll learn about it soon 😊)

You can use pd.concat() to stick dataframes together:
Vertically: if they have the same columns, OR
Horizontally: if they have the same rows

df1 = pd.DataFrame({'A': [1, 3, 5],
                    'B': [2, 4, 6]})
df2 = pd.DataFrame({'A': [7, 9, 11],
                    'B': [8, 10, 12]})
df1

	A	B
0	1	2
1	3	4
2	5	6

df2

	A	B
0	7	8
1	9	10
2	11	12

Example: `pd.concat()`

Let’s stick these dataframes together vertically
Just pass a list of dataframes to pd.concat()

df_concat = pd.concat([df1, df2])
df_concat

	A	B
0	1	2
1	3	4
2	5	6
0	7	8
1	9	10
2	11	12

Notice that the indexes were simply joined together?
This may or may not be what you want
To reset the index, you can specify the argument ignore_index=True:

df_concat = pd.concat([df1, df2], ignore_index=True)
df_concat

	A	B
0	1	2
1	3	4
2	5	6
3	7	8
4	9	10
5	11	12

More options of `pd.concat()`

You can also concatenate dataframes horizontally
If you do this, you can specify the axis argument to be 1 (default is 0)

df_concat = pd.concat([df1, df2], axis=1)
df_concat

	A	B	A	B
0	1	2	7	8
1	3	4	9	10
2	5	6	11	12

You are not limited to just two dataframes, you can concatenate as many as you want

pd.concat([df1, df2, df1, df2], axis=0, ignore_index=True)

	A	B
0	1	2
1	3	4
2	5	6
3	7	8
4	9	10
5	11	12
6	1	2
7	3	4
8	5	6
9	7	8
10	9	10
11	11	12

Combining data with `.merge()`

pd.merge() gives you the ability to “join” dataframes using different rules
- Again, just like with SQL if you’re familiar with it
You can use df.merge() to join dataframes based on shared key columns
Methods include:
- inner join: only keep rows where the key exists in both dataframes
- outer join: keep all rows from both dataframes
- left join: keep all rows from the left dataframe
- right join: keep all rows from the right dataframe

See this great cheat sheet and these great animations for more insights

Example: `pd.merge()`

Let’s say we have two dataframes, one with information about superheroes and another with information about the publishers

df1 = pd.DataFrame({"name": ['Magneto', 'Storm', 'Mystique', 'Batman', 'Joker', 'Catwoman', 'Hellboy'],
                    'alignment': ['bad', 'good', 'bad', 'good', 'bad', 'bad', 'good'],
                    'gender': ['male', 'female', 'female', 'male', 'male', 'female', 'male'],
                    'publisher': ['Marvel', 'Marvel', 'Marvel', 'DC', 'DC', 'DC', 'Dark Horse Comics']})
df2 = pd.DataFrame({'publisher': ['DC', 'Marvel', 'Image'],
                    'year_founded': [1934, 1939, 1992]})

Inner join

An inner join will return all rows of df1 where matching values for publisher are found in df2

pd.merge(df1, df2, how="inner", on="publisher")

	name	alignment	gender	publisher	year_founded
0	Magneto	bad	male	Marvel	1939
1	Storm	good	female	Marvel	1939
2	Mystique	bad	female	Marvel	1939
3	Batman	good	male	DC	1934
4	Joker	bad	male	DC	1934
5	Catwoman	bad	female	DC	1934

Outer join

An outer join will return all rows of df1 and df2, placing NaNs where information is unavailable

pd.merge(df1, df2, how="outer", on="publisher")

	name	alignment	gender	publisher	year_founded
0	Batman	good	male	DC	1934.0
1	Joker	bad	male	DC	1934.0
2	Catwoman	bad	female	DC	1934.0
3	Hellboy	good	male	Dark Horse Comics	NaN
4	NaN	NaN	NaN	Image	1992.0
5	Magneto	bad	male	Marvel	1939.0
6	Storm	good	female	Marvel	1939.0
7	Mystique	bad	female	Marvel	1939.0

Left join

A left join will return all rows of df1, and any matching rows from df2

pd.merge(df1, df2, how="left", on="publisher")

	name	alignment	gender	publisher	year_founded
0	Magneto	bad	male	Marvel	1939.0
1	Storm	good	female	Marvel	1939.0
2	Mystique	bad	female	Marvel	1939.0
3	Batman	good	male	DC	1934.0
4	Joker	bad	male	DC	1934.0
5	Catwoman	bad	female	DC	1934.0
6	Hellboy	good	male	Dark Horse Comics	NaN

Right join

A right join, as expected, will return all rows of df2, and any matching rows from df1

pd.merge(df1, df2, how="right", on="publisher")

	name	alignment	gender	publisher	year_founded
0	Batman	good	male	DC	1934
1	Joker	bad	male	DC	1934
2	Catwoman	bad	female	DC	1934
3	Magneto	bad	male	Marvel	1939
4	Storm	good	female	Marvel	1939
5	Mystique	bad	female	Marvel	1939
6	NaN	NaN	NaN	Image	1992

Cool, isn’t it? 😊

The `indicator` argument

The indicator argument can be used to show where the rows came from
It can be set to True to create a new column with the information, or to a string to name the column
Let’s see how it works

pd.merge(df1, df2, how="outer", on="publisher", indicator=True)

	name	alignment	gender	publisher	year_founded	_merge
0	Batman	good	male	DC	1934.0	both
1	Joker	bad	male	DC	1934.0	both
2	Catwoman	bad	female	DC	1934.0	both
3	Hellboy	good	male	Dark Horse Comics	NaN	left_only
4	NaN	NaN	NaN	Image	1992.0	right_only
5	Magneto	bad	male	Marvel	1939.0	both
6	Storm	good	female	Marvel	1939.0	both
7	Mystique	bad	female	Marvel	1939.0	both

More DataFrame operations 🛠️

Applying custom functions to DataFrames 🧮

Applying custom functions

There will be times when you want to apply a function that is not built-in to Pandas
For this, we also have methods:
- df.apply(), applies a function column-wise or row-wise across a dataframe (the function must be able to accept/return an array)
- df.applymap(), applies a function element-wise (for functions that accept/return single values at a time)
- series.apply()/series.map(), same as above but for Pandas series
Let’s go back to the dataset we saw before (about cycling) and see how we can apply a numpy function to it

df = pd.read_csv("data/cycling_data.csv")
df[['Time', 'Distance']].apply(np.sin)

	Time	Distance
0	-0.901866	0.053604
1	-0.901697	0.447197
2	-0.035549	-0.046354
3	-0.739059	0.270228
4	-0.236515	-0.086263
5	-0.587941	-0.116108
6	0.079729	-0.116108
7	-0.873400	0.033623
8	0.167593	0.907781
9	0.663499	-0.086263
10	0.941147	-0.086263
11	-0.745234	0.023627
12	0.998803	0.241226
13	-0.846108	0.113385
14	0.558639	-0.096222
15	0.594763	0.113385
16	-0.000151	-0.449647
17	0.279589	0.083532
18	0.176224	-0.046354
19	0.088579	0.336907
20	0.167207	-0.096222
21	0.536726	0.143136
22	-0.831858	NaN
23	-0.779579	0.250919
24	0.149728	NaN
25	-0.756921	0.473825
26	0.670117	-0.046354
27	0.986598	0.113385
28	-0.683372	0.063586
29	0.150056	0.133232
30	0.026702	0.023627
31	-0.008640	0.221770
32	0.897861	-0.700695

Or you may want to apply your own custom function

def seconds_to_hours(x):
    return x / 3600

df[['Time']].apply(seconds_to_hours)

	Time
0	0.578889
1	0.703056
2	0.517500
3	0.608889
4	0.525278
5	0.631111
6	0.548056
7	0.634722
8	0.806389
9	0.583611
10	13.350556
11	0.580556
12	0.822500
13	0.683889
14	0.576667
15	0.644722
16	0.493056
17	0.590000
18	0.516667
19	0.652778
20	0.475556
21	0.588333
22	0.481111
23	0.617222
24	0.487778
25	0.592778
26	0.478889
27	0.606111
28	0.519444
29	0.596944
30	0.511389
31	0.684167
32	0.511944

However, this could be done more easily with a lambda function

df[['Time']].apply(lambda x: x / 3600)

	Time
0	0.578889
1	0.703056
2	0.517500
3	0.608889
4	0.525278
5	0.631111
6	0.548056
7	0.634722
8	0.806389
9	0.583611
10	13.350556
11	0.580556
12	0.822500
13	0.683889
14	0.576667
15	0.644722
16	0.493056
17	0.590000
18	0.516667
19	0.652778
20	0.475556
21	0.588333
22	0.481111
23	0.617222
24	0.487778
25	0.592778
26	0.478889
27	0.606111
28	0.519444
29	0.596944
30	0.511389
31	0.684167
32	0.511944

Applying custom functions

You can even use functions that require additional arguments. Just specify the arguments in .apply():

def convert_seconds(x, to="hours"):
    if to == "hours":
        return x / 3600
    elif to == "minutes":
        return x / 60

df[['Time']].apply(convert_seconds, to="minutes")

	Time
0	34.733333
1	42.183333
2	31.050000
3	36.533333
4	31.516667
5	37.866667
6	32.883333
7	38.083333
8	48.383333
9	35.016667
10	801.033333
11	34.833333
12	49.350000
13	41.033333
14	34.600000
15	38.683333
16	29.583333
17	35.400000
18	31.000000
19	39.166667
20	28.533333
21	35.300000
22	28.866667
23	37.033333
24	29.266667
25	35.566667
26	28.733333
27	36.366667
28	31.166667
29	35.816667
30	30.683333
31	41.050000
32	30.716667

Grouping and Aggregating 📊

Grouping data

Often we are interested in examining specific groups in our data
df.groupby() allows us to group our data based on a variable(s)

df = pd.read_csv('data/cycling_data.csv')
df.head()

	Date	Name	Type	Time	Distance	Comments
0	10 Sep 2019, 00:13:04	Afternoon Ride	Ride	2084	12.62	Rain
1	10 Sep 2019, 13:52:18	Morning Ride	Ride	2531	13.03	rain
2	11 Sep 2019, 00:23:50	Afternoon Ride	Ride	1863	12.52	Wet road but nice weather
3	11 Sep 2019, 14:06:19	Morning Ride	Ride	2192	12.84	Stopped for photo of sunrise
4	12 Sep 2019, 00:28:05	Afternoon Ride	Ride	1891	12.48	Tired by the end of the week

Let’s group this dataframe on the column Name:

dfg = df.groupby(by='Name')
dfg

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x147e71ca0>

What is a DataFrameGroupBy object? It contains information about the groups of the dataframe

You can access the groups using the .groups attribute
This will return a dictionary where the keys are the group names and the values are the indices of the rows in the original dataframe that belong to that group

dfg.groups

{'Afternoon Ride': [0, 2, 4, 6, 9, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32], 'Morning Ride': [1, 3, 5, 7, 8, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31]}

Grouping data

We can also access a group using the .get_group() method:

dfg.get_group('Afternoon Ride').head()

	Date	Name	Type	Time	Distance	Comments
0	10 Sep 2019, 00:13:04	Afternoon Ride	Ride	2084	12.62	Rain
2	11 Sep 2019, 00:23:50	Afternoon Ride	Ride	1863	12.52	Wet road but nice weather
4	12 Sep 2019, 00:28:05	Afternoon Ride	Ride	1891	12.48	Tired by the end of the week
6	17 Sep 2019, 00:15:47	Afternoon Ride	Ride	1973	12.45	Legs feeling strong!
9	18 Sep 2019, 00:15:52	Afternoon Ride	Ride	2101	12.48	Pumped up tires

dfg.get_group('Morning Ride').head()

	Date	Name	Type	Time	Distance	Comments
1	10 Sep 2019, 13:52:18	Morning Ride	Ride	2531	13.03	rain
3	11 Sep 2019, 14:06:19	Morning Ride	Ride	2192	12.84	Stopped for photo of sunrise
5	16 Sep 2019, 13:57:48	Morning Ride	Ride	2272	12.45	Rested after the weekend!
7	17 Sep 2019, 13:43:34	Morning Ride	Ride	2285	12.60	Raining
8	18 Sep 2019, 13:49:53	Morning Ride	Ride	2903	14.57	Raining today

Aggregating data

The usual thing to do however, is to apply aggregate functions to the groupby object

Split-apply-combine strategy

Aggregating data

Functions

You can pass many functions to groupby objects, such as:
- mean()
- sum()
- count()
- std()
- min()
- max()
- median()
- describe()
- apply()
- transform()
- filter()

dfg['Distance'].mean()

Name
Afternoon Ride    12.462
Morning Ride      12.860
Name: Distance, dtype: float64

Aggregating multiple columns

You can also pass many functions at the same time
The .agg() method is used for this
It takes a dictionary where the keys are the columns and the values are the functions to apply
Multiple functions can be passed as a list

dfg.agg({'Distance': ['mean', 'sum', 'count'], 
         'Time': ['mean', 'sum', 'count']})

	Distance			Time
	mean	sum	count	mean	sum	count
Name
Afternoon Ride	12.462	186.93	15	4654.352941	79124	17
Morning Ride	12.860	205.76	16	2299.875000	36798	16

Aggregating data

And even apply different functions to different columns

def num_range(x):
    return x.max() - x.min()

dfg.agg({"Time": ['max', 'min', 'mean', num_range], 
         "Distance": ['sum']})

	Time				Distance
	max	min	mean	num_range	sum
Name
Afternoon Ride	48062	1712	4654.352941	46350	186.93
Morning Ride	2903	2090	2299.875000	813	205.76

By the way, you can use .agg() for non-grouped dataframes too
This is pretty much what df.describe() does under the hood:

# Select only numeric values
numeric_df = df.select_dtypes(include='number')

# Apply the function
numeric_df.agg(['count', 'mean', 'median', 'std', 'min', 'max'])

	Time	Distance
count	33.000000	31.000000
mean	3512.787879	12.667419
median	2118.000000	12.620000
std	8003.309233	0.428618
min	1712.000000	11.790000
max	48062.000000	14.570000

Dealing with Missing Values

Missing data

Missing data is a common problem in data analysis
Pandas has a few methods to deal with missing data
The most common are:
- isnull(): returns a boolean mask where True indicates missing values
- notnull(): opposite of isnull()
- dropna(): removes missing values
- fillna(): fills missing values with a specified value
- interpolate(): fills missing values with interpolated values

We can use df.isnull() to find missing values in a dataframe

df.isnull().head(3)

	Date	Name	Type	Time	Distance	Comments
0	False	False	False	False	False	False
1	False	False	False	False	False	False
2	False	False	False	False	False	False

df.info() is often more useful

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33 entries, 0 to 32
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Date      33 non-null     object 
 1   Name      33 non-null     object 
 2   Type      33 non-null     object 
 3   Time      33 non-null     int64  
 4   Distance  31 non-null     float64
 5   Comments  33 non-null     object 
dtypes: float64(1), int64(1), object(4)
memory usage: 1.7+ KB

Dropping missing values

The simplest way to deal with missing values is to drop them
You can use the .dropna() method to do this

df.dropna().head()

	Date	Name	Type	Time	Distance	Comments
0	10 Sep 2019, 00:13:04	Afternoon Ride	Ride	2084	12.62	Rain
1	10 Sep 2019, 13:52:18	Morning Ride	Ride	2531	13.03	rain
2	11 Sep 2019, 00:23:50	Afternoon Ride	Ride	1863	12.52	Wet road but nice weather
3	11 Sep 2019, 14:06:19	Morning Ride	Ride	2192	12.84	Stopped for photo of sunrise
4	12 Sep 2019, 00:28:05	Afternoon Ride	Ride	1891	12.48	Tired by the end of the week

Or you can impute (“fill”) them using .fillna()
This method has various options for filling
You can use a fixed value, the mean of the column, the previous non-nan value, etc

df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                   [3, 4, np.nan, 1],
                   [np.nan, np.nan, np.nan, 5],
                   [np.nan, 3, np.nan, 4]],
                  columns=list('ABCD'))
df

	A	B	C	D
0	NaN	2.0	NaN	0
1	3.0	4.0	NaN	1
2	NaN	NaN	NaN	5
3	NaN	3.0	NaN	4

Filling missing values

df.fillna(0)  # fill with 0

	A	B	D
0	0.0	2.0	0
1	3.0	4.0	1
2	0.0	0.0	5
3	0.0	3.0	4

df.fillna(df.mean())  # fill with the mean

	A	B	C	D
0	3.0	2.0	NaN	0
1	3.0	4.0	NaN	1
2	3.0	3.0	NaN	5
3	3.0	3.0	NaN	4

Visualising DataFrames 📈

Plotting data with Pandas

Pandas has some built-in plotting functions that are very useful
You can use them by calling .plot() on a dataframe or series

df = pd.read_csv('data/cycling_data.csv', index_col=0, parse_dates=True).dropna()

Let’s plot the Distance column

df['Distance'].plot()

Plotting data with Pandas

Cumulative sum of the Distance column

df['Distance'].cumsum().plot.line()

You can also plot multiple columns at once

df[['Distance', 'Time']].plot()

Plotting data with Pandas

There are many configuration options for these plots which build of the matplotlib library

df['Distance'].cumsum().plot.line(fontsize=14, linewidth = 2, color = 'r', ylabel="km")

You can even use custom themes with the style argument

import matplotlib.pyplot as plt
import mplcyberpunk
plt.style.use("cyberpunk")

df['Distance'].plot.line(ylabel="km")
mplcyberpunk.add_glow_effects()

That’s all for today! 🎉

Have a great day! 😊

Appendix: Cheat Sheets 📜

Click on the images to increase their size
You can download the files by clicking here
More cheat sheets can be found here

QTM 350 - Data Science Computing

Nice to see you all again! 😊

Brief recap 🤓

Last time we learned about:

Today’s plan 📅

Today we will discuss how to:

Reshaping data 🔄

Tidy data

melt() and pivot()

Example: melt()

Example: melt()

Example: pivot()

Working with Multiple DataFrames 🤝

Combining data with .concat()

Example: pd.concat()

More options of pd.concat()

Combining data with .merge()

Example: pd.merge()

Inner join

Outer join

Left join

Right join

The indicator argument

More DataFrame operations 🛠️

Applying custom functions to DataFrames 🧮

Applying custom functions

Applying custom functions

Grouping and Aggregating 📊

Grouping data

Grouping data

Aggregating data

Aggregating data

Functions

Aggregating multiple columns

Aggregating data

Dealing with Missing Values

Missing data

Dropping missing values

Filling missing values

Visualising DataFrames 📈

Plotting data with Pandas

Plotting data with Pandas

Plotting data with Pandas

That’s all for today! 🎉

Have a great day! 😊

Appendix: Cheat Sheets 📜

`melt()` and `pivot()`

Example: `melt()`

Example: `melt()`

Example: `pivot()`

Combining data with `.concat()`

Example: `pd.concat()`

More options of `pd.concat()`

Combining data with `.merge()`

Example: `pd.merge()`

The `indicator` argument