QTM 151 - Introduction to Statistical Computing II

Lecture 23 - Pivot Tables

Danilo Freire

danilo.freire@emory.edu

Emory University

Hello, friends! 😊
How’s everything?

QTM 285 - Intro to Text as Data

Brief recap 📚

Last time we talked about…

How to work with time series with Python using the datetime module
The S&P 500 and the Dow Jones Industrial Average stock market indices
How to visualise the data using matplotlib
How to convert time series to strings using wildcards (it’s a bit tricky!). List here
And finally, we saw how to calculate aggregated statistics

William Playfair - Export & Import to and from Denmark & Norway from 1700 to 1780 (Commercial and Political Atlas, 1786)

Today’s plan 📊

Pivot Tables

Today we will learn more about plotting data
We will see how to plot multiple variables at once
We will also learn how to estimate and plot growth rates
Lastly, we will see how to create pivot tables and change the data structure

Import libraries and data 📦

Let’s start by importing the necessary libraries
They are the same we used last time (and many other times!)

import pandas as pd
import matplotlib.pyplot as plt

We will also import the data we used last time

financial = pd.read_csv("data_raw/financial.csv")

Let’s take a look at the data

financial.head()

	date_str	sp500	djia	date_ex1	date_ex2	date_ex3
0	4/4/2018	2644.69	24264.30	April 04 2018	Wednesday, 2018-04-04	4-Apr-18
1	4/5/2018	2662.84	24505.22	April 05 2018	Thursday, 2018-04-05	5-Apr-18
2	4/6/2018	2604.47	23932.76	April 06 2018	Friday, 2018-04-06	6-Apr-18
3	4/9/2018	2613.16	23979.10	April 09 2018	Monday, 2018-04-09	9-Apr-18
4	4/10/2018	2656.87	24408.00	April 10 2018	Tuesday, 2018-04-10	10-Apr-18

financial.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1305 entries, 0 to 1304
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   date_str  1305 non-null   object 
 1   sp500     1260 non-null   float64
 2   djia      1260 non-null   float64
 3   date_ex1  1305 non-null   object 
 4   date_ex2  1305 non-null   object 
 5   date_ex3  1305 non-null   object 
dtypes: float64(2), object(4)
memory usage: 61.3+ KB

Parse and clean data 🧹

We will parse the date column
What is parsing, you may ask?
It is the process of converting a string into a date
Our column of interest is date_str
We will convert it to a datetime object
We will also sort the data by date

financial["date"] = pd.to_datetime(financial["date_str"])
financial = financial.sort_values("date", ascending = True) 

financial.dtypes

date_str            object
sp500              float64
djia               float64
date_ex1            object
date_ex2            object
date_ex3            object
date        datetime64[ns]
dtype: object

What does [ns] mean in the datetime64[ns] column type?
It means nanoseconds, which is the smallest unit of time in Python
It indicates the precision of the date column
We don’t actually need nanosecond precision, but… well, it doesn’t hurt either! 😅
We can always change it later if we want to, such as using pd.to_datetime(financial["date"], unit='s') to convert it to seconds, or unit='m' to convert it to minutes

Plot multiple variables 📈

Plotting multiple variables

Last time, we saw how to plot multiple variables, but let us refresh our memory 😉
We will plot the sp500 and dji columns just using pandas (wait, what? 😲)

financial[["date","sp500","djia"]].set_index("date").plot()

Plotting multiple variables

As we also saw last time, the two trends are not immediately comparable because they have different scales
We can fix this by normalising the data (as we did last time), or by transforming one or both columns
Last time we saw how to normalise the data by dividing them by the first value

financial["sp500_normalised"] = financial["sp500"] / financial["sp500"].iloc[0]
financial["djia_normalised"] = financial["djia"] / financial["djia"].iloc[0]

# Plot the normalized data
plt.plot("date", "sp500_normalised", data = financial, label = "S&P 500 (Normalised)")
plt.plot("date", "djia_normalised", data = financial, label = "Dow Jones Industrial Average (Normalised)")
plt.xlabel("Time")
plt.ylabel("Normalised Index")
plt.title("S&P 500 vs. Dow Jones Industrial Average (Normalised)")
plt.legend()
plt.show()

Plotting multiple variables

Let’s see the results

Plotting multiple variables

We can also use simple transformations if we have some understanding of the data (which we always should! 😂)
For example, the djia is usually higher than the sp500, so we can simply divide the djia by 10 to bring the two series closer together

financial['djia_rescale'] = financial['djia']/10

financial[["date","sp500","djia_rescale"]].set_index("date").plot()
plt.xlabel("Time")
plt.ylabel("Value of Index Funds")
plt.legend(["S&P 500","Dow Jones"])
plt.title("S&P 500 vs. Dow Jones Industrial Average")
plt.show()

Time difference/growth rates 📈

Time difference/growth rates: `diff()`

We can also calculate the difference between two consecutive values, such as the daily fluctuations of the stock market
We can do this by using the diff() method
The method simply calculates the difference between the current value and the previous value in the column
The first value will always be NaN because there is no previous value to compare it to
We must always ensure that the data are sorted before using the diff() method, otherwise the results will be incorrect

financial["diff_sp500"] = financial["sp500"].diff()
financial.head()

	date_str	sp500	djia	date_ex1	date_ex2	date_ex3	date	sp500_normalised	djia_normalised	djia_rescale	diff_sp500
0	4/4/2018	2644.69	24264.30	April 04 2018	Wednesday, 2018-04-04	4-Apr-18	2018-04-04	1.000000	1.000000	2426.430	NaN
1	4/5/2018	2662.84	24505.22	April 05 2018	Thursday, 2018-04-05	5-Apr-18	2018-04-05	1.006863	1.009929	2450.522	18.15
2	4/6/2018	2604.47	23932.76	April 06 2018	Friday, 2018-04-06	6-Apr-18	2018-04-06	0.984792	0.986336	2393.276	-58.37
3	4/9/2018	2613.16	23979.10	April 09 2018	Monday, 2018-04-09	9-Apr-18	2018-04-09	0.988078	0.988246	2397.910	8.69
4	4/10/2018	2656.87	24408.00	April 10 2018	Tuesday, 2018-04-10	10-Apr-18	2018-04-10	1.004605	1.005922	2440.800	43.71

Time difference/growth rates

We can plot the differences in the sp500 column to see the daily fluctuations

plt.plot("date", "diff_sp500", data = financial, label = "S&P 500")
plt.xlabel("Time")
plt.ylabel("Daily Fluctuations")
plt.legend()
plt.show()

Hmmm, that’s still not very informative… 🤔

Compute lagged differences and growth rates

The plot is still a little hard to understand because it is calculated in points
We can fix this using lagged differences
We can calculate the growth rate by dividing the difference by the previous value
First, we will create a new column with the lagged values, then calculate the growth rate

financial["lag_sp500"] = financial["sp500"].shift(1)

# Convert into percentage
financial["growth_sp500"] = (financial["diff_sp500"]/financial["lag_sp500"]) * 100

financial.head()

	date_str	sp500	djia	date_ex1	date_ex2	date_ex3	date	sp500_normalised	djia_normalised	djia_rescale	diff_sp500	lag_sp500	growth_sp500
0	4/4/2018	2644.69	24264.30	April 04 2018	Wednesday, 2018-04-04	4-Apr-18	2018-04-04	1.000000	1.000000	2426.430	NaN	NaN	NaN
1	4/5/2018	2662.84	24505.22	April 05 2018	Thursday, 2018-04-05	5-Apr-18	2018-04-05	1.006863	1.009929	2450.522	18.15	2644.69	0.686281
2	4/6/2018	2604.47	23932.76	April 06 2018	Friday, 2018-04-06	6-Apr-18	2018-04-06	0.984792	0.986336	2393.276	-58.37	2662.84	-2.192021
3	4/9/2018	2613.16	23979.10	April 09 2018	Monday, 2018-04-09	9-Apr-18	2018-04-09	0.988078	0.988246	2397.910	8.69	2604.47	0.333657
4	4/10/2018	2656.87	24408.00	April 10 2018	Tuesday, 2018-04-10	10-Apr-18	2018-04-10	1.004605	1.005922	2440.800	43.71	2613.16	1.672687

Let’s plot the graph

That is easier to understand! 😊
We can see the daily fluctuations in the S&P 500 index in percentage terms, which is indeed more informative

plt.plot("date", "growth_sp500",
          data = financial)
plt.xlabel("Time")
plt.ylabel("Daily percentage change ")
plt.title("Change in the S&P 500 Index")
plt.show()

Try it yourself! 🧠

Now it’s your turn!
Compute a column with the growth of the Dow Jones
- Use diff() to calculate the difference and shift() to create the lagged values
Plot the growth of the S&P 500 and Dow Jones in a single plot
Appendix 01

Subsetting with time data 📉

Subsetting with time data

Do you remember our old friend query()?
We can use it again here! 😊
Since the date column has a time format, Python will interpret 2019-01-01 as a date inside the query command
We can use boolean operators to subset the data
Note: remember that you have to use single quotations for .query()

subset_before  = financial.query('date <= "2019-01-01"')
subset_after   = financial.query('date >= "2020-01-01"')

max(subset_before['date'])

Timestamp('2019-01-01 00:00:00')

max(subset_after['date'])

Timestamp('2023-04-04 00:00:00')

Subsetting with time data

We can also subset between two dates

subset_between = financial.query('date.between("2020-03-02","2020-05-02")')
subset_between.iloc[:, :6].head(3)

	date_str	sp500	djia	date_ex1	date_ex2	date_ex3
498	3/2/2020	3090.23	26703.32	March 02 2020	Monday, 2020-03-02	2-Mar-20
499	3/3/2020	3003.37	25917.41	March 03 2020	Tuesday, 2020-03-03	3-Mar-20
500	3/4/2020	3130.12	27090.86	March 04 2020	Wednesday, 2020-03-04	4-Mar-20

subset_between.iloc[:, :6].tail(3)

	date_str	sp500	djia	date_ex1	date_ex2	date_ex3
540	4/29/2020	2939.51	24633.86	April 29 2020	Wednesday, 2020-04-29	29-Apr-20
541	4/30/2020	2912.43	24345.72	April 30 2020	Thursday, 2020-04-30	30-Apr-20
542	5/1/2020	2830.71	23723.69	May 01 2020	Friday, 2020-05-01	1-May-20

Why don’t we have data for the 2nd of May? Was it a weekend? 🤔

from datetime import datetime
date_str = '2020-05-02'
date_obj = datetime.strptime(date_str, '%Y-%m-%d')

# Get the day of the week
day_of_week = date_obj.strftime('%A')
day_of_week

'Saturday'

Subsetting with time data

We can then flag some observations that meet a certain condition
For instance, we can flag the observations where the growth of the S&P 500 index is greater than 5% and plot them

financial["bool_period"]  = financial["date"].between("2020-03-01","2020-05-01")
financial["bool_example"] = financial["growth_sp500"] > 5

# Create a line plot
plt.plot("date", "growth_sp500", data = financial)
plt.xlabel("Time")
plt.ylabel("Daily percentage change ")
plt.title("The S&P 500 during the start of COVID")

# Add a shaded region with a rectangle with "fill_between()"
# "x" is the x-coordinate, "y1" and "y2" are the lower
# and upper bounds of the rectangle, respectively
# We can set this to be the minimum and maximum of the outcome
# We use "where" to test a logical condition

vec_y = financial["growth_sp500"]
plt.fill_between(x = financial["date"],
                 y1 = vec_y.min(),
                 y2 = vec_y.max(),
                 where = financial["bool_period"],
                 alpha = 0.2, color = "red")

plt.show()

Let’s see the graph

Try it yourself! 🧠

Now it’s your turn again!
Generate a plot of the daily growth of the Dow Jones
Highlight regions where there was growth higher than 4% or below -4% (both of them! Use | to include both conditions)
Appendix 02

Wide and long data formats 📊

Wide and long data formats

Data can come in a wide variety of formats
Few rows, multiple columns (wide)
Stacked rows, few columns (long)
The information is the same, but the structure is different
The long format is usually preferred for data analysis, while the wide format is better for presentation

We can convert data from wide to long format using the melt() function
The melt() function takes the following arguments:
- id_vars: the columns to keep as is
- value_vars: the columns to melt
- var_name: the name of the new column with the variable names
- value_name: the name of the new column with the values

Wide and long data formats

Let’s see an example
How the data looked before the melt

financial[["date", "sp500", "djia"]].head()

	date	sp500	djia
0	2018-04-04	2644.69	24264.30
1	2018-04-05	2662.84	24505.22
2	2018-04-06	2604.47	23932.76
3	2018-04-09	2613.16	23979.10
4	2018-04-10	2656.87	24408.00

financial_long = pd.melt(financial,
                         id_vars = 'date', # keep the date column
                         value_vars = ['sp500','djia'], # melt the sp500 and djia columns
                         var_name = "portfolio_type", # variable name
                         value_name = "portfolio_value") # value name

And now we have the data in long format:

financial_long.sort_values(by='date').head()

	date	portfolio_type	portfolio_value
0	2018-04-04	sp500	2644.69
1305	2018-04-04	djia	24264.30
1306	2018-04-05	djia	24505.22
1	2018-04-05	sp500	2662.84
2	2018-04-06	sp500	2604.47

Long data format

We can also convert the data back to wide format using the pd.pivot() function
The pd.pivot() function takes the following arguments:
- index: the column to keep as is
- columns: the column to pivot
- values: the column with the values

financial_wide = pd.pivot(financial_long,
                           index = 'date',
                           columns = 'portfolio_type',
                           values =  'portfolio_value')

financial_wide.head()

portfolio_type	djia	sp500
date
2018-04-04	24264.30	2644.69
2018-04-05	24505.22	2662.84
2018-04-06	23932.76	2604.47
2018-04-09	23979.10	2613.16
2018-04-10	24408.00	2656.87

If you also want the index to be part of the dataset add .reset_index() to the end of the previous command

Try it yourself! 🧠

Convert the growth_sp500 and growth_djia to long format
Appendix 03

That’s all for today! 🎉

Thank you for your attention
and see you soon! 🙏

Appendix 01: Solution

financial["diff_djia"] = financial["djia"].diff()
financial["lag_djia"]  = financial["djia"].shift(1)
financial["growth_djia"] = (financial["diff_djia"] / financial["lag_djia"])*100

# Method 1: Adding multiple lines with "plt.plot()"
# you can make a plot transparent by adding an alpha parameter (between 0 and 1)

plt.plot("date","growth_djia",data= financial)
plt.plot("date","growth_sp500",data= financial,alpha = 0.6)
plt.xlabel("Time")
plt.ylabel("Daily percentage change")
plt.title("Change in the Dow Jones Index")
plt.show()

Back to Exercise 01

Appendix 01: Solution

# Method 2: A single line with ".plot()" but setting the "x-axis"
# with ".set_index()"
financial[["date","growth_sp500","growth_djia"]].set_index("date").plot()
plt.xlabel("Time")
plt.ylabel("Daily percentage change")
plt.title("Changes in the Stock Market Indecies")
plt.legend(["S&P 500", "Dow Jones Industrial Index"])
plt.show()

Back to Exercise 01

Appendix 02: Solution

plt.plot("date","growth_djia",data= financial)
plt.xlabel("Time")
plt.ylabel("Daily percentage change")
plt.title("Change in the Dow Jones Index")

financial["bool_djia"] = (financial["growth_djia"] > 4) | (financial["growth_djia"] < -4)
vec_y = financial["growth_djia"]
plt.fill_between(x = financial["date"],
                 y1 = vec_y.min(),
                 y2 = vec_y.max(),
                 where = financial["bool_djia"],
                 alpha = 0.7,color="pink")

Back to Exercise 02

Appendix 03: Solution

financial_long = pd.melt(financial,
                         var_name = "portfolio_type",
                         value_name = "portfolio_value",
                         id_vars='date',
                         value_vars=['growth_sp500','growth_djia'])

financial_long.sort_values(by='date').head()

	date	portfolio_type	portfolio_value
0	2018-04-04	growth_sp500	NaN
1305	2018-04-04	growth_djia	NaN
1306	2018-04-05	growth_djia	0.992899
1	2018-04-05	growth_sp500	0.686281
2	2018-04-06	growth_sp500	-2.192021

Back to Exercise 03

QTM 151 - Introduction to Statistical Computing II

Hello, friends! 😊 How’s everything?

QTM 285 - Intro to Text as Data

Brief recap 📚

Last time we talked about…

Today’s plan 📊

Pivot Tables

Import libraries and data 📦

Parse and clean data 🧹

Plot multiple variables 📈

Plotting multiple variables

Plotting multiple variables

Plotting multiple variables

Plotting multiple variables

Time difference/growth rates 📈

Time difference/growth rates: diff()

Time difference/growth rates

Hmmm, that’s still not very informative… 🤔

Compute lagged differences and growth rates

Let’s plot the graph

Try it yourself! 🧠

Subsetting with time data 📉

Subsetting with time data

Subsetting with time data

Subsetting with time data

Let’s see the graph

Try it yourself! 🧠

Wide and long data formats 📊

Wide and long data formats

Wide and long data formats

Long data format

Try it yourself! 🧠

That’s all for today! 🎉

Thank you for your attention and see you soon! 🙏

Appendix 01: Solution

Appendix 01: Solution

Appendix 02: Solution

Appendix 03: Solution

Hello, friends! 😊
How’s everything?

Time difference/growth rates: `diff()`

Thank you for your attention
and see you soon! 🙏