QTM 151 - Introduction to Statistical Computing II

Lecture 23 - Pivot Tables

Danilo Freire

Emory University

20 November, 2024

Hello, friends! 😊
How’s everything?

QTM 285 - Intro to Text as Data

Brief recap 📚

Last time we talked about…

  • How to work with time series with Python using the datetime module
  • The S&P 500 and the Dow Jones Industrial Average stock market indices
  • How to visualise the data using matplotlib
  • How to convert time series to strings using wildcards (it’s a bit tricky!). List here
  • And finally, we saw how to calculate aggregated statistics

William Playfair - Export & Import to and from Denmark & Norway from 1700 to 1780 (Commercial and Political Atlas, 1786)

Today’s plan 📊

Pivot Tables

  • Today we will learn more about plotting data
  • We will see how to plot multiple variables at once
  • We will also learn how to estimate and plot growth rates
  • Lastly, we will see how to create pivot tables and change the data structure

Import libraries and data 📦

  • Let’s start by importing the necessary libraries
  • They are the same we used last time (and many other times!)
import pandas as pd
import matplotlib.pyplot as plt
  • We will also import the data we used last time
financial = pd.read_csv("data_raw/financial.csv")
  • Let’s take a look at the data
financial.head()
date_str sp500 djia date_ex1 date_ex2 date_ex3
0 4/4/2018 2644.69 24264.30 April 04 2018 Wednesday, 2018-04-04 4-Apr-18
1 4/5/2018 2662.84 24505.22 April 05 2018 Thursday, 2018-04-05 5-Apr-18
2 4/6/2018 2604.47 23932.76 April 06 2018 Friday, 2018-04-06 6-Apr-18
3 4/9/2018 2613.16 23979.10 April 09 2018 Monday, 2018-04-09 9-Apr-18
4 4/10/2018 2656.87 24408.00 April 10 2018 Tuesday, 2018-04-10 10-Apr-18
financial.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1305 entries, 0 to 1304
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   date_str  1305 non-null   object 
 1   sp500     1260 non-null   float64
 2   djia      1260 non-null   float64
 3   date_ex1  1305 non-null   object 
 4   date_ex2  1305 non-null   object 
 5   date_ex3  1305 non-null   object 
dtypes: float64(2), object(4)
memory usage: 61.3+ KB

Parse and clean data 🧹

  • We will parse the date column
  • What is parsing, you may ask?
  • It is the process of converting a string into a date
  • Our column of interest is date_str
  • We will convert it to a datetime object
  • We will also sort the data by date
financial["date"] = pd.to_datetime(financial["date_str"])
financial = financial.sort_values("date", ascending = True) 

financial.dtypes
date_str            object
sp500              float64
djia               float64
date_ex1            object
date_ex2            object
date_ex3            object
date        datetime64[ns]
dtype: object
  • What does [ns] mean in the datetime64[ns] column type?
  • It means nanoseconds, which is the smallest unit of time in Python
  • It indicates the precision of the date column
  • We don’t actually need nanosecond precision, but… well, it doesn’t hurt either! 😅
  • We can always change it later if we want to, such as using pd.to_datetime(financial["date"], unit='s') to convert it to seconds, or unit='m' to convert it to minutes

Plot multiple variables 📈

Plotting multiple variables

  • Last time, we saw how to plot multiple variables, but let us refresh our memory 😉
  • We will plot the sp500 and dji columns just using pandas (wait, what? 😲)
financial[["date","sp500","djia"]].set_index("date").plot()

Plotting multiple variables

  • As we also saw last time, the two trends are not immediately comparable because they have different scales
  • We can fix this by normalising the data (as we did last time), or by transforming one or both columns
  • Last time we saw how to normalise the data by dividing them by the first value
financial["sp500_normalised"] = financial["sp500"] / financial["sp500"].iloc[0]
financial["djia_normalised"] = financial["djia"] / financial["djia"].iloc[0]

# Plot the normalized data
plt.plot("date", "sp500_normalised", data = financial, label = "S&P 500 (Normalised)")
plt.plot("date", "djia_normalised", data = financial, label = "Dow Jones Industrial Average (Normalised)")
plt.xlabel("Time")
plt.ylabel("Normalised Index")
plt.title("S&P 500 vs. Dow Jones Industrial Average (Normalised)")
plt.legend()
plt.show()

Plotting multiple variables

  • Let’s see the results

Plotting multiple variables

  • We can also use simple transformations if we have some understanding of the data (which we always should! 😂)
  • For example, the djia is usually higher than the sp500, so we can simply divide the djia by 10 to bring the two series closer together
financial['djia_rescale'] = financial['djia']/10

financial[["date","sp500","djia_rescale"]].set_index("date").plot()
plt.xlabel("Time")
plt.ylabel("Value of Index Funds")
plt.legend(["S&P 500","Dow Jones"])
plt.title("S&P 500 vs. Dow Jones Industrial Average")
plt.show()

Time difference/growth rates 📈

Time difference/growth rates: diff()

  • We can also calculate the difference between two consecutive values, such as the daily fluctuations of the stock market
  • We can do this by using the diff() method
  • The method simply calculates the difference between the current value and the previous value in the column
  • The first value will always be NaN because there is no previous value to compare it to
  • We must always ensure that the data are sorted before using the diff() method, otherwise the results will be incorrect
financial["diff_sp500"] = financial["sp500"].diff()
financial.head() 
date_str sp500 djia date_ex1 date_ex2 date_ex3 date sp500_normalised djia_normalised djia_rescale diff_sp500
0 4/4/2018 2644.69 24264.30 April 04 2018 Wednesday, 2018-04-04 4-Apr-18 2018-04-04 1.000000 1.000000 2426.430 NaN
1 4/5/2018 2662.84 24505.22 April 05 2018 Thursday, 2018-04-05 5-Apr-18 2018-04-05 1.006863 1.009929 2450.522 18.15
2 4/6/2018 2604.47 23932.76 April 06 2018 Friday, 2018-04-06 6-Apr-18 2018-04-06 0.984792 0.986336 2393.276 -58.37
3 4/9/2018 2613.16 23979.10 April 09 2018 Monday, 2018-04-09 9-Apr-18 2018-04-09 0.988078 0.988246 2397.910 8.69
4 4/10/2018 2656.87 24408.00 April 10 2018 Tuesday, 2018-04-10 10-Apr-18 2018-04-10 1.004605 1.005922 2440.800 43.71

Time difference/growth rates

  • We can plot the differences in the sp500 column to see the daily fluctuations
plt.plot("date", "diff_sp500", data = financial, label = "S&P 500")
plt.xlabel("Time")
plt.ylabel("Daily Fluctuations")
plt.legend()
plt.show()

Hmmm, that’s still not very informative… 🤔

Compute lagged differences and growth rates

  • The plot is still a little hard to understand because it is calculated in points
  • We can fix this using lagged differences
  • We can calculate the growth rate by dividing the difference by the previous value
  • First, we will create a new column with the lagged values, then calculate the growth rate
financial["lag_sp500"] = financial["sp500"].shift(1)

# Convert into percentage
financial["growth_sp500"] = (financial["diff_sp500"]/financial["lag_sp500"]) * 100

financial.head()
date_str sp500 djia date_ex1 date_ex2 date_ex3 date sp500_normalised djia_normalised djia_rescale diff_sp500 lag_sp500 growth_sp500
0 4/4/2018 2644.69 24264.30 April 04 2018 Wednesday, 2018-04-04 4-Apr-18 2018-04-04 1.000000 1.000000 2426.430 NaN NaN NaN
1 4/5/2018 2662.84 24505.22 April 05 2018 Thursday, 2018-04-05 5-Apr-18 2018-04-05 1.006863 1.009929 2450.522 18.15 2644.69 0.686281
2 4/6/2018 2604.47 23932.76 April 06 2018 Friday, 2018-04-06 6-Apr-18 2018-04-06 0.984792 0.986336 2393.276 -58.37 2662.84 -2.192021
3 4/9/2018 2613.16 23979.10 April 09 2018 Monday, 2018-04-09 9-Apr-18 2018-04-09 0.988078 0.988246 2397.910 8.69 2604.47 0.333657
4 4/10/2018 2656.87 24408.00 April 10 2018 Tuesday, 2018-04-10 10-Apr-18 2018-04-10 1.004605 1.005922 2440.800 43.71 2613.16 1.672687

Let’s plot the graph

  • That is easier to understand! 😊
  • We can see the daily fluctuations in the S&P 500 index in percentage terms, which is indeed more informative
plt.plot("date", "growth_sp500",
          data = financial)
plt.xlabel("Time")
plt.ylabel("Daily percentage change ")
plt.title("Change in the S&P 500 Index")
plt.show()

Try it yourself! 🧠

  • Now it’s your turn!
  • Compute a column with the growth of the Dow Jones
    • Use diff() to calculate the difference and shift() to create the lagged values
  • Plot the growth of the S&P 500 and Dow Jones in a single plot
  • Appendix 01

Subsetting with time data 📉

Subsetting with time data

  • Do you remember our old friend query()?
  • We can use it again here! 😊
  • Since the date column has a time format, Python will interpret 2019-01-01 as a date inside the query command
  • We can use boolean operators to subset the data
  • Note: remember that you have to use single quotations for .query()
subset_before  = financial.query('date <= "2019-01-01"')
subset_after   = financial.query('date >= "2020-01-01"')

max(subset_before['date'])
Timestamp('2019-01-01 00:00:00')
max(subset_after['date'])
Timestamp('2023-04-04 00:00:00')

Subsetting with time data

  • We can also subset between two dates
subset_between = financial.query('date.between("2020-03-02","2020-05-02")')
subset_between.iloc[:, :6].head(3)
date_str sp500 djia date_ex1 date_ex2 date_ex3
498 3/2/2020 3090.23 26703.32 March 02 2020 Monday, 2020-03-02 2-Mar-20
499 3/3/2020 3003.37 25917.41 March 03 2020 Tuesday, 2020-03-03 3-Mar-20
500 3/4/2020 3130.12 27090.86 March 04 2020 Wednesday, 2020-03-04 4-Mar-20
subset_between.iloc[:, :6].tail(3)
date_str sp500 djia date_ex1 date_ex2 date_ex3
540 4/29/2020 2939.51 24633.86 April 29 2020 Wednesday, 2020-04-29 29-Apr-20
541 4/30/2020 2912.43 24345.72 April 30 2020 Thursday, 2020-04-30 30-Apr-20
542 5/1/2020 2830.71 23723.69 May 01 2020 Friday, 2020-05-01 1-May-20
  • Why don’t we have data for the 2nd of May? Was it a weekend? 🤔
from datetime import datetime
date_str = '2020-05-02'
date_obj = datetime.strptime(date_str, '%Y-%m-%d')

# Get the day of the week
day_of_week = date_obj.strftime('%A')
day_of_week
'Saturday'

Subsetting with time data

  • We can then flag some observations that meet a certain condition
  • For instance, we can flag the observations where the growth of the S&P 500 index is greater than 5% and plot them
financial["bool_period"]  = financial["date"].between("2020-03-01","2020-05-01")
financial["bool_example"] = financial["growth_sp500"] > 5

# Create a line plot
plt.plot("date", "growth_sp500", data = financial)
plt.xlabel("Time")
plt.ylabel("Daily percentage change ")
plt.title("The S&P 500 during the start of COVID")

# Add a shaded region with a rectangle with "fill_between()"
# "x" is the x-coordinate, "y1" and "y2" are the lower
# and upper bounds of the rectangle, respectively
# We can set this to be the minimum and maximum of the outcome
# We use "where" to test a logical condition

vec_y = financial["growth_sp500"]
plt.fill_between(x = financial["date"],
                 y1 = vec_y.min(),
                 y2 = vec_y.max(),
                 where = financial["bool_period"],
                 alpha = 0.2, color = "red")

plt.show()

Let’s see the graph

Try it yourself! 🧠

  • Now it’s your turn again!
  • Generate a plot of the daily growth of the Dow Jones
  • Highlight regions where there was growth higher than 4% or below -4% (both of them! Use | to include both conditions)
  • Appendix 02

Wide and long data formats 📊

Wide and long data formats

  • Data can come in a wide variety of formats
  • Few rows, multiple columns (wide)
  • Stacked rows, few columns (long)
  • The information is the same, but the structure is different
  • The long format is usually preferred for data analysis, while the wide format is better for presentation

  • We can convert data from wide to long format using the melt() function
  • The melt() function takes the following arguments:
    • id_vars: the columns to keep as is
    • value_vars: the columns to melt
    • var_name: the name of the new column with the variable names
    • value_name: the name of the new column with the values

Wide and long data formats

  • Let’s see an example
  • How the data looked before the melt
financial[["date", "sp500", "djia"]].head()
date sp500 djia
0 2018-04-04 2644.69 24264.30
1 2018-04-05 2662.84 24505.22
2 2018-04-06 2604.47 23932.76
3 2018-04-09 2613.16 23979.10
4 2018-04-10 2656.87 24408.00
financial_long = pd.melt(financial,
                         id_vars = 'date', # keep the date column
                         value_vars = ['sp500','djia'], # melt the sp500 and djia columns
                         var_name = "portfolio_type", # variable name
                         value_name = "portfolio_value") # value name
  • And now we have the data in long format:
financial_long.sort_values(by='date').head()
date portfolio_type portfolio_value
0 2018-04-04 sp500 2644.69
1305 2018-04-04 djia 24264.30
1306 2018-04-05 djia 24505.22
1 2018-04-05 sp500 2662.84
2 2018-04-06 sp500 2604.47

Long data format

  • We can also convert the data back to wide format using the pd.pivot() function
  • The pd.pivot() function takes the following arguments:
    • index: the column to keep as is
    • columns: the column to pivot
    • values: the column with the values

financial_wide = pd.pivot(financial_long,
                           index = 'date',
                           columns = 'portfolio_type',
                           values =  'portfolio_value')

financial_wide.head()
portfolio_type djia sp500
date
2018-04-04 24264.30 2644.69
2018-04-05 24505.22 2662.84
2018-04-06 23932.76 2604.47
2018-04-09 23979.10 2613.16
2018-04-10 24408.00 2656.87
  • If you also want the index to be part of the dataset add .reset_index() to the end of the previous command

Try it yourself! 🧠

  • Convert the growth_sp500 and growth_djia to long format
  • Appendix 03

That’s all for today! 🎉

Thank you for your attention
and see you soon! 🙏

Appendix 01: Solution

financial["diff_djia"] = financial["djia"].diff()
financial["lag_djia"]  = financial["djia"].shift(1)
financial["growth_djia"] = (financial["diff_djia"] / financial["lag_djia"])*100

# Method 1: Adding multiple lines with "plt.plot()"
# you can make a plot transparent by adding an alpha parameter (between 0 and 1)

plt.plot("date","growth_djia",data= financial)
plt.plot("date","growth_sp500",data= financial,alpha = 0.6)
plt.xlabel("Time")
plt.ylabel("Daily percentage change")
plt.title("Change in the Dow Jones Index")
plt.show()

Back to Exercise 01

Appendix 01: Solution

# Method 2: A single line with ".plot()" but setting the "x-axis"
# with ".set_index()"
financial[["date","growth_sp500","growth_djia"]].set_index("date").plot()
plt.xlabel("Time")
plt.ylabel("Daily percentage change")
plt.title("Changes in the Stock Market Indecies")
plt.legend(["S&P 500", "Dow Jones Industrial Index"])
plt.show()

Back to Exercise 01

Appendix 02: Solution

plt.plot("date","growth_djia",data= financial)
plt.xlabel("Time")
plt.ylabel("Daily percentage change")
plt.title("Change in the Dow Jones Index")

financial["bool_djia"] = (financial["growth_djia"] > 4) | (financial["growth_djia"] < -4)
vec_y = financial["growth_djia"]
plt.fill_between(x = financial["date"],
                 y1 = vec_y.min(),
                 y2 = vec_y.max(),
                 where = financial["bool_djia"],
                 alpha = 0.7,color="pink")

Back to Exercise 02

Appendix 03: Solution

financial_long = pd.melt(financial,
                         var_name = "portfolio_type",
                         value_name = "portfolio_value",
                         id_vars='date',
                         value_vars=['growth_sp500','growth_djia'])

financial_long.sort_values(by='date').head()
date portfolio_type portfolio_value
0 2018-04-04 growth_sp500 NaN
1305 2018-04-04 growth_djia NaN
1306 2018-04-05 growth_djia 0.992899
1 2018-04-05 growth_sp500 0.686281
2 2018-04-06 growth_sp500 -2.192021

Back to Exercise 03