QTM 151 - Introduction to Statistical Computing II

Lecture 21 - Time Series

Danilo Freire

danilo.freire@emory.edu

Emory University

Hello, friends! 😊
Having a good day?

Brief recap 📝

Last class we saw…

How to use the JOIN statement to combine data from different tables
Different types of JOIN statements, such as INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN
What Entity-Relationship Diagrams are and how to use them to understand the relationships between tables
How to use pd.to_sql() to write data to a SQL database from a DataFrame

Today’s plan 📅

Time Series

Today we will learn about time series data
We will see how to work with time series data in Python
We will learn how to use the datetime module to work with dates and times
We will also visualise time series data using matplotlib
Finally, we will learn how to convert time series to strings and calculate aggregates

Time Series 🕰️

What is time series data?

Time series data is a sequence of data points collected at constant time intervals
These data are used in many fields, such as economics, finance, and meteorology
They are used to predict future values based on past observations
We usually decompose time series data into trend, seasonality, and noise
Trend: long-term increase or decrease in the data
Seasonality: repeating patterns in the data
Noise: random fluctuations in the data

Time Series in Python 🐍

Working with time series data in Python

In Python, we can use the datetime module to work with dates and times
It provides some useful functions to manipulate time series data, and it works well with pandas and matplotlib
Let’s load the packages we need

# "pandas" processes datasets
# "maplotlib.pyplot" generates graphs
# "maplotlib.dates" has options for plotting dates
# "datetime" handles date information

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from datetime import date, time, datetime

Now let’s import the data we will use

# Load the data
financial = pd.read_csv("data_raw/financial.csv")
financial.head()

	date_str	sp500	djia	date_ex1	date_ex2	date_ex3
0	4/4/2018	2644.69	24264.30	April 04 2018	Wednesday, 2018-04-04	4-Apr-18
1	4/5/2018	2662.84	24505.22	April 05 2018	Thursday, 2018-04-05	5-Apr-18
2	4/6/2018	2604.47	23932.76	April 06 2018	Friday, 2018-04-06	6-Apr-18
3	4/9/2018	2613.16	23979.10	April 09 2018	Monday, 2018-04-09	9-Apr-18
4	4/10/2018	2656.87	24408.00	April 10 2018	Tuesday, 2018-04-10	10-Apr-18

sp500 is the S&P 500 index
djia is the Dow Jones Industrial Average

Parsing dates in Python

Date columns are strings by default
We can convert them to datetime objects using the pd.to_datetime() function
This function can handle different date formats
If the conversion is simple, we can use the parse_dates argument in pd.read_csv()
- financial = pd.read_csv("data_raw/financial.csv",parse_dates = ["date"])
Let’s parse the first column, date_str, to a datetime object

# Convert the date column to datetime
financial["date"] = pd.to_datetime(financial["date_str"])
financial.head()

	date_str	sp500	djia	date_ex1	date_ex2	date_ex3	date
0	4/4/2018	2644.69	24264.30	April 04 2018	Wednesday, 2018-04-04	4-Apr-18	2018-04-04
1	4/5/2018	2662.84	24505.22	April 05 2018	Thursday, 2018-04-05	5-Apr-18	2018-04-05
2	4/6/2018	2604.47	23932.76	April 06 2018	Friday, 2018-04-06	6-Apr-18	2018-04-06
3	4/9/2018	2613.16	23979.10	April 09 2018	Monday, 2018-04-09	9-Apr-18	2018-04-09
4	4/10/2018	2656.87	24408.00	April 10 2018	Tuesday, 2018-04-10	10-Apr-18	2018-04-10

We can check the types of the columns using financial.dtypes

financial.dtypes

date_str            object
sp500              float64
djia               float64
date_ex1            object
date_ex2            object
date_ex3            object
date        datetime64[ns]
dtype: object

Visualise time series data 📊

Plotting time series data

First, let’s see where our data begins and ends
You can do it in many ways

financial['date'][0]

Timestamp('2018-04-04 00:00:00')

financial['date'][len(financial) - 1] # why -1 here?

Timestamp('2023-04-04 00:00:00')

Or you can use the min() and max() functions

print("Start date:", financial["date"].min())

Start date: 2018-04-04 00:00:00

print("End date:", financial["date"].max())

End date: 2023-04-04 00:00:00

Or just use head() and tail() as we did before

We can now plot the data
As you may remember, plt.plot() is used to create line plots
The first two arguments are column names for the (x,y) data
The third argument is the data
Make sure that the data is sorted, since the lines will be connected based on the order of the dataset
We can sort the data using financial.sort_values("date")
Let’s plot the data 🤓

Plotting time series data

# Sort the data
financial = financial.sort_values("date")

# Plot the data
plt.plot("date", "sp500", data = financial)
plt.xlabel("Time")
plt.ylabel("S&P 500 Index")
plt.title("The evolution of the stock market")
plt.show()

Try it yourself! 🧠

Now it’s your turn! 🚀
Plot the djia column on the y axis and the date column on the x axis
Make sure to sort the data first
Appendix 01

S&P 500 vs. Dow Jones Industrial Average

As you may have guessed, although the indices are calculated in different ways, they are highly correlated
The S&P 500 is a broader index, while the Dow Jones is more focused on industrial companies
The S&P 500 is considered a better representation of the overall stock market
Let’s see how the two indices compare

# Correlation between the two indices
financial[["sp500", "djia"]].corr()

	sp500	djia
sp500	1.000000	0.981448
djia	0.981448	1.000000

Let’s plot the two indices together

plt.plot("date", "sp500", data = financial, label = "S&P 500")
plt.plot("date", "djia", data = financial, label = "Dow Jones Industrial Average")
plt.xlabel("Time")
plt.ylabel("Index")
plt.title("S&P 500 vs. Dow Jones Industrial Average")
plt.legend()
plt.show()

Oops, that looks a bit messy 🤔

The indices are calculated differently, so the values are not directly comparable
We can normalise the indices to start from the same point
We do so by dividing each index by its initial value (why?)

# Normalise the indices to start from the same point
financial["sp500_normalised"] = financial["sp500"] / financial["sp500"].iloc[0]
financial["djia_normalised"] = financial["djia"] / financial["djia"].iloc[0]

# Plot the normalized data
plt.plot("date", "sp500_normalised", data = financial, label = "S&P 500 (Normalised)")
plt.plot("date", "djia_normalised", data = financial, label = "Dow Jones Industrial Average (Normalised)")
plt.xlabel("Time")
plt.ylabel("Normalised Index")
plt.title("S&P 500 vs. Dow Jones Industrial Average (Normalised)")
plt.legend()
plt.show()

Dates + Wild Cards = Fun! 🎉

Converting time series to strings

We can convert datetime objects to strings using the strftime() method
The first argument needs to be a datetime type
The second argument is the format you want to use
A wildcard % is used to denote date formats
For example, %Y is the year, %m is the month, and %d is the day
%H is the hour, %M is the minute, and %S is the second

financial["month_str"] = financial["date"].dt.strftime("%m")
print(financial["month_str"].head(3))

0    04
1    04
2    04
Name: month_str, dtype: object

financial["week_str"]  = financial["date"].dt.strftime("%W")
print(financial["week_str"].head(3))

0    14
1    14
2    14
Name: week_str, dtype: object

Converting time series to strings

We can also convert dates to names

financial["monthname"]   =  financial["date"].dt.strftime("%B")
print(financial["monthname"].head(3))

0    April
1    April
2    April
Name: monthname, dtype: object

financial["weekdayname"] =  financial["date"].dt.strftime("%A")
print(financial["weekdayname"].head(3))

0    Wednesday
1     Thursday
2       Friday
Name: weekdayname, dtype: object

Personalise your dates 🎨

We can also customise the date format
For example, we can use the following format to display the date as YYYY-MM-DD

financial["date_str"] = financial["date"].dt.strftime("%Y-%m-%d")
print(financial["date_str"].head(3))

0    2018-04-04
1    2018-04-05
2    2018-04-06
Name: date_str, dtype: object

message_monthname =  financial["date"].dt.strftime("This is the month of %B")
print(message_monthname.head(3))

0    This is the month of April
1    This is the month of April
2    This is the month of April
Name: date, dtype: object

message_monthday  =  financial["date"].dt.strftime("The day of the week is %A")
print(message_monthday.head(3))

0    The day of the week is Wednesday
1     The day of the week is Thursday
2       The day of the week is Friday
Name: date, dtype: object

message_yearmonth =  financial["date"].dt.strftime("%Y-%m")
print(message_yearmonth.head(3))

0    2018-04
1    2018-04
2    2018-04
Name: date, dtype: object

You can use any format you like!
Here is a list of all the wildcards you can use
Have fun! 🎉

Try it yourself! 🧠

Now it’s your turn (again)! 🚀
Create a new column called date_test which has the format using .dt.strftime()
The format should be Weekday, Month Day, Year
- For example, Monday, December 31, 2023
Appendix 02

Data operations 📊

Parsing + wildcards

We can also use the datetime module to perform operations on dates
For example, we can calculate the difference between two dates

# Calculate the difference between two dates
date1 = datetime(2023, 12, 31)
date2 = datetime(2023, 12, 25)
difference = date1 - date2
print(difference)

6 days, 0:00:00

We can also use the timedelta function to add or subtract days

# Add or subtract days using timedelta
date4 = date1 + pd.Timedelta(days = 7)
print(date4)

2024-01-07 00:00:00

financial['date_ex1_dt'] = pd.to_datetime(financial["date_ex1"], format = "%B %d %Y")
print(financial['date_ex1_dt'].head(3))

0   2018-04-04
1   2018-04-05
2   2018-04-06
Name: date_ex1_dt, dtype: datetime64[ns]

financial['date_ex2_dt'] = pd.to_datetime(financial["date_ex2"], format = "%A, %Y-%m-%d")
print(financial['date_ex2_dt'].head())

0   2018-04-04
1   2018-04-05
2   2018-04-06
3   2018-04-09
4   2018-04-10
Name: date_ex2_dt, dtype: datetime64[ns]

Parsing + wildcards

Does the following line of code work?
pd.to_datetime(financial["date_ex1"], format = "%A, %Y-%m-%d")

No, it doesn’t work
Why not?

ValueError: time data "April 04 2018" doesn't match format "%A, %Y-%m-%d", at position 0. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

Period grouping

We can group time series data by periods
For example, we can group data by month or year
First, you should specify the period you want to group by
We use pd.Grouper() to group the data, then we use the agg() function to calculate the aggregate

# Group the data by year
year_group = pd.Grouper(key='date', freq='YE') # 'YE' stands for year end
financial.groupby(year_group).agg(sp500_mean = ("sp500","mean")).reset_index()

	date	sp500_mean
0	2018-12-31	2751.921862
1	2019-12-31	2913.360556
2	2020-12-31	3217.856482
3	2021-12-31	4273.406627
4	2022-12-31	4098.514741
5	2023-12-31	4003.579844

You can also group by month, week, day, quarter, etc.
More information can be found here

Plot the aggregated data

plt.plot("date","sp500_mean",
          data = financial.groupby(year_group).agg(sp500_mean = ("sp500","mean")).reset_index())
plt.xlabel("Time")
plt.ylabel("S&P 500")
plt.title("Average stock market performance by year")
plt.show()

Period grouping

Another example

month_group = pd.Grouper(key='date', freq='ME')

monthly_data = (financial
               .groupby(month_group) 
               .agg(sp500_mean = ("sp500", "mean"))
               .reset_index())

monthly_data

	date	sp500_mean
0	2018-04-30	2659.463158
1	2018-05-31	2701.493636
2	2018-06-30	2754.352857
3	2018-07-31	2793.643333
4	2018-08-31	2857.820435
...	...	...
56	2022-12-31	3912.380952
57	2023-01-31	3960.656500
58	2023-02-28	4079.684737
59	2023-03-31	3968.559130
60	2023-04-30	4112.555000

61 rows × 2 columns

Plot the aggregated data

plt.plot("date","sp500_mean",
          data = monthly_data)
plt.xlabel("Time")
plt.ylabel("S&P 500")
plt.title("Monthly average stock market performance")
plt.show()

Try it yourself! 🧠

Practice pd.to_datetime()
Parse the column data_ex3 to a datetime object
Take a close look at the formatting
HINT: Refer to the table of wildcards in the previous section

Then, compute an aggregate dataset which calculates the standard deviation of the S&P 500 at the weekly level
Generate a line plot with your results
Appendix 03

That’s all for today! 🎉

See you next time! 👋

Appendix 01

Here is the solution to the exercise

# Sort the data
financial = financial.sort_values("date")

# Plot the data
plt.plot("date", "djia", data = financial)
plt.xlabel("Time")
plt.ylabel("Dow Jones Industrial Average")
plt.title("The evolution of the stock market")
plt.show()

Back to exercise

Appendix 02

We will use the following format: Weekday, Month Day, Year

financial["date_test"] = financial["date"].dt.strftime("%A, %B %d, %Y")
print(financial["date_test"].head())

0    Wednesday, April 04, 2018
1     Thursday, April 05, 2018
2       Friday, April 06, 2018
3       Monday, April 09, 2018
4      Tuesday, April 10, 2018
Name: date_test, dtype: object

Back to exercise

Appendix 03

financial['date_ex3_dt'] = pd.to_datetime(financial["date_ex3"], format = "%d-%b-%y")

week_group = pd.Grouper(key='date_ex3_dt', freq='W')

weekly_data = (financial
               .groupby(week_group) 
               .agg(sp500_std = ("sp500", "std"))
               .reset_index())

plt.plot("date_ex3_dt","sp500_std",
          data = weekly_data)
plt.xlabel("Time")
plt.ylabel("S&P 500")
plt.title("Weekly standard deviation of the stock market performance")
plt.show()

Back to exercise

QTM 151 - Introduction to Statistical Computing II

Hello, friends! 😊 Having a good day?

Brief recap 📝

Last class we saw…

Today’s plan 📅

Time Series

Time Series 🕰️

What is time series data?

Time Series in Python 🐍

Working with time series data in Python

Parsing dates in Python

Visualise time series data 📊

Plotting time series data

Plotting time series data

Try it yourself! 🧠

S&P 500 vs. Dow Jones Industrial Average

Oops, that looks a bit messy 🤔

Dates + Wild Cards = Fun! 🎉

Converting time series to strings

Converting time series to strings

Personalise your dates 🎨

Try it yourself! 🧠

Data operations 📊

Parsing + wildcards

Parsing + wildcards

Period grouping

Plot the aggregated data

Period grouping

Another example

Plot the aggregated data

Try it yourself! 🧠

That’s all for today! 🎉

See you next time! 👋

Appendix 01

Appendix 02

Appendix 03

Hello, friends! 😊
Having a good day?