QTM 151 - Introduction to Statistical Computing II

Lecture 21 - Time Series

Danilo Freire

Emory University

13 November, 2024

Hello, friends! 😊
Having a good day?

Brief recap 📝

Last class we saw…

  • How to use the JOIN statement to combine data from different tables
  • Different types of JOIN statements, such as INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN
  • What Entity-Relationship Diagrams are and how to use them to understand the relationships between tables
  • How to use pd.to_sql() to write data to a SQL database from a DataFrame

Today’s plan 📅

Time Series

  • Today we will learn about time series data
  • We will see how to work with time series data in Python
  • We will learn how to use the datetime module to work with dates and times
  • We will also visualise time series data using matplotlib
  • Finally, we will learn how to convert time series to strings and calculate aggregates

Time Series 🕰️

What is time series data?

  • Time series data is a sequence of data points collected at constant time intervals
  • These data are used in many fields, such as economics, finance, and meteorology
  • They are used to predict future values based on past observations
  • We usually decompose time series data into trend, seasonality, and noise
  • Trend: long-term increase or decrease in the data
  • Seasonality: repeating patterns in the data
  • Noise: random fluctuations in the data

Time Series in Python 🐍

Working with time series data in Python

  • In Python, we can use the datetime module to work with dates and times
  • It provides some useful functions to manipulate time series data, and it works well with pandas and matplotlib
  • Let’s load the packages we need
# "pandas" processes datasets
# "maplotlib.pyplot" generates graphs
# "maplotlib.dates" has options for plotting dates
# "datetime" handles date information

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from datetime import date, time, datetime
  • Now let’s import the data we will use
# Load the data
financial = pd.read_csv("data_raw/financial.csv")
financial.head()
date_str sp500 djia date_ex1 date_ex2 date_ex3
0 4/4/2018 2644.69 24264.30 April 04 2018 Wednesday, 2018-04-04 4-Apr-18
1 4/5/2018 2662.84 24505.22 April 05 2018 Thursday, 2018-04-05 5-Apr-18
2 4/6/2018 2604.47 23932.76 April 06 2018 Friday, 2018-04-06 6-Apr-18
3 4/9/2018 2613.16 23979.10 April 09 2018 Monday, 2018-04-09 9-Apr-18
4 4/10/2018 2656.87 24408.00 April 10 2018 Tuesday, 2018-04-10 10-Apr-18

Parsing dates in Python

  • Date columns are strings by default
  • We can convert them to datetime objects using the pd.to_datetime() function
  • This function can handle different date formats
  • If the conversion is simple, we can use the parse_dates argument in pd.read_csv()
    • financial = pd.read_csv("data_raw/financial.csv",parse_dates = ["date"])
  • Let’s parse the first column, date_str, to a datetime object
# Convert the date column to datetime
financial["date"] = pd.to_datetime(financial["date_str"])
financial.head()
date_str sp500 djia date_ex1 date_ex2 date_ex3 date
0 4/4/2018 2644.69 24264.30 April 04 2018 Wednesday, 2018-04-04 4-Apr-18 2018-04-04
1 4/5/2018 2662.84 24505.22 April 05 2018 Thursday, 2018-04-05 5-Apr-18 2018-04-05
2 4/6/2018 2604.47 23932.76 April 06 2018 Friday, 2018-04-06 6-Apr-18 2018-04-06
3 4/9/2018 2613.16 23979.10 April 09 2018 Monday, 2018-04-09 9-Apr-18 2018-04-09
4 4/10/2018 2656.87 24408.00 April 10 2018 Tuesday, 2018-04-10 10-Apr-18 2018-04-10
  • We can check the types of the columns using financial.dtypes
financial.dtypes
date_str            object
sp500              float64
djia               float64
date_ex1            object
date_ex2            object
date_ex3            object
date        datetime64[ns]
dtype: object

Visualise time series data 📊

Plotting time series data

  • First, let’s see where our data begins and ends
  • You can do it in many ways
financial['date'][0]
Timestamp('2018-04-04 00:00:00')
financial['date'][len(financial) - 1] # why -1 here?
Timestamp('2023-04-04 00:00:00')
  • Or you can use the min() and max() functions
print("Start date:", financial["date"].min())
Start date: 2018-04-04 00:00:00
print("End date:", financial["date"].max())
End date: 2023-04-04 00:00:00
  • Or just use head() and tail() as we did before
  • We can now plot the data
  • As you may remember, plt.plot() is used to create line plots
  • The first two arguments are column names for the (x,y) data
  • The third argument is the data
  • Make sure that the data is sorted, since the lines will be connected based on the order of the dataset
  • We can sort the data using financial.sort_values("date")
  • Let’s plot the data 🤓

Plotting time series data

# Sort the data
financial = financial.sort_values("date")

# Plot the data
plt.plot("date", "sp500", data = financial)
plt.xlabel("Time")
plt.ylabel("S&P 500 Index")
plt.title("The evolution of the stock market")
plt.show()

Try it yourself! 🧠

  • Now it’s your turn! 🚀
  • Plot the djia column on the y axis and the date column on the x axis
  • Make sure to sort the data first
  • Appendix 01

S&P 500 vs. Dow Jones Industrial Average

  • As you may have guessed, although the indices are calculated in different ways, they are highly correlated
  • The S&P 500 is a broader index, while the Dow Jones is more focused on industrial companies
  • The S&P 500 is considered a better representation of the overall stock market
  • Let’s see how the two indices compare
# Correlation between the two indices
financial[["sp500", "djia"]].corr()
sp500 djia
sp500 1.000000 0.981448
djia 0.981448 1.000000
  • Let’s plot the two indices together
plt.plot("date", "sp500", data = financial, label = "S&P 500")
plt.plot("date", "djia", data = financial, label = "Dow Jones Industrial Average")
plt.xlabel("Time")
plt.ylabel("Index")
plt.title("S&P 500 vs. Dow Jones Industrial Average")
plt.legend()
plt.show()

Oops, that looks a bit messy 🤔

  • The indices are calculated differently, so the values are not directly comparable
  • We can normalise the indices to start from the same point
  • We do so by dividing each index by its initial value (why?)
# Normalise the indices to start from the same point
financial["sp500_normalised"] = financial["sp500"] / financial["sp500"].iloc[0]
financial["djia_normalised"] = financial["djia"] / financial["djia"].iloc[0]

# Plot the normalized data
plt.plot("date", "sp500_normalised", data = financial, label = "S&P 500 (Normalised)")
plt.plot("date", "djia_normalised", data = financial, label = "Dow Jones Industrial Average (Normalised)")
plt.xlabel("Time")
plt.ylabel("Normalised Index")
plt.title("S&P 500 vs. Dow Jones Industrial Average (Normalised)")
plt.legend()
plt.show()

Dates + Wild Cards = Fun! 🎉

Converting time series to strings

  • We can convert datetime objects to strings using the strftime() method
  • The first argument needs to be a datetime type
  • The second argument is the format you want to use
  • A wildcard % is used to denote date formats
  • For example, %Y is the year, %m is the month, and %d is the day
  • %H is the hour, %M is the minute, and %S is the second
financial["month_str"] = financial["date"].dt.strftime("%m")
print(financial["month_str"].head(3))
0    04
1    04
2    04
Name: month_str, dtype: object

financial["week_str"]  = financial["date"].dt.strftime("%W")
print(financial["week_str"].head(3))
0    14
1    14
2    14
Name: week_str, dtype: object

Converting time series to strings

  • We can also convert dates to names

financial["monthname"]   =  financial["date"].dt.strftime("%B")
print(financial["monthname"].head(3))
0    April
1    April
2    April
Name: monthname, dtype: object
financial["weekdayname"] =  financial["date"].dt.strftime("%A")
print(financial["weekdayname"].head(3))
0    Wednesday
1     Thursday
2       Friday
Name: weekdayname, dtype: object

Personalise your dates 🎨

  • We can also customise the date format
  • For example, we can use the following format to display the date as YYYY-MM-DD
financial["date_str"] = financial["date"].dt.strftime("%Y-%m-%d")
print(financial["date_str"].head(3))
0    2018-04-04
1    2018-04-05
2    2018-04-06
Name: date_str, dtype: object
message_monthname =  financial["date"].dt.strftime("This is the month of %B")
print(message_monthname.head(3))
0    This is the month of April
1    This is the month of April
2    This is the month of April
Name: date, dtype: object
message_monthday  =  financial["date"].dt.strftime("The day of the week is %A")
print(message_monthday.head(3))
0    The day of the week is Wednesday
1     The day of the week is Thursday
2       The day of the week is Friday
Name: date, dtype: object
message_yearmonth =  financial["date"].dt.strftime("%Y-%m")
print(message_yearmonth.head(3))
0    2018-04
1    2018-04
2    2018-04
Name: date, dtype: object

Try it yourself! 🧠

  • Now it’s your turn (again)! 🚀
  • Create a new column called date_test which has the format using .dt.strftime()
  • The format should be Weekday, Month Day, Year
    • For example, Monday, December 31, 2023
  • Appendix 02

Data operations 📊

Parsing + wildcards

  • We can also use the datetime module to perform operations on dates
  • For example, we can calculate the difference between two dates
# Calculate the difference between two dates
date1 = datetime(2023, 12, 31)
date2 = datetime(2023, 12, 25)
difference = date1 - date2
print(difference)
6 days, 0:00:00
  • We can also use the timedelta function to add or subtract days
# Add or subtract days using timedelta
date4 = date1 + pd.Timedelta(days = 7)
print(date4)
2024-01-07 00:00:00

financial['date_ex1_dt'] = pd.to_datetime(financial["date_ex1"], format = "%B %d %Y")
print(financial['date_ex1_dt'].head(3))
0   2018-04-04
1   2018-04-05
2   2018-04-06
Name: date_ex1_dt, dtype: datetime64[ns]
financial['date_ex2_dt'] = pd.to_datetime(financial["date_ex2"], format = "%A, %Y-%m-%d")
print(financial['date_ex2_dt'].head())
0   2018-04-04
1   2018-04-05
2   2018-04-06
3   2018-04-09
4   2018-04-10
Name: date_ex2_dt, dtype: datetime64[ns]

Parsing + wildcards

  • Does the following line of code work?
  • pd.to_datetime(financial["date_ex1"], format = "%A, %Y-%m-%d")


  • No, it doesn’t work
  • Why not?
ValueError: time data "April 04 2018" doesn't match format "%A, %Y-%m-%d", at position 0. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

Period grouping

  • We can group time series data by periods
  • For example, we can group data by month or year
  • First, you should specify the period you want to group by
  • We use pd.Grouper() to group the data, then we use the agg() function to calculate the aggregate
# Group the data by year
year_group = pd.Grouper(key='date', freq='YE') # 'YE' stands for year end
financial.groupby(year_group).agg(sp500_mean = ("sp500","mean")).reset_index()
date sp500_mean
0 2018-12-31 2751.921862
1 2019-12-31 2913.360556
2 2020-12-31 3217.856482
3 2021-12-31 4273.406627
4 2022-12-31 4098.514741
5 2023-12-31 4003.579844
  • You can also group by month, week, day, quarter, etc.
  • More information can be found here

Plot the aggregated data

plt.plot("date","sp500_mean",
          data = financial.groupby(year_group).agg(sp500_mean = ("sp500","mean")).reset_index())
plt.xlabel("Time")
plt.ylabel("S&P 500")
plt.title("Average stock market performance by year")
plt.show()

Period grouping

Another example

month_group = pd.Grouper(key='date', freq='ME')

monthly_data = (financial
               .groupby(month_group) 
               .agg(sp500_mean = ("sp500", "mean"))
               .reset_index())

monthly_data
date sp500_mean
0 2018-04-30 2659.463158
1 2018-05-31 2701.493636
2 2018-06-30 2754.352857
3 2018-07-31 2793.643333
4 2018-08-31 2857.820435
... ... ...
56 2022-12-31 3912.380952
57 2023-01-31 3960.656500
58 2023-02-28 4079.684737
59 2023-03-31 3968.559130
60 2023-04-30 4112.555000

61 rows × 2 columns

Plot the aggregated data

plt.plot("date","sp500_mean",
          data = monthly_data)
plt.xlabel("Time")
plt.ylabel("S&P 500")
plt.title("Monthly average stock market performance")
plt.show()

Try it yourself! 🧠

  • Practice pd.to_datetime()
  • Parse the column data_ex3 to a datetime object
  • Take a close look at the formatting
  • HINT: Refer to the table of wildcards in the previous section


  • Then, compute an aggregate dataset which calculates the standard deviation of the S&P 500 at the weekly level
  • Generate a line plot with your results
  • Appendix 03

That’s all for today! 🎉

See you next time! 👋

Appendix 01

  • Here is the solution to the exercise
# Sort the data
financial = financial.sort_values("date")

# Plot the data
plt.plot("date", "djia", data = financial)
plt.xlabel("Time")
plt.ylabel("Dow Jones Industrial Average")
plt.title("The evolution of the stock market")
plt.show()

Back to exercise

Appendix 02

  • We will use the following format: Weekday, Month Day, Year
financial["date_test"] = financial["date"].dt.strftime("%A, %B %d, %Y")
print(financial["date_test"].head())
0    Wednesday, April 04, 2018
1     Thursday, April 05, 2018
2       Friday, April 06, 2018
3       Monday, April 09, 2018
4      Tuesday, April 10, 2018
Name: date_test, dtype: object

Back to exercise

Appendix 03

financial['date_ex3_dt'] = pd.to_datetime(financial["date_ex3"], format = "%d-%b-%y")

week_group = pd.Grouper(key='date_ex3_dt', freq='W')

weekly_data = (financial
               .groupby(week_group) 
               .agg(sp500_std = ("sp500", "std"))
               .reset_index())

plt.plot("date_ex3_dt","sp500_std",
          data = weekly_data)
plt.xlabel("Time")
plt.ylabel("S&P 500")
plt.title("Weekly standard deviation of the stock market performance")
plt.show()

Back to exercise