Lecture 23 - Pivot Tables
20 November, 2024
datetime
modulematplotlib
William Playfair - Export & Import to and from Denmark & Norway from 1700 to 1780 (Commercial and Political Atlas, 1786)
date_str | sp500 | djia | date_ex1 | date_ex2 | date_ex3 | |
---|---|---|---|---|---|---|
0 | 4/4/2018 | 2644.69 | 24264.30 | April 04 2018 | Wednesday, 2018-04-04 | 4-Apr-18 |
1 | 4/5/2018 | 2662.84 | 24505.22 | April 05 2018 | Thursday, 2018-04-05 | 5-Apr-18 |
2 | 4/6/2018 | 2604.47 | 23932.76 | April 06 2018 | Friday, 2018-04-06 | 6-Apr-18 |
3 | 4/9/2018 | 2613.16 | 23979.10 | April 09 2018 | Monday, 2018-04-09 | 9-Apr-18 |
4 | 4/10/2018 | 2656.87 | 24408.00 | April 10 2018 | Tuesday, 2018-04-10 | 10-Apr-18 |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1305 entries, 0 to 1304
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date_str 1305 non-null object
1 sp500 1260 non-null float64
2 djia 1260 non-null float64
3 date_ex1 1305 non-null object
4 date_ex2 1305 non-null object
5 date_ex3 1305 non-null object
dtypes: float64(2), object(4)
memory usage: 61.3+ KB
date_str
[ns]
mean in the datetime64[ns]
column type?pd.to_datetime(financial["date"], unit='s')
to convert it to seconds, or unit='m'
to convert it to minutesfinancial["sp500_normalised"] = financial["sp500"] / financial["sp500"].iloc[0]
financial["djia_normalised"] = financial["djia"] / financial["djia"].iloc[0]
# Plot the normalized data
plt.plot("date", "sp500_normalised", data = financial, label = "S&P 500 (Normalised)")
plt.plot("date", "djia_normalised", data = financial, label = "Dow Jones Industrial Average (Normalised)")
plt.xlabel("Time")
plt.ylabel("Normalised Index")
plt.title("S&P 500 vs. Dow Jones Industrial Average (Normalised)")
plt.legend()
plt.show()
djia
is usually higher than the sp500
, so we can simply divide the djia
by 10 to bring the two series closer togetherdiff()
diff()
methodNaN
because there is no previous value to compare it todiff()
method, otherwise the results will be incorrectdate_str | sp500 | djia | date_ex1 | date_ex2 | date_ex3 | date | sp500_normalised | djia_normalised | djia_rescale | diff_sp500 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4/4/2018 | 2644.69 | 24264.30 | April 04 2018 | Wednesday, 2018-04-04 | 4-Apr-18 | 2018-04-04 | 1.000000 | 1.000000 | 2426.430 | NaN |
1 | 4/5/2018 | 2662.84 | 24505.22 | April 05 2018 | Thursday, 2018-04-05 | 5-Apr-18 | 2018-04-05 | 1.006863 | 1.009929 | 2450.522 | 18.15 |
2 | 4/6/2018 | 2604.47 | 23932.76 | April 06 2018 | Friday, 2018-04-06 | 6-Apr-18 | 2018-04-06 | 0.984792 | 0.986336 | 2393.276 | -58.37 |
3 | 4/9/2018 | 2613.16 | 23979.10 | April 09 2018 | Monday, 2018-04-09 | 9-Apr-18 | 2018-04-09 | 0.988078 | 0.988246 | 2397.910 | 8.69 |
4 | 4/10/2018 | 2656.87 | 24408.00 | April 10 2018 | Tuesday, 2018-04-10 | 10-Apr-18 | 2018-04-10 | 1.004605 | 1.005922 | 2440.800 | 43.71 |
financial["lag_sp500"] = financial["sp500"].shift(1)
# Convert into percentage
financial["growth_sp500"] = (financial["diff_sp500"]/financial["lag_sp500"]) * 100
financial.head()
date_str | sp500 | djia | date_ex1 | date_ex2 | date_ex3 | date | sp500_normalised | djia_normalised | djia_rescale | diff_sp500 | lag_sp500 | growth_sp500 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4/4/2018 | 2644.69 | 24264.30 | April 04 2018 | Wednesday, 2018-04-04 | 4-Apr-18 | 2018-04-04 | 1.000000 | 1.000000 | 2426.430 | NaN | NaN | NaN |
1 | 4/5/2018 | 2662.84 | 24505.22 | April 05 2018 | Thursday, 2018-04-05 | 5-Apr-18 | 2018-04-05 | 1.006863 | 1.009929 | 2450.522 | 18.15 | 2644.69 | 0.686281 |
2 | 4/6/2018 | 2604.47 | 23932.76 | April 06 2018 | Friday, 2018-04-06 | 6-Apr-18 | 2018-04-06 | 0.984792 | 0.986336 | 2393.276 | -58.37 | 2662.84 | -2.192021 |
3 | 4/9/2018 | 2613.16 | 23979.10 | April 09 2018 | Monday, 2018-04-09 | 9-Apr-18 | 2018-04-09 | 0.988078 | 0.988246 | 2397.910 | 8.69 | 2604.47 | 0.333657 |
4 | 4/10/2018 | 2656.87 | 24408.00 | April 10 2018 | Tuesday, 2018-04-10 | 10-Apr-18 | 2018-04-10 | 1.004605 | 1.005922 | 2440.800 | 43.71 | 2613.16 | 1.672687 |
diff()
to calculate the difference and shift()
to create the lagged valuesquery()
?date
column has a time format, Python will interpret 2019-01-01
as a date inside the query command.query()
subset_between = financial.query('date.between("2020-03-02","2020-05-02")')
subset_between.iloc[:, :6].head(3)
date_str | sp500 | djia | date_ex1 | date_ex2 | date_ex3 | |
---|---|---|---|---|---|---|
498 | 3/2/2020 | 3090.23 | 26703.32 | March 02 2020 | Monday, 2020-03-02 | 2-Mar-20 |
499 | 3/3/2020 | 3003.37 | 25917.41 | March 03 2020 | Tuesday, 2020-03-03 | 3-Mar-20 |
500 | 3/4/2020 | 3130.12 | 27090.86 | March 04 2020 | Wednesday, 2020-03-04 | 4-Mar-20 |
date_str | sp500 | djia | date_ex1 | date_ex2 | date_ex3 | |
---|---|---|---|---|---|---|
540 | 4/29/2020 | 2939.51 | 24633.86 | April 29 2020 | Wednesday, 2020-04-29 | 29-Apr-20 |
541 | 4/30/2020 | 2912.43 | 24345.72 | April 30 2020 | Thursday, 2020-04-30 | 30-Apr-20 |
542 | 5/1/2020 | 2830.71 | 23723.69 | May 01 2020 | Friday, 2020-05-01 | 1-May-20 |
financial["bool_period"] = financial["date"].between("2020-03-01","2020-05-01")
financial["bool_example"] = financial["growth_sp500"] > 5
# Create a line plot
plt.plot("date", "growth_sp500", data = financial)
plt.xlabel("Time")
plt.ylabel("Daily percentage change ")
plt.title("The S&P 500 during the start of COVID")
# Add a shaded region with a rectangle with "fill_between()"
# "x" is the x-coordinate, "y1" and "y2" are the lower
# and upper bounds of the rectangle, respectively
# We can set this to be the minimum and maximum of the outcome
# We use "where" to test a logical condition
vec_y = financial["growth_sp500"]
plt.fill_between(x = financial["date"],
y1 = vec_y.min(),
y2 = vec_y.max(),
where = financial["bool_period"],
alpha = 0.2, color = "red")
plt.show()
|
to include both conditions)melt()
functionmelt()
function takes the following arguments:
id_vars
: the columns to keep as isvalue_vars
: the columns to meltvar_name
: the name of the new column with the variable namesvalue_name
: the name of the new column with the valuesdate | sp500 | djia | |
---|---|---|---|
0 | 2018-04-04 | 2644.69 | 24264.30 |
1 | 2018-04-05 | 2662.84 | 24505.22 |
2 | 2018-04-06 | 2604.47 | 23932.76 |
3 | 2018-04-09 | 2613.16 | 23979.10 |
4 | 2018-04-10 | 2656.87 | 24408.00 |
pd.pivot()
functionpd.pivot()
function takes the following arguments:
index
: the column to keep as iscolumns
: the column to pivotvalues
: the column with the valuesfinancial_wide = pd.pivot(financial_long,
index = 'date',
columns = 'portfolio_type',
values = 'portfolio_value')
financial_wide.head()
portfolio_type | djia | sp500 |
---|---|---|
date | ||
2018-04-04 | 24264.30 | 2644.69 |
2018-04-05 | 24505.22 | 2662.84 |
2018-04-06 | 23932.76 | 2604.47 |
2018-04-09 | 23979.10 | 2613.16 |
2018-04-10 | 24408.00 | 2656.87 |
.reset_index()
to the end of the previous commandgrowth_sp500
and growth_djia
to long formatfinancial["diff_djia"] = financial["djia"].diff()
financial["lag_djia"] = financial["djia"].shift(1)
financial["growth_djia"] = (financial["diff_djia"] / financial["lag_djia"])*100
# Method 1: Adding multiple lines with "plt.plot()"
# you can make a plot transparent by adding an alpha parameter (between 0 and 1)
plt.plot("date","growth_djia",data= financial)
plt.plot("date","growth_sp500",data= financial,alpha = 0.6)
plt.xlabel("Time")
plt.ylabel("Daily percentage change")
plt.title("Change in the Dow Jones Index")
plt.show()
# Method 2: A single line with ".plot()" but setting the "x-axis"
# with ".set_index()"
financial[["date","growth_sp500","growth_djia"]].set_index("date").plot()
plt.xlabel("Time")
plt.ylabel("Daily percentage change")
plt.title("Changes in the Stock Market Indecies")
plt.legend(["S&P 500", "Dow Jones Industrial Index"])
plt.show()
plt.plot("date","growth_djia",data= financial)
plt.xlabel("Time")
plt.ylabel("Daily percentage change")
plt.title("Change in the Dow Jones Index")
financial["bool_djia"] = (financial["growth_djia"] > 4) | (financial["growth_djia"] < -4)
vec_y = financial["growth_djia"]
plt.fill_between(x = financial["date"],
y1 = vec_y.min(),
y2 = vec_y.max(),
where = financial["bool_djia"],
alpha = 0.7,color="pink")
financial_long = pd.melt(financial,
var_name = "portfolio_type",
value_name = "portfolio_value",
id_vars='date',
value_vars=['growth_sp500','growth_djia'])
financial_long.sort_values(by='date').head()
date | portfolio_type | portfolio_value | |
---|---|---|---|
0 | 2018-04-04 | growth_sp500 | NaN |
1305 | 2018-04-04 | growth_djia | NaN |
1306 | 2018-04-05 | growth_djia | 0.992899 |
1 | 2018-04-05 | growth_sp500 | 0.686281 |
2 | 2018-04-06 | growth_sp500 | -2.192021 |