Load Forecasting during the COVID Period
Introduction
The main difficulty in this project is that the modeling and the predictions expected occur during a very peculiar period : the COVID crisis. No historical data could have helped encapsulate the sudden variations in electricity load we faced. We thus divided our approach into two phases : the first one was to learn from historical data in “regular periods”, thus modeling the usual electricity load, and once that done, we explored various techniques (online learning, ARIMA, experts aggregation, adaptative models…) to readjust the model to the brutal change in trend we faced in these extraordinary conditions.
In this project, the objective has been to get the best performance through a good expert aggregation. To do so, we have tried to diversify our experts as much as possible.
We have used diverse datasets to construct our models thanks to methods such as Random Forests where we do not construct the experts with the whole dataset but only with some subsets. We have also tried rolling-window methods with different sizes of window, differently weighted training examples… Hence, we managed to create training diversity with different partitioning and subsets.
Moreover, we have diversified our loss functions. Instead of only using the L2 loss to train the methods as it may be intuitive (because at the end, our objective is to minimize this cost function over the test set), we have used quantile loss (for example through the qgam). To choose the quantile where we wanted to minimize this loss, we have observed that during the first weeks of the Covid, the Load was sharply reduced. As our test set is focused on the Covid period, to improve our performances, we have tried to minimize the quantile loss over quantiles between 0.3 and 0.5. Furthermore, we have also modified the weights of the loss function to give more importance to some instances that we have considered, which could be seen as a loss change. Finally, we have also introduced a L1 loss for variable selection.
At last, we have also combined multiple methods to get a variability of algorithms. With all these methods we have tried to diversify the explanatory variables as it is done in the Random Forest. Therefore, each model may be able to add some information to the global aggregation.
I. Feature Engineering
Before any model conception, we implemented some basic but needed
transformations on the train set and the test set. Indeed, features were
of different kinds and we needed to clean the datasets in order to build
efficient models.
The main transformations concerned categorical variables : BH,
DLS, Christmas_break, Summer_break and
WeekDays were recoded as factors. Regarding WeekDays,
we considered two formatting perspective on top of the initial one :
- recode a categorical variable WeekDays2 regrouping ‘Tuesday’, ‘Wednesday’ and ‘Thursday’ as one and only factor level ‘Workday’
- mix WeekDays2 and BH : 5 factor levels for ‘Monday’, ‘Workday’,‘Friday’,‘Saturday’,‘Sunday’ and one factor level ‘BH’ taking priority over the levels of WeekDays2
However, models performance led us to retain only the original WeekDays as factor, without any simplification. This choice also allowed us to train daily model (cf. II.3).
One last transformation was operated on the initial datasets, recoding the variable Date into a numeric version of it called Time.
Finally, we prepare the division of the train set into two independant subsets, sel_a and sel_b, which will be used to distinguish the COVID period on train set.
II. Models
A recurrent idea throughout the approaches we have taken is the rolling-window for model training.
Models trained over huge datasets are usually better as they are able to extract more information and they avoid overfitting. In contrast, when we are treating time series, we should be able to adapt to the changes as fast as possible. Then, if we train a model with a huge dataset where only the latest data suffered a trend change, the algorithm will treat this latest data as outliers, as they are not really representative with the rest of the data. That is the case of our dataset, where the Covid definitively affected the Load trend. Therefore, to be able to learn from this information, we have added a parameter allowing to reduce the training dataset to a window size small enough to treat this period. Moreover, this mechanism allows us to introduce variables that are proper to this period such as Government Response Index, diversifying the models for the aggregation.
1. Linear model with adaptative ARIMA correction
For each of these approaches, the models are trained based on the
subset of the train set anterior to COVID. The idea is to build models
capturing efficiently the load variation throughout the year,
independantly of any COVID effect.
We then adjust ARIMA models on the obtained residuals in order to
encapsulate the previously ignored COVID effect.
We don’t observe any obvious trend : the idea to train only of the
observations anterior to 2020 seems reasonable.
a. AIC penalty on polynomial combinations
On the one hand, we have decided to combine all the variables by multiplying each variable to the others to make the model more flexible. On the other hand, this combination made the model extremely overparameterized. In order to select the variables to use in this linear model, we have applied the stepAIC to maximize the AIC criterion, which is a penalization criterion to avoid overfitting.
# lm_AIC <- linear_model_AIC(Data0[sel_a,])
lm_AIC <- lm(Load ~ Load.1 + Load.7 + Temp + Temp_s95 + Temp_s95_max + Temp_s99_max +
toy + WeekDays + BH + DLS + Summer_break + Christmas_break +
Time + Load.1:Temp + Load.1:Temp_s95 + Load.1:Temp_s95_max +
Load.1:Temp_s99_max + Load.1:toy + Load.1:WeekDays + Load.1:DLS +
Load.1:Summer_break + Load.1:Christmas_break + Load.7:Temp_s95_max +
Load.7:Temp_s99_max + Load.7:DLS + Load.7:Summer_break +
Load.7:Christmas_break + Temp:Temp_s95 + Temp:Temp_s99_max +
Temp:WeekDays + Temp:DLS + Temp:Christmas_break +
Temp_s95:Temp_s95_max + Temp_s95:Temp_s99_max +
Temp_s95:toy + Temp_s95:WeekDays +
Temp_s95:DLS + Temp_s95:Summer_break + Temp_s95:Christmas_break +
Temp_s95:Time + Temp_s95_max:Temp_s99_max + Temp_s95_max:toy +
Temp_s95_max:Christmas_break + Temp_s95_max:Time +
Temp_s99_max:toy +Temp_s99_max:WeekDays + Temp_s99_max:BH +
Temp_s99_max:DLS + toy:WeekDays + toy:BH + toy:DLS +
toy:Summer_break + toy:Christmas_break +
WeekDays:BH + WeekDays:DLS + WeekDays:Summer_break +
WeekDays:Christmas_break +
WeekDays:Time + BH:DLS + BH:Summer_break + BH:Christmas_break +
BH:Time + Summer_break:Time, data= Data0[sel_a,])
summary(lm_AIC)$adj.r.squared
## [1] 0.9949275
## [1] 840
## [1] 1900
##
## studentized Breusch-Pagan test
##
## data: lm_AIC
## BP = 570.22, df = 118, p-value < 2.2e-16
After having computed this linear model, we noticed that the hypothesis made about the residuals homoscedasticity is not accurate. This is because we are treating a time series. To adapt this hypothesis, we introduced an Arima correction on the residuals. The heteroscedasticity of the residuals is confirmed when performing a Breusch-Pagan test on the fitted values.
plot(Data0[sel_a,]$Date, lm_cvpred(Data0[sel_a,],eq = eq_lm_AIC)-Data0[sel_a,]$Load, xlab = "Date", ylab = "Fitted residuals")
plot(Data0[sel_b,]$Date, predict(lm_AIC, newdata = Data0[sel_b,])-Data0[sel_b,]$Load, xlab = "Date", ylab = "Covid residuals")
We clearly see a rupture in the residuals around March 2020, corresponding to the beginning of the COVID and the first lockdown.
We have tried to find out if the residuals of the model followed any seasonality or trend. To do so, we have plotted the empiric autocorrelations and partial autocorrelations. We have also tried to use the differentiation to correct this seasonality and trend. Unluckily, all the Auto-Regressive and Moving Average algorithms that we have tried following our intuitions from this graphics were not good as the ARIMAs computed by the auto.arima minimizing the AIC. Therefore, we have just limited the methods to learn from this auto.arima.
covid_residuals = ts(predict(lm_AIC, newdata = Data0[sel_b,])-Data0[sel_b,]$Load,
frequency = 7)
par(mfrow=c(2,2))
acf(covid_residuals)
pacf(covid_residuals)
acf(diff(covid_residuals))
pacf(diff(covid_residuals))
b. AIC penalty on polynomial combinations with forgetting factor for adaptative model selection
We noticed that the Linear Model computes the AIC loss minimizing the addition of a loglikelihood (because we have done the gaussian hypothesis) and a variable penalty. This loglikelihood gives the same weight to all the observations. Nevertheless, as we are working with time series, we should be able to adapt the model to the new observations as fast as possible. Therefore, the latest observations should be more than the initial observations. To integrate this idea in our modeling, we have introduced a sigmoid weightening, which gives more importance to the latest observations. This will give more information about the variables as we are able to update the model variables.
c. AIC penalty on polynomial combinations for weighted linear model
We used once more the linear model selected with AIC penalty, and retrained it as a linear model attributing more weight to the observations with lower variance.
wt = 1 / lm(abs(lm_AIC$residuals) ~ lm_AIC$fitted.values)$fitted.values^2
weighted_lm_AIC <- lm(eq_lm_AIC, Data0[sel_a,], weights = wt)
summary(weighted_lm_AIC)$adj.r.squared
## [1] 0.9942834
Considering this strategy, we could also imagine a weighted regression where more weight is attributed to the training examples with the largest prediction error, in order to force a correction on the model.
This idea will be pursued with weighted GAMs and weighted Boosting models.
2. Monthly model with adaptative ARIMA correction
To pursue with this idea of leveraging the dataset diversity to
improve the final expert aggregation, we trained linear models on
specific time window.
The monthly model is a combination of 12 different linear models, each
one fitted specifically on one month. The prediction is then made based
on the month considered, calling the proper model.
3. Daily model with adaptative ARIMA correction
The same idea of time-specific model as previously detailed for monthly models has been applied here, training 7 different models, one for each weekday.
4. GAM model with adaptative ARIMA correction
We use a double penalty approach when fitting our GAMs, allowing to
penalize functions in both null and range spaces.
The number of knots in GAMs has been tuned looking at the degree of
freedoms with the command gam.check. The spline basis have been
compared using cross-validation on the subset of training set prior to
COVID.
a. Tuned GAM with Random Forest for variable selection
Still following the idea of diversifying our experts, we tried to diversify the explanatory variables by tuning a GAM with variable selection based on the order of importance in Random Forests constructions.
variable_importance <- rf_variable_selection(dataset = Data0[sel_a,],
var_allowed = 13,
trees = 200)
variable_importance
## Load.1 Temp_s99_max Load.7
## 43338390.21 16627189.86 10504929.38
## WeekDays Temp_s99 Temp_s95_max
## 6243034.42 5453948.53 4372147.61
## Temp_s95 Temp_s99_min Temp
## 3821492.52 2452084.17 2040508.87
## toy BH Temp_s95_min
## 1114337.52 1000584.52 975020.42
## Summer_break Month DLS
## 221705.69 206722.59 179890.53
## Time Christmas_break Date
## 60312.15 50526.52 44532.51
## Year GovernmentResponseIndex
## 13982.81 0.00
We used the ANOVA command to perform Chi-squared test on slotted models to look for interactions between features.
## [1] NA 0.01150633
## [1] NA 1.202322e-11
# tuned_gam_model <- gam_tuned_model(dataset = Data0[sel_a,])
# saveRDS(tuned_gam_model, "/Users/halvardbariller/Desktop/M1 MATH&IA/S2/Predictive modelling/Saved_data/tuned_gam_model.RDS")
tuned_gam_model <- readRDS(file = "/Users/halvardbariller/Desktop/M1 MATH&IA/S2/Predictive modelling/Saved_data/tuned_gam_model.RDS")
#gam.check(tuned_gam_model)
sqrt(tuned_gam_model$gcv.ubre)
## GCV.Cp
## 874.699
b. GAM with linear model features
In this scenario, we fitted a GAM using features obtained with the
linear model AIC-penalized.
We also added an argument allowing to consider exponentially decreasing
weights on observations.
# gam_linear_model <- gam_lm_model(dataset = Data0[sel_a,], weighted = T)
# saveRDS(gam_linear_model, "/Users/halvardbariller/Desktop/M1 MATH&IA/S2/Predictive modelling/Saved_data/gam_linear_model.RDS")
gam_linear_model <- readRDS(file = "/Users/halvardbariller/Desktop/M1 MATH&IA/S2/Predictive modelling/Saved_data/gam_linear_model.RDS")
#gam.check(gam_linear_model)
sqrt(gam_linear_model$gcv.ubre)
## GCV.Cp
## 1766.294
The GVC/UBRE score is less encouraging as compared to the previous GAM, but just as for the linear model, the ARIMA correction worked really well on this model.
5. QGAM model with adaptative ARIMA correction
# naive_qgam_model <- naive_qgam_model(dataset = Data0[sel_a,],
# quantile = 0.3)
#
# saveRDS(naive_qgam_model, "/Users/halvardbariller/Desktop/M1 MATH&IA/S2/Predictive modelling/Saved_data/naive_qgam_model.RDS")
naive_qgam_model <- readRDS(file = "/Users/halvardbariller/Desktop/M1 MATH&IA/S2/Predictive modelling/Saved_data/naive_qgam_model.RDS")
#plot(naive_qgam_model, scale = F, page = 1)
#check(naive_qgam_model)
# check(naive_qgam_model$calibr)
residus_quantile_naive <- naive_qgam_residual_variance(dataset = Data0[sel_a,],
model = naive_qgam_model)
When fitting this kind of model, we make the hypothesis of homoscedasticity of the residuals.
In contrast, if we plot the boxplots of the residuals grouped by temperature intervals, we observe that the variance is not equal. The heteroscedasticity forces us to readjust the GAM. In order to treat this specific case, we add a parameter allowing the learning rate to change based on the Temperature so as to avoid these variance jumps.
# temp_refined_qgam_model <- refined_qgam_model(dataset = Data0[sel_a,], quantile = 0.3)
#
# saveRDS(temp_refined_qgam_model, "/Users/halvardbariller/Desktop/M1 MATH&IA/S2/Predictive modelling/Saved_data/temp_refined_qgam_model.RDS")
temp_refined_qgam_model <- readRDS(file = "/Users/halvardbariller/Desktop/M1 MATH&IA/S2/Predictive modelling/Saved_data/temp_refined_qgam_model.RDS")
#plot(temp_refined_qgam_model, scale = F, page = 1)
#check(temp_refined_qgam_model)
# check(temp_refined_qgam_model$calibr)
pred <- predict(temp_refined_qgam_model, newdata = Data0[sel_b,], se=TRUE)
plot(Data0[sel_b,]$Time, Data0[sel_b,]$Load)
lines(Data0[sel_b,]$Time, pred$fit, lwd = 1)
lines(Data0[sel_b,]$Time, pred$fit + 2*pred$se.fit, lwd = 1, col = 2)
lines(Data0[sel_b,]$Time, pred$fit - 2*pred$se.fit, lwd = 1, col = 2)
As we have already explained for the choice of the quantile to minimize, with the covid, the Load was sharply shrunk. Thus, even though we have tried to choose a quantile between 0.3 and 0.5, we notice that the step when the covid began is easily noticeable. Hence, we will also have to train another model to learn from the residuals.
6. Adaptative Random Forest model
Just using the same idea of the rolling window, we apply this method to construct a random forest that will be able to incorporate the Covid most important variables.
As we may observe at this plot, we suffer from an important prediction misaccuracy when the COVID appears, but using this method and including this information we are able to adapt the model and shrink the OOB error.
7. Adaptative Boosting model
a. GBM algorithms
We used Generalized Boosted Models algorithms, for which we tested
several distributions assumptions to impact the loss functions computed.
The GB models use CART trees as weak-learners.
Distribution: It is used to decide the loss function.
As our objective is to diversify the models to aggregate them at the end
with an expert aggregation, we will try multiple losses. In particular,
we will try the squared error (“gaussian”), absolute loss(“laplace”) and
the quantile loss at level \(\alpha =
0.3\).
Weights: We decided to change the weights of the
training examples so as to give slightly more importance to the COVID
examples. To do so, we used the Governement Response Index as
reference.
Number of trees: We have tried a small learning rate
(shrinkage of 0.05) and we have increased the number of iterations to
obtain better results. Furthermore, to decide the number of trees (which
could be seen as the number of iterations), we used the OOB error.
b. XGBoost algorithms
We used this package to try to obtain better performance as well as to try a linear model as a weak-learner.
To decide the number of iterations of the algorithms, we used cross-validation and put an early-stop criterion after 20 iterations without improvement.
III. Experts’ aggregation
Strongest learners’ aggregation
Our first experts’ aggregation submission was made using only the two strongest learners we had initially. These were the best two models we obtained for linear and generalized additive models. We tested several aggregation rule, and the results were very encouraging.
strongest_experts <- cbind(lm_AIC.forecast,
tuned_gam_model.forecast)
agg_OGD_strong <- mixture(Y = Data1$Load.1[2:275], experts = strongest_experts[1:274,], model = "OGD", loss.gradient=T)
agg_OGD_strong
## Aggregation rule: OGD
## Loss function: squareloss
## Gradient trick: TRUE
## Coefficients:
## lm_AIC.forecast tuned_gam_model.forecast
## 0.507 0.493
## Aggregation rule: OGD
## Loss function: squareloss
## Gradient trick: TRUE
## Coefficients:
## lm_AIC.forecast tuned_gam_model.forecast
## 0.507 0.493
##
## Number of experts: 2
## Number of observations: 274
## Dimension of the data: 1
##
## rmse mape
## OGD 501 0.00686
## Uniform 501 0.00686
agg_MLewa_strong <- mixture(Y = Data1$Load.1[2:275], experts = strongest_experts[1:274,], model = "MLewa", loss.gradient = T)
agg_MLewa_strong
## Aggregation rule: MLewa
## Loss function: squareloss
## Gradient trick: TRUE
## Coefficients:
## lm_AIC.forecast tuned_gam_model.forecast
## 0.709 0.291
## Aggregation rule: MLewa
## Loss function: squareloss
## Gradient trick: TRUE
## Coefficients:
## lm_AIC.forecast tuned_gam_model.forecast
## 0.709 0.291
##
## Number of experts: 2
## Number of observations: 274
## Dimension of the data: 1
##
## rmse mape
## MLewa 516 0.00713
## Uniform 501 0.00686
Exhaustive experts’ aggregation
The second and last experts’ aggregation submission was made using
all of the predictors built through our research process.
Once again, we tested several aggregation rules, and the results
improved once more.
experts_exhaustive <- cbind(lm_AIC.forecast,
# adaptative_lm_AIC.forecast,
weighted_lm_AIC.forecast,
monthly_model.forecast,
daily_model.forecast,
tuned_gam_model.forecast,
gam_linear_model.forecast,
temp_refined_qgam_model.forecast,
adaptative_rf_model.forecast,
gaussian_gbm_model.forecast,
laplacian_gbm_model.forecast,
quantile_gbm_model.forecast,
xgb_tree_model.forecast,
xgb_linear_model.forecast
)
agg_FTRL_exhaustive <- mixture(Y = Data1$Load.1[2:275], experts = experts_exhaustive[1:274,], model = "FTRL", loss.gradient=T)
plot(agg_FTRL_exhaustive)
## Aggregation rule: FTRL
## Loss function: squareloss
## Gradient trick: TRUE
## Coefficients:
## lm_AIC.forecast weighted_lm_AIC.forecast monthly_model.forecast
## 0.0776 0.0773 0.0777
## daily_model.forecast tuned_gam_model.forecast gam_linear_model.forecast
## 0.0781 0.0772 0.0771
## temp_refined_qgam_model.forecast adaptative_rf_model.forecast
## 0.0774 0.0766
## gaussian_gbm_model.forecast laplacian_gbm_model.forecast
## 0.0761 0.0766
## quantile_gbm_model.forecast xgb_tree_model.forecast xgb_linear_model.forecast
## 0.0768 0.0757 0.0757
##
## Number of experts: 13
## Number of observations: 274
## Dimension of the data: 1
##
## rmse mape
## FTRL 450 0.00622
## Uniform 451 0.00622
agg_MLprod_exhaustive <- mixture(Y = Data1$Load.1[2:275], experts = experts_exhaustive[1:274,], model = "MLprod", loss.gradient=T)
plot(agg_MLprod_exhaustive)
## Aggregation rule: MLprod
## Loss function: squareloss
## Gradient trick: TRUE
## Coefficients:
## lm_AIC.forecast weighted_lm_AIC.forecast monthly_model.forecast
## 0.134 0.091 0.0782
## daily_model.forecast tuned_gam_model.forecast gam_linear_model.forecast
## 0.206 0.0506 0.169
## temp_refined_qgam_model.forecast adaptative_rf_model.forecast
## 0.0504 0.0369
## gaussian_gbm_model.forecast laplacian_gbm_model.forecast
## 0.0273 0.0542
## quantile_gbm_model.forecast xgb_tree_model.forecast xgb_linear_model.forecast
## 0.0773 0.0126 0.0126
##
## Number of experts: 13
## Number of observations: 274
## Dimension of the data: 1
##
## rmse mape
## MLprod 460 0.00611
## Uniform 451 0.00622
Findings
To conclude, we have tried to consider this problem with as many
different ways as we could, relying strongly on the belief that the only
way to properly address this modeling problem was to efficiently capture
the variations throughout time.
Indeed, far from simple seasonal cycles, the COVID period has strongly
interfered with the electricity load, and it was thus needed to readjust
previously learned model.
Our first breakthrough in terms of performance was when we started using
the adaptative ARIMA correction, updating the ARIMA model as days went
by. This has proven to be so efficient that almost all of our predictors
are corrected with this function in this notebook.
When realising the importance online learning would have in this
problem, we quickly began to wonder how to use experts’ advice to
improve our forecastings. Building on that, we tried different models,
different loss functions, different techniques so as to explain as most
variance as we could with not one but with more than ten models.
Eventually, this proved to be efficient considering the results we
obtained. Moreover, our approach has been fruitful since we observe
nearly a uniform repartition of the weights between the 13 experts,
showing how much model diversity mattered.
Discussion
During our research process, we have considered many techniques and ideas to model the data as good as we could. Unfortunately, we did not manage to go through with each of them, and hence mention some of them for further consideration.
Regarding feature engineering, additional data was collected composed of Google’s France mobility reports during COVID. This data was reformatted and missing values were filled by random forest processes. However, we were facing constraints of unidentifiable model : we had roughly 75 features for only 60 training examples. We tried to perform a LASSO regression with the glmnet library so has to solve the dimensional problem, however findings were not conclusive enough to appear here.
Secondly, we mentioned tuning GAM based on variable selection obtained with Random Forests. Another idea to pursue this kind of explanatory variable selections could be to select based on the order of importance when performing PCA. Moreover, the problem of computing a gam over all the variables, apart from the computation cost, is that we risk overfitting our model. Then, we could just reduce the dimension of the explanatory variables to capture the most important information and relation between the variables, and apply a gam over these main PCA dimensions.
Thirdly, for computational reasons, we did not use Kalman filter on the tuned GAM, these last being very heavy. It would interesting though to assess the relevancy of this techniques.