1Fordham University, 2University of British Columbia, 3University of Cambridge, 4Federal Reserve Bank of Chicago, 5Georgia Institute of Technology, 6Forecasting Research Institute
The Forecasting Proficiency Test (FPT; Himmelstein et al., 2024) is a test developed to measure forecasting proficiency. The FPT uses quantile forecast items:
Quantile forecast items are designed to elicit an individual’s subjective cumulative distribution function (CDF) regarding a future continuous outcome
Each individual provides 5 monotonically increasing responses
Responses are unbounded
Forecast accuracy is the measure of interest
GOAL: in IRT fashion, modeling forecast accuracy by positing a statistical model that accounts for both person and item features
Responses to FPT quantile forecast items are on very different scale (e.g. dollars/gallon, thousands of dollars, percentages,…). We define the outcome measure, historically scaled accuracy, as
\[ Y_i = \frac{\hat{Y}_i - Y_{\mathrm{res},i}}{SD_{Y_{\mathrm{hist},i}}} \]
\(Y_i\): SD units away from the resolution.
We model \(Y_{jiq}\), the accuracy of person \(j\) to item \(i\) at quantile \(q\).
\[Y_{jiq} \sim \mathrm{Student\ T}(\mu_{iq}, \sigma_{ji}, \mathrm{df}_i) \\ \mu_{iq} = b_i + Q_q \times d_i \\ \sigma_{ji} = \frac{\sigma_i}{\mathrm{Exp}[a_i \times \theta_j]}\]
Item forecasts were collected across 5 waves of a 7 Wave study.
All models were estimated in PyMC (Abril-Pla et al., 2023) using Markov Chain Monte Carlo (MCMC) estimation (warmup = 1000, draws = 5000, ~ 40 minutes). All Rhats \(\leq 1.01\).
Distribution of \(\theta\) for the 1194 forecasters (better forecasters have higher \(\theta\) values).
Forecasters who consistently approach the expected forecasts are rewarded
One advantage of the \(\theta\) metric is that it allows for the calculation of expected item information, \(\mathrm{EI}(\theta)\) :
\[Y_{jiq} \sim \mathrm{Student\ T}(\mu_{iq}, \sigma_{ji}, \mathrm{df}_i) \\ \mu_{iq} = b_i + Q_q \times d_i \\ \sigma_{ji} = \frac{\sigma_i}{\mathrm{Exp}[a_i \times \theta_j]}\]
Given the complexity of the FPT items, item parameters are likely to change depending on many factors. Still, there seems to be reasonable stability even after a month between Wave 1 and Wave 7 (test-retest):
Negative log-likelihood function of \(\theta\) given item parameters and participant response:
a | b | d | df | sigma | |
---|---|---|---|---|---|
a | 1.00 | -0.17 | 0.00 | -0.57 | -0.04 |
b | -0.17 | 1.00 | -0.33 | -0.18 | -0.51 |
d | 0.00 | -0.33 | 1.00 | 0.35 | 0.86 |
df | -0.57 | -0.18 | 0.35 | 1.00 | 0.52 |
sigma | -0.04 | -0.51 | 0.86 | 0.52 | 1.00 |
IMPS 2025