May 2026
A widely-cited synthetic benchmark for “temporal privacy decay”
reports test R² = 0.998 — a value that, taken at face value, suggests
the rate at which personal data ages out of sensitivity is sharply
learnable from a population of users. We re-run that benchmark on three
real datasets — GDPR fines (n = 212), HIPAA breaches (n = 1,632), and
Microsoft GeoLife GPS trajectories (n = 48,406 records across 100 users)
— and report three findings that complicate the synthetic claim.
(i) The synthetic R² is a self-fulfilling artifact of using
age_days to generate the labels. On real regulatory data,
the same pipeline scores R² = −0.10 (GDPR) and R² = −0.18 (HIPAA), worse
than predicting the mean. (ii) On GeoLife mobility data,
leave-one-user-out cross-validation shows that the spatial
component of a pooled cross-user model transfers well at scale (median
R² = +0.85 across 98 held-out users), but the temporal
component does not — age-only LOUO is worse than chance, and a
shuffle-age control confirms the cross-user model extracts zero temporal
signal from age_days. (iii) A small minority of
held-out users produce R² values of order −10²⁵ that initially appear to
be spatial-extrapolation failures. Direct testing with convex-hull
membership and nearest-neighbour distance rejects this: those users sit
inside the training-pool hull. What predicts their R² is
per-user target variance — the worst-case user’s
log(k-anonymity) is exactly constant, making R²
mathematically divergent regardless of absolute prediction error. We
conclude that pooled spatial models with a metric guard are deployable
on this data; per-user fine-tuning and spatial-fallback paths, both
suggested by smaller-sample preliminary work, are not.
The question of whether personal data becomes less sensitive as it ages — and if so, at what rate — sits at the foundation of several major privacy regulations. The European Union’s General Data Protection Regulation (GDPR) Article 5(1)(e) requires that personal data be “kept in a form which permits identification of data subjects for no longer than is necessary.” The United States Health Insurance Portability and Accountability Act (HIPAA) imposes analogous retention limits on protected health information. Both regimes presume that there exists some characteristic timescale beyond which a given record’s contribution to re-identification risk has decayed enough that retention is no longer warranted. In practice, that timescale is set by policy fiat (six years, ten years, “until no longer needed”) rather than measured.
A natural question is whether the timescale can be learned from data:
given enough users, does the rate at which personal sensitivity decays
generalize across the population, so that a single model could replace a
regulatory constant with a data-driven retention schedule? Recent
synthetic benchmarks have appeared to answer yes. In particular, a
published “temporal privacy decay” benchmark reports test R² = 0.998 for
a gradient-boosting regressor trained on age_days plus
context features — a level of accuracy that would be transformative for
privacy-aware data systems if it held on real data.
This paper tests whether it does. We re-run the same regression pipeline on three real datasets that span the relevant problem space:
We further test, on the dataset where decay does exist, three follow-up questions: (i) what functional form best fits per-user temporal decay? (ii) does the cross-user model transfer well to held-out users via leave-one-user-out cross-validation? (iii) does per-user warm-start fine-tuning improve over a pooled cross-user model?
Our findings reverse multiple intuitions from prior work. The
synthetic R² is recovered only when the experimental setup also
generates labels from age_days; on real data, the
cross-user temporal signal does not transfer at all. What does transfer
is the spatial component — lat/lon features
generalize across held-out users with median R² ≈ +0.85 once the
training pool is large enough. Per-user fine-tuning, which appeared
essential at smaller sample sizes, hurts the median user at
scale. And a small fraction of users that initially look like
spatial-extrapolation failures turn out to be a metric pathology: their
target variance is near zero, which makes R² mathematically divergent
regardless of absolute prediction quality.
The contributions of this paper are: (1) a real-data re-evaluation of a synthetic privacy-decay benchmark across three datasets that span the relevant null and signal cases; (2) a leave-one-user-out characterization of cross-user transfer that decomposes the signal into spatial and temporal components and shows that only the former generalizes; (3) a direct test, by convex-hull membership and nearest-neighbour distance, of the working hypothesis that catastrophic-R² users are spatial outliers — and a refutation of that hypothesis in favor of a metric-pathology explanation; and (4) a reproducible pipeline (Section Reproducibility) that produces every figure in this paper from raw Kaggle datasets.
k-anonymity and mobility uniqueness. The privacy
notion underlying our GeoLife target is k-anonymity, originally
formalized by Samarati and Sweeney [1]. De Montjoye et al. [2] showed
empirically that as few as four spatio-temporal points are sufficient to
uniquely identify 95% of individuals in a mobility dataset, motivating
treatment of GPS trajectories as a privacy primitive. Primault et
al. [3] survey the broader landscape of computational location privacy,
including the role of temporal aggregation in raising effective k. We
use log(k) per-row as the GeoLife target precisely because
k changes over time as locations are revisited and as the population of
users moving through the same areas grows.
Cross-user / federated transfer. The hypothesis that cross-user training improves per-user privacy modeling is closely related to federated-learning literature on personalization-versus-pooling trade-offs. Kairouz et al. [4] characterize when pooled models help and when per-user adaptation is needed; the dominant variable is intra-user heterogeneity relative to inter-user spread. Our Findings 4 and 6 (Sections §5.4 and §5.6) speak directly to this trade-off: at smaller training-pool sizes the personalization tier appears essential, but at n = 98 users the pooled model dominates.
R² interpretation. The metric pathology that produces our Finding 7 (Section §5.7) is well known in the statistics literature. Kvålseth [5] cautions specifically that R² = 1 − SS_res/SS_tot is mathematically divergent when SS_tot → 0, and that the statistic is not commensurable across users with different target variances. Pimentel et al. [6] survey novelty-detection approaches that the convex-hull and nearest-neighbour detectors in our Finding 7 implement.
We use three datasets, summarized in Table 1. All three are available
from Kaggle and were processed using a single ingestion script
(load_real_datasets.py) that adds an age_days
column relative to the most recent record in the dataset, log-transforms
the target, and selects the natural stratification feature (country /
entity type / user_id). GeoLife was sub-sampled to 100 users to fit
memory; 98 of those have ≥ 150 sampled points after processing and form
the eligible cohort for per-user analyses.
| Source | Records | Time span | Target | Role |
|---|---|---|---|---|
GDPR Fines
(andreibuliga1/gdpr-fines-20182020-updated-23012021) |
212 | 2018–2020 | log(fine €) | Null test (regulatory) |
HIPAA Breaches
(thedevastator/major-us-health-data-breaches) |
1,632 | 2009–2017 | log(individuals affected) | Null test (regulatory) |
GeoLife GPS
(arashnic/microsoft-geolife-gps-trajectory-dataset) |
48,406 / 100 users | 2007–2012 | log(k-anonymity) | Signal test (mobility) |
| Synthetic (prior work) | 5,000 | N/A | ground_truth_privacy |
Baseline (literature) |
For each of the four datasets we train a
sklearn.ensemble.GradientBoostingRegressor
(n_estimators = 150, max_depth = 4) with a 75
/ 25 random train/test split. Features are age_days plus
the natural context features for that dataset (country/article/type for
GDPR; entity_type/breach_type/state for HIPAA; lat/lon/user_id for
GeoLife). We report test R² and the model’s feature importance
attributed to age_days. The synthetic baseline is taken
from prior published numbers and is not re-run.
To test whether the pooled regression’s poor performance on
regulatory data is masking signal at finer aggregation, we run 3-fold
cross-validation with age_days as the sole
predictor, separately for each natural stratum (country for GDPR,
entity_type for HIPAA, user_id for GeoLife with a minimum of 100 points
per stratum). We report the pooled (cross-stratum) R², the mean and
median per-stratum R², and the fraction of strata with positive R².
For each GeoLife user with at least 150 sampled points, we fit four
candidates to the per-user (age_days, log_k) data using
3-fold cross-validation: linear regression, exponential decay
(y = a · exp(−b · t) + c), the four-parameter logistic /
sigmoid (y = L / (1 + exp(−k · (t − t₀))) + c), and a small
per-user gradient-boosted model that serves as a non-parametric upper
bound. The exponential and sigmoid fits use
scipy.optimize.curve_fit; pathological coefficient
solutions (those producing R² < −10⁹) are recorded but downweighted
via the median statistic (Section §4.7). We report the median per-user
R² for each form, the fraction of users where the form is positive, and
the modal best-form winner across users.
LOUO is the strict test of cross-user transfer: train on n − 1 users, predict the n-th, repeat for all n users. We run five conditions on the 98-user GeoLife cohort:
log_k. Negative R² indicates that the test split’s
distribution differs from the train split’s mean.age_days. Tests whether the temporal decay rate is shared
across users.lat, lon. Tests whether the spatial exposure manifold is
shared.age_days,
lat, lon).age_days is randomly permuted at training time. If
r2_full ≈ r2_shuffled, the cross-user model is extracting
no temporal signal from age_days.Because a small fraction of held-out users produce R² values of order −10²⁵ (Section §5.7), means are uninformative. We report median R² with paired-bootstrap 95% confidence intervals on the median (10,000 resamples over users), the inter-quartile range, and the fraction of users above zero.
We test whether per-user adaptation improves over the pooled cross-user model across personal-data budgets N ∈ {5, 10, 20, 50, 100} oldest points. Five strategies per budget per held-out user:
For each strategy we report median test R² across users at each
budget plus the paired-bootstrap CI on
median(strategy − pooled) — a within-user paired
comparison that is more sensitive than testing two unpaired
distributions. A CI that is negative and excludes zero indicates that
the strategy is statistically worse than pooled; a CI that is positive
and excludes zero indicates the strategy wins.
Finding 6 (Section §5.6) leaves a
working hypothesis: the small fraction of held-out users with R² far
below zero are spatial outliers — users whose
lat/lon falls outside the training-pool coverage so the
pooled model is extrapolating into unseen territory. Section §5.7 tests
this directly. For each held-out user U we compute (a) the convex hull
of the other 97 users’ lat/lon (via
scipy.spatial.ConvexHull and
Delaunay.find_simplex), (b) the fraction of U’s test points
inside that hull, and (c) the mean and maximum nearest-neighbour
distance from U’s test points to the closest training-pool point
(sklearn NearestNeighbors with
algorithm='ball_tree'). We then compute Pearson and
Spearman correlations between each detector and
r2_full_louo across users. As an alternative non-spatial
detector, we also compute per-user Var(log_k) from the test
set and correlate it with the same target.
All GeoLife per-user statistics use median + IQR +
paired-bootstrap CI on the median as primary, because at 100
users a small fraction of held-out users produce R² values of order
−10²⁵ that make any mean-based statistic uninterpretable. Means and
standard deviations are still computed and stored under
*_REF suffix in the output JSONs for reference. Per-user
dumps and outlier rosters are preserved in the JSONs so extreme cases
are documented rather than discarded.
Figure 1 compares the test R² of the pooled gradient-boosting
regressor across the synthetic baseline and the three real datasets,
annotated with the feature importance the model attributes to
age_days.
The synthetic baseline reproduces R² = 0.998 with 93.6% of feature
importance on age_days. This is exactly the regime where
the published benchmark is informative — the model is recovering the
signal that was used to generate the labels, and any reasonable
regressor will achieve that. The two regulatory datasets reverse the
picture entirely. GDPR fines yield R² = −0.10 (worse than predicting the
mean fine), and HIPAA breaches yield R² = −0.18. GeoLife reaches R² =
0.97, but age_days contributes only 0.7% of feature
importance — lat and lon together account for
98%. The pooled regression is not measuring temporal decay on
any real dataset; on GeoLife it is measuring spatial
structure.
If pooled regression fails for regulatory data, perhaps
stratification rescues it? 3-fold CV with age_days as sole
predictor, per-stratum:
| Dataset | Stratum | Pooled R² | Mean stratum R² | Fraction strata > 0 | n strata |
|---|---|---|---|---|---|
| GDPR | country | −0.11 | −0.53 | 0.000 | 2 |
| HIPAA | entity_type | −0.05 | −0.63 | 0.000 | 4 |
| GeoLife | user_id | +0.35 | +0.45 | 0.979 | 97 |
Zero of six combined strata produce positive R² for the regulatory datasets. GeoLife produces 97 of 98 positive strata at the per-user level, with a pooled R² of +0.35 — solid evidence that individual-level temporal decay exists in mobility data, even though it is too weak to dominate the spatial signal in pooled regression.
Given that GeoLife users do show per-user temporal decay, what functional form fits best? Figure 2 reports the median per-user R² for the four candidates, annotated with the number of users for which each form is the modal best fit.
Sigmoid wins for 58 of 98 users (59%), with a median R² of 0.12 against linear’s 0.06 and exponential’s 0.07. The absolute differences between simple forms are small. The non-parametric GBM ceiling at R² = 0.45 is roughly four times higher than the best parametric form, indicating that most of the per-user signal is in irregular structure that no clean closed-form curve captures — a hint, in retrospect, that cross-user transfer should not be expected to inherit a clean parametric form either.
Leave-one-user-out across all 98 users (Figure 3) decomposes
pooled-model performance into temporal (age_days-only) and
spatial (lat/lon-only) components and tests whether each
transfers to held-out users.
Three observations. First, age-only LOUO is worse than
chance: the cross-user pooled model trained on
age_days alone is below the baseline of predicting each
held-out user’s mean. This is the strict refutation of the synthetic
benchmark’s claim — across users, temporal decay rates do not transfer
at all. Second, spatial-only and full are statistically
indistinguishable (CIs overlap heavily; medians 0.854 and 0.855), and
both have 95% / 94% of held-out users above zero. Whatever generalizes
across users in this dataset is in the spatial features. Third, the
shuffled-age control passes: scrambling age_days at
training time leaves the score unchanged. This is the cleanest possible
evidence that the model is not using cross-user temporal
information.
If sigmoid is the right per-user form, is there a universal privacy-cliff timescale — some characteristic age at which most users’ privacy decays? We normalize each user’s sigmoid midpoint to their personal time range (so 0 = oldest point, 1 = newest) and test for clustering. Across 87 users with successful sigmoid fits, the Shapiro-Wilk test rejects normality at p = 0.0003, with a coefficient of variation of 0.71. There is no shared timescale: midpoints are scattered across the personal time range with no preferred location. Whatever drives per-user temporal decay, it is not a population-level constant.
An earlier 30-user run had suggested that warm-start fine-tuning was essential — the pooled model on a 30-user training pool reached only R² = +0.27, and a per-user residual GBM appeared to recover the per-user CV ceiling. Figure 4 shows that picture flips at n = 98.
Two paired-bootstrap statistics tell the story rigorously. The CI on
median(warm_full − pooled) is negative and excludes
zero at N = 5, 10, 20 personal points: warm-start is statistically
worse than just using the pooled model. Only at N = 50 does the
difference become statistically indistinguishable from zero, and even at
N = 100 warm-start does not significantly win. The simple
explanation is that, at n = 98 users in the training pool, the spatial
manifold is well-covered enough that the pooled model is already
near-optimal for the median user, and a per-user residual GBM has
nothing useful to add — only its parametric capacity, which becomes a
small overfit penalty. The 30-user “personalization is
essential” conclusion was an artifact of weak spatial coverage in the
smaller training pool.
Section §5.6 leaves a thread. A small minority of held-out users
produce R² values of order −10²⁵ that bias means and (we initially
hypothesized) reflect spatial extrapolation: those users’
lat/lon falls outside the training-pool coverage, so the
pooled model is predicting in regions it has no support for. Section
§4.6 tests that hypothesis directly.
The spatial detectors fail. Spearman correlations between each
detector and r2_full_louo:
| Spatial detector | Spearman ρ | p |
|---|---|---|
pct_inside_hull |
−0.10 | 0.33 |
mean_nn_deg |
−0.25 | 0.013 |
max_nn_deg |
−0.14 | 0.18 |
None of the detectors clear the |ρ| > 0.6 threshold for a
deployable detector. More tellingly, all four catastrophic-R²
users sit inside the hull
(pct_inside_hull = 1.0) and the worst-case user (user 87,
R² ≈ −1.1 × 10²⁷) has mean_nn_deg = 0.0001 — that user is
densely surrounded by training-pool data, the opposite of a
spatial outlier. Conversely, the most-isolated user in the dataset (user
82, max_nn_deg = 211° because their points scatter across continents
relative to the training pool) has R² = +0.44, positive.
Spatial coverage does not predict failure on this dataset.
What does predict failure is per-user target variance. Figure 5 shows the contrast directly: a failed spatial detector (panel a, Spearman = −0.25) and a working target-variance detector (panel b, Spearman = +0.54, p < 10⁻⁷).
The four catastrophic users:
| User | Var(log_k) | log_k range | r2_full_louo |
|---|---|---|---|
| 87 | 0.000 | 0.000 | −1.1 × 10²⁷ |
| 27 | 1.89 | 6.85 | −2.97 |
| 54 | 0.85 | 4.46 | −2.06 |
| 21 | 1.21 | 7.13 | −1.68 |
The mathematical explanation is straightforward. Recall
When a held-out user’s target values are nearly constant,
SS_tot → 0, and any non-zero prediction error makes the
fraction blow up. R² is not commensurable across users with different
target variances; this is a known caution in the statistics literature
[5]. The deployable consequence is a metric guard: for users
whose recent target variance is below a threshold (Section §4.6 and our
companion artifacts use 0.5), report MAE in addition to or in place of
R². No per-user fallback model is required — the predictions for those
users are not actually catastrophic, the reporting metric is.
Read the seven findings together and a single asymmetry organizes them:
| Per-user (within one user) | Cross-user (pooled across users) | |
|---|---|---|
| Spatial signal | Strong: lat/lon dominates the per-user GBM (Finding 1) | Transfers: pooled R² ≈ +0.85 at n = 98 (Finding 4) |
| Temporal signal | Real but weak: sigmoid R² ≈ +0.12 (Finding 3) | Does not transfer: age-only LOUO ≈ chance, shuffle-age confirms (Finding 4) |
Why might this be so? Mobility data has a natural mechanism for
spatial transfer: if user U lives in a region also visited by users V,
W, X, …, then the union of those users’ GPS points provides the pooled
model with enough density to estimate U’s log(k-anonymity)
even on points U has not generated yet. The spatial manifold is shared.
Temporal data has no analogous mechanism in the cross-user direction:
user U’s rate of privacy decay depends on U’s individual
revisit cadence, U’s locations’ base population density, U’s time-of-day
patterns — variables that are not aligned with V’s. Sigmoid midpoints
scatter because each user has their own characteristic schedule (Finding
5); age-only LOUO is below chance because the cross-user model trained
on age_days actively misleads (Finding 4).
The deployable system suggested by these results is simple. Train a single pooled gradient-boosting regressor on the full available cross-user training set; use spatial features (with whatever temporal features you have, but expect them to contribute little). At inference, serve the pooled prediction directly to all users. Personalization tiers — per-user fine-tuning, warm-start residual heads, federated adapters — are not indicated by this data: at the n = 98 user scale, all three forms of warm-start tested in Section §5.6 statistically hurt the median user. The “personalization is essential” intuition that motivated those tiers in the 30-user preliminary run was an artifact of weak spatial coverage in the smaller training pool, not a fundamental property of the problem.
The one operational nuance is the metric guard from Finding 7. Users
whose recent log(k-anonymity) variance is below a threshold
(we use 0.5 in our reproduction artifacts) should have their model
performance reported as MAE in addition to or in place of R². This
changes monitoring dashboards, not the model itself. The change is small
but operationally important: without it, a small fraction of users will
appear to break the model catastrophically when in fact they have
constant targets that the metric cannot represent.
GDPR Article 5(1)(e) and analogous storage-limitation rules in HIPAA prescribe retention periods uniformly across users. Our Finding 5 — that sigmoid midpoints across users have a coefficient of variation of 0.71 with no clustering — implies that on this dataset there is no population-level retention period that fits the data. A regulation that prescribes “six years” or “until no longer necessary” is choosing a number with no support in the empirical distribution of per-user decay curves. This does not argue against retention limits in principle (per-user decay does exist — Finding 2), but it does argue against the implicit modeling assumption that a single number serves all users equally. Adaptive or tiered retention — set per-user based on observed decay rate, with a floor for consent-based lower bounds — would track the data more closely.
We make this implication carefully. GeoLife is one mobility dataset, drawn predominantly from researchers in the Beijing area between 2007 and 2012. The spatial-transfer result almost certainly depends on participants moving through overlapping urban regions; a more-dispersed dataset (e.g., transit-card data spanning multiple cities, or smartphone IMU data sampled globally) might show weaker pooled-model transfer. We do not claim that the spatial / temporal asymmetry is universal — only that on the closest available real-data analogue to the synthetic benchmark, it is clean.
Three limitations are most worth flagging. (i) Sample size on regulatory data. GDPR (n = 212) is genuinely small; the strongly negative R² is robust because the null is structural (penalty amounts are set at incident time), but the per-stratum statistics (only 2 country strata) are noisy. HIPAA at 1,632 records is more comfortable. (ii) GeoLife at 100 users. Our LOUO and per-user fine-tune analyses run on the eligible 98-user subset (those with ≥ 150 sampled points). At smaller training-pool sizes, both findings (cross-user transfer, per-user fine-tune) flipped from earlier expectations; we have not yet tested whether further scaling to n ≈ 200 changes them again. Our memory file lists this as the highest-priority follow-up. (iii) Single-domain finding 7. The metric-pathology result is mathematically robust (the proof is in the R² formula), but the empirical observation that ~5% of GeoLife users have low-variance windows depends on GeoLife’s particular sampling cadence and target choice. Other privacy targets may show different rates.
A synthetic benchmark for temporal privacy decay reporting R² = 0.998
does not reproduce on real data. On regulatory data the same pipeline
goes negative; on real mobility data it reaches a high R² by spatial
features alone, with cross-user temporal transfer scoring worse than
chance. The deployable picture that emerges from this investigation
is that population-level privacy ML is feasible for spatial exposure
modeling at scale, with the pooled model dominating per-user
fine-tuning at n = 98 users. Per-user temporal decay rates, by contrast,
are genuinely personal — the cross-user pooled model extracts zero
signal from age_days (confirmed by shuffle control), and
uniform retention policies have no support in the per-user midpoint
distribution.
A working hypothesis from a smaller-scale preliminary run — that a 5%
minority of users would need a per-user fallback path due to spatial
extrapolation failure — turned out to be a metric pathology rather than
a model failure. Direct testing of convex-hull membership and
nearest-neighbour distance falsified the spatial hypothesis; per-user
target variance is the actual predictor, with one user’s
log(k-anonymity) exactly constant and therefore producing
mathematically divergent R² regardless of absolute prediction quality.
The deployable consequence is a metric guard for low-variance users, not
a separate model.
The asymmetry between spatial and temporal cross-user transfer on this data is the central empirical finding. Whether it generalizes to other mobility datasets, or to other privacy targets entirely, is the most pressing follow-up question.
[1] Samarati, P. and Sweeney, L. (1998). Protecting Privacy When Disclosing Information: k-Anonymity and its Enforcement Through Generalization and Suppression. SRI International Technical Report SRI-CSL-98-04.
[2] De Montjoye, Y.-A., Hidalgo, C. A., Verleysen, M., and Blondel, V. D. (2013). Unique in the Crowd: The Privacy Bounds of Human Mobility. Scientific Reports 3, 1376. https://doi.org/10.1038/srep01376
[3] Primault, V., Boutet, A., Mokhtar, S. B., and Brunie, L. (2019). The Long Road to Computational Location Privacy: A Survey. IEEE Communications Surveys & Tutorials 21(3): 2772–2793.
[4] Kairouz, P. et al. (2021). Advances and Open Problems in Federated Learning. Foundations and Trends in Machine Learning 14(1–2): 1–210.
[5] Kvålseth, T. O. (1985). Cautionary Note about R². The American Statistician 39(4): 279–285.
[6] Pimentel, M. A. F., Clifton, D. A., Clifton, L., and Tarassenko, L. (2014). A Review of Novelty Detection. Signal Processing 99: 215–249.
[7] Zheng, Y., Xie, X., and Ma, W.-Y. (2010). GeoLife: A Collaborative Social Networking Service among User, Location and Trajectory. IEEE Data Engineering Bulletin 33(2): 32–39.
[8] European Parliament and Council of the European Union. (2016). Regulation (EU) 2016/679 — General Data Protection Regulation, Article 5(1)(e), Storage Limitation.
Every figure and table in this paper is regenerated from JSON dumps
under notebooks/automated_tests/phase3_research/ and the
processed GeoLife CSV at data/processed/geolife_decay.csv.
The end-to-end pipeline from raw Kaggle datasets to these outputs:
# Inside the privacy-jupyter container
bash /home/jovyan/notebooks/data_ingestion/fetch_kaggle_datasets.sh
python /home/jovyan/notebooks/data_ingestion/load_real_datasets.py
python /home/jovyan/notebooks/data_ingestion/validate_on_real_data.py
python /home/jovyan/notebooks/data_ingestion/stratified_decay.py
python /home/jovyan/notebooks/data_ingestion/decay_function_fit.py
python /home/jovyan/notebooks/data_ingestion/sigmoid_midpoint_clustering.py
python /home/jovyan/notebooks/data_ingestion/leave_one_user_out.py
python /home/jovyan/notebooks/data_ingestion/per_user_finetune.py
python /home/jovyan/notebooks/data_ingestion/convex_hull_outliers.py
# From the host (after the above), regenerate paper figures and PDF + HTML
python notebooks/paper/build_figures.py # via docker exec — see Makefile
make paper # builds notebooks/paper/paper.{pdf,html}Source repository: https://github.com/tedrubin80/decay. All seeds are fixed; numerical results in this paper reproduce to the third decimal.