Spatial Signal Transfers, Temporal Signal Does Not: A Real-Data Re-evaluation of Population-Level Privacy Decay

Abstract

A widely-cited synthetic benchmark for “temporal privacy decay” reports test R² = 0.998 — a value that, taken at face value, suggests the rate at which personal data ages out of sensitivity is sharply learnable from a population of users. We re-run that benchmark on three real datasets — GDPR fines (n = 212), HIPAA breaches (n = 1,632), and Microsoft GeoLife GPS trajectories (n = 48,406 records across 100 users) — and report three findings that complicate the synthetic claim. (i) The synthetic R² is a self-fulfilling artifact of using age_days to generate the labels. On real regulatory data, the same pipeline scores R² = −0.10 (GDPR) and R² = −0.18 (HIPAA), worse than predicting the mean. (ii) On GeoLife mobility data, leave-one-user-out cross-validation shows that the spatial component of a pooled cross-user model transfers well at scale (median R² = +0.85 across 98 held-out users), but the temporal component does not — age-only LOUO is worse than chance, and a shuffle-age control confirms the cross-user model extracts zero temporal signal from age_days. (iii) A small minority of held-out users produce R² values of order −10²⁵ that initially appear to be spatial-extrapolation failures. Direct testing with convex-hull membership and nearest-neighbour distance rejects this: those users sit inside the training-pool hull. What predicts their R² is per-user target variance — the worst-case user’s log(k-anonymity) is exactly constant, making R² mathematically divergent regardless of absolute prediction error. We conclude that pooled spatial models with a metric guard are deployable on this data; per-user fine-tuning and spatial-fallback paths, both suggested by smaller-sample preliminary work, are not.

The question of whether personal data becomes less sensitive as it ages — and if so, at what rate — sits at the foundation of several major privacy regulations. The European Union’s General Data Protection Regulation (GDPR) Article 5(1)(e) requires that personal data be “kept in a form which permits identification of data subjects for no longer than is necessary.” The United States Health Insurance Portability and Accountability Act (HIPAA) imposes analogous retention limits on protected health information. Both regimes presume that there exists some characteristic timescale beyond which a given record’s contribution to re-identification risk has decayed enough that retention is no longer warranted. In practice, that timescale is set by policy fiat (six years, ten years, “until no longer needed”) rather than measured.

A natural question is whether the timescale can be learned from data: given enough users, does the rate at which personal sensitivity decays generalize across the population, so that a single model could replace a regulatory constant with a data-driven retention schedule? Recent synthetic benchmarks have appeared to answer yes. In particular, a published “temporal privacy decay” benchmark reports test R² = 0.998 for a gradient-boosting regressor trained on age_days plus context features — a level of accuracy that would be transformative for privacy-aware data systems if it held on real data.

This paper tests whether it does. We re-run the same regression pipeline on three real datasets that span the relevant problem space:

We further test, on the dataset where decay does exist, three follow-up questions: (i) what functional form best fits per-user temporal decay? (ii) does the cross-user model transfer well to held-out users via leave-one-user-out cross-validation? (iii) does per-user warm-start fine-tuning improve over a pooled cross-user model?

Our findings reverse multiple intuitions from prior work. The synthetic R² is recovered only when the experimental setup also generates labels from age_days; on real data, the cross-user temporal signal does not transfer at all. What does transfer is the spatial component — lat/lon features generalize across held-out users with median R² ≈ +0.85 once the training pool is large enough. Per-user fine-tuning, which appeared essential at smaller sample sizes, hurts the median user at scale. And a small fraction of users that initially look like spatial-extrapolation failures turn out to be a metric pathology: their target variance is near zero, which makes R² mathematically divergent regardless of absolute prediction quality.

The contributions of this paper are: (1) a real-data re-evaluation of a synthetic privacy-decay benchmark across three datasets that span the relevant null and signal cases; (2) a leave-one-user-out characterization of cross-user transfer that decomposes the signal into spatial and temporal components and shows that only the former generalizes; (3) a direct test, by convex-hull membership and nearest-neighbour distance, of the working hypothesis that catastrophic-R² users are spatial outliers — and a refutation of that hypothesis in favor of a metric-pathology explanation; and (4) a reproducible pipeline (Section Reproducibility) that produces every figure in this paper from raw Kaggle datasets.

2 Related Work

k-anonymity and mobility uniqueness. The privacy notion underlying our GeoLife target is k-anonymity, originally formalized by Samarati and Sweeney [1]. De Montjoye et al. [2] showed empirically that as few as four spatio-temporal points are sufficient to uniquely identify 95% of individuals in a mobility dataset, motivating treatment of GPS trajectories as a privacy primitive. Primault et al. [3] survey the broader landscape of computational location privacy, including the role of temporal aggregation in raising effective k. We use log(k) per-row as the GeoLife target precisely because k changes over time as locations are revisited and as the population of users moving through the same areas grows.

Cross-user / federated transfer. The hypothesis that cross-user training improves per-user privacy modeling is closely related to federated-learning literature on personalization-versus-pooling trade-offs. Kairouz et al. [4] characterize when pooled models help and when per-user adaptation is needed; the dominant variable is intra-user heterogeneity relative to inter-user spread. Our Findings 4 and 6 (Sections §5.4 and §5.6) speak directly to this trade-off: at smaller training-pool sizes the personalization tier appears essential, but at n = 98 users the pooled model dominates.

R² interpretation. The metric pathology that produces our Finding 7 (Section §5.7) is well known in the statistics literature. Kvålseth [5] cautions specifically that R² = 1 − SS_res/SS_tot is mathematically divergent when SS_tot → 0, and that the statistic is not commensurable across users with different target variances. Pimentel et al. [6] survey novelty-detection approaches that the convex-hull and nearest-neighbour detectors in our Finding 7 implement.

3 Datasets

We use three datasets, summarized in Table 1. All three are available from Kaggle and were processed using a single ingestion script (load_real_datasets.py) that adds an age_days column relative to the most recent record in the dataset, log-transforms the target, and selects the natural stratification feature (country / entity type / user_id). GeoLife was sub-sampled to 100 users to fit memory; 98 of those have ≥ 150 sampled points after processing and form the eligible cohort for per-user analyses.

Datasets used in this study, including the synthetic benchmark we are re-evaluating against the real cases.
Source	Records	Time span	Target	Role
GDPR Fines (`andreibuliga1/gdpr-fines-20182020-updated-23012021`)	212	2018–2020	log(fine €)	Null test (regulatory)
HIPAA Breaches (`thedevastator/major-us-health-data-breaches`)	1,632	2009–2017	log(individuals affected)	Null test (regulatory)
GeoLife GPS (`arashnic/microsoft-geolife-gps-trajectory-dataset`)	48,406 / 100 users	2007–2012	log(k-anonymity)	Signal test (mobility)
Synthetic (prior work)	5,000	N/A	`ground_truth_privacy`	Baseline (literature)

4 Methods

4.1 Pooled regression baseline

For each of the four datasets we train a sklearn.ensemble.GradientBoostingRegressor (n_estimators = 150, max_depth = 4) with a 75 / 25 random train/test split. Features are age_days plus the natural context features for that dataset (country/article/type for GDPR; entity_type/breach_type/state for HIPAA; lat/lon/user_id for GeoLife). We report test R² and the model’s feature importance attributed to age_days. The synthetic baseline is taken from prior published numbers and is not re-run.

4.2 Stratified cross-validation

To test whether the pooled regression’s poor performance on regulatory data is masking signal at finer aggregation, we run 3-fold cross-validation with age_days as the sole predictor, separately for each natural stratum (country for GDPR, entity_type for HIPAA, user_id for GeoLife with a minimum of 100 points per stratum). We report the pooled (cross-stratum) R², the mean and median per-stratum R², and the fraction of strata with positive R².

4.3 Per-user functional-form fits

For each GeoLife user with at least 150 sampled points, we fit four candidates to the per-user (age_days, log_k) data using 3-fold cross-validation: linear regression, exponential decay (y = a · exp(−b · t) + c), the four-parameter logistic / sigmoid (y = L / (1 + exp(−k · (t − t₀))) + c), and a small per-user gradient-boosted model that serves as a non-parametric upper bound. The exponential and sigmoid fits use scipy.optimize.curve_fit; pathological coefficient solutions (those producing R² < −10⁹) are recorded but downweighted via the median statistic (Section §4.7). We report the median per-user R² for each form, the fraction of users where the form is positive, and the modal best-form winner across users.

4.4 Leave-one-user-out (LOUO) cross-validation

LOUO is the strict test of cross-user transfer: train on n − 1 users, predict the n-th, repeat for all n users. We run five conditions on the 98-user GeoLife cohort:

Because a small fraction of held-out users produce R² values of order −10²⁵ (Section §5.7), means are uninformative. We report median R² with paired-bootstrap 95% confidence intervals on the median (10,000 resamples over users), the inter-quartile range, and the fraction of users above zero.

4.5 Per-user warm-start fine-tuning

We test whether per-user adaptation improves over the pooled cross-user model across personal-data budgets N ∈ {5, 10, 20, 50, 100} oldest points. Five strategies per budget per held-out user:

For each strategy we report median test R² across users at each budget plus the paired-bootstrap CI on median(strategy − pooled) — a within-user paired comparison that is more sensitive than testing two unpaired distributions. A CI that is negative and excludes zero indicates that the strategy is statistically worse than pooled; a CI that is positive and excludes zero indicates the strategy wins.

4.6 Outlier characterization (convex hull, NN distance, target variance)

Finding 6 (Section §5.6) leaves a working hypothesis: the small fraction of held-out users with R² far below zero are spatial outliers — users whose lat/lon falls outside the training-pool coverage so the pooled model is extrapolating into unseen territory. Section §5.7 tests this directly. For each held-out user U we compute (a) the convex hull of the other 97 users’ lat/lon (via scipy.spatial.ConvexHull and Delaunay.find_simplex), (b) the fraction of U’s test points inside that hull, and (c) the mean and maximum nearest-neighbour distance from U’s test points to the closest training-pool point (sklearn NearestNeighbors with algorithm='ball_tree'). We then compute Pearson and Spearman correlations between each detector and r2_full_louo across users. As an alternative non-spatial detector, we also compute per-user Var(log_k) from the test set and correlate it with the same target.

4.7 Statistical reporting

All GeoLife per-user statistics use median + IQR + paired-bootstrap CI on the median as primary, because at 100 users a small fraction of held-out users produce R² values of order −10²⁵ that make any mean-based statistic uninterpretable. Means and standard deviations are still computed and stored under *_REF suffix in the output JSONs for reference. Per-user dumps and outlier rosters are preserved in the JSONs so extreme cases are documented rather than discarded.

5 Results

5.1 Pooled regression on real data (Finding 1)

Figure 1 compares the test R² of the pooled gradient-boosting regressor across the synthetic baseline and the three real datasets, annotated with the feature importance the model attributes to age_days.

The synthetic baseline reproduces R² = 0.998 with 93.6% of feature importance on age_days. This is exactly the regime where the published benchmark is informative — the model is recovering the signal that was used to generate the labels, and any reasonable regressor will achieve that. The two regulatory datasets reverse the picture entirely. GDPR fines yield R² = −0.10 (worse than predicting the mean fine), and HIPAA breaches yield R² = −0.18. GeoLife reaches R² = 0.97, but age_days contributes only 0.7% of feature importance — lat and lon together account for 98%. The pooled regression is not measuring temporal decay on any real dataset; on GeoLife it is measuring spatial structure.

5.2 Stratified cross-validation (Finding 2)

If pooled regression fails for regulatory data, perhaps stratification rescues it? 3-fold CV with age_days as sole predictor, per-stratum:

Stratified per-group cross-validation, `age_days` as sole predictor. GDPR and HIPAA produce zero positive strata across 6 combined. GeoLife produces 97 of 98 positive strata.
Dataset	Stratum	Pooled R²	Mean stratum R²	Fraction strata > 0	n strata
GDPR	country	−0.11	−0.53	0.000	2
HIPAA	entity_type	−0.05	−0.63	0.000	4
GeoLife	user_id	+0.35	+0.45	0.979	97

Zero of six combined strata produce positive R² for the regulatory datasets. GeoLife produces 97 of 98 positive strata at the per-user level, with a pooled R² of +0.35 — solid evidence that individual-level temporal decay exists in mobility data, even though it is too weak to dominate the spatial signal in pooled regression.

5.3 Per-user functional forms (Finding 3)

Given that GeoLife users do show per-user temporal decay, what functional form fits best? Figure 2 reports the median per-user R² for the four candidates, annotated with the number of users for which each form is the modal best fit.

Sigmoid wins for 58 of 98 users (59%), with a median R² of 0.12 against linear’s 0.06 and exponential’s 0.07. The absolute differences between simple forms are small. The non-parametric GBM ceiling at R² = 0.45 is roughly four times higher than the best parametric form, indicating that most of the per-user signal is in irregular structure that no clean closed-form curve captures — a hint, in retrospect, that cross-user transfer should not be expected to inherit a clean parametric form either.

5.4 Cross-user transfer (Finding 4)

Leave-one-user-out across all 98 users (Figure 3) decomposes pooled-model performance into temporal (age_days-only) and spatial (lat/lon-only) components and tests whether each transfers to held-out users.

Three observations. First, age-only LOUO is worse than chance: the cross-user pooled model trained on age_days alone is below the baseline of predicting each held-out user’s mean. This is the strict refutation of the synthetic benchmark’s claim — across users, temporal decay rates do not transfer at all. Second, spatial-only and full are statistically indistinguishable (CIs overlap heavily; medians 0.854 and 0.855), and both have 95% / 94% of held-out users above zero. Whatever generalizes across users in this dataset is in the spatial features. Third, the shuffled-age control passes: scrambling age_days at training time leaves the score unchanged. This is the cleanest possible evidence that the model is not using cross-user temporal information.

5.5 Sigmoid midpoint distribution (Finding 5)

If sigmoid is the right per-user form, is there a universal privacy-cliff timescale — some characteristic age at which most users’ privacy decays? We normalize each user’s sigmoid midpoint to their personal time range (so 0 = oldest point, 1 = newest) and test for clustering. Across 87 users with successful sigmoid fits, the Shapiro-Wilk test rejects normality at p = 0.0003, with a coefficient of variation of 0.71. There is no shared timescale: midpoints are scattered across the personal time range with no preferred location. Whatever drives per-user temporal decay, it is not a population-level constant.

5.6 Per-user fine-tuning at scale (Finding 6)

An earlier 30-user run had suggested that warm-start fine-tuning was essential — the pooled model on a 30-user training pool reached only R² = +0.27, and a per-user residual GBM appeared to recover the per-user CV ceiling. Figure 4 shows that picture flips at n = 98.

Two paired-bootstrap statistics tell the story rigorously. The CI on median(warm_full − pooled) is negative and excludes zero at N = 5, 10, 20 personal points: warm-start is statistically worse than just using the pooled model. Only at N = 50 does the difference become statistically indistinguishable from zero, and even at N = 100 warm-start does not significantly win. The simple explanation is that, at n = 98 users in the training pool, the spatial manifold is well-covered enough that the pooled model is already near-optimal for the median user, and a per-user residual GBM has nothing useful to add — only its parametric capacity, which becomes a small overfit penalty. The 30-user “personalization is essential” conclusion was an artifact of weak spatial coverage in the smaller training pool.

5.7 Catastrophic R² is a metric artifact (Finding 7)

Section §5.6 leaves a thread. A small minority of held-out users produce R² values of order −10²⁵ that bias means and (we initially hypothesized) reflect spatial extrapolation: those users’ lat/lon falls outside the training-pool coverage, so the pooled model is predicting in regions it has no support for. Section §4.6 tests that hypothesis directly.

The spatial detectors fail. Spearman correlations between each detector and r2_full_louo:

Spatial-coverage detectors vs `r2_full_louo`. None reaches |ρ| > 0.6 — the threshold for a deployable detector.
Spatial detector	Spearman ρ	p
`pct_inside_hull`	−0.10	0.33
`mean_nn_deg`	−0.25	0.013
`max_nn_deg`	−0.14	0.18

None of the detectors clear the |ρ| > 0.6 threshold for a deployable detector. More tellingly, all four catastrophic-R² users sit inside the hull (pct_inside_hull = 1.0) and the worst-case user (user 87, R² ≈ −1.1 × 10²⁷) has mean_nn_deg = 0.0001 — that user is densely surrounded by training-pool data, the opposite of a spatial outlier. Conversely, the most-isolated user in the dataset (user 82, max_nn_deg = 211° because their points scatter across continents relative to the training pool) has R² = +0.44, positive. Spatial coverage does not predict failure on this dataset.

What does predict failure is per-user target variance. Figure 5 shows the contrast directly: a failed spatial detector (panel a, Spearman = −0.25) and a working target-variance detector (panel b, Spearman = +0.54, p < 10⁻⁷).

The four held-out users with `r2_full_louo < −1.0`. User 87’s `log(k-anonymity)` is **constant** across all test points. The remaining three are mild low-variance users where small absolute errors look catastrophic in R² terms.
User	Var(log_k)	log_k range	r2_full_louo
87	0.000	0.000	−1.1 × 10²⁷
27	1.89	6.85	−2.97
54	0.85	4.46	−2.06
21	1.21	7.13	−1.68

R^2 = 1 - \frac{\text{SS}_\text{res}}{\text{SS}_\text{tot}}, \qquad \text{SS}_\text{tot} = \sum_i (y_i - \bar{y})^2.

When a held-out user’s target values are nearly constant, SS_tot → 0, and any non-zero prediction error makes the fraction blow up. R² is not commensurable across users with different target variances; this is a known caution in the statistics literature [5]. The deployable consequence is a metric guard: for users whose recent target variance is below a threshold (Section §4.6 and our companion artifacts use 0.5), report MAE in addition to or in place of R². No per-user fallback model is required — the predictions for those users are not actually catastrophic, the reporting metric is.

6 Discussion

6.1 The spatial / temporal asymmetry

The central tension. Spatial signal is shared across users (locations get re-visited; the spatial manifold is common). Temporal signal is genuinely personal in the cross-user sense (sigmoid midpoints scatter, age-only LOUO is below chance).
	Per-user (within one user)	Cross-user (pooled across users)
Spatial signal	Strong: lat/lon dominates the per-user GBM (Finding 1)	Transfers: pooled R² ≈ +0.85 at n = 98 (Finding 4)
Temporal signal	Real but weak: sigmoid R² ≈ +0.12 (Finding 3)	Does not transfer: age-only LOUO ≈ chance, shuffle-age confirms (Finding 4)

Why might this be so? Mobility data has a natural mechanism for spatial transfer: if user U lives in a region also visited by users V, W, X, …, then the union of those users’ GPS points provides the pooled model with enough density to estimate U’s log(k-anonymity) even on points U has not generated yet. The spatial manifold is shared. Temporal data has no analogous mechanism in the cross-user direction: user U’s rate of privacy decay depends on U’s individual revisit cadence, U’s locations’ base population density, U’s time-of-day patterns — variables that are not aligned with V’s. Sigmoid midpoints scatter because each user has their own characteristic schedule (Finding 5); age-only LOUO is below chance because the cross-user model trained on age_days actively misleads (Finding 4).

6.2 Implications for ML systems

The deployable system suggested by these results is simple. Train a single pooled gradient-boosting regressor on the full available cross-user training set; use spatial features (with whatever temporal features you have, but expect them to contribute little). At inference, serve the pooled prediction directly to all users. Personalization tiers — per-user fine-tuning, warm-start residual heads, federated adapters — are not indicated by this data: at the n = 98 user scale, all three forms of warm-start tested in Section §5.6 statistically hurt the median user. The “personalization is essential” intuition that motivated those tiers in the 30-user preliminary run was an artifact of weak spatial coverage in the smaller training pool, not a fundamental property of the problem.

The one operational nuance is the metric guard from Finding 7. Users whose recent log(k-anonymity) variance is below a threshold (we use 0.5 in our reproduction artifacts) should have their model performance reported as MAE in addition to or in place of R². This changes monitoring dashboards, not the model itself. The change is small but operationally important: without it, a small fraction of users will appear to break the model catastrophically when in fact they have constant targets that the metric cannot represent.

6.3 Implications for policy

GDPR Article 5(1)(e) and analogous storage-limitation rules in HIPAA prescribe retention periods uniformly across users. Our Finding 5 — that sigmoid midpoints across users have a coefficient of variation of 0.71 with no clustering — implies that on this dataset there is no population-level retention period that fits the data. A regulation that prescribes “six years” or “until no longer necessary” is choosing a number with no support in the empirical distribution of per-user decay curves. This does not argue against retention limits in principle (per-user decay does exist — Finding 2), but it does argue against the implicit modeling assumption that a single number serves all users equally. Adaptive or tiered retention — set per-user based on observed decay rate, with a floor for consent-based lower bounds — would track the data more closely.

We make this implication carefully. GeoLife is one mobility dataset, drawn predominantly from researchers in the Beijing area between 2007 and 2012. The spatial-transfer result almost certainly depends on participants moving through overlapping urban regions; a more-dispersed dataset (e.g., transit-card data spanning multiple cities, or smartphone IMU data sampled globally) might show weaker pooled-model transfer. We do not claim that the spatial / temporal asymmetry is universal — only that on the closest available real-data analogue to the synthetic benchmark, it is clean.

6.4 Limitations

Three limitations are most worth flagging. (i) Sample size on regulatory data. GDPR (n = 212) is genuinely small; the strongly negative R² is robust because the null is structural (penalty amounts are set at incident time), but the per-stratum statistics (only 2 country strata) are noisy. HIPAA at 1,632 records is more comfortable. (ii) GeoLife at 100 users. Our LOUO and per-user fine-tune analyses run on the eligible 98-user subset (those with ≥ 150 sampled points). At smaller training-pool sizes, both findings (cross-user transfer, per-user fine-tune) flipped from earlier expectations; we have not yet tested whether further scaling to n ≈ 200 changes them again. Our memory file lists this as the highest-priority follow-up. (iii) Single-domain finding 7. The metric-pathology result is mathematically robust (the proof is in the R² formula), but the empirical observation that ~5% of GeoLife users have low-variance windows depends on GeoLife’s particular sampling cadence and target choice. Other privacy targets may show different rates.

7 Conclusion

A synthetic benchmark for temporal privacy decay reporting R² = 0.998 does not reproduce on real data. On regulatory data the same pipeline goes negative; on real mobility data it reaches a high R² by spatial features alone, with cross-user temporal transfer scoring worse than chance. The deployable picture that emerges from this investigation is that population-level privacy ML is feasible for spatial exposure modeling at scale, with the pooled model dominating per-user fine-tuning at n = 98 users. Per-user temporal decay rates, by contrast, are genuinely personal — the cross-user pooled model extracts zero signal from age_days (confirmed by shuffle control), and uniform retention policies have no support in the per-user midpoint distribution.

A working hypothesis from a smaller-scale preliminary run — that a 5% minority of users would need a per-user fallback path due to spatial extrapolation failure — turned out to be a metric pathology rather than a model failure. Direct testing of convex-hull membership and nearest-neighbour distance falsified the spatial hypothesis; per-user target variance is the actual predictor, with one user’s log(k-anonymity) exactly constant and therefore producing mathematically divergent R² regardless of absolute prediction quality. The deployable consequence is a metric guard for low-variance users, not a separate model.

The asymmetry between spatial and temporal cross-user transfer on this data is the central empirical finding. Whether it generalizes to other mobility datasets, or to other privacy targets entirely, is the most pressing follow-up question.

References

[1] Samarati, P. and Sweeney, L. (1998). Protecting Privacy When Disclosing Information: k-Anonymity and its Enforcement Through Generalization and Suppression. SRI International Technical Report SRI-CSL-98-04.

[2] De Montjoye, Y.-A., Hidalgo, C. A., Verleysen, M., and Blondel, V. D. (2013). Unique in the Crowd: The Privacy Bounds of Human Mobility. Scientific Reports 3, 1376. https://doi.org/10.1038/srep01376

[3] Primault, V., Boutet, A., Mokhtar, S. B., and Brunie, L. (2019). The Long Road to Computational Location Privacy: A Survey. IEEE Communications Surveys & Tutorials 21(3): 2772–2793.

[4] Kairouz, P. et al. (2021). Advances and Open Problems in Federated Learning. Foundations and Trends in Machine Learning 14(1–2): 1–210.

[5] Kvålseth, T. O. (1985). Cautionary Note about R². The American Statistician 39(4): 279–285.

[6] Pimentel, M. A. F., Clifton, D. A., Clifton, L., and Tarassenko, L. (2014). A Review of Novelty Detection. Signal Processing 99: 215–249.

[7] Zheng, Y., Xie, X., and Ma, W.-Y. (2010). GeoLife: A Collaborative Social Networking Service among User, Location and Trajectory. IEEE Data Engineering Bulletin 33(2): 32–39.

[8] European Parliament and Council of the European Union. (2016). Regulation (EU) 2016/679 — General Data Protection Regulation, Article 5(1)(e), Storage Limitation.

Reproducibility

Every figure and table in this paper is regenerated from JSON dumps under notebooks/automated_tests/phase3_research/ and the processed GeoLife CSV at data/processed/geolife_decay.csv. The end-to-end pipeline from raw Kaggle datasets to these outputs:

# Inside the privacy-jupyter container
bash   /home/jovyan/notebooks/data_ingestion/fetch_kaggle_datasets.sh
python /home/jovyan/notebooks/data_ingestion/load_real_datasets.py
python /home/jovyan/notebooks/data_ingestion/validate_on_real_data.py
python /home/jovyan/notebooks/data_ingestion/stratified_decay.py
python /home/jovyan/notebooks/data_ingestion/decay_function_fit.py
python /home/jovyan/notebooks/data_ingestion/sigmoid_midpoint_clustering.py
python /home/jovyan/notebooks/data_ingestion/leave_one_user_out.py
python /home/jovyan/notebooks/data_ingestion/per_user_finetune.py
python /home/jovyan/notebooks/data_ingestion/convex_hull_outliers.py

# From the host (after the above), regenerate paper figures and PDF + HTML
python notebooks/paper/build_figures.py     # via docker exec — see Makefile
make paper                                   # builds notebooks/paper/paper.{pdf,html}

1 Introduction