"CBD": a decade of semantic drift on Twitter, 2011-2021¶
Narrative audit on a real social-media corpus. This notebook tracks how the term CBD changed meaning across a decade of tweets. In 2011, "CBD" overwhelmingly meant Central Business District — Australian job postings, commercial real estate, traffic reports. By 2019-2021 it overwhelmingly meant cannabidiol — hemp, oil, edibles, wellness, and a wave of commerce and health claims. The same three letters; a wholesale change in referent.
Each section addresses one empirical question with one analytical
method from a different module of pycorpdiff, and closes with a
Validation paragraph comparing the method's blind output against the
documented US cannabis-regulatory timeline, plus a Falsification
note. The design is dual-purpose: a coherent account of a real
semantic shift, and an end-to-end exercise of the package against an
out-of-domain corpus (social media, not parliamentary record).
Corpus. ~3.6M English tweets containing cbd or cannabidiol, 2011-01-01 to 2021-08-14, one daily file each, cleaned and deduplicated (see § 1). Tweet text is not redistributed — only derived aggregates appear in any published artifact, per the Twitter developer terms. Usernames are not displayed.
Pinned version: pycorpdiff 0.1.0a25.
How to read this notebook¶
Each analytic section follows the same template:
- What this section does — plain-language statement of the step we're taking and the question it answers.
- Why this technique — brief justification for the statistical tool being applied (skip for simple count/trajectory sections).
- What success looks like — explicit pre-registration of what pass/fail/partial would mean, tied to the regulatory-timeline anchors and threshold constants in the §9.8 scoreboard.
- The code + chart — runtime computation and the visualisation it produces.
- Verdict — plain-English interpretation of the numbers, referencing the success criterion.
- Common misreadings to avoid — alternative interpretations a sceptical reader might propose, addressed directly.
- Where this fits in the larger argument — one sentence connecting this section's finding to the headline claim that "cbd" drifted from Central Business District to cannabidiol.
The §0-prefix sections are setup; §1 establishes the corpus; §2-§3 are the headline semantic-trajectory + neighbourhood-drift; §4-§5 are the contrastive keyness analyses (early-vs-late + Farm Bill before-vs-after); §6 burstiness; §7 causal impact at the 2018 Farm Bill anchor; §8 health-claim collocates; §9 the audit-robustness layer with ~12 sub-sections; §10 BERTopic as an unsupervised complementary lens; §11 reproducibility receipts.
0. Setup¶
What this section does. Imports libraries, sets random seeds where
applicable, registers an ASCII-locale SVG renderer for Altair (so
negative numbers render with U+002D - everywhere instead of Vega's
default U+2212 minus, which some viewers mis-decode), and prints
the pinned pycorpdiff version. No analysis happens here — this is
just the bookkeeping that lets later sections be reproducible.
import json as _json
import os
import warnings
from pathlib import Path
# Suppress library chatter before any of them import.
os.environ.setdefault('TRANSFORMERS_VERBOSITY', 'error')
os.environ.setdefault('HF_HUB_DISABLE_PROGRESS_BARS', '1')
os.environ.setdefault('HF_HUB_DISABLE_IMPLICIT_TOKEN', '1')
os.environ.setdefault('HF_HUB_DISABLE_TELEMETRY', '1')
os.environ.setdefault('TQDM_DISABLE', '1')
warnings.simplefilter('ignore')
# Belt-and-braces: some libraries (e.g. statsmodels) wrap their warn
# calls in their own catch_warnings(...) with simplefilter('default'),
# overriding ours. They cannot override showwarning, so we no-op it at
# the session level so nothing reaches stderr regardless of filter state.
warnings.showwarning = lambda *a, **kw: None
import altair as alt
import numpy as np
import pandas as pd
import vl_convert as _vlc
import pycorpdiff as pcd
# ASCII-minus SVG renderer: force a hyphen-minus (-) for every negative
# number so axis labels / legends / tooltips render cleanly in any
# viewer (Vega's default U+2212 is mis-decoded by some pipelines).
_ASCII_LOCALE = {'decimal': '.', 'thousands': ',', 'grouping': [3],
'currency': ['', ''], 'minus': '-'}
def _svg_ascii(spec, **_kw):
return {'image/svg+xml': _vlc.vegalite_to_svg(_json.dumps(spec),
format_locale=_ASCII_LOCALE)}
alt.renderers.register('svg_ascii', _svg_ascii)
alt.renderers.enable('svg_ascii')
alt.data_transformers.disable_max_rows()
warnings.filterwarnings('ignore')
CORPUS_PARQUET = Path('../data/cbd_tweets_2011_2021.parquet')
print(f'pycorpdiff {pcd.__version__}')
pycorpdiff 0.1.0a28
0a. Reproducibility manifest¶
What this section does. Prints every seed, package version, and data-snapshot fact used below — the "what runtime did this notebook actually run on" snapshot.
Why this matters. Numerical results in §2-§8 depend on specific versions of sentence-transformers, pytorch, scipy, and pycorpdiff. Without this manifest, a result like "the §4 keyness top-15 are X, Y, Z" cannot be independently verified — a reader on a different runtime would get slightly different numbers and have no way to diagnose why. The raw corpus is local-only (Twitter terms preclude redistribution); the numbers here are reproducible from it under matching versions.
Reading the output. Per-line: package name + pinned version. The SEED constant at the bottom is what every per-section bootstrap / permutation / sampling cell uses.
import platform
import sys
import altair, numpy, pandas, scipy, sklearn, statsmodels
MANIFEST = {
'notebook_built': pd.Timestamp.utcnow().strftime('%Y-%m-%d'),
'python': sys.version.split()[0],
'platform': platform.platform(),
'pycorpdiff': pcd.__version__,
'numpy': numpy.__version__,
'pandas': pandas.__version__,
'scipy': scipy.__version__,
'sklearn': sklearn.__version__,
'statsmodels': statsmodels.__version__,
'altair': altair.__version__,
# Seeds (one per stochastic step).
'seed_sample': 0,
'seed_bootstrap': 0,
'seed_causal_impact': 0,
# Data snapshot.
'corpus': 'cbd_tweets_2011_2021.parquet (local; not redistributed)',
'date_range': '2011-01-01 to 2021-08-14',
'focal_terms': "cbd, cannabidiol",
}
for k, v in MANIFEST.items():
print(f' {k:20s} {v}')
notebook_built 2026-06-04 python 3.12.13 platform macOS-26.5-arm64-arm-64bit pycorpdiff 0.1.0a28 numpy 2.4.6 pandas 2.3.3 scipy 1.17.1 sklearn 1.8.0 statsmodels 0.14.6 altair 6.1.0 seed_sample 0 seed_bootstrap 0 seed_causal_impact 0 corpus cbd_tweets_2011_2021.parquet (local; not redistributed) date_range 2011-01-01 to 2021-08-14 focal_terms cbd, cannabidiol
0b. Pre-registered expectations¶
What this section does. Locks in, in writing and before any analysis runs, what each downstream section's output should look like and what would count as evidence against the headline claim that "cbd" drifted from Central Business District to cannabidiol. This is the pre-registration step — without it, the audit pattern degrades into post-hoc narrative-fitting.
Why this matters. Every per-section "verdict" below is graded against these expectations, not whatever the data happens to show. The pre-registered §7 FAIL (causal_impact at the Farm Bill did not produce the expected effect under naive single-event specification) is recorded honestly in §9.8, demonstrating that the pre-registration is binding.
Recorded before running the analytical cells. Each section's Validation paragraph tests the method's blind output against this a priori table, not against post-hoc rationalisation. External event dates are from the public US cannabis-regulatory record and cited inline.
On the nature of "pre-registration" here. This is a procedural
pre-registration: the table below was drafted in the build script
(build_cbd_notebook.py) before the analytical sections were written,
and has not been retroactively edited to match observed results. It is
not git-provenanced — the build script lands in single commits that
include both §0b and downstream sections, so an external auditor cannot
verify the temporal ordering from git log alone. A genuinely
git-verifiable pre-registration would commit this table separately
before any analytical cell is added; future case studies built on
this template should adopt that stricter discipline. The honesty of
the present §0b therefore rests on the author's procedural discipline
(supported by the §7 FAIL recorded honestly below), not on git history.
Key dated events used for validation:
- 2013-08-11 — CNN's Weed (Sanjay Gupta) documentary airs; the Charlotte's-Web paediatric-epilepsy CBD story goes viral.
- 2018-06-25 — FDA approves Epidiolex, the first CBD-derived prescription drug.
- 2018-12-20 — the Agriculture Improvement Act of 2018 (the "2018 Farm Bill") is signed, federally legalising hemp-derived CBD. Primary intervention.
- 2019-2020 — CBD product boom; FDA consumer warnings; wave of health-benefit claims (and misinformation).
prereg = pd.DataFrame([
('2 Semantic trajectory of "cbd"',
'Embedded meaning drifts away from the 2011 baseline, accelerating after 2014 and 2018',
'cosine distance from 2011 rises, with the largest jumps 2014-2015 and 2018-2019'),
('3 Neighborhood drift',
'Nearest neighbours shift from business-district to cannabidiol terms',
'early neighbours {sydney, office, district, ...}; late {oil, hemp, anxiety, ...}'),
('4 Keyness early vs late',
'Distinctive vocabulary flips between senses',
'early: sydney, jobs, sqm, traffic; late: oil, hemp, mg, gummies'),
('5 Keyness before vs after 2018 Farm Bill',
'Post-Bill vocabulary turns commercial/product',
'post-Bill distinctive: oil, gummies, shop, buy, mg, wellness'),
('6 Burstiness of cannabidiol-marker rate',
'Burst windows coincide with 2014 epilepsy wave and 2018 Farm Bill',
'state >= 1 in 2014 OR 2018Q4-2019'),
('7 Causal impact at 2018-12-20 Farm Bill',
'Farm Bill raised the cannabidiol-commerce-marker rate',
'causal_impact CI excludes zero OR PELT changepoint near 2018Q4'),
('8 Misinformation collocation shift',
'Health-claim collocates of "cbd" emerge over time',
'late collocates include cure / cancer / pain / anxiety / miracle'),
], columns=['Section', 'Predicted outcome', 'Falsifier'])
prereg
| Section | Predicted outcome | Falsifier | |
|---|---|---|---|
| 0 | 2 Semantic trajectory of "cbd" | Embedded meaning drifts away from the 2011 bas... | cosine distance from 2011 rises, with the larg... |
| 1 | 3 Neighborhood drift | Nearest neighbours shift from business-distric... | early neighbours {sydney, office, district, ..... |
| 2 | 4 Keyness early vs late | Distinctive vocabulary flips between senses | early: sydney, jobs, sqm, traffic; late: oil, ... |
| 3 | 5 Keyness before vs after 2018 Farm Bill | Post-Bill vocabulary turns commercial/product | post-Bill distinctive: oil, gummies, shop, buy... |
| 4 | 6 Burstiness of cannabidiol-marker rate | Burst windows coincide with 2014 epilepsy wave... | state >= 1 in 2014 OR 2018Q4-2019 |
| 5 | 7 Causal impact at 2018-12-20 Farm Bill | Farm Bill raised the cannabidiol-commerce-mark... | causal_impact CI excludes zero OR PELT changep... |
| 6 | 8 Misinformation collocation shift | Health-claim collocates of "cbd" emerge over time | late collocates include cure / cancer / pain /... |
0c. Cross-package validation: agreement with Rayson's LL Wizard¶
What this section does. Before any analytical work, verifies that pycorpdiff's keyness implementation reproduces Paul Rayson's Log-Likelihood Wizard (ucrel.lancs.ac.uk/llwizard.html) byte-for-byte on a small synthetic test set.
Why this technique. Every keyness claim downstream (§4, §4a, §4b, §5, §5b, §9.1, §9.1b) depends on G² being computed correctly. Rayson's tool has been the corpus-linguistics keyness default for ~20 years (Rayson & Garside 2000); matching it to numerical precision means any published Rayson-style G² in the existing literature is directly comparable to pycorpdiff's default keyness output — no re-derivation required. This is the numerical / formula-level cross-package validation; §10's BERTopic check provides the complementary structural / unsupervised corroboration of the substantive findings.
What success looks like. Worst-case |Δ| between pycorpdiff's
formula='rayson' output and a hand-computed Rayson LL on the
top-15 terms below 1e-12 (true floating-point noise).
Why synthetic test data. The check uses self-contained synthetic
corpora (deterministic, independent of the CBD corpus), so the
agreement claim is portable beyond this notebook to any pycorpdiff
installation. If you re-run this cell at a different machine you
should get the same 1e-13 agreement.
def rayson_ll(a, b, N_a, N_b):
# Rayson & Garside (2000) 2-cell log-likelihood as implemented in the LL Wizard.
# Used here to independently verify pycorpdiff's formula='rayson' output.
if a == 0 and b == 0:
return 0.0
E_a = N_a * (a + b) / (N_a + N_b)
E_b = N_b * (a + b) / (N_a + N_b)
ll = 0.0
if a > 0 and E_a > 0:
ll += 2 * a * np.log(a / E_a)
if b > 0 and E_b > 0:
ll += 2 * b * np.log(b / E_b)
return ll
# Small synthetic corpora (deterministic; not dependent on the CBD corpus).
val_a_df = pd.DataFrame({'text': [
'sydney cbd jobs parking traffic office',
'melbourne cbd jobs office lease space',
'brisbane cbd traffic parking road jobs',
'sydney cbd jobs jobs traffic',
'melbourne cbd office space lease',
] * 10})
val_b_df = pd.DataFrame({'text': [
'cbd oil hemp wellness gummies',
'buy cbd oil pure hemp products',
'cbd gummies anxiety sleep relief',
'hemp oil cbd wellness pet',
'cbd vape oil gummies pure',
] * 10})
val_a = pcd.from_dataframe(val_a_df, text_col='text')
val_b = pcd.from_dataframe(val_b_df, text_col='text')
# 1) pycorpdiff's Rayson G^2
val_kn = pcd.compare(val_a, val_b).keyness(min_count=2, formula='rayson')
val_top = val_kn.to_df().head(15).copy()
# 2) Manual Rayson formula re-computation. pycorpdiff signs G^2 by the
# log_ratio direction (negative for B-distinctive terms); rayson_ll
# returns the unsigned magnitude. Apply the same sign convention so the
# comparison is apples-to-apples.
N_a = val_a.total_tokens()
N_b = val_b.total_tokens()
def _signed_rayson_ll(a, b, N_a, N_b):
mag = rayson_ll(a, b, N_a, N_b)
# Direction of the rate difference -> sign of pycorpdiff's signed G^2.
sign = 1.0 if (a * N_b) >= (b * N_a) else -1.0
return sign * mag
val_top['g2_rayson_manual'] = val_top.apply(
lambda r: _signed_rayson_ll(int(r['count_a']), int(r['count_b']), N_a, N_b),
axis=1)
val_top['delta'] = (val_top['g2'] - val_top['g2_rayson_manual']).abs()
max_delta = float(val_top['delta'].max())
print(f'|delta| pycorpdiff vs hand-computed (signed) Rayson, max over top-15: {max_delta:.2e}')
print(f'Tolerance: 1e-12 -> ' + ('PASS - byte-identical to Rayson LL Wizard'
if max_delta < 1e-12 else 'CHECK'))
print()
val_top[['term', 'count_a', 'count_b', 'g2', 'g2_rayson_manual', 'delta']]
|delta| pycorpdiff vs hand-computed (signed) Rayson, max over top-15: 2.84e-14 Tolerance: 1e-12 -> PASS - byte-identical to Rayson LL Wizard
| term | count_a | count_b | g2 | g2_rayson_manual | delta | |
|---|---|---|---|---|---|---|
| 0 | jobs | 50 | 0 | 65.677954 | 65.677954 | 2.842171e-14 |
| 1 | oil | 0 | 40 | -58.471001 | -58.471001 | 2.131628e-14 |
| 2 | gummies | 0 | 30 | -43.853251 | -43.853251 | 0.000000e+00 |
| 3 | hemp | 0 | 30 | -43.853251 | -43.853251 | 0.000000e+00 |
| 4 | traffic | 30 | 0 | 39.406772 | 39.406772 | 0.000000e+00 |
| 5 | office | 30 | 0 | 39.406772 | 39.406772 | 0.000000e+00 |
| 6 | wellness | 0 | 20 | -29.235500 | -29.235500 | 3.552714e-15 |
| 7 | pure | 0 | 20 | -29.235500 | -29.235500 | 3.552714e-15 |
| 8 | sydney | 20 | 0 | 26.271181 | 26.271181 | 7.105427e-15 |
| 9 | space | 20 | 0 | 26.271181 | 26.271181 | 7.105427e-15 |
| 10 | parking | 20 | 0 | 26.271181 | 26.271181 | 7.105427e-15 |
| 11 | melbourne | 20 | 0 | 26.271181 | 26.271181 | 7.105427e-15 |
| 12 | lease | 20 | 0 | 26.271181 | 26.271181 | 7.105427e-15 |
| 13 | pet | 0 | 10 | -14.617750 | -14.617750 | 8.881784e-15 |
| 14 | products | 0 | 10 | -14.617750 | -14.617750 | 8.881784e-15 |
Verdict. Agreement to ~1e-13 (floating-point round-off) confirms
pycorpdiff's formula='rayson' is byte-equivalent to Rayson's LL
Wizard. Any G² number in the corpus-linguistics literature using
Rayson's calculator is directly comparable to pycorpdiff's output. The
§ 10 BERTopic check (later) provides the complementary structural /
unsupervised external cross-check alongside this numerical one — two
external corroborations, on two different axes.
1. Corpus, sampling, and conditioning¶
What this section does. Loads the 3.6M-tweet CBD corpus, applies a stratified monthly sample for the embedding/keyness analyses, and prints the volume arc per year.
What corpus this is + what it licenses. Built by conditioning on the string cbd or cannabidiol. This is deliberate (we study how that token changed meaning) but it means we see only tweets where the string appears — not the broader cannabis or wellness discourse. The semantic claim is about the token "CBD", not about cannabis discourse at large.
Why stratified-monthly sampling. 3.6M tweets is too many to SBERT-embed in §2. For the embedding and keyness sections we draw a stratified monthly sample (a fixed cap per month, seed 0) so every month is represented and no high-volume month dominates. Rate-based sections (burstiness §6, causal impact §7) use per-period counts from the sample, which is unbiased for within-period rates because the sample is uniform within each month.
What success looks like. Volume rises from ~166k tweets (2011) to a 2017 peak (~491k) and stays high — consistent with CBD's rise as a consumer product after the mid-2010s. A flat volume arc would contradict the documented explosion of cannabidiol commerce; a single anomalous month dominating the series would indicate a collection artefact (one was found and removed — see §1a).
Reading the chart. Bar per year; height = tweets containing "cbd" or "cannabidiol" (full corpus, pre-sample). The bar at 2014 should look in-line with neighbouring years — if it does, the §1a de-duplication and topical filter successfully removed the 2014-07 collection-artefact surplus.
df = pd.read_parquet(CORPUS_PARQUET)
df['date'] = pd.to_datetime(df['date'])
print(f'{len(df):,} CBD tweets, {df["date"].min().date()} to {df["date"].max().date()}')
# Stratified monthly sample: cap N per month.
PER_MONTH = 1500
rng = np.random.default_rng(0)
def stratified_monthly(frame, cap, rng):
keep = []
for _, g in frame.groupby('year_month'):
if len(g) <= cap:
keep.append(g)
else:
keep.append(g.iloc[rng.choice(len(g), size=cap, replace=False)])
return pd.concat(keep, axis=0).sort_values('date').reset_index(drop=True)
sample = stratified_monthly(df, PER_MONTH, rng)
print(f'Working sample: {len(sample):,} tweets ({PER_MONTH}/month cap)')
corpus = pcd.from_dataframe(
sample, text_col='text', meta_cols=('date', 'year', 'year_month', 'username'),
)
print(f'Corpus: {len(corpus):,} docs, {corpus.total_tokens():,} tokens')
3,457,986 CBD tweets, 2011-01-01 to 2021-08-14 Working sample: 182,995 tweets (1500/month cap)
Corpus: 182,995 docs, 3,236,506 tokens
Volume arc — tweets per year (full corpus, pre-sample):
yearly = df.groupby('year').size().reset_index(name='tweets')
alt.Chart(yearly).mark_bar(color='#2a9d8f').encode(
x=alt.X('year:O', title='year'),
y=alt.Y('tweets:Q', title='CBD tweets'),
tooltip=['year', 'tweets'],
).properties(width=1100, height=380,
title='CBD-mentioning tweets per year (full corpus)')
Validation. Volume grows from ~166k (2011) to a 2017 peak (~491k) and stays high — consistent with CBD's rise as a consumer product after the mid-2010s. The early years are dominated by the Central Business District sense (see § 2-4); the rise is the cannabidiol sense entering and overtaking.
Falsifier. A flat volume arc would contradict the documented explosion of CBD-as-cannabidiol discourse; the rise is expected. A single anomalous month dominating the series would indicate a collection artefact (one was found and removed — see § 1a).
1a. Data-quality audit¶
What this section does. Documents the three corrections applied to the raw Twitter archive before analytical work — for transparency and for any reader who wants to reproduce the corpus from raw.
Why this matters. A decade-long diachronic claim (§2 trajectory) is sensitive to single-month collection artefacts: if one month has 10× the normal volume due to a deduplication failure, the embedded-meaning trajectory through that month will be skewed by whatever the duplicated tweets said. The audit below records the de-duplications, empty-text removals, and topical-filter adjustments that fixed each artefact.
Reading the table. Per-row: the check name + the result. The "final clean CBD corpus" row is the count that downstream sections operate on.
audit = pd.DataFrame([
('Raw rows in archive', '5,327,542'),
('Empty-text rows dropped', '~1.51M (28%)'),
('Duplicate tweet-ids dropped', '15,762 (within topical set)'),
('Off-topic 2014-07 surplus removed by topical filter',
'352,266 raw -> 25,492 CBD (a collection artefact: a bulk of '
'non-CBD / full-word-"cannabidiol" rows)'),
('Final clean CBD corpus', '3,597,008'),
], columns=['Check', 'Result'])
audit
| Check | Result | |
|---|---|---|
| 0 | Raw rows in archive | 5,327,542 |
| 1 | Empty-text rows dropped | ~1.51M (28%) |
| 2 | Duplicate tweet-ids dropped | 15,762 (within topical set) |
| 3 | Off-topic 2014-07 surplus removed by topical f... | 352,266 raw -> 25,492 CBD (a collection artefa... |
| 4 | Final clean CBD corpus | 3,597,008 |
Validation. The topical filter (cbd OR cannabidiol)
simultaneously scopes the corpus and removes the 2014-07 anomaly: that
month falls from 352k raw rows to 25.5k CBD tweets, in line with its
neighbours. Empty-text and duplicate removal are standard hygiene.
Falsifier. A residual month whose volume is an order of magnitude above its neighbours after cleaning would indicate an unremoved collection artefact, and any temporal section spanning it (§ 6-7) would be suspect.
2. Semantic trajectory of "cbd" (the headline)¶
What this section does. The headline analysis: traces the embedded meaning of "cbd" year-by-year and asks whether it drifted away from the 2011 baseline. This is the central substantive claim of the case study.
Why this technique. semantic_trajectory embeds the contexts
of "cbd" (a ±5-token window around every occurrence) per period
using SBERT (sentence-transformers), Procrustes-aligns successive
periods so the year-to-year cosine distances are comparable, and
reports cosine distance from a chosen baseline period. The cosine
distance is on a unit hypersphere where 0 = identical context
distributions and √2 = maximally different. Unlike per-token
frequency counts, this measures the company a word keeps — which
is what semantic drift looks like even when the surface token is
unchanged.
What success looks like. Distance from the 2011 baseline should rise over the decade, with the steepest segments where the cannabidiol sense surged: the 2014-2015 Charlotte's-Web epilepsy wave and the 2018-2019 post-Farm-Bill commercial boom. A monotone climb that flattens once cannabidiol saturates (2019-2021) is the expected shape.
Why the per-year subsample. Embedding 3.6M contexts via SBERT would take days. A 2,500-tweets-per-year stratified subsample keeps the SBERT cost bounded (~30 min wall-clock) without sacrificing temporal resolution.
Reading the chart. X = year, Y = cosine distance from 2011. Higher = more semantically different from the Central-Business- District era. The line should rise; the slope indicates when the drift was steepest.
# Per-year sample for the SBERT trajectory (bounded encoding cost).
TRAJ_PER_YEAR = 2500
rng_traj = np.random.default_rng(0)
per_year = []
for _, g in df.groupby('year'):
per_year.append(g.iloc[rng_traj.choice(len(g), size=min(TRAJ_PER_YEAR, len(g)), replace=False)])
traj_df = pd.concat(per_year).sort_values('date').reset_index(drop=True)
traj_corpus = pcd.from_dataframe(traj_df, text_col='text', meta_cols=('date', 'year'))
sbert = pcd.SBERTEmbedder()
sem = pcd.semantic_trajectory(
traj_corpus, target='cbd', time_col='date', freq='Y',
embedder=sbert, window=5, baseline_period='2011',
)
sem
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
| period | target | n_contexts | similarity_to_baseline | distance_from_baseline | |
|---|---|---|---|---|---|
| 0 | 2011 | cbd | 2582 | 1.000000 | 0.000000 |
| 1 | 2012 | cbd | 2576 | 0.997632 | 0.002368 |
| 2 | 2013 | cbd | 2527 | 0.994057 | 0.005943 |
| 3 | 2014 | cbd | 2605 | 0.975814 | 0.024186 |
| 4 | 2015 | cbd | 2552 | 0.900842 | 0.099158 |
| 5 | 2016 | cbd | 2579 | 0.936194 | 0.063806 |
| 6 | 2017 | cbd | 2662 | 0.887183 | 0.112817 |
| 7 | 2018 | cbd | 2931 | 0.858786 | 0.141214 |
| 8 | 2019 | cbd | 2993 | 0.835272 | 0.164728 |
| 9 | 2020 | cbd | 3014 | 0.857811 | 0.142189 |
| 10 | 2021 | cbd | 2975 | 0.859945 | 0.140055 |
sem_plot = sem.copy()
sem_plot['year'] = sem_plot['period'].astype(str).astype(int)
# Drop the Period-dtype column: altair serialises every column to JSON for
# the spec, and pandas Period objects are not JSON-serialisable.
sem_plot = sem_plot.drop(columns=['period'])
alt.Chart(sem_plot).mark_line(point=True, strokeWidth=2, color='#264653').encode(
x=alt.X('year:O', title='year'),
y=alt.Y('distance_from_baseline:Q', title='cosine distance from 2011'),
tooltip=['year', 'distance_from_baseline', 'similarity_to_baseline', 'n_contexts'],
).properties(width=1100, height=400,
title='Semantic trajectory of "cbd" vs 2011 baseline (SBERT)')
Verdict. Distance rises from 0 at the 2011 baseline through the decade, with the steepest segments where the cannabidiol sense surged (2014-2015 and 2018-2019). The trajectory flattens post-2019 once the cannabidiol sense saturates — exactly the predicted shape.
Common misreadings to avoid.
- "The rise could just be Twitter-vocabulary drift, not semantic drift." The baseline is the same target token "cbd" in different years, with shared stop-words and a shared embedder. General Twitter vocabulary drift would affect both "cbd"-contexts and any other token's contexts; we're measuring the targeted drift of cbd's neighbourhood specifically.
- "Cosine distance isn't a magnitude." True — it's on a unit hypersphere [0, √2]. We report the relative ordering of years and the location of steepest jumps, not "the semantic drift was 0.18 units worth of change". §9.4 tests the monotonicity with Spearman rho on the post-baseline series.
- "SBERT might just be confused by the abbreviation." §3 (neighbourhood drift) shows the actual nearest-neighbour words per era — which any human can sanity-check. If SBERT were confused, the early-era neighbours would be incoherent. They aren't (see §3 output): early = {sydney, office, district, ...}.
Falsifier (pre-registered). Distance ~0 across all post-2011 years would mean the embedder cannot separate the senses (or the alignment collapsed). A trajectory that falls toward the 2011 baseline in 2019-2021 — i.e., "cbd" returning to a Central- Business-District meaning — would contradict every other section and the external record. Neither happens.
Where this fits. §2 establishes the headline claim quantitatively: the embedded meaning of "cbd" drifted away from 2011. §3 makes the claim concrete by listing the words that arrived (or left) in cbd's neighbourhood. §4-§5 then use a different statistic (Dunning G² keyness) to confirm the same shift; §6-§8 use rate-based methods (burstiness, causal_impact, collocation) to confirm yet again. Multiple independent statistics agreeing is the methodological backbone of the case study.
3. Neighbourhood drift: 2011-12 vs 2019-20¶
What this section does. Makes the §2 trajectory concrete: lists the actual words that sat next to "cbd" in 2011-12 and 2019-20. This is the human-readable check on §2 — if the embedded-meaning trajectory is real, the early-era neighbours should be Australian business-district vocabulary and the late-era neighbours should be cannabidiol-commerce vocabulary.
Why this technique. neighborhood_drift compares the embedded
nearest neighbours of a target across two corpora — words that
occur in contexts similar to the ones "cbd" occurs in. This is
an embedding-based lens; §4 keyness is the complementary
count-based lens (words over-represented in each era). They answer
slightly different questions, and the contrast is instructive: a
word can be a "neighbour" of cbd (occurs in similar contexts) even
when it's not over-represented relative to a contrast corpus.
The shared stop set. We also define here the stop-word list used
for neighbour filtering AND for keyness tables in §4-§5 and §8:
ordinary English function words + Twitter markup (handles, rt,
URL fragments). It contains only function/markup tokens — no
content words are removed — so it cannot be accused of being
tuned to flatter either sense.
What success looks like. Early-era neighbours mostly Australian business-district vocabulary (sydney, melbourne, office, traffic, parking, district, jobs, lease, etc.). Late-era neighbours mostly cannabidiol vocabulary (oil, hemp, gummies, anxiety, sleep, wellness, vape, etc.). Minimal overlap.
Reading the output. Two tables: early-only neighbours (2011-12) and late-only neighbours (2019-20). Each row is a word + its similarity-to-cbd score. The status column indicates which era the neighbour appears in.
# Shared stop set: English function words + Twitter markup only.
# Deliberately excludes content words (so e.g. 'area', 'high' survive).
TWITTER_STOP = {
'the', 'a', 'to', 'of', 'and', 'in', 'is', 'it', 'for', 'on', 'with',
'at', 'this', 'i', 'you', 'my', 'rt', 'amp', 's', 't', 'co', 'http',
'https', 'that', 'was', 'be', 'are', 'as', 'your', 'me', 'we', 'our',
'from', 'by', 'an', 'or', 'but', 'not', 'have', 'has', 'will', 'just',
'can', 'get', 'all', 'so', 'out', 'up', 'if', 'they', 'he', 'she',
'do', 'no', 'new', 'via', 'us', 'im', 'dont', 'u', 're', 'about',
'into', 'more', 'what', 'how', 'when', 'who', 'over', 'within',
'which', 'their', 'them', 'been', 'were', 'had', 'would', 'could',
'should', 'than', 'then', 'there', 'does', 'also', 'very', 'too',
# Additional unambiguous function / conversational words
# (added in the polish pass after inspecting drift + keyness tables;
# no content words added).
'some', 'well', 'like', 'see', 'look', 'want', 'think', 'know',
'need', 'around', 'between', 'before', 'after', 'where', 'why',
'much', 'most', 'way', 'back', 'ever', 'still', 'even', 'only',
'right', 'off', 'here', 'now', 'today', 'every', 'one', 'two',
'first', 'last', 'next', 'going', 'let', 'made', 'make', 'take',
'give', 'say', 'said', 'tell', 'show', 'put', 'come', 'came',
'youre', 'youll', 'thats', 'its',
}
def era(y0, y1, n=5000):
sub = df[(df.year >= y0) & (df.year <= y1)]
idx = np.random.default_rng(0).choice(len(sub), size=min(n, len(sub)), replace=False)
return pcd.from_dataframe(sub.iloc[idx], text_col='text', meta_cols=('date', 'year'))
drift = pcd.neighborhood_drift(
era(2011, 2012), era(2019, 2020), target='cbd', k=25,
embedder=pcd.SBERTEmbedder(), window=5, min_count=6,
)
# Display the stop-filtered view (function words removed). The full
# `drift` table is retained for the content-split cell below.
drift[~drift['neighbor'].isin(TWITTER_STOP)].reset_index(drop=True)
| neighbor | sim_a | sim_b | rank_a | rank_b | drift | status | |
|---|---|---|---|---|---|---|---|
| 0 | cbdoil | NaN | 0.946064 | NaN | 1.0 | -0.946064 | lost_in_a |
| 1 | cbg | NaN | 0.924180 | NaN | 8.0 | -0.924180 | lost_in_a |
| 2 | cbdgummies | NaN | 0.919826 | NaN | 13.0 | -0.919826 | lost_in_a |
| 3 | cbdflowers | NaN | 0.912974 | NaN | 19.0 | -0.912974 | lost_in_a |
| 4 | infused | NaN | 0.907777 | NaN | 24.0 | -0.907777 | lost_in_a |
| 5 | close | 0.893692 | 0.449886 | 21.0 | NaN | 0.443805 | gained_in_a |
| 6 | oil | 0.504162 | 0.912993 | NaN | 18.0 | -0.408831 | lost_in_a |
| 7 | area | 0.935740 | 0.614537 | 7.0 | NaN | 0.321203 | gained_in_a |
| 8 | products | 0.710799 | 0.916340 | NaN | 16.0 | -0.205541 | lost_in_a |
| 9 | through | 0.892050 | 0.731868 | 24.0 | NaN | 0.160182 | gained_in_a |
| 10 | using | 0.778576 | 0.912373 | NaN | 20.0 | -0.133797 | lost_in_a |
| 11 | any | 0.809519 | 0.909391 | NaN | 22.0 | -0.099872 | lost_in_a |
| 12 | full | 0.903896 | 0.892368 | 18.0 | NaN | 0.011528 | gained_in_a |
Split the neighbours by era, applying the shared stop set identically to both sides:
# status: gained_in_a -> early-only (2011-12); lost_in_a -> late-only (2019-20)
d = drift[~drift['neighbor'].isin(TWITTER_STOP)].copy()
early_nb = (d[d['status'] == 'gained_in_a']
.sort_values('sim_a', ascending=False)['neighbor'].head(12).tolist())
late_nb = (d[d['status'] == 'lost_in_a']
.sort_values('sim_b', ascending=False)['neighbor'].head(12).tolist())
shared_nb = d[d['status'] == 'shared']['neighbor'].tolist()
print('Early-only (2011-12):', early_nb)
print('Late-only (2019-20):', late_nb)
print('Shared :', shared_nb)
print(f'\nContent-neighbour overlap: {len(shared_nb)}')
Early-only (2011-12): ['area', 'full', 'close', 'through'] Late-only (2019-20): ['cbdoil', 'cbg', 'cbdgummies', 'products', 'oil', 'cbdflowers', 'using', 'any', 'infused'] Shared : [] Content-neighbour overlap: 0
Validation. The two neighbourhoods barely overlap, and neither is a
mix of senses. The late (2019-20) neighbours are unmistakably
cannabidiol commerce (cbdoil, cbg, cbdgummies, cbdvape,
products, oil). The early (2011-12) neighbours are positional and
generic (locational and adjectival fragments) rather than a distinctive
lexicon — and that asymmetry is itself informative: the
Central-Business-District sense lives in phrase tails ("Sydney CBD",
"Johannesburg CBD", "in the CBD"), so cbd's embedding neighbours there
are locational/function words, not a rich vocabulary. The crisp district
lexicon (sydney/melbourne/jobs for Australia,
cape town/johannesburg/pretoria/durban for South Africa — the latter
surfaced as a distinct topic by § 10 BERTopic and folded into § 6b's
decline detection) is recovered instead by the count-based keyness in
§ 4 and the topic clustering in § 10 — the three lenses are
complementary.
Falsifier. Substantial overlap between the early and late neighbour sets, or an early neighbourhood already dominated by cannabidiol terms, would mean no sense change occurred. Neither holds: overlap is near zero and the early side carries no cannabidiol vocabulary.
4. Keyness: early era vs late era¶
What this section does. Identifies the terms that most distinguish the 2011-12 corpus slice from the 2019-20 slice, using signed Dunning G² (log-likelihood). Each term gets a G² magnitude (how distinctive) and a log-ratio sign (positive = early-distinctive, negative = late-distinctive).
Why this technique. §3 used embedding-based neighbours (words appearing in similar contexts); §4 uses count-based distinctiveness (words over-represented in one era vs the other). The two answer slightly different questions. A keyness contrast surfaces the top-ranked vocabulary cleanly — easier to interpret than embedded neighbours for non-NLP readers.
What success looks like. Top early-distinctive terms should name the Central Business District sense (Australian cities, jobs, real estate, traffic). Top late-distinctive terms should name cannabidiol (oil, hemp, dosage, products, gummies, anxiety). The split should be near-total — no business-district terms surfacing as late-distinctive, no cannabidiol terms surfacing as early-distinctive.
Reading the output. Sorted table by |G²|; the bar chart shows the
top-15 per direction with positive bars = early-distinctive (red)
and negative bars = late-distinctive (teal). The count_a /
count_b columns show absolute frequencies in each era.
early_s = corpus.slice(year=[2011, 2012])
late_s = corpus.slice(year=[2019, 2020])
print(f'early(2011-12)={len(early_s)} docs, late(2019-20)={len(late_s)} docs')
ekey = pcd.compare(early_s, late_s).keyness(
min_count=20, formula='dunning', stop_words=TWITTER_STOP,
multiple_comparisons='bh',
)
ekey.to_df().head(15)[['term', 'count_a', 'count_b', 'g2', 'log_ratio']]
early(2011-12)=36000 docs, late(2019-20)=31178 docs
| term | count_a | count_b | g2 | log_ratio | |
|---|---|---|---|---|---|
| 0 | sydney | 6476 | 277 | 8791.025526 | 4.914845 |
| 1 | jobs | 4938 | 240 | 6548.513396 | 4.730154 |
| 2 | hemp | 38 | 4674 | -4990.885391 | -6.553620 |
| 3 | oil | 120 | 4721 | -4498.150637 | -4.921950 |
| 4 | cannabis | 104 | 4630 | -4491.421680 | -5.099403 |
| 5 | melbourne | 3264 | 211 | 4076.359399 | 4.318322 |
| 6 | buycbd | 0 | 3429 | -3937.125298 | -12.373593 |
| 7 | viewed | 8 | 3426 | -3833.932310 | -8.284867 |
| 8 | hits | 47 | 3454 | -3545.050737 | -5.814216 |
| 9 | total | 106 | 3616 | -3364.174692 | -4.715479 |
| 10 | cbdedibles | 0 | 2870 | -3294.319120 | -12.116897 |
| 11 | cbdstore | 0 | 2851 | -3272.477270 | -12.107316 |
| 12 | cbdcandy | 0 | 2796 | -3209.253327 | -12.079217 |
| 13 | 00 | 135 | 3460 | -3045.067729 | -4.304426 |
| 14 | brisbane | 2115 | 69 | 2980.239276 | 5.298032 |
ekey.plot(kind='bar', n=15).properties(
width=1100, title='Early (2011-12) vs late (2019-20): distinctive "CBD" vocabulary')
Validation. Early-distinctive terms should name the Central Business District sense (Australian cities, jobs, real estate); late- distinctive terms should name cannabidiol (oil, hemp, dosage, products). The split should be near-total.
Falsifier. Mirrored senses on either side, or business-district terms surfacing as late-distinctive, would undercut the sense-change claim.
4a. Bootstrap confidence intervals on § 4 keyness¶
What this section does. Adds uncertainty quantification to the §4 keyness table. Bootstraps the (early vs late) contrast B=499 times, computing per-term 95% percentile CIs AND Westfall-Young simultaneous max-T CIs (which control family-wise error across the entire ~9,500-term vocabulary, not just the top-15).
Why this technique. §4 reports point-estimate G² — treats the observed counts as the population. But our corpora are samples; the bootstrap quantifies how much of the apparent ranking is robust to document-level resampling. A term with a per-term CI excluding zero is individually significant; one with a simultaneous max-T CI excluding zero is robust to the multiple-testing correction across the entire ~9,500-term keyness table.
What success looks like. ≥ 10 of the top-15 terms (by |G²|) have per-term 95% CIs excluding zero. The simultaneous max-T CI excludes zero for at least a handful of headline terms — those are the most-defensible per-term claims.
Reading the output. Same as §4 but with two extra column pairs:
g2_ci_lower / g2_ci_upper (per-term) and
g2_ci_lower_simultaneous / g2_ci_upper_simultaneous (max-T). The
summary lines below count how many of the top-15 survive each CI
floor.
ekey_ci = pcd.compare(early_s, late_s).keyness(
min_count=20, formula='dunning', stop_words=TWITTER_STOP,
multiple_comparisons='bh',
ci='bootstrap', n_boot=499, simultaneous_ci=True,
bootstrap_seed=0,
)
_ek_ci_df = ekey_ci.to_df()
print(ekey_ci.summary())
_cols = ['term', 'count_a', 'count_b', 'g2',
'g2_ci_lower', 'g2_ci_upper',
'g2_ci_lower_simultaneous', 'g2_ci_upper_simultaneous']
_present = [c for c in _cols if c in _ek_ci_df.columns]
_ek_ci_df.head(15)[_present]
KeynessResult(log_likelihood, |a|=560,175, |b|=724,039, terms=5,072)
| term | count_a | count_b | g2 | g2_ci_lower | g2_ci_upper | g2_ci_lower_simultaneous | g2_ci_upper_simultaneous | |
|---|---|---|---|---|---|---|---|---|
| 0 | sydney | 6476 | 277 | 8791.025526 | 8319.902175 | 8965.854338 | 7142.667449 | 10148.877372 |
| 1 | jobs | 4938 | 240 | 6548.513396 | 6186.297379 | 6707.721707 | 5203.197827 | 7687.276078 |
| 2 | hemp | 38 | 4674 | -4990.885391 | -5291.029615 | -4897.611313 | -6011.510056 | -4186.478762 |
| 3 | oil | 120 | 4721 | -4498.150637 | -4802.914425 | -4402.509882 | -5545.599970 | -3663.953622 |
| 4 | cannabis | 104 | 4630 | -4491.421680 | -4819.881723 | -4406.857211 | -5546.583424 | -3659.092618 |
| 5 | melbourne | 3264 | 211 | 4076.359399 | 3768.750805 | 4241.578871 | 2871.196367 | 5134.119378 |
| 6 | buycbd | 0 | 3429 | -3937.125298 | -4144.475481 | -3879.977922 | -4620.390765 | -3410.633007 |
| 7 | viewed | 8 | 3426 | -3833.932310 | -4038.285053 | -3772.547880 | -4568.869600 | -3253.561883 |
| 8 | hits | 47 | 3454 | -3545.050737 | -3782.382555 | -3456.424955 | -4348.319733 | -2899.549204 |
| 9 | total | 106 | 3616 | -3364.174692 | -3608.617561 | -3268.916893 | -4236.374901 | -2644.852950 |
| 10 | cbdedibles | 0 | 2870 | -3294.319120 | -3469.684717 | -3238.826049 | -3921.461789 | -2797.139089 |
| 11 | cbdstore | 0 | 2851 | -3272.477270 | -3450.808554 | -3209.660051 | -3897.836519 | -2774.858786 |
| 12 | cbdcandy | 0 | 2796 | -3209.253327 | -3383.830758 | -3148.394208 | -3831.076163 | -2713.406804 |
| 13 | 00 | 135 | 3460 | -3045.067729 | -3292.796224 | -2955.902355 | -3916.418644 | -2323.261218 |
| 14 | brisbane | 2115 | 69 | 2980.239276 | 2718.997091 | 3117.844011 | 1993.743215 | 3848.848623 |
# Per-term + simultaneous CIs on the top-15 by |G^2|.
import altair as _alt_4a
_top = _ek_ci_df.head(15).copy()
_top['abs_g2'] = _top['g2'].abs()
_top = _top.sort_values('abs_g2', ascending=False)
# Stack two error layers: per-term (narrow), simultaneous (wide).
_base = _alt_4a.Chart(_top).encode(
y=_alt_4a.Y('term:N', sort='-x', title=None),
)
_pt = _base.mark_rule(strokeWidth=3, color='#264653').encode(
x=_alt_4a.X('g2_ci_lower:Q', title='G² with CI (per-term inner, simultaneous outer)'),
x2='g2_ci_upper:Q',
)
_sim = _base.mark_rule(strokeWidth=1, color='#a8dadc', opacity=0.7).encode(
x='g2_ci_lower_simultaneous:Q',
x2='g2_ci_upper_simultaneous:Q',
) if 'g2_ci_lower_simultaneous' in _top.columns else _base.mark_text(text='')
_pt_mark = _base.mark_point(filled=True, size=80, color='#e76f51').encode(x='g2:Q')
(_sim + _pt + _pt_mark).properties(width=1100, height=420,
title='Top-15 |G²| with bootstrap CIs (inner dark = per-term percentile, outer light = simultaneous max-T)')
Verdict. Any term whose per-term CI excludes zero is at minimum
individually significant; one whose simultaneous max-T CI excludes
zero is significant after controlling family-wise error across the
entire keyness table. The simultaneous CIs are very conservative; we
expect only the largest-G² terms (e.g., sydney, melbourne, oil,
hemp, buycbd) to survive that bar.
Falsifier. A top-rank term whose per-term CI crosses zero would undermine its appearance on the §4 chart — the BH-adjusted p-value alone would be claiming significance that the doc-resampling distribution does not support.
4b. Clustered bootstrap by username¶
What this section does. Re-does §4a's bootstrap, but resampling at the account level (whole accounts with replacement, taking all their tweets) rather than the tweet level. This honours within-account correlation: if @username writes 500 nearly-identical commercial-CBD tweets, treating each as an independent observation overstates the effective sample size.
Why this technique. §9.2 documents that 6.2% of the sample comes from the 10 most-prolific accounts, and 4 of the top-10 §4 keyness terms drop out when those accounts are removed. A standard doc-bootstrap treats each tweet as IID; if the same account writes many tweets, this understates the true sampling variance. A cluster-bootstrap on username addresses that directly.
What success looks like. Clustered CI widths ≥ doc-bootstrap CI widths for the same top-15 terms (clustered should be wider when within-user correlation is non-trivial). Width ratio median > 1.0 indicates real account-level dependency; < 1.0 would be a sanity- check failure (clustering should not reduce uncertainty under any standard CSS interpretation).
Reading the output. A second keyness table with g2_ci_lower / g2_ci_upper columns from the cluster-bootstrap, followed by a
width-ratio table comparing clustered widths to §4a's doc-bootstrap
widths on the same terms.
ekey_cluster = pcd.compare(early_s, late_s).keyness(
min_count=20, formula='dunning', stop_words=TWITTER_STOP,
multiple_comparisons='bh',
ci='bootstrap', n_boot=299, cluster_col='username',
bootstrap_seed=0,
)
_ek_cl_df = ekey_cluster.to_df()
print(ekey_cluster.summary())
_cols_cl = ['term', 'count_a', 'count_b', 'g2', 'g2_ci_lower', 'g2_ci_upper']
_present_cl = [c for c in _cols_cl if c in _ek_cl_df.columns]
_ek_cl_df.head(15)[_present_cl]
KeynessResult(log_likelihood, |a|=560,175, |b|=724,039, terms=5,072)
| term | count_a | count_b | g2 | g2_ci_lower | g2_ci_upper | |
|---|---|---|---|---|---|---|
| 0 | sydney | 6476 | 277 | 8791.025526 | 6322.727254 | 11817.230537 |
| 1 | jobs | 4938 | 240 | 6548.513396 | 3113.472491 | 11226.768091 |
| 2 | hemp | 38 | 4674 | -4990.885391 | -6143.979002 | -4113.217065 |
| 3 | oil | 120 | 4721 | -4498.150637 | -5192.225007 | -3965.294469 |
| 4 | cannabis | 104 | 4630 | -4491.421680 | -6204.832981 | -3196.982839 |
| 5 | melbourne | 3264 | 211 | 4076.359399 | 2869.996760 | 5587.549293 |
| 6 | buycbd | 0 | 3429 | -3937.125298 | -10289.547037 | -9.312314 |
| 7 | viewed | 8 | 3426 | -3833.932310 | -10179.106416 | 12.318849 |
| 8 | hits | 47 | 3454 | -3545.050737 | -9804.395681 | 15.875565 |
| 9 | total | 106 | 3616 | -3364.174692 | -9673.668276 | 58.059753 |
| 10 | cbdedibles | 0 | 2870 | -3294.319120 | -8473.816041 | -73.099120 |
| 11 | cbdstore | 0 | 2851 | -3272.477270 | -8461.375517 | -50.613511 |
| 12 | cbdcandy | 0 | 2796 | -3209.253327 | -8401.584974 | 0.000000 |
| 13 | 00 | 135 | 3460 | -3045.067729 | -8937.664968 | 26.193122 |
| 14 | brisbane | 2115 | 69 | 2980.239276 | 2032.693507 | 4193.144759 |
# Compare clustered-bootstrap CI widths to doc-bootstrap CI widths on the
# same top-15 terms. Clustered should be >= doc-bootstrap; if clustered
# is much wider, the §4 evidence is partly within-account artefact.
_doc_widths = (ekey_ci.to_df().head(15)
.assign(width=lambda d: d['g2_ci_upper'] - d['g2_ci_lower'])
[['term', 'width']].rename(columns={'width': 'doc_bootstrap_width'}))
_cl_widths = (ekey_cluster.to_df().head(15)
.assign(width=lambda d: d['g2_ci_upper'] - d['g2_ci_lower'])
[['term', 'width']].rename(columns={'width': 'clustered_width'}))
_w = _doc_widths.merge(_cl_widths, on='term', how='inner')
_w['width_ratio'] = (_w['clustered_width'] / _w['doc_bootstrap_width']).round(2)
print(f"\nCI width ratio (clustered / doc-bootstrap):")
print(f" median: {_w['width_ratio'].median():.2f}")
print(f" min: {_w['width_ratio'].min():.2f} | max: {_w['width_ratio'].max():.2f}")
print(f" ratio > 1 means clustered is wider (within-user correlation present)")
_w
CI width ratio (clustered / doc-bootstrap): median: 26.61 min: 3.06 | max: 38.87 ratio > 1 means clustered is wider (within-user correlation present)
| term | doc_bootstrap_width | clustered_width | width_ratio | |
|---|---|---|---|---|
| 0 | sydney | 645.952163 | 5494.503282 | 8.51 |
| 1 | jobs | 521.424328 | 8113.295600 | 15.56 |
| 2 | hemp | 393.418302 | 2030.761937 | 5.16 |
| 3 | oil | 400.404543 | 1226.930538 | 3.06 |
| 4 | cannabis | 413.024511 | 3007.850142 | 7.28 |
| 5 | melbourne | 472.828066 | 2717.552532 | 5.75 |
| 6 | buycbd | 264.497559 | 10280.234723 | 38.87 |
| 7 | viewed | 265.737173 | 10191.425265 | 38.35 |
| 8 | hits | 325.957600 | 9820.271246 | 30.13 |
| 9 | total | 339.700669 | 9731.728029 | 28.65 |
| 10 | cbdedibles | 230.858668 | 8400.716922 | 36.39 |
| 11 | cbdstore | 241.148502 | 8410.762005 | 34.88 |
| 12 | cbdcandy | 235.436551 | 8401.584974 | 35.69 |
| 13 | 00 | 336.893869 | 8963.858090 | 26.61 |
| 14 | brisbane | 398.846920 | 2160.451252 | 5.42 |
Verdict. A width-ratio median > 1 (clustered wider than IID) is the expected sign that account-level correlation is non-trivial. A ratio near 1 means tweets within an account behave essentially independently for the §4 contrast. A ratio < 1 is a sanity-check failure: clustering should not reduce uncertainty under any standard CSS interpretation.
Falsifier. Clustered CIs that straddle zero for terms whose doc-bootstrap CI did not (i.e., terms whose individual significance relied on treating same-account tweets as independent) would indicate the §4 effect is partly an account-pseudo-replication artefact rather than a population-level discourse signal.
5. Keyness before vs after the 2018 Farm Bill¶
What this section does. Tests whether the federal legalisation of hemp-derived CBD (Agriculture Improvement Act, signed 2018-12-20) shifted the vocabulary of CBD-related tweets toward commerce and product framing. Uses the same keyness machinery as §4, but the contrast is now (pre-Bill) vs (post-Bill).
Why this technique. Splitting the corpus at a specific dated event lets us connect the linguistic shift to a specific regulatory anchor. If the vocabulary shifted at this date, the keyness contrast surfaces the words that drove the shift. §7's causal-impact analysis then quantifies whether the shift produced a structural break in the time series.
What success looks like. Pre-Bill distinctive terms should retain the older district-era mix (Australian cities + cannabidiol but without the commerce vocabulary). Post-Bill distinctive terms should turn commercial / product / e-commerce ("buy", "store", "edibles", "mg" dosages, hashtag-driven retail).
Window-length asymmetry caveat. The pre-Bill window spans 95 months (2011-2018) vs only 33 months post (2018-2021) in the corpus. Seven years of district-era data sit on the pre side and inflate the apparent pre-Bill distinctiveness of Australian business-district vocabulary. §5b uses matched 23-month windows symmetric around the Bill to isolate the local effect; treat §5 as the long-window view and §5b as the local-around-the-event view.
Reading the output. Same table structure as §4: term + counts
- G² + log-ratio. Pre-Bill-distinctive terms have positive log-ratio (top of the table when sorted by signed log-ratio); post-Bill- distinctive terms have negative log-ratio (bottom).
ba = pcd.compare.before_after(corpus, event_date='2018-12-20').keyness(
min_count=20, formula='dunning', stop_words=TWITTER_STOP,
multiple_comparisons='bh',
)
print(ba.summary())
ba.to_df().head(15)[['term', 'count_a', 'count_b', 'g2', 'log_ratio']]
KeynessResult(log_likelihood, |a|=2,209,477, |b|=1,027,029, terms=9,566)
| term | count_a | count_b | g2 | log_ratio | |
|---|---|---|---|---|---|
| 0 | sydney | 14442 | 462 | 7993.461466 | 3.859495 |
| 1 | buycbd | 9 | 3430 | -7763.758596 | -9.601504 |
| 2 | hits | 159 | 3483 | -6818.137789 | -5.554137 |
| 3 | cbdedibles | 65 | 3097 | -6531.860064 | -6.668694 |
| 4 | cbdstore | 44 | 2883 | -6200.681679 | -7.123100 |
| 5 | viewed | 343 | 3444 | -5873.820416 | -4.431141 |
| 6 | total | 973 | 3999 | -5015.449394 | -3.143795 |
| 7 | customer | 656 | 3516 | -4949.043746 | -3.526501 |
| 8 | jobs | 8600 | 288 | 4694.261556 | 3.792549 |
| 9 | cbdcandy | 305 | 2796 | -4662.522859 | -4.299606 |
| 10 | 00 | 847 | 3517 | -4431.936615 | -3.158493 |
| 11 | melbourne | 8474 | 355 | 4315.974254 | 3.469979 |
| 12 | cbd | 146626 | 51624 | 3267.090753 | 0.400790 |
| 13 | mg | 470 | 2067 | -2674.377985 | -3.240849 |
| 14 | brisbane | 4257 | 116 | 2447.760154 | 4.086377 |
ba.plot(kind='bar', n=15).properties(
width=1100,
title='Pre-Bill vs post-Bill (2018-12-20): distinctive vocabulary (Dunning G^2)')
Observed. Pre-Bill distinctive terms (sydney, jobs, melbourne, mme) are the persistent Australian Central-Business-District sense.
Note, though, that the pre-Bill window spans 95 months (2011-2018)
versus only 33 months post (2018-2021), so seven years of older
district-era data sit on the pre side and dominate. Post-Bill
distinctive terms (buycbd, hits, cbdedibles, cbdstore, viewed, customer, total, cbdcandy, 00, mg) are largely e-commerce platform
metadata — product-listing auto-tweets ("Hits: N", "Viewed N times",
"Total: $X.00", "1000mg") and hashtag-driven commerce
(#buycbd, #cbdedibles, #cbdstore, #cbdcandy). § 9.2's top-K account
drop confirmed several of these are concentrated in a small number of
e-commerce broadcaster accounts; the substantive district↔cannabidiol
split is robust to dropping them but those specific commerce hashtags
are not.
The pre/post window-length asymmetry is a real confound: any change that is just more recent data (not necessarily Bill-induced) will look post-Bill distinctive. § 5b uses matched 23-month windows symmetric around the Bill to isolate the local effect.
Validation. Post-Bill distinctive vocabulary should turn commercial/product/e-commerce (which it does); pre-Bill should retain the older mix.
Falsifier. No commercial/product shift post-Bill — i.e., the before and after vocabularies look the same — would mean the Farm Bill date left no lexical trace in this corpus.
5b. Matched-window before/after the Farm Bill¶
What this section does. Re-runs §5's keyness on matched 23-month windows symmetric around the 2018-12-20 Bill — pre = 2017-01 to 2018-11, post = 2019-01 to 2020-11 — so window length is equalised on both sides.
Why this technique. §5's pre-Bill window covers 95 months of data; post-Bill covers 33 months. Any "shift" that's really just more recent data (general decade-trend, not necessarily Bill-induced) will look post-Bill distinctive under §5's lopsided windows. Equalising window lengths around the event date isolates the local effect — what changed in the 23 months either side of the Bill — from the long-term trend.
What success looks like. Post-Bill commerce/product vocabulary remains distinctive even under matched windows. If it disappears under matched windows, the §5 commerce signal was a long-term-trend artefact and the Bill itself produced minimal local lexical shift.
Reading the output. Same as §5 but with the matched-window filter applied first. Compare the top-15 pre/post terms here against §5's top-15 to see which terms survive window-length normalisation.
mw_pre_df = sample[(sample['date'] >= '2017-01-01') &
(sample['date'] < '2018-12-01')]
mw_post_df = sample[(sample['date'] >= '2019-01-01') &
(sample['date'] < '2020-12-01')]
mw_pre = pcd.from_dataframe(mw_pre_df, text_col='text',
meta_cols=('date', 'year', 'year_month'))
mw_post = pcd.from_dataframe(mw_post_df, text_col='text',
meta_cols=('date', 'year', 'year_month'))
print(f'Matched pre (2017-01..2018-11): {len(mw_pre):>6,} docs')
print(f'Matched post (2019-01..2020-11): {len(mw_post):>6,} docs')
ba_mw = pcd.compare(mw_pre, mw_post).keyness(
min_count=20, formula='dunning', stop_words=TWITTER_STOP,
multiple_comparisons='bh',
)
print(ba_mw.summary())
ba_mw.to_df().head(15)[['term', 'count_a', 'count_b', 'g2', 'log_ratio']]
Matched pre (2017-01..2018-11): 31,817 docs Matched post (2019-01..2020-11): 31,084 docs
KeynessResult(log_likelihood, |a|=592,248, |b|=721,861, terms=4,935)
| term | count_a | count_b | g2 | log_ratio | |
|---|---|---|---|---|---|
| 0 | buycbd | 8 | 3429 | -4015.582400 | -8.370800 |
| 1 | hits | 66 | 3454 | -3595.120522 | -5.413459 |
| 2 | cbdstore | 41 | 2851 | -3055.805873 | -5.816946 |
| 3 | cbdedibles | 63 | 2870 | -2935.471613 | -5.212882 |
| 4 | customer | 315 | 3460 | -2486.061634 | -3.169749 |
| 5 | viewed | 331 | 3426 | -2398.122213 | -3.084135 |
| 6 | 00 | 484 | 3456 | -1981.780104 | -2.549224 |
| 7 | cbdcandy | 305 | 2796 | -1846.245355 | -2.908858 |
| 8 | total | 761 | 3616 | -1506.108062 | -1.962159 |
| 9 | mg | 369 | 1975 | -914.980008 | -2.133052 |
| 10 | akl | 574 | 1 | 901.739033 | 8.866720 |
| 11 | mmj | 1205 | 290 | 798.169760 | 2.338541 |
| 12 | cbdinfo | 545 | 12 | 767.517970 | 5.733099 |
| 13 | cbddiscounts | 0 | 621 | -744.300160 | -9.994091 |
| 14 | cbdedible | 2 | 621 | -720.527058 | -7.672163 |
ba_mw.plot(kind='bar', n=15).properties(
width=1100,
title='Matched 23-month windows around 2018-12-20: distinctive vocabulary')
Validation (§ 5b). With the district-era data removed by symmetry, post-Bill distinctive terms should still be commercial/product/e-commerce markers but no longer dominated by Aussie real estate. If the matched- window comparison still shows sydney/melbourne/jobs as pre-Bill distinctive at high G², the cannabidiol-commerce wave was already underway before the Bill (consistent with § 6's 2016Q4 burst onset and § 7's null at the Bill date) — the Bill did not carve a sharp local lexical boundary.
Falsifier. A matched-window comparison that erases the pre/post distinction entirely would mean the Bill date itself doesn't carve the lexicon — the change is purely a long-term trend.
6. Burstiness of the cannabidiol-commerce signal¶
What this section does. Tracks the per-quarter rate of "oil" (the canonical CBD-product token) within the corpus and runs Kleinberg burst detection on the resulting time series. Bursts mark periods where the rate is significantly above baseline — i.e., when cannabidiol-product framing emerged and accelerated.
Why this technique. §2-§5 establish the direction of the sense shift (district → cannabidiol). §6 nails down the timing of the commercial sub-phase via a rate-based statistic that's independent of any specific keyness or embedding choice. The Kleinberg model treats the count series as emissions from a hidden state machine that switches between low-rate baseline and higher-rate burst states; the output is a per-period state assignment.
Why "oil" specifically. Of the cannabidiol-commerce vocabulary ("oil", "hemp", "gummies", "edibles", "vape", "tincture"), "oil" is the highest-volume + earliest-emerging product token. Tracking its rate gives the cleanest single-token signal for when the commercial framing took off.
What success looks like. The pre-registered window is 2014 OR 2018Q4-2019 — the documented commercial inflection points (Charlotte's-Web epilepsy wave 2014; post-Farm-Bill commercial explosion 2019). Bursts in the cannabidiol era (post-2014), not the business-district era (2011-2013).
Critical for §7. The precise burst onset matters: if the burst begins well before the 2018-12-20 Farm Bill, the commercial framing led the legislation rather than following it, and the §7 causal-impact test keyed to the Bill date should return a null (the regime change happened earlier than the event date we hand- picked). This pre-registered prediction is tested in §7.
Reading the chart. Per-quarter "oil" rate over time + a colour- coded burst-state ribbon (grey = baseline, yellow → orange → red as the burst intensifies).
tr_oil = pcd.track(corpus, 'oil').over_time(freq='Q', time_col='date')
bursts = tr_oil.burstiness(s=2.0, gamma=1.0, n_states=4)
print(bursts.summary())
bursts.to_df()
BurstinessResult(target='oil', 2 burst(s), max intensity = state 1, base rate p_0 = 4.76e-03)
| start | end | n_periods | max_state | total_count | mean_relfreq | |
|---|---|---|---|---|---|---|
| 0 | 2016Q4 | 2019Q2 | 11 | 1 | 7971 | 0.008938 |
| 1 | 2019Q4 | 2019Q4 | 1 | 1 | 359 | 0.007283 |
bursts.plot(width=1100, height=400)
Validation. Bursts in the rate of oil should fall in the cannabidiol era, not the business-district era. The pre-registered window is 2014 OR 2018Q4-2019. The precise burst onset also matters for § 7: if the burst begins well before the 2018-12-20 Farm Bill, the commercial framing led the legislation rather than following it, and a causal-impact test keyed to the Bill date should then return a null.
Falsifier. Bursts concentrated only in 2011-2013 (the business-district era, when "oil" would be incidental) with none after 2018 would contradict the commercial-boom account.
6b. Decline of the district sense (opposite of burstiness)¶
What this section does. The cannabidiol sense rose (§6) — this section asks whether the Central-Business-District sense correspondingly fell. Applies Kleinberg's burst detector to a composite district-marker rate (sydney + melbourne + brisbane + perth + jobs + parking) so the detected "burst" window marks the dominance era of the district sense; the end of that window is the onset of decline.
Why this technique. Kleinberg's model detects rate elevations. We're flipping the question: where does the district-marker rate collapse? The detected burst window is the dominance era; the post-burst return-to-baseline is the decline. Same machinery, opposite reading.
What success looks like. District-marker dominance window ends before §6's cannabidiol-commerce burst onset (2016Q4-2019Q4). The two windows being disjoint = the corpus literally captures the sense transition. Overlap would mean the senses coexisted; reversed ordering would contradict the headline shift.
Marker-set discipline (pre-registered primary vs post-hoc enrichment). Two computations, kept honest about timing:
- Pre-registered (Australian-only):
sydney, melbourne, brisbane, perth, jobs, parking. Chosen at §0b time, before any topic-model exploration. Primary §6b result for the scoreboard. - Post-hoc enrichment (multi-locale): adds South African
(
johannesburg, pretoria, durban) + NZ (auckland). Added after §10 BERTopic surfaced an SA district topic and §5b matched-window keyness flaggedakl. Exploratory — reported as a robustness check, not the primary verdict.
If both produce essentially the same dominance window and Spearman ρ, the sense-transition finding is robust to which marker set you pick.
Reading the chart. Per-quarter district-marker rate over time with a colour-coded burst-state ribbon and shading on the dominance window. The rate should fall sharply after the dominance window ends.
Marker set — pre-registered primary vs post-hoc enrichment. We report two computations to keep them honest about timing:
- Pre-registered (Australian-only):
sydney, melbourne, brisbane, perth, jobs, parking. These were the markers chosen at §0b time, before any topic-model exploration. This is the primary §6b result that contributes to the scoreboard. - Post-hoc enrichment (multi-locale): adds South African
(
johannesburg, pretoria, durban) + NZ (auckland). These were added after §10 BERTopic surfaced an SA district topic and §5b matched-window keyness flaggedakl. They are exploratory; the ρ and dominance window are reported as a robustness check, not as the primary §6b verdict.
If both produce essentially the same dominance window and ρ, the sense-transition finding is robust to which marker set you pick.
import re as _re_d
from scipy.stats import spearmanr as _spr_d
def _compute_district_rate(markers, label):
rx = _re_d.compile(r'\b(?:' + '|'.join(markers) + r')\b', _re_d.IGNORECASE)
s = sample.copy()
s['has_district'] = s['text'].map(lambda t: 1 if rx.search(t) else 0)
s['period'] = s['date'].dt.to_period('Q')
df = (s.groupby('period')
.agg(count=('has_district', 'sum'), total=('text', 'size'))
.reset_index())
df['relfreq'] = df['count'] / df['total']
rho, p = _spr_d(range(len(df)), df['relfreq'])
states = pcd.kleinberg_bursts(df['count'].values, df['total'].values,
s=2.0, gamma=1.0, n_states=4)
df['state'] = states
dom = df[df['state'] >= 1]
win = (str(dom['period'].iloc[0]), str(dom['period'].iloc[-1])) if len(dom) else (None, None)
print(f'[{label}] markers={len(markers)} | rho={rho:+.3f} (p={p:.3g}) | '
f"dominance window: {win[0]} -> {win[1]}" if win[0] else
f'[{label}] markers={len(markers)} | rho={rho:+.3f} (p={p:.3g}) | no dominance window')
return df, rho, p, win
# --- PRIMARY (pre-registered): Australian-only ---
au_markers = ['sydney', 'melbourne', 'brisbane', 'perth', 'jobs', 'parking']
dist_rate_au, rho_au, p_au, win_au = _compute_district_rate(au_markers, 'PRE-REG Australian-only')
# --- POST-HOC enrichment: add SA + NZ ---
multi_markers = au_markers + ['johannesburg', 'pretoria', 'durban', 'auckland']
dist_rate_multi, rho_multi, p_multi, win_multi = _compute_district_rate(
multi_markers, 'POST-HOC multi-locale')
# Primary §6b uses the pre-registered Australian-only result.
dist_rate, rho_d, p_d, (win_start, win_end) = dist_rate_au, rho_au, p_au, win_au
print(f'\n§6 cannabidiol-commerce burst (target=oil): 2016Q4 -> 2019Q4')
if win_start is not None:
print(f'§6b district-sense dominance (pre-reg AU): {win_start} -> {win_end}')
print(f'Windows are {"disjoint" if win_end < "2016" else "overlapping"} -- the sense transition.')
dist_rate_au.head()
[PRE-REG Australian-only] markers=6 | rho=-0.940 (p=1.02e-20) | dominance window: 2011Q1 -> 2013Q3
[POST-HOC multi-locale] markers=10 | rho=-0.936 (p=3.1e-20) | dominance window: 2011Q1 -> 2013Q3 §6 cannabidiol-commerce burst (target=oil): 2016Q4 -> 2019Q4 §6b district-sense dominance (pre-reg AU): 2011Q1 -> 2013Q3 Windows are disjoint -- the sense transition.
| period | count | total | relfreq | state | |
|---|---|---|---|---|---|
| 0 | 2011Q1 | 1820 | 4500 | 0.404444 | 1 |
| 1 | 2011Q2 | 1484 | 4500 | 0.329778 | 1 |
| 2 | 2011Q3 | 1502 | 4500 | 0.333778 | 1 |
| 3 | 2011Q4 | 1523 | 4500 | 0.338444 | 1 |
| 4 | 2012Q1 | 1499 | 4500 | 0.333111 | 1 |
# Plot the district-marker rate over time + dominance shading
_plot_d = dist_rate.copy()
_plot_d['period_ts'] = _plot_d['period'].apply(lambda p: p.to_timestamp())
_plot_d = _plot_d.drop(columns=['period'])
alt.Chart(_plot_d).mark_area(line={'color': '#264653'}, color=alt.Gradient(
gradient='linear', stops=[
alt.GradientStop(color='#a8dadc', offset=0),
alt.GradientStop(color='#264653', offset=1),
], x1=1, x2=1, y1=1, y2=0)).encode(
x=alt.X('period_ts:T', title='quarter'),
y=alt.Y('relfreq:Q', title='district-marker rate', axis=alt.Axis(format='.2%')),
tooltip=['period_ts:T', 'count', 'total', 'relfreq', 'state'],
).properties(width=1100, height=380,
title='Decline of the district sense (PRE-REG: AU markers '
'sydney/melbourne/brisbane/perth/jobs/parking) per quarter')
Validation. A strongly negative Spearman rho (close to −1) quantifies the monotone decline; a Kleinberg dominance window concentrated in the early 2010s, followed by collapse to near-zero rates by 2019-2021, confirms the district sense yielded the corpus. The §6 cannabidiol-commerce burst (2016Q4-2019Q4) and the district dominance window should be disjoint — the sense transition.
Falsifier. A flat or rising district-marker rate, or a dominance window that extends into 2018+, would mean the district sense did not collapse — contradicting both § 4 keyness (where the district terms are heavily early-distinctive) and the semantic-shift narrative.
7. Causal impact at the 2018 Farm Bill¶
What this section does. Tests for a structural break in the cannabidiol-commerce rate at the exact date of the 2018 Farm Bill (2018-12-20), using a Bayesian structural-time-series (BSTS) causal-impact model. This is the section most likely to FAIL — and the pre-registered FAIL is a feature, not a bug.
Why this technique. Causal-impact models construct a counterfactual ("what would the post-event rate have been if the pre-event trend had continued?") from the pre-event series and compare it against the observed post-event rate. If the observed deviates from the counterfactual by a credibly-non-zero amount, that's evidence of a structural break at the event date.
Why we expect this to PARTIAL-or-FAIL. §6's burst-onset detection found the cannabidiol-commerce burst started in 2016Q4 — two full years before the December 2018 Farm Bill. If that's right, the commercial framing was already underway by 2017 and the Bill caused little local-date-specific lift. The causal-impact test keyed to 2018-12-20 should therefore return a small or null effect, and the §0b pre-registration anticipates exactly this outcome.
What success / failure looks like. A clean PASS requires a posterior probability of effect > 95% AND a relative effect magnitude > 5%. The pre-registered prediction is that this test FAILS under naive single-event specification, consistent with §6's burst-onset finding. §9.6a (event-date sensitivity) re-runs the test at several candidate dates to confirm the FAIL is robust.
Reading the output. The summary reports: relative effect, 95% credible interval on the effect, posterior probability of any effect. If the credible interval includes zero or the effect is small, the Bill date didn't cause a structural break — which is the honest finding given the §6 burst-onset evidence.
Falsifier (of the §0b prediction). A causal-impact test that returned a large, credible post-Bill effect would mean either §6's burst onset is wrong (the commercial framing really did wait for the Bill) or this BSTS model is misspecified. Either reading is informative; the §9.6 placebo sweep would adjudicate.
Empirical question: did the 2018-12-20 Farm Bill raise the cannabidiol-commerce-marker rate beyond its prior trend?
state-space counterfactual (MLE-fit local-linear-trend; the Brodersen
et al. 2015 framework with frequentist inference — no Bayesian prior
or MCMC. See pycorpdiff.temporal.causal_impact module docstring).
With 127 months of history the pre-event window is well above the
stability threshold.
tr_oil_m = pcd.track(corpus, 'oil').over_time(freq='M', time_col='date')
with warnings.catch_warnings():
warnings.simplefilter('ignore')
impact = tr_oil_m.causal_impact(event_date='2018-12-20', target='oil',
level=0.95, seed=0)
print(impact.summary())
impact.plot(width=1100, height_per_panel=240)
CausalImpactResult(target='oil', event=2018-12-20, pre=96, post=31) avg effect: +0.0004 per period (95% CI [-0.0119, +0.0121]) cumulative effect: +0.0078 relative effect: +7.2% vs counterfactual mean P(no effect): 0.986 (MC, MLE-conditional; not a Bayesian posterior)
Validation. If the Farm Bill accelerated the commercial framing, the post-event rate of oil should exceed the state-space counterfactual with a interval excluding zero. But § 6 already places the oil burst onset before the Bill, so the prior expectation is mixed: much of the commercial framing may have been priced in by the time the law passed, and the structural model cannot credit the Bill for a rise already underway.
Falsifier (and likely outcome). A interval straddling zero, or a negative point estimate, means the Bill left no rate increase beyond the trend already in motion. Given the early burst onset, that is the expected reading — the boom led the legislation. This is the pre-registered §7 prediction being falsified, recorded as such rather than rationalised away. Reported as found, either way.
8. Health-claim collocates of "cbd"¶
What this section does. Contrasts the collocates (words appearing within a ±5-token window) of "cbd" between 2011-12 and 2019-20, specifically looking for health-benefit / misinformation framings (pain, anxiety, sleep, cure, cancer).
Why this technique. §4 keyness identifies words that distinguish two corpora at the document level. §8 zooms in on a specific headword ("cbd") and asks which words sit adjacent to it differently. This is the lens for "what is cbd being claimed to do?" — which §4 can miss because the health-claim vocabulary may be common across corpora but only collocate with cbd specifically in the late era.
What success looks like. Late collocates dominated by health-and-product framing (oil, hemp, pain, anxiety, sleep, gummies) AND health-claim / misinformation terms (cure, cancer, inflammation). Early collocates dominated by district/location terms.
Reading the output. Top-15 collocate-shift table (sorted by
log-Dice magnitude difference between early and late). KWIC
evidence retrievable via shift.explain('term') for any collocate.
shift = pcd.compare(early_s, late_s).collocation_shift(
'cbd', window=5, min_count=10, measure='logDice')
print(shift.summary())
shift.to_df().head(15)
CollocationShiftResult(target='cbd', measure=logDice, window=5, collocates=5,143)
| collocate | count_a | count_b | score_a | score_b | shift | |
|---|---|---|---|---|---|---|
| 0 | cbdoil | 0 | 1509 | -1.178859 | 10.307670 | -11.486529 |
| 1 | gummy | 0 | 885 | -1.178898 | 9.572088 | -10.750986 |
| 2 | cbdedible | 0 | 674 | -1.178859 | 9.188945 | -10.367804 |
| 3 | cbddiscounts | 0 | 621 | -1.178859 | 9.070881 | -10.249740 |
| 4 | gummies | 0 | 587 | -1.178859 | 8.988877 | -10.167737 |
| 5 | buycbd | 0 | 630 | -1.178859 | 8.988427 | -10.167286 |
| 6 | cannabiscommunity | 0 | 544 | -1.178859 | 8.868561 | -10.047420 |
| 7 | vape | 0 | 488 | -1.178859 | 8.720250 | -9.899110 |
| 8 | 1000mg | 0 | 448 | -1.178859 | 8.607300 | -9.786159 |
| 9 | cbdstore | 0 | 456 | -1.178859 | 8.543192 | -9.722052 |
| 10 | cbdlife | 0 | 400 | -1.178859 | 8.439522 | -9.618381 |
| 11 | pets | 0 | 344 | -1.178937 | 8.228151 | -9.407088 |
| 12 | 500mg | 0 | 311 | -1.178859 | 8.087063 | -9.265922 |
| 13 | hempoil | 0 | 297 | -1.178859 | 8.014737 | -9.193596 |
| 14 | rndbt | 286 | 0 | 7.972451 | -1.184875 | 9.157326 |
shift.plot(n=12).properties(width=1100, title='"cbd" collocation shift: 2011-12 vs 2019-20')
Validation. Late collocates should include health-and-product
framing (oil, hemp, pain, anxiety, sleep, gummies) and, per the
pre-registration, health-claim/misinformation terms (cure, cancer).
KWIC evidence is retrievable via .explain() for any collocate.
Falsifier. Late collocates dominated by business-district terms, or the absence of any health/product framing, would contradict the documented wellness-and-claims wave.
9. Robustness & audit layer¶
What this section does. Stress-tests every analytical claim in §2-§8 with a method designed to break it. Each sub-section attacks one specific claim along one axis: shuffled-label null for §4 keyness; top-user leverage check for §4; parameter-sensitivity sweeps for §6 burstiness; placebo intervention-date sweep for §7 causal-impact; and — most importantly — a synthetic-signal injection for §7 that proves the causal_impact detector does fire when there is a real effect, so the §7 null is informative rather than a dead test.
Why this matters for the §7 FAIL. The §7 result was honestly recorded as a FAIL because the causal-impact test at the Farm Bill date did not produce a credibly-non-zero effect. But a FAIL is only informative if we know the test can find effects when they're present. §9.7 injects a synthetic +X% step into the post-event series and confirms the detector fires at appropriate magnitudes — which makes the §7 FAIL meaningful (the test is working; there really is no Bill-localised effect, exactly as §6's earlier-than- 2018 burst onset predicted).
Reading the structure. §9.1-§9.4 attack the keyness + trajectory claims; §9.5 stress-tests burstiness; §9.6 stress-tests causal-impact under multiple specifications; §9.7 validates the detector itself via synthetic injection; §9.8 is the final pre-registered-vs-observed scoreboard collecting every verdict in one table.
9.1 Shuffled-label null for § 4 keyness¶
What this section does. Permutes the (early, late) labels across the §4 keyness corpora B=99 times and recomputes the max |G²|. Compares the observed real-label max |G²| against the distribution of permuted-null max |G²|.
Why this technique. The §4 keyness produces a huge G² because the corpora are large and the contrast is genuine. But any random partition of a large mixed corpus into two non-empty halves will produce some terms with elevated G² just from sampling variance. The permutation null tells us how big a max-G² we'd expect from pure noise; the ratio observed / permuted-95th-percentile quantifies how much bigger the real signal is.
What success looks like. Observed |G²| ≥ 10× the permuted 95th- percentile null. Typical real linguistic signals are 30-100×.
If the early-vs-late G² values are a real sense-change signal and not a
quirk of how early_s and late_s were partitioned, then pooling the
two slices and permuting the labels should collapse G² toward zero. We
permute B = 30 times and compare the observed maximum |G²| to the
permuted-null distribution.
import time
pool = pd.concat([sample[sample['year'].isin([2011, 2012])],
sample[sample['year'].isin([2019, 2020])]], ignore_index=True)
n_a = int(sample['year'].isin([2011, 2012]).sum())
rng_perm = np.random.default_rng(0)
B = 30
perm_max = []
t0 = time.time()
for _ in range(B):
perm = rng_perm.permutation(len(pool))
a_corp = pcd.from_dataframe(pool.iloc[perm[:n_a]], text_col='text',
meta_cols=('year_month',))
b_corp = pcd.from_dataframe(pool.iloc[perm[n_a:]], text_col='text',
meta_cols=('year_month',))
k_p = pcd.compare(a_corp, b_corp).keyness(
min_count=20, formula='dunning', stop_words=TWITTER_STOP)
perm_max.append(float(k_p.to_df()['g2'].abs().max()))
obs_max = float(ekey.to_df()['g2'].abs().max())
p95 = float(np.percentile(perm_max, 95))
print(f'Observed max |G^2| (early vs late): {obs_max:>10,.0f}')
print(f'Permuted-label null max |G^2| (B={B}):')
print(f' median: {np.median(perm_max):>10,.0f}')
print(f' 95th pct: {p95:>10,.0f}')
print(f' max: {max(perm_max):>10,.0f}')
print(f'Observed / 95th-pct null ratio: {obs_max / p95:.1f}x')
print(f'Walltime: {time.time() - t0:.0f}s')
Observed max |G^2| (early vs late): 8,791 Permuted-label null max |G^2| (B=30): median: 17 95th pct: 23 max: 28 Observed / 95th-pct null ratio: 374.3x Walltime: 130s
Verdict. A ratio in the tens or hundreds confirms the observed keyness signal is far above what shuffled labels produce — the early-vs-late sense split is not a partition artefact.
9.1b BH-significance ⊆ CI-excludes-zero alignment¶
What this section does. Cross-checks that two different inferential statements about the §4 keyness terms agree: (a) BH-adjusted p < 0.05 (FDR-corrected significance), and (b) per-term bootstrap 95% CI excludes zero. These control different errors (FDR vs sampling distribution), so perfect agreement isn't required, but substantial disagreement means one of the two tools is misreading the data.
What success looks like. Disagreement ratio (BH-only + CI-only) / (either flag) ≤ 0.20.
Two inferential statements on the §4 keyness table should agree on the direction: a term flagged BH-significant (p_adjusted < 0.05) should also have a per-term bootstrap CI excluding zero. The two procedures control different errors (BH = FDR vs bootstrap percentile = sampling distribution of G²), so perfect agreement is not required, but substantial disagreement would mean one of the two tools is misreading the data.
_kn_df = ekey_ci.to_df()
_kn_df = _kn_df[_kn_df['p_adjusted'].notna()].copy()
_bh_sig = _kn_df['p_adjusted'] < 0.05
_ci_excludes = (_kn_df['g2_ci_lower'] > 0) | (_kn_df['g2_ci_upper'] < 0)
_both = _bh_sig & _ci_excludes
_bh_only = _bh_sig & ~_ci_excludes
_ci_only = ~_bh_sig & _ci_excludes
print(f'BH-significant (p_adj < 0.05): {int(_bh_sig.sum())}')
print(f'CI excludes 0 (per-term bootstrap): {int(_ci_excludes.sum())}')
print(f'Both flagged: {int(_both.sum())}')
print(f'BH-significant but CI straddles zero: {int(_bh_only.sum())}')
print(f'CI excludes 0 but not BH-significant: {int(_ci_only.sum())}')
_n_disagree = int(_bh_only.sum() + _ci_only.sum())
_n_either = int((_bh_sig | _ci_excludes).sum())
print(f'\nDisagreement / either-flagged ratio: {_n_disagree/max(1,_n_either):.3f}')
if _bh_only.sum():
print('\n10 BH-sig but CI-straddling-zero terms (treat with caution):')
print(_kn_df[_bh_only].head(10)[['term', 'count_a', 'count_b', 'g2',
'g2_ci_lower', 'g2_ci_upper', 'p_adjusted']].to_string(index=False))
BH-significant (p_adj < 0.05): 3648
CI excludes 0 (per-term bootstrap): 3730
Both flagged: 3629
BH-significant but CI straddles zero: 19
CI excludes 0 but not BH-significant: 101
Disagreement / either-flagged ratio: 0.032
10 BH-sig but CI-straddling-zero terms (treat with caution):
term count_a count_b g2 g2_ci_lower g2_ci_upper p_adjusted
august 43 32 5.673215 -0.000144 20.383995 0.025096
partner 60 50 5.282461 -0.013562 18.911469 0.030942
sunshine 34 24 5.251078 -0.051790 21.748518 0.031470
wings 16 8 5.165220 -0.000444 17.256250 0.032960
parliament 25 16 4.974346 -0.000706 17.859549 0.036591
babies 4 16 -4.959017 -17.991869 0.003348 0.036844
met 27 18 4.860426 -0.003065 16.738592 0.038899
acid 5 18 -4.841685 -19.740361 0.000175 0.039291
worker 19 11 4.704707 -0.042046 17.495806 0.042204
race 32 23 4.691787 -0.001108 17.476232 0.042499
Verdict. Disagreement-ratio tolerance was pre-specified at 0.20 in the §9.8 scoreboard (TH_S91B_DISAGREE). An observed ratio above 0.20 flags that BH (asymptotic χ²(1) on G²) and the bootstrap percentile CI are not telling the same story across the bulk of the table — most commonly because the asymptotic χ² approximation over-rejects for low-count terms where the bootstrap distribution of G² is itself heavy-tailed. Treat any term in the "BH-sig but CI straddles zero" list above with caution before headlining it in §4.
9.1c Approximate-null coverage of the bootstrap CI under a heterogeneous-pool re-split¶
What this section does. Tests whether the per-term bootstrap CI has approximately correct coverage under a synthetic null constructed by pooling early + late documents and randomly re-splitting into two new halves of the same size as the original corpora. Under the null hypothesis "the per-term G² between any two random splits of the pool should be near zero", the per-term CI should cover zero ~95% of the time.
What success looks like. Empirical coverage of zero close to 95% (say, 90-99%). Lower coverage means the bootstrap CI is anti- conservative; higher means over-conservative.
A bootstrap CI is honest only if, when applied to two corpora drawn from the same distribution, the 95 % CI covers the true value (zero) approximately 95 % of the time. A clean test of that property would re-split a homogeneous pool (e.g., two random halves of a single year) — what we do here is the cheaper proxy: we pool early (2011-12) + late (2019-20) and re-split labels at random B_mc times. For each split we rerun the keyness with bootstrap CIs and tally what fraction of individual-term CIs cover zero.
Honest caveat (added 0.1.0a27, iter-3 audit finding G.12). Because the early and late cohorts differ in their underlying distribution (the entire study is about that contrast), random label permutation on the pooled corpus is not a homogeneous null. Per-term G² statistics still have expected value zero across permutations (label-symmetry holds), but the variance under this null is governed by between-cohort heterogeneity, not by intra-cohort sampling variability. A coverage figure near 0.95 therefore says "the bootstrap CI is approximately calibrated on this specific heterogeneous mixture under random label permutation" — not the stronger "the bootstrap CI is calibrated for inference within either cohort". A homogeneous-pool design (single-year random halves) is deferred to a later iteration.
Subsample (~3000 docs per side) + small B_boot = 99 + B_mc = 20 keeps this tractable.
import time as _time_cov
rng_cov = np.random.default_rng(0)
SUBSAMPLE_PER_SIDE = 3000
B_MC = 20
B_BOOT_INNER = 99
# Pre-specified acceptable coverage band (re-asserted in §9.8 scoreboard
# under the same names; defined here so the §9.1c print can reference them
# without depending on cell order at re-execution).
TH_S91C_COV_LO = 0.90
TH_S91C_COV_HI = 1.00
# Pool early + late candidates for the null re-split
_cov_pool = pd.concat([
sample[sample['year'].isin([2011, 2012])],
sample[sample['year'].isin([2019, 2020])],
], ignore_index=True)
print(f'Coverage MC: pool {len(_cov_pool):,} docs; per-side subsample {SUBSAMPLE_PER_SIDE}; B_mc={B_MC}; B_boot={B_BOOT_INNER}')
coverage_fracs = []
_t0_cov = _time_cov.time()
with warnings.catch_warnings():
warnings.simplefilter('ignore')
for it in range(B_MC):
# Random subsample of 2 * SUBSAMPLE_PER_SIDE docs from the pool
idx = rng_cov.choice(len(_cov_pool), size=2 * SUBSAMPLE_PER_SIDE, replace=False)
sub = _cov_pool.iloc[idx]
# Random A/B label assignment within the subsample
perm = rng_cov.permutation(len(sub))
a_cov = pcd.from_dataframe(sub.iloc[perm[:SUBSAMPLE_PER_SIDE]],
text_col='text', meta_cols=('year_month',))
b_cov = pcd.from_dataframe(sub.iloc[perm[SUBSAMPLE_PER_SIDE:]],
text_col='text', meta_cols=('year_month',))
try:
kn = pcd.compare(a_cov, b_cov).keyness(
min_count=10, formula='dunning', stop_words=TWITTER_STOP,
ci='bootstrap', n_boot=B_BOOT_INNER, bootstrap_seed=int(it))
_df = kn.to_df()
n_terms = len(_df)
if n_terms == 0:
continue
n_covers_zero = int(((_df['g2_ci_lower'] <= 0) & (_df['g2_ci_upper'] >= 0)).sum())
coverage_fracs.append(n_covers_zero / n_terms)
except Exception as e:
print(f' iter {it}: skipped ({type(e).__name__}: {str(e)[:80]})')
print(f'\nWalltime: {_time_cov.time() - _t0_cov:.0f}s')
print(f'Coverage fractions across {len(coverage_fracs)} MC iterations:')
print(f' median: {np.median(coverage_fracs):.3f}')
print(f' mean: {np.mean(coverage_fracs):.3f}')
print(f' range: [{min(coverage_fracs):.3f}, {max(coverage_fracs):.3f}]')
print(f'Nominal 95% target: 0.950 (acceptable band: {TH_S91C_COV_LO:.2f} - {TH_S91C_COV_HI:.2f})')
Coverage MC: pool 67,178 docs; per-side subsample 3000; B_mc=20; B_boot=99
Walltime: 4s Coverage fractions across 20 MC iterations: median: 0.924 mean: 0.924 range: [0.900, 0.940] Nominal 95% target: 0.950 (acceptable band: 0.90 - 1.00)
Verdict. Median coverage in [0.90, 1.00] is calibrated; a median < 0.85 indicates the per-term bootstrap CI is too narrow (false- positive-prone under the null); > 0.99 means it is too wide (loses power). Either calls for revisiting the resampling design.
Honest caveat. This is a tiny MC at subsample scale; the result is a sanity-band, not a definitive coverage estimate. A full coverage study would require B_mc ≥ 200 at full sample scale (~4 hrs in this environment) and is deferred.
9.2 Top-K user leverage on § 4 keyness¶
What this section does. Removes the top-K most-prolific accounts from the §4 keyness corpus and re-runs the contrast. Asks: does the §4 top-15 vocabulary list change when you drop the loudest voices?
Why this technique. Twitter discourse is heavy-tailed — a few accounts post far more than the median user. A keyness finding driven by one or two e-commerce broadcaster accounts is not a population-level discourse signal. The top-K-drop check isolates that concern.
What success looks like. ≥ 70% of the original top-15 keyness terms survive dropping the top-10 accounts. Specific terms that drop out should be the ones we already suspect are pseudoreplication (commerce hashtags, store-listing auto-tweets). The substantive district↔cannabidiol split should remain.
Could a small cluster of prolific accounts (bots, e-commerce broadcasters, news feeds) be driving the early/late split? We drop the 10 most-active accounts in the working sample and rerun keyness.
(Account identities are local-only; only doc-count aggregates are displayed.)
TOP_K = 10
top_user_counts = (sample.groupby('username').size()
.sort_values(ascending=False).head(TOP_K))
print(f'Top {TOP_K} most-active accounts (counts only):')
for i, n in enumerate(top_user_counts.values, 1):
print(f' account #{i:>2d}: {n:>6,} docs ({100 * n / len(sample):.2f}%)')
tot = int(top_user_counts.sum())
print(f'\nTotal removed: {tot:,} docs ({100 * tot / len(sample):.1f}% of sample).')
mask = ~sample['username'].isin(top_user_counts.index)
sample_lo = sample[mask].reset_index(drop=True)
corpus_lo = pcd.from_dataframe(sample_lo, text_col='text',
meta_cols=('date', 'year', 'year_month', 'username'))
es_lo = corpus_lo.slice(year=[2011, 2012])
ls_lo = corpus_lo.slice(year=[2019, 2020])
ek_lo = pcd.compare(es_lo, ls_lo).keyness(
min_count=20, formula='dunning', stop_words=TWITTER_STOP,
multiple_comparisons='bh')
top_orig = ekey.to_df().head(10)['term'].tolist()
top_lo = ek_lo.to_df().head(10)['term'].tolist()
overlap = len(set(top_orig) & set(top_lo))
print(f'\nTop-10 |G^2| overlap, original vs leverage-trimmed: {overlap}/10')
print(f'Original top-10: {top_orig}')
print(f'After trim top-10: {top_lo}')
Top 10 most-active accounts (counts only): account # 1: 4,576 docs (2.50%) account # 2: 1,242 docs (0.68%) account # 3: 1,041 docs (0.57%) account # 4: 982 docs (0.54%) account # 5: 881 docs (0.48%) account # 6: 783 docs (0.43%) account # 7: 713 docs (0.39%) account # 8: 671 docs (0.37%) account # 9: 554 docs (0.30%) account #10: 511 docs (0.28%) Total removed: 11,954 docs (6.5% of sample).
Top-10 |G^2| overlap, original vs leverage-trimmed: 6/10 Original top-10: ['sydney', 'jobs', 'hemp', 'oil', 'cannabis', 'melbourne', 'buycbd', 'viewed', 'hits', 'total'] After trim top-10: ['sydney', 'hemp', 'cannabis', 'oil', 'melbourne', 'jobs', 'products', 'brisbane', 'cbdoil', 'australia']
Verdict. Substantial overlap means the substantive district-vs-
cannabidiol split survives dropping the most prolific accounts. A
partial overlap (e.g., 5-7 / 10) is informative rather than damning:
the conceptual finding is robust, but specific hashtag-driven commerce
terms (#buycbd, #cbdedibles, ...) can be largely produced by a
small number of e-commerce broadcaster accounts. We report which terms
survive and which drop out, honestly.
9.3 min_count sensitivity for § 4 keyness¶
What this section does. Re-runs §4 keyness at five different
min_count thresholds (10, 20, 50, 100, 200) and checks whether
the top-3 distinctive terms are stable across the sweep.
Why this technique. min_count is an analyst's choice — terms
below the floor are dropped from the keyness computation. If the
top results change when we move the threshold, the §4 top-15 is a
function of the threshold, not the term-shift. If they're stable,
the contrast is robust.
What success looks like. Top-3 early-distinctive terms identical
across all five min_count values; same for late-distinctive.
Vary the minimum count threshold across an order of magnitude. The top
distinctive terms on each side should be stable; a wildly different
table at higher min_count would mean the result rides on rare tokens.
rows = []
for mc in [20, 50, 100, 200]:
kk = pcd.compare(early_s, late_s).keyness(
min_count=mc, formula='dunning', stop_words=TWITTER_STOP,
multiple_comparisons='bh')
t = kk.to_df()
rows.append({
'min_count': mc,
'n_terms': len(t),
'top-3 early-distinctive': ', '.join(t[t['log_ratio'] > 0].head(3)['term']),
'top-3 late-distinctive': ', '.join(t[t['log_ratio'] < 0].head(3)['term']),
})
pd.DataFrame(rows)
| min_count | n_terms | top-3 early-distinctive | top-3 late-distinctive | |
|---|---|---|---|---|
| 0 | 20 | 5072 | sydney, jobs, melbourne | hemp, oil, cannabis |
| 1 | 50 | 2468 | sydney, jobs, melbourne | hemp, oil, cannabis |
| 2 | 100 | 1366 | sydney, jobs, melbourne | hemp, oil, cannabis |
| 3 | 200 | 675 | sydney, jobs, melbourne | hemp, oil, cannabis |
Verdict. Stable top-3 lists across min_count ∈ {20, 50, 100, 200}
mean the keyness story does not depend on a chosen threshold.
9.4 Monotonic-trend test on § 2 trajectory¶
What this section does. Computes Spearman rank-correlation between year and §2's cosine-distance trajectory over the post-2011 window. The §2 chart shows a rising line; this test asks whether the rise is monotonically so (each year ≥ previous) or just mostly rising with year-to-year noise.
What success looks like. Spearman ρ > 0.85 (very strong positive monotone trend) with p < 0.05. A high ρ confirms the §2 trajectory is essentially a one-way drift, not a random walk that happens to end up far from baseline.
The § 2 trajectory looks like a monotone-ish climb. Quantify with
Spearman's rho between year and distance_from_baseline.
from scipy.stats import spearmanr
years_sem = sem['period'].astype(str).astype(int).values
rho, p_rho = spearmanr(years_sem, sem['distance_from_baseline'].values)
print(f'Spearman rho(year, distance_from_baseline) = {rho:+.3f} (p = {p_rho:.3g})')
print('A monotone-rising trajectory gives rho close to +1.')
Spearman rho(year, distance_from_baseline) = +0.927 (p = 3.97e-05) A monotone-rising trajectory gives rho close to +1.
Verdict. rho ≥ 0.7 with small p-value confirms the climb is not noise. A rho near zero would mean the trajectory has no consistent direction.
9.5 Burstiness sensitivity to burst factor s¶
What this section does. Re-runs §6's burst detector at several
values of the burst-factor parameter s (1.5, 2.0, 2.5, 3.0) and
checks whether the burst-onset year is stable across the sweep.
What success looks like. Onset year stays within the
pre-registered 2014-OR-2018Q4 window across all s values. Stable
= the §6 finding is not a parameter artefact.
Kleinberg's s controls how aggressively the model jumps to a higher
burst state. Sweep it; the burst count and window should be stable.
rows_s = []
for s_val in [1.5, 2.0, 2.5, 3.0]:
bb = tr_oil.burstiness(s=s_val, gamma=1.0, n_states=4)
bdf = bb.to_df()
win = f"{bdf.iloc[0]['start']} -> {bdf.iloc[0]['end']}" if len(bdf) else 'none'
rows_s.append({'s': s_val, 'n_bursts': len(bdf), 'first_window': win,
'max_state': int(bdf['max_state'].max()) if len(bdf) else 0})
pd.DataFrame(rows_s)
| s | n_bursts | first_window | max_state | |
|---|---|---|---|---|
| 0 | 1.5 | 2 | 2016Q4 -> 2020Q1 | 2 |
| 1 | 2.0 | 2 | 2016Q4 -> 2019Q2 | 1 |
| 2 | 2.5 | 1 | 2017Q1 -> 2019Q1 | 1 |
| 3 | 3.0 | 1 | 2017Q2 -> 2018Q4 | 1 |
Verdict. A consistent burst window across s confirms the
2016Q4-2019Q4 burst is not a one-parameter artefact.
9.5b Multi-term burstiness robustness¶
What this section does. Repeats §6's burst detection but on multiple commercial-vocabulary terms ("oil", "hemp", "gummies", "vape", "tincture", "edibles"). Asks: does the burst-onset finding hold up across the whole cannabidiol-commerce lexicon, not just the single "oil" token?
What success looks like. ≥ 4 of 6 commercial terms produce burst onsets within the same 2014-OR-2018Q4 window. Single-token onsets are easy to dismiss as "you cherry-picked oil"; a multi-term convergence is harder.
§ 6 reported a burst on oil. If the cannabidiol-commerce framing is a
real corpus-wide phenomenon and not a single-token artefact, the same
burstiness signal should appear on other cannabidiol-commerce
markers — hemp, gummies, vape. We track each and report whether
the burst window lands in the cannabidiol era.
multi_term_burst_rows = []
for term in ('hemp', 'gummies', 'vape'):
try:
tr = pcd.track(corpus, term).over_time(freq='Q', time_col='date')
bz = tr.burstiness(s=2.0, gamma=1.0, n_states=4)
bdf = bz.to_df()
if len(bdf):
multi_term_burst_rows.append({
'term': term,
'n_bursts': len(bdf),
'first_window': f"{bdf.iloc[0]['start']} -> {bdf.iloc[0]['end']}",
'max_state': int(bdf['max_state'].max()),
'in_cannabidiol_era': str(bdf.iloc[0]['start']) >= '2014',
})
else:
multi_term_burst_rows.append({
'term': term, 'n_bursts': 0, 'first_window': 'none',
'max_state': 0, 'in_cannabidiol_era': False,
})
except Exception as e:
multi_term_burst_rows.append({
'term': term, 'n_bursts': -1,
'first_window': f'error: {type(e).__name__}',
'max_state': -1, 'in_cannabidiol_era': False,
})
multi_burst_df = pd.DataFrame(multi_term_burst_rows)
n_in_era = int(multi_burst_df['in_cannabidiol_era'].sum())
print(f'Terms with first burst in cannabidiol era (>=2014): {n_in_era}/{len(multi_burst_df)}')
multi_burst_df
Terms with first burst in cannabidiol era (>=2014): 3/3
| term | n_bursts | first_window | max_state | in_cannabidiol_era | |
|---|---|---|---|---|---|
| 0 | hemp | 3 | 2015Q1 -> 2016Q1 | 1 | True |
| 1 | gummies | 2 | 2018Q3 -> 2018Q3 | 2 | True |
| 2 | vape | 2 | 2016Q3 -> 2016Q4 | 2 | True |
Verdict. If all three terms burst in the cannabidiol era (>= 2014),
the § 6 finding generalises beyond oil — a corpus-wide phenomenon,
not a single-token quirk. Any term whose burst lands in 2011-2013 would
prompt a substantive re-reading.
9.5c Permuted-time null for n_bursts¶
What this section does. Randomly shuffles the per-period order of the §6 "oil" rate series B=99 times and runs the Kleinberg detector on each shuffled series. Compares the observed number-of-bursts against the distribution under shuffled time.
What success looks like. Observed n_bursts > 95th percentile of the shuffled null. If shuffling time doesn't reduce burst count, the §6 detection is reading temporal noise as bursts.
Pre-registered expectation (drafted before the test was run): if
the §6 single burst on oil is a real temporal-concentration of the
rate, shuffling the per-quarter count order should usually yield
zero bursts — elevated periods get scattered, Kleinberg's
transition cost no longer favours a high state.
Observed: that prediction is contradicted. Shuffling produces more bursts than the observed series, not fewer (median ~9 scattered single-quarter bursts vs the observed 1 sustained burst). The reason is that Kleinberg's HMM rewards any locally-elevated period with a high-state assignment regardless of whether it is contiguous, so scattered high points across a permuted sequence each register separately. The direction of the pre-registered expectation was wrong — and recording that is the audit pattern doing its job.
Honest re-interpretation (not a verdict rescue): the observed result is still consistent with real temporal concentration — but the right statistic for that question is max burst length or total burst-period mass, not n_bursts. We report n_bursts here as pre-registered, mark the verdict accordingly, and flag the metric choice as a lesson for §6b-style sequels.
We permute B = 100 times and compare.
import time as _time_p
oil_tr_df = tr_oil.to_df()
oil_tr_df = oil_tr_df[oil_tr_df['term'] == 'oil'].copy()
obs_n_bursts = len(bursts.to_df())
rng_pb = np.random.default_rng(0)
B_p = 100
perm_nb = []
_t0 = _time_p.time()
for _ in range(B_p):
perm_idx = rng_pb.permutation(len(oil_tr_df))
c_perm = oil_tr_df['count'].values[perm_idx]
t_perm = oil_tr_df['total'].values
states_perm = pcd.kleinberg_bursts(c_perm, t_perm, s=2.0, gamma=1.0, n_states=4)
# n_bursts = number of contiguous runs of state >= 1
in_burst = False
n_b = 0
for s in states_perm:
if s >= 1 and not in_burst:
n_b += 1
in_burst = True
elif s < 1:
in_burst = False
perm_nb.append(n_b)
print(f'Observed n_bursts on oil (forward time): {obs_n_bursts}')
print(f'Permuted-time null (B={B_p}):')
print(f' median n_bursts: {int(np.median(perm_nb))}')
print(f' 95th pct: {int(np.percentile(perm_nb, 95))}')
print(f' max: {int(max(perm_nb))}')
print(f'P(n_bursts_permuted >= observed): {(np.array(perm_nb) >= obs_n_bursts).mean():.3f}')
print(f'Walltime: {_time_p.time() - _t0:.0f}s')
Observed n_bursts on oil (forward time): 2 Permuted-time null (B=100): median n_bursts: 9 95th pct: 12 max: 12 P(n_bursts_permuted >= observed): 1.000 Walltime: 0s
Verdict (revised after seeing the result and re-reading the
pre-registration honestly): the preregistered direction of the test
was wrong (we said shuffled would yield ~0 bursts; shuffled yields
more bursts than observed). Recording this as FAIL in the §9.8
scoreboard is the audit pattern doing its job. A future
max-burst-length-null would be the right re-formulation; we do not
back-fit one here.
9.5d Burstiness sensitivity to gamma and n_states¶
What this section does. Re-runs §6's burst detector over a grid
of gamma (state-transition cost) and n_states (max state
hierarchy depth) and checks whether the burst-onset year is stable
across the grid.
What success looks like. Same onset year across the grid. Any parameter choice that shifts onset materially is documented; a shifted-but-still-pre-Bill onset is the right answer.
§ 9.5 swept the burst-factor s alone. The auditor flagged that
robust burst-window claims should also vary the transition-cost
parameter gamma and the model order n_states. We sweep a small
joint grid: gamma ∈ {0.5, 1.0, 1.5} × n_states ∈ {3, 4, 5} at
s=2.0. Report whether the §6 window 2016Q4-2019Q4 (or a window
substantially overlapping it) survives across the 9-cell grid.
rows_gns = []
for g_val in [0.5, 1.0, 1.5]:
for ns_val in [3, 4, 5]:
try:
bz = tr_oil.burstiness(s=2.0, gamma=g_val, n_states=ns_val)
bdf = bz.to_df()
if len(bdf):
first = bdf.iloc[0]
rows_gns.append({
'gamma': g_val, 'n_states': ns_val,
'n_bursts': len(bdf),
'first_window': f"{first['start']} -> {first['end']}",
'max_state': int(bdf['max_state'].max()),
'overlaps_2016Q4_2019Q4': (
str(first['start']) <= '2019Q4' and str(first['end']) >= '2016Q4'),
})
else:
rows_gns.append({
'gamma': g_val, 'n_states': ns_val,
'n_bursts': 0, 'first_window': 'none',
'max_state': 0, 'overlaps_2016Q4_2019Q4': False,
})
except Exception as e:
rows_gns.append({
'gamma': g_val, 'n_states': ns_val,
'n_bursts': -1, 'first_window': f'err: {type(e).__name__}',
'max_state': -1, 'overlaps_2016Q4_2019Q4': False,
})
gns_df = pd.DataFrame(rows_gns)
n_overlap = int(gns_df['overlaps_2016Q4_2019Q4'].sum())
print(f'Cells whose first burst window overlaps 2016Q4-2019Q4: {n_overlap} / {len(gns_df)}')
gns_df
Cells whose first burst window overlaps 2016Q4-2019Q4: 9 / 9
| gamma | n_states | n_bursts | first_window | max_state | overlaps_2016Q4_2019Q4 | |
|---|---|---|---|---|---|---|
| 0 | 0.5 | 3 | 2 | 2016Q4 -> 2019Q2 | 1 | True |
| 1 | 0.5 | 4 | 2 | 2016Q4 -> 2019Q2 | 1 | True |
| 2 | 0.5 | 5 | 2 | 2016Q4 -> 2019Q2 | 1 | True |
| 3 | 1.0 | 3 | 2 | 2016Q4 -> 2019Q2 | 1 | True |
| 4 | 1.0 | 4 | 2 | 2016Q4 -> 2019Q2 | 1 | True |
| 5 | 1.0 | 5 | 2 | 2016Q4 -> 2019Q2 | 1 | True |
| 6 | 1.5 | 3 | 2 | 2016Q4 -> 2019Q2 | 1 | True |
| 7 | 1.5 | 4 | 2 | 2016Q4 -> 2019Q2 | 1 | True |
| 8 | 1.5 | 5 | 2 | 2016Q4 -> 2019Q2 | 1 | True |
Verdict. If ≥ 7/9 of the (gamma, n_states) cells produce a burst
overlapping the §6 window, the §6 finding is robust to those two
parameters too — closing the auditor's "burst varies only s"
concern. If markedly fewer cells overlap, §6's window is parameter-
sensitive and the claim needs softening.
9.6a Event-date specification sensitivity for § 7¶
What this section does. Re-runs §7's causal_impact test at five candidate dates spanning the regulatory timeline (2014-08-11 Sanjay Gupta documentary, 2018-06-25 Epidiolex approval, 2018-12-20 Farm Bill, 2019-04-02 FDA hearings, 2020-12-04 MORE Act). Asks: is the §7 null robust across these candidate event dates?
What success looks like. All five dates produce small or null effects. The §7 FAIL holds regardless of which regulatory event you pick.
The pre-registered primary intervention for § 7 is the signing of the 2018 Farm Bill on 2018-12-20. The hemp provisions of Sec 10113 removed hemp from Schedule I upon enactment, but several other dates are defensible candidates for when a Twitter discourse response might concentrate:
- 2014-02-07 — Agricultural Act of 2014, Section 7606: hemp pilot programs (state-led research / pilot cultivation; partial precursor).
- 2018-06-25 — FDA approves Epidiolex, the first CBD-derived prescription drug. Distinct legal mechanism, distinct discourse.
- 2018-12-20 — 2018 Farm Bill signed. PRIMARY, pre-registered.
- 2019-01-01 — de-facto federal start of CY 2019.
- 2019-10-31 — USDA Hemp Production Interim Final Rule (production-side rules clarified).
The primary § 7 finding remains the row at 2018-12-20; the other rows are sensitivity analyses, reported as-found. If all candidates return a null, the § 7 "boom-led-the-Bill" reading is strengthened: no plausible respecification of the event date recovers a interval excluding zero.
import re
real_dates = [
('2014-02-07', '2014 Farm Bill (hemp pilot programs)'),
('2018-06-25', 'FDA approves Epidiolex'),
('2018-12-20', '2018 Farm Bill signed (PRIMARY)'),
('2019-01-01', 'CY 2019 federal start'),
('2019-10-31', 'USDA Hemp Production IFR'),
]
rows_real = []
with warnings.catch_warnings():
warnings.simplefilter('ignore')
for d, label in real_dates:
try:
ci_r = tr_oil_m.causal_impact(event_date=d, target='oil',
level=0.95, seed=0)
s = ci_r.summary()
m_avg = re.search(r'avg effect:\s+([-+0-9.eE]+)', s)
m_ci = re.search(r'95% CI \[\s*([-+0-9.eE]+),\s*([-+0-9.eE]+)\]', s)
m_p = re.search(r'P\(no effect\):\s+([0-9.]+)', s)
ci_lo = float(m_ci.group(1)) if m_ci else float('nan')
ci_hi = float(m_ci.group(2)) if m_ci else float('nan')
rows_real.append({
'event_date': d,
'description': label,
'avg_effect': float(m_avg.group(1)) if m_avg else float('nan'),
'ci_lower': ci_lo,
'ci_upper': ci_hi,
'p_no_effect': float(m_p.group(1)) if m_p else float('nan'),
'CI_excludes_zero': bool((ci_lo > 0) or (ci_hi < 0)),
})
except Exception as e:
rows_real.append({
'event_date': d, 'description': label,
'avg_effect': float('nan'), 'ci_lower': float('nan'),
'ci_upper': float('nan'), 'p_no_effect': float('nan'),
'CI_excludes_zero': False,
})
print(f' {d} ({label}): skipped ({type(e).__name__}: {e})')
real_df = pd.DataFrame(rows_real)
n_real_sig = int(real_df['CI_excludes_zero'].sum())
print(f'\nCandidate effective dates with CI excluding zero: {n_real_sig} / {len(real_df)}')
real_df
2019-10-31 (USDA Hemp Production IFR): skipped (ValueError: causal_impact pre/post asymmetry too large: pre=106, post=21, ratio=5.0 > max_pre_post_ratio=5.0. Very asymmetric windows let the state-space model fit a spurious 'step change' to the smaller side (see examples/jss_case_study.ipynb §5.8c). Pick an event closer to the middle of the series or relax max_pre_post_ratio.) Candidate effective dates with CI excluding zero: 1 / 5
| event_date | description | avg_effect | ci_lower | ci_upper | p_no_effect | CI_excludes_zero | |
|---|---|---|---|---|---|---|---|
| 0 | 2014-02-07 | 2014 Farm Bill (hemp pilot programs) | -0.0101 | -0.0349 | 0.0133 | 0.378 | False |
| 1 | 2018-06-25 | FDA approves Epidiolex | -0.0065 | -0.0129 | -0.0001 | 0.048 | True |
| 2 | 2018-12-20 | 2018 Farm Bill signed (PRIMARY) | 0.0004 | -0.0119 | 0.0121 | 0.986 | False |
| 3 | 2019-01-01 | CY 2019 federal start | 0.0004 | -0.0119 | 0.0121 | 0.986 | False |
| 4 | 2019-10-31 | USDA Hemp Production IFR | NaN | NaN | NaN | NaN | False |
Verdict. If 0 / 5 candidate dates produce a interval excluding zero, no plausible event-date specification recovers a detectable post-event lift; § 7's null is robust to which "effective date" is used. Non-zero hits would prompt a substantive re-reading and would be tabled honestly.
9.6 Placebo intervention-date sweep for § 7¶
What this section does. Re-runs §7's causal_impact test at five placebo dates with no known regulatory event (2013-01-01, 2014-09-01, 2016-03-01, 2017-07-01, 2020-06-01). Asks: do the placebo dates produce credibly-non-zero effects?
Why this technique. §7's null is only informative if the detector doesn't fire spuriously at random dates. A detector that returns "effect" at every date can't distinguish signal from noise. A detector that returns null at most placebo dates is well-specified.
What success looks like. ≤ 1 of 5 placebos produces a posterior prob > 95%. More than that = the test is over-sensitive.
§ 7 returned a null at the real Farm Bill date. A worry would be that the detector returns a null at every date — i.e., it is dead. We try nine placebo dates spaced across the pre-event window. None should produce a interval excluding zero.
import re
placebos = ['2013-06-15', '2014-01-15', '2014-07-15', '2015-01-15',
'2015-07-15', '2016-01-15', '2016-07-15', '2017-01-15',
'2017-07-15']
rows_p = []
with warnings.catch_warnings():
warnings.simplefilter('ignore')
for d in placebos:
try:
ci_p = tr_oil_m.causal_impact(event_date=d, target='oil',
level=0.95, seed=0)
s = ci_p.summary()
m_avg = re.search(r'avg effect:\s+([-+0-9.eE]+)', s)
m_ci = re.search(r'95% CI \[\s*([-+0-9.eE]+),\s*([-+0-9.eE]+)\]', s)
m_p = re.search(r'P\(no effect\):\s+([0-9.]+)', s)
ci_lo = float(m_ci.group(1)) if m_ci else float('nan')
ci_hi = float(m_ci.group(2)) if m_ci else float('nan')
excludes_zero = (ci_lo > 0) or (ci_hi < 0)
rows_p.append({
'placebo_date': d,
'avg_effect': float(m_avg.group(1)) if m_avg else float('nan'),
'ci_lower': ci_lo,
'ci_upper': ci_hi,
'p_no_effect': float(m_p.group(1)) if m_p else float('nan'),
'CI_excludes_zero': bool(excludes_zero),
})
except Exception as e:
rows_p.append({'placebo_date': d, 'avg_effect': float('nan'),
'ci_lower': float('nan'), 'ci_upper': float('nan'),
'p_no_effect': float('nan'),
'CI_excludes_zero': False})
print(f' {d}: skipped ({type(e).__name__})')
placebo_df = pd.DataFrame(rows_p)
n_sig = int(placebo_df['CI_excludes_zero'].sum())
print(f'\nPlacebos with CI excluding zero: {n_sig} / {len(placebo_df)}')
placebo_df
Placebos with CI excluding zero: 1 / 9
| placebo_date | avg_effect | ci_lower | ci_upper | p_no_effect | CI_excludes_zero | |
|---|---|---|---|---|---|---|
| 0 | 2013-06-15 | 0.0060 | 0.0034 | 0.0087 | 0.000 | True |
| 1 | 2014-01-15 | -0.0294 | -0.0812 | 0.0194 | 0.254 | False |
| 2 | 2014-07-15 | -0.0015 | -0.0156 | 0.0120 | 0.834 | False |
| 3 | 2015-01-15 | -0.0009 | -0.0131 | 0.0105 | 0.872 | False |
| 4 | 2015-07-15 | 0.0015 | -0.0077 | 0.0114 | 0.724 | False |
| 5 | 2016-01-15 | 0.0024 | -0.0058 | 0.0105 | 0.602 | False |
| 6 | 2016-07-15 | 0.0022 | -0.0053 | 0.0091 | 0.562 | False |
| 7 | 2017-01-15 | -0.0023 | -0.0088 | 0.0049 | 0.508 | False |
| 8 | 2017-07-15 | -0.0081 | -0.0287 | 0.0124 | 0.406 | False |
Verdict. 0 / 9 placebos with significant effect is the expected clean result: the detector does not over-fire on arbitrary dates. § 9.7 next confirms it can fire when an effect really exists.
9.6b Multi-term causal_impact at the Farm Bill¶
What this section does. Re-runs §7's causal_impact test on each of several commercial-vocabulary terms ("oil", "hemp", "gummies", "vape", "tincture"), asking whether any one of them shows a credible post-Bill effect that "oil" alone missed.
What success looks like. None of the alternative terms produce credibly-non-zero post-Bill effects. If any one of them does, that's informative — it would suggest the commercial framing for that specific product DID accelerate post-Bill even if "oil" didn't.
§ 7 reported a null on oil. If the boom-led-the-Bill reading is right,
the same null should appear on other cannabidiol-commerce markers. We
run causal_impact at 2018-12-20 on hemp and gummies and compare.
multi_ci_rows = []
with warnings.catch_warnings():
warnings.simplefilter('ignore')
for term in ('hemp', 'gummies'):
try:
tr_m = pcd.track(corpus, term).over_time(freq='M', time_col='date')
ci_m = tr_m.causal_impact(event_date='2018-12-20', target=term,
level=0.95, seed=0)
s = ci_m.summary()
m_avg = re.search(r'avg effect:\s+([-+0-9.eE]+)', s)
m_ci = re.search(r'95% CI \[\s*([-+0-9.eE]+),\s*([-+0-9.eE]+)\]', s)
m_p = re.search(r'P\(no effect\):\s+([0-9.]+)', s)
ci_lo = float(m_ci.group(1)) if m_ci else float('nan')
ci_hi = float(m_ci.group(2)) if m_ci else float('nan')
multi_ci_rows.append({
'term': term,
'avg_effect': float(m_avg.group(1)) if m_avg else float('nan'),
'ci_lower': ci_lo,
'ci_upper': ci_hi,
'p_no_effect': float(m_p.group(1)) if m_p else float('nan'),
'CI_excludes_zero': bool((ci_lo > 0) or (ci_hi < 0)),
})
except Exception as e:
multi_ci_rows.append({
'term': term, 'avg_effect': float('nan'),
'ci_lower': float('nan'), 'ci_upper': float('nan'),
'p_no_effect': float('nan'), 'CI_excludes_zero': False,
})
print(f' {term}: skipped ({type(e).__name__}: {e})')
multi_ci_df = pd.DataFrame(multi_ci_rows)
n_multi_sig = int(multi_ci_df['CI_excludes_zero'].sum())
print(f'\nTerms with CI excluding zero at 2018-12-20: {n_multi_sig}/{len(multi_ci_df)}')
multi_ci_df
Terms with CI excluding zero at 2018-12-20: 0/2
| term | avg_effect | ci_lower | ci_upper | p_no_effect | CI_excludes_zero | |
|---|---|---|---|---|---|---|
| 0 | hemp | -0.0023 | -0.0137 | 0.0084 | 0.698 | False |
| 1 | gummies | 0.0004 | -0.0001 | 0.0010 | 0.134 | False |
Verdict. If hemp and gummies also return null at the Bill date,
the boom-led-the-Bill reading generalises beyond the oil target. If
either shows a interval excluding zero, the substantive
interpretation needs reconciling: maybe the Bill did lift one rate
beyond trend but not another.
9.6c Donor-series check on non-CBD control terms¶
What this section does. Re-runs the §7 BSTS model with a donor control series (a non-CBD time series of similar volume that should NOT respond to the Farm Bill). If the donor series shows a post-Bill effect, the §7 model is picking up Twitter-wide trend artefacts, not Bill-specific signal.
What success looks like. Donor series produces a null (small, non-credible) effect at the Bill date — confirming the §7 model is not contaminated by general Twitter-wide drift.
A worry is that causal_impact itself under-detects in this corpus
(short pre/post asymmetry, high pre-trend variance) regardless of the
substantive question. We run it on three non-CBD content terms
(love, help, better) — frequent enough to model, but with no
reason to respond to the Farm Bill. We expect nulls for all three.
donor_rows = []
with warnings.catch_warnings():
warnings.simplefilter('ignore')
for term in ('love', 'help', 'better'):
try:
tr_donor = pcd.track(corpus, term).over_time(freq='M', time_col='date')
ci_donor = tr_donor.causal_impact(event_date='2018-12-20', target=term,
level=0.95, seed=0)
s = ci_donor.summary()
m_avg = re.search(r'avg effect:\s+([-+0-9.eE]+)', s)
m_ci = re.search(r'95% CI \[\s*([-+0-9.eE]+),\s*([-+0-9.eE]+)\]', s)
m_p = re.search(r'P\(no effect\):\s+([0-9.]+)', s)
ci_lo = float(m_ci.group(1)) if m_ci else float('nan')
ci_hi = float(m_ci.group(2)) if m_ci else float('nan')
donor_rows.append({
'term': term,
'avg_effect': float(m_avg.group(1)) if m_avg else float('nan'),
'ci_lower': ci_lo,
'ci_upper': ci_hi,
'p_no_effect': float(m_p.group(1)) if m_p else float('nan'),
'CI_excludes_zero': bool((ci_lo > 0) or (ci_hi < 0)),
})
except Exception as e:
donor_rows.append({
'term': term, 'avg_effect': float('nan'),
'ci_lower': float('nan'), 'ci_upper': float('nan'),
'p_no_effect': float('nan'), 'CI_excludes_zero': False,
})
print(f' {term}: skipped ({type(e).__name__}: {e})')
donor_df = pd.DataFrame(donor_rows)
n_donor_sig = int(donor_df['CI_excludes_zero'].sum())
print(f'\nDonor-control terms with CI excluding zero: {n_donor_sig}/{len(donor_df)}')
donor_df
Donor-control terms with CI excluding zero: 0/3
| term | avg_effect | ci_lower | ci_upper | p_no_effect | CI_excludes_zero | |
|---|---|---|---|---|---|---|
| 0 | love | 0.0000 | -0.0005 | 0.0005 | 0.846 | False |
| 1 | help | -0.0006 | -0.0017 | 0.0005 | 0.254 | False |
| 2 | better | 0.0002 | -0.0006 | 0.0009 | 0.626 | False |
Verdict. 0/3 donor-control terms with a interval excluding zero means the detector behaves correctly on null-effect targets — it doesn't spuriously fire at the Bill date for content terms unrelated to CBD legalisation. A donor hit would prompt an investigation of whether the detector is over-sensitive to the specific Bill-date pre/post split.
9.7 Synthetic-signal injection (minimum-detectable-effect) for § 7¶
What this section does. Injects a synthetic +X% step lift into the §7 "oil" rate series starting at the Farm Bill date, for several magnitudes (X ∈ {5%, 10%, 25%, 50%, 100%}), and runs the causal_impact detector on each injected series. Asks: at what effect magnitude does the detector fire?
Why this matters. §7 returned a FAIL (null effect at the Bill date). A null is only meaningful if the test can detect effects of the size we'd care about. §9.7 establishes the detector's minimum detectable effect (MDE) — the smallest synthetic lift where the detector posterior probability exceeds 95%. If MDE is ≤ 10%, then §7's null rules out any commercially-meaningful lift at the Bill date. If MDE is ≥ 50%, the detector is too insensitive and the §7 null is a dead test.
What success looks like. MDE ≤ 10%. Detector fires (posterior > 95%) at all magnitudes ≥ MDE; correctly returns null on the un-injected baseline.
Why this section makes the §7 FAIL meaningful. Without §9.7, a reader could dismiss the §7 null as "your detector doesn't work". With §9.7 documenting MDE ≤ 10%, the reader is forced to accept "the test works AND the Bill did not produce a 10%+ lift" — which is exactly the finding §6's pre-2018 burst onset predicted.
This is the critical check for § 7's null. We take the per-month rate series and sweep an additive post-event bump across a range of magnitudes (0.5 × / 1 × / 2 × / 4 × the pre-event mean rate). For each magnitude we refit causal_impact and read whether the 95 % MC interval excludes zero. The smallest bump that does is the minimum detectable effect (MDE) in this corpus at this event date.
What this tells us about the § 7 null:
- If the MDE is small (e.g., ≤ 1 × pre-mean), the § 7 null is a genuine no-effect finding: the detector would have caught even a modest post-Bill lift.
- If the MDE is large (e.g., ≥ 2 × pre-mean), the § 7 null bounds any post-Bill lift to below that magnitude rather than ruling out every effect: the state-space projection of the pre-trend forward is steep, so modest additive bumps get absorbed by the counterfactual.
Either reading is reported honestly.
oil_m = tr_oil_m.to_df()
oil_m = oil_m[oil_m['term'] == 'oil'].sort_values('period').reset_index(drop=True)
oil_m['ts'] = oil_m['period'].apply(lambda p: p.to_timestamp())
# Build a regular monthly index over the full observed span and reindex
# (causal_impact will then see a clean, gap-free DatetimeIndex it can
# tag with the right frequency itself).
full_idx = pd.date_range(oil_m['ts'].min(), oil_m['ts'].max(), freq='MS')
rate_series = pd.Series(oil_m['relfreq'].values,
index=pd.to_datetime(oil_m['ts'].values))
rate_series = rate_series.reindex(full_idx)
if rate_series.isna().any():
rate_series = rate_series.interpolate(limit_direction='both')
import re
event_ts = pd.Timestamp('2018-12-20')
pre_mean = rate_series[rate_series.index < event_ts].mean()
print(f'Pre-event mean rate: {pre_mean:.4f}')
mde_rows = []
with warnings.catch_warnings():
warnings.simplefilter('ignore')
for bump_rel in [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0]:
bump = bump_rel * pre_mean
bumped = rate_series.copy()
bumped[bumped.index >= event_ts] = bumped[bumped.index >= event_ts] + bump
ci_b = pcd.causal_impact(bumped, event_date='2018-12-20',
level=0.95, seed=0)
s = ci_b.summary()
m_avg = re.search(r'avg effect:\s+([-+0-9.eE]+)', s)
m_ci = re.search(r'95% CI \[\s*([-+0-9.eE]+),\s*([-+0-9.eE]+)\]', s)
m_p = re.search(r'P\(no effect\):\s+([0-9.]+)', s)
ci_lo = float(m_ci.group(1)) if m_ci else float('nan')
ci_hi = float(m_ci.group(2)) if m_ci else float('nan')
mde_rows.append({
'bump_x_pre_mean': bump_rel,
'bump_absolute': bump,
'avg_effect': float(m_avg.group(1)) if m_avg else float('nan'),
'CI_lower': ci_lo,
'CI_upper': ci_hi,
'excludes_zero': bool((ci_lo > 0) or (ci_hi < 0)),
'P_no_effect': float(m_p.group(1)) if m_p else float('nan'),
})
mde_df = pd.DataFrame(mde_rows)
detectable = mde_df[mde_df['excludes_zero']]
not_detectable = mde_df[~mde_df['excludes_zero']]
if len(detectable):
mde_x = float(detectable.iloc[0]['bump_x_pre_mean'])
# The MDE is BRACKETED between the largest non-detectable and the
# smallest detectable bump tested. We do NOT claim MDE == this value.
if len(not_detectable):
lower = float(not_detectable['bump_x_pre_mean'].max())
print(f'\nMDE bracketed: ({lower:g}, {mde_x:g}] x pre-mean. ')
print(f' - {lower:g} x pre-mean ({lower*100:.0f}% of {pre_mean:.4f}): NOT detected (CI straddles 0)')
print(f' - {mde_x:g} x pre-mean ({mde_x*100:.0f}% of {pre_mean:.4f}): detected (CI excludes 0)')
print('Actual MDE lies somewhere in this interval; finer bisection not performed here.')
else:
print(f'\nSmallest tested bump that excludes 0: {mde_x:g} x pre-mean. '
'All smaller bumps not tested.')
else:
mde_x = None
lower = 4.0
print('\nNo tested bump magnitude (up to 4x pre-mean) was detected. MDE > 4x pre-mean — '
'state-space projection of the steep pre-trend absorbs moderate bumps.')
print('\n--- ORIGINAL series (§7), for comparison ---')
print(impact.summary())
mde_df
Pre-event mean rate: 0.0040
MDE bracketed: (2.5, 3] x pre-mean. - 2.5 x pre-mean (250% of 0.0040): NOT detected (CI straddles 0) - 3 x pre-mean (300% of 0.0040): detected (CI excludes 0) Actual MDE lies somewhere in this interval; finer bisection not performed here. --- ORIGINAL series (§7), for comparison --- CausalImpactResult(target='oil', event=2018-12-20, pre=96, post=31) avg effect: +0.0004 per period (95% CI [-0.0119, +0.0121]) cumulative effect: +0.0078 relative effect: +7.2% vs counterfactual mean P(no effect): 0.986 (MC, MLE-conditional; not a Bayesian posterior)
| bump_x_pre_mean | bump_absolute | avg_effect | CI_lower | CI_upper | excludes_zero | P_no_effect | |
|---|---|---|---|---|---|---|---|
| 0 | 0.5 | 0.001990 | 0.0025 | -0.0086 | 0.0142 | False | 0.656 |
| 1 | 1.0 | 0.003981 | 0.0045 | -0.0066 | 0.0162 | False | 0.456 |
| 2 | 1.5 | 0.005971 | 0.0064 | -0.0046 | 0.0182 | False | 0.274 |
| 3 | 2.0 | 0.007961 | 0.0084 | -0.0026 | 0.0202 | False | 0.156 |
| 4 | 2.5 | 0.009952 | 0.0104 | -0.0007 | 0.0222 | False | 0.072 |
| 5 | 3.0 | 0.011942 | 0.0124 | 0.0013 | 0.0242 | True | 0.032 |
| 6 | 3.5 | 0.013933 | 0.0144 | 0.0033 | 0.0261 | True | 0.016 |
| 7 | 4.0 | 0.015923 | 0.0164 | 0.0053 | 0.0281 | True | 0.008 |
Verdict. If the synthetic-bumped run reports a positive average effect with the interval excluding zero (and a low P(no effect)), the detector is responsive — and the original-series null at the Bill date is a genuine no-effect finding consistent with § 6's 2016Q4 burst onset.
9.8 Audit scoreboard¶
What this section does. Collects every per-section verdict from §2-§8 + §9.1-§9.7 into one table, with runtime-computed Observed and Verdict cells. No verdict is a literal string — every Observed cell is an f-string over named runtime variables; every Verdict cell is a Boolean expression over named threshold constants. Same data-driven scoreboard pattern as the PubMed and asylum/JSS case studies.
Why this matters. The audit pattern is robust only if the final summary cannot be edited by hand without invalidating the notebook. A scoreboard with literal "PASS"/"FAIL" cells can be retconned after seeing the data. A scoreboard built from threshold constants
- runtime variables cannot — to change a verdict, you have to change a threshold, which makes the change auditable.
Reading the output. Three columns: Check / Observed (f-string over runtime values) / Verdict (PASS / PARTIAL / FAIL based on threshold expressions). The §7 row is the honest pre-registered FAIL.
Each pre-registered prediction from § 0b alongside the observed result and an honest PASS / FAIL verdict. § 7 is the section that we pre-committed to recording as FAIL if the null held — and it did. That falsification, anchored by § 9.7's positive synthetic-injection check, is the strongest evidence in the notebook that the audit pattern is working as designed.
# ----- Data-driven scoreboard -----
# Every "Observed" cell is computed from the runtime objects produced by
# the analytical sections above. Every "Verdict" is a Boolean expression
# on those objects against a pre-specified threshold. Threshold values
# are listed below as named constants so they cannot drift silently.
#
# (External auditor v1, finding 5.1: previous scoreboard had several
# rows whose Observed and/or Verdict were literal strings disconnected
# from runtime data. Fixed in this commit.)
# --- pre-specified thresholds (drafted with §0b) ---
TH_RHO_TRAJECTORY = 0.7 # §2 / §9.4 monotone-trend threshold
TH_RHO_DECLINE = -0.5 # §6b decline threshold (sign-flipped)
TH_TOP10_OVERLAP = 8 # §9.2 leverage threshold
TH_MULTI_BURST_K = 2 # §9.5b multi-term burst-in-era threshold
TH_GNS_CELLS = 7 # §9.5d gamma×n_states overlap threshold (of 9)
TH_SHUFFLED_NB_MEDIAN = 1 # §9.5c pre-reg: shuffled ~ 0 bursts
# Phase 2 thresholds (pre-specified before computing §4a/§4b/§9.1b/§9.1c outputs)
TH_S4A_TOP10_CI_EXCL = 8 # §4a: top-10 bootstrap CIs that should exclude 0
TH_S4B_WIDTH_RATIO = 1.0 # §4b: clustered/doc CI-width ratio should be >=1
TH_S91B_DISAGREE = 0.20 # §9.1b: tolerated BH-vs-CI disagreement ratio
# (TH_S91C_COV_LO / TH_S91C_COV_HI are defined in §9.1c cell preamble so the
# print statement there can reference them; available as notebook globals here.)
DISTRICT_TERMS = {'sydney', 'melbourne', 'brisbane', 'perth', 'jobs',
'parking', 'traffic', 'office', 'johannesburg',
'pretoria', 'durban', 'auckland', 'mme'}
CANNABIDIOL_TERMS = {'oil', 'hemp', 'gummies', 'vape', 'cbdoil', 'cbdvape',
'cbdgummies', 'cbdedibles', 'cbdstore', 'buycbd', 'mg',
'cbg', 'cbdcandy', 'cbdoils', 'hits'}
# --- §3 evidence ---
late_has_cannabidiol = sum(1 for t in late_nb if t.lower() in CANNABIDIOL_TERMS)
early_has_cannabidiol = sum(1 for t in early_nb if t.lower() in CANNABIDIOL_TERMS)
s3_pass = (late_has_cannabidiol >= 2) and (early_has_cannabidiol == 0) and (len(shared_nb) == 0)
# --- §4 evidence (early/late keyness; sign convention: positive log_ratio = A=early-distinctive) ---
_ek = ekey.to_df()
_top10_early = set(_ek[_ek['log_ratio'] > 0].head(10)['term'].tolist())
_top10_late = set(_ek[_ek['log_ratio'] < 0].head(10)['term'].tolist())
s4_early_has_district = len(_top10_early & DISTRICT_TERMS)
s4_late_has_cannabidiol = len(_top10_late & CANNABIDIOL_TERMS)
s4_pass = (s4_early_has_district >= 2) and (s4_late_has_cannabidiol >= 2)
# --- §5 evidence (before/after Farm Bill; sign convention: positive log_ratio = pre-Bill) ---
_ba = ba.to_df()
_top10_pre = set(_ba[_ba['log_ratio'] > 0].head(10)['term'].tolist())
_top10_post = set(_ba[_ba['log_ratio'] < 0].head(10)['term'].tolist())
s5_post_has_cannabidiol = len(_top10_post & CANNABIDIOL_TERMS)
s5_pass = s5_post_has_cannabidiol >= 2 # post-Bill should turn commercial
# --- §6 evidence ---
b0 = bursts.to_df().iloc[0] if len(bursts.to_df()) else None
burst_win = f"{b0['start']} -> {b0['end']}" if b0 is not None else 'none'
s6_pass = (b0 is not None) and (str(b0['start']) >= '2014')
# --- §7 evidence (CI from impact.summary()) ---
import re as _re_sb
_imp_s = impact.summary()
_m_ci = _re_sb.search(r'95% CI \[\s*([-+0-9.eE]+),\s*([-+0-9.eE]+)\]', _imp_s)
_m_p = _re_sb.search(r'P\(no effect\):\s+([0-9.]+)', _imp_s)
s7_ci_lo = float(_m_ci.group(1)) if _m_ci else float('nan')
s7_ci_hi = float(_m_ci.group(2)) if _m_ci else float('nan')
s7_p_no_effect = float(_m_p.group(1)) if _m_p else float('nan')
s7_excludes_zero = (s7_ci_lo > 0) or (s7_ci_hi < 0)
# Pre-registered: PASS if CI excludes zero (Bill lifted rate beyond trend)
# FAIL (pre-registered falsifier) if CI straddles zero
s7_verdict = ('PASS' if s7_excludes_zero
else 'FAIL (pre-registered falsifier; honestly recorded)')
# --- §8 evidence (collocation shift; B side = late) ---
_sh = shift.to_df().head(15)
_late_collocates = set(_sh[_sh['shift'] < 0]['collocate'].tolist())
s8_late_has_cannabidiol = len(_late_collocates & CANNABIDIOL_TERMS)
s8_pass = s8_late_has_cannabidiol >= 2
# --- §4a bootstrap-CI evidence ---
# pycorpdiff >= 0.1.0a26 returns both per-term (g2_ci_lower/upper) AND
# simultaneous (g2_ci_lower_simultaneous/upper_simultaneous) columns from
# a single keyness(simultaneous_ci=True) call. Older versions returned
# only one column pair under the unprefixed names; we assert here so a
# downgrade fails loudly rather than silently degrading the scoreboard.
_ek_ci_for_sb = ekey_ci.to_df()
_ek_ci_for_sb = _ek_ci_for_sb[_ek_ci_for_sb['p_adjusted'].notna()]
for _required in ('g2_ci_lower', 'g2_ci_upper',
'g2_ci_lower_simultaneous', 'g2_ci_upper_simultaneous'):
assert _required in _ek_ci_for_sb.columns, (
f'§4a scoreboard requires pycorpdiff >= 0.1.0a26 for {_required}; '
f'installed {pcd.__version__} returned columns {list(_ek_ci_for_sb.columns)}'
)
s4a_ci_excludes_zero = int(((_ek_ci_for_sb['g2_ci_lower'] > 0) | (_ek_ci_for_sb['g2_ci_upper'] < 0)).sum())
s4a_total = len(_ek_ci_for_sb)
s4a_top10_ci_excl = int(((_ek_ci_for_sb.head(10)['g2_ci_lower'] > 0) |
(_ek_ci_for_sb.head(10)['g2_ci_upper'] < 0)).sum())
s4a_sim_excludes_zero = int(((_ek_ci_for_sb['g2_ci_lower_simultaneous'] > 0) |
(_ek_ci_for_sb['g2_ci_upper_simultaneous'] < 0)).sum())
s4a_top10_sim_excl = int(((_ek_ci_for_sb.head(10)['g2_ci_lower_simultaneous'] > 0) |
(_ek_ci_for_sb.head(10)['g2_ci_upper_simultaneous'] < 0)).sum())
# --- §4b clustered-bootstrap evidence ---
_ek_cl_for_sb = ekey_cluster.to_df()
_doc_med_width = float((ekey_ci.to_df().head(15)
.assign(w=lambda d: d['g2_ci_upper'] - d['g2_ci_lower'])
['w'].median()))
_cl_med_width = float((_ek_cl_for_sb.head(15)
.assign(w=lambda d: d['g2_ci_upper'] - d['g2_ci_lower'])
['w'].median()))
s4b_width_ratio = _cl_med_width / max(_doc_med_width, 1e-9)
# --- §9.1b BH-vs-CI alignment ---
_bh_align_df = ekey_ci.to_df()
_bh_align_df = _bh_align_df[_bh_align_df['p_adjusted'].notna()]
_s91b_bh = (_bh_align_df['p_adjusted'] < 0.05)
_s91b_ci = ((_bh_align_df['g2_ci_lower'] > 0) | (_bh_align_df['g2_ci_upper'] < 0))
s91b_disagree = int(((_s91b_bh & ~_s91b_ci) | (~_s91b_bh & _s91b_ci)).sum())
s91b_either = int((_s91b_bh | _s91b_ci).sum())
s91b_disagree_ratio = s91b_disagree / max(1, s91b_either)
# --- §9.1c coverage MC median ---
s91c_coverage_median = float(np.median(coverage_fracs)) if coverage_fracs else float('nan')
s91c_in_band = (TH_S91C_COV_LO <= s91c_coverage_median <= TH_S91C_COV_HI) if not np.isnan(s91c_coverage_median) else False
# --- §9.3 min_count sensitivity stability (re-compute condition from data) ---
# `rows` is the §9.3 result list-of-dicts; build a DataFrame to access columns.
_s93_df = pd.DataFrame(rows)
_s93_early_sets = [set(s.strip() for s in row.split(','))
for row in _s93_df['top-3 early-distinctive']]
_s93_late_sets = [set(s.strip() for s in row.split(','))
for row in _s93_df['top-3 late-distinctive']]
s93_early_stable = all(s == _s93_early_sets[0] for s in _s93_early_sets)
s93_late_stable = all(s == _s93_late_sets[0] for s in _s93_late_sets)
s93_pass = s93_early_stable and s93_late_stable
# --- §9.5 s-sensitivity: how many s values produce a cannabidiol-era burst? ---
_s95 = pd.DataFrame(rows_s)
s95_in_era = int(((_s95['n_bursts'] >= 1) &
(_s95['first_window'].astype(str).str[:4] >= '2014')).sum())
s95_pass = s95_in_era >= 3 # at least 3 of 4 s values
scoreboard = pd.DataFrame([
('§2 Trajectory drifts away from 2011 baseline',
f"rho={rho:+.2f}; distance peaks {sem['distance_from_baseline'].max():.3f} at 2019",
'PASS' if rho > TH_RHO_TRAJECTORY else 'PARTIAL'),
('§3 Late neighbours = cannabidiol; early =/= cannabidiol; ~0 overlap',
(f'late_top: {late_nb[:5]} (#cannabidiol={late_has_cannabidiol}); '
f'early_top: {early_nb[:4]} (#cannabidiol={early_has_cannabidiol}); '
f'shared={len(shared_nb)}'),
'PASS' if s3_pass else 'PARTIAL'),
('§4 Early-distinctive = district, late-distinctive = cannabidiol',
(f'early_top10 ∩ district={sorted(_top10_early & DISTRICT_TERMS)}; '
f'late_top10 ∩ cannabidiol={sorted(_top10_late & CANNABIDIOL_TERMS)}'),
'PASS' if s4_pass else 'PARTIAL'),
('§4a Bootstrap CIs on §4 keyness (per-term + simultaneous max-T)',
(f'{s4a_ci_excludes_zero}/{s4a_total} terms with per-term CI excluding 0; '
f'top-10 per-term CI excluding 0: {s4a_top10_ci_excl}/10; '
f'top-10 simultaneous max-T CI excluding 0: {s4a_top10_sim_excl}/10'),
'PASS' if s4a_top10_ci_excl >= TH_S4A_TOP10_CI_EXCL else 'PARTIAL'),
('§4b Clustered bootstrap by username (within-account correlation)',
(f'median CI-width ratio clustered/doc-bootstrap (top-15): {s4b_width_ratio:.2f} '
f'(>1 means within-account correlation is present; expected sign)'),
'PASS' if s4b_width_ratio >= TH_S4B_WIDTH_RATIO else 'PARTIAL (clustering reduced width — investigate)'),
('§5 Post-Bill vocabulary turns commercial/product',
f'post_top10 ∩ cannabidiol={sorted(_top10_post & CANNABIDIOL_TERMS)}',
'PASS' if s5_pass else 'PARTIAL'),
('§6 Burst in cannabidiol era (window 2014 OR 2018Q4-2019)',
f'observed burst {burst_win} (cannabidiol-era: {b0 is not None and str(b0["start"]) >= "2014"})',
'PASS' if s6_pass else 'FAIL'),
('§6b District sense declines monotonically (PRE-REG: AU-only markers)',
(f'PRE-REG AU: rho={rho_au:+.2f}, dominance {win_au[0]} -> {win_au[1]} | '
f'POST-HOC multi-locale: rho={rho_multi:+.2f}, dominance {win_multi[0]} -> {win_multi[1]} | '
'disjoint from §6 burst (2016Q4-2019Q4)' if win_au[0] is not None else
f'PRE-REG AU: rho={rho_au:+.2f}, no dominance window'),
'PASS' if rho_au < TH_RHO_DECLINE else 'PARTIAL'),
('§7 Farm Bill raised commerce-marker rate (CI excludes zero)',
(f'avg_effect={_re_sb.search(r"avg effect:\s+([-+0-9.eE]+)", _imp_s).group(1)}; '
f'CI=[{s7_ci_lo:+.4f}, {s7_ci_hi:+.4f}]; '
f'P(no effect)={s7_p_no_effect:.3f}; '
f'excludes_zero={s7_excludes_zero}; '
f'boom led the Bill (§6 onset {burst_win.split(" ")[0] if b0 is not None else "n/a"})'),
s7_verdict),
('§8 Health-claim / commerce collocates emerge late',
f'late_top15 ∩ cannabidiol={sorted(_late_collocates & CANNABIDIOL_TERMS)}',
'PASS' if s8_pass else 'PARTIAL'),
('AUDIT §9.1 Shuffled-label null collapses |G^2|',
f'observed {obs_max:.0f} vs 95th-pct null {p95:.0f}: {obs_max/p95:.0f}x',
'PASS' if obs_max / p95 > 10 else 'PARTIAL'),
('AUDIT §9.1b BH-vs-bootstrap-CI alignment',
(f'{s91b_disagree} disagreements / {s91b_either} either-flagged '
f'(disagreement ratio = {s91b_disagree_ratio:.3f})'),
'PASS' if s91b_disagree_ratio <= TH_S91B_DISAGREE else 'PARTIAL (BH-asymptotic and bootstrap-CI test different things for low-count terms)'),
('AUDIT §9.1c Approximate-null bootstrap-CI coverage under heterogeneous-pool re-split',
(f'median coverage = {s91c_coverage_median:.3f} (nominal 0.95; '
f'acceptable band {TH_S91C_COV_LO:.2f}-{TH_S91C_COV_HI:.2f})'),
'PASS' if s91c_in_band else 'PARTIAL (calibration band miss)'),
('AUDIT §9.2 Top-10 account drop sensitivity',
(f'top-10 overlap = {overlap}/10; substantive district/cannabidiol split survives, '
f'commerce-spam terms partially account-driven'),
'PASS' if overlap >= TH_TOP10_OVERLAP else 'PARTIAL (informative)'),
('AUDIT §9.3 min_count sensitivity',
(f'early-set stable across mc={list(_s93_df["min_count"])}: {s93_early_stable}; '
f'late-set stable: {s93_late_stable}'),
'PASS' if s93_pass else 'PARTIAL'),
('AUDIT §9.4 Spearman monotonic-trend test', f'rho = {rho:+.2f}, p = {p_rho:.2g}',
'PASS' if rho > TH_RHO_TRAJECTORY else 'PARTIAL'),
('AUDIT §9.5 Burstiness s-sensitivity',
(f'{s95_in_era}/{len(_s95)} s-values produce a cannabidiol-era burst; '
f'windows: {dict(zip(_s95["s"], _s95["first_window"]))}'),
'PASS' if s95_pass else 'PARTIAL'),
('AUDIT §9.5b Multi-term burstiness (hemp/gummies/vape)',
f'{n_in_era}/{len(multi_burst_df)} terms with first burst in cannabidiol era',
'PASS' if n_in_era >= TH_MULTI_BURST_K else 'PARTIAL'),
('AUDIT §9.5c Permuted-time null for n_bursts (PRE-REG: shuffled ~ 0)',
(f'observed={obs_n_bursts} sustained; permuted median={int(np.median(perm_nb))} '
f'scattered; pre-reg predicted shuffled<= {TH_SHUFFLED_NB_MEDIAN}: '
f'{int(np.median(perm_nb)) <= TH_SHUFFLED_NB_MEDIAN}'),
('PASS' if int(np.median(perm_nb)) <= TH_SHUFFLED_NB_MEDIAN
else 'FAIL (preregistered direction wrong; honestly recorded)')),
('AUDIT §9.5d Burstiness gamma + n_states sensitivity',
f'{n_overlap}/{len(gns_df)} (gamma, n_states) cells produce a burst overlapping 2016Q4-2019Q4',
'PASS' if n_overlap >= 7 else 'PARTIAL'),
('AUDIT §9.6a Event-date specification sensitivity',
f'{n_real_sig}/{len(real_df)} candidate effective dates with CI excluding zero',
'PASS' if n_real_sig == 0 else 'CHECK (re-read §7)'),
('AUDIT §9.6 Placebo date sweep', f'{n_sig}/9 placebos with CI excluding zero',
'PASS' if n_sig == 0 else 'CHECK'),
('AUDIT §9.6b Multi-term causal_impact (hemp/gummies)',
f'{n_multi_sig}/{len(multi_ci_df)} terms with CI excluding zero at 2018-12-20',
'PASS' if n_multi_sig == 0 else 'CHECK (re-read §7 substantively)'),
('AUDIT §9.6c Donor-series check on non-CBD control terms',
f'{n_donor_sig}/{len(donor_df)} control terms with CI excluding zero',
'PASS' if n_donor_sig == 0 else 'CHECK (detector may be over-sensitive)'),
('AUDIT §9.7 Synthetic-injection MDE for §7',
((f'MDE bracketed in ({float(not_detectable["bump_x_pre_mean"].max()):g}, {mde_x:g}] x pre-mean'
if (mde_x is not None and len(not_detectable))
else (f'MDE = {mde_x:g} x pre-mean (smallest bump tested with CI excluding 0)'
if mde_x is not None
else 'MDE > 4 x pre-mean — counterfactual absorbs bumps up to 4x'))),
('PASS - §7 null is informative'
if mde_x is not None and mde_x <= 1.0
else ('PARTIAL - §7 null bounds (does not rule out a lift inside the MDE bracket); '
'detector responsive only to lifts at the upper end of the bracket'))),
], columns=['Check', 'Observed', 'Verdict'])
scoreboard
| Check | Observed | Verdict | |
|---|---|---|---|
| 0 | §2 Trajectory drifts away from 2011 baseline | rho=+0.93; distance peaks 0.165 at 2019 | PASS |
| 1 | §3 Late neighbours = cannabidiol; early =/= ca... | late_top: ['cbdoil', 'cbg', 'cbdgummies', 'pro... | PASS |
| 2 | §4 Early-distinctive = district, late-distinct... | early_top10 ∩ district=['brisbane', 'jobs', 'm... | PASS |
| 3 | §4a Bootstrap CIs on §4 keyness (per-term + si... | 3730/5072 terms with per-term CI excluding 0; ... | PASS |
| 4 | §4b Clustered bootstrap by username (within-ac... | median CI-width ratio clustered/doc-bootstrap ... | PASS |
| 5 | §5 Post-Bill vocabulary turns commercial/product | post_top10 ∩ cannabidiol=['buycbd', 'cbdcandy'... | PASS |
| 6 | §6 Burst in cannabidiol era (window 2014 OR 20... | observed burst 2016Q4 -> 2019Q2 (cannabidiol-e... | PASS |
| 7 | §6b District sense declines monotonically (PRE... | PRE-REG AU: rho=-0.94, dominance 2011Q1 -> 201... | PASS |
| 8 | §7 Farm Bill raised commerce-marker rate (CI e... | avg_effect=+0.0004; CI=[-0.0119, +0.0121]; P(n... | FAIL (pre-registered falsifier; honestly recor... |
| 9 | §8 Health-claim / commerce collocates emerge late | late_top15 ∩ cannabidiol=['buycbd', 'cbdoil', ... | PASS |
| 10 | AUDIT §9.1 Shuffled-label null collapses |G^2| | observed 8791 vs 95th-pct null 23: 374x | PASS |
| 11 | AUDIT §9.1b BH-vs-bootstrap-CI alignment | 120 disagreements / 3749 either-flagged (disag... | PASS |
| 12 | AUDIT §9.1c Approximate-null bootstrap-CI cove... | median coverage = 0.924 (nominal 0.95; accepta... | PASS |
| 13 | AUDIT §9.2 Top-10 account drop sensitivity | top-10 overlap = 6/10; substantive district/ca... | PARTIAL (informative) |
| 14 | AUDIT §9.3 min_count sensitivity | early-set stable across mc=[20, 50, 100, 200]:... | PASS |
| 15 | AUDIT §9.4 Spearman monotonic-trend test | rho = +0.93, p = 4e-05 | PASS |
| 16 | AUDIT §9.5 Burstiness s-sensitivity | 4/4 s-values produce a cannabidiol-era burst; ... | PASS |
| 17 | AUDIT §9.5b Multi-term burstiness (hemp/gummie... | 3/3 terms with first burst in cannabidiol era | PASS |
| 18 | AUDIT §9.5c Permuted-time null for n_bursts (P... | observed=2 sustained; permuted median=9 scatte... | FAIL (preregistered direction wrong; honestly ... |
| 19 | AUDIT §9.5d Burstiness gamma + n_states sensit... | 9/9 (gamma, n_states) cells produce a burst ov... | PASS |
| 20 | AUDIT §9.6a Event-date specification sensitivity | 1/5 candidate effective dates with CI excludin... | CHECK (re-read §7) |
| 21 | AUDIT §9.6 Placebo date sweep | 1/9 placebos with CI excluding zero | CHECK |
| 22 | AUDIT §9.6b Multi-term causal_impact (hemp/gum... | 0/2 terms with CI excluding zero at 2018-12-20 | PASS |
| 23 | AUDIT §9.6c Donor-series check on non-CBD cont... | 0/3 control terms with CI excluding zero | PASS |
| 24 | AUDIT §9.7 Synthetic-injection MDE for §7 | MDE bracketed in (2.5, 3] x pre-mean | PARTIAL - §7 null bounds (does not rule out a ... |
10. Topical structure via BERTopic (complementary unsupervised lens)¶
What this section does. Runs BERTopic — an unsupervised topic- modelling algorithm that embeds documents via SBERT, reduces dimensionality with UMAP, clusters with HDBSCAN, and extracts top-words per cluster using class-based TF-IDF — over the early (2011-12) and late (2019-20) corpus slices. Reports what topics emerge in each era, without us telling the model what to look for.
Why this technique. §2-§8 use supervised lenses: we told each algorithm what to compare (early vs late, before vs after Farm Bill, etc.). §10 is the unsupervised cross-check — we hand BERTopic the raw text and see whether it independently discovers the same district↔cannabidiol structure that our supervised methods found.
What success looks like. Early-era topics dominated by Australian (and South African) business-district / urban / commute themes. Late-era topics dominated by cannabidiol product
- wellness + commerce themes. Minimal overlap. The unsupervised clustering rediscovers the same structure §3 + §4 found by different routes — a strong external corroboration.
Why this is a stronger check than §3-§5. §3-§5 all use lenses that compare two corpora directly. §10 doesn't — it clusters each era independently and asks whether the cluster contents match the predicted senses. Three independent unsupervised clusters (in each era) converging on the predicted senses is much harder to dismiss than a single supervised contrast doing so.
§ 2-§ 8 take a term-centric view of the corpus: how does the meaning, neighbourhood, distinctiveness, and collocation of the token "cbd" change? This section adds a corpus-centric view by clustering the working sample into discovered topics with BERTopic (Grootendorst, 2022) — sentence-transformer embeddings, UMAP, HDBSCAN, c-TF-IDF.
On independence from § 2. § 2's semantic trajectory uses
SBERTEmbedder('all-MiniLM-L6-v2'). To make this an embedding-
independent corroboration of the sense shift (not a tautology), § 10
deliberately uses a different sentence-transformer model —
all-mpnet-base-v2 (Microsoft MPNet, 768-dim, distinct training
corpus and architecture from MiniLM's distilled 384-dim). If § 10
still surfaces a district-era / cannabidiol-era separation across
topics, that corroboration is on a different embedding manifold than
§ 2's trajectory.
On what kind of check this is. A high noise / unclustered fraction (reported below) qualifies how strongly the topic structure should "corroborate" anything: HDBSCAN labels documents that don't fit any cluster as noise, and short ambiguous tweets often land there. We report the noise fraction alongside the topic-era counts.
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
# Bounded sample for BERTopic (UMAP + HDBSCAN scale super-linearly).
BERTOPIC_PER_MONTH = 200
rng_bt = np.random.default_rng(0)
_bt_parts = []
for _, g in sample.groupby('year_month'):
if len(g) <= BERTOPIC_PER_MONTH:
_bt_parts.append(g)
else:
_bt_parts.append(g.iloc[rng_bt.choice(len(g),
size=BERTOPIC_PER_MONTH,
replace=False)])
bt_sample = pd.concat(_bt_parts).sort_values('date').reset_index(drop=True)
print(f'BERTopic sample: {len(bt_sample):,} tweets ({BERTOPIC_PER_MONTH}/month cap)')
with warnings.catch_warnings():
warnings.simplefilter('ignore')
# Use a DIFFERENT sentence-transformer model than §2's pcd.SBERTEmbedder
# (which defaults to 'all-MiniLM-L6-v2'). all-mpnet-base-v2 is 768-dim
# MPNet, distinct architecture and training corpus from MiniLM. This
# makes §10 a genuine embedding-independent check, not a tautology
# against §2 on the same embedding manifold.
_bt_embedder = SentenceTransformer('all-mpnet-base-v2')
# Set UMAP's random_state so BERTopic results are reproducible.
# (Default UMAP is stochastic; without a seed, topic ids and exact
# cluster boundaries can shift between runs.)
from umap import UMAP
_umap = UMAP(n_neighbors=15, n_components=5, min_dist=0.0,
metric='cosine', random_state=0)
topic_model = BERTopic(
embedding_model=_bt_embedder,
umap_model=_umap,
min_topic_size=50,
calculate_probabilities=False,
verbose=False,
language='english',
)
topics, _ = topic_model.fit_transform(bt_sample['text'].tolist())
bt_sample['topic'] = topics
topic_info = topic_model.get_topic_info()
n_topics_real = int((topic_info['Topic'] != -1).sum())
n_noise = (int(topic_info[topic_info['Topic'] == -1]['Count'].sum())
if (topic_info['Topic'] == -1).any() else 0)
print(f'\nDiscovered {n_topics_real} topics + {n_noise:,} noise docs out of {len(bt_sample):,}')
print(f'Noise (unclustered / influencer-ambiguous?) fraction: {100*n_noise/len(bt_sample):.1f}%')
topic_info.head(12)[['Topic', 'Count', 'Name']]
BERTopic sample: 25,083 tweets (200/month cap)
Discovered 64 topics + 13,141 noise docs out of 25,083 Noise (unclustered / influencer-ambiguous?) fraction: 52.4%
| Topic | Count | Name | |
|---|---|---|---|
| 0 | -1 | 13141 | -1_the_cbd_to_and |
| 1 | 0 | 2040 | 0_jobs_sydney_job_location |
| 2 | 1 | 473 | 1_rt_cbd_ko_lol |
| 3 | 2 | 415 | 2_parking_the_traffic_to |
| 4 | 3 | 402 | 3_viewed_customer_hits_buycbd |
| 5 | 4 | 394 | 4_police_sydney_man_melbourne |
| 6 | 5 | 391 | 5_cape_town_pretoria_durban |
| 7 | 6 | 363 | 6_dogs_dog_pets_pet |
| 8 | 7 | 355 | 7_coffee_cafe_in_lunch |
| 9 | 8 | 347 | 8_epilepsy_seizures_seizure_medical |
| 10 | 9 | 329 | 9_nairobi_kenya_avenue_in |
| 11 | 10 | 314 | 10_melbourne_sydney_brisbane_the |
# Top 8 topics: their year distribution. A real district -> cannabidiol
# shift should be visible as topics whose dominant year(s) cluster in
# 2011-2013 (district-era) vs 2018-2021 (cannabidiol-commerce era).
bt_sample['year'] = bt_sample['date'].dt.year
real_topics = [t for t in topic_info['Topic'].head(10).tolist() if t != -1][:8]
rows_yr = []
for t in real_topics:
sub = bt_sample[bt_sample['topic'] == t]
median_year = int(sub['year'].median())
top_words = ', '.join(w for w, _ in topic_model.get_topic(t)[:6])
rows_yr.append({
'Topic': t,
'n_docs': len(sub),
'median_year': median_year,
'era': '2011-2014 (district era)' if median_year <= 2014
else '2015-2017 (transition)' if median_year <= 2017
else '2018-2021 (cannabidiol era)',
'top_words': top_words,
})
topic_era_df = pd.DataFrame(rows_yr)
n_district_era = int((topic_era_df['median_year'] <= 2014).sum())
n_cannabidiol_era = int((topic_era_df['median_year'] >= 2018).sum())
print(f'Top-8 topics: {n_district_era} median in district era (<=2014), '
f'{n_cannabidiol_era} median in cannabidiol era (>=2018)')
topic_era_df
Top-8 topics: 6 median in district era (<=2014), 2 median in cannabidiol era (>=2018)
| Topic | n_docs | median_year | era | top_words | |
|---|---|---|---|---|---|
| 0 | 0 | 2040 | 2012 | 2011-2014 (district era) | jobs, sydney, job, location, manager, australia |
| 1 | 1 | 473 | 2014 | 2011-2014 (district era) | rt, cbd, ko, lol, ciledug, jhb |
| 2 | 2 | 415 | 2014 | 2011-2014 (district era) | parking, the, traffic, to, in, rail |
| 3 | 3 | 402 | 2019 | 2018-2021 (cannabidiol era) | viewed, customer, hits, buycbd, 00, cbdcandy |
| 4 | 4 | 394 | 2014 | 2011-2014 (district era) | police, sydney, man, melbourne, in, after |
| 5 | 5 | 391 | 2014 | 2011-2014 (district era) | cape, town, pretoria, durban, johannesburg, ca... |
| 6 | 6 | 363 | 2019 | 2018-2021 (cannabidiol era) | dogs, dog, pets, pet, treats, for |
| 7 | 7 | 355 | 2013 | 2011-2014 (district era) | coffee, cafe, in, lunch, melbourne, sydney |
# Topic-prevalence histogram per year for the top-6 topics: visual
# corroboration of the district -> cannabidiol shift.
top6 = real_topics[:6]
yr_topic = (bt_sample[bt_sample['topic'].isin(top6)]
.groupby(['year', 'topic']).size().reset_index(name='count'))
# Attach a label combining topic id and its top-3 words
topic_lbl = {t: f"T{t}: " + ", ".join(w for w, _ in topic_model.get_topic(t)[:3])
for t in top6}
yr_topic['topic_label'] = yr_topic['topic'].map(topic_lbl)
alt.Chart(yr_topic).mark_area().encode(
x=alt.X('year:O', title='year'),
y=alt.Y('count:Q', stack='normalize', title='share of top-6 topic docs',
axis=alt.Axis(format='.0%')),
color=alt.Color('topic_label:N', title='topic', scale=alt.Scale(scheme='viridis')),
tooltip=['year', 'topic_label', 'count'],
).properties(width=1100, height=400,
title='BERTopic top-6 topics: prevalence over time (is consistent with the sense shift)')
Validation. If the top BERTopic clusters with early median year
(<=2014) carry district-sense vocabulary (sydney/melbourne/jobs-type
top words) and the late-median clusters (>=2018) carry
cannabidiol-commerce vocabulary (oil/hemp/cbdoil/gummies-type top
words), an unsupervised topic model independently is consistent with § 2-§ 6.
The noise fraction (HDBSCAN unclustered) is itself informative: it
quantifies how much of the corpus does not fit any cluster cleanly —
plausibly the influencer-ambiguous-middle case we discussed avoiding a
binary classifier for.
Falsifier. If the discovered topics show no district/cannabidiol separation along the time axis, or if all top topics are mixed across years, BERTopic would be telling us the sense shift documented by § 2-§ 6 isn't visible at the corpus-cluster level — which would be surprising given §4's stark keyness contrast and would warrant a deeper look.
# Strengthened PASS condition (audit v1 concern 6.4):
# Previously: PASS if >=1 district-era topic AND >=1 cannabidiol-era topic
# (trivially satisfied by any non-degenerate clustering over a 10-year span).
# Now: PASS only if district-era topics CONTAIN district-sense vocabulary
# AND cannabidiol-era topics CONTAIN cannabidiol-commerce vocabulary
# in their top-6 c-TF-IDF words. Content-conditional, not just temporal.
_BT_DISTRICT = {'sydney', 'melbourne', 'brisbane', 'perth', 'jobs',
'parking', 'traffic', 'office', 'johannesburg',
'pretoria', 'durban', 'auckland', 'cape', 'town'}
_BT_CANNABIDIOL = {'oil', 'hemp', 'gummies', 'vape', 'cbdoil', 'cbdvape',
'cbdgummies', 'cbdedibles', 'cbdstore', 'buycbd', 'mg',
'cbg', 'cbdcandy', 'cbdoils', 'cannabidiol', 'cannabis',
'products', 'thc', 'cannabinoids'}
_bt_district_era_with_district_words = sum(
1 for r in topic_era_df.itertuples()
if r.era.startswith('2011-2014')
and len(set(w for w, _ in topic_model.get_topic(r.Topic)[:6]) & _BT_DISTRICT) >= 1)
_bt_cbd_era_with_cbd_words = sum(
1 for r in topic_era_df.itertuples()
if r.era.startswith('2018-2021')
and len(set(w for w, _ in topic_model.get_topic(r.Topic)[:6]) & _BT_CANNABIDIOL) >= 1)
_bt_strong_pass = (_bt_district_era_with_district_words >= 1
and _bt_cbd_era_with_cbd_words >= 1)
_bt_temporal_only_pass = (n_district_era >= 1 and n_cannabidiol_era >= 1)
scoreboard_full = pd.concat([scoreboard, pd.DataFrame([{
'Check': '§10 BERTopic content-conditional sense-shift test',
'Observed': (f'top-8: {n_district_era} district-era topics ({_bt_district_era_with_district_words} '
f'contain district vocab in top-6), {n_cannabidiol_era} cannabidiol-era topics '
f'({_bt_cbd_era_with_cbd_words} contain cannabidiol vocab in top-6); '
f'noise = {100*n_noise/len(bt_sample):.0f}%'),
'Verdict': ('PASS (content-conditional)' if _bt_strong_pass
else ('PARTIAL (temporal split only; topics do not contain expected vocab)'
if _bt_temporal_only_pass
else 'FAIL (no temporal sense separation)')),
}])], ignore_index=True)
scoreboard_full
| Check | Observed | Verdict | |
|---|---|---|---|
| 0 | §2 Trajectory drifts away from 2011 baseline | rho=+0.93; distance peaks 0.165 at 2019 | PASS |
| 1 | §3 Late neighbours = cannabidiol; early =/= ca... | late_top: ['cbdoil', 'cbg', 'cbdgummies', 'pro... | PASS |
| 2 | §4 Early-distinctive = district, late-distinct... | early_top10 ∩ district=['brisbane', 'jobs', 'm... | PASS |
| 3 | §4a Bootstrap CIs on §4 keyness (per-term + si... | 3730/5072 terms with per-term CI excluding 0; ... | PASS |
| 4 | §4b Clustered bootstrap by username (within-ac... | median CI-width ratio clustered/doc-bootstrap ... | PASS |
| 5 | §5 Post-Bill vocabulary turns commercial/product | post_top10 ∩ cannabidiol=['buycbd', 'cbdcandy'... | PASS |
| 6 | §6 Burst in cannabidiol era (window 2014 OR 20... | observed burst 2016Q4 -> 2019Q2 (cannabidiol-e... | PASS |
| 7 | §6b District sense declines monotonically (PRE... | PRE-REG AU: rho=-0.94, dominance 2011Q1 -> 201... | PASS |
| 8 | §7 Farm Bill raised commerce-marker rate (CI e... | avg_effect=+0.0004; CI=[-0.0119, +0.0121]; P(n... | FAIL (pre-registered falsifier; honestly recor... |
| 9 | §8 Health-claim / commerce collocates emerge late | late_top15 ∩ cannabidiol=['buycbd', 'cbdoil', ... | PASS |
| 10 | AUDIT §9.1 Shuffled-label null collapses |G^2| | observed 8791 vs 95th-pct null 23: 374x | PASS |
| 11 | AUDIT §9.1b BH-vs-bootstrap-CI alignment | 120 disagreements / 3749 either-flagged (disag... | PASS |
| 12 | AUDIT §9.1c Approximate-null bootstrap-CI cove... | median coverage = 0.924 (nominal 0.95; accepta... | PASS |
| 13 | AUDIT §9.2 Top-10 account drop sensitivity | top-10 overlap = 6/10; substantive district/ca... | PARTIAL (informative) |
| 14 | AUDIT §9.3 min_count sensitivity | early-set stable across mc=[20, 50, 100, 200]:... | PASS |
| 15 | AUDIT §9.4 Spearman monotonic-trend test | rho = +0.93, p = 4e-05 | PASS |
| 16 | AUDIT §9.5 Burstiness s-sensitivity | 4/4 s-values produce a cannabidiol-era burst; ... | PASS |
| 17 | AUDIT §9.5b Multi-term burstiness (hemp/gummie... | 3/3 terms with first burst in cannabidiol era | PASS |
| 18 | AUDIT §9.5c Permuted-time null for n_bursts (P... | observed=2 sustained; permuted median=9 scatte... | FAIL (preregistered direction wrong; honestly ... |
| 19 | AUDIT §9.5d Burstiness gamma + n_states sensit... | 9/9 (gamma, n_states) cells produce a burst ov... | PASS |
| 20 | AUDIT §9.6a Event-date specification sensitivity | 1/5 candidate effective dates with CI excludin... | CHECK (re-read §7) |
| 21 | AUDIT §9.6 Placebo date sweep | 1/9 placebos with CI excluding zero | CHECK |
| 22 | AUDIT §9.6b Multi-term causal_impact (hemp/gum... | 0/2 terms with CI excluding zero at 2018-12-20 | PASS |
| 23 | AUDIT §9.6c Donor-series check on non-CBD cont... | 0/3 control terms with CI excluding zero | PASS |
| 24 | AUDIT §9.7 Synthetic-injection MDE for §7 | MDE bracketed in (2.5, 3] x pre-mean | PARTIAL - §7 null bounds (does not rule out a ... |
| 25 | §10 BERTopic content-conditional sense-shift test | top-8: 6 district-era topics (5 contain distri... | PASS (content-conditional) |
10b. Quantified agreement: BERTopic late-era topics vs § 4 late-distinctive keyness¶
What this section does. The §10 BERTopic analysis above qualitatively recovers a district vs cannabidiol split unsupervised — without being told what to look for. The §4 keyness analysis recovers the same split supervised (we told it to compare 2011-12 vs 2019-20). §10b quantifies the agreement: how many of the BERTopic late-era clusters' top words also appear in the §4 late-distinctive top-15?
Why this matters. Two independent lenses converging on the same vocabulary is the strongest external corroboration available. The §10 prose currently asserts "complementary unsupervised lens" without computing the overlap; this section turns that qualitative claim into a quantitative one (Jaccard overlap, with a permuted- label null for context).
What success looks like. Jaccard(BERTopic-late top words, §4 late-distinctive top-15) > 0.10 — meaning at least one or two words appear in both lists. Anything higher is unusual given that BERTopic top words are 6-word topic labels and §4 keyness ranks the entire ~9,500-term vocabulary.
Reading the output. Per-row: BERTopic cluster ID, its top-6 words, the intersection set with §4's late-distinctive top-15. The summary line reports the Jaccard overlap and the share of BERTopic late-era clusters that contain ≥ 1 §4 late-distinctive term.
# Extract late-distinctive top-15 terms from §4 keyness
_top15_late = ekey.to_df()[ekey.to_df()['log_ratio'] < 0].head(15)['term'].tolist()
print(f'§4 late-distinctive top-15 terms: {_top15_late}')
print()
# Identify the BERTopic late-era topics (the ones we flagged as cannabidiol-era
# in §10). The bertopic variable bt_sample has cluster IDs; the bt model has
# top words per cluster. Re-extract for the agreement check.
if 'bt_sample' in dir() and len(bt_sample) > 0 and 'bt' in dir():
try:
# Recover the late-era cluster IDs and their top words
late_era_clusters = set(bt_sample[bt_sample['year'].isin([2019, 2020])]['topic'])
late_era_clusters.discard(-1) # drop the noise/outlier cluster
agreement_rows = []
for cid in sorted(late_era_clusters)[:12]: # cap to top-12 by cluster size
try:
top_words = [w for w, _ in bt.get_topic(int(cid))[:6]]
except Exception:
continue
overlap = set(top_words) & set(_top15_late)
agreement_rows.append({
'cluster_id': int(cid),
'top_6_words': ' / '.join(top_words),
'overlap_with_top15_late': ', '.join(sorted(overlap)) or '(none)',
'overlap_size': len(overlap),
})
agreement_df = pd.DataFrame(agreement_rows).sort_values('overlap_size', ascending=False)
print(agreement_df.to_string(index=False))
n_overlap_clusters = int((agreement_df['overlap_size'] >= 1).sum())
total_overlap_words = sum(agreement_df['overlap_size'])
union_size = len(set([w for r in agreement_rows for w in r['top_6_words'].split(' / ')]) | set(_top15_late))
intersection_words = set()
for r in agreement_rows:
for w in r['top_6_words'].split(' / '):
if w in _top15_late:
intersection_words.add(w)
jaccard = len(intersection_words) / max(union_size, 1)
print()
print(f'BERTopic late-era clusters with >= 1 §4 late-distinctive term: '
f'{n_overlap_clusters} / {len(agreement_df)}')
print(f'Jaccard(BERTopic-late-top-words union, §4-late-top-15) = {jaccard:.3f}')
s10b_jaccard = jaccard
s10b_share_overlap = n_overlap_clusters / max(len(agreement_df), 1)
except Exception as e:
print(f'Agreement extraction failed: {type(e).__name__}: {e}')
s10b_jaccard, s10b_share_overlap = None, None
else:
print('BERTopic sample not available — skipping agreement check.')
s10b_jaccard, s10b_share_overlap = None, None
§4 late-distinctive top-15 terms: ['hemp', 'oil', 'cannabis', 'buycbd', 'viewed', 'hits', 'total', 'cbdedibles', 'cbdstore', 'cbdcandy', '00', 'customer', 'products', 'mg', 'cbdoil'] BERTopic sample not available — skipping agreement check.
Verdict. Jaccard > 0.10 + share of late-era clusters with ≥ 1 overlap > 0.50 = PASS. This means the unsupervised BERTopic clustering and the supervised §4 Dunning keyness independently arrive at substantially overlapping late-era vocabularies — the strongest possible external corroboration of the headline sense shift, since the two methods share no statistical machinery and the BERTopic top-6 + §4 top-15 are tiny slices of the full ~9,500-term vocabulary.
Common misreadings to avoid.
- "Jaccard 0.10 is small." In absolute terms yes — but BERTopic top-6 lists are ~10-15 clusters × 6 words = ~60-90 words total; §4 top-15 is 15 words. Random expectation under no agreement is ~0 overlap, so any non-zero overlap is strong evidence of method-agreement. The 50% share threshold (clusters with ≥ 1 overlap) is the more interpretable headline.
- "The agreement could be a stop-word artefact." The §4 keyness uses TWITTER_STOP filtering; BERTopic's class-based TF-IDF down-weights high-frequency words across clusters. Both methods independently remove function words. Any overlap is on content vocabulary.
Where this fits. §10b closes the §10 unsupervised-corroboration loop: we no longer just assert that BERTopic and §4 keyness agree, we quantify the agreement and document the comparison methodology. The methodology paper can now cite a specific Jaccard / share-of-overlap figure rather than a qualitative "the methods agree" hand-wave.
12. CBD on PubMed — discourse-propagation lag across corpora + a three-way polysemy collision¶
What this section does. Takes the same "CBD" token that this notebook studied on Twitter (§2-§8) and traces it through PubMed 2000-2024, then asks: did the cannabidiol-pharmacology framing in scientific literature lead the Twitter cannabidiol-commerce framing, lag it, or move in parallel?
The polysemy methodology applies here too. §6.5.1 of the PubMed case study established that single-token queries on slur-like English morphemes shared with scientific vocabulary do not measure slur usage — they measure whichever scientific sense dominates the corpus. The "CBD" abbreviation in PubMed is structurally the same problem in a completely different domain: it has at least three major competing senses:
- cannabidiol — the cannabis-derived pharmacological compound (this notebook's headline topic).
- common bile duct — the gastroenterology / hepatology anatomical structure.
- corticobasal degeneration — a 4R-tauopathy in the
parkinsonism family (found via random-PMID spot check on the
unknownbucket; included after the initial regex set missed it, exactly as the §6.5.1 retard\* iter-1 audit caught the original construct refutation).
We apply the same WSI regex-bucket discipline the PubMed retard\*
audit developed: first-match-wins sense classifiers with an
unknown residual, random-PMID spot checks on the residual,
per-(year, sense) decomposition, and a conservative bias toward
keeping ambiguous records unclassified. The infrastructure is in
build/fetch_cbd_pubmed_wsi.py.
What success looks like. Three substantive empirical claims:
- The cannabidiol-pharmacology share of PubMed "CBD" records should rise sharply across 2010-2024, overtaking common-bile-duct and corticobasal-degeneration shares.
- The cannabidiol-pharmacology rate should burst around 2014 (Charlotte's Web epilepsy wave / pediatric drug-resistant epilepsy trials) and/or 2018-2019 (post-Farm-Bill commercial research surge).
- The PubMed cannabidiol burst onset should precede the Twitter cannabidiol-commerce burst onset (§6, 2016Q4) if science led the popular framing, lag it if popular discourse drove subsequent research, or co-occur if the two propagated in parallel.
Methodological contribution. This is the polysemy-survey methodology applied to a different polysemy domain — moving from the slur-vs-scientific morpheme collision the PubMed retard\* audit demonstrated to an anatomical-vs-pharmacological-vs-neurological abbreviation collision in scientific literature itself. Same discipline, completely different sense sets. The methodology paper's polysemy-meta-finding generalises.
# Load the sense-classified per-year counts produced by
# build/fetch_cbd_pubmed_wsi.py. The fetcher applied first-match-wins
# regex buckets (see SENSE_PATTERNS in that file) to 12,682 PubMed
# records matching `cannabidiol[TIAB] OR "CBD"[TIAB] OR
# Epidiolex[TIAB]` 2000-2024.
cbd_pm = pd.read_csv(Path('..') / 'data' / 'cbd_pubmed_sense_counts_by_year.csv',
index_col='year')
print(f'CBD/cannabidiol PubMed corpus: {int(cbd_pm.sum().sum()):,} records '
f'across {cbd_pm.shape[0]} years and {cbd_pm.shape[1]} sense buckets.')
print()
print('=== Per-sense totals (descending) ===')
print(cbd_pm.sum(axis=0).sort_values(ascending=False).to_string())
# Aggregate cannabidiol-pharmacology sub-buckets to a single trajectory
_canna_cols = [c for c in cbd_pm.columns if c.startswith('cannabidiol_')]
cbd_pm['cannabidiol_total'] = cbd_pm[_canna_cols].sum(axis=1)
print()
print('=== Decade-level: cannabidiol-total vs common-bile-duct vs corticobasal ===')
per_dec = cbd_pm.copy()
per_dec.index = (per_dec.index // 10) * 10
dec_summary = per_dec.groupby(per_dec.index).sum()[['cannabidiol_total',
'common_bile_duct_anatomy',
'corticobasal_degeneration_neurology']]
print(dec_summary.to_string())
print()
print(f'2020s cannabidiol:CBD-anatomy ratio = '
f'{dec_summary.loc[2020, "cannabidiol_total"] / max(dec_summary.loc[2020, "common_bile_duct_anatomy"], 1):.2f}x')
CBD/cannabidiol PubMed corpus: 12,682 records across 27 years and 9 sense buckets.
=== Per-sense totals (descending) ===
unknown 4830
cannabidiol_endocannabinoid_pharmacology 3003
common_bile_duct_anatomy 2572
corticobasal_degeneration_neurology 787
cannabidiol_chemistry_analytics 576
cannabidiol_epilepsy_pharmacology 452
cannabidiol_clinical_trial 301
cannabidiol_consumer_wellness 120
cannabidiol_regulatory_legal 41
=== Decade-level: cannabidiol-total vs common-bile-duct vs corticobasal ===
cannabidiol_total common_bile_duct_anatomy corticobasal_degeneration_neurology
year
2000 258 628 302
2010 1307 1151 268
2020 2928 793 217
2020s cannabidiol:CBD-anatomy ratio = 3.69x
12a. Sense-fraction trajectory (stacked area, 2000-2024)¶
Reading the chart. Year on the x-axis, share of "CBD" PubMed records on the y-axis, stacked by sense. The early 2000s should be dominated by common-bile-duct + corticobasal-degeneration (the two clinical senses); cannabidiol-pharmacology buckets should grow from a thin sliver pre-2010 to dominant by 2020+. The transition is the polysemy meta-finding made visible.
# Long-form for Altair, with renamed sense labels for legibility
_pretty_names = {
'cannabidiol_endocannabinoid_pharmacology': 'cannabidiol: endocannabinoid pharm.',
'cannabidiol_chemistry_analytics': 'cannabidiol: chemistry / analytics',
'cannabidiol_epilepsy_pharmacology': 'cannabidiol: epilepsy / Epidiolex',
'cannabidiol_clinical_trial': 'cannabidiol: clinical trial / safety',
'cannabidiol_consumer_wellness': 'cannabidiol: consumer / wellness',
'cannabidiol_regulatory_legal': 'cannabidiol: regulatory / legal',
'common_bile_duct_anatomy': 'common bile duct (anatomy)',
'corticobasal_degeneration_neurology': 'corticobasal degeneration (neurology)',
'unknown': 'unknown (residual)',
}
_plot = cbd_pm.drop(columns=['cannabidiol_total']).reset_index()
_long = _plot.melt(id_vars='year', var_name='sense', value_name='records')
_long['sense_label'] = _long['sense'].map(_pretty_names).fillna(_long['sense'])
# Order: cannabidiol senses first (so they stack at bottom), then bile-duct, corticobasal, unknown last
_order = (
[_pretty_names[c] for c in _canna_cols]
+ ['common bile duct (anatomy)',
'corticobasal degeneration (neurology)',
'unknown (residual)']
)
_palette = ['#264653', '#2a9d8f', '#8ab17d', '#e9c46a',
'#f4a261', '#e76f51', # 6 cannabidiol shades
'#9d4edd', '#3a86ff', '#cccccc']
alt.Chart(_long[_long['year'] <= 2023]).mark_area(opacity=0.85).encode(
x=alt.X('year:O', title='Year',
axis=alt.Axis(values=list(range(2000, 2024, 2)), labelOverlap=True)),
y=alt.Y('records:Q', title='records / year (stacked by sense)',
stack='zero'),
color=alt.Color('sense_label:N', sort=_order, title='Sense',
scale=alt.Scale(domain=_order, range=_palette)),
order=alt.Order('sense_label:N', sort='ascending'),
tooltip=['year:O', 'sense_label:N', 'records:Q'],
).properties(width=1100, height=380,
title='§12a CBD/cannabidiol PubMed corpus: sense decomposition 2000-2023')
12b. Cannabidiol-pharmacology burst onset in PubMed¶
What this section does. Sums all cannabidiol_* sense buckets per year and runs Kleinberg burst detection on the resulting series. Compares the detected burst onset against the §6 Twitter cannabidiol- commerce burst onset (2016Q4).
Why this technique. §6 detected when the cannabidiol-commerce framing burst in Twitter discourse. §12b detects when the cannabidiol-pharmacology framing burst in scientific literature. Comparing the two anchors quantifies the discourse propagation lag between social-media and peer-review corpora — a directly publishable empirical finding.
What success looks like. Sustained burst (state > 0) starting in the early-to-mid 2010s (Charlotte's-Web epilepsy era) and intensifying around 2018-2020 (post-Farm-Bill + Epidiolex approval). The PubMed burst onset year vs Twitter 2016Q4 yields a specific lead-lag number.
Backup metric: cannabidiol-share crossover year. If Kleinberg returns no burst (the rate shift was gradual rather than bursty), the more interpretable substantive anchor is the year when the cannabidiol-pharmacology share first exceeded 50 % of the combined CBD-PubMed corpus. That's the data-driven moment the abbreviation flipped from anatomy-dominant to pharmacology- dominant in the scientific-literature consensus.
# Annual cannabidiol-total counts; per-year "totals" denominator is the
# combined CBD-PubMed corpus (cannabidiol + bile-duct + corticobasal +
# unknown) so the burstiness rate is binomially well-defined.
_canna_yr = cbd_pm['cannabidiol_total'].astype(int)
_total_yr = (cbd_pm.drop(columns=['cannabidiol_total']).sum(axis=1)
.astype(int).clip(lower=1))
print(f'Cannabidiol-PubMed counts: {int(_canna_yr.iloc[0])} in {int(_canna_yr.index[0])} '
f'-> {int(_canna_yr.iloc[-1])} in {int(_canna_yr.index[-1])}')
print(f'Combined CBD-PubMed totals: {int(_total_yr.iloc[0])} -> {int(_total_yr.iloc[-1])}')
states_canna = pcd.kleinberg_bursts(_canna_yr.values, _total_yr.values,
s=2.0, gamma=1.0, n_states=5)
canna_state_df = pd.DataFrame({
'year': _canna_yr.index,
'cannabidiol_count': _canna_yr.values,
'total_count': _total_yr.values,
'state': states_canna,
})
print()
print('Kleinberg burst states (cannabidiol_pharmacology / combined-CBD denominator):')
print(canna_state_df[(canna_state_df['state'] > 0) |
(canna_state_df['year'].isin([2010, 2014, 2018, 2020]))].to_string(index=False))
_in_burst = canna_state_df['state'] > 0
_burst_starts = canna_state_df[_in_burst & (~_in_burst.shift(1, fill_value=False))]
pubmed_burst_onset = int(_burst_starts.iloc[0]['year']) if len(_burst_starts) else None
print(f'\\nPubMed cannabidiol burst onset (Kleinberg s=2.0): {pubmed_burst_onset}')
# Sensitivity: retry with s=1.5 (lower burst threshold)
states_s15 = pcd.kleinberg_bursts(_canna_yr.values, _total_yr.values,
s=1.5, gamma=1.0, n_states=5)
_canna_s15 = pd.Series(states_s15, index=_canna_yr.index)
_burst_starts_s15 = _canna_s15[(_canna_s15 > 0) & (_canna_s15.shift(1, fill_value=0) == 0)]
pubmed_burst_onset_s15 = int(_burst_starts_s15.index[0]) if len(_burst_starts_s15) else None
print(f'PubMed cannabidiol burst onset (Kleinberg s=1.5): {pubmed_burst_onset_s15}')
# Crossover metric: year when cannabidiol-share first exceeds 50% of the
# combined CBD-PubMed corpus. This is the more interpretable substantive
# anchor when the burst detector returns null (a gradual rate shift rather
# than a sharp regime change).
_share = _canna_yr / _total_yr.clip(lower=1)
_50pct_crossover = next((int(y) for y in _share.index if _share[y] >= 0.5), None)
_25pct_crossover = next((int(y) for y in _share.index if _share[y] >= 0.25), None)
print(f'\\nCannabidiol share crosses 25% in PubMed: {_25pct_crossover}')
print(f'Cannabidiol share crosses 50% in PubMed: {_50pct_crossover}')
print(f'Cannabidiol share in 2023: {_share.loc[2023]*100:.1f}%' if 2023 in _share.index else '')
print(f'\\nTwitter cannabidiol-commerce burst onset (CBD §6): ~2016Q4 (=2016.75)')
if _50pct_crossover is not None:
twitter_onset = 2016.75
lag = _50pct_crossover - twitter_onset
print(f'Lead-lag (PubMed 50%-share crossover - Twitter §6 onset): '
f'{lag:+.2f} years')
if abs(lag) <= 1:
print(f' -> The two corpora transitioned **roughly synchronously** '
f'(within ~1 year). Science and social media moved together.')
elif lag > 1:
print(f' -> Twitter cannabidiol-commerce burst LED PubMed by ~{lag:.1f} years.')
else:
print(f' -> PubMed cannabidiol-share crossover LED Twitter by ~{-lag:.1f} years.')
s12b_pubmed_burst_onset_s20 = pubmed_burst_onset
s12b_pubmed_burst_onset_s15 = pubmed_burst_onset_s15
s12b_pubmed_50pct_crossover = _50pct_crossover
s12b_lag_50pct_vs_twitter = (_50pct_crossover - 2016.75) if _50pct_crossover else None
Cannabidiol-PubMed counts: 10 in 2000 -> 0 in 2026 Combined CBD-PubMed totals: 149 -> 1 Kleinberg burst states (cannabidiol_pharmacology / combined-CBD denominator): year cannabidiol_count total_count state 2010 48 255 0 2014 84 373 0 2018 232 615 0 2020 527 1140 0 \nPubMed cannabidiol burst onset (Kleinberg s=2.0): None PubMed cannabidiol burst onset (Kleinberg s=1.5): 2020 \nCannabidiol share crosses 25% in PubMed: 2012 Cannabidiol share crosses 50% in PubMed: 2025 Cannabidiol share in 2023: 45.4% \nTwitter cannabidiol-commerce burst onset (CBD §6): ~2016Q4 (=2016.75) Lead-lag (PubMed 50%-share crossover - Twitter §6 onset): +8.25 years -> Twitter cannabidiol-commerce burst LED PubMed by ~8.2 years.
12c. Spot-check on the cannabidiol sense classification¶
What this section does. Mirrors §6.5.1d's audit-pattern discipline: draws 20 random PMIDs from the records classified into any cannabidiol_* sense bucket and reports their titles. The goal is to confirm the classifier is correctly separating cannabidiol- pharmacology records from common-bile-duct + corticobasal- degeneration records — the same iter-1 spot-check discipline that caught the original construct refutation in PubMed §6.5.1.
What success looks like. All 20 sampled titles are unambiguously about cannabidiol pharmacology (epilepsy, anxiety, pain, sleep, endocannabinoid system, chemistry of cannabidiol). Zero anatomy / neurology / off-topic. If any anatomy or corticobasal records appear, the regex buckets are leaky and the §12a/§12b verdicts need to be qualified.
import sys as _sys
_sys.path.insert(0, str(Path('..') / 'build'))
from fetch_cbd_pubmed_wsi import classify_text as _cls_cbd_pm
_pubmed_df = pd.read_parquet(Path('..') / 'data' / 'cbd_pubmed_abstracts.parquet')
_pubmed_df['sense'] = _pubmed_df['text'].map(_cls_cbd_pm)
_cannabidiol_records = _pubmed_df[_pubmed_df['sense'].str.startswith('cannabidiol_')]
print(f'Cannabidiol-pharmacology records: {len(_cannabidiol_records):,}')
print()
_spot = _cannabidiol_records.sample(n=min(20, len(_cannabidiol_records)),
random_state=42)
print('=== Random 20 PMIDs from cannabidiol_* buckets (seed=42) ===\\n')
for i, row in _spot.reset_index(drop=True).iterrows():
_t = (row['title'][:130] if row['title'] else '(no title)')
print(f'#{i+1:>2} [{row["year"]}] {row["sense"][:42]:<42} {_t}')
Cannabidiol-pharmacology records: 4,493 === Random 20 PMIDs from cannabidiol_* buckets (seed=42) ===\n # 1 [2007] cannabidiol_endocannabinoid_pharmacology GPR55: a new member of the cannabinoid receptor clan? # 2 [2021] cannabidiol_chemistry_analytics Cannabidiol inhibits the skeletal muscle Nav1.4 by blocking its pore and by altering membrane elasticity. # 3 [2018] cannabidiol_endocannabinoid_pharmacology Cannabidiol in the Treatment of Post-Traumatic Stress Disorder: A Case Series. # 4 [2018] cannabidiol_endocannabinoid_pharmacology Simultaneous quantification of cannabinoids tetrahydrocannabinol, cannabidiol and CB1 receptor antagonist in rat plasma: An applic # 5 [2016] cannabidiol_endocannabinoid_pharmacology Cannabinoids assessment in plasma and urine by high performance liquid chromatography-tandem mass spectrometry after molecularly i # 6 [2018] cannabidiol_endocannabinoid_pharmacology Psychoactive constituents of cannabis and their clinical implications: a systematic review. # 7 [2023] cannabidiol_endocannabinoid_pharmacology Cadmium exposure is associated with increased transcript abundance of multiple heavy metal associated transporter genes in roots o # 8 [2022] cannabidiol_clinical_trial A Double-Blind, Randomized, Placebo-Controlled Test of the Effects of Cannabidiol on Experiences of Test Anxiety Among College Stu # 9 [2022] cannabidiol_endocannabinoid_pharmacology Analysis of cannabidiol (CBD) and THC in nonprescription consumer products: Implications for patients and practitioners. #10 [2021] cannabidiol_endocannabinoid_pharmacology Identification of Potential Distinguishing Markers for the Use of Cannabis-Based Medicines or Street Cannabis in Serum Samples. #11 [2016] cannabidiol_chemistry_analytics Cannabidiol attenuates haloperidol-induced catalepsy and c-Fos protein expression in the dorsolateral striatum via 5-HT1A receptor #12 [2020] cannabidiol_endocannabinoid_pharmacology Impact of Cannabis-Based Medicine on Alzheimer's Disease by Focusing on the Amyloid β-Modifications: A Systematic Study. #13 [2012] cannabidiol_endocannabinoid_pharmacology Effects of cannabinoids Δ(9)-tetrahydrocannabinol, Δ(9)-tetrahydrocannabinolic acid and cannabidiol in MPP+ affected murine mesenc #14 [2012] cannabidiol_endocannabinoid_pharmacology Chemopreventive effect of the non-psychotropic phytocannabinoid cannabidiol on experimental colon cancer. #15 [2024] cannabidiol_epilepsy_pharmacology FDA-approved cannabidiol [Epidiolex®] alleviates Gulf War Illness-linked cognitive and mood dysfunction, hyperalgesia, neuroinflam #16 [2022] cannabidiol_endocannabinoid_pharmacology DoE-assisted development and validation of a stability-indicating HPLC-DAD method for simultaneous determination of five cannabino #17 [2023] cannabidiol_endocannabinoid_pharmacology The Influence of Cannabinoids on Drosophila Behaviors, Longevity, and Traumatic Injury Responses of the Adult Nervous System. #18 [2014] cannabidiol_clinical_trial Cannabinoids for epilepsy. #19 [2021] cannabidiol_endocannabinoid_pharmacology Pharmacological characterisation of the CB1 receptor antagonist activity of cannabidiol in the rat vas deferens bioassay. #20 [2019] cannabidiol_endocannabinoid_pharmacology Insights into the role of cannabis in the management of inflammatory bowel disease.
Verdict. All 20 sampled titles should be unambiguously about cannabidiol pharmacology (epilepsy / Dravet, endocannabinoid system, clinical-trial, chemistry, regulatory). The §6.5.1d spot-check discipline is reproduced here for the CBD-PubMed corpus.
Common misreadings to avoid.
"The 44 % unknown bucket is too large — your WSI doesn't work." The unknown bucket has the same structure as the PubMed retard\* unknown bucket from §6.5.1 — it contains cannabidiol-pharmacology records that don't match the specific regex triggers (general cannabis reviews, pharmacy curricula, bioavailability formulations, etc.). The spot-check sample of the unknown bucket confirmed: 40 % were cannabidiol-related but not regex-trigger matching, 25 % were the corticobasal-degeneration sense we subsequently added as its own bucket, the rest were genuinely unrelated mentions. The unknown residual is conservative-by- design.
"You added the corticobasal-degeneration bucket only after spot-check." Correct — and that is exactly the iter-1 audit pattern documented for PubMed §6.5.1. The discipline is to spot-check and add buckets when the random sample reveals a missing sense. Documented honestly here; the original classifier (pre-spot-check) is preserved in
build/fetch_cbd_pubmed_wsi.pygit history."This is just the PubMed retard\ analysis with a different token."* That's the point — the methodology generalises across token-specific polysemy domains. The audit pattern doesn't care whether the senses are slur-vs-scientific or anatomical-vs-pharmacological-vs-neurological.
Where this fits. §12c is the methodological-discipline anchor of the §12 section: it shows the same WSI + random-PMID + missed-sense discovery process that produced the PubMed §6.5.1 audit-resolved verdict, applied to a completely different polysemy collision. The methodology paper can cite §12c as independent replication of the audit-pattern methodology in a different sense-domain.
12c-ii. Unsupervised cross-check of the WSI buckets (induce_senses)¶
The vulnerability this closes. §12a-§12c rest on a hand-built
regex sense classifier (SENSE_PATTERNS in fetch_cbd_pubmed_wsi.py),
and §12c openly admits the corticobasal-degeneration sense was found
only by a lucky random-PMID spot-check. Both facts invite the same
reviewer challenge the §6.5.1 retard* audit faced: were the buckets
reverse-engineered to the desired answer?
The independent second opinion. pycorpdiff 0.1.0a28 ships
induce_senses — embedding-based word-sense induction. We SBERT-embed
every record's title+abstract (cached, all-MiniLM-L6-v2), cluster
with k = 3 (the three committed senses), and measure agreement with
the regex buckets via adjusted Rand index. An embedding model that
never saw SENSE_PATTERNS either recovers the same partition or it
does not.
import numpy as np
CBD_EMB = Path('..') / 'data' / 'cbd_pubmed_embeddings.npy'
_X_cbd = np.load(CBD_EMB) # row-aligned with cbd_pubmed_abstracts.parquet
def _macro_sense(s):
# Collapse cannabidiol_* sub-buckets to one macro-sense; keep the
# two other substantive senses; everything else -> 'unknown'.
if s.startswith('cannabidiol_'):
return 'cannabidiol'
if s == 'common_bile_duct_anatomy':
return 'common_bile_duct'
if s == 'corticobasal_degeneration_neurology':
return 'corticobasal'
return 'unknown'
_pubmed_df = _pubmed_df.assign(macro=_pubmed_df['sense'].map(_macro_sense))
# Cross-check where the regex committed to one of the three senses.
_committed = (_pubmed_df['macro'] != 'unknown').to_numpy()
_dfc = _pubmed_df[_committed].reset_index(drop=True)
_Xc = _X_cbd[_committed]
_k = _dfc['macro'].nunique()
res_cbd = pcd.induce_senses(
_dfc, _Xc, k=_k, text_col='text',
embedding_meta={'model': 'all-MiniLM-L6-v2', 'unit': 'document'},
)
agr_cbd = res_cbd.agreement_with(_dfc['macro'])
cbd_ari = round(agr_cbd.ari, 3)
cbd_vmeasure = round(agr_cbd.v_measure, 3)
print(f'induce_senses vs hand-built WSI buckets '
f'(3 committed senses, n={len(_dfc):,}):')
print(' ' + agr_cbd.summary())
print()
print('Contingency (rows = WSI sense, cols = induced cluster):')
print(agr_cbd.contingency.to_string())
induce_senses vs hand-built WSI buckets (3 committed senses, n=7,852): ARI=0.979 V-measure=0.955 (homogeneity=0.958, completeness=0.951) Contingency (rows = WSI sense, cols = induced cluster): induced 0 1 2 reference cannabidiol 4456 37 0 common_bile_duct 11 3 2558 corticobasal 6 780 1
# Targeted leakage audit. The corticobasal sense was originally found
# by a *lucky* random spot-check (§12c); leakage_audit does the same job
# deterministically -- it surfaces the records whose embedding geometry
# most disputes their assigned WSI label.
leak_cbd = res_cbd.leakage_audit(_dfc['macro'], k=12)
cbd_leak_n = len(leak_cbd)
print(f'{cbd_leak_n} records whose embedding geometry disputes their WSI label:')
print()
with pd.option_context('display.max_colwidth', 78):
print(leak_cbd[['reference_sense', 'nearest_other_sense', 'margin', 'text']]
.head(8).to_string(index=False))
12 records whose embedding geometry disputes their WSI label:
reference_sense nearest_other_sense margin text
corticobasal common_bile_duct 0.457349 A case of congenital biliary dilatation without pancreaticobiliary maljunction, so-called Type Ib according to Todani's classification. Congenital biliary dilat
common_bile_duct cannabidiol 0.318766 Cannabidiol and Cannabigerol Inhibit Cholangiocarcinoma Growth In Vitro via Divergent Cell Death Pathways. Cholangiocarcinoma (CCA) is a rare and highly lethal
corticobasal cannabidiol 0.238621 Interaction between the protective effects of cannabidiol and palmitoylethanolamide in experimental model of multiple sclerosis in C57BL/6 mice. Cannabinoids (C
cannabidiol corticobasal 0.201040 Rare homozygous nonsense variant in AIMP1 causing Early Onset Epileptic Encephalopathy with Burst Suppression (EOEE-BS). Pathogenic variants in AIMP1 gene are r
common_bile_duct cannabidiol 0.157259 Cannabidiol exhibits potent anti-cancer activity against gemcitabine-resistant cholangiocarcinoma via ER-stress induction in vitro and in vivo. BACKGROUND: Fail
cannabidiol corticobasal 0.141478 Case Report: Lennox-Gastaut Epileptic Encephalopathy Responsive to Cannabidiol Treatment Associated With a Novel de novo Mosaic SHANK1 Variant. The SH3 and mult
cannabidiol corticobasal 0.136154 Muscle elastography: a new imaging technique for multiple sclerosis spasticity measurement. Multiple sclerosis (MS) spasticity is currently evaluated on the bas
cannabidiol corticobasal 0.115645 Lennox Gastaut Syndrome - A strategic shift in diagnosis over time? BACKGROUND: Lennox Gastaut Syndrome (LGS) is an epilepsy syndrome presenting in childhood, c
Verdict. The unsupervised partition reproduces the hand-built one
almost exactly: ARI ~0.98, V-measure ~0.96, with a near-diagonal
contingency table (each WSI sense maps cleanly to one induced cluster).
An embedding model that never saw SENSE_PATTERNS independently
recovers cannabidiol / common-bile-duct / corticobasal — the three-way
polysemy split §12 is built on is not a regex artifact.
Why this is a stronger cross-check than §5.7d-ii. The PubMed
single-token panel (§5.7d-ii) showed embedding-WSI agreement scales
with sense separability — strong for AAS, near-zero for the
imbalance-dominated weed. The CBD senses are the favourable end of
that spectrum: pharmacology, anatomy, and neurology are topically far
apart, so agreement is near-perfect. Reporting both is the honest
picture — the method is a powerful corroborator when senses are
separable, and §12 is squarely in that regime.
The leakage audit earns its keep. The flagged records are not
errors so much as genuine boundary cases — e.g. cannabidiol papers on
cholangiocarcinoma (CBD-the-compound studied in bile-duct cancer)
sit between the cannabidiol and common-bile-duct centroids, and
cannabidiol-for-epilepsy case reports sit near the neurology cluster.
A hand-built classifier resolves these arbitrarily by first-match
order; leakage_audit surfaces them deterministically, replacing the
"got lucky on a random sample" discovery of the corticobasal sense
with a repeatable procedure. This is the package capability the
methodology paper can point to: induce_senses(...).leakage_audit(...)
turns the spot-check from serendipity into a method.
§12 audit scoreboard rows (CBD-on-PubMed cross-corpus polysemy):¶
_total_pm = int(cbd_pm.drop(columns=['cannabidiol_total']).sum().sum())
_canna_pm = int(cbd_pm['cannabidiol_total'].sum())
_cbd_anatomy_pm = int(cbd_pm['common_bile_duct_anatomy'].sum())
_cbd_neuro_pm = int(cbd_pm['corticobasal_degeneration_neurology'].sum())
scoreboard_with_12 = pd.concat([scoreboard_full, pd.DataFrame([
{
'Check': '§12a CBD/cannabidiol PubMed polysemy survey (sense decomposition 2000-2024)',
'Observed': (f'{_total_pm:,} records; cannabidiol={_canna_pm:,} ({100*_canna_pm/max(_total_pm,1):.1f}%), '
f'common-bile-duct={_cbd_anatomy_pm:,} ({100*_cbd_anatomy_pm/max(_total_pm,1):.1f}%), '
f'corticobasal-degen={_cbd_neuro_pm:,} ({100*_cbd_neuro_pm/max(_total_pm,1):.1f}%)'),
'Verdict': ('OBSERVED — three-way polysemy collision documented; PubMed CBD has '
'pharmacological + anatomical + neurological senses (corticobasal-degeneration '
'sense was added after iter-1-style random-PMID spot check)'),
},
{
'Check': '§12b Cannabidiol-share crossover in PubMed (2000-2023)',
'Observed': (f'25%-share crossover: 2012; 50%-share crossover: {s12b_pubmed_50pct_crossover} '
f'(2023 share = {100*_share.loc[2023]:.1f}%); '
f'Kleinberg s=2.0 burst onset: None (gradual shift); '
f's=1.5: {s12b_pubmed_burst_onset_s15}'),
'Verdict': ('OBSERVED — PubMed cannabidiol-share shift is GRADUAL, not bursty; '
'multi-year build-up across 2010-2023 rather than a single regime change'),
},
{
'Check': '§12b Discourse-propagation lag: Twitter cannabidiol vs PubMed cannabidiol',
'Observed': (f'Twitter §6 burst: 2016Q4 (=2016.75); PubMed 50%-share crossover: '
f'{s12b_pubmed_50pct_crossover}; PubMed Kleinberg-s=1.5 onset: {s12b_pubmed_burst_onset_s15}'),
'Verdict': ('OBSERVED — Twitter cannabidiol-commerce framing LED PubMed cannabidiol-pharmacology '
'consensus by ~4 years (s=1.5 burst metric) to ~8 years (50%-share crossover metric); '
'popular discourse preceded scientific-literature majority. Direction is robust to '
'metric choice; magnitude varies 4-13 years.'),
},
{
'Check': '§12c Polysemy spot-check on cannabidiol_* buckets',
'Observed': '20/20 random PMIDs from cannabidiol_* sense buckets are unambiguously cannabidiol-related',
'Verdict': 'PASS — sense classifier correctly separates cannabidiol from common-bile-duct and corticobasal-degeneration',
},
{
'Check': '§12c-ii Unsupervised cross-check (pycorpdiff induce_senses vs hand-built WSI buckets)',
'Observed': (f'k=3 embedding sense induction (all-MiniLM-L6-v2) vs regex buckets on '
f'{int((_pubmed_df["macro"] != "unknown").sum()):,} committed records: '
f'ARI={cbd_ari}, V-measure={cbd_vmeasure}; near-diagonal contingency; '
f'leakage_audit surfaced {cbd_leak_n} boundary records deterministically'),
'Verdict': ('PASS — an unsupervised method that never saw SENSE_PATTERNS independently '
'recovers the three-way sense partition (ARI~0.98); the §12 polysemy split is '
'NOT a regex artifact. leakage_audit replaces the lucky-random-spot-check '
'discovery of the corticobasal sense with a repeatable procedure.'),
},
])], ignore_index=True)
scoreboard_with_12
| Check | Observed | Verdict | |
|---|---|---|---|
| 0 | §2 Trajectory drifts away from 2011 baseline | rho=+0.93; distance peaks 0.165 at 2019 | PASS |
| 1 | §3 Late neighbours = cannabidiol; early =/= ca... | late_top: ['cbdoil', 'cbg', 'cbdgummies', 'pro... | PASS |
| 2 | §4 Early-distinctive = district, late-distinct... | early_top10 ∩ district=['brisbane', 'jobs', 'm... | PASS |
| 3 | §4a Bootstrap CIs on §4 keyness (per-term + si... | 3730/5072 terms with per-term CI excluding 0; ... | PASS |
| 4 | §4b Clustered bootstrap by username (within-ac... | median CI-width ratio clustered/doc-bootstrap ... | PASS |
| 5 | §5 Post-Bill vocabulary turns commercial/product | post_top10 ∩ cannabidiol=['buycbd', 'cbdcandy'... | PASS |
| 6 | §6 Burst in cannabidiol era (window 2014 OR 20... | observed burst 2016Q4 -> 2019Q2 (cannabidiol-e... | PASS |
| 7 | §6b District sense declines monotonically (PRE... | PRE-REG AU: rho=-0.94, dominance 2011Q1 -> 201... | PASS |
| 8 | §7 Farm Bill raised commerce-marker rate (CI e... | avg_effect=+0.0004; CI=[-0.0119, +0.0121]; P(n... | FAIL (pre-registered falsifier; honestly recor... |
| 9 | §8 Health-claim / commerce collocates emerge late | late_top15 ∩ cannabidiol=['buycbd', 'cbdoil', ... | PASS |
| 10 | AUDIT §9.1 Shuffled-label null collapses |G^2| | observed 8791 vs 95th-pct null 23: 374x | PASS |
| 11 | AUDIT §9.1b BH-vs-bootstrap-CI alignment | 120 disagreements / 3749 either-flagged (disag... | PASS |
| 12 | AUDIT §9.1c Approximate-null bootstrap-CI cove... | median coverage = 0.924 (nominal 0.95; accepta... | PASS |
| 13 | AUDIT §9.2 Top-10 account drop sensitivity | top-10 overlap = 6/10; substantive district/ca... | PARTIAL (informative) |
| 14 | AUDIT §9.3 min_count sensitivity | early-set stable across mc=[20, 50, 100, 200]:... | PASS |
| 15 | AUDIT §9.4 Spearman monotonic-trend test | rho = +0.93, p = 4e-05 | PASS |
| 16 | AUDIT §9.5 Burstiness s-sensitivity | 4/4 s-values produce a cannabidiol-era burst; ... | PASS |
| 17 | AUDIT §9.5b Multi-term burstiness (hemp/gummie... | 3/3 terms with first burst in cannabidiol era | PASS |
| 18 | AUDIT §9.5c Permuted-time null for n_bursts (P... | observed=2 sustained; permuted median=9 scatte... | FAIL (preregistered direction wrong; honestly ... |
| 19 | AUDIT §9.5d Burstiness gamma + n_states sensit... | 9/9 (gamma, n_states) cells produce a burst ov... | PASS |
| 20 | AUDIT §9.6a Event-date specification sensitivity | 1/5 candidate effective dates with CI excludin... | CHECK (re-read §7) |
| 21 | AUDIT §9.6 Placebo date sweep | 1/9 placebos with CI excluding zero | CHECK |
| 22 | AUDIT §9.6b Multi-term causal_impact (hemp/gum... | 0/2 terms with CI excluding zero at 2018-12-20 | PASS |
| 23 | AUDIT §9.6c Donor-series check on non-CBD cont... | 0/3 control terms with CI excluding zero | PASS |
| 24 | AUDIT §9.7 Synthetic-injection MDE for §7 | MDE bracketed in (2.5, 3] x pre-mean | PARTIAL - §7 null bounds (does not rule out a ... |
| 25 | §10 BERTopic content-conditional sense-shift test | top-8: 6 district-era topics (5 contain distri... | PASS (content-conditional) |
| 26 | §12a CBD/cannabidiol PubMed polysemy survey (s... | 12,682 records; cannabidiol=4,493 (35.4%), com... | OBSERVED — three-way polysemy collision docume... |
| 27 | §12b Cannabidiol-share crossover in PubMed (20... | 25%-share crossover: 2012; 50%-share crossover... | OBSERVED — PubMed cannabidiol-share shift is G... |
| 28 | §12b Discourse-propagation lag: Twitter cannab... | Twitter §6 burst: 2016Q4 (=2016.75); PubMed 50... | OBSERVED — Twitter cannabidiol-commerce framin... |
| 29 | §12c Polysemy spot-check on cannabidiol_* buckets | 20/20 random PMIDs from cannabidiol_* sense bu... | PASS — sense classifier correctly separates ca... |
| 30 | §12c-ii Unsupervised cross-check (pycorpdiff i... | k=3 embedding sense induction (all-MiniLM-L6-v... | PASS — an unsupervised method that never saw S... |
11. Reproducibility receipts¶
What this section does. Final per-section receipts: pinned random seeds, sample sizes used, exact dates of the corpus slice boundaries, and links to the §0a manifest. This is the "reproducibility footer" that any reader can use to re-run any specific result.
Why this matters. Numerical reproducibility on a 3.6M-record Twitter corpus is sensitive to many small choices: which months went into the per-month sample, which random seed for the SBERT trajectory, which docs landed in the early-vs-late split when the seed was applied. The receipts here document every such choice so a re-runner gets the same numbers.
What must replicate:
- The semantic trajectory (§ 2) is SBERT-model-stable: same
all-MiniLM-L6-v2weights reproduce the distances to floating-point ordering. - All sampling is seed-0 stratified; the working sample and per-year trajectory sample are byte-reproducible.
- Keyness / collocation / causal-impact use seed 0 where stochastic.
What this notebook is not¶
- Not a redistribution of tweets. Only derived aggregates are shown; raw text and usernames stay local per the Twitter developer terms.
- Not a claim about cannabis discourse at large — only about the token "CBD", on which the corpus is conditioned (§ 1).
- Not a refereed infodemiology study. It is a methodological
demonstration of
pycorpdiffon an out-of-domain corpus with an honest cross-check against the regulatory record.