pycorpdiff — the showcase¶
Comparative corpus analysis for modern Python workflows.
How does political discourse change? Whose words shape a movement? What did immigrant mean in 2005 versus 2023? Corpus linguistics, digital humanities, and computational social science answer questions like these — but the existing tooling is fragmented across quanteda (R), SketchEngine (closed-source), nltk / spaCy / gensim (Python NLP), and a half-dozen visualisation packages.
pycorpdiff is the missing comparative layer. Three verbs — compare(a, b), track(c, term), compare.before_after(c, event) — and one Protocol-based plugin model give you a coherent workflow from raw text to publication-grade figures, with every result carrying its own evidence.
This notebook drives every analytical surface in the package on a single research question: how did UK parliamentary discourse on migration evolve across two decades? Along the way we cross-validate against Scattertext on US convention speeches and HistWords on diachronic word embeddings — receipts you can take to a peer reviewer.
Bring coffee.
What we'll cover:
| Part | Theme | API |
|---|---|---|
| I | The corpus | load_hansard_sample · Corpus.slice |
| II | Lexical fingerprints | compare(a, b).keyness() + volcano / bar / explain |
| III | Collocational fingerprints | .collocation_shift() + diverging bar |
| IV | The temporal arc | track().over_time() · .changepoints() · .interrupted_time_series() |
| V | Beneath frequency | .semantic_shift() · semantic_trajectory() · neighborhood_drift |
| VI | The fanout | cross-party · cross-topic · compare.before_after |
| VII | Cross-validation | Scattertext · HistWords · Rayson reference values |
| VIII | The plumbing | polars · DuckDB · multilingual tokenizers |
| IX | Where next | follow-ups + real-world fetchers |
import warnings
import altair as alt
import numpy as np
import pandas as pd
import pycorpdiff as pcd
# Pre-render every chart to inline SVG via vl-convert so plots
# show up natively on GitHub, JupyterLab, VS Code, nbviewer, and
# the rendered HTML pipeline alike.
alt.renderers.enable('svg')
alt.data_transformers.disable_max_rows()
# Suppress fit-time convergence warnings (statsmodels, scipy);
# they include verbose interpreter paths that aren't worth
# carrying into the notebook output.
warnings.filterwarnings('ignore')
warnings.showwarning = lambda *a, **kw: None # suppress any stragglers
# A consistent two-color palette: humanising in calm blue, criminalising in alert red.
HUMAN, CRIMINAL = '#2E86AB', '#E63946'
print(f'pycorpdiff {pcd.__version__}')
import pandas as pd # noqa: F401 (already imported above; kept for the adapter)
from pysofra import SofraTable
from pysofra.core import Cell, HeaderCell, Row, HeaderRow
def df_to_sofra(df, *, theme='jama', caption=None, footnotes=()):
cols = list(df.columns)
headers = (HeaderRow(cells=tuple(HeaderCell(text=c) for c in cols)),)
def fmt(col, val):
try:
if pd.isna(val):
return 'n/a'
except (ValueError, TypeError):
return str(val)
if 'p_' in col or col == 'p_value':
return f'{val:.2e}' if val < 0.001 else f'{val:.3f}'
if isinstance(val, float):
if 'g2' in col or 'ratio' in col or 'effect' in col or 'change' in col:
return f'{val:+.4f}' if abs(val) < 1 else f'{val:+.2f}'
return f'{val:.4f}'
return str(val)
rows = tuple(
Row(cells=tuple(Cell(text=fmt(c, row[c])) for c in cols))
for _, row in df.iterrows()
)
return SofraTable(
rows=rows, headers=headers,
caption=caption, footnotes=footnotes, theme_name=theme,
)
Part I — The corpus¶
We start with the bundled load_hansard_sample() — a 193-speech synthetic corpus modelled on UK Hansard. Synthetic in the prose; real in the structure: nineteen years (2005–2023), four parties, four topics (immigration / Brexit / NHS / climate), and frame shifts at the right historical inflection points. Real Hansard from parliament.uk works through the same Corpus API; see Part VIII.
corpus = pcd.load_hansard_sample()
print(f'{len(corpus):,} speeches · {corpus.total_tokens():,} tokens · {len(corpus.docs.columns)} metadata columns')
corpus.docs.head(3)
# A quick look at the structural skeleton: speeches per topic per year, coloured by frame.
skeleton = corpus.docs.groupby(['year', 'topic']).size().reset_index(name='n')
(
alt.Chart(skeleton)
.mark_rect()
.encode(
x=alt.X('year:O', title=None),
y=alt.Y('topic:N', title=None),
color=alt.Color('n:Q', scale=alt.Scale(scheme='blues'), title='speeches'),
tooltip=['year', 'topic', 'n'],
)
.properties(width=560, height=140, title='Corpus structure -- speeches per topic per year')
)
Part II — Lexical fingerprints¶
Question: which words separate the humanising frame from the criminalising frame on immigration?
compare(a, b).keyness() computes the signed Dunning log-likelihood (G²) for every shared-vocabulary term, plus Hardie's LogRatio effect size, Gabrielatos's %DIFF, the BIC-Bayes factor, Juilland's D and Gries's DP dispersion checks, and Benjamini–Hochberg multiple-comparison correction — all in one call.
immigration = corpus.slice(topic='immigration')
human = immigration.slice(frame='humanising')
criminal = immigration.slice(frame='criminalising')
print(f"humanising: {len(human)} speeches · {human.total_tokens()} tokens")
print(f"criminalising: {len(criminal)} speeches · {criminal.total_tokens()} tokens")
keyness = pcd.compare(human, criminal).keyness(min_count=3, dispersion=True)
print(keyness.summary())
keyness.table.head(8)
df_to_sofra(
keyness.table.head(8)[
['term', 'count_a', 'count_b', 'g2', 'log_ratio', 'p_value', 'p_adjusted']
],
caption='Top 8 keyness terms — humanising vs criminalising frames',
footnotes=('Signed Dunning G²; p-values BH-adjusted across the full vocabulary.',),
)
The volcano plot — effect size on x, statistical significance on y. Each labelled term is a hypothesis.
keyness.plot().properties(
width=560, height=380,
title=alt.TitleParams(
text='Lexical fingerprint: humanising vs criminalising',
subtitle='Top labels by |LogRatio| -- points right of zero are humanising-leaning',
),
)
The bar chart — same data, cleaner for a slide deck. Each Result has multiple visual modes.
keyness.plot(kind='bar', n=15).properties(
width=520, title='Top 15 keyness terms, signed G^2'
)
The Scattertext-style plot — Kessler 2017's signature interactive scatter, surfaced as keyness.plot(kind='scattertext'). Each term sits at (rank-percentile in A, rank-percentile in B); the dashed diagonal is the equally-distinctive line; off-diagonal points are the words that pry the two frames apart. Pan/zoom is live; hover for the full row.
Stopwords cluster in the top-right; humanising-distinctive terms fall into the lower-right zone (blue); criminalising-distinctive terms into the upper-left zone (red). One picture, every keyword.
keyness.plot(kind='scattertext', n_labels=18).properties(
width=600, height=600,
title=alt.TitleParams(
text="Scattertext -- humanising vs criminalising on 'immigration'",
subtitle='Rank-percentile on each axis / diagonal = equally distinctive / size proportional to |G^2|',
),
)
Where in the corpus does it actually appear? Frequency tells you how often; dispersion tells you how spread out. pcd.viz.dispersion_plot is the Mosteller-style tick chart that turns each occurrence of a target into a mark on the document timeline. Bursty terms cluster; ubiquitous terms spread evenly.
Here we contrast a humanising-leaning keyword and a criminalising-leaning keyword across the immigration sub-corpus.
from pycorpdiff.viz import dispersion_plot
# Pick one distinctive term from each side of the keyness ranking.
top_human = keyness.table[keyness.table['g2'] > 0]['term'].iloc[0]
top_criminal = keyness.table[keyness.table['g2'] < 0]['term'].iloc[0]
dispersion_plot(immigration, [top_human, top_criminal]).properties(
width=560,
title=alt.TitleParams(
text='Dispersion across the immigration sub-corpus',
subtitle=f'Each tick = one occurrence in a document / {top_human!r} vs {top_criminal!r}',
),
)
Two knobs worth knowing. Same machinery, two power-user toggles:
stop_words=filters function words after the totals are computed, so the rate denominators stay honest.method='chi_squared'swaps in Pearson χ² instead of G². They're asymptotically equivalent; χ² is the form quanteda and SketchEngine often report.
STOPWORDS = {'the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', 'it'}
filtered = pcd.compare(human, criminal).keyness(
min_count=3, stop_words=STOPWORDS, method='chi_squared',
)
print(filtered.summary())
df_to_sofra(
filtered.table[['term', 'count_a', 'count_b', 'g2', 'chi_squared', 'log_ratio']].head(8),
caption='Top 8 keyness terms — Pearson χ² method, function-word stop-list applied',
footnotes=('Same machinery as the default G²; χ² as the sort statistic.',),
)
Permutation p-values — when the χ² approximation is suspect. Dunning's 1993 paper warned that the asymptotic p-values from G² lose calibration when expected cell counts get small — exactly where rare keywords spend most of their time. The robust answer is a permutation test: build the null distribution empirically by shuffling document labels and recomputing G² each pass.
permutation_n=999 adds an empirical p-value to every row. Computationally heavier than the asymptotic χ², so it's opt-in — but for paper-grade work on small or zero-cell-heavy data, it's the gold standard. The Phipson–Smyth (2010) +1/+1 correction means the smallest reportable p is 1/(B+1), not 0.
keyness_perm = pcd.compare(human, criminal).keyness(
min_count=3, permutation_n=999, permutation_seed=42,
)
(
keyness_perm.table
[['term', 'count_a', 'count_b', 'g2', 'p_value', 'p_permutation']]
.head(10)
.assign(
p_value=lambda d: d['p_value'].map(lambda p: f'{p:.2e}'),
p_permutation=lambda d: d['p_permutation'].map(lambda p: f'{p:.4f}'),
)
)
df_to_sofra(
keyness_perm.table
[['term', 'count_a', 'count_b', 'g2', 'p_value', 'p_permutation']]
.head(10),
caption='Asymptotic χ² vs empirical permutation p-values (B = 999)',
footnotes=('Phipson–Smyth +1/+1 correction; permutation floor = 1/1000 = 0.001.',),
)
Both columns agree on which terms matter — every strong asymptotic signal also hits the permutation floor (1/1000 = 0.001). The value of the permutation column is in the other direction: it caps how confident you can be about a finding given finite resampling, and it stays well-behaved where the χ² approximation starts drifting. Plot the two against each other to see the relationship at a glance.
import numpy as np
pp = keyness_perm.table.copy()
pp['neg_log_p_chi2'] = -np.log10(np.clip(pp['p_value'], 1e-300, 1.0))
pp['neg_log_p_perm'] = -np.log10(np.clip(pp['p_permutation'], 1e-300, 1.0))
(
alt.Chart(pp.head(40))
.mark_circle(opacity=0.7, size=80)
.encode(
x=alt.X('neg_log_p_chi2:Q', title='-log10(asymptotic Chi^2 p)'),
y=alt.Y('neg_log_p_perm:Q', title='-log10(permutation p)'),
tooltip=['term', 'count_a', 'count_b', 'g2',
alt.Tooltip('p_value:Q', format='.2e'),
alt.Tooltip('p_permutation:Q', format='.4f')],
)
.properties(
width=480, height=360,
title=alt.TitleParams(
text='Asymptotic vs permutation p-values agree on strong signals',
subtitle='Permutation floor = 1/1000 (top of y-axis)',
),
)
)
Bootstrap confidence intervals on G² — direction-of-effect honesty.
A G² point estimate is computed from a sample of speeches; the
underlying population ("the kind of debate the corpus represents")
has its own variability. ci="bootstrap" resamples documents
with replacement on each side and reports the empirical 2.5 / 97.5
percentiles of the per-term G². When the CI straddles zero, we
have a useful signal: the direction of over-use isn't certain.
Documents (not tokens) are the unit of resampling, matching the
population-level inference researchers actually want.
keyness_boot = pcd.compare(human, criminal).keyness(
min_count=3, ci='bootstrap', n_boot=399, bootstrap_seed=0
)
cols = ['term', 'count_a', 'count_b', 'g2', 'g2_ci_lower', 'g2_ci_upper']
keyness_boot.table[cols].head(10)
Reference-corpus keyness — against_baseline.
The canonical setup in lexicography and discourse analysis: compare
the corpus against a fixed reference (BNC, COCA, an in-house
archive, …) rather than another corpus on hand. The reference side
ships as an aggregated (term, count) frequency list — typically
two to three orders of magnitude smaller than the source text.
pycorpdiff bundles one starter: "gutenberg_fiction" (five
Project Gutenberg English-fiction texts, 1813–1897, ~500K tokens).
User-built baselines come from pcd.baseline_from_corpus(corpus).
baseline_result = pcd.against_baseline(
corpus, 'gutenberg_fiction', min_count=10
)
print(baseline_result.summary())
baseline_result.table[['term', 'count_a', 'count_b', 'g2', 'log_ratio']].head(10)
Every result carries its evidence. .explain(term) pulls KWIC concordances from both sides — no black boxes.
top_term = keyness.table.iloc[0]['term']
print(f'Showing KWIC contexts for the top keyword: {top_term!r}\n')
keyness.explain(top_term, n=3, window=4).table
Lexical diversity — how varied is the vocabulary? Keyness asks which words separate the corpora; lexical diversity asks how varied each one is internally. Type-token ratio is the naive measure but it's confounded with text length, so we ship three length-robust alternatives: MATTR (moving-average TTR, Covington & McFall 2010), MTLD (mean factor length to a running-TTR threshold, McCarthy & Jarvis 2010), and HD-D (hypergeometric expected vocabulary in a 42-token sample, McCarthy & Jarvis 2007). Pooled corpus-level:
ld_human = pcd.lexical_diversity(human)
ld_criminal = pcd.lexical_diversity(criminal)
print(ld_human.summary())
print(ld_criminal.summary())
The temporal arc. freq='Y' slices by year and reports
per-period metrics; ci='bootstrap' adds document-resampled
uncertainty bands. The plot facets one panel per metric so
the very different scales (TTR ~ 0.5, MTLD ~ 50–100, HD-D
~ 30) stay legible side-by-side. The Hansard sample is small
per-year so the bands are wide — real archives produce much
tighter trajectories.
ld_traj = pcd.lexical_diversity(
immigration, freq='Y', time_col='date',
ci='bootstrap', n_boot=99, seed=0,
)
ld_traj.plot()
Pre-balancing via Coarsened Exact Matching.
The keyness above is honest about what separates the two frames
lexically, but doesn't tell us whether the lexical difference
rides on a confounding difference in year × party composition
between the two slices. pcd.match runs CEM (Iacus, King & Porro
2012): coarsen each covariate (year → quantile bins, party →
as-is), match strata across sides, drop unmatched strata, and
subsample the over-represented side per stratum. The result is
matched slices balanced on the chosen covariates — and a
before-vs-after L1-imbalance diagnostic showing how much
confounding the match removed.
matched = pcd.match(human, criminal, on=['year', 'party'], seed=0)
print(matched.summary())
matched.imbalance
Matched keyness re-runs the same scoring on the balanced subsample. Terms that survive here are the lexical signal after the demographic confounders are stripped out.
keyness_matched = pcd.compare(matched.a_matched, matched.b_matched).keyness(min_count=1)
keyness_matched.table[['term', 'count_a', 'count_b', 'g2', 'log_ratio']].head(10)
Part III — Collocational fingerprints¶
Question: what does each frame put next to the word 'immigrant'? This is where Hardie's logDice and SketchEngine's word-sketch tradition shine — the company a word keeps says more than its frequency.
collocation_shift computes window-based co-occurrences in each corpus, applies Laplace smoothing so absent collocates yield finite scores, and reports the per-collocate delta score_a − score_b.
shift = pcd.compare(human, criminal).collocation_shift(
'immigrant', window=4, min_count=3, measure='logDice'
)
print(shift.summary())
df_to_sofra(
shift.table.head(10),
caption="Top 10 collocate shifts of 'immigrant' — logDice, window 4",
footnotes=('Score_a − score_b; Laplace smoothing α = 0.5.',),
)
shift.plot(n=14).properties(
width=520,
title=alt.TitleParams(
text="What 'immigrant' collocates with, by frame",
subtitle='Positive bars: humanising-leaning collocates / Negative: criminalising',
),
)
.explain(collocate) here is sharper than for keyness: it returns only the KWIC windows in which both 'immigrant' AND the collocate co-occur — the actual textual evidence for the shift.
shift.explain('criminal', n=3).table
Bigrams and trigrams as first-class terms. NgramTokenizer wraps any base tokenizer and emits joined n-grams ('border_security', 'asylum_seeker'). Because the output is still list[str], every analytical surface — keyness, collocations, semantic shift — picks up multi-word units with no special-casing.
Useful when the discriminating unit is a phrase, not a word.
from pycorpdiff.tokenize import NgramTokenizer
bigram_tok = NgramTokenizer(n=2)
human_bg = human.parent.with_tokenizer(bigram_tok).slice(frame='humanising', topic='immigration')
criminal_bg = criminal.parent.with_tokenizer(bigram_tok).slice(frame='criminalising', topic='immigration')
bigram_keyness = pcd.compare(human_bg, criminal_bg).keyness(min_count=2)
print(f'{len(bigram_keyness.table):,} bigram terms ranked')
bigram_keyness.table[['term', 'count_a', 'count_b', 'g2', 'log_ratio']].head(10)
The discourse graph — every term as a node¶
Question: what does the discourse look like as a graph? Collocation shift answers "which collocates of X moved". The complementary view is to step back and let every term be a node, every co-occurrence be an edge, and read off the cluster structure of the discourse itself.
pcd.cooccurrence_network(corpus, ...) builds the network: top-N frequent terms as nodes, edges weighted by PMI (or t-score / MI³ / logDice) within a context window. The .plot() lays it out via networkx.spring_layout and renders with altair — pan/zoom, hover tooltips, the works.
net = pcd.cooccurrence_network(
immigration, top_n=18, window=5, measure='PMI',
min_count=4, min_cooccur=5,
)
print(net.summary())
net.plot(width=720, height=560, max_edges=30, label_top_n=18).properties(
title=alt.TitleParams(
text='Discourse graph of UK parliamentary immigration speeches',
subtitle='Nodes sized by frequency / edges weighted by PMI / Kamada-Kawai layout',
),
)
The graph's structure carries information the bar charts can't. Tight clusters of mutually-collocating terms read as recurring discursive phrases; bridges between clusters mark the words that move between frames; isolated nodes are vocabulary peripherally invoked but not syntactically interlocked. Hover any edge for the raw joint count and PMI weight.
Part IV — The temporal arc¶
Question: when did the frame shift happen? Frequency analysis becomes a time-series problem the moment you have dated documents.
track(corpus, term).over_time() returns a tidy DataFrame with per-period relative frequencies and Wilson score confidence intervals (Wilson > Wald because Wald collapses near p = 0 — exactly where rare-term trajectories spend most of their time).
trajectory = pcd.track(immigration, ['worker', 'criminal', 'family']).over_time(
freq='Y', time_col='date'
)
trajectory.summary()
trajectory.plot().properties(
width=560, height=300,
title=alt.TitleParams(
text='Relative-frequency trajectory in immigration speeches',
subtitle='Wilson 95% CI bands / 2005-2023',
),
)
Changepoint detection — PELT, wrapped from ruptures. The break should land near 2016, the engineered policy event in the synthetic corpus.
trajectory.changepoints(target='criminal')
Burstiness — "when was the rate elevated?" per period.
Where changepoints() answers when did the rate change?
(segmentation by location), Kleinberg's 1999 algorithm answers
when was the rate elevated and by how much? (per-period
intensity labelling). State 0 is the base rate; state 1, 2, …
are progressively stronger bursts. Bursting up is costly,
bursting down is free.
bursts = trajectory.burstiness(target='criminal', s=2.0, gamma=1.0)
print(bursts.summary())
bursts.bursts
bursts.plot()
Interrupted time series — segmented regression via statsmodels.OLS. Quantifies the level change at a known event with confidence intervals.
Reading the row labels: intercept = pre-event baseline · time = pre-event trend · level_change = the step at the event · slope_change = post-event trend change.
Caveat on the synthetic demo. The bundled Hansard-style sample is deterministic — the engineered relative frequencies for
"criminal"in the pre-event window have very low variance, so OLS standard errors below are at floating-point precision and the p-values are astronomically small. On real data with realistic noise these numbers will be normal-looking. Treat the ITS output here as a syntactic demonstration of the API, not a statistical worked example.
df_to_sofra(
trajectory.interrupted_time_series(event_date='2016', target='criminal'),
caption="Interrupted time-series coefficients for 'criminal', 2016 event",
footnotes=('Segmented OLS (Wagner et al. 2002). level_change = step at the event; '
'slope_change = post-event trend deviation.',),
)
Where is the trajectory heading? Past interrupted_time_series the question stops being retrospective. The modern temporal stack — forecast, counterfactual, online detection — lets the same trajectory answer three predictive questions in turn.
First: .forecast() extends the line forward with state-space exponential smoothing (Hyndman et al. 2008). Rates are forecast on the logit scale and back-transformed so prediction intervals stay pinned to [0, 1] instead of leaking into negative territory. Auto-selects ETS for series of length ≥ 8, falls back to a Holt linear-trend for shorter histories.
fc = trajectory.forecast(horizon=4, level=0.95)
print(fc.summary())
fc.plot().properties(
width=600, height=320,
title=alt.TitleParams(
text='Forecast -- where is each trajectory heading?',
subtitle='Solid: observed with Wilson CI / Dashed: ETS forecast with 95% PI band',
),
)
Counterfactual causal impact is the headline addition. interrupted_time_series answers "is there a step at this known event?". The harder — and more useful — question is the counterfactual: what would the trajectory have looked like without the event? .causal_impact() wraps a Bayesian structural time-series model (Brodersen et al. 2015) — fits on the pre-event window, projects forward as the counterfactual, and reports the pointwise and cumulative effects with credible intervals from Monte Carlo against the joint posterior.
The three-panel chart below is recognisable from the CausalImpact literature: observed-vs-counterfactual, pointwise effect, cumulative effect. Each band is a 95% credible interval; the vertical dashed line marks the event.
with warnings.catch_warnings():
warnings.simplefilter('ignore')
impact = trajectory.causal_impact(
event_date='2016', target='criminal',
n_samples=1000, seed=42,
)
print(impact.summary())
# Brodersen-style metrics as a publication table.
_metrics_df = pd.DataFrame([
{'metric': 'avg effect per period',
'estimate': impact.metrics['avg_effect'],
'ci_lower': impact.metrics['avg_effect_lower'],
'ci_upper': impact.metrics['avg_effect_upper']},
{'metric': 'cumulative effect',
'estimate': impact.metrics['cumulative_effect'],
'ci_lower': impact.metrics['cumulative_effect_lower'],
'ci_upper': impact.metrics['cumulative_effect_upper']},
{'metric': 'posterior P(no effect)',
'estimate': impact.metrics['posterior_prob_no_effect'],
'ci_lower': float('nan'), 'ci_upper': float('nan')},
])
df_to_sofra(
_metrics_df,
caption='Brodersen-style causal impact metrics for the 2016 event',
footnotes=('95% credible intervals via Monte Carlo against the joint state-space posterior; '
'n_samples=1000, seed=42.',),
)
impact.plot()
Bayesian online changepoint detection — .changepoints() is offline PELT: needs the full series, returns MAP locations after the fact. .changepoints_online() does Adams & MacKay's (2007) Bayesian forward pass: at each new period it updates the posterior distribution over the run length (number of periods since the last changepoint). The MAP run length collapsing to a small number marks a changepoint; the recovery growth tells you the new regime has stabilised. Useful when you're rolling a corpus forward in time and want a 'did things just shift?' signal each period.
The diagnostic plot has three panels: the observed series with detected changepoints flagged, a log-scale heatmap of the run-length posterior, and the MAP run length over time.
online = trajectory.changepoints_online(
target='criminal', hazard=0.02,
mu_0=0.0, beta_0=0.0001,
)
print(online.summary())
online.detected_changepoints(threshold=3)
online.plot()
Part V — Beneath frequency: semantic shift¶
Frequencies and collocations measure which words appear together. Semantic shift measures whether the same word means something different now.
compare(a, b).semantic_shift(target, embedder) encodes every KWIC window around the target as a sentence vector and averages into a corpus-specific centroid. Cosine distance between the centroids is the reported shift. With SBERT it captures real meaning drift; with the deterministic HashEmbedder it gives a byte-stable demonstration of the pipeline.
# Note: HashEmbedder is for demo / test reproducibility. In real research,
# swap in `pcd.SBERTEmbedder()` — needs `pip install 'pycorpdiff[semantic]'`.
sem = pcd.compare(human, criminal).semantic_shift(
'immigrant', embedder=pcd.HashEmbedder(dim=64), window=4
)
sem.table
Multi-period semantic trajectory — for tracking semantic drift over time rather than between two corpora. Built on the same averaged-contextual-embedding machinery.
sem_traj = pcd.semantic_trajectory(
immigration,
'immigrant',
time_col='date', freq='Y',
embedder=pcd.HashEmbedder(dim=64), window=4,
)
sem_traj
plot_df = sem_traj.assign(
period_str=sem_traj['period'].astype(str),
period_ts=sem_traj['period'].apply(lambda p: p.to_timestamp()),
).drop(columns='period')
(
alt.Chart(plot_df)
.mark_line(point=True, strokeWidth=2.5)
.encode(
x=alt.X('period_ts:T', title=None),
y=alt.Y('distance_from_baseline:Q', title='cosine distance from 2005'),
tooltip=['period_str', 'n_contexts', 'distance_from_baseline'],
)
.properties(
width=560, height=260,
title=alt.TitleParams(
text="Semantic drift of 'immigrant' across two decades",
subtitle='Baseline = 2005 / HashEmbedder for byte-reproducibility / swap in SBERT for real semantics',
),
)
)
Where is meaning heading? pcd.forecast_semantic_drift does for semantic drift what .forecast() does for rates — extends the cosine-displacement trajectory forward with the same state-space machinery. The logit transform is off (cosine distance lives in roughly [0, 2], not [0, 1]) and the lower prediction-interval bound is clipped at zero since negative cosine distance is nonsensical.
Useful for asking is the drift accelerating or stabilising?
from pycorpdiff.viz import semantic_forecast_plot
sem_fc = pcd.forecast_semantic_drift(sem_traj, horizon=4, level=0.95)
print(sem_fc)
semantic_forecast_plot(sem_traj, sem_fc).properties(
width=600, height=300,
title=alt.TitleParams(
text="Semantic drift forecast for 'immigrant'",
subtitle='Solid: observed cosine distance / Dashed: ETS forecast with 95% PI band',
),
)
Neighbourhood drift — neighborhood_drift returns the top-k contextual neighbours of a target in each corpus and partitions them into shared / gained_in_a / lost_in_a. With SBERT this surfaces words that have become semantically close to the target in one frame but not the other.
drift = pcd.neighborhood_drift(
human, criminal, 'immigrant',
k=8, embedder=pcd.HashEmbedder(dim=64), window=4, min_count=2,
)
drift
Part VI — The fanout¶
The same three verbs answer questions at every grain. Same corpus, different slices.
Cross-party within a topic — Labour vs Conservative on the NHS¶
nhs = corpus.slice(topic='nhs')
lab_v_con = pcd.compare(
nhs.slice(party='Labour'),
nhs.slice(party='Conservative'),
).keyness(min_count=2)
lab_v_con.table.head(8)
Cross-topic — climate discourse vs NHS discourse¶
climate_v_nhs = pcd.compare(
corpus.slice(topic='climate'),
corpus.slice(topic='nhs'),
).keyness(min_count=3)
climate_v_nhs.plot(kind='bar', n=10).properties(
width=500, title='Climate vs NHS -- signature vocabulary'
)
Before/after the Brexit referendum — across the full corpus, not just one topic¶
ba = pcd.compare.before_after(
corpus, event_date='2016-06-23', time_col='date'
).keyness(min_count=5)
ba.table.head(8)
Beyond pairwise — N-way keyness across all four parties¶
Pairwise keyness asks "is this term distinctive in A vs B". When you have three or more corpora — three news outlets, five parties, ten decades — the natural extension is one-way keyness: for each term, does its rate vary across the panel of corpora more than chance explains?
pcd.keyness_multi([a, b, c, ...]) does exactly that, with a χ²(df=N−1) reference distribution. For N=2 it reduces exactly to the pairwise G² we already trust (verified by a unit test).
PARTIES = ['Conservative', 'Labour', 'Liberal Democrat', 'SNP']
nhs_by_party = [nhs.slice(party=p) for p in PARTIES]
nway = pcd.keyness_multi(nhs_by_party, labels=PARTIES, min_count=3)
print(f'{len(nway):,} terms ranked across {len(PARTIES)} parties')
df_to_sofra(
nway.head(12).reset_index(),
caption='N-way keyness — top 12 terms across four parties (NHS speeches)',
footnotes=('χ²(df = 3) test of equal rate across the four parties.',),
)
A heatmap is the natural visualisation for an N-way table: rows are the top-distinguishing terms; columns are corpora; cell colour encodes rate per 1000 tokens so the visual axis is comparable across corpus-size differences.
top_n = 15
totals_by_party = pd.Series(
{p: nhs.slice(party=p).total_tokens() for p in PARTIES}
)
nway_long = (
nway.head(top_n)
.reset_index()
.melt(
id_vars=['term', 'g2'],
value_vars=[f'count_{p}' for p in PARTIES],
var_name='party_col',
value_name='count',
)
.assign(
party=lambda d: d['party_col'].str.replace('count_', '', regex=False),
rate_per_1k=lambda d: 1000 * d['count'] / d['party'].map(totals_by_party),
)
)
(
alt.Chart(nway_long)
.mark_rect()
.encode(
x=alt.X('party:N', title=None, axis=alt.Axis(labelAngle=-30)),
y=alt.Y('term:N', sort=alt.SortField('g2', order='descending'),
title='Top-15 N-way-distinctive terms'),
color=alt.Color('rate_per_1k:Q', scale=alt.Scale(scheme='blues'),
title='per 1k tokens'),
tooltip=['term', 'party', 'count', 'rate_per_1k:Q', 'g2'],
)
.properties(
width=320, height=360,
title=alt.TitleParams(
text='N-way keyness on NHS speeches -- four parties',
subtitle=f'{len(PARTIES)}-corpus Chi^2(df={len(PARTIES)-1}) / sorted by G^2 desc',
),
)
)
Part VII — Cross-validation receipts¶
The math agrees with the standard tools. pycorpdiff cross-validates against three open-source reference implementations and a fourth diachronic-embeddings dataset:
- Rayson's LL Wizard — hand-computed reference triples spanning Hardie's BNC examples, equal-rate / specialised-vocabulary / lopsided over-rep, and the textbook 174-vs-29-per-million case
- NLTK (
BigramAssocMeasures) — PMI and t-score agreement to ≤ 1e-12 on every adjacent bigram; MI³ via themi_like(power=3)identity - Scattertext (Kessler 2017) — behavioural agreement on the 2012 US Conventions corpus
- HistWords (Hamilton et al. 2016) — known-shifter / stable-word sanity check on Stanford SNAP COHA decade embeddings (slow tier; skips gracefully when the archive is unreachable)
Below we show the Rayson and Scattertext checks live; the slow-tier ones live in the test suite.
Rayson's LL Wizard — the canonical worked example¶
From the Lancaster CASS reference: 12000 occurrences per million in one corpus vs 10000 per million in another. Rayson's wizard reports G² ≈ 182.07. Our log_likelihood agrees.
from pycorpdiff.keyness import log_likelihood
ref = log_likelihood(
pd.Series({'the': 12000}),
pd.Series({'the': 10000}),
total_a=1_000_000, total_b=1_000_000,
)
print(f"pycorpdiff G²: {ref.loc['the', 'g2']:.4f}")
print(f"Rayson's LL Wizard: 182.0694")
print(f"agreement: {'✓' if abs(ref.loc['the', 'g2'] - 182.0694) < 0.01 else '✗'}")
df_to_sofra(
pd.DataFrame([
{'case': 'classic_12k_vs_10k', 'rayson_LL_wizard': 182.0694,
'pycorpdiff_G2': round(float(ref.loc['the', 'g2']), 4),
'abs_diff': round(abs(float(ref.loc['the', 'g2']) - 182.0694), 6)},
]),
caption="Rayson's LL Wizard — canonical worked example agreement",
footnotes=('Rounded to 4 decimal places (precision of the published wizard).',),
)
Scattertext on the 2012 US Presidential Conventions¶
Scattertext bundles the Democratic and Republican convention speeches from 2012 as a known fixture. We can plug them straight into pycorpdiff and compare top-N keyness terms to Scattertext's scaled F-score — different measures, but they should surface a substantial common subset on real political-discourse data.
(Skip the next two cells if scattertext isn't installed — pip install scattertext.)
try:
import scattertext as st
with warnings.catch_warnings():
warnings.simplefilter('ignore')
conv_df = st.SampleCorpora.ConventionData2012.get_data()
conv_corpus = pcd.from_dataframe(conv_df, text_col='text', meta_cols=('party', 'speaker'))
conv_keyness = pcd.compare(
conv_corpus.slice(party='democrat'),
conv_corpus.slice(party='republican'),
).keyness(min_count=10)
print(f'{len(conv_corpus)} speeches from the 2012 US conventions')
print(f'{len(conv_keyness.table)} shared-vocabulary terms')
print('Top 10 Dem-leaning terms by pycorpdiff signed G²:')
conv_top10 = (
conv_keyness.table[conv_keyness.table['g2'] > 0]
.head(10)[['term', 'count_a', 'count_b', 'g2', 'log_ratio']]
)
except ImportError:
print('scattertext not installed — skipping this cell. `pip install scattertext` to run.')
conv_keyness = None
conv_top10 = None
conv_top10
if conv_keyness is not None:
display(conv_keyness.plot(kind='bar', n=12).properties(
width=520,
title=alt.TitleParams(
text='Real-world keyness -- 2012 US Conventions',
subtitle='Democratic-leaning terms (top) vs Republican-leaning (bottom)',
),
))
HistWords reference values for famous semantic shifters¶
Hamilton, Leskovec & Jurafsky (2016) computed aligned diachronic word2vec embeddings on COHA and reported specific cosine distances for famous shifters. We ship those reference values for users to validate their own semantic-shift pipelines against.
from pycorpdiff.datasets.histwords import HAMILTON_REFERENCE_SHIFTS_COHA_1900_1990
ref_df = pd.DataFrame(
sorted(HAMILTON_REFERENCE_SHIFTS_COHA_1900_1990.items(), key=lambda x: -x[1]),
columns=['word', 'cosine_distance_1900_to_1990'],
)
ref_df['interpretation'] = ref_df['word'].map({
'gay': 'shift: happy/carefree → homosexual',
'broadcast': 'shift: scattering seeds → transmitting radio/TV',
'awful': 'shift: awe-inspiring → very bad',
'terrific': 'shift: terrifying → great',
'guy': 'shift: Guy Fawkes effigy → generic male',
'the': 'stable function word',
'and': 'stable function word',
'of': 'stable function word',
'is': 'stable function word',
})
df_to_sofra(
ref_df,
caption='Hamilton et al. (2016) HistWords reference cosine displacements, 1900s→1990s',
footnotes=('COHA eng-fiction word2vec; cosine distance between aligned decade centroids.',),
)
Part VIII — The plumbing¶
Beyond the analytical surface, pycorpdiff plays nicely with the surrounding PyData stack.
polars round-trip¶
Corpus accepts either pandas or polars frames at construction and round-trips back out. The analytical layer is pandas-internal; polars users get ergonomic interop without us doubling the math surface.
try:
import polars as pl
pl_df = pl.DataFrame({
'text': ['the migrant worker arrived', 'the migrant criminal threat'],
'frame': ['humanising', 'criminalising'],
})
polars_corpus = pcd.from_dataframe(pl_df, text_col='text', meta_cols=('frame',))
print(f'pandas-internal: {type(polars_corpus.docs).__name__}')
print(f'polars round-trip: {type(polars_corpus.to_polars()).__name__}')
print(polars_corpus.to_polars())
except ImportError:
print('polars not installed — skipping this cell. `pip install pycorpdiff[polars]` to run.')
DuckDB → Corpus¶
Filter large parquet collections in DuckDB before they ever touch pandas. read_duckdb runs any SQL query and wraps the result.
try:
import duckdb
con = duckdb.connect()
# Materialise our Hansard sample as an in-memory DuckDB table,
# then query it back through SQL.
con.register('hansard', corpus.docs)
post_2020_immigration = pcd.read_duckdb(
con,
"SELECT text, party, date FROM hansard "
"WHERE topic = 'immigration' AND year >= 2020",
text_col='text', meta_cols=('party', 'date'),
)
print(f'DuckDB-filtered subset: {len(post_2020_immigration)} immigration speeches from 2020+')
print(post_2020_immigration.docs[['party', 'date']].value_counts().head(5))
except ImportError:
print('duckdb not installed — skipping this cell. `pip install pycorpdiff[duckdb]` to run.')
Pluggable tokenizers — multilingual is one adapter away¶
Any callable matching __call__(text: str) -> list[str] satisfies the Tokenizer Protocol. spaCy / Stanza / jieba / fugashi adapters are one-line wrappers; the default RegexTokenizer is Unicode-native out of the box.
# Default Unicode-aware tokenizer handles non-Latin scripts directly.
default = pcd.RegexTokenizer()
print('Greek: ', default('Η μετανάστευση είναι ένα παγκόσμιο φαινόμενο'))
print('Cyrillic: ', default('Миграция — это глобальное явление'))
print('English: ', default('Migration is a global phenomenon'))
# A custom tokenizer is literally a one-line adapter.
class StemTokenizer:
"""Toy lemmatiser-like tokenizer for the showcase."""
def __call__(self, text: str) -> list[str]:
crude_stems = {'arrived': 'arrive', 'settled': 'settle', 'thrived': 'thrive'}
return [crude_stems.get(t, t) for t in default(text)]
tokenized = pcd.from_dataframe(
pd.DataFrame({'text': ['the migrant worker arrived and settled and thrived']}),
text_col='text',
tokenizer=StemTokenizer(),
)
tokenized.tokens()[0]
Sparse doc-term matrices — escape hatch for big corpora¶
The default dense n_docs × |vocab| matrix is fine for medium millions of tokens but breaks down at ~100K docs. doc_term_counts_sparse() returns the canonical scikit-learn shape — (csr_matrix, vocab) — so a pycorpdiff Corpus drops straight into anything that consumes CountVectorizer output.
matrix, vocab = corpus.doc_term_counts_sparse(min_count=3)
print(f'sparse DTM: {matrix.shape[0]:,} docs × {matrix.shape[1]:,} terms')
print(f'non-zeros: {matrix.nnz:,} '
f'({100 * matrix.nnz / (matrix.shape[0] * matrix.shape[1]):.2f}% density)')
print(f'memory footprint: ~{matrix.data.nbytes / 1024:.1f} KB '
f'(dense would be ~{matrix.shape[0] * matrix.shape[1] * 8 / 1024:.1f} KB)')
HuggingFace Datasets — one line to a Corpus¶
Public text corpora — IMDb, OpenWebText, AG News, every Hansard mirror — live on the 🤗 Hub. pcd.from_huggingface(dataset_id, ...) wraps datasets.load_dataset and converts the result to a Corpus in one call.
corpus = pcd.from_huggingface(
'stanfordnlp/imdb', split='train[:1000]',
text_col='text', meta_cols=('label',),
)
pos = corpus.slice(label=1); neg = corpus.slice(label=0)
pcd.compare(pos, neg).keyness().plot(kind='scattertext')
pip install 'pycorpdiff[huggingface]' first. Skipped here to keep the notebook offline-runnable in CI.
Results go anywhere — HTML, JSON, parquet¶
Every Result has .to_html(path) and .to_json(path) next to .to_df(). Drop the HTML into a static-site build; ship the JSON to a JavaScript dashboard; or .to_df().to_parquet() for downstream pipelines. The numbers you analysed are the numbers you publish.
# HTML — paste into a blog post or paper draft
html_snippet = keyness.to_html().splitlines()[:6]
print('\n'.join(html_snippet) + '\n...')
# JSON — orient='records' by default, one dict per term
json_snippet = keyness.to_json()
print(f'\nJSON: {len(json_snippet):,} bytes, {json_snippet.count(chr(125))} records')
Content-derived hashing — memoise analyses across runs¶
Corpus.__hash__ is content-derived via pandas.util.hash_pandas_object, so two corpora with identical docs + schema + tokenizer hash the same — and different ones don't. Drop a Corpus into a functools.lru_cache key, a joblib.Memory cache, or your own disk-backed manifest, and avoid recomputing the expensive keyness/semantic-shift passes.
h1 = hash(corpus)
h2 = hash(pcd.load_hansard_sample()) # rebuilt from disk
h3 = hash(corpus.with_tokenizer(pcd.RegexTokenizer(lowercase=False)))
print(f'same-config hash matches: {h1 == h2}')
print(f'different-tokenizer hash diverges: {h1 != h3}')
Part IX — Where next¶
We've now driven every analytical surface in pycorpdiff through one coherent research question. To take this to your own data:
Real corpora. The Hansard sample is synthetic but real-shaped. For actual research:
corpus = pcd.fetch_hansard(
'immigration', start_date='2020-01-01', end_date='2024-12-31',
cache_dir='~/.cache/hansard',
)
Hits parliament.uk's public Hansard search API. No auth. UK Open Government Licence. The same shape as load_hansard_sample() so every cell above works.
Real embeddings. Swap HashEmbedder for SBERTEmbedder everywhere semantic shift / trajectory / neighbourhood drift appears:
embedder = pcd.SBERTEmbedder() # English default
embedder = pcd.SBERTEmbedder('paraphrase-multilingual-MiniLM-L12-v2') # 50+ languages
Add pip install 'pycorpdiff[semantic]' first.
Reproducibility. Every Result has .to_df() → parquet round-trip; most Results also implement .to_html() and .to_json() for static-site or dashboard handoff (see docs/design.md for the per-Result method matrix). Wrap an analytical run in a single script that writes its outputs to JSON; re-running the script regenerates them deterministically.
Cross-validation. Wire your own pipeline up to pcd.histwords_cosine_shift(1900, 1990, target, source='coha') for diachronic-embedding validation, or compare pcd.compare(a, b).keyness(formula='dunning') against quanteda::textstat_keyness(measure='lr') informally — note that token-level conventions differ between regex tokenisation and quanteda's remove_punct=TRUE, so cross-runtime parity holds only up to tokenisation, not byte-for-byte.
Cheat sheet — every analytical surface in one block:
import pycorpdiff as pcd
# Load
corpus = pcd.from_dataframe(df, text_col='text', meta_cols=(...))
# Slice
a = corpus.slice(outlet='Guardian')
b = corpus.slice(outlet='Mail')
# Compare — three verbs
k = pcd.compare(a, b).keyness() # default formula='rayson'
k_d = pcd.compare(a, b).keyness(formula='dunning') # full 4-cell G²
c = pcd.compare(a, b).collocation_shift('migrant')
s = pcd.compare(a, b).semantic_shift('migrant', embedder=pcd.SBERTEmbedder())
# Track
tr = pcd.track(corpus, 'migrant').over_time(freq='Y')
tr.changepoints()
tr.interrupted_time_series(event_date='2016')
# Before/after
ba = pcd.compare.before_after(corpus, event_date='2016-06-23').keyness()
# Result methods — each Result implements the relevant subset of:
# .to_df() · .plot() · .to_html() · .to_json() · .summary() · .explain()
Read the README for the design philosophy, docs/design.md for the three-layer architecture, and docs/statistical-methods.md for every metric's reference.