Hansard demo¶
A worked example using the bundled Hansard-style sample corpus. The sample is synthetic but its structure mirrors real UK parliamentary discourse:
- 193 speeches across 19 years (2005–2023)
- Four topics: immigration, Brexit, NHS, climate
- Four parties: Labour, Conservative, Liberal Democrat, SNP
- Frame shifts at real historical inflection points: immigration goes humanising → criminalising around the 2016 Brexit referendum, NHS sees austerity (2010–14) and COVID (2020–22) pressure, climate sharpens scientific → policy → crisis post-2019
For an actual research project: see pycorpdiff.datasets.hansard for canonical sources of real Hansard (TheyWorkForYou, parliament.uk, HuggingFace).
This notebook drives every analytical surface in the package on a single corpus — what a real research workflow looks like end-to-end.
import altair as alt
import pycorpdiff as pcd
# Emit charts as inline Vega-Lite JSON so they render on GitHub, in
# JupyterLab, VS Code, and nbviewer — anywhere the
# application/vnd.vegalite mime is honoured.
alt.renderers.enable('svg')
corpus = pcd.load_hansard_sample()
print(f'{len(corpus):,} speeches · {corpus.total_tokens():,} tokens')
corpus.docs.head(3)
1. Cross-party frame contrast on immigration¶
Slice the corpus to immigration speeches, then split by the engineered frame (humanising vs criminalising). Keyness reveals the lexical fingerprints of each frame.
immigration = corpus.slice(topic='immigration')
h = immigration.slice(frame='humanising')
c = immigration.slice(frame='criminalising')
print(f'humanising: {len(h)} speeches · {h.total_tokens()} tokens')
print(f'criminalising: {len(c)} speeches · {c.total_tokens()} tokens')
keyness = pcd.compare(h, c).keyness(min_count=3, dispersion=True)
print(keyness.summary())
keyness.table.head(10)
keyness.plot()
Explain a top result¶
Real KWIC contexts from both frames for the highest-keyness term.
top_term = keyness.table.iloc[0]['term']
keyness.explain(top_term, n=3, window=4).table
2. What does each frame put next to 'immigrant'?¶
collocation_shift measures the window-based co-occurrence of every collocate around a target and reports the cross-corpus difference. Strong negative shifts here mean the collocate is dominantly associated with the criminalising frame.
shift = pcd.compare(h, c).collocation_shift(
'immigrant', window=4, min_count=3, measure='logDice'
)
print(shift.summary())
shift.table.head(10)
shift.plot(n=12)
3. Temporal trajectory of the criminalising frame¶
How does 'criminal' track over time in immigration speeches?
trajectory = pcd.track(immigration, ['criminal', 'worker', 'family']).over_time(
freq='Y', time_col='date'
)
trajectory.table.head(10)
trajectory.plot()
Changepoint detection¶
Where does the discourse around 'criminal' break? PELT should locate the engineered 2016 referendum shift.
trajectory.changepoints(target='criminal')
Interrupted time series at the 2016 referendum¶
Quantify the post-event level change with a segmented regression.
Caveat on the synthetic demo. The bundled Hansard-style sample is deterministic — the engineered relative frequencies for
"criminal"in the pre-event window have zero variance, so the OLS standard errors below are at floating-point precision and the p-values are astronomically small. On real data with realistic noise these numbers will be normal-looking. Treat the ITS output here as a syntactic demonstration of the API, not a statistical worked example.
trajectory.interrupted_time_series(event_date='2016', target='criminal')
4. Cross-topic comparison¶
How does climate discourse differ from NHS discourse? Different topical fields entirely — keyness should surface each topic's signature vocabulary.
climate_v_nhs = pcd.compare(
corpus.slice(topic='climate'),
corpus.slice(topic='nhs'),
).keyness(min_count=3)
climate_v_nhs.table.head(10)
5. Cross-party comparison¶
Slice the same topic by party — Labour vs Conservative speeches on the NHS.
nhs = corpus.slice(topic='nhs')
lab_v_con = pcd.compare(
nhs.slice(party='Labour'),
nhs.slice(party='Conservative'),
).keyness(min_count=2)
print(lab_v_con.summary())
lab_v_con.table.head(8)
6. Before/after the 2016 referendum¶
The full corpus split chronologically — what changed across the whole parliamentary discourse?
ba = pcd.compare.before_after(
corpus, event_date='2016-06-23', time_col='date'
).keyness(min_count=5)
ba.table.head(10)
7. Semantic shift around 'immigrant'¶
Did the meaning of 'immigrant' shift across frames, beyond just frequency? Averaged contextual embeddings give us a cosine distance between the corpus-specific centroids.
Tutorial uses HashEmbedder for byte-reproducibility. Swap in pcd.SBERTEmbedder() for real semantic content (requires pip install 'pycorpdiff[semantic]').
sem = pcd.compare(h, c).semantic_shift(
'immigrant', embedder=pcd.HashEmbedder(dim=64), window=4
)
sem.table
Where next?¶
- Try this on real Hansard data fetched from parliament.uk's API.
- Swap
HashEmbedderforSBERTEmbedder()in §7 to surface genuine semantic drift. - The corpus has a
partycolumn — try Labour vs Conservative collocations of 'immigrant' as a follow-up. - For other public corpora (CORD-19, COHA, Reddit Pushshift), the same
compare(a, b).keyness()/track()/compare.before_after()workflow applies; only the loader changes.