class: center, inverse, middle <style type="text/css"> .pull-left { float: left; width: 44%; } .pull-right { float: right; width: 44%; } .pull-right ~ p { clear: both; } .pull-left-wide { float: left; width: 66%; } .pull-right-wide { float: right; width: 66%; } .pull-right-wide ~ p { clear: both; } .pull-left-narrow { float: left; width: 30%; } .pull-right-narrow { float: right; width: 30%; } .pull-right-extra-narrow { float: right; width: 20%; } .pull-center { margin-left: 28%; width: 44%; } .pull-center-wide { margin-left: 17%; width: 66%; } .tiny123 { font-size: 0.40em; } .small123 { font-size: 0.80em; } .large123 { font-size: 2em; } .red { color: red } .chaosred { color: #b33d3d } .orange { color: orange } .green { color: green } .blue { color: blue } </style> # CHAOS ## Converting Historical Accounts to Occupational Scores ### Matt Curtis, Torben Johansen, Julius Koschnick, **Christian Vedel**, ### University of Soutern Denmark ### Email: [christian-vs@sam.sdu.dk](christian-vs@sam.sdu.dk) ### Updated 2025-07-30 --- .pull-left[ # CHAOS - This paper introduces a method for: + **C**onverting (using maths) + **H**historical **A**ccounts (e.g. wage tables) + **O**ccupational **S**scores (something like IPUMS occscore [but better]) - **Still preliminary** ] -- .pull-right[ ### Four introductions .panelset[ .panel[.panel-name[1 Straight forward] - If I give you this: .pull-right-wide[  ] .pull-left-wide[ - Could you then tell me the approximate income of: + Joiner, + Cowherd, + Bricklayer? - **CHAOS** tells you the appropriate weighted mean ] ] .panel[.panel-name[2 Technical] Given classifier (of `\(h_j\)`): $$ d_i \mapsto Pr(h_j\mid d_i) $$ And a source $$ d_i \mapsto y_i \qquad\text{e.g. income} $$ Can we combine them into a function, `\(f\)`, to estimate new outcomes: $$ f:\qquad d_k\mapsto\mathbb{E}(y_k \mid d_k) $$ ] .panel[.panel-name[3 Casual] > "Yo! I've got this table of people's income from 1825. Do you think we could somehow sort of apply it to incomes for occupations across this census data?"  ] .panel[.panel-name[4 Collegial] *We are trying to replace IPUMS occscore* ] ] ] --- class: middle # How it is usefull .pull-left[ - History is ripe with: + Occupational data + Scattered but unsystematic accounts of occupational outcomes (e.g. wages tables) - This is a systematic, out-of-the-box, framework for dealing with this - Not a black box. Every step is interpretable. #### Uses - What happens to income / occupational structure when new technologies emerge? - Do certain institutions promote economic development? - Settling old debates in Economic History - .red[Constructing measures of regional living standards using better sources than random wikipedia scrapes and Maddison] - etc. ] -- .pull-right[ .pull-left-narrow[] .pull-right-wide[ This is not: - A magic bullet - An excuse to not care about source criticism - (In fact a key econometric result 'relevance criterion') ] ] --- # Sneakpeak: Where we are getting to .pull-center-wide[  ] --- .pull-left-narrow[ ## Step by step 1. You find a **source** of occupational outcomes (e.g. wages) 2. You classify them into a **standardized system** (e.g. HISCO) 3. You calculate **average** (e.g. how they taught you in school) 4. You assign these averages to new **target observations** a. Classify occupations in target (e.g. into HISCO) b. Apply estimates from step 3 ] -- .pull-right-wide[ .panelset[ .panel[.panel-name[Steps 1-3] .pull-right-wide[  ] ] .panel[.panel-name[Step 4] .pull-right-wide[  ] ] ] ] --- # Another way of thinking about what we do ### *"It could easily accomplished with a computer"* - Dr. Strangelove .pull-left-wide[ Imagine having some data: | Occupation | Income | |-------------|--------| | Carpenter | $2.00 | | Butcher | $2.00 | | Farmer | $2.30 | From this imagine constructing a function which takes a text string and outputs an estimated income, years of education, social status, etc. for a given place `\(i\)` and time `\(t\)`: $$ f_{it}(\textit{"He has some land"}) = 1.71844 $$ We suggest that you can build such a function based on classifiers ] --- class: middle # OccCANINE .pull-left[ .small123[ w. Christian Møller Dahl & Torben Johansen https://arxiv.org/abs/2408.00885 ] - We train a language model on ~18.5 million observations spanning 13 different language from 29 different sources - Open source, user friendly, fast and highly accurate - **Importantly:** Gives us a probability of each HISCO code given a textual description. $$ \text{OccCANINE}: \qquad d_i \mapsto Pr(h_j\mid d_i) $$ - Using basic tricks of probability we can tease out an estimator from this! ] .pull-right[  *Figure 1: Conceptual architecture* .small123[ #### Some numbers: - 10k occupational descriptions: 27 seconds - 100k occupational descriptions: 5 min. - 1 million occupational descriptions: 45 min. ] ] --- ## Mostly harmless maths .pull-left[ #### Estimating HISCO wages `$$\mathbb{E}(y_i\mid h_j)\;=\; \sum_{i\in \mathcal{S}} y_iw_{ij} \;=\; \sum_{i \in \mathcal{S}} y_i \;\frac{\overset{OccCANINE}{\overbrace{Pr(h_j\mid d_i)}}\;\overset{Prior}{\overbrace{Pr(d_i)}}}{\underset{Evidence}{\underbrace{Pr(h_j)}}}$$` #### Example We estimate that HISCO 83110 "Blacksmith" earns $2.52 a day since this is the sum: `$$2.49\times 0.8878 + 3.96\times 0.0568 + ...$$` ] -- .pull-right[ ### Applying the estimator `$$\mathbb{E}(y_k\mid d_k) \;=\; \sum_{j\in \mathcal{H}} \hat{y}_j w_{jk} \;=\; \sum_{j \in \mathcal{H}} {\underset{\text{from above}}{\underbrace{\mathbb{E}(y_i\mid h_j)}}} \overset{OccCANINE}{\overbrace{Pr(h_j\mid d_k)}}$$` #### Example: `"mainly blacksmith (also farmer)"` gets a wage `\(\hat{y}_k\)` assigned from both the HISCO codes for blacksmith (0.95) but also farmer (0.02) etc. ] .footnote[ .small123[ `\(i\)`: Index of source; `\(j\)`: Index in HISCO table; `\(k\)`: Index of target table ] ] --- # Not so harmless maths .pull-left[ #### Assumptions - **Relevance:** the source needs to be relevant for the target #### Debiasing & scaling .small123[ - Bias and scaling problems `\(\mathbb{E}(\hat{y}_k)\neq\mathbb{E}(y_k)\)` and `\(\mathbb{V}(\hat{y}_k)\neq\mathbb{V}(y_k)\)` - But we can observe this problem if we are willing to assume: $$ `\begin{split} \mathbb{E}( y_k-\hat{y}_k) = \mathbb{E}(y_i-\hat{y}_i) \\ \mathbb{V}( y_k-\hat{y}_k) = \mathbb{V}(y_i-\hat{y}_i) \end{split}` $$ (Conceptually similar to *relevance* - details still need to ironed out) ] .chaosred[**Implication**]: Classification does not need to be perfect - we can correct for bias ] -- .pull-right[ .small123[ ### More than one source? $$ \mathbb{E}(y_i\mid h_j)=\sum_i \left[ y_i \sum_s \left( \frac{\overset{OccCANINE}{\overbrace{Pr(h_j\mid y_i, S_s)}}\overset{Prior}{\overbrace{Pr(y_i\mid S_s)}} \overset{\textit{Source weight}}{\overbrace{Pr(S_s)}}}{\underset{Normalization}{\underbrace{Pr(h_j)}}} \right) \right] $$ ### What if we input gibberish? Then we get: $$ `\begin{split} Classifier:\qquad &Pr(h_j\mid d_k) \;=\; Pr(h_j), \qquad \text{(bd cond. prob.)} \\ CHAOS:\qquad &\mathbb{E}(y_k|d_k) \;= \mathbb{E}(y_i) \end{split}` $$ I.e. we just get the mean from the source ] ] --- # Demonstration [Notebook] --- # Application: Setup .pull-left-wide[ #### 1. We apply this to a table of wages for 84000 occupations (US Commissioner of Labor) + Across 140 places (states and countries) + In the period 1725-1900 #### 2. We combine this with measure for skills (routine, mechanical, cognitive) obtained from the *London's Trademan* (1747) + Using embedding distance + Please ask about details #### 3. We turn everything into a HISCO-level panel using .chaosred[**CHAOS**] + **Result:** Occupation level panel of incomes and skills ] .pull-right-narrow[ .panelset[ .panel[.panel-name[Wages 1]  ] .panel[.panel-name[Wages 2]  ] .panel[.panel-name[Skills]  ] .panel[.panel-name[Skills]  ] ] ] --- # Application: Result .pull-left-narrow[ ### What is good advice in 1800? - Which 18th skills would benefit in the 19th century? - VERY preliminary ] .pull-right-wide[  ] --- # Conclusion .pull-left[ - Here's a new tool (on its way) - We hope its useful - More work to be done on: + Automatic debiasing and how to use these estimates in regressions + Application #### Feel free to contact me **Email:** <christian-vs@sam.sdu.dk>; **Twitter/X:** [@ChristianVedel](https://twitter.com/ChristianVedel); **BlueSky:** [@christianvedel.bsky.social](https://bsky.app/profile/christianvedel.bsky.social); ] -- .pull-right[ .pull-left-wide[ ### The entire pipeline of CHAOS  ] ] --- class: middle, center # Appendix --- class: middle # Skills from LTM .pull-left-wide[ ### Extracting skills - We ask `gpt-4o` to return a JSON of each occupations skills ### Measuring skills - First, we map the list skills in an occupation, `\(S = \{s_1, \dots, s_N\}\)`, into a 384-dimensional vector space using a a pre-trained Sentence-BERT model (all-MiniLM-L6-v2). This yields embedding vectors `\(\mathbf{e}_{s_i} \in \mathbb{R}^d\)`. - We then construct a index of skill based on cosine similarity `$$\text{routine proportion}_i = \frac{\bar{\sigma}_{i,\text{routine}}}{\bar{\sigma}_{i,\text{routine}} + \bar{\sigma}_{i,\text{non routine}}}$$` Where `\(\bar{\sigma}_{i,routine}\)` is the cosine similarity between skills$_i$ and the description of `\(\text{routine}\)`. ] .pull-right-narrow[ .panelset[ .panel[.panel-name[Goldsmith skills] .tiny123[ `"making all manner of utensils in gold or silver for ornament or use"`, `"performing work in the mould or beating into figure by hammer or other engine"`, `"casting works with raised figures in moulds"`, `"polishing and finishing cast works"`, `"beating plates or dishes of silver from thin flat plates"`, `"forming tankards and other vessels from thin plates soldered together"`, `"beating mouldings for vessels"`, `"using flatting-mills to reduce metal to required thinness"`, `"making all moulds for their work"` ] ] .panel[.panel-name[Routine] .small123[ #### Routine repetitive task, following explicit rules, highly structured sequence #### Non-routine problem solving, creative thinking, requires judgment ] ] .panel[.panel-name[Cognitive] .small123[ #### Cognitive abstract reasoning, analytical thinking, information processing #### Non-cognitive physical effort, manual labor, emotional labor ] ] .panel[.panel-name[Cognitive] .small123[ #### Mechanical operate machinery, assemble parts, machine operation #### Non-mechanical software use, writing, communication ] ] ] ]