class: center, inverse <style type="text/css"> .pull-left { float: left; width: 44%; } .pull-right { float: right; width: 44%; } .pull-right ~ p { clear: both; } .pull-left-wide { float: left; width: 66%; } .pull-right-wide { float: right; width: 66%; } .pull-right-wide ~ p { clear: both; } .pull-left-narrow { float: left; width: 33%; } .pull-right-narrow { float: right; width: 33%; } .small123 { font-size: 0.80em; } .large123 { font-size: 2em; } .red { color: red } .xsmall123 { font-size: 0.60em; } </style> # Breaking the HISCO Barrier: AI and Occupational Data Standardization ## Christian Møller-Dahl ## Christian Vedel ### University of Southern Denmark, HEDG #### Twitter: @ChristianVedel, #### Email: christian-vs@sam.sdu.dk #### Updated 2024-01-18 --- class: center, middle --- # Why do we want a finetuned model? .pull-left-narrow[ - The model learns to understand occupational structures - Vastly improves performance* - Natural starting point for other general tasks related to occupation *This is evident from the the embeddings: 768 dimensions reduced to 3* `\(\rightarrow\)` ] .pull-right-wide[ .panelset[ .panel[.panel-name[Unaligned CANINE] .center[
] ] .panel[.panel-name[Finetuned CANINE]
] ] ] --- # Introduction .pull-left[ ### Motivation - Manual labelling of data is tedious, expensive and error-prone - We collectively poor thousands of hours into similar work - 4 mil. `\(\rightarrow\)` 96k `\(\rightarrow\)` 605 days of work ### This presentation - **Automatic tool to make HISCO labels** - 98 percent precision - How it is done `\(\rightarrow\)` ongoing updates ] .pull-right[  *Willam Bell Scott (1861) "Iron and Coal" (Wikimedia)* ] --- # HISCO codes - Derived from ISCO (International Standard Classification of Occupations) - Invented for the sake of international comparability - Introduced by Leuwen, Maas, Miles (2002) based on ISCO68 - Hiearchical structure well suited for analysis - Common application: Convert to ranking (HISCAM, HISCLASS, etc. ) - Usually labelled with 'smart' manual methods  --- # Data example
--- .pull-left[ ## The naive solution - List all unique occupational descriptions and label them e.g. "farm servant: 62120" - Label via lookup table ## The challenge - 17865 different occupational descriptions fit with "farm servant" in DK censuses ("in service", "servant girl", "servant boy", "servant woman", "servant karl") - HISCO has ~2000 occupational classes, each with similar complexity in writing - Spelling mistakes, **negations**, and different spelling conventions ] -- .pull-right[ ## The solution  - Use machine learning! - Multi-label multiple classification - Trivial machine learning problem - Apply tricks from the machine learning literature to improve performance on unseen examples ] --- .pull-left[ ## The tricks - Unseen test and validation data - LSTM reccurent: Reads with memory "lives of fishing and farm work" (Hochreiter, Schmidhuber, 1997) - Overparameterized neural network: More parameters than data (Allen-Zhu, Li, Liang, 2018) - Regularization via dropout: Some neurons are randomly disabled in training (Srivastava et al 2014) - Embedding: Represent language in high-dimensional space (Mandelbaum, 2016) - Data augmentation (Moris et al, 2020): + "farmer": {"farmtr", "fermer", "yellow farmer"} ] .pull-right[ **Overfitting**  (Wikimedia) ### Metrics - Accuracy - Precision - Recall - F1 - Micro- and macro-level - In training: Binary cross-entropy (differentiable) ] --- # Data sources ### Used today: | Source | Lang | Observations | Reference | |------------------|----------------------------| | Danish census | Da | 4,673,892 | Clausen (2015) | | UK Mariage certificates | En | 4,046,387 | Clark, Cummins, Curtis (2022) | | DK Ørsted | Da | 36,608 | Ford (2023) | | HSN | Nl | 13,495 | Mandemakers et al (2020) | | SE Chalmers | Sw | 14,426 | Ford (2023) | - But also much more in the pipeline. ~70gb -- - *Adequate performance down to 10.000 labels* -- - **Amazing performance in millions of observations** --- # What is other data? .pull-left[ ### Other data - Swedish census data - Norwegian census data - Dutch family history data - Labelled biographies - IPUMS - Barcelona Historical Marriage Database - ~70GB ] .pull-right[ ### Call for data! - If you have something with HISCO labels on it, please send it to christian-vs@sam.sdu.dk. I owe you HISCO codes ] --- # The architecture  ### Training - Model starts at random state - Continuously tweaked towards more correct predictions ### Ongoing work - Replacing the brains by a multilingual transformer model (XML-RoBERTa) *akin to a chatGPT-frankenstein* --- # Training  ### Stopping condition - When validation loss has not improved for 10 epochs training stops --- # The product ```r # Example prompts string_to_hisco("A farmer") string_to_hisco("Tailor of beautiful dresses") string_to_hisco("The train's fireman") ``` -- ``` ## string hisco description ## 1 A farmer 61110 General Farmer ## 2 Tailor of beautiful dresses 79100 Tailor, Specialisation Unknown ## 3 The train's fireman 98330 Railway SteamEngine Fireman ``` --- # The performance (1/5) *DK data based on 300.000 validation observations* ### Macro level performance |level | recall| precision| f1| |:-----|---------:|---------:|---------:| |macro | 0.9707673| 0.9805047| 0.9755535| -- ### Micro level |level | recall| precision| f1| |:-----|---------:|---------:|---------:| |micro | 0.8228573| 0.9100653| 0.8989688| --- # The performance (2/5)  --- # The performance (2/5)  --- # The performance (2/5)  --- # The performance (3/5): Confusion matrix  --- # The performance (3/5): Confusion matrix  --- # The performance (3/5): Confusion matrix  --- # The performance (3/5): Confusion matrix  --- # The performance (3/5): Confusion matrix  --- # Performance (5/5) **Conversion to status scores as validation** `\(\Rightarrow\)` **Higher than human accuracy**
--- # Transformers - Architecture that fuels ChatGPT - Lives inside chatGPT - It is not chatGPT (that would be expensive) - Instead: XLM-RoBERTa w. 279M params - Trained on 2.5TB of website data / 100 languages - Finetuned for our purpose -- ### Results - Still training - 'light' version: 94.7 pct accurate --- # Transformer results
--- class: center # Trained embeddings <img src="Figures/tSNE_KMeans_Clustering_with_lang.png" width="650px" /> --- class: center # Untrained embeddings <img src="Figures/tSNE_KMeans_Clustering_with_lang_raw.png" width="650px" /> --- #### An applcation  --- # Empirical strategy `$$log(y_{it}) = Affected_i \times Year_t \beta_t + FE + \varepsilon_{it}$$` --- ### Breach `\(\rightarrow\)` Parishes with fishermen  --- # Conclusion .pull-left[ ### Steps ahead - More data, more training - A generalized understand of occupational descriptions - Application oriented validation ] .pull-right[ **Please let me know** - How do we make this a tool? - <p style="color:red;">If you have data to HISCO codes, please send it to me (christian-vs@sam.sdu.dk)</p> - <p style="color:red;">In return I owe you HISCO codes for your project!</p> ]