class: center, inverse <style type="text/css"> .pull-left { float: left; width: 48%; } .pull-right { float: right; width: 48%; } .pull-right ~ p { clear: both; } .pull-left-wide { float: left; width: 66%; } .pull-right-wide { float: right; width: 66%; } .pull-right-wide ~ p { clear: both; } .pull-left-narrow { float: left; width: 30%; } .pull-right-narrow { float: right; width: 30%; } .tiny123 { font-size: 0.40em; } .small123 { font-size: 0.80em; } .large123 { font-size: 2em; } .huge123 { font-size: 4em; } .red { color: red; } .highlight { background-color: yellow; } </style> # Breaking the HISCO Barrier: ## *AI and Occupational Data Standardization* ### Christian Møller-Dahl ### Torben S. D. Johansen ### Christian Vedel ### University of Southern Denmark, HEDG #### Updated 2024-04-05 --- class: middle, Large .pull-left[ # What - New tool to make research using historical occupations easier - Language model, OccCANINE: + **Input:** "En" + "He is a farmer" + **Output:** HISCO code 61110 (probability=0.989589) - Everything is free, open source, wrapped up in a simple python library and available at [github.com/christianvedels/OccCANINE](https://github.com/christianvedels/OccCANINE) ] --- # Team .pull-left[ <img src="https://portal.findresearcher.sdu.dk/files-asset/200947586/1b5a4dd1-bfba-4b37-a685-854958b1dd4e.jpg?w=160&f=webp" width="150" /> Christian Møller Dahl, Professor <img src="https://portal.findresearcher.sdu.dk/files-asset/234467387/4353be14-7b5e-4a74-949c-144e6478a7da.jpg?w=160&f=webp" width="150" /> Christian Vedel, Assistant Professor ] .pull-right[ <img src="https://media.licdn.com/dms/image/C4D03AQHyT9bJKlGKiA/profile-displayphoto-shrink_800_800/0/1581022934019?e=1716422400&v=beta&t=M00riaQqwb3DTRb_KD80_B1-POZINjpYq2VexqQ_kpE" width="150" /> Torben S. D. Johansen, Assistant Professor *All at the university of Southern Denmark, Department of Economics* ] --- # Data - thanks to... .pull-left[ .tiny123[ | Shorthand name | Language | Source | |------------------|----------|-----------------------------------------------------------| | DK_census | da | Clausen (2015); The Danish National Archives | | EN_marr_cert | en | G. Clark et al. (2022) | | EN_uk_ipums | en | MPS (2020); Office of National Statistics | | SE_swedpop | se | SwedPop Team (2022) | | JIW_database | nl | Moor and van Weeren (“2021”) | | EN_ca_ipums | unk | MPS (2020); Statistics Canada | | CA_bcn | ca | Pujades Mora and Valls (2017) | | HISCO_website | mult | HISCO database (2023) | | HSN_database | nl | Mandemakers et al (2020) | | NO_ipums | no | MPS (2020) | | FR_desc | fr | HISCO database (2023) | | EN_us_ipums | en | MPS (2020); Bureau of the Census | | EN_parish | en | de Pleijt Nuvolari and Weisdorf (2019) | | DK_cedar | da | Ford (2023) | | SE_cedar | se | Edvinsson and Westberg (2016) | | DK_orsted | da | Ford (2023) | | EN_oclack | en | Zijdeman (2023) | | EN_loc | en | Mooney (2016) | | IS_ipums | is | MPS (2020) | | SE_chalmers | se | Ford (2023) | | DE_ipums | ge | MPS (2020); German Federal Statistical Office | | IT_fm | it | Fornasin and Marzona (2016) | ] ] --- # Data examples
--- class: middle .pull-left[ # The old solution - Spend weeks, months or years manually reading and categorising - 17865 different occupational descriptions fit with "farm servant" in DK censuses ("in service", "servant girl", "servant boy", "servant woman", "servant karl") - Spelling mistakes, negations, and different spelling conventions ] .pull-right[ # Our solution  .center[ ### OccCANINE .red[ ~10k HISCO codes in 27 seconds ~100k HISCO codes in 5 min. ~1 million HISCO codes in 45 min. `\(\rightarrow\)` **All with high precision and recall** ] ] ] --- class: middle # Test data performance .pull-left-narrow[ | Statistic | Lang. info. | Value | |-----------|-------------|-------| | Accuracy | Yes | 0.935 | | F1 score | Yes | 0.960 | | Precision | Yes | 0.955 | | Recall | Yes | 0.987 | ] .footnote[ .small123[ *Based on 1 million test observations* ] ] --- class: middle # Multilingual performance .pull-left[ <img src="Figures/Performance_by_language.png" width="450" /> ] --- class: middle # Out of Distribution .pull-left[ | Dataset | N checked | Accuracy | |----------------------------|-----------|----------| | Copenhagen Burial Records | 200 | 0.950 | | Training Ship Data | 200 | 0.985 | | Swedish Strikes | 200 | 0.945 | | Dutch Familiegeld* | 200 | 0.925 | .footnote[ .small123[ `\(^*\)` *Thanks to Bram Hilkens* See more details in the paper ] ] ] --- class: middle # Read more ### [https://arxiv.org/abs/2402.13604](https://arxiv.org/abs/2402.13604) .pull-right-narrow[  ]