Lecture 19

class: title-slide

# Lecture 19

## Introduction to Machine Learning

### Tyler Ransom

### ECON 6343, University of Oklahoma

---

# Attribution

Today's material is based on the two main Statistical Learning textbooks:

.pull-left[
.center[
<img src="islr_cover.png" width="146" height="90%" style="display: block; margin: auto;" />
.small[James, Witten, Hastie, and Tibshirani (2013)]
]
]

.pull-right[
.center[
<img src="eslr_cover.png" width="149" height="90%" style="display: block; margin: auto;" />
.small[Hastie, Tibshirani, and Friedman (2009)]
]
]

---
# Plan for the Day

1. What is Machine Learning (ML)?

2. Broad overview

3. How to do ML in Julia

Next time, we'll discuss how ML is being used with econometrics

---
# Machine Learning and Artifical Intelligence (AI)

- .hi[Machine learning (ML):] Allowing computers to learn for themselves without explicitly being programmed

- USPS: Computer to read handwriting on envelopes
    
    - Google: AlphaGo Zero, computer that defeated world champion Go player
    
    - Apple/Amazon/Microsoft: Siri, Alexa, Cortana voice assistants 
    
- .hi[Artificial intelligence (AI):] Constructing machines (robots, computers) to think and act like human beings

- ML is a subset of AI

---
# Prediction vs. Inference

- Prediction and Inference are the two main reasons for analyzing data

- .hi[Prediction:] We simply want to obtain `\(\hat{Y}\)` given the `\(X\)`'s

- .hi[Inference:] Understanding how changing the `\(X\)`'s will change `\(Y\)`

- Three types of inference, according to [Andrew Gelman](https://andrewgelman.com/2018/08/18/fallacy-excluded-middle-statistical-philosophy-edition/)

1. Generalizing from sample to population (statistical inference)
   
   2. Generalizing from control state to treatment state (causal inference)
   
   3. Generalizing from observed measurements to underlying constructs of interest

- Philosophically, these can each be framed as prediction problems

---
# Inference as Prediction

- How can each type of inference be framed as a prediction problem?

- Statistical inference:

- Predict what `\(Y\)` (or `\(\beta\)`) would be in a different sample
    
- Causal inference:

- Predict what `\(Y\)` would be if we switched each person's treatment status

- Measurement quality:

- Predict what `\(Y\)` would be if we could better measure it (or the `\(X\)`'s)
    
    - e.g. personal rating in Harvard admissions (what does it measure?)

---
# Vocabulary

Econometrics and Machine Learning use different words for the same objects

.pull-left[

.hi[Econometrics]

- Dependent variable

- Covariate

- Observation

- Objective function

- Estimation

]

.pull-right[

.hi[Machine Learning]

- Target variable

- Feature

- Example (Instance)

- Cost function (Loss function)

- Training (Learning)

]

---
# The goal of machine learning

Machine learning is all about automating two hand-in-hand processes:

1. Model selection

- What should the specification be?

2. Model validation

- Does the model generalize to other (similar) contexts?

- Want to automate these processes to maximize predictive accuracy

- As defined by some cost function

- This is .hi[different] than the goal of econometrics! (causal inference)

---
# Training, validation and test data

- In econometrics, we typically use the entire data set for estimation

- In ML, we assess out-of-sample performance, so we should hold out some data

- Some held-out data is used for validating the model, and some to test the model

- Data used in estimation is referred to as .hi[training data] (60%-70% of sample)

- Data we use to test performance is called .hi[test data] (10%-20%)

- Data we use to cross-validate our model is called .hi[validation data] (10%-20%)

- Division of training/validation/test sets should be .hi[random]

---
# Model complexity and the bias/variance tradeoff

- There is a trade-off between bias and variance

`\begin{align*}
\underbrace{\mathbb{E}\left(y-\hat{f}(x)\right)^2}_{\text{Expected }MSE\text{ in test set}} &= \underbrace{\mathbb{V}\left(\hat{f}(x)\right)}_{\text{Variance}} + \underbrace{\left[\text{Bias}\left(\hat{f}(x)\right)\right]^2}_{\text{Squared Bias}} + \underbrace{\mathbb{V}\left(\epsilon\right)}_{\text{Irreducible Error}}
\end{align*}`

- A model with high .hi[bias] is a poor approximation of reality

- A model with high .hi[variance] is one that does not generalize well to a new data set

- A model is .hi[overfit] if it has low bias and high variance

- A model is .hi[underfit] if it has high bias and low variance

---
# Visualizing the bias/variance tradeoff

.center[![bias-variance](BVtradeoff.png)]

- Optimal model complexity is at the minimum of the red line
- Irreducible error means that the red line can't go below a certain level
    - maybe at 0.3 in this picture?
- Image source: Hastie, Tibshirani, and Friedman (2009)

---
# Continuous vs. categorical target variables

- When the target variable is continuous, we use `\(MSE\)` to measure fit

- When it is categorical, we use .hi[accuracy] or similar measures

- or some combination of specificity and sensitivity
    
    - goal is to not have a good "fit" by randomly guessing
    
    - so each potential metric penalizes random guessing

---
# Different types of learning algorithms

- Just like in econometrics, the most basic "learning algorithm" is OLS

- Or, if `\(Y\)` is categorical, logistic regression

- But, we know there are many other ways to estimate models

- e.g. non-parametric, semi-parametric, Bayesian, ...

---
# Supervised and unsupervised learning

- .hi[Supervised learning:] we predict `\(Y\)` from a set of `\(X\)`'s

- .hi[Unsupervised learning:] we try to group observations by their `\(X\)`'s

- Most of econometrics is about supervised learning (i.e. estimate `\(\hat{\beta}\)`)

- But there are some elements of unsupervised learning

- Particularly with regards to detecting unobserved heterogeneity types
    
    - e.g. factor analysis (detect types based on a set of measurements)
    
    - e.g. the EM algorithm (detect types based on serial correlation of residuals)

---
# Supervised learning algorithms

Examples of supervised learning algorithms:

- Tree models
    - Basically a fully non-parametric bin estimator
    - Can generalize to "forests" that average over many "trees"

- Neural networks
    - Model the human brain's system of axons and dendrites
    - "Input layer" is the `\(X\)`'s, "Output layer" is `\(Y\)`
    - "Hidden layers" nonlinearly map the input and output layers
    - Mapping ends up looking like a logit of logit of logits

---
# Supervised learning algorithms (continued)

- Bayesian models

- Bayes' rule can be thought of as a learning algorithm
    
    - Use it to update one's prior

- Support Vector Machine (SVM)

- Originally developed for classification
    
    - Tries to divide 0s and 1s by as large of a margin as possible
    
    - Based on representing examples as points in space
    
    - Generalization of the maximal margin classifier

---
# Unsupervised learning algorithms

- We covered the EM algorithm and PCA in previous lectures

- `\(k\)`-means clustering

- Attempts to group observations together based on the `\(X\)`'s
   
   - Choose cluster labels to minimize difference in `\(X\)`'s among labeled observations

`\begin{align*}
& \min_{C_1,\ldots,C_K} \sum_{k=1}^K \frac{1}{N_k}\sum_{i\in C_k}\sum_{\ell=1}^L\left(x_{i\ell}-\overline{x}_{\ell j}\right)^2 \\
\end{align*}`

`\(N_k\)` is the number of observations in cluster `\(k\)`, `\(L\)` is number of `\(X\)`'s
    
- Can choose other metrics besides Euclidean distance

---
# Active learning algorithms

- .hi[Active learning:] algorithm chooses the next example it wants a label for

- Balances "exploration" and "exploitation"

- Two common examples of active learning:

1. .hi[Reinforcement learning] powers the world-beating chess engines

- These algorithms use dynamic programming methods
    
    - Use Conditional Choice Probabilities for computational gains
    
2. .hi[Recommender systems] power social networks, streaming services, etc.

---
# Back to the bias-variance tradeoff

- The Bias-Variance Tradeoff applies to supervised learning

- How do we ensure that our model is not overly complex (i.e. overfit)?

- The answer is to penalize complexity

- .hi[Regularization] is the way we penalize complexity

- .hi[Cross-validation] is the way that we choose the optimal level of regularization

---
# How cross-validation works  (Adams, 2018)

.center[![cross-validation](CVdiagram.jpg)]

- Blue is the data that we use to estimate the model's parameters
- We randomly hold out `\(K\)` portions of this data one-at-a-time (Green boxes)
- We assess the performance of the model in the Green data
- This tells us the optimal complexity (by "hyperparameters" if CV is automated)

---
# Types of regularization

- Regularization is algorithm-specific

- in tree models, complexity is the number of "leaves" on the tree
    
    - in linear models, complexity is the number of covariates
    
    - in neural networks, complexity is the number/mapping of hidden layers
    
    - in Bayesian approaches, priors act as regularization

- Whatever our algorithm, we can tune the complexity parameters using CV

---
# Regularization of linear-in-parameters models

There are three main types of regularization for linear-in-parameters models:

1. `\(L0\)` regularization (Subset selection)

2. `\(L1\)` regularization (LASSO)

3. `\(L2\)` regularization (Ridge)

---
# `\(L0\)` regularization

- Suppose you have `\(L\)` `\(X\)`'s you may want to include in your model

- Subset selection automatically chooses which ones to include

- This is an automated version of what is traditionally done in econometrics
    
- Can use Adjusted `\(R^2\)` to penalize complexity

- or AIC, BIC, or a penalized SSR

- Algorithm either starts from 0 `\(X\)`'s and moves forward

- Or it starts from the full set of `\(X\)`'s and works backward

- But this won't work if `\(L>N\)`! (i.e. there are more `\(X\)`'s than observations)

---
# `\(L1\)` and `\(L2\)` regularization

- Consider two different penalized versions of the OLS model:

`\begin{align*}
&\min_{\beta} \sum_i \left(y_i - x_i'\beta\right)^2 + \lambda\sum_k \vert\beta_k\vert & \text{(LASSO)} \\
&\min_{\beta} \sum_i \left(y_i - x_i'\beta\right)^2 + \lambda\sum_k \beta_k^2 & \text{(Ridge)} 
\end{align*}`

- .hi[LASSO:] Least Absolute Shrinkage and Selection Operator

- sets some `\(\beta\)`'s to be 0, others to be attenuated in magnitude

- .hi[Ridge:]

- sets each `\(\beta\)` to be attenuated in magnitude

---
# `\(L1\)` and `\(L2\)` regularization (continued)

- We want to choose `\(\lambda\)` to optimize the bias-variance tradeoff

- We choose `\(\lambda\)` by `\(k\)`-fold Cross Validation

- We can also employ a weighted average of `\(L1\)` and `\(L2\)`, known as .hi[elastic net]

`\begin{align*}
&\min_{\beta} \sum_i \left(y_i - x_i'\beta\right)^2 + \lambda_1\sum_k \vert\beta_k\vert + \lambda_2\sum_k \beta_k^2  \\
\end{align*}`

where we choose `\((\lambda_1,\lambda_2)\)` by cross-validation

- `\(L1\)` and `\(L2\)` are excellent for problems where `\(L>N\)` (more `\(X\)`'s than observations)

- We can apply `\(L1\)` and `\(L2\)` to other problems (logit, neural network, etc.)

---
# How to estimate ML models

- R, Python and Julia all have excellent ML libraries

- Each language also has a "meta" ML library

- [`mlr3`](https://mlr3.mlr-org.com/) (R), [`scikit-learn`](https://scikit-learn.org/stable/) (Python), [`MLJ.jl`](https://alan-turing-institute.github.io/MLJ.jl/dev/) (Julia)

- In these libraries, the user specifies `\(Y\)` and `\(X\)`

- With only a slight change in code, can estimate a completely different ML model

- e.g. go from a logit to a tree model with minimal code changes
    
    - e.g. choose values of tuning parameters by `\(k\)`-fold CV
    
- I'll go through a quick example with `MLJ.jl`

---
# `MLJ` example

After installing the required packages:

``` julia
using MLJ, Tables, DataFrames, MLJDecisionTreeInterface, MLJLinearModels

models()
```

will list all of the models that can interface with `MLJ`:

``` julia
151-element Array{NamedTuple{(:name, :package_name, :is_supervised, :docstring, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :is_pure_julia, :is_wrapper, :load_path, :package_license, :package_url, :package_uuid, :prediction_type, :supports_online, :supports_weights, :input_scitype, :target_scitype, :output_scitype),T} where T<:Tuple,1}:
 (name = ARDRegressor, package_name = ScikitLearn, ... )
 (name = AdaBoostClassifier, package_name = ScikitLearn, ... )
 (name = AdaBoostRegressor, package_name = ScikitLearn, ... )
 ⋮
 (name = XGBoostClassifier, package_name = XGBoost, ... )
 (name = XGBoostCount, package_name = XGBoost, ... )
 (name = XGBoostRegressor, package_name = XGBoost, ... )
```

---
# `MLJ` example (continued)

.scroll-box-18[

``` julia
# use house price data from US Census Bureau
df = OpenML.load(574) |> DataFrame
X = df[:,[:P1,:P5p1,:P6p2,:P11p4,:P14p9,:P15p1,:P15p3,:P16p2,:P18p2,:P27p4,:H2p2,:H8p2,:H10p1,:H13p1,:H18pA,:H40p4]]
X = Tables.rowtable(X)
y = log.(df.price)

models(matching(X,y))

# declare a tree and lasso model
tree_model = @load DecisionTreeRegressor pkg=DecisionTree
lasso_model = @load LassoRegressor pkg=MLJLinearModels

# initialize "machines" where results can be reported
tree = machine(tree_model, X, y)
lass = machine(lasso_model, X, y)

# split into training and testing data
train, test = partition(eachindex(y), 0.7, shuffle=true)

# train the models
MLJ.fit!(tree, rows=train)
MLJ.fit!(lass, rows=train)

# predict in test set
yhat = MLJ.predict(tree, X[test,:]);
yhat = MLJ.predict(lass, X[test,:]);

# get RMSE across validation folds
MLJ.evaluate(tree_model,  X, y, resampling=CV(nfolds=6, shuffle=true), measure=rmse)
MLJ.evaluate(lasso_model, X, y, resampling=CV(nfolds=6, shuffle=true), measure=rmse)
```
]

---
# References
Aakvik, A., J. J. Heckman, and E. J. Vytlacil (2005). "Estimating Treatment Effects for
Discrete Outcomes When Responses to Treatment Vary: An Application to Norwegian Vocational
Rehabilitation Programs". In: _Journal of Econometrics_ 125.1, pp. 15-51. DOI:
[10.1016/j.jeconom.2004.04.002](https://doi.org/10.1016%2Fj.jeconom.2004.04.002).

Ackerberg, D. A. (2003). "Advertising, Learning, and Consumer Choice in Experience Good
Markets: An Empirical Examination". In: _International Economic Review_ 44.3, pp.
1007-1040. DOI:
[10.1111/1468-2354.t01-2-00098](https://doi.org/10.1111%2F1468-2354.t01-2-00098).

Adams, R. P. (2018). _Model Selection and Cross Validation_. Lecture Notes. Princeton
University. URL:
[https://www.cs.princeton.edu/courses/archive/fall18/cos324/files/model-selection.pdf](https://www.cs.princeton.edu/courses/archive/fall18/cos324/files/model-selection.pdf).

Ahlfeldt, G. M., S. J. Redding, D. M. Sturm, et al. (2015). "The Economics of Density:
Evidence From the Berlin Wall". In: _Econometrica_ 83.6, pp. 2127-2189. DOI:
[10.3982/ECTA10876](https://doi.org/10.3982%2FECTA10876).

Altonji, J. G., T. E. Elder, and C. R. Taber (2005). "Selection on Observed and Unobserved
Variables: Assessing the Effectiveness of Catholic Schools". In: _Journal of Political
Economy_ 113.1, pp. 151-184. DOI: [10.1086/426036](https://doi.org/10.1086%2F426036).

Altonji, J. G. and C. R. Pierret (2001). "Employer Learning and Statistical
Discrimination". In: _Quarterly Journal of Economics_ 116.1, pp. 313-350. DOI:
[10.1162/003355301556329](https://doi.org/10.1162%2F003355301556329).

Angrist, J. D. and A. B. Krueger (1991). "Does Compulsory School Attendance Affect
Schooling and Earnings?" In: _Quarterly Journal of Economics_ 106.4, pp. 979-1014. DOI:
[10.2307/2937954](https://doi.org/10.2307%2F2937954).

Angrist, J. D. and J. Pischke (2009). _Mostly Harmless Econometrics: An Empiricist's
Companion_. Princeton University Press. ISBN: 0691120358.

Arcidiacono, P. (2004). "Ability Sorting and the Returns to College Major". In: _Journal
of Econometrics_ 121, pp. 343-375. DOI:
[10.1016/j.jeconom.2003.10.010](https://doi.org/10.1016%2Fj.jeconom.2003.10.010).

Arcidiacono, P., E. Aucejo, A. Maurel, et al. (2016). _College Attrition and the Dynamics
of Information Revelation_. Working Paper. Duke University. URL:
[https://tyleransom.github.io/research/CollegeDropout2016May31.pdf](https://tyleransom.github.io/research/CollegeDropout2016May31.pdf).

Arcidiacono, P., E. Aucejo, A. Maurel, et al. (2025). "College Attrition and the Dynamics
of Information Revelation". In: _Journal of Political Economy_ 133.1. DOI:
[10.1086/732526](https://doi.org/10.1086%2F732526).

Arcidiacono, P. and J. B. Jones (2003). "Finite Mixture Distributions, Sequential
Likelihood and the EM Algorithm". In: _Econometrica_ 71.3, pp. 933-946. DOI:
[10.1111/1468-0262.00431](https://doi.org/10.1111%2F1468-0262.00431).

Arcidiacono, P., J. Kinsler, and T. Ransom (2022b). "Asian American Discrimination in
Harvard Admissions". In: _European Economic Review_ 144, p. 104079. DOI:
[10.1016/j.euroecorev.2022.104079](https://doi.org/10.1016%2Fj.euroecorev.2022.104079).

Arcidiacono, P., J. Kinsler, and T. Ransom (2022a). "Legacy and Athlete Preferences at
Harvard". In: _Journal of Labor Economics_ 40.1, pp. 133-156. DOI:
[10.1086/713744](https://doi.org/10.1086%2F713744).

Arcidiacono, P. and R. A. Miller (2011). "Conditional Choice Probability Estimation of
Dynamic Discrete Choice Models With Unobserved Heterogeneity". In: _Econometrica_ 79.6,
pp. 1823-1867. DOI: [10.3982/ECTA7743](https://doi.org/10.3982%2FECTA7743).

Arroyo Marioli, F., F. Bullano, S. Kucinskas, et al. (2020). _Tracking R of COVID-19: A
New Real-Time Estimation Using the Kalman Filter_. Working Paper. medRxiv. DOI:
[10.1101/2020.04.19.20071886](https://doi.org/10.1101%2F2020.04.19.20071886).

Ashworth, J., V. J. Hotz, A. Maurel, et al. (2021). "Changes across Cohorts in Wage
Returns to Schooling and Early Work Experiences". In: _Journal of Labor Economics_ 39.4,
pp. 931-964. DOI: [10.1086/711851](https://doi.org/10.1086%2F711851).

Attanasio, O. P., C. Meghir, and A. Santiago (2011). "Education Choices in Mexico: Using a
Structural Model and a Randomized Experiment to Evaluate PROGRESA". In: _Review of
Economic Studies_ 79.1, pp. 37-66. DOI:
[10.1093/restud/rdr015](https://doi.org/10.1093%2Frestud%2Frdr015).

Aucejo, E. M. and J. James (2019). "Catching Up to Girls: Understanding the Gender
Imbalance in Educational Attainment Within Race". In: _Journal of Applied Econometrics_
34.4, pp. 502-525. DOI: [10.1002/jae.2699](https://doi.org/10.1002%2Fjae.2699).

Baragatti, M., A. Grimaud, and D. Pommeret (2013). "Likelihood-free Parallel Tempering".
In: _Statistics and Computing_ 23.4, pp. 535-549. DOI: [
10.1007/s11222-012-9328-6](https://doi.org/%2010.1007%2Fs11222-012-9328-6).

Bayer, P., R. McMillan, A. Murphy, et al. (2016). "A Dynamic Model of Demand for Houses
and Neighborhoods". In: _Econometrica_ 84.3, pp. 893-942. DOI:
[10.3982/ECTA10170](https://doi.org/10.3982%2FECTA10170).

Begg, C. B. and R. Gray (1984). "Calculation of Polychotomous Logistic Regression
Parameters Using Individualized Regressions". In: _Biometrika_ 71.1, pp. 11-18. DOI:
[10.1093/biomet/71.1.11](https://doi.org/10.1093%2Fbiomet%2F71.1.11).

Beggs, S. D., N. S. Cardell, and J. Hausman (1981). "Assessing the Potential Demand for
Electric Cars". In: _Journal of Econometrics_ 17.1, pp. 1-19. DOI:
[10.1016/0304-4076(81)90056-7](https://doi.org/10.1016%2F0304-4076%2881%2990056-7).

Berry, S., J. Levinsohn, and A. Pakes (1995). "Automobile Prices in Market Equilibrium".
In: _Econometrica_ 63.4, pp. 841-890. URL:
[http://www.jstor.org/stable/2171802](http://www.jstor.org/stable/2171802).

Bjorklund, A. and R. Moffitt (1987). "The Estimation of Wage Gains and Welfare Gains in
Self-Selection Models". In: _Review of Economics and Statistics_ 69.1, pp. 42-49. DOI:
[10.2307/1937899](https://doi.org/10.2307%2F1937899).

Blass, A. A., S. Lach, and C. F. Manski (2010). "Using Elicited Choice Probabilities to
Estimate Random Utility Models: Preferences for Electricity Reliability". In:
_International Economic Review_ 51.2, pp. 421-440. DOI:
[10.1111/j.1468-2354.2010.00586.x](https://doi.org/10.1111%2Fj.1468-2354.2010.00586.x).

Blundell, R. (2010). "Comments on: ``Structural vs. Atheoretic Approaches to
Econometrics'' by Michael Keane". In: _Journal of Econometrics_ 156.1, pp. 25-26. DOI:
[10.1016/j.jeconom.2009.09.005](https://doi.org/10.1016%2Fj.jeconom.2009.09.005).

Bonhomme, S. and J. Robin (2009). "Consistent Noisy Independent Component Analysis". In:
_Journal of Econometrics_ 149.1, pp. 12-25. DOI:
[10.1016/j.jeconom.2008.12.019](https://doi.org/10.1016%2Fj.jeconom.2008.12.019).

Bonhomme, S. and J. Robin (2010). "Generalized Non-Parametric Deconvolution with an
Application to Earnings Dynamics". In: _Review of Economic Studies_ 77.2, pp. 491-533.
DOI:
[10.1111/j.1467-937X.2009.00577.x](https://doi.org/10.1111%2Fj.1467-937X.2009.00577.x).

Bresnahan, T. F., S. Stern, and M. Trajtenberg (1997). "Market Segmentation and the
Sources of Rents from Innovation: Personal Computers in the Late 1980s". In: _The RAND
Journal of Economics_ 28.0, pp. S17-S44. DOI:
[10.2307/3087454](https://doi.org/10.2307%2F3087454).

Brien, M. J., L. A. Lillard, and S. Stern (2006). "Cohabitation, Marriage, and Divorce in
a Model of Match Quality". In: _International Economic Review_ 47.2, pp. 451-494. DOI:
[10.1111/j.1468-2354.2006.00385.x](https://doi.org/10.1111%2Fj.1468-2354.2006.00385.x).

Brinch, C. N., M. Mogstad, and M. Wiswall (2017). "Beyond LATE with a Discrete
Instrument". In: _Journal of Political Economy_ 125.4, pp. 985-1039. DOI:
[10.1086/692712](https://doi.org/10.1086%2F692712).

Card, D. (1995). "Using Geographic Variation in College Proximity to Estimate the Return
to Schooling". In: _Aspects of Labor Market Behaviour: Essays in Honour of John
Vanderkamp_. Ed. by L. N. Christofides, E. K. Grant and R. Swidinsky. Toronto: University
of Toronto Press.

Cardell, N. S. (1997). "Variance Components Structures for the Extreme-Value and Logistic
Distributions with Application to Models of Heterogeneity". In: _Econometric Theory_ 13.2,
pp. 185-213. URL:
[https://www.jstor.org/stable/3532724](https://www.jstor.org/stable/3532724).

Carneiro, P., K. T. Hansen, and J. J. Heckman (2003). "Estimating Distributions of
Treatment Effects with an Application to the Returns to Schooling and Measurement of the
Effects of Uncertainty on College Choice". In: _International Economic Review_ 44.2, pp.
361-422. DOI:
[10.1111/1468-2354.t01-1-00074](https://doi.org/10.1111%2F1468-2354.t01-1-00074).

Carneiro, P., J. J. Heckman, and E. Vytlacil (2010). "Evaluating Marginal Policy Changes
and the Average Effect of Treatment for Individuals at the Margin". In: _Econometrica_
78.1, pp. 377-394. DOI: [10.3982/ECTA7089](https://doi.org/10.3982%2FECTA7089).

Carneiro, P., J. J. Heckman, and E. J. Vytlacil (2011). "Estimating Marginal Returns to
Education". In: _American Economic Review_ 101.6, pp. 2754-2781. DOI:
[10.1257/aer.101.6.2754](https://doi.org/10.1257%2Faer.101.6.2754).

Caucutt, E. M., L. Lochner, J. Mullins, et al. (2020). _Child Skill Production: Accounting
for Parental and Market-Based Time and Goods Investments_. Working Paper 27838. National
Bureau of Economic Research. DOI: [10.3386/w27838](https://doi.org/10.3386%2Fw27838).

Chen, X., H. Hong, and D. Nekipelov (2011). "Nonlinear Models of Measurement Errors". In:
_Journal of Economic Literature_ 49.4, pp. 901-937. DOI:
[10.1257/jel.49.4.901](https://doi.org/10.1257%2Fjel.49.4.901).

Chintagunta, P. K. (1992). "Estimating a Multinomial Probit Model of Brand Choice Using
the Method of Simulated Moments". In: _Marketing Science_ 11.4, pp. 386-407. DOI:
[10.1287/mksc.11.4.386](https://doi.org/10.1287%2Fmksc.11.4.386).

Cinelli, C. and C. Hazlett (2020). "Making Sense of Sensitivity: Extending Omitted
Variable Bias". In: _Journal of the Royal Statistical Society: Series B (Statistical
Methodology)_ 82.1, pp. 39-67. DOI:
[10.1111/rssb.12348](https://doi.org/10.1111%2Frssb.12348).

Coate, P. and K. Mangum (2019). _Fast Locations and Slowing Labor Mobility_. Working Paper
19-49. Federal Reserve Bank of Philadelphia.

Cunha, F. and J. Heckman (2007). "The Technology of Skill Formation". In: _American
Economic Review_ 97.2, pp. 31-47. DOI:
[10.1257/aer.97.2.31](https://doi.org/10.1257%2Faer.97.2.31).

Cunha, F., J. J. Heckman, and S. M. Schennach (2010). "Estimating the Technology of
Cognitive and Noncognitive Skill Formation". In: _Econometrica_ 78.3, pp. 883-931. DOI:
[10.3982/ECTA6551](https://doi.org/10.3982%2FECTA6551).

Cunningham, S. (2021). _Causal Inference: The Mixtape_. Yale University Press. URL:
[https://www.scunning.com/causalinference_norap.pdf](https://www.scunning.com/causalinference_norap.pdf).

Delavande, A. and C. F. Manski (2015). "Using Elicited Choice Probabilities in
Hypothetical Elections to Study Decisions to Vote". In: _Electoral Studies_ 38, pp. 28-37.
DOI: [10.1016/j.electstud.2015.01.006](https://doi.org/10.1016%2Fj.electstud.2015.01.006).

Delavande, A. and B. Zafar (2019). "University Choice: The Role of Expected Earnings,
Nonpecuniary Outcomes, and Financial Constraints". In: _Journal of Political Economy_
127.5, pp. 2343-2393. DOI: [10.1086/701808](https://doi.org/10.1086%2F701808).

Diegert, P., M. A. Masten, and A. Poirier (2025). _Assessing Omitted Variable Bias when
the Controls are Endogenous_. arXiv. DOI:
[10.48550/ARXIV.2206.02303](https://doi.org/10.48550%2FARXIV.2206.02303).

Erdem, T. and M. P. Keane (1996). "Decision-Making under Uncertainty: Capturing Dynamic
Brand Choice Processes in Turbulent Consumer Goods Markets". In: _Marketing Science_ 15.1,
pp. 1-20. DOI: [10.1287/mksc.15.1.1](https://doi.org/10.1287%2Fmksc.15.1.1).

Evans, R. W. (2018). _Simulated Method of Moments (SMM) Estimation_. QuantEcon Note.
University of Chicago. URL:
[https://notes.quantecon.org/submission/5b3db2ceb9eab00015b89f93](https://notes.quantecon.org/submission/5b3db2ceb9eab00015b89f93).

Farber, H. S. and R. Gibbons (1996). "Learning and Wage Dynamics". In: _Quarterly Journal
of Economics_ 111.4, pp. 1007-1047. DOI:
[10.2307/2946706](https://doi.org/10.2307%2F2946706).

Fu, C., N. Grau, and J. Rivera (2020). _Wandering Astray: Teenagers' Choices of Schooling
and Crime_. Working Paper. University of Wisconsin-Madison. URL:
[https://www.ssc.wisc.edu/~cfu/wander.pdf](https://www.ssc.wisc.edu/~cfu/wander.pdf).

Geary, R. C. (1942). "Inherent Relations between Random Variables". In: _Proceedings of
the Royal Irish Academy. Section A: Mathematical and Physical Sciences_ 47, pp. 63-76.
URL: [http://www.jstor.org/stable/20488436](http://www.jstor.org/stable/20488436).

Gillingham, K., F. Iskhakov, A. Munk-Nielsen, et al. (2022). "Equilibrium Trade in
Automobiles". In: _Journal of Political Economy_. DOI:
[10.1086/720463](https://doi.org/10.1086%2F720463).

Haile, P. (2019). _``Structural vs. Reduced Form'' Language and Models in Empirical
Economics_. Lecture Slides. Yale University. URL:
[http://www.econ.yale.edu/~pah29/intro.pdf](http://www.econ.yale.edu/~pah29/intro.pdf).

Haile, P. (2024). _Models, Measurement, and the Language of Empirical Economics_. Lecture
Slides. Yale University. URL:
[https://www.dropbox.com/s/8kwtwn30dyac18s/intro.pdf](https://www.dropbox.com/s/8kwtwn30dyac18s/intro.pdf).

Hastie, T., R. Tibshirani, and J. Friedman (2009). _The Elements of Statistical Learning:
Data Mining, Inference, Prediction_. 2nd. New York: Springer. URL:
[https://web.stanford.edu/~hastie/Papers/ESLII.pdf](https://web.stanford.edu/~hastie/Papers/ESLII.pdf).

Heckman, J. J. and J. A. Smith (1993). "Assessing the Case for Randomized Evaluation of
Social Programs". In: _Measuring Labor Market Measures: Evaluating the Effects of Active
Labour Market Policies_. Ed. by K. Jensen and P. K. Madsen. Copenhagen: Danish Ministry of
Labor, pp. 35-96.

Heckman, J. J., J. Smith, and N. Clements (1997). "Making the Most Out of Programme
Evaluations and Social Experiments: Accounting for Heterogeneity in Programme Impacts".
In: _Review of Economic Studies_ 64.4, pp. 487-535. URL:
[http://www.jstor.org/stable/2971729](http://www.jstor.org/stable/2971729).

Heckman, J. J., J. Stixrud, and S. Urzua (2006). "The Effects of Cognitive and
Noncognitive Abilities on Labor Market Outcomes and Social Behavior". In: _Journal of
Labor Economics_ 24.3, pp. 411-482. DOI:
[10.1086/504455](https://doi.org/10.1086%2F504455).

Heckman, J. J. and E. Vytlacil (2005). "Structural Equations, Treatment Effects, and
Econometric Policy Evaluation1". In: _Econometrica_ 73.3, pp. 669-738. DOI:
[10.1111/j.1468-0262.2005.00594.x](https://doi.org/10.1111%2Fj.1468-0262.2005.00594.x).

Hotz, V. J. and R. A. Miller (1993). "Conditional Choice Probabilities and the Estimation
of Dynamic Models". In: _The Review of Economic Studies_ 60.3, pp. 497-529. DOI:
[10.2307/2298122](https://doi.org/10.2307%2F2298122).

Huenermund, P. and E. Bareinboim (2019). _Causal Inference and Data-Fusion in
Econometrics_. Working Paper. arXiv. URL:
[https://arxiv.org/abs/1912.09104](https://arxiv.org/abs/1912.09104).

Hurwicz, L. (1950). "Generalization of the Concept of Identification". In: _Statistical
Inference in Dynamic Economic Models_. Hoboken, NJ: John Wiley and Sons, pp. 245-257.

Imbens, G. W. and J. D. Angrist (1994). "Identification and Estimation of Local Average
Treatment Effects". In: _Econometrica_ 62.2, pp. 467-475. DOI:
[10.2307/2951620](https://doi.org/10.2307%2F2951620).

Ishimaru, S. (2022). _Geographic Mobility of Youth and Spatial Gaps in Local College and
Labor Market Opportunities_. Working Paper. Hitotsubashi University.

James, G., D. Witten, T. Hastie, et al. (2013). _An Introduction to Statistical Learning
with Applications in R_. New York: Springer. DOI:
[10.1007/978-1-4614-7138-7](https://doi.org/10.1007%2F978-1-4614-7138-7). URL:
[https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR_Seventh_Printing.pdf](https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR_Seventh_Printing.pdf).

James, J. (2011). _Ability Matching and Occupational Choice_. Working Paper 11-25. Federal
Reserve Bank of Cleveland.

James, J. (2017). "MM Algorithm for General Mixed Multinomial Logit Models". In: _Journal
of Applied Econometrics_ 32.4, pp. 841-857. DOI:
[10.1002/jae.2532](https://doi.org/10.1002%2Fjae.2532).

Jin, H. and H. Shen (2020). "Foreign Asset Accumulation Among Emerging Market Economies: A
Case for Coordination". In: _Review of Economic Dynamics_ 35.1, pp. 54-73. DOI:
[10.1016/j.red.2019.04.006](https://doi.org/10.1016%2Fj.red.2019.04.006).

Keane, M. P. (2010). "Structural vs. Atheoretic Approaches to Econometrics". In: _Journal
of Econometrics_ 156.1, pp. 3-20. DOI:
[10.1016/j.jeconom.2009.09.003](https://doi.org/10.1016%2Fj.jeconom.2009.09.003).

Keane, M. P. and K. I. Wolpin (1997). "The Career Decisions of Young Men". In: _Journal of
Political Economy_ 105.3, pp. 473-522. DOI:
[10.1086/262080](https://doi.org/10.1086%2F262080).

Koopmans, T. C. and O. Reiersol (1950). "The Identification of Structural
Characteristics". In: _The Annals of Mathematical Statistics_ 21.2, pp. 165-181. URL:
[http://www.jstor.org/stable/2236899](http://www.jstor.org/stable/2236899).

Kosar, G., T. Ransom, and W. van der Klaauw (2022). "Understanding Migration Aversion
Using Elicited Counterfactual Choice Probabilities". In: _Journal of Econometrics_ 231.1,
pp. 123-147. DOI:
[10.1016/j.jeconom.2020.07.056](https://doi.org/10.1016%2Fj.jeconom.2020.07.056).

Kotlarski, I. (1967). "On Characterizing the Gamma and the Normal Distribution". In:
_Pacific Journal of Mathematics_ 20, pp. 69-76.

Krauth, B. (2016). "Bounding a Linear Causal Effect Using Relative Correlation
Restrictions". In: _Journal of Econometric Methods_ 5.1, pp. 117-141. DOI:
[10.1515/jem-2013-0013](https://doi.org/10.1515%2Fjem-2013-0013).

Lang, K. and M. D. Palacios (2018). _The Determinants of Teachers' Occupational Choice_.
Working Paper 24883. National Bureau of Economic Research. DOI:
[10.3386/w24883](https://doi.org/10.3386%2Fw24883).

Lee, D. S., J. McCrary, M. J. Moreira, et al. (2020). _Valid t-ratio Inference for IV_.
Working Paper. arXiv. URL:
[https://arxiv.org/abs/2010.05058](https://arxiv.org/abs/2010.05058).

Lewbel, A. (2019). "The Identification Zoo: Meanings of Identification in Econometrics".
In: _Journal of Economic Literature_ 57.4, pp. 835-903. DOI:
[10.1257/jel.20181361](https://doi.org/10.1257%2Fjel.20181361).

Mahoney, N. (2022). "Principles for Combining Descriptive and Model-Based Analysis in
Applied Microeconomics Research". In: _Journal of Economic Perspectives_ 36.3, pp. 211-22.
DOI: [10.1257/jep.36.3.211](https://doi.org/10.1257%2Fjep.36.3.211).

Mardia, K. V. (1970). "Measures of Multivariate Skewness and Kurtosis with Applications".
In: _Biometrika_ 57.3, pp. 519-530. URL:
[http://www.jstor.org/stable/2334770](http://www.jstor.org/stable/2334770).

McFadden, D. (1978). "Modelling the Choice of Residential Location". In: _Spatial
Interaction Theory and Planning Models_. Ed. by A. Karlqvist, L. Lundqvist, F. Snickers
and J. W. Weibull. Amsterdam: North Holland, pp. 75-96.

McFadden, D. (1989). "A Method of Simulated Moments for Estimation of Discrete Response
Models Without Numerical Integration". In: _Econometrica_ 57.5, pp. 995-1026. DOI:
[10.2307/1913621](https://doi.org/10.2307%2F1913621). URL:
[http://www.jstor.org/stable/1913621](http://www.jstor.org/stable/1913621).

Mellon, J. (2020). _Rain, Rain, Go Away: 137 Potential Exclusion-Restriction Violations
for Studies Using Weather as an Instrumental Variable_. Working Paper. University of
Manchester. URL:
[https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3715610](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3715610).

Miller, R. A. (1984). "Job Matching and Occupational Choice". In: _Journal of Political
Economy_ 92.6, pp. 1086-1120. DOI: [10.1086/261276](https://doi.org/10.1086%2F261276).

Mincer, J. (1974). _Schooling, Experience and Earnings_. New York: Columbia University
Press for National Bureau of Economic Research.

Ost, B., W. Pan, and D. Webber (2018). "The Returns to College Persistence for Marginal
Students: Regression Discontinuity Evidence from University Dismissal Policies". In:
_Journal of Labor Economics_ 36.3, pp. 779-805. DOI:
[10.1086/696204](https://doi.org/10.1086%2F696204).

Oster, E. (2019). "Unobservable Selection and Coefficient Stability: Theory and Evidence".
In: _Journal of Business & Economic Statistics_ 37.2, pp. 187-204. DOI:
[10.1080/07350015.2016.1227711](https://doi.org/10.1080%2F07350015.2016.1227711).

Pearl, J. (2012). "The Do-Calculus Revisited". In: _Proceedings of the Twenty-Eighth
Conference on Uncertainty in Artificial Intelligence_. Ed. by N. de Freitas and K. Murphy.
Corvallis, OR: AUAI Press, pp. 4-11.

Pischke, S. (2007). _Lecture Notes on Measurement Error_. Lecture Notes. London School of
Economics. URL:
[http://econ.lse.ac.uk/staff/spischke/ec524/Merr_new.pdf](http://econ.lse.ac.uk/staff/spischke/ec524/Merr_new.pdf).

Ransom, M. R. and T. Ransom (2018). "Do High School Sports Build or Reveal Character?
Bounding Causal Estimates of Sports Participation". In: _Economics of Education Review_
64, pp. 75-89. DOI:
[10.1016/j.econedurev.2018.04.002](https://doi.org/10.1016%2Fj.econedurev.2018.04.002).

Ransom, T. (2022). "Labor Market Frictions and Moving Costs of the Employed and
Unemployed". In: _Journal of Human Resources_ 57.S, pp. S137-S166. DOI:
[10.3368/jhr.monopsony.0219-10013R2](https://doi.org/10.3368%2Fjhr.monopsony.0219-10013R2).

Reiersol, O. (1950). "Identifiability of a Linear Relation between Variables Which Are
Subject to Error". In: _Econometrica_ 18.4, pp. 375-389. URL:
[http://www.jstor.org/stable/1907835](http://www.jstor.org/stable/1907835).

Robins, J. M. (1997). "Causal Inference from Complex Longitudinal Data". In: _Latent
Variable Modeling and Applications to Causality_. Ed. by M. Berkane. New York: Springer,
pp. 69-117.

Robinson, P. M. (1988). "Root-N-Consistent Semiparametric Regression". In: _Econometrica_
56.4, pp. 931-954. URL:
[http://www.jstor.org/stable/1912705](http://www.jstor.org/stable/1912705).

Rudik, I. (2020). "Optimal Climate Policy When Damages Are Unknown". In: _American
Economic Journal: Economic Policy_ 12.2, pp. 340-373. DOI:
[10.1257/pol.20160541](https://doi.org/10.1257%2Fpol.20160541).

Rueschendorf, L. (1981). "Sharpness of Frechet-bounds". In: _Zeitschrift fur
Wahrscheinlichkeitstheorie und Verwandte Gebiete_ 57.2, pp. 293-302. DOI:
[10.1007/BF00535495](https://doi.org/10.1007%2FBF00535495).

Rust, J. (1987). "Optimal Replacement of GMC Bus Engines: An Empirical Model of Harold
Zurcher". In: _Econometrica_ 55.5, pp. 999-1033. URL:
[http://www.jstor.org/stable/1911259](http://www.jstor.org/stable/1911259).

Shalizi, C. R. (2019). _Advanced Data Analysis from an Elementary Point of View_.
Cambridge University Press. URL:
[http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf](http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf).

Smith Jr., A. A. (2008). "Indirect Inference". In: _The New Palgrave Dictionary of
Economics_. Ed. by S. N. Durlauf and L. E. Blume. Vol. 1-8. London: Palgrave Macmillan.
DOI: [10.1007/978-1-349-58802-2](https://doi.org/10.1007%2F978-1-349-58802-2). URL:
[http://www.econ.yale.edu/smith/palgrave7.pdf](http://www.econ.yale.edu/smith/palgrave7.pdf).

Stinebrickner, R. and T. Stinebrickner (2014a). "Academic Performance and College Dropout:
Using Longitudinal Expectations Data to Estimate a Learning Model". In: _Journal of Labor
Economics_ 32.3, pp. 601-644. DOI: [10.1086/675308](https://doi.org/10.1086%2F675308).

Stinebrickner, R. and T. R. Stinebrickner (2014b). "A Major in Science? Initial Beliefs
and Final Outcomes for College Major and Dropout". In: _Review of Economic Studies_ 81.1,
pp. 426-472. DOI: [10.1093/restud/rdt025](https://doi.org/10.1093%2Frestud%2Frdt025).

Su, C. and K. L. Judd (2012). "Constrained Optimization Approaches to Estimation of
Structural Models". In: _Econometrica_ 80.5, pp. 2213-2230. DOI:
[10.3982/ECTA7925](https://doi.org/10.3982%2FECTA7925).

Train, K. (2009). _Discrete Choice Methods with Simulation_. 2nd ed. Cambridge; New York:
Cambridge University Press. ISBN: 9780521766555.

Vytlacil, E. (2002). "Independence, Monotonicity, and Latent Index Models: An Equivalence
Result". In: _Econometrica_ 70.1, pp. 331-341. DOI:
[10.1111/1468-0262.00277](https://doi.org/10.1111%2F1468-0262.00277).

Wiswall, M. and B. Zafar (2018). "Preference for the Workplace, Investment in Human
Capital, and Gender". In: _Quarterly Journal of Economics_ 133.1, pp. 457-507. DOI:
[10.1093/qje/qjx035](https://doi.org/10.1093%2Fqje%2Fqjx035).

Young, A. (2020). _Consistency without Inference: Instrumental Variables in Practical
Application_. Working Paper. London School of Economics.