class: title-slide <br><br><br> # Lecture 6 ## Preference Heterogeneity with Mixture Distributions ### Tyler Ransom ### ECON 6343, University of Oklahoma --- # Attribution Many of these slides are based on slides written by Peter Arcidiacono. I use them with his permission. These slides also heavily follow Chapters 6 and 14 of Train (2009) --- # Plan for the day 1. Preference Heterogeneity 2. Mixed Logit 3. Finite Mixture Models 4. The EM algorithm --- # Preference Heterogeneity - So far, we have only looked at models where all agents have identical preferences - Mathematically, `\(\beta_{RedBus}\)` does not vary across agents - Implies everyone has same price elasticity, etc. - But in real life, we know people have different values, interests, and preferences - Failure to account for this heterogeneity will result in a misleading model - e.g. lowering a product's price likely won't induce purchasing from some customers --- # Observable preference heterogeneity - One solution to the homogeneity problem is to add interaction terms - Suppose we have a 2-option transportation model: `\begin{align*} u_{i,bus}&=\beta_1 X_i + \gamma Z_1\\ u_{i,car}&=\beta_2 X_i + \gamma Z_2 \end{align*}` - We could introduce heterogeneity in `\(\gamma\)` by interacting `\(Z_j\)` with `\(X_i\)`: `\begin{align*} u_{i,bus}&=\beta_1 X_i + \widetilde{\gamma} Z_1 X_i\\ u_{i,car}&=\beta_2 X_i + \widetilde{\gamma} Z_2 X_i \end{align*}` - Now a change in `\(Z_j\)` will have a heterogeneous impact on utility depending on `\(X_i\)` - e.g. those w/diff. income `\((X_i)\)` may be more/less sensitive to changes in price `\((Z_j)\)` --- # Unobservable preference heterogeneity - Observable preference heterogeneity can be useful - But many dimensions of preferences are likely unobserved - In this case, we need to "interact" `\(Z\)` with something unobserved - One way to do this is to assume that `\(\beta\)` or `\(\gamma\)` varies across people - Assume some distribution (e.g. Normal), called the .hi[mixing distribution] - Then integrate this out of the likelihood function --- # Mixed Logit likelihood function - Assume, e.g. `\(\gamma_i \sim F\)` with pdf `\(f\)` and distributional parameters `\(\mu\)` and `\(\sigma\)` - Then the logit choice probabilities become .smaller[ `\begin{align*} P_{ij}\left(X,Z;\beta,\mu,\sigma\right)&= \int\frac{\exp\left(X_{i}\left(\beta_{j}-\beta_{J}\right)+\gamma_i\left(Z_{ij}-Z_{iJ}\right)\right)}{\sum_k \exp\left(X_{i}\left(\beta_{k}-\beta_{J}\right)+\gamma_i\left(Z_{ik}-Z_{iJ}\right)\right)}f\left(\gamma_i;\mu,\sigma\right)d\gamma_i \end{align*}` ] - Note: this is just like the expected value of a function of a random variable `\(W\)`: .smaller[ `\begin{align*} \mathbb{E}[g(W)]&= \int g(W) f\left(W;\mu,\sigma\right)dW \end{align*}` ] - Annoyance: the log likelihood now has an integral inside the log! .smaller[ `\begin{align*} \ell\left(X,Z;\beta,\gamma,\mu,\sigma\right)&=\sum_{i=1}^N \log\left\{\int\prod_{j}\left[\frac{\exp\left(X_{i}\left(\beta_{j}-\beta_{J}\right)+\gamma\left(Z_{ij}-Z_{iJ}\right)\right)}{\sum_k \exp\left(X_{i}\left(\beta_{k}-\beta_{J}\right)+\gamma\left(Z_{ik}-Z_{iJ}\right)\right)}\right]^{d_{ij}}f\left(\gamma;\mu,\sigma\right)d\gamma\right\} \end{align*}` ] --- # Common mixing distributions - Normal - Log-normal - Uniform - Triangular - Can also go crazy and specify a multivariate normal - This would allow, e.g. heterogeneity in `\(\gamma\)` to be correlated with `\(\beta\)` --- # Mixed Logit estimation - With the integral inside the log, estimation of the mixed logit is intensive - To estimate the likelihood function, need to numerically approximate the integral - The most common way of doing this is .hi[quadrature] - Another common way of doing this is by .hi[simulation] (Monte Carlo integration) - I'll walk you through how to do this in this week's problem set --- # Finite Mixture Distributions - Another option to mixed logit is to assume the mixing distribution is discrete - We assume we have missing variable that has finite support and is independent from the other variables - Let `\(\pi_s\)` denote the probability of being in the `\(s\)`th unobserved group - Integrating out over the unobserved groups then yields the following log likelihood: `\begin{align*} \ell\left(X,Z;\beta,\gamma,\pi\right)=&\sum_{i=1}^N \log\left\{\sum_{s}\pi_s\prod_{j}\left[\frac{\exp\left(X_{i}\left(\beta_{j}-\beta_{J}\right)+\gamma_{s}\left(Z_{ij}-Z_{iJ}\right)\right)}{\sum_k \exp\left(X_{i}\left(\beta_{k}-\beta_{J}\right)+\gamma_{s}\left(Z_{ik}-Z_{iJ}\right)\right)}\right]^{d_{ij}}\right\}\\ \end{align*}` --- # Mixture Distributions and Panel Data - With panel data, mixture dist. allows for .hi[permanent unobserved heterogeneity] - Here the unobs. variable is fixed over time and indep. of the covariates at `\(t=1\)` - The log likelihood function for the finite mixture case is then: `\begin{align*} \ell\left(X,Z;\beta,\gamma,\pi\right)=&\sum_{i=1}^N \log\left\{\sum_{s}\pi_s\prod_{t}\prod_{j}\left[\frac{\exp\left(X_{it}\left(\beta_{j}-\beta_{J}\right)+\gamma_{s}\left(Z_{ijt}-Z_{iJt}\right)\right)}{\sum_k \exp\left(X_{it}\left(\beta_{k}-\beta_{J}\right)+\gamma_{s}\left(Z_{ikt}-Z_{iJt}\right)\right)}\right]^{d_{ijt}}\right\} \end{align*}` - And for the mixed logit case is: .smaller[ `\begin{align*} \ell\left(X,Z;\beta,\gamma,\mu,\sigma\right)=&\sum_{i=1}^N \log\left\{\int\prod_{t}\prod_{j}\left[\frac{\exp\left(X_{it}\left(\beta_{j}-\beta_{J}\right)+\gamma\left(Z_{ijt}-Z_{iJt}\right)\right)}{\sum_k \exp\left(X_{it}\left(\beta_{k}-\beta_{J}\right)+\gamma\left(Z_{ikt}-Z_{iJt}\right)\right)}\right]^{d_{ijt}}f\left(\gamma;\mu,\sigma\right)d\gamma\right\}\\ \end{align*}` ] --- # Dynamic Selection - Often, we want to link the choices to other outcomes: - labor force participation and earnings - market entry and profits - If individuals choose to participate in the labor market based upon unobserved wages, our estimates of the returns to participating will be biased - Mixture distributions provide an alternative way of controlling for selection - .hi[Assumption:] no selection problem once we control for the unobserved variable --- # Dynamic Selection - Let `\(Y_{2t}\)` denote the choice and `\(Y_{1t}\)` denote the outcome - The assumption on the previous slide means the joint likelihood is separable: `\begin{align*} \mathcal{L}(Y_{1t},Y_{2t}|X_{1t},X_{2t},\alpha_1,\alpha_2,s)&=\mathcal{L}(Y_{1t}|Y_{2t},X_{1t},\alpha_1,s)\mathcal{L}(Y_{2t}|X_{2t},\alpha_2,s)\\ &=\mathcal{L}(Y_{1t}|X_{1t},\alpha_1,s)\mathcal{L}(Y_{2t}|X_{2t},\alpha_2,s) \end{align*}` where `\(s\)` is the unobserved type --- # Estimation in Stages - Suppose `\(s\)` was observed - There'd be no selection problem as long as we could condition on `\(s\)` and `\(X_{1t}\)` - The log likelihood function is: `\begin{align*} \ell=&\sum_{i}\sum_t \ell_1(Y_{1t}|X_{1t},\alpha_1,s)+\ell_2(Y_{2t}|X_{2t},\alpha_2,s) \end{align*}` - Estimation could proceed in stages: 1. Estimate `\(\alpha_2\)` using only `\(\ell_2\)` 2. Taking the estimate of `\(\alpha_2\)` as given, estimate `\(\alpha_1\)` using `\(\ell_1\)` --- # Non-separable means no stages - When `\(s\)` is unobserved, however, the log likelihood function is not additively separable: `\begin{align*} \ell=&\sum_i\log\left(\sum_s\pi_s\prod_t\mathcal{L}(Y_{1t}|X_{1t},\alpha_1,s)\mathcal{L}(Y_{2t}|X_{2t},\alpha_2,s)\right) \end{align*}` where `\(\mathcal{L}\)` is a likelihood function - Makes sense: if there is a selection problem, we can't estimate one part of the problem without considering what is happening in the other part --- # The EM Algorithm - We can get additive separability of the finite mixture model with the .hi[EM algorithm] - EM stands for "Expectation-Maximization" - The algorithm iterates on two steps: - E-step: estimate parameters having to do with the mixing distribution (i.e. the `\(\pi\)`'s) - M-step: pretend you observe the unobserved variable and estimate - The EM algorithm is used in other applications to fill in missing data - In this case, the missing data is the permanent unobserved heterogeneity --- # The EM Algorithm (Continued) - With the EM algorithm, the non-separable likelihood function `\begin{align*} \ell=&\sum_i\log\left(\sum_s\pi_s\prod_t\mathcal{L}(Y_{1t}|X_{1t},\alpha_1,s)\mathcal{L}(Y_{2t}|X_{2t},\alpha_2,s)\right) \end{align*}` can be written in a form that is separable: `\begin{align*} \ell=&\sum_i\sum_s q_{is}\sum_t\ell_1\left(Y_{1t}|X_{1t},\alpha_1,s\right)+\ell_2\left(Y_{2t}|X_{2t},\alpha_2,s)\right) \end{align*}` where `\(q_{is}\)` is the probability that `\(i\)` belongs to group `\(s\)` - `\(q_{is}\)` satisfies `\(\pi_s = \frac{1}{N}\sum_{i}q_{is}\)` --- # Estimation in stages again - We can now estimate the model in stages because of the restoration of separability - The only twist is that we need to .hi[weight] by the `\(q\)`'s in each estimation stage - Stage 1 of M-step: estimate `\(\ell(Y_{1t}|X_{1t},\alpha_1,s)\)` weighting by the `\(q\)`'s - Stage 2 of M-step: estimate `\(\ell(Y_{2t}|X_{2t},\alpha_1,s)\)` weighting by the `\(q\)`'s - E-step: update the `\(q\)`'s by calculating `\begin{align*} q_{is}=&\frac{\pi_s\prod_t\mathcal{L}(Y_{1t}|X_{1t},\alpha_1,s)\mathcal{L}(Y_{2t}|X_{2t},\alpha_2,s)}{\sum_m\pi_m\prod_t\mathcal{L}(Y_{1t}|X_{1t},\alpha_1,m)\mathcal{L}(Y_{2t}|X_{2t},\alpha_2,m)} \end{align*}` - Iterate on E and M steps until the `\(q\)`'s converge (Arcidiacono and Jones, 2003) --- # Other notes on estimation in stages - With permanent unobserved heterogeneity, we no longer have .hi[global concavity] - This means that if we provide different starting values, we'll get different estimates - Another thing to note is .hi[standard errors] - With stages, each stage introduces estimation error into the following stages - i.e. we take the estimate as given, but it actually is subject to sampling error - The easiest way to resolve this is with bootstrapping - Both of these issues (local optima and estimation error) are problem-specific - You need to understand your specific case --- # Maximization Minorization (MM) Algorithm - See James (2017) - Alternative to EM algorithm for estimating mixed logit models - Addresses key limitation of EM: repeated numerical optimization - Uses a .hi[globally concave] surrogate function with closed-form solution - Significantly faster than EM in many cases - Allows for continuously distributed unobserved heterogeneity --- # Advantages of MM Algorithm - Simple implementation - Low cost per iteration - No need to store entire simulated dataset - Easily parallelizable - Effective for models with many fixed coefficients or repeated observations - Can be 5-8 times faster than EM and competitive with quasi-Newton methods --- # To Recap - Why are we doing all of this difficult work? - Because preference heterogeneity allows for a more credible structural model - e.g. Gillingham, Iskhakov, Munk-Nielsen, Rust, and Schjerning (2022) - But introducing preference heterogeneity can make the model intractible - Discretizing the distribution of heterogeneity and using the EM algorithm can help - Or use the MM algorithm - We also need to be mindful of how to compute standard errors of the estimates - As well as be aware that the objective function is likely no longer globally concave --- # References .smaller[ Arcidiacono, P. and J. B. Jones (2003). "Finite Mixture Distributions, Sequential Likelihood and the EM Algorithm". In: _Econometrica_ 71.3, pp. 933-946. DOI: [10.1111/1468-0262.00431](https://doi.org/10.1111%2F1468-0262.00431). Gillingham, K., F. Iskhakov, A. Munk-Nielsen, et al. (2022). "Equilibrium Trade in Automobiles". In: _Journal of Political Economy_. DOI: [10.1086/720463](https://doi.org/10.1086%2F720463). James, J. (2017). "MM Algorithm for General Mixed Multinomial Logit Models". In: _Journal of Applied Econometrics_ 32.4, pp. 841-857. DOI: [10.1002/jae.2532](https://doi.org/10.1002%2Fjae.2532). Train, K. (2009). _Discrete Choice Methods with Simulation_. 2nd ed. Cambridge; New York: Cambridge University Press. ISBN: 9780521766555. ]