Lectures 25

class: center, middle, inverse, title-slide

.title[
# Lectures 25–26
]
.subtitle[
## Structural Models of Utility Maximization
]
.author[
### Tyler Ransom
]
.date[
### ECON 5253, University of Oklahoma
]

---

# Today's plan

1. Describe static discrete choice models

2. How do they fit in with other data science models we've talked about in this class?

3. Derive logit/probit probabilities from intermediate microeconomic theory

4. Go through examples of how to estimate

5. How discrete choice models relate to sample selection bias

Note: These slides are based on the introductory lecture of a PhD course taught at Duke University by Peter Arcidiacono, and are used with permission. That course is based on Kenneth Train's book *Discrete Choice Methods with Simulation*, which is freely available [here](https://eml.berkeley.edu/books/train1201.pdf) (PDF)

---
# What are discrete choice models?

- Discrete choice models are one of the workhorses of structural economics

- Deeply tied to economic theory:

- utility maximization

- revealed preference

- Used to model "utility" (broadly defined), for example:

- consumer product purchase decisions

- firm market entry decisions

- investment decisions

---
# Why use discrete choice models?
- Provides link between human optimization behavior and economic theory

- Parameters of these models map directly to economic theory

- Parameter values can quantify a particular policy

- Can be used to form counterfactual predictions (e.g. by adjusting certain parameter values)

- Allows a research to quantify "tastes"

---
# Why **not** use discrete choice models?

- They're not the best predictive models

- Trade-off between out-of-sample prediction and counterfactual prediction

- You don't want to form counterfactual predictions, you just want to be able to predict handwritten digits

- You aren't interested in economic theory

- The math really scares you

- You don't like (explicitly) making assumptions

- e.g. that decision-makers are rational

---
# Example of a discrete choice model

- Cities in the Bay Area are interested in how the introduction of rideshare services will impact ridership on Bay Area Rapid Transit (BART)

- Questions that cities need to know the answers to:
    - Is rideshare a substitute for public transit or a complement?
    - How inelastic is demand for BART? Should fares be `$\uparrow$` or `$\downarrow$`?
    - Should BART services be scaled up to compete with rideshares?
    - Will the influx of rideshare vehicles increase traffic congestion / pollution?

- Each of these questions requires making a counterfactual prediction

- In particular, need a way to make such a prediction clearly and confidently

---
# Properties of discrete choice models

1. Agents choose from among a .hi[finite] set of alternatives (called the *choice set*)

2. Alternatives in choice set are .hi[mutually exclusive]

3. Choice set is .hi[exhaustive]

---
# Example illustrating these properties

- In San Francisco, people can commute to work by the following (and *only* the following) methods:

- Drive a personal vehicle (incl. motorcycle)

- Carpool in a personal vehicle

- Use taxi/rideshare service (incl. Uber, Lyft, UberPool, LyftLine, etc.)

- BART (bus, train, or both)

- Bicycle

- Walk

---
# Mathematically representing utility

Let `$d_i$` indicate the choice individual (or decision-maker) `$i$` makes where `$d_i\in\{1,\cdots, J\}$`

Individuals choose `$d$` to maximize their utility, `$U$`

`$U$` generally is written as:

`\begin{align*}
U_{ij}&=u_{ij}+\varepsilon_{ij}
\end{align*}`
where:

1.  `$u_{ij}$` relates observed factors to the utility individual `$i$` receives from choosing option `$j$`

2. `$\varepsilon_{ij}$` are unobserved to the researcher but observed to the individual

3. `$d_{ij}=1$` if `$u_{ij}+\varepsilon_{ij}>u_{ij'}+\varepsilon_{ij'}$` for all `$j'\neq j$`

---
# Breakdown of the assumptions

- Examples of what's in `$\varepsilon$`
    - Person's mental state when making the decision
    - Choices of friends or relatives (maybe, depends on the data)
    - `$\vdots$`
    - Anything else about the person that is not in our data

- Reasonable to assume additive separability?
    - This is a big assumption: that there are no interactive effects between unobservable and observable factors
    - This results in linear separation regions and may be too restrictive
    - For now, go with it, and remember that there are no free lunches

---
# Probabilistic choice

With the `$\varepsilon$`'s unobserved, we must consider choices as probabilistic instead of certain

The Probability that `$i$` chooses alternative `$j$` is:

`\begin{align*}
P_{ij}&=\Pr(u_{ij}+\varepsilon_{ij}>u_{ij'}+\varepsilon_{ij'}  \,\,\forall\,\,  j'\neq j)\\
&=\Pr(\varepsilon_{ij'}-\varepsilon_{ij}<u_{ij}-u_{ij'}  \,\,\forall\,\,  j'\neq j)\\
&=\int_{\varepsilon}I(\varepsilon_{ij'}-\varepsilon_{ij}<u_{ij}-u_{ij'}  \,\,\forall\,\,  j'\neq j)f(\varepsilon)d\varepsilon
\end{align*}`

---
# Transformations of utility

Note that, regardless of what distributional assumptions are made on the `$\varepsilon$`'s, the probability of choosing a particular option does not change when we:

1. Add a constant to the utility of all options (utility is relative to one of the options, only differences in utility matter)

2. Multiply by a positive number (need to scale something, generally the variance of the `$\varepsilon$`'s)

This is just like in consumer choice theory: utility is ordinal, and so is invariant to the above two transformations

---
# Variables

Suppose we have:

`\begin{align*}
u_{i1}&=\alpha Male_i+\beta_1 X_i + \gamma Z_1\\
u_{i2}&=\alpha Male_i+\beta_2 X_i+\gamma Z_2
\end{align*}`

Since only differences in utility matter:

`\begin{align*}
u_{i1}-u_{i2}&=(\beta_1-\beta_2)X_i+\gamma (Z_1-Z_2)
\end{align*}`

- Thus, we cannot tell whether men are happier than women, but can tell whether men have a preference for a particular option over another

- We can only obtain .hi[differenced] coefficient estimates on `$X$`'s, and can obtain an estimate of a coefficient that is constant across choices only if the variable it is multiplying varies by choice

---
# Number of error terms

Similar to socio-demographic characteristics, there are restrictions on the number of error terms

Recall that he probability `$i$` will choose `$j$` is given by:

`\begin{align*}
P_{ij}&=\Pr(u_{ij}+\varepsilon_{ij}>u_{ij'}+\varepsilon_{ij'}  \,\,\forall\,\,   j'\neq j)\\
&=\Pr(\varepsilon_{ij'}-\varepsilon_{ij}<u_{ij}-u_{ij'}  \,\,\forall\,\,   j'\neq j)\\
&=\int_{\varepsilon}I(\varepsilon_{ij'}-\varepsilon_{ij}<u_{ij}-u_{ij'}  \,\,\forall\,\,   j'\neq j)f(\varepsilon)d\varepsilon
\end{align*}`

where the integral is `$J$`-dimensional

---
# Number of error terms (cont'd)

But we can rewrite the last line as `$J-1$` dimensional integral over the differenced `$\varepsilon$`'s:

`\begin{align*}
P_{ij}&=\int_{\tilde{\varepsilon}}I(\tilde{\varepsilon}_{ij'}<\tilde{u}_{ij'}  \,\,\forall\,\,   j'\neq j)g(\tilde{\varepsilon})d\tilde{\varepsilon}
\end{align*}`

Note that this means one dimension of `$f(\varepsilon)$` is not identified and must therefore be normalized

---
# Derivation of Logit Probability

Consider the case when the choice set is `$\{1,2\}$`.  The Type 1 extreme value cdf for `$\varepsilon_2$` is:
`\begin{align*}
F(\varepsilon_2)=e^{-e^{(-\varepsilon_2)}}
\end{align*}`
To get the probability of choosing `$1$`, substitute in for `$\varepsilon_2$` with `$\varepsilon_1+u_1-u_2$`:
`\begin{align*}
Pr(d_1=1|\varepsilon_1)=e^{-e^{-(\varepsilon_1+u_1-u_2)}}
\end{align*}`
But `$\varepsilon_1$` is unobserved so we need to integrate it out (see Appendix to these slides if you want the math steps)

---
# Derivation of Logit Probability

In the end, we can show that, for any model where there are two choice alternatives and `$\varepsilon$` is drawn from the Type 1 extreme value distribution,

`\begin{align*}
P_{i1}=\frac{\exp(u_{i1}-u_{i2})}{1+\exp(u_{i1}-u_{i2})},P_{i2}=\frac{1}{1+\exp(u_{i1}-u_{i2})}
\end{align*}`

Suppose we have a data set with `$N$` observations. The log likelihood function we maximize is then:
`\begin{align*}
\ell(\beta,\gamma)=\sum_{i=1}^N(d_{i1}=1)(u_{i1}-u_{i2})-\ln\left(1+\exp(u_{i1}-u_{i2})\right)
\end{align*}`

---
# Derivation of Probit Probability

In the probit model, we assume that `$\varepsilon$` is Normally distributed. So for a binary choice we have:

`\begin{align*}
P_{i1}=\Phi\left(u_{i1}-u_{i2}\right),P_{i2}=1-\Phi\left(u_{i1}-u_{i2}\right)
\end{align*}`
where `$\Phi\left(\cdot\right)$` is the standard normal cdf

The log likelihood function we maximize is then:
`\begin{align*}
\ell(\beta,\gamma)=\sum_{i=1}^N(d_{i1}=1)\ln\left(\Phi\left(u_{i1}-u_{i2}\right)\right)+(d_{i2}=1)\ln\left(1-\Phi\left(u_{i1}-u_{i2}\right)\right)
\end{align*}`

---
# Pros & Cons of Logit & Probit

Logit model:

- Has a much simpler objective function

- Is by far most popular

- ... but has more restrictive assumptions about how people substitute choices

- (this is known as the Independence of Irrelevant Alternatives or IIA assumption)

Probit model:

- Much more difficult to estimate

- ... but can accommodate more realistic choice patterns

---
# Estimation in `R`

The R function `glm` is the easiest way to estimate a binomial logit or probit model:

```r
library(tidyverse)
library(magrittr)
library(mlogit)
data(Heating) # load data on residential heating choice in CA
levels(Heating$depvar) <- c("gas","gas","elec","elec","elec")
estim <- glm(depvar ~ as.factor(income)+agehed+rooms+region,
             family=binomial(link='logit'),data=Heating)
estim %>% summary %>% print
```

---
# Interpreting the coefficients
Estimated coefficients using the code in the previous slide:

```r
Coefficients:
                    Estimate Std. Error z value Pr(>|z|)
(Intercept)        -1.031346   0.430835  -2.394   0.0167 *
as.factor(income)3  0.379461   0.301553   1.258   0.2083  
as.factor(income)4 -0.087949   0.314739  -0.279   0.7799  
as.factor(income)5  0.287393   0.291833   0.985   0.3247  
as.factor(income)6  0.197383   0.299092   0.660   0.5093  
as.factor(income)7  0.251675   0.293767   0.857   0.3916  
agehed             -0.011238   0.005807  -1.935   0.0530 .
rooms               0.042742   0.046687   0.915   0.3599  
regionscostl       -0.089907   0.217331  -0.414   0.6791  
regionmountn        0.057200   0.289125   0.198   0.8432  
regionncostl       -0.386541   0.239918  -1.611   0.1071  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---
# Interpreting the coefficients

- Positive coefficients `$\Rightarrow$` household more likely to choose the non-baseline alternative (in this case: electric)

- Whatever the first `level` of the factor dependent variable is will be the "baseline" alternative

- Negative coefficients imply the reverse

- Coefficients .hi[not] linked to changes in probability of choosing the alternative (since probability is a nonlinear function of `$X$`)

---
# Forming predictions
To get predicted probabilities for each observation in the data:

```r
Heating %<>% mutate(predLogit = predict(estim, newdata = Heating, type = "response"))
Heating %>% `$`(predLogit) %>% summary %>% print
```

---
# Estimating a probit model
For the probit model, we repeat the same code, except change the "link" function from "logit" to "probit"

```r
estim2 <- glm(depvar ~ as.factor(income)+agehed+rooms+region,
             family=binomial(link='probit'),data=Heating))
estim2 %>% summary %>% print
Heating %<>% mutate(predProbit = predict(estim2, newdata = Heating, type = "response"))
Heating %>% `$`(predProbit) %>% summary %>% print
```

---
# A simple counterfactual simulation
- We talked a lot about doing counterfactual comparisons, but how do we *actually* do it?

- Let's show how to do this on a previous example. Suppose that we introduce a policy that makes richer people more likely to use electric heating.

- Mathematically, what does this look like?

- It would correspond to an increase in the parameter in front of *income* in our regression

---
# A simple counterfactual simulation

- Suppose the coefficient increased by a factor of 4 for the three highest income categories. What would be the new share of gas vs. electricity usage?

```r
estim$coefficients["as.factor(income)5"] <- 4*estim$coefficients["as.factor(income)5"]
estim$coefficients["as.factor(income)6"] <- 4*estim$coefficients["as.factor(income)6"]
estim$coefficients["as.factor(income)7"] <- 4*estim$coefficients["as.factor(income)7"]
Heating %<>% mutate(predLogitCfl = predict(estim, newdata = Heating, type = "response"))
Heating %>% `$`(predLogitCfl) %>% summary %>% print
```

This policy would increase electric usage by 8 percentage points (from 22\% to 30\%)

---
# Discrete choice models and sample selection bias

- Discrete choice models are common tools used to evaluate sample selection bias

- Why? Because variables that are MNAR can be thought of as following a utility-maximizing process

- Examples:
    - Suppose you want to know what the returns to schooling are, but you only observe wages for those who currently hold jobs
    - As a result, your estimate of the returns to schooling might be invalidated by the non-randomness of the sample of people who are currently working
    - How to get around this? Use a discrete choice model (This was the problem we ran into in PS7, if you recall)

---
# Heckman selection correction
The Heckman selection model specifies two equations:
`\begin{align*}
    u_{i}&= \beta x_{i} + \nu_{i} \\
    y_{i} &= \gamma z_{i} + \varepsilon_{i}
\end{align*}`
- The first equation is a utility maximization problem, determining if the person is in the labor force. Can think of `$\nu_{i}$` as "desire to work"
- `$x_{i}$` may include: number of children in the household
- The second equation is the log wage equation, where `$y_i$` is only observed for people who are in the labor force
- To solve the model, one needs to use the so-called "Heckit" model, which involves adding a correction term in the wage equation which accounts for the fact that workers are not randomly selected

---
# Estimating Heckman selection in `R`
`R` has a package called `sampleSelection` which incorporates the Heckman selection model

.footnote[
[1] This code taken from Garrett Glasgow's [website]( {http://www.polsci.ucsb.edu/faculty/glasgow/ps207/ps207_class6.r)
]

```r
library(modelsummary)
library(sampleSelection)
data('Mroz87')
Mroz87$kids <- (Mroz87$kids5 + Mroz87$kids618) > 0
# Comparison of linear regression and selection model
outcome1 <- lm(wage ~ exper, data = Mroz87)
summary(outcome1)
selection1 <- selection(selection = lfp ~ age + I(age^2) + faminc + kids + educ,
outcome = wage ~ exper, data = Mroz87, method = '2step')
summary(selection1)
```

---
# Estimation output
Output from a regression of wage on experience:

```r
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.991989   0.065674  15.105  < 2e-16 ***
exper       0.015201   0.004287   3.546 0.000434 ***
---
```

Output from the Heckman selection model: (edited for length)

.scroll-box-8[
```r
Probit selection equation:
               Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.157e+00  1.402e+00  -2.965 0.003126 **
kidsTRUE    -4.490e-01  1.309e-01  -3.430 0.000638 ***
Outcome equation:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.775794   0.181695   9.773  < 2e-16 ***
exper       0.016899   0.004478   3.774 0.000174 ***
Multiple R-Squared:0.1257,	Adjusted R-Squared:0.1216
   Error terms:
              Estimate Std. Error t value Pr(>|t|)    
invMillsRatio  -1.2270     0.2504  -4.901 1.17e-06 ***
sigma           1.1480         NA      NA       NA    
rho            -1.0688         NA      NA       NA
```
]

---
# Reading the output

- Because there are two equations, there are now more parameters

- Using just the regression on workers led us to believe the returns to experience were `$\approx 1.5%$`

- Taking into account the selectivity of labor force participants leads us to conclude the returns to experience are slightly higher, `$\approx 1.7%$`

- Viability of the model depends on the assumption that's made: in this case, that having children only affects labor supply preferences and doesn't affect wages
    - Wage discrimination against mothers in the labor market would invalidate this
    - Back to the idea that to get causal inference we have to impose more assumptions

---
# The optimal stopping problem

- Much of life is concerned with knowing when to stop:
    - How many people to date before making/accepting a marriage proposal
    - How much to study for upcoming exams
    - How long to "hodl" an asset

- All of the above cases involve forming expectations about:
    1. The long-run value of making a particular choice
    2. ... relative to the long-run value of alternatives

- Expectations about the future imply that we need to think "dynamically" (i.e. think over the longterm)

- Today we'll go through the math on how to do this

---
# Relation to reinforcement learning

- Reinforcement learning is based on the optimal stopping problem

- At each state `$X$` (e.g. game board configuration), observe reward `$y$` (e.g. win probability)

- In each period (i.e. gameplay turn), choose the decision that maximizes the (present value) expected reward

- With structural models, "reward" is utility

---
# Dynamic discrete choice models

With .hi[dynamic] models, need a way to quantify present value of utility

Individual `$i$`'s .hi[flow utility] for option `$j$` at time `$t$` is:
`\begin{align*}
U_{ijt}&=u_{ijt}+\varepsilon_{ijt}\\
       &=X_{it}\alpha_j+\varepsilon_{ijt}
\end{align*}`

---
# Dynamic discrete choice models

Individual chooses `$d_{it}$` to maximize .hi[expected lifetime utility]

`\begin{align*}
\max_{d_{it}} V &= \mathbb{E}\left\{\sum_{\tau=t}^T\sum_j\beta^{\tau-t}(d_{it}=j)U_{ijt}\right\}
\end{align*}`
where
- `$V$` is the .hi[value function]
- `$\beta\in\left(0,1\right)$` is the .hi[discount factor]
- `$T$` is the .hi[time horizon]

---
# Expectations

- Expectations taken over future states `$X$` .hi[and] errors `$\varepsilon$`

- `$\varepsilon$`'s are iid over time

- Future states are not affected by `$\varepsilon$`'s except through current and past choices:

`\begin{align*}
\mathbb{E}\left(X_{t+1}|d_t,...,d_1,\varepsilon_t,...,\varepsilon_{1}\right)&=\mathbb{E}\left(X_{t+1}|d_t,...,d_1\right)
\end{align*}`

---
# Human behavior vs. reinforcement learning

- In reinforcement learning, we don't have `$\varepsilon$`, unless we allow for "curiosity"

- Transitions in `$X$` much more dominant factor (e.g. if I move here, opponent will move there, ...)

- Real-life example of uncertainty in `$\varepsilon$`'s:
    - "My significant other might take a job in another city next year, so if I want to move with him/her, I may not want to take this job offer today.

- Real-life example of uncertainty in `$X$`'s:
    - "I might get laid off next year, which will influence my ability to pay off my car loan, so I might want not want to buy this Mercedes today, since my (expected) permanent income might be lower than my current income."

---
# Dynamic programming & the Bellman equation

- The value function `$V$` is the optimization problem

- It's helpful to write the value function as a recursive expression, where we separate out today's decision from all future decisions (this is called the *Bellman equation*, or the *dynamic programming problem*)

- The payoff from choosing `$j$` today is the *flow utility* `$= u_{ijt}$` above

- The payoff from choosing alternative `$j$` in the future is the expected future utility conditional on choosing `$j$` today

How do we solve the Bellman equation?

- Requires solving backwards, just like in a dynamic game (cf. subgame perfect Nash equilibrium)

---
# Two Period Example

Consider the utility of choice `$j$` in the last period:
`\begin{align*}
U_{ijT}&=u_{ijT}+\varepsilon_{ijT}\\
       &=X_{iT}\alpha_j+\varepsilon_{ijT}
\end{align*}`

Define the .hi[conditional valuation function] for choice `$j$` as the flow utility of `$j$` minus the associated `$\varepsilon$` plus the expected value of future utility conditional on `$j$`:

`\begin{align*}
v_{ijT-1}&=u_{ijT-1}+\beta \mathbb{E}\max_{k\in J}\left\{u_{ikT}+\varepsilon_{ikT}|d_{iT-1}=j\right\}
\end{align*}`

where `$\beta$` is the discount factor

Suppose `$X_{iT}$` was deterministic given `$X_{iT-1}$` and `$d_{T-1}$` and the `$\varepsilon$`'s are Type 1 extreme value. What would the `$\mathbb{E}\max$` expression be? `$\left[\ln\sum_{k}\exp\left(u_{ikT}\right)\right]$`

---
# Two Period Example (cont'd)

For `$J=2$` the log likelihood would then look like:

`\begin{align*}
L(\alpha)=\sum_{i=1}^N\sum_{t=1}^T(d_{i1t}=1)(v_{i1t}-v_{i2t})-\ln\left(1+\exp(v_{i1t}-v_{i2t})\right)
\end{align*}`

where

`\begin{align*}
v_{ijt}=u_{ijt}+\beta \mathbb{E}\max_{k\in J}\left\{v_{ikt+1}+\varepsilon_{ikt+1}|d_{it}=j\right\}
\end{align*}`

and where

`\begin{align*}
u_{ijt}=X_{it}\alpha_{j}
\end{align*}`

Note: if `$T=2$` then `$v_{ikt+1} = u_{ikT}$`

---
# Estimating a dynamic discrete choice model in `R`

- Because we have to loop backwards through time, we can't simply use `lm()`

- Requires us to write a custom likelihood function

- This is because the flow utility parameters ($\alpha_j$) appear in the flow utility function in .hi[each] period
    - Side note: We don't typically estimate the discount factor ($\beta$) but instead assume a fixed value (most common: 0.90 or 0.95)

- To do this, write down an objective function (i.e. log likelihood function) and use `nloptr` to estimate the `$\alpha$`'s

---
# Counterfactuals

- Once you have the `$\alpha$`'s you can do counterfactual simulations

- These simulations are likely to be more realistic because the model has incorporated forward-looking behavior

---
# Objective function and estimation
.scroll-box-18[
```r
objfun <- function(alpha,Choice,age) {
    J <- 2
    a  <- alpha[3]*(1-diag(J))
    
    u1 <- matrix(0, N, T)
    u2 <- matrix(0, N, T)
    for (t in 1:T) {
        u1[ ,t] <- 0*age[ ,t]
        u2[ ,t] <- alpha[1] + alpha[2]*age[ ,t]
    }

Like <- 0
    for (t in T:1) {
        for (j in 1:J) {
            # Generate FV
            dem <- exp(u1[ ,t] + a[1,j]+fv[ ,1,t+1])+
                   exp(u2[ ,t] + a[2,j]+fv[ ,2,t+1])
            fv[ ,j,t] <- beta*(log(dem)-digamma(1))
            p1 <- exp(u1[ ,t] + a[1,j] + fv[ ,1,t+1])/dem
            p2 <- exp(u2[ ,t] + a[2,j] + fv[ ,2,t+1])/dem
            Like <- Like - (LY[ ,t]==j)*((Choice[ ,t]==1)*log(p1)+(Choice[ ,2]==2)*log(p2))
        }
    }
    return ( sum(Like) )
}

## initial values
theta0 <- runif(3) #start at uniform random numbers equal to number of coefficients

## Algorithm parameters
options <- list("algorithm"="NLOPT_LN_NELDERMEAD","xtol_rel"=1.0e-6,"maxeval"=1e4)

## Optimize!
result <- nloptr( x0=theta0,eval_f=objfun,opts=options,Choice=Choice,age=age)
print(result)
```
]

---
# Derivation of Logit Probability

`\begin{align*}
Pr(d_1=1)&=\int_{-\infty}^{\infty}\left(e^{-e^{-(\varepsilon_1+u_1-u_2)}}\right)f(\varepsilon_1)d\varepsilon_1\\
&=\int_{-\infty}^{\infty}\left(e^{-e^{-(\varepsilon_1+u_1-u_2)}}\right)e^{-\varepsilon_1}e^{-e^{-\varepsilon_1}}d\varepsilon_1\\
&=\int_{-\infty}^{\infty}\exp\left(-e^{-\varepsilon_1}-e^{-(\varepsilon_1+u_1-u_2)}\right)e^{-\varepsilon_1}d\varepsilon_1\\
&=\int_{-\infty}^{\infty}\exp\left(-e^{-\varepsilon_1}\left[1+e^{u_2-u_1}\right]\right)e^{-\varepsilon_1}d\varepsilon_1
\end{align*}`

---
# Derivation of Logit Probability

Now need to do the substitution rule where `$t=\exp(-\varepsilon_1)$` and `$dt=-\exp(-\varepsilon_1)d\varepsilon_1$`

Note that we need to do the same transformation of the bounds as we do to `$\varepsilon_1$` to get `$t$`.  Namely, `$\exp(-\infty)=0$` and `$\exp(\infty)=\infty$`

---
# Derivation of Logit Probability

Substituting in then yields:

`\begin{align*}
Pr(d_1=1)&=\int_{\infty}^0\exp\left(-t\left[1+e^{(u_2-u_1)}\right]\right)(-dt)\\
&=\int_0^{\infty}\exp\left(-t\left[1+e^{(u_2-u_1)}\right]\right)dt\\
&=\begin{array}{c|}\frac{\exp\left(-t\left[1+e^{(u_2-u_1)}\right]\right)}{-\left[1+e^{(u_2-u_1)}\right]}\end{array}^{\infty}_{0}\\
&=0-\frac{1}{-\left[1+e^{(u_2-u_1)}\right]}=\frac{\exp(u_1)}{\exp(u_1)+\exp(u_2)}
\end{align*}`