Lecture 10

.title[
# Lecture 10
]
.subtitle[
## Travel cost method
]
.author[
### Ivan Rudik
]
.date[
### AEM 4510
]

---

```r
if (!require("pacman")) install.packages("pacman")
pacman::p_load( 
  tidyverse, magrittr, xaringanExtra, rlang, patchwork, broom, viridis, fixest, maps, tigris, sf, vembedr
)
set.seed(1)
options(htmltools.dir.version = FALSE)
knitr::opts_hooks$set(fig.callout = function(options) {
  if (options$fig.callout) {
    options$echo <- FALSE
  }
knitr::opts_chunk$set(echo = TRUE, fig.align="center")
  options
})
red_pink <- "#e64173"
blue <- "#3C93DC"
red <- "#ff0000"
```

```
## Warning: 'xaringanExtra::style_panelset' is deprecated.
## Use 'style_panelset_tabs' instead.
## See help("Deprecated")
```

```
## Warning in style_panelset_tabs(...): The arguments to `syle_panelset()` changed in xaringanExtra 0.1.0. Please refer to the documentation to update your slides.
```

```r
red_pink <- "#e64173"
# A blank theme for ggplot
theme_empty <- theme_minimal() +
  theme(
    legend.position = "none",
    title = element_text(size = 24),
    axis.text.x = element_text(size = 24), axis.text.y = element_text(size = 24, color = "#ffffff"),
    axis.title.x = element_text(size = 24), axis.title.y = element_text(size = 24),
    panel.grid.minor.x = element_blank(), panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank(), panel.grid.major.x = element_blank(),
    panel.background = element_rect(fill = "#ffffff", colour = NA),
    plot.background = element_rect(fill = "#ffffff", colour = NA),
    axis.line = element_line(colour = "black"), axis.ticks = element_line(),
  )
theme_blank <- theme_minimal() +
  theme(
    legend.position = "none",
    title = element_text(size = 24),
    axis.text.x = element_blank(), axis.text.y = element_blank(),
    axis.title.x = element_blank(), axis.title.y = element_blank(),
    panel.grid.minor.x = element_blank(), panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank(), panel.grid.major.x = element_blank(),
    panel.background = element_rect(fill = "#ffffff", colour = NA),
    plot.background = element_rect(fill = "#ffffff", colour = NA),
    axis.line = element_blank(), axis.ticks = element_blank(),
  ) 
theme_regular <- 
  theme_minimal() +
  theme(
    legend.position = "none",
    title = element_text(size = 14),
    axis.text.x = element_text(size = 24), axis.text.y = element_text(size = 24),
    axis.title.x = element_text(size = 24), axis.title.y = element_text(size = 24),
    panel.grid.minor.x = element_blank(), panel.grid.minor.y = element_blank(),
    panel.grid.major.x = element_blank(), axis.ticks = element_line(),  axis.line = element_line(),
    panel.background = element_rect(fill = "#ffffff", colour = NA),
    plot.background = element_rect(fill = "#ffffff", colour = NA)
  ) 
```

```r
# initialize seed and data
set.seed(1000)
num_trips <- 100
trip_data <- expand.grid(
  house_num = seq(from = 1, to = num_trips),
  site = 1:26
)

# make fake dataset
trip_data <- trip_data %>%
  as_tibble() %>%
  mutate(
    trips = sample(0:8, num_trips * 26, replace = TRUE),
    income = exp(rnorm(num_trips * 26)) * 3000 + 50000 + rnorm(num_trips * 2) * 15000,
    income = ifelse(income < 10000, 10000, income),
    travel_cost = exp(rnorm(num_trips * 26)) * 30,
    travel_cost_other = exp(rnorm(num_trips * 26)) * 35,
    trips = round(-.01 * travel_cost + .01 * travel_cost_other + 3 / 100000 * income),
    water_clarity = runif(num_trips * 26)
  ) %>%
  group_by(site) %>%
  mutate(water_clarity = mean(water_clarity)) %>%
  ungroup() %>%
  mutate(trips = case_when(
    trips <= 0 ~ 0,
    TRUE ~ trips + 3 * round(water_clarity)
  ))
```

---

# Roadmap

- How do we estimate the value of recreational goods?

---

# Background

---

# Should we separate the Great Lakes and Mississippi?

---

# Should we separate the Great Lakes and Mississippi?

---

# Should we separate the Great Lakes and Mississippi?

---

# Should we separate the Great Lakes and Mississippi?

Benefits from barriers accrue to anglers in the Great Lakes, both commercial and recreational

Costs come from cost of building the barriers plus cost of maintaining them,  plus costs of reduced shipping (if any), plus any other costs associated with the barriers

How do we figure out the benefits from recreational anglers?

---

# Why do we need travel cost?

Recreational areas .hi[have value]

Their quality also has value

Not placing a value on recreation is essentially giving it a value of .hi-purple[zero]

This is likely inappropriate

If someone dumped toxic waste in Taughannock does that have zero cost?

---

# What is the travel cost method?

The travel cost method uses observable data on recreation visitation to infer the recreational value of environmental amenities

The central idea is that the time and travel cost expenses that people incur to visit a site represent the .hi[price] of access to the site

This means that people's WTP to visit can be estimated based on the number of visits they make to sites of different prices

This gives us a demand curve for sites/amenities, so we can value changes in these environmental amenities

---

# Hotelling

After WWII, the U.S. national park service solicited advice from economists on methods for quantifying the value of specific park properties

Would total entrance fee that people pay measure the value?

Harold Hotelling proposed the first indirect method for measuring the demand of a non-market good in 1947

---

# Hotelling

> Let concentric zones be defined around each park so that the cost of travel to the park from all points in one of these zones is approximately constant. The persons entering the park in a year, or a suitable chosen sample of them, are to be listed according to the zone from which they came. The fact that they come means that the service of the park is at least worth the cost, and this cost can probably be estimated with fair accuracy.

---

# Hotelling

> A comparison of the cost of coming from a zone with the number of people who do come from it, together with a count of the population of the zone, enables us to plot one point for each zone on a demand curve for the service of the park. By a judicious process of fitting, it should be possible to get a good enough approximation to this demand curve to provide, through integration, a measure of consumers’ surplus..

About twelve years after, Trice and Wood (1958) and Clawson (1959) independently implemented the methodology

---

# Theoretical foundation

How can we use observed data to tell us something about willingness to pay?

Consider a single consumer and a single recreation site

The consumer has:

- Total number of recreation trips: x, to site of quality: q
- Total budget of time: T
- Working time: H
- Non-recreation, non-work time: l
- Hourly wage: w
- Money cost of reaching the site: c
- Expenditures on other market goods: z

---

# Theoretical foundation

This lets us write down the consumer's utility maximization problem:
`$$\max_{x,z,l} U(x,z,l,q) \,\,\,\, \text{subject to: }\,\,\, \underbrace{wH = cx + z}_{\text{money budget}}, \,\, \underbrace{T = H + l + tx}_{\text{time budget}}$$`
--

Multiply the time budget by `$w$` and substitute the money budget in:

`$$\max_{x,z,l} U(x,z,l,q) \,\,\,\, \text{subject to: }\,\,\, \underbrace{wT = (c+wt)x +z + wl}_{\text{combined money/time budget}}$$`

Where now we have one constraint on the dollar value of time

---

# Theoretical foundation

`$$\max_{x,z,l} U(x,z,l,q) \,\,\,\, \text{subject to: }\,\,\, \underbrace{wT = (c+wt)x + z + wl}_{\text{combined money/time budget}}$$`
`$wT$` is the consumer's .hi[full income], their money value of total time budget

`$c+wt$` is the consumer's .hi[full price], their total cost to reach the site

`$z$` is their consumption of other goods

`$wl$` is the opportunity cost of non-recreation site leisure

---

# Theoretical foundation

Let `$Y = wT$` be the consumer's .hi[full income], their money value of total time budget

Let `$p = c+wt$` be the consumer's .hi[full price], their total cost to reach the site

Then we can write the problem as:

--
`$$\max_{x,z,l} U(x,z,l,q) \,\,\,\, \text{subject to: }\,\,\, \underbrace{Y = z + px + wl}_{\text{combined budget}}$$`

Solve the constraint for `$z$` and substitute into the utility function...

---

# Theoretical foundation

`$$\max_{x,l} U\left(x,Y-px-wl,l,q\right)$$`

Choose trips `$x$` and leisure `$l$`, this implies an amount of money left over

This has first-order conditions:
`$$[x] \,\,\, U_x - pU_z = 0 \rightarrow \frac{U_x}{U_z} = p$$`

and
`$$[l] \,\,\, -wU_z + U_l = 0 \rightarrow \frac{U_l}{U_z} = w$$`

---

# Theoretical foundation

`$\frac{U_x}{U_z} = p$` tells us the consumer equates the marginal rate of substitution between recreational trips and consumption to be the full price of the recreational trip

What does this mean?

.hi[The value of the marginal recreational trip to the consumer, in dollar terms, is revealed by the full price p]

---

# Theoretical foundation

`$$U_x - pU_z = 0 \qquad -wU_z + U_l = 0$$`

The above FOCs are two equations, the consumer had two choices (x,l) so we had two unknowns

If we know the functional form of `$U$` we can use the FOCs to solve for x (and l) as a function of the parameters (p,Y,q):

`$$x = f(p,Y,q)$$`

This is simply the consumer's .hi-blue[demand curves] for recreation as a function of the full price p, full budget Y, and quality q

---

# Theoretical foundation

`$$x = f(p,Y,q)$$`

If we observe consumers going to sites of different full prices `$p_1,p_2,\dots,p_n$`, we are moving up and down their recreation demand curve

This lets us trace out the demand curve

Changing Y or q shifts the demand curve in or out: these are income and quasi-price effects

Once we have it, we can compute surplus!

---

# Zonal (single-site) model

Here's the most basic travel cost model to start

- Construct distance zones (i) as concentric circles emanating from the recreation site
  - Travel costs from all points within each zone to the site are sufficiently close in magnitude to justify neglecting the differences
--

- From a sample of visitors `$(v_i)$` at the recreation site, determine zones of origin and their populations `$(n_i)$`
--

- Calculate the per capita visitation rates for each zone of origin `$(t_i = (v_i/n_i))$`

---

# Zonal (single-site) model

- Construct a travel cost measure `$(tc_i)$` that reflects the round-trip costs of travel from the zone of origin to the recreation site (time and gas), + an entry fee `$(fee)$` which may be zero and does not vary across zones

- Collect relevant socioeconomic data `$(s_i)$` such as income and education for each distance zone

- Use statistical methods to estimate the trip demand curve: the relationship between per-capita visitation rates, cost per visit, [and travel costs to other sites `$(tc_{si})$`] controlling for socioeconomic differences

- `$t_i = g(tc_i + fee; tc_{si}, s_i) + \varepsilon_i$` where `$g$` can be linear

---

# Zonal (single-site) model

Here's a simple example of a set of zones 1-5:
<img src="11-slides-travel-cost_files/figure-html/sample2 scatter-1.svg" style="display: block; margin: auto;" />

---

# Zonal (single-site) model

Suppose we have the following data:

```
## # A tibble: 5 × 5
##   zone   dist   pop  cost   vpp
##   <chr> <dbl> <dbl> <dbl> <dbl>
## 1 A         2 10000    20    15
## 2 B        30 10000    30    13
## 3 C        90 20000    65     6
## 4 D       140 10000    80     3
## 5 E       150 10000    90     1
```

If we plot cost by visits per person, we have a measure of the demand curve...

---

# Zonal (single-site) model

.pull-left[
This is a very simple example where it happens to be an exactly straight line, most likely the data won't be this perfect

The line is simply: `$$t_i = \beta_0 + \beta_1 tc_i$$`

where `$\beta_0$` is the intercept and `$\beta_1$` is the slope
]

.pull-right[
<img src="11-slides-travel-cost_files/figure-html/zonal example-1.svg" style="display: block; margin: auto;" />
]

---

# Zonal (single-site) model

It ignores things like income, other sites, other household characteristics

For now, we'd continue by fitting a line through the points (OLS/regression)
]

.pull-right[
<img src="11-slides-travel-cost_files/figure-html/zonal example 2-1.svg" style="display: block; margin: auto;" />
]

---

# Zonal (single-site) model

Based on the estimate model coefficients, construct the (inverse) demand curve

Entry fee on the y-axis (price), and the number of predicted total visits on the x-axis (quantity)

The demand curve is different for different zone because different social economic variables

The (use) value of the park/site to each zone is given by the area underneath the corresponding demand curve

---

# Issues with the single-site model

What are some potential issues and concerns with this approach?

It ignores non-use value (automatically zero for non-users)

What are the right zones to choose?

What is the right functional form for demand?

How do we measure the opportunity cost of time?

How do we treat multi-purpose trips?

How do we value particular site attributes? Can't disentangle them at a single site

---

# Multi-site model

To value particular site attributes we need to have multiple sites (with different attributes!)

We can answer questions like:

What is the benefit of a fish restocking program?
  - Need to know the value of fish catch rate for visitors

What is the benefit of water clarity?

What is the benefit of tree replanting?

---

# Multi-site model

Suppose we have a dataset with a large number of individuals and sites

Individuals are given by `$i=1,\dots,N$` and sites are given by `$j=1,\dots,J$`

We observe the number of times each individual visited each site

The multi-site model works as follows

---

# Multi-site model

.hi[Step 1:] Do the single-site estimation for each site:
`$$t_{ij} = \beta_{0j} + \beta_{1j} tc_{ij} + \beta_{2j} tc_{sij} + \beta_{3j} s_i + \varepsilon_{ij}$$`

.hi[Step 2:] Recover all the `$\beta$`s from each step 1 regression so that we have a set of J `$\beta_{0j}$`s for `$j=1\dots,J$`, `$\beta_{1j}$`s for `$j=1\dots,J$`, etc

These `$\beta$`s tell us the slope `$(\beta_{1j})$` and intercept `$(\beta_{0j}, \beta_{2j}, \beta_{3j})$`

`$\beta_{2j}, \beta_{3j}$` capture how the cost of substitute sites and household characteristics matter: they shift demand up and down

---

# Multi-site model

.hi[Step 3:] Take each set of `$J$` coefficient estimates and use them as the dependent variable in a regression on site attributes `$z$`:
`$$\hat{\beta}_{0j} = \alpha_{00} + \alpha_{01}z_j + \epsilon_{0j}$$`
`$$\hat{\beta}_{1j} = \alpha_{10} + \alpha_{11}z_j + \epsilon_{1j}$$`
`$$\hat{\beta}_{2j} = \alpha_{20} + \alpha_{21}z_j + \epsilon_{2j}$$`
`$$\hat{\beta}_{3j} = \alpha_{30} + \alpha_{31}z_j + \epsilon_{3j}$$`

The `$\alpha_{\times1}$` coefficients tell us how the demand curve shifts `$(\alpha_{00}, \alpha_{02}, \alpha_{03})$` or rotates `$(\alpha_{01})$` as we change site attribute `$z$`

---

# Valuing attributes with a multi-site model

.pull-left[
If we improve the quality of a site from z<sub>1</sub> to z<sub>2</sub>, demand for that site shifts up

The gain in CS, holding the cost fixed, is given by the blue area

Once we estimate demand curves, we can see how welfare changes when we alter quality characteristics!

]

.pull-right[
<img src="11-slides-travel-cost_files/figure-html/multi site example-1.svg" style="display: block; margin: auto;" />
]

---

# Multi-site example

```r
trip_data
```

```
## # A tibble: 2,600 × 7
##    house_num  site trips income travel_cost travel_cost_other water_clarity
##        <int> <int> <dbl>  <dbl>       <dbl>             <dbl>         <dbl>
##  1         1     1     4 40450.       38.9              16.4          0.506
##  2         2     1     5 60304.       29.8              37.5          0.506
##  3         3     1     5 66681.       42.2              67.2          0.506
##  4         4     1     5 52886.       11.0              51.3          0.506
##  5         5     1     5 69282.       15.7               7.72         0.506
##  6         6     1     5 36948.        4.30             48.0          0.506
##  7         7     1     6 60866.        5.31             91.0          0.506
##  8         8     1     5 35557.       65.0             161.           0.506
##  9         9     1     5 64880.       14.5              24.3          0.506
## 10        10     1     4 38491.       13.6              26.5          0.506
## # … with 2,590 more rows
```

---

# First stage estimation

```
## # A tibble: 26 × 5
##    intercept own_price cross_price    income  site
##        <dbl>     <dbl>       <dbl>     <dbl> <dbl>
##  1     2.99   -0.0161      0.0106  0.0000321     1
##  2     2.45   -0.0117      0.0101  0.0000397     2
##  3     2.37   -0.0197      0.0111  0.0000450     3
##  4     2.33   -0.0187      0.0119  0.0000438     4
##  5     2.05   -0.0143      0.0139  0.0000450     5
##  6    -0.236  -0.00668     0.00972 0.0000321     6
##  7     2.67   -0.0210      0.0118  0.0000395     7
##  8    -0.346  -0.00395     0.00987 0.0000324     8
##  9     2.98   -0.0133      0.0107  0.0000315     9
## 10    -0.103  -0.00943     0.0105  0.0000302    10
## # … with 16 more rows
```

---

# Second stage

```
## Joining, by = "site"
```

```
## # A tibble: 4 × 3
##   term           estimate coeff      
##   <chr>             <dbl> <chr>      
## 1 water_clarity 48.0      intercept  
## 2 water_clarity -0.171    own_price  
## 3 water_clarity  0.0241   cross_price
## 4 water_clarity  0.000165 income
```
The estimates column tells us how a change in water clarity (from 0 to 100%), shifts or rotates our demand curve

Clearer water `$\rightarrow$` more demand, more responsive to price, attracts higher-income people more

---

# Real world data: central park

Standard travel cost method is costly

Need to survey households

This takes time and money

What alternatives do we have?

---

# Mobility data from cell phones

We can use these data to do the travel cost method

Same data used by NYT, WaPo, etc for COVID analysis of restaurants, etc

Here we will be looking at visits to central park

---

# Mobility data from cell phones

```
## # A tibble: 22,972 × 13
##    visitor_cbgs  year month location_name               latitude longitude scaled_visits visits travel_distance_km travel_time_minutes visits_per_person median_age median…¹
##           <dbl> <dbl> <dbl> <chr>                          <dbl>     <dbl>         <dbl>  <dbl>              <dbl>               <dbl>             <dbl>      <dbl>    <dbl>
##  1 340030032003  2018     8 Harlem Meer                     40.8     -74.0          34.8      4               22.9                29.3            0.0286       44.9       NA
##  2 340030032003  2018     8 Harlem Meer                     40.8     -74.0          69.5      8               22.9                29.3            0.0571       44.9       NA
##  3 340030032003  2018     8 Harlem Meer                     40.8     -74.0          34.8      4               22.9                29.3            0.0286       44.9       NA
##  4 340030034011  2018    11 Diana Ross Playground           40.8     -74.0          59.8      5               25.3                29.4            0.0510       39.9   107353
##  5 340030034011  2019     8 Diana Ross Playground           40.8     -74.0          46        4               25.3                29.4            0.0392       39.9   107353
##  6 340030034011  2019    11 Central Park                    40.8     -74.0          92.9      8               25.3                29.4            0.0792       39.9   107353
##  7 340030034023  2018     9 East 72nd Street Playground     40.8     -74.0         257.      16               22.4                25.9            0.143        31.5   138068
##  8 340030035002  2018     3 East 72nd Street Playground     40.8     -74.0         184.      20               22.2                28.4            0.0873       35.1    42500
##  9 340030035002  2019     5 Cherry Hill Fountain            40.8     -74.0          38.4      4               22.2                28.4            0.0183       35.1    42500
## 10 340030040022  2018     1 Central Park                    40.8     -74.0         110.       8               21.1                23.6            0.0748       37.1    83257
## # … with 22,962 more rows, and abbreviated variable name ¹median_income
```

---

# Real world data: central park

The data tells us for each .hi[census block group (CBG)] (600-3000 person locations):
  - visits per month to a particular location in central park by all cell phones in the CBG
  - how far the CBG is from the central park location (time and distance)
  - The median income of the CBG
  - The median age of the CBG

---

# Visits by where people live

---

# Travel cost estimation with cell data

We don't have the exact cost of households going to central park, but we have variables that are a good proxy

Regression: `$log(visits) = \beta_0 + \beta_1 log(travel\_distance\_km)$`

```
## NOTE: 237 observations removed because of infinite values (RHS: 237).
```

```r
central_park_demand
```

```
## # A tibble: 2 × 2
##   term                    estimate
##   <chr>                      <dbl>
## 1 (Intercept)               2.10  
## 2 log(travel_distance_km)  -0.0593
```

What do the estimates mean?

---

# Visualizing the relationship

The slope is the elasticity (-0.0593)

A 1 percent increase in distance decreases visits by 0.0593 percent

]

---

# The elasticity and omitted variables

Other things probably affect how far someone lives from central park and how often they visit central park

Ideas?

New regression controlling for these factors:

`$$log(visits) = \beta_0 + \beta_1 log(travel\_distance\_km) + \\ \beta_2 log(median\_income) + \beta_3 log(median\_age)$$`

---

# The elasticity and omitted variables

```
## NOTE: 2,036 observations removed because of NA and infinite values (RHS: 2,036).
```

```
## # A tibble: 4 × 2
##   term                    estimate
##   <chr>                      <dbl>
## 1 (Intercept)               0.578 
*## 2 log(travel_distance_km)  -0.0252 versus -0.593
## 3 log(median_income)        0.0858
## 4 log(median_age)           0.134
```

The elasticity dropped by two-thirds!

Why?

---

# The elasticity and omitted variables

Rich people go to central park more than poorer people

Older people go to central park more than younger people

Where do richer older people tend to live?

---

# The elasticity and omitted variables

`$$log(travel\_distance\_km) = \beta_0 + \beta_1 log(median\_income)$$`

```
## # A tibble: 2 × 5
##   term               estimate std.error statistic p.value
##   <chr>                 <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)           7.65    0.0942       81.2       0
## 2 log(median_income)   -0.520   0.00831     -62.6       0
```

`$$log(travel\_distance\_km) = \beta_0 + \beta_1 log(median\_age)$$`

```
## # A tibble: 2 × 5
##   term            estimate std.error statistic p.value
##   <chr>              <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)         5.95    0.0913      65.1       0
## 2 log(median_age)    -1.15    0.0250     -46.2       0
```

Richer and older people live closer to central park

---

# The elasticity and omitted variables

Why does this matter?

Rich people can afford to live in Manhattan and they also like parks a lot

Ignoring this makes it seem like the average person visits a lot less if they live further away

But it is just the fact that poorer households tend to live in the outer boroughs of New York and likely cannot afford as many trips as richer households