Problem set 1 solutions

EC607

Due Monday, 14 April 2025 by 11:59p Pacific.

Part 0: Take a minute

0.1 (5 points) Take 15 minutes to think about (1) your research interests and (2) how you are doing. No screens. No paper. No nothing. Just sit somewhere (ideally not PLC) quietly and think.

To submit: Where and when did you complete this task?

Answer: Completion.

Part 1: Fun with functions

1.1 (15 points) Write a function that uses OLS to estimate the coefficients of a linear regression model.

The function should

accept two arguments: y and X (both matrices)
output estimated coefficients as a matrix

Your function should only use matrix operations (e.g., %*%) and basic summary statistics (e.g., sum(), nrow()). Do not use more complex functions (like lm()).

Hint: Remember that you can create a column matrix of ones using matrix(1, nrow = n, ncol = 1) (where n is the number of rows you want). You can also use cbind() to combine matrices by columns (cbind) or rows (rbind). There’s an example of cbind() below in 1.2.

Hint: The function named function() allows you to write a function. For example,

# A very fancy function
so_fun = function(a,b) {
  a + b
}

defines a function called so_fun that takes two arguments, a and b, and returns their sum. You can call the function by typing so_fun(1, 2).

Answer: Here’s my function…

# The function
ols_fun = function(X, y) {
  # Create a column of ones
  ones = matrix(1, nrow = nrow(X), ncol = 1)
  # Combine the X matrix with the column of ones
  X_ones = cbind(ones, X)
  # Calculate the coefficients
  solve(t(X_ones) %*% X_ones) %*% t(X_ones) %*% y
}

1.2 (5 points) Test your function using the following data, comparing it to the results of lm().

# Define the parameters
beta = matrix(c(3, 3, 3), nrow = 3, ncol = 1)
# Define the X matrix
X = matrix(data = rnorm(200), nrow = 100, ncol = 2)
# Define the disurbance
u = rnorm(100)
# Create a column of ones
ones = matrix(1, nrow = 100, ncol = 1)
# Define the y matrix
y = cbind(ones, X) %*% beta + u
# Run lm
test_lm = lm(y ~ X)
summary(test_lm)
# TODO Now run your function

Answer: Filling in the blanks…

# Define the parameters
beta = matrix(c(3, 3, 3), nrow = 3, ncol = 1)
# Define the X matrix
X = matrix(data = rnorm(200), nrow = 100, ncol = 2)
# Define the disurbance
u = rnorm(100)
# Create a column of ones
ones = matrix(1, nrow = 100, ncol = 1)
# Define the y matrix
y = cbind(ones, X) %*% beta + u
# Run lm
test_lm = lm(y ~ X)
summary(test_lm)


Call:
lm(formula = y ~ X)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.13168 -0.61821  0.06255  0.66475  2.21161 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.00061    0.09244   32.46   <2e-16 ***
X1           3.07456    0.08715   35.28   <2e-16 ***
X2           3.00563    0.09394   31.99   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9212 on 97 degrees of freedom
Multiple R-squared:  0.9541,    Adjusted R-squared:  0.9532 
F-statistic:  1008 on 2 and 97 DF,  p-value: < 2.2e-16

# Now run our function
ols_fun(X = X, y = y)

         [,1]
[1,] 3.000610
[2,] 3.074559
[3,] 3.005635

Looks good!

Part 2: A simulation

Let’s push your function a little harder… in a simulation!

2.1 (10 points) When coding a simulation, it usually helps to start by coding up a single iteration (after defining the population DGP).

Start by drawing 100 observations from the following population model and then applying your function to estimate the coefficients.

\[ y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \beta_3 x_{3i} + \epsilon_i \]

where

\(\beta_0 = 0\), \(\beta_1 = 1\), \(\beta_2 = 2\), and \(\beta_3 = 3\)
\(x_{1i} \sim N(0, 1)\)
\(x_{2i} \sim N(0, 1)\)
\(x_{3i} = u_i + w_i\)
\(\epsilon_i = v_i + w_i\)
and \(w_i \sim N(0, 1)\), \(u_i \sim N(0, 1)\), and \(v_i \sim N(0, 1)\)

I’ll get you started.

# Load the `tidyverse` (using `pacman`)
library(pacman)
p_load(tidyverse)
# Set a seed
set.seed(12)
# Set the number of observations desired
n = 100
# Generate the population (starting with the "unobserved" disturbances)
sample_df = tibble(
  id = 1:n,
  u = rnorm(n),
  v = rnorm(n),
  w = rnorm(n),
  epsilon = v + w,
  x1 = rnorm(n),
  x2 = rnorm(n),
  x3 = u + w,
  y = 0 + 1 * x1 + 2 * x2 + 3 * x3 + epsilon
)
# TODO Now run your function

Note: Using a tibble (from the tidyverse) here is helpful because it allows you to reference columns that you generated in the same step, e.g., tibble(a = 1, b = a + 2). If you were using data.frame, you couldn’t do it all in one step—you’d need to use multiple steps and/or something like mutate().

Answer: Filling in the remaining lines…

# Load the `tidyverse` (using `pacman`)
library(pacman)
p_load(tidyverse, purrr, furrr, future)
# Set a seed
set.seed(12)
# Set the number of observations desired
n = 100
# Generate the population (starting with the "unobserved" disturbances)
sample_df = tibble(
  id = 1:n,
  u = rnorm(n),
  v = rnorm(n),
  w = rnorm(n),
  epsilon = v + w,
  x1 = rnorm(n),
  x2 = rnorm(n),
  x3 = u + w,
  y = 0 + 1 * x1 + 2 * x2 + 3 * x3 + epsilon
)
# Now run our function
ols_fun(
  X = sample_df |> select(x1, x2, x3) |> as.matrix(),
  y = sample_df |> select(y) |> as.matrix()
)

             y
   0.008545703
x1 1.066598141
x2 1.859886876
x3 3.527533696

2.2 (5 points) Now it’s time for a simulation. For the simulation, you basically need to run the code you wrote in 2.1 above a large number of times (e.g., 10,000 times).

Here are some steps to get to the full simulation:

Copy lines 6–20 above (plus the lines you added to run your regression function).
Update those lines (that define a single iteration of the simulation) to define a function with a single argument it for the iteration number.
At the end of this iteration function, output the estimated coefficients.
Use a loop, map, or apply function to run that iteration function many times. You’ve got lots of options here (for, lapply, map, etc.). You might want to check out the map functions from the purrr package from the tidyverse.
After the loop, combine the results into a single data frame (e.g., using bind_rows() from the dplyr package).

Here’s a start…

# Set the number of iterations
n_iter = 1e4
# Set the seed
set.seed(12)
# Define the iteration function
iter_fun = function(it) {
  # TODO Your iteration magic here...
}
# Run the simulation n_iter times
sim_list = map(1:n_iter, iter_fun)
# TODO Combine the results into a single data.frame

Now run the simulation with at least 1,000 iterations.

Answer: Simulation time!

# Set the number of iterations
n_iter = 1e4
# Set the seed
set.seed(12)
# Define the iteration function
iter_fun = function(it, n = 100) {
  # Draw n observations from the population
  sample_df = tibble(
    id = 1:n,
    u = rnorm(n),
    v = rnorm(n),
    w = rnorm(n),
    epsilon = v + w,
    x1 = rnorm(n),
    x2 = rnorm(n),
    x3 = u + w,
    y = 0 + 1 * x1 + 2 * x2 + 3 * x3 + epsilon
  )
  # Now run our function
  b = ols_fun(
    X = sample_df |> select(x1, x2, x3) |> as.matrix(),
    y = sample_df |> select(y) |> as.matrix()
  )
  # Format to a data frame
  data.frame(
    est = b |> as.vector(),
    param = c('beta_0', 'beta_1', 'beta_2', 'beta_3'),
    iteration = it
  )
}
# Run the simulation n_iter times (in parallel)
plan(multisession, workers = 10)
sim_list = future_map(
  1:n_iter,
  iter_fun,
  .options = furrr_options(seed = TRUE)
)
# Combine the results into a single data frame
sim_df = bind_rows(sim_list)

2.3 (10 points) Summarize the results of the simulation with (1) a numerical summary (the mean coefficient estimate for each of the four parameters) and (2) a histogram or density plot of the coefficient estimates (separated by parameter).

Hint: ggplot2 is probably a good idea here. You can use geom_histogram() or geom_density() to plot the distributions of the coefficient estimates.

Answer: First, a simple numerical summary…

sim_df |>
  group_by(param) |>
  summarise(
    mean_est = mean(est),
    sd_est = sd(est)
  )

# A tibble: 4 × 3
  param   mean_est sd_est
  <chr>      <dbl>  <dbl>
1 beta_0 -0.000762 0.122 
2 beta_1  1.00     0.125 
3 beta_2  2.00     0.124 
4 beta_3  3.50     0.0880

Density estimates…

# Load ggplot2
pacman::p_load(ggplot2, viridis)
# Density plots
ggplot(sim_df, aes(x = est)) +
geom_density(
  aes(fill = param, color = param),
  alpha = 0.7
) +
geom_vline(
  data = data.frame(
    param = c('beta_0', 'beta_1', 'beta_2', 'beta_3'),
    true_val = c(0, 1, 2, 3)
  ),
  aes(xintercept = true_val, color = param),
  linetype = 'dashed',
  linewidth = .3
) +
geom_hline(yintercept = 0) +
labs(
  x = 'Coefficient estimate',
  y = 'Density'
) +
scale_fill_manual(
  '',
  values = magma(4, end = .9),
  labels = c(expression(beta[0]), expression(beta[1]), expression(beta[2]), expression(beta[3]))
) +
scale_color_manual(
  '',
  values = magma(4, end = .9),
  labels = c(expression(beta[0]), expression(beta[1]), expression(beta[2]), expression(beta[3]))
) +
theme_minimal(base_family = 'Fira Sans Condensed', base_size = 12) +
theme(legend.position = 'bottom')

**Densities of OLS estimates** for each parameter. Dashed vertical lines denote the true parameter values.

2.4 (5 points) Now that you have the results of the simulation, compare them to the true values of the parameters. How well OLS working for each of the parameters?

Answer: OLS is working well for \(\beta_0\), \(\beta_1\), and \(\beta_2\). The estimators’ distributions are centerd on the true values and are also typically close to truth.

OLS is not working well for \(\beta_3\); we’re clearly biased. The distribution is centered on 3.5 (rather than 3), and none of the estimates got very close to 3.

2.5 (10 points) Why is OLS biased for \(\beta_3\)? Explain the intuition and with simple math. Bonus if you can show it with a graph too.

Answer: OLS is biased for \(\beta_3\) because \(x_3\) is correlated with the error term \(\epsilon\). This correlation arises because

\(x_3 = u + w\),
\(\epsilon = v + w\).

Thus \(x_3\) and the disturbance (\(\epsilon\)) are correlated through their dependence on \(w\)—violating exogeneity.

We can see this graphically by plotting the distribution of the distrubance \(\epsilon\) against \(x_3\) (using data from 2.1). Under exogeneity, \(\epsilon\) should be centered on 0 given any value of \(x_3\), which is clearly not the case:

# Plot
ggplot(data = sample_df, aes(x = x3, y = epsilon)) +
geom_hline(yintercept = 0) +
geom_point() +
geom_smooth(se = FALSE, color = magma(4)[3]) +
labs(
  x = expression(x[3]),
  y = expression(epsilon)
) +
theme_minimal(base_family = 'Fira Sans Condensed', base_size = 12)

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Illustrating the DGP’s violation of exogeneity.

2.6 (5 points) How biased is OLS for \(\beta_3\)? Answer with math (not with the data from the simulation). I’m fine with you using expectations or probability limits. Invoke assumptions if you need them; just make sure you are clear about the assumptions you make.

Answer: Recall that the probability limit of the OLS estimator can be written

\[ \text{plim} \; \hat{\beta}_3 = \beta_3 + \frac{\text{Cov}(\tilde{x}_3, \epsilon)}{\text{Var}(\tilde{x}_3)} \] where \(\tilde{x}_3\) the residual of \(x_3\) after regressing it on the other regressors. However, \(\tilde{x}_3 = x_3\) in our case, as \(x_3\) is independent of the other regressors. Thus, we can write \[ \text{plim} \; \hat{\beta}_3 = \beta_3 + \frac{\text{Cov}(x_3, \epsilon)}{\text{Var}(x_3)} \] Now consider each term in the bias separately.

The first term (the numerator) \[ \begin{align*} \text{Cov}(x_3, \epsilon) &= \text{Cov}(u + w, v + w) \\ &= \text{Cov}(u, v) + \text{Cov}(u, w) + \text{Cov}(w, v) + \text{Cov}(w, w) \\ &= 0 + 0 + 0 + \text{Var}(w) \\ &= \text{Var}(w) \\ &= 1 \end{align*} \]

The second term (the denominator) \[ \begin{align*} \text{Var}(x_3) &= \text{Var}(u + w) \\ &= \text{Var}(u) + \text{Var}(w) + 2\text{Cov}(u, w) \\ &= 1 + 1 + 0 \\ &= 2 \end{align*} \] Thus, we can write the probability limit of the OLS estimator as \[ \text{plim} \; \hat{\beta}_3 = \beta_3 + \frac{1}{2} = 3.5 \] This means that the OLS estimator is asymptotically biased (inconsistent) for \(\beta_3\) by 0.5 (matching our simulated results).