Problem set 1 solutions
EC607
Due Monday, 14 April 2025 by 11:59p Pacific.
Part 0: Take a minute
0.1 (5 points) Take 15 minutes to think about (1) your research interests and (2) how you are doing. No screens. No paper. No nothing. Just sit somewhere (ideally not PLC) quietly and think.
To submit: Where and when did you complete this task?
Answer: Completion.
Part 1: Fun with functions
1.1 (15 points) Write a function that uses OLS to estimate the coefficients of a linear regression model.
The function should
- accept two arguments:
y
andX
(both matrices) - output estimated coefficients as a matrix
Your function should only use matrix operations (e.g., %*%
) and basic summary statistics (e.g., sum()
, nrow()
). Do not use more complex functions (like lm()
).
Hint: Remember that you can create a column matrix of ones using matrix(1, nrow = n, ncol = 1)
(where n
is the number of rows you want). You can also use cbind()
to combine matrices by columns (cbind
) or rows (rbind
). There’s an example of cbind()
below in 1.2.
Hint: The function named function()
allows you to write a function. For example,
defines a function called so_fun
that takes two arguments, a
and b
, and returns their sum. You can call the function by typing so_fun(1, 2)
.
Answer: Here’s my function…
1.2 (5 points) Test your function using the following data, comparing it to the results of lm()
.
# Define the parameters
beta = matrix(c(3, 3, 3), nrow = 3, ncol = 1)
# Define the X matrix
X = matrix(data = rnorm(200), nrow = 100, ncol = 2)
# Define the disurbance
u = rnorm(100)
# Create a column of ones
ones = matrix(1, nrow = 100, ncol = 1)
# Define the y matrix
y = cbind(ones, X) %*% beta + u
# Run lm
test_lm = lm(y ~ X)
summary(test_lm)
# TODO Now run your function
Answer: Filling in the blanks…
# Define the parameters
beta = matrix(c(3, 3, 3), nrow = 3, ncol = 1)
# Define the X matrix
X = matrix(data = rnorm(200), nrow = 100, ncol = 2)
# Define the disurbance
u = rnorm(100)
# Create a column of ones
ones = matrix(1, nrow = 100, ncol = 1)
# Define the y matrix
y = cbind(ones, X) %*% beta + u
# Run lm
test_lm = lm(y ~ X)
summary(test_lm)
Call:
lm(formula = y ~ X)
Residuals:
Min 1Q Median 3Q Max
-2.13168 -0.61821 0.06255 0.66475 2.21161
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.00061 0.09244 32.46 <2e-16 ***
X1 3.07456 0.08715 35.28 <2e-16 ***
X2 3.00563 0.09394 31.99 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9212 on 97 degrees of freedom
Multiple R-squared: 0.9541, Adjusted R-squared: 0.9532
F-statistic: 1008 on 2 and 97 DF, p-value: < 2.2e-16
[,1]
[1,] 3.000610
[2,] 3.074559
[3,] 3.005635
Looks good!
Part 2: A simulation
Let’s push your function a little harder… in a simulation!
2.1 (10 points) When coding a simulation, it usually helps to start by coding up a single iteration (after defining the population DGP).
Start by drawing 100 observations from the following population model and then applying your function to estimate the coefficients.
\[ y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \beta_3 x_{3i} + \epsilon_i \]
where
- \(\beta_0 = 0\), \(\beta_1 = 1\), \(\beta_2 = 2\), and \(\beta_3 = 3\)
- \(x_{1i} \sim N(0, 1)\)
- \(x_{2i} \sim N(0, 1)\)
- \(x_{3i} = u_i + w_i\)
- \(\epsilon_i = v_i + w_i\)
- and \(w_i \sim N(0, 1)\), \(u_i \sim N(0, 1)\), and \(v_i \sim N(0, 1)\)
I’ll get you started.
# Load the `tidyverse` (using `pacman`)
library(pacman)
p_load(tidyverse)
# Set a seed
set.seed(12)
# Set the number of observations desired
n = 100
# Generate the population (starting with the "unobserved" disturbances)
sample_df = tibble(
id = 1:n,
u = rnorm(n),
v = rnorm(n),
w = rnorm(n),
epsilon = v + w,
x1 = rnorm(n),
x2 = rnorm(n),
x3 = u + w,
y = 0 + 1 * x1 + 2 * x2 + 3 * x3 + epsilon
)
# TODO Now run your function
Note: Using a tibble
(from the tidyverse
) here is helpful because it allows you to reference columns that you generated in the same step, e.g., tibble(a = 1, b = a + 2)
. If you were using data.frame
, you couldn’t do it all in one step—you’d need to use multiple steps and/or something like mutate()
.
Answer: Filling in the remaining lines…
# Load the `tidyverse` (using `pacman`)
library(pacman)
p_load(tidyverse, purrr, furrr, future)
# Set a seed
set.seed(12)
# Set the number of observations desired
n = 100
# Generate the population (starting with the "unobserved" disturbances)
sample_df = tibble(
id = 1:n,
u = rnorm(n),
v = rnorm(n),
w = rnorm(n),
epsilon = v + w,
x1 = rnorm(n),
x2 = rnorm(n),
x3 = u + w,
y = 0 + 1 * x1 + 2 * x2 + 3 * x3 + epsilon
)
# Now run our function
ols_fun(
X = sample_df |> select(x1, x2, x3) |> as.matrix(),
y = sample_df |> select(y) |> as.matrix()
)
y
0.008545703
x1 1.066598141
x2 1.859886876
x3 3.527533696
2.2 (5 points) Now it’s time for a simulation. For the simulation, you basically need to run the code you wrote in 2.1 above a large number of times (e.g., 10,000 times).
Here are some steps to get to the full simulation:
- Copy lines 6–20 above (plus the lines you added to run your regression function).
- Update those lines (that define a single iteration of the simulation) to define a function with a single argument
it
for the iteration number. - At the end of this iteration function, output the estimated coefficients.
- Use a loop, map, or apply function to run that iteration function many times. You’ve got lots of options here (
for
,lapply
,map
, etc.). You might want to check out themap
functions from thepurrr
package from thetidyverse
. - After the loop, combine the results into a single data frame (e.g., using
bind_rows()
from thedplyr
package).
Here’s a start…
Now run the simulation with at least 1,000 iterations.
Answer: Simulation time!
# Set the number of iterations
n_iter = 1e4
# Set the seed
set.seed(12)
# Define the iteration function
iter_fun = function(it, n = 100) {
# Draw n observations from the population
sample_df = tibble(
id = 1:n,
u = rnorm(n),
v = rnorm(n),
w = rnorm(n),
epsilon = v + w,
x1 = rnorm(n),
x2 = rnorm(n),
x3 = u + w,
y = 0 + 1 * x1 + 2 * x2 + 3 * x3 + epsilon
)
# Now run our function
b = ols_fun(
X = sample_df |> select(x1, x2, x3) |> as.matrix(),
y = sample_df |> select(y) |> as.matrix()
)
# Format to a data frame
data.frame(
est = b |> as.vector(),
param = c('beta_0', 'beta_1', 'beta_2', 'beta_3'),
iteration = it
)
}
# Run the simulation n_iter times (in parallel)
plan(multisession, workers = 10)
sim_list = future_map(
1:n_iter,
iter_fun,
.options = furrr_options(seed = TRUE)
)
# Combine the results into a single data frame
sim_df = bind_rows(sim_list)
2.3 (10 points) Summarize the results of the simulation with (1) a numerical summary (the mean coefficient estimate for each of the four parameters) and (2) a histogram or density plot of the coefficient estimates (separated by parameter).
Hint: ggplot2
is probably a good idea here. You can use geom_histogram()
or geom_density()
to plot the distributions of the coefficient estimates.
Answer: First, a simple numerical summary…
# A tibble: 4 × 3
param mean_est sd_est
<chr> <dbl> <dbl>
1 beta_0 -0.000762 0.122
2 beta_1 1.00 0.125
3 beta_2 2.00 0.124
4 beta_3 3.50 0.0880
Density estimates…
# Load ggplot2
pacman::p_load(ggplot2, viridis)
# Density plots
ggplot(sim_df, aes(x = est)) +
geom_density(
aes(fill = param, color = param),
alpha = 0.7
) +
geom_vline(
data = data.frame(
param = c('beta_0', 'beta_1', 'beta_2', 'beta_3'),
true_val = c(0, 1, 2, 3)
),
aes(xintercept = true_val, color = param),
linetype = 'dashed',
linewidth = .3
) +
geom_hline(yintercept = 0) +
labs(
x = 'Coefficient estimate',
y = 'Density'
) +
scale_fill_manual(
'',
values = magma(4, end = .9),
labels = c(expression(beta[0]), expression(beta[1]), expression(beta[2]), expression(beta[3]))
) +
scale_color_manual(
'',
values = magma(4, end = .9),
labels = c(expression(beta[0]), expression(beta[1]), expression(beta[2]), expression(beta[3]))
) +
theme_minimal(base_family = 'Fira Sans Condensed', base_size = 12) +
theme(legend.position = 'bottom')
2.4 (5 points) Now that you have the results of the simulation, compare them to the true values of the parameters. How well OLS working for each of the parameters?
Answer: OLS is working well for \(\beta_0\), \(\beta_1\), and \(\beta_2\). The estimators’ distributions are centerd on the true values and are also typically close to truth.
OLS is not working well for \(\beta_3\); we’re clearly biased. The distribution is centered on 3.5 (rather than 3), and none of the estimates got very close to 3.
2.5 (10 points) Why is OLS biased for \(\beta_3\)? Explain the intuition and with simple math. Bonus if you can show it with a graph too.
Answer: OLS is biased for \(\beta_3\) because \(x_3\) is correlated with the error term \(\epsilon\). This correlation arises because
- \(x_3 = u + w\),
- \(\epsilon = v + w\).
Thus \(x_3\) and the disturbance (\(\epsilon\)) are correlated through their dependence on \(w\)—violating exogeneity.
We can see this graphically by plotting the distribution of the distrubance \(\epsilon\) against \(x_3\) (using data from 2.1). Under exogeneity, \(\epsilon\) should be centered on 0 given any value of \(x_3\), which is clearly not the case:
# Plot
ggplot(data = sample_df, aes(x = x3, y = epsilon)) +
geom_hline(yintercept = 0) +
geom_point() +
geom_smooth(se = FALSE, color = magma(4)[3]) +
labs(
x = expression(x[3]),
y = expression(epsilon)
) +
theme_minimal(base_family = 'Fira Sans Condensed', base_size = 12)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
2.6 (5 points) How biased is OLS for \(\beta_3\)? Answer with math (not with the data from the simulation). I’m fine with you using expectations or probability limits. Invoke assumptions if you need them; just make sure you are clear about the assumptions you make.
Answer: Recall that the probability limit of the OLS estimator can be written
\[ \text{plim} \; \hat{\beta}_3 = \beta_3 + \frac{\text{Cov}(\tilde{x}_3, \epsilon)}{\text{Var}(\tilde{x}_3)} \] where \(\tilde{x}_3\) the residual of \(x_3\) after regressing it on the other regressors. However, \(\tilde{x}_3 = x_3\) in our case, as \(x_3\) is independent of the other regressors. Thus, we can write \[ \text{plim} \; \hat{\beta}_3 = \beta_3 + \frac{\text{Cov}(x_3, \epsilon)}{\text{Var}(x_3)} \] Now consider each term in the bias separately.
The first term (the numerator) \[ \begin{align*} \text{Cov}(x_3, \epsilon) &= \text{Cov}(u + w, v + w) \\ &= \text{Cov}(u, v) + \text{Cov}(u, w) + \text{Cov}(w, v) + \text{Cov}(w, w) \\ &= 0 + 0 + 0 + \text{Var}(w) \\ &= \text{Var}(w) \\ &= 1 \end{align*} \]
The second term (the denominator) \[ \begin{align*} \text{Var}(x_3) &= \text{Var}(u + w) \\ &= \text{Var}(u) + \text{Var}(w) + 2\text{Cov}(u, w) \\ &= 1 + 1 + 0 \\ &= 2 \end{align*} \] Thus, we can write the probability limit of the OLS estimator as \[ \text{plim} \; \hat{\beta}_3 = \beta_3 + \frac{1}{2} = 3.5 \] This means that the OLS estimator is asymptotically biased (inconsistent) for \(\beta_3\) by 0.5 (matching our simulated results).