Problem set 1

EC607

Due Sunday, 19 April 2026 by 11:59p Pacific.

Part 0: Take a minute

0.1 (5 points) Take 15 minutes to think about (1) your research interests and (2) how you are doing. No screens. No paper. No nothing. Just sit somewhere (ideally not PLC) quietly and think.

To submit: Where and when did you complete this task?

Part 1: Fun with functions

Important: For Parts 1 and 2, you should try to write the code yourself. I know that AI can write an OLS function. The point of getting a PhD is to learn what’s going on under the hood (and to learn how to learn what’s going on under the hood). So… do your own work here.

1.1 (15 points) Write a function that uses OLS to estimate the coefficients of a linear regression model.

The function should

accept two arguments: y and X (both matrices)
output estimated coefficients as a matrix

Your function should only use matrix operations (e.g., %*%) and basic summary statistics (e.g., sum(), nrow()). Do not use more complex functions (like lm()). Bonus if you can make it work only with matrix operations (i.e., no sum() or nrow(), etc.).

Hint: Remember that you can create a column matrix of ones using matrix(1, nrow = n, ncol = 1) (where n is the number of rows you want). You can also use cbind() to combine matrices by columns (cbind) or rows (rbind). There’s an example of cbind() below in 1.2.

Recall: The function named function() allows you to write a function. For example,

# A very fancy function
so_fun = function(a, b) {
  a + b
}

defines a function called so_fun that takes two arguments, a and b, and returns their sum. You can call the function by typing so_fun(1, 2). Remember that you can also define default values in the function.

1.2 (5 points) Test your function using the following data, comparing it to the results of lm().

# Set a seed
set.seed(12)
# Define the parameters
beta = matrix(c(3, 3, 3), nrow = 3, ncol = 1)
# Define the X matrix
X = matrix(data = rnorm(200), nrow = 100, ncol = 2)
# Define the disturbance
u = runif(100, min = -1, max = 1)
# Create a column of ones
ones = matrix(1, nrow = 100, ncol = 1)
# Define the y matrix
y = cbind(ones, X) %*% beta + u
# Run lm
test_lm = lm(y ~ X)
summary(test_lm)
# TODO Now run your function

Part 2: A simulation

Let’s push your function a little harder… in a simulation!

2.1 (10 points) When coding a simulation, it usually helps to start by coding up a single iteration (after defining the population DGP).

Start by drawing 50 observations from the following population model and then applying your function to estimate the coefficients.

\[ y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \beta_3 x_{3i} + \epsilon_i \]

where

\(\beta_0 = 0\), \(\beta_1 = 1\), \(\beta_2 = 2\), and \(\beta_3 = 3\)
\(x_{1i} \sim \text{Unif}(-1, 1)\)
\(x_{2i} \sim \text{Unif}(-1, 1)\)
\(x_{3i} = u_i + w_i\)
\(\epsilon_i = v_i + w_i\)
and \(w_i \sim \text{Unif}(-1, 1)\), \(u_i \sim \text{Unif}(-1, 1)\), and \(v_i \sim \text{Unif}(-1, 1)\)

I’ll get you started.

# Load the `tidyverse` (using `pacman`)
library(pacman)
p_load(tidyverse)
# Set a seed
set.seed(12)
# Set the number of observations desired
n = 50
# Generate the population (starting with the "unobserved" disturbances)
sample_df = tibble(
  id = 1:n,
  u = runif(n, min = -1, max = 1),
  v = runif(n, min = -1, max = 1),
  w = runif(n, min = -1, max = 1),
  epsilon = v + w,
  x1 = runif(n, min = -1, max = 1),
  x2 = runif(n, min = -1, max = 1),
  x3 = u + w,
  y = 0 + 1 * x1 + 2 * x2 + 3 * x3 + epsilon
)
# TODO Now run your function

Note: As mentioned in class, using a tibble (from the tidyverse) here is helpful because it allows you to reference columns that you generated in the same step, e.g., tibble(a = 1, b = a + 2). If you were using data.frame (or even data.table), you couldn’t do it all in one step—you’d need to use multiple steps and/or something like mutate().

2.2 (5 points) Now it’s time for a simulation. For the simulation, you basically need to run the code you wrote in 2.1 above a large number of times (e.g., 10,000 times).

Here are some steps to get to the full simulation:

Copy the code above (plus the lines you added to run your regression function).
Update those lines (that define a single iteration of the simulation) to define a function with a single argument it for the iteration number.
At the end of this iteration function, output the estimated coefficients.
Use a loop, map, or apply function to run that iteration function many times. You’ve got lots of options here (for, lapply, map, etc.). You might want to check out the map functions from the purrr package from the tidyverse.
After the loop, combine the results into a single data frame (e.g., using bind_rows() from the dplyr package).

Here’s a start…

# Set the number of iterations
n_iter = 1e4
# Set the seed
set.seed(12)
# Define the iteration function
iter_fun = function(it) {
  # TODO Your iteration magic here...
}
# Run the simulation n_iter times
sim_list = map(1:n_iter, iter_fun)
# TODO Combine the results into a single data.frame

Now run the simulation with at least 1,000 iterations.

Remember: If this simulation takes a long time to run, you can parallelize with the future and furrr packages.

2.3 (10 points) Summarize the results of the simulation with (1) a numerical summary (the mean coefficient estimate for each of the four parameters) and (2) a histogram or density plot of the coefficient estimates (separated by parameter).

Hint: ggplot2 is probably a good idea here. You can use geom_histogram() or geom_density() to plot the distributions of the coefficient estimates.

2.4 (5 points) Now that you have the results of the simulation, compare them to the true values of the parameters. How well is OLS working for each of the parameters?

2.5 (10 points) Why is OLS biased for \(\beta_3\)? Explain the intuition and with simple math. Bonus if you can show it with a graph too.

2.6 (5 points) How biased is OLS for \(\beta_3\)? Answer with math (not with the data from the simulation). I’m fine with you using expectations or probability limits. Invoke assumptions if you need them; just make sure you are clear about the assumptions you make.

2.7 (5 points) Provide an example of a real-world research question where this kind of bias might be a problem. Explain the intuition behind the bias in this context and how it might affect the conclusions of the research.

2.8 (5 points) Re-run the simulation with 500 observations per iteration. Does the bias in \(\hat{\beta}_3\) get any better or worse? Explain the intuition behind the results.

Part 3: The part with AI

3.1 (10 points) Now that you’ve done all the work above, ask an AI agent (preferably Claude Code or ChatGPT/Codex) to write the code for (1) the OLS function and (2) the simulation. (Also note which AI you used and the exact prompt you used to get the code.)

3.2 (5 points) Compare the AI-generated code to your code: What’s different? Does the AI’s approach seem better/worse in any way? Do the AI-based results differ?

3.3 (5 points) What does the AI-comparison exercise above lead you to think about the role of AI in research?