EC421, Set 03
Prologue
tidyverse.We reviewed the fundamentals of statistics and econometrics.
We review more of the main/basic results in metrics.
We will post the first assignment (focused on review) soon.
First we need to finish more (of this) review.
Multiple regression
We’re moving from simple linear regression
(one outcome variable and one explanatory variable)
\[ \textcolor{#e64173}{y_i} = \beta_0 + \beta_1 \textcolor{#6A5ACD}{x_i} + u_i \]
to the land of multiple linear regression
(one outcome variable and multiple explanatory variables)
\[ \textcolor{#e64173}{y_i} = \beta_0 + \beta_1 \textcolor{#6A5ACD}{x_{1i}} + \beta_2 \textcolor{#6A5ACD}{x_{2i}} + \cdots + \beta_k \textcolor{#6A5ACD}{x_{ki}} + u_i \]
Why?
We can better explain variation in \(y\), improve predictions, avoid OV Bias, …
\(y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + u_i \quad\) \(x_1\) is continuous \(\quad x_2\) is categorical
The intercept and categorical variable \(x_2\) control for the groups’ means.
With groups’ means removed:
\(\hat{\beta}_1\) estimates the relationship between \(y\) and \(x_1\) after controlling for \(x_2\).
Another way to think about it: We’re estimating two (parallel) lines.
Looking at our estimator can also help.
For the simple linear regression \(y_i = \beta_0 + \beta_1 x_i + u_i\)
\[ \begin{aligned} \hat{\beta}_1 &= \\[0.3em] &= \dfrac{\sum_i \left( x_i - \overline{x} \right) \left( y_i - \overline{y} \right)}{\sum_i \left( x_i -\overline{x} \right)} \\[0.3em] &= \dfrac{\sum_i \left( x_i - \overline{x} \right) \left( y_i - \overline{y} \right)/(n-1)}{\sum_i \left( x_i -\overline{x} \right) / (n-1)} \\[0.3em] &= \dfrac{\mathop{\hat{\text{Cov}}}(x,\,y)}{\mathop{\hat{\text{Var}}} \left( x \right)} \end{aligned} \]
Simple linear regression estimator:
\[ \hat{\beta}_1 = \dfrac{\mathop{\hat{\text{Cov}}}(x,\,y)}{\mathop{\hat{\text{Var}}} \left( x \right)} \]
Moving to multiple linear regression, the estimator changes slightly:
\[ \hat{\beta}_1 = \dfrac{\mathop{\hat{\text{Cov}}}(\textcolor{#e64173}{\tilde{x}_1},\,y)}{\mathop{\hat{\text{Var}}} \left( \textcolor{#e64173}{\tilde{x}_1} \right)} \]
where \(\textcolor{#e64173}{\tilde{x}_1}\) is the residualized \(x_1\) variable—the variation remaining in \(x\) after controlling for the other explanatory variables.
Consider the multiple-regression model
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + u \]
Residualized \(x_{1}\) (\(\textcolor{#e64173}{\tilde{x}_1}\)) comes from regressing \(x_1\) on an intercept and all other explanatory variables (then collecting the residuals), i.e.,
\[ \begin{aligned} \hat{x}_{1} &= \hat{\gamma}_0 + \hat{\gamma}_2 \, x_{2} + \hat{\gamma}_3 \, x_{3} \\ \textcolor{#e64173}{\tilde{x}_{1}} &= x_{1} - \hat{x}_{1} \end{aligned} \]
allowing us to better understand our OLS multiple-regression estimator
\[ \hat{\beta}_1 = \dfrac{\mathop{\hat{\text{Cov}}}(\textcolor{#e64173}{\tilde{x}_1},\,y)}{\mathop{\hat{\text{Var}}} \left( \textcolor{#e64173}{\tilde{x}_1} \right)} \]
Measures of goodness of fit quantify how well a model describes/fits the data.
Common measure: \(R^2\) [R-squared] (a.k.a. coefficient of determination)
\[ R^2 = \dfrac{\sum_i (\hat{y}_i - \overline{y})^2}{\sum_i \left( y_i - \overline{y} \right)^2} = 1 - \dfrac{\sum_i \left( y_i - \hat{y}_i \right)^2}{\sum_i \left( y_i - \overline{y} \right)^2} \]
Notice our old friend SSE: \(\sum_i \left( y_i - \hat{y}_i \right)^2 = \sum_i e_i^2\).
\(R^2\) literally tells us the share of the var. in \(y\) our current models accounts for.
Thus \(0 \leq R^2 \leq 1\).
The problem: As we add variables to our model, \(R^2\) mechanically increases.
To see this problem, we can simulate a dataset of 10,000 observations on \(y\) and 1,000 random \(x_k\) variables. No relations between \(y\) and the \(x_k\)!
Pseudo-code outline of the simulation:
The problem: As we add variables to our model, \(R^2\) mechanically increases.
R code for the simulation:
set.seed(1234)
y = rnorm(1e4)
x = matrix(data = rnorm(1e7), nrow = 1e4)
x %<>% cbind(matrix(data = 1, nrow = 1e4, ncol = 1), x)
r_df = mclapply(X = 1:(1e3-1), mc.cores = detectCores() - 1, FUN = function(i) {
tmp_reg = lm(y ~ x[,1:(i+1)]) %>% summary()
data.frame(
k = i + 1,
r2 = tmp_reg %$% r.squared,
r2_adj = tmp_reg %$% adj.r.squared
)
}) %>% bind_rows()The problem: As we add variables to our model, \(\textcolor{#314f4f}{R^2}\) mechanically increases.
One solution: Adjusted \(\textcolor{#e64173}{R^2}\)
The problem: As we add variables to our model, \(R^2\) mechanically increases.
One solution: Penalize for the number of variables, e.g., adjusted \(R^2\):
\[ \overline{R}^2 = 1 - \dfrac{\sum_i \left( y_i - \hat{y}_i \right)^2/(n-k-1)}{\sum_i \left( y_i - \overline{y} \right)^2/(n-1)} \]
Note: Adjusted \(R^2\) need not be between 0 and 1.
There are tradeoffs to remember as we add/remove variables:
Fewer variables
More variables
We’ll go deeper into this issue in a few weeks, but as a refresher:
Omitted-variable bias (OVB) arises when we omit a variable that
affects our outcome variable \(y\)
correlates with an explanatory variable \(x_j\)
As it’s name suggests, this situation leads to bias in our estimate of \(\beta_j\).
Note: OVB Is not exclusive to multiple linear regression, but it does require multiple variables affect \(y\).
Example
Let’s imagine a simple model of the returns to schooling \[ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \beta_2 \text{Male}_i + u_i \] where \(\text{School}_i\) gives \(i\)’s years of schooling; \(\text{Male}_i\) represents an indicator variable for whether individual \(i\) is male.
Thus
If \(\beta_2 > 0\), then males are favored in the labor market
(discrimination, all else equal).
Example, continued
From our population model
\[ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \beta_2 \text{Male}_i + u_i \]
If a study focuses on the relationship between pay and schooling, i.e., \[ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \left(\beta_2 \text{Male}_i + u_i\right) \] \[ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \varepsilon_i \] the “disturbance” becomes \(\varepsilon_i = \beta_2 \text{Male}_i + u_i\).
OLS needs exogeneity to be unbiasedness. Likely violated here.
But even if \(\mathop{\boldsymbol{E}}\left[ u | X \right] = 0\), it is not true that \(\mathop{\boldsymbol{E}}\left[ \varepsilon | X \right] = 0\) so long as \(\beta_2 \neq 0\).
Example, continued
From our population model
\[ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \beta_2 \text{Male}_i + u_i \]
If a study focuses on the relationship between pay and schooling, i.e., \[ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \left(\beta_2 \text{Male}_i + u_i\right) \] \[ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \varepsilon_i \] the “disturbance” becomes \(\varepsilon_i = \beta_2 \text{Male}_i + u_i\).
OLS needs exogeneity to be unbiasedness. Likely violated here.
Unless \(\text{School}\) and \(\text{Male}\) are unrelated, OLS is biased.
Example, continued
Let’s try to see this result graphically.
Population model:
\[ \text{Pay}_i = 20 + 0.5 \times \text{School}_i + 10 \times \text{Male}_i + u_i \]
Our regression model that suffers from omitted-variable bias:
\[ \text{Pay}_i = \hat{\beta}_0 + \hat{\beta}_1 \times \text{School}_i + e_i \]
Finally, imagine that women, on average, receive more schooling than men.
Example, continued: \(\text{Pay}_i = 20 + 0.5 \times \text{School}_i + 10 \times \text{Male}_i + u_i\)
The relationship between pay and schooling.
Example, continued: \(\text{Pay}_i = 20 + 0.5 \times \text{School}_i + 10 \times \text{Male}_i + u_i\)
Biased regression estimate: \(\widehat{\text{Pay}}_i = 31.3 + -0.9 \times \text{School}_i\)
Example, continued: \(\text{Pay}_i = 20 + 0.5 \times \text{School}_i + 10 \times \text{Male}_i + u_i\)
Recalling the omitted variable: Gender (female and male)
Example, continued: \(\text{Pay}_i = 20 + 0.5 \times \text{School}_i + 10 \times \text{Male}_i + u_i\)
Recalling the omitted variable: Gender (female and male)
Example, continued: \(\text{Pay}_i = 20 + 0.5 \times \text{School}_i + 10 \times \text{Male}_i + u_i\)
Unbiased regression estimate: \(\widehat{\text{Pay}}_i = 20.9 + 0.4 \times \text{School}_i + 9.1 \times \text{Male}_i\)
Don’t omit variables
Instrumental variables and two-stage least squares†
Warning: There are situations in which neither solution is possible.
Proceed with caution (sometimes you can sign the bias).
Maybe just stop.
Interpreting coefficients
Consider the relationship
\[ \text{Pay}_i = \beta_0 + \beta_1 \, \text{School}_i + u_i \]
where
Interpretations
Deriving the slope’s interpretation:
\[ \begin{aligned} \mathop{\boldsymbol{E}}\left[ \text{Pay} | \text{School} = \ell + 1 \right] - \mathop{\boldsymbol{E}}\left[ \text{Pay} | \text{School} = \ell \right] &= \\[0.5em] \mathop{\boldsymbol{E}}\left[ \beta_0 + \beta_1 (\ell + 1) + u \right] - \mathop{\boldsymbol{E}}\left[ \beta_0 + \beta_1 \ell + u \right] &= \\[0.5em] \left[ \beta_0 + \beta_1 (\ell + 1) \right] - \left[ \beta_0 + \beta_1 \ell \right] &= \\[0.5em] \beta_0 - \beta_0 + \beta_1 \ell - \beta_1 \ell + \beta_1 &= \beta_1 \end{aligned} \]
I.e., the slope gives the expected increase in our outcome variable for a one-unit increase in the explanatory variable.
If we have multiple explanatory variables, e.g.,
\[ \text{Pay}_i = \beta_0 + \beta_1 \, \text{School}_i + \beta_2 \, \text{Ability}_i + u_i \]
then the interpretation changes slightly.
\[ \begin{aligned} \mathop{\boldsymbol{E}}\left[ \text{Pay} | \text{School} = \ell + 1 \land \text{Ability} = \alpha \right] - & \\ \mathop{\boldsymbol{E}}\left[ \text{Pay} | \text{School} = \ell \land \text{Ability} = \alpha \right] &= \\ \mathop{\boldsymbol{E}}\left[ \beta_0 + \beta_1 (\ell + 1) + \beta_2 \alpha + u \right] - \mathop{\boldsymbol{E}}\left[ \beta_0 + \beta_1 \ell + \beta_2 \alpha + u \right] &= \\ \left[ \beta_0 + \beta_1 (\ell + 1) + \beta_2 \alpha \right] - \left[ \beta_0 + \beta_1 \ell + \beta_2 \alpha \right] &= \\ \beta_0 - \beta_0 + \beta_1 \ell - \beta_1 \ell + \beta_1 + \beta_2 \alpha - \beta_2 \alpha &= \beta_1 \end{aligned} \]
I.e., the slope gives the expected increase in our outcome variable for a one-unit increase in the explanatory variable, holding all other variables constant (ceteris paribus).
Alternative derivation
Consider the model
\[ y = \beta_0 + \beta_1 \, x + u \]
Differentiate the model:
\[ \dfrac{dy}{dx} = \beta_1 \]
Consider the relationship
\[ \text{Pay}_i = \beta_0 + \beta_1 \, \text{Female}_i + u_i \]
where
Interpretations
Derivations
\[ \begin{aligned} \mathop{\boldsymbol{E}}\left[ \text{Pay} | \text{Non-female} \right] &= \mathop{\boldsymbol{E}}\left[ \beta_0 + \beta_1\times 0 + u_i \right] \\ &= \mathop{\boldsymbol{E}}\left[ \beta_0 + 0 + u_i \right] \\ &= \beta_0 \end{aligned} \]
\[ \begin{aligned} \mathop{\boldsymbol{E}}\left[ \text{Pay} | \text{Female} \right] &= \mathop{\boldsymbol{E}}\left[ \beta_0 + \beta_1\times 1 + u_i \right] \\ &= \mathop{\boldsymbol{E}}\left[ \beta_0 + \beta_1 + u_i \right] \\ &= \beta_0 + \beta_1 \end{aligned} \]
Note: If there are no other variables to condition on, then \(\hat{\beta}_1\) equals the difference in group means, e.g., \(\overline{x}_\text{Female} - \overline{x}_\text{Non-female}\).
Note2: The holding all other variables constant interpretation also applies for categorical variables in multiple regression settings.
\(y_i = \beta_0 + \beta_1 x_i + u_i\) for binary variable \(x_i = \{\textcolor{#314f4f}{0}, \, \textcolor{#e64173}{1}\}\)
\(y_i = \beta_0 + \beta_1 x_i + u_i\) for binary variable \(x_i = \{\textcolor{#314f4f}{0}, \, \textcolor{#e64173}{1}\}\)
Interactions allow the effect of one variable to change based upon the level of another variable.
Examples
Does the effect of schooling on pay change by gender?
Does the effect of gender on pay change by race?
Does the effect of schooling on pay change by experience?
Previously, we considered a model that allowed women and men to have different wages, but the model assumed the effect of school on pay was the same for everyone:
\[ \text{Pay}_i = \beta_0 + \beta_1 \, \text{School}_i + \beta_2 \, \text{Female}_i + u_i \]
but we can also allow the effect of school to vary by gender:
\[ \text{Pay}_i = \beta_0 + \beta_1 \, \text{School}_i + \beta_2 \, \text{Female}_i + \beta_3 \, \text{School}_i\times\text{Female}_i + u_i \]
The model where schooling has the same effect for everyone (F and M):
The model where schooling’s effect can differ by gender (F and M):
Interpreting coefficients can be a little tricky with interactions, but the key is to carefully work through the math1.
\[ \text{Pay}_i = \beta_0 + \beta_1 \, \text{School}_i + \beta_2 \, \text{Female}_i + \beta_3 \, \text{School}_i\times\text{Female}_i + u_i \]
Expected returns for an additional year of schooling for women:
\[ \begin{aligned} \mathop{\boldsymbol{E}}\left[ \text{Pay}_i | \text{Female} \land \text{School} = \ell + 1 \right] - \mathop{\boldsymbol{E}}\left[ \text{Pay}_i | \text{Female} \land \text{School} = \ell \right] &= \\[.5em] \mathop{\boldsymbol{E}}\left[ \beta_0 + \beta_1 (\ell+1) + \beta_2 + \beta_3 (\ell + 1) + u_i \right] - \mathop{\boldsymbol{E}}\left[ \beta_0 + \beta_1 \ell + \beta_2 + \beta_3 \ell + u_i \right] &= \\[.5em] \beta_1 + \beta_3 \end{aligned} \]
Interpreting coefficients can be a little tricky with interactions, but the key is to carefully work through the math1.
\[ \text{Pay}_i = \beta_0 + \beta_1 \, \text{School}_i + \beta_2 \, \text{Female}_i + \beta_3 \, \text{School}_i\times\text{Female}_i + u_i \]
Expected returns for an additional year of schooling for women: \(\beta_1 + \beta_3\)
\(\beta_1\): the expected return for an add. yr. of schooling for non-females;
\(\beta_3\): the difference in the returns to schooling for females vs. non-females.
The previous slides focused on interactions where one variable was binary.
If both variables are continuous, then the interpretation is slightly trickier.
Key Interactions simply mean the effect of one variable depends on the level of another variable.
Suppose we’re interested in the model
\[ \text{Pay}_i = \beta_0 + \beta_1 \, \text{School}_i + \beta_2 \, \text{Experience}_i + \beta_3 \, \text{School}_i\times\text{Experience}_i + u_i \]
where \(\text{School}_i\) and \(\text{Experience}_i\) are both continuous variables (in years).
How do we interpret the interaction here?
School’s effect on pay now depends on the level of experience.
Interpretation Consider the partial derivative:
\[ \dfrac{\partial\text{Pay}_i}{\partial\text{School}_i} = \beta_1 + \beta_3 \text{Experience}_i \]
In the model
\[ \text{Pay}_i = \beta_0 + \beta_1 \, \text{School}_i + \beta_2 \, \text{Experience}_i + \beta_3 \, \text{School}_i\times\text{Experience}_i + u_i \]
all else equal, an additional year of school changes pay by
\[ \beta_1 + \beta_3 \text{Experience} \]
Polynomials are just interactions: they interact a variable with itself.
\[ \text{Pay}_i = \beta_0 + \beta_1 \, \text{School}_i + \beta_2 \, \text{School}_i^2 + u_i \]
Here the effect of schooling depends on an individual’s level of schooling.
Interpretation Back to the partial derivative:
\[ \dfrac{\partial\text{Pay}_i}{\partial\text{School}_i} = \beta_1 + 2 \beta_2 \text{School}_i \]
all else equal, an additional year of school changes pay by
\[ \beta_1 + 2 \beta_2 \text{School}_i \]
When your outcome variable is binary, the interpretation changes slightly.
Recall: The avg. of a binary variable gives the % of observations with a ‘1’.
Example: Avg(0, 0, 0, 1, 1) = 0.40 \(\implies\) 40% of obserations = 1.
If your outcome is binary, then you are modeling the probability (percent) that the outcome equals one.
\[ \text{Employed}_i = \beta_0 + \beta_1 \text{School}_i + u_i \]
Interpretation \(\beta_1\) is the effect of one additional year of schooling on the probability an individual is employed (all else equal).
In economics, you will frequently see logged outcome variables with linear (non-logged) explanatory variables, e.g.,
\[ \log(\text{Pay}_i) = \beta_0 + \beta_1 \, \text{School}_i + u_i \]
This specification changes our interpretation of the slope coefficients.
Interpretation
A one-unit increase in our explanatory variable increases the outcome variable by approximately \(\beta_1\times 100\) percent.
Example: An additional year of schooling increases pay by approximately 3 percent (for \(\beta_1 = 0.03\)).
Derivation
Consider the log-linear model
\[ \log(y) = \beta_0 + \beta_1 \, x + u \]
and differentiate
\[ \dfrac{dy}{y} = \beta_1 dx \]
So a marginal change in \(x\) (i.e., \(dx\)) leads to a \(\beta_1 dx\) percentage change in \(y\).
Because the log-linear specification comes with a different interpretation, you need to make sure it fits your data-generating process/model.
Does \(x\) change \(y\) in levels (e.g., a 3-unit increase) or percentages (e.g., a 10-percent increase)?
I.e., you need to be sure an exponential relationship makes sense:
\[ \log(y_i) = \beta_0 + \beta_1 \, x_i + u_i \iff y_i = e^{\beta_0 + \beta_1 x_i + u_i} \]
Similarly, econometricians frequently employ log-log models, in which the outcome variable is logged and at least one explanatory variable is logged
\[ \log(\text{Pay}_i) = \beta_0 + \beta_1 \, \log(\text{School}_i) + u_i \]
Interpretation:
Derivation
Consider the log-log model
\[ \log(y) = \beta_0 + \beta_1 \, \log(x) + u \]
and differentiate
\[ \dfrac{dy}{y} = \beta_1 \dfrac{dx}{x} \]
which says that for a one-percent increase in \(x\), we will see a \(\beta_1\) percent increase in \(y\). As an elasticity:
\[ \dfrac{dy}{dx} \dfrac{x}{y} = \beta_1 \]
Note: If you have a log-linear model with a binary indicator variable, the interpretation for the coefficient on that variable changes.
Consider
\[ \log(y_i) = \beta_0 + \beta_1 x_1 + u_i \]
for binary variable \(x_1\).
The interpretation of \(\beta_1\) is now
Additional topics
So far, we’ve focused mainly statistical (causal) inference—using estimators and their distributions properties to try to learn about underlying, unknown population parameters.
\[ y_i = \textcolor{#e64173}{\hat{\beta}_{0}} + \textcolor{#e64173}{\hat{\beta_1}} \, x_{1i} + \textcolor{#e64173}{\hat{\beta_2}} \, x_{2i} + \cdots + \textcolor{#e64173}{\hat{\beta}_{k}} \, x_{ki} + e_i \]
Prediction includes a fairly different set of topics/tools within econometrics (and data science/machine learning)—creating models that accurately estimate individual observations.
\[ \textcolor{#e64173}{\hat{y}_i} = \mathop{\hat{f}}\left( x_1,\, x_2,\, \ldots x_k \right) \]
Succinctly
Inference: causality, \(\hat{\beta}_k\) (consistent and efficient), standard errors/hypothesis tests for \(\hat{\beta}_k\), generally OLS
Prediction:1 correlation, \(\hat{y}_i\) (low error), model selection, nonlinear models are much more common
Much of modern (micro)econometrics focuses on causally estimating (identifying) the effect of programs/policies, e.g.,
In this literature, the program is often a binary variable, and we place high importance on finding an unbiased estimate for the program’s effect.
Our linearity assumption requires
We allow nonlinear relationships between \(y\) and the explanatory variables.
Examples
Polynomials and interactions: \(y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_1^2 + \beta_3 x_2 + \beta_4 x_2^2 + \beta_5 \left( x_1 x_2 \right) + u_i\)
Exponentials and logs: \(\log(y_i) = \beta_0 + \beta_1 x_1 + \beta_2 e^{x_2} + u_i\)
Indicators and thresholds: \(y_i = \beta_0 + \beta_1 x_1 + \beta_2 \, \mathbb{I}(x_1 \geq 100) + u_i\)
Transformation challenge: (literally) infinite possibilities. What do we pick?
\(y_i = \beta_0 + u_i\)
\(y_i = \beta_0 + \beta_1 x + u_i\)
\(y_i = \beta_0 + \beta_1 x + \beta_2 x^2 + u_i\)
\(y_i = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + u_i\)
\(y_i = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 x^4 + u_i\)
\(y_i = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 x^4 + \beta_5 x^5 + u_i\)
Truth: \(y_i = 2 e^{x} + u_i\)
Because OLS minimizes the sum of the squared errors, outliers can play a large role in our estimates.
Because OLS minimizes the sum of the squared errors, outliers can play a large role in our estimates.
Because OLS minimizes the sum of the squared errors, outliers can play a large role in our estimates.
Because OLS minimizes the sum of the squared errors, outliers can play a large role in our estimates.
Common responses
remove the outliers from the dataset;
related: leave-one-out regression to identify influential observations;
replace outliers with the 99th percentile of their variable (Windsorize);
take the log of the variable to “take care of” outliers.
Another option
Do nothing. Outliers are not always bad. Some people are “far” from the average. It may not make sense to try to change this variation.
Similarly, missing data can affect your results.
R doesn’t know how to deal with a missing observation.
#> [1] NA
If you run a regression1 with missing values, R drops the observations missing those values.
If the observations are missing in a nonrandom way, a random sample may end up nonrandom.
We’ve refreshed the main ingredients for OLS regression:
So far, the big message has been
Next What happens when we violate \(u_i\) var. or cov. assumptions?