class: center, middle, inverse, title-slide .title[ # ScPoEconometrics ] .subtitle[ ## Linear Regression Extensions ] .author[ ### Florian Oswald, Mylรจne Feuillade, Gustave Kenedi and Junnan He ] .date[ ### SciencesPo Paris 2022-10-24 ] --- layout: true <div class="my-footer"><img src="../img/logo/ScPo-shield.png" style="height: 60px;"/></div> --- # Today - Linear Regression Extensions Depending on the data and the relationships between the variables of interest, you may need to move away from the baseline model. -- We will focus on 3 important variations: 1. ***Non-linear relationships***: log and polynomial models 1. ***Interactions*** between variables 1. ***Standardized*** regression -- In each case, the way we estimate these coefficients does not change (i.e OLS). -- Empirical applications: (i) *college tuition* and *earnings potential*, (ii) *wage*, *education* and *gender* , (iii) *class size* and *student performance* --- layout: false class: title-slide-section-red, middle # Non-Linear Relationships --- layout: true <div class="my-footer"><img src="../img/logo/ScPo-shield.png" style="height: 60px;"/></div> --- # Accounting for Non-Linear Relationships There are two main "methods": -- 1. ***Log*** models -- 1. ***Polynomial*** models --- # Log Models * The models we have seen so far can be called ***level-level*** specifications. Both the dependent and the independent variables have been measured in level. -- * This *level* can be: euros, years, number of students,... and even percentage. -- * Taking the *natural* logarithm of the dependent and/or the independent variable(s) leads us to define 3 other types of regressions (abuse of notation: `\(\ln(x) = \log_{e}(x)=\log(x)\)`): * ***Log - level***: `\(\quad \log(y_i) = b_0 + b_1 x_{1,i} + ... + e_i\)` * ***Level - log***: `\(\quad \textrm{y}_i = b_0 + b_1 \log(x_{1,i}) + ... + e_i\)` * ***Log - log***: `\(\quad \log(y_i) = b_0 + b_1 \log(x_{1,i}) + ... + e_i\)` --- # The (natural) log Function: A Primer ๐ <img src="chapter_regext_files/figure-html/unnamed-chunk-1-1.svg" style="display: block; margin: auto;" /> --- # The (natural) log Function: A Primer ๐ The [natural log function](https://en.wikipedia.org/wiki/Natural_logarithm) is the inverse function of the exponential function. , i.e. `\(\log(\exp(x))=x\)` -- `\(\rightarrow\)` since for all `\(x\)`, `\(\exp(x)>0 \implies\)` natural log function is only defined for ***strictly positive values***! (It is not defined in 0!) -- โ ๏ธ You can only log your variables if they don't take 0 or negative values! Always think about this when taking the log of your dependent or independent variable(s) --- # The (natural) log Function: A Primer ๐ If you have very ***skewed distributions*** taking the log will render it more ***normally distributed*** -- .pull-left[ <img src="chapter_regext_files/figure-html/unnamed-chunk-2-1.svg" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="chapter_regext_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" /> ] --- # Log Models: Simplified Interpretations | Specification | Model | Interpretation of `\(b_1\)` | |--------------------|:---------:|:-----------------------------------:| | Level - Level | `\(y = b_0 + b_1 x + e\)` | .small[A **one unit** increase in ] `\(x\)` .small[ is associated, on average, with a ] `\(b_1\)` .small[**unit change** in y] | | Log - Level | `\(\log(y) = b_0 + b_1 x + e\)` | .small[A **one unit** increase in ] `\(x\)` .small[ is associated, on average, with a] `\(b_1 \times 100\)` .small[ **percent change** in y] | | Level - Log | `\(y = b_0 + b_1 \log(x) + e\)` | .small[A **one percent** increase in ] `\(x\)` .small[ is associated, on average, with a ] `\(b_1 / 100\)` .small[**unit change** in y] | | Log - Log | `\(\log(y) = b_0 + b_1 \log(x) + e\)` | .small[A **one percent** increase in ] `\(x\)` .small[ is associated, on average, with a] `\(b_1\)` .small[**percent change** in y] | -- * This may look like cooking recipes but of course it can be [derived with some relatively simple maths](https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faqhow-do-i-interpret-a-regression-model-when-some-variables-are-log-transformed/). -- * โ ๏ธ these interpretations are only true for ***small*** changes in `\(x\)` and small/or `\(b_1\)`. What happens if we want to know the change in `\(y\)` for big changes in `\(x\)` or when `\(b_1\)` is large? --- name: gen_log # Log Models: General Interpretations For ***any increase in `\(x\)`, `\(\Delta x,\)` and any `\(b_1\)`*** `\((\Delta x = 5\% = 0.05 \implies 1 + \Delta x = 1.05)\)`: | Specification | Model | Interpretation of `\(b_1\)` | |--------------------|:---------:|:-----------------------------------:| | Level - Level | `\(y = b_0 + b_1 x + e\)` | .small[A **one unit** increase in ] `\(x\)` .small[ is associated, on average, with a ] `\(b_1\)` .small[**unit change** in y] | | Log - Level | `\(\log(y) = b_0 + b_1 x + e\)` | .small[A **one unit** increase in ] `\(x\)` .small[ is associated, on average, with a] `\((e^{b_1} - 1) \times 100\)` .small[ **percent change** in y] | | Level - Log | `\(y = b_0 + b_1 \log(x) + e\)` | .small[A ] ** `\(\Delta x\)`** .small[**percent** increase in ] `\(x\)` .small[ is associated, on average, with a ] `\(b_1 \times \log(1 + \Delta x)\)` .small[**unit change** in y] | | Log - Log | `\(\log(y) = b_0 + b_1 \log(x) + e\)` | .small[A ] ** `\(\Delta x\)`** .small[**percent** increase in ] `\(x\)` .small[ is associated, on average, with a] `\(((1 + \Delta x)^{b_1} - 1) \times 100\)` .small[**percent change** in y] | ([*Appendix:*](#log_approx) Why are the previously shown approximations true?) --- # When Should You Use log Models? 1. If the relationship betwen `\(x\)` and `\(y\)` looks like a log or exponential function. -- .pull-left[ <img src="chapter_regext_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="chapter_regext_files/figure-html/unnamed-chunk-5-1.svg" style="display: block; margin: auto;" /> ] --- # When Should You Use log Models? 1. If the relationship betwen `\(x\)` and `\(y\)` looks like a log or exponential function. .pull-left[ <img src="chapter_regext_files/figure-html/unnamed-chunk-6-1.svg" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="chapter_regext_files/figure-html/unnamed-chunk-7-1.svg" style="display: block; margin: auto;" /> ] --- # When Should You Use log Models? 1. If the relationship betwen `\(x\)` and `\(y\)` looks like a log or exponential function. 1. To easily interpret coefficients as <a href="https://en.wikipedia.org/wiki/Elasticity_(economics)">***elasticities***</a> which play a central role in economic theory. ***Elasticity of `\(y\)` with respect to `\(x\)`:*** percent change in `\(y\)` following a 1% increase in `\(x\)`. --- # Accounting for Other Types Non-Linear Relationships What if the relationship between `\(x\)` and `\(y\)` is not exponential/log? -- `\(\rightarrow\)` ***polynomial*** regressions: just take a polynomial function of the regressor! --- # Polynomial Wut? ๐ .pull-left[ <img src="chapter_regext_files/figure-html/unnamed-chunk-8-1.svg" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="chapter_regext_files/figure-html/unnamed-chunk-9-1.svg" style="display: block; margin: auto;" /> ] --- # Polynomial Regressions What does this mean in practice? -- `\(\rightarrow\)` add a higher order of the regressor to the regression, depending on the visual (or expected) relationship -- Several ways of doing this in `R`: .pull-left[ ```r lm(y ~ x + I(x^2) + I(x^3), data) ``` ] -- .pull-right[ ```r lm(y ~ poly(x, 3, raw = TRUE), data) ``` ] --- # Polynomial Regressions .pull-left[ ***2nd order:*** <img src="chapter_regext_files/figure-html/unnamed-chunk-12-1.svg" style="display: block; margin: auto;" /> ] -- .pull-left[ ***3rd order:*** <img src="chapter_regext_files/figure-html/unnamed-chunk-13-1.svg" style="display: block; margin: auto;" /> ] --- <img src="chapter_regext_files/figure-html/curve_fitting.png" width="378px" height="600px" style="display: block; margin: auto;" /> --- class:inverse # Task 1: Non-linear relationships
−
+
10
:
00
1. Load the data [here](https://www.dropbox.com/s/2v5mb04nzw2u7bd/college_tuition_income.csv?dl=1). This dataset contains information about tuition and estimated incomes of graduates for universities in the US. More details can be found [here](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-03-10/readme.md). 2. Create a scatter plot of estimated mid career pay (`mid_career_pay`) `\((y-axis)\)` as a function of out of state tuition (`out_of_state_tuition`) `\((x-axis)\)`. Would you say the relationship is broadly linear or rather non-linear? Use `geom_smooth(method = "lm", se = F) + geom_smooth(method = "lm", se = F, formula = y ~ poly(x, 2, raw= T))` to fit both a linear and 2nd order regression line. This time which seems most appropriate? 3. Create a variable equal to out of state tuition divided by 1000. Regress mid career pay on out of state tuition divided by 1000. Interpret the coefficient. 4. Regress mid career pay on out of state tuition divided by 1000 and its square. *Hint:* you can use either `poly(x, 2, raw = T)` or `x + I(x^2)`, where x is your regressor. What does the positive sign on the squared term imply? --- layout: false class: title-slide-section-red, middle # Interaction Terms --- layout: true <div class="my-footer"><img src="../img/logo/ScPo-shield.png" style="height: 60px;"/></div> --- # Interacting Regressors * We interact two regressors when we believe ***the effect of one depends on the value of the other***. * *Example:* The returns to education on wage vary by gender. -- * In practice, if we interact `\(x_1\)` and `\(x_2\)`, we would write our model like this : `$$y_i = b_0 + b_1 x_{1,i} + b_2 x_{2,i} + b_3x_{1,i} \times x_{2,i} + ... + e_i$$` -- * The interpretation of `\(b_1\)`, `\(b_2\)`, and `\(b_3\)` will depend on the type of `\(x_1\)` and `\(x_2\)`. -- * Let's focus on the cases where one regressor is a ***dummy/categorical*** variable and the other is ***continuous***. * It will give you the intuition for the other cases: * Both regresors are dummies/categorical variables, * Both regresors are continuous variables. --- # Interacting Regressors Let's go back to the *STAR* experiment data. -- How does the effect of being in a small vs regular class vary with the experience of the teacher? -- Our regression model becomes: $$ \textrm{score}_i = \color{#d96502}{b_0} + \color{#027D83}{b_1} \textrm{small}_i + \color{#02AB0D}{b_2} \textrm{experience}_i + \color{#d90502}{b_3} \textrm{small}_i \times \textrm{experience}_i + e_i$$ -- Effect of small class with teacher with 10 years of experience? -- `\(\mathbb{E}[\textrm{score}_i | \textrm{small}_i = 1 \textrm{ & experience}_i = 10] = \color{#d96502}{b_0} + \color{#027D83}{b_1} + \color{#02AB0D}{b_2}*10 + \color{#d90502}{b_3}*10\)` -- `\(\mathbb{E}[\textrm{score}_i | \textrm{small}_i = 0 \textrm{ & experience}_i = 10] = \color{#d96502}{b_0} + \color{#02AB0D}{b_2}*10\)` -- `\(\begin{split} \mathbb{E}[\textrm{score}_i &| \textrm{small}_i = 1 \textrm{ & experience}_i = 10] - \mathbb{E}[\textrm{score}_i | \textrm{small}_i = 0 \textrm{ & experience}_i = 10] \\ &= \color{#d96502}{b_0} + \color{#027D83}{b_1} + \color{#02AB0D}{b_2}*10 + \color{#d90502}{b_3}*10 - (\color{#d96502}{b_0} + \color{#02AB0D}{b_2}*10) \\ &= \color{#027D83}{b_1} + \color{#d90502}{b_3}*10 \end{split}\)` --- # Interacting Regressors Running the regression for the `math` score (for all grades), we obtain: ```r lm(math ~ small+ experience + small*experience, star_df) ``` ``` ## ## Call: ## lm(formula = math ~ small + experience + small * experience, ## data = star_df) ## ## Coefficients: ## (Intercept) smallTRUE experience ## 534.1919 15.8906 1.3305 ## smallTRUE:experience ## -0.3034 ``` ***Interpretation:*** -- * The interaction term allows the impact of being in a small class to vary with the experience of the teacher. -- * In particular, we still observe a ***positive impact of being in a small class*** on math score, * but this ***effect is decreasing in the experience of the teacher***. --- # Interacting Regressors: Visually $$ \textrm{score}_i = \color{#d96502}{b_0} + \color{#027D83}{b_1} \textrm{small}_i + \color{#02AB0D}{b_2} \textrm{experience}_i + \color{#d90502}{b_3} \textrm{small}_i * \textrm{experience}_i + e_i$$ <img src="chapter_regext_files/figure-html/graph_base.png" width="90%" style="display: block; margin: auto;" /> --- # Interacting Regressors: Visually $$ \textrm{score}_i = \color{#d96502}{b_0} + \color{#027D83}{b_1} \textrm{small}_i + \color{#02AB0D}{b_2} \textrm{experience}_i + \color{#d90502}{b_3} \textrm{small}_i * \textrm{experience}_i + e_i$$ <img src="chapter_regext_files/figure-html/graph_reg.png" width="90%" style="display: block; margin: auto;" /> --- # Interacting Regressors: Visually $$ \textrm{score}_i = \color{#d96502}{b_0} + \color{#027D83}{b_1} \textrm{small}_i + \color{#02AB0D}{b_2} \textrm{experience}_i + \color{#d90502}{b_3} \textrm{small}_i * \textrm{experience}_i + e_i$$ <img src="chapter_regext_files/figure-html/graph_reg_b0.png" width="90%" style="display: block; margin: auto;" /> --- # Interacting Regressors: Visually $$ \textrm{score}_i = \color{#d96502}{b_0} + \color{#027D83}{b_1} \textrm{small}_i + \color{#02AB0D}{b_2} \textrm{experience}_i + \color{#d90502}{b_3} \textrm{small}_i * \textrm{experience}_i + e_i$$ <img src="chapter_regext_files/figure-html/graph_reg_b0_b1.png" width="90%" style="display: block; margin: auto;" /> --- # Interacting Regressors: Visually $$ \textrm{score}_i = \color{#d96502}{b_0} + \color{#027D83}{b_1} \textrm{small}_i + \color{#02AB0D}{b_2} \textrm{experience}_i + \color{#d90502}{b_3} \textrm{small}_i * \textrm{experience}_i + e_i$$ <img src="chapter_regext_files/figure-html/graph_reg_b0_b1_b2.png" width="90%" style="display: block; margin: auto;" /> --- # Interacting Regressors: Visually $$ \textrm{score}_i = \color{#d96502}{b_0} + \color{#027D83}{b_1} \textrm{small}_i + \color{#02AB0D}{b_2} \textrm{experience}_i + \color{#d90502}{b_3} \textrm{small}_i * \textrm{experience}_i + e_i$$ <img src="chapter_regext_files/figure-html/graph_reg_b0_b1_b2_b3.png" width="90%" style="display: block; margin: auto;" /> --- class:inverse # Task 2: Wages, education and gender in 1985
−
+
10
:
00
1. Load the data `CPS1985` from the `AER` package. 1. Look at the `help` to get the definition of each variable: `?CPS1985` 1. We don't know if people are working part-time or full-time, does it matter here? 1. Create the`log_wage` variable equal to the log of `wage`. 1. Regress `log_wage` on `gender` and `education`, and save it as `reg1`. Interpret each coefficient. 1. Regress the `log_wage` on `gender`, `education` and their interaction `gender*education`, save it as `reg2`. Interpret each coefficient. Does the gender wage gap decrease with education? 1. Create a plot showing this interaction. (*Hint:* use the `color = gender` argument in `aes` and `geom_smooth(method = "lm", se = F)` to obtain a regression line per gender.) --- layout: false class: title-slide-section-red, middle # Standardized Regression --- layout: true <div class="my-footer"><img src="../img/logo/ScPo-shield.png" style="height: 60px;"/></div> --- # Standardized Regression Let's define what *standardizing* a variable means. > ***Standardizing*** a variable `\(z\)` means to *demean* the variable and to divide the demeaned value by its own standard deviation: $$ z_i^{stand} = \frac{z_i - \bar z}{\sigma(z)}$$ where `\(\bar z\)` is the mean of `\(z\)` and `\(\sigma(z)\)` is the standard deviation of `\(z\)`, i.e. `\(\sigma(z) = \sqrt{\textrm{Var}(z)}\)`. -- `\(z^{stand}\)` now has mean 0 and standard deviation 1, i.e. `\(\overline{z^{stand}} = 0\)` and `\(\sigma(z^{stand}) = 1\)` -- Intuitively, standardizing ***puts variables on the same scale*** so we can compare them. In our class size and student performance example, it will help to interpret: * The **magnitude** of the effects, * The **relative importance of each variable**. --- # Standardized Regression: Graphically .pull-left[ <img src="chapter_regext_files/figure-html/graph_before.png" width="3200" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="chapter_regext_files/figure-html/graph_after.png" width="3200" style="display: block; margin: auto;" /> ] --- # Standardized Regression: Interpretation If the ***dependent*** variable `\(y\)` is standardized, i.e. the model is `\(\color{#d90502}{y^{stand}} = b_0 + \sum_{k=1}^Kb_kx_k +e\)`: -- * By definition, `\(b_k\)` measures the predicted change in ** `\(y^{stand}\)` ** associated with a one unit increase in `\(x_k\)`. * If `\(y^{stand}\)` increases by one, it means that `\(y\)` increases by one standard deviation. So `\(b_k\)` measures the change in `\(y\)` **as a share of `\(y\)`'s standard deviation**. -- If the ***regressor*** `\(x_k\)` is standardized, i.e. the model is `\(y = b_0 + \sum_{k=1}^Kb_k\color{#d90502}{x_k^{stand}} +e\)`: -- * By definition, `\(b_k\)` measures the predicted change in `\(y\)` associated with a one unit increase in ** `\(x_k^{stand}\)` **. * If `\(x_k^{stand}\)` increases by one unit, it means that `\(x_k\)` increases by one standard deviation. So `\(b_k\)` measures the predicted change in `\(y\)` **associated with an increase in `\(x_k\)` by one standard deviation**. --- class:inverse # Task 3: Standardized regression
−
+
07
:
00
Let's go back our [grades](https://www.dropbox.com/s/wwp2cs9f0dubmhr/grade5.dta?dl=1) dataset. Remember that to load the data you need to use the `read_dta()` function from the `haven` package. These are the estimates we got from regressing average math test scores on the full set of regressors. ``` ## (Intercept) classize disadvantaged school_enrollment ## 78.560725298 0.003320773 -0.389333008 0.000758258 ## female religious ## 0.923710499 2.876146701 ``` 1. Create a new variable `avgmath_stand` equal to the standardized math score. You can use the `scale()` function (combined with `mutate`) or do it by hand with base `R`. 1. Run the full regression using the standardized math test score as the dependent variable. Interpret the coefficients and their magnitude. 1. Create the standardized variables for each *continuous* regressor as `<regressor>_stand`. * Would it make sense to standardize the `religious` variable? 1. Regress `avgmath_stand` on the full set of standardized regressors and `religious`. Discuss the relative influence of the regressors. --- # Teaser for the Next 3 Lectures * You may have noticed that since the beginning we always work with **samples** drawn from the overall population. -- * Each time, imagine we could draw another sample from population: * Would we obtain the same results? * In other words, how confident can we be that our estimates (sign, magnitude) are not just driven by randomness? -- * We will answer those kind of questions: * We'll present the notion of **sampling**, and * Understand what **statistical inference** is and how to do it. --- # On the way to causality โ How to manage data? Read it, tidy it, visualise it! โ **How to summarise relationships between variables?** Simple and multiple linear regression, non-linear regressions, interactions... โ What is causality? โ What if we don't observe an entire population? โ Are our findings just due to randomness? โ How to find exogeneity in practice? --- class: title-slide-final, middle background-image: url(../img/logo/ScPo-econ.png) background-size: 250px background-position: 9% 19% # SEE YOU NEXT WEEK! | | | | :--------------------------------------------------------------------------------------------------------- | :-------------------------------- | | <a href="mailto:florian.oswald@sciencespo.fr">.ScPored[<i class="fa fa-paper-plane fa-fw"></i>] | florian.oswald@sciencespo.fr | | <a href="https://github.com/ScPoEcon/ScPoEconometrics-Slides">.ScPored[<i class="fa fa-link fa-fw"></i>] | Slides | | <a href="https://scpoecon.github.io/ScPoEconometrics">.ScPored[<i class="fa fa-link fa-fw"></i>] | Book | | <a href="http://twitter.com/ScPoEcon">.ScPored[<i class="fa fa-twitter fa-fw"></i>] | @ScPoEcon | | <a href="http://github.com/ScPoEcon">.ScPored[<i class="fa fa-github fa-fw"></i>] | @ScPoEcon | --- name: log_approx # Log Models: Approximations Why are the approximations shown previously true? -- ***Log-Level*** *General interpretation:* A **one unit** increase in `\(x\)` is associated, on average, with a `\((e^{b_1} - 1) \times 100\)` **percent change** in y. *Simplified interpretation:* A **one unit** increase in `\(x\)` is associated, on average, with a `\(b_1 \times 100\)` **percent change** in y. -- This is because, for small `\(b_1\)`, `\(e^{b_1} \approx 1+ b_1 \iff b_1 \approx e^{b_1} - 1\)` -- `\(\rightarrow\)` for `\(b_1 = \color{#d90502}{0.04}\)`, `\(e^{b_1} - 1 = e^{0.04} - 1 = 0.0408\)` -- `\(\rightarrow\)` for `\(b_1 = \color{#d90502}{0.5}\)`, `\(e^{b_1} - 1 = e^{0.5} - 1 = 0.6487\)` --- # Log Models: Approximations Why are the approximations shown previously true? ***Level-Log*** *General interpretation:* A ** `\(\Delta x\)`** **percent** increase in `\(x\)` is associated, on average, with a `\(b_1 \times log(1 + \Delta x)\)` **unit change** in y. *Simplified interpretation:* A **one percent** increase in `\(x\)` is associated, on average, with a `\(b_1 / 100\)` **unit change** in y. -- This is because for small `\(\Delta x\)`, `\(log(1 + \Delta x) \approx \Delta x\)` -- `\(\rightarrow\)` for `\(\Delta x = \color{#d90502}{1\%}=0.01\)`, `\(log(1+\Delta x) = log(1.01) = 0.01\)` (hence the `\(/100\)` in the simplified interpretation) -- `\(\rightarrow\)` for `\(\Delta x = \color{#d90502}{20\%}=0.20\)`, `\(log(1+\Delta x) = log(1.20) = 0.18\)` --- # Log Models: Approximations Why are the approximations shown previously true? ***Log-Log*** *General interpretation:* A ** `\(\Delta x\)`** **percent** increase in `\(x\)` is associated, on average, with a `\(((1 + \Delta x)^{b_1} - 1) \times 100\)` **percent change** in y. *Simplified interpretation:* A **one percent** increase in `\(x\)` is associated, on average, with a `\(b_1\)` **percent change** in y. -- This is because for small `\(|b_1|\times \Delta x\)`, `\((1 + \Delta x)^{b_1} \approx 1 + b_1 \times \Delta x \iff b_1 \times \Delta x \times 100 \approx ((1 + \Delta x)^{b_1} - 1) \times 100\)` -- `\(\rightarrow\)` for `\(\Delta x = \color{#d90502}{1\%}=0.01\)` and `\(b_1 = \color{#d90502}{0.5}\)`, `\(((1+\Delta x)^{b_1} - 1) \times 100 = (1.01^{0.5} - 1) \times 100 = 0.5\)` -- `\(\rightarrow\)` for `\(\Delta x = \color{#d90502}{10\%}=0.10\)` and `\(b_1 = \color{#d90502}{10}\)`, `\(((1+\Delta x)^{b_1} - 1) \times 100= (1.1^{10} - 1) \times 100 = 159.37\)` [back](#gen_log)