class: center, middle, inverse, title-slide # Regression Logic ## EC 320: Introduction to Econometrics ### Winter 2022 --- class: inverse, middle # Prologue --- # Housekeeping - Lab 3 & Exercise 3 & Extra office hour - Lab 3 @ 4 p.m. - Exercise 3 due today - Extra office hour @ 7 p.m. - Problem Set 1 solution will be available later today. - Problem Sets due dates changed - Extra three days - Due Monday instead of Friday starting Problem Set 2 - Midterm 1 next week (Wednesday) - Midterm review on Monday --- # Last Time 1. Fundamental problem of econometrics 2. Selection bias 3. Randomized control trials --- class: inverse, middle # Regression Logic --- # Regression Economists often rely on (linear) regression for statistical comparisons. - *"Linear"* is more flexible than you think. Regression analysis helps us make *other things equal* comparisons. - We can model the effect of `\(X\)` on `\(Y\)` while .hi[controlling] .pink[for potential confounders]. - Forces us to be explicit about the potential sources of selection bias. - Failure to control for confounding variables leads to .hi[omitted-variable bias], a close cousin of selection bias --- # Returns to Private College **Research Question:** Does going to a private college instead of a public college increase future earnings? - **Outcome variable:** earnings - **Treatment variable:** going to a private college (binary) -- **Q:** How might a private school education increase earnings? -- **Q:** Does a comparison of the average earnings of private college graduates with those of public school graduates .pink[isolate the economic returns to private college education]? Why or why not? --- # Returns to Private College **How might we estimate the causal effect of private college on earnings?** **Approach 1:** Compare average earnings of private college graduates with those of public college graduates. - Prone to selection bias. **Approach 2:** Use a matching estimator that compares the earnings of individuals the same admissions profiles. - Cleaner comparison than a simple difference-in-means. - Somewhat difficult to implement. - Throws away data (inefficient). **Approach 3:** Estimate a regression that compares the earnings of individuals with the same admissions profiles. <!-- --- --> <!-- # Difference-in-Means, Take 2 --> <!-- ## Example: Returns to private college --> <!-- show same data with groupings based on application profile; what are the differences/similarities within/across groups?; calculate within-group diff-in-means; take average of these (unweighted, then weighted); show and discuss causal diagram --> <!-- --- --> <!-- # Difference-in-Means, Regression style --> <!-- ## Example: Returns to private college --> <!-- write pop model, describe coefficients and regression lingo; hand wave about OLS and estimated pop model; run regression of example data --> --- # The Regression Model We can estimate the effect of `\(X\)` on `\(Y\)` by estimating a .hi[regression model]: `$$Y_i = \beta_0 + \beta_1 X_i + u_i$$` - `\(Y_i\)` is the outcome variable. -- - `\(X_i\)` is the treatment variable (continuous). -- - `\(u_i\)` is an error term that includes all other (omitted) factors affecting `\(Y_i\)`. -- - `\(\beta_0\)` is the **intercept** parameter. -- - `\(\beta_1\)` is the **slope** parameter. --- # Running Regressions The intercept and slope are population parameters. Using an estimator with data on `\(X_i\)` and `\(Y_i\)`, we can estimate a .hi[fitted regression line]: `$$\hat{Y_i} = \hat{\beta}_0 + \hat{\beta}_1 X_i$$` - `\(\hat{Y_i}\)` is the **fitted value** of `\(Y_i\)`. - `\(\hat{\beta}_0\)` is the **estimated intercept**. - `\(\hat{\beta}_1\)` is the **estimated slope**. -- The estimation procedure produces misses called .hi[residuals], defined as `\(Y_i - \hat{Y_i}\)`. --- # Running Regressions In practice, we estimate the regression coefficients using an estimator called .hi[Ordinary Least Squares] (OLS). - Picks estimates that make `\(\hat{Y_i}\)` as close as possible to `\(Y_i\)` given the information we have on `\(X\)` and `\(Y\)`. - We will dive into the weeds after the midterm. --- # Running Regressions OLS picks `\(\hat{\beta}_0\)` and `\(\hat{\beta}_1\)` that trace out the line of best fit. Ideally, we wound like to interpret the slope of the line as the causal effect of `\(X\)` on `\(Y\)`. <img src="05-Regression_Logic_files/figure-html/unnamed-chunk-2-1.svg" style="display: block; margin: auto;" /> --- # Confounders However, the data are grouped by a third variable `\(W\)`. How would omitting `\(W\)` from the regression model affect the slope estimator? <img src="05-Regression_Logic_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" /> --- # Confounders The problem with `\(W\)` is that it affects both `\(Y\)` and `\(X\)`. Without adjusting for `\(W\)`, we cannot isolate the causal effect of `\(X\)` on `\(Y\)`. <img src="05-Regression_Logic_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" /> --- # Controlling for Confounders We can control for `\(W\)` by specifying it in the regression model: `$$Y_i = \beta_0 + \beta_1 X_i + \beta_2 W_i + u_i$$` - `\(W_i\)` is a **control variable**. - By including `\(W_i\)` in the regression, we can use OLS can difference out the confounding effect of `\(W\)`. - **Note:** OLS doesn't care whether a right-hand side variable is a treatment or control variable, but we do. --- # Controlling for Confounders .center[![Control](control.gif)] --- # Controlling for Confounders Controlling for `\(W\)` "adjusts" the data by **differencing out** the group-specific means of `\(X\)` and `\(Y\)`. .hi-purple[Slope of the estimated regression line changes!] <img src="05-Regression_Logic_files/figure-html/unnamed-chunk-6-1.svg" style="display: block; margin: auto;" /> --- # Controlling for Confounders Can we interpret the estimated slope parameter as the causal effect of `\(X\)` on `\(Y\)` now that we've adjusted for `\(W\)`? <img src="05-Regression_Logic_files/figure-html/unnamed-chunk-7-1.svg" style="display: block; margin: auto;" /> --- # Controlling for Confounders ## Example: Returns to schooling Last class: > **Q:** Could we simply compare the earnings those with more education to those with less? > <br> **A:** If we want to measure the causal effect, probably not. .hi-green[What omitted variables should we worry about?] --- # Controlling for Confounders ## Example: Returns to schooling Three regressions ***of*** wages ***on*** schooling. <table> <caption>Outcome variable: log(Wage)</caption> <thead> <tr> <th style="text-align:left;"> Explanatory variable </th> <th style="text-align:center;"> 1 </th> <th style="text-align:center;"> 2 </th> <th style="text-align:center;"> 3 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;color: #23373b !important;line-height: 110%;font-style: italic;color: black !important;"> Intercept </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;font-weight: bold;"> 5.571 </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> 5.581 </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> 5.695 </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-style: italic;color: black !important;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-weight: bold;"> (0.039) </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> (0.066) </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> (0.068) </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;line-height: 110%;font-style: italic;color: black !important;"> Education </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;font-weight: bold;"> 0.052 </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> 0.026 </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> 0.027 </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-style: italic;color: black !important;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-weight: bold;"> (0.003) </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> (0.005) </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> (0.005) </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;line-height: 110%;font-style: italic;color: black !important;"> IQ Score </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;font-weight: bold;"> </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> 0.004 </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> 0.003 </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-style: italic;color: black !important;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-weight: bold;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> (0.001) </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> (0.001) </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;line-height: 110%;font-style: italic;color: black !important;"> South </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;font-weight: bold;"> </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> -0.127 </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-style: italic;color: black !important;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-weight: bold;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> (0.019) </td> </tr> </tbody> </table> --- count: false # Controlling for Confounders ## Example: Returns to schooling Three regressions ***of*** wages ***on*** schooling. <table> <caption>Outcome variable: log(Wage)</caption> <thead> <tr> <th style="text-align:left;"> Explanatory variable </th> <th style="text-align:center;"> 1 </th> <th style="text-align:center;"> 2 </th> <th style="text-align:center;"> 3 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;color: #23373b !important;line-height: 110%;font-style: italic;color: black !important;"> Intercept </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> 5.571 </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;font-weight: bold;"> 5.581 </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> 5.695 </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-style: italic;color: black !important;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> (0.039) </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-weight: bold;"> (0.066) </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> (0.068) </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;line-height: 110%;font-style: italic;color: black !important;"> Education </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> 0.052 </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;font-weight: bold;"> 0.026 </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> 0.027 </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-style: italic;color: black !important;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> (0.003) </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-weight: bold;"> (0.005) </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> (0.005) </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;line-height: 110%;font-style: italic;color: black !important;"> IQ Score </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;font-weight: bold;"> 0.004 </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> 0.003 </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-style: italic;color: black !important;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-weight: bold;"> (0.001) </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> (0.001) </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;line-height: 110%;font-style: italic;color: black !important;"> South </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;font-weight: bold;"> </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> -0.127 </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-style: italic;color: black !important;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-weight: bold;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> (0.019) </td> </tr> </tbody> </table> --- count: false # Controlling for Confounders ## Example: Returns to schooling Three regressions ***of*** wages ***on*** schooling. <table> <caption>Outcome variable: log(Wage)</caption> <thead> <tr> <th style="text-align:left;"> Explanatory variable </th> <th style="text-align:center;"> 1 </th> <th style="text-align:center;"> 2 </th> <th style="text-align:center;"> 3 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;color: #23373b !important;line-height: 110%;font-style: italic;color: black !important;"> Intercept </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> 5.571 </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> 5.581 </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;font-weight: bold;"> 5.695 </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-style: italic;color: black !important;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> (0.039) </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> (0.066) </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-weight: bold;"> (0.068) </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;line-height: 110%;font-style: italic;color: black !important;"> Education </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> 0.052 </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> 0.026 </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;font-weight: bold;"> 0.027 </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-style: italic;color: black !important;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> (0.003) </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> (0.005) </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-weight: bold;"> (0.005) </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;line-height: 110%;font-style: italic;color: black !important;"> IQ Score </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> 0.004 </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;font-weight: bold;"> 0.003 </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-style: italic;color: black !important;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> (0.001) </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-weight: bold;"> (0.001) </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;line-height: 110%;font-style: italic;color: black !important;"> South </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;font-weight: bold;"> -0.127 </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-style: italic;color: black !important;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-weight: bold;"> (0.019) </td> </tr> </tbody> </table> --- # Omitted-Variable Bias The presence of omitted-variable bias (OVB) precludes causal interpretation of our slope estimates. We can back out the sign and magnitude of OVB by subtracting the .pink[slope estimate from a ***long*** regression] from the .purple[slope estimate from a ***short*** regression]: `$$\text{OVB} = \color{#9370DB}{\hat{\beta}_1^{\text{Short}}} - \color{#e64173}{\hat{\beta}_1^{\text{Long}}}$$` -- __Dealing with potential sources of OVB is one of the main objectives of econometric analysis!__ <!-- Find example RCT data and run through R example w/ diff-in-means and regression --> <!-- https://www.povertyactionlab.org/evaluation/summer-jobs-reduce-violence-among-disadvantaged-youth-united-states --> --- class: inverse, middle # Data and the .mono[tidyverse] --- # Data ## Experimental data Data generated in controlled, laboratory settings. -- Ideal for __causal identification__, but difficult to obtain in the social sciences. - Intractable logistical problems - Too expensive - Morally repugnant -- Experiments outside the lab: __randomized control trials__ and __A/B testing__. --- # Data ## Observational data Data generated in non-experimental settings. -- - Surveys - Censuses - Administrative records - Environmental data - Financial and sales transactions - Social media -- Mainstay of economic research, but __poses challenges__ to causal identification. --- # Tidy Data .more-left[
] .less-right[ .hi-orange[Rows] represent .hi-orange[observations]. .hi-green[Columns] represent .hi-green[variables]. Each .hi-purple[value] is associated with an .hi-orange[observation] and a .hi-green[variable]. ] --- # Cross Sectional Data .hi-purple[Sample of individuals from a population at a point in time.] Ideally, collected using __random sampling__. - Random sampling .mono[+] sufficient sample size .mono[=] representative sample. - Random sampling simplifies data analysis, but non-random samples are common (and difficult to work with). Used extensively in applied microeconomics.<sup>*</sup> __Main focus of this course.__ .footnote[ <sup>*</sup> Applied microeconomics .mono[=] Labor, health, education, public finance, development, industrial organization, and urban economics. ] --- # Cross Sectional Data
--- # Time Series Data .hi-purple[Observations of variables over time.] - Quarterly US GDP - Annual US infant mortality rates - Daily Amazon stock prices Complication: Observations are not independent draws. - GDP this quarter highly related to GDP last quarter. Used extensively in empirical macroeconomics. Requires more-advanced methods (EC 421 and EC 422). --- # Time Series Data
--- # Pooled Cross Sectional Data .hi-purple[Cross sections from different points in time.] Useful for studying policy changes and relationship that change over time. Requires more-advanced methods (EC 421 and many 400-level applied micro classes). --- # Pooled Cross Sectional Data
--- # Panel or Longitudinal Data .hi-purple[Time series for each cross-sectional unit.] - Example: daily attendance data for a sample of students. Difficult to collect, but useful for causal identification. - Can control for _unobserved_ characteristics. Requires more-advanced methods (EC 421 and many 400-level applied micro classes). --- # Panel or Longitudinal Data
--- # Tidy Data?
--- # Messy Data **Analysis-ready datasets are rare.** Most data are "messy." The focus of this class is data analysis, but .hi[data wrangling] is a non-trivial part of a data scientist/analyst's job. .mono[R] has a suite of packages that facilitate data wrangling. - `readr`, `tidyr`, `dplyr`, `ggplot2` .mono[+] others. - Known collectively as the `tidyverse`. --- # .mono[tidyverse] ## The [`tidyverse`](https://www.tidyverse.org): A package of packages `readr`: Functions to import data. `tidyr`: Functions to reshape messy data. `dplyr`: Functions to work with data. `ggplot2`: Functions to visualize data. --- # Workflow ## Step 1: Load packages with `pacman` ```r library(pacman) p_load(tidyverse) ``` If the `tidyverse` hasn't already been installed, `p_load` will install it. Loading the `tidyverse` automatically loads `readr`, `tidyr`, `dplyr`, `ggplot2`, and a few other packages. --- # Workflow ## Step 2: Import data with `readr` ```r workers <- read_csv("03-example_data.csv") ``` CSV files are a common non-proprietary format for storing tabular data. The `read_csv` function imports CSV (comma-separated values) files. - Converts the CSV file to a [`tibble`](https://tibble.tidyverse.org), the `tidyverse` version of a `data.frame`. --- # Workflow ## Step 3: Reshape data with `tidyr` Variables are stored in rows instead of columns: ``` #> # A tibble: 21,800 × 4 #> worker_id year variable value #> <dbl> <dbl> <chr> <dbl> #> 1 13 1980 educ 14 #> 2 13 1981 educ 14 #> 3 13 1982 educ 14 #> 4 13 1983 educ 14 #> 5 13 1984 educ 14 #> 6 13 1985 educ 14 #> 7 13 1986 educ 14 #> 8 13 1987 educ 14 #> 9 17 1980 educ 13 #> 10 17 1981 educ 13 #> # … with 21,790 more rows ``` --- # Workflow ## Step 3: Reshape data with `tidyr` Make the data tidy by using the `spread` function: ```r workers <- workers %>% spread(key = variable, value = value) ``` Note the use of the .hi[pipe operator]. - .hi[`%>%`] .mono[=] *"and then."* - Chains multiple commands together without having to define intermediate objects. --- # Workflow ## Step 3: Reshape data with `tidyr` The result: ``` #> # A tibble: 4,360 × 7 #> worker_id year black earnings educ exper union #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 13 1980 0 8850. 14 1 0 #> 2 13 1981 0 14800. 14 2 1 #> 3 13 1982 0 11278. 14 3 0 #> 4 13 1983 0 12409. 14 4 0 #> 5 13 1984 0 14734. 14 5 0 #> 6 13 1985 0 15676. 14 6 0 #> 7 13 1986 0 1457. 14 7 0 #> 8 13 1987 0 14013. 14 8 0 #> 9 17 1980 0 13274. 13 4 0 #> 10 17 1981 0 12800. 13 5 0 #> # … with 4,350 more rows ``` --- # Workflow ## Step 4: Manipulate data with `dplyr` Generate new variables with `mutate`: ```r workers <- workers %>% mutate(union = ifelse(union == 1, "Yes", "No")) ``` Before, `union` was a binary variable equal to 1 if the worker is in a union or 0 if otherwise. Now `union` is a character variable. --- # Workflow ## Step 4: Manipulate data with `dplyr` The result: ``` #> # A tibble: 4,360 × 7 #> worker_id year black earnings educ exper union #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> #> 1 13 1980 0 8850. 14 1 No #> 2 13 1981 0 14800. 14 2 Yes #> 3 13 1982 0 11278. 14 3 No #> 4 13 1983 0 12409. 14 4 No #> 5 13 1984 0 14734. 14 5 No #> 6 13 1985 0 15676. 14 6 No #> 7 13 1986 0 1457. 14 7 No #> 8 13 1987 0 14013. 14 8 No #> 9 17 1980 0 13274. 13 4 No #> 10 17 1981 0 12800. 13 5 No #> # … with 4,350 more rows ``` --- # Workflow ## Step 6: Visualize and analyze data with `ggplot2` **How are education and earnings correlated?** ```r workers %>% ggplot(aes(x = educ, y = earnings)) + geom_point() ``` <img src="05-Regression_Logic_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> --- # Workflow ## Step 6: Visualize and analyze data with `ggplot2` **How are education and earnings correlated?** Can also use the `cor` function from `base` .mono[R]: ```r cor(workers$educ, workers$earnings) ``` ``` #> [1] 0.2685563 ``` --- # Workflow ## Step 6: Visualize and analyze data with `ggplot2` **How are education and earnings correlated?** ```r workers %>% ggplot(aes(x = educ, y = earnings, color = union)) + geom_point() ``` <img src="05-Regression_Logic_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> --- # Workflow ## Step 6: Visualize and analyze data with `ggplot2` **How are education and earnings correlated?** ```r workers %>% ggplot(aes(x = educ, y = earnings, color = union)) + geom_point() + facet_grid(~union) ``` <img src="05-Regression_Logic_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> --- # Workflow ## Step 6: Visualize and analyze data with `ggplot2` **How are education and earnings correlated?** Can .hi[subset] the data to get group-specific correlations: ```r workers_union <- workers %>% * filter(union == "Yes") cor(workers_union$educ, workers_union$earnings) ``` ``` #> [1] 0.211482 ``` ```r workers_nounion <- workers %>% * filter(union == "No") cor(workers_nounion$educ, workers_nounion$earnings) ``` ``` #> [1] 0.2809786 ``` --- # Why Bother? **Q:** Why not just use .mono[.hi-green[MS Excel]] for data wrangling? -- **A:** .hi[Reproducibility] - Easier to retrace your steps with .mono[R]. -- **A:** .hi[Portability] - Easy to re-purpose .mono[R] code for new projects. -- **A:** .hi[Scalability] - .mono[Excel] chokes on big datasets. -- **A:** .hi[.mono[R] Saves time] (eventually) - Lower marginal costs in exchange for higher fixed costs. --- # Further Reading 1. [Tidy Data](https://vita.had.co.nz/papers/tidy-data.pdf) by Hadley Wickham (creator of the `tidyverse`) 2. [Cheatsheets](https://rstudio.com/resources/cheatsheets/)