class: center, middle, inverse, title-slide .title[ # Topic 8: Exploratory Analysis ] .subtitle[ ## Part 2: See how variables relate to each other ] .author[ ### Nick Hagerty
ECNS 460/560
Montana State University ] --- name: toc <style type="text/css"> # CSS for including pauses in printed PDF output (see bottom of lecture) @media print { .has-continuation { display: block !important; } } .remark-code-line { font-size: 95%; } .small { font-size: 75%; } .medsmall { font-size: 90%; } .scroll-output-full { height: 90%; overflow-y: scroll; } .scroll-output-75 { height: 75%; overflow-y: scroll; } </style> # Table of contents 1. [Describing relationships](#xy) 1. [Conditional expectations](#cef) 1. [Adjusting for other variables](#adjusting) 1. [Smoothing](#smooth) --- class: inverse, middle name: challenge # A brief challenge --- # Challenge Load this data: ```r puzzle = read_csv("https://bit.ly/3B4BraF") ``` </br> **Using your exploratory analysis skills, find the most important thing about this dataset.** --- # Challenge -- --- class: inverse, middle name: xy # Describing relationships --- # Anscombe's Quartet All 4 of these datasets have the same means, standard deviations, correlation coefficient, linear regression line, and regression R-squared. <img src="08-Exploratory-part2_files/figure-html/unnamed-chunk-3-1.png" width="70%" style="display: block; margin: auto;" /> **Always plot your data!** Don't rely on summary statistics alone. --- # Scatterplots Long before you run a regression, you should be looking at the scatterplots. ```r vienna = read_csv("https://osf.io/y6jvb/download") ggplot(vienna, aes(x = rating_count, y = price)) + geom_point() ``` <img src="08-Exploratory-part2_files/figure-html/unnamed-chunk-4-1.png" width="80%" style="display: block; margin: auto;" /> --- # Scatterplots Transformations are often extremely important to learning what's really going on. ```r vienna = vienna |> mutate(ln_price = log(price), ln_rating_count = log(rating_count)) ggplot(vienna, aes(x = ln_rating_count, y = ln_price)) + geom_point() ``` <img src="08-Exploratory-part2_files/figure-html/unnamed-chunk-5-1.png" width="80%" style="display: block; margin: auto;" /> --- # Scatterplots Another example of how helpful transformations can be. ```r library(gapminder) gap07 = gapminder |> filter(year == 2007) |> mutate(gdp = pop * gdpPercap) ggplot(gap07, aes(x = pop, y = gdp)) + geom_point() ``` <img src="08-Exploratory-part2_files/figure-html/unnamed-chunk-6-1.png" width="80%" style="display: block; margin: auto;" /> --- # Scatterplots Another example of how helpful transformations can be. ```r gap07 = gap07 |> mutate(ln_gdp = log(gdp), ln_pop = log(pop)) ggplot(gap07, aes(x = pop, y = ln_gdp)) + geom_point() ``` <img src="08-Exploratory-part2_files/figure-html/unnamed-chunk-7-1.png" width="80%" style="display: block; margin: auto;" /> --- # Scatterplots Another example of how helpful transformations can be. ```r gap07 = gap07 |> mutate(ln_gdp = log(gdp), ln_pop = log(pop)) ggplot(gap07, aes(x = ln_pop, y = ln_gdp)) + geom_point() ``` <img src="08-Exploratory-part2_files/figure-html/unnamed-chunk-8-1.png" width="80%" style="display: block; margin: auto;" /> --- # Binscatter When you have a lot of data, scatterplots get too crowded to be useful! Download this data scraped from Airbnb for listings in London on the night of 4 March 2017. ```r london = read_csv("https://osf.io/ey5p7/download") london2 = london |> mutate(ln_price = log(price)) ggplot(london2, aes(x = longitude, y = ln_price)) + geom_point() ``` <img src="08-Exploratory-part2_files/figure-html/unnamed-chunk-9-1.png" width="80%" style="display: block; margin: auto;" /> --- # Binscatter An extremely useful alternative is the **binscatter** plot. Binscatter divides your X variable into "bins" and plots the mean of Y within each bin. --- # Binscatter First, create the bins. They can be equally spaced or equally sized (quantiles). - Quantiles are preferred, so each bin represents the same amount of underlying data. ```r london2 = london2 |> mutate(longitude_bins = factor(ntile(longitude, 40)), bin_color = factor(ntile(longitude, 40) %% 8)) ggplot(london2, aes(x = longitude, y = ln_price, color = bin_color)) + geom_point() + scale_color_brewer(palette = "Set2") ``` <img src="08-Exploratory-part2_files/figure-html/unnamed-chunk-10-1.png" width="80%" style="display: block; margin: auto;" /> --- # Binscatter Then, calculate the means of both X and Y within each bin. ```r london3 = london2 |> group_by(longitude_bins) |> mutate(ln_price_binned = mean(ln_price, na.rm = TRUE), longitude_binned = mean(longitude, na.rm = TRUE)) ggplot(london3) + geom_point(aes(x = longitude, y = ln_price, color = bin_color)) + geom_point(aes(x = longitude_binned, y = ln_price_binned), size = 2) + scale_color_brewer(palette = "Set2") ``` <img src="08-Exploratory-part2_files/figure-html/unnamed-chunk-11-1.png" width="80%" style="display: block; margin: auto;" /> --- # Binscatter Let's show just the binned points: ```r london_binned = london2 |> group_by(longitude_bins) |> summarize(across(c("ln_price", "longitude"), mean, na.rm = TRUE)) ggplot(london_binned, aes(x = longitude, y = ln_price)) + geom_point(size = 3) ``` <img src="08-Exploratory-part2_files/figure-html/unnamed-chunk-12-1.png" width="80%" style="display: block; margin: auto;" /> --- # Binscatter Easier: Use `binsreg` ([Cattaneo, Crump, Farrell, and Feng 2021](https://arxiv.org/abs/1902.09608)) - Correctly handles many of the finer statistical details. ```r install.packages("binsreg") ``` ```r library(binsreg) binsreg(london2$ln_price, london2$longitude) ``` <img src="08-Exploratory-part2_files/figure-html/unnamed-chunk-14-1.png" width="80%" style="display: block; margin: auto;" /> ``` ## Call: binsreg ## ## Binscatter Plot ## Bin/Degree selection method (binsmethod) = IMSE direct plug-in (select # of bins) ## Placement (binspos) = Quantile-spaced ## Derivative (deriv) = 0 ## ## Group (by) = Full Sample ## Sample size (n) = 53817 ## # of distinct values (Ndist) = 53817 ## # of clusters (Nclust) = NA ## dots, degree (p) = 0 ## dots, smoothness (s) = 0 ## # of bins (nbins) = 69 ``` --- # Local regression A useful alternative to binscatter is **local regression.** Local regression gives a smooth line that closely resembles the binscatter plot. What does it do exactly? We'll come back to this. ```r ggplot(london3, aes(x = longitude, y = ln_price)) + geom_smooth() ``` <img src="08-Exploratory-part2_files/figure-html/unnamed-chunk-15-1.png" width="80%" style="display: block; margin: auto;" /> --- # Local regression A useful alternative to binscatter is **local regression.** Local regression gives a smooth line that closely resembles the binscatter plot. What does it do exactly? We'll come back to this. ```r ggplot(london3, aes(x = longitude, y = ln_price)) + geom_point(aes(x = longitude_binned, y = ln_price_binned), size = 2) + geom_smooth() ``` <img src="08-Exploratory-part2_files/figure-html/unnamed-chunk-16-1.png" width="80%" style="display: block; margin: auto;" /> --- # Linear regression (OLS) You can also add the **line of best fit**, aka the **least-squares regression line.** Clearly this is more useful in some situations than others. ```r ggplot(london3, aes(x = longitude, y = ln_price)) + geom_point(aes(x = longitude_binned, y = ln_price_binned), size = 2) + geom_smooth(method = "lm", formula = y ~ x) ``` <img src="08-Exploratory-part2_files/figure-html/unnamed-chunk-17-1.png" width="80%" style="display: block; margin: auto;" /> --- # Linear regression (OLS) You can also add the **line of best fit**, aka the **least-squares regression line.** Clearly this is more useful in some situations than others. ```r ggplot(london3, aes(x = longitude, y = ln_price)) + geom_smooth(data = filter(london3, longitude < -0.17), method = 'lm', formula = y~x) + geom_smooth(data = filter(london3, longitude > -0.17), method = 'lm', formula = y~x) + geom_point(aes(x = longitude_binned, y = ln_price_binned), size = 2) ``` <img src="08-Exploratory-part2_files/figure-html/unnamed-chunk-18-1.png" width="80%" style="display: block; margin: auto;" /> --- # Linear regression (OLS) You can also add the **line of best fit**, aka the **least-squares regression line.** Clearly this is more useful in some situations than others. **So if you just ran a regression without plotting your data, you would get a woefully misleading answer!** --- class: inverse, middle name: cef # Conditional Expectations Images and code in this section are from ["Why Regression?"](https://github.com/edrubin/EC607S21/tree/master/notes-lecture/03-why-regression) by Ed Rubin, used with permission, and are not included under this resource's overall CC license. --- # The Conditional Expectation Function (CEF) The CEF tells us the **expected value** (population mean) of an outcome variable `\(Y_i\)` at each value of `\(X_i\)`. $$ \mathbb{E}[Y_i|X_i] $$ Examples: - `\(\mathbb{E}[\text{Price}_i|\text{Longitude}_i]\)` - `\(\mathbb{E}[\text{Income}_i|\text{Education}_i]\)` - `\(\mathbb{E}[\text{Birth weight}_i|\text{Air pollution}_i]\)` </br> Binscatter and local regression give **nonparametric estimates** of the CEF. - Nonparametric: functional form is not pre-specified but rather driven by the data. --- class: clear, middle, center The conditional distributions of `\(\text{Y}_{i}\)` for `\(\text{X}_{i} = x\)` in 8, ..., 22. <img src="08-Exploratory-part2_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- class: clear, middle, center The CEF, `\(\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} \right]\)`, connects these conditional distributions' means. <img src="08-Exploratory-part2_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- class: clear, middle, center Focusing in on the CEF, `\(\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} \right]\)`... <img src="08-Exploratory-part2_files/figure-html/fig_cef_only-1.png" style="display: block; margin: auto;" /> --- # The Conditional Expectation Function A well-chosen CEF is (almost) everything you need for empirical work. - For **prediction:** the CEF is the best predictor of `\(Y_i\)` given `\(X_i\)`. - In the sense of minimizing mean squared error (bias squared plus variance). - For **causal inference:** when the variation in `\(X\)` is as-good-as-random, then the CEF shows how `\(X\)` causally affects `\(Y\)`. - Specifically, the slope of the CEF is the **average causal response** of `\(Y\)` to a 1-unit change in `\(X\)`, for each value of `\(X\)`. -- </br> So... **why are we always running regressions??** --- # Why Use Regression? **Regression gives the best linear approximation of the CEF.** - Again, "best" in the sense of lowest possible mean squared error. **Why?** The intuition: - Regression minimizes the *vertical* sum of squared errors. - The mean *also* minimizes the sum of squared errors for a given value of `\(X_i\)`. - So the regression line approximates the mean over all values of `\(X_i\)`, or the CEF. - Try out [this game](https://old.mathematik.tu-clausthal.de/en/mathematics-interactive/statistics/guessing-the-regression-line/). --- # Why Use Regression? **Regression provides the best linear approximation of the CEF.** - Again, "best" in the sense of lowest possible mean squared error. Regression is a **good way to summarize the CEF,** even when the CEF is not linear. - Your regression can be useful even if it is not a literally correct model of the true data generating process. - But, different regressions may be more or less useful (better or worse approximations of the CEF). How useful is *your* regression in *your* situation? - Look at a nonparametric estimate of your CEF to see if your regression is a good approximation. - That is: binscatter or local regression. --- # Why Use Regression? My view: **By the time you actually run a regression, the results should never be a surprise.** - Because you have already learned what the underlying CEF looks like. -- So what's the point of running regressions? 1. **Quantification:** To measure the average slope of the CEF. 2. **Statistical inference:** To compute standard errors. </br> -- What if linear regression is not a good approximation to the CEF? - You can always add nonlinear *terms* to a regression to improve the approximation. - Doesn't change the fact that regression provides a linear *approximation*, in the sense that all the terms enter the model in a linear combination. --- class: inverse, middle name: adjusting # Adjusting for other variables --- # Simpson's Paradox A trend in a full dataset can disappear or even reverse when looking at the constituent groups individually. <img src="img/Simpsons_paradox_-_animation.gif" width="67%" style="display: block; margin: auto;" /> Which trend is the "right" one? .small[Image from [Wikipedia](https://commons.wikimedia.org/wiki/File:Simpsons_paradox_-_animation.gif) and used under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).] --- # The Simpsons Paradox A trend in a full dataset can disappear or even reverse when looking at the constituent groups individually. <img src="img/the-simpsons-paradox.jfif" width="67%" style="display: block; margin: auto;" /> .small[Image from [RJ Andrews @infowetrust](https://twitter.com/infowetrust/status/984536880199876608) and not included in the CC license.] --- # The Simpsons Pair O' Docs A trend in a full dataset can disappear or even reverse when looking at the constituent groups individually. <img src="img/the-simpsons-pair-o-docs.jfif" width="73%" style="display: block; margin: auto;" /> .small[Image from [RJ Andrews @infowetrust](https://twitter.com/infowetrust/status/984536880199876608) and not included in the CC license.] --- # Adjusting for a binary variable Let's go back to the `gapminder` data and plot life expectancy by GDP: ```r library(gapminder) gap07 = gapminder |> filter(year == 2007) |> mutate(gdp = pop * gdpPercap, ln_gdp = log(gdp), ln_pop = log(pop)) ggplot(gap07, aes(y = lifeExp, x = ln_gdp)) + geom_point() ``` <img src="08-Exploratory-part2_files/figure-html/unnamed-chunk-24-1.png" width="80%" style="display: block; margin: auto;" /> --- # Adjusting for a binary variable Suppose you think there might be something different about Africa. Create a binary variable and color the dots by it: ```r gap07 = gap07 |> mutate(africa = if_else(continent == "Africa", 1, 0)) ggplot(gap07, aes(y = lifeExp, x = ln_gdp, color = africa)) + geom_point() ``` <img src="08-Exploratory-part2_files/figure-html/unnamed-chunk-25-1.png" width="80%" style="display: block; margin: auto;" /> --- # Adjusting for a binary variable Suppose you think there might be something different about Africa. Create a binary variable and color the dots by it: ```r gap07 = gap07 |> mutate(africa = factor(if_else(continent == "Africa", 1, 0))) ggplot(gap07, aes(y = lifeExp, x = ln_gdp, color = africa)) + geom_point() ``` <img src="08-Exploratory-part2_files/figure-html/unnamed-chunk-26-1.png" width="80%" style="display: block; margin: auto;" /> --- # Adjusting for a binary variable We can **adjust** for the Africa indicator, in order to remove its influence on the relationship we're seeing between GDP and life expectancy. --- # Adjusting for a binary variable First, find the mean differences in the `\(X\)` variable that are explained by `Africa`. ```r gap07 = gap07 |> group_by(africa) |> mutate(ln_gdp_by_africa = mean(ln_gdp)) ggplot(gap07, aes(y = lifeExp, x = ln_gdp, color = africa)) + geom_point() + geom_vline(aes(xintercept = ln_gdp_by_africa, color = africa)) ``` <img src="08-Exploratory-part2_files/figure-html/unnamed-chunk-27-1.png" width="80%" style="display: block; margin: auto;" /> --- # Adjusting for a binary variable Remove the differences explained by `Africa`, by subtracting these group means from `\(X\)`. ```r gap07 = gap07 |> group_by(africa) |> mutate(ln_gdp_adj_africa = ln_gdp - mean(ln_gdp)) ggplot(gap07, aes(y = lifeExp, x = ln_gdp_adj_africa, color = africa)) + geom_point() + geom_vline(aes(xintercept = 0, color = africa)) ``` <img src="08-Exploratory-part2_files/figure-html/unnamed-chunk-28-1.png" width="80%" style="display: block; margin: auto;" /> --- # Adjusting for a binary variable Second, find the mean differences in the `\(Y\)` variable that are explained by `Africa`. ```r gap07 = gap07 |> group_by(africa) |> mutate(lifeExp_by_africa = mean(lifeExp)) ggplot(gap07, aes(y = lifeExp, x = ln_gdp_adj_africa, color = africa)) + geom_point() + geom_hline(aes(yintercept = lifeExp_by_africa, color = africa)) ``` <img src="08-Exploratory-part2_files/figure-html/unnamed-chunk-29-1.png" width="80%" style="display: block; margin: auto;" /> --- # Adjusting for a binary variable Remove the differences explained by `Africa`, by subtracting these group means from `\(Y\)`. ```r gap07 = gap07 |> group_by(africa) |> mutate(lifeExp_adj_africa = lifeExp - mean(lifeExp)) ggplot(gap07, aes(y = lifeExp_adj_africa, x = ln_gdp_adj_africa, color = africa)) + geom_point() + geom_hline(aes(yintercept = 0, color = africa)) ``` <img src="08-Exploratory-part2_files/figure-html/unnamed-chunk-30-1.png" width="80%" style="display: block; margin: auto;" /> --- # Adjusting for a binary variable For a more easily interpretable graph, add the "grand" (overall sample) means back in: ```r gap07 = gap07 |> ungroup() |> mutate(ln_gdp_adj_africa = ln_gdp_adj_africa + mean(ln_gdp), lifeExp_adj_africa = lifeExp_adj_africa + mean(lifeExp)) ggplot(gap07, aes(y = lifeExp_adj_africa, x = ln_gdp_adj_africa, color = africa)) + geom_point() ``` <img src="08-Exploratory-part2_files/figure-html/unnamed-chunk-31-1.png" width="80%" style="display: block; margin: auto;" /> --- # Adjusting for a binary variable **This is exactly what happens when you "control for" a variable in a regression.** Note: You do have to adjust both your X and Y variables. Adjusting only Y is not enough -- it'll give the wrong answer. (Look up the Frisch-Waugh Theorem.) - One common place this trips people up is in seasonal adjustment. -- Did we gain anything useful by adjusting for whether a country is in Africa? <img src="img/you-were-so-preoccupied-with-whether-or-not-you-could-you-didnt-stop-to-think-if-you-should.jpg" width="70%" style="display: block; margin: auto;" /> --- # Within-unit variation What if, instead of a binary indicator variable, we want to adjust for many different categories? - For example, we might observe multiple observations over time for the same person, household, firm, or county. We can follow the same process of subtracting out group means as we did for the binary variable. The only difference is that we might have thousands of different groups, instead of just two. Now the variation remaining is all **within-unit.** -- This is exactly what **fixed effects** regression does. --- # Within-unit variation <img src="img/Animation of Fixed Effects.gif" style="display: block; margin: auto;" /> .small[Image by [Nick Huntington-Klein](https://nickchk.com/causalgraphs.html) and not included in this resource's overall CC license.] --- class: inverse, middle name: smooth # Smoothing Parts of this section are adapted from [“Introduction to Data Science”](http://rafalab.dfci.harvard.edu/dsbook/smoothing.html) by Rafael A. Irizarry, used under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0). --- # Smoothing **Basic idea:** Separate true signal (a predictable data generating process) from the noise. - Problem: Hard to tell the difference. - This is a general problem in prediction. <img src="img/signal-plus-noise-example-1.png" width="50%" style="display: block; margin: auto;" /> --- # Bin smoothing / moving averages How to best estimate a CEF? We can think of this as a problem of predicting `$$Y_i = f(x_i) + \epsilon_i$$` where `$$f(x) = \mathbb{E}[Y|X = x_i].$$` One option that holds intuitive appeal is **bin smoothing.** This is like binscatter, but with overlapping bins. Take moving averages of `\(Y\)` across your observations `\(i\)` at each point `\(x_0\)` in the support of `\(x\)`: $$ \hat{f}(x_0) = \frac{ \sum_i { Y_i \space 1 \\big\\{ |x_i - x_0| \leq h \\big\\} } } { \sum_i {1 \\big\\{ |x_i - x_0| \leq h \\big\\}} } $$ where `\(h\)` is the bandwidth. E.g., for a weekly moving average, `\(h = 3.5\)`. --- # Bin smoothing / moving averages Do this at enough values of `\(x_0\)` that it appears continuous: <img src="img/binsmoother-animation.gif" style="display: block; margin: auto;" /> --- # Kernels We can reduce the variance of this estimate by choosing different weights `\(w_0\)` in the moving average: $$ \hat{f}(x_0) = \sum_i { Y_i \space w_0(x_i) } $$ In the moving average example, the weights were either `\(0\)` or `\(1/N_0\)`, the number of observations that fall within the window. - Turn on and off discretely as data points enter and exit the window. </br> Instead, we can choose weights that vary more smoothly. These weights are described with a **kernel** function. - Nonnegative, symmetric, integrates to 1. --- # Some kernels Uniform kernel: $$ w_0(x_i) = W(\frac{x_i - x_0}{h}), \space W(u) = \frac{1}{2} \space 1\\{|u| \leq 1 \\}. $$ <img src="img/Kernel_uniform.svg" width="60%" style="display: block; margin: auto;" /> .small[Image from <a href = "https://en.wikipedia.org/wiki/Kernel_(statistics)">Wikipedia</a> and used under [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/deed.en).] --- # Some kernels Triangular kernel: $$ w_0(x_i) = W(\frac{x_i - x_0}{h}), \space W(u) = (1-|u|) \space 1\\{|u| \leq 1 \\}. $$ <img src="img/Kernel_triangle.svg" width="60%" style="display: block; margin: auto;" /> .small[Image from <a href = "https://en.wikipedia.org/wiki/Kernel_(statistics)">Wikipedia</a> and used under [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/deed.en).] --- # Some kernels Epanechnikov kernel: $$ w_0(x_i) = W(\frac{x_i - x_0}{h}), \space W(u) = \frac{3}{4} (1-u^2) \space 1\\{|u| \leq 1 \\}. $$ <img src="img/Kernel_epanechnikov.svg" width="60%" style="display: block; margin: auto;" /> .small[Image from <a href = "https://en.wikipedia.org/wiki/Kernel_(statistics)">Wikipedia</a> and used under [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/deed.en).] --- # Some kernels Gaussian (normal) kernel: $$ w_0(x_i) = W(\frac{x_i - x_0}{h}), \space W(u) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}u^2}. $$ <img src="img/Kernel_exponential.svg" width="60%" style="display: block; margin: auto;" /> .small[Image from <a href = "https://en.wikipedia.org/wiki/Kernel_(statistics)">Wikipedia</a> and used under [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/deed.en).] --- # Local regression Instead of treating the mean as constant within each window... - **Fit a regression line** within each window. Regression is estimated via weighted least squares, using the kernel weights. <img src="img/loess-animation.gif" width="50%" style="display: block; margin: auto;" /> --- # Local regression Instead of treating the mean as constant within each window... - **Fit a regression line** within each window. As usual, larger bandwidths lead to more smoothing. <img src="img/loess-multi-span-animation.gif" width="50%" style="display: block; margin: auto;" /> --- # Local regression >> bin smoothing Local regression is preferred over bin smoothing / moving averages. Moving averages can be badly biased at the edges of the data. --- # Beware of default arguments Be careful with `geom_smooth`. The default options may not be what you intend. - With `\(<1000\)` observations, uses **local polynomial** smoothing (`loess`) of **degree 2**. - With `\(> = 1000\)` observations, uses **penalized regression splines** in a generalized additive model (`gam`), an entirely different method. To get *local linear* regression in `ggplot`, specify: `geom_smooth(method = "loess", method.args = list(degree = 1))`. --- # Beware of default arguments `loess` does not allow you to choose the bandwidth or kernel. - Instead, you can set the share of *all* data used at each point (the "span"). - Default: `span = 0.75`. - Default kernel: tricube. Alternatives: - Package `locpol` allows you to choose the kernel and bandwidth - Or choose it "optimally" based on cross-validation or a "rule of thumb" formula. - `gam` estimates the smoothing parameter for you, but it's uncommon in economics. Practical advice: - Use `locpol` for important results, but `geom_smooth` is fine for quick-and-dirty answers. - Robustness to method and bandwidth is more important than the particular choices themselves. --- # Summary of Part 2 ### Describing relationships * **ALWAYS.** PLOT. YOUR. DATA. * Try log transformations to make more sense of scatterplots in right-skewed data. * Use **binscatter** or **local regression** to find patterns in large datasets. * Both are nonparametric ways to estimate the **conditional expectation function**. * Regression provides the **best linear approximation** to the CEF. * When using local regression, pay close attention to the choice of **bandwidth.** ### Adjusting for other variables * Patterns within groups can look completely different when the groups are combined. * To use only **within-group variation**, subtract group means from both X and Y variables. --- # Summary of Part 2 ### Describing relationships * ALWAYS. **PLOT.** YOUR. DATA. * Try log transformations to make more sense of scatterplots in right-skewed data. * Use **binscatter** or **local regression** to find patterns in large datasets. * Both are nonparametric ways to estimate the **conditional expectation function**. * Regression provides the **best linear approximation** to the CEF. * When using local regression, pay close attention to the choice of **bandwidth.** ### Adjusting for other variables * Patterns within groups can look completely different when the groups are combined. * To use only **within-group variation**, subtract group means from both X and Y variables. --- # Summary of Part 2 ### Describing relationships * ALWAYS. PLOT. **YOUR.** DATA. * Try log transformations to make more sense of scatterplots in right-skewed data. * Use **binscatter** or **local regression** to find patterns in large datasets. * Both are nonparametric ways to estimate the **conditional expectation function**. * Regression provides the **best linear approximation** to the CEF. * When using local regression, pay close attention to the choice of **bandwidth.** ### Adjusting for other variables * Patterns within groups can look completely different when the groups are combined. * To use only **within-group variation**, subtract group means from both X and Y variables. --- # Summary of Part 2 ### Describing relationships * ALWAYS. PLOT. YOUR. **DATA.** * Try log transformations to make more sense of scatterplots in right-skewed data. * Use **binscatter** or **local regression** to find patterns in large datasets. * Both are nonparametric ways to estimate the **conditional expectation function**. * Regression provides the **best linear approximation** to the CEF. * When using local regression, pay close attention to the choice of **bandwidth.** ### Adjusting for other variables * Patterns within groups can look completely different when the groups are combined. * To use only **within-group variation**, subtract group means from both X and Y variables.