ECON 3818Chapter 26Kyle Butts27 September 20211 / 46

$\require{color} \definecolor{purple}{rgb}{0.337254901960784, 0.00392156862745098, 0.643137254901961} \definecolor{navy}{rgb}{0.0509803921568627, 0.23921568627451, 0.337254901960784} \definecolor{ruby}{rgb}{0.603921568627451, 0.145098039215686, 0.0823529411764706} \definecolor{alice}{rgb}{0.0627450980392157, 0.470588235294118, 0.584313725490196} \definecolor{daisy}{rgb}{0.92156862745098, 0.788235294117647, 0.266666666666667} \definecolor{coral}{rgb}{0.949019607843137, 0.427450980392157, 0.129411764705882} \definecolor{kelly}{rgb}{0.509803921568627, 0.576470588235294, 0.337254901960784} \definecolor{jet}{rgb}{0.0745098039215686, 0.0823529411764706, 0.0862745098039216} \definecolor{asher}{rgb}{0.333333333333333, 0.372549019607843, 0.380392156862745} \definecolor{slate}{rgb}{0.192156862745098, 0.309803921568627, 0.309803921568627} \definecolor{cranberry}{rgb}{0.901960784313726, 0.254901960784314, 0.450980392156863}$

Chapter 26: Regression Inference

2 / 46

Introduction

Chapter 4 and 5 discussed how scatterplots and lines of best fit show us linear relationships, but there are remaining questions

Is there really a linear relationship between x and y, or is the pattern just by chance

Spurious correlations

What is the estimated slope that explains how $y$ responds to $x$ in the population. What is the margin of error for our estimate?

If we use the least-squares line to predict $y$ for a given $x$ , how accurate is that prediction?
In econometrics, you will discuss when you can answer what is the effect on $y$ of changing $x$

3 / 46

Regression Review

We can model the linear relationship between X and Y by thinking of a conditional expectation:

$E(Y|X)= a + bX$

We want estimates for a and b, $\hat{a}$ and $\hat{b}$ , and we find these estimates by minimizing the sum of squared residuals

$\varepsilon_i = Y_i - \coral{\widehat{Y}_i} \equiv Y_i - (\coral{\hat{a} + \hat{b} X_i})$

4 / 46

OLS Estimators

We pick the values of $\hat{a}$ and $\hat{b}$ to minimize the sum of least squares, $\sum_{i=1}^n \varepsilon_i^2 $. This yields the Ordinary Least Squares estimators

$\hat{a}=\bar{Y}-\hat{b}\bar{X}$

$\hat{b}=r_{XY}\frac{s_Y}{s_X}$

5 / 46

Next Steps

This chapter will answer

How can I interpret $\hat{a}$ and $\hat{b}$ ?
What conditions are necessary for those interpretations?
Inference from a Regression

6 / 46

Interpreting $a$ and $b$

$\hat{\coral{\text{Calcification Rate}}} = -12.103 + 0.4615 * \text{Temperature}$

We can now predict how temperature affects the calcification rate. The $R^2$ will tell us how much of the variation in calcification rate is due to temperature, but it will not tell us whether this relationship is statistically significant.

In order for this regression to be meaningful, we must determine whether the results are statistically significant

7 / 46

Estimating the Parameters

When the conditions for the regression are met1

The slope $\hat{b}$ of the least-squares line is an unbiased estimator of the population slope $b$
The intercept $\hat{a}$ of the least-squares line is an unbiased estimator of the population intercept $a$

Now we only need to estimate the remaining parameters, $\sigma$ , the standard deviation of the error term $\varepsilon_i$ .

1 We will discuss the conditions later

8 / 46

Regression Standard Error

Our regression model is:

$y = a + X b + \varepsilon$

$\varepsilon$ is the error term that describes why an individual doesn't fall directly on regression line $a + X b$ .

We denote the variance of $\varepsilon$ as $\sigma^2$ . The standard deviation, $\sigma$ , describes variability of response variable $y$ about the population regression line ($\pm$).

9 / 46

Estimating Std. Dev. of the Error Term

The least-squares line estimates the population regression line

The residuals are the deviations of data points from the least-squares line

$\hat{\varepsilon} \equiv \text{residual} = y - \coral{\hat{y}}$

Therefore we estimate $\sigma$ by the sample standard deviation of the residuals, known as the regression standard error

10 / 46

Regression Standard Error

$s=\sqrt{\frac{1}{n-2} \sum\text{residual}^2} \equiv \sqrt{\frac{1}{n-2} \sum(y - \coral{\hat{y}})^2}$

We use $s$ to estimate the standard deviation, $\sigma$ , of responses about the mean given by the population regression line

We will use this error to determine whether our predictions are statistically significant

11 / 46

Testing the Hypothesis of No Linear Relationship

To answer questions about whether associations between two variables are , we must test a hypothesis about the slope $b$ :

$H_0: \ b = 0$ $H_1: b \neq 0$

If we fail to reject $H_0$ :

Regression line with slope 0 is horizontal -- meaning y does not change at all when x changes
$H_0$ says that there is no linear relationship between X and Y

If we reject $H_0$ , and accept $H_1$ :

There is some linear relationship between X and Y

12 / 46

Null of No Linear Relationship

If we fail to reject $H_0$ :

Regression line with slope 0 is horizontal -- meaning $y$ does not change at all when $x$ changes

13 / 46

Population vs. sample

Question: Why do we care about population vs. sample?

Population

14 / 46

Population vs. sample

Question: Why do we care about population vs. sample?

Population

Population relationship

$y_i = 2.53 + 0.57 x_i + u_i$

$y_i = \beta_0 + \beta_1 x_i + u_i$

15 / 46

Population vs. sample

Question: Why do we care about population vs. sample?

Sample 1: 30 random individuals

Population relationship
$y_i = 2.53 + 0.57 x_i + u_i$

Sample relationship
$\hat{y}_i = 2.36 + 0.61 x_i$

16 / 46

Population vs. sample

Question: Why do we care about population vs. sample?

Sample 2: 30 random individuals

Population relationship
$y_i = 2.53 + 0.57 x_i + u_i$

Sample relationship
$\hat{y}_i = 2.79 + 0.56 x_i$

17 / 46

Population vs. sample

Question: Why do we care about population vs. sample?

Sample 3: 30 random individuals

Population relationship
$y_i = 2.53 + 0.57 x_i + u_i$

Sample relationship
$\hat{y}_i = 3.21 + 0.45 x_i$

18 / 46

$1,000$ Samples of size (30)

19 / 46

Population vs. sample

On average, our regression lines match the population line very nicely.
However, individual lines (samples) can really miss the mark.
Differences between individual samples and the population lead to uncertainty for the econometrician.

20 / 46

Sampling Distribution of $\hat{b}$

Since $\hat{b}$ is a function of our data, it has a sampling distribution.

The sampling distribution of $\hat{b}$ is: $\hat{b} \sim N\left(b, \ \frac{\sigma^2}{\sigma_X^2}\right)$ Another instance of the sampling distribution being normally distributed!

$\sigma^2$ is the variance of $\varepsilon$ and $\sigma_X^2$ is the variance of $X$ .

21 / 46

Significance Test for Regression Slope

To test the hypothesis, $H_0: b=0$ , compute the t-statistic:

$t_{n-2} = \frac{\hat{b} - 0}{SE_b}$

Important to note that the degrees of freedom for the t-statistic for testing a regression slope is $n-2$ (we estimate $a$ and $s$ )

In this formula, the standard error of the least-squares slope is our estimate at the sampling distribution's standard deviation:

$SE_{\hat{b}}=\frac{s}{\sqrt{\sum (x-\bar{x}^2)}}$

22 / 46

Example

We fit a least-squares line to the model, $\text{Price} = a+b (\text{age})$ with 28 observations from items sold at antiques show. A summary of the output is below:

Parameter	Parameter Estimate	Std. Error of Estimate
$\hat{a}$	27.730	34.840
$\hat{b}$	1.893	0.267

Suppose we want to test the hypothesis, $H_0: b=0$ vs. $H_1: b \neq 0$ . The value of this t-statistic is: $t_{26} = \frac{b}{SE_b} = \frac{1.893 - 0}{0.267} = 7.09$

Using t-table $\implies p < 0.001$

23 / 46

Clicker Question

In the previous example we rejected the null hypothesis of $b=0$ , meaning we claim there is sufficient evidence to say there is a linear relationship between age and price sold of items at a antiques road show.

What type of error would we have committed if it turned out there was no relationship between age and price?

Type I, reject the null even though its true
Type II, reject the null even though its true
Type I, fail to reject a false null
Type II, fail to reject a false null

24 / 46

Additional Example -- Exam Style

My budtender friend Eric did a study on marijuana consumption and hot cheeto consumption. He surveyed 25 of his friends and collected the following regression results. Assume $\alpha = 0.05$

Cheeto Consumption	Estimate	Std. Error	t-statistic	p-value
Intercept	21.0	12.3
Joints Smoked	4.2	1.8

Fill in the rest of the table
Is the intercept statistically significant? Why?
Is the slope coefficient statistically significant? Why?
Interpret slope coefficient

25 / 46

Hypothesis Testing Example

Regression analysis provides estimates on the relationship between daily wine consumption on risk of breast cancer. The estimated slope was $\hat{b} = 0.009$ with a standard error of $SE_{\hat{b}} = 0.001$ based off 25 observations.

We want to test whether these results are strong enough to reject the null hypothesis $H_0: b = 0$

in favor or the alternative hypothesis

$H_1: b > 0$

26 / 46

Hypothesis Testing Example

So we have $\hat{b}$ =0.009 and $SE_{\hat{b}}$ =0.001. Solving hypothesis test:

Find t-stat

$t=\frac{0.009}{0.001}=9$

Use t-table to find p-value

$$ 25 \text{ observations } \implies t{n-2} = t{23} $$

$t_{23}^{0.0005} = 3.8 \implies p<0.0005$

Interpret p-value $p < 0.0005 \implies p < 0.05 \implies \textbf{Reject $H_0$}$

27 / 46

Regression Results

# Hourly Earnings ($) on Years of Education
summary(lm(wage ~ educ, data = wage1))

#> 
#> Call:
#> lm(formula = wage ~ educ, data = wage1)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -5.3396 -2.1501 -0.9674  1.1921 16.6085 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -0.90485    0.68497  -1.321    0.187    
#> educ         0.54136    0.05325  10.167   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.378 on 524 degrees of freedom
#> Multiple R-squared:  0.1648,    Adjusted R-squared:  0.1632 
#> F-statistic: 103.4 on 1 and 524 DF,  p-value: < 2.2e-16

28 / 46

Confidence Interval for Regression Slope

The slope, $b$ , of the population regression is usually the most important parameter in a regression problem

The slope is the rate of change of the mean response as the explanatory variable increases
The slope explains how changes in x affect outcome variable y

A confidence interval is useful because it shows us how accurate the estimate of $b$ is likely to be.

29 / 46

Confidence Interval for Regression Slope

A level $C$ confidence interval for the slope $b$ of the population regression line is

$\hat{b} \pm t^* \cdot SE_{b},$

where $t^* = t^{\frac{1-C}{2}}_{n-2}$

30 / 46

Confidence Interval for Regression Slope

Recall our regression results looking at the relationship of temperature on coral calcification. The estimated slope was $\hat{b} = 0.4615$ and a standard error $SE_{\hat{b}} = 0.07394$ . Note this was based off a sample of 12 observations.

12 observations mean our $t_{n-2}$ distribution has 12-2=10 degrees of freedom and that critical $t$ -stat is $2.23$ when $(1-C)/2 = 0.05/2 = 0.025$

31 / 46

Confidence Interval for Regression Slope

12 observations mean our $t_{n-2}$ distribution has 12-2=10 degrees of freedom and that critical $t$ -stat is $2.23$ when $(1-C)/2 = 0.05/2 = 0.025$

If we want to construct a 95% confidence interval:

$\hat{b} \pm t^* SE_{\hat{b} }= 0.4615 \pm (2.23)(0.07394)$

The 95% confidence interval for population slope $b$ is $[0.297, 0.626]$ .

32 / 46

Clicker Question

A random sample of 19 companies were selected and the relationship between sales (in hundreds of thousands of dollars) and profits (in hundreds of thousands of dollars) was investigated by a regression, $profits = a + b \cdot sales$ . The following results were obtained from statistical software:

Parameter	Parametere Estimate	Std. Error of Estimate
$\hat{a}$	-176.6440	61.1600
$\hat{b}$	0.0925	0.0075

An approximate 90% confidence interval for the slope $b$ is:

$-176.66$ to $-176.63$
$0.079$ to $0.106$
$0.071$ to $0.114$

33 / 46

Confidence Intervals

R will spit out a 95% confidence interval associated with slope estimates with confint:

# Hourly Earnings ($) on Years of Education
confint(lm(wage ~ educ, data = wage1))

#>                  2.5 %    97.5 %
#> (Intercept) -2.2504719 0.4407687
#> educ         0.4367534 0.6459651

95% confident that an additional year of schooling increases average hourly earnings between $0.44 and $0.65

34 / 46

Significance and Margin of Error

Conducting a hypothesis test on $\hat{b}$ tells you about the significance of your result

$p$ -value $< \alpha$ , we can say our coefficient is statistically different from zero

A confidence interval says something about the precision of the coefficient

What are the ranges of coefficient values we expect the true-value to be in between
Confidence interval is also the only points you will fail to reject the null.

35 / 46

Significance and Margin of Error

#> 
#> Call:
#> lm(formula = wage ~ educ, data = wage1)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -5.3396 -2.1501 -0.9674  1.1921 16.6085 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -0.90485    0.68497  -1.321    0.187    
#> educ         0.54136    0.05325  10.167   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.378 on 524 degrees of freedom
#> Multiple R-squared:  0.1648,    Adjusted R-squared:  0.1632 
#> F-statistic: 103.4 on 1 and 524 DF,  p-value: < 2.2e-16

Do we reject null that education has no effect on wage?

36 / 46

Categorical Variable inside Regression

In that previous example, the explanatory variable was categorical. Let's see how that changes interpretation.

# Hourly Earnings ($) on HS Degree
summary(lm(wage ~ hs_deg, data = wage1))

#> 
#> Call:
#> lm(formula = wage ~ hs_deg, data = wage1)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -5.8865 -2.4165 -0.9267  1.1734 18.5635 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   4.0567     0.3309  12.258  < 2e-16 ***
#> hs_deg        2.3598     0.3748   6.296 6.48e-10 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.564 on 524 degrees of freedom
#> Multiple R-squared:  0.07032,    Adjusted R-squared:  0.06854 
#> F-statistic: 39.63 on 1 and 524 DF,  p-value: 6.485e-10

37 / 46

Categorical Variable inside Regression

This regression implies the relationship between HS Degree and hourly earnings is:

$\coral{\hat{Income}} = \$4.06 + \$2.36 \cdot \text{HS Degree}$

The takeaways here would be:

Without a HS degree, predicted wage is $4.06
With a PhD, predicted wage is $4.06 + $2.36

The coefficient on an indicator represents the difference in averages of $Y$ between the $= 0$ and $= 1$ groups.

38 / 46

Conditions for Regression Inference

Say we have $n$ observations regarding explanatory variable $x$ and response variable $y$ .

The mean response $E(Y | X)$ has a with x, given by a population regression line $E(Y|X) = a + b X$
For any fixed value of $x$ , the response variable $y$ varies according to a normal distribution
Repeated responses $y$ are independent of each other
The of $\varepsilon$ , $\sigma$ , is the same for all values of x.

39 / 46

Intuition about Conditions

The mean response $E(Y \ \vert \ X)$ has a with $x$ , given by a population regression line

In practice, we observe $y$ for many different values of $x$ . Eventually we see an overall linear pattern formed by points scattered about the population line.

40 / 46

Intuition about Conditions

For any fixed value of $x$ , the response variable $y$ varies according to a normal distribution

We cannot observe the entire population regression line. The values of $y$ that we do observe vary about their means according to a normal distribution. If we hold x constant and take many observations of y, the Normal pattern will eventually appear in a histogram.

41 / 46

Intuition about Conditions

The of $\varepsilon$ , $\sigma$ , is the same for all values of x. The value of $\sigma$ is unknown.

The standard deviation determines whether the points fall close to the population regression line (small $\sigma$ ) or are widely scattered (large $\sigma$ )
If $\sigma$ changes depending on $x$ , then our sample distribution would be wrong.

42 / 46

Intuition about Conditions

For each possible value of $x$ , the mean of the responses moves along the population regression line
For a fixed $x$ , the responses $y$ follow a normal distribution with std. dev $\sigma$
The normal curve shows how $y$ will vary when $x$ is held constant

43 / 46

Checking Conditions for Inference

Remember, all of this discussion about inferences hinges on the data meeting certain conditions.

The relationship is linear in the population
The response varies normally about the regression line
Observations are independent
The standard deviation of the responses is the same for all values of x

44 / 46

Checking Conditions for Inference

In order to check these conditions, it can be helpful to look at a residual plot. A plots the residuals against the explanatory variable $x$ , with a horizontal line at the "residual =0" position. The "residual =0" line represents the position of the least-squares line in the scatterplot of y against $x$ .

Regression Plot

Residual Plot

45 / 46

Checking Conditions for Inference

The relationship is linear. Look for curved patterns or other deviations from an overall straight line pattern in residual plot
The response varies normally about regression line. Check for departures from normality in your stemplot or histogram of residuals.
Observations are independent. Signs of dependence in the residual plot are subtle, so usually use common sense.
$x$ . Look at the scatter of residuals above and below the "residual =0" line. The scatter should be roughly the same from one end to the other.

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

ECON 3818

Chapter 26

Kyle Butts

27 September 2021

Chapter 26: Regression Inference

Introduction

Regression Review

OLS Estimators

Next Steps

Interpreting aa and bb

Estimating the Parameters

Regression Standard Error

Estimating Std. Dev. of the Error Term

Regression Standard Error

Testing the Hypothesis of No Linear Relationship

Null of No Linear Relationship

Population vs. sample

Population vs. sample

Population vs. sample

Population vs. sample

Population vs. sample

Population vs. sample

Sampling Distribution of \hat{b}\hat{b}

Significance Test for Regression Slope

Example

Clicker Question

Additional Example -- Exam Style

Hypothesis Testing Example

Hypothesis Testing Example

Regression Results

Confidence Interval for Regression Slope

Confidence Interval for Regression Slope

Confidence Interval for Regression Slope

Confidence Interval for Regression Slope

Clicker Question

Confidence Intervals

Significance and Margin of Error

Significance and Margin of Error

Categorical Variable inside Regression

Categorical Variable inside Regression

Conditions for Regression Inference

Intuition about Conditions

Intuition about Conditions

Intuition about Conditions

Intuition about Conditions

Checking Conditions for Inference

Checking Conditions for Inference

Regression Plot

Residual Plot

Checking Conditions for Inference

Chapter 26: Regression Inference

Help

Interpreting $a$ and $b$

Sampling Distribution of $\hat{b}$