Chapter 4 and 5 discussed how scatterplots and lines of best fit show us linear relationships, but there are remaining questions
Is there really a linear relationship between x and y, or is the pattern just by chance
What is the estimated slope that explains how y responds to x in the population. What is the margin of error for our estimate?
If we use the least-squares line to predict y for a given x, how accurate is that prediction?
In econometrics, you will discuss when you can answer what is the effect on y of changing x
We can model the linear relationship between X and Y by thinking of a conditional expectation:
E(Y|X)= a + bX
We want estimates for a and b, \hat{a} and \hat{b}, and we find these estimates by minimizing the sum of squared residuals
\varepsilon_i = Y_i - \coral{\widehat{Y}_i} \equiv Y_i - (\coral{\hat{a} + \hat{b} X_i})
We pick the values of \hat{a} and \hat{b} to minimize the sum of least squares, $\sum_{i=1}^n \varepsilon_i^2 $. This yields the Ordinary Least Squares estimators
\hat{a}=\bar{Y}-\hat{b}\bar{X}
\hat{b}=r_{XY}\frac{s_Y}{s_X}
This chapter will answer
How can I interpret \hat{a} and \hat{b}?
What conditions are necessary for those interpretations?
Inference from a Regression
\hat{\coral{\text{Calcification Rate}}} = -12.103 + 0.4615 * \text{Temperature}
We can now predict how temperature affects the calcification rate. The R^2 will tell us how much of the variation in calcification rate is due to temperature, but it will not tell us whether this relationship is statistically significant.
In order for this regression to be meaningful, we must determine whether the results are statistically significant
When the conditions for the regression are met1
The slope \hat{b} of the least-squares line is an unbiased estimator of the population slope b
The intercept \hat{a} of the least-squares line is an unbiased estimator of the population intercept a
Now we only need to estimate the remaining parameters, \sigma, the standard deviation of the error term \varepsilon_i.
1 We will discuss the conditions later
Our regression model is:
y = a + X b + \varepsilon
\varepsilon is the error term that describes why an individual doesn't fall directly on regression line a + X b.
We denote the variance of \varepsilon as \sigma^2. The standard deviation, \sigma, describes variability of response variable y about the population regression line ($\pm$).
The least-squares line estimates the population regression line
\hat{\varepsilon} \equiv \text{residual} = y - \coral{\hat{y}}
Therefore we estimate \sigma by the sample standard deviation of the residuals, known as the regression standard error
s=\sqrt{\frac{1}{n-2} \sum\text{residual}^2} \equiv \sqrt{\frac{1}{n-2} \sum(y - \coral{\hat{y}})^2}
We use s to estimate the standard deviation, \sigma, of responses about the mean given by the population regression line
We will use this error to determine whether our predictions are statistically significant
To answer questions about whether associations between two variables are statistically significant, we must test a hypothesis about the slope b:
H_0: \ b = 0 H_1: b \neq 0
If we fail to reject H_0:
Regression line with slope 0 is horizontal -- meaning y does not change at all when x changes
H_0 says that there is no linear relationship between X and Y
If we reject H_0, and accept H_1:
If we fail to reject H_0:
Question: Why do we care about population vs. sample?
Population
Question: Why do we care about population vs. sample?
Population
Population relationship
y_i = 2.53 + 0.57 x_i + u_i
y_i = \beta_0 + \beta_1 x_i + u_i
Question: Why do we care about population vs. sample?
Sample 1: 30 random individuals
Population relationship
y_i = 2.53 + 0.57 x_i + u_i
Sample relationship
\hat{y}_i = 2.36 + 0.61 x_i
Question: Why do we care about population vs. sample?
Sample 2: 30 random individuals
Population relationship
y_i = 2.53 + 0.57 x_i + u_i
Sample relationship
\hat{y}_i = 2.79 + 0.56 x_i
Question: Why do we care about population vs. sample?
Sample 3: 30 random individuals
Population relationship
y_i = 2.53 + 0.57 x_i + u_i
Sample relationship
\hat{y}_i = 3.21 + 0.45 x_i
$1,000$ Samples of size (30)
On average, our regression lines match the population line very nicely.
However, individual lines (samples) can really miss the mark.
Differences between individual samples and the population lead to uncertainty for the econometrician.
Since \hat{b} is a function of our data, it has a sampling distribution.
The sampling distribution of \hat{b} is: \hat{b} \sim N\left(b, \ \frac{\sigma^2}{\sigma_X^2}\right) Another instance of the sampling distribution being normally distributed!
\sigma^2 is the variance of \varepsilon and \sigma_X^2 is the variance of X.
To test the hypothesis, H_0: b=0, compute the t-statistic:
t_{n-2} = \frac{\hat{b} - 0}{SE_b}
Important to note that the degrees of freedom for the t-statistic for testing a regression slope is n-2 (we estimate a and s)
In this formula, the standard error of the least-squares slope is our estimate at the sampling distribution's standard deviation:
SE_{\hat{b}}=\frac{s}{\sqrt{\sum (x-\bar{x}^2)}}
We fit a least-squares line to the model, \text{Price} = a+b (\text{age}) with 28 observations from items sold at antiques show. A summary of the output is below:
Parameter | Parameter Estimate | Std. Error of Estimate |
---|---|---|
\hat{a} | 27.730 | 34.840 |
\hat{b} | 1.893 | 0.267 |
Suppose we want to test the hypothesis, H_0: b=0 vs. H_1: b \neq 0. The value of this t-statistic is: t_{26} = \frac{b}{SE_b} = \frac{1.893 - 0}{0.267} = 7.09
Using t-table \implies p < 0.001
In the previous example we rejected the null hypothesis of b=0, meaning we claim there is sufficient evidence to say there is a linear relationship between age and price sold of items at a antiques road show.
What type of error would we have committed if it turned out there was no relationship between age and price?
My budtender friend Eric did a study on marijuana consumption and hot cheeto consumption. He surveyed 25 of his friends and collected the following regression results. Assume \alpha = 0.05
Cheeto Consumption | Estimate | Std. Error | t-statistic | p-value |
---|---|---|---|---|
Intercept | 21.0 | 12.3 | ||
Joints Smoked | 4.2 | 1.8 |
Example: Regression analysis provides estimates on the relationship between daily wine consumption on risk of breast cancer. The estimated slope was \hat{b} = 0.009 with a standard error of SE_{\hat{b}} = 0.001 based off 25 observations.
We want to test whether these results are strong enough to reject the null hypothesis H_0: b = 0
in favor or the alternative hypothesis
H_1: b > 0
So we have \hat{b}=0.009 and SE_{\hat{b}}=0.001. Solving hypothesis test:
t=\frac{0.009}{0.001}=9
$$ 25 \text{ observations } \implies t{n-2} = t{23} $$
t_{23}^{0.0005} = 3.8 \implies p<0.0005
# Hourly Earnings ($) on Years of Educationsummary(lm(wage ~ educ, data = wage1))
#> #> Call:#> lm(formula = wage ~ educ, data = wage1)#> #> Residuals:#> Min 1Q Median 3Q Max #> -5.3396 -2.1501 -0.9674 1.1921 16.6085 #> #> Coefficients:#> Estimate Std. Error t value Pr(>|t|) #> (Intercept) -0.90485 0.68497 -1.321 0.187 #> educ 0.54136 0.05325 10.167 <2e-16 ***#> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#> #> Residual standard error: 3.378 on 524 degrees of freedom#> Multiple R-squared: 0.1648, Adjusted R-squared: 0.1632 #> F-statistic: 103.4 on 1 and 524 DF, p-value: < 2.2e-16
The slope, b, of the population regression is usually the most important parameter in a regression problem
The slope is the rate of change of the mean response as the explanatory variable increases
The slope explains how changes in x affect outcome variable y
A confidence interval is useful because it shows us how accurate the estimate of b is likely to be.
A level C confidence interval for the slope b of the population regression line is
\hat{b} \pm t^* \cdot SE_{b},
where t^* = t^{\frac{1-C}{2}}_{n-2}
Example: Recall our regression results looking at the relationship of temperature on coral calcification. The estimated slope was \hat{b} = 0.4615 and a standard error SE_{\hat{b}} = 0.07394. Note this was based off a sample of 12 observations.
12 observations mean our t_{n-2} distribution has 12-2=10 degrees of freedom and that critical t-stat is 2.23 when (1-C)/2 = 0.05/2 = 0.025
Example: Recall our regression results looking at the relationship of temperature on coral calcification. The estimated slope was \hat{b} = 0.4615 and a standard error SE_{\hat{b}} = 0.07394. Note this was based off a sample of 12 observations.
12 observations mean our t_{n-2} distribution has 12-2=10 degrees of freedom and that critical t-stat is 2.23 when (1-C)/2 = 0.05/2 = 0.025
If we want to construct a 95% confidence interval:
\hat{b} \pm t^* SE_{\hat{b} }= 0.4615 \pm (2.23)(0.07394)
The 95% confidence interval for population slope b is [0.297, 0.626].
A random sample of 19 companies were selected and the relationship between sales (in hundreds of thousands of dollars) and profits (in hundreds of thousands of dollars) was investigated by a regression, profits = a + b \cdot sales. The following results were obtained from statistical software:
Parameter | Parametere Estimate | Std. Error of Estimate |
---|---|---|
\hat{a} | -176.6440 | 61.1600 |
\hat{b} | 0.0925 | 0.0075 |
An approximate 90% confidence interval for the slope b is:
R will spit out a 95% confidence interval associated with slope estimates with confint
:
# Hourly Earnings ($) on Years of Educationconfint(lm(wage ~ educ, data = wage1))
#> 2.5 % 97.5 %#> (Intercept) -2.2504719 0.4407687#> educ 0.4367534 0.6459651
95% confident that an additional year of schooling increases average hourly earnings between $0.44 and $0.65
Conducting a hypothesis test on \hat{b} tells you about the significance of your result
A confidence interval says something about the precision of the coefficient
What are the ranges of coefficient values we expect the true-value to be in between
Confidence interval is also the only points you will fail to reject the null.
#> #> Call:#> lm(formula = wage ~ educ, data = wage1)#> #> Residuals:#> Min 1Q Median 3Q Max #> -5.3396 -2.1501 -0.9674 1.1921 16.6085 #> #> Coefficients:#> Estimate Std. Error t value Pr(>|t|) #> (Intercept) -0.90485 0.68497 -1.321 0.187 #> educ 0.54136 0.05325 10.167 <2e-16 ***#> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#> #> Residual standard error: 3.378 on 524 degrees of freedom#> Multiple R-squared: 0.1648, Adjusted R-squared: 0.1632 #> F-statistic: 103.4 on 1 and 524 DF, p-value: < 2.2e-16
Do we reject null that education has no effect on wage?
In that previous example, the explanatory variable was categorical. Let's see how that changes interpretation.
# Hourly Earnings ($) on HS Degreesummary(lm(wage ~ hs_deg, data = wage1))
#> #> Call:#> lm(formula = wage ~ hs_deg, data = wage1)#> #> Residuals:#> Min 1Q Median 3Q Max #> -5.8865 -2.4165 -0.9267 1.1734 18.5635 #> #> Coefficients:#> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 4.0567 0.3309 12.258 < 2e-16 ***#> hs_deg 2.3598 0.3748 6.296 6.48e-10 ***#> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#> #> Residual standard error: 3.564 on 524 degrees of freedom#> Multiple R-squared: 0.07032, Adjusted R-squared: 0.06854 #> F-statistic: 39.63 on 1 and 524 DF, p-value: 6.485e-10
This regression implies the relationship between HS Degree and hourly earnings is:
\coral{\hat{Income}} = \$4.06 + \$2.36 \cdot \text{HS Degree}
The takeaways here would be:
Without a HS degree, predicted wage is $4.06
With a PhD, predicted wage is $4.06 + $2.36
The coefficient on an indicator represents the difference in averages of Y between the = 0 and = 1 groups.
Say we have n observations regarding explanatory variable x and response variable y.
The mean response E(Y | X) has a straight-line relationship with x, given by a population regression line E(Y|X) = a + b X
For any fixed value of x, the response variable y varies according to a normal distribution
Repeated responses y are independent of each other
The standard deviation of \varepsilon, \sigma, is the same for all values of x.
The mean response E(Y \ \vert \ X) has a straight-line relationship with x, given by a population regression line
For any fixed value of x, the response variable y varies according to a normal distribution
The standard deviation of \varepsilon, \sigma, is the same for all values of x. The value of \sigma is unknown.
The standard deviation determines whether the points fall close to the population regression line (small \sigma) or are widely scattered (large \sigma)
If \sigma changes depending on x, then our sample distribution would be wrong.
For each possible value of x, the mean of the responses moves along the population regression line
For a fixed x, the responses y follow a normal distribution with std. dev \sigma
The normal curve shows how y will vary when x is held constant
Remember, all of this discussion about inferences hinges on the data meeting certain conditions.
The relationship is linear in the population
The response varies normally about the regression line
Observations are independent
The standard deviation of the responses is the same for all values of x
In order to check these conditions, it can be helpful to look at a residual plot. A residual plot plots the residuals against the explanatory variable x, with a horizontal line at the "residual =0" position. The "residual =0" line represents the position of the least-squares line in the scatterplot of y against x.
The relationship is linear. Look for curved patterns or other deviations from an overall straight line pattern in residual plot
The response varies normally about regression line. Check for departures from normality in your stemplot or histogram of residuals.
Observations are independent. Signs of dependence in the residual plot are subtle, so usually use common sense.
Standard deviation of responses is same for all values of x. Look at the scatter of residuals above and below the "residual =0" line. The scatter should be roughly the same from one end to the other.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |