QTM 151 - Introduction to Statistical Computing II

Lecture 12 - Application 03: Linear Models

Danilo Freire

danilo.freire@emory.edu

Emory University

Hello, everyone! How are you doing? 😊

Linear Models

How to fit and interpret linear models in Python

Today we will learn how to run a linear regression using the statsmodels library in Python
Linear models (also known as ordinary least squares) are the most common type of regression analysis
- They are used to predict the value of a dependent variable based on one or more independent variables
- As the name implies, the relationship between the dependent variable and the independent variables is assumed to be linear (i.e., a straight line)
But don’t worry: if you haven’t had statistics before, I will explain everything you need to know (no maths required, trust me! 😅)

Show the code

# Import libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

# Create a pandas DataFrame
np.random.seed(151)
x = np.random.normal(loc = 5, scale = 5, size = 100)
y = 1 + 2*x + np.random.normal(loc = 5, scale = 10, size = 100)
df = pd.DataFrame({'y': y, 'x': x})

# Estimate OLS
X = sm.add_constant(x)
model = sm.OLS(y, X).fit()
predictions = model.predict(X)

# Create graph
plt.figure(figsize = (6, 5))
plt.plot(x, predictions, color = 'red', label = 'OLS regression line')
plt.scatter(x = df.x, y = df.y, color = "black", alpha= 0.5, label = 'Data points')

# Add labels and legend
plt.xlabel('Variable X')
plt.ylabel('Variable Y')
plt.title('Linear Model')
plt.legend()
plt.show()

Import and install libraries

First, let’s import the libraries we have seen so far

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Now, let’s install the statsmodels library
According to the Anaconda website, statsmodels is already included in the distribution, but you can install it using conda or pip.
Mac and Linux users can run the following command in the terminal:

conda install statsmodels

Windows users can do the following:
- Open the Anaconda Prompt then type conda install statsmodels and press Enter

statsmodels is the go-to library for statistical modelling in Python
It has models for linear regression, time series analysis, and much more
It works well with pandas and numpy and even supports R-style formulas
- model = smf.ols(formula = 'Dependent ~ Independent01 + Independent02', data=df)
We will use it to estimate our linear model

Generate simulated data

Let’s generate some simulated data to illustrate how linear models work 😊📈
First, let’s create an empty DataFrame

dataset = pd.DataFrame()

Now, let’s create two random variables of size 50

n = 50
np.random.seed(42)
dataset["x"] = np.random.normal(loc = 0,scale = 1, size = n)
dataset["e"] = np.random.normal(loc = 0,scale = 1, size = n)

Let me explain what we’re doing here 🤓
- x is the independent variable, which we will use to predict the dependent variable y
- e is the error term, which represents the noise in the data
Why do we include the error term?
- Because in real life, we don’t have perfect data
- The error term is what makes the relationship between x and y not perfect

Generate simulated data

Now, let’s create data from the linear model
The model will be \(y = b_0 + b_1 x + e\), where \(b_0 = 1, b_1 = 2.\)
\(b_0\) is the intercept, \(b_1\) is the slope, and \(e\) is the error term
Let’s create the dependent variable y using the formula above

# Create the dependent variable
# y = 1 + 2*x + e
b0 = 1
b1 = 2

dataset["y"] = b0 + b1 * dataset["x"] + dataset["e"]

Then, let’s compute the line of best fit

# Line of best fit 
dataset["p"] = b0 + b1*dataset["x"]
dataset.head(3)

	x	e	y	p
0	0.496714	0.324084	2.317512	1.993428
1	-0.138264	-0.385082	0.338389	0.723471
2	0.647689	-0.676922	1.618455	2.295377

Visualising the data

Now, let’s plot the data and the line of best fit

# Plot the data
plt.scatter(x = dataset["x"], y = dataset["y"])
plt.plot(dataset["x"], dataset["p"], color = 'green')

plt.xlabel("X Variable")
plt.ylabel("Y Variable")
plt.legend(labels = ["Data points", "Best fit line"])
plt.show()

Try it yourself! 🚀

Create a new dataset called subset_above2
Subset records with \(y \ge 2\) using .query()
Count the original rows with len(dataset)
Count the subsetted rows: len(subset_above2)
Compute the proportion of subsetted observations
Solution: Appendix 01

Try it yourself! 🚀

Store the sample mean of \(y\) as ybar
Compute the standard deviation of \(y\) as stdv_sample
Use .query() to subset observations that satisfy

\[abs\left(y - ybar \right) \le stdv\_sample\]

HINT: Use .mean() and .std()
HINT: Use the globals @xbar, @stdv_sample
Solution: Appendix 02

Note: .abs() is a pandas function that returns the absolute value of a number. For example, abs(-3) and abs(3) both return 3. It is useful to compute the distance between two numbers without considering their sign, such 5 and -3 (which are 8 units apart): a = 5; b = -3; abs(a - b) = 8. More here.

Estimate the line of best fit 📈

Estimate the line of best fit

We usually have information about the dependent variable \(y\) and the independent variable \(x\), but we don’t know the coefficients \(b_0\) and \(b_1\)
The goal is to estimate the line of best fit that minimises the sum of the squared residuals
The residuals are the differences between the observed values of \(y\) and the predicted values of \(y\)
The statistical method that estimates the coefficients is called ordinary least squares (OLS)
If you have used R before, this is the same as lm(y ~ x, data = dataset)

In Python, we use the statsmodels library to estimate the coefficients
- We’ll use just one part of the library, statsmodels.api
- Its alias is sm
- If you want to use R-style formulas, you can use statsmodels.formula.api (alias smf)
First, let’s load the packages

# Import the library
import statsmodels.api as sm 
import statsmodels.formula.api as smf

Now let’s run the model

model = smf.ols(formula = 'y ~ x', data = dataset)
results = model.fit()

Estimate the line of best fit

Show the code

from stargazer.stargazer import Stargazer

model = smf.ols(formula='y ~ x', data=dataset)
results = model.fit()

# Create a Stargazer object
table = Stargazer([results])
table


	Dependent variable: y

	(1)

Intercept	1.041^***
	(0.128)
x	2.103^***
	(0.134)

Observations	50
R²	0.836
Adjusted R²	0.833
Residual Std. Error	0.878 (df=48)
F Statistic	245.067^*** (df=1; 48)

Note:	^p<0.1; ^p<0.05; ^**p<0.01

Compute the line of best fit

We will use .params to get the attribute “parameters from the results”

b_list = results.params
print(b_list)

Intercept    1.041022
x            2.103076
dtype: float64

We can then compute the estimated best fit lines by extracting the intercept and slop from b_list

dataset["p_estimated"] = b_list[0] + b_list[1]  * dataset["x"]
print(dataset[["p", "p_estimated"]].head(10))

          p  p_estimated
0  1.993428     2.085649
1  0.723471     0.750241
2  2.295377     2.403160
3  4.046060     4.244069
4  0.531693     0.548579
5  0.531726     0.548614
6  4.158426     4.362226
7  2.534869     2.654995
8  0.061051     0.053682
9  2.085120     2.182067

It’s pretty close, right? 😊

Visualising the data

Now, let’s plot the data and the line of best fit
The first line plots the data points
The second line plots the estimated best fit line
The legend command creates a box with the color labels

plt.scatter(x = dataset["x"], y = dataset["y"])
plt.plot(dataset["x"], dataset["p_estimated"], color = 'green')

plt.legend(labels = ["Data points","Estimated Predicted Model"])
plt.show()

Try it yourself! 🚀

How good is the estimated fit?
Create two overlapping lineplots
\((x \text{ }\) vs \(\text{ } p)\) and \((p_{estimated} \text{ }\) vs \(\text{ } p_{estimated})\)
Create a legend to label each plot
Solution: Appendix 03

Try it yourself! 🚀

Compute a column with the formula sample_error = y - p_estimated
Create a lambda function fn_positive_error = lambda error: error >= 0
Compute a column for whether the error is positive using .apply()
Solution: Appendix 04

Try it yourself (last time)! 🚀

Compute a new column error_sqr = sample_error ** 2
Calculate the mean of error_sqr
Solution: Appendix 05

That’s all for today! 🎉

Enjoy the fall break! 🍂

Appendix 01

Create a new dataset called subset_above2
Subset records with \(y \ge 2\) using .query()
Count the original rows with len(dataset)
Count the subsetted rows: len(subset_above2)
Compute the proportion of subsetted observations

# Create the new dataset
subset_above2 = pd.DataFrame()

# Subset the records with y >= 2
subset_above2 = dataset.query("y >= 2")

# Count the original rows
original_rows = len(dataset)

# Count the subsetted rows
subsetted_rows = len(subset_above2)

# Compute the proportion of subsetted observations
proportion = subsetted_rows / original_rows
proportion

0.3

Back to Exercise 01

Appendix 02

Store the sample mean of \(y\) as ybar
Compute the standard deviation of \(y\) as stdv_sample
Use .query() to subset observations that satisfy \(abs\left(y - ybar \right) \le stdv\_sample\)
HINT: Use .mean() and .std()
HINT: Use the globals @xbar, @stdv_sample

# Store the sample mean of y as ybar
ybar = dataset["y"].mean()

# Compute the standard deviation of y as stdv_sample
stdv_sample = dataset["y"].std()

# Subset observations that satisfy abs(y - ybar) <= stdv_sample
subset_within_stdv = dataset.query("abs(y - @ybar) <= @stdv_sample")

# Count the subsetted rows
len(subset_within_stdv)

Back to Exercise 02

Appendix 03

How good is the estimated fit?
Create two overlapping lineplots
\((x \text{ }\) vs \(\text{ } p)\) and \((p_{estimated} \text{ }\) vs \(\text{ } p_{estimated})\)
Create a legend to label each plot

plt.plot(dataset["x"], dataset["p"], color = 'blue')
plt.plot(dataset["x"], dataset["p_estimated"], color = 'green')

plt.legend(labels = ["Predicted Model", "Estimated Predicted Model"])
plt.show()

Back to Exercise 03

Appendix 04

Compute a column with the formula sample_error = y - p_estimated
Create a lambda function fn_positive_error = lambda error: error >= 0
Compute a column for whether the error is positive using .apply()

# Compute a column with the formula sample_error = y - p_estimated
dataset["sample_error"] = dataset["y"] - dataset["p_estimated"]

# Create a lambda function fn_positive_error = lambda error: error >= 0
fn_positive_error = lambda error: error >= 0

# Compute a column for whether the error is positive using .apply()
dataset["positive_error"] = dataset["sample_error"].apply(fn_positive_error)
dataset.head(3)

	x	e	y	p	p_estimated	sample_error	positive_error
0	0.496714	0.324084	2.317512	1.993428	2.085649	0.231863	True
1	-0.138264	-0.385082	0.338389	0.723471	0.750241	-0.411852	False
2	0.647689	-0.676922	1.618455	2.295377	2.403160	-0.784705	False

Back to Exercise 04

Appendix 05

Compute a new column error_sqr = sample_error ** 2
Calculate the mean of error_sqr

# Compute a new column error_sqr = sample_error ** 2
dataset["error_sqr"] = dataset["sample_error"] ** 2

# Calculate the mean of error_sqr
mean_error_sqr = dataset["error_sqr"].mean()
mean_error_sqr

0.7400786858214785

Back to Exercise 05