QTM 151 - Introduction to Statistical Computing II

Lecture 12 - Application 03: Linear Models

Danilo Freire

Emory University

09 October, 2024

Hello, everyone! How are you doing? 😊

Linear Models

How to fit and interpret linear models in Python

  • Today we will learn how to run a linear regression using the statsmodels library in Python
  • Linear models (also known as ordinary least squares) are the most common type of regression analysis
    • They are used to predict the value of a dependent variable based on one or more independent variables
    • As the name implies, the relationship between the dependent variable and the independent variables is assumed to be linear (i.e., a straight line)
  • But don’t worry: if you haven’t had statistics before, I will explain everything you need to know (no maths required, trust me! 😅)
Show the code
# Import libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

# Create a pandas DataFrame
np.random.seed(151)
x = np.random.normal(loc = 5, scale = 5, size = 100)
y = 1 + 2*x + np.random.normal(loc = 5, scale = 10, size = 100)
df = pd.DataFrame({'y': y, 'x': x})

# Estimate OLS
X = sm.add_constant(x)
model = sm.OLS(y, X).fit()
predictions = model.predict(X)

# Create graph
plt.figure(figsize = (6, 5))
plt.plot(x, predictions, color = 'red', label = 'OLS regression line')
plt.scatter(x = df.x, y = df.y, color = "black", alpha= 0.5, label = 'Data points')

# Add labels and legend
plt.xlabel('Variable X')
plt.ylabel('Variable Y')
plt.title('Linear Model')
plt.legend()
plt.show()

Import and install libraries

  • First, let’s import the libraries we have seen so far
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
  • Now, let’s install the statsmodels library

  • According to the Anaconda website, statsmodels is already included in the distribution, but you can install it using conda or pip.

  • Mac and Linux users can run the following command in the terminal:

conda install statsmodels
  • Windows users can do the following:
    • Open the Anaconda Prompt then type conda install statsmodels and press Enter

  • statsmodels is the go-to library for statistical modelling in Python
  • It has models for linear regression, time series analysis, and much more
  • It works well with pandas and numpy and even supports R-style formulas
    • model = smf.ols(formula = 'Dependent ~ Independent01 + Independent02', data=df)
  • We will use it to estimate our linear model

Generate simulated data

  • Let’s generate some simulated data to illustrate how linear models work 😊📈
  • First, let’s create an empty DataFrame
dataset = pd.DataFrame()
  • Now, let’s create two random variables of size 50
n = 50
np.random.seed(42)
dataset["x"] = np.random.normal(loc = 0,scale = 1, size = n)
dataset["e"] = np.random.normal(loc = 0,scale = 1, size = n) 
  • Let me explain what we’re doing here 🤓
    • x is the independent variable, which we will use to predict the dependent variable y
    • e is the error term, which represents the noise in the data
  • Why do we include the error term?
    • Because in real life, we don’t have perfect data
    • The error term is what makes the relationship between x and y not perfect

Generate simulated data

  • Now, let’s create data from the linear model

  • The model will be \(y = b_0 + b_1 x + e\), where \(b_0 = 1, b_1 = 2.\)

  • \(b_0\) is the intercept, \(b_1\) is the slope, and \(e\) is the error term

  • Let’s create the dependent variable y using the formula above

# Create the dependent variable
# y = 1 + 2*x + e
b0 = 1
b1 = 2

dataset["y"] = b0 + b1 * dataset["x"] + dataset["e"] 
  • Then, let’s compute the line of best fit
# Line of best fit 
dataset["p"] = b0 + b1*dataset["x"]
dataset.head(3)
x e y p
0 0.496714 0.324084 2.317512 1.993428
1 -0.138264 -0.385082 0.338389 0.723471
2 0.647689 -0.676922 1.618455 2.295377

Visualising the data

  • Now, let’s plot the data and the line of best fit
# Plot the data
plt.scatter(x = dataset["x"], y = dataset["y"])
plt.plot(dataset["x"], dataset["p"], color = 'green')

plt.xlabel("X Variable")
plt.ylabel("Y Variable")
plt.legend(labels = ["Data points", "Best fit line"])
plt.show() 

Try it yourself! 🚀

  • Create a new dataset called subset_above2

  • Subset records with \(y \ge 2\) using .query()

  • Count the original rows with len(dataset)

  • Count the subsetted rows: len(subset_above2)

  • Compute the proportion of subsetted observations

  • Solution: Appendix 01

Try it yourself! 🚀

  • Store the sample mean of \(y\) as ybar
  • Compute the standard deviation of \(y\) as stdv_sample
  • Use .query() to subset observations that satisfy

\[abs\left(y - ybar \right) \le stdv\_sample\]

  • HINT: Use .mean() and .std()

  • HINT: Use the globals @xbar, @stdv_sample

  • Solution: Appendix 02

Note: .abs() is a pandas function that returns the absolute value of a number. For example, abs(-3) and abs(3) both return 3. It is useful to compute the distance between two numbers without considering their sign, such 5 and -3 (which are 8 units apart): a = 5; b = -3; abs(a - b) = 8. More here.

Estimate the line of best fit 📈

Estimate the line of best fit

  • We usually have information about the dependent variable \(y\) and the independent variable \(x\), but we don’t know the coefficients \(b_0\) and \(b_1\)
  • The goal is to estimate the line of best fit that minimises the sum of the squared residuals
  • The residuals are the differences between the observed values of \(y\) and the predicted values of \(y\)
  • The statistical method that estimates the coefficients is called ordinary least squares (OLS)
  • If you have used R before, this is the same as lm(y ~ x, data = dataset)
  • In Python, we use the statsmodels library to estimate the coefficients
    • We’ll use just one part of the library, statsmodels.api
    • Its alias is sm
    • If you want to use R-style formulas, you can use statsmodels.formula.api (alias smf)
  • First, let’s load the packages
# Import the library
import statsmodels.api as sm 
import statsmodels.formula.api as smf
  • Now let’s run the model
model = smf.ols(formula = 'y ~ x', data = dataset)
results = model.fit()

Estimate the line of best fit

Show the code
from stargazer.stargazer import Stargazer

model = smf.ols(formula='y ~ x', data=dataset)
results = model.fit()

# Create a Stargazer object
table = Stargazer([results])
table
Dependent variable: y
(1)
Intercept1.041***
(0.128)
x2.103***
(0.134)
Observations50
R20.836
Adjusted R20.833
Residual Std. Error0.878 (df=48)
F Statistic245.067*** (df=1; 48)
Note:*p<0.1; **p<0.05; ***p<0.01

Compute the line of best fit

  • We will use .params to get the attribute “parameters from the results”
b_list = results.params
print(b_list)
Intercept    1.041022
x            2.103076
dtype: float64
  • We can then compute the estimated best fit lines by extracting the intercept and slop from b_list
dataset["p_estimated"] = b_list[0] + b_list[1]  * dataset["x"]
print(dataset[["p", "p_estimated"]].head(10))
          p  p_estimated
0  1.993428     2.085649
1  0.723471     0.750241
2  2.295377     2.403160
3  4.046060     4.244069
4  0.531693     0.548579
5  0.531726     0.548614
6  4.158426     4.362226
7  2.534869     2.654995
8  0.061051     0.053682
9  2.085120     2.182067
  • It’s pretty close, right? 😊

Visualising the data

  • Now, let’s plot the data and the line of best fit
  • The first line plots the data points
  • The second line plots the estimated best fit line
  • The legend command creates a box with the color labels
plt.scatter(x = dataset["x"], y = dataset["y"])
plt.plot(dataset["x"], dataset["p_estimated"], color = 'green')

plt.legend(labels = ["Data points","Estimated Predicted Model"])
plt.show()

Try it yourself! 🚀

  • How good is the estimated fit?
  • Create two overlapping lineplots
  • \((x \text{ }\) vs \(\text{ } p)\) and \((p_{estimated} \text{ }\) vs \(\text{ } p_{estimated})\)
  • Create a legend to label each plot
  • Solution: Appendix 03

Try it yourself! 🚀

  • Compute a column with the formula sample_error = y - p_estimated

  • Create a lambda function fn_positive_error = lambda error: error >= 0

  • Compute a column for whether the error is positive using .apply()

  • Solution: Appendix 04

Try it yourself (last time)! 🚀

  • Compute a new column error_sqr = sample_error ** 2

  • Calculate the mean of error_sqr

  • Solution: Appendix 05

That’s all for today! 🎉

Enjoy the fall break! 🍂

Appendix 01

  • Create a new dataset called subset_above2
  • Subset records with \(y \ge 2\) using .query()
  • Count the original rows with len(dataset)
  • Count the subsetted rows: len(subset_above2)
  • Compute the proportion of subsetted observations
# Create the new dataset
subset_above2 = pd.DataFrame()

# Subset the records with y >= 2
subset_above2 = dataset.query("y >= 2")

# Count the original rows
original_rows = len(dataset)

# Count the subsetted rows
subsetted_rows = len(subset_above2)

# Compute the proportion of subsetted observations
proportion = subsetted_rows / original_rows
proportion
0.3

Back to Exercise 01

Appendix 02

  • Store the sample mean of \(y\) as ybar
  • Compute the standard deviation of \(y\) as stdv_sample
  • Use .query() to subset observations that satisfy \(abs\left(y - ybar \right) \le stdv\_sample\)
  • HINT: Use .mean() and .std()
  • HINT: Use the globals @xbar, @stdv_sample
# Store the sample mean of y as ybar
ybar = dataset["y"].mean()

# Compute the standard deviation of y as stdv_sample
stdv_sample = dataset["y"].std()

# Subset observations that satisfy abs(y - ybar) <= stdv_sample
subset_within_stdv = dataset.query("abs(y - @ybar) <= @stdv_sample")

# Count the subsetted rows
len(subset_within_stdv)
33

Back to Exercise 02

Appendix 03

  • How good is the estimated fit?
  • Create two overlapping lineplots
  • \((x \text{ }\) vs \(\text{ } p)\) and \((p_{estimated} \text{ }\) vs \(\text{ } p_{estimated})\)
  • Create a legend to label each plot
plt.plot(dataset["x"], dataset["p"], color = 'blue')
plt.plot(dataset["x"], dataset["p_estimated"], color = 'green')

plt.legend(labels = ["Predicted Model", "Estimated Predicted Model"])
plt.show()

Back to Exercise 03

Appendix 04

  • Compute a column with the formula sample_error = y - p_estimated
  • Create a lambda function fn_positive_error = lambda error: error >= 0
  • Compute a column for whether the error is positive using .apply()
# Compute a column with the formula sample_error = y - p_estimated
dataset["sample_error"] = dataset["y"] - dataset["p_estimated"]

# Create a lambda function fn_positive_error = lambda error: error >= 0
fn_positive_error = lambda error: error >= 0

# Compute a column for whether the error is positive using .apply()
dataset["positive_error"] = dataset["sample_error"].apply(fn_positive_error)
dataset.head(3)
x e y p p_estimated sample_error positive_error
0 0.496714 0.324084 2.317512 1.993428 2.085649 0.231863 True
1 -0.138264 -0.385082 0.338389 0.723471 0.750241 -0.411852 False
2 0.647689 -0.676922 1.618455 2.295377 2.403160 -0.784705 False

Back to Exercise 04

Appendix 05

  • Compute a new column error_sqr = sample_error ** 2
  • Calculate the mean of error_sqr
# Compute a new column error_sqr = sample_error ** 2
dataset["error_sqr"] = dataset["sample_error"] ** 2

# Calculate the mean of error_sqr
mean_error_sqr = dataset["error_sqr"].mean()
mean_error_sqr
0.7400786858214785

Back to Exercise 05