Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Chapter: 2-Least squares
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: Feb 14, 2024
Copyright: © 2024 Sebastien Roch
Figure: Helpful map of ML by scitkit-learn (Source)
$\bowtie$
The following dataset is from the excellent textbook [ISLP]. Quoting [ISLP, Section 2.1]:
Suppose that we are statistical consultants hired by a client to provide advice on how to improve sales of a particular product. The
Advertising
data set consists of thesales
of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media:TV
,radio
, andnewspaper
. [...] It is not possible for our client to directly increase sales of the product. On the other hand, they can control the advertising expenditure in each of the three media. Therefore, if we determine that there is an association between advertising and sales, then we can instruct our client to adjust advertising budgets, thereby indirectly increasing sales. In other words, our goal is to develop an accurate model that can be used to predict sales on the basis of the three media budgets.
This a regression problem. That is, we want to estimate the relationship between an outcome variable and one or more predictors (or features). We load the data, show its first few lines and some statistics.
df = pd.read_csv('advertising.csv')
df.head()
TV | radio | newspaper | sales | |
---|---|---|---|---|
0 | 230.1 | 37.8 | 69.2 | 22.1 |
1 | 44.5 | 39.3 | 45.1 | 10.4 |
2 | 17.2 | 45.9 | 69.3 | 9.3 |
3 | 151.5 | 41.3 | 58.5 | 18.5 |
4 | 180.8 | 10.8 | 58.4 | 12.9 |
df.describe()
TV | radio | newspaper | sales | |
---|---|---|---|---|
count | 200.000000 | 200.000000 | 200.000000 | 200.000000 |
mean | 147.042500 | 23.264000 | 30.554000 | 14.022500 |
std | 85.854236 | 14.846809 | 21.778621 | 5.217457 |
min | 0.700000 | 0.000000 | 0.300000 | 1.600000 |
25% | 74.375000 | 9.975000 | 12.750000 | 10.375000 |
50% | 149.750000 | 22.900000 | 25.750000 | 12.900000 |
75% | 218.825000 | 36.525000 | 45.100000 | 17.400000 |
max | 296.400000 | 49.600000 | 114.000000 | 27.000000 |
We will focus for now on the TV budget.
TV = df['TV'].to_numpy()
sales = df['sales'].to_numpy()
We make a scatterplot showing the realtion between those two quantities.
plt.scatter(TV, sales)
plt.xlabel('TV')
plt.ylabel('sales')
plt.show()
There does seem to be a relationship between the two. Roughly a higher TV budget is linked to higher sales, although the correspondence is not perfect. To express the relationship more quantitatively, we seek a function $f$ such that
$$ y \approx f(\mathbf{x}) $$where $\mathbf{x}$ denotes a sample TV budget and $y$ is the corresponding observed sales. We might posit for instance that there exists a true $f$ and that each observation is disrupted by some noise $\varepsilon$
$$ y = f(\mathbf{x}) + \varepsilon. $$A natural way to estimate such an $f$ from data is $k$-nearest-neighbors ($k$-NN) regression. Let the data be of the form $\{(\mathbf{x}_i, y_i)\}_{i=1}^n$, where in our case $\mathbf{x}_i$ is the TV budget of the $i$-th sample and $y_i$ is the corresponding sales. For each $\mathbf{x}$ (not necessarily in the data), we do the following:
find the $k$ nearest $\mathbf{x}_i$'s to $\mathbf{x}$
take an average of the corresponding $y_i$'s.
We implement this method in Python. We use the function numpy.argsort
to sort an array and the function numpy.absolute
to compute the absolute deviation. Our quick implementation here assumes that the $\mathbf{x}_i$'s are scalars.
def knnregression(x,y,k,xnew):
n = len(x)
closest = np.argsort([np.absolute(x[i]-xnew) for i in range(n)])
return np.mean(y[closest[0:k]])
For $k=3$ and a grid of $1000$ points, we get the following approximation $\hat{f}$. Here the function numpy.linspace
creates an array of equally spaced points.
k = 3
xgrid = np.linspace(TV.min(), TV.max(), num=1000)
yhat = [knnregression(TV,sales,k,xnew) for xnew in xgrid]
plt.scatter(TV, sales, alpha=0.5)
plt.xlabel('TV')
plt.ylabel('sales')
plt.plot(xgrid, yhat, 'r')
plt.show()
A higher $k$ gives something smoother.
k = 10
xgrid = np.linspace(TV.min(), TV.max(), num=1000)
yhat = [knnregression(TV,sales,k,xnew) for xnew in xgrid]
plt.scatter(TV, sales, alpha=0.5)
plt.xlabel('TV')
plt.ylabel('sales')
plt.plot(xgrid, yhat, 'r')
plt.show()
One downside of $k$-NN regression is that it does not give an easily interpretable relationship: if I increase my TV budget by $\Delta$ dollars, how is it expected to affect the sales? Another issue arises in high dimension where the counter-intuitive phenomena we discussed in a previous section can have a significant impact. Recall in particular the High-dimensional Cube Theorem. If we have $d$ predictors -- where $d$ is large -- and our data is distributed uniformly in a bounded region, then any given $\mathbf{x}$ will be far from any of our data points. In that case, the $y$-values of the closest $\mathbf{x}_i$'s may not be predictive. This is referred to as the Curse of Dimensionality.
One way out is to make stronger assumptions on the function $f$. For instance, we can assume that the true relationship is (approximately) affine, that is,
$$ y \approx \beta_0 + \beta_1 x. $$Or if we have $d$ predictors:
$$ y \approx \beta_0 + \sum_{j=1}^d \beta_j x_j. $$How do we estimate appropriate intercept and coefficients? The standard approach is to minimize the sum of the squared errors
$$ \sum_{i=1}^n \left(y_i - \left\{\beta_0 + \sum_{j=1}^d \beta_j (\mathbf{x}_{i})_j\right\}\right)^2, $$where $(\mathbf{x}_{i})_j$ is the $j$-th entry of input vector $\mathbf{x}_i$ and $y_i$ is the corresponding $y$-value. This is called multiple linear regression.
It is a least-squares problem. We re-write it in a more convenient matrix form and combine $\beta_0$ with the other $\beta_i$'s by adding a dummy predictor to each sample. Let
$$ \mathbf{y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix}, \quad\quad A = \begin{pmatrix} 1 & \mathbf{x}_1^T \\ 1 & \mathbf{x}_2^T \\ \vdots & \vdots \\ 1 & \mathbf{x}_n^T \end{pmatrix} \quad\text{and}\quad \boldsymbol{\beta} = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_d \end{pmatrix}. $$Then observe that
\begin{align*} \|\mathbf{y} - A \boldsymbol{\beta}\|^2 &= \sum_{i=1}^n \left(y_i - (A \boldsymbol{\beta})_i\right)^2\\ &= \sum_{i=1}^n \left(y_i - \left\{\beta_0 + \sum_{j=1}^d \beta_j (\mathbf{x}_{i})_j\right\}\right)^2. \end{align*}The linear least-squares problem is then formulated as
$$ \min_{\boldsymbol{\beta}} \|\mathbf{y} - A \boldsymbol{\beta}\|^2. $$In words, we are looking for a linear combination of the columns of $A$ that is closest to $\mathbf{y}$ in Euclidean distance. Indeed, minimizing the squared Euclidean distance is equivalent to minimizing its square root, as the latter in an increasing function.
One could solve this optimization problem through calculus (and we will come back to this approach later in the course), but understanding the geometric and algebraic structure of the problem turns out to provide powerful insights into its solution -- and that of many of problems in data science. It will also be an opportunity to review some basic linear-algebraic concepts along the way.
We will come back to the Advertising
dataset later in the chapter.
NUMERICAL CORNER: The plane $P$ made of all points $(x,y,z) \in \mathbb{R}^3$ that satisfy $z = x+y$ is a linear subspace. Indeed, $0 = 0 + 0$ so $(0,0,0) \in P$. And, for any $\mathbf{u}_1 = (x_1, y_1, z_1)$ and $\mathbf{u}_2 = (x_2, y_2, z_2)$ such that $z_1 = x_1 + y_1$ and $z_2 = x_2 + y_2$ and for any $\alpha \in \mathbb{R}$, we have
$$ \alpha z_1 + z_2 = \alpha (x_1 + y_1) + (x_2 + y_2) = (\alpha x_1 + x_2) + (\alpha y_1 + y_2). $$That is, $\alpha \mathbf{u}_1 + \mathbf{u}_2$ satisfies the condition defining $P$ and therefore is itself in $P$. Note also that $P$ passes through the origin.
In this example, the linear subspace $P$ can be described alternatively as the collection of every vector of the form $(x, y, x+y)$.
We use plot_surface
to plot it over a grid of points created using numpy.meshgrid
.
x = np.linspace(0,1,num=101)
y = np.linspace(0,1,num=101)
X, Y = np.meshgrid(x, y)
print(X)
[[0. 0.01 0.02 ... 0.98 0.99 1. ] [0. 0.01 0.02 ... 0.98 0.99 1. ] [0. 0.01 0.02 ... 0.98 0.99 1. ] ... [0. 0.01 0.02 ... 0.98 0.99 1. ] [0. 0.01 0.02 ... 0.98 0.99 1. ] [0. 0.01 0.02 ... 0.98 0.99 1. ]]
print(Y)
[[0. 0. 0. ... 0. 0. 0. ] [0.01 0.01 0.01 ... 0.01 0.01 0.01] [0.02 0.02 0.02 ... 0.02 0.02 0.02] ... [0.98 0.98 0.98 ... 0.98 0.98 0.98] [0.99 0.99 0.99 ... 0.99 0.99 0.99] [1. 1. 1. ... 1. 1. 1. ]]
Z = X + Y
print(Z)
[[0. 0.01 0.02 ... 0.98 0.99 1. ] [0.01 0.02 0.03 ... 0.99 1. 1.01] [0.02 0.03 0.04 ... 1. 1.01 1.02] ... [0.98 0.99 1. ... 1.96 1.97 1.98] [0.99 1. 1.01 ... 1.97 1.98 1.99] [1. 1.01 1.02 ... 1.98 1.99 2. ]]
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X, Y, Z)
plt.show()
NUMERICAL CORNER: In Numpy, one can compute the rank of a matrix using the function numpy.linalg.matrix_rank
. We will see later in the course how to compute it using the singular value decomposition (which is how LA.matrix_rank
does it).
Let's try the example above.
w1 = np.array([1., 0., 1.])
w2 = np.array([0., 1., 1.])
w3 = np.array([1., -1., 0.])
A = np.stack((w1, w2, w3), axis=-1)
print(A)
[[ 1. 0. 1.] [ 0. 1. -1.] [ 1. 1. 0.]]
We compute the rank of A
.
LA.matrix_rank(A)
2
We take only the first two columns of A
this time to form B
.
B = np.stack((w1, w2),axis=-1)
print(B)
[[1. 0.] [0. 1.] [1. 1.]]
LA.matrix_rank(B)
2
In Numpy, @
is used for matrix product.
C = np.array([[1., 0., 1.],[0., 1., -1.]])
print(C)
[[ 1. 0. 1.] [ 0. 1. -1.]]
LA.matrix_rank(C)
2
print(B @ C)
[[ 1. 0. 1.] [ 0. 1. -1.] [ 1. 1. 0.]]
NUMERICAL CORNER: To solve a linear system in Numpy, use numpy.linalg.solve
. As an example, we consider the overdetermined system with
We use numpy.ndarray.T
for the transpose.
w1 = np.array([1., 0., 1.])
w2 = np.array([0., 1., 1.])
A = np.stack((w1, w2),axis=-1)
b = np.array([0., 0., 2.])
x = LA.solve(A.T @ A, A.T @ b)
print(x)
[0.66666667 0.66666667]
We can also use numpy.linalg.lstsq
directly on the overdetermined system to compute the least-square solution.
x = LA.lstsq(A, b, rcond=None)[0]
print(x)
[0.66666667 0.66666667]
NUMERICAL CORNER: We implement the Gram-Schmidt algorithm in Python. For reasons that will become clear in the next subsection, we output both the $\mathbf{q}_j$'s and $r_{ij}$'s, each in matrix form. Here we use numpy.dot
to compute inner products.
def gramschmidt(A):
(n,m) = A.shape
Q = np.zeros((n,m))
R = np.zeros((m,m))
for j in range(m):
v = np.copy(A[:,j])
for i in range(j):
R[i,j] = np.dot(Q[:,i], A[:,j])
v -= R[i,j]*Q[:,i]
R[j,j] = LA.norm(v)
Q[:,j] = v/R[j,j]
return Q, R
w1 = np.array([1., 0., 1.])
w2 = np.array([0., 1., 1.])
A = np.stack((w1, w2),axis=-1)
print(A)
[[1. 0.] [0. 1.] [1. 1.]]
Q, R = gramschmidt(A)
print(Q)
[[ 0.70710678 -0.40824829] [ 0. 0.81649658] [ 0.70710678 0.40824829]]
print(R)
[[1.41421356 0.70710678] [0. 1.22474487]]
NUMERICAL CORNER: We implement back substitution in Python. In our naive implementation, we assume that the diagonal entries are not zero, which will suffice for our purposes.
def backsubs(R,b):
m = b.shape[0]
x = np.zeros(m)
for i in reversed(range(m)):
x[i] = (b[i] - np.dot(R[i,i+1:m],x[i+1:m]))/R[i,i]
return x
NUMERICAL CORNER: We implement the QR approach to least squares and return to our simple overdetermined system example.
def ls_by_qr(A, b):
Q, R = gramschmidt(A)
return backsubs(R, Q.T @ b)
w1 = np.array([1., 0., 1.])
w2 = np.array([0., 1., 1.])
A = np.stack((w1, w2),axis=-1)
b = np.array([0., 0., 2.])
x = ls_by_qr(A, b)
print(x)
[0.66666667 0.66666667]
NUMERICAL CORNER: We implement the procedure above in Python. We will need the following function. For $\alpha \in \mathbb{R}$, let the sign of $\alpha$ be
$$ \mathrm{sign}(\alpha) = \begin{cases} 1 & \text{if $\alpha > 0$}\\ 0 & \text{if $\alpha = 0$}\\ -1 & \text{if $\alpha < 0$} \end{cases} $$For example, in Python, using the function numpy.sign
:
np.sign(-10), np.sign(20), np.sign(0)
(-1, 1, 0)
The following function constructs the upper triangular matrix $R$ by iteratively modifying the relevant block of $A$. On the other hand, computing the matrix $Q$ actually requires extra computational work that is often not needed. We saw that, in the context of the least-squares problem, we really only need to compute $Q^T \mathbf{b}$ for some input vector $\mathbf{b}$. This can be done at the same time that $R$ is constructed, as follows. The key point to note is that $Q^T \mathbf{b} = H_m \cdots H_1 \mathbf{b}$.
See here for an explanation of numpy.copy
.
def householder(A, b):
n, m = A.shape
R = np.copy(A)
Qtb = np.copy(b)
for k in range(m):
# computing z
y = R[k:n,k]
e1 = np.zeros(n-k)
e1[0] = 1
z = np.sign(y[0]) * LA.norm(y) * e1 + y
z = z / LA.norm(z)
# updating R
R[k:n,k:m] = R[k:n,k:m] - 2 * np.outer(z, z) @ R[k:n,k:m]
# updating Qtb
Qtb[k:n] = Qtb[k:n] - 2 * np.outer(z, z) @ Qtb[k:n]
return R[0:m,0:m], Qtb[0:m]
In householder
, we use both reflections defined above. We will not prove this here, but the particular choice made has good numerical properties. Quoting [TB, Lecture 10]:
Mathematically, either choice of sign is satisfactory. However, this is a case where numerical stability -- insensitivity to rounding errors -- dictates that one choice should be taken rather than the other. For numerical stability, it is desirable to reflect $\mathbf{x}$ to the vector $z \|\mathbf{x}\| \mathbf{e}_1$ that is not too close to $\mathbf{x}$ itself. [...] Suppose that [in the figure above] the angle between $H^+$ and the $\mathbf{e}_1$ axis is very small. Then the vector $\mathbf{v} = \|\mathbf{x}\| \mathbf{e}_1 - \mathbf{x}$ is much smaller than $\mathbf{x}$ or $\|\mathbf{x}\| \mathbf{e}_1$. Thus the calculation of $\mathbf{v}$ represents a subtraction of nearby quantities and will tend to suffer from cancellation errors.
We return to our overdetermined system example.
w1 = np.array([1., 0., 1.])
w2 = np.array([0., 1., 1.])
A = np.stack((w1, w2),axis=-1)
b = np.array([0., 0., 2.])
R, Qtb = householder(A, b)
x = backsubs(R, Qtb)
print(x)
[0.66666667 0.66666667]
One advantage of the Householder approach is that it produces a matrix $Q$ with very good orthogonality, i.e., $Q^T Q \approx I$. We give a quick example below comparing Gram-Schmidt and Householder. (The choice of matrix $A$ will become clearer when we discuss the singular value decomposition later in the chapter.)
n = 50
U, W = LA.qr(rng.normal(0,1,(n,n)))
V, W = LA.qr(rng.normal(0,1,(n,n)))
S = np.diag((1/2) ** np.arange(1,n+1))
A = U @ S @ V.T
Qgs, Rgs = gramschmidt(A)
print(LA.norm(A - Qgs @ Rgs))
1.3773646535704371e-16
print(LA.norm(Qgs.T @ Qgs - np.identity(n)))
21.989847693771175
As you can see above, the $Q$ and $R$ factors produced by Gram-Schmidt do have the property that $QR \approx A$. However, $Q$ is far from orthogonal. (Recall that LA.norm
computes the Frobenius norm introduced previously.)
On the other hand, Householder reflections perform much better in that respect as we show next. Here we use the implementation of Householder transformations in numpy.linalg.qr
.
Qhh, Rhh = LA.qr(A)
print(LA.norm(A - Qhh @ Rhh))
3.760888072615892e-16
print(LA.norm(Qhh.T @ Qhh - np.identity(n)))
4.5280497659384375e-15
NUMERICAL CORNER: We test our least-squares method on simulated data. This has the advantage that we know the truth.
Suppose the truth is a linear function of one variable.
n, b0, b1 = 100, -1, 1
x = np.linspace(0,10,num=n)
y = b0 + b1*x
plt.scatter(x,y,alpha=0.5)
plt.show()
A perfect straight line is little too easy. So let's add some noise. That is, to each $y_i$ we add an independent random variable $\varepsilon_i$ with a standard Normal distribution (mean $0$, variance $1$).
y += rng.normal(0,1,n)
plt.scatter(x,y,alpha=0.5)
plt.show()
We form the matrix $A$ and use our least-squares code to solve for $\boldsymbol{\hat\beta}$. The function ls_by_qr
, which we implemented previously, is in mmids.py, which is available on the GitHub of the notes.
A = np.stack((np.ones(n),x),axis=-1)
coeff = mmids.ls_by_qr(A,y)
print(coeff)
[-0.76231766 0.96166898]
plt.scatter(x,y,alpha=0.5)
plt.plot(x,coeff[0]+coeff[1]*x,'r')
plt.show()
NUMERICAL CORNER: Suppose the truth is in fact a degree-two polynomial of one variable with Gaussian noise.
n, b0, b1, b2 = 100, 0, 0, 1
x = np.linspace(0,10,num=n)
y = b0 + b1 * x + b2 * x**2 + 10*rng.normal(0,1,n)
plt.scatter(x,y,alpha=0.5)
plt.show()
We form the matrix $A$ and use our least-squares code to solve for $\boldsymbol{\hat\beta}$.
A = np.stack((np.ones(n), x, x**2), axis=-1)
coeff = mmids.ls_by_qr(A,y)
print(coeff)
[ 2.87806801 -1.24872939 1.12130512]
plt.scatter(x,y,alpha=0.5)
plt.plot(x, coeff[0] + coeff[1] * x + coeff[2] * x**2, 'r')
plt.show()
We return to the Advertising
dataset from the [ISLP] textbook.
df = pd.read_csv('advertising.csv')
df.head()
TV | radio | newspaper | sales | |
---|---|---|---|---|
0 | 230.1 | 37.8 | 69.2 | 22.1 |
1 | 44.5 | 39.3 | 45.1 | 10.4 |
2 | 17.2 | 45.9 | 69.3 | 9.3 |
3 | 151.5 | 41.3 | 58.5 | 18.5 |
4 | 180.8 | 10.8 | 58.4 | 12.9 |
We will focus for now on the TV budget.
TV = df['TV'].to_numpy()
sales = df['sales'].to_numpy()
plt.scatter(TV, sales)
plt.xlabel('TV')
plt.ylabel('sales')
plt.show()
We form the matrix $A$ and use our least-squares code to solve for $\boldsymbol{\beta}$.
n = np.size(TV)
A = np.stack((np.ones(n),TV),axis=-1)
coeff = mmids.ls_by_qr(A,sales)
print(coeff)
[7.03259355 0.04753664]
TVgrid = np.linspace(TV.min(), TV.max(), num=100)
plt.scatter(TV,sales,alpha=0.5)
plt.plot(TVgrid,coeff[0]+coeff[1]*TVgrid,'r')
plt.show()
A degree-two polynomial might be a better fit.
A = np.stack((np.ones(n), TV, TV**2), axis=-1)
coeff = mmids.ls_by_qr(A,sales)
print(coeff)
[ 6.11412013e+00 6.72659270e-02 -6.84693373e-05]
plt.scatter(TV,sales,alpha=0.5)
plt.plot(TVgrid, coeff[0] + coeff[1] * TVgrid + coeff[2] * TVgrid**2,'r')
plt.show()
The fit looks slightly better than the linear one. This is not entirely surprising though given that the linear model is a subset of the quadratic one. But, in adding more parameters, one must worry about overfitting. To quote Wikipedia:
In statistics, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably".[1] An overfitted model is a statistical model that contains more parameters than can be justified by the data.[2] The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure.[3]
To illustrate, let's see what happens with a degree-$20$ polynomial fit.
deg = 20
A = np.stack([TV**i for i in range(deg+1)], axis=-1)
coeff = mmids.ls_by_qr(A,sales)
print(coeff)
[-3.28380423e+00 1.90113955e+00 -1.13244689e-01 3.17646257e-03 -4.67589588e-05 3.79706699e-07 -1.62469373e-09 2.46221192e-12 6.70339576e-15 -3.23984638e-17 3.98573286e-20 -5.40524779e-24 -2.58679077e-27 -2.33298668e-30 -2.86575281e-33 -4.26967545e-36 -7.27522485e-39 -1.36905063e-41 -2.78115692e-44 -6.00362887e-47 -1.36152548e-49]
saleshat = np.sum([coeff[i] * TVgrid**i for i in range(deg+1)], axis=0)
plt.scatter(TV,sales,alpha=0.5)
plt.plot(TVgrid, saleshat, 'r')
plt.show()
We could use cross-validation to choose a suitable degree.
We return to the linear case, but with the full set of predictors.
radio = df['radio'].to_numpy()
newspaper = df['newspaper'].to_numpy()
f, (ax1, ax2) = plt.subplots(1, 2, sharex=False, sharey=True)
ax1.scatter(radio,sales,alpha=0.5)
ax1.set_title('radio')
ax2.scatter(newspaper,sales,alpha=0.5)
ax2.set_title('newspaper')
plt.show()
A = np.stack((np.ones(n), TV, radio, newspaper), axis=-1)
coeff = mmids.ls_by_qr(A,sales)
print(coeff)
[ 2.93888937e+00 4.57646455e-02 1.88530017e-01 -1.03749304e-03]
Newspaper advertising (the last coefficient) seems to have a much weaker effect on sales per dollar spent. Next, we briefly sketch one way to assess the statistical significance of such a conclusion.
Our coefficients are estimated from a sample. There is intrinsic variability in our sampling procedure. We would like to understand how our estimated coefficients compare to the true coefficients. This is set up beautifully in [Data8, Section 13.2]:
A data scientist is using the data in a random sample to estimate an unknown parameter. She uses the sample to calculate the value of a statistic that she will use as her estimate. Once she has calculated the observed value of her statistic, she could just present it as her estimate and go on her merry way. But she's a data scientist. She knows that her random sample is just one of numerous possible random samples, and thus her estimate is just one of numerous plausible estimates. By how much could those estimates vary? To answer this, it appears as though she needs to draw another sample from the population, and compute a new estimate based on the new sample. But she doesn't have the resources to go back to the population and draw another sample. It looks as though the data scientist is stuck. Fortunately, a brilliant idea called the bootstrap can help her out. Since it is not feasible to generate new samples from the population, the bootstrap generates new random samples by a method called resampling: the new samples are drawn at random from the original sample.
Without going into full details (see [DS100, Section 17.3] for more), it works as follows. Let $\{(\mathbf{x}_i, y_i)\}_{i=1}^n$ be our data. We assume that our sample is representative of the population and we simulate our sampling procedure by resampling from the sample. That is, we take a random sample with replacement $\mathcal{X}_{\mathrm{boot},1} = \{(\mathbf{x}_i, y_i)\,:\,i \in I\}$ where $I$ is a multi-set of elements from $[n]$ of size $n$. We recompute our estimated coefficients on $\mathcal{X}_{\mathrm{boot},1}$. Then we repeat independently for a desired number of replicates $\mathcal{X}_{\mathrm{boot},1}, \ldots, \mathcal{X}_{\mathrm{boot},r}$. Plotting a histogram of the resulting coefficients gives some idea of the variability of our estimates.
def linregboot(A, b, replicates = np.int32(10000)):
n,m = A.shape
coeff_boot = np.zeros((m,replicates))
for i in range(replicates):
resample = rng.integers(0,n,n)
Aboot = A[resample,:]
bboot = b[resample]
coeff_boot[:,i] = mmids.ls_by_qr(Aboot,bboot)
return coeff_boot
First, let's use a simple example from the lecture with a known ground truth.
n, b0, b1 = 100, -1, 1
x = np.linspace(0,10,num=n)
y = b0 + b1*x + rng.normal(0,1,n)
A = np.stack((np.ones(n),x),axis=-1)
The estimated coefficients are the following.
coeff = mmids.ls_by_qr(A,y)
print(coeff)
[-0.92360937 1.0091636 ]
Now we apply the bootstrap and plot histograms of the two coefficients.
coeff_boot = linregboot(A,y)
plt.hist(coeff_boot[0,:])
plt.show()
plt.hist(coeff_boot[1,:])
plt.show()
We see in the histograms that the true coefficient values $-1$ and $1$ fall within the likely range.
We return to the Advertising
dataset and apply the bootstrap.
n = np.size(TV)
A = np.stack((np.ones(n), TV, radio, newspaper), axis=-1)
coeff = mmids.ls_by_qr(A,sales)
print(coeff)
[ 2.93888937e+00 4.57646455e-02 1.88530017e-01 -1.03749304e-03]
coeff_boot = linregboot(A,sales)
Plotting a histogram of the coefficients corresponding to newspaper advertising shows that $0$ is a plausible value, while it is not for TV advertising.
plt.hist(coeff_boot[1,:])
plt.show()
plt.hist(coeff_boot[3,:])
plt.show()
NUMERICAL CORNER: We implement the algorithm above. In our naive implementation, we assume that $B$ is positive definite, and therefore that all steps are well-defined.
def cholesky(B):
n = B.shape[0]
L = np.zeros((n, n))
for j in range(n):
L[j,0:j] = mmids.forwardsubs(L[0:j,0:j],B[j,0:j])
L[j,j] = np.sqrt(B[j,j] - LA.norm(L[j,0:j])**2)
return L
Here is a simple example.
B = np.array([[2., 1.],[1., 2.]])
print(B)
[[2. 1.] [1. 2.]]
L = cholesky(B)
print(L)
[[1.41421356 0. ] [0.70710678 1.22474487]]
print(L @ L.T)
[[2. 1.] [1. 2.]]
NUMERICAL CORNER: We implement this algorithm below. In our naive implementation, we assume that $A$ has full column rank, and therefore that all steps are well-defined.
def ls_by_chol(A, b):
L = cholesky(A.T @ A)
z = mmids.forwardsubs(L, A.T @ b)
return mmids.backsubs(L.T, z)