Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Chapter: 3-Optimization theory and algorithms
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: July 15, 2024
Copyright: © 2024 Sebastien Roch
$\newcommand{\bmu}{\boldsymbol{\mu}}$ $\newcommand{\bSigma}{\boldsymbol{\Sigma}}$ $\newcommand{\bfbeta}{\boldsymbol{\beta}}$ $\newcommand{\bflambda}{\boldsymbol{\lambda}}$ $\newcommand{\bgamma}{\boldsymbol{\gamma}}$ $\newcommand{\bsigma}{{\boldsymbol{\sigma}}}$ $\newcommand{\bpi}{\boldsymbol{\pi}}$ $\newcommand{\btheta}{{\boldsymbol{\theta}}}$ $\newcommand{\bphi}{\boldsymbol{\phi}}$ $\newcommand{\balpha}{\boldsymbol{\alpha}}$ $\newcommand{\blambda}{\boldsymbol{\lambda}}$ $\renewcommand{\P}{\mathbb{P}}$ $\newcommand{\E}{\mathbb{E}}$ $\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}$ $\newcommand{\bp}{\mathbf{p}}$ $\renewcommand{\bx}{\mathbf{x}}$ $\newcommand{\bX}{\mathbf{X}}$ $\newcommand{\by}{\mathbf{y}}$ $\newcommand{\bY}{\mathbf{Y}}$ $\newcommand{\bz}{\mathbf{z}}$ $\newcommand{\bZ}{\mathbf{Z}}$ $\newcommand{\bw}{\mathbf{w}}$ $\newcommand{\bW}{\mathbf{W}}$ $\newcommand{\bv}{\mathbf{v}}$ $\newcommand{\bV}{\mathbf{V}}$ $\newcommand{\bfg}{\mathbf{g}}$ $\newcommand{\bfh}{\mathbf{h}}$ $\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}$ $\renewcommand{\S}{\mathcal{S}}$ $\newcommand{\X}{\mathcal{X}}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\pa}{\mathrm{pa}}$ $\newcommand{\Z}{\mathcal{Z}}$ $\newcommand{\bh}{\mathbf{h}}$ $\newcommand{\bb}{\mathbf{b}}$ $\newcommand{\bc}{\mathbf{c}}$ $\newcommand{\cE}{\mathcal{E}}$ $\newcommand{\cP}{\mathcal{P}}$ $\newcommand{\bbeta}{\boldsymbol{\beta}}$ $\newcommand{\bLambda}{\boldsymbol{\Lambda}}$ $\newcommand{\cov}{\mathrm{Cov}}$ $\newcommand{\bfk}{\mathbf{k}}$ $\newcommand{\idx}[1]{}$ $\newcommand{\xdi}{}$
Motivating example: analyzing customer satisfaction¶
We now turn to classification$\idx{classification}\xdi$.
Quoting Wikipedia:
In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient (sex, blood pressure, presence or absence of certain symptoms, etc.). Classification is an example of pattern recognition. In the terminology of machine learning, classification is considered an instance of supervised learning, i.e., learning where a training set of correctly identified observations is available.
We will illustrate this problem on an airline customer satisfaction dataset available on Kaggle, an excellent source of data and community contributed analyses. The background is the following:
The dataset consists of the details of customers who have already flown with them. The feedback of the customers on various context and their flight data has been consolidated. The main purpose of this dataset is to predict whether a future customer would be satisfied with their service given the details of the other parameters values.
We first load the data and convert it to an appropriate matrix representation. We (or, more precisely, ChatGPT) pre-processed the original file to remove rows with missing data or 0 ratings, convert categorical variables into one-hot encodings, and keep only a subset of the rows and columns. You can see the details of the pre-processing in this chat history.
data = pd.read_csv('customer_airline_satisfaction.csv')
This is a large dataset. Here are the first five rows and first 6 colums.
print(data.iloc[:5, :6])
Satisfied Age Class_Business Class_Eco Class_Eco Plus Business travel 0 0 63 1 0 0 1 1 0 34 1 0 0 1 2 0 52 0 1 0 0 3 0 40 0 1 0 1 4 1 46 0 1 0 1
It has 100,000 rows and 24 columns.
data.shape
(100000, 24)
The column names are:
print(data.columns.tolist())
['Satisfied', 'Age', 'Class_Business', 'Class_Eco', 'Class_Eco Plus', 'Business travel', 'Loyal customer', 'Flight Distance', 'Departure Delay in Minutes', 'Arrival Delay in Minutes', 'Seat comfort', 'Departure/Arrival time convenient', 'Food and drink', 'Gate location', 'Inflight wifi service', 'Inflight entertainment', 'Online support', 'Ease of Online booking', 'On-board service', 'Leg room service', 'Baggage handling', 'Checkin service', 'Cleanliness', 'Online boarding']
The first column indicates whether a customer was satisfied (with 1
meaning satisfied). The next 6 columns give some information about the customers, e.g., their age or whether they are members of a loyalty program with the airline. The following three columns give information about the flight, with names that should be self-explanatory: Flight Distance
, Departure Delay in Minutes
, and Arrival Delay in Minutes
. The remaining columns give the customers' ratings, between 1
and 5
, of various feature, e.g., Baggage handling
, Checkin service
.
Our goal will be to predict the first column, Satisfied
, from the rest of the columns. For this, we transform our data into Numpy arrays.
y = data['Satisfied'].to_numpy()
X = data.drop(columns=['Satisfied']).to_numpy()
print(y)
[0 0 0 ... 0 1 0]
print(X)
[[63 1 0 ... 3 3 4] [34 1 0 ... 2 3 4] [52 0 1 ... 4 3 4] ... [39 0 1 ... 4 1 1] [25 0 0 ... 5 3 1] [44 1 0 ... 1 2 2]]
Some features may affect satisfication more than others. Let us look at age for instance. The following code extracts the Age
column from X
(i.e., column $0$) and computes the fraction of satisfied customers in several age bins.
Explanation by ChatGPT (who wrote the code):
numpy.digitize
bins the age data into the specified age bins. The-1
adjustment is to match zero-based indexing.numpy.bincount
counts the occurrences of each bin index. Theminlength
parameter ensures that the resulting array length matches the number of age bins (age_labels
). This is important if some bins have zero counts, ensuring the counts array covers all bins.freq_satisfied = counts_satisfied / counts_all
calculates the satisfaction frequency for each age group by dividing the counts of satisfied customers by the total counts in each age group.
age_col_index = 0
age_data = X[:, age_col_index]
age_bins = [0, 18, 25, 35, 45, 55, 65, 100]
age_labels = ['0-17', '18-24', '25-34', '35-44', '45-54', '55-64', '65+']
age_bin_indices = np.digitize(age_data, bins=age_bins) - 1
counts_all = np.bincount(age_bin_indices, minlength=len(age_labels))
counts_satisfied = np.bincount(age_bin_indices[y == 1], minlength=len(age_labels))
freq_satisfied = counts_satisfied / counts_all
age_group_labels = np.array(age_labels)
The results are plotted using matplotlib's matplotlib.pyplot.bar
function. We see in particular that younger people tend to be more dissatisfied. Of course, this might be because they cannot afford the most expensive services.
plt.figure(figsize=(4, 4))
plt.bar(age_group_labels, freq_satisfied, color='lightblue', edgecolor='black')
plt.xlabel('Age Group'), plt.ylabel('Frequency of Satisfied Customers')
plt.show()
The input data is now of the form $\{(\mathbf{x}_i, y_i) : i=1,\ldots, n\}$ where $\mathbf{x}_i \in \mathbb{R}^d$ are the features and $y_i \in \{0,1\}$ is the label. Above we use the matrix representation $X \in \mathbb{R}^{d \times n}$ with columns $\mathbf{x}_i$, $i = 1,\ldots, n$ and $\mathbf{y} = (y_1, \ldots, y_n)^T \in \{0,1\}^n$.
Our goal:
learn a classifier from the examples $\{(\mathbf{x}_i, y_i) : i=1,\ldots, n\}$, that is, a function $\hat{f} : \mathbb{R}^d \to \mathbb{R}$ such that $\hat{f}(\mathbf{x}_i) \approx y_i$.
We may want to enforce that the output is in $\{0,1\}$ as well. This problem is referred to as binary classification.
A natural approach to this type of supervised learning$\idx{supervised learning}\xdi$ problem is to define two objects:
Family of classifiers: A class $\widehat{\mathcal{F}}$ of classifiers from which to pick $\hat{f}$.
Loss function: A loss function $\ell(\hat{f}, (\mathbf{x},y))$ which quantifies how good of a fit $\hat{f}(\mathbf{x})$ is to $y$.
Our goal is then to solve
$$ \min_{\hat{f} \in \widehat{\mathcal{F}}} \frac{1}{n} \sum_{i=1}^n \ell(\hat{f}, (\mathbf{x}_i, y_i)), $$that is, we seek to find a classifier among $\widehat{\mathcal{F}}$ that minimizes the average loss over the examples.
For instance, in logistic regression, we consider linear classifiers of the form
$$ \hat{f}(\mathbf{x}) = \sigma(\mathbf{x}^T \boldsymbol{\theta}) \qquad \text{with} \qquad \sigma(t) = \frac{1}{1 + e^{-t}} $$where $\boldsymbol{\theta} \in \mathbb{R}^d$ is a parameter vector. And we use the cross-entropy loss
$$ \ell(\hat{f}, (\mathbf{x}, y)) = - y \log(\sigma(\mathbf{x}^T \boldsymbol{\theta})) - (1-y) \log(1- \sigma(\mathbf{x}^T \boldsymbol{\theta})). $$In parametric form, the problem boils down to
$$ \min_{\boldsymbol{\theta} \in \mathbb{R}^d} - \frac{1}{n} \sum_{i=1}^n y_i \log(\sigma(\mathbf{x}_i^T \boldsymbol{\theta})) - \frac{1}{n} \sum_{i=1}^n (1-y_i) \log(1- \sigma(\mathbf{x}_i^T \boldsymbol{\theta})). $$To obtain a prediction in $\{0,1\}$ here, we could cutoff $\hat{f}(\mathbf{x})$ at a threshold $\tau \in [0,1]$, that is, return $\mathbf{1}\{\hat{f}(\mathbf{x}) > \tau\}$.
Optimality conditions¶
NUMERICAL CORNER: Consider $f(x) = e^x$. Then $f'(x) = f''(x) = e^x$. Suppose we are interested in approximating $f$ in the interval $[0,1]$. We take $a=0$ and $b=1$ in Taylor's Theorem. The linear term is
$$ f(a) + (x-a) f'(a) = 1 + x e^0 = 1 + x. $$Then for any $x \in [0,1]$
$$ f(x) = 1 + x + \frac{1}{2}x^2 e^{\xi_x} $$where $\xi_x \in (0,1)$ depends on $x$. We get a uniform bound on the error over $[0,1]$ by replacing $\xi_x$ with its worst possible value over $[0,1]$
$$ |f(x) - (1+x)| \leq \frac{1}{2}x^2 e^{\xi_x} \leq \frac{e}{2} x^2. $$x = np.linspace(0,1,100)
y = np.exp(x)
taylor = 1 + x
err = (np.exp(1)/2) * x**2
If we plot the upper and lower bounds, we see that $f$ indeed falls within them.
plt.plot(x,y,label='f')
plt.plot(x,taylor,label='taylor')
plt.plot(x,taylor-err,linestyle=':',color='green',label='lower')
plt.plot(x,taylor+err,linestyle='--',color='green',label='upper')
plt.legend()
plt.show()
NUMERICAL CORNER: Let $f(x) = x^3$. Then $f'(x) = 3 x^2$ and $f''(x) = 6 x$ so that $f'(0) = 0$ and $f''(0) \geq 0$. Hence $x=0$ is a stationary point. But $x=0$ is not a local minimizer. Indeed $f(0) = 0$ but, for any $\delta > 0$, $f(-\delta) < 0$.
x = np.linspace(-2,2,100)
y = x**3
plt.plot(x,y, c='k')
plt.ylim(-5,5)
plt.show()
NUMERICAL CORNER: Returning to the previous example, the points satisfying $h_1(\mathbf{x}) = 0$ sit on the circle of radius $1$ around the origin. We have already seen that
$$ \nabla h_1(\mathbf{x}) = \left( \frac{\partial h_1(\mathbf{x})}{\partial x_1}, \frac{\partial h_1(\mathbf{x})}{\partial x_2} \right) = (- 2 x_1, - 2 x_2). $$Here is code illustrating the theorem (with help from ChatGPT). We first compute the function $h_1$ at a grid of points using numpy.meshgrid
.
def h1(x1, x2):
return 1 - x1**2 - x2**2
x1, x2 = np.linspace(-1.5, 1.5, 400), np.linspace(-1.5, 1.5, 400)
X1, X2 = np.meshgrid(x1, x2)
H1 = h1(X1, X2)
We use matplotlib.pyplot.contour
to plot the constraint set as a contour line (for the constant value $0$) of $h_1$. Gradients of $h_1$ are plotted at a collection of points
with the matplotlib.pyplot.quiver
function, which is used for plotting vectors as arrows. We see that the directions of first-order feasible directions are orthogonal to the arrows, and therefore are tangent to the constraint set.
At those same points
, we also plot the gradient of $f$, which recall is
We make all gradients into unit vectors.
plt.figure(figsize=(4, 4))
plt.contour(X1, X2, H1, levels=[0], colors='b')
points = [(0.5, np.sqrt(3)/2), (-0.5, np.sqrt(3)/2), (0.5, -np.sqrt(3)/2),
(-0.5, -np.sqrt(3)/2), (1, 0), (-1, 0), (0, 1), (0, -1)]
for x1, x2 in points:
plt.quiver(x1, x2, -x1/np.sqrt(x1**2 + x2**2), -x2/np.sqrt(x1**2 + x2**2),
scale=10, color='r')
plt.quiver(x1, x2, 4*x1/np.sqrt(16 * x1**2 + 36 * x2**2),
6*x2/np.sqrt(16 * x1**2 + 36 * x2**2), scale=10, color='lime')
We see that, at $(-1,0)$ and $(1,0)$, the gradient is indeed orthogonal to the first-order feasible directions.
Gradient descent and its convergence analysis¶
At each iteration of gradient descent, we take a step in the direction of the negative of the gradient, that is,
$$ \mathbf{x}^{t+1} = \mathbf{x}^t - \alpha_t \nabla f(\mathbf{x}^t), \quad t=0,1,2\ldots $$for a sequence of step sizes $\alpha_t > 0$. Choosing the right step size (also known as steplength or learning rate) is a large subject in itself. We will only consider the case of fixed step size here.
In general, we will not be able to guarantee that a global minimizer is reached in the limit, even if one exists. Our goal for now is more modest: to find a point where the gradient of $f$ approximately vanishes.
We implement gradient descent$\idx{gradient descent}\xdi$ in Python. We assume that a function f
and its gradient grad_f
are provided. We first code the basic steepest descent step with a step size$\idx{step size}\xdi$ $\idx{learning rate}\xdi$ $\alpha =$ alpha
.
def desc_update(grad_f, x, alpha):
return x - alpha*grad_f(x)
def gd(f, grad_f, x0, alpha=1e-3, niters=int(1e6)):
xk = x0
for _ in range(niters):
xk = desc_update(grad_f, xk, alpha)
return xk, f(xk)
NUMERICAL CORNER: We illustrate on a simple example.
def f(x):
return (x-1)**2 + 10
def grad_f(x):
return 2*(x-1)
xgrid = np.linspace(-5,5,100)
plt.plot(xgrid, f(xgrid), label='f')
plt.plot(xgrid, grad_f(xgrid), label='grad_f')
plt.ylim((-20,50)), plt.legend()
plt.show()
gd(f, grad_f, 0)
(0.9999999999999722, 10.0)
We found a global minmizer in this case.
The next example shows that a different local minimizer may be reached depending on the starting point.
def f(x):
return 4 * (x-1)**2 * (x+1)**2 - 2*(x-1)
def grad_f(x):
return 8 * (x-1) * (x+1)**2 + 8 * (x-1)**2 * (x+1) - 2
xgrid = np.linspace(-2,2,100)
plt.plot(xgrid, f(xgrid), label='f')
plt.plot(xgrid, grad_f(xgrid), label='grad_f')
plt.ylim((-10,10)), plt.legend()
plt.show()
gd(f, grad_f, 0)
(1.057453770738375, -0.0590145651028224)
gd(f, grad_f, -2)
(-0.9304029265558538, 3.933005966859003)
TRY IT! In this last example, does changing the step size affect the outcome? (Open In Colab) $\ddagger$
In the final example, we end up at a stationary point that is not a local minimizer. Here both the first and second derivatives are zero. This is known as a saddle point.
def f(x):
return x**3
def grad_f(x):
return 3 * x**2
xgrid = np.linspace(-2,2,100)
plt.plot(xgrid, f(xgrid), label='f')
plt.plot(xgrid, grad_f(xgrid), label='grad_f')
plt.ylim((-10,10)), plt.legend()
plt.show()
gd(f, grad_f, 2)
(0.00033327488712690107, 3.701755838398568e-11)
gd(f, grad_f, -2, niters=100)
(-4.93350410883896, -120.0788396909241)
NUMERICAL CORNER: We revisit our first simple single-variable example.
def f(x):
return (x-1)**2 + 10
def grad_f(x):
return 2*(x-1)
The second derivative is $f''(x) = 2$. Hence, this $f$ is $L$-smooth and $m$-strongly convex with $L = m = 2$. The theory we developed suggests taking step size $\alpha_t = \alpha = 1/L = 1/2$. It also implies that
$$ f(x^1) - f(x^*) \leq \left(1 - \frac{m}{L}\right) [f(x^0) - f(x^*)] = 0. $$We converge in one step! And that holds for any starting point $x^0$.
Let's try this!
gd(f, grad_f, 0, alpha=0.5, niters=1)
(1.0, 10.0)
Let's try a different starting point.
gd(f, grad_f, 100, alpha=0.5, niters=1)
(1.0, 10.0)
Application: logistic regression¶
NUMERICAL CORNER: We plot the sigmoid function.
def sigmoid(z):
return 1/(1+np.exp(-z))
grid = np.linspace(-5, 5, 100)
plt.plot(grid, sigmoid(grid), c='k')
plt.show()
We modify our implementation of gradient descent to take a dataset as input. Recall that to run gradient descent, we first implement a function computing a descent update. It takes as input a function grad_fn
computing the gradient itself, as well as a current iterate and a step size. We now also feed a dataset as additional input.
def desc_update_for_logreg(grad_fn, A, b, curr_x, beta):
gradient = grad_fn(curr_x, A, b)
return curr_x - beta*gradient
We are ready to implement GD. Our function takes as input a function loss_fn
computing the objective, a function grad_fn
computing the gradient, the dataset A
and b
, and an initial guess init_x
. Optional parameters are the step size and the number of iterations.
def gd_for_logreg(loss_fn, grad_fn, A, b, init_x, beta=1e-3, niters=int(1e5)):
curr_x = init_x
for iter in range(niters):
curr_x = desc_update_for_logreg(grad_fn, A, b, curr_x, beta)
return curr_x
To implement loss_fn
and grad_fn
, we define the sigmoid as above. Below, pred_fn
is $\bsigma(A \mathbf{x})$. Here we write the loss function as
where $\odot$ is the Hadamard product, or element-wise product (for example $\mathbf{u} \odot \mathbf{v} = (u_1 v_1, \ldots,u_n v_n)^T$), the logarithm (denoted in bold) is applied element-wise and $\mathrm{mean}(\mathbf{z})$ is the mean of the entries of $\mathbf{z}$ (i.e., $\mathrm{mean}(\mathbf{z}) = n^{-1} \sum_{i=1}^n z_i$).
def pred_fn(x, A):
return sigmoid(A @ x)
def loss_fn(x, A, b):
return np.mean(-b*np.log(pred_fn(x, A)) - (1 - b)*np.log(1 - pred_fn(x, A)))
def grad_fn(x, A, b):
return -A.T @ (b - pred_fn(x, A))/len(b)
We can choose a step size based on the smoothness of the objective as above. Recall that numpy.linalg.norm
computes the Frobenius norm by default.
def stepsize_for_logreg(A, b):
L = LA.norm(A)**2 / (4 * len(b))
return 1/L
NUMERICAL CORNER: We return to our original motivation, the airline customer satisfaction dataset. We first load the dataset. We will need the column names later.
data = pd.read_csv('customer_airline_satisfaction.csv')
column_names = data.columns.tolist()
print(column_names)
['Satisfied', 'Age', 'Class_Business', 'Class_Eco', 'Class_Eco Plus', 'Business travel', 'Loyal customer', 'Flight Distance', 'Departure Delay in Minutes', 'Arrival Delay in Minutes', 'Seat comfort', 'Departure/Arrival time convenient', 'Food and drink', 'Gate location', 'Inflight wifi service', 'Inflight entertainment', 'Online support', 'Ease of Online booking', 'On-board service', 'Leg room service', 'Baggage handling', 'Checkin service', 'Cleanliness', 'Online boarding']
Our goal will be to predict the first column, Satisfied
, from the rest of the columns. For this, we transform our data into Numpy arrays. We also standardize the columns by subtracting their mean and dividing by their standard deviation. This will allow to compare the influence of different features on the prediction. And we add a column of 1s to to account for the intercept.
y = data['Satisfied'].to_numpy()
X = data.drop(columns=['Satisfied']).to_numpy()
means = np.mean(X, axis=0)
stds = np.std(X, axis=0)
X_standardized = (X - means) / stds
A = np.concatenate((np.ones((len(y),1)), X_standardized), axis=1)
b = y
We use the functions loss_fn
and grad_fn
which were written for general logistic regression problems.
init_x = np.zeros(A.shape[1])
best_x = gd_for_logreg(loss_fn, grad_fn, A, b, init_x, beta=1e-3, niters=int(1e3))
print(best_x)
[ 0.03622497 0.04123861 0.10020177 -0.08786108 -0.02485893 0.0420605 0.11995567 -0.01799992 -0.02399636 -0.02653084 0.1176043 -0.02382631 0.05909378 -0.01161711 0.06553672 0.21313777 0.12883519 0.14631027 0.12239595 0.11282894 0.08556647 0.08954403 0.08447245 0.108043 ]
To interpret the results, we plot the coefficients in decreasing order.
coefficients, features = best_x[1:], column_names[1:]
sorted_indices = np.argsort(coefficients)
sorted_coefficients = coefficients[sorted_indices]
sorted_features = np.array(features)[sorted_indices]
plt.figure(figsize=(6, 5))
plt.barh(sorted_features, sorted_coefficients, color='lightblue', edgecolor='black')
plt.xlabel('Coefficient Value'), plt.title('Logistic Regression Coefficients')
plt.show()
We see from the first ten bars or so that, as might be expected, higher ratings on various aspects of the flight generally contribute to a higher predicted likelihood of satisfaction (with one exception being Gate location
whose coefficient is negative but may not be statistically significant). Inflight entertainment
seems particularly influential. Age
also shows the same pattern, something we had noticed in the introductory section through a different analysis. On the other hand, departure delay and arrival delay contribute to a lower predicted likelihood of satisfaction, again an expected pattern. The most negative influence however appears to come from Class_Eco
.