Optimization
In Chapter 2, we devised a closed form expression for the optimal decision rule assuming we have a probability model for the data. Then we turned to empirical risk minimization (ERM) where we instead rely on numerical methods to discover good decision rules when we don’t have such a probability model. In this chapter, we take a closer look at how to solve empirical risk minimization problems effectively. We focus on the core optimization methods commonly used to solve empirical risk minimization problems and on the mathematical tools used to analyze their running times.
Our main subject will be gradient descent algorithms and how to shape loss functions so that gradient descent succeeds. Gradient descent is an iterative procedure that iterates among possible models, at each step replacing the old model with one with lower empirical risk. We show that the class of optimization problems where gradient descent is guaranteed to find an optimal solution is the set of convex functions. When we turn to risk minimization, this means that gradient descent will find the model that minimizes the empirical risk whenever the loss function is convex and the decision function is a linear combination of features.
We then turn to studying stochastic gradient descent, the workhorse of machine learning. Stochastic gradient descent is effectively a generalization of the perceptron learning rule. Its generality enables us to apply it to a variety of function classes and loss functions and guarantee convergence even if the data may not be separable. We spend a good deal of time looking at the dynamics of the stochastic gradient method to try to motivate why it is so successful and popular in machine learning.
Starting from the convex case, we work towards more general nonconvex problems. In particular, we highlight two salient features of gradient descent and stochastic gradient descent that are particular to empirical risk minimization and help to motivate the resilience of these methods.
First, we show that even for problems that are not convex, gradient descent for empirical risk minimization has an implicit convexity property that encourages convergence. Though we explicitly optimize over function representations which are computationally intractable to optimize in the worst-case, it turns out that we can still reason about the convergence of the predictions themselves.
Second, we show that gradient descent implicitly manages the complexity of the prediction function, encouraging solutions of low complexity in cases where infinitely many solutions exist. We close the chapter with a discussion of other methods for empirical risk minimization that more explicitly account for model complexity and stable convergence.
Optimization basics
Stepping away from empirical risk minimization for a moment, consider the general minimization problem \begin{array}{ll} \text{minimize}_w & \Phi(w) \end{array} where \Phi\colon\mathbb{R}^d\to\mathbb{R} is a real-valued function over the domain \mathbb{R}^d.
When and how can we minimize such a function? Before we answer this question, we need to formally define what we’re shooting for.
A point w_\star is a minimizer of \Phi if \Phi(w_\star)\leq \Phi(w) for all w. It is a local minimizer of \Phi if for some \epsilon> 0, \Phi(w_\star)\leq \Phi(w) for all w such that \Vert w-w_\star \Vert\leq \epsilon.
Sometimes we will refer to minimizers as global minimizers to contrast against local minimizers.
The figure below presents example functions and their minima. In the first illustration, there is a unique minimizer. In the second, there are an infinite number of minimizers, but all local minimizers are global minimizers. In the third example, there are many local minimizers that are not global minimizers.
Note that in our example, the two functions without suboptimal local minimizers share the property that for any two points w_1 and w_2, the line segment connecting (w_1,\Phi(w_1)) to (w_2,\Phi(w_2)) lies completely above the graph of the function. Such functions are called convex functions.
A function \Phi is convex if for all w_1, w_2 in \mathbb{R}^d and \alpha \in [0,1], \Phi(\alpha w_1 + (1-\alpha) w_2 )\leq \alpha\Phi(w_1)+(1-\alpha)\Phi(w_2)\,.
We will see shortly that convex functions are the class of functions where gradient descent is guaranteed to find an optimal solution.
Gradient descent
Suppose we want to minimize a differentiable function \Phi\colon\mathbb{R}^d \rightarrow \mathbb{R}. Most of the algorithms we will consider start at some point w_0 and then aim to find a new point w_1 with a lower function value. The simplest way to do so is to find a direction v such that \Phi is decreasing when moving along the direction v. This notion can be formalized by the following definition:
A vector v is a descent direction for \Phi at w_0 if \Phi(w_0 + tv) < \Phi(w_0) for some t>0.
For continuously differentiable functions, it’s easy to tell if v is a descent direction: if v^T \nabla \Phi(w_0) <0 then v is a descent direction.
To see this note that by Taylor’s theorem, \Phi(w_0 + \alpha v) = \Phi(w_0) + \alpha \nabla \Phi(w_0 + \tilde{\alpha} v)^T v for some \tilde{\alpha} \in [0,\alpha]. By continuity, if \alpha is small, we’ll have \nabla \Phi(w_0 + \tilde{\alpha} v)^Tv <0. Therefore \Phi(w_0 + \alpha v) < \Phi(w_0) and v is a descent direction.
This characterization of descent directions allows us to provide conditions as to when w minimizes \Phi.
The point w_\star is a local minimizer only if \nabla \Phi(w_\star) = 0\,.
Why is this true? Well, the point -\nabla \Phi(w_\star) is always a descent direction if it’s not zero. If w_\star is a local minimum, there can be no descent directions. Therefore, the gradient must vanish.
Gradient descent uses the fact that the negative gradient is always a descent direction to construct an algorithm: repeatedly compute the gradient and take a step in the opposite direction to minimize \Phi.
Gradient Descent
- Start from an initial point w_0 \in \mathbb{R}^d.
- At each step t=0,1,2,\dots:
- Choose a step size \alpha_t>0
- Set w_{t+1} = w_t - \alpha_t \nabla \Phi(w_t)
Gradient descent terminates whenever the gradient is so small that the iterates w_t no longer substantially change. Note now that there can be points where the gradient vanishes but where the function is not minimized. For example, maxima have this property. In general, points where the gradient vanishes are called stationary points. It is critically important to remember that not all stationary points are minimizers.
For convex \Phi, the situation is dramatically simpler. This is part of the reason why convexity is so appealing.
Let \Phi:\mathbb{R}^d\rightarrow \mathbb{R} be a differentiable convex function. Then w_\star is a global minimizer of \Phi if and only if \nabla \Phi(w_\star)=0.
To prove this, we need our definition of convexity: for any \alpha \in [0,1] and w\in\mathbb{R}^d, \Phi(w_\star + \alpha(w-w_\star)) = \Phi((1-\alpha)w_\star + \alpha w) \leq (1-\alpha) \Phi(w_\star) + \alpha \Phi(w) Here, the inequality is just our definition of convexity. Now, if we rearrange terms, we have \Phi(w) \geq \Phi(w_\star) + \frac{\Phi(w_\star + \alpha(w-w_\star)) - \Phi(w_\star)}{\alpha} Now apply Taylor’s theorem: there is now some \tilde{\alpha}\in[0,1] such that \Phi(w_\star + \alpha(w-w_\star)) - \Phi(w_\star)=\alpha \nabla \Phi(w_\star+ \tilde{\alpha}(w-w_\star))^T(w-w_\star). Taking the limit as \alpha goes to zero yields \Phi(w) \geq \Phi(w_\star) +\nabla \Phi(w_\star)^T(w-w_\star)\,.
But if \nabla \Phi(w_\star)=0, that means, \Phi(w) \geq \Phi(w_\star) for all w, and hence w_\star is a global minimizer.
Tangent hyperplanes always fall below the graphs of convex functions.
Let \Phi:\mathbb{R}^d\rightarrow \mathbb{R} be a differentiable convex function. Then for any u and v, we have \Phi(u) \geq \Phi(v) + \nabla \Phi(v)^T(u-v)\,.
A convex function cookbook
Testing if a function is convex can be tricky in more than two dimensions. But here are 5 rules that generate convex functions from simpler functions. In machine learning, almost all convex cost functions are built using these rules.
All norms are convex (this follows from the triangle inequality).
If \Phi is convex and \alpha \geq 0, then \alpha \Phi is convex.
If \Phi and \Psi are convex, then \Phi+\Psi is convex.
If \Phi and \Psi are convex, then h(w) = \max \{\Phi(w),\Psi(w)\} is convex.
If \Phi is convex and A is a matrix and b is a vector, then the function h(w)=\Phi(Aw+b) is convex.
All of these properties can be verified using only the definition of convex functions. For example, consider the 4th property. This is probably the trickiest of the list. Take two points w_1 and w_2 and \alpha \in [0,1]. Suppose, without loss of generality, that \Phi((1-\alpha)w_1+ \alpha w_2) \geq \Psi((1-\alpha) w_1 + \alpha w_2) \begin{aligned} h((1-\alpha) w_1 + \alpha w_2) &= \max \{\Phi((1- \alpha) w_1+ \alpha w_2),\Psi((1-\alpha) w_1 + \alpha w_2)\}\\ &= \Phi((1-\alpha) w_1+ \alpha w_2)\\ & \leq (1-\alpha) \Phi(w_1)+ \alpha \Phi(w_2)\\ &\leq (1-\alpha) \max\{\Phi(w_1),\Psi(w_1)\} + \alpha \max\{\Phi(w_2),\Psi(w_2)\}\\ & = (1-\alpha) h(w_1) + \alpha h(w_2) \end{aligned} Here, the first inequality follows because \Phi is convex. Everything else follows from the definition that h is the max of \Phi and \Psi. The reader should verify the other four assertions as an exercise. Another useful exercise is to verify that the SVM cost in the next section is convex by just using these five basic rules and the fact that the one dimensional function f(x)=mx+b is convex for any scalars m and b.
Applications to empirical risk minimization
For decision theory problems, we studied the zero-one loss that counts errors: \mathit{loss}(\hat{y},y) = \mathbb{1}\{ y\hat{y} < 0\} Unfortunately, this loss is not useful for the gradient method. The gradient is zero almost everywhere. As we discussed in the chapter on supervised learning, machine learning practice always turns to surrogate losses that are easier to optimize. Here we review three popular choices, all of which are convex loss functions. Each choice leads to a different important optimization problem that has been studied in its own right.
The support vector machine
Consider the canonical problem of support vector machine classification. We are provided pairs \left(x_{i},y_{i}\right), with x_{i}\in\mathbb{R}^{d} and y_{i}\in\left\{ -1,1\right\} for i=1,\ldots n (Note, the y labels are now in \{-1,1\} instead of \{0,1\}.) The goal is to find a vector w\in \mathbb{R}^d such that: \begin{aligned} w^{T}x_{i} & > 0\quad\text{ for }y_{i}=1\\ w^{T}x_{i} & < 0\quad\text{ for }y_{i}=-1 \end{aligned} Such a w defines a half-space where we believe all of the positive examples lie on one side and the negative examples on the other.
Rather than classifying these points exactly, we can allow some slack. We can pay a penalty of 1-y_i w^T x_i points that are not strongly classified. This motivates the hinge loss we encountered earlier and leads to the support vector machine objective: \text{minimize}_w\quad \sum_{i=1}^n \max\left\{1-y_i w^T x_i,0\right\} \,. Defining the function e(z) = \mathbb{1}\{ z\leq 1 \}, we can compute that the gradient of the SVM cost is -\sum_{i=1}^n e(y_i w^T x_i) y_i x_i \,. Hence, gradient descent for this ERM problem would follow the iteration w_{t+1} = w_t + \alpha \sum_{i=1}^n e(y_i w^T x_i) y_i x_i Although similar, note that this isn’t quite the perceptron method yet. The time to compute one gradient step is O(n) as we sum over all n inner products. We will soon turn to the stochastic gradient method that has constant iteration complexity and will subsume the perceptron algorithm.
Logistic regression
Logistic regression is equivalent to using the loss function \mathit{loss}(\hat y,y) = \log \left(1+\exp(-y\hat y)\right)\,. Note that even though this loss has a probabilistic interpretation, it can also just be seen as an approximation to the error-counting zero-one loss.
Least squares classification
Least squares classification uses the loss function \mathit{loss}(\hat y,y) = \tfrac{1}{2}(\hat y-y)^2\,. Though this might seem like an odd approximation to the error-counting loss, it leads to the maximum a posteriori (MAP) decision rule when minimizing the population risk. Recall the MAP rule selects the label that has highest probability conditional on the observed data.
It is helpful to keep the next picture in mind that summarizes how each of these different loss functions approximate the zero-one loss. We can ensure that the squared loss is an upper bound on the zero-one loss by dropping the factor 1/2.
Insights from quadratic functions
Quadratic functions are the prototypical example that motivate algorithms for differentiable optimization problems. Though not all insights from quadratic optimization transfer to more general functions, there are several key features of the dynamics of iterative algorithms on quadratics that are notable. Moreover, quadratics are a good test case for reasoning about optimization algorithms: if a method doesn’t work well on quadratics, it typically won’t work well on more complicated nonlinear optimization problems. Finally, note that ERM with linear functions and a squared loss is a quadratic optimization problem, so such problems are indeed relevant to machine learning practice.
The general quadratic optimization problem takes the form \Phi(w) = \tfrac{1}{2} w^T Q w - p^T w + r\,, where Q is a symmetric matrix, p is a vector, and r is a scalar. The scalar r only affects the value of the function, and plays no role in the dynamics of gradient descent. The gradient of this function is \nabla \Phi(w) = Qw - p\,. The stationary points of \Phi are the w where Qw = p. If Q is full rank, there is a unique stationary point.
The gradient descent algorithm for quadratics follows the iterations w_{t+1} = w_t - \alpha (Qw_t - p)\,. If we let w_\star be any stationary point of \Phi, we can rewrite this iteration as w_{t+1}-w_\star = (I-\alpha Q) (w_t - w_\star)\,. Unwinding the recursion yields the “closed form” formula for the gradient descent iterates w_{t}-w_\star = (I-\alpha Q)^t (w_0 - w_\star)\,.
This expression reveals several possible outcomes. Let \lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_d denote the eigenvalues of Q. These eigenvalues are real because Q is symmetric. First suppose that Q has a negative eigenvalue \lambda_d<0 and v is an eigenvector such that Qv = \lambda_d v. Then (I-\alpha Q)^t v = (1+\alpha |\lambda_d|)^t v which tends to infinity as t grows. This is because 1+\alpha |\lambda_d| is greater than 1 if \alpha>0. Hence, if \langle v, w_0-w_\star \rangle \neq 0, gradient descent diverges. For a random initial condition w_0, we’d expect this dot product will not equal zero, and hence gradient descent will almost surely not converge from a random initialization.
In the case that all of the eigenvalues of Q are positive, then choosing \alpha greater than zero and less than 1/\lambda_1 will ensure that 0\leq 1-\alpha \lambda_k <1 for all k. In this case, the gradient method converges exponentially quickly to the optimum w_\star: \begin{aligned} \Vert w_{t+1}-w_\star\Vert &= \Vert (I-\alpha Q) (w_t - w_\star)\Vert\\ &\leq \Vert I-\alpha Q\Vert \Vert w_t - w_\star \Vert \leq \left(1-\frac{\lambda_d}{\lambda_1}\right)\Vert w_t-w_\star\Vert\,. \end{aligned} When the eigenvalues of Q are all positive, the function \Phi is strongly convex. Strongly convex functions turn out to be the set of functions where gradient descent with a constant step size converges exponentially from any starting point.
Note that the ratio of \lambda_1 to \lambda_d governs how quickly all of the components converge to 0. Defining the condition number of Q to be \kappa = \lambda_1/\lambda_d and setting the step size \alpha = 1/\lambda_1, gives the bound \Vert w_{t}-w_\star\Vert \leq \left(1 - \kappa^{-1}\right)^t \Vert w_{0}-w_\star \Vert\,. This rate reflects what happens in practice: when there are small singular values, gradient descent tends to bounce around and oscillate as shown in the figure below. When the condition number of Q is small, gradient descent makes rapid progress towards the optimum.
There is one final case that’s worth considering. When all of the eigenvalues of Q are nonnegative but some of them are zero, the function \Phi is convex but not strongly convex. In this case, exponential convergence to a unique point cannot be guaranteed. In particular, there will be an infinite number of global minimizers of \Phi. If w_\star is a global minimizer and v is any vector with Qv=0, then w_\star+v is also a global minimizer. However, in the span of the eigenvectors corresponding to positive eigenvalues, gradient descent still converges exponentially. For general convex functions, it will be important to consider different parts of the parameter space to fully understand the dynamics of gradient methods.
Stochastic gradient descent
The stochastic gradient method is one of the most popular algorithms for contemporary data analysis and machine learning. It has a long history and has been “invented” several times by many different communities (under the names “least mean squares,” “backpropagation,” “online learning,” and the “randomized Kaczmarz method”). Most researchers attribute this algorithm to the initial work of Robbins and Monro from 1951 who solved a more general problem with the same method.Robbins and Monro, “A Stochastic Approximation Method,” The Annals of Mathematical Statistics, 1951, 400–407.
Consider again our main goal of minimizing the empirical risk with respect to a vector of parameters w, and consider the simple case of linear classification where w is d-dimensional and f(x_i;w) = w^Tx_i\,.
The idea behind the stochastic gradient method is that since the gradient of a sum is the sum of the gradients of the summands, each summand provides useful information about how to optimize the total sum. Stochastic gradient descent minimizes empirical risk by following the gradient of the risk evaluated on a single, random example.
Stochastic Gradient Descent
- Start from an initial point w_0 \in \mathbb{R}^d.
- At each step t=0,1,2,\dots:
- Choose a step size \alpha_t>0 and random index i \in [n].
- Set w_{t+1} = w_t - \alpha_t \nabla_{w_t} \mathit{loss}(f(x_i;w_t),y_i)
The intuition behind this method is that by following a descent direction in expectation, we should be able to get close to the optimal solution if we wait long enough. However, it’s not quite that simple. Note that even when the gradient of the sum is zero, the gradients of the individual summands may not be. The fact that w_\star is no longer a fixed point complicates the analysis of the method.
Example: revisiting the perceptron
Let’s apply the stochastic gradient method to the support vector machine cost loss. We initialize our half-space at some w_{0}. At iteration t, we choose a random data point (x_{i},y_{i}) and update w_{t+1}= w_{t}+\eta\begin{cases} y_{i}x_{i} &\text{if}~y_{i}w_{t}^{T}x_{i}\leq 1\\ 0 & \text{otherwise} \end{cases} As we promised earlier, we see that using stochastic gradient descent to minimize empirical risk with a hinge loss is completely equivalent to Rosenblatt’s Perceptron algorithm.
Example: computing a mean
Let’s now try to examine the simplest example possible. Consider applying the stochastic gradient method to the function \frac{1}{2n} \sum_{i=1}^n (w-y_i)^2\,, where y_1,\dots,y_n are fixed scalars. This setup corresponds to a rather simple classification problem where the x features are all equal to 1. Note that the gradient of one of the increments is \nabla \mathit{loss}(f(x_i;w),y) = w-y_i\,.
To simplify notation, let’s imagine that our random samples are coming to us in order \{1,2,3,4,...\} Start with w_1=0, use the step size \alpha_k = 1/k. We can then write out the first few equations: \begin{aligned} w_{2} & = w_{1}-w_{1}+y_{1}=y_{1}\\ w_{3} & = w_{2}-\frac{1}{2}\left(w_{2}-y_{2}\right)=\frac{1}{2}y_{1}+\frac{1}{2}y_{2}\\ w_{4} & = w_{3}-\frac{1}{3}\left(w_{3}-y_{3}\right) = \frac{1}{3}y_{1}+\frac{1}{3}y_{2}+\frac{1}{3}y_{3} \end{aligned} Generalizing from here, we can conclude by induction that w_{k+1} = \left(\frac{k-1}{k}\right)w_{k}+\frac{1}{k}y_{k}=\frac{1}{k}\sum_{i=1}^{k}y_{i}\,. After n steps, w_n is the mean of the y_i, and you can check by taking a gradient that this is indeed the minimizer of the ERM problem.
The 1/k step size was the originally proposed step size by Robbins and Monro. This simple example justifies why: we can think of the stochastic gradient method as computing a running average. Another motivation for the 1/k step size is that the steps tend to zero, but the path length is infinite.
Moving to a more realistic random setting where the data might arrive in any order, consider what happens when we run the stochastic gradient method on the function R(w) = \tfrac{1}{2} \mathbb{E}[ (w-Y)^2]\,. Here Y is some random variable with mean \mu and variance \sigma^2. If we run for k steps with i.i.d. samples Y_i at each iteration, the calculation above reveals that w_k = \frac{1}{k} \sum_{i=1}^k Y_i\,. The associated cost is R(w_k) = \frac{1}{2} \mathop\mathbb{E}\left[ \left(\frac{1}{k} \sum_{i=1}^k Y_i-Y\right)^2\right] = \frac{1}{2k}\sigma^2 + \frac{1}{2}\sigma^2\,. Compare this to the minimum achievable risk R_\star. Expanding the definition R(w) = \tfrac{1}{2} \mathbb{E}[w^2 - 2 Y w + Y^2] = \tfrac{1}{2}w^2 - \mu w + \tfrac{1}{2}\sigma^2 + \tfrac{1}{2}\mu^2\,, we find that the minimizer is w_\star=\mu. Its cost is R_\star=R(w_\star) = \frac{1}{2} \sigma^2\,, and after n iterations, we have the expected optimality gap \mathop\mathbb{E}\left[R(w_n)-R_\star\right] = \frac{1}{2n}\sigma^2\,. This is the best we could have achieved using any estimator for w_\star given the collection of random draws. Interestingly, the incremental “one-at-a-time” method finds as good a solution as one that considers all of the data together. This basic example reveals a fundamental limitation of the stochastic gradient method: we can’t expect to generically get fast convergence rates without additional assumptions. Statistical fluctuations themselves prevent the optimality gap from decreasing exponentially quickly.
This simple example also helps give intuition on the convergence as we sample stochastic gradients. The figure below plots an example of each individual term in the summand, shaded with colors of blue to distinguish each term. The minimizing solution is marked with a red star. To the far left and far right of the figure, all of the summands will have gradients pointing in the same direction to the solution. However, as our iterate gets close to the optimum, we will be pointed in different directions depending on which gradient we sample. By reducing the step size, we will be more likely to stay close and eventually converge to the optimal solution.
Example: Stochastic minimization of a quadratic function
Let’s consider a more general version of stochastic gradient descent that follows the gradient plus some unbiased noise. We can model the algorithm as minimizing a function \Phi(w) where we follow a direction \nabla \Phi(w) + \nu where \nu is a random vector with mean zero and \mathop\mathbb{E}[\Vert \nu \Vert^2] = \sigma^2. Consider the special case where \Phi(w) is a quadratic function \Phi(w) = \tfrac{1}{2} w^T Q w - p^T w +r\,. Then the iterations take the form w_{t+1} = w_t - \alpha ( Qw_t -p + \nu_t)\,. Let’s consider what happens when Q is positive definite with maximum eigenvalue \lambda_1 and minimum eigenvalue \lambda_d >0. Then, if w_\star is a global minimizer of \Phi, we can again rewrite the iterations as w_{t+1}-w_\star = (I-\alpha Q) (w_t-w_\star) - \alpha\nu_t\,. Since we assume that the noise \nu_t is independent of all of the w_k with k \leq t, we have \mathop\mathbb{E}[ \Vert w_{t+1}-w_\star \Vert^2 ]\leq \Vert I-\alpha Q \Vert^2 \mathop\mathbb{E}[ \Vert w_t-w_\star\Vert^2 ] + \alpha^2\sigma^2\,. which looks like the formula we derived for quadratic functions, but now with an additional term from the noise. Assuming \alpha < 1/\lambda_1, we can unwind this recursion to find \mathop\mathbb{E}[ \Vert w_{t}-w_\star \Vert^2 ]\leq ( 1-\alpha \lambda_d)^{2t} \Vert w_0-w_\star\Vert^2 + \frac{\alpha \sigma^2}{\lambda_d}\,. From this expression, we see that gradient descent converges exponentially quickly to some ball around the optimal solution. The smaller we make \alpha, the closer we converge to the optimal solution, but the rate of convergence is also slower for smaller \alpha. This tradeoff motivates many of the step size selection rules in stochastic gradient descent. In particular, it is common to start with a large step size and then successively reduce the step size as the algorithm progresses.
Tricks of the trade
In this section, we describe key engineering techniques that are useful for tuning the performance of stochastic gradient descent. Every machine learning practitioner should know these simple tricks.
Shuffling. Even though we described the stochastic gradient method a sampling each gradient with replacement from the increments, in practice better results are achieved by simply randomly permuting the data points and then running SGD in this random order. This is called “shuffling,” and even a single shuffle can eliminate the pathological behavior we described in the example with highly correlated data. Recently, beginning with work by Gürbüzbalaban et al., researchers have shown that in theory, shuffling outperforms independent sampling of incrementsGürbüzbalaban, Ozdaglar, and Parrilo, “Why Random Reshuffling Beats Stochastic Gradient Descent,” Mathematical Programming, 2019, 1–36.. The arguments for without-replacement sampling remain more complicated than the with-replacement derivations, but optimal sampling for SGD remains an active area of optimization research.
Step size selection. Step size selection in SGD remains a hotly debated topic. We saw above a decreasing stepsize 1/k solved our simple one dimensional ERM problem. However, a rule that works for an unreasonable number of cases is to simply pick the largest step size which does not result in divergence. This step will result in a model that is not necessarily optimal, but significantly better than initialization. By slowly reducing the step size from this initial large step size to successively smaller ones, we can zero in on the optimal solution.
Step decay. The step size is usually reduced after a fixed number of passes over the training data. A pass over the entire dataset is called an epoch. In an epoch, some number of iterations are run, and then a choice is made about whether to change the step size. A common strategy is to run with a constant step size for some fixed number of iterations T, and then reduce the step size by a constant factor \gamma. Thus, if our initial step size is \alpha, on the kth epoch, the step size is \alpha \gamma^{k-1}. This method is often more robust in practice than the diminishing step size rule. For this step size rule, a reasonable heuristic is to choose \gamma between 0.8 and 0.9. Sometimes people choose rules as aggressive as \gamma = 0.1.
Another possible schedule for the step size is called epoch doubling. In epoch doubling, we run for T steps with step size \alpha, then run 2T steps with step size \alpha/2, and then 4T steps with step size \alpha/4 and so on. Note that this provides a piecewise constant approximation to the function \alpha/k.
Minibatching. A common technique used to take advantage of parallelism is called minibatching. A minibatch is an average of many stochastic gradients. Suppose at each iteration we sample a \mathrm{batch}_k with m data points. The update rule then becomes w_{k+1} = w_{k} -\alpha_k \frac{1}{m}\sum_{j\in \mathrm{batch}_k} \nabla_w \mathit{loss}(f(x_j;w_k),y_j)\,. Minibatching reduces the variance of the stochastic gradient estimate of the true gradient, and hence tends to be a better descent direction. Of course, there are tradeoffs in total computation time versus the size of the minibatch, and these typically need to be handled on a case by case basis.
Momentum. Finally, we note that one can run stochastic gradient descent with momentum. Momentum mixes the current gradient direction with the previously taken step. The idea here is that if the previous weight update was good, we may want to continue moving along this direction. The algorithm iterates are defined as w_{k+1}=w_{k}-\alpha_{k}g_k\left(w_{k}\right) + \beta (w_{k}+w_{k-1})\,, where g_k denotes a stochastic gradient. In practice, these methods are very successful. Typical choices for \beta here are between 0.8 and 0.95. Momentum can provide significant accelerations, and should be considered an option in any implementation of SGM.
The SGD quick start guide
Newcomers to stochastic gradient descent often find all of these design choices daunting, and it’s useful to have simple rules of thumb to get going. We recommend the following:
- Pick as large a minibatch size as you can given your computer’s RAM.
- Set your momentum parameter to either 0 or 0.9. Your call!
- Find the largest constant stepsize such that SGD doesn’t diverge. This takes some trial and error, but you only need to be accurate to within a factor of 10 here.
- Run SGD with this constant stepsize until the empirical risk plateaus.
- Reduce the stepsize by a constant factor (say, 10)
- Repeat steps 4 and 5 until you converge.
While this approach may not be the most optimal in all cases, it’s a great starting point and is good enough for probably 90% of applications we’ve encountered.
Analysis of the stochastic gradient method
We now turn to a theoretical analysis of the general stochastic gradient method. Before we proceed, let’s set up some conventions. We will assume that we are trying to minimize a convex function R:\mathbb{R}^d \rightarrow \mathbb{R}. Let w_\star denote any optimal solution of R. We will assume we gain access at every iteration to a stochastic function g(w;\xi) such that \mathbb{E}_\xi[g(w; \xi)] = \nabla R(w)\,. Here \xi is a random variable which determines what our direction looks like. We additionally assume that these stochastic gradients are bounded so there exists a non-negative constants B such that \Vert g(w;\xi)\Vert \leq B\,.
We will study the stochastic gradient iteration w_{t+1}=w_{t}-\alpha_{t}g\left(w_{t};\xi_t\right)\,. Throughout, we will assume that the sequence \{\xi_j\} is selected i.i.d. from some fixed distribution.
We begin by expanding the distance to the optimal solution: \begin{aligned} \Vert w_{t+1} - w_\star \Vert^2 &= \Vert w_t - \alpha_t g_t(w_t;\xi_t) - w_\star \Vert^2 \\ &= \Vert w_t - w_\star\Vert^2 -2 \alpha_t \langle g_t( w_t;\xi_t), w_t- w_\star \rangle + \alpha_t^2 \Vert g_t(w_t; \xi_t) \Vert^2 \end{aligned}
We deal with each term in this expansion separately. First note that if we apply the law of iterated expectation \begin{aligned} \mathbb{E}[ \langle g_t(w_t;\xi_t), w_t - w_\star \rangle ] &= \mathbb{E}\left[\mathbb{E}_{\xi_t}[ \langle g_t(w_t;\xi_t), w_t - w_\star \rangle \mid \xi_0,\ldots,\xi_{t-1} ]\right]\\ &= \mathbb{E}\left[ \langle\mathbb{E}_{\xi_t}[ g_t(w_t;\xi_t) \mid \xi_0,\ldots,\xi_{t-1} ] , w_t - w_\star \rangle\right]\\ &= \mathbb{E}\left[ \langle \nabla R(w_t) , w_t - w_\star \rangle\right]\,. \end{aligned} Here, we are simply using the fact that \xi_t being independent of all of the preceding \xi_i implies that it is independent of w_t. This means that when we iterate the expectation, the stochastic gradient can be replaced by the gradient. The last term we can bound using our assumption that the gradients are bounded: \begin{aligned} \mathbb{E}[ \Vert g(w_t; \xi_t)\Vert^2] \leq B^2 \end{aligned}
Letting \delta_t := \mathbb{E}[\Vert w_{t}-w_\star\Vert^2], this gives \delta_{t+1} \leq \delta_t - 2 \alpha_t \mathbb{E}\left[ \langle \nabla R(w_t) , w_t - w_\star \rangle\right] + \alpha_t^2 B^2\,.
Now let \lambda_t = \sum_{j=0}^t \alpha_j denote the sum of all the step sizes up to iteration t. Also define the average of the iterates weighted by the step size \bar{w}_t = \lambda_t^{-1} \sum_{j=0}^t \alpha_j w_j\,. We are going to analyze the deviation of R(\bar{w}_t) from optimality.
Also let \rho_0 = \Vert w_0-w_\star\Vert. \rho_0 is the initial distance to an optimal solution. It is not necessarily a random variable.
To proceed, we just expand the following expression: \begin{aligned} \mathbb{E}\left[ R\left (\bar{w}_T \right) - R(w_\star) \right] & \leq \mathbb{E}\left[ \lambda_T^{-1} \sum_{t=0}^T \alpha_t( R(w_t) - R(w_\star) ) \right] \\ &\leq \lambda_T^{-1} \sum_{t=0}^T \alpha_t \mathbb{E}[ \langle \nabla R(x_t), w_t-w_\star \rangle ]\\ &\leq \lambda_T^{-1} \sum_{t=0}^T \tfrac{1}{2}(\delta_{t}-\delta_{t+1})+ \tfrac{1}{2} \alpha_t^2 B^2\\ & = \frac{\delta_0 - \delta_{T+1} + B^2 \sum_{t=0}^T \alpha_t^2}{2 \lambda_T}\\ & \leq \frac{\rho_0^2 + B^2 \sum_{t=0}^T \alpha_t^2}{2 \sum_{t=0}^T \alpha_t} \end{aligned} Here, the first inequality follows because R is convex (the line segments lie above the function, i.e., R(w_\star) \geq R(w_t) + \langle\nabla R(w_t), w_\star-w_t\rangle). The second inequality uses the fact that gradients define tangent planes to R and always lie below the graph of R, and the third inequality uses the recursion we derived above for \delta_t.
The analysis we saw gives the following result.Nemirovski et al., “Robust Stochastic Approximation Approach to Stochastic Programming,” SIAM Journal on Optimization 19, no. 4 (2009): 1574–1609.
Suppose we run the SGM on a convex function R with minimum value R_\star for T steps with step size \alpha. Define \alpha_{\mathrm{opt}} = \frac{\rho_0}{B\sqrt{T}}\qquad\text{and}\qquad \theta = \frac\alpha{\alpha_{\mathrm{opt}}}\,. Then, we have the bound \mathbb{E}[R(\bar{w}_T) - R_\star] \leq \left(\tfrac{1}{2} \theta + \tfrac{1}{2} \theta^{-1}\right) \frac{B \rho_0}{\sqrt{T}} \,.
This proposition asserts that we pay linearly for errors in selecting the optimal constant step size. If we guess a constant step size that is two-times or one-half of the optimal choice, then we need to run for at most twice as many iterations. The optimal step size is found by minimizing our upper bound on the suboptimality gap. Other step sizes could also be selected here, including diminishing step size. But the constant step size turns out to be optimal for this upper bound.
What are the consequences for risk minimization? First, for empirical risk, assume we are minimizing a convex loss function and searching for a linear predictor. Assume further that there exists a model with zero empirical risk. Let C be the maximum value of the gradient of the loss function, D be the largest norm of any example x_i and let \rho denote the minimum norm w such that R_S[w]=0. Then we have \mathbb{E}[R_S[\bar{w}_T]] \leq \frac{CD\rho}{\sqrt{T}} \,. we see that with appropriately chosen step size, the stochastic gradient method converges at a rate of 1/\sqrt{T}, the same rate of convergence observed when studying the one-dimensional mean computation problem. Again, the stochasticity forces us into a slow, 1/\sqrt{T} rate of convergence, but high dimensionality does not change this rate.
Second, if we only operate on samples exactly once, and we assume our data is i.i.d., we can think of the stochastic gradient method as minimizing the population risk instead of the empirical risk. With the same notation, we’ll have \mathbb{E}[R[\bar{w}_T]] -R_\star \leq \frac{CD\rho}{\sqrt{T}} \,. The analysis of stochastic gradient gives our second generalization bound of the book. What it shows is that by optimizing over a fixed set of T data points, we can get a solution that will have low cost on new data. We will return to this observation in the next chapter.
Implicit convexity
We have thus far focused on convex optimization problems, showing that gradient descent can find global minima with modest computational means. What about nonconvex problems? Nonconvex optimization is such a general class of problems that in general it is hard to make useful guarantees. However, ERM is a special optimization problem, and its structure enables nonconvexity to enter in a graceful way.
As it turns out there’s a “hidden convexity” of ERM problems which shows that the predictions should converge to a global optimum even if we can’t analyze to where exactly the model converges. We will show this insight has useful benefits when models are overparameterized or nonconvex.
Suppose we have a loss function \mathit{loss} that is equal to zero when \hat{y} = y and is nonnegative otherwise. Suppose we have a generally parameterized function class \{ f(x;w) \colon w\in\mathbb{R}^d\} and we aim to find parameters w that minimize the empirical risk. The empirical risk R_S[w] =\frac{1}{n} \sum_{i=1}^n \mathit{loss}(f(x_i;w),y_i) is bounded below by 0. Hence if we can find a solution with f(x_i;w)=y_i for all i, we would have a global minimum not a local minimum. This is a trivial observation, but one that helps focus our study. If during optimization all predictions f(x_i;w) converge to y_i for all i, we will have computed a global minimizer. For the sake of simplicity, we specialize to the square loss in this section: \mathit{loss}(f(x_i;w),y_i) = \tfrac{1}{2} (f(x_i;w)-y_i)^2 The argument we develop here is inspired by the work of Du et al. who use a similar approach to rigorously analyze the convergence of two layer neural networks.Du et al., “Gradient Descent Provably Optimizes over-Parameterized Neural Networks,” in International Conference on Learning Representations, 2019. Similar calculations can be made for other losses with some modifications of the argument.
Convergence of overparameterized linear models
Let’s first consider the case of linear prediction functions f(x;w) = w^T x\,. Define y = \begin{bmatrix} y_1\\ \vdots \\ y_n \end{bmatrix}\qquad \qquad \text{and}\qquad \qquad X = \begin{bmatrix} x_1^T \\ \vdots \\x_n^T \end{bmatrix}\,. We can then write the empirical risk objective as R_S[w]=\frac1{2n}\|Xw-y\|^2\,. The gradient descent update rule has the form w_{t+1} = w_t - \alpha X^T (X w_t-y)\,. We pull the scaling factor 1/n into the step size for notational convenience. Now define the vector of predictions \hat{y}_t = \begin{bmatrix} f(x_1;w_t) \\ \vdots \\ f(x_n;w_t) \end{bmatrix}\,. For the linear case, the predictions are given by \hat{y}_k = X w_k. We can use this definition to track the evolution of the predictions instead of the parameters w. The predictions evolve according to the rule \hat{y}_{t+1} = \hat{y}_t - \alpha X X^T (\hat{y}_t-y)\,. This looks a lot like the gradient descent algorithm applied to a strongly convex quadratic function that we studied earlier. Subtracting y from both sides and rearranging shows \hat{y}_{t+1}-y = (I - \alpha X X^T) (\hat{y}_t-y)\,. This expression proves that as long as XX^T is strictly positive definite and \alpha is small enough, then the predictions converge to the training labels. Keep in mind that X is n\times d and a model is overparameterized if d>n. The n\times n matrix XX^T has a chance of being strictly positive definite in this case.
When we use a sufficiently small and constant step size \alpha, our predictions converge at an exponential rate. This is in contrast to the behavior we saw for gradient methods on overdetermined problems. Our general analysis of the weights showed that the convergence rate might be only inverse polynomial in the iteration counter. In the overparameterized regime, we can guarantee the predictions converge more rapidly than the weights themselves.
The rate in this case is governed by properties of the matrix X. As we have discussed we need the eigenvalues of XX^T to be positive, and we’d ideally like that the eigenvalues are all of similar magnitude.
First note that a necessary condition is that d, the dimension, must be larger than n, the number of data points. That is, we need to have an overparameterized model in order to ensure exponentially fast convergence of the predictions. We have already seen that such overparameterized models make it possible to interpolate any set of labels and to always force the data to be linearly separable. Here, we see further that overparameterization encourages optimization methods to converge in fewer iterations by improving the condition number of the data matrix.
Overparameterization can also help accelerate convergence. Recall that the eigenvalues of XX^T are the squares of the singular values of X. Let us write out a singular value decomposition X = USV^T\,, where S is a diagonal matrix of singular values (\sigma_1,\ldots, \sigma_n). In order to improve the condition number of this matrix, it suffices to add a feature that is concentrated in the span of the singular vectors with small singular values. How to find such features is not always apparent, but does give us a starting point as to where to look for new, informative features.
Convergence of nonconvex models
Surprisingly, this analysis naturally extends to nonconvex models. With some abuse of notation, let \hat y = f(x;w)\in\mathbb{R}^n denote the n predictions of some nonlinear model parameterized by the weights w on input x. Our goal is to minimize the squared loss objective \frac12\|f(x;w)-y\|^2\,. Since the model is nonlinear this objective is no longer convex. Nonetheless we can mimic the analysis we did previously for overparameterized linear models.
Running gradient descent on the weights gives w_{t+1} = w_t - \alpha \mathsf{D}f (x;w_t)(\hat{y}_t - y)\,, where \hat y_t=f(x;w_t) and \mathsf{D}f is the Jacobian of the predictions with respect to w. That is, \mathsf{D}f (x;w) is the d\times n matrix of first order derivatives of the function f(x;w) with respect to w. We can similarly define the Hessian operator H(w) to be the n \times d \times d array of second derivatives of f(x;w). We can think of H(w) as a quadratic form that maps pairs of vectors (u,v) \in \mathbb{R}^{d\times d} to \mathbb{R}^n. With these higher order derivatives, Taylor’s theorem asserts \begin{aligned} \hat{y}_{t+1} &= f(x,w_{t+1})\\ &= f(x,w_t) + \mathsf{D}f (x;w_t)^T(w_{t+1}-w_t) \\ &\qquad\qquad+ \int_0^1 H(w_t+s (w_{t+1}-w_t)) (w_{t+1}-w_t, w_{t+1}-w_t) \mathrm{d}s\,. \end{aligned} Since w_t are the iterates of gradient descent, this means that we can write the prediction as \begin{aligned} \hat{y}_{t+1} &= \hat{y}_t -\alpha \mathsf{D}f(x;w_t)^T \mathsf{D}f(x;w_t)(\hat{y_t} - y) + \alpha \epsilon_t \,, \end{aligned} where \epsilon_t = \alpha \int_0^1 H(w_t+s (w_{t+1}-w_t)) \left( \mathsf{D}f(x;w_t) (\hat{y}_t - y),\mathsf{D}f(x;w_t) (\hat{y}_t - y)\right)\mathrm{d}s\,. Subtracting the labels y from both sides and rearranging terms gives the recursion \hat{y}_{t+1}-y = (I - \alpha \mathsf{D}f(x;w_t)^T \mathsf{D}f(x;w_t))(\hat{y}_t - y) + \alpha \epsilon_t\,. If \epsilon_t vanishes, this shows that the predictions again converge to the training labels as long as the eigenvalues of \mathsf{D}f(x;w_t)^T \mathsf{D}f(x;w_t) are strictly positive. When the error vector \epsilon_t is sufficiently small, similar dynamics will occur. We expect \epsilon_t to not be too large because it is quadratic in the distance of y_t to y and because it is multiplied by the step size \alpha which can be chosen to be small.
The nonconvexity isn’t particularly disruptive here. We just need to make sure our Jacobians have full rank most of the time and that our steps aren’t too large. Again, if the number of parameters are larger than the number of data points, then these Jacobians are likely to be positive definite as long as we’ve engineered them well. But how exactly can we guarantee that our Jacobians are well behaved? We can derive some reasonable ground rules by unpacking how we compute gradients of compositions of functions. More on this follows in our chapter on deep learning.
Regularization
The topic of regularization belongs somewhere between optimization and generalization and it’s one way of connecting the two. Hence, we will encounter it in both chapters. Indeed, one complication with optimization in the overparameterized regime is that there is an infinite collection of models that achieve zero empirical risk. How do we break ties between these and which set of weights should we prefer?
To answer this question we need to take a step back and remember that the goal of supervised learning is not just to achieve zero training error. We also care about performance on data outside the training set, and having zero loss on its own doesn’t tell us anything about data outside the training set.
As a toy example, imagine we have two sets of data X_{\mathrm{train}} and X_{\mathrm{test}} where X_{\mathrm{train}} has shape n \times d and X_{\mathrm{test}} is m \times d. Let y_{\mathrm{train}} be the training labels and let q be an m-dimensional vector of random labels. Then if d>(m+n) we can find weights w such that \begin{bmatrix} X_{\mathrm{train}} \\ X_{\mathrm{test}} \end{bmatrix} w = \begin{bmatrix} y_{\mathrm{train}} \\ q \end{bmatrix}\,. That is, these weights would produce zero error on the training set, but error no better than random guessing on the testing set. That’s not desired behavior! Of course this example is pathological, because in reality we would have no reason to fit random labels against the test set when we create our model.
The main challenge in supervised learning is to design models that achieve low training error while performing well on new data. The main tool used in such problems is called regularization. Regularization is the general term for taking a problem with infinitely many solutions and biasing its solution towards a smaller subset of solution space. This is a highly encompassing notion.
Sometimes regularization is explicit insofar as we have a desired property of the solution in mind which we exercise as a constraint on our optimization problem. Sometimes regularization is implicit insofar as algorithm design choices lead to a unique solution, although the properties of this solution might not be immediately apparent.
Here, we take an unconventional tack of working from implicit to explicit, starting with stochastic gradient descent.
Implicit regularization by optimization
Consider again the linear case of gradient descent or stochastic gradient descent w_{t+1} = w_t - \alpha e_t x_t\,, where e_t is the gradient of the loss at the current prediction. Note that if we initialize w_0=0, then w_t is always in the span of the data. This can be seen by simple induction. This already shows that even though general weights lie in a high dimensional, SGD searches over a space with dimension at most n, the number of data points.
Now suppose we have a nonnegative loss with \frac{\partial \mathit{loss}(z,y)}{\partial z} = 0 if and only if y=z. This condition is satisfied by the square loss, but not hinge and logistic losses. For such losses, at optimality we have for some vector v that:
- Xw = y\,, because we have zero loss.
- w = X^T v\,, because we are in the span of the data.
Under the mild assumption that our examples are linearly independent, we can combine these equations to find that w=X^T(XX^T)^{-1}y\,. That is, when we run stochastic gradient descent we converge to a very specific solution. Even though we were searching through an n-dimensional space, we converge to a unique point in this space.
This special w is the minimum Euclidean norm solution of Xw=y. In other words, out of all the linear prediction functions that interpolate the training data, SGD selects the solution with the minimal Euclidean norm. To see why this solution has minimal norm, suppose that \hat{w} = X^T\alpha + v with v orthogonal to all x_i. Then we have X\hat{w} = XX^T\alpha + Xv = XX^T\alpha \,. Which means \alpha is completely determined and hence \hat{w} = X^T(XX^T)^{-1}y+ v. But now \Vert \hat{w}\Vert^2 = \Vert X^T(XX^T)^{-1}y\Vert^2 + \Vert v \Vert^2\,. Minimizing the right hand side shows that v must equal zero.
We now turn to showing that such minimum norm solutions have important robustness properties that suggest that they will perform well on new data. In the next chapter, we will prove that these methods are guaranteed to perform well on new data under reasonable assumptions.
Margin and stability
Consider a linear predictor that makes no classification errors and hence perfectly separates the data. Recall that the decision boundary of this predictor is the hyperplane \mathcal{B}= \{z\,:\,w^Tz = 0\} and the margin of the predictor is the distance of the decision boundary from to data: \mathrm{margin}(w) = \min_i \operatorname{dist}\left(x_i,\mathcal{B}\right)\,. Since we’re assuming that w correctly classifies all of the training data, we can write the margin in the convenient form \operatorname{margin}(w) = \min_i \frac{y_i w^Tx_i}{\Vert w\Vert}\,.
Ideally, we’d like our data to be far away from the boundary and hence we would like our predictor to have large margin. The reasoning behind this desideratum is as follows: If we expect new data to be similar to the training data and the decision boundary is far away from the training data, then it would be unlikely for a new data point to lie on the wrong side of the decision boundary. Note that margin tells us how large a perturbation in the x_i can be handled before a data point is misclassified. It is a robustness measure that tells us how sensitive a predictor is to changes in the data itself.
Let’s now specialize margin to the interpolation regime described in the previous section. Under the assumption that we interpolate the labels so that w^Tx_i=y_i, we have \operatorname{margin}(w) = \Vert w\Vert^{-1}\,. If we want to simultaneously maximize margin and interpolate the data, then the optimal solution is to choose the minimum norm solution of Xw=y. This is precisely the solution found by SGD and gradient descent.
Note that we could have directly tried to maximize margin by solving the constrained optimization problem \begin{array}{ll} \text{minimize} & \Vert w \Vert^2\\ \text{subject to} & y_i w^Tx_i \geq 1\,. \end{array} This optimization problem is the classic formulation of the support vector machine. The support vector machine is an example of explicit regularization. Here we declare exactly which solution we’d like to choose given that our training error is zero. Explicit regularization of high dimensional models is as old as machine learning. In contemporary machine learning, however, we often have to squint to see how our algorithmic decisions are regularizing. The tradeoff is that we can run faster algorithms with implicit regularizers. But it’s likely that revisiting classic regularization ideas in the context of contemporary models will lead to many new insights.
The representer theorem and kernel methods
As we have discussed so far, it is common in linear methods to restrict the search space to the span of the data. Even when d is large (or even infinite), this reduces the search problem to one in an n-dimensional space. It turns out that under broad generality, solutions in the span of the data are optimal for most optimization problems in prediction. Here, we make formal an argument we first introduced in our discussion of features: for most empirical risk minimization problems, the optimal model will lie in the span of the training data.
Consider the penalized ERM problem \text{minimize} \frac{1}{n} \sum_{i=1}^n \mathit{loss}(w^T x_i, y_i) + \lambda \Vert w\Vert^2 Here \lambda is called a regularization parameter. When \lambda=0, there are an infinite number of w that minimize the ERM problem. But for any \lambda>0, there is a unique minimizer. The term regularization refers to adding some prior information to an optimization problem to make the optimal solution unique. In this case, the prior information is explicitly encoding that we should prefer w with smaller norms if possible. As we discussed in our chapter on features, smaller norms tend to correspond to simpler solutions in many feature spaces. Moreover, we just described that minimum norm solutions themselves could be of interest in machine learning. A regularization parameter allows us to explicitly tune the norm of the optimal solution.
For our penalized ERM problem, using the same argument as above, we can write any w as w = X^T \beta + v for some vectors \beta and v with v^Tx_i=0 for all i. Plugging this equation into the penalized ERM problem yields \text{minimize}_{\beta,v} \frac{1}{n} \sum_{i=1}^n \mathit{loss}(\beta^T X x_i,y_i) + \lambda \Vert X^T \beta \Vert^2 + \lambda \Vert v\Vert^2\,. Now we can minimize with respect to v, seeing that the only option is to set v=0. Hence, we must have that the optimum model lies in the span of the data: w = X^T \beta This derivation is commonly called the representer theorem in machine learning. As long as the cost function only depends on function evaluations f(x_i)=w^Tx_i and the cost increases as a function of \Vert w\Vert, then the empirical risk minimizer will lie in the span of the data.
Define the kernel matrix of the training data K = XX^T. We can then rewrite the penalized ERM problem as \text{minimize}_{\beta} \frac{1}{n} \sum_{i=1}^n \mathit{loss}(e_i^T K\beta,y_i) + \lambda \beta^T K \beta\,, where e_i is the standard Euclidean basis vector. Hence, we can solve the machine learning problem only using the values in the matrix K, searching only for the coefficients \beta in the kernel expansion.
The representer theorem (also known as the kernel trick) tells us that most machine learning problems reduce to a search in n dimensional space, even if the feature space has much higher dimension. Moreover, the optimization problems only care about the values of dot products between data points. This motivates the use of the kernel functions described in our discussion of representation: kernel functions allow us to evaluate dot products of vectors in high dimensional spaces often without ever materializing the vectors, reducing high-dimensional function spaces to the estimation of weightings of individual data points in the training sample.
Squared loss methods and other optimization tools
This chapter focused on gradient methods for minimizing empirical risk, as these are the most common methods in contemporary machine learning. However, there are a variety of other optimization methods that may be useful depending on the computational resources available and the particular application in question.
There are a variety of optimization methods that have proven fruitful in machine learning, most notably constrained quadratic programming for solving support vector machines and related problems. In this section we highlight least squares methods which are attractive as they can be solved by solving linear systems. For many problems, linear systems solves are faster than iterative gradient methods, and the computed solution is exact up to numerical precision, rather than being approximate.
Consider the optimization problem \begin{array}{ll} \text{minimize}_w & \tfrac{1}{2}\sum_{i=1}^n (y_i-w^T x_i)^2 \,. \end{array} The gradient of this loss function with respect to w is given by -\sum_{i=1}^n (y_i-w^T x_i)x_i \,. If we let y denote the vector of y labels and X denote the n\times d matrix X = \begin{bmatrix} x_1^T\\ x_2^T \\ \vdots \\ x_n^T \end{bmatrix}\,, then setting the gradient of the least squares cost equal to zero yields the solution w = (X^T X)^{-1} X^T y\,. For many problems, it is faster to compute this closed form solution than it is to run the number of iterations of gradient descent required to find a w with small empirical risk.
Regularized least squares also has a convenient closed form solution. The penalized ERM problem where we use a square loss is called the ridge regression problem: \begin{array}{ll} \text{minimize}_w & \tfrac{1}{2}\sum_{i=1}^n (y_i-w^T x_i)^2 +\lambda \Vert w\Vert^2 \end{array} Ridge regression can be solved in the same manner as above, yielding the optimal solution w = (X^T X+ \lambda I)^{-1} X^T y\,. There are a few other important problems solvable by least squares. First, we have the identity (X^T X+ \lambda I_d)^{-1} X^T = X^T (X X^T+ \lambda I_n)^{-1}\,. this means that we can solve ridge regression either by solving a system in d equations and d unknowns or n equations and n unknowns. In the overparameterized regime, we’d choose the formulation with n parameters. Moreover, as we described above, the minimum norm interpolating problem \begin{array}{ll} \text{minimize} & \Vert w \Vert^2\\ \text{subject to} & w^Tx_i = y_i\,. \end{array} is solved by w= X (X X^T)^{-1}y. This shows that the limit as \lambda goes to zero in ridge regression is this minimum norm solution.
Finally, we note that for kernelized problems, we can simply replace the matrix X X^T with the appropriate kernel K. Hence, least squares formulations are extensible to solve prediction problems in arbitrary kernel spaces.
Chapter notes
Mathematical optimization is a vast field, and we clearly are only addressing a very small piece of the puzzle. For an expanded coverage of the material in this chapter with more mathematical rigor and implementation ideas, we invite the reader to consult the recent book by Wright and Recht.Wright and Recht, Optimization for Data Analysis (Cambridge University Press, 2021).
The chapter focuses mostly on iterative, stochastic gradient methods. Initially invented by Robbins and Monro for solving systems of equations in random variablesRobbins and Monro, “A Stochastic Approximation Method.”, stochastic gradient methods have played a key role in pattern recognition since the Perceptron. Indeed, it was very shortly after Rosenblatt’s invention that researchers realized the Perceptron was solving a stochastic approximation problem. Of course, the standard perceptron step size schedule does not converge to a global optimum when the data is not separable, and this lead to a variety of methods to fix the problem. Many researchers employed the Widrow-Hoff “Least-Mean-Squares” rule which in modern terms is minimizing the empirical risk associated with a square loss by stochastic gradient descent.Widrow and Hoff, “Adaptive Switching Circuits,” in Institute of Radio Engineers, Western Electronic Show and Convention, Convention Record, 1960, 96–104. Aizerman and his colleagues determined not only how to apply stochastic gradient descent to linear functions, but how to operate in kernel spaces as well.Aizerman, Braverman, and Rozonoer, “The Robbins-Monro Process and the Method of Potential Functions,” Automation and Remote Control 26 (1965): 1882–85. Of course, all of these methods were closely related to each other, but it took some time to put them all on a unified footing. It wasn’t until the 1980s with a full understanding of complexity theory, that optimal step sizes were discovered for stochastic gradient methods by Nemirovski and Yudin.Nemirovski and Yudin, Problem Complexity and Method Efficiency in Optimization (Wiley, 1983). More surprisingly, it was not until 2007 that the first non-asymptotic analysis of the perceptron algorithm was published.Shalev-Shwartz, Singer, and Srebro, “Pegasos: Primal Estimated Sub-GrAdient SOlver for SVM,” in International Conference on Machine Learning, 2007.
Interestingly, it wasn’t again until the early 2000s that stochastic gradient descent became the default optimization method for machine learning. There tended to be a repeated cycle of popularity for the various optimization methods. Global optimization methods like linear programming were determined effective in the 1960s for pattern recognition problemsMangasarian, “Linear and Nonlinear Separation of Patterns by Linear Programming,” Operations Research 13, no. 3 (1965): 444–52., supplanting interest in stochastic descent methods. Stochastic gradient descent was rebranded as back propagation in the 1980s, but again more global methods eventually took center stage. Mangasarian, who was involved in both of these cycles, told us in private correspondence that linear programming methods were always more effective in terms of their speed of computation and quality of solution.
Indeed this pattern was also followed in optimization. Nemirovski and Nesterov did pioneering work in iterative gradient and stochastic gradient methods . But they soon turned to developing the foundations of interior point methods for solving global optimization problems.Nesterov and Nemirovskii, Interior-Point Polynomial Methods in Convex Programming (SIAM, 1994). In the 2000s, they republished their work on iterative methods, leading to a revolution in machine learning.
It’s interesting to track this history and forgetting in machine learning. Though these tools are not new, they are often forgotten and replaced. It’s perhaps time to revisit the non-iterative methods in light of this.