# Generalization

Simply put, generalization relates the performance of a model on seen
examples to its performance on *unseen* examples. In this
chapter, we discuss the interplay between representation, optimization,
and generalization, again focusing on models with more parameters than
seen data points. We examine the intriguing empirical phenomena related
to overparameterization and generalization in today’s machine learning
practice. We then review available theory—some old and some emerging—to
better understand and anticipate what drives generalization
performance.

# Generalization gap

Recall, the risk of a predictor f\colon \mathcal{X}\to \mathcal{Y} with respect to a loss function \mathit{loss}\colon \mathcal{Y}\times \mathcal{Y}\to\mathbb{R} is defined as R[f] = \mathop\mathbb{E}\left[\mathit{loss}(f(X), Y)\right]\,. Throughout this chapter, it will often be convenient to stretch the notation slightly by using \mathit{loss}(f,(x,y)) to denote the loss of a predictor f on an example (x,y)\,. For predictors specified by model parameters w, we’ll also write \mathit{loss}(w,(x,y))\,.

For the purposes of this chapter, it makes sense to think of
the n samples as an ordered tuple
S=((x_1,y_1),\dots\dots,(x_n,y_n))\in(\mathcal{X}\times
\mathcal{Y})^n\,.
The empirical risk R_S[f]
is, as before,
R_{S}[f] = \frac{1}{n}\sum_{i=1}^{n}\mathit{loss}(f(x_i),y_i)\,.
Empirical risk minimization seeks to find a predictor f^* in a specified class \mathcal{F} that minimizes the empirical
risk:
R_S[f^*] = \min_{f\in\mathcal{F}} R_S[f]
In machine learning practice, the empirical risk is often called
*training error* or *training loss*, as it corresponds to
the loss achieved by some optimization method on the sample. Depending
on the optimization problem, we may not be able to find an exact
empirical risk minimizer and it may not be unique.

Empirical risk minimization is commonly used as a proxy for minimizing the unknown population risk. But how good is this proxy? Ideally, we would like that the predictor f we find via empirical risk minimization satisfies R_S[f]\approx R[f]. However, this may not be the case, since the risk R[f] captures loss on unseen example, while the empirical risk R_S[f] captures loss on seen examples.

Generally, we expect to do much better on seen examples than unseen
examples. This performance gap between seen and unseen examples is what
we call *generalization gap*.

Define the *generalization gap* of a predictor f with respect to a dataset S as
\Delta_{\mathrm{gen}}(f) = R[f] - R_S[f]\,.

This quantity is sometimes also called *generalization error*
or *excess risk*. Recall the following tautological, yet
important identity:
R[f] = R_S[f] + \Delta_{\mathrm{gen}}(f)
What it says is that if we manage to make the empirical risk
R_S[f] small through optimization, then
all that remains to worry about is generalization gap.

The last chapter provided powerful tools to make optimization succeed. How we can bound the generalization gap is the topic of this chapter. We first take a tour of evidence from machine learning practice for inspiration.

# Overparameterization: empirical phenomena

We previously experienced the advantages of overparameterized models in terms of their ability to represent complex functions and our ability to feasily optimize them. The question remains whether they generalize well to unseen data. Perhaps we simply kicked the can down the road. Does the model size that was previously a blessing now come back to haunt us? We will see that not only do large models often generalize well in practice, but often more parameters lead to better generalization performance. Model size does, however, challenge some theoretical analysis. The empirical evidence will orient our theoretical study towards dimension-free bounds that avoid worst-case analysis.

## Effects of model complexity

Think of a model family with an associated measure of complexity, such as number of trainable parameters. Suppose that for each setting of the complexity measure, we can solve the empirical risk minimization problem. We can then plot what happens to risk and empirical risk as we vary model complexity.

A traditional view of generalization posits that as we increase model complexity initially both empirical risk and risk decrease. However, past a certain point the risk begins to increase again, while empirical risk decreases.

The graphic shown in many textbooks is a u-shaped risk curve. The
complexity range below the minimum of the curve is called
*underfitting*. The range above is called
*overfitting*.

This picture is often justified using the bias-variance trade-off, motivated by a least squares regression analysis. However, it does not seem to bear much resemblance to what is observed in practice.

We have already discussed the example of the Perceptron which achieves zero training loss and still generalizes well in theory. Numerous practitioners have observed that other complex models also can simultaneously achieve close to zero training loss and still generalize well. Moreover, in many cases risk continues to decreases as model complexity grows and training data are interpolated exactly down to (nearly) zero training loss. This empirical relationship between overparameterization and risk appears to be robust and manifests in numerous model classes, including overparameterized linear models, ensemble methods, and neural networks.

In the absence of regularization and for certain model families, the
empirical relationship between model complexity and risk is more
accurately captured by the *double descent* curve in the figure
above. There is an interpolation threshold at which a model of the given
complexity can fit the training data exactly. The complexity range below
the threshold is the *underparameterized regime*, while the one
above is the overparameterized regime. Increasing model complexity in
the overparameterized regime continues to decrease risk indefinitely,
albeit at decreasing marginal returns, toward some convergence
point.

The double descent curve is not universal. In many cases, in practice
we observe a single descent curve throughout the entire complexity
range. In other cases, we can see multiple bumps as we increase model
complexity.Liang,
Rakhlin, and Zhai, “On the Multiple Descent of Minimum-Norm
Interpolants and Restricted Lower Isometry of Kernels,” in
*Conference on Learning Theory*, 2020.
However, the general point remains. There is no evidence that highly
overparameterized models do not generalize. Indeed, empirical evidence
suggests larger models not only generalize, but that larger models make
better out-of-sample predictors than smaller ones.He
et al., “Deep Residual Learning for Image Recognition,” in
*Computer Vision and Pattern Recognition*, 2016; Huang et al.,
“Gpipe: Efficient Training of Giant Neural Networks Using Pipeline
Parallelism,” *Advances in Neural Information Processing
Systems* 32 (2019): 103–12.

## Optimization versus generalization

Training neural networks with stochastic gradient descent, as is commonly done in practice, attempts to solve a non-convex optimization problem. Reasoning about non-convex optimization is known to be difficult. Theoreticians see a worthy goal in trying to prove mathematically that stochastic gradient methods successfully minimize the training objective of large artificial neural networks. The previous chapter discussed some of the progress that has been made toward this goal.

It is widely believed that what makes optimization easy crucially depends on the fact that models in practice have many more parameters than there are training points. While making optimization tractable, overparameterization puts burden on generalization.

We can force a disconnect between optimization and generalization in a simple experiment that we will see next. One consequence is that even if a mathematical proof established the convergence guarantees of stochastic gradient descent for training some class of large neural networks, it would not necessarily on its own tell us much about why the resulting model generalizes well to the test objective.

Indeed, consider the following experiment. Fix training data (x_1,y_1),\dots, (x_n, y_n) and fix a training algorithm A that achieves zero training loss on these data and achieves good test loss as well.

Now replace all the labels y_1,\dots,y_n by randomly and independently drawn labels \tilde y_1,\dots, \tilde y_n\,. What happens if we run the same algorithm on the training data with noisy labels (x_1,\tilde y_1),\dots, (x_n, \tilde y_n))?

One thing is clear. If we choose from k discrete classes, we expect the model trained on the random labels to have no more than 1/k test accuracy, that is, the accuracy achieved by random guessing. After all, there is no statistical relationship between the training labels and the test labels that the model could learn.

What is more interesting is what happens to optimization. The left
panel of the figure shows the outcome of this kind of *randomization
test* on the popular CIFAR-10 image classification benchmark for a
standard neural network architecture. What we can see is that the
training algorithm continues to drive the training loss to zero even if
the labels are randomized. The right panel shows that we can vary the
amount of randomization to obtain a smooth degradation of the test
error. At full randomization, the test error degrades to 90\%, as good as guessing one of
the 10 classes. The figure shows
what happens to a specific model architecture, called Inception, but
similar observations hold for most, if not all, overparameterized
architectures that have been proposed.

The randomization experiment shows that optimization continues to work well even when generalization performance is no better than random guessing, i.e., 10\% accuracy in the case of the CIFAR-10 benchmark that has 10 classes. The optimization method is moreover insensitive to properties of the data, since it works even on random labels. A consequence of this simple experiment is that a proof of convergence for the optimization method may not reveal any insights into the nature of generalization.

## The diminished role of explicit regularization

Regularization plays an important role in the theory of convex empirical risk minimization. The most common form of regularization used to be \ell_2-regularization corresponding to adding a scalar of the squared Euclidean norm of the parameter vector to the objective function.

A more radical form of regularization, called *data
augmentation*, is common in the practice of deep learning. Data
augmentation transforms each training point repeatedly throughout the
training process by some operation, such as a *random crop* of
the image. Training on such randomly modified data points is meant to
reduce overfitting, since the model never encounters the exact same data
point twice.

Regularization continues to be a component of training large neural networks in practice. However, the nature of regularization is not clear. We can see a representative empirical observation in the table below.

params | random crop | \ell_2-regularization | train accuracy | test accuracy |
---|---|---|---|---|

1,649,402 | yes | yes | 100.0 | 89.05 |

yes | no | 100.0 | 89.31 | |

no | yes | 100.0 | 86.03 | |

no | no | 100.0 | 85.75 |

The table shows the performance of a common neural model architecture, called Inception, on the standard CIFAR-10 image classification benchmark. The model has more than 1.5 million trainable parameters, even though there are only 50,000 training examples spread across 10 classes. The training procedure uses two explicit forms of regularization. One is a form of data augmentation with random crops. The other is \ell_2-regularization. With both forms of regularization the fully trained model achieves close to 90\% test accuracy. But even if we turn both of them off, the model still achieves close to 86\% test accuracy (without even readjusting any hyperparameters such as learning rate of the optimizer). At the same time, the model fully interpolates the training data in the sense of making no errors whatsoever on the training data.

These findings suggest that while explicit regularization may help generalization performance, it is by no means necessary for strong generalization of heavily overparameterized models.

# Theories of generalization

With these empirical facts in hand, we now turn to mathematical theories that might help explain what we observe in practice and also may guide future empirical and theoretical work. In the remainder of the chapter, we tour several different, seemingly disconnected views of generalization.

We begin with a deep dive into *algorithmic stability*, which
posits that generalization arises when models are insensitive to
perturbations in the data on which they are trained. We then discuss
*VC dimension and Rademacher complexity*, which show how small
generalization gaps can arise when we restrict the complexity of models
we wish to fit to data. We then turn to *margin bounds* which
assert that whenever the data is easily separable, good generalization
will occur. Finally we discuss generalization bounds that arise from
*optimization*, showing how choice of an algorithmic scheme
itself can yield models with desired generalization properties.

In all of these cases, we show that we can recover generalization bounds of the form we saw in the Perceptron: the bounds will decrease with number of data points and increase with “complexity” of the optimal prediction function. Indeed, looking back at the proof of the Perceptron generalization bound, all of the above elements appeared. Our generalization bound arose because we could remove single data points from a set and not change the number of mistakes made by the Perceptron. A large margin assumption was essential to get a small mistake bound. The mistake bound itself was dependent on the iterations of the algorithm. And finally, we related the size of the margin to the scale of the data and optimal separator.

Though starting from different places, we will shows that the four different views of generalization can all arrive at similar results. Each of the aforementioned ingredients can alone lead to generalization, but considerations of all of these aspects help to improve machine learning methods. Generalization is multifaceted and multiple perspectives are useful when designing data-driven predictive systems.

Before diving into these four different views, we first take a quick pause to consider how we hope generalization bounds might look.

## How should we expect the gap to scale?

Before we turn to analyzing generalization gaps, it’s worth first considering how we should expect them to scale. That is, what is the relationship between the expected size of \Delta_{\mathrm{gen}} and the number of observations, n?

First, note that we showed that for a *fixed* prediction
function f, the expectation of the
empirical risk is equal to the population risk. That is, the empirical
risk of a single function is a sample average of the population risk of
that function. As we discussed in Chapter 3, i.i.d. sample averages
should *generalize* and approximate the average at the population
level. Here, we now turn to describing *how* they might be
expected to scale under different assumptions.

## Quantitative central limit theorems

The *central limit theorem* formalizes how sample averages
estimate their expectations: If Z
is a random variable with bounded variance then \hat{\mu}_Z^{(n)} converges in distribution
to a Gaussian random variable with mean zero and variance on the order
of 1/n.

The following inequalities are useful *quantitative* forms of
the central limit theorem. They precisely measure how close the sample
average will be to the population average using limited information
about the random quantity.

**Markov’s inequality:**Let Z be a nonnegative random variable. Then, \mathop\mathbb{P}[Z \geq t] \leq \frac{\mathop\mathbb{E}[Z]}{t}\,.

This can be proven using the inequality I_{[Z\geq t]}(z) \leq \frac{z}{t}.

**Chebyshev’s inequality:**Suppose Z is a random variable with mean \mu_Z and variance \sigma_Z^2. Then, \mathop\mathbb{P}[Z\geq t+\mu_Z] \leq \frac{\sigma_Z^2}{t^2}

Chebyshev’s inequality helps us understand why sample averages are good estimates of the mean. Suppose that X_1,\dots, X_n are independent samples we were considering above. Let \hat\mu denote the sample mean \frac1n\sum_{i=1}^n Z_i. Chebyshev’s inequality implies \mathop\mathbb{P}[\hat\mu\geq t+\mu_X] \leq \frac{\sigma_X^2}{n t^2}\,, which tends to zero as n grows. A popular form of this inequality sets t=\mu_X which gives \mathop\mathbb{P}[\hat{\mu}\geq 2\mu_X] \leq \frac{\sigma_X^2}{n \mu_X^2}\,.

**Hoeffding’s inequality:**Let Z_1, Z_2, \ldots, Z_n be independent random variables, each taking values in the interval [a_i,b_i].Let \hat\mu denote the sample mean \frac1n\sum_{i=1}^n Z_i. Then \mathop\mathbb{P}[\hat\mu \geq \mu_Z + t] \leq \exp\left(- \frac{2n^2t^2}{\sum_{i=1}^n(b_i-a_i)^2}\right)\,.

An important special case is when the Z_i are identically distributed copies of Z and take values in [0,1]. Then we have \mathop\mathbb{P}[\hat\mu \geq \mu_Z + t] \leq \exp\left(- 2nt^2\right)\,. This shows that when random variables are bounded, sample averages concentrate around their mean value exponentially quickly. If we invoke this bound with t=C/\sqrt{n}, the point at which it gives non-trivial results, we have an error of O(1/\sqrt{n}) with exponentially high probability. We will see shortly that this relationship between error and number of samples is ubiquitous in generalization theory.

These powerful *concentration inequalities* let us precisely
quantify how close the sample average will be to the population average.
For instance, we know a person’s height is a positive number and that
there are no people who are taller than nine feet. With these two facts,
Hoeffding’s inequality tells us that if we sample the heights of thirty
thousand individuals, our sample average will be within an inch of the
true average height with probability at least 83%. This assertion is
true no matter how large the population of individuals. The required
sample size is dictated only by the variability of height, not by the
number of total individuals.

You could replace “height” in this example with almost any attribute that you are able to measure well. The quantitative central limits tell us that for attributes with reasonable variability, a uniform sample from a general population will give a high quality estimate of the average value.

“Reasonable variability” of a random variable is necessary for quantitative central limit theorems to hold. When random variables have low variance or are tightly bounded, small experiments quickly reveal insights about the population. When variances are large or effectively unbounded, the number of samples required for high precision estimates might be impractical and our estimators and algorithms and predictions may need to be rethought.

## Bounding generalization gaps for individual predictors

Let us now return to generalization of prediction, considering the example where the quantity of interest is the prediction error on individuals in a population. There are effectively two scaling regimes of interest in generalization theory. In one case when the empirical risk is large, we expect the generalization gap to decrease inversely proportional to \sqrt{n}\,. When the empirical risk is expected to be very small, on the other hand, we tend to see the generalization gap decrease inversely proportional to n\,.

Why we see these two regimes is illustrated by studying the case of a
*single* prediction function f\,, chosen *independently* of the
sample S\,. Our ultimate goal is
to reason about the generalization gap of predictors chosen by an
algorithm running on our data. The analysis we walk through next doesn’t
apply to data-dependent predictors directly, but it nonetheless provides
helpful intuition about what bounds we can hope to get.

For a fixed function f\,, the
zero-one loss on a single randomly chosen data point is a Bernoulli
random variable, equal to 1 with
probability p and 1-p otherwise. The empirical risk R_S[f] is the sample average of this random
variable and the risk R[f] is its
expectation. To estimate the generalization gap, we can apply
Hoeffding’s inequality to find

\mathop\mathbb{P}[ R[f]-R_S[f] \geq \epsilon] \leq \exp\left(
-2n\epsilon^2\right)\,.
Hence, we will have with probability 1-\delta on our sample that
|\Delta_{\mathrm{gen}}(f)| \leq \sqrt{\frac{\log(1/\delta)}{2n}}\,.
That is, the generalization gap goes to zero at a rate
of 1/\sqrt{n}\,.

In the regime where we observe no empirical mistakes, a more refined analysis can be applied. Suppose that R[f]>\epsilon\,. Then the probability that we observe R_S[f]=0 cannot exceed \begin{aligned} \mathop\mathbb{P}[ \forall i\colon\operatorname{sign}(f(x_i))=y_i] &= \prod_{i=1}^n \mathop\mathbb{P}[ \operatorname{sign}(f(x_i))=y_i] \\ & \leq (1-\epsilon)^n \leq e^{-\epsilon n}\,. \end{aligned} Hence, with probability 1-\delta, |\Delta_{\mathrm{gen}}(f)| \leq \frac{\log(1/\delta)}{n}\,, which is the 1/n regime. These two rates are precisely what we observe in the more complex regime of generalization bounds in machine learning. The main trouble and difficulty in computing bounds on the generalization gap is that our prediction function f depends on the data, making the above analysis inapplicable.

In this chapter, we will focus mostly on 1/\sqrt{n} rates. These rates are more general as they make no assumptions about the expected empirical risk. With a few notable exceptions, the derivation of 1/\sqrt{n} rates tends to be easier than the 1/n counterparts. However, we note that every one of our approaches to generalization bounds have analyses for the “low empirical risk” or “large margin” regimes. We provide references at the end of this chapter to these more refined analyses.

# Algorithmic stability

We will first see a tight characterization in terms of an algorithmic
robustness property we call *algorithmic stability*. Intuitively,
algorithmic stability measures how sensitive an algorithm is to changes
in a single training example. Whenever a model is insensitive to such
perturbations, the generalization gap will be small. Stability gives us
a powerful and intuitive way of reasoning about generalization.

There are a variety of different notions of perturbation. We could consider resampling a single data point and look at how much a model changes. We could also leave one data point out and see how much the model changes. This was the heart of our Perceptron generalization argument. More aggressively, we could study what happens when a single data point is arbitrarily corrupted. All three of these approaches yield similar generalization bounds, though it is often easier to work with one than the others. To simplify the exposition, we choose to focus on only one notion (resampling) here.

To introduce the idea of stability, we first condense our notation to make the presentation a bit less cumbersome. Recall that we operate on tuples of n labeled examples, S=((x_1,y_1),\dots\dots,(x_n,y_n))\in(\mathcal{X}\times \mathcal{Y})^n\,. We denote a labeled example as z=(x,y). We will overload our notation and denote the loss accrued by a prediction function f on a data point z as \mathit{loss}(f,z). That is, \mathit{loss}(f,z) = \mathit{loss}(f(x),y). We use the uppercase letters when a labeled example Z is randomly drawn from a population (X, Y).

With this notation in hand, let’s now consider two independent random
samples S = (Z_1, \dots , Z_n)
and S^{\prime} = (Z^{\prime}_1, \dots
,Z^{\prime}_{n}), each drawn independently and identically from a
population (X, Y). We call the
second sample S' a *ghost
sample* as it is solely an analytical device. We never actually
collect this second sample or run any algorithm on it.

We introduce n *hybrid
samples* S^{(i)}, for i\in\{1, \dots, n\} as
S^{(i)} = (Z_1,\dots, Z_{i-1}, Z_{i}^{\prime}, Z_{i+1},\dots,Z_{n})\,,
where the i-th example
comes from S', while all
others come from S.

We can now introduce a data-dependent notion of average stability of an algorithm. For this definition, we think of an algorithm as a deterministic map A that takes a training sample in (\mathcal{X}\times\mathcal{Y})^n to some prediction function in a function space \Omega\,. That is A(S) denotes the function from \mathcal{X} to \mathcal{Y} that is returned by our algorithm when run on the sample S.

The *average stability* of an algorithm A\colon (\mathcal{X} \times \mathcal{Y})^{n}
\rightarrow \Omega is
\Delta(A) = \mathop\mathbb{E}_{S,S^{\prime}}
\left[
\frac{1}{n}\sum_{i=1}^{n}
\left(\mathit{loss}(A(S),Z_i^{\prime})
-\mathit{loss}(A(S^{(i)}),Z_{i}^{\prime})\right)
\right]\,.

There are two useful ways to parse this definition. The first is to interpret average stability as the average sensitivity of the algorithm to a change in a single example. Since we don’t know which of its n input samples the algorithm may be sensitive to, we test all of them and average out the results.

Second, from the perspective of A(S), the example Z_i^\prime is *unseen*, since it is
not part of S. But from the
perspective of A(S^{(i)}) the
example Z_i^\prime is seen, since
it is part of S^{(i)} via the
substitution that defines the i-th
hybrid sample. This shows that the instrument \Delta(A) also measures the average loss
difference of the algorithm on seen and unseen examples. We therefore
have reason to suspect that average stability relates to generalization
gap as the next proposition confirms.

The expected generalization gap equals average stability: \mathop\mathbb{E}[\Delta_{\mathrm{gen}}(A(S))] = \Delta(A)

By linearity of expectation, \begin{aligned} \mathop\mathbb{E}[\Delta_{\mathrm{gen}}(A(S))] &= \mathop\mathbb{E}\left[R[A(S)] - R_S[A(S)]\right] \\ &= \mathop\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^{n}\mathit{loss}(A(S),Z_i^{\prime})\right] - \mathop\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^{n}\mathit{loss}(A(S),Z_i)\right]\,. \end{aligned} Here, we used that Z_i^\prime is an example drawn from the distribution that does not appear in the set S, while Z_i does appear in S. At the same time, Z_i and Z_i' are identically distributed and independent of the other examples. Therefore, \mathop\mathbb{E}\mathit{loss}(A(S),Z_i) = \mathop\mathbb{E}\mathit{loss}(A(S^{(i)}),Z_i^{\prime})\,. Applying this identity to each term in the empirical risk above, and comparing with the definition of \Delta(A), we conclude \mathop\mathbb{E}[R[A(S)] - R_S[A(S)]] = \Delta(A)\,.

## Uniform stability

While average stability gave us an exact characterization of generalization error, it can be hard to work with the expectation over S and S'. Uniform stability replaces the averages by suprema, leading to a stronger but useful notion.

The *uniform stability* of an algorithm A is defined as
\Delta_{\mathrm{sup}}(A) = \sup_{\underset{d_H(S, S')=1}{S, S'
\in (\mathcal{X}\times\mathcal{Y})^n}}
\sup_{z\in \mathcal{X}\times\mathcal{Y}} |\mathit{loss}(A(S), z) -
\mathit{loss}(A(S'), z)|,
where d_H(S,S') is the
Hamming distance between tuples S
and S'\,.

In this definition, it is important to note that the z has nothing to do with S and S'\,. Uniform stability is effectively computing the worst-case difference in the predictions of the learning algorithm run on two arbitrary datasets that differ in exactly one point.

Uniform stability upper bounds average stability, and hence uniform stability upper bounds generalization gap (in expectation). Thus, we have the corollary \mathop\mathbb{E}[\Delta_{\mathrm{gen}}(A(S))] \leq \Delta_{\mathrm{sup}} (A)

This corollary turns out to be surprisingly useful since many algorithms are uniformly stable. For example, strong convexity of the loss function is sufficient for the uniform stability of empirical risk minimization, as we will see next.

## Stability of empirical risk minimization

We now show that empirical risk minimization is uniformly stable
provided under strong assumptions on the loss function. One important
assumption we need is that the loss function \mathit{loss}(w, z) is differentiable and
*strongly convex* in the model parameters w for every example z. What this means is that for every
example z and for all w,w'\in\Omega,
\mathit{loss}(w',z)\ge
\mathit{loss}(w,z)
+ \langle\nabla \mathit{loss}(w,z), w' - w\rangle
+ \frac{\mu}2\|w-w'\|^2\,.
There’s only one property of strong convexity we’ll need.
Namely, if \Phi\colon\mathbb{R}^d\to\mathbb{R}
is \mu-strongly convex
and w^* is a stationary point (and
hence global minimum) of the function \Phi, then we have
\Phi(w) - \Phi(w^*) \geq \frac{\mu}{2} \| w - w^* \|^2\,.
The second assumption we need is that \mathit{loss}(w, z) is *L-Lipschitz* in w for every z, i.e., \|\nabla \mathit{loss}(w, z)\|\le L\,.
Equivalently, this means |\mathit{loss}(w, z)-\mathit{loss}(w',z)|\le
L\|w-w'\|.

Assume that for every z, \mathit{loss}(w, z) is \mu-strongly convex in w over the domain \Omega, i.e., Further assume that, that the loss function \mathit{loss}(w, z) is L-Lipschitz in w for every z. Then, empirical risk minimization (ERM) satisfies \Delta_{\mathrm{sup}} (\text{ERM}) \leq \frac{4L^2}{\mu n}\,.

Let \hat w_S = \arg \min_{w\in \Omega} \frac{1}{n} \sum_{i=1}^n\mathit{loss}(w, z_i) denote the empirical risk minimizer on the sample S. Fix arbitrary samples S,S' of size n that differ in a single index i\in\{1,\dots,n\} where S contains z_i and S' contains z_i'\,. Fix an arbitrary example z\,. We need to show that |\mathit{loss}(\hat w_{S}, z) - \mathit{loss}(\hat w_{S'}, z)| \leq \frac{4 L^2}{\mu n}\,. Since the loss function is L-Lipschitz, it suffices to show that \| \hat w_{S} - \hat w_{S'} \| \leq \frac{4L}{\mu n}\,.

On the one hand, since \hat w_S minimizes the empirical risk by definition, it follows from the strong convexity of the empirical risk that \frac{\mu}{2} \| \hat w_{S} - \hat w_{S'} \|^2 \le R_S[\hat w_{S'}] - R_S[\hat w_{S}]\,.

On the other hand, we can bound the right hand side as \begin{aligned} & R_S[\hat w_{S'}] - R_S[\hat w_{S}]\\ &= \frac{1}{n} ( \mathit{loss}(\hat w_{S'}, z_i) - \mathit{loss}(\hat w_{S}, z_i)) + \frac{1}{n} \sum_{i \neq j} ( \mathit{loss}(\hat w_{S'}, z_j) - \mathit{loss}(\hat w_{S}, z_j))\\ &=\frac{1}{n} (\mathit{loss}(\hat w_{S'}, z_i) - \mathit{loss}(\hat w_{S}, z_i)) + \frac{1}{n} (\mathit{loss}(\hat w_{S}, z_i') - \mathit{loss}(\hat w_{S'}, z_i')) \\ & \qquad + \left(R_{S'}[\hat w_{S'}] - R_{S'}[\hat w_{S}]\right)\\ &\leq \frac{1}{n} | \mathit{loss}(\hat w_{S'}, z_i) - \mathit{loss}(\hat w_{S}, z_i)| + \frac{1}{n} | \mathit{loss}(\hat w_{S}, z_i') - \mathit{loss}(\hat w_{S'}, z_i') |\\ &\leq \frac{2 L}{n} \| \hat w_{S'} - \hat w_{S}\|\,. \end{aligned}

Here, we used the assumption that \mathit{loss} is L-Lipschitz and the fact that R_{S'}[\hat w_{S'}] - R_{S'}[\hat w_{S}] \leq 0\,.

Putting together the strong convexity property and our calculation above, we find \| \hat w_{S'} - \hat w_{S} \| \leq \frac{4 L}{\mu n}\,.

Hence, \Delta_{\mathrm{sup}}(\text{ERM}) \leq \frac{4 L ^ 2}{\mu n}\,.

An interesting point about this result is that there is no explicit reference to the complexity of the model class referenced by \Omega.

## Stability of regularized empirical risk minimization

Some empirical risk minimization problems, such as the Perceptron
(ERM with hinge loss) we saw earlier, are convex but not strictly
convex. We can turn convex problems into strongly convex problems by
adding an *\ell_2-regularization* term to the loss
function:
r(w, z) = \mathit{loss}(w, z) + \frac{\mu}{2} \| w \|^2\,.
The last term is named \ell_2-regularization, *weight decay*,
or *Tikhonov regularization* depending on field and context.

By construction, if the loss is convex, then the regularized loss r(w, z) is \mu-strongly convex. Hence, our previous theorem applies. However, by adding regularization we changed the objective function. The optimizer of the regularized objective is in general not the same as the optimizer of the unregularized objective.

Fortunately, A simple argument shows that solving the regularized objective also solves the unregularized objective. The idea is that assuming \|w\|\le B we can set the regularization parameter \mu = \frac{L}{B\sqrt{n}}\,. This ensures that the regularization term \mu\|w\|^2 is at most O(\frac{LB}{\sqrt{n}}) and therefore the minimizer of the regularized risk also minimizes the unregularized risk up to error O(\frac{LB}{\sqrt{n}})\,. Plugging this choice of \mu into the ERM stability theorem, the generalization gap will also be O(\frac{LB}{\sqrt{n}})\,.

## The case of regularized hinge loss

Let’s relate the generalization theory we just saw to the familiar case of the perceptron algorithm from Chapter 3. This corresponds to the special case of minimizing the regularized hinge loss r(w, (x,y)) = \max\{1-y\langle w,x\rangle,0\} + \frac{\mu}{2} \| w \|^2\,. Moreover, we assume that the data are are linearly separable with margin \gamma.

Denoting by \hat w_S the empirical risk minimizer on a random sample S of size n, we know that \frac\mu2\|\hat w_S\|^2 \le R_S(\hat w_S) \le R_S(0) = 1\,. Hence, \|\hat w_S\|\le B for B=\sqrt{2/\mu}. We can therefore restrict our domain to the Euclidean ball of radius B. If the data are also bounded, say \Vert x \Vert \leq D, we further get that \Vert \nabla_w r(w,z) \Vert \le \|x\| + \mu\|w\| = D + \mu B\,. Hence, the regularized hinge loss is L-Lipschitz with L=D+\mu B=D+\sqrt{2\mu}\,.

Let w_\gamma be a maximum margin hyperplane for the sample S. We know that the empirical loss will satisfy R_S[\hat w_S] \leq R_S[w_\gamma] = \frac{\mu}{2}\| w_\gamma \|^2 = \frac{\mu }{2\gamma^2}\,. Hence, by Theorem 1, \mathop\mathbb{E}[R[\hat w_S]] \le \mathop\mathbb{E}[R_S[\hat w_S]] + \Delta_{\mathrm{sup}}(\mathrm{ERM}) \le \frac{\mu }{2\gamma^2}+ \frac{4(D+\sqrt{2\mu})^2}{\mu n} Setting \mu=\frac{2\gamma D}{\sqrt{n}} and noting that \gamma\le D gives that \mathop\mathbb{E}[R[\hat w_S]] \le O\left(\frac{D}{\gamma \sqrt{n}}\right)\,. Finally, since the regularized hinge loss upper bounds the zero-one loss, we can conclude that \mathop\mathbb{P}[Y \hat w_S^T X < 0] \leq O\left(\frac{D}{\gamma \sqrt{n}}\right)\,, where the probability is taken over both sample S and test point (X, Y). Applying Markov’s inequality to the sample, we can conclude the same bound holds for a typical sample up to constant factors.

This bound is proportional to the square root of the bound we saw for the perceptron in Chapter 3. As we discussed earlier, this rate is slower than the perceptron rate as it does not explicitly take into account the fact that the empirical risk is zero. However, it is worth noting that the relationship between the variables in question—diameter, margin, and number of samples—is precisely the same as for the perceptron. This kind of bound is common and we will derive it a few more times in this chapter.

Stability analysis combined with explicit regularization and convexity thus give an appealing conceptual and mathematical approach to reasoning about generalization. However, empirical risk minimization involving non-linear models is increasingly successful in practice and generally leads to non-convex optimization problems.

# Model complexity and uniform convergence

We briefly review other useful tools to reason about generalization. Arguably, the most basic is based on counting the number of different functions that can be described with the given model parameters.

Given a sample S of n independent draws from the same underlying distribution, the empirical risk R_S[f] for a fixed function f is an average of n random variables, each with mean equal to the risk R[f]\,. Assuming for simplicity that the range of our loss function is bounded in the interval [0,1]\,, Hoeffding’s bound gives us the tail bound \mathop\mathbb{P}\left[ R_S[f] > R[f] + t\right] \le \exp(-2nt^2)\,.

By applying the union bound to a finite set of functions \mathcal{F} we can guarantee that with probability 1-\delta, we have for all functions f\in\mathcal{F} that \Delta_{\mathrm{gen}}(f)\le\sqrt{\frac{\ln|\mathcal{F}|+\ln(1/\delta)}{n}}\,.\quad\quad(1)

The cardinality bound |\mathcal{F}| is a basic measure of the complexity of the model family \mathcal{F}\,. We can think of the term \ln(\mathcal{F}) as a measure of complexity of the function class \mathcal{F}\,. The gestalt of the generalization bound as “\sqrt{\mathrm{complexity}/n}” routinely appears with varying measures of complexity.

## VC dimension

Bounding the generalization gap from above for all functions in a
function class is called *uniform convergence*. A classical tool
to reason about uniform convergence is the Vapnik-Chervonenkis dimension
(VC dimension) of a function class \mathcal{F}\subseteq X\rightarrow Y,
denoted \mathrm{VC}(\mathcal{F})\,. It’s defined as
the size of the largest set Q\subseteq
X such that for any Boolean function h\colon Q \to \{-1,1\}, there is a
predictor f\in\mathcal{F} such
that f(x)= h(x) for all x\in Q\,. In other words, if there is a
size-d sample Q such that the functions of \mathcal{F} induce all 2^d possible binary labelings of Q, then the VC-dimension of \mathcal{F} is at least d\,.

The VC-dimension measures the ability of the model class to conform to an arbitrary labeling of a set of points. The so-called VC inequality implies that with probability 1-\delta, we have for all functions f\in\mathcal{F} \Delta_{\mathrm{gen}}(f)\le \sqrt{\frac{\mathrm{VC}(\mathcal{F})\ln n+\ln(1/\delta)}{n}}\,.\quad\quad(2)

We can see that the complexity term \mathrm{VC}(\mathcal{F}) refines our earlier cardinality bound since \mathrm{VC}(\mathcal{F})\le\log|\mathcal{F}|+1. However VC-dimension also applies to infinite model classes. Linear models over \mathbb{R}^d have VC-dimension d, corresponding to the number of model parameters. Generally speaking, VC dimension tends to grow with the number of model parameters for many model families of interest. In such cases, the bound in Equation 2 becomes useless once the number of model parameters exceeds the size of the sample.

However, the picture changes significantly if we consider notions of
model complexity different than raw counts of parameters. Consider two
sets of vectors X_0 and X_1 all having Euclidean norm bounded
by D. Let \mathcal{F} be the set of all linear
functions f such that f(x)=w^T x with ||w||\leq \gamma{^-1}, f(x)\leq -1 if x\in X_0, and f(x)\geq 1 if x \in X_1. Vapnik showedVapnik,
*Statistical Larning Theory* (Wiley, 1998).
that the VC dimension of this set of hyperplanes was \frac{D^2}{\gamma^2}\,. As described in a
survey of support vector machines by Burges, the worst case arrangement
of n data points is a simplex
in n-2 dimensions.Burges,
“A Tutorial on Support Vector Machines for Pattern
Recognition,” *Data Mining and Knowledge Discovery* 2, no.
2 (1998): 121–67. Plugging this VC-dimension into
our generalization bound yields
\Delta_{\mathrm{gen}}(f)\le \sqrt{\frac{D^2\ln
n+\gamma^2\ln(1/\delta)}{\gamma^2 n}}\,.
We again see our Perceptron style generalization bound! This
bound again holds when the empirical risk is nonzero. And the dimension
of the data, d does not appear at
all in this bound. The difference between the parametric model and the
margin-like bound is that we considered properties of the data. In the
*worst case* bound which counts parameters, it appears that
high-dimensional prediction is impossible. It is only by considering
data-specific properties that we can find a reasonable generalization
bound.

## Rademacher complexity

An alternative to VC-dimension is Rademacher complexity, a flexible tool that often is more amenable to calculations that incorporate problem-specific aspects such as restrictions on the distribution family or properties of the loss function. To get a generalization bound in terms of Rademacher complexity, we typically apply the definition not the model class \mathcal{F} itself but to the class of functions \mathcal{L} of the form h(z) = \mathit{loss}(f, z) for some f\in\mathcal{F} and a loss function \mathit{loss}\,. By varying the loss function, we can derive different generalization bounds.

Fix a function class \mathcal{L}\subseteq Z\to\mathbb{R}, which will later correspond to the composition of a predictor with a loss function, which is why we chose the symbol \mathcal{L}\,. Think of the domain Z as the space of labeled examples z=(x,y)\,. Fix a distribution P over the space Z\,.

The *empirical Rademacher complexity* of a function
class \mathcal{L}\subseteq
Z\to\mathbb{R} with respect to a sample \{z_1,\dots,z_n\}\subseteq Z drawn i.i.d.
from the distribution P is defined
as:
\widehat{\mathfrak{R}}_n(\mathcal{L})
=\mathop\mathbb{E}_{\sigma\in\{-1,1\}^n}
\left[
\frac 1n \sup_{h\in\mathcal{L}}
\left|
\sum_{i=1}^n \sigma_i h(z_i)
\right|
\right]\,.

We obtain the *Rademacher complexity* \mathfrak{R}_n(\mathcal{L})=\mathop\mathbb{E}\left[\widehat{\mathfrak{R}}_n(\mathcal{L})\right]
by taking the expectation of the empirical Rademacher complexity with
respect to the sample. Rademacher complexity measures the ability of a
function class to interpolate a random sign pattern assigned to a point
set.

One application of Rademacher complexity applies when the loss function is L-Lipschitz in the parameterization of the model class for every example z\,. This bound shows that with probability 1-\delta for all functions f\in\mathcal{F}, we have \Delta_{\mathrm{gen}}(f)\le 2L{\mathfrak{R}}_n(\mathcal{F}) + 3\sqrt{\frac{\log(1/\delta)}{n}}\,. When applied to the hinge loss with the function class being hyperplanes of norm less than \gamma^{-1}, this bound again recovers the perceptron generalization bound \Delta_{\mathrm{gen}}(f)\le 2 \frac{D}{\gamma \sqrt{n}} + 3\sqrt{\frac{\log(1/\delta)}{n}}\,.

## Margin bounds for ensemble methods

Ensemble methods work by combining many weak predictors into one strong predictor. The combination step usually involves taking a weighted average or majority vote of the weak predictors. Boosting and random forests are two ensemble methods that continue to be highly popular and competitive in various settings. Both methods train a sequence of small decision trees, each on its own achieving modest accuracy on the training task. However, so long as different trees make errors that aren’t too correlated, we can obtain a higher accuracy model by taking, say, a majority vote over the individual predictions of the trees.

Researchers in the 1990s already observed that boosting often continues to improve test accuracy as more weak predictors are added to the ensemble. The complexity of the entire ensemble was thus often far too large to apply standard uniform convergence bounds.

A proffered explanation was that boosting, while growing the
complexity of the ensemble, also improved the *margin* of the
ensemble predictor. Assuming that the final predictor f\colon X\to \{-1,1\} is binary, its
*margin* on an example (x,y) is defined as the value yf(x)\,. The larger the margin the more
“confident” the predictor is about its prediction. A margin yf(x) just above 0 shows that the weak predictors in the
ensemble were nearly split evenly in their weighted votes.

An elegant generalization bound relates the risk of any
predictor f to the fraction of
correctly labeled training examples at a given margin \theta. Below let R[f] be the risk of f w.r.t. zero-one loss. However,
let R^\theta_S(f) be the empirical
risk with respect to *margin errors* at level \theta, i.e., the loss \mathbf{1}(yf(x)\le\theta) that penalizes
errors where the predictor is within an additive \theta margin of making a mistake.

With probability 1-\delta, every convex combination f of base predictors in \mathcal{H} satisfies the following bound for every \theta>0:

R[f] - R^\theta_S[f] \le O\left( \frac1{\sqrt{n}}\left(\frac{ \mathrm{VC}(\mathcal{H})\log n}{\theta^2} +\log(1/\delta) \right)^{1/2} \right)

The theorem can be proved using Rademacher complexity. Crucially, the bound only depends on the VC dimension of the base class \mathcal{H} but not the complexity of ensemble. Moreover, the bound holds for all \theta>0 and so we can choose \theta after knowing the margin that manifested during training.

## Margin bounds for linear models

Margins also play a fundamental role for linear prediction. We saw one margin bound for linear models in our chapter on the Perceptron algorithm. Similar bounds hold for other variants of linear prediction. We’ll state the result here for a simple least squares problem: w^* = \arg\min_{w\colon\|w\|\le B} \frac1n\sum_{i=1}^n\left(\langle x_i, w\rangle - y\right)^2

In other words, we minimize the empirical risk w.r.t. the squared loss over norm bounded linear separators, call this class \mathcal{W}_B\,. Further assume that all data points satisfy \|x_i\|\le 1 and y\in\{-1,1\}. Analogous to the margin bound in Theorem 2, it can be shown that with probability 1-\delta for every linear predictor f specified by weights in \mathcal{W}_B we have R[f] - R_S^\theta[f] \le 4\frac{{\mathfrak{R}}(\mathcal{W}_B)}{\theta} + O\left(\frac{\log(1/\delta)}{\sqrt{n}}\right)\,.

Moreover, given the assumptions on the data and model class we made, the Rademacher complexity satisfies {\mathfrak{R}}(\mathcal{W})\le B/\sqrt{n}. What we can learn from this bound is that the relevant quantity for generalization is the ratio of complexity to margin B/\theta\,.

It’s important to understand that margin is a scale-sensitive notion; it only makes sense to talk about it after suitable normalization of the parameter vector. If the norm didn’t appear in the bound we could scale up the parameter vector to achieve any margin we want. For linear predictors the Euclidean norm provides a natural and often suitable normalization.

# Generalization from algorithms

In the overparameterized regime, there are always an infinite number of models that minimize empirical risk. However, when we run a particular algorithm, the algorithm usually returns only one from this continuum. In this section, we show how directly analyzing algorithmic iteration can itself yield generalization bounds.

## One pass optimization of stochastic gradient descent

As we briefly discussed in the optimization chapter, we can interpret
the convergence analysis of stochastic gradient descent as directly
providing a generalization bound for a particular variant of SGD. Here
we give the argument in full detail. Suppose that we choose a loss
function that upper bounds the number of mistakes. That is \mathit{loss}(\hat{y},y) \geq
\mathbb{1}\{y\hat{y}<0\}\,. The hinge loss would be such an
example. Choose the function R to
be the risk (not empirical risk!) with respect to this loss function:
R[w] = \mathop\mathbb{E}[ \mathit{loss}(w^T x, y) ]
At each iteration, suppose we gain access to an example
pair (x_i, y_i) sampled i.i.d.
from the a data generating distribution. Then when we run the stochastic
gradient method, the iterates are
w_{t+1} = w_t - \alpha_t e(w_t^Tx_t,y_t) x_t\,,\quad
\text{where}\quad
e(z,y) = \frac{\partial \mathit{loss}(z,y)}{\partial z}\,.
Suppose that for all x, \Vert x
\Vert \leq D\,. Also suppose that |e(z,y)|\leq C\,. Then the SGD convergence
theorem tells us that after n
steps, starting at w_0=0 and using
an appropriately chosen constant step size, the average of our
iterates \bar{w}_n will satisfy
\mathop\mathbb{P}[\operatorname{sign}(\bar{w}_n^T x ) \neq y ] \leq
\mathbb{E}[R[\bar{w}_n]] \leq R[w_\star] + \frac{C D \Vert w_\star
\Vert }{\sqrt{n}}\,.
This inequality tells us that we will find a distribution
boundary that has low *population* risk after seeing n samples. And the population risk itself
lets us upper bound the probability of our model making an error on new
data. That is, this inequality is a generalization bound.

We note here that this importantly does not measure our empirical risk. By running stochastic gradient descent, we can find a low-risk model without ever computing the empirical risk.

Let us further assume that the population can be separated with large margin. As we showed when we discussed the Perceptron, the margin is equal to the inverse of the norm of the corresponding hyperplane. Suppose we ran the stochastic gradient method using a hinge loss. In this case, C=1, so, letting \gamma denote the maximum margin, we get the simplified bound \mathop\mathbb{P}[\operatorname{sign}(\bar{w}_n^T x ) \neq y ] \leq \frac{D}{\gamma\sqrt{n}}\,. Note that the Perceptron analysis did not have a step size parameter that depended on the problem instance. But, on the other hand, this analysis of SGD holds regardless of whether the data is separable or whether zero empirical risk is achieved after one pass over the data. The stochastic gradient analysis is more general but generality comes at the cost of a looser bound on the probability of error on new examples.

## Uniform stability of stochastic gradient descent

Above we showed that empirical risk minimization is stable no matter
what optimization method we use to solve the objective. One weakness is
that the analysis applied to the exact solution of the optimization
problem and only applies for strongly convex loss function. In practice,
we might only be able to compute an approximate empirical risk minimizer
and may be interested in losses that are not strongly convex.
Fortunately, we can also show that some optimization methods are stable
even if they don’t end up computing a minimizer of a strongly convex
empirical risk. Specifically, this is true for the stochastic gradient
method under suitable assumptions. Below we state one such result which
requires the assumption that the loss function is *smooth*. A
continuously differentiable function f\colon\mathbb{R}^d\to\mathbb{R}
is \beta-smooth if \|\nabla f(y)-\nabla f(x)\|\le
\beta\|y-x\|.

Assume a continuously differentiable loss function that is \beta-smooth and L-Lipschitz on every example and convex. Suppose that we run the stochastic gradient method (SGM) with step sizes \eta_t\le 2/\beta for T steps. Then, we have \Delta_{\mathrm{sup}} (\text{SGM}) \leq \frac{2L^2}{n}\sum_{t=1}^T\eta_t\,.

The theorem allows for SGD to sample the same data points multiple times, as is common practice in machine learning. The stability approach also extends to the non-convex case albeit with a much weaker quantitative bound.

## What solutions does stochastic gradient descent favor?

We reviewed empirical evidence that explicit regularization is not
necessary for generalization. Researchers therefore believe that a
combination of data generating distribution and optimization algorithm
perform *implicit regularization*. Implicit regularization
describes the tendency of an algorithm to seek out solutions that
generalize well on their own on a given a dataset without the need for
explicit correction. Since the empirical phenomena we reviewed are all
based on gradient methods, it makes sense to study implicit
regularization of gradient descent. While a general theory for
non-convex problems remains elusive, the situation for linear models is
instructive.

Consider again the linear case of gradient descent or stochastic gradient descent: w_{t+1} = w_t - \alpha e_t x_t where e_t is the gradient of the loss at the current prediction. As we showed in the optimization chapter, if we run this algorithm to convergence, we must have the resulting \hat{w} lies in the span of the data, and that it interpolates the data. These two facts imply that the optimal \hat{w} is the minimum Euclidean norm solution of Xw=y\,. That is, w solves the optimization problem \begin{array}{ll} \text{minimize} & \|w\|^2\\ \text{subject to} & y_i w^Tx_i = 1\,. \end{array} Moreover, a closed form solution of this problem is given by \hat{w}=X^T(XX^T)^{-1}y\,. That is, when we run stochastic gradient descent we converge to a very specific solution. Now what can we say about the generalization properties of this minimum norm interpolating solution?

The key to analyzing the generalization of the minimum norm solution will be a stability-like argument. We aim to control the error of the model trained on the first m data points on the next data point in the sequence, x_{m+1}\,. To do so, we use a simple identity that follows from linear algebra.

Let S be an arbitrary set of m\geq 2 data points. Let w_{m-1} and w_{m} denote the minimum norm solution trained on the first m-1 and m points respectively. Then (1-y_{m} \langle w_{m-1}, x_m \rangle)^2 = s_{m}^2 (\Vert w_m \Vert^2-\Vert w_{m-1} \Vert^2)\,, where s_{m} := \operatorname{dist} \left(\operatorname{span}( x_1,\ldots, x_{m-1}),x_{m}\right)\,.

We hold off on proving this lemma and first prove our generalization result with the help of this lemma.

Let S_{n+1} denote a set of n+1 i.i.d. samples. Let S_j denote the first j samples and w_j denote the solution of minimum norm that interpolates these j points. Let R_j denote the maximum norm of \Vert x_i \Vert for 1\leq i\leq j\,. Let (x,y) denote another independent sample from \mathcal{D}\,. Then if \epsilon_j:=\mathop\mathbb{E}[(1 - y f_{S_j}(x))^2] is a non-increasing sequence, we have \mathop\mathbb{P}[ y \langle w_{n}, x \rangle < 0] \leq \frac{\mathop\mathbb{E}[R_j^2 \Vert w_{n+1} \Vert^2]}{n}\,.

Lemma together with the bound s_i^2 \leq R_{n+1}^2 yields the inequality \mathop\mathbb{E}[(1-y \langle w_i, x \rangle)^2] \leq (\mathop\mathbb{E}[ R_{n+1}^2 \Vert w_{i+1} \Vert^2]-\mathop\mathbb{E}[ R_{n+1}^2\Vert w_i\Vert^2])\,. Here, we could drop the subscript on x and y on the left-hand side as they are identically distributed to (x_{i+1},y_{i+1})\,. Adding these inequalities together gives the bound \frac{1}{n} \sum_{i=1}^n \mathop\mathbb{E}[(1 - y f_{S_i}(x))^2] \leq \frac{\mathop\mathbb{E}[R_{n+1}^2 \Vert w_{n+1} \Vert^2]}{n}\,. Assuming the sequence is decreasing means that the minimum summand of the previous inequality is \mathop\mathbb{E}[(1 - y f_{i}(x))^2]\,. This and Markov’s inequality prove the theorem.

This proof reveals that the minimum norm solution, the one found by running stochastic gradient descent to convergence, achieves a nearly identical generalization bound as the Perceptron, even with the fast 1/n rate. Here, nothing is assumed about margin, but instead we assume that the complexity of the interpolating solution does not grow rapidly as we increase the amount of data we collect. This proof combines ideas from stability, optimization, and model complexity to find yet another explanation for why gradient methods find high-quality solutions to machine learning problems.

## Proof of Lemma 1

We conclude with the deferred proof of Lemma 1.

Let K= XX^T denote the kernel matrix for S\,. Partition K as K = \begin{bmatrix} K_{11} & K_{12} \\ K_{21} & K_{22} \end{bmatrix} where K_{11} is (m-1) \times (m-1) and K_{22} is a scalar equal to \langle x_m,x_m \rangle\,. Similarly, partition the vector of labels y so that y^{(m-1)} denotes the first m-1 labels. Under this partitioning, \langle w_{m-1} , x_m \rangle = K_{21} K_{11}^{-1} y^{(m-1)}\,.

Now note that s_m^2 = K_{22}-K_{21} K_{11}^{-1} K_{12}\,. Next, using the formula for inverting partitioned matrices, we find K^{-1} = \begin{bmatrix} (K_{11}-K_{12}K_{21}K_{22}^{-1})^{-1} & s_m^{-2} K_{11}^{-1} K_{12} \\ s_m^{-2} (K_{11}^{-1} K_{12})^T & s_m^{-2} \end{bmatrix}\,. By the matrix inversion lemma we have (K_{11}-K_{12}K_{21}K_{22}^{-1})^{-1} = K_{11}^{-1} +s_m^{-2} \left(K_{21} K_{11}^{-1}\right)^T\left(K_{21} K_{11}^{-1}\right) \,. Hence, \begin{aligned} \Vert w_{i}\Vert &= y^T K^{-1} y \\ &= s_m^{-2}(y_m^2 - 2y_m \langle w_{m-1} ,x_m\rangle +\langle w_{m-1} ,x_m\rangle^2 ) + {y^{(m-1)}}^T K_{11}^{-1} y^{(m-1)}\,. \end{aligned} Rearranging terms proves the lemma.

# Looking ahead

Despite significant effort and many recent advances, the theory of generalization in overparameterized models still lags behind the empirical phenomenology. What governs generalization remains a matter of debate in the research community.

Existing generalization bounds often do not apply directly to practice by virtue of their assumptions, are quantitatively too weak to apply to heavily overparameterized models, or fail to explain important empirical observations. However, it is not just a lack of quantitative sharpness that limits our understanding of generalization.

Conceptual questions remain open: What is it a successful theory of generalization should do? What are formal success criteria? Even a qualitative theory of generalization, that is not quantitatively precise in concrete settings, may be useful if it leads to the successful algorithmic interventions. But how do we best evaluate the value of a theory in this context?

Our focus in this chapter was decidedly narrow. We discussed how to
related risk and empirical risk. This perspective can only capture
questions that relate performance on a sample to performance on the very
same distribution that the sample was drawn from. What is left out are
important questions of *extrapolation* from a training
environment to testing conditions that differ from training.
Overparameterized models that generalize well in the narrow sense can
fail dramatically even with small changes in the environment. We will
revisit the question of generalization for overparameterized models in
our chapter on deep learning.

# Chapter notes

The tight characterization of generalization gap in terms of average
stability, as well as stability of regularized empirical risk
minimization (Theorem 1), is due to Shalev-Shwartz et al.Shalev-Shwartz
et al., “Learnability, Stability and Uniform Convergence,”
*Journal of Machine Learning Research* 11, no. Oct (2010):
2635–70. Uniform stability was introduced by
Bousquet and Elisseeff.Bousquet
and Elisseeff, “Stability and Generalization,” *Journal
of Machine Learning Research* 2, no. Mar (2002):
499–526. For additional background on VC dimension
and Rademacher complexity, see, for example, the text by Shalev-Shwartz
and Ben-David.Shalev-Shwartz
and Ben-David, *Understanding Machine Learning: From Theory to
Algorithms* (Cambridge University Press,
2014).

The double descent figure is from work of Belkin et al.Belkin
et al., “Reconciling Modern Machine-Learning Practice and the
Classical Bias-Variance Trade-Off,” *Proceedings of the
National Academy of Sciences*, 2019. Earlier
work pointed out similar empirical risk-complexity relationships.Neyshabur,
Tomioka, and Srebro, “In Search of the Real Inductive Bias: On the
Role of Implicit Regularization in Deep Learning,”
*arXiv:1412.6614*, 2014. The empirical
findings related to the randomization test and the role of
regularization are due to Zhang et al.Zhang
et al., “Understanding Deep Learning Requires Rethinking
Generalization,” in *International Conference on Learning
Representations*, 2017.

Theorem 2 is due to Schapire et al.Schapire
et al., “Boosting the Margin: A New Explanation for the
Effectiveness of Voting Methods,” *The Annals of
Statistics* 26, no. 5 (1998): 1651–86. Later
work showed theoretically that boosting maximizes margin.Zhang
and Yu, “Boosting with Early Stopping: Convergence and
Consistency,” *The Annals of Statistics* 33 (2005):
1538–79; Telgarsky, “Margins, Shrinkage, and Boosting,” in
*International Conference on Machine Learning*,
2013. The margin bound for linear models follows
from more general results of Kakade, Sridharan, and TewariKakade,
Sridharan, and Tewari, “On the Complexity of Linear Prediction:
Risk Bounds, Margin Bounds, and Regularization,” in *Advances
in Neural Information Processing Systems*, 2009,
793–800. that build on earlier work by Bartlett and
MendelsonBartlett
and Mendelson, “Rademacher and Gaussian Complexities:
Risk Bounds and Structural Results,” *Journal of Machine
Learning Research* 3, no. Nov (2002): 463–82.,
as well as work of Koltchinskii and PanchenkoKoltchinskii
and Panchenko, “Empirical Margin Distributions and Bounding the
Generalization Error of Combined Classifiers,” *The Annals of
Statistics* 30, no. 1 (2002): 1–50.. Rademacher
complexity bounds for family of neural networks go back to work of
BartlettBartlett,
“The Sample Complexity of Pattern Classification with Neural
Networks: The Size of the Weights Is More Important Than the Size of the
Network,” *Transactions on Information Theory* 44, no. 2
(1998): 525–36. and remain and active research
topic. We will see more on this in our chapter on deep learning.

The uniform stability bound for stochastic gradient descent is due to
Hardt, Recht, and SingerHardt,
Recht, and Singer, “Train Faster, Generalize Better: Stability of
Stochastic Gradient Descent,” in *International Conference on
Machine Learning*, 2016.. Subsequent work
further explores the generalization performance stochastic gradient
descent in terms of its stability properties. Theorem 4 and Lemma 1 are
due to Liang and RechtLiang
and Recht, “Interpolating Classifiers Make Few Mistakes,”
*arXiv:2101.11815*, 2021..

There has been an explosion of work on generalization and
overparameterization in recent years. See, also, recent work exploring
how other norms shed light on generalization performance.Neyshabur
et al., “Exploring Generalization in Deep Learning,” in
*Advances in Neural Information Processing Systems*, 2017,
5947–56. Our exposition is by no means a
representative survey of the broad literature on this topic. There are
several ongoing lines of work we did not cover: PAC-Bayes boundsDziugaite
and Roy, “Computing Nonvacuous Generalization Bounds for Deep
(Stochastic) Neural Networks with Many More Parameters Than Training
Data,” *arXiv:1703.11008*, 2017.,
compression boundsArora
et al., “Stronger Generalization Bounds for Deep Nets via a
Compression Approach,” *arXiv:1802.05296*,
2018., and arguments about the properties of the
optimization landscapeZhang
et al., “Theory of Deep Learning III: Generalization
Properties of SGD” (Discussion paper, Center for
Brains, Minds; Machines (CBMM). Preprint, 2017)..
This chapter builds on a chapter by HardtHardt,
“Generalization in Overparameterized Models,” in *Beyond
the Worst-Case Analysis of Algorithms*, ed. Tim Roughgarden
(Cambridge University Press, 2021), 486–505., but
contains several structural changes as well as different results.

# References

*arXiv:1802.05296*, 2018.

*Transactions on Information Theory*44, no. 2 (1998): 525–36.

*Journal of Machine Learning Research*3, no. Nov (2002): 463–82.

*Proceedings of the National Academy of Sciences*, 2019.

*Journal of Machine Learning Research*2, no. Mar (2002): 499–526.

*Data Mining and Knowledge Discovery*2, no. 2 (1998): 121–67.

*arXiv:1703.11008*, 2017.

*Beyond the Worst-Case Analysis of Algorithms*, edited by Tim Roughgarden, 486–505. Cambridge University Press, 2021.

*International Conference on Machine Learning*, 2016.

*Computer Vision and Pattern Recognition*, 2016.

*Advances in Neural Information Processing Systems*32 (2019): 103–12.

*Advances in Neural Information Processing Systems*, 793–800, 2009.

*The Annals of Statistics*30, no. 1 (2002): 1–50.

*Conference on Learning Theory*, 2020.

*arXiv:2101.11815*, 2021.

*Advances in Neural Information Processing Systems*, 5947–56, 2017.

*arXiv:1412.6614*, 2014.

*The Annals of Statistics*26, no. 5 (1998): 1651–86.

*Understanding Machine Learning: From Theory to Algorithms*. Cambridge University Press, 2014.

*Journal of Machine Learning Research*11, no. Oct (2010): 2635–70.

*International Conference on Machine Learning*, 2013.

*Statistical Larning Theory*. Wiley, 1998.

*International Conference on Learning Representations*, 2017.

*The Annals of Statistics*33 (2005): 1538–79.