# Supervised learning

Previously, we talked about the fundamentals of prediction and statistical modeling of populations. Our goal was, broadly speaking, to use available information described by a random variable X to conjecture about an unknown outcome Y.

In the important special case of a binary outcome Y, we saw that we can write an optimal predictor \hat Y as a threshold of some function f: \hat Y(x) = \mathbb{1}\{f(x) > t\} We saw that in many cases the optimal function is a ratio of two likelihood functions.

This optimal predictor has a serious limitation in practice, however. To be able to compute the prediction for a given input, we need to know a probability density function for the positive instances in our problem and also one for the negative instances. But we are often unable to construct or unwilling to assume a particular density function.

As a thought experiment, attempt to imagine what a probability
density function over images labeled *cat* might look like.
Coming up with such a density function appears to be a formidable task,
one that’s not intuitively any easier than merely classifying whether an
image contains a cat or not.

In this chapter, we transition from a purely mathematical characterization of optimal predictors to an algorithmic framework. This framework has two components. One is the idea of working with finite samples from a population. The other is the theory of supervised learning and it tells us how to use finite samples to build predictors algorithmically.

# Sample versus population

Let’s take a step back to reflect on the interpretation of the pair of random variables (X, Y) that we’ve worked with so far. We think of the random variables (X, Y) as modeling a population of instances in our prediction problem. From this pair of random variables, we can derive other random variables such as a predictor \hat Y = \mathbb{1}\{f(X) > t\}. All of these are random variables in the same probability space. When we talk about, say, the true positive rate of the predictor \hat Y, we therefore make a statement about the joint distribution of (X, Y).

In almost all prediction problems, however, we do not have access to the entire population of instances that we will encounter. Neither do we have a probability model for the joint distribution of the random variables (X, Y). The joint distribution is a theoretical construct that we can reason about, but it doesn’t readily tell us what to do when we don’t have precise knowledge of the joint distribution.

What knowledge then do we typically have about the underlying population and how can we use it algorithmically to find good predictors? In this chapter we will begin to answer both questions.

First we assume that from past experience we have observed n labeled instances (x_1, y_1),...,(x_n, y_n). We assume that
each data point (x_i, y_i) is a
draw from the same underlying distribution (X, Y). Moreover, we will often assume that
the data points are drawn independently. This pair of assumptions is
often called the “i.i.d. assumption”, a shorthand for *independent
and identically distributed*.

To give an example, consider a population consisting of all currently
eligible voters in the United States and some of their features, such
as, age, income, state of residence etc. An *i.i.d. sample* from
this population would correspond to a repeated sampling process that
selects a uniformly random voter from the entire reference
population.

Sampling is a difficult problem with numerous pitfalls that can strongly affect the performance of statistical estimators and the validity of what we learn from data. In the voting example, individuals might be unreachable or decline to respond. Even defining a good population for the problem we’re trying to solve is often tricky. Populations can change over time. They may depend on a particular social context, geography, or may not be neatly characterized by formal criteria. Task yourself with the idea of taking a random sample of spoken sentences in the English language, for example, and you will quickly run into these issues.

In this chapter, as is common in learning theory, we largely ignore these important issues. We instead focus on the significant challenges that remain even if we have a well-defined population and an unbiased sample from it.

# Supervised learning

*Supervised learning* is the prevalent method for constructing
predictors from data. The essential idea is very simple. We assume we
have labeled data, in this context also called *training
examples*, of the form (x_1,y_1),
..., (x_n, y_n), where each *example* is a pair (x_i,y_i) of an *instance* x_i and a corresponding *label* y_i. The notion of *supervision*
refers to the availability of these labels.

Given such a collection of labeled data points, supervised learning
turns the task of finding a good predictor into an optimization problem
involving these data points. This optimization problem is called
*empirical risk minimization*.

Recall, in the last chapter we assumed full knowledge of the joint
distribution of (X,Y) and
analytically found predictors that minimize risk. The risk is equal to
the expected value of a *loss function* that quantifies the cost
of each possible prediction for a given true outcome. For binary
prediction problems, there are four possible pairs of labels
corresponding to true positives, false positives, true negatives, and
false negatives. In this case, the loss function boils down to
specifying a cost to each of the four possibilities.

More generally, a loss function is a function \mathit{loss}\colon\mathcal{Y}\times\mathcal{Y}\to\mathbb{R}\,, where \mathcal{Y} is the set of values that Y can assume. Whereas previously we focused on the predictor \hat Y as a random variable, in this chapter our focus shifts to the functional form that the predictor has. By convention, we write \hat Y=f(X), where f\colon\mathcal{X}\to\mathcal{Y} is a function that maps from the sample space \mathcal{X} into the label space \mathcal{Y}.

Although the random variable \hat Y and the function f are mathematically not the same objects, we will call both a predictor and extend our risk definition to apply the function as well: R[f] = \mathop\mathbb{E}\left[ \mathit{loss}(f(X), Y) \right]\,. The main new definition in this chapter is a finite sample analog of the risk, called empirical risk.

Given a set of labeled data points S=((x_1,y_1),...,(x_n, y_n)), the
*empirical risk* of a predictor f\colon \mathcal{X}\to\mathcal{Y} with
respect to the sample S is defined
as
R_S[f] = \frac1n \sum_{i=1}^n \mathbb{\mathit{loss}}( f(x_i), y_i )\,.

Rather than taking expectation over the population, the empirical risk averages the loss function over a finite sample. Conceptually, we think of the finite sample as something that is in our possession, e.g., stored on our hard disk.

Empirical risk serves as a proxy for the risk. Whereas the
risk R[f] is a population
quantity—that is, a property of the joint distribution (X, Y) and our predictor f—the empirical risk is a *sample
quantity*.

We can think of the empirical risk as the sample average estimator of the risk. When samples are drawn i.i.d., the empirical risk is a random variable that equals the sum of n independent random variables. If losses are bounded, the central limit theorem therefore suggests that the empirical risk approximates the risk for a fixed predictor f.

Regardless of the distribution of S, however, note that we can always compute the empirical risk R_S[f] entirely from the sample S and the predictor f. Since empirical risk a quantity we can compute from samples alone, it makes sense to turn it into an objective function that we can try to minimize numerically.

*Empirical risk minimization* is the optimization problem of
finding a predictor in a given function family that minimizes the
empirical risk.

Given a function class \mathcal{F}\subseteq
\mathcal{X}\to\mathcal{Y}, *empirical risk minimization*
on a set of labeled data points S
corresponds to the objective:
\min_{f\in\mathcal{F}} R_S[f]
A solution to the optimization problem is called *empirical
risk minimizer*.

There is a tautology relating risk and empirical risk that is good to
keep in mind:
R[f] = R_S[f] + (R[f] - R_S[f])
Although mathematically trivial, the tautology reveals an
important insight. To minimize risk, we can first attempt to minimize
empirical risk. If we successfully find a predictor f that achieves small empirical
risk R_S[f], we’re left worrying
about the term R[f]-R_S[f]. This
term quantifies how much the empirical risk of f underestimates its risk. We call this
difference *generalization gap* and it is of fundamental
importance to machine learning. Intuitively speaking, it tells us how
well the performance of our predictor transfers from seen examples (the
training examples) to unseen examples (a fresh example from the
population) drawn from the same distribution. This process is called
*generalization*.

Generalization is not the only goal of supervised learning. A
constant predictor that always outputs 0 generalizes perfectly well, but is almost
always entirely useless. What we also need is that the predictor
achieves small empirical risk R_S[f]. Making the empirical risk small is
fundamentally about *optimization*. As a consequence, a large
part of supervised learning deals with optimization. For us to be able
to talk about optimization, we need to commit to a
*representation* of the function class \mathcal{F} that appears in the empirical
risk minimization problem. The representation of the function class, as
well as the choice of a suitable loss function, determines whether or
not we can efficiently find an empirical risk minimizer.

To summarize, introducing empirical risk minimization directly leads to three important questions that we will work through in turn.

**Representation:**What is the class of functions \mathcal{F} we should choose?**Optimization:**How can we efficiently solve the resulting optimization problem?**Generalization:**Will the performance of predictor transfer gracefully from seen training examples to unseen instances of our problem?

These three questions are intertwined. Machine learning is not so much about studying these questions in isolation as it is about the often delicate interplay between them. Our choice of representation influences both the difficulty of optimization and our generalization performance. Improvements in optimization may not help, or could even hurt, generalization. Moreover, there are aspects of the problem that don’t neatly fall into only one of these categories. The choice of the loss function, for example, affects all of the three questions above.

There are important differences between the three questions. Results in optimization, for example, tend to be independent of the statistical assumptions about the data generating process. We will see a number of different optimization methods that under certain circumstances find either a global or local minimum of the empirical risk objective. In contrast, to reason about generalization, we need some assumptions about the data generating process. The most common one is the i.i.d.-assumption we discussed earlier. We will also see several mathematical frameworks for reasoning about the gap between risk and empirical risk.

Let’s start with a foundational example that illustrates these core concepts and their interplay.

# A first learning algorithm: The perceptron

As we discussed in the introduction, in 1958 the New
York Times reported the Office of Naval Research claiming the
perceptron algorithmRosenblatt,
“The Perceptron: A Probabilistic Model for Information Storage and
Organization in the Brain,” *Psychological Review*, 1958,
65–386. would “be able to walk, talk, see, write,
reproduce itself and be conscious of its existence.” Let’s now dive into
this algorithm that seemed to have such unbounded potential.

Toward introducing this algorithm, let’s assume we’re in a binary
prediction problem with labels in \{-1,1\} for notational convenience. The
perceptron algorithm aims to find a *linear separator* of the
data, that is, a hyperplane specified by coefficients w\in\mathbb{R}^d that so that all positive
examples lie on one side of the hyperplane and all negative ones on the
other.

Formally, we can express this as y_i\langle w, x_i\rangle > 0. In other
words, the linear function f(x)=\langle
w, x\rangle agrees in sign with the labels on all training
instances (x_i, y_i). In fact, the
perceptron algorithm will give us a bit more. Specifically, we require
that the sign agreement has some *margin* y_i\langle w, x_i\rangle \ge 1. That is,
when y=1, the linear function must
take on a value of at least 1 and
when y=-1, the linear function
must be at most -1. Once we find
such a linear function, our prediction \hat Y(x) on a data point x is \hat{Y}(x) = 1 if \langle w, x \rangle \geq 0 and \hat Y(x) = -1 otherwise.

The algorithm goes about finding a linear separator w incrementally in a sequence of update steps.

**Perceptron**

- Start from the initial solution w_0=0
- At each step t=0,1,2,...:
- Select a random index i\in\{1,...,n\}
- Case 1: If y_i\langle w_t, x_i\rangle < 1, put w_{t+1} = w_t + y_i x_i
- Case 2: Otherwise put w_{t+1} = w_t.

Case 1 corresponds to what’s called a *margin mistake*. The
sign of the linear function may not disagree with the label, but it
doesn’t have the required margin that we asked for.

When an update occurs, we have \langle w_{t+1}, x_i\rangle = \langle w_t, x_i\rangle + \|x_i\|^2\,. In this sense, the algorithm is nudging the hyperplane to be less wrong on example x_i. However, in doing so it could introduce errors on other examples. It is not yet clear that the algorithm converges to a linear separator when this is possible.

# Connection to empirical risk minimization

Before we turn to the formal guarantees of the perceptron, it is
instructive to see how to relate it to empirical risk minimization. In
order to do so, it’s helpful to introduce two *hyperparameters*
to the algorithm by considering the alternative update rule:
w_{t+1} = \gamma w_t + \eta y_i x_i
Here \eta is a positive
scalar called a *learning rate* and \gamma\in [0,1] is called the *forgetting
rate*.

First, it’s clear from the description that we’re looking for a
linear separator. Hence, our function class is the set of linear
functions f_w(x)=\langle w,
x\rangle, where w\in\mathbb{R}^d. We will sometimes call the
vector w the *weight
vector* or vector of *model parameters*.

An optimization method that picks a random example at each step and
makes a local improvement to the model parameters is the *stochastic
gradient method*. This method will figure prominently in our chapter
on optimization as it is the workhorse of many machine learning
applications today. The local improvement the method picks at each step
is given by a local linear approximation of the loss function around the
current model parameters. This linear approximation can be written
neatly in terms of the vector of first derivatives, called
*gradient*, of the loss function with respect to the current
model parameters.

The formal update rule reads w_{t+1} = w_t - \eta\nabla_{w_t} \mathit{loss}(f_{w_t}(x_i), y_i) Here, the example (x_i,y_i) is randomly chosen and the expression \nabla_{w_t} \mathit{loss}(f_{w_t}(x_i), y_i) is the gradient of the loss function with respect to the model parameters w_t on the example (x_i, y_i). We will typically drop the vector w_t from the subscript of the gradient when it’s clear from the context. The scalar \eta>0 is a step size parameter that we will discuss more carefully later. For now, think of it as a small constant.

It turns out that we can connect this update rule with the perceptron
algorithm by choosing a suitable loss function. Consider the loss
function
\mathit{loss}(\langle w, x\rangle, y)
= \max\big\{1-y\langle w, x\rangle,\, 0\big\}\,.
This loss function is called *hinge loss*. Note that its
gradient is -yx when y\langle w, x\rangle < 1 and 0 when y\langle w, x\rangle >1.

The gradient of the hinge loss is not defined when y\langle w, x\rangle=1. In other words, the
loss function is not differentiable everywhere. This is why technically
speaking the stochastic gradient method operates with what is called a
*subgradient*. The mathematical theory of subgradient
optimization rigorously justifies calling the gradient 0 when y\langle w, x\rangle=1. We will ignore this
technicality throughout the book.

We can see that the hinge loss gives us part of the update rule in
the perceptron algorithm. The other part comes from adding a weight
penalty \frac\lambda 2\Vert
w\Vert^2 to the loss function that discourages the weights from
growing out of bounds. This weight penalty is called *\ell_2-regularization*, *weight
decay*, or *Tikhonov regularization* depending on which field
you work in. The purpose of regularization is to promote generalization.
We will therefore return to regularization in detail when we discuss
generalization in more depth. For now, note that the margin constraint
we introduced is inconsequential unless we penalize large vectors.
Without the weight penalty we could simply scale up any linear separator
until it separates the points with the desired margin.

Putting the two loss functions together, we get the \ell_2-regularized empirical risk
minimization problem for the hinge loss:
\frac{1}{n}\sum_{i=1}^n \max\big\{1-y_i\langle w, x_i\rangle,\, 0\big\}
+ \frac\lambda 2 \| w \|_2^2
The perceptron algorithm corresponds to solving this empirical
risk objective with the stochastic gradient method. The
constant \eta, which we dubbed the
learning rate, is the step size of the stochastic gradient methods. The
forgetting rate constant \gamma is
equal to (1-\eta \lambda). The
optimization problem is also known as *support vector machine*
and we will return to it later on.

## A word about surrogate losses

When the goal was to maximize the accuracy of a predictor, we
mathematically solved the risk minimization problem with respect to the
*zero-one loss*
\mathit{loss}(\hat y, y)=\mathbb{1}\{\hat y\ne y\}
that gives us penalty 1 if
our label is incorrect, and penalty 0 if our predicted label \hat y matches the true label y. We saw that the optimal predictor in this
case was a *maximum a posteriori* rule, where we selected the
label with higher posterior probability.

Why don’t we directly solve empirical risk minimization with respect
to the zero-one loss? The reason is that the empirical risk with the
zero-one loss is computationally difficult to optimize directly. In
fact, this optimization problem is NP-hard even for linear prediction
rules.Kearns,
Schapire, and Sellie, “Toward Efficient Agnostic Learning,”
*Machine Learning* 17, no. 2–3 (1994):
115–41. To get a better sense of the difficulty,
convince yourself that the stochastic gradient method, for example,
fails entirely on the zero-one loss objective. Of course, the stochastic
gradient method is not the only learning algorithm.

The hinge loss therefore serves as a *surrogate loss* for the
zero-one loss. We hope that by optimizing the hinge loss, we end up
optimizing the zero-one loss as well. The hinge loss is not the only
reasonable choice. There are numerous loss functions that approximate
the zero-one loss in different ways.

- The
*hinge loss*is \max\{1-y\hat y, 0\} and*support vector machine*refers to empirical risk minimization with the hinge loss and \ell_2-regularization. This is what the perceptron is optimizing. - The
*squared loss*is given by \frac12(y-\hat y)^2. Linear least squares regression corresponds to empirical risk minimization with the squared loss. - The
*logistic loss*is -\log(\sigma(\hat y)) when y=1 and -\log(1-\sigma(\hat y)) when y=-1, where \sigma(z) = 1/(1+\exp(-z)) is the logistic function.*Logistic regression*corresponds to empirical risk minimization with the logistic loss and linear functions.

Sometimes we can theoretically relate empirical risk minimization under a surrogate loss to the zero-one loss. In general, however, these loss functions are used heuristically and practitioners evaluate performance by trial-and-error.

# Formal guarantees for the perceptron

We saw that the perceptron corresponds to finding a linear predictor
using the stochastic gradient method. What we haven’t seen yet is a
proof that the perceptron method works and under what conditions. Recall
that there are two questions we need to address. The first is why the
perceptron method successfully fits the training data, a question about
*optimization*. The second is why the solution should also
correctly classify unseen examples drawn from the same distribution, a
question about *generalization*. We will address each in turn
with results from the 1960s. Even though the analysis here is over 50
years old, it has all of the essential parts of more recent theoretical
arguments in machine learning.

## Mistake bound

To see why we perform well on the training data, we use a *mistake
bound* due to Novikoff.Novikoff,
“On Convergence Proofs on Perceptrons,” in *Symposium on
the Mathematical Theory of Automata*, 1962,
615–22. The bound shows that if there exists a
linear separator of the training data, then the perceptron will find it
quickly provided that the *margin* of the separating hyperplane
isn’t too small.

Margin is a simple way to evaluate how well a predictor separates data. Any vector w\in \mathbb{R}^d defines a hyperplane \mathcal{H}_w = \{x~:~w^Tx=0\}. Suppose that the hyperplane \mathcal{H}_w corresponding to the vector w perfectly separates the data in S. Then we define the margin of such a vector w as the smallest distance of our data points to this hyperplane: \gamma(S,w) = \min_{1\le i\le n} \operatorname{dist}(x_i,\mathcal{H}_w )\,. Here, \operatorname{dist}(x, \mathcal{H}_w) = \min\{ \|x - x'\| \colon x'\in \mathcal{H}_w\} = \frac{|\langle x, w\rangle|}{\|w\|}.

Overloading terminology, we define the margin of a *dataset*
to be the maximum margin achievable by any predictor w:
\gamma(S) = \max_{\Vert w \Vert =1} \gamma(S,w)\,.
We will now show that when a dataset has a large margin, the
perceptron algorithm will find a separating hyperplane quickly.

Let’s consider the simplest form of the perceptron algorithm. We initialize the algorithm with w_0=0. The algorithm proceeds by selecting a random index i_t at step t checking whether y_{i_t} w_t^Tx_{i_t} < 1. We call this condition a margin mistake, i.e., the prediction w_t^T x_{i_t} is either wrong or too close to the hyperplane. If a margin mistake occurs, the perceptron performs the update w_{t+1} = w_t + y_{i_t} x_{i_t}\,. That is, we rejigger the hyperplane to be more aligned with the signed direction of the mistake. If no margin mistake occurs, then w_{t+1}=w_t.

To analyze the perceptron we need one additional definition. Define the diameter of a data S to be D(S) = \max_{(x,y)\in S}\|x\|\,. We can now summarize a worst case analysis of the perceptron algorithm with the following theorem.

The perceptron algorithm makes at most (2+D(S)^2)\gamma(S)^{-2} margin mistakes on any sequence of examples S that can be perfectly classified by a linear separator.

The main idea behind the proof of this theorem is that since w only changes when you make a mistake, we can upper bound and lower bound w at each time a mistake is made, and then, by comparing these two bounds, compute an inequality on the total number of mistakes.

To find an upper bound, suppose that at step t the algorithm makes a margin mistake. We then have the inequality: \begin{aligned} \Vert w_{t+1} \Vert^2 &= \Vert w_t +y_{i_t} x_{i_t}\Vert^2 \\ & = \Vert w_t \Vert^2 + 2 y_{i_t} \langle w_t, x_{i_t}\rangle + \Vert x_{i_t} \Vert^2 \\ &\leq \Vert w_t\Vert^2 + 2 + D(S)^2 \,. \end{aligned} The final inequality uses the fact that y_{i_t} \langle w_t, x_{i_t}\rangle<1. Now, let m_t denote the total number of mistakes made by the perceptron in the first t iterations. Summing up the above inequality over all the mistakes we make and using the fact that \|w_0\|=0, we get our upper bound on the norm of w_t: \Vert w_{t} \Vert \leq \sqrt{m_t(2+ D(S)^2)}\,.

Working toward a lower bound on the norm of w_t, we will use the following argument. Let w be any unit vector that correctly classifies all points in S. If we make a mistake at iteration t, we have \langle w, w_{t+1}-w_t\rangle = \langle w, y_{i_t}x_{i_t}\rangle = \frac{|\langle w, x_{i_t}\rangle|}{\|w\|} \geq \gamma(S, w)\,. Note that the second equality here holds because w correctly classifies the point (x_{i_t}, y_{i_t}). This is where we use that the data are linearly separable. The inequality follows from the definition of margin.

Now, let w_\star denote the hyperplane that achieves the maximum margin \gamma(S). Instantiating the previous argument with w_\star, we find that \Vert w_t \Vert \geq \langle w_\star, w_t\rangle = \sum_{k=1}^t w_\star^T (w_{k}-w_{k-1}) \geq m_t\gamma(S)\,, where the equality follows from a telescoping sum argument.

This yields the desired lower bound on the norm of w_t. Combined with the upper bound we already derived, it follows that the total number of mistakes has the bound m_t \leq \frac{2+D(S)^2}{\gamma(S)^2}\,.

The proof we saw has some ingredients we’ll encounter again. Telescoping sums, for example, are a powerful trick used throughout the analysis of optimization algorithms. A telescoping sum lets us understand the behavior of the final iterate by decomposing it into the incremental updates of the individual iterations.

The mistake bound does not depend on the dimension of the data. This is appealing since the requirement of linear separability and high margin, intuitively speaking, become less taxing the larger the dimension is.

An interesting consequence of this theorem is that if we run the perceptron repeatedly over the same dataset, we will eventually end up with a separating hyperplane. To see this, imagine repeatedly running over the dataset until no mistake occurred on a full pass over the data. The mistake bound gives a bound on the number of passes required before the algorithm terminates.

## From mistake bounds to generalization

The previous analysis shows that the perceptron finds a good predictor on the training data. What can we say about new data that we have not yet seen?

To talk about generalization, we need to make a statistical assumption about the data generating process. Specifically we assume that the data points in the training set S=\{(x_1,y_1)\ldots, (x_n,y_n) \} where each drawn i.i.d. from a fixed underlying distribution \mathcal{D} with the labels taking values \{ -1,1 \} and each x_i \in \mathbb{R}^d.

We know that the perceptron finds a good linear predictor for the training data (if it exists). What we now show is that this predictor also works on new data drawn from the same distribution.

To analyze what happens on new data, we will employ a powerful
*stability* argument. Put simply, an algorithm is stable if the
effect of removing or replacing a single data point is small. We will do
a deep dive on stability in our chapter on generalization, but we will
have a first encounter with the idea here.

The perceptron is stable because it makes a bounded number of
mistakes. If we remove a data point where no mistake is made, the model
doesn’t change at all. In fact, it’s as if we had *never seen*
the data point. This lets us relate the performance on seen examples to
the performance on examples in the training data on which the algorithm
never updated.

Vapnik and Chervonenkis presented the following stability argument in
their classic text from 1974, though the original argument is likely a
decade older.Vapnik
and Chervonenkis, *Theory of Pattern Recognition: Statistical
Learning Problems* (Nauka, 1974). Their main
idea was to leverage our assumption that the data are i.i.d., so we can
swap the roles of training and test examples in the analysis.

Let S_n denote a training set of n i.i.d. samples from a distribution \mathcal{D} that we assume has a perfect linear separator. Let w(S) be the output of the perceptron on a dataset S after running until the hyperplane makes no more margin mistakes on S. Let Z=(X,Y) be an additional independent sample from \mathcal{D}. Then, the probability of making a margin mistake on (X,Y) satisfies the upper bound \mathop\mathbb{P}[Y w(S_n)^T X < 1] \leq \frac{1}{n+1} {\mathop\mathbb{E}}_{S_{n+1}}\left[ \frac{2+D(S_{n+1})^2}{\gamma(S_{n+1})^2} \right]\,.

First note that \mathop\mathbb{P}[Y w^T X < 1] = \mathop\mathbb{E}[\mathbb{1}\{Yw^T X < 1\}]\,.

Let S_n=(Z_1, ... , Z_n) and with Z_k=(X_k, Y_k) and put Z_{n+1}=Z=(X,Y). Note that these n+1 random variables are i.i.d. draws from \mathcal{D}. As a purely analytical device, consider the “leave-one-out set” S^{-k}=\{Z_1,\dots,Z_{k-1},Z_{k+1},...,Z_{n+1}\}\,. Since the data are drawn i.i.d., running the algorithm on S^{-k} and evaluating it on Z_k=(X_k,Y_k) is equivalent to running the algorithm on S_n and evaluating it on Z_{n+1}. These all correspond to the same random experiment and differ only in naming. In particular, we have \mathop\mathbb{P}[Y w(S_n)^T X < 1] = \frac1{n+1}\sum_{k=1}^{n+1} \mathop\mathbb{E}[\mathbb{1}\{Y_k w(S^{-k})^T X_k < 1\}]\,. Indeed, we’re averaging quantities that are each identical to the left hand side. But recall from our previous result that the perceptron makes at most m=\frac{2+D((Z_1,\dots,Z_{n+1}))^2}{\gamma((Z_1,\dots,Z_{n+1}))^2} margin mistakes when run on the entire sequence (Z_1,\dots,Z_{n+1}). Let i_1,\dots,i_m denote the indices on which the algorithm makes a mistake in any of its cycles over the data. If k\not\in\{i_1,\dots,i_m\}, the output of the algorithm remains the same after we remove the k-th sample from the sequence. It follows that such k satisfy Y_k w(S^{-k})X_k \geq 1 and therefore k does not contribute to the summation above. The other terms can at most contribute 1 to the summation. Hence, \sum_{k=1}^{n+1}\mathbb{1}\{Y_k w(S^{-k})^T X_k < 1\} \le m\,, and by linearity of expectation, as we hoped to show, \mathop\mathbb{P}[Y w(S_n)^T X < 1] \le \frac{\mathop\mathbb{E}[m]}{n+1}\,.

We can turn our mistake bounds into bounds on the empirical risk and risk achieved by the perceptron algorithm by choosing the loss function \mathit{loss}(\langle w, x\rangle, y)= \mathbb{1}\{ \langle w, x\rangle y < 1\}. These bounds also imply bounds on the (empirical) risk with respect to the zero-one loss, since the prediction error is bounded by the number of margin mistakes.

# Chapter notes

Rosenblatt developed the perceptron in 1957 and continued to publish
on the topic in the years that followed.Rosenblatt,
*Two Theorems of Statistical Separability in the Perceptron*
(United States Department of Commerce, 1958); Rosenblatt, *Principles
of Neurodynamics: Perceptions and the Theory of Brain Mechanisms*
(Spartan, 1962). The perceptron project was funded
by the US Office of Naval Research (ONR), who jointly announced the
project with Rosenblatt in a press conference in 1958, that lead to the
New York Times article we quoted earlier. This development sparked
significant interest in perceptrons research throughout the 1960s.

The simple proof the mistake bound we saw is due to Novikoff.Novikoff,
“On Convergence Proofs on
Perceptrons.” Block is credited with a more
complicated contemporaneous proof.Block,
“The Perceptron: A Model for Brain Functioning,”
*Reviews of Modern Physics* 34, no. 1 (1962):
123. Minsky and Papert attribute a simple analysis
of the convergence guarantees for the perceptron to a 1961 paper by
Papert.Papert,
“Some Mathematical Models of Learning,” in *London
Symposium on Information Theory* (Academic Press, New York,
1961).

Following these developments Vapnik and Chervonenkis proved the
generalization bound for the perceptron method that we saw earlier,
relying on the kind of stability argument that we will return to in our
chapter on generalization. The proof of Theorem 2 is available in their
1974 book.Vapnik
and Chervonenkis, *Theory of Pattern
Recognition*. Interestingly, by the 1970s,
Vapnik and Chervonenkis must have abandoned the stability argument in
favor of the VC-dimension.

In 1969, Minksy and Papert published their influential book
“Perceptrons: An introduction to computational geometry”.Minsky
and Papert, *Perceptrons: An Introduction to Computational
Geometry* (MIT Press, 2017). Among other
results, it showed that perceptrons fundamentally could not learn
certain concepts, like, an XOR of its input bits. In modern language,
linear predictors cannot learn parity functions. The results remain
relevant in the statistical learning community and have been extended in
numerous ways. On the other hand, pragmatic researchers realized one
could just add the XOR to the feature vector and continue to use linear
methods. We will discuss such feature engineering in the next
chapter.

The dominant narrative in the field has it that Minsky and Papert’s book curbed enthusiasm for perceptron research and their multilayer extensions, now better known as deep neural networks. In an updated edition of their book from 1988, Minsky and Papert argue that work on perceptrons had already slowed significantly by the time their book was published for a lack of new results:

One popular version is that the publication of our book so discouraged research on learning in network machines that a promising line of research was interrupted. Our version is that progress had already come to a virtual halt because of the lack of adequate basic theories, […].

On the other hand, the pattern recognition community had realized
that perceptrons were just one way to implement linear predictors.
Highleyman was arguably the first to propose empirical risk minimization
and applied this technique to optical character recognition.Highleyman,
“Linear Decision Functions, with Application to Pattern
Recognition,” *Proceedings of the IRE* 50,
no. 6 (1962): 1501–14. Active research in the 1960s
showed how to find linear rules using linear programming
techniques.Mangasarian,
“Linear and Nonlinear Separation of Patterns by Linear
Programming,” *Operations Research* 13, no. 3 (1965):
444–52. Work by Aizerman, Braverman and Rozonoer
developed iterative methods to fit nonlinear rules to data.Aizerman,
Braverman, and Rozonoer, “The
Robbins-Monro Process and the Method of
Potential Functions,” *Automation and Remote Control* 26
(1965): 1882–85. All of this work was covered in
depth in the first edition of Duda and Hart, which appeared five years
after *Perceptrons*.

It was at this point that the artificial intelligence community first
split from the pattern recognition community. While the artificial
intelligence community turned towards more *symbolic* techniques
in 1970s, work on statistical learning continued in Soviet and IEEE
journals. The modern view of empirical risk minimization, of which we
began this chapter, came out of this work and was codified by Vapnik and
Chervonenkis in the 1970s.

It wasn’t until the 1980s that work on pattern recognition, and with it the tools of the 1960s and earlier, took a stronger foothold in the machine learning community again.Langley, “The Changing Science of Machine Learning” (Springer, 2011). We will continue this discussion in our chapter on datasets and machine learning benchmarks, which were pivotal in the return of pattern recognition to the forefront of machine learning research.

# References

*Automation and Remote Control*26 (1965): 1882–85.

*Reviews of Modern Physics*34, no. 1 (1962): 123.

*Proceedings of the IRE*50, no. 6 (1962): 1501–14.

*Machine Learning*17, no. 2–3 (1994): 115–41.

*Operations Research*13, no. 3 (1965): 444–52.

*Perceptrons: An Introduction to Computational Geometry*. MIT Press, 2017.

*Symposium on the Mathematical Theory of Automata*, 615–22, 1962.

*London Symposium on Information Theory*. Academic Press, New York, 1961.

*Principles of Neurodynamics: Perceptions and the Theory of Brain Mechanisms*. Spartan, 1962.

*Psychological Review*, 1958, 65–386.

*Two Theorems of Statistical Separability in the Perceptron*. United States Department of Commerce, 1958.

*Theory of Pattern Recognition: Statistical Learning Problems*. Nauka, 1974.