# Supervised learning

Previously, we talked about decision theory. Our goal was, broadly speaking, to use available information described by a random variable X to make a decision about an unknown outcome Y.

Decision theory is directly relevant to machine learning. In a prediction problem, Y might be a future outcome and X is an array of available information. In a classification problem, the random variable X might describe an instance of an image and Y a categorical description of the image.

In the important special case of a binary outcome Y, we saw that we can write an optimal predictor \hat Y as a threshold of some scoring function f: \hat Y(x) = \mathbb{1}\{f(x) > t\} If our goal is to maximize the true positive rate of our predictor while constraining its false positive rate, the Neyman-Pearson Lemma applies. It shows that we can take the score function to be a ratio of two likelihood functions.

The optimal score function that comes out of the lemma has a serious limitation in practice, however. To compute the value of the score function, we need to know a probability density function for the positive instances in our problem and also one for the negative instances. But we are often unable to construct or unwilling to assume a particular density function. As a thought experiment, attempt to imagine what a probability density function over images labeled *cat* might look like. Coming up with such a density function appears to be a formidable task, one that’s not intuitively any easier than merely classifying whether an image contains a cat or not.

In this chapter, we transition from decision theory to learning theory. What learning theory adds to decision theory is precisely how we can extract useful score functions from a collection of data points.

# Sample versus population

Let’s take a step back to reflect on the interpretation of the pair of random variables (X, Y) that we’ve worked with so far. We think of the random variables (X, Y) as modeling a population of instances in our classification problem. From this pair of random variables, we can derive other random variables such as a score function f and a predictor \hat Y = \mathbb{1}\{f(X) > t\}. All of these are random variables in the same probability space. When we talk about, say, the true positive rate of the predictor \hat Y, we therefore make a statement about the joint distribution of (X, Y).

In almost all decision making scenarios, however, we do not have access to the entire population of instances that we will encounter. Neither do we have a probability model for the joint distribution of the random variables (X, Y). The joint distribution is a theoretical construct that we can reason about, but it doesn’t readily tell us what to do when we don’t have precise knowledge of the joint distribution.

What knowledge then do we typically have about the underlying population and how can we use it algorithmically to find good predictors? In this chapter we will begin to answer both questions.

First we assume that from past experience we have observed n labeled instances (x_1, y_1),...,(x_n, y_n). We assume that each data point (x_i, y_i) is a draw from the same underlying distribution (X, Y). Moreover, we will often assume that the data points are drawn independently. This pair of assumptions is often called the “i.i.d. assumption,” a shorthand for *independent and identically distributed*.

To give an example, consider a population consisting of all currently eligible voters in the United States and some of their features, such as, age, income, state of residence etc. An *i.i.d. sample* from this population would correspond to a repeated sampling process that selects a uniformly random voter from the entire reference population.

Sampling is a difficult problem with numerous pitfalls that can strongly affect the performance of statistical estimators and the validity of what we learn from data. In the voting example, individuals might be unreachable or decline to respond. Even defining a good population for the problem we’re trying to solve is often tricky. Populations can change over time. They may depend on a particular social context, geography, or may not be neatly characterized by formal criteria. Task yourself with the idea of taking a random sample of spoken sentences in the English language, for example, and you will quickly run into these issues.

In this chapter, as is common in learning theory, we largely ignore these important issues. We instead focus on the significant challenges that remain even if we have a well-defined population and an unbiased sample from it.

# Supervised learning

*Supervised learning* is the prevalent method for constructing classifiers from data. The essential idea is very simple. We assume we have labeled data, in this context also called *training examples*, of the form (x_1,y_1), ..., (x_n, y_n), where each *example* is a pair (x_i,y_i) of an *instance* x_i and a corresponding *label* y_i. The word *supervision* refers to the availability of these labels.

Given such a collection of labeled data points, supervised learning turns the task of finding a good classifier into an optimization problem involving these data points. This optimization problem is called *empirical risk minimization*.

Recall that when we had full knowledge of the joint distribution of (X,Y), we aimed to minimize the risk of a decision rule. The risk is equal to the expected value of a *loss function* that measures the expected discrepancy between the predicted and true labels. For binary prediction problems, there are four possible pairs of labels corresponding to true positives, false positives, true negatives, and false negatives. In this case, the loss function boils down to specifying a cost to each of the four possibilities. More generally, a loss function is a function \ell\colon\mathcal{Y}\times\mathcal{Y}\to\mathbb{R}\,. The risk was defined as the expectation
R[f] = \mathop\mathbb{E}\left[ \ell (f(X), Y) \right]\,.
We will now see its finite sample counterpart.

Given a set of labeled data points S=((x_1,y_1),...,(x_n, y_n)), the *empirical risk* of a classifier f\colon \mathcal{X}\to\mathcal{Y} with respect to the sample S is defined as
R_S[f] = \frac1n \sum_{i=1}^n \mathbb{\ell}( f(x_i), y_i )\,.

*Empirical risk minimization* is the optimization problem of finding a classifier in a given function family that minimizes the empirical risk:
\min_{f\in\mathcal{F}} R_S[f]
Empirical risk serves as a proxy objective for the risk. Whereas the risk R[f] is a population quantity—that is, a property of the joint distribution (X, Y) and our classifier f—the empirical risk is a *sample quantity*. Note that when the samples are drawn i.i.d., the empirical risk is the sample average of the risk, and in this case by the central limit theorem, we’d expect the sample risk to closely approximate the risk for a fixed predictor f. Regardless of the distribution of S, however, note that we can always compute the empirical risk R_S[f] entirely from the sample S and the classifier f. Put differently, it’s a quantity we can compute from samples alone.

There is a tautology relating risk and empirical risk that is good to keep in mind:
R[f] = R_S[f] + (R[f] - R_S[f])
Although mathematically trivial, the tautology reveals an important insight. If we somehow find a classifier f that achieves small empirical risk R_S[f], we’re left worrying about the term R[f]-R_S[f], which quantifies how much the empirical risk of f underestimates its risk. We call this difference *generalization gap* and it is of fundamental importance to machine learning. Intuitively speaking, it tells us how well the performance of our classifier transfers from seen examples (the training examples) to unseen examples (a fresh example from the population) drawn from the same distribution. This process is called *generalization*.

Generalization is not the only goal of supervised learning. A constant classifier that always outputs 0 generalizes perfectly well, but is almost always entirely useless. What we also need is that the classifier achieves small empirical risk R_S[f]. Making the empirical risk small is fundamentally about *optimization*. As a consequence, a large part of supervised learning deals with optimization. For us to be able to talk about optimization, we need to commit to a *representation* of the function class \mathcal{F} that appears in the empirical risk minimization problem. The representation of the function class, as well as the choice of a suitable loss function, determines whether or not we can efficiently find an empirical risk minimizer.

To summarize, introducing empirical risk minimization directly leads to three important questions that we will work through in turn.

**Representation:**What is the class of functions \mathcal{F} we should choose?**Optimization:**How can we efficiently solve the resulting optimization problem?**Generalization:**Will the performance of classifier transfer gracefully from seen training examples to unseen instances of our problem?

These three questions are intertwined. Machine learning is not so much about studying these questions in isolation as it is about the often delicate interplay between them. Our choice of representation influences both the difficulty of optimization and our generalization performance. Improvements in optimization may not help, or could even hurt, generalization. Moreover, there are aspects of the problem that don’t neatly fall into only one of these categories. The choice of the loss function, for example, affects all of the three questions above.

There are important differences between the three questions. Results in optimization, for example, tend to be independent of the statistical assumptions about the data generating process. We will see a number of different optimization methods that under certain circumstances find either a global or local minimum of the empirical risk objective. In contrast, to reason about generalization, we need some assumptions about the data generating process. The most common one is the i.i.d.-assumption we discussed earlier. We will also see several mathematical frameworks for reasoning about the gap between risk and empirical risk.

Let’s start with a foundational example that illustrates these core concepts and their interplay.

# A first learning algorithm: The perceptron

As we discussed in the introduction, in 1958 the New York Times reported the Office of Naval Research claiming the perceptron algorithmRosenblatt, “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain,” *Psychological Review*, 1958, 65–386. would “be able to walk, talk, see, write, reproduce itself and be conscious of its existence.” Let’s now dive into this algorithm that seemed to have such unbounded potential.

Toward introducing this algorithm, let’s assume we’re in a binary classification problem with labels in \{-1,1\} for notational convenience. The perceptron algorithm aims to find a *linear separator* of the data, that is, a hyperplane specified by coefficients w\in\mathbb{R}^d that so that all positive examples lie on one side of the hyperplane and all negative ones on the other.Illustration of a linear separator.

Formally, we can express this as y_i\langle w, x_i\rangle > 0. In other words, the linear function f(x)=\langle w, x\rangle agrees in sign with the labels on all training instances (x_i, y_i). In fact, the perceptron algorithm will give us a bit more. Specifically, we require that the sign agreement has some *margin* y_i\langle w, x_i\rangle \ge 1. That is, when y=1, the linear function must take on a value of at least 1 and when y=-1, the linear function must be at most -1. Once we find such a linear function, our decision rule for new data is \hat{Y}(x) = \mathbb{1}\{ \langle w, x \rangle \geq 0\}.

The algorithm goes about finding w in the following way:

**Perceptron**

- Start from the initial solution w_0=0
- At each step t=0,1,2,...:
- Select a random index i\in\{1,...,n\}
- Case 1: If y_i\langle w_t, x_i\rangle < 1, put w_{t+1} = w_t + y_i x_i
- Case 2: Otherwise put w_{t+1} = w_t.

Case 1 corresponds to what’s called a *margin mistake*. The sign of the linear function may not disagree with the label, but it doesn’t have the required margin that we asked for.

It’s common to introduce two *hyperparameters* to the algorithm by considering the alternative update rule:
w_{t+1} = \gamma w_t + \eta y_i x_i
Here \eta is a positive scalar called a *learning rate* and \gamma\in [0,1] is called the *forgetting rate*.

Can we represent the perceptron algorithm as an instance of empirical risk minimization? The answer is that we can and it is instructive to see how.

First, it’s clear from the description that we’re looking for a linear separator. Hence, our function class is the set of linear functions f(x)=\langle w, x\rangle, where w\in\mathbb{R}^d. We will sometimes call the vector w the *weight vector* or vector of *model parameters*.

An optimization method that picks a random example at each step and makes an improvement to the model parameters is the popular stochastic gradient method specified by the update rule: w_{t+1} = w_t - \eta\nabla_{w_t} \ell(f(x_i), y_i) Here, \nabla \ell(f(x_i), y_i) is the gradient of the loss function with respect to the model parameters w_t on a randomly chosen example (x_i, y_i). We will typically drop the vector w_t from the subscript of the gradient when it’s clear from the context. The scalar \eta>0 is a step size parameter that we will discuss more carefully later. For now, think of it as a small constant.

Consider the loss function

\ell(y, \langle w, x\rangle) = \max(1-y\langle w, x\rangle,\; 0)\,.

This loss function is called *hinge loss*.An illustration of the hinge loss on example (x, y) as a function of y\langle w, x\rangle. Note that its gradient is -yx when y\langle w, x\rangle < 1 and 0 when y\langle w, x\rangle >1.Note that the gradient is not defined when y\langle w, x\rangle=1. The loss function is not differentiable everywhere. This is why technically speaking the stochastic gradient method operates with what is called a *subgradient*. The mathematical theory of subgradient optimization rigorously justifies calling the gradient 0 when y\langle w, x\rangle=1.

We can see that the hinge loss gives us part of the update rule in the perceptron algorithm. The other part comes from adding a weight penalty \frac\lambda 2\Vert w\Vert^2 to the loss function that discourages the weights from growing out of bounds. This weight penalty is called *\ell_2-regularization*, *weight decay*, or *Tikhonov regularization* depending on which field you work in. The purpose of regularization is to promote generalization. We will therefore return to regularization in detail when we discuss generalization in more depth. For now, note that the margin constraint we introduced is inconsequential unless we penalize large vectors. Without the weight penalty we could simply scale up any linear separator until it separates the points with the desired margin.

Putting the two loss functions together, we get the \ell_2-regularized empirical risk minimization problem for the hinge loss:

\frac{1}{n}\sum_{i=1}^n \max(1-y_i\langle w, x_i\rangle,\; 0) + \frac\lambda 2 {\Vert w \Vert_2}^2

The perceptron algorithm corresponds to solving this empirical risk objective with the stochastic gradient method. The constant \eta, which we dubbed the learning rate, is the step size of the stochastic gradient methods. The forgetting rate constant \gamma is equal to (1-\eta \lambda). The optimization problem is also known as *support vector machine* and we will return to it later on.

## A word about surrogate losses

In decision theory, when the goal was to maximize the accuracy of a predictor, we directly solved the risk minimization with respect to the *zero-one loss*
\ell(y, z)=\mathbb{1}\{y\ne z\}
that gives us penalty 1 if our label is incorrect, and penalty 0 if our predicted label z matches the true label y. We saw that the optimal classifier in this case was a *maximum a posteriori* rule, where we selected the label with higher posterior probability

Why don’t we directly solve empirical risk minimization with respect to the zero-one loss? The reason is that the *empirical* risk with the zero-one loss is generally difficult to optimize directly. In fact, this optimization problem is NP-hard even for linear classification rules.It’s worth referring back to the derivation of the likelihood ratio test to see why similar rules are impossible to compute from the empirical risk. In particular, that construction relies heavily on iterating conditional expectations, which cannot be done from samples. Take a moment to convince yourself that the stochastic gradient method, for example, fails entirely on the zero-one loss objective.

The hinge loss therefore serves a *surrogate loss* for the zero-one loss. We hope that by optimizing the hinge loss, we end up optimizing the zero-one loss as well.A comparison of hinge loss versus zero-one loss. The hinge loss is not the only reasonable choice. There are numerous loss functions that approximate the zero-one loss in different ways.

- The
*hinge loss*is \max\{1-yz, 0\} and*support vector machine*refers to empirical risk minimization with the hinge loss and \ell_2-regularization. This is what the perceptron is optimizing. - The
*squared loss*is given by \frac12(y-z)^2. Linear least squares regression corresponds to empirical risk minimization with the squared loss. - The
*logistic loss*is -\log(\sigma(z)) when y=1 and -\log(1-\sigma(z)) when y=-1, where \sigma(z) = 1/(1+\exp(-z)) is the logistic function.*Logistic regression*corresponds to empirical risk minimization with the logistic loss.

Sometimes we can theoretically relate empirical risk minimization under a surrogate loss to the zero-one loss. In general, however, these loss functions are used heuristically and performance is evaluated by trial-and-error.

## Formal guarantees for the perceptron

We saw that the perceptron corresponds to finding a linear classifier using the stochastic gradient method. What we haven’t seen yet is a proof that the perceptron method works and under what conditions. Recall that there are two questions we need to address. The first is why the perceptron method successfully fits the training data, a question about *optimization*. The second is why the solution should also correctly classify unseen examples drawn from the same distribution, a question about *generalization*. We will address each in turn with results from the 1960s. Even though the analysis here is over 50 years old, it has all of the essential parts of more recent theoretical arguments in machine learning.

## Mistake bound

To see why we perform well on the training data, we use a *mistake bound* due to Novikoff.Novikoff, “On Convergence Proofs on Perceptrons,” in *Proceedings of the Symposium on the Mathematical Theory of Automata*, 1962, 615–22. The bound shows that if there exists a linear separator of the training data, then the perceptron will find it quickly provided that the *margin* of the separating hyperplane isn’t too small.

Margin is a simple way to evaluate how well a classifier separates data. For a fixed vector w\in \mathbb{R}^d, w defines a hyperplane \mathcal{H}_w = \{x~:~w^Tx=0\}. Suppose that the hyperplane \mathcal{H}_w corresponding to the vector w perfectly separates the data in S. Then we define the margin of such a vector w as the smallest distance of our data points to this hyperplane: \gamma(S,w) = \min_{1\le i\le n} \operatorname{dist}(x_i,\mathcal{H}_w )\,. Here, \operatorname{dist}(x, \mathcal{H}_w) = \min\{ \|x - x'\| \colon x'\in \mathcal{H}_w\} = \frac{|\langle x, w\rangle|}{\|w\|}.

Overloading terminology, we define the margin of a *dataset* to be the maximum margin achievable by any classifier w:
\gamma(S) = \max_{\Vert w \Vert =1} \gamma(S,w)\,.
We will now show that when a dataset has a large margin, the perceptron algorithm will find a separating hyperplane quickly.

Let’s consider the simplest form of the perceptron algorithm with \gamma and \eta both equal to 1. We initialize the algorithm with w_0=0. The algorithm proceeds by selecting a random index i_t at step t checking whether y_{i_t} w_t^Tx_{i_t} < 1. If this condition is true, we say that the algorithm made a *margin mistake*, i.e., the prediction w_t^T x_{i_t} is either wrong or too close to the hyperplane. If a margin mistake occurs, the perceptron performs the update
w_{t+1} = w_t + y_{i_t} x_{i_t}\,.
That is, we rejigger the hyperplane to be more aligned with the signed direction of the mistake. If no margin mistake mistakes, then w_{t+1}=w_t.

We can summarize a worst case analysis of the perceptron algorithm with the following theorem.

Define the diameter of the data S to be D(S) = \max_{(x,y)\in S}\|x\|\,. The perceptron algorithm makes at most (2+D(S)^2)\gamma(S)^{-2} margin mistakes on any sequence of examples S that can be perfectly classified by a linear separator.

The main idea behind the proof of this theorem is that since w only changes when you make a mistake, we can upper bound and lower bound w at each time a mistake is made, and then, by comparing these two bounds, compute an inequality on the total number of mistakes.

To find an upper bound, suppose that at step t the algorithm makes a margin mistake. We then have the inequality: \begin{aligned} \Vert w_{t+1} \Vert^2 &= \Vert w_t +y_{i_t} x_{i_t}\Vert^2 \\ & \leq \Vert w_t \Vert^2 + 2 y_{i_t} \langle w_t, x_{i_t}\rangle + \Vert x_{i_t} \Vert^2 \\ &\leq \Vert w_t\Vert^2 + 2 + D(S)^2 \,. \end{aligned} The final inequality uses the fact that y_{i_t} \langle w_t, x_{i_t}\rangle<1. Now, let m_t denote the total number of mistakes made by the perceptron in the first t iterations. Summing up the above inequality over all the mistakes we make and using the fact that \|w_0\|=0, we get our upper bound on the norm of w_t: \Vert w_{t} \Vert \leq \sqrt{m_t(2+ D(S)^2)}\,.

Working toward a lower bound on the norm of w_t, we will use the following argument. Let w be any unit vector that correctly classifies all points in S. If we make a mistake at iteration t, we have \langle w, w_{t+1}-w_t\rangle = \langle w, y_{i_t}x_{i_t}\rangle = \frac{|\langle w, x_{i_t}\rangle|}{\|w\|} \geq \gamma(S, w)\,. Note that the second equality here holds because w correctly classifies the point (x_{i_t}, y_{i_t}).Convince yourself that this step fails if w does not perfectly separate the data. This is where we use that the data are linearly separable. The inequality follows from the definition of margin.

Now, let w_\star denote the hyperplane that achieves the maximum margin \gamma(S). Instantiating the previous argument with w_\star, we find that \Vert w_t \Vert \geq \langle w_\star, w_t\rangle = \sum_{k=1}^t w_\star^T (w_{k}-w_{k-1}) \geq m_t\gamma(S)\,, where the equality follows from a telescoping sum argument.Telescoping sums are a powerful trick used throughout the analysis of optimization algorithms. The telescoping sum lets us understand the behavior of the final iterate by decomposing it into the incremental updates of the individual iterations.

This yields the desired lower bound on the norm of w_t. Combined with the upper bound we already derived, it follows that the total number of mistakes has the bound m_t \leq \frac{2+D(S)^2}{\gamma(S)^2}\,, as we had hoped to show.

The mistake bound does not depend on the dimension of the data. This is appealing since the requirement of linear separability and high margin, intuitively speaking, become less taxing the larger the dimension is.

An interesting consequence of this theorem is that if we run the perceptron repeatedly over the same dataset, we will eventually end up with a separating hyperplane. To see this, imagine repeatedly running over the dataset until no mistake occurred on a full pass over the data. The mistake bound gives a bound on the number of passes required before the algorithm terminates.

## From mistake bounds to generalization

The previous analysis shows that the perceptron finds a good classifier on the training data. What can we say about new data that we have not yet seen?

To talk about generalization, we need to make a statistical assumption about the data generating process. Specifically we assume that the data points in the training set S=\{(x_1,y_1)\ldots, (x_n,y_n) \} where each drawn i.i.d. from a fixed underlying distribution \mathcal{D} with the labels taking values \{ -1,1 \} and each x_i \in \mathbb{R}^d.

We know that the perceptron finds a good linear classifier for the training data (if it exists). What we now show is that this classifier also works on new data drawn from the same distribution.

To analyze what happens on new data, we will employ a powerful *stability* argument. Put simply, an algorithm is stable if the effect of removing or replacing a single data point is small. We will do a deep dive on stability in our chapter on generalization, but we will have a first encounter with the idea here.

The perceptron is stable because it makes a bounded number of mistakes. If we remove a data point where no mistake is made, the model doesn’t change at all. In fact, it’s as if we had *never seen* the data point. This lets us relate the performance on seen examples to the performance on examples in the training data on which the algorithm never updated.

Vapnik and Chervonenkis presented the following stability argument in their classic text from 1974, though the original argument is likely a decade older.Vapnik and Chervonenkis, *Theory of Pattern Recognition: Statistical Learning Problems* (Moscow: Nauka, 1974). Their main idea was to leverage our assumption that the data are i.i.d., so we can swap the roles of training and test examples in the analysis.

Let S_n denote a training set of n i.i.d. samples from a distribution \mathcal{D} that has a perfect linear separator. Let w(S) be the output of the perceptron on a dataset S after running until the hyperplane makes no more margin mistakes on S. Let Z=(X,Y) be an additional independent sample from \mathcal{D}. Then, the probability of making a margin mistake on (X,Y) satisfies the upper bound \mathop\mathbb{P}[Y w(S_n)^T X < 1] \leq \frac{1}{n+1} {\mathop\mathbb{E}}_S\left[ \frac{2+D(S_{n+1})^2}{\gamma(S_{n+1})^2} \right]\,.

First note that \mathop\mathbb{P}[Y w^T X < 1] = \mathop\mathbb{E}[\mathbb{1}\{Yw^T X < 1\}]\,.

Let S_n=(Z_1, ... , Z_n) and with Z_k=(X_k, Y_k) and put Z_{n+1}=Z=(X,Y). Note that these n+1 random variables are i.i.d. draws from \mathcal{D}. As a purely analytical device, consider the “leave-one-out set” S^{-k}=\{Z_1,\dots,Z_{k-1},Z_{k+1},...,Z_{n+1}\}\,. Since the data are drawn i.i.d., running the algorithm on S^{-k} and evaluating it on Z_k=(X_k,Y_k) is equivalent to running the algorithm on S_n and evaluating it on Z_{n+1}. These all correspond to the same random experiment and differ only in naming.

In particular, we have \mathop\mathbb{P}[Y w(S_n)^T X < 1] = \frac1{n+1}\sum_{k=1}^{n+1} \mathop\mathbb{E}[\mathbb{1}\{Y_k w(S^{-k})^T X_k < 1\}]\,. Indeed, we’re averaging quantities that are each identical to the left hand side.

But recall from our previous result that the perceptron makes at most m=\frac{2+D((Z_1,\dots,Z_{n+1}))^2}{\gamma((Z_1,\dots,Z_{n+1}))^2} margin mistakes when run on the entire sequence (Z_1,\dots,Z_{n+1}). Let i_1,\dots,i_m denote the indices on which the algorithm makes a mistake in any of its cycles over the data. If k\not\in\{i_1,\dots,i_m\}, the output of the algorithm remains the same after we remove the k-th sample from the sequence. It follows that such k satisfy Y_k w(S^{-k})X_k \geq 1 and therefore k does not contribute to the summation above. The other terms can at most contribute 1 to the summation. Hence, \sum_{k=1}^{n+1}\mathbb{1}\{Y_k w(S^{-k})^T X_k < 1\} \le m\,, and by linearity of expectation, \mathop\mathbb{P}[Y w(S_n)^T X < 1] \le \frac{\mathop\mathbb{E}[m]}{n+1}\,, which is what we wanted to prove.

We can turn our mistake bounds into bounds on the empirical risk and risk achieved by the perceptron algorithm by choosing the loss function \ell(\langle w, x\rangle, y)= \mathbb{1}\{ \langle w, x\rangle y < 1\}. These bounds also imply bounds on the (empirical) risk with respect to the zero-one loss, since the classification error is bounded by the number of margin mistakes.

# Chapter notes

Rosenblatt developed the perceptron in 1957 and continued to publish on the topic in the years that followed.Rosenblatt, *Two Theorems of Statistical Separability in the Perceptron* (United States Department of Commerce, 1958); Rosenblatt, “Principles of Neurodynamics: Perceptions and the Theory of Brain Mechanisms,” 1962. The perceptron project was funded by the US Office of Naval Research (ONR), who jointly announced the project with Rosenblatt in a press conference in 1958, that lead to the New York Times article we quoted earlier. This development sparked significant interest in perceptrons research throughout the 1960s.

The simple proof the mistake bound we saw is due to Novikoff.Novikoff, “On Convergence Proofs on Perceptrons.” Block is credited with a more complicated contemporaneous proof.Block, “The Perceptron: A Model for Brain Functioning,” *Reviews of Modern Physics* 34, no. 1 (1962): 123. Minsky and Papert attribute a simple analysis of the convergence guarantees for the perceptron to a 1961 paper by Papert.Papert, “Some Mathematical Models of Learning,” in *London Symposium on Information Theory* (Academic Press, New York, 1961).

Following these developments Vapnik and Chervonenkis proved the generalization bound for the perceptron method that we saw earlier, relying on the kind of stability argument that we will return to in our chapter on generalization. The proof of is available in their 1974 book.Vapnik and Chervonenkis, *Theory of Pattern Recognition*. Interestingly, by the 1970s, Vapnik and Chervonenkis must have abandoned the stability argument in favor of the VC-dimension.

In 1969, Minksy and Papert published their influential book “perceptrons: an introduction to computational geometry.”Minsky and Papert, *Perceptrons: An Introduction to Computational Geometry* (MIT press, 2017). Among other results, it showed that perceptrons fundamentally could not learn certain concepts, like, an XOR of its input bits. In modern language, linear classifiers cannot learn parity functions. The results remain relevant in the statistical learning community and have been extended in numerous ways. On the other hand, pragmatic researchers realized one could just add the XOR to the feature vector and continue to use linear methods. We will discuss such feature engineering in the next chapter.

The dominant narrative in the field has it that Minsky and Papert’s book curbed enthusiasm for perceptron research and their multilayer extensions, now better known as deep neural networks. In an updated edition of their book from 1988, Minsky and Papert argue that work on perceptrons had already slowed significantly by the time their book was published for a lack of new results:

One popular version is that the publication of our book so discouraged research on learning in network machines that a promising line of research was interrupted. Our version is that progress had already come to a virtual halt because of the lack of adequate basic theories, […].

On the other hand, the pattern recognition community had realized that perceptrons were just one way to implement linear decision rules. Highleyman was arguably the first to propose empirical risk minimization and applied this technique to optical character recognition.Highleyman, “Linear Decision Functions, with Application to Pattern Recognition,” *Proceedings of the IRE* 50, no. 6 (1962): 1501–14. Active research in the 1960s showed how to find linear rules using linear programming techniques,Mangasarian, “Linear and Nonlinear Separation of Patterns by Linear Programming,” *Operations Research* 13, no. 3 (1965): 444–52. and work by Aizerman, Braverman and Rozonoer developed iterative methods to fit nonlinear rules to data.Aizerman, Braverman, and Rozonoer, “The Robbins-Monro Process and the Method of Potential Functions,” *Automation and Remote Control* 26 (1965): 1882–85. All of this work was covered in depth in the first edition of Duda and Hart, which appeared five years after *perceptrons*.

It was at this point that the artificial intelligence community first split from the pattern recognition community. While the artificial intelligence community turned towards more *symbolic* techniques in 1970s, work on statistical learning continued in Soviet and IEEE journals. The modern view of empirical risk minimization, of which we began this chapter, came out of this work and was codified by Vapnik and Chervonenkis in the 1970s.

It wasn’t until the 1980s that work on pattern recognition, and with it the tools of the 1960s and earlier, took a stronger foothold in the machine learning community again.Langley, “The Changing Science of Machine Learning” (Springer, 2011). We will continue this discussion in our chapter on datasets and machine learning benchmarks, which were pivotal in the return of pattern recognition to the forefront of machine learning research.

# References

*Automation and Remote Control*26 (1965): 1882–85.

*Reviews of Modern Physics*34, no. 1 (1962): 123.

*Proceedings of the IRE*50, no. 6 (1962): 1501–14.

*Operations Research*13, no. 3 (1965): 444–52.

*Perceptrons: An Introduction to Computational Geometry*. MIT press, 2017.

*Proceedings of the Symposium on the Mathematical Theory of Automata*, 615–22, 1962.

*London Symposium on Information Theory*. Academic Press, New York, 1961.

*Psychological Review*, 1958, 65–386.

*Two Theorems of Statistical Separability in the Perceptron*. United States Department of Commerce, 1958.

*Theory of Pattern Recognition: Statistical Learning Problems*. Moscow: Nauka, 1974.