# Fundamentals of prediction

*Prediction* is the art and science of leveraging patterns
found in natural and social processes to conjecture about uncertain
events. We use the word *prediction* broadly to refer to
statements about things we don’t know for sure *yet*, including
but not limited to the outcome of future events.

Machine learning is to a large extent the study of algorithmic prediction. Before we can dive into machine learning, we should familiarize ourselves with prediction. Starting from first principles, we will motivate the goals of prediction before building up to a statistical theory of prediction.

We can formalize the goal of prediction problems by assuming a population of N instances with a variety of attributes. We associate with each instance two variables, denoted X and Y. The goal of prediction is to conjecture a plausible value for Y after observing X alone. But when is a prediction good? For that, we must quantify some notion of the quality of prediction and aim to optimize that quantity.

To start, suppose that for each variable X we make a deterministic prediction f(X) by means of some prediction function f. A natural goal is to find a function f that makes the fewest number of incorrect predictions, where f(X)\ne Y, across the population. We can think of this function as a computer program that reads X as input and outputs a prediction f(X) that we hope matches the value Y. For a fixed prediction function, f, we can sum up all of the errors made on the population. Dividing by the size of the population, we observe the average (or mean) error rate of the function.

## Minimizing errors

Let’s understand how we can find a prediction function that makes as few errors as possible on a given population in the case of binary prediction, where the variable Y has only two values.

Consider a population of Abalone, a type of marine snail with colorful shells featuring a varying number of rings. Our goal is to predict the sex, male or female, of the Abalone from the number of rings on the shell. We can tabulate the population of Abalone by counting for each possible number of rings, the number of male and female instances in the population.

From this way of presenting the population, it is not hard to compute
the predictor that makes the fewest mistakes. For each value on the
X-axis, we predict “female” if the number of female instances with this
X-value is larger than the number of male instances. Otherwise, we
predict “male” for the given X-value. For example, there’s a majority of
male Abalone with seven rings on the shell. Hence, it makes sense to
predict “male” when we see seven rings on a shell. Scrutinizing the
figure a bit further, we can see that the best possible predictor is a
*threshold function* that returns “male” whenever the number of
rings is at most 8, and “female” whenever the number of rings is greater
or equal to 9.

The number of mistakes our predictor makes is still significant. After all, most counts are pretty close to each other. But it’s better than random guessing. It uses whatever there is that we can say from the number of rings about the sex of an Abalone snail, which is just not much.

What we constructed here is called the *minimum error rule*.
It generalizes to multiple attributes. If we had measured not only the
number of rings, but also the length of the shell, we would repeat the
analogous counting exercise over the two-dimensional space of all
possible values of the two attributes.

The minimum error rule is intuitive and simple, but computing the rule exactly requires examining the entire population. Tracking down every instance of a population is not only intractable. It also defeats the purpose of prediction in almost any practical scenario. If we had a way of enumerating the X and Y value of all instances in a population, the prediction problem would be solved. Given an instance X we could simply look up the corresponding value of Y from our records.

What’s missing so far is a way of doing prediction that does not require us to enumerate the entire population of interest.

# Modeling knowledge

Fundamentally, what makes prediction without enumeration possible is
*knowledge* about the population. Human beings organize and
represent knowledge in different ways. In this chapter, we will explore
in depth the consequences of one particular way to represent
populations, specifically, as *probability distributions*.

The assumption we make is that we have knowledge of a probability
distribution p(x,y) over pairs
of X and Y values. We assume that this distribution
conceptualizes the “typical instance” in a population. If we were to
select an instance uniformly at random from the population, what
relations between its attributes might we expect? We expect that a
uniform sample from our population would be the same as a sample
from p(x,y). We call such a
distribution a *statistical model* or simply *model* of a
population. The word *model* emphasizes that the distribution
isn’t the population itself. It is, in a sense, a sketch of a population
that we use to make predictions.

Let’s revisit our Abalone example in probabilistic form. Assume we know the distribution of the number of rings of male and female Abalone, as illustrated in the figure.

Both follow a skewed normal distribution described by three parameters each, a location, a scale, and a skew parameter. Knowing the distribution is to assume that we know these parameters. Although the specific numbers won’t matter for our example, let’s spell them out for concreteness. The distribution for male Abalone has location 7.4, scale 4.48, and skew 3.12, whereas the distribution for female Abalone has location 7.63, scale 4.67, and skew 4.34. To complete the specification of the joint distribution over X and Y, we need to determine the relative proportion of males and females. Assume for this example that male and female Abalone are equally likely.

Representing the population this way, it makes sense to predict “male” whenever the probability density for male Abalone is larger than that for female Abalone. By inspecting the plot we can see that the density is higher for male snails up until 8 rings at which point it is larger for female instances. We can see that the predictor we derive from this representation is the same threshold rule that we had before.

We arrived at the same result without the need to enumerate and count all possible instances in the population. Instead, we recovered the minimum error rule from knowing only 7 parameters, three for each conditional distribution, and one for the balance of the two classes.

Modeling populations as probability distributions is an important step in making prediction algorithmic. It allows us to represent populations succinctly, and gives us the means to make predictions about instances we haven’t encountered.

Subsequent chapters extend these fundamentals of prediction to the
case where we don’t know the exact probability distribution, but only
have a random sample drawn from the distribution. It is tempting to
think about machine learning as being all about *that*, namely
what we do with a sample of data drawn from a distribution. However, as
we learn in this chapter, many fundamentally important questions arise
even if we have full knowledge of the population.

## Prediction from statistical models

Let’s proceed to formalize prediction assuming we have full knowledge of a statistical model of the population. Our first goal is to formally develop the minimum error rule in greater generality.

We begin with binary prediction where we suppose Y has two alternative values, 0 and 1. Given some measured information X, our goal is to conjecture whether Y equals zero or one.

Throughout we assume that X
and Y are random variables drawn
from a joint probability distribution. It is convenient both
mathematically and conceptually to specify the joint distribution as
follows. We assume that Y has
*a priori* (or *prior*) probabilities:
p_0 = \mathop\mathbb{P}[Y=0]\,,
\qquad p_1 = \mathop\mathbb{P}[Y=1]
That is, we assume we know the proportion of instances
with Y=1 and Y=0 in the population. We’ll always model
available information as being a random vector X with support in \mathbb{R}^d. Its distribution depends on
whether Y is equal to zero or one.
In other words, there are two different statistical models for the data,
one for each value of Y. These
models are the conditional probability densities of X given a value y for Y, denoted p(x
\mid Y=y). This density function is often called a *generative
model* or *likelihood function* for each scenario.

## Example: signal versus noise

For a simple example with more mathematical formalism, suppose that when Y=0 we observe a scalar X=\omega where \omega is unit-variance, zero mean Gaussian noise \omega\sim \mathcal{N}(0,1). Recall that the Gaussian distribution of mean \mu and variance \sigma^2 is given by the density \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac12\left(\frac{x-\mu}{\sigma}\right)^2}.

Suppose when Y=1, we would
observe X=s+\omega for some
scalar s. That is, the conditional
densities are
\begin{aligned}
p(x\mid Y=0) &= \mathcal{N}(0,1) \,, \\
p(x\mid Y=1) &= \mathcal{N}(s,1)\,.
\end{aligned}
The larger the shift s is,
the easier it is to predict whether Y=0 or Y=1. For example, suppose s=10 and we observed X=11. If we had Y=0, the probability that the observation is
greater than 10 is on the order
of 10^{-23}, and hence we’d likely
think we’re in the alternative scenario where Y=1. However, if s were very close to zero, distinguishing
between the two alternatives is rather challenging. We can think of a
small difference s that we’re
trying to detect as a *needle in a haystack*.

# Prediction via optimization

Our core approach to all statistical decision making will be to
formulate an appropriate optimization problem for which the decision
rule is the optimal solution. That is, we will optimize over
*algorithms*, searching for functions that map data to decisions
and predictions. We will define an appropriate notion of the cost
associated to each decision, and attempt to construct decision rules
that minimize the expected value of this cost. As we will see, choosing
this optimization framework has many immediate consequences.

## Predictors and labels

A *predictor* is a function \hat Y(x) that maps an input x to a prediction \hat y=\hat Y(x). The prediction \hat y is also called a *label* for
the point x. The target
variable Y can be both real valued
or discrete. When Y is a discrete
random variable, each different value it can take on is called a
*class* of the prediction problem.

To ease notation, we take the liberty to write \hat Y as a shorthand for the random variable \hat Y(X) that we get by applying the prediction function \hat Y to the random variable X.

The most common case we consider through the book is binary prediction, where we have two classes, 0 and 1. Sometimes it’s mathematically convenient to instead work with the numbers -1 and 1 for the two classes.

In most cases we consider, labels are scalars that are either discrete or real-valued. Sometimes it also makes sense to consider vector-valued predictions and target variables.

The creation and encoding of suitable labels for a prediction problem is an important step in applying machine learning to real world problems. We will return to it multiple times.

## Loss functions and risk

The final ingredient in our formal setup is a *loss function*
which generalizes the notion of an error that we defined as a mismatch
between prediction and target value.

A *loss function* takes two inputs, \hat y and y, and returns a real number \mathit{loss}(\hat y, y) that we interpret as
a quantified loss for predicting \hat
y when the target is y. A
loss could be negative in which case we think of it as a reward.

A prediction error corresponds to the loss function \mathit{loss}(\hat y, y)=\mathbb{1}\{\hat y \ne y\} that indicates disagreement between its two inputs. Loss functions give us modeling flexibility that will become crucial as we apply this formal setup throughout this book.

An important notion is the expected loss of a predictor taken over a
population. This construct is called *risk*.

We define the *risk* associated with \hat{Y} to be
R[\hat{Y}] := \mathop\mathbb{E}[\mathit{loss}(\hat{Y}(X),Y)]\,.
Here, the expectation is taken jointly over X and Y.

Now that we defined risk, our goal is to determine which decision rule minimizes risk. Let’s get a sense for how we might go about this.

In order to minimize risk, theoretically speaking, we need to solve
an *infinite dimensional* optimization problem over binary-valued
functions. That is, *for every* x, we need to find a binary assignment.
Fortunately, the infinite dimension here turns out to not be a problem
analytically once we make use of the law of iterated expectation.

We claim that the optimal predictor is given by \hat {Y}(x) = \mathbb{1}\left\{\mathop\mathbb{P}[Y=1\mid X=x] \geq \frac{\mathit{loss}(1,0)-\mathit{loss}(0,0)}{\mathit{loss}(0,1)-\mathit{loss}(1,1)} \,\, \mathop\mathbb{P}[Y=0\mid X=x]\right\}\,.

This rule corresponds to the intuitive rule we derived when thinking about how to make predictions over the population. For a fixed value of the data X=x, we compare the frequency of which Y=1 occurs to which Y=0 occurs. If this frequency exceeds some threshold that is defined by our loss function, then we set \hat{Y}(x)=1. Otherwise, we set \hat{Y}(x)=0.

To see why this rule is optimal, we make use of the law of iterated expectation: \begin{aligned} \mathop\mathbb{E}[\mathit{loss}(\hat{Y}(X),Y)] &= \mathop\mathbb{E}\left[ \mathop\mathbb{E}\left[\mathit{loss}(\hat{Y}(X),Y)\mid X\right] \right]\,. \end{aligned} Here, the outer expectation is over a random draw of X and the inner expectation samples Y conditional on X. Since there are no constraints on the predictor \hat Y, we can minimize the expression by minimizing the inner expectation independently for each possible setting that X can assume.

Indeed, for a fixed value x, we can expand the expected loss for each of the two possible predictions: \begin{aligned} \mathop\mathbb{E}[\mathit{loss}(0,Y)\mid X=x] &= \mathit{loss}(0,0) \mathop\mathbb{P}[Y=0\mid X=x] + \mathit{loss}(0,1) \mathop\mathbb{P}[Y=1\mid X=x]\\ \mathop\mathbb{E}[\mathit{loss}(1,Y)\mid X=x] &= \mathit{loss}(1,0) \mathop\mathbb{P}[Y=0\mid X=x] + \mathit{loss}(1,1) \mathop\mathbb{P}[Y=1\mid X=x]\,. \end{aligned} The optimal assignment for this x is to set \hat Y(x)=1 whenever the second expression is smaller than the first. Writing out this inequality and rearranging gives us the rule specified in the lemma.

Probabilities of the form \mathop\mathbb{P}[Y=y\mid X=x], as they
appeared in the lemma, are called *posterior* probability.

We can relate them to the likelihood function via Bayes rule: \mathop\mathbb{P}[Y=y\mid X=x] = \frac{p(x\mid Y=y) p_y }{p(x)}\,, where p(x) is a density function for the marginal distribution of X.

When we use posterior probabilities, we can rewrite the optimal predictor as \hat {Y}(x) = \mathbb{1}\left\{ \frac{p(x\mid Y=1)}{p(x\mid Y=0)} \geq \frac{p_0(\mathit{loss}(1,0)-\mathit{loss}(0,0))}{p_1(\mathit{loss}(0,1)-\mathit{loss}(1,1))}\right\}\,. This rule is an example of a likelihood ratio test.

The *likelihood ratio* is the ratio of the likelihood
functions:
\mathcal{L}(x) := \frac{p(x\mid Y=1)}{p(x\mid Y=0)}
A *likelihood ratio test* (LRT) is a predictor of the
form
\hat{Y}(x) = \mathbb{1}\{ \mathcal{L}(x) \geq \eta\}
for some scalar threshold \eta>0.

If we denote the optimal threshold value \eta = \frac{p_0(\mathit{loss}(1,0)-\mathit{loss}(0,0))}{p_1(\mathit{loss}(0,1)-\mathit{loss}(1,1))}\,,\qquad(1) then the predictor that minimizes the risk is the likelihood ratio test \hat{Y}(x) = \mathbb{1}\{ \mathcal{L}(x)\geq \eta \}\,.

A LRT naturally partitions the sample space in two regions: \begin{aligned} \mathcal{X}_0 & = \left\{ x \in \mathcal{X} \colon \mathcal{L}(x) \leq \eta \right\}\\ \mathcal{X}_1 &= \left\{ x \in \mathcal{X} \colon \mathcal{L}(x) > \eta \right\}\,. \end{aligned} The sample space \mathcal{X} then becomes the disjoint union of \mathcal{X}_0 and \mathcal{X}_1. Since we only need to identify which set x belongs to, we can use any function h:\mathcal{X} \rightarrow \mathbb{R} which gives rise to the same threshold rule. As long as h(x) \leq t whenever \mathcal{L}(x) \leq \eta and vice versa, these functions give rise to the same partition into \mathcal{X}_0 and \mathcal{X}_1. So, for example, if g is any monotonically increasing function, then the predictor \hat{Y}_g(x) = \mathbb{1} \{ g(\mathcal{L}(x))\geq g(\eta) \} is equivalent to using \hat{Y}(x). In particular, it’s popular to use the logarithmic predictor \hat{Y}_{\mathrm{log}}(x) = \mathbb{1} \{\log p(x\mid Y=1)- \log p(x\mid Y=0) \geq \log(\eta)\}\,, as it is often more convenient or numerically stable to work with logarithms of likelihoods.

This discussion shows that there are an *infinite number of
functions* which give rise to the same binary predictor. Hence, we
don’t need to know the conditional densities exactly and can still
compute the optimal predictor. For example, suppose the true
partitioning of the real line under an LRT is
\mathcal{X}_0 = \{x\colon x\geq 0\}\quad\text{and}\quad\mathcal{X}_1 =
\{x\colon x< 0\}\,.
Setting the threshold to t=0, the functions h(x) = x or h(x)=x^3 give the same predictor, as does any
odd function which is positive on the right half line.

## Example: needle in a haystack revisited

Let’s return to our needle in a haystack example with \begin{align*} p(X\mid Y=0) &= \mathcal{N}(0,1)\,,\\ p(X\mid Y=1) &= \mathcal{N}(s,1)\,, \end{align*} and assume that the prior probability of Y=1 is very small, say, p_1 = 10^{-6}. Suppose that if we declare \hat{Y}=0, we do not pay a cost. If we declare \hat{Y}=1 but are wrong, we incur a cost of 100. But if we guess \hat{Y}=1 and it is actually true that Y=1, we actually gain a reward of 1,000,000. That is \mathit{loss}(0,0) = 0, \mathit{loss}(0,1)=0, \mathit{loss}(1,0) = 100, and \mathit{loss}(1,1)=-1,000,000\,.

What is the LRT for this problem? Here, it’s considerably easier to
work with logarithms:
\log(\eta) = \log\left( \frac{(1-10^{-6}) \cdot 100}{10^{-6} \cdot
10^{6}}\right) \approx 4.61
Now,
\log p(x\mid Y=1)- \log p(x\mid Y=0) = -\frac{1}{2} (x-s)^2 +
\frac{1}{2} x^2 = sx-\frac{1}{2}s^2
Hence, the optimal predictor is to declare
\hat{Y}= \mathbb{1}\left\{ sx > \tfrac{1}{2}s^2+\log(\eta)
\right\}\,.
The optimal rule here is *linear*. Moreover, the rule
divides the space into two open intervals. While the entire real line
lies in the union of these two intervals, it is exceptionally unlikely
to ever see an x larger
than |s|+5. Hence, even if our
predictor were incorrect in these regions, the risk would still be
nearly optimal as these terms have almost no bearing on our expected
risk!

## Maximum a posteriori and maximum likelihood

A folk theorem of statistical decision theory states that essentially
all optimal rules are equivalent to likelihood ratio tests. While this
isn’t *always* true, many important prediction rules end up being
equivalent to LRTs. Shortly, we’ll see an optimization problem that
speaks to the power of LRTs. But before that, we can already show that
the well known *maximum likelihood* and *maximum a
posteriori* predictors are both LRTs.

The expected error of a predictor is the expected number of times we
declare \hat{Y}=0
(resp. \hat{Y}=1) when \hat{Y}=1 (resp. \hat{Y}=0) is true. Minimizing the error is
equivalent to minimizing the risk with cost \mathit{loss}(0,0)=\mathit{loss}(1,1)=0, \mathit{loss}(1,0)=\mathit{loss}(0,1)=1. The
optimum predictor is hence a likelihood ratio test. In particular,
\hat{Y}(x) =\mathbb{1}\left\{ \mathcal{L}(x) \geq
\tfrac{p_0}{p_1}\right\}\,.
Using Bayes rule, one can see that this rule is equivalent to
\hat{Y}(x) = \arg \max_{y\in\{0,1\}}\mathop\mathbb{P}[Y=y\mid
X=x]\,.
Recall that the expression \mathop\mathbb{P}[Y=y\mid X=x] is called the
posterior probability of Y=y
given X=x. And this rule is hence
referred to as the *maximum a posteriori* (MAP) rule.

As we discussed above, the expression p(x\mid Y=y) is called the
*likelihood* of the point x
given the class Y=y. A maximum
likelihood rule would set
\hat{Y}(x) = \arg \max_y p(x\mid Y=y)\,.
This is completely equivalent to the LRT when p_0=p_1 and the costs are \mathit{loss}(0,0)=\mathit{loss}(1,1)=0, \mathit{loss}(1,0)=\mathit{loss}(0,1)=1.
Hence, the maximum likelihood rule is equivalent to the MAP rule with a
uniform prior on the labels.

That both of these popular rules ended up reducing to LRTs is no accident. In what follows, we will show that LRTs are almost always the optimal solution of optimization-driven decision theory.

# Types of errors and successes

Let \hat{Y}(x) denote any
predictor mapping into \{0,1\}.
Binary predictions can be right or wrong in four different ways
summarized by the *confusion table*.

Y=0 | Y=1 | |
---|---|---|

\hat Y = 0 | true negative | false negative |

\hat Y = 1 | false positive | true positive |

Taking expected values over the populations give us four
corresponding *rates* that are characteristics of a
predictor.

**True Positive Rate:**\mathrm{TPR} = \mathop\mathbb{P}[\hat{Y}(X)=1\mid Y=1]. Also known as*power*,*sensitivity*,*probability of detection*, or*recall*.**False Negative Rate:**\mathrm{FNR} = 1-\mathrm{TPR}. Also known as*type II error*or*probability of missed detection*.**False Positive Rate:**\mathrm{FPR} = \mathop\mathbb{P}[\hat{Y}(X)=1\mid Y=0]. Also known as*size*or*type I error*or*probability of false alarm*.**True Negative Rate**\mathrm{TNR}=1-\mathrm{FPR}, the probability of declaring \hat{Y}=0 given Y=0. This is also known as*specificity*.

There are other quantities that are also of interest in statistics and machine learning:

**Precision:**P[Y=1 \mid \hat{Y}(X)=1]. This is equal to (p_1 \mathrm{TPR})/(p_0 \mathrm{FPR}+p_1 \mathrm{TPR}).**F1-score:**F_1 is the harmonic mean of precision and recall. We can write this as F_1 = \frac{2 \mathrm{TPR}}{1+\mathrm{TPR}+\tfrac{p_0}{p_1} \mathrm{FPR}}**False discovery rate:**False discovery rate (FDR) is equal to the expected ratio of the number of false positives to the total number of positives.

In the case where both labels are equally likely,
precision, F_1, and FDR are also only functions of \mathrm{FPR} and \mathrm{TPR}. However, these quantities
explicitly account for *class imbalances*: when there is a
significant skew between p_0
and p_1, such measures are often
preferred.

\mathrm{TPR} and \mathrm{FPR} are competing objectives. We’d like \mathrm{TPR} as large as possible and \mathrm{FPR} as small as possible.

We can think of risk minimization as optimizing a balance between \mathrm{TPR} and \mathrm{FPR}: R[\hat{Y}] := \mathop\mathbb{E}[\mathit{loss}(\hat{Y}(X),Y)] = \alpha \mathrm{FPR} - \beta \mathrm{TPR} + \gamma\,, where \alpha and \beta are nonnegative and \gamma is some constant. For all such \alpha, \beta, and \gamma, the risk-minimizing predictor is an LRT.

Other cost functions might try to balance \mathrm{TPR} versus \mathrm{FPR} in other ways. Which pairs of (\mathrm{FPR},\mathrm{TPR}) are achievable?

## ROC curves

True and false positive rate lead to another fundamental notion,
called the *receiver operating characteristic (ROC) curve*.

The ROC curve is a property of the joint distribution (X, Y) and shows for every possible value \alpha=[0, 1] the best possible true positive rate that we can hope to achieve with any predictor that has false positive rate \alpha. As a result the ROC curve is a curve in the FPR-TPR plane. It traces out the maximal TPR for any given FPR. Clearly the ROC curve contains values (0,0) and (1,1), which are achieved by constant predictors that either reject or accept all inputs.

We will now show, in a celebrated result by Neyman and Pearson, that the ROC curve is given by varying the threshold in the likelihood ratio test from negative to positive infinity.

# The Neyman-Pearson Lemma

The Neyman-Pearson Lemma, a fundamental lemma of decision theory, will be an important tool for us to establish three important facts. First, it will be a useful tool for understanding the geometric properties of ROC curves. Second, it will demonstrate another important instance where an optimal predictor is a likelihood ratio test. Third, it introduces the notion of probabilistic predictors.

Suppose we want to maximize true positive rate subject to an upper bound on the false positive rate. That is, we aim to solve the optimization problem: \begin{array}{ll} \text{maximize} & \mathrm{TPR}\\ \text{subject to} & \mathrm{FPR} \leq \alpha \end{array}

Let’s optimize over *probabilistic predictors*. A
probabilistic predictor Q
returns 1 with
probability Q(x) and 0 with probability 1-Q(x). With such rules, we can rewrite our
optimization problem as:
\begin{array}{ll}
\text{maximize}_Q & \mathop\mathbb{E}[ Q(X) \mid Y=1 ] \\
\text{subject to} & \mathop\mathbb{E}[ Q(X) \mid Y=0 ] \leq
\alpha\\
& \forall x\colon Q(x) \in [0,1]
\end{array}

**Neyman-Pearson Lemma.** Suppose the likelihood
functions p(x|y) are continuous.
Then the optimal probabilistic predictor that maximizes \mathrm{TPR} with an upper bound
on \mathrm{FPR} is a deterministic
likelihood ratio test.

Even in this constrained setup, allowing for more powerful probabilistic rules, we can’t escape likelihood ratio tests. The Neyman-Pearson Lemma has many interesting consequences in its own right that we will discuss momentarily. But first, let’s see why the lemma is true.

The key insight is that for any LRT, we can find a loss function for which it is optimal. We will prove the lemma by constructing such a problem, and using the associated condition of optimality.

Let \eta be the threshold for an LRT such that the predictor Q_\eta(x)= \mathbb{1}\{ \mathcal{L}(x) > \eta\} has \mathrm{FPR}=\alpha. Such an LRT exists because we assumed our likelihoods were continuous. Let \beta denote the \mathrm{TPR} of Q_\eta.

We claim that Q_\eta is optimal for the risk minimization problem corresponding to the loss function \mathit{loss}(1,0) = \tfrac{\eta p_1}{p_0},~\mathit{loss}(0,1)=1,~\mathit{loss}(1,1)=0,~\mathit{loss}(0,0)=0\,. Indeed, recalling Equation 1, the risk minimizer for this loss function corresponds to a likelihood ratio test with threshold value \frac{p_0( \mathit{loss}(1,0) - \mathit{loss}(0,0) )} {p_1(\mathit{loss}(0, 1) - \mathit{loss}(1, 1) )} =\frac{p_0\mathit{loss}(1,0)} {p_1\mathit{loss}(0, 1)} = \eta\,. Moreover, under this loss function, the risk of a predictor Q equals \begin{aligned} R[Q] &= p_0 \mathrm{FPR}(Q) \mathit{loss}(1,0) + p_1 (1-\mathrm{TPR}(Q)) \mathit{loss}(0,1)\\ &= p_1 \eta \mathrm{FPR}(Q) + p_1 (1-\mathrm{TPR}(Q))\,. \end{aligned}

Now let Q be any other predictor with \mathrm{FPR}(Q)\leq \alpha. We have by the optimality of Q_\eta that \begin{aligned} p_1 \eta \alpha + p_1 (1-\beta) &\leq p_1 \eta \mathrm{FPR}(Q) + p_1 (1-\mathrm{TPR}(Q)) \\ &\leq p_1 \eta \alpha + p_1 (1-\mathrm{TPR}(Q)) \,, \end{aligned} which implies \mathrm{TPR}(Q) \leq \beta. This in turn means that Q_\eta maximizes \mathrm{TPR} for all rules with \mathrm{FPR}\leq \alpha, proving the lemma.

## Properties of ROC curves

A specific randomized predictor that is useful for analysis combines two other rules. Suppose predictor one yields (\mathrm{FPR}^{(1)},\mathrm{TPR}^{(1)}) and the second rule achieves (\mathrm{FPR}^{(2)},\mathrm{TPR}^{(2)}). If we flip a biased coin and use rule one with probability p and rule 2 with probability 1-p, then this yields a randomized predictor with (\mathrm{FPR},\mathrm{TPR}) = (p\mathrm{FPR}^{(1)}+(1-p)\mathrm{FPR}^{(2)} ,p \mathrm{TPR}^{(1)}+(1-p)\mathrm{TPR}^{(2)}). Using this rule lets us prove several properties of ROC curves.

The points (0,0) and (1,1) are on the ROC curve.

This proposition follows because the point (0,0) is achieved when the threshold \eta = \infty in the likelihood ratio test, corresponding to the constant 0 predictor. The point (1,1) is achieved when \eta = 0, corresponding to the constant 1 predictor.

The Neyman-Pearson Lemma gives us a few more useful properties.

The ROC must lie above the main diagonal.

To see why this proposition is true, fix some \alpha>0. Using a randomized rule, we can achieve a predictor with \mathrm{TPR}=\mathrm{FPR}=\alpha. But the Neyman-Pearson LRT with \mathrm{FPR} constrained to be less than or equal to \alpha achieves true positive rate greater than or equal to the randomized rule.

The ROC curve is concave.

Suppose (\mathrm{FPR}(\eta_1),\mathrm{TPR}(\eta_1)) and (\mathrm{FPR}(\eta_2),\mathrm{TPR}(\eta_2)) are achievable. Then (t\mathrm{FPR}(\eta_1)+(1-t)\mathrm{FPR}(\eta_2) ,t \mathrm{TPR}(\eta_1)+(1-t)\mathrm{TPR}(\eta_2)) is achievable by a randomized test. Fixing \mathrm{FPR} \leq t\mathrm{FPR}(\eta_1)+(1-t)\mathrm{FPR}(\eta_2), we see that the optimal Neyman-Pearson LRT achieves \mathrm{TPR} \geq \mathrm{TPR}(\eta_1)+(1-t)\mathrm{TPR}(\eta_2).

## Example: the needle one more time

Consider again the *needle in a haystack* example,
where p(x\mid Y=0) =
\mathcal{N}(0,\sigma^2) and p(x\mid Y=1) = \mathcal{N}(s,\sigma^2)
with s a positive scalar. The
optimal predictor is to declare \hat{Y}=1 when X is greater than \gamma:= \frac{s}{2} + \frac{\sigma^2 \log
\eta}{s}. Hence we have
\begin{aligned}
\mathrm{TPR} &= \int_{\gamma}^{\infty} p(x\mid Y=1) \,\mathrm{d}x =
\tfrac{1}{2} \operatorname{erfc}
\left(\frac{\gamma-s}{\sqrt{2}\sigma}\right)\\
\mathrm{FPR} &= \int_{\gamma}^{\infty} p(x\mid Y=0) \,\mathrm{d}x =
\tfrac{1}{2} \operatorname{erfc}
\left(\frac{\gamma}{\sqrt{2}\sigma}\right)\,.
\end{aligned}

For fixed s and \sigma, the ROC curve (\mathrm{FPR}(\gamma),\mathrm{TPR}(\gamma))
only depends on the *signal to noise ratio* (SNR), s/\sigma. For small SNR, the ROC curve is
close to the \mathrm{FPR}=\mathrm{TPR} line. For large
SNR, \mathrm{TPR}
approaches 1 for all values
of \mathrm{FPR}.

## Area under the ROC curve

Oftentimes in information retrieval and machine learning, the term ROC curve is overloaded to describe the achievable FPR-TPR pairs that we get by varying the threshold t in any predictor \hat Y(x) = \mathbb{1}\{ R(x) > t\}. Note such curves must lie below the ROC curves that are traced out by the optimal likelihood ratio test, but may approximate the true ROC curves in many cases.

A popular summary statistic for evaluating the quality of a decision function is the area under its associated ROC curve. This is commonly abbreviated as AUC. In the ROC curve plotted in the previous section, as the SNR increases, the AUC increases. However, AUC does not tell the entire story. Here we plot two ROC curves with the same AUC.

If we constrain \mathrm{FPR} to be less than 10%, for the blue curve, \mathrm{TPR} can be as high as 80% whereas it can only reach 50% for the red. AUC should be always viewed skeptically: the shape of an ROC curve is always more informative than any individual number.

# Decisions that discriminate

The purpose of prediction is almost always decision making. We build predictors to guide our decision making by acting on our predictions. Many decisions entail a life changing event for the individual. The decision could grant access to a major opportunity, such as college admission, or deny access to a vital resource, such as a social benefit.

Binary decision rules always draw a boundary between one group in the
population and its complement. Some are labeled *accept*, others
are labeled *reject*. When decisions have serious consequences
for the individual, however, this decision boundary is not just a
technical artifact. Rather it has moral and legal significance.

The decision maker often has access to data that encode an individual’s status in socially salient groups relating to race, ethnicity, gender, religion, or disability status. These and other categories that have been used as the basis of adverse treatment, oppression, and denial of opportunity in the past and in many cases to this day.

Some see formal or algorithmic decision making as a neutral mathematical tool. However, numerous scholars have shown how formal models can perpetuate existing inequities and cause harm. In her book on this topic, Ruha Benjamin warns of

the employment of new technologies that reflect and reproduce existing inequities but that are promoted and perceived as more objective or progressive than the discriminatory systems of a previous era.Benjamin,

Race After Technology(Polity, 2019).

Even though the problems of inequality and injustice are much broader
than one of formal decisions, we already encounter an important and
challenging facet within the narrow formal setup of this chapter.
Specifically, we are concerned with decision rules that
*discriminate* in the sense of creating an unjustified basis of
differentiation between individuals.

A concrete example is helpful. Suppose we want to accept or reject
individuals for a job. Suppose we have a perfect estimate of the number
of hours an individual is going to work in the next 5 years. We decide
that this a reasonable measure of productivity and so we accept every
applicant where this number exceeds a certain threshold. On the face of
it, our rule might seem neutral. However, on closer reflection, we
realize that this decision rule systematically disadvantages individuals
who are more likely than others to make use of their parental leave
employment benefit that our hypothetical company offers. We are faced
with a conundrum. On the one hand, we trust our estimate of
productivity. On the other hand, we consider taking parental leave
*morally irrelevant* to the decision we’re making. It should not
be a disadvantage to the applicant. After all that is precisely the
reason why the company is offering a parental leave benefit in the first
place.

The simple example shows that statistical accuracy alone is no
safeguard against discriminatory decisions. It also shows that ignoring
*sensitive attributes* is no safeguard either. So what then is
*discrimination* and how can we avoid it? This question has
occupied scholars from numerous disciplines for decades. There is no
simple answer. Before we go into attempts to formalize discrimination in
our statistical decision making setting, it is helpful to take a step
back and reflect on what the law says.

## Legal background in the United States

The legal frameworks governing decision making differ from country to country, and from one domain to another. We take a glimpse at the situation in the United States, bearing in mind that our description is incomplete and does not transfer to other countries.

Discrimination is not a general concept. It is concerned with
socially salient categories that have served as the basis for
unjustified and systematically adverse treatment in the past. United
States law recognizes certain *protected categories* including
race, sex (which extends to sexual orientation), religion, disability
status, and place of birth.

Further, discrimination is a domain specific concept concerned with
important opportunities that affect people’s lives. Regulated domains
include credit (Equal Credit Opportunity Act), education (Civil Rights
Act of 1964; Education Amendments of 1972), employment (Civil Rights Act
of 1964), housing (Fair Housing Act), and *public accommodation*
(Civil Rights Act of 1964). Particularly relevant to machine learning
practitioners is the fact that the scope of these regulations extends to
marketing and advertising within these domains. An ad for a credit card,
for example, allocates access to credit and would therefore fall into
the credit domain.

There are different legal frameworks available to a plaintiff that
brings forward a case of discrimination. One is called *disparate
treatment*, the other is *disparate impact*. Both capture
different forms of discrimination. Disparate treatment is about
purposeful consideration of group membership with the intention of
discrimination. Disparate impact is about unjustified harm, possibly
through indirect mechanisms. Whereas disparate treatment is about
*procedural fairness*, disparate impact is more about
*distributive justice*.

It’s worth noting that anti-discrimination law does not reflect one overarching moral theory. Pieces of legislation often came in response to civil rights movements, each hard fought through decades of activism.

Unfortunately, these legal frameworks don’t give us a formal definition that we could directly apply. In fact, there is some well-recognized tension between the two doctrines.

## Formal non-discrimination criteria

The idea of formal non-discrimination (or *fairness*) criteria
goes back to pioneering work of Anne Cleary and other researchers in the
educational testing community of the 1960s.Hutchinson
and Mitchell, “50 Years of Test (Un) Fairness: Lessons for Machine
Learning,” in *Conference on Fairness, Accountability, and
Transparency*, 2019, 49–58.

The main idea is to introduce a discrete random variable A that encodes membership status in one or multiple protected classes. Formally, this random variable lives in the same probability space as the other covariates X, the decision \hat Y=\mathbb{1}\{ R > t\} in terms of a score R, and the outcome Y. The random variable A might coincide with one of the features in X or correlate strongly with some combination of them.

Broadly speaking, different statistical fairness criteria all equalize some group-dependent statistical quantity across groups defined by the different settings of A. For example, we could ask to equalize acceptance rates across all groups. This corresponds to imposing the constraint for all groups a and b:

\mathop\mathbb{P}[\hat Y = 1 \mid A=a] = \mathop\mathbb{P}[\hat Y = 1 \mid A=b]

Researchers have proposed dozens of different criteria, each trying
to capture different intuitions about what is *fair*. Simplifying
the landscape of fairness criteria, we can say that there are
essentially three fundamentally different ones of particular
significance:

- Acceptance rate \mathop\mathbb{P}[\hat Y = 1]
- Error rates \mathop\mathbb{P}[\hat Y = 0 \mid Y = 1] and \mathop\mathbb{P}[\hat Y = 1 \mid Y =0]
- Outcome frequency given score value \mathop\mathbb{P}[Y = 1 \mid R = r ]

The meaning of the first two as a formal matter is clear given what
we already covered. The third criterion needs a bit more motivation. A
useful property of score functions is *calibration* which asserts
that \mathop\mathbb{P}[Y = 1\mid
R=r]=r for all score values r. In words, we can interpret a score
value r as the propensity of
positive outcomes among instances assigned the score value r. What the third criterion says is closely
related. We ask that the score values have the same meaning in each
group. That is, instances labeled r in one group are equally likely to be
positive instances as those scored r in any other group.

The three criteria can be generalized and simplified using three different conditional independence statements.

Independence | Separation | Sufficiency |
---|---|---|

R\bot A | R\bot A \mid Y | Y\bot A\mid R |

Each of these applies not only to binary prediction, but any set of
random variables where the independence statement holds. It’s not hard
to see that independence implies equality of acceptance rates across
groups. Separation implies equality of error rates across groups. And
sufficiency implies that all groups have the same rate of positive
outcomes given a score value.Barocas,
Hardt, and Narayanan, *Fairness and Machine Learning*
(fairmlbook.org, 2019).

Researchers have shown that any two of the three criteria are
*mutually exclusive* except in special cases. That means,
generally speaking, imposing one criterion forgoes the other two.Kleinberg,
Mullainathan, and Raghavan, “Inherent Trade-Offs in the Fair
Determination of Risk Scores,” in *Innovations in Theoretical
Computer Science*, 2017; Chouldechova, “Fair Prediction with
Disparate Impact: A Study of Bias in Recidivism Prediction
Instruments,” *Big Data* 5, no. 2 (2017):
153–63.

Although these formal criteria are easy to state and arguably natural in the language of decision theory, their merit as measures of discrimination has been subject of an ongoing debate.

## Merits and limitations of a narrow statistical perspective

The tension between these criteria played out in a public debate
around the use of risk scores to predict *recidivism* in
pre-trial detention decisions.

There’s a risk score, called COMPAS, used by many jurisdictions in
the United States to assess *risk of recidivism* in pre-trial
bail decisions. Recidivism refers to a person’s relapse into criminal
behavior. In the United States, a defendant may either be detained or
released on bail prior to the trial in court depending on various
factors. Judges may detain defendants in part based on this score.

Investigative journalists at ProPublica found that Black defendants
face a higher false positive rate, i.e., more Black defendants labeled
*high risk* end up not committing a crime upon release than among
White defendants labeled *high risk*.Angwin
et al., “Machine Bias,” *ProPublica*, May 2016, https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.
In other words, the COMPAS score fails the separation criterion.

A company called Northpointe, which sells the proprietary COMPAS risk
model, pointed out in return that Black and White defendants have equal
recidivism rates *given* a particular score value. That is
defendants labeled, say, an ‘8’ for *high risk* would go on to
recidivate at a roughly equal rate in either group. Northpointe claimed
that this property is desirable so that a judge can interpret scores
equally in both groups.Dieterich,
Mendoza, and Brennan, “COMPAS Risk Scales:
Demonstrating Accuracy Equity and Predictive Parity” (Northpointe
Inc. Research Department, 2016), https://www.documentcloud.org/documents/2998391-ProPublica-Commentary-Final-070616.html.

The COMPAS debate illustrates both the merits and limitations of the narrow framing of discrimination as a classification criterion.

On the hand, the error rate disparity gave ProPublica a tangible and concrete way to put pressure on Northpointe. The narrow framing of decision making identifies the decision maker as responsible for their decisions. As such, it can be used to interrogate and possibly intervene in the practices of an entity.

On the other hand, decisions are always part of a broader system that embeds structural patterns of discrimination. For example, a measure of recidivism hinges crucially on existing policing patterns. Crime is only found where policing activity happens. However, the allocation and severity of police force itself has racial bias. Some scholars therefore find an emphasis on statistical criteria rather than structural determinants of discrimination to be limited.

# Chapter notes

The theory we covered in this chapter is also called *detection
theory* and *decision theory*. Similarly, what we call a
predictor throughout has various different names, such as *decision
rule* or *classifier*.

The elementary detection theory covered in this chapter has not
changed much at all since the 1950s and is essentially considered a
“solved problem”. Neyman and Pearson invented the likelihood ratio
testNeyman
and Pearson, “On the Use and Interpretation of Certain Test
Criteria for Purposes of Statistical Inference: Part
I,” *Biometrika*, 1928,
175–240. and later proved their lemma showing it to
be optimal for maximizing true positive rates while controlling false
positive rates.Neyman
and Pearson, “On the Problem of the Most Efficient Tests of
Statistical Hypotheses,” *Philosophical Transactions of the
Royal Society of London. Series A* 231, no. 694–706 (1933):
289–337. Wald followed this work by inventing
general Bayes risk minimization in 1939.Wald,
“Contributions to the Theory of Statistical Estimation and Testing
Hypotheses,” *The Annals of Mathematical Statistics* 10,
no. 4 (1939): 299–326. Wald’s ideas were widely
adopted during World War II for the purpose of interpreting RADAR
signals which were often very noisy. Much work was done to improve RADAR
operations, and this led to the formalization that the output of a RADAR
system (the receiver) should be a likelihood ratio, and a decision
should be made based on an LRT. Our proof of Neyman-Pearson’s lemma came
later, and is due to Bertsekas and Tsitsiklis (See Section 9.3 of
*Introduction to Probability*Bertsekas
and Tsitsiklis, *Introduction to Probability*, 2nd ed. (Athena
Scientific, 2008).).

Our current theory of detection was fully developed by Peterson,
Birdsall, and Fox in their report on optimal signal detectability.Peterson,
Birdsall, and Fox, “The Theory of Signal Detectability,”
*Transactions of the IRE* 4, no. 4 (1954):
171–212. Peterson, Birdsall, and Fox may have been
the first to propose Receiver Operating Characteristics as the means to
characterize the performance of a detection system, but these ideas were
contemporaneously being applied to better understand psychology and
psychophysics as well.Tanner
Jr. and Swets, “A Decision-Making Theory of Visual
Detection,” *Psychological Review* 61, no. 6 (1954):
401.

Statistical Signal Detection theory was adopted in the pattern
recognition community at a very early stage. Chow proposed using optimal
detection theory,Chow,
“An Optimum Character Recognition System Using Decision
Functions,” *IRE Transactions on Electronic
Computers*, no. 4 (1957): 247–54. and this led
to a proposal by Highleyman to approximate the risk by its sample
average.Highleyman,
“Linear Decision Functions, with Application to Pattern
Recognition,” *Proceedings of the IRE* 50,
no. 6 (1962): 1501–14. This transition from
population risk to “empirical” risk gave rise to what we know today as
machine learning.

Of course, how decisions and predictions are applied and interpreted
remains an active research topic. There is a large amount of literature
now on the topic of fairness and machine learning. For a general
introduction to the problem and dangers associated with algorithmic
decision making not limited to discrimination, see the books by
BenjaminBenjamin,
*Race After Technology*., BroussardBroussard,
*Artificial Unintelligence: How Computers Misunderstand the
World* (MIT Press, 2018)., EubanksEubanks,
*Automating Inequality: How High-Tech Tools Profile, Police, and
Punish the Poor* (St. Martin’s Press, 2018).,
NobleNoble,
*Algorithms of Oppression: How Search Engines Reinforce Racism*
(NYU Press, 2018)., and O’NeilO’Neil,
*Weapons of Math Destruction: How Big Data Increases Inequality and
Threatens Democracy* (Broadway Books, 2016)..
The technical material in our section on discrimination follows Chapter
2 in the textbook by Barocas, Hardt, and Narayanan.Barocas,
Hardt, and Narayanan, *Fairness and Machine
Learning*.

The abalone example was derived from data available at the UCI Machine Learning Repository, which we will discuss in more detail in Chapter 8. We modified the data to ease exposition. The actual data does not have an equal number of male and female instances, and the optimal predictor is not exactly a threshold function.

# References

*ProPublica*, May 2016. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.

*Fairness and Machine Learning*. fairmlbook.org, 2019.

*Race After Technology*. Polity, 2019.

*Introduction to Probability*. 2nd ed. Athena Scientific, 2008.

*Artificial Unintelligence: How Computers Misunderstand the World*. MIT Press, 2018.

*Big Data*5, no. 2 (2017): 153–63.

*IRE Transactions on Electronic Computers*, no. 4 (1957): 247–54.

*Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor*. St. Martin’s Press, 2018.

*Proceedings of the IRE*50, no. 6 (1962): 1501–14.

*Conference on Fairness, Accountability, and Transparency*, 49–58, 2019.

*Innovations in Theoretical Computer Science*, 2017.

*Philosophical Transactions of the Royal Society of London. Series A*231, no. 694–706 (1933): 289–337.

*Biometrika*, 1928, 175–240.

*Algorithms of Oppression: How Search Engines Reinforce Racism*. NYU Press, 2018.

*Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy*. Broadway Books, 2016.

*Transactions of the IRE*4, no. 4 (1954): 171–212.

*Psychological Review*61, no. 6 (1954): 401.

*The Annals of Mathematical Statistics*10, no. 4 (1939): 299–326.