Mathematical background
The main mathematical tools of machine learning are optimization and statistics. At their core are concepts from multivariate calculus and probability. Here, we briefly review some of the concepts from calculus and probability that we will frequently make use of in the book.
Common notation
- Lowercase letters u, v, w, x, y, z, typically denote vectors. We use both \langle u, v\rangle and u^T v to denote the inner product between vectors u and v.
- Capital letters X, Y, Z typically denote random variables.
- The conditional probability \mathop\mathbb{P}[A \mid B] of an event A conditional on an event B
- The gradient \nabla f(x) of a function f\colon \mathbb{R}^d\to\mathbb{R} at a point x\in\mathbb{R}^d refers to the vector of partial derivatives of f evaluated at x.
- Identity matrix I
- The first k positive integers [k]=\{1, 2, \dots, k\}.
Multivariable calculus and linear algebra
Positive definite matrices
Positive definite matrices are central to both optimization algorithms and statistics. In this section, we quickly review some of the core properties that we will use throughout the book.
A matrix M is positive definite (pd) if it is symmetric M = M^T and z^T M z > 0 for all nonzero z\in \mathbb{R}^d. We denote this as M \succ 0 A matrix M is positive semidefinite (psd) if it is symmetric and z^T M z \geq 0 for all nonzero z. We denote this as M \succeq 0. All pd matrices are psd, but not vice versa.
Some of the main properties of positive semidefinite matrices include.
If M_1 \succeq 0, and M_2 \succeq 0, then M_1 + M_2 \succeq 0.
a \in \mathbb{R}, a\geq 0 implies aM \succeq 0.
For any matrix F, FF^T and F^TF are both psd. Conversely, if M is psd there exists an F such that M = FF^T.
Note that (1) and (2) still hold if “psd” is replaced with “pd.” That is, the sum of two pd matrices is pd. And multiplying a pd matrix by a positive scalar preserves positive definiteness.
Recall that \lambda is an eigenvalue of a square matrix M if there exists a nonzero x\in \mathbb{R}^d such that Mx = \lambda x. Eigenvalues of psd matrices are all non-negative. Eigenvalues of pd matrices are all positive. This follows by multiplying the equation Ax = \lambda x on the left by x^T.
Gradients, Taylor’s Theorem and infinitesimal approximation
Let \Phi:\mathbb{R}^d\rightarrow \mathbb{R}. Recall from multivariable calculus that the gradient of \Phi at a point w is the vector of partial derivatives \nabla \Phi(w) = \begin{bmatrix} \frac{\partial \Phi(w)}{\partial x_1} \\ \frac{\partial \Phi(w)}{\partial x_2} \\ \vdots \\ \frac{\partial \Phi(w)}{\partial x_d} \end{bmatrix}\,. Sometimes we write \nabla_x\Phi(w) to make clear which functional argument we are referring to.
One of the most important theorems in calculus is Taylor’s Theorem, which allows us to approximate smooth functions by simple polynomials. The following simplified version of Taylor’s Theorem is used throughout optimization. This form of Taylor’s theorem is sometimes called the multivariable mean-value theorem. We will use this at multiple points to analyze algorithms and understand the local properties of functions.
Taylor’s Theorem.
If \Phi is continuously differentiable, then, for some t \in [0,1]\,, \Phi(w) = \Phi(w_0) + \nabla \Phi(tw + (1-t)w_0)^T (w - w_0)\,.
If \Phi is twice continuously differentiable, then \nabla \Phi(w) = \nabla \Phi(w_0) + \int_0^1 \nabla^2 \Phi(tw + (1-t)w_0)(w - w_0) \mathrm{d}t and, for some t \in [0,1] \begin{aligned} \Phi(w) &= \Phi(w_0) + \nabla \Phi(w_0)^T(w - w_0) \\ &\qquad\qquad\quad+ \frac{1}{2} (w-w_0)^T \nabla^2 \Phi(tw + (1-t)w_0)^T (w - w_0)\,. \end{aligned}
Taylor’s theorem can be used to understand the local properties of functions. For example, \Phi(w+ \epsilon v) = \Phi(w) + \epsilon \nabla \Phi(w)^Tv + \frac{\epsilon^2}{2} v^T \nabla^2 \Phi(w + \delta v)^T v for some 0\leq \delta \leq \epsilon. This expression states that \Phi(w+ \epsilon v) = \Phi(w) + \epsilon \nabla \Phi(w)^Tv + \Theta(\epsilon^2)\,, So to first order, we can approximate \Phi by a linear function.
Jacobians and the multivariate chain rule
The matrix of first order partial derivatives of a multivariate mapping \Phi\colon\mathbb{R}^n\to\mathbb{R}^m is called Jacobian matrix. We define the Jacobian of \Phi with respect to a variable x evaluated at a value w as the m\times n matrix \mathsf{D}_x \Phi(w) = \left[ \frac{\partial \Phi_i(w)}{\partial x_j} \right]_{i=1\dots m, j=1\dots n}\,. The i-th row of the Jacobian therefore corresponds to the transpose of the familiar gradient \nabla_x^T \Phi_i(w) of the i-th coordinate of \Phi. In particular, when m=1 the Jacobian corresponds to the transpose of the gradient.
The first-order approximation given by Taylor’s theorem directly extends to multivariate functions via the Jacobian matrix. So does the chain rule from calculus for computing the derivatives of function compositions.
Let \Phi\colon\mathbb{R}^n\to\mathbb{R}^m and \Psi\colon\mathbb{R}^m\to\mathbb{R}^k. Then, we have \mathsf{D}_x \Psi\circ\Phi(w) = \mathsf{D}_{\Phi(w)}\Psi(\Phi(w))\mathsf{D}_x\Phi(w)\,.
As we did with the gradient notation, when the variable x is clear from context we may drop it from our notation and write \mathsf{D}\Phi(w)
Probability
Contemporary machine learning uses probability as its primary means of quantifying uncertainty. Here we review some of the basics we will make use of in this course. This will also allow us to fix notation.
We note that often times, mathematical rigor gets in the way of explaining concepts. So we will attempt to only introduce mathematical machinery when absolutely necessary.
Probability is a function on sets. Let \mathcal{X} denote the sample set. For every A\subset \mathcal{X}, we have 0 \leq \mathop\mathbb{P}[A] \leq 1\,, \qquad\qquad \mathop\mathbb{P}[\mathcal{X}]=1\,, \qquad \qquad \mathop\mathbb{P}[\emptyset] = 0\,, and \mathop\mathbb{P}[A\cup B] + \mathop\mathbb{P}[A\cap B] = \mathop\mathbb{P}[A] + \mathop\mathbb{P}[B]\,. This implies that \mathop\mathbb{P}[A\cup B] = \mathop\mathbb{P}[A] + \mathop\mathbb{P}[B]\,. if and only if \mathop\mathbb{P}[A\cap B]=0. We always have the inequality \mathop\mathbb{P}[A\cup B] \leq \mathop\mathbb{P}[A] + \mathop\mathbb{P}[B]\,. By induction, we get the union bound \textstyle\mathop\mathbb{P}\left[\bigcup_{i} A_i\right] \leq \sum_i \mathop\mathbb{P}[A_i]\,.
Random variables and vectors
Random variables are a particular way of characterizing outcomes of random processes. We will use capital letters like X, Y, and Z to denote such random variables. The sample space of a random variable will be the set where a variable can take values. Events are simply subsets of possible values. Common examples we will encounter in this book are
Probability that a random variable has a particular value. This will be denoted as \mathop\mathbb{P}[X=x]. Note here that we use a lower case letter to denote the value that the random variable might take.
Probability that a random variable satisfies some inequality. For example, the probability that X is less than a scalar t will be denoted as \mathop\mathbb{P}[X\leq t].
A random vector is a random variable whose sample space consists of R^d. We will not use notation to distinguish between vectors and scalars in this text.
Densities
Random vectors are often characterized by probability densities rather than by probabilities. The density p of a random variable X is defined by its relation to probabilities of sets: \mathop\mathbb{P}[X \in A] = \int_{x\in A} p(x) \mathrm{d}x\,.
Expectations
If f is a function on R^d and X is a random vector, then the expectation of f is given by \mathop\mathbb{E}[f(X)] = \int f(x) p(x) \mathrm{d}x
If A is a set, the indicator function of the set is the function I_A(x) = \begin{cases} 1 & \text{if}~x \in A \\ 0 & \text{otherwise} \end{cases} Note that the expectation of an indicator function is a probability: \mathop\mathbb{E}[I_A(X)] = \int_{x \in A} p(x) \mathrm{d}x = \mathop\mathbb{P}[X \in A]\,. This expression links the three concepts of expectation, density, and probability together.
Note that the expectation operator is linear: \mathop\mathbb{E}[af(X)+bg(X)] = a \mathop\mathbb{E}[f(X)] + b \mathop\mathbb{E}[g(x)] \,.
Two other important expectations are the mean and covariance. The mean of a random variable is the expected value of the identity function: \mu_X := \mathop\mathbb{E}[X] = \int x p(x) \mathrm{d}x\,. The covariance of a random variable is the matrix \Sigma_X := \mathop\mathbb{E}[(X-\mu_X)(X-\mu_X)^T]\,. Note that covariance matrices are positive semidefinite. To see this, take a nonzero vector z and compute z^T \Sigma_X z := \mathop\mathbb{E}[z^T(X-\mu_X)(X-\mu_X)^Tz] = \mathop\mathbb{E}[ ((X-\mu_X)^Tz)^2]\,. Since the term inside the expectation is nonnegative, the expectation is nonnegative as well.
Important examples of probability distributions
Bernoulli random variables. A Bernoulli random variable X can take two values, 0 and 1. In such a case \mathop\mathbb{P}[X=1] = 1-\mathop\mathbb{P}[X=0]
Gaussian random vectors. Gaussian random vectors are the most ubiquitous real valued random vectors. Their densities are parameterized only by their mean and covariance: p(x) = \frac{1}{\operatorname{det}(2\pi \Sigma)^{1/2}} \exp\left( -\tfrac{1}{2} (x-\mu_X)^T \Sigma^{-1} (x-\mu_X) \right)\,. Gaussian random variables are often called “normal” random variables. We denote the distribution of a normal random variable with mean \mu and covariance \Sigma as \mathcal{N}(\mu,\Sigma)\,. The reason Gaussian random variables are ubiquitous is because of the central limit theorem: averages of many independent random variables tend to look like Gaussian random variables.
Conditional probability and Bayes’ Rule
Conditional probability is applied quite cavalierly in machine learning. It’s actually very delicate and should only be applied when we really know what we’re doing.
\mathop\mathbb{P}[A|B] = \frac{\mathop\mathbb{P}[A\cap B]}{\mathop\mathbb{P}[B]}
A and B are said to be independent if \mathop\mathbb{P}[A|B] = \mathop\mathbb{P}[A]. Note that from the definition of conditional probability A and B are independent if and only if \mathop\mathbb{P}[A\cap B] = \mathop\mathbb{P}[A]\mathop\mathbb{P}[B]\,.
Bayes’ Rule is an immediate corollary of the definition of conditional probability. In some sense, it’s just a restatement of the definition. \mathop\mathbb{P}[A|B] = \frac{\mathop\mathbb{P}[B|A]\mathop\mathbb{P}[A]}{\mathop\mathbb{P}[B]} This is commonly applied when A is one of a set of several alternatives. Suppose A_i are a collection of disjoint sets such that \cup_i A_i = \mathcal{X} then for each i, Bayes’ Rule states \mathop\mathbb{P}[A_i|B] = \frac{\mathop\mathbb{P}[B|A_i] \mathop\mathbb{P}[A_i]}{\sum_j \mathop\mathbb{P}[B|A_j] \mathop\mathbb{P}[A_j]}\,. This shows that if we have models of the likelihood of B under each alternative A_i and if we have beliefs about the probability of each A_i, we can compute the probability of observing A_i under the condition that B has occurred.
Conditional densities
Suppose X and Z are random variables whose joint distribution is continuous. If we try to write down the conditional distribution for X given Z=z, we find \mathop\mathbb{P}[X\in A | Z=z] = \frac{ \mathop\mathbb{P}[X\in A \cap Z=z] } {\mathop\mathbb{P}[Z=z]} Both the numerator and denominator are equal to zero. In order to have a useful formula, we can appeal to densities. \begin{aligned} \mathop\mathbb{P}[x \in A | z \leq Z \leq z+\epsilon ] &= \frac{ \int_{z}^{z+\epsilon} \int_{x \in A} p(x,z') \mathrm{d}x \mathrm{d}z' } {\int_{z}^{z+\epsilon} p(z') \mathrm{d}z' }\\ &\approx \frac{ \epsilon \int_{x\in A} p(x,z) } { \epsilon p(z) \mathrm{d}z } \\ &= \int_{x\in A} \frac{p(x,z)}{p(z)} \mathrm{d}x \end{aligned} Letting \epsilon go to zero, this calculation shows that we can use the conditional density to compute the conditional probabilities of X when Z=z: p(x|z) := \frac{p(x,z)}{p(z)}\,.
Conditional expectation and the law of iterated expectation
Conditional expectation is short hand for computing expected values with respect to conditional probabilities: \mathop\mathbb{E}[f(x,z)|Z=z] = \int f(x,z) p(x|z) \mathrm{d}x An important formula is the law of iterated expectation: \mathop\mathbb{E}[f(x,z)] = \mathop\mathbb{E}[\mathop\mathbb{E}[f(x,z)|Z=z]] This formula follows because \begin{aligned} \mathop\mathbb{E}[f(x,z)] &= \int\int f(x,z) p(x,z) \mathrm{d}x \mathrm{d}z \\ &= \int\int f(x,z) p(x|z) p(z) \mathrm{d}x \mathrm{d}z \\ &= \int \left(\int f(x,z) p(x|z)\mathrm{d}x\right) p(z) \mathrm{d}z \,. \end{aligned}
Estimation
This book devotes much of its attention to probabilistic decision making. A different but related statistical problem is parameter estimation. Assuming that data X is generated by a statistical model, we’d like to infer some nonrandom property about its distribution. The most canonical examples here would be estimating the mean or variance of the distribution. Note that estimating these parameters has a different flavor than decision theory. In particular, our framework of risk minimization no longer applies.
If we aim to minimize a functional \text{minimize}_f~\mathop\mathbb{E}[\mathit{loss}(\vartheta,f(x))] then the optimal choice is to set f(x) = \vartheta. But we don’t know this parameter in the first place. So we end up with an algorithm that’s not implementable.
Instead, what we do in estimation theory is pose a variety of plausible estimators that might work for a particular parameter and consider the efficacy of these parameters in different settings. In particular, we’d like estimators that take a set of observations S=(x_1,\ldots,x_n) and return a guess for the parameter whose value improves as n increases: \lim_{n\rightarrow \infty} \mathop\mathbb{E}_S[\mathit{loss}(\vartheta,\hat{\vartheta}(S))] = 0
Even though estimators are constructed from data, their design and implementation require a good deal of knowledge about the underlying probability distribution. Because of this, estimation is typically considered to be part of classical statistics and not machine learning. Estimation theory has a variety of powerful tools that are aimed at producing high quality estimators, and is certainly worth learning more about. We need rudimentary elements of estimation to understand popular baselines and algorithms in causal inference and reinforcement learning.
Plug-in Estimators
We will restrict our attention to plug-in estimators. Plug-in estimators are functions of the moments of probability distributions. They are plug-in because we replace the true distribution with the empirical distribution. To be precise, suppose there exist vector valued functions g and \psi such that \vartheta = g(\mathop\mathbb{E}[\psi(x)]). Then, given a dataset, S=(x_1,\ldots,x_n), the associated plug-in estimator of \vartheta is \hat{\vartheta}(S) = g\left( \frac{1}{n} \sum_{i=1}^n \psi(x_i)\right) that is, we replace the expectation with the sample average. There are canonical examples of plugin estimators.
The sample mean. The sample mean is the plug-in estimator where g and \psi are both the identity functions.
The sample covariance. The sample covariance is \hat{\Sigma}_x = \sum_{i=1}^n x_i x_i^T -\left(\frac{1}{n}\sum_{i=1}^n x_i \right)\left(\sum_{i=1}^n x_i \right)^T\,. From this formula, we can take \psi(x) = \begin{bmatrix} 1\\ x \end{bmatrix}\begin{bmatrix} 1\\ x \end{bmatrix}^T ~~~\text{and}~~~ g\left(\begin{bmatrix} A & B \\ B^T & C\end{bmatrix} \right) = C-BB^T\,.
Least-squares estimator. Suppose we have three random vectors, y, x, and v and we assume that v and x are zero-mean and uncorrelated and that y=Ax+v for some matrix A. Let’s suppose we’d like to estimate A from a set of pairs S=( (x_1,y_1), \ldots, (x_n, y_n)). One can check that A = \Sigma_{yx} \Sigma_{x}^{-1}\,. And hence the plug-in estimator would use the sample covariances: \hat{A} = \left( \sum_{i=1}^n y_i x_i^T\right) \left( \sum_{i=1}^n x_i x_i^T\right)^{-1} In this case, we have the formulation \psi(x) = \begin{bmatrix} x\\ y \end{bmatrix}\begin{bmatrix} x\\ y \end{bmatrix}^T ~~~\text{and}~~~ g\left(\begin{bmatrix} A & B \\ B^T & C\end{bmatrix} \right) = B A^{-1}\,.
Convergence rates
In our study of generalization, we reasoned that the empirical risk should be close to the true risk because sample averages should be close to population values. A similar reasoning holds true for plug-in estimators: smooth functions of sample averages should be close to their population counterparts.
We covered the case of the sample mean in our discussion of generalization. To recall, suppose x is a Bernoulli random variable with mean p. Let x_1,\ldots,x_n be independent and identically distributed as x. Then Hoeffding’s inequality states that \mathop\mathbb{P}\left[ \left| \frac{1}{n}\sum_{i=1}^n x_i - p \right| > \epsilon \right] \leq 2 \exp(-2n \epsilon^2)\,. Or, in other words, with probability 1-\delta, \left| \frac{1}{n}\sum_{i=1}^n x_i - p \right| \leq \sqrt{\frac{\log(2/\delta)}{2n}}\,.
Let’s consider a simple least-squares estimator. Suppose we know that y=w^T x + v where w and x are a vectors, w is deterministic, and x and v are uncorrelated. Consider the least-squares estimator \hat{w}_S from n data points.. The estimation error in w is the vector e_S = \hat{w}_S-w. The expectation of e_S is zero and the expected norm of the error is given by \mathop\mathbb{E}\left[ \Vert e_S \Vert^2 \right] = \operatorname{Trace} \left( \left( \sum_{i=1}^n x_i x_i^T \right)^{-1} \right)\,. This error is small if the sample covariance has large eigenvalues. Indeed, if \lambda_S denotes the minimum eigenvalue of the sample covariance of x, then \mathop\mathbb{E}\left[ \Vert e_S \Vert^2 \right] \leq \frac{d}{n \lambda_S}\,. This expression suggests that the distribution of x must have density that covers all directions somewhat equally in order for the least-squares estimator to have good performance. On top of this, we see that the squared error decreases roughly as d/n. Hence, we need far more measurements than dimensions to find a good estimate of w. This is in contrast to what we studied in classification. Most of the generalization bounds for classification we derived were dimension free and only depended on properties like the margin of the data. In contrast, in parameter estimation, we tend to get results that scale as number of parameters over number of data points. This rough rule of thumb that the error scales as the ratio of number of parameters to number of data points tends to be a good guiding principle when attempting to understand convergence rates of estimators.