# Sequential decision making and dynamic programming

As the previous chapters motivated we don’t just make predictions for their own sake, but rather use data to inform decision making and action. This chapter examines sequential decisions and the interplay between predictions and actions in settings where our repeated actions are directed towards a concrete end-goal. It will force us to understand statistical models that evolve over time and the nature of dependencies in data that is temporally correlated. We will also have to understand feedback and its impact on statistical decision-making problems.

In machine learning, the subfield of using statistical tools to
direct actions in dynamic environments is commonly called “reinforcement
learning” (RL). However, this blanket term tends to lead people towards
specific solution techniques. So we are going to try to maintain a
broader view of the area of *sequential decision making*,
including perspectives from related fields of predictive analytics and
optimal control. These multiple perspectives will allow us to highlight
how RL is different from the machine learning we are most familiar
with.

This chapter will follow a similar flow to our study of prediction.
We will formalize a temporal mathematical model of sequential decision
making involving notions from *dynamical systems*. We will then
present a common optimization framework for making sequential decisions
when models are known: *dynamic programming*. Dynamic programming
will enable algorithms for finding or approximating optimal decisions
under a variety of scenarios. In the sequel, we will turn to the
learning problem of how to best make sequential decisions when the
mechanisms underlying dynamics and costs are not known in advance.

# From predictions to actions

Let’s first begin with a discussion of how sequential decision making differs from static prediction. In our study of decision theory, we laid out a framework for making optimal predictions of a binary covariate Y when we had access to data X, and probabilistic models of how X and Y were related. Supervised learning was the resulting problem of making such decisions from data rather than probabilistic models.

In sequential decision making, we add two new variables. First, we
incorporate *actions* denoted U that we aim to take throughout a procedure.
We also introduce rewards R that
we aim to maximize. In sequential decision making, the goal is to
analyze the data X and then
subsequently choose U so
that R is large. We have explicit
agency in choosing U and are
evaluated based on some quality scoring of U and X. There are an endless number of problems
where this formulation is applied from supply chain optimization to
robotic planning to online engagement maximization. Reinforcement
learning is the resulting problem of taking actions so as to maximize
rewards where our actions are only a function of previously observed
data rather than probabilistic models. Not surprisingly, the
optimization problem associated with sequential decision making is more
challenging than the one that arises in decision theory.

# Dynamical systems

In addition to the action variable in sequential decision making, another key feature of sequential decision making problems is the notion of time and sequence. We assume data is collected in an evolving process, and our current actions influence our future rewards.

We begin by bringing all of these elements together in the general definition of a discrete time dynamical system. The definitions both simple and broad. We will illustrate with several examples shortly.

A *dynamical system model* has a *state* X_t, *exogenous input* U_t modeling our *control action*, and
*reward* R_t. The state evolves
in discrete time steps according to the equation
X_{t+1} = f_t(X_t,U_t,W_t)
where W_t is a random
variable and f_t is a function.
The reward is assumed to be a function of these variables as well:
R_{t} = g_t(X_t,U_t,W_t)
for some function g_t. To
simplify our notation throughout, we will commonly write R explicitly as a function of (X,U,W), that is, R_t[X_t,U_t,W_t].

Formally, we can think of this definition as a structural equation model that we also used to define causal models. After all, the equations above give us a way to incrementally build up a data-generating process from noise variables. Whether or not the dynamical system is intended to capture any causal relationships in the real world is a matter of choice. Practitioners might pursue this formalism for reward maximization without modeling causal relationships. A good example is the use of sequential decision making tools for revenue maximization in targeted advertising. Rather than modeling causal relationships between, say, preferences and clicks, targeted advertising heavily relies on all sorts of signals, be they causal or not.

## Concrete examples

**Grocery shopping.** Bob really likes to eat cheerios
for breakfast every morning. Let the state X_t denotes the amount of Cheerios in Bob’s
kitchen on day t. The
action U_t denotes the amount of
Cheerios Bob buys on day t
and W_t denotes the amount of
Cheerios he eats that day. The random variable W_t varies with Bob’s hunger. This yields the
dynamical system
X_{t+1} = X_t + U_t - W_t\,.
While this example is a bit cartoonish, it turns out that such
simple models are commonly used in managing large production supply
chains. Any system where resources are stochastically depleted and must
be replenished can be modeled comparably. If Bob had a rough model for
how much he eats in a given day, he could forecast when his supply would
be depleted. And he could then minimize the number of trips he’d need to
make to the grocer using optimal control.

**Moving objects.** Consider a physical model of a
flying object. The simplest model of the dynamics of this system are
given by Newton’s laws of mechanics. Let Z_t denote the position of the vehicle (this
is a three-dimensional vector). The derivative of position is velocity
V_t = \frac{\partial Z_t}{\partial t}\,,
and the derivative of velocity is acceleration
A_t = \frac{\partial V_t}{\partial t}\,.
Now, we can approximate the rules in discrete time with the
simple Taylor approximations
\begin{aligned}
Z_{t+1} &= Z_t + \Delta V_t\\
V_{t+1} &= V_t + \Delta A_t
\end{aligned}
We also know by Newton’s second law, that acceleration is equal
to the total applied force divided by the mass of the object: F=mA. The flying object will be subject to
external forces such as gravity and wind W_t and it will also receive forces from its
propellers U_t. Then we can add
this to the equations to yield a model
\begin{aligned}
Z_{t+1} &= Z_t + \Delta V_t\\
V_{t+1} &= V_t + \frac{\Delta}{m} (W_t+U_t)\,.
\end{aligned}

Oftentimes, we only observe the acceleration through an accelerometer. Then estimating the position and velocity becomes a filtering problem. Optimal control problems for this model include flying the object along a given trajectory or flying to a desired location in a minimal amount of time.

## Markov decision processes

Our definition in terms of structural equations is not the only
dynamical model used in machine learning. Some people prefer to work
directly with probabilistic transition models and conditional
probabilities. In a *Markov Decision Process*, we again have a
state X_t and input U_t, and they are linked by a probabilistic
model
\mathop\mathbb{P}[X_{t+1}\mid X_t,U_t]\,.
This is effectively the same as the structural equation model
above except we hide the randomness in this probabilistic notation.

**Example: machine repair.** The following example
illustrates the elegance of the conditional probability models for
dynamical systems. This example comes from BertsekasBertsekas,
*Dynamic Programming and Optimal Control*, 4th ed., vol. 1
(Athena Scientific, 2017).. Suppose we have a
machine with ten states of repair. State 10 denotes excellent condition
and 1 denotes the inability to function. Every time one uses the machine
in state j, it has a probability
of falling into disrepair, given by the probabilities \mathop\mathbb{P}[X_{t+1} = i\mid X_t=j]\,,
where \mathop\mathbb{P}[X_{t+1}=i\mid
X_t=j]=0 if i>j. The
action a one can take at any time
is to repair the machine, resetting the system state to 10. Hence
\mathop\mathbb{P}[X_{t+1}=i\mid X_{t}=j,U_t=0] =
\mathop\mathbb{P}[X_{t+1}=i\mid X_{t}=j]
and
\mathop\mathbb{P}[X_{t+1}=i\mid X_{t}=j,U_t=1] = \mathbb{1}\left\{ i=10
\right\} \,.
While we could write this dynamical system as a structural
equation model, it is more conveniently expressed by these probability
tables.

# Optimal sequential decision making

Just as risk minimization was the main optimization problem we
studied in static decision theory, there is an abstract class of
optimization problems that underlie most sequential decision making
(SDM) problems. The main problem is to find a sequence of decision
*policies* that maximize a cumulative reward subject to the
uncertain, stochastic system dynamics. At each time, we assign a
reward R_t(X_t,U_t,W_t) to the
current state-action pair. The goal is to find a sequence of actions to
make the summed reward as large as possible:
\begin{array}{ll}
\text{maximize}_{\{u_t\}} & \mathop\mathbb{E}_{W_t}\left[
\sum_{t=0}^T R_t(X_t,u_t,W_t) \right]\\
\text{subject to} & X_{t+1} = f_t(X_t, u_t, W_t)\\
& \text{($x_0$ given)}
\end{array}
Here, the expected value is over the sequence of stochastic
disturbance variables W_t. Note
here, W_t is a random variable
and X_t are is hence also a random
variable. The sequence of actions \{u_t\} is our decision variable. It could be
chosen via a random or deterministic procedure as a matter of design.
But it is important to understand what information is allowed to be used
in order to select u_t.

Since the dynamics are stochastic, the optimal SDM problem typically
allows a policy to observe the state before deciding upon the next
action. This allows a decision strategy to continually mitigate
uncertainty through feedback. This is why we optimize over policies
rather than over a deterministic sequence of actions. That is, our goal
is to find functions of the current state \pi_t such that U_t=\pi_t(X_t, X_{t-1}, \ldots) is optimal in
expected value. By a *control policy* (or simply “a policy”) we
mean a function that takes a trajectory from a dynamical system and
outputs a new control action. In order for \pi_t to be implementable, it must only have
access only to previous states and actions.

The policies \pi_t are the decision variables of the problem: \begin{array}{ll} \text{maximize}_{\pi_t} & \mathop\mathbb{E}_{W_t}\left[ \sum_{t=0}^T R_t(X_t,U_t,W_t) \right]\\ \text{subject to} & X_{t+1} = f(X_t, U_t, W_t) \\ & U_t = \pi_t(X_t, X_{t-1},\ldots)\\ & \text{($x_0$ given)} \end{array} Now, U_t is explicitly a random variable as it is a function of the state X_t.

This SDM problem will be the core of what we study in this chapter.
And our study will follow a similar path to the one we took with
decision theory. We will first study how to solve these SDM problems
when we know the model. There is a general purpose solution for these
problems known as *dynamic programming*.

# Dynamic programming

The dynamic programming solution to the SDM problem is based on the
*principle of optimality*: if you’ve found an optimal control
policy for a time horizon of length T, \pi_1,\ldots, \pi_T, and you want to know the
optimal strategy starting at state x at time t, then you just have to take the optimal
policy starting at time t, \pi_t,\ldots,\pi_T. The best analogy for this
is based on driving directions: if you have mapped out an optimal route
from Seattle to Los Angeles, and this path goes through San Francisco,
then you must also have the optimal route from San Francisco to Los
Angeles as the tail end of your trip. Dynamic programming is built on
this principle, allowing us to recursively find an optimal policy by
starting at the final time and going backwards in time to solve for the
earlier stages.

To proceed, define the *Q-function* to be the mapping:
\begin{aligned}
\mathcal{Q}_{a\rightarrow b}(x,u)
&= \max_{\{u_t\}}\mathop\mathbb{E}_{W_t}\left[ \sum_{t=a}^b
R_t(X_t,u_t,W_t)\right] \\
& \mathrm{s.t.}\quad X_{t+1} = f_t(X_t, u_t,
W_t),\quad(X_a,u_a)=(x,u)
\end{aligned}
The Q-function determines the best achievable value of the SDM
problem over times a to b when the action at time a is set to be u and the initial condition is x. It then follows that the optimal value of
the SDM problem is \max_u
\mathcal{Q}_{0\rightarrow T}(x_0,u), and the optimal policy
is \pi(x_0) = \arg \max_u
\mathcal{Q}_{0\rightarrow T}(x_0,u). If we had access to the
Q-function for the horizon {0,T},
then we’d have everything we’d need to know to take the first step in
the SDM problem. Moreover, the optimal policy is only a function of the
current state of the system. Once we see the current state, we have all
the information we need to predict future states, and hence we can
discard the previous observations.

We can use dynamic programming to compute this Q-function and the Q-function associated with every subsequent action. That is, clearly we have that the terminal Q-function is \mathcal{Q}_{T\rightarrow T}(x,u) = \mathop\mathbb{E}_{W_T}\left[ R_T(x,u,W_T) \right]\,, and then compute recursively \mathcal{Q}_{t\rightarrow T}(x,u) = \mathop\mathbb{E}_{W_t}\left[ R_t(x,u,W_t) + \max_{u'} \mathcal{Q}_{t+1 \rightarrow T}(f_t(x,u,W_t),u')\right]\,. This expression is known as Bellman’s equation. We also have that for all times t, the optimal policy is u_t = \arg\max_u \mathcal{Q}_{t\rightarrow T} (x_t,u) and the policy depends only on the current state.

To derive this form of the Q-function, we assume inductively that this form is true for all times beyond t+1 and then have the chain of identities \begin{aligned} \mathcal{Q}_{t\rightarrow T}(x,u) &= \max_{\pi_{t+1},\ldots,\pi_T} \mathop\mathbb{E}_{w}\left[R_t(x,u,W_t) + \sum_{s=t+1}^T R_s(X_s,\pi_{s}(X_s),W_s)\right]\\ &= \mathop\mathbb{E}_{W_t}\left[R_t(x,u,W_t) + \max_{\pi_{t+1},\ldots,\pi_T}\mathop\mathbb{E}_{W_{t+1},\ldots,W_T} \left\{ \sum_{s=t+1}^T R_s(X_s,\pi_{s}(X_s),W_s) \right\}\right]\\ &=\mathop\mathbb{E}_{W_t}\left[R_t(x,u,W_t) + \max_{\pi_{t+1}} Q\left\{f(x,u,W_t),\pi_{t+1}(f(x,u,W_t))\right\}\right]\\ &= \mathop\mathbb{E}_{W_t}\left[R_t(x,u,W_t) + \max_{u'} Q(f(x,u,W_t),u')\right]\,. \end{aligned} Here, the most important point is that the maximum can be exchanged with the expectation with respect to the first W_t. This is because the policies are allowed to make decisions based on the history of observed states, and these states are deterministic functions of the noise process.

## Infinite time horizons and stationary policies

The Q-functions we derived for these finite time horizons are
*time varying*. One applies a different policy for each step in
time. However, on long horizons with time invariant dynamics and costs,
we can get a simpler formula. First, for example, consider the limit:
\begin{array}{ll}
\text{maximize} & \lim_{N\rightarrow
\infty} \mathop\mathbb{E}_{W_t}[ \frac{1}{N} \sum_{t=0}^N
R(X_t,U_t,W_t) ]\\
\text{subject to} & X_{t+1} = f(X_t, U_t, W_t),~U_t=\pi_t(X_t)\\
& \text{($x_0$ given).}
\end{array}
Such infinite time horizon problems are referred to as
*average cost* dynamic programs. Note that there are no
subscripts on the rewards or transition functions in this model.

Average cost dynamic programming is deceptively difficult. These
formulations are not directly amenable to standard dynamic programming
techniques except in cases with special structure. A considerably
simpler infinite time formulation is known as *discounted*
dynamic programming, and this is the most popular studied formulation.
Discounting is a mathematical convenience that dramatically simplifies
algorithms and analysis. Consider the SDM problem
\begin{array}{ll}
\text{maximize} & (1-\gamma) \mathop\mathbb{E}_{W_t}[
\sum_{t=0}^\infty \gamma^t R(X_t,U_t,W_t) ]\\
\text{subject to} & X_{t+1} = f(X_t, U_t,
W_t),~U_t=\pi_t(X_t)\\
& \text{($x_0$ given).}
\end{array}
where \gamma is a scalar
in (0,1) called the *discount
factor*. For \gamma close
to 1, the discounted reward is
approximately equal to the average reward. However, unlike the average
cost model, the discounted cost has particularly clean optimality
conditions. If we define \mathcal{Q}_\gamma(x,u) to be the Q-function
obtained from solving the discounted problem with initial
condition x, then we have a
discounted version of dynamic programming, now with the same Q-functions
on the left and right hand sides:
\mathcal{Q}_\gamma(x,u) = \mathop\mathbb{E}_{W} \left[ R(x,u,W) +
\gamma \max_{u'} \mathcal{Q}_\gamma(f(x,u,W),u')\right]\,.
The optimal policy is now for *all times* to let
u_t= \arg\max_u \mathcal{Q}_\gamma(x_t,u)\,.
The policy is time invariant and one can execute it without any
knowledge of the reward or dynamics functions. At every stage, one
simply has to maximize a function to find the optimal action.
Foreshadowing to the next chapter, the formula additionally suggest that
the amount that needs to be “learned” in order to “control” is not very
large for these infinite time horizon problems.

# Computation

Though dynamic programming is a beautiful universal solution to the very general SDM problem, the generality also suggests computational barriers. Dynamic programming is only efficiently solvable for special cases, and we now describe a few important examples.

## Tabular MDPs

Tabular MDPs refer to Markov Decision Processes with small number of states and actions. Say that there are S states and A actions. Then the transition rules are given by tables of conditional probabilities \mathop\mathbb{P}[X_{t+1}|X_t,U_t], and the size of such tables are S^2 A. The Q-functions for the tabular case are also tables, each of size SA, enumerating the cost-to-go for all possible state action pairs. In this case, the maximization \max_{u'}\mathcal{Q}_{a\rightarrow b}(x,u') corresponds to looking through all of the actions and choosing the largest entry in the table. Hence, in the case that the rewards are deterministic functions of x and u, Bellman’s equation simplifies to \mathcal{Q}_{t\rightarrow T}(x,u) = R_t(x,u) + \sum_{x'} \mathop\mathbb{P}[X_{t+1}=x'|X_t=x,U_t=u] \max_{u'} \mathcal{Q}_{t+1\rightarrow T}(x',u')\,. This function can be computed by elementary matrix-vector operations: the Q-functions are S \times A arrays of numbers. The “max” operation can be performed by operating over each row in such an array. The summation with respect to x' can be implemented by multiplying a SA \times S array by an S-dimensional vector. We complete the calculation by summing the resulting expression with the S\times A array of rewards. Hence, the total time to compute \mathcal{Q}_{t\rightarrow T} is O(S^2A).

## Linear quadratic regulator

The other important problem where dynamic programming is efficiently
solvable is the case when the dynamics are *linear* and the
rewards are *quadratic*. In control design, this class of
problems is generally referred to as the problem of the Linear Quadratic
Regulator (LQR):
\begin{array}{ll}
\text{minimize} \, & \mathop\mathbb{E}_{W_t}
\left[\frac{1}{2}\sum_{t=0}^T X_t^T \Phi_t X_t + U_t^T \Psi_t
U_t\right], \\
\text{subject to} & X_{t+1} = A_t X_t+ B_t U_t + W_t,~U_t=\pi_t(X_t)
\\
& \text{($x_0$ given).}
\end{array}
Here, \Phi_t
and \Psi_t are most commonly
positive semidefinite matrices. w_t is noise with zero mean and bounded
variance, and we assume W_t
and W_{t'} are independent
when t\neq t'. The state
transitions are governed by a linear update rule with A_t and B_t appropriately sized matrices. We also
abide by the common convention in control textbooks to pose the problem
as a minimization—not maximization—problem.

As we have seen above, many systems can be modeled by linear dynamics
in the real world. However, we haven’t yet discussed cost functions.
It’s important to emphasize here that cost functions are
*designed* not given. Recall back to supervised learning: though
we wanted to minimize the number of errors made on out-of-sample data,
on in-sample data we minimized convex surrogate problems. The situation
is exactly the same in this more complex world of dynamical decision
making. Cost functions are designed by the engineer so that the SDM
problems are tractable but also so that the desired outcomes are
achieved. Cost function design is part of the toolkit for online
decision making, and quadratic costs can often yield surprisingly good
performance for complex problems.

Quadratic costs are also attractive for computational reasons. They
are convex as long as \Phi_t
and \Psi_t are positive definite.
Quadratic functions are closed under minimization, maximization,
addition. And for zero mean noise W_t with covariance \Sigma, we know that the noise interacts
nicely with the cost function. That is, we have
\mathbb{E}_W[(x+W)^T M (x+W)] = x^T M x + \operatorname{Tr}(M\Sigma)
for any vector x and
matrix M. Hence, when we run
dynamic programming, every Q-function is necessarily quadratic.
Moreover, since the Q-functions are quadratic, the optimal action is a
*linear function* of the state
U_t = -K_t X_t
for some matrix K_t.

Now consider the case where there are static costs \Phi_t=\Phi and \Psi_t=\Psi, and time invariant dynamics such
that A_t=A and B_t=B for all t. One can check that the Q-function on a
finite time horizon satisfies a recursion
\mathcal{Q}_{t \rightarrow T}(x,u) = x^T \Phi x + u^T \Psi u + (Ax+Bu)^T
M_{t+1} (Ax+Bu) + c_t\,.
for some positive definite matrix M_{t+1}. In the limit as the time horizon
tends to infinity, the optimal control *policy* is *static,
linear state feedback*:
u_t = -K x_t\,.
Here the matrix K is
defined by
K=(\Psi + B^T M B)^{-1} B^T M A
and M is a solution to the
*Discrete Algebraic Riccati Equation*
M = \Phi + A^T M A - (A^T M B)(\Psi + B^T M B)^{-1} (B^T M A)\,.
Here, M is the unique
solution of the Riccati equation where all of the eigenvalues
of A-BK have magnitude less
than 1. Finding this specific
solution is relatively easy using standard linear algebraic techniques.
It is also the limit of the Q-functions computed above.

## Policy and value iteration

Two of the most well studied methods for solving such discounted
infinite time horizon problems are *value iteration* and
*policy iteration*. Value iteration proceeds by the steps
\mathcal{Q}_{k+1} (x,u) = \mathop\mathbb{E}_{W} \left[ R(x,u,W) +
\gamma \max_{u'} \mathcal{Q}_k (f(x,u,W),u')\right]\,.
That is, it simply tries to solve the Bellman equation by
running a fixed point operation. This method succeeds when the iteration
is a contraction mapping, and this occurs in many contexts.

On the other hand, Policy Iteration is a two step procedure:
*policy evaluation* followed by *policy improvement*.
Given a policy \pi_k, the policy
evaluation step is given by
\mathcal{Q}_{k+1} (x,u) = \mathop\mathbb{E}\left[ R(x,u,W) + \gamma
\mathcal{Q}_k (f(x,\pi_k(x),W),\pi_k(x))\right]\,.
And then the policy is updated by the rule
\pi_{k+1}(x) = \arg\max_u \mathcal{Q}_{k+1} (x,u)\,.
Often times, several steps of policy evaluation are performed
before updating the policy.

For both policy and value iteration, we need to be able to compute
expectations efficiently and must be able to update *all values*
of x and u in the associated Q functions. This is certainly doable for
tabular MDPs. For general low dimensional problems, policy iteration and
value iteration can be approximated by gridding state space, and then
treating the problem as a tabular one. Then, the resulting Q function can be extended to
other (x,u) pairs by
interpolation. There are also special cases where the maxima and minima
yield closed form solutions and hence these iterations reduce to simpler
forms. LQR is a canonical example of such a situation.

## Model predictive control

If the Q-functions in value or policy iteration converge quickly,
long-term planning might not be necessary, and we can effectively solve
infinite horizon problem with short-term planning. This is the key idea
behind one of the most powerful techniques for efficiently and
effectively finding quality policies for SDM problems called *model
predictive control*.

Suppose that we aim to solve the infinite horizon average reward
problem:
\begin{array}{ll}
\text{maximize} & \lim_{T\rightarrow \infty}
\mathop\mathbb{E}_{W_t}[\frac{1}{T}\sum_{t=0}^T R_t(W_t,U_t) ]\\
\text{subject to} & X_{t+1} = f_t(X_t, U_t, W_t)\\
& U_t = \pi_t(X_t)\\
& \text{($x_0$ given).}
\end{array}
Model Predictive Control computes an *open loop* policy
on a finite horizon H
\begin{array}{ll}
\text{maximize}_{u_t} & \mathop\mathbb{E}_{W_t}[ \sum_{t=0}^H
R_t(X_t,u_t)]\\
\text{subject to} & X_{t+1} = f_t(X_t, u_t, W_t)\\
& \text{($X_0$ = x).}
\end{array}
This gives a sequence u_0(x),\ldots, u_H(x). The policy is then set
to be \pi(x)=u_0(x). After this
policy is executed, we observe a new state, x', based on the dynamics. We then
recompute the optimization, now using x_0=x' and setting the action to
be \pi(x')= u_0(x').

MPC is a rather intuitive decision strategy. The main idea is to plan out a sequence of actions for a given horizon, taking into account as much uncertainty as possible. But rather than executing the entire sequence, we play the first action and then gain information from the environment about the noise. This direct feedback influences the next planning stage. For this reason, model predictive control is often a successful control policy even when implemented with inaccurate or approximate models. Model Predictive Control also allows us to easily add a variety of constraints to our plan at little cost, such as bounds on the magnitude of the actions. We just append these to the optimization formulation and then lean on the computational solver to make us a plan with these constraints.

To concretely see how Model Predictive Control can be effective, it’s helpful to work through an example. Let’s suppose the dynamics and rewards are time invariant. Let’s suppose further that the reward function is bounded above, and there is some state-action pair (x_\star,u_\star) which achieves this maximal reward R_\mathrm{max}.

Suppose we solve the finite time horizon problem where we enforce that (x,u) must be at (x_\star,u_\star) at the end of the time horizon: \begin{array}{ll} \text{maximize} & \mathop\mathbb{E}_{W_t}[ \sum_{t=0}^H R(X_t,u_t)]\\ \text{subject to} & X_{t+1} = f(X_t, u_t, W_t)\\ & (X_H,U_H) = (x_\star,u_\star)\\ & \text{($X_0$ = x).} \end{array} We replan at every time step by solving this optimization problem and taking the first action.

The following proposition summarizes how this policy performs

Assume that all rewards are bounded above by R_{\mathrm{max}}. Then with the above MPC policy, we have for all T that \mathbb{E}\left[ \frac{1}{T} \sum_{t=0}^T R(x_t,u_t)\right] \geq \frac{Q_{0\rightarrow H}(x_0,u_0)-H R_{\mathrm{max}}}{T} + \mathbb{E}_W[R(f(x_\star,u_\star,W),0)]\,.

The proposition asserts that there is a “burn in” cost associated with the initial horizon length. This term goes to zero with T, but will have different values for different H. The policy converges to a residual average cost due to the stochasticity of the problem and the fact that we try to force the system to the state (x_\star,u_\star).

To analyze how the policy performs, we turn to Bellman’s equation. For any time t, the MPC policy is u_t = \arg\max_u \mathcal{Q}_{0\rightarrow H}(x_t,u) Now, \mathcal{Q}_{0\rightarrow H}(x_t,u) = R(x_t,u) + \mathbb{E}[\max_{u'} \mathcal{Q}_{1\rightarrow H}(X_{t+1},u')]\,.

Now consider what to do at the time t+1. A *suboptimal* strategy at this
time is to try to play the optimal strategy on the horizon 1\rightarrow H, and then do nothing on the
last step. That is,
\max_u \mathcal{Q}_{0\rightarrow H}(x_{t+1},u) \geq
\max_{u'} \mathcal{Q}_{1\rightarrow
H}(x_{t+1},u') + \mathbb{E}[R(f(x_\star,u_\star,W_{t+H}),0)]\,.
The last expression follows because the action sequence
from 1\rightarrow H enforces (x_{t+H},u_{t+H})=(x_\star,u_\star). The
first term on the right hand side was computed in expectation above,
hence we have
\mathbb{E}[\max_u \mathcal{Q}_{0\rightarrow H}(X_{t+1},u)] \geq
\mathbb{E}[\mathcal{Q}_{0\rightarrow H}(x_t,u_t)] -
\mathbb{E}[R(x_t,u_t,W)] + \mathbb{E}[R(f(x_\star,u_\star,W),0)] \,.
Unwinding this recursion, we find
\begin{aligned}
\mathbb{E}[\max_u \mathcal{Q}_{0\rightarrow H}(X_{T+1},u)] &\geq
\mathcal{Q}_{0\rightarrow H}(x_0,u_0) - \mathbb{E}\left[\sum_{t=0}^T
R(x_t,u_t,W_t)\right]\\
&\qquad\qquad\qquad + T \mathbb{E}[R(f(x_\star,u_\star,W),0)]
\,.
\end{aligned}
Since the rewards are bounded above, we can upper bound the left
hand side by R_{\mathrm{max}} H.
Rearranging terms then proves the theorem.

The main caveat with this argument is that there may not
*exist* a policy that drives x to x_\star from an arbitrary initial condition
and any realization of the disturbance signal. Much of the analysis of
MPC schemes is devoted to guaranteeing that the problems are
*recursively feasible*, meaning that such constraints can be met
for all time.

This example also shows how it is often helpful to have some sort of recourse at the end of the planning horizon to mitigate the possibility of being too greedy and driving the system into a bad state. The terminal condition of forcing x_H=0 adds an element of safety to the planning, and ensures stable execution for all time. More general, adding some terminal condition to the planning horizon \mathcal{C}(x_{H}) is part of good Model Predictive Control design and is often a powerful way to balance performance and robustness.

# Partial observation and the separation heuristic

Let’s now move to the situation where instead of observing the state directly, we observe an output Y_t: Y_t = h_t(X_t,U_t,W_t)\,. All of the policies we derived from optimization formulations above required feeding back a function of the state. When we can only act on outputs, SDM problems are considerably more difficult.

*Static Output Feedback is NP-hard.*Consider the case of just building a static policy from the output Y_t. Let’s suppose our model is just the simple linear model: \begin{aligned} X_{t+1} &= A X_t + B U_t\\ Y_t &= C X_t \end{aligned} Here, A, B and C are matrices. Suppose we want to find a feedback policy U_t = K Y_t ( where K is a matrix) and all we want to guarantee is that for any initial x_0, the system state converges to zero. This problem is called*static state feedback*and is surprisingly*NP-Hard*. It turns out that the problem is equivalent to finding a matrix K such that A+BKC has all of its eigenvalues inside the unit circle in the complex plane.Blondel and Tsitsiklis, “A Survey of Computational Complexity Results in Systems and Control,”*Automatica*36, no. 9 (2000): 1249–74. Though in the MDP case, static state feedback was not only optimal, but computable for tabular MDPs and certain other SDM problems, static output feedback is computationally intractable.Papadimitriou and Tsitsiklis showed that optimization of general POMDPs, even on small state spaces, was in all likelihood completely intractable.Papadimitriou and Tsitsiklis, “The Complexity of Markov Decision Processes,”*POMDPs are PSPACE hard.**Mathematics of Operations Research*12, no. 3 (1987): 441–50. They reduced the problem of*quantifier elimination*in logical satisfiability problems (QSAT) to POMDPs. QSAT seeks to determine the validity of statements like “there exists x such that for all y there exists z such that for all w this logical formula is true.” Optimal action in POMDPs essentially have to keep track of all of the possible true states that might have been visited given the partial observation and make actions accordingly. Hence, the policies have a similar flavor to quantifier elimination as they seek actions that are beneficial to all possible occurrences of the unobserved variables. Since these policies act over long time horizons, the number of counterfactuals that must be maintained grows exponentially large.

Despite these challenges, engineers solve POMDP problems all of the time. Just because the problems are hard in general, doesn’t mean they are intractable on average. It only means that we cannot expect to have general purpose optimal algorithms for these problems. Fortunately, suboptimal solutions are oftentimes quite good for practice, and there are many useful heuristics for decision making with partial information. The most common approach to the output feedback problem is the following two-stage strategy:

Using all of your past data \{y_s\} for s=0,\ldots,t, build an estimate, \hat{x}_t, of your state.*Filtering.*Solve the desired SDM problem as if you had perfect observation of the state X_t, using \hat{x}_t wherever you would have used an observation X_t=x_t. At run time, plug in the estimator \hat{x}_t as if it were a perfect measurement of the state.*Action based on certainty equivalence.*

This strategy uses a *separation principle* between prediction
and action. For certain problems, this two-staged approach is actually
optimal. Notably, if the SDM problem has quadratic rewards/costs, if the
dynamics are linear, and if the noise process is Gaussian, then the
separation between prediction and action is optimal. More commonly, the
separation heuristic is suboptimal, but this abstraction also enables a
simple heuristic that is easy to debug and simple to design.

While we have already covered algorithms for optimal control, we have
not yet discussed state estimation. Estimating the state of a dynamical
system receives the special name *filtering*. However, at the end
of the day, filtering is a prediction problem. Define the observed data
up to time t as
\tau_t := (y_t,\ldots, y_1, u_{t-1}, \ldots, u_1)\,.
The goal of filtering is to estimate a a function h(\tau_t) that predicts X_t. We now describe two approaches to
filtering.

## Optimal filtering

Given a model of the state transition function and the observation
function, we can attempt to compute the maximum *a posteriori*
estimate of the state from the data. That is, we could compute p(x_t | \tau_t) and then estimate the mode of
this density. Here, we show that such an estimator has a relatively
simple recursive formula, though it is not always computationally
tractable to compute this formula.

To proceed, we first need a calculation that takes advantage of the conditional independence structure of our dynamical system model. Note that \begin{aligned} p(y_t, x_t, & x_{t-1}, u_{t-1} | \tau_{t-1} ) = \\ &p(y_t| x_t, x_{t-1}, u_{t-1}, \tau_{t-1} ) p(x_t | x_{t-1},u_{t-1}, \tau_{t-1} )\\ &\qquad\qquad\qquad\qquad\times p(x_{t-1}|\tau_{t-1} ) p(u_{t-1}|\tau_{t-1} )\\ &= p(y_t| x_t ) p(x_t | x_{t-1},u_{t-1} ) p(x_{t-1}|\tau_{t-1} ) p(u_{t-1}|\tau_{t-1} )\,. \end{aligned} This decomposition is into terms we now recognize. p(x_t | x_{t-1},u_{t-1} ) and p(y_t| x_t ) define the POMDP model and are known. p(u_t|\tau_t) is our policy and it’s what we’re trying to design. The only unknown here is p(x_{t-1}|\tau_{t-1} ), but this expression gives us a recursive formula to p(x_{t}|\tau_{t} ) for all t.

To derive this formula, we apply Bayes rule and then use the above
calculation:
\begin{aligned}
p(x_t | \tau_t) &=\frac{\int_{x_{t-1}} p(x_t , y_t, x_{t-1}, u_{t-1}
| \tau_{t-1}) }
{\int_{x_t,x_{t-1}}
p(x_t , y_t, x_{t-1}, u_{t-1}| \tau_{t-1})}\\
&=\frac{\int_{x_{t-1}} p(y_t| x_t ) p(x_t | x_{t-1},u_{t-1} )
p(x_{t-1}|\tau_{t-1} )
p(u_{t-1}|\tau_{t-1} ) }
{\int_{x_{t},x_{t-1}} p(y_t| x_t ) p(x_t | x_{t-1},u_{t-1} )
p(x_{t-1}|\tau_{t-1} )
p(u_{t-1}|\tau_{t-1} )}\\
&=\frac{\int_{x_{t-1}} p(y_t| x_t ) p(x_t | x_{t-1},u_{t-1} )
p(x_{t-1}|\tau_{t-1} ) }
{\int_{x_t,x_{t-1}} p(y_t| x_t ) p(x_t| x_{t-1},u_{t-1} )
p(x_{t-1}|\tau_{t-1} )}\,.\qquad (1)
\end{aligned}
Given a prior for x_0,
this now gives us a formula to compute a MAP estimate of x_t for all t, incorporating data in a streaming fashion.
For tabular POMDP models with small state spaces, this formula can be
computed simply by summing up the conditional probabilities. In POMDPs
without inputs—also known as *hidden Markov models*—this formula
gives the forward pass of Viterbi’s decoding algorithm. For models where
the dynamics are linear and the noise is Gaussian, these formulas reduce
into an elegant closed form solution known as *Kalman filtering*.
In general, this optimal filtering algorithm is called *belief
propagation* and is the basis of a variety of algorithmic techniques
in the field of graphical models.

## Kalman filtering

For the case of linear dynamical systems, the above calculation has a
simple closed form solution that looks similar to the solution of LQR.
This estimator is called a *Kalman filter*, and is one of the
most important tools in signal processing and estimation. The Kalman
filter assumes a linear dynamical system driven by Gaussian noise with
observations corrupted by Gaussian noise
\begin{aligned}
X_{t+1} &= A X_t+ B U_t + W_t\\
Y_{t} & = C X_t + V_t\,.
\end{aligned}
Here, assume W_t
and V_t are independent for all
time and Gaussian with means zero and covariances \Sigma_W and \Sigma_V respectively. Because of the joint
Gaussianity, we can compute a closed form formula for the density
of X_t conditioned on the past
observations and actions, p(x_t |
\tau_t). Indeed, X_t is a
Gaussian itself.

On an infinite time horizon, the Kalman filter takes a simple and elucidating form: \begin{aligned} \hat{x}_{t+1} &= A \hat{x}_t + B u_t - L (y_t - \hat{y}_t)\\ \hat{y}_t &= C \hat{x}_t \end{aligned} where L=A P C^T (C P C^T + \Sigma_V)^{-1} and P is the positive semidefinite solution to the discrete algebraic Riccati equation P = A P A^T + \Sigma_W - (A P C^T) (CPC^T +\Sigma_V)^{-1} (C\Sigma A^T)\,.

The derivation of this form follows from the calculation in
Equation 1. But the explanation of
the formulas tend to be more insightful than the derivation. Imagine the
case where L=0. Then our
estimate \hat{x}_t simulates the
same dynamics as our model with no noise corruption. The
matrix L computes a correction for
this simulation based on the observed y_t. This feedback correction is chosen in
such a way such that P is the
steady state covariance of the error X_t-\hat{x}_t. P ends up being the minimum variance possible
with an estimator that is *unbiased* in the sense that \mathop\mathbb{E}[\hat{x}_t-X_t]=0.

Another interesting property of this calculation is that
the L matrix is the LQR gain
associated with the LQR problem
\begin{array}{ll}
\text{minimize} \, & \lim_{T\rightarrow
\infty}\mathop\mathbb{E}_{W_t} \left[\frac{1}{2}\sum_{t=0}^T X_t^T
\Sigma_W X_t + U_t^T \Sigma_V U_t\right], \\
\text{subject to} & X_{t+1} = A^T X_t+ B^T U_t + W_t,~U_t=\pi_t(X_t)
\\
& \text{($x_0$ given).}
\end{array}
Control theorists often refer to this pairing as the *duality
between estimation and control*.

## Feedforward prediction

While the optimal filter can be computed in simple cases, we often do
not have simple computational means to compute the optimal state
estimate. That said, the problem of state estimation is necessarily one
of prediction, and the first half of this course gave us a general
strategy for building such estimators from data. Given many simulations
or experimental measurements of our system, we can try to estimate a
function h such that X_t \approx h(\tau_t). To make this concrete,
we can look at *time lags* of the history
\tau_{t-s\rightarrow t} := (y_t,\ldots, y_{t-s}, u_{t-1}, \ldots,
u_{t-s})\,.
Such time lags are necessarily all of the same length. Then
estimating
\begin{array}{ll}
\text{minimize}_h & \sum_t
\operatorname{loss}(h(\tau_{t-s\rightarrow t}), x_t)
\end{array}
is a supervised learning problem, and standard tools can be
applied to design architectures for and estimate h.

# Chapter notes

This chapter and the following chapter overlap significantly with a
survey of reinforcement learning by RechtRecht,
“A Tour of Reinforcement Learning: The View from Continuous
Control,” *Annual Review of Control, Robotics, and Autonomous
Systems* 2 (2019)., which contains additional
connections to continuous control. Those interested in learning more
about continuous control from an optimization viewpoint should consult
the book by Borrelli et al.Borrelli,
Bemporad, and Morari, *Predictive Control for Linear and Hybrid
Systems* (Cambridge University Press, 2017).
This book also provides an excellent introduction to model predictive
control. Another excellent introduction to continuous optimal control
and filtering is Boyd’s lecture notes.Boyd,
“EE363: Linear Dynamical Systems”
2009.

An invaluable introduction to the subject of dynamic programming is
by Bertsekas, who has done pioneering research in this space and has
written some of the most widely read texts.Bertsekas,
*Dynamic Programming and Optimal Control*.
For readers interested in a mathematical introduction to dynamic
programming on discrete processes, we recommend Puterman’s text.Puterman,
*Markov Decision Processes: Discrete Stochastic Dynamic
Programming* (Wiley-Interscience, 1994).
Puterman also explains the linear programming formulation of dynamic
programming.

# References

*Dynamic Programming and Optimal Control*. 4th ed. Vol. 1. Athena Scientific, 2017.

*Automatica*36, no. 9 (2000): 1249–74.

*Predictive Control for Linear and Hybrid Systems*. Cambridge University Press, 2017.

*Mathematics of Operations Research*12, no. 3 (1987): 441–50.

*Markov Decision Processes: Discrete Stochastic Dynamic Programming*. Wiley-Interscience, 1994.

*Annual Review of Control, Robotics, and Autonomous Systems*2 (2019).