# Causal inference in practice

The previous chapter introduced the conceptual foundations of causality, but there’s a lot more to learn about how these concepts play out in practice. In fact, there’s a flourishing practice of causal inference in numerous scientific disciplines. Increasingly, ideas from machine learning show up in the design of causal estimators. Conversely, ideas from causal inference can help machine learning practitioners run better experiments.

In this chapter we focus on estimating the average treatment effect, often abbreviated as ATE, of a binary treatment T on an outcome variable Y: \mathop\mathbb{E}[Y \mid \mathrm{do}(T:=1) ] - \mathop\mathbb{E}[Y \mid \mathrm{do}(T:=0)]\,. Causal effects are population quantities that involve two hypothetical actions, one holding the treatment variable constant at the treatment value 1, the other holding the treatment constant at its baseline value 0.

The central question in causal inference is how we can estimate causal quantities, such as the average treatment effect, from data.

Confounding between the outcome and treatment variable is the main impediment to causal inference from observational data. Recall that random variables Y and T are confounded, if the conditional probability distribution of Y given T does not equal its interventional counterpart: \mathop\mathbb{P}\{Y=y\mid \mathrm{do}(T:=t)\} \ne \mathop\mathbb{P}\{Y=y\mid T=t\} If these expressions were equal, we could estimate the average treatment effect in a direct way by estimating the difference \mathop\mathbb{E}[Y\mid T=1]-\mathop\mathbb{E}[Y\mid T=0] from samples. Confounding makes the estimation of treatment effects more challenging, and sometimes impossible. Note that the main challenge here is to arrive at an expression for the desired causal effect that is free of any causal constructs, such as the do-operator. Once we have a plain probability expression at hand, tools from statistics allow us to relate the population quantity with a finite sample estimate.

# Design and inference

There are two important components to causal inference, one is
*design*, the other is *inference*.

In short, design is about sorting out various substantive questions about the data generating process. Inference is about the statistical apparatus that we unleash on the data in order to estimate a desired causal effect.

Design requires us to decide on a population, a set of variables to include, and a precise question to ask. In this process we need to engage substantively with relevant scientific domain knowledge in order to understand what assumptions we can make about the data.

Design can only be successful if the assumptions we are able to make permit the estimation of the causal effect we’re interested in. In particular, this is where we need to think carefully about potential sources of confounding and how to cope with them.

There is no way statistical estimators can recover from poor design. If the design does not permit causal inference, there is simply no way that a clever statistical trick could remedy the shortcoming. It’s therefore apt to think of causal insights as consequences of the substantive assumptions that we can make about the data, rather than as products of sophisticated statistical ideas.

Hence, we emphasize design issues throughout this chapter and intentionally do not dwell on technical statements about rates of estimation. Such mathematical statements can be valuable, but design must take precedence.

## Experimental and observational designs

Causal inference distinguishes between *experimental* and
*observational* designs. Experimental designs generally are
active in the sense of administering some treatment to some set of
experimental units. Observational designs do not actively assign
treatment, but rather aim to make it possible to identify causal effects
from collected data without implementing any interventions.

The most common and well-established experimental design is a randomized controlled trial (RCT). The main idea is to assign treatment randomly. A randomly assigned treatment, by definition, is not influenced by any other variable. Hence, randomization eliminates any confounding bias between treatment and outcome.

In a typical implementation of a randomized controlled trial,
subjects are randomly partitioned into a *treatment group* and a
*control group*. The treatment group receives the treatment, the
control group receives no treatment. It is important that subjects do
not know which group they were assigned to. Otherwise knowledge of their
assignment may influence the outcome. To ensure this, subjects in the
control group receive what is called a *placebo*, a device or
procedure that looks indistinguishable from treatment to the study
subject, but lacks the treatment ingredient whose causal powers are in
question. Adequate placebos may not exist depending on what the
treatment is, for example, in the case of a surgery.

Randomized controlled trials have a long history with many success stories. They’ve become an important source of scientific knowledge.

Sometimes randomized controlled trials are difficult, expensive, or
impossible to administer. Treatment might be physically or legally
impossible, too costly, or too dangerous. Nor are they free of issues
and pitfalls.Deaton
and Cartwright, “Understanding and Misunderstanding Randomized
Controlled Trials,” *Social Science & Medicine* 210
(2018): 2–21. In this chapter, we will see
observational alternatives to randomized controlled trials. However,
these are certainly not without their own set of difficulties and
shortcomings.

The machine learning practitioner is likely to encounter
randomization in the form of so-called *A/B tests*. In an A/B
test we randomly assign one of two treatments to a set of individuals.
Such experiments are common in the tech industry to find out which of
two changes to a product leads to a better outcome.

# The observational basics: adjustment and controls

For the remainder of the chapter we focus on observational causal inference methods. In the previous chapter we saw that there are multiple ways to cope with confounding between treatment and outcome. One of them is to adjust (or control) for the parents (i.e., direct causes) of T via the adjustment formula.

The extra variables that we adjust for are also called
*controls*, and we take the phrase *controlling for* to
mean the same thing as *adjusting for*.

We then saw that we could use any set of random variables satisfying the graphical backdoor criterion. This is helpful in cases where some direct causes are unobserved so that we cannot use them in the adjustment formula.

Let’s generalize this idea even further and call a set of variables
*admissible* if it satisfies the adjustment formula.

We say that a discrete random variable X is *admissible* if it satisfies the
adjustment formula:
\mathop\mathbb{P}[Y=y\mid \mathrm{do}(T:=t)] = \sum_x
\mathop\mathbb{P}[Y=y\mid T=t, X=x] \mathop\mathbb{P}[X=x]
Here we sum over all values x in the support of X.

The definition directly suggests a basic estimator for the do-intervention.

**Basic adjustment estimator.**

- Collect samples n samples (t_i, y_i, x_i)_{i=1}^n.
- Estimate each of the conditional probabilities \mathop\mathbb{P}[Y=y\mid T=t,X=x] from the collected samples.
- Compute the sum \sum_x \mathop\mathbb{P}[Y=y\mid T=t, X=x] \mathop\mathbb{P}[X=x].

This estimator can only work if all slices \{T=t,X=x\} have nonzero probability, an
assumption often called *overlap* or *positivity* in
causal inference.

But the basic estimator also fails when the adjustment variable X can take on too many possible values. In general, the variable X could correspond to a tuple of features, such as, age, height, weight, etc. The support of X grows exponentially with the number of features. This poses an obvious computational problem, but more importantly a statistical problem as well. By a counting argument some of the events \{T=t,X=x\} must have probability as small as the inverse of size of the support X. To estimate a probability p>0 from samples to within small relative error, we need about O(1/p^2) samples.

Much work in causal inference deals with overcoming the statistical inefficiency of the basic estimator. Conceptually, however, most sophisticated estimators work from the same principle. We need to assume that we have an admissible variable X and that positivity holds. Different estimators then use this assumption in different ways.

## Potential outcomes and ignorability

The average treatment effect often appears in the causal inference literature equivalently in its potential outcome notation \mathop\mathbb{E}[Y_1 - Y_0]. This way of going about it is mathematically equivalent and either way works for us.

When talking about potential outcomes, it’s customary to replace the
assumption that X is admissible
with another essentially equivalent assumption called
*ignorability* or *unconfoundedness*. To recall from the
previous chapter, this assumption requires that the potential outcomes
variables are conditionally independent of treatment given X. Formally, T\bot (Y_0, Y_1)\mid X. It’s not hard to show
that ignorability implies that X
is admissible.

# Reductions to model fitting

Adjustment gives a simple and general way to estimate causal effects given an admissible set of variables. The primary shortcoming that we discussed is the sample inefficiency of the formula in high-dimensional settings.

There’s a vast literature of causal estimators that aim to address this central shortcoming in a range of different settings. While the landscape of causal estimators might seem daunting to newcomers, almost all causal inference methods share a fundamental idea. This idea is reduce causal inference to standard supervised machine learning tasks.

Let’s see how this central idea plays out in a few important cases.

## Propensity scores

Propensity scores are one popular way to cope with adjustment
variables that have large support. Let T
\in \{ 0, 1 \} be a binary treatment variable. The quantity
e(x) = \mathop\mathbb{E}[T \mid X=x]
is known as the *propensity score* and gives the
likelihood of treatment in the subpopulation defined by the
condition X=x.

Suppose that X is admissible, and the propensity scores are positive e(x) \neq 0 for all X. Then, \begin{aligned} \mathop\mathbb{E}[ Y \mid \mathrm{do}(T := 1) ] = \mathop\mathbb{E}\left[ \frac{YT}{e(X)} \right] \end{aligned}

Applying the adjustment formula for a fixed y, we have \begin{aligned} \mathop\mathbb{P}[Y = y \mid \mathrm{do}(T := 1)] &= \sum_x \mathop\mathbb{P}[ Y = y \mid T = 1, X=x] \mathop\mathbb{P}[X=x]\\ &= \sum_x \frac{\mathop\mathbb{P}[ Y = y \mid T= 1, X=x] \mathop\mathbb{P}[X=x] \mathop\mathbb{P}[T = 1 \mid X=x]}{\mathop\mathbb{P}[T = 1 \mid X=x]}\\ &= \sum_x \frac{\mathop\mathbb{P}[Y = y, T = 1, X=x]}{\mathop\mathbb{P}[T = 1 \mid X=x]}\\ &= \sum_{x, t\in\{0, 1\}} \frac{t \mathop\mathbb{P}[Y = y,T = t,X=x] }{\mathop\mathbb{P}[T = 1 \mid X=x]}\,. \end{aligned} Here, we used that e(x) = \mathop\mathbb{P}[T = 1 \mid X=x] \neq 0. Completing the proof, \begin{aligned} \mathop\mathbb{E}[ Y \mid \mathrm{do}(T := 1)] &= \sum_y y \mathop\mathbb{P}[Y = y \mid \mathrm{do}(T := 1)] \\ &= \sum_{y, x, t\in\{0, 1\}} \frac{y t \mathop\mathbb{P}[ Y = y, T = t, X=x ]}{\mathop\mathbb{P}[T = 1 \mid X=x]} = \mathop\mathbb{E}\left[ \frac{YT}{e(X)} \right]\,. \end{aligned}

The same theorem also shows that
\mathop\mathbb{E}[Y \mid \mathrm{do}(T := 0) ] = \mathop\mathbb{E}\left[
\frac{Y(1 - T)}{1 - e(X)} \right]
and thus the average treatment effect of X on Y
is given by
\mathop\mathbb{E}[ Y \mid \mathrm{do}(T := 1)]
- \mathop\mathbb{E}[Y \mid \mathrm{do}(T := 0)]
= \mathop\mathbb{E}\left[
Y\left(\frac{T}{e(X)}-\frac{1-T}{1-e(X)}\right) \right]\,.
This formula for the average treatment effect is called
*inverse propensity score weighting*. Let’s understand what it
buys us compared with the adjustment formula when working with a finite
sample.

One way to approximate the expectation given the theorem above is to collect many samples from which we estimate the propensity score e(x) separately for each possible setting X. However, this way of going about it runs into the very same issues as the basic estimator. Practitioners therefore choose a different route.

In a first step, we fit a model \hat e to the propensity scores hoping that our model \hat e approximates the propensity score function e uniformly well. We approach this step as we would any other machine learning problem. We create a dataset of observations (x_i, e_i) where e_i is an empirical estimate of e(x_i) that we compute from our sample. We then fit a model to these data points using our favorite statistical technique, be it logistic regression or something more sophisticated.

In a second step, we then use our model’s estimated propensity scores in our sample estimate instead of the true propensity scores: \frac{1}{n} \sum_{i=1}^n \frac{t_i y_i}{\hat e (x_i)}. The appeal of this idea is that we can use the entire repertoire of model fitting to get a good function approximation of the propensity scores. Depending on what the features are we could use logistic regression, kernel methods, random forests, or even deep models. Effectively we’re reducing the problem of causal inference to that of model fitting, which we know how to do.

## Double machine learning

Our previous reduction to model fitting has a notable shortcoming. The propensity score estimate \hat e(x_i) appears in the denominator of our estimator. This has two consequences. First, unbiased estimates of propensity scores do not imply an unbiased estimate of the causal effect. Second, when propensity scores are small and samples aren’t too plentiful, this can lead to substantial variance.

There’s a popular way to cope, called *double machine
learning*, that works in a partially linear structural causal model:
Y = \tau T + g(X) + U\,,
\qquad T = f(X) + V
In this model, the variable X is an observed confounder between treatment
and outcome. We allow the functions g and f
to be arbitrary, but note that g
only depends on X but not
on T as it could in general. The
random variables U, V are
independent exogenous noise variables with mean 0. In this model, the effect of treatment on
the outcome is linear and the coefficient \tau is the desired average treatment
effect.

The trick behind double machine learning is to subtract \mathop\mathbb{E}[Y\mid X] from each side of the first equation and to use the fact that \mathop\mathbb{E}[Y\mid X]=\tau\mathop\mathbb{E}[T\mid X] + g(X). We therefore get the equation Y-\mathop\mathbb{E}[Y\mid X] = \tau (T-\mathop\mathbb{E}[T\mid X]) + U\,. Denoting \tilde Y = Y-\mathop\mathbb{E}[Y\mid X] and \tilde T=T-\mathop\mathbb{E}[T\mid X] we can see that the causal effect \tau is the solution to the regression problem \tilde Y = \tau \tilde T+U.

The idea now is to solve *two* regression problems to find
good function approximations of the conditional expectations \mathop\mathbb{E}[Y\mid X] and \mathop\mathbb{E}[T\mid X], respectively. We
can do this using data drawn from the joint distribution of (X, T, Y) by solving two subsequent model
fitting problems, hence the name double machine learning.

Suppose then that we find two function approximations q(X, Y) \approx \mathop\mathbb{E}[Y\mid X] and r(X, T)\approx\mathop\mathbb{E}[T\mid X]. We can define the random variables \hat Y = Y - q(X, Y) and \hat T = T - r(X, T). The final step is to solve the regression problem \hat Y = \hat\tau\hat T + U for the parameter \hat\tau.

Compared with inverse propensity score weighting, we can see that finite sample errors in estimating the conditional expectations have a more benign effect on the causal effect estimate \hat\tau. In particular, unbiased estimates of the conditional expectations lead to an unbiased estimate of the causal effect.

## Heterogeneous treatment effects

In many applications, treatment effects can vary by subpopulation. In
such cases we may be interested in the *conditional average treatment
effect* (CATE) in the subpopulation defined by X=x:
\tau(x)=\mathop\mathbb{E}[Y \mid \mathrm{do}(T:=1), X=x ] -
\mathop\mathbb{E}[Y \mid \mathrm{do}(T:=0), X=x]\,.
We’re in luck, because the same proof we saw earlier shows that
we can estimate these so-called heterogeneous treatment effects with the
propensity score formula:
\tau(x) =
\mathop\mathbb{E}\left[Y\left(\frac{T}{e(X)}-\frac{1-T}{1-e(X)}\right)\mid
X=x\right]

We can also extend double machine learning easily to the heterogeneous case by replacing the coefficient \tau in the first structural equation with a function \tau(X) that depends on X. The argument remains the same except that in the end we need to solve the problem \hat Y = \hat\tau(X) \hat T + Y, which amounts to optimizing over a function \hat\tau in some model family rather than a constant \hat\tau.

Both inverse propensity score weighting and the double machine
learning can, in principle, estimate heterogeneous treatment effects.
These aren’t the only reductions to model fitting, however. Another
popular method, called *causal forests*, constructs decision
trees whose leaves correspond covariate settings that deconfound
treatment and outcome.Wager
and Athey, “Estimation and Inference of Heterogeneous Treatment
Effects Using Random Forests,” *Journal of the American
Statistical Association* 113, no. 523 (2018):
1228–42.

# Quasi-experiments

The idea behind quasi-experimental designs is that sometimes
processes in nature or society are structured in a way that enables
causal inference. The three most widely used quasi-experimental designs
are *regression discontinuities*, *instrumental
variables*, and *differences in differences*. We will review
the first two briefly to see where machine learning comes in.

## Regression discontinuity

Many consequential interventions in society trigger when a certain score R exceeds a threshold value t. The idea behind a regression discontinuity design is that units that fall just below the threshold are indistinguishable from units just above threshold. In other words, whether or not a unit is just above or just below the threshold is a matter of pure chance. We can then hope to identify a causal effect of an intervention by comparing units just below and just above the threshold.

To illustrate the idea, consider an intervention in a hospital
setting that is assigned to newborn children just below a birth weight
of 1500g. We can ask if the intervention has a causal effect on
wellbeing of the child at a later age as reflected in an outcome
variable, such as, mortality or cumulative hospital cost in their first
year. We expect various factors to influence both birth weight and
outcome variable. But we hope that these confounding factors are
essentially held constant right around the threshold weight of 1500g.
Regression discontinuity designs have indeed been used to answer such
questions for a number of different outcome variables.Almond
et al., “Estimating Marginal Returns to Medical Care: Evidence
from at-Risk Newborns,” *The Quarterly Journal of
Economics* 125, no. 2 (2010): 591–634; Bharadwaj, Løken, and
Neilson, “Early Life Health Interventions and Academic
Achievement,” *American Economic Review* 103, no. 5
(2013): 1862–91.

Once we have identified the setup for a regression discontinuity, the idea is to perform two regressions. One fits a model to the data below the threshold. The other fits the model to data above the threshold. We then take the difference of the values that the two models predict at the threshold as our estimate of the causal effect. As usual, the idea works out nicely in an idealized linear setting and can be generalized in various ways.

There are numerous subtle and not so subtle ways a regression
discontinuity design can fail. One subtle failure mode is when
intervention incentivizes people to strategically make efforts to fall
just below or above the threshold. Manipulation or *gaming* of
the running variable is a well-known issue for instance when it comes to
social program eligibilityCamacho
and Conover, “Manipulation of Social Program Eligibility,”
*American Economic Journal: Economic Policy* 3, no. 2 (2011):
41–65.. But there are other less obvious cases. For
example, school class sizes in data from Chile exhibit irregularities
that void regression discontinuity designs.Urquiola
and Verhoogen, “Class-Size Caps, Sorting, and the
Regression-Discontinuity Design,” *American Economic
Review* 99, no. 1 (2009): 179–215. In turn,
researchers have come up with tests designed to catch such problems.

## Instrumental variables

Instrumental variables are a popular quasi-experimental method for
causal inference. The starting point is confounding between a
treatment T and our outcome of
interest Y. We are in a situation
where we’re unable to resolve confounding via the adjustment formula.
However, what we have is the existence of a special variable Z called an *instrument* that will
help us estimate the treatment effect.

What makes Z a valid instrument is nicely illustrated with the following causal graph.

The graph structure encodes two key assumptions:

- The instrument Z and the outcome Y are unconfounded.
- The instrument Z has no direct effect on the outcome Y.

Let’s walk through how this works out in the one-dimensional linear
structural equation for the outcome:
Y= \alpha + \beta T + \gamma W + N
Here, N is an independent
noise term. For convenience, we denote the *error term* U=\gamma W + N. What we’re interested in is
the coefficient \beta since we can
easily verify that it corresponds to the average treatment effect:
\beta = \mathop\mathbb{E}[Y\mid\mathrm{do}(T:=1)] -
\mathop\mathbb{E}[Y\mid\mathrm{do}(T:=0)]
To find the coefficient \beta, we cannot directly solve the
regression problem Y=\alpha + \beta T +
U, because the error term U
is not independent of T due to the
confounding influence of W.

However, there’s a way forward after we make a few additional assumptions:

- The error term is zero mean: \mathop\mathbb{E}[U]=0
- The instrument is uncorrelated with the error term: \mathop\mathrm{Cov}(Z, U)=0
- Instrument and treatment have nonzero correlation: \mathop\mathrm{Cov}(Z, T)\ne 0

The first two assumptions directly imply
\mathop\mathbb{E}[Y-\alpha-\beta T]= 0 \qquad\text{and}\qquad
\mathop\mathbb{E}[Z(Y-\alpha-\beta T)] = 0\,.
This leaves us with two linear equations in \alpha and \beta so that we can solve for both
parameters. Indeed, \alpha=\mathop\mathbb{E}[Y]-\beta\mathop\mathbb{E}[T].
Plugging this into the second equation, we have
\mathop\mathbb{E}[Z((Y-\mathop\mathbb{E}[Y])-\beta(T-\mathop\mathbb{E}[T])
)]=0,
which implies, via our third assumption \mathop\mathrm{Cov}(T, Z)\ne 0,
\beta = \frac{\mathop\mathrm{Cov}(Z, Y)}{\mathop\mathrm{Cov}(T, Z)}\,.
There’s a different intuitive way to derive this solution by
solving a *two step least squares* procedure:

- Predict the treatment from the instrument via least squares regression, resulting in the predictor \hat T = cZ.
- Predict the outcome from the predicted treatment using least squares regression, resulting in the predictor \hat Y=\beta' \hat T.

A calculation reveals that indeed \beta'=\beta, the desired treatment effect. To see this note that c=\frac{\mathop\mathrm{Cov}(Z, T)}{\mathop\mathrm{Var}(Z)}\, and hence \beta' = \frac{\mathop\mathrm{Cov}(Y, \hat T)}{\mathop\mathrm{Var}(\hat T)} = \frac{\mathop\mathrm{Cov}(Y, Z)}{c\mathop\mathrm{Var}(Z)} = \frac{\mathop\mathrm{Cov}(Z, Y)}{\mathop\mathrm{Cov}(T, Z)} = \beta\,. This solution directly generalizes to the multi-dimensional linear case. The two stage regression approach is in fact the way instrumental variables is often introduced operationally. We see that again instrumental variables is a clever way of reducing causal inference to prediction.

One impediment to instrumental variables is a poor correlation
between the instrument and the treatment. Such instruments are called
*weak instruments*. In this case, the denominator \mathop\mathrm{Cov}(T, Z) in our expression
for \beta is small and the
estimation problem is ill-conditioned. The other impediment is that the
causal graph corresponding to instrumental variables is not necessarily
easy to come by in applications. What’s delicate about the graph is that
we want the instrument to have a significant causal effect on the
treatment, but at the same time have no other causal powers that might
influence the outcome in a way that’s not mediated by the treatment.

Nonetheless, researchers have found several intriguing applications
of instrumental variables. One famous example goes by the name *judge
instruments*. The idea is that within the United States, at least in
certain jurisdictions and courts, defendants may be assigned randomly to
judges. Different judges then assign different sentences, some perhaps
more lenient, others harsher. The treatment here could be the sentence
length and the outcome may indicate whether or not the defendant went on
to commit another crime upon serving the prison sentence. A perfectly
random assignment of judges implies that the judge assignment and the
outcome are unconfounded. Moreover, the assignment of a judge has a
causal effect on the treatment, but plausibly no direct causal effect on
the outcome. The assignment of judges then serves as an instrumental
variable. The observation that judge assignments may be random has been
the basis of much causal inference about the criminal justice system.
However, the assumption of randomness in judge assignments has also been
challenged.Chilton
and Levy, “Challenging the Randomness of Panel Assignment in the
Federal Courts of Appeals,” *Cornell L. Rev.* 101 (2015):
1.

# Limitations of causal inference in practice

It’s worth making a distinction between causal modeling broadly
speaking and the practice of causal inference today. The previous
chapter covered the concepts of causal modeling. Structural causal
models make it painfully clear that the model necessarily specifies
strong assumptions about the data generating process. In contrast, the
practice of causal inference we covered in this chapter seems almost
*model-free* in how it reduces to pattern classification via
technical assumptions. This appears to free the practitioner from
difficult modeling choices.

The assumptions that make this all work, however, are not verifiable from data. Some papers seek assurance in statistical robustness checks, but these too are sample-based estimates. Traditional robustness checks, such as resampling methods or leave-one-out estimates, may get at issues of generalization, but cannot speak to the validity of causal assumptions.

As a result, a certain pragmatic attitude has taken hold. If we cannot verify the assumption from data anyway, we might as well make it in order to move forward. But this is a problematic position. Qualitative and theoretical ways of establishing substantive knowledge remain relevant where the limitations of data set in. The validity of a causal claim cannot be established solely based on a sample. Other sources of substantive knowledge are required.

## Validity of observational methods

The empirical evidence regarding the validity of observational causal inference studies is mixed and depends on the domain of application.

A well known article compared observational studies in the medical
domain between 1985 and 1998 to the results of randomized controlled
trials.Benson
and Hartz, “A Comparison of Observational Studies and Randomized,
Controlled Trials,” *New England Journal of Medicine* 342,
no. 25 (2000): 1878–86. The conclusion was good
news for observational methods:

We found little evidence that estimates of treatment effects in observational studies reported after 1984 are either consistently larger than or qualitatively different from those obtained in randomized, controlled trials.

Another study around the same time came to a similar conclusion:

The results of well-designed observational studies (with either a cohort or a case–control design) do not systematically overestimate the magnitude of the effects of treatment as compared with those in randomized, controlled trials on the same topic.Concato, Shah, and Horwitz, “Randomized, Controlled Trials, Observational Studies, and the Hierarchy of Research Designs,”

New England Journal of Medicine342, no. 25 (2000): 1887–92.

One explanation, however, is that medical researchers may create observational designs with great care on the basis of extensive domain knowledge and prior investigation.

Freedman’s paper *Statistical Models and Shoe Leather*
illustrates this point through the famous example of Jon Snow’s
discovery from the 1850s that cholera is a waterborne disease.Freedman,
“Statistical Models and Shoe Leather,” *Sociological
Methodology*, 1991, 291–313. Many associate
Snow with an early use of quantitative methods. But the application of
those followed years of substantive investigation and theoretical
considerations that formed the basis of the quantitative analysis.

In other domains, observational methods have been much less
successful. Online advertising, for example, generates hundreds of
billions of dollars in yearly global revenue, but the causal effects of
targeted advertising remain a subject of debate.Hwang,
*Subprime Attention Crisis* (Farrar, Strauss; Giroux,
2020). Randomized controlled trials in this domain
are rare for technical and cultural reasons. Advertising platforms are
highly optimized toward a particular way of serving ads that can make
true randomization difficult to implement. As a result, practitioners
rely on a range of observational methods to determine the causal effect
of showing an ad. However, these methods tend to perform poorly as a
recent large-scale study reveals:

The observational methods often fail to produce the same effects as the randomized experiments, even after conditioning on extensive demographic and behavioral variables. We also characterize the incremental explanatory power our data would require to enable observational methods to successfully measure advertising effects. Our findings suggest that commonly used observational approaches based on the data usually available in the industry often fail to accurately measure the true effect of advertising.Gordon et al., “A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook,”

Marketing Science38, no. 2 (2019): 193–225.

## Interference, interaction, and spillovers

Confounding is not the only threat to the validity of causal studies. In a medical setting, it’s often relatively easy to ensure that treatment of one subject does not influence the treatment assignment or outcome of any other unit. We called this the Stable Unit Treatment Value Assumption (SUTVA) in the previous chapter and noted that it holds by default for the units in a structural causal models. Failures of SUTVA, however, are common and go by many names, such as, interference, interaction, and spill-over effects.

Take the example of an online social network. Interaction between
units is the default in all online platforms, whose entire purpose is
that people interact. Administering treatment to a subset of the
platform’s users typically has some influence on the control group. For
example, if our treatment exposes a group of users to more content of a
certain kind, those users might share the content with others outside
the treatment group. In other words, treatment *spills over* to
the control group. In certain cases, this problem can be mitigated by
assigning treatment to a cluster in the social network that has a
boundary with few outgoing edges thus limiting bias from
interaction.Eckles,
Karrer, and Ugander, “Design and Analysis of Experiments in
Networks: Reducing Bias from Interference,” *Journal of Causal
Inference* 5, no. 1 (2016).

Interference is also common in the economic development context. To
borrow an example from economist John RoemerRoemer,
*How We Cooperate: A Theory of Kantian Optimization* (Yale
University Press, 2019)., suppose we want to know
if better fishing nets would improve the yield of fishermen in a town.
We design a field experiment in which we give better fishing nets to a
random sample of fishermen. The results show a significantly improved
yield for the treated fishermen. However, if we scale the intervention
to the entire population of fishermen, we might cause overfishing and
hence reduced yield for everyone.

# Chapter notes

Aside from the introductory texts from the previous chapter, there are a few more particularly relevant in the context of this chapter.

The textbook by Angrist and PischkeAngrist
and Pischke, *Mostly Harmless Econometrics: An Empiricist’s
Companion* (Princeton University Press, 2008).
covers causal inference with an emphasis on regression analysis an
applications in econometrics. See Athey and ImbensAthey
and Imbens, “The State of Applied Econometrics: Causality and
Policy Evaluation,” *Journal of Economic Perspectives* 31,
no. 2 (2017): 3–32. for a more recent survey of the
state of causal inference in econometrics.

Marinescu et al.Marinescu,
Lawlor, and Kording, “Quasi-Experimental Causality in Neuroscience
and Behavioural Research,” *Nature Human Behaviour* 2, no.
12 (2018): 891–98. give a short introduction to
quasi-experiments and their applications to neuroscience with a focus on
regression discontinuity design, instrumental variables, and differences
in differences.

# References

*The Quarterly Journal of Economics*125, no. 2 (2010): 591–634.

*Mostly Harmless Econometrics: An Empiricist’s Companion*. Princeton University Press, 2008.

*Journal of Economic Perspectives*31, no. 2 (2017): 3–32.

*New England Journal of Medicine*342, no. 25 (2000): 1878–86.

*American Economic Review*103, no. 5 (2013): 1862–91.

*American Economic Journal: Economic Policy*3, no. 2 (2011): 41–65.

*Cornell L. Rev.*101 (2015): 1.

*New England Journal of Medicine*342, no. 25 (2000): 1887–92.

*Social Science & Medicine*210 (2018): 2–21.

*Journal of Causal Inference*5, no. 1 (2016).

*Sociological Methodology*, 1991, 291–313.

*Marketing Science*38, no. 2 (2019): 193–225.

*Subprime Attention Crisis*. Farrar, Strauss; Giroux, 2020.

*Nature Human Behaviour*2, no. 12 (2018): 891–98.

*How We Cooperate: A Theory of Kantian Optimization*. Yale University Press, 2019.

*American Economic Review*99, no. 1 (2009): 179–215.

*Journal of the American Statistical Association*113, no. 523 (2018): 1228–42.