Preface
In its conception, our book is both an old take on something new and a new take on something old.
Looking at it one way, we return to the roots with our emphasis on pattern classification. We believe that the practice of machine learning today is surprisingly similar to pattern classification of the 1960s, with a few notable innovations from more recent decades.
This is not to understate recent progress. Like many, we are amazed by the advances that have happened in recent years. Image recognition has improved dramatically. Even small devices can now reliably recognize speech. Natural language processing and machine translation have made massive leaps forward. Machine learning has even been helpful in some difficult scientific problems, such as protein folding.
However, we think that it would be a mistake not to recognize pattern classification as a driving force behind these improvements. The ingenuity behind many advances in machine learning so far lies not in a fundamental departure from pattern classification, but rather in finding new ways to make problems amenable to the model fitting techniques of pattern classification.
Consequently, the first few chapters of this book follow relatively closely the excellent text “Pattern Classification and Scene Analysis” by Duda and Hart, particularly, its first edition from 1973, which remains relevant today. Indeed, Duda and Hart summarize the state of pattern classification in 1973, and it bears a striking resemblance to the core of what we consider today to be machine learning. We add new developments on representations, optimization, and generalization, all of which remain topics of evolving, active research.
Looking at it differently, our book departs in some considerable ways from the way machine learning is commonly taught.
First, our text emphasizes the role that datasets play in machine learning. A full chapter explores the histories, significance, and scientific basis of machine learning benchmarks. Although ubiquitous and taken for granted today, the datasets-as-benchmarks paradigm was a relatively recent development of the 1980s. Detailed consideration of datasets, the collection and construction of data, as well as the training and testing paradigm, tend to be lacking from theoretical courses on machine learning.
Second, the book includes a modern introduction to causality and the practice of causal inference that lays to rest dated controversies in the field. The introduction is self-contained, starts from first principles, and requires no prior commitment intellectually or ideologically to the field of causality. Our treatment of causality includes the conceptual foundations, as well as some of the practical tools of causal inference increasingly applied in numerous applications. It’s interesting to note that many recent causal estimators reduce the problem of causal inference in clever ways to pattern classification. Hence, this material fits quite well with the rest of the book.
Third, our book covers sequential and dynamic models thoroughly. Though such material could easily fill a semester course on its own, we wanted to provide the basic elements required to think about making decisions in dynamic contexts. In particular, given so much recent interest in reinforcement learning, we hope to provide a self-contained short introduction to the concepts underpinning this field. Our approach here follows our approach to supervised learning: we focus on how we would make decisions given a probabilistic model of our environment, and then turn to how to take action when the model is unknown. Hence, we begin with a focus on optimal sequential decision making and dynamic programming. We describe some of the basic solution approaches to such problems, and discuss some of the complications that arise as our measurement quality deteriorates. We then turn to making decisions when our models are unknown, providing a survey of bandit optimization and reinforcement learning. Our focus here is to again highlight the power of prediction. We show that for most problems, pattern recognition can be seen as a complement to feedback control, and we highlight how “certainty equivalent” decision making—where we first use data to estimate a model and then use feedback control acting as if this model were true—is optimal or near optimal in a surprising number of scenarios.
Finally, we attempt to highlight in a few different places throughout the potential harms, limitations, and social consequences of machine learning. From its roots in World War II, machine learning has always been political. Advances in artificial intelligence feed into a global industrial military complex, and are funded by it. As useful as machine learning is for some unequivocally positive applications such as assistive devices, it is also used to great effect for tracking, surveillance, and warfare. Commercially its most successful use cases to date are targeted advertising and digital content recommendation, both of questionable value to society. Several scholars have explained how the use of machine learning can perpetuate inequity through the ways that it can put additional burden on already marginalized, oppressed, and disadvantaged communities. Narratives of artificial intelligence also shape policy in several high stakes debates about the replacement of human judgment in favor of statistical models in the criminal justice system, health care, education, and social services.
There are some notable topics we left out. Some might find that the most glaring omission is the lack of material on unsupervised learning. Indeed, there has been a significant amount of work on unsupervised learning in recent years. Thankfully, some of the most successful approaches to learning without labels could be described as reductions to pattern recognition. For example, researchers have found ingenious ways of procuring labels from unlabeled data points, an approach called self supervision. We believe that the contents of this book will prepare students interested in these topics well.
The material we cover supports a one semester graduate introduction to machine learning. We invite readers from all backgrounds. However, mathematical maturity with probability, calculus, and linear algebra is required. We provide a chapter on mathematical background for review. Necessarily, this chapter cannot replace prerequisite coursework.
In writing this book, our goal was to balance mathematical rigor against presenting insights we have found useful in the most direct way possible. In contemporary learning theory important results often have short sketches, yet making these arguments rigorous and precise may require dozens of pages of technical calculations. Such proofs are critical to the community’s scientific activities but often make important insights hard to access for those not yet versed in the appropriate techniques. On the other hand, many machine learning courses drop proofs altogether, thereby losing the important foundational ideas that they contain. We aim to strike a balance, including full details for as many arguments as possible, but frequently referring readers to the relevant literature for full details.