MA4K0 Introduction to Uncertainty Quantificationhomepages.warwick.ac.uk/~masbm/ClimateCourse/... · uncertainty into two types, aleatoric and epistemic uncertainty. Aleatoric un-certainty

—DRAFT—

MA4K0 Introduction toUncertainty Quantification

T. J. SullivanMathematics InstituteUniversity of WarwickCoventry CV4 7AL UK

[email protected]

DRAFTNot For General Distribution

Version 11 (2013-10-16 15:45)

—DRAFT—

—DRAFT—

Contents

Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Introduction to Uncertainty Quantification 1

1 Introduction 3

1.1 What is Uncertainty Quantification? . . . . . . . . . . . . . . . . 3

1.2 Mathematical Prerequisites . . . . . . . . . . . . . . . . . . . . . 6

1.3 The Road Not Travelled . . . . . . . . . . . . . . . . . . . . . . . 7

2 Recap of Measure and Probability Theory 9

2.1 Measure and Probability Spaces . . . . . . . . . . . . . . . . . . . 9

2.2 Random Variables and Stochastic Processes . . . . . . . . . . . . 12

2.3 Aside: Interpretations of Probability . . . . . . . . . . . . . . . . 13

2.4 Lebesgue Integration . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5 The Radon–Nikodym Theorem and Densities . . . . . . . . . . . 16

2.6 Product Measures and Independence . . . . . . . . . . . . . . . . 17

2.7 Gaussian Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Recap of Banach and Hilbert Spaces 23

3.1 Basic Definitions and Properties . . . . . . . . . . . . . . . . . . 23

3.2 Dual Spaces and Adjoints . . . . . . . . . . . . . . . . . . . . . . 26

3.3 Orthogonality and Direct Sums . . . . . . . . . . . . . . . . . . . 27

3.4 Tensor Products . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 Basic Optimization Theory 33

4.1 Optimization Problems and Terminology . . . . . . . . . . . . . . 33

4.2 Unconstrained Global Optimization . . . . . . . . . . . . . . . . 34

4.3 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . 37

4.4 Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.5 Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.6 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

—DRAFT—

ii CONTENTS

5 Measures of Information and Uncertainty 495.1 The Existence of Uncertainty . . . . . . . . . . . . . . . . . . . . 495.2 Interval Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . 505.3 Variance, Information and Entropy . . . . . . . . . . . . . . . . . 505.4 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . 53Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6 Bayesian Inverse Problems 576.1 Inverse Problems and Regularization . . . . . . . . . . . . . . . . 576.2 Bayesian Inversion in Banach Spaces . . . . . . . . . . . . . . . . 626.3 Well-Posedness and Approximation . . . . . . . . . . . . . . . . . 63Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7 Filtering and Data Assimilation 717.1 State Estimation in Discrete Time . . . . . . . . . . . . . . . . . 727.2 Linear Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . 747.3 Extended Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . 777.4 Ensemble Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . 787.5 Eulerian and Lagrangian Data Assimilation . . . . . . . . . . . . 80Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

8 Orthogonal Polynomials 858.1 Basic Definitions and Properties . . . . . . . . . . . . . . . . . . 858.2 Recurrence Relations . . . . . . . . . . . . . . . . . . . . . . . . . 898.3 Roots of Orthogonal Polynomials . . . . . . . . . . . . . . . . . . 908.4 Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . 928.5 Polynomial Approximation . . . . . . . . . . . . . . . . . . . . . 938.6 Orthogonal Polynomials of Several Variables . . . . . . . . . . . . 96Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

9 Numerical Integration 999.1 Quadrature in One Dimension . . . . . . . . . . . . . . . . . . . . 999.2 Gaussian Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . 1019.3 Clenshaw–Curtis / Fejer Quadrature . . . . . . . . . . . . . . . . 1049.4 Quadrature in Multiple Dimensions . . . . . . . . . . . . . . . . . 1049.5 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . 1059.6 Pseudo-Random Methods . . . . . . . . . . . . . . . . . . . . . . 107Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

10 Sensitivity Analysis and Model Reduction 11110.1 Model Reduction for Linear Models . . . . . . . . . . . . . . . . . 11110.2 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11210.3 McDiarmid Diameters . . . . . . . . . . . . . . . . . . . . . . . . 11210.4 ANOVA/HDMR Decompositions . . . . . . . . . . . . . . . . . . 115Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

—DRAFT—

CONTENTS iii

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

11 Spectral Expansions 12111.1 Karhunen–Loeve Expansions . . . . . . . . . . . . . . . . . . . . 12111.2 Wiener–Hermite Polynomial Chaos . . . . . . . . . . . . . . . . . 12611.3 Generalized PC Expansions . . . . . . . . . . . . . . . . . . . . . 129Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

12 Stochastic Galerkin Methods 13512.1 Lax–Milgram Theory and Galerkin Projection . . . . . . . . . . . 13612.2 Stochastic Galerkin Projection . . . . . . . . . . . . . . . . . . . 14012.3 Nonlinearities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

13 Non-Intrusive Spectral Methods 14913.1 Pseudo-Spectral Methods . . . . . . . . . . . . . . . . . . . . . . 15013.2 Stochastic Collocation . . . . . . . . . . . . . . . . . . . . . . . . 150Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

14 Distributional Uncertainty 15514.1 Maximum Entropy Distributions . . . . . . . . . . . . . . . . . . 15514.2 Distributional Robustness . . . . . . . . . . . . . . . . . . . . . . 15714.3 Functional and Distributional Robustness . . . . . . . . . . . . . 162Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

Bibliography and Index 169

Bibliography 171

Index 179

—DRAFT—

iv CONTENTS

—DRAFT—

PREFACE v

Preface

These notes are designed as an introduction to Uncertainty Quantification (UQ)at the fourth year (senior) undergraduate or beginning postgraduate level, andare aimed primarily at students from a mathematical (rather than, say, engi-neering) background; the mathematical prerequisites are listed in Section 1.2,and the early chapters of the text recapitulate some of this material in moredetail. These notes accompany the University of Warwick mathematics mod-ule MA4K0 Introduction to Uncertainty Quantification; while the notes are in-tended to be general, certain contextual remarks and assumptions about priorknowledge will be Warwick-specific, and will indicated by a large “W” in themargin, like the one to the right. W

The aim is to give a survey of the main objectives in the field of UQ and afew of the mathematical methods by which they can be achieved. There are,of course, even more UQ problems and solution methods in the world that arenot covered in these notes, which are intended — with the exception of thepreliminary material on measure theory and functional analysis — to compriseapproximately 30 hours’ worth of lectures. For any grievous omissions in thisregard, I ask for your indulgence, and would be happy to receive suggestions forimprovements.

The exercises contain, by deliberate choice, a number of terribly ill-posed orunder-specified problems of the messy type often encountered in practice. It ismy hope that these exercises will encourage students to grapple with the ques-tions of mathematical modelling that are a necessary precursor to doing appliedmathematics outside the tame classroom environment. Theoretical knowledgeis important; however, problem solving, which begins with problem formula-tion, is an equally vital skill that too often goes neglected in undergraduatemathematics courses.

These notes have benefitted, from initial conception to nearly finished prod-uct, from discussions with many people. I would like to thank Charlie Elliott,Dave McCormick, Mike McKerns, Michael Ortiz, Houman Owhadi, Clint Scovel,Andrew Stuart, and all the students on the 2013–14 iteration of MA4K0 for theiruseful comments.

T .J .S.University of Warwick, U.K.

Wednesday 16th October, 2013

http://www2.warwick.ac.uk/fac/sci/maths/undergrad/ughandbook/year4/ma4k0/

http://www2.warwick.ac.uk/fac/sci/maths/undergrad/ughandbook/year4/ma4k0/

—DRAFT—

vi CONTENTS

—DRAFT—

Introduction to UncertaintyQuantification

—DRAFT—

—DRAFT—

Chapter 1

Introduction

We must think differently about ourideas — and how we test them. We mustbecome more comfortable with probabil-ity and uncertainty. We must think morecarefully about the assumptions and be-liefs that we bring to a problem.

The Signal and the Noise: The Art of

Science and Prediction

Nate Silver

1.1 What is Uncertainty Quantification?

Uncertainty Quantification (UQ) is, roughly put, the coming together of proba-bility theory and statistical practice with ‘the real world’. These two anecdotesillustrate something of what is meant by this statement:

• Until 1990–1995, risk modelling for catastrophe insurance and re-insurance(i.e. insurance for property owners against risks arising from earthquakes,hurricanes, terrorism, &c., and then insurance for the providers of suchinsurance) was done on a purely statistical basis. Since that time, catas-trophe modellers have started to incorporate models for the underlyingphysics or human behaviour, hoping to gain a more accurate predictiveunderstanding of risks by blending the statistics and the physics, e.g. byfocussing on what is both statistically and physically reasonable.

• Over roughly the same period of time, deterministic engineering models ofcomplex physical processes began to incorporate some element of probabil-ity to account for lack of knowledge about important physical parameters,random variability in operating circumstances, or outright uncertaintyabout what the form of a ‘correct’ model would be. Again the aim is toprovide more accurate predictions about systems’ behaviour.

Perhaps as a result of its history, there are many perspectives on what UQ is,including at the extremes assertions like “UQ is just a buzzword for statistics”or “UQ is just error analysis”; other perspectives on UQ include the study

—DRAFT—

4 CHAPTER 1. INTRODUCTION

of numerical error and the stability of algorithms. UQ problems of interestinclude certification, prediction, model and software verification and validation,parameter estimation, data assimilation, and inverse problems. At its verybroadest,

“UQ studies all sources of error and uncertainty, including thefollowing: systematic and stochastic measurement error; ignorance;limitations of theoretical models; limitations of numerical represen-tations of those models; limitations of the accuracy and reliability ofcomputations, approximations, and algorithms; and human error. Amore precise definition is UQ is the end-to-end study of the reliabilityof scientific inferences.” [109, p. 135]

It is especially important to appreciate the “end-to-end” nature of UQ studies:one is interested in relationships between pieces of information, bearing in mindthat they are only approximations of reality, not the ‘truth’ of those pieces ofinformation / assumptions. There is always a risk of ‘Garbage In, GarbageOut’. A mathematician performing a UQ analysis cannot tell you that yourmodel is ‘right’ or ‘true’, but only that, if you accept the validity of the model(to some quantified degree), then you must logically accept the validity of cer-tain conclusions (to some quantified degree). Naturally, a respectable analysiswill include a meta-analysis examining the sensitivity of the original analysis toperturbations of the governing assumptions. In the author’s view, this is theproper interpretation of philosophically sound but practically unhelpful asser-tions like “Verification and validation of numerical models of natural systems isimpossible” and “The primary value of models is heuristic.” [74].

Example 1.1. Consider the following elliptic boundary value problem on a con-nected Lipschitz domain Ω ⊆ Rn (typically n = 2 or 3):

−∇ · (κ∇u) = f in Ω,

u = 0 on ∂Ω.

This PDE is a simple but not naıve model for the pressure field u of some fluidoccupying a domain Ω, the permeability of which to the fluid is described by atensor field κ : Ω → Rn×n; there is a source term f and the boundary conditionspecifies that the pressure vanishes on the boundary of Ω. This simple modelis of interest in the earth sciences because Darcy’s law asserts that the velocityfield v of the fluid flow in this medium is related to the gradient of the pressurefield by

v = κ∇u.If the fluid contains some kind of contaminant, then one is naturally very inter-ested in where fluid following the velocity field v will end up, and how soon.

In a course on PDE theory, you will learn that if the permeability field κ ispositive-definite and essentially bounded, then this problem has a unique weaksolution u in the Sobolev spaceH1

0 (Ω) for each forcing term f in the dual Sobolevspace H−1(Ω). One objective of this course is to tell you that this is far from theend of the story! As far as practical applications go, existence and uniquenessof solutions is only the beginning. For one thing, this PDE model is only anapproximation of reality. Secondly, even if the PDE were a perfectly accuratemodel, the ‘true’ κ and f are not known precisely, so our knowledge about u =

—DRAFT—

1.1. WHAT IS UNCERTAINTY QUANTIFICATION? 5

u(κ, f) is also uncertain in some way. If κ and f are treated as random variables,then u is also a random variable, and one is naturally interested in propertiesof that random variable such as mean, variance, deviation probabilities &c. —and to do so it is necessary to build up the machinery of probability on functionspaces.

Another issue is that often we want to solve the inverse problem: we knowsomething about f and something about u and want to infer κ. Even a simpleinverse problem such as this one is of enormous practical interest: it is bysolving such inverse problems that oil companies attempt to infer the locationof oil deposits in order to make a profit, and seismologists the structure of theplanet in order to make earthquake predictions. Both of these problems, theforward and inverse propagation of uncertainty, fall under the very general remitof UQ. Furthermore, in practice, the fields f , κ and u are all discretized andsolved for numerically (i.e. approximately and finite-dimensionally), so it is ofinterest to understand the impact of these discretization errors.

Epistemic and Aleatoric Uncertainty. It is common in the literature to divideuncertainty into two types, aleatoric and epistemic uncertainty. Aleatoric un-certainty — from the Latin alea, meaning a die — refers to uncertainty aboutan inherently variable phenomenon. Epistemic uncertainty — from the Greekǫπιστηµη, meaning knowledge — refers to uncertainty arising from lack ofknowledge. To a certain extent, the distinction is an imprecise one, and re-peats the old debate between frequentist and subjectivist (e.g. Bayesian) statis-ticians. Someone who was simultaneously a devout Newtonian physicist anda devout Bayesian might argue that the results of dice rolls are not aleatoricuncertainties — one simply doesn’t have complete enough information aboutthe initial conditions of die, the material and geometry of the die, any gusts ofwind that might affect the flight of the die, &c. On the other hand, it is usuallyclear that some forms of uncertainty are epistemic rather than aleatoric: forexample, when physicists say that they have yet to come up with a Theory ofEverything, they are expressing a lack of knowledge about the laws of physicsin our universe, and the correct mathematical description of those laws. In anycase, regardless of one’s favoured interpretation of probability, the language ofprobability theory is a powerful tool in describing uncertainty.

Some Typical UQ Objectives. Many common UQ objectives can be illustratedin the context of a system, F , that maps inputs X in some space X to outputsY = F (X) in some space Y. Some common UQ objectives include:

• The reliability or certification problem. Suppose that some set Yfail ⊆ Y isidentified as a ‘failure set’, i.e. the outcome F (X) ∈ Yfail is undesirable insome way. Given appropriate information about the inputs X and forwardprocess F , determine the failure probability,

P[F (X) ∈ Yfail].

Furthermore, in the case of a failure, how large will the deviation fromacceptable performance be, and what are the consequences?

• The prediction problem. Dually to the reliability problem, given a maxi-mum acceptble probability of error ε > 0, find a set Yε ⊆ Y such that

P[F (X) ∈ Yε] ≥ 1− ε.

—DRAFT—


i.e. the prediction F (X) ∈ Yε is wrong with probability at most ε.• The parameter identification or inverse problem. Given some observationsof the output, Y , which may be corrupted or unreliable in some way,attempt to determine the corresponding inputs X such that F (X) = Y .In what sense are some estimates for X more or less reliable than others?

• The model reduction model calibration problem. Construct another func-tion Fh (perhaps a numerical model with certain numerical parameters tobe calibrated, or one involving far fewer input or output variables) suchthat Fh ≈ F in an appropriate sense. Quantifying the accuracy of theapproximation may itself be a certification or prediction problem.

A Word of Warning. In this second decade of the third millennium, there is asyet no elegant unified theory of UQ. UQ is not a mature field like linear algebra orsingle-variable complex analysis, with stately textbooks containing well-polishedpresentations of classical theorems bearing august names like Cauchy, Gauss andHamilton. Both because of its youth as a field and its very close engagement withapplications, UQ is much more about problems, methods, and ‘good enough forthe job’. There are some very elegant approaches within UQ, but as yet nosingle, general, over-arching theory of UQ.

1.2 Mathematical Prerequisites

Like any course, MA4K0 has certain prerequisites. If you are just following thecourse for fun, and attending the lectures merely to stay warm and dry in whatis almost sure to be a fine English autumn, then good for you. However, if youactually want to understand what is going on, then it’s better for your ownhealth if you can use your nearest time machine to ensure that you have alreadytaken and understood, in addition to the standard G100/G103 Mathematicscore courses, the following non-core courses:W

• ST112 Probability B• Either MA359 Measure Theory or ST318 Probability Theory• MA3G7 Functional Analysis I

As a crude diagnostic test, read the following sentence:

Given any σ-finite measure space (X ,F , µ), the set of all F -measurable functions f : X → C for which

∫X|f |2 dµ is finite, mod-

ulo equality µ-almost everywhere, is a Hilbert space with respect tothe inner product 〈f, g〉 :=

∫Xf g dµ.

If any of the symbols, concepts or terms used or implicit in that sentence give youmore than a few moments’ pause, then you should think again before attemptingMA4K0.

If, in addition, you have taken the following courses, then certain techniques,examples and remarks will make more sense to you:

• MA117 Programming for Scientists• MA228 Numerical Analysis• MA250 Introduction to Partial Differential Equations• MA398 Matrix Analysis and Algorithms• MA3H0 Numerical Analysis and PDEs

http://www2.warwick.ac.uk/fac/sci/statistics/modules/st1/st111

http://www2.warwick.ac.uk/fac/sci/maths/undergrad/ughandbook/year3/ma359/


http://www2.warwick.ac.uk/fac/sci/maths/undergrad/ughandbook/year3/ma3g7/





http://www2.warwick.ac.uk/fac/sci/maths/undergrad/ughandbook/year3/ma3h0/

—DRAFT—

1.3. THE ROAD NOT TRAVELLED 7

• ST407 Monte Carlo Methods• MA482 Stochastic Analysis• MA4A2 Advanced PDEs• MA607 Data Assimilation

However, none of these courses are essential. That said, some ability and will-ingness to implement UQ methods — even in simple settings — in e.g. C/C++,Mathematica, Matlab, or Python〈1.1〉 is highly desirable. UQ is a topic bestlearned in the doing, not through pure theory.

1.3 The Road Not Travelled

There are many topics relevant to UQ that are either not covered or discussedonly briefly in these notes, including: detailed treatment of data assimilationbeyond the confines of the Kalman filter and its variations; accuracy, stabilityand computational cost of numerical methods; details of numerical implementa-tion of optimization methods; stochastic homogenization; optimal control; andmachine learning.

〈1.1〉The author’s current language of choice.



http://www2.warwick.ac.uk/fac/sci/maths/undergrad/ughandbook/year4/ma4a2/

—DRAFT—


—DRAFT—

Chapter 2

Recap of Measure andProbability Theory

To be conscious that you are ignorant isa great step to knowledge.

Sybil

Benjamin Disraeli

Probability theory, grounded in Kolmogorov’s axioms and the general foun-dations of measure theory, is an essential tool in the mathematical treatment ofuncertainty. This chapter serves as a review, without detailed proof, of conceptsfrom measure and probability theory that will be used in the rest of the text.Like Chapter 3, this chapter is intended as a review of material that should beunderstood as a prerequisite before proceeding; to extent, Chapters 2 and 3 areinterdependent and so can (and should) be read in parallel with one another.

2.1 Measure and Probability Spaces

Definition 2.1. A measurable space is a pair (Θ,F ), where• Θ is a set, called the sample space; and• F is a σ-algebra on Θ, i.e. a collection of subsets of Θ containing ∅

and closed under countable applications of the operations of union, in-tersection, and complementation relative to Θ; elements of F are calledmeasurable sets or events.

Definition 2.2. A signed measure (or charge) on a measurable space (Θ,F )is a function µ : F → R ∪ ±∞ that takes at most one of the two infinitevalues, has µ(∅) = 0, and, whenever E1, E2, · · · ∈ F are pairwise disjointwith union E ∈ F , the series

∑n∈N µ(En) converges absolutely to µ(E). A

measure is a signed measure that does not take negative values. A probabilitymeasure is a measure such that µ(Θ) = 1. The triple (Θ,F , µ) is called a signedmeasure space, measure space, or probability space as appropriate. The sets ofall signed measures, measures, and probability measures on (Θ,F ) are denotedM±(Θ,F ), M+(Θ,F ), and M1(Θ,F ) respectively.

—DRAFT—

10 CHAPTER 2. RECAP OF MEASURE AND PROBABILITY THEORY

Examples 2.3. 1. The trivial measure: τ(E) := 0 for every E ∈ F .2. The unit Dirac measure at a ∈ X :

δa(E) :=

1, if a ∈ E, E ∈ F ,

0, if a /∈ E, E ∈ F .

3. Counting measure:

κ(E) :=

|E|, if E ∈ F is a finite set,

+∞, if E ∈ F is an infinite set.

4. Lebesgue measure on Rn: the unique measure on Rn (equipped with itsBorel σ-algebra B(Rn)) that assigns to every cube its volume. To be moreprecise, Lebesgue measure is actually defined on the completion of B(Rn),a larger σ-algebra than B(Rn).

Definition 2.4. Let (X ,F , µ) be a measure space. If N ⊆ X is a subset of ameasurable set E ∈ F such that µ(E) = 0, then N is called a µ-null set. If theset of x ∈ X for which some property P (x) does not hold is µnull, then P issaid to hold µ-almost everywhere (or, when µ is a probability measure, µ-almostsurely). If every µ-null set is in fact an F -measurable set, then the measurespace (X ,F , µ) is said to be complete.

When the sample space is a topological space, it is usual to use the Borelσ-algebra (i.e. the smallest σ-algebra that contains all the open sets); measureson the Borel σ-algebra are called Borel measures. Unless noted otherwise, thisis the convention followed in these notes.

Definition 2.5. The support of a measure µ defined on a topological space X is

supp(µ) :=⋂

F ⊆ X | F is closed and µ(X \ F ) = 0.

That is, supp(µ) is the smallest closed subset of X that has full µ-measure.Equivalently, supp(µ) is the complement of the union of all open sets of µ-measure zero, or the set of all points x ∈ X for which every neighbourhood ofx has strictly positive µ-measure.

M1(X ) is often called the probability simplex on X . The motivation for thisterminology comes from the case in which X = 1, . . . , n is a finite set. In thiscase, functions f : X → R are in bijection with column vectors

f(1)...

f(n)

and probability measures µ on the power set of X are in bijection with rowvectors [

µ(1) · · · µ(n)]

such that µ(i) ≥ 0 for all i ∈ 1, . . . , n and∑n

i=1 µ(i) = 1. As illustratedin Figure 2.1, the set of such µ is the (n− 1)-dimensional simplex in Rn that is

—DRAFT—

2.1. MEASURE AND PROBABILITY SPACES 11

the convex hull of the n points

δ1 =[1 0 · · · 0

],

δ2 =[0 1 · · · 0

],

...

δn =[0 0 · · · 1

].

Looking ahead, the expected value of f under µ is exactly the matrix product:

Eµ[f ] =

n∑

i=1

µ(i)f(i) = 〈µ | f〉 = µ⊤f =[µ(1) · · · µ(n)

]f(1)...

f(n)

.

The geometry of M1(X ) is something that one forgets in favour of the algebraicproperties of measures and functions at one’s peril, as poetically highlighted bySir Michael Atiyah [4, Paper 160, p. 7]:

“Algebra is the offer made by the devil to the mathematician.The devil says: ‘I will give you this powerful machine, it will answerany question you like. All you need to do is give me your soul: giveup geometry and you will have this marvellous machine.’”

Or, as is traditionally but perhaps apocryphally said to have been inscribed overthe entrance to Plato’s Academy:

AΓEΩMETPHTOΣ MH∆EIΣ EIΣITΩ

In a sense that will be made precise later, for any ‘nice’ space X , M1(X ) is thesimplex spanned by the collection of unit Dirac measures δx | x ∈ X. Givena bounded, measurable function f : X → R and c ∈ R,

µ ∈ M(X ) | Eµ[f ] ≤ c

is a half-space of M(X ), and so a set of the form

µ ∈ M1(X ) | Eµ[f1] ≤ c1, . . . ,Eµ[fm] ≤ cm

can be thought of as a polytope of probability measures.

Definition 2.6. If (Θ,F , µ) is a probability space and B ∈ F has µ(B) > 0,then the conditional probability measure µ( · |B) on (Θ,F ) is defined by

µ(E|B) :=µ(E ∩B)

µ(B)for E ∈ F .

Theorem 2.7 (Bayes’ rule). If (Θ,F , µ) is a probability space and A,B ∈ F

have µ(A), µ(B) > 0, then

µ(A|B) =µ(B|A)µ(A)

µ(B).

—DRAFT—


δ1

δ2

δ3

M1(1, 2, 3)

⊂ M±(1, 2, 3) ∼= R3

Figure 2.1: The probability simplex M1(1, 2, 3), drawn as the trianglespanned by the unit Dirac masses δi, i ∈ 1, 2, 3, in the space of signed mea-sures on 1, 2, 3.

2.2 Random Variables and Stochastic Processes

Definition 2.8. Let (X ,F ) and (Y,G ) be measurable spaces. A functionf : X → Y generates a σ-algebra on X by

σ(f) := σ([f ∈ E] | E ∈ G

),

and f is called a measurable function if σ(f) ⊆ F . A measurable functionwhose domain is a probability space is usually called a random variable.

Definition 2.9. Let Ω be any set and let (Θ,F , µ) be a probability space. Afunction U : Ω × Θ → X such that each U(ω, · ) is a random variable is calledan X -valued stochastic process on Ω.

Definition 2.10. A measurable function f : X → Y from a measure space(X ,F , µ) to a measurable space (Y,G ) defines a measure f∗µ on (Y,G ), calledthe push-forward of µ by f , by

(f∗µ)(E) := µ([f ∈ E]

), for E ∈ G .

When µ is a probability measure, f∗µ is called the distribution or law of therandom variable f .

Definition 2.11. A filtration of a σ-algebra F is a family F• = Fi | i ∈ I ofsub-σ-algebras of F , indexed by an ordered set I, such that

i ≤ j in I =⇒ Fi ⊆ Fj .

The natural filtration associated to a stochastic process U : I × Θ → X is thefiltration FU

• defined by

FUi := σ

(U(i, · )−1(E) ⊆ Θ | E ⊆ X is measurable

).

A stochastic process U is adapted to a filtration F• if FUi ⊆ Fi for each i ∈ I.

—DRAFT—

2.3. ASIDE: INTERPRETATIONS OF PROBABILITY 13

Measurability and adaptedness are important properties of stochastic pro-cesses, and loosely correspond to certain questions being ‘answerable’ or ‘de-cidable’ with respect to the information contained in a given σ-algebra. Forinstance, if the event [X ∈ E] is not F -measurable, then it does not even makesense to ask about the probability Pµ[X ∈ E]. For another example, supposethat some stream of observed data is modelled as a stochastic process Y , andit is necessary to make some decision U(t) at each time t. It is common senseto require that the decision stochastic process be FY

• -adapted, since the deci-sion U(t) must be made on the basis of the observations Y (s), s ≤ t, not onobservations from any future time.

2.3 Aside: Interpretations of Probability

It is worth noting that the above discussions are purely mathematical: a prob-ability measure is an abstract algebraic–analytic object with no necessary con-nection to everyday notions of chance or probability. The question of whatinterpretation of probability to adopt, i.e. what practical meaning to ascribe toprobability measures, is a question of philosophy and mathematical modelling.The two main points of view are the frequentist and Bayesian perspectives. Toa frequentist, the probability µ(E) of an event E is the relative frequency ofoccurrence of the event E in the limit of infinitely many independent but iden-tical trials; to a Bayesian, µ(E) is a numerical representation of one’s degree ofbelief in the truth of a proposition E. The frequentist’s point of view is objec-tive; the Bayesian’s is subjective; both use the same mathematical machinery ofprobability measures to describe the properties of the function µ.

Frequentists are careful to distinguish between parts of their analyses thatare fixed and deterministic versus those that have a probabilistic character.However, for a Bayesian, any uncertainty can be described in terms of a suitableprobability measure. In particular, one’s beliefs about some unknown θ (takingvalues in a space Θ) in advance of observing data are summarized by a priorprobability measure π on Θ. The other ingredient of a Bayesian analysis is alikelihood function, which is up to normalization a conditional probability: givenany observed datum y, L(y|θ) is the likelihood of observing y if the parametervalue θ were the truth. A Bayesian’s belief about θ given the prior π and theobserved datum y is the posterior probability measure π( · |y) on Θ, which isjust the conditional probability

π(θ|y) = L(y|θ)π(θ)Eπ[L(y|θ)]

=L(y|θ)π(θ)∫

Θ L(y|θ) dπ(θ)

or, written in a fancier way that generalizes better to infinite-dimensional Θ,

dπ( · |y)dπ

(θ) ∝ L(y|θ).

Both the previous two equations are referred to as Bayes’ rule, and are at thisstage informal applications of the standard Bayes’ rule (Theorem 2.7) for eventsA and B of non-zero probability.

Parameter estimation provides a good example of the philosophical differ-ence between frequentist and subjectivist uses of probability. Suppose thatX1, . . . , Xn are n independent and identically distributed observations of some

—DRAFT—


random variable X , which is distributed according to the normal distributionN (θ, 1) of mean θ and variance 1. We set our frequentist and Bayesian statis-ticians the challenge of estimating θ from the data X1, . . . , Xn.

• To the frequentist, θ is a well-defined real number that happens to beunknown. This number can be estimated using the estimator

θn :=1

n

n∑

i=1

Xi,

which is a random variable. It makes sense to say that θn is close to θwith high probability, and hence to give a confidence interval for θ, but θitself does not have a distribution.

• To the Bayesian, θ is a random variable, and its distribution in advanceof seeing the data is encoded in a prior π. Upon seeing the data andconditioning upon it using Bayes’ rule, the distribution of the parameteris the posterior distribution π(θ|d). The posterior encodes everything that

is known about θ in view of π, L(y|θ) ∝ e−|y−θ|2/2 and d, although thisinformation may be summarized by a single number such as the maximuma posteriori estimator

θMAP := argmaxθ∈R

π(θ|d)

or the maximum likelihood estimator

θMLE := argmaxθ∈R

L(d|θ).

It is also worth noting that there is a significant community that, in additionto being frequentist or Bayesian, asserts that selecting a single probability mea-sure is too precise a description of uncertainty. These ‘imprecise probabilists’count such distinguished figures as George Boole and John Maynard Keynesamong their ranks, and would prefer to say that 1

2−2−100 ≤ P[heads] ≤ 12+2−100

than commit themselves to the assertion that P[heads] = 12 ; imprecise proba-

bilists would argue that the former assertion can be verified, to a prescribedlevel of confidence, in finite time, whereas the latter cannot. Techniques likethe use of lower and upper probabilities (or interval probabilities) are popularin this community, including sophisticated generalizations like Dempster–Shafertheory; one can also consider feasible sets of probability measures, which is theapproach taken in Chapter 14 of these notes.

2.4 Lebesgue Integration

Definition 2.12. Let (X ,F , µ) be a measure space. A function f : X → K

is called simple if f =∑n

i=1 αi1Eifor some scalars α1, . . . , αn ∈ K and some

pairwise disjoint measurable sets E1, . . . , En ∈ F with µ(Ei) finite for i =1, . . . , n. The Lebesgue integral of a simple function f :=

∑ni=1 αi1Ei

is definedto be ∫

X

f dµ :=

n∑

i=1

αiµ(Ei).

—DRAFT—

2.4. LEBESGUE INTEGRATION 15

Definition 2.13. Let (X ,F , µ) be a measure space and let f : X → [0,+∞] bea measurable function. The Lebesgue integral of f is defined to be

∫

X

f dµ := sup

∫

X

φdµ

∣∣∣∣φ : X → R is a simple function, and

0 ≤ φ(x) ≤ f(x) for µ-almost all x ∈ X

.

Definition 2.14. Let (X ,F , µ) be a measure space and let f : X → R be ameasurable function. The Lebesgue integral of f is defined to be

∫

X

f dµ :=

∫

X

f+ dµ−∫

X

f− dµ

provided that at least one of the integrals on the right-hand side is finite. Theintegral of a complex-valued measurable function f : X → C is defined to be

∫

X

f dµ :=

∫

X

Re f dµ+ i

∫

X

Im f dµ.

One of the major attractions of the Lebesgue integral is that, subject to asimple domination condition, pointwise convergence of integrands is enough toensure convergence of integral values:

Theorem 2.15 (Dominated convergence theorem). Let (X ,F , µ) be a measurespace and let fn : X → K be a measurable function for each n ∈ N. If f : X → K

is such that limn→∞ fn(x) = f(x) for every x ∈ X and there is a measurablefunction g : X → [0,∞] such that

∫X |g| dµ is finite and |fn(x)| ≤ g(x) for all

x ∈ X and all large enough n ∈ N, then

∫

X

f dµ = limn→∞

∫

X

fn dµ.

Furthermore, if the measure space is complete, then the conditions on pointwiseconvergence and pointwise domination of fn(x) can be relaxed to hold µ-almosteverywhere.

Definition 2.16. When (Θ,F , µ) is a probability space and X : Θ → K is arandom variable, it is conventional to write Eµ[X ] for

∫ΘX(θ) dµ(θ) and to call

Eµ[X ] the expected value or expectation of X . Also,

Vµ[X ] := Eµ

[∣∣X − Eµ[X ]∣∣2] ≡ Eµ[|X |2]− |Eµ[X ]|2

is called the variance of X . If X is a Kd-valued random variable, then Eµ[X ],if it exists, is an element of Kd, and

C := Eµ

[(X − Eµ[X ])(X − Eµ[X ])∗

]∈ Kd×d

i.e. Cij := Eµ

[(Xi − Eµ[Xi])(Xj − Eµ[Xj ])

]∈ K

is the covariance matrix of X .

Definition 2.17. Let (X ,F , µ) be a measure space. For 1 ≤ p ≤ ∞, the Lp

space (or Lebesgue space) is defined by

Lp(X , µ;K) := f : X → K | f is measurable and ‖f‖Lp(µ) is finite,

—DRAFT—


where

‖f‖Lp(µ) :=

(∫

X

|f(x)|p dµ(x))1/p

for 1 ≤ p <∞ and

‖f‖L∞(µ) := inf ‖g‖∞ | f = g : X → K µ-almost everywhere= inf t ≥ 0 | |f | ≤ t µ-almost everywhere

To be more precise, Lp(X , µ;K) is the set of equivalence classes of such functions,where functions that differ only on a set of µ-measure zero are identified.

Theorem 2.18 (Chebyshev’s inequality). Let X ∈ Lp(Θ, µ;K), 1 ≤ p < ∞, be arandom variable. Then, for all t ≥ 0,

Pµ

[|X − Eµ[X ]| ≥ t

]≤ t−pEµ

[|X |p

]. (2.1)

Integration of Vector-Valued Functions. Lebesgue integration of functionsthat take values in Rn can be handled componentwise, as indeed was doneabove for complex-valued integrands. Many interesting UQ problems concernrandom fields, i.e. random variables with values in infinite-dimensional spacesof functions. For definiteness, consider a function f defined on a measure space(X ,F , µ) taking values in a Banach space V . There are two ways to proceed,and they are in general inequivalent:

1. The strong integral or Bochner integral of f is defined by integrating simpleV-valued functions as in the construction of the Lebesgue integral, andthen defining ∫

X

f dµ := limn→∞

∫

X

φn dµ

whenever (φn)n∈N is a sequence of simple functions such that the (scalar-valued) Lebesgue integral

∫X‖f − φn‖ dµ converges to 0 as n → ∞. It

transpires that f is Bochner integrable if and only if ‖f‖ is Lebesgue in-tegrable. The Bochner integral satisfies a version of the Dominated Con-vergence Theorem, but there are some subtleties concerning the Radon–Nikodym theorem.

2. The weak integral or Pettis integral of f is defined using duality:∫Xf dµ

is defined to be an element v ∈ V such that

〈ℓ | v〉 =∫

X

〈ℓ | f(x)〉dµ(x) for all ℓ ∈ V ′.

Since this is a weaker integrability criterion, there are naturally morePettis-integrable functions than Bochner-integrable ones, but the Pettisintegral has deficiencies such as the space of Pettis-integrable functionsbeing incomplete, the existence of a Pettis-integrable function f : [0, 1] →V such that F (t) :=

∫[0,t]

f(τ) dτ is not differentiable [44], and so on.

2.5 The Radon–Nikodym Theorem and Densities

Let (X ,F , µ) be a measure space and let ρ : X → [0,+∞] be a measurablefunction. The operation

ν : E 7→∫

E

ρ(x) dµ(x) (2.2)

—DRAFT—

2.6. PRODUCT MEASURES AND INDEPENDENCE 17

defines a measure ν on (X ,F ). It is natural to ask whether every measure νon (X ,F ) can be expressed in this way. A moment’s thought reveals that theanswer, in general, is no: there is no ρ that will make (2.2) hold when µ and νare Lebesgue measure and a unit Dirac measure (or vice versa) on R.

Definition 2.19. Let µ and ν be measures on a measurable space (X ,F ). If, forE ∈ F , ν(E) = 0 whenever µ(E) = 0, then ν is said to be absolutely continuouswith respect to µ, denoted ν ≪ µ. If ν ≪ µ ≪ ν, then µ and ν are said to beequivalent, and this is denoted µ ≈ ν. If there exists E ∈ F such that µ(E) = 0and ν(X \E) = 0, then µ and ν are said to be mutually singular, denoted µ ⊥ ν.

Theorem 2.20 (Radon–Nikodym). Suppose that µ and ν are σ-finite measureson a measurable space (X ,F ) and that ν ≪ µ. Then there exists a measurablefunction ρ : X → [0,∞] such that, for all measurable functions f : X → R andall E ∈ F , ∫

E

f dν =

∫

E

fρ dµ

whenever either integral exists. Furthermore, any two functions ρ with thisproperty are equal µ-almost everywhere.

The function ρ in the Radon–Nikodym theorem is called the Radon–Nikodymderivative of ν with respect to µ, and the suggestive notation ρ = dν

dµ is often

used. In probability theory, when ν is a probability measure, dνdµ is called the

probability density function (PDF) of ν (or any ν-distributed random variable)with respect to µ.

2.6 Product Measures and Independence

Definition 2.21. Let (Θ,F , µ) be a probability space.1. Two measurable sets (events) E1, E2 ∈ F are said to be independent ifµ(E1 ∩ E2) = µ(E1)µ(E2).

2. Two sub-σ-algebras G1 and G2 of F are said to be independent if E1 andE2 are independent events whenever E1 ∈ G1 and E2 ∈ G2.

3. Two measurable functions (random variables) X : Θ → X and Y : Θ → Yare said to be independent if the σ-algebras generated by X and Y areindependent.

Definition 2.22. Let (X ,F , µ) and (Y,G , ν) be σ-finite measure spaces. Theproduct σ-algebra F ⊗ G is the σ-algebra on X × Y that is generated by themeasurable rectangles, i.e. the smallest σ-algebra for which all the products

F ×G, F ∈ F , G ∈ G ,

are measurable sets. The product measure µ ⊗ ν : F ⊗ G → [0,+∞] is themeasure such that

(µ⊗ ν)(F ×G) = µ(F )ν(G), for all F ∈ F , G ∈ G .

In the other direction, given a measure on a product space, we can considerthe measures induced on the factor spaces:

—DRAFT—


Definition 2.23. Let (X×Y,F , µ) be a measure space and suppose that the fac-tor space X is equipped with a σ-algebra such that the projections ΠX : (x, y) 7→x is a measurable function. Then the marginal measure µX is the measure onX defined by

µX (E) :=((ΠX )∗µ

)(E) = µ(E × Y).

The marginal measure µY on Y is defined similarly.

Theorem 2.24. Let X = (X1, X2) be a random variable taking values in aproduct space X = X1 × X2. Let µ be the (joint) distribution of X, and µi the(marginal) distribution of Xi for i = 1, 2. Then X1 and X2 are independentrandom variables if and only if µ = µ1 ⊗ µ2.

The important property of integration with respect to a product measure,and hence taking expected values of independent random variables, is that itcan be performed by iterated integration:

Theorem 2.25 (Fubini–Tonelli). Let (X ,F , µ) and (Y,G , ν) be σ-finite measurespaces, and let f : X ×Y → [0,+∞] be measurable. Then, of the following threeintegrals, if one exists in [0,∞], then all three exist and are equal:

∫

X

∫

Y

f(x, y) dν(y) dµ(x),

∫

Y

∫

X

f(x, y) dµ(x) dν(y),

and

∫

X×Y

f(x, y) d(µ⊗ ν)(x, y).

2.7 Gaussian Measures

An important class of probability measures and random variables is the class ofGaussian measures (also known as normal distributions) and random variables.Gaussian measures are particularly important because, unlike Lebesgue mea-sure, they are well-defined on infinite-dimensional topological vector spaces; thenon-existence of an infinite-dimensional Lebesgue measure is a consequence ofthe following theorem:

Theorem 2.26. Let V be an infinite-dimensional, separable Banach space. Thenthe only Borel measure µ on V that is locally finite (i.e. every point of V has aneighbourhood of finite µ-measure) and translation invariant (i.e. µ(x + E) =µ(E) for all x ∈ V and measurable E ⊆ V) is the trivial measure.

Gaussian measures on Rd are defined using a Radon–Nikodym derivativewith respect to Lebesgue measure:

Definition 2.27. Let m ∈ Rd and let C ∈ Rd×d be symmetric and positivedefinite. The Gaussian measure with mean m and covariance C is denotedN (m,C) and defined by

N (m,C)(E) =1

√detC

√2π

d

∫

E

exp

(− (x−m) · C(x −m)

2

)dx

for each measurable E ⊆ Rd. The Gaussian measure N (0, I) is called the stan-dard Gaussian measure. A Dirac measure δm can be considered as a degenerateGaussian measure on R, one with variance equal to zero.

—DRAFT—

2.7. GAUSSIAN MEASURES 19

It is easily verified that the push-forward ofN (m,C) by any linear functionalℓ : Rd → R is a Gaussian measure on R, and this is taken as the defining propertyof a general Gaussian measure for settings in which, by Theorem 2.26, there maynot be a Lebesgue measure with respect to which densities can be taken:

Definition 2.28. Let V be a (locally convex) topological vector space. A Borelmeasure γ on V is said to be a (non-degenerate) Gaussian measure if, for everyℓ ∈ V ′, the push-forward measure ℓ∗γ is a (non-degenerate) Gaussian measureon R. Equivalently, γ is Gaussian if, for every linear map T : V → Rd, T∗γ =N (m,C) for some m ∈ Rd and some symmetric positive-definite C ∈ Rd×d.

Definition 2.29. Let µ be a probability measure on a Banach space V . Anelement mµ ∈ V is called the mean of µ if

∫

V

〈ℓ |x−mµ〉dµ(x) = 0 for all ℓ ∈ V ′,

so that∫V xdµ(x) = mµ in the sense of a Pettis integral. If mµ = 0, then

µ is said to be centred. The covariance operator is the symmetric operatorCµ : V ′ × V ′ → K defined by

Cµ(k, ℓ) =

∫

V

〈k |x−mµ〉〈ℓ |x−mµ〉dµ(x) for all k, ℓ ∈ V ′.

We often abuse notation and write Cµ : V ′ → V ′′ for the operator defined by

〈Cµk | ℓ〉 := Cµ(k, ℓ)

The inverse of Cµ, if it exists, is called the precision operator of µ.

Theorem 2.30 (Vakhania). Let µ be a Gaussian measure on a separable, reflexiveBanach space V with mean mµ ∈ V and covariance operator Cµ : V ′ → V. Thenthe support of µ is the affine subspace of V that is the translation by the meanof the closure of the range of the covariance operator, i.e.

supp(µ) = mµ + CµV ′.

Corollary 2.31. For a Gaussian measure µ on a separable, reflexive Banachspace V, the following are equivalent:

1. µ is non-degenerate;2. Cµ : V ′ → V is one-to-one;3. CµV ′ = V.

Example 2.32. Consider a Gaussian random variable X = (X1, X2) ∼ µ takingvalues in R2. Suppose that the mean and covariance of X (or, equivalently, µ)are, in the usual basis of R2,

m =

[01

]C =

[1 00 0

].

Then X = (Z, 1), where Z ∼ N (0, 1) is a standard Gaussian random variableon R; the values of X all lie on the affine line L := (x1, x2) ∈ R2 | x2 = 1.Indeed, Vakhania’s theorem says that

supp(µ) = m+ C(R2) =

[01

]+

[x10

] ∣∣∣∣ x1 ∈ R

= L.

—DRAFT—


Theorem 2.33. A centred probability measure µ on V is a Gaussian measure ifand only if its Fourier transform µ : V ′ → C satisfies

µ(ℓ) :=

∫

V

ei〈ℓ | x〉 dµ(x) = e−Q(ℓ)/2 for all ℓ ∈ V ′.

for some positive-definite quadratic form Q on V ′. Indeed, Q(ℓ) = Cµ(ℓ, ℓ). Fur-thermore, if two Gaussian measures µ and ν have the same mean and covarianceoperator, then µ = ν.

Theorem 2.34 (Fernique). Let µ be a centered Gaussian measure on a separableBanach space V. Then there exists α > 0 such that

∫

V

exp(α‖x‖2) dµ(x) < +∞.

A fortiori, µ has moments of all orders: for all k ≥ 0,∫

V

‖x‖k dµ(x) < +∞.

Definition 2.35. Let K : H → H be a linear operator on a separable Hilbertspace H.

1. K is said to be compact if it has a singular value decomposition, i.e. ifthere exist finite or countably infinite orthonormal sequences (un) and(vn) in H and a sequence of non-negative reals (σn) such that

K =∑

n

σn〈v∗, · 〉un,

with limn→∞ σn = 0 if the sequences are infinite.2. K is said to be trace class or nuclear if

∑n σn is finite, and Hilbert–Schmidt

or nuclear of order 2 if∑

n σ2n is finite.

3. If K is trace class, then its trace is defined to be

tr(K) :=∑

n

〈en,Ken〉

for any orthonormal basis (en) ofH, and (by Lidskiı’s theorem) this equalsthe sum of the eigenvalues of K, counted with multiplicity.

Theorem 2.36. Let µ be a centred Gaussian measure on a separable Hilbertspace H. Then Cµ : H → H is trace class and

tr(Cµ) =

∫

H

‖x‖2 dµ(x).

Conversely, if K : H → H is positive, symmetric and of trace class, then thereis a Gaussian measure µ on H such that Cµ = K.

Definition 2.37. Let µ = N (mµ, Cµ) be a Gaussian measure on a Banach spaceV . The Cameron–Martin space is the Hilbert space Hµ defined equivalently by:

• Hµ is the completion ofh ∈ V

∣∣ for some h∗ ∈ V ′, Cµ(h∗, · ) = 〈 · |h〉

with respect to the inner product 〈h, k〉µ = Cµ(h∗, k∗).

—DRAFT—

BIBLIOGRAPHY 21

• Hµ is the completion of the range of the covariance operator Cµ : V ′ → Vwith respect to this inner product (cf. the closure with respect to the normin V in Theorem 2.30).

• If V is Hilbert, then Hµ is the completion of R(C1/2µ ) with the inner

product 〈h, k〉µ = 〈C−1/2µ h,C

−1/2µ k〉V .

• Hµ is the set of all v ∈ V such that (Tv)∗µ ≈ µ.• Hµ is the intersection of all linear subspaces of V that have full µ-measure.

Note, however, that if Hµ is infinite-dimensional, then µ(Hµ) = 0. Fur-

thermore, infinte-dimensional spaces have the rather alarming property thatGaussian measures on such spaces are either equivalent or mutually singular— there is no middle ground in the way that Lebesgue measure on [0, 1] has adensity with respect to Lebesgue measure on R but is not equivalent to it —and surprisingly simple operations can destroy equivalence.

Theorem 2.38 (Feldman–Hajek). Let µ, ν be Gaussian probability measures ona locally convex topological vector space V. Then either

• µ and ν are equivalent, i.e. µ(E) = 0 ⇐⇒ ν(E) = 0; or• µ and ν are mutually singular, i.e. there exists E such that µ(E) = 0 andν(E) = 1.

Furthermore, equivalence holds if and only if

1. R(C1/2µ ) = R(C

1/2ν ) = E; and

2. mµ −mν ∈ E; and

3. T := (C−1/2µ C

1/2ν )(C

−1/2µ C

1/2ν )∗ − I is Hilbert–Schmidt in E.

Proposition 2.39. Let µ be a centred Gaussian measure on a separable Banachspace V such that dimHµ = ∞. For c ∈ R, let Dc : V → V be the dilationmap Dc(x) := cx. Then (Dc)∗µ is equivalent to µ if and only if c ∈ ±1, and(Dc)∗µ and µ are mutually singular otherwise.

Bibliography

At Warwick, this material is mostly covered in MA359 Measure Theory and WST318 Probability Theory. Gaussian measure theory in infinite-dimensionalspaces is covered in MA482 Stochastic Analysis and MA612 Probability onFunction Spaces and Bayesian Inverse Problems. Vakhania’s theorem (Theo-rem 2.30) on the support of a Gaussian measure can be found in [110]. Fernique’stheorem (Theorem 2.34) on the integrability of Gaussian vectors was proved in[31]. The Feldman–Hajek dichotomy (Theorem 2.38) was proved independentlyby Feldman [30] and Hajek [36] in 1958.

Gordon’s book [35] is mostly a text on the gauge integral, but its firstchapters provide an excellent condensed introduction to measure theory andLebesgue integration. The book of Capinski & Kopp [18] is a clear, readableand self-contained introductory text confined mainly to Lebesgue integration onR (and later Rn), including material on Lp spaces and the Radon–Nikodym theo-rem. Another excellent text on measure and probability theory is the monographof Billingsley [10]. The Bochner integral was introduced by Bochner in [12]; re-cent texts on the topic include those of Diestel & Uhl [24] and Mikusinski [67].For detailed treatment of the Pettis integral, see Talagrand [102]. Further dis-






—DRAFT—


cussion of the relationship between tensor products and spaces of vector-valuedintegrable functions can be found in the book of Ryan [85].

Bourbaki [15] contains a treatment of measure theory from a functional-analytic perspective. The presentation is focussed on Radon measures on locallycompact spaces, which is advantageous in terms of regularity but leads to anapproach to measurable functions that is cumbersome, particularly from theviewpoint of probability theory.

The modern origins of imprecise probability lie in treatises like those ofBoole [13] and Keynes [50]; more recent foundations and expositions have beenput forward by Walley [114], Kuznetsov [56], Weichselberger [115], and byDempster [23] and Shafer [87].

—DRAFT—

Chapter 3

Recap of Banach and HilbertSpaces

Dr. von Neumann, ich mochte gernwissen, was ist dann eigentlich einHilbertscher Raum?

David Hilbert

This chapter covers the necessary concepts from linear functional analysison Hilbert and Banach spaces, in particular basic and useful constructions likedirect sums and tensor products. Like Chapter 2, this chapter is intended as areview of material that should be understood as a prerequisite before proceeding;to extent, Chapters 2 and 3 are interdependent and so can (and should) be readin parallel with one another.

3.1 Basic Definitions and Properties

In what follows, K will denote either of the fields R or C.

Definition 3.1. A norm on a vector space V over K is a function ‖ · ‖ : V → R

that is1. positive semi-definite: for all x ∈ V , ‖x‖ ≥ 0;2. positive definite: for all x ∈ V , ‖x‖ = 0 if and only if x = 0;3. homogeneous : for all x ∈ V and α ∈ K, ‖αx‖ = |α|‖x‖;4. sublinear : for all x, y ∈ V , ‖x+ y‖ ≤ ‖x‖+ ‖y‖.

If the positive definiteness requirement is omitted, then ‖ · ‖ is said to be a semi-norm. A vector space equipped with a (semi-)norm is called a (semi-)normedspace.

Definition 3.2. An inner product on a vector space V over K is a function〈 · , · 〉 : V × V → K that is

1. positive semi-definite: for all x ∈ V , 〈x, x〉 ≥ 0;2. positive definite: for all x ∈ V , 〈x, x〉 = 0 if and only if x = 0;3. conjugate symmetric: for all x, y ∈ V , 〈x, y〉 = 〈y, x〉;

—DRAFT—

24 CHAPTER 3. RECAP OF BANACH AND HILBERT SPACES

4. sesquilinear : for all x, y, z ∈ V and all α, β ∈ K, 〈x, αy + βz〉 = α〈x, y〉 +β〈x, z〉.

A vector space equipped with an inner product is called an inner product space.In the case K = R, conjugate symmetry becomes symmetry, and sesquilinearitybecomes bilinearity.

It is easily verified that every inner product space is a normed space underthe induced norm ‖x‖ :=

√〈x, x〉. The inner product and norm satisfy the

Cauchy–Schwarz inequality

|〈x, y〉| ≤ ‖x‖1/2‖y‖1/2 for all x, y ∈ V . (3.1)

Every norm on V that is induced by an inner product satisfies the parallelogramidentity

‖x+ y‖2 + ‖x− y‖2 = 2‖x‖2 + 2‖y‖2 for all x, y ∈ V . (3.2)

In the opposite direction, if ‖ · ‖ is a norm on V that satisfies the parallelogramidentity, then the unique inner product 〈 · , · 〉 that induces this norm is foundby the polarization identity

〈x, y〉 = ‖x+ y‖2 − ‖x− y‖24

(3.3)

in the real case, and

〈x, y〉 = ‖x+ y‖2 − ‖x− y‖24

+ i‖ix− y‖2 − ‖ix+ y‖2

4(3.4)

in the complex case.

Example 3.3. 1. For any n ∈ N, the coordinate space Kn is an inner productspace under the Euclidean inner product

〈x, y〉 :=n∑

i=1

xiyi.

In the real case, this is usually known as the dot product and denoted x ·y.2. For any m,n ∈ N, the space Km×n of m× n matrices is an inner product

space under the Frobenius inner product

〈A,B〉 ≡ A :B :=∑

i=1,...,mj=1,...,n

aijbij .

Definition 3.4. Let (V , ‖ · ‖) be a normed space. A sequence (xn)n∈N in Vconverges to x ∈ V if, for every ε > 0, there exists N ∈ N such that, whenevern ≥ N , ‖xn − x‖ < ε. A sequence (xn)n∈N in V is called Cauchy if, for everyε > 0, there exists N ∈ N such that, whenever m,n ≥ N , ‖xm − xn‖ < ε. Acomplete space is one in which each Cauchy sequence in V converges to someelement of V . Complete normed spaces are called Banach spaces, and completeinner product spaces are called Hilbert spaces.

Example 3.5. 1. Kn and Km×n are finite-dimensional Hilbert spaces withrespect to their usual inner products.

—DRAFT—

3.1. BASIC DEFINITIONS AND PROPERTIES 25

2. The standard example of an infinite-dimensional Hilbert space is the spaceof square-summable sequences,

ℓ2(K) :=

x = (xn)n∈N ∈ KN

∣∣∣∣∣ ‖x‖ℓ2 :=∑

n∈N

|xn|2 <∞,

is a Hilbert space with respect to the inner product

〈x, y〉ℓ2 :=∑

n∈N

xnyn.

3. Given a measure space (X ,F , µ), the space L2(X , µ;K) of (equivalenceclasses modulo equality µ-almost everywhere) of square-integrable func-tions from X to K is a Hilbert space with respect to the inner product

〈f, g〉L2(µ) :=

∫

X

f(x)g(x) dµ(x). (3.5)

Note that it is necessary to take the quotient by the equivalence relationof equality µ-almost everywhere since a function f that vanishes on a setof full measure but is non-zero on a set of zero measure is not the zerofunction but nonetheless has ‖f‖L2(µ) = 0. When (X ,F , µ) is a probabil-ity space, elements of L2(X , µ;K) are thought of as random variables offinite variance, and the L2 inner product is the covariance:

〈X,Y 〉L2(µ) := Eµ

[XY

]= cov(X,Y ).

When L2(X , µ;K) is a separable space, it is isometrically isomorphic toℓ2(K) (see Theorem 3.16).

4. Indeed, Hilbert spaces over a fixed field K are classified by their dimension:whenever H and K are Hilbert spaces of the same dimension over K,there is a K-linear map T : H → K such that 〈Tx, T y〉K = 〈x, y〉H for allx, y ∈ H.

Example 3.6. 1. For any compact topological space X , the space C(X ;K)of continuous functions f : X → K is a Banach space with respect to thesupremum norm

‖f‖∞ := supx∈X

|f(x)|. (3.6)

For non-compact X , the supremum norm is only a bona fide norm if werestrict attention to bounded continuous functions (otherwise it may takethe value ∞).

2. For 1 ≤ p ≤ ∞, the spaces Lp(X , µ;K) with their norms

‖f‖Lp(µ) :=

(∫

X

|f(x)|p dµ(x))1/p

(3.7)

for 1 ≤ p <∞ and

‖f‖L∞(µ) := inf ‖g‖∞ | f = g : X → K µ-almost everywhere (3.8)

= inf t ≥ 0 | |f | ≤ t µ-almost everywhere

are Banach spaces, but only the L2 spaces are Hilbert spaces.

—DRAFT—


Another family of Banach spaces that arises very often in PDE applicationsis the family of Sobolev spaces. For the sake of brevity, we limit the discussion tothose Sobolev spaces that are Hilbert spaces. To save space, we use multi-indexnotation for derivatives: α := (α1, . . . , αn) ∈ Nn

0 denotes a multi-index, and|α| := α1 + · · ·+ αn.

Definition 3.7. Let Ω ⊆ Rn, let α ∈ Nn0 , and consider f : Ω → R. A weak

derivative of order α for f : Ω → R is a function v : Ω → R such that

∫

Ω

f(x)∂|α|

∂α1x1 . . . ∂αnxnφ(x) dx = (−1)|α|

∫

Ω

v(x)φ(x) dx (3.9)

for every φ ∈ C∞c (Ω;R). Such a weak derivative is usually denoted Dαf , and

coincides with the classical (strong) derivative if it exists. The Sobolev spaceHk(Ω) is

Hk(Ω) :=

f ∈ L2(Ω)

∣∣∣∣for all α ∈ Nn

0 with |α| ≤ k,f has a weak derivative Dαf ∈ L2(Ω)

(3.10)

with the inner product

〈u, v〉Hk :=∑

|α|≤k

〈Dαu,Dαv〉L2 . (3.11)

3.2 Dual Spaces and Adjoints

Definition 3.8. The (continuous) dual space of a normed space V over K is thevector space V ′ of all continuous linear functionals ℓ : V → K. The dual pairingbetween an element ℓ ∈ V ′ and an element v ∈ V is denoted 〈ℓ | v〉 or simplyℓ(v). For a linear functional ℓ, being continuous is equivalent to being boundedin the sense that its operator norm (or dual norm)

‖ℓ‖′ := sup06=v∈V

|〈ℓ | v〉|‖v‖ ≡ sup

v∈V‖v‖=1

|〈ℓ | v〉| ≡ supv∈V

‖v‖≤1

|〈ℓ | v〉|

is finite.

Proposition 3.9. For every normed space V, the dual space V ′ is a Banach spacewith respect to ‖ · ‖′.

An important property of Hilbert spaces is that they are naturally self-dual :every continuous linear functional on a Hilbert space can be naturally identifiedwith the action of taking the inner product with some element of the space.This stands in stark contrast to the duals of even elementary Banach spaces.

Theorem 3.10 (Riesz representation theorem). Let H be a Hilbert space. Forevery continuous linear functional f ∈ H′, there exists f ♯ ∈ H such that 〈f |x〉 =〈f ♯, x〉 for all x ∈ H. Furthermore, the map f 7→ f ♯ is an isometric isomorphismbetween H and its dual.

Given a linear map A : V → W between normed spaces V and W , the adjointof A is the linear map A∗ : W ′ → V ′ defined by the relation

〈A∗ℓ | v〉 = 〈ℓ |Av〉 for all v ∈ V and ℓ ∈ W ′.

—DRAFT—

3.3. ORTHOGONALITY AND DIRECT SUMS 27

When considering a linear map A : H → K between Hilbert spaces H and K, wecan appeal to the Riesz representation theorem and define the adjoint in termsof inner products:

〈A∗k, h〉H = 〈k,Ah〉K for all h ∈ H and k ∈ K.

Given a matrix eii∈I ofH, the corresponding dual basis eii∈I ofH is definedby the relation 〈ei, ej〉H = δij . The matrix of A with respect to bases eii∈I

of H and fjj∈J of K and the matrix of A∗ with respect to the correspondingdual bases are very simply related: the one is the conjugate transpose of theother, and so by abuse of terminology the conjugate transpose of a matrix isoften referred to as the adjoint.

3.3 Orthogonality and Direct Sums

Orthogonal decompositions of Hilbert spaces will be fundamental tools in manyof the methods considered later on.

Definition 3.11. A subset E of an inner product space V is said to be orthogonalif 〈x, y〉 = 0 for all distinct elements x, y ∈ E; it is said to be orthonormal if

〈x, y〉 =1, if x = y ∈ E,

0, if x, y ∈ E and x 6= y.

The orthogonal complement E⊥ of a subset E of an inner product space V is

E⊥ := y ∈ V | for all x ∈ E, 〈y, x〉 = 0.

The orthogonal complement of E ⊆ V is always a closed linear subspace ofV , and hence if V = H is a Hilbert space, then E⊥ is also a Hilbert space in itsown right.

Theorem 3.12. Let K be a closed subspace of a Hilbert space H. Then, for anyx ∈ H, there is a unique ΠKx ∈ K that is closest to x in the sense that

‖ΠKx− x‖ = infy∈K

‖y − x‖.

Furthermore, x can be written uniquely as x = ΠKx+ z, where z ∈ K⊥. Hence,H decomposes as the orthogonal direct sum

H = K⊥⊕ K⊥.

Proof. Deferred to Lemma 4.20 and the more general context of convex opti-mization.

The operator ΠK : H → K is called the orthogonal projection onto K.

Theorem 3.13. Let K and L be closed subspaces of a Hilbert space H. Thecorresponding orthogonal projection operators

1. are continuous inear operators of norm at most 1;2. are such that I−ΠK = ΠK⊥ ;

—DRAFT—


and satisfy, for every x ∈ H,3. ‖x‖2 = ‖ΠKx‖2 + ‖(I−ΠK)x‖2;4. ΠKx = x ⇐⇒ x ∈ K;5. ΠKx = 0 ⇐⇒ x ∈ K⊥.

Example 3.14 (Conditional expectation). An important probabilistic applicationof orthogonal projection is the operation of conditioning a random variable.Let (Θ,F , µ) be a probability space and let X ∈ L2(Θ,F , µ;K) be a square-integrable random variable. If G ⊆ F is a σ-algebra, then the conditionalexpectation of X with respect to G , usually denoted E[X |G ], is the orthogonalprojection of X onto the subspace L2(Θ,G , µ;K). In elementary contexts, G

is usually taken to be the σ-algebra generated by a single event E of positiveµ-probability, i.e.

G = ∅, [X ∈ E], [X /∈ E],Θ.The orthogonal projection point of view makes two important properties ofconditional expectation intuitively obvious:

1. Whenever G1 ⊆ G2F , L2(Θ,G1, µ;K) is a subspace of L2(Θ,G2, µ;K) andcomposition of the orthogonal projections onto these subspace yields thetower rule for conditional expectations:

E[X |G1] = E[E[X |G2]

∣∣G1

],

and, in particular, taking G1 to be the trivial σ-algebra ∅,Θ,

E[X ] = E[E[X |G2]].

2. Whenever X,Y ∈ L2(Θ,F , µ;K) and X is, in fact, G -measurable,

E[XY |G ] = XE[Y |G ].

Direct Sums. Suppose that V and W are vector spaces over a common fieldK. The Cartesian product V ×W can be given the structure of a vector spaceover K by defining the operations componentwise:

(v, w) + (v′, w′) := (v + v′, w + w′),

α(v, w) := (αv, αw),

for all v, v′ ∈ V , w,w′ ∈ W , and α ∈ K. The resulting vector space is calledthe (algebraic) direct sum of V and W and is usually denoted by V ⊕W , whileelements of V ⊕W are usually denoted by v ⊕ w instead of (v, w).

If ei|i ∈ I is a basis of V and ej|j ∈ J is a basis of W , then ek | k ∈K := I ⊎ J is basis of V ⊕W . Hence, the dimension of V ⊕W over K is equalto the sum of the dimensions of V and W .

When H and K are Hilbert spaces, their (algebraic) direct sum H ⊕K canbe given a Hilbert space structure by defining

〈h⊕ k, h′ ⊕ k′〉H⊕K := 〈h, h′〉H + 〈k, k′〉K

for all h, h′ ∈ H and k, k′ ∈ K. The original spaces H and K embed into H⊕Kas the subspaces H ⊕ 0 and 0 ⊕ K respectively, and these two subspaces

—DRAFT—

3.3. ORTHOGONALITY AND DIRECT SUMS 29

are mutually orthogonal. For this reason, the orthogonality of the two sum-

mands in a Hilbert direct sum is sometimes emphasized by the notation H⊥⊕ K.

The Hilbert space projection theorem (Theorem 3.12) was the statement that

whenever K is a closed subspace of a Hilbert space H, H = K⊥⊕ K⊥.

It is necessary to be a bit more careful in defining the direct sum of countablymany Hilbert spaces. Let Hn be a Hilbert space over K for each n ∈ N. Thenthe Hilbert space direct sum H :=

⊕n∈NHn is defined to be

H :=

x = (xn)n∈N

∣∣∣∣xn ∈ Hn for each n ∈ N, and

xn = 0 for all but finitely many n

,

where the completion is taken with respect to the inner product

〈x, y〉H :=∑

n∈N

〈xn, yn〉Hn,

which is always a finite sum when applied to elements of the generating set.This construction ensures that every element x of H has finite norm ‖x‖2H =∑

n∈N ‖xn‖2Hn. As before, each of the summands Hn is a subspace of H that is

orthogonal to all the others.Orthogonal direct sums and orthogonal bases are among the most important

constructions in Hilbert space theory, and will be very useful in what follows.The prototypical example to bear in mind is the Fourier basis en | n ∈ Z ofL2(S1;C), where

en(x) :=1

2πexp(−inx).

Indeed, Fourier’s claim〈3.1〉 that any periodic function f could be written as

f(x) =∑

n∈Z

fnen(x),

fn :=

∫

S1f(y)en(y) dy,

can be seen as one of the historical drivers behind the development of much ofanalysis. Other important examples are the systems of orthogonal polynomialsthat will be considered in Chapter 8. Some important results about orthogonalsystems are summarized below:

Lemma 3.15 (Bessel’s inequality). Let V be an inner product space and (en)n∈N

an orthonormal sequence in V. Then, for any x ∈ V, the series∑

n∈N |〈x, en〉|2converges and satisfies ∑

n∈N

|〈x, en〉|2 ≤ ‖x‖2. (3.12)

Theorem 3.16 (Parseval identity). Let H be a Hilbert space, let (en)n∈N be anorthonormal sequence in H, and let (αn)n∈N be a sequence in K. Then the series

〈3.1〉Of course, Fourier did not use the modern notation of Hilbert spaces! Furthermore, if hehad, then it would have been ‘obvious’ that his claim could only hold true for L2 functionsand in the L2 sense, not pointwise for arbitrary functions.

—DRAFT—


∑n∈N αnen converges in H if and only if the series

∑n∈N |αn|2 converges in R,

in which case ∥∥∥∥∥∑

n∈N

αnen

∥∥∥∥∥

2

=∑

n∈N

|αn|2. (3.13)

Corollary 3.17. Let H be a Hilbert space and let (en)n∈N be an orthonormalsequence in H. Then, for any x ∈ H, the series

∑n∈N〈x, en〉en converges.

Theorem 3.18. Let H be a Hilbert space and let (en)n∈N be an orthonormalsequence in H. Then the following are equivalent:

1. en | n ∈ N⊥ = 0;2. H = spanen | n ∈ N;3. H =

⊕n∈N Ken as a direct sum of Hilbert spaces;

4. for all x ∈ H, ‖x‖2 =∑

n∈N |〈x, en〉|2;5. for all x ∈ H, x =

∑n∈N〈x, en〉en.

If one (and hence all) of these conditions holds true, then (en)n∈N is called acomplete orthonormal basis for HCorollary 3.19. Let (en)n∈N be a complete orthonormal basis for H. For every

x ∈ H, the truncation error x−∑Nn=1〈x, en〉en is orthogonal to spane1, . . . , eN.

Proof. Let v :=∑N

n=1 vnen be any element of spane1, . . . , eN. By complete-ness, x =

∑n∈N〈x, en〉en. Hence,⟨x−

N∑

n=1

〈x, en〉en, v⟩

=

⟨∑

n>N

〈x, en〉en,N∑

n=1

vnen

⟩

=∑

n>Nm∈0,...,N

〈〈x, en〉en, vmem〉

=∑

n>Nm∈0,...,N

〈x, en〉vm〈en, em〉

= 0

since 〈en, em〉 = δnm.

3.4 Tensor Products

Intuitively, the tensor product V ⊗ W of two vector spaces V and W over acommon field K is the vector space over K with basis given by the formalsymbols ei ⊗ fj | i ∈ I, j ∈ J, where ei|i ∈ I is a basis of V and fj|j ∈ Jis a basis of W . However, it is not immediately clear that this definition isindependent of the bases chosen for V and W . A more thorough definition is asfollows.

Definition 3.20. The free vector space FV×W on the Cartesian product V ×Wis defined by taking the vector space in which the elements of V×W are a basis:

FV×W :=

n∑

i=1

αie(vi,wi)

∣∣∣∣∣n ∈ N, αi ∈ K, (vi, wi) ∈ V ×W

—DRAFT—

3.4. TENSOR PRODUCTS 31

The ‘freeness’ of FV×W is that the elements e(v,w) are, by definition linearlyindependent for distinct pairs (v, w) ∈ V × W . Now define an equivalencerelation ∼ on FV×W such that

e(v+v′,w) ∼ e(v,w) + e(v′,w),

e(v,w+w′) ∼ e(v,w) + e(v,w′),

αe(v,w) ∼ e(αv,w) ∼ e(v,αw)

for arbitrary v, v′ ∈ V , w,w′ ∈ W , and α ∈ K. Let R be the subspace of FV×W

generated by these equivalence relations, i.e. the equivalence class of e(0,0).

Definition 3.21. The (algebraic) tensor product V ⊗W is the quotient space

V ⊗W :=FV×W

R.

One can easily check that V ⊗W , as defined in this way, is indeed a vectorspace over K. The subspace R of FV×W is mapped to the zero element of V⊗Wunder the quotient map, and so the above equivalences become equalities in thetensor product space:

(v + v′)⊗ w = v ⊗ w + v′ ⊗ w,

v ⊗ (w + w′) = v ⊗ w + v ⊗ w′,

α(v ⊗ w) = (αv) ⊗ w = v ⊗ (αw)

for all v, v′ ∈ V , w,w′ ∈ W , and α ∈ K.One can also check that the heuristic definition in terms of bases holds true

under the formal definition: if ei|i ∈ I is a basis of V and fj|j ∈ J is a basisof W , then ei ⊗ fj | i ∈ I, j ∈ J is basis of V ⊗W . Hence, the dimension ofthe tensor product is the product of dimensions of the original spaces.

Definition 3.22. The Hilbert space tensor product of two Hilbert spaces H andK over the same field K is given by defining an inner product on the algebraictensor product H⊗K by

〈h⊗ k, h′ ⊗ k′〉H⊗K := 〈h, h′〉H〈k, k′〉K for all h, h′ ∈ H and k, k′ ∈ K,

extending this definition to all of the algebraic tensor product by sesquilinearity,and defining the Hilbert space tensor product H⊗K to be the completion of thealgebraic tensor product with respect to this inner product and its associatednorm.

Tensor products of Hilbert spaces arise very naturally when consideringspaces of functions of more than one variable, or spaces of functions that takevalues in other function spaces. A prime example of the second type is a spaceof stochastic processes.

Example 3.23. 1. Given two measure spaces (X ,F , µ) and (Y,G , ν), con-sider L2(X ×Y, µ⊗ ν;K), the space of functions on X ×Y that are squareintegrable with respect to the product measure µ⊗ ν. If f ∈ L2(X , µ;K)and g ∈ L2(Y, ν;K), then we can define a function h : X × Y → K byh(x, y) := f(x)g(y). The definition of the product measure ensures that

—DRAFT—


h ∈ L2(X × Y, µ ⊗ ν;K), so this procedure defines a bilinear mappingL2(X , µ;K) × L2(Y, ν;K) → L2(X × Y, µ ⊗ ν;K). It turns out that thespan of the range of this bilinear map is dense in L2(X × Y, µ ⊗ ν;K) ifL2(X , µ;K) and L2(Y, ν;K) are separable. This shows that

L2(X , µ;K)⊗ L2(Y, ν;K) ∼= L2(X × Y, µ⊗ ν;K),

and it also explains why it is necessary to take the completion in theconstruction of the Hilbert space tensor product.

2. Similarly, L2(X , µ;H), the space of functions f : X → H that are squareintegrable in the sense that

∫

X

‖f(x)‖2H dµ(x) < +∞,

is isomorphic to L2(X , µ;K) ⊗ H if this space is separable. The isomor-phism maps f ⊗ϕ ∈ L2(X , µ;K)⊗H to the H-valued function x 7→ f(x)ϕin L2(X , µ;H).

3. Combining the previous two examples reveals that

L2(X , µ;K)⊗L2(Y, ν;K) ∼= L2(X ×Y, µ⊗ ν;K) ∼= L2(X , µ;L2(Y, ν;K)

).

Similarly, one can consider Bochner spaces of functions (random variables)taking values in a Banach space V that are pth-power-integrable in the sensethat

∫X ‖f(x)‖pV dµ(x) is finite, and identify this with a suitable tensor product

Lp(X , µ;R) ⊗ V . However, several subtleties arise in doing this, as there is nosingle ‘natural’ tensor product of Banach spaces as there is for Hilbert spaces.Readers who are interested in such spaces should consult the book of Ryan [85].

Bibliography

At Warwick, the theory of Hilbert and Banach spaces is covered in courses suchWas MA3G7 Functional Analysis I and MA3G8 Functional Analysis II. Sobolevspaces are covered in MA4A2 Advanced PDEs.

Classic reference texts on elementary functional analysis, including Banachand Hilbert space theory, include the monographs of Reed & Simon [79], Rudin [83],and Rynne & Youngson [86]. Further discussion of the relationship between ten-sor products and spaces of vector-valued integrable functions can be found inthe book of Ryan [85].

Truly intrepid students may wish to consult Bourbaki [14], but the standardwarnings about Bourbaki texts apply: the presentation is comprehensive butoften forbiddingly austere, and so is perhaps better as a reference text thana learning tool. Aliprantis & Border’s Hitchhiker’s Guide [2] is another en-cyclopædic text, but is surprisingly readable despite the Bourbakiste order inwhich material is presented.



http://www2.warwick.ac.uk/fac/sci/maths/undergrad/ughandbook/year4/ma4a2/

—DRAFT—

Chapter 4

Basic Optimization Theory

We demand rigidly defined areas of doubtand uncertainty!

The Hitchhiker’s Guide to the Galaxy

Douglas Adams

This chapter reviews the basic elements of optimization theory and practice,without going into the fine details of numerical implementation.

4.1 Optimization Problems and Terminology

In an optimization problem, the objective is to find the extreme values (eitherthe minimal value, the maximal value, or both) f(x) of a given function f amongall x in a given subset of the domain of f , along with the point or points x thatrealize those extreme values. The general form of a constrained optimizationproblem is

extremize: f(x)

with respect to: x ∈ Xsubject to: gi(x) ∈ Ei for i = 1, 2, . . . ,

where X is some set; f : X → R ∪ ±∞ is a function called the objectivefunction; and, for each i, gi : X → Yi is a function and Ei ⊆ Yi some subset.The conditions gi(x) ∈ Ei | i = 1, 2, . . . are called constraints, and a pointx ∈ X for which all the constraints are satisfied is called feasible; the set offeasible points,

x ∈ X | gi(x) ∈ Ei for i = 1, 2, . . .,is called the feasible set. If there are no constraints, so that the problem is asearch over all of X , then the problem is said to be unconstrained. In the case ofa minimization problem, the objective function f is also called the cost functionor energy; for maximization problems, the objective function is also called theutility function.

From a purely mathematical point of view, the distinction between con-strained and unconstrained optimization is artificial: constrained minimization

—DRAFT—

34 CHAPTER 4. BASIC OPTIMIZATION THEORY

over X is the same as unconstrained minimization over the feasible set. How-ever, from a practical standpoint, the difference is huge. Typically, X is Rn forsome n, or perhaps a simple subset specified using inequalities on one coordinateat a time, such as [a1, b1] × · · · × [an, bn]; a bona fide non-trivial constraint isone that involves a more complicated function of one coordinate, or two or morecoordinates, such as

g1(x) := cos(x) − sin(x) > 0

org2(x1, x2, x3) := x1x2 − x3 = 0.

Definition 4.1. The arg min or set of global minimizers of f : X → R ∪ ±∞is defined to be

argminx∈X

f(x) :=

x ∈ X

∣∣∣∣ f(x) = infx′∈X

f(x′)

,

and the arg max or set of global maximizers of f : X → R∪ ±∞ is defined tobe

argmaxx∈X

f(x) :=

x ∈ X

∣∣∣∣ f(x) = supx′∈X

f(x′)

.

Definition 4.2. A constraint is said to be1. redundant if it does not change the feasible set, and relevant otherwise;2. non-binding if it does not change the extreme value, and binding otherwise;3. active if it holds as an equality at the extremizer, and inactive otherwise.

4.2 Unconstrained Global Optimization

In general, finding a global minimizer of an arbitrary function is very hard, es-pecially in high-dimensional settings and without nice features like convexity.Except in very simple settings like linear least squares, it is necessary to con-struct an approximate solution, and to do so iteratively; that is, one computes asequence (xn)n∈N in X such that xn converges as n→ ∞ to an extremizer of theobjective function within the feasible set. A simple example of a deterministiciterative method for finding the critical points, and hence extrema, of a smoothfunction is Newton’s method:

Definition 4.3. Given a differentiable function f and an initial state x0, New-ton’s method for finding a zero of f is the sequence generated by the iteration

xn+1 := xn −(Df(xn)

)−1f(xn).

Newton’s method is often applied to find critical points of f , i.e. points whereDf vanishes, in which case the iteration is.

xn+1 := xn −(D2f(xn)

)−1Df(xn).

For objective functions f : X → R ∪ ±∞ that have little to no smooth-ness, or that have many local extremizers, it is often necessary to resort torandom searches of the space X . For such algorithms, there can only be a prob-abilistic guarantee of convergence. The rate of convergence and the degree of

—DRAFT—

4.2. UNCONSTRAINED GLOBAL OPTIMIZATION 35

approximate optimality naturally depend upon features like randomness of thegeneration of new elements of X and whether the extremizers of f are difficultto reach, e.g. because they are located in narrow ‘valleys’. We now describethree very simple random iterative algorithms for minimization of a prescribedobjective function f, in order to illustrate some of the relevant issues. For sim-plicity, suppose that f has a unique global minimizer x_min and write f_min

for f(x_min).

Algorithm 4.4 (Random sampling). For simplicity, the following algorithm runsfor n_max steps with no convergence checks. The algorithm returns an approxi-mate minimizer x_best along with the corresponding value of f. Suppose thatrandom() generates independent samples of X from a probability measure µwith support X .

f_best = +inf

n = 0

while n < n_max:

x_new = random()

f_new = f(x_new)

if f_new < f_best:

x_best = x_new

f_best = f_new

n = n + 1

return [x_best, f_best]

A weakness of Algorithm 4.4 is that it completely neglects local informationabout f. Even if the current state x_best is very close to the global minimizerx_min, the algorithm may continue to sample points x_new that are very faraway and have f(x_new) ≫ f(x_best). It would be preferable to explore aneighbourhood of x_best more thoroughly and hence find a better approxima-tion of [x_min, f_min]. The next algorithm attempts to rectify this deficiency.

Algorithm 4.5 (Random walk). As before, this algorithm runs for n_max steps.The algorithm returns an approximate minimizer x_best along with the corre-sponding value of f. Suppose that an initial state x0 is given, and that jump()generates independent samples of X from a probability measure µ with supportequal to the unit ball of X .

x_best = x0

f_best = f(x_best)

n = 0

while n < n_max:

x_new = x_best + jump()

f_new = f(x_new)

if f_new < f_best:

x_best = x_new

f_best = f_new

n = n + 1


Algorithm 4.5 also has a weakness: since the state is only ever updated tostates with a strictly lower value of f, and only looks for new states within

—DRAFT—


unit distance of the current one, the algorithm is prone to becoming stuck inlocal minima if they are surrounded by wells that are sufficiently wide, even ifthey are very shallow. The next algorithm, the simulated annealing method,attempts to rectify this problem by allowing the optimizer to make some ‘up-hill’ moves, which can be accepted or rejected according to comparison of auniformly-distributed random variable with a user-prescribed acceptance prob-ability function. Therefore, in the simulated annealing algorithm, a distinctionis made between the current state x of the algorithm and the best state so far,x_best; unlike in the previous two algorithms, proposed states x_new may beaccepted and become x even if f(x_new) > f(x_best). The idea is to introducea parameter T, to be thought of as ‘temperature’: the optimizer starts off ‘hot’,and ‘uphill’ moves are likely to be accepted; by the end of the calculation, theoptimizer is relatively ‘cold’, and ‘uphill’ moves are unlikely to accepted.

Algorithm 4.6 (Simulated annealing). Suppose that an initial state x0 is given,and that functions temperature(), neighbour() and acceptance_prob()havebeen specified. Suppose that uniform() generates independent samples fromthe uniform distribution on [0, 1]. Then the simulated annealing algorithm is

x = x0

fx = f(x)

x_best = x

f_best = fx

n = 0

while n < n_max:

T = temperature(n / n_max)

x_new = neighbour(x)

f_new = f(x_new)

if acceptance_prob(fx, f_new, T) > uniform():

x = x_new

fx = f_new

if f_new < f_best:

x_best = x_new

f_best = f_new

n = n + 1


Like Algorithm 4.4, the simulated annealing method can guarantee to findthe global minimizer of f provided that the neighbour() function allows full ex-ploration of the state space and the maximum run time n_max is large enough.However, the difficulty lies in coming up with functions temperature() andacceptance_prob() such that the algorithm finds the global minimizer in rea-sonable time. Simulated annealing calculations can be extremely computation-ally costly. A commonly-used acceptance probability function P is the one fromthe Metropolis–Hastings algorithm:

P (e, e′, T ) =

1, if e′ < e,

e−(e′−e)/T , if e′ ≥ e.

There are, however, many other choices; in particular, it is not necessary toautomatically accept downhill moves, and it is permissible to have P (e, e′, T ) <1 for e′ < e.

—DRAFT—

4.3. CONSTRAINED OPTIMIZATION 37

4.3 Constrained Optimization

It is well-known that the unconstrained extremizers of smooth enough functionsmust be critical points, i.e. points where the derivative vanishes. The followingtheorem, the Lagrange multiplier theorem, states that the constrained minimiz-ers of a smooth enough function, subject to smooth enough equality constraints,are critical points of an appropriately generalized function:

Theorem 4.7 (Lagrange multipliers). Let X and Y be real Banach spaces. LetU ⊆ X be open and let f ∈ C1(U ;R). Let g ∈ C1(U ;Y), and suppose that that u0is a constrained extremizer of f subject to the constraint that g(u0) = 0. Supposealso that the Frechet derivative Dg(u0) : X → Y is surjective. Then there existsa Lagrange multiplier λ0 ∈ Y ′ such that (u0, λ0) is an unconstrained criticalpoint of the Lagrangian L defined by

U × Y ′ ∋ (u, λ) 7→ L(u, λ) := f(u) + 〈λ | g(u)〉 ∈ R.

i.e. Df(u0) = λ0 Dg(u0) as linear maps from X to R.

The corresponding result for inequality constraints is the Karush–Kuhn–Tucker theorem:

Theorem 4.8 (Karush–Kuhn–Tucker). Suppose that x∗ ∈ Rn is a local minimizerof f ∈ C1(Rn;R) subject to inequality constraints gi(x) ≤ 0 and equality con-straints hj(x) = 0, where gi, hj ∈ C1(Rn;R) for i = 1, . . . ,m and j = 1, . . . ℓ.Then there exist µ ∈ Rm and λ ∈ Rℓ such that

−∇f(x∗) = µ · ∇g(x∗) + λ · ∇h(x∗),

where x∗ is feasible, and µ satisfies the dual feasibility criteria µi ≥ 0 and thecomplementary slackness criteria µigi(x

∗) = 0 for i = 1, . . . ,m.

Strictly speaking, the validity of the Karush–Kuhn–Tucker theorem also de-pends upon some regularity conditions on the constraints called constraint qual-ification conditions, of which there are many variations that can easily be foundin the literature. A very simple one is that if gi and hj are affine functions,then no further regularity is needed; another is that the gradients of the activeinequality constraints and the gradients of the equality constraints be linearlyindependent at x∗.

Numerical Implementation of Constraints. In the numerical treatment ofconstrained optimization problems, there are many ways to implement con-straints, not all of which actually enforce the constraints in the sense of ensuringthat trial states x_new, accepted states x, or even the final solution x_best areactually members of the feasible set. For definiteness, consider the constrainedminimization problem

minimize: f(x)

with respect to: x ∈ Xsubject to: c(x) ≤ 0

for some functions f, c : X → R ∪ ±∞. One way of seeing the constraint‘c(x) ≤ 0’ is as a Boolean true/false condition: either the inequality is satisfied,

—DRAFT—


or it is not. Supposing that neighbour(x) generates new (possibly infeasible)elements of X given a current state x, one approach to generating feasible trialstates x_new is the following:

x’ = neighbour(x)

while c(x’) > 0:

x’ = neighbour(x)

x_new = x’

However, this accept/reject approach is extremely wasteful: if the feasible setis very small, then x’ will ‘usually’ be rejected, thereby wasting a lot of com-putational time, and this approach takes no account of how ‘nearly feasible’ aninfeasible x’ might be.

One alternative approach is to use penalty functions : instead of consideringthe constrained problem of minimizing f(x) subject to c(x) ≤ 0, one can considerthe unconstrained problem of minimizing x 7→ f(x)+p(x), where p : X → [0,∞)is some function that equals zero on the feasible set and takes larger values the‘more’ the constraint inequality c(x) ≤ 0 is violated, e.g., for µ > 0.

pµ(x) =

0, if c(x) ≤ 0,

ec(x)/µ − 1, if c(x) > 0.

The hope is that (a) the minimization of f + pµ over all of X is easy, and (b) asµ→ 0, minimizers of f + pµ converge to minimzers of f on the original feasibleset. The penalty function approach is attractive, but the choice of penaltyfunction is rather ad hoc, and issues can easily arise of competition between thepenalties corresponding to multiple constraints.

An alternative to the use of penalty functions is to construct constrainingfunctions that enforce the constraints exactly. That is, we seek a function C()

that takes as input a possibly infeasible x’ and returns some x_new = C(x’)

that is guaranteed to satisfy the constraint c(x_new) <= 0. For example, sup-pose that X = Rn and the feasible set is the Euclidean unit ball, so the constraintis

c(x) := ‖x‖22 − 1 ≤ 0.

Then a suitable constraining function could be

C(x) :=

x, if ‖x‖2 ≤ 1,

x/‖x‖2, if ‖x‖2 > 1.

Constraining functions are very attractive because the constraints are treatedexactly. However, they must often be designed on a case-by-case basis for eachconstraint function c, and care must be taken to ensure that multiple constrain-ing functions interact well and do not unduly favour parts of the feasible setover others; for example, the above constraining function C maps the entire in-feasible set to the unit sphere, which might be considered undesirable in certainsettings, and so a function such as

C(x) :=

x, if ‖x‖2 ≤ 1,

x/‖x‖22, if ‖x‖2 > 1.

might be more appropriate. Finally, note that the original accept/reject methodof finding feasible states is a constraining function in this sense, albeit a veryinefficient one.

—DRAFT—

4.4. CONVEX OPTIMIZATION 39

4.4 Convex Optimization

The topic of this section is convex optimization. As will be seen, convexityis a powerful property that makes optimization problems tractable to a muchgreater extent than any amount of smoothness (which still permits local minima)or low-dimensionality can do.

In this section, X will be a Hausdorff, locally convex topological vectorspace. Given two points x0 and x1 of X and t ∈ [0, 1], xt will denote the convexcombination

xt := (1− t)x0 + tx1.

More generally, given points x0, . . . , xn of a vector space, a sum of the form

α0x0 + · · ·+ αnxn

is called a linear combination if the αi are any field elements, an affine combi-nation if their sum is 1, and a convex combination if they are non-negative andsum to 1.

Definition 4.9. A subset K ⊆ X is a convex set if, for all x0, x1 ∈ K andt ∈ [0, 1], xt ∈ K; it is said to be strictly convex if xt ∈ K whenever x0 andx1 are distinct points of K and t ∈ (0, 1). An extreme point of a convex set Kis a point of K that cannot be written as a non-trivial convex combination ofdistinct elements of K; the set of all extreme points of K is denoted ext(K).The convex hull co(S) (resp. closed convex hull co(S)) of S ⊆ X is defined tobe the intersection of all convex (resp. closed and convex) subsets of X thatcontain S.

Example 4.10. The square [−1, 1]2 is a convex subset of R2, but is not strictlyconvex, and its extreme points are the four vertices (±1,±1). The closed unitdisc (x, y) ∈ R2 | x2+y2 ≤ 1 is a strictly convex subset of R2, and its extremepoints are the points of the unit circle (x, y) ∈ R2 | x2+y2 = 1. See Figure 4.1for further examples.

Example 4.11. M1(X ) is a convex subset of the space of all (signed) Borelmeasures on X . The extremal probability measures are the zero-one measures,i.e. those those for which, for every measurable set E ⊆ X , µ(E) ∈ 0, 1.Furthermore, as will be discussed in Chapter 14, if X is, say, a Polish space, thenthe zero-one measures (and hence the extremal probability measures) on X arethe Dirac point masses. Indeed, in this situation, M1(X ) = co

(δx | x ∈ X

)

as a subset of the space M±(X ) of signed measures on X .

The reason that these notes restrict attention to Hausdorff, locally convextopological vector spaces X is that it is just too much of a headache to workwith spaces for which the following ‘common sense’ results do not hold:

Theorem 4.12 (Kreın–Milman [54]). Let X be a Hausdorff, locally convex topo-logical vector space, and let K ⊆ X be compact and convex. Then K is theclosed convex hull of its extreme points.

Theorem 4.13 (Choquet–Bishop–de Leeuw [11]). Let X be a Hausdorff, locallyconvex topological vector space, let K ⊆ X be compact and convex, and let c ∈ K.

—DRAFT—


b b

(a) A convex subset of theplane (grey) and its ex-treme points (black).

b

b

(b) A non-convex subset ofthe plane (black) and itsconvex hull (grey).

Figure 4.1: Convex sets, extreme points, and convex hulls.

Then there exists a probability measure µ supported on ext(K) such that, for allaffine functions f on K,

f(c) =

∫

ext(K)

f(e) dµ(e).

Informally speaing, the Kreın–Milman and Choquet–Bishop–de Leeuw the-orems together assure us that a compact, convex subset K of a topologicallyrespectable space is entirely characterized by its set of extreme points in thefollowing sense: every point of K can be obtained as an average of extremalpoints of K, and, indeed, the value of any affine function at any point of K canbe obtained as an average of its values at the extremal points in the same way.

Definition 4.14. Let K ⊆ X be convex. A function f : K → R ∪ ±∞ is aconvex function if, for all x0, x1 ∈ K and t ∈ [0, 1],

f(xt) ≤ (1− t)f(x0) + tf(x1),

and is called a strictly convex function if, for all distinct x0, x1 ∈ K and t ∈(0, 1),

f(xt) < (1− t)f(x0) + tf(x1).

It is straightforward to see that f : K → R ∪ ±∞ is convex (resp. strictlyconvex) if and only if its epigraph

epi(f) := (x, v) ∈ K × R | v ≥ f(x)is a convex (resp. strictly convex) subset of K×R. Convex functions have manyconvenient properties with respect to minimization and maximization:

Theorem 4.15. Let f : K → R be a convex function on a compact, convex,non-empty set K ⊆ X . Then

1. any local minimizer of f in K is also a global minimizer;2. the set argminK f of global minimizers of f in K is convex;3. if f is strictly convex, then it has at most one global minimizer in K;4. f has the same maximum values on K and ext(K).

—DRAFT—

4.4. CONVEX OPTIMIZATION 41

Remark. Note well that Theorem 4.15 does not assert the existence of mini-mizers, for which simultaneous compactness of K and lower semicontinuity off is required. For example:

• the exponential function on R is strictly convex, continuous and boundedbelow by 0 yet has no minimizer;

• the interval [−1, 1] is compact, and the function f : [−1, 1] → R ∪ ±∞defined by

f(x) :=

x, if |x| < 1

2 ,

+∞, if |x| ≥ 12 ,

is convex, yet f has no minimizer — although infx∈[−1,1] f(x) = − 12 , there

is no x for which f(x) attains this infimal value.

Proof. 1. Suppose that x0 is a local minimizer of f in K that is not a globalminimizer: that is, suppose that x0 is a minimizer of f in some openneighbourhood N of x0, and also that there exists x1 ∈ K \N such thatf(x1) < f(x0). Then, for sufficiently small t > 0, xt ∈ N , but convexityimplies that

f(xt) ≤ (1− t)f(x0) + tf(x1) < (1− t)f(x0) + tf(x0) = f(x0),

which contradicts the assumption that x0 is a minimizer of f in N .

2. Suppose that x0, x1 ∈ K are global minimizers of f . Then, for all t ∈ [0, 1],xt ∈ K and

f(x0) ≤ f(xt) ≤ (1 − t)f(x0) + tf(x1) = f(x0).

Hence, xt ∈ argminK f , and so argminK f is convex.

3. Suppose that x0, x1 ∈ K are distinct global minimizers of f , and lett ∈ (0, 1). Then xt ∈ K and

f(x0) ≤ f(xt) < (1 − t)f(x0) + tf(x1) = f(x0),

which is a contradiction. Hence, f has at most one minimizer in K.

4. Suppose that x is a non-extreme point of K that is also a strict maximizerfor f on K. Let x = xt = (1− t)x0 + tx1, where x0, x1 are distinct pointsof K and t ∈ (0, 1). Since x is assumed to be a strict maximizer for f onK, maxf(x0), f(x1) < f(xt). Then, since f is convex,

f(xt) ≤ (1− t)f(x0) + tf(x1) ≤ maxf(x0), f(x1) < f(xt),

which is a contradiction.

Definition 4.16. A convex optimization problem (or convex program) is a mini-mization problem in which the objective function and all constraints are equal-ities or inequalities with respect to convex functions.

Remark 4.17. 1. Beware of the common pitfall of saying that a convex program is simply the minimization of a convex function over a convex set. Ofcourse, by Theorem 4.15, such minimization problems are nicer than gen-eral minimization problems, but bona fide convex programs are an evennicer special case.

—DRAFT—


2. In practice, many problems are not obviously convex programs, but butcan be transformed into convex programs by e.g. a cunning change ofvariables. Being able to spot the right equivalent problem is a major partof the art of optimization.

It is difficult to overstate the importance of convexity in making optimiza-tion problems tractable. Indeed, it has been remarked that lack of convexity isa much greater obstacle to tractability than high dimension. There are manypowerful methods for the solution of convex programs, with corresponding stan-dard software libraries such as cvxopt. For example, the interior point methodsexplore the interior of the feasible set in search of the solution to the convexprogram, while being kept away from the boundary of the feasible set by a bar-rier function. The discussion that follows is only intended as an outline; fordetails, see Chapter 11 of Boyd & Vandenberghe [16].

Consider the convex program

minimize: f(x)

with respect to: x ∈ Rn

subject to: ci(x) ≤ 0 for i = 1, . . . ,m,

where the functions f, c1, . . . , cm : Rn → R are all convex and differentiable. LetF denote the feasible set for this program. Let 0 < µ ≪ 1 be a small scalar,called the barrier parameter, and define the barrier function associated to theprogram by

B(x;µ) := f(x)− µm∑

i=1

log ci(x).

Note that B( · ;µ) is strictly convex for µ > 0, that B(x;µ) → +∞ as x→ ∂F ,and that B( · ; 0) = f ; therefore, the unique minimizer x∗µ of B( · ;µ) lies in Fand (hopefully) converges to the minimizer of the original problem as µ → 0.Indeed, using arguments based on convex duality, one can show that

f(x∗µ)− infx∈F

f(x) ≤ mµ.

The strictly convex problem of minimizing B( · ;µ) can be solved approximatelyusing Newton’s method. In fact, however, one settles for a partial minimizationof B( · ;µ) using only one or two steps of Newton’s method, then decreases µ toµ′, performs another partial minimization of B( · ;µ′) using Newton’s method,and so on in this alternating fashion.

4.5 Linear Programming

Theorem 4.15 has the following immediate corollary for the minimization andmaximization of affine functions on convex sets:

Corollary 4.18. Let ℓ : K → R be an affine function on a non-empty, compact,convex set K ⊆ X . Then

extx∈K

ℓ(x) = extx∈ext(K)

ℓ(x).

http://cvxopt.org/

—DRAFT—

4.6. LEAST SQUARES 43

Definition 4.19. A linear program is an optimization problem of the form

extremize: f(x)

with respect to: x ∈ Rn

subject to: gi(x) ≤ 0 for i = 1, . . . ,m,

where the functions f, g1, . . . , gm : Rn → R are all affine functions. Linear pro-grams are often written in the canonical form

maxmize: c · xwith respect to: x ∈ Rn

subject to: Ax ≤ b

x ≥ 0,

where c ∈ Rn, A ∈ Rm×n and b ∈ Rm are given, and the two inequalities areinterpreted componentwise.

Note that the feasible set for a linear program is an intersection of finitelymany half-spaces of Rn, i.e. a polytope. This polytope may be empty, in whichcase the constraints are mutually contradictory and the program is said to beinfeasible. Also, the polytope may be unbounded in the direction of c, in whichcase the extreme value of the problem is infinite.

Since linear programs are special cases of convex programs, methods such asinterior point methods are applicable to linear programs as well. Such methodsapproach the optimum point x∗, which is necessarily an extremal element ofthe feasible polytope, from the interior of the feasible polytope. Historically,however, such methods were preceded by methods such as Dantzig’s simplexalgorithm, which, sets out to directly explore the set of extreme points in a(hopefully) efficient way. Although the theoretical worst worst-case complexityof simplex method as formulated by Dantzig is exponential in n and m, in prac-tice the simplex method is remarkably efficient (polynomial running time) pro-vided that certain precautions are taken to avoid pathologies such as ‘stalling’.

4.6 Least Squares

An elementary example of convex programming is unconstrained quadratic min-imization, otherwise known as least squares. Least squares minimization playsa central role in elementary statistical estimation, as will be demonstrated bythe Gauss–Markov theorem (Theorem 6.2).

Lemma 4.20. Let K be a closed, convex subset of a Hilbert space H. Then, foreach y ∈ H, there is a unique element x ∈ K such that

x ∈ argminx∈K

‖y − x‖.

Proof. By Exercise 4.1, the function J : X → [0,∞) defined by J(x) := ‖y−x‖2is strictly convex, and hence it has at most one minimizer in K. Therefore, itonly remains to show that J has at least one minimizer inK. Since J is boundedbelow (on X , not just on K), J has a sequence of approximate minimizers: let

I := infx∈K

‖y − x‖2, I2 ≤ ‖y − xn‖2 ≤ I2 + 1n .

—DRAFT—


By the parallelogram identity for the Hilbert norm ‖ · ‖,

‖(y − xm) + (y − xn)‖2 + ‖(y − xm)− (y − xn)‖2 = 2‖y − xm‖2 + 2‖y − xn‖2,

and hence

‖2y − (xm + xn)‖2 + ‖xn − xm‖2 ≤ 4I2 + 2n + 2

m .

Since K is convex, 12 (xm+xn) ∈ K, so the first term on the left-hand side above

is bounded below as follows:

‖2y − (xm + xn)‖2 = 4

∥∥∥∥y −xm + xn

2

∥∥∥∥2

≥ 4I2.

Hence,

‖xn − xm‖2 ≤ 4I2 + 2n + 2

m − 4I2 = 2n + 2

m ,

and so the sequence (xn)n∈N is Cauchy; since H is complete and K is closed,this sequence converges to some x ∈ K. Since the norm ‖ · ‖ is continuous,‖y − x‖ = I.

Lemma 4.21 (Orthogonality of the residual). Let V be a closed subspace of aHilbert space H and let b ∈ H. Then x ∈ V minimizes the distance to b if andonly if the residual x− b is orthogonal to V , i.e.

x = argminx∈V

‖x− b‖ ⇐⇒ (x − b) ⊥ V.

Proof. Let J(x) := 12‖x−b‖2, which has the same minimizers as x 7→ ‖x−b‖; by

Lemma 4.20, such a minimizer exists and is unique. Suppose that (x − b) ⊥ Vand let y ∈ V . Then y−x ∈ V and so (y−x) ⊥ (x− b). Hence, by Pythagoras’theorem,

‖y − b‖2 = ‖y − x‖2 + ‖x− b‖2 ≥ ‖x− b‖2,

and so x minimizes J .

Conversely, suppose that x minimizes J . Then, for every y ∈ V ,

0 =∂

∂λJ(x+ λy) =

1

2(〈y, x− b〉+ 〈x− b, y〉) = Re〈x− b, y〉

and, in the complex case,

0 =∂

∂λJ(x + λiy) =

1

2(−i〈y, x− b〉+ i〈x− b, y〉) = −Im〈x− b, y〉.

Hence, 〈x− b, y〉 = 0, and since y was arbitrary, (x− b) ⊥ V .

Lemma 4.22 (Normal equations). Let A : H → K be a linear operator betweenHilbert spaces such that R(A) is a closed subspace of K. Then, given b ∈ K,

x ∈ argminx∈H

‖Ax− b‖K ⇐⇒ A∗Ax = A∗b,

the equations on the right-hand side being known as the normal equations.

—DRAFT—

4.6. LEAST SQUARES 45

Proof. Recall that, as a consequence of completeness, the only element of aHilbert space that is orthogonal to every other element of the space is the zeroelement. Hence,

‖Ax− b‖K is minimal

⇐⇒ (Ax− b) ⊥ Av for all v ∈ H, by Lemma 4.21

⇐⇒ 〈Ax− b, Av〉K = 0 for all v ∈ H⇐⇒ 〈A∗Ax −A∗b, v〉H = 0 for all v ∈ H⇐⇒ A∗Ax = A∗b by completeness of H.

Weighting and Regularization. It is common in practice that one does notwant to minimize the K-norm directly, but perhaps some re-weighted versionof the K-norm. This re-weighting is accomplished by a self-adjoint and positivedefinite operator on K.

Corollary 4.23 (Weighted least squares). Let A : H → K be a linear operatorbetween Hilbert spaces such that R(A) is a closed subspace of K. Let Q : K → Kbe self-adjoint and positive-definite and let

〈k, k′〉Q := 〈k,Qk′〉KThen, given b ∈ K,

x ∈ argminx∈H

‖Ax− b‖Q ⇐⇒ A∗QAx = A∗Qb.

Proof. Exercise 4.2.

Another situation that arises frequently in practice is that the normal equa-tions do not have a unique solution (e.g. because A∗A is not invertible) and soit is necessary to select one by some means, or that one has some prior beliefthat ‘the right solution’ should be close to some initial guess x0. A techniquethat accomplishes both of these aims is Tikhonov regularization:

Corollary 4.24 (Tikhonov-regularized least squares). Let A : H → K be a linearoperator between Hilbert spaces such that R(A) is a closed subspace of K, letQ : H → H be self-adjoint and positive-definite, and let b ∈ K and x0 ∈ H begiven. Let

J(x) := ‖Ax− b‖2 + ‖x− x0‖2Q.Then

x ∈ argminx∈H

J(x) ⇐⇒ (A∗A+Q)x = A∗b+Qx0.


Nonlinear Least Squares and Gauss–Newton Iteration. It often occurs inpractice that one wishes to find a vector of parameters θ ∈ Rp such that afunction Rk ∋ x 7→ f(x; θ) ∈ Rℓ best fits a collection of data points (xi, yi) ∈Rk × Rℓ | i = 1, . . . ,m. For each candidate parameter vector θ, define theresidual vector

r(θ) :=

r1(θ)...

rm(θ)

=

y1 − f(x1; θ)

...ym − f(xm; θ)

∈ Rm.

—DRAFT—


The aim is to find θ to minimize the objective function J(θ) := ‖r(θ)‖22. Let

A :=

∂r1(θ)∂θ1 · · · ∂r1(θ)

∂θp

.... . .

...∂rm(θ)∂θ1 · · · ∂rm(θ)

∂θp

∣∣∣∣∣∣∣∣θ=θn

∈ Rm×p

be the Jacobian matrix of the residual vector, and note that A = −DF (θn),where

F (θ) :=

f(x1; θ)

...f(xm; θ)

∈ Rm.

Consider the first-order Taylor approximation

r(θ) ≈ r(θn) +A(r(θ) − r(θn)).

Thus, to approximately minimize ‖r(θ)‖2, we find δ := r(θ)− r(θn) that makesthe right-hand side of the approximation equal to zero. This is an ordinary linearleast squares problem, the solution of which is given by the normal equations as

δ = (A∗A)−1A∗r(θn).

Thus, we obtain the Gauss–Newton iteration for a sequence (θn)n∈N of approx-imate minimizers of J :

θn+1 := θn − (A∗A)−1A∗r(θn)

= θn +((DF (θn))

∗(DF (θn)))−1

(DF (θn))∗r(θn).

In general, the Gauss–Newton iteration is not guaranteed to converge to theexact solution, particularly if δ is ‘too large’, in which case it may be appropriateto use a judicously chosen small positive multiple of δ. The use of Tikhonovregularization in this context is known as the Levenberg–Marquardt algorithmor trust region method, and the small multiplier applied to δ is essentially thereciprocal of the Tikhonov regularization parameter.

Bibliography

Direct and iterative methods for the solution of linear least squares problemsare covered in MA398 Matrix Analysis and Algorithms.W

The book of Boyd & Vandenberghe [16] is an excellent reference on thetheory and practice of convex optimization, as is the associated software librarycvxopt. The classic reference for convex analysis in general is the monograph ofRockafellar [82]. A standard reference on numerical methods for optimizationis the book of Nocedal & Wright [72].

For constrained global optimization in the absence of ‘nice’ features, particu-larly for the UQ methods in Chapter 14, the author recommends the DifferentialEvolution algorithm [77, 97] within the Mystic framework.


http://cvxopt.org/

http://pythonhosted.org/mystic

—DRAFT—

EXERCISES 47

Exercises

Exercise 4.1. Let ‖ ·‖ be a norm on a vector space X , and fix y ∈ X . Showthat the function J : X → [0,∞) defined by J(x) := ‖y− x‖2 is strictly convex.

Exercise 4.2. Let A : H → K be a linear operator between Hilbert spaces suchthat R(A) is a closed subspace of K. Let Q : K → K be self-adjoint and positive-definite and let

〈k, k′〉Q := 〈k,Qk′〉KShow that, given b ∈ K,

x ∈ argminx∈H

‖Ax− b‖Q ⇐⇒ A∗QAx = A∗Qb.

Exercise 4.3. Let A : H → K be a linear operator between Hilbert spaces suchthat R(A) is a closed subspace of K, let Q : H → H be self-adjoint and positive-definite, and let b ∈ K and x0 ∈ H be given. Let

J(x) := ‖Ax− b‖2 + ‖x− x0‖2Q.

Show that

x ∈ argminx∈H

J(x) ⇐⇒ (A∗A+Q)x = A∗b+Qx0.

Hint: Consider the operator from H into K⊕H given in block form as [A,Q1/2]⊤.

—DRAFT—


—DRAFT—

Chapter 5

Measures of Information andUncertainty

As we know, there are known knowns.There are things we know we know. Wealso know there are known unknowns.That is to say we know there are somethings we do not know. But there are alsounknown unknowns, the ones we don’tknow we don’t know.

Donald Rumsfeld

This chapter briefly summarizes some basic numerical measures of uncer-tainty, from interval bounds to information-theoretic quantities such as (Shan-non) information and entropy. This discussion then naturally leads to consider-ation of distances (and distance-like functions) between probability measures.

5.1 The Existence of Uncertainty

At a very fundamental level, the first level in understanding the uncertaintiesaffecting some system is to identify the sources of uncertainty. Sometimes, thiscan be a challenging task because there may be so much lack of knowledgeabout, e.g. the relevant physical mechanisms, that one does not even knowwhat a list of the important parameters would be, let alone what uncertaintyone has about their values. The presence of such so-called unknown unknownsis of major concern in high-impact settings like risk assessment.

One way of assessing the presence of unknown unknowns is that if one sub-scribes to a deterministic view of the universe in which reality maps inputs x ∈ Xto outputs y = f(x) ∈ Y by a well-defined single-valued function f : X → Y,then unknown unknowns are additional variables z ∈ Z whose existence oneinfers from contradictory observations like

f(x) = y1 and f(x) = y2 6= y1.

—DRAFT—

50 CHAPTER 5. MEASURES OF INFORMATION AND UNCERTAINTY

Unknown unknowns explain away this contradiction by asserting the existenceof a space Z containing distinct elements z1 and z2, that in fact f is a functionf : X × Z → Y, and that the observations were actually

f(x, z1) = y1 and f(x, z2) = y2.

Of course, this viewpoint does nothing to actually identify the relevant space Znor the values z1 and z2.

5.2 Interval Estimates

Sometimes, nothing more can be said about some unknown quantity than arange of possible values, with none more or less probable than any other. Inthe case of an unknown real number x, such information may boil down to aninterval such as [a, b] in which x is known to lie. This is, of course, a very basicform of uncertainty, and one may simply summarize the degree of uncertaintyby the length of the interval.

Interval Arithmetic. As well as summarizing the degree of uncertainty by thelength of the interval estimate, it is often of interest to manipulate the intervalestimates themselves as if they were numbers. One commonly-used method ofmanipulating interval estimates of real quantities is interval arithmetic. Each ofthe basic arithmetic operations ∗ ∈ +,−, ·, / is extended to intervals A,B ⊆ R

byA ∗B := x ∈ R | x = a ∗ b for some a ∈ A, b ∈ B.

Hence,

[a, b] + [c, d] = [a+ c, b+ d],

[a, b]− [c, d] = [a− d, b− c],

[a, b] · [c, d] =[mina · c, a · d, b · c, b · d,maxa · c, a · d, b · c, b · d

],

[a, b]/[c, d] =[mina/c, a/d, b/c, b/d,maxa/c, a/d, b/c, b/d

]when 0 /∈ [c, d].

The addition and multiplication operations are commutative, associative andsub-distributive:

A(B +B) ⊆ AB +AC.

These ideas can be extended to elementary functions without too much dif-ficulty: monotone functions are straightforward, and the Intermediate ValueTheorem ensures that the continuous image of an interval is again an inter-val. However, for general functions f , it is not straightforward to compute (theconvex hull of) the image of f .

The distributional robustness approaches covered in Chapter 14 can be seenas an extension of this approach from partially known real numbers to partiallyknown probability measures.

5.3 Variance, Information and Entropy

Suppose that one adopts a subjectivist (e.g. Bayesian) interpretation of prob-ability, so that one’s knowledge about some system of interest with possible

—DRAFT—

5.3. VARIANCE, INFORMATION AND ENTROPY 51

values in X is summarized by a probability measure µ ∈ M1(X ). The proba-bility measure µ is a very rich and high-dimensional object; often it is necessaryto summarize the degree of uncertainty implicit in µ with a few numbers —perhaps even just one number.

Variance. One obvious summary statistic, when X is (a subset of) a normedvector space and µ has mean m, is the variance of µ, i.e.

V(µ) :=

∫

X

‖x−m‖2 dµ(x) ≡ EX∼µ

[‖X −m‖2

].

If V(µ) is small (resp. large), then we are relatively certain (resp. uncertain)that X ∼ µ is in fact quite close to m. A more refined variance-based measureof informativeness is the covariance operator

C(µ) := EX∼µ

[(X −m)⊗ (X −m)

].

A distribution µ for which the operator norm of C(µ) is large may be said tobe a relatively uninformative distribution. Note that when X = Rn, C(µ) is ann × n symmetric positive-definite matrix. Hence, such a C(µ) has n positivereal eigenvalues (counted with multiplicity)

λ1 ≥ λ2 ≥ · · · ≥ λn > 0,

with corresponding normalized eigenvectors v1, . . . , vn ∈ Rn. The direction v1corresponding to the largest eigenvalue λ1 is the direction in which the uncer-tainty about the random vector X is greatest; correspondingly, the direction vnis the direction in which the uncertainty about the random vector X is least.

A beautiful and classical result concerning the variance of two quantities ofinterest is the uncertainty principle from quantum mechanics. In this setting,the probability distribution is written as p = |ψ|2, where ψ is a unit-normelement of a suitable Hilbert space, usually something like L2(Rn;C). Physicalobservables like position, momentum &c. act as self-adjoint operators on thisHilbert space; e.g. the position operator Q is

(Qψ)(x) := xψ(x),

so that the expected position is

〈ψ,Qψ〉 =∫

Rn

ψ(x)xψ(x) dx =

∫

Rn

x|ψ(x)|2 dx.

In general, for a fixed unit-norm element ψ ∈ H, the expected value 〈A〉 andvariance V(A) ≡ σ2

A of a self-adjoint operator A : H → H are defined by

〈A〉 := 〈ψ,Aψ〉,σ2A := 〈(A− 〈A〉)2〉.

The following inequality provides a fundamental lower bound on the productof the variances of any two observables A and B in terms of their commutator[A,B] := AB − BA and their anti-commutator A,B := AB + BA. Whenthis lower bound is positive, the two variances cannot both be close to zero, sosimultaneous high-precision measurements of A and B are impossible.

—DRAFT—


Theorem 5.1 (Uncertainty principle: Schrodinger’s inequality). Let A, B be self-adjoint operators on a Hilbert space H, and let ψ ∈ H have unit norm. Then

σ2Aσ

2B ≥

∣∣∣∣〈A,B〉 − 2〈A〉〈B〉

2

∣∣∣∣2

+

∣∣∣∣〈[A,B]〉

2

∣∣∣∣2

(5.1)

and, a fortiori, σAσB ≥ 12

∣∣〈[A,B]〉∣∣.

Proof. Let f := (A− 〈A〉)ψ and g := (B − 〈B〉)ψ, so that

σ2A = 〈f, f〉 = ‖f‖2,σ2B = 〈g, g〉 = ‖g‖2.

Therefore, by the Cauchy–Schwarz inequality (3.1),

σ2Aσ

2B = ‖f‖2‖g‖2 ≥ |〈f, g〉|2.

Now write the right-hand side of this inequality as

|〈f, g〉|2 =(Re(〈f, g〉)

)2+(Im(〈f, g〉)

)2

=

( 〈f, g〉+ 〈g, f〉2

)2

+

( 〈f, g〉 − 〈g, f〉2i

)2

.

Using the self-adjointness of A and B,

〈f, g〉 = 〈(A− 〈A〉)ψ, (B − 〈B〉)ψ〉= 〈AB〉 − 〈A〉〈B〉 − 〈A〉〈B〉 + 〈A〉〈B〉= 〈AB〉 − 〈A〉〈B〉;

similarly, 〈g, f〉 = 〈BA〉 − 〈A〉〈B〉. Hence,

〈f, g〉 − 〈g, f〉 = 〈[A,B]〉,〈f, g〉+ 〈g, f〉 = 〈A,B〉 − 2〈A〉〈B〉,

which yields (5.1).

Information and Entropy. In information theory as pioneered by Claude Shan-non, the information (or surprisal) associated to a possible outcome x of arandom variable X ∼ µ taking values in a finite set X is defined to be

I(x) := − logPX∼µ[X = x] ≡ − logµ(x). (5.2)

Information has units according to the base of the logarithm used:

base 2 ↔ bits, base e ↔ nats/nits, base 10 ↔ bans/dits/hartleys.

The negative sign in (5.2) makes I(x) non-negative, and logarithms are usedbecause one seeks a quantity I( · ) that represents in an additive way the ‘surprisevalue’ of observing x. So, for example, if x has half the probability of y, thenone is ‘twice as surprised’ to see the outcome X = x instead of X = y, and soI(x) = I(y) + log 2. The entropy of the measure µ is the expected information:

H(µ) := EX∼µ[I(X)] ≡ −∑

x∈X

µ(x) log µ(x). (5.3)

—DRAFT—

5.4. INFORMATION GAIN 53

(We follow the convention that 0 log 0 := limp→0 p log p = 0.) These definitionsare readily extended to a random variable X taking values in Rn and distributedaccording to a probability measure µ that has Lebesgue density ρ:

I(x) := − log ρ(x),

H(µ) := −∫

Rn

ρ(x) log ρ(x) dx.

Since entropy measures the average information content of the possible valuesof X ∼ µ, entropy is often interpreted as a measure of the uncertainty implicitin µ. (Remember that if µ is very ‘spread out’ and describes a lot of uncertaintyabout X , then observing a particular value of X carries a lot of ‘surprise value’and hence a lot of information.)

Example 5.2. Consider a Bernoulli random variable X taking values in x1, x2 ∈X with probabilities p, 1 − p ∈ [0, 1] respectively. This random variable hasentropy

−p log p− (1 − p) log(1− p).

If X is certain to equal x1, then p = 1, and the entropy is 0; similarly, ifX is certain to equal x2, then p = 0, and the entropy is again 0; these twodistributions carry zero information and have minimal entropy. On the otherhand, if p = 1

2 , in which case X is uniformly distributed on X , then the entropyis log 2; indeed, this is the maximum possible entropy for a Bernoulli randomvariable. This example is often interpreted as saying that when interrogatingsomeone with questions that demand “yes” or “no” answers, one gains maximuminformation by asking questions that have an equal probability of being answered“yes” versus “no”.

Proposition 5.3. Let µ and ν be probability measures on discrete sets or Rn.Then the product measure µ⊗ ν satisfies

H(µ⊗ ν) = H(µ) + H(ν).

That is, the entropy of a random vector with independent components is the sumof the entropies of the component random variables.


5.4 Information Gain

Implicit in the definition and entropy (5.3) is the use of a uniform measure(counting measure on a finite set, or Lebesgue measure on Rn) as a referencemeasure. Upon reflection, there is no need to privilege uniform measure withbeing the unique reference measure. Indeed, in some settings, such as infinite-dimensional Banach spaces, there is no such thing as a uniform measure. Ingeneral, if µ is a probability measure on a measurable space (X ,F ) with refer-ence measure π, then we would like to define the entropy of µ with respect to πby an expression like

H(µ|π) = −∫

R

dµ

dπ(x) log

dµ

dπ(x) dπ(x)

—DRAFT—


whenever µ has a Radon–Nikodym derivative with respect to π. The negativeof this functional is a distance-like function on the set of probability measureson (X ,F ):

Definition 5.4. Let µ, ν be σ-finite measures on (X ,F ). The Kullback–Leiblerdivergence from µ to ν is defined to be

DKL(µ‖ν) :=

∫

X

dµ

dνlog

dµ

dνdν ≡

∫

X

logdµ

dνdµ, if µ≪ ν,

+∞, otherwise.

While DKL( · ‖ · ) is non-negative, and vanishes if and only if its argumentsare identical, it is neither symmetric nor does it satisfy the triangle inequality.Nevertheless, DKL( · ‖ · ) generates a topology on M1(X ) that is finer than thetotal variation topology:

Theorem 5.5 (Pinsker). For any µ, ν ∈ M1(X ,F ),

dTV(µ, ν) ≤√DKL(µ‖ν)

2,

where the total variation metric is defined by

dTV(µ, ν) := sup|µ(E)− ν(E)|

∣∣E ∈ F. (5.4)

Example 5.6 (Bayesian experimental design). Suppose that a Bayesian point ofview is adopted, and for simplicity that all the random variables of interest arefinite-dimensional with Lebesgue densities ρ. Consider the problem of selectingan optimal experimental design λ for the inference of some parameters / un-knowns θ from the observed data y that will result from the experiment λ. If, foreach λ and θ, we know the conditional distribution y|λ, θ of the observed data,then the conditional distribution y|λ is obtained by integration with respect tothe prior distribution of θ:

ρ(y|λ) =∫ρ(y|λ, θ)ρ(θ) dθ.

Let U(y, λ) be a real-valued measure of the utility of the posterior distribution

ρ(θ|y, λ) = ρ(y|θ, λ)ρ(θ)ρ(y|λ) .

For example, one could take the utility U(y, λ) to be the Kullback–Leibler di-vergence DKL

(ρ( · |y, λ)

∥∥ρ( · |λ))between the prior and posterior distributions

on θ. An experimental design λ that maximizes

U(λ) :=

∫U(y, λ)ρ(y|λ) dy

is one that is optimal in the sense of maximizing the expected gain in Shannoninformation.

—DRAFT—

BIBLIOGRAPHY 55

Bibliography

Comprehensive treatments of interval analysis include the classic monograph ofMoore [69] and the more recent text of Jaulin & al. [42].

Information theory was pioneered by Shannon in his seminal 1948 paper [88].The Kullback–Leibler divergence was introduced by Kullback & Leibler [55],who in fact considered the symmetrized version of the divergence that nowbears their names. The book of MacKay [63] provides a thorough introductionto information theory.

Exercises

Exercise 5.1. Prove Gibbs’ inequality that the Kullback–Leibler divergence isnon-negative, i.e.

DKL(µ‖ν) :=∫

X

logdµ

dνdµ ≥ 0

whenever µ, ν are σ-finite measures on (X ,F ) with µ ≪ ν. Show also thatDKL(µ‖ν) = 0 if and only if µ = ν.

Exercise 5.2. Prove Proposition 5.3. That is, suppose that µ and ν are proba-bility measures on discrete sets or Rn, and show that the product measure µ⊗νsatisfies

H(µ⊗ ν) = H(µ) + H(ν).

That is, the entropy of a random vector with independent components is thesum of the entropies of the component random variables.

Exercise 5.3. Let µ0 = N (m0, C0) and µ1 = N (m1, C1) be non-degenerateGaussian measures on Rn. Show that DKL(µ0‖µ1) is

1

2

(tr(C−1

1 C0) + (m1 −m0)⊤C−1

1 (m1 −m0)− n− log

(detC0

detC1

)).

Exercise 5.4. Suppose that µ and ν are equivalent probability measures on(X ,F ) and define

d(µ, ν) := ess supx∈X

∣∣∣∣logdµ

dν(x)

∣∣∣∣ .

Show that this defines a well-defined metric on the measure equivalence classE containing µ and ν. In particular, show that neither the choice of functionused as the Radon–Nikodym derivative dµ

dν , nor the choice of measure in E withrespect to which the essential supremum is taken, affect the value of d(µ, ν).

Exercise 5.5. Show that Pinsker’s inequality (Theorem 5.5) cannot be reversed.That is, show that, for any ε > 0, there exist probability measures µ and ν withdTV(µ, ν) ≤ ε but DKL(µ‖ν) = +∞.

—DRAFT—


—DRAFT—

Chapter 6

Bayesian Inverse Problems

It ain’t what you don’t know that getsyou into trouble. It’s what you know forsure that just ain’t so.

Mark Twain

This chapter provides a general introduction, at the high level, to the back-ward propagation of uncertainty/information in the solution of inverse problems,and specifically a Bayesian probabilistic perspective on such inverse problems.Under the umbrella of inverse problems, we consider parameter estimation andregression. One specific aim is to make clear the connection between regulariza-tion and the application of a Bayesian prior. The filtering methods of Chapter 7fall under the general umbrella of Bayesian approaches to inverse problems, buthave an additional emphasis on real-time computational expediency.

6.1 Inverse Problems and Regularization

In many applications it is of interest to solve inverse problems, namely to find u,an input to a mathematical model, given y, an observation of (some componentsof, or functions of) the solution of the model. We have an equation of the form

y = H(u)

where X , Y are, say, Banach spaces, u ∈ X , y ∈ Y, and H : X → Y is theobservation operator. However, inverse problems are typically ill-posed: theremay be no solution, the solution may not be unique, or there may be a uniquesolution that depends sensitively on y. Indeed, very often we do not actuallyobserve H(u), but some noisily corrupted version of it, such as

y = H(u) + η. (6.1)

The inverse problem framework encompasses that problem of model calibra-tion (or parameter estimation), where a model Hθ relating inputs to outputsdepends upon some parameters θ ∈ Θ, e.g., when X = Y = Θ, Hθ(u) = θu.

—DRAFT—

58 CHAPTER 6. BAYESIAN INVERSE PROBLEMS

The problem is, given some observations of inputs ui and corresponding outputsyi, to find the parameter value θ such that

yi = Hθ(ui) for each i.

Again, this problem is typically ill-posed.One approach to the problem of ill-posedness is to seek a least-squares solu-

tion: find, for the norm ‖ · ‖Y on Y,

argminu∈X

∥∥y −H(u)∥∥2Y.

However, this problem, too, can be difficult to solve as it may possess minimizingsequences that do not have a limit in X ,〈6.1〉 or may possess multiple minima,or may depend sensitively on the observed data y. Especially in this last case, itmay be advantageous to not try to fit the observed data too closely, and insteadregularize the problem by seeking

argmin∥∥y −H(u)

∥∥2Y+∥∥u− u

∥∥2E

∣∣∣ u ∈ E ⊆ X

for some Banach space E embedded in X and a chosen u ∈ E. The standardexample of this regularization setup is Tikhonov regularization, as in Corol-lary 4.24: when X and Y are Hilbert spaces, for suitable self-adjoint positive-definite operators Q on X and R on Y, we seek

argmin∥∥Q−1/2(y −H(u))

∥∥2Y+∥∥R−1/2(u− u)

∥∥2X

∣∣∣ u ∈ X

However, this approach appears to be somewhat ad hoc, especially where thechoice of regularization is concerned.

Taking a probabilistic — specifically, Bayesian — viewpoint alleviates thesedifficulties. If we think of u and y as random variables, then (6.1) defines theconditioned random variable y|u, and we define the ‘solution’ of the inverseproblem to be the conditioned random variable u|y. This allows us to modelthe noise, η, via its statistical properties, even if we do not know the exactinstance of η that corrupted the given data, and it also allows us to specify apriori the form of solutions that we believe to be more likely, thereby enablingus to attach weights to multiple solutions which explain the data. This is theessence of the Bayesian approach to inverse problems.

Remark 6.1. In practice the true observation operator is often approximated bysome numerical modelH( · ;h), where h denotes a mesh parameter, or parametercontrolling missing physics. In this case (6.1) becomes

y = H(u;h) + ε+ η,

where ε := H(u) − H(u;h). In principle, the observational noise η and thecomputational error ε could be combined into a single term, but keeping themseparate is usually more appropriate: unlike η, ε is typically not of mean zero,and is dependent upon u.

〈6.1〉Take a moment to reconcile the statement “there may exist minimizing sequences that donot have a limit in X” with X being a Banach space.

—DRAFT—

6.1. INVERSE PROBLEMS AND REGULARIZATION 59

To illustrate the central role that least squares minimization plays in elemen-tary statistical estimation, and hence motivate the more general considerationsof the rest of the chapter, consider the following finite-dimensional linear prob-lem. Suppose that we are interested in learning some vector of parametersx ∈ Kn, which gives rise to a vector y ∈ Km of observations via

y = Ax+ η,

where A ∈ Km×n is a known linear operator (matrix) and η is a Gaussian noisevector known to have mean zero and self-adjoint, positive-definite covariancematrix E[ηη∗] = Q ∈ Km×m, with η independent of x. A common approach isto seek an estimate x of x that is a linear function Ky of the data y, is unbiasedin the sense that E[x] = x, and is the best estimate in that it minimizes anappropriate cost function. The following theorem, the Gauss–Markov theorem,states that there is precisely one such estimator, and that it is the best in twovery natural senses:

Theorem 6.2 (Gauss–Markov). Suppose that A∗Q−1A is invertible. Then, amongall unbiased linear estimators K ∈ Kn×m, producing an estimate x = Ky of x,the one that minimizes both the mean-squared error E[‖x−x‖2] and the covari-ance matrix〈6.2〉 E[(x− x)(x − x)∗] is

K = (A∗Q−1A)−1A∗Q−1,

and the resulting estimate x has E[x] = x and covariance matrix

E[(x − x)(x − x)∗] = (A∗Q−1A)−1.

Remark 6.3. Indeed, by Corollary 4.23, x = (A∗Q−1A)−1A∗Q−1y is also thesolution to the weighted least squares problem with weight Q−1, i.e.

x = argminx∈Kn

J(x), J(x) :=1

2‖Ax− y‖2Q−1 .

Note that the first and second derivatives (gradient and Hessian) of J are

∇J(x) = A∗Q−1Ax−A∗y, D2J(x) = A∗Q−1A,

so the covariance matrix of x is the inverse of the Hessian of J . These observa-tions will be useful in the construction of the Kalman filter in Chapter 7.

Proof of Theorem 6.2. Note that the first part of this proof is surplus to re-quirements: we could simply check that K := (A∗Q−1A)−1A∗Q−1 is indeed theminimal linear unbiased estimator, but it is nice to derive the formula for Kfrom first principles and get some practice at constrained convex optimizationinto the bargain.

Since K is required to be unbiased, it follows that KA = I. Therefore,

‖x− x‖2 = ‖Ky − x‖2 = ‖K(Ax+ η)− x‖2 = ‖Kη‖2.〈6.2〉Here, the minimization is meant in the sense of positive semi-definite matrices: for twomatrices A and B, we say that A ≤ B if B − A is a positive semi-definite matrix.

—DRAFT—


Since ‖Kη‖2 = η∗K∗Kη is a scalar and tr(XY ) = tr(Y X) for any two rectan-gular matrices X and Y of the appropriate sizes,

E[‖x− x‖2] = E[η∗K∗Kη] = tr(E[Kηη∗K∗]) = tr(KQK∗).

Thus, K is the solution to the constrained optimization problem

K = argmintr(KQK∗) | KA = I.

Note that this is a convex optmization problem, since, by Exercise 6.2, K 7→√tr(KQK∗) is a norm. Introduce a matrix Λ ∈ Kn×n of Lagrange multipliers,

so that the minimizer of the constrained problem is the unique critical point ofthe Lagrangian

L(K,Λ) := tr(KQK∗)− Λ : (KA− I)

= tr(KQK∗ − Λ∗(KA− I)

).

The critical point of the Lagrangian satisfies

0 = ∇KL(K,Λ) = KQ∗ +KQ− ΛA∗ = 2KQ− ΛA∗,

since Q is self-adjoint. Multiplication on the right by Q−1A, and using theconstraint that KA = I, reveals that Λ = 2(A∗Q−1A)−1, and hence that K =(A∗Q−1A)−1A∗Q−1.

It is easily verified that K is an unbiased estimator:

x = (A∗Q−1A)−1A∗Q−1(Ax + η) = x+ (A∗Q−1A)−1A∗Q−1η

and so, taking expectations of both sides, E[x] = x. Moreover, the covarianceof this estimator satisfies

E[(x− x)(x − x)∗] = (A∗Q−1A)−1A∗Q−1E[ηη∗]Q−1A(A∗Q−1A)−1

= (A∗Q−1A)−1,

as claimed.

Now suppose that L = K + D is any linear unbiased estimator; note thatDA = 0. Then the covariance of the estimate Ly satisfies

E[(Ly − x)(Ly − x)∗] = E[(K +D)ηη∗(K∗ +D∗)]

= (K +D)Q(K∗ +D∗)

= KQK∗ +DQD∗ +KQD∗ + (KQD∗)∗.

Since DA = 0,

KQD∗ = (A∗Q−1A)−1A∗Q−1QD∗ = (A∗Q−1A)−10 = 0,

and so

E[(Ly − x)(Ly − x)∗] = KQK∗ +DQD∗ ≥ KQK∗

in the sense of positive semi-definite matrices, as claimed.

—DRAFT—

6.1. INVERSE PROBLEMS AND REGULARIZATION 61

Remark 6.4. In the situation that A∗Q−1A is not invertible, it is standard touse the estimator

K = (A∗Q−1A)†A∗Q−1,

where B† denotes the Moore–Penrose pseudo-inverse of B, defined equivalentlyby

B† := limδ→0

(B∗B + δI)B∗,

B† := limδ→0

B∗(BB∗ + δI)B∗, or

B† := V Σ+U∗,

where B = UΣV ∗ is the singular value decomposition of B, and Σ+ is thetranspose of the matrix obtained from Σ by replacing all the strictly positivesingular values by their reciprocals.

Bayesian Interpretation of Regularization. The Gauss–Markov estimator isnot ideal: for example, because of its characterization as the minimizer of aquadratic cost function, it is sensitive to large outliers in the data, i.e. compo-nents of y that differ greatly from the corresponding component of Ax. In sucha situation, it may be desirable to not try to fit the observed data y too closely,and instead regularize the problem by seeking x, the minimizer of

J(x) :=1

2‖Ax− y‖2Q−1 +

1

2‖x− x‖2R−1 ,

for some chosen x ∈ Kn and positive-definite Tikhonov matrix R ∈ Kn×n.Depending upon the relative sizes of Q and R, x will be influenced more by thedata y and hence lie close to the Gauss–Markov estimator, or be influenced morebe the regularization term and hence lie close to x. At first sight this proceduremay seem somewhat ad hoc, but it has a natural Bayesian interpretation.

The observation equationy = Ax+ η

in fact defines the conditional distribution y|x by (y−Ax)|x = η ∼ N (0, Q). Tofind the minimizer of x 7→ 1

2‖Ax − y‖2Q−1 , i.e. x = Ky, amounts to finding themaximum likelihood estimator of x given y. The Bayesian interpretation of theregularization term is that N (x, R) is a prior distribution for x. The resultingposterior distribution for x|y has Lebesgue density ρ(x|y) with

ρ(x|y) ∝ exp

(−1

2‖Ax− y‖2Q−1

)exp

(−1

2‖x− x‖2R−1

)

= exp

(−1

2‖Ax− y‖2Q−1 − 1

2‖x− x‖2R−1

)

= exp

(−1

2‖x−Ky‖2A∗Q−1A − 1

2‖x− x‖2R−1

)

= exp

(−1

2‖x− (A∗Q−1A+R−1)−1(A∗Q−1AKy +R−1x)‖2A∗Q−1A+R−1

)

by the standard result that the product of Gaussian distributions with meansm1 and m2 and covariances C1 and C2 is a Gaussian with covariance C3 =

—DRAFT—


(C−11 + C−1

2 )−1 and mean C3(C−11 m1 + C−1

2 m2). The solution to the regular-ized least squares problem, i.e. minimizing the exponent in the above posteriordistribution, is the maximum a posteriori estimator of x given y. However,the full posterior contains more information than the MAP estimator alone. Inparticular, the posterior covariance matrix (A∗Q−1A + R−1)−1 reveals thosecomponents of x about which we are relatively more or less certain.

6.2 Bayesian Inversion in Banach Spaces

This section concerns Bayesian inversion in Banach spaces, and, in particular,establishing the appropriate rigorous statement of Bayes’ rule in settings wherethere is no Lebesgue measure with respect to which we can take densities.

Example 6.5. There are many applications in which it is of interest to deter-mine the permeability of subsurface rock, e.g. the prediction of transport ofradioactive waste from an underground waste repository, or the optimizationof oil recovery from underground fields. The flow velocity v of a fluid underpressure p in a medium or permeability K is given by Darcy’s law

v = −K∇p.The pressure field p within a bounded, open domain Ω ⊂ Rd is governed by theelliptic PDE

−∇ · (K∇p) = 0 in Ω,

p = h on ∂Ω.

For simplicity, take the permeability tensor field K to be a scalar field k timesthe identity tensor; for mathematical and physical reasons, it is important that kbe positive, so write k = eu. The objective is to recover u from, say, observationsof the pressure field at known points x1, . . . , xm ∈ Ω:

yi = p(xi) + ηi.

Note that this fits the general ‘y = H(u) + η’ setup, with H being definedimplicitly by the solution operator to the elliptic PDE.

In general, let u be a random variable with (prior) distribution µ0 — whichwe do not at this stage assume to be Gaussian — on a separable Banach spaceX . Suppose that we observe data y ∈ Rm according to (6.1), where η is anRm-valued random variable independent of u with probability density ρ withrespect to Lebesgue measure. Let Φ(u; y) be any function that differs from− log ρ(y −H(u)) by an additive function of y alone, so that

ρ(y −H(u))

ρ(y)∝ exp(−Φ(u; y))

with a constant of proportionality independent of u. An informal applicationof Bayes’ rule suggests that the posterior probability distribution of u given y,µy, has Radon–Nikodym derivative with respect to the prior, µ0, given by

dµy

dµ0(u) ∝ exp(−Φ(u; y)).

The next theorem makes this argument rigorous:

—DRAFT—

6.3. WELL-POSEDNESS AND APPROXIMATION 63

Theorem 6.6 (Generalized Bayes’ rule). Suppose that H : X → Rm is continuous,and that η is absolutely continuous with support Rm. If u ∼ µ0, then u|y ∼ µy,where µy ≪ µ0 and

dµy

dµ0(u) ∝ exp(−Φ(u; y)).

Lemma 6.7. Let µ, ν be probability measures on S×T , where (S,A) and (T,B)are measurable spaces. Assume that µ ≪ ν and that dµ

dν = ϕ. Assume furtherthat the conditional distribution of x|y under ν, denoted by νy(dx), exists. Thendistribution of x|y under µ, denoted µy(dx), exists and µy ≪ νy, with Radon–Nikodym derivative given by

dµy

dνy(x) =

ϕ(x,y)Z(y) , if Z(y) > 0,

1, otherwise,

where Z(y) :=∫S ϕ(x, y) dν

y(x).

Proof. See Section 10.2 of [25].

Proof of Theorem 6.6. Let Q0(dy) := ρ(y) dy on Rm and Q(du|y) := ρ(y −H(u)) dy, so that, by construction

dQ

dQ0(y|u) = C(y) exp(−Φ(u; y)).

Define measures ν0 and ν on Rm ×X by

ν0(dy, du) := Q0(dy)⊗ µ0(du),

ν(dy, du) := Q0(dy|u)µ0(du).

Note that ν0 is a product measure under which u and y are independent, whereasν is not. Since H is continuous, so is Φ; since µ0(X ) = 1, it follows that Φ isµ0-measurable. Therefore, ν is well-defined, ν ≪ ν0, and

dν

dν0(y, u) = C(y) exp(−Φ(u; y)).

Note that∫

X

exp(−Φ(u; y)) dµ0(u) = C(y)

∫

X

ρ(y −H(u)) dµ0(u) > 0,

since ρ is strictly positive on Rm and H is continuous. Since ν0(du|y) = µ0(du),the result follows from Lemma 6.7.

Proposition 6.8. If the prior µ0 is a Gaussian measure and the potential Φ isquadratic in u, then, for all y, the posterior µy is Gaussian.


6.3 Well-Posedness and Approximation

This section concerns the well-posedness of the Bayesian inference problem forGaussian priors on Banach spaces. To save space later on, the following will betaken as our standard assumptions on the negative log-likelihood / potential Φ:

—DRAFT—


Assumptions on Φ. Assume that Φ: X × Y → R satisfies:

1. For every ε > 0 and r > 0, there exists M = M(ε, r) ∈ R such that, forall u ∈ X and all y ∈ Y with ‖y‖Y < r,

Φ(u; y) ≥M − ε‖u‖2X .

2. For every r > 0, there exists K = K(r) > 0 such that, for all u ∈ X andall y ∈ Y with ‖u‖X , ‖y‖Y < r,

Φ(u; y) ≤ K.

3. For every r > 0, there exists L = L(r) > 0 such that, for all u1, u2 ∈ Xand all y ∈ Y with ‖u1‖X , ‖u2‖X , ‖y‖Y < r,

∣∣Φ(u1; y)− Φ(u2; y)∣∣ ≤ L

∣∣u1 − u2∣∣X.

4. For every ε > 0 and r > 0, there exists C = C(ε, r) > 0 such that, for allu ∈ X and all y1, y ∈ Y with ‖y1‖Y , ‖y2‖Y < r,

∣∣Φ(u; y1)− Φ(u; y2)∣∣ ≤ exp

(ε‖u‖2X + C

)∣∣y1 − y2∣∣Y.

Theorem 6.9. Let Φ satisfy standard assumptions (1), (2), and (3) and assumethat µ0 is a Gaussian probability measure on X . Then, for each y ∈ Y, µy givenby

dµy

dµ0(u) =

exp(−Φ(u; y))

Z(y),

Z(y) =

∫

X

exp(−Φ(u; y)) dµ0(u),

is a well-defined probability measure on X .

Proof. Assumption (2) implies that Z(y) is bounded below:

Z(y) ≥∫

u|‖u‖X≤r

exp(−K(r)) dµ0(u) = exp(−K(r))µ0

[‖u‖X ≤ r

]> 0

for r > 0, since µ0 is a strictly positive measure on X . Assumption (3) impliesthat Φ is µ0-measurable, and hence that µy is a well-defined measure. Byassumption (1), for ‖y‖Y ≤ r and ε sufficiently small,

Z(y) =

∫

X

exp(−Φ(u; y)) dµ0(u)

≤∫

X

exp(ε‖u‖2X −M(ε, r)) dµ0(u)

≤ C exp(−M(ε, r)) <∞,

since µ0 is Gaussian and we may choose ε small enough that the Ferniquetheorem (Theorem 2.34) applies. Thus, µy can indeed be normalized to be aprobability measure on X .

—DRAFT—

6.3. WELL-POSEDNESS AND APPROXIMATION 65

Definition 6.10. The Hellinger distance between two measures µ and ν is de-fined in terms of any reference measure ρ with respect to which both µ and νare absolutely continuous by:

dHell(µ, ν) :=

√√√√1

2

∫

Θ

∣∣∣∣∣

√dµ

dρ(θ)−

√dν

dρ(θ)

∣∣∣∣∣

2

dρ(θ).

Exercises 6.4, 6.5 and 6.6 establish the major properties of the Hellingermetric. A particularly useful property is that closeness in the Hellinger met-ric implies closeness of expected values of polynomially bounded functions: iff : X → E, for some Banach space E, then

∥∥Eµ[f ]− Eν [f ]∥∥E≤ 2√Eµ

[‖f‖2E

]− Eν

[‖f‖2E

]dHell(µ, ν).

The following theorem shows that Bayesian inference with respect to a Gaus-sian prior measure is well-posed with respect to perturbations of the observeddata y:

Theorem 6.11. Let Φ satisfy the standard assumptions (1), (2) and (4), supposethat µ0 is a Gaussian probability measure on X , and that µy ≪ µ0 with densitygiven by the generalized Bayes’ rule for each y ∈ Y. Then there exists a constantC ≥ 0 such that, for all y, y′ ∈ Y,

dHell(µy, µy′

) ≤ C‖y − y′‖Y .

Proof. As in the proof of Theorem 6.9, standard assumption (2) gives a lowerbound on Z(y). By standard assumptions (1) and (4) and the Fernique theorem(Theorem 2.34),

|Z(y)− Z(y′)| ≤ ‖y − y′‖Y∫

X

exp(−ε‖u‖2X −M) exp(−ε‖u‖2X + C) dµ0(u)

≤ C‖y − y′‖Y .

By the definition of the Hellinger distance,

2dHell(µy, µy′

)2 =

∫

X

(1√Z(y)

e−Φ(u;y)/2 − 1√Z(y′)

e−Φ(u;y′)/2

)2

dµ0(u)

≤ I1 + I2

where

I1 :=2

Z(y)

∫

X

(e−Φ(u;y)/2 − e−Φ(u;y′)/2

)2dµ0(u),

I2 := 2

∣∣∣∣∣1√Z(y)

− 1√Z(y′)

∣∣∣∣∣

2 ∫

X

e−Φ(u;y′)/2 dµ0(u).

By standard assumptions (1) and (4) and the Fernique theorem,

Z(y)

2I1 ≤

∫

X

1

4exp(ε‖u‖2X −M) exp(2ε‖u‖2X + 2C)‖y − y′‖2Y dµ0(u)

≤ C‖y − y′‖2Y .

—DRAFT—


A similar application of standard assumption (1) and the Fernique theoremshows that the integral in I2 is finite. Also, the lower bound on Z( · ) impliesthat

∣∣∣∣∣1√Z(y)

− 1√Z(y′)

∣∣∣∣∣

2

≤ Cmax

1

Z(y)3,

1

Z(y′)3

|Z(y)−Z(y′)|2 ≤ C‖y− y′‖2Y .

Combining these facts yields the desired Lipschitz continuity in the Hellingermetric.

Similarly, the next theorem shows that Bayesian inference with respect toa Gaussian prior measure is well-posed with respect to approximation of mea-sures and log-likelihoods. The approximation of Φ by some ΦN typically arisesthrough the approximation of H by some discretized numerical model HN .

Theorem 6.12. Suppose that the probabilities µ and µN are the posteriors aris-ing from potentials Φ and ΦN and are all absolutely continuous with respect toµ0, and that Φ, ΦN satisfy the standard assumptions (1) and (2) with constantsuniform in N . Assume also that, for all ε > 0, there exists K = K(ε) > 0 suchthat ∣∣Φ(u)− ΦN (u; y)

∣∣ ≤ K exp(ε‖u‖2X )ψ(N), (6.2)

where limN→∞ ψ(N) = 0. Then there is a constant C, independent of N , suchthat

dHell(µ, µN ) ≤ Cψ(N).

Proof. Since y does not appear in this problem, y-dependence will be suppressedfor the duration of this proof. Let Z and ZN denote the appropriate normaliza-tion constants, as in the proof of Theorem 6.11. By standard assumption (1),(6.2), and the Fernique theorem,

|Z − ZN | ≤∫

X

Kψ(N) exp(ε‖u‖2X −M) exp(ε‖u‖2X ) dµ0 ≤ Cψ(N).

By the definition of the Hellinger distance,

2dHell(µ, µN )2 =

∫

X

(1√Ze−Φ(u)/2 − 1√

ZNe−ΦN (u)/2

)2

dµ0(u)

≤ I1 + I2

where

I1 :=2

Z

∫

X

(e−Φ(u)/2 − e−ΦN (u)/2

)2dµ0(u),

I2 := 2

∣∣∣∣1√Z

− 1√ZN

∣∣∣∣2 ∫

X

e−ΦN (u)/2 dµ0(u).

By standard assumption (1), (6.2), and the Fernique theorem,

Z

2I1 ≤

∫

X

K2 exp(3ε‖u‖2X −M)ψ(N)2 dµ0(u)

≤ Cψ(N)2.

—DRAFT—

BIBLIOGRAPHY 67

Similarly,∣∣∣∣1√Z

− 1√ZN

∣∣∣∣2

≤ Cmax

1

Z3,

1

(ZN )3

|Z − ZN |2 ≤ Cψ(N)2.

Combining these facts yields the desired bound on dHell(µ, µN ).

Remark 6.13. Note well that, regardless of the value of the observed data y, the Bayesian posterior µy is absolutely continuous with respect to the prior µ0 and,in particular, cannot associate positive posterior probability to any event of priorprobability zero. However, the Feldman–Hajek theorem (Theorem 2.38) saysthat it is very difficult for probability measures on infinite-dimensional spacesto be absolutely continuous with respect to one another. Therefore, the choiceof infinite-dimensional prior µ0 is a very strong modelling assumption that, ifit is ‘wrong’, cannot be ‘corrected’ even by large amounts of data y. In thissense, it is not reasonable to expect that Bayesian inference on function spacesshould be well-posed with respect to apparently small perturbations of the priorµ0, e.g. by a shift of mean that lies outside the Cameron–Martin space, or achange of covariance arising from a non-unit dilation of the space. Nevertheless,the infinite-dimensional perspective is not without genuine fruits: in particular,the well-posedness results (Theorems 6.11 and 6.12) are very important for thestability of finite-dimensional (discretized) Bayesian problems with respect todiscretization dimension N .

Bibliography

This material is covered in much greater detail in the module MA612 Probabilityon Function Spaces and Bayesian Inverse Problems. W

Tikhonov regularization was introduced in [105, 106]. An introduction tothe general theory of regularization and its application to inverse problems isthe book of Engl & al. [26]. The book of Tarantola [103] also provides a goodapplied introduction to inverse problems. Kaipio & Somersalo [45] provide agood introduction to the Bayesian approach to inverse problems, especially inthe context of differential equations.

This chapter owes a great deal to the papers of Stuart [98] and Cotter &al. [20, 21], which set out the common structure of Bayesian inverse prob-lems on Banach and Hilbert spaces, focussing on Gaussian priors. Stuart’sarticle stresses the importance of delaying discretization to the last possiblemoment, much as in PDE theory, lest one carelessly end up with a family offinite-dimensional problems that are individually well-posed but collectively ill-conditioned as the discretization dimension tends to infinity. Extensions toBesov priors, which are constructed using wavelet bases of L2 and allow fornon-smooth local features in the random fields, can be found in the articles ofDashti & al. [22] and Lassas & al. [57].

Exercises

Exercise 6.1. Let µ0 be a Gaussian probability measure on a separable Banachspace X and suppose that the potential Φ(u; y) is quadratic in u. Show thatthe posterior µy is also a Gaussian measure on X .



—DRAFT—


Exercise 6.2. Let Q ∈ Kn×n be a self-adjoint and positive-definite matrix. Showthat

〈A,B〉 := tr(A∗QB) for A,B ∈ Kn×m

defines an inner product on the space Kn×m of n × m matrices over K, andhence that A 7→

√tr(A∗QA) is a norm on Kn×m.

Exercise 6.3. Let µ and ν be probability measures on (Θ,F ), both absolutelycontinuous with respect to a reference measure ρ. Define the total variationdistance between µ and ν by

dTV(µ, ν) :=1

2Eρ

[∣∣∣∣dµ

dρ− dν

dρ

∣∣∣∣]

=1

2

∫

Θ

∣∣∣∣dµ

dρ(θ)− dν

dρ(θ)

∣∣∣∣ dρ(θ).

Show that dTV is a metric on M1(Θ,F ), that its values do not depend uponthe choice of ρ, that M1(Θ,F ) has diameter at most 1, and that, if ν ≪ µ,then

dTV(µ, ν) :=1

2Eµ

[∣∣∣∣1−dν

dµ

∣∣∣∣]≡ 1

2

∫

Θ

∣∣∣∣1−dν

dµ(θ)

∣∣∣∣ dµ(θ).

Show also that this definition agrees with the definition given in (5.4).

Exercise 6.4. Let µ and ν be probability measures on (Θ,F ), both absolutelycontinuous with respect to a reference measure ρ. Define the Hellinger distancebetween µ and ν by

dHell(µ, ν) :=

√√√√√1

2Eρ

∣∣∣∣∣

√dµ

dρ−√

dν

dρ

∣∣∣∣∣

2

=

√√√√1

2

∫

Θ

∣∣∣∣∣

√dµ

dρ(θ) −

√dν

dρ(θ)

∣∣∣∣∣

2

dρ(θ).

Show that dHell is a metric on M1(Θ,F ), that its values do not depend uponthe choice of ρ, that M1(Θ,F ) has diameter at most 1, and that, if ν ≪ µ,then

dHell(µ, ν) :=

√√√√√1

2Eµ

∣∣∣∣∣1−

√dν

dµ

∣∣∣∣∣

2 ≡

√√√√1

2

∫

Θ

∣∣∣∣∣1−√

dν

dµ(θ)

∣∣∣∣∣

2

dµ(θ).

Exercise 6.5. Show that the total variation and Hellinger distances satisfy, forall µ and ν,

1√2dTV(µ, ν) ≤ dHell(µ, ν) ≤ dTV(µ, ν)

1/2.

Exercise 6.6. Suppose that µ and ν are probability measures on a Banach spaceX . Show that, if E is a Banach space and f : X → E has finite second momentwith respect to both µ and ν, then

∥∥Eµ[f ]− Eν [f ]∥∥E≤ 2√Eµ

[‖f‖2E

]− Eν

[‖f‖2E

]dHell(µ, ν).

—DRAFT—

EXERCISES 69

Show also that, if E is Hilbert and f has finite fourth moment with respect toboth µ and ν, then

∥∥Eµ[f ⊗ f ]− Eν [f ⊗ f ]∥∥E≤ 2√Eµ

[‖f‖4E

]− Eν

[‖f‖4E

]dHell(µ, ν).

Hence show that the differences between the means and covariance operators oftwo measures on X are bounded above by the Hellinger distance between thetwo measures.

Exercise 6.7. Let Γ ∈ Rq×q be symmetric and positive definite. Suppose thatH : X → Rq satisfies

1. For every ε > 0, there exists M ∈ R such that, for all u ∈ X ,

‖H(u)‖Γ−1 ≤ exp(ε‖u‖2X +M

).

2. For every r > 0, there exists K > 0 such that, for all u1, u2 ∈ X with‖u1‖X , ‖u2‖X < r,

‖H(u1)−H(u2)‖Γ−1 ≤ K∥∥u1 − u2

∥∥X.

Show that Φ: X × Rq → R defined by

Φ(u; y) :=1

2

⟨y −H(u),Γ−1(y −H(u))

⟩

satisfies the standard assumptions.

Exercise 6.8. An exercise in forensic inference:

NEWSFLASH: THE PRESIDENT HAS BEEN SHOT!

While being driven through the streets of the capital in his open-topped limou-sine, President Marx of Freedonia has been shot by a sniper. To make mattersworse, the bullet appears to have come from a twenty-storey building, on eachfloor of which was stationed a single bodyguard of President Marx’s security de-tail who was meant to protect him. None of the suspects have any obvious marksof guilt such as one bullet missing from their magazine, gunpowder burns, failedlie detector tests. . . not even an evil moustache. The soundproofing inside thebuilding was good enough that none of the security detail can even say whetherthe shot came from above or below them. You have been called in as an expertquantifier of uncertainty to try to infer from which floor the assassin took theshot, and hence identify the guilty man. You have the following information:

1. At the time of the shot, the President’s limousine was 500m from thebuilding, travelling at about 20mph.

2. The bullet entered the President’s body at the base of his neck and exitedthrough the centre of the breastbone. The President is an average-sizedman.

Apply the techniques you have learned in this chapter to make a recommen-dation as to which floor the shot came from, and hence which bodyguard isthe traitor. Do your own research on rifle muzzle velocities &c., explaining theassumptions that you make along the way.

Exercise 6.9. Construct a short sequence of Bayesian inferences in which theposterior appears to predict one thing and then another. Implications for notcutting short analyses and studies. [???] FINISH ME!!!

—DRAFT—


—DRAFT—

Chapter 7

Filtering and DataAssimilation

It is not bigotry to be certain we areright; but it is bigotry to be unable toimagine how we might possibly have gonewrong.

The Catholic Church and Conversion

G. K. Chesterton

Data assimilation is the integration of two information sources:

• a mathematical model of a time-dependent physical system, or a numericalimplementation of such a model; and

• a sequence of observations of that system, usually corrupted by some noise.

The objective is to combine these two ingredients to produce a more accurateestimate of the system’s true state, and hence more accurate predictions ofthe system’s future state. Very often, data assimilation is synonymous withfiltering, which incorporates many of the same ideas but arose in the context ofsignal processing. An additional component of the data assimilation / filteringproblem is that one typically wants to achieve it in real time: if today is Monday,then a data assimilation scheme that takes until Friday to produce an accurateprediction of Tuesday’s weather using Monday’s observations is basically useless.

Data assimilation methods are typically Bayesian, in the sense that the cur-rent knowledge of the system state can be thought of as a prior, and the incor-poration of the dynamics and observations as an update/conditioning step thatproduces a posterior. Bearing in mind considerations of computational cost andthe imperative for real time data assimilation, there are two key ideas underly-ing filtering: the first is to build up knowledge about the posterior sequentially,and hence perhaps more efficiently; the second is to break up the unknown uand build up knowledge about its constituent parts sequentially, hence reducingthe computational dimension of each sampling problem. Thus, the first ideameans decomposing the data sequentially, while the second means decomposingthe unknown state sequentially.

—DRAFT—

72 CHAPTER 7. FILTERING AND DATA ASSIMILATION

7.1 State Estimation in Discrete Time

In the Kalman filter, the probability distributions representing the system stateand various noise terms are described purely in terms of their mean and covari-ance, so they are effectively being approximated as Gaussians.

For simplicity, the first description of the Kalman filter will be of a controlledlinear dynamical system that evolves in discrete time steps

t0 < t1 < · · · < tk < . . . .

The state of the system at time tk is a vector xk ∈ Kn, and it evolves from thestate xk−1 ∈ Kn at time tk−1 according to the linear model

xk = Fkxk−1 +Gkuk + wk (7.1)

where, for each time tk,• Fk ∈ Kn×n is the state transition model, which is applied to the previousstate xk−1 ∈ Kn;

• Gk ∈ Kn×p is the control-to-input model, which is applied to the controlvector uk ∈ Kp;

• wk ∼ N (0, Qk) is the process noise, a Kn-valued centred Gaussian randomvariable with self-adjoint positive-definite covariance matrix Qk ∈ Kn×n.

At time tk an observation yk ∈ Kq of the true state xk is made according to

yk = Hkxk + vk, (7.2)

where• Hk ∈ Kq×n is the observation operator, which maps the true state spaceKn into the observable space Kq

• vk ∼ N (0, Rk) is the observation noise, a Kq-valued centred Gaussianrandom variable with self-adjoint positive-definite covariance Qk ∈ Kq×q.

As an initial condition, the state of the system at time t0 is taken to be x0 =µ0 + w0 where µ0 ∈ Kn is known and w0 ∼ N (0, Q0). All the noise vectors areassumed to be mutually independent.

As a preliminary to constructing the full Kalman filter, we consider the prob-lem of estimating states x1, . . . , xk given the corresponding controls u1, . . . , ukand m known observations y1, . . . , ym, where k ≥ m. In particular, we seek thebest linear unbiased estimate of x1, . . . , xk.

First note that (7.1)–(7.2) is equivalent to the single equation

bk|m = Ak|mzk + ηk|m, (7.3)

where

bk|m :=

µ0

G1u1y1...

Gmumym

Gm+1um+1

...Gkuk

∈ Kn(k+1)+qm, zk :=

x0...xk

, ηk|m :=

−w0

−w1

+v1...

−wm

+vm−wm+1

...−wk

—DRAFT—

7.1. STATE ESTIMATION IN DISCRETE TIME 73

and Ak|m ∈ K(n(k+1)+qm)×n(k+1) is

Ak|m :=

I 0 0 · · · · · · · · · · · · · · · 0

−F1 I 0. . .

. . .. . .

. . .. . .

...

0 H1 0. . .

. . .. . .

. . .. . .

...

0 −F2 I. . .

. . .. . .

. . .. . .

...

0 0 H2. . .

. . .. . .

. . .. . .

......

. . .. . .

. . .. . .

. . .. . .

. . ....

.... . .

. . .. . . −Fm I

. . .. . .

......

. . .. . .

. . . 0 Hm. . .

. . ....

.... . .

. . .. . .

. . . −Fm+1 I. . .

......

. . .. . .

. . .. . .

. . .. . .

. . . 00 · · · · · · · · · · · · · · · 0 −Fk I

.

Note that the noise vector ηk|m is Kn(k+1)+qm-valued and has mean zero andblock-diagonal positive-definite precision operator (inverse covariance) Wk|m

given in block form by

Wk|m := diag(Q−1

0 , Q−11 , R−1

1 , . . . , Q−1m , R−1

m , Q−1m+1, . . . , Q

−1k

).

By the Gauss–Markov theorem (Theorem 6.2), the best linear unbiased estimatezk|m = [x0|m, . . . , xk|m]∗ of zk satisfies

zk|m ∈ argminzk∈Kn

Jk|m(zk), Jk|m(zk) :=1

2

∥∥Ak|mzk − bk|m∥∥2Wk|m

, (7.4)

and by Lemma 4.22 is the solution of the normal equations

A∗k|mWk|mAk|mzk|m = A∗

k|mWk|mbk|m.

By Exercise 7.1, it follows from the assumptions made above that these normalequations have a unique solution

zk|m =(A∗

k|mWk|mAk|m

)−1A∗

k|mWk|mbk|m. (7.5)

By Theorem 6.2 and Remark 6.3, E[zk|m] = zk and the covariance matrix ofthe estimate zk|m is (A∗

k|mWk|mAk|m)−1; a Bayesian statistician would say thatzk, conditioned upon the control and observation data bk|m, is the Gaussian

random variable with distribution N(zk|m, (A

∗k|mWk|mAk|m)−1

).

Note that, since Wk|m is block diagonal, Jk|m can be written as

Jk|m(zk) =1

2‖x0−µ0‖2Q−1

0

+1

2

m∑

i=1

‖yi−Hixi‖2R−1i

+1

2

k∑

i=1

‖xi−Fixi−1−Giui‖2Q−1i

.

An expansion of this type will prove very useful in derivation of the linearKalman filter in the next section.

—DRAFT—


7.2 Linear Kalman Filter

We now consider the state estimation problem in the common practical situationthat k = m. Why is the state estimate (7.5) not the end of the story? For onething, there is an issue of immediacy: one does not want to have to wait forobservation y1000 to come in before estimating states x1, . . . , x999 as well asx1000, in particular because the choice of the control uk+1 typically dependsupon the estimate of xk; what one wants is to estimate xk upon observing yk.However, there is also an issue of computational cost, and hence computationtime: the solution of the least squares problem

find x = argminx∈Kn

‖Ax− b‖2Km

where A ∈ Km×n, at least by direct methods such as solving the normal equa-tions or QR factorization, requires of the order ofmn2 floating-point operations,as shown in MA398 Matrix Analysis and Algorithms. Hence, calculation of theWstate estimate zk by direct solution of (7.5) takes of the order of

(n(k + 1) + qm)n2(k + 1)2

operations. It is clearly impractical to work with a state estimation scheme witha computational cost that increases cubically with the number of time steps tobe considered. The idea of filtering is the break the state estimation problemdown into a sequence of estimation problems that can be solved with constantcomputational cost per time step, as each observation comes in.

The two-step linear Kalman filter (LKF) is a recursive method for construct-ing the best linear unbiased estimate xk|k (with covariance matrix Pk|k) of xkin terms of the previous state estimate xk−1|k−1 and the data uk and yk. Itis called the two-step filter because the process of updating the state estimate(xk−1|k−1, Pk−1|k−1) for time tk−1 into the estimate (xk|k, Pk|k) for tk is splitinto two steps (which could, of course, be algebraically unified into a singlestep):

• the prediction step uses the dynamics alone to update (xk−1|k−1, Pk−1|k−1)into (xk|k−1, Pk|k−1), an estimate for the state at time tk that does notuse the observation yk;

• the correction step updates (xk|k−1, Pk|k−1) into (xk|k, Pk|k) using the ob-servation yk.

Initialization. We begin by initializing the state estimate as

(x0|0, P0|0) := (µ0, Q0).

Prediction. Write

Fk :=[0 · · · 0 Fk

]∈ Kn×nk,

where the Fk block is the block corresponding to xk−1, so that Fkzk−1 =Fkxk−1. A key insight here is to write the cost function Jk|k−1 recursivelyas

Jk|k−1(zk) = Jk−1|k−1(zk−1) +1

2‖xk −Fkzk−1 −Gkuk‖2Q−1

k

,


—DRAFT—

7.2. LINEAR KALMAN FILTER 75

in which case the gradient and Hessian of Jk|k−1 are given in block form withrespect to zk = [zk−1, xk]

∗ as

∇Jk|k−1(zk) =

[∇Jk−1|k−1(zk−1) + F∗kQ

−1k (Fkzk−1 − xk +Gkuk)

−Q−1k (Fkzk−1 − xk +Gkuk)

]

and

D2Jk|k−1(zk) =

[D2Jk−1|k−1(zk−1) + F∗

kQ−1k Fk −F∗

kQ−1k

−Q−1k Fk Q−1

k

],

which, by Exercise 7.2, is positive definite. Hence, by a single iteration of New-ton’s method with any initial condition zk, the minimizer zk|k−1 of Jk|k−1(zk)is simply

zk|k−1 = zk −(D2Jk|k−1(zk)

)−1∇Jk|k−1(zk).

Note that ∇Jk−1|k−1(zk−1|k−1) = 0 and Fkzk−1|k−1 = Fkxk−1|k−1, so cleverinitial guess is to take

zk =

[zk−1|k−1

Fkxk−1|k−1 +Gkuk

].

With this initial guess, the gradient becomes ∇Jk|k−1(zk) = 0 i.e. the optimalestimate of xk given y1, . . . , yk−1 is the bottom row of zk|k−1 and — by Re-mark 6.3 — the covariance matrix Pk|k−1 of this estimate is the bottom-right

block of the inverse Hessian(D2Jk|k−1(zk)

)−1, calculated using the method of

Schur complements (Exercise 7.3):

xk|k−1 = Fkxk−1|k−1 +Gkuk, (7.6)

Pk|k−1 = FkPk−1|k−1F∗k +Qk. (7.7)

These two updates comprise the prediction step of the Kalman filter. The cal-culation of xk|k−1 requires Θ(n2+np) operations, and the calculation of Pk|k−1

requires O(nα) operations, assuming that matrix-matrix multiplication for n×nmatrices can be effected in O(nα) operations for some 2 ≤ α ≤ 3.

Correction. The next step is a correction step (or update step) that corrects theprior estimate-covariance pair (xk|k−1, Pk|k−1) to a posterior estimate-covariancepair (xk|k, Pk|k) given the observation yk. Write

Hk :=[0 · · · 0 Hk

]∈ Kn×nk,

where the Hk block is the block corresponding to xk, so that Hkzk = Hkxk.Again, the cost function is written recursively:

Jk|k(zk) = Jk|k−1(zk) +1

2‖yk −Hkzk‖2R−1

k

.

The gradient and Hessian are

∇Jk|k(zk) = ∇Jk|k−1(zk) +H∗kR

−1k (Hkzk − yk)

= ∇Jk|k−1(zk) +H∗kR

−1k (Hkxk − yk)

—DRAFT—


andD2Jk|k(zk) = D2Jk|k−1(zk) +H∗

kR−1k Hk.

We now take zk = zk|k−1 as a clever initial guess for a single Newton iteration,so that the gradient becomes

∇Jk|k(zk|k−1) = ∇Jk|k−1(zk|k−1)︸︷︷︸=0

+H∗kR

−1k (Hkzk|k−1 − yk).

The posterior estimate xk|k is now obtained as the bottom row of the Newtonupdate, i.e.

xk|k = xk|k−1 − Pk|kH∗kR

−1k (Hkxk|k−1 − yk) (7.8)

where the posterior covariance Pk|k is obtained as the bottom-right block of the

inverse Hessian(D2Jk|k(zk)

)−1by Schur complementation:

Pk|k =(P−1k|k−1 +H∗

kR−1k Hk

)−1. (7.9)

Determination of the computational costs of these two steps is left as an exercise(Exercise 7.6).

In many presentations of the Kalman filter, the correction step is phrased interms of the Kalman gain Kk ∈ Kn×q:

Kk := Pk|k−1H∗k

(HkPk|k−1H

∗k +Rk

)−1. (7.10)

so that

xk|k = xk|k−1 +Kk(zk −Hkxk|k−1) (7.11)

Pk|k = (I −KkHk)Pk|k−1. (7.12)

It is also common to refer to

yk := zk −Hkxk|k−1

as the innovation residual and

Sk := HkPk|k−1H∗k +Rk

as the innovation covariance, so that Kk = Pk|k−1H∗kS

−1k and xk|k = xk|k−1 +

Kkyk. It is an exercise in algebra to show that the first presentation of thecorrection step (7.8)–(7.9) and the Kalman gain formulation (7.10)–(7.12) arethe same.

The Kalman filter can also be formulated in continuous time, or in a hybridform with continuous evolution but discrete observations. For example, thehybrid Kalman filter has the evolution and observation equations

x(t) = F (t)x(t) +G(t)u(t) + w(t),

yk = Hkxk + vk,

where xk := x(tk). The prediction equations are that xk|k−1 is the solution attime tk of the initial value problem

˙x(t) = F (t)x(t) +G(t)u(t),

x(tk−1) = xk−1|k−1,

—DRAFT—

7.3. EXTENDED KALMAN FILTER 77

and that Pk|k−1 is the solution at time tk of the initial value problem

P (t) = F (t)P (t)F (t)∗ +Q(t),

P (tk−1) = Pk−1|k−1.

The correction equations (in Kalman gain form) are as before:

Kk = Pk|k−1H∗k

(HkPk|k−1H

∗k +Rk

)−1

xk|k = xk|k−1 +Kk(zk −Hkxk|k−1)

Pk|k = (I −KkHk)Pk|k−1.

The LKF with continuous time evolution and observation is known as theKalman–Bucy filter. The evolution and observation equations are

x(t) = F (t)x(t) +G(t)u(t) + w(t),

y(t) = H(t)x(t) + v(t).

Notably, in the Kalman–Bucy filter, the distinction between prediction andcorrection does not exist.

˙x(t) = F (t)x(t) +G(t)u(t) +K(t)(y(t)−H(t)x(t)

),

P (t) = F (t)P (t) + P (t)F (t)∗ +Q(t)−K(t)R(t)K(t)∗,

whereK(t) := P (t)H(t)∗R(t)−1.

7.3 Extended Kalman Filter

The extended Kalman filter (EKF) is an extension of the Kalman filter to nonlin-ear dynamical systems. In discrete time, the evolution and observation equationsare

xk = fk(xk−1, uk) + wk,

yk = hk(xk) + vk,

where, as before, xk ∈ Kn are the states, uk ∈ Kp are the controls, yk ∈ Kq

are the observations, fk : Kn ×Kp → Kn are the vector fields for the dynamics,

hk : Kn → Kq are the observation maps, and the noise processes wk and vk

are uncorrelated with zero mean and positive-definite covariances Qk and Rk

respectively.The classical derivation of the EKF is to approximate the nonlinear evolution–

observation equations with a linear system and then use the LKF on that linearsystem. In contrast to the LKF, the EKF is neither the unbiased minimummean-squared error estimator nor the minimum variance unbiased estimator ofthe state; in fact, the EFK is generally biased. However, the EKF is the bestlinear unbiased estimator of the linearized dynamical system, which can oftenbe a good approximation of the nonlinear system. As a result, how well thelocal linear dynamics match the nonlinear dynamics determines in large parthow well the EKF will perform.

—DRAFT—


The approximate linearized system is obtained by first-order Taylor expan-sion of fk about the previous estimated state xk−1|k−1 and hk about xk|k−1

xk = fk(xk−1|k−1, uk) + Dfk(xk−1|k−1, uk)(xk−1 − xk−1|k−1) + wk,

yk = hk(xk|k−1) + Dhk(xk|k−1)(xk − xk|k−1) + vk.

Taking

Fk := Dfk(xk−1|k−1, uk),

Hk := Dhk(xk|k−1),

uk := fk(xk−1|k−1, uk)− Fkxk−1|k−1,

zk := hk(xk|k−1)−Hkxk|k−1,

the linearized system is

xk = Fkxk−1 + uk + wk,

yk = Hkxk + zk + vk.

We now apply the standard LKF to this system, treating uk as the controls forthe linear system and yk − zk as the observations, to obtain

xk|k−1 = fk(xk−1|k−1, uk), (7.13)

Pk|k−1 = FkPk−1|k−1F∗k +Qk, (7.14)

Pk|k =(P−1k|k−1 +H∗

kR−1k Hk

)−1, (7.15)

xk|k = xk|k−1 − Pk|kH∗kR

−1k (hk(xk|k−1)− yk). (7.16)

7.4 Ensemble Kalman Filter

The EnKF is a Monte Carlo approximation of the Kalman filter that avoidsevolving the covariance matrix of the state vector x ∈ Kn. Instead, the EnKFuses an ensemble of N states

X = [x(1), . . . , x(N)].

The columns of the matrix X ∈ Kn×N are the ensemble members.

Initialization. The ensemble is initialized by choosing the columns of X0|0 tobe N independent draws from, say, N (µ0, Q0). However, the ensemble membersare not generally independent except in the initial ensemble, since every EnKFstep ties them together, but all the calculations proceed as if they actually wereindependent.

Prediction. The prediction step of the EnKF is straightforward: each column

x(i)k−1|k−1 is evolved to xk|k−1 using the LKF prediction step (7.6)

x(i)k|k−1 = Fkx

(i)k−1|k−1 +Gkuk,

or the EKF prediction step (7.13)

xk|k−1 = fk(xk−1|k−1, uk).

—DRAFT—

7.4. ENSEMBLE KALMAN FILTER 79

Correction. The correction step for the EnKF uses a trick called data replica-tion: the observed data yk = Hkxk + vk is replicated into an m×N matrix

D = [d(1), . . . , d(N)], d(i) := yk + ηi, ηi ∼ N (0, Rk).

so that each column di consists of the observed data vector yk plus an indepen-dent random draw from N (0, Rk). If the columns of Xk|k−1 are a sample fromthe prior distribution, then the columns of

Xk|k−1 +Kk

(D −HkXk|k−1

)

form a sample from the posterior probability distribution, in the sense of aBayesian prior (before data) and posterior (conditioned upon the data). TheEnKF approximates this sample by replacing the exact Kalman gain matrix(7.10)

Kk := Pk|k−1H∗k

(HkPk|k−1H

∗k +Rk

)−1,

which involves the covariance matrix Pk|k−1, which is not tracked in the EnKF,by an approximate covariance matrix. The empirical mean and empirical co-variance of X are

〈X〉 := 1

N

N∑

i=1

x(i),(X − 〈X〉)(X − 〈X〉)∗

N − 1.

The Kalman gain matrix for the EnKF uses Ck|k−1 in place of Pk|k−1:

Kk := Ck|k−1H∗k

(HkCk|k−1H

∗k +Rk

)−1, (7.17)

so that the correction step becomes

Xk|k := Xk|k−1 + Kk

(D −HkXk|k−1

). (7.18)

One can also use sampling to dispense with Rk, and instead use the empiricalcovariance of the replicated data,

(D − 〈D〉)(D − 〈D〉)∗N − 1

.

Note, however, that the empirical covariance matrix is typically rank-deficient(in practical applications, there are usually many more state variables thanensemble members), in which case the matrix inverse in (7.17) may fail to exist;in such situations, a pseudo-inverse may be used.

Remark 7.1. Even when the matrices involved are positive-definite, insteadof computing the inverse of a matrix and multiplying by it, it is much better(several times cheaper and also more accurate) to compute the Cholesky decom-position of the matrix and treat the multiplication by the inverse as solutionof a system of linear equations (cf. MA398 Matrix Analysis and Algorithms).This is a general point relevant to the implementation of all KF-like methods.


—DRAFT—


7.5 Eulerian and Lagrangian Data Assimilation

Systems involving some kind of fluid flow are an important application area fordata assimilation. It is interesting to consider two classes of observations of suchsystems, namely Eulerian and Lagrangian observations: Eulerian observationsare observations at fixed points in space, whereas Lagrangian observations areobservations at points that are carried along by the flow field. For example,a fixed weather station on the ground with a thermometer, a barometer, ananemometer &c. would report Eulerian observations of temperature, pressureand wind speed. By contrast, an unpowered float set adrift on the ocean currentswould report Lagrangian data about temperature, salinity &c. at its current(evolving) position.

Consider for example the incompressible Stokes (ι = 0) or Navier–Stokes(ι = 1) equations on the unit square with periodic boundary conditions, thoughtof as the two-dimensional torus T2, starting at time t = 0:

∂u

∂t+ ιu · ∇u = ν∆u−∇p+ f on T2 × [0,∞),

∇ · u = 0 on T2 × [0,∞),

u = u0 on T2 × 0.Here u, f : T2 × [0,∞) → R2 are the velocity field and forcing term respectively,p : T2 × [0,∞) → R is the pressure field, u0 : T

2 → R2 is the initial value of thevelocity field, and ν ≥ 0 is the viscosity of the fluid.

Eulerian observations of this system might take the form of noisy obser-vations yj,k of the velocity field at fixed points zj ∈ T2, j = 1, . . . , J , at anincreasing sequence of discrete times tk ≥ 0, k ∈ N, i.e.

yj,k = u(zj , tk) + ηj,k.

On the other hand, Lagrangian observations might take the form of noisy ob-servations yj,k of the locations zj(tk) ∈ T2 at time tk of J passive tracers thatstart at position zj,0 ∈ T2 at time t = 0 and are carried along with the flowthereafter, i.e.

zj(t) = zj,0 +

∫ t

0

u(zj(s), s) ds,

yj,k = zj(tk) + ηj,k.

Bibliography

The description given of the Kalman filter, particularly in terms of Newton’smethod applied to the quadratic objective function J , follows that of Humpherys& al. [41]. The remarks about Eulerian versus Lagrangian data assimilationborrow from §3.6 of Stuart [98].

The original presentation of the Kalman [46] and Kalman–Bucy filters [47]was in the context of signal processing, and encountered some initial resistancefrom the engineering community, as related in the article of Humpherys & al.Filtering is now fully accepted in applications communities and has a soundalgorithmic and theoretical base; for a stochastic processes point of view onfiltering, see e.g. the books of Jazwinski [43] and Øksendal [73] (§6.1).

—DRAFT—

EXERCISES 81

Exercises

Exercise 7.1. Verify that the normal equations for the state estimation problem(7.4) have a unique solution.

Exercise 7.2. Suppose that A ∈ Kn×n and C ∈ Km×m are self-adjoint andpositive definite, B ∈ Km×n, and D ∈ Km×m is self-adjoint and positive semi-definite. Then [

A+B∗CB −B∗C−CB C +D

]

is self-adjoint and positive-definite.

Exercise 7.3 (Schur complements). Let

M =

[A BC D

]

be a square matrix with A, D, A−BD−1C and D − CA−1B all non-singular.Then

M−1 =

[A−1 +A−1B(D − CA−1B)−1CA−1 −A−1B(D − CA−1B)−1

−(D − CA−1B)−1CA−1 (D − CA−1B)−1

]

and

M−1 =

[(A−BD−1C)−1 −(A−BD−1C)−1BD−1

−D−1C(A−BD−1C)−1 D−1 +D−1C(A−BD−1C)−1BD−1

].

Exercise 7.4. Schur complementation is often stated in the more restrictivesetting of self-adjoint positive-definite matrices, in which it has a natural in-terpretation in terms of the conditioning of Gaussian random variables. Let(X,Y ) ∼ N (m,C) be jointly Gaussian, where, in block form,

m =

[m1

m2

], C =

[C11 C12

C∗12 C22

],

and C is self-adjoint and positive definite. Show:1. C11 and C22 are necessarily self-adjoint and positive-definite matrices.2. With the Schur complement defined by S := C11 − C12C

−122 C

∗12, S is self-

adjoint and positive definite, and

C−1 =

[S−1 −S−1C12C

−122

−C−122 C

∗12S

−1 C−122 + C−1

22 C∗12S

−1C12C−122

].

3. The conditional distribution of X given that Y = y is Gaussian:

(X |Y = y) ∼ N(m1 + C12C

−122 (y −m2), S

).

Sketch the PDF of a Gaussian random variable in R2 to further convinceyourself of this result.

Exercise 7.5 (Sherman–Morrison–Woodbury formula). Let A and D be invertiblematrices and let B and C be such that A+BD−1C and D +CA−1B are non-singular. Then

(A+BD−1C)−1 = A−1 −A−1B(D + CA−1B)−1CA−1.

—DRAFT—


Exercise 7.6. Determine the asymptotic computational costs of the correctionsteps (7.8) and (7.9) of the LKF, and hence the asymptotic computational costper iteration of the LKF.

Exercise 7.7 (Fading memory). In the LKF, the current state variable is up-dated as the latest inputs and measurements become known, but the estimationis based on the least squares solution of all the previous states where all measure-ments are weighted according to their covariance. One can also use an estimatorthat discounts the error in older measurements leading to a greater emphasison recent observations, which is particularly useful in situations where there issome modeling error in the system.

To do this, consider the objective function

J(λ)k|k (zk) :=

λk

2‖x0 − µ0‖2Q−1

0

+1

2

k∑

i=1

λk−i‖yi −Hixi‖2R−1i

+1

2

k∑

i=1

λk−1‖xi − Fixi−1 −Giui‖2Q−1i

,

where the parameter λ ∈ [0, 1] is called the forgetting factor ; note that the stan-dard LKF is the case λ = 1, and the objective function increasingly relies uponrecent measurements as λ → 0. Find a recursive expression for the objective

function J(λ)k|k and follow the steps in the derivation of the usual LKF to derive

the LKF with fading memory λ.

Exercise 7.8. Write the prediction and correction equations (7.13)–(7.16) forthe EKF in terms of the Kalman gain matrix.

Exercise 7.9. Suppose that a fuel tank of an aircraft is (when the aircraft islevel) the cuboid Ω := [a1, b1] × [a2, b2]× [a3, b3]. Assume that at some time t,the aircraft is flying such that

• the original upward-pointing [0, 0, 1]∗ vector of the plane, and hence thetank, is ν(t) ∈ R3;

• the fuel in the tank is in static equilibrium;• fuel probes inside the tank provide noisy measurements of the fuel depthat the four corners of the tank: specifically, if [ai, bi, zi]

∗ is the intersectionof the fuel surface with the boundary of the tank at the corner [ai, bi]

∗,assume that you are told ζi = zi +N (0, σ2), independently for each fuelprobe.

Using this information:1. Assuming first that σ2 = 0, calculate the volume of fuel in the tank.2. Assuming now that σ2 > 0, estimate the volume of fuel in the tank.3. Explain how to estimate the volume of fuel remaining in the tank as a

function of time. What other information other than the probe measure-ments might you use, and why?

4. Generalize your results to more general fuel tank and probe geometries,and more general observational noise.

Exercise 7.10. Consider the problem of using the LKF to estimate the positionand velocity of a projectile given a few noisy measurements of its position. Infact, the LKF not only provides a relatively smooth profile of the projectile’s

—DRAFT—

EXERCISES 83

trajectory as it passes by a radar sensor, but also effectively predicts the pointof impact as well as the point of origin — so that troops on the ground can bothduck for cover and return fire before the projectile lands.

The state of the projectile is. . .

—DRAFT—


—DRAFT—

Chapter 8

Orthogonal Polynomials

Although our intellect always longs forclarity and certainty, our nature oftenfinds uncertainty fascinating.

On War

Karl von Clausewitz

Orthogonal polynomials are an important example of orthogonal decompo-sitions of Hilbert spaces. They are also of great practical importance: they playa central role in numerical integration using quadrature rules (Chapter 9) andapproximation theory; in the context of UQ, they are also a foundational toolin polynomial chaos expansions (Chapter 11).

For the rest of this chapter, N = N0 or 0, 1, . . . , N for some N ∈ N0.

8.1 Basic Definitions and Properties

Recall that a polynomial of degree n in a single indeterminate x is an expressionof the form

p(x) = cnxn + cn−1x

n−1 + · · ·+ c1x+ c0,

where the coefficients ci are scalars drawn from some field K, and cn 6= 0. Ifcn = 1, then p is said to be monic. The space of all polynomials in x is denotedK[x], and the space of polynomials of degree at most n is denoted K≤n[x].

Definition 8.1. Let µ be a non-negative measure on R. A family of polynomialsQ = qn | n ∈ N is called an orthogonal system of polynomials if qn is of degreen and

〈qm, qn〉L2(µ) :=

∫

R

qm(x)qn(x) dµ(x) = γnδmn for m,n ∈ N .

The constants

γn = ‖qn‖2L2(µ) =

∫

R

q2n dµ

are required to be strictly positive and are called the normalization constantsof the system Q. If γn = 1 for all n ∈ N , then Q is an orthonormal system.

—DRAFT—

86 CHAPTER 8. ORTHOGONAL POLYNOMIALS

In other words, a system of orthogonal polynomials is nothing but a col-lection of orthogonal elements of the Hilbert space L2(R, µ) that happen to bepolynomials. Note that, given µ, orthogonal (resp. orthonormal) polynomialsfor µ can be found inductively by using the Gram–Schmidt orthogonalization(resp. orthonormalization) procedure on the monomials 1, x, x2, . . . .

Example 8.2. 1. The Legendre polynomials Len are the orthogonal polyno-mials for uniform measure on [−1, 1]:

∫ 1

−1

Lem(x)Len(x) dx = δmn.

2. The Hermite polynomials Hen are the orthogonal polynomials for standardGaussian measure (2π)−1/2e−x2/2 dx on the real line:

∫ ∞

−∞

Hem(x)Hen(x)exp(−x2/2)√

2πdx = n!δmn.

The first few Legendre and Hermite polynomials are given in Table 8.1and illustrated in Figures 8.1 and 8.2.

3. See Table 8.2 for a summary of other classical systems of orthogonal poly-nomials corresponding to various probability measures on subsets of thereal line.

Remark 8.3. Many sources, typically physicists’ texts, use the weight functione−x2

dx instead of probabilists’ preferred (2π)−1/2e−x2/2 dx or e−x2/2 dx forthe Hermite polynomials. Changing from one normalization to the other isof course not difficult, but special care must be exercised in practice to seewhich normalization a source is using, especially when relying on third-partysoftware packages: for example, the GAUSSQ Gaussian quadrature package fromhttp://netlib.org/ uses the e−x2

dx normalization. To convert integrals withrespect to one Gaussian measure to integrals with respect to another (and henceget the right answers for Gauss–Hermite quadrature), use the following change-of-variables formula:

1√2π

∫

R

f(x)e−x2/2 dx =1

π

∫

R

f(√2x)e−x2

dx.

It follows from this that conversion between the physicists’ and probabilists’Gauss–Hermite quadrature formulæ is achieved by

wprobi =

wphysi√π, xprobi =

√2xphysi .

Lemma 8.4. The L2(R, µ) inner product is positive definite on R≤d[x] if andonly if the Hankel determinant det(Hn) is strictly positive for n = 1, . . . , d+ 1,where

Hn :=

m0 m1 · · · mn−1

m1 m2 · · · mn

......

. . ....

mn−1 mn · · · m2n−2

, mn :=

∫

R

xn dµ(x).

Hence, the L2(R, µ) inner product is positive definite on R[x] if and only ifdet(Hn) > 0 for all n ∈ N.

—DRAFT—

8.1. BASIC DEFINITIONS AND PROPERTIES 87

n Len Hen

0 1 11 x x

2 12 (3x

2 − 1) x2 − 1

3 12 (5x

3 − 3x) x3 − 3x

4 18 (35x

4 − 30x2 + 3) x4 − 6x2 + 3

5 18 (63x

5 − 70x3 + 15x) x5 − 10x3 + 15x

Table 8.1: The first few Legendre polynomials Len, which are the orthogo-nal polynomials for uniform measure dx on [−1, 1], and Hermite polynomi-als Hen, which are the orthogonal polynomials for standard Gaussian measure(2π)−1/2e−x2/2 dx on R.

0.5

1.0

−0.5

−1.0

0.25 0.50 0.75 1.00−0.25−0.50−0.75−1.00

Figure 8.1: The Legendre polynomials Le0 (red), Le1 (orange), Le2 (yellow),Le3 (green), Le4 (blue) and Le5 (purple) on [−1, 1].

5.0

10.0

−5.0

−10.0

0.5 1.0 1.5−0.5−1.0−1.5

Figure 8.2: The Hermite polynomials He0 (red), He1 (orange), He2 (yellow),He3 (green), He4 (blue) and He5 (purple) on R.

—DRAFT—


Distribution of ξ Polynomials qk(ξ) Support

Continuous

Gaussian Hermite R

Gamma Laguerre [0,∞)Beta Jacobi [a, b]

Uniform Legendre [a, b]

Discrete

Poisson Charlier N0

Binomial Krawtchouk 0, 1, . . . , nNegative Binomial Meixner N0

Hypergeometric Hahn 0, 1, . . . , n

Table 8.2: Families of probability distributions and the corresponding familiesof orthogonal polynomials.

Proof. Letp(x) := cdx

d + · · ·+ c1x+ c0

be any polynomial of degree at most d. Note that

‖p‖2L2(R,µ) =

∫

R

d∑

k,ℓ=0

ckcℓxk+ℓ dµ(x) =

d∑

k,ℓ=0

ckcℓmk+ℓ,

and so ‖p‖L2(R,µ) > 0 if and only if Hd+1 is positive definite. This, in turn, isequivalent to having det(Hn) > 0 for n = 1, 2, . . . , d+ 1.

Theorem 8.5. If the L2(R, µ) inner product is positive definite on R[x], thenthere exists an infinite sequence of orthogonal polynomials for µ.

Proof. Apply the Gram–Schmidt procedure to the monomials xn, n ∈ N0. Thatis, take q0(x) = 1, and for n ∈ N recursively define

qn(x) := xn −n−1∑

k=0

〈xk, qk〉〈qk, qk〉

qk(x).

Since the inner product is positive definite, 〈qk, qk〉 > 0, and so each qn isuniquely defined. By construction, each qn is orthogonal to qk for k < n.

By Exercise 8.1, the hypothesis of Theorem 8.5 is satisfied if the measure µhas infinite support. In the other direction, we have the following:

Theorem 8.6. If the L2(R, µ) inner product is positive definite on K≤d[x], butnot on R≤n[x] for n > d, then µ admits only d+ 1 orthogonal polynomials.

Proof. The Gram–Schmidt procedure can be applied so long as the denomina-tors 〈qk, qk〉 are strictly positive, i.e. for k ≤ d + 1. The polynomial qd+1 isorthogonal to qn for n ≤ d; we now show that qd+1 = 0. By assumption, thereexists a polynomial P of degree d + 1, having the same leading coefficient asqd+1, such that ‖P‖L2(R,µ) = 0. Hence, P − qd+1 has degree d, so it can bewritten in the orthogonal basis q0, . . . , qd as

P − qd+1 =

d∑

k=0

ckqk

—DRAFT—

8.2. RECURRENCE RELATIONS 89

for some coefficients c0, . . . , cd. Hence,

0 = ‖P‖2L2(R,µ) = ‖qd+1‖2L2(R,µ) +

d∑

k=0

c2k‖qk‖2L2(R,µ),

which implies, in particular, that ‖qd+1‖L2(R,µ) = 0. Hence, the normalizationconstant γd+1 = 0, which is not permitted, and so qd+1 is not a member of asequence of orthogonal polynomials for µ.

Theorem 8.7. If µ has finite moments only of degrees 0, 1, . . . , r, then µ admitsonly a finite system of orthogonal polynomials q0, . . . , qd, where d = ⌊r/2⌋.Proof. Exercise 8.2

Theorem 8.8. The coefficients of any system of orthogonal polynomials aredetermined, up to multiplication by an arbitrary constant for each degree, by theHankel determinants of the polynomial moments. That is, if

mn :=

∫

R

xn dµ(x)

then the nth degree orthogonal polynomial qn for µ is, for some cn 6= 0,

qn = cn det

mn

Hn

...m2n−1

1 · · · xn−1 xn

= cn det

m0 m1 m2 · · · mn

m1 m2 m3 · · · mn+1

......

.... . .

...mn−1 mn mn+2 · · · m2n−1

1 x x2 · · · xn

.

Proof. FINISH ME!!!

8.2 Recurrence Relations

The Legendre polynomials satisfy the recurrence relation

Len+1(x) =2n+ 1

n+ 1xLen(x)−

n

n+ 1Len−1(x).

The Hermite polynomials satisfy the recurrence relation

Hen+1(x) = xHen(x) − nHen−1(x).

In fact, all systems of orthogonal polynomials satisfy a three-term recurrencerelation of the form

qn+1(x) = (Anx+Bn)qn(x) − Cnqn−1(x)

for some sequences (An), (Bn), (Cn), with the initial terms q0(x) = 1 andq−1(x) = 0. Furthermore, there is a characterization of precisely which se-quences (An), (Bn), (Cn) arise from systems of orthogonal polynomials.

—DRAFT—


Theorem 8.9 (Favard). Let (An), (Bn), (Cn) be real sequences and let Q =qn | n ∈ N be defined by

qn+1(x) = (Anx+Bn)qn(x) − Cnqn−1(x),

q0(x) = 1,

q−1(x) = 0.

Then Q is a system of orthogonal polynomials for some measure µ if and onlyif, for all n ∈ N ,

An 6= 0, Cn 6= 0, CnAnAn−1 > 0.

Theorem 8.10 (Christoffel–Darboux formula). The orthonormal polynomials Pn |n ≥ 0 for a measure µ satisfy

n∑

k=0

Pk(y)Pk(x) =√βn+1

Pn+1(y)Pn(x)− Pn(y)Pn+1(x)

y − x(8.1)

andn∑

k=0

|Pk(x)|2 =√βn+1

(P ′n+1(x)Pn(x)− P ′

n(x)Pn+1(x)). (8.2)

Proof. Multiply the recurrence relation√βk+1Pk+1(x) = (x − αk)Pk(x)−

√βkPk−1(x)

by Pk(y) on both sides and subtract the corresponding expression with x and yinterchanged to obtain

(y − x)Pk(y)Pk(x) =√βk+1

(Pk+1(y)Pk(x) − Pk(y)Pk+1(x)

)

−√βk(Pk(y)Pk−1(x)− Pk−1(y)Pk(x)

).

Sum both sides from k = 0 to k = n and use the telescoping nature of the sumon the right to obtain (8.1). Take the limit as y → x to obtain (8.2).

Corollary 8.11. The orthonormal polynomials Pn | n ≥ 0 for a measure µsatisfy

P ′n+1(x)Pn(x)− P ′

n(x)Pn+1(x) > 0.

8.3 Roots of Orthogonal Polynomials

Definition 8.12. The Jacobi matrix of a measure µ is the infinite, symmetric,tridiagonal matrix

J∞(µ) :=

α0

√β1 0 · · ·

√β1 α1

√β2

. . .

0√β2 α2

. . ....

. . .. . .

. . .

where αk and βk are as in FINISH ME!!!. The upper-left n × n minor ofJ∞(µ) is denoted Jn(µ).

—DRAFT—

8.3. ROOTS OF ORTHOGONAL POLYNOMIALS 91

Theorem 8.13. Let P0, P1, . . . be the orthonormal polynomials for µ. The zerosof Pn are the eigenvalues of Jn(µ), and the eigenvector corresponding to the zeroz is

P0(z)

...Pn−1(z)

.

Theorem 8.14 (Zeros of orthogonal polynomials). Let µ be supported in a non-degenerate interval I ⊆ R, and let Q = qn | n ∈ N0 be a system of orthogonalpolynomials for µ

1. For each n ∈ N0, qn has exactly n distinct real roots z(n)1 , . . . , z

(n)n ∈ I.

2. If (a, b) is an open interval of µ-measure zero, then (a, b) contains at mostone root of any orthogonal polynomial qn for µ.

3. The zeros of qn and qn+1 alternate:

z(n+1)1 < z

(n)1 < z

(n+1)2 < · · · < z(n+1)

n < z(n)n < z(n+1)n+1 ;

hence, whenever m > n, between any two zeros of qn there lies a zero ofqm.

4. If the support of µ is the entire interval I, then the set of all zeros for thesystem Q is dense in I:

I =⋃

n∈N

z ∈ R | qn(z) = 0.

Proof. 1. First observe that 〈qn, 1〉L2(µ) = 0, and so qn changes sign in I.Since qn is continuous, the Intermediate Value Theorem implies that qnhas at least one real root z

(n)1 ∈ I. For n > 1, there must be another root

z(n)2 ∈ I of qn distinct from z

(n)1 , since if qn were to vanish only at z

(n)1 ,

then(x− z

(n)1

)qn would not change sign in I, which would contradict the

orthogonality relation 〈x− z(n)1 , qn〉L2(µ) = 0. Similarly, if n > 2, consider(

x− z(n)1

)(x− z

(n)2

)qn to deduce the existence of yet a third distinct root

z(n)3 ∈ I. This procedure terminates when all the n complex roots of qnguaranteed by the Fundamental Theorem of Algebra are shown to lie inI.

2. Suppose that (a, b) contains two distinct zeros z(n)i and z

(n)j of qn. Then

⟨qn,

∏

k 6=i,j

(x− z

(n)k

)⟩

L2(µ)

=

∫

R

qn(x)∏

k 6=i,j

(x− z

(n)k

)dµ(x)

=

∫

R

∏

k 6=i,j

(x− z

(n)k

)2(x− z

(n)i

)(x− z

(n)j

)dµ(x)

> 0,

since the integrand is positive outside of (a, b). However, this contradictsthe orthogonality of qn to all polynomials of degree less than n.

3. As usual, let Pn be the normalized version of qn. Let σ, τ be consecutive ze-ros of Pn, so that P ′

n(σ)P′n(τ) < 0. Then Corollary 8.11 implies that Pn+1

—DRAFT—


has opposite signs at σ and τ , and so the Intermediate Value Theoremimplies that Pn+1 has at least one zero between σ and τ . This observation

accounts for n−1 of the n+1 zeros of Pn+1, namely z(n+1)2 < · · · < z

(n+1)n .

There are two further zeros of Pn+1, one to the left of z(n)1 and one to the

right of z(n)n . This follows because P ′

n

(z(n)n

)> 0, so Corollary 8.11 implies

that Pn+1

(z(n)n

)< 0. Since Pn+1(x) → +∞ as x → ∞, the Intermediate

Value Theorem implies the existence of z(n+1)n+1 > z

(n)n . A similar argument

establishes the existence of z(n+1)1 < z

(n)1 .

4. FINISH ME!!!

8.4 Polynomial Interpolation

The existence of a unique polynomial p(x) =∑n

i=0 cixi of degree at most n that

interpolates the values y0, . . . , yn at n+1 distinct points x0, . . . , xn follows fromthe invertibility of the Vandermonde matrix

Vn(x0, . . . , xn) :=

1 x0 x20 · · · xn+10

1 x1 x21 · · · xn+11

......

.... . .

...1 xn x2n · · · xn+1

n

(8.3)

and hence the unique solvability of the system of simultaneous linear equations

Vn(x0, . . . , xn)

c0...cn

=

y0...yn

. (8.4)

There is, however, another way to express the polynomial interpolation problem,the so-called Lagrange form, which amounts to a clever choice of basis for R≤n[x](instead of the usual monomial basis 1, x, x2, . . . , xn) so that the matrix in(8.4) in the new basis is the identity matrix.

Definition 8.15 (Lagrange polynomials). Let x0, . . . , xK ∈ R be distinct. For0 ≤ j ≤ K, the associated Lagrange basis polynomial ℓj is defined by

ℓj(x) :=∏

0≤k≤Kk 6=j

x− xkxj − xk

.

Given also arbitrary values y0, . . . , yK ∈ R, the associated Lagrange interpola-tion polynomial is

L(x) :=

K∑

j=0

yjℓj(x).

Theorem 8.16. Given distinct x0, . . . , xK ∈ R and any y0, . . . , yK ∈ R, the as-sociated Lagrange interpolation polynomial L is the unique polynomial of degreeat most K such that L(xk) = yk for k = 0, . . . ,K.

—DRAFT—

8.5. POLYNOMIAL APPROXIMATION 93

Proof. Observe that each Lagrange basis polynomial is a polynomial of degreeK, and so L is a polynomial of degree at mostK. Observe also that ℓj(xk) = δjk.Hence,

L(xk) =

K∑

j=0

f(xj)ℓj(xk) =

K∑

j=0

f(xj)δjk = f(xk).

For uniqueness, consider the basis ℓ0, . . . , ℓK of R≤K [x] and suppose that

p =∑K

j=0 cjℓj is any polynomial that interpolates the values ykKk=0 at the

points xkKk=0. But then, for each k = 0, . . . ,K,

f(xk) =

K∑

j=0

cjℓj(xk) =

K∑

j=0

cjδjk = ck,

and so p = L, as claimed.

Runge’s Phenomenon. Given the task of choosing nodes xk ∈ [a, b] betweenwhich to interpolate functions f : [a, b] → R, it might seem natural to choosethe nodes xk to be equally spaced. Runge’s phenomenon [84] shows that thisis not always a good choice of interpolation scheme. Consider the functionf : [−1, 1] → R defined by

f(x) :=1

1 + 25x2, (8.5)

and let Ln be the degree-n (Lagrange) interpolation polynomial for f on theequally-spaced nodes xk := 2k

n − 1. As illustrated in Figure 8.1, Ln oscillateswildly near the endpoints of the interval [−1, 1]. Even worse, as n increases,these oscillations do not die down but increase without bound: it can be shownthat

limn→∞

supx∈[−1,1]

|f(x)− Ln(x)| = ∞.

As a consequence, polynomial interpolation and numerical integration usinguniformly spaced nodes — as in the Newton–Cotes formula (Definition 9.5) —can in general be very inaccurate. The oscillations near ±1 can be controlledby using a non-uniform set of nodes, in particular one that is denser ±1 thannear 0; the standard example is the set of Chebyshev nodes defined by

xk := cos

(2k − 1

2nπ

),

i.e. the roots of the Chebyshev polynomials of the first kind Tn, which are theorthogonal polynomials for the measure (1− x2)−1/2 dx on [−1, 1].

However, Chebyshev nodes are not a panacea. Indeed, for every predefinedset of interpolation nodes there is a continuous function for which the interpo-lation process on those nodes diverges. For every continuous function there isa set of nodes on which the interpolation process converges. Interpolation onChebyshev nodes converges uniformly for every absolutely continuous function.

8.5 Polynomial Approximation

The following theorem on the uniform approximation (on compact sets) of con-tinuous functions by polynomials should be familiar from elementary real anal-ysis:

—DRAFT—


0.5

1.0

−0.5

−1.0

0.5 1.0−0.5−1.0

Figure 8.1: Runge’s phenomenon. The function f(x) := (1 + 25x2)−1 in black,and polynomial interpolations of degrees 5 (red), 9 (green), and 13 (blue) onevenly-spaced nodes.

Theorem 8.17 (Weierstrass). Let [a, b] ⊂ R be a bounded interval, let f : [a, b] →R be continuous, and let ε > 0. Then there exists a polynomial p such that

supa≤x≤b

|f(x)− p(x)| < ε.

As a consequence of standard results on orthogonal projection in Hilbertspaces, we have the following:

Theorem 8.18. For any f ∈ L2(I, µ) and any d ∈ N0, the orthogonal projectionΠdf of f onto R≤d[x] is the best degree d polynomial approximation of f in theL2(I, µ) norm, i.e.

Πdf = argminp(x)∈R≤d[x]

‖p− f‖L2(I,µ),

where, denoting the orthogonal polynomials for µ by qk | k ∈ N0,

Πdf :=

d∑

k=0

〈f, qk〉L2(µ)

‖qk‖2L2(µ)

qk,

and the residual is orthogonal to the projection subspace:

〈f −Πdf, p〉L2(µ) = 0 for all p(x) ∈ R≤d[x].

An important property of polynomial expansions of functions is that thequality of the approximation (i.e. the rate of convergence) improves as the reg-ularity of the function to be approximated increases. This property is referred

—DRAFT—

8.5. POLYNOMIAL APPROXIMATION 95

to as spectral convergence and is easily quantified by using the machinery ofSobolev spaces. Recall that, given a measure µ on a subinterval I ⊆ R

〈u, v〉Hk(µ) :=

k∑

m=0

⟨dmu

dxm,dmv

dxm

⟩

L2(µ)

=

k∑

m=0

∫

I

dmu

dxmdmv

dxmdµ

‖u‖Hk(µ) := 〈u, v〉1/2Hk(µ)

.

The Sobolev space Hk(µ) consists of all L2(µ) functions that have weak deriva-tives of all orders up to k in L2(µ), and is equipped with the above inner productand norm.

Legendre expansions of Sobolev functions on [−1, 1] satisfy the followingspectral convergence theorem; the analogous results for Hermite expansions ofSobolev functions on R and Laguerre expansions of Sobolev functions on (0,∞)are Exercise 8.5 and Exercise 8.6 respectively.

Theorem 8.19 (Spectral convergence of Legendre expansions). There is a constantC ≥ 0 that may depend upon k but is independent of d and f such that, for allf ∈ Hk([−1, 1], dx),

‖f −Πdf‖L2(dx) ≤ Cd−k‖f‖Hk(dx).

Proof. Recall that the Legendre polynomials satisfy

LLen = λnLen,

where the differential operator L and eigenvalues λn are

L =d

dx

((1− x2)

d

dx

)= (1− x2)

d2

dx2− 2x

d

dx, λn = −n(n+ 1)

Note that, by the definition of the Sobolev norm and the operator L, ‖Lf‖L2 ≤C‖f‖H2 and hence, for any m ∈ N, ‖Lmf‖L2 ≤ C‖f‖H2m .

The key ingredient of the proof is integration by parts:

〈f,Len〉L2 = λ−1n

∫ 1

−1

(LLen)(x)f(x) dx

= λ−1n

∫ 1

−1

((1− x2)Le′′n(x)f(x) − 2xLe′n(x)f(x)

)dx

= −λ−1n

∫ 1

−1

(((1 − x2)f)′(x)Le′n(x) + 2xLe′n(x)f(x)

)dx

= −λ−1n

∫ 1

−1

(1− x2)f ′(x)Le′n(x) dx

= λ−1n

∫ 1

−1

((1− x2)f ′

)′(x)Le′n(x) dx

= λ−1n 〈Lf,Len〉L2 .

Hence, for all m ∈ N0 for which f has 2m weak derivatives,

〈f,Len〉 =〈Lmf,Len〉

λmn

—DRAFT—


Hence,

‖f −Πdf‖2 =∞∑

n=d+1

|〈f,Len〉|2‖Len‖2

=

∞∑

n=d+1

|〈Lmf,Len〉|2λ2mn ‖Len‖2

≤ λ−2md

∞∑

n=d+1

|〈Lmf,Len〉|2‖Len‖2

≤ d−4m‖Lmf‖2

≤ C2d−4m‖f‖2H2m .

Taking k = 2m and square roots completes the proof.

However, in the other direction poor regularity can completely ruin the niceconvergence of spectral expansions. The classic example of this is Gibbs’ phe-nomenon, in which one tries to approximate the sign function

sgn(x) :=

−1, if x < 0,

0, if x = 0,

1, if x > 0,

on [−1, 1] by its expansion with respect to a system of orthogonal polynomi-als such as the Legendre polynomials Len(x) or the Fourier polynomials eπnx.FINISH ME!!!

8.6 Orthogonal Polynomials of Several Variables

For working with polynomials in d variables, we will use standard multi-indexnotation. Multi-indices will be denoted by Greek letters α = (α1, . . . , αd) ∈ Nd

0.For x = (x1, . . . , xd) ∈ Rd and α ∈ Nd

0, the monomial xα is defined by

xα := xα11 xα2

2 . . . xαd

d ,

and |α| := α1 + · · · + αd is called the total degree of xα. The total degree ofa polynomial p (i.e. a finite linear combination of such monomials) is denoteddeg(p) and is the maximum of the total degrees of the summands.

Given a measure µ on Rd, it is tempting to apply the Gram–Schmidt processwith respect to the inner product

〈f, g〉L2(µ) :=

∫

Rd

f(x)g(x) dµ(x)

to the monomials xα | α ∈ Nd0 to obtain a system of orthogonal polynomials

for the measure µ. However, there is an immediate problem, in that orthogonalpolynomials of several variables are not unique. In order to apply the Gram–Schmidt process, we need to give a linear order to multi-indices α ∈ Nd

0. Thereare many choices of well-defined total order (for example, the lexicographic orderor the graded lexicographic order); but there is no natural choice and differentorders will give different sequences of orthogonal polynomials. Instead of fixingsuch a total order, we relax Definition 8.1 slightly:

—DRAFT—

BIBLIOGRAPHY 97

Definition 8.20. Let µ be a non-negative measure on Rd. A family of polyno-mials Q = qα | α ∈ Nd

0 is called an orthogonal system of polynomials if qα issuch that

〈qα, p〉L2(µ) = 0 for all p(x) ∈ R[x1, . . . , xd] with deg(p) < |α|.

Hence, in the many-variables case, an orthogonal polynomial of total degreen, while it is required to be orthogonal to all polynomials of strictly lower totaldegree, may be non-orthogonal to other polynomials of the same total degree n.However, the meaning of orthonormality is unchanged: a system of polynomialsPα | α ∈ Nd

0 is orthonormal if

〈Pα, Pβ〉L2(µ) = δαβ .

Bibliography

The classic reference on orthogonal polynomials is the 1939 monograph of Szego[100]. An excellent more modern reference is the book of Gautschi [33]; sometopics covered in that book that are not treated here include. . .FINISH ME!!!

Many important properties of orthogonal polynomials, and standard exam-ples, are given in Chapter 22 of Abramowitz & Stegun [1].

Exercises

Exercise 8.1. Show that the L2(R, µ) inner product is positive definite on thespace of polynomials if the measure µ has infinite support.

Exercise 8.2. Show that if µ has finite moments only of degrees 0, 1, . . . , r,then µ admits only a finite system of orthogonal polynomials q0, . . . , qd, whered = ⌊r/2⌋.

Exercise 8.3. Define a Borel measure µ on R by

dµ

dx(x) =

1

π

1

1 + x2.

Show that µ is a probability measure, that dimL2(R, µ;R) = ∞, find all or-thogonal polynomials for µ, and explain your results.

Exercise 8.4. Calculate the orthogonal polynomials of Table 8.2 by hand fordegree p ≤ 5, and write a numerical program to compute them for higherdegree.

Exercise 8.5 (Spectral convergence of Hermite expansions). Let γ = N (0, 1)be standard Gaussian measure on R. First establish the integration-by-partsformula ∫

R

f(x)g′(x) dγ(x) = −∫

R

(f ′(x) − xf(x))g(x) dγ(x).

Using this, and the fact that the Hermite polynomials satisfy

(d2

dx2− x

d

dx

)Hen(x) = −nHen(x), for n ∈ N0,

—DRAFT—


mimic the proof of Theorem 8.19 to show that there is a constant C ≥ 0 that maydepend upon k but is independent of d and f such that, for all f ∈ Hk(R, γ), fand its degree d expansion in the Hermite orthogonal basis of L2(R, γ) satisfy

‖f −Πdf‖L2(γ) ≤ Cd−k/2‖f‖Hk(γ).

Exercise 8.6 (Spectral convergence of Laguerre expansions). Let dµ(x) = e−x dx,for which the orthogonal polynomials are the Laguerre polynomials Lan, n ∈ N0.Establish an integration-by-parts formula for µ and then use this and the fact

that Lan is an eigenfunction for x d2

dx2 + (1− x) ddx with eigenvalue −n to prove

the analogue of Exercise 8.5 but for Laguerre expansions.

—DRAFT—

Chapter 9

Numerical Integration

A turkey is fed for a 1000 days — ev-ery day confirms to its statistical depart-ment that the human race cares about itswelfare “with increased statistical signif-icance”. On the 1001st day, the turkeyhas a surprise.

The Fourth Quadrant: A Map of the

Limits of Statistics

Nassim Taleb

The topic of this chapter is the numerical (i.e. approximate) evaluation ofdefinite integrals. Such methods of numerical integration will be essential if theexpectations the UQ methods of later chapters — with their many expectations— are to be implemented in a practical manner.

The topic of integration has a long history, as one of the twin pillars ofcalculus, and was historically also known as quadrature. Nowadays, quadratureusually refers to a particular method of numerical integration.

9.1 Quadrature in One Dimension

This section concerns the numerical integration of a real-valued function f withrespect to a measure µ on a sub-interval I ⊆ R, and to do so by sampling thefunction at pre-determined points of I and taking a suitable weighted average.That is, the aim is to construct an approximation of the form

∫

I

f(x) dµ(x) ≈ Q(f) :=

K∑

k=1

wkf(xk),

with prescribed points x1, . . . , xK ∈ I called nodes and scalars w1, . . . , wK ∈ R

called weights. The approximationQ(f) is called a quadrature formula. The aimis to choose nodes and weights wisely, so that the quality of the approximation∫I f dµ ≈ Q(f) is good for a large class of integrands f . One measure of thequality of the approximation is the following:

—DRAFT—

100 CHAPTER 9. NUMERICAL INTEGRATION

Definition 9.1. A quadrature formula is said to have order of accuracy n ∈ N0

if∫If dµ = Q(f) whenever f is a polynomial of degree at most n.

A quadrature formula Q(f) =∑K

k=1 wkf(xk) can be identified with the

discrete measure∑K

k=1 wkδxk. If some of the weights wk are negative, then this

measure is a signed measure. This point of view will be particularly useful whenconsidering multi-dimensional quadrature formulas. Regardless of the signatureof the weights, the following limitation on the accuracy of quadrature formulasis fundamental:

Lemma 9.2. Let I ⊆ R be any interval. Then no quadrature formula with ndistinct nodes in I can have order of accuracy 2n or greater.

Proof. Let x1, . . . , xn ⊆ I be any set of n distinct points, and let w1, . . . , wnbe any set of weights. Let f be the degree-2n polynomial f(x) :=

∏nj=1(x−xj)2,

i.e. the square of the nodal polynomial. Then

∫

I

f(x) dx > 0 =

n∑

j=1

wjf(xj),

since f vanishes at each node xj . Hence, the quadrature formula is not exactfor polynomials of degree 2n.

The first, simplest, quadrature formulas to consider are those in which thenodes form an equally-spaced discrete set of points in [a, b]. Many of thesequadrature formulas may be familiar from high-school mathematics.

Definition 9.3 (Midpoint rule). The midpoint quadrature formula has the singlenode x1 := b−a

2 and the single weight w1 := ρ(x1)|b − a|. That is, it is theapproximation

∫ b

a

f(x)ρ(x) dx ≈ I1(f) := f

(b− a

2

)ρ

(b− a

2

)|b− a|.

Another viewpoint on the midpoint rule is that it is the approximation of theintegrand f by the constant function with value f

(b−a2

). The next quadrature

formula, on the other hand, amounts to the approximation of f by the affinefunction

x 7→ f(a) +x− a

b− a(f(b)− f(a))

that equals f(a) at a and f(b) at b.

Definition 9.4 (Trapezoidal rule). The midpoint quadrature formula has the

nodes x1 := a and x2 := b and the weights w1 := ρ(a) |b−a|2 and w2 := ρ(b) |b−a|

2 .That is, it is the approximation

∫ b

a

f(x)ρ(x) dx ≈ I2(f) := (f(a)ρ(a) + f(b)ρ(b))|b− a|

2.

Recall the definition of the Lagrange interpolation polynomial L for a set ofnodes and values from Definition 8.15. The midpoint and trapezoidal quadratureformulas amount to approximating f by a Lagrange interpolation polynomial L

of degree 0 or 1 and hence approximating∫ b

a f(x) dx by∫ b

a L(x) dx. The generalsuch construction is the following:

—DRAFT—

9.2. GAUSSIAN QUADRATURE 101

Definition 9.5 (Newton–Cotes formula). Consider K + 1 equally-spaced points

a = x0 < x1 = x0 + h < x2 = x0 + 2h < · · · < xK = b,

where h = 1K . The closed Newton–Cotes quadrature formula is the quadrature

formula that arises from approximating f by the Lagrange interpolating poly-nomial L that runs through the points (xk, f(xk))

Kk=0; the open Newton–Cotes

quadrature formula is the quadrature formula that arises from approximatingf by the Lagrange interpolating polynomial L that runs through the points(xk, f(xk))

K−1k=1 .

Proposition 9.6. The weights for the closed Newton–Cotes quadrature formulaare given by

wj =

∫ b

a

ℓj(x) dx.

Proof. Let L be as in the definition of the Newton–Cotes rule. Then

∫ b

a

f(x) dx ≈∫ b

a

L(x) dx

=

∫ b

a

K∑

j=0

f(xj)ℓj(x) dx

=K∑

j=0

f(xj)

∫ b

a

ℓj(x) dx

as claimed.

The midpoint rule is the open Newton–Cotes quadrature formula on threepoints; the trapezoidal rule is the closed Newton–Cotes quadrature formula ontwo points.

The quality of Newton–Cotes quadrature formulas can be very poor, essen-tially because Runge’s phenomemon can make the quality of the approximationf ≈ L very poor. The weakness of Newton–Cotes quadrature can be seen an-other way: it is possible for the Newton–Cotes weights to be negative, and anyquadrature formula with weights of both signs is prone to errors. For example,the positive-definite function f that linearly interpolates the values

??? at xk

has Q(f) = 0. FINISH ME!!!

9.2 Gaussian Quadrature

Gaussian quadrature is a powerful method for numerical integration in whichboth the nodes and the weights are chosen so as to maximize the order of accu-racy of the quadrature formula. Remarkably, by the correct choice of n nodesand weights, the quadrature formula can be made accurate for all polynomialsof degree at most 2n− 1. Moreover, the weights in this quadrature formula areall positive, and so the quadrature formula is stable even for high n.

—DRAFT—


See [108, Ch. 37].The objective is to approximate a definite integral

∫ b

a

f(x)w(x) dx,

where w : [a, b] → (0,+∞) is a fixed weight function, by a finite sum

In(f) :=n∑

j=1

wjf(xj),

where the nodes x1, . . . , xn and weights w1, . . . , wn will be chosen appropriately.Let q0, q1, . . . be a system of orthogonal polynomials with respect to the

weight function w. That is,

∫ b

a

f(x)qn(x)w(x) dx = 0

whenever f is a polynomial of degree at most n − 1. Let the nodes x1, . . . , xnbe the zeros of qn; by Theorem 8.14, qn has n distinct roots in [a, b]. Define theassociated weights by

wj :=anan−1

∫ b

a qn−1(x)2w(x) dx

q′n(xj)qn−1(xj),

where ak is the coefficient of xk in qk(x).Orthogonal polynomials for quadrature formulas can be found in Abramowitz

& Stegun [1, §25.4]1. The nodes actually determine the weights as above.2. Those weights are positive. DONE3. Order of accuracy 2n− 1. DONE4. Error estimate [95, Th. 3.6.24]: for f ∈ C2n,

∫ b

a

f(x)ρ(x) dx −QK(f) =f (2n)(ξ)

(2n)!‖pn‖2ρ

Definition 9.7. The n-point Gauss–Legendre quadrature formula is the quadra-ture formula with nodes x1, . . . , xn given by the zeros of the Legendre poly-nomial qn+1. The nodes are sometimes called Gauss points.

Theorem 9.8. The n-point Gauss quadrature formula has order of accuracyexactly 2n − 1, and no quadrature formula has order of accuracy higher thanthis.

Proof. Lemma 9.2 shows that no quadrature formula can have order of accuracygreater than 2n− 1.

On the other hand, suppose that p is any polynomial of degree at most2n− 1. Factor this polynomial as

p(x) = g(x)qn+1(x) + r(x),

where g is a polynomial of degree at most n− 1, and the remainder r is also apolynomial of degree at most n− 1. Since qn+1 is orthogonal to all polynomials

—DRAFT—

9.2. GAUSSIAN QUADRATURE 103

of degree at most n,∫ b

agqn+1 dµ = 0. However, since g(xj)qn+1(xj) = 0 for

each node xj ,

In(gqn+1) =

n∑

j=1

wjg(xj)qn+1(xj) = 0.

Since∫ b

a · dµ and Qn( · ) are both linear operators,∫ b

a

f dµ =

∫ b

a

r dµ and Qn(f) = Qn(r).

Since r is of degree at most n− 1,∫ b

ar dµ = In(r), and so

∫ b

af dµ = Qn(f), as

claimed.

Theorem 9.9. The Gauss weights are given by

wj =anan−1

∫ b

a qn−1(x)2w(x) dx

q′n(xj)qn−1(xj),

where ak is the coefficient of xk in qk(x).

Proof. Suppose that p is any polynomial of degree at most 2n− 1. Factor thispolynomial as

p(x) = g(x)qn+1(x) + r(x),

where g is a polynomial of degree at most n− 1, and the remainder r is also apolynomial of degree at most n − 1. Using Lagrange basis polynomials, writer =

∑ni=1 r(xi)ℓi, so that

∫ b

a

r dµ =n∑

i=1

r(xi)

∫ b

a

ℓi dµ.

Since the Gauss quadrature formula is exact for r, it follows that the Gaussweights satisfy

wi =

∫ b

a

ℓi dµ

.........FINISH ME!!!

Theorem 9.10. The weights of the Gauss quadrature formula are all positive.

Proof. Fix 1 ≤ i ≤ n and consider the polynomial

p(x) :=∏

1≤j≤nj 6=i

(x− xj)2

i.e. the square of the nodal polynomial, divided by (x−xi)2. Since the degree ofp is strictly less than 2n− 1, the Gauss quadrature formula is exact, and sincep vanishes at every node other than xi, it follows that

∫

I

p dµ =

n∑

j=1

wjp(xj) = wip(xi).

Since µ is a non-negative measure, p ≥ 0 everywhere, and p(xi) > 0, it followsthat wi > 0.

—DRAFT—


9.3 Clenshaw–Curtis / Fejer Quadrature

Despite its optimal degree of polynomial exactness, Gaussian quadrature hassome major drawbacks in practice. One principal drawback is that, by The-orem 8.14, the Gaussian quadrature nodes are never nested — that is, if onewishes to increase the accuracy of the numerical integral by passing from using,say, n nodes to 2n nodes, then none of the first n nodes will be re-used. Ifevaluations of the integrand are computationally expensive, then this lack ofnesting is a serious concern. Another drawback of Gaussian quadrature on nnodes is the cost of computing the weights, which is O(n2). By contrast, theClenshaw–Curtis quadrature rules [19] (although in fact discovered thirty yearspreviously by Fejer [29]) are nested quadrature rules, with accuracy comparableto Gaussian quadrature in many circumstances, and with weights that can becomputed with cost O(n logn).

f(cos θ) =a02

+

∞∑

k=1

ak cos(kθ)

where

ak =2

π

∫ π

0

f(cos θ) cos(kθ) dθ

The cosine series expansion of f is also a Chebyshev polynomial expansion off , since by construction Tk(cos θ) = cos(kθ):

f(x) =a02T0(x) +

∞∑

k=1

akTk(x)

∫ 1

−1

f(x) dx =

∫ π

0

f(cos θ) sin θ dθ

= a0 +

∞∑

k=1

2a2k1− (2k)2

9.4 Quadrature in Multiple Dimensions

Having established quadrature formulæfor integrals with a one-dimensional do-main of integration, the next agendum is to produce quadrature formulas formulti-dimensional (i.e. iterated) integrals of the form

∫∏

dj=1[aj ,bj ]

f(x) dx =

∫ bd

ad

. . .

∫ b1

a1

f(x1, . . . , xd) dx1 . . .dxd.

Tensor Product Quadrature Formulæ. The first, obvious, strategy to try isto treat d-dimensional integration as a succession of d one-dimensional integralsand apply our favourite one-dimensional quadrature formula d times. This isthe idea underlying tensor product quadrature formulas, and it has one majorflaw: if the one-dimensional quadrature formula uses N nodes, then the tensorproduct rule uses Nd nodes, which very rapidly leads to an impractically largenumber of integrand evaluations for even moderately large values of N and d. In

—DRAFT—

9.5. MONTE CARLO METHODS 105

general, when the one-dimensional quadrature formula uses N nodes, the errorfor an integrand in Cr using a tensor product rule is O(N−r/d).

Sparse Quadrature Formulæ. The curse of dimension, which quickly renderstensor product quadrature formulae impractical in high dimension, spurs theconsideration of sparse quadrature formulas, in which far fewer than Nd nodesare used, at the cost of some accuracy in the quadrature formula: in practice, weare willing to pay the price of loss of accuracy in order to get any answer at all!One example of a popular sparse quadrature rule is the recursive construction ofSmolyak sparse grids, which is particularly useful when combined with a nestedone-dimensional quadrature rule such as the Clenshaw–Curtis rule.

Assume that we are given, for each ℓ ∈ N, we are given a one-dimensional

quadrature formula Q(1)ℓ . The formula for Smolyak quadrature in dimension

d ∈ N at level ℓ ∈ N is defined in terms of the lower-dimensional quadratureformalæby

Q(d)ℓ (f) :=

(ℓ∑

i=1

(Q

(1)i −Q

(1)i−1

)⊗Q

(d−1)ℓ−i+1

)(f)

This formula takes a little getting used to, and it helps to first consider thecase d = 2 and a few small values of ℓ. First, for ℓ = 1, Smolyak’s rule is thequadrature formula

Q(2)1 = Q

(1)1 ⊗Q

(1)1 ,

i.e. the full tensor product of the one-dimensional quadrature formula Q(1)1 with

itself. For the next level, ℓ = 2, Smolyak’s rule is

Q(2)2 =

2∑

i=1

(Q

(1)i −Q

(1)i−1

)⊗Q

(1)ℓ−i+1

= Q(1)1 ⊗Q

(1)2 +

(Q

(1)2 −Q

(1)1

)⊗Q

(1)1

= Q(1)1 ⊗Q

(1)2 +Q

(1)2 ⊗Q

(1)1 −Q

(1)1 ⊗Q

(1)1 .

The “−Q(1)1 ⊗Q

(1)1 ” term is included to avoid double counting. See Figure 9.1

for an illustration of nodes of the Smolyak construction in the case that the

one-dimensional quadrature formula Q(d)ℓ has 2ℓ − 1 equally-spaced nodes.

In general, when the one-dimensional quadrature formula at level ℓ uses Nℓ

nodes, the quadrature error for an integrand in Cr using Smolyak recursion isO(N−r

ℓ (logNℓ)(d−1)(r+1)).

Remark 9.11. The right Sobolev space for studying sparse grids, since we needpointwise evaluation, is H1

mix, in which functions are weakly differentiable ineach coordinate direction.

9.5 Monte Carlo Methods

As seen above, tensor product quadrature formulas suffer from the curse of di-mensionality: they require exponentially many evaluations of the integrand as afunction of the dimension of the integration domain. Sparse grid constructions

—DRAFT—


b

(a) ℓ = 1

bb b

b

b

(b) ℓ = 2

bb b

b

b

b

b

b

b

b b b b

b b

bb

(c) ℓ = 3

Figure 9.1: Illustration of the nodes of the 2-dimensional Smolyak sparse quadra-

ture formulas Q(2)ℓ for levels ℓ = 1, 2, 3, in the case that the 1-dimensional

quadrature formula Q(1)ℓ has 2ℓ − 1 equally-spaced nodes in the interior of the

domain of integration, i.e. is an open Newton–Cotes formula.

only partially alleviate this problem. Remarkably, however, the curse of dimen-sionality can be entirely circumvented by resorting to random sampling of theintegration domain — provided, of course, that it is possible to draw samplesfrom the measure against which the integrand is to be integrated.

Monte Carlo methods are, in essence, an application of the law of largenumbers (LLN). Recall that the LLN states that if X(1), X(2), . . . is a sequenceof independent samples from a random variable X with finite expectation E[X ],then the sample average

1

K

K∑

k=1

X(k)

converges to E[X ] as K → ∞. The weak LLN states that the mode of conver-gence is convergence in probability:

for all ε > 0, limK→∞

P

[∣∣∣∣∣1

K

K∑

k=1

X(k) − E[X ]

∣∣∣∣∣ > ε

]= 0;

the strong LLN (which is harder to prove than the weak LLN) states that themode of convergence is actually almost sure:

P

[lim

K→∞

1

K

K∑

k=1

X(k) = E[X ]

]= 1.

‘Vanilla’ Monte Carlo.

Theorem 9.12 (Birkhoff–Khinchin ergodic theorem). Let T : Θ → Θ be a measure-preserving map of a probability space (Θ,F , µ), and let f ∈ L1(Θ, µ;R). Then,for µ-almost every θ ∈ Θ,

limK→∞

1

K

K−1∑

k=0

f(T kθ) = Eµ[f |GT ],

—DRAFT—

9.6. PSEUDO-RANDOM METHODS 107

where GT is the σ-algebra of T -invariant sets. Hence, if T is ergodic, then

limK→∞

1

K

K−1∑

k=0

f(T kθ) = Eµ[f ] µ-a.s.

To obtain an error estimate for Monte Carlo integrals, we simply applyChebyshev’s inequality to SK := 1

K

∑Kk=1X

(k), which has E[SK ] = E[X ] and

V[SK ] =1

K2

K∑

k=1

V[X ] =V[X ]

K,

to obtain that, for any t ≥ 0,

P[|SK − E[X ]| ≥ t] ≤ V[X ]

Kt2.

That is, for any ε ∈ (0, 1], with probability at least 1 − ε with respect to theK Monte Carlo samples, the Monte Carlo average SK lies within (V[X ]/Kε)1/2

of the true expected value E[X ]. The fact that the error decays like K−1/2,i.e. slowly, is a major limitation of ‘vanilla’ Monte Carlo methods; it is undesir-able to have to quadruple the number of samples to double the accuracy of theapproximate integral.

CDF Inversion One obvious criticism of Monte Carlo integration as presentedabove is the accessibility of the measure of integration µ. Even leaving asidethe sensitive topic of the generation of truly ‘random’ numbers, it is no easymatter to draw random numbers from an arbitrary probability measure on R.The uniform measure on an interval may be said to be easily accessible; ρ dx,for some positive and integrable function ρ, is not.

Fν(x) :=

∫

(−∞,x]

dν

X ∼ Unif([0, 1]) =⇒ F−1ν (X) ∼ ν

Importance Sampling

Markov Chain Monte Carlo Markov chain Monte Carlo (MCMC) methodsare a class of algorithms for sampling from a probability distribution µ basedon constructing a Markov chain that has µ as its equilibrium distribution. Thestate of the chain after a large number of steps is then used as a sample of µ.The quality of the sample improves as a function of the number of steps. Usuallyit is not hard to construct a Markov chain with the desired properties; the moredifficult problem is to determine how many steps are needed to converge to µwithin an acceptable error.

9.6 Pseudo-Random Methods

This chapter concludes with a very brief survey of numerical integration methodsthat are in fact based upon deterministic sampling, but in such a way as thesample points ‘might as well be’ random.

Niederreiter [71]

—DRAFT—


Definition 9.13. The discrepancy:

DN (P ) := supB∈J

∣∣∣∣#(P ∩B)

N− λd(B)

∣∣∣∣

where J is the collection of all products of the form∏d

i=1[ai, bi), with 0 ≤ ai <bi ≤ 1. The star-discrepancy:

D∗N(P ) := sup

B∈J ∗

∣∣∣∣#(P ∩B)

N− λd(B)

∣∣∣∣

where J ∗ is the collection of all products of the form∏d

i=1[0, bi), with 0 ≤ bi < 1.

Lemma 9.14.D∗

N ≤ DN ≤ 2dD∗N .

Definition 9.15. Let f : [0, 1]d → R. If J ⊆ [0, 1]d is a subrectangle of [0, 1]d,i.e. a d-fold product of subintervals of [0, 1], let ∆J(f) be the sum of the valuesof f at the 2d vertices of J , with alternating signs at nearest-neighbour vertices.The Vitali variation of f : [0, 1]d → R is defined to be

V Vit(f) := sup

∑

J∈Π

|∆J(f)|∣∣∣∣∣

Π is a partition of [0, 1]d into finitelymany non-overlapping subrectangles

For 1 ≤ s ≤ d, the Hardy–Krause variation of f is defined to be

V HK(f) :=∑

F

V VitF (f),

where the sum runs over all faces F of [0, 1]d having dimension at most s.

Theorem 9.16 (Koksma’s inequality). If f : [0, 1] → R has bounded (total) vari-ation, then, for any x1, . . . , xN ⊆ [0, 1],

∣∣∣∣∣1

N

N∑

i=1

f(xi)−∫

[0,1]df(x) dx

∣∣∣∣∣ ≤ V (f)D∗N (x1, . . . , xN ).

Theorem 9.17 (Koksma–Hlakwa Inequality). Let f : [0, 1]d → R have boundedHardy–Krause variation. Then, for any x1, . . . , xN ⊆ [0, 1)N ,

∣∣∣∣∣1

N

N∑

i=1

f(xi)−∫

[0,1]df(x) dx

∣∣∣∣∣ ≤ V[0,1]d(f)D∗N (x1, . . . , xN ).

Furthermore, this bound is sharp in the sense that, for every x1, . . . , xN ⊆[0, 1)N and every ε > 0, there exists f : [0, 1]d → R with V (f) = 1 such that

∣∣∣∣∣1

N

N∑

i=1

f(xi)−∫

[0,1]df(x) dx

∣∣∣∣∣ > D∗N (x1, . . . , xN )− ε.

Halton’s sequence: [37]Sobol′ sequence: [91]

—DRAFT—

BIBLIOGRAPHY 109

Bibliography

At Warwick, Monte Carlo integration and related topics are covered in the mod-Wule ST407 Monte Carlo Methods. See also Robert & Casella [81] for a surveyof MC methods in statistics.

Orthogonal polynomials for quadrature formulas can be found in Section 25.4of Abramowitz & Stegun [1]. Gautschi’s general monograph [33] on orthogonalpolynomials covers applications to Gaussian quadrature in Section 3.1. Thearticle [107] compares the Gaussian and Clenshaw–Curtis quadrature rules andexplains their similar accuracy in many circumstances.

Smolyak recursion was introduced in [90].

Exercises

Exercise 9.1. Determine the weights for the open Newton–Cotes quadratureformula. (Cf. Proposition 9.6.)

Exercise 9.2 (Takahasi–Mori (tanh–sinh) Quadrature [101]). Consider a definite

integral over [−1, 1] of the form∫ 1

−1f(x) dx. Employ a change of variables

x = ϕ(t) := tanh(π2 sinh(t)) to convert this to an integral over the real line. Leth > 0 and K ∈ N, and approximate this integral over R using 2K + 1 pointsequally spaced from −Kh to Kh to derive a quadrature rule

∫ 1

−1

f(x) dx ≈ Qh,K(f) :=

k=K∑

k=−K

wkf(xk),

where xk := tanh(π2 sinh(kh)),

and wk :=π2h cosh(kh)

cosh2(π2 sinh(kh)).

How are these nodes distributed in [−1, 1]? If f is bounded, then what rate ofdecay does f ϕ have? Hence, why is excluding the nodes xk with |k| > K areasonable approximation?


—DRAFT—


—DRAFT—

Chapter 10

Sensitivity Analysis andModel Reduction

Le doute n’est pas un etat bien agreable,mais l’assurance est un etat ridicule.

Voltaire

The topic of this chapter is sensitivity analysis, which may be broadly un-derstood as understanding how f(x1, . . . , xn) depends upon variations not onlyin the xi individually, but also combined or correlated effects among the xi

10.1 Model Reduction for Linear Models

Suppose that the model mapping inputs x ∈ Rn to outputs y = f(x) ∈ Rm isactually a linear map, and so can be represented by a matrixA ∈ Rm×n. There isessentially only one method for the dimensional reduction of such linear models,the singular value decomposition (SVD).

Theorem 10.1 (Singular value decomposition). For every matrix A ∈ Cm×n,there exist unitary matrices U ∈ Cm×m and V ∈ Cn×n and a diagonal matrixΣ ∈ Rm×n

≥0 such thatA = UΣV ∗.

The columns of U are called the left singular vectors of A; the columnsof V are called the right singular vectors of A; and the diagonal entries ofΣ are called the singular values of A. While the singular values are unique,the singular vectors may fail to be. By convention, the singular values andcorresponding singular vectors are ordered so that the singular values form adecreasing sequence

σ1 ≥ σ2 ≥ · · · ≥ σminm,n ≥ 0.

Thus, the SVD is a decomposition of A into a sum of rank-1 operators:

A = UΣV ∗ =

minm,n∑

j=1

σjuj ⊗ vj =

minm,n∑

j=1

σjuj〈vj , · 〉.

—DRAFT—

112 CHAPTER 10. SENSITIVITY ANALYSIS AND MODEL REDUCTION

The appeal of the SVD is that it is numerically stable, and that it providesoptimal low-rank approximation of linear operators: if Ak ∈ Rm×n is definedby

Ak :=

k∑

j=1

σjuj ⊗ vj ,

then Ak is the optimal rank-k approximation to A in the sense that

‖A−Ak‖2 = min

‖A−X‖2

∣∣∣∣X ∈ Rm×n andrank(X) ≤ k

,

where ‖ · ‖2 denotes the operator 2-norm on matrices.Chapter 11 contains an important application of the SVD to the analysis of

sample data from random variables, an discrete variant of the Karhunen–Loeveexpansion. Simply put, when A is a matrix whose columns are independentsamples from some some stochastic process (random vector), the SVD of A isthe ideal way to fit a linear structure to those data points. One may considernonlinear fitting and dimensionality reduction methods in the same way, andthis is known as manifold learning: see, for instance, the IsoMap algorithm ofTenenbaum & al. [104].

10.2 Derivatives

A natural first way to understand the dependence of f(x1, . . . , xn) upon x1, . . . , xnnear some nominal point x∗ = (x∗1, . . . , x

∗n) is to estimate the partial derivatives

of f at x∗, i.e. to approximate

∂f

∂xi(x∗) := lim

h→0

f(x∗1, . . . , x∗i + h, . . . , x∗n)− f(x∗)

h.

Approximate by e.g.∂f

∂xi(x∗) ≈

Ultimately boils down to polynomial approximation f ≈ p, f ′ ≈ p′.

10.3 McDiarmid Diameters

This section considers an ‘L∞-type’ sensitivity index that measures the sen-sitivity of a function of n variables or parameters to variations in those vari-ables/parameters individually.

Definition 10.2. The ith McDiarmid subdiameter of f :∏n

i=1 Xi → K is

Di[f ] := sup

|f(x)− f(y)|

∣∣∣∣x, y ∈∏n

i=1 Xi

xj = yj for j 6= i

= sup

|f(x1, . . . , xi, . . . , xn)− f(x1, . . . , x

′i, . . . , xn)|

∣∣∣∣xj ∈ Xj for j = 1, . . . , n

x′i ∈ Xi

.

The McDiarmid diameter of f is

D[f ] :=

√√√√n∑

i=1

Di[f ]2.

—DRAFT—

10.3. MCDIARMID DIAMETERS 113

Remark 10.3. Note that although the two definitions of Di[f ] given above areobviously mathematically equivalent, they are very different from a computa-tional point of view: the first formulation is ‘obviously’ a constrained optimiza-tion problem in 2n variables with n − 1 constraints (i.e. ‘difficult’), whereasthe second formulation is ‘obviously’ an unconstrained optimization problem inn+ 1 variables (i.e. ‘easy’).

Lemma 10.4. For each j = 1, . . . , n, Dj [ · ] is a semi-norm on the space ofbounded functions f : X → K, as is D[ · ].


The McDiarmid subdiameters and diameter are useful not only as sensitivityindices, but also for providing a rigorous upper bound on deviations of a functionof independent random variables from its mean value:

Theorem 10.5 (McDiarmid’s bounded differences inequality). Let X = (X1, . . . , Xn)be any random variable with independent components taking values in X =∏n

i=1 Xi, and let f : X → R be absolutely integrable with respect to the law of Xand have finite McDiarmid diameter D[f ]. Then, for any t ≥ 0,

P[f(X) ≥ E[f(X)] + t

]≤ exp

(− 2t2

D[f ]2

),

P[f(X) ≤ E[f(X)]− t

]≤ exp

(− 2t2

D[f ]2

),

P[|f(X)− E[f(X)]| ≥ t

]≤ 2 exp

(− 2t2

D[f ]2

).

Corollary 10.6 (Hoeffding’s inequality). Let X = (X1, . . . , Xn) be a random vari-able with independent components, taking values in the cuboid

∏ni=1[ai, bi]. Let

Sn := 1n

∑ni=1Xi. Then, for any t ≥ 0,

P[Sn − E[Sn] ≥ t

]≤ exp

(− −2n2t2∑n

i=1(bi − ai)2

),

and similarly for deviations below, and either side, of the mean.

McDiarmid’s and Hoeffding’s inequalities are just two examples of a broadfamily of inequalities known as concentration of measure inequalities. Roughlyput, the concentration of measure phenomenon, which was first noticed by Levy[61], is the fact that a function of a high-dimensional random variable with manyindependent (or weakly correlated) components has its values overwhelminglyconcentrated about the mean (or median). An inequality such as McDiarmid’sprovides a rigorous certification criterion: to be sure that f(X) will deviateabove its mean by more than t with probability no greater than ε ∈ [0, 1], itsufficies to show that

exp

(− 2t2

D[f ]2

)≤ ε

i.e.

D[f ] ≤ t

√2

log ε−1.

—DRAFT—


Experimental effort then revolves around determining E[f(X)] and D[f ]; giventhose ingredients, the certification criterion is mathematically rigorous. Thatsaid, it is unlikely to be the optimal rigorous certification criterion, because Mc-Diarmid’s inequality is not guaranteed to be sharp. The calculation of optimalprobability inequalities is considered in Chapter 14.

To proveMcDiarmid’s inequality first requires a lemma bounding the moment-generating function of a random variable:

Lemma 10.7 (Hoeffding’s lemma). Let X be a random variable with mean zerotaking values in [a, b]. Then, for t ≥ 0,

E[etX ] ≤ exp

(t2(b− a)2

8

).

Proof. By the convexity of the exponential function, for each x ∈ [a, b],

etx ≤ b− x

b− aeta +

x− a

b− aetb.

Therefore, applying the expectation operator,

E[etX ] ≤ b

b− aeta +

a

b− aetb =: eφ(t).

Observe that φ(0) = 0, φ′(0) = 0, and φ′′(t) ≤ 14 (b− a)2. Hence, since exp is an

increasing and convex function,

E[etX ] ≤ exp

(0 + 0t+

(b− a)2

4

t2

2

)= exp

(t2(b − a)2

8

),

as claimed.

Proof of McDiarmid’s inequality (Theorem 10.5). The proof uses the proper-ties of conditional expectation outlined in Example 3.14. Let Fi be the σ-algebra generated by X1, . . . , Xi, and define random variables Z0, . . . , Zn byZi := E[f(X)|Fi]. Note that Z0 = E[f(X)] and Zn = f(X). Now consider theconditional increment (Zi − Zi−1)|Fi−1. First observe that

E[Zi − Zi−1|Fi−1] = 0,

so that the sequence (Zi)i≥0 is a martingale. Secondly, observe that

Li ≤ Zi − Zi−1|Fi−1 ≤ Ui,

where

Li := infℓE[f(X)|Fi−1, Xi = ℓ]− E[f(X)|Fi−1],

Ui := supu

E[f(X)|Fi−1, Xi = u]− E[f(X)|Fi−1].

Since Ui − Li ≤ Di[f ], Hoeffding’s lemma implies that

E[es(Zi−Zi−1)

∣∣∣Fi−1

]≤ es

2Di[f ]2/8.

—DRAFT—

10.4. ANOVA/HDMR DECOMPOSITIONS 115

Hence, for any s ≥ 0,

P[f(X)− E[f(X)] ≥ t]

= P[es(f(X)−E[f(X)]) ≥ est

]

≤ e−stE[es(f(X)−E[f(X)])

](Markov’s inequality)

= e−stE[es

∑ni=1 Zi−Zi−1

](telescoping sum)

= e−stE[E[es

∑ni=1 Zi−Zi−1

∣∣∣Fn−1

]](tower rule)

= e−stE[es

∑n−1i=1 Zi−Zi−1E

[es(Zn−Zn−1)

∣∣∣Fn−1

]](Z0, . . . , Zn−1 are Fn−1-measurable)

≤ e−stes2Dn[f ]

2/8E[es

∑n−1i=1 Zi−Zi−1

]

by the first part of the proof. Repeating this argument a further n − 1 timesshows that

P[f(X)− E[f(X)] ≥ t] ≤ exp

(−st+ s2

8D[f ]2

).

The expression on the right-hand side is minimized by s = 4t/D[f ]2, whichyields the first of McDiarmid’s inequalities, and the others follow easily.

10.4 ANOVA/HDMR Decompositions

The topic of this section is a variance-based decomposition of a function of n vari-ables that goes by various names such as the analysis of variance (ANOVA), thefunctional ANOVA, and the high-dimensional model representation (HDMR).As before, let (Xi,Fi, µi) be a probability space for i = 1, . . . , n, and let(X ,F , µ) be the product space. Write N = 1, . . . , n, and consider a (F -measurable) function of interest f : X → R. Bearing in mind that in practicalapplications n may be large (103 or more), it is of interest to efficiently identify

• which of the xi contribute in the most dominant ways to the variations inf(x1, . . . , xn),

• how the effects of multiple xi are cooperative or competitive with oneanother,

• and hence construct a surrogate model for f that uses a lower-dimensionalset of input variables, by using only those that give rise to dominant effects.

The idea is to write f(x1, . . . , xn) as a sum of the form

f(x1, . . . , xn) = f∅ +

n∑

i=1

fi(xi) +∑

1≤i<j≤n

fi,j(xi, xj) + . . .

=∑

I⊆N

fI(xI).

Experience suggests that ‘typical real-world systems’ f exhibit only low-ordercooperativity in the effects of the input variables x1, . . . , xn. That is, the termsfI with |I| ≫ 1 are typically small, and a good approximation of f is given by,say, a second-order expansion,

f(x1, . . . , xn) ≈ f∅ +

n∑

i=1

fi(xi) +∑

1≤i<j≤n

fi,j(xi, xj).

—DRAFT—


Note, however, that low-order cooperativity does not necessarily imply thatthere is a small set of significant variables (it is possible that fi is large formost i ∈ 1, . . . , n), not does it say anything about the linearity or non-linearityof the input-output relationship. Furthermore, there are many HDMR-typeexpansions of the form given above; orthogonality criteria can be used to selecta particular HDMR representation.

RS-HDMR / ANOVA. A long-established decomposition of this type is theanalysis of variance (ANOVA) or random sampling HDMR (RS-HDMR) de-composition. Let

f∅(x) :=

∫

X

f dµ,

i.e. f∅ is the orthogonal projection of f onto the one-dimensional space of con-stant functions, and so it is common to abuse notation and write f∅ ∈ R. Fori = 1, . . . , n, let

fi(x) :=

∫

X1

. . .

∫

Xi−1

∫

Xi+1

. . .

∫

Xn

f dµ1 . . . dµi−1dµi+1 . . . dµn − f∅.

Note that fi(x) is actually a function of xi only, and so it is common to abusenotation and write fi(xi) instead of fi(x); fi is constant with respect to theother n− 1 variables. To take this idea further and capture cooperative effectsamong two or more xi, for I ⊆ N := 1, . . . , n, let |I| denote the cardinality ofI and let ∼ I denote the relative complement D \ I. For I = (i1, . . . , i|I|) ⊆ Nand x ∈ X , define the point xI by xI := (xi1 , . . . , xi|I|); similar notation likex∼I , XI , µI &c. should hopefully be self-explanatory.

Definition 10.8. The ANOVA decomposition or RS-HDMR of f is the sumf =

∑I⊆N fI , where the functions fI : X → R (or, by abuse of notation,

fI : XI → R) are defined recursively by

f∅(x) :=

∫

X

f(x) dµ(x)

fI(xI) :=

∫

X∼I

f(x) −

∑

J(I

fJ(xJ )

dx∼I

=

∫

X∼I

f(x) dx∼I −∑

J(I

fJ(xJ ).

Theorem 10.9 (ANOVA). Let f ∈ L2(X , µ) have variance σ2 := ‖f − f∅‖2L2 .Then

1. whenever i ∈ I,∫XifI(x) dµi(xi) = 0;

2. whenever I 6= J ,∫fI(x)fJ (x) dµ(x) = 0;

3. σ2 =∑

I⊆D σ2I , where

σ2∅ := 0,

σ2I :=

∫fI(x)

2 dx.

—DRAFT—

10.4. ANOVA/HDMR DECOMPOSITIONS 117

Proof. 1. First suppose that I = i. Then∫

Xi

fI dµi =

∫

Xi

(∫

X∼I

f dµ∼i − f∅

)dµi

=

∫

X

f dµ− f∅

= 0.

Now suppose for an induction that |I| = k and that∫XjfJ dµj = 0 for all

j ∈ J whenever |J | < |I|. Then

∫

Xi

fI dµi =

∫

Xi

∫

X∼I

f dµ∼I −∑

J(I

fJ

dµi

=

∫

Xi

∫

X∼I

f dµ∼Idµi −∑

J(Ii/∈J

∫

Xi

fJ dµi

FINISH ME!!!

2. FINISH ME!!!

3. This follows immediately from the orthogonality relation

∫

X

fI(x)fJ (x) dµ(x) = 〈fI , fJ〉L2(µ) = 0 whenever I 6= J .

Cut-HDMR. In Cut-HDMR, an expansion is performed with respect to a ref-erence point x ∈ X :

f∅(x) := f(x),

fi(x) := f(x1, . . . , xi−1, xi, xi+1, . . . , xn)− f∅(x)

≡ f(xi, x∼i)− f∅(x),

fi,j(x) := f(x1, . . . , xi−1, xi, xi+1, . . . , xj−1, xj , xj+1, . . . , xn)− fi(x) − fj(x) − f∅(x)

≡ f(xi,j, x∼i,j)− f(xi, x∼i)− f(xj , x∼j)− f∅(x),

fI(x) := f(xI , x∼I)−∑

J(I

fJ(x).

Note that a component function fI of a Cut-HDMR expansion vanishes at anyx ∈ X that has a component in common with x, i.e.

fI(x) = 0 whenever xi = xi for some i ∈ I.

Hence,

fI(x)fJ (x) = 0 whenever xk = xk for some k ∈ I ∪ J .

Indeed, this orthogonality relation defines the Cut-HDMR expansion.

—DRAFT—


HDMR Projectors. Decomposition of a function f into an HDMR expansioncan be seen as the application of a suitable sequence of orthogonal projection op-erators, and hence HDMR provides an orthogonal decomposition of L2([0, 1]n).However, in contrast to orthogonal decompositions such as Fourier and polyno-mial chaos expansions, in which L2 is decomposed into an infinite direct sum offinite-dimensional subspaces, the ANOVA/HDMR decomposition is a decompo-sition of L2 into a finite direct sum of infinite-dimensional subspaces.

To be more precise, let F be any vector space of measurable functions fromX to R. The obvious candidate for the projector P∅ is the orthogonal projectionP∅ : F → F∅, where

F∅ := f ∈ F | f(x) ≡ a for some a ∈ R and all x ∈ [0, 1]n.is the space of constant functions. For i = 1, . . . , n, Pi : F → Fi, where

Fi :=

f ∈ F

∣∣∣∣ f is independent of xj for j 6= i and

∫ 1

0

f(x) dxi = 0

and, for ∅ 6= I ⊆ N , PI : F → FI , where

FI :=

f ∈ F

∣∣∣∣ f is independent of xj for j /∈ I and, for i ∈ I,

∫ 1

0

f(x) dxi = 0

.

These linear operators PI are idempotent, commutative and mutually orthogo-nal, i.e.

PIPJf = PJPIf =

PIf, if I = J ,

0, if I 6= J ,

and form a resolution of the identity∑

I⊆N

PIf = f.

Thus, the space of functions F decomposes as the direct sum F =⊕

I⊆N FI ,

and this direct sum is orthogonal when F is a Hilbert subspace of L2(X , µ).

Sobol′ Sensitivity Indices. The decomposition of the variance given by anHDMR / ANOVA decomposition naturally gives rise to a set of sensitivityindices for ranking the most important input variables and their cooperativeeffects. An obvious (and naıve) assessment of the relative importance of thevariables xI is the variance component σ2

I , or the normalized contribution σ2I/σ

2.However, this measure neglects the contributions of those xJ with J ⊆ I, orthose xJ such that J has some indices in commmon with I. With this in mind,the Sobol′ sensitivity indices are defined as follows:

Definition 10.10. Given an HDMR decomposition of a function f of n variables,the lower and upper Sobol′ sensitivity indices of I ⊆ N are, respectively,

τ2I :=∑

J⊆I

σ2J , and τ

2I :=

∑

J∩I 6=∅

σ2J .

The normalized lower and upper Sobol′ sensitivity indices of I ⊆ N are, respec-tively,

s2I := τ2I/σ2, and s2I := τ2I/σ

2.

—DRAFT—

BIBLIOGRAPHY 119

Since∑

I⊆N σ2I = σ2 = ‖f − f∅‖2L2, it follows immediately that, for each

I ⊆ N ,0 ≤ s2I ≤ s2I ≤ 1.

Note, however, that while the ANOVA theorem guarantees that σ2 =∑

I⊆D σ2I ,

in general Sobol′ indices satisfy no such additivity relation:

1 6=∑

I⊆N

s2I <∑

I⊆N

s2I 6= 1.

Bibliography

Detailed treatment of the singular value decomposition can be found in any texton (numerical) linear algebra, such as that of Trefethen & Bau [108].

McDiarmid’s inequality appears in [65], although the underlying martingaleresults go back to Hoeffding [39] and Azuma [5]. Ledoux [59] and Ledoux &Talagrand [60] give more general presentations of the concentration-of-measurephenomenon, including geometrical considerations such as isoperimetric inequal-ities.

In the statistical literature, the analysis of variance (ANOVA) originateswith Fisher & Mackenzie [32]. The ANOVA decomposition was generalized byHoeffding [38] to functions in L2([0, 1]d) for d ∈ N; for d = ∞, see Owen [75].That generalization can easily be applied to L2 functions on any product do-main, and leads to the functional ANOVA of Stone [96]. In the mathematicalchemistry literature, the HDMR (with its obvious connections to ANOVA) waspopularized by Rabitz & al. [3, 78]. The presentation of ANOVA/HDMR inthis chapter draws upon those references and the presentations of Beccacece &Borgonovo [7] and Hooker [40].

Sobol′ indices were introduced by Sobol′ in [92]. HDMR by Sobol′ in [93].

Exercises

Exercise 10.1. Prove Lemma 10.4. That is, show that, for each j = 1, . . . , n, theMcDiarmid subdiameter Dj [ · ] is a semi-norm on the space of bounded functionsf : X → K, as is the McDiarmid diameter D[ · ]. What are the null-spaces ofthese semi-norms?

Exercise 10.2. Let f : [−1, 1]2 → R be a function of two variables. Sketchthe vanishing sets of the component functions of f in a Cut-HDMR expansionthrough x = (0, 0). Do the same exercise for f : [−1, 1]3 → R and x = (0, 0, 0),taking particular care with second-order terms like f1,2.

—DRAFT—


—DRAFT—

Chapter 11

Spectral Expansions

The mark of a mature, psychologicallyhealthy mind is indeed the ability to livewith uncertainty and ambiguity, but onlyas much as there really is.

Julian Baggini

This chapter and its sequels consider several spectral methods for uncertaintyquantification. At their core, these are orthogonal decomposition methods inwhich a random variable stochastic process (usually the solution of interest)over a probability space (Θ,F , µ) is expanded with respect to an appropriateorthogonal basis of L2(Θ, µ;R). This chapter lays the foundations by consideringspectral expansions in general, starting with the Karhunen–Loeve biorthogonaldecomposition, and continuing with orthogonal polynomial bases for L2(Θ, µ;R)and the resulting polynomial chaos decompositions.

11.1 Karhunen–Loeve Expansions

Fix a compact domain Ω ⊆ Rd (which could be thought of as ‘space’, ‘time’, ora general parameter space) and a probability space (Θ,F , µ). The Karhunen–Loeve expansion of a stochastic process U : Ω × Θ → R is a particularly nicespectral decomposition, in that it decomposes U in a biorthogonal fashion, i.e.in terms of components that are both orthogonal over the parameter domain Ωand the probability space Θ.

To be more precise, consider a stochastic process U : Ω×Θ → R such that

• for all x ∈ Ω, U(x) ∈ L2(Θ, µ;R);• for all x ∈ Ω, Eµ[U(x)] = 0;• the covariance function CU (x, y) := Eµ[U(x)U(y)] is a well-defined con-tinuous function of x, y ∈ Ω.

Remark 11.1. 1. The condition that U is a zero-mean process is not a seriousrestriction; if U is not a zero-mean process, then simply consider U definedby U(x, θ) := U(x, θ)− Eµ[U(x)].

—DRAFT—

122 CHAPTER 11. SPECTRAL EXPANSIONS

2. It is common in practice to see the covariance function interpreted asproving some information on the correlation length of the process U . Thatis, CU (x, y) depends only upon ‖x−y‖ and, for some function g : [0,∞) →[0,∞), CU (x, y) = g(‖x− y‖). A typical such g is g(r) = exp(−r/r0), andthe constant r0 encodes how similar values of U at nearby points of Ω areexpected to be; when the correlation length r0 is small, the field U hasdissimilar values near to one another, and so is rough; when r0 is large,the field U has only similar values near to one another, and so is moresmooth.

Define the covariance operator of U , also denoted by CU : L2(Ω, dx;R) →L2(Ω, dx;R) by

(CUf)(x) :=

∫

Ω

CU (x, y)f(y) dy.

Now let en | n ∈ N be an orthonormal basis of eigenvectors of L2(Ω, dx;R)with corresponding eigenvalues λn | n ∈ N, i.e.

∫

Ω

CU (x, y)en(y) dy = λnen(x)

and ∫

Ω

em(x)en(x) dx = δmn.

Definition 11.2. Let X be a first-countable topological space. A functionK : X → X → R is called a Mercer kernel if

1. K is continuous;2. K is symmetric, i.e. K(x, x′) = K(x′, x) for all x, x′ ∈ X ; and3. K is positive semi-definite in the sense that, for all choices of finitely many

points x1, . . . , xn ∈ X , the Gram matrix

G :=

K(x1, x1) · · · K(x1, xn)

.... . .

...K(xn, x1) · · · K(xn, xn)

is positive semi-definite, i.e. satisfies ξ ·Gξ ≥ 0 for all ξ ∈ Rn.

Theorem 11.3 (Mercer). Let X be a first-countable topological space equippedwith a complete Borel measure µ. Let K : X → X → R be a Mercer kernel. Ifx 7→ K(x, x) lies in L1(X , µ;R), then there is an orthonormal basis enn∈N ofL2(X , µ;R) consisting of of eigenfunctions of the operator

f 7→∫

X

K( · , y)f(y) dµ(y)

with non-negative eigenvalues λnn∈N. Furthermore, the eigenfunctions corre-sponding to non-zero eigenvalues are continuous, and

K(x, y) =∑

n∈N

λnen(x)en(y),

and this series converges absolutely, and uniformly over compact subsets of X .

—DRAFT—

11.1. KARHUNEN–LOEVE EXPANSIONS 123

Theorem 11.4 (Karhunen–Loeve). Under the above assumptions on U , its co-variance function and its covariance operator, U can be written as

U =∑

n∈N

Znen

where the enn∈N are orthonormal eigenfunctions of the covariance operatorCU , the corresponding eigenvalues λnn∈N are non-negative, the convergenceof the series is in L2(Θ, µ;R) and uniform in x ∈ Ω, with

Zn =

∫

Ω

U(x)en(x) dx.

Furthermore, the random variables Zn are centred, uncorrelated, and have vari-ance λn:

Eµ[Zn] = 0, and Eµ[ZmZn] = λnδmn.

Proof. Since it is continuous, the covariance function is a Mercer kernel. Hence,by Mercer’s theorem, there is an orthonormal basis enn∈N of L2(Ω, dx;R)consisting of eigenfunctions of the covariance operator with non-negative eigen-values λnn∈N. In this basis, the covariance function has the representation

CU (x, y) =∑

n∈N

λnen(x)en(y).

Write the process U in terms of this basis as

U =∑

n∈N

Znen,

where the coefficients Zn are random variables given by orthogonal projection:

Zn :=

∫

Ω

U(x)en(x) dx.

Then

Eµ[Zn] = Eµ

[∫

Ω

U(x)en(x) dx

]=

∫

Ω

E[U(x)]en(x) dx = 0.

and

Eµ[ZmZn] = Eµ

[∫

Ω

U(x)em(x) dx

∫

Ω

U(x)en(x) dx

]

= Eµ

[∫

Ω

∫

Ω

U(x)em(x)U(y)en(y) dydx

]

=

∫

Ω

∫

Ω

Eµ[U(x)U(y)]em(x)en(y) dydx

=

∫

Ω

∫

Ω

CU (x, y)em(x)en(y) dydx

=

∫

Ω

em(x)

∫

Ω

CU (x, y)en(y) dydx

=

∫

Ω

em(x)λnen(x) dx

= λnδmn.

—DRAFT—


Let SN :=∑N

n=1 Znen : Ω×Θ → R. Then, for any x ∈ Ω,

Eµ

[|U(x) − SN (x)|2

]

= Eµ[U(x)2] + Eµ[SN (x)2]− 2Eµ[U(x)SN (x)]

= CU (x, x) + Eµ

[N∑

n=1

N∑

m=1

ZnZmem(x)en(x)

]− 2Eµ

[U(x)

N∑

n=1

Znen(x)

]

= CU (x, x) +

N∑

n=1

λnen(x)2 − 2Eµ

[N∑

n=1

∫

Ω

U(x)U(y)en(y)en(x) dy

]

= CU (x, x) +

N∑

n=1

λnen(x)2 − 2

N∑

n=1

∫

Ω

CU (x, y)en(y)en(x) dy

= CU (x, x) −N∑

n=1

λnen(x)2

→ 0 as N → ∞, uniformly in x, by Mercer’s theorem.

Among many possible decompositions of a random process, the Karhunen–Loeve expansion is optimal in the sense that the mean-square error of any trun-cation of the expansion after finitely many terms is minimal. However, itsutility is limited since the covariance function of the solution process is oftennot known a priori. Nevertheless, the Karhunen–Loeve expansion provides aneffective means of representing input random processes when their covariancestructure is known, and provides a simple method for sampling Gaussian mea-sures on Hilbert spaces, which is a necessary step in the implementation of themethods outlined in Chapter 6.

Example 11.5. Suppose that C : H → H is a self-adjoint, positive-definite,nuclear operator on a Hilbert space H and let m ∈ H. Let (λk, ek)k∈N bea sequence of orthonormal eigenpairs for C, ordered by decreasing eigenvalueλk. Let ξ1, ξ2, . . . be IID samples from N (0, 1). Then, by the Karhunen–Loevetheorem,

X := m+

∞∑

k=1

√λkξkek

is an H-valued random variable with distribution N (m,C). Therefore, a finite

sum of the form m+∑K

k=1

√λkξkek for large K is a reasonable approximation

to a draw from N (m,C); this is the procedure used to generate the samplepaths in Figure 11.1.

Definition 11.6. A principal component analysis of an RN -valued random vec-tor U is the Karhunen–Loeve expansion of U seen as a stochastic processU : 1, . . . , N × Ω → R. It is also known as the discrete Karhunen–Loevetransform, the Hotelling transform, and the proper orthogonal decomposition.

Principal component analysis is often applied to sample data, and is inti-mately related to the singular value decomposition:

Example 11.7. Let X ∈ RN×M be a matrix whose columns are M indepen-dent and identically distributed samples from some probability measure on RN ,

—DRAFT—

11.1. KARHUNEN–LOEVE EXPANSIONS 125

0

0.5

1.0

−0.5

1

(a) 10 KL modes

0

0.5

1.0

−0.5

1

(b) 100 KL modes

0

0.5

1.0

−0.5

1

(c) 1000 KL modes

0

0.5

1.0

−0.5

1

(d) 10000 KL modes

Figure 11.1: Sample paths of the Gaussian distribution on H10 ([0, 1]) that has

mean path m(x) = x(1−x) and covariance operator(− d2

dx2

)−1. Along with the

mean path (dashed), six sample paths are shown for Karhunen–Loeve expan-sions using 10, 100, 1000, and 10000 terms.

and assume without loss of generality that the samples have mean zero. Theempirical covariance matrix of the samples is

C := 1M2XX

⊤.

The eigenvalues λn and eigenfunctions en of the Karhunen–Loeve expansionare just the eigenvalues and eigenvectors of this matrix C. Let Λ ∈ RN×N

be the diagonal matrix of the eigenvalues λn (which are non-negative, and areassumed to be in decreasing order) and E ∈ RN×N the matrix of corresponding

orthonormal eigenvectors, so that C diagonalizes as

C = EΛE⊤.

The principal component transform of the data X is W := E⊤X ; this is an or-thogonal transformation of RN that transforms X to a new coordinate systemin which the greatest component-wise variance comes to lie on the first coordi-nate (called the first principal component), the second greatest variance on thesecond coordinate, and so on.

On the other hand, taking the singular value decomposition of the data(normalized by the number of samples) yields

1MX = UΣV ⊤,

—DRAFT—


where U ∈ RN×N and V ∈ RM×M are orthogonal and Σ ∈ RN×M is diago-nal with decreasing non-negative diagonal entries (the singular values of 1

MX).Then

C = UΣV ⊤(UΣV ⊤)⊤ = UΣV ⊤V Σ⊤U⊤ = UΣ2U⊤.

from which we see that U = E and Σ2 = Λ. This is just another instanceof the well-known relation that, for any matrix A, the eigenvalues of AA∗ arethe singular values of A and the right eigenvectors of AA∗ are the left singularvectors of A; however, in this context, it also provides an alternative way tocompute the principal component transform.

In fact, performing principal component analysis via the singular value de-composition is numerically preferable to forming and then diagonalizing thecovariance matrix, since the formation of XX⊤ can cause a disastrous loss ofprecision; the classic example of this phenomenon is the Lauchli matrix

1 ε 0 01 0 ε 01 0 0 ε

(0 < ε≪ 1),

for which taking the singular value decomposition is stable, but forming anddiagonalizing XX⊤ is unstable.

11.2 Wiener–Hermite Polynomial Chaos

The next section will cover polynomial chaos (PC) expansions in greater gen-erality, and this section serves as an introductory prelude. In this, the classicaland notationally simplest setting, we consider expansions of a real-valued ran-dom variable U with respect to a single standard Gaussian random variableΞ, using appropriate orthogonal polynomials of Ξ, i.e. the Hermite polynomi-als. This setting was pioneered by Norbert Wiener, and so it is known as theWiener–Hermite polynomial chaos.

To this end, let Ξ ∼ N (0, 1) =: γ be a standard Gaussian random variable.The probability density function ρΞ : R → R of Ξ with respect to Lebesguemeasure on R is

dγ

dx(ξ) = ρΞ(ξ) =

1√2π

exp

(−ξ

2

2

).

Now let Hen(ξ) ∈ R[ξ], for n ∈ N0, be the Hermite polynomials, which are asystem of orthogonal polynomials for the standard Gaussian measure γ.

Lemma 11.8. If p(x) =∑

n∈N0pnx

n is any polynomial or convergent powerseries, then E[p(Ξ)Hem(Ξ)] = m!pm.

Proof. TO DO!

By the Stone–Weierstrass theorem and the approximability of L2 functionsby continuous ones, the Hermite polynomials form a complete orthogonal basisof the Hilbert space L2(R, γ;R) with the inner product

〈U, V 〉 := E[U(Ξ)V (Ξ)] ≡∫

R

U(ξ)V (ξ)ρΞ(ξ) dξ.

—DRAFT—

11.2. WIENER–HERMITE POLYNOMIAL CHAOS 127

Let us further extend the 〈 · , · 〉 notation for inner products and write 〈 · 〉 forexpectation with respect to the distribution γ of Ξ. So, for example, the or-thogonality relation for the Hermite polynomials reads 〈HemHen〉 = n!δmn.

Definition 11.9. Let U ∈ L2(R, γ;R) be a square-integrable real-valued ran-dom variable. The Wiener–Hermite polynomial chaos expansion of U with re-spect to the standard Gaussian Ξ is the expansion of U in the orthogonal basisHenn∈N0 , i.e.

U =∑

n∈N0

unHen(Ξ)

with scalar Wiener–Hermite polynomial chaos coefficients unn∈N0 ⊆ R givenby

un =〈UHen〉〈He2n〉

=1

n!√2π

∫ ∞

−∞

U(x)Hen(x)e−x2/2 dx.

Remark 11.10. From the perspective of sampling of random variables, thismeans that if we wish to draw a sample from the distribution of U , it is enoughto draw a sample ξ from the standard normal distribution and then evaluatethe series

∑n∈N0

unHen(ξ) at that ξ.

Note that, in particular, since He0 ≡ 1,

E[U ] = 〈He0, U〉 =∑

n∈N0

un〈He0,Hen〉 = u0,

so the expected value of U is simply its 0th PC coefficient. Similarly, its varianceis a weighted sum of the squares of its PC coefficients:

V[U ] = E[|U − E[U ]|2

]

= E

∣∣∣∣∣∑

n∈N

unHen

∣∣∣∣∣

2 since E[U ] = u0

=∑

m,n∈N

umun〈Hem,Hen〉

=∑

n∈N

u2n〈He2n〉 by Hermitian orthogonality.

Example 11.11. Let X ∼ N (µ, σ2) be a real-valued Gaussian random variablewith mean m ∈ R and variance σ2 ≥ 0. Let Y := eX ; since log Y is normallydistributed, the non-negative-valued random variable Y is said to be a log-normal random variable. As usual, let Ξ ∼ N (0, 1) be the standard Gaussian

random variable; clearly XL= µ + σΞ and Y

L= eµeσΞ. The Wiener–Hermite

—DRAFT—


expansion of Y as∑

k∈N0ykHek(Ξ) has coeffients

yk =E[eµeσΞHek(Ξ)

]

E [Hek(Ξ)2]

=eµ

k!E[eσΞHek(Ξ)

]

=eµ

k!E

[σkΞk

k!Hek(Ξ)

]by Hermitian orthogonality

=eµeσ

2/2σk

k!

i.e.

Y = eµ+σ2/2∑

k∈N0

σk

k!Hek(Ξ).

From this expansion it can be seen that E[Y ] = eµ+σ2/2 and

V[Y ] = e2µ+σ2 ∑

k∈N0

(σk

k!

)2

〈He2k〉 = e2µ+σ2(eσ

2 − 1).

Of course, in practice, the series expansion U =∑

n∈N0unHen(Ξ) must be

truncated after finitely many terms, and so it is natural to ask about the qualityof the approximation

U ≈ UP :=

P∑

n=0

unHen(Ξ)

Since the Hermite polynomials form a complete orthogonal basis for L2(R, γ;R),the standard results about orthogonal approximations in Hilbert spaces apply.In particular, by Corollary 3.19, the truncation error U − UP is orthogonal tothe space from which UP was chosen, i.e.

spanHe0,He1, . . . ,HeP,and tends to zero in mean square.

Lemma 11.12. The truncation error U − UP is orthogonal to the subspace

spanHe0,He1, . . . ,HePof L2(R, ρΞ dx;R). Furthermore, limP→∞ UP = U in L2(R, γ;R).

Proof. Let V :=∑P

m=0 vmHem be any element of the subspace of L2(R, γ;R)spanned by the Hermite polynomials of degree at most P. Then

〈U − UP, V 〉 =⟨(

∑

n>P

unHen

)(P∑

m=0

vmHem

)⟩

=

⟨∑

n>Pm∈0,...,P

unvmHenHem

⟩

=∑

n>Pm∈0,...,P

unvm〈HenHem〉

= 0.

—DRAFT—

11.3. GENERALIZED PC EXPANSIONS 129

Hence, by Pythagoras’ theorem,

‖U‖L2 = ‖UP‖L2(γ) + ‖U − UP‖L2(γ),

and hence ‖U − UP‖L2(γ) → 0 as P → ∞.

11.3 Generalized PC Expansions

The ideas of polynomial chaos can be generalized well beyond the setting inwhich the stochastic germ Ξ is a standard Gaussian random variable, or even avector Ξ = (Ξ1, . . . ,Ξd) of mutually orthogonal Gaussian random variables.

Let Ξ = (Ξ1, . . . ,Ξd) be an Rd-valued random variable with independent(and hence orthogonal) components. As usual, let R[ξ1, . . . , ξd] denote the ringof all polynomials in ξ1, . . . , ξd with real coefficients, and let R[ξ1, . . . , ξd]≤p de-note those polynomials of total degree at most p ∈ N0. Let Γp ⊆ R[ξ1, . . . , ξd]≤p

be a collection of polynomials that are mutually orthogonal and orthogonal toR[ξ1, . . . , ξd]≤p−1, and let Γp := spanΓp. This yields the orthogonal decompo-sition

L2(Θ, µ;R) =⊕

p∈N0

Γp.

It is important to note that there is a lack of uniqueness in these basis polyno-mials whenever d ≥ 2: each choice of ordering of multi-indices α ∈ Nd

0 can yielda different orthogonal basis of L2 when the Gram–Schmidt procedure is appliedto the monomials ξα.

Note that (as usual, assuming separability) the L2 space over the productprobability space (Θ,F , µ) is isomorphic to the Hilbert space tensor product ofthe L2 spaces over the marginal probability spaces:

L2(Θ1 × · · · ×Θd, µ1 ⊗ · · · ⊗ µd;R) =

d⊗

i=1

L2(Θi, µi;R),

from which we see that an orthogonal system of multivariate polynomials forL2(Θ, µ;R) can be found by taking products of univariate orthogonal polynomi-als for the marginal spaces L2(Θi, µi;R). A generalized polynomial chaos (gPC)expansion of a random variable or stochastic process U is simply the expansionof U with respect to such a complete orthogonal polynomial basis of L2(Θ, µ).

Example 11.13. Let Ξ = (Ξ1,Ξ2) be such that Ξ1 and Ξ2 are independent (andhence orthogonal) and such that Ξ1 is a standard Gaussian random variableand Ξ2 is uniformly distributed on [−1, 1]. Hence, the univariate orthogonalpolynomials for Ξ1 are the Hermite polynomials Hen and the univariate orthog-onal polynomials for Ξ2 are the Legendre polynomials Len. Then a system of

—DRAFT—


orthogonal polynomials for Ξ up to total degree 3 is

Γ0 = 1,Γ1 = He1(ξ1),Le1(ξ2)

= ξ1, ξ2,Γ2 = He2(ξ1),He1(ξ1)Le1(ξ2),Le2(ξ2)

= ξ21 − 1, ξ1ξ2,12 (3ξ

22 − 1),

Γ3 = He3(ξ1),He2(ξ1)Le1(ξ2),He1(ξ1)Le2(ξ2),Le3(ξ2)= ξ31 − 3ξ1, ξ

21ξ2 − ξ2,

12 (3ξ1ξ

22 − ξ1),

12 (5ξ

32 − 3ξ2).

Rather than have the orthogonal basis polynomials have two indices, one forthe degree p and one within each set Γp, it is useful and conventional to orderthe basis polynomials using a single index k ∈ N0. It is common in practiceto take Ψ0 = 1 and to have the polynomial degree be (weakly) increasing withrespect to the new index k. So, to continue Example 11.13, one could take

Ψ0(ξ) = 1,

Ψ1(ξ) = ξ1,

Ψ2(ξ) = ξ2,

Ψ3(ξ) = ξ21 − 1,

Ψ4(ξ) = ξ1ξ2,

Ψ5(ξ) =12 (3ξ

22 − 1),

Ψ6(ξ) = ξ31 − 3ξ1,

Ψ7(ξ) = ξ21ξ2 − ξ2,

Ψ8(ξ) =12 (3ξ1ξ

22 − ξ1),

Ψ9(ξ) =12 (5ξ

32 − 3ξ2).

Truncation of gPC Expansions. Suppose that a gPC expansionU =∑

k∈N0ukΨk

is truncated, i.e. we consider

UP =

P∑

k=0

ukΨk.

It is an easy exercise to show that the truncation error U −UP is orthogonal tospanΨ0, . . . ,ΨP. It is also worth considering how many terms there are in sucha truncated gPC expansion. Suppose that the stochastic germ Ξ has dimensiond (i.e. has d independent components), and we work only with polynomialsof total degree at most p. The total number of coefficients in the truncatedexpansion UP is

P + 1 =(d+ p)!

d!p!

That is, the total number of gPC coefficients that must be calculated growscombinatorially as a function of the number of input random variables and thedegree of polynomial approximation. Such rapid growth limits the usefulness ofgPC expansions for practical applications where d and p are much greater than,say, 10.

—DRAFT—

11.3. GENERALIZED PC EXPANSIONS 131

Remark 11.14. It is possible to adapt the notion of a gPC expansion to thesituation of dependent random variables, but there are some complications. Insummary, suppose that Ξ = (Ξ1, . . . ,Ξd) taking values in Θ = Θ1 × · · · × Θd

has joint law µ, which is not necessarily a product measure. Nevertheless, letµi denote the marginal law of Ξi, i.e.

µi(Ei) := µ(Θ1 × · · · ×Θi−1 × Ei ×Θi+1 × · · · ×Θd).

To simplify matter further, assume that µ (resp. µi) has Lebesgue density ρ

(resp. ρi). Now let φ(i)p (ξi) ∈ R[ξi], p ∈ N0, be univariate orthogonal polynomials

for µi. The chaos function associated to a multi-index α defined to be

Ψα(ξ) :=

√ρ1(ξ1) . . . ρd(ξd)

ρ(ξ)φ(1)α1

(ξ1) . . . φ(d)αd

(ξd).

It can be shown that the family Ψα | α ∈ Nd0 is a complete orthonormal

basis for L2(Θ, µ;R), so we have the usual series expansion U =∑

α uαΨα.Note, however, that with the exception of Ψ0 = 1, the functions Ψα are notpolynomials. Nevertheless, we still have the usual properties that truncationerror is orthogonal the the approximation subspace, and

Eµ[U ] = u0, Vµ[U ] =∑

α6=0

u2α〈Ψ2α〉.

Expansions of Random Variables. Consider a real-valued random variable U ,which we expand in terms of a stochastic germ ξ. Let UP be a truncated GPCexpansion of U :

UP(ξ) =

P∑

k=0

ukΨk(ξ),

where the polynomials Ψk are orthogonal with respect to the law of ξ, and withthe usual convention that Ψ0 = 1. A first, easy, observation is that

E[UP] = 〈Ψ0, UP〉 =

P∑

k=0

uk〈Ψ0,Ψk〉 = u0,

so the expected value of UP is simply its 0th GPC coefficient. Similarly, itsvariance is a weighted sum of the squares of its GPC coefficients:

E[|UP − E[UP]|2

]= E

∣∣∣∣∣

P∑

k=1

ukΨk

∣∣∣∣∣

2

=

P∑

k,ℓ=1

ukuℓ〈Ψk,Ψℓ〉

=

P∑

k=1

u2k〈Ψ2k〉.

—DRAFT—


Expansions of Random Vectors. Similar remarks can be made for a Rd-valuedrandom vector U having truncated GPC expansion

UP(ξ) =

P∑

k=0

ukΨk(ξ),

with coefficients uk = (u1k, . . . , udk) ∈ Rd for each k ∈ 0, . . . ,P. As before,

E[UP] = 〈Ψ0, UP〉 =

P∑

k=0

uk〈Ψ0,Ψk〉 = u0 ∈ Rd

and the covariance matrix C ∈ Rd×d of UP is given by

C =

P∑

k=1

uku⊤k 〈Ψ2

k〉 i.e. Cij =

P∑

k=1

uikujk〈Ψ2

k〉.

Expansions of Stochastic Processes. Consider now a square-integrable stochas-tic process U : Ω×Θ → R; that is, for each x ∈ Ω, U(x, · ) ∈ L2(Θ, µ) is a real-valued random variable, and, for each θ ∈ Θ, U( · , θ) ∈ L2(Ω, dx) is a scalarfield on the domain Ω. Recall that

L2(Θ, µ;R)⊗ L2(Ω, dx;R) ∼= L2(Θ× Ω, µ⊗ dx;R) ∼= L2(Θ, µ;L2(Ω, dx)

).

As usual, take Ψk | k ∈ N0 to be an orthogonal polynomial basis of L2(Θ, µ;R),ordered (weakly) by total degree, with Ψ0 = 1. A GPC expansion of the randomfield U is an L2-convergent expansion of the form

U(x, ξ) =∑

k∈N0

uk(x)Ψk(ξ),

which in practice is truncated to

U(x, ξ) ≈ UP(x, ξ) =

P∑

k=0

uk(x)Ψk(ξ).

The functions uk : Ω → R are called the stochastic modes of the process U . Thestochastic mode u0 : Ω → R is the mean field of U :

E[U(x)] = E[UP(x)] = u0(x).

The variance of the field at x ∈ Ω is

V[U(x)] =

∞∑

k=1

u2k〈Ψ2k〉 ≈

P∑

k=1

u2k〈Ψ2k〉,

whereas, for two points x, y ∈ Ω,

E[U(x)U(y)] =

⟨∑

k∈N0

uk(x)Ψk(ξ)∑

ℓ∈N0

uℓ(y)Ψℓ(ξ)

⟩

=∑

k∈N0

uk(x)uk(y)〈Ψ2k〉

≈P∑

k=0

uk(x)uk(y)〈Ψ2k〉,

—DRAFT—

BIBLIOGRAPHY 133

FINISH ME!!!

Figure 11.1: ...

FINISH ME!!!

Figure 11.2: ...

and so

CU (x, y) =∑

k∈N


≈P∑

k=1


= CUP(x, y).

At least when dimΩ is low, it is very common to see the behaviour of a stochasticfield U (or UP) summarized by plots of the mean field and the variance field,as in Figure 11.1; when dimΩ = 1, a surface or contour plot of the covariancefield CU (x, y) as in Figure 11.2 can also be informative.

Bibliography

Spectral expansions in general are covered in Chapter 2 of the monograph of LeMaıtre & Knio [58], and Chapter 5 of the book of Xiu [117].

The Karhunen–Loeve expansion bears the names of Karhunen [48] and Loeve[62], but KL-type series expansions of stochastic processes were considered ear-lier by Kosambi [52]. Lemma 11.12, that the truncation error in a PC expansionis orthogonal to the approximation subspace, is nowadays a simple corollary ofstandard results in Hilbert spaces, but is an observation that appears to havefirst first been made in the stochastic context by Cameron & Martin [17]. Theapplication of Wiener–Hermite PC expansions to engineering systems was pop-ularized by Ghanem & Spanos [34]; the extension to gPC and the connectionwith the Askey scheme is due to Xiu & Karniadakis [118].

The extension of generalized polynomial chaos to arbitrary dependency amongthe components of the stochastic germ, as in Remark 11.14, is due to Soize &Ghanem [94].

Exercises

Exercise 11.1. Use the Karhunen–Loeve expansion to generate samples from aGaussian random field U on [−1, 1] (i.e., for each x ∈ [−1, 1], U(x) is a Gaussianrandom variable) with covariance function

—DRAFT—


1. CU (x, y) = exp(−|x− y|2/a2),

2. CU (x, y) = exp(−|x− y|/a), and

3. CU (x, y) = (1 + |x− y|2/a2)−1

for various values of a > 0. Plot and comment upon your results, particularlythe smoothness of the fields produced.

Exercise 11.2. Consider the negative Laplacian operator L := − d2

dx2 acting onreal-valued functions on the interval [0, 1], with zero boundary conditions. Showthat the eigenvalues µn and eigenfunctions en of L are

µn = (πn)2, en(x) = sin(πnx).

Hence show that C := L−1 has the same eigenfunctions with eigenvalues λn =(πn)−2. Hence, using the Karhunen–Loeve theorem, generate figures similar toFigure 11.1 for your choice of mean field m : [0, 1] → R.

Exercise 11.3. Do the analogue of Exercise 11.3 for the negative Laplacian

operator L := − d2

dx2 − d2

dy2 acting on real-valued functions on the square [0, 1]2,again with zero boundary conditions.

Exercise 11.4. Show that the eigenvalues λn and eigenfunctions en of the ex-ponential covariance function C(x, y) = exp(−|x − y|/a) on [−b, b] are givenby

λn =

2a

1+a2w2n, if n ∈ 2Z,

2a1+a2v2

n, if n ∈ 2Z+ 1,

en(x) =

sin(wnx)

/√b− sin(2wnb)

2wn, if n ∈ 2Z,

cos(vnx)/√

b+ sin(2vnb)2vn

, if n ∈ 2Z+ 1,

where wn and vn solve the transcendental equations

awn + tan(wnb) = 0, for n ∈ 2Z,

1− avn tan(vnb) = 0, for n ∈ 2Z+ 1.

Hence, using the Karhunen–Loeve theorem, generate sample paths from theGaussian measure with covariance kernel C and your choice of mean path.

—DRAFT—

Chapter 12

Stochastic Galerkin Methods

Not to be absolutely certain is, I think,one of the essential things in rationality.

Am I an Atheist or an Agnostic?

Bertrand Russell

Unlike non-intrusive approaches, which rely on individual realizations todetermine the stochastic model response to random inputs, Galerkin methodsuse a formalism of weak solutions, expressed in terms of inner products, toform systems of governing equations for the solution’s PC coefficients, whichare generally coupled together. They are in essence the extension to a suitabletensor product Hilbert space of the usual Galerkin formalism that that underliesmany theoretical and numerical approaches to PDEs. This chapter is devotedto the formulation of Galerkin methods of UQ using PC expansions.

Suppose that the relationship between some input data d and the output(solution) u can be expressed formally as

M(u; d) = 0.

A good model for this kind of set-up is an elliptic boundary balue problem on,say, a bounded, connected domain Ω ⊆ Rn with smooth boundary ∂Ω:

−∇ · (κ(x)∇u(x)) = f(x) for x ∈ Ω, (12.1)

u(x) = 0 for x ∈ ∂Ω.

In this case, the input data d are typically the forcing term f : Ω → R andthe permeability field κ : Ω → Rn×n; in some cases, the domain Ω itself mightdepend upon d, but this introduces additional complications that will not beconsidered in this chapter. For a PDE such as this, solutions u are typicallysought in the Sobolev space H1

0 (Ω) of L2 functions that have a weak derivative

that itself lies in L2, and that vanish on ∂Ω in the sense of trace. Moreover, itis usual to seek weak solutions, i.e. u ∈ H1

0 (Ω) for which the inner product of(12.1) with any v ∈ H1

0 (Ω) is an equality. That is, integrating by parts, we seeku ∈ H1

0 (Ω) such that

〈κ∇u,∇v〉L2(Ω) = 〈f, v〉L2(Ω) for all v ∈ H10 (Ω). (12.2)

—DRAFT—

136 CHAPTER 12. STOCHASTIC GALERKIN METHODS

On expressing this problem in a chosen basis of H10 (Ω), the column vector [u] of

coefficients of u in this basis turn out to satisfy a matrix-vector equation (i.e. asystem of simultaneous linear equations) of the form [a][u] = [b] for some matrix[a] determined by the permeability field κ and a column vector [b] determinedby the forcing term f .

In this chapter, after reviewing basic Lax–Milgram theory and Galerkin pro-jection for problems like (12.1) / (12.2), we consider the situation in which theinput data d are uncertain and are described as a random variable D(ξ). Thenthe solution is also a random variable U(ξ) and the model relationship becomes

M(U(ξ);D(ξ)) = 0.

Again, this equation is usually interpreted in a weak sense in a suitable Hilbertspace of H1

0 (Ω)-valued random variables. If D and U are expanded in some PCbasis, it is natural to ask how the coefficients of U with respect to this PC basisand a chosen basis of H1

0 (Ω) are related to one another. It will turn out that,like in the standard deterministic setting, this problem can be written in theform of a matrix-vector equation [A][U ] = [B] related to, but more complicatedthan, the deterministic problem [a][u] = [b].

12.1 Lax–Milgram Theory and Galerkin Projection

Let H be a real Hilbert space equipped with a bilinear form a : H × H → R.Given f ∈ H∗ (i.e. a continuous linear functional f : H → R), the associatedweak problem is:

find u ∈ H such that a(u, v) = 〈f | v〉 for all v ∈ V . (12.3)

Example 12.1. Let Ω ⊆ Rn be a bounded, connected domain. Let a matrix-valued function κ : Ω → Rn×n and a scalar-valued function f : Ω → R be given,and consider the elliptic problem (12.1). The appropriate bilinear form a( · , · )is defined by

a(u, v) := 〈−∇ · (κ∇u), v〉L2(Ω) = 〈κ∇u,∇v〉L2(Ω),

where the second equality follows from integration by parts when u, v are smoothfunctions that vanish on ∂Ω; such functions form a dense subset of the Sobolevspace H1

0 (Ω). This short calculation motivates two important developments inthe treatment of the PDE (12.1). First, even though the original formulation(12.1) seems to require the solution u to have two orders of differentiability, thelast line of the above calculation makes sense even if u and v have only oneorder of (weak) differentiability, and so we restrict attention to H1

0 (Ω). Second,we declare u ∈ H1

0 (Ω) to be a weak solution of (12.1) if the L2(Ω) inner productof (12.1) with any v ∈ H1

0 (Ω) holds as an equality of real numbers, i.e. if

−∫

Ω

∇ · (κ(x)∇u(x))v(x) dx =

∫

Ω

f(x)v(x) dx

i.e. if

a(u, v) = 〈f, v〉L2(Ω) for all v ∈ H10 (Ω).

—DRAFT—

12.1. LAX–MILGRAM THEORY AND GALERKIN PROJECTION 137

The existence and uniqueness of solutions problems like (12.3), under appro-priate conditions on a (which of course are inherited from appropriate conditionson κ), is ensured by the Lax–Milgram theorem, which generalizes the Riesz rep-resentation theorem that any Hilbert space is isomorphic to its dual space.

Theorem 12.2 (Lax–Milgram). Let a be a bilinear form on a Hilbert space Hsuch that

1. (boundedness) there exists a constant C > 0 such that, for all u, v ∈ H,|a(u, v)| ≤ C‖u‖‖v‖; and

2. (coercivity) there exists a constant c > 0 such that, for all v ∈ H, |a(v, v)| ≥c‖v‖2.

Then for all f ∈ H∗, there exists a unique u ∈ V such that, for all v ∈ H,a(u, v) = 〈f | v〉.Proof. For each u ∈ H, v 7→ a(u, v) is a bounded linear functional on H. So,by the Riesz representation theorem, given u ∈ H, there is a unique w ∈ Hsuch that 〈w, v〉 = a(u, v). Define Au := w. The map A : H → H is clearlywell-defined. It is also linear: take α1, α2 ∈ R and u1, u2 ∈ H:

〈A(α1u1 + α2u2), v〉 = a(α1u1 + α2u2, v)

= α1a(u1, v) + α2a(u2, v)

= α1〈Au1, v〉+ α2〈Au2, v〉= 〈α1Au1 + α2Au2, v〉.

A is a bounded map, since

‖Au‖2 = 〈Au,Au〉 = a(u,Au) ≤ C‖u‖‖Au‖,so ‖Au‖ ≤ C‖u‖. Furthermore, A is injective since

‖Au‖‖u‖ ≥ 〈Au, u〉 = a(u, u) ≥ c‖u‖2,so Au = 0 =⇒ u = 0.

To see that the range R(A) is closed, take a convergent sequence (vn)n∈N inR(A) that converges to some v ∈ H. Choose un ∈ H such that Aun = vn foreach n ∈ N. The sequence (Aun)n∈N is Cauchy, so

‖Aun −Aum‖‖un − um‖ ≥ |〈Aun −Aum, un − um〉|= |a(un − um, un − um)|≥ c‖un − um‖2.

So c‖un − um‖ ≤ ‖vn − vm‖ → 0. So (un)n∈N is Cauchy and converges to someu ∈ H. So vn = Aun → Au = v by the continuity (boundedness) of A, sov ∈ R(A), and so R(A) is closed.

Finally, A is surjective: for, if not, there is an s ∈ H, s 6= 0, such thats ⊥ R(A). But then

c‖s‖2 ≤ a(s, s) = 〈s, As〉 = 0,

so s = 0, a contradiction.So, take f ∈ H∗. There is a unique w ∈ H such that 〈w, v〉 = 〈f | v〉 for

all v ∈ H. The equation Au = w has a unique solution since A is invertible.So 〈Au, v〉 = 〈f | v〉 for all v ∈ H. But 〈Au, v〉 = a(u, v). So there is a uniqueu ∈ H such that a(u, v) = 〈f | v〉.

—DRAFT—


Galerkin Projection. Now consider the problem of finding a good finite-dimensionalapproximation to u. Fix a subspace V(M) ⊆ H of dimension M . The Galerkinprojection of the weak problem is:

find u(M) ∈ V(M) such that a(u(M), v(M)) = 〈f, v(M)〉 for all v(M) ∈ V(M).

Note that if the hypotheses of the Lax–Milgram theorem are satisfied on the fullspaceH, then they are certainly satisfied on the subspace V(M), thereby ensuringthe existence and uniqueness of solutions to the Galerkin problem. Note well,though, that existence of a unique Galerkin solution for each M ∈ N0 does notimply the existence of a unique weak solution (nor even multiple weak solutions)to the full problem; for this, one typically needs to show that the Galerkinapproximations are uniformly bounded and appeal to a Sobolev embeddingtheorem to extract a convergent subsequence.

Example 12.3. 1. The Fourier basis ekk∈Z of L2(S1, dx;C) defined by

ek(x) =1√2π

exp(ikx).

For Galerkin projection, one can use the finite-dimensional subspace

V(2M+1) := spane−M , . . . , e−1, e0, e1, . . . , eM

of functions that are band-limited to contain frequencies at most M . Incase of real-valued functions, one can use the functions

x 7→ cos(kx), for k ∈ N0,

x 7→ sin(kx), for k ∈ N.

2. Fix a partition a = x0 < x1 < · · · < xM = b of a compact interval[a, b] ( R and consider the associated tent functions defined by

φm(x) :=

0, if x ≤ a or x ≤ xm−1;

(x − xm−1)/(xm − xm−1), if xm−1 ≤ x ≤ xm;

(xm+1 − x)/(xm+1 − xm), if xm ≤ x ≤ xm+1;

0, if x ≥ b or x ≥ xm+1.

The function φm takes the value 1 at xm and decays linearly to 0 alongthe two line segments adjacent to xm. The (M + 1)-dimensional vec-tor space V(M) := spanφ0, . . . , φM consists of all continuous functionson [a, b] that are piecewise affine on the partition, i.e. have constant

derivative on each of the open intervals (xm−1, xm). The space V(M) :=spanφ1, . . . , φM−1 consists of the continuous functions that piecewise

affine on the partition and take the value 0 at a and b; hence, V(M) isone good choice for a finite-dimensional space to approximate the Sobolevspace H1

0 ([a, b]). More generally, one could consider tent functions associ-ated to any simplicial mesh in Rn.

An important property of the solution u(M) of the Galerkin problem, viewedas an approximation to the solution u of the original problem, is that the error

—DRAFT—

12.1. LAX–MILGRAM THEORY AND GALERKIN PROJECTION 139

u−u(M) is a-orthogonal to the approximation subspace V(M): for any choice ofv(M) ∈ V(M) ⊆ H,

a(u− u(M), v(M)) = a(u, v(M))− a(u(M), v(M)) = 〈f, v(M)〉 − 〈f, v(M)〉 = 0.

However, note well that u(M) is generally not the optimal approximation of ufrom the subspace V(M), i.e.

∥∥u− u(M)∥∥ 6= inf

∥∥u− v(M)∥∥∣∣∣ v(M) ∈ V(M)

.

The optimal approximation of u from V(M) is the orthogonal projection of uonto V(M); if H has an orthonormal basis en and u =

∑n∈N u

nen, then the

optimal approximation of u in V(M) = spane1, . . . , eM is∑M

n=1 unen, but this

is not generally the same as the Galerkin solution u(M). However, the nextresult, Cea’s lemma, shows that u(M) is a quasi-optimal approximation to u(note that the ratio C/c is always at least 1):

Lemma 12.4 (Cea’s lemma). Let a, c and C be as in the statement of the Lax–Milgram theorem. Then the weak solution u ∈ H and the Galerkin solutionu(M) ∈ V(M) satisfy

∥∥u− u(M)∥∥ ≤ C

cinf∥∥u− v(M)

∥∥∣∣∣ v(M) ∈ V(M)

.


Matrix Form. It is helpful to recast the Galerkin problem in matrix form. Letφ1, . . . , φM be a basis for V(M). Then u(M) solves the Galerkin problem ifand only if

a(u(M), φm) = 〈f, φm〉 for m ∈ 1, . . . ,M.Now expand u(M) in this basis as u(M) =

∑Mm=1 u

mφm and insert this into theprevious equation:

a

(M∑

m=1

umφm, φi

)=

M∑

m=1

uma(φm, φi) = 〈f, φi〉 for i ∈ 1, . . . ,M.

In other words, the vector of coefficients [u(M)] = [u1, . . . , uM ]⊤ ∈ RM satisfiesthe matrix equation

a(φ1, φ1) . . . a(φM , φ1)

.... . .

...a(φ1, φM ) . . . a(φM , φM )

︸︷︷︸=:[a]

u1

...

uM

︸︷︷︸=:[u(M)]

=

〈f, φ1〉

...〈f, φM 〉

︸︷︷︸=:[b]

. (12.4)

The matrix [a] ∈ RM×M is the Gram matrix of the bilinear form a, and is ofcourse a symmetric matrix whenever a is a symmetric bilinear form.

Remark 12.5. In practice the matrix-vector equation [a][u(M)] = [b] is neversolved by explicitly inverting the Gram matrix [a] to obtain the coefficients um

via [u(M)] = [a]−1[b]. Indeed, in many situations the Gram matrix is sparse,and so inversion methods that take advantage of that sparsity are used; further-more, for large systems, the methods used are often iterative rather than direct(e.g. factorization-based).

—DRAFT—


Lax–Milgram Theory for Banach Spaces. There are many extensions of thenow-classical Lax–Milgram lemma beyond the setting of symmetric bilinearforms on Hilbert spaces. For example, the following generalization is due toKozono & Yanagisawa [53]:

Theorem 12.6 (Kozono–Yanagisawa). Let X be a Banach space, Y a reflexiveBanach space, and a : X × Y → C a bilinear form such that

1. there is a constant M > 0 such that

|a(x, y)| ≤M‖x‖X‖y‖Y for all x ∈ X , y ∈ Y;

2. the null spaces

NX := x ∈ X | a(x, y) = 0 for all y ∈ Y ⊆ X ,NY := x ∈ Y | a(x, y) = 0 for all x ∈ X ⊆ Y,

admit complementary closed subspaces RX and RY such that X = NX ⊕RX and Y = NY ⊕RY ;

3. there is a constant C > 0 such that

‖x‖X ≤ C

(supy∈Y

|a(x, y)|‖y‖Y

+ ‖PXx‖X)

for all x ∈ X ,

‖y‖Y ≤ C

(supx∈X

|a(x, y)|‖x‖X

+ ‖PYy‖Y)

for all x ∈ X ,

where PX (resp. PY) is the projection of X (resp. Y) onto NX (resp. NY)along the direct sum X = NX ⊕RX (resp. Y = NY ⊕RY).

Then, for every f ∈ Y ′ such that 〈f | y〉 = 0 for all y ∈ NY , there exists x ∈ Xsuch that

a(x, y) = 〈f | y〉 for all y ∈ Y.Furthermore, there is a constant C independent of x and f such that ‖x‖X ≤C‖f‖Y′ .

12.2 Stochastic Galerkin Projection

Stochastic Lax–Milgram Theory. The next step is to build appropriate Lax–Milgram theory and Galerkin projection for stochastic problems, for which agood prototype is

−∇ · (κ(θ, x)∇u(θ, x)) = f(θ, x) for x ∈ Ω,

u(x) = 0 for x ∈ ∂Ω,

with θ being drawn from some probability space (Θ,F , µ). To that end, weintroduce a stochastic space S, which in the following will be L2(Θ, µ;R). Weretain also a Hilbert space V in which the deterministic solution u(θ) is soughtfor each θ ∈ Θ; implicitly, V is independent of the problem data, or rather ofθ. Thus, the space in which the stochastic solution U is sought is the tensor

—DRAFT—

12.2. STOCHASTIC GALERKIN PROJECTION 141

product Hilbert space H := V ⊗S, which is isomorphic to the space L2(Θ, µ;V)of square-integrable V-valued random variables.

In terms of bilinear forms, the setup is that of a bilinear-form-on-V-valuedrandom variable A and a V ′-valued random variable F . Define a bilinear formα on H by

α(X,Y ) := Eµ[A(X,Y )] ≡∫

Θ

A(θ)(X(θ), Y (θ)

)dµ(θ)

and, similarly, a linear functional β on H by

β(Y ) := Eµ[〈F |Y 〉V ].Clearly, if α satisfies the boundedness and coercivity assumptions of the Lax–Milgram theorem on H, then, for every F ∈ L2(Θ, µ;V ′), there is a unique weaksolution U ∈ L2(Θ, µ;V) satisfying

α(U, Y ) = β(Y ) for all Y ∈ L2(Θ, µ;V).A sufficient, but not necessary, condition for α to satisfy the hypotheses of theLax–Milgram theorem on H is for A(θ) to satisfy those hypotheses uniformly inθ on V :Theorem 12.7 (Stochastic Lax–Milgram theorem). Let (Θ,F , µ) be a probabil-ity space, and let A be a random variable on Θ, taking values in the space ofsymmetric bilinear forms on a Hilbert space V, and satisfying the hypotheses ofthe deterministic Lax–Milgram theorem (Theorem 12.2) uniformly with respectto θ ∈ Θ. Define a bilinear form α on L2(Θ, µ;V) by

α(X,Y ) := Eµ[A(X,Y )].

Then, for every F ∈ L2(Θ, µ;V ′), there is a unique U ∈ L2(Θ, µ;V) such that

α(U, Y ) = β(Y ) for all Y ∈ L2(Θ, µ;V).Proof. Suppose that A(θ) satisfies the boundedness assumption with constantC(θ) and the coercivity assumption with constant c(θ). By hypothesis,

C′ := supθ∈Θ

C(θ) and

c′ := infθ∈Θ

c(θ)

are both strictly positive and finite. The bilinear form α satisfies, for all X,Y ∈H,

α(X,Y ) = Eµ[A(X,Y )]

≤ Eµ

[C‖X‖V‖Y ‖V

]

≤ C′Eµ

[‖X‖2V ]1/2Eµ[‖Y ‖2V

]1/2

= C′‖X‖K‖Y ‖H,and

α(X,X) = Eµ[A(X,X)]

≥ Eµ

[c‖X‖2V

]

≥ c′‖X‖2H.

—DRAFT—


Hence, by the deterministic Lax–Milgram theorem applied to the bilinear formα on the Hilbert space H, for every F ∈ L2(Θ, µ;V ′), there exists a uniqueU ∈ L2(Θ, µ;V) such that

α(U, Y ) = β(Y ) for all Y ∈ L2(Θ, µ;V).

Remark 12.8. Note, however, that uniform boundedness and coercivity of A arenot necessary for α to be bounded and coercive. For example, the constants c(θ)and C(θ) may degenerate to 0 or ∞ as θ approaches certain points of the samplespace Θ. Provided that these degeneracies are integrable and yield positive andfinite expected values, this will not ruin the boundedness and coercivity of α.Indeed, there may be an arbitrarily large (but µ-measure zero) set of θ for whichthere is no weak solution u to the deterministic problem

A(θ)(u, v) = 〈F (θ) | v〉 for all v ∈ V .

Stochastic Galerkin Projection. Let V(M) be a finite-dimensional subspace ofV , with basis φ1, . . . , φM . As indicated above, take the stochastic space S to beL2(Θ, µ;K), which we assume to be equipped with an orthogonal decompositionsuch as a PC decomposition. Let SP be a finite-dimensional subspace of S,for example the span of the polynomials of degree at most P. The Galerikinprojection of the stochastic problem on H is to find U =

∑m,k u

mk φm ⊗ Ψk ∈

V(M) ⊗ SP such that

α(U, V ) = β(V ) for all V ∈ L2(Θ, µ;V).

In particular, it suffices to find U that satisfies this condition for each basiselement V = φn ⊗ Ψℓ of V(M) ⊗ SP. Recall that φn ⊗ Ψℓ is the function(θ, x) 7→ φn(x)Ψℓ(θ).

As before, the Galerkin problem is equivalent to the matrix-vector equation

[α][U ] = [β]

in the basis φm ⊗ Ψk | m = 1, . . .M ; k = 0, . . . ,P of V(M) ⊗ SP. An obviousquestion is how the Gram matrix [α] ∈ RM(P+1)×M(P+1) is related to the Grammatrix of the random bilinear form A.

...

...Suppose that we are already given a linear problem with its deterministic

problem discretized and cast in the matrix form

[A](ξ)U(ξ) = B(ξ)

which, for each fixed ξ ∈ L2(Ξ, pξ), has its solution U(ξ) ∈ V(M) ∼= RM . The

stochastic Galerkin projection for the stochastic solution U =∑P

k=1 ukΨk ∈V(M) ⊗ SP gives

P∑

j=0

〈Ψi, [A]Ψj〉uj = 〈Ψi, B〉 for each i ∈ 0, . . . ,P.

—DRAFT—

12.2. STOCHASTIC GALERKIN PROJECTION 143

This is equivalent to the (large!) block system[A]00 . . . [A]0P...

. . ....

[A]P0 . . . [A]PP

u0...uP

=

b0...bP

, (12.5)

where for 0 ≤ i, j ≤ P,• [A]ij := 〈Ψi, [A]Ψj〉 ∈ RM×M , where [A] ∈ RM×M is the Gram matrix ofthe random bilinear form A;

• ui =∑M

m=1 umi φm ∈ V(M) is the ith stochastic mode of the solution U ;

• and bi := 〈Ψi, B〉 ∈ RM is the ith stochastic mode of the source term B.Note that, in general, the stochastic modes uj of the solution U (and, indeed thecoefficients umj of the stochastic modes in the deterministic basis φ1, . . . , φM )are all coupled together through the matrix [A]. This can be a limitation ofstochastic Galerkin methods, and will be remarked upon later.

Example 12.9. As a special case, suppose that the random data have no impacton the differential operator and affect only the right-hand side B. In this caseA(θ) = a for all θ ∈ Θ and so

[A]ij := 〈Ψi, [a]Ψj〉 = [a]〈Ψi,Ψj〉 = [a]δij〈Ψ2i 〉.

Hence, the stochastic Galerkin system, in its matrix form (12.5), becomes

[a] [0] . . . [0]

[0] [a]〈Ψ21〉

. . ....

.... . .

. . . [0][0] . . . [0] [a]〈Ψ2

P〉

u0u1...uP

=

b0b1...bP

.

Hence the stochastic modes uj decouple and are given by

uj = [a]−1 〈b,Ψj〉〈Ψ2

j〉.

The Galerkin Tensor. In contrast to Example 12.9, in which the differentialoperator is deterministic, we can consider the case in which the matrix [α] hasa (truncated) PC expansion

[α] =

P∑

k=0

[α]kΨk

with coefficient matrices [α]k ∈ RM×M for k ≥ 0. In this case, the blocks [α]ijare given by

[α]ij = 〈Ψi, [α]Ψj〉 =P∑

k=0

[α]k〈Ψi,ΨjΨk〉.

Hence, the Galerkin block system (12.5) is equivalent to

[α]00 . . . [α]0P...

. . ....

[α]P0 . . . [α]PP

u0...uP

=

b0...

bP

, (12.6)

—DRAFT—


where

bi :=〈B,Ψi〉〈Ψ2

i 〉,

[α]ij :=

P∑

k=0

[α]kCkji,

Cijk :=〈ΨiΨjΨk〉〈ΨkΨk〉

.

The rank-3 tensor Cijk is called the Galerkin tensor :• it is symmetric in the first two indices: Cijk = Cjik ;

• this induces symmetry in the problem (12.6): [α]ij = [α]ji;

• and since the Ψk are an orthogonal system, many of the (P + 1)3 entriesof Cijk are zero, leading to sparsity for the matrix problem; for example,

[α]00 =

P∑

k=0

[α]kCk00 = [α]0.

Note that the Galerkin tensor Cijk is determined entirely by the PC systemΨk | k ≥ 0, and so while there is a significant computational cost associatedto evaluating its entries, this is a one-time cost: the Galerkin tensor can bepre-computed, stored, and then used for many different problems, i.e. many Asand Bs.

Example 12.10 (Ordinary differential equations). Consider random variables Z,B ∈L2(Θ, µ;R) and the random linear first-order ordinary differential equation

dU

dt= −ZU, U(t) = B,

for U : [0, T ] × Θ → R. Let Ψkk∈N0 be an orthogonal basis for L2(Θ, µ;R)with the usual convention that Ψ0 = 1. Give Z, B and U the chaos expansionsZ =

∑k∈N0

zkΨk, B =∑

k∈N0bkΨk and U(t) =

∑k∈N0

uk(t)Ψk respectively.Projecting the evolution equation onto the basis Ψkk∈N0 yields

E

[dU

dtΨk

]= −E[ZUΨk] for each k ∈ N0.

Inserting the chaos expansions for Z and U into this yields, for every k ∈ N0,

E

[∑

i∈N0

ui(t)ΨiΨk

]= −E

∑

j∈N0

zjΨj

∑

i∈N0

ui(t)ΨiΨk

,

i.e. uk(t)〈Ψ2k〉 = −

∑

i,j∈N0

zjui(t)E[ΨjΨiΨk],

i.e. uk(t) = −∑

i,j∈N0

Cijkzjui(t).

The coefficients uk are a coupled system of countably many ordinary differen-tial equations. If all the chaos expansions are truncated at order P, then all

—DRAFT—

12.3. NONLINEARITIES 145

the above summations over N0 become summations over 0, . . . ,P, yielding acoupled system of P + 1 ordinary differential equations. In matrix form:

d

dt

u0(t)...

uP(t)

= A⊤

u0...uP

,

u0(0)

...uP(0)

=

b0...bP

,

where A ∈ R(P+1)×(P+1) is Aik = −∑j Cijkzj .

12.3 Nonlinearities

Nonlinearities of various types occur throughout practical problems, and theirtreatment is critical in the context of stochastic Galerkin methods, which requirethe projection of these nonlinearities onto the finite-dimensional solution spaces.For example, given the approximation

U(ξ) ≈ UP(ξ) =

P∑

k=0

ukΨk(ξ)

how does one calculate the coefficients of, say, U(ξ)2 or√U(ξ)? The first exam-

ple, U2, is a special case of taking the product of two Galerkin approximations,and can be resolved using the Galerkin tensor Cijk of the previous section.

Galerkin Products. Consider two random variables U, V ∈ L2(Θ, µ;R). Theproduct random variable UV is again an element of L2(Θ, µ;R). The naturalquestion to ask, given expansions U =

∑k∈N0

ukΨk and V =∑

k∈N0vkΨk,

is how to quickly compute the coefficients of UV in terms of ukk∈N0 andvkk∈N0 — particularly if expansions are truncated to finite precision.

Example 12.11. Suppose that U =∑P

k=0 ukΨk and V =∑P

k=0 vkΨk. Thentheir product W := UV has the expansion

W =P∑

i,j=0

uivjΨiΨj.

Note that, while W is guaranteed to be in L2, it is not necessarily in SP.Nevertheless, if we write W =

∑k≥0 wkΨk, it follows that

wk =〈W,Ψk〉〈Ψk,Ψk〉

=

P∑

i,j=0

uivjCijk.

The truncation of the expansion W =∑

k≥0 wkΨk to k = 0, . . . ,P is the or-

thogonal projection of W onto SP (and hence the L2-closest approximation ofW in SP) and is called the Galerkin product, or pseudo-spectral product, of Uand V , denoted U ∗ V .

The fact that multiplication of two random variables can be handled effi-ciently, albeit with some truncation error, in terms of their expansions in the

—DRAFT—


gPC basis and the Galerkin tensor is very useful: it adds to the list of reasonswhy one might wish to pre-computate and store the Galerkin tensor for use inmany problems.

However, the situation of binary products (and hence squares) is. . .Triple and higher products... non-commutativity

Galerkin Inversion. Given

U =∑

k≥0

ukΨk ≈P∑

k=0

ukΨk

we seek a random variable V =∑

k≥0 vkΨk ≈∑Pk=0 vkΨk such that U(ξ)V (ξ) =

1 for almost every ξ. The weak interpretation of this desideratum is that U ∗V = Ψ0. Thus we are led to the following matrix-vector equation for the gPCcoefficients of V :

∑Pk=0 Ck00uk . . .

∑Pk=0 CkP0uk

.... . .

...∑Pk=0 Ck0Puk . . .

∑Pk=0 CkPPuk

v0...vP

=

10...0

(12.7)

Naturally, if U(ξ) = 0 for a positive probability set of ξ, then V (ξ) will beundefined for those same ξ. Furthermore, if U ≈ 0 with ‘too large’ probability,then V may exist a.e. but fail to be in L2. Hence, it is not surprising to learnthat while (12.7) has a unique solution whenever the matrix on the left-handis non-singular, the system becomes highly ill-conditioned as the amount ofprobability mass near U = 0 increases.

FINISH ME!!!

Bibliography

Basic Lax–Milgram theory and Galerkin methods for PDEs can be found in anymodern textbook on PDEs, such as those by Evans [27] (see Chapter 6) andRenardy & Rogers [80] (see Chapter 9).

The monograph of Xiu [117] provides a general introduction to spectralmethods for uncertainty quantification, including Galerkin methods (Chapter6), but is light on proofs. The book of Le Maıtre & Knio [58] covers Galerkinmethods in Chapter 4, including an extensive treatment of nonlinearities inSection 4.5.

Exercises

Exercise 12.1. Let a be a bilinear form satisfying the hypotheses of the Lax–Milgram theorem. Given f ∈ H∗, show that the unique u such that a(u, v) =〈f | v〉 for all v ∈ H satisfies ‖u‖H ≤ c−1‖f‖H′ .

Exercise 12.2 (Cea’s lemma). Let a, c and C be as in the statement of theLax–Milgram theorem. Show that the weak solution u ∈ H and the Galerkin

—DRAFT—

EXERCISES 147

solution u(M) ∈ V(M) satisfy

∥∥u− u(M)∥∥ ≤ C

cinf∥∥u− v(M)

∥∥∣∣∣ v(M) ∈ V(M)

.

Exercise 12.3. Consider a partition of the unit interval [0, 1] into N +1 equallyspaced nodes

0 = x0 < x1 = h < x2 = 2h < · · · < xN = 1,

where h = 1N > 0. For n = 0, . . . , N , let

φn(x) :=

0, if x ≤ 0 or x ≤ xn−1;

(x − xn−1)/h, if xn−1 ≤ x ≤ xn;

(xn+1 − x)/h, if xn ≤ x ≤ xn+1;

0, if x ≥ 1 or x ≥ xn+1.

What space of functions is spanned by φ0, . . . , φN? For these functions φ0, . . . , φN ,calculate the Gram matrix for the bilinear form

a(u, v) :=

∫ 1

0

u′(x)v′(x) dx

corresponding to the Laplace operator. Determine also the vector components〈f, φn〉 in the Galerkin equation (12.4).

Exercise 12.4. Show that, for fixed P, the Galerkin product satisfies for allU, V,W ∈ SP and α, β ∈ R,

U ∗ V = V ∗ U,(αU) ∗ (βV ) = αβ(U ∗ V ),

(U + V ) ∗W = U ∗W + V ∗W.

Exercise 12.5. Galerkin division: WRITE ME!!!

—DRAFT—


—DRAFT—

Chapter 13

Non-Intrusive SpectralMethods

[W]hen people thought the Earth wasflat, they were wrong. When peoplethought the Earth was spherical, theywere wrong. But if you think that think-ing the Earth is spherical is just as wrongas thinking the Earth is flat, then yourview is wronger than both of them puttogether.

The Relativity of Warong

Isaac Asimov

Chapter 12 considered a spectral approach to UQ, namely Galerkin expan-sion, that is mathematically very attractive in that it is a natural extensionof the Galerkin methods that are commonly used for deterministic PDEs andminimizes the stochastic residual, but has the severe disadvantage that thestochastic modes of the solution are coupled together by a large system suchas (12.5). Hence, the Galerkin formalism is not suitable for situations in whichdeterministic solutions are slow and expensive to obtain, and the determinis-tic solution method cannot be modified. Many so-called legacy codes are notamenable to such intrusive methods of UQ.

In contrast, this chapter considers non-intrusive spectral methods for UQ.These are characterized by the feature that the solution of the deterministicproblem is a ‘black box’ that does not need to be modified for use in the spec-tral method, beyond being able to be evaluated at any desired point θ of theprobability space (Θ,F , µ).

—DRAFT—

150 CHAPTER 13. NON-INTRUSIVE SPECTRAL METHODS

13.1 Pseudo-Spectral Methods

Consider a square-integrable stochastic process u : Θ → H taking values in aseparable Hilbert space H, with a spectral expansion

u =∑

k∈N0

ukΨk

of u ∈ L2(Θ, µ;H) ∼= H⊗L2(Θ, µ;R) in terms of coefficients (stochastic modes)uk ∈ H and an orthogonal basis Ψk | k ∈ N0 of L2(Θ, µ;R). As usual, thestochastic modes are given by

uk =Eµ[uΨk]

Eµ[Ψ2k]

=1

γk

∫

Θ

u(θ)Ψk(θ) dµ(θ).

If the normalization constants γk := ‖Ψk‖2L2(µ) are known ahead of time, thenit remains only to approximate the integral with respect to µ of the product ofu with each basis function Ψk.

Deterministic Quadrature. If the dimension of Θ is low and u(θ) is relativelysmooth as a function of θ, then an appealing approach to the estimation ofEµ[uΨk] is deterministic quadrature. For optimal polynomial accuracy, Gaus-sian quadrature (i.e. nodes at the roots of µ-orthogonal polynomials) may beused. In practice, nested quadrature rules such as Clenshaw–Curtis may bepreferable since one does not wish to have to discard past solutions of u uponpassing to a more accurate quadrature rule. For multi-dimensional domains ofintegration Θ, sparse quadrature rules may be used to alleviate the curse ofdimension.

Monte Carlo Integration. If the dimension of Θ is high, or u(θ) is a non-smooth function of θ, then it is tempting to resort to Monte Carlo approximationof Eµ[uΨk]. This approach is also appealing because the calculation of thestochastic modes uk can be written as a straightforward (but often large) matrix-matrix multiplication, as in Exercise 13.1. The problem with Monte Carlomethods, as ever, is the slow convergence rate of ∼ (number of samples)−1/2.

Sources of Error. In practice, the following sources of error arise when com-puting pseudo-spectral expansions of this type:

1. discretization error comes about through the approximation of H by afinite-dimensional subspace, i.e. the approximation of and of the stochasticmodes uk by a finite sum uk ≈ ∑m

i=1 uk,iϕi, where ϕi | i ∈ N is somebasis for H;

2. truncation error comes about through the truncation of the spectral ex-pansion for u after finitely many terms, i.e. u ≈∑K

k=0 ukΨk;3. quadrature error comes about through the approximate nature of the nu-

merical integration scheme used to find the stochastic modes.

13.2 Stochastic Collocation

Collocation methods for ordinary and partial differential equations are a formof polynomial interpolation. The idea is to find a low-dimensional object —

—DRAFT—

13.2. STOCHASTIC COLLOCATION 151

usually a polynomial — that approximates the true solution to the differentialequation by means of exactly satisfying the differential equation at a selectedset of points, called collocation points or collocation nodes.

Example 13.1 (Collocation for an ODE). Consider for example the initial valueproblem

u(t) = f(t, u(t)), for t ∈ [a, b]

u(a) = ua,

to be solved on an interval of time [a, b]. Choose n points

a ≤ t1 < t2 < · · · < tn ≤ b,

called collocation nodes. Now find a polynomial p(t) ∈ R≤n[t] so that the ODE

p(tk) = f(tk, p(tk))

is satisfied for k = 1, . . . , n, as is the initial condition p(a) = ua. For example,if n = 2, t1 = a and t2 = b, then the coefficients c2, c1, c0 ∈ R of the polynomialapproximation

p(t) =2∑

k=0

ck(t− a)k,

which has derivative p(t) = 2c2(t− a) + c1, are required to satisfy

p(a) = c1 = f(a, p(a))

p(b) = 2c2(b − a) + c1 = f(b, p(b))

p(a) = c0 = ua

i.e.

p(t) =f(b, p(b))− f(a, ua)

2(b− a)(t− a)2 + f(a, ua)(t− a) + ua.

The above equation implicitly defines the final value p(b) of the collocationsolution. This method is also known as the trapezoidal rule for ODEs, since thesame solution is obtained by rewriting the differential equation as

u(t) = u(a) +

∫ t

a

f(s, u(s)) ds

and approximating the integral on the right-hand side by the trapezoidal quadra-ture rule for integrals.

It should be made clear at the outset that there is nothing stochastic about‘stochastic collocation’, just as there is nothing chaotic about ‘polynomial chaos’.The meaning of the term ‘stochastic’ in this case is that the collocation principleis being applied across the ‘stochastic space’ (i.e. the probability space) of astochastic process, rather than the space/time/space-time domain. Considerfor example the random PDE

Lθ[u(x, θ)] = 0 for x ∈ Ω, θ ∈ Θ,

Bθ[u(x, θ)] = 0 for x ∈ ∂Ω, θ ∈ Θ,

—DRAFT—


where, for µ-a.e. θ in some probability space (Θ,F , µ), the differential operatorLθ and boundary operator Bθ are well-defined and the PDE admits a uniquesolution u( · , θ) : Ω → R. The solution u : Ω × Θ → R is then a stochasticprocess. We now let ΘM = θ(1), . . . , θ(M) ⊆ Θ be a finite set of prescribedcollocation nodes. The collocation problem is to find an approximate solutionu(M) ≈ u that satisfies

Lθ(m)

[u(M)

(x, θ(m)

)]= 0 for x ∈ Ω,

Bθ(m)

[u(M)

(x, θ(m)

)]= 0 for x ∈ ∂Ω,

for m = 1, . . . ,M ; there is, however, some flexibility in how to approximateu(x, θ) for θ /∈ ΘM .

Interpolation Approach. An obvious first approach is to use interpolatingpolynomials when they are available. This is easiest when the stochastic spaceΘ is one-dimensional.

Example 13.2. Consider the initial value problem

d

dtu(t, θ) = −eθu(t, θ), u(0, θ) = 1,

with θ ∼ N (0, 1). Take as the collocation nodes θ(1), . . . , θ(M) ∈ R the M rootsof the Hermite polynomial HeM of degree M . The collocation solution u(m) isgiven at the collocation nodes θ(m) by

d

dtu(m)(t, θ(m)) = −eθ(m)

u(t, θ(m)), u(0, θ(m)) = 1,

i.e.u(t, θ(m)) = exp(−eθ(m)

t)

Away from the collocation nodes, u(M) is defined by polynomial interpolation:for each t, u(M)(t, θ) is a polynomial in θ of degree at most M with prescribedvalues at the collocation nodes. Writing this interpolation in Lagrange formyields

u(M)(t, θ) =

M∑

m=1

u(m)(t, θ(m))ℓm(θ)

=

M∑

m=1

exp(−eθ(m)

t)∏

1≤k≤Mk 6=m

θ − θ(k)

θ(m) − θ(k).

Tensor Product Collocation.

Sparse Grid Collocation. Sparse grid interpolation using Smolyak–Chebyshevnodes [6]

Stochastic Collocation on Unstructured Grids. Stochastic collocation meth-ods for arbitrary unstructured sets of nodes is a notably tricky subject, essen-tially because it boils down to polynomial interpolation through an unstructuredset of nodes, which, as we have seen (WHERE???), is generally impossible.

—DRAFT—

BIBLIOGRAPHY 153

Bibliography

The monograph of Xiu [117] provides a general introduction to spectral methodsfor uncertainty quantification, including collocation methods, but is light onproofs. The recent paper of Narayan & Xiu [70] presents a method for stochasticcollocation on arbitrary sets of nodes using the framework of least orthogonalinterpolation.

Non-intrusive methods for UQ, including NISP and stochastic collocation,are covered in Chapter 3 of Le Maıtre & Knio [58].

Exercises

Exercise 13.1. Let u = (u1, . . . , uM ) : Θ → RM be a square-integrable randomvector defined over a probability space (Θ,F , µ), and let Ψk | k ∈ N0, withnormalization constants γk := ‖Ψk‖2L2(µ), be an orthogonal basis for L2(Θ, µ;R).

Suppose that N independent samples (θ(n), u(θ(n))) | n = 1, . . . , N withθ(n) ∼ µ, are given, and it is desired to use these N Monte Carlo samplesto form a truncated pseudo-spectral expansion

u ≈ u(N) :=

K∑

k=0

u(N)k Ψk

of u, where the approximate stochastic modes are obtained using Monte Carlointegration. Write down the defining equation for the mth component of the

kth approximate stochastic mode, u(N)k,m, and hence show that the approximate

stochastic modes solve the matrix equation

u(N)1,1 . . . u

(N)K,1

.... . .

...

u(N)1,M . . . u

(N)K,M

= Γ−1DP⊤,

where

Γ := diag(γ0, . . . , γK),

D :=

u1(θ

(1)) . . . u1(θ(N))

.... . .

...

uM (θ(1)) . . . uM (θ(N))

,

P :=

Ψ0(θ

(1)) . . . Ψ1(θ(N))

.... . .

...

ΨK(θ(1)) . . . ΨK(θ(N))

.

Exercise 13.2. What is the analogue of the result of Exercise 13.1 when theintegrals are approximated using a quadrature rule, rather than using MonteCarlo?

—DRAFT—


—DRAFT—

Chapter 14

Distributional Uncertainty

Technology, in common with many otheractivities, tends toward avoidance ofrisks by investors. Uncertainty is ruledout if possible. [P]eople generally preferthe predictable. Few recognize how de-structive this can be, how it imposes se-vere limits on variability and thus makeswhole populations fatally vulnerable tothe shocking ways our universe can throwthe dice.

Heretics of Dune

Frank Herbert

In the previous chapters, it has been assumed that an exact model is avail-able for the probabilistic components of a system, i.e. that all probability dis-tributions involved are known and can be sampled. In practice, however, suchassumptions about probability distributions are always wrong to some degree:the distributions used in practice may only be simple approximations of morecomplicated real ones, or there may be significant uncertainty about what thereal distributions actually are. The same is true of uncertainty about the correctform of the forward physical model.

14.1 Maximum Entropy Distributions

Principle of Maximum Entropy. If all one knows about a probability measureµ is that it lies in some set A ⊆ M1(X ), then one should take µ to be theelement µME ∈ A of maximum entropy.

((Heuristic justifications. . .Wallis–Jaynes derivation?))FINISH ME!!!

Example 14.1 (Unconstrained maximum entropy distributions). If X = 1, . . . ,mand p ∈ Rm

>0 is a probability measure on X , then the entropy of p is

H(p) := −m∑

i=1

pi log pi. (14.1)

—DRAFT—

156 CHAPTER 14. DISTRIBUTIONAL UNCERTAINTY

The only constraints on p are the natural ones that pi ≥ 0 and that S(p) :=∑mi=1 pi = 1. Temporarily neglect the inequality constraints and use the method

of Lagrange multipliers to find the extrema of H(p) among all p ∈ Rm withS(p) = 1; such p must satisfy, for some λ ∈ R,

0 = ∇H(p) − λ∇S(p) = −

1 + log p1 + λ

...1 + log pm + λ

It is clear that any solution to this equation must have p1 = · · · = pm, forif pi and pj differ, then at most one of 1 + log pi + λ and 1 + log pj + λ canequal 0 for the same value of λ. Therefore, since S(p) = 1, it follows that theunique extremizer of H(p) among p ∈ Rm | S(p) = 1 is p1 = · · · = pm = 1

m .The inequality constraints that were neglected initially are satisfied, and are notactive constraints, so it follows that the uniform probability measure on X isthe unique maximum entropy distribution on X .

A similar argument using the calculus of variations shows that the uniquemaximum entropy probability distribution on an interval [a, b] ( R is the uni-form distribution 1

|b−a| dx.

Example 14.2 (Constrained maximum entropy distributions). Consider the set ofall probability measures µ on R that have mean m and variance s2; what isthe maximum entropy distribution in this set? Consider probability measures µthat are absolutely continuous with respect to Lebesgue measure, having densityρ. Then the aim is to find µ to maximize

H(µ) = −∫

R

ρ(x) log ρ(x) dx,

subject to the constraints that ρ ≥ 0,∫Rρ(x) dx = 1,

∫Rxρ(x) dx = 0 and∫

R(x−m)2ρ(x) dx = s2. Introduce Lagrange multipliers c = (c0, c1, c2) and theLagrangian

Fc(ρ) := −∫

R

ρ(x) log ρ(x) dx+c0

∫

R

ρ(x) dx+c1

∫

R

xρ(x) dx+c2

∫

R

(x−m)2ρ(x) dx.

Consider a perturbation ρ + tσ; if ρ is indeed a critical point of Fc, then, re-gardless of σ, it must be true that

d

dtFc(ρ+ tσ)

∣∣∣∣t=0

= 0.

This derivative is given by

d

dtFc(ρ+ tσ)

∣∣∣∣t=0

=

∫σ(x)

[− log ρ(x)+ c0 + c1x+ c2(x−m)2

]dx−

∫

R

ρ(x) dx.

Since∫Rρ(x) dx = 1 and it is required that

0 =

∫

R

[− log ρ(x) + c0 − 1 + c1x+ c2(x−m)2

]σ(x) dx

—DRAFT—

14.2. DISTRIBUTIONAL ROBUSTNESS 157

for every σ, the expression in the brackets must vanish, i.e.

ρ(x) = exp(−c0 + 1− c1x− c2(x−m)2).

Since ρ(x) is the exponential of a quadratic form in x, µ must be a Gaussian ofsome mean and variance, which, by hypothesis, are m and s2 respectively, i.e.

c0 = 1− log(1/

√2πs2

),

c1 = 0,

c2 = 12s

2.

Discrete Entropy and Convex Programming. In discrete settings, the entropyof a probability measure p ∈ M1(1, . . . ,m) with respect to the uniform mea-sure as defined in (14.1) is a strictly convex function of p ∈ Rm

>0. Therefore,when p is constrained by a family of convex constraints, finding the maximumentropy distribution is a convex program:

minimize:

m∑

i=1

pi log pi

with respect to: p ∈ Rm

subject to: p ≥ 0

p · 1 = 1

ϕi(p) ≤ 0 for i = 1, . . . , n,

for given convex functions ϕ1, . . . , ϕn : Rm → R. This is useful because an ex-

plicit formula for the maximum entropy distribution, such as in Example 14.2,is rarely available. Therefore, the possibility of effciciently computing the max-imum entropy distribution, as in this convex programming situation, is veryattractive.

However, it must be remembered that despite the various justifications forthe use of the MaxEnt principle, it remains a selection mechanism that in somesense returns a ‘typical’ or ‘representative’ distribution from a given class; whatif one is more interested in ‘atypical’ behaviour? This is the topic of the nextsection.

14.2 Distributional Robustness

Suppose that we are interested in the value Q(µ†) of some quantity of interestthat is a functional of a partially-known probability measure µ† on a spaceX . Very often, Q(µ†) arises as the expected value with respect to µ† of somefunction q : X → R, so the objective is to determine

Q(µ†) ≡ EX∼µ† [q(X)].

Now suppose that µ† is known only to lie in some subset A ⊆ M1(X ). In theabsence of any further information about which µ ∈ A are more or less likelyto be µ†, and particular if the consequences of planning based on an inaccurateestimate of Q(µ†) are very high, it makes sense to adopt a posture of ‘healthy

—DRAFT—


conservatism’ and compute bounds on Q(µ†) that are as tight as justified bythe information that µ† ∈ A, but no tighter, i.e. to find

Q(A) := infµ∈A

Q(µ) and Q(A) := supµ∈A

Q(µ).

When Q(µ) is the expected value with respect to µ of some function q : X → R,the objective is to determine

Q(A) := infµ∈A

Eµ[q] and Q(A) := supµ∈A

Eµ[q].

The inequalityQ(A) ≤ Q(µ†) ≤ Q(A)

is, by construction, the sharpest possible bound on Q(µ†) given only informationthat µ† ∈ A. The obvious question is, can Q(A) and Q(A) be computed?

Finite Sample Spaces. Suppose that the sample space X = 1, . . . ,K is afinite set equipped with the discrete topology. Then the space of measurablefunctions f : X → R is isomorphic to RK and the space of probability measuresµ on X is isomorphic to the unit simplex in RK . If the available information onµ† is that it lies in the set

A := µ ∈ M1(X ) | Eµ[ϕn] ≤ cn for n = 1, . . . , N

for known measurable functions ϕ1, . . . , ϕN : X → R and values c1, . . . , cN ∈ R,then the problem of finding the extreme values of Eµ[q] among µ ∈ A reducesto linear programming:

extremize: p · qwith respect to: p ∈ RK

subject to: p ≥ 0

p · 1 = 1

p · ϕn ≤ cn for n = 1, . . . , N .

Note that the feasible set A for this problem is a convex subset of RK ; indeed,A is a polytope, i.e. the intersection of finitely many closed half-spaces of RK .Furthermore, as a closed subset of the probability simplex in RK , A is compact.Therefore, by Corollary 4.18, the extreme values of this problem are certain tobe found in the extremal set ext(A). This insight can be exploited to great effectin the study of distributional robustness problems for general sample spaces X .

Remarkably, when the feasible set A of probability measures is sufficientlylike a polytope, it is not necessary to consider finite sample spaces. What wouldappear to be an intractable optimization problem over an infinite-dimensionalset of measures is in fact equivalent to a tractable finite-dimensional problem.Thus, the aim of this section is to find a finite-dimensional subset A∆ of A withthe property that

extµ∈A

Q(µ) = extµ∈A∆

Q(µ).

To perform this reduction, it is necessary to restrict attention to probabilitymeasures, topological spaces, and functionals that are sufficiently well-behaved.

—DRAFT—


Extreme Points of Moment Classes. The first step in this reduction is to clas-sify the extremal measures in sets of probability measures that are prescribedby inequality or equality constraints on the expected value of finitely many arbi-trary measurable test functions, so-called moment classes. Since, in finite time,we can only verify — even approximately, numerically — the truth of finitelymany inequalities, such moment classes are appealing feasible sets from an epis-temological point of view because they they conform to Karl Popper’s dictumthat “Our knowledge can only be finite, while our ignorance must necessarilybe infinite.”

Definition 14.3. A Borel measure µ on a topological space X is called innerregular if, for every Borel-measurable set E ⊆ X ,

µ(E) = supµ(K) | K ⊆ E and K is compact.

A pseudo-Radon space is a topological space on which every Borel probabilitymeasure is inner regular. A Radon space is a separable, metrizable, pseudo-Radon space.

Examples 14.4. 1. Lebesgue measure on Euclidean space Rn (restricted tothe Borel σ-algebra B(Rn), if pedantry is the order of the day) is aninner regular measure. Similarly, Gaussian measure is an inner regularprobability measure on Rn. Proof. See MA359 Measure Theory. W

2. However, Lebesgue/Gaussian measures on R equipped with the topologyof one-sided convergence are not inner regular measures. Proof. See Ex-ercise 14.1

3. Every Polish space (i.e. every separable and completely metrizable topo-logical space) is a pseudo-Radon space.

Compare the following definition of a barycentre (a centre of mass) for a setof probability measures with the conclusion of the Choquet–Bishop–de Leeuwtheorem Theorem 4.13:

Definition 14.5. A barycentre for a set A ⊆ M1(X ) is a probability measureµ ∈ M1(X ) such that there exists p ∈ M1(ext(A)) such that

µ(B) =

∫

ext(A)

ν(B) dp(ν) for all measurable B ⊆ X . (14.2)

Definition 14.6. A Riesz space (or vector lattice) is a vector space V togetherwith a partial order ≤ that is

1. (translation invariant) for all x, y, z ∈ V , x ≤ y =⇒ x+ z ≤ y + z;2. (positively homogeneous) for all x, y ∈ V and scalars α ≥ 0, x ≤ y =⇒αx ≤ αy;

3. (lattice structure) for all x, y ∈ V , there exists a supremum element x ∨ ythat is a least upper bound for x and y in the order x ≤ y.

Definition 14.7. A subset S of a vector space V is a Choquet simplex if the cone

C := (tx, t) ∈ V × R | t ∈ R, t ≥ 0, x ∈ V

is such that C−C is a Riesz space when C is taken to be the non-negative cone.


—DRAFT—


lD

lD

lD

lD

δx1

δx2

δx3

A = µ ∈ M1(X ) | Eµ[ϕ] ≤ c

lD ∈ ext(A) ⊆ ∆1 ∩ A

⊂ M±(X )

Figure 14.1: Heuristic justification of Winkler’s classification of extreme pointsof moment sets (Theorem 14.8).

The definition of a Choquet simplex extends the usual finite-dimensionaldefinition: a finite-dimensional compact Choquet simplex is a simplex in theordinary sense of being the closed convex hull of finitely many points.

With these definitions, the extreme points of moment sets of probabilitymeasures can be described by the following theorem due to Winkler. The proofis rather technical, and can be found in [116]. The important point is thatwhen X is a pseudo-Radon space, Winkler’s theorem applies with P = M1(X ),and so the extreme measures in moment classes will simply be finite convexcombinations of Dirac measures. Figures like Figure 14.1 should make this anintuitively plausible claim.

Theorem 14.8 (Winkler). Let (X ,F ) be a measurable space and let P ⊆ M1(F )be a Choquet simplex such that ext(P ) consists of Dirac measures. Fix measur-able functions ϕ1, . . . , ϕn : X → R and c1, . . . , cn ∈ R and let

A :=

µ ∈ P

∣∣∣∣for i = 1, . . . , n,

ϕi ∈ L1(µ) and Eµ[ϕi] ≤ ci

.

Then A is convex and its extremal set satisfies

ext(A) ⊆ A∆ :=

µ ∈ A

∣∣∣∣∣∣∣∣

µ =∑m

i=1 αiδxi,

1 ≤ m ≤ n+ 1, andthe vectors (ϕ1(xi), . . . , ϕn(xi), 1)

mi=1

are linearly independent

;

Furthermore, if all the moment conditions defining A are given by equalities,then ext(A) = A∆.

Optimization of Measure Affine Functionals. Having understood the extremepoints of moment classes, the next step is to show that the optimization ofsuitably nice functionals on such classes can be exactly reduced to optimizationover the extremal measures in the class.

—DRAFT—


Definition 14.9. For A ⊆ M1(X ), a function F : A → R ∪ ±∞ is said to bemeasure affine if, for all µ ∈ A and p ∈ M1(ext(A)) for which (14.2) holds, Fis p-integrable with

F (µ) =

∫

ext(A)

F (ν) dp(ν). (14.3)

As always, the reader should check that the terminology ‘measure affine’is a sensible choice by verifying that when X = 1, . . . ,K is a finite samplespace, the restriction of any affine function F : RK ∼= M±(X ) → R to a subsetA ⊆ M1(X ) is a measure affine function in the sense of Definition 14.9.

An important and simple example of a measure affine functional is an eval-uation functional, i.e. the integration of a fixed measurable function q:

Proposition 14.10. If q is bounded either below or above, then ν 7→ Eν [q] is ameasure affine map.


In summary, we now have the following:

Theorem 14.11. Let X be a pseudo-Radon space and let A ⊆ M1(X ) be amoment class of the form

A := µ ∈ M1(X ) | Eµ[ϕj ] ≤ 0 for j = 1, . . . , N

for prescribed measurable functions ϕj : X → R. Then the extreme points of Aare given by

ext(A) ⊆ A∩∆N (X )

=

µ ∈ M1(A)

∣∣∣∣∣∣∣∣∣

for some α0, . . . , αN ∈ [0, 1], x0, . . . , xN ∈ X ,

µ =∑N

i=0 αiδxi∑Ni=0 αi = 1,

and∑N

i=0 αiϕj(xi) ≤ 0 for j = 1, . . . , N

.

Hence, if q is bounded either below or above, then Q(A) = Q(A∆) and Q(A) =

Q(A∆).

Proof. The classification of ext(A) is given byWinkler’s theorem (Theorem 14.8).Since q is bounded on at least one side, Proposition 14.10 implies that µ 7→F (µ) := Eµ[q] is measure affine. Let µ ∈ A be arbitrary and choose a proba-bility measure p ∈ M1(ext(A)) with barycentre µ. Then, it follows from thebarycentric formula (14.3) that

F (µ) ≤ supF (ν) | ν ∈ ext(A).

The proves the claim for maximization; minimization is similar.

The kinds of constraints on measures (or, if you prefer, random variables)that can be considered in Theorem 14.11 include values for or bounds on func-tions of one or more of those random variables, e.g. the mean of X1, the varianceof X2, the covariance of X3 and X4. However, one type of information that isnot of this type is that X5 and X6 are independent random variables, i.e. thattheir joint law is a product measure. The problem here is that sets of product

—DRAFT—


measures can fail to be convex (see Exercise 14.3), so the reduction to extremepoints cannot be applied directly. Fortunately, a cunning application of Fubini’stheorem resolves this difficulty. Note well, though, that unlike Theorem 14.11, Theorem 14.12 does not say that A∆ = ext(A); it only says that the optimiza-tion problem has the same extreme values over A∆ and A.

Theorem 14.12. Let A ⊆ M1(X ) be a moment class of the form

A :=

µ =

K⊗

k=1

µk ∈K⊗

k=1

M1(Xk)

∣∣∣∣∣∣∣∣∣

Eµ[ϕj ] ≤ 0 for j = 1, . . . , N ,Eµ1 [ϕ1j ] ≤ 0 for j = 1, . . . , N1,

...EµK

[ϕKj ] ≤ 0 for j = 1, . . . , NK

for prescribed measurable functions ϕj : X → R and ϕkj : X → R. Let

A∆ := µ ∈ A |µk ∈ ∆N+Nk(Xk) .

Then, if q is bounded either above or below, Q(A) = Q(A∆) and Q(A) = Q(A∆).

Proof. Let ε > 0 and let µ∗ ∈ A be εK+1 -suboptimal for the maximization of

µ 7→ Eµ[q] over µ ∈ A, i.e.

Eµ∗ [q] ≥ supµ∈A

Eµ[q]−ε

K + 1.

By Fubini’s theorem,

Eµ∗1⊗···⊗µ∗

K[q] = Eµ∗

1

[Eµ∗

2⊗···⊗µ∗K[q]]

By the same arguments used in the proof of Theorem 14.11, µ∗1 can be replaced

by some probability measure ν1 ∈ M1(X1) with support on at most N + N1

points, such that ν1 ⊗ µ∗2 ⊗ · · · ⊗ µ∗

K ∈ A, and

Eν1

[Eµ∗

2⊗···⊗µ∗K[q]]≥ Eµ∗

1

[Eµ∗

2⊗···⊗µ∗K[q]]− ε

K + 1≥ sup

µ∈AEµ[q]−

2ε

K + 1.

Repeating this argument a further K − 1 times yields ν =⊗K

k=1 νk ∈ A∆ suchthat

Eν [q] ≥ supµ∈A

Eµ[q]− ε.

Since ε > 0 was arbitrary, it follows that

supµ∈A∆

Eµ[q] = supµ∈A

Eµ[q].

The proof for the infimum is similar.

14.3 Functional and Distributional Robustness

In addition to epistemic uncertainty about probability measures, applicationsoften feature epistemic uncertainty about the functions involved. For example,if the system of interest is in reality some function g† from a space X of inputs to

—DRAFT—

14.3. FUNCTIONAL AND DISTRIBUTIONAL ROBUSTNESS 163

another space Y of outputs, it may only be known that g† lies in some subset G ofthe set of all (measurable) functions from X to Y; furthermore, our informationabout g† and our information about µ† may be coupled in some way, e.g. byknowledge of EX∼µ† [g†(X)]. Therefore, we now consider admissible sets of theform

A ⊆(g, µ)

∣∣∣∣g : X → Y is measurable

and µ ∈ M1(X )

,

quantities of interest of the form Q(g, µ) = EX∼µ[q(X, g(X))] for some measur-able function q : X × Y → R, and seek the extreme values

Q(A) := inf(g,µ)∈A

EX∼µ[q(X, g(X))] and Q(A) := sup(g,µ)∈A

EX∼µ[q(X, g(X))].

Obviously, if for each g : X → Y the set of µ ∈ M1(X ) such that (g, µ) ∈ A isa moment class of the form considered in Theorem 14.12, then

ext(g,µ)∈A

EX∼µ[q(X, g(X))] = ext(g,µ)∈A

µ∈⊗

Kk=1 ∆N+Nk

(Xk)

EX∼µ[q(X, g(X))].

In principle, though, although the search over µ is finite-dimensional for each g,the search over g is still infinite-dimensional. However, the passage to discretemeasures often enables us to finite-dimensionalize the search over g, since, insome sense, only the values of g on the finite set supp(µ) ‘matter’ in computingEX∼µ[q(X, g(X))].

The idea is quite simple: instead of optimizing with respect to g ∈ G, weoptimize with respect to the finite-dimensional vector y = g|supp(µ). However,this reduction step requires some care:

1. Some ‘functions’ do not have their values defined pointwise, e.g. ‘functions’in Lebesgue and Sobolev spaces, which are actually equivalence classesof functions modulo equality almost everywhere. If isolated points havemeasure zero, then it makes no sense to restrict such ‘functions’ to a finiteset supp(µ). These difficulties are circumvented by insisting that G be aspace of functions with pointwise-defined values.

2. In the other direction, it is sometimes difficult to verify wether a vector yindeed arises as the restriction to supp(µ) of some g ∈ G; we need func-tions that can be extended from supp(µ) to all of X . Suitable extensionproperties are ensured if we restrict attention to smooth enough functionsbetween the right kinds of spaces.

Theorem 14.13 (Minty’s extension theorem). Let (X , d) be a metric space, let Hbe a Hilbert space, let E ⊆ H, and suppose that f : E → H satisfies

‖f(x)− f(y)‖H ≤ d(x, y)α for all x, y ∈ E (14.4)

with Holder constant 0 < α ≤ 1. Then there exists F : X → H such that F |E = fand (14.4) holds for all x, y ∈ X if either α ≤ 1

2 or if X is an inner product

space with metric given by d(x, y) = k1/α‖x− y‖ for some k > 0. Furthermore,the extension can be performed so that F (X ) ⊆ co(f(E)), and hence withoutincreasing the Holder norm

supx

‖f(x)‖H + supx 6=y

‖f(x)− f(y)‖Hd(x, y)α

.

—DRAFT—


b

b b1

−1

1−1

(R2, ‖ · ‖∞)

f

bb

b

1

2

3

−1

1 2−1−2

(R2, ‖ · ‖2)

Figure 14.1: The function f that takes the three points on the left (equippedwith ‖ · ‖∞) to the three points on the right (equipped with ‖ · ‖2) has Lipschitzconstant 1, but has no 1-Lipschitz extension F to (0, 0), let alone the wholeplane R2, since F ((0, 0)) would have to lie in the (empty) intersection of thethree grey discs. Cf. Example 14.14.

Special cases of Minty’s theorem include the Kirszbraun–Valentine theorem(which assures that Lipschitz functions between Hilbert spaces can be extendedwithout increasing the Lipschitz constant) and McShane’s theorem (which as-sures that scalar-valued continuous functions on metric spaces can be extendedwithout increasing a prescribed convex modulus of continuity). However, theextensibility property fails for Lipschitz functions between Banach spaces, evenfinite-dimensional ones:

Example 14.14. Let E ⊆ R2 be given by E := (1,−1), (−1, 1), (1, 1) anddefine f : E → R2 by

f((1,−1)) := (1, 0), f((−1, 1)) := (−1, 0), and f((1, 1)) := (0,√3).

Suppose that we wish to extend this f to F : R2 → R2, where E and the domaincopy of R2 are given the metric arising from the maximum norm ‖ · ‖∞ and therange copy of R2 is given the metric arising from the Euclidean norm ‖ · ‖2.Then, for all distinct x, y ∈ E,

‖x− y‖∞ = 2 = ‖f(x)− f(y)‖2,

so f has Lipschitz constant 1 on E. What value should F take at the origin,(0, 0), if it is to have Lipschitz constant at most 1? Since (0, 0) lies at ‖ · ‖∞-distance 1 from all three points of E, F ((0, 0)) must lie within ‖ · ‖2-distance 1 ofall three points of f(E). However, there is no such point of R2 within distance1 of all three points of f(E), and hence any extension of f to F : R2 → R2 musthave Lip(F ) > 1. See Figure 14.1.

Theorem 14.15. Let G be a collection of measurable functions from X to Y suchthat, for every finite subset E ⊆ X and g : E → Y, it is possible to determinewhether or not g can be extended to an element of G. Let A ⊆ G × M1(X )be such that, for each g ∈ G, the set of µ ∈ M1(X ) such that (g, µ) ∈ A is a

—DRAFT—

14.3. FUNCTIONAL AND DISTRIBUTIONAL ROBUSTNESS 165

moment class of the form considered in Theorem 14.12. Let

A∆ :=

(y, µ)

∣∣∣∣∣∣

µ ∈⊗Kk=1 ∆N+Nk

(Xk),y is the restriction to supp(µ) of some g ∈ G,

and (g, µ) ∈ A

.

Then, if q is bounded either above or below, Q(A) = Q(A∆) and Q(A) = Q(A∆).


Example 14.16. Suppose that g† : [−1, 1] → R is known to have Lipschitz con-stant Lip(g†) ≤ L. Suppose also that the inputs of g† are distributed accordingto µ† ∈ M1([−1, 1]), and it is known that

EX∼µ† [X ] = 0 and EX∼µ† [g†(X)] ≥ m > 0.

Hence, the corresponding feasible set is

A =

(g, µ)

∣∣∣∣g : [−1, 1] → R has Lipschitz constant ≤ L,

µ ∈ M1([−1, 1]), EX∼µ[X ] = 0, and EX∼µ[g(X)] ≥ m

.

Suppose that our quantity of interest is the probability of output values below0, i.e. q(x, y) = 1[y ≤ 0]. Then Theorem 14.15 ensures that the extreme valuesof

Q(g, µ) = EX∼µ[1[g(X) ≤ 0]] = PX∼µ[g(X) ≤ 0]

are the solutions of

extremize:

2∑

i=0

wi1[yi ≤ 0]

with respect to: w0, w1, w2 ≥ 0

x0, x1, x2 ∈ [−1, 1]

y0, y1, y2 ∈ R

subject to:

2∑

i=0

wi = 1

2∑

i=0

wixi = 0

2∑

i=0

wiyi ≥ m

|yi − yj| ≤ L|xi − xj | for i, j ∈ 0, 1, 2.

Example 14.17 (McDiarmid). The following admissible set corresponds to theassumptions of McDiarmid’s inequality (Theorem 10.5):

A =

(g, µ)

∣∣∣∣∣∣

g : X → R has Dk[g] ≤ Dk,

µ =⊗K

k=1 µk ∈ M1(X ),and EX∼µ[g(X)] = m

.

—DRAFT—


McDiarmid’s inequality was the upper bound

Q(A) := sup(g,µ)∈A

Pµ[g(X) ≤ 0] ≤ exp

(−2max0,m2

∑Kk=1D

2k

).

Perhaps not surprisingly given its general form, McDiarmid’s inequality is notthe least upper bound on Pµ[g(X) ≤ 0]; the actual least upper bound can becalculated using the reduction theorems.

FINISH ME!!!

Bibliography

Berger [8] makes the case for distributional robustness, with respect to priorsand likelihoods, in Bayesian inference. Smith [89] provides theory and severalpractical examples for generalized Chebyshev inequalities in decision analysis.Boyd & Vandenberghe [16, Sec. 7.2] cover some aspects of distributional robust-ness under the heading of nonparametric distribution estimation, in the case inwhich it is a convex problem. Convex optimization approaches to distributionalrobustness and optimal probability inequalities are also considered by Bertsi-mas & Popescu [9]. There is also an extensive literature on the related topic ofmajorization, for which see the book of Marshall & al. [64].

The classification of the extreme points of moment sets and the consequencesfor the optimization of measure affine functionals are due to von Weizsacker &Winkler [112, 113] and Winkler [116]. Karr [49] proved similar results underadditional topological and continuity assumptions. Theorem 14.12 and the Lip-schitz version of Theorem 14.15 can be found in Owhadi & al. [76] and Sullivan& al. [99] respectively. Extension Theorem 14.13 is due to Minty [68], and gen-eralizes earlier results by McShane [66], Kirszbraun [51] and Valentine [111].Example 14.14 is taken from the example given on p. 202 of Federer [28].

Exercises

Exercise 14.1. Consider the topology T on R generated by the basis of opensets [a, b), where a, b ∈ R.

1. Show that this topology generates the same σ-algebra on R as the usualEuclidean topology does. Hence, show that Gaussian measure is a well-defined probability measure on the Borel σ-algebra of (R, T ).

2. Show that every compact subset of (R, T ) is a countable set.3. Conclude that Gaussian measure on (R, T ) is not inner regular and that

(R, T ) is not a pseudo-Radon space.

Exercise 14.2. Suppose that A is a moment class of probability measures onX and that q : X → R ∪ ±∞ is bounded either below or above. Show thatµ 7→ Eµ[q] is a measure affine map. Hint: verify the assertion for the casein which q is the indicator function of a measurable set; extend it to boundedmeasurable functions using the Monotone Class Theorem; for non-negative µ-integrable functions q, use monotone convergence to verify the barycentric for-mula.

—DRAFT—

EXERCISES 167

Exercise 14.3. Let λ denote uniform measure on the unit interval [0, 1] ( R.Show that the line segment in M1([0, 1]

2) joining the measures λ⊗δ0 and δ0⊗λcontains measures that are not product measures. Hence show that a set A ofproduct probability measures like that considered in Theorem 14.12 is typicallynot convex.

Exercise 14.4. Calculate by hand, as a function of t ∈ R, D ≥ 0 and m ∈ R,

supµ∈A

PX∼µ[X ≤ t],

where

A :=

µ ∈ M1(R)

∣∣∣∣EX∼µ[X ] ≥ m, anddiam(supp(µ)) ≤ D

.

Exercise 14.5. Calculate by hand, as a function of t ∈ R, s ≥ 0 and m ∈ R,

supµ∈A

PX∼µ[X −m ≥ st],

andsupµ∈A

PX∼µ[|X −m| ≥ st],

where

A :=

µ ∈ M1(R)

∣∣∣∣EX∼µ[X ] ≤ m, andEX∼µ[|X −m|2] ≤ s2

.

Exercise 14.6. Prove Theorem 14.15.

Exercise 14.7. Calculate by hand, as a function of t ∈ R, m ∈ R, z ∈ [0, 1] andv ∈ R,

sup(g,µ)∈A

PX∼µ[g(X) ≤ t],

where

A :=

(g, µ)

∣∣∣∣∣∣

g : [0, 1] → R has Lipschitz constant 1,µ ∈ M1([0, 1]), EX∼µ[g(X)] ≥ m,

and g(z) = v

.

Numerically verify your calculations.

—DRAFT—


—DRAFT—

Bibliography and Index

—DRAFT—

—DRAFT—

Bibliography

[1] M. Abramowitz and I. A. Stegun, editors. Handbook of MathematicalFunctions with Formulas, Graphs, and Mathematical Tables. Dover Pub-lications Inc., New York, 1992. Reprint of the 1972 edition.

[2] C. D. Aliprantis and K. C. Border. Infinite Dimensional Analysis: AHitchhiker’s Guide. Springer, Berlin, third edition, 2006.

[3] O. F. Alıs and H. Rabitz. Efficient implementation of high dimensionalmodel representations. J. Math. Chem., 29(2):127–142, 2001.

[4] M. Atiyah. Collected Works. Vol. 6. Oxford Science Publications. TheClarendon Press Oxford University Press, New York, 2004.

[5] K. Azuma. Weighted sums of certain dependent random variables. TohokuMath. J. (2), 19:357–367, 1967.

[6] V. Barthelmann, E. Novak, and K. Ritter. High dimensional polynomialinterpolation on sparse grids. Adv. Comput. Math., 12(4):273–288, 2000.

[7] F. Beccacece and E. Borgonovo. Functional ANOVA, ultramodularity andmonotonicity: applications in multiattribute utility theory. European J.Oper. Res., 210(2):326–335, 2011.

[8] J. O. Berger. An overview of robust Bayesian analysis. Test, 3(1):5–124,1994. With comments and a rejoinder by the author.

[9] D. Bertsimas and I. Popescu. Optimal inequalities in probability theory:a convex optimization approach. SIAM J. Optim., 15(3):780–804 (elec-tronic), 2005.

[10] P. Billingsley. Probability and Measure. Wiley Series in Probability andMathematical Statistics. John Wiley & Sons Inc., New York, third edition,1995. A Wiley-Interscience Publication.

[11] E. Bishop and K. de Leeuw. The representations of linear functionals bymeasures on sets of extreme points. Ann. Inst. Fourier. Grenoble, 9:305–331, 1959.

[12] S. Bochner. Integration von Funktionen, deren Werte die Elemente einesVectorraumes sind. Fund. Math., 20:262–276, 1933.

[13] G. Boole. An Investigation of the Laws of Thought on Which are Foundedthe Mathematical Theories of Logic and Probabilities. Walton and Maber-ley, London, 1854.

—DRAFT—

172 BIBLIOGRAPHY

[14] N. Bourbaki. Topological Vector Spaces. Chapters 1–5. Elements of Mathe-matics (Berlin). Springer-Verlag, Berlin, 1987. Translated from the Frenchby H. G. Eggleston and S. Madan.

[15] N. Bourbaki. Integration. I. Chapters 1–6. Elements of Mathematics(Berlin). Springer-Verlag, Berlin, 2004. Translated from the 1959, 1965and 1967 French originals by Sterling K. Berberian.

[16] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge Univer-sity Press, Cambridge, 2004.

[17] R. H. Cameron and W. T. Martin. The orthogonal development of non-linear functionals in series of Fourier–Hermite functionals. Ann. of Math.(2), 48:385–392, 1947.

[18] M. Capinski and E. Kopp. Measure, Integral and Probability. Springer Un-dergraduate Mathematics Series. Springer-Verlag London Ltd., London,second edition, 2004.

[19] C. W. Clenshaw and A. R. Curtis. A method for numerical integrationon an automatic computer. Numer. Math., 2:197–205, 1960.

[20] S. L. Cotter, M. Dashti, J. C. Robinson, and A. M. Stuart. Bayesianinverse problems for functions and applications to fluid mechanics. InverseProblems, 25(11):115008, 43, 2009.

[21] S. L. Cotter, M. Dashti, and A. M. Stuart. Approximation of Bayesianinverse problems for PDEs. SIAM J. Numer. Anal., 48(1):322–345, 2010.

[22] M. Dashti, S. Harris, and A. Stuart. Besov priors for Bayesian inverseproblems. Inverse Probl. Imaging, 6(2):183–200, 2012.

[23] A. P. Dempster. Upper and lower probabilities induced by a multivaluedmapping. Ann. Math. Statist., 38:325–339, 1967.

[24] J. Diestel and J. J. Uhl, Jr. Vector Measures. Number 15 in MathematicalSurveys. American Mathematical Society, Providence, R.I., 1977.

[25] R. M. Dudley. Real Analysis and Probability, volume 74 of CambridgeStudies in Advanced Mathematics. Cambridge University Press, Cam-bridge, 2002. Revised reprint of the 1989 original.

[26] H. W. Engl, M. Hanke, and A. Neubauer. Regularization of Inverse Prob-lems, volume 375 of Mathematics and its Applications. Kluwer AcademicPublishers Group, Dordrecht, 1996.

[27] L. C. Evans. Partial Differential Equations, volume 19 of Graduate Studiesin Mathematics. American Mathematical Society, Providence, RI, secondedition, 2010.

[28] H. Federer. Geometric Measure Theory. Die Grundlehren der mathema-tischen Wissenschaften, Band 153. Springer-Verlag New York Inc., NewYork, 1969.

—DRAFT—

BIBLIOGRAPHY 173

[29] L. Fejer. On the infinite sequences arising in the theories of harmonicanalysis, of interpolation, and of mechanical quadratures. Bull. Amer.Math. Soc., 39(8):521–534, 1933.

[30] J. Feldman. Equivalence and perpendicularity of Gaussian processes. Pa-cific J. Math., 8:699–708, 1958.

[31] X. Fernique. Integrabilite des vecteurs gaussiens. C. R. Acad. Sci. ParisSer. A-B, 270:A1698–A1699, 1970.

[32] R. A. Fisher and W. A. Mackenzie. The manurial response of differentpotato varieties. J. Agric. Sci., 13:311–320, 1923.

[33] W. Gautschi. Orthogonal Polynomials: Computation and Approximation.Numerical Mathematics and Scientific Computation. Oxford UniversityPress, New York, 2004. Oxford Science Publications.

[34] R. G. Ghanem and P. D. Spanos. Stochastic Finite Elements: A SpectralApproach. Springer-Verlag, New York, 1991.

[35] R. A. Gordon. The Integrals of Lebesgue, Denjoy, Perron, and Henstock,volume 4 of Graduate Studies in Mathematics. American MathematicalSociety, Providence, RI, 1994.

[36] J. Hajek. On a property of normal distribution of any stochastic process.Czechoslovak Math. J., 8 (83):610–618, 1958.

[37] J. H. Halton. On the efficiency of certain quasi-random sequences of pointsin evaluating multi-dimensional integrals. Numer. Math., 2:84–90, 1960.

[38] W. Hoeffding. A class of statistics with asymptotically normal distribu-tion. Ann. Math. Statistics, 19:293–325, 1948.

[39] W. Hoeffding. Probability inequalities for sums of bounded random ran-dom variables. J. Amer. Statist. Assoc., 58:13–30, 1963.

[40] G. Hooker. Generalized functional ANOVA diagnostics for high-dimensional functions of dependent variables. J. Comput. Graph. Statist.,16(3):709–732, 2007.

[41] J. Humpherys, P. Redd, and J. West. A Fresh Look at the Kalman Filter.SIAM Rev., 54(4):801–823, 2012.

[42] L. Jaulin, M. Kieffer, O. Didrit, and E. Walter. Applied Interval Analysis:With Examples in Parameter and State Estimation, Robust Control andRobotics. Springer-Verlag London Ltd., London, 2001.

[43] A. H. Jazwinski. Stochastic Processes and Filtering Theory, volume 64of Mathematics in Science and Engineering. Academic Press, New York,1970.

[44] V. M. Kadets. Non-differentiable indefinite Pettis integrals. QuaestionesMath., 17(2):137–139, 1994.

—DRAFT—

174 BIBLIOGRAPHY

[45] J. Kaipio and E. Somersalo. Statistical and Computational Inverse Prob-lems, volume 160 of Applied Mathematical Sciences. Springer-Verlag, NewYork, 2005.

[46] R. E. Kalman. A new approach to linear filtering and prediction problems.Trans. ASME Ser. D. J. Basic Engrg., 82:35–45, 1960.

[47] R. E. Kalman and R. S. Bucy. New results in linear filtering and predictiontheory. Trans. ASME Ser. D. J. Basic Engrg., 83:95–108, 1961.

[48] K. Karhunen. Uber lineare Methoden in der Wahrscheinlichkeitsrechnung.Ann. Acad. Sci. Fennicae. Ser. A. I. Math.-Phys., 1947(37):79, 1947.

[49] Alan F. Karr. Extreme points of certain sets of probability measures, withapplications. Math. Oper. Res., 8(1):74–85, 1983.

[50] J. M. Keynes. A Treatise on Probability. Macmillan and Co., London,1921.

[51] M. D. Kirszbraun. Uber die zusammenziehende und Lipschitzsche Trans-formationen. Fund. Math., 22:77–108, 1934.

[52] D. D. Kosambi. Statistics in function space. J. Indian Math. Soc. (N.S.),7:76–88, 1943.

[53] H. Kozono and T. Yanagisawa. Generalized Lax–Milgram theorem inBanach spaces and its application to the elliptic system of boundary valueproblems. Manuscripta Math., 141(3-4):637–662, 2013.

[54] M. Krein and D. Milman. On extreme points of regular convex sets. StudiaMath., 9:133–138, 1940.

[55] S. Kullback and R. A. Leibler. On information and sufficiency. Ann. Math.Statistics, 22:79–86, 1951.

[56] V. P. Kuznetsov. Intervalnye statisticheskie modeli. “Radio i Svyaz′”,Moscow, 1991.

[57] Matti Lassas, Eero Saksman, and Samuli Siltanen. Discretization-invariant Bayesian inversion and Besov space priors. Inverse Probl. Imag-ing, 3(1):87–122, 2009.

[58] O. P. Le Maıtre and O. M. Knio. Spectral Methods for Uncertainty Quan-tification: With applications to computational fluid dynamics. ScientificComputation. Springer, New York, 2010.

[59] M. Ledoux. The Concentration of Measure Phenomenon, volume 89 ofMathematical Surveys and Monographs. American Mathematical Society,Providence, RI, 2001.

[60] M. Ledoux and M. Talagrand. Probability in Banach Spaces. Classics inMathematics. Springer-Verlag, Berlin, 2011. Isoperimetry and Processes,Reprint of the 1991 edition.

—DRAFT—

BIBLIOGRAPHY 175

[61] P. Levy. Problemes Concrets d’Analyse Fonctionnelle. Avec unComplement sur les Fonctionnelles Analytiques par F. Pellegrino.Gauthier-Villars, Paris, 1951. 2d ed.

[62] M. Loeve. Probability Theory. II. Springer-Verlag, New York, fourthedition, 1978. Graduate Texts in Mathematics, Vol. 46.

[63] D. J. C. MacKay. Information Theory, Inference and Learning Algorithms.Cambridge University Press, New York, 2003.

[64] A. W. Marshall, I. Olkin, and B. C. Arnold. Inequalities: Theory ofMajorization and its Applications. Springer Series in Statistics. Springer,New York, second edition, 2011.

[65] C. McDiarmid. On the method of bounded differences. In Surveys incombinatorics, 1989 (Norwich, 1989), volume 141 of London Math. Soc.Lecture Note Ser., pages 148–188. Cambridge Univ. Press, Cambridge,1989.

[66] E. J. McShane. Extension of range of functions. Bull. Amer. Math. Soc.,40(12):837–842, 1934.

[67] J. Mikusinski. The Bochner Integral. Birkhauser Verlag, Basel, 1978.Lehrbucher und Monographien aus dem Gebiete der exakten Wis-senschaften, Mathematische Reihe, Band 55.

[68] G. J. Minty. On the extension of Lipschitz, Lipschitz–Holder continuous,and monotone functions. Bull. Amer. Math. Soc., 76:334–339, 1970.

[69] R. E. Moore. Interval Analysis. Prentice-Hall Inc., Englewood Cliffs, N.J.,1966.

[70] A. Narayan and D. Xiu. Stochastic collocation methods on unstruc-tured grids in high dimensions via interpolation. SIAM J. Sci. Comput.,34(3):A1729–A1752, 2012.

[71] H. Niederreiter. Low-discrepancy and low-dispersion sequences. J. NumberTheory, 30(1):51–70, 1988.

[72] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Seriesin Operations Research and Financial Engineering. Springer, New York,second edition, 2006.

[73] B. Øksendal. Stochastic Differential Equations: An Introduction with Ap-plications. Universitext. Springer-Verlag, Berlin, sixth edition, 2003.

[74] N. Oreskes, K. Shrader-Frechette, and K. Belitz. Verification, validation,and confirmation of numerical models in the earth sciences. Science.

[75] A. B. Owen. Latin supercube sampling for very high dimensional simula-tions. ACM Trans. Mod. Comp. Sim., 8(2):71–102, 1998.

[76] H. Owhadi, C. Scovel, T. J. Sullivan, M. McKerns, and M. Ortiz. OptimalUncertainty Quantification. SIAM Rev., 55(2):271–345, 2013.

—DRAFT—

176 BIBLIOGRAPHY

[77] K. V. Price, R. M. Storn, and J. A. Lampinen. Differential Evolution:A Practical Approach to Global Optimization. Natural Computing Series.Springer-Verlag, Berlin, 2005.

[78] H. Rabitz and O. F. Alıs. General foundations of high-dimensional modelrepresentations. J. Math. Chem., 25(2-3):197–233, 1999.

[79] M. Reed and B. Simon. Methods of Modern Mathematical Physics. I.Functional Analysis. Academic Press, New York, 1972.

[80] M. Renardy and R. C. Rogers. An Introduction to Partial DifferentialEquations, volume 13 of Texts in Applied Mathematics. Springer-Verlag,New York, second edition, 2004.

[81] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. SpringerTexts in Statistics. Springer-Verlag, New York, second edition, 2004.

[82] R. T. Rockafellar. Convex Analysis. Princeton Landmarks in Mathemat-ics. Princeton University Press, Princeton, NJ, 1997. Reprint of the 1970original, Princeton Paperbacks.

[83] W. Rudin. Functional Analysis. International Series in Pure and AppliedMathematics. McGraw-Hill Inc., New York, second edition, 1991.

[84] C. Runge. Uber empirische Funktionen und die Interpolation zwischenaquidistanten Ordinaten. Zeitschrift fur Mathematik und Physik, 46:224–243, 1901.

[85] R. A. Ryan. Introduction to Tensor Products of Banach Spaces. SpringerMonographs in Mathematics. Springer-Verlag London Ltd., London, 2002.

[86] B. P. Rynne and M. A. Youngson. Linear Functional Analysis. SpringerUndergraduate Mathematics Series. Springer-Verlag London Ltd., Lon-don, second edition, 2008.

[87] G. Shafer. A Mathematical Theory of Evidence. Princeton UniversityPress, Princeton, N.J., 1976.

[88] C. E. Shannon. A mathematical theory of communication. Bell SystemTech. J., 27:379–423, 623–656, 1948.

[89] J. E. Smith. Generalized Chebychev inequalities: theory and applicationsin decision analysis. Oper. Res., 43(5):807–825, 1995.

[90] S. A. Smolyak. Quadrature and interpolation formulae on tensor productsof certain function classes. Dokl. Akad. Nauk SSSR, 148:1042–1045, 1963.

[91] I. M. Sobol′. Uniformly distributed sequences with an additional propertyof uniformity. Z. Vycisl. Mat. i Mat. Fiz., 16(5):1332–1337, 1375, 1976.

[92] I. M. Sobol′. Estimation of the sensitivity of nonlinear mathematicalmodels. Mat. Model., 2(1):112–118, 1990.

[93] I. M. Sobol′. Sensitivity estimates for nonlinear mathematical models.Math. Modeling Comput. Experiment, 1(4):407–414 (1995), 1993.

—DRAFT—

BIBLIOGRAPHY 177

[94] C. Soize and R. Ghanem. Physical systems with random uncertainties:chaos representations with arbitrary probability measure. SIAM J. Sci.Comput., 26(2):395–410 (electronic), 2004.

[95] J. Stoer and R. Bulirsch. Introduction to Numerical Analysis, volume 12 ofTexts in Applied Mathematics. Springer-Verlag, New York, third edition,2002. Translated from the German by R. Bartels, W. Gautschi and C.Witzgall.

[96] C. J. Stone. The use of polynomial splines and their tensor products inmultivariate function estimation. Ann. Statist., 22(1):118–184, 1994.

[97] R. Storn and K. Price. Differential evolution — a simple and efficientheuristic for global optimization over continuous spaces. J. Global Optim.,11(4):341–359, 1997.

[98] A. M. Stuart. Inverse problems: a Bayesian perspective. Acta Numer.,19:451–559, 2010.

[99] T. J. Sullivan, M. McKerns, D. Meyer, F. Theil, H. Owhadi, and M. Or-tiz. Optimal uncertainty quantification for legacy data observations ofLipschitz functions. Math. Model. Numer. Anal., 47(6):1657–1689, 2013.

[100] G. Szego. Orthogonal Polynomials. American Mathematical Society, Prov-idence, R.I., fourth edition, 1975. American Mathematical Society, Collo-quium Publications, Vol. XXIII.

[101] H. Takahasi and M. Mori. Double exponential formulas for numericalintegration. Publ. Res. Inst. Math. Sci., 9:721–741, 1973/74.

[102] M. Talagrand. Pettis integral and measure theory. Mem. Amer. Math.Soc., 51(307):ix+224, 1984.

[103] A. Tarantola. Inverse Problem Theory and Methods for Model Parame-ter Estimation. Society for Industrial and Applied Mathematics (SIAM),Philadelphia, PA, 2005.

[104] J. B Tenenbaum, V. de Silva, and J. C. Langford. A global ge-ometric framework for nonlinear dimensionality reduction. Science,290(5500):2319–2323, 2000.

[105] A. N. Tikhonov. On the stability of inverse problems. C. R. (Doklady)Acad. Sci. URSS (N.S.), 39:176–179, 1943.

[106] A. N. Tikhonov. On the solution of incorrectly put problems and theregularisation method. In Outlines Joint Sympos. Partial DifferentialEquations (Novosibirsk, 1963), pages 261–265. Acad. Sci. USSR SiberianBranch, Moscow, 1963.

[107] L. N. Trefethen. Is Gauss quadrature better than Clenshaw–Curtis? SIAMRev., 50(1):67–87, 2008.

[108] L. N. Trefethen and D. Bau, III. Numerical Linear Algebra. Society forIndustrial and Applied Mathematics (SIAM), Philadelphia, PA, 1997.

—DRAFT—

178 BIBLIOGRAPHY

[109] U.S. Department of Energy. Scientific Grand Challenges for NationalSecurity: The Role of Computing at the Extreme Scale. 2009.

[110] N. N. Vakhania. The topological support of Gaussian measure in Banachspace. Nagoya Math. J., 57:59–63, 1975.

[111] F. A. Valentine. A Lipschitz condition preserving extension for a vectorfunction. Amer. J. Math., 67(1):83–93, 1945.

[112] H. von Weizsacker and G. Winkler. Integral representation in the set ofsolutions of a generalized moment problem. Math. Ann., 246(1):23–32,1979/80.

[113] H. von Weizsacker and G. Winkler. Noncompact extremal integral rep-resentations: some probabilistic aspects. In Functional analysis: surveysand recent results, II (Proc. Second Conf. Functional Anal., Univ. Pader-born, Paderborn, 1979), volume 68 of Notas Mat., pages 115–148. North-Holland, Amsterdam, 1980.

[114] P. Walley. Statistical Reasoning with Imprecise Probabilities, volume 42of Monographs on Statistics and Applied Probability. Chapman and HallLtd., London, 1991.

[115] K. Weichselberger. The theory of interval-probability as a unifying conceptfor uncertainty. Internat. J. Approx. Reason., 24(2-3):149–170, 2000.

[116] G. Winkler. Extreme points of moment sets. Math. Oper. Res., 13(4):581–587, 1988.

[117] D. Xiu. Numerical Methods for Stochastic Computations: A SpectralMethod Approach. Princeton University Press, Princeton, NJ, 2010.

[118] D. Xiu and G. E. Karniadakis. The Wiener–Askey polynomial chaos forstochastic differential equations. SIAM J. Sci. Comput., 24(2):619–644(electronic), 2002.

—DRAFT—

Index

absolute continuity, 17affine combination, 39almost everywhere, 10ANOVA, 116arg max, 34arg min, 34

Banach space, 24barycentre, 159Bayes’ rule, 11, 63Bessel’s inequality, 29Birkhoff–Khinchin ergodic theorem, 106Bochner integral, 16, 32bounded differences inequality, 113

Cea’s lemma, 139, 146Cameron–Martin space, 20Cauchy–Schwarz inequality, 24, 52Chebyshev nodes, 93Chebyshev’s inequality, 16Choquet simplex, 159Choquet–Bishop–de Leeuw theorem, 39Christoffel–Darboux formula, 90Clenshaw–Curtis quadrature, 104collocation method

for ODEs, 151stochastic, 151

complete measure space, 10concentration of measure, 113conditional expectation, 28, 114conditional probability, 11constraining function, 38convex combination, 39convex function, 40convex hull, 39convex optimization problem, 41convex set, 39counting measure, 10covariance

matrix, 15operator, 19, 51

Cut-HDMR, 117

Dirac measure, 10direct sum, 28dominated convergence theorem, 15dual space, 26

ensemble Kalman filter, 78entropy, 52, 155equivalent measures, 17Eulerian observations, 80expectation, 15expected value, 15extended Kalman filter, 77extreme point, 39, 160

Favard’s theorem, 90Fejer quadrature, 104Feldman–Hajek theorem, 21, 67Fernique’s theorem, 20filtration, 12Fubini–Tonelli theorem, 18

Galerkin product, 145Galerkin projection

deterministic, 137stochastic, 140

Galerkin tensor, 144Gauss–Markov theorem, 59Gauss–Newton iteration, 46Gaussian measure, 18, 19Gibbs’ inequality, 55Gibbs’ phenomenon, 96Gram matrix, 122, 139

Hankel determinant, 86, 89HDMR

projectors, 118Hellinger distance, 65, 68Hermite polynomials, 86, 126Hilbert space, 24Hoeffding’s inequality, 113

—DRAFT—

180 INDEX

Hoeffding’s lemma, 114Hotelling transform, 124

independence, 17information, 52inner product, 23inner product space, 24inner regular measure, 159integral

Bochner, 16, 32Lebesgue, 14Pettis, 16, 19strong, 16weak, 16

interior point method, 42interval arithmetic, 50

Kalman filter, 74ensemble, 78extended, 77linear, 74

Karhunen–Loeve theorem, 123sampling Gaussian measures, 124

Karush–Kuhn–Tucker conditions, 37Koksma’s inequality, 108Koksma–Hlawka inequality, 108Kozono–Yanagisawa theorem, 140Kreın–Milman theorem, 39Kullback–Leibler divergence, 54

Lagrange multipliers, 37Lagrange polynomials, 92Lagrangian observations, 80law of large numbers, 106Lax–Milgram theorem

deterministic, 137stochastic, 141

Lebesgue Lp space, 15Lebesgue integral, 14Lebesgue measure, 10, 18Legendre polynomials, 86linear Kalman filter, 74linear program, 43

marginal, 18maximum entropy

principle of, 155McDiarmid diameter, 112McDiarmid subdiameter, 112McDiarmid’s inequality, 113

measurable function, 12measurable space, 9measure, 9measure affine function, 160Mercer kernel, 122Mercer’s theorem, 122midpoint rule, 100Minty’s extension theorem, 163Moore–Penrose pseudo-inverse, 61mutually singular measures, 17

Newton’s method, 34Newton–Cotes formula, 101norm, 23normal equations, 44normed space, 23null set, 10

orthogonal complement, 27orthogonal polynomials, 88orthogonal projection, 27orthogonal set, 27orthonormal set, 27

parallelogram identity, 24Parseval identity, 29penalty function, 38Pettis integral, 16, 19polarization identity, 24precision operator, 19principal component analysis, 124probability density function, 17probability measure, 9product measure, 17push-forward measure, 12

quadrature formula, 99

Radon space, 159Radon–Nikodym theorem, 17random variable, 12Riesz representation theorem, 26Riesz space, 159RS-HDMR, 116Runge’s phenomenon, 93

Schrodinger’s inequality, 52Schur complements, 81

and conditioning of Gaussians, 81semi-norm, 23

—DRAFT—

INDEX 181

Sherman–Morrison–Woodbury formula,81

signed measure, 9simulated annealing, 36singular value decomposition, 111, 125Sobol′ indices, 118Sobolev space, 26, 95stochastic collocation method, 151stochastic process, 12strong integral, 16support, 10surprisal, 52

Takahasi–Mori quadrature, 109tanh–sinh quadrature, 109tensor product, 30Tikhonov regularization, 45, 58total variation distance, 54, 68trapezoidal rule, 100trivial measure, 10

uncertainty principle, 52

Vandermonde matrix, 92variance, 15vector lattice, 159

weak integral, 16Wiener–Hermite PC expansion, 127Winkler’s theorem, 160

zero-one measure, 39

Documents

MA4K0 Introduction to Uncertainty Quantificationhomepages.warwick.ac.uk/~masbm/ClimateCourse/... · uncertainty into two types, aleatoric and epistemic uncertainty. Aleatoric un-certainty