491
Springer Series in Statistics Advisors: P. Bickel, P. Diggle, S. Fienberg, U. Gather, I. Olkin, S. Zeger

Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

Springer Series in Statistics

Advisors:

P. Bickel, P. Diggle, S. Fienberg, U. Gather,

I. Olkin, S. Zeger

Page 2: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

Springer Series in Statistics

Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications, 2nd

Eggermont/LaRiccia: Maximum Penalized Likelihood Estimation, Volume I: Density

Fahrmeir/Tutz: Multivariate Statistical Modeling Based on Generalized Linear Models,

Gyöfi /Kohler/Krzyźak/Walk: A Distribution-Free Theory of Nonparametric

Harrell: Regression Modeling Strategies: With Applications to Linear Models,

Hastie/Tibshirani/Friedman: The Elements of Statistical Learning: Data Mining,

Heyde: Quasi-Likelihood and its Application: A General Approach to Optimal

Huet/Bouvier/Poursat/Jolivet: Statistical Tools for Nonlinear Regression: A Practical

(continued after index)

Alho/Spencer: Statistical Demography and ForecastingAndersen/Borgan/Gill/Keiding: Statistical Models Based on Counting ProcessesAtkinson/Riani: Robust Diagnostic Regression AnalysisAtkinson/Riani/Ceriloi: Exploring Multivariate Data with the Forward SearchBerger: Statistical Decision Theory and Bayesian Analysis, 2nd edition

editionBrockwell/Davis: Time Series: Theory and Methods, 2nd editionBucklew: Introduction to Rare Event SimulationCappé/Moulines/Rydén: Inference in Hidden Markov ModelsChan/Tong: Chaos: A Statistical Perspective

Coles: An Introduction to Statistical Modeling of Extreme ValuesDevroye/Lugosi: Combinatorial Methods in Density Estimation

Chen/Shao/Ibrahim: Monte Carlo Methods in Bayesian Computation

Diggle/Ribeiro: Model-based GeostatisticsDudoit/Van der Laan: Multiple Testing Procedures with Applications to GenomicsEfromovich: Nonparametric Curve Estimation: Methods, Theory, and Applications

Estimation

2nd editionFan/Yao: Nonlinear Time Series: Nonparametric and Parametric MethodsFerraty/Vieu: Nonparametric Functional Data Analysis: Theory and PracticeFerreira/Lee: Multiscale Modeling: A Bayesian PerspectiveFienberg/Hoaglin: Selected Papers of Frederick MostellerFrühwirth-Schnatter: Finite Mixture and Markov Switching ModelsGhosh/Ramamoorthi: Bayesian NonparametricsGlaz/Naus/Wallenstein: Scan StatisticsGood: Permutation Tests: Parametric and Bootstrap Tests of Hypotheses, 3rd editionGouriéroux: ARCH Models and Financial ApplicationsGu: Smoothing Spline ANOVA Models

RegressionHaberman: Advanced Statistics, Volume I: Description of PopulationsHall: The Bootstrap and Edgeworth ExpansionHärdle: Smoothing Techniques: With Implementation in S

Logistic Regression, and Survival AnalysisHart: Nonparametric Smoothing and Lack-of-Fit Tests

Inference, and PredictionHedayat/Sloane/Stufken: Orthogonal Arrays: Theory and Applications

Parameter Estimation

Guide with S-PLUS and R Examples, 2nd editionIbrahim/Chen/Sinha: Bayesian Survival AnalysisJiang: Linear and Generalized Linear Mixed Models and Their ApplicationsJolliffe: Principal Component Analysis, 2nd editionKnottnerus: Sample Survey Theory: Some Pythagorean PerspectivesKonishi/Kitagawa: Information Criteria and Statistical Modeling

Page 3: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

(continued from page ii)

Tanner: Tools for Statistical Inference: Methods for the Exploration of Posterior Distri-

van der Vaart/Wellner: Weak Convergence and Empirical Processes: With Applications

Introduction to Empirical Processes and Semiparametric InferenceKosorok:Küchler/Sørensen: Exponential Families of Stochastic ProcessesKutoyants: Statistical Inference for Ergodic Diffusion ProcessesLahiri: Resampling Methods for Dependent Data

Springer Series in Statistics

Lavallée: Indirect SamplingLe/Zidek: Statistical Analysis of Environmental Space-Time ProcessesLe Cam: Asymptotic Methods in Statistical Decision TheoryLe Cam/Yang: Asymptotics in Statistics: Some Basic Concepts, 2nd edition

Liu: Monte Carlo Strategies in Scientifi c ComputingLiese/Miescke: Statistical Decision Theory: Estimation, Testing, Selection

Manski: Partial Identifi cation of Probability DistributionsMielke /Berry: Permutation Methods: A Distance Function Approach, 2nd editionMolenberghs/Verbeke: Models for Discrete Longitudinal DataMukerjee/Wu: A Modern Theory of Factorial DesignsNelsen: An Introduction to Copulas, 2nd editionPan/Fang: Growth Curve Models and Statistical DiagnosticsPolitis/Romano/Wolf: SubsamplingRamsay/Silverman: Applied Functional Data Analysis: Methods and Case Studies

Reinsel: Elements of Multivariate Time Series Analysis, 2nd editionRamsay/Silverman: Functional Data Analysis, 2nd edition

Rosenbaum: Observational Studies, 2nd editionRosenblatt: Gaussian and Non-Gaussian Linear Time Series and Random FieldsSärndal/Swensson/Wretman: Model Assisted Survey SamplingSantner/Williams/Notz: The Design and Analysis of Computer ExperimentsSchervish: Theory of StatisticsShaked/Shanthikumar: Stochastic OrdersShao/Tu: The Jackknife and BootstrapSimonoff: Smoothing Methods in Statistics

Song: Correlated Data Analysis: Modeling, Analytics, and ApplicationsSprott: Statistical Inference in ScienceStein: Interpolation of Spatial Data: Some Theory for KrigingTaniguchi/Kakizawa: Asymptotic Theory for Statistical Inference for Time Series

butions and Likelihood Functions, 3rd editionTillé: Sampling AlgorithmsTsaitis: Semiparametric Theory and Missing Datavan der Laan/Robins: Unifi ed Methods for Censored Longitudinal Data and Causality

to StatisticsVerbeke/Molenberghs: Linear Mixed Models for Longitudinal DataWeerahandi: Exact Statistical Methods for Data Analysis

Page 4: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

Introduction to Empirical

Inference

Michael R. Kosorok

Processes and Semiparametric

Page 5: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

c© 2008 Springer Science+Business Media, LLC

or dissimilar methodology now known or hereafter developed is forbidden.

Printed on acid-free paper.

9 8 7 6 5 4 3 2 1

springer.com

ISBN 978-0-387-74977-8

Library of Congress Control Number: 2007940955

Department of BiostatisticsUniversity of North Carolina

Chapel Hill, NC 27599-7420

Michael R. Kosorok

3101 McGavran-Greenberg Hall

e-ISBN 978-0-387-74978-5

[email protected]

USA

not identified as such, is not to be taken as an expression of opinion as to whether or not they are subjectto proprietary rights.

permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY

The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are

10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection

All rights reserved. This work may not be translated or copied in whole or in part without the written

with any form of information storage and retrieval, electronic adaptation, computer software, or by similar

Page 6: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

To

My Parents, John and Eleanor

My Wife, Pamela

My Daughters, Jessica and Jeanette

and

My Brothers, David, Andy and Matt

Page 7: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

Preface

The goal of this book is to introduce statisticians, and other researcherswith a background in mathematical statistics, to empirical processes andsemiparametric inference. These powerful research techniques are surpris-ingly useful for studying large sample properties of statistical estimatesfrom realistically complex models as well as for developing new and im-proved approaches to statistical inference.

This book is more of a textbook than a research monograph, althougha number of new results are presented. The level of the book is more in-troductory than the seminal work of van der Vaart and Wellner (1996).In fact, another purpose of this work is to help readers prepare for themathematically advanced van der Vaart and Wellner text, as well as forthe semiparametric inference work of Bickel, Klaassen, Ritov and Well-ner (1997). These two books, along with Pollard (1990) and Chapters 19and 25 of van der Vaart (1998), formulate a very complete and successfulelucidation of modern empirical process methods. The present book owesmuch by the way of inspiration, concept, and notation to these previousworks. What is perhaps new is the gradual—yet rigorous—and unified waythis book introduces the reader to the field.

The book consists of three parts. The first part is an overview that con-cisely covers the basic concepts in both empirical processes and semipara-metric inference, while avoiding many technicalities. The second part isdevoted to empirical processes, while the third part is devoted to semipara-metric efficiency and inference. In each of the last two parts, the chapterfollowing the introductory chapter is devoted to the relevant mathemat-ical concepts and technical background needed for the remainder of the

Page 8: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

viii Preface

part. For example, an overview of metric spaces—which are necessary tothe study of weak convergence—is included in Chapter 6. Thus the bookis largely self contained. In addition, a chapter devoted to case studies isincluded at the end of each of the three parts of the book. These casestudies explore in detail practical examples that illustrate applications oftheoretical concepts.

The impetus for this work came from a course the author gave in theDepartment of Statistics at the University of Wisconsin-Madison, duringthe Spring semester of 2001. Accordingly, the book is designed to be usedas a text in a one- or two-semester sequence in empirical processes andsemiparametric inference. In a one-semester course, most of Chapters 1–10and 12–18 can be covered, along with Sections 19.1 and 19.2 and parts ofChapter 22. Parts of Chapters 3 and 4 may need to be skipped or glossedover and other content judiciously omitted in order fit everything in. In atwo semester course, one can spend the first semester focusing on empiricalprocesses and the second semester focusing more on semiparametric meth-ods. In the first semester, Chapters 1 and 2, Sections 4.1–4.3, Chapters5–10, 12–14 and parts of Chapter 15 could be covered, while in the secondsemester, Chapter 3, Sections 4.4–4.5, the remainder of Chapter 15, andChapters 16–22 could be covered in some detail. The instructor can utilizethose parts of Chapter 11 and elsewhere as deemed appropriate. It is goodto pick and choose what is covered within every chapter presented, so thatthe students are not given too much material to digest.

The books can also be used for self-study and can be pursued in a basi-cally linear format, with the reader omitting deeper concepts the first timethrough. For some sections, such as with Chapter 3, it is worth skimmingthrough to get an outline of the main ideas first without worrying aboutverifying the math. In general, this kind of material is learned best whenhomework problems are attempted. Students should generally have had atleast half a year of graduate level probability as well as a year of graduatelevel mathematical statistics before working through this material.

Some of the research components presented in this book were partiallysupported by National Institutes of Health Grant CA075142. The authorthanks the editor, John Kimmel, and numerous anonymous referees whoprovided much needed guidance throughout the process of writing. Thanksalso go to numerous colleagues and students who provided helpful feed-back and corrections on many levels. A partial list of such individualsincludes Moulinath Banerjee, Hongyuan Cao, Guang Cheng, Sang-HoonCho, Kai Ding, Jason Fine, Minjung Kwak, Bee Leng Lee, Shuangge Ma,Rajat Mukherjee, Andrea Rotnitzky, Rui Song, Anand Vidyashankar, JonWellner, Donglin Zeng, and Songfeng Zheng.

Chapel HillOctober 2007

Page 9: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

Contents

Preface vii

I Overview 1

1 Introduction 3

2 An Overview of Empirical Processes 92.1 The Main Features . . . . . . . . . . . . . . . . . . . . . . . 92.2 Empirical Process Techniques . . . . . . . . . . . . . . . . . 13

2.2.1 Stochastic Convergence . . . . . . . . . . . . . . . . 132.2.2 Entropy for Glivenko-Cantelli and Donsker Theorems 162.2.3 Bootstrapping Empirical Processes . . . . . . . . . . 192.2.4 The Functional Delta Method . . . . . . . . . . . . . 212.2.5 Z-Estimators . . . . . . . . . . . . . . . . . . . . . . 242.2.6 M-Estimators . . . . . . . . . . . . . . . . . . . . . . 28

2.3 Other Topics . . . . . . . . . . . . . . . . . . . . . . . . . . 302.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.5 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3 Overview of Semiparametric Inference 353.1 Semiparametric Models and Efficiency . . . . . . . . . . . . 353.2 Score Functions and Estimating Equations . . . . . . . . . . 393.3 Maximum Likelihood Estimation . . . . . . . . . . . . . . . 44

Page 10: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

x Contents

3.4 Other Topics . . . . . . . . . . . . . . . . . . . . . . . . . . 473.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4 Case Studies I 494.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . 50

4.1.1 Mean Zero Residuals . . . . . . . . . . . . . . . . . . 504.1.2 Median Zero Residuals . . . . . . . . . . . . . . . . . 52

4.2 Counting Process Regression . . . . . . . . . . . . . . . . . 544.2.1 The General Case . . . . . . . . . . . . . . . . . . . 554.2.2 The Cox Model . . . . . . . . . . . . . . . . . . . . . 59

4.3 The Kaplan-Meier Estimator . . . . . . . . . . . . . . . . . 604.4 Efficient Estimating Equations for Regression . . . . . . . . 62

4.4.1 Simple Linear Regression . . . . . . . . . . . . . . . 664.4.2 A Poisson Mixture Regression Model . . . . . . . . . 69

4.5 Partly Linear Logistic Regression . . . . . . . . . . . . . . . 694.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.7 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

II Empirical Processes 75

5 Introduction to Empirical Processes 77

6 Preliminaries for Empirical Processes 816.1 Metric Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 816.2 Outer Expectation . . . . . . . . . . . . . . . . . . . . . . . 886.3 Linear Operators and Functional Differentiation . . . . . . . 936.4 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7 Stochastic Convergence 1037.1 Stochastic Processes in Metric Spaces . . . . . . . . . . . . 1037.2 Weak Convergence . . . . . . . . . . . . . . . . . . . . . . . 107

7.2.1 General Theory . . . . . . . . . . . . . . . . . . . . . 1077.2.2 Spaces of Bounded Functions . . . . . . . . . . . . . 113

7.3 Other Modes of Convergence . . . . . . . . . . . . . . . . . 1157.4 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1207.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1257.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

8 Empirical Process Methods 1278.1 Maximal Inequalities . . . . . . . . . . . . . . . . . . . . . . 128

8.1.1 Orlicz Norms and Maxima . . . . . . . . . . . . . . . 128

Page 11: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

Contents xi

8.1.2 Maximal Inequalities for Processes . . . . . . . . . . 1318.2 The Symmetrization Inequality and Measurability . . . . . 1388.3 Glivenko-Cantelli Results . . . . . . . . . . . . . . . . . . . 1448.4 Donsker Results . . . . . . . . . . . . . . . . . . . . . . . . . 1488.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1518.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

9 Entropy Calculations 1559.1 Uniform Entropy . . . . . . . . . . . . . . . . . . . . . . . . 156

9.1.1 VC-Classes . . . . . . . . . . . . . . . . . . . . . . . 1569.1.2 BUEI Classes . . . . . . . . . . . . . . . . . . . . . . 162

9.2 Bracketing Entropy . . . . . . . . . . . . . . . . . . . . . . . 1669.3 Glivenko-Cantelli Preservation . . . . . . . . . . . . . . . . 1699.4 Donsker Preservation . . . . . . . . . . . . . . . . . . . . . . 1729.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1739.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1769.7 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

10 Bootstrapping Empirical Processes 17910.1 The Bootstrap for Donsker Classes . . . . . . . . . . . . . . 180

10.1.1 An Unconditional Multiplier Central Limit Theorem 18110.1.2 Conditional Multiplier Central Limit Theorems . . . 18310.1.3 Bootstrap Central Limit Theorems . . . . . . . . . . 18710.1.4 Continuous Mapping Results . . . . . . . . . . . . . 189

10.2 The Bootstrap for Glivenko-Cantelli Classes . . . . . . . . . 19310.3 A Simple Z-Estimator Master Theorem . . . . . . . . . . . 19610.4 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19810.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20410.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

11 Additional Empirical Process Results 20711.1 Bounding Moments and Tail Probabilities . . . . . . . . . . 20811.2 Sequences of Functions . . . . . . . . . . . . . . . . . . . . . 21111.3 Contiguous Alternatives . . . . . . . . . . . . . . . . . . . . 21411.4 Sums of Independent but not Identically Distributed Stochas-

tic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 21811.4.1 Central Limit Theorems . . . . . . . . . . . . . . . . 21811.4.2 Bootstrap Results . . . . . . . . . . . . . . . . . . . 222

11.5 Function Classes Changing with n . . . . . . . . . . . . . . 22411.6 Dependent Observations . . . . . . . . . . . . . . . . . . . . 22711.7 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23011.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23311.9 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

12 The Functional Delta Method 235

Page 12: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

xii Contents

12.1 Main Results and Proofs . . . . . . . . . . . . . . . . . . . . 23512.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

12.2.1 Composition . . . . . . . . . . . . . . . . . . . . . . 23712.2.2 Integration . . . . . . . . . . . . . . . . . . . . . . . 23812.2.3 Product Integration . . . . . . . . . . . . . . . . . . 24212.2.4 Inversion . . . . . . . . . . . . . . . . . . . . . . . . 24612.2.5 Other Mappings . . . . . . . . . . . . . . . . . . . . 249

12.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24912.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

13 Z-Estimators 25113.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . 25213.2 Weak Convergence . . . . . . . . . . . . . . . . . . . . . . . 253

13.2.1 The General Setting . . . . . . . . . . . . . . . . . . 25413.2.2 Using Donsker Classes . . . . . . . . . . . . . . . . . 25413.2.3 A Master Theorem and the Bootstrap . . . . . . . . 255

13.3 Using the Delta Method . . . . . . . . . . . . . . . . . . . . 25813.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26213.5 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262

14 M-Estimators 26314.1 The Argmax Theorem . . . . . . . . . . . . . . . . . . . . . 26414.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . 26614.3 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . 26714.4 Regular Euclidean M-Estimators . . . . . . . . . . . . . . . 27014.5 Non-Regular Examples . . . . . . . . . . . . . . . . . . . . . 271

14.5.1 A Change-Point Model . . . . . . . . . . . . . . . . . 27114.5.2 Monotone Density Estimation . . . . . . . . . . . . . 277

14.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28014.7 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282

15 Case Studies II 28315.1 Partly Linear Logistic Regression Revisited . . . . . . . . . 28315.2 The Two-Parameter Cox Score Process . . . . . . . . . . . . 28715.3 The Proportional Odds Model Under Right Censoring . . . 291

15.3.1 Nonparametric Maximum Likelihood Estimation . . 29215.3.2 Existence . . . . . . . . . . . . . . . . . . . . . . . . 29315.3.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . 29515.3.4 Score and Information Operators . . . . . . . . . . . 29715.3.5 Weak Convergence and Bootstrap Validity . . . . . . 301

15.4 Testing for a Change-point . . . . . . . . . . . . . . . . . . 30315.5 Large p Small n Asymptotics for Microarrays . . . . . . . . 306

15.5.1 Assessing P-Value Approximations . . . . . . . . . . 30815.5.2 Consistency of Marginal Empirical Distribution

Functions . . . . . . . . . . . . . . . . . . . . . . . . 309

Page 13: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

Contents xiii

15.5.3 Inference for Marginal Sample Means . . . . . . . . 31215.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31415.7 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315

III Semiparametric Inference 317

16 Introduction to Semiparametric Inference 319

17 Preliminaries for Semiparametric Inference 32317.1 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . 32317.2 Hilbert Spaces . . . . . . . . . . . . . . . . . . . . . . . . . 32417.3 More on Banach Spaces . . . . . . . . . . . . . . . . . . . . 32817.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33217.5 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332

18 Semiparametric Models and Efficiency 33318.1 Tangent Sets and Regularity . . . . . . . . . . . . . . . . . . 33318.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33718.3 Optimality of Tests . . . . . . . . . . . . . . . . . . . . . . . 34218.4 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34518.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34618.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347

19 Efficient Inference for Finite-Dimensional Parameters 34919.1 Efficient Score Equations . . . . . . . . . . . . . . . . . . . 35019.2 Profile Likelihood and Least-Favorable Submodels . . . . . 351

19.2.1 The Cox Model for Right Censored Data . . . . . . 35219.2.2 The Proportional Odds Model for Right

Censored Data . . . . . . . . . . . . . . . . . . . . . 35319.2.3 The Cox Model for Current Status Data . . . . . . . 35519.2.4 Partly Linear Logistic Regression . . . . . . . . . . . 356

19.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35719.3.1 Quadratic Expansion of the Profile Likelihood . . . . 35719.3.2 The Profile Sampler . . . . . . . . . . . . . . . . . . 36319.3.3 The Penalized Profile Sampler . . . . . . . . . . . . 36919.3.4 Other Methods . . . . . . . . . . . . . . . . . . . . . 371

19.4 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37319.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37619.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377

20 Efficient Inference for Infinite-Dimensional Parameters 37920.1 Semiparametric Maximum Likelihood Estimation . . . . . . 37920.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387

20.2.1 Weighted and Nonparametric Bootstraps . . . . . . 38720.2.2 The Piggyback Bootstrap . . . . . . . . . . . . . . . 389

Page 14: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

xiv Contents

20.2.3 Other Methods . . . . . . . . . . . . . . . . . . . . . 39320.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39520.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395

21 Semiparametric M-Estimation 39721.1 Semiparametric M-estimators . . . . . . . . . . . . . . . . . 399

21.1.1 Motivating Examples . . . . . . . . . . . . . . . . . . 39921.1.2 General Scheme for Semiparametric M-Estimators . 40121.1.3 Consistency and Rate of Convergence . . . . . . . . 40221.1.4

√n Consistency and Asymptotic Normality . . . . . 402

21.2 Weighted M-Estimators and the Weighted Bootstrap . . . . 40721.3 Entropy Control . . . . . . . . . . . . . . . . . . . . . . . . 41021.4 Examples Continued . . . . . . . . . . . . . . . . . . . . . . 412

21.4.1 Cox Model with Current Status Data(Example 1, Continued) . . . . . . . . . . . . . . . . 412

21.4.2 Binary Regression Under Misspecified LinkFunction (Example 2, Continued) . . . . . . . . . . . 415

21.4.3 Mixture Models (Example 3, Continued) . . . . . . . 41821.5 Penalized M-estimation . . . . . . . . . . . . . . . . . . . . 420

21.5.1 Binary Regression Under Misspecified LinkFunction (Example 2, Continued) . . . . . . . . . . . 420

21.5.2 Two Other Examples . . . . . . . . . . . . . . . . . 42221.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42321.7 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423

22 Case Studies III 42522.1 The Proportional Odds Model Under Right Censoring

Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42622.2 Efficient Linear Regression . . . . . . . . . . . . . . . . . . . 43022.3 Temporal Process Regression . . . . . . . . . . . . . . . . . 43622.4 A Partly Linear Model for Repeated Measures . . . . . . . 44422.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45322.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45622.7 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457

References 459

Author Index 470

List of symbols 473

Subject Index 477

Page 15: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

1Introduction

Both empirical processes and semiparametric inference techniques have be-come increasingly important tools for solving statistical estimation andinference problems. These tools are particularly important when the sta-tistical model for the data at hand is semiparametric, in that it has oneor more unknown component that is a function, measure or some otherinfinite dimensional quantity. Semiparametric models also typically haveone or more finite-dimensional Euclidean parameters of particular interest.The term nonparametric is often reserved for semiparametric models withno Euclidean parameters. Empirical process methods are powerful tech-niques for evaluating the large sample properties of estimators based onsemiparametric models, including consistency, distributional convergence,and validity of the bootstrap. Semiparametric inference tools complementempirical process methods by, among other things, evaluating whether es-timators make efficient use of the data.

Consider the semiparametric model

Y = β′Z + e,(1.1)

where β, Z ∈ Rp are restricted to bounded sets, prime denotes transpose,(Y, Z) are the observed data, E[e|Z] = 0 and E[e2|Z] ≤ K < ∞ almostsurely, and E[ZZ ′] is positive definite. Given an independent and identicallydistributed (i.i.d.) sample of such data (Yi, Zi), i = 1 . . . n, we are interestedin estimating β without having to further specify the joint distribution of(e, Z). This is a very simple semiparametric model, and we definitely do

Page 16: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

4 1. Introduction

not need empirical process methods to verify that

β =

[

n∑

i=1

ZiZ′i

]−1 n∑

i=1

ZiYi

is consistent for β and that√

n(β − β) is asymptotically normal withbounded variance.

A deeper question is whether β achieves the lowest possible varianceamong all “reasonable” estimators. One important criteria for “reasonable-ness” is regularity, which is satisfied by β and most standard

√n consistent

estimators and which we will define more precisely in Chapter 3. A regu-lar estimator is efficient if it achieves the lowest possible variance amongregular estimators. Semiparametric inference tools are required to establishthis kind of optimality. Unfortunately, β does not have the lowest possi-ble variance among all regular estimators, unless we are willing to makesome very strong assumptions. For instance, β has optimal variance if weare willing to assume, in addition to what we have already assumed, thate has a Gaussian distribution and is independent of Z. In this instance,the model is almost fully parametric (except that the distribution of Z re-mains unspecified). Returning to the more general model in the previous

paragraph, there is a modification of β that does have the lowest possiblevariance among regular estimators, but computation of this modified esti-mator requires estimation of the function z → E[e2|Z = z]. We will explorethis particular example in greater detail in Chapters 3 and 4.

There is an interesting semiparametric model part way between the fullyparametric Gaussian residual model and the more general model which onlyassumes E[e|Z] = 0 almost surely. This alternative model assumes that theresidual e and covariate Z are independent, with E[e] = 0 and E[e2] < ∞,but no additional restrictions are placed on the residual distribution F .Unfortunately, β still does not have optimal variance. However, β is a verygood estimator and may be good enough for most purposes. In this setting,it may be useful to estimate F to determine whether the residuals areGaussian. One promising estimator is

F (t) = n−1n∑

i=1

1

Yi − β′Zi ≤ t

,(1.2)

where 1A is the indicator of A. Empirical process methods can be used toshow that F is uniformly consistent for F and that

√n(F−F ) converges “in

distribution” in a uniform sense to a certain Gaussian quantity, providedf is uniformly bounded. Quotes are used here because the convergencein question involves random real functions rather than Euclidean randomvariables. This kind of convergence is called weak convergence and is ageneralization of convergence in distribution which will be defined moreprecisely in Chapter 2.

Page 17: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

1. Introduction 5

Now we will consider a more complex example. Let N(t) be a count-ing process over the finite interval [0, τ ] which is free to jump as long ast ∈ [0, V ], where V ∈ (0, τ ] is a random time. A counting process, by defi-nition, is nonnegative and piecewise constant with positive jumps of size 1.Typically, the process counts a certain kind of event, such as hospitaliza-tions, for an individual (see Chapter 1 of Fleming and Harrington, 1991,or Chapter II of Andersen, Borgan, Gill and Keiding, 1993). Define alsothe “at-risk” process Y (t) = 1V ≥ t. This process indicates whether anindividual is at-risk at time t− (just to the left of t) for a jump in N at timet. Suppose we also have baseline covariates Z ∈ Rp, and, for all t ∈ [0, τ ],we assume

E N(t)|Z =

∫ t

0

E Y (s)|Z eβ′ZdΛ(s),(1.3)

for some β ∈ Rp and continuous nondecreasing function Λ(t) with Λ(0) = 0and 0 < Λ(τ) < ∞. The model (1.3) is a variant of the “multiplicative in-tensity model” (see definition 4.2.1 of Fleming and Harrington, 1991). Basi-cally, we are assuming that the mean of the counting process is proportionalto eβ′Z . We also need to assume that EY (τ) > 0, EN2(τ) < ∞, andthat Z is restricted to a bounded set, but we do not otherwise restrict thedistribution of N . Given an i.i.d. sample (Ni, Yi, Zi), i = 1, . . . , n, we areinterested in estimating β and Λ.

Under mild regularity conditions, the estimating equation

Un(t, β) = n−1n∑

i=1

∫ t

0

[Zi − En(s, β)] dNi(s),(1.4)

where

En(s, β) =n−1

∑ni=1 ZiYi(s)e

β′Zi

n−1∑n

i=1 Yi(s)eβ′Zi,

can be used for estimating β. The motivation for this estimating equationis that it arises as the score equation from the celebrated Cox partial like-lihood (Cox, 1975) for either failure time data (where the counting processN(t) simply indicates whether the failure time has occurred by time t) orthe multiplicative intensity model under an independent increment assump-tion on N (See Chapter 4 of Fleming and Harrington, 1991). Interestingly,this estimating equation can be shown to work under the more generalmodel (1.3). Specifically, we can establish that (1.4) has an asymptotically

unique zero β at t = τ , that β is consistent for β, and that√

n(

β − β)

is

asymptotically mean zero Gaussian. Empirical process tools are needed toaccomplish this. These same techniques can also establish that

Λ(t) =

∫ t

0

n−1∑n

i=1 dNi(s)

n−1∑n

i=1 Yi(s)eβ′Zi

Page 18: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

6 1. Introduction

is uniformly consistent for Λ(t) over t ∈ [0, τ ] and that√

n(Λ−Λ) convergesweakly to a mean zero Gaussian quantity. These methods can also be usedto construct valid confidence intervals and confidence bands for β and Λ.

At this point, efficiency of these estimators is difficult to determine be-cause so little has been specified about the distribution of N . Consider,however, the special case of right-censored failure time data. In this set-ting, the observed data are (Vi, di, Zi), i = 1, . . . , n, where Vi = Ti ∧ Ci,di = 1Ti ≤ Ci, Ti is a failure time of interest with integrated haz-ard function eβ′ZiΛ(t) given Zi, and Ci is a right censoring time inde-pendent of Ti given Zi with distribution not depending on β or Λ. Here,Ni(t) = di1Vi ≤ t and a ∧ b denotes the minimum of a and b. Semi-

parametric inference techniques can now establish that both β and Λ areefficient. We will revisit this example in greater detail in Chapters 3 and 4.

In both of the previous examples, the estimator of the infinite-dimensionalparameter (F in the first example and Λ in the second) is

√n consistent,

but slower rates of convergence for the infinite dimensional part are alsopossible. As a third example, consider the partly linear logistic regressionmodel described in Mammen and van de Geer (1997) and van der Vaart(1998, Page 405). The observed data are n independent realizations of therandom triplet (Y, Z, U), where Z ∈ Rp and U ∈ R are covariates that arenot linearly dependent, and Y is a dichotomous outcome with

E Y |Z, U = ν [β′Z + η(U)] ,(1.5)

where β ∈ Rp, Z is restricted to a bounded set, U ∈ [0, 1], ν(t) = 1/(1+e−t),and η is an unknown smooth function. We assume, for some integer k ≥ 1,that the first k − 1 derivatives of η exist and are absolutely continuous

with J2(η) ≡∫ 1

0

[

η(k)(t)]2

dt < ∞, where superscript (k) denotes the k-th derivative. Given an i.i.d. sample Xi = (Yi, Zi, Ui), i = 1 . . . n, we areinterested in estimating β and η.

The conditional density at Y = y given the covariates (Z, U) = (z, u)has the form

pβ,η(x) = ν [β′z + η(u)]y 1− ν [β′z + η(u)]1−y.

This cannot be used directly for defining a likelihood since for any 1 ≤ n <∞ and fixed sample x1, . . . , xn, there exists a sequence of smooth functionsηm satisfying our criteria which converges to η, where η(ui) = ∞ whenyi = 1 and η(ui) = −∞ when yi = 0. The issue is that requiring J(η) <∞does not restrict η on any finite collection of points. There are a numberof methods for addressing this problem, including requiring J(η) ≤ Mn foreach n, where Mn ↑ ∞ at an appropriately slow rate, or using a series ofincreasingly complex spline approximations.

An important alternative is to use the penalized log-likelihood

Ln(β, η) = n−1n∑

i=1

log pβ,η(Xi)− λ2nJ2(η),(1.6)

Page 19: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

1. Introduction 7

where λn is a possibly data-dependent smoothing parameter. Ln is max-imized over β and η to obtain the estimators β and η. Large values ofλn lead to very smooth but somewhat biased η, while small values ofλn lead to less smooth but less biased η. The proper trade-off betweensmoothness and bias is usually best achieved by data-dependent schemessuch as cross-validation. If λn is chosen to satisfy λn = oP (n−1/4) and

λ−1n = OP (nk/(2k+1)), then both β and η are uniformly consistent and√

n(

β − β)

converges to a mean zero Gaussian vector. Furthermore, β

can be shown to be efficient even though η is not√

n consistent. Moreabout this example will be discussed in Chapter 4.

These three examples illustrate the goals of empirical process and semi-parametric inference research as well as hint at the power of these methodsfor solving statistical inference problems involving infinite-dimensional pa-rameters. The goal of the first part of this book is to present the key ideas ofempirical processes and semiparametric inference in a concise and heuristicway, without being distracted by technical issues, and to provide motivationto pursue the subject in greater depth as given in the remaining parts ofthe book. Even for those anxious to pursue the subject in depth, the broadview contained in this first part provides a valuable context for learningthe details.

Chapter 2 presents an overview of empirical process methods and results,while Chapter 3 presents an overview of semiparametric inference tech-niques. Several case studies illustrating these methods, including furtherdetails on the examples given above, are presented in Chapter 4, which con-cludes the overview part. The empirical process part of the book (Part II)will be introduced in Chapter 5, while the semiparametric inference part(Part III) will be introduced in Chapter 16.

Page 20: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,
Page 21: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

2An Overview of Empirical Processes

This chapter presents an overview of the main ideas and techniques of em-pirical process research. The emphasis is on those concepts which directlyimpact statistical estimation and inference. The major distinction betweenempirical process theory and more standard asymptotics is that the randomquantities studied have realizations as functions rather than real numbersor vectors. Proofs of results and certain details in definitions are postponeduntil Part II of the book.

We begin by defining and sketching the main features and asymptoticconcepts of empirical processes with a view towards statistical issues. Anoutline of the main empirical process techniques covered in this book is pre-sented next. This chapter concludes with a discussion of several additionalrelated topics that will not be pursued in later chapters.

2.1 The Main Features

A stochastic process is a collection of random variables X(t), t ∈ T on thesame probability space, indexed by an arbitrary index set T . An empiricalprocess is a stochastic process based on a random sample. For example,consider a random sample X1, . . . , Xn of i.i.d. real random variables withdistribution F . The empirical distribution function is

Fn(t) = n−1n∑

i=1

1Xi ≤ t,(2.1)

Page 22: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

10 2. An Overview of Empirical Processes

where the index t is allowed to vary over T = R, the real line.More generally, we can consider a random sample X1, . . . , Xn of inde-

pendent draws from a probability measure P on an arbitrary sample spaceX . We define the empirical measure to be Pn = n−1

∑ni=1 δXi , where δx is

the measure that assigns mass 1 at x and zero elsewhere. For a measurablefunction f : X → R, we denote Pnf = n−1

∑ni=1 f(Xi). For any class F

of measurable functions f : X → R, an empirical process Pnf, f ∈ Fcan be defined. This simple approach can generate a surprising variety ofempirical processes, many of which we will consider in later sections in thischapter as well as in Part II.

Setting X = R, we can now re-express Fn as the empirical processPnf, f ∈ F, where F = 1x ≤ t, t ∈ R. Thus one can view the stochas-tic process Fn as indexed by either t ∈ R or f ∈ F . We will use eitherindexing approach, depending on which is most convenient for the task athand. However, because of its generality, indexing empirical processes byclasses of functions will be the primary approach taken throughout thisbook.

By the law of large numbers, we know that

Fn(t)as→ F (t)(2.2)

for each t ∈ R, whereas→ denotes almost sure convergence. A primary goal of

empirical process research is to study empirical processes as random func-tions over the associated index set. Each realization of one of these randomfunctions is a sample path. To this end, Glivenko (1933) and Cantelli (1933)demonstrated that (2.2) could be strengthened to

supt∈R

|Fn(t)− F (t)| as→ 0.(2.3)

Another way of saying this is that the sample paths of Fn get uniformlycloser to F as n → ∞. Returning to general empirical processes, a classF of measurable functions f : X → R, is said to be a P -Glivenko-Cantelliclass if

supf∈F

|Pnf − Pf | as∗→ 0,(2.4)

where Pf =∫

X f(x)P (dx) andas∗→ is a mode of convergence slightly stronger

thanas→ but which will not be precisely defined until later in this chapter

(both modes of convergence are equivalent in the setting of (2.3)). Some-times the P in P -Glivenko-Cantelli can be dropped if the context is clear.

Returning to Fn, we know by the central limit theorem that for eacht ∈ R

Gn(t) ≡ √n [Fn(t)− F (t)] G(t),

Page 23: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

2.1 The Main Features 11

where denotes convergence in distribution and G(t) is a mean zeronormal random variable with variance F (t) [1− F (t)]. In fact, we knowthat Gn, simultaneously for all t in a finite set Tk = t1, . . . , tk ∈ R,will converge in distribution to a mean zero multivariate normal vectorG = G(t1), . . . , G(tk)′, where

cov [G(s), G(t)] = E [G(s)G(t)] = F (s ∧ t)− F (s)F (t)(2.5)

for all s, t ∈ Tk.Much more can be said. Donsker (1952) showed that the sample paths

of Gn, as functions on R, converge in distribution to a certain stochasticprocess G. Weak convergence is the generalization of convergence in dis-tribution from vectors of random variables to sample paths of stochasticprocesses. Donsker’s result can be stated succinctly as Gn G in ℓ∞(R),where, for any index set T , ℓ∞(T ) is the collection of all bounded functionsf : T → R. ℓ∞(T ) is used in settings like this to remind us that we arethinking of distributional convergence in terms of the sample paths.

The limiting process G is a mean zero Gaussian process with E [G(s)G(t)]= (2.5) for every s, t ∈ R. A Gaussian process is a stochastic processZ(t), t ∈ T , where for every finite Tk ⊂ T , Z(t), t ∈ Tk is multivariatenormal, and where all sample paths are continuous in a certain sense thatwill be made more explicit later in this chapter. The process G can bewritten G(t) = B(F (t)), where B is a standard Brownian bridge on theunit interval. The process B has covariance s ∧ t− st and is equivalent tothe process W(t) − tW(1), for t ∈ [0, 1], where W is a standard Brownianmotion process. The standard Brownian motion is a Guassian process on[0,∞) with continuous sample paths, with W(0) = 0, and with covariances ∧ t. Both B and W are important examples of Gaussian processes.

Returning again to general empirical processes, define the random mea-sure Gn =

√n(Pn − P ), and, for any class F of measurable functions

f : X → R, let G be a mean zero Gaussian process indexed by F , withcovariance E [f(X)g(X)] − Ef(X)Eg(X) for all f, g ∈ F , and having ap-propriately continuous sample paths. Both Gn and G can be thought of asbeing indexed by F . We say that F is P -Donsker if Gn G in ℓ∞(F).The P and/or the ℓ∞(F) may be dropped if the context is clear. Donsker’s(1952) theorem tells us that F = 1x ≤ t, t ∈ R is Donsker for all proba-bility measures which are based on some real distribution function F . Withf(x) = 1x ≤ t and g(x) = 1x ≤ s,

E [f(X)g(X)]− Ef(X)Eg(X) = F (s ∧ t)− F (s)F (t).

For this reason, G is also referred to as a Brownian bridge.Suppose we are interested in forming confidence bands for F over some

subset H ⊂ R. Because F = 1x ≤ t, t ∈ R is Glivenko-Cantelli, we canuniformly consistently estimate the covariance σ(s, t) = F (s∧t)−F (s)F (t)of G with σ(s, t) = Fn(s∧t)−Fn(s)Fn(t). While such a covariance could be

Page 24: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

12 2. An Overview of Empirical Processes

used to form confidence bands when H is finite, it is of little use when H isinfinite, such as when H is a subinterval of R. In this case, it is preferableto make use of the Donsker result for Gn. Let Un = supt∈H |Gn(t)|. Thecontinuous mapping theorem tells us that whenever a process Zn(t), t ∈H converges weakly to a tight limiting process Z(t), t ∈ H in ℓ∞(H),then h(Zn) h(Z) in h(ℓ∞(H)) for any continuous map h. In our settingUn = h(Gn), where h(g) = supt∈H |g(t)|, for any g ∈ ℓ∞(R), is a continuousreal function. Thus the continuous mapping theorem tells us that Un

U = supt∈H |G(t)|. When F is continuous and H = R, U = supt∈[0,1] |B(t)|has a known distribution from which it is easy to compute quantiles. If welet up be the p-th quantile of U , then an asymptotically valid symmetric1− α level confidence band for F is Fn ± u1−α/

√n.

An alternative is to construct confidence bands based on a large num-ber of bootstraps of Fn. The bootstrap for Fn can be written as Fn(t) =n−1

∑ni=1 Wni1Xi ≤ t, where (Wn1, . . . , Wnn) is a multinomial random

n-vector, with probabilities 1/n, . . . , 1/n and number of trials n, and whichis independent of the data X1, . . . , Xn. The conditional distribution ofGn =

√n(Fn − Fn) given X1, . . . , Xn can be shown to converge weakly

to the distribution of G in ℓ∞(R). Thus the bootstrap is an asymptoticallyvalid way to obtain confidence bands for F .

Returning to the general empirical process set-up, let F be a Donskerclass and suppose we wish to construct confidence bands for Ef(X) thatare simultaneously valid for all f ∈ H ⊂ F . Provided certain secondmoment conditions hold on F , the estimator σ(f, g) = Pn [f(X)g(X)] −Pnf(X)Png(X) is consistent for σ(f, g) = E [f(X)g(X)] − Ef(X)Eg(X)uniformly over all f, g ∈ F . As with the empirical distribution functionestimator, this covariance is enough to form confidence bands provided His finite. Fortunately, the bootstrap is always asymptotically valid whenF is Donsker and can therefore be used for infinite H. More precisely, ifGn =

√n(Pn−Pn), where Pnf = n−1

∑ni=1 Wnif(Xi) and (Wn1, . . . , Wnn)

is defined as before, then the conditional distribution of Gn given the dataconverges weakly to G in ℓ∞(F). Since this is true for all of F , it is cer-tainly true for any H ⊂ F . The bootstrap result for Fn is clearly a specialcase of this more general result.

Many important statistics based on i.i.d. data cannot be written as em-pirical processes, but they can frequently be written in the form φ(Pn),where Pn is indexed by some F and φ is a smooth map from ℓ∞(F) to someset B (possibly infinite-dimensional). Consider, for example, the quantileprocess ξn(p) = F−1

n (p) for p ∈ [a, b], where H−1(p) = inft : H(t) ≥ pfor a distribution function H and 0 < a < b < 1. Here, ξn = φ(Fn),where φ maps a distribution function H to H−1. When the underlyingdistribution F is continuous over N = [H−1(a) − ǫ, H−1(b) + ǫ] ⊂ [0, 1],for some ǫ > 0, with continuous density f such that 0 < inft∈N f(t) ≤supt∈N f(t) < ∞, then

√n(ξn(p) − ξp), where ξp = F−1(p), is uniformly

Page 25: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

2.2 Empirical Process Techniques 13

asymptotically equivalent to −Gn(F−1(p))/f(F−1(p)) and hence convergesweakly to G(F−1(p))/f(F−1(p)) in ℓ∞([a, b]). (Because the process G issymmetric around zero, both −G and G have the same distribution.) Theabove weak convergence result is a special case of the functional delta-method principle which states that

√n [φ(Pn)− φ(P )] converges weakly in

B to φ′(G), whenever F is Donsker and φ has a “Hadamard derivative” φ′

which will be defined more precisely later in this chapter.Many additional statistics can be written as zeros or maximizers of

certain data-dependent processes. The former are known as Z-estimatorsand the latter as M-estimators. Consider the linear regression examplegiven in Chapter 1. Since β is the zero of Un(β) = Pn [X(Y −X ′β)], β

is a Z-estimator. In contrast, the penalized likelihood estimators (β, η) inthe partly linear logistic regression example of the same chapter are M-estimators since they are maximizers of L(β, η) given in (1.6). As is thecase with Un and Ln, the data-dependent objective functions used in Z-and M- estimation are often empirical processes, and thus empirical processmethods are frequently required when studying the large sample propertiesof the associated statistics.

The key attribute of empirical processes is that they are random func-tions—or stochastic processes—based on a random data sample. The mainasymptotic issue is studying the limiting behavior of these processes interms of their sample paths. Primary achievements in this direction areGlivenko-Cantelli results which extend the law of large numbers, Donskerresults which extend the central limit theorem, the validity of the bootstrapfor Donsker classes, and the functional delta method.

2.2 Empirical Process Techniques

In this section, we expand on several important techniques used in empiricalprocesses. We first define and discuss several important kinds of stochasticconvergence, including convergence in probability as well as almost sureand weak convergence. We then introduce the concept of entropy and in-troduce several Glivenko-Cantelli and Donsker theorems based on entropy.The empirical bootstrap and functional delta method are described next.A brief outline of Z- and M- estimator methods are then presented. Thissection is essentially a review in miniature of the main points covered inPart II of this book, with a minimum of technicalities.

2.2.1 Stochastic Convergence

When discussing convergence of stochastic processes, there is always a met-ric space (D, d) implicitly involved, where D is the space of possible valuesfor the processes and d is a metric (distance measure), satisfying d(x, y) ≥ 0,

Page 26: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

14 2. An Overview of Empirical Processes

d(x, y) = d(y, x), d(x, z) ≤ d(x, y) + d(y, z), and d(x, y) = 0 if and only ifx = y, for all x, y, z ∈ D. Frequently, D = ℓ∞(T ), where T is the indexset for the processes involved, and d is the uniform distance on D, i.e.,d(x, y) = supt∈T |x(t) − y(t)| for any x, y ∈ D. We are primarily interestedin the convergence properties of the sample paths of stochastic processes.Weak convergence, or convergence in distribution, of a stochastic processXn happens when the sample paths of Xn begin to behave in distribution,as n → ∞, more and more like a specific random process X . When Xn

and X are Borel measurable, weak convergence is equivalent to saying thatEf(Xn) → Ef(X) for every bounded, continuous function f : D → R,where the notation f : A → B means that f is a mapping from A to B,and where continuity is in terms of d. Hereafter, we will let Cb(D) denotethe space of bounded, continuous maps f : D → R. We will define Borelmeasurability in detail later in Part II, but, for now, it is enough to saythat lack of this property means that there are certain important subsetsA ⊂ D where the probability that Xn ∈ A is not defined.

In many statistical applications, Xn may not be Borel measurable. Toresolve this problem, we need to introduce the notion of outer expectationfor arbitrary maps T : Ω → R ≡ [−∞,∞], where Ω is the sample space.T is not necessarily a random variable because it is not necessarily Borelmeasurable. The outer expectation of T , denoted E∗T , is the infimum overall EU , where U : Ω → R is measurable, U ≥ T , and EU exists. For EU toexist, it must not be indeterminate, although it can be ±∞, provided thesign is clear. We analogously define inner expectation: E∗T = −E∗[−T ].There also exists a measurable function T ∗ : Ω → R, called the mini-mal measurable majorant, satisfying T ∗(ω) ≥ T (ω) for all ω ∈ Ω andwhich is almost surely the smallest measurable function ≥ T . Further-more, when E∗T < ∞, E∗T = ET ∗. The maximal measurable minorant isT∗ = −(−T )∗. We also define outer probability for possibly nonmeasurablesets: P∗(A) as the infimum over all P(B) with A ⊂ B ⊂ Ω and B a Borelmeasurable set. Inner probability is defined as P∗(A) = 1 − P∗(Ω − A).This use of outer measure permits defining weak convergence, for possiblynonmeasurable Xn, as E∗f(Xn) → Ef(X) for all f ∈ Cb(D). We denotethis convergence by Xn X . Notice that we require the limiting processX to be measurable. This definition of weak convergence also carries withit an implicit measurability requirement on Xn: Xn X implies that Xn

is asymptotically measurable, in that E∗f(Xn) − E∗f(Xn) → 0, for everyf ∈ Cb(D).

We now consider convergence in probability and almost surely. We sayXn converges to X in probability if P d(Xn, X)∗ > ǫ → 0 for every ǫ > 0,

and we denote this XnP→ X . We say that Xn converges outer almost surely

to X if there exists a sequence ∆n of measurable random variables withd(Xn, X) ≤ ∆n for all n and with Plim supn→∞ ∆n = 0 = 1. We denote

this kind of convergence Xnas∗→ X . While these modes of convergence are

Page 27: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

2.2 Empirical Process Techniques 15

slightly different than the standard ones, they are identical when all thequantities involved are measurable. The properties of the standard modesare also generally preserved in these new modes. The major difference isthat these new modes can accommodate many situations in statistics and inother fields which could not be as easily accommodated with the standardones. As much as possible, we will keep measurability issues suppressedthroughout this book, except where it is necessary for clarity. From thispoint on, the metric d of choice will be the uniform metric unless notedotherwise.

For almost all of the weak convergence applications in this book, thelimiting quantity X will be tight, in the sense that the sample paths of Xwill have a certain minimum amount of smoothness. To be more precise, foran index set T , let ρ be a semimetric on T , in that ρ has all the propertiesof a metric except that ρ(s, t) = 0 does not necessarily imply s = t. Wesay that T is totally bounded by ρ if for every ǫ > 0, there exists a finitecollection Tk = t1, . . . , tk ⊂ T such that for all t ∈ T , we have ρ(t, s) ≤ ǫfor some s ∈ Tk. Now define UC(T, ρ) to be the subset of ℓ∞(T ) whereeach x ∈ UC(T, ρ) satisfies

limδ↓0

sups,t∈T with ρ(s,t)≤δ

|x(t) − x(s)| = 0.

The “UC” refers to uniform continuity. The stochastic process X is tightif X ∈ UC(T, ρ) almost surely for some ρ for which T is totally bounded.If X is a Gaussian process, then ρ can be chosen as ρ(s, t) = (var[X(s)−X(t)])

1/2. Tight Gaussian processes will be the most important limiting

processes considered in this book.Two conditions need to be met in order for Xn to converge weakly in

ℓ∞(T ) to a tight X . This is summarized in the following theorem which wepresent now but prove later in Chapter 7 (Page 114):

Theorem 2.1 Xn converges weakly to a tight X in ℓ∞(T ) if and onlyif:

(i) For all finite t1, . . . , tk⊂T , the multivariate distribution of Xn(t1),. . . , Xn(tk) converges to that of X(t1), . . . , X(tk).

(ii) There exists a semimetric ρ for which T is totally bounded and

limδ↓0

lim supn→∞

P∗

sups,t∈T with ρ(s,t)<δ

|Xn(s)−Xn(t)| > ǫ

= 0,(2.6)

for all ǫ > 0.

Condition (i) is convergence of all finite dimensional distributions and Con-dition (ii) implies asymptotic tightness. In many applications, Condition (i)is not hard to verify while Condition (ii) is much more difficult.

Page 28: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

16 2. An Overview of Empirical Processes

In the empirical process setting based on i.i.d. data, we are interested inestablishing that Gn G in ℓ∞(F), where F is some class of measurablefunctions f : X → R, and where X is the sample space. When Ef2(X) <∞for all f ∈ F , Condition (i) above is automatically satisfied by the standardcentral limit theorem, whereas establishing Condition (ii) is much morework and is the primary motivator behind the development of much ofmodern empirical process theory. Whenever F is Donsker, the limitingprocess G is always a tight Gaussian process, and F is totally bounded

by the semimetric ρ(f, g) = var [f(X)− g(X)]1/2. Thus Conditions (i)

and (ii) of Theorem 2.1 are both satisfied with T = F , Xn(f) = Gnf , andX(f) = Gf , for all f ∈ F .

Another important result is the continuous mapping theorem. This the-orem states that if g : D → E is continuous at every point of a set D0 ⊂ D,and if Xn X , where X takes all its values in D0, then g(Xn) g(X). Forexample, if F is a Donsker class, then supf∈F |Gnf | has the same limitingdistribution as supf∈F |Gf |, since the supremum map is uniformly con-

tinuous, i.e.,∣

∣supf∈F |x(f)| − supf∈F |y(f)|∣

∣ ≤ supf∈F |x(f)− y(f)| for allx, y ∈ ℓ∞(F). This fact can be used to construct confidence bands for Pf .The continuous mapping theorem has many other practical uses that wewill utilize at various points throughout this book.

2.2.2 Entropy for Glivenko-Cantelli and Donsker Theorems

The major challenge in obtaining Glivenko-Cantelli or Donsker theoremsfor classes of functions F is to somehow show that going from pointwiseconvergence to uniform convergence is feasible. Clearly the complexity, orentropy, of F plays a major role. The easiest entropy to introduce is en-tropy with bracketing. For 1 ≤ r < ∞, Let Lr(P ) denote the collection

of functions g : X → R such that ‖g‖r,P ≡[∫

X |g(x)|rdP (x)]1/r

< ∞.An ǫ-bracket in Lr(P ) is a pair of functions l, u ∈ Lr(P ) with Pl(X) ≤u(X) = 1 and with ‖l−u‖r,P ≤ ǫ. A function f ∈ F lies in the bracket l, uif Pl(X) ≤ f(X) ≤ u(X) = 1. The bracketing number N[](ǫ,F , Lr(P )) isthe minimum number of ǫ-brackets in Lr(P ) needed to ensure that everyf ∈ F lies in at least one bracket. The logarithm of the bracketing num-ber is the entropy with bracketing. The following is one of the simplestGlivenko-Cantelli theorems (the proof is deferred until Part II, Page 145):

Theorem 2.2 Let F be a class of measurable functions and suppose thatN[](ǫ,F , L1(P )) < ∞ for every ǫ > 0. Then F is P -Glivenko-Cantelli.

Consider, for example, the empirical distribution function Fn based onan i.i.d. sample X1, . . . , Xn of real random variables with distribution F(which defines the probability measure P on X = R). In this setting, Fn

is the empirical process Gn with class F = 1x ≤ t, t ∈ R. For anyǫ > 0, a finite collection of real numbers −∞ = t1 < t2 < · · · < tk = ∞

Page 29: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

2.2 Empirical Process Techniques 17

can be found so that F (tj−) − F (tj−1) ≤ ǫ for all 1 < j ≤ k, F (t1) = 0and F (tk−) = 1, where H(t−) = lims↑t H(s) when such a limit exists.This can always be done in such a way that k ≤ 2 + 1/ǫ. Consider thecollection of brackets (lj , uj), 1 < j ≤ k, with lj(x) = 1x ≤ tj−1 anduj(x) = 1x < tj (notice that uj is not in F). Now each f ∈ F is in at leastone bracket and ‖uj−lj‖P,1 = F (tj−)−F (tj−1) ≤ ǫ for all 1 < j ≤ k. ThusN[](ǫ,F , L1(P )) < ∞ for every ǫ > 0, and the conditions of Theorem 2.2are met.

Donsker theorems based on entropy with bracketing require more strin-gent conditions on the number of brackets needed to cover F . The brack-eting integral,

J[](δ,F , Lr(P )) ≡∫ δ

0

log N[](ǫ,F , Lr(P ))dǫ,

needs to be bounded for r = 2 and δ = ∞ to establish that F is Donsker.Hence the bracketing entropy is permitted to go to ∞ as ǫ ↓ 0, but nottoo quickly. For most of the classes F of interest, the entropy does go to∞ as ǫ ↓ 0. However, a surprisingly large number of these classes satisfythe conditions of Theorem 2.3 below, our first Donsker theorem (which weprove in Chapter 8, Page 148):

Theorem 2.3 Let F be a class of measurable functions with J[](∞,F ,L2(P )) <∞. Then F is P -Donsker.

Returning again to the empirical distribution function example, we have

for the ǫ-brackets used previously that ‖uj − lj‖P,2 = (‖uj − lj‖P,1)1/2 ≤

ǫ1/2. Hence the minimum number of L2 ǫ-brackets needed to cover F isbounded by 1 + 1/ǫ2, since an L1 ǫ2-bracket is an L2 ǫ-bracket. For ǫ > 1,the number of brackets needed is just 1. J[](∞,F , L2(P )) will therefore be

finite if∫ 1

0

log(1 + 1/ǫ2)dǫ < ∞. Using the fact that log(1+a) ≤ 1+log(a)for a ≥ 1 and the variable substitution u = 1 + log(1/ǫ2), we obtain thatthis integral is bounded by

∫∞0

u1/2e−u/2du =√

2π. Thus the conditionsof Theorem 2.3 are easily satisfied. We now give two other examples ofclasses with bounded Lr(P ) bracketing integral. Parametric classes of theform F = fθ : θ ∈ Θ work, provided Θ is a bounded subset of Rp andthere exists an m ∈ Lr(P ) such that |fθ1(x)− fθ2(x)| ≤ m(x)‖θ1 − θ2‖ forall θ1, θ2 ∈ Θ. Here, ‖ · ‖ is the standard Euclidean norm on Rp. The classF of all monotone functions f : R → [0, 1] also works for all 1 ≤ r < ∞and all probability measures P .

Entropy calculations for other classes that arise in statistical applicationscan be difficult. However, there are a number of techniques for doing thisthat are not difficult to apply in practice and that we will explore brieflylater on in this section. Unfortunately, there are also many classes F forwhich entropy with bracketing does not work at all. An alternative whichcan be useful in such settings is entropy based on covering numbers. For a

Page 30: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

18 2. An Overview of Empirical Processes

probability measure Q, the covering number N(ǫ,F , Lr(Q)) is the minimumnumber of Lr(Q) ǫ-balls needed to cover F , where an Lr(Q) ǫ-ball around afunction g ∈ Lr(Q) is the set h ∈ Lr(Q) : ‖h−g‖Q,r < ǫ. For a collectionof balls to cover F , all elements of F must be included in at least one ofthe balls, but it is not necessary that the centers of the balls be containedin F . The entropy is the logarithm of the covering number. The bracketingentropy conditions in Theorems 2.2 and 2.3 can be replaced by conditionsbased on the uniform covering numbers

supQ

N (ǫ‖F‖Q,r,F , Lr(Q)) ,(2.7)

where F : X → R is an envelope for F , meaning that |f(x)| ≤ F (x)for all x ∈ X and all f ∈ F , and where the supremum is taken overall finitely discrete probability measures Q with ‖F‖Q,r > 0. A finitelydiscrete probability measure on X puts mass only at a finite number ofpoints in X . Notice that the uniform covering number does not dependon the probability measure P for the observed data. The uniform entropyintegral is

J(δ,F , Lr) =

∫ δ

0

log supQ

N (ǫ‖F‖Q,r,F , Lr(Q))dǫ,

where the supremum is taken over the same set used in (2.7).The following two theorems (given without proof) are Glivenko-Cantelli

and Donsker results for uniform entropy:

Theorem 2.4 Let F be an appropriately measurable class of measurablefunctions with supQ N(ǫ‖F‖1,Q,F , L1(Q)) < ∞ for every ǫ > 0, where thesupremum is taken over the same set used in (2.7). If P ∗F < ∞, then Fis P -Glivenko-Cantelli.

Theorem 2.5 Let F be an appropriately measurable class of measurablefunctions with J(1,F , L2) < ∞. If P ∗F 2 < ∞, then F is P -Donsker.

Discussion of the “appropriately measurable” condition will be postponeduntil Part II (see Pages 145 and 149), but suffice it to say that it is satisfiedfor many function classes of interest in statistical applications.

An important collection of function classes F , which satisfies J(1,F , Lr)< ∞ for any 1 ≤ r < ∞, are the Vapnik-Cervonenkis classes, or VCclasses. Many classes of interest in statistics are VC, including the class ofindicator functions explored earlier in the empirical distribution functionexample and also vector space classes. A vector space class F has the form∑k

i=1 λifi(x), (λ1, . . . , λk) ∈ Rk for fixed functions f1, . . . , fk. We willpostpone further definition and discussion of VC classes until Part II.

The important thing to know at this point is that one does not needto calculate entropy for each new problem. There are a number of easy

Page 31: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

2.2 Empirical Process Techniques 19

methods which can be used to determine whether a given class is Glivenko-Cantelli or Donsker based on whether the class is built up of other, well-known classes. For example, subsets of Donsker classes are Donsker sinceCondition (ii) of Theorem 2.1 is clearly satisfied for any subset of T if itis satisfied for T . One can also use Theorem 2.1 to show that finite unionsof Donsker classes are Donsker. When F and G are Donsker, the followingare also Donsker: f ∧ g : f ∈ F , g ∈ G, f ∨ g : f ∈ F , g ∈ G, where ∨denotes maximum, and f + g : f ∈ F , g ∈ G. If F and G are boundedDonsker classes, then fg : f ∈ F , g ∈ G is Donsker. Also, Lipschitzcontinuous functions of Donsker classes are Donsker. Furthermore, if F isDonsker, then it is also Glivenko-Cantelli. These, and many other toolsfor verifying that a given class is Glivenko-Cantelli or Donsker, will bediscussed in greater detail in Chapter 9.

2.2.3 Bootstrapping Empirical Processes

An important aspect of inference for empirical processes is to be able toobtain covariance and confidence band estimates. The limiting covariancefor a P -Donsker class F is σ : F × F → R, where σ(f, g) ≡ Pfg −PfPg. The covariance estimate σ : F × F → R, where σ(f, g) ≡ Pnfg −PnfPng, is uniformly consistent for σ outer almost surely if and only ifP ∗ [supf∈F(f(X)− Pf)2

]

< ∞. This will be proved later in Part II. How-ever, this is only of limited use since critical values for confidence bandscannot in general be determined from the covariance when F is not finite.The bootstrap is an effective alternative.

As mentioned earlier, some care must be taken to ensure that the conceptof weak convergence makes sense when the statistics of interest may notbe measurable. This issue becomes more delicate with bootstrap resultswhich involve convergence of conditional laws given the observed data.In this setting, there are two sources of randomness, the observed dataand the resampling done by the bootstrap. For this reason, convergenceof conditional laws is assessed in a slightly different manner than regularweak convergence. An important result is that Xn X in the metric space(D, d) if and only if

supf∈BL1

|E∗f(Xn)− Ef(X)| → 0,(2.8)

where BL1 is the space of functions f : D → R with Lipschitz normbounded by 1, i.e., ‖f‖∞ ≤ 1 and |f(x) − f(y)| ≤ d(x, y) for all x, y ∈ D,and where ‖ · ‖∞ is the uniform norm.

We can now use this alternative definition of weak convergence to de-fine convergence of the conditional limit laws of bootstraps. Let Xn be asequence of bootstrapped processes in D with random weights that we will

denote M . For some tight process X in D, we use the notation XnPM

X to

Page 32: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

20 2. An Overview of Empirical Processes

mean that suph∈BL1

∣EMh(Xn)− Eh(X)∣

P→ 0 and EMh(Xn)∗−EMh(Xn)∗P→ 0, for all h ∈ BL1, where the subscript M in the expectations indicatesconditional expectation over the weights M given the remaining data, andwhere h(Xn)∗ and h(Xn)∗ denote measurable majorants and minorantswith respect to the joint data (including the weights M). We use the nota-

tion Xnas∗M

X to mean the same thing except with allP→’s replaced by

as∗→’s.

Note that the h(Xn) inside of the supremum does not have an asterisk: thisis because Lipschitz continuous function of the bootstrapped processes wewill study in this book will always be measurable functions of the randomweights when conditioning on the data.

As mentioned previously, the bootstrap empirical measure can be de-fined as Pnf = n−1

∑ni=1 Wnif(Xi), where Wn = (Wn1, . . . , Wnn) is a

multinomial vector with probabilities (1/n, . . . , 1/n) and number of trials

n, and where Wn is independent of the data sequence X = (X1, X2, . . .).We can now define a useful and simple alternative to this standard non-parametric bootstrap. Let ξ = (ξ1, ξ2, . . .) be an infinite sequence of non-

negative i.i.d. random variables, also independent of X , which have mean0 < µ <∞ and variance 0 < τ2 <∞, and which satisfy ‖ξ‖2,1 < ∞, where

‖ξ‖2,1 =∫∞0

P (|ξ| > x)dx. This last condition is slightly stronger thanbounded second moment but is implied whenever the 2 + ǫ moment existsfor any ǫ > 0. We can now define a multiplier bootstrap empirical measurePnf = n−1

∑ni=1(ξi/ξn)f(Xi), where ξn = n−1

∑ni=1 ξi and Pn is defined

to be zero if ξ = 0. Note that the weights add up to n for both bootstraps.When ξ1 has a standard exponential distribution, for example, the momentconditions are clearly satisfied, and the resulting multiplier bootstrap hasDirichlet weights.

Under these conditions, we have the following two theorems (which weprove in Part II, Page 187), for convergence of the bootstrap, both in proba-

bility and outer almost surely. Let Gn =√

n(Pn−Pn), Gn =√

n(µ/τ)(Pn−Pn), and G be the standard Brownian bridge in ℓ∞(F).

Theorem 2.6 The following are equivalent:

(i) F is P -Donsker.

(ii) GnPW

G in ℓ∞(F) and the sequence Gn is asymptotically measurable.

(iii) GnPξ

G in ℓ∞(F) and the sequence Gn is asymptotically measurable.

Theorem 2.7 The following are equivalent:

(i) F is P -Donsker and P ∗ [supf∈F(f(X)− Pf)2]

<∞.

(ii) Gnas∗W

G in ℓ∞(F).

Page 33: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

2.2 Empirical Process Techniques 21

(iii) Gnas∗ξ

G in ℓ∞(F).

According to Theorem 2.7, the almost sure consistency of the boot-strap requires the same moment condition required for almost sure uniformconsistency of the covariance estimator σ. In contrast, the consistency inprobability of the bootstrap given in Theorem 2.6 only requires that F isDonsker. Thus consistency in probability of the bootstrap empirical pro-cess is an automatic consequence of weak convergence in the first place.Fortunately, consistency in probability is adequate for most statistical ap-plications, since this implies that confidence bands constructed from thebootstrap are asymptotically valid. This follows because, as we will also es-tablish in Part II, whenever the conditional law of a bootstrapped quantity(say Xn) in a normed space (with norm ‖ · ‖) converges to a limiting law(say of X), either in probability or outer almost surely, then the conditionallaw of ‖Xn‖ converges to that of ‖X‖ under mild regularity conditions. Wewill also establish a slightly more general in-probability continuous map-ping theorem for the bootstrap when the continuous map g is real valued.

Suppose we wish to construct a 1−α level confidence band for Pf, f ∈F, where F is P -Donsker. We can obtain a large number, say N , bootstrap

realizations of supf∈F

∣Gnf∣

∣ to estimate the 1−α quantile of supf∈F |Gf |.If we call this estimate c1−α, then Theorem 2.6 tells us that Pnf ± c1−α ,f ∈ F has coverage 1− α for large enough n and N . For a more specificexample, consider estimating F (t1, t2) = PY1 ≤ t1, Y2 ≤ t2, where X =(Y1, Y2) has an arbitrary bivariate distribution. We can estimate F (t1, t2)with Fn(t1, t2) = n−1

∑ni=1 1Y1i ≤ t1, Y2i ≤ t2. This is the same as esti-

mating Pf, f ∈ F, where F = f(x) = 1y1 ≤ t1, y2 ≤ t2 : t1, t2 ∈ R.This is a bounded Donsker class since F = f1f2 : f1 ∈ F1, f2 ∈ F2,where Fj = 1yj ≤ t, t ∈ R is a bounded Donsker class for j = 1, 2.We thus obtain consistency in probability of the bootstrap. We also obtainouter almost sure consistency of the bootstrap by Theorem 2.7, since F isbounded by 1.

2.2.4 The Functional Delta Method

Suppose Xn is a sequence of random variables with√

n(Xn − θ) X forsome θ ∈ Rp, and the function φ : Rp → Rq has a derivative φ′(θ) at θ.The standard delta method now tells us that

√n(φ(Xn)−φ(θ)) φ′(θ)X .

However, many important statistics based on i.i.d. data involve maps fromempirical processes to spaces of functions, and hence cannot be handled bythe standard delta method. A simple example is the map φξ which takescumulative distribution functions H and computes ξp, p ∈ [a, b], whereξp = H−1(p) = inft : H(t) ≥ p and [a, b] ⊂ (0, 1). The sample p-th

quantile is then ξn(p) = φξ(Fn)(p). Although the standard delta methodcannot be used here, the functional delta method can be.

Page 34: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

22 2. An Overview of Empirical Processes

Before giving the main functional delta method results, we need to definederivatives for functions between normed spaces D and E. A normed spaceis a metric space (D, d), where d(x, y) = ‖x−y‖, for any x, y ∈ D, and where‖ · ‖ is a norm. A norm satisfies ‖x + y‖ ≤ ‖x‖ + ‖y‖, ‖αx‖ = |α| × ‖x‖,‖x‖ ≥ 0, and ‖x‖ = 0 if and only if x = 0, for all x, y ∈ D and all complexnumbers α. A map φ : Dφ ⊂ D → E is Gateaux-differentiable at θ ∈ D, iffor every fixed h ∈ D with θ + th ∈ Dφ for all t > 0 small enough, thereexists an element φ′

θ(h) ∈ E such that

φ(θ + th)− φ(θ)

t→ φ′

θ(h)

as t ↓ 0. For the functional delta method, however, we need φ to have thestronger property of being Hadamard-differentiable. A map φ : Dφ → E isHadamard-differentiable at θ ∈ D, tangentially to a set D0 ⊂ D, if thereexists a continuous linear map φ′

θ : D → E such that

φ(θ + tnhn)− φ(θ)

tn→ φ′

θ(h),

as n → ∞, for all converging sequences tn → 0 and hn → h ∈ D0, withhn ∈ D and θ + tnhn ∈ Dφ for all n ≥ 1 sufficiently large.

For example, let D = D[0, 1], where DA, for any interval A ⊂ R, is thespace of cadlag (right-continuous with left-hand limits) real functions onA with the uniform norm. Let Dφ = f ∈ D[0, 1] : |f | > 0. Consider thefunction φ : Dφ → E = D[0, 1] defined by φ(g) = 1/g. Notice that for anyθ ∈ Dφ, we have, for any converging sequences tn ↓ 0 and hn → h ∈ D,with hn ∈ D and θ + tnhn ∈ Dφ for all n ≥ 1,

φ(θ + tnhn)− φ(θ)

tn=

1

tn(θ + tnhn)− 1

tnθ= − hn

θ(θ + tnhn)→ − h

θ2,

where we have suppressed the argument in g for clarity. Thus φ isHadamard-differentiable, tangentially to D, with φ′

θ(h) = −h/θ2.Sometimes Hadamard differentiability is also called compact differentia-

bility. Another important property of this kind of derivative is that it satis-fies a chain rule, in that compositions of Hadamard-differentiable functionsare also Hadamard-differentiable. Details on this and several other aspectsof functional differentiation will be postponed until Part II. We have thefollowing important result (the proof of which will be given in Part II,Page 235):

Theorem 2.8 For normed spaces D and E, let φ : Dφ ⊂ D → E beHadamard-differentiable at θ tangentially to D0 ⊂ D. Assume that rn(Xn−θ) X for some sequence of constants rn →∞, where Xn takes its valuesin Dφ, and X is a tight process taking its values in D0. Then rn(φ(Xn) −φ(θ)) φ′

θ(X).

Page 35: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

2.2 Empirical Process Techniques 23

Consider again the quantile map φξ, and let the distribution functionF be absolutely continuous over N = [u, v] = [F−1(a) − ǫ, F−1(b) + ǫ],for some ǫ > 0, with continuous density f such that 0 < inft∈N f(t) ≤supt∈N f(t) < ∞. Also let D1 ⊂ D[u, v] be the space of all distributionfunctions restricted to [u, v]. We will now argue that φξ is Hadamard-differentiable at F tangentially to C[u, v], where for any interval A ⊂ R,CA is the space of continuous real functions on A. Let tn → 0 and hn ∈D[u, v] converge uniformly to h ∈ C[u, v] such that F + tnhn ∈ D1 for alln ≥ 1, and denote ξp = F−1(p), ξpn = (F + tnhn)−1(p), ξN

pn = (ξpn∨u)∧v,

and ǫpn = t2n ∧ (ξNpn − u). The reason for the modification ξN

pn is to ensurethat the quantile estimate is contained in [u, v] and hence also ǫpn ≥ 0.Thus there exists an n0 < ∞, such that for all n ≥ n0, (F + tnhn)(u) < a,(F + tnhn)(v) > b, ǫpn > 0 and ξN

pn = ξpn for all p ∈ [a, b], and therefore

(F + tnhn)(ξNpn − ǫpn) ≤ F (ξp) ≤ (F + tnhn)(ξN

pn)(2.9)

for all p ∈ [a, b], since (F + tnhn)−1(p) is the smallest x satisfying (F +tnhn)(x) ≥ p and F (ξp) = p.

Since F (ξNpn − ǫpn) = F (ξN

pn) + O(ǫpn), hn(ξNpn) − h(ξN

pn) = o(1), and

hn(ξNpn − ǫpn) − h(ξN

pn − ǫpn) = o(1), where O and o are uniform overp ∈ [a, b] (here and for the remainder of our argument), we have that (2.9)implies

F (ξNpn) + tnh(ξN

pn − ǫpn) + o(tn) ≤ F (ξp)(2.10)

≤ F (ξNpn) + tnh(ξN

pn) + o(tn).

But this implies that F (ξNpn) + O(tn) ≤ F (ξp) ≤ F (ξN

pn) + O(tn), whichimplies that |ξpn − ξp| = O(tn). This, together with (2.10) and the factthat h is continuous, implies that F (ξpn)−F (ξp) = −tnh(ξp)+ o(tn). Thisnow yields

ξpn − ξp

tn= −h(ξp)

f(ξp)+ o(1),

and the desired Hadamard-differentiability of φξ follows, with derivativeφ′

F (h) = −h(F−1(p))/f(F−1(p)), p ∈ [a, b].The functional delta method also applies to the bootstrap. Consider the

sequence of random elements Xn(Xn) in a normed space D, and assumethat rn(Xn−µ) X, where X is tight in D, for some sequence of constants0 < rn ∞. Here, Xn is a generic empirical process based on the datasequence Xn, n ≥ 1, and is not restricted to i.i.d. data. Now assume

we have a bootstrap of Xn, Xn(Xn, Wn), where W = Wn is a sequenceof random bootstrap weights which are independent of Xn. Also assume

XnPW

X. We have the following bootstrap result:

Theorem 2.9 For normed spaces D and E, let φ : Dφ ⊂ D → E beHadamard-differentiable at µ tangentially to D0 ⊂ D, with derivative φ′

µ.

Page 36: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

24 2. An Overview of Empirical Processes

Let Xn and Xn have values in Dφ, with rn(Xn − µ) X, where X is

tight and takes its values in D0, the maps Wn → Xn are appropriately

measurable, and where rnc(Xn − Xn)PW

X, for some 0 < c < ∞. Then

rnc(φ(Xn)− φ(Xn))PW

φ′µ(X).

We will postpone until Part II a more precise discussion of what “appro-priately measurable” means in this context (see Page 236).

When Xn in the previous theorem is the empirical process Pn indexedby a Donsker class F and rn =

√n, the results of Theorem 2.6 ap-

ply with µ = P for either the nonparametric or multiplier bootstrapweights. Moreover, the above measurability condition also holds (this willbe verified in Chapter 12). Thus the bootstrap is automatically valid forHadamard-differentiable functions applied to empirical processes indexedby Donsker classes. As a simple example, bootstraps of the quantile pro-cess ξn(p), p ∈ [a, b] ⊂ (0, 1) are valid, provided the conditions given inthe example following Theorem 2.8 for the density f over the interval Nare satisfied. This can be used, for example, to create asymptotically validconfidence bands for F−1(p), p ∈ [a, b]. There are also results for outeralmost sure conditional convergence of the conditional laws of the boot-strapped process rn(φ(Xn)− φ(Xn)), but this requires stronger conditionson the differentiability of φ, and we will not pursue this further in thisbook.

2.2.5 Z-Estimators

A Z-estimator θn is the approximate zero of a data-dependent function. Tobe more precise, let the parameter space be Θ and let Ψn : Θ → L be adata-dependent function between two normed spaces, with norms ‖ · ‖ and

‖ · ‖L , respectively. If ‖Ψn(θn)‖L

P→ 0, then θn is a Z-estimator. The mainstatistical issues for such estimators are consistency, asymptotic normalityand validity of the bootstrap. Usually, Ψn is an estimator of a fixed functionΨ : Θ → L with Ψ(θ0) = 0 for some parameter of interest θ0 ∈ Θ. We savethe proof of the following theorem as an exercise:

Theorem 2.10 Let Ψ(θ0)= 0 for some θ0 ∈ Θ, and assume ‖Ψ(θn)‖L →0 implies ‖θn − θ0‖ → 0 for any sequence θn ∈ Θ (this is an “identifia-bility” condition). Then

(i) If ‖Ψn(θn)‖L

P→ 0 for some sequence of estimators θn ∈ Θ and

supθ∈Θ ‖Ψn(θ) −Ψ(θ)‖L

P→ 0, then ‖θn − θ0‖ P→ 0.

(ii) If ‖Ψn(θn)‖L

as∗→ 0 for some sequence of estimators θn ∈ Θ and

supθ∈Θ ‖Ψn(θ) −Ψ(θ)‖L

as∗→ 0, then ‖θn − θ0‖ as∗→ 0.

Page 37: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

2.2 Empirical Process Techniques 25

Consider, for example, estimating the survival function for right-censoredfailure time data. In this setting, we observe X = (U, δ), where U = T ∧C,δ = 1T ≤ C, T is a failure time of interest with distribution function F0

and survival function S0 = 1 − F0 with S0(0) = 1, and C is a censoringtime with distribution and survival functions G and L = 1 − G, respec-tively, with L(0) = 1. For a sample of n observations Xi, i = 1, . . . , n,let Tj, j = 1, . . . , mn be the unique observed failure times. The Kaplan-

Meier estimator Sn of S0 is then given by

Sn(t) =∏

j:Tj≤t

(

1−∑n

i=1 δi1Ui = Tj∑n

i=1 1Ui ≥ Tj

)

.

Consistency and other properties of this estimator can be demonstrated viastandard continuous-time martingale arguments (Fleming and Harrington,1991; Andersen, Borgun, Keiding and Gill, 1993); however, it is instructiveto use empirical process arguments for Z-estimators.

Let τ < ∞ satisfy L(τ−)S0(τ−) > 0, and let Θ be the space of allsurvival functions S with S(0) = 1 and restricted to [0, τ ]. We will use theuniform norm ‖ ·‖∞ on Θ. After some algebra, the Kaplan-Meier estimatorcan be shown to be the solution of Ψn(Sn) = 0, where Ψn : Θ → Θ has theform Ψn(S)(t) = PnψS,t, where

ψS,t(X) = 1U > t+ (1− δ)1U ≤ t1S(U) > 0 S(t)

S(U)− S(t).

This is Efron’s (1967) “self-consistency” expression for the Kaplan-Meier.For the fixed function Ψ, we use Ψ(S)(t) = PψS,t. Somewhat surprisingly,the class of function F = ψS,t : S ∈ Θ, t ∈ [0, τ ] is P -Donsker. To seethis, first note that the class M of monotone functions f : [0, τ ] → [0, 1] ofthe real random variable U has bounded entropy (with bracketing) integral,which fact we establish later in Part II. Now the class of functions M1 =ψS,t : S ∈ Θ, t ∈ [0, τ ], where

ψS,t(U) = 1U > t+ 1U ≤ t1S(U) > 0 S(t)

S(U),

is a subset of M, since ψS,t(U) is monotone in U on [0, τ ] and takes valuesonly in [0, 1] for all S ∈ Θ and t ∈ [0, τ ]. Note that 1U ≤ t : t ∈ [0, τ ]is also Donsker (as argued previously), and so is δ (trivially) and S(t) :S ∈ Θ, t ∈ [0, τ ], since any class of fixed functions is always Donsker. Sinceall of these Donsker classes are bounded, we now have that F is Donskersince sums and products of bounded Donsker classes are also Donsker. SinceDonsker classes are also Glivenko-Cantelli, we have that supS∈Θ ‖Ψn(S)−Ψ(S)‖∞ as∗→ 0. If we can establish the identifiability condition for Ψ, the

outer almost sure version of Theorem 2.10 gives us that ‖Sn − S0‖∞ as∗→ 0.

Page 38: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

26 2. An Overview of Empirical Processes

After taking expectations, the function Ψ can be shown to have the form

Ψ(S)(t) = PψS,t = S0(t)L(t) +

∫ t

0

S0(u)

S(u)dG(u)S(t)− S(t).(2.11)

Thus, if we make the substitution ǫn(t) = S0(t)/Sn(t) − 1, Ψ(Sn)(t) → 0

uniformly over t ∈ [0, τ ] implies that un(t) = ǫn(t)L(t)+∫ t

0ǫn(u)dG(u) → 0

uniformly over the same interval. By solving this integral equation, weobtain ǫn(t) = un(t)/L(t−)−

∫ t−0

[L(s)L(s−)]−1 un(s)dG(s), which impliesǫn(t) → 0 uniformly, since L(t−) ≥ L(τ−) > 0. Thus ‖Sn − S0‖∞ → 0,implying the desired identifiability.

We now consider weak convergence of Z-estimators. Let Ψn, Ψ, Θ and Lbe as at the beginning of this section. We have the following master theoremfor Z-estimators, the proof of which will be given in Part II (Page 254):

Theorem 2.11 Assume that Ψ(θ0) = 0 for some θ0 in the interior of

Θ,√

nΨn(θn)P→ 0, and ‖θn − θ0‖ P→ 0 for the random sequence θn ∈ Θ.

Assume also that√

n(Ψn−Ψ)(θ0) Z, for some tight random Z, and that

√n(Ψn(θn)−Ψ(θn))−√n(Ψn(θ0)−Ψ(θ0))

L

1 +√

n‖θn − θ0‖P→ 0.(2.12)

If θ → Ψ(θ)is Frechet-differentiable at θ0 (defined below) with continuously-invertible (also defined below) derivative Ψθ0, then

‖√nΨθ0(θn − θ0) +√

n(Ψn −Ψ)(θ0)‖L

P→ 0(2.13)

and thus√

n(θn − θ0) −Ψ−1θ0

(Z).

Frechet-differentiability of a map φ : Θ ⊂ D → L at θ ∈ Θ is strongerthan Hadamard-differentiability, in that it means there exists a continuous,linear map φ′

θ : D → L with

‖φ(θ + hn)− φ(θ)− φ′θ(hn)‖

L

‖hn‖→ 0(2.14)

for all sequences hn ⊂ D with ‖hn‖ → 0 and θ + hn ∈ Θ for all n ≥ 1.Continuous invertibility of an operator A : Θ → L essentially means A isinvertible with the property that for a constant c > 0 and all θ1, θ2 ∈ Θ,

‖A(θ1)−A(θ2)‖L ≥ c‖θ1 − θ2‖.(2.15)

An operator is a map between spaces of function, such as the maps Ψand Ψn. We will postpone further discussion of operators and continuousinvertibility until Part II.

Returning to our Kaplan-Meier example, with Ψn(S)(t) = PnψS,t andΨ(S)(t) = PψS,t as before, note that since F = ψS,t, S ∈ Θ, t ∈ [0, τ ] is

Page 39: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

2.2 Empirical Process Techniques 27

Donsker, we easily have that√

n(Ψn − Ψ)(θ0) Z, for θ0 = S0 and sometight random Z. We also have that for any Sn ∈ Θ converging uniformlyto S0,

supt∈[0,τ ]

P (ψSn,t − ψS0,t)2 ≤ 2 sup

t∈[0,τ ]

∫ t

0

[

Sn(t)

Sn(u)− S0(t)

S0(u)

]2

S0(u)dG(u)

+2 supt∈[0,τ ]

(Sn(t)− S0(t))2

→ 0.

This can be shown to imply (2.12). After some analysis, Ψ can be shownto be Frechet-differentiable at S0, with derivative

Ψθ0(h)(t) = −∫ t

0

S0(t)h(u)

S0(u)dG(u)− L(t)h(t),(2.16)

for all h ∈ D[0, τ ], having continuous inverse

Ψ−1θ0

(a)(t) = −S0(t)(2.17)

×

a(0) +

∫ t

0

1

L(u−)S0(u−)

[

da(u) +a(u)dF0(u)

S0(u)

]

,

for all a ∈ D[0, τ ]. Thus all of the conditions of Theorem 2.11 are satisfied,and we obtain the desired weak convergence of

√n(Sn − S0) to a tight,

mean zero Gaussian process. The covariance of this process is

V (s, t) = S0(s)S0(t)

∫ s∧t

0

dF0(u)

L(u−)S0(u)S0(u−),

which can be derived after lengthy but straightforward calculations (whichwe omit).

Returning to general Z-estimators, there are a number of methods forshowing that the conditional law of a bootstrapped Z-estimator, given theobserved data, converges to the limiting law of the original Z-estimator. Oneimportant approach that is applicable to non-i.i.d. data involves establish-ing Hadamard-differentiability of the map φ, which extracts a zero fromthe function Ψ. We will explore this approach in Part II. We close thissection with a simple bootstrap result for the setting where Ψn(θ)(h) =Pnψθ,h and Ψ(θ)(h) = Pψθ,h, for random and fixed real maps indexedby θ ∈ Θ and h ∈ H. Assume that Ψ(θ0)(h) = 0 for some θ0 ∈ Θand all h ∈ H, that suph∈H |Ψ(θn)(h)| → 0 implies ‖θn − θ0‖ → 0 forany sequence θn ∈ Θ, and that Ψ is Frechet-differentiable with continu-ously invertible derivative Ψ′

θ0. Also assume that F = ψθ,h : θ ∈ Θ, h ∈

H is P -G-C with supθ∈Θ,h∈G P |ψθ,h| < ∞. Furthermore, assume thatG = ψθ,h : θ ∈ Θ, ‖θ − θ0‖ ≤ δ, h ∈ H, where δ > 0, is P -Donsker

and that suph∈H P (ψθn,h − ψθ0,h)2 → 0 for any sequence θn ∈ Θ with

Page 40: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

28 2. An Overview of Empirical Processes

‖θn−θ0‖ → 0. Then, using arguments similar to those used in the Kaplan-Meier example and with the help of Theorems 2.10 and 2.11, we have

that if θn satisfies suph∈H

√nΨn(θn)

P→ 0, then ‖θn − θ0‖ P→ 0 and√

n(θn − θ0) −Ψ−1θ0

(Z), where Z is the tight limiting distribution of√n(Ψn(θ0)−Ψ(θ0)).Let Ψ

n(θ)(h) = Pnψθ,h, where P

n is either the nonparametric bootstrap

Pn or the multiplier bootstrap Pn defined in Section 2.2.3, and define thebootstrap estimator θn ∈ Θ to be a minimizer of suph∈H |Ψ

n(θ)(h)| overθ ∈ Θ. We will prove in Part II that these conditions are more than enough

to ensure that√

n(θn− θn)PW− Ψ−1

θ0(Z), where W refers to either the non-

parametric or multiplier bootstrap weights. Thus the bootstrap is valid.These conditions for the bootstrap are satisfied in the Kaplan-Meier exam-ple, for either the nonparametric or multiplier weights, thus enabling theconstruction of confidence bands for S0(t) over t ∈ [0, τ ].

2.2.6 M-Estimators

An M-estimator θn is the approximate maximum of a data-dependent func-tion. To be more precise, let the parameter set be a metric space (Θ, d)

and let Mn : Θ → R be a data-dependent real function. If Mn(θn) =

supθ∈Θ Mn(θ) − oP (1), then θn is an M-estimator. Maximum likelihoodand least-squares (after changing the sign of the objective function) esti-mators are some of the most important examples, but there are many otherexamples as well. As with Z-estimators, the main statistical issues for M-estimators are consistency, weak convergence and validity of the bootstrap.Unlike Z-estimators, the rate of convergence for M-estimators is not nec-essarily

√n, even for i.i.d. data, and finding the right rate can be quite

challenging.For establishing consistency, Mn is often an estimator of a fixed function

M : Θ → R. We now present the following consistency theorem (the proofof which is deferred to Part II, Page 267):

Theorem 2.12 Assume for some θ0 ∈ Θ that lim infn→∞ M(θn) ≥M(θ0) implies d(θn, θ0) → 0 for any sequence θn ∈ Θ (this is another

identifiability condition). Then, for a sequence of estimators θn ∈ Θ,

(i) If Mn(θn) = supθ∈Θ Mn(θ)− oP (1) and supθ∈Θ |Mn(θ)−M(θ)| P→ 0,

then d(θn, θ0)P→ 0.

(ii) If Mn(θn) = supθ∈Θ Mn(θ)− oas∗(1) and supθ∈Θ |Mn(θ)−M(θ)| as∗→0, then d(θn, θ0)

as∗→ 0.

Suppose, for now, we know that the rate of convergence for the M-estimator θn is rn, or, in other words, we know that Zn = rn(θn − θ0) =

Page 41: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

2.2 Empirical Process Techniques 29

OP (1). Zn can now be re-expressed as the approximate maximum of thecriterion function h → Hn(h) = Mn(θ0 + h/rn) for h ranging over somemetric space H. If the argmax of Hn over bounded subsets of H can now beshown to converge weakly to the argmax of a tight limiting process H overthe same bounded subsets, then Zn converges weakly to argmaxh∈H

H(h).We will postpone the technical challenges associated with determining

these rates of convergence until Part II, and restrict ourselves to an inter-esting special case involving Euclidean parameters, where the rate is knownto be

√n. The proof of the following theorem is also deferred to Part II

(Page 270):

Theorem 2.13 Let X1, . . . , Xn be i.i.d. with sample space X and lawP , and let mθ : X → R be measurable functions indexed by θ ranging overan open subset of Euclidean space Θ ⊂ Rp. Let θ0 be a bounded point ofmaximum of Pmθ in the interior of Θ, and assume for some neighborhoodΘ0 ⊂ Θ including θ0, that there exists measurable functions m : X → Rand mθ0 : X → Rp satisfying

|mθ1(x) −mθ2(x)| ≤ m(x)‖θ1 − θ2‖,(2.18)

P [mθ −mθ0 − (θ − θ0)′mθ0 ]

2 = o(‖θ − θ0‖2),(2.19)

Pm2 < ∞, and P‖mθ0‖2 < ∞, for all θ1, θ2, θ ∈ Θ0. Assume also thatM(θ) = Pmθ admits a second order Taylor expansion with nonsingularsecond derivative matrix V . Denote Mn(θ) = Pnmθ, and assume the ap-

proximate maximizer θn satisfies Mn(θn) = supθ∈Θ Mn(θ) − oP (n−1) and

‖θn − θ0‖ P→ 0. Then√

n(θn − θ0) −V −1Z, where Z is the limitingGaussian distribution of Gnmθ0 .

Consider, for example, least absolute deviation regression. In this setting,we have i.i.d. random vectors U1, . . . , Un in Rp and random errors e1, . . . , en,but we observe only the data Xi = (Yi, Ui), where Yi = θ′0Ui + ei, i =

1, . . . , n. The least-absolute-deviation estimator θn minimizes the functionθ → Pnmθ, where mθ(X) = |Y − θ′U |. Since a minimizer of a criterionfunction Mn is also a maximizer of −Mn, M-estimation methods can beused in this context with only a change in sign. Although boundednessof the parameter space Θ is not necessary for this regression setting, werestrict—for ease of discourse—Θ to be a bounded, open subset of Rp

containing θ0. We also assume that the distribution of the errors ei hasmedian zero and positive density at zero, which we denote f(0), and thatP [UU ′] is finite and positive definite.

Note that since we are not assuming E|ei| < ∞, it is possible thatPmθ = ∞ for all θ ∈ Θ. Since minimizing Pnmθ is the same as mini-mizing Pnmθ, where mθ = mθ − mθ0 , we will use Mn(θ) = Pnmθ as our

criterion function hereafter (without modifying the estimator θn). By thedefinition of Y , mθ(X) = |e − (θ − θ0)

′U | − |e|, and we now have that

Page 42: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

30 2. An Overview of Empirical Processes

Pmθ ≤ ‖θ − θ0‖(

E‖U‖2)1/2

< ∞ for all θ ∈ Θ. Since

|mθ1(x) −mθ2(x)| ≤ ‖θ1 − θ2‖ × ‖u‖,(2.20)

it is not hard to show that the class of function mθ : θ ∈ Θ is P -Glivenko-Cantelli. It can also be shown that Pmθ ≥ 0 with equality onlywhen θ = θ0. Hence Theorem 2.12, Part (ii), yields that θn

as∗→ θ0.Now we consider M(θ) = Pmθ. By conditioning on U , one can show af-

ter some analysis that M(θ) is two times continuously differentiable, withsecond derivative V = 2f(0)P [UU ′] at θ0. Note that (2.20) satisfies Con-dition (2.18); and with mθ(X) = −Usign(e), we also have that

|mθ(X)−mθ0(X)− (θ − θ0)′mθ(X)| ≤ 1 |e| ≤ |(θ − θ0)

′U | [(θ − θ0)′U ]

2

satisfies Condition (2.19). Thus all the conditions of Theorem 2.13 are

satisfied. Hence√

n(θn − θ0) is asymptotically mean zero normal, with

variance V −1P[

mθ0m′θ0

]

V −1 =(

P [UU ′])−1

/(

4f2(0))

. This variance isnot difficult to estimate from the data, but we postpone presenting thedetails.

Another technique for obtaining weak convergence of M-estimators thatare

√n consistent, is to first establish consistency and then take an appro-

priate derivative of the criterion function Mn(θ), Ψn(θ)(h), for h rangingover some index set H , and apply Z-estimator techniques to Ψn. This worksbecause the derivative of a smooth criterion function at an approximatemaximizer is approximately zero. This approach facilitates establishing thevalidity of the bootstrap since such validity is often easier to obtain forZ-estimators than for M-estimators. This approach is also applicable tocertain nonparametric maximum likelihood estimators which we will con-sider in Part III.

2.3 Other Topics

In addition to the empirical process topics outlined in the previous sections,we will cover a few other related topics in Part II, including results forsums of independent but not identically distributed stochastic processesand, briefly, for dependent but stationary processes. However, there are anumber of interesting empirical process topics we will not pursue in laterchapters, including general results for convergence of nets. In the remainderof this section, we briefly outline a few additional topics not covered laterwhich involve sequences of empirical processes based on i.i.d. data. Forsome of these topics, we will primarily restrict ourselves to the empiricalprocess Gn =

√n(Fn−F ), although many of these results have extensions

which apply to more general empirical processes.

Page 43: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

2.3 Other Topics 31

The law of the iterated logarithm for Gn states that

lim supn→∞

‖Gn‖∞√2 log log n

≤ 1

2, a.s.,(2.21)

with equality if 1/2 is in the range of F , where ‖ · ‖∞ is the uniform norm.This can be generalized to empirical processes on P -Donsker classes Fwhich have a measurable envelope with bounded second moment (Dudleyand Philipp, 1983):

lim supn→∞

[

supf∈F |Gn(f)|]∗

(2 log log n) supf∈F |P (f − Pf)2|≤ 1, a.s.

Result (2.21) can be further strengthened to Strassen’s (1964) theorem,which states that on a set with probability 1, the set of all limiting pathsof

1/(2 log log n)Gn is exactly the set of all functions of the form h(F ),where h(0) = h(1) = 0 and h is absolutely continuous with derivative h′

satisfying∫ 1

0 [h′(s)]2 ds ≤ 1. While the previous results give upper boundson ‖Gn‖∞, it is also known that

lim infn→∞

2 log log n‖Gn‖∞ =π

2, a.s.,

implying that the smallest uniform distance between Fn and F is at leastO(1/

√n log log n).

A topic of interest regarding Donsker theorems is the closeness of the em-pirical process sample paths to the limiting Brownian bridge sample paths.The strongest result on this question for the empirical process Gn is theKMT construction, named after Komlos, Major and Tusnady (1975, 1976).The KMT construction states that there exists fixed positive constants a,b, and c, and a sequence of standard Brownian bridges Bn, such that

P

(

‖Gn − Bn(F )‖∞ >a log n + x√

n

)

≤ be−cx,

for all x > 0 and n ≥ 1. This powerful result can be shown to imply both

lim supn→∞

√n

log n‖Gn − Bn(F )‖∞ < ∞, a.s., and

lim supn→∞

E

[ √n

log n‖Gn − Bn(F )‖∞

]m

< ∞,

for all 0 < m < ∞. These results are called strong approximations and haveapplications in statistics, such as in the construction of confidence bands forkernel density estimators (see, for example, Bickel and Rosenblatt, 1973).Another interesting application—to “large p, small n” asymptotics for

Page 44: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

32 2. An Overview of Empirical Processes

microarrays—will be developed in some detail in Section 15.5 of Part II,although we will not address the theoretical derivation of the KMT con-struction.

An important class of generalizations of the empirical process for i.i.d.data are the U-processes. The mth order empirical U-process measure Un,m

is defined, for a measurable function f : Xm → R and a sample of oberva-tions X1, . . . , Xn on X , as

(

nm

)−1∑

(i1,...,im)∈In,m

f(Xi1 , . . . , Xim),

where In,m is the set of all m-tuples of integers (i1, . . . , im) satisfying 1 ≤i1 < · · · < im ≤ n. For m = 1, this measure reduces to the usual empiricalmeasure, i.e., Un,1 = Pn. The empirical U-process, for a class of m-variatereal functions F (of the form f : Xm → R as above), is

√n(Un,m − P )f : f ∈ F

.

These processes are useful for solving a variety of complex statistical prob-lems arising in a number of areas, including nonparametric monotone re-gression (see Ghosal, Sen and van der Vaart, 2000), testing for normality(see Arcones and Wang, 2006), and a number of other areas. Fundamentalwork on Glivenko-Cantelli and Donsker type results for U-processes canbe found in Nolan and Pollard (1987, 1988) and Arcones and Gine (1993),and some useful technical tools can be found in Gine (1997). Recent de-velopments include Monte Carlo methods of inference for U-processes (see,for example, Zhang, 2001) and a central limit theorem for two-sample U-processes (Neumeyer, 2004).

2.4 Exercises

2.4.1. Let X, Y be a pair of real random numbers with joint distributionP . Compute upper bounds for N[](ǫ,F , Lr(P )), for r = 1, 2, where F =1X ≤ s, Y ≤ t : s, t ∈ R.

2.4.2. Prove Theorem 2.10.

2.4.3. Consider the Z-estimation framework for the Kaplan-Meier estima-tor discusses in Section 2.2.5. Let Ψ(S)(t) be as defined in (2.11). Show thatΨ is Frechet-differentiable at S0, with derivative Ψθ0(h)(t) given by (2.16),for all h ∈ D[0, τ ].

2.4.4. Continuing with the set-up of the previous problem, show thatΨθ0 is continuously invertible, with inverse Ψ−1

θ0given in (2.17). The fol-

lowing approach may be easiest: First show that for any a ∈ D[0, τ ],

Page 45: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

2.5 Notes 33

h(t) = Ψ−1θ0

(a)(t) satisfies Ψθ0(h)(t) = a(t). The following identity maybe helpful:

d

[

a(t)

S0(t)

]

=da(t)

S0(t−)+

a(t)dF0(t)

S0(t−)S0(t).

Now show that there exists an M < ∞ such that∥

∥Ψ−1θ0

(a)∥

∥ ≤M‖a‖, where

‖ · ‖ is the uniform norm. This then implies that there exists a c > 0 suchthat ‖Ψθ0(h)‖ ≥ c‖h‖.

2.5 Notes

Theorem 2.1 is a composite of Theorems 1.5.4 and 1.5.7 of van der Vaartand Wellner (1996) (hereafter abbreviated VW). Theorems 2.2, 2.3, 2.4and 2.5 correspond to Theorems 19.4, 19.5, 19.13 and 19.14, respectively, ofvan der Vaart (1998). The if and only if implications of (2.8) are described inVW, Page 73. The implications (i)⇔(ii) in Theorems 2.6 and 2.7 are givenin Theorems 3.6.1 and 3.6.2, respectively, of VW. Theorems 2.8 and 2.11correspond to Theorems 3.9.4 and 3.3.1, of VW, while Theorem 2.13 comesfrom Example 3.2.22 of VW.

Page 46: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,
Page 47: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

3Overview of Semiparametric Inference

This chapter presents an overview of the main ideas and techniques ofsemiparametric inference, with particular emphasis on semiparametric effi-ciency. The major distinction between this kind of efficiency and the stan-dard notion of efficiency for parametric maximum likelihood estimators—asexpressed in the Cramer-Rao lower bound—is the presence of an infinite-dimensional nuisance parameter in semiparametric models. Proofs and othertechnical details will generally be postponed until Part III.

In the first section, we define and sketch the main features of semipara-metric models and semiparametric efficiency. The second section discussesefficient score functions and estimating equations and their connection toefficient estimation. The third section discusses nonparametric maximumlikelihood estimation, the main tool for constructing efficient estimators.The fourth and final section briefly discusses several additional relatedtopics, including variance estimation and confidence band construction forefficient estimators.

3.1 Semiparametric Models and Efficiency

A statistical model is a collection of probability measures P ∈ P on a sam-ple space X . Such models can be expressed in the form P = Pθ : θ ∈ Θ,where Θ is some parameter space. Semiparametric models are statisticalmodels where Θ has one or more infinite-dimensional component. For ex-ample, the parameter space for the linear regression model (1.1), where

Page 48: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

36 3. Overview of Semiparametric Inference

Y = β′Z + e, consists of two components, a subset of p-dimensional Eu-clidean space (for the regression parameter β) and the infinite-dimensionalspace of all joint distribution functions of (e, Z) with E[e|Z] = 0 andE[e2|Z] ≤ K almost surely, for some K <∞.

The goal of semiparametric inference is to construct optimal estimatorsand test statistics for evaluating semiparametric model parameters. The pa-rameter component of interest can be succinctly expressed as a function ofthe form ψ : P → D, where ψ extracts the component of interest and takesvalues in D. For now, we assume D is finite dimensional. As an illustration,if we are interested in the unconditional variance of the residual errors inthe regression model (1.1), ψ would be the map ψ(P ) =

Re2dF (e), where

F (t) = P [e ≤ t] is the unconditional residual distribution component of P .Throughout this book, the statistics of interest will be based on an i.i.d.sample, X1, . . . , Xn, of realizations from some P ∈ P .

An estimator Tn of the parameter ψ(P ), based on such a sample, isefficient if the limiting variance V of

√n(Tn−ψ(P )) is the smallest possible

among all regular estimators of ψ(P ). The inverse of V is the informationfor Tn. Regularity will be defined more explicitly later in this section, butsuffice it to say for now that the limiting distribution of

√n(Tn − ψ(P ))

(as n → ∞), for a regular estimator Tn, is continuous in P . Note that aswe are changing P , the distribution of Tn changes as well as the parameterψ(P ). Not all estimators are regular, but most commonly used estimatorsin statistics are. Optimality of test statistics is closely related to efficientestimation, in that the most powerful test statistics for a hypothesis abouta parameter are usually based on efficient estimators for that parameter.

The optimal efficiency for estimators of a parameter ψ(P ) depends inpart on the complexity of the model P . Estimation under the model P ismore taxing than estimation under any parametric submodel P0 = Pθ :θ ∈ Θ0 ⊂ P , where Θ0 is finite dimensional. Thus the information for esti-mation under model P is worse than the information under any parametricsubmodel P0. If the information for the regular estimator Tn is equal to theminimum of the information over all efficient estimators for all paramet-ric submodels P0, then Tn is semiparametric efficient. For semiparametricmodels, this minimizer is the best possible, since the only models with moreinformation are parametric models. A parametric model that achieves thisminimum, if such a model exists, is called a least favorable or hardest sub-model. Note that efficient estimators for parametric models are triviallysemiparametric efficient since such models are their own parametric sub-models.

Fortunately, finding the minimum information over parametric submod-els usually only requires consideration of one-dimensional parametric sub-models Pt : t ∈ Nǫ surrounding representative distributions P ∈ P ,where Nǫ = [0, ǫ) for some ǫ > 0, P0 = P , and Pt ∈ P for all t ∈ Nǫ. If Phas a dominating measure µ, then each P ∈ P can be expressed as a densityp. In this case, we require the submodels around a representative density p

Page 49: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

3.1 Semiparametric Models and Efficiency 37

to be smooth enough so that the real function g(x) = ∂ log pt(x)/(∂t)|t=0

exists with∫

X g2(x)p(x)µ(dx) < ∞. This idea can be restated in sufficientgenerality to allow for models that may not be dominated. In this moregeneral case, we require

∫ [

(dPt(x))1/2 − (dP (x))1/2

t− 1

2g(x)(dP (x))1/2

]2

→ 0,(3.1)

as t ↓ 0. In this setting, we say that the submodel Pt : t ∈ Nǫ is differen-tiable in quadratic mean at t = 0, with score function g : X → R.

In evaluating efficiency, it is necessary to consider many such one-dimen-sional submodels surrounding the representative P , each with a differentscore function. Such a collection of score functions is called a tangent setof the model P at P , and is denoted PP . Because Pg = 0 and Pg2 < ∞for any g ∈ PP , such tangent sets are subsets of L0

2(P ), the space of allfunctions h : X → R with Ph = 0 and Ph2 < ∞. Note that, as wehave done here, we will sometimes omit function arguments for simplicity,provided the context is clear. When the tangent set is closed under linearcombinations, it is called a tangent space. Usually, one can take the closedlinear span (the closure under linear combinations) of a tangent set to makea tangent space.

Consider P = Pθ : θ ∈ Θ, where Θ is an open subset of Rk. Assumethat P is dominated by µ, and that the classical score function ℓθ(x) =∂ log pθ(x)/(∂θ) exists with Pθ[ℓθ ℓ

′θ] bounded and Pθ‖ℓθ − ℓθ‖2 → 0 as

θ → θ. Now for each h ∈ Rk, let ǫ > 0 be small enough so that Pt :t ∈ Nǫ ⊂ P , where for each t ∈ Nǫ, Pt = Pθ+th. One can show (seeExercise 3.5.1) that each of these one-dimensional submodels satisfy (3.1),with g = h′ℓθ, resulting in the tangent space PPθ

= h′ℓθ : h ∈ Rk. Thusthere is a simple connection between the classical score function and themore general idea of tangent sets and tangent spaces.

Continuing with the parametric setting, the estimator θ is efficient for es-timating θ if it is regular with information achieving the Cramer-Rao lowerbound P [ℓθ ℓ

′θ]. Thus the tangent set for the model contains information

about the optimal efficiency. This is also true for semiparametric modelsin general, although the relationship between tangent sets and the optimalinformation is more complex.

Consider estimation of the parameter ψ(P ) ∈ Rk for the semipara-metric model P . For any estimator Tn of ψ(P ), if

√n(Tn − ψ(P )) =√

nPnψP + oP (1), where oP (1) denotes a quantity going to zero in prob-ability, then ψP is an influence function for ψ(P ) and Tn is asymp-totically linear. For a given tangent set PP , assume for each submodelPt : t ∈ Nǫ satisfying (3.1) with some g ∈ PP and some ǫ > 0, thatdψ(Pt)/(dt)|t=0 = ψP (g) for some linear map ψP : L0

2(P ) → Rk. In thissetting, we say that ψ is differentiable at P relative to PP . When PP is alinear space, there exists a measurable function ψP : X → Rk such that

Page 50: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

38 3. Overview of Semiparametric Inference

ψP (g) = P[

ψP (X)g(X)]

, for each g ∈ PP . The function ψP ∈ PP ⊂ L02(P )

is unique and is called the efficient influence function for the parameter ψin the model P (relative to the tangent space PP ). Here, and throughoutthe book, we abuse notation slightly by declaring that a random vector isin a given linear space if and only if each component of the vector is. Notethat ψP ∈ PP . Frequently, the efficient influence function can be foundby taking a candidate influence function ψP ∈ L0

2(P ) and projecting itonto PP to obtain ψP . If

√n(Tn − ψ(P )) is asymptotically equivalent to√

nPnψP , then Tn can be shown to be semiparametric efficient (which werefer to hereafter simply as “efficient”). We will see in Theorem 18.8 inSection 18.2 (on Page 339), that full efficiency of an estimator Tn can es-sentially be verified by checking whether the influence function for Tn liesin PP . This can be a relatively straightforward method of verification formany settings.

Consider again the parametric example, with P = Pθ : θ ∈ Θ, whereΘ ⊂ Rk. Suppose the parameter of interest is ψ(Pθ) = f(θ), where f : Rk →Rd has derivative fθ at θ, and that the Fisher information matrix Iθ =

P [ℓθ ℓ′θ] is invertible. It is not hard to show that ψP (g)= fθI

−1θ P

[

ℓθ(X)g(X)]

,

and thus ψP (x) = fθI−1θ ℓθ(x). Any estimator Tn for which

√n(Tn − f(θ))

is asymptotically equivalent to√

nPn

[

fθI−1θ ℓθ

]

, has asymptotic variance

equal to the Cramer-Rao lower bound fθI−1θ f ′

θ.Returning to the semiparametric setting, any one-dimensional submodel

Pt : t ∈ Nǫ, satisfying (3.1) for the score function g ∈ PP and someǫ > 0, is a parametric model with parameter t. The Fisher information fort, evaluated at t = 0, is Pg2. Thus the Cramer-Rao lower bound for estimat-

ing a univariate parameter ψ(P ) based on this model is(

P[

ψP g])2

/Pg2,

since dψ(Pt)/(dt)|t=0 = P[

ψP g]

. Provided that ψP ∈ PP and all necessary

derivatives exist, the maximum Cramer-Rao lower bound over all such sub-models in PP is thus Pψ2

P . For more general Euclidean parameters ψ(P ),

this lower bound on the asymptotic variance is P[

ψP ψ′P

]

.

Hence, for P[

ψP ψ′P

]

to be the upper bound for all parametric submod-

els, the tangent set must be sufficiently large. Obviously, the tangent setmust also be restricted to score functions which reflect valid submodels.In addition, the larger the tangent set, the fewer the number of regularestimators. To see this, we will now provide a more precise definition ofregularity. Let Pt,g denote a submodel Pt : t ∈ Nǫ satisfying (3.1) forthe score g and some ǫ > 0. Tn is regular for ψ(P ) if the limiting distri-bution of

√n(Tn−ψ(P1/

√n,g)), over the sequence of distributions P1/

√n,g,

exists and is constant over all g ∈ PP . Thus the tangent set must be cho-sen to be large enough but not too large. Fortunately, most estimators in

Page 51: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

3.2 Score Functions and Estimating Equations 39

common use for semiparametric inference are regular for large tangent sets,and thus there is usually quite a lot to be gained by making the effort toobtain an efficient estimator. Once the tangent set has been identified, thecorresponding efficient estimator Tn of ψ(P ) will always be asymptoticallylinear with influence function equal to the efficient influence function ψP .

Consider, for example, the unrestricted model P of all distributions onX . Suppose we are interested in estimating ψ(P ) = Pf for some f ∈ L2(P ),the space of all measurable functions h with Ph2 < ∞. For bounded g ∈L0

2(P ), the one-dimensional submodel Pt : dPt = (1 + tg)dP, t ∈ Nǫ ⊂ Pfor ǫ small enough. Furthermore, (3.1) is satisfied with ∂ψ(Pt)/(∂t)|t=0 =P [fg]. It is not hard to show, in fact, that one-dimensional submodels sat-isfying (3.1) with ∂ψ(Pt)/(∂t)|t=0 = P [fg] exist for all g ∈ L0

2(P ). ThusPP = L0

2(P ) is a tangent set for the unrestricted model and ψP (x) =f(x)−Pf is the corresponding efficient influence function. Since

√n(Pnf−

ψ(P )) =√

nPnψP , Pnf is efficient for estimating Pf . Note that, in this un-restricted model, PP is the maximal possible tangent set, since all tangentsets must be subsets of L0

2(P ). In general, the size of a tangent set reflectsthe amount of restrictions placed on a model, in that larger tangent setsreflect fewer restrictions.

Efficiency can also be established for infinite dimensional parametersψ(P ) when

√n consistent regular estimators for ψ(P ) exist. There are a

number of ways of expressing efficiency in this context, but we will onlymention the convolution approach here. The convolution theorem statesthat for any regular estimator Tn of ψ(P ),

√n(Tn − ψ(P )) has a weak

limiting distribution which is the convolution of a Gaussian process Z andan independent process M , where Z has the same limiting distribution as√

nPnψP . In other words, an inefficient estimator always has an asymp-totically non-negligible independent noise process M added to the efficientestimator distribution. A regular estimator Tn for which M is zero is ef-ficient. Occasionally, we will use the term uniformly efficient when it ishelpful to emphasize the fact that ψ(P ) is infinite dimensional. If ψ(P ) isindexed by t ∈ T , and if Tn(t) is efficient for ψ(P )(t) for each t ∈ T , thenit can be shown that weak convergence of

√n(Tn−ψ(P )) to a tight, mean

zero Gaussian process implies uniform efficiency of Tn. This result will beformally stated in Theorem 18.9 on Page 341. Another important fact isthat if Tn is an efficient estimator for ψ(P ) and φ is a suitable Hadamard-differentiable function, then φ(Tn) is an efficient estimator for φ(ψ(P )). Wewill make these results more explicit in Part III.

3.2 Score Functions and Estimating Equations

A parameter ψ(P ) of particular interest is the parametric component θof a semiparametric model Pθ,η : θ ∈ Θ, η ∈ H, where Θ is an open

Page 52: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

40 3. Overview of Semiparametric Inference

subset of Rk and H is an arbitrary set that may be infinite dimensional.Tangent sets can be used to develop an efficient estimator for ψ(Pθ,η) = θthrough the formation of an efficient score function. In this setting, weconsider submodels of the form Pθ+ta,ηt , t ∈ Nǫ that are differentiable

in quadratic mean with score function ∂ log dPθ+ta,ηt/(∂t)|t=0 = a′ℓθ,η + g,

where a ∈ Rk, ℓθ,η : X → Rk is the ordinary score for θ when η is fixed,

and where g : X → R is an element of a tangent set P(η)Pθ,η

for the submodel

Pθ = Pθ,η : η ∈ H (holding θ fixed). This tangent set is the tangent set forη and should be rich enough to reflect all parametric submodels of Pθ. The

tangent set for the full model is PPθ,η=

a′ℓθ,η + g : a ∈ Rk, g ∈ P(η)Pθ,η

.

While ψ(Pθ+ta,ηt) = θ + ta is clearly differentiable with respect to t,we also need, as in the previous section, that there there exists a functionψθ,η : X → Rk such that

∂ψ(Pθ+ta,ηt)

∂t

t=0

= a = P[

ψθ,η

(

ℓ′θ,ηa + g)]

,(3.2)

for all a ∈ Rk and all g ∈ P(η)Pθ,η

. After setting a = 0, we see that such a

function must be uncorrelated with all of the elements of P(η)Pθ,η

.Define Πθ,η to be the orthogonal projection onto the closed linear span

of P(η)Pθ,η

in L02(Pθ,η). We will describe how to obtain such projections in

detail in Part III, but suffice it to say that for any h ∈ L02(Pθ,η), h =

h − Πθ,ηh + Πθ,ηh, where Πθ,ηh ∈ P(η)Pθ,η

but P [(h−Πθ,ηh) g] = 0 for all

g ∈ P(η)Pθ,η

. The efficient score function for θ is ℓθ,η = ℓθ,η − Πθ,η ℓθ,η, while

the efficient information matrix for θ is Iθ,η = P[

ℓθ,ηℓ′θ,η

]

.

Provided that Iθ,η is nonsingular, the function ψθ,η = I−1θ,η ℓθ,η satis-

fies (3.2) for all a ∈ Rk and all g ∈ P(η)Pθ,η

. Thus the functional (parameter)

ψ(Pθ,η) = θ is differentiable at Pθ,η relative to the tangent set PPθ,η, with

efficient influence function ψθ,η. Hence the search for an efficient estima-tor of θ is over if one can find an estimator Tn satisfying

√n(Tn − θ) =

√nPnψθ,η + oP (1). Note that Iθ,η = Iθ,η − P

[

Πθ,η ℓθ,η

(

Πθ,η ℓθ,η

)′]

, where

Iθ,η = P[

ℓθ,ηℓ′θ,η

]

. An intuitive justification for the form of the efficient

score is that some information for estimating θ is lost due to a lack of knowl-edge about η. The amount subtracted off of the efficient score, Πθ,η ℓθ,η, isthe minimum possible amount for regular estimators when η is unknown.

Consider again the semiparametric regression model (1.1), where Y =β′Z + e, E[e|Z] = 0 and E[e2|Z] ≤ K < ∞ almost surely, and where weobserve (Y, Z), with the joint density η of (e, Z) satisfying

Reη(e, Z)de = 0

almost surely. Assume η has partial derivative with respect to the firstargument, η1, satisfying η1/η ∈ L2(Pβ,η), and hence η1/η ∈ L0

2(Pβ,η),

Page 53: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

3.2 Score Functions and Estimating Equations 41

where Pβ,η is the joint distribution of (Y, Z). The Euclidean parameter ofinterest in this semiparametric model is θ = β. The score for β, assumingη is known, is ℓβ,η = −Z(η1/η)(Y − β′Z, Z), where we use the shorthand(f/g)(u, v) = f(u, v)/g(u, v) for ratios of functions.

One can show that the tangent set P(η)Pβ,η

for η is the subset of L02(Pβ,η)

which consists of all functions g(e, Z) ∈ L02(Pβ,η) which satisfy

E[eg(e, Z)|Z] =

Reg(e, Z)η(e, Z)de∫

Rη(e, Z)de

= 0,

almost surely. One can also show that this set is the orthocomplement inL0

2(Pβ,η) of all functions of the form ef(Z), where f satisfies Pβ,ηf2(Z) <

∞. This means that ℓβ,η = (I − Πβ,η)ℓβ,η is the projection in L02(Pβ,η) of

−Z(η1/η)(e, Z) onto ef(Z) : Pβ,ηf2(Z) < ∞, where I is the identity.Thus

ℓβ,η(Y, Z) =−Ze

Rη1(e, Z)ede

Pβ,η[e2|Z]= − Ze(−1)

Pβ,η[e2|Z]=

Z(Y − β′Z)

Pβ,η[e2|Z],

where the second-to-last step follows from the identity∫

Rη1(e, Z)ede =

∂∫

Rη(te, Z)de/(∂t)|t=1, and the last step follows since e = Y −β′Z. When

the function z → Pβ,η[e2|Z = z] is non-constant in z, ℓβ,η(Y, Z) is not

proportional to Z(Y − β′Z), and the estimator β defined in Chapter 1 willnot be efficient. We will discuss efficient estimation for this model in greaterdetail in Chapter 4.

Two very useful tools for computing efficient scores are score and in-formation operators. Although we will provide more precise definitions inParts II and III, operators are maps between spaces of functions. Return-ing to the generic semiparametric model Pθ,η : θ ∈ Θ, η ∈ H, sometimes

it is easier to represent an element g in the tangent set for η, P(η)Pθ,η

, asBθ,ηb, where b is an element of another set Hη and Bθ,η is an operator

satisfying P(η)Pθ,η

= Bθ,ηb : b ∈ Hη. Such an operator is a score operator.

The adjoint of the score operator Bθ,η : Hη → L02(Pθ,η) is another oper-

ator B∗θ,η : L0

2(Pθ,η) → lin Hη which is similar in spirit to a transpose for

matrices. Here we use lin A to denote closed linear span (the linear spaceconsisting of all linear combinations) of A. Additional details on adjointsand methods for computing them will be described in Part III. One cannow define the information operator B∗

θ,ηBθ,η : Hη → linHη. If B∗θ,ηBθ,η

has an inverse, then it can be shown that the efficient score for θ has the

form ℓθ,η =

(

I −Bθ,η

[

B∗θ,ηBθ,η

]−1

B∗θ,η

)

ℓθ,η.

To illustrate these methods, consider the Cox model for right-censoreddata introduced in Chapter 1. In this setting, we observe a sample of nrealizations of X = (V, d, Z), where V = T ∧ C, d = 1V = T , Z ∈

Page 54: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

42 3. Overview of Semiparametric Inference

Rk is a covariate vector, T is a failure time, and C is a censoring time.We assume that T and C are independent given Z, that T given Z hasintegrated hazard function eβ′ZΛ(t) for β in an open subset B ⊂ Rk and Λ iscontinuous and monotone increasing with Λ(0) = 0, and that the censoringdistribution does not depend on β or Λ (i.e., censoring is uninformative).Define the counting and at-risk processes N(t) = 1V ≤ td and Y (t) =

1V ≥ t, and let M(t) = N(t) −∫ t

0 Y (s)eβ′ZdΛ(s). For some 0 < τ < ∞with PC ≥ τ > 0, let H be the set of all Λ’s satisfying our criteria withΛ(τ) <∞. Now the set of models P is indexed by β ∈ B and Λ ∈ H . We letPβ,Λ be the distribution of (V, d, Z) corresponding to the given parameters.

The likelihood for a single observation is thus proportional to pβ,Λ(X) =[

eβ′Zλ(V )]d

exp[

−eβ′ZΛ(V )]

, where λ is the derivative of Λ. Now let

L2(Λ) be the set of measurable functions b : [0, τ ] → R with∫ τ

0b2(s)dΛ(s) <

∞. If b ∈ L2(Λ) is bounded, then Λt(s) =∫ s

0etb(u)dΛ(u) ∈ H for all t. The

score function ∂ log pβ+ta,Λt(X)/(∂t)|t=0 is thus∫ τ

0[a′Z + b(s)] dM(s), for

any a ∈ Rk. The score function for β is therefore ℓβ,Λ(X) = ZM(τ), whilethe score function for Λ is

∫ τ

0b(s)dM(s). In fact, one can show that there

exists one-dimensional submodels Λt such that log pβ+ta,Λt is differentiable

with score a′ℓβ,Λ(X) +∫ τ

0b(s)dM(s), for any b ∈ L2(Λ) and a ∈ Rk.

The operator Bβ,Λ : L2(Λ) →L02(Pβ,Λ), given by Bβ,Λ(b)=

∫ τ

0b(s)dM(s),

is the score operator which generates the tangent set for Λ, P(Λ)Pβ,Λ

≡Bβ,Λb : b ∈ L2(Λ). It can be shown that this tangent space spans allsquare-integrable score functions for Λ generated by parametric submodels.The adjoint operator can be shown to be B∗

β,Λ : L2(Pβ,Λ) → L2(Λ), whereB∗

β,Λ(g)(t)= Pβ,Λ[g(X)dM(t)]/dΛ(t). The information operator B∗β,ΛBβ,Λ :

L2(Λ) → L2(Λ) is thus

B∗β,ΛBβ,Λ(b)(t) =

Pβ,Λ

[∫ τ

0 b(s)dM(s)dM(u)]

dΛ(u)= Pβ,Λ

[

Y (t)eβ′Z]

b(t),

using martingale methods.

Since B∗β,Λ

(

ℓβ,Λ

)

(t) = Pβ,Λ

[

ZY (t)eβ′Z]

, we have that the efficient

score for β is

ℓβ,Λ =(

I −Bβ,Λ

[

B∗β,ΛBβ,Λ

]−1B∗

β,Λ

)

ℓβ,Λ(3.3)

=

∫ τ

0

Z −Pβ,Λ

[

ZY (t)eβ′Z]

Pβ,Λ [Y (t)eβ′Z ]

dM(t).

When Iβ,Λ ≡ Pβ,Λ

[

ℓβ,Λℓ′β,Λ

]

is positive definite, the resulting efficient in-

fluence function is ψβ,Λ ≡ I−1β,Λℓβ,Λ. Since the estimator βn obtained from

Page 55: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

3.2 Score Functions and Estimating Equations 43

maximizing the partial likelihood

Ln(β) =

n∏

i=1

(

eβ′Zi

∑nj=1 1Vj ≥ Vieβ′Zj

)di

(3.4)

can be shown to satisfy√

n(βn − β) =√

nPnψβ,Λ + oP (1), this estimatoris efficient.

Returning to our discussion of score and information operators, these op-erators are also useful for generating scores for the entire model, not just forthe nuisance component. With semiparametric models having score func-tions of the form a′ℓθ,η + Bθ,ηb, for a ∈ Rk and b ∈ Hη, we can definea new operator Aβ,η : (a, b) : a ∈ Rk, b ∈ lin Hη → L0

2(Pθ,η) where

Aβ,η(a, b) = a′ℓθ,η + Bθ,ηb. More generally, we can define the score opera-tor Aη : lin Hη → L2(Pη) for the model Pη : η ∈ H, where H indexes theentire model and may include both parametric and nonparametric com-ponents, and where lin Hη indexes directions in H . Let the parameter ofinterest be ψ(Pη) = χ(η) ∈ Rk. We assume there exists a linear opera-tor χ : lin Hη → Rk such that, for every b ∈ lin Hη, there exists a one-dimensional submodel Pηt : ηt ∈ H, t ∈ Nǫ satisfying

∫ [

(dPηt)1/2 − (dPη)1/2

t− 1

2Aηb(dPη)1/2

]2

→ 0,

as t ↓ 0, and ∂χ(ηt)/(∂t)|t=0 = χ(b).We require Hη to have a suitable inner product 〈·, ·〉η , where an inner

product is an operation on elements in Hη with the property that 〈a, b〉η =〈b, a〉η, 〈a+ b, c〉η = 〈a, c〉η + 〈b, c〉η, 〈a, a〉η ≥ 0, and 〈a, a〉η = 0 if and onlyif a = 0, for all a, b, c ∈ Hη. The efficient influence function is the solution

ψPη ∈ R(Aη) ⊂ L02(Pη) of

A∗ηψPη = χη,(3.5)

where R denotes range, B denotes closure of the set B, A∗η is the adjoint

of Aη, and χη ∈ Hη satisfies 〈χη, b〉η = χη(b) for all b ∈ Hη. Methods forobtaining such a χη will be given in Part III. When A∗

ηAη is invertible,

then the solution to (3.5) can be written ψPη = Aη

(

A∗ηAη

)−1χη. In Chap-

ter 4, we will illustrate this approach to derive efficient estimators for allparameters of the Cox model.

Returning to the semiparametric model setting, where P = Pθ,η : θ ∈Θ, η ∈ H, Θ is an open subset of Rk, and H is a set, the efficient score canbe used to derive estimating equations for computing efficient estimators ofθ. An estimating equation is a data dependent function Ψn : Θ → Rk forwhich an approximate zero yields a Z-estimator for θ. When Ψn(θ) has the

form Pnℓθ,n, where ℓθ,n(X |X1, . . . , Xn) is a function for the generic obser-

vation X which depends on the value of θ and the sample data X1, . . . , Xn,

Page 56: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

44 3. Overview of Semiparametric Inference

we have the following estimating equation result (the proof of which willbe given in Part III, Page 350):

Theorem 3.1 Suppose that the model Pθ,η : θ ∈ Θ, where Θ ⊂ Rk, isdifferentiable in quadratic mean with respect to θ at (θ, η) and let the effi-

cient information matrix Iθ,η be nonsingular. Let θn satisfy√

nPnℓθn,n =

oP (1) and be consistent for θ. Also assume that ℓθn,n is contained in a Pθ,η-Donsker class with probability tending to 1 and that the following conditionshold:

Pθn,η ℓθn,n = oP (n−1/2 + ‖θn − θ‖),(3.6)

Pθ,η

∥ℓθn,n − ℓθ,η

2 P→ 0, Pθn,η

∥ℓθn,n

2

= OP (1).(3.7)

Then θn is asymptotically efficient at (θ, η).

Returning to the Cox model example, the profile likelihood score is thepartial likelihood score Ψn(β) = Pnℓβ,n, where

ℓβ,n(X = (V, d, Z)|X1, . . . , Xn) =(3.8)

∫ τ

0

Z −Pn

[

ZY (t)eβ′Z]

Pn

[

Y (t)eβ′Z]

dMβ(t),

and Mβ(t) = N(t)−∫ t

0 Y (u)eβ′ZdΛ(u). We will show in Chapter 4 that all

the conditions of Theorem 3.1 are satisfied for the root of Ψn(β) = 0, βn,and thus the partial likelihood yields efficient estimation of β.

Returning to the general semiparametric setting, even if an estimatingequation Ψn is not close enough to Pnℓθ,η to result in an efficient estimator,frequently the estimator will still result in a

√n-consistent estimator which

is precise enough to be useful. In some cases, the computational effortneeded to obtain an efficient estimator may be too great a cost, and onemust settle for an inefficient estimating equation that works. Even in thesesettings, some modifications in the estimating equation can often be madewhich improve efficiency while maintaining computability. This issue willbe explored in greater detail in Part III.

3.3 Maximum Likelihood Estimation

The most common approach to efficient estimation is based on modifica-tions of maximum likelihood estimation that lead to efficient estimates.These modifications, which we will call “likelihoods,” are generally not re-ally likelihoods (products of densities) because of complications resulting

Page 57: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

3.3 Maximum Likelihood Estimation 45

from the presence of an infinite dimensional nuisance parameter. Considerestimating an unknown real density f(x) from an i.i.d. sample X1, . . . , Xn.The likelihood is

∏ni=1 f(Xi), and the maximizer over all densities has ar-

bitrarily high peaks at the observations, with zero at the other values, andis therefore not a density. This problem can be fixed by using an empiricallikelihood

∏ni=1 pi, where p1, . . . , pn are the masses assigned to the obser-

vations indexed by i = 1, . . . , n and are constrained to satisfy∑n

i=1 pi = 1.This leads to the empirical distribution function estimator, which is knownto be fully efficient.

Consider again the Cox model for right-censored data explored in theprevious section. The density for a single observation X = (V, d, Z) is

proportional to[

eβ′Zλ(V )]d

exp[

−eβ′ZΛ(V )]

. Maximizing the likelihood

based on this density will result in the same problem raised in the previousparagraph. A likelihood that works is the following, which assigns massonly at observed failure times:

Ln(β, Λ) =

n∏

i=1

[

eβ′Zi∆Λ(Vi)]di

exp[

−eβ′ZiΛ(Vi)]

,(3.9)

where ∆Λ(t) is the jump size of Λ at t. For each value of β, one canmaximize or profile Ln(β, Λ) over the “nuisance” parameter Λ to obtainthe profile likelihood pLn(β), which for the Cox model is exp [−∑n

i=1 di]

times the partial likelihood (3.4). Let β be the maximizer of pLn(β). Then

the maximizer Λ of Ln(β, Λ) is the “Breslow estimator”

Λ(t) =

∫ t

0

PndN(s)

Pn

[

Y (s)eβ′Z] .

We will see in Chapter 4 that β and Λ are both efficient.Another useful class of likelihood variants are penalized likelihoods. Pe-

nalized likelihoods add a penalty term in order to maintain an appropri-ate level of smoothness for one or more of the nuisance parameters. Thismethod is used in the partly linear logistic regression model described inChapter 1. Other methods of generating likelihood variants that work arepossible. The basic idea is that using the likelihood principle to guide es-timation of semiparametric models often leads to efficient estimators forthe model components which are

√n consistent. Because of the richness

of this approach to estimation, one needs to verify for each new situationthat a likelihood-inspired estimator is consistent, efficient and well-behavedfor moderate sample sizes. Verifying efficiency usually entails demonstrat-ing that the estimator satisfies the efficient score equation described in theprevious section.

Unfortunately, there is no guarantee that the efficient score is a deriva-tive of the log likelihood along some submodel. A way around this problem

Page 58: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

46 3. Overview of Semiparametric Inference

is to use approximately least-favorable submodels. This is done by findinga function ηt(θ, η) such that η0(θ, η) = η, for all θ ∈ Θ and η ∈ H , whereηt(θ, η) ∈ H for all t small enough, and such that κθ0,η0 = ℓθ0,η0 , whereκθ,η(x) = ∂lθ+t,ηt(θ,η)(x)/(∂t)|t=0, lθ,η(x) is the log-likelihood for the ob-served value x at the parameters (θ, η), and where (θ0, η0) are the true pa-rameter values. Note that we require κθ,η = ℓθ,η only when (θ, η) = (θ0, η0).

If (θn, ηn) is the maximum likelihood estimator, i.e., the maximizer ofPnlθ,η, then the function t → Pnlθn+t,ηt(θn,ηn) is maximal at t = 0, and

thus (θn, ηn) is a zero of Pnκθ,η. Now if θn and ℓθ,n = κθ,ηnsatisfy the

conditions of Theorem 3.1 at (θ, η) = (θ0, η0), then the maximum likeli-

hood estimator θn is asymptotically efficient at (θ0, η0). We will explore inPart III how this flexibility is helpful for certain models.

Consider now the special case that both θ and η are√

n consistent in themodel Pθ,η : θ ∈ Θ, η ∈ H, where Θ ⊂ Rk. Let lθ,η be the log-likelihood

for a single observation, and let θn and ηn be the corresponding maximumlikelihood estimators. Since θ is finite-dimensional, the log-likelihood canbe varied with respect to θ in the usual way so that the maximum likelihoodestimators satisfy Pnℓθn,ηn

= 0.In contrast, varying η in the log-likelihood is more complex. We can

typically use a subset of the submodels t → ηt used for defining the tangentset and information for η in the model. This is easiest when the score forη is expressed as a score operator Bθ,η working on a set of indices h ∈ H(similar to what was described in Section 3.2). The likelihood equation forη is then usually of the form PnBθn,ηn

h−Pθn,ηnBθn,ηn

h = 0 for all h ∈ H.Note that we have forced the scores to be mean zero by subtracting off themean rather than simply assuming Pθ,ηBθ,ηh = 0. The approach is valid ifthere exists some path t → ηt(θ, η), with η0(θ, η) = η, such that

Bθ,ηh(x)− Pθ,ηBθ,ηh = ∂lθ,ηt(θ,η)(x)/(∂t)|t=0(3.10)

for all x in the sample space X . We assume (3.10) is valid for all h ∈ H,where H is chosen so that Bθ,ηh(x) − Pθ,ηBθ,ηh is uniformly bounded onH for all x ∈ X , (θ, η) ∈ Rk ×H .

We can now express (θn, ηn) as a Z-estimator with estimating functionΨn : Rk × H → Rk × ℓ∞(H), where Ψn = (Ψn1, Ψn2), with Ψn1(θ, η) =Pnℓθ,η and Ψn2(θ, η) = PnBθ,ηh−Pθ,ηBθ,ηh, for all h ∈ H. The expectationof these maps under the true parameter (θ0, η0) is the deterministic mapΨ = (Ψ1, Ψ2), where Ψ1(θ, η) = Pθ0,η0 ℓθ,η and Ψ2(θ, η) = Pθ0,η0Bθ,ηh −Pθ,ηBθ,ηh, for all h ∈ H. We have constructed these estimating equationsso that the maximum likelihood estimators and true parameters satisfyΨn(θn, ηn) = 0 = Ψ(θ0, η0). Provided H is a subset of a normed space, wecan use the Z-estimator master theorem (Theorem 2.11) to obtain weak

convergence of√

n(θn − θ0, ηn − η0):

Page 59: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

3.4 Other Topics 47

Corollary 3.2 Suppose that ℓθ,η and Bθ,ηh, with h ranging over H andwith (θ, η) ranging over a neighborhood of (θ0, η0), are contained in a Pθ0,η0-

Donsker class, and that both Pθ0,η0

∥ℓθ,η − ℓθ0,η0

2 P→ 0 and suph∈H Pθ0,η0

|Bθ,ηh−Bθ0,η0h|2P→ 0, as (θ, η) → (θ0, η0). Also assume that Ψ is Frechet-

differentiable at (θ0, η0) with derivative Ψ0 : Rk × linH → Rk × ℓ∞(H)that is continuously-invertible and onto its range, with inverse Ψ−1

0 : Rk ×ℓ∞(H) → Rk × lin H. Then, provided (θn, ηn) is consistent for (θ0, η0) and

Ψn(θn, ηn) = oP (n−1/2) (uniformly over Rk × ℓ∞(H)), (θn, ηn) is efficient

at (θ0, η0) and√

n(θn − θ0, ηn − η0) −Ψ−10 Z, where Z is the Gaussian

limiting distribution of√

nΨn(θ0, η0).

The proof of this will be given later in Part III (Page 379). A functionf : U → V is onto if, for every v ∈ V , there exists a u ∈ U with v = f(u).Note that while H can be a subset of the tangent set used for informationcalculations, it must be rich enough to ensure that the inverse of Ψ0 exists.

The efficiency in Corollary 3.2 can be shown to follow from the scoreoperator calculations given in Section 3.2, but we will postpone furtherdetails until Part III. As was done at the end of Section 3.2, the abovediscussion can be completely re-expressed in terms of a single parametermodel Pη : η ∈ H with a single score operator Aη : Hη → L2(Pη),where H is a richer parameter set, including, for example, both Θ and Has defined in the previous paragraphs, and where Hη is similarly enrichedto include the tangent sets for all subcomponents of the model.

3.4 Other Topics

Other topics of importance include frequentist and Bayesian methods forconstructing confidence sets. We will focus primarily on frequentist ap-proaches in this book and only briefly discuss Bayesian methods. Whilethe bootstrap is generally valid in the setting of Corollary 3.2, it is unclearthat this remains true when the nuisance parameter converges at a rateslower than

√n, even if interest is limited to the parametric component.

Even when the bootstrap is valid, it may be excessively cumbersome tore-estimate the entire model for many bootstrapped data sets. We will ex-plore this issue in more detail in Part III. We now present one approach forhypothesis testing and variance estimation for the parametric component θof the semiparametric model Pθ,η : θ ∈ Θ, η ∈ H. This approach is oftenvalid even when the nuisance parameter is not

√n consistent.

Murphy and van der Vaart demonstrated that under reasonable regular-ity conditions, the log-profile likelihood, pln(θ), (profiling over the nuisanceparameter) admits the following expansion about the maximum likelihood

Page 60: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

48 3. Overview of Semiparametric Inference

estimator for the parametric component θn:

log pln(θn) = log pln(θn)− 1

2n(θn − θn)′Iθ0,η0(θn − θn)(3.11)

+oP (√

n‖θn − θ0‖+ 1)2,

for any estimator θnP→ θ0 (Murphy and van der Vaart, 2000). This can be

shown to lead naturally to chi-square tests of full versus reduced models.Furthermore, this result demonstrates that the curvature of the log-partiallikelihood can serve as a consistent estimator of the efficient informationfor θ at θ0, Iθ0,η0 , and thereby permit estimation of the limiting variance of√

n(θn−θ0). We will discuss this method, along with several other methodsof inference, in greater detail toward the end of Part III.

3.5 Exercises

3.5.1. Consider P = Pθ : θ ∈ Θ, where Θ is an open subset of Rk.Assume that P is dominated by µ, and that ℓθ(x) = ∂ log pθ(x)/(∂θ) existswith Pθ[ℓθ ℓ

′θ] bounded and Pθ‖ℓθ − ℓθ‖2 → 0 as θ → θ. For each h ∈ Rk,

let ǫ > 0 be small enough so that Pt : t ∈ Nǫ ⊂ P , where for eacht ∈ Nǫ, Pt = Pθ+th. Show that each of these one-dimensional submodelssatisfy (3.1), with g = h′ℓθ.

3.5.2. Consider the paragraphs leading up to the efficient score for β inthe Cox model, given in Expression (3.3). Show that the tangent set for

the nuisance parameter Λ, P(Λ)Pβ,Λ

, spans all square-integrable score functions

for Λ generated by parametric submodels (where the parameters for Λ areindependent of β).

3.5.3. Let Ln(β, Λ) be the Cox model likelihood given in (3.9). Show thatthe corresponding profile likelihood for β, pLn(β), obtained by maximizingover Λ, equals exp [−∑n

i=1 di] times the partial likelihood (3.4).

3.6 Notes

The linear regression example was partially inspired by Example 25.28 ofvan der Vaart (1998), and Theorem 3.1 is his Theorem 25.54.

Page 61: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

4Case Studies I

We now expand upon several examples introduced in Chapters 1–3 to morefully illustrate the methods and theory we have outlined thus far. Certaintechnical aspects which involve concepts introduced later in the book will beglossed over to avoid getting bogged down with details. The main objectiveof this chapter is to initiate an appreciation for what empirical processesand efficiency calculations can accomplish.

The first example is linear regression with either mean zero or medianzero residuals. In addition to efficiency calculations for model parameters,empirical processes are needed for inference on the distribution of the resid-uals. The second example is counting process regression for both generalcounting processes and the Cox model for right-censored failure time data.Empirical processes will be needed for parameter inference, and efficiencywill be established under the Cox proportional hazards model for maximumlikelihood estimation of both the regression parameters and the baselinehazard. The third example is the Kaplan-Meier estimator of the survivalfunction for right-censored failure time data. Since weak convergence ofthe Kaplan-Meier has already been established in Chapter 2 using empir-ical processes, the focus in this chapter will be on efficiency calculations.The fourth example considers estimating equations for general regressionmodels when the residual variance may be a function of the covariates.Estimation of the variance function is needed for efficient estimation. Wealso consider optimality of a certain class of estimating equations. The gen-eral results are illustrated with both simple linear regression and a Poissonmixture regression model. In the latter case, the mixture distribution is not√

n consistent in the uniform norm, the proof of which fact we omit. The

Page 62: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

50 4. Case Studies I

fifth, and final, example is partly linear logistic regression. The emphasiswill be on efficient estimation of the parametric regression parameter. Forthe last two examples, both empirical processes and efficiency calculationswill be needed.

4.1 Linear Regression

The semiparametric linear regression model is Y = β′Z + e, where weobserve X = (Y, Z) and assume E‖Z‖2 < ∞, E[e|Z] = 0 and E[e2|Z] ≤K < ∞ almost surely, Z includes the constant 1, and E[ZZ ′] is full rank.The model for X is Pβ,η : β ∈ Rk, η ∈ H, where η is the joint densityof the residual e and covariate Z with partial derivative with respect tothe first argument η1 which we assume satisfies η/η ∈ L2(Pβ,η) and henceη/η ∈ L0

2(Pβ,η). We consider this model under two assumptions on η: thefirst is that the residuals have conditional mean zero, i.e.,

Ruη(u, Z)du = 0

almost surely; and the second is that the residuals have median zero, i.e.,∫

Rsign(u)η(u, Z)du = 0 almost surely.

4.1.1 Mean Zero Residuals

We have already demonstrated in Section 3.2 that the usual least squaresestimator β = [PnZZ ′]−1 PnZY is

√n consistent but not always effi-

cient for β when the only assumption we are willing to make is that theresiduals have mean zero conditional on the covariates. The basic argu-ment for this was taken from the form of the efficient score ℓβ,η(Y, Z) =Z(Y − β′Z)/Pβ,η[e2|Z] which yields a distinctly different estimator than

β when z → Pβ,η[e2|Z = z] is non-constant in z. In Section 4.4 of thischapter, we will describe a data-driven procedure for efficient estimation inthis context based on approximating the efficient score.

For now, however, we turn our attention to efficient estimation when wealso assume that the covariates are independent of the residual. Accord-ingly, we denote η to be the density of the residual e and η to be the deriva-tive of η. We discussed this model in Chapter 1 and pointed at that β is stillnot efficient in this setting. We also claimed that an empirical estimator Fof the residual distribution, based on the residuals Y1−β′Z1, . . . , Yn−β′Zn,had the property that

√n(F −F ) converged in a uniform sense to a certain

Guassian quantity. We now derive the efficient score for estimating β forthis model and sketch a proof of the claimed convergence of

√n(F − F ).

We assume that η is continuously differentiable with Pβ,η(η/η)2 < ∞.Using techniques described in Section 3.2, it is not hard to verify that the

tangent set for the full model is the linear span of −(η/η)(e)a′Z +b(e), as a

spans Rk and b spans the tangent set P(η)Pβ,η

for η, consisting of all functions

in L02(e) which are orthogonal to e, where we use L0

2(U) to denote all mean

Page 63: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

4.1 Linear Regression 51

zero real functions f of the random variable U for which Pf2(U) < ∞. Thestructure of the tangent set follows from the constraint that

Reη(e)de = 0.

The projection of the usual score for β, ℓβ,η ≡ −(η/η)(e)Z, onto P(η)Pβ,η

is a function h ∈ P(η)Pβ,η

such that −(η/η)(e)Z − h(e) is uncorrelated with

b(e) for all b ∈ P(η)Pβ,η

. We will now verify that h(e) = −(η/η)(e)µ− eµ/σ2,

where µ ≡ E[Z] and σ2 = E[e2]. It easy to see that h is square integrable.

Moreover, since∫

Reη(e)de = −1, h has mean zero. Thus h ∈ P(η)

Pβ,η. It also

follows that −(η/η)(e)Z − h(e) is orthogonal to P(η)Pβ,η

, after noting that

−(η/η)(e)Z − h(e) = −(η/η)(e)(Z − µ) + eµ/σ2 ≡ ℓβ,η is orthogonal to allsquare-integrable mean zero functions b(e) which satisfy Pβ,ηb(e)e = 0.

Thus the efficient information is

Iβ,η ≡ Pβ,η

[

ℓβ,ηℓ′β,η

]

= Pβ,η

[

η

η(e)

2

(Z − µ)(Z − µ)′]

+µµ′

σ2.(4.1)

Since

1 =

(∫

R

eη(e)de

)2

=

(∫

R

η(e)η(e)de

)2

≤ σ2Pβη

η

η(e)

2

,

we have that

(4.1) ≥ Pβ,η

[

(Z − µ)(Z − µ)′

σ2

]

+µµ′

σ2=

P [ZZ ′]

σ2,

where, for two k×k symmetric matrices A and B, A ≥ B means that A−Bis positive semidefinite. Thus the efficient estimator can have strictly lowervariance than the least-squares estimator β.

Developing a procedure for calculating an asymptotically efficient esti-mator appears to require nonparametric estimation of η/η. We will show inthe semiparametric inference case studies of Chapter 22 how to accomplishthis via the residual distribution estimator F defined above. We now turnour attention to verifying that

√n(F−F ) converges weakly to a tight, mean

zero Gaussian process. This result can also be used to check normality ofthe residuals. Recall that if the residuals are Gaussian, the least-squaresestimator is fully efficient.

Now, using P = Pβ,η,

√n(F (v)− F (v)) =

√n[

Pn1Y − β′Z ≤ v − P1Y − β′Z ≤ v]

=√

n(Pn − P )1Y − β′Z ≤ v+√

nP[

1Y − β′Z ≤ v − 1Y − β′Z ≤ v]

= Un(v) + Vn(v).

Page 64: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

52 4. Case Studies I

We will show in Part II that

1Y − b′Z ≤ v : b ∈ Rk, v ∈ R

is a VC (andhence Donsker) class of functions. Thus, since

supv∈R

P[

1Y − β′Z ≤ v − 1Y − β′Z ≤ v]2 P→ 0,

we have that Un(v) =√

n(Pn − P )1Y − β′Z ≤ v + ǫn(v), where supv∈R

|ǫn(v)| P→ 0. It is not difficult to show that

Vn(v) = P

[

∫ v+(β−β)′Z

v

η(u)du

]

.(4.2)

We leave it as an exercise (see Exercise 4.6.1) to show that for any u, v ∈ R,

|η(u)− η(v)| ≤ |F (u)− F (v)|1/2

(

P

η

η(e)

2)1/2

.(4.3)

Thus η is both bounded and equicontinuous, and thus by (4.2), Vn(v) =√n(β − β)′µη(v) + ǫn(v), where supv∈R

|ǫn(v)| P→ 0. Hence F is asymptot-ically linear with influence function

ψ(v) = 1e ≤ v − F (v) + eZ ′ P [ZZ ′]−1µη(v),

and thus, since this influence function is a Donsker class (as the sum of twoobviously Donsker classes),

√n(F − F ) converges weakly to a tight, mean

zero Gaussian process.

4.1.2 Median Zero Residuals

We have already established in Section 2.2.6 that when the residuals havemedian zero and are independent of the covariates, then the least-absolute-deviation estimator β ≡ argminb∈Rk Pn|Y − b′Z| is consistent for β and√

n(β − β) is asymptotically linear with influence function

ψ = 2η(0)P [ZZ ′]−1Zsign(e).

Thus√

n(β − β) is asymptotically mean zero Gaussian with covariance

equal to

4η2(0)P [ZZ ′]−1

. In this section, we will study efficiency for

this model and show that β is not in general fully efficient. Before doingthis, however, we will briefly study efficiency in the the more general modelwhere we only assume E[sign(e)|Z] = 0 almost surely.

Under this more general model, the joint density of (e, Z), η, must sat-isfy

Rsign(e)η(e, Z)de = 0 almost surely. As we did when we studied the

conditionally mean zero residual case in Section 3.2, assume η has partial

Page 65: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

4.1 Linear Regression 53

derivative with respect to the first argument, η1, satisfying η1/η ∈ L2(Pβ,η).Clearly, (η1/η)(e, Z) also has mean zero. The score for β, assuming η isknown, is ℓβ,η = −Z(η1/η)(Y − β′Z, Z).

Similar to what was done in Section 3.2 for the conditionally mean zero

case, it can be shown that the tangent set P(η)Pβ,η

for η is the subset of

L02(Pβ,η) which consists of all functions g(e, Z) ∈ L0

2(Pβ,η) which satisfy

E[sign(e)g(e, Z)|Z] =

Rsign(e)g(e, Z)η(e, Z)de

Rη(e, Z)de

= 0,

almost surely. It can also be shown that this set is the orthocomplementin L0

2(Pβ,η) of all functions of the form sign(e)f(Z), where f satisfies

Pβ,ηf2(Z) <∞. Hence the efficient score ℓβ,η is the projection in L02(Pβ,η)

of −Z(η1/η)(e, Z) onto sign(e)f(Z) : Pβ,ηf2(Z) <∞. Hence

ℓβ,η(Y, Z) = −Zsign(e)

R

η1(e, Z)sign(e)de = 2Zsign(e)η(0, Z),

where the second equality follows from the facts that∫

Rη1(e, Z)de = 0 and

∫ 0

−∞ η1(e, Z)de = η(0, Z). When η(0, z) is non-constant in z, ℓβ,η(Y, Z) isnot proportional to sign(Z)(Y −β′Z), and thus the least-absolute-deviationestimator is not efficient in this instance. Efficient estimation in this situ-ation appears to require estimation of η(0, Z), but we will not pursue thisfurther.

We now return our attention to the setting where the median zero resid-uals are independent of the covariates. We will also assume that 0 < η(0) <∞. Recall that in this setting, the usual score for β is ℓβ,η ≡ −Z(η/η)(e),where η is the derivative of the density η of e. If we temporarily make thefairly strong assumption that the residuals have a Laplace density (i.e.,η(e) = (ν/2) exp(−ν|e|) for a real parameter ν > 0), then the usual scoresimplifies to Zsign(e), and thus the least-absolute deviation estimator isfully efficient for this special case.

Relaxing the assumptions to allow for arbitrary median zero density,we can follow arguments similar to those we used in the previous section

to obtain that the tangent set for η, P(η)Pβ,η

, consists of all functions in

L02(e) which are orthogonal to sign(e). The structure of the tangent set

follows from the median zero residual constraint which can be expressed as∫

Rsign(e)η(e)de = 0. The projection of ℓβ,η on P(η)

β,η is a function h ∈ P(η)Pβ,η

such that −(η/η)(e)Z − h(e) is uncorrelated with b(e) for all b ∈ P(η)Pβ,η

.

We will now prove that h(e) = −(η/η)(e)µ − 2η(0)sign(e)µ satisfies theabove constraints. First, it is easy to see that h(e) is square-integrable.Second, since −

Rsign(e)η(e)de = 2η(0), we have that sign(e)h(e) has

zero expectation. Thus h ∈ P(η)Pβ,η

. Thirdly, it is straightforward to verify

that ℓβ,η(Y, Z)− h(e) = −(η/η)(e)(Z−µ)+ 2η(0)sign(e)µ is orthogonal to

Page 66: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

54 4. Case Studies I

all elements of P(η)Pβ,η

. Hence the efficient score is ℓβ,η = ℓβ,η − h. Thus theefficient information is

Iβ,η = Pβ,η

[

ℓβ,ηℓ′β,η

]

= Pβ,η

[

η

η(e)

2

(Z − µ)(Z − µ)′]

+ 4η2(0)µµ′.

Note that

4η2(0) = 4

[∫ 0

−∞η(e)de

]2

= 4

[∫ 0

−∞

η

η(e)η(e)de

]2

≤ 4

∫ 0

−∞

η

η(e)

2

η(e)de

∫ 0

−∞η(e)de

= 2

∫ 0

−∞

η

η(e)

2

η(e)de.(4.4)

Similar arguments yield that

4η2(0) ≤ 2

∫ ∞

0

η

η(e)

2

η(e)de.

Combining this last inequality with (4.4), we obtain that

4η2(0) ≤∫

R

η

η(e)

2

η(e)de.

Hence the efficient estimator can have strictly lower variance than the least-absolute-deviation estimator in this situation.

While the least-absolute-deviation estimator is only guaranteed to befully efficient for Laplace distributed residuals, it does have excellent ro-bustness properties. In particular, the breakdown point of this estimator is50%. The breakdown point is the maximum proportion of contamination—from arbitrarily large symmetrically distributed residuals—tolerated by theestimator without resulting in inconsistency.

4.2 Counting Process Regression

We now examine in detail the counting process regression model consideredin Chapter 1. The observed data are X = (N, Y, Z), where for t ∈ [0, τ ],N(t) is a counting process and Y (t) = 1V ≥ t is an at-risk processbased on a random time V ≥ 0 which may depend on N , with PY (0) = 1,infZ P [Y (τ)|Z] > 0, PN2(τ) < ∞, and where Z ∈ Rk is a regression

Page 67: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

4.2 Counting Process Regression 55

covariate. The regression model (1.3) is assumed, implying EdN(t)|Z =EY (t)|Zeβ′ZdΛ(t), for some β ∈ B ⊂ Rk and continuous nondecreasingfunction Λ(t) with Λ(0) = 0 and 0 < Λ(τ) < ∞. We assume Z is restrictedto a bounded set, var(Z) is positive definite, and that B is open, convexand bounded. We first consider inference for β and Λ in this general model,and then examine the specialization to the Cox proportional hazards modelfor right-censored failure time data.

4.2.1 The General Case

As described in Chapter 1, we estimate β with the estimating equationUn(t, β) = Pn

∫ t

0 [Z − En(s, β)] dN(s), where

En(t, β) =PnZY (t)eβ′Z

PnY (t)eβ′Z.

Specifically, the estimator β is a root of Un(τ, β) = 0. The estimator for Λ,

is Λ(t) =∫ t

0

[

PnY (s)eβ′Z]−1

PndN(s). We first show that β is consistent

for β, and that√

n(β − β) is asymptotically normal with a consistentlyestimable covariance matrix. We then derive consistency and weak conver-gence results for Λ and suggest a simple method of inference based on theinfluence function.

We first argue that certain classes of functions are Donsker, and thereforealso Glivenko-Cantelli. We first present the following lemma:

Lemma 4.1 For −∞ < a < b < ∞, Let X(t), t ∈ [a, b] be a monotone

cadlag or caglad stochastic process with P [|X(a)| ∨ |X(b)|]2 <∞. Then Xis P -Donsker.

The proof will be given later in Part II. Note that we usually speak of classesof functions as being Donsker or Glivenko-Cantelli, but in Lemma 4.1, weare saying this about a process. Let X be the sample space for the stochasticprocess X(t) : t ∈ T . Then

√n(Pn − P )X converges weakly in ℓ∞(T )

to a tight, mean zero Gaussian process if and only if F = ft : t ∈ T isP -Donsker, where for any x ∈ X and t ∈ T , ft(x) = x(t). Viewed in thismanner, this modified use of the term Donsker is, in fact, not a modificationat all.

Since Y and N both satisfy the conditions of Lemma 4.1, they are bothDonsker as processes in ℓ∞([0, τ ]). Trivially, the classes β ∈ B and Zare both Donsker classes, and therefore so is β′Z : β ∈ B since productsof bounded Donsker classes are Donsker. Now the class eβ′Z : β ∈ B isDonsker since exponentiation is Lipschitz continuous on compacts. HenceY (t)eβ′Z : t ∈ [0, τ ], β ∈ B, ZY (t)eβ′Z : t ∈ [0, τ ], β ∈ B, andZZ ′Y (t)eβ′Z : t ∈ [0, τ ], β ∈ B, are all Donsker since they are all prod-ucts of bounded Donsker classes.

Page 68: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

56 4. Case Studies I

Now the derivative of Un(τ, β) with respect to β can be shown to be−Vn(β), where

Vn(β) =

∫ τ

0

PnZZ ′Y (t)eβ′Z

PnY (t)eβ′Z−

PnZY (t)eβ′Z

PnY (t)eβ′Z

⊗2⎤

⎦PndN(t),

and where superscript ⊗2 denotes outer product. Because all of the classesinvolved are Glivenko-Cantelli and the limiting values of the denominatorsare bounded away from zero, we have the supβ∈B |Vn(β) − V (β)| as∗→ 0,where

V (β) =

∫ τ

0

PZZ ′Y (t)eβ′Z

PY (t)eβ′Z−

PZY (t)eβ′Z

PY (t)eβ′Z

⊗2⎤

⎦(4.5)

×P[

Y (t)eβ′Z]

dΛ(t).

After some work, it can be shown that there exists a c > 0 such thatV (β) ≥ c var(Z), where for two symmetric matrices A1, A2, A1 ≥ A2 meansthat A1−A2 is positive semidefinite. Thus Un(τ, β) is almost surely convex

for all n ≥ 1 large enough. Thus β is almost surely consistent for the trueparameter β0.

Using algebra, Un(τ, β) = Pn

∫ τ

0 [Z − En(s, β)] dMβ(s), where Mβ(t) =

N(t) −∫ t

0 Y (s)eβ′ZdΛ0(s) and Λ0 is the true value of Λ. Let U(t, β) =

P

∫ t

0[Z − E(s, β)] dMβ(s)

, where E(t, β)= P[

ZY (t)eβ′Z]

/P[

Y (t)eβ′Z]

.

It is not difficult to verify that

√n[

Un(τ, β)− U(τ, β)]

−√n [Un(τ, β0)− U(τ, β0)]P→ 0,

since

√nPn

∫ τ

0

[

En(s, β)− E(s, β)]

dMβ(s)(4.6)

=√

nPn

∫ τ

0

[

En(s, β)− E(s, β)]

×

dMβ0(s)− Y (s)[

eβ′Z − eβ′0Z]

dΛ0(s)

P→ 0.

This follows from the following lemma (which we prove later in Part II),

with [a, b] = [0, τ ], An(t) = En(t, β), and Bn(t) =√

nPnMβ0(t):

Lemma 4.2 Let Bn ∈ D[a, b] and An ∈ ℓ∞([a, b]) be either cadlag or

caglad, and assume supt∈(a,b] |An(t)| P→ 0, An has uniformly bounded to-tal variation, and Bn converges weakly to a tight, mean zero process with

sample paths in D[a, b]. Then∫ b

aAn(s)dBn(s)

P→ 0.

Page 69: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

4.2 Counting Process Regression 57

Thus the conditions of the Z-estimator master theorem, Theorem 2.11,are all satisfied, and

√n(β − β0) converges weakly to a mean zero ran-

dom vector with covariance C = V −1(β0)P[∫ τ

0 [Z − E(s, β0)] dMβ0(s)]⊗2

V −1(β0). With a little additional work, it can be verified that C can be

consistently estimated with C = V −1n (β)Pn

∫ τ

0

[

Z − En(s, β)]

dM(s)⊗2

V −1n (β), where M(t) = N(t)−

∫ t

0 Y (s)eβ′ZdΛ(s), and where

Λ(t) =

∫ t

0

PndN(s)

PnY (s)eβ′Z,

as defined in Chapter 1.Let Λ0 be the true value of Λ. Then

Λ(t)− Λ0(t) =

∫ t

0

1PnY (s) > 0

(Pn − P )dN(s)

PnY (s)eβ′Z

−∫ t

0

1PnY (s) = 0

PdN(s)

PnY (s)eβ′Z

−∫ t

0

(Pn − P )Y (s)eβ′Z

PY (s)eβ′Z

PdN(s)

PnY (s)eβ′Z

−∫ t

0

P[

Y (s)(

eβ′Z − eβ′0Z)]

PY (s)eβ′ZdΛ0(s)

= An(t)−Bn(t)− Cn(t)−Dn(t).

By the smoothness of these functions of β and the almost sure consis-tency of β, each of the processes An, Bn, Cn and Dn converge uniformly

to zero, and thus supt∈[0,τ ]

∣Λ(t)− Λ0(t)

as∗→ 0. Since PPnY (t) = 0 ≤[PV < τ]n, Bn(t) = oP (n−1/2), where the oP (n−1/2) term is uniformin t. It is also not hard to verify that An(t) = (Pn − P )A(t) + oP (n−1/2),

Cn(t) = (Pn−P )C(t)+oP (n−1/2), where A(t) =∫ t

0

[

PY (s)eβ′0Z]−1

dN(s),

C(t) =∫ t

0

[

PY (s)eβ′0Z]−1

Y (s)eβ′0ZdΛ0(s), and both remainder terms are

uniform in t. In addition,

Dn(t) = (β − β0)′∫ t

0

PZY (s)eβ′0Z

PY (s)eβ′0Z

dΛ0(t) + oP (n−1/2),

where the remainder term is again uniform in t.

Page 70: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

58 4. Case Studies I

Taking this all together, we obtain the expansion

√n[

Λ(t)− Λ0(t)]

=√

n(Pn − P )

∫ t

0

dMβ0(s)

PY (s)eβ′0Z

−√n(β − β0)′∫ t

0

E(s, β0)dΛ0(s) + oP (1),

where the remainder term is uniform in t. By previous arguments,√

n(β−β0) = V −1(β0)Pn

∫ τ

0[Z − E(s, β0)] dMβ0(s) + oP (1), and thus Λ is asymp-

totically linear with influence function

ψ(t) =

∫ t

0

dMβ0(s)

PY (s)eβ′0Z

(4.7)

−∫ τ

0

[Z − E(s, β0)]′ dMβ0(s)

V −1(β0)

∫ t

0

E(s, β0)dΛ0(s).

Since ψ(t) : t ∈ T is Donsker,√

n(Λ− Λ) converges weakly in D[0, τ ] to

a tight, mean zero Gaussian process Z with covariance P[

ψ(s)ψ(t)]

. Let

ψ(t) be ψ(t) with β and Λ substituted for β0 and Λ0, respectively.After some additional analysis, it can be verified that

supt∈[0,τ ]

Pn

[

ψ(t)− ψ(t)]2 P→ 0.

Since ψ has envelope kN(τ), for some fixed k < ∞, the class ψ(s)ψ(t) :

s, t ∈ [0, τ ] is Glivenko-Cantelli, and thus Pn

[

ψ(s)ψ(t)]

is uniformly con-

sistent (in probability) for P[

ψ(s)ψ(t)]

by the discussion in the beginning

of Section 2.2.3 and the smoothness of the involved functions in terms ofβ and Λ. However, this is not particularly helpful for inference, and thefollowing approach is better. Let ξ be standard normal and independent ofthe data X = (N, Y, Z), and consider the wild bootstrap ∆(t) = Pnξψ(t).

Although some work is needed, it can be shown that√

n∆PξZ. This is

computationally quite simple, since it requires saving ψ1(tj), . . . , ψn(tj)only at all of the observed jump points t1, . . . , tmn , drawing a sample of

standard normals ξ1, . . . , ξn, evaluating sup1≤j≤mn

∣∆(t)∣

∣, and repeating

often enough to obtain a reliable estimate of the (1− α)-level quantile cα.The confidence band Λ(t)± cα : t ∈ [0, τ ] thus has approximate coverage1 − α. A number of modifications of this are possible, including a modi-fication where the width of the band at t is roughly proportional to thevariance of

√n(Λ(t)− Λ0(t)).

Page 71: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

4.2 Counting Process Regression 59

4.2.2 The Cox Model

For the Cox regression model applied to right-censored failure time data,we observe X = (W, δ, Z), where W = T ∧ C, δ = 1W = T , Z ∈ Rk

is a regression covariate, T is a right-censored failure time with integratedhazard eβ′ZΛ(t) given the covariate, and where C is a censoring time inde-pendent of T given Z. We also assume that censoring is uninformative of βor Λ. This is a special case of the general counting process regression modelof the previous section, with N(t) = δ1W ≤ t and Y (t) = 1W ≥ t.The consistency of β, a zero of Un(τ, β), and Λ both follow from the pre-vious general results, as does the asymptotic normality and validity of themultiplier bootstrap based on the estimated influence function. There are,however, some interesting special features of the Cox model that are of in-terest, including the martingale structure, the limiting covariance, and theefficiency of the estimators.

First, it is not difficult to show that Mβ0(t) and Un(t, β0) are bothcontinuous-time martingales (Fleming and Harrington, 1991; Andersen,Borgan, Gill and Keiding, 1993). This implies that

P

∫ τ

0

[Z − E(s, β0)] dMβ0(s)

⊗2

= V (β0),

and thus the asymptotic limiting variance of√

n(β−β0) is simply V −1(β0).

Thus β is efficient. In addition, the results of this section verify Condi-tion (3.7) and the non-displayed conditions of Theorem 3.1 for ℓβ,n definedin (3.8). Verification of (3.6) is left as an exercise (see Exercise 4.6.4). This

provides another proof of the efficiency of β. What remains to be verifiedis that Λ is also efficient (in the uniform sense). The influence functionfor Λ is ψ given in (4.7). We will now use the methods of Section 3.2 toverify that ψ(u) is the efficient influence function for the parameter Λ(u),for each u ∈ [0, τ ]. This will imply that Λ is uniformly efficient for Λ, since√

n(Λ− Λ) converges weakly to a tight, mean zero Gaussian process. Thislast argument will be vindicated in Theorem 18.9 on Page 341.

The tangent space for the Cox model (for both parameters together) isA(a, b) =

∫ τ

0[Z ′a + b(s)] dMβ(s) : (a, b) ∈ H, where H = Rk × L2(Λ).

A : H → L02(Pβ,Λ) is thus a score operator for the full model. The nat-

ural inner product for pairs of elements in H is 〈(a, b), (c, d)〉H = a′b +∫ τ

0c(s)d(s)dΛ(s), for (a, b), (c, d) ∈ H ; and the natural inner product for

g, h ∈ L02(Pβ,Λ) is 〈g, h〉Pβ,Λ

= Pβ,Λ[gh]. Hence the adjoint of A, A∗, satis-fies 〈A(a, b), g〉Pβ,Λ

= 〈(a, b), A∗g〉H for all (a, b) ∈ H and all g ∈ L02(Pβ,Λ).

It can be shown that A∗g =(

Pβ,Λ

[∫ τ

0 ZdMβ(s)g]

, Pβ,Λ[dMβ(t)g]/dΛ0(t))

satisfies these equations and is thus the adjoint of A.Let the one-dimensional submodel Pt : t ∈ Nǫ be a perturbation in

the direction A(a, b), for some (a, b) ∈ H . In Section 3.2, we showed thatΛt(u) =

∫ u

0(1 + tb(s))dΛ(s) + o(t), and thus, for the parameter χ(Pβ,Λ) =

Page 72: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

60 4. Case Studies I

Λ(u), ∂χ(Pt)/(∂t)|t=0 =∫ u

0 b(s)dΛ(s). Thus 〈χ, (a, b)〉H =∫ u

0 b(s)dΛ(s),and therefore χ = (0, 1s ≤ v) ∈ H (s is the function argument here).Our proof will be complete if we can show that A∗ψ = χ, since this wouldimply that ψ(u) is the efficient influence function for Λ(u). First,

P

[∫ τ

0

ZdMβ(s)ψ(u)

]

= Pβ,Λ

∫ τ

0

ZdMβ(s)

∫ u

0

dMβ(s)

Pβ,Λ [Y (s)eβ′Z ]

−Pβ,Λ

∫ τ

0

ZdMβ(s)

∫ τ

0

[Z − E(s, β)]′dMβ(s)

×V −1(β)

∫ u

0

E(s, β)dΛ(s)

=

∫ u

0

E(s, β)dΛ(s) − V (β)V −1(β)

∫ u

0

E(s, β)dΛ(s)

= 0.

Second, it is not difficult to verify that Pβ,Λ

[

dMβ(s)ψ(u)]

= 1s ≤ udΛ(s),

and thus we obtain the desired result. Therefore, both parameter estima-tors β and Λ are uniformly efficient for estimating β and Λ in the Coxmodel, since asymptotic tightness plus pointwise efficiency implies uniformefficiency (see Theorem 18.9 in Part III).

4.3 The Kaplan-Meier Estimator

In this section, we consider the same right-censored failure time data con-sidered in Section 4.2.2, except that there is no regression covariate. Theprecise set-up is described in Section 2.2.5. Thus T and C are assumedto be independent, where T has distribution function F and C has distri-bution function G. Assume also that F (0) = 0. We denote PF to be theprobability measure for the observed data X = (W, δ), and allow both Fand G to have jumps. Define S = 1 − F , L = 1 −G, and π(t) = PF Y (t),and let τ ∈ (0,∞) satisfy F (τ) > 0 and π(τ) > 0. Also define Λ(t) =∫ t

0[PnY (s)]

−1 PndN(s). The Kaplan-Meier estimator S has the product

integral form S(t) =∏

0<s≤t

[

1− dΛ(s)]

.

We have already established in Section 2.2.5, using the self-consistencyrepresentation of S, that S is uniformly consistent for S over [0, τ ] and that√

n[

S − S]

converges weakly in D[0, τ ] to a tight, mean zero Gaussian

process. In this section, we verify that S is also uniformly efficient. We firstderive the influence function ψ for S, and then show that this satisfies theappropriate version of the adjoint formula (3.5) for estimating S(u), foreach u ∈ [0, τ ]. This pointwise efficiency will then imply uniform efficiency

because of the weak convergence of√

n[

S − S]

.

Page 73: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

4.3 The Kaplan-Meier Estimator 61

Standard calculations (Fleming and Harrington, 1991, Chapter 3) revealthat

S(u)− S(u) = −S(u)

∫ u

0

S(v−)

S(v)

dΛ(v)− dΛ(v)

= −S(u)

∫ u

0

1 PnY (v) > 0 S(v−)

S(v)

PndM(v)

PnY (v)

+S(u)

∫ u

0

1 PnY (v) = 0 S(v−)

S(v)dΛ(v)

= An(u) + Bn(u),

where M(v) = N(v) −∫ v

0Y (s)dΛ(s) is a martingale. Since π(τ) > 0,

Bn(u) = oP (n−1/2). Using martingale methods, it can be shown that An(u)= Pnψ(u) + oP (n1/2), where

ψ(u) = −S(u)

∫ u

0

1

1−∆Λ(v)

dM(v)

π(v)

.

Since all error terms are uniform for u ∈ [0, τ ], we obtain that S is asymptot-ically linear, with influence function ψ. As an element of L0

2(PF ), ψ(u)(W, δ)= g(W, δ), where

g(W, 1) = −S(u)

[

1W ≤ u[1−∆Λ(W )] π(W )

−∫ u

0

1W ≥ sdΛ(s)

[1−∆Λ(s)] π(s)

]

and

g(W, 0) = −∫ u

0

1W ≥ sdΛ(s)

[1−∆Λ(s)] π(s).

For each h ∈ L02(F ), there exists a one-dimensional submodel Ft : t ∈

Nǫ, with Ft(v) =∫ v

0 (1 + th(s))dF (s) + o(t). This is clearly the maximaltangent set for F . This collection of submodels can be shown to generatethe tangent set for the observed data model PF : F ∈ D, where D is thecollection of all failure time distribution functions with F (0) = 0, via thescore operator A : L0

2(F ) → L02(PF ) defined by (Ah)(W, δ) = δh(W )+ (1−

δ)∫∞

W h(v)dF (v)/S(W ). For a, b ∈ L02(F ), let 〈a, b〉F =

∫∞0 a(s)b(s)dF (s);

and for h1, h2 ∈ L02(PF ), let 〈h1, h2〉PF = PF [h1h2]. Thus the adjoint of A

must satisfy 〈Ah, g〉PF= 〈h, A∗g〉F . Accordingly,

PF [(Ah)g] = PF

[

δh(W )g(W, 1) + (1− δ)

∫∞W

h(v)dF (v)

S(W )g(W, 0)

]

=

∫ ∞

0

g(v, 1)L(v−)h(v)dF (v) +

∫ ∞

0

∫∞w

h(v)dF (v)

S(w)S(w)dG(w)

=

∫ ∞

0

g(v, 1)L(v−)h(v)dF (v)−∫ ∞

0

[v,∞]

g(w, 0)dG(w)h(v)dF (v),

Page 74: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

62 4. Case Studies I

by the fact that∫∞

s h(v)dF (v) = −∫ s

0 h(v)dF (v) and by changing theorder of integration on the right-hand-side. Thus A∗g(v) = g(v, 1)L(v−)−∫

[v,∞] g(s, 0)dG(s).

With St(u) = 1 − Ft(u) based on a submodel perturbed in the di-rection h ∈ L0

2(F ), we have that ∂St(u)/(∂t)|t=0 =∫∞

uh(v)dF (v) =

−∫ u

0h(v)dF (v) = 〈χ, h〉F , where χ = −1v ≤ u. We now verify that

A∗ [ψ(u)]

= χ, and thereby prove that S(u) is efficient for estimating the

parameter S(u), since it can be shown that ψ ∈ R(A). We now have

(4.8)(

A∗ [ψ(u)])

(v) = ψ(u)(v, 1)L(v−)−∫

[v,∞]

ψ(u)(s, 0)dG(s)

= −S(u)

S(v)1v ≤ u+ S(u)L(v−)

∫ u

0

1v ≥ sdΛ(s)

[1−∆Λ(s)] π(s)

−∫

[v,∞]

S(u)

∫ u

0

1s ≥ rdΛ(r)

[1−∆Λ(r)] π(r)

dG(s).

Since∫

[v,∞]1s ≥ rdG(s) = L([v ∨ r]−) = 1v ≥ rL(v−) + 1v <

rL(r−), we now have that

(4.8) = −S(u)

S(v)1v ≤ u − S(u)

∫ u

0

1v < rL(r−)dΛ(r)

[1−∆Λ(r)] π(r)

= −1v ≤ u[

1

S(v)+

∫ u

v

dF (r)

S(r)S(r−)

]

S(u)

= −1v ≤ u.Thus (3.5) is satisfied, and we obtain the result that S is pointwise and,therefore, uniformly efficient for estimating S.

4.4 Efficient Estimating Equations for Regression

We now consider a generalization of the conditionally mean zero residuallinear regression model considered previously. A typical observation is as-sumed to be X = (Y, Z), where Y = gθ(Z)+ e, Ee|Z = 0, Z, θ ∈ Rk, andgθ(Z) is a known, sufficiently smooth function of θ. In addition to linearregression, generalized linear models—as well as many nonlinear regressionmodels—fall into this structure. We assume that (Z, e) has a density η, and,therefore, that the observation (Y, Z) has density η(y − gθ(z), z) with theonly restriction being that

Reη(e, z)de = 0. As we observed in Section 3.2,

these conditions force the score functions for η to be all square-integrablefunctions a(e, z) which satisfy

Eea(e, Z)|Z =

Rea(e, Z)η(e, Z)de∫

Rη(e, Z)de

= 0

Page 75: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

4.4 Efficient Estimating Equations for Regression 63

almost surely. As also demonstrated in Section 3.2, the above equality im-plies that the tangent space for η is the orthocomplement in L0

2(Pθ,η) ofthe set H of all functions of the form eh(Z), where Eh2(Z) <∞.

Hence, as pointed out previously, the efficient score for θ is obtained byprojecting the ordinary score ℓθ,η(e, z) = − [η1(e, z)/η(e, z)] gθ(z) onto H,where η1 is the derivative with respect to the first argument of η and gθ

is the derivative of gθ with respect to θ. Of course, we are assuming thatthese derivatives exist and are square-integrable. Since the projection of anarbitrary b(e, z) onto H is eEeb(e, Z)|Z/V (Z), where V (z) ≡ Ee2|Z =z, the efficient score for θ is

ℓθ,η(Y, Z) = − gθ(Z)e∫

Rη1(u, Z)udu

V (Z)∫

Rηu, Zdu

=gθ(Z)(Y − gθ(Z))

V (Z).(4.9)

This, of course, implies the result in Section 3.2 for the special case of linearregression.

In practice, the form of V (Z) is typically not known and needs to beestimated. Let this estimator be denoted V (Z). It can be shown that, evenif V is not consistent but converges to V = V , the estimating equation

Sn(θ) ≡ Pn

[

gθ(Z)(Y − gθ(Z))

V (Z)

]

is approximately equivalent to the estimating equation

Sn(θ) ≡ Pn

[

gθ(Z)(Y − gθ(Z))

V (Z)

]

,

both of which can still yield√

n consistent estimators of θ. The closer V isto V , the more efficient will be the estimator based on solving Sn(θ) = 0.Another variant of the question of optimality is “for what choice of V willthe estimator obtained by solving Sn(θ) = 0 yield the smallest possiblevariance?” Godambe (1960) showed that, for univariate θ, the answer isV = V . This result is not based on semiparametric efficiency analysis, butis obtained from minimizing the limiting variance of the estimator over all“reasonable” choices of V .

For any real, measurable function w of z ∈ Rk (measurable on the prob-ability space for Z), let

Sn,w(θ) ≡ Pn [gθ(Z)(Y − gθ(Z))w(Z)/V (Z)] ,

U(z) ≡ g(z)g′(z)/V (z), and let θn,w be a solution of Sn,w(θ) = 0. Standardmethods can be used to show that if both E[U(Z)] and E[U(Z)w(Z)] are

positive definite, then the limiting variance of√

n(θn,w − θ) is

E[U(Z)w(Z)]−1 E[U(Z)w2(Z)]

E[U(Z)w(Z)]−1.

Page 76: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

64 4. Case Studies I

The following proposition yields Godambe’s (1960) result generalized forarbitrary k ≥ 1:

Proposition 4.3 Assume E[U(Z)] is positive definite. Then, for anyreal, Z-measurable function w for which E[U(Z)w(Z)] is positive definite,

C0,w ≡ E[U(Z)w(Z)]−1 E[U(Z)w2(Z)] E[U(Z)w(Z)]−1−E[U(Z)]−1

is positive semidefinite.

Proof. Define B(z) ≡ E[U(Z)w(Z)]−1w(z)− E[U(Z)]−1

, and notethat Cw(Z) ≡ B(Z)E[U(Z)]B′(Z) must therefore be positive semidefinite.The desired result now follows since E[Cw(Z)] = C0,w.

Note that when A − B is positive semidefinite, for two k × k variancematrices A and B, we know that B is the smaller variance. This followssince, for any v ∈ Rk, v′Av ≥ v′Bv. Thus the choice w = 1 will yieldthe minimum variance, or, in other words, the choice V = V will yieldthe lowest possible variance for estimators asymptotically equivalent to thesolution of Sn(θ) = 0.

We now verify that estimation based on solving Sn(θ) = 0 is asymptoti-cally equivalent to estimation based on solving Sn(θ) = 0, under reasonable

regularity conditions. Assume that θ satisfies Sn(θ) = oP (n−1/2) and that

θP→ θ. Assume also that for every ǫ > 0, there exists a P -Donsker class G

such that the inner probability that V ∈ G is > 1− ǫ for all n large enoughand all ǫ > 0, and that for some τ > 0, the class

F1 =

gθ1(Z)(Y − gθ2(Z))

W (Z): ‖θ1 − θ‖ ≤ τ, ‖θ2 − θ‖ ≤ τ, W ∈ G

is P -Donsker, and the class

F2 =

gθ1(Z)g′θ2(Z)

W (Z): ‖θ1 − θ‖ ≤ τ, ‖θ2 − θ‖ ≤ τ, W ∈ G

is P -Glivenko-Cantelli. We also need that

P

[

gθ(Z)

V (Z)− gθ(Z)

V (Z)

]2

P→ 0,(4.10)

and, for any θP→ θ,

Pgθ(Z)g′

θ(Z)

V (Z)− P

gθ(Z)g′θ(Z)

V (Z)

P→ 0.(4.11)

We have the following lemma:

Page 77: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

4.4 Efficient Estimating Equations for Regression 65

Lemma 4.4 Assume that E[Y |Z = z] = gθ(z), that

U0 ≡ E

[

gθ(Z)g′θ(Z)

V (Z)

]

is positive definite, that θ satisfies Sn(θ) = oP (n−1/2), and that θP→ θ. Sup-

pose also that F1 is P -Donsker, that F2 is P -Glivenko-Cantelli, and that

both (4.10) and (4.11) hold for any θP→ θ. Then

√n(θ−θ) is asymptotically

normal with variance

U−10 E

[

gθ(Z)g′θ(Z)V (Z)

V 2(Z)

]

U−10 .

Moreover, if V = V , Z-almost surely, then θ is optimal in the sense ofProposition 4.3.

Proof. For any h ∈ Rk,

oP (n−1/2) = h′Pn

[

gθ(Z)(Y − gθ(Z))

V (Z)

]

= h′Pn

[

gθ(Z)(Y − gθ(Z))

V (Z)

]

+h′Pn

[

gθ(Z)

V (Z)− gθ(Z)

V (Z)

(Y − gθ(Z))

]

−h′Pn

[

gθ(Z)g′θ(Z)

V (Z)(θ − θ)

]

,

where θ is on the line segment between θ and θ. Now the conditions of thetheorem can be seen to imply that

√n(θ − θ) =

√nU−1

0 Pn

gθ(Z)(Y − gθ(Z))

V (Z)

+ oP (1),

and the first conclusion of the lemma follows. The final conclusion nowfollows directly from Proposition 4.3.

The conditions of Lemma 4.4 are easily satisfied for a variety of regressionmodels and variance estimators V . For example, if gθ(Z) = (1 + e−θ′Z)−1

is the conditional expectation of a Bernoulli outcome Y , given Z, and bothθ and Z are assumed to be bounded, then all the conditions are easily sat-isfied with θ being a zero of Sn and V (Z) = gθ(Z)

[

1− gθ(Z)]

, provided

E[ZZ ′] is positive definite. Note that for this example, θ is also semipara-metric efficient. We now consider two additional examples in some detail.The first example is a special case of the semiparametric model discussed atthe beginning of this section. The model is simple linear regression but withan unspecified form for V (Z). Note that for general gθ, if the conditions of

Lemma 4.4 are satisfied for V = V , then the estimator θ is semiparametric

Page 78: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

66 4. Case Studies I

efficient by the form of the efficient score given in (4.9). The second ex-ample considers estimation of a semiparametric Poisson mixture regressionmodel, where the mixture induces extra-Poisson variation. We will developan optimal estimating equation procedure in the sense of Proposition 4.3.Unfortunately, it is unclear in this instance how to strengthen this resultto obtain semiparametric efficiency.

4.4.1 Simple Linear Regression

Consider simple linear regression based on a univariate Z. Let θ = (α, β)′ ∈R2 and assume gθ(Z) = α + βZ. We also assume that the support of Zis a known compact interval [a, b]. We will use a modified kernel methodof estimating V (z) ≡ E[e2|Z = z]. Let the kernel L : R → [0, 1] satisfyL(x) = 0 for all |x| > 1 and L(x) = 1− |x| otherwise, and let h ≤ (b− a)/2be the bandwidth for this kernel. For the sample (Y1, Z1), . . . , (Yn, Zn), let

θ = (α, β)′ be the usual least-squares estimator of θ, and let ei = Yi −α− βZi be the estimated residuals. Let Fn(u) be the empirical distributionof Z1, . . . , Zn, and let Hn(u) ≡ n−1

∑ni=1 e2

i 1Zi ≤ u. We will denoteF as the true distribution of Z, with density f , and also define H(z) ≡∫ z

aV (u)dF (u). Now define

V (z) ≡∫

Rh−1L

(

z−uh

)

Hn(du)∫

Rh−1L

(

z−uh

)

Fn(du),

for z ∈ [a + h, b − h], and let V (z) = V (a + h) for all z ∈ [a, a + h) andV (z) = V (b− h) for all z ∈ (b− h, b].

We need to assume in addition that both F and V are twice differentiablewith second derivatives uniformly bounded on [a, b], that for some M <∞we have M−1 ≤ f(z), V (z) ≤ M for all a ≤ x ≤ b, and that the possiblydata-dependent bandwidth satisfies h = oP (1) and h−1 = oP (n1/4). If welet Ui ≡ (1, Zi)

′, i = 1, . . . , n, then Hn(z) =

n−1n∑

i=1

e2i 1Zi ≤ z = n−1

n∑

i=1

[

ei − (θ − θ)′Ui

]2

1Zi ≤ z

= n−1n∑

i=1

e2i 1Zi ≤ z

−2(θ − θ)′n−1n∑

i=1

Uiei1Zi ≤ z

+(θ − θ)′n−1n∑

i=1

UiU′i1Zi ≤ z(θ − θ)

= An(z)−Bn(z) + Cn(z).

Page 79: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

4.4 Efficient Estimating Equations for Regression 67

In Chapter 9, we will show in an exercise (see Exercise 9.6.8) that G1 ≡Ue · 1Z ≤ z, z ∈ [a, b] is Donsker. Hence ‖Pn − P‖G1 = OP (n−1/2).

Since also E[e|Z] = 0 and ‖θ − θ‖ = OP (n−1/2), we now have that

supz∈[a,b]

|Bn(z)| = OP (n−1).

By noting that ‖U‖ is bounded under our assumptions, we also obtainthat supz∈[a,b] |Cn(z)| = OP (n−1). In Exercise 9.6.8 in Chapter 9, we

will also verify that G2 ≡ e2 · 1Z ≤ z, z ∈ [a, b] is Donsker. Hencesupz∈[a,b] |An(z) − H(z)| = OP (n−1/2), and thus also supz∈[a,b] |Hn(z) −H(z)| = OP (n−1/2). Standard results also verify that supz∈[a,b] |Fn(z) −F (z)| = OP (n−1/2).

Let Dn ≡ Hn − H , L be the derivative of L, and note that for anyz ∈ [a+h, b−h] we have by integration by parts and by the form of L that

R

h−1L

(

z − u

h

)

Dn(du) = −∫ z+h

z−h

Dn(u)h−2L

(

z − u

h

)

du.

Thus∣

R

h−1L

(

z − u

h

)

Dn(du)

≤ h−1 supz∈[a,b]

∣Hn(z)−H(z)∣

∣ .

Since the right-hand-side does not depend on z, and by the result of theprevious paragraph, we obtain that the supremum of the left-hand-side overz ∈ [a + h, b − h] is OP (h−1n−1/2). Letting H be the derivative of H , wesave it as an exercise (see Exercise 9.6.8 again) to verify that both

supz∈[a+h,b−h]

R

h−1L

(

z − u

h

)

H(du)− H(z)

= O(h)

and(

supz∈[a,a+h)

∣H(z)− H(a +h)∣

)

∨(

supz∈(b−h,b]

∣H(z)− H(b−h)∣

)

=

O(h). Hence

Rn(z) ≡

Rh−1L

(

z−uh

)

Hn(du) for z ∈ [a + h, b− h],

Rn(a + h) for z ∈ [a, a + h),

Rn(b− h) for z ∈ (b− h, b]

is uniformly consistent for H with uniform error OP (h+h−1n−1/2) = oP (1).Similar, but somewhat simpler arguments compared to those in the pre-

vious paragraph, can also be used to verify that

Page 80: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

68 4. Case Studies I

Qn(z) ≡

Rh−1L

(

z−uh

)

Fn(du) for z ∈ [a + h, b− h],

Qn(a + h) for z ∈ [a, a + h),

Qn(b− h) for z ∈ (b− h, b]

is uniformly consistent for f also with uniform error OP (h + h−1n−1/2) =oP (1). Since f is bounded below, we now have that supz∈[a,b] |V (z) −V (z)| = oP (1). Since gθ(Z) = (1, Z)′, we have established that both (4.10)and (4.11) hold for this example since V is also bounded below.

We will now show that there exists a k0 < ∞ such that the probabilitythat “V goes below 1/k0 or the first derivative of V exceeds k0” goes tozero as n →∞. If we let G be the class of all functions q : [a, b] → [k−1

0 , k0]such that the first derivative of q, q, satisfies |q| ≤ k0, then our result willimply that the inner probability that V ∈ G is > 1−ǫ for all n large enoughand all ǫ > 0. An additional exercise in Chapter 9 (see again Exercise 9.6.8)will then show that, for this simple linear regression example, the class F1

defined above is Donsker and F2 defined above is Glivenko-Cantelli. ThusLemma 4.4 applies. This means that if we first use least-squares estimatorsof α and β to construct the estimator V , and then compute the “two-stage”estimator

θ ≡[

n∑

i=1

UiU′i

V (Zi)

]−1 n∑

i=1

UiYi

V (Zi),

then this θ will be efficient for θ.Since V is uniformly consistent, the only thing remaining to show is

that the derivative of V , denoted Vn, is uniformly bounded. Note that thederivative of Rn, which we will denote Rn, satisfies the following for allz ∈ [a + h, b− h]:

Rn(z) = −∫

R

h−2L

(

z − u

h

)

Hn(du)

= h−2[

Hn(z)− Hn(z − h)− Hn(z + h) + Hn(z)]

= OP (h−2n−1/2) + h−2 [H(z)−H(z − h)−H(z + h) + H(z)] ,

where the last equality follows from the previously established fact that

supz∈[a+h,b−h]

∣Hn(z)−H(z)∣

∣ = OP (n−1/2). Now the uniform bounded-

ness of the second derivative of H ensures that supz∈[a+h,b−h]

∣Hn(z)∣

∣ =

OP (1). Similar arguments can be used to establish that the derivative of

Qn, which we will denote Qn, satisfies supz∈[a+h,b−h]

∣Qn(z)∣

∣ = OP (1).

Now we have uniform boundedness in probability of Vn over [a + h, b− h].Since V does not change over either [a, a + h) or (b − h, b], we have also

established that supz∈[a,b]

∣Vn(z)∣

∣ = OP (1), and the desired results follow.

Page 81: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

4.5 Partly Linear Logistic Regression 69

4.4.2 A Poisson Mixture Regression Model

In this section, we consider a Poisson mixture regression model in whichthe nuisance parameter is not

√n consistent in the uniform norm. Given

a regression vector Z ∈ Rk and a nonnegative random quantity W ∈ R,the observation Y is Poisson with parameter Weβ′Z , for some β ∈ Rk. Weonly observe the pair (Y, Z). Thus the density of Y given Z is

Qβ,G(y) =

∫ ∞

0

e−weβ′Z

[weβ′Z ]y

y!dG(w),

where G is the unknown distribution function for W . For identifiability,we assume that one of the components of Z is the constant 1 and that theexpectation of W is also 1. We denote the joint distribution of the observeddata as Pβ,G, and assume that Z is bounded, Pβ,G[ZZ ′] is full rank, andthat Pβ,GY 2 <∞.

In this situation, it is not hard to verify that

V (z) = E

[

Y − eβ′Z]2∣

Z = z

= E

[

E

[

Y − Ueβ′Z]2∣

U, Z = z

+ (U − 1)2e2β′z

]

= eβ′z + σ2e2β′z,

where σ2 ≡ E[U − 1]2. Let β be the estimator obtained by solving

Pn

[

Z(Y − eβ′Z)]

= 0.

Standard arguments reveal that√

n(β − β) is asymptotically mean zeroGaussian with finite variance matrix. Relatively simple calculations alsoreveal that E[Y (Y − 1)|Z = z] =

∫∞0

u2dG(u)e2β′Z = (σ2 + 1)e2β′Z . Hence

σ2 ≡ −1 + n−1∑n

i=1 e−2β′ZiYi(Yi − 1) will satisfy σ2 = σ2 + OP (n−1/2).

Now let β be the solution of

Pn

[

Z(Y − eβ′Z)

V (Z)

]

= 0,

where V (z) ≡ eβ′z + σ2e2β′z. It is left as an exercise (see Exercise 4.6.7)

to verify that β satisfies the conditions of Lemma 4.4 for V = V , and thusthe desired optimality is achieved.

4.5 Partly Linear Logistic Regression

For the partly linear logistic regression example given in Chapter 1, the ob-served data are n independent realizations of the random triplet (Y, Z, U),

Page 82: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

70 4. Case Studies I

where Z ∈ Rp and U ∈ R are covariates which are not linearly dependent,Y is a dichotomous outcome with conditional expectation ν[β′Z + η(U)],β ∈ Rp, Z is restricted to a bounded set, U ∈ [0, 1], ν(t) = 1/(1 + e−t),and where η is an unknown smooth function. Hereafter, for simplicity, wewill also assume that p = 1. We further assume, for some integer k ≥ 1,that the first k−1 derivatives of η exist and are absolutely continuous with

J2(η) =∫ 1

0

[

η(k)(t)]2

dt <∞. To estimate β and η based on an i.i.d. sampleXi = (Yi, Zi, Ui), i = 1, . . . , n, we use the following penalized log-likelihood:

Ln(β, η) = n−1n∑

i=1

log pβ,η(Xi)− λ2nJ2(η),

where

pβ,η(x) = ν [βz + η(u)]y 1− ν [βz + η(u)]1−y

and λn is chosen to satisfy λn = oP (n−1/4) and λ−1n = OP (nk/(2k+1)).

Denote βn and ηn to be the maximizers of Ln(β, η), let Pβ,η denote ex-pectation under the model, and let β0 and η0 to be the true values of theparameters.

Consistency of βn and ηn and efficiency of βn are established for partlylinear generalized linear models in Mammen and van de Geer (1997). We

now derive the efficient score for β and then sketch a verification that βn isasymptotically linear with influence function equal to the efficient influencefunction. Several technically difficult steps will be reserved for the casestudies II chapter, Chapter 15, in Part II of the book. Our approach toproving this diverges only slightly from the approach used by Mammanand van de Geer (1997).

Let H be the linear space of functions h : [0, 1] → R with J(h) <∞. Fort ∈ [0, ǫ) and ǫ sufficiently small, let βt = β+tv and ηt = η+th for v ∈ R andh ∈ H. If we differentiate the non-penalized log-likelihood, we deduce thatthe score for β and η, in the direction (v, h), is (vZ+h(U))(Y −µβ,η(Z, U)),where µβ,η(Z, U) = ν[βZ + η(U)]. Now let

h1(u) =EZVβ,η(Z, U)|U = uEVβ,η(Z, U)|U = u ,

where Vβ,η = µβ,η(1 − µβ,η), and assume that h1 ∈ H. It can easily beverified that Z−h1(U) is uncorrelated with any h(U), h ∈ H, and thus theefficient score for β is ℓβ,η(Y, Z, U) = (Z − h1(U))(Y − µβ,η(Z, U)). Hence

the efficient information for β is Iβ,η = Pβ,η

[

(Z − h1(U))2Vβ,η(Z, U)]

and

the efficient influence function is ψβ,η = I−1β,η ℓβ,η, provided Iβ,η > 0, which

we assume hereafter to be true for β = β0 and η = η0.In order to prove asymptotic linearity of βn, we need to also assume

that Pβ0,η0

[

Z − h1(U)]2

> 0, where h1(u) = EZ|U = u. We prove in

Page 83: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

4.5 Partly Linear Logistic Regression 71

Chapter 15 that βn and ηn are both uniformly consistent for β0 and η0,respectively, and that

Pn

[

(βn − β0)Z + ηn(U)− η0(U)]2

= oP (n−1/2).(4.12)

Let βns = βn + s and ηns(u) = ηn(u) − sh1(u). If we now differentiate

Ln(βns, ηns) and evaluate at s = 0, we obtain

0 = Pn [(Y − µβ0,η0)(Z − h1(U))]

−Pn

[

(µβn,ηn− µβ0,η0)(Z − h1(U))

]

− λ2n

∂J2(ηns)/(∂s)|s=0

= An −Bn − Cn,

since Ln(β, η) is maximized at βn and ηn by definition.Using (4.12), we obtain

Bn = Pn

[

Vβ0,η0(Z, U)

(βn − β0)Z + ηn(U)− η0(U)

(Z − h1(U))]

+oP (n−1/2),

since ∂ν(t)/(∂t) = ν(1− ν). By definition of h1, we have

Pβ0,η0 [Vβ0,η0(Z, U)(ηn(U)− η0(U))(Z − h1(U))] = 0,

and we also have that Pβ0,η0 [Vβ0,η0(Z, U)(ηn(U)− η0(U))(Z − h1(U))]2P→

0. Thus, if we can establish that for each τ > 0, ηn(U) − η0(U) lies in abounded Pβ0,η0-Donsker class with probability > (1− τ) for all n ≥ 1 largeenough and all τ > 0, then

Pn [Vβ0,η0(Z, U)(ηn(U)− η0(U))(Z − h1(U))] = oP (n−1/2),

and thus

Bn = (βn − β0)Pβ0,η0

[

Vβ0,η0(Z, U)(Z − h1(U))2]

+ oP (n−1/2),(4.13)

since products of bounded Donsker classes are Donsker (and therefore alsoGlivenko-Cantelli), and since

Pβ0,η0 [Vβ0,η0(Z, U)Z(Z − h1(U))] = Pβ0,η0

[

Vβ0,η0(Z, U)(Z − h1(U))2]

by definition of h1. Let Hc be the subset of H with functions h satisfyingJ(h) ≤ c and ‖h‖∞ ≤ c. We will show in Chapter 15 that h(U) : h ∈ Hcis indeed Donsker for each c < ∞ and also that J(ηn) = OP (1). This yieldsthe desired Donsker property, and (4.13) follows.

It is now not difficult to verify that Cn ≤ 2λ2nJ(ηn)J(h1) = oP (n−1/2),

since λn = oP (n−1/4) by assumption. Hence√

n(βn− β0) =√

nPnψβ0,η0 +

oP (1), and we have verified that βn is indeed efficient for β0.

Page 84: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

72 4. Case Studies I

4.6 Exercises

4.6.1. For the linear regression example, verify the inequality (4.3).

4.6.2. For the general counting process regression model setting, showthat V (β) ≥ c var(Z), for some c > 0, where V (β) is given in (4.5).

4.6.3. Show how to use Lemma 4.2 to establish (4.6). Hint: Let An(t) =

En(t, β)− E(t, β) and Bn(t) =√

nPnMβ0(t) for part of it, and let

An(t) = Pn

∫ t

0

Y (s)[

eβ′Z − eβ′0Z]

dΛ0(s)

and Bn(t) =√

n[

En(t, β)− E(t, β)]

for the other part.

4.6.4. For the Cox model example (Section 4.2.2), verify Condition (3.6)

of Theorem 3.1 for ℓβ,n defined in (3.8).

4.6.5. We will see in Section 18.2 (in Theorem 18.8 on Page 339) that,for a regular estimator, it is enough to verify efficiency by showing that theinfluence function lies in the tangent space for the full model. For the Coxmodel example, verify that the influence functions for both β and Λ(t) (forall t ∈ [0, τ ]) lie in the tangent space for the full model.

4.6.6. For the Kaplan-Meier example of Section 4.3, verify that ψ ∈ R(A).

4.6.7. For the Poisson mixture model example of Section 4.4.2, verify thatthe conditions of Lemma 4.4 are satisfied for the given choice of V :

(a) Show that the class G ≡

et′Z + s2e2t′Z : ‖t− β‖ ≤ ǫ1, |s2 − σ2| ≤ ǫ2

is Donsker for some ǫ1, ǫ2 > 0. Hint: First show that t′Z : ‖t− β‖ ≤ǫ1 is Donsker from the fact that the product of two (in this casetrivial) bounded Donsker classes is also Donsker. Now complete theproof by using the facts that Lipschitz functions of Donsker classesare Donsker, that products of bounded Donsker classes are Donsker(used earlier), and that sums of Donsker classes are Donsker.

(b) Now complete the verification of Lemma 4.4.

4.6.8. For the partly linear logistic regression example of Section 4.5,verify that (Z − h1(U)) (Y −µβ0,η0) is uncorrelated with h(U)(Y − µβ0,η0)for all h ∈ H, where the quantities are as defined in the example.

4.7 Notes

Least absolute deviation regression was studied using equicontinuity ar-guments in Bassett and Koenker (1978). More succinct results based on

Page 85: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

4.7 Notes 73

empirical processes can be found in Pollard (1991). An excellent discussionof breakdown points and other robustness issues is given in Huber (1981).

Page 86: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,
Page 87: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

5Introduction to Empirical Processes

The goal of Part II is to provide an in depth coverage of the basics of em-pirical process techniques which are useful in statistics. Chapter 6 presentspreliminary mathematical background which provides a foundation for latertechnical development. The topics covered include metric spaces, outer ex-pectations, linear operators and functional differentiation. The main topicsoverviewed in Chapter 2 of Part I will then be covered in greater depth,along with several additional topics, in Chapters 7 through 14. Part IIfinishes in Chapter 15 with several case studies. The main approach is topresent the mathematical and statistical ideas in a logical, linear progres-sion, and then to illustrate the application and integration of these ideas inthe case study examples. The scaffolding provided by the overview, Part I,should enable the reader to maintain perspective during the sometimesrigorous developments of this section.

Stochastic convergence is studied in Chapter 7. An important aspect ofthe modes of convergence explored in this book are the notions of outerintegrals and outer measure which were mentioned briefly in Section 2.2.1.While many of the standard relationships between the modes of stochas-tic convergence apply when using outer measure, there are a few impor-tant differences that we will examine. While these differences may, in somecases, add complexity to an already difficult asymptotic theory, the gainin breadth of applicability to semiparametric statistical estimators is wellworth the trouble. For example, convergence based on outer measure per-mits the use of the uniform topology for studying convergence of empiricalprocesses with complex index sets. This contrasts with more traditionalapproaches, which require special topologies that can be harder to use in

Page 88: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

78 5. Introduction to Empirical Processes

applications, such as the Skorohod topology for cadlag processes (see Chap-ter 3 of Billingsley, 1968).

The main techniques for proving empirical process central limit theoremswill be presented in Chapter 8. Establishing Glivenko-Cantelli and Donskertheorems requires bounding expectations involving suprema of stochasticprocesses. Maximal inequalities and symmetrization techniques are impor-tant tools for accomplishing this, and careful measurability arguments arealso sometimes needed. Symmetrization involves replacing inequalities forthe empirical process f → (Pn−P )f , f ∈ F , with inequalities for the “sym-metrized” process n−1

∑ni=1 ǫif(Xi), where ǫ1, . . . , ǫn are i.i.d. Rademacher

random variables (eg., Pǫ1 = −1 = Pǫ1 = 1 = 1/2) independent ofX1, . . . , Xn. Several tools for assessing measurability in statistical applica-tions will also be discussed.

Entropy with bracketing, uniform entropy, and other measures of entropyare essential aspects in all of these results. This is the topic of Chapter 9.The associated entropy calculations can be quite challenging, but the workis often greatly simplified by using Donsker preservation results to buildlarger Donsker classes from smaller ones. Similar preservation results areavailable for Glivenko-Cantelli classes.

Bootstrapping of empirical processes, based on multinomial or otherMonte Carlo weights, is studied in Chapter 10. The bootstrap is a valu-able way to conduct inference for empirical processes because of its broadapplicability. In many semiparametric settings, there are no viable alterna-tives for inference. A central role in establishing validity of the bootstrapis played by multiplier central limit theorems which establish weak conver-gence of processes of the form

√nPnξ(f(X)−Pf), f ∈ F , where ξ is inde-

pendent of X , has mean zero and variance 1, and∫∞0

P (|ξ| > x)dx <∞.In Chapter 11, several extensions of empirical process results are pre-

sented for function classes which either consist of sequences or change withthe sample size n, as well as results for independent but not identicallydistributed data. These results are useful in a number of statistical set-tings, including asymptotic analysis of the Cramer-von Mises statistic andregression settings where the covariates are assumed fixed or when a biasedcoin study design (see Wei, 1978, for example) is used. Extensions of thebootstrap for conducting inference in these new situations is also discussed.

Many interesting statistical quantities can be expressed as functionals ofempirical processes. The functional delta method, discussed in Chapter 12,can be used to translate weak convergence and bootstrap results for empir-ical processes to corresponding inference results for these functionals. MostZ- and M- estimators are functionals of empirical processes. For example,under reasonable regularity conditions, the functional which extracts thezero (root) of a Z-estimating equation is sufficiently smooth to permit thedelta method to carry over inference results for the estimating equationto the corresponding Z-estimator. The results also apply to M-estimatorswhich can be expressed as approximate Z-estimators.

Page 89: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

5. Introduction to Empirical Processes 79

Z-estimation is discussed in Chapter 13, while M-estimation is discussedin Chapter 14. A key challenge with many important M-estimators is toestablish the rate of convergence, especially in settings where the estimatorsare not

√n consistent. This issue was only briefly mentioned in Section 2.2.6

because of the technical complexity of the problem. There are a number oftools which can be used to establish these rates, and several such tools willbe studied in Chapter 14. These techniques rely significantly on accurateentropy calculations for the M-estimator empirical process, as indexed bythe parameter set, within a small neighborhood of the true parameter.

The case studies presented in Chapter 15 demonstrate that the technicalpower of empirical process methods facilitates valid inference for flexiblemodels in many interesting and important statistical settings.

Page 90: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,
Page 91: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

6Preliminaries for Empirical Processes

In this chapter, we cover several mathematical topics that play a centralrole in the empirical process results we present later. Metric spaces arecrucial since they provide the descriptive language by which the most im-portant results about stochastic processes are derived and expressed. Outerexpectations, or, more correctly, outer integrals are key to defining and uti-lizing outer modes of convergence for quantities which are not measurable.Since many statistical quantities of interest are not measurable with re-spect to the uniform topology, which is often the topology of choice forapplications, outer modes of convergence will be the primary approach forstochastic convergence throughout this book. Linear operators and func-tional derivatives also play a major role in empirical process methods andare key tools for the functional delta method and Z-estimator theory dis-cussed in Chapters 12 and 13.

6.1 Metric Spaces

We now introduce a number of concepts and results for metric spaces.Before defining metric spaces, we briefly review topological spaces, σ-fields,and measure spaces. A collection O of subsets of a set X is a topology inX if:

(i) ∅ ∈ O and X ∈ O, where ∅ is the empty set;

(ii) If Uj ∈ O for j = 1, . . . , m, then⋂

j=1,...,m Uj ∈ O;

Page 92: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

82 6. Preliminaries for Empirical Processes

(iii) If Uα is an arbitrary collection of members of O (finite, countableor uncountable), then

α Uα ∈ O.

When O is a topology in X , then X (or the pair (X,O)) is a topologicalspace, and the members of O are called the open sets in X . For a subsetA ⊂ X , the relative topology on A consists of the sets A ∩B : B ∈ O.

A map f : X → Y between topological spaces is continuous if f−1(U) isopen in X whenever U is open in Y . A set B in X is closed if and only ifits complement in X , denoted X −B, is open. The closure of an arbitraryset E ∈ X , denoted E, is the smallest closed set containing E; while theinterior of an arbitrary set E ∈ X , denoted E, is the largest open setcontained in E. A subset A of a topological space X is dense if A = X . Atopological space X is separable if it has a countable dense subset.

A neighborhood of a point x ∈ X is any open set that contains x. Atopological space is Hausdorff if distinct points in X have disjoint neigh-borhoods. A sequence of points xn in a topological space X convergesto a point x ∈ X if every neighborhood of x contains all but finitely manyof the xn. This convergence is denoted xn → x. Suppose xn → x andxn → y. Then x and y share all neighborhoods, and x = y when X isHausdorff. If a map f : X → Y between topological spaces is continuous,then f(xn) → f(x) whenever xn → x in X . To see this, let xn ⊂ X bea sequence with xn → x ∈ X . Then for any open U ⊂ Y containing f(x),all but finitely many xn are in f−1(U), and thus all but finitely manyf(xn) are in U . Since U was arbitrary, we have f(xn)→ f(x).

We now review the important concept of compactness. A subset K of atopological space is compact if for every set A ⊃ K, where A is the unionof a collection of open sets S, K is also contained in some finite unionof sets in S. When the topological space involved is also Hausdorff, thencompactness of K is equivalent to the assertion that every sequence inK has a convergent subsequence (converging to a point in K). We omitthe proof of this equivalence. This result implies that compact subsets ofHausdorff topological spaces are necessarily closed. Note that a compactset is sometimes called a compact for short. A σ-compact set is a countableunion of compacts.

A collection A of subsets of a set X is a σ-field in X (sometimes calleda σ-algebra) if:

(i) X ∈ A;

(ii) If U ∈ A, then X − U ∈ A;

(iii) The countable union⋃∞

j=1 Uj ∈ A whenever Uj ∈ A for all j ≥ 1.

Note that (iii) clearly includes finite unions. When (iii) is only required tohold for finite unions, then A is called a field. When A is a σ-field in X ,then X (or the pair (X,A)) is a measurable space, and the members ofA are called the measurable sets in X . If X is a measurable space and Y

Page 93: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

6.1 Metric Spaces 83

is a topological space, then a map f : X → Y is measurable if f−1(U) ismeasurable in X whenever U is open in Y .

If O is a collection of subsets of X (not necessary open), then there existsa smallest σ-field A∗ in X so that O ∈ A∗. This A∗ is called the σ-fieldgenerated by O. To see that such an A∗ exists, let S be the collection ofall σ-fields in X which contain O. Since the collection of all subsets of Xis one such σ-field, S is not empty. Define A∗ to be the intersection of allA ∈ S. Clearly, O ∈ A∗ and A∗ is in every σ-field containing O. All thatremains is to show that A∗ is itself a σ-field. Assume that Aj ∈ A∗ for allintegers j ≥ 1. If A ∈ S, then

j≥1 Aj ∈ A. Since⋃

j≥1 Aj ∈ A for everyA ∈ S, we have

j≥1 Aj ∈ A∗. Also X ∈ A∗ since X ∈ A for all A ∈ S;and for any A ∈ A∗, both A and X − A are in every A ∈ S. Thus A∗ isindeed a σ-field.

A σ-field is separable if it is generated by a countable collection of subsets.Note that we have already defined “separable” as a characteristic of certaintopological spaces. There is a connection between the two definitions whichwe will point out in a few pages during our discussion on metric spaces.When X is a topological space, the smallest σ-field B generated by the opensets is called the Borel σ-field of X . Elements of B are called Borel sets.A function f : X → Y between topological spaces is Borel-measurable if itis measurable with respect to the Borel σ-field of X . Clearly, a continuousfunction between topological spaces is also Borel-measurable.

For a σ-field A in a set X , a map µ : A → R is a measure if:

(i) µ(A) ∈ [0,∞] for all A ∈ A;

(ii) µ(∅) = 0;

(iii) For a disjoint sequence Aj ∈ A, µ(

⋃∞j=1 Aj

)

=∑∞

j=1 µ(Aj) (count-

able additivity).

If X = A1 ∪ A2 ∪ · · · for some finite or countable sequence of sets in Awith µ(Aj) < ∞ for all indices j, then µ is σ-finite. The triple (X,A, µ) iscalled a measure space. If µ(X) = 1, then µ is a probability measure. Fora probability measure P on a set Ω with σ-field A, the triple (Ω,A, P ) iscalled a probability space. If the set [0,∞] in Part (i) is extended to (−∞,∞]or replaced by [−∞,∞) (but not both), then µ is a signed measure. For ameasure space (X,A, µ), let A∗ be the collection of all E ⊂ X for whichthere exists A, B ∈ A with A ⊂ E ⊂ B and µ(B − A) = 0, and defineµ(E) = µ(A) in this setting. Then A∗ is a σ-field, µ is still a measure, andA∗ is called the µ-completion of A.

A metric space is a set D together with a metric. A metric or distancefunction is a map d : D× D → [0,∞) where:

(i) d(x, y) = d(y, x);

(ii) d(x, z) ≤ d(x, y) + d(y, z) (the triangle inequality);

Page 94: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

84 6. Preliminaries for Empirical Processes

(iii) d(x, y) = 0 if and only if x = y.

A semimetric or pseudometric satisfies (i) and (ii) but not necessarily (iii).Technically, a metric space consists of the pair (D, d), but usually only D isgiven and the underlying metric d is implied by the context. This is similarto topological and measurable spaces, where only the set of all points Xis given while the remaining components are omited except where neededto clarify the context. A semimetric space is also a topological space withthe open sets generated by applying arbitrary unions to the open r-ballsBr(x) ≡ y : d(x, y) < r for r ≥ 0 and x ∈ D (where B0(x) ≡ ∅). A metricspace is also Hausdorff, and, in this case, a sequence xn ∈ D convergesto x ∈ D if d(xn, x) → 0. For a semimetric space, d(xn, x) → 0 ensuresonly that xn converges to elements in the equivalence class of x, where theequivalence class of x consists of all y ∈ D : d(x, y) = 0. Accordingly, theclosure A of a set A ∈ D is not only the smallest closed set containing A,as stated earlier, but A also equals the set of all points that are limits ofsequences xn ∈ A. Showing this relationship is saved as an exercise. Inaddition, two semimetrics d1 and d2 on a set D are considered equivalent(in a topological sense) if they both generate the same open sets. It is leftas an exercise to show that equivalent metrics yield the same convergentsubsequences.

A map f : D → E between two semimetric spaces is continuous at apoint x if and only if f(xn) → f(x) for every sequence xn → x. The map fis continuous (in the topological sense) if and only if it is continuous at allpoints x ∈ D. Verifying this last equivalence is saved as an exercise. Thefollowing lemma helps to define semicontinuity for real valued maps:

Lemma 6.1 Let f : D → R be a function on the metric space D. Thenthe following are equivalent:

(i) For all c ∈ R, the set y : f(y) ≥ c is closed.

(ii) For all y0 ∈ D, lim supy→y0f(y) ≤ f(y0).

Proof. Assume (i) holds but that (ii) is untrue from some y0 ∈ D.This implies that for some δ > 0, lim supy→y0

f(y) = f(y0) + δ. ThusH ∩ y : d(y, y0) < ǫ, where H ≡ y : f(y) ≥ f(y0) + δ, is nonemptyfor all ǫ > 0. Since H is closed by (i), we now have that y0 ∈ H . But thisimplies that f(y0) = f(y0) + δ, which is impossible. Hence (ii) holds. Theproof that (ii) implies (i) is saved as an exercise.

A function f : D → R satisfying either (i) or (ii) (and hence both)of the conditions in Lemma 6.1 is said to be upper semicontinuous. Afunction f : D → R is lower semicontinuous if −f is upper semicontinuous.Using Condition (ii), it is easy to see that a function which is both upperand lower semicontinuous is also continuous. The set of all continuous andbounded functions f : D → R, which we denote Cb(D), plays an importantrole in weak convergence on the metric space D, which we will explore in

Page 95: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

6.1 Metric Spaces 85

Chapter 7. It is not hard to see that the Borel σ-field on a metric spaceD is the smallest σ-field generated by the open balls. It turns out thatthe Borel σ-field B of D is also the smallest σ-field A making all of Cb(D)measurable. To see this, note that any closed A ⊂ D is the preimage of theclosed set 0 for the continuous bounded function x → d(x, A) ∧ 1, wherefor any set B ⊂ D, d(x, B) ≡ infd(x, y) : y ∈ B. Thus B ⊂ A. Sinceit is obvious that A ⊂ B, we now have A = B. A Borel-measurable mapX : Ω → D defined on a probability space (Ω,A, P ) is called a randomelement or random map with values in D. Borel measurability is, in manyways, the natural concept to use on metric spaces since it connects nicelywith the topological structure.

A Cauchy sequence is a sequence xn in a semimetric space (D, d) suchthat d(xn, xm) → 0 as n, m → ∞. A semimetric space D is complete ifevery Cauchy sequence has a limit x ∈ D. Every metric space D has acompletion D which has a dense subset isometric with D. Two metric spacesare isometric if there exists a bijection (a one-to-one and onto map) betweenthem which preserves distances.

When a metric space D is separable, and therefore has a countable densesubset, the Borel σ-field for D is itself a separable σ-field. To see this, letA ∈ D be a countable dense subset and consider the collection of open ballswith centers at points in A and with rational radii. Clearly, the set of suchballs is countable and generates all open sets in D. A topological space X isPolish if it is separable and if there exists a metric making X into a completemetric space. Hence any complete and separable metric space is Polish.Furthermore, any open subset of a Polish space is also Polish. Examples ofPolish spaces include Euclidean spaces and many other interesting spacesthat we will explore shortly. A Suslin set is the continuous image of a Polishspace. If a Suslin set is also a Hausdorff topological space, then it is a Suslinspace. An analytic set is a subset A of a Polish space (X,O), which is Suslinwith respect to the relative topology A ∩B : B ∈ O. Since there alwaysexists a continuous and onto map f : X → A for any Borel subset A ofa Polish space (X,O), every Borel subset of a Polish space is Suslin andtherefore also analytic.

A subset K is totally bounded if and only if for every r > 0, K can becovered by finitely many open r-balls. Furthermore, it can be shown thata subset K of a complete semimetric space is compact if and only if itis totally bounded and closed. A totally bounded subset K is also calledprecompact because every sequence in K has a Cauchy subsequence. Tosee this, assume K is totally bounded and choose any sequence xn ∈ K.There exists a nested series of subsequence indices Nm and a nested se-ries of 2−m-balls Am ⊂ K, such that for each integer m ≥ 1, Nm isinfinite, Nm+1 ⊂ Nm, Am+1 ⊂ Am, and xj ∈ Am for all j ∈ Nm. Thisfollows from the total boundedness properties. For each m ≥ 1, choose anm ∈ Nm, and note that the subsequence xnm is Cauchy. Now assumeevery sequence in K has a Cauchy subsequence. It is not difficult to verify

Page 96: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

86 6. Preliminaries for Empirical Processes

that if K were not totally bounded, then it is possible to come up witha sequence which has no Cauchy subsequences (see Exercise 6.5.4). Thisrelationship between compactness and total boundedness implies that aσ-compact set in a metric space is separable. These definitions of compact-ness agree with the previously given compactness properties for Hausdorffspaces. This happens because a semimetric space D can be made into ametric—and hence Hausdorff—space DH by equating points in DH withequivalence classes in D.

A very important example of a metric space is a normed space. A normedspace D is a vector space (also called a linear space) equipped with a norm,and a norm is a map ‖ · ‖ : D → [0,∞) such that, for all x, y ∈ D andα ∈ R,

(i) ‖x + y‖ ≤ ‖x‖+ ‖y‖ (another triangle inequality);

(ii) ‖αx‖ = |α| × ‖x‖;

(iii) ‖x‖ = 0 if and only if x = 0.

A seminorm satisfies (i) and (ii) but not necessarily (iii). A normed spaceis a metric space (and a seminormed space is a semimetric space) withd(x, y) = ‖x − y‖, for all x, y ∈ D. A complete normed space is called aBanach space. Two seminorms ‖ · ‖1 and ‖ · ‖2 on a set D are equivalentif the following is true for all x, xn ∈ D: ‖xn − x‖1 → 0 if and only if‖xn − x‖2 → 0.

In our definition of a normed space D, we require the space to also be avector space (and therefore it contains all linear combinations of elements inD). However, it is sometimes of interest to apply norms to subsets K ⊂ Dwhich may not be linear subspaces. In this setting, let linK denote thelinear span of K (all linear combinations of elements in K), and let linKthe closure of linK. Note that both linK and linK are now vector spacesand that linK is also a Banach space.

We now present several specific examples of metric spaces. The Euclideanspace Rd is a Banach space with squared norm ‖x‖2 =

∑dj=1 x2

j . This spaceis equivalent under several other norms, including ‖x‖ = max1≤j≤d |xj | and

‖x‖ =∑d

j=1 |xj |. A Euclidean space is separable with a countably densesubset consisting of all vectors with rational coordinates. By the Heine-Borel theorem, a subset in a Euclidean space is compact if and only if itis closed and bounded. The Borel σ-field is generated by the intervals ofthe type (−∞, x], for rational x, where the interval is defined as follows:y ∈ (−∞, x] if and only if yj ∈ (−∞, xj ] for all coordinates j = 1, . . . , d.For one-dimensional Euclidean space, R, the norm is ‖x‖ = |x| (absolutevalue). The extended real line R = [−∞,∞] is a metric space with respectto the metric d(x, y) = |G(x) − G(y)|, where G : R → R is any strictlymonotone increasing, continuous and bounded function, such as the arctanfunction. For any sequence xn ∈ R, |xn − x| → 0 implies d(xn, x) → 0,

Page 97: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

6.1 Metric Spaces 87

while divergence of d(xn, x) implies divergence of |xn − x|. In addition, itis possible for a sequence to converge, with respect to d, to either −∞ or∞. This makes R compact.

Another important example is the set of bounded real functions f : T →R, where T is an arbitrary set. This is a vector space if sums z1 + z2 andproducts with scalars, αz, are defined pointwise for all z, z1, z2 ∈ ℓ∞(T ).Specifically, (z1 + z2)(t) = z1(t) + z2(t) and (αz)(t) = αz(t), for all t ∈ T .This space is denoted ℓ∞(T ). The uniform norm ‖x‖T ≡ supt∈T |x(t)|makes ℓ∞(T ) into a Banach space consisting exactly of all functions z :T → R satisfying ‖z‖T < ∞. It is not hard to show that ℓ∞(T ) is separableif and only if T is countable.

Two useful subspaces of ℓ∞([a, b]), where a, b ∈ R, are C[a, b] and D[a, b].The space C[a, b] consists of continuous functions z : [a, b] → R, and D[a, b]is the space of cadlag functions which are right-continuous with left-handlimits (cadlag is an abbreviation for continue a droite, limites a gauche).We usually equip these spaces with the uniform norm ‖ · ‖[a,b] inheritedfrom ℓ∞([a, b]). Note that C[a, b] ⊂ D[a, b] ⊂ ℓ∞([a, b]). Relative to theuniform norm, C[a, b] is separable, and thus also Polish by the completenessestablished in Exercise 6.5.5(a), but D[a, b] is not separable. Sometimes,D[a, b] is called the Skorohod space, although Skorohod equipped D[a, b]with a special metric—quite different than the uniform metric—resultingin a separable space.

An important subspace of ℓ∞(T ) is the space UC(T, ρ), where ρ is asemimetric on T . UC(T, ρ) consists of all bounded function f : T → Rwhich are uniformly ρ-continuous, i.e.,

limδ↓0

supρ(s,t)<δ

|f(s)− f(t)| = 0.

When (T, ρ) is totally bounded, the boundedness requirement for functionsin UC(T, ρ) is superfluous since a uniformly continuous function on a totallybounded set must necessarily be bounded. We denote C(T, ρ) to be thespace of ρ-continuous (not necessarily continuous) function on T . It is left asan exercise to show that the spaces C[a, b], D[a, b], UC(T, ρ), C(T , ρ), when(T, ρ) is a totally bounded semimetric space, and UC(T, ρ) and ℓ∞(T ), foran arbitrary set T , are all complete with respect to the uniform metric.When (T, ρ) is a compact semimetric space, T is totally bounded, and aρ-continuous function in T is automatically uniformly ρ-continuous. Thus,when T is compact, C(T, ρ) = UC(T, ρ). Actually, every space UC(T, ρ)is equivalent to a space C(T , ρ), because the completion T of a totallybounded space T is compact and, furthermore, every uniformly continuousfunction on T has a unique continuous extension to T . Showing this is savedas an exercise.

The foregoing structure makes it clear that UC(T, ρ) is a Polish spacethat is made complete by the uniform norm. Hence UC(T, ρ) is also σ-compact. In fact, any σ-compact set in ℓ∞(T ) is contained in UC(T, ρ),

Page 98: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

88 6. Preliminaries for Empirical Processes

for some totally bounded semimetric space (T, ρ), and all compact sets inℓ∞(T ) have a specific form:

Theorem 6.2 (Arzela-Ascoli)

(a) The closure of K ⊂ UC(T, ρ), where (T, ρ) is totally bounded, iscompact if and only if

(i) supx∈K |x(t0)| < ∞, for some t0 ∈ T ; and

(ii)

limδ↓0

supx∈K

sups,t∈T :ρ(s,t)<δ

|x(s) − x(t)| = 0.

(b) The closure of K ⊂ ℓ∞(T ) is σ-compact if and only if K ⊂ UC(T, ρ)for some semimetric ρ making T totally bounded.

The proof is given in Section 6.4. Since all compact sets are trivially σ-compact, Theorem 6.2 implies that any compact set in ℓ∞(T ) is actuallycontained in UC(T, ρ) for some semimetric ρ making T totally bounded.

Another important class of metric spaces are product spaces. For a pairof metric spaces (D, d) and (E, e), the Cartesian product D×E is a metricspace with respect to the metric ρ((x1, y1), (x2, y2)) ≡ d(x1, x2)∨ e(y1, y2),for x1, x2 ∈ D and y1, y2 ∈ E. This resulting topology is the product topol-ogy. In this setting, convergence of (xn, yn) → (x, y) is equivalent to con-vergence of both xn → x and yn → y. There are two natural σ-fields forD×E that we can consider. The first is the Borel σ-field for D×E generatedfrom the product topology. The second is the product σ-field generated byall sets of the form A × B, where A ∈ A, B ∈ B, and A and B are therespective σ-fields for D and E. These two are equal when D and E areseparable, but they may be unequal otherwise, with the first σ-field largerthan the second. Suppose X : Ω → D and Y : Ω → E are Borel-measurablemaps defined on a measurable space (Ω,A). Then (X, Y ) : Ω → D × Eis a measurable map for the product of the two σ-fields by the definitionof a measurable map. Unfortunately, when the Borel σ-field for D × E islarger than the product σ-field, then it is possible for (X, Y ) to not beBorel-measurable.

6.2 Outer Expectation

An excellent overview of outer expectations is given in Chapter 1.2 of vander Vaart and Wellner (1996). The concept applies to an arbitrary probabil-ity space (Ω,A, P ) and an arbitrary map T : Ω → R, where R ≡ [−∞,∞].As described in Chapter 2, the outer expectation of T , denoted E∗T , is theinfimum over all EU , where U : Ω → R is measurable, U ≥ T , and EUexists. For EU to exist, it must not be indeterminate, although it can be

Page 99: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

6.2 Outer Expectation 89

±∞, provided the sign is clear. Since T is not necessarily a random vari-able, the proper term for E∗T is outer integral. However, we will use theterm outer expectation throughout the remainder of this book in deferenceto its connection with the classical notion of expectation. We analogouslydefine inner expectation: E∗T = −E∗[−T ]. The following lemma verifiesthe existence of a minimal measurable majorant T ∗ ≥ T :

Lemma 6.3 For any T : Ω → R, there exists a minimal measurablemajorant T ∗ : Ω → R with

(i) T ∗ ≥ T ;

(ii) For every measurable U : Ω → R with U ≥ T a.s., T ∗ ≤ U a.s.

For any T ∗ satisfying (i) and (ii), E∗T = ET ∗, provided ET ∗ exists. Thelast statement is true if E∗T <∞.

The proof is given in Section 6.4 at the end of this chapter. The followinglemma, the proof of which is left as an exercise, is an immediate consequenceof Lemma 6.3 and verifies the existence of a maximal measurable minorant:

Lemma 6.4 For any T : Ω → R, the maximal measurable minorantT∗ ≡ −(−T )∗ exists and satisfies

(i) T∗ ≤ T ;

(ii) For every measurable U : Ω → R with U ≤ T a.s., T∗ ≥ U a.s.

For any T∗ satisfying (i) and (ii), E∗T = ET∗, provided ET∗ exists. Thelast statement is true if E∗T > −∞.

An important special case of outer expectation is outer probability. Theouter probability of an arbitrary B ⊂ Ω, denoted P ∗(B), is the infimumover all P (A) such that A ⊃ B and A ∈ A. The inner probability of anarbitrary B ⊂ Ω is defined to be P∗(B) = 1 − P ∗(Ω − B). The followinglemma gives the precise connection between outer/inner expectations andouter/inner probabilities:

Lemma 6.5 For any B ⊂ Ω,

(i) P ∗(B) = E∗1B and P∗(B) = E∗1B;

(ii) there exists a measurable set B∗ ⊃ B so that P (B∗) = P ∗(B); forany such B∗, 1B∗ = (1B)∗;

(iii) For B∗ ≡ Ω− Ω−B∗, P∗(B) = P (B∗);

(iv) (1B)∗ + (1Ω−B)∗ = 1.

Proof. From the definitions, P ∗(B) = infA∈A:A⊃B E1A ≥ E∗1B.Next, E∗1B = E(1B)∗ ≥ E1 (1B)∗ ≥ 1 = P (1B)∗ ≥ 1 ≥

Page 100: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

90 6. Preliminaries for Empirical Processes

P ∗(B), where the last inequality follows from the definition of P ∗. Com-bining the two conclusions yields that all inequalities are actually equali-ties. This gives the first parts of (i) and (ii), with B∗ = (1B)∗ ≥ 1.The second part of (i) results from P∗(B) = 1 − P ∗(Ω − B) = 1 −E(1 − 1B)∗ = 1 − E (1− (1B)∗). The second part of (ii) follows from(1B)∗ ≤ 1B∗ = 1 (1B)∗ ≥ 1 ≤ (1B)∗. The definition of P∗ im-plies (iii) directly. To verify (iv), we have (1Ω − B)∗ = (1 − 1B)∗ =−(1B − 1)∗ = 1− (1B)∗.

The following three lemmas, Lemmas 6.6–6.8, provide several relationswhich will prove useful later on but which might be skipped on a firstreading. The proofs are given in Section 6.4 and in the exercises.

Lemma 6.6 Let S, T : Ω → R be arbitrary maps. The following state-ments are true almost surely, provided the statements are well-defined:

(i) S∗ + T ∗ ≤ (S + T )∗ ≤ S∗ + T ∗, with all equalities if S is measurable;

(ii) S∗ + T∗ ≤ (S + T )∗ ≤ S∗ + T ∗, with all equalities if T is measurable;

(iii) (S − T )∗ ≥ S∗ − T ∗;

(iv) |S∗ − T ∗| ≤ |S − T |∗;

(v) (1T > c)∗ = 1T ∗ > c, for any c ∈ R;

(vi) (1T ≥ c)∗ = 1T∗ ≥ c, for any c ∈ R;

(vii) (S ∨ T )∗ = S∗ ∨ T ∗;

(viii) (S ∧ T )∗ ≤ S∗ ∧ T ∗, with equality if S is measurable.

Lemma 6.7 For any sets A, B ⊂ Ω,

(i) (A ∪B)∗ = A∗ ∪B∗ and (A ∩B)∗ = A∗ ∩B∗;

(ii) (A ∩ B)∗ ⊂ A∗ ∩ B∗ and (A ∪ B)∗ ⊃ A∗ ∪ B∗, with the inclusionsreplaced by equalities if either A or B is measurable;

(iii) If A ∩ B = ∅, then P∗(A) + P∗(B) ≤ P∗(A ∪ B) ≤ P ∗(A ∪ B) ≤P ∗(A) + P ∗(B).

Lemma 6.8 Let T : Ω → R be an arbitrary map and let φ : R → Rbe monotone, with an extension to R. The following statements are truealmost surely, provided the statements are well-defined:

A. If φ is nondecreasing, then

(i) φ(T ∗)≥ [φ(T )]∗, with equality if φ is left-continuous on [−∞,∞);

(ii) φ(T∗)≤ [φ(T )]∗, with equality if φ is right-continuous on (−∞,∞].

B. If φ is nonincreasing, then

Page 101: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

6.2 Outer Expectation 91

(i) φ(T ∗)≤ [φ(T )]∗, with equality if φ is left-continuous on [−∞,∞);

(ii) φ(T∗)≥ [φ(T )]∗, with equality if φ is right-continuous on (−∞,∞].

We next present an outer-expectation version of the unconditionalJensen’s inequality:

Lemma 6.9 (Jensen’s inequality) Let T : Ω → R be an arbitrary map,with E∗|T | <∞, and assume φ : R → R is convex. Then

(i) E∗φ(T ) ≥ φ(E∗T ) ∨ φ(E∗T );

(ii) if φ is also monotone, E∗φ(T ) ≥ φ(E∗T ) ∧ φ(E∗T ).

Proof. Assume first that φ is monotone increasing. Since φ is also con-tinuous (by convexity), E∗φ(T ) = Eφ(T ∗) ≥ φ(E∗T ), where the equal-ity follows from A(i) of Lemma 6.8 and the inequality from the usualJensen’s inequality. Similar arguments verify that E∗φ(T ) ≥ φ(E∗T ) basedon A(ii) of the same lemma. Note also that φ(E∗T ) ≥ φ(E∗T ). Now as-sume that φ is monotone decreasing. Using B(i) and B(ii) of Lemma 6.8and arguments similar to those used for increasing φ, we obtain that bothE∗φ(T ) ≥ φ(E∗T ) and E∗φ(T ) ≥ φ(E∗T ). Note in this case that φ(E∗T ) ≥φ(E∗T ). Thus when φ is monotone (either increasing or decreasing), we havethat both E∗φ(T ) ≥ φ(E∗T ) ∨ φ(E∗T ) and E∗φ(T ) ≥ φ(E∗T ) ∧ φ(E∗T ).Hence (ii) follows.

We have also proved (i) in the case where φ is monotone. Suppose nowthat φ is not monotone. Then there exists a c ∈ R so that φ is nonincreasingover (−∞, c] and nondecreasing over (c,∞). Let g1(t) ≡ φ(t)1t ≤ c +φ(c)1t > c and g2(t) ≡ φ(c)1t ≤ c + φ(t)1t > c, and note thatφ(t) = g1(t)+g2(t)−φ(c) and that both g1 and g2 are convex and monotone.Now [φ(T )]

∗= [g1(T ) + g2(T )− φ(c)]

∗ ≥ [g1(T )]∗ + [g2(T )]∗ − φ(c) by

Part (i) of Lemma 6.6. Now, using the results in the previous paragraph,we have that E∗φ(T ) ≥ g1(E

∗T ) + g2(E∗T ) − φ(c) = φ(E∗T ). However,

we also have that [g1(T ) + g2(T )− φ(c)]∗ ≥ [g1(T )]

∗+ [g2(T )]∗ − φ(c).

Thus, again using results from the previous paragraph, we have E∗φ(T ) ≥g1(E∗T ) + g2(E∗T )− φ(c) = φ(E∗T ). Hence (i) follows.

The following outer-expectation version of Chebyshev’s inequality is alsouseful:

Lemma 6.10 (Chebyshev’s inequality) Let T : Ω → R be an arbi-trary map, with φ : [0,∞) → [0,∞) nondecreasing and strictly positive on(0,∞). Then, for every u > 0,

P ∗ (|T | ≥ u) ≤ E∗φ(|T |)φ(u)

.

Proof. The result follows from the standard Chebyshev inequality as aresult of the following chain of inequalities:

Page 102: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

92 6. Preliminaries for Empirical Processes

(1|T | ≥ u)∗ ≤ (1φ(|T |) ≥ φ(u))∗ ≤ 1

[φ(|T |)]∗ ≥ φ(u)

.

The first inequality follows from the fact that |T | ≥ u implies φ(|T |) ≥ φ(u),and the second inequality follows from A(i) of Lemma 6.8.

There are many other analogies between outer and standard versions ofexpectation and probability, but we only present a few more in this chap-ter. We next present versions of the monotone and dominated convergencetheorems. The proofs are given in Section 6.4. Some additional results aregiven in the exercises.

Lemma 6.11 (Monotone convergence) Let Tn, T : Ω → R be arbi-trary maps on a probability space, with Tn ↑ T pointwise on a set of innerprobability one. Then T ∗

n ↑ T ∗ almost surely. Provided E∗Tn > −∞ forsome n, then E∗Tn ↑ E∗T .

Lemma 6.12 (Dominated convergence) Let Tn, T, S : Ω → R be

maps on a probability space, with |Tn − T |∗ as→ 0, |Tn| ≤ S for all n, andE∗S < ∞. Then E∗Tn → E∗T .

Let (Ω, A, P ) be the P -completion of the probability space (Ω,A, P ), asdefined in the previous section (for general measure spaces). Recall thata completion of a measure space is itself a measure space. One can alsoshow that A is the σ-field of all sets of the form A ∪ N , with A ∈ Aand N ⊂ Ω so that P ∗(N) = 0, and P is the probability measure satisfying

P (A∪N) = P (A). A nice property of(

Ω, A, P)

is that for every measurable

map S : (Ω, A) → R, there is a measurable map S : (Ω, A) → R such thatP ∗(S = S) = 0. Furthermore, a minimal measurable cover T ∗ of a mapT : (Ω,A, P ) → R is a version of a minimal measurable cover T ∗ for T asa map on the P -completion of (Ω,A, P ), i.e., P ∗(T ∗ = T ∗) = 0. While itis not difficult to show this, we do not prove it.

We close this section with two results which have application to productprobability spaces. The first result involves perfect maps, and the secondresult is a special formulation of Fubini’s theorem. Consider composing amap T : Ω → R with a measurable map φ : Ω → Ω, defined on some

probability space, to form T φ :(

Ω, A, P)

→ R, where φ :(

Ω, A, P)

→(Ω,A, P ). Denote T ∗ as the minimal measurable cover of T for P φ−1. Itis easy to see that since T ∗ φ ≥ T φ, we have (T φ)∗ ≤ T ∗ φ. Themap φ is perfect if (T φ)∗ = T ∗ φ, for every bounded T : Ω → R. This

property ensures that P ∗(φ ∈ A) =(

P φ−1)∗

(A) for every set A ⊂ Ω.

An important example of a perfect map is a coordinate projection in aproduct probability space. Specifically, let T be a real valued map definedon (Ω1 ×Ω2,A1 ×A2, P1 ×P2) which only depends on the first coordinateof ω = (ω1, ω2). T ∗ can then be computed by just ignoring Ω2 and thinkingof T as a map on Ω1. More precisely, suppose T = T1 π1, where π1 is

Page 103: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

6.3 Linear Operators and Functional Differentiation 93

the projection on the first coordinate. The following lemma shows thatT ∗ = T ∗

1 π1, and thus coordinate projections such as π1 are perfect. Wewill see other examples of perfect maps later on in Chapter 7.

Lemma 6.13 A coordinate projection on a product probability space withproduct measure is perfect.

Proof. Let π1 : (Ω1 × Ω2,A1 × A2, P1 × P2) → Ω1 be the projectiononto the first coordinate, and let T : Ω1 → R by bounded, but otherwisearbitrary. Let T ∗ be the minimal measurable cover of T for P1 = (P1×P2)π−1

1 . By definition, (T π1)∗ ≤ T ∗ π1. Now suppose U ≥ T π1, P1 ×P2-

a.s., and is measurable, where U : Ω1 × Ω2 → R. Fubini’s theorem yieldsthat for P2-almost all Ω2, we have U(ω1, ω2) ≥ T (ω1) for P2-almost all ω2.But for fixed ω2, U is a measurable function of ω1. Thus for P2-almost allω2, U(ω1, ω2) ≥ T ∗(ω1) for P1-almost all ω1. Applying Fubinin’s theoremagain, the jointly measurable set (ω1, ω2) : U < T ∗ π1 is P1 × P2-null.Hence (T π1)

∗ = T ∗ π1 almost surely.Now we consider Fubini’s theorem for maps on product spaces which may

not be measurable. There is no generally satisfactory version of Fubini’stheorem that will work in all nonmeasurable settings of interest, and itis frequently necessary to establish at least some kind of measurability toobtain certain key empirical process results. The version of Fubini’s theoremwe now present basically states that repeated outer expectations are alwaysless than joint outer expectations. Let T be an arbitrary real map definedon the product space (Ω1 × Ω2,A1 × A2, P1 × P2). We write E∗

1E∗2T to

mean outer expectations taken in turn. For fixed ω1, let (E∗2T )(ω1) be the

infimum of∫

Ω2U(ω2)dP2(ω2) taken over all measurable U : Ω2 → R with

U(ω2) ≥ T (ω1, ω2) for every ω2 for which∫

Ω2U(ω2)dP2(ω2) exists. Next,

E∗1E

∗2T is the outer integral of E∗

2T : Ω1 → R. Repeated inner expectationsare analogously defined. The following version of Fubini’s theorem givesbounds for this repeated expectation process. We omit the proof.

Lemma 6.14 (Fubini’s theorem) Let T be an arbitrary real valuedmap on a product probability space. Then E∗T ≤ E1∗E2∗T ≤ E∗

1E∗2T ≤

E∗T .

6.3 Linear Operators and FunctionalDifferentiation

A linear operator is a map T : D → E between normed spaces with theproperty that T (ax + by) = aT (x) + bT (y) for all scalars a, b and anyx, y ∈ D. When the range space E is R, then T is a linear functional.When T is linear, we will often use Tx instead of T (x). A linear operatorT : D → E is a bounded linear operator if

Page 104: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

94 6. Preliminaries for Empirical Processes

‖T ‖ ≡ supx∈D :‖x‖≤1

‖Tx‖ <∞.(6.1)

Here, the norms ‖ · ‖ are defined by the context. We have the followingproposition:

Proposition 6.15 For a linear operator T : D → E between normedspaces, the following are equivalent:

(i) T is continuous at a point x0 ∈ D;

(ii) T is continuous on all of D;

(iii) T is bounded.

Proof. We save the implication (i) ⇒ (ii) as an exercise. Note that bylinearity, boundedness of T is equivalent to there existing some 0 < c <∞for which

‖Tx‖ ≤ c‖x‖ for all x ∈ D.(6.2)

Assume T is continuous but that there exists no 0 < c <∞ satisfying (6.2).Then there exists a sequence xn ∈ D so that ‖xn‖ = 1 and ‖Txn‖ ≥ n forall n ≥ 1. Define yn = ‖Txn‖−1xn and note that ‖Tyn‖ = 1 by linearity.Now yn → 0 and thus Tyn → 0, but this is a contradiction. Thus thereexists some 0 < c < ∞ satisfying (6.2), and (iii) follows. Now assume(iii) and let xn ∈ X be any sequence satisfying xn → 0. Then by (6.2),‖Txn‖ → 0, and thus (i) is satisfied at x0 = 0.

For normed spaces D and E, let B(D, E) be the space of all bounded linearoperators T : D → E. This structure makes the space B(D, E) into a normedspace with norm ‖ · ‖ defined in (6.1). When E is a Banach space, then anyconvergent sequence Tnxn will be contained in E, and thus B(D, E) is alsoa Banach space. When D is not a Banach space, T has a unique continuousextension to D. To see this, fix x ∈ D and let xn ∈ D be a sequenceconverging to x. Then, since ‖Txn−Txm‖ ≤ c‖xn−xm‖, Txn converges tosome point in E. Next, note that if both sequences xn, yn ∈ D convergeto x, then ‖Txn−Tyn‖ ≤ c‖xn−yn‖ → 0. Thus we can define an extensionT : D → E to be the unique linear operator with Tx = limn→∞ Txn, wherex is any point in D and xn is any sequence in D converging to x.

For normed spaces D and E, and for any T ∈ B(D, E), N(T ) ≡ x ∈D : Tx = 0 is the null space of T and R(T ) ≡ y ∈ E : Tx =y for some x ∈ D is the range space of T . It is clear that T is one-to-oneif and only if N(T ) = 0. We have the following two results for inverses,which we give without proof:

Lemma 6.16 Assume D and E are normed spaces and that T ∈ B(D, E).Then

(i) T has a continuous inverse T−1 : R(T ) → D if and only if there existsa c > 0 so that ‖Tx‖ ≥ c‖x‖ for all x ∈ D;

Page 105: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

6.3 Linear Operators and Functional Differentiation 95

(ii) (Banach’s theorem) If D and E are complete and T is continuouswith N(T ) = 0, then T−1 is continuous if and only if R(T ) isclosed.

A linear operator T : D → E between normed spaces is a compact oper-ator if T (U) is compact in E, where U = x ∈ D : ‖x‖ ≤ 1 is the unitball in D and, for a set A ∈ D, T (A) ≡ Tx : x ∈ A. The operator T isonto if for every y ∈ E, there exists an x ∈ D so that Tx = y. Later on inthe book, we will encounter linear operators of the form T + K, where Tis continuously invertible and onto and K is compact. The following resultwill be useful:

Lemma 6.17 Let A = T + K : D → E be a linear operator betweenBanach spaces, where T is both continuously invertible and onto and K iscompact. Then if N(A) = 0, A is also continuously invertible and onto.

Proof. We only sketch the proof. Since T−1 is continuous, the operatorT−1K : E → D is compact. Hence I + T−1K is one-to-one and thereforealso onto by a result of Riesz for compact operators (see, for example,Theorem 3.4 of Kress, 1999). Thus T + K is also onto. We will be doneif we can show that I + T−1K is continuously invertible, since that wouldimply that (T +K)−1 = (I +T−1K)−1T−1 is bounded. Let L ≡ I +T−1Kand assume L−1 is not bounded. Then there exists a sequence xn ∈D with ‖xn‖ = 1 and ‖L−1xn‖ ≥ n for all integers n ≥ 1. Let yn =(

‖L−1xn‖)−1

xn and φn =(

‖L−1xn‖)−1

L−1xn, and note that yn → 0while ‖φn‖ = 1 for all n ≥ 1. Since T−1K is compact, there exists asubsequence n′ so that T−1Kφn′ → φ ∈ D. Since φn +T−1Kφn = yn forall n ≥ 1, we have φn′ → −φ. Hence φ ∈ N(L), which implies φ = 0 sinceL is one-to-one. But this contradicts ‖φn‖ = 1 for all n ≥ 1. Thus L−1 isbounded.

The following simple inversion result for contraction operators is alsouseful. An operator A is a contraction operator if ‖A‖ < 1.

Proposition 6.18 Let A : D → D be a linear operator with ‖A‖ < 1.Then I − A, where I is the identity, is continuously invertible and ontowith inverse (I −A)−1 =

∑∞j=0 Aj.

Proof. Let B ≡ ∑∞j=0 Aj , and note that ‖B‖ ≤ ∑∞

j=0 ‖A‖j = (1 −‖A‖)−1 < ∞. Thus B is a bounded linear operator on D. Since (I−A)B = Iby simple algebra, we have that B = (I −A)−1, and the result follows.

We now shift our attention to differentiation. Let D and E be two normedspaces, and let φ : Dφ ⊂ D → E be a function. We allow the domain Dφ ofthe function to be an arbitrary subset of D. The function φ : Dφ ⊂ D → Eis Gateaux-differentiable at θ ∈ Dφ, in the direction h, if there exists aquantity φ′

θ(h) ∈ E so that

φ(θ + tnh)− φ(θ)

tn→ φ′

θ(h),

Page 106: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

96 6. Preliminaries for Empirical Processes

as n → ∞, for any scalar sequence tn → 0. Gateaux differentiability isusually not strong enough for the applications of functional derivativesneeded for Z-estimators and the delta method. The stronger differentiabilitywe need is Hadamard and Frechet differentiability.

A map φ : Dφ ⊂ D → E is Hadamard differentiable at θ ∈ Dφ if thereexists a continuous linear operator φ′

θ : D → E such that

φ(θ + tnhn)− φ(θ)

tn→ φ′

θ(h),(6.3)

as n → ∞, for any scalar sequence tn → 0 and any h, hn ∈ D, withhn → h, and so that θ+ tnhn ∈ Dφ for all n. It is left as an exercise to showthat Hadamard differentiability is equivalent to compact differentiability,where compact differentiability satisfies

suph∈K,θ+th∈Dφ

φ(θ + th)− φ(θ)

t− φ′

θ(h)

→ 0, as t→ 0,(6.4)

for every compact K ⊂ D. Hadamard differentiability can be refined byrestricting the h values to be in a set D0 ⊂ D. More precisely, if in (6.3)it is required that hn → h only for h ∈ D0 ⊂ D, we say φ is Hadamard-differentiable tangentially to the set D0. There appears to be no easy wayto refine compact differentiability in an equivalent manner.

A map φ : Dφ ⊂ D → E is Frechet-differentiable if there exists a con-tinuous linear operator φ′

θ : D → E so that (6.4) holds uniformly in h onbounded subsets of D. This is equivalent to ‖φ(θ + h) − φ(θ) − φ′

θ(h)‖ =o(‖h‖), as ‖h‖ → 0. Since compact sets are bounded, Frechet differentia-bility implies Hadamard differentiability. Frechet differentiability will beneeded for Z-estimator theory, while Hadamard differentiability is useful inthe delta method. The following chain rule for Hadamard differentiabilitywill also prove useful:

Lemma 6.19 (Chain rule) Assume φ : Dφ ⊂ D → Eψ ⊂ E is Hadamarddifferentiable at θ ∈ Dφ tangentially to D0 ⊂ D, and ψ : Eψ ⊂ E → F isHadamard differentiable at φ(θ) tangentially to φ′

θ(D0). Then ψφ : Dφ → Fis Hadamard differentiable at θ tangentially to D0 with derivative ψ′

φ(θ)φ′θ.

Proof. First, ψ φ(θ + tht) − ψ φ(θ) = ψ(φ(θ) + tkt) − ψ(φ(θ)),where kt = [φ(θ + tht)− φ(θ)] /t. Note that if h ∈ D0, then kt → k ≡φ′

θ(h) ∈ φ′θ(D0), as t → 0, by the Hadamard differentiability of φ. Now,

[ψ(φ(θ) + tkt)− ψ(φ(θ))] /t → ψ′φ(θ)(k) by the Hadamard differentiability

of ψ.

6.4 Proofs

Proof of Theorem 6.2. First assume K ⊂ ℓ∞(T ) is compact. Let ρ(s, t) =supx∈K |x(s) − x(t)|. We will now establish that (T, ρ) is totally bounded.

Page 107: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

6.4 Proofs 97

Fix η > 0 and cover K with finitely many open balls of radius η, cen-tered at x1, . . . , xk. Partition Rk into cubes with edges of length η. Forevery such cube for which it is possible, choose at most one t ∈ T so that(x1(t), . . . , xk(t)) is in the cube. This results in a finite set Tη ≡ t1, . . . , tmin T because x1, . . . , xk are uniformly bounded. For each s ∈ Tη, we havefor all values of t ∈ T for which (x1(t), . . . , xk(t)) and (x1(s), . . . , xk(s))are in the same cube, that

ρ(s, t) = supx∈K

|x(s)− x(t)|

≤ sup1≤j≤k

|xj(s)− xj(t)|+ 2 supx∈K

inf1≤j≤k

supu∈T

|xj(u)− x(u)|

< 3η.

Thus the balls t : ρ(t, t1) < 3η, . . . , t : ρ(t, tm) < 3η completely coverT . Hence (T, ρ) is totally bounded since η was arbitrary. Also, by construc-tion, the condition in Part (a.ii) of the theorem is satisfied. Combiningthis with the total boundedness of (T, ρ) yields Condition (a.i). We havenow obtained the fairly strong result that compactness of the closure ofK ⊂ ℓ∞(T ) implies that there exists a semimetric ρ which makes (T, ρ) to-tally bounded and which enables Conditions (a.i) and (a.ii) to be satisfiedfor K.

Now assume that the closure of K ⊂ ℓ∞(T ) is σ-compact. This impliesthat there exists a sequence of compact sets K1 ⊂ K2 ⊂ · · · for whichK = ∪i≥1Ki. The previous result yields for each i ≥ 1, that there existsa semimetric ρi making T totally bounded and for which Conditions (a.i)and (a.ii) are satisfied for each Ki. Now let ρ(s, t) =

∑∞i=1 2−i (ρi(s, t) ∧ 1).

Fix η > 0, and select a finite integer m so that 2−m < η. Cover T withfinitely many open ρm balls of radius η, and let Tη = t1, . . . , tk be theircenters. Because ρ1 ≤ ρ2 ≤ · · · , there is for every t ∈ T an s ∈ Tη withρ(s, t) ≤ ∑m

i=1 2−iρi(s, t) + 2−m ≤ 2η. Thus (T, ρ) is totally bounded byρ since η was arbitrary. Now, for any x ∈ K, x ∈ Km for some finitem ≥ 1, and thus x is both bounded and uniformly ρ-continuous since ρm ≤2mρ. Hence σ-compactness of K ⊂ ℓ∞(T ) implies K ⊂ UC(T, ρ) for somesemimetric ρ making T totally bounded. In the discussion preceding thestatement of Theorem 6.2, we argued that UC(T, ρ) is σ-compact whenever(T, ρ) is totally bounded. Hence we have proven Part (b).

The only part of the proof which remains is to show that if K ⊂ UC(T, ρ)satisfies Conditions (a.i) and (a.ii), then the closure of K is compact. As-sume Conditions (a.i) and (a.ii) hold for K. Define

mδ(x) ≡ sups,t∈T :ρ(s,t)<δ

|x(s) − x(t)|

and mδ ≡ supx∈K mδ(x), and note that mδ(x) is continuous in x and thatm1/n(x) is nonincreasing in n, with limn→∞ m1/n(x) = 0. Choose k < ∞large enough so that m1/k <∞. For every δ > 0, let Tδ ⊂ T be a finite mesh

Page 108: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

98 6. Preliminaries for Empirical Processes

satisfying supt∈T infs∈Tδρ(s, t) < δ, and let Nδ be the number of points in

Tδ. Now, for any t ∈ T , |x(t)| ≤ |x(t0)|+ |x(t)−x(t0)| ≤ |x(t0)|+N1/km1/k,and thus α ≡ supx∈K supt∈T |x(t)| < ∞. For each ǫ > 0, pick a δ > 0 sothat mδ < ǫ and an integer n < ∞ so that α/n ≤ ǫ. Let U ≡ T1/(2δ)

and define a “bracket” to be a finite collection h = ht : t ∈ U so thatht = −α + jα/n, for some integer 1 ≤ j ≤ 2n− 1, for each t ∈ U . Say thatx ∈ ℓ∞(T ) is “in” the bracket h, denoted x ∈ h, if x(t) ∈ [ht, ht + α/n] forall t ∈ U . Let B(K) be the set of all brackets h for which x ∈ h for somex ∈ K. For each h ∈ B(K), choose one and only one x ∈ K with x ∈ h,discard duplicates, and denote the resulting set X(K). It is not hard toverify that supx∈K infy∈X(K) ‖x− y‖T < 2ǫ, and thus the union of 2ǫ-ballswith centers in X(K) is a finite cover of K. Since ǫ is arbitrary, K is totallybounded.

Proof of Lemma 6.3. Begin by selecting a measurable sequence Um ≥T such that E arctanUm ↓ E∗ arctanT , and set T ∗(ω) = limm→∞ inf1≤k≤m

Uk(ω). This gives a measurable function T ∗ taking values in R, with T ∗ ≥T , and EarctanT ∗ = E∗ arctanT by monotone convergence. For any mea-surable U ≥ T , arctanU ∧ T ∗ ≥ arctanT , and thus E arctanU ∧ T ∗ ≥E∗ arctanT = E arctanT ∗. However, U∧T ∗ is trivially smaller than T ∗, andsince both quantities therefore have the same expectation, arctanU ∧T ∗ =arctanT ∗ a.s. Hence T ∗ ≤ U a.s., and (i) and (ii) follow. When ET ∗ exists,it is larger than E∗T by (i) yet smaller by (ii), and thus ET ∗ = E∗T . WhenE∗T < ∞, there exists a measurable U ≥ T with EU+ < ∞, where z+ isthe positive part of z. Hence E(T ∗)+ ≤ EU+ and ET ∗ exists.

Proof of Lemma 6.6. The second inequality in (i) is obvious. If Sand U ≥ S + T are both measurable, then U − S ≥ T and U − S ≥ T ∗

since U − S is also measurable. Hence U = (S + T )∗ ≥ S + T ∗ and thesecond inequality is an equality. Now (S + T )∗ ≥ (S∗ + T )∗ = S∗ + T ∗,and we obtain the first inequality. If S is measurable, then S∗ = S∗ andthus S∗ + T ∗ = S∗ + T ∗. Part (ii) is left as an exercise. Part (iii) followsfrom the second inequality in (i) after relabeling and rearranging. Part (iv)follows from S∗ − T ∗ ≤ (S − T )∗ ≤ |S − T |∗ and then exchanging theroles of S and T . For Part (v), it is clear that (1T > c)∗ ≥ 1T ∗ > c. IfU ≥ 1T > c is measurable, then S = T ∗1U ≥ 1+(T ∗∧c)1U < 1 ≥ Tand is measurable. Hence S ≥ T ∗, and thus T ∗ ≤ c whenever U < 1. Thistrivially implies 1T ∗ > c = 0 when U < 1, and thus 1T ∗ > c ≤ U .Part (vi) is left as an exercise.

For Part (vii), (S ∨ T )∗ ≤ S∗ ∨ T ∗ trivially. Let U = (S ∨ T )∗ and notethat U is measurable and both U ≥ T and U ≥ S. Hence both U ≥ T ∗ andU ≥ S∗, and thus (S ∨ T )∗ ≥ S∗ ∨ T ∗, yielding the desired inequality. Theinequality in (viii) is obvious. Assume S is measurable and let U = (S∧T )∗.Clearly, U ≤ S∗∧T ∗. Define T ≡ U1U < S+T ∗1U ≥ S; and note thatT ≥ T ∗, since if U < S, then T < S and thus U ≥ T . Fix ω ∈ Ω. If U < S,then S ∧ T = U . If U ≥ S, then U = S since U ≤ S, and, furthermore,we will now show that T ∗ ≥ S. If it were not true, then T ∗ < S and

Page 109: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

6.4 Proofs 99

U ≤ S ∧ T ∗ < S, which is clearly a contradiction. Thus when U ≥ S,U = S = S ∧ T ≥ S ∧ T ∗. Hence U = S ∧ T ≥ S ∧ T ∗ a.s., and the desiredequality in (viii) follows.

Proof of Lemma 6.7. The first part of (i) is a consequence of the fol-lowing chain of equalities: 1(A∪B)∗ = (1A ∪B)∗ = (1A ∨ 1B)∗ =(1A)∗ ∨ (1B)∗ = 1A∗ ∨ 1B∗ = 1A∗ ∪ B∗. The first and fourthequalities follow from (ii) of Lemma 6.5, the second and fifth equalitiesfollow directly, and the third equality follows from (vii) of Lemma 6.6. Forthe second part of (i), we have (A ∩ B)∗ = Ω − [(Ω−A) ∪ (Ω−B)]

∗=

Ω− (Ω−A)∗ ∪ (Ω−B)∗ = Ω− (Ω−A∗)∪ (Ω−B∗) = A∗ ∩B∗, where thesecond equality follows from the first part of (i).

The inclusions in Part (ii) are obvious. Assume A is measurable. Then(viii) of Lemma 6.6 can be used to validate the following string of equalities:1(A ∩ B)∗ = (1A ∩B)∗ = (1A ∧ 1B)∗ = (1A)∗ ∧ (1B)∗ =1A∗ ∧ 1B∗ = 1A∗ ∩ B∗. Thus (A ∩ B)∗ = A∗ ∩ B∗. By symmetry,this works whether A or B is measurable. The proof that (A∪B)∗ = A∗∪B∗when either A or B is measurable is left as an exercise. The proof of Part (iii)is also left as an exercise.

Proof of Lemma 6.8. All of the inequalities follow from the definitions.For the equality in A(i), assume that φ is nondecreasing and left-continuous.Define φ−1(u) = inft : φ(t) ≥ u, and note that φ(t) > u if and only ift > φ−1(u). Thus, for any c ∈ R, 1φ(T ∗) > c = 1T ∗ > φ−1(c) =(

1T > φ−1(c))∗

= (1φ(T ) > c)∗ = 1

[φ(T )]∗ > c

. The second andfourth equalities follow from (v) of Lemma 6.6. Hence φ(T ∗) = [φ(T )]

∗. For

the equality in A(ii), assume that φ is nondecreasing and right-continuous;and define φ−1(u) = supt : φ(t) ≤ u. Note that φ(t) ≥ u if and onlyif t ≥ φ−1(u). The proof proceeds in the same manner as for A(i), onlyPart (vi) in Lemma 6.6 is used in place of Part (v). We leave the proof ofPart B as an exercise.

Proof of Lemma 6.11. Clearly, lim inf T ∗n ≤ lim sup T ∗

n ≤ T ∗. Con-versely, lim inf T ∗

n ≥ lim inf Tn = T and is measurable, and thus lim inf T ∗n ≥

T ∗. Hence T ∗n ↑ T ∗. Now E∗T ∗

n = ET ∗n ↑ ET ∗ by monotone convergence for

measurable maps. Note we are allowing +∞ as a possible value for E∗Tn,for some n, or E∗T .

Proof of Lemma 6.12. Since |T | ≤ |Tn| + |T − Tn| for all n, we have|T −Tn|∗ ≤ 2S∗ a.s. Fix ǫ > 0. Since E∗S <∞, there exists a 0 < k < ∞ sothat E[S∗1S∗ > k] ≤ ǫ/2. Thus E∗|T−Tn| ≤ Ek∧|T−Tn|∗+2E[S∗1S∗ >k]→ ǫ. The result now follows since ǫ was arbitrary.

Page 110: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

100 6. Preliminaries for Empirical Processes

6.5 Exercises

6.5.1. Show that Part (iii) in the definition of σ-field can be replaced,without really changing the definition, by the following: The countableintersection

⋂∞j=1 Uj ∈ A whenever Uj ∈ A for all j ≥ 1.

6.5.2. Show the following:

(a) For a metric space D and set A ∈ D, the closure A consists of alllimits of sequences xn ∈ A.

(b) Two metrics d1 and d2 on a set D are equivalent if and only if we havethe following for any sequence xj ∈ D, as n, m→∞: d1(xn, xm)→0 if and only if d2(xn, xm)→ 0.

(c) A function f : D → E between two metric spaces is continuous (inthe topological sense) if and only if, for all x ∈ D and all sequencesxn ∈ D, f(xn) → f(x) whenever xn → x.

6.5.3. Verify the implication (ii) ⇒ (i) in Lemma 6.1.

6.5.4. Show that if a subset K of a metric space is not totally bounded,then it is possible to construct a sequence xn ∈ K which has no Cauchysubsequences.

6.5.5. Show that the following spaces are complete with respect to theuniform metric:

(a) C[a, b] and D[a, b];

(b) UC(T, ρ) and C(T , ρ), where (T, ρ) is a totally bounded semimetricspace;

(c) UC(T, ρ) and ℓ∞(T ), where T is an arbitrary set.

6.5.6. Show that a uniformly continuous function f : T → R, where T istotally bounded, has a unique continuous extension to T .

6.5.7. Let CL[0, 1] ⊂ C[0, 1] be the space of all Lipschitz-continuous func-tions on [0, 1], and endow it with the uniform metric:

(a) Show that CL[0, 1] is dense in C[0, 1].

(b) Show that no open ball in C[0, 1] is contained in CL[0, 1].

(c) Show that CL[0, 1] is not complete.

(d) Show that for 0 < c < ∞, the set

f ∈CL[0, 1] : |f(x)| ≤ c and |f(x)− f(y)| ≤ c|x− y|, ∀x, y ∈[0, 1]

is compact.

Page 111: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

6.5 Exercises 101

6.5.8. A collection F of maps f : D → E between metric spaces, withrespective Borel σ-fields D and E , can generate a (possibly) new σ-field forD by considering all inverse images f−1(A), for f ∈ F and A ∈ E . Showthat the σ-field σp generated by the coordinate projections x → x(t) onC[a, b] is equal to the Borel σ-field σc generated by the uniform norm. Hint:Show first that continuity of the projection maps implies σp ⊂ σc. Second,show that the open balls in σc can be created from countable set operationson sets in σp.

6.5.9. Show that the following metrics generate the product topology onD× E, where d and e are the respective metrics for D and E:

(i) ρ1((x1, y1), (x2, y2)) ≡ d(x1, x2) + e(y1, y2).

(ii) ρ2((x1, y1), (x2, y2)) ≡√

d2(x1, x2) + e2(y1, y2).

6.5.10. For any map T : Ω → R, show that E∗T is the supremum of allEU , where U ≤ T , U : Ω → R measurable and EU exists. Show also thatfor any set B ∈ Ω, P∗(B) is the supremum of all P (A), where A ⊂ B andA ∈ A.

6.5.11. Use Lemma 6.3 to prove Lemma 6.4.

6.5.12. Prove Parts (ii) and (vi) of Lemma 6.6 using Parts (i) and (v),respectively.

6.5.13. Let S, T : Ω → R be arbitrary. Show the following:

(a) |S∗ − T∗| ≤ |S − T |∗ + (S∗ − S∗) ∧ (T ∗ − T∗);

(b) |S − T |∗ ≤ (S∗− T∗)∨ (T ∗− S∗) ≤ |S − T |∗ + (S∗ − S∗)∧ (T ∗− T∗).

6.5.14. Finish the proof of Lemma 6.7:

(a) Show that (A ∪B)∗ = A∗ ∪B∗ if either A or B is measurable.

(b) Prove Part (iii) of the lemma.

6.5.15. Prove Part B of Lemma 6.8 using the approach outlined in theproof of Part A.

6.5.16. Prove the following “converse” to Jensen’s inequality:

Lemma 6.20 (converse to Jensen’s inequality) Let T : Ω → R bean arbitrary map, with E∗|T | < ∞, and assume φ : R → R is concave.Then

(a) E∗φ(T ) ≤ φ(E∗T ) ∧ φ(E∗T );

(b) if φ is also monotone, E∗φ(T ) ≤ φ(E∗T ) ∨ φ(E∗T ).

6.5.17. In the proof of Proposition 6.15, show that (i) implies (ii).

6.5.18. Show that Hadamard and compact differentiability are equiva-lent.

Page 112: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

102 6. Preliminaries for Empirical Processes

6.6 Notes

Much of the material in Section 6.2 is an amalgamation of several con-cepts and presentation styles found in Chapter 2 of Billingsley (1986), Sec-tions 1.3 and 1.7 of van der Vaart and Wellner (1996), Section 18.1 of vander Vaart (1998), and in Chapter 1 of both Rudin (1987) and Rudin (1991).A nice proof of the equivalence of the several definitions of compactness canby found in Appendix I of Billingsley (1968).

Many of the ideas in Section 6.3 come from Chapter 1.2 of van der Vaartand Wellner (1996), abbreviated VW hereafter. Lemma 6.3 correspondsto Lemma 1.2.1 of VW, Lemma 6.4 is given in Exercise 1.2.1 of VW,and Parts (i), (ii) and (iii) correspond to Lemma 1.2.3 of VW. Most ofLemma 6.6 is given in Lemma 1.2.2 of VW, although the first inequalitiesin Parts (i) and (ii), as well as Part (vi), are new. Lemma 6.7 is given inExercise 1.2.15 in VW. Lemmas 6.11 and 6.12 correspond to Exercises 1.2.3and 1.2.4 of VW, respectively, after some modification. Also, Lemmas 6.13and 6.14 correspond to Lemmas 1.2.5 and 1.2.6 of VW, respectively.

Much of the material on linear operators can be found in Appendix A.1of Bickel, Klaassen, Ritov and Wellner (1998), and in Chapter 2 of Kress(1999). Lemma 6.16 is Proposition 7, Parts A and B, in Appendix A.1of Bickel, et al. (1998). The presentation on functional differentiation ismotivated by the first few pages in Chapter 3.9 of VW, and the chain rule(Lemma 6.19) is Lemma 3.9.3 of VW.

Page 113: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

7Stochastic Convergence

In this chapter, we study concepts and theory useful in understanding thelimiting behavior of stochastic processes. We begin with a general discussionof stochastic processes in metric spaces. The focus of this discussion is onmeasurable stochastic processes since most limits of empirical processes instatistical applications are measurable. We next discuss weak convergenceboth in general and in the specific case of bounded stochastic processes.One of the interesting aspects of the approach we take to weak convergenceis that the processes studied need not be measurable except in the limit.This is useful in applications since many empirical processes in statisticsare not measurable with respect to the uniform metric. The final section ofthis chapter considers other modes of convergence, such as in probabilityand outer almost surely, and their relationships to weak convergence.

7.1 Stochastic Processes in Metric Spaces

In this section, we introduce several important concepts about stochasticprocesses in metric spaces. Recall that for a stochastic process X(t), t ∈T , X(t) is a measurable real random variable for each t ∈ T on a probabil-ity space (Ω,A, P ). The sample paths of such a process typically reside inthe metric space D = ℓ∞(T ) with the uniform metric. Often, however, whenX is viewed as a map from Ω to D, it is no longer Borel measurable. A classicexample of this duality comes from Billingsley (1968, Pages 152–153). Theexample hinges on the fact that there exists a set H ⊂ [0, 1] which is not a

Page 114: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

104 7. Stochastic Convergence

Borel set. Define the stochastic process X(t) = 1U ≤ t, where t ∈ [0, 1]and U is uniformly distributed on [0, 1]. The probability space is (Ω,B, P ),where Ω = [0, 1], B are the Borel sets on [0, 1], and P is the uniform proba-bility measure on [0, 1]. A natural metric space for the sample paths of X isℓ∞([0, 1]). Define the set A = ∪s∈HBs(1/2), where Bs(1/2) is the uniformopen ball of radius 1/2 around the function t → fs(t) ≡ 1t ≤ s. SinceA is an open set in ℓ∞([0, 1]), and since the uniform distance between fs1

and fs2 is 1 whenever s1 = s2, the set ω ∈ Ω : X(ω) ∈ A equals H . SinceH is not a Borel set, X is not Borel measurable.

This lack of measurability is actually the usual state for most of theempirical processes we are interested in studying, especially since most ofthe time the uniform metric is the natural choice of metric. Many of theassociated technical difficulties can be resolved through the use of outermeasure and outer expectation as defined in the previous chapter and whichwe will utilize in our study of weak convergence. In contrast, most of thelimiting processes we will be studying are, in fact, Borel measurable. Forthis reason, a brief study of Borel measurable processes is valuable. Thefollowing lemma, for example, provides two ways of establishing equivalencebetween Borel probability measures. Recall from Section 2.2.3 that BL1(D)is the set of all functions f : D → R bounded by 1 and with Lipschitz normbounded by 1, i.e., with |f(x)− f(y)| ≤ d(x, y) for all x, y ∈ D. When thechoice of metric space is clear by the context, we will sometimes use theabbreviated notation BL1 as was done in Chapter 2 . Define also a vectorlattice F ⊂ Cb(D) to be a vector space for which if f ∈ F then f ∨ 0 ∈ F .We also say that a set F of real functions on D separates points of D if, forany x, y ∈ D with x = y, there exists f ∈ F such that f(x) = f(y). We arenow ready for the lemma:

Lemma 7.1 Let L1 and L2 be Borel probability measures on a metricspace D. The following are equivalent:

(i) L1 = L2.

(ii)∫

fdL1 =∫

fdL2 for all f ∈ Cb(D).

If L1 and L2 are also separable, then (i) and (ii) are both equivalent to

(iii)∫

fdL1 =∫

fdL2 for all f ∈ BL1.

Moreover, if L1 and L2 are also tight, then (i)–(iii) are all equivalent to

(iv)∫

fdL1 =∫

fdL2 for all f in a vector lattice F ⊂ Cb(D) that bothcontains the constant functions and separates points in D.

The proof is given in Section 7.4. We say that two Borel random maps Xand X ′, with respective laws L and L′, are versions of each other if L = L′.

In addition to being Borel measurable, most of the limiting stochasticprocesses of interest are tight. A Borel probability measure L on a metric

Page 115: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

7.1 Stochastic Processes in Metric Spaces 105

space D is tight if for every ǫ > 0, there exists a compact K ⊂ D so thatL(K) ≥ 1− ǫ. A Borel random map X : Ω → D is tight if its law L is tight.Tightness is equivalent to there being a σ-compact set that has probability1 under L or X . L or X is separable if there is a measurable and separableset which has probability 1. L or X is Polish if there is a measurable Polishset having probability 1. Note that tightness, separability and Polishnessare all topological properties and do not depend on the metric. Since bothσ-compact and Polish sets are also separable, separability is the weakestof the three properties. Whenever we say X has any one of these threeproperties, we tacetly imply that X is also Borel measurable.

On a complete metric space, tightness, separability and Polishness areequivalent. This equivalence for Polishness and separability follows fromthe definitions. To see the remaining equivalence, assume L is separable.By completeness, there is a D0 ⊂ D having probability 1 which is bothseparable and closed. Fix any ǫ ∈ (0, 1). By separability, there exists asequence xk ∈ D0 which is dense in D0. For every δ > 0, the union of theballs of radius δ centered on the xk covers D0. Hence for every integerj ≥ 1, there exists a finite collection of balls of radius 1/j whose union Dj

has probability ≥ 1− ǫ/2j. Thus the closure of the intersection ∩j≥1Dj istotally bounded and has probability ≥ 1− ǫ. Since ǫ is arbitrary, L is tight.

For a stochastic process X(t), t ∈ T , where (T, ρ) is a separable, semi-metric space, there is another meaning for separable. X is separable (as astochastic process) if there exists a countable subset S ⊂ T and a null setN so that, for each ω ∈ N and t ∈ T , there exists a sequence sm ∈ Swith ρ(sm, t) → 0 and |X(sm, ω) − X(t, ω)| → 0. It turns out that manyof the empirical processes we will be studying are separable in this sense,even though they are not Borel measurable and therefore cannot satisfythe other meaning for separable. Throughout the remainder of the book,the distinction between these two definitions will either be explicitly statedor made clear by the context.

Most limiting processes X of interest will reside in ℓ∞(T ), where theindex set T is often a class of real functions F with domain equal to thesample space. When such limiting processes are tight, the following lemmademands that X resides on UC(T, ρ), where ρ is some semimetric makingT totally bounded, with probability 1:

Lemma 7.2 Let X be a Borel measurable random element in ℓ∞(T ).Then the following are equivalent:

(i) X is tight.

(ii) There exists a semimetric ρ making T totally bounded and for whichX ∈ UC(T, ρ) with probability 1.

Furthermore, if (ii) holds for any ρ, then it also holds for the semimetricρ0(s, t) ≡ Earctan |X(s)−X(t)|.

Page 116: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

106 7. Stochastic Convergence

The proof is given in Section 7.4. A nice feature of tight processes in ℓ∞(T )is that the laws of such processes are completely defined by their finite-dimensional marginal distributions (X(t1), . . . , X(tk)), where t1, . . . , tk ∈ Tand k ≥ 1 is an integer:

Lemma 7.3 Let X and Y be tight, Borel measurable stochastic processesin ℓ∞(T ). Then the Borel laws of X and Y are equal if and only if allcorresponding finite-dimensional marginal distributions are equal.

Proof. Consider the collection F ⊂ Cb(D) of all functions f : ℓ∞(T ) → Rof the form f(x) = g(x(t1), . . . , x(tk)), where g ∈ Cb(Rk) and k ≥ 1 isan integer. We leave it as an exercise to show that F is a vector lattice,an algebra, and separates points of ℓ∞(T ). The desired result now followsfrom Lemma 7.1.

While the semimetric ρ0 defined in Lemma 7.2 is always applicable whenX is tight, it is frequently not the most convenient semimetric to work

with. The family of semimetrics ρp(s, t) ≡ (E|X(s)−X(t)|p)1/(p∨1), for

some choice of p ∈ (0,∞), is sometimes more useful. There is an interestinglink between ρp and other semimetrics for which Lemma 7.2 holds. For aprocess X in ℓ∞(T ) and a semimetric ρ on T , we say that X is uniformly ρ-continuous in pth mean if E|X(sn)−X(tn)|p → 0 whenever ρ(sn, tn) → 0.The following lemma is a conclusion from Lemma 1.5.9 of van der Vaartand Wellner (1996) (abbreviated VW hereafter), and we omit the proof:

Lemma 7.4 Let X be a tight Borel measurable random element in ℓ∞(T ),and let ρ be any semimetric for which (ii) of Lemma 7.2 holds. If X is ρ-continuous in pth mean for some p ∈ (0,∞), then (ii) of Lemma 7.2 alsoholds for the semimetric ρp.

Perhaps the most frequently occurring limiting process in ℓ∞(T ) is aGaussian process. A stochastic process X(t), t ∈ T is Gaussian if allfinite-dimensional marginals X(t1), . . . , X(tk) are multivariate normal.If a Gaussian process X is tight, then by Lemma 7.2, there is a semimetricρ making T totally bounded and for which the sample paths t → X(t)are uniformly ρ-continuous. An interesting feature of Gaussian processes isthat this result implies that the map t → X(t) is uniformly ρ-continuous inpth mean for all p ∈ (0,∞). To see this, take p = 2, and note that |X(sn)−X(tn)| → 0 in probability if and only if E|X(sn) − X(tn)|2 → 0. Thuswhenever ρ(sn, tn)→ 0, E|X(sn)−X(tn)|2 → 0 and hence also E|X(sn)−X(tn)|p → 0 for any p ∈ (0,∞) since X(sn)−X(tn) is normally distributedfor all n ≥ 1. Lemma 7.4 now implies that tightness of a Gaussian processis equivalent to T being totally bounded by ρp with almost all sample pathsof X being uniformly ρp-continuous for all p ∈ (0,∞).

For a general Banach space D, a Borel measurable random element Xon D is Gaussian if and only if f(X) is Gaussian for every continuous,linear map f : D → R. When D = ℓ∞(T ) for some set T , this definitionappears to contradict the definition of Gaussianity given in the preceding

Page 117: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

7.2 Weak Convergence 107

paragraph, since now we are using all continuous linear functionals insteadof just linear combinations of coordinate projections. These two definitionsare not really reconcilable in general, and so some care must be takenin reading the literature to determine the appropriate context. However,when the process in question is tight, the two definitions are equivalent, asverified in the following proposition:

Proposition 7.5 Let X be a tight, Borel measurable map into ℓ∞(T ).Then the following are equivalent:

(i) The vector (Xt1 , . . . , Xtk) is multivariate normal for every finite set

t1, . . . , tk ⊂ T .

(ii) φ(X) is Gaussian for every continuous, linear map φ : ℓ∞(T ) → R.

(iii) φ(X) is Gaussian for every continuous, linear map φ : ℓ∞(T ) → Dinto any Banach space D.

Proof. The proof that (i)⇒(ii) is given in the proof of Lemma 3.9.8 ofVW, and we omit the details here. Now assume (ii), and fix any Banachspace D and any continuous, linear map φ : ℓ∞(T ) → D. Now for any con-tinuous, linear map ψ : D → R, the composition map ψ φ : ℓ∞(T ) → Ris continuous and linear, and thus by (ii) we have that ψ(φ(X)) is Gaus-sian. Since ψ is arbitrary, we have by the definition of a Gaussian processon a Banach space that φ(X) is Gaussian. Since both D and φ were alsoarbitrary, conclusion (iii) follows. Finally, (iii)⇒(i) since multivariate co-ordinate projections are special examples of continuous, linear maps intoBanach spaces.

7.2 Weak Convergence

We first discuss the general theory of weak convergence in metric spacesand then discuss results for the special metric space of uniformly boundedfunctions, ℓ∞(T ), for an arbitrary index set T . This last space is wheremost—if not all—of the action occurs for statistical applications of empir-ical processes.

7.2.1 General Theory

The extremely important concept of weak convergence of sequences arisesin many areas of statistics. To be as flexible as possible, we allow theprobability spaces associated with the sequences to change with n. Let(Ωn,An, Pn) be a sequence of probability spaces and Xn : Ωn → D asequence of maps. We say that Xn converges weakly to a Borel measurableX : Ω → D if

Page 118: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

108 7. Stochastic Convergence

E∗f(Xn)→ Ef(X), for every f ∈ Cb(D).(7.1)

If L is the law of X , (7.1) can be reexpressed as

E∗f(Xn)→∫

Ω

f(x)dL(x), for every f ∈ Cb(D).

This weak convergence is denoted Xn X or, equivalently, Xn L.Weak convergence is equivalent to “convergence in distribution” and “con-vergence in law.” By Lemma 7.1, this definition of weak convergence en-sures that the limiting distributions are unique. Note that the choice ofprobability spaces (Ωn,An, Pn) is important since these dictate the outerexpectation used in the definition of weak convergence. In most of thesettings discussed in this book, Ωn = Ω for all n ≥ 1. Some importantexceptions to this rule will be discussed in Chapter 11. Fortunately, evenin those settings where Ωn does change with n, one can frequently read-just the probability spaces so that the sample spaces are all the same. It isalso possible to generalize the concept of weak convergence of sequences toweak convergence of nets as done in VW, but we will restrict ourselves tosequences throughout this book.

The forgoing definition of weak convergence does not obviously appearto be related to convergence of probabilities, but this is in fact true for theprobabilities of sets B ⊂ Ω which have boundaries δB satisfying L(δB) = 0.Here and elsewhere, we define the boundary δB of a set B in a topologicalspace to be the closure of B minus the interior of B. Several interestingequivalent formulations of weak convergence on a metric space D are givenin the following portmanteau theorem:

Theorem 7.6 (Portmanteau) The following are equivalent:

(i) Xn L;

(ii) lim inf P∗(Xn ∈ G) ≥ L(G) for every open G;

(iii) lim sup P∗(Xn ∈ F ) ≤ L(F ) for every closed F ;

(iv) lim inf E∗f(Xn) ≥∫

Ω f(x)dL(x) for every lower semicontinuous fbounded below;

(v) lim sup E∗f(Xn) ≤∫

Ωf(x)dL(x) for every upper semicontinuous f

bounded above;

(vi) limP∗(Xn ∈ B) = limP∗(Xn ∈ B) = L(B) for every Borel B withL(δB) = 0;

(vii) lim inf E∗f(Xn) ≥∫

Ω f(x)dL(x) for every bounded, Lipschitz contin-uous, nonnegative f .

Furthermore, if L is separable, then (i)–(vii) are also equivalent to

Page 119: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

7.2 Weak Convergence 109

(viii) supf∈BL1|E∗f(Xn)− Ef(X)| → 0.

The proof is given in Section 7.4. Depending on the setting, one or more ofthese alternative definitions will prove more useful than the others. For ex-ample, definition (vi) is probably the most intuitive from a statistical pointof view, while definition (viii) is convenient for studying certain propertiesof the bootstrap.

Another very useful result is the continuous mapping theorem:

Theorem 7.7 (Continuous mapping) Let g : D → E be continuous atall points in D0 ⊂ D, where D and E are metric spaces. Then if Xn Xin D, with P∗(X ∈ D0) = 1, then g(Xn) g(X).

The proof is given in Section 7.4. As mentioned in Chapter 2, a commonapplication of this theorem is in the construction of confidence bands basedon the supremum distance.

A potential issue is that there may sometimes be more than one choiceof metric space D to work with in a given weak convergence setting. Forexample, if we are studying weak convergence of the usual empirical process√

n(Fn(t)−F (t)) based on data in [0, 1], we could let D be either ℓ∞([0, 1])or D[0, 1]. The following lemma tells us that the choice of metric spaceis generally not a problem. Recall from Chapter 6 that for a topologicalspace (X,O), the relative topology on A ⊂ X consists of the open setsA ∩B : B ∈ O.

Lemma 7.8 Let the metric spaces D0 ⊂ D have the same metric, andassume X and Xn reside in D0. Then Xn X in D0 if and only if Xn Xin D.

Proof. Since any set B0 ∈ D0 is open if and only if it is of the form B ∩D0

for some open B in D, the result follows from Part (ii) of the portmanteautheorem.

Recall from Chapter 2 that a sequence Xn is asymptotically measurableif and only if

E∗f(Xn)− E∗f(Xn) → 0,(7.2)

for all f ∈ Cb(D). An important, related concept is that of asymptotictightness. A sequence Xn is asymptotically tight if for every ǫ > 0, there isa compact K so that lim inf P∗(Xn ∈ Kδ) ≥ 1−ǫ, for every δ > 0, where fora set A ⊂ D, Aδ = x ∈ D : d(x, A) < δ is the “δ-enlargement” around A.The following lemma tells us that when Xn is asymptotically tight, we candetermine asymptotic measurability by verifying (7.2) only for a subset offunctions in Cb(D). For the purposes of this lemma, an algebra F ⊂ Cb(D)is a vector space for which if f, g ∈ F then fg ∈ F .

Lemma 7.9 Assume the sequence Xn is asymptotically tight and that(7.2) holds for all f in a subalgebra F ⊂ Cb(D) that separates points of D.Then Xn is asymptotically measurable.

Page 120: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

110 7. Stochastic Convergence

We omit the proof of this lemma, but it can be found in Chapter 1.3 ofVW.

When D is a Polish space and Xn and X are both Borel measurable,tightness of Xn for each n ≥ 1 plus asymptotic tightness is equivalent tothe concept of uniform tightness used in the classical theory of weak con-vergence (see p. 37, Billingsley, 1968). More precisely, a Borel measurablesequence Xn is uniformly tight if for every ǫ > 0, there is a compact Kso that P(Xn ∈ K) ≥ 1 − ǫ for all n ≥ 1. The following is a more formalstatement of the equivalence we are describing:

Lemma 7.10 Assume D is a Polish space and that the maps Xn and Xare Borel measurable. Then Xn is uniformly tight if and only if Xn istight for each n ≥ 1 and Xn is asymptotically tight.

The proof is given in Section 7.4. Because Xn will typically not be mea-surable in many of the applications of interest to us, uniform tightness willnot prove as useful a concept as asymptotic tightness.

Two good properties of asymptotic tightness are that it does not dependon the metric chosen—only on the topology—and that weak convergenceoften implies asymptotic tightness. The first of these two properties areverified in the following lemma:

Lemma 7.11 Xn is asymptotically tight if and only if for every ǫ > 0there exists a compact K so that lim inf P∗(Xn ∈ G) ≥ 1− ǫ for every openG ⊃ K.

Proof. Assume first that Xn is asymptotically tight. Fix ǫ > 0, and letthe compact set K satisfy lim inf P∗(Xn ∈ Kδ) ≥ 1 − ǫ for every δ > 0. IfG ⊃ K is open, then there exists a δ0 > 0 so that G ⊃ Kδ0 . If this were nottrue, then there would exist a sequence xn ∈ G so that d(xn, K) → 0.This implies the existence of a sequence yn ∈ K so that d(xn, yn) → 0.Thus, since K is compact and the complement of G is closed, there isa subsequence n′ and a point y ∈ G so that d(yn′ , y) → 0, but this isimpossible. Hence lim inf P∗(Xn ∈ G) ≥ 1−ǫ. Now assume that Xn satisfiesthe alternative definition. Fix ǫ > 0, and let the compact set K satisfylim inf P∗(Xn ∈ G) ≥ 1 − ǫ for every open G ⊃ K. For every δ > 0, Kδ isan open set. Thus lim inf P∗(Xn ∈ Kδ) ≥ 1− ǫ for every δ > 0.

The second good property of asymptotic tightness is given in the secondpart of the following lemma, the first part of which gives the necessity ofasymptotic measurability for weakly convergent sequences:

Lemma 7.12 Assume Xn X. Then

(i) Xn is asymptotically measurable.

(ii) Xn is asymptotically tight if and only if X is tight.

Proof. For Part (i), fix f ∈ Cb(D). Note that weak convergence impliesboth E∗f(Xn) → Ef(X) and E∗f(Xn) = −E∗[−f(Xn)] → −E[−f(X)] =

Page 121: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

7.2 Weak Convergence 111

Ef(X), and the desired result follows since f is arbitrary. For Part (ii),fix ǫ > 0. Assume X is tight, and choose a compact K so that P(X ∈K) ≥ 1 − ǫ. By Part (ii) of the portmanteau theorem, lim inf P∗(Xn ∈Kδ) ≥ P(X ∈ Kδ) ≥ 1 − ǫ for every δ > 0. Hence Xn is asymptoticallytight. Now assume that Xn is asymptotically tight, fix ǫ > 0, and choosea compact K so that lim inf P∗(Xn ∈ Kδ) ≥ 1 − ǫ for every δ > 0. By

Part (iii) of the portmanteau theorem, P(X ∈ Kδ) ≥ lim sup P∗(Xn ∈Kδ) ≥ lim inf P∗(Xn ∈ Kδ) ≥ 1 − ǫ. By letting δ ↓ 0, we obtain that X istight.

Prohorov’s theorem (given below) tells us that asymptotic measurabilityand asymptotic tightness together almost gives us weak convergence. This“almost-weak-convergence” is relative compactness. A sequence Xn is rel-atively compact if every subsequence Xn′ has a further subsequence Xn′′

which converges weakly to a tight Borel law. Weak convergence happenswhen all of the limiting Borel laws are the same. Note that when all ofthe limiting laws assign probability one to a fixed Polish space, there is aconverse of Prohorov’s theorem, that relative compactness of Xn impliesasymptotic tightness of Xn (and hence also uniform tightness). Details ofthis result are discussed in Chapter 1.12 of VW, but we do not pursue itfurther here.

Theorem 7.13 (Prohorov’s theorem) If the sequence Xn is asymptoti-cally measurable and asymptotically tight, then it has a subsequence Xn′

that converges weakly to a tight Borel law.

The proof, which we omit, is given in Chapter 1.3 of VW. Note that the con-clusion of Prohorov’s theorem does not state that Xn is relatively compact,and thus it appears as if we have broken our earlier promise. However, ifXn is asymptotically measurable and asymptotically tight, then every sub-sequence Xn′ is also asymptotically measurable and asymptotically tight.Thus repeated application of Prohorov’s theorem does indeed imply relativecompactness of Xn.

A natural question to ask at this juncture is: under what circumstancesdoes asymptotic measurability and/or tightness of the marginal sequencesXn and Yn imply asymptotic measurability and/or tightness of the jointsequence (Xn, Yn)? This question is answered in the following lemma:

Lemma 7.14 Let Xn : Ωn → D and Yn : Ωn → E be sequences of maps.Then the following are true:

(i) Xn and Yn are both asymptotically tight if and only if the same istrue for the joint sequence (Xn, Yn) : Ωn → D× E.

(ii) Asymptotically tight sequences Xn and Yn are both asymptoticallymeasurable if and only if (Xn, Yn) : Ωn → D × E is asymptoticallymeasurable.

Page 122: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

112 7. Stochastic Convergence

Proof. Let d and e be the metrics for D and E, respectively, and letD × E be endowed with the product topology. Now note that a set inD × E of the form K1 × K2 is compact if and only if K1 and K2 areboth compact. Let πjK, j = 1, 2, be the projections of K onto D and E,respectively. To be precise, the projection π1K consists of all x ∈ D suchthat (x, y) ∈ K for some y ∈ E, and π2K is analogously defined for E. Weleave it as an exercise to show that π1K and π2K are both compact. It iseasy to see that K is contained in π1K × π2K. Using the product spacemetric ρ((x1, y1), (x2, y2)) = d(x1, x2) ∨ e(y1, y2) (one of several metricsgenerating the product topology), we now have for K1 ∈ D and K2 ∈ Ethat (K1 × K2)

δ = Kδ1 × Kδ

2 . Part (i) now follows from the definition ofasymptotic tightness.

Let π1 : D × E → D be the projection onto the first coordinate, andnote that π1 is continuous. Thus for any f ∈ Cb(D), f π1 ∈ Cb(D × E).Hence joint asymptotic measurability of (Xn, Yn) implies asymptotic mea-surability of Xn by the definition of asymptotic measurability. The sameargument holds for Yn. The difficult part of proving Part (ii) is the im-plication that asymptotic tightness plus asymptotic measurability of bothmarginal sequences yields asymptotic measurability of the joint sequence.We omit this part of the proof, but it can be found in Chapter 1.4 of VW.

A very useful consequence of Lemma 7.14 is Slutsky’s theorem. Note thatthe proof of Slutsky’s theorem (given below) also utilizes both Prohorov’stheorem and the continuous mapping theorem.

Theorem 7.15 (Slutsky’s theorem) Suppose Xn X and Yn c,where X is separable and c is a fixed constant. Then the following aretrue:

(i) (Xn, Yn) (X, c).

(ii) If Xn and Yn are in the same metric space, then Xn + Yn X + c.

(iii) Assume in addition that the Yn are scalars. Then whenever c ∈ R,YnXn cX. Also, whenever c = 0, Xn/Yn X/c.

Proof. By completing the metric space for X , we can without loss of gen-erality assume that X is tight. Thus by Lemma 7.14, (Xn, Yn) is asymptot-ically tight and asymptotically measurable. Thus by Prohorov’s theorem,all subsequences of (Xn, Yn) have further subsequences which converge totight limits. Since these limit points have marginals X and c, and sincethe marginals in this case completely determine the joint distribution, wehave that all limiting distributions are uniquely determined as (X, c). Thisproves Part (i). Parts (ii) and (iii) now follow from the continuous mappingtheorem.

Page 123: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

7.2 Weak Convergence 113

7.2.2 Spaces of Bounded Functions

Now we consider the setting where the Xn are stochastic processes withindex set T . The natural metric space for weak convergence in this settingis ℓ∞(T ). A nice feature of this setting is the fact that asymptotic mea-surability of Xn follows from asymptotic measurability of Xn(t) for eacht ∈ T :

Lemma 7.16 Let the sequence of maps Xn in ℓ∞(T ) be asymptoticallytight. Then Xn is asymptotically measurable if and only if Xn(t) is asymp-totically measurable for each t ∈ T .

Proof. Let ft : ℓ∞(T ) → R be the marginal projection at t ∈ T , i.e.,ft(x) = x(t) for any x ∈ ℓ∞(T ). Since each ft is continuous, asymp-totic measurability of Xn implies asymptotic measurability of Xn(t) foreach t ∈ T . Now assume that Xn(t) is asymptotically measurable foreach t ∈ T . Then Lemma 7.14 implies asymptotic measurability for allfinite-dimensional marginals (Xn(t1), . . . , Xn(tk)). Consequently, f(Xn) isasymptotically measurable for all f ∈ F , for the subset of Cb(D) defined inthe proof of Lemma 7.3 given above in Section 7.2, where D = ℓ∞(T ). SinceF is an algebra that separates points in ℓ∞(T ), asymptotic measurabilityof Xn follows from Lemma 7.9.

We now verify that convergence of finite dimensional distributions plusasymptotic tightness is equivalent to weak convergence in ℓ∞(T ):

Theorem 7.17 The sequence Xn converges to a tight limit in ℓ∞(T ) ifand only if Xn is asymptotically tight and all finite-dimensional marginalsconverge weakly to limits. Moreover, if Xn is asymptotically tight and allof its finite-dimensional marginals (Xn(t1), . . . , Xn(tk)) converge weakly tothe marginals (X(t1), . . . , X(tk)) of a stochastic process X, then there isa version of X such that Xn X and X resides in UC(T, ρ) for somesemimetric ρ making T totally bounded.

Proof. The result that “asymptotic tightness plus convergence of finite-dimensional distributions implies weak convergence” follows fromProhorov’s theorem and Lemmas 7.9 and 7.1, using the vector lattice andsubalgebra F ⊂ Cb(D) defined above in the proof of Lemma 7.3. The im-plication in the opposite direction follows easily from Lemma 7.12 andthe continuous mapping theorem. Now assume Xn is asymptotically tightand that all finite-dimensional distributions of Xn converge to those of astochastic process X . By asymptotic tightness of Xn, the probability thata version of X lies in some σ-compact K ⊂ ℓ∞(T ) is one. By Theorem 6.2,K ⊂ UC(T, ρ) for some semimetric ρ making T totally bounded.

Recall Theorem 2.1 and the Condition (2.6) from Chapter 2. When Con-dition (2.6) holds for every ǫ > 0, then we say that the sequence Xn isasymptotically uniformly ρ-equicontinuous in probability. We are now in aposition to prove Theorem 2.1 on Page 15. Note that the statement of this

Page 124: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

114 7. Stochastic Convergence

theorem is slightly informal: Conditions (i) and (ii) of the theorem actuallyimply that Xn X ′ in ℓ∞(T ) for some tight version X ′ of X . Recall that‖x‖T ≡ supt∈T |x(t)|.

Proof of Theorem 2.1 (see Page 15). First assume Xn X inℓ∞(T ), where X is tight. Convergence of all finite-dimensional distributionsfollows from the continuous mapping theorem. Now by Theorem 6.2, P(X ∈UC(T, ρ)) for some semimetric ρ making T totally bounded. Hence for everyη > 0, there exists some compact K ⊂ UC(T, ρ) so that

lim infn→∞

P∗(Xn ∈ Kδ) ≥ 1− η, for all δ > 0.(7.3)

Fix ǫ, η > 0, and let the compact set K satisfy (7.3). By Theorem 6.2, thereexists a δ0 > 0 so that supx∈K sups,t:ρ(s,t)<δ0

|x(s)− x(t)| ≤ ǫ/3. Now

P∗[

sups,t∈T :ρ(s,t)<δ0

|Xn(s)−Xn(t)| > ǫ

]

≤ P∗[

sups,t∈T :ρ(s,t)<δ0

|Xn(s)−Xt(t)| > ǫ, Xn ∈ Kǫ/3

]

+P∗(Xn ∈ Kǫ/3)

≡ En

satisfies lim supn→∞ En ≤ η, since if x ∈ Kǫ/3 then sups,t∈T :ρ(s,t)<δ0|x(s)−

x(t)| < ǫ. Thus Xn is asymptotically uniformly ρ-continuous in probabilitysince ǫ and η were arbitrary.

Now assume that Conditions (i) and (ii) of the theorem hold. Lemma 7.18below, the proof of which is given in Section 7.4, yields that Xn is asymp-totically tight. Thus the desired weak converge of Xn follows from Theo-rem 7.17 above.

Lemma 7.18 Assume Conditions (i) and (ii) of Theorem 2.1 hold. ThenXn is asymptotically tight.

The proof of Theorem 2.1 verifies that whenever Xn X and X is tight,any semimetric ρ defining a σ-compact set UC(T, ρ) such that P(X ∈UC(T, ρ)) = 1 will also result in Xn being uniformly ρ-equicontinuousin probability. What is not clear at this point is the converse, that anysemimetric ρ∗ which enables uniform asymptotic equicontinuity of Xn willalso define a σ-compact set UC(T, ρ∗) wherein X resides with probability 1.The following theorem shows that, in fact, any semimetric which works forone of these implications will work for the other:

Theorem 7.19 Assume Xn X in ℓ∞(T ), and let ρ be a semimetricmaking (T, ρ) totally bounded. Then the following are equivalent:

(i) Xn is asymptotically uniformly ρ-equicontinuous in probability.

(ii) P(X ∈ UC(T, ρ)) = 1.

Page 125: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

7.3 Other Modes of Convergence 115

Proof. If we assume (ii), then (i) will follow by arguments given inthe proof of Theorem 2.1 above. Now assume (i). For x ∈ ℓ∞(T ), defineMδ(x) ≡ sups,t∈T :ρ(s,t)<δ |x(s) − x(t)|. Note that if we restrict δ to (0, 1)then x → M(·)(x), as a map from ℓ∞(T ) to ℓ∞((0, 1)), is continuous since|Mδ(x)−Mδ(y)| ≤ 2‖x−y‖T for all δ ∈ (0, 1). Hence M(·)(Xn) M(·)(X)in ℓ∞((0, 1)). Condition (i) now implies that there exists a positive sequenceδn ↓ 0 so that P∗(Mδn(Xn) > ǫ) → 0 for every ǫ > 0. Hence Mδn(X) 0.This implies (ii) since X is tight by Theorem 2.1.

An interesting consequence of Theorems 2.1 and 7.19, in conjunctionwith Lemma 7.4, happens when Xn X in ℓ∞(T ) and X is a tight Gaus-sian process. Recall from Section 7.1 the semimetric ρp(s, t) ≡ (E|X(s) −X(t)|p)1/(p∨1), for any p ∈ (0,∞). Then for any p ∈ (0,∞), (T, ρp) is to-tally bounded, the sample paths of X are ρp-continuous, and, furthermore,Xn is asymptotically uniformly ρp-equicontinuous in probability. While anyvalue of p ∈ (0,∞) will work, the choice p = 2 (the “standard deviation”metric) is often the most convenient to work with.

We now point out an equivalent condition for Xn to be asymptoticallyuniformly ρ-equicontinuous in probability. This new condition, which isexpressed in the following lemma, is sometimes easier to verify for certainsettings (one of which occurs in the next chapter):

Lemma 7.20 Let Xn be a sequence of stochastic processes indexed by T .Then the following are equivalent:

(i) There exists a semimetric ρ making T totally bounded and for whichXn is uniformly ρ-equicontinuous in probability.

(ii) For every ǫ, η > 0, there exists a finite partition T = ∪ki=1Ti such that

lim supn→∞

P∗(

sup1≤i≤k

sups,t∈Ti

|Xn(s)−Xn(t)| > ǫ

)

< η.(7.4)

The proof is given in Section 7.4 below.

7.3 Other Modes of Convergence

Recall the definitions of convergence in probability and outer almost surely,for arbitrary maps Xn : Ω → D, as defined in Chapter 2. We now introducetwo additional modes of convergence which can be useful in some settings.Xn converges almost uniformly to X if, for every ǫ > 0, there exists a mea-surable set A such that P(A) ≥ 1 − ǫ and d(Xn, X) → 0 uniformly on A.Xn converges almost surely to X if P∗(limn→∞ d(Xn, X) = 0) = 1. Notethat an important distinction between almost sure and outer almost sureconvergence is that, in the latter mode, there must exist a measurable ma-jorant of d(Xn, X) which goes to zero. This distinction is quite important

Page 126: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

116 7. Stochastic Convergence

because it can be shown that almost sure convergence does not in generalimply convergence in probability when d(Xn, X) is not measurable. Forthis reason, we do not use the almost sure convergence mode in this bookexcept rarely. One of those rare times is in Exercise 7.5.7, another is inProposition 7.22 below which will be used in Chapter 10. The followinglemma characterizes the relationships among the three remaining modes:

Lemma 7.21 Let Xn, X : Ω → D be maps with X Borel measurable.Then

(i) Xnas∗→ X implies Xn

P→ X.

(ii) XnP→ X if and only if every subsequence Xn′ has a further subse-

quence Xn′′ such that Xn′′as∗→ X.

(iii) Xnas∗→ X if and only if Xn converges almost uniformly to X if and

only if supm≥n d(Xm, X)P→ 0.

The proof is given Section 7.4. Since almost uniform convergence and outeralmost sure convergence are equivalent for sequences, we will not use thealmost uniform mode very much.

The following proposition gives a connection between almost sure con-vergence and convergence in probability. We need this proposition for acontinuous mapping result for bootstrapped processes presented in Chap-ter 10:

Proposition 7.22 Let Xn, Yn : Ω → D be maps with Yn measurable.Suppose every subsequence n′ has a further subsequence n′′ such that Xn′′ →0 almost surely. Suppose also that d(Xn, Yn)

P→ 0. Then XnP→ 0.

Proof. For every subsequence n′ there exists a further subsequence n′′

such that both Xn′′ → 0 and d(Xn′′ , Yn′′)∗ → 0 almost surely for someversions d(Xn′′ , Yn′′)∗. Since d(Yn, 0) ≤ d(Xn, 0) + d(Xn, Yn)∗, we have

that Yn′′ → 0 almost surely. But this implies Yn′′as∗→ 0 since the Yn are

measurable. Since the subsequence n′ was arbitrary, we now have that

YnP→ 0. Thus Xn

P→ 0 since d(Xn, 0) ≤ d(Yn, 0) + d(Xn, Yn).The next lemma describes several important relationships between weak

convergence and convergence in probability. Before presenting it, we needto extend the definition of convergence in probability, in the setting wherethe limit is a constant, to allow the probability spaces involved to changewith n as is already permitted for weak convergence. We denote this mod-

ified convergence XnP→ c, and distinguish it from the previous form of

convergence in probability only by context.

Lemma 7.23 Let Xn, Yn : Ωn → D be maps, X : Ω → D be Borelmeasurable, and c ∈ D be a constant. Then

(i) If Xn X and d(Xn, Yn)P→ 0, then Yn X.

Page 127: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

7.3 Other Modes of Convergence 117

(ii) XnP→ X implies Xn X.

(iii) XnP→ c if and only if Xn c.

Proof. We first prove (i). Let F ⊂ D be closed, and fix ǫ > 0. Thenlim supn→∞ P∗(Yn ∈ F ) = lim supn→∞ P∗(Yn ∈ F, d(Xn, Yn)∗ ≤ ǫ) ≤lim supn→∞ P∗(Xn ∈ F ǫ) ≤ P (X ∈ F ǫ). The result follows by letting

ǫ ↓ 0. Now assume XnP→ X . Since X X , d(X, Xn)

P→ 0 implies Xn X

by (i), thus (ii) follows. We now prove (iii). XnP→ c implies Xn c by (ii).

Now assume Xn c, and fix ǫ > 0. Note that P∗(d(Xn, c) ≥ ǫ) = P∗(Xn ∈B(c, ǫ)), where B(c, ǫ) is the open ǫ-ball around c ∈ D. By the portmanteau

theorem, lim supn→∞ P∗(Xn ∈ B(c, ǫ)) ≤ P(X ∈ B(c, ǫ)) = 0. Thus XnP→

c since ǫ is arbitrary, and (iii) follows.We now present a generalized continuous mapping theorem that allows

for sequences of maps gn which converge to g in a fairly general sense. Inthe exercises at the end of the chapter, we consider an instance of thiswhere one is interested in maximizing a stochastic process Xn(t), t ∈ T over an “approximation” Tn of a subset T0 ⊂ T . As a specific motivation,suppose T is high dimensional. The computational burden of computingthe supremum of Xn(t) over T may be reduced by choosing a finite meshTn which closely approximates T . In what follows, (D, d) and (E, e) aremetric spaces.

Theorem 7.24 (Extended continuous mapping). Let Dn ⊂ D and gn :Dn → E satisfy the following: if xn → x with xn ∈ Dn for all n ≥ 1 andx ∈ D0, then gn(xn) → g(x), where D0 ⊂ D and g : D0 → E. Let Xn bemaps taking values in Dn, and let X be Borel measurable and separablewith P∗(X ∈ D0) = 1. Then

(i) Xn X implies gn(Xn) g(X).

(ii) XnP→ X implies gn(Xn)

P→ g(X).

(iii) Xnas∗→ X implies gn(Xn)

as∗→ g(X).

The proof can be found in Chapter 1.11 of VW, and we omit it here. It iseasy to see that if g : D → E is continuous at all points in D0, and if we setgn = g and Dn = D for all n ≥ 1, then the standard continuous mappingtheorem (Theorem 7.7), specialized to the setting where X is separable, isa corollary of Part (i) of the above theorem.

The following theorem gives another kind of continuous mapping resultfor sequences which converge in probability and outer almost surely. WhenX is separable, the conclusions of this theorem are a simple corollary ofTheorem 7.24.

Theorem 7.25 Let g : D → E be continuous at all points in D0 ⊂ D,and let X be Borel measurable with P∗(X ∈ D0) = 1. Then

Page 128: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

118 7. Stochastic Convergence

(i) XnP→ X implies g(Xn)

P→ g(X).

(ii) Xnas∗→ X implies g(Xn)

as∗→ g(X).

Proof. Assume XnP→ X , and fix ǫ > 0. Define Bk to be all x ∈ D such that

the 1/k-ball around x contains points y and z with e(g(y), g(z)) > ǫ. Partof the proof of Theorem 7.7 in Section 7.4 verifies that Bk is open. It is clearthat Bk decreases as k increases. Furthermore, P(X ∈ Bk) ↓ 0, since everypoint in ∩∞

k=1Bk is a point of discontinuity of g. Now the outer probabilitythat e(g(Xn), g(X)) > ǫ is bounded above by the outer probability thateither X ∈ Bk or d(Xn, X) ≥ 1/k. But this last outer probability converges

to P∗(X ∈ Bk) since d(Xn, X)P→ 0. Part (i) now follows by letting k →∞

and noting that ǫ was arbitrary. Assume that Xnas∗→ X . Note that a minor

modification of the proof of Part (i) verifies that supm≥n d(Xm, X)P→ 0

implies supm≥n e(g(Xn), g(X))P→ 0. Now Part (iii) of Lemma 7.21 yields

that Xnas∗→ X implies g(Xn)

as∗→ g(X).We now present a useful outer almost sure representation result for weak

convergence. Such representations allow the conversion of certain weak con-vergence problems into problems about convergence of fixed sequences. Wegive an illustration of this approach in the proof of Proposition 7.27 below.

Theorem 7.26 Let Xn : Ωn → D be a sequence of maps, and let X∞ beBorel measurable and separable. If Xn X∞, then there exists a probabilityspace (Ω, A, P ) and maps Xn : Ω → D with

(i) Xnas∗→ X∞;

(i) E∗f(Xn) = E∗f(Xn), for every bounded f : D → R and all 1 ≤ n ≤∞.

Moreover, Xn can be chosen to be equal to Xn φn, for all 1 ≤ n ≤ ∞,where the φn : Ω → Ωn are measurable and perfect maps and Pn = P φ−1

n .

The proof can be found in Chapter 1.10 of VW, and we omit it here.Recall the definition of perfect maps from Chapter 6. In the setting of theabove theorem, if the Xn are constructed from the perfect maps φn, then[

f(Xn)]∗

= [f(Xn)]∗ φn for all bounded f : D → R. Thus the equivalence

between Xn and Xn can be made much stronger than simply equivalencein law.

The following proposition can be useful in studying weak convergenceof certain statistics which can be expressed as stochastic integrals. Forexample, the Wilcoxon statistic can be expressed in this way. The proof ofthe proposition provides the illustration of Theorem 7.26 promised above.

Proposition 7.27 Let Xn, Gn ∈ D[a, b] be stochastic processes with

Xn X and GnP→ G in D[a, b], where X is bounded with continuous

Page 129: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

7.3 Other Modes of Convergence 119

sample paths, G is fixed, and Gn and G have total variation bounded by

K < ∞. Then∫ (·)

aXn(s)dGn(s)

∫ (·)a

X(s)dG(s) in D[a, b].

Proof. First, Slutsky’s theorem and Lemma 7.23 establish that (Xn,Gn) (X, G). Next, Theorem 7.26 tells us that there exists a new prob-ability space and processes Xn, X, Gn and G which have the same outerintegrals for bounded functions as Xn, X , Gn and G, respectively, butwhich also satisfy (Xn, Gn)

as∗→ (X, G). By the continuity of the samplepaths of X, we have for each integer m ≥ 1, that there exists a finite par-tition of [a, b], a = t0 < t1 < · · · < tkm = b, with P(Mm > 1/m) ≤ 1/m,where

Mm ≡ max1≤j≤km

sups,t∈(tj−1,tj ]

|X(s)− X(t)|.

Define Xm ∈ D[a, b] such that Xm(a) = X(a) and Xm(t) ≡∑mj=11tj−1 <

t ≤ tjX(tj), for t ∈ (a, b]. Note that for integrals over the range (a, t], fort ∈ [a, b], we define the value of the integral to be zero when t = a since(a, a] is the null set. We now have, for any t ∈ [a, b], that

∫ t

a

Xn(s)dGn(s)−∫ t

a

X(s)dG(s)

≤∫ b

a

∣Xn(s)− X(s)∣

∣× |dGn(s)|+∫ b

a

∣Xm(s)− X(s)∣

∣× |dGn(s)|

+

∫ t

a

Xm(s)

dGn(s)− dG(s)

+

∫ t

0

∣Xm(s)− X(s)∣

∣× |dG(s)|

≤ K(

‖Xn − X‖[a,b] + 2Mm

)

+

m∑

j=1

X(tj)

(tj−1,tj ]∩(a,t]

dGn(s)− dG(s)

≤ K(

‖Xn − X‖∗[a,b] + 2Mm

)

+ km

(

‖X‖[a,b] × ‖Gn − G‖∗[a,b]

)

≡ En(m).

Note that En(m) is measurable and → 0 almost surely. Define Dn to

be the infimum of En(m) over all integers m ≥ 1. Since Dnas∗→ 0 and Dn

is measurable, we have that∫ (·)

aXn(s)dGn(s)

as∗→∫ (·)

aX(s)dG(s). Choose

any f ∈ Cb(D[a, b]), and note that the map (x, y) → f(

∫ (·)a x(s)dy(s)

)

, for

x, y ∈ D[a, b] with the total variation of y bounded, is bounded. Thus

Page 130: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

120 7. Stochastic Convergence

E∗f

(

∫ (·)

a

Xn(s)dGn(s)

)

= E∗f

(

∫ (·)

a

Xn(s)dGn(s)

)

→ Ef

(

∫ (·)

a

X(s)dG(s)

)

= Ef

(

∫ (·)

a

X(s)dG(s)

)

.

Since this convergence holds for all f ∈ Cb(D[a, b]), the desired result nowfollows.

We give one more result before closing this chapter. The result appliesto certain weak convergence settings involving questions that are easier toanswer for measurable maps. The following lemma shows that a nonmeasur-able, weakly convergent sequence Xn is usually quite close to a measurablesequence Yn:

Lemma 7.28 Let Xn : Ωn → D be a sequence of maps. If Xn X, whereX is Borel measurable and separable, then there exists a Borel measurable

sequence Yn : Ωn → D with d(Xn, Yn)P→ 0.

The proof can be found in Chapter 1.10 of VW, and we omit it here.

7.4 Proofs

Proof of Lemma 7.1. Clearly (i) implies (ii). Now assume (ii). For everyopen G ⊂ D, define the sequence of functions fm(x) = [md(x, D−G)]∧1, forintegers m ≥ 1, and note that each fm is bounded and Lipschitz continuousand fm ↑ 1G as m → ∞. By monotone convergence, L1(G) = L2(G).Since this is true for every open G ⊂ D, including G = D, the collectionof Borel sets for which L1(B) = L2(B) is a σ-field and is at least as largeas the Borel σ-field. Hence (ii) implies (i). The equivalence of (i) and (iii)under separability follows from Theorem 1.12.2 of VW and we omit thedetails here.

The fact that (i) implies (iv) is obvious. Now assume L1 and L2 are tightand that (iv) holds. Fix ǫ > 0, and choose a compact K ⊂ D such thatL1(K) ∧ L2(K) ≥ 1 − ǫ. According to a version of the Stone-Weierstrasstheorem given in Jameson (1974, p. 263), a vector lattice F ⊂ Cb(K) thatincludes the constants and separates points of K is uniformly dense inCb(K). Choose a g ∈ Cb(D) for which 0 ≤ g ≤ 1, and select an f ∈ Fsuch that supx∈K |g(x) − f(x)| ≤ ǫ. Now we have

gdL1 −∫

gdL2

∣ ≤∣

KgdL1 −

KgdL2

∣ + 2ǫ ≤∣

K(f ∧ 1)+dL1 −

K(f ∧ 1)+dL2

∣ + 4ǫ = 4ǫ.The last equality follows since (f ∧ 1)+ ∈ F . Thus

gdL1 =∫

gdL2 sinceǫ is arbitrary. By adding and subtracting scalars, we can verify that thesame result holds for all g ∈ Cb(D). Hence (i) holds.

Page 131: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

7.4 Proofs 121

Proof of Lemma 7.2. The equivalence of (i) and (ii) is an imme-diate consequence of Theorem 6.2. Now assume (ii) holds. Then |X | isbounded almost surely. Hence for any pair of sequences sn, tn ∈ T such

that ρ0(sn, tn) → 0, X(sn)−X(tn)P→ 0. Thus X ∈ UC(T, ρ0) with prob-

ability 1. It remains to show that (T, ρ0) is totally bounded. Let the pair

of sequences sn, tn ∈ T satisfy ρ(sn, tn) → 0. Then X(sn) − X(tn)P→ 0

and thus ρ0(sn, tn) → 0. This means that since (T, ρ) is totally bounded,we have for every ǫ > 0 that there exists a finite Tǫ ⊂ T such thatsupt∈T infs∈Tǫ ρ0(s, t) < ǫ. Thus (T, ρ0) is also totally bounded, and thedesired result follows.

Proof of Theorem 7.6. Assume (i), and note that (i) implies (vii)trivially. Now assume (vii), and fix an open G ⊂ D. As in the proof ofLemma 7.1 above, there exists a sequence of nonnegative, Lipschitz con-tinuous functions fm with 0 ≤ fm ↑ 1G. Now for each integer m ≥ 1,lim inf P∗(Xn ∈ G) ≥ lim inf E∗fm(Xn) = Efm(X). Taking the limit asm→∞ yields (ii). Thus (vii)⇒(ii). The equivalence of (ii) and (iii) followsby taking complements.

Assume (ii) and let f be lower semicontinuous with f ≥ 0. Define the

sequence of functions fm =∑m2

i=1(1/m)1Gi, where Gi = x : f(x) >i/m, i = 1, . . . , m2. Thus fm “rounds” f down to i/m if f(x) ∈ (i/m, (i+1)/m] for any i = 0, . . . , m2 − 1 and fm = m when f(x) > m. Hence0 ≤ fm ≤ f ∧m and |fm − f |(x) ≤ 1/m whenever f(x) ≤ m. Fix m. Notethat each Gi is open by the definition of lower semicontinuity. Thus

lim infn→∞

E∗f(Xn) ≥ lim infn→∞

E∗fm(Xn)

≥m2∑

i=1

(1/m)[

lim infn→∞

P∗(Xn ∈ Gi)]

≥m2∑

i=1

(1/m)P(X ∈ Gi)

= Efm(X).

Thus (ii) implies (iv) after letting m → ∞ and adding then subtractinga constant as needed to compensate for the lower bound of f . The equiv-alence of (iv) and (v) follows by replacing f with −f . Assume (v) (andthus also (iv)). Since a continuous function is both upper and lower semi-continuous, we have for any f ∈ Cb(D) that Ef(X) ≥ lim sup E∗f(Xn) ≥lim inf E∗f(Xn) ≥ Ef(X). Hence (v) implies (i).

Assume (ii) (and hence also (iii)). For any Borel set B ⊂ D, L(B) ≤lim infn→∞ P∗(Xn ∈ B) ≤ lim supn→∞ P∗(Xn ∈ B) ≤ L(B); however,the forgoing inequalities all become equalities when L(δB) = 0. Thus (ii)implies (vi). Assume (vi), and let F be closed. For each ǫ > 0 defineF ǫ = x : d(x, F ) < ǫ. Since the sets δF ǫ are disjoint, L(δF ǫ) > 0

Page 132: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

122 7. Stochastic Convergence

for at most countably many ǫ. Hence we can choose a sequence ǫm ↓ 0so that L(δF ǫn) = 0 for each integer m ≥ 1. Note that for fixed m,lim supn→∞ P∗(Xn ∈ F ) ≤ lim supn→∞ P∗(Xn ∈ F ǫm) = L(F ǫm). Byletting m → ∞, we obtain that (vi) implies (ii). Thus Conditions (i)–(vii)are all equivalent.

The equivalence of (i)–(vii) to (viii) when L is separable follows fromTheorem 1.12.2 of VW, and we omit the details.

Proof of Theorem 7.7. The set of all points at which g is not con-tinuous can be expressed as Dg ≡

⋃∞m=1

⋂∞k=1 Gm

k , where Gmk consists of

all x ∈ D so that e(g(y), g(z)) > 1/m for some y, z ∈ B1/k(x), where eis the metric for E. Note that the complement of Gm

k , (Gmk )c, consists of

all x for which e(g(y), g(z)) ≤ 1/m for all y, z ∈ B1/k(x). Now if xn → x,then for any y, z ∈ B1/k(x), we have that y, z ∈ B1/k(xn) for all n largeenough. Hence (Gm

k )c is closed and thus Gmk is open. This means that

Dg is a Borel set. Let F ⊂ E be closed, and let xn ∈ g−1(F ) be a se-quence for which xn → x. If x is a continuity point of g, then x ∈ g−1(F ).Otherwise x ∈ Dg. Hence g−1(F ) ⊂ g−1(F ) ∪ Dg. Since g is continuouson the range of X , there is a version of g(X) that is Borel Measurable.By the portmanteau theorem, lim sup P∗(g(Xn) ∈ F ) ≤ lim sup P∗(Xn ∈g−1(F )) ≤ P(X ∈ g−1(F )). Since Dg has probability zero under the law of

X , P(X ∈ g−1(F )) = P(g(X) ∈ F ). Reapplying the portmanteau theorem,we obtain the desired result.

Proof of Lemma 7.10. It is easy to see that uniform tightness im-plies asymptotic tightness. To verify equivalence going the other direction,assume that Xn is tight for each n ≥ 1 and that the sequence is asymptot-ically tight. Fix ǫ > 0, and choose a compact K0 for which lim inf P(Xn ∈Kδ

0) ≥ 1−ǫ for all δ > 0. For each integer m ≥ 1, choose an nm <∞ so that

P(Xn ∈ K1/m0 ) ≥ 1 − 2ǫ for all n ≥ nm. For each integer n ∈ (nm, nm+1],

choose a compact Kn so that P(Xn ∈ Kn) ≥ 1− ǫ/2 and an ηn ∈ (0, 1/m)

so that P(

Xn ∈ K1/m0 −Kηn

0

)

< ǫ/2. Let Kn = (Kn∪K0)∩Kηn

0 , and note

that K0 ⊂ Kn ⊂ K1/m0 and that Kn is compact. We leave it as an exercise

to show that K ≡ ⋃∞n=1 Kn is also compact. Now P(Xn ∈ Kn) ≥ 1−3ǫ for

all n ≥ 1, and thus P(Xn ∈ K) ≥ 1 − 3ǫ for all n ≥ 1. Uniform tightnessfollows since ǫ was arbitrary.

Proof of Lemma 7.18. Fix ζ > 0. The conditions imply that P∗(‖Xn‖∗T> M) < ζ for some M < ∞. Let ǫm be a positive sequence convergingdown to zero and let ηm ≡ 2−mζ. By Condition (ii), there exists a positivesequence δm ↓ 0 so that

lim supn→∞

P∗(

sups,t∈T :ρ(s,t)<δm

|Xn(s)−Xn(t)| > ǫm

)

< ηm.

Now fix m. By the total boundedness of T , there exists a finite set of disjointpartions T1, . . . , Tk so that T = ∪k

i=1Ti and so that

Page 133: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

7.4 Proofs 123

P∗(

max1≤i≤k

sups,t∈Ti

|Xn(s)−Xn(t)| > ǫm

)

< ηm.

Let z1, . . . , zp be the set of all functions on ℓ∞(T ) which are constant oneach Ti and which take only the values ±iǫm, for i = 0, . . . , K, where K isthe largest integer ≤ M/ǫm. Now let Km be the union of the closed ballsof radius ǫm around the zi, i = 1, . . . , p. This construction ensures that if‖x‖T ≤ M and max1≤i≤k sups,t∈Ti

|x(s) − x(t)| ≤ ǫm, then x ∈ Km. Thisconstruction can be repeated for each m ≥ 1.

Let K = ∩∞m=1Km, and note that K is totally bounded and closed. Total

boundedness follows since each Km is a union of finite ǫm balls which coverK and ǫm ↓ 0. We leave the proof that K is closed as an exercise. Now weshow that for every δ > 0, there is an m < ∞ so that Kδ ⊃ ∩m

i=1Ki. Ifthis were not true, then there would be a sequence zm ∈ Kδ with zm ∈∩m

i=1Ki for every m ≥ 1. This sequence has a subsequence zm1(k), k ≥ 1contained in one of the open balls making up K1. This subsequence hasa further subsequence zm2(k), k ≥ 1 contained in one of the open ballsmaking up K2. We can continue with this process to generate, for eachinteger j ≥ 1, a subsequence zmj(k), k ≥ 1 contained in the intersection

∩ji=1Bi, where each Bi is one of the open balls making up Ki. Define a new

subsequence zk = zmk(k), and note that zk is Cauchy with limit in K sinceK is closed. However, this contradicts d(zm, K) ≥ δ for all m. Hence thecomplement of Kδ is contained in the complement of ∩m

i=1Ki for some m <∞. Thus lim supn→∞ P∗(Xn ∈ Kδ) ≤ lim supn→∞ P∗ (Xn ∈ ∩m

i=1Ki) ≤ζ +

∑mi=1 ηm ≤ 2ζ. Since this result holds for all δ > 0, we now have that

lim infn→∞ P∗(Xn ∈ Kδ) ≥ 1 − 2ζ for all δ > 0. Asymptotic tightnessfollows since ζ is arbitrary.

Proof of Lemma 7.20. The fact that (i) implies (ii) we leave as an exer-cise. Assume (ii). Then there exists a sequence of finite partitions T1, T2, . . .such that

lim supn→∞

P∗(

supU∈Tk

sups,t∈U

|Xn(s)−Xn(t)| > 2−k

)

< 2−k,

for all integers k ≥ 1. Here, each partition Tk is a collection of disjoint setsU1, . . . , Umk

with T = ∪mk

i=1Ui. Without loss of generality, we can insist thatthe Tk are nested in the sense that any U ∈ Tk is a union of sets in Tk+1.Such a nested sequence of partitions is easy to construct from any othersequence of partitions T ∗

k by letting Tk consist of all nontrivial intersectionsof all sets in T ∗

1 up through and including T ∗k . Let T0 denote the partition

consisting of the single set T .For any s, t ∈ T , define K(s, t) ≡ supk : s, t ∈ U for some U ∈ Tk and

ρ(s, t) ≡ 2−K(s,t). Also, for any δ > 0, let J(δ) ≡ infk : 2−k < δ. It isnot hard to verify that for any s, t ∈ T and δ > 0, ρ(s, t) < δ if and only ifs, t ∈ U for some U ∈ TJ(δ). Thus

Page 134: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

124 7. Stochastic Convergence

sups,t∈T :ρ(s,t)<δ

|Xn(s)−Xn(t)| = supU∈TJ(δ)

sups,t∈U

|Xn(s)−Xn(t)|

for all 0 < δ ≤ 1. Since J(δ) →∞ as δ ↓ 0, we now have that Xn is asymp-totically ρ-equicontinuous in probability, as long as ρ is a pseudometric.The only difficulty here is to verify that the triangle inequality holds for ρ,which we leave as an exercise. It is easy to see that T is totally boundedwith respect to ρ, and (i) follows.

Proof of Lemma 7.21. We first prove (iii). Assume that Xnas∗→ X ,

and define Akn ≡ supm≥n d(Xm, X)∗ > 1/k. For each integer k ≥ 1, we

have P (Akn) ↓ 0 as n → ∞. Now fix ǫ > 0, and note that for each k ≥ 1

we can choose an nk so that P (Akn) ≤ ǫ/2k. Let A = Ω − ∪∞

k=1Aknk

, andobserve that, by this construction, P (A) ≥ 1− ǫ and d(Xn, X)∗ ≤ 1/k, forall n ≥ nk and all ω ∈ A. Thus Xn converges to X almost uniformly sinceǫ is arbitrary. Assume now that Xn converges to X almost uniformly. Fixǫ > 0, and let A be measurable with P (A) ≥ 1− ǫ and d(Xn, X)→ 0 uni-formly over ω ∈ A. Fix η > 0, and note that η ≥ (d(Xn, X)1A)∗ for allsufficiently large n, since η is measurable and satisfies η ≥ d(Xn, X)1Afor sufficiently large n. Now let S, T : Ω → [0,∞) be maps with S mea-surable and T ∗ bounded. The for any c > 0, [(S + c)T ]∗ ≤ (S + c)T ∗,and (S + c)T ∗ ≤ [(S + c)T ]∗ since T ∗ ≤ [(S + c)T ]∗/(S + c). Hence[(S + c)T ]∗ = (S + c)T ∗. By letting c ↓ 0, we obtain that (ST )∗ = ST ∗.Hence d(Xn, X)∗1A = (d(Xn, X)1A)∗ ≤ η for all n large enough, and

thus d(Xn, X)∗ → 0 for almost all ω ∈ A. Since ǫ is arbitrary, Xnas∗→ X .

Now assume that Xnas∗→X . This clearly implies that supm≥n d(Xm, X)

P→0. Fix ǫ > 0. Generate a subsequence nk, by finding, for each integer k ≥ 1,an integer nk ≥ nk−1 ≥ 1 which satisfies P∗(supm≥nk

d(Xm, X) > 1/k) ≤ǫ/2k. Call the set inside this outer probability statement Ak, and defineA = Ω− ∩∞

k=1A∗k. Now P(A) ≥ 1 − ǫ, and for each ω ∈ A and all m ≥ nk,

d(Xm, X) ≤ 1/k for all k ≥ 1. Hence supω∈A d(Xn, X)(ω)→ 0, as n →∞.Thus Xn converges almost uniformly to X , since ǫ is arbitrary, and there-fore Xn

as∗→ X . Thus we have proven (iii).We now prove (i). It is easy to see that Xn converging to X almost uni-

formly will imply XnP→ X . Thus (i) follows from (iii). We next prove (ii).

Assume XnP→ X . Construct a subsequence 1 ≤ n1 < n2 < · · · so that

P(d(Xnj , X)∗ > 1/j) < 2−j for all integers j ≥ 1. Then

P(d(Xnj , X)∗ > 1/j, for infinitely many j) = 0

by the Borel-Cantelli lemma. Hence Xnj

as∗→ X as a sequence in j. This nowimplies that every sequence has an outer almost surely convergent subse-quence. Assume now that every subsequence has an outer almost surelyconvergent subsequence. By (i), the almost surely convergent subsequences

also converge in probability. Hence XnP→ X .

Page 135: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

7.5 Exercises 125

7.5 Exercises

7.5.1. Show that F defined in the proof of Lemma 7.3 is a vector lattice,an algebra, and separates points in D.

7.5.2. Show that the set K defined in the proof of Lemma 7.10 in Sec-tion 7.2 is compact.

7.5.3. In the setting of the proof of Lemma 7.14, show that π1K and π2Kare both compact whenever K ∈ D× E is compact.

7.5.4. In the proof of Lemma 7.18 (given in Section 7.4), show that theset K = ∩∞

m=1Km is closed.

7.5.5. Suppose that Tn and T0 are subsets of a semimetric space (T, ρ)such that Tn → T0 in the sense that

(i) Every t ∈ T0 is the limit of a sequence tn ∈ Tn;

(ii) For every closed S ⊂ T − T0, S ∩ Tn = ∅ for all n large enough.

Suppose that Xn and X are stochastic processes indexed by T for whichXn X in ℓ∞(T ) and X is Borel measurable with P(X ∈ UC(T, ρ)) = 1,where (T, ρ) is not necessarily totally bounded. Show that supt∈Tn

Xn(t) supt∈T0

X(t). Hint: show first that for any xn → x, where xn ∈ ℓ∞(T )and x ∈ UC(T, ρ), we have limn→∞ supt∈Tn

xn(t) = supt∈T0x(t).

7.5.6. Complete the proof of Lemma 7.20:

1. Show that (i) implies (ii).

2. Show that for any s, t ∈ T and δ > 0, ρ(s, t) < δ if and only if s, t ∈ Ufor some U ∈ TJ(δ), where ρ(s, t) ≡ 2−K(s,t) and K is as defined inthe proof.

3. Verify that the triangle inequality holds for ρ.

7.5.7. Let Xn and X be maps into R with X Borel measurable. Show thefollowing:

(i) Xnas∗→ X if and only if both X∗

n and Xn∗ ≡ (Xn)∗ converge almostsurely to X .

(ii) Xn X if and only if X∗n X and Xn∗ X .

Hints: For (i), first show |Xn−X |∗ = |X∗n−X |∨|Xn∗−X |. For (ii), assume

Xn X and apply the extended almost sure representation theorem tofind a new probability space and a perfect sequence of maps φn such thatXn = Xn φn

as∗→ X. By (i), X∗n → X almost surely, and thus X∗

n X.Since φn is perfect, X∗

n = X∗n φn; and thus Ef(X∗

n) = Ef(X∗n) for every

Page 136: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

126 7. Stochastic Convergence

measurable f . Hence X∗n X . Now show Xn∗ X . For the converse,

use the facts that P (X∗n ≤ x) ≤ P∗ (Xn ≤ x) ≤ P (Xn∗ ≤ x) and that

distributions for real random variable are completely determined by theircumulative distribution functions.

7.5.8. Using the ideas in the proof of Proposition 7.27, prove Lemma 4.2.

7.6 Notes

Many of the ideas and results of this chapter come from Chapters 1.3–1.5and 1.9–1.12 of VW specialized to sequences (rather than nets). Lemma 7.1is a composite of Lemmas 1.3.12 and Theorem 1.12.2 of VW, whileLemma 7.4 is Lemma 1.5.3 of VW. Components (i)–(vii) of the portman-teau theorem are a specialization to sequences of the portmanteau theoremin Chapter 1.3 of VW. Theorem 7.7, Lemmas 7.8, 7.9, and 7.12, and The-orem 7.13 correspond to VW Theorems 1.3.6 and 1.3.10, Lemmas 1.3.13and 1.3.8, and Theorem 1.3.9, respectively. Lemma 7.14 is a compositeof Lemmas 1.4.3 and 1.4.4 of VW. Lemma 7.16 and Theorem 7.17 areessentially VW Lemma 1.5.2 and Theorem 1.5.4. Lemmas 7.21 and 7.23and Theorem 7.24 are specializations to sequences of VW Lemmas 1.9.2and 1.10.2 and Theorem 1.11.1. Theorem 7.26 is a composition of VW The-orem 1.10.4 and Addendum 1.10.5 applied to sequences, while Lemma 7.28is essentially Proposition 1.10.12 of VW.

Proposition 7.5 on Gaussian processes is essentially a variation of Lemma3.9.8 of VW. Proposition 7.27 is a modification of Lemma A.3 of Bilias,Gu and Ying (1997) who use this result to obtain weak convergence ofthe proportional hazards regression parameter in a continuously monitoredsequential clinical trial.

Page 137: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

8Empirical Process Methods

Recall from Section 2.2.2 the concepts of bracketing and uniform entropyalong with the corresponding Glivenko-Cantelli and Donsker theorems. Wenow briefly review the set-up. Given a probability space (Ω,A, P ), the dataof interest consist of n independent copies X1, . . . , Xn of a map X : Ω → X ,where X is the sample space. We are interested in studying the limitingbehavior of empirical processes indexed by classes F of functions f : X → Rwhich are measurable in the sense that each composite map f(X) : Ω →X → R is A-measurable. With Pn denoting the empirical measure basedon the data X1, . . . , Xn, the empirical process of interest is Pnf viewed asa stochastic process indexed by f ∈ F .

A class F is P -Glivenko-Cantelli if ‖Pn − P‖F as∗→ 0, where for anyu ∈ ℓ∞(T ), ‖u‖T ≡ supt∈T |u(t)|. We say F is weak P -Glivenko-Cantelli, ifthe outer almost sure convergence is replaced by convergence in probability.Sometimes, for clarification, we will call a P -Glivenko-Cantelli class a strongP -Glivenko-Cantelli class to remind ourselves that the convergence is outeralmost sure. A class F is P -Donsker if Gn G weakly in ℓ∞(F), where Gis a tight Brownian bridge. Of course, the P prefix can be dropped if thecontext is clear.

As mentioned in Section 4.2.1, these ideas also apply directly to i.i.d. sam-ples of stochastic processes. In this setting, X has the form X(t), t ∈ T ,where X(t) is measurable for each t ∈ T , and X is typically ℓ∞(T ). We say

that X is P -Glivenko-Cantelli if supt∈T |(Pn − P )X(t)| as∗→ 0 and that Xis P -Donsker if GnX converges weakly to a tight Gaussian process. Thisis exactly equivalent to considering whether the class FT ≡ ft, t ∈ T ,

Page 138: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

128 8. Empirical Process Methods

where ft(x) ≡ x(t) for all x ∈ X and t ∈ T , is Glivenko-Cantelli orDonsker. The limiting Brownian bridge G for the class FT has covarianceP [(Gfs)(Gft)] = P [(X(s)−PX(s))(X(t)−PX(t))]. This duality betweenthe function class and stochastic process viewpoint will prove useful fromtime to time, and which approach we take will depend on the setting.

The main goal of this chapter is to present the empirical process tech-niques needed to prove the Glivenko-Cantelli and Donsker theorems ofSection 2.2.2. The approach we take is guided by Chapters 2.2–2.5 of VW,although we leave out many technical details. The most difficult step inthese proofs is going from point-wise convergence to uniform convergence.Maximal inequalities are very useful tools for accomplishing this step. Foruniform entropy results, an additional tool, symmetrization, is also needed.To use symmetrization, several measurability conditions are required onthe class of function F beyond the usual requirement that each f ∈ F bemeasurable. In the sections presenting the Glivenko-Cantelli and Donskertheorem proofs, results for bracketing entropy are presented before the uni-form entropy results.

8.1 Maximal Inequalities

We first present several results about Orlicz norms which are useful forcontrolling the size of the maximum of a finite collection of random vari-ables. Several maximal inequalities for stochastic processes will be givennext. These inequalities include a general maximal inequality for separablestochastic processes and a maximal inequality for sub-Gaussian processes.The results will utilize Orlicz norms combined with a method known aschaining. The results of this section will play a key role in the proofs of theDonsker theorems developed later on in this chapter.

8.1.1 Orlicz Norms and Maxima

A very useful class of norms for random variables used in maximal inequal-ities are the Orlicz norms. For a nondecreasing, nonzero convex functionψ : [0,∞] → [0,∞], with ψ(0) = 0, the Orlicz norm ‖X‖ψ of a real randomvariable X is defined as

‖X‖ψ ≡ inf

c > 0 : Eψ

( |X |c

)

≤ 1

,

where the norm takes the value∞ if no finite c exists for which Eψ(|X |/c) ≤1. Exercise 8.5.1 below verifies that ‖ · ‖ψ is indeed a norm on the spaceof random variables with ‖X‖ψ < ∞. The Orlicz norm ‖ · ‖ψ is also calledthe ψ-norm, in order to specify the choice of ψ. When ψ is of the formx → xp, where p ≥ 1, the corresponding Orlicz norm is just the Lp-norm

Page 139: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

8.1 Maximal Inequalities 129

‖X‖p ≡ (E|X |p)1/p. For maximal inequalities, Orlicz norms defined withψp(x) ≡ exp−1, for p ≥ 1, are of greater interest because of their sensitivityto behavior in the tails. Clearly, since xp ≤ ψp(x), we have ‖X‖p ≤ ‖X‖ψp.Also, by the series representation for exponentiation, ‖X‖p ≤ p!‖X‖ψ1 forall p ≥ 1. The following result shows how Orlicz norms based on ψp relatefairly precisely to the tail probabilities:

Lemma 8.1 For a real random variable X and any p ∈ [1,∞), the fol-lowing are equivalent:

(i) ‖X‖ψp < ∞.

(ii) There exist constants 0 < C, K < ∞ such that

P(|X | > x) ≤ Ke−Cxp

, for all x > 0.(8.1)

Moreover, if either condition holds, then K = 2 and C = ‖X‖−pψp

satis-

fies (8.1), and, for any C, K ∈ (0,∞) satisfying (8.1), ‖X‖ψp ≤ ((1 +

K)/C)1/p.

Proof. Assume (i). Then P(|X | > x) equals

P

ψp(|X |/‖X‖ψp) ≥ ψp(x/‖X‖ψp)

≤ 1 ∧(

1

ψp(x/‖X‖ψp)

)

,(8.2)

by Markov’s inequality. By Exercise 8.5.2 below, 1∧ (eu − 1)−1 ≤ 2e−u forall u > 0. Thus the right-hand-side of (8.2) is bounded above by (8.1) withK = 2 and C = ‖X‖−p

ψp. Hence (ii) and the first half of the last sentence

of the lemma follow. Now assume (ii). For any c ∈ (0, C), Fubini’s theoremgives us

E(

ec|X|p − 1)

= E

∫ |X|p

0

cecsds

=

∫ ∞

0

P(|X | > s1/p)cecsds

≤∫ ∞

0

Ke−Cscecsds

= Kc/(C − c),

where the inequality follows from the assumption. Now Kc/(C − c) ≤ 1whenever c ≤ C/(1 + K) or, in other words, whenever c−1/p ≥ ((1 +K)/C)1/p. This implies (i) and the rest of the lemma.

An important use for Orlicz norms is to control the behavior of maxima.This control is somewhat of an extension of the following simple result forLp-norms: For any random variables X1, . . . , Xm, ‖max1≤i≤m Xi‖p

Page 140: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

130 8. Empirical Process Methods

≤(

E max1≤i≤m

|Xi|p)1/p

≤(

E

m∑

i=1

|Xi|p)1/p

≤ m1/p max1≤i≤m

‖Xi‖p.

The following lemma shows that a similar result holds for certain Orlicznorms but with the m1/p replaced with a constant times ψ−1(m):

Lemma 8.2 Let ψ : [0,∞) → [0,∞) be convex, nondecreasing and non-zero, with ψ(0) = 0 and lim supx,y→∞ ψ(x)ψ(y)/ψ(cxy) < ∞ for someconstant c < ∞. Then, for any random variables X1, . . . , Xm,

max1≤i≤m

Xi

ψ

≤ Kψ−1(m) max1≤i≤m

‖Xi‖ψ,

where the constant K depends only on ψ.

Proof. We first make the stronger assumption that ψ(1) ≤ 1/2 and thatψ(x)ψ(y) ≤ ψ(cxy) for all x, y ≥ 1. Under this stronger assumption, wealso have ψ(x/y) ≤ ψ(cx)/ψ(y) for all x ≥ y ≥ 1. Hence, for any y ≥ 1 andk > 0,

max1≤i≤m

ψ

( |Xi|ky

)

≤ maxi

[

ψ(c|Xi|/k)

ψ(y)1

|Xi|ky

≥ 1

( |Xi|ky

)

1

|Xi|ky

< 1

]

≤m∑

i=1

[

ψ(c|Xi|/k)

ψ(y)

]

+ ψ(1).

In the summation, set k = c maxi ‖Xi‖ψ and take expectations of bothsides to obtain

(

maxi |Xi|ky

)

≤ m

ψ(y)+

1

2.

With y = ψ−1(2m), the right-hand-side is ≤ 1. Thus ‖maxi |Xi|‖ψ ≤cψ−1(2m)maxi ‖Xi‖ψ. Since ψ is convex and ψ(0) = 0, x → ψ−1(x) isconcave and one-to-one for x > 0. Thus ψ−1(2m) ≤ 2ψ−1(m), and theresult follows with K = 2c for the special ψ functions specified at thebeginning of the proof.

By Exercise 8.5.3 below, we have for any ψ satisfying the conditions ofthe lemma, that there exists constants 0 < σ ≤ 1 and τ > 0 such thatφ(x) ≡ σψ(τx) satisfies φ(1) ≤ 1/2 and φ(x)φ(y) ≤ φ(cxy) for all x, y ≥ 1.Furthermore, for this φ, φ−1(u) ≤ ψ−1(u)/(στ), for all u > 0, and, for anyrandom variable X , ‖X‖ψ ≤ ‖X‖φ/(στ) ≤ ‖X‖ψ/σ. Hence

στ∥

∥maxi

Xi

ψ≤

∥maxi

Xi

φ

≤ 2cφ−1(m)maxi‖Xi‖φ ≤ 2c

σψ−1(m)max

i‖Xi‖ψ,

Page 141: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

8.1 Maximal Inequalities 131

and the desired result follows with K = 2c/(σ2τ).An important consequence of Lemma 8.2 is that maximums of random

variables with bounded ψ-norm grow at the rate ψ−1(m). Based on Ex-ercise 8.5.4, ψp satisfies the conditions of Lemma 8.2 with c = 1, for anyp ∈ [1,∞). The implication in this situation is that the growth of maximais at most logarithmic, since ψ−1

p (m) = (log(m + 1))1/p. These results willprove quite useful in the next section.

We now present an inequality for collections X1, . . . , Xm of random vari-ables which satisfy

P (|Xi| > x) ≤ 2e−12

x2

b+ax , for all x > 0,(8.3)

for i = 1, . . . , m and some a, b ≥ 0. This setting will arise later in thedevelopment of a Donsker theorem based on bracketing entropy.

Lemma 8.3 Let X1, . . . , Xm be random variables that satisfy the tailbound (8.3) for 1 ≤ i ≤ m and some a, b ≥ 0. Then

max1≤i≤m

|Xi|∥

ψ1

≤ K

a log(1 + m) +√

b√

log(1 + m)

,

where the constant K is universal, in the sense that it does not depend ona, b, or on the random variables.

Proof. Assume for now that a, b > 0. The condition implies for allx ≤ b/a the upper bound 2 exp(−x2/(4b)) for P (|Xi| > x), since in thiscase b + ax ≤ 2b. For all x > b/a, the condition implies an upper boundof 2 exp(−x/(4a)), since b/x + a ≤ 2a in this case. This implies thatP (|Xi|1|Xi| ≤ b/a > x) ≤ 2 exp(−x2/(4b)) and P (|Xi|1|Xi| > b/a >x) ≤ 2 exp(−x/(4a)) for all x > 0. Hence, by Lemma 8.1, the Orlicz norms‖ |Xi|1|Xi| ≤ b/a‖ψ2 and ‖ |Xi|1|Xi| > b/a‖ψ1 are bounded by

√12b

and 12a, respectively. The result now follows by Lemma 8.2 combined withthe inequality

‖maxi|Xi| ‖ψ1 ≤‖max

i[|Xi|1|Xi| ≤ b/a] ‖ψ1+‖max

i[|Xi|1|Xi|> b/a] ‖ψ2,

where the replacement of ψ1 with ψ2 in the last term follows since ψp normsincrease in p.

Suppose now that a > 0 but b = 0. Then the tail bound (8.3) holds forall b > 0, and the result of the lemma is thus true for all b > 0. The desiredresult now follows by letting b ↓ 0. A similar argument will verify that theresult holds when a = 0 and b > 0. Finally, the result is trivially true whena = b = 0 since, in this case, Xi = 0 almost surely for i = 1, . . . , m.

8.1.2 Maximal Inequalities for Processes

The goals of this section are to first establish a general maximal inequalityfor separable stochastic processes and then specialize the result to sub-Gaussian processes. A stochastic process X(t), t ∈ T is separable when

Page 142: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

132 8. Empirical Process Methods

there exists a countable subset T∗ ⊂ T such that supt∈T infs∈T∗ |X(t) −X(s)| = 0 almost surely. For example, any cadlag process indexed by aclosed interval in R is separable because the rationals are a separable subsetof R. The need for separability of certain processes in the Glivenko-Cantelliand Donsker theorems is hidden in other conditions of the involved theo-rems, and direct verification of separability is seldom required in statisticalapplications.

A stochastic process is sub-Gaussian when

P (|X(t)−X(s)| > x) ≤ 2e−12 x2/d2(s,t), for all s, t ∈ T , x > 0,(8.4)

for a semimetric d on T . In this case, we say that X is sub-Gaussian withrespect to d. An important example of a separable sub-Gaussian stochas-tic process, the Rademacher process, will be presented at the end of thissection. These processes will be utilized later in this chapter in the devel-opment of a Donsker theorem based on uniform entropy. Another exampleof a sub-Gaussian process is Brownian motion on [0, 1], which can easily beshown to be sub-Gaussian with respect to d(s, t) = |s− t|1/2. Because thesample paths are continuous, Brownian motion is also separable.

The conclusion of Lemma 8.2 above is not immediately useful for max-imizing X(t) over t ∈ T since a potentially infinite number of randomvariables is involved. However, a method called chaining, which involveslinking up increasingly refined finite subsets of T and repeatedly applyingLemma 8.2, does make such maximization possible in some settings. Thetechnique depends on the metric entropy of the index set T based on thesemimetric d(s, t) = ‖X(s)−X(t)‖ψ.

For an arbitrary semimetric space (T, d), the covering number N(ǫ, T, d)is the minimal number of closed d-balls of radius ǫ required to cover T . Thepacking number D(ǫ, T, d) is the maximal number of points that can fit inT while maintaining a distance greater than ǫ between all points. Whenthe choice of index set T is clear by context, the notation for covering andpacking numbers will be abbreviated as N(ǫ, d) and D(ǫ, d), respectively.The associated entropy numbers are the respective logarithms of the cov-ering and packing numbers. Taken together, these concepts define metricentropy.

For a semimetric space (T, d) and each ǫ > 0,

N(ǫ, d) ≤ D(ǫ, d) ≤ N(ǫ/2, d).

To see this, note that there exists a minimal subset Tǫ ⊂ T such that thecardinality of Tǫ = D(ǫ, d) and the minimum distance between distinctpoints in Tǫ is > ǫ. If we now place closed ǫ-balls around each point in Tǫ,we have a covering of T . If this were not true, there would exist a pointt ∈ T which has distance > ǫ from all the points in Tǫ, but this would meanthat D(ǫ, d) + 1 points can fit into T while still maintaining a separation> ǫ between all points. But this contradicts the maximality of D(ǫ, d). Thus

Page 143: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

8.1 Maximal Inequalities 133

N(ǫ, d) ≤ D(ǫ, d). Now note that no ball of radius ≤ ǫ/2 can cover morethan one point in Tǫ, and thus at least D(ǫ, d) closed ǫ/2-balls are neededto cover Tǫ. Hence D(ǫ, d) ≤ N(ǫ/2, d).

This forgoing discussion reveals that covering and packing numbers areessentially equivalent in behavior as ǫ ↓ 0. However, it turns out to beslightly more convenient for our purposes to focus on packing numbers inthis section. Note that T is totally bounded if and only if D(ǫ, d) is finitefor each ǫ > 0. The success of the following maximal inequality depends onhow fast D(ǫ, d) increases as ǫ ↓ 0:

Theorem 8.4 (General maximal inequality) Let ψ satisfy the conditionsof Lemma 8.2, and let X(t), t ∈ T be a separable stochastic process with‖X(s)−X(t)‖ψ ≤ rd(s, t), for all s, t ∈ T , some semimetric d on T , anda constant r < ∞. Then for any η, δ > 0,∥

sups,t∈T :d(s,t)≤δ

|X(s)−X(t)|∥

ψ

≤ K

[∫ η

0

ψ−1(D(ǫ, d))dǫ + δψ−1(D2(η, d))

]

,

for a constant K <∞ which depends only on ψ and r. Moreover,

sups,t∈T

|X(s)−X(t)|∥

ψ

≤ 2K

∫ diam T

0

ψ−1(D(ǫ, d))dǫ,

where diamT ≡ sups,t∈T d(s, t) is the diameter of T .

Before we present the proof of this theorem, recall in the discussion fol-lowing the proof of Lemma 8.2 that ψp-norms, for any p ∈ [1,∞), satisfy theconditions of this lemma. The case p = 2, which applies to sub-Gaussianprocesses, is of most interest to us and is explicitly evaluated in Corol-lary 8.5 below. This corollary plays a key role in the proof of the Donskertheorem for uniform entropy (see Theorem 8.19 in Section 8.4).

Proof of Theorem 8.4. Note that if the first integral were infinite, theinequalities would be trivially true. Hence we can, without loss of generalityassume that the packing numbers and associated integral are bounded.Construct a sequence of finite nested sets T0 ⊂ T1 ⊂ · · · ⊂ T such thatfor each Tj, d(s, t) > η2−j for every distinct s, t ∈ Tj, and that each Tj

is “maximal” in the sense that no additional points can be added to Tj

without violating the inequality. Note that by the definition of packingnumbers, the number of points in Tj is bounded above by D(η2−j , d).

Now we will do the chaining part of the proof. Begin by “linking” everypoint tj+1 ∈ Tj+1 to one and only one tj ∈ Tj such that d(tj , tj+1) ≤η2−j, for all points in Tj+1. Continue this process to link all points in Tj

with points in Tj−1, and so on, to obtain for every tj+1 (∈ Tj+1) a chaintj+1, tj, tj−1, . . . , t0 that connects to a point in T0. For any integer k ≥ 0and arbitrary points sk+1, tk+1 ∈ Tk+1, the difference in increments alongtheir respective chains connecting to s0, t0 can be bounded as follows:

Page 144: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

134 8. Empirical Process Methods

|X(sk+1)−X(tk+1) − X(s0)−X(t0)|

=

k∑

j=0

X(sj+1)−X(sj) −k∑

j=0

X(tj+1)−X(tj)

≤ 2

k∑

j=0

max |X(u)−X(v)|,

where for fixed j the maximum is taken over all links (u, v) from Tj+1 toTj. Hence the jth maximum is taken over at most the cardinality of Tj+1

links, with each link having ‖X(u)−X(v)‖ψ bounded by rd(u, v) ≤ rη2−j .By Lemma 8.2, we have for a constant K0 < ∞ depending only on ψand r,

maxs,t∈Tk+1

|X(s)−X(s0) − X(t)−X(t0)|∥

ψ

(8.5)

≤ K0

k∑

j=0

ψ−1(D(η2−j−1, d))η2−j

= 4K0

k∑

j=0

ψ−1(D(η2−k+j−1, d))η2−k+j−2

≤ 4ηK0

∫ 1

0

ψ−1(D(ηu, d))du

= 4K0

∫ η

0

ψ−1(D(ǫ, d))dǫ.

In this bound, s0 and t0 depend on s and t in that they are the endpointsof the chains starting at s and t, respectively.

The maximum of the increments |X(sk+1)−X(tk+1)|, over all sk+1 andtk+1 in Tk+1 with d(sk+1, tk+1) < δ, is bounded by the left-hand-sideof (8.5) plus the maximum of the discrepancies at the ends of the chains|X(s0) −X(t0)| for those points in Tk+1 which are less than δ apart. Forevery such pair of endpoints s0, t0 of chains starting at two points in Tk+1

within distance δ of each other, choose one and only one pair sk+1, tk+1 inTk+1, with d(sk+1, tk+1) < δ, whose chains end at s0, t0. By definition ofT0, this results in at most D2(η, d) pairs. Now,

|X(s0)−X(t0)| ≤ |X(s0)−X(sk+1) − X(t0)−X(tk+1)|(8.6)

+|X(sk+1)−X(tk+1)|.

Take the maximum of (8.6) over all pairs of endpoints s0, t0. The maximumof the first term of the right-hand-side of (8.6) is bounded by the left-hand-side of (8.5). The maximum of the second term of the right-hand-side of (8.6) is the maximum of D2(η, d) terms with ψ-norm bounded by

Page 145: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

8.1 Maximal Inequalities 135

rδ. By Lemma 8.2, this maximum is bounded by some constant C timesδψ−1(D2(η, d)). Combining this with (8.5), we obtain

maxs,t∈Tk+1:d(s,t)<δ

|X(s)−X(t)|∥

ψ

≤ 8K0

∫ η

0

ψ−1(D(ǫ, d))dǫ + Cδψ−1(D2(η, d)).

By the fact that the right-hand-side does not depend on k, we can replaceTk+1 with T∞ = ∪∞

j=0Tj by the monotone convergence theorem. If wecan verify that taking the supremum over T∞ is equivalent to taking thesupremum over T , then the first conclusion of the theorem follows withK = (8K0) ∨ C.

Since X is separable, there exists a countable subset T∗ ⊂ T such thatsupt∈T infs∈T∗ |X(t) − X(s)| = 0 almost surely. Let Ω∗ denote the subsetof the sample space of X for which this supremum is zero. AccordinglyP(Ω∗) = 1. Now, for any point t and sequence tn in T , it is easy to see thatd(t, tn) → 0 implies |X(t) −X(tn)| → 0 almost surely (see Exercise 8.5.5below). For each t ∈ T∗, let Ωt be the subset of the sample space of Xfor which infs∈T∞ |X(s) − X(t)| = 0. Since T∞ is a dense subset of thesemimetric space (T, d), P(Ωt) = 1. Letting Ω ≡ Ω∗ ∩ (∩t∈T∗Ωt), we nowhave P(Ω) = 1. This, combined with the fact that

supt∈T

infs∈T∞

|X(t)−X(s)| ≤ supt∈T

infs∈T∗

|X(t)−X(s)|

+ supt∈T∗

infs∈T∞

|X(s)−X(t)|,

implies that supt∈T infs∈T∞ |X(t) −X(s)| = 0 almost surely. Thus takingthe supremum over T is equivalent to taking the supremum over T∞.

The second conclusion of the theorem follows from the previous resultby setting δ = η = diamT and noting that, in this case, D(η, d) = 1. Nowwe have

δψ−1(D2(η, d)) = ηψ−1(D(η, d))

=

∫ η

0

ψ−1(D(η, d))dǫ

≤∫ η

0

ψ−1(D(ǫ, d))dǫ,

and the second conclusion follows.As a consequence of Exercise 8.5.5 below, the conclusions of Theorem 8.4

show that X has d-continuous sample paths almost surely whenever the in-tegral

∫ η

0ψ−1(D(ǫ, d))dǫ is bounded for some η > 0. It is also easy to verify

that the maximum of the process of X is bounded, since ‖ supt∈T X(t)‖ψ ≤‖X(t0)‖ψ + ‖ sups,t∈T |X(t) −X(s)| ‖ψ, for any choice of t0 ∈ T . Thus X

Page 146: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

136 8. Empirical Process Methods

is tight and takes its values in UC(T, d) almost surely. These results willprove quite useful in later developments.

An important application of Theorem 8.4 is to sub-Gaussian processes:

Corollary 8.5 Let X(t), t ∈ T be a separable sub-Gaussian processwith respect to d. Then for all δ > 0,

E

(

sups,t∈T :d(s,t)≤δ

|X(s)−X(t)|)

≤ K

∫ δ

0

log D(ǫ, d)dǫ,

where K is a universal constant. Also, for any t0 ∈ T ,

E

(

supt∈T

|X(t)|)

≤ E|X(t0)|+ K

∫ diam T

0

log D(ǫ, d)dǫ.

Proof. Apply Theorem 8.4 with ψ = ψ2 and η = δ. Because ψ−12 (m) =

log(1 + m), ψ−12 (D2(δ, d)) ≤

√2ψ−1

2 (D(δ, d)). Hence the second term ofthe general maximal inequality can be replaced by

√2δψ−1(D(δ, d)) ≤

√2

∫ δ

0

ψ−1(D(ǫ, d))dǫ,

and we obtain∥

supd(s,t)≤δ

|X(s)−X(t)|∥

ψ2

≤ K

∫ δ

0

log(1 + D(ǫ, d))dǫ,

for an enlarged universal constant K. Note that D(ǫ, d) ≥ 2 for all ǫ strictlyless than diamT . Since (1 + m) ≤ m2 for all m ≥ 2, the 1 inside of thelogarithm can be removed at the cost of increasing K again, wheneverδ < diamT . Thus it is also true for all δ ≤ diamT . We are done with thefirst conclusion since d(s, t) ≤ diamT for all s, t ∈ T . Since the secondconclusion is an easy consequence of the first, the proof is complete.

The next corollary shows how to use the previous corollary to establishbounds on the modulus of continuity of certain sub-Gaussian processes.Here the modulus of continuity for a stochastic process X(t) : t ∈ T ,where (T, d) is a semimetric space, is defined as

mX(δ) ≡ sups,t∈T :d(s,t)≤δ

|X(s)−X(t)|.

Corollary 8.6 Assume the conditions of Corollary 8.5. Also assumethere exists a differentiable function δ → h(δ), with derivative h(δ), satisfy-ing h(δ) ≥

log D(δ, d) for all δ > 0 small enough and limδ↓0[δh(δ)/h(δ)] =0. Then

limM→∞

lim supδ↓0

P

(

mX(δ)

δh(δ)> M

)

= 0.

Page 147: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

8.1 Maximal Inequalities 137

Proof. Using L’Hospital’s rule and the assumptions of the theorem, weobtain that

∫ δ

0

log D(ǫ, d)dǫ

δh(δ)≤

∫ δ

0h(ǫ)dǫ

δh(δ)→ 1,

as δ ↓ 0. The result now follows from the first assertion of Corollary 8.5.In the situation where D(ǫ, d) ≤ K(1/ǫ)r, for constants 0 < r, K <

∞ and all ǫ > 0 small enough, the above corollary works for h(δ) =c√

log(1/δ), for some constant 0 < c < ∞. This follows from simple cal-culations. This situation applies, for example, when X is either a standardBrownian motion or a Brownian bridge on T = [0, 1]. Both of these pro-cesses are sub-Gaussian with respect to the metric d(s, t) = |s− t|1/2, andif we let η = δ2, we obtain from the corollary that

limM→∞

lim supη↓0

P

(

mX(η)√

η log(1/η)> M

)

= 0.

The rate in the denominator is quite precise in this instance since the Levymodulus theorem (see Theorem 9.25 of Karatzas and Shreve, 1991) yields

P

(

lim supη↓0

mX(η)√

η log(1/η)=√

2

)

= 1.

The above discussion is also applicable to the modulus of continuity ofcertain empirical processes, and we will examine this briefly in Chapter 11.We note that the condition δh(δ)/h(δ) → 0, as δ ↓ 0, can be thought of asa requirement that D(δ, d) goes to infinity extremely slowly as δ ↓ 0. Whilethis condition is fairly severe, it is reasonably plausible as illustrated in theabove Brownian bridge example.

We now consider an important example of a sub-Gaussian process usefulfor studying empirical processes. This is the Rademacher process

X(a) =

n∑

i=1

ǫiai, a ∈ Rn,

where ǫ1, . . . , ǫn are i.i.d. Rademacher random variables satisfying P (ǫ =−1) = P (ǫ = 1) = 1/2. We will verify shortly that this is indeed a sub-Gaussian process with respect to the Euclidean distance d(a, b) = ‖a− b‖(which obviously makes T = Rn into a metric space). This process willemerge in our development of Donsker results based on uniform entropy.The following lemma, also known as Hoeffding’s inequality, verifies thatRademacher processes are sub-Gaussian:

Lemma 8.7 (Hoeffding’s inequality) Let a = (a1, . . . , an) ∈ Rn andǫ1, . . . , ǫn be independent Rademacher random variables. Then

Page 148: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

138 8. Empirical Process Methods

P

(∣

n∑

i=1

ǫiai

> x

)

≤ 2e−12 x2/‖a‖2

,

for the Euclidean norm ‖ · ‖. Hence ‖∑ ǫa‖ψ2 ≤√

6‖a‖.Proof. For any λ and Rademacher variable ǫ, one has Eeλǫ = (eλ +

e−λ)/2 =∑∞

i=0 λ2i/(2i)! ≤ eλ2/2, where the last inequality follows from therelation (2i)! ≥ 2ii! for all nonnegative integers. Hence Markov’s inequalitygives for any λ > 0

P

(

n∑

i=1

ǫiai > x

)

≤ e−λxEexp

λ

n∑

i=1

ǫiai

≤ exp(λ2/2)‖a‖2 − λx.

Setting λ = x/‖a‖2 yields the desired upper bound. Since multiplyingǫ1, . . . , ǫn by −1 does not change the joint distribution, we obtain

P

(

−n∑

i=1

ǫiai > x

)

= P

(

n∑

i=1

ǫiai > x

)

,

and the desired upper bound for the absolute value of the sum follows. Thebound on the ψ2-norm follows directly from Lemma 8.1.

8.2 The Symmetrization Inequalityand Measurability

We now discuss a powerful technique for empirical processes called sym-metrization. We begin by defining the “symmetrized” empirical processf →P

nf ≡n−1∑n

i=1ǫif(Xi), where ǫ1, . . . , ǫn are independent Rademacherrandom variables which are also independent of X1, . . . , Xn. The basic ideabehind symmetrization is to replace supremums of the form ‖(Pn−P )f‖Fwith supremums of the form ‖P

nf‖F . This replacement is very useful inGlivenko-Cantelli and Donsker theorems based on uniform entropy, and aproof of the validity of this replacement is the primary goal of this sec-tion. Note that the processes (Pn − P )f and P

nf both have mean zero. Adeeper connection between these two processes is that a Donsker theoremor Glivenko-Cantelli theorem holds for one of these processes if and only ifit holds for the other.

One potentially troublesome difficulty is that the supremums involvedmay not be measurable, and we need to be clear about the underly-ing product probability spaces so that the outer expectations are welldefined. In this setting, we will assume that X1, . . . , Xn are the coordi-nate projections of the product space (Xn,An, Pn), where (X ,A, P ) isthe probability space for a single observation and An is shorthand for

Page 149: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

8.2 The Symmetrization Inequality and Measurability 139

the product σ-field generated from sets of the form A1 × · · · × An, whereA1, . . . , An ∈ A. In many of the settings of interest to us, the σ-field An

will be strictly smaller than the Borel σ-field generated from the producttopology, as discussed in Section 6.1, but the results we obtain using An

will be sufficient for our purposes. In some settings, an additional sourceof randomness, independent of X1, . . . , Xn, will be involved which we willdenote Z. If we let the probability space for Z be (Z,D, Q), we will as-sume that the resulting underlying joint probability space has the form(Xn,An, Pn) × (Z,D, Q) = (Xn × Z,An × D, Pn × Q), where we definethe product σ-field An×D in the same manner as before. Now X1, . . . , Xn

are equal to the coordinate projections on the first n coordinates, while Zis equal to the coordinate projection on the (n + 1)st coordinate.

We now present the symmetrization theorem. After its proof, we willdiscuss a few additional important measurability issues.

Theorem 8.8 (Symmetrization) For every nondecreasing, convex φ :R → R and class of measurable functions F ,

E∗φ

(

1

2‖Pn − P‖F

)

≤ E∗φ (‖Pn‖F) ≤ E∗φ (2‖Pn − P‖F + |Rn| · ‖P‖F) ,

where Rn ≡ Pn1 = n−1

∑ni=1 ǫi and the outer expectations are computed

based on the product σ-field described in the previous paragraph.

Before giving the proof of this theorem, we make a few observations.Firstly, the constants 1/2, 1 and 2 appearing in front of the three respectivesupremum norms in the chain of inequalities can all be replaced by c/2,c and 2c, respectively, for any positive constant c. This follows triviallysince, for any positive c, x → φ(cx) is nondecreasing and convex wheneverx → φ(x) is nondecreasing and convex. Secondly, we note that most ofour applications of this theorem will be for the setting φ(x) = x. Thirdly,we note that the first inequality in the chain of inequalities will be ofgreatest use to us. However, the second inequality in the chain can be usedto establish the following Glivenko-Cantelli result, the complete proof ofwhich will be given later on, at the tail end of Section 8.3:

Proposition 8.9 For any class of measurable functions F , the followingare equivalent:

(i) F is P -Glivenko-Cantelli and ‖P‖F < ∞.

(ii) ‖Pn‖F

as∗→ 0.

As mentioned previously, there is also a similar equivalence involvingDonsker results, but we will postpone further discussion of this until weencounter multiplier central limit theorems in Chapter 10.

Proof of Theorem 8.8. Let Y1, . . . , Yn be independent copies of X1, . . . ,Xn. Formally, Y1, . . . , Yn are the coordinate projections on the last n co-ordinates in the product space (Xn,An, Pn) × (Z,D, Q) × (Xn,An, Pn).

Page 150: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

140 8. Empirical Process Methods

Here, (Z,D, Q) is the probability space for the n-vector of independentRademacher random variables ǫ1, . . . , ǫn used in P

n. Since, by Lemma 6.13,coordinate projections are perfect maps, the outer expectations in the the-orem are unaffected by the enlarged product probability space. For fixedX1, . . . , Xn, ‖Pn − P‖F =

supf∈F

1

n

n∑

i=1

[f(Xi)− Ef(Yi)]

≤ E∗Y sup

f∈F

1

n

n∑

i=1

[f(Xi)− f(Yi)]

,

where E∗Y is the outer expectation with respect to Y1, . . . , Yn computed

by treating the X1, . . . , Xn as constants and using the probability space(Xn,An, Pn). Applying Jensen’s inequality, we obtain

φ (‖Pn − P‖F) ≤ EY φ

1

n

n∑

i=1

[f(Xi)− f(Yi)]

∗Y

F

⎠ ,

where ∗Y denotes the minimal measurable majorant of the supremum withrespect to Y1, . . . , Yn and holding X1, . . . , Xn fixed. Because φ is nonde-creasing and continuous, the ∗Y inside of the φ in the forgoing expressioncan be removed after replacing EY with E∗

Y , as a consequence of Lemma 6.8.Now take the expectation of both sides with respect to X1, . . . , Xn to obtain

E∗φ (‖Pn − P‖F) ≤ E∗XE∗

Y φ

(

1

n

n∑

i=1

[f(Xi)− f(Yi)]

F

)

.

The repeated outer expectation can now be bounded above by the jointouter expectation E∗ by Lemma 6.14 (Fubini’s theorem for outer expecta-tions).

By the product space structure of the underlying probability space, theouter expectation of any function g(X1, . . . , Xn, Y1, . . . , Yn) remains un-changed under permutations of its 2n arguments. Since −[f(Xi)−f(Yi)] =[f(Yi) − f(Xi)], we have for any n-vector (e1, . . . , en) ∈ −1, 1n, that∥

∥n−1∑n

i=1 ei[f(Xi)− f(Yi)]∥

F is just a permutation of

h(X1, . . . , Xn, Y1, . . . , Yn) ≡∥

n−1n∑

i=1

[f(Xi)− f(Yi)]

F

.

Hence

E∗φ (‖Pn − P‖F) ≤ EǫE∗X,Y φ

1

n

n∑

i=1

ei[f(Xi)− f(Yi)]

F

.

Now the triangle inequality combined with the convexity of φ yields

E∗φ (‖Pn − P‖F) ≤ EǫE∗X,Y φ (2‖P

n‖F) .

Page 151: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

8.2 The Symmetrization Inequality and Measurability 141

By the perfectness of coordinate projections, E∗X,Y can be replaced by

E∗XE∗

Y . Now EǫE∗XE∗

Y is bounded above by the joint expectation E∗ byreapplication of Lemma 6.14. This proves the first inequality.

For the second inequality, let Y1, . . . , Yn be independent copies of X1, . . . ,Xn as before. Holding X1, . . . , Xn and ǫ1, . . . , ǫn fixed, we have ‖P

nf‖F =‖P

n(f − Pf) + PnPf‖F =

‖Pn(f−Ef(Y ))+RnPf‖F ≤ E∗

Y

1

n

n∑

i=1

ǫi[f(Xi)− f(Yi)]

F

+ |Rn| ·‖P‖F .

Applying Jensen’s inequality, we now have

φ (‖Pn‖F) ≤ E∗

Y φ

(∥

1

n

n∑

i=1

ǫi[f(Xi)− f(Yi)]

F

+ |Rn| · ‖P‖F)

.

Using the permutation argument we used for the first part of the proof, wecan replace the ǫ1, . . . , ǫn in the summation with all 1’s, and take expecta-tions with respect to X1, . . . , Xn and ǫ1, . . . , ǫn (which are still present inRn). This gives us

E∗φ (‖Pn‖F) ≤ EǫE

∗XE∗

Y φ

(∥

1

n

n∑

i=1

[f(Xi)− f(Yi)]

F

+ |Rn| · ‖P‖F)

.

After adding and subtracting Pf in the summation and applying the con-vexity of φ, we can bound the right-hand-side by

1

2EǫE

∗XE∗

Y φ

(

2

1

n

n∑

i=1

[f(Xi)− Pf ]

F

+ |Rn| · ‖P‖F)

+1

2EǫE

∗XE∗

Y φ

(

2

1

n

n∑

i=1

[f(Yi)− Pf ]

F

+ |Rn| · ‖P‖F)

.

By reapplication of the permutation argument and Lemma 6.14, we obtainthe desired upper bound.

The above symmetrization results will be most useful when the supre-mum ‖P

n‖F is measurable and Fubini’s theorem permits taking the expec-tation first with respect to ǫ1, . . . , ǫn given X1, . . . , Xn and secondly withrespect to X1, . . . , Xn. Without this measurability, only the weaker versionof Fubini’s theorem for outer expectations applies (Lemma 6.14), and thusthe desired reordering of expectations may not be valid. To overcome thisdifficulty, we will assume that the class F is a P -measurable class. A classF of measurable functions f : X → R, on the probability space (X ,A, P ),is P -measurable if (X1, . . . , Xn) → ‖∑n

i=1 eif(Xi)‖F is measurable on thecompletion of (Xn,An, Pn) for every constant vector (e1, . . . , en) ∈ Rn.It is possible to weaken this condition, but at least some measurability

Page 152: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

142 8. Empirical Process Methods

assumptions will usually be needed. In the Donsker theorem for uniformentropy, it will be necessary to assume that several related classes of Fare also P -measurable. These additional classes are Fδ ≡ f − g : f, g ∈F , ‖f − g‖P,2 < δ, for all δ > 0, and F2

∞ ≡ (f − g)2 : f, g ∈ F (recallthat ‖f‖P,2 ≡ (Pf2)1/2).

Another assumption on F that is stronger than P -measurability andoften easier to verify in statistical applications is pointwise measurability.A class F of measurable functions is pointwise measurable if there exists acountable subset G ⊂ F such that for every f ∈ F , there exists a sequencegm ∈ G with gm(x) → f(x) for every x ∈ X . Since, by Exercise 8.5.6below, ‖∑ eif(Xi)‖F = ‖∑ eif(Xi)‖G for all (e1, . . . , en) ∈ Rn, pointwisemeasurable classes are P -measurable for all P . Consider, for example, theclass F = 1x ≤ t : t ∈ R where the sample space X = R. Let G =1x ≤ t : t ∈ Q, and fix the function x → f(x) = 1x ≤ t0 for somet0 ∈ R. Note that G is countable. Let tm be a sequence of rationals withtm ≥ t0, for all m ≥ 1, and with tm ↓ t0. Then x → gm(x) = 1x ≤ tmsatisfies gm ∈ G, for all m ≥ 1, and gm(x) → f(x) for all x ∈ R. Sincet0 was arbitrary, we have just proven that F is pointwise measurable (andhence also P -measurable for all P ). Hereafter, we will use the abbreviationPM as a shorthand for denoting pointwise measurable classes.

Another nice feature of PM classes is that they have a number of usefulpreservation features. An obvious example is that when F1 and F2 are PMclasses, then so is F1 ∪ F2. The following lemma provides a number ofadditional preservation results:

Lemma 8.10 Let F1, . . . ,Fk be PM classes of real functions on X , and letφ : Rk →R be continuous. Then the class φ (F1, . . . ,Fk) is PM, where φ(F1, . . . ,Fk) denotes the class φ(f1, . . . , fk) : (f1, . . . , fk)∈F1× · · · ×Fk.

Proof. DenoteH ≡ φ(F1, . . . ,Fk). Fix an arbitrary h = φ(f1, . . . , fk) ∈H. By assumption, each Fj has a countable subset Gj ⊂ Fj such thatthere exists a subsequence gj

m ∈ Gj with gjm(x) → fj(x), as m → ∞,

for all x ∈ X and j = 1, . . . , k. By continuity of φ, we thus have thatφ(g1

m(x), . . . , gkm(x)) → φ(f1(x), . . . , fk(x)) = h(x), as m → ∞, for all

x ∈ X . Since the choice of h was arbitrary, we therefore have that the setφ(G1, . . . ,Gk) is a countable subset of H making H pointwise measurable.

Lemma 8.10 automatically yields many other useful PM preservationresults, including the following for PM classes F1 and F2:

• F1 ∧ F2 (all possible pairwise minimums) is PM.

• F1 ∨ F2 (all possible pairwise maximums) is PM.

• F1 + F2 is PM.

• F1 · F2 ≡ f1f2 : f1 ∈ F1, f2 ∈ F2 is PM.

Page 153: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

8.2 The Symmetrization Inequality and Measurability 143

We will use these properties of PM classes to establish Donsker propertiesfor some specific statistical examples later on in the case studies presentedin Chapter 15. The following proposition shows an additional property ofPM classes that potentially simplifies the measurability requirements of theDonsker theorem for uniform entropy, Theorem 8.19, given in Section 8.4below:

Proposition 8.11 Let F be a class of measurable functions f : X → Ron the probability space (X ,A, P ). Provided F is PM with envelope F suchthat P ∗F 2 < ∞, then Fδ and F2

∞ are PM for all 0 < δ ≤ ∞.

Proof. The fact that both F∞ and F2∞ are PM follows easily from

Lemma 8.10. Assume, without loss of generality, that the envelope F ismeasurable (if not, simply replace F with F ∗). Next, let H ⊂ F∞ bea countable subset for which there exists for each g ∈ F∞ a sequencehm ∈ H such that hm(x) → g(x) for all x ∈ X . Fix δ > 0 and h ∈ Fδ.Then there exists an ǫ > 0 such that Ph2 = δ2 − ǫ. Let gm ∈ H be asequence for which gm(x) → h(x) for all x ∈ X , and assume that Pg2

m ≥ δ2

infinitely often. Then there exists another sequence gm ∈ H such thatP g2

m ≥ δ2 for all m ≥ 1 and also gm(x) → h(x) for all x ∈ X . Since|gm| ≤ F , for all m ≥ 1, we have by the dominated convergence theoremthat δ2 ≤ lim infm→∞ P g2

m = Ph2 = δ2 − ǫ, which is impossible. Hence,returning to the original sequence gm, ‖gm‖P,2 cannot be ≥ δ infinitelyoften. Thus there exists a sequence gm ∈ Hδ ≡ g ∈ H : ‖g‖P,2 < δ suchthat gm(x) → h(x) for all x ∈ X . Thus Fδ is PM since h was arbitraryand Hδ does not depend on h. Since δ was also arbitrary, the proof iscomplete.

We next consider establishing P -measurability for the class

1Y − βT Z ≤ t : β ∈ Rk, t ∈ R

,

where X ≡ (Y, Z) ∈ X ≡ R× Rk has distribution P , for arbitrary P . Thisclass was considered in the linear regression example of Section 4.1. Thedesired measurability result is stated in the following lemma:

Lemma 8.12 Let F ≡

1Y − βT Z ≤ t : β ∈ Rk, t ∈ R

. Then the

classes F , Fδ ≡f − g : f, g ∈ F , ‖f − g‖P,2 < δ, and F2∞≡

(f − g)2 : f,

g ∈ F

are all P -measurable for any probability measure on X .

Proof. We first assume that ‖Z‖ ≤ M for some fixed M < ∞. Hencethe sample space is XM ≡ (y, z) : (y, z) ∈ X , ‖z‖ ≤ M. Consider thecountable set G =

1Y − βT Z ≤ t : β ∈ Qk, t ∈ Q

, where Q are therationals. Fix β ∈ Rk and t ∈ R, and construct a sequence (βm, tm) asfollows: for each m ≥ 1, pick βm ∈ Qk so that ‖βm − β‖ < 1/(2mM)and pick tm ∈ Q so that tm ∈ (t + 1/(2m), t + 1/m]. Now, for any (y, z)with y ∈ R and z ∈ Rk with ‖z‖ ≤ M , we have that 1y − βT

mz ≤ tm =1y − βT z ≤ tm + (βm − β)T z. Since |(βm − β)T z| < 1/(2m) by design,

Page 154: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

144 8. Empirical Process Methods

we have that rm ≡ tm + (βm − β)T z − t > 0 for all m and that rm → 0 asm→∞. Since the function t → 1u ≤ t is right-continuous and since (y, z)was arbitrary, we have just proven that 1y− βT

mz ≤ tm → y− βT z ≤ tfor all (y, z) ∈ XM . Thus F is pointwise measurable with respect to thecountable subset G.

We can also verify that Fδ and F2∞ are likewise PM classes, under the

constraint that the random variable Z satisfies ‖Z‖ ≤ M . To see this forFδ, let f1, f2 ∈ F satisfy ‖f1 − f2‖P,2 < δ and let gm,1, gm,2 ∈ G besuch that gm,1 → f1 and gm,2 → f2 pointwise in XM . Then, by dominatedconvergence, ‖gm,1−gm,2‖P,2 → ‖f1−f2‖P,2, and thus ‖gm,1−gm,2‖P,2 < δfor all m large enough. Hence

gm,1 − gm,2 ∈ Gδ ≡ f − g : f, g ∈ G, ‖f − g‖P,2 < δ

for all m large enough, and thus Gδ is a separable subset of Fδ makingFδ into a PM class. The proof that F2

∞ is also PM follows directly fromLemma 8.10.

Now let JM (x1, . . . , xn) = 1max1≤i≤n ‖zi‖ ≤ M, where xi = (yi, zi),1 ≤ i ≤ n. Since M was arbitrary, the previous two paragraphs haveestablished that

(x1, . . . , xn) →∥

n∑

i=1

eif(xi)

H

JM (x1, . . . , xn)(8.7)

is measurable for every n-tuple (e1, . . . , en) ∈ Rn, every M < ∞, andwith H being replaced by F , Fδ or F2

∞. Now, for any (x1, . . . , xn) ∈ Xn,JM (x1, . . . , xn) = 1 for all M large enough. Thus the map (8.7) is alsomeasurable after replacing JM with its pointwise limit 1 = limM→∞ JM .Hence F , Fδ and F2

∞ are all P -measurable classes for any measure P onX .

Another example of a P -measurable class occurs when F is a Suslintopological space (for an arbitrary topology O), and the map (x, f) →f(x) is jointly measurable on X × F for the product σ-field of A andthe Borel σ-field generated from O. Further insights and results on thismeasurable Suslin condition can be found in Example 2.3.5 and Chapter1.7 of VW. While this approach to establishing measurability can be usefulin some settings, a genuine need for it does not often occur in statisticalapplications, and we will not pursue it further here.

8.3 Glivenko-Cantelli Results

We now present several Glivenko-Cantelli (G-C) results. First, we discussan interesting necessary condition for a class F to be P -G-C. Next, wepresent the proofs of G-C theorems for bracketing (Theorem 2.2 on Page 16)

Page 155: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

8.3 Glivenko-Cantelli Results 145

and uniform (Theorem 2.4 on Page 18) entropy. Part of the proof in theuniform entropy case will include the presentation of a new G-C theorem,Theorem 8.15 below. Finally, we give the proof of Proposition 8.9 whichwas promised in the previous section.

The following lemma shows that the existence of an integrable envelopeof the centered functions of a class F is a necessary condition for F to beP -G-C:

Lemma 8.13 If the class of functions F is strong P -G-C, then P‖f −Pf‖∗F < ∞. If in addition ‖P‖F < ∞, then also P‖f‖∗F < ∞.

Proof. Since f(Xn)− Pf = n(Pn− P )f − (n− 1)(Pn−1 −P )f , we haven−1‖f(Xn)−Pf‖F ≤ ‖Pn−P‖F +(1−n−1)‖Pn−1−P‖F . Since F is strongP -G-C, we now have that P (‖f(Xn)−Pf‖∗F ≥ n infinitely often) = 0. TheBorel-Cantelli lemma now yields that

∑∞n=1 P (‖f(Xn)−Pf‖∗F ≥ n) <∞.

Since the Xn are i.i.d., the f(Xn) in the summands can be replaced withf(X1) for all n ≥ 1. Now we have

P ∗‖f − Pf‖F ≤∫ ∞

0

P (‖f(X1)− Pf‖∗F > x) dx

≤ 1 +

∞∑

n=1

P (‖f(X1)− Pf‖∗F ≥ n)

< ∞.

Proof of Theorem 2.2 (see Page 16). Fix ǫ > 0. Since the L1-bracketing entropy is bounded, it is possible to choose finitely many ǫ-brackets [li, ui] so that their union contains F and P (ui − li) < ǫ forevery i. Now, for every f ∈ F , there is a bracket [li, ui] containing f with(Pn − P )f ≤ (Pn − P )ui + P (ui − f) ≤ (Pn − P )ui + ǫ. Hence

supf∈F

(Pn − P )f ≤ maxi

(Pn − P )ui + ǫ

as∗→ ǫ.

Similar arguments can be used to verify that

inff∈F

(Pn − P )f ≥ mini

(Pn − P )li − ǫ

as∗→ −ǫ.

The desired result now follows since ǫ was arbitrary.To prove Theorem 2.4 on Page 18, we first restate the theorem to clarify

the meaning of “appropriately measurable” in the original statement of thetheorem, and then prove a more general version (Theorem 8.15 below):

Theorem 8.14 (Restated Theorem 2.4 from Page 18) Let F be a P -measurable class of measurable functions with envelope F and

Page 156: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

146 8. Empirical Process Methods

supQ

N(ǫ‖F‖Q,1,F , L1(Q)) < ∞,

for every ǫ > 0, where the supremum is taken over all finite probabilitymeasures Q with ‖F‖Q,1 > 0. If P ∗F < ∞, then F is P -G-C.

Proof. The result is trivial if P ∗F = 0. Hence we will assume withoutloss of generality that P ∗F > 0. Thus there exists an η > 0 such that,with probability 1, PnF > η for all n large enough. Fix ǫ > 0. By assump-tion, there is a K < ∞ such that 1PnF > 0 logN(ǫPnF,F , L1(Pn)) ≤ Kalmost surely, since Pn is a finite probability measure. Hence, with proba-bility 1, log N(ǫη,F , L1(Pn)) ≤ K for all n large enough. Since ǫ was arbi-trary, we now have that log N(ǫ,F , L1(Pn)) = O∗

P (1) for all ǫ > 0. Now fixǫ > 0 (again) and M < ∞, and define FM ≡ f1F ≤M : f ∈ F. Since,‖(f−g)1F ≤M‖1,Pn ≤ ‖f−g‖1,Pn for any f, g ∈ F , N(ǫ,FM , L1(Pn)) ≤N(ǫ,F , L1(Pn)). Hence log N(ǫ,FM , L1(Pn)) = O∗

P (1). Finally, since ǫ andM are both arbitrary, the desired result follows from Theorem 8.15 below.

Theorem 8.15 Let F be a P -measurable class of measurable functionswith envelope F such that P ∗F < ∞. Let FM be as defined in the aboveproof. If log N(ǫ,FM , L1(Pn)) = o∗P (n) for every ǫ > 0 and M < ∞, thenP‖Pn − P‖∗F → 0 and F is strong P -G-C.

Before giving the proof of Theorem 8.15, we give the following lemmawhich will be needed. This is Lemma 2.4.5 of VW, and we omit the proof:

Lemma 8.16 Let F be a class of measurable functions f : X → R withintegrable envelope. Define a filtration by letting Σn be the σ-field generatedby all measurable functions h : X∞ → R that are permutation-symmetricin their first n arguments. Then E (‖Pn − P‖∗F |Σn+1) ≥ ‖Pn+1 − P‖∗F ,almost surely. Furthermore, there exist versions of the measurable coverfunctions ‖Pn − P‖∗F that are adapted to the filtration. Any such versionsform a reverse submartingale and converge almost surely to a constant.

Proof of Theorem 8.15. By the symmetrization Theorem 8.8, P -measura- bility of F , and by Fubini’s theorem, we have for all M > 0that

E∗‖Pn − P‖F ≤ 2EXEǫ

1

n

n∑

i=1

ǫif(Xi)

F

≤ 2EXEǫ

1

n

n∑

i=1

ǫif(Xi)1F ≤ M∥

F

+2EXEǫ

1

n

n∑

i=1

ǫif(Xi)1F > M∥

F

≤ 2EXEǫ

1

n

n∑

i=1

ǫif(Xi)

FM

+ 2P ∗ [F1F > M] .

Page 157: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

8.3 Glivenko-Cantelli Results 147

The last term can be made arbitrarily small by making M large enough.Thus, for convergence in mean, it is enough to show that the first term goesto zero for each M . Accordingly, fix M < ∞. Fix also X1, . . . , Xn, and letG be a finite δ-mesh in L1(Pn) over FM . Thus

1

n

n∑

i=1

ǫif(Xi)

FM

≤∥

1

n

n∑

i=1

ǫif(Xi)

G

+ δ.(8.8)

By definition of the entropy number, the cardinality of G can be chosenequal to N(δ,FM , L1(Pn)). Now, we can bound the L1-norm on the right-hand-side of (8.8) by the Orlicz-norm for ψ2(x) = exp(x2) − 1, and applythe maximal inequality Lemma 8.2 to find that the left-hand-side of (8.8)is bounded by a universal constant times

1 + log N(δ,FM , L1(Pn)) supf∈F

1

n

n∑

i=1

ǫif(Xi)

ψ2|X

+ δ,

where the Orlicz norms ‖ · ‖ψ2|X are taken over ǫ1, . . . , ǫn with X1, . . . , Xn

still fixed. From Exercise 8.5.7 below, we have—by Hoeffding’s inequal-ity (Lemma 8.7) combined with Lemma 8.1—that the Orlicz norms areall bounded by

6/n (Pnf2)1/2, which is bounded by√

6/nM . The lastdisplayed expression is thus bounded by

61 + log N(δ,FM , L1(Pn))n

M + δP→ δ.

Thus the left-hand-side of (8.8) goes to zero in probability. Since it isalso bounded by M , the bounded convergence theorem implies that itsexpectation also goes to zero. Since M was arbitrary, we now have thatE∗‖Pn − P‖F → 0. This now implies that F is weak P -G-C.

From Lemma 8.16, we know that there exists a version of ‖Pn−P‖∗F thatconverges almost surely to a constant. Since we already know that ‖Pn −P‖∗F

P→ 0, this constant must be zero. The desired result now follows.Proof of Proposition 8.9. Assume (i). By the second inequality of

the symmetrization theorem (Theorem 8.8), ‖Pn‖F

P→ 0. This convergencecan be strengthened to outer almost surely, since ‖P

n‖F forms a reversesubmartingale as in the previous proof. Now assume (ii). By Lemma 8.13and the fact that P [ǫf(X)] = 0 for a Rademacher ǫ independent of X ,we obtain that P‖f‖∗F = P‖ǫf(X)‖∗F < ∞. Now, the fact that F is weakP -G-C follows from the first inequality in the symmetrization theorem.The convergence can be strengthened to outer almost sure by the reversemartingale argument used previously. Thus (ii) follows.

Page 158: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

148 8. Empirical Process Methods

8.4 Donsker Results

We now present several Donsker results. We begin with several interestingnecessary and sufficient conditions for a class to be P -Donsker. We nextpresent the proofs of Donsker theorems for bracketing (Theorem 2.3 onPage 17) and uniform (Theorem 2.5 on Page 18) entropy. Before proceeding,let Fδ ≡ f − g : f, g ∈ F , ‖f − g‖P,2 < δ, for any 0 < δ ≤ ∞.

The following lemma outlines several properties of Donsker classes andshows that Donsker classes are automatically strong Glivenko-Cantelliclasses:

Lemma 8.17 Let F be a class of measurable functions, with envelope

F ≡ ‖f‖F . For any f, g ∈ F , define ρ(f, g) ≡

P (f − Pf − g + Pg)21/2

;and, for any δ > 0, let Fδ ≡ f − g : ρ(f, g) < δ. Then the following areequivalent:

(i) F is P -Donsker;

(ii) (F , ρ) is totally bounded and ‖Gn‖Fδn

P→ 0 for every δn ↓ 0;

(iii) (F , ρ) is totally bounded and E∗‖Gn‖Fδn→ 0 for every δn ↓ 0.

These conditions imply that E∗‖Gn‖rF → E‖G‖r

F < ∞, for every 0 < r < 2;that P (‖f − Pf‖∗F > x) = o(x−2) as x →∞; and that F is strong P -G-C.If in addition ‖P‖F <∞, then also P (F ∗ > x) = o(x−2) as x →∞.

Proof. The equivalence of (i)–(iii) and the first assertion is Lemma 2.3.11of VW, and we omit the equivalence part of the proof. Now assume Con-ditions (i)–(iii) hold. Lemma 2.3.9 of VW states that if F is Donsker, then

limx→∞

x2 supn≥1

P∗ (‖Gn‖F > x) = 0.(8.9)

This immediately implies that the rth moment of ‖Gn‖F is uniformlybounded in n and that E‖G‖r

F < ∞ for all 0 < r < 2. Thus the first asser-tion follows and, therefore, F is weak P -G-C. Lemma 8.16 now implies F isstrong G-C. Letting n = 1 in (8.9) yields that P (‖f − Pf‖∗F > x) = o(x−2)as x →∞, and the remaining assertions follow.

Proof of Theorem 2.3 (Donsker with bracketing entropy, onPage 17). With a given set of ǫ/2-brackets [li, ui] covering F , we canconstruct a set of ǫ-brackets covering F∞ by taking differences [li−uj, ui−lj] of upper and lower bounds, i.e., if f ∈ [li, ui] and g ∈ [lj , uj], thenf−g ∈ [li−uj, ui− lj]. Thus N[](ǫ,F∞, L2(P )) ≤ N2

[](ǫ/2,F , L2(P )). From

Exercise 8.5.8 below, we now have that J[](δ,F∞, L2(P )) ≤√

8J[](δ,F ,L2(P )).

Now, for a given, small δ > 0, select a minimal number of δ-bracketsthat cover F , and use them to construct a finite partition F = ∪m

i=1Fi

consisting of sets of L2(P )-diameter δ. For any i ∈ 1, . . . , m, the subset

Page 159: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

8.4 Donsker Results 149

of F∞ consisting of all f−g with f, g ∈ Fi consists of functions with L2(P )norm strictly smaller than δ. Hence by Lemma 8.18 below, there exists anumber a(δ) > 0 satisfying

E∗[

sup1≤i≤m

supf,g∈Fi

|Gn(f − g)|]

J[](δ,F , L2(P ))(8.10)

+√

nP ∗ [G1G > a(δ)√

n]

,

where G is an envelope for F∞ and the relation means that the left-hand-side is bounded above by a universal positive constant times the right-hand-side. In this setting, “universal” means that the constant does not dependon n or δ. If [l, u] is a minimal bracket for covering all of F , then G can betaken to be u − l. Boundedness of the entropy integral implies that thereexists some k < ∞ so that only one L2(P ) bracket of size k is needed tocover F . This implies PG2 <∞.

By Exercise 8.5.9 below, the second term on the right-hand-side of (8.10)is bounded above by [a(δ)]−1P ∗ [G21 G > a(δ)

√n]

and thus goes to zeroas n → ∞. Since J[](δ,F , L2(P )) = o(δ), as δ ↓ 0, we now have thatlimδ↓0 lim supn→∞ of the left-hand-side of (8.10) goes to zero. In view ofMarkov’s inequality for outer probability (which follows from Chebyshev’sinequality for outer probability as given in Lemma 6.10), Condition (ii)in Lemma 7.20 is satisfied for the stochastic process Xn(f) = Gn(f) withindex set T = F . Now, Theorem 2.1 yields the desired result.

Lemma 8.18 For any class F of measurable functions f : X → R withPf2 < δ2 for every f , we have, with

a(δ) ≡ δ√

1 ∨ log N[](δ,F , L2(P )),

and F an envelope function for F , that

E∗‖Gn‖F J[](δ,F , L2(P )) +√

nP ∗ [F1F >√

na(δ)]

.

This is Lemma 19.34 of van der Vaart (1998), who gives a nice proof whichutilizes the maximal inequality result given in Lemma 8.3 above. The ar-guments are lengthy, and we omit the proof.

To prove Theorem 2.5 (Donsker with uniform entropy, from Page 18), wefirst restate the theorem with a clarification of the measurability assump-tion, as done in the previous section for Theorem 2.4:

Theorem 8.19 (Restated Theorem 2.5 from Page 18) Let F be a classof measurable functions with envelope F and J(1,F , L2) < ∞. Let theclasses Fδ and F2

∞ ≡ h2 : h ∈ F∞ be P -measurable for every δ > 0. IfP ∗F 2 <∞, then F is P -Donsker.

Page 160: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

150 8. Empirical Process Methods

We note here that by Proposition 8.11, if F is PM, then so are Fδ and F2∞,

for all δ > 0, provided F has envelope F such that P ∗F 2 < ∞. Since PMimplies P -measurability, all measurability requirements for Theorem 8.19are thus satisfied whenever F is PM.

Proof of Theorem 8.19. Let the positive, decreasing sequence δn ↓ 0be arbitrary. By Markov’s inequality for outer probability (see Lemma 6.10)and the symmetrization Theorem 8.8,

P∗ (‖Gn‖Fδn> x

)

≤ 2

xE∗

1√n

n∑

i=1

ǫif(Xi)

Fδn

,

for i.i.d. Rademachers ǫ1, . . . , ǫn independent of X1, . . . , Xn. By the P -measurability assumption for Fδ, for all δ > 0, the standard version ofFubini’s theorem applies, and the outer expectation is just a standardexpectation and can be calculated in the order EXEǫ. Accordingly, fixX1, . . . , Xn. By Hoeffding’s inequality (Lemma 8.7), the stochastic processf → n−1/2 ×∑n

i=1 ǫif(Xi) is sub-Gaussian for the L2(Pn)-seminorm

‖f‖n ≡

1

n

n∑

i=1

f2(Xi).

This stochastic process is also separable since, for any measure Q and ǫ > 0,N(ǫ,Fδn , L2(Q)) ≤ N(ǫ,F∞, L2(Q)) ≤ N2(ǫ/2,F , L2(Q)), and the latteris finite for any finite dimensional probability measure Q and any ǫ > 0.Thus the second conclusion of Corollary 8.5 holds with

1√n

n∑

i=1

ǫif(Xi)

Fδn

∫ ∞

0

log N(ǫ,Fδn , L2(Pn))dǫ.(8.11)

Note that we can omit the term E∣

∣n−1/2∑n

i=1 ǫif0(Xi)∣

∣ from the con-clusion of the corollary because it is also bounded by the right-hand-sideof (8.11).

For sufficiently large ǫ, the set Fδn fits in a single ball of L2(Pn)-radiusǫ around the origin, in which case the integrand on the right-hand-sideof (8.11) is zero. This will definitely happen when ǫ is larger than θn,where

θ2n ≡ sup

f∈Fδn

‖f‖2n =

1

n

n∑

i=1

f2(Xi)

Fδn

.

Thus the right-hand-side of (8.11) is bounded by

Page 161: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

8.5 Exercises 151

∫ θn

0

log N(ǫ,Fδn , L2(Pn))dǫ

∫ θn

0

log N2(ǫ/2,F , L2(Pn))dǫ

∫ θn/(2‖F‖n)

0

log N(ǫ‖F‖n,F , L2(Pn))dǫ‖F‖n

‖F‖nJ(θn,F , L2).

The second inequality follows from the change of variables u = ǫ/(2‖F‖n)(and then renaming u to ǫ). For the third inequality, note that we canadd 1/2 to the envelope function F without changing the existence of itssecond moment. Hence ‖F‖n ≥ 1/2 without loss of generality, and thusθn/(2‖F‖n) ≤ θn. Because ‖F‖n = Op(1), we can now conclude that theleft-hand-side of (8.11) goes to zero in probability, provided we can verify

that θnP→ 0. This would then imply asymptotic L2(P )-equicontinuity in

probability.

Since ‖Pf2‖Fδn→ 0 and Fδn ⊂ F∞, establishing that ‖Pn−P‖F2

P→ 0

would prove that θnP→ 0. The class F2

∞ has integrable envelope (2F )2 andis P -measurable by assumption. Since also, for any f, g ∈ F∞, Pn|f2 −g2| ≤ Pn(|f − g|4F ) ≤ ‖f − g‖n‖4F‖n, we have that the covering numberN(ǫ‖2F‖2n,F2

∞, L1(Pn)) is bounded by N(ǫ‖F‖n,F∞, L2(Pn)). Since thislast covering number is bounded by supQ N2(ǫ‖F‖Q,2/2,F , L2(Q)) < ∞,where the supremum is taken over all finitely discrete probability measureswith ‖F‖Q,2 > 0, we have by Theorem 8.14 that F2

∞ is P -Glivenko-Cantelli.

Thus θnP→ 0. This completes the proof of asymptotic equicontinuity.

The last thing we need to prove is that F is totally bounded in L2(P ).By the result of the last paragraph, there exists a sequence of discreteprobability measures Pn with ‖(Pn − P )f2‖F∞ → 0. Fix ǫ > 0 and take nlarge enough so that ‖(Pn − P )f2‖F∞ < ǫ2. Note that N(ǫ,F , L2(Pn)) isfinite by assumption, and, for any f, g ∈ F with ‖f−g‖Pn,2 < ǫ, P (f−g)2 ≤Pn(f − g)2 + |(Pn − P )(f − g)2| ≤ 2ǫ2. Thus any ǫ-net in L2(Pn) is alsoa√

2ǫ-net in L2(P ). Hence F is totally bounded in L2(P ) since ǫ wasarbitrary.

8.5 Exercises

8.5.1. For any ψ valid for defining an Orlicz norm ‖ · ‖ψ, show that thespace Hψ of real random variables X with ‖X‖ψ < ∞ defines a Banachspace, where we equate a random variable X with zero if X = 0 almostsurely:

Page 162: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

152 8. Empirical Process Methods

(a) Show first that ‖ · ‖ψ defines a norm on Hψ. Hint: Use the convexityof ψ to establish that for any X, Y ∈ Hψ and any c1, c2 > 0,

( |X + Y |c1 + c2

)

≤ c1

c1 + c2Eψ

( |X |c1

)

+c2

c1 + c2Eψ

( |Y |c2

)

.

(b) Now show that Hψ is complete. Hint: Show that for any Cauchysequence of random variables Xn ∈ Hψ, lim supn→∞ ‖Xn‖ψ < ∞.Use Prohorov’s theorem to show that every such Cauchy sequenceconverges to an almost surely unique element of Hψ.

8.5.2. Show that 1 ∧ (eu − 1)−1 ≤ 2e−u, for any u > 0.

8.5.3. For a function ψ meeting the conditions of Lemma 8.2, show thatthere exists constants 0 < σ ≤ 1 and τ > 0 such that φ(x) ≡ σψ(τx)satisfies φ(x)φ(y) ≤ φ(cxy) for all x, y ≥ 1 and φ(1) ≤ 1/2. Show that thisφ also satisfies the following

(a) For all u > 0, φ−1(u) ≤ ψ−1(u)/(στ).

(b) For any random variable X , ‖X‖ψ ≤ ‖X‖φ/(στ) ≤ ‖X‖ψ/σ.

8.5.4. Show that for any p ∈ [1,∞), ψp satisfies the conditions ofLemma 8.2 with c = 1.

8.5.5. Let ψ satisfy the conditions of Lemma 8.2. Show that for any se-

quence of random variables Xn, ‖Xn‖ψ → 0 implies XnP→ 0. Hint: Show

that lim infx→∞ ψ(x)/x > 0, and hence ‖Xn‖ψ → 0 implies E|Xn| → 0.

8.5.6. Show that if the class of functions F has a countable subset G ⊂F such that for each f ∈ F there exists a sequence gm ∈ G withgm(x) → f(x) for every x ∈ X , then ‖∑ eif(Xi)‖F = ‖∑ eif(Xi)‖Gfor all (e1, . . . , en) ∈ Rn.

8.5.7. In the context of the proof of Theorem 8.15, use Hoeffding’s in-equality (Lemma 8.7) combined with Lemma 8.1 to show that

1

n

n∑

i=1

ǫif(Xi)

ψ2|X

≤√

6/n(Pnf2)1/2,

where the meaning of the subscript ψ2|X is given in the body of the proof.

8.5.8. In the context of the proof of Theorem 2.3 above, show that bytaking logarithms followed by square roots,

J[](δ,F∞, L2(P )) ≤√

8J[](δ,F , L2(P )).

8.5.9. Show, for any map X : Ω → R and constants α ∈ [0,∞) andm ∈ (0,∞), that

E∗ [|X |α1|X | > m] ≤ m−1E∗ [|X |α+11|X | > m]

.

Page 163: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

8.6 Notes 153

8.6 Notes

Lemma 8.2, Corollary 8.5, and Theorems 8.15 and 8.19 correspond toLemma 2.2.1, Corollary 2.2.8, and Theorems 2.4.3 and 2.5.2 of VW, re-spectively. The first inequality in Theorem 8.8 corresponds to Lemma 2.3.1of VW. Lemma 8.3 is a modification of Lemma 2.2.10 of VW, and Theo-rem 8.4 is a modification and combination of both Theorem 2.2.4 and Corol-lary 2.2.5 of VW. The version of Hoeffding’s inequality we use (Lemma 8.7)is Lemma 2.2.7 of VW, and Lemma 8.13 was inspired by Exercise 2.4.1 ofVW. The proof of Theorem 2.3 follows closely van der Vaart’s (1998) proofof his Theorem 19.5.

Page 164: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,
Page 165: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

9Entropy Calculations

The focus of this chapter is on computing entropy for empirical processes.An important use of such entropy calculations is in evaluating whether aclass of functions F is Glivenko-Cantelli and/or Donsker or neither. Severaladditional uses of entropy bounds will be discussed in Chapter 11. Some ofthese uses will be very helpful in Chapter 14 for establishing rates of con-vergence for M-estimators. An additional use of entropy bounds, one whichwill receive only limited discussion in this book, is in precisely assessingthe asymptotic smoothness of certain empirical processes. Such smooth-ness results play a role in the asymptotic analysis of a number of statisticalapplications, including confidence bands for kernel density estimation (eg.,Bickel and Rosenblatt, 1973) and certain hypothesis tests for multimodal-ity (Polonik, 1995).

We begin the chapter by describing methods to evaluate uniform entropy.Provided the uniform entropy for a class F is not too large, F might beG-C or Donsker, as long as sufficient measurability holds. Since many ofthe techniques we will describe for building bounded uniform entropy inte-gral (BUEI) classes (which include VC classes) closely parallel the methodsfor building pointwise measurable (PM) classes described in the previouschapter, we will include a discussion on joint BUEI and PM preservationtowards the end of Section 9.1.2. We then present methods based on brack-eting entropy. Several important results for building G-C classes from otherG-C classes (G-C preservation), are presented next. Finally, we discuss sev-eral useful Donsker preservation results.

One can think of this chapter as a handbag of tools for establishingweak convergence properties of empirical processes. Illustrations of how to

Page 166: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

156 9. Entropy Calculations

use these tools will be given in various applications scattered throughoutlater chapters. To help anchor the context for these tools in practice, itmight be worthwhile rereading the counting process regression example ofSection 4.2.1, in the first case studies chapter of this book. In the secondcase studies chapter of this book (Chapter 15), we will provide additional—and more complicated—illustrations of these tools, with special emphasison Donsker preservation techniques.

9.1 Uniform Entropy

We first discuss the very powerful concept of VC-classes of sets and func-tions. Such classes are extremely important tools in assessing and control-ling bounded uniform entropy. We then discuss several useful and powerfulpreservation results for bounded uniform entropy integral (BUEI) classes.

9.1.1 VC-Classes

In this section, we introduce Vapnik-Cervonenkis (VC) classes of sets, VC-classes of functions, and several related function classes. We then presentseveral examples of VC-classes.

Consider an arbitrary collection x1, . . . , xn of points in a set X and acollection C of subsets of X . We say that C picks out a certain subset A ofx1, . . . , xn if A = C∩x1, . . . , xn for some C ∈ C. We say that C shattersx1, . . . , xn if all of the 2n possible subsets of x1, . . . , xn are picked outby the sets in C. The VC-index V (C) of the class C is the smallest n forwhich no set of size n x1, . . . , xn ⊂ X is shattered by C. If C shatters allsets x1, . . . , xn for all n ≥ 1, we set V (C) = ∞. Clearly, the more refinedC is, the higher the VC-index. We say that C is a VC-class if V (C) < ∞.

For example, let X = R and define the collection of sets C = (−∞, c] :c ∈ R. Consider any two point set x1, x2 ⊂ R and assume, without lossof generality, that x1 < x2. It is easy to verify that C can pick out thenull set and the sets x1 and x1, x2 but can’t pick out x2. ThusV (C) = 2 and C is a VC-class. As another example, let C = (a, b] : −∞ ≤a < b ≤ ∞. The collection can shatter any two point set, but considerwhat happens with a three point set x1, x2, x3. Without loss of generality,assume x1 < x2 < x3, and note that the set x1, x3 cannot be picked outwith C. Thus V (C) = 3 in this instance.

For any class of sets C and any collection x1, . . . , xn ⊂ X , let ∆n(C, x1,. . . , xn) be the number of subsets of x1, . . . , xn which can be pickedout by C. A surprising combinatorial result is that if V (C) < ∞, then∆n(C, x1, . . . , xn) can increase in n no faster than O(nV (C)−1). This is moreprecisely stated in the following lemma:

Lemma 9.1 For a VC-class of sets C,

Page 167: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

9.1 Uniform Entropy 157

maxx1,...,xn∈X

∆n(C, x1, . . . , xn) ≤V (C)−1∑

j=1

(

nj

)

.

Since the right-hand-side is bounded by V (C)nV (C)−1, the left-hand-sidegrows polynomially of order at most O(nV (C)−1).

This is a corollary of Lemma 2.6.2 of VW, and we omit the proof.Let 1C denote the collection of all indicator functions of sets in the

class C. The following theorem gives a bound on the Lr covering numbersof 1C:

Theorem 9.2 There exists a universal constant K < ∞ such that forany VC-class of sets C, any r ≥ 1, and any 0 < ǫ < 1,

N(ǫ, 1C, Lr(Q)) ≤ KV (C)(4e)V (C)

(

1

ǫ

)r(V (C)−1)

.

This is Theorem 2.6.4 of VW, and we omit the proof. Since F = 1 serves asan envelope for 1C, we have as an immediate corollary that, for F = 1C,supQ N(ǫ‖F‖1,Q,F , L1(Q)) <∞ and

J(1,F , L2)

∫ 1

0

log(1/ǫ)dǫ =

∫ ∞

0

u1/2e−udu ≤ 1,

where the supremum is over all finite probability measures Q with ‖F‖Q,2 >0. Thus the uniform entropy conditions required in the G-C and Donskertheorems of the previous chapter are satisfied for indicators of VC-classesof sets. Since the constant 1 serves as a universally applicable envelopefunction, these classes of indicator functions are therefore G-C and Donsker,provided the requisite measurability conditions hold.

For a function f : X → R, the subset of X × R given by (x, t) : t <f(x) is the subgraph of f . A collection F of measurable real functions onthe sample space X is a VC-subgraph class or VC-class (for short), if thecollection of all subgraphs of functions in F forms a VC-class of sets (assets in X × R). Let V (F) denote the VC-index of the set of subgraphs ofF . The following theorem, the proof of which is given in Section 9.5, showsthat covering numbers of VC-classes of functions grow at a polynomial ratejust like VC-classes of sets:

Theorem 9.3 There exists a universal constant K < ∞ such that, forany VC-class of measurable functions F with integrable envelope F , anyr ≥ 1, any probability measure Q with ‖F‖Q,r > 0, and any 0 < ǫ < 1,

N(ǫ‖F‖Q,r,F , Lr(Q)) ≤ KV (F)(4e)V (F)

(

2

ǫ

)r(V (F)−1)

.

Page 168: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

158 9. Entropy Calculations

Thus VC-classes of functions easily satisfy the uniform entropy require-ments of the G-C and Donsker theorems of the previous chapter. A relatedkind of function class is the VC-hull class. A class of measurable functionsG is a VC-hull class if there exists a VC-class F such that each f ∈ G is thepointwise limit of a sequence of functions fm in the symmetric convexhull of F (denoted sconvF). A function f is in sconvF if f =

∑mi=1 αifi,

where the αis are real numbers satisfying∑m

i=1 |αi| ≤ 1 and the fis arein F . The convex hull of a class of functions F , denoted convF , is simi-larly defined but with the requirement that the αi’s are positive. We useconvF to denote pointwise closure of convF and sconvF to denote thepointwise closure of sconvF . Thus the class of functions F is a VC-hullclass if F = sconvG for some VC-class G. The following theorem providesa useful relationship between the L2 covering numbers of a class F (notnecessarily a VC-class) and the L2 covering numbers of convF when thecovering numbers for F are polynomial in 1/ǫ:

Theorem 9.4 Let Q be a probability measure on (X ,A), and let F be aclass of measurable functions with measurable envelope F , such that QF 2 <∞ and, for 0 < ǫ < 1,

N(ǫ‖F‖Q,2,F , L2(Q)) ≤ C

(

1

ǫ

)V

,

for constants C, V < ∞ (possibly dependent on Q). Then there exist aconstant K depending only on V and C such that

log N(ǫ‖F‖Q,2, convF , L2(Q)) ≤ K

(

1

ǫ

)2V/(V +2)

.

This is Theorem 2.6.9 of VW, and we omit the proof.It is not hard to verify that sconvF is a subset of the convex hull of

F ∪ −F ∪ 0, where −F ≡ −f : f ∈ F (see Exercise 9.6.1 below).Since the covering numbers of F ∪ −F ∪ 0 are at most one plus twicethe covering numbers of F , the conclusion of Theorem 9.4 also holds ifconvF is replaced with sconvF . This leads to the following easy corollaryfor VC-hull classes, the proof of which we save as an exercise:

Corollary 9.5 For any VC-hull class F of measurable functions andall 0 < ǫ < 1,

supQ

log N(ǫ‖F‖Q,2,F , L2(Q)) ≤ K

(

1

ǫ

)2−2/V

, 0 < ǫ < 1,

where the supremum is taken over all probability measures Q with ‖F‖Q,2 >0, V is the VC-index of the VC-subgraph class associated with F , and theconstant K < ∞ depends only on V .

Page 169: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

9.1 Uniform Entropy 159

We now present several important examples and results about VC-classesof sets and both VC-subgraph and VC-hull classes of functions. The firstlemma, Lemma 9.6, applies to vector spaces of functions, a frequently oc-curring function class in statistical applications.

Lemma 9.6 Let F be a finite-dimensional vector space of measurablefunctions f : X → R. Then F is VC-subgraph with V (F) ≤ dim(F) + 2.

Proof. Fix any collection G of k = dim(F)+2 points (x1, t1), . . . , (xk, tk)

in X × R. By assumption, the vectors (f(x1)− t1, . . . , f(xk)− tk)T , as franges over F , are restricted to a dim(F)+1 = k−1-dimensional subspaceH of Rk. Any vector 0 = a ∈ Rk orthogonal to H satisfies

j:aj>0

aj(f(xj)− tj) =∑

j:aj<0

(−aj)(f(xj)− tj),(9.1)

for all f ∈ F , where we define sums over empty sets to be zero. Thereexists such a vector a with at least one strictly positive coordinate. Forthis a, the subset of G of the form (xj , tj) : aj > 0 cannot also be of theform (xj , tj) : tj < f(xj) for any f ∈ F , since otherwise the left side ofthe Equation (9.1) would be strictly positive while the right side would benonpositive. Conclude that the subgraphs of F cannot pick out the subset(xj , tj) : aj > 0. Since G was arbitrary, we have just shown that thesubgraphs of F cannot shatter any set of k points in X × R. The desiredresult now follows..

The next three lemmas, Lemmas 9.7 through 9.9, consist of useful toolsfor building VC-classes from other VC-classes. One of these lemmas,Lemma 9.8, is the important identity that classes of sets are VC if andonly if the corresponding classes of indicator functions are VC-subgraph.The proof of Lemma 9.9 is relegated to Section 9.5.

Lemma 9.7 Let C and D be VC-classes of sets in a set X , with respectiveVC-indices VC and VD; and let E be a VC-class of sets in W, with VC-indexVE . Also let φ : X → Y and ψ : Z → X be fixed functions. Then

(i) Cc ≡ Cc : C ∈ C is VC with V (Cc) = V (C);

(ii) C ⊓ D ≡ C ∩D : C ∈ C, D ∈ D is VC with index ≤ VC + VD − 1;

(iii) C ⊔ D ≡ C ∪D : C ∈ C, D ∈ D is VC with index ≤ VC + VD − 1;

(iv) D × E is VC in X ×W with VC index ≤ VD + VE − 1;

(v) φ(C) is VC with index VC if φ is one-to-one;

(vi) ψ−1(C) is VC with index ≤ VC.

Page 170: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

160 9. Entropy Calculations

Proof. For any C ∈ C, the set Cc picks out the points of a given setx1, . . . , xn that C does not pick out. Thus if C shatters a given set of points,so does Cc, and vice versa. This proves (i). From n points, C can pick outO(nVC−1) subsets. From each of these subsets, D can pick out O(nVD−1)further subsets. Hence C⊓D can pick out at most O(nVC+VD−2) subsets, andthus (ii) follows from the definition of a VC-class. We save it as an exerciseto show that (i) and (ii) imply (iii). For (iv), note that D ×W and X × Eare both VC-classes with respective VC-indices VD and VE . The desiredconclusion now follows from Part (ii), since D × E = (D ×W) ⊓ (X × E).

For Part (v), if φ(C) shatters y1, . . . , yn, then each yi must be in therange of φ and there must therefore exist x1, . . . , xn such that φ is a bijec-tion between x1, . . . , xn and y1, . . . , yn. Hence C must shatter x1, . . . , xn,and thus V (φ(C)) ≤ V (C). Conversely, it is obvious that if C shattersx1, . . . , xn, then φ(C) shatters φ(x1), . . . , φ(xn). Hence V (C) ≤ V (φ(C)).For (vi), if ψ−1(C) shatters z1, . . . , zn, then all ψ(zi) must be distinct andthe restriction of ψ to z1, . . . , zn is a bijection onto its range. Thus C shat-ters ψ(z1), . . . , ψ(zn), and hence V (ψ−1(C)) ≤ V (C).

Lemma 9.8 For any class C of sets in a set X , the class FC of indicatorfunctions of sets in C is VC-subgraph if and only if C is a VC-class. More-over, whenever at least one of C or FC is VC, the respective VC-indices areequal.

Proof. Let D be the collection of sets of the form (x, t) : t < 1x ∈ Cfor all C ∈ C. Suppose that D is VC, and let k = V (D). Then no set of theform (x1, 0), . . . , (xk, 0) can be shattered by D, and hence V (C) ≤ V (D).Now suppose that C is VC with VC-index k. Since for any t < 0, 1x ∈C > t for all x and all C, we have that no collection (x1, t1), . . . , (xk, tk)can be shattered by D if any of the tjs are < 0. It is similarly true that nocollection (x1, t1), . . . , (xk, tk) can be shattered by D if any of the tjs are≥ 1, since 1x ∈ C > t is never true when t ≥ 1. It can now be deducedthat (x1, t1), . . . , (xk, tk) can only be shattered if (x1, 0), . . . , (xk, 0) canbe shattered. But this can only happen if x1, . . . , xk can be shattered byC. Thus V (D) ≤ V (C).

Lemma 9.9 Let F and G be VC-subgraph classes of functions on a setX , with respective VC indices VF and VG . Let g : X → R, φ : R → R, andψ : Z → X be fixed functions. Then

(i) F ∧ G ≡ f ∧ g : f ∈ F , g ∈ G is VC-subgraph with index ≤VF + VG − 1;

(ii) F ∨ G is VC with index ≤ VF + VG − 1;

(iii) F > 0 ≡ f > 0 : f ∈ F is a VC-class of sets with index VF ;

(iv) −F is VC-subgraph with index VF ;

Page 171: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

9.1 Uniform Entropy 161

(v) F + g ≡ f + g : f ∈ F is VC with index VF ;

(vi) F · g ≡ fg : f ∈ F is VC with index ≤ 2VF − 1;

(vii) F ψ ≡ f(ψ) : f ∈ F is VC with index ≤ VF ;

(viii) φ F is VC with index ≤ VF for monotone φ.

The next two lemmas, Lemmas 9.10 and 9.11, refer to properties of mono-tone processes and classes of monotone functions. The proof of Lemma 9.11is relegated to Section 9.5.

Lemma 9.10 Let X(t), t ∈ T be a monotone increasing stochastic pro-cess, where T ⊂ R. Then X is VC-subgraph with index V (X) = 2.

Proof. Let X be the set of all monotone increasing functions g : T → R;and for any s ∈ T and x ∈ X , define (s, x) → fs(x) = x(s). Thus the proofis complete if we can show that the class of functions F ≡ fs : s ∈ T isVC-subgraph with VC index 2. Now let (x1, t1), (x2, t2) be any two pointsin X ×R. F shatters (x1, t1), (x2, t2) if the graph G of (fs(x1), fs(x2)) in R2

“surrounds” the point (t1, t2) as s ranges over T . By surrounding a point(a, b) ∈ R2, we mean that the graph must pass through all four of the sets(u, v) : u ≤ a, v ≤ b, (u, v) : u > a, v ≤ b, (u, v) : u ≤ a, v > b and(u, v) : u > a, v > b. By the assumed monotonicity of x1 and x2, thegraph G forms a monotone curve in R2, and it is thus impossible for it tosurround any point in R2. Thus (x1, t1), (x2, t2) cannot be shattered by F ,and the desired result follows.

Lemma 9.11 The set F of all monotone functions f : R → [0, 1] satisfies

supQ

log N(ǫ,F , L2(Q)) ≤ K

ǫ, 0 < ǫ < 1,

where the supremum is taken over all probability measures Q, and the con-stant K < ∞ is universal.

The final lemma of this section, Lemma 9.12 below, addresses the claimraised in Section 4.1 that the class of functions F ≡ 1Y − b′Z ≤ v :b ∈ Rk, v ∈ R is Donsker. Because the indicator functions in F are asubset of the indicator functions for half-spaces in Rk+1, Part (i) of thelemma implies that F is VC with index k + 3. Since Lemma 8.12 fromChapter 8 verifies that F , Fδ and F2

∞ are all P -measurable, for any prob-ability measure P , Theorem 9.3 combined with Theorem 8.19 and the factthat indicator functions are bounded, establishes that F is P -Donsker forany P . Lemma 9.12 also gives a related result on closed balls in Rd. In thelemma, 〈a, b〉 denotes the Euclidean inner product.

Lemma 9.12 The following are true:

Page 172: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

162 9. Entropy Calculations

(i) The collection of all half-spaces in Rd, consisting of the sets x ∈Rd : 〈x, u〉 ≤ c with u ranging over Rd and c ranging over R, is VCwith index d + 2.

(ii) The collection of all closed balls in Rd is VC with index ≤ d + 3.

Proof. The class A+ of sets x : 〈x, u〉 ≤ c, with u ranging over Rd andc ranging over (0,∞), is equivalent to the class of sets x : 〈x, u〉 − 1 ≤ 0with u ranging over Rd. In this last class, since 〈x, u〉 spans a d-dimensionalvector space, Lemma 9.6 and Part (v) of Lemma 9.9 yield that the classof functions spanned by 〈x, u〉 − 1 is VC with index d + 2. Part (iii) ofLemma 9.9 combined with Part (i) of Lemma 9.7 now yields that the classA+ is VC with index d + 2. Similar arguments verify that both the classA−, with c restricted to (−∞, 0), and the class A0, with c = 0, are VC withindex d + 2. It is easy to verify that the union of finite VC classes has VCindex equal to the maximum of the respective VC indices. This concludesthe proof of (i).

Closed balls in Rd are sets of the form x : 〈x, x〉 − 2〈x, u〉+ 〈u, u〉 ≤ c,where u ranges over Rd and c ranges over [0,∞). It is straightforward tocheck that the class G all functions of the form x → −2〈x, u〉 + 〈u, u〉 − care contained in a d + 1 dimensional vector space, and thus G is VC withindex ≤ d + 3. Combining this with Part (v) of Lemma 9.9 yields that theclass F = G+〈x, x〉 is also VC with index d+3. Now the desired conclusionfollows from Part (iii) if Lemma 9.9 combined with Part (i) of Lemma 9.7.

9.1.2 BUEI Classes

Recall for a class of measurable functions F , with envelope F , the uniformentropy integral

J(δ,F , L2) ≡∫ δ

0

supQ

log N(ǫ‖F‖Q,2,F , L2(Q))dǫ,

where the supremum is taken over all finitely discrete probability measuresQ with ‖F‖Q,2 > 0. Note the dependence on choice of envelope F . Thisis crucial since there are many random functions which can serve as anenvelope. For example, if F is an envelope, then so is F + 1 and 2F . Onemust allow that different envelopes may be needed in different settings. Wesay that the class F has bounded uniform entropy integral (BUEI) withenvelope F—or is BUEI with envelope F—if J(1,F , L2) < ∞ for thatparticular choice of envelope.

Theorem 9.3 tells us that a VC-class F is automatically BUEI with anyenvelope. We leave it as an exercise to show that if F and G are BUEI withrespective envelopes F and G, then F ⊔ G is BUEI with envelope F ∨ G.The following lemma, which is closely related to an important Donsker

Page 173: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

9.1 Uniform Entropy 163

preservation theorem in Section 9.4 below, is also useful for building BUEIclasses from other BUEI classes:

Lemma 9.13 Let F1, . . . ,Fk be BUEI classes with respective envelopesF1, . . . , Fk, and let φ : Rk → R satisfy

|φ f(x)− φ g(x)|2 ≤ c2k∑

j=1

(fj(x) − gj(x))2,(9.2)

for every f, g ∈ F1×· · ·×Fk and x for a constant 0 < c < ∞. Then the classφ (F1, . . . ,Fk) is BUEI with envelope H ≡ |φ(f0)| + c

∑kj=1(|f0j | + Fj),

where f0 ≡ (f01, . . . , f0k) is any function in F1 × · · · × Fk, and whereφ (F1, . . . ,Fk) is as defined in Lemma 8.10.

Proof. Fix ǫ > 0 and a finitely discrete probability measure Q, and letf, g ∈ F1×· · ·×Fk satisfy ‖fj−gj‖Q,2 ≤ ǫ‖Fj‖Q,2 for 1 ≤ j ≤ k. Now (9.2)implies that

‖φ f − φ g‖Q,2 ≤ c

k∑

j=1

‖fj − gj‖2Q,2

≤ ǫck∑

j=1

‖Fj‖Q,2

≤ ǫ‖H‖Q,2.

Hence

N(ǫ‖H‖Q,2, φ (F1, . . . ,Fk), L2(Q)) ≤k∏

j=1

N(ǫ‖Fj‖Q,2,Fj, L2(Q)),

and the desired result follows since ǫ and Q were arbitrary.Some useful consequences of Lemma 9.13 are given in the following

lemma:

Lemma 9.14 Let F and G be BUEI with respective envelopes F and G,and let φ : R → R be a Lipschitz continuous function with Lipschitz con-stant 0 < c <∞. Then

(i) F ∧ G is BUEI with envelope F + G;

(ii) F ∨ G is BUEI with envelope F + G;

(iii) F + G is BUEI with envelope F + G;

(iv) φ(F) is BUEI with envelope |φ(f0)|+ c(|f0|+ F ), provided f0 ∈ F .

Page 174: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

164 9. Entropy Calculations

The proof, which we omit, is straightforward.As mentioned earlier, Lemma 9.13 is very similar to a Donsker preserva-

tion result we will present later in this chapter. In fact, most of the BUEIpreservation results we give in this section have parallel Donsker preser-vation properties. An important exception, and one which is perhaps theprimary justification for the use of BUEI preservation techniques, applies toproducts of Donsker classes. As verified in the following theorem, the prod-uct of two BUEI classes is BUEI, whether or not the two classes involvedare bounded (compare with Corollary 9.15 below):

Theorem 9.15 Let F and G be BUEI classes with respective envelopesF and G. Then F · G ≡ fg : f ∈ F , g ∈ G is BUEI with envelope FG.

Proof. Fix ǫ > 0 and a finitely discrete probability measure Q with‖FG‖Q,2 > 0, and let dQ∗ ≡ G2dQ/‖G‖2

Q,2. Clearly, Q∗ is a finitely

discrete probability measure with ‖F‖Q∗,2 > 0. Let f1, f2 ∈ F satisfy‖f1 − f2‖Q∗,2 ≤ ǫ‖F‖Q∗,2. Then

ǫ ≥ ‖f1 − f2‖Q∗,2

‖F‖Q∗,2=‖(f1 − f2)G‖Q,2

‖FG‖Q,2

,

and thus, if we let F ·G ≡ fG : f ∈ F,

N(ǫ‖FG‖Q,2,F ·G, L2(Q)) ≤ N(ǫ‖F‖Q∗,2,F , L2(Q∗))

≤ supQ

N(ǫ‖F‖Q,2,F , L2(Q)),(9.3)

where the supremum is taken over all finitely discrete probability measuresQ for which ‖F‖Q,2 > 0. Since the right-hand-side of (9.3) does not depend

on Q, and since Q satisfies ‖FG‖Q,2 > 0 but is otherwise arbitrary, we havethat

supQ

N(ǫ‖FG‖Q,2,F ·G, L2(Q)) ≤ supQ

N(ǫ‖F‖Q,2,F , L2(Q)),

where the supremums are taken over all finitely discrete probability mea-sures Q but with the left side taken over the subset for which ‖FG‖Q,2 > 0while the right side is taken over the subset for which ‖F‖Q,2 > 0.

We can similarly show that the uniform entropy numbers for the classG · F with envelope FG is bounded by the uniform entropy numbers forG with envelope G. Since |f1g1 − f2g2| ≤ |f1 − f2|G + |g1 − g2|F for allf1, f2 ∈ F and g1, g2 ∈ G, the forgoing results imply that

supQ

N(ǫ‖FG‖Q,2,F · G, L2(Q)) ≤ supQ

N(ǫ‖F‖Q,2/2,F , L2(Q))

× supQ

N(ǫ‖G‖Q,2/2,G, L2(Q)),

Page 175: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

9.1 Uniform Entropy 165

where the supremums are all taken over the appropriate subsets of all fini-tely discrete probability measures. After taking logs, square roots, and thenintegrating both sides with respect to ǫ, the desired conclusion follows.

In order for BUEI results to be useful for obtaining Donsker results,it is necessary that sufficient measurability be established so that Theo-rem 8.19 can be used. As shown in Proposition 8.11 and the commentsfollowing Theorem 8.19, pointwise measurability (PM) is sufficient mea-surability for this purpose. Since there are significant similarities betweenPM preservation and BUEI preservation results, one can construct usefuljoint PM and BUEI preservation results. Here is one such result:

Lemma 9.16 Let the classes F1, . . . ,Fk be both BUEI and PM with re-spective envelopes F1, . . . , Fk, and let φ : Rk → R satisfy (9.2) for ev-ery f, g ∈ F1 × · · · × Fk and x for a constant 0 < c < ∞. Then theclass φ (F1, . . . ,Fk) is both BUEI and PM with envelope H ≡ |φ(f0)| +c∑k

j=1(|f0j |+ Fj), where f0 is any function in F1 × · · · × Fk.

Proof. Since a function satisfying (9.2) as specified is also continuous, thedesired result is a direct consequence of Lemmas 8.10 and 9.13.

The following lemma contains some additional joint preservation results:

Lemma 9.17 Let the classes F and G be both BUEI and PM with re-spective envelopes F and G, and let φ : R → R be a Lipschitz continuousfunction with Lipschitz constant 0 < c < ∞. Then

(i) F ∪ G is both BUEI and PM with envelope F ∨G;

(ii) F ∧ G is both BUEI and PM with envelope F + G;

(iii) F ∨ G is both BUEI and PM with envelope F + G;

(iv) F + G is both BUEI and PM with envelope F + G;

(v) F · G is both BUEI and PM with envelope FG;

(vi) φ(F) is both BUEI and PM with envelope |φ(f0)|+ c(|f0|+F ), wheref0 ∈ F .

Proof. Verifying (i) is straightforward. Results (ii), (iii), (iv) and (vi) areconsequences of Lemma 9.16. Result (v) is a consequence of Lemma 8.10and Theorem 9.15.

If a class of measurable functions F is both BUEI and PM with enve-lope F , then Theorem 8.19 implies that F is P -Donsker whenever P ∗F 2 <∞. Note that we have somehow avoided discussing preservation for subsetsof classes. This is because it is unclear whether a subset of a PM class F isitself a PM class. The difficulty is that while F may have a countable densesubset G (dense in terms of pointwise convergence), it is unclear whetherany arbitrary subset H ⊂ F also has a suitable countable dense subset.An easy way around this problem is to use various preservation results to

Page 176: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

166 9. Entropy Calculations

establish that F is P -Donsker, and then it follows directly that any H ⊂ Fis also P -Donsker by the definition of weak convergence. We will exploreseveral additional preservation results as well as several practical exampleslater in this chapter and in the case studies of Chapter 15.

9.2 Bracketing Entropy

We now present several useful bracketing entropy results for certain func-tion classes as well as a few preservation results. We first mention thatbracketing numbers are in general larger than covering numbers, as veri-fied in the following lemma:

Lemma 9.18 Let F be any class of real function on X and ‖·‖ any normon F . Then

N(ǫ,F , ‖ · ‖) ≤ N[](ǫ,F , ‖ · ‖)for all ǫ > 0.

Proof. Fix ǫ > 0, and let B be collection of ǫ-brackets that covers F .From each bracket B ∈ B, take a function g(B) ∈ B ∩ F to form a finitecollection of functions G ⊂ F of the same cardinality as B consisting ofone function from each bracket in B. Now every f ∈ F lies in a bracketB ∈ B such that ‖f − g(B)‖ ≤ ǫ by the definition of an ǫ-bracket. Thus Gis an ǫ cover of F of the same cardinality as B. The desired conclusion nowfollows.

The first substantive bracketing entropy result we present considersclasses of smooth functions on a bounded set X ⊂ Rd. For any vectork = (k1, . . . , kd) of nonnegative integers define the differential operatorDk ≡ ∂|k|/(∂xk1

1 , . . . , ∂xkd

d ), where |k| ≡ k1 + · · · + kd. As defined previ-ously, let ⌊x⌋ be the largest integer j ≤ x, for any x ∈ R. For any functionf : X → R and α > 0, define the norm

‖f‖α ≡ maxk:|k|≤⌊α⌋

supx|Dkf(x)|+ max

k:|k|=⌊α⌋supx,y

|Dkf(x)−Dkf(y)|‖x− y‖α−⌊α⌋ ,

where the suprema are taken over x = y in the interior of X . When k = 0,we set Dkf = f . Now let Cα

M (X ) be the set of all continuous functionsf : X → R with ‖f‖α ≤ M . Recall that for a set A in a metric space,diamA = supx,y∈A d(x, y). We have the following theorem:

Theorem 9.19 Let X ⊂ Rd be bounded and convex with nonempty in-terior. There exists a constant K < ∞ depending only on α, diamX , andd such that

log N[](ǫ, Cα1 (X ), Lr(Q)) ≤ K

(

1

ǫ

)d/α

,

for every r ≥ 1, ǫ > 0, and any probability measure Q on Rd.

Page 177: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

9.2 Bracketing Entropy 167

This is Corollary 2.7.2 of VW, and we omit the proof.We now present a generalization of Theorem 9.19 which permits X to be

unbounded:

Theorem 9.20 Let Rd =⋃∞

j=1 Ij be a partition of Rd into bounded,convex sets with nonempty interior, and let F be a class of measurable func-tions f : Rd → R such that the restrictions F|Ij

belong to CαMj

for all j.

Then, for every V ≥ d/α, there exists a constant K depending only on α,r, and d, such that

log N[](ǫ,F , Lr(Q)) ≤ K

(

1

ǫ

)V⎡

∞∑

j=1

λ(I1j )

rV +r M

V rV +r

j Q(Ij)V

V +r

V +rr

,

for every ǫ > 0 and probability measure Q, where, for a set A ⊂ Rd,A1 ≡ x : ‖x−A‖ < 1.

This is Corollary 2.7.4 of VW, and we omit the proof.We now consider several results for Lipschitz and Sobolev function classes.

We first present the results for covering numbers based on the uniform normand then present the relationship to bracketing entropy.

Theorem 9.21 For a compact, convex subset C ⊂ Rd, let F be the classof all convex functions f : C → [0, 1] with |f(x) − f(y)| ≤ L‖x − y‖ forevery x, y. For some integer m ≥ 1, let G be the class of all functions

g : [0, 1] → [0, 1] with∫ 1

0[g(m)(x)]2dx ≤ 1, where superscript (m) denotes

the m’th derivative. Then

log N(ǫ,F , ‖ · ‖∞) ≤ K(1 + L)d/2

(

1

ǫ

)d/2

, and

log N(ǫ,G, ‖ · ‖∞) ≤ M

(

1

ǫ

)1/m

,

where ‖ · ‖∞ is the uniform norm and the constant K < ∞ depends onlyon d and C and the constant M depends only on m.

The first displayed result is Corollary 2.7.10 of VW, while the second dis-played result is Theorem 2.4 of van de Geer (2000). We omit the proofs.

The following lemma shows how Theorem 9.21 applies to bracketing en-tropy:

Lemma 9.22 For any norm ‖ · ‖ dominated by ‖ · ‖∞ and any class offunctions F ,

log N[](2ǫ,F , ‖ · ‖) ≤ log N(ǫ,F , ‖ · ‖∞),

for all ǫ > 0.

Page 178: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

168 9. Entropy Calculations

Proof. Let f1, . . . , fm be a uniform ǫ-cover of F . Since the 2ǫ-brackets[fi − ǫ, fi + ǫ] now cover F , the result follows.

We now present a second Lipschitz continuity result which is in fact ageneralization of Lemma 9.22. The result applies to function classes of theform F = ft : t ∈ T , where

|fs(x) − ft(x)| ≤ d(s, t)F (x)(9.4)

for some metric d on T , some real function F on the sample space X , andfor all x ∈ X . This special Lipschitz structure arises in a number of settings,including parametric Z- and M- estimation. For example, consider the leastabsolute deviation regression setting of Section 2.2.6, under the assumptionthat the random covariate U and regression parameter θ are constrainedto known compact subsets U , Θ ⊂ Rp. Recall that, in this setting, theoutcome given U is modeled as Y = θ′U + e, where the residual error e hasmedian zero. Estimation of the true parameter value θ0 is accomplished byminimizing θ → Pnmθ, where mθ(X) ≡ |e − (θ − θ0)

′U | − |e|, X ≡ (Y, U)and e = Y − θ′0U . From (2.20) in Section 2.2.6, we know that the classF = mθ : θ ∈ Θ satisfies (9.4) with T = Θ, d(s, t) = ‖s − t‖ andF (x) = ‖u‖, where x = (y, u) is a realization of X .

The following theorem shows that the bracketing numbers for a generalF satisfying (9.4) are bounded by the covering numbers for the associatedindex set T .

Theorem 9.23 Suppose the class of functions F = ft : t ∈ T satis-fies (9.4) for every s, t ∈ T and some fixed function F . Then, for any norm‖ · ‖,

N[](2ǫ‖F‖,F , ‖ · ‖) ≤ N(ǫ, T, d).

Proof. Note that for any ǫ-net t1, . . . , tk that covers T with respect tod, the brackets [ftj − ǫF, ftj + ǫF ] cover F . Since these brackets are all ofsize 2ǫ‖F‖, the proof is complete.

Note that when ‖ ·‖ is any norm dominated by ‖ ·‖∞, Theorem 9.23 sim-plifies to Lemma 9.22 when T = F and d = ‖ · ‖∞ (and thus automaticallyF = 1).

We move now from continuous functions to monotone functions. As wasdone in Lemma 9.11 above for uniform entropy, we can study bracketingentropy of the class of all monotone functions mapping into [0, 1]:

Theorem 9.24 For each integer r ≥ 1, there exists a constant K < ∞such that the class F of monotone functions f : R → [0, 1] satisfies

log N[](ǫ,F , Lr(Q)) ≤ K

ǫ,

for all ǫ > 0 and every probability measure Q.

The lengthy proof, which we omit, is given in Chapter 2.7 of VW.

Page 179: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

9.3 Glivenko-Cantelli Preservation 169

We now briefly discuss preservation results. Unfortunately, it appearsthat there are not as many useful preservation results for bracketing entropyas there are for uniform entropy, but the following lemma contains two suchresults which are easily verified:

Lemma 9.25 Let F and G be classes of measurable function. Then forany probability measure Q and any 1 ≤ r ≤ ∞,

(i) N[](2ǫ,F + G, Lr(Q)) ≤ N[](ǫ,F , Lr(Q))N[](ǫ,G, Lr(Q));

(ii) Provided F and G are bounded by 1,

N[](2ǫ,F · G, Lr(Q)) ≤ N[](ǫ,F , Lr(Q))N[](ǫ,G, Lr(Q)).

The straightforward proof is saved as an exercise.

9.3 Glivenko-Cantelli Preservation

In this section, we discuss methods which are useful for building upGlivenko-Cantelli (G-C) classes from other G-C classes. Such results canbe useful for establishing consistency for Z- and M- estimators and theirbootstrapped versions. It is clear from the definition of P -G-C classes, thatif F and G are P -G-C, then F ∪ G and any subset thereof is also P -G-C.The purpose of the remainder of this section is to discuss more substantivepreservation results. The main tool for this is the following theorem, whichis a minor modification of Theorem 3 of van der Vaart and Wellner (2000)and which we give without proof:

Theorem 9.26 Suppose that F1, . . . ,Fk are strong P -G-C classes offunctions with max1≤j≤k ‖P‖Fj < ∞, and that φ : Rk → R is continu-ous. Then the class H ≡ φ(F1, . . . ,Fk) is strong P -G-C provided it has anintegrable envelope.

The following corollary lists some obvious consequences of this theorem:

Corollary 9.27 Let F and G be P -G-C classes with respective inte-grable envelopes F and G. Then the following are true:

(i) F + G is P -G-C.

(ii) F · G is P -G-C provided P [FG] <∞.

(iii) Let R be the union of the ranges of functions in F , and let ψ : R → Rbe continuous. Then ψ(F) is P -G-C provided it has an integrableenvelope.

Page 180: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

170 9. Entropy Calculations

Proof. The statement (i) is obvious. Since (x, y) → xy is continuous inR2, statement (ii) follows from Theorem 9.26. Statement (iii) also followsfrom the theorem since ψ has a continuous extension to R, ψ, such that‖Pψ(f)‖F = ‖Pψ(f)‖F .

It is interesting to note that the “preservation of products” result inPart (ii) of the above corollary does not hold in general for Donsker classes(although, as was shown in Section 9.1.2, it does hold for BUEI classes).This preservation result for G-C classes can be useful in formulating mastertheorems for bootstrapped Z- and M- estimators. Consider, for example,verifying the validity of the bootstrap for a parametric Z-estimator θn whichis a zero of θ → Pnψθ, for θ ∈ Θ, where ψθ is a suitable random function. LetΨ(θ) = Pψθ, where we assume that for any sequence θn ∈ Θ, Ψ(θn) → 0implies θn → θ0 ∈ Θ (i.e., the parameter is identifiable). Usually, to obtainconsistency, it is reasonable to assume that the class ψθ, θ ∈ Θ) is P -G-C.

Clearly, this condition is sufficient to ensure that θnas∗→ θ0.

Now, under a few additional assumptions, the Z-estimator master the-orem, Theorem 2.11 can be applied, to obtain asymptotic normality of√

n(θn− θ0). In Section 2.2.5, we made the claim that if Ψ is appropriatelydifferentiable and the parameter is identifiable (as defined in the previousparagraph), sufficient additional conditions for this asymptotic normalityto hold and for the bootstrap to be valid are that the ψθ : θ ∈ Θ isstrong P -G-C with supθ∈Θ P |ψθ| < ∞, that ψθ : θ ∈ Θ, ‖θ − θ0‖ ≤ δ isP -Donsker for some δ > 0, and that P‖ψθ − ψθ0‖2 → 0 as θ → θ0. As wewill see in Chapter 13, where we present the arguments for this result indetail, an important step in the proof of bootstrap validity is to show thatthe bootstrap estimate θn is unconditionally consistent for θ0. If we use aweighted bootstrap with i.i.d. non-negative weights ξ1, . . . , ξn, which areindependent of the data and which satisfy Eξ1 = 1, then result (ii) fromthe above corollary tells us that F ≡ ξψθ : θ ∈ Θ is P -G-C. This followssince both classes of functions ξ (a trivial class with one member) andψθ : θ ∈ Θ are P -G-C and since the product class F has an integralenvelope by Lemma 8.13. Note here that we are tacitly augmenting P tobe the product probability measure of both the data and the independentbootstrap weights. We will expand on these ideas in Section 10.3 of thenext chapter for the special case where Θ ⊂ Rp and in Chapter 13 for themore general case.

Another result that can be useful for inference is the following lemma oncovariance estimation. We mentioned this result in the first paragraph ofSection 2.2.3 in the context of conducting uniform inference for Pf as franges over a class of functions F . The lemma answers the question of whenthe limiting covariance of Gn, indexed by F , can be consistently estimated.Recall that this covariance is σ(f, g) ≡ Pfg − PfPg, and its estimator isσ(f, g) ≡ Pnfg−PnfPng. Although knowledge of this covariance matrix is

Page 181: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

9.3 Glivenko-Cantelli Preservation 171

usually not sufficient in itself to obtain inference on Pf : f ∈ F, it stillprovides useful information.

Lemma 9.28 Let F be Donsker. Then ‖σ(f, g)−σ(f, g)‖F·Fas∗→ 0 if and

only if P ∗‖f − Pf‖2F <∞.

Proof. Note that since F is Donsker, F is also G-C. Hence F ≡ f :f ∈ F is G-C, where for any f ∈ F , f = f − Pf . Now we first assumethat P ∗‖f − Pf‖2F < ∞. By Theorem 9.26, F · F is also G-C. Uniformconsistency of σ now follows since, for any f, g ∈ F , σ(f, g) − σ(f, g) =

(Pn−P )f g− PnfPng. Assume next that ‖σ(f, g)− σ(f, g)‖F·Fas∗→ 0. This

implies that F · F is G-C. Now Lemma 8.13 implies that P ∗‖f − Pf‖2F =P ∗‖fg‖F·F < ∞.

We close this section with the following theorem that provides severalinteresting necessary and sufficient conditions for F to be strong G-C:

Theorem 9.29 Let F be a class of measurable functions. Then the fol-lowing are equivalent:

(i) F is strong P -G-C;

(ii) E∗‖Pn − P‖F → 0 and E∗‖f − Pf‖F < ∞;

(iii) ‖Pn − P‖F P→ 0 and E∗‖f − Pf‖F < ∞.

Proof. Since Pn − P does not change when the class F is replaced byf − Pf : f ∈ F, we will assume hereafter that ‖P‖F = 0 without loss ofgenerality.

(i)⇒(ii): That (i) implies E∗‖f‖F < ∞ follows from Lemma 8.13. Fix0 < M < ∞, and note that

E∗‖Pn − P‖F ≤ E∗ ‖(Pn − P )f × 1F ≤ M‖F(9.5)

+2E∗ [F × 1F > M] .

By Assertion (ii) of Corollary 9.27, F · 1F ≤ M is strong P -G-C, andthus the first term on the right of (9.5) → 0 by the bounded convergencetheorem. Since the second term on the right of (9.5) can be made arbitrarilysmall by increasing M , the left side of (9.5) → 0, and the desired resultfollows.

(ii)⇒(iii): This is obvious.(iii)⇒(i): By the assumed integrability of the envelope F , Lemma 8.16

can be employed to verify that there is a version of ‖Pn−P‖∗F that convergesouter almost surely to a constant. Condition (iii) implies that this constantmust be zero.

Page 182: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

172 9. Entropy Calculations

9.4 Donsker Preservation

In this section, we describe several techniques for building Donsker classesfrom other Donsker classes. The first theorem, Theorem 9.30, gives re-sults for subsets, pointwise closures and symmetric convex hulls of Donskerclasses. The second theorem, Theorem 9.31, presents a very powerful re-sult for Lipschitz functions of Donsker classes. The corollary that followspresents consequences of this theorem that are quite useful in statisticalapplications.

For a class F of real-valued, measurable functions on the sample space

X , let F (P,2)be the set of all f : X → R for which there exists a sequence

fm ∈ F such that fm → f both pointwise (i.e., for every argumentx ∈ X ) and in L2(P ). Similarly, let sconv(P,2) F be the pointwise andL2(P ) closure of sconvF defined in Section 9.1.1.

Theorem 9.30 Let F be a P -Donsker class. Then

(i) For any G ⊂ F , G is P -Donsker.

(ii) F (P,2)is P -Donsker.

(iii) sconv(P,2) F is P -Donsker.

Proof. The proof of (i) is obvious by the facts that weak convergenceconsists of marginal convergence plus asymptotic equicontinuity and thatthe maximum modulus of continuity does not increase when maximizingover a smaller set. For (ii), one can without loss of generality assume that

both F and F (P,2)are mean zero classes. For a class of measurable functions

G, denote the modulus of continuity

MG(δ) ≡ supf,g∈G:‖f−g‖P,2<δ

|Gn(f − g)|.

Fix δ > 0. We can choose f, g ∈ F (P,2)such that |Gn(f − g)| is arbitrarily

close to MF(P,2) (δ) and ‖f − g‖P,2 < δ. We can also choose f∗, g∗ ∈ F suchthat ‖f−f∗‖P,2 and ‖g−g∗‖P,2 are arbitrarily small (for fixed data). ThusMF(P,2) (δ) ≤ MF(2δ). Since δ > 0 was arbitrary, we obtain that asymptotic

equicontinuity in probability for F (P,2)follows from asymptotic equicon-

tinuity in probability of Gn(f) : f ∈ F. Part (iii) is Theorem 2.10.3 ofVW, and we omit its proof.

The following theorem, Theorem 2.10.6 of VW, is one of the most usefulDonsker preservation results for statistical applications. We omit the proof.

Theorem 9.31 Let F1, . . . ,Fk be Donsker classes with max1≤i≤k ‖P‖Fi

<∞. Let φ : Rk → R satisfy

|φ f(x)− φ g(x)|2 ≤ c2k∑

i=1

(fi(x) − gi(x))2,

Page 183: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

9.5 Proofs 173

for every f, g ∈ F1 × · · · × Fk and x ∈ X and for some constant c < ∞.Then φ (F1, . . . ,Fk) is Donsker provided φ f is square integrable for atleast one f ∈ F1 × · · · × Fk.

The following corollary is a useful consequence of this theorem:

Corollary 9.32 Let F and G be Donsker classes. Then:

(i) F ∪ G and F + G are Donsker.

(ii) If ‖P‖F∪G < ∞, then the classes of pairwise infima, F ∧ G, andpairwise suprema, F ∨ G, are both Donsker.

(iii) IF F and G are both uniformly bounded, F · G is Donsker.

(iv) If ψ : R → R is Lipschitz continuous, where R is the range of func-tions in F , and ‖ψ(f)‖P,2 < ∞ for at least one f ∈ F , then ψ(F) isDonsker.

(v) If ‖P‖F < ∞ and g is a uniformly bounded, measurable function,then F · g is Donsker.

Proof. For any measurable function f , let f ≡ f − Pf . Also defineF ≡ f : f ∈ F and G ≡ f : f ∈ G. Note that for any f ∈ F and g ∈ G,Gnf = Gnf , Gng = Gng and Gn(f+g) = Gn(f+g). Hence F∪G is Donskerif and only if F ∪ G is Donsker; and, similarly, F +G is Donsker if and onlyF + G is Donsker. Clearly, ‖P‖F∪G = 0. Hence Lipschitz continuity of the

map (x, y) → x + y on R2 yields that F + G is Donsker, via Theorem 9.31.Hence also F + G is Donsker. Since F ∪ G is contained in the union ofF ∪ 0 and G ∪ 0, F ∪ G is Donsker and hence so is F ∪G. Thus Part (i)follows.

Proving Parts (ii) and (iv) is saved as an exercise. Part (iii) follows sincethe map (x, y) → xy is Lipschitz continuous on bounded subsets of R2.For Part (v), note that for any f1, f2 ∈ F , |f1(x)g(x) − f2(x)g(x)| ≤‖g‖∞|f1(x) − f2(x)|. Hence φ F , g, where φ(x, y) = xy, is Lipschitzcontinuous in the required manner.

9.5 Proofs

Proof of Theorem 9.3. Let C denote the set of all subgraphs Cf offunctions f ∈ F . Note that for any probability measure Q on X and anyf, g ∈ F ,

Q|f − g| =

X

R

|1t < f(x) − 1t < g(x)| dt Q(dx)

= (Q× λ)(Cf∆Cg),

Page 184: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

174 9. Entropy Calculations

where λ is Lebesgue measure, A∆B ≡ A∪B−A∩B for any two sets A, B,and the second equality follows from Fubini’s theorem. Construct a proba-bility measure P on X×R by restricting Q×λ to the set (x, t) : |t| ≤ F (x)and letting P = (Q × λ)/(2‖F‖Q,1). Now Q|f − g| = 2‖F‖Q,1P |1Cf −1Cg|. Thus, by Theorem 9.2 above,

N(ǫ2‖F‖Q,1,F , L1(Q)) = N(ǫ, C, L1(P ))(9.6)

≤ KV (C)(4e)V (C)

(

1

ǫ

)V (C)−1

.

For r > 1, we have

Q|f − g|r ≤ Q

|f − g|(2F )r−1

= 2r−1R|f − g|QF r−1,

where R is the probability measure with density F r−1/QF r−1 with respectto Q. Thus

‖f − g‖Q,r ≤ 21−1/r‖f − g‖1/rR,1

(

QF r−1)1/r

= 21−1/r‖f − g‖1/rR,1‖F‖Q,r

(

QF r−1

QF r

)1/r

,

which implies

‖f − g‖Q,r

2‖F‖Q,r≤(‖f − g‖R,1

2‖F‖R,1

)1/r

.

Hence N(ǫ2‖F‖Q,r,F , Lr(Q)) ≤ N(ǫr2‖F‖R,1,F , L1(R)). Since (9.6) ap-plies equally well with Q replaced by R, we now have that

N(ǫ2‖F‖Q,r,F , Lr(Q)) ≤ KV (C)(4e)V (C)

(

1

ǫ

)r(V (C)−1)

.

The desired result now follows by bringing the factor 2 in the left-hand-sideover to the numerator of 1/ǫ in the right-hand-side.

Proof of Lemma 9.9. We leave it as an exercise to show that Parts (i)and (ii) follow from Parts (ii) and (iii) of Lemma 9.7. To prove (iii), notefirst that the sets x : f(x) > 0 are (obviously) one-to-one images of thesets (x, 0) : f(x) > 0, which are the intersections of the open subgraphs(x, t) : f(x) > t with the set X × 0. Since a single set has VC index1, the result now follows by applying (ii) and then (iv) of Lemma 9.7. Thesubgraphs of −F are the images of the open supergraphs (x, t) : t > f(x)under the one-to-one map (x, t) → (x,−t). Since the open supergraphs arethe complements of the closed subgraphs (x, t) : t ≥ f(x), they have thesame VC-index as F by Lemma 9.33 below and by Part (i) of Lemma 9.7.

Part (v) follows from the fact that F + g shatters a given set of points(x1, t1), . . . , (xn, tn) if and only if F shatters (x1, t

′1), . . . , (xn, t′n), where

Page 185: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

9.5 Proofs 175

t′i = ti − g(xi), 1 ≤ i ≤ n. For Part (vi), note that for any f ∈ F thesubgraph of fg is the union of the sets C+

f ≡ (x, t) : t < f(x)g(x), g(x) >

0, C−f ≡ (x, t) : t < f(x)g(x), g(x) < 0 and C0

f ≡ (x, t) : t < 0, g(x) =

0. Define C+ ≡ C+f : f ∈ F, C− ≡ C−

f : f ∈ F and C0 ≡ C0f : f ∈

F. By Exercise 9.6.6 below, it suffices to show that these three classesare VC on the respective disjoint sets X ∩ x : g(x) > 0 × R, X ∩ x :g(x) < 0 × R and X ∩ x : g(x) = 0 × R, with respective VC indicesbounded by VF , VF and 1. Consider first C+ on X ∩ x : g(x) < 0. Notethat the subset (x1, t1), . . . , (xm, tm) is shattered by C+ if and only if thesubset (x1, t1/g(x1)), . . . , (xm, tm/g(xm)) is shattered by the subgraphs ofF . Thus the VC-index of C+ on the relevant subset of X × R is VF . Thesame VC-index occurs for C−, but the VC-index for C0 is clearly 1. Thisconcludes the proof of (vi).

For (vii), the result follows from Part (vi) of Lemma 9.7 since the sub-graphs of the class F ψ are the inverse images of the subgraphs of F underthe map (z, t) → (ψ(z), t). For Part (viii), suppose that the subgraphs ofφ F shatter the set of points (x1, t1), . . . , (xn, tn). Now choose f1, . . . , fm

from F so that the subgraphs of the functions φ fj pick out all m = 2n

subsets. For each 1 ≤ i ≤ n, define si to be the largest value of fj(xi) overthose j ∈ 1, . . . , m for which φ(fj(xi)) ≤ ti. Now note that fj(xi) ≤ si

if and only if φ(fj(xi)) ≤ ti, for all 1 ≤ i ≤ n and 1 ≤ j ≤ m. Hence thesubgraphs of f1, . . . , fm shatter the points (x1, s1), . . . , (xn, sn).

Proof of Lemma 9.11. First consider the class H+,r of monotone in-creasing, right-continuous functions h : R → [0, 1]. For each h ∈ H+,r,define h−1(t) ≡ infx : h(x) ≥ t, and note that for any x, t ∈ R,h(x) ≥ t if and only if x ≥ h−1(t). Thus the class of indicator functions1h(x) ≥ t : h ∈ H+,r =

1x ≥ h−1(t) : h ∈ H+,r

⊂ 1x ≥ t : t ∈R. Since the last class of sets has VC index 2, the first class is also VCwith index 2. Since each function h ∈ H+,r is the pointwise limit of thesequence

hm =

m∑

j=1

1

m1

h(x) ≥ j

m

,

we have that H+,r is contained in the closed convex hull of a VC-subgraphclass with VC index 2. Thus, by Corollary 9.5, we have for all 0 < ǫ < 1,

supQ

log N(ǫ,H+,r, L2(Q)) ≤ K0

ǫ,

where the supremum is taken over all probability measures Q and the con-stant K0 is universal. Now consider the class H+,l of monotone increasing,

left-continuous functions, and define h−1(x) ≡ supx : h(x) ≤ t. Now notethat for any x, t ∈ R, h(x) > t if and only if x > h−1(t). Arguing as before,we deduce that 1h(x) > t : h ∈ H+,l is a VC-class with index 2. Sinceeach h ∈ H+,l is the pointwise limit of the sequence

Page 186: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

176 9. Entropy Calculations

hm =

m∑

j=1

1

m1

h(x) >j

m

,

we can again apply Corollary 9.5 to arrive at the same uniform entropybound we arrived at for H+,r.

Now let H+ be the class of all monotone increasing functions h : R →[0, 1], and note that each h ∈ H+ can be written as hr+hl, where hr ∈ H+,r

and hl ∈ H+,l. Hence for any probability measure Q and any h(1), h(2) ∈H+, ‖h(1)−h(2)‖Q,2 ≤ ‖h(1)

r −h(2)r ‖Q,r +‖h(1)

l −h(2)l ‖Q,2, where h

(1)r , h

(2)r ∈

H+,r and h(1)l , h

(2)l ∈ H+,l. Thus N(ǫ,H+, L2(Q)) ≤ N(ǫ/2,H+,r, L2(Q))

×N(ǫ/2,H+,l, L2(Q)), and hence

supQ

log N(ǫ,H+, L2(Q)) ≤ K1

ǫ,

where K1 = 4K0. Since any monotone decreasing function g : R → [0, 1]can be written as 1 − h, where h ∈ H+, the uniform entropy numbers forthe class of all monotone functions f : R → [0, 1], which we denote F ,is log(2) plus the uniform entropy numbers for H+. Since 0 < ǫ < 1, weobtain the desired conclusion given in the statement of the lemma, withK =

√2K1 =

√32K0.

Lemma 9.33 Let F be a set of measurable functions on X . Then theclosed subgraphs have the same VC-index as the open subgraphs.

Proof. Suppose the closed subgraphs (the subgraphs of the form (x, t) :t ≤ f(x)) shatter the set of points (x1, t1), . . . , (xn, tn). Now select out ofF functions f1, . . . , fm whose closed subgraphs shatter all m = 2n subsets.Let ǫ = (1/2) infti − fj(xi) : ti − fj(xi) > 0, and note that the opensubgraphs (the subgraphs of the form (x, t), t < f(x)) of the f1, . . . , fm

shatter the set of points (x1, t1 − ǫ), . . . , (xn, tn − ǫ). This follows since,by construction, ti − ǫ ≥ fj(xi) if and only if ti > fj(xi), for all 1 ≤i ≤ n and 1 ≤ j ≤ m. Now suppose the open subgraphs shatter the setof points (x1, t1), . . . , (xn, tn). Select out of F functions f1, . . . , fm whoseopen subgraphs shatter all m = 2n subsets, and let ǫ = (1/2) inffj(xi) −ti : fj(xi) − ti > 0. Note now that the closed subgraphs of f1, . . . , fm

shatter the set of points (x1, t1 + ǫ), . . . , (xn, tn + ǫ), since, by construction,ti < fj(xi) if and only if ti + ǫ ≤ fj(xi). Thus the VC-indices of open andclosed subgraphs are the same.

9.6 Exercises

9.6.1. Show that sconvF ⊂ convG, where G = F ∪ −F ∪ 0.

Page 187: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

9.6 Exercises 177

9.6.2. Show that the expression N(ǫ‖aF‖bQ,r, aF , Lr(bQ)) does not de-pend on the constants 0 < a, b < ∞, where 1 ≤ r <∞.

9.6.3. Prove Corollary 9.5.

9.6.4. In the proof of Lemma 9.7, verify that Part (iii) follows fromParts (i) and (ii).

9.6.5. Show that Parts (i) and (ii) of Lemma 9.9 follow from Parts (ii)and (iii) of Lemma 9.7.

9.6.6. Let X = ∪mi=1Xi, where the Xi are disjoint; and assume Ci is a

VC-class of subsets of Xi, with VC-index Vi, 1 ≤ i ≤ m. Show that ⊔mi=1Ci

is a VC-class in X with VC-index V1 + · · ·+ Vm −m + 1. Hint: Note thatC1 ∪ X2 and X1 ∪ C2 are VC on X1 ∪ X2 with respective indices V1 and V2.Now use Part (ii)—not Part (iii)—of Lemma 9.7 to show that C1 ⊔ C2 isVC on X1 ∪ X2 with VC index V1 + V2 − 1.

9.6.7. Show that if F and G are BUEI with respective envelopes F andG, then F ⊔ G is BUEI with envelope F ∨G.

9.6.8. In the context of the simple linear regression example of Sec-tion 4.4.1, verify the following:

(a) Show that both G1 and G2 are Donsker even though neither U nor eare bounded. Hint: Use BUEI preservation results.

(b) Verify that both

supz∈[a+h,b−h]

R

h−1L

(

z − u

h

)

H(du)− H(z)

= O(h)

and(

supz∈[a,a+h)

∣H(z)− H(a + h)∣

)

∨(

supz∈(b−h,b]

∣H(z)− H(b − h)∣

)

= O(h).

(c) Show that F1 is Donsker and F2 is Glivenko-Cantelli.

9.6.9. Prove Lemma 9.25.

9.6.10. Consider the class F of all functions f : [0, 1] → [0, 1] such that|f(x) − f(y)| ≤ |x − y|. Show that a set of ǫ-brackets can be constructedto cover F with cardinality bounded by exp(C/ǫ) for some 0 < C < ∞.Hint: Fix ǫ > 0, and let n be the smallest integer ≥ 3/ǫ. For any p =(k0, . . . , kn) ∈ P ≡ 1, . . . , nn+1, define the path p to be the collection ofall function in F such that f ∈ p only if f(i/n) ∈ [(ki − 1)/n, ki/n] for alli = 0 . . . n. Show that for all f ∈ F , if f(i/n) ∈ [j/n, (j + 1)/n], then

Page 188: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

178 9. Entropy Calculations

f

[

i + 1

n

]

∈[

(j − 1) ∨ 0

n,(j + 2) ∧ n

n

]

for i, j = 0, . . . , n− 1. Show that this implies that the number of paths ofthe form p for p ∈ P needed to “capture” all elements of F is bounded byn3n. Now show that for each p ∈ P , there exists a pair of right-continuous“bracketing” functions Lp, Up : [0, 1] → [0, 1] such that ∀x ∈ [0, 1], Lp(x) <Up(x), Up(x) − Lp(x) ≤ 3/n ≤ ǫ, and Lp(x) ≤ f(x) ≤ Up(x) for all f ∈ p.Now complete the proof.

9.6.11. Show that if F is Donsker with ‖P‖F < ∞ and f ≥ δ for allf ∈ F and some constant δ > 0, then 1/F ≡ 1/f : f ∈ F is Donsker.

9.6.12. Complete the proof of Corollary 9.32:

1. Prove Part (ii). Hint: show first that for any real numbers a1, a2, b1, b2,|a1 ∧ b1 − a2 ∧ b2| ≤ |a1 − a2|+ |b1 − b2|.

2. Prove Part (iv).

9.7 Notes

Theorem 9.3 is a minor modification of Theorem 2.6.7 of VW. Corollary 9.5,Lemma 9.6 and Theorem 9.23 are Corollary 2.6.12, Lemma 2.6.15 and The-orem 2.7.11, respectively, of VW. Lemmas 9.7 and 9.9 are modification ofLemmas 2.6.17 and 2.6.18, respectively, of VW. Lemma 9.11 was suggestedby Example 2.6.21 of VW, and Lemma 9.12 is a modification of Exer-cise 2.6.14 of VW. Parts (i) and (ii) of Theorem 9.30 are Theorems 2.10.1and 2.10.2, respectively, of VW. Corollary 9.32 includes some modificationsof Examples 2.10.7, 2.10.8 and 2.10.10 of VW. Lemma 9.33 was suggestedby Exercise 2.6.10 of VW. Exercise 9.6.10 is a modification of Exercise 19.5of van der Vaart (1998).

The bounded uniform entropy integral (BUEI) preservation techniquespresented here grew out of the author’s work on estimating equations forfunctional data described in Fine, Yan and Kosorok (2004).

Page 189: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

10Bootstrapping Empirical Processes

The purpose of this chapter is to obtain consistency results for bootstrappedempirical processes. These results can then be applied to many kinds ofbootstrapped estimators since most estimators can be expressed as func-tionals of empirical processes. Much of the bootstrap results for such esti-mators will be deferred to later chapters where we discuss the functionaldelta method, Z-estimation and M-estimation. We do, however, present onespecialized result for parametric Z-estimators in Section 3 of this chapteras a practical illustration of bootstrap techniques.

We note that both conditional and unconditional bootstrap consistencyresults can be useful depending on the application. For the conditionalbootstrap, the goal is to establish convergence of the conditional law giventhe data to an unconditional limit law. This convergence can be either inprobability or outer almost sure. While the later convergence is certainlystronger, convergence in probability is usually sufficient for statistical ap-plications.

The best choice of bootstrap weights for a given statistical applicationis also an important question, and the answer depends on the application.While the multinomial bootstrap is conceptually simple, its use in survivalanalysis applications may result in too much tied data. In the presenceof censoring, it is even possible that a bootstrap sample could be drawnthat consists of only censored observations. To avoid complications of thiskind, it may be better to use the Bayesian bootstrap (Rubin, 1981). Theweights for the Bayesian bootstrap are ξ1/ξ, . . . , ξn/ξ, where ξ1, . . . , ξn arei.i.d. standard exponential (mean and variance 1), independent of the dataX1, . . . , Xn, and where ξ ≡ n−1

∑ni=1 ξi. Since these weights are strictly

Page 190: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

180 10. Bootstrapping Empirical Processes

positive, all observations are represented in each bootstrap realization, andthe aforementioned problem with tied data won’t happen unless the originaldata has ties. Both the multinomial and Bayesian bootstraps are includedin the bootstrap weights we discuss in this chapter.

The multinomial weighted bootstrap is sometimes called the nonparamet-ric bootstrap since it amounts to sampling from the empirical distribution,which is a nonparametric estimate of the true distribution. In contrast, theparametric bootstrap is obtained by sampling from a parametric estimatePθn

of the true distribution, where θn is a consistent estimate of the truevalue of θ (see, for example, Chapter 1 of Shao and Tu, 1995). A detaileddiscussion of the parametric bootstrap is beyond the scope of this chapter.Another kind of bootstrap is the exchangeable weighted bootstrap, whichwe only mention briefly in Lemma 10.18 below. This lemma is needed forthe proof of Theorem 10.15.

We also note that the asymptotic results of this chapter are all first order,and in this situation the limiting results do not vary among those schemesthat satisfy the stated conditions. A more refined analysis of differencesbetween weighting schemes is beyond the scope of this chapter, but suchdifferences may be important in small samples. A good reference for higherorder properties of the bootstrap is Hall (1992).

The first section of this chapter considers unconditional and conditionalconvergence of bootstrapped empirical processes to limiting laws when theclass of functions involved is Donsker. The main result of this section is aproof of Theorems 2.6 and 2.7 on Page 20 of Section 2.2.3. At the end ofthe section, we present several special continuous mapping results for boot-strapped processes. The second section considers parallel results when thefunction class involved is Glivenko-Cantelli. In this case, the limiting lawsare degenerate, i.e., constant with probability 1. Such results are helpfulfor establishing consistency of bootstrapped estimators. The third sectionpresents the simple Z-estimator illustration promised above. Throughoutthis chapter, we will sometimes for simplicity omit the subscript when re-ferring to a representative of an i.i.d. sample. For example, we may use E|ξ|to refer to E|ξ1|, where ξ1 is the first member of the sample ξ1, . . . , ξn. Thecontext will make the meaning clear.

10.1 The Bootstrap for Donsker Classes

The overall goal of this section is to prove the validity of the bootstrap cen-tral limit theorems given in Theorems 2.6 and 2.7 on Page 20 of Chapter 2.Both unconditional and conditional multiplier central limit theorems play apivotal role in this development and will be presented first. At the end of thesection, we also present several special continuous mapping results whichapply to bootstrapped processes. These results allow the construction of

Page 191: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

10.1 The Bootstrap for Donsker Classes 181

asymptotically uniformly valid confidence bands for Pf : f ∈ F when Fis Donsker.

10.1.1 An Unconditional Multiplier Central Limit Theorem

In this section, we present a multiplier central limit theorem that formsthe basis for proving the unconditional central limit theorems of the nextsection. We also present an interesting corollary. For a real random vari-able ξ, recall from Section 2.2.3 the quantity ‖ξ‖2,1 ≡

∫∞0

P(|ξ| > x)dx.Exercise 10.5.1 below verifies this is a norm which is slightly larger than‖ · ‖2. Also recall that δXi is the probability measure that assigns a massof 1 to Xi so that Pn = n−1

∑ni=1 δXi and Gn = n−1/2

∑ni=1(δXi − P ).

Theorem 10.1 (Multiplier central limit theorem) Let F be a class ofmeasurable functions, and let ξ1, . . . , ξn be i.i.d. random variables withmean zero, variance 1, and with ‖ξ‖2,1 < ∞, independent of the sampledata X1, . . . , Xn. Let G′

n ≡ n−1/2∑n

i=1 ξi(δXi − P ) and G′′n ≡ n−1/2

∑ni=1

(ξi − ξ)δXi , where ξ ≡ n−1∑n

i=1 ξi. Then the following are equivalent:

(i) F is P -Donsker;

(ii) G′n converges weakly to a tight process in ℓ∞(F);

(iii) G′n G in ℓ∞(F);

(iv) G′′n G in ℓ∞(F).

Before giving the proof of this theorem, we will need the following tool.This lemma is Lemma 2.9.1 of VW, and we give it without proof:

Lemma 10.2 (Multiplier inequalities) Let Z1, . . . , Zn be i.i.d. stochasticprocesses, with index F such that E∗‖Z‖F < ∞, independent of the i.i.d.Rademacher variables ǫ1, . . . , ǫn. Then for every i.i.d. sample ξ1, . . . , ξn

of real, mean-zero random variables independent of Z1, . . . , Zn, and any1 ≤ n0 ≤ n,

1

2‖ξ‖1E∗

1√n

n∑

i=1

ǫiZi

F

≤ E∗

1√n

n∑

i=1

ξiZi

F

≤ 2(n0 − 1)E∗‖Z‖FE max1≤i≤n

|ξi|√n

+2√

2‖ξ‖2,1 maxn0≤k≤n

E∗

1√k

k∑

i=n0

ǫiZi

F

.

When the ξi are symmetrically distributed, the constants 1/2, 2 and 2√

2can all be replaced by 1.

Page 192: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

182 10. Bootstrapping Empirical Processes

Proof of Theorem 10.1. Note that the processes G, Gn, G′n and G′′

n

do not change if they are indexed by F ≡ f − Pf : f ∈ F rather thanF . Thus we can assume throughout the proof that ‖P‖F = 0 without lossof generality.

(i)⇔(ii): Convergence of the finite-dimensional marginal distributions ofGn and G′

n is equivalent to F ⊂ L2(P ), and thus it suffices to show thatthe asymptotic equicontinuity conditions of both processes are equivalent.By Lemma 8.17, if F is Donsker, then P ∗(F > x) = o(x−2) as x → ∞.Similarly, if ξ · F is Donsker, then P ∗(|ξ| × F > x) = o(x−2) as x → ∞.In both cases, P ∗F < ∞. Since the variance of ξ is finite, we have byExercise 10.5.2 below that E∗ max1≤i≤n |ξi|/

√n → 0. Combining this with

the multiplier inequality (Lemma 10.2), we have

1

2‖ξ‖1 lim sup

n→∞E∗

1√n

n∑

i=1

ǫif(Xi)

≤ lim supn→∞

E∗

1√n

n∑

i=1

ξif(Xi)

≤ 2√

2‖ξ‖2,1 supk≥n0

E∗

1√k

k∑

i=1

ǫif(Xi)

,

for every δ > 0 and n0 ≤ n. By the symmetrization theorem (Theorem 8.8),we can remove the Rademacher variables ǫ1, . . . , ǫn at the cost of chang-ing the constants. Hence, for any sequence δn ↓ 0, E∗‖n−1/2

∑ni=1(δXi −

P )‖Fδn→ 0 if and only if E∗‖n−1/2

∑ni=1 ξi(δXi −P )‖Fδn

→ 0. By Lemma8.17, these mean versions of the asymptotic equicontinuity conditions implythe probability versions, and the desired results follow. We have actuallyproved that the first three assertions are equivalent.

(iii)⇒(iv): Note that by the equivalence of (i) and (iii), F is Glivenko-

Cantelli. Since G′n − G′′

n =√

nξPn, we now have that ‖G′n − G′′

n‖FP→ 0.

Thus (iv) follows.(iv)⇒(i): Let (Y1, . . . , Yn) be an independent copy of (X1, . . . , Xn), and

let (ξ1, . . . , ξn) be an independent copy of (ξ1, . . . , ξn), so that (ξ1, . . . , ξn, ξ1,. . . , ξn) is independent of (X1, . . . , Xn, Y1, . . . , Yn). Let ξ be the pooledmean of the ξis and ξis; set

G′′2n = (2n)−1/2

(

n∑

i=1

(ξi − ξ)δXi +

n∑

i=1

(ξi − ξ)δYi

)

and define

G′′2n ≡ (2n)−1/2

(

n∑

i=1

(ξi − ξ)δXi +n∑

i=1

(ξi − ξ)δYi

)

.

We now have that both G′′2n G and G′′

2n G in ℓ∞(F).Thus, by the definition of weak convergence, we have that (F , ρP ) is

totally bounded and that for any sequence δn ↓ 0 both ‖G′′2n‖Fδn

P→ 0 and

Page 193: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

10.1 The Bootstrap for Donsker Classes 183

‖G′′2n‖Fδn

P→ 0. Hence also ‖G′′2n − G′′

2n‖Fδn

P→ 0. However, since

G′′2n − G′′

2n = n−1/2n∑

i=1

(ξi − ξi)√2

(δXi − δYi) ,

and since the weights ξi ≡ (ξi−ξi)/√

2 satisfy the moment conditions for thetheorem we are proving, we now have the Gn ≡ n−1/2

∑ni=1(δXi − δYi) √

2G in ℓ∞(F) by the already proved equivalence between (iii) and (i).Thus, for any sequence δn ↓ 0, E∗‖Gn‖Fδn

→ 0. Since also

EY

n∑

i=1

f(Xi)− f(Yi)

≥∣

n∑

i=1

f(Xi)− Ef(Yi)

=

n∑

i=1

f(Xi)

,

we can invoke Fubini’s theorem (Lemma 6.14) to yield

E∗‖Gn‖Fδn≥ E∗‖Gn‖Fδn

→ 0.

Hence F is Donsker.We now present the following interesting corollary which shows the pos-

sibly unexpected result that the multiplier empirical process is asymptot-ically independent of the usual empirical process, even though the samedata X1, . . . , Xn are used in both processes:

Corollary 10.3 Assume the conditions of Theorem 10.1 hold and thatF is Donsker. Then (Gn, G′

n, G′′n) (G, G′, G′, ) in [ℓ∞(F)]

3, where G and

G′ are independent P -Brownian bridges.

Proof. By the preceding theorem, the three processes are asymptoticallytight marginally and hence asymptotically tight jointly. Since the first pro-cess is uncorrelated with the second process, the limiting distribution ofthe first process is independent of the limiting distribution of the secondprocess. As argued in the proof of the multiplier central limit theorem, theuniform difference between G′

n and G′′n goes to zero in probability, and thus

the remainder of the corollary follows.

10.1.2 Conditional Multiplier Central Limit Theorems

In this section, the convergence properties of the multiplier processes inthe previous section are studied conditional on the data. This yields in-probability and outer-almost-sure conditional multiplier central limit the-orems. These results are one step closer to the bootstrap validity resultsof the next section. For a metric space (D, d), define BL1(D) to be thespace of all functions f : D → R with Lipschitz norm bounded by 1, i.e.,‖f‖∞ ≤ 1 and |f(x)−f(y)| ≤ d(x, y) for all x, y ∈ D. In the current set-up,D = ℓ∞(F), for some class of measurable functions F , and d is the corre-sponding uniform metric. As we did in Section 2.2.3, we will use BL1 as

Page 194: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

184 10. Bootstrapping Empirical Processes

shorthand for BL1(ℓ∞(F)). The conditional weak convergence arrows we

use in Theorems 10.4 and 10.6 below were also defined in Section 2.2.3.We now present the in-probability conditional multiplier central limit

theorem:

Theorem 10.4 Let F be a class of measurable functions, and let ξ1, . . . ,ξn be i.i.d. random variables with mean zero, variance 1, and ‖ξ‖2,1 < ∞,independent of the sample data X1, . . . , Xn. Let G′

n, G′′n and ξ be as defined

in Theorem 10.1. Then the following are equivalent:

(i) F is Donsker;

(ii) G′n

G in ℓ∞(F) and G′n is asymptotically measurable.

(iii) G′′n

G in ℓ∞(F) and G′′n is asymptotically measurable.

Before giving the proof of this theorem, we make a few points andpresent Lemma 10.5 below to aid in the proof. In the above theorem,Eξ denotes taking the expectation conditional on X1, . . . , Xn. Note thatfor a continuous function h : ℓ∞(F) → R, if we fix X1, . . . , Xn, then(a1, . . . , an) → h(n−1/2

∑ni=1 ai(δXi − P )) is a measurable map from Rn

to R, provided ‖f(X) − Pf‖∗F < ∞ almost surely. This last inequality istacitly assumed so that the empirical processes under investigation residein ℓ∞(F). Thus the expectation Eξ in conclusions (ii) and (iii) is proper.The following lemma is a conditional multiplier central limit theorem fori.i.d. Euclidean data:

Lemma 10.5 Let Z1, . . . , Zn be i.i.d. Euclidean random vectors, withEZ = 0 and E‖Z‖2 < ∞, independent of the i.i.d. sequence of real ran-dom variables ξ1, . . . , ξn with Eξ = 0 and Eξ2 = 1. Then, conditionallyon Z1, Z2, . . ., n−1/2

∑ni=1 ξiZi N(0, covZ), for almost all sequences

Z1, Z2, . . ..

Proof. By the Lindeberg central limit theorem, convergence to the givennormal limit will occur for every sequence Z1, Z2, . . . for which

n−1n∑

i=1

ZiZTi → covZ

and

1

n

n∑

i=1

‖Zi‖2Eξξ2i 1|ξi| × ‖Zi‖ > ǫ

√n → 0,(10.1)

for all ǫ > 0, where Eξ is the conditional expectation given the Z1, Z2, . . ..The first condition is true for almost all sequences by the strong law of

Page 195: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

10.1 The Bootstrap for Donsker Classes 185

large numbers. We now evaluate the second condition. Fix ǫ > 0. Now, forany τ > 0, the sum in (10.1) is bounded by

1

n

n∑

i=1

‖Zi‖2E[ξ21|ξ| > ǫ/τ] + 1

n

n∑

i=1

‖Zi‖21‖Zi‖ >√

nτ.

The first sum has an arbitrarily small upper bound in the limit if we choosesufficiently small τ . Since E‖Z‖2 < ∞, the second sum will go to zero foralmost all sequences Z1, Z2, . . .. Thus, for almost all sequences Z1, Z2, . . .,(10.1) will hold for any ǫ > 0. For, the intersection of the two sets ofsequences, all required conditions hold, and the desired result follows.

Proof of Theorem 10.4. Since the processes G, Gn, G′n and G′′

n areunaffected if the class F is replaced with f −Pf : f ∈ F, we will assume‖P‖F = 0 throughout the proof, without loss of generality.

(i)⇒(ii): If F is Donsker, the sequence G′n converges in distribution to

a Brownian bridge process by the unconditional multiplier central limittheorem (Theorem 10.1). Thus G′

n is asymptotically measurable. Now,by Lemma 8.17, a Donsker class is totally bounded by the semimetricρP (f, g) ≡ (P [f − g]2)1/2. For each fixed δ > 0 and f ∈ F , denote Πδf tobe the closest element in a given, finite δ-net (with respect to the metricρP ) for F . We have by continuity of the limit process G, that G Πδ → G,almost surely, as δ ↓ 0. Hence, for any sequence δn ↓ 0,

suph∈BL1

|Eh(G Πδn)− Eh(G)| → 0.(10.2)

By Lemma 10.5 above, we also have for any fixed δ > 0 that

suph∈BL1

|Eξh(G′n Πδ)− Eh(G Πδ)| → 0,(10.3)

as n → ∞, for almost all sequences X1, X2, . . .. To see this, let f1, . . . , fm

be the δ-mesh of F that defines Πδ. Now define the map A : Rm → ℓ∞(F)by (A(y))(f) = yk, where y = (y1, . . . , ym) and the integer k satisfiesΠδf = fk. Now h(G Πδ) = g(G(f1), . . . , G(fm)) for the function g :Rm → R defined by g(y) = h(A(y)). It is not hard to see that if h isbounded Lipschitz on ℓ∞(F), then g is also bounded Lipschitz on Rm with aLipschitz norm no larger than the Lipschitz norm for h. Now (10.3) followsfrom Lemma 10.5. Note also that BL1(Rm) is separable with respect tothe metric ρ(m)(f, g) ≡∑∞

i=1 2−i supx∈Ki|f(x)− g(x)|, where K1 ⊂ K2 ⊂

· · · are compact sets satisfying ∪∞i=1Ki = Rm. Hence, since G′

n Πδ andGΠδ are both tight, the supremum in 10.3 can be replaced by a countablesupremum. Thus the displayed quantity is measurable, since h(G′

n Πδ) ismeasurable.

Now, still holding δ fixed,

Page 196: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

186 10. Bootstrapping Empirical Processes

suph∈BL1

|Eξh(G′n Πδ)− Eξh(G′

n)| ≤ suph∈BL1

Eξ |h(G′n Πδ)− h(G′

n)|

≤ Eξ ‖G′n Πδ −G′

n‖∗F

≤ Eξ‖G′n‖∗Fδ

,

where Fδ ≡ f − g : ρP (f, g) < δ, f, g ∈ F. Thus the outer expectation ofthe left-hand-side is bounded above by E∗‖G′

n‖Fδ. As we demonstrated in

the proof of Theorem 10.1, E∗‖G′n‖Fδn

→ 0, for any sequence δn ↓ 0. Now,we choose the sequence δn so that it goes to zero slowly enough to ensurethat (10.3) still holds with δ replaced by δn. Combining this with (10.2),the desired result follows.

(ii)⇒(i): Let h(G′n)∗ and h(G′

n)∗ denote measurable majorants and mi-norants with respect to (ξ1, . . . , ξn, X1, . . . , Xn) jointly. We now have, bythe triangle inequality and Fubini’s theorem (Lemma 6.14),

|E∗h(G′n)− Eh(G)| ≤ |EXEξh(G′

n)∗ − E∗XEξh(G′

n)|+E∗

X |Eξh(G′n)− Eh(G)|,

where EX denotes taking the expectation over X1, . . . , Xn. By (ii) andthe dominated convergence theorem, the second term on the right sideconverges to zero for all h ∈ BL1. Since the first term on the right isbounded above by EXEξh(G′

n)∗ − EXEξh(G′n)∗, it also converges to zero

since G′n is asymptotically measurable. It is easy to see that the same

result holds true if BL1 is replaced by the class of all bounded, Lipschitzcontinuous nonnegative functions h : ℓ∞(F) → R, and thus G′

n Gunconditionally by the Portmanteau theorem (Theorem 7.6). Hence F isDonsker by the converse part of Theorem 10.1.

(ii)⇒(iii): Since we can assume ‖P‖F = 0, we have

|h(G′n)− h(G′′

n)| ≤ ‖ξGn‖F .(10.4)

Moreover, since (ii) also implies (i), we have that E∗‖ξGn‖F → 0 byLemma 8.17. Thus suph∈BL1

|Eξh(G′n) − Eξh(G′′

n)| → 0 in outer proba-bility. Since (10.4) also implies that G′′

n is asymptotically measurable, (iii)follows.

(iii)⇒(i): Arguing as we did in the proof that (ii)⇒(i), it is not hard toshow that G′′

n G unconditionally. Now Theorem 10.1 yields that F isDonsker.

We now present the outer-almost-sure conditional multiplier central limittheorem:

Theorem 10.6 Assume the conditions of Theorem 10.4. Then the fol-lowing are equivalent:

(i) F is Donsker and P ∗‖f − Pf‖2F < ∞;

(ii) G′n

as∗ξ

G in ℓ∞(F).

Page 197: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

10.1 The Bootstrap for Donsker Classes 187

(iii) G′′n

as∗ξ

G in ℓ∞(F).

Proof. The equivalence of (i) and (ii) is given in Theorem 2.9.7 of VW,and we omit its proof.

(ii)⇒(iii): As in the proof of Theorem 10.4, we assume that ‖P‖F = 0without loss of generality. Since

|h(G′n)− h(G′′

n)| ≤ |√nξ| × ‖Pn‖F ,(10.5)

for any h ∈ BL1, we have

suph∈BL1

|Eξh(G′n)− Eξ(G

′′n)| ≤ Eξ|

√nξ| × ‖Pn‖F ≤ ‖Pn‖F as∗→ 0,

since the equivalence of (i) and (ii) implies that F is both Donsker andGlivenko-Cantelli. Hence

suph∈BL1

|Eξh(G′′n)− Eh(G)| as∗→ 0.

The relation (10.5) also yields that Eξh(G′′n)∗−Eξh(G′′

n)∗as∗→ 0, and thus (iii)

follows.(iii)⇒(ii): Let h ∈ BL1. Since Eξh(G′′

n)∗ − Eh(G)as∗→ 0, we have

E∗h(G′′n) → Eh(G). Since this holds for all h ∈ BL1, we now have that

G′′n G unconditionally by the Portmanteau theorem (Theorem 7.6).

Now we can invoke Theorem 10.1 to conclude that F is both Donsker andGlivenko-Cantelli. Now (10.5) implies (ii) by using an argument almostidentical to the one used in the previous paragraph.

10.1.3 Bootstrap Central Limit Theorems

Theorems 10.4 and 10.6 will now be used to prove Theorems 2.6 and 2.7from Page 20 of Section 2.2.3. Recall that the multinomial bootstrap is ob-tained by resampling from the data X1, . . . , Xn, with replacement, n timesto obtain a bootstrapped sample X∗

1 , . . . , X∗n. The empirical measure P∗

n

of the bootstrapped sample has the same distribution—given the data—as the measure Pn ≡ n−1

∑ni=1 WniδXi , where Wn ≡ (Wn1, . . . , Wnn) is a

multinomial(n, n−1, . . . , n−1) deviate independent of the data. As in Sec-

tion 2.2.3, let Pn ≡ n−1∑n

i=1 WniδXi and Gn ≡√

n(Pn − Pn). Also recall

the definitions Pn ≡ n−1∑n

i=1(ξ/ξ)δXi and Gn ≡√

n(µ/τ)(Pn−Pn), wherethe weights ξ1, . . . , ξn are i.i.d. nonnegative, independent of X1, . . . , Xn,with mean 0 < µ < ∞ and variance 0 < τ2 < ∞, and with ‖ξ‖2,1 < ∞.

When ξ = 0, we define Pn to be zero. Note that the weights ξ1, . . . , ξn inthis section must have µ subtracted from them and then divided by τ beforethey satisfy the criteria of the multiplier weights in the previous section.

Proof of Theorem 2.6 (Page 20). The equivalence of (i) and (ii)follows from Theorem 3.6.1 of VW, which proof we omit. We note, how-ever, that a key component of this proof is a clever approximation of the

Page 198: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

188 10. Bootstrapping Empirical Processes

multinomial weights with i.i.d. Poisson mean 1 weights. We will use thisapproximation in our proof of Theorem 10.15 below.

We now prove the equivalence of (i) and (iii). Let ξi ≡ τ−1(ξi − µ), i =1, . . . , n, and define G

n ≡ n−1/2∑n

i=1(ξi −ξ)δXi , where ξ ≡ n−1

∑ni=1 ξi .

The basic idea is to show the asymptotic equivalence of Gn and Gn. Then

Theorem 10.4 can be used to establish the desired result. Accordingly,

Gn − Gn =

(

1− µ

ξ

)

Gn =

(

ξ

µ− 1

)

Gn.(10.6)

First, assume that F is Donsker. Since the weights ξ1 , . . . , ξn satisfy theconditions of the unconditional multiplier central limit theorem, we have

that Gn G. Theorem 10.4 also implies that G

n

G. Now (10.6) can be

applied to verify that ‖Gn − Gn‖F

P→ 0, and thus Gn is asymptoticallymeasurable and

suph∈BL1

∣Eξh(Gn)− Eξh(Gn)

P→ 0.

Thus (i)⇒(iii).

Second, assume that GnPξ

G. It is not hard to show, arguing as we did

in the proof of Theorem 10.4 for the implication (ii)⇒(i), that Gn G inℓ∞(F) unconditionally. By applying (10.6) again, we now have that ‖G

n−Gn‖F P→ 0, and thus G

n G in ℓ∞(F) unconditionally. The unconditionalmultiplier central limit theorem now verifies that F is Donsker, and thus(iii)⇒(i).

Proof of Theorem 2.7 (Page 20). The equivalence of (i) and (ii)follows from Theorem 3.6.2 of VW, which proof we again omit. We nowprove the equivalence of (i) and (iii).

First, assume (i). Then Gn

as∗ξ

G by Theorem 10.6. Fix ρ > 0, and note

that by using the first equality in (10.6), we have for any h ∈ BL1 that

∣h(Gn)− h(Gn)∣

∣ ≤ 2× 1

1− µ

ξ

> ρ

+ (ρ‖Gn‖F) ∧ 1.(10.7)

The first term on the rightas∗→ 0. Since the map ‖ ·‖F ∧1 : ℓ∞(F) → R is in

BL1, we have by Theorem 10.6 that Eξ [(ρ‖Gn‖F) ∧ 1]

as∗→ E [‖ρG‖F ∧ 1].Let the sequence 0 < ρn ↓ 0 converge slowly enough so that that thefirst term on the right in (10.7)

as∗→ 0 after replacing ρ with ρn. SinceE [‖ρnG‖F ∧ 1]→ 0, we can apply Eξ to both sides of (10.7)—after replac-ing ρ with ρn—to obtain

suph∈BL1

∣h(Gn)− h(Gn)∣

as∗→ 0.

Page 199: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

10.1 The Bootstrap for Donsker Classes 189

Combining the fact that h(Gn)∗−h(G

n)∗as∗→ 0 with additional applications

of (10.7) yields h(Gn)∗ − h(Gn)∗as∗→ 0. Since h was arbitrary, we have

established that Gnas∗ξ

G, and thus (iii) follows.

Second, assume (iii). Fix ρ > 0, and note that by using the secondequality in (10.6), we have for any h ∈ BL1 that

∣h(G

n)− h(Gn)∣

∣≤ 2× 1

ξ

µ− 1

> ρ

+(

ρ‖Gn‖F)

∧ 1.

Since the first term on the rightas∗→ 0, we can use virtually identical argu-

ments to those used in the previous paragraph—but with the roles of Gn

and Gn reversed—to obtain that Gn

as∗ξ

G. Now Theorem 10.6 yields that

F is Donsker, and thus (i) follows.

10.1.4 Continuous Mapping Results

We now assume a more general set-up, where Xn is a bootstrapped pro-cess in a Banach space (D, ‖ · ‖) and is composed of the sample dataXn ≡ (X1, . . . , Xn) and a random weight vector Mn ∈ Rn independentof Xn. We do not require that X1, . . . , Xn be i.i.d. In this section, we ob-tain two continuous mapping results. The first result, Proposition 10.7, isa simple continuous mapping results for the very special case of Lipschitzcontinuous maps. It is applicable to both the in-probability or outer-almost-sure versions of bootstrap consistency. An interesting special case is themap g(x) = ‖x‖. In this case, the proposition validates the use of thebootstrap to construct asymptotically uniformly valid confidence bands forPf : f ∈ F whenever Pf is estimated by Pnf and F is P -Donsker.

Now assume that XnPM

X and that the distribution of ‖X‖ is continuous.

Lemma 10.11 towards the end of this section reveals that P(‖Xn‖ ≤ t|Xn)converges uniformly to P (‖X‖ ≤ t), in probability. A parallel outer almost

sure result holds when Xnas∗M

X .

The second result, Theorem 10.8, is a considerably deeper result for gen-eral continuous maps applied to bootstraps which are consistent in prob-ability. Because of this generality, we must require certain measurabilityconditions on the map Mn → Xn. Fortunately, based on the discussionin the paragraph following Theorem 10.4 above, these measurability con-ditions are easily satisfied when either Xn = Gn or Xn = Gn. It appearsthat other continuous mapping results for bootstrapped empirical processeshold, such as for bootstraps which are outer almost surely consistent, butsuch results seem to be very challenging to verify.

Proposition 10.7 Let D and E be Banach spaces, X a tight randomvariable on D, and g : D → E Lipschitz continuous. We have the following:

Page 200: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

190 10. Bootstrapping Empirical Processes

(i) If XnPM

X, then g(Xn)PM

g(X).

(ii) If Xnas∗M

X, then g(Xn)as∗M

g(X).

Proof. Let c0 < ∞ be the Lipschitz constant for g, and, without lossof generality, assume c0 ≥ 1. Note that for any h ∈ BL1(E), the mapx → h(g(x)) is an element of c0BL1(D). Thus

suph∈BL1(E)

∣EMh(g(Xn))− Eh(g(X))∣

∣ ≤ suph∈c0BL1(D)

∣EMh(Xn)− Eh(X)∣

= c0 suph∈BL1(D)

∣EMh(Xn)− Eh(X)∣

∣ ,

and the desired result follows by the respective definitions ofPM

andas∗M

.

Theorem 10.8 Let g : D → E be continuous at all points in D0 ⊂ D,where D and E are Banach spaces and D0 is closed. Assume that Mn →h(Xn) is measurable for every h ∈ Cb(D) outer almost surely. Then if

XnPM

X in D, where X is tight and P∗(X ∈ D0) = 1, g(Xn)PM

g(X).

Proof. As in the proof of the implication (ii)⇒(i) of Theorem 10.4, wecan argue that Xn X unconditionally, and thus g(Xn) g(X) uncon-ditionally by the standard continuous mapping theorem. Moreover, we canreplace E with its closed linear span so that the restriction of g to D0 has anextension g : D → E which is continuous on all of D by Dugundji’s exten-sion theorem (Theorem 10.9 below). Thus (g(Xn), g(Xn)) (g(X), g(X)),

and hence g(Xn) − g(Xn)P→ 0. Therefore we can assume without loss of

generality that g is continuous on all of D. We can also assume without lossof generality that D0 is a separable Banach space since X is tight. HenceE0 ≡ g(D0) is also a separable Banach space.

Fix ǫ > 0. There now exists a compact K ⊂ E0 such that P(g(X) ∈ K) <ǫ. By Theorem 10.10 below, the proof of which is given in Section 10.4, weknow there exists an integer k < ∞, elements z1, . . . , zk ∈ C[0, 1], con-tinuous functions f1, . . . , fk : E → R, and a Lipschitz continuous function

J : lin (z1, . . . , zk) → E, such that the map x → Tǫ(x) ≡ J(

∑kj=1 zjfj(x)

)

has domain E and range ⊂ E and satisfies supx∈K ‖Tǫ(x) − x‖ < ǫ. LetBL1 ≡ BL1(E). We now have

suph∈BL1

∣EMh(g(Xn))− Eh(g(X))∣

≤ suph∈BL1

∣EMh(Tǫg(Xn))− Eh(Tǫg(X))∣

+EM

∥Tǫg(Xn)− g(Xn)∥

∥ ∧ 2

+ E ‖Tǫg(X)− g(X)‖ ∧ 2 .

Page 201: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

10.1 The Bootstrap for Donsker Classes 191

However, the outer expectation of the second term on the right convergesto the third term, as n → ∞, by the usual continuous mapping theorem.Thus, provided

suph∈BL1

∣EMh(Tǫg(Xn))− Eh(Tǫg(X))∣

P→ 0,(10.8)

we have that

lim supn→∞

E∗

suph∈BL1

∣EMh(g(Xn))− Eh(g(X))∣

(10.9)

≤ 2E ‖Tǫg(X)− g(X)‖ ∧ 2≤ 2E ‖Tǫg(X)− g(X)1g(X) ∈ K‖+ 4P(g(X) ∈ K)

< 6ǫ.

Now note that for each h ∈ BL1, h(

J(

∑kj=1 zjaj

))

= h(a1, . . . , ak) for

all (a1, . . . , ak) ∈ Rk and some h ∈ c0BL1(Rk), where 1 ≤ c0 < ∞ (this

follows since J is Lipschitz continuous and∥

∑kj=1 zjaj

∥≤ max1≤j≤k |aj |×

∑kj=1 ‖zj‖). Hence

suph∈BL1

∣EMh(Tǫg(Xn))− Eh(Tǫg(X))∣

∣(10.10)

≤ suph∈c0BL1(Rk)

∣EMh(u(Xn))− Eh(u(X))

= c0 suph∈BL1(Rk)

∣EMh(u(Xn))− Eh(u(X))∣

∣ ,

where x → u(x) ≡ (f1(g(x)), . . . , fk(g(x))). Fix any v : Rk → [0, 1]which is Lipschitz continuous (the Lipschitz constant may be > 1). Then,

since Xn X unconditionally, E∗

EMv(u(Xn))∗ − EMv(u(Xn))∗

≤E∗

v(u(Xn))∗ − v(u(Xn))∗

→ 0, where sub- and super- script ∗ denote

measurable majorants and minorants, respectively, with respect to the jointprobability space of (Xn, Mn). Thus

∣EMv(u(Xn))− EMv(u(Xn))∗∣

P→ 0.(10.11)

Note that we are using at this point the outer almost sure measurabilityof Mn → v(u(Xn)) to ensure that EMv(u(Xn)) is well defined, even if theresulting random expectation is not itself measurable.

Now, for every subsequence n′, there exists a further subsequence n′′ such

that Xn′′

as∗M

X . This means that for this subsequence, the set B of data sub-

sequences Xn′′ : n ≥ 1 for which EMv(u(Xn′′))−Ev(u(X))→ 0 has inner

Page 202: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

192 10. Bootstrapping Empirical Processes

probability 1. Combining this with (10.11) and Proposition 7.22, we obtain

that EMv(u(Xn))−Ev(u(X))P→ 0. Since v was an arbitrary real, Lipschitz

continuous function on Rk, we now have by Part (i) of Lemma 10.11 belowfollowed by Lemma 10.12 below, that

suph∈BL1(Rk)

∣EMh(u(Xn))− Eh(u(X))∣

P→ 0.

Combining this with (10.10), we obtain that (10.8) is satisfied. The desiredresult now follows from (10.9), since ǫ > 0 was arbitrary.

Theorem 10.9 (Dugundji’s extension theorem) Let X be an arbitrarymetric space, A a closed subset of X, L a locally convex linear space (whichincludes Banach vector spaces), and f : A → L a continuous map. Thenthere exists a continuous extension of f , F : X → L. Moreover, F (X) liesin the closed linear span of the convex hull of f(A).

Proof. This is Theorem 4.1 of Dugundji (1951), and the proof can befound therein.

Theorem 10.10 Let E0 ⊂ E be Banach spaces with E0 separable andlin E0 ⊂ E. Then for every ǫ > 0 and every compact K ⊂ E0, there ex-ists an integer k < ∞, elements z1, . . . , zk ∈ C[0, 1], continuous functionsf1, . . . , fk : E → R, and a Lipschitz continuous function J : lin (z1, . . . , zk)

→ E, such that the map x → Tǫ(x) ≡ J(

∑kj=1 zjfj(x)

)

has domain E and

range ⊂ E, is continuous, and satisfies supx∈K ‖Tǫ(x) − x‖ < ǫ.

The proof of this theorem is given in Section 10.4. For the next twolemmas, we use the usual partial ordering on Rk to define relations betweenpoints, e.g., for any s, t ∈ Rk, s ≤ t is equivalent to s1 ≤ t1, . . . , sk ≤ tk.

Lemma 10.11 Let Xn and X be random variables in Rk for all n ≥ 1.Define S ⊂ [R ∪ −∞,∞]k to be the set of all continuity points of t →F (t) ≡ P(X ≤ t) and H to be the set of all Lipschitz continuous functionsh : Rk → [0, 1] (the Lipschitz constants may be > 1). Then, provided theexpectations are well defined, we have:

(i) If E[h(Xn)|Yn]P→ Eh(X) for all h ∈ H, then supt∈A |P(Xn ≤ t|Yn)−

F (t)| P→ 0 for all closed A ⊂ S;

(ii) If E[h(Xn)|Yn]as∗→ Eh(X) for all h ∈ H, then supt∈A |P(Xn ≤ t|Yn)−

F (t)| as∗→ 0 for all closed A ⊂ S.

Proof. Let t0 ∈ S. For every δ > 0, there exists h1, h2 ∈ H , such thath1(u) ≤ 1u ≤ t0 ≤ h2(u) for all u ∈ Rk and E[h2(X) − h1(X)] < δ.

Under the condition in (i), we therefore have that P(Xn ≤ t0|Yn)P→ F (t0),

Page 203: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

10.2 The Bootstrap for Glivenko-Cantelli Classes 193

since δ was arbitrary. The conclusion of (i) follows since this convergenceholds for all t0 ∈ S, since both P(Xn ≤ t|Yn) and F (t) are monotone int with range ⊂ [0, 1], and since [0, 1] is compact. The proof for Part (ii)follows similarly.

Lemma 10.12 Let Fn and F be distribution functions on Rk, and letS ⊂ [R ∪ −∞,∞]k be the set of all continuity points of F . Then thefollowing are equivalent:

(i) supt∈A |Fn(t)− F (t)| → 0 for all closed A ⊂ S.

(ii) suph∈BL1(Rk)

Rk h(dFn − dF )∣

∣→ 0.

The relatively straightforward proof is saved as Exercise 10.5.3.

10.2 The Bootstrap for Glivenko-Cantelli Classes

We now present several results for the bootstrap applied to Glivenko-Cantelli classes. The primary use of these results is to assist verification ofconsistency of bootstrapped estimators. The first theorem (Theorem 10.13)consists of various multiplier bootstrap results, and it is followed by a corol-lary (Corollary 10.14) which applies to certain weighted bootstrap results.The final theorem of this section (Theorem 10.15) gives corresponding re-sults for the multinomial bootstrap. On a first reading through this section,it might be best to skip the proofs and focus on the results and discussionbetween the proofs.

Theorem 10.13 Let F be a class of measurable functions, and let ξ1, . . . ,ξn be i.i.d. nonconstant random variables with 0 < E|ξ| <∞ and indepen-dent of the sample data X1, . . . , Xn. Let Wn ≡ n−1

∑ni=1 ξi(δXi − P ) and

Wn ≡ n−1∑n

i=1(ξi− ξ)δXi , where ξ ≡ n−1∑n

i=1 ξi. Then the following areequivalent:

(i) F is strong Glivenko-Cantelli;

(ii) ‖Wn‖F as∗→ 0;

(iii) Eξ‖Wn‖F as∗→ 0 and P ∗‖f − Pf‖F <∞;

(iv) For every η > 0, P (‖Wn‖F > η| Xn)as∗→ 0 and P ∗‖f − Pf‖F < ∞,

where Xn ≡ (X1, . . . , Xn);

(v) For every η > 0, P (‖Wn‖∗F > η| Xn)as∗→ 0 and P ∗‖f−Pf‖F < ∞, for

some version of ‖Wn‖∗F , where the superscript ∗ denotes a measurablemajorant with respect to (ξ1, . . . , ξn, X1, . . . , Xn) jointly;

(vi) ‖Wn‖F as∗→ 0;

Page 204: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

194 10. Bootstrapping Empirical Processes

(vii) Eξ‖Wn‖F as∗→ 0 and P ∗‖f − Pf‖F <∞;

(viii) For every η > 0, P(

‖Wn‖F > η∣

∣Xn

)

as∗→ 0 and P ∗‖f − Pf‖F < ∞;

(ix) For every η > 0, P(

‖Wn‖∗F > η∣

∣Xn

)

as∗→ 0 and P ∗‖f − Pf‖F < ∞,

for some version of ‖Wn‖∗F .

The lengthy proof is given in Section 4 below. As is shown in the proof,the conditional expectations and conditional probabilities in (iii), (iv), (vii)and (viii) are well defined. This is because the quantities inside of theexpectations in Parts (iii) and (vii) (and in the conditional probabilitiesof (iv) and (viii)) are measurable as functions of ξ1, . . . , ξn conditional onthe data. The distinctions between (iv) and (v) and between (viii) and (ix)are not as trivial as they appear. This is because the measurable majorantsinvolved are computed with respect to (ξ1, . . . , ξn, X1, . . . , X1) jointly, andthus the differences between ‖Wn‖F and ‖Wn‖∗F or between ‖Wn‖F and

‖Wn‖F may be nontrivial.The following corollary applies to a class of weighted bootstraps that

includes the Bayesian bootstrap mentioned earlier:

Corollary 10.14 Let F be a class of measurable functions, and letξ1, . . . , ξn be i.i.d. nonconstant, nonnegative random variables with 0 <Eξ < ∞ and independent of X1, . . . , Xn. Let Pn ≡ n−1

∑ni=1(ξi/ξ)δXi ,

where we set Pn = 0 when ξ = 0. Then the following are equivalent:

(i) F is strong Glivenko-Cantelli.

(ii) ‖Pn − Pn‖F as∗→ 0 and P ∗‖f − Pf‖F < ∞.

(iii) Eξ‖Pn − Pn‖F as∗→ 0 and P ∗‖f − Pf‖F < ∞.

(iv) For every η > 0, P(

‖Pn − Pn‖F > η∣

∣Xn

)

as∗→ 0 and P ∗‖f −Pf‖F <∞;

(v) For every η > 0, P(

‖Pn − Pn‖∗F > η∣

∣Xn

)

as∗→ 0 and P ∗‖f −Pf‖F <

∞, for some version of ‖Pn − Pn‖∗F .

If in addition P (ξ = 0) = 0, then the requirement that P ∗‖f − Pf‖F <∞in (ii) may be dropped.

Proof. Since the processes Pn−P and Pn−Pn do not change when theclass F is replaced with F ≡ f − Pf : f ∈ F, we can assume ‖P‖F = 0without loss of generality. Let the envelope of F be denoted F ≡ ‖f‖∗F .

Since multiplying the ξi by a constant does not change ξi/ξ, we can alsoassume Eξ = 1 without loss of generality. The fact that the conditionalexpressions in (iii) and (iv) are well defined can be argued is in the proof of

Page 205: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

10.2 The Bootstrap for Glivenko-Cantelli Classes 195

Theorem 10.13 (given below in Section 4), and we do not repeat the detailshere.

(i)⇒(ii): Since

Pn − Pn − Wn =

(

1

ξ− 1

)

1ξ > 0Wn − 1ξ = 0Pn,(10.12)

(ii) follows by Theorem 10.13 and the fact that ξas∗→ 1.

(ii)⇒(i): Note that

Pn − Pn − Wn = −(

ξ − 1)

1ξ > 0(Pn − Pn)− 1ξ = 0Pn.(10.13)

The first term on the rightas∗→ 0 by (ii), while the second term on the right

is bounded in absolute value by 1ξ = 0‖Pn‖F ≤ 1ξ = 0PnFas∗→ 0, by

the moment condition.(ii)⇒(iii): The method of proof will be to use the expansion (10.12) to

show that Eξ‖Pn−Pn−Wn‖Fas∗→ 0. Then (iii) will follow by Theorem 10.13

and the established equivalence between (ii) and (i). Along this vein, wehave by symmetry followed by an application of Theorem 9.29 that

1

ξ− 1

1ξ > 0‖Wn‖F

≤ 1

n

n∑

i=1

F (Xi)Eξ

ξi

1

ξ− 1

1ξ > 0]

= PnFEξ

|1− ξ|1ξ > 0

as∗→ 0.

Since also Eξ

[

1ξ = 0]

‖Pn‖Fas∗→ 0, the desired conclusion follows.

(iii)⇒(iv): This is obvious.(iv)⇒(i): Consider again expansion (10.13). The moment conditions eas-

ily give us, conditional on X1, X2, . . ., that 1ξ = 0‖Pn‖F ≤ 1ξ =

0PnFP→ 0 for almost all sequences X1, X2, . . .. By (iv), we also obtain

that |ξ − 1|1ξ > 0‖Pn − Pn‖FP→ 0 for almost all sequences X1, X2, . . ..

Thus Assertion (viii) of Theorem 10.13 follows, and we obtain (i).If P (ξ = 0) = 0, then 1ξ = 0Pn = 0 almost surely, and we no longer

need the moment condition PF < ∞ in the proofs of (ii)⇒(i) and (ii)⇒(iii),and thus the moment condition in (ii) can be dropped in this setting.

(ii)⇒(v): Assertion (ii) implies that there exists a measurable set B ofinfinite sequences (ξ1, X1), (ξ2, X2), . . . with P(B) = 1 such that ‖Pn −Pn‖∗F → 0 on B for some version of ‖Pn− Pn‖∗F . Let Eξ,∞ be the expecta-tion taken over the infinite sequence ξ1, ξ2, . . . holding the infinite sequenceX1, X2, . . . fixed. By the bounded convergence theorem, we have for anyη > 0 and almost all sequences X1, X2, . . .,

Page 206: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

196 10. Bootstrapping Empirical Processes

lim supn→∞

P(

‖Pn−Pn‖∗F > η∣

∣Xn

)

= lim supn→∞

Eξ,∞1

‖Pn−Pn‖∗F >η

1B

= Eξ,∞ lim supn→∞

1

‖Pn−Pn‖∗F >η

1B

= 0.

Thus (v) follows.(v)⇒(iv): This is obvious.The following theorem verifies consistency of the multinomial boot-

strapped empirical measure defined in Section 10.1.3, which we denote Pn,when F is strong G-C. The proof is given in Section 4 below.

Theorem 10.15 Let F be a class of measurable functions, and let themultinomial vectors Wn in Pn be independent of the data. Then the follow-ing are equivalent:

(i) F is strong Glivenko-Cantelli;

(ii) ‖Pn − Pn‖F as∗→ 0 and P ∗‖f − Pf‖F < ∞;

(iii) EW ‖Pn − Pn‖F as∗→ 0 and P ∗‖f − Pf‖F < ∞;

(iv) For every η > 0, P(

‖Pn − Pn‖F > η∣

∣Xn

)

as∗→ 0 and P ∗‖f −Pf‖F <∞;

(v) For every η > 0, P(

‖Pn − Pn‖∗F > η∣

∣Xn

)

as∗→ 0 and P ∗‖f −Pf‖F <

∞, for some version of ‖Pn − Pn‖∗F .

10.3 A Simple Z-Estimator Master Theorem

Consider Z-estimation based on the estimating equation θ → Ψn(θ) ≡Pnψθ, where θ ∈ Θ ⊂ Rp and x → ψθ(x) is a measurable p-vector valuedfunction for each θ. This is a special case of the more general Z-estimationapproach discussed in Section 2.2.5. Define the map θ → Ψ(θ) ≡ Pψθ,

and assume θ0 ∈ Θ satisfies Ψ(θ0) = 0. Let θn be an approximate zero

of Ψn, and let θn be an approximate zero of the bootstrapped estimatingequation θ → Ψ

n(θ) ≡ Pnψθ, where P

n is either Pn of Corollary 10.14—with ξ1, . . . , ξn satisfying the conditions specified in the first paragraph ofSection 10.1.3 (the multiplier bootstrap)—or Pn of Theorem 10.15 (themultinomial bootstrap).

The goal of this section is to determine reasonably general conditionsunder which

√n(θn−θ0) Z, where Z is mean zero normally distributed,

and√

n(θn − θn)P

k0Z. Here, we useP

to denote eitherPξ

orPW

de-

pending on which bootstrap is being used, and k0 = τ/µ for the multiplier

Page 207: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

10.3 A Simple Z-Estimator Master Theorem 197

bootstrap while k0 = 1 for the multinomial bootstrap. One could also es-timate the limiting variance rather than use the bootstrap, but there aremany settings, such as least absolute deviation regression, where varianceestimation may be more awkward than the bootstrap. For theoretical vali-dation of the bootstrap approach, we have the following theorem, which isrelated to Theorem 2.11 and which utilizes some of the bootstrap resultsof this chapter:

Theorem 10.16 Let Θ ⊂ Rp be open, and assume θ0 ∈ Θ satisfiesΨ(θ0) = 0. Also assume the following:

(A) For any sequence θn ∈ Θ, Ψ(θn)→ 0 implies ‖θn − θ0‖ → 0;

(B) The class ψθ : θ ∈ Θ is strong Glivenko-Cantelli;

(C) For some η > 0, the class F ≡ ψθ : θ ∈ Θ, ‖θ− θ0‖ ≤ η is Donskerand P‖ψθ − ψθ0‖2 → 0 as ‖θ − θ0‖ → 0;

(D) P‖ψθ0‖2 < ∞ and Ψ(θ) is differentiable at θ0 with nonsingular deriva-tive matrix Vθ0 ;

(E) Ψn(θn) = oP (n−1/2) and Ψn(θn) = oP (n−1/2).

Then √n(θn − θ0) Z ∼ N

(

0, V −1θ0

P [ψθ0ψTθ0

](V −1θ0

)T)

and√

n(θn − θn)P

k0Z.

Before giving the proof, we make a few comments about the Conditions(A)–(E) of the theorem. Condition (A) is one of several possible identifi-ability conditions. Condition (B) is a sufficient condition, when combinedwith (A), to yield consistency of a zero of Ψn. This condition is gener-ally reasonable to verify in practice. Condition (C) is needed for asymp-

totic normality of√

n(θn − θ0) and is also not hard to verify in practice.Condition (D) enables application of the delta method at the appropriatejuncture in the proof below, and (E) is a specification of the level of ap-proximation permitted in obtaining the zeros of the estimating equations.See Exercise 10.5.5 below for a specific example of an estimation settingthat satisfies these conditions.

Proof of Theorem 10.16. By (B) and (E),

‖Ψ(θn)‖ ≤ ‖Ψn(θn)‖+ supθ∈Θ

‖Ψn(θ) −Ψ(θ)‖ ≤ oP (1).

Thus θnP→ θ0 by the identifiability Condition (A). By Assertion (ii) of

either Corollary 10.14 or Theorem 10.15 (depending on which bootstrap

is used), Condition (B) implies supθ∈Θ ‖Ψn(θ) − Ψ(θ)‖ as∗→ 0. Thus reap-

plication of Conditions (A) and (E) yield θnP→ θ0. Note that for the first

Page 208: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

198 10. Bootstrapping Empirical Processes

part of the proof we will be using unconditional bootstrap results, whilethe associated conditional bootstrap results will be used only at the end.

By (C) and the consistency of θn, we have Gnψθn−Gnψθ0

P→ 0. Since (E)

now implies that Gnψθn=√

nP (ψθ0 − ψθn) + oP (1), we can use the para-

metric (Euclidean) delta method plus differentiability of Ψ to obtain

√nVθ0(θ0 − θn) +

√noP (‖θn − θ0‖) = Gnψθ0 + oP (1).(10.14)

Since Vθ0 is nonsingular, this yields that√

n‖θn− θ0‖(1+ oP (1)) = OP (1),

and thus√

n(θn − θ0) = OP (1). Combining this with (10.14), we obtain

√n(θn − θ0) = −V −1

θ0

√nPnψθ0 + oP (1),(10.15)

and thus√

n(θn − θ0) Z with the specified covariance.The first part of Condition (C) and Theorem 2.6 imply that G

n ≡k−10

√n(P

n − Pn) G in ℓ∞(F) unconditionally, by arguments similar tothose used in the (ii)⇒(i) part of the proof of Theorem 10.4. Combining thiswith the second part of Condition (C), we obtain k0G

n(ψθn) + Gn(ψθ

n)−

k0Gn(ψθ0) −Gn(ψθ0)

P→ 0. Condition (E) now implies√

nP (ψθ0 − ψθn) =√

nPnψθ0 + oP (1). Using similar arguments to those used in the previous

paragraph, we obtain

√n(θn − θ0) = −V −1

θ0

√nP

nψθ0 + oP (1).

Combining with (10.15), we have

√n(θn − θn) = −V −1

θ

√n(P

n − Pn)ψθ0 + oP (1).

The desired conditional bootstrap convergence now follows from Theo-rem 2.6, Part (ii) or Part (iii) (depending on which bootstrap is used).

10.4 Proofs

Proof of Theorem 10.10. Fix ǫ > 0 and a compact K ⊂ E0. The proofstems from certain properties of separable Banach spaces that can be foundin Megginson (1998). Specifically, the fact that every separable Banachspace is isometrically isomorphic to a subspace of C[0, 1], implies the exis-tence of an isometric isomorphism J∗ : E0 → A0, where A0 is a subspaceof C[0, 1]. Since C[0, 1] has a basis, we know by Theorem 4.1.33 of Meg-ginson (1998) that it also has the “approximation property.” This meansby Theorem 3.4.32 of Megginson (1998) that since J∗(K) is compact, thereexists a finite rank, bounded linear operator T∗ : A0 → A0 such thatsupy∈J∗(K) ‖T∗(y) − y‖ < ǫ. Because T∗ is finite rank, this means thereexists elements z1, . . . , zk ∈ A0 ⊂ C[0, 1] and bounded linear functionals

Page 209: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

10.4 Proofs 199

f∗1 , . . . , f∗

k : A0 → R such that T∗(y) =∑k

j=1 zjf∗j (y). Note that since

both J∗ and J−1∗ : A0 → E0 are isometric isomorphisms, they are also

both Lipschitz continuous with Lipschitz constant 1. This means that themap x → Tǫ(x) ≡ J−1

∗ (T∗(J∗(x))), with domain and range E0, satisfiessupx∈K ‖Tǫ(x)− x‖ < ∞.

We now need to verify the existence of several important extensions.By Dugundji’s extension theorem (Theorem 10.9 above), there exists acontinuous extension of J∗, J∗ : E → lin A0. Also, by the Hahn-Banachextension theorem, there exist bounded linear extensions of f∗

1 , . . . , f∗k ,

f∗1 , . . . , f∗

k : lin A0 → R. Now let J denote the restriction of J−1∗ to the

domain

∑kj=1 zjf

∗j (y) : y ∈ J∗(E0)

. Since J is Lipschitz continuous, as

noted previously, we now have by Theorem 10.17 below that there exists aLipschitz continuous extension of J , J : lin (z1, . . . , zk) → E, with Lipschitzconstant possibly larger than 1. Now define x → fj(x) ≡ f∗

j (J∗(x)), for

j = 1, . . . , k, and x → Tǫ(x) ≡ J(

∑ki=1 zjfj(x)

)

; and note that Tǫ is a

continuous extension of Tǫ. Now the quantities k, z1, . . . , zk, f1, . . . , fk, Jand Tǫ all satisfy the given requirements.

Proof of Theorem 10.13. Since the processes Pn − P , Wn and Wn

do not change when F is replaced by F ≡ f − Pf : f ∈ F, we can usefor the index set either F or F ≡ f − Pf : f ∈ F without changingthe processes. Define also F ≡ ‖f − Pf‖∗F . We first need to show that theexpectations and probabilities in (iii), (iv), (vii) and (viii) are well defined.Note that for fixed x1, . . . , xn and a ≡ (a1, . . . , an) ∈ Rn, the map

(a1, . . . , an) →∥

n−1n∑

i=1

aif(xi)

F

= supu∈B

|aT u|,

where B ≡ (f(x1), . . . , f(xn)) : f ∈ F ⊂ Rn. By the continuity of themap (a, u) → |aT u| and the separability of R, this map is a measurablefunction even if the set B is not a Borel set. Thus the conditional expecta-tions and conditional probabilities are indeed well defined.

(i)⇒(ii): Note that F being P -G-C implies that PF < ∞ by Lemma 8.13.Because F is G-C and ξ is trivially G-C, the desired result follows fromCorollary 9.27 (of Section 9.3) and the fact that ‖ξ(f − Pf)‖∗F ≤ |ξ| × Fis integrable.

(ii)⇒(i): Since both sign(ξ) and ξ · F are P -G-C, Corollary 9.27 can beapplied to verify that sign(ξ) · ξ · F = |ξ| · F is also P -G-C. We also haveby Lemma 8.13 that P ∗F < ∞ since P |ξ| > 0. Now we have for fixedX1, . . . , Xn,

(Eξ)‖Pn‖F = ‖n−1n∑

i=1

(Eξi)δXi‖F ≤ Eξ‖Wn‖F ,

Page 210: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

200 10. Bootstrapping Empirical Processes

and thus (Eξ)E∗‖Pn−P‖F ≤ E∗‖Wn‖F . By applying Theorem 9.29 twice,we obtain the desired result.

(ii)⇒(iii): Note first that (ii) immediately implies P |ξ|F (X) < ∞ byLemma 8.13. Thus PF <∞ since E|ξ|> 0. Define Rn≡n−1

∑ni=1 |ξi|F (Xi),

and let B be the set of all infinite sequences (ξ1, X1), (ξ2, X2), . . . such that

‖Wn‖F as∗→ 0 and Rn → E[|ξ|F ]. Note that the set B has probability 1.Moreover, by the bounded convergence theorem,

lim sup Eξ‖Wn‖F1Rn ≤ K = lim sup Eξ,∞‖Wn‖F1Rn ≤ K1B≤ Eξ,∞ lim sup ‖Wn‖∗F1Rn ≤ K1B= 0,

outer almost surely, for any K <∞. In addition, if we let Sn =n−1∑n

i=1 |ξi|,we have for any 0 < N < ∞ that

Eξ‖Wn‖F1Rn > K ≤ EξRn1Rn > K≤ Eξ [Sn1Rn > K]PnF

≤ N(E|ξ|)[PnF ]2

K+ Eξ[Sn1Sn > N]PnF ,

where the second-to-last inequality follows by symmetry. By Exercise 10.5.4,the last line of the display has a lim sup ≤ N(PF )2/K + (E|ξ|)2/N outeralmost surely. Thus, if we let N =

√K and allow K ↑ ∞ slowly enough,

we ensure that lim supn→∞ Eξ‖Wn‖F → 0, outer almost surely. Hence (iii)follows.

(iii)⇒(iv): This is obvious.

(iv)⇒(ii): (iv) clearly implies that ‖Wn‖F P→ 0. Now Lemma 8.16 impliesthat since the class |ξ| × F has an integrable envelope, a version of ‖Wn‖∗Fmust converge outer almost surely to a constant. Thus (ii) follows.

(ii)⇒(v): Assertion (ii) implies that there exists a measurable set B ofinfinite sequences (ξ1, X1), (ξ2, X2), . . . with P(B) = 1 such that ‖Wn‖∗F →0 on B for some version of ‖Wn‖∗F . Thus by the bounded convergencetheorem, we have for any η > 0 and almost all sequences X1, X2, . . .,

lim supn→∞

P (‖Wn‖∗F > η| Xn) = lim supn→∞

Eξ,∞1 ‖Wn‖∗F > η 1B

= Eξ,∞ lim supn→∞

1 ‖Wn‖∗F > η 1B

= 0.

Thus (v) follows.(v)⇒(iv): This is obvious.(ii)⇒(vi): Note that

‖Wn −Wn‖F ≤ |ξ − Eξ| × |n−1n∑

i=1

F (Xi)|+ (Eξ)‖Pn‖F .(10.16)

Page 211: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

10.4 Proofs 201

Since (ii)⇒(i), P ∗‖f − Pf‖F < ∞. Thus, since the centered weights ξi −Eξ satisfy the conditions of the theorem as well as the original weights,the right side converges to zero outer almost surely. Hence ‖Wn‖F as∗→ 0,and (vi) follows.

(vi)⇒(ii): Since ξi can be replaced with ξi − Eξ without changing Wn,we will assume without loss of generality that Eξ = 0 (for this paragraphonly). Accordingly, (10.16) implies

‖Wn −Wn‖F ≤ |ξ − Eξ| × |n−1n∑

i=1

F (Xi)|.(10.17)

Thus (ii) will follow by the strong law of large numbers if we can show thatEF < ∞. Now let Y1, . . . , Yn be i.i.d. P independent of X1, . . . , Xn, and letξ1, . . . , ξn be i.i.d. copies of ξ1, . . . , ξn independent of X1, . . . , Xn, Y1, . . . , Yn.

Define W2n ≡ (2n)−1∑n

i=1

[

(ξi − ξ)f(Xi) + (ξi − ξ)f(Yi)]

and W′2n ≡

(2n)−1∑n

i=1

[

(ξi − ξ)f(Xi) + (ξi − ξ)f(Yi)]

, where ξ ≡ (2n)−1∑n

i=1(ξi +

ξi). Since both ‖W2n‖Fas∗→ 0 and ‖W′

2n‖Fas∗→ 0, we have that ‖Wn −

W′2n‖F

as∗→ 0. However,

Wn − W′2n =

1

n

n∑

i=1

ξi − ξi

2[f(Xi)− f(Yi)] ,

and thus ‖n−1∑n

i=1[f(Xi) − f(Yi)]‖Fas∗→ 0 by the previously established

equivalence between (i) and (ii) and the fact that the new weights (ξi−ξi)/2satisfy the requisite conditions. Thus

E∗F = E∗‖f(X)− Ef(Y )‖F ≤ E∗‖f(X)− f(Y )‖F < ∞,

where the last inequality holds by Lemma 8.13, and (ii) now follows.(iii)⇒(vii): Since

Eξ‖Wn −Wn‖F ≤ (E|ξ|)‖Pn‖F ,

we have that Eξ‖Wn‖Fas∗→ 0, because (iii) also implies (i).

(vii)⇒(viii): This is obvious.(viii)⇒(vi): Since Wn does not change if the ξis are replaced by ξi −

Eξ, we will assume—as we did in the proof that (vi)⇒(ii)—that Eξ = 0without loss of generality. By reapplication of (10.17) and the strong law of

large numbers, we obtain that ‖Wn‖F P→ 0. Since the class |ξ| × F has anintegrable envelope, reapplication of Lemma 8.16 yields the desired result.

(vi)⇒(ix): The proof here is identical to the proof that (ii)⇒(v), afterexchanging Wn with Wn.

(ix)⇒(viii): This is obvious.

Page 212: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

202 10. Bootstrapping Empirical Processes

Proof of Theorem 10.15. The fact that the conditional expressions inAssertions (iii) and (iv) are well defined can be argued as in the proof ofTheorem 10.13 above, and we omit the details.

(i)⇒(v): This follows from Lemma 10.18 below since the vectors Wn/nare exchangeable and satisfy the other required conditions.

(v)⇒(iv): This is obvious.(iv)⇒(i): For each integer n ≥ 1, generate an infinite sequence of in-

dependent random row n-vectors m(1)n , m

(2)n , . . . as follows. Set m

(k)1 = 1

for all integers k ≥ 1, and for each n > 1, generate an infinite sequence

of i.i.d. Bernoullies B(1)n , B

(2)n , . . . with probability of success 1/n, and set

m(k)n = [1− B

(k)n ](m

(k)n−1, 0) + B

(k)n (0, . . . , 0, 1). Note that for each fixed n,

m(1)n , m

(2)n , . . . are i.i.d. multinomial(1, n−1, . . . , n−1) vectors. Independent

of these random quantities, generate an infinite sequence U1, U2, . . . of i.i.d.Poisson random variables with mean 1, and set Nn =

∑ni=1 Ui. Also make

sure that all of these random quantities are independent of X1, X2, . . ..

Without loss of generality assume Wn =∑n

i=1 m(i)n and define ξ(n) ≡

∑Nn

i=1 m(i)n . It is easy to verify that the Wn are indeed multinomial(n, n−1,

. . . , n−1) vectors as claimed, and that ξ(n)i , . . . , ξ

(n)i , where (ξ

(n)1 , . . . , ξ

(n)n ) ≡

ξ(n), are i.i.d. Possion mean 1 random variables. Note also that these ran-dom weights are independent of X1, X2, . . ..

Let Wn ≡ n−1∑n

i=1(ξ(n)i − 1)(δXi − P ), and note that

Pn − Pn −Wn = n−1n∑

i=1

(Wni − ξ(n)i )(δXi − P ).

Since the nonzero elements of (Wni − ξ(n)i ) all have the same sign by con-

struction, we have that

EW,ξ‖Pn − Pn −Wn‖F ≤ EW,ξ‖n−1n∑

i=1

|Wni − ξ(n)i |(δXi − P )‖F

≤(

E

Nn − n

n

)

[

PnF + PF]

as∗→ 0,

where the last inequality follows from the exchangeability result EW,ξ|Wni−ξ(n)i | = E[|Nn−n|/n], 1 ≤ i ≤ n, and the outer almost sure convergence to

zero follows from the fact that E[|Nn − n|/n] ≤ n−1/2 combined with themoment conditions. In the forgoing, we have used EW,ξ to denote takingexpectations over Wn and ξ(n) conditional on X1, X2, . . .. We have just

established that Assertion (iv) holds in Theorem 10.13 with weights (ξ(n)1 −

1, . . . , ξ(n)n − 1) that satisfy the necessary conditions. Thus F is Glivenko-

Cantelli.

Page 213: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

10.4 Proofs 203

(v)⇒(ii): Let EW,ξ,∞ be the expectation over the infinite sequences ofthe weights conditional on X1, X2, . . .. For fixed η > 0 and X1, X2, . . ., wehave by the bounded convergence theorem

EW,ξ,∞ lim supn→∞

1

‖Pn − Pn‖∗F > η

= lim supn→∞

EW,ξ,∞1

‖Pn−Pn‖∗F >η

.

But the right side → 0 for almost all X1, X2, . . . by (v). This implies

1

‖Pn − Pn‖∗F > η

→ 0, almost surely. Now (ii) follows since η was arbi-

trary.(ii)⇒(iii): Let B be the set of all sequences of weights and data for which

‖Pn−Pn‖∗F → 0. From (ii), we know that B is measurable, P (B) = 1 and,by the bounded convergence theorem, we have for every η > 0 and allX1, X2, . . .

0 = EW,ξ,∞ lim supn→∞

[

1

‖Pn − Pn‖∗F > η

1B]

= lim supn→∞

EW,ξ,∞[

1

‖Pn − Pn‖∗F > η

1B]

.

Since P (B) = 1, this last line implies (v), since η was arbitrary, and henceAssertions (i) and (iv) also hold by the previously established equivalences.Fix 0 < K < ∞, and note that the class F · 1F ≤ K is strong G-C byCorollary 9.27. Now

EW ‖Pn − Pn‖F ≤ EW ‖Pn − Pn‖F·1F≤K

+EW

[

(Pn + Pn)F1F > K]

≤ EW ‖Pn − Pn‖F·1F≤K + 2Pn[F1F > K]as∗→ 2P [F1F > K],

by Assertion (iv). Since this last term can be made arbitrarily small bychoosing K large enough, Assertion (iii) follows.

(iii)⇒(iv): This is obvious.

Theorem 10.17 Let X, Z be metric spaces, with the dimension of Xbeing finite, and let Y ⊂ X. For any Lipschitz continuous map f : Y → Z,there exists a Lipschitz continuous extension F : X → Z.

Proof. This is a simplification of Theorem 2 of Johnson, Lindenstraussand Schechtman (1986), and the proof can be found therein.

Lemma 10.18 Let F be a strong Glivenko-Cantelli class of measurablefunctions. For each n, let (Mn1, . . . , Mnn) be an exchangeable nonnegativerandom vector independent of X1, X2, . . . such that

∑ni=1 Mni = 1 and

max1≤i≤n |Mni| P→ 0. Then, for every η > 0,

Page 214: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

204 10. Bootstrapping Empirical Processes

P

(∥

n∑

i=1

Mni(δXi − P )

F

> η

Xn

)

as∗→ 0.

This is Lemma 3.6.16 of VW, and we omit the proof.

10.5 Exercises

10.5.1. Show the following:

(a) ‖ · ‖2,1 is a norm on the space of real, square-integrable random vari-ables.

(b) For any real random variable ξ, and any r > 2, (1/2)‖ξ‖2 ≤ ‖ξ‖2,1 ≤(r/(r − 2))‖ξ‖r. Hints: For the first inequality, show that E|ξ|2 ≤2∫∞0

P(|ξ| > u)udu ≤ 2‖ξ‖2,1×‖ξ‖2. For the second inequality, showfirst that

‖ξ‖2,1 ≤ a +

∫ ∞

a

(‖ξ‖rr

xr

)1/2

dx

for any a > 0.

10.5.2. Show that for any p > 1, and any real i.i.d. X1, . . . , Xn withE|X |p < ∞, we have

E max1≤i≤n

|Xi|n1/p

→ 0,

as n →∞. Hint: Show first that for any x > 0,

lim supn→∞

P

(

max1≤i≤n

|Xi|n1/p

> x

)

≤ 1− exp

(

−E|X |pxp

)

.

10.5.3. Prove Lemma 10.12.

10.5.4. Let ξ1, . . . , ξn be i.i.d. nonnegative random variables with Eξ <∞, and denote Sn = n−1

∑ni=1 ξi. Show that

limm→∞

lim supn→∞

E[Sn1Sn > m] = 0.

Hint: Theorem 9.29 may be useful here.

10.5.5. Assume that, given the covariate Z ∈ Rp, Y is Bernoulli with

probability of success eθT Z/(1 + eθT Z), where θ ∈ Θ = Rp and E[ZZT ] ispositive definite. Assume that we observe an i.i.d. sample (Y1, Z1), . . . , (Yn,Zn) generated from this model with true parameter θ0 ∈ R. Show that theconditions of Theorem 10.16 are satisfied for Z-estimators based on

ψθ(y, z) = Z

(

Y − eθT Z

1 + eθT Z

)

.

Note that one of the challenges here is the noncompactness of Θ.

Page 215: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

10.6 Notes 205

10.6 Notes

Much of the material in Section 10.1 is inspired by Chapters 2.9 and 3.6 ofVW, although the results for the weights (ξ1 − ξ, . . . , ξn − ξ) and (ξ1/ξ −1, . . . , ξn/ξ) and the continuous mapping results are essentially new. Theequivalence of Assertions (i) and (ii) of Theorem 10.4 is Theorem 2.9.2 ofVW, while the equivalence of (i) and (ii) of Theorem 10.6 is Theorem 2.9.6of VW. Lemma 10.5 is Lemma 2.9.5 of VW. Theorem 10.16 is an expansionof Theorem 5.21 of van der Vaart (1998), and Part (b) of Exercise 10.5.1is Exercise 2.9.1 of VW.

Page 216: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,
Page 217: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

11Additional Empirical Process Results

In this chapter, we study several additional empirical process results thatare useful but don’t fall neatly into the framework of the other chapters.Because the contents of this chapter are somewhat specialized, some readersmay want to skip it the first time they read the book. Although some ofthe results given herein will be used in later chapters, the results of thischapter are not really necessary for a philosophical understanding of theremainder of the book. On the other hand, this chapter does contain resultsand references that are useful for readers interested in the deeper potentialof empirical process methods in genuinely hard statistical problems.

We first discuss bounding tail probabilities and moments of ‖Gn‖F .These results will be useful in Chapter 14 for determining rates ofconvergence of M-estimators. We then discuss Donsker results for classescomposed of sequences of functions and present several related statisticalapplications. After this, we discuss contiguous alternative probability mea-sures Pn that get progressively closer to a fixed “null” probability measureP as n gets larger. These results will be useful in Part III of the book,especially in Chapter 18, where we discuss optimality of tests.

We then discuss weak convergence of sums of independent but not iden-tically distributed stochastic processes which arise, for example, in clinicaltrials with non-independent randomization schemes such as biased coin de-signs (see, for example, Wei, 1978). We develop this topic in some depth,discussing both a central limit theorem and validity of a certain weightedbootstrap procedure. We also specialize these results to empirical processesbased on i.i.d. data but with functions classes Fn that change with n.

Page 218: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

208 11. Additional Empirical Process Results

The final topic we cover is Donsker results for dependent observations.Here, our brief treatment is primarily meant to introduce the subject andpoint the interested reader toward the key results and references.

11.1 Bounding Moments and Tail Probabilities

We first consider bounding moments of ‖Gn‖∗F under assumptions similarto the Donsker theorems of Chapter 8. While these do not provide sharpbounds, the bounds are still useful for certain problems. We need to in-troduce some slightly modified entropy integrals. The first is based on amodified uniform entropy integral:

J∗(δ,F) ≡ supQ

∫ δ

0

1 + log N(ǫ‖F‖Q,2,F , L2(Q))dǫ,

where the supremum is over all finitely discrete probably measures Q with‖F‖Q,2 > 0. The only difference between this and the previously defineduniform entropy integral J(δ,F , L2), is the presence of the 1 under theradical. The following theorem, which we give without proof, is a subset ofTheorem 2.14.1 of VW:

Theorem 11.1 Let F be a P -measurable class of measurable functions,with measurable envelope F . Then, for each p ≥ 1,

‖‖Gn‖∗F‖P,p ≤ cpJ∗(1,F)‖F‖P,2∧p,

where the constant cp < ∞ depends only on p.

We next provide an analogue of Theorem 11.1 for bracketing entropy,based on the modified bracketing integral:

J∗[](δ,F) ≡

∫ δ

0

1 + log N[](ǫ‖F‖P,2,F , L2(P ))dǫ.

The difference between this definition and the previously defined J[](δ,F ,L2(P )) is twofold: the presence of the 1 under the radical and a rescaling ofǫ by the factor ‖F‖P,2. The following theorem, a subset of Theorem 2.14.2of VW, is given without proof:

Theorem 11.2 Let F be a class of measurable functions with measurableenvelope F . Then

‖‖Gn‖∗F‖P,1 ≤ cJ∗[](1,F)‖F‖P,2,

for some universal constant c <∞.

Page 219: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

11.1 Bounding Moments and Tail Probabilities 209

We now provide an additional moment bound based on another modifi-cation of the bracketing integral. This modification is

J[](δ,F , ‖ · ‖) ≡∫ δ

0

1 + log N[](ǫ,F , ‖ · ‖) dǫ.

Note that unlike the definition of J∗[] above, the size of the brackets used

to compute N[] is not standardized by the norm of the envelope F . Thefollowing theorem, which we give without proof, is Lemma 3.4.2 of VW:

Theorem 11.3 Let F be a class of measurable functions such that Pf2 <δ2 and ‖f‖∞ ≤M for every f ∈ F . Then

‖ ‖Gn‖∗F‖P,1 ≤ cJ[](δ,F , L2(P ))

[

1 +J[](δ,F , L2(P ))

δ2√

nM

]

,

for some universal constant c <∞.

For some problems, the previous moment bounds are sufficient. In othersettings, more refined tail probability bounds are needed. To accomplishthis, stronger assumptions are needed for the involved function classes. Re-call the definition of pointwise measurable (PM) classes from Section 8.2.The following tail probability results, the proofs of which are given in Sec-tion 11.7 below, apply only to bounded and PM classes:

Theorem 11.4 Let F be a pointwise measurable class of functions f :X → [−M, M ], for some M <∞, such that

supQ

log N (ǫ,F , L2(Q)) ≤ K

(

1

ǫ

)W

,(11.1)

for all 0 < ǫ ≤ M and some constants 0 < W < 2 and K < ∞, where thesupremum is taken over all finitely discrete probability measures. Then

‖ ‖Gn‖∗F‖ψ2≤ c,

for all n ≥ 1, where c <∞ depends only on K, W and M .

Examples of interesting function classes that satisfy the conditions ofTheorem 11.4 are bounded VC classes that are also PM, and the set of allnon-decreasing distribution functions on R. This follows from Theorem 9.3and Lemma 9.11, since the class of distribution functions can be shown tobe PM. To see this last claim, for each integer m ≥ 1, let Gm be the classof empirical distribution functions based on a sample of size m from therationals union −∞,∞. It is not difficult to show that G ≡ ∪m≥1Gm iscountable and that for each distribution function f , there exists a sequencegm ∈ G such that gm(x) → f(x), as m →∞, for each x ∈ R.

Page 220: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

210 11. Additional Empirical Process Results

Theorem 11.5 Let F be a pointwise measurable class of functions f :X → [−M, M ], for some M <∞, such that

N[] (ǫ,F , L2(P )) ≤ K

(

1

ǫ

)W

,(11.2)

for all 0 < ǫ ≤ M and some constants K, W < ∞. Then

‖ ‖Gn‖∗F‖ψ2≤ c,

for all n ≥ 1, where c <∞ depends only on K, W and M .

An example of a function class that satisfies the conditions of Theo-rem 11.5, are the Lipschitz classes of Theorem 9.23 which satisfy Con-dition 9.4, provided T is separable and N(ǫ, T, d) ≤ K(1/ǫ)W for someconstants K, W < ∞. This will certainly be true if (T, d) is a Euclideanspace.

By Lemma 8.1, if the real random variable X satisfies ‖X‖ψ2 < ∞, thenthe tail probabilities of X are “subgaussian” in the sense that P (|X | >

x) ≤ Ke−Cx2

for some constants K < ∞ and C > 0. These results can besignificantly refined under stronger conditions to yield more precise boundson the constants. Some results along this line can be found in Chapter 2.14of VW. A very strong result applies to the empirical distribution functionFn, where F consists of left half-lines in R:

Theorem 11.6 For any i.i.d. sample X1, . . . , Xn with distribution F ,

P

(

supt∈R

√n|Fn(t)− F (t)| > x

)

≤ 2e−2x2

,

for all x > 0.

Proof. This is the celebrated result of Dvoretzky, Kiefer and Wolfowitz(1956), given in their Lemma 2, as refined by Massart (1990) in his Corol-lary 1. We omit the proof of their result but note that their result appliesto the special case where F is continuous. We now show that it also applieswhen F may be discontinuous. Without loss of generality, assume that Fhas discontinuities, and let t1, . . . , tm be the locations of the discontinuitiesof F , where m may be infinity. Note that the number of discontinuities canbe at most countable. Let p1, . . . , pm be the jump sizes of F at t1, . . . , tm.Now let U1, . . . , Un be i.i.d. uniform random variables independent of theX1, . . . , Xn, and define new random variables

Yi = Xi +

m∑

j=1

pj [1Xi > tj+ Ui1Xi = tj] ,

1 ≤ i ≤ n. Define also the transformation t → T (t) = t +∑m

j=1 pj1t ≥tj; let F∗

n be the empirical distribution of Y1, . . . , Yn; and let F ∗ be thedistribution of Y1. It is not hard to verify that

Page 221: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

11.2 Sequences of Functions 211

supt∈R

|Fn(t)− F (t)| = supt∈R

|F∗n(T (t))− F ∗(T (t))|

≤ sups∈R

|F∗n(s)− F ∗(s)|,

and the desired result now follows since F ∗ is continuous.

11.2 Sequences of Functions

Whether a countable class of functions F is P -Donsker can be verified usingthe methods of Chapters 9 and 10, but sometimes the special structure ofcertain countable classes simplifies the evaluation. This is true for certainclasses composed of sequences of functions. The following is our first resultin this direction:

Theorem 11.7 Let fi, i ≥ 1 be any sequence of measurable functionssatisfying

∑∞i=1 P (fi − Pfi)

2 < ∞. Then the class

∞∑

i=1

cifi :∑∞

i=1 |ci| ≤ 1 and the series converges pointwise

is P -Donsker.

Proof. Since the class given in the conclusion of the theorem is thepointwise closure of the symmetric convex hull (see the comments givenin Section 9.1.1 just before Theorem 9.4) of the class fi, it is enoughto verify that fi is Donsker, by Theorem 9.30. To this end, fix ǫ >0 and define for each positive integer m, a partition fi = ∪m+1

i=1 Fi asfollows. For each i = 1, . . . , m, let Fi consist of the single point fi, and letFm+1 = fm+1, fm+2, . . .. Since supf,g∈Fi

|Gn(f − g)| = 0 (trivially) fori = 1, . . . , m, we have, by Chebyshev’s inequality,

P

(

supi

supf,g∈Fi

|Gn(f − g)| > ǫ

)

≤ P

(

supf∈Fm+1

|Gnf | > ǫ

2

)

≤ 4

ǫ2

∞∑

i=m+1

P (fi − Pfi)2.

Since this last term can be made arbitrarily small by choosing m largeenough, and since ǫ was arbitrary, the desired result now follows by Theo-rem 2.1 via Lemma 7.20.

When the sequence fi satisfies Pfifj = 0 whenever i = j, Theo-rem 11.7 can be strengthened as follows:

Theorem 11.8 Let fi, i ≥ 1 be any sequence of measurable functionssatisfying Pfifj = 0 for all i = j and

∑∞i=1 Pf2

i <∞. Then the class

Page 222: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

212 11. Additional Empirical Process Results

∞∑

i=1

cifi :∑∞

i=1 c2i ≤ 1 and the series converges pointwise

is P -Donsker.

Proof. Since the conditions on c ≡ (c1, c2, . . .) ensure∑∞

i=1 cifi ≤√∑∞

i=1 f2i , we have by the dominated convergence theorem that point-

wise converging sums also converge in L2(P ). Now we argue that the classF of all of these sequences is totally bounded in L2(P ). This follows be-cause F can be arbitrarily closely approximated by a finite-dimensional set,since

P

(

i>m

cifi

)2

=∑

i>m

c2i Pf2

i ≤∑

i>m

Pf2i → 0,

as m → ∞. Thus the theorem is proved if we can show that the sequenceGn, as a process indexed by F , is asymptotically equicontinuous with re-spect to the L2(P )-seminorm. Accordingly, note that for any f =

∑∞i= cifi,

g =∑∞

i=1 difi, and integer k ≥ 1,

|Gn(f)−Gn(g)|2 =

∞∑

i=1

(ci − di)Gn(fi)

2

≤ 2

k∑

i=1

(ci − di)2Pf2

i

k∑

i=1

G2n(fi)

Pf2i

+ 2

∞∑

i=k+1

(ci − di)2

∞∑

i=k+1

G2n(fi).

Since, by assumption, ‖c − d‖ ≤ ‖c‖ + ‖d‖ ≤ 2 (here, ‖ · ‖ is the infinite-dimensional Euclidean norm), the above expression is bounded by

2‖f − g‖2P,2

k∑

i=1

G2n(fi)

Pf2i

+ 8

∞∑

i=k+1

G2n(fi).

Now take the supremum over all pairs of series f and g with ‖f−g‖P,2 < δ.Since also EG2

n(fi) ≤ Pf2i , the expectation is bounded above by 2δ2k +

8∑∞

i=k+1 Pf2i . This quantity can now be made arbitrarily small by first

choosing k large and then choosing δ small enough.If the functions fi involved in the preceding theorem are an orthonor-

mal sequence ψi in L2(P ), then the result can be reexpressed in termsof an elliptical class for a fixed sequence of constants bi:

F ≡ ∞∑

i=1

ciψi :

∞∑

i=1

c2i

b2i

≤ 1 and the series converges pointwise

.

More precisely, Theorem 11.8 implies that F is P -Donsker if∑∞

i=1 b2i <∞.

Note that a sufficient condition for the stated pointwise convergence to hold

Page 223: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

11.2 Sequences of Functions 213

at the point x for all ci satisfying∑∞

i=1 c2i /b2

i ≤ 1 is for∑∞

i=1 b2i ψ

2i (x) <

∞. A very important property of an empirical process indexed by an ellip-tical class F is the following:

‖Gn‖2F = supf∈F

∞∑

i=1

ciGn(ψi)

2

=

∞∑

i=1

b2i G2

n(ψi).(11.3)

In the central quantity, each function f ∈ F is represented by its series rep-resentation ci. For the second equality, it is easy to see that the last termis an upper bound for the second term by the Cauchy-Schwartz inequalitycombined with the fact that

∑∞i=1 c2

i /b2i ≤ 1. The next thing to note is that

this maximum can be achieved by setting ci = b2i Gn(ψi)/

√∑∞

i=1 b2i G2

n(ψ).An important use for elliptical classes is to characterize the limiting

distribution of one- and two- sample Cramer-von Mises, Anderson-Darling,and Watson statistics. We will now demonstrate this for both the Cramer-von Mises and Anderson-Darling statistics. For a study of the one-sampleWatson statistic, see Example 2.13.4 of VW. We now have the following keyresult, the proof of which we give in Section 11.7. Note that the result (11.3)plays an important role in the proof.

Theorem 11.9 Let Pn be the empirical distribution of an i.i.d. sampleof uniform [0, 1] random variables, let Gn =

√n(Pn − P ) be the classi-

cal empirical process indexed by t ∈ [0, 1], and let Z1, Z2, . . . be an i.i.d.sequence of standard normal deviates independent of Pn. Also define thefunction classes

F1 ≡

∞∑

j=1

cj

√2 cosπjt :

∞∑

j=1

c2jπ

2j2 ≤ 1

and

F2 ≡

∞∑

j=1

cj

√2pj(2t− 1) :

∞∑

j=1

c2j j(j +1)≤ 1 and pointwise convergence

,

where the functions p0(u) ≡ (1/2)√

2, p1(u) ≡ (1/2)√

6u, p2(u) ≡ (1/4)√

10×(3u2−1), p3(u) ≡ (1/4)

√14(5u3−3u), and so on, are the orthonormalized

Legendre polynomials in L2[−1, 1]. Then the following are true:

(i) The one-sample Cramer-von Mises statistic for uniform data satisfies

∫ 1

0

G2n(t)dt = ‖Gn‖2F1

1

π2

∞∑

j=1

Z2j

j2≡ T1.

(ii) The one-sample Anderson-Darling statistic for uniform data satisfies

Page 224: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

214 11. Additional Empirical Process Results

∫ 1

0

G2n(t)

t(1− t)dt = ‖Gn‖2F2

∞∑

j=1

Z2j

j(j + 1)≡ T2.

Theorem 11.9 applies to testing whether i.i.d. real data X1, . . . , Xn comesfrom an arbitrary continuous distribution F . This is realized by replacingt with F (x) throughout the theorem. A more interesting result can beobtained by applying the theorem to testing whether two samples havethe same distribution. Let Fn,j be the empirical distribution of sample jof nj i.i.d. real random variables, j = 1, 2, where the two samples are

independent, and where n = n1 + n2. Let Fn,0 ≡ (n1Fn,1 + n2Fn,2)/nbe the pooled empirical distribution. The two-sample Cramer-von Misesstatistic is

T1 ≡n1n2

n1 + n1

∫ ∞

−∞

(

Fn,1(s)− Fn,2(s))2

dFn,0(s),

while the two-sample Anderson-Darling statistics is

T2 ≡n1n2

n1 + n1

∫ ∞

−∞

(

Fn,1(s)− Fn,2(s))2

Fn,0(s)[

1− Fn,0(s)]dFn,0(s).

The proof of the following corollary is given in Section 11.7:

Corollary 11.10 Under the null hypothesis that the two samples comefrom the same continuous distribution F0, Tj Tj, as n1 ∧ n2 ∞, forj = 1, 2.

Since the limiting distributions do not depend on F0, critical values can beeasily calculated by Monte Carlo simulation. Our own calculations resultedin critical values at the 0.05 level of 0.46 for T1 and 2.50 for T2.

11.3 Contiguous Alternatives

For each n ≥ 1, let Xn1, . . . , Xnn be i.i.d. random elements in a measurablespace (X , A). Let P denote the common probability distribution underthe “null hypothesis,” and let Pn be a “contiguous alternative hypothesis”distribution satisfying

∫ [√n(dP 1/2

n − dP 1/2)− 1

2hdP 1/2

]2

→ 0,(11.4)

as n →∞, for some measurable function h : X → R. The following lemma,which is part of Lemma 3.10.11 of VW and which we give without proof,provides some properties for h:

Page 225: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

11.3 Contiguous Alternatives 215

Lemma 11.11 If the sequence of probability measures Pn satisfy (11.4),then necessarily Ph = 0 and Ph2 <∞.

The following theorem gives very general weak convergence propertiesof the empirical process under the contiguous alternative Pn. Such weakconvergence will be useful for studying efficiency of tests in Chapter 18.This is Theorem 3.10.12 of VW which we give without proof:

Theorem 11.12 Let F be a P -Donsker class of measurable functionswith ‖P‖F < ∞, and assume the sequence of probability measures Pn sat-isfies (11.4). Then

√n(Pn−P ) converges under Pn in distribution in ℓ∞(F)

to the process f → G(f) + Pfh, where G is a tight Brownian bridge.If, moreover, ‖Pnf2‖F = O(1), then ‖√n(Pn − P )f − Pfh‖F → 0 and√

n(Pn − Pn) converges under Pn in distribution to G.

We now present a bootstrap result for contiguous alternatives. For i.i.d.nonnegative random weights ξ1, . . . , ξn with mean 0 < µ < ∞ and vari-ance 0 < τ2 < ∞, recall the bootstrapped empirical measures Pnf =n−1

∑ni=1(ξi/ξ)f(Xi) and Gn =

√n(µ/τ)(Pn − Pn) from Sections 2.2.3

and 10.1. We define several new symbols:Pn→ and

Pn denote convergence in

probability and weak convergence, respectively, under the distribution Pnn .

In addition, XnPnM

X in a metric space D denotes conditional bootstrap con-

vergence in probability under Pn, i.e., supg∈BL1|EMg(Xn) − Eg(X)| Pn→ 0

and EMg(Xn)∗ − EMg(Xn)∗Pn→ 0, for all g ∈ BL1, where BL1 is the same

bounded Lipschitz function space defined in Section 2.2.3 and the subscriptM denotes taking the expectation over ξ1, . . . , ξn conditional on the data.Note that we require the weights ξ1, . . . , ξn to have the same distributionand independence from Xn1, . . . , Xnn under both Pn and P .

Theorem 11.13 Let F be a P -Donsker class of measurable functions,let Pn satisfy (11.4), and assume

limM→∞

lim supn→∞

Pn(f − Pf)21|f − Pf | > M = 0(11.5)

for all f ∈ F . Also let ξ1, . . . , ξn be i.i.d. nonnegative random variables,independent of Xn1, . . . , Xnn, with mean 0 < µ < ∞, variance 0 < τ2 < ∞,

and with ‖ξ1‖2,1 < ∞. Then Gn

Pnξ

G in ℓ∞(F) and Gn is asymptotically

measurable.

Proof. Let ηi ≡ τ−1(ξi − µ), i = 1, . . . , n, and note that

Page 226: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

216 11. Additional Empirical Process Results

Gn = n−1/2(µ/τ)

n∑

i=1

(ξi/ξ − 1)δXi(11.6)

= n−1/2(µ/τ)n∑

i=1

(ξi/ξ − 1)(δXi − P )

= n−1/2n∑

i=1

ηi(δXi − P )

+

(

µ

ξ− 1

)

n−1/2n∑

i=1

ηi(δXi − P )

+(µ

τ

)

(

µ

ξ− 1

)

n−1/2n∑

i=1

(δXi − P ).

Since F is P -Donsker, we also have that F ≡ f − Pf : f ∈ F isP -Donsker. Thus by the unconditional multiplier central limit theorem(Theorem 10.1), we have that η · F is also P -Donsker. Now, by the factthat ‖P (f − Pf)‖F = 0 (trivially) combined with Theorem 11.12, both

n−1/2∑n

i=1 ηi(δXi−P )Pn G(f) and n−1/2

∑ni=1(δXi−P )

Pn G(f)+P (f−

Pf)h in ℓ∞(F). The reason the first limiting process has mean zero is be-cause η is independent of X and thus Pη(f −Pf)h = 0 for all f ∈ F . Thus

the last two terms in (11.6)Pn→ 0 and Gn

Pn G in ℓ∞(F). This now implies

the unconditional asymptotic tightness and desired asymptotic measura-bility of Gn.

By the same kinds of arguments we used in the proof of Theorem 10.4, allwe need to verify now is that all finite dimensional collections f1, . . . , fm ∈F converge under Pn in distribution, conditional on the data, to the appro-priate limiting Gaussian process. Accordingly, let Zi = (f1(Xi)− Pf1, . . . ,fm(Xi)−Pfm)′, i = 1, . . . , n. What we need to show is that n−1/2

∑ni=1 ηiZi

converges weakly under Pn, conditional on the Z1, . . . , Zn, to a mean zeroGaussian process with variance Σ ≡ PZ1Z

′1. By Lemma 10.5, we are done

if we can verify that n−1∑n

i=1 ZiZ′i

Pn→ Σ and max1≤i≤n ‖Zi‖/√

nPn→ 0.

By the assumptions of the theorem,

lim supn→∞

En(M) ≡ Pn[Z1Z′11‖Z1‖ > M] → 0

as M →∞. Note also that (11.4) can be shown to imply that Pnh → Ph, asn→∞, for any bounded h (this verification is saved as an exercise). Thus,for any M < ∞, PnZ1Z

′11‖Z1‖ ≤ M convergences to PZ1Z

′11‖Z1‖ ≤

M. Since M is arbitrary, this convergence continues to hold if M is re-placed by a sequence Mn going to infinity slowly enough. Accordingly,

Page 227: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

11.3 Contiguous Alternatives 217

n−1n∑

i=1

ZiZ′i = n−1

n∑

i=1

ZiZ′i1‖Zi‖ > Mn

+n−1∑

i=1

ZiZ′i1‖Zi‖ ≤ Mn

Pn→ PZ1Z′1,

as n →∞. Now we also have

max1≤i≤n

‖Zi‖√n

=

max1≤i≤n

‖Zi‖2n

√ max1≤i≤n

‖Zi‖2n

1‖Zi‖ ≤ M+1

n

n∑

i=1

‖Zi‖21‖Zi‖ > M.

The first term under the last square root signPn→ 0 trivially, while the

expectation under Pn of the second term, En(M), goes to zero as n →∞ and M → ∞ sufficiently slowly with n, as argued previously. Thus

max1≤i≤n ‖Zi‖/√

nPn→ 0, and the proof is complete.

We close this section with the following additional useful consequencesof contiguity:

Theorem 11.14 Let Yn = Yn(Xn1, . . . , Xnn) and assume that (11.4)holds. Then the following are true:

(i) If YnP→ 0 under the sequence of distributions Pn, then Yn

Pn→ 0.

(ii) If Yn is asymptotically tight under the sequence of distributions Pn,then Yn is also asymptotically tight under Pn

n .

Proof. In addition to the result of Lemma 11.11, Lemma 3.10.11 of VWyields that, under (11.4),

n∑

i=1

logdPn

dP(Xni) = Gn −

1

2Ph2 + Rn,

where Gn ≡√

nPnh(X) and Rn goes to zero under both Pn and Pnn . Note

that by Theorem 11.12, GnPn→ Gh + Ph2.

We first prove (i). Note that for every δ > 0, there exists an M <∞ suchthat lim supn→∞ Pn

n (|Gn| > M) ≤ δ. For any ǫ > 0, let gn(ǫ) ≡ 1‖Yn‖∗ >ǫ, and note that gn(ǫ) is measurable under both Pn and Pn

n . Thus, forany ǫ > 0,

Pnn gn(ǫ) ≤ Pn

n [gn(ǫ)1|Gn| ≤ M, |Rn| ≤ ǫ] + Pnn (|Gn| > M)

+Pnn (|Rn| > ǫ)

≤ Pn[

gn(ǫ)eM−Ph2/2−ǫ]

+ Pnn (|Gn| > M) + Pn

n (|Rn| > ǫ)

→ 0 + δ + 0,

Page 228: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

218 11. Additional Empirical Process Results

as n →∞. Thus (i) follows since both δ and ǫ were arbitrary.We now prove (ii). Fix η > 0, and note that by recycling the above

arguments, Hn ≡∏n

i=1(dPn/dP )(Xni) is asymptotically bounded in prob-ability under both Pn and Pn

n . Thus there exists an H < ∞ such thatlim supn→∞ P(Hn > H) ≤ η/2. Since Yn is asymptotically tight under Pn,there exists a compact set K such that lim supn→∞ P(Yn ∈ (X −Kδ)∗) ≤η/(2H) under Pn, for every δ > 0, where Kδ is the δ-enlargement of K asdefined in Section 7.2.1 and superscript ∗ denotes a set that is the minimummeasurable covering under both Pn and Pn

n . Hence, under Pnn ,

lim supn→∞

P(Yn ∈ (X −Kδ)∗) = lim supn→∞

1Yn ∈ (X −Kδ)∗dP nn

≤ H lim supn→∞

1Yn ∈ (X −Kδ)∗dP n

+ lim supn→∞

P(Hn > H)

≤ Hη

2H+

η

2= η,

for every δ > 0. Thus Yn is asymptotically tight under Pnn since η was

arbitrary. Hence (ii) follows.

11.4 Sums of Independent but not IdenticallyDistributed Stochastic Processes

In this section, we are interested in deriving the limiting distribution ofsums of the form

∑mn

i=1 fni(ω, t), where the real-valued stochastic processesfni(ω, t), t ∈ T, 1 ≤ i ≤ mn, for all integers n ≥ 1, are independent withinrows on the probability space (Ω,A, P ) for some index set T . In additionto a central limit theorem, we will present a multiplier bootstrap resultto aid in inference. An example using these techniques will be presentedin the upcoming case studies II chapter. Throughout this section, functionarguments or subscripts will sometimes be suppressed for notational clarity.

11.4.1 Central Limit Theorems

The notation and set-up are similar to that found in Pollard (1990) andKosorok (2003). A slightly more general approach to the same question canbe found in Chapter 2.11 of VW. Part of the generality in VW is the abilityto utilize bracketing entropy in addition to uniform entropy for establishingtightness. An advantage of Pollard’s approach, on the other hand, is thattotal boundedness of the index set T is a conclusion rather than a condition.

Page 229: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

11.4. Identically Distributed Stochastic Processes 219

Both approaches have their merits and appear to be roughly equally usefulin practice.

We need to introduce a few measurability conditions that are differentfrom but related to conditions introduced in previous chapters. The firstcondition is almost measurable Suslin: Call a triangular array fni(ω, t), t ∈T almost measurable Suslin (AMS) if for all integers n ≥ 1, there existsa Suslin topological space Tn ⊂ T with Borel sets Bn such that

(i)

P∗(

supt∈T

infs∈Tn

mn∑

i=1

(fni(ω, s)− fni(ω, t))2

> 0

)

= 0,

(ii) For i = 1 . . .mn, fni : Ω× Tn → R is A× Bn-measurable.

The second condition is stronger yet seems to be more easily verified inapplications: Call a triangular array of processes fni(ω, t), t ∈ T separableif for every integer n ≥ 1, there exists a countable subset Tn ⊂ T such that

P∗(

supt∈T

infs∈Tn

mn∑

i=1

(fni(ω, s)− fni(ω, t))2

> 0

)

= 0.

The following lemma shows that separability implies AMS:

Lemma 11.15 If the triangular array of stochastic processes fni(ω, t),t ∈ T is separable, then it is AMS.

Proof. The discrete topology applied to Tn makes it into a Suslin topol-ogy by countability, with resulting Borel sets Bn. For i = 1 . . .mn, fni :Ω× Tn → R is A× Bn-measurable since, for every α ∈ R,

(ω, t) ∈ Ω× Tn : fni(ω, t) > α =⋃

s∈Tn

(ω, s) : fni(ω, s) > α ,

and the right-hand-side is a countable union of A× Bn-measurable sets.The forgoing measurable Suslin condition is closely related to the defini-

tion given in Example 2.3.5 of VW, while the definition of separable arraysis similar in spirit to the definition of separable stochastic processes given inthe discussion preceding Lemma 7.2 in Section 7.1 above. The modificationsof these definitions presented in this section have been made to accommo-date nonidentically distributed arrays for a broad scope of statistical ap-plications. However, finding the best possible measurability conditions wasnot the primary goal.

We need the following definition of manageability (Definition 7.9 of Pol-lard, 1990, with minor modification). First, for any set A ∈ Rm, let Dm(x, A)be the packing number for the set A at Euclidean distance x, i.e., the largestk such that there exist k points in A with the smallest Euclidean distancebetween any two distinct points being greater than x. Also let Fnω ≡

Page 230: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

220 11. Additional Empirical Process Results

[fn1(ω, t), . . . , fnmn(ω, t)] ∈ Rmn : t ∈ T ; and for any vectors u, v ∈ Rm,u⊙ v ∈ Rm is the pointwise product and ‖ · ‖ denotes Euclidean distance.A triangular array of processes fni(ω, t) is manageable, with respect tothe envelopes Fn(ω) ≡ [Fn1(ω), . . . , Fnmn(ω)] ∈ Rmn , if there exists a de-terministic function λ (the capacity bound) for which

(i)∫ 1

0

log λ(x)dx <∞,

(ii) there exists N ⊂ Ω such that P∗(N) = 0 and for each ω ∈ N ,

Dmn (x ‖α⊙ Fn(ω)‖ , α⊙Fnω) ≤ λ(x),

for 0 < x ≤ 1, all vectors α ∈ Rmn of nonnegative weights, all n ≥ 1,and where λ does not depend on ω or n.

We now state a minor modification of Pollard’s Functional Central LimitTheorem for the stochastic process sum

Xn(ω, t) ≡mn∑

i=1

[fni(ω, t)− Efni(·, t)] .

The modification is the inclusion of a sufficient measurability requirementwhich was omitted in Pollard’s (1990) version of the theorem.

Theorem 11.16 Suppose the triangular array fni(ω, t), i = 1, . . . , mn,t ∈ T consists of independent processes within rows, is AMS, and satisfies:

(A) the fni are manageable, with envelopes Fni which are also inde-pendent within rows;

(B) H(s, t) = limn→∞ EXn(s)Xn(t) exists for every s, t ∈ T ;

(C) lim supn→∞∑mn

i=1 E∗F 2ni <∞;

(D) limn→∞∑mn

i=1 E∗F 2ni1 Fni > ǫ = 0, for each ǫ > 0;

(E) ρ(s, t) = limn→∞ ρn(s, t), where

ρn(s, t) ≡(

mn∑

i=1

E |fni(·, s)− fni(·, t)|2)1/2

,

exists for every s, t ∈ T , and for all deterministic sequences sn andtn in T , if ρ(sn, tn)→ 0 then ρn(sn, tn)→ 0.

Then

(i) T is totally bounded under the ρ pseudometric;

(ii) Xn converges weakly on ℓ∞(T ) to a tight mean zero Gaussian processX concentrated on UC(T, ρ), with covariance H(s, t).

Page 231: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

11.4. Identically Distributed Stochastic Processes 221

The proof is given in Kosorok (2003), who relies on Chapter 10 of Pollard(1990) for some of the steps. We omit the details. We will use this theoremwhen discussing function classes changing with n later in this chapter, aswell as in one of the case studies of Chapter 15. One can think of this the-orem as a Lindeberg central limit theorem for stochastic processes, whereCondition (D) is a modified Lindeberg condition.

Note that the manageability Condition (A) is an entropy condition quitesimilar to the BUEI condition of Chapter 9. Pollard (1990) discussed sev-eral methods and preservation results for establishing manageability, in-cluding bounded pseudodimension classes which are very close in spirit toVC-classes of functions (see Chapter 4 of Pollard): The set Fn ⊂ Rn haspseudodimension of at most V if, for every point t ∈ RV +1, no proper co-ordinate projection of Fn can surround t. A proper coordinate projectionFk

n is obtained by choosing a subset i1, . . . , ik of indices 1, . . . , mn, wherek ≤ mn, and then letting Fk

n =

fni1(t1), . . . , fnik(tk) : (t1, . . . , tk) ∈ Rk

.It is not hard to verify that if fn1(t), . . . , fnmn(t) are always monotone

increasing functions in t, then the resulting Fn has pseudodimension 1. Thishappens because all two-dimensional projections always form monotoneincreasing trajectories, and thus can never surround any point in R2. Theproof is almost identical to the proof of Lemma 9.10. By Theorem 4.8of Pollard (1990), every triangular array of stochastic processes for whichFn has pseudodimension bounded above by V < ∞, for all n ≥ 1, ismanageable. As verified in the following theorem, complicated manageableclasses can be built up from simpler manageable classes:

Theorem 11.17 Let fn1, . . . , fnmn and gn1, . . . , gnmn be manageablearrays with respective index sets T and U and with respective envelopesFn1, . . . , Fnmn and Gn1, . . . , Gnmn . Then the following are true:

(i) fn1(t)+gn1(u), . . . , fnmn(t)+gnmn(u) : (t, u) ∈ T ×U, is manage-able with envelopes Fn1 + Gn1, . . . , Fnmn + Gnmn ;

(ii) fn1(t)∧gn1(u), . . . , fnmn(t)∧gnmn(u) : (t, u) ∈ T×U is manageablewith envelopes Fn1 + Gn1, . . . , Fnmn + Gnmn ;

(iii) fn1(t)∨gn1(u), . . . , fnmn(t)∨gnmn(u) : (t, u) ∈ T×U is manageablewith envelopes Fn1 + Gn1, . . . , Fnmn + Gnmn ;

(iv) fn1(t)gn1(u), . . . , fnmn(t)gnmn(u) : (t, u) ∈ T × U is manageablewith envelopes Fn1Gn1, . . . , FnmnGnmn .

Proof. For any vectors x1, x2, y1, y2 ∈ Rmn , it is easy to verify that‖x1y1 − x2y2‖ ≤ ‖x1 − x2‖ + ‖y1 − y2‖, where is any one of theoperations ∧, ∨, or +. Thus

Nmn (ǫ‖α⊙ (Fn + Gn)‖, α⊙FnGn) ≤ Nmn (ǫ‖α⊙ Fn‖, α⊙Fn)

×Nmn (ǫ‖α⊙Gn‖, α⊙ Gn) ,

Page 232: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

222 11. Additional Empirical Process Results

for any 0 < ǫ ≤ 1 and all vectors α ∈ Rmn of nonnegative weights, whereNmn denotes the covering number version of Dmn , Gn ≡ gn1(u), . . . ,gnmn(u) : u ∈ U, and FnGn has the obvious interpretation. ThusParts (i)–(iii) of the theorem follow by the relationship between packingand covering numbers discussed in Section 8.1.2 in the paragraphs preced-ing Theorem 8.4.

Proving Part (iv) requires a slightly different approach. Let x1, x2, y1, y2

by any vectors in Rmn , and let x, y ∈ Rmn be any nonnegative vectors suchthat both x − [x1 ∨ x2 ∨ (−x1) ∨ (−x2)] and y − [y1 ∨ y2 ∨ (−y1) ∨ (−y2)]are nonnegative. It is not hard to verify that ‖x1 ⊙ y1 − x2 ⊙ y2‖ ≤ ‖y ⊙x1 − y ⊙ x2‖+ ‖x⊙ y1 − x⊙ y2‖. From this, we can deduce that

Nmn (2ǫ‖α⊙ Fn ⊙Gn‖, α⊙Fn ⊙ Gn)

≤ Nmn (ǫ‖α⊙ Fn ⊙Gn‖, α⊙Gn ⊙Fn)

×Nmn (ǫ‖α⊙ Fn ⊙Gn‖, α⊙ Fn ⊙ Gn)

= Nmn (ǫ‖α′ ⊙ Fn‖, α′ ⊙ Fn)×Nmn (ǫ‖α′′ ⊙Gn‖, α′′ ⊙ Gn) ,

for any 0 < ǫ ≤ 1 and all vectors α ∈ Rmn of nonnegative weights, whereα′ ≡ α ⊙ Gn and α′′ ≡ α ⊙ Fn. Since capacity bounds do not depend onthe nonnegative weight vector (either α, α′ or α′′), Part (iv) now follows,and the theorem is proved.

11.4.2 Bootstrap Results

We now present a weighted bootstrap for inference about the limiting pro-cess X of Theorem 11.16. The basic idea shares some similarities with thewild bootstrap (Praestgaard and Wellner, 1993). Let Z ≡ zi, i ≥ 1 be asequence of random variables satisfying

(F) The zi are independent and identically distributed, on the proba-bility space Ωz,Az , Πz, with mean zero and variance 1.

Denote µni(t) ≡ Efni(·, t), and let µni(t) be estimators of µni(t). Theweighted bootstrapped process we propose for inference is

Xnω(t) ≡mn∑

i=1

zi [fni(ω, t)− µni(ω, t)] ,

which is defined on the product probability space Ω,A, Π×Ωz,Az, Πz,similar to what was done in Chapter 9 for the weighted bootstrap in thei.i.d. case. What is unusual about this bootstrap is the need to estimatethe µni terms: this need is a consequence of the terms being non-identicallydistributed.

The proposed method of inference is to resample Xn, using many realiza-tions of z1, . . . , zmn , to approximate the distribution of Xn. The followingtheorem gives us conditions under which this procedure is asymptoticallyvalid:

Page 233: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

11.4. Identically Distributed Stochastic Processes 223

Theorem 11.18 Suppose the triangular array fni satisfies the condi-tions of Theorem 11.16 and the sequence zi, i ≥ 1 satisfies Condition (F)above. Suppose also that the array of estimators µni(ω, t), t ∈ T, 1 ≤ i ≤mn, n ≥ 1 is AMS and satisfies the following:

(G) supt∈T

∑mn

i=1 [µni(ω, t)− µni(t)]2 = oP (1);

(H) the stochastic processes µni(ω, t) are manageable with envelopes

Fni(ω)

;

(I) k∨∑mn

i=1

[

Fni(ω)]2

converges to k in outer probability as n →∞, for

some k < ∞.

Then the conclusions of Theorem 11.16 obtain, Xn is asymptotically mea-

surable, and XnPZ

X.

The main idea of the proof is to first study the conditional limitingdistribution of Xnω(t) ≡ ∑mn

i=1 zi [fni(ω, t)− µni(t)], and then show thatthe limiting result is unchanged after replacing µni with µni. The first stepis summarized in the following theorem:

Theorem 11.19 Suppose the triangular array fni satisfies the condi-tions of Theorem 11.16 and the sequence zi, i ≥ 1 satisfies Condition (F)above. Then the conclusions of Theorem 11.16 obtain, Xn is asymptotically

measurable, and XnPZ

X.

An interesting step in the proof of this theorem is verifying that manageabil-ity of the triangular array z1fn1, . . . , zmnfnmn follows directly from man-ageability of fn1, . . . , fnmn . We now demonstrate this. For vectors u ∈ Rmn ,let |u| denote pointwise absolute value and sign(u) denote pointwise sign.Now, for any nonnegative α ∈ Rmn ,

Dmn (x ‖α⊙ |zn| ⊙ Fn(ω)‖ , α⊙ zn ⊙Fnω)

= Dmn (x ‖α⊙ Fn(ω)‖ , α⊙ sign(zn)⊙Fnω)

= Dmn (x ‖α⊙ Fn(ω)‖ , α⊙Fnω) ,

where zn ≡ z1, . . . , zmnT , since the absolute value of the zi can beabsorbed into the α to make α and since any coordinate change of signdoes not effect the geometry of Fnω. Thus the foregoing triangular arrayis manageable with envelopes |zi|Fni(ω). The remaining details of theproofs of both Theorems 11.18 and 11.19, which we omit here, can befound in Kosorok (2003).

Page 234: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

224 11. Additional Empirical Process Results

11.5 Function Classes Changing with n

We now return to the i.i.d. empirical process setting where the i.i.d. ob-servations X1, X2, . . . are drawn from a measurable space X ,A, withprobability measure P . What is new, however, is that we allow the func-tion class to depend on n. Specifically, we assume the function class hasthe form Fn ≡ fn,t : t ∈ T , where the functions x → fn,t(x) are in-dexed by a fixed T but are allowed to change with sample size n. Notethat this trivially includes the standard empirical process set-up with anarbitrary but fixed function class F by setting T = F and fn,t = t forall n ≥ 1 and t ∈ F . The approach we take is to specialize the resultsof Section 11.4 after replacing manageability with a bounded uniform en-tropy integral condition. An alternative approach which can utilize eitheruniform or bracketing entropy is given in Section 2.11.3 of VW, but we donot pursue this second approach here.

Let Xn(t) ≡ n−1/2∑n

i=1 (fn,t(Xi)− Pfn,t), for all t ∈ T , and let Fn bean envelope for Fn. We say that the sequence Fn is AMS if for all n ≥ 1,there exists a Suslin topological space Tn ⊂ T with Borel sets Bn such that

P∗(

supt∈T

infs∈Tn

|fn,s(X1)− fn,t(X1)| > 0

)

= 0(11.7)

and fn,· : X ×Tn → R is A×Bn-measurable. Moreover, the sequence Fn issaid to be separable if, for all n ≥ 1, there exists a countable subset Tn ⊂ Tsuch that (11.7) holds. The arguments in the proof of Lemma 11.15 verifythat separability implies AMS as is true for the more general setting ofSection 11.4.

We also require the following bounded uniform entropy integral condi-tion:

lim supn→∞

supQ

∫ 1

0

log N (ǫ‖Fn‖Q,2,Fn, L2(Q))dǫ < ∞,(11.8)

where, for each n ≥ 1, supQ is the supremum taken over all finitely discreteprobability measures Q with ‖Fn‖Q,2 > 0. We are now ready to presentthe following functional central limit theorem:

Theorem 11.20 Suppose Fn is AMS and the following hold:

(A) Fn satisfies (11.8) with envelop Fn;

(B) H(s, t) = limn→∞ EXn(s)Xn(t) for every s, t ∈ T ;

(C) lim supn→∞ E∗F 2n <∞;

(D) limn→∞ E∗F 2n1Fn > ǫ

√n = 0, for each ǫ > 0;

(E) ρ(s, t) = limn→∞ ρn(s, t), where ρn(s, t) ≡√

E|fn,s(X1)− fn,t(X2)|,exists for every s, t ∈ T , and for all deterministic sequences sn andtn in T , if ρ(sn, tn)→ 0 then ρn(sn, tn)→ 0.

Page 235: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

11.5 Function Classes Changing with n 225

Then

(i) T is totally bounded under the ρ pseudometric;

(ii) Xn converges weakly in ℓ∞(T ) to a tight, mean zero Gaussian processX concentrated on UC(T, ρ), with covariance H(s, t).

Proof. The proof consists in showing that the current setting is just aspecial case of Theorem 11.16. Specifically, we let fni(t) = fn,t(Xi) andmn = n, and we study the array fni(t), t ∈ T . First, it is clear that Fn

being AMS implies that fni(t), t ∈ T is AMS. Now let

Fn ≡ [fn1(t), . . . , fnn(t)] ∈ Rn : t ∈ T

and Fn ≡[

Fn1, . . . , Fnn

]

, where Fni ≡ Fn(Xi)/√

n; and note that for any

α ∈ Rn,

Dn

(

ǫ‖α⊙ Fn‖, α⊙ Fn

)

≤ D (ǫ‖Fn‖Qα,2,Fn, L2(Qα)) ,

where Qα ≡ (n‖α‖)−1∑n

i=1 α2i δXi is a finitely discrete probability measure.

Thus, by the relationship between packing and covering numbers given inSection 8.1.2, we have that if we let

λ(x) = lim supn→∞

supQ

N (x‖Fn‖Q,2/2,Fn, L2(Q)) ,

where supQ is taken over all finitely discrete probability measures, thenCondition 11.8 implies that

Dn

(

ǫ‖α⊙ Fn‖, α⊙ Fn

)

≤ λ(ǫ),

for all 0 < ǫ ≤ 1, all vectors α ∈ Rn of nonnegative weights, and all n ≥ 1;

and that∫ 1

0

log λ(ǫ)dǫ < ∞. Note that without loss of generality, we can

set Dn(·, ·) = 1 whenever ‖α ⊙ Fn‖ = 0 and let λ(1) = 1, and thus theforegoing arguments yield that Condition (11.8) implies manageability ofthe triangular array fn1(t), . . . , fnn(t), t ∈ T .

Now the remaining conditions of the theorem can easily be show to im-ply Conditions (B) through (E) of Theorem 11.16 for the new triangulararray and envelope vector Fn. Hence the desired results follow from Theo-rem 11.16.

The following lemma gives us an important example of a sequence ofclasses Fn that satisfies Condition (11.8):

Lemma 11.21 For fixed index set T , let Fn = fn,t : t ∈ T be a VCclass of measurable functions with VC-index Vn and integrable envelopeFn, for all n ≥ 1, and assume supn≥1 Vn = V < ∞. Then the sequence Fn

satisfies Condition (11.8).

Page 236: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

226 11. Additional Empirical Process Results

Proof. By Theorem 9.3, there exists a universal constant K dependingonly on V such that

N (ǫ‖Fn‖Q,2,Fn, L2(Q)) ≤ K

(

1

ǫ

)2(V −1)

,

for all 0 < ǫ ≤ 1. Note that we have extended the range of ǫ to include 1, butthis presents not difficulty since N(‖Fn‖Q,2,Fn, L2(Q)) = 1 always holdsby the definition of an envelope function. The desired result now followssince Vn ≤ V for all n ≥ 1.

As a simple example, let T ⊂ R and assume that fn,t(x) is always mono-tone increasing in t. Then Fn always has VC-index 2 by Lemma 9.10, andhence Lemma 11.21 applies. Thus (11.8) holds. This particular situationwill apply later in Section 14.5.2 when we study the weak convergence of acertain monotone density estimator. This Condition (11.8) is quite similarto the BUEI condition for fixed function classes, and most of the preserva-tion results of Section 9.1.2 will also apply. The following proposition, theproof of which is saved as an exercise, is one such preservation result:

Proposition 11.22 Let Gn and Hn be sequences of classes of mea-surable functions with respective envelope sequences Gn and Hn, whereCondition (11.8) is satisfied for the sequences (Fn, Fn) = (Gn, Gn) and(Fn, Fn) = (Hn, Hn). Then Condition (11.8) is also satisfied for the se-quence of classes Fn = Gn ·Hn (consisting of all pairwise products) and theenvelope sequence Fn = GnHn.

We now present a weighted bootstrap result for this setting. Let Z ≡zi, i ≥ 1 be a sequence of random variables satisfying

(F) The zi are positive, i.i.d. random variables which are independentof the data X1, X2, . . . and which have mean 0 < µ <∞ and variance0 < τ2 < ∞.

The weighted bootstrapped process we propose for use here is

Xn(t) ≡ µ

τn−1/2

n∑

i=1

(

zi

zn− 1

)

fn,t(Xi),

where zn ≡ n−1∑n

i=1 zi. The following theorem tells us that this is a validbootstrap procedure:

Theorem 11.23 Suppose the class of functions Fn, with envelope Fn,satisfies the conditions of Theorem 11.20 and the sequence zi, i ≥ 1 sat-isfies Condition (F) above. Then the conclusions of Theorem 11.20 obtain,

Xn is asymptotically measurable, and XnPZ

X.

Proof. Note that

Page 237: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

11.6 Dependent Observations 227

τ

µXn(t) = n−1/2

n∑

i=1

(

zi

zn− 1

)

(fn,t(Xi)− Pfn,t)

= n−1/2n∑

i=1

(

zi

µ− 1

)

(fn,t(Xi)− Pfn,t)

+n−1/2

(

µ

zn− 1

) n∑

i=1

(

zi

µ− 1

)

(fn,t(Xi)− Pfn,t)

+n−1/2

(

µ

zn− 1

) n∑

i=1

(fn,t(Xi)− Pfn,t)

= An(t) + Bn(t) + Cn(t).

By Theorem 11.20, Gnfn,t = O∗TP (1), where O∗T

P (1) denotes a term boundedin outer probability uniformly over T . Since Theorem 11.20 is really a spe-cial case of Theorem 11.16, we have by Theorem 11.19 that n−1/2

∑ni=1

µτ

(

zi

µ − 1)

× (fn,t(Xi)− Pfn,t)PZ

X , since µτ

(

z1

µ − 1)

has mean zero and

variance 1. Hence (µ/τ)An is asymptotically measurable and (µ/τ)AnPZ

X .

Moreover, since znas∗→ µ, we now have that both Bn = o∗T

P (1) and Cn =o∗T

P (1), where o∗TP (1) denotes a term going to zero outer almost surely

uniformly over T . The desired results now follow.

11.6 Dependent Observations

In the section, we will review a number of empirical process results fordependent observations. A survey of recent results on this subject is Em-pirical Process Techniques for Dependent Data, edited by Dehling, Mikoschand Sørensen (2002); and a helpful general reference on theory for depen-dent observations is Dependence in Probability and Statistics: A Survey ofRecent Results, edited by Eberlein and Taqqu (1986). Our focus here willbe on strongly mixing stationary sequences (see Bradley, 1986). For theinterested reader, a few results for non-stationary dependent sequences canbe found in Andrews (1991), while several results for long range dependentsequences can be found in Dehling and Taqqu (1989), Yu (1994) and Wu(2003), among other references.

Let X1, X2, . . . be a stationary sequence of possibly dependent randomvariables on a probability space (Ω,D, Q), and let Mb

a be the σ-field gen-erated by Xa, . . . , Xb. By stationary, we mean that for any set of positiveintegers m1, . . . , mk, the joint distribution of Xm1+j , Xm2+j , . . . , Xmk+j isunchanging for all integers j ≥ −m1 + 1. The sequence Xi, i ≥ 1 isstrongly mixing (also α-mixing) if

Page 238: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

228 11. Additional Empirical Process Results

α(k) ≡ supm≥1

|P (A ∩B)− P (A)P (B)| : A ∈Mm1 , B ∈ M∞

m+k

→ 0,

as k →∞, and it is absolutely regular (also β-mixing) if

β(k) ≡ E supm≥1

|P (B |Mm1 )− P (B)| : B ∈ M∞

m+k

→ 0,

as k → ∞. Other forms of mixing include ρ-mixing, φ-mixing, ψ-mixingand ∗-mixing (see Definition 3.1 of Dehling and Philipp, 2002). Note thatthe stronger notion of m-dependence, where observations more than m lagsapart are independent, implies that β(k) = 0 for all k > m and thereforealso implies absolute regularity. It is also known that absolute regularityimplies strong mixing (see Section 3.1 of Dehling and Philipp, 2002). Here-after, we will restrict our attention to β-mixing sequences since these willbe the most useful for our purposes.

We now present several empirical process Donsker and bootstrap resultsfor absolutely regular stationary sequences. Let the values of X1 lie in aPolish space X with distribution P , and let Gn be the empirical measure forthe first n observations of the sequence, i.e., Gnf = n1/2

∑ni=1(f(Xi)−Pf),

for any measurable f : X → R. We now present the following bracketingcentral limit theorem:

Theorem 11.24 Let X1, X2, . . . be a stationary sequence in a Polishspace with marginal distribution P , and let F be a class of functions inL2(P ). Suppose there exists a 2 < p < ∞ such that

(a)∑∞

k=1 k2/(p−2)β(k) < ∞, and

(b) J[](∞,F , Lp(P )) <∞.

Then Gn H in ℓ∞(F), where H is a tight, mean zero Gaussian processwith covariance

Γ(f, g) ≡ limk→∞

∞∑

i=1

cov(f(Xk), g(Xi)), for all f, g ∈ F .(11.9)

Proof. The result follows through Condition 2 of Theorem 5.2 ofDedecker and Louhichi (2002), after noting that their Condition 2 canbe shown to be implied by Conditions (a) and (b) above, via argumentscontained in Section 4.3 of Dedecker and Louhichi (2002). We omit thedetails.

We next present a result for VC classes F . In this case, we need to addressthe issue of measurability with some care. For what follows, let B be theσ-field of the measurable sets on X . The class of functions F is permissibleif it can be indexed by some set T , i.e., F = f(·, t) : t ∈ T (T couldpotentially be F for this purpose), in such a way that the following holds:

(a) T is a Suslin metric space with Borel σ-field B(T ),

Page 239: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

11.6 Dependent Observations 229

(b) f(·, ·) is a B × B(T )-measurable function from R× T to R.

Note that this definition is similar to the almost measurable Suslin condi-tion of Section 11.5. We now have the following theorem:

Theorem 11.25 Let X1, X2, . . . be a stationary sequence in a Polishspace with marginal distribution P , and let F be a class of functions inL2(P ). Suppose there exists a 2 < p < ∞ such that

(a) limk→∞ k2/(p−2)(log k)2(p−1)/(p−2)β(k) = 0, and

(b) F is permissible, VC, and has envelope F satisfying P ∗F p <∞.

Then Gn H in ℓ∞(F), where H is as defined in Theorem 11.24 above.

Proof. This theorem is essentially Theorem 2.1 of Arcones and Yu(1994), and the proof, under slightly different measurability conditions,can be found therein. The measurability issues, including the sufficiency ofthe permissibility condition, are addressed in the appendix of Yu (1994).We omit the details.

Now we consider the bootstrap. A problem with the usual nonpara-metric bootstrap is that samples of X1, X2, . . . randomly drawn with re-placement will be independent and will loose the dependency structureof the stationary sequence. Hence the usual bootstrap will generally notwork. A modified bootstrap, the moving blocks bootstrap (MBB), was in-dependently introduce by Kunsch (1989) and Liu and Singh (1992) to ad-dress this problem. The method works as follows for a stationary sampleX1, . . . , Xn: For a chosen block length b ≤ n, extend the sample by definingXn+1, . . . , Xn+b−1 = X1, . . . , Xb and let k be the smallest integer such thatkb ≥ n. Now define blocks (as row vectors) Bi = (Xi, Xi+1, . . . , Xi+b−1), fori = 1, . . . , n, and sample from the Bis with replacement to obtain k blocksB∗

1 , B∗2 , . . . , B∗

k. The bootstrapped sample X∗1 , . . . , X∗

n consists of the first nobservations from the row vector (B∗

1 , . . . , B∗k). The bootstrapped empirical

measure indexed by the class F is then defined as

G∗nf ≡ n−1/2

n∑

i=1

(f(X∗i )− Pnf),

for all f ∈ F , where Pnf ≡ n−1∑n

i=1 f(Xi) is the usual empirical proba-bility measure (except that the data are now potentially dependent).

For now, we will assume that X1, X2, . . . are real-valued, although theresults probably could be extended to general Polish-valued random vari-ables. MBB bootstrap consistency has been established for bracketing en-tropy in Buhlmann (1995), although the entropy requirements are muchstronger than those of Theorem 11.24 above, and also for VC classes inRadulovic (1996). Other interesting, related references are Naik-Nimbalkerand Rajarshi (1994) and Peligrad (1998), among others. We conclude this

Page 240: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

230 11. Additional Empirical Process Results

section by presenting Radulovic’s (1996) Theorem 1 (slightly modified toaddress measurability) without proof:

Theorem 11.26 Let X1, X2, . . . be a stationary sequence of real randomvariables with marginal distribution P , and let F be a class of functions inL2(P ). Also assume that X∗

1 , . . . , X∗n are generated by the MBB procedure

with block size b(n) → ∞, as n → ∞, and that there exists 2 < p < ∞,q > p/(p− 2), and 0 < ρ < (p− 2)/[2(p− 1)] such that

(a) lim supk→∞ kqβ(k) <∞,

(b) F is permissible, VC, and has envelope F satisfying P ∗F p < ∞, and

(c) lim supn→∞ n−ρb(n) < ∞.

Then G∗n

P∗

H in ℓ∞(F), where H is as defined in Theorem 11.24 above.

11.7 Proofs

Proofs of Theorems 11.4 and 11.5. Consider the class F ′ ≡ 1/2 +F/(2M), and note that all functions f ∈ F ′ satisfy 0 ≤ f ≤ 1. Moreover,‖Gn‖F = 2M‖Gn‖F ′. Thus, if we prove the results for F ′, we are done.

For Theorem 11.4, the Condition (11.1) is also satisfied if we replace Fwith F ′. This can be done without changing W , although we may need tochange K to some K ′ < ∞. Theorem 2.14.10 of VW now yields that

P∗(‖Gn‖F ′ > t) ≤ CeDtU+δ

e−2t2 ,(11.10)

for every t > 0 and δ > 0, where U = W (6 −W )/(2 + W ) and C and Ddepend only on K ′, W , and δ. Since 0 < W < 2, it can be shown that0 < U < 2. Accordingly, choose a δ > 0 so that U + δ < 2. Now, it can beshown that there exists a C∗ <∞ and K∗ > 0 so that (11.10)≤ C∗e−K∗t2 ,for every t > 0. Theorem 11.4 now follows by Lemma 8.1.

For Theorem 11.5, the Condition (11.2) implies the existence of a K ′ sothat N[](ǫ,F ′, L2(P )) ≤ (K ′/ǫ)V , for all 0 < ǫ < K ′, where V is the onein (11.2). Now Theorem 2.14.9 of VW yields that

P∗(‖Gn‖F ′ > t) ≤ CtV e−2t2 ,(11.11)

for every t > 0, where the constant C depends only on K ′ and V . Thusthere exists a C∗ < ∞ and K∗ > 0 such that (11.11)≤ C∗e−K∗t2 , for everyt > 0. The desired result now follows.

Proof of Theorem 11.9. For the proof of result (i), note that thecosines are bounded, and thus the series defining F1 is automatically point-wise convergent by the discussion prior to Theorem 11.9. Now, the Cramer-von Mises statistic is the square of the L2[0, 1] norm of the function t →

Page 241: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

11.7 Proofs 231

Gn(t) ≡ Gn1X ≤ t. Since the functions gi ≡√

2 sin πjt : j = 1, 2, . . .form an orthonormal basis for L2[0, 1], Parseval’s formula tells us that theintegral of the square of any function in L2[0, 1] can be replaces by the sumof the squares of the Fourier coefficients. This yields:

∫ 1

0

G2n(t)dt =

∞∑

i=1

[∫ 1

0

Gn(t)gi(t)dt

]2

=

∞∑

i=1

[

Gn

∫ X

0

gi(t)dt

]2

.(11.12)

But since∫ x

0 gi(t)dt = −(πi)−1√

2 cosπix, the last term in (11.12) becomes∑∞

i=1 G2n(√

2 cosπiX)/(πi)2. Standard methods can now be used to estab-lish that this converges weakly to the appropriate limiting distribution. Analternative proof can be obtained via the relation (11.3) and the fact thatF1 is an elliptical class and hence Donsker. Now (i) follows.

For result (ii), note that the orthonormalized Legendre polynomials canbe obtained by applying the Gram-Schmidt procedure to the functions1, u, u2, . . .. By problem 2.13.1 of VW, the orthonormalized Legendrepolynomials satisfy the differential equations (1 − u2)p′′j (u) − 2up′j(u) =−j(j+1)pj(u), for all u ∈ [−1, 1] and integers j ≥ 1. By change of variables,followed by partial integration and use of this differential identity, we obtain

2

∫ 1

0

p′i(2t− 1)p′j(2t− 1)t(1− t)dt

=1

4

∫ 1

−1

p′i(u)p′j(u)(1 − u2)du

= −1

4

∫ 1

−1

pi(u)[

p′′j (u)(1− u2)− 2up′j(u)]

du

=1

4j(j + 1)1i = j.

Thus the functions 2√

2p′j(2t−1)√

t(1− t)/√

j(j + 1), with j ranging overthe positive integers, form an orthonormal basis for L2[0, 1]. By Parseval’sformula, we obtain

∫ 1

0

G2n(t)

t(1− t)dt =

∞∑

j=1

[∫ 1

0

Gn(t)p′i(2t− 1)dt

]28

j(j + 1)

=

∞∑

j=1

2

j(j + 1)G2

n (pj(2t− 1)) .

By arguments similar to those used to establish (i), we can now verifythat (ii) follows.

Proof of Corollary 11.10. By transforming the data using the trans-form x → F0(x), we can, without loss of generality, assume that the dataare all i.i.d. uniform and reduce the interval of integration to [0, 1]. Now

Page 242: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

232 11. Additional Empirical Process Results

Proposition 7.27 yields that T1 ∫ 1

0G2(t)dt, where G(t) is a standard

Brownian bridge. Now Parseval’s formula and arguments used in the proofof Theorem 11.9 yield that

∫ 1

0

G2(t)dt =

∞∑

i=1

G2(√

2 cosπx)/(πi)2 ≡ T ∗1 ,

where now G is a mean zero Gaussian Brownian bridge random measure,where the covariance between G(f) and G(g), where f, g : [0, 1] → R, is∫ 1

0 f(s)g(s)ds −∫ 1

0 f(s)ds∫ 1

0 g(s)ds. The fact that T ∗1 is tight combined

with the covariance structure of G yields that T ∗1 has the same distribution

as T1, and the desired weak convergence result for T1 follows.For T2, apply the same transformation as above so that, without loss of

generality, we can again assume that the data are i.i.d. uniform and thatthe interval of integration is [0, 1]. Let

Gn ≡√

n1n2

n1 + n2

(

Fn,1 − Fn,2

)

,

and fix ǫ ∈ (0, 1/2). We can now apply Proposition 7.27 to verify that

∫ 1−ǫ

ǫ

G2n(s)

Fn,0(s)[

1− Fn,0(s)]dFn,0(s)

∫ 1−ǫ

ǫ

G2(s)

s(1− s)ds.

Note also that Fubini’s theorem yields that both E∫ ǫ

0G2(s)/[s(1 − s)]ds

= ǫ and E

∫ 1

1−ǫ G2(s)/[s(1 − s)]ds

= ǫ.

We will now work towards bounding∫ ǫ

0

(

Gn(s)/

Fn,0(s)[1− Fn,0(s)])

ds.

Fix s ∈ (0, ǫ) and note that, under the null hypothesis, the conditional dis-tribution of Gn(s) given Fn,0(s) = m has the form

n1n2

n1 + n2

(

A

n1− m−A

n2

)

,

where A is hypergeometric with density

P(A = a) =

(

n1

a

)(

n1

m− a

)

/

(

nm

)

,

where a is any integer between (m− n2) ∨ 0 and m ∧ n1. Hence

E[

G2n(s)

∣Fn,0(s) = m

]

=n1 + n2

n1n2var(A)

=n

n1n2

(

mn1n2(n−m)

n2(n− 1)

)

=n

n− 1

[

m(n−m)

n2

]

.

Page 243: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

11.8 Exercises 233

Thus

E

[

∫ ǫ

0

G2n(s)

Fn,0(s)[1− Fn,0(s)]ds

]

=n

n− 1EFn,0(ǫ) ≤ 2ǫ,

for all n ≥ 2. Similar arguments verify that

E

[

∫ 1

1−ǫ

G2n(s)

Fn,0(s)[1 − Fn,0(s)]ds

]

≤ 2ǫ,

for all n ≥ 2. Since ǫ was arbitrary, we now have that

T2

∫ 1

0

G2(s)

s(1− s)ds ≡ T ∗

2 ,

where G is the same Brownian bridge process used in defining T ∗1 . Now we

can again use arguments from the proof of Theorem 11.9 to obtain that T ∗2

has the same distribution as T2.

11.8 Exercises

11.8.1. Verify that F ∗ in the proof of Theorem 11.6 is continuous.

11.8.2. Show that when Pn and P satisfy (11.4), we have that Pnh → Ph,as n →∞, for all bounded and measurable h.

11.8.3. Prove Proposition 11.22. Hint: Consider the arguments used inthe proof of Theorem 9.15.

11.9 Notes

Theorems 11.4 and 11.5 are inspired by Theorem 2.14.9 of VW, and The-orem 11.7 is derived from Theorem 2.13.1 of VW. Theorem 11.8 is Theo-rem 2.13.2 of VW, while Theorem 11.9 is derived from Examples 2.13.3and 2.13.5 of VW. Much of the structure of Section 11.4 comes fromKosorok (2003), although Theorem 11.17 was derived from material in Sec-tion 5 of Pollard (1990). Lemma 11.15 and Theorems 11.16, 11.18 and 11.19,are Lemma 2 and Theorems 1 (with a minor modification), 3 and 2, re-spectively, of Kosorok (2003).

Page 244: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,
Page 245: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

12The Functional Delta Method

In this chapter, we build on the presentation of the functional delta methodgiven in Section 2.2.4. Recall the concept of Hadamard differentiability in-troduced in this section and also defined more precisely in Section 6.3.The key result of Section 2.2.4 is that the delta method and its bootstrapcounterpart work provided the map φ is Hadamard differentiable tangen-tially to a suitable set D0. We first present in Section 12.1 clarificationsand proofs of the two main theorems given in Section 2.2.4, the functionaldelta method for Hadamard differentiable maps (Theorem 2.8 on Page 22)and the conditional analog for the bootstrap (Theorem 2.9 on Page 23).We then give in Section 12.2 several important examples of Hadamard dif-ferentiable maps of use in statistics, along with specific illustrations of howthose maps are utilized.

12.1 Main Results and Proofs

In this section, we first prove the functional delta method theorem (Theo-rem 2.8 on Page 22) and then restate and prove Theorem 2.9 from Page 23.Before proceeding, recall that Xn in the statement of Theorem 2.8 is a ran-dom quantity that takes its values in a normed space D.

Proof of Theorem 2.8 (Page 22). Consider the map h → rn(φ(θ +r−1n h) − φ(θ)) ≡ gn(h), and note that it is defined on the domain Dn ≡h : θ + r−1

n h ∈ Dφ and satisfies gn(hn) → φ′θ(h) for every hn → h ∈ D0

with hn ∈ Dn. Thus the conditions of the extended continuous mapping

Page 246: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

236 12. The Functional Delta Method

theorem (Theorem 7.24) are satisfied by g(·) = φ′θ(·). Hence conclusion (i)

of that theorem implies gn(rn(Xn − θ)) φ′θ(X).

We now restate and prove Theorem 2.9 on Page 23. The restatement clar-ifies the measurability condition. Before proceeding, recall the definitionsof Xn and Xn in the statement of Theorem 2.9. Specifically, Xn(Xn) is a se-quence of random elements in a normed space D based on the data sequenceXn, n ≥ 1, while Xn(Xn, Wn) is a bootstrapped version of Xn based onboth the data sequence and a sequence of weights W = Wn, n ≥ 1. Notethat the proof of this theorem utilizes the bootstrap continuous mappingtheorem (Theorem 10.8). Here is the restated version of Theorem 2.9 fromPage 23:

Theorem 12.1 For normed spaces D and E, let φ : Dφ ⊂ D → E beHadamard-differentiable at µ tangentially to D0 ⊂ D, with derivative φ′

µ.

Let Xn and Xn have values in Dφ, with rn(Xn − µ) X, where X is tightand takes its values in D0 for some sequence of constants 0 < rn → ∞,the maps Wn → h(Xn) are measurable for every h ∈ Cb(D) outer almost

surely, and where rnc(Xn − Xn)PW

X, for a constant 0 < c < ∞. Then

rnc(φ(Xn)− φ(Xn))PW

φ′µ(X).

Proof. We can, without loss of generality, assume that D0 is completeand separable (since X is tight), that D and E are both complete, andthat φ′

µ : D → E is continuous on all of D, although it is permitted tonot be bounded or linear off of D0. To accomplish this, one can apply theDugundji extension theorem (Theorem 10.9) which extends any continuousoperator defined on a closed subset to the entire space. It may be necessaryto replace E with its closed linear span to accomplish this.

We can now use arguments nearly identical to those used in the proofgiven in Section 10.1 of Theorem 10.4 to verify that, unconditionally, bothUn ≡ rn(Xn−Xn) c−1X and rn(Xn−µ) Z, where Z is a tight randomelement. Fix some h ∈ BL1(D), define Un ≡ rn(Xn − µ), and let X1 andX2 be two independent copies of X. We now have that

∣E∗h(Un + Un)− Eh(c−1X1 + X2)∣

≤∣

∣EXnEWnh(Un + Un)∗ − E∗EWnh(Un + Un)∣

+E∗∣

∣EWnh(Un + Un)− EX1

h(c−1X1 + Un)∣

+∣

∣E∗EX1

h(c−1X1 + Un)− EX2

EX1

h(c−1X1 + X2)∣

∣ ,

where EWn , EX1

and EX2

are expectations taken over the bootstrap weights,

X1 and X2, respectively. The first term on the right in the above expressiongoes to zero by the asymptotic measurability of Un + Un = rn(Xn − µ).

The second term goes to zero by the fact that UnPW

c−1X combined with

Page 247: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

12.2 Examples 237

the fact that the map x → h(x+Un) is Lipschitz continuous with Lipschitzconstant 1 outer almost surely. The third term goes to zero since Un Xand the map x → E

X1h(X1 +x) is also Lipschitz continuous with Lipschitz

constant 1. Since h was arbitrary, we have by the Portmanteau theoremthat, unconditionally,

rn

(

Xn − µXn − µ

)

(

c−1X1 + X2

X2

)

.

Now the functional delta method (Theorem 2.8) yields

rn

φ(Xn)− φ(µ)φ(Xn)− φ(µ)

Xn − µXn − µ

φ′µ(c−1X1 + X2)

φ′µ(X2)

c−1X1 + X2

X2

,

since the map (x, y) → (φ(x), φ(y), x, y) is Hadamard differentiable at (µ, µ)tangentially to D0. This implies two things. First,

rnc

(

φ(Xn)− φ(Xn)

Xn − Xn

)

(

φ′µ(X)X

)

,

since φ′µ is linear on D0. Second, the usual continuous mapping theorem

now yields that, unconditionally,

rnc(φ(Xn)− φ(Xn))− φ′µ(rnc(Xn − Xn))

P→ 0,(12.1)

since the map (x, y) → x− φ′µ(y) is continuous on all of E× D.

Now for any map h ∈ Cb(D), the map x → h(rnc(x − Xn)) is con-tinuous and bounded for all x ∈ D outer almost surely. Thus the mapsWn → h(rnc(Xn − Xn)) are measurable for every h ∈ Cb(D) outer almostsurely. Hence the bootstrap continuous mapping theorem, Theorem 10.8,

yields that φ′µ(rnc(Xn − Xn))

PW

φ′µ(X). The desired result now follows

from (12.1).

12.2 Examples

We now give several important examples of Hadamard differentiable maps,along with illustrations of how these maps are utilized in statistical appli-cations.

12.2.1 Composition

Recall from Section 2.2.4 the map φ : Dφ → D[0, 1], where φ(f) = 1/f andDφ = f ∈ D[0, 1] : |f | > 0. In that section, we established that φ was

Page 248: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

238 12. The Functional Delta Method

Hadamard differentiable, tangentially to D[0, 1], with derivative at θ ∈ Dφ

equal to h → −h/θ2. This is a simple example of the following generalcomposition result:

Lemma 12.2 Let g : B ⊂ R ≡ [−∞,∞] → R be differentiable withderivative continuous on all closed subsets of B, and let Dφ = A ∈ ℓ∞(X ) :R(A)δ ⊂ B for some δ > 0, where X is a set, R(C) denotes the rangeof the function C ∈ ℓ∞(X ), and Dδ is the δ-enlargement of the set D.Then A → g A is Hadamard-differentiable as a map from Dφ ⊂ ℓ∞(X )to ℓ∞(X ), at every A ∈ Dφ. The derivative is given by φ′

A(α) = g′(A)α,where g′ is the derivative of g.

Before giving the proof, we briefly return to our simple example of thereciprocal map A → 1/A. The differentiability of this map easily generalizesfrom D[0, 1] to ℓ∞(X ), for arbitrary X , provided we restrict the domain ofthe reciprocal map to Dφ = A ∈ ℓ∞(X ) : infx∈X |A(x)| > 0. This followsafter applying Lemma 12.2 to the set B = [−∞, 0) ∪ (0,∞].

Proof of Lemma 12.2. Note that D = ℓ∞(X ) in this case, and thatthe tangent set for the derivative is all of D. Let tn be any real sequencewith tn → 0, let hn ∈ ℓ∞(X ) be any sequence converging to α ∈ ℓ∞(X )uniformly, and define An = A + tnhn. Then, by the conditions of thetheorem, there exists a closed B1 ⊂ B such that R(A) ∪ R(An)δ ⊂ B1

for some δ > 0 and all n large enough. Hence

supx∈X

g(A(x) + tnhn(x)) − g(A(x))

tn− g′(A(x))α(x)

→ 0,

as n →∞, since continuous functions on closed sets are bounded and thusg′ is uniformly continuous on B1.

12.2.2 Integration

For an M < ∞ and an interval [a, b] ∈ R, let BVM [a, b] be the set of allfunctions A ∈ D[a, b] with total variation |A(0)|+

(a,b]|dA(s)| ≤ M . In this

section, we consider, for given functions A ∈ D[a, b] and B ∈ BVM [a, b] anddomain DM ≡ D[a, b] × BVM [a, b], the maps φ : DM → R and ψ : DM →D[a, b] defined by

φ(A, B) =

(a,b]

A(s)dB(s) and ψ(A, B)(t) =

(a,t]

A(s)dB(s).(12.2)

The following lemma verifies that these two maps are Hadamard differen-tiable:

Lemma 12.3 For each fixed M < ∞, the maps φ : DM → R andψ : DM → D[a, b] defined in (12.2) are Hadamard differentiable at each(A, B) ∈ DM with

(a,b] |dA| < ∞. The derivatives are given by

Page 249: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

12.2 Examples 239

φ′A,B(α, β) =

(a,b]

Adβ +

(a,b]

αdB, and

ψ′A,B(α, β)(t) =

(a,t]

Adβ +

(a,t]

αdB.

Note that in the above lemma we define∫

(a,t] Adβ = A(t)β(t)−A(a)β(a)−∫

(a,t]β(s−)dA(s) so that the integral is well defined even when β does not

have bounded variation. We will present the proof of this lemma at the endof this section.

We now look at two statistical applications of Lemma 12.3, the two-sample Wilcoxon rank sum statistic, and the Nelson-Aalen integrated haz-ard estimator. Consider first the Wilcoxon statistic. Let X1, . . . , Xm andY1, . . . , Yn be independent samples from distributions F and G on the re-als. If Fm and Gn are the respective empirical distribution functions, theWilcoxon rank sum statistic for comparing F and G has the form

T1 = m

R

(mFm(x) + nGn(x))dFm(x).

If we temporarily assume that F and G are continuous, then

T1 = mn

R

Gn(x)dFm(x) + m2

R

Fm(x)dFm(x)

= mn

R

Gn(x)dFm(x) +m2 + m

2

≡ mnT2 +m2 + m

2,

where T2 is the Mann-Whitney statistic. When F or G have atoms, the rela-tionship between the Wilcoxon and Mann-Whitney statistics is more com-plex. We will now study the asymptotic properties of the Mann-Whitneyversion of the rank sum statistic, T2.

For arbitrary F and G, T2 = φ(Gn, Fm), where φ is as defined inLemma 12.3. Note that F , G, Fm and Gn all have total variation ≤ 1. ThusLemma 12.3 applies, and we obtain that the Hadamard derivative of φ at(A, B) = (G, F ) is the map φ′

G,F (α, β) =∫

RGdβ +

RαdF , which is con-

tinuous and linear over α, β ∈ D[−∞,∞]. If we assume that m/(m+n)→λ ∈ [0, 1], as m ∧ n →∞, then

mn

m + n

(

Gn −GFm − F

)

( √λ B1(G)√

1− λ B2(F )

)

,

where B1 and B2 are independent standard Brownian bridges. HenceGG(·) ≡ B1(G(·)) and GF (·) ≡ B2(F (·)) both live in D[−∞,∞]. NowTheorem 2.8 yields

Page 250: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

240 12. The Functional Delta Method

T2 √

λ

R

GdGF +√

1− λ

R

GF dG,

as m ∧ n → ∞. When F = G and F is continuous, this limiting distribu-tion is mean zero normal with variance 1/12. The delta method bootstrap,Theorem 12.1, is also applicable and can be used to obtain an estimate ofthe limiting distribution under more general hypotheses on F and G.

We now shift our attention to the Nelson-Aalen estimator under rightcensoring. In the right censored survival data setting, an observation con-sists of the pair (X, δ), where X = T ∧ C is the minimum of a failuretime T and censoring time C, and δ = 1T ≤ C. T and C are assumedto be independent. Let F be the distribution function for T , and definethe integrated baseline hazard for F to be Λ(t) =

∫ t

0 dF (s)/S(s−), whereS ≡ 1−F is the survival function. The Nelson-Aalen estimator for Λ, basedon the i.i.d. sample (X1, δ1), . . . , (Xn, δn), is

Λn(t) ≡∫

[0,t]

dNn(s)

Yn(s),

where Nn(t) ≡ n−1∑n

i=1 δi1Xi ≤ t and Yn(t) ≡ n−1∑n

i=1 1Xi ≥ t. Itis easy to verify that the classes δ1X ≤ t, t ≥ 0 and 1X ≥ t : t ≥ 0are both Donsker and hence that

√n

(

Nn −N0

Yn − Y0

)

(

G1

G2

)

,(12.3)

where N0(t) ≡ P (T ≤ t, C ≥ T ), Y0(t) ≡ P (X ≥ t), and G1 andG2 are tight Gaussian processes with respective covariances N0(s ∧ t) −N0(s)N0(t) and Y0(s ∨ t) − Y0(s)Y0(t) and with cross-covariance 1s ≥t [N0(s)−N0(t−)]−N0(s)Y0(t). Note that while we have already seen thissurvival set-up several times (eg., Sections 2.2.5 and 4.2.2), we are choosingto use slightly different notation than previously used to emphasize certainfeatures of the underlying empirical processes.

The Nelson-Aalen estimator depends on the pair (Nn, Yn) through thetwo maps

(A, B) →(

A,1

B

)

→∫

[0,t]

1

BdA.

From Section 12.1.1, Lemma 12.3, and the chain rule (Lemma 6.19), itis easy to see that this composition map is Hadamard differentiable ona domain of the type (A, B) :

[0,τ ]|dA(t)| ≤ M, inft∈[0,τ ] |B(t)| ≥ ǫ

for a given M < ∞ and ǫ > 0, at every point (A, B) such that 1/B hasbounded variation. Note that the interval of integration we are using, [0, τ ],is left-closed rather than left-open as in the definition of ψ given in (12.2).However, if we pick some η > 0, then in fact integrals over [0, t], for anyt > 0, of functions which have zero variation over (−∞, 0) are unchanged if

Page 251: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

12.2 Examples 241

we replace the interval of integration with (−η, t]. Thus we will still be ableto utilize Lemma 12.3 in our current set-up. In this case, the point (A, B) ofinterest is A = N0 and B = Y0. Thus if t is restricted to the interval [0, τ ],where τ satisfied Y0(τ) > 0, then it is easy to see that the pair (Nn, Yn)is contained in the given domain with probability tending to 1 as n →∞.The derivative of the composition map is given by

(α, β) →(

α,−β

Y 20

)

→∫

[0,t]

Y0−∫

[0,t]

βdN0

Y 20

.

Thus from (12.3), we obtain via Theorem 2.8 that

√n(Λn − Λ)

[0,(·)]

dG1

Y0−∫

[0,(·)]

G2dN0

Y 20

.(12.4)

The Gaussian process on the right side of (12.4) is equal to∫

[0,(·)] dM/Y0,

where M(t) ≡ G1(t) −∫

[0,t] G2dΛ can be shown to be a Gaussian martin-

gale with independent increments and covariance∫

[0,s∧t](1−∆Λ)dΛ, where

∆A(t) ≡ A(t)−A(t−) is the mass at t of a signed-measure A. This meansthat the Gaussian process on the right side of (12.4) is also a Gaussianmartingale with independent increments but with covariance

[0,s∧t](1 −∆Λ)dΛ/Y0. A useful discussion of continuous time martingales arising inright censored survival data can be found in Fleming and Harrington(1991).

The delta method bootstrap, Theorem 12.1, is also applicable here andcan be used to obtain an estimate of the limiting distribution. However,when Λ is continuous over [0, τ ], the independent increments structure im-plies that the limiting distribution is time-transformed Brownian motion.More precisely, the limiting process can be expressed as W(v(t)), whereW is standard Brownian motion on [0,∞) and v(t) ≡

(0,t] dΛ/Y0. As

discussed in Chapter 7 of Fleming and Harrington (1991), this fact canbe used to compute asymptotically exact simultaneous confidence bandsfor Λ.

Proof of Lemma 12.3. For sequences tn → 0, αn → α, and βn → β,define An ≡ A+tnαn and Bn ≡ B+tnβn. Since we require that (An, Bn) ∈DM , we know that the total variation of Bn is bounded by M . Considerfirst the derivative of ψ, and note that

(a,t]AndBn −

(a,t]AdB

tn− ψ′

A,B(αn, βn) =(12.5)∫

(a,t]

αd(Bn −B) +

(a,t]

(αn − α)d(Bn −B).

Since it is easy to verify that the map (α, β) → ψ′A,B(α, β) is continuous

and linear, the desired Hadamard differentiability of ψ will follow provided

Page 252: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

242 12. The Functional Delta Method

the right side of (12.5) goes to zero. To begin with, the second term on theright side goes to zero uniformly over t ∈ (a, b], since both Bn and B havetotal variation bounded by M .

Now, for the first term on the right side of (12.5), fix ǫ > 0. Since αis cadlag, there exists a partition a = t0 < t1 < · · · < tm = b such thatα varies less than ǫ over each interval [ti−1, ti), 1 ≤ i ≤ m, and m < ∞.Now define the function α to be equal to α(ti−1) over the interval [ti−1, ti),1 ≤ i ≤ m, with α(b) = α(b). Thus∥

(a,t]

αd(Bn −B)

≤∥

(a,t]

(α− α)d(Bn −B)

∞+

(a,t]

αd(Bn −B)

≤ ‖α− α‖∞2M +m∑

i=1

|α(ti−1)| × |(Bn −B)(ti)− (Bn −B)(ti−1)|

+|α(b)| × |(Bn −B)(b)|≤ ǫ2M + (2m + 1)‖Bn −B‖∞‖α‖∞→ ǫ2M,

as n →∞. Since ǫ was arbitrary, we have that the first term on the right sideof (12.5) goes to zero, as n →∞, and the desired Hadamard differentiabilityof ψ follows.

Now the desired Hadamard differentiability of φ follows from the trivialbut useful Lemma 12.4 below, by taking the extraction map f : D[a, b] → Rdefined by f(x) = x(b), noting that φ = f(ψ), and then applying the chainrule for Hadamard derivatives (Lemma 6.19).

Lemma 12.4 Let T be a set and fix T0 ⊂ T . Define the extraction mapf : ℓ∞(T ) → ℓ∞(T0) as f(x) = x(t) : t ∈ T0. Then f is Hadamarddifferentiable at all x ∈ ℓ∞(T ) with derivative f ′

x(h) = h(t) : t ∈ T0.Proof. Let tn be any real sequence with tn → 0, and let hn ∈ ℓ∞(T ) be

any sequence converging to h ∈ ℓ∞(T ). The desired conclusion follows afternoting that t−1

n [f(x + tnhn) − f(x)] = hn(t) : t ∈ T0 → h(t) : t ∈ T0,as n →∞.

12.2.3 Product Integration

For a function A ∈ D(0, b], let Ac(t) ≡ A(t)−∑

0<s≤t ∆A(s), where ∆A isas defined in the previous section, be the continuous part of A. We definethe product integral to be the map A → φ(A), where

φ(A)(t) ≡∏

0<s≤t

(1 + dA(s)) = exp(Ac(t))∏

0<s≤t

(1 + ∆A(s)).

Page 253: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

12.2 Examples 243

The first product is merely notation, but it is motivated by the mathemat-ical definition of the product integral:

φ(A)(t) = limmaxi |ti−ti−1|→0

i

1 + [A(ti)−A(ti−1)] ,

where the limit is over partitions 0 = t0 < t1 < · · · tm = t with maximumseparation decreasing to zero. We will also use the notation

φ(A)(s, t] =∏

s<u≤t

(1 + dA(u)) ≡ φ(A)(t)

φ(A)(s),

for all 0 ≤ s < t. The two terms on the left are defined by the far rightterm. Three alternative definitions of the product integral, as solutions oftwo different Volterra integral equations and as a “Peano series,” are givenin Exercise 12.3.2.

The following lemma verifies that product integration is Hadamard dif-ferentiable:

Lemma 12.5 For fixed constants 0 < b, M < ∞, the product integralmap φ : BVM [0, b] ⊂ D[0, b] → D[0, b] is Hadamard differentiable withderivative

φ′A(α)(t) =

(0,t]

φ(A)(0, u)φ(A)(u, t]dα(u).

When α ∈ D[0, b] has unbounded variation, the above quantity is well-defined by integration by parts.

We give the proof later on in this section, after first discussing an im-portant statistical application. From the discussion of the Nelson-Aalenestimator Λn in Section 12.2.2, it is not hard to verify that in the right-censored survival analysis setting S(t) = φ(−Λ)(t), where φ is the productintegration map. Moreover, it is easily verified that the Kaplan-Meier esti-mator Sn discussed in Sections 2.2.5 and 4.3 satisfies Sn(t) = φ(−Λn)(t).

We can now use Lemma 12.5 to derive the asymptotic limiting distribu-tion of

√n(Sn − S). As in Section 12.2.2, we will restrict our time domain

to [0, τ ], where P (X > τ) > 0. Under these circumstances, there exists anM < ∞, such that Λ(τ) < M and Λn(τ) < M with probability tendingto 1 as n →∞. Now Lemma 12.5, combined with (12.4) and the discussionimmediately following, yields

√n(Sn − S) −

(0,(·)]φ(−Λ)(0, u)φ(−Λ)(u, t]

dM

Y0

= −S(t)

(0,(·)]

dM

(1 −∆Λ)Y0,

where M is a Gaussian martingale with independent increments and covari-ance

(0,s∧t](1 − ∆Λ)dΛ/Y0. Thus√

n(Sn − S)/S is asymptotically time-

transformed Brownian motion W(w(t)), where W is standard Brownian

Page 254: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

244 12. The Functional Delta Method

motion on [0,∞) and where w(t) ≡∫

(0,t][(1−∆Λ)Y0]−1dΛ. Along the lines

discussed in the Nelson-Aalen example of Section 12.2.2, the form of thelimiting distribution can be used to obtain asymptotically exact simulta-neous confidence bands for S. The delta method bootstrap, Theorem 12.1,can also be used for inference on S.

Before giving the proof of Lemma 12.5, we present the following lemmawhich we will need and which includes the important Duhamel equation forthe difference between two product integrals:

Lemma 12.6 For A, B ∈ D(0, b], we have for all 0 ≤ s < t ≤ b thefollowing, where M is the sum of the total variation of A and B:

(i) (the Duhamel equation)

(φ(B) − φ(A))(s, t] =

(s,t]

φ(A)(0, u)φ(B)(u, t](B −A)(du).

(ii) ‖φ(A)− φ(B)‖(s,t] ≤ eM (1 + M)2‖A−B‖(s,t].

Proof. For any u ∈ (s, t], consider the function Cu ∈ D(s, t] defined as

Cu(x) =

A(x) −A(s), for s ≤ x < u,A(u−)−A(s), for x = u,A(u−)−A(s) + B(x)−B(u), for u < x ≤ t.

Using the Peano series expansion of Exercise 12.3.2, Part (b), we obtain:

φ(A)(s, u)φ(B)(u, t] = φ(Cu)(s, t] = 1

+∑

m,n≥0:m+n≥1

s<t1<···<tm<u<tm+1<···<tm+n≤t

A(dt1) · · ·A(dtm)

×B(dtm+1) · · ·B(dtm+n).

Thus∫

(s,t]

φ(A)(s, u)φ(B)(u, t](B −A)(du)

=∑

n≥1

s<x1<···<xn≤t

⎣1 +∑

m≥1

s<t1<···<tm<x1

A(dt1) · · ·A(dtm)

×B(dx1) · · ·B(dxn)

−∑

n≥1

s<t1<···<tn≤t

⎣1 +∑

m≥1

tn<x1<···<xm≤t

B(dx1) · · ·B(dxn)

×A(dt1) · · ·A(dtn)

Page 255: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

12.2 Examples 245

=∑

n≥1

s<x1<···<xn≤t

B(dx1) · · ·B(dxn)

−∑

n≥1

s<t1<···tn≤t

A(dt1) · · ·A(dtn)

= φ(B)(s, t] − φ(A)(s, t].

This proves Part (i).For Part (ii), we need to derive an integration by parts formula for the

Duhamel equation. Define G = B − A and H(u) ≡∫ u

0 φ(B)(v, t]G(dv).Now integration by parts gives us

(s,t]

φ(A)(0, u)φ(B)(u, t]G(du)(12.6)

= φ(A)(t)H(t) − φ(A)(s)H(s) −∫

(s,t]

H(u)φ(A)(du).

From the backwards integral equation (Part (c) of Exercise 12.3.2), we knowthat φ(B)(dv, t] = −φ(B)(v, t]B(dv), and thus, by integration by parts, weobtain

H(u) = G(u)φ(B)(u, t] +

(0,u]

G(v−)φ(B)(v, t]B(dv).

Combining this with (12.6) and the fact that φ(A)(du) = φ(A)(u−)A(du),we get

(s,t]

φ(A)(0, u)φ(B)(u, t]G(du)(12.7)

= φ(A)(t)

(0,t]

G(u−)φ(B)(u, t]B(du)

−φ(A)(s)φ(B)(s, t]G(s)

−φ(A)(s)

(0,s]

G(u−)φ(B)(u, t]B(du)

−∫

(s,t]

G(u)φ(B)(u, t]φ(A)(u−)A(du)

−∫

(s,t]

(0,u]

G(v−)φ(B)(v, t]B(dv)φ(A)(u−)A(du).

From Exercise 12.3.3, we know that φ(A) and φ(B) are bounded by the ex-ponentiation of the respective total variations of A and B. Now the desiredresult follows.

Proof of Lemma 12.5. Set An = A + tnαn for a sequence αn → αwith the total variation of both A and An bounded by M . In view of theDuhamel equation (Part (i) of Lemma 12.6 above), it suffices to show that

Page 256: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

246 12. The Functional Delta Method

(0,t]

φ(A)(0, u)φ(An)(u, t]dαn(u)→∫

(0,t]

φ(A)(0, u)φ(A)(u, t]dα(u),

uniformly in 0 ≤ t ≤ b. Fix ǫ > 0. Since α ∈ D[0, b], there exists a functionα with total variation V < ∞ such that ‖α− α‖∞ ≤ ǫ.

Now recall that the derivation of the integration by parts formula (12.7)from the proof of Lemma 12.6 did not depend on the definition of G, otherthan the necessity of G being right-continuous. If we replace G with α− α,we obtain from (12.7) that

(0,(·)]φ(A)(0, u)φ(An)(u, t]d(αn − α)(u)

∞≤ e2M (1 + 2M)2‖αn − α‖∞→ e2M (1 + 2M)2ǫ,

as n →∞, since ‖αn − α‖∞ → 0. Moreover,

(0,(·)]φ(A)(0, u) [φ(An)− φ(A)] (u, t]dα(u)

∞≤ ‖φ(An)− φ(A)‖∞‖φ(A)‖∞V

→ 0,

as n → ∞. Now using again the integration by parts formula (12.7), butwith G = α− α, we obtain

0,(·)]φ(A)(0, u)φ(A)(u, t]d(α − α)(u)

∞≤ e2M (1 + 2M)2ǫ.

Thus the desired result follows since ǫ was arbitrary.

12.2.4 Inversion

Recall the derivation given in the paragraphs following Theorem 2.8 of theHadamard derivative of the inverse of a distribution function F . Note thatthis derivation did not depend on F being a distribution function per se.In fact, the derivation will carry through unchanged if we replace the dis-tribution function F with any nondecreasing, cadlag function A satisfyingmild regularity conditions. For a non-decreasing function B ∈ D(−∞,∞),define the left-continuous inverse r → B−1(r) ≡ infx : B(x) ≥ r. We willhereafter use the notation D[u, v] to denote all left-continuous functionswith right-hand limits (caglad) on [u, v] and D1[u, v] to denote the restric-tion of all non-decreasing functions in D(−∞,∞) to the interval [u, v].Here is a precise statement of the general Hadamard differentiation resultfor non-decreasing functions:

Page 257: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

12.2 Examples 247

Lemma 12.7 Let −∞ < p ≤ q < ∞, and let the non-decreasing func-tion A ∈ D(−∞,∞) be continuously differentiable on the interval [u, v] ≡[A−1(p) − ǫ, A−1(q) + ǫ], for some ǫ > 0, with derivative A′ being strictlypositive and bounded over [u, v]. Then the inverse map B → B−1 as a mapD1[u, v] ⊂ D[u, v] → D[p, q] is Hadamard differentiable at A tangentiallyto C[u, v], with derivative α → −(α/A′) A−1.

We will give the proof of Lemma 12.7 at the end of this section. Wenow restrict ourselves to the setting where A is a distribution functionwhich we will now denote by F . The following lemma provides two resultssimilar to Lemma 12.7 but which utilize knowledge about the support of thedistribution function F . Let D2[u, v] be the subset of distribution functionsin D1[u, v] with support only on [u,∞), and let D3[u, v] be the subset ofdistribution functions in D2[u, v] which have support only on [u, v].

Lemma 12.8 Let F be a distribution function. We have the following:

(i) Let F ∈ D2[u,∞), for finite u, and let q ∈ (0, 1). Assume F iscontinuously differentiable on the interval [u, v] = [u, F−1(q) + ǫ], forsome ǫ > 0, with derivative f being strictly positive and bounded over[u, v]. Then the inverse map G → G−1 as a map D2[u, v] ⊂ D[u, v] →D(0, q] is Hadamard differentiable at F tangentially to C[u, v].

(ii) Let F ∈ D3[u, v], for [u, v] compact, and assume that F is con-tinuously differentiable on [u, v] with derivative f strictly positiveand bounded over [u, v]. Then the inverse map G → G−1 as a mapD3[u, v] ⊂ D[u, v] → D(0, 1) is Hadamard differentiable at F tangen-tially to C[u, v].

In either case, the derivative is the map α → −(α/f) F−1.

Before giving the proof of the above two lemmas, we will discuss some im-portant statistical applications. As discussed in Section 2.2.4, an importantapplication of these results is to estimation and inference for the quantilefunction p → F−1(p) based on the usual empirical distribution function fori.i.d. data. Lemma 12.8 is useful when some information is available on thesupport of F , since it allows the range of p to extend as far as possible.These results are applicable to other estimators of the distribution func-tion F besides the usual empirical distribution, provided the standardizedestimators converge to a tight limiting process over the necessary intervals.Several examples of such estimators include the Kaplan-Meier estimator,the self-consistent estimator of Chang (1990) for doubly-censored data, andcertain estimators from dependent data as mentioned in Kosorok (1999).

We now apply Lemma 12.8 to the construction of quantile processesbased on the Kaplan-Meier estimator discussed in Section 12.2.3 above.Since it is known that the support of a survival function is on [0,∞),we can utilize Part (i) of this lemma. Define the Kaplan-Meier quantile

Page 258: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

248 12. The Functional Delta Method

process ξ(p) ≡ F−1n (p), 0 < p ≤ q, where Fn = 1− Sn, Sn is the Kaplan-

Meier estimator, and where 0 < q < F (τ) for τ as defined in the previoussection. Assume that F is continuously differentiable on [0, τ ] with densityf bounded below by zero and finite. Combining the results of the previoussection with Part (i) of Lemma 12.8 and Theorem 2.8, we obtain

√n(ξ − ξ)(·)

S(ξ(·))f(ξ(·))

(0,ξ(·)]

dM

(1−∆Λ)Y0,

in D(0, q], where ξ(p) ≡ ξp and M is the Gaussian martingale described

in the previous section. Thus√

n(ξ − ξ)f(ξ)/S(ξ) is asymptotically time-transformed Brownian motion with time-transform w(ξ), where w is asdefined in the previous section, over the interval (0, q]. As described inKosorok (1999), one can construct kernel estimators for f—which can beshown to be uniformly consistent—to facilitate inference. An alternativeapproach is the bootstrap which can be shown to be valid in this settingbased on Theorem 12.1.

Proof of Lemma 12.7. The arguments are essentially identical to thoseused in the paragraphs following Theorem 2.8, except that the distributionfunction F is replaced by a more general, non-decreasing function A.

Proof of Lemma 12.8. To prove Part (i), let αn → α uniformly inD[u, v] and tn → 0, where α is continuous and F + tnαn is contained inD2[u, v] for all n ≥ 1. Abbreviate F−1(p) and (F + tnαn)−1(p) to ξp andξpn, respectively. Since F and F + tnαn have domains (u,∞) (the lowerbound by assumption), we have that ξp, ξpn > u for all 0 < p ≤ q. Moreover,ξp, ξpn ≤ v for all n large enough. Thus the numbers ǫpn ≡ t2n ∧ (ξpn − u)are positive for all 0 < p ≤ q, for all n large enough. Hence, by definition,we have for all p ∈ (0, q] that

(F + tnαn)(ξpn − ǫpn) ≤ p ≤ (F + tnαn)(ξpn),(12.8)

for all sufficiently large n.By the smoothness of F , we have F (ξp) = p and F (ξpn−ǫpn) = F (ξpn)+

O(ǫpn), uniformly over p ∈ (0, q]. Thus from (12.8) we obtain

−tnα(ξpn) + o(tn) ≤ F (ξpn)− F (ξp) ≤ −tnα(ξpn − ǫpn) + o(tn),(12.9)

where the o(tn) terms are uniform over 0 < p ≤ q. Both the far left and farright sides are O(tn), while the middle term is bounded above and belowby constants times |ξpn − ξp|, for all 0 < p ≤ q. Hence |ξpn − ξp| = O(tn),uniformly over 0 < p ≤ q. The Part (i) result now follows from (12.9), sinceF (ξpn)−F (ξp) = f(ξp)(ξpn − ξp) + En, where En = o(sup0<p≤q |ξpn − ξp|)by the uniform differentiability of F over (u, v].

Note that Part (ii) of this lemma is precisely Part (ii) of Lemma 3.9.23of VW, and the details of the proof (which are quite similar to the proofof Part (i)) can be found therein.

Page 259: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

12.3 Exercises 249

12.2.5 Other Mappings

We now mention briefly a few additional interesting examples. The firstexample is the copula map. For a bivariate distribution function H , withmarginals FH(x) ≡ H(x,∞) and GH(y) ≡ H(∞, y), the copula map isthe map φ from bivariate distributions on R2 to bivariate distributions on[0, 1]2 defined as follows:

H → φ(H)(u, v) = H(F−1H (u), G−1

H (v)), (u, v) ∈ [0, 1]2,

where the inverse functions are the left-continuous quantile functions de-fined in the previous section. Section 3.9.4.4 of VW verifies that this map isHadamard differentiable in a manner which permits developing inferentialprocedures for the copula function based on i.i.d. bivariate data.

The second example is multivariate trimming. Let P be a given proba-bility distribution on Rd, fix α ∈ (0, 1/2], and define H to be the collectionof all closed half-spaces in Rd. The set KP ≡ ∩H ∈ H : P (H) ≥ 1 − αcan be easily shown to be compact and convex (see Exercise 12.3.4). Theα-trimmed mean is the quantity

T (P ) ≡ 1

λ(KP )

KP

xdλ(x),

where λ is the Lebesgue measure on Rd. Using non-trivial arguments,Section 3.9.4.6 of VW shows how P → T (P ) can be formulated as aHadamard differentiable functional of P and how this formulation can beapplied to develop inference for T (P ) based on i.i.d. data from P .

There are many other important examples in statistics, some of which wewill explore later on in this book, including a delta method formulation ofZ-estimator theory which we will describe in the next chapter (Chapter 13)and several other examples in the case studies of Chapter 15.

12.3 Exercises

12.3.1. In the Wilcoxon statistic example of Section 12.2.2, verify explic-itly that every hypothesis of Theorem 2.8 is satisfied.

12.3.2. Show that the product integral of A, φ(A)(s, t], is equivalent tothe following:

(a) The unique solution B of the following Volterra integral equation:

B(s, t] = 1 +

(s,t]

B(s, u)A(du).

Page 260: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

250 12. The Functional Delta Method

(b) The following Peano series representation:

φ(A)(s, t] = 1 +

∞∑

m=1

s<t1<···<tm≤t

A(dt1) · · ·A(dtm),

where the signed-measure interpretation of A is being used. Hint: Usethe uniqueness from Part (a).

(c) The unique solution B of the “backward” Volterra integral equation:

B(s, t] = 1 +

(s,t]

B(u, t]A(du).

Hint: Start at t and go backwards in time to s.

12.3.3. Let φ be the product integral map of Section 12.2.3. Show that ifthe total variation of A over the interval (s, t] is M , then |φ(A)(s, t]| ≤ eM .Hint: Recall that log(1 + x) ≤ x for all x > 0.

12.3.4. Show that the set KP ⊂ Rd defined in Section 12.2.5 is compactand convex.

12.4 Notes

Much of the material of this chapter is inspired by Chapter 3.9 of VW,although there is some new material and the method of presentation is dif-ferent. Section 12.1 contains results from Sections 3.9.1 and 3.9.3 of VW,although our results are specialized to Banach spaces (rather than themore general topological vector spaces). The examples of Sections 12.2.1through 12.2.4 are modified versions of the examples of Sections 3.9.4.3,3.9.4.1, 3.9.4.5 and 3.9.4.2, respectively, of VW. The order has be changedto emphasize a natural progression leading up to quantile inference basedon the Kaplan-Meier estimator. Lemma 12.2 is a generalization ofLemma 3.9.25 of VW, while Lemmas 12.3 and 12.5 correspond to Lem-mas 3.9.17 and 3.9.30 of VW. Lemma 12.7 is a generalization of Part (i)of Lemma 3.9.23 of VW, while Part (ii) of Lemma 12.8 corresponds toPart (ii) of Lemma 3.9.23 of VW. Exercise 12.3.2 is based on Exercises 3.9.5and 3.9.6 of VW.

Page 261: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

13Z-Estimators

Recall from Section 2.2.5 that Z-estimators are approximate zeros of data-dependent functions. These data-dependent functions, denoted Ψn, aremaps between a possibly infinite dimensional normed parameter space Θand a normed space L, where the respective norms are ‖ · ‖ and ‖ · ‖L.

The Ψn are frequently called estimating equations. A quantity θn ∈ Θ

is a Z-estimator if ‖Ψn(θn)‖L

P→ 0. In this chapter, we extend and provethe results of Section 2.2.5. As part of this, we extend the Z-estimatormaster theorem, Theorem 10.16, to the infinite dimensional case, althoughwe divide the result into two parts, one for consistency and one for weakconvergence.

We first discuss consistency and present a Z-estimator master theoremfor consistency. We then discuss weak convergence and examine closelythe special case of Z-estimators which are empirical measures of Donskerclasses. We then use this structure to develop a Z-estimator master theo-rem for weak convergence. Both master theorems, the one for consistencyand the one for weak convergence, will include results for the bootstrap.Finally, we demonstrate how Z-estimators can be viewed as Hadamard dif-ferentiable functionals of the involved estimating equations and how thisstructure enables use of a modified delta method to obtain very generalresults for Z-estimators. Recall from Section 2.2.5 that the Kaplan-Meierestimator is an important and instructive example of a Z-estimator. A moresophisticated example, which will be presented later in the case studies ofChapter 15, is the nonparametric maximum likelihood estimator for theproportional odds survival model.

Page 262: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

252 13. Z-Estimators

13.1 Consistency

The main consistency result we have already presented in Theorem 2.10of Section 2.2.5, and the proof of this theorem was given as an exercise(Exercise 2.4.2). We will now extend this result to the bootstrapped Z-estimator. First, we restate the identifiability condition of Theorem 2.10:The map Ψ : Θ → L is identifiable at θ0 ∈ Θ if

‖Ψ(θn)‖L → 0 implies ‖θn − θ0‖ → 0 for any θn ∈ Θ.(13.1)

Note that there are alternative identifiability conditions that will also work,including the stronger condition that both Ψ(θ0) = 0 and Ψ : Θ → L beone-to-one. Nevertheless, Condition (13.1) seems to be the most efficientfor our purposes.

In what follows, we will use the bootstrap-weighted empirical process Pn

to denote either the nonparametric bootstrapped empirical process (withmultinomial weights) or the multiplier bootstrapped empirical process de-fined by f → P

nf = n−1∑n

i=1(ξi/ξ)f(Xi), where ξ1, . . . , ξn are i.i.d. posi-tive weights with 0 < µ = Eξ1 <∞ and ξ = n−1

∑ni=1 ξi. Note that this is

a special case of the weighted bootstrap introduced in Theorem 10.13 butwith the addition of ξ in the denominator. We leave it as an exercise (Exer-cise 13.4.1) to verify that the conclusions of Theorem 10.13 are not affectedby this addition. Let Xn ≡ X1, . . . , Xn as given in Theorem 10.13. Thefollowing is the main result of this section:

Theorem 13.1 (Master Z-estimator theorem for consistency) Let θ →Ψ(θ) = Pψθ, θ → Ψn(θ) = Pnψθ and θ → Ψ

n(θ) = Pnψθ, where Ψ

satisfies (13.1) and the class ψθ : θ ∈ Θ is P -Glivenko-Cantelli. Then,

provided ‖Ψn(θn)‖L = oP (1) and

P(

‖Ψn(θn)‖L > η

∣Xn

)

= oP (1) for every η > 0,(13.2)

we have both ‖θn − θ0‖ = oP (1) and P(

‖θn − θ0‖ > η∣

∣Xn

)

= oP (1) for

every η > 0.

Note in (13.2) the absence of an outer probability on the left side. This isbecause, as argued in Section 2.2.3, a Lipschitz continuous map of either ofthese bootstrapped empirical processes is measurable with respect to therandom weights conditional on the data.

Proof of Theorem 13.1. The result that ‖θn − θ0‖ = oP (1) is a con-clusion from Theorem 2.10. For the conditional bootstrap result, (13.2)

implies that for some sequence ηn ↓ 0, P(

‖Ψ(θn)‖L > ηn

∣Xn

)

= oP (1),

since

P

(

supθ∈Θ

‖Ψn(θ)−Ψ(θ)‖ > η

Xn

)

= oP (1)

Page 263: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

13.2 Weak Convergence 253

for all η > 0, by Theorems 10.13 and 10.15. Thus, for any ǫ > 0,

P(

‖θn − θ0‖ > ǫ∣

∣Xn

)

≤ P(

‖θn − θ0‖ > ǫ, ‖Ψ(θn)‖L ≤ ηn

∣Xn

)

+P(

‖Ψ(θn)‖L > ηn

∣Xn

)

P→ 0,

since the identifiability Condition (13.1) implies that for all δ > 0 thereexists an η > 0 such that ‖Ψ(θ)‖L < η implies ‖θ − θ0‖ < δ. Hence itis impossible for there to exist any θ ∈ Θ such that both ‖θ − θ0‖ > ǫand ‖Ψ(θ)‖L < ηn for all n ≥ 1. The conclusion now follows since ǫ wasarbitrary.

Note that we might have worked toward obtaining outer almost sure re-sults since we are making a strong Glivenko-Cantelli assumption for theclass of functions involved. However, we only need convergence in probabil-ity for statistical applications. Notice also that we only assumed ‖Ψ

n(θn)‖goes to zero conditionally rather than unconditionally as done in Theo-rem 10.16. However, it seems to be easier to check the conditional versionin practice. Moreover, the unconditional version is actually stronger thanthe conditional version, since

E∗P(

‖Ψn(θn)‖L > η

∣Xn

)

≤ P∗(

‖Ψn(θn)‖L > η

)

by the version of Fubini’s theorem given as Theorem 6.14. It is unclear howto extend this argument to the outer almost sure setting. This is anotherreason for restricting our attention to the convergence in probability results.Nevertheless, we still need the strong Glivenko-Cantelli assumption sincethis enables the use of Theorems 10.13 and 10.15.

While the above approach will be helpful for some Z-estimators, many Z-estimators are complex enough to require individually Tailored approachesto establishing consistency. In Section 13.2.3 below, we will revisit theKaplan-Meier estimator example of Section 2.2.5 to which we can apply ageneralization of Theorem 13.1, Theorem 13.4, which includes weak conver-gence. In contrast, the proportional odds model for right-censored survivaldata, which will be presented in Chapter 15, requires a more individualizedapproach to establishing consistency.

13.2 Weak Convergence

In this section, we first provide general results for Z-estimators which maynot be based on i.i.d. data. We then present sufficient conditions for thei.i.d. case when the estimating equation is an empirical measure rangingover a Donsker class. Finally, we give a master theorem for Z-estimatorsbased on i.i.d. data which includes bootstrap validity.

Page 264: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

254 13. Z-Estimators

13.2.1 The General Setting

We now prove Theorem 2.11 on Page 26 and give a method of weakeningthe differentiability requirement for Ψ. An important thing to note is thatno assumptions about the data being i.i.d. are required. The proof followsclosely the proof of Theorem 3.3.1 given in VW.

Proof of Theorem 2.11 (Page 26). By the definitions of θn and θ0,

√n(

Ψ(θn)−Ψ(θ0))

= −√n(

Ψn(θn)−Ψ(θn))

+ oP (1)(13.3)

= −√n(Ψn −Ψ)(θ0) + oP (1 +√

n‖θn − θ0‖),

by Assumption (2.12). Note the error terms throughout this theorem arewith respect to the norms of the spaces, e.g. Θ or L, involved. Since Ψθ0 iscontinuously invertible, we have by Part (i) of Lemma 6.16 that there existsa constant c > 0 such that ‖Ψθ0(θ − θ0)‖ ≥ c‖θ − θ0‖ for all θ and θ0 inlin Θ. Combining this with the differentiability of Ψ yields ‖Ψ(θ)−Ψ(θ0)‖ ≥c‖θ − θ0‖+ o(‖θ − θ0‖). Combining this with (13.3), we obtain

√n‖θn − θ0‖(c + oP (1)) ≤ OP (1) + oP (1 +

√n‖θn − θ0‖).

We now have that θn is√

n-consistent for θ0 with respect to ‖ · ‖. By the

differentiability of Ψ, the left side of (13.3) can be replaced by√

nΨθ0(θn−θ0) + oP (1 +

√n‖θn − θ0‖). This last error term is now oP (1) as also is

the error term on the right side of (13.3). Now the result (2.13) follows.Next the continuity of Ψ−1

θ0and the continuous mapping theorem yield√

n(θn − θ0) −Ψ−1θ0

(Z) as desired.The following lemma allows us to weaken the Frechet differentiabil-

ity requirement to Hadamard differentiability when it is also known that√n(θn − θ0) is asymptotically tight:

Lemma 13.2 Assume the conditions of Theorem 2.11 except that con-sistency of θn is strengthened to asymptotic tightness of

√n(θn − θ0) and

the Frechet differentiability of Ψ is weakened to Hadamard differentiabilityat θ0. Then the results of Theorem 2.11 still hold.

Proof. The asymptotic tightness of√

n(θn−θ0) enables Expression (13.3)

to imply√

n(

Ψ(θn)−Ψ(θ0))

= −√n(Ψn−Ψ)(θ0)+oP (1). The Hadamard

differentiability of Ψ yields√

n(

Ψ(θn)−Ψ(θ0))

=√

nΨθ0(θn−θ0)+oP (1).

Combining, we now have√

nΨθ0(θn − θ0) = −√n(Ψn − Ψ)(θ0) + oP (1),and all of the results of the theorem follow.

13.2.2 Using Donsker Classes

We now consider the special case where the data involved are i.i.d., i.e.,Ψn(θ)(h) = Pnψθ,h and Ψ(θ)(h) = Pψθ,h, for measurable functions ψθ,h,

Page 265: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

13.2 Weak Convergence 255

where h ranges over an index set H. The following lemma gives us reason-ably verifiable sufficient conditions for (2.12) to hold:

Lemma 13.3 Suppose the class of functions

ψθ,h − ψθ0,h : ‖θ − θ0‖ < δ, h ∈ H(13.4)

is P -Donsker for some δ > 0 and

suph∈H

P (ψθ,h − ψθ0,h)2 → 0, as θ → θ0.(13.5)

Then if θnP→ θ0, suph∈H

∣Gnψθn,h −Gnψθ0,h

∣ = oP (1).

Before giving the proof of this lemma, we make the somewhat trivialobservation that the conclusion of this lemma implies (2.12).

Proof of Lemma 13.3. Let Θδ ≡ θ : ‖θ − θ0‖ < δ and define theextraction function f : ℓ∞(Θδ ×H)×Θδ → ℓ∞(H) as f(z, θ)(h) ≡ z(θ, h),where z ∈ ℓ∞(Θδ × H). Note that f is continuous at every point (z, θ1)such that suph∈H |z(θ, h) − z(θ1, h)| → 0 as θ → θ1. Define the stochas-tic process Zn(θ, h) ≡ Gn(ψθ,h − ψθ0,h) indexed by Θδ × H. As assumed,the process Zn converges weakly in ℓ∞(Θδ × H) to a tight Gaussian pro-cess Z0 with continuous sample paths with respect to the metric ρ de-fined by ρ2((θ1, h1), (θ2, h2)) = P (ψθ1,h1 −ψθ0,h1 −ψθ2,h2 + ψθ0,h2)

2. Since,suph∈H ρ((θ, h), (θ0, h)) → 0 by assumption, we have that f is continuousat almost all sample paths of Z0. By Slutsky’s theorem (Theorem 7.15),

(Zn, θn) (Z0, θ0). The continuous mapping theorem (Theorem 7.7) now

implies that Zn(θn) = f(Zn, θn) f(Z0, θ0) = 0.If, in addition to the assumptions of Lemma 13.3, we are willing to assume

ψθ0,h : h ∈ H(13.6)

is P -Donsker, then√

n(Ψn −Ψ)(θ0) Z, and all of the weak convergenceassumptions of Theorem 2.11 are satisfied. Alternatively, we could justassume that

Fδ ≡ ψθ,h : ‖θ − θ0‖ < δ, h ∈ H(13.7)

is P -Donsker for some δ > 0, then both (13.4) and (13.6) are P -Donskerfor some δ > 0. We are now well poised for a Z-estimator master theoremfor weak convergence.

13.2.3 A Master Theorem and the Bootstrap

In this section, we augment the results of the previous section to achieve ageneral Z-estimator master theorem that includes both weak convergence

Page 266: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

256 13. Z-Estimators

and validity of the bootstrap. Here we consider the two bootstrapped Z-estimators described in Section 13.1, except that for the multiplier boot-strap we make the additional requirements that 0 < τ2 = var(ξ1) < ∞ and

‖ξ1‖2,1 < ∞. We useP

to denote eitherPξ

orPW

depending on which

bootstrap is being used, and we let the constant k0 = τ/µ for the multi-plier bootstrap and k0 = 1 for the multinomial bootstrap. Here is the mainresult:

Theorem 13.4 Assume Ψ(θ0) = 0 and the following hold:

(A) θ → Ψ(θ) satisfies (13.1);

(B) The class ψθ,h; θ ∈ Θ, h ∈ H is P -Glivenko-Cantelli;

(C) The class Fδ in (13.7) is P -Donsker for some δ > 0;

(D) Condition (13.5) holds;

(E) ‖Ψn(θn)‖L = oP (n−1/2) and P(√

n‖Ψn(θn)‖L > η

∣ Xn

)

= oP (1) for

every η > 0;

(F) θ → Ψ(θ) is Frechet-differentiable at θ0 with continuously invertiblederivative Ψθ0 .

Then√

n(θn − θ0) −Ψ−1θ0

Z, where Z ∈ ℓ∞(H) is the tight, mean zero

Gaussian limiting distribution of√

n(Ψn−Ψ)(θ0), and√

n(θn− θn)P

k0Z.

Condition (A) is identifiability. Conditions (B) and (C) are consistencyand asymptotic normality conditions for the estimating equation. Condi-tion (D) is an asymptotic equicontinuity condition for the estimating equa-tion at θ0. Condition (E) simply states that the estimators are approxi-mate zeros of the estimating equation, while Condition (F) specifies thesmoothness and invertibility requirements of the derivative of Ψ. Exceptfor the last half of Condition (E), all of the conditions are requirements for

asymptotic normality of√

n(θn − θ0). What is perhaps surprising is howlittle additional assumptions are needed to obtain bootstrap validity. Onlyan assurance that the bootstrapped estimator is an approximate zero ofthe bootstrapped estimating equation is required. Thus bootstrap validityis almost an automatic consequence of asymptotic normality.

Before giving the proof of the theorem, we will present an example. Recallthe right-censored Kaplan-Meier estimator example of Section 2.2.5 whichwas shown to be a Z-estimator with a certain estimating equation Ψn(θ) =Pnψθ(t), where

ψθ(t) = 1U > t+ (1− δ)1U ≤ t1θ(U) > 0 θ(t)

θ(U)− θ(t),

Page 267: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

13.2 Weak Convergence 257

where the observed data U, δ is the right-censored survival time and censor-ing indicator, respectively, and where we have replaced the survival func-tion S with θ in the notation of Section 2.2.5 to obtain greater consistencywith the notation of the current chapter. The limiting estimating functionΨ(θ) = Pψθ is given in (2.11). It was shown in the paragraphs surround-ing (2.11) that both Ψ(θ0) = 0 and Ψ(θn) → 0 imply ‖θn − θ0‖∞ → 0,and thus the opening line and Condition (A) of the theorem are satisfied.Exercise 2.4.3 verifies that Ψ(θ) is Frechet differentiable with derivativeΨθ0 defined in (2.16). Exercise 2.4.4 verifies that Ψθ0 is continuously in-vertible with inverse Ψ−1

θ0given explicitly in (2.17). Thus Condition (F) of

the theorem is also established.In the paragraphs in Section 2.2.5 after the presentation of Theorem 2.1,

the class ψθ(t) : θ ∈ Θ, t ∈ [0, τ ], where Θ is the class of all survivalfunctions t → θ(t) with θ(0) = 0 and with t restricted to [0, τ ], was shownto be Donsker. Note that in this setting, H = [0, τ ]. Thus Conditions (B)and (C) of the theorem hold. It is also quite easy to verify directly that

supt∈[0,τ ]

P [ψθ(t)− ψθ0(t)]2 → 0,

as ‖θ − θ0‖∞ → 0, and thus Condition (D) of the theorem is satisfied. If

θn is the Kaplan-Meier estimator, then ‖Ψn(θn)(t)‖∞ = 0 almost surely. Ifthe bootstrapped version is

θn(t) ≡∏

j:Tj≤t

⎝1−nP

n

[

δ1U = Tj]

nPn1U ≥ Tj

⎠ ,

where T1, . . . , Tmn are the observed failure times in the sample, then also

‖Ψn(θn)‖∞ = 0 almost surely. Thus Condition (E) of the theorem is sat-

isfied, and hence all of the conditions of theorem are satisfied. Thus weobtain consistency, weak convergence, and bootstrap consistency for theKaplan-Meier estimator all at once.

As mentioned at the end of Section 13.1, a master result such as thiswill not apply to all Z-estimator settings. Many interesting and importantZ-estimators require an individualized approach to obtaining consistency,such as the Z-estimator for the proportional odds model for right-censoreddata which we examine in Chapter 15.

Proof of Theorem 13.4. The consistency of θn and weak convergenceof√

n(θn− θ0) follow from Theorems 13.1 and 2.11 and Lemma 13.3. The-orem 13.1 also yields that there exists a decreasing sequence 0 < ηn ↓ 0such that

P(

‖θn − θ0‖ > ηn

∣Xn

)

= oP (1).

Now we can use the same arguments used in the proof of Lemma 13.3, incombination with Theorem 2.6, to obtain that

√n(Ψ

n−Ψ)(θn)−√n(Ψn−

Page 268: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

258 13. Z-Estimators

Ψ)(θ0) = En, where P(En > η|Xn) = oP (1) for all η > 0. Combining thiswith arguments used in the proof of Theorem 2.11, we can deduce that√

n(θn − θ0) = −Ψ−1θ0

√n(Ψ

n −Ψ)(θ0) + E′n, where P(E′

n > η|Xn) = oP (1)for all η > 0. Combining this with the conclusion of Theorem 2.11, weobtain

√n(θn− θn) = −Ψ−1

θ0

√n(Ψ

n−Ψn)(θ0)+E′′n, where P(E′′

n > η|Xn) =oP (1) for all η > 0. The final conclusion now follows from reapplication ofTheorem 2.6.

13.3 Using the Delta Method

There is an alternative approach to Z-estimators that may be more effectivefor more general data settings, including non-i.i.d. and dependent datasettings. The idea is to view the extraction of the zero from the estimatingequation as a continuous mapping. Our approach is closely related to theapproach in Section 3.9.4.7 of VW but with some important modificationswhich simplify the required assumptions. We require Θ to be the subset ofa Banach space and L to be a Banach space. Let ℓ∞(Θ, L) be the Banachspace of all uniformly norm-bounded functions z : Θ → L. Let Z(Θ, L) bethe subset consisting of all maps with at least one zero, and let Φ(Θ, L) bethe collection of all maps (or algorithms) φ : Z(Θ, L) → Θ that for eachelement z ∈ Z(Θ, L) extract one of its zeros φ(z). This structure allows formultiple zeros.

The following lemma gives us a kind of uniform Hadamard differentia-bility of members of Φ(Θ, L), which we will be able to use to obtain a delta

method result for Z-estimators θn that satisfy Ψn(θn) = oP (r−1n ) for some

sequence rn → ∞ for which Xn(θ) ≡ rn(Ψn − Ψ)(θ) converges weakly toa tight, random element in X ∈ ℓ∞(Θ0, L), where Θ0 ⊂ Θ is an openneighborhood of θ0 and ‖X(θ)−X(θ0)‖L → 0 as θ → θ0 almost surely, i.e.,X has continuous sample paths in θ. Define ℓ∞0 (Θ, L) to be the elementsx ∈ ℓ∞(Θ, L) for which ‖x(θ)− x(θ0)‖L → 0 as θ → θ0.

Theorem 13.5 Assume Ψ : Θ → L is uniformly norm-bounded over Θ,Ψ(θ0) = 0, and Condition (13.1) holds. Let Ψ also be Frechet differentiableat θ0 with continuously invertible derivative Ψθ0. Then the continuous lin-ear operator φ′

Ψ : ℓ∞0 (Θ, L) → lin Θ defined by z → φ′Ψ(z) ≡ −Ψ−1

θ0(z(θ0))

satisfies:

supφ∈Φ(Θ,L)

φ(Ψ + tnzn)− φ(Ψ)

tn− φ′

Ψ(z(θ0))

→ 0,

as n →∞, for any sequences (tn, zn) ∈ (0,∞)× ℓ∞(Θ, L) such that tn ↓ 0,Ψ + tnzn ∈ Z(Θ, L), and zn → z ∈ ℓ∞0 (Θ, L).

Proof. Let 0 < tn ↓ 0 and zn → z ∈ ℓ∞0 (Θ, L) such that Ψ + tnzn ∈Z(Θ, L). Choose any sequence φn ∈ Φ(Θ, L), and note that θn ≡ φn(Ψ+

Page 269: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

13.3 Using the Delta Method 259

tnZn) satisfies Ψ(θn) + tnzn = 0 by construction. Hence Ψ(θn) = O(tn).By Condition (13.1), θn → θ0. By the Frechet differentiability of Ψ,

lim infn→∞

‖Ψ(θn)−Ψ(θ0)‖L

‖θn − θ0‖≥ lim inf

n→∞‖Ψθ0(θn − θ0)‖L

‖θn − θ0‖≥ inf

‖g‖=1‖Ψθ0(g)‖L,

where g ranges over lin Θ. Since the inverse of Ψθ0 is continuous, the rightside of the above is positive. Thus there exists a universal constant c <∞ (depending only on Ψθ0 and lin Θ) for which ‖θn − θ0‖ < c‖Ψ(θn) −Ψ(θ0)‖L = c‖tnzn(θn)‖L for all n large enough. Hence ‖θn − θ0‖ = O(tn).By Frechet differentiability, Ψ(θn)−Ψ(θ0) = Ψθ0(θn − θ0) + o(‖θn − θ0‖),where Ψθ0 is linear and continuous on lin Θ. The remainder term is o(tn)by previous arguments. Combining this with the fact that t−1

n (Ψ(θn) −Ψ(θ0)) = −zn(θn)→ z(θ0), we obtain

θn − θ0

tn= Ψ−1

θ0

(

Ψ(θn)−Ψ(θ0)

tn+ o(1)

)

→ −Ψ−1θ0

(z(θ0)).

The conclusion now follows since the sequence φn was arbitrary.The following simple corollary allows the delta method to be applied

to Z-estimators. Let φ : ℓ∞(Θ, L) → Θ be a map such that for each x ∈ℓ∞(Θ, L), φ(x) = θ1 = θ0 when x ∈ Z(Θ, L) and φ(x) = φ(x) for someφ ∈ Φ(Θ, L) otherwise.

Corollary 13.6 Suppose Ψ satisfies the conditions of Theorem 13.5,θn = φ(Ψn), and Ψn has at least one zero for all n large enough, outeralmost surely. Suppose also that rn(Ψn − Ψ) X in ℓ∞(Θ, L), with X

tight and ‖X(θ)‖L → 0 as θ → θ0 almost surely. Then rn(θn − θ0)

−Ψ−1θ0

X(θ0).

Proof. Since Ψn has a zero for all n large enough, outer almost surely,we can, without loss of generality, assume that Ψn has a zero for all n.Thus we can assume that φ ∈ Φ(Θ, L). By Theorem 13.5, we know that φis Hadamard differentiable tangentially to ℓ∞0 (Θ, L), which, by assumption,contains X with probability 1. Thus Theorem 2.8 applies, and the desiredresult follows.

We leave it as an exercise (see Exercise 13.4.2 below) to develop a corol-lary which utilizes φ to obtain a bootstrap result for Z-estimators. A draw-back with this approach is that root finding algorithms in practice areseldom exact, and room needs to be allowed for computational error. Thefollowing corollary of Theorem 13.5 yields a very general Z-estimator resultbased on a modified delta method. We make the fairly realistic assumptionthat the Z-estimator θn is computed from Ψn using a deterministic algo-rithm (e.g., a computer program) that is allowed to depend on n and whichis not required to yield an exact root of Ψn.

Corollary 13.7 Suppose Ψ satisfies the conditions of Theorem 13.5,and θn = An(Ψn) for some sequence of deterministic algorithms An :

Page 270: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

260 13. Z-Estimators

ℓ∞(Θ, L) → Θ and random sequence Ψn : Θ → L of estimating equations

such that ΨnP→ Ψ in ℓ∞(Θ, L) and Ψn(θn) = oP (r−1

n ), where 0 < rn →∞is a sequence of constants for which rn(Ψn−Ψ) X in ℓ∞(Θ0, L) for someclosed Θ0 ⊂ Θ containing an open neighborhood of θ0, with X tight and‖X(θ)‖L → 0 as θ → θ0 almost surely. Then rn(θn − θ0) −Ψ−1

θ0X(θ0).

Proof. Let Xn ≡ rn(Ψn − Ψ). By Theorem 7.26, there exists a newprobability space (Ω, A, P ) on which: E∗f(Ψn) = E∗f(Ψn) for all bounded

f : ℓ∞(Θ, L) → R and all n ≥ 1; Ψnas∗→ Ψ in ℓ∞(Θ, L); rn(Ψn−Ψ)

as∗→ X inℓ∞(Θ0, L); X and X have the same distributions; and rn(An(Ψn)−θ0) andrn(An(Ψn) − θ0) have the same distributions. Note that for any boundedf , Ψn → f(Ψn(An(Ψn))) = g(Ψn) for some bounded g. Thus Ψn(θn) =oP (r−1

n ) for θn ≡ An(Ψn).Hence for any subsequence n′ there exists a further subsequence n′′

such that Ψn′′as∗→ Ψ in ℓ∞(Θ, L), rn′′(Ψn′′ − Ψ)

as∗→ X in ℓ∞(Θ0, L), and

Ψn′′(θn′′)as∗→ 0 in L. Thus also Ψ(θn′′)

as∗→ 0, which implies θn′′as∗→ θ0. Note

that θn′′ is a zero of Ψn′′(θ) − Ψ(θn′′) by definition and is contained inΘ0 for all n large enough. Hence, for all n large enough, rn′′(θn′′ − θ0) =

rn′′

(

φn′′

(

Ψn′′ − Ψn′′(θn′′))

− φn′′ (Ψ))

for some sequence φn ∈ Φ(Θ0, L)

possibly dependent on sample realization ω ∈ Ω. This implies that for alln large enough,

∥rn′′(θn′′ − θ0)− φ′Ψ(X)

≤ supφ∈Φ(Θ0,L)

∥rn′′

(

φ(

Ψn′′ − Ψn′′(θn′′ ))

− φ(Ψ))

− φ′Ψ(X)

as∗→ 0,

by Theorem 13.5 (with Θ0 replacing Θ). This implies∥

∥rn′′ (An′′(Ψn′′)− θ0)

−φ′Ψ(X)

as∗→ 0. Since this holds for every subsequence, we have rn(An(Ψn)

−θ0) φ′Ψ(X). This of course implies rn(An(Ψn)− θ0) φ′

Ψ(X).The following corollary extends the previous result to generalized boot-

strapped processes. Let Ψn be a bootstrapped version of Ψn based on both

the data sequence Xn (the data used in Ψn) and a sequence of weightsW = Wn, n ≥ 1.

Corollary 13.8 Assume the conditions of Corollary 13.7 and, in addi-tion, that θn = An(Ψ

n) for a sequence of bootstrapped estimating equations

Ψn(Xn, Wn), with Ψ

n −ΨPW

0 and rnΨn(θn)

PW

0 in ℓ∞(Θ, L), and with

rnc(Ψn − Ψn)

PW

X in ℓ∞(Θ0, L) for some 0 < c < ∞, where the maps

Wn → h(Ψn) are measurable for every h ∈ Cb(ℓ

∞(Θ, L)) outer almost

surely. Then rnc(θn − θn)PW

φ′Ψ(X).

Page 271: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

13.3 Using the Delta Method 261

Proof. This proof shares many similarities with the proof of Theo-rem 12.1. To begin with, by using the same arguments used in the beginningof that proof, we can obtain that, unconditionally,

rn

(

Ψn −Ψ

Ψn −Ψ

)

(

c−1X ′1 + X ′

2

X ′2

)

in ℓ∞(Θ0, L), where X ′1 and X ′

2 are two independent copies of X . We can

also obtain that Ψn

P→ Ψ in ℓ∞(Θ, L) unconditionally. Combining this witha minor adaptation of the above Corollary 13.7 (see Exercise 13.4.3 below)we obtain unconditionally that

rn

θn − θ0

θn − θ0

Ψn(θ0)−Ψ(θ0)

Ψn(θ0)−Ψ(θ0)

φ′Ψ(c−1X ′

1 + X ′2)

φ′Ψ(X ′

2)c−1X ′

1(θ0) + X ′2(θ0)

X ′2(θ0)

.

This implies two things. First,

rnc

(

θn − θn

(Ψn −Ψn)(θ0)

)

(

φ′Ψ(X)

X(θ0)

)

unconditionally, since φ′Ψ is linear on L. Second, the usual continuous map-

ping theorem now yields unconditionally that

rnc(θn − θn) + Ψ−1θ0

(rnc(Ψn −Ψn)(θ0))

P→ 0,(13.8)

since the map (x, y) → x + Ψ−1θ0

(y) is continuous on all of lin Θ× L (recall

that x → φ′Ψ(x) = −Ψ−1

θ0(x(θ0))).

Now for any map h ∈ Cb(L), the map x → h(rnc(x − Ψn(θ0))) is con-tinuous and bounded for all x ∈ L outer almost surely. Thus the mapsWn → h(rnc(Ψ

n − Ψn)(θ0)) are measurable for every h ∈ Cb(L) outeralmost surely. Hence the bootstrap continuous mapping theorem, Theo-

rem 10.8, yields that Ψ−1θ0

(rnc(Ψn − Ψn)(θ0))

PW

Ψ−1θ0

(X). The desired

result now follows from (13.8).An interesting application of the above results is to estimating equations

for empirical processes from dependent data as discussed in Section 11.6.Specifically, suppose Ψn(θ)(h) = Pnψθ,h, where the stationary sample dataX1, X2, . . . and F = ψθ,h : θ ∈ Θ, h ∈ H satisfy the conditions of Theo-rem 11.24 with marginal distribution P , and let Ψ(θ)(h) = Pψθ,h. Then theconclusion of Theorem 11.24 is that

√n(Ψn−Ψ) H in ℓ∞(Θ×H), where

H is a tight, mean zero Gaussian process. Provided Ψ satisfies the conditionsof Theorem 13.1, and provided a few other conditions hold, Corollary 13.7will give us weak convergence of the standardized Z-estimators

√n(θn−θ0)

based on Ψn. Under certain regularity conditions, a moving blocks boot-strapped estimation equation Ψ

n can be shown by Theorem 11.26 to satisfy

Page 272: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

262 13. Z-Estimators

the requirements of Corollary 13.8. This enables valid bootstrap estima-tion of the limiting distribution of

√n(θn − θ0). These results can also be

extended to stationary sequences with long range dependence, where thenormalizing rate rn may differ from

√n.

13.4 Exercises

13.4.1. Show that the addition of ξ in the denominator of the weightsin the weighted bootstrap introduced in Theorem 10.13, as discussed inSection 13.1, does not affect the conclusions of that theorem.

13.4.2. Develop a bootstrap central limit theorem for Z-estimators basedTheorem 12.1 which utilizes the Hadamard differentiability of the zero-extraction map φ used in Corollary 13.6.

13.4.3. Verify the validity of the “minor adaptation” of Corollary 13.7used in the proof of Corollary 13.8.

13.5 Notes

Theorem 2.11 and Lemma 13.2 are essentially a decomposition of Theo-rem 3.3.1 of VW into two parts. Lemma 13.3 is Lemma 3.3.5 of VW.

Page 273: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

14M-Estimators

M-estimators, as introduced in Section 2.2.6, are approximate maximizersof objective functions computed from data. Note the estimators based onminimizing objective functions are trivially also M-estimators after tak-ing the negative of the objective function. In some ways, M-estimatorsare more basic than Z-estimators since Z-estimators can always be ex-pressed as M-estimators. The reverse is not true, however, since there areM-estimators that cannot be effectively formulated as Z-estimators. Nev-ertheless, Z-estimator theory is usually much easier to use whenever it canbe applied. The focus, then, of this chapter is on M-estimator settings forwhich it is not practical to directly use Z-estimator theory. The usual issuesfor M-estimation theory are to establish consistency, determine the correctrate of convergence, establish weak convergence, and, finally, to conductinference.

We first present a key result central to M-estimation theory, the argmaxtheorem, which permits deriving weak limits of M-estimators as the theargmax of the limiting process. This is useful for both consistency, whichwe discuss next, and weak convergence. The section on consistency includesa proof of Theorem 2.12 on Page 28. We then discuss how to determinethe correct rate of convergence which is necessary for establishing weakconvergence based on the argmax theorem. We then present general resultsfor “regular estimators,” i.e., estimators whose rate of convergence is

√n.

We then give several examples that illustrate M-estimation theory for non-regular estimators that have rates distinct from

√n. Much of the content

of this chapter is inspired by Chapter 3.2 of VW.

Page 274: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

264 14. M-Estimators

The M-estimators in both the regular and non-regular examples wepresent will be i.i.d. empirical processes of the form Mn(θ) = Pnmθ fora class of measurable, real valued functions M = mθ : θ ∈ Θ, wherethe parameter space Θ is usually a subset of a semimetric space. The en-tropy of the class M plays a crucial role in determining the proper rate ofconvergence. The aspect of the rate of convergence determining process isoften the most difficult part technically in M-estimation theory and usu-ally requires fairly precise bounds on moments of the empirical processesinvolved, such as those described in Section 11.1. We note that our presen-tation involves only a small amount of the useful results in the area. Muchmore of these kinds of results can be found in Chapter 3.4 of VW and invan de Geer (2000).

Inference for M-estimators is more challenging than it is for Z-estimatorsbecause the bootstrap is not in general automatically valid, especially whenthe convergence rate is not

√n. On the other hand, subsampling m out of n

observations (see Politis and Romano, 1994) can be shown to be universallyvalid, provided m → ∞ and m/n → 0. However, even this result is notentirely satisfactory because it requires n to be quite large since m mustalso be large yet small relative to n. Bootstrapping and other methods ofinference for M-estimators is currently an area of active research, but wedo not pursue it further in this chapter.

14.1 The Argmax Theorem

We now consider a sequence Mn(h) : h ∈ H of stochastic processes in-

dexed by a metric space H . Let hn denote an M-estimator obtained bynearly-maximizing Mn. The idea of the argmax theorem presented be-low is that under reasonable regularity conditions, when Mn M , whereM is another stochastic process in ℓ∞(H), that hn h, where h is theargmax of M . If we know that the rate of convergence of an M-estimatorθn is rn (a nondecreasing, positive sequence), where θn is the argmax of

θ → Mn(θ), then hn = rn(θn − θ0) can be expressed as the argmax ofh → Mn(h) ≡ rn [Mn(θ0 + h/rn)−Mn(θ0)]. Provided Mn M , and the

regularity conditions of the argmax theorem apply, hn = rn(θn − θ0) h,

where h is the argmax of M . Consistency results will follow for the choicern = 1 for all n ≥ 1.

Note that in the theorem, we require the sequence hn to be uniformlytight. This is stronger than asymptotic tightness, as pointed out in Lemma7.10, but is also quite easy to establish for Euclidean parameters which willbe our main focus in this chapter. For finite Euclidean estimators that aremeasurable, uniform tightness follows automatically from asymptotic tight-ness (see Exercise 14.6.1). This is a reasonable restriction, since, in practice,most infinite-dimensional estimators that converge weakly can usually be

Page 275: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

14.1 The Argmax Theorem 265

expressed as Z-estimators. Our consistency results that we present later onwill not require uniform tightness and will thus be more readily applicableto infinite dimensional estimators. Returning to the theorem at hand, mostweak convergence results for non-regular estimators apply to finite dimen-sional parameters, and thus the theorem below will be applicable. We alsonote that it is not hard to modify these results for applicability to spe-cific settings, including some infinite dimensional settings. For interestingexamples in this direction, see Ma and Kosorok (2005a) and Kosorok andSong (2007). We now present the argmax theorem, which utilizes uppersemicontinuity as defined in Section 6.1:

Theorem 14.1 (Argmax theorem) Let Mn, M be stochastic processes in-dexed by a metric space H such that Mn M in ℓ∞(K) for every compactK ⊂ H. Suppose also that almost all sample paths h → M(h) are upper

semicontinuous and possess a unique maximum at a (random) point h,

which as a random map in H is tight. If the sequence hn is uniformly tightand satisfies Mn(hn) ≥ suph∈H Mn(h)− oP (1), then hn h in H.

Proof. Fix ǫ > 0. By uniform tightness of hn and tightness of h, thereexists a compact set K ⊂ H such that lim infn→∞ P∗(hn ∈ K) ≥ 1− ǫ and

P(h ∈ K) ≥ 1− ǫ. Then almost surely

M(h) > suph ∈G,h∈K

M(h),

for every open G ∋ h, by upper semicontinuity of M . To see this, supposeit were not true. Then by the compactness of K, there would exist a con-vergent sequence hm ∈ Gc ∩K, for some open G ∋ h, with hm → h andM(hm) → M(h). The upper semicontinuity forces M(h) ≥ M(h) whichcontradicts the uniqueness of the maximum.

Now apply Lemma 14.2 below for the sets A = B = K, to obtain

lim supn→∞

P∗(hn ∈ F ) ≤ lim supn→∞

P∗(hn ∈ F ∩K) + lim supn→∞

P∗(hn ∈ K)

≤ P(h ∈ F ∪Kc) + ǫ

≤ P(h ∈ F ) + P(h ∈ Kc) + ǫ

≤ P(h ∈ F ) + 2ǫ.

The desired result now follows from the Portmanteau theorem since ǫ wasarbitrary.

Lemma 14.2 Let Mn, M be stochastic processes indexed by a metricspace H, and let A, B ⊂ H be arbitrary. Suppose there exists a randomelement h such that almost surely

M(h) > suph ∈G,h∈K

M(h), for every open G ∋ h.(14.1)

Page 276: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

266 14. M-Estimators

Suppose the sequence hn satisfies Mn(hn) ≥ suph∈H Mn(h)− oP (1). Then,if Mn M in ℓ∞(A ∪B), we have for every closed set F ,

lim supn→∞

P∗(hn ∈ F ∩A) ≤ P(h ∈ F ∪Bc).

Proof. By the continuous mapping theorem,

suph∈F∩A

Mn(h)− suph∈B

Mn(h) suph∈F∩A

M(h)− suph∈B

M(h),

and thus

lim supn→∞

P∗(hn ∈ F ∩A)

≤ lim supn→∞

P∗(

suph∈F∩A

Mn(h) ≥ suph∈H

Mn(h)− oP (1)

)

≤ lim supn→∞

P∗(

suph∈F∩A

Mn(h) ≥ suph∈B

Mn(h)− oP (1)

)

≤ P

(

suph∈F∩A

M(h) ≥ suph∈B

M(h)

)

,

by Slutsky’s theorem (to get rid of the oP (1) part) followed by the Portman-teau theorem. Note that the event E in the last probability can’t happenwhen h ∈ F c ∩ B because of Assumption (14.1) and the fact that F c is

open. Thus E is contained in the set h ∈ F∪h ∈ B, and the conclusionof the lemma follows.

14.2 Consistency

We can obtain a consistency result by specializing the argmax theoremto the setting where M is fixed. This will not yield as general a result asTheorem 2.12 because of the uniform tightness requirement. The primarygoal of this section is to prove Theorem 2.12 on Page 28. Before givingthe proof, we want to present a result comparing a few different ways ofestablishing identifiability. We assume throughout this section that (Θ, d)is a metric space. In the following lemma, the condition given in (i) is theidentifiability condition assumed in Theorem 2.12, while the Condition (ii)is often called the “well-separated maximum” condition:

Lemma 14.3 Let M : Θ → R be a map and θ0 ∈ Θ a point. The followingconditions are equivalent:

(i) For any sequence θn ∈ Θ, lim infn→∞ M(θn) ≥ M(θ0) impliesd(θn, θ0)→ 0.

Page 277: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

14.3 Rate of Convergence 267

(ii) For every open G ∋ θ0, M(θ0) > supθ ∈G M(θ).

The following condition implies both (i) and (ii):

(iii) M is upper semicontinuous with a unique maximum at θ0.

Proof. Suppose (i) is true but (ii) is not. Then there exists an openG ∋ θ0 such that supθ ∈G ≥ M(θ0). This implies the existence of a sequenceθn with both lim infn→∞ M(θn) ≥ M(θ0) and d(θn, θ0) → τ > 0, whichis a contradiction. Thus (i) implies (ii). Now assume (ii) is true but (i) isnot. Then there exists a sequence with lim infn→∞ M(θn) ≥ M(θ0) butwith θn ∈ G for all n large enough and some open G ∋ θ0. Of course,this contradicts (ii), and thus (ii) implies (i). Now suppose M is uppersemicontinuous with a unique maximum at θ0 but (ii) does not hold. Thenthere exists an open G ∋ θ0 for which supθ ∈G M(θ) ≥ M(θ0). But thisimplies that the set θ : M(θ) ≥ M(θ0) contains at least one point inaddition to θ0 since Gc is closed. This contradiction completes the proof.

Any one of the three identifiability conditions given in the above lemmaare sufficient for Theorem 2.12. The most convenient condition in practicewill depend on the setting. Here is the awaited proof:

Proof of Theorem 2.12 (Page 28). Since lim infn→∞ M(θn) ≥ M(θ0)implies d(θn, θ0)→ 0 for any sequence θn ∈ Θ, we know that there existsa non-decreasing cadlag function f : [0,∞] → [0,∞] that satisfies bothf(0) = 0 and d(θ, θ0) ≤ f(|M(θ) −M(θ0)|) for all θ ∈ Θ. The details forconstructing such an f are left as an exercise (see Exercise 14.6.2).

For Part (i), note that M(θ0) ≥ M(θn) ≥ Mn(θn) − ‖Mn − M‖Θ ≥Mn(θ0)− oP (1) ≥ M(θ0)− oP (1). By the previous paragraph, this implies

d(θn, θ0) ≤ f(|M(θn)−M(θ0)|) P→ 0. An almost identical argument yieldsPart (ii).

14.3 Rate of Convergence

In this section, we relax the requirement that (Θ, d) be a metric space toonly requiring that it to be a semimetric space. If θ → M(θ) is two timesdifferentiable at a point of maximum θ0, then the first derivative of Mat θ0 must vanish while the second derivative should be negative definite.Thus it is not unreasonable to require that M(θ) −M(θ0) ≤ −cd2(θ, θ0)for all θ in a neighborhood of θ0 and some c > 0. The following theoremshows that an upper bound for the rate of convergence of a near-maximizerof a random objection function Mn can be obtained from the modulus ofcontinuity of Mn −M at θ0. In practice, one may need to try several ratesthat satisfy the conditions of this theorem before finding the right rn forwhich the weak limit of rn(θn − θ0) is nontrivial.

Page 278: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

268 14. M-Estimators

Theorem 14.4 (Rate of convergence) Let Mn be a sequence of stochas-tic processes indexed by a semimetric space (Θ, d) and M : Θ → R adeterministic function such that for every θ in a neighborhood of θ0, thereexists a c1 > 0 such that

M(θ)−M(θ0) ≤ −c1d2(θ, θ0),(14.2)

where d : Θ × Θ → [0,∞) satisfies d(θn, θ0) → 0 whenever d(θn, θ0) → 0.Suppose that for all n large enough and sufficiently small δ, the centeredprocess Mn −M satisfies

E∗ supd(θ,θ0)<δ

√n |(Mn −M)(θ)− (Mn −M)(θ0)| ≤ c2φn(δ),(14.3)

for c2 < ∞ and functions φn such that δ → φn(δ)/δα is decreasing forsome α < 2 not depending on n. Let

r2nφn(r−1

n ) ≤ c3

√n, for every n and some c3 <∞.(14.4)

If the sequence θn satisfies Mn(θn) ≥ supθ∈Θ Mn(θ) − OP (r−2n ) and con-

verges in outer probability to θ0, then rnd(θn, θ0) = OP (1).

Proof. We will use a modified “peeling device” (see, for example, Sec-tion 5.3 of van de Geer, 2000) for the proof. For every η > 0, let η′ > 0be a number for which d(θ, θ0) ≤ η whenever θ ∈ Θ satisfies d(θ, θ0) ≤ η′

and also d(θ, θ0) ≤ η/2 whenever θ ∈ Θ satisfies d(θ, θ0) ≤ η′/2. Such an η′

always exists for each η by the assumed relationship between d and d. Notealso that Mn(θn) ≥ supθ∈Θ Mn(θ)−OP (r−2

n ) ≥Mn(θ0)−OP (r−2n ). Now fix

ǫ > 0, and choose K < ∞ such that the probability that Mn(θn)−Mn(θ0) <−Kr−2

n is ≤ ǫ.For each n, the parameter space minus the point θ0 can be partitioned

into “peels” Sj,n = θ : 2j−1 < rnd(θ, θ0) ≤ 2j with j ranging over

the integers. Assume that Mn(θn) − Mn(θ0) ≥ −Kr−2n , and note that

if rnd(θn, θ0) is > 2M for a given integer M , then θn is in one of thepeels Sj,n, with j > M . In that situation, the supremum of the map θ →Mn(θ) −Mn(θ0) + Kr−2

n is nonnegative by the property of θn. Concludethat for every η > 0,

P∗(

rnd(θn, θ0) > 2M)

(14.5)

≤∑

j≥M,2j≤ηrn

P∗(

supθ∈Sj,n

[

Mn(θ) −Mn(θ0) + Kr−2n

]

≥ 0

)

+P∗(

2d(θn, θ0) ≥ η′)

+ P∗(

Mn(θn)−Mn(θ0) < −Kr−2n

)

.

The lim supn→∞ of the sum of the two probabilities after the summation

on the right side is ≤ ǫ by the consistency of θn and the choice of K. Now

Page 279: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

14.3 Rate of Convergence 269

choose η small enough so that (14.2) holds for all d(θ, θ0) ≤ η′ and (14.3)holds for all δ ≤ η. Then for every j involved in the sum, we have for everyθ ∈ Sj,n, M(θ) −M(θ0) ≤ −c1d

2(θ, θ0) ≤ −c122j−2r−2

n . In terms of thecentered process Wn ≡ Mn−M , the summation on the right side of (14.5)may be bounded by

j≥M,2j≤ηrn

P∗(

‖Wn(θ)−Wn(θ0)‖Sj,n ≥c12

2j−2 −K

r2n

)

≤∑

j≥M

c2φn(2j/rn)r2n√

n(c122j−2 −K)

≤∑

j≥M

c2c32jα

(c122j−2 −K),

by Markov’s inequality, the conditions on rn, and the fact that φn(cδ) ≤cαφn(δ) for every c > 1 as a consequence of the assumptions on φn. It isnot difficult to show that this last sum goes to zero as M → ∞ (verifyingthis is saved as an exercise). Thus we can choose an M < ∞ such that thelim supn→∞ of the left side of (14.5) is ≤ 2ǫ. The desired result now followssince ǫ was arbitrary.

Consider the i.i.d. setting with criterion functions of the form Mn(θ) =Pnmθ and M(θ) = Pmθ. The scaled and centered process

√n(Mn−M) =

Gnmθ equals the empirical process at mθ. The Assertion (14.3) involvesassessing the suprema of the empirical process by classes of functionsMδ ≡mθ−mθ0 : d(θ, θ0) < δ. Taking this view, establishing (14.3) will requirefairly precise—but not unreasonably precise—knowledge of the involvedempirical process. The moment results in Section 11.1 will be useful here,and we will illustrate this with several examples later on in this chapter.We note that the problems we address in this book represent only a smallsubset of the scope and capabilities of empirical process techniques fordetermining rates of M-estimators. We close this section with the followingcorollary which essentially specializes Theorem 14.4 to the i.i.d. setting.Because the specialization is straightforward, we omit the somewhat trivialproof. Recall that the relation a b means that a is less than or equal btimes a universal finite and positive constant.

Corollary 14.5 In the i.i.d. setting, assume that for every θ in aneighborhood of θ0, P (mθ −mθ0) −d2(θ, θ0), where d satisfies the condi-tions given in Theorem 14.4. Assume moreover that there exists a functionφ such that δ → φ(δ)/δα is decreasing for some α < 2 and, for every n,

E∗‖Gn‖Mδ φ(δ). If the sequence θn satisfies Pnmθn

≥ supθ∈Θ Pnmθ −OP (r−2

n ) and converges in outer probability to θ0, then rnd(θn, θ0) = OP (1)for every sequence rn for which r2

nφ(1/rn) √

n for all n ≥ 1.

Page 280: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

270 14. M-Estimators

14.4 Regular Euclidean M-Estimators

A general result for Euclidean M-estimators based on i.i.d. data was givenin Theorem 2.13 on Page 29 of Section 2.2.6. We now prove this theorem.In Section 2.2.6, the theorem was used to establish asymptotic normalityof a least-absolute-deviation regression estimator. Establishing asymptoticnormality in this situation is quite difficult without empirical process meth-ods.

Proof of Theorem 2.13 (Page 29). We first utilize Corollary 14.5to verify that

√n is the correct rate of convergence. We will use Eu-

clidean distance as both the discrepancy measure and distance through,i.e. d(θ1, θ2) = d(θ1, θ2) = ‖θ1 − θ2‖. Condition (2.18) of the theorem indi-cates that Mδ in this instance is a Lipschitz class, and thus Theorem 9.23implies that

N[](2ǫ‖Fδ‖P,2,Mδ, L2(P )) ǫ−p,(14.6)

where Fδ ≡ δm is an envelope for Mδ. To see this, it may be helpful torewrite Condition (2.18) as

|mθ1(x) −mθ2(x)| ≤ ‖θ1 − θ2‖δ

Fδ(x).

Now (14.6) can be utilized in Theorem 11.2 to obtain

E∗‖Gn‖Mδ ‖Fδ‖P,2 δ.

Hence the modulus of continuity condition in Corollary 14.5 is satisfied forφ(δ) = δ. Combining Condition (2.19), the maximality of θ0, and the factthat the second derivative matrix V is nonsingular and continuous, yieldsthat M(θ) −M(θ0) −‖θ − θ0‖2. Since φ(δ)/δα = δ1−α is decreasing forany α ∈ (1, 2) and nφ(1/

√n) = n/

√n =

√n, the remaining conditions of

Corollary 14.5 are satisfied for rn =√

n, and thus√

n(θn − θ0) = OP (1).The next step is to apply the argmax theorem (Theorem 14.1) to the

process h → Un(h) ≡ n(Mn(θ0 + h/√

n) − Mn(θ0)). Now fix a compactK ⊂ Rp, and note that

Un(h) = Gn

[√n(

mθ0+h/√

n −mθ0

)

− hT mθ0

]

+hT Gnmθ0 + n(M(θ0 + h/√

n)−M(θ0))

≡ En(h) + hT Gnmθ0 +1

2hT V h + o(1),

where o(1) denotes a quantity going to zero uniformly over K. Note that

hn ≡√

n(θn−θ0) satisfies Un(hn) ≥ suph∈Rp Un(h)−oP (1). Thus, providedwe can establish that ‖En‖K = oP (1), the argmax theorem will yield that

hn h, where h is the argmax of h → U(h) ≡ hT Z + (1/2)hT V h, where

Page 281: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

14.5 Non-Regular Examples 271

Z is the Gaussian limiting distribution of Gnmθ0 . Hence h = −V −1Z andthe desired result will follow.

We now prove ‖En‖K = oP (1) for all compact K ⊂ Rp. let unh(x) ≡√

n(mθ0+h/√

n(x) −mθ0(x)) − hT mθ0(x), and note that by (2.18),

|unh1

(x)− unh2

(x)| ≤ (m(x) + ‖mθ0(x)‖)‖h1 − h2‖,

for all h1, h2 ∈ Rp and all n ≥ 1. Fix a compact K ⊂ Rp, and let Fn ≡un

h : h ∈ K. Applying Theorem 9.23 once again, but with ‖ · ‖ = ‖ · ‖Q,2

(for any probability measure Q on X ) instead of ‖ · ‖P,2, we obtain

N[](2ǫ‖Fn‖Q,2,Fn, L2(Q)) ≤ kǫ−p,

where the envelope Fn ≡ (m + ‖mθ0‖)‖h‖K and k < ∞ does not dependon K or n. Lemma 9.18 in Chapter 9 now yields that

supn≥1

supQ

N(ǫ‖Fn‖Q,2,Fn, L2(Q)) ≤ k

(

2

ǫ

)p

,

where the second supremum is taken over all finitely discrete probabilitymeasures on X . This implies that Condition (A) of Theorem 11.20 holdsfor Fn and Fn. In addition, Condition (2.19) implies that Condition (B) ofTheorem 11.20 also holds with H(s, t) = 0 for all s, t ∈ K. It is not difficultto verify that all of the remaining conditions of Theorem 11.20 also hold(we save this as an exercise), and thus Gnun

h 0 in ℓ∞(K). This, of course,is the desired result.

14.5 Non-Regular Examples

We now present two examples in detail that illustrate the techniques pre-sented in this chapter for parameter estimation with non-regular rates ofconvergence. The first example considers a simple change-point model withthree parameters wherein two of the parameter estimates converge at theregular rate while one of the parameter estimates converges at the n-rate,i.e., it converges faster than

√n. The second example is monotone den-

sity estimation based on the Grenander estimator which is shown to yieldconvergence at the cube-root rate.

14.5.1 A Change-Point Model

For this model, we observe i.i.d. realizations of X = (Y, Z), where Y =α1Z ≤ ζ + β1Z > ζ + ǫ, Z and ǫ are independent with ǫ continuous,Eǫ = 0 and σ2 ≡ Eǫ2 < ∞, γ ≡ (α, β) ∈ R2 and ζ is known to liein a bounded interval [a, b]. The unknown parameters can be collected asθ = (γ, ζ), and the subscript zero will be used to denote the true parameter

Page 282: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

272 14. M-Estimators

values. We make the very important assumption that α0 = β0 and alsoassume that Z has a strictly bounded and positive density f over [a, b]with P(Z < a) > 0 and P(Z > b) > 0. Our goal is to estimate θ throughleast squares. This is the same as maximizing Mn(θ) = Pnmθ, where

mθ(x) ≡ − (y − α1z ≤ ζ − β1z > ζ)2 .

Let θn be maximizers of Mn(θ), where θn ≡ (γn, ζn) and γn ≡ (αn, βn).Since we are not assuming that γ is bounded, we first need to prove

the existence of γn, i.e., we need to prove that ‖γn‖ = OP (1). We thenneed to provide consistency of all parameters and then establish the ratesof convergence for the parameters. Finally, we need to obtain the jointlimiting distribution of the parameter estimates.

Existence.

Note that the covariate Z and parameter ζ can be partitioned into fourmutually exclusive sets: Z ≤ ζ ∧ ζ0, ζ < Z ≤ ζ0, ζ0 < Z ≤ ζ andZ > ζ ∨ ζ0. Since also 1Z < a ≤ 1Z ≤ ζ ∧ ζ0 and 1Z > b ≤ 1Z >

ζ ∨ ζ0 by assumption, we obtain −Pnǫ2 = Mn(θ0) ≤Mn(θn)

≤ −Pn

[

(ǫ− αn + α0)21Z < a+ (ǫ− βn + β0)

21Z > b]

.

By decomposing the squares, we now have

(αn − α0)2Pn[1Z < a] + (βn − β0)

2Pn[1Z > b]≤ Pn[ǫ21a ≤ z ≤ b]

+2|αn − α0|Pn[ǫ1Z < a] + 2|βn − β0|Pn[ǫ1Z > b]≤ Op(1) + oP (1)‖γn − γ0‖.

Since P(Z < a) ∧ P(Z > b) > 0, the above now implies that ‖γn − γ0‖2 =OP (1+ ‖γn− γ0‖) and hence that ‖γn− γ0‖ = OP (1). Thus all the param-eters are bounded in probability and therefore exist.

Consistency.

Our approach to establishing consistency will be to utilize the argmaxtheorem (Theorem 14.1). We first need to establish that Mn M inℓ∞(K) for all compact K ⊂ H ≡ R2 × [a, b], where M(θ) ≡ Pmθ. Wethen need to show that θ → M(θ) is upper semicontinuous with a unique

maximum at θ0. We already know from the previous paragraph that θn isasymptotically tight (i.e., ‖θn‖ = OP (1)). The argmax theorem will then

yield that θn θ0 as desired.Fix a compact K ⊂ H . We now verify that FK ≡ mθ : θ ∈ K is

Glivenko-Cantelli. Note that

Page 283: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

14.5 Non-Regular Examples 273

mθ(X) = −(ǫ− α + α0)21Z ≤ ζ ∧ ζ0 − (ǫ− β + α0)

21ζ < Z ≤ ζ0−(ǫ− α + β0)

21ζ0 < Z ≤ ζ − (ǫ− β + β0)21Z > ζ ∨ ζ0.

It is not difficult to verify that (ǫ−α+α0)2 : θ ∈ K and 1Z ≤ ζ∧ζ0 : θ ∈

K are separately Glivenko-Cantelli classes. Thus the product of the twoclass is also Glivenko-Cantelli by Corollary 9.27 since the product of thetwo envelopes is integrable. Similar arguments reveal that the remainingcomponents of the sum are also Glivenko-Cantelli, and reapplication ofCorollary 9.27 yields that FK itself is Glivenko-Cantelli. Thus Mn M inℓ∞(K) for all compact K.

We now establish upper semicontinuity of θ → M(θ) and uniqueness ofthe maximum. Using the decomposition of the sets for (Z, ζ) used in theExistence paragraph above, we have

M(θ) = −Pǫ2 − (α− α0)2P(Z ≤ ζ ∧ ζ0)− (β − α0)

2P(ζ < Z ≤ ζ0)

−(α− β0)2P(ζ0 < Z ≤ ζ)− (β − β0)

2P(Z > ζ ∨ ζ0)

≤ −Pǫ2 = M(θ0).

Because Z has a bounded density on [a, b], we obtain that M is continuous.It is also clear that M has a unique maximum at θ0 because the densityof Z is bounded below and α0 = β0 (see Exercise 14.6.5 below). Now theconditions of the argmax theorem are met, and the desired consistencyfollows.

Rate of convergence.

We will utilize Corollary 14.5 to obtain the convergence rates via thediscrepancy function d(θ, θ0) ≡ ‖γ − γ0‖ +

|ζ − ζ0|. Note that this isnot a norm since it does not satisfy the triangle inequality. Nevertheless,d(θ, θ0) → 0 if and only if ‖θ − θ0‖ → 0. Moreover, from the Consistencyparagraph above, we have that

M(θ)−M(θ0) = −PZ ≤ ζ ∧ ζ0(α− α0)2 − PZ > ζ ∨ ζ0(β − β0)

2

−Pζ < Z ≤ ζ0(β − α0)2

−Pζ0 < Z ≤ ζ(α− β0)2

≤ −PZ < a(α− α0)2 − PZ > b(β − β0)

2

−k1(1− o(1))|ζ − ζ0|≤ −(k1 ∧ δ1 − o(1))d2(θ, θ0),

where the first inequality follows from the fact that the product of thedensity of Z and (α0 − β0)

2 is bounded below by some k1 > 0, and thesecond inequality follows from both P(Z < a) and P(Z > b) being boundedbelow by some δ1 > 0. Thus M(θ) −M(θ0) −d2(θ, θ0) for all ‖θ − θ0‖small enough, as desired.

Page 284: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

274 14. M-Estimators

Consider now the class of functions Mδ ≡ mθ − mθ0 : d(θ, θ0) < δ.Using previous calculations, we have

(14.7)

mθ −mθ0 = 2(α− α0)ǫ1Z ≤ ζ ∧ ζ0+ 2(β − β0)ǫ1Z > ζ ∨ ζ0+2(β − α0)ǫ1ζ < Z ≤ ζ0+ 2(α− β0)ǫ1ζ0 < Z ≤ ζ−(α− α0)

21Z ≤ ζ ∧ ζ0 − (β − β0)21Z > ζ ∨ ζ0

−(β − α0)21ζ < Z ≤ ζ0 − (α − β0)

21ζ0 < Z ≤ ζ≡ A1(θ) + A2(θ) + B1(θ) + B2(θ)

−C1(θ)− C2(θ)−D1(θ)−D2(θ).

Consider first A1. Since 1Z ≤ t : t ∈ [a, b] is a VC class, it is easy tocompute that

E∗ supd(θ,θ0)<δ

|GnA1(θ)| δ,

as a consequence of Lemma 8.17. Similar calculations apply to A2. Similarcalculations also apply to C1 and C2, except that the upper bounds willbe δ2 instead of δ. Now we consider B1. An envelope for the classF = B1(θ) : d(θ, θ0) < δ is F = 2(|β0 − α0|+ δ)|ǫ|1ζ0 − δ2 < Z ≤ ζ0.It is not hard to verify that

log N[](η‖F‖P,2,F , L2(P )) log(1/η)(14.8)

(see Exercise 14.6.6). Now Theorem 11.2 yields that

E∗ supd(θ,θ0)<δ

|GnB1(θ)| = E∗‖Gn‖F × ‖F‖P,2 δ2.

Similar calculations apply also to B2, D1 and D2. Combining all of theseresults with the fact that O(δ2) = O(δ), we obtain E∗‖Gn‖Mδ

δ.Now when δ → φ(δ) = δ, φ(δ)/δα is decreasing for any α ∈ (1, 2).

Thus the conditions of Corollary 14.5 are satisfied with φ(δ) = δ. Since

r2nφ(1/rn) = rn, we obtain that

√nd(θn, θ0) = OP (1). By the form of d,

this now implies that√

n‖γn − γ0‖ = OP (1) and n|ζn − ζ0| = OP (1).

Weak convergence.

We will utilize a minor modification of the argmax theorem and the rate re-sult above to obtain the limiting distribution of hn = (

√n(γn− γ0), n(ζn−

ζ0)). From the rate result, we know that hn is uniformly tight and isthe smallest argmax of h → Qn(h) ≡ nPn(mθn,h

− mθ0), where θn,h ≡θ0 + (h1/

√n, h2/

√n, h3/n) and h ≡ (h1, h2, h3) ∈ R3 ≡ H . Note that we

have qualified hn as being the smallest argmax, which is interpreted com-ponentwise since H is three dimensional. This is because if we hold (h1, h2)fixed, Mn(θn,h) does not vary in h3 over the interval n[Z(j)−ζ0, Z(j+1)−ζ0),

Page 285: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

14.5 Non-Regular Examples 275

for j = 0, . . . , n, where Z(1), . . . , Z(n) are the order statistics for Z1, . . . , Zn,Z(0) ≡ −∞, and Z(n+1) ≡ ∞. Because P(Z < a) > 0, we only need to con-sider h3 at the values n(Z(j)−ζ0), j = 1, . . . , n, provided n is large enough.

Let DK be the space of functions q : K ⊂ H → R, that are contin-uous in the first two arguments (h1, h2) and right-continuous and piece-wise constant in the third argument h3, with jump locations constantacross (h1, h2), where K is a closed and bounded, rectangular subset ofH that contains an open neighborhood of zero. For each q ∈ DK , leth3 → Jq(h3) be the cadlag counting process denoting the jump locationswith Jq(0−) = 0, jumps of size positive 1 at each jump point in q(·, ·, h3) forh3 ≥ 0, and with h3 → Jq(−h3) also having jumps of size positive 1 (butleft-continuous) at each jump point in q(·, ·, h3) also for h3 ≥ 0 (the left-continuity comes from the reversed time scale). Thus Jq(h3) is decreasingfor h3 < 0 and increasing for h3 ≥ 0.

For q1, q2 ∈ DK , define the distance dK(q1, q2) to be the sum of a “mod-ified Skorohod metric” dK(q1, q2) (to be defined shortly) and the Skorohoddistance between Jq1 and Jq2 . Note that the rectangular structure of Kensures that K = K1 ×K2 ×K3, where each Kj is a closed interval in R,j = 1, 2, 3. For each closed interval C in R, define ΛC to be the collectionof continuous, strictly increasing maps λ : C → C such that λ(C) = C.Define a norm on ΛC to be

λ → ‖λ‖ ≡ sups=t:s,t∈C

log

λ(t) − λ(s)

t− s

.(14.9)

For q1, q2 ∈ DK , we define

dK(q1, q2) ≡ infλ∈ΛK3

suph∈K

|q1(h1, h2, h3)− q2(h1, h2, λ(h3))|+ ‖λ‖

.

We save the verification that dK is a metric as an exercise (see Exer-cise 14.6.7).

Now it is not difficult to see that the smallest argmax function is con-tinuous on DK with respect to dK . We will argue that Qn(h) Q(h) in(DK , dK), for some limiting process Q, and for each compact K ⊂ H . Bythe continuous mapping theorem, the smallest argmax of the restriction ofQn to K will converge weakly to the smallest argmax of the restriction ofQ to K. Since hn is uniformly tight, we obtain hn h, where h is thesmallest argmax of Q.

All that remains is to establish the specified weak convergence and tocharacterize Q. We first argue that Qn− Qn = oK

P (1) in (DK , dK) for eachcompact K ⊂ H , where

Page 286: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

276 14. M-Estimators

Qn(h) ≡ 2h1

√nPn[ǫ1Z ≤ ζ0]− h2

1P(Z ≤ ζ0)

+2h2

√nPn[ǫ1Z > ζ0]− h2

2P(Z > ζ0)

+nPn[−2(α0 − β0)ǫ− (α0 − β0)2]1ζ0 + h3/n < Z ≤ ζ0

+nPn[2(α0 − β0)ǫ− (α0 − β0)2]1ζ0 < Z ≤ ζ0 + h3/n

≡ An(h1) + Bn(h2) + Cn(h3) + Dn(h3).

The superscript K in oKP (1) indicates that the error is in terms of dk. Fix a

compact K ⊂ H . Note that by (14.7), An(h1) = nPn[A1(θn,h)−C1(θn,h]+

En(h), where

En(h) = 2h1Gn [1Z ≤ (ζ0 + h3/n) ∧ ζ0 − 1Z ≤ ζ0]−h2

1 [Pn1Z ≤ (ζ0 + h3/n) ∧ ζ0 − P(Z ≤ ζ0)]

→ 0

in probability, as n →∞, uniformly over h ∈ K. A similar analysis revealsthe uniform equivalence of Bn(h2) and nPn[A2(θn,h−C2(θn,h)]. It is fairly

easy to see that Cn(h3) and nPn[B1(θn,h) − D1(θn,h)] are asymptotically

uniformly equivalent in probability as also Dn(h3) and nPn[B2(θn,h) −D2(θn,h)]. Thus Qn−Qn goes to zero, in probability, uniformly over h ∈ K.

Note that the potential jump points in h3 for Qn and Qn remain the same,and thus Qn − Qn = oK

P (1) as desired.Lemma 14.6 below shows that Qn Q ≡ 2h1Z1 − h2

1P(Z ≤ ζ0) +2h2Z2−h2

2P(Z > ζ0)+Q+(h3)1h3 > 0+Q−(h3)1h3 < 0 in (DK , dK),where Z1, Z2, Q+ and Q− are all independent and Z1 and Z2 are meanzero Gaussian with respective variances σ2P(Z ≤ ζ0) and σ2P(Z > ζ0). Lets → ν+(s) be a right-continuous homogeneous Poisson process on [0,∞)with intensity parameter f(ζ0) (recall that f is the density of ǫ), and lets → ν−(s) be another Poisson process, independent of ν+, on [−∞, 0)which is left-continuous and goes backward in time with intensity f(ζ0). Let(V +

k )k≥1 and (V −k )k≥1 be independent sequences of i.i.d. random variables

with V +1 being a realization of 2(α0 − β0)ǫ − (α0 − β0)

2 and V −1 being

a realization of −2(α0 − β0)ǫ − (α0 − β0)2. Also define V +

0 = V −0 = 0

for convenience. Then h3 → Q+(h3) ≡ 1h3 > 0∑0≤k≤ν+(h3)V +

k and

h3 → Q−(h3) ≡ 1h3 < 0∑0≤k≤ν−(h3)V −

k .

Putting this all together, we conclude that h = (h1, h2, h3), where all

three components are mutually independent, h1 and h2 are both meanzero Gaussian with respective variances σ2/P(Z ≤ ζ0) and σ2/P(Z > ζ0),

and where h3 is the smallest argmax of h3 → Q+(h3) + Q−(h3). Note thatthe expected value of both V +

1 and V −1 is −(α0 − β0)

2. Thus Q+ + Q−

will be zero at h3 = 0 and eventually always negative for all h3 far enoughaway from zero. This means that the smallest argmax of Q+ + Q− will bebounded in probability as desired.

Page 287: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

14.5 Non-Regular Examples 277

Lemma 14.6 For each closed, bounded, rectangular K ⊂ H that containsan open neighborhood of zero, Qn Q in (DK , dK).

Proof. Fix K. The approach we take will be to establish asymptotictightness of Qn on DK and convergence of all finite dimensional distribu-tions. More precisely, we need (Qn, JQn

) (Q, ν− + ν+) in the productof the “modified Skorohod” and Skorohod topologies. This boils down toestablishing

√nPnǫ1Z ≤ ζ0√nPnǫ1Z > ζ0

nPn1ζ0 + h3/n < Z ≤ ζ0nPnǫ1ζ0 + h3/n < Z ≤ ζ0nPn1ζ0 < Z ≤ ζ0 + h3/nnPnǫ1ζ0 < Z ≤ ζ0 + h3/n

Z1

Z2

ν−

Q−

ν+

Q+

,(14.10)

in the product of the R2 and (D[−K, K])⊗4

topologies, where D[−K, K]is the space of cadlag functions on [−K, K] endowed with the Skorohodtopology.

It can be shown by an adaptation of the characteristic function approachused in the proof of Theorem 5 of Kosorok and Song (2007), that all finitedimensional distributions in (14.10) converge. We will omit the somewhatlengthy arguments.

We next need to show that the processes involved are asymptoticallytight. Accordingly, consider first Rn(u) ≡ nPn1ζ0 < Z ≤ ζ0 + u/n, andnote that ERn(K) = Kf(ζ0) < ∞. Thus the number of jumps in thecadlag, monotone increasing, piece-wise constant process Rn over [0, K] isbounded in probability. Fix η ∈ (0, K], and let 0 = u0 < u1 < · · ·um = Kbe a finite partition of [0, K] such that m < 2/η with max1≤j≤m |uj −uj−1| < η. Note that the limiting probability that any two jumps in Rn

occur within the interval (uj−1, uj] is bounded above by η2f2(ζ0). Thus, for

arbitrary η > 0, the limiting probability that any two jumps in Rn occurwithin a distance η anywhere in [0, K] is O(η). Hence

limη↓0

lim supn→∞

P

(

sups,t∈[0,K]:|s−t|<η

|Rn(s)− Rn(t)| > η

)

= 0,

and Rn is therefore asymptotically tight. Similar arguments can be usedto show that the remaining processes are also asymptotically tight. Thedesired conclusion now follows.

14.5.2 Monotone Density Estimation

This example is a special case of the cube root asymptotic results of Kimand Pollard (1990) which was analyzed in detail in Section 3.2.14 of VW.

Page 288: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

278 14. M-Estimators

Let X1, . . . , Xn be a sample of size n from a Lebesgue density f on [0,∞)

that is known to be decreasing. The maximum likelihood estimator fn of fis the non-increasing step function equal to the left derivative of the leastconcave majorant of the empirical distribution function Fn. This fn is thecelebrated Grenander estimator (Grenander, 1956). For a fixed value of

t > 0, we will study the properties of fn(t) under the assumption that fis differentiable at t with derivative −∞ < f ′(t) < 0. Specifically, we will

establish consistency of fn(t), verify that the rate of convergence of fn is

n1/3, and derive weak convergence of n1/3(fn(t) − f(t)). Existence of fn

will be verified automatically as a consequence of consistency.

Consistency.

Let Fn denote the least concave majorant of Fn. In general, the least con-cave majorant of a function g is the smallest concave function h such thath ≥ g. One can construct Fn by imagining a string tied at (x, y) = (0, 0)which is pulled tight over the top of the function graph (x, y = Fn(x)). Theslope of each of the piecewise linear segments will be non-increasing, andthe string (Fn) will touch Fn at two or points (xj , yj), j = 0, . . . , k, wherek ≥ 1, (x0, y0) ≡ (0, 0) and xk is the last observation in the sample. For allx > xk, we set Fn(x) = 1. Note also that Fn is continuous. We leave it as anexercise to verify that this algorithm does indeed produce the least concavemajorant of Fn. The following lemma (Marshall’s lemma) yields that Fn isuniformly consistent for F . We save the proof as another exercise.

Lemma 14.7 (Marshall’s lemma) Under the give conditions, supt≥0

|Fn(t)− F (t)| ≤ supt≥0 |Fn(t)− F (t)|.

Now fix 0 < δ < t, and note that by definition of fn,

Fn(t + δ)− Fn(t)

δ≤ fn(t) ≤ Fn(t)− Fn(t− δ)

δ.

By Marshall’s lemma, the upper and lower bounds converge almost surelyto δ−1(F (t) − F (t − δ)) and δ−1(F (t + δ) − F (t)), respectively. By the

assumptions on F and the arbitrariness of δ, we obtain fn(t)as∗→ f(t).

Rate of convergence.

To determine the rate of convergence, we need to perform an interestinginverse transformation of the problem that will also be useful for obtainingthe weak limiting distribution. Define the stochastic process sn(a) : a > 0by sn(a) = argmaxs≥0Fn(s) − as, where the largest value is selectedwhen multiple maximizers exist. The function sn is a sort of inverse ofthe function fn in the sense that fn(t) ≤ a if and only if sn(a) ≤ t for

every t ≥ 0 and a > 0. To see this, first assume that fn(t) ≤ a. Thismeans that the left derivative of Fn is ≤ a at t. Hence a line of slope a

Page 289: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

14.5 Non-Regular Examples 279

that is moved down vertically from +∞ will first touch Fn at a point s0

to the left of (or equal to) t. That point is also the point at which Fn

is furthest away from the line s → as passing through the origin. Thuss0 = argmaxs≥0Fn(s)−as, and hence sn(a) ≤ t. Now suppose sn(a) ≤ t.Then the argument can be taken in reverse to see that the slope of the linethat touches Fn at sn(a) is less than or equal to the left derivative of Fn

at t, and thus fn(t) ≤ a. Hence,

P(n1/3(fn(t)− f(t)) ≤ x) = P(sn(f(t) + xn−1/3) ≤ t),(14.11)

and the desired rate and weak convergence result can be deduced from theargmax values of x → sn(f(t) + xn−1/3). Applying the change of variables → t + g in the definition of sn, we obtain

sn(f(t)+ xn−1/3)− t = argmaxg>−tFn(t + g)− (f(t) + xn−1/3)(t + g).

In this manner, the probability on the left side of (14.11) is precisely P(gn ≤0), where gn is the argmax above.

Now, by the previous argmax expression combined with the fact that thelocation of the maximum of a function does not change when the functionis shifted vertically, we have gn ≡ argmaxg>−tMn(g) ≡ Fn(t + g) −Fn(t) − f(t)g − xgn−1/3. It is not hard to see that gn = OP (1) and that

Mn(g)P→ M(g) ≡ F (t+ g)−F (t)− f(t)g uniformly on compacts, and thus

gn = oP (1). We now utilize Theorem 14.4 to obtain the rate for gn, withthe metric d(θ1, θ2) = |θ1− θ2|, θ = g, θ0 = 0 and d = d. Note the fact thatMn(0) = M(0) = 0 will simplify the calculations. It is now easy to see thatM(g) −g2, and by using Theorem 11.2, that

E∗ sup|g|<δ

√n |Mn(g)−M(g)| ≤ E∗ sup

|g|<δ

|Gn(1X ≤ t + g − 1X ≤ t)|

+O(√

nδn−1/3)

φn(δ) ≡ δ1/2 +√

nδn−1/3.

Clearly, φn(δ)/δα is decreasing in δ for α = 3/2. Since n2/3φn(n−1/3) =n1/2+n1/6n−1/3 = O(n1/2), Theorem 14.4 yields n1/3gn = OP (1). We show

in the next section how this enables weak convergence of n1/3(f(t)− f(t)).

Weak convergence.

Let hn = n1/3gn, and note that since the maximum of a function does notchange when the function is multiplied by a constant, we have that hn isthe argmax of the process

h → n2/3Mn(n−1/3h)(14.12)

= n2/3(Pn − P )(

1X ≤ t + hn−1/3 − 1X ≤ t)

+n2/3[

F (t + hn−1/3)− F (t)− f(t)hn−1/3]

− xh.

Page 290: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

280 14. M-Estimators

Fix 0 < K < ∞, and apply Theorem 11.20 to the sequence of classes Fn =n1/6

(

1X ≤ t + hn−1/3 − 1X ≤ t)

: −K ≤ h ≤ K with envelope

sequence Fn = n1/61t−Kn−1/3 ≤ X ≤ t + Kn−1/3, to obtain that theprocess on the right side of (14.12) converges in ℓ∞(−K, K) to

h → H(h) ≡√

f(t)Z(h) +1

2f ′(t)h2 − xh,

where Z is a two-sided Brownian motion originating at zero (two inde-pendent Brownian motions starting at zero, one going to the right of zeroand the other going to the left). From the previous paragraph, we know

that hn = OP (1). Since it is not hard to verify that H is continuous with aunique maximum, the argmax theorem now yields by the arbitrariness ofK that hn h, where h = argmaxH. By Exercise 14.6.10 below, we cansimplify the form of h to

4f(t)

[f ′(t)]2

1/3

argmaxhZ(h)− h2+x

f ′(t).

Since

P

(

4f(t)

[f ′(t)]2

1/3

argmaxhZ(h)− h2+x

f ′(t)≤ 0

)

= P(

|4f ′(t)f(t)|1/3argmaxh

Z− h2

≤ x)

,

we now have, as a result of (14.11), that

n1/3(fn(t)− f(t)) |f ′(t)f(t)|1/3C,

where the random variable C ≡ argmaxhZ(h) − h2 has Chernoff’s dis-tribution (see Groeneboom, 1989).

14.6 Exercises

14.6.1. Show that for a sequence Xn of measurable, Euclidean randomvariables which are finite almost surely, measurability plus asymptotictightness implies uniform tightness.

14.6.2. For a metric space (D, d), let H : D → [0,∞] be a function suchthat H(x0) = 0 for a point x0 ∈ D and H(xn) → 0 implies d(xn, x0) → 0for any sequence xn ∈ D. Show that there exists a non-decreasing cadlagfunction f : [0,∞] → [0,∞] that satisfies both f(0) = 0 and d(x, x0) ≤f(|H(x)|) for all x ∈ H . Hint: Use the fact that the given conditions on Himply the existence of a decreasing sequence 0 < τn ↓ 0 such that H(x) < τn

implies d(x, x0) < 1/n, and note that it is permissible to have f(u) = ∞for all u ≥ τ1.

Page 291: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

14.6 Exercises 281

14.6.3. In the proof of Theorem 14.4, verify that for fixed c < ∞ andα < 2,

j≥M

2jα

22j − c→ 0,

as M →∞.

14.6.4. In the context of the last paragraph of the proof of Theorem 2.13,given in Section 14.4, complete the verification of the conditions of Theo-rem 11.20.

14.6.5. Consider the function θ → M(θ) defined in Section 14.5.1. Showthat it has a unique maximum over R2×[a, b]. Also show that the maximumis not unique if α0 = β0.

14.6.6. Verify (14.8).

14.6.7. Show that the modified Skorohod metric dK defined in Section14.5.1 is a metric. Hint: First show that for any λ1, λ2 ∈ ΛC , we have

‖λ1(λ2)‖ ≤ ‖λ1‖+ ‖λ2‖,for the norm defined in (14.9).

14.6.8. Verify that the algorithm described in the second paragraph ofSection 14.5.2 does indeed generate the least concave majorant of Fn.

14.6.9. The goal of this exercise is to prove Marshall’s lemma given inSection 14.5.2. Denoting An(t) ≡ Fn(t) − F (t) and Bn(t) ≡ Fn(t) − F (t),the proof can be broken into the following steps:

(a) Show that 0 ≥ inft≥0 An(t) ≥ inft≥0 Bn(t).

(b) Show that

i. supt≥0 An(t) ≥ 0 and supt≥0 Bn(t) ≥ 0.

ii. If supt≥0 Bn(t) = 0, then supt≥0 An(t) = 0.

iii. If supt≥0 Bn(t) > 0, then supt≥0 An(t) ≤ supt≥0 Bn(t) (this laststep is tricky).

Now verify that 0 ≤ supt≥0 An(t) ≤ supt≥0 Bn(t).

(c) Now complete the proof.

14.6.10. Let Z(h) : h ∈ R be a standard two-sided Brownian motionwith Z(0) = 0. (The process is zero-mean Gaussian and the incrementZ(g)−Z(h) has variance |g−h|.) Then argmaxhaZ(h)−bh2−ch is equalin distribution to (a/b)2/3argmaxgZ(g) − g2 − c/(2b), where a, b, c > 0.Hint: The process h → Z(σh − µ) is equal in distribution to the processh → √

σZ(g)−Z(µ), where σ ≥ 0 and µ ∈ R. Apply the change of variableh = (a/b)2/3g − c/(2b) and note that the location of a maximum does notchange by multiplication by a positive constant or a vertical shift.

Page 292: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

282 14. M-Estimators

14.7 Notes

Theorem 14.1 and Lemma 14.2 are Theorem 3.2.2 and Lemma 3.2.1, re-spectively, of VW, while Theorem 14.4 and Corollary 14.5 are modifiedversions of Theorem 3.2.5 and Corollary 3.2.6 of VW. The monotone den-sity estimation example in Section 14.5.2 is a variation of Example 3.2.14 ofVW. The limiting behavior of the Grenander estimator of this example wasobtained by Prakasa Rao (1969). Exercise 14.6.9 is an expanded version ofExercise 24.5 of van der Vaart (1998) and Exercise 14.6.10 is an expandedversion of Exercise 3.2.5 of VW.

Page 293: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

15Case Studies II

The examples of this section illustrate a variety of empirical process tech-niques applied in a statistical context. The first example is a continuationof the partly linear logistic regression example introduced in Chapter 1 andstudied in some detail in Chapter 4. The issues addressed are somewhattechnical, but they are also both instructive and necessary for securing thedesired results. The second example utilizes empirical process results for in-dependent but not identically distributed observations from Chapter 11 toaddress an issue in clinical trials. The third example applies Z-estimationtheory for estimation and inference in the proportional odds regressionmodel for right-censored survival data, while the fourth example consid-ers hypothesis testing for the presence of a change-point in the regressionmodel studied in Section 14.5.1. An interesting feature of this fourth exam-ple is that the model is partially unidentifiable under the null hypothesis ofno change-point. The fifth example utilizes maximal inequalities (see Sec-tion 8.1) to establish asymptotic results for very high dimensional data setswhich arise in gene microarray studies. These varied examples demonstratehow the empirical process methods of the previous chapters can be used tosolve challenging and important problems in statistical inference.

15.1 Partly Linear Logistic Regression Revisited

As promised in Section 4.5, we now tie up some of the technical loose endsfor establishing that the proposed penalized log-likelihood estimator βn is

Page 294: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

284 15. Case Studies II

asymptotically efficient. Specifically, we establish (i) that Hc is Donsker forevery c < ∞, (ii) that J(ηn) = OP (1), (iii) that Expression (4.12) holds,

and (iv) that both βn and ηn are uniformly consistent. Along the way,we will also verify that the L2 convergence of ηn occurs at the optimalrate n−k/(2k+1) (see Cox and O’Sullivan, 1990, and Mammen and van deGeer, 1997). We need to strengthen our previous assumptions to requirethe existence of a known c0 < ∞ such that both |β| < c0 and ‖η‖∞ < c0

and also that the density of U is strictly positive and finite.For (i), Theorem 9.21 readily implies, after some rescaling of Hc, that

log N[](ǫ,Hc, L2(P )) ≤ Mǫ−1/k,(15.1)

for all ǫ > 0, where M only depends on c and P . This readily yields thatHc is Donsker.

To establish (ii), first define the composite parameter θ ≡ (β, η), andlet ℓθ(x) ≡ ywθ(x) − log

(

1 + ewθ(x))

and wθ(x) ≡ zβ + η(u). The quitetechnical Theorem 15.1 below yields that both J(ηn) = OP (1) and

‖wθn(X)− wθ0(X)‖P,2 = OP (n−k/(2k+1))(15.2)

hold. Thus (4.12) holds trivially, and we are done with Parts (ii) and (iii).We now use (15.2) to obtain (iv) and optimality of the L2 convergence

rate of ηn. Recall that h1(u) ≡ E[Z|U = u] and that P [Z− h1(U)]2 > 0 byassumption. Since E[(Z− h1(U))g(U)] = 0 for all g ∈ L2(U), (15.2) implies

|βn − β0| = OP (n−k/(2k+1)) and thus βn is consistent. These results nowimply that P [ηn(U) − η0(U)]2 = OP (n−2k/(2k+1)), and thus we have L2

optimality of ηn because of the assumptions on the density of U . Uniformconsistency of ηn now follows form the fact that J(ηn) = OP (1) forcesu → ηn(u) to be uniformly equicontinuous in probability.

We now present Theorem 15.1:

Theorem 15.1 Under the given assumptions, J(ηn) = OP (1) and

‖wθn(X)− wθ0(X)‖P,2 = OP (n−k/(2k+1)).

Proof. We first need to establish a fairly precise bound on the bracketingentropy of

G ≡

ℓθ(X)− ℓθ0(X)

1 + J(η): |β − β0| ≤ c1, ‖η − η0‖∞ ≤ c1, J(η) <∞

,

which satisfies G ⊂ G1 + G2(G1), where

G1 =

wθ(X)− wθ0(X)

1 + J(η): |β − β0| ≤ c1, ‖η − η0‖∞ ≤ c1, J(η) <∞

,

G2 consists of all functions t → a−1 log(1 + eat) with a ≥ 1, and G2(G1) ≡g2(g1) : g1 ∈ G1, g2 ∈ G2. By (15.1) combined with the properties of

Page 295: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

15.1 Partly Linear Logistic Regression Revisited 285

bracketing entropy and the fact that J(η0) is a finite constant, it is not hardto verify that there exists an M0 < ∞ such that log N[](ǫ,G1, L2(P )) ≤M0ǫ

−1/k. Now by Exercise 15.6.1 combined with Lemma 15.2 below, weobtain that there exists a K1 < ∞ such that log N[](ǫ,G2(G1), L2(P )) ≤K1ǫ

−1/k, for every ǫ > 0. Combining this with preservation propertiesof bracketing entropy (see Lemma 9.25), we obtain that there exists anM1 < ∞ such that log N[](ǫ,G, L2(P )) ≤ M1ǫ

−1/k for all ǫ > 0.Combining this result with Theorem 15.3 below, we obtain that

∣(Pn − P )(ℓθn(X)− ℓθ0(X))

∣ = OP (n−1/2(1 + J(η)))(15.3)

×[

ℓθn(X)− ℓθ0(X)

1 + J(ηn)

1−1/(2k)

P,2

∨ n−(2k−1)/[2(2k+1)]

]

.

Now note that by a simple Taylor expansion and the boundedness con-straints on the parameters, there exists a c1 > 0 and a c2 < ∞ such that

P (ℓθn(X)− ℓθ0(X)) ≤ −c1P [wθn

(X)− wθ0(X)]2

and ∣

∣ℓθn(X)− ℓθ0(X)

∣ ≤ c2

∣wθn(X)− wθ0(X)

∣ ,

almost surely. Combining this with (15.3) and a simple Taylor expansion,we can readily establish that

λ2nJ2(ηn) ≤ λ2

nJ2(η0) + (Pn − P )(ℓθn(X)− ℓθ0(X))

+P[

ℓθn(X)− ℓθ0(X)

]

≤ OP (λ2n) + OP (n−1/2)(1 + J(ηn))

×[

wθn(X)− wθ0(X)

1 + J(ηn)

1−1/(2k)

P,2

∨ n−(2k−1)/[2(2k+1)]

]

−c1P[

wθn(X)− wθ0(X)

]2

,

from which we can deduce that both

J2(ηn)

1 + J(ηn)= OP (1) + OP

(

n(2k−1)/[2(2k+1)])

(15.4)

×[

wθn(X)− wθ0(X)

1 + J(ηn)

1−1/(2k)

P,2

∨ n−(2k−1)/[2(2k+1)]

]

and∥

wθn(X)− wθ0(X)

1 + J(ηn)

2

P,2

= OP (λ2n) + OP (n−1/2)(15.5)

×[

wθn(X)− wθ0(X)

1 + J(ηn)

1−1/(2k)

P,2

∨ n−(2k−1)/[2(2k+1)]

]

.

Page 296: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

286 15. Case Studies II

Letting An ≡ nk/(2k+1)(1 + J(ηn))−1‖wθn(X) − wθ0(X)‖P,2, we obtain

from (15.5) that A2n = OP (1) + OP (1)A

1−1/(2k)n . Solving this yields An =

OP (1), which implies

‖wθn(X)− wθ0(X)‖P,2

1 + J(ηn)= OP (n−k/(2k+1)).

Applying this to (15.4) now yields J(ηn) = OP (1), which implies ‖wθn(X)−

wθ0(X)‖P,2 = OP (n−k/(2k+1)), and the proof is complete.

Lemma 15.2 For a probability measure P , let F1 be a class of measurablefunctions f : X → R, and let F2 denote a class of nondecreasing functionsf2 : R → [0, 1] that are measurable for every probability measure. Then,

log N[] (ǫ,F2(F1), L2(P )) ≤ 2 log N[](ǫ/3,F1, L2(P ))

+ supQ

log N[](ǫ/3,F2, L2(Q)),

for all ǫ > 0, where the supremum is over all probability measures Q.

Proof. Fix ǫ > 0, and let [fk, gk)], 1 ≤ k ≤ n1 be a minimal L2(P )bracketing ǫ/3-cover for F1, where fk is the lower- and gk is the upper-boundary function for the bracket. For each fk, construct a minimalL2(Qk,1) bracketing ǫ/3-cover for F1(fk(x)), where Qk,1 is the distributionof fk(X). Let n2 = supQ log N[](ǫ/3,F2, L2(Q)), and choose a correspond-ing minimal cover [fk,j,1, gk,j,1], l ≤ j ≤ n2. Construct a similar cover[fk,j,2, gk,j,2], 1 ≤ j ≤ n2 for each F1(gk(x)), 1 ≤ k ≤ n1.

Let h1 ∈ F1 and h2 ∈ F2 be arbitrary; let [fk, gk] be the bracketcontaining h1; let [fk,j,1, gk,j,1] be the bracket containing h2(fk); and let[fk,j,2, gk,j,2] be the bracket containing h2(gk). Then [fk,j,1(fk), gk,j,2(gk)] isan L2(P ) ǫ-bracket which satisfies fk,j,1(fk) ≤ h2(fk) ≤ h2(h1) ≤ h2(gk) ≤gk,j,2(gk). Thus, since f1 and f2 were arbitrary, the number of L2(P ) ǫ-brackets needed to completely cover F2(F1) is bounded by

N2[](ǫ/3,F1, L2(P ))× sup

QN[](ǫ/3,F2, L2(Q)),

and the desired result follows.

Theorem 15.3 Let F be a uniformly bounded class of measurable func-tions such that for some measurable f0, supf∈F ‖f−f0‖∞ <∞. Moreover,assume that log N[](ǫ,F , L2(P )) ≤ K1ǫ

−α for some K0 < ∞ and α ∈ (0, 2)and for all ǫ > 0. Then

supf∈F

[

|(Pn − P )(f − f0)|‖f − f0‖1−α/2

P,2 ∨ n−(2−α)/[2(2+α)]

]

= OP (n−1/2).

Proof. This is a result presented on Page 79 of van de Geer (2000) andfollows from Lemma 5.13 on the same page, the proof of which can be foundin Pages 79–80.

Page 297: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

15.2 The Two-Parameter Cox Score Process 287

15.2 The Two-Parameter Cox Score Process

In this section, we apply the results of Section 11.4 to study the two-parameter Cox score process in a sequential clinical trial with staggeredentry of patients. The material of this section comes from Section 4 ofKosorok (2003). We will make the somewhat weak assumption that patientsare independent but not necessarily i.i.d. Such a generalization is necessaryif an adaptive design, such as a biased coin design (see Wei, 1978), is usedrather than a simple i.i.d. randomized design. The two time parameters arethe calendar time of entry and the time since entry for each patient. Sellkeand Siegmund (1983) made a fundamental breakthrough on this problemand important work on this problem has also been done by Slud (1984) andGu and Lai (1991). The most recent work on this problem, and the workupon which we will build, is that of Bilias, Gu, and Ying (1997) (hereafterabbreviated BGY).

For each patient, there is an entry time (τi ≥ 0), a continuous failuretime (Xi ≥ 0), a censoring time (Ci ≥ 0), and a real valued covariateprocess Zi = Zi(s), s ≥ 0, 1 ≤ i ≤ n, n ≥ 1. Although the results wepresent are valid when Zi is vector valued, we will assume that Zi is scalarvalued for simplicity. The entry time is on the calendar time scale whilethe remaining quantities are on the time since entry time scale. As is donein BGY, we will assume that the quadruples (τi, Xi, Ci, Zi), i = 1 . . . n,are independent (but not identically distributed) and that the conditionalhazard rate of Xi at s, give τi, Ci, and Zi(u), for u ≤ s, has the Coxproportional hazards form exp [βZi(s)] λ0(s), for some unknown baselinehazard λ0. Thus, at calendar time x, the ith individual’s failure time iscensored at Ci∧(x−τi)

+, where for a real number u, u+ = uu ≥ 0. At anycalendar time x, what we actually observe under possibly right-censoring isUi(x) = Xi∧Ci ∧ (x− τi)

+ and ∆i(x) = Xi ≤ Ci ∧ (x− τi)+. For x ≥ s,

denote Ni(x, s) = ∆i(x) Ui(x) ≤ s, Yi(x, s) = Ui(x) ≥ s, and

Z(β; x, s) =

∑ni=1 Zi(s) exp [βZi(s)] Yi(x, s)∑n

i=1 exp [βZi(s)] Yi(x, s).

The statistic of interest we will focus on is

Wn(x, s) = n−1/2n∑

i=1

∫ s

0

[

Zi(u)− Z(β0; x, s)]

Ni(x, du)

under the null hypothesis that β0 is the true value of β in the forego-ing Cox proportional hazards model. For ease of exposition, we will as-sume throughout that β0 is known. It is easy to show that Wn(x, x) isthe partial likelihood score at calendar time x and at β = β0. When Zi

is a dichotomous treatment indicator, Wn(x, x) is the well known two-sample log-rank statistic and is a good choice for testing H0 : β = β0

Page 298: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

288 15. Case Studies II

versus HA : β = β0. However, for certain other alternative hypothe-ses, the supremum version of Wn, sups∈(0,x] |Wn(x, s)|, is a more powerfulstatistic (see Fleming, Harrington, and O’Sullivan, 1987). Determining theasymptotic behavior of this last statistic requires weak convergence resultswhether continuous or group sequential interim monitoring is being done.Let D(x0) = (x, s) : 0 ≤ s ≤ x ≤ x0, where x0 satisfies some assumptionsgiven below. Under several assumptions, which we will review later, BGYdemonstrate that Wn(x, s) converges weakly in the uniform topology onℓ∞(D(x0)) to a tight mean zero Gaussian process W with a covariancewhich we will denote H .

Unfortunately, the complexity of the distribution of W precludes it frombeing used directly to compute critical boundaries, especially for the supre-mum version of the test statistic, during clinical trial execution. We will nowshow that the results of Section 11.4 can be used to obtain an asymptoti-cally valid Monte Carlo estimate of the distribution of W for this purpose,something like a “wild bootstrap.” We now present the assumptions givenby BGY:

(a) x0 <∞ satisfies lim infn→∞ n−1∑n

i=1 EYi(x0, x0) > 0;

(b) (Condition 1 in BGY) There exists a nonrandom B < ∞ such thatthe total variation |Zi(0)|+

∫ x0

0 |Zi(du)| ≤ B;

(c) (Condition 2 in BGY) For k = 0, 1, 2, there exists Γk(x, s) such that,for all (x, s) ∈ D(x0),

limn→∞

n−1n∑

i=1

E[

Zki (s)Yi(x, s) exp(β0Zi(s))

]

= Γk(x, s);

(d) (Condition 3 in BGY) Letting K(x, s) = Γ1(x, s)/Γ0(x, s) and

Kn(x, s) =

∑ni=1 E [Zi(s)Yi(x, s) exp(β0Zi(s))]∑n

i=1 E [Yi(x, s) exp(β0Zi(s))],

then, for each fixed s, K(·, s) is continuous on [s, x0] and

limn→∞

sup0≤x≤x0

∫ x

0

[Kn(x, s) −K(x, s)]2ds = 0.

In the proof of their Theorem 2.2, BGY establish that Wn(x, s) convergesin probability to Wn(x, s) =

∑ni=1 fni(ω, (x, s)), where

fni(ω, (x, s)) = n−1/2

∫ s

0

[Zi(u)−Kn(x, u)] Mi(x, du),(15.6)

Mi(x, s) = Ni(x, s) −∫ s

0 Yi(x, u) exp(β0Zi(u))λ0(u)du, ω ∈ Ω, and wherethe data for this statistic comes from the probability space Ω,W , Π. BGY

Page 299: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

15.2 The Two-Parameter Cox Score Process 289

establish that all of the conditions of Pollard’s (1990) functional centrallimit theorem are satisfied with measurable envelope functions, hence allof the conditions of Theorem 11.16 are satisfied, except for the “sufficientmeasurability” condition. It is actually not clear how all the necessarymeasurability conditions are addressed by BGY; however, it is not difficultto show that the inherent double right-continuity of both Wn(x, s) and allof the array components fni(ω, (x, s)) gives us separability (which implies

AMS by Lemma 11.15) with T = D(x0) and Tn = D(x0) ∩ [Q ∪ x0]2,where Q is the set of rationals.

All of the conditions of Theorem 11.16 are thus satisfied and Wn con-verges weakly to a tight, mean zero Gaussian process W . Let zi, i ≥ 1be a random sequence satisfying Condition (F) of Section 11.4.2 and whichis independent of the data generating Wn. Since we usually do not haveexact knowledge of either Kn or the Mis, we will need to find appropriateestimators and then apply Theorem 11.18 to obtain a Monte Carlo methodof estimating the distribution of W based only on the data from the clinicaltrial. In this setting, µni((x, s)) = 0. However, if we use the estimator

µni(ω, (x, s)) = n−1/2

∫ s

0

[

Z(β0; x, u)−Kn(x, u)]

Mi(x, du)(15.7)

−n−1/2

∫ s

0

[

Zi(u)− Z(β0; x, s)]

Mi(x, du),

where

Mi(x, du) = Mi(x, du)−Yi(x, u) exp(β0Zi(u))

∑nj=1 Mj(x, du)

∑nj=1 Yj(x, u) exp(β0Zj(u))

,

then

fni(ω, (x, s)) − µni(ω, (x, s))

= n−1/2

∫ s

0

[

Zi(u)− Z(β0; x, u)]

Mi(x, du)

= n−1/2

∫ s

0

[

Zi(u)− Z(β0; x, s)]

×[

Ni(x, du)− Yi(x, u) exp(β0Zi(u))

∑nj=1 Nj(x, du)

∑nj=1 Yj(x, u) exp(β0Zj(u))

]

can be computed from the data.If we can next establish that the array µni satisfies the conditions of

Theorem 11.18, we will then have a Monte Carlo method of estimating thedistribution of W based on the conditional distribution of the process

Wn(x, s) =n∑

i=1

zi [fni(ω, (x, s))− µni(ω, (x, s))]

Page 300: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

290 15. Case Studies II

given the data from the clinical trial. To this end, we have the followingTheorem:

Theorem 15.4 In the two-parameter Cox score process setting with stag-gered entry of patients with Assumptions (a) through (d) satisfied, the tri-angular array of estimators µni—of the form given in (15.7)—satisfiesConditions (G) through (I) of Theorem 11.18.

Proof. Because the processes µni possess the same double right con-tinuity that Wn possesses, the separability—and hence AMS—conditionis readily satisfied. Because each µni can be written as a difference be-tween a monotone increasing and a monotone decreasing part with enve-lope Fni(ω), we have manageability since sums of monotone processes havepseudodimension one (see Lemma A.2 of BGY) and since sums of man-ageable processes are manageable. Since Assumption (a) implies Λ0(x0) ≡∫ x0

0λ0(u)du < ∞ and also because of the bounded total variation of Zi,

we can use envelopes Fni(ω) = Cn−1/2, where there exists a k < ∞ notdepending on n such that C ∨ k converges in probability to k as n → ∞.Hence Conditions (H) and (I) of Theorem 11.18 are satisfied. All that re-mains to be shown is that sup(t,s)∈T

∑ni=1 µ2

ni(ω, (x, s)) converges to zeroin probability.

Clearly,

sup(x,s)∈T

n∑

i=1

µ2ni(ω, (x, s))(15.8)

≤ 2 sup(x,s)∈T

n−1n∑

i=1

(∫ x

0

[

Z(β0; x, u)−Kn(x, u)]

Mi(x, du)

)2

+ sup(x,s)∈T

2

n

n∑

i=1

(∫ x

0

[

Zi(u)− Z(β0; x, u)]

Yi(x, u) exp(βZi(u))M(x, du)

)2

,

where

M(x, du) ≡[

n∑

i=1

Yi(x, u) exp(β0Zi(u))

]−1 [ n∑

i=1

Mi(x, du)

]

.

However, since Λ0(x0) < ∞ and since Mi is the difference between two non-negative monotone functions bounded by Λ0(x0), the first term on the right-

hand-side of (15.8) is bounded by k1 sup(x,s)∈T

∣Z(β0; x, s)−Kn(x, s)∣

2, for

some k1 < ∞, thus this term vanishes in probability as n → ∞. Using in-tegration by parts combined with the fact that the total variation of Zi isbounded, it can be shown that there exists constants c1 < ∞ and c2 < ∞such that the second term on the right-hand-side of (15.8) is bounded by

Page 301: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

15.3 The Proportional Odds Model Under Right Censoring 291

sup(x,s)∈T

n−1n∑

i=1

(

2c1 supu∈[0,s]

∣M(x, u)∣

∣+2c2 supu∈[0,s]

∫ u

0

Z(β0; x, v)M (x, dv)

)2

≤ 4c21 sup

(x,s)∈T

∣M(x, s)∣

2+ 4c2 sup

(x,s)∈T

∫ s

0

Z(β0; x, u)M(x, du)

2

;

and both of these terms can be shown to converge to zero in probability,as n →∞, by using arguments contained in BGY.

15.3 The Proportional Odds Model Under RightCensoring

In the right-censored regression set-up, we observe X = (U, δ, Z), whereU = T ∧C, δ = 1U = T , Z ∈ Rd is a covariate vector, T is a failure timeof interest, and C is a right censoring time. We assume that C and T areindependent given Z. The proportional odds regression model stipulatesthat the survival function of T given Z has the form

SZ(t) =(

1 + eβ′ZA(t))−1

,(15.9)

where t → A(t) is nondecreasing on [0, τ ], with A(0) = 0 and τ < ∞being the upper limit of the censoring distribution, i.e., we assume thatP(C > τ) = 0. We also assume that P(C = τ) > 0, that var[Z] is positivedefinite, and that the distribution of Z and C are uninformative of SZ .

Let the true parameter values be denoted β0 and A0. We make the addi-tional assumptions that the support of Z is compact and that β0 lies in aknown compact ⊂ Rd. Murphy, Rossini and van der Vaart (1997) developasymptotic theory for maximum likelihood estimation of this model undergeneral conditions for A0 which permit ties in the failure time distribution.To simplify the exposition, we will make stronger assumptions on A0 in or-der to facilitate arguments similar to those used in Lee (2000) and Kosorok,Lee and Fine (2004). Specifically, we assume that A0 has a derivative a0

that satisfies 0 < inft∈[0,τ ] a0(t) ≤ supt∈[0,τ ] a0(t) <∞.Let FZ ≡ 1−SZ . For distinct covariate values Z1 and Z2, we can deduce

from (15.9) that

FZ1 (t)SZ2(t)

FZ2 (t)SZ1(t)=

eβ′Z1

eβ′Z2,

which justifies the “proportional odds” designation. A motivation for thismodel is that in some settings it can be easier to justify on scientific groundsthan other common alternatives such as the proportional hazards or accel-erated failure time models (Murphy, Rossini and van der Vaart, 1997).

Define the composite model parameter θ ≡ (β, A). In the followingsections, we derive the nonparametric maximum likelihood estimator

Page 302: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

292 15. Case Studies II

(NPMLE) θn, prove its existence, establish consistency and weak conver-gence, and verify that the bootstrap procedure is valid for all model pa-rameters. Certain score and information operators will be needed for theweak convergence component, and these will be introduced just before weestablish weak convergence but after we have proven consistency. While βis assumed to lie in a known compact set, we make no such restrictionson A for estimation. Hence the NPMLE An might be unbounded: for thisreason, we need to verify that An(τ) = OP (1) and thus θn “exists.”

15.3.1 Nonparametric Maximum Likelihood Estimation

The likelihood for a sample of n i.i.d. observations (U1, δ1, Z1), . . . , (Un, δn,Zn) is

ℓn(θ) = Pn

δ(log a(U)) + β′Z)− (1 + δ) log(

1 + eβ′ZA(U))

,

where a is the derivative of A. As discussed in Murphy (1994), maximizingℓn over A for fixed β results in an maximizer that is piecewise constant withjumps at the observed failure times and thus does not have a continuousdensity. To address this issue, Murphy (1994) and Parner (1998) suggestreplacing a(u) in ℓn with n∆A(u), the jump size of A at u, which modified“empirical log-likelihood” we will denote Ln(θ). We will show later thatwhen A0 is continuous, the step sizes of the maximizer over A of Ln willgo to zero as n →∞.

The procedure we will use to estimate θ is to first maximize the pro-file log-likelihood pLn(β) ≡ supA Ln(β, A) to obtain θn. The associatedmaximizer over A we will denote An. In other words, An = Aβn

, where

Aβ ≡ argmaxALn(β, A). We also define θβ ≡ (β, Aβ). Obviously θn ≡ θβn

is just the joint maximizer of Ln(θ). To characterize Aβ , consider one-dimensional submodels for A defined by the map

t → At ≡∫ (·)

0

(1 + th1(s))dA(s),

where h1 is an arbitrary total variation bounded cadlag function on [0, τ ].The derivative of Ln(θ, At) with respect to t evaluated at t = 0 is the scorefunction for A:

(15.10)

V τn,2(θ)(h1) ≡ Pn

∫ τ

0

h1(s)dN(s) − (1 + δ)

[

eβ′Z∫ U∧τ

0 h1(s)dA(s)

1 + eβ′ZA(U ∧ τ)

]

,

where the subscript “2” denotes that this is the score of the second pa-rameter A. The dependence on τ will prove useful in later sections, but,

Page 303: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

15.3 The Proportional Odds Model Under Right Censoring 293

for now, it can be ignored since P (U ≤ τ) = 1 by assumption. Chooseh1(u) = 1u ≤ t, insert into (15.10), and equate the result to zero toobtain

PnN(t) = Pn

(1 + δ)eβ′Z∫ t

0 Y (s)dAβ(s)

1 + eβ′ZAβ(U)

,(15.11)

where N(t) ≡ δ1U ≤ t and Y (t) ≡ 1U ≥ t are the usual counting andat-risk processes for right-censored survival data.

Next define

W (t; θ) ≡ (1 + δ)eβ′ZY (t)

1 + eβ′ZA(U)

and solve (15.11) to obtain

Aβ(t) =

∫ t

0

PnW (s; θβ)−1

PndN(s).(15.12)

Thus Aβ can be characterized as a stationary point of (15.12). This struc-ture will prove useful in later developments and can also be used to calculateθn from data. One approach to accomplishing this is by first using (15.12)

to facilitate calculating pLn(β) so that βn can be determined via a simplesearch algorithm and then taking An to be the solution of (15.12) corre-

sponding to β = βn. In the case of multiple solutions, we take the onecorresponding to the maximizer of A → Ln(βn, A).

15.3.2 Existence

While we are assuming that β lies in a known, compact ⊂ Rd, we are notsetting boundedness restrictions on A. Fortunately, such restrictions are notnecessary, as can be seen in the following lemma, which is the contributionof this section:

Lemma 15.5 Under the given conditions, lim supn→∞ An(τ) < ∞ withinner probability one.

Proof. Note that

‖PnN −Q0‖∞ as∗→ 0 and supu≥0

|(Pn − P )1U ≤ u| as∗→ 0,(15.13)

where Q0 ≡ PN . Let X1, X2, . . . be a fixed data sequence satisfying (15.13).Note that, without loss of generality, all data sequences will satisfy this sincethe probability of such a sequence is 1 by definition of outer-almost sureconvergence.

Under this set-up, we can treat the maximum likelihood estimators as asequence of fixed quantities. By the assumed compactness of the parameter

Page 304: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

294 15. Case Studies II

space for β, there exists a subsequence of n for which βn converges toa bounded vector β∗ along that subsequence. Now choose a further subse-quence nk for which both βnk

→ β∗ ∈ Rd and Ank(τ) →∞. We will now

work towards a contradiction.Let θn ≡ (β0, PnN), and note that, by definition of the NPMLE,

0 ≤ Lnk(θn)− Lnk

(θn)(15.14)

≤ O(1) +

∫ τ

0

log(n∆Ank(s))Pnk

dN(s)

−Pnk

[

(1 + δ) log(1 + Ank(U))

]

.

Let u0, u1, . . . , uM be a partition of [0, τ ] for some finite M , with 0 =u0 < u1 < · · · < uM = τ , which we will specify in more detail shortly, anddefine N j(s) ≡ N(s)1U ∈ [uj−1, uj] for 1 ≤ j ≤ M . Now by Jensen’sinequality,

∫ τ

0

log(nk∆Ank)Pnk

dN j(s) ≤ PnkN j(τ) log

(

∫ uj

0 n∆Ank(s)dPnN j(s)

PnkN j(τ)

)

≤ O(1)+ log(Ank(uj))Pnk

δ1U ∈ [uj−1, uj].

Thus the right-side of (15.14) is dominated by

(15.15)

O(1) +

M−1∑

j=1

log Ank(uj)

×Pnk(δ1U ∈ [uj−1, uj] − (1 + δ)1U ∈ [uj , uj+1])

+ log Ank(τ)Pnk

(δ1U ∈ [uM−1,∞] − (1 + δ)1U ∈ [τ,∞]) .

Without loss of generality, assume Q0(τ) > 0. Because of the assump-tions, we can choose u0, u1, . . . , uM with M <∞ such that for some η > 0,

P ((1 + δ)1U ∈ [τ,∞]) ≥ η + P (δ1U ∈ [uM−1,∞])

and

P ((1 + δ)1U ∈ [uj , uj+1]) ≥ η + P (δ1U1[uj−1, uj ]) ,

for all 1 ≤ j ≤ M − 1. Hence

(15.15) ≤ −(η + o(1)) log Ank(τ) → −∞,

as k →∞, which yields the desired contradiction. Thus the conclusions ofthe lemma follow since the data sequence was arbitrary.

Page 305: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

15.3 The Proportional Odds Model Under Right Censoring 295

15.3.3 Consistency

In this section, we prove uniform consistency of θn. Let Θ ≡ B0×A be theparameter space for θ, where B0 ⊂ Rd is the known compact containing β0

and A is the collection of all monotone increasing functions A : [0, τ ] →[0,∞] with A(0) = 0. The following is the main result of this section:

Theorem 15.6 Under the given conditions, θnas∗→ θ0.

Proof. Define θn = (β0, An), where An ≡∫ (·)0 [PW (s; θ0)]

−1PndN(s).Note that

(15.16)

Ln(θn)− Ln(θn) =

∫ τ

0

PW (s; θ0)

PnW (s; θn)PndN(s) + (βn − β0)

′Pn

∫ τ

0

ZdN(s)

−Pn

[

(1 + δ) log

(

1 + eβ′nZAn(U)

1 + eβ′0ZAn(U)

)]

.

By Lemma 15.7 below, (Pn−P )W (t; θn)as∗→ 0. Combining this with Lemma

15.5 yields that lim infn→∞ inft∈[0,τ ] PnW (t; θn) > 0 and that the

lim supn→∞ of the total variation of t →[

PW (t; θn)]−1

is < ∞ with

inner probability one. Since∫ t

0 g(s)dN(s) : t ∈ [0, τ ], g ∈ D[0, τ ], the total

variation of g ≤M

is Donsker for every M < ∞, we now have

∫ τ

0

PW (s; θ0)

PnW (s; θn)PndN(s)−

∫ τ

0

PW (s; θ0)

PW (s; θn)dQ0(s)

as∗→ 0.(15.17)

Combining Lemma 15.5 with the fact that

(1 + δ) log(

1 + eβ′ZA(U))

: θ ∈ Θ, A(τ) ≤M

is Glivenko-Cantelli for each M < ∞ yields

(Pn − P )

[

(1 + δ) log

(

1 + eβ′nZAn(U)

1 + eβ′0ZA0(U)

)]

as∗→ 0.(15.18)

Now combining results (15.17) and (15.18) with (15.16), we obtain that

Ln(θn)− Ln(θn)−∫ τ

0

PW (s; θ0)

PW (s; θn)dQ0(s)− (βn − β0)

′P [Zδ](15.19)

+P

[

(1 + δ) log

(

1 + eβ′nZAn(U)

1 + eβ′0ZA0(U)

)]

as∗→ 0.

Page 306: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

296 15. Case Studies II

Now select a fixed sequence X1, X2, . . . for which the previous conver-gence results hold, and note that such sequences occur with inner proba-bility one. Reapplying (15.12) yields

lim supn→∞

sups,t∈[0,τ ]

|An(s)− An(t)||Pn(N(s)−N(t))| < ∞.

Thus there exists a subsequence nk along which both ‖Ank− A‖∞ →

0 and βnk→ β, for some θ = (β, A), where A is both continuous and

bounded. Combining this with (15.19), we obtain

0 ≤ Lnk(θnk

)− Lnk(θn) → P0 log

[

dPθ

dP0

]

≤ 0,

where Pθ is the probability measure of a single observation on the specifiedmodel at parameter value θ and where P0 ≡ Pθ0 . Since the “model is

identifiable” (see Exercise 15.6.2 below), we obtain that θn → θ0 uniformly.Since the sequence X1, X2, . . . was an arbitrary representative from a setwith inner probability one, we obtain that θn → θ0 almost surely.

Since An is a piecewise constant function with jumps ∆An only at ob-served failure times t1, . . . , tmn , θn is a continuous function of a maximumtaken over mn+d real variables. This structure implies that supt∈[0,τ ] |An(t)−A0(t)| is a measurable random variable, and hence the uniform distance

between θn and θ0 is also measurable. Thus the almost sure convergencecan be strengthened to the desired outer almost sure convergence.

Lemma 15.7 The class of functions W (t; θ) : t ∈ [0, τ ], θ ∈ Θ is P -Donsker.

Proof. It is fairly easy to verify that

F1 ≡

(1 + δ)eβ′ZY (t) : t ∈ [0, τ ], β ∈ B0

is a bounded P -Donsker class. If we can also verify that

F2 ≡

(

1 + eβ′ZA(U))−1

: θ ∈ Θ

is P -Donsker, then we are done since the product of two bounded Donskerclasses is also Donsker. To this end, let φ : R2 → R be defined by

φ(x, y) =1− y

1− y + exy,

and note that φ is Lipschitz continuous on sets of the form [−k, k]× [0, 1],with a finite Lipschitz constant depending only on k, for all k < ∞ (seeExercise 15.6.3 below). Note also that

Page 307: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

15.3 The Proportional Odds Model Under Right Censoring 297

F2 =

φ

(

β′Z,A(U)

1 + A(U)

)

: θ ∈ Θ

.

Clearly, β′Z : β ∈ B0 is Donsker with range contained in [−k0, k0] forsome k0 < ∞ by the given conditions. Moreover, A(U)(1 + A(U))−1 :A ∈ A is a subset of all monotone, increasing functions with range [0, 1]and thus, by Theorem 9.24, is Donsker. Hence, by Theorem 9.31, F2 isP -Donsker, and the desired conclusions follow.

15.3.4 Score and Information Operators

Because we have an infinite dimensional parameter A, we need to take carewith score and information operator calculations. The overall idea is thatwe need these operators in order to utilize the general Z-estimator conver-gence theorem (Theorem 2.11) in the next section to establish asymptotic

normality of√

n(θn−θ0) and obtain validity of the bootstrap. To facilitatethe development of these operators, let H denote the space of elementsh = (h1, h2) with h1 ∈ Rd and h2 ∈ D[0, τ ] of bounded variation. We sup-ply H with the norm ‖h‖H ≡ ‖h1‖ + ‖h2‖v, where ‖ · ‖ is the Euclideannorm and ‖ · ‖v is the total variation norm.

Define Hp ≡ h ∈ H : ‖h‖H ≤ p, where the inequality is strict whenp = ∞. The parameter θ can now be viewed as an element of ℓ∞(Hp) if wedefine

θ(h) ≡ h′1β +

∫ τ

0

h2(s)dA(s), h ∈ Hp, θ ∈ Θ.

Note that H1 is sufficiently rich to be able to extract out all componentsof θ. For example, h = (h1, 0) = ((0, 1, 0, . . .)′, 0) extracts out the secondcomponent of β, i.e., β2 = θ(h), while h∗,u = (0, 1(·) ≤ u) extracts out

A(u), i.e., A(u) = θ(h∗,u). As a result, the parameter space Θ becomes asubset of ℓ∞(Hp) with norm ‖θ‖(p) ≡ suph∈Hp

|θ(h)|. We can study weak

convergence in the uniform topology of θn via this functional representationof the parameter space since, for all 1 ≤ p < ∞ and every θ ∈ Θ, ‖θ‖∞ ≤‖θ‖(p) ≤ 4p‖θ‖∞ (see Exercise 15.6.4).

We now calculate the score operator which will become the Z-estimatingequation to which we will apply the Z-estimator convergence theorem. Con-sider the one-dimensional submodel defined by the map

t → θt ≡ θ + t(h1,

∫ (·)

0

h2(s)dA(s)), h ∈ Hp.

The score operator has the form

V τn (θ)(h) ≡ ∂

∂tLn(θt)

t=0

= V τn,1(θ)(h1) + V τ

n,2(θ)(h2),

where

Page 308: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

298 15. Case Studies II

V τn,1(θ)(h1) ≡ Pn

h′1ZN(τ)− (1 + δ)

[

h′1Zeβ′ZA(U ∧ τ)

1 + eβ′ZA(U ∧ τ)

]

,

and V τn,2(θ)(h2) is defined by replacing h1 with h2 in (15.10). As mentioned

earlier, we will need to utilize the dependence on τ at a later point. Nowwe have that the NPMLE θn can be characterized as a zero of the mapθ → Ψn(θ) = V τ

n (θ), and thus θn is a Z-estimator with the estimatingequation residing in ℓ∞(H∞).

The expectation of Ψn is Ψ ≡ PV τ , where V τ equals V τ1 (i.e., V τ

n withn = 1), with X1 replaced by a generic observation X . Thus V τ also satisfiesV τ

n = PnV τ . The Gateaux derivative of Ψ at any θ1 ∈ Θ exists and isobtained by differentiating over the submodels t → θ1+tθ. This derivative is

Ψθ0(h) ≡ ∂

∂tΨ(θ1 + tθ)

t=0

= −θ(σθ1(h)),

where the operator σθ : H∞ → H∞ can be shown to be

(15.20)

σθ(h) =

(

σ11θ σ12

θ

σ21θ σ22

θ

)(

h1

h2

)

≡ P

(

σ11θ σ12

θ

σ21θ σ22

θ

)(

h1

h2

)

≡ P σθ(h),

where

σ11θ (h1) = ξθA(U)ZZ ′h1

σ12θ (h2) = ξθZ

∫ τ

0

Y (s)h2(s)dA(s)

σ21θ (h1)(s) = ξθY (s)Z ′h1

σ22θ (h2)(s) =

(1 + δ)eβ′ZY (s)h2(s)

1 + eβ′ZA(U)− ξθe

β′Z

∫ τ

0

Y (u)h2(u)dA(u)Y (s),

and

ξθ ≡(1 + δ)eβ′Z

[1 + eβ′ZA(U)]2 .

We need to strengthen this Gateaux differentiability of Ψ to Frechetdifferentiability, at least at θ = θ0. This is accomplished in the followinglemma:

Lemma 15.8 Under the given assumptions, the operator θ → Ψ(θ)(·),viewed as a map from ℓ∞(Hp) to ℓ∞(Hp), is Frechet differentiable for each

p < ∞ at θ = θ0, with derivative θ → Ψθ0(θ) ≡ −θ(σθ0(·)).

Page 309: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

15.3 The Proportional Odds Model Under Right Censoring 299

Before giving the proof of this lemma, we note that we also need to verifythat Ψθ0 is continuously invertible. The following theorem establishes thisplus a little more.

Theorem 15.9 Under the given conditions, σθ0 : H∞ → H∞ is contin-uously invertible and onto. Moreover, Ψθ0 : lin Θ → linΘ is also continu-ously invertible and onto, with inverse θ → Ψ−1

θ0(θ) ≡ −θ(σ−1

θ0), where σ−1

θ0

is the inverse of σθ0 .

The “onto” property is not needed for the Z-estimator convergence theo-rem, but it will prove useful in Chapter 22 where we will revisit this exampleand show that θn is asymptotically efficient. We conclude this section withthe proofs of Lemma 15.8 and Theorem 15.9.

Proof of Lemma 15.8. By the smoothness of θ → σθ(·), we have

limt↓0

supθ:‖θ‖(p)≤1

suph∈Hp

∫ τ

0

θ (σθ0+utθ(h)− σθ0(h)) du

= 0.

Thus

suph∈Hp

|Ψ(θ0 + θ)(h) −Ψ(θ0)(h) + θ (σθ0(h))| = o(

‖θ‖(p)

)

as ‖θ‖(p) → 0.Proof of Theorem 15.9. From the explicit form of the operator σθ0

defined above, we have that σθ0 = σ1 + σ2, where

σ1 ≡(

I 00 g0(·)

)

,

where I is the d× d identity matrix,

g0(s) = P

[

(1 + δ)eβ′0ZY (s)

1 + eβ′0ZA0(U)

]

,

and where σ2 ≡ σθ0−σ1. It is not hard to verify that σ2 is a compact oper-ator, i.e., that the range of h → σ2(h) over the unit ball in H∞ lies withina compact set (see Exercise 15.6.6). Note that since 1/g0 has bounded to-tal variation, we have for any g = (g1, g2) ∈ H, that g = σ1(h), whereh = (g1, g2(·)/g0(·)) ∈ H. Thus σ1 is onto. It is also true that

‖g0(·)h2(·)‖H ≥(

infs∈[0,τ ]

|g0(s)|)

‖h2‖H ≥ c0‖h2‖H,

and thus σ1 is both continuously invertible and onto. If we can also verifythat σθ0 is one-to-one, we then have by Lemma 6.17 that σθ0 = σ1 + σ2 isboth continuously invertible and onto.

Page 310: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

300 15. Case Studies II

We will now verify that σθ0 is one-to-one by showing that for any h ∈ H∞,σθ0(h) = 0 implies that h = 0. Fix an h ∈ H∞ for which σθ0(h) = 0, anddefine the one-dimensional submodel t → θ0t = (β0t, A0t) ≡ (β0, A0) +

t(h1,∫ (·)0 h2(s)dA(s)). Note that σθ0(h) = 0 implies

P

∂2

(∂t)2ℓn(θ0t)

t=0

= P [V τ (θ0)(h)]2 = 0,(15.21)

where we are using the original form of the likelihood ℓn instead of themodified form Ln because the submodels A0t are differentiable for all tsmall enough.

It can be verified that V u(θ0)(h) is a continuous time martingale overu ∈ [0, τ ] (here is where we need the dependency of V τ on τ). The basicidea is that V u(θ0)(h) can be reexpressed as

∫ u

0

(

λθ0(s)

λθ0(s)

)

(h′1Z + h2(s)) dM(s),

where

λθ0 ≡∂

∂tλθ0t

t=0

, λθ(u) ≡ eβ′Za0(u)

1 + eβ′ZA(u),

and where M(u) = N(u)−∫ u

0 Y (s)λθ0(s)ds is a martingale since λθ0 is thecorrect hazard function for the failure time T given Z. Thus P [V τ (θ0)(h)]2

= P [V u(τ)(θ0)(h)]2+P [V τ (θ0)(h)−V u(θ0)(h)]2 for all u ∈ [0, τ ], and henceP [V u(θ0)(h)]2 = 0 for all u ∈ [0, τ ]. Thus V u(θ0)(h) = 0 almost surely, forall u ∈ [0, τ ].

Hence, if we assume that the failure time T is censored at some U ∈ (0, τ ],we have almost surely that

eβ′0Z

∫ u

0(h′

1Z + h2(s))Y (s)dA0(s)

1 +∫ u

0eβ′

0ZY (s)dA0(s)= 0,

for all u ∈ [0, τ ]. Hence∫ u

0(h′

1Z+h2(s))Y (s)dA0(s) = 0 almost surely for allu ∈ [0, τ ]. Taking the derivative with respect to u yields that h′

1Z+h2(u) =0 almost surely for all u ∈ [0, τ ]. This of course forces h = 0 since var[Z]is positive definite. Thus σθ0 is one-to-one since h was an arbitrary choicesatisfying σθ0(h) = 0. Hence σθ0 is continuously invertible and onto, andthe first result of the theorem is proved.

We now prove the second result of the theorem. Note that since σθ0 :H∞ → H∞ is continuously invertible and onto, for each p > 0, there is aq > 0 such that σ−1

θ0(Hq) ⊂ Hp. Fix p > 0, and note that

infθ∈linΘ

‖θ(σσ0(·))‖(p)

‖θ(·)‖(p)≥ inf

θ∈linΘ

suph∈σ−1θ0

(H(q))|θ(σ−1

θ0(h))|

‖θ‖(p)

= infθ∈Θ

‖θ‖(q)‖θ‖(p)

≥ q

2p.

Page 311: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

15.3 The Proportional Odds Model Under Right Censoring 301

(See Exercise 15.6.4 to verify the last inequality.) Thus ‖θ(σθ0)‖(p) ≥cp‖θ‖(p), for all θ ∈ lin Θ, where cp > 0 depends only on p. Lemma 6.16,Part (i), now implies that θ → θ(σθ0) is continuously invertible. For anyθ1 ∈ lin Θ, we have θ2(σθ0) = θ1, where θ2 = θ1(σ

−1θ0

) ∈ lin Θ. Thus

θ → θ(σθ0) is also onto. Hence θ → Ψθ0(θ) = −θ(σθ0) is both continu-ously invertible and onto, and the theorem is proved.

15.3.5 Weak Convergence and Bootstrap Validity

Our approach to establishing weak convergence will be through verifyingthe conditions of Theorem 2.11 via the Donsker class result of Lemma 13.3.After establishing weak convergence, we will use a similar technical ap-proach, but with some important differences, to obtain validity of a simpleweighted bootstrap procedure.

Recall that Ψn(θ)(h) = PnV τ (θ)(h), and note that V τ (θ)(h) can beexpressed as

V τ (θ)(h) =

∫ τ

0

(h′1Z + h2(s))dN(s)−

∫ τ

0

(h′1Z + h2(s))W (s; θ)dA(s).

We now show that for any 0 < ǫ < ∞, Gǫ ≡ V τ (θ)(h) : θ ∈ Θǫ, h ∈ H1,where Θǫ ≡ θ ∈ Θ : ‖θ − θ0‖(1) ≤ ǫ is P -Donsker. First, Lemma 15.7tells us that W (t; θ) : t ∈ [0, τ ], θ ∈ Θ is Donsker. Second, it is easilyseen that h′

1Z + h2(t) : t ∈ [0, τ ], h ∈ H1 is also Donsker. Since theproduct of bounded Donsker classes is also Donsker, we have thatft,θ(h) ≡(h′

1Z+h2(t))W (t; θ) : t ∈ [0, τ ], θ ∈ Θǫ, h ∈ H1 is Donsker. Third, considerthe map φ : ℓ∞([0, τ ]×Θǫ ×H1) → ℓ∞(Θǫ ×H1 ×Aǫ) defined by

φ(f·,θ(h)) ≡∫ τ

0

fs,θ(h)dA(s),

for A ranging over Aǫ ≡ A ∈ A : supt∈[0,τ ] |A(t) −A0(t)| ≤ ǫ. Note that

for any θ1, θ2 ∈ Θǫ and h, h ∈ H1,∣

∣φ(f·,θ1(h))− φ(f·,θ2(h))∣

∣ ≤ supt∈[0,τ ]

∣ft,θ1(h)− ft,θ2(h)∣

∣× (A0(τ) + ǫ).

Thus φ is continuous and linear, and hence φ(f·,θ(h)) : θ ∈ Θǫ, h ∈ H1 isDonsker by Lemma 15.10 below. Thus also

∫ τ

0

(h′1Z + h2(s))W (s; θ)dA(s) : θ ∈ Θǫ, h ∈ H1

is Donsker. Since it not hard to verify that

∫ τ

0

(h′1Z + h2(s))dN(s) : h ∈ H1

Page 312: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

302 15. Case Studies II

is also Donsker, we now have that Gǫ is indeed Donsker as desired.We now present the needed lemma and its proof before continuing:

Lemma 15.10 Suppose F is Donsker and φ : ℓ∞(F) → D is continuousand linear. Then φ(F) is Donsker.

Proof. Observe that Gnφ(F) = φ(GnF) φ(GF) = G(φ(F)), where thefirst equality follows from linearity, the weak convergence follows from thecontinuous mapping theorem, the second equality follows from a reapplica-tion of linearity, and the meaning of the “abuse in notation” is obvious.

We now have that both V τ (θ)(h) − V τ (θ0)(h) : θ ∈ Θǫ, h ∈ H1 andV τ (θ0)(h) : h ∈ H1 are also Donsker. Thus

√n(Ψn(θ0) − Ψ(θ0))

GV τ (θ0) in ℓ∞(H1). Moreover, since it is not hard to show (see Exer-cise 15.6.7) that

suph∈H1

P (V τ (θ)(h) − V τ (θ0)(h))2 → 0, as θ → θ0,(15.22)

Lemma 13.3 yields that∥

√n(Ψn(θ)−Ψ(θ))−√n(Ψn(θ0)−Ψ(θ0))

(1)= oP (1), as θ → θ0.

Combining these results with Theorem 15.9, we have that all of the condi-tions of Theorem 2.11 are satisfied, and thus

√nΨθ0(θn − θ0) +

√n(Ψn(θ0)−Ψ(θ0))

(1)= oP (1)(15.23)

and√

n(θn−θ0) Z0 ≡ −Ψ−1θ0

(GV τ (θ0)) in ℓ∞(H1). We can observe fromthis result that Z0 is a tight, mean zero Gaussian process with covariance

P [Z0(h)Z0(h)] = P[

V τ (θ0)(σ−1θ0

(h))V τ (θ0)(σ−1θ0

(h)]

, for any h, h ∈ H1. As

pointed out earlier, this is in fact uniform convergence since any componentof θ can be extracted via θ(h) for some h ∈ H1.

Now we will establish validity of a weighted bootstrap procedure forinference. Let w1, . . . , wn be positive, i.i.d., and independent of the dataX1, . . . , Xn, with 0 < µ ≡ Pw1 < ∞, 0 < σ2 ≡ var(w1) < ∞, and‖w1‖2,1 < ∞. Define the weighted bootstrapped empirical process Pn ≡n−1

∑ni=1(wi/w)∆Xi , where w ≡ n−1

∑ni=1 wi and ∆Xi is the empirical

measure for the observation Xi. This particular bootstrap was introducein Section 2.2.3. Let Ln(θ) be Ln(θ) but with Pn replaced by Pn, and letΨn be Ψn but with Pn replaced by Pn. Define θn to be the maximizer ofθ → Ln(θ). The idea is, after conditioning on the data sample X1, . . . , Xn,to compute θn for many replications of the weights w1, . . . , wn to formconfidence intervals for θ0. We want to show that

√n(µ/σ)(θn − θn)

Pw

Z0.(15.24)

We first study the unconditional properties of θn. Note that for maxi-mizing θ → Ln(θ) and for zeroing θ → Ψn, we can temporarily drop the

Page 313: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

15.4 Testing for a Change-point 303

w factor since neither the maximizer nor zero of a function is modifiedwhen multiplied by a positive constant. Let w be a generic version of w1,and note that if a class of functions F is Glivenko-Cantelli, then so alsois the class of functions w · F via Theorem 10.13. Likewise, if the classF is Donsker, then so is w · F via the multiplier central limit theorem,Theorem 10.1. Also, Pwf = µPf , trivially. What this means, is that thearguments in Sections 15.3.2 and 15.3.3 can all be replicated for θn withonly trivial modifications. This means that θn

as∗→ θ0.Now, reinstate the w everywhere, and note by Corollary 10.3, we can

verify that both√

n(Ψ−Ψ)(θ0) (σ/µ)G1Vτ (θ0) + G2V

τ (θ0), where G1

and G2 are independent Brownian bridge random measures, and∥

√n(Ψn(θn)−Ψ(θn))−√n(Ψn(θ0)−Ψ(θ0))

(1)= oP (1).

Thus reapplication of Theorem 2.11 yields that∥

√nΨθ0(θn − θ0) +

√n(Ψn −Ψ)(θ0)

(1)= oP (1).

Combining this with (15.23), we obtain∥

√nΨθ0(θn − θn) +

√n(Ψn −Ψn)(θ0)

(1)= oP (1).

Now, using the linearity of Ψθ0 , the continuity of Ψ−1θ0

, and the boot-strap central limit theorem, Theorem 2.6, we have the desired result that√

n(µ/σ)(θn − θn)PwZ0. Thus the proposed weighted bootstrap is valid.

We also note that it is not clear how to verify the validity of the usualnonparametric bootstrap, although its validity probably does hold. Thekey to the relative simplicity of the theory for the proposed weighted boot-strap is that Glivenko-Cantelli and Donsker properties of function classesare not altered after multiplying by independent random weights satisfyingthe given moment conditions. We also note that the weighted bootstrap iscomputationally simple, and thus it is quite practical to generate a reason-ably large number of replications of θn to form confidence intervals. Thisis demonstrated numerically in Kosorok, Lee and Fine (2004).

15.4 Testing for a Change-point

Recall the change-point model example of Section 14.5.1 and consider test-ing the null hypothesis H0 : α = β. Under this null, the change-point pa-rameter ζ is not identifiable, and thus ζn is not consistent. This meansthat testing H0 is an important concern, since it is unlikely we wouldknow in advance whether H0 were true. The statistic we propose usingis Tn ≡ supζ∈[a,b] |Un(ζ)|, where

Page 314: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

304 15. Case Studies II

Un(ζ) ≡

nFn(ζ)(1 − Fn(ζ))

σn

(

∑ni=1 1Zi ≤ ζYi

nFn(ζ)−∑n

i=1 1Zi > ζYi

n(1− Fn(ζ))

)

,

σ2n ≡ n−1

∑ni=1

(

Yi − αn1Zi ≤ ζn − βn1Zi > ζn)2

, and where Fn(t) ≡Pn1Z ≤ t. We will study the asymptotic limiting behavior of this statisticunder the sequence of contiguous alternative hypotheses H1n : β = α +η/√

n, where the distribution of Z and ǫ does not change with n. ThusYi = ǫi + α0 + (η/

√n)1Zi > ζ0, i = 1, . . . , n.

We will first show that under H1n,

Un(ζ) = a0(ζ)Bn(ζ) + ν0(ζ) + rn(ζ),(15.25)

where a0(ζ) ≡ σ−1√

F (ζ)(1 − F (ζ)), F (ζ) ≡ P1Z ≤ ζ,

Bn(ζ) ≡√

nPn[1ζ0 < Z ≤ ζǫ]F (ζ)

−√

nPn[1Z > ζ ∨ ζ0ǫ]1− F (ζ)

,

ν0(ζ) ≡ ηa0(ζ)

(

P [ζ0 < Z ≤ ζ]

F (ζ)− P [Z > ζ ∨ ζ0]

1− F (ζ)

)

,

and supζ∈[a,b] |rn(ζ)| P→ 0. The first step is to note that Fn is uniformlyconsistent on [a, b] for F and that both infζ∈[a,b] F (ζ) > 0 and infζ∈[a,b](1−F (ζ)) > 0. Since η/

√n → 0, we also have that both

Vn(ζ) ≡∑n

i=1 1Zi ≤ ζYi∑n

i=1 1Zi ≤ ζ , and Wn(ζ) ≡∑n

i=1 1Zi ≥ ζYi∑n

i=1 1Zi ≥ ζ

are uniformly consistent, over ζ ∈ [a, b], for α0 by straightforward Glivenko-

Cantelli arguments. Since ζn ∈ [a, b] with probability 1 by assumption, we

have σ2n

P→ σ2 even though ζn may not be consistent. Now the fact thatboth F1 ≡ 1Z ≤ ζ : ζ ∈ [a, b] and F2 ≡ 1Z > ζ : ζ ∈ [a, b] areDonsker yields the final conclusion, after some simple calculations.

Reapplying the fact thatF1 andF2 are Donsker yields that a0(ζ)Bn(ζ)

B0(ζ) in ℓ∞([a, b]), where B0 is a tight, mean zero Gaussian process withcovariance

H0(ζ1, ζ2) ≡√

F (ζ1 ∧ ζ2)(1 − F (ζ1 ∨ ζ2))

F (ζ1 ∨ ζ2)(1 − F (ζ1 ∧ ζ2)).

Thus, from (15.25), we have the Un B0 + ν0, it is easy to verify thatν0(ζ0) = 0. Hence, if we use critical values based on B0, the statistic Tn willasymptotically have the correct size under H0 as well as have arbitrarilylarge power under H1n as |η| gets larger.

Thus what we need now is a computationally easy method for obtainingthe critical values of supζ∈[a,b] |B0(ζ)|. Define

Page 315: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

15.4 Testing for a Change-point 305

Ani (ζ) ≡

Fn(ζ)(1 − Fn(ζ))

(

1Zi ≤ ζFn(ζ)

− 1Zi > ζ1− Fn(ζ))

)

,

and consider the weighted “bootstrap”Bn(ζ) ≡ n−1/2∑n

i=1 Ani (ζ)wi, where

w1, . . . , wn are i.i.d. standard normals independent of the data. The con-tinuous mapping theorem applied to the following lemma verifies that this

bootstrap will satisfy supζ∈[a,b] |Bn(ζ)| Pw

supζ∈[a,b] |B0(ζ)| and thus can be

used to obtain the needed critical values:

Lemma 15.11 BnPwB0 in ℓ∞([a, b]).

Before ending this section with the proof of this lemma, we note that theproblem of testing a null hypothesis under which some of the parameters areno longer identifiable is quite challenging (see Andrews, 2001), especiallywhen infinite-dimensional parameters are present. An example of the lattersetting is in transformation regression models with a change-point (Kosorokand Song, 2007). Obtaining the critical value of the appropriate statisticfor these models is much more difficult than it is for the simple examplewe have presented in this section.

Proof of Lemma 15.11. Let G be the class of nondecreasing functionsG : [a, b] → [a0, b0] of ζ, where a0 ≡ F (a)/2 and b0 ≡ 1/2+F (b)/2, and notethat by the uniform consistency of Fn, Fn ∈ G with probability arbitrarilyclose to 1 for all n large enough. Also note that

G(ζ)(1 −G(ζ))

(

1Z ≤ ζG(ζ)

− 1Z > ζ1−G(ζ)

)

: ζ ∈ [a, b], G ∈ G

is Donsker, since aF : a ∈ K is Donsker for any compact K ⊂ R andany Donsker class F . Thus by the multiplier central limit theorem, The-orem 10.1, we have that Bn converges weakly to a tight Gaussian processunconditionally. Thus we know that Bn is asymptotically tight condition-ally, i.e., we know that

P∗(

supζ1,ζ2∈[a,b]:|ζ1−ζ2|≤δn

|Bn(ζ1)− Bn(ζ2)| > τ

X1, . . . , Xn

)

= oP (1),

for all τ > 0 and all sequences δn ↓ 0. All we need to verify now is that thefinite dimensional distributions of Bn converge to the appropriate limitingmultivariate normal distributions conditionally.

Since w1, . . . , wn are i.i.d. standard normal, it is easy to see that Bn isconditionally a Gaussian process with mean zero and, after some algebra(see Exercise 15.6.8), with covariance

Hn(ζ1, ζ2) ≡√

Fn(ζ1 ∧ ζ2)(1 − Fn(ζ1 ∨ ζ2))

Fn(ζ1 ∨ ζ2)(1 − Fn(ζ1 ∧ ζ2)).(15.26)

Since Hn can easily be shown to converge to H0(ζ1, ζ2) uniformly over[a, b]× [a, b], the proof is complete.

Page 316: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

306 15. Case Studies II

15.5 Large p Small n Asymptotics for Microarrays

In this section, we present results for the “large p, small n” paradigm (West,2003) that arises in microarray studies, image analysis, high throughputmolecular screening, astronomy, and in many other high dimensional ap-plications. To solidify our discussion, we will focus on an idealized versionof microarray data. The material for this section comes primarily fromKosorok and Ma (2007), which can be referred to for additional detailsand which contains a brief review of statistical methods for microarrays.Microarrays are capable of monitoring the gene expression of thousands ofgenes and have become both important and routine in biomedical research.The basic feature of a simple microarray experiment is that data from asmall number n of i.i.d. microarrays are collected, with each microarrayhaving measurements on a large number p of genes. Typically, n is around20 or less (usually, in fact, much less) whereas p is easily in the thousands.The data is usually aggregated across the n arrays to form test statistics foreach of the p genes, resulting in large scale multiple testing. False discov-ery rate (FDR) methods (see Benjamini and Hochberg, 1995) are used toaccount for this multiplicity in order to successfully identify which amongthousands of monitored genes are significantly differentially expressed.

For FDR methods to be valid for identifying differentially expressedgenes, the p-values for the non-differentially expressed genes must simul-taneously have uniform distributions marginally. While this is feasible forpermutation based p-values, it is unclear whether such uniformity holds forp-values based on asymptotic approximations. For instance, suppose thatwe wish to use t-tests for each of the p genes and to compute approximatep-values based on the normal approximation for simplicity. If the data is notnormally distributed, we would have to rely on the central limit theorem.Unfortunately, it is unclear whether this will work for all of the tests simul-taneously. The issue is that the number of p-values involved goes to infinityand intuition suggests that at least some of the p-values should behave er-ratically. In this section, we will show the somewhat surprising result that,under arbitrary dependency structures across genes, the p-values are in factsimultaneously valid when using the normal approximation for t-tests. Theresult is essentially a very high dimensional central limit theorem.

To further clarify ideas, consider a simple one-sample cDNA microarraystudy. Note that this data setting and the results that follow can be easilyextended to incorporate more complex designs. Studies using Affymetrixgenechip data can also be included in the same framework after somemodification. Denote Yij as the background-corrected log-ratios, for ar-ray i = 1, . . . , n and gene j = 1, . . . , p. Consider the following simple linearmodel:

Yij = µj + ǫij ,(15.27)

Page 317: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

15.5 Large p Small n Asymptotics for Microarrays 307

where µj are the fixed gene effects and ǫij are mean zero random errors. Forsimplicity of exposition, we have omitted other potentially important termsin our model, such as array-specific intensity normalization and possibleprint-tip effects. We note, however, that the theory we present in this papercan extend readily to these richer models, and, in fact has been extendedto address normalization in the original Kosorok and Ma (2007) paper.

Under the simple model (15.27), permutation tests can be used. However,when normalization is needed and the residual errors are not Gaussian, itmay be convenient to use the normal approximation to compute the p-values for the t-tests associated with the sample mean Xj for each genej = 1, . . . , p. In this section, we wish to study inference for these sam-ple means when the number of arrays n → ∞ slowly while the numberof genes p >> n. This is essentially the asymptotic framework consideredin van der Laan and Bryan (2001) who show that provided the range ofexpression levels is bounded, the sample means consistently estimate themean gene effects uniformly across genes whenever log p = o(n). We extendthe results of van der Laan and Bryan (2001) in two important ways. First,uniform consistency results are extended to general empirical distributionfunctions. This consistency will apply to sample means and will also applyto other functionals of the empirical distribution function such as samplemedians (this is explored more in the Kosorok and Ma, 2007, paper). Sec-ond, a precise Brownian bridge approximation to the empirical distributionfunction is developed and utilized to establish uniform validity of marginalp-values based on the normal approximation. More specifically, we developa central limit theorem for the large p small n setting which holds providedlog p = o(n1/2).

An important consequence of these results is that approximate p-valuesbased on normalized gene expression data can be validly applied to FDRmethods for identifying differentially expressed genes. We refer to this kindof asymptotic regime as “marginal asymptotics” (see also Kosorok and Ma,2005) because the focus of the inference is at the marginal (gene) level,even though the results are uniformly valid over all genes. Qualitatively,the requirement that log p = o(n1/2) seems to be approximately the correctorder of asymptotics for microarray experiments with a moderate number,say ∼ 20–30, of replications. The main technical tools we use from empiri-cal processes include maximal inequalities (Section 8.1), and a specializedHungarian construction for the empirical distribution function.

We first discuss what is required of p-value approximations in order forthe FDR procedure to be valid asymptotically. We then present the maintheoretical results for the empirical processes involved, and then close thissection with a specialization of these results to estimation and inferencebased on sample means.

Page 318: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

308 15. Case Studies II

15.5.1 Assessing P-Value Approximations

Suppose we have p hypothesis tests with p-values q(p) ≡ q1, . . . , qp butonly know the estimated p-values q(p) ≡ q1, . . . , qp. An important ques-tion is how accurate must q(p) be in order for inference based on q(p) tobe asymptotically equivalent to inference based on q(p)? Specifically, thechief hypothesis testing issue is controlling the FDR asymptotically in p.To fix ideas, suppose the indices Jp = 1, . . . , p for the hypothesis testsare divided into two groups, J0p and J1p, where some null hypotheses holdfor all j ∈ J0p and some alternatives hold for all j ∈ J1p. We will assumethat qj is uniformly distributed for all j ∈ J0p and that qj has distribu-tion F1 for all j ∈ J1p, where F1(t) ≥ t for all t ∈ [0, 1] and F1 is strictlyconcave with limt↓0 F1(t)/t = ∞. Let λp ≡ #J0p/p be the proportion oftrue null hypotheses, and assume λp → λ0 ∈ (0, 1], as p → ∞. Also let

Fp(t) ≡ p−1∑p

j=1 1qj ≤ t, where 1A is the indicator of A, and assume

Fp(t) converges uniformly in t to F0(t) ≡ λ0t + (1− λ0)F1(t).The estimate of FDR proposed by Storey (2002) (see also Genovese

and Wasserman, 2002) for a p-value threshold of t ∈ [0, 1] is FDRl(t) ≡λ(l)t/(Fp(t) ∨ (1/p)), where λ(l) ≡ (1 − Fp(l))/(1 − l) is a conservative

estimate of λ0, in that λ(l) → λ∗ in probability, where λ0 ≤ λ∗ ≤ 1, andwhere a ∨ b denotes the maximum of a and b. The quantity l is the tuningparameter and is constrained to be in (0, 1) with decreasing bias as l getscloser to zero. Because of the upward bias in λ(l), if λ(l) is distinctly < 1,then one can be fairly confident that λ0 < 1.

Assume λ0 < 1. The asymptotic FDR for the procedure rejecting allhypotheses corresponding to indices with pj ≤ t is r0(t) ≡ λ0t/(λ0t +(1 − λ0)F1(t)). Storey, Taylor and Siegmund (2004) demonstrate that un-der fairly general dependencies among the p-values q(p), Fp(t) converges to

F0(t), and thus FDRl(t) converges in probability to r∗(t) ≡ (λ∗/λ0)r0(t).Our assumptions on F1 ensure that r0(t) is monotone increasing withderivative r0(t) bounded by (4δ)−1. Thus, for each ρ ∈ [0, λ∗], there ex-

ists a t ∈ [0, 1] with r∗(t) = ρ and r0(t) ≤ ρ. Thus using FDRl(t) tocontrol FDR is asymptotically valid, albeit conservative.

Suppose all we have available is q(p). Now we estimate F0 with Fp(t) =

p−1∑p

j=1 1qj ≤ t and λ∗ with λl(t) ≡ (1 − Fp(l))/(1 − l). The previous

results will all hold for FDRl(t) ≡ λ(l)t/(Fp(t) ∨ (1/p)), provided Fp isuniformly consistent for F0. We now show that a sufficient condition for thisis max1≤j≤p |qj − qj | → 0 in probability. Under this condition, there existsa positive sequence ǫp ↓ 0 such that P

(

max1≤j≤p |q(p) − q(p)| > ǫp

)

→ 0 inprobability. Accordingly, we have with probability tending to one that forany t ∈ [ǫp, 1− ǫp], Fp(t− ǫp) ≤ Fp(t) ≤ Fp(t + ǫp). Thus, by continuity of

F0, uniform consistency of Fp follows from uniform consistency of Fp.

Page 319: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

15.5 Large p Small n Asymptotics for Microarrays 309

In summary, the above procedure for controlling FDR is asymptoticallyvalid when λ0 < 1, provided max1≤j≤p |q(p) − q(p)| goes to zero in proba-bility. However, This result does not hold when λ0 = 1. This is discussedin greater detail in Kosorok and Ma (2007) who also present methods ofaddressing this issue which we do not pursue further here.

15.5.2 Consistency of Marginal Empirical DistributionFunctions

For each n ≥ 1, let X1(n), . . . , Xn(n) be a sample of i.i.d. vectors (eg.,microarrays) of length pn, where the dependence within vectors is allowedto be arbitrary. Denote the jth component (eg., gene) of the ith vectorXij(n), i.e., Xi(n) = (Xi1(n), . . . , Xipn(n))

′. Also let the marginal distribu-

tion of X1j(n) be denoted Fj(n), and let Fj(n)(t) = n−1∑n

i=1 1Xij(n) ≤ t,for all t ∈ R and each j = 1, . . . , pn.

The main results of this subsection are Theorems 15.12 and 15.13 below.The two theorems are somewhat surprising, high dimensional extensionsof two classical univariate results for empirical distribution functions: thecelebrated Dvoretzky, Kiefer and Wolfowitz (1956) inequality as refinedby Massart (1990) and the celebrated Komlos, Major and Tusnady (1976)Hungarian (KMT) construction as refined by Bretagnolle and Massart(1989). The extensions utilize maximal inequalities based on Orlicz norms(see Section 8.1). The proofs are given at the end of this subsection.

The first theorem yields simultaneous consistency of all Fj(n)s:

Theorem 15.12 For a universal constant 0 < c0 < ∞ and all n, pn ≥ 2,

max1≤j≤pn

∥Fj(n) − Fj(n)

ψ2

≤ c0

log pn

n.(15.28)

In particular, if n → ∞ and log pn = o(n), then the left side of (15.28)→ 0.

Note that the rate on the right-side of (15.28) is sharp, in the sense thatthere exist sequences of data sets, where (log pn/n)−1/2 ×max1≤j≤pn ‖Fj(n)

−Fj(n)‖∞ → c > 0, in probability, as n → ∞. In particular, this is true ifthe genes are all independent, n, pn →∞ with log pn = o(n), and c = 1/2.

The second theorem shows that the standardized empirical processes√n(Fj(n)−Fj(n)) can be simultaneously approximated by Brownian bridges

in a manner which preserves the original dependency structure in thedata. For example, if the original data has weak dependence, as definedin Storey, Taylor and Siegmund (2004), then so will the approximatingBrownian bridges. An example of weak dependence is the m-dependencedescribed in Section 11.6. Let Fj(n) be the smallest σ-field making all ofX1j(n), . . . , Xnj(n) measurable, 1 ≤ j ≤ pn, and let Fn be the smallestσ-field making all of F1(n), . . . ,Fpn(n) measurable.

Page 320: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

310 15. Case Studies II

Theorem 15.13 For universal constants 0 <c1, c2 <∞ and all n, pn≥ 2,

(15.29)∥

max1≤j≤pn

√n(Fj(n) − Fj(n))−Bj(n)(Fj(n))

ψ1

≤ c1 log n + c2 log pn√n

,

for some stochastic processes B1(n), . . . , Bpn(n) which are conditionally in-dependent given Fn and for which each Bj(n) is a standard Brownian bridgewith conditional distribution given Fn depending only on Fj(n), 1 ≤ j ≤ pn.

Before giving the promised proofs, we note that while the above resultsare only used in the next subsection for developing inference based on themean, these results extend nicely to median and other robust estimationmethods (see Kosorok and Ma, 2007, and Ma, Kosorok, Huang, et al.,2006).

Proof of Theorem 15.12. Define Vj(n) ≡√

n‖Fj(n) − Fj(n)‖∞, and note

that by Corollary 1 of Massart (1990), P(Vj(n) > x) ≤ 2e−2x2

, for all x ≥ 0and any distribution Fj(n). This inequality is a refinement of the celebratedresult of Dvoretzky, Kiefer and Wolfowitz (1956), given in their Lemma 2,and the extension to distributions with discontinuities is standard. UsingLemma 8.1, we obtain ‖Vj(n)‖ψ2 ≤

3/2 for all 1 ≤ j ≤ pn. Now, byLemma 8.2 combined with the fact that

lim supx,y→∞

ψ2(x)ψ2(y)

ψ2(xy)= 0,

we have that there exists a universal constant c∗ < ∞ with

max1≤j≤pn

Vj(n)

ψ2

≤ c∗√

log(1 + pn)√

3/2

for all n ≥ 1. The desired result now follows for the constant c0 =√

3c∗,since log(k + 1) ≤ 2 log k for any k ≥ 2.

Proof of Theorem 15.13. Let Uj , j = 1, . . . , pn, be i.i.d. uniform randomvariables independent of the data. Then, by Theorem 15.14 below, we havefor each 1 ≤ j ≤ pn that there exists a measurable map gj(n) : Rn× [0, 1] →C[0, 1] where Bj(n) = gj(n)(X1j(n), . . . , Xnj(n), Uj) is a Brownian bridgewith

(15.30)

P(√

n∥

√n(Fj(n) − Fj(n))−Bj(n)(Fj(n))

∞> x + 12 logn

)

≤ 2e−x/6,

for all x ≥ 0. Note that this construction generates an ensemble of Brown-ian bridges B1(n), . . . , Bpn(n) that may be dependent when the componentsin X1(n) = (X11(n), . . . , X1pn(n))

′ are dependent. However, each Bj(n) only

Page 321: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

15.5 Large p Small n Asymptotics for Microarrays 311

depends on the information contained in Fj(n) and the independent uni-form random variable Uj . Thus Bj(n) depends on Fn only through theinformation contained in Fj(n), and the ensemble of Brownian bridges isconditionally independent given Fn. Note also the validity of (15.30) for alln ≥ 2.

Let Vj(n) =(

(√

n/(log n))∥

√n(Fj(n) − Fj(n))−Bj(n)(Fj(n))

∞− 12

)+

,

where u+ is the positive part of u. By Lemma 8.1, Expression (15.30) im-plies that ‖Vj(n)‖ψ1 ≤ 18/ logn. Reapplying the result that log(k + 1) ≤2 log k for any k≥ 2, we now have, by the fact that lim supx,y→∞ ψ1(x)ψ1(y)/ψ1(xy) = 0 combined with Lemma 8.2, that there exists a universalconstant 0 < c2 < ∞ for which

∥max1≤j≤pn Vj(n)

ψ1≤ c2 log pn/(log n).

Now (15.29) follows, for c1 = 12, from the definition of Vj(n).

Theorem 15.14 For n ≥ 2, let Y1, . . . , Yn be i.i.d. real random variableswith distribution G (not necessarily continuous), and let U0 be a uniformrandom variable independent of Y1, . . . , Yn. Then there exists a measurablemap gn : Rn×[0, 1] → C[0, 1] such that B = gn(Y1, . . . , Yn, U0) is a standardBrownian bridge satisfying, for all x ≥ 0,

(15.31)

P(√

n∥

√n(Gn −G)−B(G)

∞> x + 12 logn

)

≤ 2e−x/6,

where Gn is the empirical distribution of Y1, . . . , Yn.

Proof. By Theorem 20.4 of Billingsley (1995), there exists a measurableh0 : [0, 1] → [0, 1]2 such that (U1, U2) ≡ h0(U0) is a pair of independentuniforms. Moreover, standard arguments yield the existence of a functionhn : Rn × [0, 1] → [0, 1]n such that (V1, . . . , Vn) ≡ hn(Y1, . . . , Yn, U1) isa sample of i.i.d. uniforms and (Y1, . . . , Yn) = (ψ(V1), . . . , ψ(Vn)), whereψ(u) ≡ infx : G(x) ≥ u (see Exercise 15.6.9). U1 is needed to handlepossible discontinuities in G.

Let Hn be the empirical distribution for V1, . . . , Vn, and note that

(15.32) √n(Hn(G(x)) −G(x)) =

√n(Gn(x)−G(x)), ∀x ∈ R.

by design. Now by the Hungarian construction (Theorem 1) of Bretagnolleand Massart (1989), there exists a Brownian bridge B depending only onV1, . . . , Vn and U2 such that

P

(

√n sup

u∈[0,1]

√n(Hn(u)− u)−B(u)

∣ > x + 12 logn

)

≤ 2e−x/6,

for all x ≥ 0, and thus by (15.32),

Page 322: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

312 15. Case Studies II

P(√

n∥

√n(Gn −G)−B(G)

∞> x + 12 logn

)

(15.33)

≤ 2e−x/6, ∀x ≥ 0.

By Lemma 15.15 below, we can take B to be fn(V1, . . . , Vn, U2), wherefn : [0, 1]n+1 → D[0, 1] is measurable and D[0, 1] has the Skorohod ratherthan uniform metric, since both t → √

n(Hn(t) − t) and t → B(t) areBorel measurable on the Skorohod space D[0, 1]. Since P(B ∈ C[0, 1]) = 1,and since the uniform and Skorohod metrics are equivalent on C[0, 1], wenow have that fn is also measurable with respect to the uniform topol-ogy. Thus the map gn : Rn × [0, 1] → C[0, 1] defined by the composition(Y1, . . . , Yn, U0) → (V1, . . . , Vn, U2) → B is Borel measurable, and (15.31)follows.

Lemma 15.15 Given two random elements X and Y in a separable met-ric space X, there exists a Borel measurable f : X×[0, 1] → X and a uniformrandom variable Z independent of X, such that Y = f(X, Z) almost surely.

Proof. The result and proof are given in Skorohod (1976). While Skoro-hod’s paper does not specify uniformity of Z, this readily follows withoutloss of generality, since any real random variable can be expressed as aright-continuous function of a uniform deviate.

15.5.3 Inference for Marginal Sample Means

For each 1 ≤ j ≤ pn, assume for this section that Fj(n) has finite meanµj(n) and standard deviation σj(n) > 0. Let Xj(n) be the sample meanof X1j(n), . . . , Xnj(n). The following corollary yields simultaneous consis-tency of the marginal sample means. All proofs are given at the end of thissubsection.

Corollary 15.16 Assume the closure of the support of Fj(n) is a com-pact interval [aj(n), bj(n)] with aj(n) = bj(n). Under the conditions of The-orem 15.12 and with the same constant c0, we have for all n, pn ≥ 2,

(15.34)∥

max1≤j≤pn

|Xj(n) − µj(n)|∥

ψ2

≤ c0

log pn

nmax

1≤j≤pn

|bj(n) − aj(n)|.

Note that Corollary 15.16 slightly extends the large p small n consis-tency results of van der Laan and Bryan (2001) by allowing the range ofthe support to increase with n provided it does not increase too rapidly.Kosorok and Ma (2007) show that the bounded range assumption can berelaxed at the expense of increasing the upper bound in (15.34).

Now suppose we wish to test the marginal null hypothesis Hj(n)0 : µj(n) =

µ0,j(n) with Tj(n) =√

n(Xj(n) − µ0,j(n))/σj(n), where σj(n) is a location-invariant and consistent estimator of σj(n). To use FDR, as mentioned

Page 323: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

15.5 Large p Small n Asymptotics for Microarrays 313

previously, we need uniformly consistent estimates of the p-values of thesetests. While permutation methods can be used, an easier way is to useπj(n) = 2Φ(−|Tj(n)|), where Φ is the standard normal distribution func-tion, but we need to show this is valid. For the estimator σj(n), we requireσj(n)/σj(n) to be uniformly consistent for 1, i.e.,

E0n ≡ max1≤j≤pn

∣σ2

j(n)/σ2j(n) − 1

∣= oP (1).

One choice for σ2j(n) that satisfies this requirement is the sample variance

S2j(n) for X1j(n), . . . , Xnj(n):

Corollary 15.17 Assume n →∞, with pn ≥ 2 and log pn = o(nγ) forsome γ ∈ (0, 1]. Assume the closure of the support of Fj(n) is compact as in

Corollary 15.16, and let dn ≡ max1≤j≤pn σ−1j(n)|bj(n) − aj(n)|. Then E0n =

O(n−1) + oP (d2nnγ/2−1/2). In particular, if dn = O(1), then E0n = oP (1).

This approach leads to uniformly consistent p-values:

Corollary 15.18 Assume as n → ∞ that pn ≥ 2, log pn = o(n1/2),dn = O(1) and E0n = oP (1). Then there exist standard normal randomvariables Z1(n), . . . , Zpn(n) which are conditionally independent given Fn,with each Zj(n) having conditional distribution given Fn depending only onFj(n), 1 ≤ j ≤ pn, such that

max1≤j≤pn

|πj(n) − πj(n)| = oP (1),(15.35)

where

(15.36)

πj(n) ≡ 2Φ

(

−∣

Zj(n) +

√n(µj(n) − µ0,j(n))

σj(n)

)

, 1 ≤ j ≤ pn.

In particular, (15.35) holds if σ2j(n) = S2

j(n).

Corollary 15.18 tells us that if we assume bounded ranges of the distri-butions, then the approximate p-values are asymptotically valid, providedwe use the sample standard deviation for t-tests and log pn = o(n1/2).

Proof of Corollary 15.16. Apply Theorem 15.12 and the following iden-tity:

[aj(n),bj(n)]

x[

dFj(n)(x)− dFj(n)(x)]

(15.37)

= −∫

[aj(n),bj(n)]

[

Fj(n)(x)− Fj(n)(x)]

dx.

Proof of Corollary 15.17. Let S2j(n) be the sample variance version with

n in the denominator, and let S2j(n) be the version with denominator n−1.

Page 324: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

314 15. Case Studies II

Then S2j(n)/σ2

j(n)− 1 = O(n−1)+ (1+ o(1))(

S2j(n)/σ2

j(n) − 1)

. and thus we

can assume the denominator is n after adding the term O(n−1). Note that

S2j(n)

σ2j(n)

− 1

≤∣

n−1n∑

i=1

(Xij(n) − µj(n))2

σ2j(n)

− 1

+

(

Xj(n) − µj(n)

σj(n)

)2

.

Apply Corollary 15.16 twice, once for the data (Xij(n) − µj(n))2/σ2

j(n) and

once for the data (Xij(n) − µj(n))/σj(n). This gives us

max1≤j≤pn

S2j(n)

σ2j(n)

− 1

≤ OP

(√

log pn

nd2

n +log pn

nd2

n

)

= oP (d2nnγ/2−1/2), since nγ/2−1/2 = o(1) by assumption, and the desired

result follows.Proof of Corollary 15.18. Note that for any x ∈ R and any y > 0,

|Φ(xy)− Φ(x)| ≤ 0.25× (|1− y| ∨ |1− 1/y|). Thus

max1≤j≤pn

|πj(n) − π∗j(n)| ≤ 1

2

(

max1≤j≤n

(σj(n) ∨ σj(n))

1

σj(n)− 1

σj(n)

)

= OP (E1/20n ),

where π∗j(n) ≡ 2Φ(−|T ∗

j(n)|) and T ∗j(n) ≡

√n(Xj(n) − µ0,j(n))/σj(n).

Now Theorem 15.13 yields

max1≤j≤pn

∣π∗j(n) − πj(n)

∣ = OP

(

c1 log n + c2 log pn√n

)

max1≤j≤pn

|bj(n) − aj(n)|σj(n)

= oP (nγ−1/2dn),

and thus max1≤j≤pn |πj(n)−πj(n)| = OP (E1/20n )+oP (nγ−1/2dn). The desired

results now follow.

15.6 Exercises

15.6.1. For any c ≥ 1, let t → fc(t) ≡ c−1 log(1 + ect). Show thatsupt∈R

|fa(t)− fb(t)| ≤ |a− b| and, for any 1 ≤ A0 < ∞, that

supa,b≥A0,t∈R

|fa(t)− fb(t)| ≤ A−10 .

Use these results to show that N(ǫ,G2, ‖ · ‖∞) ≤ K0ǫ−2, for all ǫ > 0.

15.6.2. In the proof of Theorem 15.6, it is necessary to verify that theproportional odds “model is identifiable.” This means that we need to show

Page 325: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

15.7 Notes 315

that whenever (1+eβ′ZA(U))−1 = (1+eβ′0ZA0(U))−1 almost surely, β = β0

and A(t) = A0(t) for all t ∈ [0, τ ]. Show that this is true. Hint: Because A0

is continuous with density bounded above and below, it is enough to verifythat whenever eβ′Za(t) = eβ′

0Z almost surely for every t ∈ [0, τ ], wherea(t) ≡ dA(t)/dA0(t), we have β = β0 and a(t) = 1 for all t ∈ [0, τ ]. It mayhelp to use the fact that var[Z] is positive definite.

15.6.3. Prove that the function φ defined in the proof of Lemma 15.7is Lipschitz continuous on [−k, k] × [0, 1] with finite Lipschitz constantdepending only on k, for all k <∞.

15.6.4. In the context of Section 15.3.4, verify that for any 0 < q < p <∞, the following are true for all θ ∈ linΘ:

(i) (q/(2p))‖θ‖(p) ≤ ‖θ‖(q) ≤ ‖θ‖(p), and

(ii) p‖θ‖∞ ≤ ‖θ‖(p) ≤ 4p‖θ‖∞.

15.6.5. Verify the form of the components of σθ given in the formulasimmediately following (15.20). It may help to scrutinize the facts thatθ(h) = β′h1 +

∫ τ

0h2(s)dA(s) and that the derivative of t → PV τ (θt)(h),

where θt = θ1 + tθ, is −θ(σθ1(h)).

15.6.6. For σ2 defined in the proof of Theorem 15.9, verify that the rangeof h → σ2(h) over the unit ball in H∞ lies within a compact set.

15.6.7. Show that (15.22) holds. Hint: Use the continuity of θ → V τ (θ)(h).

15.6.8. In the proof of Lemma 15.11, verify that the conditional covari-ance of Bn has the form given in (15.26).

15.6.9. In the context of the proof of Theorem 15.14, show that there ex-ists a function hn : Rn× [0, 1] → [0, 1]n such that (V1, . . . , Vn) ≡ hn(Y1, . . .,Yn, U1) is a sample of i.i.d. uniforms and (Y1, . . . , Yn) = (ψ(V1), . . . , ψ(Vn)),where ψ(u) ≡ infx : G(x) ≥ u. You can use the result from Theorem 20.4of Billingsley (1995) that yields the existence of a measurable functiongm : [0, 1] → [0, 1]m, so that (U ′

1, . . . , U′m) ≡ gm(U1) are i.i.d. uniforms.

The challenge is to address the possible presence of discontinuities in thedistribution of Y1.

15.7 Notes

In the setting of Section 15.4, taking the supremum of |Un(ζ)| over ζ maynot be optimal in some settings. The issue is that standard maximum like-lihood optimality arguments do not hold when some of the parameters areunidentifiable under the null hypothesis. Using Bayesian arguments, An-drews and Ploberger (1994) show that it is sometimes better to compute a

Page 326: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

316 15. Case Studies II

weighted integral over the unidentifiable index parameter (ζ in our exam-ple) rather than taking the supremum. This issue is explored briefly in thecontext of transformation models in Kosorok and Song (2007).

We also note that the original motivation behind the theory developedin Section 15.5 was to find a theoretical justification for using least ab-solute deviation estimation for a certain semiparametric regression modelfor robust normalization and significance analysis of cDNA microarrays.The model, estimation, and theoretical justification for this are describedin Ma, Kosorok, Huang, et al. (2006).

Page 327: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

16Introduction to SemiparametricInference

The goal of Part III, in combination with Chapter 3, is to provide a solidworking knowledge of semiparametric inference techniques that will facil-itate research in the area. Chapter 17 presents additional mathematicalbackground, beyond that provided in Chapter 6, which enables the techni-cal development of efficiency calculations and related mathematical toolsneeded for the remaining chapters. The topics covered include projections,Hilbert spaces, and adjoint operators in Banach spaces. The main topicsoverviewed in Chapter 3 will then be extended and generalized in Chap-ters 18 through 21. Many of these topics will utilize in a substantial waythe empirical process methods developed in Part II. Part III, and the entirebook, comes to a conclusion in the case studies presented in Chapter 22.

The overall development of the material in this last part will be logicaland orderly, but not quite as linear as was the case with Part II. Only asubset of semiparametric inference topics will be covered, but the materialthat is covered should facilitate understanding of the omitted topics. Thestructure outlined in Chapter 3 should help the reader maintain perspectiveduring the developments in the section, while the case study examplesshould help the reader successfully integrate the concepts discussed. Someof the concepts presented in Chapter 3 were covered sufficiently in thatchapter and will not be discussed in greater detail in Part III. It might behelpful at this point for the reader to review Chapter 3 before continuing.

Many of the topics in Part III were chosen because of the author’s per-sonal interests and not because they are necessarily the most importanttopics in the area. In fact, there are a number of interesting and valuabletopics in semiparametric inference that are omitted because of space and

Page 328: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

320 16. Introduction to Semiparametric Inference

time constraints. Some of these omissions will be mentioned towards theend of this chapter. In spite of these caveats, the topics included are allworthwhile and, taken together, will provide a very useful introduction forboth those who wish to pursue research in semiparametric inference as wellas for those who simply want to sample the area.

The main components of semiparametric efficiency theory for generalmodels is studied in Chapter 18. The important concepts of tangent spaces,regularity and efficiency are presented with greater mathematical complete-ness than in Section 3.2. With the background we have from Part II of thebook, this mathematical rigor adds clarity to the presentation that is help-ful in research. Efficiency for parameters defined as functionals of otherparameters, infinite-dimensional parameters and groups of parameters arealso studied in this chapter. The delta method from Chapter 12 provesto be useful in this setting. Chapter 18 then closes with a discussion ofoptimality theory for hypothesis testing. As one might expect, there is afundamental connection between efficient estimators for a parameter andoptimal tests of hypotheses involving that parameter.

In Chapter 19, we focus on efficient estimation and inference for the para-metric component θ in a semiparametric model that includes a potentiallyinfinite-dimensional component η. The concepts of efficient score functionsdiscussed in Section 3.2 and least-favorable submodels are elaborated uponand connected to profile likelihood based estimation and inference. Thisincludes a development of some of the methods hinted at in Section 3.4 aswell as some discussion of penalized likelihoods.

Estimation and inference for regular but infinite-dimensional parame-ters using semiparametric likelihood is developed in Chapter 20. Topicscovered include joint efficient estimation of θ and η, when η is infinite-dimensional but can still be estimated efficiently and at the

√n-rate. Infer-

ence for both parameters jointly—and separately—is discussed. The usualweighted bootstrap will be discussed along with a more efficient approach,called the “piggyback bootstrap,” that utilizes the profile likelihood struc-ture. Empirical process methods from Part II will be quite useful in thischapter.

Chapter 21 will introduce and develop semiparametric estimation and in-ference for general semiparametric M-estimators, including non-likelihoodestimation procedures such as least-squares and misspecified likelihood es-timation. Because infinite-dimensional parameters are still involved, tan-gent spaces and other concepts from earlier chapters in Part III will proveuseful in deriving limiting distributions. A weighted bootstrap procedurefor inference will also be presented and studied. Entropy calculations andM-estimator theory discussed in Part II will be quite useful in these devel-opments. The concepts, issues, and results will be illustrated with a num-ber of interesting examples, including least-squares estimation for the Coxmodel with current status data and M-estimation for partly-linear logisticregression with a misspecified link function.

Page 329: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

16. Introduction to Semiparametric Inference 321

A number of important ideas will not be covered in these chapters, in-cluding local efficiency, double robustness and causal inference. The richtopic of efficiency in estimating equations will also not be discussed indetail, although the concept will be covered briefly in tandem with an ex-ample in the case studies of Chapter 22 (see Section 22.3). An estimatingequation is locally efficient if it produces root-n consistent estimators fora large class of models while also producing an efficient estimator for atleast one of the models in that class (see Section 1.2.6 of van der Laan andRobins, 2003). A semiparametric estimation procedure is doubly robust ifonly part of the model needs to be correctly specified in order to achievefull efficiency for the parameter of interest (see Section 1.6 of van der Laanand Robins, 2003). Double robustness can be quite useful in missing datasettings. Causal inference is inference for causal relationships between pre-dictors and outcomes and has emerged in recent years as one of the mostactive and important areas of biostatistical research. Semiparametric meth-ods are extremely useful in causal inference because of the need to developmodels which flexibly adjust for confounding variables and other factors.Additional details and insights into these concepts can be found in vander Laan and Robins (2003). A somewhat less technical but very informa-tive discussion is given in Tsiatis (2006) who also examines semiparametricmissing data models in some depth.

The case studies in Chapter 22 demonstrate that empirical process meth-ods combined with semiparametric inference tools can solve many difficultand important problems in statistical settings in a manner that makesmaximum—or nearly maximum—use of the available data under both re-alistic and flexible assumptions. Efficiency for both finite and infinite di-mensional parameters is considered. The example studied in Section 22.3includes some discussion of general theory for estimating equations. An in-teresting aspect of this example is that an optimal estimating equation isnot in general achievable, although improvements in efficiency are possiblefor certain classes of estimating equations.

As mentioned previously, many important topics are either omitted oronly briefly addressed. Thus it will be important for the interested readerto pursue additional references before embarking on certain kinds of re-search. Another point that needs to be made is that optimality is not theonly worthy goal in semiparametric inference, although it is important. Aneven more important aspect of semiparametric methods is their value indeveloping flexible and meaningful models for scientific research. This isthe primary goal of this book, and it is hoped that the concepts presentedwill prove useful as a starting point for research in empirical processes andsemiparametric inference.

Page 330: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,
Page 331: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

17Preliminaries for SemiparametricInference

In this chapter, we present additional technical background, beyond thatgiven in Chapter 6, needed for the development of semiparametric infer-ence theory. We first discuss projections without direct reference to Hilbertspaces. Next, we present basic properties and results for Hilbert spacesand revisit projections as objects in Hilbert spaces. Finally, we build onthe Banach space material presented in Chapter 6, and in other placessuch as Chapter 12. Concepts discussed include adjoints of operators anddual spaces. The background in this chapter provides the basic frameworkfor a “semiparametric efficiency calculus” that is used for important com-putations in semiparametric inference research. The development of thiscalculus will be the main goal of the next several chapters.

17.1 Projections

Geometrically, the projection of an object T onto a space S is the elementS ∈ S that is “closest” to T , provided such an element exists. In the semi-parametric inference context, the object is usually a random variable andthe spaces of interest for projection are usually sets of square-integrablerandom variables. The following theorem gives a simple method for identi-fying the projection in this setting:

Theorem 17.1 Let S be a linear space of real random variables withfinite second moments. Then S is the projection of T onto S if and only if

(i) S ∈ S and

Page 332: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

324 17. Preliminaries for Semiparametric Inference

(ii) E(T − S)S = 0 for all S ∈ S.

If S1 and S2 are both projections, then S1 = S2 almost surely. If S containsthe constants, then ET = ES and cov(T − S, S) = 0 for all S ∈ S.

Proof. First assume (i) and (ii) hold. Then, for any S ∈ S, we have

E(T − S)2 = E(T − S)2 + 2E(T − S)(S − S) + E(S − S)2.(17.1)

But Conditions (i) and (ii) force the middle term to be zero, and thusE(T − S)2 ≥ E(T − S)2 with strict inequality whenever E(S − S)2 > 0.Thus S is the almost surely unique projection of T onto S.

Conversely, assume that S is a projection and note that for any α ∈ Rand any S ∈ S,

E(T − S − αS)2 − E(T − S)2 = −2αE(T − S)S + α2ES2.

Since S is a projection, the left side is strictly nonnegative for every α.But the parabola α → α2ES2 − 2αE(T − S)S is nonnegative for all α andS only if E(T − S)S = 0 for all S. Thus (ii) holds, and (i) is part of thedefinition of a projection and so holds automatically.

The uniqueness follows from application of (17.1) to both S1 and S2,forcing E(S1 − S2)

2 = 0. If the constants are in S, then Condition (ii)implies that E(T − S)c = 0 for any c ∈ R including c = 1. Thus the finalassertion of the theorem also follows.

Note that the theorem does not imply that a projection always exists. Infact, if the set S is open in the L2(P ) norm, then the infimum of E(T −S)2

over S ∈ S is not achieved. A sufficient condition for existence, then, isthat S be closed in the L2(P ) norm, but often existence can be establisheddirectly. We will visit this issue more in the upcoming chapters.

A very useful example of a projection is a conditional expectation. LetX and Y be real random variables on a probability space. Then g0(y) ≡E(X |Y = y) is the conditional expectation of X given Y = y. If we let Gbe the space of all measurable functions g of Y such that Eg2(Y ) < ∞,then it is easy to verify that

E(X − g0(Y ))g(Y ) = 0, for every g ∈ G.

Thus, provided Eg20(Y ) < ∞, E(X |Y = y) is the projection of X onto

the linear space G. By Theorem 17.1, the conditional expectation is al-most surely unique. We will utilize conditional expectations frequently forcalculating the projections needed in semiparametric inference settings.

17.2 Hilbert Spaces

A Hilbert space is essentially an abstract generalization of a finite-dimen-sional Euclidean space. This abstraction is a special case of a Banach space

Page 333: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

17.2 Hilbert Spaces 325

and, like a Banach space, is often infinite-dimensional. To be precise, aHilbert space is a Banach space with an inner product. An inner producton a Banach space D with norm ‖ · ‖ is a function 〈·, ·〉 : D× D → R suchthat, for all α, β ∈ R and x, y, z ∈ D, the following hold:

(i) 〈x, x〉 = ‖x‖2,(ii) 〈x, y〉 = 〈y, x〉, and

(iii) 〈αx + βy, z〉 = α〈x, z〉+ β〈y, z〉.A semi-inner product arises when ‖ · ‖ is a semi-norm.

It is also possible to generate a norm beginning with an inner product.Let D be a linear space with semi-inner product 〈·, ·〉 (not necessarily witha norm). In other words, 〈·, ·〉 satisfies Conditions (ii)–(iii) of the previousparagraph and also satisfies 〈x, x〉 ≥ 0. Then ‖x‖ ≡ 〈x, x〉1/2, for all x ∈ D,defines a semi-norm. This is verified in the following theorem:

Theorem 17.2 Let 〈·, ·〉 be a semi-inner product on D, with ‖x‖ ≡〈x, x〉1/2 for all x ∈ D. Then, for all α ∈ R and all x, y ∈ D,

(a) 〈x, y〉 ≤ ‖x‖ ‖y‖,(b) ‖x + y‖ ≤ ‖x‖+ ‖y‖, and

(c) ‖αx‖ = |α| × ‖x‖.Moreover, if 〈·, ·〉 is also an inner product, then ‖x‖ = 0 if and only ifx = 0.

Proof. Note that

0 ≤ 〈x− αy, x− αy〉 = 〈x, x〉 − 2α〈x, y〉+ α2〈y, y〉,from which we can deduce

0 ≤ at2 + bt + c ≡ q(t),

where c ≡ 〈x, x〉, b ≡ −2|〈x, y〉|, a ≡ 〈y, y〉, and t ≡ α sign〈x, y〉 (which isclearly free to vary over R). This now forces q(t) to have at most one realroot. This implies that the discriminant from the quadratic formula is notpositive. Hence

0 ≥ b2 − 4ac = 4〈x, y〉2 − 4〈x, x〉〈y, y〉,and Part (a) follows.

Using (a), we have for any x, y ∈ D,

‖x + y‖2 = 〈x + y, x + y〉= ‖x‖2 + 2〈x, y〉+ ‖y‖2≤ ‖x‖2 + 2‖x‖ ‖y‖+ ‖y‖2= (‖x‖+ ‖y‖)2,

Page 334: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

326 17. Preliminaries for Semiparametric Inference

and (b) follows. Part (c) follows readily from the equalities

‖αx‖2 = 〈αx, αx〉 = α〈αx, x〉 = α2〈x, x〉 = (α‖x‖)2.

Part (d) follows readily from the definitions.Part (a) in Theorem 17.2 is also known as the Cauchy-Schwartz inequal-

ity. Two elements x, y in a Hilbert space H with inner product 〈·, ·〉 areorthogonal if 〈x, y〉 = 0, denoted x⊥y. For any set C ⊂ H and any x ∈ H, xis orthogonal to C if x⊥y for every y ∈ C, denoted x⊥C. The two subsetsC1, C2 ⊂ H are orthogonal if x⊥y for every x ∈ C1 and y ∈ C2, denotedC1⊥C2. For any C1 ⊂ H, the orthocomplement of C1, denoted C⊥

1 , is theset x ∈ H : x⊥C1.

Let the subspace H ⊂ H be linear and closed. By Theorem 17.1, wehave for any x ∈ H that there exists an element y ∈ H that satisfies‖x − y‖ ≤ ‖x − z‖ for any z ∈ H , and such that 〈x − y, z〉 = 0 for allz ∈ H . Let Π denote an operator that performs this projection, i.e., letΠx ≡ y where y is the projection of x onto H . The “projection” operatorΠ : H → H has several important properties which we give in the followingtheorem. Recall from Section 6.3 the definitions of the null space N(T ) andrange R(T ) of an operator T .

Theorem 17.3 Let H be a closed, linear subspace of H, and let Π : H →H be the projection operator onto H. Then

(i) Π is continuous and linear,

(ii) ‖Πx‖ ≤ ‖x‖ for all x ∈ H,

(iii) Π2 ≡ ΠΠ = Π, and

(iv) N(Π) = H⊥ and R(Π) = H.

Proof. Let x, y ∈ H and α, β ∈ R. If z ∈ H , then

〈[αx + βy]− [αΠx + βΠy], z〉 = α〈x −Πx, z〉+ β〈y −Πy, z〉= 0.

Now by the uniqueness of projections (via Theorem 17.1), we now have thatΠ(αx+βy) = αΠx+βΠy. This yields linearity of Π. If we can establish (ii),(i) will follow. Since 〈x −Πx, Πx〉 = 0, for any x ∈ H, we have

‖x‖2 = ‖x−Πx‖2 + ‖Πx‖2 ≥ ‖Πx‖2,

and (ii) (and hence also (i)) follows.For any y ∈ H , Πy = y. Thus for any x ∈ H, Π(Πx) = Πx, and (iii)

follows. Now let x ∈ N(Π). Then x = x − Πx ∈ H⊥. Conversely, for anyx ∈ H⊥, Πx = 0 by definition, and thus x ∈ N(Π). Hence N(Π) = H⊥.Now it is trivial to verify that R(Π) ⊂ H by the definitions. Moreover, forany x ∈ H , Πx = x, and thus H ⊂ R(Π). This implies (iv).

Page 335: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

17.2 Hilbert Spaces 327

One other point we note is that for any projection Π onto a closed linearsubspace H ⊂ H, I−Π, where I is the identity, is also a projection onto theclosed linear subspace H⊥. An important example of a Hilbert space H, onethat will be of great interest to us throughout this chapter, is H = L2(P )with inner product 〈f, g〉 =

fgdP . A closed, linear subspace of interestto us is L0

2(P ) ⊂ L2(P ) which consists of all mean zero functions in L2(P ).The projection operator Π : L2(P ) → L0

2(P ) is Πx = x − Px. To seethis, note that Πx ∈ L0

2(P ) and 〈x − Πx, y〉 = 〈Px, y〉 = PxPy = 0 forall y ∈ L0

2(P ). Thus by Theorem 17.1, Πx is the unique projection ontoL0

2(P ). It is also not hard to verify that I − Π is the projection onto theconstants (see Exercise 17.4.2).

Now consider the situation where there are two closed, linear subspacesH1, H2 ⊂ H but H1 and H2 are not necessarily orthogonal. Let Πj bethe projection onto Hj , and define Qj ≡ I − Πj , for j = 1, 2. The ideaof alternating projections is to alternate between projecting h ∈ H ontothe orthocomplements of H1 and H2 repeatedly, so that in the limit weobtain the projection h of h onto the orthocomplement of the closure ofthe sumspace of H1 and H2, denoted H1 + H2. The sumspace of H1 andH2 is simply the closed linear span of h1 + h2 : h1 ∈ H1, h2 ∈ H2. Thenh− h is hopefully the projection of h onto H1 + H2.

For any h ∈ H, let h(m)j ≡ Πj [I − (Q1Q2)

m]h and h(m)j ≡ Πj [I −

(Q2Q1)m]h, for j = 1, 2. Also, let Π be the projection onto H1 + H2 and

Q ≡ I − Π. The results of the following theorem are that Π can be com-puted as the limit of iterations between Q2 and Q1, and that Πh can beexpressed as a sum of elements in H1 and H2 under certain conditions.These results will be useful in the partly linear Cox model case study ofChapter 22. We omit the proof which can be found in Section A.4 of Bickel,Klaassen, Ritov and Wellner (1998) (hereafter abbreviated BKRW):

Theorem 17.4 Assume that H1, H2 ⊂ H are closed and linear. Then,for any h ∈ H,

(i) ‖h(m)1 + h

(m)2 −Πh‖ ≡ um → 0, as m →∞;

(ii) ‖h(m)1 + h

(m)2 −Πh‖ ≡ um → 0, as m →∞;

(iii) um ∨ um ≤ ρ2(m−1), where ρ is the cosine of the minimum angle τbetween H1 and H2 considering only elements in (H1 ∩H2)

⊥;

(iv) ρ < 1 if and only if H1 + H2 is closed; and

(v) ‖I − (Q2Q1)m −Π‖ = ‖I − (Q1Q2)

m −Π‖ = ρ2m−1.

We close this section with a brief discussion of linear functionals onHilbert spaces. Recall from Section 6.3 the definition of a linear operatorand the fact that the norm for a linear operator T : D → E is ‖T ‖ ≡supx∈D:‖x‖≤1 ‖Tx‖. In the special case where E = R, a linear operator is

Page 336: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

328 17. Preliminaries for Semiparametric Inference

called a linear functional. A linear functional T , like a linear operator, isbounded when ‖T ‖ < ∞. By Proposition 6.15, boundedness is equivalentto continuity in this setting. A very important result for bounded linearfunctionals in Hilbert spaces is the following:

Theorem 17.5 (Riesz representation theorem) If L : H → R isa bounded linear functional on a Hilbert space, then there exists a uniqueelement h0 ∈ H such that L(h) = 〈h, h0〉 for all h ∈ H, and, moreover,‖L‖ = ‖h0‖.

Proof. Let H = N(L), and note that H is closed and linear because Lis continuous and the space 0 (consisting of only zero) is trivially closedand linear. Assume that H = H since otherwise the proof would be trivialwith h0 = 0. Thus there is a vector f0 ∈ H⊥ such that L(f0) = 1. Hence,for all h ∈ H, h− L(h)f0 ∈ H , since

L(h− L(h)f0) = L(h)− L(h)L(f0) = 0.

Thus, by the orthogonality of H and H⊥, we have for all h ∈ H that

0 = 〈h− L(h)f0, f0〉 = 〈h, f0〉 − L(h)‖f0‖2.

Hence if h0 ≡ ‖f0‖−2f0, L(h) = 〈h, h0〉 for all h ∈ H.Suppose h′

0 ∈ H satisfies 〈h, h′0〉 = 〈h, h0〉 for all h ∈ H, then h0−h′

0⊥H,and thus h′

0 = h0. Since by the Cauchy-Schwartz inequality |〈h, h0〉| ≤‖h‖ ‖h0‖ and also 〈h0, h0〉 = ‖h0‖2, we have ‖L‖ = ‖h0‖, and thus thetheorem is proved.

17.3 More on Banach Spaces

Recall the discussion about Banach spaces in Section 6.1. As with Hilbertspaces, a linear functional on a Banach space is just a linear operatorwith real range. The dual space B∗ of a Banach space B is the set of allcontinuous, linear functionals on B. By applying Proposition 6.15, it isclear that every b∗ ∈ B∗ satisfies |b∗b| ≤ ‖b∗‖ ‖b‖ for every b ∈ B, where‖b∗‖ ≡ supb∈B:‖b‖≤1 |b∗b| < ∞.

For the special case of a Hilbert space H, H∗ can identified with H by theReisz representation theorem given above. This implies that there exists anisometry (a one-to-one map that preserves norms) between H and H∗. Tosee this, choose h∗ ∈ H∗ and let h ∈ H be the unique element that satisfies〈h, h〉 = h∗h for all h ∈ H. Then

‖h∗‖ = suph∈H:‖h‖≤1

|〈h, h〉| ≤ ‖h‖

by the Cauchy-Schwartz inequality. The desired conclusion follows since h∗

was arbitrary.

Page 337: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

17.3 More on Banach Spaces 329

We now return to the generality of Banach spaces. For each continuous,linear operator between Banach spaces A : B1 → B2, there exists an adjointmap (or just adjoint) A∗ : B∗

2 → B∗1 defined by (A∗b∗2)b1 = b∗2Ab1 for all

b1 ∈ B1 and b∗2 ∈ B∗2. It is straightforward to verify that the resulting A∗

is linear. The following proposition tells us that A∗ is also continuous (bybeing bounded):

Proposition 17.6 Let A : B1 → B2 be a bounded linear operator be-tween Banach spaces. Then ‖A∗‖ = ‖A‖.

Proof. Since also, for any b∗2 ∈ B∗2,

‖A∗b∗2‖ = supb1∈B1:‖b1‖≤1

|A∗b∗2b1|

= supb1∈B1:‖b1‖≤1

b∗2

(

Ab1

‖Ab1‖

)∣

‖Ab1‖

≤ ‖b∗2‖ ‖A‖,we have ‖A∗‖ ≤ ‖A‖. Thus ‖A∗‖ is a continuous, linear operator.

Now let A∗∗ : B∗∗1 → B∗∗

2 be the adjoint of A∗ with respect to the doubleduals (duals of the duals) of B1 and B2. Note that for j = 1, 2, Bj ⊂ B∗∗

j ,since for any bj ∈ Bj , the map bj : B∗

j → R defined by b∗j → b∗jbj, is abounded linear functional. By the definitions involved, we now have forany b1 ∈ B1 and b∗2 ∈ B∗

2 that

(A∗∗b1)b∗2 = (A∗b∗2)b1 = b∗2Ab1,

and thus ‖A∗∗‖ ≤ ‖A∗‖ and the restriction of A∗∗ to B1, denoted hereafterA∗∗

1 , equals A. Hence ‖A‖ = ‖A∗∗1 ‖ ≤ ‖A∗‖, and the desired result follows.

We can readily see that the adjoint of an operator A : H1 → H2 betweentwo Hilbert spaces, with respective inner products 〈·, ·〉1 and 〈·, ·〉2, is amap A∗ : H2 → H1 satisfying 〈Ah1, h2〉2 = 〈h1, A

∗h2〉1 for every h1 ∈ H1

and h2 ∈ H2. Note that we are using here the isometry for Hilbert spacesdescribed at the beginning of this section. Now consider the adjoint of arestriction of a continuous linear operator A : H1 → H2, A0 : H0,1 ⊂ H1 →H2, where H0,1 is a closed, linear subspace of H1. If Π : H1 → H0,1 is theprojection onto the subspace, it is not hard to verify that the adjoint of A0

is A∗0 ≡ Π A∗ (see Exercise 17.4.3).

Recall from Section 6.3 the notation B(D, E) denoting the collection ofall linear operators between normed spaces D and E. From Lemma 6.16,we know that for a given T ∈ B(B1, B2), for Banach spaces B1 and B2,R(T ) is not closed unless T is continuously invertible. We now give aninteresting counter-example that illustrates this. Let B1 = B2 = L2(0, 1),and define T : L2(0, 1) → L2(0, 1) by Tx(u) ≡ ux(u). Then ‖T ‖ ≤ 1, andthus T ∈ B(L2(0, 1), L2(0, 1)). However, it is clear that

R(T ) =

y ∈ L2(0, 1) :

∫ 1

0

u−2y2(u)du < ∞

.(17.2)

Page 338: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

330 17. Preliminaries for Semiparametric Inference

Although R(T ) is dense in L2(0, 1) (see Exercise 17.4.3), the functionsy1(u) ≡ 1 and y2(u) ≡ √

u are clearly not in R(T ). Thus R(T ) is notclosed.

This lack of closure of R(T ) arises from the simple fact that the inverseoperator T−1y(u) = u−1y(u) is not bounded over y ∈ L2(0, 1) (considery = 1, for example). On the other hand, it is easy to verify that for anynormed spaces D and E and any T ∈ B(D, E), N(T ) is always closed asa direct consequence of the continuity of T . Observe also that for anyT ∈ B(B1, B2),

N(T ∗) = b∗2 ∈ B∗2 : (T ∗b∗2)b1 = 0 for all b1 ∈ B1(17.3)

= b∗2 ∈ B∗2 : b∗2(Tb1) = 0 for all b1 ∈ B1

= R(T )⊥,

where R(T )⊥ is an abuse of notation used in this context to denote thelinear functionals in B∗

2 that yield zero on R(T ). For Hilbert spaces, thenotation is valid because of the isometry between a Hilbert space H andits dual H∗. The identity (17.3) has the following interesting extension:

Theorem 17.7 For two Banach spaces B1 and B2 and for any T ∈B(B1, B2), R(T ) = B2 if and only if N(T ∗) = 0 and R(T ∗) is closed.

Proof. If R(T ) = B2, then 0 = R(T )⊥ = N(T ∗) by (17.3), and thus T ∗ isone-to-one. Moreover, since B2 is closed, we also have that R(T ∗) is closedby Lemma 17.8 below. Conversely, if N(T ∗) = 0 and R(T ∗) is closed,then, by reapplication of (17.3), R(T )⊥ = 0 and R(T ) is closed, whichimplies R(T ) = B2.

Lemma 17.8 For two Banach spaces B1 and B2 and for any T ∈ B(B1,B2), R(T ) is closed if and only if R(T ∗) is closed.

Proof. This is part of Theorem 4.14 of Rudin (1991), and we omit theproof.

If we specialize (17.3) to Hilbert spaces H1 and H2, we obtain triviallyfor any A ∈ B(H1, H2) that R(A)⊥ = N(A∗). We close this section withthe following useful, additional result for Hilbert spaces:

Theorem 17.9 For two Hilbert spaces H1 and H2 and any A ∈ B(H1,H2), R(A) is closed if and only if R(A∗) is closed if and only if R(A∗A) isclosed. Moreover, if R(A) is closed, then R(A∗) = R(A∗A) and

A(A∗A)−1A∗ : H2 → H2

is the projection onto R(A).

Proof. The result that R(A) is closed if and only if R(A∗) holds byLemma 17.8, but we will prove this again for the specialization to Hilbert

Page 339: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

17.3 More on Banach Spaces 331

spaces to highlight some interesting features that will be useful later in theproof.

First assume R(A) is closed, and let A∗0 be the restriction of A∗ to R(A),

and note that R(A∗) = R(A∗0), since R(A)⊥ = N(A∗) according to (17.3).

Also let A0 be the restriction of A to N(A)⊥, and note that R(A0) = R(A)by definition of N(A) and therefore R(A0) is closed with N(A0) = 0.Thus by Part (ii) of Lemma 6.16, there exits a c > 0 such that ‖A0x‖ ≥ c‖x‖for all x ∈ N(A)⊥. Now fix y ∈ R(A0), and note that there exists anx ∈ N(A)⊥ such that y = A0x. Thus

‖x‖ ‖A∗0y‖ ≥ 〈x, A∗

0y〉 = 〈A0x, y〉 = ‖A0x‖ ‖y‖ ≥ c‖x‖ ‖y‖,

and therefore ‖A∗0y‖ ≥ c‖y‖ for all y ∈ R(A0). This means by reapplication

of Part (ii) of Lemma 6.16 that R(A∗) = R(A∗0) is closed.

Now assume R(A∗) is closed. By identical arguments to those used inthe previous paragraph, we know that R(A∗∗) is closed. But because weare restricted to Hilbert spaces, A∗∗ = A, and thus R(A) is closed. Nowassume that either R(A) or R(A∗) is closed. By what we have proven sofar, we now know that both R(A) and R(A∗) must be closed. By recyclingarguments, we know that R(A∗A) = R(A∗

0A0). Now, by applying yet againPart (ii) of Lemma 6.16, we have that there exists c1, c2 > 0 such that‖A0x‖ ≥ c1‖x‖ and ‖A∗

0y‖ ≥ c2‖y‖ for all x ∈ N(A)⊥ and y ∈ R(A). Thusfor all x ∈ N(A)⊥,

‖A∗0A0x‖ ≥ c2‖A0x‖ ≥ c1c2‖x‖,

and thus by both parts of Lemma 6.16, R(A∗A) = R(A∗0A0) is closed.

Now assume that R(A∗A) is closed. Thus R(A∗0A0) is also closed and, by

recycling arguments, we have that N(A∗0A0) = N(A0) = 0. Thus there

exists a c > 0 such that for all x ∈ N(A0)⊥,

c‖x‖ ≤ ‖A∗0A0x‖ ≤ ‖A∗

0‖ ‖A0x‖,

and therefore R(A) = R(A0) is closed. This completes the first part of theproof.

Now assume that R(A) is closed and hence so also is R(A∗) and R(A∗A).Clearly R(A∗A) ⊂ R(A∗). Since R(A)⊥ = N(A∗), we also have that R(A∗)⊂ R(A∗A), and thus R(A∗) = R(A∗A). By Part (i) of Lemma 6.16, we nowhave that A∗A is continuously invertible on R(A∗), and thus A(A∗A)−1A∗

is well defined on H2. Observe that for any y ∈ R(A), there exists an x ∈ H1

such that y = Ax, and thus

Πy ≡ A(A∗A)−1A∗y = A(A∗A)−1(A∗A)x = Ax = y.

For any y ∈ R(A)⊥, we have y ∈ N(A∗) and thus Πy = 0. Hence Π is theprojection operator onto R(A), and the proof is complete.

Page 340: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

332 17. Preliminaries for Semiparametric Inference

17.4 Exercises

17.4.1. In the context of Theorem 17.3, show that for any projectionΠ : H → H , where H is closed and linear, I − Π is also a projection ontoH⊥.

17.4.2. In the projection example given in the paragraph following theproof of Theorem 17.3, show that I −Π is the projection onto the space ofconstants, where Π is the projection from L2(P ) to L0

2(P ).

17.4.3. Let H1 and H2 be two Hilbert spaces and let A0 : H0,1 be therestriction to the closed subspace H0,1 of a continuous linear operator A :H1 → H2. If Π : H1 → H0,1 is the projection onto H0,1, show that theadjoint of A0 is Π A∗.

17.4.4. Show that R(T ) defined in (17.2) is dense in L2(0, 1).

17.4.5. In the last paragraph of the proof of Theorem 17.9, explain whyR(A)⊥ = N(A∗) implies R(A∗) ⊂ R(A∗A). Hint: Divide the domain of A∗

into two components, R(A) and R(A)⊥.

17.5 Notes

Much of the presentation in the first section on projections was inspired bySections 11.1 and 11.2 of van der Vaart (1998), and Theorem 17.1 is vander Vaart’s Theorem 11.1. Parts of Sections 17.2 and 17.3 were inspiredby material in Section 25.2 of van der Vaart (1998), although much oftechnical content of Section 17.2 came from Conway (1990) and Appendix 4of BKRW, while much of the technical content of Section 17.3 came fromChapter 4 of Rudin (1991) and Appendix 1 of BKRW.

Theorem 17.2 is a composition of Conway’s Inequality 1.4 and Corol-lary 1.5, while Theorems 17.3 and 17.5 are modifications of Conway’s The-orems 2.7 and 3.4, respectively. Theorem 17.4 is a minor modification ofTheorem A.4.2 of BKRW: Parts (i), (ii) and (iii) correspond to Parts Aand B of the theorem in BKRW, while Parts (iv) and (v) correspond toParts C and D of the theorem in BKRW.

Proposition 17.6 is part of Theorem 4.10 of Rudin, and Theorem 17.7is Theorem 4.15 of Rudin. The counter-example in L2(0, 1) of the oper-ator Tx(u) ≡ ux(u) is Example A.1.11 of BKRW. The basic content ofTheorem 17.9 was inspired by the last few sentences of van der Vaart’sSection 25.2, although the proof is new.

Page 341: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

18Semiparametric Models and Efficiency

The goal of this chapter is to expand in a rigorous fashion on the conceptsof statistical models and efficiency introduced in Section 3.1, and it wouldbe a good idea at this point for the reader to recall and briefly review thecontent of that section. Many of the results in this chapter are quite generaland are not restricted to semiparametric models per se.

We begin with an in-depth discussion of the relationship between tangentsets and the concept of regularity. This includes several characterizations ofregularity for general Banach-valued parameters. We then present severalimportant results that characterize asymptotic efficiency along with sev-eral useful tools for establishing asymptotic efficiency for both finite andinfinite-dimensional parameters. We conclude with a discussion of optimalhypothesis testing for one-dimensional parameters.

18.1 Tangent Sets and Regularity

For a statistical model P ∈ P on a sample space X , a one-dimensionalmodel Pt is a smooth submodel at P if P0 = P , Pt : t ∈ Nǫ ⊂ Pfor some ǫ > 0, and (3.1) holds for some measurable “tangent” functiong : X → R. Here, Nǫ is either the set [0, ǫ) (as was the case throughoutChapter 3) or (−ǫ, ǫ) (as will be the case hereafter). Also, P is usually thetrue but unknown distribution of the data. Note that Lemma 11.11 forcesthe g in (3.1) to be contained in L0

2(P ).

Page 342: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

334 18. Semiparametric Models and Efficiency

A tangent set QP ⊂ L02(P ) represents a submodel Q ⊂ P at P if the

following hold:

(i) For every smooth one-dimensional submodel Pt for which

P0 = P and Pt : t ∈ Nǫ ⊂ Q for some ǫ > 0,(18.1)

and for which (3.1) holds for some g ∈ L02(P ), we have g ∈ QP ; and

(ii) For every g ∈ QP , there exists a smooth one-dimensional submodelPt such that (18.1) and (3.1) both hold.

An appropriate question to ask at this point is why the focus on one-dimensional submodels? The basic reason is that score functions for finitedimensional submodels can be represented by tangent sets correspondingto one-dimensional submodels. To see this, let Q ≡ Pθ : θ ∈ Θ ⊂ P ,where Θ ⊂ Rk. Let θ0 ∈ Θ be the true value of the parameter, i.e. P = Pθ0 .Suppose that the members Pθ of Q all have densities pθ dominated by ameasure µ, and that

ℓθ0 ≡∂

∂θlog pθ

θ=θ0

,

where ℓθ0 ∈ L02(P ), P‖ℓθ − ℓθ0‖2 → 0 as θ → θ0, and the meaning of the

extension of L02(P ) to vectors of random variables is obvious. As discussed

in the paragraphs following (3.1), the tangent set QP ≡ h′ℓθ0 : h ∈ Rkcontains all the information in the score ℓθ0 , and, moreover, it is not hardto verify that QP represents Q.

Thus one-dimensional submodels are sufficient to represent all finite-dimensional submodels. Moreover, since semiparametric efficiency is as-sessed by examining the information for the worst finite-dimensional sub-model, one-dimensional submodels are sufficient for semiparametric modelsin general, including models with infinite-dimensional parameters.

Now if Pt : t ∈ Nǫ and g ∈ PP satisfy (3.1), then for any a ≥ 0,everything will also hold when ǫ is replaced by ǫ/a and g is replaced by ag.Thus we can usually assume, without a significant loss in generality, thata tangent set PP is a cone, i.e., a set that is closed under multiplicationby nonnegative scalars. We will also frequently find it useful to replace atangent set with its closed linear span, or to simply assume that the tangentset is closed under limits of linear combinations, in which case it becomesa tangent space.

We now review and generalize several additional important conceptsfrom Section 3.1. For an arbitrary model parameter ψ : P → D, considerthe fairly general setting where D is a Banach space B. In this case, ψ isdifferentiable at P relative to the tangent set PP if, for every smooth one-dimensional submodel Pt with tangent g ∈ PP , dψ(Pt)/(dt)|t=0 = ψP (g)

for some bounded linear operator ψP : PP → B.

Page 343: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

18.1 Tangent Sets and Regularity 335

When PP is a linear space, it is a subspace of the Hilbert space L02(P )

and some additional results follow. To begin with, the Riesz representation

theorem yields that for every b∗ ∈ B∗, b∗ψP (g) = P[

ψP (b∗)g]

for some

operator ψP : B∗ → lin PP . Note also that for any g ∈ PP and b∗ ∈ B∗,we also have b∗ψP (g) = 〈g, ψ∗

P (b∗)〉, where 〈·, ·〉 is the inner product on

L02(P ) and ψ∗

P is the adjoint of ψP . Thus the operator ψP is precisely ψ∗P .

In this case, ψP is the efficient influence function. The justification for thequalifier “efficient” will be given in Theorems 18.3 and 18.4 in the nextsection.

An estimator sequence Tn for a parameter ψ(P ) is asymptotically linearif there exists an influence function ψP : X → B such that

√n(Tn−ψ(P ))−√

nPnψPP→ 0. The estimator Tn is regular at P relative to PP if for every

smooth one-dimensional submodel Pt ⊂ P and every sequence tn with

tn = O(n−1/2),√

n(Tn−ψ(Ptn))Pn Z, for some tight Borel random element

Z, where Pn ≡ Ptn .There are a number of ways to establish regularity, but when B = ℓ∞(H),

for some set H, and Tn is asymptotically linear, the conditions for regu-larity can be expressed as properties of the influence function, as verifiedin the theorem below. Note that by Proposition 18.14 in Section 18.4 be-low, we only need to consider the influence function as evaluated for thesubset of linear functionals B′ ⊂ B∗ that are coordinate projections. Theseprojections are defined by the relation ψP (g)(h) = P [ψP (h)g] for all h ∈ H.

Theorem 18.1 Assume Tn and ψ(P ) are in ℓ∞(H), that ψ is differen-tiable at P relative to the tangent space PP with efficient influence functionψP : H → L0

2(P ), and that Tn is asymptotically linear for ψ(P ), with in-fluence function ψP . For each h ∈ H, let ψ•

P (h) be the projection of ψP (h)onto PP . Then the following are equivalent:

(i) The class F ≡ ψP (h) : h ∈ H is P -Donsker and ψ•P (h) = ψP (h)

almost surely for all h ∈ H.

(ii) Tn is regular at P .

Before giving the proof, we highlight an important rule for verifying thata candidate influence function ψP : H → L0

2(P ) is an efficient influencefunction. This rule can be extracted from preceding arguments and is sum-marized in the following proposition, the proof of which is saved as anexercise (see Exercise 18.5.1):

Proposition 18.2 Assume ψ : P → ℓ∞(H) is differentiable at P rela-tive to the linear tangent set PP , with bounded linear derivative ψP : PP →ℓ∞(H). Then ψP : H → L0

2(P ) is an efficient influence function if and onlyif the following both hold:

(i) ψP (h) is in the closed linear span of PP for all h ∈ H, and

Page 344: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

336 18. Semiparametric Models and Efficiency

(ii) ψP (g)(h) = P [ψP (h)g] for all h ∈ H and g ∈ PP .

This simple proposition is a primary ingredient in a “calculus” of semi-parametric efficient estimators and was mentioned in a less general formin Chapter 3. A second important ingredient is making sure that PP isrich enough to represent all smooth finite-dimensional submodels of P (seeExercise 3.5.1 for a discussion of this regarding the Cox model for right-censored data). This issue was also discussed a few paragraphs previouslybut is sufficiently important to bare repeating.

Proof of Theorem 18.1. Suppose F is P -Donsker. Let Pt be anysmooth one-dimensional submodel with tangent g ∈ PP , and let tn beany sequence with

√ntn → k, for some finite constant k. Then Pn = Ptn

satisfies (11.4) for h = g, and thus, by Theorem 11.12,

√nPnψP (·) Pn

GψP (·) + P[

ψP (·)g]

= GψP (·) + P[

ψ•P (·)g

]

in ℓ∞(H). The last equality follows from the fact that g ∈ PP . Let Yn =∥

√n(Tn − ψ(P ))−√nPnψP

H, and note that YnP→ 0 by the asymptotic

linearity assumption.At this point, we note that it is really no loss of generality to assume

that the measurable sets for Pn and Pn (applied to the data X1, . . . , Xn)are both the same for all n ≥ 1. Using this assumption, we now have by

Theorem 11.14 that YnPn→ 0. Combining this with the differentiability of

ψ, we obtain that

√n(Tn − ψ(Pn))(·) =

√n(Tn − ψ(P ))(·)(18.2)

−√n(ψ(Pn)− ψ(P ))(·)Pn GψP (·) + P

[(

ψ•P (·)− ψP (·)

)

g]

,

in ℓ∞(H).Suppose now that (ii) holds but we don’t know whether F is P -Donsker.

The fact that Tn is regular implies that√

n(Tn−ψ(P )) Z, for some tightprocess Z. The asymptotic linearity of Tn now forces

√nPnψP (·) Z also,

and this yields that F is in fact P -Donsker. Suppose, however, that for someh ∈ H we have g(h) ≡ ψ•

P (h) − ψP (h) = 0. Since the choice of g in thearguments leading up to (18.2) was arbitrary, we can choose g = ag for anya > 0 to yield

√n(Tn(h)− ψ(Pn)(h))

Pn GψP (h) + aP g2.(18.3)

Thus we can easily have different limiting distributions by choosing differentvalues of a. This means that Tn is not regular. Thus we have proved that (i)holds by contradiction.

Assume now that (i) holds. Using once again the arguments preced-ing (18.2), we obtain for arbitrary choices of g ∈ PP and constants k, that

Page 345: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

18.2 Efficiency 337

√n(Tn − ψ(Pn))

Pn GψP

in ℓ∞(H). Now relax the assumption that√

ntn → k to√

ntn = O(1) andallow g to arbitrary as before. Under this weaker assumption, we have thatfor every subsequence n′, there exists a further subsequence n′′ such that√

n′′tn′′ → k, for some finite k, as n′′ →∞. Arguing along this subsequence,our previous arguments can all be recycled to verify that

√n′′(Tn′′ − ψ(Pn′′))

Pn′′

GψP

in ℓ∞(H), as n′′ →∞.Fix f ∈ Cb(ℓ

∞(H)), recall the portmanteau theorem (Theorem 7.6), anddefine Zn ≡

√n(Tn−ψ(Pn)). What we have just proven is that every sub-

sequence n′ has a further subsequence n′′ such that E∗f(Zn′′)→ Ef(Z), asn′′ →∞, where Z ≡ GψP . This of course implies that E∗f(Zn)→ Ef(Z),

as n →∞. Reapplication of the portmanteau theorem now yields ZnPn Z.

Thus Tn is regular, and the proof is complete for both directions.We note that in the context of Theorem 18.1 above, a non-regular esti-

mator Tn has a serious defect. From (18.3), we see that for any M < ∞and ǫ > 0, there exists a smooth one-dimensional submodel Pt such that

P(∥

√n(Tn − ψ(Pn))

H > M)

> 1− ǫ.

Thus the estimator Tn has arbitrarily poor performance for certain sub-models which are represented by PP . Hence regularity is not just a mathe-matically convenient definition, but it reflects, even in infinite-dimensionalsettings, a certain intuitive reasonableness about Tn. This does not meanthat nonregular estimators are never useful, because they can be, especiallywhen the parameter ψ(P ) is not

√n-consistent. Nevertheless, regular esti-

mators are very appealing when they are available.

18.2 Efficiency

We now turn our attention to the question of efficiency in estimating generalBanach-valued parameters. We first present general optimality results andthen characterize efficient estimators in the special Banach space ℓ∞(H).We then consider efficiency of Hadamard-differentiable functionals of ef-ficient parameters and show how to establish efficiency of estimators inℓ∞(H) from efficiency of all one-dimensional components. We also considerthe related issue of efficiency in product spaces.

The following two theorems, which characterize optimality in Banachspaces, are the key results of this section. For a Borel random elementY , let L(Y ) denote the law of Y (as in Section 7.1), and let ∗ denote theconvolution operation. Also define a function u : B → [0,∞) to be subconvex

Page 346: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

338 18. Semiparametric Models and Efficiency

if, for every b ∈ B, u(b) ≥ 0 = u(0) and u(b) = u(−b), and also, for everyc ∈ R, the set b ∈ B : u(b) ≤ c is convex and closed. A simple example ofa subconvex function is the norm ‖ · ‖ for B. Here are the theorems:

Theorem 18.3 (Convolution theorem) Assume that ψ : P → B is dif-ferentiable at P relative to the tangent space PP , with efficient influencefunction ψP . Assume that Tn is regular at P relative to PP , with Z beingthe tight weak limit of

√n(Tn−ψ(P )) under P . Then L(Z) = L(Z0)∗L(M),

where M is some Borel random element in B, and Z0 is a tight Gaussian

process in B with covariance P [(b∗1Z0)(b∗2Z0)] = P

[

ψP (b∗1)ψP (b∗2)]

for all

b∗1, b∗2 ∈ B∗.

Theorem 18.4 Assume the conditions of Theorem 18.3 hold and thatu : B → [0,∞) is subconvex. Then

lim supn→∞

E∗u(√

n(Tn − ψ(P )))

≥ Eu(Z0),

where Z0 is as defined in Theorem 18.3.

We omit the proofs, but they can be found in Section 5.2 of BKRW.The previous two theorems characterize optimality of regular estimators

in terms of the limiting process Z0, which is a tight, mean zero Gaussianprocess with covariance obtained from the efficient influence function. Thiscan be viewed as an asymptotic generalization of the Cramer-Rao lowerbound. We say that an estimator Tn is efficient if it is regular and thelimiting distribution of

√n(Tn − ψ(P )) is Z0, i.e., Tn achieves the optimal

lower bound. The following proposition, the proof of which is given inSection 4, assures us that Z0 is fully characterized by the distributions ofb∗Z0 for b∗ ranging over all of B∗:

Proposition 18.5 Let Xn be an asymptotically tight sequence in a Ba-nach space B and assume b∗Xn b∗X for every b∗ ∈ B∗ and some tight,Gaussian process X in B. Then Xn X.

The next result assures us that Hadamard differentiable functions ofefficient estimators are also asymptotically efficient:

Theorem 18.6 Assume that ψ : P → B is differentiable at P relative tothe tangent space PP —with derivative ψP g, for every g ∈ PP , and efficientinfluence function ψP —and takes its values in a subset Bφ. Suppose alsothat φ : Bφ ⊂ B → E is Hadamard differentiable at ψ(P ) tangentially to

B0 ≡ lin ψP (PP ). Then φ ψ : P → E is also differentiable at P relative toPP . If Tn is a sequence of estimators with values in Bφ that is efficient atP for estimating ψ(P ), then φ(Tn) is efficient at P for estimating φψ(P ).

Proof. Let φ′ψ(P ) : B → E be the derivative of φ. Note that for any

g ∈ PP and any submodel Pt with tangent g, we have by the specifiedHadamard differentiability of φ that

Page 347: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

18.2 Efficiency 339

φ ψ(Pt)− φ ψ(P )

t→ φ′

ψ(P )ψP g,(18.4)

as t → 0. Thus φ ψ : P → E is differentiable at P relative to PP .For any chosen submodel Pt with tangent g ∈ PP , define Pn ≡ P1/

√n.

By the efficiency of Tn, we have that√

n(Tn − ψ(Pn))Pn Z0, where Z0

has the optimal, mean zero, tight Gaussian limiting distribution. By the

delta method (Theorem 2.8), we now have that√

n(φ(Tn)− φ ψ(Pn))Pn

φ′ψ(P )Z0. Since the choice of Pt was arbitrary, we now know that φ(Tn) is

regular and also that√

n(φ(Tn)−φ ψ(P )) φ′ψ(P )Z0. By Theorem 18.3,

P [(e∗1φ′ψ(P )Z0)(e

∗2φ

′ψ(P )Z0)] = P [ψP (e∗1φ

′ψ(P ))ψP (e∗2φ

′ψ(P ))], for all e∗1, e

∗2 ∈

E∗. Thus the desired result now follows from (18.4) and the definition of ψP

since P [ψP (e∗φ′ψ(P ))g] = e∗φ′

ψ(P )ψP (g) for every e∗ ∈ E∗ and g ∈ lin PP .The following very useful theorem completely characterizes efficient esti-

mators of Euclidean parameters. The theorem is actually Lemma 25.23 ofvan der Vaart (1998), and the proof, which we omit, can be found therein:

Theorem 18.7 Let Tn be an estimator for a parameter ψ : P → Rk,where ψ is differentiable at P relative to the tangent space PP with k-variateefficient influence function ψP ∈ L0

2(P ). Then the following are equivalent:

(i) Tn is efficient at P relative to PP , and thus the limiting distributionof√

n(Tn − ψ(P )) is mean zero normal with covariance P [ψP ψ′P ].

(ii) Tn is asymptotically linear with influence function ψP .

The next theorem we present endeavors to extend the above characteri-zation of efficient estimators to more general parameter spaces of the formℓ∞(H):

Theorem 18.8 Let Tn be an estimator for a parameter ψ : P → ℓ∞(H),where ψ is differentiable at P relative to the tangent space PP with efficientinfluence function ψP : H → L0

2(P ). Let F ≡ ψP (h) : h ∈ H. Then thefollowing are equivalent:

(a) Tn is efficient at P relative to PP and at least one of the followingholds:

(i) Tn is asymptotically linear.

(ii) F is P -Donsker for some version of ψP .

(b) For some version of ψP , Tn is asymptotically linear with influencefunction ψP and F is P -Donsker.

(c) Tn is regular and asymptotically linear with influence function ψP

such that ψP (h) : h ∈ H is P -Donsker and ψP (h) ∈ PP for allh ∈ H.

Page 348: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

340 18. Semiparametric Models and Efficiency

We are assuming tacitly that ψP (h) : h ∈ H is a stochastic pro-cess. Two stochastic processes X and X are versions of each other whenX(h) = X(h) almost surely for every h ∈ H. Note that this is a slightly dif-ferent definition than the one used in Section 7.1 for Borel random variables,but the concepts are equivalent when the processes are both Borel measur-able and live on ℓ∞(H). This equivalence follows since finite-dimensionaldistributions determine the full distribution for tight stochastic processes.However, in the generality of the above theorem, we need to be careful sinceψP may not be Borel measurable when H is infinite and different versionsmay have different properties.

The theorem gives us several properties of efficient estimators that canbe useful for a number of things, including establishing efficiency. In partic-ular, conclusion (c) tells us that a simple method for establishing efficiencyof Tn requires only that Tn be regular and asymptotically linear with aninfluence function that is contained in a Donsker class for which the indi-vidual components ψP (h) are contained in the tangent space for all h ∈ H.The theorem also tells us that if Tn is efficient, only one of (i) or (ii) in (a)is required and the other will follow. This means, for example, that if Tn

is efficient and F is not P -Donsker for any version of ψP , then Tn mustnot be asymptotically linear. Also note that the requirement that F is P -Donsker collapses to requiring that ‖ψP ‖P,2 < ∞ when H is finite and weare therefore in the setting of Theorem 18.7. However, such a requirementis not needed in the statement of Theorem 18.7 since ‖ψP ‖P,2 < ∞ au-tomatically follows from the required differentiability of ψ when ψ ∈ Rk.This follows since the Riesz representation theorem assures us that ψP isin the closed linear span of PP which is a subset of L2(P ).

Proof of Theorem 18.8. Assume (b), and note that Theorem 18.1now implies that Tn is regular. Efficiency now follows immediately fromTheorems 18.3 and 18.4 and the definition of efficiency. Thus (a) follows.

Now assume (a) and (i). Since the definition of efficiency includes reg-ularity, we know by Theorem 18.1 that Tn is asymptotically linear withinfluence function ψP having projection ψP on PP . By Theorem 18.3,ψP (h) − ψP (h) = 0 almost surely, for all h ∈ H. Since ψP (h) : h ∈ H isP -Donsker by the asymptotic linearity and efficiency of Tn, we now havethat ψP is a version of ψP for which F is P -Donsker. This yields (b). Notethat in applying Theorem 18.3, we only needed to consider the evaluationfunctionals in B∗ (i.e., b∗h(b) ≡ b(h), where h ∈ H) and not all of B∗, since atight, mean zero Gaussian process on ℓ∞(H) is completely determined byits covariance function (h1, h2) → P [Z0(h1)Z0(h2)] (see Proposition 18.14below).

Now assume (a) and (ii). The regularity of Tn and the fact that F is P -Donsker yields that both

√n(Tn − ψ(P ) and

√nPnψP are asymptotically

tight. Thus, for verifying the desired asymptotic linearity, it is sufficient toverify it for all finite-dimensional subsets of H, i.e., we need to verify that

Page 349: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

18.2 Efficiency 341

suph∈H0

√n(Tn(h)− ψ(P )(h)) −√nPnψP (h)

∣ = oP (1),(18.5)

for all finite subsets H0 ⊂ H. This holds by Theorem 18.7, and thus thedesired conclusions follow.

Note that (c) trivially follows from (b). Moreover, (b) follows from (c)by Theorem 18.1 and the regularity of Tn. This completes the proof.

The following theorem tells us that pointwise efficiency implies uniformefficiency under weak convergence. This fairly deep result is quite useful inapplications.

Theorem 18.9 Let Tn be an estimator for a parameter ψ : P → ℓ∞(H),where ψ is differentiable at P relative to the tangent space PP with efficientinfluence function ψP : H → L0

2(P ). The following are equivalent:

(a) Tn is efficient for ψ(P ).

(b) Tn(h) is efficient for ψ(P )(h), for every h ∈ H, and√

n(Tn − ψ(P ))is asymptotically tight under P .

The proof of this theorem makes use of the following deep lemma, theproof of which is given in Section 4 below.

Lemma 18.10 Suppose that ψ : P → D is differentiable at P relativeto the tangent space PP and that d′Tn is asymptotically efficient at P forestimating d′ψ(P ) for every d′ in a subset D′ ⊂ D∗ which satisfies

‖d‖ ≤ c supd′∈D′, ‖d′‖≤1

|d′(d)|,(18.6)

for some constant c < ∞. Then Tn is asymptotically efficient at P provided√n(Tn − ψ(P )) is asymptotically tight under P .

Proof of Theorem 18.9. That (a) implies (b) is obvious. Now as-sume (b), and let D = ℓ∞(H) and D′ be the set of all coordinate projec-tions d → d∗hd ≡ d(h) for every h ∈ H. Since the uniform norm on ℓ∞(H)is trivially equal to supd′∈D′ |d′d| and all d′ ∈ D′ satisfy ‖d′‖ = 1, Condi-tion (18.6) is easily satisfied. Since

√n(Tn − ψ(P )) is asymptotically tight

by assumption, all of the conditions of Lemma 18.10 are satisfied. HenceTn is efficient, and the desired conclusions follow.

We close this section with an interesting corollary of Lemma 18.10 thatprovides a remarkably simple connection between marginal and joint effi-ciency on product spaces:

Theorem 18.11 Suppose that ψj : P → Dj is differentiable at P relative

to the tangent space PP , and suppose that Tn,j is asymptotically efficientat P for estimating ψj(P ), for j = 1, 2. Then (Tn,1, Tn,2) is asymptoticallyefficient at P for estimating (ψ1(P ), ψ2(P )).

Page 350: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

342 18. Semiparametric Models and Efficiency

Thus marginal efficiency implies joint efficiency even though marginal weakconvergence does not imply joint weak convergence! This is not quite sosurprising as it may appear at first. Consider the finite-dimensional settingwhere ψj(P ) ∈ R for j = 1, 2. If Tn,j is efficient for ψj(P ), for each j = 1, 2,then Theorem 18.7 tells us that (Tn,1, Tn,2) is asymptotically linear with

influence function (ψ1,P , ψ2,P ). Thus the limiting joint distribution will infact be the optimal bivariate Gaussian distribution. The above theorem canbe viewed as an infinite-dimension generalization of this finite-dimensionalphenomenon.

Proof of Theorem 18.11. Let D′ be the set of all maps (d1, d2) → d∗jdj

for d∗j ∈ Dj and j equal to either 1 or 2. Note that by the Hahn-Banachtheorem (see Corollary 6.7 of Conway, 1990), ‖dj‖ = sup|d∗jdj | : ‖d∗j‖ =1, d∗j ∈ D∗

j. Thus the product norm ‖(d1, d2)‖ = ‖d1‖ ∨ ‖d2‖ satisfiesCondition (18.6) of Lemma 18.10 with c = 1. Hence the desired conclusionfollows.

18.3 Optimality of Tests

In this section, we study testing of the null hypothesis

H0 : ψ(P ) ≤ 0(18.7)

versus the alternative H1 : ψ(P ) > 0 for a one-dimensional parameterψ(P ). The basic conclusion we will endeavor to show is that a test basedon an asymptotically optimal estimator for ψ(P ) will, in a meaningful way,be asymptotically optimal. Note that null hypotheses of the form H01 :ψ(P ) ≤ ψ0 can trivially be rewritten in the form given in (18.7) by replacingP → ψ(P ) with P → ψ(P )− ψ0. For dimensions higher than one, comingup with a satisfactory criteria forming up with a satisfactory criteria foroptimality of tests is difficult (see the discussion in Section 15.2 of van derVaart, 1998) and we will not pursue the higher dimensional setting in thisbook.

For a given model P and measure P on the boundary of the null hypoth-esis where ψ(P ) = 0, we are interested in studying the “local asymptoticpower” in a neighborhood of P . These neighborhoods are of size 1/

√n and

are the appropriate magnitude when considering sample size computationfor√

n consistent parameter estimates. Consider for example the univari-ate normal setting where the data are i.i.d. N(µ, σ2). A natural choiceof test for H0 : µ ≤ 0 versus H1 : µ > 0 is the indicator of whetherTn =

√nx/sn > z1−α, where x and sn are the sample mean and standard

deviation from an i.i.d. sample X1, . . . , Xn, zq is the qth quantile of a stan-dard normal, and α is the chosen size of this one-sided test. For any µ > 0,Tn diverges to infinity with probability 1. However, if µ = k/

√n for some fi-

nite k, then Tn N(k, 1). Thus we can derive non-trivial power functions

Page 351: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

18.3 Optimality of Tests 343

only for shrinking “contiguous alternatives” in a 1/√

n neighborhood ofzero. In this case, since ψP = X and the corresponding one-dimensionalsubmodel Pt must satisfy ∂ψ(Pt)/(∂t)|t=1 = k, we know that the scorefunction g corresponding to Pt must be g(X) = kX/σ. Hence, in thisexample, we can easily express the local alternative sequence in terms ofthe score function rather than k.

Thus it makes sense in general to study the performance of tests undercontiguous alternatives defined by one-dimensional submodels correspond-ing to score functions. Accordingly, for a given element g of a tangent setPP , let t → Pt,g be a one-dimensional submodel which is differentiable inquadratic mean at P with score function g along which ψ is differentiable,i.e.,

ψ(Pt,g)− ψ(P )

t→ P [ψP g],

as t ↓ 0. For each such g for which P [ψP g] > 0, we can see that whenψ(P ) = 0, the submodel Pt,g belongs to H1 : ψ(P ) > 0 for all sufficientlysmall t > 0. We will therefore consider power over contiguous alternativesof the form Ph/

√n,g for h > 0.

Before continuing, we need to define a power function. For a subset Q ⊂P containing P , a power function π : Q → [0, 1] at level α is a functionon probability measures that satisfies π(Q) ≤ α for all Q ∈ Q for whichψ(Q) ≤ 0. We say that a sequence of power function πn has asymptoticlevel α if lim supn→∞ πn(Q) ≤ α for every Q ∈ Q : ψ(Q) ≤ 0. The powerfunction for a level α hypothesis test of H0 is the probability of rejectingH0 under Q. Hence statements about power functions can be viewed asstatements about hypothesis tests. Here is our main result:

Theorem 18.12 Let ψ : P → R be differentiable at P relative to thetangent space PP with efficient influence function ψP , and suppose ψ(P ) =0. Then, for every sequence of power functions P → πn(P ) of asymptoticlevel α tests for H0 : ψ(P ) ≤ 0, and for every g ∈ PP with P [ψP g] > 0 andevery h > 0,

lim supn→∞

πn(Ph/√

n,g) ≤ 1− Φ

⎣z1−α − hP [ψP g]√

P [ψ2P ]

⎦ .

This is a minor modification of Theorem 25.44 in Section 25.6 of van derVaart (1998) and the proof is given therein. While there are some differencesin notation, the substantive modification is that van der Vaart requires thepower functions to have level α for each n, whereas our version only requirethe levels to be asymptotically α. This does not affect the proof, and weomit the details. An advantage of this modification is that it permits theuse of approximate hypothesis tests, such as those which depend on thecentral limit theorem, whose level for fixed n may not be exactly α butwhose asymptotic level is known to be α.

Page 352: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

344 18. Semiparametric Models and Efficiency

An important consequence of Theorem 18.12 is that a test based onan efficient estimator Tn of ψ(P ) can achieve the given optimality. To seethis, let S2

n be a consistent estimator of the limiting variance of√

n(Tn −ψ(P )), and let πn(Q) be the power function defined as the probability that√

nTn/Sn > z1−α under the model Q. It is easy to see that this powerfunction has asymptotic power α under the null hypothesis. The followingresult shows that this procedure is asymptotically optimal:

Lemma 18.13 Let ψ : P → R be differentiable at P relative to the tan-gent space PP with efficient influence function ψP , and suppose ψ(P ) = 0.Suppose the estimator Tn is asymptotically efficient at P , and, moreover,

that S2n

P→ Pψ2P . Then, for every h ≥ 0 and g ∈ PP ,

lim supn→∞

πn(Ph/√

n,g) = 1− Φ

⎝z1−α − hP [ψP g]√

Pψ2P

⎠ .

Proof. By Theorem 18.7 and Part (i) of Theorem 11.14, we have that

√nTn

Sn

Pn Z + h

P [ψP g]√

Pψ2P

,

where Pn ≡ Ph/√

n,g and Z has a standard normal distribution. The desiredresult now follows.

Consider, for example, the Mann-Whitney test discussed in Section 12.2.2for comparing two independent samples of respective sample sizes m and n.Let Fm and Gn be the respective empirical distributions with correspondingtrue distributions F and G which we assume to be continuous. The Mann-Whitney statistic is Tn =

RGn(s)dFm(s) − 1/2 which is consistent for

ψ(P ) =∫

RG(s)dF (s)−1/2. We are interested in testing the null hypothesis

H0 : ψ(P ) ≤ 0 versus H1 : ψ(P ) > 0.By Theorem 18.11, we know that (Fm, Gm) is jointly efficient for (F, G).

Moreover, by Lemma 12.3, we know that (F, G) →∫

RG(s)dF (s) is Hada-

mard differentiable. Thus Theorem 18.6 applies, and we obtain that∫

RGn(s)

dFn(s) is asymptotically efficient for∫

RG(s)dF (s). Hence Lemma 18.13

also applies, and we obtain that Tn is optimal for testing H0, provided it issuitably standardized. We know from the discussion in Section 12.2.2 thatthe asymptotic variance of

√nTn is 1/12. Thus the test that rejects the

null when√

12nTn is greater than z1−α is optimal.Another simple example is the sign test for symmetry about zero for

a sample of real random variables X1, . . . , Xn with distribution F that iscontinuous at zero. The test statistic is Tn =

Rsign(x)dFn(x), where Fn

is the usual empirical distribution. Using arguments similar to those usedin the previous paragraphs, it can be shown that Tn is an asymptoticallyefficient estimator for ψ(P ) =

Rsign(x)dF (x) = P(X > 0) − P(X <

Page 353: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

18.4 Proofs 345

0). Thus, by Lemma 18.13 above, the sign test is asymptotically optimalfor testing the null hypothesis H0 : P(X > 0) ≤ P(X < 0) versus thealternative H1 : P(X > 0) > P(X < 0).

These examples illustrates the general concept that if the parameter ofinterest is a smooth functional of the underlying distribution functions,then the estimator obtain by substituting the true distributions with thecorresponding empirical distributions will be asymptotically optimal, pro-vided we are not willing to make any parametrically restrictive assumptionsabout the distributions.

18.4 Proofs

Proof of Proposition 18.5. Let B∗1 ≡ b∗ ∈ B∗ : ‖b∗‖ ≤ 1 and B ≡

ℓ∞(B∗1). Note that (B, ‖·‖) ⊂ (B, ‖·‖B∗

1) by letting x(b∗) ≡ b∗x for every b∗ ∈

B∗ and all x ∈ B and recognizing that ‖x‖ = supb∗∈B∗ |b∗x| = ‖x‖B∗1

by theHahn-Banach theorem (see Corollary 6.7 of Conway, 1990). Thus weak con-vergence of Xn in B will imply weak convergence in B by Lemma 7.8. Sincewe already know that Xn is asymptotically tight in B, we are done if we canshow that all finite-dimensional distributions of Xn converge. Accordingly,let b∗1, . . . , b

∗m ∈ B∗

1 be arbitrary and note that for any (α1, . . . , αm) ∈ Rm,∑m

j=1 αjXn(b∗j ) = b∗Xn for b∗ ≡ ∑mj=1 αjb

∗j ∈ B∗. Since we know that

b∗Xn b∗X , we now have that∑m

j=1 αjb∗jXn

∑mj=1 αjb

∗jX . Thus

(Xn(b∗1), . . . , Xn(b∗m))T (X(b∗1), . . . , X(b∗m))T since (α1, . . . , αj) ∈ Rm

was arbitrary and X is Gaussian. Since b∗1, . . . , b∗m and m were also ar-

bitrary, we have the result that all finite-dimensional distributions of Xn

converge, and the desired conclusion now follows.Proof of Lemma 18.10. By Part (ii) of Theorem 11.14, we know

that√

n(Tn − ψ(P )) is asymptotically tight under Pn ≡ P1/√

n for ev-ery differentiable submodel Pt. By the differentiability of ψ, we knowthat

√n(Tn − ψ(Pn)) is also asymptotically tight under Pn. By assump-

tion, we also have that d′√

n(Tn − ψ(Pn)) is asymptotically linear withcovariance Pψ2

P (d′) for every d′ ∈ D′. Since the same result will hold forall finite linear combinations of elements of D′, and since increasing thesize of D′ will not negate Condition (18.6), we can replace without lossof generality D′ with lin D′. By using a minor variation of the proof of

Proposition 18.5, we now have that√

n(Tn − ψ(Pn))Pn Z in ℓ∞(D′

1),where D′

1 ≡ d∗ ∈ D′ : ‖d∗‖ ≤ 1 and Z is a tight Gaussian processwith covariance P [ψP (d∗1)ψP (d∗2)] for all d∗1, d

∗2 ∈ D′

1. Note that this covari-ance uniquely defines the distribution of Z as a tight, Gaussian element inℓ∞(D′

1) by the linearity of D′ and Lemma 7.3.Note that by Condition (18.6), we have for every d ∈ D that

‖d‖ ≤ c supd′∈D′, ‖d′‖≤1

|d′(d)| ≤ c‖d‖.

Page 354: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

346 18. Semiparametric Models and Efficiency

We thus have by Lemma 7.8 that the convergence of√

n(Tn − ψ(Pn))to Z under Pn also occurs in D. Since Condition 18.6 also implies thatℓ∞(D′

1) contains D, we have that Z viewed as an element in D is completelycharacterized by the covariance P [ψP (d∗1)ψP (d∗2)] for d∗1 and d∗2 rangingover D′

1. This is verified in Proposition 18.14 below. Hence by the assumeddifferentiability of ψ, the covariance of Z also satisfies P [(d∗1Z0)(d

∗2Z0)] =

P [ψP (d∗1)ψP (d∗2)] for all d∗1, d∗2 ∈ D∗. Hence Tn is regular with

√n(Tn −

ψ(P )) Z = Z0, where Z0 is the optimal limiting Gaussian process definedin Theorem 18.3. Thus Tn is efficient.

Proposition 18.14 Let X be a tight, Borel measurable element in aBanach space D, and suppose there exists a set D′ ⊂ D∗ such that

(i) (18.6) holds for all d ∈ D and some constant 0 < c < ∞ and

(ii) d′ → d′X is a Gaussian process on ℓ∞(D′1), where D′

1 ≡ d′ ∈ D′ :‖d′‖ ≤ 1.

Then X is uniquely defined in law on D.

Proof. Let φ : D → ℓ∞(D′1) be defined by φ(d) ≡ d′d : d′ ∈ D′

1, andnote that by Exercise 18.5.4 below, both φ and φ−1 : D → D, where

D ≡ x ∈ ℓ∞(D′1) : x = d′d : d′ ∈ D′

1 for some d ∈ D ,

are continuous and linear. Note also that D is a closed and linear subspaceof ℓ∞(D′

1).Thus Y = φ(X) is tight. Now let d′1, . . . , d

′m ∈ D′

1 and α1, . . . , αm ∈ Rbe arbitrary. Then by Condition (ii),

∑mj=1 αjd

′jX is Guassian with vari-

ance∑

j,k αjαkP [(d′jX)(d′kX)]. Since the choice of the αjs was arbitrary,

(d′1X, . . . , d′mX)T is multivariate normal with covariance P [(d′jX)(d′kX)].Since the choice of d′1, . . . , d

′m was also arbitrary, we have that all finite-

dimensional distributions of Y are multivariate normal. Now the equiva-lence between (i) and (ii) in Proposition 7.5 implies that d∗X = d∗φ−1(Y )is Gaussian for all d∗ ∈ D∗. This follows since P(X ∈ φ−1(D)) = 1 and, forany d∗ ∈ D∗, e∗ = d∗φ−1 is a continuous linear functional on D that can beextended to a continuous linear functional e∗ on all of ℓ∞(D′

1) with e∗ = e∗

on D. The desired conclusion now follows by the definition of a Guassianprocess on a Banach space.

18.5 Exercises

18.5.1. Prove Proposition 18.2.

18.5.2. Show that Theorem 18.7 is a special case of Theorem 18.8.

Page 355: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

18.6 Notes 347

18.5.3. Consider the sign test example discussed at the end of Section 18.3,where Tn =

Rsign(x)dFn. What value of Sn should be used so that the

test that rejects H0 when√

nTn/Sn > z1−α is asymptotically optimal atlevel α?

18.5.4. Show that both φ : D → D and φ−1 : D → D as defined in theproof of Proposition 18.14 are continuous and linear.

18.6 Notes

Theorem 18.3 is part of Theorem 5.2.1 of BKRW, while Theorem 18.4 istheir Proposition 5.2.1. Theorem 18.6 and Lemma 18.10 are Theorem 25.47and Lemma 25.49 of van der Vaart (1998), respectively, while Theorem 18.9is a minor modification of Theorem 25.48 of van der Vaart (1998). The proofof Lemma 18.10 given above is somewhat simpler than the correspondingproof of van der Vaart’s, primarily as a result of a different utilization oflinear functionals. Theorem 18.11 and Lemma 18.13 are Theorem 25.44 andLemma 25.45 of van der Vaart (1998).

Page 356: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,
Page 357: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

19Efficient Inference forFinite-Dimensional Parameters

In this chapter, as in Section 3.2, we focus on semiparametric models Pθ,η :θ ∈ Θ, η ∈ H, where Θ is an open subset of Rk and H is an arbitrary,possibly infinite-dimensional set. The parameter of interest for this chapteris ψ(Pθ,η) = θ.

We first present the promised proof of Theorem 3.1. It may at first ap-pear that Theorem 3.1 can only be used when the form of the efficient scoreequation can be written down explicitly. However, even in those settingswhere the efficient score cannot be written in closed form, Theorem 3.1 anda number of related approaches can be used for semiparametric efficient es-timation based on the profile likelihood. This process can be facilitatedthrough the use of approximately least-favorable submodels, which are dis-cussed in the second section.

The main ideas of this chapter are given in Section 19.3, which presentsseveral methods of inference for θ that go significantly beyond Theorem 3.1.The first two methods are based on a multivariate normal approximation ofthe profile likelihood that is valid in a neighborhood of the true parameter.The first of the two methods is based on a quadratic expansion that is validin a shrinking neighborhood. The second method, the profile sampler isbased on an expansion that is valid on a compact, fixed neighborhood. Thethird method is an extension that is valid for penalized profile likelihoods.A few other methods are also discussed, including bootstrap, jackknife, andfully Bayesian approaches.

Page 358: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

350 19. Efficient Inference for Finite-Dimensional Parameters

19.1 Efficient Score Equations

The purpose of this section is to prove Theorem 3.1 on Page 44. Recall thatthis theorem was used in Section 3.2 to verify efficiency of the regressionparameter estimator for the Cox model based on an estimating equationthat approximated the efficient score. However, it is worth reiterating thatthis theorem is probably more useful for semiparametric maximum likeli-hood estimation, including penalized estimation, and not quite as usefulas a direct method of estimation. The reason for this is that the efficientscore is typically too complicated to realistically use for estimation directly,except for certain special cases like the Cox model under right censoring,and often does not have a closed form, while maximization of the semi-parametric likelihood is usually much less difficult. This theorem will beutilized some in the case studies of Chapter 22.

Proof of Theorem 3.1 (Page 44). Define F to be the Pθ,η-Donsker

class of functions that contains both ℓθn,n and ℓθ,η with probability tending

towards 1. By Condition (3.7), we now have

Gnℓθn,n = Gnℓθ,η + oP (1) =√

nPnℓθ,η + oP (1).

Combining this with the “no-bias” Condition (3.6), we obtain

√n(Pθn,η − Pθ,η)ℓθn,n = oP (1 +

√n‖θn − θ‖) + Gnℓθn,n

=√

nPnℓθ,η + oP (1 +√

n‖θn − θ‖).

If we can show

√n(Pθn,η − Pθ,η)ℓθn,n = (Iθ,η + oP (1))

√n(θn − θ),(19.1)

then the proof will be complete.Since Iθ,η = Pθ,η ℓθ,ηℓ

′θ,η,

√n(Pθn,η − Pθ,η)ℓθn,n − Iθ,η

√n(θn − θ)

=√

n(Pθn,η − Pθ,η)ℓθn,n − Pθ,η

[

ℓθ,ηℓ′θ,η

]√n(θn − θ)

=√

n

ℓθn,n(dP1/2

θn,η+ dP

1/2θ,η )

×[

(dP1/2

θn,n− dP

1/2θ,η )− 1

2(θn − θ)′ℓθ,ηdP

1/2θ,η

]

+

ℓθn,n(dP1/2

θn,η− dP

1/2θ,η )

1

2ℓ′θ,ηdP

1/2θ,η

√n(θn − θ)

+

(ℓθn,n − ℓθ,η)ℓ′θ,ηdPθ,η

√n(θn − θ)

≡ An + Bn + Cn.

Page 359: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

19.2 Profile Likelihood and Least-Favorable Submodels 351

Combining the assumed differentiability in quadratic mean with Condi-tion (3.7), we obtain An = oP (

√n‖θn − θ‖). Condition (3.7) also implies

Cn = oP (√

n‖θn − θ‖).Now we consider Bn. Note that for any sequence mn →∞,∥

ℓθn,n(dP1/2

θn,η− dP

1/2θ,η )

1

2ℓ′θ,ηdP

1/2θ,η

≤ mn

‖ℓθn,n‖dP1/2θ,η

∣dP1/2

θn,η− dP

1/2θ,η

+

‖ℓθn,n‖2(dPθn,η + dPθ,η)

‖ℓθ,η‖>mn

‖ℓθ,η‖2dPθ,η

≡ mnDn + En,

where En = oP (1) by Condition (3.7) and the square-integrability of ℓθ,η.Now

D2n ≤

‖ℓθn,n‖2dPθ,η ×∫

(dP1/2

θn,η− dP

1/2θ,η )2

≡ Fn ×Gn,

where Fn = OP (1) by reapplication of Condition (3.7) and Gn = oP (1) by

differentiability in quadratic mean combined with the consistency of θn (seeExercise 19.5.1). Thus there exists some sequence mn →∞ slowly enoughso that m2

nGn = oP (1). Hence, for this choice of mn, mnDn = oP (1).

Thus Bn = oP (√

n‖θn − θ‖), and we obtain (19.1). The desired result nowfollows.

19.2 Profile Likelihood and Least-FavorableSubmodels

In this section, we will focus on the maximum likelihood estimator θn basedon maximizing the joint empirical log-likelihood Ln(θ, η) ≡ nPnl(θ, η),where l(·, ·) is the log-likelihood for a single observation. In this chapter, ηis regarded as a nuisance parameter, and thus we can restrict our attentionto the profile log-likelihood θ → pLn(θ) ≡ supη Ln(θ, η). Note that Ln isa sum and not an average, since we multiplied the empirical measure byn. In the remaining sections of the chapter, we will use a zero subscript todenote the true parameter values. While the solution of an efficient scoreequation need not be a maximum likelihood estimator, it is also possiblethat the maximum likelihood estimator in a semiparametric model maynot be expressible as the zero of an efficient score equation. This possibil-ity occurs because the efficient score is a projection, and, as such, thereis no assurance that this projection is the derivative of the log-likelihood

Page 360: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

352 19. Efficient Inference for Finite-Dimensional Parameters

along a submodel. To address this issue, we can utilize the approximatelyleast-favorable submodels introduced in Section 3.3.

Recall that an approximately least-favorable submodel approximates thetrue least-favorable submodel (defined in the early part of Section 3.1)to a useful level of accuracy that facilitates analysis of semiparametricestimators. We will now describe this process in generality: the specificswill depend on the situation. As described in Section 3.3, we first need ageneral map from the neighborhood of θ into the parameter set for η, whichmap we will denote by t → ηt(θ, η), for t ∈ Rk. We require that

ηt(θ, η) ∈ H, for all ‖t− θ‖ small enough, and(19.2)

ηθ(θ, η) = η for any (θ, η) ∈ Θ× H,

where H is a suitable enlargement of H that includes all estimators thatsatisfy the constraints of the estimation process. Now define the map

ℓ(t, θ, η) ≡ l(t, ηt(θ, η)).

We will require several things of ℓ(·, ·, ·), at various point in our discussion,that will result in further restrictions on ηt(θ, η).

Define ℓ(t, θ, η) ≡ (∂/(∂t))ℓ(t, θ, η), and let ℓθ,n ≡ ℓ(θ, θ, ηn). Clearly,

Pnℓθn,n = 0, and thus θn is efficient for θ0, provided ℓθ,n satisfies the con-ditions of Theorem 3.1. The reason it is necessary to check this even formaximum likelihood estimators is that often ηn is on the boundary (oreven a little bit outside of) the parameter space. Consider for example theCox model for right censored data (see Section 4.2.2). In this case, η is thebaseline integrated hazard function which is usually assumed to be con-tinuous. However, ηn is the Breslow estimator, which is a right-continuousstep function with jumps at observed failure times and is therefore not inthe parameter space. Thus direct differentiation of the log-likelihood at themaximum likelihood estimator will not yield an efficient score equation.

We will also require that

ℓ(θ0, θ0, η0) = ℓθ0,η0 .(19.3)

Note that we are only insisting that this identity holds at the true param-eter values. We will now review four examples to illustrate this concept,the Cox model for right censored data, the proportional odds model forright censored data, the Cox model for current status data, and partly lin-ear logistic regression. Some of these examples have already been studiedextensively in previous chapters. The approximately least-favorable sub-model structure we present here will be utilized later in this chapter fordeveloping methods of inference for θ.

19.2.1 The Cox Model for Right Censored Data

The Cox model for right censored data has been discussed previouslyin this book in a number of places, including Section 4.2.2. Because of

Page 361: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

19.2 Profile Likelihood and Least-Favorable Submodels 353

notational tradition, we will use β instead of θ for the regression parameterand Λ instead η for the baseline integrated hazard function. Recall that anobservation from this model has the form X = (W, δ, Z), where W = T ∧C,δ = 1W = T , Z ∈ Rk is a regression covariate, T is a right-censoredfailure time with integrated hazard given Z equal to t → eβ′ZΛ(t), and Cis a censoring time independent of T given Z and uninformative of (β, Λ).We assume that there exists a τ <∞ such that P (C ≥ τ) = P (C = τ) > 0.We require H to consist of all monotone increasing, functions Λ ∈ C[0, τ ]with Λ(0) = 0. We define H to be the set of all monotone, increasingfunctions Λ ∈ D[0, τ ].

As shown in (3.3), the efficient score for β is ℓβ,Λ =∫ τ

0 (Z−h0(s))dM(s),

where M(t) ≡ N(t) −∫ t

0 Y (s)eβ′ZdΛ(s), N and Y are the usual countingand at-risk processes respectively, and

h0(t) ≡P [Z1W ≥ teβ′

0Z ]

P [1W ≥ teβ′0Z ]

,

where P is the true probability measure (at the parameter values (β0, Λ0)).Recall that the log-likelihood for a single observation is

l(θ, Λ) = (β′Z + log ∆Λ(W ))δ − eβ′ZΛ(W ),

where ∆Λ(w) is the jump size in Λ at w.

If we let t → Λt(β, Λ)) ≡∫ (·)0

(1+ (β− t)′h0(s))dΛ(s), then Λt(β, Λ) ∈ Hfor all t small enough, Λβ(β, Λ) = Λ, and

ℓ(β0, β0, Λ0) =

∫ τ

0

(Z − h0(s))dM(s) = ℓβ0,Λ0(19.4)

(see Exercise 19.5.2). Thus Conditions (19.2) and (19.3) are both satisfied.We will use these results later in this chapter to develop valid methods ofinference for β.

19.2.2 The Proportional Odds Model for Right Censored Data

This model was studied extensively in Section 15.3. The data and parameterconstraints are the same as in the Cox model, but the survival functiongiven the covariates has the proportional odds form (15.9) instead of theproportional hazards form. Moreover, we use A for the baseline integrated“hazard” instead of Λ, thus the composite parameter is (β, A). Recall thescore and information operators derived for this model from Section 15.3,and, for a fixed or random function s → g(s), denote

W τ (θ)(g) ≡∫ τ

0

g(s)dN(s)− (1 + δ)

∫ U∧τ

0eβ′Zg(s)dA(s)

1 + eβ′ZA(U ∧ τ).

Page 362: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

354 19. Efficient Inference for Finite-Dimensional Parameters

Thus the usual score for β in the direction h1 ∈ Rk, is ℓβ,A = V τ (θ)(Z ′h1).

We will now argue that ℓβ,A = W τ (θ)(

Z ′h1 − [σ22θ ]−1σ21

θ (h1)(·))

, whereσ22

θ is continuously invertible, and thus the least favorable direction forthis model is

s → h0(s) ≡ [σ22θ ]−1σ21

θ (h1)(s).

Recall that the operator σ21θ is compact, linear operator from Rk to

H2∞ ≡ “the space of functions on D[0, τ ] with bounded total variation”,

and the operator σ22θ is a linear operator from H2

∞ to H2∞. We first need

to show that σ22θ is continuously invertible, and then we need to show

that h1 → W τ (h′0(·)h1) is the projection of h1 → W τ (Z ′h1) onto the

closed linear span of h2 → V τ (h2(·)), where h2 ranges over H2∞. The first

result follows from the same arguments used in the proof of Theorem 15.9,but under a simplified model where β is known to have the value β0 (seeExercise 19.5.3).

For the second result, it is clear based on the first result that the func-tion s → [σ22

θ ]−1σ21θ (h1)(s) is an element of H2

∞ for every h1 ∈ Rk. ThusW τ (θ)(h′

0(·)h1) is in the closed linear span of W τ (θ)(h2), where h2 rangesover H2

∞, and thus all that remains to verify is that

P [W τ (θ)(Z ′h1)−W τ (θ)(h′0(·)h1)W τ (θ)(h2)](19.5)

= 0, for all h2 ∈ H2∞.

Considering the relationship between W τ and the V τ defined in Sec-tion 15.3.4, we have that the left side of (19.5) equals

P

[

V τ (θ)

(

Z ′h1

−[σ22θ ]−1σ21

θ (h1)

)

V τ (θ)

(

0h2

)]

= h′1σ

11θ (0) + h′

1σ12θ (h2)

−(0)σ12θ [σ22

θ ]−1σ21θ (h1)−

∫ τ

0

h2(s)σ21(h1)(s)dA(s)

= 0,

since

h′1σ

12θ (h2) =

∫ τ

0

h2(s)σ21(h1)(s)dA(s),

as can be deduced from the definitions following (15.20). Since h2 wasarbitrary, we have proven (19.5). Thus h0 ∈ H2

∞ as defined previously inthis section is indeed the least favorable direction. Hence t → At(β, A) ≡∫ (·)0

(1 + (β − t)′h0(s))dA(s) satisfies both (19.2) and (19.3).

Page 363: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

19.2 Profile Likelihood and Least-Favorable Submodels 355

19.2.3 The Cox Model for Current Status Data

Current status data arises when each subject is observed at a single exam-ination time, Y , to determine whether an event has occurred. The eventtime, T , cannot be observed exactly. Including the covariate Z, the observeddata in this setting consists of n independent and identically distributedrealizations of X = (Y, δ, Z), where δ = 1T ≤ Y . We assume that theintegrated hazard function of T given Z has the proportional hazards formand parameters (β, Λ) as given in Section 19.2.1 above.

We also make the following additional assumptions. T and Y are inde-pendent given Z. Z lies in a compact set almost surely and the covari-ance of Z − E(Z|Y ) is positive definite, which, as we will show later,guarantees that the efficient information I0 is strictly positive. Y pos-sesses a Lebesgue density which is continuous and positive on its support[σ, τ ], where 0 < σ < τ < ∞, for which the true parameter Λ0 satisfiesΛ0(σ−) > 0 and Λ0(τ) < M <∞, for some known M , and is continuouslydifferentiable on this interval with derivative bounded above zero. We letH denote all such possible choices of Λ0 satisfying these constraints for thegiven value of M , and we let H consist of all nonnegative, nondecreasingright-continuous functions Λ on [σ, τ ] with Λ(τ) ≤M .

We can deduce that the log-likelihood for a single observation, l(β, Λ),has the form

l(β, Λ) = δ log[1− exp(−Λ(Y ) exp(β′Z))](19.6)

−(1− δ) exp(β′Z)Λ(Y ).

From this, we can deduce that the score function for β takes the formℓβ,Λ(x) = zΛ(y)Q(x; β, Λ), where

Q(x; β, Λ) = eβ′z

[

δexp(−eβ′zΛ(y))

1− exp(−eβ′zΛ(y))− (1− δ)

]

.

Inserting a submodel t → Λt such that h(y) = −∂/∂t|t=0Λt(y) existsfor every y into the log likelihood and differentiating at t = 0, we obtain ascore function for Λ of the form Aβ,Λh(x) = h(y)Q(x; β, Λ). The linear spanof these functions contains Aβ,Λh for all bounded functions h of bounded

variation. Thus the efficient score function for θ is ℓβ,Λ = ℓβ,Λ −Aβ,Λhβ,Λ

for the least-favorable direction vector of functions hβ,Λ minimizing the

distance Pβ,Λ‖ℓβ,Λ −Aβ,Λh‖2. The solution at the true parameter is

h0(Y ) ≡ Λ0(Y )h00(Y )(19.7)

≡ Λ0(Y )Eβ0,Λ0(ZQ2(X ; β0, Λ0)|Y )

Eβ0,Λ0(Q2(X ; β0, Λ0)|Y )

(see Exercise 19.5.4). As the formula shows, the vector of functions h0(y) isunique a.s., and h0(y) is a bounded function since Q(x; θ0, Λ0) is bounded

Page 364: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

356 19. Efficient Inference for Finite-Dimensional Parameters

away from zero and infinity. We assume that the function h0 given by (19.7)has a version which is differentiable with a bounded derivative on [σ, τ ].

An approximately least-favorable submodel can thus be of the formΛt(β, Λ)(·) = Λ(·) + φ(Λ(·))(β − t)′h00 Λ−1

0 Λ(·), where φ(·) is a fixedfunction we will define shortly that approximates the identity. Note that weneed to extend Λ−1

0 so that it is defined on all of [0, M ]. This is done by let-ting Λ−1

0 (t) = σ for all t ∈ [0, Λ0(σ)] and Λ−10 (t) = τ for all t ∈ [Λ0(τ), M ].

We take φ : [0, M ] → [0, M ] to be any fixed function with φ(u) = u on[Λ0(σ), Λ0(τ)] such that u → φ(u)/u is Lipschitz and φ(u) ≤ c(u∧(M−u))for all u ∈ [0, M ] and some c < ∞ depending only on (β0, Λ0). Our condi-tions on the model ensure that such a function exists.

The function Λt(β, Λ) is essentially Λ plus a perturbation in the leastfavorable direction, h0, but its definition is somewhat complicated in orderto ensure that Λt(β, Λ) really defines a cumulative hazard function withinour parameter space, at least for all t that is sufficiently close to β. To seethis, first note that by using h00 Λ−1

0 Λ, rather than h00, we ensure thatthe perturbation that is added to Λ is Lipschitz-continuous with respect toΛ. Combining this with the Lipschitz-continuity of φ(u)/u, we obtain thatfor any 0 ≤ v < u ≤ M ,

Λt(β, Λ)(u)− Λt(β, Λ)(v) ≥ Λ(u)− Λ(v)− Λ(v)tk0(Λ(u)− Λ(v)),

for some universal constant 0 < k0 < ∞. Since Λ(v) ≤ M , we obtain thatfor all ‖t − β‖ small enough, Λt(β, Λ) is non-decreasing. The additionalconstraints on φ ensures that 0 ≤ Λt(β, Λ) < M for all ‖t − β‖ smallenough.

Hence t → Λt(β, Λ) ≡∫ (·)0 (1 + (β − t)′h0(s))dΛ(s) satisfies both (19.2)

and (19.3). Additional details about the construction of the approximatelyleast-favorable submodel for this example can be found in Section 4.1 ofMurphy and van der Vaart (2000).

19.2.4 Partly Linear Logistic Regression

This a continuation of the example discussed in Chapter 1 and Sections 4.5and 15.1. Recall that an observation from this model has the form X =(Y, Z, U), where Y is a dichotomous outcome, Z ∈ Rd is a covariate, andU ∈ [0, 1] is an additional, continuous covariate. Moreover, the probabilitythat Y = 1 given that (Z, U) = (z, u) is ν[β′z + η(u)], where u → ν(u) ≡eu/(1 + eu), η ∈ H, and H consists of those functions h ∈ C[0, 1] with

J(h) ≡∫ 1

0 [h(d)(s)]2ds < ∞, where h(j) is the jth derivative of h, andd < ∞ is known, positive integer. Additional assumptions are given inSections 4.5 and 15.1. These assumptions include Z being restricted toa known, compact set, and the existence of a known c0 < ∞ for which‖β‖ < c0 and ‖h‖∞ < c0. Thus H = h ∈ H : ‖h‖∞ < c0 and H = H.

Page 365: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

19.3 Inference 357

As shown in Section 4.5, the efficient score for β is ℓβ,η(Y, Z, U) ≡ (Z −h1(Y ))(Y − µβ,η(Z, U)), where µβ,η(Z, U) ≡ ν[β′Z + η(U)],

h1(u) ≡ E ZVβ,η(Z, U)|U = uE Vβ,η(Z, U)|U = u ,

and Vβ,η(Z, U) ≡ µβ,η(Z, U)(1−µβ,η(Z, U)). Thus h1 is the least-favorable

direction, and, provided there exists a version of h1 such that h1 ∈ H,then ηt(β, η) = η+(β− t)′h1 satisfies both (19.2) and (19.3). This examplediffers from the previous examples since the likelihood used for estimationis penalized.

19.3 Inference

We now discuss a few methods of conducting efficient estimation and infer-ence for θ using the profile likelihood. The first method is based on the veryimportant result that, under reasonable regularity conditions, a profile like-lihood for θ behaves asymptotically like a parametric likelihood of a normalrandom variable with variance equal to the inverse of the efficient Fisherinformation Iθ,η in a shrinking neighborhood of the maximum likelihoodestimator θ0. This can yield valid likelihood ratio based inference for θ.The second method, the profile sampler, extends the first method to showthat this behavior occurs in a compact neighborhood of θ0, and thus anautomatic method for estimation and inference can be obtain by applyinga simple random walk on the profile likelihood. The third method extendsthe idea of the second method to penalized maximum likelihoods. Severalother methods, including the bootstrap, jackknife and purely Bayesian ap-proaches, are also discussed.

19.3.1 Quadratic Expansion of the Profile Likelihood

The main ideas of this section come from a very elegant paper by Murphyand van der Vaart (2000) on profile likelihood. The context is the sameas in the beginning of Section 19.2, wherein we have maximum likelihoodestimators (θn, ηn) based on a i.i.d. sample X1, . . . , Xn, where one is the

finite-dimensional parameter of primary interest (θn) and the other is aninfinite-dimensional nuisance parameter (ηn). The main result, given for-mally below as Theorem 19.5 (on Page 359), is that under certain regularity

conditions, we have for any estimator θnP→ θ0, that

Page 366: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

358 19. Efficient Inference for Finite-Dimensional Parameters

pLn(θn) = pLn(θ0) + (θn − θ0)′

n∑

i=1

ℓθ0,η0(Xi)(19.8)

−1

2n(θn − θ0)

′Iθ0,η0(θn − θ0)

+oP0(1 +√

n‖θn − θ0‖)2,

where Iθ0,η0 is the efficient Fisher information and P0 the probability mea-sure of X at the true parameter values.

Suppose we can know that the maximum profile likelihood estimator isconsistent, i.e., that θn = θ0 + oP0(1), and that Iθ0,η0 is positive definite.Then if (19.8) also holds, we have that

‖√n(θn − θ0)‖2 ≤ √n(θn − θ0)

′[

n−1/2n∑

i=1

ℓθ0,η0(Xi)

]

+oP0(1 +√

n‖θn − θ0‖)2

= OP0(√

n‖θn − θ0‖) + oP0(1 +√

n‖θn − θ0‖)2,

since pLn(θn)− pLn(θ0) ≥ 0. This now implies that

(1 +√

n‖θn − θ0‖)2 = OP0(1 +√

n‖θn − θ0‖) + oP0(1 +√

n‖θn − θ0‖)2,

which yields that√

n‖θn − θ0‖ = OP0(1).Let K ⊂ Rk be a compact neighborhood of 0, and note that for any

sequence of possibly random points θn ∈ θ0 + n−1/2K, we have θn =θ0 + oP0(1) and

√n(θn − θ0) = OP0(1). Thus supu∈K |pLn(θ0 + u/

√n) −

pLn(θ0) − Mn(u)| = oP0(1), where u → Mn(u) ≡ u′Zn − (1/2)u′Iθ0,η0u

and Zn ≡√

nPnℓθ0,η0(X). Since Zn Z, where Z ∼ Nk(0, Iθ0,η0), andsince K was arbitrary, we have by the argmax theorem (Theorem 14.1)

that√

n(θn − θ0) I−1θ0,η0

Z, which implies that θn is efficient, and thus,by Theorem 18.7, we also have

√n(θn − θ0) =

√nPnI−1

θ0,η0ℓθ0,η0(X) + oP0(1).(19.9)

These arguments can easily be strengthened to imply the following simplecorollary:

Corollary 19.1 Let the estimator θn be consistent for θ0 and sat-isfy pLn(θn) ≥ pLn(θn)− oP0(1). Then, provided Iθ0,η0 is positive definiteand (19.8) holds, θn is efficient.

The proof is not difficult and is saved as an exercise (see Exercise 19.5.5).Combining (19.8) and Corollary 19.1, we obtain after some algebra the

following profile likelihood expansion centered around any consistent, ap-proximate maximizer of the profile likelihood:

Page 367: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

19.3 Inference 359

Corollary 19.2 Let θn = θ0+oP0(1) and satisfy pLn(θn) ≥ pLn(θn)−oP0(1). Then, provided Iθ0,η0 is positive definite and (19.8) holds, we have

for any random sequence θn = θ0 + oP0 (1),

pLn(θn) = pLn(θn)− 1

2n(θn − θn)′Iθ0,η0(θn − θn)(19.10)

+oP0(1 +√

n‖θn − θ0‖)2.

The proof is again saved as an exercise (see Exercise 19.5.6).The following two additional corollaries provide methods of using this

quadratic expansion to conduct inference for θ0:

Corollary 19.3 Assume the conditions of Corollary 19.1 hold for θn.Then, under the null hypothesis H0 : θ = θ0, 2(pLn(θn)−pLn(θ0)) χ2(k),where χ2(k) is a chi-squared random variable with k degrees of freedom.

Corollary 19.4 Assume the conditions of Corollary 19.1 hold for θn.

Then for any vector sequence vnP→ v ∈ Rk and any scalar sequence hn

P→ 0such that (

√nhn)−1 = OP (1), where the convergence is under P = P0, we

have

−2pLn(θn + hnvn)− pLn(θn)

nh2n

P→ v′Iθ0,η0v.

Proofs. Corollary 19.3 follows immediately from Corollary 19.2 by settingθn = θ0, while Corollary 19.4 also follows from Corollary 19.2 but aftersetting θn = θn + hnvn.

Corollary 19.3 can be used for hypothesis testing and confidence regionconstruction for θ0, while Corollary 19.4 can be used to obtain consistent,numerical estimates of Iθ0,η0 . The purpose of the remainder of this sectionis to present and verify reasonable regularity conditions for (19.8) to hold.After the presentation and proof of Theorem 19.5, we will verify the con-ditions of the theorem for two of the examples mentioned above, the Coxmodel for right-censored data and the Cox model for current status data.The next section will show that slightly stronger assumptions can yieldeven more powerful methods of inference.

To begin with, we will need an approximately least-favorable submodelt → ηt(θ, η) that satisfies Conditions (19.2) and (19.3). Define ℓ(t, θ, η) ≡(∂/(∂t))ℓ(t, θ, η) and ηθ ≡ argmaxηLn(θ, η), and assume that for any pos-

sibly random sequence θnP→ θ0, we have

ηθn

P→ η and(19.11)

P0ℓ(θ0, θn, ηθn) = oP0(‖θn − θ0‖+ n−1/2).(19.12)

We are now ready to present the theorem:

Theorem 19.5 Assume Conditions (19.2), (19.3), (19.11) and (19.12)are satisfied, and assume that the functions (t, θ, η) → ℓ(t, θ, η)(X) and

Page 368: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

360 19. Efficient Inference for Finite-Dimensional Parameters

(t, θ, η) → ℓ(t, θ, η)(X) are continuous at (θ0, θ0, η0) for P0-almost every X.Also assume that for some neighborhood V of (θ0, θ0, η0), the class of func-tions F1 ≡ ℓ(t, θ, η) : (t, θ, η) ∈ V is P0-Donsker with square-integrableenvelope function and the class of functions F2 ≡ ℓ(t, θ, η) : (t, θ, η) ∈ V is P0-Glivenko-Cantelli and bounded in L1(P0). Then (19.8) holds.

Proof. The proof follows closely the proof of Theorem 1 given in Mur-phy and van der Vaart (2000). Since ℓ(t, θ, η) → ℓθ0,η0 ≡ ℓ0 as (t, θ, η) →(θ0, θ0, η0), and since the envelope of F1 is square-integrable, we have by

the dominated convergence theorem that P0

(

ℓ(t, θ, η)− ℓ0

)2 P→ 0 for every

(t, θ, η)P→ (θ0, θ0, η0). We likewise have P0ℓ(t, θ, η)

P→ P0ℓ(θ0, θ0, η0). Be-cause t → exp [ℓ(t, θ0, η0)] corresponds to a smooth parametric likelihoodup to a multiplicative constant for all ‖t− θ0‖ small enough, we know thatits derivatives satisfy the usual identity:

P0ℓ(θ0, θ0, η0) = −P0

[

ℓ(θ0, θ0, η0)ℓ′(θ0, θ0, η0)

]

(19.13)

= −Iθ0,η0 ≡ −I0.

Combining these facts with the empirical process conditions, we obtain that

for every (t, θ, η)P→ (θ0, θ0, η0), both

Gnℓ(t, θ, η) = Gnℓ0 + oP0(1) and(19.14)

Pnℓ(t, θ, η)P→ −I0.(19.15)

By (19.2) and the definition of ηθ, we have

n−1[

pLn(θ)− pLn(θ0)]

= Pnl(θ, ηθ)− Pnl(θ0, ηθ0),

which is bounded below by Pnl(θ, ηθ(θ0, ηθ0)) − Pnl(θ0, ηθ0(θ0, ηθ0)) and

bounded above by Pnl(θ, ηθ(θ, ηθ)) − Pnl(θ0, ηθ0(θ, ηθ)). The lower bound

follows from the facts that ηθ maximizes η → Ln(θ, η) and ηθ0(θ0, ηθ0) =

ηθ0 , while the upper bound follows from the facts that ηθ(θ, ηθ) = ηθ and ηθ0

maximizes η → Ln(θ0, η). Note that both the upper and lower bounds havethe form Pnℓ(θ, ψ) − Pnℓ(θ0, ψ), where ψ = (θ0, ηθ0) for the lower boundand ψ = (θ, ηθ) for the upper bound. The basic idea of the remainder ofthe proof is to apply a two-term Taylor expansion to both the upper andlower bounds and show that they agree up to a term of order oP0(‖θ−θ0‖+n−1/2)2.

For some t on the line segment between θ and θ0, we have that

Pnℓ(θ, ψ)− Pnℓ(θ0, ψ) = (θ − θ0)′Pnℓ(θ0, ψ)

+1

2(θ − θ0)

′Pnℓ(t, ψ)(θ − θ0).

Page 369: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

19.3 Inference 361

For the second term on the right, we can, as a consequence of (19.15),replace Pnℓ(t, ψ) with −I0 after adding an oP0(‖θ − θ0‖2) term. For thefirst term, we can utilize (19.14) to obtain that

√nPnℓ(θ0, ψ) = Gnℓ0 + Gn

[

ℓ(θ0, ψ)− ℓ0

]

+√

nP0ℓ(θ0, ψ)

= Gnℓ0 + oP0(1) + oP0(1 +√

n‖θ − θ0‖),

where the error terms follow from the empirical process assumptions and(19.12), respectively. Thus the first term becomes

(θ − θ0)′Pnℓ0 + oP0(‖θ − θ0‖2 + n−1/2‖θ − θ0‖).

The desired conclusion now follows.We now verify the conditions of this theorem for two models, the Cox

model for right censored data and the Cox model for current status data.The verification of the conditions for the proportional odds model underright-censoring will be considered in the case studies of Chapter 22. In-ference for the partly-linear logistic regression example will be consideredlater in Section 19.3.3. Several other very interesting semiparametric exam-ples for which Theorem 19.5 is applicable are presented in Murphy and vander Vaart (2000), including a case-control set-up with a missing covariate,a shared gamma-frailty model for right-censored data, and an interestingsemiparametric mixture model.

The Cox model for right censored data.

This is a continuation of Section 19.2.1. Recall that Conditions (19.2)and (19.3) have already been established for this model. Putting the pre-vious results for this example together, we first obtain that

ℓ(t, β, Λ) = (β′Z + log [∆Λt(β, Λ)(W )] δ − eβ′ZΛt(β, Λ)(W ).

Differentiating twice with respect to t, we next obtain

ℓ(t, β, Λ) = Zδ − Zet′ZΛt(β, Λ)(W )

− h0(W )δ

1 + (β − t)′h0(W )+ et′Z

∫ W

0

h0(s)dΛ(s)

and

ℓ(t, β, Λ) = −ZZ ′et′ZΛt(β, Λ)(W ) + et′Z

∫ W

0

(Zh′0(s) + h0(s)Z

′)dΛ(s)

− h0(W )h′0(W )δ

[1 + (β − t)′h0(W )]2 .

In this case, the maximizer of Λ → Ln(β, Λ) has the closed form of theBreslow estimator:

Page 370: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

362 19. Efficient Inference for Finite-Dimensional Parameters

Λβ(s) =

∫ s

0

PndN(u)

PnY (u)eβ′Z,

and it is easy to verify that Λβnis uniformly consistent for Λ0 for any

possibly random sequence βnP→ β0. Thus Condition (19.11) holds. Recall

the counting process notation Y (s) ≡ 1W ≥ s and N(s) ≡ 1W ≤ sδ,and note that P0ℓ(β0, βn, Λβn

)

= P0

[

Zδ −∫ τ

0

(1 + (βn − β0)′h0(s))ZY (s)eβ′

0ZdΛβn(s)

− h0(W )δ

1 + (βn − β0)′h0(W )+

∫ τ

0

h0(s)Y (s)eβ′0ZdΛβn

(s)

]

= P0

[

−∫ τ

0

(βn − β0)′h0(s)ZY (s)eβ′

0ZdΛβn(s)

+h0(W )(βn − β0)

′h0(W ))δ

1 + (βn − β0)′h0(W )

]

= P0

[

−∫ τ

0

Z(βn − β0)′h0(s)Y (s)eβ′

0Z(

dΛβn(s)− dΛ0(s)

)

+h0(W )

(

(βn − β0)′h0(W )

)2

δ

1 + (βn − β0)′h0(W )

= oP0(‖βn − β0‖).The second equality follows from the identity

P0

[

(Z − h0(s))Y (s)eβ′0Z]

= 0, for all s ∈ [0, τ ],(19.16)

while the third equality follows from the fact that

M(s) ≡ N(s)−∫ s

0

Y (u)eβ′0ZdN(u)

is a mean-zero martingale combined with a reapplication of (19.16). ThusCondition (19.12) holds.

The smoothness of the functions involved, combined with the fact thatu → h0(u) is bounded in total variation, along with other standard argu-ments, yields the required continuity, Donsker and Glivenko-Cantelli con-ditions of Theorem 19.5 (see Exercise 19.5.7). Thus all of the conditions ofTheorem 19.5 are satisfied for the Cox model for right censored data.

The Cox model for current status data.

This is a continuation of Section 19.2.3 in which we formulated an approx-imately least-favorable submodel Λt(β, Λ) that satisfies Conditions (19.2)and (19.3). We also have from Section 19.2.3 that

Page 371: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

19.3 Inference 363

ℓ(t, β, Λ) = l(t, Λt(β, Λ))

= δ log[

1− exp(

−Λt(β, Λ)(Y )et′Z)]

−(1− δ)et′ZΛt(β, Λ)(Y ).

This yields

ℓ(t, β, Λ) = (ZΛt(β, Λ)(Y ) + Λ(Y ))Q(X ; t, Λt(β, Λ)), and

ℓ(t, β, Λ) =(

Λt(β, Λ)(Y )ZZ ′ + Λ(Y )Z + ZΛ′(Y ))

Q(X ; t, Λt(β, Λ))

−e2t′Z[

ZΛt(β, Λ)(Y ) + Λ(Y )]⊗2

×exp

(

−et′ZΛt(β, Λ)(Y )))

[1− exp (−et′ZΛt(β, Λ)(Y ))]2 ,

whereΛ(u) ≡ −φ(Λ(u))h00 Λ−1

0 Λ(u).

Note that these functions are smooth enough to satisfy the continuity con-ditions of Theorem 19.5.

Using entropy calculations, Murphy and van der Vaart (1999), extendearlier results of Huang (1996) to obtain

P0

[

Λβn(Y )− Λ0(Y )

]2

= OP0(‖βn − β0‖2 + n−2/3),

for any possibly random sequence βnP→ β0. This yields that Condition

(19.11) is satisfied. The rate of convergence also helps in verifying Con-dition (19.12). We omit the details in verifying (19.12), but they can befound on Page 460 of Murphy and van der Vaart (2000).

It is not difficult to verify that the carefully formulated structure ofΛt(β, Λ) ensures the existence of a neighborhood V of (β0, β0, Λ0) for whichthe class G ≡ Λt(β, Λ) : (t, β, Λ) ∈ V , viewed as a collection of functionsof Y , is a subset of the class of all monotone functions f : [0, τ ] → [0, M ],which class is known to be Donsker for any probability measure. Sinceℓ(t, β, Λ) and ℓ(t, β, Λ) are bounded, Lipschitz continuous of Λt(β, Λ), it isnot difficult to verify that F1 is P0-Donsker and F2 is P0-Glivenko-Cantellifor this example, with suitably integrable respective envelope functions.Thus all of the conditions of Theorem 19.5 are satisfied.

19.3.2 The Profile Sampler

The material for this section comes largely from Lee, Kosorok and Fine(2005) who proposed inference based on sampling from a posterior dis-tribution based on the profile likelihood. The quadratic expansion of the

Page 372: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

364 19. Efficient Inference for Finite-Dimensional Parameters

profile likelihood given in the previous section permits the construction ofconfidence sets for θ by inverting the log-likelihood ratio. Translating thiselegant theory into practice can be computationally challenging. Even ifthe log profile likelihood ratio can be successfully inverted for a multivari-ate parameter, this inversion does not enable the construction of confidenceintervals for each one-dimensional subcomponent separately, as is standardpractice in data analysis. For such confidence intervals, it would be nec-essary to further profile over all remaining components in θ. A relatedproblem for which inverting the log likelihood is not adequate is the con-struction of rectangular confidence regions for θ, such as minimum volumeconfidence rectangles (Di Bucchiano, Einmahl and Mushkudiani, 2001) orrescaled marginal confidence intervals. For many practitioners, rectangularregions are preferable to ellipsoids, for ease of interpretation.

In principle, having an estimator of θ and its variance simplifies these in-ferences considerably. However, the computation of these quantities usingthe semiparametric likelihood poses stiff challenges relative to those en-countered with parametric models, as has been illustrated in several placesin this book. Finding the maximizer of the profile likelihood is done implic-itly and typically involves numerical approximations. When the nuisanceparameter is not

√n estimable, nonparametric functional estimation of η

for fixed θ may be required, which depends heavily on the proper choice ofsmoothing parameters. Even when η is estimable at the parametric rate,and without smoothing, I0 does not ordinarily have a closed form. Whenit does have a closed form, it may include linear operators which are dif-ficult to estimate well, and inverting the estimated linear operators maynot be straightforward. The validity of these variance estimators must beestablished on a case-by-case basis.

The bootstrap is a possible solution to some of these problems, and wewill discuss this later, briefly, in this chapter and in more detail in Chap-ter 21. We will show that theoretical justification for the bootstrap is pos-sible but quite challenging for semiparametric models where the nuisanceparameter is not

√n consistent. Even when the bootstrap can be shown to

be valid, the computational burden is quite substantial, since maximiza-tion over both θ and η is needed for each bootstrap sample. A differentapproach to variance estimation is possible via Corollary 19.4 above whichverifies that the curvature of the profile likelihood near θn is asymptoticallyequal to I0. In practice, one can perform second order numerical differenti-ation by evaluating the profile likelihood on a hyperrectangular grid of 3k

equidistant points centered at θn, taking the appropriate differences, andthen dividing by 4h2, where p is the dimension of θ and h is the spacingbetween grid points. While the properties of h for the asymptotic validityof this approach are well known, there are no clear cut rules on choosing thegrid spacing in a given data set. Thus, it would seem difficult to automatethis technique for practical usage.

Page 373: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

19.3 Inference 365

Prior to the Lee, Kosorok and Fine (2005) paper, there does not appearto exist in the statistical literature a general theoretically justified andautomatic method for approximating I0. Lee, Kosorok and Fine proposean application of Markov chain Monte Carlo to the semiparametric profilelikelihood. The method involves generating a Markov chain θ(1), θ(2), . . . with stationary density proportional to pθ,n(θ) ≡ exp (pLn(θ)) q(θ), whereq(θ) = Q(dθ)/(dθ) for some prior measure Q. This can be accomplished byusing, for example, the Metropolis-Hastings algorithm (Metropolis, et al.,1953; and Hastings, 1970). Begin with an initial value θ(1) for the chain.For each k = 2, 3, . . . , obtain a proposal θk+1 by random walk from θ(k).Compute pθk+1,n(θk+1), and decide whether to accept θk+1 by evaluating

the ratio pθk+1,n(θk+1)/pθ(k),n(θ(k)) and applying an acceptance rule. Af-ter generating a sufficiently long chain, one may compute the mean of thechain to estimate the maximizer of pLn(θ) and the variance of the chainto estimate I−1

0 . The output from the Markov chain can also be directlyused to construct various confidence sets, including minimum volume con-fidence rectangles. Whether or not a Markov chain is used to sample fromthe “posterior” proportional to exp (pLn(θ)) q(θ), the procedure based onsampling from this posterior is referred to as the profile sampler.

Part of the computational simplicity of this procedure is that pLn(θ)does not need to be maximized, it only needs to be evaluated. As men-tioned earlier in this chapter, the profile likelihood is generally fairly easyto compute as a consequence of algorithms such as the stationary pointalgorithm for maximizing over the nuisance parameter. On the other hand,sometimes the profile likelihood can be very hard to compute. When thisis the case, numerical differentiation via Corollary 19.4 may be advanta-geous since it requires fewer evaluations of the profile likelihood. However,numerical evidence in Section 4.2 of Lee, Kosorok and Fine (2005) seemsto indicate that, at least for moderately small samples, numerical differen-tiation does not perform as well in general as the profile sampler. This ob-servation is supported by theoretical work on the profile sampler by Chengand Kosorok (2007a, 2007b) who show that the profile sampler yields fre-quentist inference that is second-order accurate. Thus the profile samplermay be beneficial even when the profile likelihood is hard to compute.

The procedure’s validity is established in Theorem 19.6 below whichextends Theorem 19.5 in a manner that enables the quadratic expansionof the log-likelihood around θn to be valid in a fixed, bounded set, ratherthan only in a shrinking neighborhood. The conclusion of these argumentsis that the “posterior” distribution of the profile likelihood with respectto a prior on θ is asymptotically equivalent to the distribution of θn. Inorder to do this, the new theorem will require an additional assumption onthe profile likelihood. Define ∆n(θ) ≡ n−1(pLn(θ) − pLn(θn)). Here is thetheorem:

Page 374: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

366 19. Efficient Inference for Finite-Dimensional Parameters

Theorem 19.6 Assume Θ is compact, I0 is positive definite, Q(Θ) <

∞, q(θ0) > 0, and q is continuous at θ0. Assume also that θn is efficient

and that (19.10) holds for θn = θn and for any possibly random sequence

θnP→ θ0. Assume moreover that for every random sequence θn ∈ Θ

∆n(θn) = oP0(1) implies that θn = θ0 + oP0(1).(19.17)

Then, for every measurable function g : Rk → R satisfying

lim supk→∞

k−2 log

(

supu∈Rk:‖u‖≤k

|g(u)|)

≤ 0,(19.18)

we have

Θ g(√

n(θ − θn))

pθ,n(θ)dθ∫

Θpθ,ndθ

(19.19)

=

Rk

g(u)(2π)−k/2|I0|1/2 exp

[

−u′I0u

2

]

du + oP0 (1).

The proof is given in Section 19.4 below. Note that when g(u) = O(1 +‖u‖)d, for any d < ∞, Condition (19.18) is readily satisfied. This means

that indicators of measurable sets and the first two moments of√

n(T−θn),where T has the posterior density proportional to t → pt,n(t), are consistentfor the corresponding probabilities and moments of the limiting Gaussiandistribution. Specifically, E(T ) = θn + oP0(n

−1/2) and nvar(T ) = I−10 +

oP0(1). Thus we can calculate all the quantities needed for inference onθ without having to actually maximize the profile likelihood directly orcompute derivatives.

Note that the interesting Condition (19.17) is not implied by the otherconditions and is not implied by the identifiability of the Kulback-Leiblerinformation from the full likelihood. Nevertheless, if it can be shown that∆n(θ) converges uniformly over Θ to the profiled Kulback-Leibler infor-mation ∆0(θ), then identifiability of the Kulback-Leibler information forLn(θ, η) is sufficient. This approach works for the Cox model for right-censored data, as we will see below. However, in two out of the three exam-ples considered in the Lee, Kosorok and Fine (2005) paper, and probablyfor a large portion of models generally, it appears that the strategy based on∆0(θ) is usually not fruitful, and it seems to be easier to establish (19.17)directly. The Condition (19.17) is needed because the integration in (19.19)is over all of Θ, and thus it is important to guarantee that there are noother distinct modes besides θn in the limiting posterior. Condition (19.10)is not sufficient for this since it only applies to shrinking neighborhoods ofθ0 and not to all of Θ as required.

Page 375: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

19.3 Inference 367

The examples and simulation studies in Lee, Kosorok and Fine demon-strate that the profile sampler works very well and is in general compu-tationally efficient. The Metropolis algorithm applied to pθ,n(θ) with aLebesgue prior measure is usually quite easy to tune and seems to achieveequilibrium quickly. By the ergodic theorem, there exists a sequence of fi-nite chain lengths Mn → ∞ so that the chain mean θn ≡M−1

n

∑Mn

j=1 θ(j)

satisfies θn = θn + oP0(n−1/2); the standardized sample variance Vn ≡

M−1n

∑Mn

j=1 n(θ(i) − θn)(θ(i) − θn)′ is consistent for I−10 ; and the empirical

measure Gn(A) ≡ M−1n

∑Mn

j=1 1√

n(θ(j) − θn) ∈ A

, for a bounded con-

vex A ⊂ Rk, is consistent for the probability that a mean zero Gaussiandeviate with variance I−1

0 lies in A. Hence the output of the chain canbe used for inference about θ0, provided Mn is large enough so that thesampling error from using a finite chain is negligible.

We now verify the additional Assumption (19.17) for the Cox model forright censored data and for the Cox model for current status data. Note thatthe requirements that θn be efficient and that (19.10) hold, for θn = θn and

for any possibly random sequence θnP→ θ0, have already been established

for these examples at the end of the previous section. Thus verifying (19.17)will enable the application of Theorem 19.6 to these examples.

The Cox model for right censored data.

For this example, we can use the identifiability of the profile Kulback-Leibler information since the profile likelihood does not involve the nui-sance parameter. Let B be the compact parameter space for β, where β0 isknown to be in the interior of B, and assume that ‖Z‖ is bounded by a con-stant. We know from our previous discussions of this model that n−1pLn(β)equals, up to a constant that does not depend on β,

Hn(β) ≡ Pn

[∫ τ

0

(

β′Z − log[

PnY (s)eβ′Z])

dN(s)

]

.

By arguments which are by now familiar to the reader, it is easy to verify

that ‖Hn −H0‖BP→ 0, where

H0(β) ≡ P0

[∫ τ

0

(

β′Z − log P0

[

Y (s)eβ′Z])

dN(s)

]

.

It is also easy to verify that H0 has first derivative

U0(β) ≡ P0

[∫ τ

0

(Z − E(s, β)) dN(s)

]

,

where E(s, β) is as defined in Section 4.2.1, and second derivative −V (β),where V (β) is defined in (4.5). By the boundedness of ‖Z‖ and B com-bined with the other assumptions of the model, it can be shown (see Ex-ercise 19.5.8 below) that there exists a constant c0 > 0 not depending on

Page 376: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

368 19. Efficient Inference for Finite-Dimensional Parameters

β such that V (β) ≥ c0varZ, where for k × k matrices A and B, A ≥ Bmeans that c′Ac ≥ c′Bc for every c ∈ Rk. Thus H0 is strictly concave andthus has a unique maximum on B. It is also easy to verify that U0(β0) = 0(see Part (b) of Exercise 19.5.8), and thus the unique maximum is locatedat β = β0.

Hence ‖∆n(β) − ∆0(β)‖BP→ 0, where ∆0(β) = H0(β) − H0(β0) ≤ 0

is continuous, with the last inequality being strict whenever β = β0. Thisimmediately yields Condition (19.17) for β replacing θ.

The Cox model for current status data.

For this example, we verify (19.17) directly. Let βn be some possibly randomsequence satisfying ∆n(βn) = oP0(1), where β is replacing θ. Fix someα ∈ (0, 1) and note that since ∆n(βn) = oP0(1) and ∆n(β0) ≤ 0 almostsurely, we have

n−1n∑

i=1

log

f(βn, Fβn; Xi)

f(β0, F0; Xi)

≥ oP0(1),

where f(β, F ; X) ≡ δ

1− F (Y )exp(β′Z)

+ (1 − δ)F (Y )exp(β′Z), F ≡ 1 −F = exp(−Λ), and Fβ ≡ 1 − exp(−Λβ) is the maximizer of the likelihoodover the nuisance parameter for fixed β. This now implies

n−1n∑

i=1

log

[

1 + α

f(βn, Fβn; Xi)

f(β0, F0; Xi)− 1

]

≥ oP0(1),

because α log(x) ≤ log(1 + αx− 1) for any x > 0. This implies that

P0 log

[

1 + α

f(βn, Fβn; Xi)

f(β0, F0; Xi)− 1

]

≥ oP0(1)(19.20)

by Lemma 19.7 below, the proof of which is given in Section 19.4, sincex → log(1 + αx) is Lipschitz continuous for x ≥ 0 and f(θ0, F0; X) ≥ calmost surely, for some c > 0.

Because x → log x is strictly concave, we now have that

P0 log

[

1 + α

f(βn, Fβn; Xi)

f(β0, F0; Xi)− 1

]

≤ 0.

This combined with (19.20) implies that

P0 log

[

1 + α

f(βn, Fβn; Xi)

f(β0, F0; Xi)− 1

]

= oP0(1).

This forces the result

Page 377: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

19.3 Inference 369

P0

∣F βn(Y )exp(β′

nZ) − F 0(Y )exp(β′0Z)

∣ = oP0 (1)

by the strict concavity of x → log x. This, in turn, implies that

P0

[

(βn − β0)′(Z − E[Z|Y ])− cn(Y )

2∣

Y

]

= oP0(1),

for almost surely all Y , where cn(Y ) is uncorrelated with Z−E[Z|Y ]. Henceβn = θ0 + oP0(1), and Condition (19.17) now follows.

Lemma 19.7 The class F ≡ f(β, F ; X) : β ∈ B, F ∈M, where M isthe class of distribution functions on [0, τ ], is P0-Donsker.

19.3.3 The Penalized Profile Sampler

In many semiparametric models involving a smooth nuisance parameter,it is often convenient and beneficial to perform estimation using penaliza-tion. One motivation for this is that, in the absence of any restrictionson the form of the function η, maximum likelihood estimation for somesemiparametric models leads to over-fitting. We have discussed this issuein the context of the partly linear logistic regression model (see Chapter 1and Sections 4.5, 15.1 and, in this chapter, Section 19.2.4). Under certainreasonable regularity conditions, penalized semiparametric log-likelihoodestimation can yield fully efficient estimates for θ, as we demonstrated inSection 4.5.

The purpose of this section is to briefly discuss a modification of theprofile sampler that works with profiled penalized likelihoods, called thepenalized profile sampler. Another method of inference that works in thiscontext is a weighted bootstrap which will be discussed in Chapter 21 andis applicable in general to semiparametric M-estimation. Interestingly, it isunclear whether the usual nonparametric bootstrap will work at all in thisthis context.

We assume in this section that the nuisance parameter η is a function inSobolev class of functions supported on some compact set U on the real line,whose d-th derivative exists and is absolutely continuous with J(η) < ∞,where

J2(η) =

U(η(d)(u))2du.

Here d is a fixed, positive integer and η(j) is the j-th derivative of η withrespect to u. Obviously J2(η) is some measurement of the complexity of η.We denote H to be the Sobolev function class with degree d on U .

The penalized log-likelihood in this context is:

Ln(θ, η) = n−1Ln(θ, η)− λ2nJ2(η),

Page 378: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

370 19. Efficient Inference for Finite-Dimensional Parameters

where Ln(θ, η) is the log-likelihood as used in previous sections in this chap-ter, and λn is a smoothing parameter, possibly dependent on the data. Thisis the approach we have been using for the partly linear logistic regressionmodel. We make the same assumptions about the smoothing parameter inthis general context that we made for the partly linear logistic regressionmodel, namely:

λn = oP0(n−1/4) and λ−1

n = OP0(nd/(2d+1)).(19.21)

One way to ensure (19.21) in practice is simply to set λn = n−d/(2d+1).Or we can just choose λn = n−1/3 which is independent of d. The log-profile penalized likelihood is defined as pLn(θ) = Ln(θ, ηθ), where ηθ isargmaxη∈HLn(θ, η) for fixed θ and λn. The penalized profile sampler isactually just the procedure of sampling from the posterior distribution ofpLn(θ) by assigning a prior on θ (Cheng and Kosorok, 2007c).

Cheng and Kosorok (2007c) analyze the penalized profile sampler andobtain the following conclusions under reasonable regularity conditions:

• Distribution approximation: The posterior distribution propor-

tional to exp(

pLn(θ))

q(θ), where q is a suitably smooth prior density

with q(θ0) > 0, can be approximated by a normal distribution withmean the maximum penalized likelihood estimator of θ and variancethe inverse of the efficient information matrix, with a level of accuracysimilar to that obtained for the profile sampler (see Theorem 19.6).

• Moment approximation: The maximum penalized likelihood es-timator of θ can be approximated by the posterior mean with errorOP0(λ

2n). The inverse of the efficient information matrix can be ap-

proximated by the posterior variance with error OP0(n1/2λ2

n).

• Confidence interval approximation: An exact frequentist confi-dence interval of Wald’s type for θ can be estimated by the credibleset obtained from the posterior distribution with error OP0(λ

2n).

The last item above is perhaps the most important: a confidence regionbased on the posterior distribution will be OP0(λ

2n) close to an exact fre-

quentist confidence region. This means, for example, that when d = 2,and if λn = n−2/5, the size of the error would be OP0(n

−4/5). Note thatthis is second order accurate since the first order coverage accuracy of aparametric confidence band is oP0(n

−1/2).Cheng and Kosorok (2007c) also verify that the partly linear logistic

regression model satisfies the needed regularity conditions for the penalizedprofile sampler with the approximately least-favorable submodel describedin Section 19.2.4.

Page 379: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

19.3 Inference 371

19.3.4 Other Methods

In this section, we briefly discuss a few alternatives to quadratic expansionand the profile sampler. We first discuss the bootstrap and some relatedprocedures. We then present a computationally simpler alternative, the“block jackknife,” and then briefly discuss a fully Bayesian alternative.Note that this list is not exhaustive, and there are a number of usefulmethods that we are omitting.

The nonparametric bootstrap is useful for models where all parametersare

√n-consistent, but it is unclear how well it works when the nuisance pa-

rameter is not√

n-consistent. The weighted bootstrap appears to be moreapplicable in general, and we will discuss this method in greater detail inChapter 21 in the context of general semiparametric M-estimators, includ-ing non-likelihood based approaches. Two other important alternatives arethe m within n bootstrap (see Bickel, Gotze and van Zwet, 1997) and sub-sampling (see Politis and Romano, 1994). The idea for both of these isto compute the estimate based on a bootstrap sample of size m < n. Forthe m within n bootstrap, the sample is taken with replacement, while forsubsampling, the sample is taken without replacement. The properties ofboth procedures are quite similar when m/n → 0 and both m and n goto infinity. The main result of Politis and Ramano is that the subsamplingboots the true limiting distribution whenever the true limiting distributionexists and is continuous. Since

√n(θn − θ0) is known to have a continuous

limiting distribution L, Theorem 2.1 of Politis and Romano (1994) yieldsthat the m out of n subsampling bootstrap converges—conditionally onthe data—to the same distribution L, provided m/n → 0 and m → ∞ asn→∞.

Because of the requirement that m → ∞ as n → ∞, the subsamplingbootstrap potentially involves many calculations of the estimator. Fortu-nately, the asymptotic linearity of θn can be used to formulate a compu-tationally simpler alternative, the “block jackknife,” as described in Maand Kosorok (2005a). Let θn be any asymptotically linear estimator of aparameter θ0 ∈ Rk, based on an i.i.d. sample X1, . . . , Xn, having square-integrable influence function φ for which E[φφT ] is nonsingular.

Let m be a fixed integer > k, and, for each n ≥ m, define qm,n to bethe largest integer satisfying mqm,n ≤ n. Also define Nm,n ≡ mqm,n. For

the data X1, . . . , Xn, compute the estimator θn and randomly sample Nm,n

out of the n observations without replacement, to obtain X∗1 , . . . , X∗

Nm,n.

Note that we are using the notation θn rather than θn to remind ourselvesthat this estimator is a general, asymptotically linear estimator and notnecessarily a maximum likelihood estimator. For j = 1, . . . , m, let θ∗n,j bethe estimate of θ based on the observations X∗

1 , . . . , X∗Nm,n

after omitting

X∗j , X∗

m+j , X∗2m+j, . . . , X

∗(qm,n−1)m+j. Compute θ∗n ≡ m−1

∑mj=1 θ∗n,j and

S∗n ≡ (m − 1)qm,n

∑mj=1

(

θ∗n,j − θ∗n

)(

θ∗n,j − θ∗n

)T

. The following lemma

Page 380: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

372 19. Efficient Inference for Finite-Dimensional Parameters

provides a method of obtaining asymptotically valid confidence ellipses forθ0:

Lemma 19.8 Let θn be an estimator of θ0 ∈ Rk, based on an i.i.d. sampleX1, . . . , Xn, which satisfies n1/2(θn − θ0) =

√nPnφ + op(1), where E[φφT ]

is nonsingular. Then n(θn−θ0)T [S∗

n]−1 (θn−θ0) converges weakly to k(m−1)Fk,m−k/(m− k), where Fr,s has an F distribution with respective degreesof freedom r and s.

Proof. Fix m > k. Without loss of generality, we can assume by thei.i.d. structure that X∗

i = Xi for i = 1, . . . , Nm,n. Let ǫ0,n ≡√

n(βn−β0)−n−1/2

∑ni=1 φi and

ǫj,n ≡ (Nm,n −m)1/2(β∗n,j − β0)− (Nm,n −m)−1/2

i∈Kj,n

φi,

where Kj,n ≡ 1, . . . , n − j, m + j, 2m + j, . . . , (qm,n − 1)m + j, for j =1, . . . , m; and note that max0≤j≤m |ǫj,n| = oP0(1) by asymptotic linearity.

Now let Z∗j,n ≡ q

−1/2m,n

∑qm,n

i=1 φ(i−1)m+j , for j = 1 . . .m, and define Z∗n ≡

m−1∑m

j=1 Z∗j,n. Thus S∗

n = (m − 1)−1∑m

j=1(Z∗j,n − Z∗

n)(Z∗j,n − Z∗

n)T +

op(1). Hence S∗n and

√n(βn − β0) are jointly asymptotically equivalent to

Sm ≡ (m − 1)−1∑m

j=1(Zj − Zm)(Zj − Zm)T and Z0, respectively, where

Zm ≡ m−1∑m

j=1 Zj and Z0, . . . , Zm are i.i.d. mean zero Gaussian deviates

with variance E[φφT ]. Now the results follow by standard normal theory(see Appendix V of Scheffe, 1959).

The fact that m remains fixed as n → ∞ in the above approach resultsin a potentially significant computational savings over subsampling whichrequires m to grow increasingly large as n →∞. A potential challenge is inchoosing m for a given data set. The larger m is, the larger the denomina-tor degrees of freedom in Fd,m−d and the tighter the confidence ellipsoid.On the other hand, m cannot be so large that the required asymptotic lin-earity does not hold simultaneously for all jackknife components. The needto choose m makes this approach somewhat less automatic than the pro-file sampler. Nevertheless, that fact that this “block jackknife” procedurerequires fewer assumptions than the profile sampler makes it a potentiallyuseful alternative.

Of course, it is always simplest to estimate the variance directly wheneverthis is possible. Unfortunately, this is rarely the case except for extremelysimple semiparametric models such as the Cox model for right censoreddata. Inference on θ can also be based on the marginal posterior of θ fromthe full likelihood with respect to a joint prior on (θ, η). Shen (2002) has

shown that this approach yields valid inferences for θn when θ is estimableat the parametric rate. The profile sampler, however, appears to be a betterchoice most of the time since it greatly simplifies the theory and compu-tations through the avoidance of specifying a prior for η. In this light, the

Page 381: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

19.4 Proofs 373

profile sampler may be useful as an approximately Bayesian alternative toa fully Bayesian procedure when η is strictly a nuisance parameter.

19.4 Proofs

Proof of Theorem 19.6. Let

δ2n ≡ n−1 ∨ sup

k≥nk−1 log

supθ1,θ2∈Θ

∣g(√

k[θ1 − θ2])∣

,

where x ∨ y denotes the maximum of x and y, and note that by Condi-tion (19.18) this defines a positive sequence δn ↓ 0 such that nδn → ∞.Hence also lim supn→∞ rn ≤ 0, where

rn = supθ1,θ2∈Θ

log |√ng√n(θ1 − θ2)|nδn

.

Fix ǫ > 0, and let

∆ǫn ≡ sup

θ∈Θ:‖θ−θn‖>ǫ

∆n(θ).

Now note that pθ,n(θ) is proportional to expn∆n(θ)q(θ) and that

θ∈Θ:‖θ−θ‖>ǫ

√n∣

∣g√

n(θ − θn)∣

∣ expn∆n(θ)q(θ)dθ

≤ 1 ∆ǫn < −δn

Θ

q(θ)dθ exp−nδn[1− rn]+ oP0(1)

→ 0,

in probability, by Lemma 19.9 below. This now implies that there exists apositive decreasing sequence ǫn ↓ 0 such that

√nǫn →∞ and

θ∈Θ:‖θ−θn‖>ǫn

√n|g√n(θ − θn)| expn∆n(θ)q(θ)dθ → 0,

in probability.Now,

Page 382: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

374 19. Efficient Inference for Finite-Dimensional Parameters

θ∈Θ:‖θ−θn‖≤ǫn

√n|g√n(θ − θn)| exp

−n

2(θ − θn)′I0(θ − θn)

×∣

∣exp

n∆n(θ) +n

2(θ − θn)′I0(θ − θn)

− 1∣

∣ q(θ)dθ

≤∫

θ∈Θ:‖θ−θn‖≤ǫn

√n|g√n(θ − θn)|

× exp

−n

4(θ − θn)′I0(θ − θn)

q(θ)dθ

× supθ∈Θ:‖θ−θn‖≤ǫn

[

exp

−n

4(θ − θn)′I0(θ − θn)

×∣

∣exp

n∆n(θ) +n

2(θ − θn)′I0(θ − θn)

− 1∣

]

= An × supθ∈Θ:‖θ−θn‖≤ǫn

Bn(θ),

where An = OP0(1) by Condition (19.18). However, for any random se-

quence θn ∈ Θ such that ‖θn − θn‖ ≤ ǫn for all n ≥ 1, and any fixedM < ∞, we have by Condition (19.10) that

Bn(θn) ≤ exp

−n

4(θn − θn)′I0(θn − θn)

(19.22)

×∣

exp

oP0

(

1 +√

n‖θn − θ0‖)2

− 1

,

where

oP0

(

1 +√

n‖θn − θ0‖)2

≤ oP0(1)

[

(

1 +√

n‖θn − θn‖)2

+ n‖θn − θ0‖2]

= oP0(1)(

1 +√

n‖θn − θn‖)2

,

since√

n‖θn−θ0‖ = OP0(1). Combining this with (19.22) and the fact that|ex − 1| ≤ e|x| for all x ∈ R, we obtain that

Bn(θn) ≤ exp

−n

4(θn − θn)′I0(θn − θn)

× |expoP0(1) − 1| 1√

n‖θn − θn‖ ≤ M

+ exp

−n

4(θn − θn)′I0(θn − θn)

+oP0(1)(

1 +√

n‖θn − θn‖)2

×1√

n‖θn − θn‖ > M

≤ oP0(1) + exp

−M2

4λ1(1− oP0 (1))

→ exp

−M2

4λ1

,

Page 383: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

19.4 Proofs 375

in probability, where λ1 is the smallest eigenvalue of I0 and is > 0 bypositive definiteness. Hence Bn(θn) = oP0(1) since M was arbitrary. Thussupθ∈Θ:‖θ−θn‖≤ǫn

Bn(θ) = oP0(1).

By reapplication of Condition (19.9),

θ∈Θ:‖θ−θn‖≤ǫn

√ng√n(θ − θn) exp

−n

2(θ − θn)′I0(θ − θn)

q(θ)dθ

=

u∈√n(Θ−θn):‖u‖≤√

nǫn

g(u) exp

−u′I0u

2

q

(

θn +u√n

)

du

→∫

Rk

g(u) exp

−u′I0u

2

du× q(θ0),

in probability. (Here, we have used the notation√

n(Θ− θn) to denote the

set √n(θ − θn) : θ ∈ Θ.) The same conclusion holds true for g = 1, andthus

Θ

√n expn∆n(θ)q(θ)dθ =

Rk

exp

−u′I0u

2

du× q(θ0) + oP0(1).

Hence the desired result follows.

Lemma 19.9 Assume Condition (19.17). For every ǫ > 0 and everypositive decreasing sequence δn ↓ 0,

limn→∞

P ∆ǫn ≥ −δn = 0.

Proof. Suppose that the lemma is not true. Then there exists an ǫ > 0,a positive decreasing sequence δ∗n ↓ 0, a random sequence θn ∈ Θ :

‖θn − θn‖ > ǫ, and a subsequence nk such that

limk→∞

P

∆nk(θnk

) ≥ −δ∗nk

= ρ > 0.

Define Hn =

∆n(θn) ≥ −δ∗n

, and let θ∗n = 1 Hn θn + 1cHnθn, where

1cA is the indicator of the complement of A. Now, note that ∆n(θ∗n) =

∆n(θn) = 0 on Hcn, where Ac is the complement of A, and that ∆n(θ∗n) =

∆n(θn) ≥ −δ∗n on Hn. Now ∆n(θ∗n) → 0 in probability, which impliesby Condition (19.17) that θ∗n → θ0 in probability. Now, on the set Hnk

,

θ∗nk= θnk

and therefore by construction, θnkis separated from θn by at

least ǫ. Since θn converges to θ0, we have that eventually, on the set Hnk,

θ∗nk(= θnk

) is separated from θ0 by at least ǫ/2. However, the probabilityof the set Hnk

converges to ρ > 0, which implies that with probabilitylarger than ρ/2, θ∗nk

is eventually separated from θ0 by at least ǫ/2. This

Page 384: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

376 19. Efficient Inference for Finite-Dimensional Parameters

contradicts the fact that θ∗nk− θ0 → 0 in probability, and the desired result

follows.Proof of Lemma 19.7. First note that for all a, b ∈ [1,∞),

supu∈[0,1]

|ua − ub| ≤ c + log(1/c)|a− b|,(19.23)

for any c > 0. Minimizing the right-hand-side of (19.23) over c, we obtainthat the left-hand-side of (19.23) is bounded above by

|a− b|[

1 + log

(

1

(|a− b| ∧ 1)

)]

which, in turn, implies that supu∈[0,a] |ua−ub| ≤ 2|a− b|1/2 for all a, b ≤ 1.Here, x ∧ y denotes the minimum of x and y. Note also that for any a ≥ 1and any u, v ∈ [0, 1], |ua − va| ≤ a|u − v|. Let M ∈ (1,∞) satisfy 1/M ≤eβ′Z ≤ M for all β ∈ B and all possible Z. Then, for any F1, F2 ∈ M andany β1, β2 ∈ B,

∣F 1(Y )M exp(β′1Z) − F 2(Y )M exp(β′

2Z)∣

∣(19.24)

≤ M2|F 1(Y )− F 2(Y )|+ k|β1 − β2|1/2,

for some fixed k ∈ (0,∞), since Z lies in a compact set. Since the sets

F : F ∈ M and F 1/M: F ∈M are equivalent, and since the bracketing

entropy of M, for brackets of size ǫ, is O(ǫ−1) by Theorem 9.24, (19.24)implies that the bracketing entropy of F is O(ǫ−1), and the desired resultfollows.

19.5 Exercises

19.5.1. Verify explicitly that Gn = oP (1), where Gn is defined towardsthe end of the proof of Theorem 3.1.

19.5.2. Verify (19.4).

19.5.3. Using the techniques given in the proof of Theorem 15.9, showthat σ22

θ : H2∞ → H2

∞ is continuously invertible and onto.

19.5.4. In the context of Section 19.2.2, show that the choice h(Y ) =h0(Y ), where h0 is as defined in (19.7), is bounded and minimizesPβ,Λ‖ℓβ,Λ −Aβ,Λh‖2 over all square-integrable choices of h(Y ).

19.5.5. Adapting the arguments leading up to (19.9), prove Corollary 19.1.

19.5.6. Use (19.8) and Corollary 19.1 to prove Corollary 19.2.

Page 385: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

19.6 Notes 377

19.5.7. Verify that F1 is P0-Donsker and that F2 is P0-Glivenko-Cantellifor the Cox model for right censored data, where F1 and F2 are as defined inTheorem 19.5. See the discussion of this model at the end of Section 19.3.1.

19.5.8. In the context of the paragraphs on the Cox model for right cen-sored data at the end of Section 19.3.2, show that the following are true:

(a) There exists a constant c0 > 0 not depending on β such that V (β) ≥c0varZ. Hint: It may be helpful to recall that Y (t) ≥ Y (τ) for all0 ≤ t ≤ τ .

(b) U0(β0) = 0.

19.6 Notes

The proof of Theorem 3.1 in Section 19.1 follows closely the proof of The-orem 25.54 in van der Vaart (1998). Corollaries 19.3 and 19.4 are modifi-cations of Corollaries 2 and 3 of Murphy and van der Vaart (2000), whileTheorem 19.5 is Murphy and van der Vaart’s Theorem 1. Theorem 19.6is a modification of Theorem 1 of Lee, Kosorok and Fine (2005), whileSection 19.3.2 on the Cox model for current status data comes from Sec-tion 4.1 and Lemma 2 of Lee, Kosorok and Fine. Lemmas 19.7 and 19.9are Lemmas A.2 and A.1, respectively, of Lee, Kosorok and Fine, whileLemma 19.8 is Lemma 5 of Ma and Kosorok (2005a).

Page 386: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,
Page 387: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

20Efficient Inference forInfinite-Dimensional Parameters

We now consider the special case that both θ and η are√

n consistent inthe semiparametric model Pθ,η : θ ∈ Θ, η ∈ H, where Θ ⊂ Rk. Often inthis setting, η may be of some interest to the data analyst. Hence, in thischapter, η will not be considered a nuisance parameter. In the first section,we expand on the general ideas presented in the last half of Section 3.3.This includes a proof of Corollary 3.2 on Page 46. The second sectionpresents several methods of inference, including a weighted bootstrap anda computationally efficient piggyback bootstrap, along with a brief reviewof a few other methods.

20.1 Semiparametric Maximum LikelihoodEstimation

Assume that the score operator for η has the form h → Bθ,ηh, where h ∈ Hfor some set of indices H. The joint maximum likelihood estimator (θn, ηn)

will then usually satisfy a Z-estimator equation Ψn(θn, ηn) = 0, whereΨn = (Ψn1, Ψn2), Ψn1(θ, η) = Pnℓθ,η and Ψn2(θ, η) = PnBθ,ηh−Pθ,ηBθ,ηh,for all h ∈ H. The expectation of Ψn under the true parameter value isΨ = (Ψ1, Ψ2) as defined in the paragraph preceding Corollary 3.2. Here isthe proof of this corollary:

Proof of Corollary 3.2 (Page 46). The conditions given in the first

few lines of the corollary combined with the consistency of (θn, ηn) enableus to use Lemma 13.3 to conclude that

Page 388: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

380 20. Efficient Inference for Infinite-Dimensional Parameters

√n(Ψn −Ψ)(θn, ηn)−√n(Ψn −Ψ)(θ0, η0) = oP (1),

where the convergence is uniform. Since the Donsker assumption on thescore equation ensures

√nΨn(θ0, η0) Z, for some tight, mean zero Gaus-

sian process Z, we have satisfied all of the conditions of Theorem 2.11, andthus

√n(θn − θ0, ηn − η0) −Ψ−1

0 Z.The remaining challenge is to establish efficiency. Recall that the dif-

ferentiation used to obtain the score and information operators involves asmooth function t → ηt(θ, η) for which η0(θ, η) = η, t is a scalar, and

Bθ,ηh(x)− Pθ,ηBθ,ηh = ∂ℓθ,ηt(θ,η)(x)/(∂t)∣

t=0,

and where ℓ(θ, η)(x) is the log-likelihood for a single observation (we arechanging the notation slightly from Section 3.3). Note that this ηt is notnecessarily an approximately least-favorable submodel. The purpose of thisηt is to incorporate the effect of a perturbation of η in the direction h ∈ H.The resulting one-dimensional submodel is t → ψt ≡ (θ + ta, ηt(θ, η)), withderivative

∂tψt

t=0

≡ ψ(a, h),

where c ≡ (a, h) ∈ Rk × H and ψ : Rk × H → Rk × linH ≡ C is alinear operator that may depend on the composite (joint) parameter ψ ≡(θ, η). To be explicit about which tangent in C is being applied to the one-dimensional submodel, we will use the notation ψt,c, i.e., ∂/(∂t)|t=0 ψt,c =

ψ(c).Define the abbreviated notation Uψ(c) ≡ a′ℓψ + Bψh− PψBψh, for c =

(a, h) ∈ C. Our construction now gives us that for any c1, c2 ∈ C,

Ψ(ψ0(c2))(c1) =∂

∂tPψ0

[

Uψt,c2(c1)

]

t=0,ψ=ψ0

(20.1)

= −Pψ0 [Uψ0(c1)Uψ0(c2)] ,

where ψ0 ≡ (θ0, η0). The first equality follows from the definition of Ψas the derivative of the expected score under the true model. The secondequality follows from viewing the likelihood as a two-dimensional submodelwith variables (s, t), where s perturbs the model in the direction c1 and tperturbs the model in the direction c2. The two-by-two information matrixin this context is the negative of the derivative of the expected score vectorand is also equal to the expected outer-product of the score vector.

We know from the conclusion of the first paragraph of the proof that theinfluence function for ψn ≡ (θn, ηn) is ψ ≡ −Ψ−1 [Uψ0(·)]. Thus, for anyc ∈ C,

Page 389: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

20.1 Semiparametric Maximum Likelihood Estimation 381

Pψ0

[

ψUψ0(c)]

= Pψ0

[(

−Ψ−1 [Uψ0(·)])

Uψ0(c)]

= −Ψ−1Pψ0 [Uψ0(·)Uψ0(c)]

= −Ψ−1[

−Ψ(ψ0(c))(·)]

= ψ0(c).

The first equality follows from the form of the influence function, the secondequality follows from the fact that Ψ−1 is onto and linear, the third equalityfollows from (20.1), and the fourth equality follows from the properties ofan inverse function. This means by the definition given in Section 18.1 thatψ0 is the efficient influence function.

Since√

n(ψn−ψ0) is asymptotically tight and Gaussian with covariancethat equals the covariance of the efficient influence function, we have byTheorem 18.3 that ψn is efficient.

A key condition for Corollary 3.2 is that Ψ be continuously invertibleand onto. This can be quite non-trivial to establish. We have demonstratedfor the proportional odds model considered in Section 15.3 how this can bedone by using Lemma 6.17.

For most semiparametric models where the joint parameter is regular,we can assume a little more structure than in the previous paragraphs.For many jointly regular models, we have that η = A, where t → A(t) isrestricted to a subset H ∈ D[0, τ ] of functions bounded in total variation,where τ < ∞. The composite parameter is thus ψ = (θ, A). We endowthe parameter space with the uniform norm since this is usually the mostuseful in applications. Examples include many right-censored univariateregression models, including the proportional odds model of Section 15.3,certain multivariate survival models, and certain biased sampling models.We will give a few examples later on in this section.

The index set H we assume consists of all finite variation functions inD[0, τ ], and we assign to C = Rk ×H the norm ‖c‖C ≡ ‖a‖+ ‖h‖v, wherec = (a, h), ‖ · ‖ is the Euclidean norm, and ‖ · ‖v is the total variation normon [0, τ ]. We let Cp ≡ c ∈ C : ‖c‖C ≤ p, where the inequality is strictwhen p = ∞. This is the same structure utilized in Section 15.3.4 for theproportional odds model aside from some minor changes in the notation.The full composite parameter ψ = (θ, A) can be viewed as an element ofℓ∞(Cp) if we define

ψ(c) ≡ a′θ +

∫ τ

0

h(s)dA(s), c ∈ Cp, ψ ∈ Ω ≡ Θ×H.

As described in Section 15.3.4, Ω thus becomes a subset of ℓ∞(Cp), withnorm ‖ψ‖(p) ≡ supc∈Cp

|ψ(c)|. Moreover, if ‖ · ‖∞ is the uniform norm onΩ, then, for any 1 ≤ p <∞, ‖ψ‖∞ ≤ ‖ψ‖(p) ≤ 4p‖ψ‖∞. Thus the uniformand ‖ · ‖(p) norms are equivalent.

Page 390: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

382 20. Efficient Inference for Infinite-Dimensional Parameters

For a direction h ∈ H, we will perturb A via the one-dimensional sub-

model t → At(·) =∫ (·)0

(1 + th(s))dA(s). This means in the notation at the

beginning of this section that ψ(c) =(

a,∫ (·)0

h(s)dA(s))

. We now modify

the score notation slightly. For any c ∈ C, let

U [ψ](c) =∂

∂tℓ

(

θ + ta, A(·) + t

∫ (·)

0

h(s)dA(s)

)∣

t=0

=∂

∂tℓ (θ + ta, A(·))

t=0

+∂

∂tℓ

(

θ, A(·) + t

∫ (·)

0

h(s)dA(s)

)∣

t=0

≡ U1[ψ](a) + U2[ψ](h).

Note that Bψh = U2[ψ](h), Ψn(ψ)(c) = PnU [ψ](c), and Ψ(ψ)(c) = P0U [ψ](c), where P0 = Pψ0 . In this context, PψU2[ψ](h) = 0 for all h ∈ H bydefinition of the maximum and under identifiability of the model. Thus wewill not need the PψBψh term used earlier in this section. It is importantto note that the map ψ → U [ψ](·) actually has domain lin Ω and rangecontained in ℓ∞(C).

We now consider properties of the second derivative of the log-likelihood.Let a ∈ Rk and h ∈ H. For ease of exposition, we will use the somewhat re-dundant notation c = (a, h) ≡ (c1, c2). We assume the following derivativestructure exists and is valid for j = 1, 2 and all c ∈ C:

∂sUj [θ + sa, A + sh](cj)

s=0

=∂

∂sUj[θ + sa, A](cj)

s=0

+∂

∂sUj[θ, A + sh](cj)

s=0

,

≡ a′σ1j [ψ](cj) +

∫ τ

0

σ2j [ψ](cj)(u)dh(u),

where σ1j [ψ](cj) is a random k-vector and u → σ2j [ψ](cj)(u) is a randomfunction contained in H. Denote σjk[ψ] = P0σjk [ψ] and σjk = σjk[ψ0], forj, k = 1, 2, and where P0 = Pψ0 .

Let ψn = (θn, An) be the maximizers of the log-likelihood. Then Ψn(ψn)(c) = 0 for all c ∈ C. Moreover, since lin Ω is contained in C, we have thatthe map c ∈ lin Ω → −Ψ(c)(·) ∈ ℓ∞(C) has the form −Ψ(c)(·) = c(σ(·)),where σ ≡ σ[ψ0] and

σ[ψ](c) ≡(

σ11[ψ] σ12[ψ]σ21[ψ] σ22[ψ]

)(

c1

c2

)

and, for any c, c ∈ C, c(c) = c′1c1 +∫ τ

0c2(u)dc2(u). Provided σ : C → C is

continuously invertible and onto, we have that Ψ : lin Ω → C is also contin-uously invertible and onto with inverse satisfying −Ψ−1(c)(·) = c(σ−1(·)).

Page 391: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

20.1 Semiparametric Maximum Likelihood Estimation 383

In this set-up, we will need the following conditions for some p > 0 inorder to apply Corollary 3.2:

U [ψ](c) : ‖ψ − ψ0‖ ≤ ǫ, c ∈ Cp is Donsker for some ǫ > 0,(20.2)

supc∈Cp

P0 |U [ψ](c)− U [ψ0](c)|2 → 0, as ψ → ψ0, and(20.3)

supc∈Cp

‖σ[ψ](c) − σ[ψ0](c)‖(p) → 0, as ‖ψ − ψ0‖(p) → 0.(20.4)

Note by Exercise 20.3.1 that (20.4) implies Ψ is Frechet-differentiable inℓ∞(Cp). It is also not hard to verify that if Conditions (20.2)–(20.4) hold forsome p > 0, then they hold for all 0 < p < ∞ (see Exercise 20.3.2). Thisyields the following corollary, whose simple proof is saved as an exercise(see Exercise 20.3.3):

Corollary 20.1 Assume Conditions (20.2)–(20.4) hold for some p >

0, that σ : C → C is continuously invertible and onto, and that ψn is uni-formly consistent for ψ0 with

supc∈C1

∣PnΨn(ψn)(c)∣

∣ = oP0 (n−1/2).

Then ψn is efficient with

√n(ψn − ψ0)(·) Z(σ−1(·))

in ℓ∞(C1), where Z is the tight limiting distribution of√

nPnU [ψ0](·).Note that we actually need Z to be a tight element in ℓ∞(σ−1(C1)), but

the linearity of U [ψ](·) ensures that if√

nPnU [ψ0](·) converges to Z inℓ∞(C1), then it will also converge weakly in ℓ∞(Cp) for any p < ∞. Since σis continuously invertible by assumption, there exists a p0 <∞ such that

supc∈C1

∣σ−1(c)∣

∣ ≤ p0,

and thus σ−1(C1) ⊂ Cp0 .This kind of analysis was used extensively in the proportional odds model

example studied in Section 15.3. In fact, the maximum likelihood jointestimator (θn, An) studied in Section 15.3 can be easily shown to satisfy

the conditions of Corollary 20.1 (see Exercise 20.3.4). Thus (θn, An) isasymptotically efficient for estimating (θ0, A0) in the proportional oddsexample. Since the ‖ · ‖(1) norm dominates the uniform norm, the weakconvergence and efficiency results we obtain with respect to the ‖ · ‖(1)norm will carry through for the uniform norm.

The next section, Section 20.2, will study methods of inference for maxi-mum likelihood settings that satisfy the conditions of Corollary 20.1. Typ-ically, these maximum likelihood settings involve an empirical likelihood

Page 392: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

384 20. Efficient Inference for Infinite-Dimensional Parameters

structure for the infinite-dimensional component, as discussed at the be-ginning of Section 3.3. We will consider three examples in detail: the Coxregression model for right-censored data, the proportional odds model forright-censored data, and an interesting biased sampling model. Two addi-tional models were also studied using the methods of this chapter in Dixon,Kosorok and Lee (2005), but we will not consider them further here: theodds-rate regression model for right-censored survival data and the corre-lated gamma frailty regression model for multivariate right-censored sur-vival data. Before moving on to Section 20.2, we briefly discuss the applica-tion of Corollary 20.1 to the Cox model and the biased sampling examples.The application of Corollary 20.1 to the proportional odds example hasalready been discussed.

The Cox model for right-censored data.

This model has been explored extensively in previous sections, and bothweak convergence and efficiency have already been established. Neverthe-less, it is useful to study this model again from the perspective of thissection. Accordingly, we will let θ be the regression effect and A the base-line hazard, with the observed data X = (U, δ, Z), where U is the minimumof the failure time T and censoring time C, and where δ and Z have theunusual definitions. We make the usual assumptions for this model as donein Section 4.2.2, including requiring the baseline hazard to be continuous,except that we will use (θ, A) to denote the model parameters (β, Λ).

It is not hard to verify that U1[ψ](a) =∫ τ

0Z ′adMψ(s) and U2[ψ](h) =

∫ τ

0h(s)dMψ(s), where Mψ(t) ≡ N(t)−

∫ t

0Y (s)eθ′ZdA(s) and N and Y are

the usual counting and at-risk processes. It is also easy to show that thecomponents of σ are defined by

σ11a =

∫ τ

0

P0

[

ZZ ′Y (s)eθ′0Z]

dA0(s)a,

σ12h =

∫ τ

0

P0

[

ZY (s)eθ′0Z]

h(s)dA0(s),

σ21a = P0

[

Z ′Y (·)eθ′0Z]

a, and

σ22h = P0

[

Y (·)eθ′0Z]

h(·).

The maximum likelihood estimator is ψn = (θn, An), where θn is themaximizer of the well-known partial likelihood and An is the Breslow esti-mator. We will now establish that the conditions of Corollary 20.1 hold forthis example. Conditions (20.2)–(20.4) are easy to verify by recycling argu-ments we have used previously. Likewise, most of the remaining conditionsare easy to verify.

The only somewhat difficult condition to verify is that σ is continu-ously invertible and onto. We will use Lemma 6.17 to do this. First, it is

Page 393: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

20.1 Semiparametric Maximum Likelihood Estimation 385

easy to verify that σ = κ1 + κ2, where κ1c ≡ (a, ρ0(·)h(·)), κ2 ≡ σ − κ1,ρ0(t) ≡ P0[Y (t)eθ′

0Z ], and where κ2 is a compact operator and κ1 is triv-ially continuously invertible and onto. Lemma 6.17 will now imply that σis continuously invertible and onto, provided we can show that σ is one-to-one.

Accordingly, let c ∈ C be such that σ(c) = 0, where, as usual, c = (a, h).

Then we also have that c(σ(c)) = 0, where c = (a,∫ t

0 h(s)dA0(s)). Thisimplies, after a little algebra, that

0 =

∫ τ

0

P0

(

[a′Z + h(s)]2Y (s)eθ′

0Z)

dA0(s)

=

∫ τ

0

P0

(

[a′Z + h(s)]2dN(s)

)

,

which implies that

[a′Z + h(T )]2δ = 0, P0-almost surely.(20.5)

This now forces a = 0 and h = 0 (see Exercise 20.3.5). Thus σ is one-to-one.Hence all of the conditions of Corollary 20.1 hold, and we conclude thatψn is efficient.

A biased sampling model.

We will now consider a special case of a class of biased sampling modelswhich were studied by Gilbert (2000). The data consists of n i.i.d. real-izations of X = (δ, Y ). Here, δ ∈ 0, 1 is a random stratum identifier,taking on the value j with selection probability λj > 0, j = 0, 1, withλ0 + λ1 = 1. Given δ = j, Y ∈ [0, τ ] has distribution Fj defined on a sigmafield of subsets B of [0, τ ] by Fj(B, θ, A) ≡W−1

j (θ, A)∫

Bwj(u, θ)dA(u) for

B ∈ B. The wj , j = 0, 1, are nonnegative (measurable) stratum weightfunctions assumed to be known up to the finite dimensional parameter θ ∈Θ ⊂ R. We will assume hereafter that w0(t, θ) = 1 and that w1(t, θ) = eθt.Wj(θ, A) ≡

∫ τ

0 wj(u, θ)dA(u) is assumed to be finite for all θ ∈ Θ. Theprobability measure A is the unknown infinite dimensional parameter ofinterest, and ψ = (θ, A) is the joint parameter. We assume that A0 is con-tinuous with support on all of [0, τ ]. The goal is to estimate ψ based oninformation from samples from the Fj distributions, j = 0, 1. Thus thelog-likelihood for a single observation is

ℓ(ψ)(X) = log wδ(Y, θ) + log ∆A(Y )− log Wδ(θ, A),

= δθY + log ∆A(Y )− log

∫ τ

0

eδθsdA(s),

where ∆A(Y ) is the probability mass of A at Y .Thus the score functions are

Page 394: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

386 20. Efficient Inference for Infinite-Dimensional Parameters

U1[ψ](a) = δ

[

Y −∫ τ

0seδθsdA(s)

∫ τ

0eδθsdA(s)

]

a, and

U2[ψ](h) = h(Y )−∫ τ

0eδθsh(s)dA(s)∫ τ

0eδθsdA(s)

.

The components of σ are obtained by taking the expectations under thetrue distribution P0 of σjk, j, k = 1, 2, where

σ11a =(

Eδ([δy]2)− [Eδ(δy)]2)

a,

σ21a =eδθ0(·)

∫ τ

0 eδθ0sdA0(s)[δ(·)− Eδ(δy)] a,

σ12h = Eδ(δyh(y))− Eδ(δy)Eδ(h(y)), and

σ22h =eδθ0(·)

∫ τ

0 eδθ0sdA0(s)[h(·)− Eδ(h(y))] ,

where, for a B-measurable function y → f(y),

Ej(f(y)) ≡∫ τ

0 f(y)ejθ0ydA0(y)∫ τ

0 ejθ0ydA0(y),

for j = 0, 1. Then σjk = P0σjk, j, k = 1, 2.We now show that σ : C → C is continuously invertible and onto. First,

as with the previous example, it is easy to verify that σ = κ1 + κ2, whereκ1c ≡ (a, ρ0(·)h(·)), κ2 ≡ σ − κ1,

ρ0(·) ≡ P0

[

eδθ0(·)∫ τ

0eδθ0sdA0(s)

]

,

and where κ2 is a compact operator and κ1 is continuously invertible andonto. Provided we can show that σ is one-to-one, we will be able to utilizeagain Lemma 6.17 to obtain that σ is continuously invertible and onto.

Our argument for showing that σ is one-to-one will be similar to that usedfor the Cox model for right censored data. Accordingly, let c = (a, h) ∈ Csatisfy σc = 0, and let c = (a,

∫ (·)0 h(s)dA0(s)). Thus c(σc) = 0. After some

algebra, it can be shown that this implies

0 = P0Eδ (aδy + h(y)− Eδ[aδy + h(y)])2

(20.6)

= λ0V0(h(y)) + λ1V1(ay + h(y)),

where, for a measurable function y → f(y), Vj(f(y)) is the variance off(Y ) given δ = j, j = 0, 1. Recall that both λ0 and λ1 are positive. Thus,since (20.6) implies V0(h(y)) = 0, y → h(y) must be a constant function.Since (20.6) also implies V1(ay +h(y)) = 0, we now have that a = 0. Hence

Page 395: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

20.2 Inference 387

h(Y ) = Eδ(h(y)) almost surely. Thus P0h2(Y ) = 0, which implies h = 0

almost surely. Hence c = 0, and thus σ is one-to-one.Conditions (20.2)–(20.4) can be established for p = 1 by recycling previ-

ous arguments (see Exercise 20.3.6). Gilbert (2000) showed that the max-

imum likelihood estimator ψn = argmaxψPnlψ(X) is uniformly consistent

for ψ0. Since Ψn(ψn)(c) = 0 almost surely for all c ∈ C by definition of the

maximum, all of the conditions of Corollary 20.1 hold for ψn. Thus ψn isefficient.

20.2 Inference

We now present several methods of inference for the semiparametric max-imum likelihood estimation setting of Corollary 20.1. The first method isthe specialization of the bootstrap approach for Z-estimators described inSection 13.2.3 to the current setting. The second method, the “piggybackbootstrap,” is a much more computationally efficient method that takesadvantage of the special structure of efficient estimators. We conclude thissection with a brief review of several additional methods of inference forregular, infinite-dimensional parameters.

20.2.1 Weighted and Nonparametric Bootstraps

Recall the nonparametric and weighted bootstrap methods for Z-estimatorsdescribed in Section 13.2.3. Let P

n and Gn be the bootstrapped empirical

measure and process based on either kind of bootstrap, and letP

denote

eitherPW

for the nonparametric version orPξ

for the weighted version.

We will use ψn to denote an approximate maximizer of the bootstrapped

empirical log-likelihood ψ → Pnℓ(ψ)(X), and we will denote Ψ

n(ψ)(c) ≡P

nU [ψ](c) for all ψ ∈ Ω and c ∈ C. We now have the following simplecorollary, where Xn is the σ-field of the observations X1, . . . , Xn:

Corollary 20.2 Assume the conditions of Corollary 20.1, and, in ad-dition, that ψ

nas∗→ ψ0 unconditionally and

P

(√n sup

c∈C1

|Ψn(ψn)(c)|

Xn

)

= oP (1).(20.7)

Then the conclusions of Corollary 20.1 hold and√

n(ψn− ψn)

P

Z(σ−1(·))in ℓ∞(C1), i.e., the limiting distribution of

√n(ψn−ψ0) and the conditional

limiting distribution of√

n(ψn − ψn) given Xn are the same.

Before giving the simple proof of this corollary, we make a few remarks.First, we make the obvious point that if ψ

n is the maximizer of ψ →

Page 396: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

388 20. Efficient Inference for Infinite-Dimensional Parameters

Pnℓ(ψ)(X), then (20.7) is easily satisfied by the definition of the maximizer.

Second, we note that method of proving consistency of ψn is different than

that used in Theorem 13.4, which utilizes the identifiability properties ofthe limiting estimating equation Ψ as expressed in Condition (13.1). Thereason for this is that for many semiparametric models, identifiability isquite tricky to establish and for many complicated models it is not clearhow to prove identifiability of Ψ. Establishing consistency of ψ

n directly isoften more fruitful.

For the weighted bootstrap, maximizing ψ → Pnℓ(ψ)(X) is equivalent

to maximizing ψ → Pnξℓ(ψ)(X), where the positive weights ξ1, . . . , ξn

are i.i.d. and independent of the data X1, . . . , Xn. By Corollary 9.27,ξℓ(ψ)(X) : ψ ∈ Ω forms a Glivenko-Cantelli class with integrable enve-lope if and only if ℓ(ψ)(X) : ψ ∈ Ω is Glivenko-Cantelli with integrableenvelope. Thus many of the consistency arguments used in establishingconsistency for ψn can be applied—almost without modification—to veri-fying unconditional consistency of ψ

n. This is true, for example, with theexistence and consistency arguments for the joint estimator from the pro-portional odds model for right-censored data discussed in Sections 15.3.2and 15.3.3. Thus, for this proportional odds example, ψ

n is unconditionallyconsistent and satisfies (20.7), and thus the weighted bootstrap is valid inthis case. It is not clear what arguments would be needed to obtain thecorresponding results for the nonparametric bootstrap.

An important advantage of the weighted and nonparametric bootstrapsis that they do not require the likelihood model to be correctly specified. Anexample of the validity of these methods under misspecification of univari-ate frailty regression models for right-censored data, a large class of modelswhich includes both the Cox model and the proportional odds model asa special cases, is described in Kosorok, Lee and Fine (2004). An impor-tant disadvantage of these bootstraps is that they are very computationallyintense since the weighted likelihood must be maximized over the entire pa-rameter space for every bootstrap realization. This issue will be addressedin Section 20.2.2 below.

Proof of Corollary 20.2. The proof essentially follows from the factthat the conditional bootstrapped distribution of a Donsker class is auto-matically consistent via Theorem 2.6. Since conditional weak convergenceimplies unconditional weak convergence (as argued in the proof of Theo-rem 10.4), both Lemma 13.3 and Theorem 2.11 apply to Ψ

n, and thus

supc∈Cp

√n(ψ

n − ψ0)(σ(c)) −√n(Ψn − Ψ)(c)

∣ = oP0(1),

unconditionally, for any 0 < p < ∞. Combining this with previous resultsfor ψn, we obtain for any 0 < p < ∞

supc∈Cp

√n(ψ

n − ψn)(σ(c)) −√n(Ψn −Ψn)(c)

∣= oP0(1).

Page 397: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

20.2 Inference 389

Since U [ψ0](c) : c ∈ Cp is Donsker for any 0 < p <∞, we have the desiredconclusion by reapplication of Theorem 2.6 and the continuous invertibilityof σ.

20.2.2 The Piggyback Bootstrap

We now discuss an inferential method for the joint maximum likelihoodestimator ψn = (θn, An) that is computationally much more efficient thanthe usual bootstrap. From the previous chapter, Chapter 19, we havelearned a method—the “profile sampler”—for generating random realiza-tions θn such that

√n(θn − θn) given the data has the same limiting dis-

tribution as√

n(θn − θ0) does unconditionally. The piggyback bootstrapwill utilize these θn realizations to improve computational efficiency. Itturns out that it does not matter how the θn are generated, provided√

n(θn−θn) has the desired conditional limiting distribution. For any θ ∈ Θ,let A

θ = argmaxAPnℓ(θ, A)(X), where P

n is the weighted bootstrap empiri-cal measure. In this section, we will not utilize the nonparametric bootstrapbut will only consider the weighted bootstrap.

The main idea of the piggyback bootstrap of Dixon, Kosorok and Lee(2005) is to generate a realization of θn, then generate the random weightsξ1, . . . , ξn in P

n independent of both the data and θn, and then compute

Aθn

. This generates a joint realization ψn ≡ (θn, A

n). For instance, one

can generate a sequence of θns, θ(1)n , . . . , θ

(m)n , using the profile sampler.

For each θ(j)n , a new draw of the random weights ξ1, . . . , ξn is made, and

A(j) ≡ A

θ(j)n

is computed. A joint realization ψ(j) ≡ (θ

(j)n , A

θ(j)n

) is thus

obtained for each j = 1, . . . , m.Under regularity conditions which we will delineate, the conditional dis-

tribution of√

n(ψn − ψn) converges to the same limiting distribution as√

n(ψn − ψ0) does unconditionally. Hence the realizations ψ(1), . . . , ψ

(m)

can be used to construct joint confidence bands for Hadamard-differentiablefunctions of ψ0 = (θ0, A0), as permitted by Theorem 12.1. For example, thiscould be used to construct confidence bands for estimated survival curvesfrom a proportional odds model for a given covariate value.

We now present a few regularity conditions which, in addition to theassumptions of Corollary 20.1, are sufficient for the piggyback bootstrap tobe valid. First, decompose Z from Corollary 20.1 into two parts, (Z1, Z2) =Z, where Z1 ∈ Rk and Z2 ∈ ℓ∞(Hp), for some 0 < p < ∞. Let M denote

the random quantities used to generate θn, so that√

n(θn− θn)PM

Z ′1, where

Z ′1 is a Gaussian vector independent of Z with mean zero and covariance

equal to the upper k×k entry of σ−1; and let ξ denote ξ1, ξ2, . . ., so thatPξ

will signify conditional convergence of the weighted bootstrap process inprobability, given the data. We assume the random quantities represented

Page 398: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

390 20. Efficient Inference for Infinite-Dimensional Parameters

by M are independent of those represented by ξ, i.e., the generation of θn isindependent of the random weights ξ1, ξ2, . . .. For simplicity of exposition,we also require ψn = (θn, An) to be a maximizer of ψ → Pnℓ(ψ)(X).We also need that σ22 : Hp → Hp is continuously invertible and onto;

and, for any potentially random sequence θnP→ θ0, A

θn

P→ A0 in ℓ∞(Hp),

unconditionally, for some 0 < p < ∞. We now present the main result ofthis section:

Corollary 20.3 Assume the above conditions in addition to the con-ditions of Corollary 20.1. Then the conclusions of Corollary 20.1 hold and

√n

(

θn − θn

Aθn− An

)

P

M, ξZ(σ−1(·)), in ℓ∞(C1).

Before giving the proof, we note several things. First, for the two exam-ples considered in detail at the end of Section 20.1, continuous invertibilityof σ22 is essentially implied by the assumed structure and continuous in-vertibility of σ. To see this, it is enough to verify that σ22 is one-to-one.Accordingly, fix an h ∈ H that satisfied σ22h = 0. We thus have that

c(σc) = 0, where c =(

0,∫ (·)0

h(u)dA0(u))

and c = (0, h). For both of the

examples, we verified that this implies c = 0, which obviously implies h = 0.While the above argument does not directly work for the proportional oddsexample, it is not hard to verify the desired result by using simple, recycledarguments from the proof of Theorem 15.9 (see Exercise 20.3.7). In general,establishing continuous invertibility of σ22 is quite easy to do, and is almostautomatic, once continuous invertibility of σ has been established.

The only challenging condition to establish is Aθn

P→ A0 for any sequence

θnP→ θ0. We will examine this, after the proof given below of the above

corollary, for three examples, both the Cox and proportional odds modelsfor right censored data and the biased sampling model considered at theend of Section 20.1.

As a final note, we point out that the computational advantages of thepiggy-back bootstrap over the usual bootstrap can be quite dramatic. Asimulation study given in Section 4 of Dixon, Kosorok and Lee (2005) iden-tified a 30-fold decrease in computation time (as measure by both actualtime and number of maximizations performed) from using the piggybackbootstrap instead of the weighted bootstrap.

Proof of Corollary 20.3. Similar arguments to those used in Exer-cise 20.3.2 can be used to verify that the assumptions of the corollarywhich apply for some 0 < p < ∞ will hold for all 0 < p < ∞. We willtherefore assume without loss of generality that p is large enough for bothσ−1(C1) ∈ Cp and σ−1

22 (H1) ∈ Hp. Recycling arguments used before, we canreadily obtain that

√n(P

n − Pn)U2[θn, Aθn

](·) =√

n(Pn − Pn)U2[θ0, A0](·) + o

Hp

P0(1),

Page 399: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

20.2 Inference 391

unconditionally. However, the left-hand-side of this expression equals, bythe definition of a maximizer,

−√nPnU2[θn, Aθn

] = −√nPn

(

U2[θn, Aθn

]− U2[θn, An])

= −√nP(

U2[θn, Aθn

]− U2[θn, An])

+ oHp

P0(1),

= −σ21

√n(θn − θn)− σ22

√n(A

θn− An) + o

Hp

P0(1),

where the second equality follows the stochastic equicontinuity assured byCondition (20.3), and the third equality follows from the Frechet differen-tiability of Ψ as assured by Exercise 20.3.1.

Thus we have, unconditionally, that√

n(Aθn− An) = −σ−1

22 σ21√

n(θn −θn)−√n(P

n − Pn)U2[ψ0](σ−122 (·)) + o

Hp

P0(1). Hence

√n

(

θn − θn

Aθn− An

)

P

M, ξ

(

Z ′1

Z2(σ−122 (·))− σ−1

22 σ21Z′1

)

,

where Z ′1 and Z2 are independent mean-zero Gaussian processes with co-

variances defined above. Hence the right-hand-side has joint covariance

c′V0c, where c′ ≡(

c1,∫ (·)0 c2(s)dA0(s)

)

, and c ≡ (c1, c2), for any c, c ∈ C1,and where

V0 ≡(

v11 −v11σ12σ−122

−σ−122 σ21v11 σ−1

22 + σ−122 σ21v11σ12σ

−122

)

,

andv11 ≡

(

σ11 − σ12σ−122 σ21

)−1.

Since V0 = σ−1, the desired result follows.

The Cox model under right censoring.

All that remains to show for the Cox model is that Aθn

converges uniformly

to A0, unconditionally, for any sequence θnP→ θ0. Fortunately, the maxi-

mizer of A → Pnℓ(θn, A)(X) has the explicit “Breslow estimator” form:

Aθn

(t) =

∫ t

0

PndN(s)

PnY (s)eθ′

nZ.

Since the class of functions Y (t)eθ′Z : t ∈ [0, τ ], ‖θ− θ0‖ ≤ ǫ is Glivenko-

Cantelli for some ǫ > 0, PnY (t)eθ′

nZ P→ P0Y (t)eθ′0Z uniformly and uncon-

ditionally. Since the class N(t) : t ∈ [0, τ ] is also Glivenko-Cantelli, andsince the map (U, V ) →

(1/V )dU is continuous—provided V is bounded

below—via Lemma 12.3, we obtain the desired result that Aθn

P→ A0 uni-

formly and unconditionally. Thus the conclusions of Corollary 20.3 hold.

Page 400: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

392 20. Efficient Inference for Infinite-Dimensional Parameters

The piggyback bootstrap in this instance is incredibly simple to imple-ment. Let

V ≡∫ τ

0

PnZZ ′Y (t)eθ′nZ

PnY (t)eθ′nZ

PnZY (t)eθ′nZ

PnY (t)eθ′nZ

⊗2⎤

⎦PndN(t),

and let θn = θn + Zn/√

n where Zn is independent of ξ1, . . . , ξn and nor-

mally distributed with mean zero and variance V . Then ψn = (θn, A

θn)

will be an asymptotically valid Monte Carlo replication of the correct jointlimiting distribution of ψn = (θn, An).

The proportional odds model under right censoring.

A careful inspection of the arguments for estimator existence given in Sec-tion 15.3.2 yields the conclusion that lim supn→∞ A

θ′n(τ) < ∞ with inner

probability one, unconditionally, for any arbitrary (not necessarily conver-gent) deterministic sequence θ′n ∈ Θ. This follows primarily from the fact

that for any Glivenko-Cantelli class F , ‖Pn−Pn‖F as∗→ 0 via Corollary 10.14.

Similar reasoning can be applied to the arguments of Section 15.3.3 toestablish that A

θ′n

as∗→ A0, unconditionally, for any deterministic sequence

θ′n → θ0. Thus, for any arbitrary deterministic sequence of sets Θ′n ∋ θ0

converging to θ0, we have supθ∈Θ′n‖A

n(θ) − A0‖∞ as∗→ 0. This yields

that ‖Aθn− A0‖∞ P→ 0 for any sequence θn

P→ θ0. Hence the conclusions

of Corollary 20.3 follow, provided we can obtain random draws θn whichsatisfy the requisite conditions.

We will show in Chapter 22 that the profile sampler is valid for the pro-portional odds model under right censoring, and thus θn values generatedfrom the profile sampler will have the needed properties. Hence we willbe able to generate random realizations (θn, A

θn) which have the desired

conditional joint limiting distribution and thus can be used for inference.In this instance, one can adapt the arguments of Section 15.3.1 to obtain acomputationally efficient stationary point algorithm which involves solvingiteratively

Aθn

(t) =

∫ t

0

PnW (s; ψ

θn)−1

PndN(s),

where

W (t; ψ) ≡ (1 + δ)eθ′ZY (t)

1 + eθ′ZA(U),

ψ = (θ, A) and ψθn

= (θn, Aθn

). This will result in significant computa-tional savings over the usual bootstrap.

A biased sampling model.

Establishing the regularity conditions for the biased sampling model pre-sented at the end of Section 20.1 is somewhat involved, and we will omit the

Page 401: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

20.2 Inference 393

proofs which are given in Dixon, Kosorok and Lee (2005). However, we willdiscuss implementation of the piggyback bootstrap for this model. Firstly,we can use arguments given in Vardi (1985) to obtain that the maximum

likelihood estimator θn is the maximizer of the partial log-likelihood

pℓn(θ) ≡∑

j=0,1

λnj

∫ τ

0

log

[

wj(u, θ)B−jn (θ)

k=0,1 λnkwk(u, θ)B−kn (θ)

]

dGjk(u),

where λnj ≡ nj/n, nj is the number of observations in stratum j, Gnj isthe empirical distribution of the observations Y in stratum j, for j = 0, 1,Bn(θ) is the unique solution of

∫ τ

0

w1(u, θ)

Bn(θ)λn0w0(u, θ) + λn1w1(u, θ)dGn(u) = 1,

and where Gn ≡ λn0Gn0 + λn1Gn1. Thus we can obtain draws θn via theprofile sampler applied to pℓn(θ).

Secondly, arguments in Vardi (1985) can also be used to obtain a closedform solution for Aθ:

Aθ(t) =

∫ t

0

[

j=0,1 λnjwj(u, θ)B−jn (θ)

]−1

dGn(u)

∫ τ

0

[

j=0,1 λnjwj(s, θ)B−jn (θ)

]−1

dGn(u)

.

Since nj = Pn1δ = j and Gnj(u) = n−1j Pn1Y ≤ u, δ = j, Aθ is a

function of the empirical distribution Pn. Hence Aθ is obtained by replacing

Pn with Pn in the definition of Aθ and its components (eg., λnj , for j =

0, 1, Bn, etc.). Dixon, Kosorok and Lee (2005) verify the conditions ofCorollary 20.3 for this setting, and thus the piggyback bootstrap is valid,provided appropriate draws θn can be obtained.

Dixon, Kosorok and Lee (2005) also establish all of the conditions ofTheorem 19.6 for the partial-profile likelihood pℓn(θ) except for Condi-tion (19.17). Nevertheless, this condition probably holds, although we willnot verify it here. Alternatively, one can estimate the limiting covariancematrix for

√n(θn − θ0) using Corollary 19.4 which is easy to implement in

this instance since θ is a scalar. Call the resulting covariance estimator V .

Then one can generate θn = θn + Zn

V /n, where the Zn is a standard

normal deviate. Hence (θn, Aθn

) will be a jointly valid deviate that can beused for joint inference on (θ, A).

20.2.3 Other Methods

There are a number of other methods that can be used for inference inthe setting of this chapter, including fully Bayesian approaches, although

Page 402: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

394 20. Efficient Inference for Infinite-Dimensional Parameters

none of them are as applicable in generality as the bootstrap or piggybackbootstrap, as far as we are aware. Nevertheless, there are a few worthwhilemethods of inference that apply to specific settings.

For the Cox model under right censoring, Lin, Fleming and Wei (1994)developed an influence function approach for inference for the baseline haz-ard. The basic idea—which is broadly applicable when the influence func-tion is a smooth, closed form function of the parameter—is to estimate thefull influence function ψ(X) for the joint parameters at observation X byplugging in the maximum likelihood estimators in place of the true parame-ter values. Let the resulting estimated influence function applied to observa-tion X be denoted ψ(X). Let G1, . . . , Gn be i.i.d. standard normals, and de-fine Hn ≡ n−1/2

∑ni=1 Giψ(Xi), and let H be the tight, mean zero Gaussian

process limiting distribution of√

n(ψn−ψ0) = n−1/2∑n

i=1 ψ(Xi)+oP0(1),

where the error term is uniform. Then, provided ψ(X) and ψ(X) live ina Donsker class with probability approaching 1 as n → ∞ and provided

P∥

∥ψ(X)− ψ(X)

2

∞P→ 0, it is easy to see that Hn

PG

H . Although it is

computationally easy, this general approach is somewhat narrowly appli-cable because most influence functions do not have a sufficiently simple,closed form.

Another approach for the Cox model under right censoring, proposedby Kim and Lee (2003), is to put certain priors on the full likelihood andgenerate realizations of the posterior distribution. Kim and Lee derive anelegant Bernstein-von Mises type result which shows that the resultingposterior, conditional on the data, converges weakly to the correct jointlimiting distribution of the joint maximum likelihood estimators. This ap-proach is only applicable for models with very special structure, but it isquite computationally efficient when it can be applied.

Hunter and Lange (2002) develop an interesting accelerated bootstrapprocedure for the proportional odds model under right censoring whichfacilitates faster maximization, resulting in a faster computation of boot-strap realizations. Because the method relies on the special structure of themodel, it is unclear whether the procedure can be more broadly general-ized. There are probably a number of additional inferential methods thatapply to various specific but related problems in maximum likelihood infer-ence that we have failed to mention. Our omission is unintentional, and weinvite the interested reader to search out these methods more completelythan we have. Nevertheless, the weighted bootstrap and piggyback boot-strap approaches we have discussed are among the most broadly applicableand computationally feasible methods available.

Page 403: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

20.3 Exercises 395

20.3 Exercises

20.3.1. Show that Condition (20.4) implies that Ψ is Frechet-differentiablein ℓ∞(Cp).

20.3.2. Show that if Conditions (20.2)–(20.4) hold for any p > 0, thenthey hold for all 0 < p < ∞. Hint: Use the linearity of the operatorsinvolved.

20.3.3. Prove Corollary 20.1.

20.3.4. Prove that the joint maximum likelihood estimator (θn, An) forthe proportional odds model satisfies the conditions of Corollary 20.1. Thus(θn, An) is asymptotically efficient.

20.3.5. Show that (20.5) forces a = 0 and h = 0.

20.3.6. Show that Conditions (20.2)–(20.4) hold for p = 1 in the biasedsampling example given at the end of Section 20.1.

20.3.7. Show that σ22 : H → H for the proportional odds regressionmodel for right censored data is continuously invertible and onto. Notethat σ22 corresponds to σ22

θ0as defined in (15.20).

20.4 Notes

The general semiparametric model structure that emerges in Section 20.1after the proof of Corollary 3.2 was inspired and informed by the generalmodel framework utilized in Dixon, Kosorok and Lee (2005). The biasedsampling example was also studied in Dixon, Kosorok and Lee, and Corol-lary 20.3 is essentially Theorem 2.1 of Dixon, Kosorok and Lee, althoughthe respective proofs are quite different.

Page 404: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,
Page 405: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

21Semiparametric M-Estimation

The purpose of this chapter is to extend the M-estimation results of Chap-ter 14 to semiparametric M-estimation, where there is both a Euclidean pa-rameter of interest θ and a nuisance parameter η. Obviously, the semipara-metric maximum likelihood estimators we have been discussing in the lastseveral chapters are important examples of semiparametric M-estimators,where the objective function is an empirical likelihood. However, thereare numerous other examples of semiparametric M-estimators, includingestimators obtained from misspecified semiparametric likelihoods, least-squares, least-absolute deviation, and penalized maximum likelihood (whichwe discussed some in Sections 4.5 and 15.1 and elsewhere). In this chap-ter, we will try to provide general results on estimation and inference forsemiparametric M-estimators, along with several illustrative examples. Thematerial for this chapter is adapted from Ma and Kosorok (2005b).

As with Chapter 19, the observed data of interest consist of n i.i.d.observations X1 . . . Xn drawn from a semiparametric model Pθ,η : θ ∈Θ, η ∈ H, where Θ is an open subset of Rk and H is a possibly infinite-dimensional set. The parameter of interest is θ and ψ = (θ, η) will be usedto denote the joint parameter. The criterion functions we will focus on willprimarily be of the form Pnmψ(X), for some class of measurable functionsmψ. The most widely used M-estimators of this kind include maximumlikelihood (MLE), misspecified MLE, ordinary least squares (OLS), andleast absolute deviation estimators. A useful general theorem for investi-gating the asymptotic behavior of M-estimators for the Euclidean parame-ter in semiparametric models, which we will build upon in this chapter, isgiven in Wellner, Zhang and Liu (2002).

Page 406: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

398 21. Semiparametric M-Estimation

An alternative approach to obtaining the limit distribution of M-estima-tors is to view them as Z-estimators based on estimating equations ob-tained by differentiating the involved objective functions. We have usedthis approach several times for maximum likelihood estimation, includingin Section 15.2 for the proportional odds model under right censoring. Anadvantage of this approach is that the Z-estimator theory of Chapter 13can be utilized. The Z-estimator approach is particularly useful when thejoint estimator ψn = (θn, ηn) has convergence rate

√n. Unfortunately, this

theory usually does not apply when some of the parameters cannot beestimated at the

√n rate.

Penalized M-estimation, as illustrated previously for the partly linear lo-gistic regression model (see Sections 4.5 and 15.1), provides a flexible alter-native to ordinary M-estimation. Other examples of penalized M-estimatorsfor semiparametric models include the penalized MLE for generalized par-tial linear models of Mammen and van de Geer (1997), the penalized MLEfor the Cox model with interval censored data of Cai and Betensky (2003),and the penalized MLE for transformation models with current status dataof Ma and Kosorok (2005a).

The main innovation of Ma and Kosorok (2005b) is the development ofbootstrap theory for M-estimation that does not require subsampling (asin Politis and Romano, 1994), yet permits the nuisance parameter to beslower than

√n-consistent. It is interesting to note that this development

occur empirical process bootstrap (see Mason and Newton, 1992, Barbeand Bertail, 1995, Gine, 1996, and Wellner and Zhan, 1996). The specificbootstrap studied in Ma and Kosorok (2005b) is the weighted bootstrapwhich, as mentioned in previous chapters, consists of i.i.d. positive randomweights applied to each observation. The reason for using the weightedbootstrap is that the i.i.d. behavior of the weights makes many of the proofseasier. While it is possible that our results also hold for the nonparametricbootstrap, such a determination is beyond the scope of this chapter andappears to be quite difficult.

The overarching goal of this chapter is to develop a paradigm for semi-parametric M-estimators that facilitates a conceptual link between weakconvergence and validation of the bootstrap, even when the nuisance pa-rameter may not be

√n-estimable. To this end, the first main result we

present is a general theory for weak convergence, summarized in Theo-rem 21.1 below. The theorem mentioned earlier in Wellner, Zhang and Liu(2002) is a corollary of our Theorem 21.1. A great variety of models canbe analyzed by the resulting unified approach. The second main result is avalidation of the use of the weighted bootstrap as a universal inference toolfor the parametric component, even when the nuisance parameters cannotbe estimated at the

√n convergence rate and/or when the usual inferences

based on the likelihood principle do not apply, as happens for example inthe penalized M-estimation settings. We also show how these two main

Page 407: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

21.1 Semiparametric M-estimators 399

results can be linked through a careful analysis of the entropy numbers ofthe objective functions.

The layout of the chapter is as follows. In Section 21.1, after first present-ing some motivating examples, we develop

√n-consistency and asymptotic

normality results for the estimators of the Euclidean parameters. In Sec-tion 21.2, weighted M-estimators with independent weights are studied andthese results are used to establish validity of the weighted bootstrap as aninference tool for the Euclidean parameter. Control of the modulus of con-tinuity of the weighted empirical processes plays an important role in ourapproach. This can be viewed as an extension of the principle utilized in therate of convergence theorem (Theorem 14.4) of Section 14.3. The connec-tion mentioned above with the entropy numbers of the involved objectivefunctions is described in detail in Section 21.3. In Section 21.4, we studyin greater detail the examples of Section 21.1 and verify consistency andbootstrap validity using the proposed techniques. The chapter concludes inSection 21.5 with an extension to penalized M-estimation.

21.1 Semiparametric M-estimators

The purpose of this section is to provide a general framework for estimationand weak convergence of M-estimators for the parametric component θin a semiparametric model. We first present several motivating examples.Then we present a general scheme for semiparametric M-estimation. Wethen discuss consistency and rate of convergence, followed by two differentapproaches to establishing asymptotic normality of the estimator. As withprevious chapters, we let P denote expectation under the true distribution.

21.1.1 Motivating Examples

Although only three examples will be given here, they stand as archetypesfor a great variety of models that can be studied in a similar manner. Inthe following sections, derivatives will be denoted with superscript “()”.For example, h(2) refers to the second order derivative of the function h.Notice that each of these examples departs in some way from the usualsemiparametric MLE paradigm.

The Cox model with current status data (Example 1).

This is an augmentation of the example presented in Section 19.2.3. Let θbe the regression coefficient and Λ the baseline integrated hazard function.The MLE approach to inference for this model was discussed at the endof both Sections 19.3.1 and 19.3.2. As an alternative estimation approach,(θ, Λ) can also be estimated by OLS:

Page 408: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

400 21. Semiparametric M-Estimation

(θn, Λn) = argminPn

(

1− δi − exp(−eθ′ZiΛ(ti))2

.

In this model, the nuisance parameter Λ cannot be estimated at the√

nrate, but is estimable at the n1/3 rate, as discussed previously (see alsoGroeneboom and Wellner, 1992).

Binary regression under misspecified link function (Example 2).

This is a modification of the partly linear logistic regression model intro-duced in Chapter 1 and discussed in Sections 4.5 and 15.1. Suppose that weobserve an i.i.d. random sample (Y1, Z1, U1), . . . , (Yn, Zn, Un) consisting ofa binary outcome Y , a k−dimensional covariate Z, and a one-dimensionalcontinuous covariate U ∈ [0, 1], following the additive model

Pθ,h(Y = 1|Z = z, U = u) = φ(θ′z + h(u)),

where h is a smooth function belonging to

H =

h : [0, 1] → [−1, 1],

∫ 1

0

(

h(s)(u))2

du ≤ K

,(21.1)

for a fixed and known K ∈ (0,∞) and an integer s ≥ 1, and where φ :R → [0, 1] is a known continuously differentiable monotone function. Thechoices φ(t) = 1/(1 + e−t) and φ = Φ (the cumulative normal distributionfunction) correspond to the logit model and probit models, respectively.

The maximum likelihood estimator (θn, hn) maximizes the (conditional)log-likelihood function

ℓn(θ, h)(X) = Pn [Y log(φ(θ′Z + h(U)))(21.2)

+(1− Y ) log(1− φ(θ′Z + h(U)))] ,

where X = (Y, Z, U). Here, we investigate the estimation of (θ, h) undermisspecification of φ. Under model misspecification, the usual theory forsemiparametric MLEs, as discussed in Chapter 19, does not apply.

The condition∫ 1

0

(

h(s)(u))2

du ≤ K in (21.1) can be relaxed as follows.

Instead of maximizing the log-likelihood as in (21.2), we can take (θn, hn)to be the maximizer of the penalized log-likelihood ℓn(θ, h) − λ2

nJ2(h),as done in Chapter 1, where J is as defined in Chapter 1, and whereλn is a data-driven smoothing parameter. Then we only need to assume∫ 1

0

(

h(s)(u))2

du < ∞. The application of penalization to the misspecifiedmodel will be examined more closely in Section 21.5.

Mixture models (Example 3).

Suppose that an observation X has a conditional density pθ(x|z) given anunobservable variable Z = z, where pθ is known up to the Euclidean param-eter θ. If the unobservable Z possesses an unknown distribution η, then the

Page 409: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

21.1 Semiparametric M-estimators 401

observation X has the following mixture density pθ,η(x) =∫

pθ(x|z)dη(z).

The maximum likelihood estimator (θn, ηn) maximizes the log-likelihoodfunction (θ, η) → ℓn(θ, η) ≡ Pn log(pθ,η(X)).

Examples of mixture models include frailty models, errors-in-variablemodels in which the errors are modeled by a Gaussian distribution, andscale mixture models over symmetric densities. For a detailed discussion ofthe MLE for semiparametric mixture models, see van der Vaart (1996) andvan de Geer (2000).

21.1.2 General Scheme for Semiparametric M-Estimators

Consider a semiparametric statistical model Pθ,η(X), with n i.i.d. observa-tions X1 . . .Xn drawn from Pθ,η, where θ ∈ Rk and η ∈ H . Assume thatthe infinite dimensional space H has norm ‖ · ‖, and the true unknown

parameter is (θ0, η0). An M-estimator (θn, ηn) of (θ, η) has the form

(θn, ηn) = argmaxPnmθ,η(X),(21.3)

where m is a known, measurable function. All of the following results of thischapter will hold, with only minor modifications, if “argmax” in Equation(21.3) is replaced by “argmin”. For simplicity, we assume the limit criterionfunction Pmψ, where ψ = (θ, η), has a unique and “well-separated” pointof maximum ψ0, i.e., Pmψ0 > supψ/∈G Pmψ for every open set G thatcontains ψ0.

Analysis of the asymptotic behavior of M-estimators can be split intothree main steps: (1) establishing consistency; (2) establishing a rate ofconvergence; and (3) deriving the limiting distribution.

A typical scheme for studying general semiparametric M-estimators isas follows. First, consistency is established with the argmax theorem (The-orem 14.1) or a similar method. Second, the rate of convergence for theestimators of all parameters can then be obtained from convergence ratetheorems such as Theorem 14.4. We briefly discuss in Section 21.1.3 consis-tency and rate of convergence results in the semiparametric M-estimationcontext. The asymptotic behavior of estimators of the Euclidean param-eters can be studied with Theorem 21.1 or Corollary 21.2 presented inSection 21.1.4 below.

Important properties of weighted M-estimators that enable validationof the weighted bootstrap are described in Section 21.2 below. Lemmas21.8–21.10 in Section 21.3 can be used to control the modulus of continu-ity of weighted M-estimators, as needed for bootstrap validation, based onmodulus of continuity results for the unweighted M-estimators. The basicconclusion is that bootstrap validity of weighted M-estimators is almost anautomatic consequence of modulus of continuity control for the correspond-ing unweighted M-estimator. The remainder of this Section is devoted toelucidating this general scheme.

Page 410: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

402 21. Semiparametric M-Estimation

21.1.3 Consistency and Rate of Convergence

The first steps are to establish consistency and rates of convergence for allparameters of the unweighted (original) M-estimator. General theory forthese aspects was studied in detail in Chapter 14.

Consistency of M-estimators can be achieved by careful application ofthe argmax theorem, as discussed, for example, in Section 14.2. Applica-tion of the argmax theorem often involves certain compactness assumptionson the parameter sets along with model identifiability. In this context, it isoften sufficient to verify that the class of functions mθ,η : θ ∈ Θ, η ∈ H isP -Glivenko-Cantelli. Such an approach is used in the proof of consistencyfor Example 1, given below in Section 21.4. More generally, establishingconsistency can be quite difficult. While a streamlined approach to estab-lishing consistency is not a primary goal of this chapter, we will presentseveral useful tools for accomplishing this in the examples and results thatfollow.

The basic tool in establishing the rate of convergence for an M-estimatoris control of the modulus of continuity of the empirical criterion functionusing entropy integrals over the parameter sets. Entropy results in Sec-tion 11.1 and in van de Geer (2000) give rate of convergence results for alarge variety of models, as we will demonstrate for Examples 1–3 in Sec-tion 21.4 below.

21.1.4√

n Consistency and Asymptotic Normality

In this section, we develop theory for establishing√

n consistency andasymptotic normality for the maximizer θn obtained from a semiparametricobjection function m. We present two philosophically different approaches,one based on influence functions and one based on score equations.

An influence function approach.

We now develop a paradigm for studying the asymptotic properties of θn,based on an arbitrary m, which parallels the efficient influence functionparadigm used for MLE’s (where m is the log-likelihood).

For any fixed η ∈ H , let η(t) be a smooth curve running through η att = 0, that is η(0) = η. Let a = (∂/∂t)η(t)|t=0 be a proper tangent in

the tangent set P(η)Pθ,η

for the nuisance parameter. The concepts of tangent

(or “direction of perturbation”) and tangent set/space are identical in thiscontext as in the usual semiparametric context of Section 3.2, since tangentsets come form the model and not from the method of estimation. Accord-ingly, the requirements of the perturbation a are the usual ones, includingthe properties that a ∈ L2(P ) and, when t is small enough, that η(t) ∈ H .

To simplify notation some, we will use A to denote P(η)Pθ,η

.

For simplicity, we denote m(θ, η; X) as m(θ, η). Set

Page 411: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

21.1 Semiparametric M-estimators 403

m1(θ, η) =∂

∂θm(θ, η), and m2(θ, η)[a] =

∂t

t=0

m(θ, η(t)),

where a ∈ A. We also define

m11(θ, η) =∂

∂θm1(θ, η), m12(θ, η)[a] =

∂t

t=0

m1(θ, η(t)), and

m21(θ, η)[a] =∂

∂θm2(θ, η)[a], m22(θ, η)[a1][a2] =

∂t

t=0

m2(θ, η2(t))[a1],

where a, a1 and a2 ∈ A, and (∂/(∂t))η2(t)|t=0 = a2.A brief review of the construction of maximum likelihood estimators is

insightful here. In this special case, m is a log-likelihood. Denote by [m2]the linear span of the components of m2 in L2(P ). Recall that the efficientscore function m is equal to the projection of the score function m1 onto theorthocomplement of [m2] in L2(P ). As mentioned in Section 3.2, one wayof estimating θ is by solving the efficient score equations Pnmθ,ηn(X) = 0.

For general M-estimators, a natural extension is to construct estimatingequations as follows. Define m2(θ, η)[A] = (m2(θ, η)[a1], . . . , m2(θ, η)[ak]),where A = (a1, . . . , ak) and a1, . . . , ak ∈ A. Then m12[A1] and m22[A1][A2]can be defined accordingly, for A1 = (a11, . . . , a1k), A2 = (a21, . . . , a2k) andai,j ∈ A. Assume there exists an A∗ = (a∗

1, . . . , a∗k), where a∗

i ∈ A, suchthat for any A = (a1, . . . , ak), ai ∈ A,

P (m12(θ0, η0)[A] −m22(θ0, η0)[A∗][A]) = 0.(21.4)

Define m(θ, η) ≡ m1(θ, η) − m2(θ, η)[A∗]. θ is then estimated by solvingPnm(θ, ηn; X) = 0, where we substitute an estimator ηn for the unknownnuisance parameter. A variation of this approach is to obtain an estimatorηn(θ) of η for each given value of θ and then solve θ from

Pnm(θ, ηn(θ); X) = 0.(21.5)

In some cases, estimators satisfying (21.5) may not exist. Hence we weaken(21.5) to the following “nearly-maximizing” condition:

Pnm(θn, ηn) = oP (n−1/2).(21.6)

We next give sufficient conditions for θn, based on (21.6), to be√

nconsistent and asymptotically normally distributed:

A1: (Consistency and rate of convergence) Assume

|θn − θ0| = oP (1), and ‖ηn − η0‖ = OP (n−c1),

for some c1 > 0, where | · | will be used in this chapter to denote theEuclidean norm.

Page 412: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

404 21. Semiparametric M-Estimation

A2: (Finite variance) 0 < det(I∗) < ∞, where det denotes the determi-nant of a matrix and

I∗ = P (m11(θ0, η0)−m21(θ0, η0)[A∗])−1

× P [m1(θ0, η0)−m2(θ0, η0)[A∗]]⊗2

× P (m11(θ0, η0)−m21(θ0, η0)[A∗])−1

,

where superscript ⊗2 denotes outer product.

A3: (Stochastic equicontinuity) For any δn ↓ 0 and C > 0,

sup|θ−θ0|≤δn,‖η−η0‖≤Cn−c1

√n(Pn − P )(m(θ, η)− m(θ0, η0))

∣ = oP (1).

A4: (Smoothness of the model) For some c2 > 1 satisfying c1c2 > 1/2 andfor all (θ, η) satisfying (θ, η) : |θ − θ0| ≤ δn, ‖η − η0‖ ≤ Cn−c1,∣

P

(m(θ, η) − m(θ0, η0)) −(

m11(θ0, η0) − m21(θ0, η0)[A∗])

(θ − θ0)

−(

m12(θ0, η0)

[

η − η0

‖η − η0‖

]

− m22(θ0, η0)[A∗][

η − η0

‖η − η0‖

])

‖η − η0‖∣

= o(|θ − θ0|) + O(‖η − η0‖c2).

Theorem 21.1 Suppose that (θn, ηn) satisfies Equation (21.6), and thatConditions A1–A4 hold, then

√n(θn − θ0)

= −√n P (m11(θ0, η0)−m21(θ0, η0)[A∗])−1

× Pn(m1(θ0, η0)−m2(θ0, η0)[A∗]) + oP (1).

Hence√

n(θn − θ0) is asymptotically normal with mean zero and varianceI∗.

Proof. Combining Condition A1 and A3, we have

√n(Pn − P )m(θn, ηn) =

√n(Pn − P )m(θ0, η0) + oP (1).(21.7)

Considering Equation (21.6), we can further simplify Equation (21.7) to

√nP (m(θn, ηn)− m(θ0, η0)) = −√nPnm(θ0, η0) + oP (1).(21.8)

Considering Conditions A1 and A4, we can expand the left side of Equation(21.8) to obtain

Page 413: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

21.1 Semiparametric M-estimators 405

(21.9)

P (m11(θ0, η0)−m21(θ0, η0)[A∗])(θn − θ0)

+ P

(

m12(θ0, η0)

[

ηn − η0

‖ηn − η0‖

]

−m22(θ0, η0)[A∗]

[

ηn − η0

‖ηn − η0‖

])

× ‖ηn − η0‖= oP (|θn − θ0|) + OP (‖ηn − η0‖c2)− Pnm(θ0, η0) + oP (n−1/2).

Equation (21.9), Condition A1 and Condition A2 together give us

√nP (m11(θ0, η0)−m21(θ0, η0)[A

∗])(θn − θ0)

= −√nPnm(θ0, η0) + oP (1),

and Theorem 21.1 follows.Condition A1 can be quite difficult to establish. Some of these challenges

were discussed in Section 21.1.3 above. In some cases, establishing A1 canbe harder than establishing all of the remaining conditions of Theorem 21.1combined. Fortunately, there are various techniques for attacking the prob-lem, and we will outline some of them in the examples considered in Sec-tion 21.4 below. Condition A2 corresponds to the nonsingular informationcondition for the MLE. The asymptotic distribution of

√n(θn − θ0) is de-

generate if this condition is not satisfied. For the case of the MLE, I∗ canbe further simplified to −P (m11(θ0, η0)−m21(θ0, η0)[A

∗])−1.

Conditions A3 and A4 are perhaps less transparent than Conditions A1and A2, but a number of techniques are available for verification. ConditionA3 can be verified via entropy calculations and certain maximal inequali-ties, for example, as discussed in Chapter 14 and as demonstrated in Huang(1996) and in van der Vaart (1996). One relatively simple, sufficient con-dition is for the class of functions m(θ, η) : |θ − θ0| ≤ ǫ1, ‖η − η0‖ ≤ ǫ2to be Donsker for some ǫ1, ǫ2 > 0 and that P (m(θ, η)− m(θ0, η0))

2 → 0 asboth |θ − θ0| → 0 and ‖η − η0‖ → 0.

Condition A4 can be checked via Taylor expansion techniques for func-tionals. Roughly speaking, if the model is differentiable in the ordinarysense when we replace the nonparametric parameter with a Euclidean pa-rameter, then often the right kind of smoothness for the infinite dimensionalparameter η also holds. For example, if all third derivatives with respect toboth θ and η of Pm(θ, η) are bounded in a neighborhood of (θ0, η0), thenthe expression in Condition A4 holds for c2 = 2. Hence Condition A4 willhold provided c1 > 1/4. Fortunately, c1 > 1/4 for many settings, includingall of the examples considered in Section 21.4 below.

A score equation approach.

Another way of looking at M-estimators, which is perhaps more intuitive,is as follows. By definition, M-estimators maximize an objective function,i.e.,

Page 414: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

406 21. Semiparametric M-Estimation

(θn, ηn) = argmaxPnm(θ, η; X).(21.10)

From Equation (21.10), we have

Pnm1(θn, ηn) = 0, and Pnm2(θn, ηn)[a] = 0,(21.11)

where a runs over A. Viewed in this fashion, M-estimators are also Z-estimators. Hence the theory of Chapter 13 can be utilized. We can re-lax (21.11) to the following “nearly-maximizing” conditions:

Pnm1(θn, ηn) = oP (n−1/2), and(21.12)

Pnm2(θn, ηn)[a] = oP (n−1/2), for all a ∈ A.

The following corollary provides sufficient conditions under which esti-mators satisfying (21.12) and Conditions A1–A2 in Theorem 21.1 will havethe same properties obtained in Theorem 21.1. Before giving the result,we articulate two substitute conditions for A3–A4 which are needed in thissetting:

B3: (Stochastic equicontinuity) For any δn ↓ 0 and C > 0,

sup|θ−θ0|≤δn,‖η−η0‖≤Cn−c1

√n(Pn − P ) (m1(θ, η)−m1(θ0, η0))

= oP (1), and

sup|θ−θ0|≤δn,‖η−η0‖≤Cn−c1

√n(Pn − P ) (m2(θ, η)−m2(θ0, η0)) [A∗]

= oP (1), where c1 is as in Condition A1.

B4: (Smoothness of the model) For some c2 > 1 satisfying c1c2 > 1/2,where c1 is given in Condition A1, and for all (θ, η) belonging to(θ, η) : |θ − θ0| ≤ δn, ‖η − η0‖ ≤ Cn−c1,

P

m1(θ, η) −m1(θ0, η0)−m11(θ0, η0)(θ − θ0)

−m12(θ0, η0)

[

η − η0

‖η − η0‖

]

× ‖η − η0‖∣

= o(|θ − θ0|) + O(‖η − η0‖c2),

and∣

P

m2(θ, η)[A∗]−m2(θ0, η0)[A∗]−m21(θ0, η0)[A

∗](θ − θ0)

−m22(θ0, η0)[A∗]

[

η − η0

‖η − η0‖

]

× ‖η − η0‖∣

= o(|θ − θ0|) + O(‖η − η0‖c2).

Page 415: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

21.2 Weighted M-Estimators and the Weighted Bootstrap 407

Corollary 21.2 Suppose that the estimator (θn, ηn) satisfies Equation(21.12), and Conditions A1, A2, B3 and B4 all hold. Then the results of

Theorem 21.1 hold for θn.

Setting A = A∗ and combining the two equations in (21.12), we ob-tain (21.6). Subtracting the two equations in Conditions B3 and B4 inCorollary 21.2, we can obtain Conditions A3 and A4 in Theorem 21.1, re-spectively. Thus the conditions in Corollary 21.2 are stronger than theircounterparts in Theorem 21.1. However, Corollary 21.2 is sometimes easierto understand and implement. We also note that simpler sufficient condi-tions for B3 and B4 can be developed along the lines of those developedabove for Conditions A3 and A4.

21.2 Weighted M-Estimators and the WeightedBootstrap

We now investigate inference for θ. For the parametric MLE, the mostwidely used inference techniques are based on the likelihood. For generalsemiparametric M-estimation, a natural thought is to mimic the approachfor profile log-likelihood expansion studied in Chapter 19. However, whatwe would then obtain is

Pnm(θn) = Pnm(θ0) + (θn − θ0)′

n∑

i=1

m1(θ0)(Xi)

− 1

2n(θn − θ0)

′P m11 −m21[A∗] (θn − θ0)

+ oP (1 +√

n|θn − θ0|)2,

for any sequence θnP→ θ0. Unfortunately, the curvature in the quadratic

term does not correspond to the inverse of I∗. This makes inference basedon the likelihood ratio impractical, and we will not pursue this approachfurther in this chapter.

In contrast, the weighted bootstrap—which can be expressed in this set-ting as a weighted M-estimator—appears to be an effective and nearlyuniversal inference tool for semiparametric M-estimation. The goal of thissection is to verify that this holds true in surprising generality. We firststudy the unconditional behavior of weighted M-estimators and then usethese results to establish conditional asymptotic validity of the weightedbootstrap.

Consider n i.i.d. observations X1, . . . , Xn drawn from the true distribu-tion P . Denote ξi, i = 1, .., n, as n i.i.d. positive random weights, satisfyingE(ξ) = 1 and 0 ≤ var(ξ) = v0 < ∞ and which are independent of the dataXn = σ(X1, . . . , Xn), where σ(U) denotes the smallest σ-algebra for which

Page 416: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

408 21. Semiparametric M-Estimation

U is measurable. Note that the weights are very similar to the ones intro-duced for the weighted bootstrap in Section 2.2.3, except that E(ξ) = 1and we do not require ‖ξ‖2,1 < ∞ nor do we divide the weights by ξ. The

weighted M-estimator (θn, ηn) satisfies

(θn, ηn) = argmaxPn ξm(θ, η; X) .(21.13)

Asymptotic properties of weighted M-estimators will be studied in a fashionparallel to those of ordinary M-estimators.

Since we assume the random weights are independent of Xn, the consis-tency and rate of convergence for the estimators of all parameters can beestablished using Theorems 2.12 and 14.4, respectively, after minor modifi-cations. For completeness, we now present these slightly modified versions,Corollaries 21.3 and 21.4 below. Since these are somewhat obvious gen-eralizations of Theorems 2.12 and 14.4, we save their proofs as exercises(see Exercise 21.6.1). In both cases, one trivially gets the results for theunweighted empirical measure Pn by assuming the weights ξ1, . . . , ξn in P

n

are 1 almost surely. We denote ψ = (θ, η) for simplicity, where ψ ∈ Γ, andassume there exists a semimetric d making (Γ, d) into a semimetric space.We also enlarge P to be the product measure of the true distribution andthe distribution of the independent weights.

Corollary 21.3 Consistency of weighted M-estimators: Let m be ameasurable function indexed by (Γ, d) such that Pm : Γ → R is deter-ministic. Also let P

n ≡ n−1∑n

i=1 ξiδXi , where ξ1, . . . , ξn are i.i.d. positiveweights independent of the data X1, . . . , Xn.

(i) Suppose that ‖(Pn− P )m‖Γ → 0 in outer probability and there exists

a point ψ0 such that Pm(ψ0) > supψ/∈GPm(ψ), for every open set

G that contains ψ0. Then any sequence ψn, such that P

nm(ψn) ≥

supψPnm(ψ)− oP (1), satisfies ψ

n → ψ0 in outer probability.

(ii) Suppose that ‖(Pn − P )m‖K → 0 in outer probability for every com-

pact K ⊂ Γ and that the map ψ → Pm(ψ) is upper semicontinuouswith a unique maximizer at ψ0. Then the same conclusion given in (i)

is true provided that the sequence ψn is uniformly tight.

Corollary 21.4 Rate of convergence of weighted M-estimators: Let mand P

n be as defined in Corollary 21.3 above. Assume also that for every ψin a neighborhood of ψ0, P (m(ψ)−m(ψ0)) −d2(ψ, ψ0). Suppose also that,for every n and sufficiently small δ, the centered process (P

n − P )(m(ψ)−m(ψ0)) satisfies

E∗ supd(ψ,ψ0)<δ

|(Pn − P )(m(ψ)−m(ψ0))|

φn(δ)√n

,

for a function φn such that δ → φn(δ)/δc is decreasing for some c < 2. Let

r2nφn (1/rn) ≤ √n, for every n. If the sequence ψ

n satisfies

Page 417: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

21.2 Weighted M-Estimators and the Weighted Bootstrap 409

Pnm(ψ

n) ≥ Pnm(ψ0)− oP (r−2

n )

and converges in outer probability to ψ0, then rnd(ψn, ψ0) = OP (1).

Assume hereafter that the estimator (θn, ηn) satisfies

Pnm(θn, η

n) = Pnξm(θn, ηn) = oP (n−1/2).(21.14)

We now investigate the unconditional limiting distribution of θn:

Corollary 21.5 Replace all m in Theorem 21.1 with ξm. Suppose thatthe estimator (θn, η

n) satisfies Equation (21.14) and Conditions A1–A4 inTheorem 21.1, then

√n(θn − θ0)

= −√n P (m11(θ0, η0)−m21(θ0, η0)[A∗])−1

× Pn(m1(θ0, η0)−m2(θ0, η0)[A

∗]) + oP (1).

Thus√

n(θn−θ0) is asymptotically normally distributed with variance (1+v0)I

∗, where I∗ is as defined in Condition A2.

We can also obtain results for weighted M-estimators similar to those inCorollary 21.2:

Corollary 21.6 Suppose the estimator (θn, ηn) satisfies

Pnm1(θ

n, η

n) = oP (n−1/2), and(21.15)

Pnm2(θ

n, η

n)[a] = oP (n−1/2),

for any a ∈ A. If we replace m1 and m2 with ξm1 and ξm2, respectively,then for estimators satisfying Equation (21.15) and all of the conditions ofCorollary 21.2, the conclusions of Corollary 21.5 hold.

The proofs of Corollaries 21.5 and 21.6 are direct extensions of the proofsof Theorem 21.1 and Corollary 21.2, respectively, and we save the detailsas an exercise (see Exercise 20.6.2).

The above results can be used to justify the use of the weighted bootstrapfor general M-estimators.

Theorem 21.7 Suppose the M-estimator θn and the weighted M-estima-tor θn satisfy:

√n(θn − θ0) = I−1

0

√nPnm + oP (1), and

√n(θn − θ0) = I−1

0

√nP

nm + oP (1).

Assume also that the conclusions of Theorem 21.1 and Corollary 21.5 hold.Then we have

√n(θn − θn) = I−1

0

√n(P

n − Pn)m + oP (1). Since E(ξ) =

1 and ξ is independent of Xn,√

n/v0(θn − θn)

Z0, wherePξ

denotes

conditional convergence given the data Xn and Z0 is mean zero Gaussianwith covariance I∗.

Page 418: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

410 21. Semiparametric M-Estimation

Thus the weighted bootstrap is asymptotically valid for inference on θn.Another widely used inference tool is the nonparametric bootstrap basedon multinomial weights, as discussed in Chapter 10. Consistency of thenonparametric bootstrap estimators can be established along the lines ofPart (ii) of Theorems 2.6 and 2.7. However, convergence rate and asymp-totic normality results (such as those in Corollary 21.5 above) are quitedifficult to establish for the nonparametric bootstrap, especially for modelswith parameters not estimable at the

√n rate.

For the weighted bootstrap, once the asymptotic properties for the or-dinary semiparametric M-estimators are established along the lines of Sec-tion 21.1, the weighted bootstrap can be verified almost automatically.The entropy control results in Section 21.3 below play an important rolein connecting properties of ordinary M-estimators with the validity of theweighted bootstrap. The ease of validating the weighted bootstrap will beillustrated with the examples studied in Section 21.4.

21.3 Entropy Control

The asymptotic behavior of M-estimators is often closely related to cer-tain entropy integrals, usually of the set m(θ, η) : θ ∈ Θ, η ∈ H. Forweighted M-estimators, the functional sets of interest are composed of func-tions multiplied by independent weights. The implication of the results inthis section is that, in many situations, both the weighted and unweightedM-estimators can be controlled by the same entropy integral bounds. Prac-tically, this means that the corresponding rates of convergence will also bethe same. The following three lemmas are essentially generalizations of The-orems 8.4, 11.2 and 11.3, respectively. We encourage the reader to brieflyreview the measurability and entropy definitions and results of Chapters 8and 9. Recall also the definitions of J∗(δ,F), J∗

[](δ,F), and J(δ,F , ‖ · ‖) ofSection 11.1.

Lemma 21.8 (Entropy control with covering number) Let F be a P -measurable class of measurable functions with measurable envelope F . Letξ be a positive random variable with E(ξ) = 1 and var(ξ) = v0, for 0 ≤v0 < ∞. Then

E∗ [‖Gn(ξf)‖F ] ≤ kJ∗(1,F)‖F‖P,2,

where k does not depend on F , ξ or F , and where ‖ · ‖F denotes the supre-mum over all f ∈ F .

Proof. Since ‖ξ(f1 − f2)‖Q,2 ≤ ‖ξ‖Q,2‖f1 − f2‖Q,2, we have

N (ǫ‖ξ‖Q,2‖F‖Q,2, ξF , L2(Q)) ≤ N (ǫ‖F‖Q,2,F , L2(Q)) .(21.16)

Also, since F is measurable, we have that ξF is measurable.

Page 419: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

21.3 Entropy Control 411

Set G·n = n−1/2

∑ni=1 ǫif(Xi), where the ǫi’s are i.i.d. Rademacher ran-

dom variables, defined by P(ǫ = 1) = P(ǫ = −1) = 1/2. Set ψ2(y) = ey2−1.Then by Theorem 8.4, we have

‖ ‖G·n‖ξF‖ψ2|X,ξ

∫ τn

0

1 + log N(ǫ, ξF , L2(Pn))dǫ,(21.17)

for τn = ‖ ‖ξf‖n‖F , where “” means “bounded above up to a universalconstant”. Here ‖ · ‖n is the L2(Pn)-seminorm and ‖ · ‖ψ2 is the usual ψ2

Orlicz norm. Setting ǫ = u‖ξ‖n‖F‖n and changing variables in (21.17), weobtain

∫ τn

0

1 + log N(ǫ, ξF , L2(Pn))dǫ

= K

∫τn

‖ξ‖n‖F‖n

0

1 + log N(u‖ξ‖n‖F‖n, ξF , L2(Pn))‖ξ‖n‖F‖ndu

≤ K

∫ 1

0

1 + log N(u‖F‖n,F , L2(Pn))‖ξ‖n‖F‖ndu.

Hence we can conclude E‖Gon‖ξF J(1,F)(E‖F‖2)1/2 after taking expec-

tations. Thus the desired result holds by symmetrization (Theorem 8.8)and the fact that ‖ · ‖ψ2 dominates the associated L2 norm.

Lemma 21.9 (Entropy control with bracketing I) Let F be a class ofmeasurable functions with envelop F , let ξ be as in Lemma 21.8, then

E∗ [‖Gn(ξf)‖F ] ≤ k√

1 + v0J∗[](1,F)‖F‖P,2,

where k does not depend on F , ξ or F .

Proof. By Theorem 11.2, we have

E [‖Gn(ξf)‖∗F ] J∗[] (1, ξF) ‖ξF‖P,2 = J∗

[] (1, ξF)√

1 + v0‖F‖P,2.

Since ‖ξf1− ξf2‖P,2 =√

1 + v0‖f1−f2‖P,2 and ‖ξF‖P,2 =√

1 + v0‖F‖P,2,we have

N[]

(

ǫ√

1 + v0‖F‖P,2, ξF , L2(P ))

= N[] (ǫ‖F‖P,2,F , L2(P )) .

Thus J∗[](1, ξF) = J∗

[](1,F), and the desired result follows.

Lemma 21.10 (Entropy control with bracketing II) Let F be a class ofmeasurable functions with Pf2 < δ2 and ‖f‖∞ ≤ M < ∞. Also let ξ be asdefined in Lemma 21.8, except that we require ξ ≤ c for some fixed constantc <∞. Then

E∗ [‖Gn(ξf)‖F ] ≤ k√

1 + v0J[](δ,F , L2(P ))

1 +J[](δ,F , L2(P ))√

1 + v0δ2√

ncM

,

where k does not depend on F , ξ or F .

Page 420: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

412 21. Semiparametric M-Estimation

Proof. By Theorem 11.3, we have

E∗‖Gn‖ξF J[](δ√

1 + v0, ξF , L2(P ))

(

1 +J[](δ

√1 + v0, ξF , L2(P ))

(1 + v0)δ2√

ncM

)

.

We also have ‖ξF‖P,2 ≤√

1 + v0δ, ‖ξF‖∞ ≤ cM, and ‖ξf1 − ξf2‖P,2 =√1 + v0‖f1 − f2‖P,2. From the definition of J[], we have

J[](δ, ξF , L2(P )) =

∫ δ

0

1 + log N[](ǫ, ξF , L2(P ))dǫ

=

∫ δ

0

1 + log N[]

(

ǫ√1 + v0

,F , L2(P )

)

=

∫ δ√1+v0

0

1 + log N[](u,F , L2(P ))√

1 + v0du

=√

1 + v0J[]

(

δ√1 + v0

,F , L2(P )

)

.

Hence the desired result follows.We note that these maximal inequalities all permit v0 = 0. Hence the set-

ting where ξ = 1 almost surely is a special case of the above lemmas. Moreprecisely, the conclusions of Theorems 11.1, 11.2 and 11.3 are respectivelyimplied by the above lemmas.

The previous three lemmas are enough for our examples, but there aremany settings where other maximal inequalities may be needed. For a morecomplete reference on maximal inequalities without the random weights, seeSection 11.1 of this book, Chapter 6 of van de Geer (2000), and Sections 2.14and 3.4 of VW.

21.4 Examples Continued

We now study the models presented in Section 21.1.1 in greater detail.We only include relevant model assumptions here. For demonstration pur-poses and for clarity, some assumptions are made stronger here than in theoriginal references.

21.4.1 Cox Model with Current Status Data (Example 1,Continued)

The MLE of the Cox model with current status data was studied extensivelyin Section 19.2.3, and elsewhere in Chapter 19, and in both Huang (1996)and Section 25.11.1 of van der Vaart (1998). The OLS has not been studiedpreviously. The main assumptions include:

Page 421: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

21.4 Examples Continued 413

E1.1: θ ∈ Θ and Z ∈ B, where Θ and B are known compact subsets of Rk.

E1.2: There exists a known K, such that 0 < 1/K < Λ < K < ∞.

E1.3: The event time T and censoring time Y are both bounded by aknown constant, with Y ∈ [σ, τ ], where 0 < σ < τ < ∞.

E1.4: The event time and censoring time are conditionally independentgiven Z.

We need the following technical tool. For a proof (which we omit), seeLemma 25.84 of van der Vaart (1998):

Lemma 21.11 Under Conditions E1.1–E1.4, there exists a constant Csuch that, for every ǫ > 0, and for

M1 = δ log (1− exp[− exp(θ′Z)Λ])

−(1− δ) exp(θ′Z)Λ : θ ∈ Θ, Λ ∈ L , and

M2 =

(1− δ − exp[− exp(θ′Z)Λ])2

: θ ∈ Θ, Λ ∈ L

,

where L denotes all nondecreasing functions Λ with 1/K ≤ Λ(σ) ≤ Λ(τ) ≤K, we have

log N[] (ǫ, Mj , L2(P )) ≤ C

(

1

ǫ

)

, for j = 1, 2.(21.18)

Consistency.

The parameter set for θ is compact by assumption, and the parameterset for Λ is compact relative to the weak topology (the topology usedfor weak convergence of measures, as done in Part (viii) of Theorem 7.6).Consistency for the MLE and OLS can be obtained by the argmax theorem(Theorem 14.1), as discussed for the MLE in Section 25.11.1 of van derVaart (1998). The consistency for the OLS is quite similar, and we omitthe details.

Rate of convergence.

Define d ((θ, Λ), (θ0, Λ0)) = |θ − θ0| + ‖Λ − Λ0‖2, where ‖ · ‖2 is the L2

norm on [σ, τ ]. Combining the Inequality (21.18) from Lemma 21.11 with

Lemma 21.9 above and taking φn(δ) =√

δ(

1 +√

δ/(δ2√

n))

in Corol-

lary 21.4 above, we can conclude a convergence rate of n1/3 for both |θn−θ0|and ‖Λn − Λ0‖2. This result holds for both the MLE and OLS, since theyhave the same consistency and entropy results. The n1/3 convergence rateis optimal for the estimation of Λ, as discussed in Groeneboom and Wellner(1992).

Page 422: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

414 21. Semiparametric M-Estimation

√n consistency and asymptotic normality.

For the MLE, define m(θ, Λ) =

ZΛ− φ(Λ)(h00 Λ−10 )Λ

Q(X ; θ, Λ),where Q, φ and h00 are as defined in Section 19.2.3. The following “finitevariance” assumption is needed:

E1.5: P [(ZΛ0(Y )− h00(Y ))Q(X ; θ0, Λ0)]⊗2 is non-singular.

√n consistency and asymptotic normality of θn can then be proved by the

“efficient score” approach given in Section 25.11.1 of van der Vaart. Thisis a special case of Theorem 21.1.

For the OLS, on the other hand, set

m1 = 2Zeθ′ZΛ(Y ) exp(−eθ′ZΛ(Y ))(1 − δ − exp(−eθ′ZΛ(Y ))), and

m2[a] = 2eθ′Z exp(−eθ′ZΛ(Y ))(1 − δ − exp(−eθ′ZΛ(Y )))a(Y ),

where a ∈ A, and A = L2(Y ), the set of all mean zero, square-integrable

functions of Y . Consider the estimator (θn, ηn) satisfying

Pnm1(θn, ηn) = oP (n−1/2) and Pnm2(θn, ηn)[a] = oP (n−1/2),

for all a ∈ A. We now prove the√

n consistency and asymptotic normalityof θn using Corollary 21.2. From the rate of convergence results discussedabove, Condition A1 is satisfied with c1 = 1/3. Condition B3 can be veri-fied with the entropy result (21.18) and Lemma 21.10. Combining Taylorexpansion with the differentiability of the functions involved in m1 andm2, we can see that Condition B4 is satisfied with c2 = 2. Now we checkCondition A2.

Define Λt = Λ0 + ta for a proper perturbation direction a. Set

m12[a] = 2Zeθ′Z exp(−eθ′ZΛ)

×(

(1− Λeθ′Z)(1− δ − exp(−eθ′ZΛ)) + Λeθ′Z exp(−eθ′ZΛ))

a

≡ L(θ, Λ)a

and

m22[a1][a2] = −2e2θ′Z exp(−eθ′ZΛ)[

1− δ − 2 exp(−eθ′ZΛ)]

a1a2

≡ R(θ, Λ)a1a2,

for a1, a2 ∈ A. Condition A2 requires P (m12[A] − m22[A∗][A]) = 0, for

all A ∈ Ak, which will be satisfied by A∗ ≡ E(L(θ, Λ)|Y )/E(R(θ, Λ)|Y ).Simple calculations give

m11 = 2Z⊗2Λeθ′Z exp(−eθ′ZΛ)

×[

(1 − Λeθ′Z)(1 − δ − exp(−eθ′ZΛ)) + Λeθ′Z exp(−eθ′ZΛ)]

Page 423: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

21.4 Examples Continued 415

and m21[a] = L(θ, Λ)a. Define

I∗ = P (m11 −m21[A∗])−1P [m1 −m2[A

∗]]⊗2P (m11 −m21[A∗])−1,

and assume

E1.5′: 0 < det(I∗) < ∞.

Then all conditions of Corollary 21.2 are satisfied, and the desired√

nconsistency and asymptotic normality of θn follows.

Validity of the weighted bootstrap.

For the random weights ξ, we assume

E1.6: There exists a constant c such that ξ < c < ∞.

Consistency of the weighted M-estimators can be established by Corollary21.3, following the same arguments as for the ordinary M-estimators. WithCondition E1.6 and Lemma 21.10, we can apply Corollary 21.4 to establisha convergence rate of n1/3 for all parameters.

For the√

n consistency and asymptotic normality of θn, Corollaries 21.5and 21.6 are applied to the weighted MLE and the weighted least squaresestimator, respectively. For both estimators, Conditions A2 and A4 havebeen verified previously. We now only need to check Conditions A1 andA3 for the MLE and A1 and B3 for the least squares estimator. ConditionA1 is satisfied with c1 = 1/3 from the convergence rate results for bothestimators. Conditions A3 and B3 can be checked directly with the entropyresult (21.18) and Lemma 21.10. Hence the

√n consistency and asymptotic

normality of the weighted estimators for θ follow. Based on Theorem 21.7,the validity of the weighted bootstrap follows for both the MLE and OLS.

21.4.2 Binary Regression Under Misspecified Link Function(Example 2, Continued)

Denote the true value of (θ, h) as (θ0, h0). Under misspecification, it isassumed that

Pθ,h(Y = 1|Z = z, U = u) = ψ(θ′z + h(u)),(21.19)

where ψ = φ is an incorrect link function. When the model is misspecified,the maximizer (θ, h) of the likelihood function is not necessarily (θ0, h0).The maximizer (θ, h) can also be viewed as a minimizer of the Kullback-Leibler discrepancy, which is defined as

−P

[

log Pθ,h

log P0

]

= −P

[

Y log ψ(θ′Z + h) + (1− Y ) log(1 − ψ(θ′Z + h))

Y log φ(θ′0Z + h0) + (1− Y ) log(1 − φ(θ′0Z + h0))

]

.

Page 424: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

416 21. Semiparametric M-Estimation

Here the expectation is taken with respect to the true underlying distri-bution. Thus (θ, h) can be viewed as a parameter of the true distributionP .

The following model assumptions are needed.

E2.1:(

ψ(1)

ψ

)(1)

< 0,(

− ψ(1)

1−ψ

)(1)

< 0 and(

φ(1)

φ

)(1)

< 0,(

− φ(1)

1−φ

)(1)

< 0.

E2.2: For simplicity, we assume U ∈ [0, 1], and Eh(U) = c for a known

constant c. We estimate h under the constraint Pn(hn) = c.

E2.3: θ ∈ B1 and Z ∈ B2, where B1 and B2 are both compact sets in Rk.

E2.4: var(Z − E(Z|U)) is non-singular.

The Condition E2.1 is satisfied by many link functions, including the logitlink (corresponding to ψ(u) = eu/(1 + eu)), the probit link (correspondingto ψ(u) = Φ(u)), and the complementary log-log link (corresponding toψ(u) = exp(−eu)).

Combining the second entropy result from Theorem 9.21 with the condi-tions that θ and Z are both bounded and ψ is continuously differentiable,we can conclude

log N(ǫ, ψ(θ′Z + h(U)), L2(Q)) ≤ Cǫ−1/s,(21.20)

where h ∈ H, θ ∈ B1, Z ∈ B2, all probability measures Q, and where C is afixed constant. Equation (21.20), together with the boundedness conditionsand the Lipschitz property of the log function, leads to a similar result forthe class of log-likelihood functions.

Consistency.

Let m denote the log-likelihood. Combining Pnm(θn, hn) ≥ Pnm(θ, h) with

Pm(θn, hn) ≤ Pm(θ, h), we have

0 ≤ P (m(θ, h)−m(θn, hn)) ≤ (Pn − P )m(θn, hn)−m(θ, h).The entropy result (21.20), boundedness Assumptions E2.2 and E2.3 and

Lemma 21.8 give√

n(Pn − P )m(θn, hn)−m(θ, h) = oP (1). Thus we canconclude

0 ≤ P (m(θ, h)−m(θn, hn)) ≤ oP (n−1/2).

Also, via Taylor expansion,

P (m(θ, h)−m(θn, hn))

= −P

(

(θ′Z + h)− (θ′nZ + hn))2

(

ψ(1)(θ′Z + h)

ψ(θ′Z + h)

)(1)

Y

−(

ψ(1)(θ′Z + h)

1− ψ(θ′Z + h)

)(1)

(1− Y )

,

Page 425: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

21.4 Examples Continued 417

where (θ, h) is on the line segment between (θn, hn) and (θ, h).Combining Condition E2.1 with Conditions E2.2 and E2.3, we know

there exists a scalar c0 such that

(

ψ(1)(θ′Z + h)

ψ(θ′Z + h)

)(1)

Y −(

ψ(1)(θ′Z + h)

1− ψ(θ′Z + h)

)(1)

(1− Y )

> c0 > 0.

Hence we can conclude P(

(θ′Z + h)− (θ′nZ + hn))2

= oP (1), which is

equivalent to

P

(θ − θn)′(Z − E(Z|U)) + (h− hn) + (θ − θn)′E(Z|U)2

= oP (1).

Since var(Z−E(Z|U)) is non-singular, the above equation gives us |θn−θ| =oP (1) and ‖hn − h‖2 = oP (1).

Rate of convergence.

Define d((θ1, h1), (θ2, h2)) = |θ1−θ2|+‖h1−h2‖L2 . From the boundednessassumptions, the log-likelihood functions are uniformly bounded. Based onthe entropy result given in Theorem 9.21 and Lemma 21.8, and using thebound of the log-likelihood function as the envelope function, we now have

E∗|√n(Pn − P )(m(θ, h)−m(θn, hn))| ≤ K1δ1−1/(2s),(21.21)

for d((θ, h), (θn, hn)) < δ and a constant K1. This result can then be appliedto Corollary 21.4 above to yield the rate of convergence ns/(2s+1).

√n consistency and asymptotic normality.

Corollary 21.2 is applied here. We can see that Condition A1 is satisfiedwith c1 = s/(2s + 1); Condition B3 has been shown in (21.21); and Con-dition B4 is satisfied with c2 = 2 based on the continuity of the involvedfunctions and on Taylor expansion. Now we only need to verify the finitevariance condition.

We have

m1 =

(

Yψ(1)

ψ− (1− Y )

ψ(1)

1− ψ

)

Z and

m2[a] =

(

Yψ(1)

ψ− (1− Y )

ψ(1)

1− ψ

)

a,

for a proper tangent a, and where ht = h0 +ta. We also have for a1, a2 ∈ A,where A = a : a ∈ L0

2(U) with J(a) <∞, that

Page 426: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

418 21. Semiparametric M-Estimation

m11 =

(

Y

(

ψ(1)

ψ

)(1)

− (1− Y )

(

ψ(1)

1− ψ

)(1))

Z⊗2,

m12[a1] =

(

Y

(

ψ(1)

ψ

)(1)

− (1− Y )

(

ψ(1)

1− ψ

)(1))

Za1,

m21[a1] =

(

Y

(

ψ(1)

ψ

)(1)

− (1− Y )

(

ψ(1)

1− ψ

)(1))

Za1, and

m22[a1][a2] =

(

Y

(

ψ(1)

ψ

)(1)

− (1− Y )

(

ψ(1)

1− ψ

)(1))

a1a2.

The “finite variance” condition requires that there exists A∗ = (a∗1 . . . a∗

k),such that for any A = (a1 . . . ak)

Pm12[A]−m22[A∗][A](21.22)

= P

(

Y

(

ψ(1)

ψ

)(1)

− (1 − Y )

(

ψ(1)

1− ψ

)(1))

(ZA′ −A∗A′)

= 0,

where a∗i , ai ∈ A. Denote Q ≡

(

Y(

ψ(1)

ψ

)(1)

− (1− Y )(

ψ(1)

1−ψ

)(1))

. It is

clear that A∗ = E(ZQ|U)/E(Q|U) satisfies (21.22). Assume

E2.5:

0 < det(

P (m11 −m21[A∗])−1P [m1 −m2[A

∗]]⊗2

×P (m11 −m21[A∗])−1

)

< ∞.

Now the√

n consistency and asymptotic normality of θn follow.

Validity of the weighted bootstrap.

Consistency of the weighted MLE can be obtained similarly to that for theordinary MLE, with reapplication of the entropy result from Lemma 21.8.The rate of convergence can be obtained by Corollary 21.4 as ns/(2s+1)

for all parameters. Unconditional√

n consistency and asymptotic normal-ity for the estimator of θ can be achieved via Corollary 21.6. Thus fromTheorem 21.7, the weighted bootstrap is valid.

21.4.3 Mixture Models (Example 3, Continued)

Consider the mixture model with kernel pθ(X, Y |Z) = Ze−ZXθZe−θZY ,where we write the observations as pairs (Xi, Yi), i = 1, . . . , n. Thus given

Page 427: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

21.4 Examples Continued 419

unobservable variables Zi = z, each observation consists of a pair of ex-ponentially distributed variables with parameters z and θz, respectively.Assume Z has unknown distribution η(z).

Existence of the “least-favorable direction” a∗ for this model is discussedin Section 3 of van der Vaart (1996). The efficient score function is shownto be

ℓθ,η(x, y) =

12 (x − θy)z3 exp(−z(x + θy))dη(z)∫

θz2 exp(−z(x + θy))dη(z).(21.23)

We assume the following conditions hold:

E3.1: θ ∈ B, where B is a compact set in R.

E3.2: The true mixing distribution η0 satisfies∫

(z2 + z−5)dη0(z) < ∞.

E3.3: Pℓθ0,η02 > 0.

We need the following technical tool. The result (the proof of which weomit) comes form Lemma L.23 of Pfanzagl (1990).

Lemma 21.12 There exists a weak neighborhood V of the true mixingdistribution such that the class F ≡ ℓθ,η : θ ∈ B, η ∈ V is Donsker with

log N[]

(

ǫ, ℓθ,η : θ ∈ B, η ∈ V , L2(P ))

≤ K

(

1

ǫ

)L

for fixed constants K <∞ and 1 < L < 2.

Consistency.

Consistency of (θn, ηn) for the product of the Euclidean and weak topologyfollows from Kiefer and Wolfowitz (1956), as discussed in van der Vaart(1996).

Rate of convergence.

From Lemma 21.12, we obtain that the entropy integral J[](ǫ,F , L2(P ))

ǫ1−L/2. Lemma 21.10 and Corollary 21.4 can now be applied to obtainthe convergence rate n1/(2+L) for all parameters.

√n consistency and asymptotic normality of θn.

We now check the conditions of Theorem 21.1. Condition A1 is satisfiedwith c1 = 1/(2+L), as shown above. Condition A2 is the Assumption E3.3.See van der Vaart (1996) for a discussion of the tangent set A. ConditionA3 can be checked by Lemma 21.10 above. Condition A4 is satisfied withc2 = 2. Hence the

√n consistency and asymptotic normality of θn follow.

Page 428: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

420 21. Semiparametric M-Estimation

Validity of the weighted bootstrap.

We now establish properties of the weighted MLE. Consistency can beobtained following the arguments in Kiefer and Wolfowitz (1956), usingthe same arguments used for the ordinary MLE. The convergence raten1/(2+L) can be achieved with Corollary 21.4 and the entropy result fromLemma 21.9 for all parameters.

√n consistency and asymptotic normality

of the unconditional weighted MLE for θ can be obtained by Corollary 21.5.Thus we can conclude that the conditional weighted bootstrap is valid byTheorem 21.7.

21.5 Penalized M-estimation

Penalized M-estimators have been studied extensively, for example inWahba (1990), Mammen and van de Geer (1997), van de Geer (2000, 2001),and Gu (2002). Generally, these estimators do not fit in the framework dis-cussed above because of the extra penalty terms. Another important featureof penalized M-estimators is the difficulty of the inference. Although Mur-phy and van der Vaart (2000) show that inference for the semiparametricMLE can be based on the profile likelihood, their technique is not directlyapplicable to the penalized MLE. The main difficulty is that the penaltyterm of the objective function may have too large an order, and thus Con-dition (3.9) in Murphy and van der Vaart (2000) may not be satisfied inthe limit.

We now show that certain penalized M-estimators can be coaxed intothe framework discussed above. Examples we investigate include the pe-nalized MLE for binary regression under misspecified link function, thepenalized least squares estimator for partly linear regression, and the pe-nalized MLE for the partly additive transformation model with currentstatus data. The first example we will examine in detail, while the lattertwo examples we will only consider briefly. The key is that once we can es-tablish that the penalty term is oP (n−1/2), then the “nearly-maximizing”condition of Corollary 21.4 (see also (21.12) and (21.6)) is satisfied. Afterthis, all of the remaining analysis can be carried out in the same manneras done for ordinary semiparametric M-estimators.

21.5.1 Binary Regression Under Misspecified Link Function(Example 2, Continued)

As discussed earlier, we can study this model under the relaxed condi-

tion∫ 1

0

(

h(s)(u))2

du < ∞ by taking the penalized approach. All otherassumptions as discussed in Section 21.4 for this model will still be needed.Consider the penalized MLE (θn, hn) as an estimator of (θ, h), where

Page 429: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

21.5 Penalized M-estimation 421

(θn, hn) = argmaxPnm(θ, h)− λ2nJ2(h).(21.24)

In (21.24), λn is a data-driven smoothing parameter and

J2(h) ≡∫ 1

0

(

h(s)(u))2

du

as defined previously. We also assume

E2.5: λn = OP (n−s/(2s+1)) and λ−1n = OP (ns/(2s+1)).

Consistency.

From the entropy result of Theorem 9.21 (with m in the theorem replacedby s) we have

(Pn − P )(m(θ, h)−m(θ, h)) = (1 + J(h))OP (n−1/2),(21.25)

where θ, h satisfy all the model assumptions (see Exercise 21.6.3). Thepenalized MLE satisfies

Pnm(θn, hn)− λ2nJ2(hn) ≥ Pnm(θ, h)− λ2

nJ2(h).(21.26)

Combining (21.25) and (21.26), we can conclude

0 ≤ P (m(θ, h)−m(θn, hn))(21.27)

≤ λ2nJ2(h)− λ2

nJ2(hn) + OP (n−1/2)(1 + J(hn)).

Under boundedness assumptions, we can now conclude λnJ(hn) = oP (1).

Hence 0 ≤ P (m(θ, h)−m(θn, hn)) = oP (1). Following the same arguments

as used in Section 21.4, we can now conclude the consistency of (θn, hn).

Rate of convergence.

Denote ψ = θ′Z + h(U), ψ = θ′Z + h(U) and ψn = θ′nZ + hn(U). Taylorexpansion gives us

P (m(θ, h)−m(θn, hn)) = P

(

1

2m(2)(ψ)(ψn − ψ)2

)

,(21.28)

where ψ is on the line segment between ψn and ψ. From the compactnessassumptions, there exist ǫ1 and ǫ2, such that 0 < ǫ1 < m(2) < ǫ2 < ∞almost surely. Combining this result with (21.28), we now have

ǫ1‖ψn − ψ‖2 ≤ P (m(θ, h)−m(θn, hn)) ≤ ǫ2‖ψn − ψ‖2.(21.29)

Careful application of Theorem 15.3, with α = 1/s, combined with the fact

that there exists a constant k <∞ such that |m(ψn)−m(ψ)| ≤ k‖ψn− ψ‖almost surely, yields

Page 430: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

422 21. Semiparametric M-Estimation

|(Pn − P )(m(ψn)−m(ψ))|(21.30)

≤ OP (n−1/2)(1 + J(hn))

×(

‖ψn − ψ‖1−1/(2s) ∨ n−(2s−1)/[2(2s+1)])

(see Exercise 21.6.4). Combining this with inequalities (21.29) and (21.26)and the boundedness assumptions, we have

λ2nJ2(hn) + ǫ1‖ψn − ψ‖2(21.31)

≤ λ2nJ2(h) + OP (n−1/2)(1 + J(hn))

×(‖ψn − ψ‖1−1/(2s) ∨ n−(2s−1)/[2(2s+1)]).

Some simple calculations, similar to those used in Section 15.1, yield J(hn)

= OP (1) and ‖ψn− ψ‖ = OP (n−s/(2s+1)) (see Exercise 21.6.5). Once theseresults are established, the remaining analysis for the ordinary MLE can becarried out as done in Section 21.4. We can see that, if we replace Pn withP

n in the above analysis, all results hold with only minor modificationsneeded. Thus we can establish the consistency and rate of convergenceresults for the weighted MLE similarly. Then the analysis of the weightedMLE for θ and the validity of the weighted bootstrap can be carried outusing the same arguments as used in Section 21.4.

21.5.2 Two Other Examples

Two other penalized M-estimation examples are studied in in detail Maand Kosorok (2005b). The first example is partly linear regression, wherethe observed data consist of (Y, Z, U), where Y = θ′Z + h(U) + ǫ, where his nonparametric with J(h) < ∞ and ǫ is normally distributed. Penalizedleast-squares is used for estimation and the weighted bootstrap is used forinference. The main result is that the penalized bootstrap θ∗n is valid, i.e.,

that√

n(θ∗n− θn)Pξ

Z, where Z is the limiting distribution of√

n(θn− θ0).

The second example is for partly linear transformation models for cur-rent status data. This model was also studied in Ma and Kosorok (2005a).The observed data is (V, 1U ≤ V , Z, W ), where U is the failure time ofinterest, V is a random observation time, Z is a Euclidean covariate and Wis a continuous covariate. The model is H(U)+θ′Z+h(W ) = e, where e hasa known distribution, and U and V are assumed to be independent given(Z, W ). The nuisance parameter η = (H, h) consists of two components, atransformation function H and a nonparametric regression function h. Itis assumed that H is monotone and J(h) < ∞. Only h is penalized in themaximum likelihood. The main result is, as in the previous paragraph, thatthe conditional weighted bootstrap is valid for inference on θ. Additionaldetails can be found in Ma and Kosorok (2005b).

Page 431: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

21.6 Exercises 423

21.6 Exercises

21.6.1. Prove Corollaries 21.3 and 21.4. This can be done by clarifyinghow the presence of random weights affects the arguments used in theproofs of Theorems 2.12 and 14.4. Lemma 14.3 may also be helpful forproving Corollary 21.3.

21.6.2. Prove Corollaries 21.5 and 21.6 using the arguments in the re-spective proofs of Theorem 21.1 and Corollary 21.2.

21.6.3. Verify that (21.25) holds. Hint: Consider the class

F =

m(θ, h)−m(θ, h)

1 + J(h): θ ∈ Θ, J(h) <∞

,

and verify that Theorem 9.21 can be applied, after some manipulations, toobtain

supQ

log N(ǫ,F , L2(Q)) ≤ M∗

(

1

ǫ

)1/s

,

where the supremum is over all finitely-discrete probability measures andM∗ < ∞.

21.6.4. Verify that (21.30) holds.

21.6.5. In the context of Section 2.5.1, verify that J(hn) = OP (1) and

‖ψn− ψ‖ = OP (n−2s/(2s+1)). Hint: Let An ≡ ns/(2s+1)‖ψn− ψ‖, and showthat (21.31) implies

J2(hn) + A2n = OP (1) + OP (1)(1 + J(hn))

(

A1−1/(2s)n ∨ 1

)

.

Thus A2n = OP (1) + OP (1)(1 + J(hn))A

1−1/(2s)n , and hence J2(hn) =

OP (1)(1+J(hn))4s/(2s+1) (this step is a little tricky). Thus J(hn) = OP (1),and the desired result can now be deduced.

21.7 Notes

As mentioned previously, the material in this chapter is adapted from Maand Kosorok (2005b). In particular, Examples 1–3 are the first three ex-amples in Ma and Kosorok, and many of the theorems, corollaries andlemmas of this chapter also have corresponding results in Ma and Kosorok.Theorem 21.1 and Corollaries 21.2 through 21.5 correspond to Theorem 1,and Corollaries 1, 3, 4, and 2, respectively, of Ma and Kosorok. Corol-lary 21.6 is based on remark 7 of Ma and Kosorok; and Theorem 21.7 andLemmas 21.8 through 21.12 correspond to Theorem 2, Lemmas 1–3, andTechnical Tools T1 and T3, respectively, of Ma and Kosorok.

Page 432: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,
Page 433: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

22Case Studies III

In this chapter, we examine closely four examples that utilize or are relatedto some of the concepts discussed in Part III. In most cases, concepts andresults from Part II also play a key role. A secondary goal in these examplesis to illustrate a paradigm for statistical research that can be applied tomany different data generating models. This is particularly true for the firstexample on the proportional odds model under right censoring. This modelhas been discussed extensively in previous chapters, but it is helpful toorganize all the steps that are (or should be) taken in developing inferencefor this model. A few holes in these steps are filled in along the way.

The second example is a follow-up on the linear regression model in-troduced in Chapter 1 and discussed in Section 4.1.1. The objective is todevelop efficient estimation for the setting where the covariate and residualare known to be independent. The third example is new and is concernedwith a time-varying regression model for temporal process data. An in-teresting issue for functional data—such as temporal process data—is thatsometimes optimal efficiency is not possible, although improvements in effi-ciency within certain classes of estimating equations are possible. Empiricalprocess methods play a crucial role in developing methods of inference forthis setting.

The fourth example, also new, considers partly linear regression for re-peated measures data. The dependencies between the repeated measuresin a single observation produce nontrivial complications in the asymptotictheory. Other complications arise from the lack of boundedness restrictionson parameter spaces which is in contrast to the assumptions made for other

Page 434: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

426 22. Case Studies III

partly linear regression models considered in this book (see, for example,the first paragraph of Section 15.1).

These examples, in combination with other examples presented previ-ously in this book, demonstrate clearly the power of semiparametric andempirical process methods to enlarge the field of possibilities in scientificresearch for both flexible and meaningful statistical modeling. It is hopedthat the reader will have not only have gained a meaningful technical back-ground that will facilitate further learning and research, but, more impor-tantly, it is hoped that the reader’s imagination has been stimulated withnew ideas and insights.

22.1 The Proportional Odds Model Under RightCensoring Revisited

The proportional odds model for right censored data has been studiedextensively in Sections 15.3 and 19.2.2 and several places in Chapter 20,including Exercise 20.3.4. The main steps for estimation and inference thathave been discussed for this model are:

• Developing a method of estimation. A maximum log-likelihood ap-proach was developed and discussed in Section 15.3.1.

• Establishing consistency of the estimator. This was accomplished inSections 15.3.2 and 15.3.3. A preliminary step of establishing exis-tence (Section 15.3.2) was also needed in this instance for the baselinefunction A because of the omission of boundedness assumptions onA (which omission was done to be most realistic in practice).

• Establishing the rates of convergence. This step was addressed in-directly in this instance through weak convergence since all of theparameters end up being regular.

• Obtaining weak convergence for all regular parameters. This was ac-complished in Sections 15.3.4 and 15.3.5. A Z-estimator approach wasutilized that involved both score and information operators.

• Establishing efficiency of all regular parameters. This was accom-plished for all parameters in the paragraphs following the presentationof Corollary 20.1. This step may be considered optional if maximumlikelihood-like inference is difficult to implement or is impractical forsome reason. Nevertheless, it is useful and often fruitful to at leastattempt to obtain efficiency in estimation.

• Obtaining convergence of non-regular parameters. This step was notneeded in this instance. In general, this step is optional if scientificinterest is restricted to only the regular parameters.

Page 435: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

22.1 The Proportional Odds Model Under Right Censoring Revisited 427

• Developing a method of inference. The weighted bootstrap was sug-gested and validated in Section 15.3.5 for this model. For computa-tional simplicity, the profile sampler of Chapter 19 and the piggybackbootstrap of Chapter 20 were also proposed for this model (see Sec-tion 19.2.2 and toward the end of Section 20.2). The only part thatremains to be done is to verify all of the conditions of Theorems 19.5and 19.6. This will establish the validity of the profile sampler. Thevalidity of the piggyback bootstrap has already been established to-wards the end of Section 20.2.

• Studying the properties of estimation and inference under modelmisspecification. This is an optional, but often useful, step. Thiswas accomplished to some degree for a generalization of the propor-tional odds model, the class of univariate frailty regression models,in Kosorok, Lee and Fine (2004).

Note that in practice, a researcher may need to iterate between several ofthese steps before achieving all of the desired conclusions. For instance,it may take a few iterations to arrive at a computationally feasible andefficient estimator, or, it may take a few iterations to arrive at the optimalrate of convergence. In any case, the above list of steps can be considered ageneral paradigm for estimation and inference, although there are probablyother useful paradigms that could serve the same purpose.

The focus of the remainder of this section will be on establishing theconditions of Theorems 19.5 and 19.6 as mentioned above, since these arethe only missing pieces in establishing the validity of the profile sampler forthe proportional odds model under right censoring. As mentioned above,all of the remaining steps, except for these, have been addressed previously.

Recall from Section 19.2.2 that

At(β, A) =

∫ (·)

0

(1 + (β − t)′h0(s)) dA(s),

where h0(s) = [σ22θ0

]−1σ21θ0

(·)(s) is as defined in that section, satisfies Con-ditions 19.2 and 19.3. Hence

ℓ(t, β, A) =

(

∂t

)

l(t, At(β, A))(22.1)

=

(

∂t

)

δ(log ∆At(β, A)(U) + t′Z)

−(1 + δ) log(

1 + et′ZAt(β, A)(U))

=

∫ τ

0

(

Z − h0(s)

1 + (β − t)′h0(s)

)

×[

dN(s)− (1 + δ)Y (s)et′ZdAt(β, A)(s)

1 + et′ZAt(β, A)(U ∧ τ)

]

Page 436: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

428 22. Case Studies III

and

ℓ(t, β, A) =

(

∂t

)

ℓ(t, β, A)(22.2)

= −∫ τ

0

h⊗20 (s)

(1 + (β − t)′h0(s))2

[

dN(s)− (1 + δ)Y (s)et′ZdAt(β, A)(s)

1 + et′ZAt(β, A)(U ∧ τ)

]

− (1 + δ)

∫ τ

0

[

Z − h0(s)

1 + (β − t)′h0(s)

]⊗2Y (s)et′ZdAt(β, A)(s)

1 + et′ZAt(β, A)(U ∧ τ)

+ (1 + δ)

∫ τ

0

[

Z − h0(s)

1 + (β − t)′h0(s)

]

Y (s)et′ZdAt(β, A)(s)

1 + et′ZAt(β, A)(U ∧ τ)

⊗2

.

Hence it is easy to see that both (t, β, A) → ℓ(t, θ, A)(X) and (t, β, A) →ℓ(t, θ, A)(X) are continuous for P0-almost every X . Although it is tedious, itis not hard to verify that for some uniform neighborhood V of (β0, β0, A0),

F1 ≡

ℓ(t, β, A) : (t, β, A) ∈ V

(22.3)

is P0-Donsker and

F2 ≡

ℓ(t, β, A) : (t, β, A) ∈ V

(22.4)

is P0-Glivenko-Cantelli (see Exercise 22.6.1).

Now consider any sequence βnP→ β0, and let Aβn

be the profile max-

imizer at β = βn, i.e., Aβn= argmaxALn(A, βn), where Ln is the log-

likelihood as defined in Section 15.3.1. The arguments in Section 15.3.2can be modified to verify that Aβn

(τ) is asymptotically bounded in proba-

bility. Since, by definition, Ln(βn, An) ≥ Ln(βn, Aβn) ≥ Ln(βn, An), where

An is as defined in Section 15.3.3, we can argue along the lines used inSection 15.3.3 to obtain that Aβn

is uniformly consistent for A0.We now wish to strengthen this last result to

‖Aβn−A0‖[0,τ ] = OP

(

n−1/2 + ‖βn − β0‖)

,(22.5)

for any sequence βnP→ β0. If (22.5) holds, then, as shown by Murphy and

van der Vaart (2000) in the discussion following their Theorem 1,

P ℓ(β0, βn, Aβn) = oP

(

n−1/2 + ‖βn − β0‖)

,

and thus both Conditions (19.11) and (19.12) hold. Hence all of the condi-tions of Theorem 19.5 hold.

We now show (22.5). The basic idea of the proof is similar to argumentsgiven in the proof of Theorem 3.4 in Lee (2000). Recall the definition of

Page 437: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

22.1 The Proportional Odds Model Under Right Censoring Revisited 429

V τn,2 as given in Expression (15.10) and the space H2

∞ from Section 19.2.2;and define, for all h ∈ H2

∞,

Dn(A)(h) ≡ V τn,2(βn, A)(h), Dn(A)(h) ≡ V τ

n,2(β0, A)(h),

andD0(A)(h) ≡ PV τ

n,2(β0, A)(h).

By definition of a maximizer, Dn(Aβn)(h) = 0 and D0(A0)(h) = 0 for all

h ∈ H2∞.

By Lemma 13.3,

√n(Dn −D0)(Aβn

)−√n(Dn −D0)(A0) = oP (1),

uniformly in ℓ∞(H21), where H2

1 is the subset of H2∞ consisting of functions

of total variation ≤ 1. By the differentiability of the score function, we alsohave √

n(Dn(A0)−Dn(A0)) = OP (√

n‖βn − β0‖),uniformly in ℓ∞(H2

1). Combining the previous two displays, we have

√n(D0(Aβn

)−D0(A0)) = −√n(Dn(Aβn)−D0(Aβn

))

= −√n(Dn −D0)(A0) + oP (1)

= −√n(Dn −D0)(A0)

+OP (1 +√

n‖βn − β0‖)= OP (1 +

√n‖βn − β0‖).

Note that D0(A)(h) =∫ τ

0 (σ22β0,A0

h)dA(s), where σ22 is as defined in Sec-

tion 15.3.4. Since σ22β0,A0

is continuously invertible as shown in Section

19.2.2, we have that there exists some c > 0 such that√

n(D0(Aβn) −

D0(A0)) ≥ c‖Aβn−A0‖H2

1. Thus (22.5) is satisfied.

The only thing remaining to do for this example is to verify (19.17), afterreplacing θn with βn and Θ with B, since this well then imply the validityof the conclusions of Theorem 19.6. For each β ∈ B, let

Aβ ≡ argmaxAP

[

dPβ,A

dPβ0,A0

]

,

where P = Pβ0,A0 by definition, and define

∆n(β) ≡ P

[

dPβ,Aβ

dPβn,An

]

and ∆0(β) ≡ P

[

dPβ,Aβ

dPβ0,A0

]

.

Theorem 2 of Kosorok, Lee and Fine (2004) is applicable here since theproportional odds model is a special case of the odds rate model withfrailty variance parameter γ = 1. Hence

Page 438: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

430 22. Case Studies III

supβ∈B

|∆n(β) − ∆n(β)| = oP (1).

The smoothness of the model now implies

supβ∈B

|∆n(β) −∆0(β)| = oP (1),

and thussupβ∈B

|∆n(β) −∆0(β)| = oP (1).

As a consequence, we have for any sequence βn ∈ B, that ∆n(βn) =

oP (1) implies ∆0(βn) = oP (1). Hence βnP→ β0 by model identifiability,

and (19.17) follows.

22.2 Efficient Linear Regression

This is a continuation of the linear regression example of Section 4.1.1. Themodel is Y = β′

0Z+e, where Y, e ∈ R, Z ∈ Rk lies in a compact set P -almostsurely, and the regression parameter β0 lies in an known, open compactsubset B ⊂ Rk. We assume that e has mean zero and unknown varianceσ2 < ∞ and that e and Z are independent, with P [ZZ ′] being positivedefinite. An observation from this model consists of the pair X = (Y, Z).We mentioned in Section 4.1.1 that efficient estimation for β0 is tricky inthis situation. The goal of this section is to keep the promise we made inSection 4.1.1 by developing and validating a method of efficient estimationfor β0. We will also derive a consistent estimator of the efficient Fisherinformation for β0 as well as develop efficient inference for the residualdistribution.

Recall that P represents the true distribution and that η0 is the densityfunction for e. We need to assume that η0 has support on all of R and is

three time continuously differentiable with η0, η0, η0 and η(3)0 all uniformly

bounded on R. We also need to assume that there exist constants 0 ≤a1, a2 <∞ and 1 ≤ b1, b2 < ∞ such that

η0

η0(x)

1 + |x|a1≤ b1, and

η0

η0(x)

1 + |x|a2≤ b2,

for all x ∈ R, and∫

R|x|(8a1)∨(4a2)+6η(x)dx < ∞. The reason for allow-

ing these functions to grow at a polynomial rate is that at least one verycommon possibility for η, the Gaussian density, has η/η growing at sucha rate. In Exercise 22.6.2, this rate is verified, with a1 = 1, a2 = 2, andb1 = b2 = 1. Hence it seems unrealistic to require stronger conditions suchas boundedness.

Page 439: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

22.2 Efficient Linear Regression 431

Method of estimation.

Our basic approach for obtaining efficient estimation will be to estimatethe efficient score ℓβ,η(x) = −(η/η)(e)(Z − µ) + eµ/σ2, where µ ≡ PZ,

with an estimated efficient score function ℓβ,n that satisfies the conditionsof Theorem 3.1. Recall that the data (Y, Z) and (e, Z) are equivalent forprobabilistic analysis even though e cannot be observed directly. The fi-nal assumptions we make are that I0 ≡ P [ℓβ0,η0 ℓ

′β0,η0

] is positive definite

and that the model is identifiable, in the sense that the map β → ℓβ,η0

is continuous and has a unique zero at β = β0. This identifiability can beshown to hold if the map x → (η0/η0)(x) is strictly monotone (see Exer-cise 22.6.3). Identifiability will be needed for establishing consistency of the

Z-estimator βn based on the estimating equation β → Pnℓβ,n which will bedefined presently.

The first step is to perform ordinary least-squares estimation to obtainthe

√n-consistent estimator β and then compute F (t) ≡ Pn1Y −β′Z ≤ t.

As verified in Section 4.1.1, ‖F −F‖∞ = OP (n−1/2), where F is the cumu-lative distribution function corresponding to the residual density functionη0. The second step is to utilize F to estimate both η0 and η0. We willutilize kernel estimators to accomplish this.

For a given sequence hn of bandwidths, define the kernel density estima-tor

ηn(t) ≡∫

R

1

hnφ

(

t− u

hn

)

dF (u),

where φ is the standard normal density. Taking the derivative of t → ηn(t)twice, we can obtain estimators for η0:

η(1)n (t) ≡ −

R

1

h2n

(

t− u

hn

)

φ

(

t− u

hn

)

dF (u),

and for η0:

η(2)n (t) ≡

R

1

h3n

[

(

t− u

hn

)2

− 1

]

φ

(

t− u

hn

)

dF (u).

Define Dn ≡ F − F and use integration by parts to obtain

R

1

hnφ

(

t− u

hn

)

dDn(u) = −∫

R

Dn(u)1

h2n

(

t− u

hn

)

φ

(

t− u

hn

)

du.

Thus∣

R

1

hnφ

(

t− u

hn

)

dDn(u)

≤ OP (n−1/2)

R

t− u

hn

φ

(

t− u

hn

)

h−2n du

= OP (n−1/2h−1n ),

Page 440: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

432 22. Case Studies III

where the error terms are uniform in t. Moreover, since η0(t) is uniformlybounded,

R

1

hnφ

(

t− u

hn

)

[η0(u)− η0(t)] du

=

R

1

hnφ

(

t− u

hn

)[

η0(t)(u− t) +η0(t

∗)

2(u − t)2

]

du

= O(h2n),

where t∗ is in the interval [t, u]. Thus ‖ηn − η0‖∞ = OP (n−1/2h−1n + h2

n).Similar analyses (see Exercise 22.6.4) can be used to show that both

‖η(1)n −η0‖∞ = OP (n−1/2h−2

n +h2n) and ‖η(2)

n −η0‖∞ = OP (n−1/2h−3n +hn).

Accordingly, we will assume hereafter that hn = oP (1) and h−1n = oP (n1/6)

so that all of these uniform errors converge to zero in probability. Now let rn

be a positive sequence converging to zero with r−1n = OP (n1/2h3

n∧h−1n ), and

define An ≡ t : ηn(t) ≥ rn. The third step is to estimate t → (η0/η0)(t)with

kn(t) ≡ η(1)n (t)

(

rn

2 ∨ ηn(t))1t ∈ An.

We will now derive some important properties of kn. First note that

supt∈An

η0(t)

ηn(t)− 1

= OP (r−1n ‖ηn − η0‖∞) = OP (r−1

n [n−1/2h−1n + h2

n]).

Thus, for all t ∈ An,

η(1)n (t)

ηn(t)=

η0(t)

η0(t)+ OP (r−1

n ‖η(1)n − η0‖∞)

+η0(t)

η0(t)OP (r−1

n ‖ηn − η0‖∞)

=η0(t)

η0(t)(1 + oP (1)) + oP (1),

where the errors are uniform over t ∈ An. Since kn(t) is zero off of An, wehave that

kn(t)

1 + |t|a1≤ b1(1 + oP (1)) + oP (1),(22.6)

for all t ∈ R, where the errors are uniform. Note also that since the supportof η0 is all of R, we have that for every compact set A ⊂ R, A ⊂ An with

probability going to 1 as n → ∞. Thus ‖kn − η0/η0‖KP→ 0 for every

compact K ⊂ R.Second, note that the derivative of t → kn(t), for all t ∈ An, is

Page 441: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

22.2 Efficient Linear Regression 433

k(1)n (t) =

η(2)n (t)

ηn(t)−(

η(1)n (t)

ηn(t)

)2

=η0(t)

η0(t)+ OP (r−1

n ‖η(2)n − η0‖∞)

+η0(t)

η0(t)OP (r−1

n ‖ηn − η0‖∞)

+

(

η0(t)

η0(t)

)2

(1 + oP (1)) + oP (1)

=η0(t)

η0(t)+ OP (r−1

n [n−1/2h−3n + hn])

+η0(t)

η0(t)OP (r−1

n [n−1/2h−1n + h2

n])

+

(

η0(t)

η0(t)

)2

(1 + oP (1)) + oP (1)

=η0(t)

η0(t)(1 + oP (1)) + OP (1)

+

(

η0(t)

η0(t)

)2

(1 + oP (1)) + oP (1),

where all error terms are uniform in t. Hence, since k(1)n (t) is zero for all

t ∈ An, we have

k(1)n (t)

(1 + |t|a1)2(1 + |t|a2)≤ b2

1(1 + oP (1)) + b2(1 + oP (1)) + OP (1),

where the error terms are uniform in t. This means that for some sequenceMn which is bounded in probability as n →∞,

k(1)n (t)

1 + |t|(2a1)∨a2≤ Mn,(22.7)

for all t ∈ R and all n large enough.Third, let

ℓβ,n(X) ≡ −kn(Y − β′Z)(Z − µ) + (Y − β′Z)µ

σ2,

where µ ≡ PnZ and σ2 ≡ Pn(Y − β′Z)2. Our approach to estimation will

be to find a zero of β → Pnℓβ,n, i.e., a value of β that minimizes ‖Pnℓβ,n‖over B. Let βn be such a zero. We first need to show that βn is consistent.Then we will utilize Theorem 3.1 to establish efficiency as promised.

Page 442: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

434 22. Case Studies III

Consistency.

Combining (22.6) and (22.7) with the theorem and corollary given below,

the proofs of which are given in Section 22.4, we obtain that kn(Y −β′Z) :β ∈ B is contained in a P -Donsker class with probability going to 1 as

n → ∞. This implies that both ℓβ,n : β ∈ B and ℓβ,η0 : β ∈ B arealso contained in a P -Donsker class with probability going to 1 as n →∞.Thus

supβ∈B

∥(Pn − P )(

ℓβ,n − ℓβ,η0

)∥

∥ = OP (n−1/2).

Using k0 ≡ η0/η0, we also have that

P∥

∥ℓβ,n − ℓβ,η0

∥ ≤ P∣

∣(kn − k0)(Y − β′Z)(Z − µ)∣

∣(22.8)

+P |k0(Y − β′Z)(µ− PZ)|

+P

(Y − β′Z)

(

µ

σ2− PZ

σ2

)∣

≤ P (1Y − β′Z ∈ An |(Y − β′Z)(Z − µ)|)+P ((1 + |Y − β′Z|a1)b1|µ− PZ|) + oP (1)

≤ oP (1),

as a consequence of previously determined properties of kn and k0. Hence

supβ∈B

∥Pnℓβ,n − P ℓβ,η0

∥ = oP (1).

Thus the identifiability of the model implies that βnP→ β0.

Theorem 22.1 Let F be a class of differentiable functions f : R → Rsuch that for some 0 ≤ α, β <∞ and 1 ≤ M < ∞,

supf∈F

supx∈R

f(x)

1 + |x|α∣

≤M and supf∈F

supx∈R

f(x)

1 + |x|β

≤ M.

Then for some constant K depending only on α, β, M , we have

log N[](δ,F , L2(Q)) ≤ K[

1 + Q|X |4(α∨β)+6]1/4

δ,

for all δ > 0 and all probability measures Q.

Corollary 22.2 Let F be as defined in Theorem 22.1, and define theclass of functions

H ≡ f(Y − β′Z) : β ∈ B, f ∈ F ,

where B ⊂ Rk is bounded. Then, for every probability measure P for whichP |Y − β′

0Z|4(α∨β)+6 < ∞, for some β0 ∈ B, and for which Z lies in abounded subset of Rk P -almost surely, H is P -Donsker.

Page 443: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

22.2 Efficient Linear Regression 435

Efficiency.

Provided we can establish that

P[

k20(Y − β′

nZ)1Y − β′nZ ∈ An

]

= oP (1),(22.9)

the result in (22.8) can be strengthened to conclude that Condition (3.7)holds. Accordingly, the left-hand-side of (22.9) is bounded by

P [(1 + |Y − β′nZ|a1)2b2

11Y − β′nZ ∈ Kn+ oP (1),

for some sequence of increasing compact sets Kn, by previously establishedproperties of k0 and An. Since Y = β′

0Z + e, we obtain that |Y − β′nZ|a1

|e ∨ 1|a1 . Hence, for some sequence tn → ∞, the left-hand-size of (22.9) isbounded up to a constant by P

[

|e|2a11|e| > tn]

= o(1). Thus (3.7) holds.The only condition of Theorem 3.1 remaining to be established is (3.6).

Consider the distribution Pβn,η0. Under this distribution, the mean of e ≡

Y − β′nZ is zero and e and Z are independent. Hence

Pβn,η0ℓβn,n = Pβn,η0

[

−kn(Y − β′nZ)(PZ − µ)

]

.

However, µ− PZ = OP (n−1/2) and

Pβn,η0kn(Y − β′

nZ) = oP (1) + P[

k0(Y − β′nZ)1Y − β′

nZ ∈ An]

.

Previous arguments can be recycled to established that the last term onthe right is oP (1). Thus Pβn,η0

ℓβn,n = oP (n−1/2), and (3.6) follows. Thismeans that all of the conditions of Theorem 3.1 are established . . . exceptthe requirement that Pnℓβn,n = oP (n−1/2). The issue is that even though

we are assuming ‖Pnℓβ,n‖ is minimized at β = βn, there is no guarantee atthis point that such a minimum is of the order oP (n−1/2). Once we verifythis, however, the theorem will finally yield that βn is efficient.

To accomplish this last step, we will use arguments in the proof of The-orem 3.1, which proof is given in Section 19.1, to show that if βn satisfiesPnℓβn,η0

= oP (n−1/2) then Pnℓβn,n = oP (n−1/2). The existence of such a

βn is guaranteed by the local quadratic structure of β → Pnℓβ,η0 and the

assumed positive-definiteness of I0. Now,

Pn(ℓβn,n − ℓβn,η0) = n−1/2Gn(ℓβn,n − ℓβn,η0

) + Pβn,η0(ℓβn,n − ℓβn,η0

)

−(Pβn,η0− P )(ℓβn,n − ℓβn,η0

)

≡ En + Fn −Gn.

By the P -Donsker properties of ℓβn,n and ℓβn,η0combined with the fact

that P(

ℓβn,n − ℓβn,η0

)2

= oP (1), we have that En = oP (n−1/2). Condi-

tion (3.6) implies that Fn = oP (n−1/2 + ‖βn − β0‖) = oP (n−1/2). The

Page 444: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

436 22. Case Studies III

last equality follows from the fact that βn is efficient (since we are usingthe exact efficient score for inference). Finally, arguments used in the sec-ond paragraph of the proof of Theorem 3.1 yield that Gn = oP (n−1/2).

Thus oP (n−1/2) = ‖Pnℓβn,n‖ ≥ ‖Pnℓβn,n‖, and the desired efficiency of βn

follows.

Variance estimation for βn.

The next issue for this example is to obtain a consistent estimator of I0 =P [ℓβ0,η0 ℓ

′β0,η0

], which estimator we denote In. This will facilitate inference

for β0, since we would then have that n(βn − β0)′In(βn − β0) converges

weakly to a chi-square random variable with k degrees of freedom. We now

show that In ≡ Pn

[

ℓβn,nℓ′βn,n

]

is such an estimator.

Since ℓβn,n is contained in a P -Donsker class with probability tendingto 1 for n →∞, it is also similarly contained in a P -Glivenko-Cantelli class.Recycling previous arguments, we can verify that

P[

ℓβn,nℓ′βn,n

]

= P[

ℓβ0,η0 ℓ′β0,η0

]

+ oP (1).

Since the product of two P -Glivenko-Cantelli classes is also P -Glivenko-Cantelli provided the product of the two envelopes has bounded expecta-

tion, we now have that (Pn − P )[

ℓβn,nℓ′βn,n

]

= oP (1), after arguing along

subsequences if needed. The desired consistency follows.

Inference for the residual distribution.

It is not hard to show that

Fn(t) ≡ Pn1Y − β′nZ ≤ t

is efficient for estimating the true residual distribution F0(t), uniformlyover t ∈ R, by using arguments in Section 4.1.1 and concepts in Chapters 3and 18. The verification of this is saved as an exercise (see Exercise 22.6.5).This means that (βn, Fn) are jointly efficient for (β0, F0).

Moreover, valid joint inference for both parameters can be obtained by

using the piggyback bootstrap (βn, F n), where βn = βn + n−1/2I

−1/2n U

and F n(t) = P

n1Y − β′nZ ≤ t, U is a standard k-variate normal deviate

independent of the data, and Pn is the weighted empirical measure with

random weights ξ1, . . . , ξn which are independent of both U and the dataand which satisfy the conditions given in Section 20.2.2. The verificationof this is also saved as an exercise (see Exercise 22.6.6).

22.3 Temporal Process Regression

In this section, we will study estimation for a varying coefficient regres-sion model for temporal process data. The material is adapted from Fine,

Page 445: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

22.3 Temporal Process Regression 437

Yan and Kosorok (2004). Consider, for example, bone marrow transplanta-tion studies in which the time-varying effect of a certain medication on theprevalence of graft versus host (GVH) disease may be of scientific interest.Let the outcome measure—or “response”—be denoted Y (t), where t is re-stricted to a finite time interval [l, u]. In the example, Y (t) is a dichotomousprocess indicating the presence of GVH at time t. More generally, we allowY to be a stochastic process, but we require Y to have square-integrabletotal variation over [l, u].

We model the mean of the response Y at time t as a function of a k-vectorof time-dependent covariates X(t) and a time-dependent stratification in-dicator S(t) as follows:

E[Y (t)|X(t), S(t) = 1] = g−1(β′(t)X(t)),(22.10)

where the link function g is monotone, differentiable and invertible, andβ(t) = β1(t), . . . , βk(t)′ is a k-vector of time-dependent regression coeffi-cients. A discussion of the interpretation and uses of this model is given inFine, Yan and Kosorok. For the bone marrow example, g−1(u) = eu/(1+eu)would yield a time-indexed logistic model, with β(t) denoting the changesin log odds ratios over time for GVH disease prevalence per unit increasein the covariates.

Note that no Markov assumption is involved here since the conditioningin (22.10) only involves the current time and not previous times. In addi-tion to the stratification indicator S(t), it is useful to include a non-missingindicator R(t), for which R(t) = 1 if Y (t), X(t), S(t) is fully observed att, and R(t) = 0 otherwise. We assume that Y (t) and R(t) are indepen-dent conditionally on X(t), S(t) = 1. The data structure is similar inspirit to that described in Nadeau and Lawless (1998). The approach wetake for inference is to utilize that fact that the model only posits theconditional mean of Y (t) and not the correlation structure. Thus we canconstruct “working independence” estimating equations to obtain simple,nonparametric estimators. The pointwise properties of these estimators fol-lows from standard estimating equation results, but uniform properties arequite nontrivial to establish since martingale theory is not applicable here.We will use empirical process methods.

The observed data consists of n independent and identically distributedcopies of R(t) : t ∈ [l, u] and (Y (t), X(t), S(t) : R(t) = 1, t ∈ [l, u]).

We can compute an estimator βn(t) for each t ∈ [l, u] as the root ofUn(β(t), t) ≡ PnA(β(t), t), where

A(β(t), t) ≡ S(t)R(t)D(β(t))V (β(t), t)[

Y (t)− g−1(β′(t)X(t))]

,

where D(u) ≡ ∂[g−1(u′X(t))]/(∂u) is a column k-vector-valued functionand V (β(t), t) is a time-dependent and possibly data-dependent scalar-valued weight function. Here are the specific data and estimating equationassumptions we will need:

Page 446: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

438 22. Case Studies III

(i) (Si, Ri, Xi, Yi), i = 1, . . . , n, are i.i.d. and all component processesare cadlag. We require S, R and X to all have total variation over[l, u] bounded by a fixed constant c < ∞, and we require Y to havetotal variation Y over [l, u] which has finite second moment.

(ii) t → β(t) is cadlag on [l, u].

(iii) h ≡ g−1 and h = ∂h(u)/(∂u) are Lipschitz continuous and boundedabove and below on compact sets.

(iv) We require

inft∈[l,u]

eigmin P [S(t)R(t)X(t)X ′(t)] > 0,

where eigmin denotes the minimum eigenvalue of a matrix.

(v) For all bounded B ⊂ Rk, the class of random functions V (b, t) : b ∈B, t ∈ [l, u] is bounded above and below by positive constants and isBUEI and PM (recall these definitions from Sections 9.1.2 and 8.2).

Note that Condition (v) is satisfied if V (b, t) ≡ v(b′X(t)), where u → v(u)is a nonrandom, positive and Lipschitz continuous function.

The form of the estimator will depend on the form of the observed data.The estimator jumps at those M times where

t → (Yi(t), Xi(t), Si(t) : Ri(t) = 1)

and t → Ri(t) jumps, i = 1, . . . , n. If Yi(t) and Xi(t) are piecewise con-

stant, then so also is βn. In this situation, finding βn (as a process) involvessolving t → Un(β(t), t) at these M time points. For most practical appli-cations, Y and X will be either piecewise-constant or continuous, and,therefore, so will βn. In the piecewise-constant case, we can interpolatein a right-continuous manner between the M jump points, otherwise, wecan smoothly interpolate between them. When M is large, the differencesbetween these two approaches will be small. The bounded total variationassumptions on the data make the transition from pointwise to uniform es-timation and inference both theoretically possible and practically feasible.In this light, we will assume hereafter that βn can be computed at everyvalue of t ∈ [l, u].

We will now discuss consistency, asymptotic normality, and inferencebased on simultaneous confidence bands. We will conclude this examplewith some comments on optimality of the estimators in this setting. Severalinteresting examples of data analyses and simulation studies for this set-upare given in Fine, Yan and Kosorok (2004).

Page 447: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

22.3 Temporal Process Regression 439

Consistency.

The following theorem gives us existence and consistency of βn and theabove conditions:

Theorem 22.3 Assume (22.10) holds with true parameter β0(t) : t ∈[l, u], where supt∈[l,u] |β0(t)| < ∞. Let βn = βn(t) : t ∈ [l, u] be thesmallest, in uniform norm, root of Un(β(t), t) = 0 : t ∈ [l, u]. Then sucha root exists for all n large enough almost surely, and

supt∈[l,u]

∣βn(t)− β0(t)∣

as∗→ 0.

Proof. Define

C(γ, β, t) ≡ S(t)R(t)D(γ(t))V (γ, t) [Y (t)− h(β′(t)X(t))] ,

where γ, β ∈ ℓ∞c ([l, u])kand ℓ∞c (H) is the collection of bounded real

functions on the set H with absolute value ≤ c (when c =∞, the subscriptc is omitted and the standard definition of ℓ∞ prevails). Let

G ≡

C(γ, β, t) : γ, β ∈ ℓ∞c ([l, u])k, t ∈ [l, u]

.

The first step is to show that G is BUEI and PM with square-integrableenvelope for each c < ∞. This implies that G is P -Donsker and hence alsoP -Glivenko-Cantelli. We begin by observing that the classes

β′(t)X(t) : β ∈ ℓ∞c ([l, u])k, t ∈ [l, u] and b′X(t) : b ∈ [−c, c]k, t ∈ [l, u]

are equivalent. Next, the following lemma yields that processes with square-integrable total variation are BUEI and PM (the proof is given below):

Lemma 22.4 Let W (t) : t ∈ [l, u] be a cadlag stochastic process withsquare-integrable total variation W , then W (t) : t ∈ [l, u] is BUEI andPM with envelop 2W .

Proof. Let W consist of all cadlag functions w : [l, u] → R of boundedtotal variation and let W0 be the subset consisting of monotone increasingfunctions. Then the functions wj : W → W0, where w1(w) extracts themonotone increasing part of w and w2(w) extracts the negative of themonotone decreasing part of w, are both measurable. Moreover, for anyw ∈ W, w = w1(w) − w2(w). Lemma 9.10 tells us that wj(W ) is VCwith index 2, for both j = 1, 2. It is not difficult to verify that cadlagmonotone increasing processes are PM (see Exercise 22.6.7). Hence we canapply Part (iv) of Lemma 9.17 to obtain the desired result.

By applying Lemma 9.17, we obtain that G is BUEI and PM with square-integrable envelope and hence is P -Donsker by Theorem 8.19 and the state-ments immediately following the theorem.

Page 448: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

440 22. Case Studies III

The second step is to use this Donsker property to obtain existenceand consistency. Accordingly, we now have for each c < ∞ and all β ∈ℓ∞c ([l, u])k, that

Un(β(t), t)

= Pn

C(β, β0, t)− S(t)R(t)D(β(t))V (β(t), t)

×[

h(β′(t)X(t))− h(β′0(t)X(t))

]

= −Pn

[

S(t)R(t)X(t)X ′(t)h(β′(t)X(t))h(β′(t)X(t))V (β(t), t)]

×

β(t)− β0(t)

+ ǫn(t),

where β(t) is on the line segment between β(t) and β0(t) and ǫn(t) ≡PnC(β, β0, t), t ∈ [l, u]. Since G is P -Glivenko-Cantelli and PC(β, β0, t) = 0

for all t ∈ [l, u] and β ∈ ℓ∞c ([l, u])k, we have supt∈[l,u] |ǫn(t)| as∗→ 0.By Condition (ii) and the uniform positive-definiteness assured by Condi-tion (iv), the above results imply that Un(β(t), t) has a uniformly bounded

solution βn for all n large enough. Hence ‖Un(βn(t), t)‖ ≥ c‖βn(t)−β0(t)‖−ǫ∗n(t), where c > 0 does not depend on t and ‖ǫ∗n‖∞

as∗→ 0. This follows be-cause S(t)R(t)X(t)X(t)′ : t ∈ [l, u] is P -Glivenko-Cantelli using previousarguments. Thus the desired uniform consistency follows.

Asymptotic normality.

The following theorem establishes both asymptotic normality and an asymp-totic linearity structure which we will utilize later for inference:

Theorem 22.5 Under the conditions of Theorem 22.3, βn is asymptot-ically linear with influence function ψ(t) ≡ −[H(t)]−1A(β0(t), t), where

H(t) ≡ P [S(t)R(t)D(β0(t))V (β0(t), t)D′(β0(t))] ,

and√

n(βn − β0) converges weakly in ℓ∞([l, u])k to a tight, mean zeroGaussian process X (t) with covariance

Σ(s, t) ≡ P [X (s)X ′(t)] = P [ψ(s)ψ′(t)].

Proof. By Theorem 22.3, we have for all n large enough,

0 =√

nUn(βn(t), t)

=√

nPnA(β0(t), t) +√

nPn

[

A(βn(t), t) −A(β0(t), t)]

=√

nPnA(β0(t), t) +√

nPn

[

C(βn, β0, t)− C(β0, β0, t)]

−√nPn

[

S(t)R(t)D(βn(t))V (βn(t), t)

×

h(β′n(t)X(t))− h(β′

0(t)X(t))]

≡ √nPnA(β0(t), t) + Jn(t)−Kn(t).

Page 449: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

22.3 Temporal Process Regression 441

Since G is P -Donsker and since sums of Donsker classes are Donsker, andalso since

supt∈[l,u]

P

C(βn, β0, t)− C(β0, β0, t)2 P→ 0,

we have that supt∈[l,u] |Jn(t)| = oP (1).By previous arguments, we also have that

Kn(t) = [H(t) + ǫ∗∗n (t)]√

n(βn(t)− β0(t)),

where a simple extension of previous arguments yields that supt∈[l,u] |ǫ∗∗n (t)| =oP (1). This now yields the desired asymptotic linearity. The weak conver-gence follows since A(β0(t), t) : t ∈ [l, u] is a subset of the Donsker classG.

Simultaneous confidence bands.

We now utilize the asymptotic linear structure derived in the previoustheorem to develop simultaneous confidence band inference. Define

Hn(t) ≡ Pn

[

S(t)R(t)D(βn(t))V (βn(t), t)D′(βn(t))]

andΣn(s, t) ≡ H−1

n (s)Pn

[

A(βn(s), s)A′(βn(t), t)]

H−1n (t),

and let M(t) = [M1(t), . . . , Mk(t)]′ be the component-wise square root ofthe diagonal of Σn(t, t). Define also

In(t) ≡ n−1/2n∑

i=1

Gi

[

diag M(t)]−1

Hn(t)−1

Ai(βn(t), t),

where G1, . . . , Gn are i.i.d. standard normal deviates independent of thedata. Now let m

n(α) be the 1 − α quantile of the conditional samplingdistribution of supt∈[l,u] ‖In(t)‖.

The next theorem establishes that Σ is consistent for Σ and that

βn(t) ± n−1/2mn(α)M (t)(22.11)

is a 1−α-level simultaneous confidence band for β0(t), simultaneous for allt ∈ [u, l]. A nice property of the band in (22.11) is that its size is rescaled ina time-dependent manner proportionally to the component-wise standarderror. This means that time-dependent changes in precision are accuratelyreflected by the confidence band.

Theorem 22.6 Under the conditions of Theorem 22.3, Σ(s, t) is uni-formly consistent for Σ(s, t), over all s, t ∈ [l, u], almost surely. If, in ad-dition, inft∈[l,u] eigmin Σ(t, t) > 0, then the 1 − α confidence band givenin (22.11) is simultaneously valid asymptotically.

Page 450: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

442 22. Case Studies III

Proof. The proof of uniform consistency of Σ follows from minor mod-ifications of previous arguments. Provided the minimum eigenvalue con-dition holds, M(t) will be asymptotically bounded both above and belowuniformly over t ∈ [l, u] and uniformly consistent for the component-wisesquare root of the diagonal of Σ(t, t), which we denote M0(t). The ar-guments in Section 20.2.3 are applicable, and we can establish, again by

recycling earlier arguments, that In(t)PG

M−10 (t)X (t) in ℓ∞([l, u])k

. The

desired conclusions now follow.

Optimality.

Note that the estimating equation Un is an infinite-dimensional analog ofthat in Liang and Zeger (1986), with an independence working assumptionacross t. This avoids having to worry about modeling temporal correlations.Nevertheless, incorporating weights which account for these dependenciesmay improve the efficiency of the estimators. With standard longitudinaldata, the response’s dimension is small and specifying the dependencies isnot difficult (see Prentice and Zhao, 1991; Zhao, Prentice and Self, 1990).Moreover, misspecification of the dependence in this case does not bias theestimators. It is not clear whether this approach can be readily extendedto temporal process responses.

In order to assess efficiency for the functional model, we need to selecta Hilbert space in D[l, u] and define a score operator on that space. Areasonable choice is the space, which we will denote H1, that has innerproduct 〈a, b〉1 =

[l,u]a(s)b(s)dµ(s), where µ is a bounded measure. For

ease of exposition, we will assume throughout this section that X(t) =X is time-independent. Now let Un(β) be the empirical expectation ofDXRSVR,S,X Y (β), where DX : H1 → Hp takes a ∈ H1 and multiplies itcomponent-wise to obtain a vector function with components

a(t)∂g−1(u′X)

∂u

u=βn(t)

,

for a sequence βn uniformly consistent for β0, and where Y (β)(t) ≡ Y (t)−g−1(β(t)′X). We assume that VR,S,X : H1 → H1 is an estimated operatorthat converges in probability. Let τj,R,S ≡ t ∈ [l, u] : R(t)S(t) = j, for j =

0, 1, and restrict VR,S,X such that if g = VR,S,Xf , then g(t) = f(t) for all t ∈τ0,R,S . In other words, the operator functions as the pointwise identity forall t where S(t)R(t) = 0. Arguments similar to those used above to establish

consistency and asymptotic normality of βn can be extended to verify thatβn satisfying Un(βn) = 0 exists for all n large enough and is consistentand, moreover, that

√n(βn − β0) converges weakly in ℓ∞([l, u])k to a

tight Gaussian process.Our efforts toward achieving optimality, therefore, can essentially be re-

duced to finding a VR,S,X that is uniformly consistent for an operator which

Page 451: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

22.3 Temporal Process Regression 443

makes Un optimal with respect to H1. What this translates to is thatthe variance of 〈h, βn〉1 is minimal over all estimators from equations ofthe form specified above. In other words, it is variance-minimizing for allh ∈ H1, corresponding to all linear functionals on H1. This is a naturalextension of the optimal estimating function concept for finite-dimensionalparameters (see Godambe, 1960). It is not difficult to derive that the lim-iting covariance operator for

√n(βn − β0) is

V ≡ P [DXRSVR,S,XRSD∗X ]−1

(22.12)

×P[

DXRSVR,S,XWR,S,XV ∗R,S,XRSD∗

X

]

×P [DXRSVR,S,XRSD∗X ]−1

,

provided the inverses exist, where DX and VR,S,X are the asymptotic limits

of DX and VR,S,X , respectively, the superscript ∗ denotes adjoint, andWR,S,X : H1 → H1 is the self-adjoint operator which takes a ∈ H1 to

WR,S,Xa(t) ≡

[l,u]σX(s, t)a(s)R(s)S(s)dµ(s), for all t ∈ τ1,R,S ,

a(t), for all t ∈ τ0,R,S ,

where

σx(s, t) ≡ P[

Y (β0)(s)Y (β0)(t)∣

∣X = x, R(s)S(s) = R(t)S(t) = 1]

.

Using Hilbert space techniques, including those presented in Chapter 17,we can adapt Godambe’s (1960) proof to show that the optimal choice forVR,S,X is W−1

R,S,X , provided the inverse exists. In this instance, the rightside of (22.12) simplifies to

V = P [DXWR,X,SD∗X ]−1 .(22.13)

When the measure µ used in defining the inner product on H1 only givesmass to a finite set of points, then (22.13) follows from the standard finite-dimensional results. Thus, one may view the optimal score operator for thefunctional model (22.10) as a generalization of Liang and Zeger (1986).

An important question is whether or not W−1R,S,X exists. If not, then,

although the lower bound (22.13) may exist, it may not be achievablethrough any estimating equation of the form Un. Unlike with the finite-dimensional setting, where WR,S,X is a matrix, serious problems can arisein the functional set-up. Consider, for example, the case where H1 is trulyinfinite-dimensional, as when µ is Lebesgue measure on [l, u]. Suppose, forsimplicity, that R(t)S(t) = 1 almost surely for all t ∈ [l, u] and that Y (β0)is a smooth Gaussian process with conditional covariance function σx(s, t)satisfying

[l,u]

[l,u]σ2

x(s, t)dsdt ≤ M for all x, where M < ∞, and the

Hilbert space is L2[l, u]. Note that such smooth Gaussian processes can beobtained, for example, by integrating less smooth Gaussian processes with

Page 452: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

444 22. Case Studies III

bounded covariances. For any finite collection of time points in [l, u], theoptimum equation exists if the matrix inverse of WR,S,X restricted to thosetime points exists. However, this does not typically imply that W−1

R,S,X

exists on L2[l, u].To see the issue, note that the image of WR,S,X can be shown using

Picard’s theorem (see, for example, Wahba, 1990, Chapter 8) to be thereproducing kernel Hilbert space with reproducing kernel

Kx(t, v) ≡∫

[l,u]

σx(t, s)σx(s, v)ds.

This implies that W−1R,S,X exists only on a strict subset of the reproducing

kernel Hilbert space with reproducing kernel σx. It can be further shownthat a tight Gaussian process with covariance σx is, with probability 1, nota member of the reproducing kernel Hilbert space with kernel σx (see Page 5of Wahba, 1990) and hence not a member of the reproducing kernel Hilbertspace with kernel Kx. Thus, with probability 1, W−1

R,S,X Y (β0) does not

exist. Hence, even if W−1R,S,X is valid on a subspace of H1 that is large enough

for V to exist, the score operator will not exist. It is easy to conjecture thatsimilar difficulties arise with more complicated non-Gaussian data.

On the other hand, even though W−1R,S,X may not exist, we do have, for

every ǫ > 0, that WR,S,X ≡ ǫI+WR,S,X—where I is the identity operator—does have an inverse over all of L2[l, u], by standard results for Volterraintegral equations (see Section 3.3 of Kress, 1999). This means that it ispossible to construct a score operator, using WR,S,X , with deterministic,small ǫ > 0, that is arbitrarily close to the optimal estimating equation. Inpractical settings, the effectiveness of W will depend on the choice of ǫ andthe degree of temporal correlation. The stronger the correlations, the moreunstable the inverse. While optimality may be out of reach, significantimprovements over the estimating equation Un are possible by utilizingthe correlation structure of Y (β0) as we have described above. Furtherdiscussions of these concepts can be found in Fine, Yan and Kosorok (2004).

An important conclusion from this example is that efforts to achieveoptimality may yield meaningful gains in efficiency even in those settingswhere optimality is unachievable.

22.4 A Partly Linear Model for Repeated Measures

In this section, we study a marginal partly linear regression model for re-peated measures data. We assume the data are i.i.d. observations X1, . . . ,Xn with 0 < q < ∞ repeated measures per individual Xi. More precisely,an individual observation is Xi ≡ (Yi, Ui), where Yi = (Yi1, . . . , Yiq)

′ is anq-vector of possibly dependent outcome measures, Ui ≡ (Zi, Wi) is a matrix

Page 453: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

22.4 A Partly Linear Model for Repeated Measures 445

of regression covariates, Zi ≡ (Zi1, . . . , Ziq)′ with Zij ≡ (Zij1, . . . , Zijp)

′ ∈Rp, j = 1, . . . , q, and Wi ≡ (Wi1, . . . , Wiq)

′ ∈ [0, 1]m. Consistent with thesedefinitions, we also define Uij ≡ (Z ′

ij , Wij)′, so that Ui = (Ui1, . . . , Uiq)

′.This kind of data arises in longitudinal studies, family studies, and in manyother settings involving clustered data. In many applications, q may ac-tually vary from individual to individual, but, for ease of exposition, weassume in this section that q is fixed throughout.

The model we assume is Yij = β′0Zij + h0(Wij) + ǫij , j = 1, . . . , q, where

β ∈ Rp, h0 ∈ H, H is the space of functions h : [0, 1] → R with J(h) < ∞,

with J2(h) ≡∫ 1

0

(

h(m)(w))2

dw and the integer 1 ≤ m < ∞ being known,and where ǫi ≡ (ǫi1, . . . ǫim)′ is mean zero Gaussian with nonsingular andfinite covariance Σ0. The space H and function J were defined previouslyin Sections 4.5 and 15.1 for the partly linear logistic regression model.

Let Vn be a symmetric, positive semidefinite q× q matrix estimator thatconverges in distribution to a symmetric, finite and positive definite matrixV0. We will study estimation of β0 and h0 via maximizing the followingobjective function:

(22.14)

(β, h) → n−1n∑

i=1

(Yi − Ziβ − h(Wi))′Vn (Yi − Ziβ − h(Wi)) + λ2

nJ2(h),

where h(Wi) ≡ (h(Wi1), . . . , h(Wiq))′ and λn > 0 is a tuning parameter

satisfying λn = OP (n−m/(2m+1)) and λ−1n = OP (nm/(2m+1)). Accordingly,

let θn ≡ (βn, hn) be a maximizer of (22.14).

We will establish consistency and rates of convergence for θn as well asasymptotic normality and efficiency for βn. Note that the issue of asymp-totic normality and efficient estimation of β0 is nontrivial in this repeatedmeasures context and that significant difficulties can arise. For instance,asymptotic normality may be difficult or impossible to achieve with appar-ently reasonable objective functions when Vn is not diagonal (see Lin andCarroll, 2001). A general efficient procedure for a variety of partly linearmodels for repeated measures, that includes our example as a special case,is given in Lin and Carroll (2005). However, the presentation and theoryfor this general approach, which the authors refer to as “profile kernel andbackfitting estimation” is lengthy, and we will not include it here.

We now return to establishing the theoretical properties of θn. We needsome additional assumptions before continuing. We assume commonmarginal expectations

E [Zi1k|Wi1 = w] = · · · = E [Ziqk|Wiq = w] ≡ hk ∈ H,

for k = 1, . . . , p. When Vn is not diagonal, we also require that E [Zij |Wi] =E [Zij |Wij ], j = 1, . . . , q. The first assumption is satisfied if the covariates(Zij , Wij) have the same marginal distribution across j = 1, . . . , q and the

Page 454: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

446 22. Case Studies III

conditional expectation functions (h1, . . . , hp) are sufficiently smooth. Thesecond assumption is satisfied if the components of Wi other than Wij donot carry additional information for Zij , j = 1, . . . , q.

Let ζij ≡ (Zij1, . . . , Zijp, 1, Wij , W2ij , . . . , W

m−1ij )′. We need to assume

P [ζijζ′ij ] is positive definite, and(22.15)

P‖Zij‖2 <∞ all j = 1, . . . , q.

Note that this assumption precludes including an intercept in Zij . Theintercept term for our model is contained in the function h0. Define also

Zij ≡ Zij −

h1(Wi)...

hp(Wi)

⎠.

Our final assumptions are that P [Z ′ijV0Zij ] is nonsingular and that the

density of Wij is bounded below by zero for at least one value of j =1, . . . , q. The assumptions in this paragraph are essentially boundednessand identifiability conditions on the covariates that will be needed for ratedetermination of both θn and hn and asymptotic normality. Unlike withprevious examples in this book that involve partly linear models, we do notrequire any specific knowledge of the bounded sets containing β0 and h0

(compare, for example, with the first paragraph of Section 15.1). Neither dowe require the coviarate vectors Zij to be uniformly bounded, only boundedin an L2(P ) sense. These weaker assumptions are more realistic in practice.However, more refined empirical process analysis is required.

Consistency and rate calculation

It turns out that in this example it is easier to compute the rate directly.Consistency will then follow. Define Uij → gn(Uij) ≡ β′

nZij + hn(Wij),Uij → g0(Uij) ≡ β′

0Zij + h0(Wij),

‖gn − g0‖n,j ≡(

n−1n∑

i=1

|(gn − g0)(Uij)|2)1/2

,

and ‖gn − g0‖n ≡(

∑qj=1 ‖gn − g0‖2n,j

)1/2

. Note that by definition of θn,

we have

n−1n∑

i=1

⎝ǫi −

(gn − g0)(Ui1)...

(gn − g0)(Uiq)

Vn

⎝ǫi −

(gn − g0)(Ui1)...

(gn − g0)(Uiq)

+λ2nJ2(hn) ≤ n−1

n∑

i=1

ǫ′iVnǫi + λ2nJ2(h0).

Page 455: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

22.4 A Partly Linear Model for Repeated Measures 447

Hence

n−1n∑

i=1

(gn − g0)(Ui1)...

(gn − g0)(Uiq)

Vn

(gn − g0)(Ui1)...

(gn − g0)(Uiq)

⎦+ λ2

nJ2(hn)(22.16)

≤ 2n−1n∑

i=1

ǫ′iVn

(gn − g0)(Ui1)...

(gn − g0)(Uiq)

⎦+ λ2

nJ2(h0).

Combining (22.16) with our assumptions on Vn which ensure that Vn isasymptotically bounded and that its minimum eigenvalue is asymptoticallybounded away from zero, we obtain

(22.17)

‖gn − g0‖2n + λ2nJ2(hn) ≤ OP

⎝λ2n +

q∑

j=1

q∑

l=1

n−1n∑

i=1

ǫil(gn − g0)(Uij)

⎠ .

The Cauchy-Schwartz inequality applied to the right-hand-side yields that

‖gn − g0‖2n ≤ OP (1 + ‖gn − g0‖n),

and thus ‖gn − g0‖n = OP (1).We will now proceed to the next step in which we utilize careful empirical

process calculations to obtain tighter bounds for each of the q2 terms ofthe form n−1

∑ni=1 ǫil(gn − g0)(Uij), where 1 ≤ j, l ≤ q. Accordingly, fix

j and l, and note that ǫil is mean zero Gaussian and independent of Uij ,i = 1, . . . , n. We will need to use the following theorem, which is Lemma 8.4of van de Geer (2000), the proof of which we will omit:

Theorem 22.7 Let v1, . . . , vn be points on a space V, and let F be aclass of measurable functions on V with supf∈F ‖f‖Qn ≤ R < ∞, where

‖f‖Qn ≡(

n−1∑n

i=1 f2(vi))1/2

, and with

log N(δ,F , Qn) ≤ Aδ−α, all δ > 0,

for constants 0 < α < 2 and A < ∞. Suppose also that η1, . . . , ηn areindependent, mean zero real random variables satisfying

max1≤i≤n

K2(

Eeη2i /K2 − 1

)

≤ σ20 ,

for constants 0 < K, σ0 <∞. Then for a constant c which depends only onA, α, R, K, σ0 (and not on the points v1, . . . , vn), we have for all T > c,

P

(

supf∈F

∣n−1/2∑n

i=1 ηif(vi)∣

‖f‖1−α/2Qn

≥ T

)

≤ c exp[−T 2/c2].

Page 456: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

448 22. Case Studies III

Fix the sample U1j = u1j , . . . , Unj = unj , and set vi = uij , i = 1, . . . , n.Define u → gθ(u) ≡ (β − β0)

′z + (h − h0)(w), where u ≡ (z′, w)′, z ∈ Rp

and w ∈ [0, 1]. Consider the class of functions

F ≡

gθ(u)

1 + ‖gθ‖Qn + J(h− h0): θ ∈ Rp ×H

.

By Taylor expansion,

(h− h0)(w) = (h− h0)(0) + (h− h0)(1)(0)w

+ . . . +(h− h0)

(m−1)wm−1

(m− 1)!

+

∫ w

0

(h− h0)(m)(t)(w − t)m−1

(m− 1)!dt

≡m−1∑

k=0

h∗kwk + h2(w)

≡ h1(w) + h2(w),

where h1 is an m−1-degree polynomial and supw∈[0,1] |h2(w)| ≤ J(h−h0).Thus

‖(β − β0)′z + h1(w)‖Qn ≤ ‖gθ‖Qn + J(h− h0),

which implies

1 + ‖gθ‖Qn + J(h− h0) ≥ 1 + γn

√‖β − β0‖2 +

m−1∑

k=0

(h∗k)2,

where γn is the smallest eigenvalue of Bn ≡ n−1∑n

i=1 ζnζ′i, where ζi ≡(zij1, . . . , zijp, 1, wij , . . . , w

m−1ij )′, i = 1, . . . , n. Using related arguments, it

is also easy to verify that

supw∈[0,1]

|(h− h0)(u)| ≤(

m−1∑

k=0

(h∗k)2

)1/2

+ J(h− h0).

Thus, for the class of functions

F1 ≡

(h− h0)(w)

1 + ‖gθ‖Qn + J(h− h0): β ∈ Rp, h ∈ H

,

we have

supf∈F1,w∈[0,1]

|f(w)| ≤ γ−1n + 1,(22.18)

and, trivially,

Page 457: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

22.4 A Partly Linear Model for Repeated Measures 449

supf∈F1

J(f) ≤ 1.(22.19)

We take a brief digression now to verify that without loss of generality,we can replace γ−1

n in (22.18) with a finite constant that does not dependon the particular values of U1j , . . . , Unj . Assumption (22.15) ensures thatCn ≡ n−1

∑ni=1 ζijζ

′ij converges almost surely to a positive definite matrix.

Thus the minimum eigenvalue γn of Cn will also converge to a positiveconstant. Thus, with probability going to 1 as n → ∞, all possible valuesof γ−1

n will be bounded above by some Γ0 < ∞. Since the L2(Q) normis bounded by the uniform norm for any choice of probability measure Q,Theorem 9.21 now implies that

log N(δ,F1, Qn) ≤ A0δ−1/m, for all δ > 0,(22.20)

where A0 does not depend on the values of u1j , . . . , unj (with probabilitytending to 1 as n →∞).

Now consider the class of functions

F2 ≡

(β − β0)′z

1 + ‖gθ‖Qn + J(h− h0): β ∈ Rp, h ∈ H

,

and note that by previous arguments,

F2 ⊂ F3 ≡

β′z : β ∈ Rp, ‖β‖ ≤ γ−1n

.

Combining the fact that n−1∑n

i=1 ‖Zij‖2 converges almost surely to a finiteconstant by assumption with the previously established properties of γn,we obtain that

N(δ,F2, Qn) ≤ A1δ−p, for all δ > 0,(22.21)

where A1 does not depend on the values of u1j , . . . , unj (with probabilitytending to 1 as n →∞).

Combining (22.20) and (22.21), we obtain that

log N(δ,F , Qn) ≤ A2δ−1/m, for all δ > 0,

where A2 <∞ does not depend on the values of u1j, . . . , unj with probabil-ity tending to 1 as n →∞. Now we return to arguing for fixed u1j, . . . , unj

and apply Theorem 22.7 for α = 1/m to obtain that

(22.22)

P

(

supf∈F

∣n−1/2∑n

i=1 ǫilf(uij)∣

‖f‖1−1/(2m)Qn

≥ T

U1j = u1j, . . . , Unj = unj

)

≤ c exp[−T 2/c2],

Page 458: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

450 22. Case Studies III

where T and c can be chosen to be both finite and not dependent onthe values of u1j , . . . , unj . Accordingly, we can remove the conditioningin (22.22) and apply the fact that ‖gn − g0‖n,j = OP (1) to obtain

(22.23)∣

n−1n∑

i=1

ǫil(gn − g0)(uij)

= OP

(

n−1/2‖gn − g0‖1−1/(2m)n,j (1 + J(hn − h0))

1/(2m))

(see Exercise 14.6.9).We now combine (22.23) with (22.17) to obtain that

(22.24)

‖gn − g0‖2n + λ2nJ2(hn)|

= OP

(

λ2n + n−1/2‖gn − g0‖1−1/(2m)

n (1 + J(hn))1/(2m))

.

Putting Rn ≡ nm/(2m+1)‖gn − g0‖n/(1 + J(hn)), (22.24) now implies that

R2n = OP (1 + R

1−1/(2m)n ). This implies that Rn = OP (1), and thus ‖gn −

g0‖n = OP (n−m/(2m+1)(1 + J(hn))). Apply this to (22.24), we obtain

J2(hn) = OP (1 + J(hn)), which implies J(hn) = OP (1) and hence also‖gn − g0‖n = OP (n−m/(2m+1)).

Recycling the arguments accompanying the polynomial expansion men-tioned previously for h ∈ H, we can verify that

‖βn − β0‖ = OP (1) and supw∈[0,1]

|hn(w)| = OP (1)(22.25)

(see Exercise 22.6.10). Using this and recycling again previous arguments,we now have (see again Exercise 22.6.10) that gn−g0 is contained in a classG with probability increasing to 1, as n →∞, that satisfies

log N[](δ,G, L2(P )) ≤ Aδ−1/m, for all δ > 0.(22.26)

We now use (22.26) to transfer the convergence rate for the norm ‖ · ‖n,j

to the norm ‖ · ‖P,2. This transfer follows, after setting ν = 1/m, fromthe following theorem, which is Theorem 2.3 of Mammen and van de Geer(1997) and which we present without proof:

Theorem 22.8 Suppose the class of measurable functions F satisfies thefollowing for some norm ‖ · ‖, constants A < ∞ and 0 < ν < 2, and for allδ > 0:

log N[](δ,F , ‖ · ‖) ≤ Aδ−ν .

Then for all η > 0 there exists a 0 < C < ∞ such that

lim supn→∞

P

(

supf∈F ,‖f‖>Cn−1/(2+ν)

‖f‖n

‖f‖ − 1

> η

)

= 0.

Page 459: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

22.4 A Partly Linear Model for Repeated Measures 451

It now follows (see Exercise 22.6.12) that

‖gn − g0‖P,2 = OP (n−m/(2m+1)).(22.27)

Now (22.27) implies for each j = 1, . . . , q that

P(

(βn − β0)′Zij + (hn − h0)(Wij)

)2

= OP (n−2m/(2m+1)),

which, by the assumptions and definition of Zij , implies for some functionfn of Wij that

OP (n−2m/(2m+1)) = P(

(βn − β0)′Zij + fn(Wij)

)2

≥ P(

(βn − β0)′Zij

)2

≥ c‖βn − β0‖2,

for some c > 0. Thus ‖βn − β0‖ = OP (n−m/(2m+1)) and therefore also

P(

(hn − h0)(Wij))2

= OP (n−2m/(2m+1)). Since the density of Wij is

bounded below by 0 for at least one value of j ∈ 1, . . . , q, we also have

‖hn−h0‖L2 = OP (n−m/(2m+1)). The fact that J(hn) = OP (1) now, finally,

yields ‖hn − h0‖∞ = oP (1). Thus we have achieved uniform consistency of

both parameters as well as the optimal L2 rate of convergence for hn.

Asymptotic normality

Consider evaluating the objective function in (22.14) at the point θn,s ≡(

βn + st, hn − st′(h1, . . . , hp)′)

, where s is a scalar and t ∈ Rp. Differenti-

ating with respect to s, evaluating at s = 0 and allowing t to range overt ∈ Rp, we obtain by definition of a maximizer that

0 = n−1n∑

i=1

[

Zi − H(Wi)]′

Vn

[

Yi − Ziβn − hn(Wi)]

−λ2n

∫ 1

0

hn(w)h1(w)...

hn(w)hp(w)

⎦dw,

where H(Wi) ≡ (h1(Wi), . . . , hp(Wi)) and, for any function g : [0, 1] → R,g(Wi) ≡ (g(Wi1), . . . , g(Wiq))

′. This now implies

Page 460: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

452 22. Case Studies III

n−1n∑

i=1

[

Zi − H(W − i)]′

VnZi(βn − β0)

= n−1n∑

i=1

[

Zi − H(Wi)]′

Vnǫi

−n−1n∑

i=1

[

Zi − H(Wi)]′

Vn

[

(hn − h0)(Wi)]

−λ2n

∫ 1

0

hn(w)h1(w)...

hn(w)hp(w)

⎦dw

≡ An −Bn − Cn.

We save it as an exercise (see Exercise 22.6.13) to verify that

n−1n∑

i=1

[

Zi − H(Wi)]′

VnZiP→ P

[

Z ′iV0Zi

]

,

√nAn Np

(

0, P[

Z ′iV0Σ0V0Zi

])

,

Bn = oP (n−1/2), and Cn = oP (n−1/2),

where Np(a, B) is a p-variate normal distribution with mean vector a andvariance matrix B. Thus we can conclude

(22.28)

√n(βn − β0) Np

(

0,

P [Z ′iV0Zi]

−1

P[

Z ′iV0Σ0V0Zi

]

P [Z ′iV0Zi]

−1)

.

Efficiency

Based on the Gaussian likelihood, it is easy to verify that the score forβ is Z ′

iΣ−10 ǫi and the tangent set for h is h′(Wi)Σ

−10 ǫi : h ∈ H. We

could choose to take the closure of the tangent set in the space of square-integrable functions of Wi and to center the set so that all elements havemean zero. However, these adjustments are not needed to derive that theefficient score for β is

ℓβ,h(Xi) ≡ [Zi − H(Wi)]′Σ−1

0 ǫi.(22.29)

The verification of this is saved for Exercise 22.6.8.Accordingly, choosing Vn so that it converges to V0 = Σ−1

0 will result

in optimal estimation of Vn. This comes as no surprise given the form ofthe limiting variance of

√n(βn − β0) in (22.28) and the optimality con-

cepts of Godambe (1960) (see Proposition 4.3 of Section 4.4 and the sur-

rounding discussion). The resulting limiting covariance of√

n(βn − β0) isP [Z ′

iΣ−10 Zi]−1. The following choice for Vn will work:

Page 461: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

22.5 Proofs 453

(22.30)

Vn =

[

n−1n∑

i=1

(

Yi − Ziβn − hn(Wi)) (

Yi − Ziβn − hn(Wi))′]−1

,

where θn ≡ (βn, hn) is a minimizer of (22.14) with Vn replaced by the q× q

identity matrix. Verification that VnP→ Σ−1

0 is saved for Exercise 22.6.14.

Efficient estimation of βn using Vn given in (22.30) thus requires twooptimizations of an objective function, but this is a small price to pay forefficiency. Note that more information on the structure of Σ0 could be usedto reduce the number of parameters needed for estimating Vn. For example,identity, diagonal, exchangeable or autocorrelation covariance structurescould be employed. If the covariance structure is correctly specified, fullefficiency is obtained. If it is not correctly specified, asymptotic normalityof√

n(βn−β0) is achieved although the associated limiting variance is sub-optimal. This is a very simple example of a “locally efficient estimator” asmentioned briefly at the end of Chapter 16 (see also Section 1.2.6 of vander Laan and Robins, 2003).

Moreover, it seems quite reasonable that efficient estimation of Σ0 couldalso be developed, but we do not pursue this here. We also have not yetdiscussed a method of estimating the limiting covariance of

√n(βn − β0).

However, inference based on weighted M-estimation under penalization,as discussed in Section 21.5, appears to be applicable here. We omit thedetails.

One final issue we mention is in regards to the assumption we make thatE[Zij |Wi] = E[Zij |Wij ], j = 1, . . . , q, i = 1, . . . , n. This seems oddly re-strictive, but it definitely makes the asymptotic normality result easier toachieve when Vn is not diagonal. The complications resulting from omit-ting this assumption are connected to several intriguing observations madein Lin and Carroll (2001). The profile kernel and backfitting estimationmethod of Lin and Carroll (2006) seems to be able to get around this prob-lem so that full efficiency is achieved under less restrictive assumptions onthe dependency between Zij and Wij than what we have used here. Werefer the interested reader to the cited publications for more details.

22.5 Proofs

Proof of Theorem 22.1. We will use Theorem 9.20. Let γ = α∨β +3/2,and note that

|f(x)|1 + |x|γ ≤ |f(x)|

1 + |x|α × 21 + |x|α

1 + |x|α+3/2

≤ M × 6(1 ∨ |x|)−3/2,

Page 462: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

454 22. Case Studies III

for all x ∈ R. Similarly, we can verify that

|f(x)|1 + |x|γ ≤ 6M(1 ∨ |x|)−3/2,

for all x ∈ R.Now define the class of functions

G ≡

x → f(x)

1 + |x|γ : f ∈ F

,

and note that for every g ∈ G and all x ∈ R, |g(x)| ≤ 6M(1 ∨ |x|)−3/2 and

|g(x)| ≤ |f(x)|1 + |x|γ +

|f(x)|(1 ∨ |x|)γ−1

(1 + |x|γ)2

≤ 6M(1 ∨ |x|)−3/2 + 6M(1 ∨ |x|)−3/2.

Setting k∗ ≡ 12M , we obtain that |g(x)| ∨ |g(x)| ≤ k∗(1 ∨ |x|)−3/2.Now let Ij,+ ≡ [j, j + 1] and Ij,− ≡ [−(j + 1),−j] for all integers j ≥ 0,

and note that each of these sets is trivially bounded and convex and thattheir union equals R. Let ‖ · ‖1 be the Lipschitz norm of degree 1, and notethat for every g ∈ G,

∥g|Ij,+

1≤ k(1 ∨ j)−3/2 for all j ≥ 0, where g|A is

the restriction of g to the set A ⊂ R. Similarly,∥

∥g|Ij,−

1≤ k∗(1 ∨ j)−3/2.

Note also that I1j,+ = x : ‖x − Ij,+‖ < 1 = (j − 1, j + 2) and, similarly,

I1j,− = (−j − 2,−j + 1), for all j ≥ 0.Applying Theorem 9.20 and using the fact that the probability of any

set is ≤ 1, we obtain

log N[](δ,G, L4(Q)) ≤(

K∗δ

)

⎝2(3k)4/5∞∑

j=0

(1 ∨ j)−6/5

5/4

,

for all δ > 0, where K∗ < ∞ is universal. Since∑∞

j=1 j−6/5 converges, weobtain that there exists a constant K∗∗ < ∞, depending only on k∗, suchthat

log N[](δ,G, L4(Q)) ≤ K∗∗δ

,(22.31)

for all δ > 0 and all probability measures Q.Note that for any f1, f2 ∈ F ,

Q[f1 − f2]2 = Q[(1 + |X |γ)2(g1 − g2)

2]

≤(

Q[1 + |X |γ ]4])1/2 (

Q[g1 − g2]4)1/2

,

where x → gj(x) ≡ (1 + |x|γ)−1fj(x), for j = 1, 2. Suppose Q|X |4γ =P |X |4(α∨β)+6 ≡ M∗ < ∞. Then Q(1 + |X |γ)4 ≤ 8(1 + M∗). Now let

Page 463: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

22.5 Proofs 455

[g1i, g2i , 1 ≤ i ≤ m] be a minimal covering of δ-L4(Q) brackets for G.Then [(1 + |x|γ)g1i, (1 + |x|γ)g2i , 1 ≤ i ≤ m] is an [8(1+M∗)]1/4δ-L2(Q)covering of F . The desired conclusion of the Theorem now follows by (22.31)and the fact that k∗ depends only on α, β, M .

Proof of Corollary 22.2. It is not hard to verify (see Exercise 22.6.14)that the class

H0 ≡ −(β − β0)′Z : β ∈ B(22.32)

satisfies

N[](δ,H0, L2(P )) ≤ k∗∗δ−k,(22.33)

for all δ > 0 and some constant k∗∗ <∞ (which may depend on P ).Fix P that satisfies the conditions of the corollary, and let ǫ ≡ Y − β′

0Z.For any h ∈ H0, define Yh ≡ ǫ + h(Z) and let Qh be the probabilitydistribution for Yh. Note that suph∈H0

Qh|Yh|4(α∨β)+6 ≡ M∗∗ < ∞. Nowfix δ > 0, and let h1i, h2i, i = 1, . . . , m be a minimal collection of δ/3-L2(P ) brackets that coverH0. Let Hδ be the collection of all of the functionshji, j = 1, 2, i = 1, . . . , m, and let h ∈ Hδ be arbitrary. Theorem 22.1 nowimplies that

log N[](δ/3,F , L2(Qh)) ≤ K0

δ,

for K0 < ∞ that does not depend on δ nor on Qh. Accordingly, letf1i,h, f2i,h, i = 1, . . . , m∗ be a minimal collection of δ/3-L2(Qh) brackets

that covers F . Moreover, by definition of Qh,(

P [f2i,h(Y )− f1i,h]2)1/2 ≤

δ/3. Repeat this process for all members of Hδ.Let f(Y − β′Z) be an arbitrary element of H, and let h1j, h2j be

the δ/3-L2(P ) bracket that contains −(β− β0)′Z. Let f1i,h1j , f2i,h1j and

f1i′,h2j , f2i′,h2j be δ/3-L2(Qh1j ) and δ/3-L2(Qh2j ) brackets, respectively,that cover h1j and h2j . Then

f1i,h1j (Y + h1j(Z)) ≤ f(Y − β′Z) ≤ f2i′,h2j (Y + h2j(Z))

and

(

P[

f2i′,h2j (Y + h2j)(Z))− f1i,h1j (Y + h1j(Z))]2)1/2

≤(

Qh2j

[

f1i′,h2j − f2i′,h2j

]2)1/2

+ δ/3 +(

Qh1j

[

f1i,h1j − f2i,h1j

]2)1/2

≤ δ.

Since the logarithm of the number of such brackets is bounded by K ′0δ

−1,for a constant K ′

0 depending on P but not on δ, and since δ was arbitrary,we have that the bracketing entropy integral for H, J[](∞,H, L2(P )), isfinite. Thus the desired conclusion follows from Theorem 2.3.

Page 464: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

456 22. Case Studies III

22.6 Exercises

22.6.1. Verify that F1 given in (22.3) is Donsker and that F2 givenin (22.4) is Glivenko-Cantelli.

22.6.2. Show that the conditions given in the second paragraph for theresidual density η are satisfied by the standard normal density, with a1 = 1,a2 = 2, and b1 = b2 = 1.

22.6.3. For the linear regression example, show that having P [ZZ ′] bepositive definite combined with having x → (η0/η0)(x) be strictly monotone(along with the other assumptions specified for this model) implies that themap β → P ℓβ,η0 is continuous with a unique zero at β = β0.

22.6.4. In the linear regression example, verify that both ‖η(1)n − η0‖∞ =

OP (n−1/2h−2n + h2

n) and ‖η(2)n − η0‖∞ = OP (n−1/2h−3

n + hn).

22.6.5. Show that the estimator Fn defined at the end of Section 22.2 isuniformly efficient.

22.6.6. Show that the piggyback bootstrap (βn, F n) defined at the end

of Section 22.2 is valid.

22.6.7. Consider a cadlag, monotone increasing stochastic process X(t) :t ∈ T , and consider the class of evaluation functions F ≡ ft : t ∈ T ,where ft(X) ≡ X(t) for all t ∈ T , and T is a closed interval in R. Showthat F is PM.

22.6.8. Verify that (22.23) follows from (22.22). Part of this verificationshould include a discussion and resolution of any measurability issues.

22.6.9. Verify that both (22.25) and (22.26) hold. Hint: For the secondinequality, consider the classes (β − β0)

′Zij : ‖β − β0‖ ≤ K1 and (h −h0)(Wij) : ‖h−h0‖∞ ≤ K2, J(h−h0) ≤ K3 separately. For the first class,the desired bracketing entropy bound is easy to verify. For the second class,it may be helpful to first establish the bracketing entropy bound for theuniform norm and then apply Lemma 9.22.

22.6.10. Verify (22.27). It may be helpful to first verify it for fixed j ∈1, . . . , q, since U1j, . . . , Unj forms an i.i.d. sample.

22.6.11. Verify (22.27).

22.6.12. Utilizing arguments applied in Section 22.4 as needed, verify (22.28).

22.6.13. Show the following:

1. ℓβ,h given in (22.29) is the efficient score for β as claimed.

2. Vn given in (22.30) is consistent for Σ−10 .

Page 465: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

22.7 Notes 457

22.6.14. Verify that the classH0 defined in (22.32) satisfies the bracketingentropy bound (22.33) for a constant k∗ < ∞ that may depend on P (butonly through the compact set containing Z P -almost surely).

22.7 Notes

Theorems 22.3, 22.5, and 22.6 are, after at most a few minor modifica-tions, Theorems A1, A2 and A3 of Fine, Yan and Kosorok (2004). Otherconnections to original sources have been acknowledged previously.

Page 466: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,
Page 467: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

References

Andersen, P. K., Borgan, Ø., Gill, R. D., and Keiding, N. (1993). StatisticalModels Based on Counting Processes. Springer-Verlag, New York.

Andrews, D. W. K. (1991). An empirical process central limit theoremfor dependent non-identically distributed random variables. Journalof Multivariate Analysis, 38:187–203.

Andrews, D. W. K. (2001). Testing when a parameter is on the boundaryof the maintained hypothesis. Econometrica, 69:683–673.

Andrews, D. W. K. and Ploberger, W. (1994). Optimal tests when a nui-sance parameter is present only under the alternative. Econometrica,62:1383–1414.

Arcones, M. A. and Gine, E. (1993). Limit theorems for U-processes. Annalsof Probability, 21:1494–1542.

Arcones, M. A. and Wang, Y. (2006). Some new tests for normality basedon U-processes. Statistics and Probability Letters, 76:69–82.

Arcones, M. A. and Yu, B. (1994). Central limit theorems for empiricaland U-processes of stationary mixing sequences. Journal of TheoreticalProbability, 7:47–71.

Barbe, P. and Bertail, P. (1995). The Weighted Bootstrap. Springer, NewYork.

Page 468: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

460 References

Bassett, G., Jr. and Koenker, R. (1978). Asymptotic theory of least abso-lute error regression. Journal of the American Statistical Association,73:618–622.

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discoveryrate: A practical and powerful approach to multiple testing. Journalof the Royal Statistical Society, Series B, 57:289–300.

Bickel, P. J., Gotze, F., and van Zwet, W. R. (1997). Resampling fewerthan n observations: Gains, losses, and remedies for losses. StatisticaSinica, 7:1–31.

Bickel, P. J., Klaassen, C. A., Ritov, Y., and Wellner, J. A. (1998). Efficientand Adaptive Estimation for Semiparametric Models. Springer-Verlag,New York.

Bickel, P. J. and Rosenblatt, M. (1973). On some global measures of thedeviations of density function estimates. Annals of Statistics, 1:1071–1095.

Bilius, Y., Gu, M. G., and Ying, Z. (1997). Towards a general theory forCox model with staggered entry. Annals of Statistics, 25:662–682.

Billingsley, P. (1968). Convergence of Probability Measures. Wiley, NewYork.

Billingsley, P. (1986a). Probability and Measure. Wiley, New York, secondedition.

Billingsley, P. (1986b). Probability and Measure. Wiley, New York, thirdedition.

Bradley, R. C. (1986). Basic properties of strong mixing conditions. InEberlein, E. and Taqqu, M. S., editors, Dependence in Probabilityand Statistics: A Survey of Recent Results, pages 165–192. Birkhauser,Basel.

Bretagnolle, J. and Massart, P. (1989). Hungarian construction from thenonasymptotic viewpoint. Annals of Probability, 17:239–256.

Buhlmann, P. (1995). The blockwise bootstrap for general empirical pro-cesses of stationary sequences. Stochastic Processes and Their Appli-cations, 58:247–265.

Cai, T. X. and Betensky, R. A. (2003). Hazard regression for interval-censored data with penalized splines. Biometrics, 59:570–579.

Cantelli, F. P. (1933). Sulla determinazione empirica delle leggi di proba-bilita. Giornale dell’Istituto Italiano degli Attuari, 4:421–424.

Page 469: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

References 461

Chang, M. N. (1990). Weak convergence of a self-consistent estimator ofthe survival function with doubly censored data. Annals of Statistics,18:391–404.

Cheng, G. and Kosorok, M. R. General frequentist properties of the poste-rior profile distribution. Annals of Statistics. To appear.

Cheng, G. and Kosorok, M. R. Higher order semiparametric frequentistinference with the profile sampler. Annals of Statistics. To appear.

Cheng, G. and Kosorok, M. R. (2007). The penalized profile sampler.http://arxiv.org/abs/math.ST/0701540.

Conway, J. B. (1990). A Course in Functional Analysis. Springer, NewYork, 2nd edition.

Cox, D. D. and O’Sullivan, F. (1990). Asymptotic analysis of penalizedlikelihood and related estimators. Annals of Statistics, 18:1676–1695.

Cox, D. R. (1975). Partial likelihood. Biometrika, 62:269–276.

Dedecker, J. and Louhichi, S. (2002). Maximal inequalities and empiricalcentral limit theorems. In Dehling, H., Mikosch, T., and Sørensen, M.,editors, Empirical Process Techniques for Dependent Data, pages 137–159. Birkhauser, Boston.

Dehling, H., Mikosch, T., and Sørensen, M., editors (2002). Empirical Pro-cess Techniques for Dependent Data. Birkhauser, Boston.

Dehling, H. and Philipp, W. (2002). Empirical process techniques for de-pendent data. In Dehling, H., Mikosch, T., and Sørensen, M., edi-tors, Empirical Process Techniques for Dependent Data, pages 3–113.Birkhauser, Boston.

Dehling, H. and Taqqu, M. S. (1989). The empirical process of some long-range dependent sequences with an application to U-statistics. Annalsof Statistics, 17:1767–1783.

Di Bucchiano, A., Einmahl, J. H. J., and Mushkudiani, N. A. (2001). Small-est nonparametric tonerance regions. Annals of Statistics, 29:1320–1343.

Dixon, J. R., Kosorok, M. R., and Lee, B. L. (2005). Functional inferencein semiparametric models using the piggyback bootstrap. Annals ofthe Institute of Statistical Mathematics, 57:255–277.

Donsker, M. D. (1952). Justification and extension of Doob’s heuristic ap-proach to the Kolmogorov-Smirnov theorems. Annals of MathematicalStatistics, 23:277–281.

Page 470: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

462 References

Dudley, R. M. and Philipp, W. (1983). Invariance principles for sums ofBanach space valued random elements and empirical processes. Prob-ability Theory and Related Fields, 62:509–552.

Dugundji, J. (1951). An extension of Tietze’s theorem. Pacific Journal ofMathematics, 1:353–367.

Dvoretzky, A., Kiefer, J., and Wolfowitz, J. (1956). Asymptotic minimaxcharacter of the sample distribution function and of the classical multi-nomial estimator. Annals of Mathematical Statistics, 27:642–669.

Eberlein, E. and Taqqu, M. S., editors (1986). Dependence in Probabilityand Statistics: A Survey of Recent Results. Birkhauser, Basel.

Efron, B. (1967). The two sample problem with censored data. Proceed-ings of the Fifth Berkeley Symposium on Mathematical Statistics andProbability, 4:831–855.

Fine, J. P., Yan, J., and Kosorok, M. R. (2004). Temporal process regres-sion. Biometrika, 91:683–703.

Fleming, T. R. and Harrington, D. P. (1991). Counting Processes and Sur-vival Analysis. Wiley, New York.

Fleming, T. R., Harrington, D. P., and O’Sullivan, M. (1987). Supremumversions of the logrank and generalized wilcoxon statistics. Journal ofthe American Statistical Association, 82:312–320.

Genovese, C. and Wasserman, L. (2002). Operating characteristics and ex-tensions of the false discovery rate procedure. Journal of the RoyalStatistical Society, Series B, 64:499–517.

Ghosal, S., Sen, A., and van der Vaart, A. W. (2000). Testing monotonicityof regression. Annals of Statistics, 28:1054–1082.

Gilbert, P. B. (2000). Large sample theory of maximum likelihood esti-mates in semiparametric biased sampling models. Annals of Statistics,28:151–194.

Gine, E. (1996). Lectures on some aspects of the bootstrap. Unpublishednotes.

Gine, E. (1997). Decoupling and limit theorems for U-statistics and U-processes. In Lectures on probability theory and statistics (Saint-Flour,1996), pages 1–35. Springer, Berlin. Lecture Notes in Mathematics,1665.

Glivenko, V. (1933). Sulla determinazione empirica della leggi di proba-bilita. Giornale dell’Istituto Italiano Degli Attuari, 4:92–99.

Page 471: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

References 463

Godambe, V. P. (1960). An optimum property of regular maximum likeli-hood estimation. Annals of Mathematical Statistics, 31:1208–1211.

Grenander, U. (1956). On the theory of mortality measurement, part ii.Skandinavisk Aktuarietidskrift, 39:125–153.

Groeneboom, P. (1989). Brownian motion with a parabolic drift and airyfunctions. Probability Theory and Related Fields, 81:79–109.

Groeneboom, P. and Wellner, J. A. (1992). Information Bounds and Non-parametric Maximum Likelihood Estimation. Birkhauser Verlag, Basel.

Gu, C. (2002). Smoothing Spline ANOVA Models. Springer, New York.

Gu, M. G. and Lai, T. L. (1991). Weak convergence of time-sequential cen-sored rank statistics with applications to sequential testing in clinicaltrials. Annals of Statistics, 19:1403–1433.

Hall, P. (1992). The Bootstrap and Edgeworth Expansion. Springer-Verlag,New York.

Hastings, W. K. (1970). Monte carlo sampling methods using markovchaines and their applications. Biometrika, 57:97–109.

Huang, J. (1996). Efficient estimation for the Cox model with interval cen-soring. Annals of Statistics, 24:540–568.

Huber, P. J. (1981). Robust Statistics. Wiley, New York.

Hunter, D. R. and Lange, K. (2002). Computing estimates in the propor-tional odds model. Annals of the Institute of Statistical Mathematics,54:155–168.

Jameson, J. O. (1974). Topology and Normed Spaces. Chapman and Hall,London.

Johnson, W. B., Lindenstrauss, J., and Schechtman, G. (1986). Extensionsof Lipschitz maps into Banach spaces. Israel Journal of Mathematics,54:129–138.

Karatzas, I. and Shreve, S. E. (1991). Brownian Motion and StochasticCalculus. Springer, New York, 2 edition.

Kiefer, J. and Wolfowitz, J. (1956). Consistency of the maximum likelihoodestimator in the presence of infinitely many nuisance parameters. An-nals of Mathematical Statistics, 27:887–906.

Kim, J. and Pollard, D. (1990). Cube root asymptotics. Annals of Statistics,18:191–219.

Page 472: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

464 References

Kim, Y. and Lee, J. (2003). Bayesian bootstrap for proportional hazardsmodels. Annals of Statistics, 31:1905–1922.

Komlos, J., Major, P., and Tusnady, G. (1975). An approximation of partialsums of independent rv’s and the sample df. i. Probability Theory andRelated Fields, 32:111–131.

Komlos, J., Major, P., and Tusnady, G. (1976). An approximation of partialsums of independent rv’s and the sample df. ii. Probability Theory andRelated Fields, 34:33–58.

Kosorok, M. R. (1999). Two-sample quantile tests under general conditions.Biometrika, 86:909–921.

Kosorok, M. R. (2003). Bootstraps of sums of independent but not identi-cally distributed stochastic processes. Journal of Multivariate Analy-sis, 84:299–318.

Kosorok, M. R., Lee, B. L., and Fine, J. P. (2004). Robust inference forunivariate proportional hazards frailty regression models. Annals ofStatistics, 32:1448–1491.

Kosorok, M. R. and Ma, S. (2005). Comment on ”Semilinear high-dimensional model for normalization of microarray data: a theoreti-cal analysis and partial consistency” by J. Fan, H. Peng, T. Huang.Journal of the American Statistical Association, 100:805–807.

Kosorok, M. R. and Ma, S. (2007). Marginal asymptotics for the “large p,small n” paradigm: with applications to microarray data. Annals ofStatistics, 35:1456–1486.

Kosorok, M. R. and Song, R. (2007). Inference under right censoringfor transformation models with a change-point based on a covariatethreshold. Annals of Statistics, 35:957–989.

Kress, R. (1999). Linear Integral Equations. Springer, New York, secondedition.

Kunsch, H. R. (1989). The jackknife and the bootstrap for general station-ary observation. Annals of Statistics, 17:1217–1241.

Lee, B. L. (2000). Efficient Semiparametric Estimation Using MarkovChain Monte Carlo. PhD thesis, University of Wisconsin at Madison,Madison, Wisconsin.

Lee, B. L., Kosorok, M. R., and Fine, J. P. (2005). The profile sampler.Journal of the American Statistical Association, 100:960–969.

Liang, K. Y. and Zeger, S. L. (1986). Longitudinal data analysis usinggeneralized linear models. Biometrika, 73:13–22.

Page 473: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

References 465

Lin, D. Y., Fleming, T. R., and Wei, L. J. (1994). Confidence bands for sur-vival curves under the proportional hazards model. Biometrika, 81:73–81.

Lin, X. and Carroll, R. J. (2001). Semiparametric regression for clustereddata using generalized estimating equations. Journal of the AmericanStatistical Association, 96:1045–1056.

Lin, X. and Carroll, R. J. (2006). Semiparametric estimation in generalrepeated measures problems. Journal of the Royal Statistical Society,Series B, 68:69–88.

Liu, R. Y. and Singh, K. (1992). Moving blocks jackknife and bootstrapcapture weak dependence. In LePage, R. and Billard, L., editors, Ex-ploring the Limits of Bootstrap, pages 225–248. Wiley, New York.

Ma, S. and Kosorok, M. R. (2005a). Penalized log-likelihood estimation forpartly linear transformation models with current status data. Annalsof Statistics, 33:2256–2290.

Ma, S. and Kosorok, M. R. (2005b). Robust semiparametric M-estimationand the weighted bootstrap. Journal of Multivariate Analysis, 96:190–217.

Ma, S., Kosorok, M. R., Huang, J., Xie, H., Manzella, L., and Soares,M. B. (2006). Robust semiparametric microarray normalization andsignificance analysis. Biometrics, 62:555–561.

Mammen, E. and van de Geer, S. A. (1997). Penalized quasi-likelihoodestimation in partial linear models. Annals of Statistics, 25:1014–1035.

Mason, D. M. and Newton, M. A. (1992). A rank statistic approach to theconsistency of a general boostrap. Annals of Statistics, 25:1611–1624.

Massart, P. (1990). The tight constant in the Dvoretzky-Kiefer-Wolfowitzinequality. Annals of Probability, 18:1269–1283.

Megginson, R. E. (1998). An Introduction to Banach Space Theory.Springer, New York.

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., andTeller, E. (1953). Equation of state calculations by fast computingmachines. Journal of Chemical Physics, 21:1087–1092.

Murphy, S. A. (1994). Consistency in a proportional hazards model incor-porating a random effect. Annals of Statistics, 22:712–731.

Murphy, S. A., Rossini, A. J., and van der Vaart, A. W. (1997). Maximumlikelihood estimation in the proportional odds model. Journal of theAmerican Statistical Association, 92:968–976.

Page 474: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

466 References

Murphy, S. A. and van der Vaart, A. W. (1999). Observed information insemiparametric models. Bernoulli, 5:381–412.

Murphy, S. A. and van der Vaart, A. W. (2000). On profile likelihood. withcomments and a rejoinder by the authors. Journal of the AmericanStatistical Association, 95:449–485.

Nadeau, C. and Lawless, J. F. (1998). Inferences for means and covariancesof point processes through estimating functions. Biometrika, 85:893–906.

Naik-Nimbalkar, U. V. and Rajarshi, M. B. (1994). Validity of blockwisebootstrap for empirical processes with stationary observations. Annalsof Statistics, 22:980–994.

Neumeyer, N. (2004). A central limit theorem for two-sample U-processes.Statistics and Probability Letters, 67:73–85.

Nolan, D. and Pollard, D. (1987). U-processes: rates of convergence. Annalsof Statistics, 15:780–799.

Nolan, D. and Pollard, D. (1988). Functional limit theorems for U-processes. Annals of Probability, 16:1291–1298.

Parner, E. (1998). Asymptotic theory for the correlated gamma-frailtymodel. Annals of Statistics, 26:183–214.

Peligrad, M. (1998). On the blockwise bootstrap for empirical processes forstationary sequences. Annals of Probability, 26:877–901.

Pfanzagl, J. (1990). Estimation in Semiparametric Models: Some RecentDevelopments, volume 63 of Lecture Notes in Statistics. Springer.

Politis, D. N. and Romano, J. P. (1994). Large sample confidence regionsbased on subsampling under minimal assumptions. Annals of Statis-tics, 22:2031–2050.

Pollard, D. (1990). Empirical Processes: Theory and Applications, volume 2of NSF-CBMS Regional Conference Series in Probability and Statis-tics. Institute of Mathematical Statistics and American Statistical As-sociation, Hayward, California.

Pollard, D. (1991). Asymptotics for least absolute deviation regression es-timators. Econometric Theory, 7:186–199.

Polonik, W. (1995). Measuring mass concentrations and estimating den-sity contour clusters-An excess mass approach. Annals of Statistics,23:855–881.

Page 475: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

References 467

Praestgaard, J. and Wellner, J. A. (1993). Exchangeably weighted boot-straps of the general empirical process. Annals of Probability, 21:2053–2086.

Prakasa Rao, B. L. S. (1969). Estimation of a unimodal density. Sankya,Series A, 31:23–36.

Prentice, R. L. and Zhao, L. P. (1991). Estimating equations for parame-ters in means and covariances of multivariate discrete and continuousresponses. Biometrics, 47:825–839.

Radulovic, D. (1996). The bootstrap for empirical processes based onstationary observations. Stochastic Processes and Their Applications,65:259–279.

Rubin, D. B. (1981). The Bayesian bootstrap. Annals of Statistics, 9:130–134.

Rudin, W. (1987). Real and Complex Analysis. McGraw-Hill, Inc., NewYork, third edition.

Rudin, W. (1991). Functional Analysis. McGraw-Hill, Inc., New York, sec-ond edition.

Scheffe, H. (1959). The Analysis of Variance. Wiley, New York.

Sellke, T. and Siegmund, D. (1983). Sequential analysis of the proportionalhazards model. Biometrika, pages 315–326.

Shao, J. and Tu, D. (1995). The Jackknife and Bootstrap. Springer-Verlag,New York.

Sheng, X. (2002). Asymptotic normality of semiparametric and nonpara-metric posterior distributions. Journal of the American Statistical As-sociation, 97:222–235.

Skorohod, A. V. (1976). On a representation of random variables. Theoryof Probability and its Applications, 21:628–632.

Slud, E. V. (1984). Sequential linear rank tests for two-sample censoredsurvival data. Annals of Statistics, 12:551–571.

Storey, J. D. (2002). A direct approach to false discovery rates. Journal ofthe Royal Statistical Society, Series B, 64:479–498.

Storey, J. D., Taylor, J. E., and Siegmund, D. (2004). Strong control, con-servative point estimation and simultaneous conservative consistencyof false discovery rates: A unified approach. Journal of the Royal Sta-tistical Society, Series B, 66:187–205.

Page 476: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

468 References

Strassen, V. (1964). An invariance principle for the law of the iteratedlogarithm. Probability Theory and Related Fields, 3:211–226.

Tsiatis, A. A. (2006). Semiparametric Theory and Missing Data. Springer,New York.

van de Geer, S. A. (2000). Empirical Processes in M-Estimation. CambridgeUniversity Press, New York.

van de Geer, S. A. (2001). Least squares estimation with complexity penal-ties. Mathematical Methods of Statistics, 10:355–374.

van der Laan, M. J. and Bryan, J. (2001). Gene expression analysis withthe parametric bootstrap. Biostatistics, 2:445–461.

van der Laan, M. J. and Robins, J. M. (2003). Unified Methods for CensoredLongitudinal Data and Causality. Springer-Verlag, New York.

van der Vaart, A. W. (1996). Efficient maximum likelihood estimation insemiparametric mixture models. Annals of Statistics, 24:862–878.

van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge UniversityPress, New York.

van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence andEmpirical Processes: With Applications in Statistics. Springer-Verlag,New York.

van der Vaart, A. W. and Wellner, J. A. (2000). Preservation theorems forglivenko-cantelli and uniform glivenko-cantelli classes. In Progress inProbability, 47, pages 115–133. High dimensional probability, II (Seat-tle, WA, 1999), Birkhauser, Boston.

Vardi, Y. (1985). Empirical distributions in selection bias models. Annalsof Statistics, 13:178–203.

Wahba, G. (1990). Spline Models for Observational Data, volume 59of CBMS-NSF Regional Conference Series in Applied Mathematics.SIAM (Society for Industrial and Applied Mathematics), Philadelphia.

Wei, L. J. (1978). The adaptive biased coin design for sequential experi-ments. Annals of Statistics, 6:92–100.

Wellner, J. A. and Zhan, Y. H. (1996). Bootstrapping z-estimators. Tech-nical Report 308, Department of Statistics, University of Washington.

Wellner, J. A., Zhang, Y., and Liu, H. (2002). Two semiparametric estima-tion methods for panel count data. Unpublished manuscript.

Page 477: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

References 469

West, M. (2003). Bayesian factor regression models in the ”large p, smalln” paradigm. In Bernardo, J. M., Bayarri, M. J., Dawid, A. P., Berger,J. O., Heckerman, D., Smith, A. F. M., and West, M., editors, BayesianStatistics 7, pages 733–742. Oxford University Press, Oxford.

Wu, W. B. (2003). Empirical processes of long-memory sequences.Bernoulli, 9:809–831.

Yu, B. (1994). Rates of convergence for empirical processes of stationarymixing sequences. Annals of Probability, 22:94–116.

Zhang, D. (2001). Bayesian bootstraps for U-processes, hypothesis testsand convergence of dirichlet u-processes. Statistica Sinica, 11:463–478.

Zhao, L. P., Prentice, R. L., and Self, S. G. (1990). Multivariate meanparameter estimation by using a partly exponential model. Journal ofthe Royal Statistical Society, Series B, 54:805–811.

Page 478: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

Author Index

Andersen, P. K., 5, 25, 59Andrews, D. W. K., 227, 305, 315Arcones, M. A., 32, 229

Barbe, P., 398Bassett, G., Jr., 72Benjamini, Y., 306Bertail, P., 398Betensky, R. A., 398Bickel, P. J., vii, 31, 102, 155, 327,

332, 347, 371Bilius, Y., 126, 287Billingsley, P., 102, 103, 110, 311,

315Borgan, Ø., 5, 25, 59Bradley, R. C., 227Bretagnolle, J., 309, 311Bryan, J., 307, 312Buhlmann, P., 229

Cai, T. X., 398Cantelli, F. P., 10Carroll, R. J., 445, 453Chang, M. N., 247Cheng, G., 365, 370

Conway, J. B., 332, 342, 345Cox, D. D., 284Cox, D. R., 5, 41

Dedecker, J., 228Dehling, H., 227, 228Di Bucchiano, A., 364Dixon, J. R., 384, 389, 393, 395Donsker, M. D., 11Dudley, R. M., 31Dugundji, J., 192Dvoretzky, A., 210, 309, 310

Eberlein, E., 227Efron, B., 25Einmahl, J. H. J., 364

Fine, J. P., 178, 291, 303, 363,365, 366, 377, 388, 427,429, 437, 444, 457

Fleming, T. R., 5, 25, 59, 61, 241,288, 394

Gotze, F., 371Genovese, C., 308Ghosal, S., 32

Page 479: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

Author Index 471

Gilbert, P. B., 385, 387Gill, R. D., 5, 25, 59Gine, E., 32, 398Glivenko, V., 10Godambe, V. P., 63, 64, 443, 452Grenander, U., 278Groeneboom, P., 280, 400, 413Gu, C., 420Gu, M. G., 126, 287

Hall, P., 180Harrington, D. P., 5, 25, 59, 61,

241, 288Hastings, W. K., 365Hochberg, Y., 306Huang, J., 310, 316, 363, 405, 412Huber, P. J., 73Hunter, D. R., 394

Jameson, J. O., 120Johnson, W. B., 203

Karatzas, I., 137Keiding, N., 5, 25, 59Kiefer, J., 210, 309, 310, 419Kim, J., 277Kim, Y., 394Klaassen, C. A. J., vii, 102, 327,

332, 347Koenker, R., 72Komlos, J., 31, 309Kosorok, M. R., 178, 218, 221, 223,

233, 247, 248, 265, 277,287, 291, 303, 305–307,309, 310, 312, 316, 363,365, 366, 370, 371, 377,384, 388, 389, 393, 395,397, 422, 423, 427, 429,437, 444, 457

Kress, R., 95, 102, 444Kunsch, H. R., 229

Lai, T. L., 287Lange, K., 394Lawless, J. F., 437

Lee, B. L., 291, 303, 363, 365, 366,377, 384, 388, 389, 393,395, 427–429

Lee, J., 394Lin, D. Y., 394Lin, X., 445, 453Lindenstrauss, J., 203Liu, H., 397Liu, R. Y., 229Louhichi, S., 228

Ma, S., 265, 306, 307, 309, 310,312, 316, 371, 377, 397,422, 423

Major, P., 31, 309Mammen, E., 6, 70, 284, 398, 420,

450Manzella, L., 310, 316Mason, D. M., 398Massart, P., 210, 309–311Megginson, R. E., 198Metropolis, N., 365Mikosch, T., 227Murphy, S. A., 48, 291, 292, 356,

357, 360, 361, 363, 377,420, 428

Mushkudiani, N. A., 364

Nadeau, C., 437Naik-Nimbalkar, U. V., 229Neumeyer, N., 32Newton, M. A., 398Nolan, D., 32

O’Sullivan, F., 284O’Sullivan, M., 288

Parner, E., 292Peligrad, M., 229Pfanzagl, J., 419Philipp, W., 31, 228Ploberger, W., 315Politis, D. N., 264, 371, 398Pollard, D., vii, 32, 73, 218–221,

233, 277, 289

Page 480: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

472 Author Index

Polonik, W., 155Praestgaard, J., 222Prakasa Rao, B. L. S., 282Prentice, R. L., 442

Radulovic, D., 229, 230Rajarshi, M. B., 229Ritov, Y., vii, 102, 327, 332, 347Robins, J. M., 321, 453Romano, J. P., 264, 371, 398Rosenblatt, M., 31, 155Rosenbluth, A. W., 365Rosenbluth, M. N., 365Rossini, A. J., 291Rubin, D. B., 179Rudin, W., 102, 330, 332

Schechtman, G., 203Scheffe, H., 372Self, S. G., 442Selke, T., 287Sen, A., 32Shao, J., 180Shen, X., 372Shreve, S. E., 137Siegmund, D., 287, 308, 309Singh, K., 229Skorohod, A. V., 312Slud, E. V., 287Soares, M. B., 310, 316Song, R., 265, 277, 305, 316Sørensen, M., 227Storey, J. D., 308, 309Strassen, V., 31

Taqqu, M. S., 227Taylor, J. E., 308, 309Teller, A. H., 365Teller, E., 365Tsiatis, A. A., 321Tu, D., 180Tusnady, G., 31, 309

van de Geer, S. A., 6, 70, 167, 264,268, 284, 286, 398, 401,412, 420, 447, 450

van der Laan, M. J., 307, 312,321, 453

van der Vaart, A. W., vii, 32, 33,48, 88, 102, 106, 126,149, 153, 178, 205, 250,262, 282, 291, 332, 339,342, 347, 356, 357, 360,361, 363, 377, 401, 405,412, 413, 419, 420, 428

van Zwet, W. R., 371Vardi, Y., 393

Wahba, G., 420, 444Wang, Y., 32Wasserman, L., 308Wei, L. J., 207, 287, 394Wellner, J. A., vii, 33, 88, 102,

106, 126, 153, 178, 205,222, 250, 262, 282, 327,332, 347, 397, 398, 400,412, 413

West, M., 306Wolfowitz, J., 210, 309, 310, 419Wu, W. B., 227

Xie, H., 310, 316

Yan, J., 178, 437, 444, 457Ying, Z., 126, 287Yu, B., 227, 229

Zeger, S. L., 442, 443Zhan, Y. H., 398Zhang, D., 32Zhang, Y., 397Zhao, L. P., 442

Page 481: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

List of Symbols

A: σ-field of measurable sets, 82A: tangent set, 402A, A∗: direction, orthogonal pro-

jection vectors, 403A∗: µ-completion of A, 83B∗: adjoint of the operator B, 41

B′: transpose of B, 3B⊥: orthocomplement of B, 331B∗: dual of the space B, 328B(D, E): space of bounded linear

operators between D andE, 94

BL1(D): space of real functions onD with Lipschitz normbounded by 1, 19

as∗M

: conditional convergence of boot-

strap outer almost surely,20

PM

: conditional convergence of boot-

strap in probability, 19B: Brownian bridge, 11

C: random variable with Chernoff’sdistribution, 280

C[a, b]: space of continuous, realfunctions on [a, b], 23

C(T, ρ): space of ρ-continuousfunctions on T , 87

Cc: complement of the set C, 159

Cc: class of complements of the setclass C, 159

D × E : class of pairwise Cartesianproduct sets, 159

C ⊓ D: class of pairwise intersec-tions, 159

C⊔D: class of pairwise unions, 159

Cb(D): space of bounded, contin-uous maps f : D → R,14

CαM (X ): bounded real Lipschitz

continuous functions onX , 166

as→: convergence almost surely, 10as∗→: convergence outer almost

surely, 14P→: convergence in probability, 14

: weak convergence, 11

Page 482: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

474 List of Symbols

convF : convex hull of a class F ,158

convF : closed convex hull of a classF , 158

sconvF : symmetric convex hull ofa class F , 158

sconvF : symmetric closed convexhull of a class F , 158

cov[X, Y ]: covariance of X and Y ,11

(D, d), (E, e): metric spaces, 13D[a, b]: space of cadlag functions

on [a, b], 22D×E: Cartesian product of D and

E, 88δB: boundary of the set B, 108δx: point mass at x or Dirac mea-

sure, 10diamA: diameter of the set A, 166

E: closure of E, 82E: interior of E, 82E(t, β): expected Cox empirical

average, 56En(t, β): Cox empirical average,

5E∗: inner expectation, 14E∗: outer expectation, 14

F : distribution function, 9F : envelope of the class F , 18F ,G: collections of functions, 10,

19Fδ, F2

∞: special modifications ofthe class F , 142

F(P,2): L2(P )-closure of F , 172

sconv(P,2) F : L2(P )-closure ofsconvF , 172

F : mean-zero centered F , 171F1 ∨ F2: all pairwise maximums,

142F1 ∧ F2: all pairwise minimums,

142F1 + F2: all pairwise sums, 142

F1×F2: all pairwise products, 142F1 ∪ F2: union of classes F1 and

F2, 142F > 0: class of sets x : f(x) >

0 over f ∈ F , 160Fn: empirical distribution

function, 9FT : extraction function class, 128

G: general Brownian bridge, 11Gn: empirical process, 11G′

n, G′′n: multiplier bootstrap em-

pirical process, 181Gn: standardized empirical distri-

bution function, 10

H: Hilbert space, 326

1A: indicator of A, 4〈·, ·〉: inner product, 43Iθ,η: efficient information matrix,

40

J : Sobolev norm, 6J[](δ,F , ‖ · ‖): bracketing integral,

17J∗

[](δ,F), J[](δ,F , ‖ · ‖): modifiedbracketing integrals,208, 209

J(δ,F , ‖ · ‖): uniform entropy in-tegral, 18

J∗(δ,F): modified uniformentropy integral, 208

ℓθ,η: score for θ when η is fixed, 40

ℓθ,η: efficient score for θ, 40ℓ(t, θ, η): likelihood for

approximately leastfavorable submodel, 352

ℓ(t, θ, η): score for approximatelyleast favorable submodel,352

ℓ∞(T ): set of all uniformlybounded real functionson T , 11

Page 483: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

List of Symbols 475

Lr: equivalence class of r-integrablefunctions, 16

L02(P ): mean zero subspace of L2(P ),

37λn: smoothing parameter, 7Λ(t): integrated hazard function,

5lin H: linear span of H, 41lin H: closed linear span of H, 41

mθ,η(X): objective function for semi-parametric M-estimator,401

→: function specifier (“maps to”),11

mX(δ), mF(δ): modulus of conti-nuity of process X , classF , 136, 172

M : martingale, 42Mn(θ): M-estimating equation, 28T∗: maximal measurable minorant

of T , 14T ∗: minimal measurable majorant

of T , 14a ∨ b: maximum of a and b, 19a ∧ b: minimum of a and b, 6

N(t): counting process, 5N(T ): null space of T , 94N[](ǫ,F , ‖ · ‖): bracketing number,

16N(ǫ,F , ‖ · ‖): covering number, 18‖ ·‖2,1: special variant of L2 norm,

20‖ · ‖P,r: Lr(P ) norm, 16‖ · ‖∞: uniform norm, 19‖ · ‖ψ: Orlicz ψ-norm, 128‖ · ‖T : uniform norm over T , 87

O: topology, 81Ω, Ωn: sample spaces, 14, 107(Ω,A, P ): probability space, 83(Ωn,A\,P\): sequence of probabil-

ity spaces, 108OP (rn): rate rn asymptotic bound-

edness in probability, 7

oP (rn): rate rn convergence tozero in probability, 7

⊥: orthogonal, 326∅: empty set, 82⊗2: outer product, 56

Π: projection operator, 326PP : tangent set, 37P : collection of probability mea-

sures, 35Pn: empirical measure, 10P

n: symmetrized, bootstrappedempirical processes, 138,408

Pn, Pn: weighted empirical mea-sures, 194, 196

P, Q: probability measures on un-derlying sample space X ,10, 18

P∗: inner probability, 14P∗: outer probability, 14ψP : influence function, 37ψ(P ): model parameter, 37ψp: the function x → ψp(x) =

exp − 1, 129ψP : efficient influence function, 38Ψn: Z-estimating equation, 24

R(T ): range space of T , 94R: real numbers, 3R: extended reals, 14

σ(f, g): covariance between f(X)and g(X), 19

σ(f, g): empirical estimate ofσ(f, g), 19

sign(a): sign of a, 52

T : index set for a stochastic pro-cess, 9

Tn: estimator, 36

UC(T, ρ): set of uniformly contin-uous functions from(T, ρ) to R, 15

Un(t, β): Cox empirical score, 5

Page 484: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

476 List of Symbols

u ⊙ v: pointwise vector product,220

V (C), V (F): VC-index of a set C,function classF , 156, 157

var[X ]: variance of X , 15

W: Brownian motion, 11Wn, Wn: centered, weighted em-

pirical measures, 193

X, Xn: random map, sequence, 14Xn: bootstrapped version of Xn,

19X : sample space, 10(X,A): measurable space, 82(X,A, µ): measure space, 83(X,O): topological space, 82X(t), t ∈ T : stochastic process,

9

Y (t): “at-risk” process, 5

Z: two-sided Brownian motion, 280

Page 485: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

Subject Index

α-mixing, see strongly mixingα-trimmed mean, 249absolutely regular, 228adjoint, 41almost measurable Suslin (AMS),

219almost sure representation, 118alternating projections, 327AMS, see almost measurable Suslinanalytic set, 85Anderson-Darling statistic, 213Argmax theorem, 264Arzela-Ascoli theorem, 88asymptotic normality, 402, 440, 451asymptotically linear, 37, 335, 339asymptotically measurable, 110asymptotically tight

see tightness, asymptotic, 109

β-mixing, see absolutely regularBanach space, 86, 250, 258, 324,

328Banach’s theorem, 95Bayesian methods, 47, 315, 372,

394

Bernstein-von Mises theorem, 394biased sampling, see model, biased

samplingbijection, 85BKRW: abbreviation for Bickel,

Klaassen, Ritov andWellner (1998), 327

block jackknife, 371bone marrow transplantation, 437bootstrap, 12, 19, 222, 371, 387

m within n, 371accelerated, 394Bayesian, 179empirical process, 19, 179multiplier, 20, 181, 183nonparametric, 180, 303, 371,

387parametric, 180piggyback, 389subsampling, 371weighted, 194, 302, 371, 387,

407wild, 222, 288

bootstrap consistency, 193Borel measurable, 14, 85, 104

Page 486: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

478 Subject Index

Borel sets, 83Borel σ-field, 83Borel-Cantelli lemma, 124bounded function space, 113bracket, 16bracketing, see entropy, with brack-

etingbracketing number, 16Breslow estimator, 45Brownian bridge, 11, 309Brownian motion, 11

two-sided, 280BUEI: bounded uniform entropy in-

tegral, 155BUEI class, 162BUEI and PM class, 165, 439

cadlag, 22, 87caglad, 246capacity bound, 220Cartesian product, 88Cauchy sequence, 85Cauchy-Schwartz inequality, 326causal inference, 321central limit theorem, see Donsker

theorembootstrap, 187, 230functional, 218, 228, 229multiplier, 181

chain rule, 96chaining, 133change-point model, see model,

change-pointChebyshev’s inequality, 91Chernoff’s distribution, 280closed set, 82compact operator, 95compact set, 82compact, σ, set, see σ-compact

setcomplete semimetric space, 85completion, 83confidence band, 11, 389, 441consistency, 21, 252, 266, 295, 309,

402, 434, 439, 446

contiguity, 214contiguous alternative, 214, 342continuous

at a point, 84map, 82uniformly in pth mean, 106

continuous mapping theorem, 12,16, 109

bootstrap, 189, 190extended, 117

continuously invertible, 26, 95, 299contraction, 95convergence

almost sure, 10, 115almost uniform, 115dominated, 92in probability, 14, 116monotone, 92outer almost sure, 14, 116stochastic, 13, 103weak, 4, 11, 107, 301weak, of nets, 108

convex hull, 158convolution theorem, 338copula, 249countable additivity, 83covariance, 11, 340cover, 98covering number, 17, 132, 410Cox model, 41, 45, 59, 352, 355,

361, 362, 367, 368, 384,391, 399

Cox score process, 287Cramer-Rao lower bound, 38, 338Cramer-von Mises statistic, 213current status data, 355, 362, 368,

399

delta method, 13, 21, 235, 258dense set, 82, 330dependent observations, 227derivative

compact, 22Frechet, 26, 96, 254, 258, 298Gateaux, 22, 95

Page 487: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

Subject Index 479

Hadamard, 13, 22, 96, 235, 254,258, 338

differentiable in quadratic mean,37

differentiable relative to tangent set,334

Donsker class, 11, 148, 155, 180,181, 254, 301, 335, 339

Donsker preservation, 172Donsker theorem, 17, 18, 127, 149,

228, 229double robustness, 321doubly-censored data, 247dual space, 328Dugundji’s extension theorem, 192,

236Duhamel equation, 244

efficiency, see efficient estimator,452

efficient estimator, 4, 36, 38, 320,333, 337, 338, 426, 435

efficient influence function, 38, 335efficient score function, 40, 350elliptical class, 212empirical

distribution function, 9measure, 10process, 9

entropy, 16, 155control, 410uniform, 156with bracketing, 16, 166, 411

entropy integralbracketing, 17, 411uniform, 18, 410

envelope, 18equicontinuity

stochastic, 404, 406uniform, 113, 114

equivalence class, 84estimating equation, 43, 442

efficient, 62optimal, 63, 442

existence, 293, 439

expectationinner, 14outer, 14

false discovery rate (FDR), 306field, 82field, σ, see σ-fieldFubini’s theorem, 92, 141function classes changing with n,

224function inverse, 12, 246functional

linear, see linear functional

G-C class, see Glivenko-Cantelliclass

Gaussian process, 11, 106, 126, 340Glivenko-Cantelli class, 10, 127,

144, 155, 180, 193strong, 127weak, 127

Glivenko-Cantelli preservation,155, 169

Glivenko-Cantelli theorem, 16, 18,127, 145

Grenander estimator, 278

Hahn-Banach theorem, 342Hausdorff space, 82, 84Hilbert space, 324Hoeffding’s inequality, 137Hungarian construction, see KMT

constructionhypothesis testing, 342

independent but not identicallydistributed, 218

infinite dimensional parameter, 379influence function, 37, 335, 394,

402information

efficient, 40Fisher, 38

information operator, 41, 297inner product, 43, 325

Page 488: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

480 Subject Index

semi-, 325interior of set, 82isometry, 85, 330

Jensen’s inequality, 91, 101joint efficiency, 341

Kaplan-Meier estimator, 25, 60, 243,247

kernel estimator, 431KMT construction, 31, 309Kulback-Leibler information, 366

“large p, small n” asymptotics, 32,306

law of large numbers, 10law of the iterated logarithm, 31least absolute deviation, 397least concave majorant, 278least squares, 397L’Hospital’s rule, 137Liang, K. Y., 442, 443linear functional, 93, 327linear operator, 93

bounded, 93continuous, 93

linear space, 323linear span, 37, 86

closed, 37, 41, 86local asymptotic power, 342local efficiency, 321long range dependence, 227

m-dependence, 228, 309M-estimator, 13, 263, 397

penalized, 398, 420semiparametric, 397, 399weighted, 407

manageability, 219Mann-Whitney statistic, 239, 344marginal efficiency, 341Markov chain Monte Carlo, 365martingale, 25, 42, 59, 241, 300maximal inequality, 128, 133maximal measurable minorant, 14,

89, 90

maximum likelihood, 44, 291, 379,397

nonparametric, 291semiparametric, 379

measurability, 138measurable

asymptotically, see asymptot-ically measurable

Borel, see Borel measurablemap, 83, 104, 138sets, 82space, 82

measure, 83measure space, 83M-estimator, 28metric space, 81, 83, 103Metropolis-Hastings algorithm, 365microarray, 306minimal measurable majorant, 14,

89, 90mixture model, 69, 400model

biased sampling, 385, 392change-point, 271, 303Cox, see Cox modellinear, 306misspecified, 400mixture, see mixture modelnonparametric, 3parameter of, 37proportional odds, 291, 353,

392, 426semiparametric, see semipara-

metric modelstatistical, 35transformation, 316varying coefficient, 436

modulus of continuity, 136, 172moment bounds, 208monotone density estimation, 278moving blocks bootstrap, 229multiplier inequality, 181

neighborhood, 82Nelson-Aalen estimator, 239

Page 489: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

Subject Index 481

norm, 86, 325normed space, 86, 236null space, 94

onto, 47, 95, 299open set, 82operator, 26

compact, see compact opera-tor

information, see information op-erator

linear, see linear operatorprojection, see projection op-

eratorscore, see score operator

optimality of estimators, see effi-cient estimators

optimaltiy of tests, 342Orlicz norm, 128orthocomplement, 41, 326orthogonal, 326

P -measurable class, 141packing number, 132partial likelihood, 43peeling device, 268penalized estimation, 445penalized likelihood, 45penalized profile sampler, see pro-

file sampler, penalizedpermissible, 228Picard’s theorem, 444piggyback bootstrap, see bootstrap,

piggybackPM, see pointwise measurabilitypointwise measurability, 142, 155Polish process, 105Polish space, 85, 87, 110portmanteau theorem, 108, 337precompact set, see totally bounded

setprobability

inner, 14, 89outer, 14, 89

probability measure, 83

probability space, 83product, 92, 138, 341

product integral, 242profile likelihood, 45, 47, 351

quadratic expansion of, 357profile sampler, 363, 365

penalized, 369Prohorov’s theorem, 111projection, 40, 323

coordinate, 92, 221projection operator, 326proportional odds model, see

model, proportional oddspseudodimension, 221pseudometric, see semimetric

quadratic expansion of profile like-lihood, see profile likeli-hood, quadratic expan-sion of

Rademacher process, 137Rademacher random variable, 137random map, 85range space, 94rate of convergence, 28, 267, 446rate of convergences, 402regression

counting process, 54least absolute deviation, 29,

52linear, 3, 35, 50, 66, 426logistic, partly linear, 6, 69,

283, 356, 400partly linear, 444Poisson, 69temporal process, 436

regular estimator, 4, 333, 335, 339relative compactness, 111repeated measures, 444reproducing kernel Hilbert space,

444residual distribution, 4, 436reverse submartingale, 146

Page 490: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

482 Subject Index

Riesz representation theorem, 328,335

right-censored data, 25

σ-algebra, 82σ-compact set, 82, 88, 114σ-field, 82, 88sample path, 10score function, 37, 405score operator, 41, 297, 444semicontinuous, 84

lower, 84upper, 84, 267

semimetric, 15, 84seminorm, 86semiparametric efficient, see effi-

cient estimatorsemiparametric inference, 36, 319semiparametric model, 3, 35, 333separable process, 105, 131separable σ-field, 83separable space, 82sequences of functions, 211shatter, 156signed measure, 83Skorohod metric, 281Skorohod space, 87Slutsky’s theorem, 112stationary process, 227stochastic process, 9, 103Strassen’s theorem, 31strong approximation, 31strongly mixing, 227sub-Gaussian process, 131subconvex function, 337subgraph, 157submodel

approximately least favorable,46, 352, 359

hardest, see submodel, least fa-vorable

least favorable, 36, 352one-dimensional, 36, 37, 43one-dimensional, smooth, 333

sumspace, 327

Suslin set, 85Suslin space, 85, 144symmetric convex hull, 158symmetrization, 138symmetrization theorem, 139

tail probability bound, 210tangent set, 37, 333, 334tangent space, 37tightness, 15, 105

asymptotic, 15, 109, 254uniform, 110

topological space, 82topology, 82

relative, 82total variation, 238totally bounded set, 85triangle inequality, 83, 86triangular array, 219tuning parameter, 445

U-process, 32uniformly efficient, 39

Vapnik-Cervonenkis (VC) class,18, 156, 157, 229

variance estimation, 436varying coefficient model, see

model, varying coefficientVC-class, see Vapnik-Cervonenkis

classVC-hull class, 158VC-index, 156VC-subraph class, 157vector lattice, 104versions, 104, 339Volterra integral equation, 444VW: abbreviation for van der Vaart

and Wellner (1996), 33

Watson statistic, 213weak dependence, 309well-separated maximum, 266Wilconxon statistic, 239working independence, 437

Page 491: Springer Series in Statistics - stat.ncsu.edulu/ST7903/Kosorok-Introduction_to... · Springer Series in Statistics Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,

Subject Index 483

Z-estimator, 13, 24, 43, 196, 251,298, 398