119
MAKING CAUSAL CONCLUSIONS FROM HETEROGENEOUS DATA SOURCES A DISSERTATION SUBMITTED TO THE DEPARTMENT OF STATISTICS AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Evan Taylor Ragosa Rosenman August 2020

MAKING CAUSAL CONCLUSIONS - Stanford Universitystatweb.stanford.edu/~owen/students/EvanRosenmanThesis.pdfthank Rina, my sometimes travel buddy and perpetual coauthor, who has made

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • MAKING CAUSAL CONCLUSIONS

    FROM HETEROGENEOUS DATA SOURCES

    A DISSERTATION

    SUBMITTED TO THE DEPARTMENT OF STATISTICS

    AND THE COMMITTEE ON GRADUATE STUDIES

    OF STANFORD UNIVERSITY

    IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

    FOR THE DEGREE OF

    DOCTOR OF PHILOSOPHY

    Evan Taylor Ragosa Rosenman

    August 2020

  • http://creativecommons.org/licenses/by-nc/3.0/us/

    This dissertation is online at: http://purl.stanford.edu/pq377kc2214

    © 2020 by Evan Taylor Ragosa Rosenman. All Rights Reserved.

    Re-distributed by Stanford University under license with the author.

    This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

    ii

    http://creativecommons.org/licenses/by-nc/3.0/us/http://creativecommons.org/licenses/by-nc/3.0/us/http://purl.stanford.edu/pq377kc2214

  • I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

    Mike Baiocchi, Co-Adviser

    I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

    Art Owen, Primary Adviser

    I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

    Julia Palacios

    Approved for the Stanford University Committee on Graduate Studies.

    Stacey F. Bent, Vice Provost for Graduate Education

    This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

    iii

  • Abstract

    As datasets grow larger and more complex, the field of statistics must provide commensurate methods

    for synthesizing and gathering evidence. This thesis presents new methodology for related questions

    at the intersection of observational and experimental causal inference. I consider problems of “data

    fusion,” in which one set of causal estimates is derived by merging results from an observational

    dataset and an experimental dataset. I also consider how to design experiments informed by the

    results of observational studies.

    I begin by providing an overview of relevant concepts and prior work in causal inference, data

    fusion, sensitivity analysis, and optimization. Chapter 2 considers the data fusion question in the

    case when all confounding variables are measured; I propose several estimators and derive when

    each is expected to outperform. In Chapter 3, I remove the unconfoundedness assumption, which

    leads to a new class of estimators based on a shrinkage approach. Chapter 4 considers the design

    question, and proposes a novel solution for regret-minimizing designs in the case of a binary covariate.

    Throughout, I use data from the Women’s Health Initiative, a 1991 study of the e�cacy of hormone

    therapy involving thousands of postmenopausal women, to demonstrate the utility of my methods.

    The contents of this thesis are drawn from two existing manuscripts. The second chapter is

    adapted from “Propensity Score Methods for Merging Observational and Experimental Datasets,”

    jointly authored with Art Owen, Michael Baiocchi, and Hailey Banack (Rosenman et al., 2018).

    The third chapter is adapted from “Combining Observational and Experimental Datasets Using

    Shrinkage Estimators,” jointly authored with Guillaume Basse, Art Owen, and Michael Baiocchi

    (Rosenman et al., 2020). The fourth chapter is forthcoming as its own paper.

    This work touches on a variety of methods from disparate areas of the literature. While the

    central problem is one of causal inference, tools from Empirical Bayes, decision theory, and convex

    optimization are deployed throughout. It is my hope that this work can serve two purposes. First, I

    hope it can excite methodologists to consider new ways of posing causal questions and new methods

    for solving emergent challenges. Second, I hope to empower practitioners to more e�ciently use

    their data to identify treatment e↵ects.

    iv

  • Acknowledgments

    If it takes a village to raise a child, it requires a small city to shepherd someone to a doctorate. With

    apologies for the lack of brevity, I would like to thank the many teachers, friends, and colleagues

    who have played a role in my intellectual and personal development over the past three decades.

    From the third through twelfth grades, I attended the Pingry School in Martinsville, New Jersey,

    and I am forever grateful for the extraordinary education I received there. In the eighth grade, my

    English teacher, Marnie McKoy, took to referring to me as “Dr. Rosenman,” a moniker that I can

    finally inhabit more than fifteen years later. The wonderful educators I encountered at Pingry –

    Mr. and Mrs. Grant, Dr. Dineen, Mrs. Landau, Mr. Coe, Mr. Tramontana, Dr. Korfhage, Dr.

    DeSimone, Mrs. O’Mara, and many others – have had a lasting impact on my intellectual life. My

    time at Pingry also provided me with lifelong friends who have been sources of great encourage-

    ment throughout this doctorate, including Kerry Bickford, Darina Shtrakhman, Bi↵ Parker-Magyar,

    Melinda Zoephel, and Meredith Skiba.

    My undergraduate years at Harvard were when I first discovered my ardor for STEM. Harvard

    was an intimidating place, in which impostor syndrome ran rampant. Inspiring teachers and mentors

    are the only reason I emerged with my passion for math and science intact. I’d like to acknowledge

    Yiling Chen, Joe Blitzstein, Sarah Koch, David Morin, and Mike Ruberry for all the guidance and

    excitement that they provided. Harvard also gave me a terrific group of friends, many of whom have

    patiently listened to me kvetch over the past five years. To Matt Chartier, Olga Zinoveva, Kevin

    Fogarty, Alice Li-Fogarty, Stephanie Wang, Danielle Kolin-Freeman, and Amy Guan – a hearty

    thanks.

    The choice to move to Washington, D.C. after undergrad, to work at a midsize tech company

    called Applied Predictive Technologies, remains one of my best decisions. I am so grateful for the

    wonderful friends I made in my three years there, many of whom have moved westward and become

    my support network here in the Bay Area. Many thanks to Alex Svistunov, Nitin Viswanathan,

    Kathy Qian, Chao Xue, Brady Kelly, Simon Krauss, J.D. Astudillo, and Liz Casano. I am also

    appreciative of the teachers and mentors I met while studying as a nighttime Master’s student at

    Georgetown. David Caraballo, Sivan Leviyang, Ken Shaw, and Ali Arab all played a role in nurturing

    my love of Statistics and helping me on the path to the Ph.D.

    v

  • One of my greatest blessings has been sharing the doctoral experience with two badass Harvard

    statisticians who also matriculated in the fall of 2015. Both of these women influenced my decision to

    pursue a Ph.D., and I would never have completed this degree without them. To Michele Zemplenyi:

    there will never be another person who makes me laugh like you do. Thank you for all the moments

    of levity and delight, and for always being just a phone call away. To Kristen Hunter, who has

    enriched my life for a dozen years, thank you for so many adventures, for your endless willingness

    to call out the bullshit, for being my lifetime partner in crime.

    Any Ph.D. is an emotional rollercoaster – a cacophony of self-doubt, pride, disappointment, and

    fleeting moments of insight. If one is exceedingly lucky, out of this mess emerges a doctor. I have

    been so lucky. I must thank my wonderful committee members: Guillaume Basse, Julia Palacios,

    and Stefan Wager, as well as my longtime mentor, Clea Sarnquist. Each of these individuals has been

    generous with time and with wisdom, and each has kept the proverbial o�ce door ajar. Moreover,

    they have written recommendation letters for me, provided me guidance on career choices, and

    periodically reminded me to keep perspective. These faculty have modeled what it is to be a

    curious, responsible researcher and I will be proud to follow in their footsteps.

    It is a rare and wonderful thing to find an advisor who is simultaneously brilliant and kind. I

    have been fortunate enough to find two of them. So much can be said about Mike Baiocchi: that he

    is unrelentingly supportive of his students, incredibly down-to-earth, fearless, resilient, and smart as

    hell. He is a credit to the field of Statistics and to the state of Maine. Mike, you are the reason I

    have stuck with causal inference, and you have profoundly shaped my career. Just as importantly,

    you have made me feel valued, competent, and worthy at times when I did not believe in myself. I

    am so grateful to have had the chance to collaborate with you.

    Art Owen, whose statistical prowess is eclipsed only by his unfailing generosity and fundamental

    decency, has been an extraordinary mentor as well. Over the past four years, the opportunity to

    spend an hour a week talking to him about research and ideas has been a true privilege – and a

    complete blast. Art, you have made me a better researcher and a better communicator. You have

    been an advocate for me in every way: the person who cheers me up after a journal rejection; who

    is always willing to provide career advice; who reminds me to keep an eye on the bigger picture.

    Thank you for everything.

    I could not have imagined going through this experience without my cohort: nine brilliant,

    quirky individuals, who came together from all over the world to share in these five years. We may

    have seen each other less in recent times, but I will always remember the birthday cakes, the Patio

    outings, the barbecues, the hushed laughs shared in the backs of classrooms. I’d especially like to

    thank Rina, my sometimes travel buddy and perpetual coauthor, who has made some of the most

    trying times bearable; Claire, who has listened patiently to my many complaints, and always been a

    delightful co-conspirator in dark humor; and Andy, mensch among mensches, the guy who will teach

    you probability and also take you out for ramen when you’re feeling down. You have all enriched

    vi

  • this experience.

    Friendship and guidance from students in prior cohorts, most notably Alex Chin, Jelena Markovic,

    Jessica Hwang, and Paulo Orenstein, have also been essential, as have many wonderful relationships

    outside of Sequoia Hall. Jason Weinreb, Murphy Temple, Surajit Bose, Matt Seymour, Je↵ Sheng,

    and Daniel Kremer: thank you, all.

    I would also like to thank my family. I find myself thinking of my grandparents, the last of

    whom – my paternal grandfather, Lawrence Rosenman – passed away just as I was beginning my

    doctorate. In the final voicemail he left me, Grandpa said, “I have love for all my grandkids – they

    all went to great universities, so many I don’t know what to do with them all. Have a wonderful

    day and do well. I love you.” It’s a simple message, and one I have cherished always. I also can’t

    help but think of my maternal grandparents, Frances and Amerigo Ragosa. Stung by the Great

    Depression, both cut their educations short in order to go to work: Mema barely managed to finish

    high school while Pop-Pop, an Italian immigrant, dropped out before getting his diploma. Denied

    opportunities for formal schooling, they built a successful life, but always yearned to learn more. I

    hope that I make them proud.

    My immediate family has spread all over the contiguous United States, but they have continued

    to be my rock, and I am immensely lucky to have them in my life. Thanks to my brother, Michael,

    for his words of encouragement, o↵ered from far-away Wisconsin. My father, Mark, the most happily

    retired attorney in the whole history of north-central New Jersey, has nonetheless been willing to

    proofread every single manuscript, application, poster, and presentation that I have produced in

    my graduate career – including this thesis. Dad has played the role of consummate problem-solver,

    always willing to go the extra mile to make my day more manageable, whether by buying me a

    co↵ee mug or listening to me kvetch about my students. Dad, you have two characteristics that

    have profoundly influenced my graduate career. First, you are a stubborn man, and I inherited that

    quality, and it has served me well. Second, your unshakeable belief in my abilities is almost certainly

    somewhat misplaced, but it has gone a long way in helping me to stay the course. Thank you.

    And last but not least, to the BAMF-without-an-F, the OG embodiment of “Nevertheless, she

    persisted,” the woman who showed early promise in math, who graduated summa cum laude from

    a 98% male engineering college, who managed assembly lines, who earned a nighttime law degree

    while raising two children and working full-time, who became an attorney at 49, who scrambled

    and got knocked down and picked herself back up, who made partner, who has worked nonstop

    since age 21 and faced down a lifetime of misogynist bullshit and who still fights every day to make

    things a little bit easier for the next crop of women – to the incomparable Diane Ragosa: there is

    nothing better in life than having a baller for a mother. I would never have attempted something so

    ambitious without your example, and I would never have finished it without your unwavering love

    and support. Thank you.

    vii

  • Contents

    Abstract iv

    Acknowledgments v

    1 Introduction 1

    1.1 The Data Revolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.2 Causal Inference Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.2.1 Potential Outcome Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.2.2 Sources of Randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.2.3 Estimation Methods for Observational Studies . . . . . . . . . . . . . . . . . 4

    1.3 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2 Combining Datasets Under SITA 9

    2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.1.1 Scientific assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.2 Notation, assumptions and estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    2.2.1 Sampling assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    2.2.2 Stratification and treatment e↵ect assumptions . . . . . . . . . . . . . . . . . 12

    2.2.3 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    2.3 Delta method results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    2.3.1 Population quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    2.3.2 Main theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    2.3.3 Delta method means and variances . . . . . . . . . . . . . . . . . . . . . . . 17

    2.3.4 The dynamic weighted estimator . . . . . . . . . . . . . . . . . . . . . . . . . 19

    2.3.5 Performance comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    2.3.6 Proof of Corollary 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    2.4 Derivation: Dynamic Weighted Estimator . . . . . . . . . . . . . . . . . . . . . . . . 22

    viii

  • 2.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    2.5.1 Simulation of the ideal case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    2.5.2 Restrictive enrollment criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    2.5.3 Violation of Assumption 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    2.6 WHI data example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    2.6.1 Propensity score construction and covariate balance . . . . . . . . . . . . . . 30

    2.6.2 Gold standard causal e↵ect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    2.6.3 Prognostic modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    2.6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    3 Combining Datasets Without SITA 39

    3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    3.2 Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    3.3 Notation, Assumptions, and Set-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    3.3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    3.3.2 Assumptions and Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    3.3.3 Estimator Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    3.4 Proposed Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    3.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    3.4.2 1, Common Shrinkage Factor . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    3.4.3 2, Variance-Weighted Shrinkage Factors . . . . . . . . . . . . . . . . . . . . 49

    3.4.4 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    3.5 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    3.5.1 Set-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    3.5.2 Estimating Implied � . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    3.6 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    3.6.1 Simulation Set-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    3.6.2 Identical Observational and RCT Covariate Distributions . . . . . . . . . . . 56

    3.6.3 Di↵ering Observational and RCT Covariate Distributions . . . . . . . . . . . 59

    3.7 Application to the Women’s Health Initiative Data . . . . . . . . . . . . . . . . . . . 61

    3.7.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    3.7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    3.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    4 Designing Experiments Using Observational Studies 67

    4.1 Problem Set-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

    4.1.1 A Change in Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

    ix

  • 4.1.2 Stratification and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . 68

    4.1.3 Loss and Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

    4.2 Converting to an Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . 69

    4.2.1 Näıve Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

    4.2.2 Regret Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

    4.3 Tractable Case: Binary Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

    4.4 Application to the Women’s Health Initiative Data . . . . . . . . . . . . . . . . . . . 74

    4.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

    4.4.2 Detailed Example: � = 1.5, Fine Stratification . . . . . . . . . . . . . . . . . 75

    4.4.3 Performance Over Multiple Conditions . . . . . . . . . . . . . . . . . . . . . . 76

    4.5 Future Work: General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

    5 Conclusion 80

    5.1 Discussion and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

    5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

    A Appendix to Chapter 2 91

    A.1 Lengthier Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

    A.1.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

    A.1.2 Proof of Corollary 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

    A.2 WHI Data Example: Additional Details . . . . . . . . . . . . . . . . . . . . . . . . . 93

    A.2.1 Observational study propensity modeling and covariate balance . . . . . . . . 93

    A.2.2 RCT covariate balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

    B Appendix to Chapter 3 95

    B.1 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

    B.2 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

    C Appendix to Chapter 4 100

    C.1 Proof of Validity of Confidence Regions . . . . . . . . . . . . . . . . . . . . . . . . . 100

    C.1.1 Review of Proof in Zhao et al. (2019) . . . . . . . . . . . . . . . . . . . . . . 100

    C.1.2 Extension to Design Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

    C.2 Proof of Concavity of Minimax Problem . . . . . . . . . . . . . . . . . . . . . . . . . 104

    x

  • List of Tables

    2.1 These are the four � vectors used in our simulations. The first two correlate with the

    mean response vector �, while the second two do not. The second and fourth imply

    larger sampling biases than the first and third do. . . . . . . . . . . . . . . . . . . . 24

    2.2 MSEs for treatment e↵ect in the ideal setting. Column 1 gives treatment (constant,

    linear, quadratic). Column 2 shows whether the propensity was correlated with the

    mean response. Column 3 indicates the magnitude of the propensity vector �. The

    remaining columns are mean squared errors for the overall treatment from our 5

    estimators and an oracle. In every case, the spiked-in estimator using (2.4) has lowest

    MSE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    2.3 MSEs for treatment e↵ect in the setting with restricted enrollments. The columns are

    the same as in Table 2.2. Here the oracle estimator is always best and the dynamic

    estimator is the best of the ones that can be implemented. . . . . . . . . . . . . . . 27

    2.4 These are the results of the simulations where Assumption 3 is violated but the RCT

    has the same x distribution as the ODB. . . . . . . . . . . . . . . . . . . . . . . . . 29

    2.5 These are the results of the simulations where Assumption 3 is violated and the x in

    the RCT are subject to restrictive enrollment criteria. . . . . . . . . . . . . . . . . . 29

    2.6 Standardized di↵erences (SD) between treated and control populations in the obser-

    vational dataset, before and after stratification on the propensity score, for clinical

    risk factors for coronary heart disease.

    32

    2.7 Standardized di↵erences (SD) between treated and control populations in the obser-

    vational database, before and after stratification on the propensity score, for ethnicity

    category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    2.8 Standardized di↵erences (SD) between treated and control populations in the obser-

    vational database, before and after stratification on the propensity score, for smoking

    category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    xi

  • 3.1 Distribution of age variable values in the observational study, RCT, and RCT “silver”

    datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    3.2 Distribution of history of cardiovascular disease in the observational study, RCT, and

    RCT “silver” datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    3.3 Distribution of Langley scatter categories in the observational study, RCT, and RCT

    “silver” datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    3.4 Simulation results for each stratification scheme. The third column gives the average

    L2 loss over 1, 000 replicates of ⌧̂r, the RCT-only estimator (assuming an RCT of

    size 1, 000). The following five columns give the average L2 loss of various shrinkage

    estimators as a percentage of the average L2 loss of ⌧̂r. . . . . . . . . . . . . . . . . 63

    3.5 Frequency across simulations that the conditions under which 1+ and 2+ dominate

    ⌧̂r are met. These conditions are given in Lemma 1 and Lemma 3. . . . . . . . . . . 65

    4.1 L2 loss comparisons for regret-minimizing allocations relative to equal allocation. For

    starred entries, the regret-minimizing allocation defaults to equal allocation. . . . . . 77

    4.2 L2 loss comparisons for regret-minimizing allocations relative to näıve allocation. . . 78

    A.1 Standardized di↵erences (SD) between treated and control populations in RCT gold

    dataset, for clinical risk factors for coronary heart disease. . . . . . . . . . . . . . . . 94

    A.2 Standardized di↵erences (SD) between treated and control populations in RCT gold

    dataset, for ethnicity category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

    A.3 Standardized di↵erences (SD) between treated and control populations in RCT gold

    dataset, for smoking category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

    xii

  • List of Figures

    2.1 Performance measures across all 2,000 simulations run in the ideal case. Bias squared

    is shown in black, and variance in gray, so that total bar height represents the MSE.

    The much larger values for the RCT estimator are excluded to make visual comparison

    easier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    2.2 Performance measures across all 2,000 simulations run in the restricted enrollment

    case. Bias squared is shown in black, and variance in gray, so that total bar height

    represents the MSE. The much larger values for the RCT estimator are excluded to

    make visual comparison easier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    2.3 Propensity score distributions among treated and control women (left panel) and

    marginal propensity score distributions (right panel) for the ODB and RCT Women’s

    Health Initiative populations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    2.4 ROC AUC scores for logistic regression prognostic score model in the control popu-

    lations of the ODB and RCT silver datasets. . . . . . . . . . . . . . . . . . . . . . . 35

    2.5 Causal estimators computed over 100 bootstrap replicates for small and large RCT

    sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    2.6 Root mean square error when estimating the causal e↵ect of HT on CHD, across

    100 bootstrap replicates for smaller and large RCT sizes. The gold standard causal

    e↵ect is taken to be the age-stratified reweighted estimator, the magnitude of which

    is shown via the dashed gold line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    3.1 Percent reduction in risk relative to ⌧̂r for our proposed estimators, the Green and

    Strawderman estimators, and an oracle under four di↵erent conditions. Here, we

    assume ⌧̂o is computed without any adjustment for selection bias, yielding a highly

    biased estimator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    3.2 Percent reduction in risk relative to ⌧̂r for our proposed estimators, the Green and

    Strawderman estimators, and an oracle under four di↵erent conditions. Here, we

    assume ⌧̂o is computed by stabilized inverse probability of treatment weighting, such

    that some of the selection bias is removed. . . . . . . . . . . . . . . . . . . . . . . . . 58

    xiii

  • 3.3 Percent reduction in risk relative to ⌧̂r for our proposed estimators, the Green and

    Strawderman estimators, and an oracle under four di↵erent conditions. Here, we

    assume ⌧̂o is computed without any adjustment for selection bias, yielding a highly

    biased estimator. We also induce di↵erent distributions for the covariates Xi among

    the observational and RCT units. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    3.4 Percent reduction in risk relative to ⌧̂r for our proposed estimators, the Green and

    Strawderman estimators, and an oracle under four di↵erent conditions. Here, we

    assume ⌧̂o is computed by stabilized inverse probability of treatment weighting, such

    that some of the selection bias is removed. We also induce di↵erent distributions for

    the covariates Xi among the observational and RCT units. . . . . . . . . . . . . . . 61

    4.1 Simulated example of confidence regions in four strata under � = 1.2. . . . . . . . . 73

    4.2 Allocation of units to strata under näıve scheme and regret-minimizing scheme. . . 75

    4.3 Average loss over 1,000 resamples of 1,000-units experiments under equal-allocation,

    näıve-allocation, and regret-minimizing allocation designs. . . . . . . . . . . . . . . . 76

    A.1 Nominal and cross-validated receiver operator characteristic area under curve for

    propensity models with di↵erent numbers of variables . . . . . . . . . . . . . . . . . 94

    xiv

  • Chapter 1

    Introduction

    1.1 The Data Revolution

    Passive data collection is a defining feature of modern life. Massive online social networks continually

    monitor user interactions (Eckles and Bakshy, 2017); hospitals capture patient medical records in

    electronic health databases (Charles et al., 2015); e-commerce giants record real-time sales data as

    their customers shop (Bajari et al., 2019). For applied researchers in the social and medical sciences,

    this ever-expanding global panopticon yields data that is both promising and perilous.

    In the optimistic view, these observational data can provide insight into the causal e↵ect of a

    proposed treatment, such as a novel drug regimen or a new marketing strategy. If the strongest

    assumptions hold, such data can be used to identify any desired causal e↵ect, obviating the need

    to run randomized trials. This prospect has yielded recent conjecture that Big Data may supplant

    experimentation as the future of decision-making (Bareinboim and Pearl, 2016).

    Yet the past half-century of causal inference research engenders deep skepticism toward such an

    approach (Imbens and Rubin, 2015). Researchers do not control the treatment assignment in obser-

    vational data, and, as a result, cannot be certain that treated individuals and untreated individuals

    are otherwise comparable. This challenge can be overcome only by making untestable assumptions

    – and even if these assumptions hold, careful modeling is necessary to remove the selection e↵ect.

    The applied literature includes myriad examples of treatments that showed promise in observational

    studies only to be overturned by later randomized trials (Hartman et al., 2015). One prominent

    case, the e↵ect of hormone therapy on the health of postmenopausal women, will be discussed at

    length in this manuscript (Writing Group for the Women’s Health Initiative Investigators, 2002).

    The “virtuous” counterpart to observational data is the well-designed experiment. Data from a

    randomized trial yield unbiased estimates of a causal e↵ect without the need for problematic sta-

    tistical assumptions. Yet experiments su↵er two significant drawbacks. First, they are frequently

    1

  • CHAPTER 1. INTRODUCTION 2

    expensive, and, as a consequence, generally involve fewer units. Especially if one is interested in sub-

    group causal e↵ects, this means experimental estimates can be imprecise. Second, experiments often

    involve inclusion criteria that can make them dissimilar from target populations of interest. Hence,

    while observational studies may su↵er significant selection bias due to unmeasured confounding,

    experimental data will frequently have high variance and may su↵er from bias as well.

    There has been considerable recent interest in the development of statistical methods to synthe-

    size evidence from these two types of data (Mueller et al., 2018; Bareinboim and Pearl, 2016; Kallus

    et al., 2018). Yet the current literature – discussed at length in Section 1.3 – o↵ers few concrete

    methodological recommendations for applied researchers. This thesis will seek to fulfill the unmet

    need. We consider the data fusion problem from three angles. First, we develop methods for merging

    experimental and observational causal e↵ect estimates in the case when all confounding variables are

    measured in the observational studies. Next, we remove the unconfoundedness assumption, which

    leads to a new class of estimators based on a shrinkage approach. Finally, we propose a novel solution

    for designing experiments informed by observational studies, making use of the regret minimization

    framework. Throughout, we deploy tools from disparate areas of the literature, including Empirical

    Bayes, decision theory, and convex optimization.

    1.2 Causal Inference Review

    1.2.1 Potential Outcome Model

    In the vein of Chin (2019), I provide a brief review of causal inference concepts that will be relevant

    to this manuscript.

    We suppose we have access to a finite population of n individuals. We are considering a treatment

    of interest, such as an experimental drug or a behavioral intervention. In this thesis, we will only

    consider binary treatments, and we associate with each unit i a random variable Wi 2 {0, 1}, whereWi = 1 indicates that unit i receives the treatment and Wi = 0 indicates that unit i does not receive

    the treatment. We will also assume the units have an associated outcome Yi where we typically

    suppose Yi 2 R. Lastly, we will assume the units have measured covariates Xi 2 Rp.Throughout, we will adopt the potential outcomes framework of Neyman and Rubin (Rubin,

    1974). We associate with each subject i two values, Yi(1) and Yi(0). These values represent the

    realized outcome for unit i if the unit is treated or not treated respectively. Hence, the observed

    outcome Yi satisfies

    Yi = WiYi(1) +WiYi(0) .

    Implicit in this definition is the “Stable Unit Treatment Value Assumption” (SUTVA) – the

    assumption that a unit’s potential outcomes do not vary with treatments assigned to other units,

    and that there is only one version of the treatment (Rubin, 1980). We will make this assumption

  • CHAPTER 1. INTRODUCTION 3

    throughout. Additionally, note that in Chapter 2, we use slightly di↵erent notation; the potential

    outcomes are denoted (Yit, Yic) and we use Wit to denote Wi and Wic to denote 1�Wi.The typical target of inference will be the average treatment e↵ect (ATE),

    ⌧ =1

    n

    nX

    i=1

    Yi(1)� Yi(0) . (1.1)

    We will often also be interested in treatment e↵ects involving subgroups. Suppose we have subgroups

    indexed by k = 1, . . . ,K and we have associated indexing sets Sk such that:

    [

    k

    Sk = {1, 2, . . . , n} and Sk \ Sj = ; for any i 6= j.

    Then the treatment e↵ect for subgroup k is simply

    ⌧k =1

    |Sk|X

    i2Sk

    Yi(1)� Yi(0) .

    1.2.2 Sources of Randomness

    Wi is a random variable. Characterizing the joint distribution of all the Wi variables – known as the

    “assignment mechanism” – is a central part of inferring causality. One core dichotomy is between

    experimental and observational datasets. In the former, the researcher controls the treatment as-

    signment and thus knows the assignment mechanism explicitly. In the latter, the researcher typically

    makes some assumptions about the distribution and then seeks to infer it from the data.

    Define pi = P (Wi = 1) to be the unit-level treatment probability for unit i. A closely related

    concept is the “propensity score,” given by

    e(x) =1P

    iI(Xi = x)

    X

    i:Xi=x

    pi ,

    the average treatment probability for all units such that Xi = x. These values will be extremely

    important to our analysis of observational studies, though they are also defined in the setting of

    experiments. Following Chapter 3 of Imbens and Rubin (2015), we can characterize assignment

    mechanisms with some additional descriptors based on the dependencies of pi. Namely, an assign-

    ment mechanism is

    • probabilistic if 0 < pi < 1 for all i = 1, . . . , n, signifying that all units have some positiveprobability of receiving the treatment and some positive probability of receiving the control

    condition;

    • individualistic if pi depends only on Yi(0), Yi(1), and Xi for i = 1, . . . , n, and exhibits nodependency on the covariates or potential outcomes of units j 6= i; and

  • CHAPTER 1. INTRODUCTION 4

    • unconfounded if pi does not depend on the potential outcomes conditional on the covariates,i.e. Xi = Xj =) pi = pj for any i and j, even if their potential outcomes di↵er.

    Throughout this manuscript, we will assume probabilistic and individualistic assignment for all

    data. Unconfounded assignment is also always assumed for the experimental data, and is assumed for

    the observational data in Chapter 2 but not in subsequent chapters. The combination of probabilistic,

    individualistic, and unconfounded assignment is known as “strongly ignorable treatment assignment”

    (SITA) and we will often refer to whether or not SITA is assumed.

    In the typical setting, Wi is our only source of randomness and we treat Yi(1) and Yi(0) as fixed

    – but only partially observed – constants. This is the approach we will take in Chapters 2 and 3.

    An alternative formulation considers Yi(0) and Yi(1) to themselves be random variables (see e.g.

    VanderWeele and Robins, 2012). In this setting, Definition 1.1 can be easily redefined as

    ⌧ =1

    n

    nX

    i=1

    E (Yi(1)� Yi(0)) .

    We can now also define a useful quantity that is related to the subgroup treatment e↵ect: the

    conditional average treatment e↵ect (CATE),

    ⌧(x) =1

    n

    nX

    i=1

    E (Yi(1)� Yi(0) | Xi = x) .

    The above definitions for the assignment mechanism also easily extend to this case; individualistic

    and unconfounded assignment become properties of the joint distribution of Wi,Xi, Yi(0), and

    Yi(1), rather than properties of the finite sample of n units. We find that this formulation is

    more appropriate for discussion of experimental design, and will adopt it as needed in Chapter 4.

    1.2.3 Estimation Methods for Observational Studies

    In observational studies in which unconfoundedness holds, the treatment assignment is approxi-

    mately independent of the potential outcomes among units who have su�ciently similar covariate

    values. Hence, we can treat these units as being drawn from a local experiment, and directly com-

    pare treated and control units to infer causal e↵ects. A further insight, first o↵ered in Rosenbaum

    and Rubin (1983), is that the propensity score is a balancing score. This means that we need only

    find units with su�ciently similar values of the propensity score – not the entire covariate vector –

    in order to obtain independence of the treatment assignment from the covariates and the potential

    outcomes.

    This discovery gave rise to causal inference techniques that rely on estimating the propensity

    score as a function of the covariates, and then using the fitted propensity score estimates to create

    cohorts of units with similar values. Common methods include pair-matching treated and control

  • CHAPTER 1. INTRODUCTION 5

    units; stratifying units; and weighting units in order to recover a local experiment (Austin, 2011).

    Propensity score stratification approaches will be used in Chapter 2, while weighting approaches,

    which give rise to “inverse probability weighting” (IPW) estimators, will play a major role in Chap-

    ters 3 and 4.

    There exist many more complex estimators that rely on modeling both the propensity score and

    the potential outcomes as functions of the covariates. Imbens and Rubin (2015) suggest combining

    propensity score stratification with regression adjustment to improve precision. “Doubly robust”

    estimators (Kang et al., 2007) are formulated using both models such that they are asymptotically

    consistent if either the propensity model or the outcome model is correctly specified. Heterogeneous

    treatment e↵ect estimation in observational studies is a very active area of research; many modern

    methods involve both outcome and propensity modeling (see e.g. Künzel et al., 2019; Chernozhukov

    et al., 2018) and attain this property.

    In the absence of unconfoundedness, inferential methodology is much more limited. Causal

    estimates can be reliably identified in certain quasi-experimental settings – regression discontinu-

    ity designs and instrumental variable analyses being two prominent examples – but these are not

    generic methods. Researchers frequently focus not on point identification, but rather on methods to

    determine whether measured causal e↵ects are su�ciently robust to the possibility of unmeasured

    confounding. This area, known as “sensitivity analysis,” has yielded a variety of new methods in

    recent years (Ding and VanderWeele, 2016; Fogarty, 2019; VanderWeele and Ding, 2017). We will

    make use of a new method o↵ered by Zhao et al. (2019) in Chapters 3 and 4.

    1.3 Literature Review

    The data fusion problem relates to several di↵erent areas in the statistical and epidemiological

    literature. For example, the meta-analysis literature considers the broad challenges of evidence

    synthesis across multiple studies. Papers in this area have been highlighting the lack of consensus

    about how to merge observational and experimental studies for at least a quarter of a century

    (Pladevall-Vila et al., 1996). Yet even without clear methodological guidelines, observational studies

    are frequently included in systematic reviews; a 2014 survey found that just 36% of 300 such reviews

    were restricted to experimental and quasi-experimental data, with the remainder including at least

    some types of observational studies (Page et al., 2016).

    Mueller et al. (2018) recently published a summary of methods to incorporate observational

    studies into meta-analyses. They considered 93 relevant articles published between 1994 and 2016,

    finding that many recommendations for the inclusion of observational studies were essentially un-

    changed from those used for randomized controlled trials (RCTs). While about 40% of the articles

    made recommendations on the assessment of bias in observational studies, there was little agreement

    on best practices for combining the data. The methodological questions considered in most of the

  • CHAPTER 1. INTRODUCTION 6

    papers were whether or not to report a single e↵ect estimate, and whether or not to use a fixed vs.

    a random e↵ects model to combine the individual study estimates. These questions relate to the

    heterogeneity of e↵ect estimates, but they do not engage with the unique challenges of confounding

    in observational studies.

    Mueller and coauthors highlight a few exceptions. Thompson et al. (2011) propose estimating bias

    reduction based on the subjective judgment of a panel of assessors, and adjusting the observational

    study results accordingly. Their method requires a high degree of subject matter expertise. Prevost

    et al. (2000) suggest a hierarchical Bayes approach in which the di↵erence between observational

    and experimental results is modeled explicitly. The authors emphasize certain advantages to this

    approach but also note that results are sensitive to the choice of prior. In totality, the meta-analysis

    literature underscores the need for a more robust toolbox to synthesize heterogeneous data sources,

    while accounting for the unique challenges of causal inference with observational data.

    Another closely related area of the literature is that of transportability and generalizability. In

    the middle of the twentieth century, Campbell (1957) introduced the concepts of “internal validity”

    and “external validity” to distinguish between challenges of treatment e↵ect estimation and general-

    izability in quasi-experimental research. This paradigm was widely adopted among social scientists.

    The problem of extending causal findings across di↵erent domains is now known under the broader

    banner of “transportability,” which subsumes results from the meta-analysis and treatment e↵ect

    heterogeneity literatures (Bareinboim and Pearl, 2016).

    In this context, there is substantial research on using observational data to determine how to

    generalize causal e↵ects from an experiment to a target population. Early work focused heavily on

    cases in which such generalization is invalid (see e.g. Manski, 2009; Höfler et al., 2010). In a series

    of papers (Pearl and Bareinboim, 2011; Bareinboim and Pearl, 2013, 2016), Pearl and Bareoinboim

    formalized this problem, derived conditions under which “transport” is possible, and developed

    algorithms for returning the correct “transport formula.” While they considered the problem in the

    context of graphical models, it has also been considered in the more classical causal inference setting.

    For example, Hartman et al. (2015) derived assumptions and placebo tests for identifying population

    treatment e↵ects from RCTs. Stuart et al. (2011) advocated the use of propensity scores as a tool

    to assess generalizability. A variety of other work has advocated reweighting approaches in order to

    generalize results (Cole and Stuart, 2010; Andrews and Oster, 2017).

    We are explicitly interested in using causal estimates derived from observational studies, an area

    that has received comparatively less attention than the transportability question. The hesitancy

    to use these estimates is not particularly surprising: if a researcher has access to an experiment,

    he or she will likely be cautious about incorporating observational data that might introduce bias.

    One approach is to assume unconfoundedness in the observational study, meaning that all variables

    a↵ecting the treatment assignment and the outcome are measured. This is our approach in Chapter

    2; it is also used in Athey et al. (2019).

  • CHAPTER 1. INTRODUCTION 7

    A small number of prior papers have attempted to weaken the unconfoundedness assumption

    and proceed with merged estimation. They often introduce alternative assumptions. In Kallus et al.

    (2018), the authors assume that the hidden confounding has a parametric structure that can be

    modeled e↵ectively. They suggest fitting a model !̂ to predict local average treatment e↵ects to the

    observational study, and then learning a second model ⌘̂ which interpolates between the predictions

    of !̂ on the RCT units and the actual observed outcomes in the RCT. They posit that the sum of

    these functions, !̂(x)+ ⌘̂(x), is a good estimate for the CATE at x. Yet their theoretical guarantees

    rely heavily on determining the correct functional form for ⌘̂(·).In Peysakhovich and Lada (2016), it is assumed the bias preserves unit-level relative rank or-

    dering (as the authors say, “bigger causal e↵ects imply bigger bias”). They argue that their set of

    assumptions is reasonable in their setting, which involves time series data with multiple observations

    per unit. But it does not easily generalize to the more standard case where each unit’s outcome is

    observed only at a single time point.

    A number of other approaches have been suggested, such as methods that make use of Bayesian

    networks (Cooper and Yoo, 1999) or structural causal models (Mooij et al., 2016). Yet this inquiry

    is unquestionable in its infancy. One can find a chorus of recent papers explicitly calling for more

    methodological development in the area of combining observational and experimental data (Mueller

    et al., 2018; Shalit, 2020). This thesis seeks to heed that call.

    1.4 Contributions

    The contents of this thesis are divided among the three chapters to follow, each of which corresponds

    to a distinct manuscript. Each chapter considers the data fusion problem from a di↵erent angle.

    The second chapter is adapted from “Propensity Score Methods for Merging Observational and

    Experimental Datasets,” jointly authored with Art Owen, Michael Baiocchi, and Hailey Banack

    (Rosenman et al., 2018). This chapter considers estimation of the average treatment e↵ect from a

    completed pair of studies: a randomized controlled trial and an observational study. We assume that

    SITA holds in the latter study. We propose a general procedure in which the data is jointly stratified

    on the output of a propensity score estimation function fitted to the observational study, as well as

    causal e↵ect moderators. We propose three novel estimators for the causal e↵ect within each stratum,

    and use the delta method to determine when each would be expected to outperform. We apply our

    methods to data from the Women’s Health Initiative, a study of thousands of postmenopausal women

    which has both observational and experimental data on hormone therapy (HT).

    The third chapter is adapted from “Combining Observational and Experimental Datasets Using

    Shrinkage Estimators,” jointly authored with Guillaume Basse, Art Owen, and Michael Baiocchi

    (Rosenman et al., 2020). This chapter considers the same setting, but we remove the assumption

    that all confounders are measured and choose as our objective the L2 loss in measuring a vector of

  • CHAPTER 1. INTRODUCTION 8

    stratum-specific treatment e↵ects. We propose a generic procedure for deriving shrinkage estimators

    in this setting, making use of a generalized unbiased risk estimate. Then, we develop two new

    estimators, prove finite sample conditions under which they have lower risk than an estimator using

    only experimental data, and show that each achieves a notion of asymptotic optimality. Lastly, we

    draw connections between our approach and state-of-the-art results in sensitivity analysis, including

    proposing a method for evaluating the feasibility of our estimators.

    The fourth chapter is forthcoming as its own manuscript. In this chapter, we depart from the

    estimation setting and consider how to design a stratified experiment making use of data from a

    completed observational study. Again, our objective is the L2 loss in measuring the vector of treat-

    ment e↵ects. In the case of a binary outcome, we obtain valid, bias-aware confidence regions for the

    pilot estimates of the stratum-specific variances derived from the observational study, generalizing

    recent results from Zhao et al. (2019). Then, we show that experiments can be designed to minimize

    a notion of regret by solving a convex optimization problem. We again demonstrate the utility of

    our methods with an application to data from the Women’s Health Initiative.

    In the final chapter, we discuss limitations and future directions for this line of research.

  • Chapter 2

    Combining Datasets Under SITA

    2.1 Introduction

    We first consider how to combine the information from a large observational database (ODB) with

    data from a smaller randomized controlled trial (RCT), under the assumption that all confounders

    are measured in the observational study. Our goal is to obtain a treatment e↵ect estimate that is

    more accurate than either source could yield on its own.

    We present three methods to combine our two sources of data. The key technique underlying all

    of these methods is to score subjects in the RCT according to their propensity for treatment had they

    been in the ODB instead. They are then placed in pooled strata containing some ODB and some

    RCT observations with comparable propensities. To see why this might help, consider a stratum in

    the ODB comprised entirely of subjects with a very low treatment propensity. The RCT samples in

    that same stratum will be more evenly split between treatment and control, increasing a critically

    low within-stratum sample size of treated subjects. We are therefore extending the stratification of

    observational data by propensity, as described by Imbens and Rubin (2015) and Stuart and Rubin

    (2007).

    For any data combination method such as ours to succeed, it is necessary to make assumptions

    that cannot be tested within the available data and have to be judged instead on scientific grounds.

    At a minimum, we require that treatment e↵ects for subjects in a stratum not depend too strongly,

    if at all, on the data set the subjects came from. We will describe this and our other assumptions

    below.

    We apply our methods to data from the Women’s Health Initiative, or WHI (Writing Group for

    the Women’s Health Initiative Investigators, 2002). This study includes an RCT paired with an

    ODB. The treatment is hormone therapy (HT) and the outcome measure we consider is coronary

    heart disease (CHD). Conclusions from the observational data alone proved to be misleading due to

    di↵erences between the treated and untreated subjects, as revealed by a large RCT. Accordingly,

    9

  • CHAPTER 2. COMBINING DATASETS UNDER SITA 10

    it would be interesting to see if a smaller RCT combined with the ODB would have been more

    accurate than the ODB alone, potentially providing an earlier warning. The WHI’s RCT was quite

    large. This allows us to split it into a holdout sample that we use to define a true treatment e↵ect.

    Using that holdout estimate as a gold standard, we can then compare our combined methods with

    methods using only one of the data sources.

    2.1.1 Scientific assumptions

    Here we describe our assumptions in qualitative terms.

    As mentioned above, we require the RCT and ODB to have comparable treatment e↵ects within

    strata. The strong form of this assumption is that everybody in each stratum we form has the same

    treatment e↵ect. Most of our results use a weaker form in which only the average treatment e↵ects

    have to be the same for RCT and ODB subjects in each stratum. Note however that equal average

    treatment e↵ects implies equal average di↵erential outcomes, but does not imply equal average

    outcomes under either the test or control conditions.

    We will assume that the propensity for treatment in the ODB is nearly constant within each of

    our strata. This assumption requires that the variables we use to form strata include the important

    quantities predictive of whether treatment or control was assigned. It also requires that we estimate

    the propensity well. In our motivating problems, the ODB is so large that getting good propensity

    estimates is a very reasonable assumption, provided that suitable predictor variables are present.

    Our goal is to estimate a population average treatment e↵ect. For that, we need to know the

    correct proportion of the population corresponding to each stratum. In our motivating medical

    context, we suppose that the owner of the large ODB is interested in stratum proportions given by

    that very same patient population. The RCT, on the other hand, might have di↵erent sampling

    proportions due, for example, to enrollment criteria that restrict who may participate. In our

    examples, the stratum proportions are exactly those of the ODB.

    2.1.2 Outline

    This chapter is organized as follows. In Section 2.2, we define our notation, assumptions and

    estimators of the average treatment e↵ect. Our estimates combine some within-stratum estimates,

    computed in various ways. Our first proposed estimator, called the spiked-in estimator, simply places

    RCT units into propensity based strata defined on the ODB, and estimates the treatment e↵ect in

    each stratum without regard to which data set the subjects came from. Our second estimator

    takes a sample size weighted average of RCT and ODB-based treatment e↵ect estimates within each

    stratum. Our third new estimator, called the dynamic weighted average, uses data driven weights

    instead of sample size weights to combine ODB and RCT estimates within each stratum so as to

    minimize an estimate of the mean squared error in each stratum. For the large sample sizes of

    interest to us, delta method approximations to the mean and variance are accurate enough.

  • CHAPTER 2. COMBINING DATASETS UNDER SITA 11

    Section 2.3 presents delta method estimates of the within-stratum bias and variance for our

    estimators. We see theoretically that one of our estimators, the spiked-in estimator, can have an

    enormous bias if average test and control outcomes are not both comparable between data sets.

    Another estimator, the dynamic weighted estimator, is more robust.

    Section 2.5 gives numerical illustrations of our method for an ODB of size 5,000 and an RCT

    of size 200. Section 2.6 gives more background on the WHI data. It then gives a detailed model

    for the WHI data, developing a propensity and defining a gold standard average treatment e↵ect

    on a holdout sample from the WHI’s RCT. A variant of the spiked-in estimator is introduced in

    Section 2.6.3. That variant refines propensity strata via a prognostic score predictive of coronary

    heart disease in untreated subjects, leading to a “dual-spiked” estimator. All the estimators are

    compared via bootstrap simulations in 2.6.4, and the dual-spiked estimate is most accurate for the

    WHI data. Lengthier proofs can be found in the Appendix. Section 2.7 summarizes our conclusions.

    2.2 Notation, assumptions and estimators

    Some subjects belong to the randomized controlled trial (RCT) and others to the observational

    database (ODB). We assume that no subject is in both data sets. We write i 2 R if subject i is inthe RCT and i 2 O otherwise. Subject i has an outcome Yi 2 R and some covariates that we encodein the vector xi 2 Rd. Subject i receives either the test or control condition.

    The condition of subject i is given by a treatment variable Wi 2 {0, 1} where Wi = 1 if subject iis in the test condition (and 0 otherwise). Some formulas simplify when we can use parallel notation

    for both test and control settings. Accordingly we introduce Wit = Wi and Wic = 1 �Wi. Otherformulas look better when focused on the test condition. For instance, letting pit = Pr(Wit = 1) and

    pic = Pr(Wic = 1), the expression pit(1 � pit) is immediately recognizable as a Bernoulli varianceand is preferred to pitpic.

    2.2.1 Sampling assumptions

    We adopt the potential outcomes framework of Neyman and Rubin (Rubin, 1974). Subject i has

    two potential outcomes, Yit and Yic, corresponding to test and control conditions respectively. Then

    Yi = WitYit + WicYic. The potential outcomes (Yit, Yic) are non-random and we will assume that

    they are bounded. The treatment e↵ect for subject i is Yit � Yic. We work conditionally on theobserved values of covariates and so xi are also non-random.

    All of the randomness in our model comes from the treatment variables Wi. We write Bern(p)

    for a Bernoulli random variable taking the value 1 with probability p and 0 with probability 1� p.The ODB and RCT di↵er in how the Wi are distributed.

    Assumption 1 (ODB sampling). If i 2 O, then Wi ⇠ Bern(pi) independently where pi = e(xi)with 0 < pi < 1.

  • CHAPTER 2. COMBINING DATASETS UNDER SITA 12

    The function e(·) in Assumption 1 is the propensity. Because the propensity depends only on x,and is never 0 or 1, the ODB has a strongly ignorable treatment assignment (SITA) (Rosenbaum

    and Rubin, 1984). Because the Wi are independent, the outcome for subject i is una↵ected by the

    treatment Wi0 for any subject i0 6= i. That is, our model for the ODB satisfies the stable unittreatment value assumption or SUTVA (Imbens and Rubin, 2015).

    Assumption 2 (RCT sampling). If i 2 R, then Wi ⇠ Bern(pr) independently for a commonprobability 0 < pr < 1.

    The RCT will commonly have pr = 1/2 but we do not assume this. We additionally assume that

    the ODB is independent of the RCT.

    2.2.2 Stratification and treatment e↵ect assumptions

    We will use K strata indexed by k = 1, . . . ,K. The stratum for subject i depends on xi. The sets

    Ok and Rk contain the subjects in stratum k from the ODB and RCT respectively. We assume thateach stratum contains only a narrow range of propensity values e(xi). Strata defined by propensity

    ranges may be further partitioned by variables in xi, using domain knowledge if applicable, in order

    to make the treatment e↵ect more nearly constant within strata. Propensity score stratification with

    sub-stratification on other important predictors is a commonly used strategy for causal inference in

    observational studies Imbens and Rubin (2015); Stuart and Rubin (2007).

    Our model allows the treatment e↵ect to vary by stratum. We begin with a strong assumption

    about the treatment e↵ects.

    Assumption 3. For all strata k = 1, . . . ,K, there is a treatment e↵ect ⌧k with Yit � Yic = ⌧k forall i 2 Ok [Rk.

    In most of our work, we can weaken this assumption to just require equality on average within

    each stratum. The weakened version is given as Assumption 4 below. Let the sample sizes of the

    ODB and RCT be no and nr respectively. Ordinarily no � nr. The ODB and RCT sample sizeswithin stratum k are nok and nrk. The within-stratum average treatment e↵ects are

    ⌧ok =1

    nok

    X

    i2Ok

    Yit � Yic and ⌧rk =1

    nrk

    X

    i2Rk

    Yit � Yic, (2.1)

    when their denominator counts are positive. We will never use strata with nok = 0 when we later

    weight strata proportionally to their ODB sizes.

    Assumption 4. For k = 1, . . . ,K, if min(nok, nrk) > 0 then ⌧ok = ⌧rk and we call their common

    value ⌧k. If nok > nrk = 0 take ⌧k = ⌧ok and if nrk > nok = 0 take ⌧k = ⌧rk.

    Assumption 4 might be unrealistic if the treatment is applied di↵erently in the ODB versus the

    RCT. We thus suppose some form of “treatment version irrelevance” Lesko et al. (2017).

  • CHAPTER 2. COMBINING DATASETS UNDER SITA 13

    We need the strong Assumption 3 in one place to estimate a quantity that depends on both po-

    tential outcomes of a single subject. Because our strata will be based at least partially on propensity,

    Assumption 3 is very nearly true under the model of Xie et al. Xie et al. (2012b) In the Appendix,

    some simulations will involve data that violate Assumption 3.

    2.2.3 Estimators

    Our estimand is the global average treatment e↵ect defined by

    ⌧ =KX

    k=1

    !k⌧k

    for weights !k > 0 withP

    K

    k=1 !k = 1. The weights can be chosen to match population character-

    istics. We use !k = nok/no. Then !k = 0 whenever nok = 0 and we have a well defined ⌧k for

    every stratum that contributes to ⌧ . We may still have nrk = 0 for some strata with !k > 0. Our

    estimators all take the formP

    k!k ⌧̂k for di↵erent within-stratum estimates ⌧̂k.

    We begin with “single data source” estimators before describing our proposed new estimators.

    An ODB-only estimate of the treatment e↵ect in stratum k is

    ⌧̂ok =

    Pi2Ok WitYitPi2Ok Wit

    �P

    i2Ok WicYicPi2Ok Wic

    . (2.2)

    Then ⌧̂o =P

    k!k ⌧̂ok. A potential problem with ⌧̂o comes from bins k with very small propensity

    values. Then Ok may contain very few observations with Wit = 1 and ⌧̂ok may have high variance.Similarly for bins k associated with large propensity values, Ok may contain very few observationswith Wic = 1 which again leads to high variance. That is, the “edge bins” can have very skewed

    sample sizes causing problems for ⌧̂o.

    The ODB estimate (2.2) is a di↵erence of ratio estimators, because the denominators are random.

    We will see in Section 2.3 that there can also be a severe bias in the edge bins. An analogous RCT-

    only estimator is ⌧̂r =P

    k!k ⌧̂rk where

    ⌧̂rk =

    Pi2Rk WitYitPi2Rk Wit

    �P

    i2Rk WicYicPi2R Wic

    . (2.3)

    Because the RCT assigns treatments with constant probability, the edge bins have less imbalanced

    treatment outcomes. However, because the RCT is small, we may find several of the strata have

    very small sample sizes nrk.

    Our first hybrid estimator is ⌧̂s =P

    k!k ⌧̂sk, where

    ⌧̂sk =

    Pi2Ok WitYit +

    Pi2Rk WitYitP

    i2Ok Wit +P

    i2Rk Wit�P

    i2Ok WicYic +P

    i2Rk WicYicPi2Ok Wic +

    Pi2Rk Wic

    . (2.4)

  • CHAPTER 2. COMBINING DATASETS UNDER SITA 14

    The RCT data are “spiked” into the ODB strata. This spiked-in estimator can improve upon

    the ODB estimator by increasing the number of treated units in the low-propensity edge bins and

    increasing the number of control units in the high-propensity edge bins. Even a small number of

    such balancing observations can be extremely valuable.

    The spiked-in estimator is not a convex combination of ⌧̂ok and ⌧̂rk, because the pooling is

    first done among the test and control units. Our final two estimators are constructed as convex

    combinations of ⌧̂ok and ⌧̂rk.

    The weighted average estimator ⌧̂w uses

    ⌧̂wk = �k ⌧̂ok + (1� �k)⌧̂rk, where �k =nok

    nok + nrk. (2.5)

    It weights ⌧̂rk and ⌧̂ok according to the number of data points involved in each estimate.

    Our final estimator is a “dynamic weighted average” ⌧̂d. It uses weights for ⌧̂rk and ⌧̂ok that are

    estimated from the data. Those weights are chosen to minimize an estimate of mean squared error

    (MSE) derived using the delta method in the following section. We can observe its approximate

    optimality via the following result, recalling that the RCT estimator will in general be unbiased.

    Proposition 1. Let �̂1 and �̂2 be independent estimators of a common quantity �, with bias,

    variance, and mean squared errors, Bias(�̂1) 2 (�1,1), Bias(�̂2) = 0, var(�̂j) and MSE(�̂j) 2(0,1) for j = 1, 2. For c 2 R, let �̂c = c�̂1 + (1� c)�̂2. Then

    c⇤ ⌘ argminc

    MSE(�̂c) =var(�̂2)

    MSE(�̂1) + var(�̂2).

    This linear combination has

    Bias(�̂c⇤) =Bias(�̂1)MSE(�̂2)

    MSE(�̂1) +MSE(�̂2),

    var(�̂c⇤) = c2⇤var(�̂1) + (1� c⇤)2var(�̂2), and

    MSE(�̂c⇤) =MSE(�̂1)var(�̂2)

    MSE(�̂1) + var(�̂2).

    (2.6)

    Proof. Independence of the �̂j yields var(�̂c) = c2var(�̂1) + (1� c)2var(�̂2) while linearity of expec-tation yields Bias(�̂c) = cBias(✓̂1). Optimizing MSE(�̂c) over c yields the result.

    2.3 Delta method results

    Let X be a random vector with mean µ and a finite covariance matrix. Let f be a function of X

    that is twice di↵erentiable in an open set containing µ and let f1 and f2 be first and second order

  • CHAPTER 2. COMBINING DATASETS UNDER SITA 15

    Taylor approximations to f around µ. Then the delta method mean and variance of f(X) are

    E�(f(X)) = E(f2(X)) and var�(f(X)) = var(f1(X)) .

    Sometimes, to combine estimates, we will need a delta method mean for a weighted sum of those

    estimates. We will also need a delta method variance for a weighted sum of independent random

    variables. We use the following natural expressions without resorting to Taylor approximations:

    E�✓X

    j

    �j ⌧̂j

    ◆=X

    j

    �jE�(⌧̂j) (2.1)

    var�

    ✓X

    j

    �j ⌧̂j

    ◆=X

    j

    �2jvar�(⌧̂j), for independent ⌧̂j . (2.2)

    2.3.1 Population quantities

    We will study our estimators in terms of some population quantities. These involve some unobserved

    values of Yit or Yic. For instance, the test and control stratum averages in the ODB are

    µokt =

    Pi2Ok Yit

    nokand µokc =

    Pi2Ok Yic

    nok

    and it is typical that both of these are unobserved. Corresponding values for the RCT are µrkt and

    µrkc.

    When we merge ODB and RCT strata we will have to consider a kind of skew in which the within-

    stratum mean responses above di↵er between the two data sets. To this end, define �kt = µokt�µrktand �kc = µokc � µrkc. Under either of the stronger Assumption 3 or the weaker Assumption 4,�kt = (⌧k + µokc)� (⌧k + µrkc) = �kc. We will use �k = �kt = �kc. We will see that large valuesof �k can bias the spiked-in estimator. Reducing that bias is the main motivation for our dynamic

    weighted average estimator.

    Now we define several other population quantities. Let S be a finite non-empty set of n = n(S)indices, such as one of our strata Ok or Rk. For each i 2 S, let (Yit, Yic) 2 [�B,B]2 be a pair ofbounded potential outcomes and let Wi = Wit be independent Bern(pi) random variables and let

    Wic = 1�Wit. Some of our results add the condition that all pi 2 [✏, 1� ✏] for some ✏ > 0.For S so equipped, we define average responses

    µt = µt(S) =1

    n

    X

    i2SYit and µc = µc(S) =

    1

    n

    X

    i2SYic. (2.3)

    For example, µokt above is µt(Ok). We use average treatment probabilities

    pt = pt(S) =1

    n

    X

    i2Spi and pc = pc(S) = 1� pt(S). (2.4)

  • CHAPTER 2. COMBINING DATASETS UNDER SITA 16

    These become pokt, pokc, prkt and prkc in a natural notation when S is Ok or Rk.The above quantities are averages over i uniformly distributed in S as distinct from expectations

    with respect to random Wi. We also need some covariances of this type between response and

    propensity values,

    st = st(S) =1

    n

    X

    i2SYitpi � µtpt and

    sc = sc(S) =1

    n

    X

    i2SYic(1� pi)� µcpc.

    (2.5)

    We will find that these quantities play an important role in bias. If, for instance, the larger values

    of Yit tend to co-occur with higher propensities pi, then averages are biased up.

    The delta method variances of our estimators depend on the following weighted averages of

    squares and cross products

    Stt = Stt(S) =1

    n

    X

    i2Spi(1� pi)(Yit � ⇢t)2,

    Scc = Scc(S) =1

    n

    X

    i2Spi(1� pi)(Yic � ⇢c)2, and

    Stc = Stc(S) =1

    n

    X

    i2Spi(1� pi)(Yit � ⇢t)(Yic � ⇢c),

    (2.6)

    where ⇢t = ⇢t(S) = µt(S) + st(S)/pt(S) and ⇢c = ⇢c(S) = µc(S) + sc(S)/pc(S). The quantity ⇢t isthe lead term in E�(

    Pi2S WitYit/

    Pi2S Wit) and ⇢c is similar. More details about these quantities

    are in the Appendix where Theorem 1 is proved.

    Proposition 2. Let S be Ok, Rk or Ok [Rk. Then under Assumption 3, sc(S) = �st(S).

    Proof. Under Assumption 3, we can set Yit = Yic + ⌧k and µt = µc + ⌧k in (2.5).

    2.3.2 Main theorem

    We will compare the e�ciency of our five estimators using their delta method approximations. We

    state two elementary propositions without proof and then give our main theorem. Results for our

    various estimators are mostly direct corollaries of that theorem.

    Proposition 3. Let x and y be jointly distributed random variables with means x0 6= 0 and y0respectively, and finite variances. Let ⇢ = y0/x0. Then

    E�⇣yx

    ⌘= ⇢� cov(y � ⇢x, x)

    x20, and

    var�⇣yx

    ⌘=

    var(y � ⇢x)x20

    .

  • CHAPTER 2. COMBINING DATASETS UNDER SITA 17

    Proposition 4. Let xt, xc, yt, yc be jointly distributed random variables with finite variances and

    means xj,0 6= 0 and yj,0 respectively, for j 2 {t, c}. Let ⇢j = yj,0/xj,0. Then

    var�⇣ ytxt

    ± ycxc

    ⌘=

    var(yt � ⇢txt)x2t,0

    +var(yc � ⇢cxc)

    x2c,0

    ± 2cov(yt � ⇢txt, yc � ⇢cxc)xt,0xc,0

    .

    Theorem 1. Let S be an index set of finite cardinality n > 0. For i 2 S, let Wit ⇠ Bern(pi) beindependent with Wic = 1�Wit, 0 < pi < 1. Let

    ⌧̂ =

    Pi2S WitYitPi2S Wit

    �P

    i2S WicYicPi2S Wic

    where (Yit, Yic) 2 [�B,B]2, for B < 1. Then with µt, µc, pt, pc, st, sc, Stt, Scc, Stc defined atequations (2.4) through (2.6),

    var�(⌧̂) =1

    n

    ✓Sttp2t

    +Sccp2c

    + 2Stcptpc

    ◆. (2.7)

    If all pi 2 [✏, 1� ✏] for some ✏ > 0, then

    E� (⌧̂) = (µt � µc) +⇣stpt

    � scpc

    ⌘+O

    ⇣ 1n

    ⌘. (2.8)

    Proof. See Appendix Section A.1.1.

    The implied constant in O(1/n) for equation (2.8) holds for all n > 1.

    2.3.3 Delta method means and variances

    We define the delta method bias of an estimate ⌧̂k via Bias�(⌧̂k) = E�(⌧̂k) � ⌧k. We also assume0 < ✏ < e(xi) < 1� ✏ for some ✏.

    Corollary 1. Let ⌧̂ok be the ODB-only estimator from (2.2). Then

    var�(⌧̂ok) =1

    nok

    ✓Sttp2t

    +Sccp2c

    + 2Stcptpc

    ◆,

    where st, sc, pt, pc, Stt, Scc and Stc are given in equations (2.3) through (2.6) with S = Ok. If0 < ✏ < e(xi) < 1� ✏ for all i 2 Ok [Rk, then

    Bias� (⌧̂ok) =stpt

    � scpc

    +O

    ✓1

    nok

    ◆.

  • CHAPTER 2. COMBINING DATASETS UNDER SITA 18

    If also Assumption 3 holds, then

    Bias� (⌧̂ok) =st

    pt(1� pt)+O

    ✓1

    nok

    ◆.

    Proof. The first two claims follow from Theorem 1, using e(xi) 2 [✏, 1 � ✏] for i 2 Ok [ Rk inthe second one. Under Assumption 3, sc = �st, so the lead term in E�(⌧̂k) is st(1/pt + 1/pc) =st(pt + pc)/pt(1� pt) = st/pt(1� pt).

    Corollary 2. Let ⌧̂rk be the RCT-only estimator from (2.3). Then ⌧̂rk is known to be unbiased, and

    var�(⌧̂rk) =�̄2rk

    nrkpr(1� pr), where

    �̄2rk

    =1

    nrk

    X

    i2Rk

    [(Yit � µrkt)(1� pr) + (Yic � µrkc)pr]2,(2.9)

    for µrkt = µt(Rk) and µrkc = µc(Rk). Under Assumption 3, �̄2rk = �2rkt ⌘ (1/nrk)P

    i2Rk(Yit �µrkt)2. If pr = 1/2, then

    var�(⌧̂rk) =1

    4n2k

    X

    i2Rk

    ✓Ȳi �

    µrkt + µrkc2

    ◆2

    for Ȳi = (Yit + Yic)/2.

    Proof. See Appendix Section A.1.2.

    Corollary 3. Let ⌧̂wk be the weighted-average estimator (2.5). Then, with �k = nok/(nok + nrk),

    var�(⌧̂wk) =�k

    nok + nrk

    ✓Sttp2t

    +Sccp2c

    + 2Stcptpc

    ◆+

    1� �knok + nrk

    �̄2rk

    pr(1� pr),

    where Stt, Scc and Scc are given in equation (2.6) with S = Ok, and �̄2rk is defined at (2.9). If0 < ✏ < e(xi) < 1� ✏ for all i 2 Ok [Rk, then

    Bias�(⌧̂wk) = �k

    ✓soktpokt

    � sokcpokc

    ◆+O

    ✓1

    nok + nrk

    ◆,

    where sokt, pokt, sokc, and pokc are defined by equations (2.4) and (2.5) for S = Ok. If Assumption3 also holds, then

    Bias�(⌧̂wk) =�ksokt

    pokt(1� pokt)+O

    ✓1

    nok + nrk

    ◆.

    Proof. See Appendix Section 2.3.6.

  • CHAPTER 2. COMBINING DATASETS UNDER SITA 19

    In our motivating scenarios we anticipate that no � nr so that �k ⇡ 1 for most k. Then the firstterm in var�(⌧̂wk) is only slightly smaller than var�(⌧̂ok) for the ODB-only estimate, and at most a

    small variance reduction is to be expected from weighting.

    The spiked-in estimator’s bias and variance cannot be computed as a corollary of Theorem 1,

    but they can be computed directly.

    Corollary 4. Let ⌧̂sk be the spiked-in estimator (2.5). Then

    var�(⌧̂sk) =1

    nok + nrk

    ✓Sttp2t

    +Sccp2c

    + 2Stcptpc

    ◆,

    where st, sc, pt, pc, Stt, Scc and Stc are given in equations (2.3) through (2.6) with S = Ok [Rk.If 0 < ✏ < e(xi) < 1� ✏ for all i 2 Ok [Rk, then

    Bias� (⌧̂sk) =stpt

    � scpc

    +O

    ✓1

    nok + nrk

    ◆.

    If Assumption 3 also holds, then

    Bias� (⌧̂sk) =st

    pt(1� pt)+O

    ✓1

    nok + nrk

    ◆.

    Proof. The spiked-in estimates are computed by pooling Ok and Rk into their union.

    To relate the bias of ⌧̂sk to that of the other estimators, we write it in terms of the quantities

    computed using S = Ok and S = Rk. Denoting these quantities using an additional subscript of oand r,

    Bias�(⌧̂sk) = �knok⇣ poktnokpokt + nrkprkt

    � pokcnokpokc + nrkprkc

    + soktnok

    nokpokt + nrkprkt� sokc

    noknokpokc + nrkprkc

    +O⇣ 1nok + nrk

    ⌘.

    (2.10)

    The bias for ⌧̂rk is zero. The bias for ⌧̂ok has terms analogous to the second and third (and error)

    terms above, but the first term is new to ⌧̂sk. This term is linear in �k. For large values of �k,

    this term will dominate, yielding biases that can easily exceed those of ⌧̂ok. This is the fundamental

    danger of the spiked-in estimator: if the mean potential outcomes di↵er substantially between ODB

    and RCT subjects with similar value of the propensity score function, then the estimation will be

    poor due to large bias.

    2.3.4 The dynamic weighted estimator

    The bias-variance tradeo↵s are intrinsically di↵erent in each stratum. Using results from the prior

    section, we derive a dynamic weighted estimator that uses di↵erent weights in each stratum. Our

  • CHAPTER 2. COMBINING DATASETS UNDER SITA 20

    dynamic weighted estimator is based on Assumption 3, though we will test it in settings where that

    assumption does not hold.

    From Proposition 1, the MSE-optimal convex combination of ⌧̂ok and ⌧̂rk is c⇤k ⌧̂ok +(1� c⇤k)⌧̂rkwhere c⇤k = var(⌧̂rk)/(var(⌧̂rk) +MSE(⌧̂ok)). The dynamic weighted estimator is

    ⌧̂dk = ĉ⇤k ⌧̂ok + (1� ĉ⇤k)⌧̂rk, with ĉ⇤k =cvar(⌧̂rk)

    cvar(⌧̂rk) + [MSE(⌧̂ok), (2.11)

    for plug-in estimators of MSE(⌧̂ok) and var(⌧̂rk). To obtain our MSE estimates we use ]MSE(·) =Bias�(·)2 + var�(·) taking the delta method moments from Corollaries 1 and 2. These expressionsinclude some unknown population quantities that we then approximate from the data to get [MSE(·).

    For the ODB estimate we use

    ]MSE(⌧̂ok) =✓

    stpt(1� pt)

    ◆2+

    1

    nok

    ✓Sttp2t

    +Sccp2c

    + 2Stcptpc

    where the quantities on the right hand side are given in Section 2.3.1 with S = Ok. For the RCTestimate we use

    fvar(⌧̂rk) =�̄2rk

    pr(1� pr)nrk, with �̄2

    rk=

    1

    nrk

    X

    i2Rk

    Wit�̂2rkt

    +Wic�̂2rkc

    where �̂2rkt

    , �̂2rkc

    are the sample variances observed among the treated and control units respectively.

    Both of these estimates use Assumption 3.

    The values of pt and pc are known: pt =P

    i2Ok pit/nok where pit is the propensity e(xi)

    and pc = 1 � pt. We use Horvitz-Thompson style inverse probability weighting to estimate otherquantities. Full details can be found in Appendix Section 2.4.

    2.3.5 Performance comparison

    The ideal dynamic estimator with the optimal weight ck⇤ must be at least as good as ⌧̂ok, ⌧̂rk and

    ⌧̂wk because those estimators are all special cases of weighting estimators belonging to the class that

    ck⇤ optimizes over. Our estimator ⌧̂dk will not always be better than those other estimators, because

    in estimating ĉk⇤, we may introduce enough error to make it less e�cient.

    When combining stratum-based estimates ⌧̂k into the weighted estimator ⌧̂ =P

    k!k ⌧̂k, there is

    the possibility of biases canceling between strata. None of the competing estimators we consider are

    designed to exploit such cancellation. For large strata, ck⇤ should be well estimated. To arrange

    cancellations among biased within-stratum estimates would require domain-specific assumptions

    that we do not make here.

    The comparison to the spiked-in estimator is more complex. As we saw in equation (2.10),

    the bias can grow without bound in �k, so for large �k this estimator will have the largest MSE.

  • CHAPTER 2. COMBINING DATASETS UNDER SITA 21

    However, for small values of �k, the spiked-in estimator can outperform all the other estimators. To

    see why, we make a direct comparison with the dynamic weighted estimator and reference our prior

    discussion showing that the dynamic weighted estimator will generally outperform ⌧̂ok, ⌧̂rk and ⌧̂wk.

    We introduce sample counterparts of �k, given by

    �̂kt =

    Pi2Ok WitYitPi2Ok Wit

    �P

    i2Rk WitYitPi2Rk Wit

    , and

    �̂kc =

    Pi2Ok WicYicPi2Ok Wic

    �P

    i2Rk WicYicPi2Rk Wic

    .

    Then, after some algebra, ⌧̂sk di↵ers from the RCT estimate as follows,

    ⌧̂sk � ⌧̂rk = ckt�̂kt � ckc�̂kc (2.12)

    for sample size proportions

    ckt =

    Pi2Ok WitP

    i2Ok[Rk Witand ckc =

    Pi2Ok WicP

    i2Ok[Rk Wic.

    By comparison,

    ⌧̂dk � ⌧̂rk = ck?�̂kt � ck?�̂kc, (2.13)

    where the dynamic estimator tunes ck? to the available data. An oracle could choose ck? optimally

    using Proposition 1. While the oracle is working in a one parameter family (2.13) for each bin k,

    the spiked-in estimator uses two weights ckt and ckc (2.12) that are not necessarily within the family

    that the oracle optimizes over. This is why it is possible for the spiked estimator to outperform the

    oracle.

    2.3.6 Proof of Corollary 3

    Using (2.1) and Corollaries 1 and 2, Bias�(⌧̂wk) = �k⇥Bias�(⌧̂ok) for �k given in (2.5). This yields thelead terms in both expressions for Bias�(⌧̂wk). The error terms are �kO(1/nok) = O(1/(nok +nrk)).

    Using independence of the RCT and ODB, Corollaries 1 and 2, and definition (2.2)

    var�(⌧̂wk) =�2k

    nok

    ✓Sttp2t

    +Sccp2c

    + 2Stcptpc

    ◆+ (1� �k)2

    �̄2rk

    nrkpr(1� pr)

    =�k

    nok + nrk

    ✓Sttp2t

    +Sccp2c

    + 2Stcptpc

    ◆+

    1� �knok + nrk

    �̄2rk

    pr(1� pr).

  • CHAPTER 2. COMBINING DATASETS UNDER SITA 22

    2.4 Derivation: Dynamic Weighted Estimator

    Recall we seek to estimate c⇤k ⌧̂ok+(1�c⇤k)⌧̂rk where the weight c⇤k = var(⌧̂rk)/(var(⌧̂rk) +MSE(⌧̂ok)).For the ODB, our plug-in estimate is

    ]MSE(⌧̂ok) =✓

    stpt(1� pt)

    ◆2+

    1

    nok

    ✓Sttp2t

    +Sccp2c

    + 2Stcptpc

    and for the RCT estimate we use

    fvar(⌧̂rk) =�̄2rk

    pr(1� pr)nrk, with �̄2

    rk=

    1

    nrk

    X

    i2Rk

    Wit�̂2rkt

    +Wic�̂2rkc

    where �̂2rkt

    , �̂2rkc

    are the sample variances observed among the treated and control units respectively.

    We use Horvitz-Thompson style inverse probability weighting to estimate key quantities, as

    follows:

    ⇢̂t =

    Pi2Ok WitYitPi2Ok Wit

    , ⇢̂c =

    Pi2Ok WicYicPi2Ok Wic

    ,

    ŝt =

    Pi2Ok Wit

    nok

    X

    i2Ok

    WitYit � ptX

    i2Ok

    WitYit/pit

    !

    +

    Pi2Ok Wic

    nok

    X

    i2Ok

    WicYic � pcX

    i2Ok

    WicYic/pic

    !,

    Ŝtt =

    Pi2Ok Witpit(1� pit)(Yit � ⇢̂t)

    2

    Pi2Ok Wit

    , and

    Ŝcc =

    Pi2Ok Wicpit(1� pit)(Yic � ⇢̂c)

    2

    Pi2Ok Wic

    .

    The sole quantity that does not have a Horvitz-Thompson estimator is Stc(Ok), because we neverobserve both potential outcomes for a given unit. First, we write Stc as

    1

    n

    X

    i2Ok

    Witpit(1� pit)(Yit � ⇢t)(Yic � ⇢c) +1

    n

    X

    i2Ok

    Wicpit(1� pit)(Yit � ⇢t)(Yic � ⇢c).

    Next, under Assumption 3,

    Yit � ⇢t = Yic + ⌧k � µt � st/pt = Yic � ⇢c �stptpc

    ,

  • CHAPTER 2. COMBINING DATASETS UNDER SITA 23

    and similarly Yic � ⇢c = Yit � ⇢t + st/(ptpc). Therefore

    Stc =1

    n

    X

    i2Ok

    Witpit(1� pit)(Yit � ⇢t)2 +1

    n

    X

    i2Ok

    Wicpit(1� pit)(Yic � ⇢c)2

    � stnpt(1� pt)

    X

    i2Ok

    pit(1� pit)⇣Wit(Yit � ⇢t)�Wic(Yic � ⇢c)

    ⌘ (2.14)

    and we get Ŝtc by plugging the above estimates of ⇢t, ⇢c and known values of pt, pc into (2.14).

    Although Assumption 3 is used to derive the estimator, some of our simulations in Section 2.5 of

    this Appendix test it under a violation of that assumption.

    2.5 Simulations

    Our goal is to estimate the average treatment e↵ect in the target population, from which we assume

    the ODB data was randomly sampled. The value of the RCT is that it can substitute for ODB data

    in places where that data is sparse due to the treatment assignment mechanism.

    We simulate two high level scenarios. In one, the RCT is a random sample from the same

    population that the ODB came from. Then the RCT and ODB data di↵er only in their treatment

    assignment mechanisms. We consider this case the idea