161
EFFICIENT PERMUTATION P -VALUE ESTIMATION FOR GENE SET TESTS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF STATISTICS AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Yu He July 2016

EFFICIENT PERMUTATION P-VALUE ESTIMATION FOR A …statweb.stanford.edu/~owen/students/HeraHeThesis.pdf · Observing the p-value can be written as the proportion of points lying in

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • EFFICIENT PERMUTATION P -VALUE ESTIMATION FOR

    GENE SET TESTS

    A DISSERTATION

    SUBMITTED TO THE DEPARTMENT OF STATISTICS

    AND THE COMMITTEE ON GRADUATE STUDIES

    OF STANFORD UNIVERSITY

    IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

    FOR THE DEGREE OF

    DOCTOR OF PHILOSOPHY

    Yu He

    July 2016

  • c© Copyright by Yu He 2016All Rights Reserved

    ii

  • Yu He

    I certify that I have read this dissertation and that, in my opinion, it

    is fully adequate in scope and quality as a dissertation for the degree

    of Doctor of Philosophy.

    (Art B. Owen) Principal Adviser

    I certify that I have read this dissertation and that, in my opinion, it

    is fully adequate in scope and quality as a dissertation for the degree

    of Doctor of Philosophy.

    (Trevor Hastie)

    I certify that I have read this dissertation and that, in my opinion, it

    is fully adequate in scope and quality as a dissertation for the degree

    of Doctor of Philosophy.

    (Wing H. Wong)

    Approved for the Stanford University Committee on Graduate Studies

  • Abstract

    In a genome-wide expression study, gene set testing is often used to find potential

    gene sets that correlate with a treatment(disease, drug, phenotype etc.). A gene

    set may contain tens to thousands genes, and genes within a gene set are generally

    correlated. Permutation tests are standard approaches of getting p-values for these

    gene set tests. Plain Monte Carlo methods that generate random permutations can be

    computationally infeasible for small p-values. Ackermann and Strimmer (2009) finds

    two families of test statistics that achieve overall best performances - a linear family

    and a quadratic family. This dissertation first reviews the relative background of gene

    set testing and permutation tests, and then provides three alternative approaches to

    estimate small permutation p-values efficiently.

    The first approach focuses on the linear statistic. Observing the p-value can

    be written as the proportion of points lying in a spherical cap, the p-value is ap-

    proximated by the volume of a spherical cap. Error estimates can be derived from

    generalized Stolarsky’s invariance principal, and alternative probabilistic proofs are

    provided.

    The second approach focuses on the quadratic statistic. Importance sampling is

    used to estimate the area of the (continuous) significant region on the sphere, and

    the volume of the region is used as an approximation for the (discrete proportion)

    p-value. Different proposal distributions are studied and compared.

    The third approach estimates the p-value with nested sampling. It may work for

    both the linear and the quadratic statistic. Similar ideas can be found in literature

    spanning from combinatorics, sequential Monte Carlo, Bayesian computation, rare

    event estimation, network reliability etc., and bears different names, e.g. approximate

    iv

  • counting, nested sampling, subset simulation, multilevel splitting etc. We give a

    thorough review of literature in these different areas, and apply the technique to the

    gene set testing with the quadratic test statistic.

    Finally, we compare the proposed methods with plain Monte Carlo and saddle-

    point approximation on three expression studies in Parkinson’s Disease patients.

    This work was supported by the US National Science Foundation under grant

    DMS-1521145.

    v

  • Acknowledgement

    It is my pleasure to thank the many people who made this thesis possible.

    First and foremost, I owe a debt of gratitude to my advisor Professor Art Owen.

    With a contagious passion for statistics, a dedicated pursuit for high quality research

    and diligence that brings overflowing ideas, Art has set a great example for me. Art

    has also guided me through the research journey with encouragement, great patience,

    sound advice and timely help. His guidance has been the compass in the wilderness,

    without which I would have been lost. Although Art is often found juggling many

    meetings, lectures, office hours and emails, he has always kept his door open and

    welcomed a discussion anytime. I am deeply grateful for all the time he has spent

    with me.

    I would like to thank Professor Wing Wong and Professor Trevor Hastie for reading

    my thesis and providing insightful feedback. I also thank Professor Lester Mackey

    for serving on my dissertation committee and Professor Hua Tang for chairing my

    dissertation committee.

    I am very fortunate to have met many great math and statistics teachers. I would

    like to thank my high school math teacher Songbin Lan, my undergraduate teach-

    ers at Nanjing University and University of Toronto (especially Andrey Feuerverger

    and Larry Guth), my graduate teachers at Stanford (especially Trevor Hastie, Tze

    Leung Lai and David Siegmund). I thank them for their constant inspiration and

    encouragement.

    I thank Qingyuan Zhao, Murat A. Erdogdu, Anand Rajaraman and Jure Leskovec

    for the collaboration on a data mining paper that complements my thesis work. I

    would also like to express my gratitude to Kinjal Basu and Qingyuan Zhao for the

    vi

  • collaboration on the work that appears in Chapter 2. The time we spent together,

    struggling or cheering, is among the fondest memories of my time at Stanford.

    Many wonderful friends have enriched and added color to my life at Stanford.

    Thank you to everyone in the Owen research group and all other students in the

    department for providing a friendly, supportive and intellectually stimulating envi-

    ronment. Special thanks to my roommate Jingshu, my officemates Pragya and Xiaoy-

    ing for their constant companionship, to the 206 family Qingyuan, Bhaswar, Murat,

    Joshua and Pooja for their incredible friendship that can instantly sweep away any

    anxiety or stress, and to many other friends who made me feel at home in the depart-

    ment. I am especially grateful to my seniors Su Chen, Jeremy Shen, Pei He, Yi Liu

    and Ya Xu for their encouragement and immense help in my career development.

    The luckiest thing that has ever happened to me is to be born into my family. It

    is my father who first motivated my love for math, and it is my mother’s resilience

    and optimism in the face of adversity that prepared me for the many obstacles that

    I encountered in research and in life. I am most grateful for their unconditional

    love and support for me, which is always my pillar of strength. Last but not least,

    I thank Qingyuan again for his companionship as an intimate friend, an inspiring

    collaborator, a considerate partner and a strong emotional pillar. I cannot imagine

    getting through everything without his support and love.

    vii

  • Contents

    Abstract iv

    Acknowledgement vi

    1 Introduction 1

    1.1 Background: gene set enrichment analysis(GSEA) . . . . . . . . . . . 1

    1.2 Null hypothesis in GSEA . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.3 Permutation test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.4 Test statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.5 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2 Approxation via Stolarsky’s Invariance 10

    2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.2 Background and notation . . . . . . . . . . . . . . . . . . . . . . . . . 13

    2.3 Approximation via spherical cap volume . . . . . . . . . . . . . . . . 16

    2.4 A finer approximation to the p-value . . . . . . . . . . . . . . . . . . 20

    2.5 Generalized Stolarsky Invariance . . . . . . . . . . . . . . . . . . . . . 26

    2.6 Two sided p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    2.7 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    2.8 Comparison to saddlepoint approximation . . . . . . . . . . . . . . . 31

    2.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    3 Advances in importance sampling 41

    3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    viii

  • 3.2 Importance sampling and control variates . . . . . . . . . . . . . . . . 43

    3.3 Regret bounds and convexity . . . . . . . . . . . . . . . . . . . . . . 46

    3.3.1 Mixture importance sampling . . . . . . . . . . . . . . . . . . 47

    3.3.2 Multiple importance sampling . . . . . . . . . . . . . . . . . . 50

    3.3.3 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    3.4 Choosing α . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    3.4.1 Bounding αj away from zero . . . . . . . . . . . . . . . . . . . 54

    3.4.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    3.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    3.5.1 Singular function . . . . . . . . . . . . . . . . . . . . . . . . . 57

    3.5.2 Rare event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    3.5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    4 Approxation with importance sampling 65

    4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    4.2 Geometry of the quadratic spherical cap . . . . . . . . . . . . . . . . 66

    4.3 Three sequential importance sampling algorithms . . . . . . . . . . . 68

    4.3.1 Uniform sampling from Sd with polar coordinates . . . . . . . 694.3.2 Sequential sampling from Sd(λ) starting from the largest eigen-

    value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    4.3.3 Sequential sampling from Sd(λ) starting from the smallest eigen-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

    4.3.4 Sequential sampling from Sd(λ) with low rank matrix Σ . . . 764.3.5 Simulation results for the three sequential importance algorithms 78

    4.4 From continuous approximation to the exact p-value . . . . . . . . . . 84

    5 Subset simulation 88

    5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

    5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

    5.2.1 Estimation of P ∗ . . . . . . . . . . . . . . . . . . . . . . . . . 89

    5.2.2 Uniform sampling on A` . . . . . . . . . . . . . . . . . . . . . 89

    ix

  • 5.2.3 Quick update for T (x′) . . . . . . . . . . . . . . . . . . . . . . 92

    5.2.4 Choosing q . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

    5.2.5 Algorithm for estimating P ∗ . . . . . . . . . . . . . . . . . . . 92

    5.3 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

    6 Real data example 103

    A Appendix 112

    A.1 Proof of Theorem 3 (Limiting invariance) . . . . . . . . . . . . . . . . 112

    A.2 Proof of Lemma 3 (Double inclusion for Model 2) . . . . . . . . . . . 114

    A.3 Proof of Theorem 6 (Second moment under Model 2) . . . . . . . . . 116

    A.4 Proof of Theorem 7 (Location weighted invariance) . . . . . . . . . . 117

    A.5 Proof of Theorem 8 (Spatially weighed invariance) . . . . . . . . . . 120

    A.6 Proof of Theorem 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

    A.7 Proof of Corollaries 1 and 2 . . . . . . . . . . . . . . . . . . . . . . . 123

    A.8 Proof of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

    A.9 Proof of Theorem 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

    A.10 Proof of Theorem 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

    A.11 Proof of Theorem 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

    A.12 Algorithms and Figures . . . . . . . . . . . . . . . . . . . . . . . . . . 127

    x

  • List of Tables

    2.1 Maximal Z scores observed for p̂2 and p̂3. . . . . . . . . . . . . . . . 40

    3.1 Singular function example. The estimate µ̂ was computed from 500,000

    observations using the sampler given in the first column. Control vari-

    ates were used in two of those samples. The final columns give variance

    reduction factors compared to plain Monte Carlo and compared to uni-

    form mixture importance sampling with no control variates. . . . . . 60

    3.2 Top 10 mixture components N (xk, σ2rI5) for the singular integrand inα∗∗, which uses control variates. D denotes the defensive mixture. The

    last columns are mean and sd of αj over 5000 simulations. . . . . . . 61

    3.3 Top 10 mixture components N (xk, σ2rI5) for the singular integrandin α∗, which does not use control variates. D denotes the defensive

    mixture. The last columns are mean and sd of αj over 5000 simulations. 62

    3.4 Rare event example. The estimate µ̂ was computed from 100,000 obser-

    vations using the sampler given in the first column. Control variates

    were used in two of those samples. The next columns give variance

    reduction factors compared to plain Monte Carlo and compared to

    uniform mixture importance sampling with no control variates. The

    final column compares actual squared error with its sample estimate. 63

    3.5 Top 10 mixture components N (zk, σ2rI2) for the singular integrandin α∗, which does not use control variates. D denotes the defensive

    mixture. The last columns are mean and sd of αj over 5000 simulations. 64

    3.6 Average running times in seconds for four estimators on two examples. 64

    xi

  • 4.1 Four groups of data sets with different sizes . . . . . . . . . . . . . . 78

    4.2 Comparison of sampling with pl and ql for one data set in GPI with

    P ∗ = 8.426 × 10−2. N(Sd(λ))/N is the proportion of points lying inSd(λ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

    4.3 Data set DS∗ in GPI with P∗ = 1.083×10−4. Here Nql(Sd(λ))/Nql 6= 1,

    i.e. ql fails to sample exclusively from Sd(λ) because of numericalinaccuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

    4.4 Comparison of sampling with ps and qs on data set DS∗, with P ∗ =

    1.083× 10−4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.5 Comparison of sampling with pr and qr on data set DS

    ∗ with P ∗ =

    1.083× 10−4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.6 Data set DS∗ with P ∗ = 1.083 × 10−4 is considered. We estimate P ∗

    by projecting continuous samples from pr and qr and calculate with

    formula (4.17). N(Sd(T0;Q,Λ))/N is the proportion of g(a) lying inSd(T0;Q,Λ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

    4.7 Data set DS∗ with P ∗ = 1.083×10−4 is considered. We compare sam-pling with pr, qU and qα∗ and estimate with formula (4.17). N(Sd(T0;Q,Λ))/Ndenotes the proportion of g(a) lying in Sd(T0;Q,Λ). . . . . . . . . . . 85

    4.8 Data set DS∗ with P ∗ = 1.083 × 10−4 is considered. We comparesampling with pr, qU and qα∗∗ and estimate with control variates.

    N(Sd(T0;Q,Λ))/N denotes the proportion of g(a) lying in Sd(T0;Q,Λ). 86

    5.1 Comparison of Monte Carlo sampling and subset simulation on data

    set DS∗ with P ∗ = 1.083× 10−4. We run K = 50 independent subsetsimulation with n = 1000, q = 0.2, B = 20. . . . . . . . . . . . . . . . 96

    6.1 Three data sets used for non-permutation GSEA. . . . . . . . . . . . 105

    6.2 Kendall’s τ metric for Moran set with 10−5 ≤ p̂MC ≤ 10−4. . . . . . . 1056.3 Running time for all gene sets in different data sets with the linear

    statistic (in seconds). p̂MC are run with 106 samples. Extra time is

    needed for gene sets with p̂MC < 10−4 to generate 107 samples. . . . . 110

    xii

  • 6.4 Running time for all gene sets in different data sets with the quadratic

    statistic (in seconds). p̂MC are run with 106 samples. Extra time is

    needed for gene sets with p̂MC < 10−4 to generate 107 samples. . . . . 110

    6.5 Kendall’s τ metric for Moran set with 10−5 ≤ p̂MC ≤ 10−4. Wefocus on the gene sets in Moran data with linear statistic p-values

    10−5 < p̂MC < 10−4 because these gene sets have small p-values, yet re-

    liable Monte carlo estimates p̂MC as golden standards - with 107 Monte

    Carlo samples. The methods that are closest to Monte Carlo(golded

    standard) are p̂1, p̂2 and p̂3. . . . . . . . . . . . . . . . . . . . . . . . . 110

    xiii

  • List of Figures

    1.1 Framework for the construction of a gene set statistic . . . . . . . . . 6

    2.1 Illustration for Model 1. The point y is uniformly distributed over Sd.The small open circles represent permuted vectors xk. The point y0

    is the observed value of y. The circle around it goes through x0 and

    represents a spherical cap of height yT0x0. A second spherical cap of

    equal volume is centered at y = y1. We study moments of p(y; ρ̂), the

    fraction of xk in the cap centered at random y. . . . . . . . . . . . . 17

    2.2 Illustration for Model 2. The original response vector is y0 with yT0x0 =

    ρ̂. We consider alternative y uniformly distributed on the surface of

    C(x0; ρ̂) with examples y1 and y2. Around each such yj there is a

    spherical cap of height ρ̂ that just barely includes xc = x0. We use

    p̂2 = E2(p(y; ρ̂)) and find an expression for E2((p̂2 − p(y; ρ̂))2). . . . 222.3 Illustration of r1, r2, δ1 and δ2. The points xc, x1 and x2 each have m0

    negative and m1 positive components. For j = 1, 2 the swap distance

    between xj and xc is rj. There are δ1 positive components of xc where

    both x1 and x2 are negative, and δ2 negative components of xc where

    both xj are positive. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    2.4 RMSEs for p̂1 and p̂2 under Models 1 and 2. The x-axis shows the

    estimate p̂ as ρ varies from 1 to 0. Here m0 = m1. Plots with m0 6= m1are similar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    2.5 Comparison of p̂1 and p̂2. In (a), log10(p̂2) is plotted against log10(p̂1)

    for varying ρ’s. The black line is the 45 degree line. In (b), the ratio of

    RMSEs for p̂1 and p̂2 is plotted against log10(p̂1). The x-axis is log10(p̂1). 33

    xiv

  • 2.6 The coefficient of variation for p̂2 with varying ρ’s. . . . . . . . . . . 33

    2.7 Comparison of p̂3 versus p̂2. For a given triple (m0,m1, ρ̂), we ran-

    domly sample 100 vectors y0 with xT0y0 = ρ̂. By symmetry, x0 can be

    any permutation. We get 100 different p̂3 and a common p̂2 for each

    triple (m0,m1, ρ̂). In the two panels on the left, ρ̂’s are chosen to give

    two-sided p̂1(ρ̂) = 2×10−10 with various dimensions (m0,m1). The twopanels on the right correspond to two-sided p̂1(ρ̂) = 2 × 10−20. Esti-mates for two-sided p-values are plotted on the y-axis, with p̂3 plotted

    as black dots with distributions and p̂2 as red crosses. . . . . . . . . 34

    2.8 Comparison of RMSE(p̂3) versus RMSE(p̂2). The same simulation set-

    ting as described in Figure 2.7. The red crosses and black dots are

    estimated RMSE(p̂2) and RMSE(p̂3) under Model 2 with centers c = 0

    and c = arg max06i

  • 5.1 Comparison of σ̂1(P̂∗1:K) and σ̂2(P̂

    ∗i )’s. With the same K = 50 sim-

    ulations in Table 5.1, the sample standard deviation for all (P̂ ∗i )’s is

    σ̂1(P̂∗1:K) = 2.813 × 10−5. We have 50 stand alone estimates σ̂2(P̂ ∗i )

    computed with eq. (5.2). We plot the z-scores of P̂ ∗i ’s computed with

    σ̂1(P̂∗1:K) and σ̂2(P̂

    ∗i ) respectively in Figure 5.1b and 5.1c. We plot the

    histogram of σ̂2(P̂∗i )’s in Figure 5.1a with σ̂1(P̂

    ∗1:K) added as the dashed

    reference line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

    5.2 Simulation result for GPI. For Monte Carlo NMC = 108. For subset

    simulation, n = 1000, B = 40, and K = 50 for each data set. . . . . . 99

    5.3 Simulation result for GPII. For Monte Carlo NMC = 108. For subset

    simulation, n = 1000, B = 40, and K = 50 for each data set. . . . . . 100

    5.4 Simulation result for GPIII. For Monte Carlo NMC = 108. For subset

    simulation, n = 1000, B = 120, and K = 50 for each data set. . . . . 101

    5.5 Simulation result for GPIV. For Monte Carlo NMC = 108. For subset

    simulation, n = 1000, B = 120, and K = 50 for each data set. . . . . 102

    6.1 Moran: scatter plot for linear statistic . . . . . . . . . . . . . . . . . . 106

    6.2 Moran: scatter plot for linear statistic. Gene sets are plotted with

    p-values satisfying 10−5 < p̂MC < 10−4, with a total of 190 gene sets. 107

    6.3 Scherzer: scatter plot for linear statistic . . . . . . . . . . . . . . . . . 108

    6.4 Scherzer: scatter plot for linear statistic. Gene sets are plotted with

    p-values satisfying 10−5 < p̂MC < 10−3, with a total of 15 gene sets. . 109

    6.5 Comparison of p̂MC and p̂SS on Zhang, Moran and Scherzer data set

    for quadratic statistic. . . . . . . . . . . . . . . . . . . . . . . . . . . 111

    A.1 For each of 40 data sets in GPI, we obtain Nps = 5 × 107 and Nqs =5 × 105 samples from ps and qs respectively and estimate µ. FigureA.1a plots µ̂qs versus µ̂ps . Figure A.1b plots the estimated relative

    error (µ̂ps − P ∗)/P ∗ versus P ∗. Figure A.1c plots VRFN(qs; ps) versusµ̂ps . Figure A.1d plots VRFt(qs; ps) versus µ̂ps . . . . . . . . . . . . . 134

    xvi

  • A.2 For each of 40 data sets in GPI, we obtain Npr = 5 × 107 and Nqr =5 × 105 samples from pr and qr respectively and estimate µ. FigureA.2a plots µ̂qr versus µ̂pr . Figure A.2b plots the estimated relative

    error (µ̂pr − P ∗)/P ∗ versus P ∗. Figure A.2c plots VRFN(qr; pr) versusµ̂pr . Figure A.2d plots VRFt(qr; pr) versus µ̂pr . . . . . . . . . . . . . 135

    A.3 For each of 40 data sets in GPII, we obtain Nps = 5 × 107 and Nqs =5 × 105 samples from ps and qs respectively and estimate µ. FigureA.3a plots µ̂qs versus µ̂ps . Figure A.3b plots the estimated relative

    error (µ̂ps − P ∗)/P ∗ versus P ∗. Figure A.3c plots VRFN(qs; ps) versusµ̂ps . Figure A.3d plots VRFt(qs; ps) versus µ̂ps . . . . . . . . . . . . . 136

    A.4 For each of 40 data sets in GPIII, we obtain Npr = 108 and Nqr = 10

    8

    samples from pr and qr respectively and estimate µ. Figure A.4a plots

    µ̂qr versus µ̂pr . Figure A.4b plots the ratio µ̂pr/µ̂qr versus µ̂qr . Figure

    A.4c plots VRFN(qr; pr) versus µ̂pr . Figure A.4d plots VRFt(qr; pr)

    versus µ̂pr . VRFN and VRFt are computed with equations (4.14),

    (4.15) and (4.16). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

    A.5 For each of 40 data sets in GPIV, we obtain Npr = 108 and Nqr = 10

    8

    samples from pr and qr respectively and estimate µ. Figure A.5a plots

    µ̂qr versus µ̂pr . Figure A.5b plots the ratio µ̂pr/µ̂qr versus µ̂qr . Figure

    A.5c plots VRFN(qr; pr) versus µ̂pr . Figure A.5d plots VRFt(qr; pr)

    versus µ̂pr . VRFN and VRFt are computed with equations (4.14),

    (4.15) and (4.16). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

    xvii

  • Chapter 1

    Introduction

    1.1 Background: gene set enrichment analysis(GSEA)

    Since the introduction of DNA microarray measurement technology, genomewide ex-

    pression analysis with these DNA microarrays has become a mainstay of both ge-

    nomics and statistical research. Researchers seek methods that extract useful infor-

    mation from the DNA microarrays both accurately and efficiently. For a particular

    experiment, gene expressions are measured for thousands of genes from a group of

    samples belonging to either the treatment or control group, for example, candidates

    with or without lung cancer. Traditionally, the most differently expressed genes are

    tested individually for the relationship with the treatment. However, this approach

    has some major limitations,

    • After correcting for the multiple hypothesis testing effect, genes that achieve therequired significance level may be too few or even non-existing. We may leave

    out many weakly correlated genes due to the noise and the multiple hypothesis

    testing effect.

    • Even if we end up with a long list of statistically significant genes, they maywell be biologically unstructured. Interpretation can be ad hoc and arbitrary,

    and often biased by the biologist’s area of expertise.

    1

  • CHAPTER 1. INTRODUCTION 2

    • Many measured genes are biologically related - sharing the same biological func-tion, chromosomal location, or regulation, and hence their measurements are

    correlated. Standard multiple hypothesis testing procedures that control FDR

    require independence or special dependence structures like PRDS(Benjamini

    and Yekutieli (2001)), which often does not hold in real microarray data sets.

    In observation of the above challenges for single gene testing procedure, Mootha

    et al. (2003) first introduced gene set enrichment analysis(GSEA). Instead of studying

    single gene effects individually, they propose to study microarray data at the level

    of gene sets. “The gene sets are defined based on prior biological knowledge, e.g.,

    published information about biochemical pathways or coexpression in previous ex-

    periments.”(Subramanian et al. (2005)) The goal of GSEA is to find gene sets that

    are correlated with the treatment as a whole. Moving the analysis from the single

    gene level to the gene set level has several advantages,

    • It is common to have many weakly correlated single genes that appear asinsignificant individually. By combining their weak effects with appropriate

    choices of the test statistic, we may find the gene set as a whole achieves the

    desired statistical significance.

    • Conducting hypothesis testing on the gene set level yields much more inter-pretable results. We can focus on exploring the scientific explanation of the

    relationships between gene sets and the treatment, without the extra step of

    summarizing uncoordinated single gene test results.

    • The total number of tests carried out at the same time is significantly reduced.Multiple single gene tests are condensed to one single test for the whole gene

    set.

    • When a single test statistic is constructed for the whole gene set, the correlationsof single genes within gene sets no longer play a role in the final decision, making

    the conclusion more statistically sound.

  • CHAPTER 1. INTRODUCTION 3

    Because of the above benefits, GSEA has gained much attention since it’s first

    introduction, and has become the standard practice in the last decade(Tamayo et al.

    (2012)). The gene set database has grown from the original database(Subramanian

    et al. (2005)) of 1,325 genes set, including four major collections, to 13311 gene

    sets as of today in the Molecular Signatures Database (MSigDB, Liberzon et al.

    (2015)), divided into 8 major collections, and several subcollections. The gene sets

    are available for download from the Broad Institute (2016).

    1.2 Null hypothesis in GSEA

    A key component in hypothesis testing is the null hypothesis. Tian et al. (2005) and

    Goeman and Bühlmann (2007) among others introduce two different null hypotheses

    in GSEA. Let G be the gene set of interest and Y be the treatment. The first

    null compares the association between Y and S with the association between Y and

    other gene sets of comparable sizes. This null hypothesis essentially means Y cannot

    stand out from comparable gene sets, hence is often known as the “competitive null”

    hypothesis. Methods for testing the competitive null typically involves randomizing

    the genes labels and keeping the sample labels fixed. This permutation does not give

    a rigorous test when genes are correlated, which is usually the case for those within

    a gene set. (Goeman and Bühlmann (2007))

    The second null only focuses on the gene set of interest S. It compares the associa-

    tion between S and Y with the association between S and random treatments. To test

    this “self-contained null”, usually the labels in the treatment Y are permuted, with

    the gene labels fixed. While the competitive null is often of interest, this dissertation

    focuses on testing for the “self-contained” null.

    1.3 Permutation test

    Gene set tests constructs a single test statistic for the whole gene set. In most cases,

    the null distributions have no closed form, hence the p-values are usually estimated

    by permutation tests. Even in cases where closed form null distributions are available

  • CHAPTER 1. INTRODUCTION 4

    under appropriate assumptions, such as the Kolmogorov-Smirnov statistic in the ini-

    tial GSEA in Mootha et al. (2003), and the J-G score in Jiang and Gentleman (2007),

    the permutation tests are suggested to gain robustness in case the data falls short of

    the assumptions.

    A detailed explanation of permutation test is described in Lehmann and Romano

    (2005). We describe its procedure in our particular application of GSEA. Suppose

    for m independent samples we observe the single gene measurements Yg ∈ Rm, g =1, · · · , G for all genes in a gene set of size G, and denote the corresponding treatmentas X ∈ Rm or {0, 1}m. In cases where X takes binary values, let m0 be the number of0’s and m1 be the number of 1’s, with m = m0 +m1. Denote the gene measurements

    for the gene set as the matrix Y1:G = [Y1, · · · , YG] ∈ Rm×G. We first decide onthe test statistic for the gene set. An example is taking the sum of single gene

    correlations as the test statistic, T (Y1:G, X) =G∑g=1

    corr(Yg, X). Another example is

    constructing the test statistic as the sum of squared t-statistic for the single genes

    T (Y1:G, X) =G∑g=1

    t(Yg, X)2. To perform the permutation test, we keep Y1:G fixed and

    obtain all N unique permutations of X as X∗0 , · · · , X∗N−1. In cases where X ∈ Rm,N = m! and in cases where X ∈ {0, 1}m0+m1 , N =

    (m0+m1m0

    ). Then the permutation

    p-value is defined as

    p =1

    N

    N−1∑i=0

    1(T (Y1:G, X∗i ) ≥ T (Y1:G, X))

    Note that T (Y1:G, X∗i ) ≥ T (Y1:G, X) holds true at least for X∗i = X, so the true

    permutation p-value never goes below 1/N .

    The total number of permutations N increases exponentially and quickly becomes

    intractable as the sample size m grows. For example with m = 20, N = 20!.= 2.4×

    1018 and with m0 = m1 = 20, N =(

    4020

    ) .= 1.4 × 1011. It is common to approximate

    the exact permutation p-value by random sampling from all permutations (Good

    (2013)). Monte Carlo permutation tests are easy to implement, require no specific

    distributional assumptions on the data, and can be applied to any test statistic of our

    choice. Despite their generality, they are often computationally expensive, especially

  • CHAPTER 1. INTRODUCTION 5

    when the true p-values are small. As discussed in Larson and Owen (2015), for p-

    values as small as �, random permutations of size between 3/� to 19/� are needed to

    get adequate power.

    Monte Carlo based permutation also suffers from a resampling granularity problem,

    whose name is adopted from Larson and Owen (2015). It is conventional to add the

    observation X as an additional random sample of permutations. Then the smallest

    p-value that we can possibly get from M − 1 random permutations is 1/M . Whentwo or more gene sets are tied at this granularity value, there is no way to distinguish

    them. Many existing methods rank the gene sets by their corresponding test statistics.

    However this practice is subject to the assumption that all test statistics have the

    same null distribution, which is clearly not the case when comparing gene sets of

    different sizes or different correlation structures.

    Observing the challenges in plain Monte Carlo sampling of permutations, we seek

    alternative methods to estimate permutation p-values efficiently, especially those that

    are extremely small. The methods that work most efficiently are usually specialized

    to the chosen test statistic. We discuss the choice of test statistics in the next section.

    1.4 Test statistic

    A gene set statistic is usually constructed with three components: gene-level statistic,

    transformation and summary statistic as shown in Fig. 1.1. There can be many

    different choices in each component. To obtain the gene set test statistic, one can

    first choose the gene level statistic as the t-statistic or the correlation coefficient,

    then take no transformation, or take transformations such as the absolute value or

    the square, and finally take the mean or median. The reader can easily decode the

    aforementioned two examples of gene set statistics T (Y1:G, X) =G∑g=1

    corr(Yg, X) and

    T (Y1:G, X) =G∑g=1

    t(Yg, X)2 in terms of these three steps. For a more detailed discussion

    of the construction framework for gene set tests, see Ackermann and Strimmer (2009).

    Ackermann and Strimmer (2009) compared 261 different gene set statistics, and

    found particularly good performance of two families of statistics - a linear family

  • CHAPTER 1. INTRODUCTION 6

    Figure 1.1: Framework for the construction of a gene set statistic

    and a quadratic family. Let ρg(Yg, X) and tg(Yg, X) be the single gene correlation

    and t-statistic respectively. The linear family consists of T1 =G∑g=1

    ρg(Yg, X) and

    T ′1 =G∑g=1

    tg(Yg, X), and the quadratic family consists of T2 =G∑g=1

    ρg(Yg, X)2 and T ′2 =

    G∑g=1

    tg(Yg, X)2. The best performance comes from the quadratic family. By squaring

    the correlation or t-statistics, effects from genes that are deferentially expressed in

    opposite directions are added up instead of being cancelled from each other. The

    linear family is the second best. Here T ′1 is also known as the J-G score proposed in

    Jiang and Gentleman (2007). It is remarkable that the best performing statistics are

    surprisingly simple, especially when compared with the complicated GSEA method

    in Subramanian et al. (2005).

    The similar performances of using t-statistic and the correlation can be justified

    through Taylor approximation, as shown in Larson and Owen (2015). The usual t-

    statistic for testing a linear relationship is tg ≡√m− 2ρ̂g/(1 − ρ̂2g)1/2. The Taylor

    expansion gives tg =√m− 2

    (ρ̂g +

    12ρ̂3g +O(ρ̂

    5g)). Gene set tests are most useful when

    individual |ρ̂g|’s are small. In these cases tg is approximately a constant multiple ofρ̂g, hence using the correlation as the gene level statistic should yield similar perfor-

    mances to those using t-statistics. We study the linear and quadratic statistics with

    correlation as the gene level statistic.

  • CHAPTER 1. INTRODUCTION 7

    1.5 Notations

    We summarize the notations that are used through out the dissertation here. Let

    m = m0 +m1 be the number of patients, with m0 in the control group and m1 in the

    treatment group, and X ∈ {0, 1}n be the indicator variable for treatment. We limitour discussion to binary X, while some methods can be easily extended to continuous

    X’s as well. Let S be the gene set of interest and G = |S| be the cardinality of S.Denote the expression level for gene g as Yg ∈ Rm, and let Y1:G = [Y1, · · · , YG] ∈ Rm×G

    be the expression level measurement matrix. We center and standardize the binary

    X to obtain x0 s.t. xT0 1 = 0,x

    T0x0 = 1. Let x0, · · · ,xN−1 be all permutations of x0,

    where N =(m0+m1m0

    ).

    We are interested in approximating permutation p-value for two statistics

    T1(X;Y1:G) =G∑g=1

    ρg(Yg, X), T2(X;Y1:G) =G∑g=1

    ρg(Yg, X)2.

    Note that we can replace X with x0 because centering and scaling do not change

    the correlation coefficients, hence the corresponding p-values are

    pj =1

    N

    N−1∑i=0

    1(Tj(xi;Y1:G) ≥ Tj(x0;Y1:G)), j = 1, 2

    In the following chapters we may omit the subscript on p-value when there is no

    confusion on the test statistic. We now derive alternative formulas for T1 and T2 for

    ease of discussion in the following chapters. Note that

    T1 =G∑g=1

    corr(X, Yg) =G∑g=1

    corr(X,Yg

    sd(Yg)) =

    √G corr(X,

    G∑g=1

    Ygsd(Yg)

    )

    We define Y =G∑g=1

    Yg/sd(Yg) ∈ Rn, then using T1 is equivalent to using corr(X, Y )

    as the test statistic for permutation tests. We center and standardize Y to get y0 s.t.

    yT0 1 = 0,yT0y0 = 1. Then the p-value for the linear statistic can be written in terms

  • CHAPTER 1. INTRODUCTION 8

    of the correlations between y0 and xi’s.

    p1 =1

    N

    N−1∑i=0

    1(xTi y0 ≥ xT0y0) (1.1)

    To simplify T2, we center and standardize all columns of Y1:G to get Ỹ1:G, s.t.

    Ỹ T1:G1 = 0 and diag(ỸT

    1:GỸ1:G) = 1. Define Σ = Ỹ1:GỸT

    1:G, then

    T2 =G∑g=1

    corr(X, Yg)2 =

    G∑g=1

    (xTỸg)2 = xTΣx

    Then the p-value for the quadratic statistic is

    p2 =1

    N

    N−1∑i=0

    1(xTi Σxi ≥ xT0 Σx0) (1.2)

    This dissertation proposes three novel methods to efficiently estimate small permu-

    tation p-values p1 and p2 with T1 and T2 as test statistics respectively. The rest of the

    dissertation is organized as follows. Chapter 2, 4 and 5 are devoted to three distinct

    methods respectively. Chapter 2 introduces three approximations for p1. Error esti-

    mates are derived from generalized Stolarsky’s invariance. Alternative probabilistic

    arguments are provided as well. Chapter 3 provides some new results on importance

    sampling. Specifically, we provide a method to jointly optimize the weights in mixture

    importance sampling in combination with control variates. Chapter 4 focuses on the

    quadratic statistic T2, and introduces an estimation procedure based on importance

    sampling. Chapter 5 is devoted to a subset sampling method that may work for both

    linear and quadratic statistics. Similar ideas can be found in literature spanning from

    combinatorics, sequential Monte Carlo, Bayesian computation, rare event estimation,

    network reliability etc., and bears different names, e.g. approximate counting, nested

    sampling, subset simulation, multilevel splitting etc. We give a thorough review of

    literature in these different areas, and apply the technique to the gene set testing with

    the quadratic test statistics. Finally Chapter 6 applies the three methods- Stolarsky,

    importance sampling and subset simulation - on a real data example.

  • CHAPTER 1. INTRODUCTION 9

    Chapter 2 is based on the tech report He et al. (2016), and Chapter 4 is joint work

    with Kinjal Basu, Qingyuan Zhao and Art B. Owen. It appears in the tech report

    He and Owen (2014).

  • Chapter 2

    p-value approximation via

    Stolarsky’s Invariance principle

    2.1 Introduction

    This chapter focuses on estimating the p-value for the linear statistic, as defined in

    eq. (1.1). We drop the subscript of p1 throughout the discussion in this chapter. For

    linear test statistics, as we show below, the permutation p-value is the fraction of

    permuted data vectors lying in a given spherical cap subset of a d-dimensional sphere

    Sd = {z ∈ Rd+1 | zTz = 1}. A natural but crude approximation to that p-value isthe fraction p̂ of the sphere’s surface volume contained in that spherical cap.

    Stolarsky’s invariance principal gives a remarkable description of the accuracy of

    this approximation p̂. For y ∈ Sd and t ∈ [−1, 1] we can define the spherical cap ofcenter y and height t via C(y; t) = {z ∈ Sd | 〈y, z〉 ≥ t}. For x0, . . . ,xN−1 ∈ Sd, letp(y, t) be the fraction of those N points that lie in C(y; t) and let p̂(y, t) = p̂(t) =

    vol(C(y; t))/vol(Sd). The squared L2 spherical cap discrepancy of these points is

    L2(x0, . . . ,xN−1)2 =

    ∫ 1−1

    ∫Sd

    |p̂(t)− p(z, t)|2 dσd(z) dt.

    10

  • CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 11

    Stolarsky (1973) shows that

    dωdωd+1

    × L2(·)2 =∫Sd

    ∫Sd‖x− y‖ dσd(x) dσd(y)−

    1

    N2

    N−1∑k,l=0

    ‖xk − xl‖ (2.1)

    where σd is the uniform (Haar) measure on Sd and ωd is the (surface) volume of Sd.Equation (2.1) relates the mean squared error of p̂ to the mean absolute Euclidean

    distance among the N points. In our applications, the N points will be the distinct

    permuted values of a data vector, but the formula holds for an arbitrary set of N

    points.

    The left side of (2.1) is, up to normalization, a mean squared discrepancy over

    spherical caps. This average of (p̂ − p)2 includes p-values of all sizes between 0 and1. It is not then a very good accuracy measure when p̂ turns out to be very small.

    It would be more useful to get such a mean squared error taken over caps of exactly

    the size p̂, and no others.

    Brauchart and Dick (2013) consider quasi-Monte Carlo (QMC) sampling in the

    sphere. They generalize Stolarsky’s discrepancy formula to include a weighting func-

    tion on the height t. By specializing their formula, we get an expression for the mean

    of (p̂− p)2 over spherical caps of any fixed size.Discrepancy theory plays a prominent role in QMC (Niederreiter, 1992), which

    is about approximating an integral by a sample average. The present setting is in a

    sense the reverse of QMC: the discrete average over permutations is the exact value,

    and the integral over a continuum is the approximation. A second difference is that

    the QMC literature focuses on choosing N points to minimize a criterion such as (2.1),

    whereas here the N points are determined by the problem.

    We present several results for the mean of (p̂− p)2 under different conditions. Inaddition to fixing the size of the caps we can restrict the mean squared error to only

    be over caps centered on points y satisfying 〈y,x0〉 = 〈y0,x0〉 where x0 is the original(unpermuted) x vector and y0 is the observed y value. We can obtain this result by

    further extending Brauchart and Dick’s generalization of Stolarsky’s invariance. We

    call this the ‘finer approximation’ and will show it has advantages over constraining

  • CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 12

    only the height of the caps. More generally, the point xc could be any of the permuted

    x vectors, such as the one that happens to be closest to y0.

    Although we found these results via invariance we can also obtain them via proba-

    bilistic arguments. As a consequence we have a probabilistic derivation of Stolarsky’s

    formula. Some of our results are for arbitrary x, but our best computational for-

    mulas are for the case where the variable x is binary, as it would be in experiments

    comparing treatment and control groups.

    The rest of the chapter is organized as follows. Section 2.2 presents some context

    on permutation tests and gives some results from spherical geometry. In Section 2.3

    we use Stolarsky’s invariance principle as generalized by Brauchart and Dick (2013)

    to obtain the mean squared error between the true p-value and its continuous ap-

    proximation p̂1, taken over all spherical caps of volume p̂1. This section also has a

    probabilistic derivation of that mean squared error. In Section 2.4 we describe some

    finer approximations p̃ for the p-value. These condition on not just the volume of the

    spherical cap but also on its distance from the original data point x0, or from some

    other point, such as the closest permutation of x0 to y0. By always including the

    original point we ensure that p̃ > 1/N . That is a desirable property because the true

    permutation p-value cannot be smaller than 1/N . In Section 2.5 we modify the proof

    in Brauchart and Dick (2013), to further generalize their invariance results to include

    the mean squared error of the finer approximations. Section 2.6 extends our estimates

    to two-sided testing. Section 2.7 illustrates our p-value approximations numerically.

    We see that an RMS error in the finer approximate p-values is of the same order

    of magnitude as those p-values themselves. Section 2.8 makes a numerical compari-

    son to saddlepoint methods. Section 2.9 discusses the results and gives more details

    about the bioinformatics problems that motivate the search for approximations to

    the permutation distribution. Most of the proofs are in the Appendix A.

  • CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 13

    2.2 Background and notation

    The raw data contain points (Xi, Yi) for i = 1, . . . ,m. X′is are the treatment in-

    dicators, and Y =G∑g=1

    Yg/sd(Yg) ∈ Rn as discussed in Section 1.5. The sample

    correlation of these points is ρ̂ = xT0y0 where x0 has components (Xi − X̄)/sX forX̄ = (1/m)

    ∑mi=1 Xi, s

    2X = (1/m)

    ∑mi=1(Xi− X̄)2 and Ȳ and sY are defined similarly.

    We assume that sX and sY are positive. Both x0 and y0 belong to Sm−1. Moreoverthey belong to {z ∈ Sm−1 | zT1m = 0}. We can use an orthogonal matrix to ro-tate the points of this set onto Sm−2 × {0}. As a result, we may simply work withx0,y0 ∈ Sd where d = m− 2.

    The quantity ρ̂ measures association between X and Y . It can be used as such

    a measure if the Xi are fixed and Yi observed conditionally, or vice versa, or if the

    (Xi, Yi) pairs are independently sampled from some joint distribution. Let π be

    a permutation of the indices 1, . . . ,m. There are m! vectors xπ that result from

    centering and scaling Xπ = (Xπ(1), Xπ(2), . . . , Xπ(m)). The permutation p-value is

    p = (1/m!)∑

    π 1(xTπy0 > x

    T0y0). The justification for this p-value relies on the group

    structure of permutations (Lehmann and Romano, 2005). For a cautionary tale on

    the use of permutation sets without a group structure, see Southworth et al. (2009).

    For notational simplicity we assume ρ̂ > 0 and work with one-sided p-values. Negative

    ρ̂ can be handled similarly, or simply by switching the group labels. For two-sided

    p-values see Section 2.6.

    Our proposals are computationally most attractive in the case where Xi takes on

    just two values, such as 0 and 1. Then ρ̂ is a two-sample test statistic. If there are

    m0 observations with Xi = 0 and m1 with Xi = 1 then x0 contains m0 components

    equal to −√m1/(mm0) and m1 components equal to +

    √m0/(mm1). Some formulas

    involve the smaller sample size, m ≡ min(m0,m1).For this two-sample case there are only N =

    (m0+m1m0

    )distinct permutations of x0.

    Calling these x0,x1, . . . ,xN−1 we find that

    p =1

    N

    N−1∑k=0

    1(xTky0 > ρ̂). (2.2)

  • CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 14

    Now suppose that there are exactly r indices for which xk is positive and xl is

    negative. There are then r indices with the reverse pattern too. We say that xk and

    xl are at ‘swap distance r’. In that case we easily find that

    u(r) ≡ xTkxl = 1− r( 1m0

    +1

    m1

    ). (2.3)

    We need some geometric properties of the unit sphere and spherical caps. The

    surface volume of Sd is ωd = 2π(d+1)/2/Γ((d+1)/2). We use σd for the volume elementin Sd normalized so that σd(Sd) = 1. The spherical cap C(y; t) = {z ∈ Sd | zTy > t}has volume

    σd(C(y; t)) =

    12I1−t2(d2, 1

    2

    ), 0 ≤ t ≤ 1

    1− 12I1−t2

    (d2, 1

    2

    ), −1 ≤ t < 0

    where It(a, b) is the incomplete beta function

    It(a, b) =1

    B(a, b)

    ∫ t0

    xa−1(1− x)b−1 dx

    with B(a, b) =∫ 1

    0xa−1(1− x)b−1 dx. Obviously, this volume is 0 if t < −1 and it is 1

    if t > 1. This volume is independent of y so we may write σd(C(· , t)) for the volume.By symmetry, 1(x ∈ C(y, t)) = 1(y ∈ C(x, t)).

    Our first approximation of the p-value is

    p̂1(ρ̂) = σd(C(y; ρ̂)). (2.4)

    This approximation has two intuitive explanations. First, the true p-value is the

    proportion of permutations of x0 that lie in C(y0; ρ̂), and σd(C(y0, ρ̂)) is the pro-

    portion of the volume of Sd in that set. Second, as we show in Proposition 2,p̂1 = E(p | 〈x0,y〉 = ρ̂) for y ∼ U(Sd) as y0 would if the original Yi were IIDGaussian. In Theorem 4, we find Var(p̂1) under this assumption.

    We frequently need to project y ∈ Sd onto a point x ∈ Sd. In this representationy = tx +

    √1− t2y∗ where t = yTx ∈ [−1, 1] and y∗ ∈ {z ∈ Sd | zTx = 0} which

  • CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 15

    is isomorphic to Sd−1. The coordinates t and y∗ are unique. From equation (A.1) inBrauchart and Dick (2013) we get

    dσd(y) =ωd−1ωd

    (1− t2)d/2−1 dt dσd−1(y∗). (2.5)

    In their case x was (0, 0, . . . , 1).

    The intersection of two spherical caps of common height t is

    C2(x,y; t) ≡ C(x; t) ∩ C(y; t).

    We will need the volume of this intersection. Lee and Kim (2014) give a general solu-

    tion for spherical cap intersections without requiring equal heights. They enumerate

    25 cases, but our case does not correspond to any single such case and so we obtain

    the formula we need directly, below. We suspect it must be known already, but we

    were unable to find it in the literature.

    Lemma 1. Let x,y ∈ Sd and −1 6 t 6 1 and put u = xTy. Let V2(u; t, d) =σd(C2(x,y; t)). If u = 1, then V2(u; t, d) = σd(C(x; t)). If −1 < u < 1, then

    V2(u; t, d) =ωd−1ωd

    ∫ 1t

    (1− s2)d2−1σd−1(C(y

    ∗; ρ(s))) ds, (2.6)

    where ρ(s) = (t− su)/√

    (1− s2)(1− u2). Finally, for u = −1,

    V2(u; t, d) =

    0, t > 0ωd−1ωd

    ∫ |t|−|t|(1− s

    2)d2−1 ds, else.

    (2.7)

    Proof. Let z ∼ U(Sd). Then V2(u; t, d) = σd(C2(x,y; t)) = Pr(z ∈ C2(x,y; t)). Ifu = 1 then x = y and so C2(x,y; t) = C(x; t). For u < 1, we project y and z onto

    x, via z = sx+√

    1− s2z∗ and y = ux+√

    1− u2y∗. Now

    V2(u; t, d) =

    ∫Sd

    1(〈x, z〉 ≥ t)1(〈y, z〉 ≥ t) dσ(z)

    =

    ∫ 1−1

    1(s > t)ωd−1ωd

    (1− s2)d2−1

  • CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 16

    ×∫Sd−1

    1(su+√

    1− s2√

    1− u2 〈y∗, z∗〉 ≥ t) dσd−1(z∗) ds.

    If u > −1 then this reduces to (2.6). For u = −1 we get

    V2(u; t, d) =ωd−1ωd

    ∫ 1−1

    1(s > t)1(−s > t)(1− s2)d2−1 ds.

    which reduces to (2.7).

    When we give probabilistic arguments and interpretations we do so for a random

    center y of a spherical cap. We use Models 1 and 2 below. Model 1 is illustrated in

    Figure 2.1. Model 2 is illustrated in Figure 2.2 of Section 2.4 where we first use it.

    Model 1. The vector y is uniformly distributed on the sphere Sd. Expectation underthis model is denoted E1(·).

    Model 2. The vector y is uniformly distributed on {z ∈ Sd | zTxc = ρ̃}, for some−1 ≤ ρ̃ ≤ 1, and c ∈ {0, 1, . . . , N − 1}. Then y = ρ̃xc +

    √1− ρ̃2y∗ for y∗ uniformly

    distributed on a subset of Sd isometric to Sd−1. Expectation under this model isdenoted E2(·).

    2.3 Approximation via spherical cap volume

    Here we study the approximate p-value p̂1(ρ̂) = σd(C(y; ρ̂)). First we find the mean

    squared error of this approximation over all spherical caps of the given volume via

    invariance. Then we give a probabilistic interpretation which includes the conditional

    unbiasedness result in Proposition 2 below. Then we give two computational simpli-

    fications, first for points obtained via permutation, and second for permutations of a

    binary vector. We begin by restating the invariance principle.

  • CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 17

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    y0

    y1

    Sd

    Figure 2.1: Illustration for Model 1. The point y is uniformly distributed over Sd.The small open circles represent permuted vectors xk. The point y0 is the observedvalue of y. The circle around it goes through x0 and represents a spherical cap ofheight yT0x0. A second spherical cap of equal volume is centered at y = y1. We studymoments of p(y; ρ̂), the fraction of xk in the cap centered at random y.

    Theorem 1. Let x0, . . . ,xN−1 be any points in Sd. Then

    1

    N2

    N−1∑k,l=0

    ‖xk − xl‖+1

    Cd

    ∫ 1−1

    ∫Sd

    ∣∣∣∣σd(C(z; t))− 1NN−1∑k=0

    1C(z;t)(xk)

    ∣∣∣∣2 dσd(z) dt=

    ∫Sd

    ∫Sd‖x− y‖ dσd(x) dσd(y)

    where Cd = ωd−1/(dωd).

    Proof. Stolarsky (1973).

    Brauchart and Dick (2013) gave a simple proof of Theorem 1 using reproducing

    kernel Hilbert spaces. They generalized Theorem 1 as follows.

  • CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 18

    Theorem 2. Let x0, . . . ,xN−1 be any points in Sd. Let v : [−1, 1] → (0,∞) be anyfunction with an antiderivative. Then

    ∫ 1−1v(t)

    ∫Sd

    ∣∣∣∣σd(C(z; t))− 1NN−1∑k=0

    1C(z;t)(xk)

    ∣∣∣∣2 dσd(z) dt=

    1

    N2

    N−1∑k,l=0

    Kv(xk,xl)−∫Sd

    ∫SdKv(x,y) dσd(x) dσd(y)

    (2.8)

    where Kv(x,y) is a reproducing kernel function defined by

    Kv(x,y) =

    ∫ 1−1v(t)

    ∫Sd

    1C(z;t)(x)1C(z;t)(y) dσd(z) dt. (2.9)

    Proof. See Theorem 5.1 in Brauchart and Dick (2013)

    If we set v(t) = 1 and K(x,y) = 1 − Cd‖x − y‖, then we recover the originalStolarsky formula. Note that the statement of Theorem 5.1 in Brauchart and Dick

    (2013) has a sign error in their counterpart to (2.8). The corrected statement (2.8)

    can be verified by comparing equations (5.3) and (5.4) of Brauchart and Dick (2013).

    We would like a version of (2.8) just for one value of t such as t = ρ̂ = xT0y0. For

    ρ̂ ∈ [−1, 1) and � = (�1, �2) ∈ (0, 1)2, let

    v�(t) = �2 +1

    �11(ρ̂ ≤ t ≤ ρ̂+ �1). (2.10)

    Each v� satisfies the conditions of Theorem 2 making (2.8) an identity in �. We let

    �2 → 0 and then �1 → 0 on both sides of (2.8) for v = v� yielding Theorem 3.

    Theorem 3. Let x0,x1, . . . ,xN ∈ Sd and t ∈ [−1, 1]. Then

    ∫Sd|p(y, t)− p̂1(t)|2 dσd(y) =

    1

    N2

    N−1∑k=0

    N−1∑l=0

    σd(C2(xk,xl; t))− p̂1(t)2. (2.11)

    Proof. See Section A.1 of the Appendix which uses the limit argument described

    above.

  • CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 19

    We now give a proposition that holds for all models, including our Model 1 and

    Model 2.

    Proposition 1. For a random point y ∈ Sd,

    E(p(y, t)) =1

    N

    N−1∑k=0

    Pr(y ∈ C(xk; t)), and (2.12)

    E(p(y, t)2) =1

    N2

    N−1∑k,l=0

    Pr(y ∈ C2(xk,xl; t)). (2.13)

    Proposition 1 provides a probabilistic interpretation for equation (2.11). When

    y ∼ U(Sd), the double sum on right side of (2.11) is E(p(y, t)2). Additionally p̂1(t)has a probabilistic interpretation under Model 1.

    Proposition 2. For any x0, . . . ,xN−1 ∈ Sd and t ∈ [−1, 1], p̂1(t) = E1(p(y, t)).

    Proof. E1(p(y; t)) = E1[

    1N

    N−1∑k=0

    1C(y;t)(xk)]

    = σd(Cd(y; t)) = p̂1(t).

    Combining Proposition 2 and Theorem 3 we find that if y ∼ U(Sd), as it wouldfor IID Gaussian Yi, then p(y, ρ̂) is a random variable with mean p̂1(ρ̂) and variance

    given by (2.11) with t = ρ̂.

    The right hand side of (2.11) sums O(N2) terms. In a permutation analysis we

    might have N = m! or N =(m0+m1m0

    )for binary Xi, and so the computational cost

    could be high. The symmetry in a permutation set allows us to use

    ∫Sd|p(y, t)− p̂1(t)|2 dσd(y) =

    1

    N

    N−1∑k=0

    σd(C2(x0,xk; t))− p̂1(t)2

    instead. But that costs O(N), the same as the full permutation analysis.

    When the Xi are binary, then for fixed t, σd(C2(xk,xl; t)) just depends on the

    swap distance r between xk and xl. Then

    ∫Sd|p(y, t)− p̂1(t)|2 dσd(y) =

    1

    N

    m∑r=0

    NrV2(u(r); t, d)− p̂1(t)2 (2.14)

  • CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 20

    for V2(u(r); t, d) given in Lemma 1 and Nr =∑N−1

    k=0

    ∑N−1l=0 1(rk,l = r) counts pairs

    (xk,xl) at swap distance r.

    Theorem 4. Let x0 ∈ Sd be the centered and scaled vector from an experiment withbinary Xi of which m0 are negative and m1 are positive. Let x0,x1, . . . ,xN−1 be the

    N =(m0+m1m0

    )distinct permutations of x0. If y ∼ U(Sd), then for t ∈ [−1, 1], and

    with u(r) defined in (2.3),

    E(p(y; t)) = σd(C(y0; t)), and

    Var(p(y, t)) =1

    N

    m∑r=0

    (m0r

    )(m1r

    )V2(u(r); t, d)− p̂1(t)2.

    Proof. There are(m0r

    )(m1r

    )permuted points xi at swap distance r from x0.

    2.4 A finer approximation to the p-value

    In the previous section, we studied the distribution of p-values with the spherical cap

    centers y uniformly distributed on the sphere Sd. In this section, we give a finerapproximation to p(y0, ρ̂) by studying the distribution of the p-values with centers y

    satisfying the constraint 〈y,xc〉 = 〈y0,xc〉 = ρ̃. The point xc may be any permutationof x0. There are two special choices. The first is to choose c = 0 so that xc = x0 is the

    original unpermuted data. The second is to choose xc to be the closest permutation

    of x0 to y0. That is c = arg maxi 〈y0,xi〉. We will give a general formula that worksfor any choice of xc and compare the performance of the above two choices.

    The rationale for conditioning on all y satisfying 〈y,xc〉 = ρ̃ is as follows. Sincewe want the exact p-value centered at y0 with radius ρ̂, the more targeted set of

    p-values we study, the better our approximation should be. When conditioning on

    〈y,xc〉 = ρ̃, we eliminate many irrelevant y. The approximation could be improvedby conditioning on even more information, but the cost would go up. If we condition

    on the order statistic of all inner products 〈y0,xi〉, we get back the exact p-value.For an index c ∈ {0, 1, . . . , N − 1} we propose finer approximations to the p-value

  • CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 21

    based on Model 2 from Section 2.2. These are

    p̃c = E2(p(y, ρ̂)) = E1(p(y, ρ̂) | yTxc = yT0xc). (2.15)

    We are interested in two special cases,

    p̂2 = p̃0, and p̂3 = p̃c, where c = arg max06i 1/N , having p̂2 > 1/N is a desirable property. Similarly, p̂3 > 1/N

    because then xc is in general an interior point of C(y, ρ̂). We expect that p̂3 should

    be more conservative than p̂2 and we see this numerically in Section 2.7.

    From Proposition 1, we can get our estimate p̃c and its mean squared error by

    finding single and double inclusion probabilities for y.

    To compute p̃c we need to sum N values Pr(y ∈ C(xk; t) | yTxc = ρ̃) and for p̃cto be useful we must compute it in o(N) time. The computations are feasible in the

    binary case, which we now focus on.

    Let uj = xTj x0 for j = 1, 2, and let u3 = x

    T1x2. Let the projection of y on xc

    be y = ρ̃xc +√

    1− ρ̃2y∗. Then the single and double point inclusion probabilitiesunder Model 2 are

    P1(u1, ρ̃, ρ̂) =

    ∫Sd−1

    1(〈y,x1〉 ≥ ρ̂) dσd−1(y∗), and (2.17)

    P2(u1, u2, u3, ρ̃, ρ̂) =

    ∫Sd−1

    1(〈y,x1〉 ≥ ρ̂)1(〈y,x2〉 ≥ ρ̂) dσd−1(y∗) (2.18)

    where ρ̂ = 〈x0,y0〉. If two permutations of x0 are at swap distance r, then their innerproduct is u(r) = 1− r(m−10 +m−11 ) from equation (2.3).

    Lemma 2. Let the projection of x1 onto xc be x1 = u1xc +√

    1− u21x∗1. Then the

  • CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 22

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    xc●

    y0

    y2●

    y1

    Sd

    Figure 2.2: Illustration for Model 2. The original response vector is y0 with yT0x0 =

    ρ̂. We consider alternative y uniformly distributed on the surface of C(x0; ρ̂) withexamples y1 and y2. Around each such yj there is a spherical cap of height ρ̂ thatjust barely includes xc = x0. We use p̂2 = E2(p(y; ρ̂)) and find an expression forE2((p̂2 − p(y; ρ̂))2).

    single point inclusion probability from (2.17) is

    P1(u1, ρ̃, ρ̂) =

    1(ρ̃u1 ≥ ρ̂), u1 = ±1 or ρ̃ = ±1σd−1(C(x∗1, ρ∗)), u1 ∈ (−1, 1), ρ̃ ∈ (−1, 1) (2.19)where ρ∗ = (ρ̂− ρ̃u1)/

    √(1− ρ̃2)(1− u21).

    Proof. The projection of y onto xc is y = ρ̃xc +√

    1− ρ̃2y∗. Now

    〈y,x1〉 =

    ρ̃u1, u1 = ±1 or ρ̃ = ±1ρ̃u1 +√1− ρ̃2√1− u21 〈y∗,x∗1〉 , u1 ∈ (−1, 1), ρ̃ ∈ (−1, 1)and the result easily follows.

  • CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 23

    We can now give a computable expression for p̃c and hence for p̂2 and p̂3.

    Theorem 5. For −1 ≤ ρ̂ ≤ 1, −1 ≤ ρ̃ ≤ 1,

    p̃c = E2(p(y, ρ̂)) =1

    N

    m∑r=0

    (m0r

    )(m1r

    )P1(u(r), ρ̃, ρ̂) (2.20)

    where u(r) is given in equation (2.3), P1(u(r), ρ̃, ρ̂) is given in equation (2.19) and

    ρ̃ = xTc y0.

    Proof. There are(m0r

    )(m1r

    )permutations of x0 at swap distance r from xc.

    From (2.20) we see that p̃c can be computed in O(m) work. The mean squared

    error for p̃c is more complicated and will be more expensive. We need the double

    point inclusion probabilities and then we need to count the number of pairs xk,xl

    forming a given set of swap distances among xk,xl,xc.

    Lemma 3. For j = 1, 2, let xj be at swap distance rj from xc and let r3 be the swap

    distance between x1 and x2. Let u1, u2, u3 be the corresponding inner products given

    by (2.3). If there are equalities among x1, x2 and xc, then the double point inclusion

    probability from (2.18) is

    P2(u1, u2, u3, ρ̃, ρ̂) =

    1(ρ̃ ≥ ρ̂), x1 = x2 = xc

    1(ρ̃ ≥ ρ̂)P1(u2, ρ̃, ρ̂), x1 = xc 6= x2

    1(ρ̃ ≥ ρ̂)P1(u1, ρ̃, ρ̂), x2 = xc 6= x1

    P1(u2, ρ̃, ρ̂), x1 = x2 6= xc.

    If x1, x2 and xc are three distinct points with min(u1, u2) = −1, then

    P2(u1, u2, u3, ρ̃, ρ̂) =

    1(−ρ̃ ≥ ρ̂)P1(u2, ρ̃, ρ̂), u1 = −11(−ρ̃ ≥ ρ̂)P1(u1, ρ̃, ρ̂), u2 = −1.

  • CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 24

    Otherwise −1 < u1, u2 < 1, and then

    P2(u1, u2, u3, ρ̃, ρ̂)

    =

    1(ρ̃u1 ≥ ρ̂)1(ρ̃u2 ≥ ρ̂), ρ̃ = ±1∫ 1−1

    ωd−2ωd−1

    (1− t2) d−12 −11(t ≥ ρ1)1(tu∗3 ≥ ρ2) dt, ρ̃ 6= ±1, u∗3 = ±1∫ 1−1

    ωd−2ωd−1

    (1− t2) d−12 −11(t ≥ ρ1)σd−2(C(x∗∗2 ,ρ2−tu∗3√

    1−t2√

    1−u∗23)) dt, ρ̃ 6= ±1, |u∗3| < 1

    where

    u∗3 =u3 − u1u2√

    1− u21√

    1− u22and ρj =

    ρ̂− ρ̃uj√1− ρ̃2

    √1− u2j

    , j = 1, 2 (2.21)

    and x∗∗2 is the residual from the projection of x∗2 on x

    ∗1.

    Proof. See Section A.2.

    Next we consider the swap configuration among x1, x2 and xc. Let xj be at swap

    distance rj from xc, for j = 1, 2. We let δ1 be the number of positive components

    of xc that are negative in both x1 and x2. Similarly, δ2 is the number of negative

    components of xc that are positive in both x1 and x2. See Figure 2.3. The swap

    distance between x1 and x2 is then r3 = r1 + r2 − δ1 − δ2.Let r = (r1, r2), δ = (δ1, δ2) and r = min(r1, r2). We will study values of

    r1, r2, r3, δ1, δ2 ranging over the following sets:

    r1, r2 ∈ R = {1, . . . ,m}

    δ1 ∈ D1(r) = {max(0, r1 + r2 −m0), . . . , r}

    δ2 ∈ D2(r) = {max(0, r1 + r2 −m1), . . . , r}, and

    r3 ∈ R3(r) = {max(1, r1 + r2 − 2r), . . . ,min(r1 + r2,m,m0 +m1 − r1 − r2)}.

    Whenever the lower bound for one of these sets exceeds the upper bound, we take

    the set to be empty, and a sum over it to be zero. Note that while r1 = 0 is possible,

    it corresponds to x1 = xc and we will handle that case specially, excluding it from R.

  • CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 25

    xc = (

    m1︷ ︸︸ ︷+,+,+,+,+, · · · ,+,+,+,+,

    m0︷ ︸︸ ︷−,−,−,−, · · · ,−,−,−,−,− )

    x1 = (

    m1︷ ︸︸ ︷+,+,+, · · · ,+,−,−,−, · · · ,−︸ ︷︷ ︸

    r1

    ,

    m0︷ ︸︸ ︷+,+,+, · · · ,+,+,+︸ ︷︷ ︸

    r1

    ,−, · · · ,− )

    x2 = (

    m1︷ ︸︸ ︷+, · · · ,+,−,−, · · · ,−︸ ︷︷ ︸

    δ1︸ ︷︷ ︸r2

    ,+, · · · ,+,m0︷ ︸︸ ︷

    −, · · · ,−,+,+, · · ·︸ ︷︷ ︸δ2

    ,+

    ︸ ︷︷ ︸r2

    ,−, · · · ,− )

    Figure 2.3: Illustration of r1, r2, δ1 and δ2. The points xc, x1 and x2 each have m0negative and m1 positive components. For j = 1, 2 the swap distance between xj andxc is rj. There are δ1 positive components of xc where both x1 and x2 are negative,and δ2 negative components of xc where both xj are positive.

    The number of pairs (xl,xk) with a fixed r and δ is

    c(r, δ) =

    (m0δ1

    )(m1δ2

    )(m0 − δ1r1 − δ1

    )(m1 − δ2r1 − δ2

    )(m0 − r1r2 − δ1

    )(m1 − r1r2 − δ2

    ). (2.22)

    Then the number of configurations given r1, r2 and r3 is

    c(r1, r2, r3) =∑δ1∈D1

    ∑δ2∈D2

    c(r, δ)1(r3 = r1 + r2 − δ1 − δ2). (2.23)

    We can now get an expression for the expected mean squared under Model 2 which

    combined with Theorem 5 for the mean provides an expression for the mean squared

    error of p̃c.

  • CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 26

    Theorem 6. For −1 ≤ ρ̂ ≤ 1,−1 ≤ ρ̃ ≤ 1,

    E2(p(y, ρ̂)2) =1

    N2

    [1(ρ̃ ≥ ρ̂) + 2

    m∑r=1

    (m0r

    )(m1r

    )P2(1, u(r), u(r), ρ̃, ρ̂)

    +

    m∑r=1

    (m0r

    )(m1r

    )P1(u(r), ρ̃, ρ̂)

    +∑r1∈R

    ∑r2∈R

    ∑r3∈R3(r)

    c(r1, r2, r3)P2(u1, u2, u3, ρ̃, ρ̂)

    ] (2.24)

    where P2(·) is the double inclusion probability in (2.18) and c(r1, r2, r3) is the config-uration count in (2.23).

    Proof. See Section A.3 of the Appendix.

    In our experience, the cost of computing E2(p(y, ρ̂)2) under Model 2 is dominatedby the cost of the O(m3) integrals required to get the P2(·) values in (2.24). The costalso includes an O(m4) component because c(r1, r2, r3) is also a sum of O(m) terms,

    but it did not dominate the computation at the sample sizes we looked at (up to

    several hundred).

    2.5 Generalized Stolarsky Invariance

    Here we obtain the Model 2 results in a different way, by extending the work by

    Brauchart and Dick (2013). They introduced a weight on the height t of the spherical

    cap in the average. We now apply a weight function to the inner product 〈z,xc〉between the center z of the spherical cap and a special point xc.

    Theorem 7. Let x0, . . . ,xN−1 be arbitrary points in Sd and v(·) and h(·) be positive

  • CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 27

    functions in L2([−1, 1]). Then for any x′ ∈ Sd, the following equation holds,

    ∫ 1−1v(t)

    ∫Sdh(〈z,x′〉)

    ∣∣∣∣σd(C(z; t))− 1NN−1∑k=0

    1C(z;t)(xk)

    ∣∣∣∣2 dσd(z) dt=

    1

    N2

    N−1∑k,l=0

    Kv,h,x′(xk,xl) +

    ∫Sd

    ∫SdKv,h,x′(x,y) dσd(x) dσd(y)

    − 2N

    N−1∑k=0

    ∫SdKv,h,x′(x,xk) dσd(x)

    (2.25)

    where Kv,h,x′ : Sd × Sd → R is a reproducing kernel defined by

    Kv,h,x′(x,y) =

    ∫ 1−1v(t)

    ∫Sdh(〈z,x′〉)1C(z;t)(x)1C(z;t)(y) dσd(z) dt. (2.26)

    Proof. See Section A.4 of the Appendix.

    Remark. We will use this result for x′ = xc, where xc is one of the N given points.

    The theorem holds for general x′ ∈ Sd, but the result is computationally and statisti-cally more attractive when x′ = xc.

    We now show that the second moment in Theorem 6 holds as a special limiting case

    of Theorem 7. In addition to v� from Section 2.3 we introduce η = (η1, η2) ∈ (0, 1)2

    and

    hη(s) = η2 +1

    η1(ωd−1ωd

    (1− s2)d/2−1)1(ρ̃ ≤ s ≤ ρ̃+ η1) (2.27)

    Using these results we can now establish the following theorem, which provides

    the second moment of p(y, ρ̂) under Model 2.

    Theorem 8. Let x0 ∈ Sd be the centered and scaled vector from an experiment withbinary Xi of which m0 are negative and m1 are positive. Let x0,x1, . . . ,xN−1 be the

    N =(m0+m1m0

    )distinct permutations of x0. Let xc be one of the xk and let p̃c be given

  • CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 28

    by (2.15). Then

    E2(p(y, ρ̂)2) =1

    N2

    N−1∑k,l=0

    ∫Sd−1

    1(〈y,xk〉 ≥ ρ̂)1(〈y,xl〉 ≥ ρ̂) dσd−1(y∗)

    where y = ρ̃xc +√

    1− ρ̃2y∗.

    Proof. The proof uses Theorem 7 with a sequence of h defined in (2.27) and v defined

    in (2.10). See Section A.5 of the appendix.

    This result shows that we can use the invariance principle to derive the second

    moment of p(y, ρ̂) under Model 2. The mean square in Theorem 8 is consistent with

    the second moment equation (2.13) in Proposition 1.

    2.6 Two sided p-values

    In statistical applications it is more usual to report two-sided p-values. A conservative

    approach is to use 2 min(p, 1− p) where p is a one-sided p-value. A sharper choice is

    p =1

    N

    N−1∑k=0

    1(|xTky0| ≥ |ρ̂|). (2.28)

    This choice changes our Model 2 estimate. It also changes the second moment of our

    Model 1 estimate.

    The two-sided version of the estimate p̂1(ρ̂) is 2σd(C(y; |ρ̂|)), the same as if wehad doubled a one-tailed estimate. Also E1(p) = p̂1 in the two tailed case. We nowconsider the mean square for the two-tailed estimate under Model 1. For x1,x2 ∈ Sd

    with u = xT1x2, the two-tailed double inclusion probability under Model 1 is

    Ṽ2(u; t, d) =

    ∫Sd

    1(|zTx1| ≥ |t|)1(|zTx2| ≥ |t|) dσd(z).

    Writing 1(|zTxi| ≥ |t|) = 1(zTxi ≥ |t|)+1(zT(−xi) ≥ |t|) for i = 1, 2 and expanding

  • CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 29

    the product, we get

    Ṽ2(u; t, d) = 2V2(u; |t|, d) + 2V2(−u; |t|, d).

    By replacing V2(u, t, d) with Ṽ2(u, t, d) and p̂1(t) with 2σd(C(y; |t|)) in Theorem 4, weget the variance of two-sided p-values under Model 1.

    To obtain corresponding formulas under Model 2, we use the usual notations.

    Let uj = xTj x0 for j = 1, 2, and let u3 = x

    T1x2. Let the projection of y on xc be

    y = ρ̃xc +√

    1− ρ̃2y∗. Now

    P̃1(u1, ρ̃, ρ̂) =

    ∫Sd−1

    1(| 〈y,x1〉 | ≥ |ρ̂|) dσd−1(y∗), and, (2.29)

    P̃2(u1, u2, u3, ρ̃, ρ̂) =

    ∫Sd−1

    1(| 〈y,x1〉 | ≥ |ρ̂|)1(| 〈y,x2〉 | ≥ |ρ̂|) dσd−1(y∗) (2.30)

    are the appropriate single and double inclusion probabilities.

    After writing 1(| 〈y,xi〉 | ≥ |ρ̂|) = 1(〈y,xi〉 ≥ |ρ̂|) + 1(〈y,−xi〉 ≥ |ρ̂|) for i = 1, 2and expanding the product, we get

    P̃1(u1, ρ̃, ρ̂) = P1(u1, ρ̃, |ρ̂|) + P1(−u1, ρ̃, |ρ̂|), and

    P̃2(u1, u2, u3, ρ̃, ρ̂) = P2(u1, u2, u3, ρ̃, |ρ̂|) + P2(−u1, u2,−u3, ρ̃, |ρ̂|)

    + P2(u1,−u2,−u3, ρ̃, |ρ̂|) + P2(−u1,−u2, u3, ρ̃, |ρ̂|).

    Changing P1(u1, ρ̃, ρ̂) and P2(u1, u2, u3, ρ̃, ρ̂) to P̃1(u1, ρ̃, ρ̂) and P̃2(u1, u2, u3, ρ̃, ρ̂)

    respectively in Theorem 5 and 6, we get the first and second moments for two-sided

    p-values under Model 2.

    For a two-sided p-value, p̂3 is calculated with xc where c̃ = arg maxi| 〈y0,xi〉 |.For m0 = m1, c̃ = c = arg maxi 〈y0,xi〉, but the result may differ significantly forunequal sample sizes.

  • CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 30

    2.7 Numerical Results

    We consider two-sided p-values in this section. First we evaluate the accuracy of p̂1,

    the simple spherical cap volume approximate p value. We considered m0 = m1 in a

    range of values from 5 to 200. The values p̂1 ranged from just below 1 to 2×10−30. Wejudge the accuracy of this estimate by its root mean squared error. Under Model 1

    this is (E(p̂1(ρ)−p(y, ρ))2)1/2 for y ∼ U(Sd). Figure 2.4a shows this RMSE decreasingtowards 0 as p̂1 goes to 0 with ρ going to 1. The RMSE also decreases with increasing

    sample size, as we would expect from the central limit theorem.

    As seen in Figures 2.4a and 2.4b, the RMSE is not monotone in p̂1. Right at

    p̂1 = 1 we know that RMSE = 0 and around 0.1 there is a dip. The practically

    interesting values of p̂1 are much smaller than 0.1, and the RMSE is monotone for

    them.

    A problem with p̂1 is that it can approach 0 even though p > 1/N . The Model 1

    RMSE does not reflect this problem. By studying E2((p̂1(ρ)− p(y, ρ))2)1/2, we get adifferent result. In Figure 2.4c, the RMSE of p̂1 under Model 2 reaches a plateau as

    p̂1 goes to 0. The Model 2 RMSE reveals the flaw in p̂1 going below 1/N .

    The estimator p̂2 = p̃0 performs better than p̂1 because it makes more use of the

    data, and it is never below 1/N . As seen in Figure 2.4d, the RMSE of p̂2 very closely

    matches p̂2 itself as p̂2 decreases to zero. That is, the relative error |p̂2− p|/p̂2 is wellbehaved for small p-values. Also as p̂2 drops to the granularity limit 1/N , its RMSE

    drops to 0.

    The estimators p̂1 and p̂2, do not differ much for larger p-values as seen in Fig-

    ure 2.5a. But in the limit as ρ̂ → 1 we see that p̂1 → 0, while p̂2 approaches thegranularity limit 1/N instead.

    Figure 2.5b compares the RMSE of the two estimators under Model 2. As ex-

    pected, p̂2 is more accurate. It also shows that the biggest differences occur only

    when p̂1 goes below 1/N .

    To examine the behavior of p̂2 more closely, we plot its coefficient of variation in

    Figure 2.6. We see that the relative uncertainty in p̂2 is not extremely large. Even

    when the estimated p-values are as small as 10−30 the coefficient of variation is below

  • CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 31

    5.

    In Section 2.4, we mentioned another choice for xc. It was p̂3 = p̃c, where xc is the

    closest permutation of x0 to y0. We compare p̂3 to p̂2 in Figures 2.7 and 2.8. We fixed

    the observed x0 and ρ, and then randomly sampled 100 vectors y0 with 〈y0,x0〉 = ρ.All 100 of the y0 lead to the same value for p̂2 and its standard deviation. We get 100

    different estimates for p̂3 and its standard deviation. We varied m0 and m1, choosing

    ρ so that the values of p̂2 are comparable at different sample sizes. Figure 2.7 shows

    the estimates p̂3 with reference points for p̂2. As expected p̂3 tends to be larger than

    p̂2. Figure 2.8 shows the sample RMSEs for p̂3 with reference points for the RMSE

    for p̂2. The top row of plots has m0 = m1 while the bottom row has m1 = 2m0. The

    left column of plots are at larger p-values than the rightmost column. We see that

    neither choice always has the smaller RMSE, but p̂2 is usually more accurate.

    2.8 Comparison to saddlepoint approximation

    Many approximation methods have been proposed for permutation tests. Zhou et al.

    (2009) fit approximations by moments in the Pearson family. Larson and Owen (2015)

    fit Gaussian and beta approximations to linear statistics and gamma approximations

    to quadratic statistics for gene set testing problems. Knijnenburg et al. (2009) fit

    generalized extreme value distributions to the tails of sampled permutation values.

    These approximations do not come with an all inclusive p-value that accounts for

    both numerical and sampling uncertainty. The sampling method does come with such

    a p-value if we add one to numerator and denominator as Barnard (1963) suggests.

    But that method cannot attain very small p-values. Reasonable power to attain p 6 �

    requires a sample of somewhere between 3/� and 19/� random permutations (Larson

    and Owen, 2015).

    The strongest theoretical support for approximate p-values comes from saddle-

    point approximations. Reid (1988) surveys saddlepoint approximations and Robinson

    (1982) develops them for permutation tests of the linear statistics we have consid-

    ered here. When the true p-value is p, the saddlepoint approximation p̂s satisfies

    p̂s = p(1+O(1/n)). Because we do not know the implied constant in O(1/n) or the n

  • CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 32

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●●

    ●●●●

    ●●●●●●

    ●●●●●●

    ●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●

    ●●●●●●●●●●

    ●●●●●

    ●●●●

    ●●●●