EFFICIENT PERMUTATION P-VALUE ESTIMATION FOR A …statweb.stanford.edu/~owen/students/HeraHeThesis.pdf · Observing the p-value can be written as the proportion of points lying in

EFFICIENT PERMUTATION P -VALUE ESTIMATION FOR

GENE SET TESTS

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF STATISTICS

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Yu He

July 2016

c© Copyright by Yu He 2016All Rights Reserved

ii

Yu He

I certify that I have read this dissertation and that, in my opinion, it

is fully adequate in scope and quality as a dissertation for the degree

of Doctor of Philosophy.

(Art B. Owen) Principal Adviser




(Trevor Hastie)




(Wing H. Wong)

Approved for the Stanford University Committee on Graduate Studies

Abstract

In a genome-wide expression study, gene set testing is often used to find potential

gene sets that correlate with a treatment(disease, drug, phenotype etc.). A gene

set may contain tens to thousands genes, and genes within a gene set are generally

correlated. Permutation tests are standard approaches of getting p-values for these

gene set tests. Plain Monte Carlo methods that generate random permutations can be

computationally infeasible for small p-values. Ackermann and Strimmer (2009) finds

two families of test statistics that achieve overall best performances - a linear family

and a quadratic family. This dissertation first reviews the relative background of gene

set testing and permutation tests, and then provides three alternative approaches to

estimate small permutation p-values efficiently.

The first approach focuses on the linear statistic. Observing the p-value can

be written as the proportion of points lying in a spherical cap, the p-value is ap-

proximated by the volume of a spherical cap. Error estimates can be derived from

generalized Stolarsky’s invariance principal, and alternative probabilistic proofs are

provided.

The second approach focuses on the quadratic statistic. Importance sampling is

used to estimate the area of the (continuous) significant region on the sphere, and

the volume of the region is used as an approximation for the (discrete proportion)

p-value. Different proposal distributions are studied and compared.

The third approach estimates the p-value with nested sampling. It may work for

both the linear and the quadratic statistic. Similar ideas can be found in literature

spanning from combinatorics, sequential Monte Carlo, Bayesian computation, rare

event estimation, network reliability etc., and bears different names, e.g. approximate

iv

counting, nested sampling, subset simulation, multilevel splitting etc. We give a

thorough review of literature in these different areas, and apply the technique to the

gene set testing with the quadratic test statistic.

Finally, we compare the proposed methods with plain Monte Carlo and saddle-

point approximation on three expression studies in Parkinson’s Disease patients.

This work was supported by the US National Science Foundation under grant

DMS-1521145.

v

Acknowledgement

It is my pleasure to thank the many people who made this thesis possible.

First and foremost, I owe a debt of gratitude to my advisor Professor Art Owen.

With a contagious passion for statistics, a dedicated pursuit for high quality research

and diligence that brings overflowing ideas, Art has set a great example for me. Art

has also guided me through the research journey with encouragement, great patience,

sound advice and timely help. His guidance has been the compass in the wilderness,

without which I would have been lost. Although Art is often found juggling many

meetings, lectures, office hours and emails, he has always kept his door open and

welcomed a discussion anytime. I am deeply grateful for all the time he has spent

with me.

I would like to thank Professor Wing Wong and Professor Trevor Hastie for reading

my thesis and providing insightful feedback. I also thank Professor Lester Mackey

for serving on my dissertation committee and Professor Hua Tang for chairing my

dissertation committee.

I am very fortunate to have met many great math and statistics teachers. I would

like to thank my high school math teacher Songbin Lan, my undergraduate teach-

ers at Nanjing University and University of Toronto (especially Andrey Feuerverger

and Larry Guth), my graduate teachers at Stanford (especially Trevor Hastie, Tze

Leung Lai and David Siegmund). I thank them for their constant inspiration and

encouragement.

I thank Qingyuan Zhao, Murat A. Erdogdu, Anand Rajaraman and Jure Leskovec

for the collaboration on a data mining paper that complements my thesis work. I

would also like to express my gratitude to Kinjal Basu and Qingyuan Zhao for the

vi

collaboration on the work that appears in Chapter 2. The time we spent together,

struggling or cheering, is among the fondest memories of my time at Stanford.

Many wonderful friends have enriched and added color to my life at Stanford.

Thank you to everyone in the Owen research group and all other students in the

department for providing a friendly, supportive and intellectually stimulating envi-

ronment. Special thanks to my roommate Jingshu, my officemates Pragya and Xiaoy-

ing for their constant companionship, to the 206 family Qingyuan, Bhaswar, Murat,

Joshua and Pooja for their incredible friendship that can instantly sweep away any

anxiety or stress, and to many other friends who made me feel at home in the depart-

ment. I am especially grateful to my seniors Su Chen, Jeremy Shen, Pei He, Yi Liu

and Ya Xu for their encouragement and immense help in my career development.

The luckiest thing that has ever happened to me is to be born into my family. It

is my father who first motivated my love for math, and it is my mother’s resilience

and optimism in the face of adversity that prepared me for the many obstacles that

I encountered in research and in life. I am most grateful for their unconditional

love and support for me, which is always my pillar of strength. Last but not least,

I thank Qingyuan again for his companionship as an intimate friend, an inspiring

collaborator, a considerate partner and a strong emotional pillar. I cannot imagine

getting through everything without his support and love.

vii

Contents

Abstract iv

Acknowledgement vi

1 Introduction 1

1.1 Background: gene set enrichment analysis(GSEA) . . . . . . . . . . . 1

1.2 Null hypothesis in GSEA . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Permutation test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Test statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Approxation via Stolarsky’s Invariance 10

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Background and notation . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Approximation via spherical cap volume . . . . . . . . . . . . . . . . 16

2.4 A finer approximation to the p-value . . . . . . . . . . . . . . . . . . 20

2.5 Generalized Stolarsky Invariance . . . . . . . . . . . . . . . . . . . . . 26

2.6 Two sided p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.7 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.8 Comparison to saddlepoint approximation . . . . . . . . . . . . . . . 31

2.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3 Advances in importance sampling 41

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

viii

3.2 Importance sampling and control variates . . . . . . . . . . . . . . . . 43

3.3 Regret bounds and convexity . . . . . . . . . . . . . . . . . . . . . . 46

3.3.1 Mixture importance sampling . . . . . . . . . . . . . . . . . . 47

3.3.2 Multiple importance sampling . . . . . . . . . . . . . . . . . . 50

3.3.3 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.4 Choosing α . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.4.1 Bounding αj away from zero . . . . . . . . . . . . . . . . . . . 54

3.4.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.5.1 Singular function . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.5.2 Rare event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4 Approxation with importance sampling 65

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.2 Geometry of the quadratic spherical cap . . . . . . . . . . . . . . . . 66

4.3 Three sequential importance sampling algorithms . . . . . . . . . . . 68

4.3.1 Uniform sampling from Sd with polar coordinates . . . . . . . 694.3.2 Sequential sampling from Sd(λ) starting from the largest eigen-

value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.3.3 Sequential sampling from Sd(λ) starting from the smallest eigen-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.3.4 Sequential sampling from Sd(λ) with low rank matrix Σ . . . 764.3.5 Simulation results for the three sequential importance algorithms 78

4.4 From continuous approximation to the exact p-value . . . . . . . . . . 84

5 Subset simulation 88

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.2.1 Estimation of P ∗ . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.2.2 Uniform sampling on A` . . . . . . . . . . . . . . . . . . . . . 89

ix

5.2.3 Quick update for T (x′) . . . . . . . . . . . . . . . . . . . . . . 92

5.2.4 Choosing q . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.2.5 Algorithm for estimating P ∗ . . . . . . . . . . . . . . . . . . . 92

5.3 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6 Real data example 103

A Appendix 112

A.1 Proof of Theorem 3 (Limiting invariance) . . . . . . . . . . . . . . . . 112

A.2 Proof of Lemma 3 (Double inclusion for Model 2) . . . . . . . . . . . 114

A.3 Proof of Theorem 6 (Second moment under Model 2) . . . . . . . . . 116

A.4 Proof of Theorem 7 (Location weighted invariance) . . . . . . . . . . 117

A.5 Proof of Theorem 8 (Spatially weighed invariance) . . . . . . . . . . 120

A.6 Proof of Theorem 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

A.7 Proof of Corollaries 1 and 2 . . . . . . . . . . . . . . . . . . . . . . . 123

A.8 Proof of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

A.9 Proof of Theorem 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . 124



A.12 Algorithms and Figures . . . . . . . . . . . . . . . . . . . . . . . . . . 127

x

List of Tables

2.1 Maximal Z scores observed for p̂2 and p̂3. . . . . . . . . . . . . . . . 40

3.1 Singular function example. The estimate µ̂ was computed from 500,000

observations using the sampler given in the first column. Control vari-

ates were used in two of those samples. The final columns give variance

reduction factors compared to plain Monte Carlo and compared to uni-

form mixture importance sampling with no control variates. . . . . . 60

3.2 Top 10 mixture components N (xk, σ2rI5) for the singular integrand inα∗∗, which uses control variates. D denotes the defensive mixture. The

last columns are mean and sd of αj over 5000 simulations. . . . . . . 61

3.3 Top 10 mixture components N (xk, σ2rI5) for the singular integrandin α∗, which does not use control variates. D denotes the defensive

mixture. The last columns are mean and sd of αj over 5000 simulations. 62

3.4 Rare event example. The estimate µ̂ was computed from 100,000 obser-

vations using the sampler given in the first column. Control variates

were used in two of those samples. The next columns give variance

reduction factors compared to plain Monte Carlo and compared to

uniform mixture importance sampling with no control variates. The

final column compares actual squared error with its sample estimate. 63

3.5 Top 10 mixture components N (zk, σ2rI2) for the singular integrandin α∗, which does not use control variates. D denotes the defensive

mixture. The last columns are mean and sd of αj over 5000 simulations. 64

3.6 Average running times in seconds for four estimators on two examples. 64

xi

4.1 Four groups of data sets with different sizes . . . . . . . . . . . . . . 78

4.2 Comparison of sampling with pl and ql for one data set in GPI with

P ∗ = 8.426 × 10−2. N(Sd(λ))/N is the proportion of points lying inSd(λ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.3 Data set DS∗ in GPI with P∗ = 1.083×10−4. Here Nql(Sd(λ))/Nql 6= 1,

i.e. ql fails to sample exclusively from Sd(λ) because of numericalinaccuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.4 Comparison of sampling with ps and qs on data set DS∗, with P ∗ =

1.083× 10−4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.5 Comparison of sampling with pr and qr on data set DS

∗ with P ∗ =

1.083× 10−4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.6 Data set DS∗ with P ∗ = 1.083 × 10−4 is considered. We estimate P ∗

by projecting continuous samples from pr and qr and calculate with

formula (4.17). N(Sd(T0;Q,Λ))/N is the proportion of g(a) lying inSd(T0;Q,Λ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.7 Data set DS∗ with P ∗ = 1.083×10−4 is considered. We compare sam-pling with pr, qU and qα∗ and estimate with formula (4.17). N(Sd(T0;Q,Λ))/Ndenotes the proportion of g(a) lying in Sd(T0;Q,Λ). . . . . . . . . . . 85

4.8 Data set DS∗ with P ∗ = 1.083 × 10−4 is considered. We comparesampling with pr, qU and qα∗∗ and estimate with control variates.

N(Sd(T0;Q,Λ))/N denotes the proportion of g(a) lying in Sd(T0;Q,Λ). 86

5.1 Comparison of Monte Carlo sampling and subset simulation on data

set DS∗ with P ∗ = 1.083× 10−4. We run K = 50 independent subsetsimulation with n = 1000, q = 0.2, B = 20. . . . . . . . . . . . . . . . 96

6.1 Three data sets used for non-permutation GSEA. . . . . . . . . . . . 105

6.2 Kendall’s τ metric for Moran set with 10−5 ≤ p̂MC ≤ 10−4. . . . . . . 1056.3 Running time for all gene sets in different data sets with the linear

statistic (in seconds). p̂MC are run with 106 samples. Extra time is

needed for gene sets with p̂MC < 10−4 to generate 107 samples. . . . . 110

xii

6.4 Running time for all gene sets in different data sets with the quadratic

statistic (in seconds). p̂MC are run with 106 samples. Extra time is

needed for gene sets with p̂MC < 10−4 to generate 107 samples. . . . . 110

6.5 Kendall’s τ metric for Moran set with 10−5 ≤ p̂MC ≤ 10−4. Wefocus on the gene sets in Moran data with linear statistic p-values

10−5 < p̂MC < 10−4 because these gene sets have small p-values, yet re-

liable Monte carlo estimates p̂MC as golden standards - with 107 Monte

Carlo samples. The methods that are closest to Monte Carlo(golded

standard) are p̂1, p̂2 and p̂3. . . . . . . . . . . . . . . . . . . . . . . . . 110

xiii

List of Figures

1.1 Framework for the construction of a gene set statistic . . . . . . . . . 6

2.1 Illustration for Model 1. The point y is uniformly distributed over Sd.The small open circles represent permuted vectors xk. The point y0

is the observed value of y. The circle around it goes through x0 and

represents a spherical cap of height yT0x0. A second spherical cap of

equal volume is centered at y = y1. We study moments of p(y; ρ̂), the

fraction of xk in the cap centered at random y. . . . . . . . . . . . . 17

2.2 Illustration for Model 2. The original response vector is y0 with yT0x0 =

ρ̂. We consider alternative y uniformly distributed on the surface of

C(x0; ρ̂) with examples y1 and y2. Around each such yj there is a

spherical cap of height ρ̂ that just barely includes xc = x0. We use

p̂2 = E2(p(y; ρ̂)) and find an expression for E2((p̂2 − p(y; ρ̂))2). . . . 222.3 Illustration of r1, r2, δ1 and δ2. The points xc, x1 and x2 each have m0

negative and m1 positive components. For j = 1, 2 the swap distance

between xj and xc is rj. There are δ1 positive components of xc where

both x1 and x2 are negative, and δ2 negative components of xc where

both xj are positive. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.4 RMSEs for p̂1 and p̂2 under Models 1 and 2. The x-axis shows the

estimate p̂ as ρ varies from 1 to 0. Here m0 = m1. Plots with m0 6= m1are similar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.5 Comparison of p̂1 and p̂2. In (a), log10(p̂2) is plotted against log10(p̂1)

for varying ρ’s. The black line is the 45 degree line. In (b), the ratio of

RMSEs for p̂1 and p̂2 is plotted against log10(p̂1). The x-axis is log10(p̂1). 33

xiv

2.6 The coefficient of variation for p̂2 with varying ρ’s. . . . . . . . . . . 33

2.7 Comparison of p̂3 versus p̂2. For a given triple (m0,m1, ρ̂), we ran-

domly sample 100 vectors y0 with xT0y0 = ρ̂. By symmetry, x0 can be

any permutation. We get 100 different p̂3 and a common p̂2 for each

triple (m0,m1, ρ̂). In the two panels on the left, ρ̂’s are chosen to give

two-sided p̂1(ρ̂) = 2×10−10 with various dimensions (m0,m1). The twopanels on the right correspond to two-sided p̂1(ρ̂) = 2 × 10−20. Esti-mates for two-sided p-values are plotted on the y-axis, with p̂3 plotted

as black dots with distributions and p̂2 as red crosses. . . . . . . . . 34

2.8 Comparison of RMSE(p̂3) versus RMSE(p̂2). The same simulation set-

ting as described in Figure 2.7. The red crosses and black dots are

estimated RMSE(p̂2) and RMSE(p̂3) under Model 2 with centers c = 0

and c = arg max06i

5.1 Comparison of σ̂1(P̂∗1:K) and σ̂2(P̂

∗i )’s. With the same K = 50 sim-

ulations in Table 5.1, the sample standard deviation for all (P̂ ∗i )’s is

σ̂1(P̂∗1:K) = 2.813 × 10−5. We have 50 stand alone estimates σ̂2(P̂ ∗i )

computed with eq. (5.2). We plot the z-scores of P̂ ∗i ’s computed with

σ̂1(P̂∗1:K) and σ̂2(P̂

∗i ) respectively in Figure 5.1b and 5.1c. We plot the

histogram of σ̂2(P̂∗i )’s in Figure 5.1a with σ̂1(P̂

∗1:K) added as the dashed

reference line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.2 Simulation result for GPI. For Monte Carlo NMC = 108. For subset

simulation, n = 1000, B = 40, and K = 50 for each data set. . . . . . 99

5.3 Simulation result for GPII. For Monte Carlo NMC = 108. For subset

simulation, n = 1000, B = 40, and K = 50 for each data set. . . . . . 100

5.4 Simulation result for GPIII. For Monte Carlo NMC = 108. For subset

simulation, n = 1000, B = 120, and K = 50 for each data set. . . . . 101

5.5 Simulation result for GPIV. For Monte Carlo NMC = 108. For subset

simulation, n = 1000, B = 120, and K = 50 for each data set. . . . . 102

6.1 Moran: scatter plot for linear statistic . . . . . . . . . . . . . . . . . . 106

6.2 Moran: scatter plot for linear statistic. Gene sets are plotted with

p-values satisfying 10−5 < p̂MC < 10−4, with a total of 190 gene sets. 107

6.3 Scherzer: scatter plot for linear statistic . . . . . . . . . . . . . . . . . 108

6.4 Scherzer: scatter plot for linear statistic. Gene sets are plotted with

p-values satisfying 10−5 < p̂MC < 10−3, with a total of 15 gene sets. . 109

6.5 Comparison of p̂MC and p̂SS on Zhang, Moran and Scherzer data set

for quadratic statistic. . . . . . . . . . . . . . . . . . . . . . . . . . . 111

A.1 For each of 40 data sets in GPI, we obtain Nps = 5 × 107 and Nqs =5 × 105 samples from ps and qs respectively and estimate µ. FigureA.1a plots µ̂qs versus µ̂ps . Figure A.1b plots the estimated relative

error (µ̂ps − P ∗)/P ∗ versus P ∗. Figure A.1c plots VRFN(qs; ps) versusµ̂ps . Figure A.1d plots VRFt(qs; ps) versus µ̂ps . . . . . . . . . . . . . 134

xvi

A.2 For each of 40 data sets in GPI, we obtain Npr = 5 × 107 and Nqr =5 × 105 samples from pr and qr respectively and estimate µ. FigureA.2a plots µ̂qr versus µ̂pr . Figure A.2b plots the estimated relative

error (µ̂pr − P ∗)/P ∗ versus P ∗. Figure A.2c plots VRFN(qr; pr) versusµ̂pr . Figure A.2d plots VRFt(qr; pr) versus µ̂pr . . . . . . . . . . . . . 135

A.3 For each of 40 data sets in GPII, we obtain Nps = 5 × 107 and Nqs =5 × 105 samples from ps and qs respectively and estimate µ. FigureA.3a plots µ̂qs versus µ̂ps . Figure A.3b plots the estimated relative

error (µ̂ps − P ∗)/P ∗ versus P ∗. Figure A.3c plots VRFN(qs; ps) versusµ̂ps . Figure A.3d plots VRFt(qs; ps) versus µ̂ps . . . . . . . . . . . . . 136

A.4 For each of 40 data sets in GPIII, we obtain Npr = 108 and Nqr = 10

8

samples from pr and qr respectively and estimate µ. Figure A.4a plots

µ̂qr versus µ̂pr . Figure A.4b plots the ratio µ̂pr/µ̂qr versus µ̂qr . Figure

A.4c plots VRFN(qr; pr) versus µ̂pr . Figure A.4d plots VRFt(qr; pr)

versus µ̂pr . VRFN and VRFt are computed with equations (4.14),

(4.15) and (4.16). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

A.5 For each of 40 data sets in GPIV, we obtain Npr = 108 and Nqr = 10

8

samples from pr and qr respectively and estimate µ. Figure A.5a plots

µ̂qr versus µ̂pr . Figure A.5b plots the ratio µ̂pr/µ̂qr versus µ̂qr . Figure

A.5c plots VRFN(qr; pr) versus µ̂pr . Figure A.5d plots VRFt(qr; pr)

versus µ̂pr . VRFN and VRFt are computed with equations (4.14),

(4.15) and (4.16). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

xvii

Chapter 1

Introduction

1.1 Background: gene set enrichment analysis(GSEA)

Since the introduction of DNA microarray measurement technology, genomewide ex-

pression analysis with these DNA microarrays has become a mainstay of both ge-

nomics and statistical research. Researchers seek methods that extract useful infor-

mation from the DNA microarrays both accurately and efficiently. For a particular

experiment, gene expressions are measured for thousands of genes from a group of

samples belonging to either the treatment or control group, for example, candidates

with or without lung cancer. Traditionally, the most differently expressed genes are

tested individually for the relationship with the treatment. However, this approach

has some major limitations,

• After correcting for the multiple hypothesis testing effect, genes that achieve therequired significance level may be too few or even non-existing. We may leave

out many weakly correlated genes due to the noise and the multiple hypothesis

testing effect.

• Even if we end up with a long list of statistically significant genes, they maywell be biologically unstructured. Interpretation can be ad hoc and arbitrary,

and often biased by the biologist’s area of expertise.

1

CHAPTER 1. INTRODUCTION 2

• Many measured genes are biologically related - sharing the same biological func-tion, chromosomal location, or regulation, and hence their measurements are

correlated. Standard multiple hypothesis testing procedures that control FDR

require independence or special dependence structures like PRDS(Benjamini

and Yekutieli (2001)), which often does not hold in real microarray data sets.

In observation of the above challenges for single gene testing procedure, Mootha

et al. (2003) first introduced gene set enrichment analysis(GSEA). Instead of studying

single gene effects individually, they propose to study microarray data at the level

of gene sets. “The gene sets are defined based on prior biological knowledge, e.g.,

published information about biochemical pathways or coexpression in previous ex-

periments.”(Subramanian et al. (2005)) The goal of GSEA is to find gene sets that

are correlated with the treatment as a whole. Moving the analysis from the single

gene level to the gene set level has several advantages,

• It is common to have many weakly correlated single genes that appear asinsignificant individually. By combining their weak effects with appropriate

choices of the test statistic, we may find the gene set as a whole achieves the

desired statistical significance.

• Conducting hypothesis testing on the gene set level yields much more inter-pretable results. We can focus on exploring the scientific explanation of the

relationships between gene sets and the treatment, without the extra step of

summarizing uncoordinated single gene test results.

• The total number of tests carried out at the same time is significantly reduced.Multiple single gene tests are condensed to one single test for the whole gene

set.

• When a single test statistic is constructed for the whole gene set, the correlationsof single genes within gene sets no longer play a role in the final decision, making

the conclusion more statistically sound.


Because of the above benefits, GSEA has gained much attention since it’s first

introduction, and has become the standard practice in the last decade(Tamayo et al.

(2012)). The gene set database has grown from the original database(Subramanian

et al. (2005)) of 1,325 genes set, including four major collections, to 13311 gene

sets as of today in the Molecular Signatures Database (MSigDB, Liberzon et al.

(2015)), divided into 8 major collections, and several subcollections. The gene sets

are available for download from the Broad Institute (2016).

1.2 Null hypothesis in GSEA

A key component in hypothesis testing is the null hypothesis. Tian et al. (2005) and

Goeman and Bühlmann (2007) among others introduce two different null hypotheses

in GSEA. Let G be the gene set of interest and Y be the treatment. The first

null compares the association between Y and S with the association between Y and

other gene sets of comparable sizes. This null hypothesis essentially means Y cannot

stand out from comparable gene sets, hence is often known as the “competitive null”

hypothesis. Methods for testing the competitive null typically involves randomizing

the genes labels and keeping the sample labels fixed. This permutation does not give

a rigorous test when genes are correlated, which is usually the case for those within

a gene set. (Goeman and Bühlmann (2007))

The second null only focuses on the gene set of interest S. It compares the associa-

tion between S and Y with the association between S and random treatments. To test

this “self-contained null”, usually the labels in the treatment Y are permuted, with

the gene labels fixed. While the competitive null is often of interest, this dissertation

focuses on testing for the “self-contained” null.

1.3 Permutation test

Gene set tests constructs a single test statistic for the whole gene set. In most cases,

the null distributions have no closed form, hence the p-values are usually estimated

by permutation tests. Even in cases where closed form null distributions are available


under appropriate assumptions, such as the Kolmogorov-Smirnov statistic in the ini-

tial GSEA in Mootha et al. (2003), and the J-G score in Jiang and Gentleman (2007),

the permutation tests are suggested to gain robustness in case the data falls short of

the assumptions.

A detailed explanation of permutation test is described in Lehmann and Romano

(2005). We describe its procedure in our particular application of GSEA. Suppose

for m independent samples we observe the single gene measurements Yg ∈ Rm, g =1, · · · , G for all genes in a gene set of size G, and denote the corresponding treatmentas X ∈ Rm or {0, 1}m. In cases where X takes binary values, let m0 be the number of0’s and m1 be the number of 1’s, with m = m0 +m1. Denote the gene measurements

for the gene set as the matrix Y1:G = [Y1, · · · , YG] ∈ Rm×G. We first decide onthe test statistic for the gene set. An example is taking the sum of single gene

correlations as the test statistic, T (Y1:G, X) =G∑g=1

corr(Yg, X). Another example is

constructing the test statistic as the sum of squared t-statistic for the single genes

T (Y1:G, X) =G∑g=1

t(Yg, X)2. To perform the permutation test, we keep Y1:G fixed and

obtain all N unique permutations of X as X∗0 , · · · , X∗N−1. In cases where X ∈ Rm,N = m! and in cases where X ∈ {0, 1}m0+m1 , N =

(m0+m1m0

). Then the permutation

p-value is defined as

p =1

N

N−1∑i=0

1(T (Y1:G, X∗i ) ≥ T (Y1:G, X))

Note that T (Y1:G, X∗i ) ≥ T (Y1:G, X) holds true at least for X∗i = X, so the true

permutation p-value never goes below 1/N .

The total number of permutations N increases exponentially and quickly becomes

intractable as the sample size m grows. For example with m = 20, N = 20!.= 2.4×

1018 and with m0 = m1 = 20, N =(

4020

) .= 1.4 × 1011. It is common to approximate

the exact permutation p-value by random sampling from all permutations (Good

(2013)). Monte Carlo permutation tests are easy to implement, require no specific

distributional assumptions on the data, and can be applied to any test statistic of our

choice. Despite their generality, they are often computationally expensive, especially


when the true p-values are small. As discussed in Larson and Owen (2015), for p-

values as small as �, random permutations of size between 3/� to 19/� are needed to

get adequate power.

Monte Carlo based permutation also suffers from a resampling granularity problem,

whose name is adopted from Larson and Owen (2015). It is conventional to add the

observation X as an additional random sample of permutations. Then the smallest

p-value that we can possibly get from M − 1 random permutations is 1/M . Whentwo or more gene sets are tied at this granularity value, there is no way to distinguish

them. Many existing methods rank the gene sets by their corresponding test statistics.

However this practice is subject to the assumption that all test statistics have the

same null distribution, which is clearly not the case when comparing gene sets of

different sizes or different correlation structures.

Observing the challenges in plain Monte Carlo sampling of permutations, we seek

alternative methods to estimate permutation p-values efficiently, especially those that

are extremely small. The methods that work most efficiently are usually specialized

to the chosen test statistic. We discuss the choice of test statistics in the next section.

1.4 Test statistic

A gene set statistic is usually constructed with three components: gene-level statistic,

transformation and summary statistic as shown in Fig. 1.1. There can be many

different choices in each component. To obtain the gene set test statistic, one can

first choose the gene level statistic as the t-statistic or the correlation coefficient,

then take no transformation, or take transformations such as the absolute value or

the square, and finally take the mean or median. The reader can easily decode the

aforementioned two examples of gene set statistics T (Y1:G, X) =G∑g=1

corr(Yg, X) and

T (Y1:G, X) =G∑g=1

t(Yg, X)2 in terms of these three steps. For a more detailed discussion

of the construction framework for gene set tests, see Ackermann and Strimmer (2009).

Ackermann and Strimmer (2009) compared 261 different gene set statistics, and

found particularly good performance of two families of statistics - a linear family


Figure 1.1: Framework for the construction of a gene set statistic

and a quadratic family. Let ρg(Yg, X) and tg(Yg, X) be the single gene correlation

and t-statistic respectively. The linear family consists of T1 =G∑g=1

ρg(Yg, X) and

T ′1 =G∑g=1

tg(Yg, X), and the quadratic family consists of T2 =G∑g=1

ρg(Yg, X)2 and T ′2 =

G∑g=1

tg(Yg, X)2. The best performance comes from the quadratic family. By squaring

the correlation or t-statistics, effects from genes that are deferentially expressed in

opposite directions are added up instead of being cancelled from each other. The

linear family is the second best. Here T ′1 is also known as the J-G score proposed in

Jiang and Gentleman (2007). It is remarkable that the best performing statistics are

surprisingly simple, especially when compared with the complicated GSEA method

in Subramanian et al. (2005).

The similar performances of using t-statistic and the correlation can be justified

through Taylor approximation, as shown in Larson and Owen (2015). The usual t-

statistic for testing a linear relationship is tg ≡√m− 2ρ̂g/(1 − ρ̂2g)1/2. The Taylor

expansion gives tg =√m− 2

(ρ̂g +

12ρ̂3g +O(ρ̂

5g)). Gene set tests are most useful when

individual |ρ̂g|’s are small. In these cases tg is approximately a constant multiple ofρ̂g, hence using the correlation as the gene level statistic should yield similar perfor-

mances to those using t-statistics. We study the linear and quadratic statistics with

correlation as the gene level statistic.


1.5 Notations

We summarize the notations that are used through out the dissertation here. Let

m = m0 +m1 be the number of patients, with m0 in the control group and m1 in the

treatment group, and X ∈ {0, 1}n be the indicator variable for treatment. We limitour discussion to binary X, while some methods can be easily extended to continuous

X’s as well. Let S be the gene set of interest and G = |S| be the cardinality of S.Denote the expression level for gene g as Yg ∈ Rm, and let Y1:G = [Y1, · · · , YG] ∈ Rm×G

be the expression level measurement matrix. We center and standardize the binary

X to obtain x0 s.t. xT0 1 = 0,x

T0x0 = 1. Let x0, · · · ,xN−1 be all permutations of x0,

where N =(m0+m1m0

).

We are interested in approximating permutation p-value for two statistics

T1(X;Y1:G) =G∑g=1

ρg(Yg, X), T2(X;Y1:G) =G∑g=1

ρg(Yg, X)2.

Note that we can replace X with x0 because centering and scaling do not change

the correlation coefficients, hence the corresponding p-values are

pj =1

N

N−1∑i=0

1(Tj(xi;Y1:G) ≥ Tj(x0;Y1:G)), j = 1, 2

In the following chapters we may omit the subscript on p-value when there is no

confusion on the test statistic. We now derive alternative formulas for T1 and T2 for

ease of discussion in the following chapters. Note that

T1 =G∑g=1

corr(X, Yg) =G∑g=1

corr(X,Yg

sd(Yg)) =

√G corr(X,

G∑g=1

Ygsd(Yg)

)

We define Y =G∑g=1

Yg/sd(Yg) ∈ Rn, then using T1 is equivalent to using corr(X, Y )

as the test statistic for permutation tests. We center and standardize Y to get y0 s.t.

yT0 1 = 0,yT0y0 = 1. Then the p-value for the linear statistic can be written in terms


of the correlations between y0 and xi’s.

p1 =1

N

N−1∑i=0

1(xTi y0 ≥ xT0y0) (1.1)

To simplify T2, we center and standardize all columns of Y1:G to get Ỹ1:G, s.t.

Ỹ T1:G1 = 0 and diag(ỸT

1:GỸ1:G) = 1. Define Σ = Ỹ1:GỸT

1:G, then

T2 =G∑g=1

corr(X, Yg)2 =

G∑g=1

(xTỸg)2 = xTΣx

Then the p-value for the quadratic statistic is

p2 =1

N

N−1∑i=0

1(xTi Σxi ≥ xT0 Σx0) (1.2)

This dissertation proposes three novel methods to efficiently estimate small permu-

tation p-values p1 and p2 with T1 and T2 as test statistics respectively. The rest of the

dissertation is organized as follows. Chapter 2, 4 and 5 are devoted to three distinct

methods respectively. Chapter 2 introduces three approximations for p1. Error esti-

mates are derived from generalized Stolarsky’s invariance. Alternative probabilistic

arguments are provided as well. Chapter 3 provides some new results on importance

sampling. Specifically, we provide a method to jointly optimize the weights in mixture

importance sampling in combination with control variates. Chapter 4 focuses on the

quadratic statistic T2, and introduces an estimation procedure based on importance

sampling. Chapter 5 is devoted to a subset sampling method that may work for both

linear and quadratic statistics. Similar ideas can be found in literature spanning from

combinatorics, sequential Monte Carlo, Bayesian computation, rare event estimation,

network reliability etc., and bears different names, e.g. approximate counting, nested

sampling, subset simulation, multilevel splitting etc. We give a thorough review of

literature in these different areas, and apply the technique to the gene set testing with

the quadratic test statistics. Finally Chapter 6 applies the three methods- Stolarsky,

importance sampling and subset simulation - on a real data example.


Chapter 2 is based on the tech report He et al. (2016), and Chapter 4 is joint work

with Kinjal Basu, Qingyuan Zhao and Art B. Owen. It appears in the tech report

He and Owen (2014).

Chapter 2

p-value approximation via

Stolarsky’s Invariance principle

2.1 Introduction

This chapter focuses on estimating the p-value for the linear statistic, as defined in

eq. (1.1). We drop the subscript of p1 throughout the discussion in this chapter. For

linear test statistics, as we show below, the permutation p-value is the fraction of

permuted data vectors lying in a given spherical cap subset of a d-dimensional sphere

Sd = {z ∈ Rd+1 | zTz = 1}. A natural but crude approximation to that p-value isthe fraction p̂ of the sphere’s surface volume contained in that spherical cap.

Stolarsky’s invariance principal gives a remarkable description of the accuracy of

this approximation p̂. For y ∈ Sd and t ∈ [−1, 1] we can define the spherical cap ofcenter y and height t via C(y; t) = {z ∈ Sd | 〈y, z〉 ≥ t}. For x0, . . . ,xN−1 ∈ Sd, letp(y, t) be the fraction of those N points that lie in C(y; t) and let p̂(y, t) = p̂(t) =

vol(C(y; t))/vol(Sd). The squared L2 spherical cap discrepancy of these points is

L2(x0, . . . ,xN−1)2 =

∫ 1−1

∫Sd

|p̂(t)− p(z, t)|2 dσd(z) dt.

10

CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 11

Stolarsky (1973) shows that

dωdωd+1

× L2(·)2 =∫Sd

∫Sd‖x− y‖ dσd(x) dσd(y)−

1

N2

N−1∑k,l=0

‖xk − xl‖ (2.1)

where σd is the uniform (Haar) measure on Sd and ωd is the (surface) volume of Sd.Equation (2.1) relates the mean squared error of p̂ to the mean absolute Euclidean

distance among the N points. In our applications, the N points will be the distinct

permuted values of a data vector, but the formula holds for an arbitrary set of N

points.

The left side of (2.1) is, up to normalization, a mean squared discrepancy over

spherical caps. This average of (p̂ − p)2 includes p-values of all sizes between 0 and1. It is not then a very good accuracy measure when p̂ turns out to be very small.

It would be more useful to get such a mean squared error taken over caps of exactly

the size p̂, and no others.

Brauchart and Dick (2013) consider quasi-Monte Carlo (QMC) sampling in the

sphere. They generalize Stolarsky’s discrepancy formula to include a weighting func-

tion on the height t. By specializing their formula, we get an expression for the mean

of (p̂− p)2 over spherical caps of any fixed size.Discrepancy theory plays a prominent role in QMC (Niederreiter, 1992), which

is about approximating an integral by a sample average. The present setting is in a

sense the reverse of QMC: the discrete average over permutations is the exact value,

and the integral over a continuum is the approximation. A second difference is that

the QMC literature focuses on choosing N points to minimize a criterion such as (2.1),

whereas here the N points are determined by the problem.

We present several results for the mean of (p̂− p)2 under different conditions. Inaddition to fixing the size of the caps we can restrict the mean squared error to only

be over caps centered on points y satisfying 〈y,x0〉 = 〈y0,x0〉 where x0 is the original(unpermuted) x vector and y0 is the observed y value. We can obtain this result by

further extending Brauchart and Dick’s generalization of Stolarsky’s invariance. We

call this the ‘finer approximation’ and will show it has advantages over constraining


only the height of the caps. More generally, the point xc could be any of the permuted

x vectors, such as the one that happens to be closest to y0.

Although we found these results via invariance we can also obtain them via proba-

bilistic arguments. As a consequence we have a probabilistic derivation of Stolarsky’s

formula. Some of our results are for arbitrary x, but our best computational for-

mulas are for the case where the variable x is binary, as it would be in experiments

comparing treatment and control groups.

The rest of the chapter is organized as follows. Section 2.2 presents some context

on permutation tests and gives some results from spherical geometry. In Section 2.3

we use Stolarsky’s invariance principle as generalized by Brauchart and Dick (2013)

to obtain the mean squared error between the true p-value and its continuous ap-

proximation p̂1, taken over all spherical caps of volume p̂1. This section also has a

probabilistic derivation of that mean squared error. In Section 2.4 we describe some

finer approximations p̃ for the p-value. These condition on not just the volume of the

spherical cap but also on its distance from the original data point x0, or from some

other point, such as the closest permutation of x0 to y0. By always including the

original point we ensure that p̃ > 1/N . That is a desirable property because the true

permutation p-value cannot be smaller than 1/N . In Section 2.5 we modify the proof

in Brauchart and Dick (2013), to further generalize their invariance results to include

the mean squared error of the finer approximations. Section 2.6 extends our estimates

to two-sided testing. Section 2.7 illustrates our p-value approximations numerically.

We see that an RMS error in the finer approximate p-values is of the same order

of magnitude as those p-values themselves. Section 2.8 makes a numerical compari-

son to saddlepoint methods. Section 2.9 discusses the results and gives more details

about the bioinformatics problems that motivate the search for approximations to

the permutation distribution. Most of the proofs are in the Appendix A.


2.2 Background and notation

The raw data contain points (Xi, Yi) for i = 1, . . . ,m. X′is are the treatment in-

dicators, and Y =G∑g=1

Yg/sd(Yg) ∈ Rn as discussed in Section 1.5. The sample

correlation of these points is ρ̂ = xT0y0 where x0 has components (Xi − X̄)/sX forX̄ = (1/m)

∑mi=1 Xi, s

2X = (1/m)

∑mi=1(Xi− X̄)2 and Ȳ and sY are defined similarly.

We assume that sX and sY are positive. Both x0 and y0 belong to Sm−1. Moreoverthey belong to {z ∈ Sm−1 | zT1m = 0}. We can use an orthogonal matrix to ro-tate the points of this set onto Sm−2 × {0}. As a result, we may simply work withx0,y0 ∈ Sd where d = m− 2.

The quantity ρ̂ measures association between X and Y . It can be used as such

a measure if the Xi are fixed and Yi observed conditionally, or vice versa, or if the

(Xi, Yi) pairs are independently sampled from some joint distribution. Let π be

a permutation of the indices 1, . . . ,m. There are m! vectors xπ that result from

centering and scaling Xπ = (Xπ(1), Xπ(2), . . . , Xπ(m)). The permutation p-value is

p = (1/m!)∑

π 1(xTπy0 > x

T0y0). The justification for this p-value relies on the group

structure of permutations (Lehmann and Romano, 2005). For a cautionary tale on

the use of permutation sets without a group structure, see Southworth et al. (2009).

For notational simplicity we assume ρ̂ > 0 and work with one-sided p-values. Negative

ρ̂ can be handled similarly, or simply by switching the group labels. For two-sided

p-values see Section 2.6.

Our proposals are computationally most attractive in the case where Xi takes on

just two values, such as 0 and 1. Then ρ̂ is a two-sample test statistic. If there are

m0 observations with Xi = 0 and m1 with Xi = 1 then x0 contains m0 components

equal to −√m1/(mm0) and m1 components equal to +

√m0/(mm1). Some formulas

involve the smaller sample size, m ≡ min(m0,m1).For this two-sample case there are only N =

(m0+m1m0

)distinct permutations of x0.

Calling these x0,x1, . . . ,xN−1 we find that

p =1

N

N−1∑k=0

1(xTky0 > ρ̂). (2.2)


Now suppose that there are exactly r indices for which xk is positive and xl is

negative. There are then r indices with the reverse pattern too. We say that xk and

xl are at ‘swap distance r’. In that case we easily find that

u(r) ≡ xTkxl = 1− r( 1m0

+1

m1

). (2.3)

We need some geometric properties of the unit sphere and spherical caps. The

surface volume of Sd is ωd = 2π(d+1)/2/Γ((d+1)/2). We use σd for the volume elementin Sd normalized so that σd(Sd) = 1. The spherical cap C(y; t) = {z ∈ Sd | zTy > t}has volume

σd(C(y; t)) =

12I1−t2(d2, 1

2

), 0 ≤ t ≤ 1

1− 12I1−t2

(d2, 1

2

), −1 ≤ t < 0

where It(a, b) is the incomplete beta function

It(a, b) =1

B(a, b)

∫ t0

xa−1(1− x)b−1 dx

with B(a, b) =∫ 1

0xa−1(1− x)b−1 dx. Obviously, this volume is 0 if t < −1 and it is 1

if t > 1. This volume is independent of y so we may write σd(C(· , t)) for the volume.By symmetry, 1(x ∈ C(y, t)) = 1(y ∈ C(x, t)).

Our first approximation of the p-value is

p̂1(ρ̂) = σd(C(y; ρ̂)). (2.4)

This approximation has two intuitive explanations. First, the true p-value is the

proportion of permutations of x0 that lie in C(y0; ρ̂), and σd(C(y0, ρ̂)) is the pro-

portion of the volume of Sd in that set. Second, as we show in Proposition 2,p̂1 = E(p | 〈x0,y〉 = ρ̂) for y ∼ U(Sd) as y0 would if the original Yi were IIDGaussian. In Theorem 4, we find Var(p̂1) under this assumption.

We frequently need to project y ∈ Sd onto a point x ∈ Sd. In this representationy = tx +

√1− t2y∗ where t = yTx ∈ [−1, 1] and y∗ ∈ {z ∈ Sd | zTx = 0} which


is isomorphic to Sd−1. The coordinates t and y∗ are unique. From equation (A.1) inBrauchart and Dick (2013) we get

dσd(y) =ωd−1ωd

(1− t2)d/2−1 dt dσd−1(y∗). (2.5)

In their case x was (0, 0, . . . , 1).

The intersection of two spherical caps of common height t is

C2(x,y; t) ≡ C(x; t) ∩ C(y; t).

We will need the volume of this intersection. Lee and Kim (2014) give a general solu-

tion for spherical cap intersections without requiring equal heights. They enumerate

25 cases, but our case does not correspond to any single such case and so we obtain

the formula we need directly, below. We suspect it must be known already, but we

were unable to find it in the literature.

Lemma 1. Let x,y ∈ Sd and −1 6 t 6 1 and put u = xTy. Let V2(u; t, d) =σd(C2(x,y; t)). If u = 1, then V2(u; t, d) = σd(C(x; t)). If −1 < u < 1, then

V2(u; t, d) =ωd−1ωd

∫ 1t

(1− s2)d2−1σd−1(C(y

∗; ρ(s))) ds, (2.6)

where ρ(s) = (t− su)/√

(1− s2)(1− u2). Finally, for u = −1,

V2(u; t, d) =

0, t > 0ωd−1ωd

∫ |t|−|t|(1− s

2)d2−1 ds, else.

(2.7)

Proof. Let z ∼ U(Sd). Then V2(u; t, d) = σd(C2(x,y; t)) = Pr(z ∈ C2(x,y; t)). Ifu = 1 then x = y and so C2(x,y; t) = C(x; t). For u < 1, we project y and z onto

x, via z = sx+√

1− s2z∗ and y = ux+√

1− u2y∗. Now

V2(u; t, d) =

∫Sd

1(〈x, z〉 ≥ t)1(〈y, z〉 ≥ t) dσ(z)

=

∫ 1−1

1(s > t)ωd−1ωd

(1− s2)d2−1


×∫Sd−1

1(su+√

1− s2√

1− u2 〈y∗, z∗〉 ≥ t) dσd−1(z∗) ds.

If u > −1 then this reduces to (2.6). For u = −1 we get

V2(u; t, d) =ωd−1ωd

∫ 1−1

1(s > t)1(−s > t)(1− s2)d2−1 ds.

which reduces to (2.7).

When we give probabilistic arguments and interpretations we do so for a random

center y of a spherical cap. We use Models 1 and 2 below. Model 1 is illustrated in

Figure 2.1. Model 2 is illustrated in Figure 2.2 of Section 2.4 where we first use it.

Model 1. The vector y is uniformly distributed on the sphere Sd. Expectation underthis model is denoted E1(·).

Model 2. The vector y is uniformly distributed on {z ∈ Sd | zTxc = ρ̃}, for some−1 ≤ ρ̃ ≤ 1, and c ∈ {0, 1, . . . , N − 1}. Then y = ρ̃xc +

√1− ρ̃2y∗ for y∗ uniformly

distributed on a subset of Sd isometric to Sd−1. Expectation under this model isdenoted E2(·).

2.3 Approximation via spherical cap volume

Here we study the approximate p-value p̂1(ρ̂) = σd(C(y; ρ̂)). First we find the mean

squared error of this approximation over all spherical caps of the given volume via

invariance. Then we give a probabilistic interpretation which includes the conditional

unbiasedness result in Proposition 2 below. Then we give two computational simpli-

fications, first for points obtained via permutation, and second for permutations of a

binary vector. We begin by restating the invariance principle.


●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

y0

●

y1

Sd

Figure 2.1: Illustration for Model 1. The point y is uniformly distributed over Sd.The small open circles represent permuted vectors xk. The point y0 is the observedvalue of y. The circle around it goes through x0 and represents a spherical cap ofheight yT0x0. A second spherical cap of equal volume is centered at y = y1. We studymoments of p(y; ρ̂), the fraction of xk in the cap centered at random y.

Theorem 1. Let x0, . . . ,xN−1 be any points in Sd. Then

1

N2

N−1∑k,l=0

‖xk − xl‖+1

Cd

∫ 1−1

∫Sd

∣∣∣∣σd(C(z; t))− 1NN−1∑k=0

1C(z;t)(xk)

∣∣∣∣2 dσd(z) dt=

∫Sd

∫Sd‖x− y‖ dσd(x) dσd(y)

where Cd = ωd−1/(dωd).

Proof. Stolarsky (1973).

Brauchart and Dick (2013) gave a simple proof of Theorem 1 using reproducing

kernel Hilbert spaces. They generalized Theorem 1 as follows.


Theorem 2. Let x0, . . . ,xN−1 be any points in Sd. Let v : [−1, 1] → (0,∞) be anyfunction with an antiderivative. Then

∫ 1−1v(t)

∫Sd

∣∣∣∣σd(C(z; t))− 1NN−1∑k=0

1C(z;t)(xk)

∣∣∣∣2 dσd(z) dt=

1

N2

N−1∑k,l=0

Kv(xk,xl)−∫Sd

∫SdKv(x,y) dσd(x) dσd(y)

(2.8)

where Kv(x,y) is a reproducing kernel function defined by

Kv(x,y) =

∫ 1−1v(t)

∫Sd

1C(z;t)(x)1C(z;t)(y) dσd(z) dt. (2.9)

Proof. See Theorem 5.1 in Brauchart and Dick (2013)

If we set v(t) = 1 and K(x,y) = 1 − Cd‖x − y‖, then we recover the originalStolarsky formula. Note that the statement of Theorem 5.1 in Brauchart and Dick

(2013) has a sign error in their counterpart to (2.8). The corrected statement (2.8)

can be verified by comparing equations (5.3) and (5.4) of Brauchart and Dick (2013).

We would like a version of (2.8) just for one value of t such as t = ρ̂ = xT0y0. For

ρ̂ ∈ [−1, 1) and � = (�1, �2) ∈ (0, 1)2, let

v�(t) = �2 +1

�11(ρ̂ ≤ t ≤ ρ̂+ �1). (2.10)

Each v� satisfies the conditions of Theorem 2 making (2.8) an identity in �. We let

�2 → 0 and then �1 → 0 on both sides of (2.8) for v = v� yielding Theorem 3.

Theorem 3. Let x0,x1, . . . ,xN ∈ Sd and t ∈ [−1, 1]. Then

∫Sd|p(y, t)− p̂1(t)|2 dσd(y) =

1

N2

N−1∑k=0

N−1∑l=0

σd(C2(xk,xl; t))− p̂1(t)2. (2.11)

Proof. See Section A.1 of the Appendix which uses the limit argument described

above.


We now give a proposition that holds for all models, including our Model 1 and

Model 2.

Proposition 1. For a random point y ∈ Sd,

E(p(y, t)) =1

N

N−1∑k=0

Pr(y ∈ C(xk; t)), and (2.12)

E(p(y, t)2) =1

N2

N−1∑k,l=0

Pr(y ∈ C2(xk,xl; t)). (2.13)

Proposition 1 provides a probabilistic interpretation for equation (2.11). When

y ∼ U(Sd), the double sum on right side of (2.11) is E(p(y, t)2). Additionally p̂1(t)has a probabilistic interpretation under Model 1.

Proposition 2. For any x0, . . . ,xN−1 ∈ Sd and t ∈ [−1, 1], p̂1(t) = E1(p(y, t)).

Proof. E1(p(y; t)) = E1[

1N

N−1∑k=0

1C(y;t)(xk)]

= σd(Cd(y; t)) = p̂1(t).

Combining Proposition 2 and Theorem 3 we find that if y ∼ U(Sd), as it wouldfor IID Gaussian Yi, then p(y, ρ̂) is a random variable with mean p̂1(ρ̂) and variance

given by (2.11) with t = ρ̂.

The right hand side of (2.11) sums O(N2) terms. In a permutation analysis we

might have N = m! or N =(m0+m1m0

)for binary Xi, and so the computational cost

could be high. The symmetry in a permutation set allows us to use

∫Sd|p(y, t)− p̂1(t)|2 dσd(y) =

1

N

N−1∑k=0

σd(C2(x0,xk; t))− p̂1(t)2

instead. But that costs O(N), the same as the full permutation analysis.

When the Xi are binary, then for fixed t, σd(C2(xk,xl; t)) just depends on the

swap distance r between xk and xl. Then

∫Sd|p(y, t)− p̂1(t)|2 dσd(y) =

1

N

m∑r=0

NrV2(u(r); t, d)− p̂1(t)2 (2.14)


for V2(u(r); t, d) given in Lemma 1 and Nr =∑N−1

k=0

∑N−1l=0 1(rk,l = r) counts pairs

(xk,xl) at swap distance r.

Theorem 4. Let x0 ∈ Sd be the centered and scaled vector from an experiment withbinary Xi of which m0 are negative and m1 are positive. Let x0,x1, . . . ,xN−1 be the

N =(m0+m1m0

)distinct permutations of x0. If y ∼ U(Sd), then for t ∈ [−1, 1], and

with u(r) defined in (2.3),

E(p(y; t)) = σd(C(y0; t)), and

Var(p(y, t)) =1

N

m∑r=0

(m0r

)(m1r

)V2(u(r); t, d)− p̂1(t)2.

Proof. There are(m0r

)(m1r

)permuted points xi at swap distance r from x0.

2.4 A finer approximation to the p-value

In the previous section, we studied the distribution of p-values with the spherical cap

centers y uniformly distributed on the sphere Sd. In this section, we give a finerapproximation to p(y0, ρ̂) by studying the distribution of the p-values with centers y

satisfying the constraint 〈y,xc〉 = 〈y0,xc〉 = ρ̃. The point xc may be any permutationof x0. There are two special choices. The first is to choose c = 0 so that xc = x0 is the

original unpermuted data. The second is to choose xc to be the closest permutation

of x0 to y0. That is c = arg maxi 〈y0,xi〉. We will give a general formula that worksfor any choice of xc and compare the performance of the above two choices.

The rationale for conditioning on all y satisfying 〈y,xc〉 = ρ̃ is as follows. Sincewe want the exact p-value centered at y0 with radius ρ̂, the more targeted set of

p-values we study, the better our approximation should be. When conditioning on

〈y,xc〉 = ρ̃, we eliminate many irrelevant y. The approximation could be improvedby conditioning on even more information, but the cost would go up. If we condition

on the order statistic of all inner products 〈y0,xi〉, we get back the exact p-value.For an index c ∈ {0, 1, . . . , N − 1} we propose finer approximations to the p-value


based on Model 2 from Section 2.2. These are

p̃c = E2(p(y, ρ̂)) = E1(p(y, ρ̂) | yTxc = yT0xc). (2.15)

We are interested in two special cases,

p̂2 = p̃0, and p̂3 = p̃c, where c = arg max06i 1/N , having p̂2 > 1/N is a desirable property. Similarly, p̂3 > 1/N

because then xc is in general an interior point of C(y, ρ̂). We expect that p̂3 should

be more conservative than p̂2 and we see this numerically in Section 2.7.

From Proposition 1, we can get our estimate p̃c and its mean squared error by

finding single and double inclusion probabilities for y.

To compute p̃c we need to sum N values Pr(y ∈ C(xk; t) | yTxc = ρ̃) and for p̃cto be useful we must compute it in o(N) time. The computations are feasible in the

binary case, which we now focus on.

Let uj = xTj x0 for j = 1, 2, and let u3 = x

T1x2. Let the projection of y on xc

be y = ρ̃xc +√

1− ρ̃2y∗. Then the single and double point inclusion probabilitiesunder Model 2 are

P1(u1, ρ̃, ρ̂) =

∫Sd−1

1(〈y,x1〉 ≥ ρ̂) dσd−1(y∗), and (2.17)

P2(u1, u2, u3, ρ̃, ρ̂) =

∫Sd−1

1(〈y,x1〉 ≥ ρ̂)1(〈y,x2〉 ≥ ρ̂) dσd−1(y∗) (2.18)

where ρ̂ = 〈x0,y0〉. If two permutations of x0 are at swap distance r, then their innerproduct is u(r) = 1− r(m−10 +m−11 ) from equation (2.3).

Lemma 2. Let the projection of x1 onto xc be x1 = u1xc +√

1− u21x∗1. Then the


●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

xc●

y0

●

y2●

y1

Sd

Figure 2.2: Illustration for Model 2. The original response vector is y0 with yT0x0 =

ρ̂. We consider alternative y uniformly distributed on the surface of C(x0; ρ̂) withexamples y1 and y2. Around each such yj there is a spherical cap of height ρ̂ thatjust barely includes xc = x0. We use p̂2 = E2(p(y; ρ̂)) and find an expression forE2((p̂2 − p(y; ρ̂))2).

single point inclusion probability from (2.17) is

P1(u1, ρ̃, ρ̂) =

1(ρ̃u1 ≥ ρ̂), u1 = ±1 or ρ̃ = ±1σd−1(C(x∗1, ρ∗)), u1 ∈ (−1, 1), ρ̃ ∈ (−1, 1) (2.19)where ρ∗ = (ρ̂− ρ̃u1)/

√(1− ρ̃2)(1− u21).

Proof. The projection of y onto xc is y = ρ̃xc +√

1− ρ̃2y∗. Now

〈y,x1〉 =

ρ̃u1, u1 = ±1 or ρ̃ = ±1ρ̃u1 +√1− ρ̃2√1− u21 〈y∗,x∗1〉 , u1 ∈ (−1, 1), ρ̃ ∈ (−1, 1)and the result easily follows.


We can now give a computable expression for p̃c and hence for p̂2 and p̂3.

Theorem 5. For −1 ≤ ρ̂ ≤ 1, −1 ≤ ρ̃ ≤ 1,

p̃c = E2(p(y, ρ̂)) =1

N

m∑r=0

(m0r

)(m1r

)P1(u(r), ρ̃, ρ̂) (2.20)

where u(r) is given in equation (2.3), P1(u(r), ρ̃, ρ̂) is given in equation (2.19) and

ρ̃ = xTc y0.

Proof. There are(m0r

)(m1r

)permutations of x0 at swap distance r from xc.

From (2.20) we see that p̃c can be computed in O(m) work. The mean squared

error for p̃c is more complicated and will be more expensive. We need the double

point inclusion probabilities and then we need to count the number of pairs xk,xl

forming a given set of swap distances among xk,xl,xc.

Lemma 3. For j = 1, 2, let xj be at swap distance rj from xc and let r3 be the swap

distance between x1 and x2. Let u1, u2, u3 be the corresponding inner products given

by (2.3). If there are equalities among x1, x2 and xc, then the double point inclusion

probability from (2.18) is

P2(u1, u2, u3, ρ̃, ρ̂) =

1(ρ̃ ≥ ρ̂), x1 = x2 = xc

1(ρ̃ ≥ ρ̂)P1(u2, ρ̃, ρ̂), x1 = xc 6= x2

1(ρ̃ ≥ ρ̂)P1(u1, ρ̃, ρ̂), x2 = xc 6= x1

P1(u2, ρ̃, ρ̂), x1 = x2 6= xc.

If x1, x2 and xc are three distinct points with min(u1, u2) = −1, then

P2(u1, u2, u3, ρ̃, ρ̂) =

1(−ρ̃ ≥ ρ̂)P1(u2, ρ̃, ρ̂), u1 = −11(−ρ̃ ≥ ρ̂)P1(u1, ρ̃, ρ̂), u2 = −1.


Otherwise −1 < u1, u2 < 1, and then

P2(u1, u2, u3, ρ̃, ρ̂)

=

1(ρ̃u1 ≥ ρ̂)1(ρ̃u2 ≥ ρ̂), ρ̃ = ±1∫ 1−1

ωd−2ωd−1

(1− t2) d−12 −11(t ≥ ρ1)1(tu∗3 ≥ ρ2) dt, ρ̃ 6= ±1, u∗3 = ±1∫ 1−1

ωd−2ωd−1

(1− t2) d−12 −11(t ≥ ρ1)σd−2(C(x∗∗2 ,ρ2−tu∗3√

1−t2√

1−u∗23)) dt, ρ̃ 6= ±1, |u∗3| < 1

where

u∗3 =u3 − u1u2√

1− u21√

1− u22and ρj =

ρ̂− ρ̃uj√1− ρ̃2

√1− u2j

, j = 1, 2 (2.21)

and x∗∗2 is the residual from the projection of x∗2 on x

∗1.

Proof. See Section A.2.

Next we consider the swap configuration among x1, x2 and xc. Let xj be at swap

distance rj from xc, for j = 1, 2. We let δ1 be the number of positive components

of xc that are negative in both x1 and x2. Similarly, δ2 is the number of negative

components of xc that are positive in both x1 and x2. See Figure 2.3. The swap

distance between x1 and x2 is then r3 = r1 + r2 − δ1 − δ2.Let r = (r1, r2), δ = (δ1, δ2) and r = min(r1, r2). We will study values of

r1, r2, r3, δ1, δ2 ranging over the following sets:

r1, r2 ∈ R = {1, . . . ,m}

δ1 ∈ D1(r) = {max(0, r1 + r2 −m0), . . . , r}

δ2 ∈ D2(r) = {max(0, r1 + r2 −m1), . . . , r}, and

r3 ∈ R3(r) = {max(1, r1 + r2 − 2r), . . . ,min(r1 + r2,m,m0 +m1 − r1 − r2)}.

Whenever the lower bound for one of these sets exceeds the upper bound, we take

the set to be empty, and a sum over it to be zero. Note that while r1 = 0 is possible,

it corresponds to x1 = xc and we will handle that case specially, excluding it from R.


xc = (

m1︷︸︸︷+,+,+,+,+, · · · ,+,+,+,+,

m0︷︸︸︷−,−,−,−, · · · ,−,−,−,−,− )

x1 = (

m1︷︸︸︷+,+,+, · · · ,+,−,−,−, · · · ,−︸︷︷︸

r1

,

m0︷︸︸︷+,+,+, · · · ,+,+,+︸︷︷︸

r1

,−, · · · ,− )

x2 = (

m1︷︸︸︷+, · · · ,+,−,−, · · · ,−︸︷︷︸

δ1︸︷︷︸r2

,+, · · · ,+,m0︷︸︸︷

−, · · · ,−,+,+, · · ·︸︷︷︸δ2

,+

︸︷︷︸r2

,−, · · · ,− )

Figure 2.3: Illustration of r1, r2, δ1 and δ2. The points xc, x1 and x2 each have m0negative and m1 positive components. For j = 1, 2 the swap distance between xj andxc is rj. There are δ1 positive components of xc where both x1 and x2 are negative,and δ2 negative components of xc where both xj are positive.

The number of pairs (xl,xk) with a fixed r and δ is

c(r, δ) =

(m0δ1

)(m1δ2

)(m0 − δ1r1 − δ1

)(m1 − δ2r1 − δ2

)(m0 − r1r2 − δ1

)(m1 − r1r2 − δ2

). (2.22)

Then the number of configurations given r1, r2 and r3 is

c(r1, r2, r3) =∑δ1∈D1

∑δ2∈D2

c(r, δ)1(r3 = r1 + r2 − δ1 − δ2). (2.23)

We can now get an expression for the expected mean squared under Model 2 which

combined with Theorem 5 for the mean provides an expression for the mean squared

error of p̃c.


Theorem 6. For −1 ≤ ρ̂ ≤ 1,−1 ≤ ρ̃ ≤ 1,

E2(p(y, ρ̂)2) =1

N2

[1(ρ̃ ≥ ρ̂) + 2

m∑r=1

(m0r

)(m1r

)P2(1, u(r), u(r), ρ̃, ρ̂)

+

m∑r=1

(m0r

)(m1r

)P1(u(r), ρ̃, ρ̂)

+∑r1∈R

∑r2∈R

∑r3∈R3(r)

c(r1, r2, r3)P2(u1, u2, u3, ρ̃, ρ̂)

] (2.24)

where P2(·) is the double inclusion probability in (2.18) and c(r1, r2, r3) is the config-uration count in (2.23).

Proof. See Section A.3 of the Appendix.

In our experience, the cost of computing E2(p(y, ρ̂)2) under Model 2 is dominatedby the cost of the O(m3) integrals required to get the P2(·) values in (2.24). The costalso includes an O(m4) component because c(r1, r2, r3) is also a sum of O(m) terms,

but it did not dominate the computation at the sample sizes we looked at (up to

several hundred).

2.5 Generalized Stolarsky Invariance

Here we obtain the Model 2 results in a different way, by extending the work by

Brauchart and Dick (2013). They introduced a weight on the height t of the spherical

cap in the average. We now apply a weight function to the inner product 〈z,xc〉between the center z of the spherical cap and a special point xc.

Theorem 7. Let x0, . . . ,xN−1 be arbitrary points in Sd and v(·) and h(·) be positive


functions in L2([−1, 1]). Then for any x′ ∈ Sd, the following equation holds,

∫ 1−1v(t)

∫Sdh(〈z,x′〉)

∣∣∣∣σd(C(z; t))− 1NN−1∑k=0

1C(z;t)(xk)

∣∣∣∣2 dσd(z) dt=

1

N2

N−1∑k,l=0

Kv,h,x′(xk,xl) +

∫Sd

∫SdKv,h,x′(x,y) dσd(x) dσd(y)

− 2N

N−1∑k=0

∫SdKv,h,x′(x,xk) dσd(x)

(2.25)

where Kv,h,x′ : Sd × Sd → R is a reproducing kernel defined by

Kv,h,x′(x,y) =

∫ 1−1v(t)

∫Sdh(〈z,x′〉)1C(z;t)(x)1C(z;t)(y) dσd(z) dt. (2.26)

Proof. See Section A.4 of the Appendix.

Remark. We will use this result for x′ = xc, where xc is one of the N given points.

The theorem holds for general x′ ∈ Sd, but the result is computationally and statisti-cally more attractive when x′ = xc.

We now show that the second moment in Theorem 6 holds as a special limiting case

of Theorem 7. In addition to v� from Section 2.3 we introduce η = (η1, η2) ∈ (0, 1)2

and

hη(s) = η2 +1

η1(ωd−1ωd

(1− s2)d/2−1)1(ρ̃ ≤ s ≤ ρ̃+ η1) (2.27)

Using these results we can now establish the following theorem, which provides

the second moment of p(y, ρ̂) under Model 2.

Theorem 8. Let x0 ∈ Sd be the centered and scaled vector from an experiment withbinary Xi of which m0 are negative and m1 are positive. Let x0,x1, . . . ,xN−1 be the

N =(m0+m1m0

)distinct permutations of x0. Let xc be one of the xk and let p̃c be given


by (2.15). Then

E2(p(y, ρ̂)2) =1

N2

N−1∑k,l=0

∫Sd−1

1(〈y,xk〉 ≥ ρ̂)1(〈y,xl〉 ≥ ρ̂) dσd−1(y∗)

where y = ρ̃xc +√

1− ρ̃2y∗.

Proof. The proof uses Theorem 7 with a sequence of h defined in (2.27) and v defined

in (2.10). See Section A.5 of the appendix.

This result shows that we can use the invariance principle to derive the second

moment of p(y, ρ̂) under Model 2. The mean square in Theorem 8 is consistent with

the second moment equation (2.13) in Proposition 1.

2.6 Two sided p-values

In statistical applications it is more usual to report two-sided p-values. A conservative

approach is to use 2 min(p, 1− p) where p is a one-sided p-value. A sharper choice is

p =1

N

N−1∑k=0

1(|xTky0| ≥ |ρ̂|). (2.28)

This choice changes our Model 2 estimate. It also changes the second moment of our

Model 1 estimate.

The two-sided version of the estimate p̂1(ρ̂) is 2σd(C(y; |ρ̂|)), the same as if wehad doubled a one-tailed estimate. Also E1(p) = p̂1 in the two tailed case. We nowconsider the mean square for the two-tailed estimate under Model 1. For x1,x2 ∈ Sd

with u = xT1x2, the two-tailed double inclusion probability under Model 1 is

Ṽ2(u; t, d) =

∫Sd

1(|zTx1| ≥ |t|)1(|zTx2| ≥ |t|) dσd(z).

Writing 1(|zTxi| ≥ |t|) = 1(zTxi ≥ |t|)+1(zT(−xi) ≥ |t|) for i = 1, 2 and expanding


the product, we get

Ṽ2(u; t, d) = 2V2(u; |t|, d) + 2V2(−u; |t|, d).

By replacing V2(u, t, d) with Ṽ2(u, t, d) and p̂1(t) with 2σd(C(y; |t|)) in Theorem 4, weget the variance of two-sided p-values under Model 1.

To obtain corresponding formulas under Model 2, we use the usual notations.

Let uj = xTj x0 for j = 1, 2, and let u3 = x

T1x2. Let the projection of y on xc be

y = ρ̃xc +√

1− ρ̃2y∗. Now

P̃1(u1, ρ̃, ρ̂) =

∫Sd−1

1(| 〈y,x1〉 | ≥ |ρ̂|) dσd−1(y∗), and, (2.29)

P̃2(u1, u2, u3, ρ̃, ρ̂) =

∫Sd−1

1(| 〈y,x1〉 | ≥ |ρ̂|)1(| 〈y,x2〉 | ≥ |ρ̂|) dσd−1(y∗) (2.30)

are the appropriate single and double inclusion probabilities.

After writing 1(| 〈y,xi〉 | ≥ |ρ̂|) = 1(〈y,xi〉 ≥ |ρ̂|) + 1(〈y,−xi〉 ≥ |ρ̂|) for i = 1, 2and expanding the product, we get

P̃1(u1, ρ̃, ρ̂) = P1(u1, ρ̃, |ρ̂|) + P1(−u1, ρ̃, |ρ̂|), and

P̃2(u1, u2, u3, ρ̃, ρ̂) = P2(u1, u2, u3, ρ̃, |ρ̂|) + P2(−u1, u2,−u3, ρ̃, |ρ̂|)

+ P2(u1,−u2,−u3, ρ̃, |ρ̂|) + P2(−u1,−u2, u3, ρ̃, |ρ̂|).

Changing P1(u1, ρ̃, ρ̂) and P2(u1, u2, u3, ρ̃, ρ̂) to P̃1(u1, ρ̃, ρ̂) and P̃2(u1, u2, u3, ρ̃, ρ̂)

respectively in Theorem 5 and 6, we get the first and second moments for two-sided

p-values under Model 2.

For a two-sided p-value, p̂3 is calculated with xc where c̃ = arg maxi| 〈y0,xi〉 |.For m0 = m1, c̃ = c = arg maxi 〈y0,xi〉, but the result may differ significantly forunequal sample sizes.


2.7 Numerical Results

We consider two-sided p-values in this section. First we evaluate the accuracy of p̂1,

the simple spherical cap volume approximate p value. We considered m0 = m1 in a

range of values from 5 to 200. The values p̂1 ranged from just below 1 to 2×10−30. Wejudge the accuracy of this estimate by its root mean squared error. Under Model 1

this is (E(p̂1(ρ)−p(y, ρ))2)1/2 for y ∼ U(Sd). Figure 2.4a shows this RMSE decreasingtowards 0 as p̂1 goes to 0 with ρ going to 1. The RMSE also decreases with increasing

sample size, as we would expect from the central limit theorem.

As seen in Figures 2.4a and 2.4b, the RMSE is not monotone in p̂1. Right at

p̂1 = 1 we know that RMSE = 0 and around 0.1 there is a dip. The practically

interesting values of p̂1 are much smaller than 0.1, and the RMSE is monotone for

them.

A problem with p̂1 is that it can approach 0 even though p > 1/N . The Model 1

RMSE does not reflect this problem. By studying E2((p̂1(ρ)− p(y, ρ))2)1/2, we get adifferent result. In Figure 2.4c, the RMSE of p̂1 under Model 2 reaches a plateau as

p̂1 goes to 0. The Model 2 RMSE reveals the flaw in p̂1 going below 1/N .

The estimator p̂2 = p̃0 performs better than p̂1 because it makes more use of the

data, and it is never below 1/N . As seen in Figure 2.4d, the RMSE of p̂2 very closely

matches p̂2 itself as p̂2 decreases to zero. That is, the relative error |p̂2− p|/p̂2 is wellbehaved for small p-values. Also as p̂2 drops to the granularity limit 1/N , its RMSE

drops to 0.

The estimators p̂1 and p̂2, do not differ much for larger p-values as seen in Fig-

ure 2.5a. But in the limit as ρ̂ → 1 we see that p̂1 → 0, while p̂2 approaches thegranularity limit 1/N instead.

Figure 2.5b compares the RMSE of the two estimators under Model 2. As ex-

pected, p̂2 is more accurate. It also shows that the biggest differences occur only

when p̂1 goes below 1/N .

To examine the behavior of p̂2 more closely, we plot its coefficient of variation in

Figure 2.6. We see that the relative uncertainty in p̂2 is not extremely large. Even

when the estimated p-values are as small as 10−30 the coefficient of variation is below


5.

In Section 2.4, we mentioned another choice for xc. It was p̂3 = p̃c, where xc is the

closest permutation of x0 to y0. We compare p̂3 to p̂2 in Figures 2.7 and 2.8. We fixed

the observed x0 and ρ, and then randomly sampled 100 vectors y0 with 〈y0,x0〉 = ρ.All 100 of the y0 lead to the same value for p̂2 and its standard deviation. We get 100

different estimates for p̂3 and its standard deviation. We varied m0 and m1, choosing

ρ so that the values of p̂2 are comparable at different sample sizes. Figure 2.7 shows

the estimates p̂3 with reference points for p̂2. As expected p̂3 tends to be larger than

p̂2. Figure 2.8 shows the sample RMSEs for p̂3 with reference points for the RMSE

for p̂2. The top row of plots has m0 = m1 while the bottom row has m1 = 2m0. The

left column of plots are at larger p-values than the rightmost column. We see that

neither choice always has the smaller RMSE, but p̂2 is usually more accurate.

2.8 Comparison to saddlepoint approximation

Many approximation methods have been proposed for permutation tests. Zhou et al.

(2009) fit approximations by moments in the Pearson family. Larson and Owen (2015)

fit Gaussian and beta approximations to linear statistics and gamma approximations

to quadratic statistics for gene set testing problems. Knijnenburg et al. (2009) fit

generalized extreme value distributions to the tails of sampled permutation values.

These approximations do not come with an all inclusive p-value that accounts for

both numerical and sampling uncertainty. The sampling method does come with such

a p-value if we add one to numerator and denominator as Barnard (1963) suggests.

But that method cannot attain very small p-values. Reasonable power to attain p 6 �

requires a sample of somewhere between 3/� and 19/� random permutations (Larson

and Owen, 2015).

The strongest theoretical support for approximate p-values comes from saddle-

point approximations. Reid (1988) surveys saddlepoint approximations and Robinson

(1982) develops them for permutation tests of the linear statistics we have consid-

ered here. When the true p-value is p, the saddlepoint approximation p̂s satisfies

p̂s = p(1+O(1/n)). Because we do not know the implied constant in O(1/n) or the n


●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●●

●●●●

●●●●●●

●●●●●●

●●

●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●●●●●●●

●●●●●

●

●●●●

●●●●

Documents

EFFICIENT PERMUTATION P-VALUE ESTIMATION FOR A …statweb.stanford.edu/~owen/students/HeraHeThesis.pdf · Observing the p-value can be written as the proportion of points lying in