Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
EFFICIENT PERMUTATION P -VALUE ESTIMATION FOR
GENE SET TESTS
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF STATISTICS
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Yu He
July 2016
c© Copyright by Yu He 2016All Rights Reserved
ii
Yu He
I certify that I have read this dissertation and that, in my opinion, it
is fully adequate in scope and quality as a dissertation for the degree
of Doctor of Philosophy.
(Art B. Owen) Principal Adviser
I certify that I have read this dissertation and that, in my opinion, it
is fully adequate in scope and quality as a dissertation for the degree
of Doctor of Philosophy.
(Trevor Hastie)
I certify that I have read this dissertation and that, in my opinion, it
is fully adequate in scope and quality as a dissertation for the degree
of Doctor of Philosophy.
(Wing H. Wong)
Approved for the Stanford University Committee on Graduate Studies
Abstract
In a genome-wide expression study, gene set testing is often used to find potential
gene sets that correlate with a treatment(disease, drug, phenotype etc.). A gene
set may contain tens to thousands genes, and genes within a gene set are generally
correlated. Permutation tests are standard approaches of getting p-values for these
gene set tests. Plain Monte Carlo methods that generate random permutations can be
computationally infeasible for small p-values. Ackermann and Strimmer (2009) finds
two families of test statistics that achieve overall best performances - a linear family
and a quadratic family. This dissertation first reviews the relative background of gene
set testing and permutation tests, and then provides three alternative approaches to
estimate small permutation p-values efficiently.
The first approach focuses on the linear statistic. Observing the p-value can
be written as the proportion of points lying in a spherical cap, the p-value is ap-
proximated by the volume of a spherical cap. Error estimates can be derived from
generalized Stolarsky’s invariance principal, and alternative probabilistic proofs are
provided.
The second approach focuses on the quadratic statistic. Importance sampling is
used to estimate the area of the (continuous) significant region on the sphere, and
the volume of the region is used as an approximation for the (discrete proportion)
p-value. Different proposal distributions are studied and compared.
The third approach estimates the p-value with nested sampling. It may work for
both the linear and the quadratic statistic. Similar ideas can be found in literature
spanning from combinatorics, sequential Monte Carlo, Bayesian computation, rare
event estimation, network reliability etc., and bears different names, e.g. approximate
iv
counting, nested sampling, subset simulation, multilevel splitting etc. We give a
thorough review of literature in these different areas, and apply the technique to the
gene set testing with the quadratic test statistic.
Finally, we compare the proposed methods with plain Monte Carlo and saddle-
point approximation on three expression studies in Parkinson’s Disease patients.
This work was supported by the US National Science Foundation under grant
DMS-1521145.
v
Acknowledgement
It is my pleasure to thank the many people who made this thesis possible.
First and foremost, I owe a debt of gratitude to my advisor Professor Art Owen.
With a contagious passion for statistics, a dedicated pursuit for high quality research
and diligence that brings overflowing ideas, Art has set a great example for me. Art
has also guided me through the research journey with encouragement, great patience,
sound advice and timely help. His guidance has been the compass in the wilderness,
without which I would have been lost. Although Art is often found juggling many
meetings, lectures, office hours and emails, he has always kept his door open and
welcomed a discussion anytime. I am deeply grateful for all the time he has spent
with me.
I would like to thank Professor Wing Wong and Professor Trevor Hastie for reading
my thesis and providing insightful feedback. I also thank Professor Lester Mackey
for serving on my dissertation committee and Professor Hua Tang for chairing my
dissertation committee.
I am very fortunate to have met many great math and statistics teachers. I would
like to thank my high school math teacher Songbin Lan, my undergraduate teach-
ers at Nanjing University and University of Toronto (especially Andrey Feuerverger
and Larry Guth), my graduate teachers at Stanford (especially Trevor Hastie, Tze
Leung Lai and David Siegmund). I thank them for their constant inspiration and
encouragement.
I thank Qingyuan Zhao, Murat A. Erdogdu, Anand Rajaraman and Jure Leskovec
for the collaboration on a data mining paper that complements my thesis work. I
would also like to express my gratitude to Kinjal Basu and Qingyuan Zhao for the
vi
collaboration on the work that appears in Chapter 2. The time we spent together,
struggling or cheering, is among the fondest memories of my time at Stanford.
Many wonderful friends have enriched and added color to my life at Stanford.
Thank you to everyone in the Owen research group and all other students in the
department for providing a friendly, supportive and intellectually stimulating envi-
ronment. Special thanks to my roommate Jingshu, my officemates Pragya and Xiaoy-
ing for their constant companionship, to the 206 family Qingyuan, Bhaswar, Murat,
Joshua and Pooja for their incredible friendship that can instantly sweep away any
anxiety or stress, and to many other friends who made me feel at home in the depart-
ment. I am especially grateful to my seniors Su Chen, Jeremy Shen, Pei He, Yi Liu
and Ya Xu for their encouragement and immense help in my career development.
The luckiest thing that has ever happened to me is to be born into my family. It
is my father who first motivated my love for math, and it is my mother’s resilience
and optimism in the face of adversity that prepared me for the many obstacles that
I encountered in research and in life. I am most grateful for their unconditional
love and support for me, which is always my pillar of strength. Last but not least,
I thank Qingyuan again for his companionship as an intimate friend, an inspiring
collaborator, a considerate partner and a strong emotional pillar. I cannot imagine
getting through everything without his support and love.
vii
Contents
Abstract iv
Acknowledgement vi
1 Introduction 1
1.1 Background: gene set enrichment analysis(GSEA) . . . . . . . . . . . 1
1.2 Null hypothesis in GSEA . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Permutation test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Test statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Approxation via Stolarsky’s Invariance 10
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Background and notation . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Approximation via spherical cap volume . . . . . . . . . . . . . . . . 16
2.4 A finer approximation to the p-value . . . . . . . . . . . . . . . . . . 20
2.5 Generalized Stolarsky Invariance . . . . . . . . . . . . . . . . . . . . . 26
2.6 Two sided p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.7 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.8 Comparison to saddlepoint approximation . . . . . . . . . . . . . . . 31
2.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 Advances in importance sampling 41
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
viii
3.2 Importance sampling and control variates . . . . . . . . . . . . . . . . 43
3.3 Regret bounds and convexity . . . . . . . . . . . . . . . . . . . . . . 46
3.3.1 Mixture importance sampling . . . . . . . . . . . . . . . . . . 47
3.3.2 Multiple importance sampling . . . . . . . . . . . . . . . . . . 50
3.3.3 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4 Choosing α . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4.1 Bounding αj away from zero . . . . . . . . . . . . . . . . . . . 54
3.4.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.5.1 Singular function . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.5.2 Rare event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4 Approxation with importance sampling 65
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Geometry of the quadratic spherical cap . . . . . . . . . . . . . . . . 66
4.3 Three sequential importance sampling algorithms . . . . . . . . . . . 68
4.3.1 Uniform sampling from Sd with polar coordinates . . . . . . . 694.3.2 Sequential sampling from Sd(λ) starting from the largest eigen-
value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.3 Sequential sampling from Sd(λ) starting from the smallest eigen-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.4 Sequential sampling from Sd(λ) with low rank matrix Σ . . . 764.3.5 Simulation results for the three sequential importance algorithms 78
4.4 From continuous approximation to the exact p-value . . . . . . . . . . 84
5 Subset simulation 88
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2.1 Estimation of P ∗ . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2.2 Uniform sampling on A` . . . . . . . . . . . . . . . . . . . . . 89
ix
5.2.3 Quick update for T (x′) . . . . . . . . . . . . . . . . . . . . . . 92
5.2.4 Choosing q . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2.5 Algorithm for estimating P ∗ . . . . . . . . . . . . . . . . . . . 92
5.3 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6 Real data example 103
A Appendix 112
A.1 Proof of Theorem 3 (Limiting invariance) . . . . . . . . . . . . . . . . 112
A.2 Proof of Lemma 3 (Double inclusion for Model 2) . . . . . . . . . . . 114
A.3 Proof of Theorem 6 (Second moment under Model 2) . . . . . . . . . 116
A.4 Proof of Theorem 7 (Location weighted invariance) . . . . . . . . . . 117
A.5 Proof of Theorem 8 (Spatially weighed invariance) . . . . . . . . . . 120
A.6 Proof of Theorem 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
A.7 Proof of Corollaries 1 and 2 . . . . . . . . . . . . . . . . . . . . . . . 123
A.8 Proof of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
A.9 Proof of Theorem 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
A.10 Proof of Theorem 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
A.11 Proof of Theorem 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
A.12 Algorithms and Figures . . . . . . . . . . . . . . . . . . . . . . . . . . 127
x
List of Tables
2.1 Maximal Z scores observed for p̂2 and p̂3. . . . . . . . . . . . . . . . 40
3.1 Singular function example. The estimate µ̂ was computed from 500,000
observations using the sampler given in the first column. Control vari-
ates were used in two of those samples. The final columns give variance
reduction factors compared to plain Monte Carlo and compared to uni-
form mixture importance sampling with no control variates. . . . . . 60
3.2 Top 10 mixture components N (xk, σ2rI5) for the singular integrand inα∗∗, which uses control variates. D denotes the defensive mixture. The
last columns are mean and sd of αj over 5000 simulations. . . . . . . 61
3.3 Top 10 mixture components N (xk, σ2rI5) for the singular integrandin α∗, which does not use control variates. D denotes the defensive
mixture. The last columns are mean and sd of αj over 5000 simulations. 62
3.4 Rare event example. The estimate µ̂ was computed from 100,000 obser-
vations using the sampler given in the first column. Control variates
were used in two of those samples. The next columns give variance
reduction factors compared to plain Monte Carlo and compared to
uniform mixture importance sampling with no control variates. The
final column compares actual squared error with its sample estimate. 63
3.5 Top 10 mixture components N (zk, σ2rI2) for the singular integrandin α∗, which does not use control variates. D denotes the defensive
mixture. The last columns are mean and sd of αj over 5000 simulations. 64
3.6 Average running times in seconds for four estimators on two examples. 64
xi
4.1 Four groups of data sets with different sizes . . . . . . . . . . . . . . 78
4.2 Comparison of sampling with pl and ql for one data set in GPI with
P ∗ = 8.426 × 10−2. N(Sd(λ))/N is the proportion of points lying inSd(λ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3 Data set DS∗ in GPI with P∗ = 1.083×10−4. Here Nql(Sd(λ))/Nql 6= 1,
i.e. ql fails to sample exclusively from Sd(λ) because of numericalinaccuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4 Comparison of sampling with ps and qs on data set DS∗, with P ∗ =
1.083× 10−4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.5 Comparison of sampling with pr and qr on data set DS
∗ with P ∗ =
1.083× 10−4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.6 Data set DS∗ with P ∗ = 1.083 × 10−4 is considered. We estimate P ∗
by projecting continuous samples from pr and qr and calculate with
formula (4.17). N(Sd(T0;Q,Λ))/N is the proportion of g(a) lying inSd(T0;Q,Λ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.7 Data set DS∗ with P ∗ = 1.083×10−4 is considered. We compare sam-pling with pr, qU and qα∗ and estimate with formula (4.17). N(Sd(T0;Q,Λ))/Ndenotes the proportion of g(a) lying in Sd(T0;Q,Λ). . . . . . . . . . . 85
4.8 Data set DS∗ with P ∗ = 1.083 × 10−4 is considered. We comparesampling with pr, qU and qα∗∗ and estimate with control variates.
N(Sd(T0;Q,Λ))/N denotes the proportion of g(a) lying in Sd(T0;Q,Λ). 86
5.1 Comparison of Monte Carlo sampling and subset simulation on data
set DS∗ with P ∗ = 1.083× 10−4. We run K = 50 independent subsetsimulation with n = 1000, q = 0.2, B = 20. . . . . . . . . . . . . . . . 96
6.1 Three data sets used for non-permutation GSEA. . . . . . . . . . . . 105
6.2 Kendall’s τ metric for Moran set with 10−5 ≤ p̂MC ≤ 10−4. . . . . . . 1056.3 Running time for all gene sets in different data sets with the linear
statistic (in seconds). p̂MC are run with 106 samples. Extra time is
needed for gene sets with p̂MC < 10−4 to generate 107 samples. . . . . 110
xii
6.4 Running time for all gene sets in different data sets with the quadratic
statistic (in seconds). p̂MC are run with 106 samples. Extra time is
needed for gene sets with p̂MC < 10−4 to generate 107 samples. . . . . 110
6.5 Kendall’s τ metric for Moran set with 10−5 ≤ p̂MC ≤ 10−4. Wefocus on the gene sets in Moran data with linear statistic p-values
10−5 < p̂MC < 10−4 because these gene sets have small p-values, yet re-
liable Monte carlo estimates p̂MC as golden standards - with 107 Monte
Carlo samples. The methods that are closest to Monte Carlo(golded
standard) are p̂1, p̂2 and p̂3. . . . . . . . . . . . . . . . . . . . . . . . . 110
xiii
List of Figures
1.1 Framework for the construction of a gene set statistic . . . . . . . . . 6
2.1 Illustration for Model 1. The point y is uniformly distributed over Sd.The small open circles represent permuted vectors xk. The point y0
is the observed value of y. The circle around it goes through x0 and
represents a spherical cap of height yT0x0. A second spherical cap of
equal volume is centered at y = y1. We study moments of p(y; ρ̂), the
fraction of xk in the cap centered at random y. . . . . . . . . . . . . 17
2.2 Illustration for Model 2. The original response vector is y0 with yT0x0 =
ρ̂. We consider alternative y uniformly distributed on the surface of
C(x0; ρ̂) with examples y1 and y2. Around each such yj there is a
spherical cap of height ρ̂ that just barely includes xc = x0. We use
p̂2 = E2(p(y; ρ̂)) and find an expression for E2((p̂2 − p(y; ρ̂))2). . . . 222.3 Illustration of r1, r2, δ1 and δ2. The points xc, x1 and x2 each have m0
negative and m1 positive components. For j = 1, 2 the swap distance
between xj and xc is rj. There are δ1 positive components of xc where
both x1 and x2 are negative, and δ2 negative components of xc where
both xj are positive. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 RMSEs for p̂1 and p̂2 under Models 1 and 2. The x-axis shows the
estimate p̂ as ρ varies from 1 to 0. Here m0 = m1. Plots with m0 6= m1are similar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5 Comparison of p̂1 and p̂2. In (a), log10(p̂2) is plotted against log10(p̂1)
for varying ρ’s. The black line is the 45 degree line. In (b), the ratio of
RMSEs for p̂1 and p̂2 is plotted against log10(p̂1). The x-axis is log10(p̂1). 33
xiv
2.6 The coefficient of variation for p̂2 with varying ρ’s. . . . . . . . . . . 33
2.7 Comparison of p̂3 versus p̂2. For a given triple (m0,m1, ρ̂), we ran-
domly sample 100 vectors y0 with xT0y0 = ρ̂. By symmetry, x0 can be
any permutation. We get 100 different p̂3 and a common p̂2 for each
triple (m0,m1, ρ̂). In the two panels on the left, ρ̂’s are chosen to give
two-sided p̂1(ρ̂) = 2×10−10 with various dimensions (m0,m1). The twopanels on the right correspond to two-sided p̂1(ρ̂) = 2 × 10−20. Esti-mates for two-sided p-values are plotted on the y-axis, with p̂3 plotted
as black dots with distributions and p̂2 as red crosses. . . . . . . . . 34
2.8 Comparison of RMSE(p̂3) versus RMSE(p̂2). The same simulation set-
ting as described in Figure 2.7. The red crosses and black dots are
estimated RMSE(p̂2) and RMSE(p̂3) under Model 2 with centers c = 0
and c = arg max06i
5.1 Comparison of σ̂1(P̂∗1:K) and σ̂2(P̂
∗i )’s. With the same K = 50 sim-
ulations in Table 5.1, the sample standard deviation for all (P̂ ∗i )’s is
σ̂1(P̂∗1:K) = 2.813 × 10−5. We have 50 stand alone estimates σ̂2(P̂ ∗i )
computed with eq. (5.2). We plot the z-scores of P̂ ∗i ’s computed with
σ̂1(P̂∗1:K) and σ̂2(P̂
∗i ) respectively in Figure 5.1b and 5.1c. We plot the
histogram of σ̂2(P̂∗i )’s in Figure 5.1a with σ̂1(P̂
∗1:K) added as the dashed
reference line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.2 Simulation result for GPI. For Monte Carlo NMC = 108. For subset
simulation, n = 1000, B = 40, and K = 50 for each data set. . . . . . 99
5.3 Simulation result for GPII. For Monte Carlo NMC = 108. For subset
simulation, n = 1000, B = 40, and K = 50 for each data set. . . . . . 100
5.4 Simulation result for GPIII. For Monte Carlo NMC = 108. For subset
simulation, n = 1000, B = 120, and K = 50 for each data set. . . . . 101
5.5 Simulation result for GPIV. For Monte Carlo NMC = 108. For subset
simulation, n = 1000, B = 120, and K = 50 for each data set. . . . . 102
6.1 Moran: scatter plot for linear statistic . . . . . . . . . . . . . . . . . . 106
6.2 Moran: scatter plot for linear statistic. Gene sets are plotted with
p-values satisfying 10−5 < p̂MC < 10−4, with a total of 190 gene sets. 107
6.3 Scherzer: scatter plot for linear statistic . . . . . . . . . . . . . . . . . 108
6.4 Scherzer: scatter plot for linear statistic. Gene sets are plotted with
p-values satisfying 10−5 < p̂MC < 10−3, with a total of 15 gene sets. . 109
6.5 Comparison of p̂MC and p̂SS on Zhang, Moran and Scherzer data set
for quadratic statistic. . . . . . . . . . . . . . . . . . . . . . . . . . . 111
A.1 For each of 40 data sets in GPI, we obtain Nps = 5 × 107 and Nqs =5 × 105 samples from ps and qs respectively and estimate µ. FigureA.1a plots µ̂qs versus µ̂ps . Figure A.1b plots the estimated relative
error (µ̂ps − P ∗)/P ∗ versus P ∗. Figure A.1c plots VRFN(qs; ps) versusµ̂ps . Figure A.1d plots VRFt(qs; ps) versus µ̂ps . . . . . . . . . . . . . 134
xvi
A.2 For each of 40 data sets in GPI, we obtain Npr = 5 × 107 and Nqr =5 × 105 samples from pr and qr respectively and estimate µ. FigureA.2a plots µ̂qr versus µ̂pr . Figure A.2b plots the estimated relative
error (µ̂pr − P ∗)/P ∗ versus P ∗. Figure A.2c plots VRFN(qr; pr) versusµ̂pr . Figure A.2d plots VRFt(qr; pr) versus µ̂pr . . . . . . . . . . . . . 135
A.3 For each of 40 data sets in GPII, we obtain Nps = 5 × 107 and Nqs =5 × 105 samples from ps and qs respectively and estimate µ. FigureA.3a plots µ̂qs versus µ̂ps . Figure A.3b plots the estimated relative
error (µ̂ps − P ∗)/P ∗ versus P ∗. Figure A.3c plots VRFN(qs; ps) versusµ̂ps . Figure A.3d plots VRFt(qs; ps) versus µ̂ps . . . . . . . . . . . . . 136
A.4 For each of 40 data sets in GPIII, we obtain Npr = 108 and Nqr = 10
8
samples from pr and qr respectively and estimate µ. Figure A.4a plots
µ̂qr versus µ̂pr . Figure A.4b plots the ratio µ̂pr/µ̂qr versus µ̂qr . Figure
A.4c plots VRFN(qr; pr) versus µ̂pr . Figure A.4d plots VRFt(qr; pr)
versus µ̂pr . VRFN and VRFt are computed with equations (4.14),
(4.15) and (4.16). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
A.5 For each of 40 data sets in GPIV, we obtain Npr = 108 and Nqr = 10
8
samples from pr and qr respectively and estimate µ. Figure A.5a plots
µ̂qr versus µ̂pr . Figure A.5b plots the ratio µ̂pr/µ̂qr versus µ̂qr . Figure
A.5c plots VRFN(qr; pr) versus µ̂pr . Figure A.5d plots VRFt(qr; pr)
versus µ̂pr . VRFN and VRFt are computed with equations (4.14),
(4.15) and (4.16). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
xvii
Chapter 1
Introduction
1.1 Background: gene set enrichment analysis(GSEA)
Since the introduction of DNA microarray measurement technology, genomewide ex-
pression analysis with these DNA microarrays has become a mainstay of both ge-
nomics and statistical research. Researchers seek methods that extract useful infor-
mation from the DNA microarrays both accurately and efficiently. For a particular
experiment, gene expressions are measured for thousands of genes from a group of
samples belonging to either the treatment or control group, for example, candidates
with or without lung cancer. Traditionally, the most differently expressed genes are
tested individually for the relationship with the treatment. However, this approach
has some major limitations,
• After correcting for the multiple hypothesis testing effect, genes that achieve therequired significance level may be too few or even non-existing. We may leave
out many weakly correlated genes due to the noise and the multiple hypothesis
testing effect.
• Even if we end up with a long list of statistically significant genes, they maywell be biologically unstructured. Interpretation can be ad hoc and arbitrary,
and often biased by the biologist’s area of expertise.
1
CHAPTER 1. INTRODUCTION 2
• Many measured genes are biologically related - sharing the same biological func-tion, chromosomal location, or regulation, and hence their measurements are
correlated. Standard multiple hypothesis testing procedures that control FDR
require independence or special dependence structures like PRDS(Benjamini
and Yekutieli (2001)), which often does not hold in real microarray data sets.
In observation of the above challenges for single gene testing procedure, Mootha
et al. (2003) first introduced gene set enrichment analysis(GSEA). Instead of studying
single gene effects individually, they propose to study microarray data at the level
of gene sets. “The gene sets are defined based on prior biological knowledge, e.g.,
published information about biochemical pathways or coexpression in previous ex-
periments.”(Subramanian et al. (2005)) The goal of GSEA is to find gene sets that
are correlated with the treatment as a whole. Moving the analysis from the single
gene level to the gene set level has several advantages,
• It is common to have many weakly correlated single genes that appear asinsignificant individually. By combining their weak effects with appropriate
choices of the test statistic, we may find the gene set as a whole achieves the
desired statistical significance.
• Conducting hypothesis testing on the gene set level yields much more inter-pretable results. We can focus on exploring the scientific explanation of the
relationships between gene sets and the treatment, without the extra step of
summarizing uncoordinated single gene test results.
• The total number of tests carried out at the same time is significantly reduced.Multiple single gene tests are condensed to one single test for the whole gene
set.
• When a single test statistic is constructed for the whole gene set, the correlationsof single genes within gene sets no longer play a role in the final decision, making
the conclusion more statistically sound.
CHAPTER 1. INTRODUCTION 3
Because of the above benefits, GSEA has gained much attention since it’s first
introduction, and has become the standard practice in the last decade(Tamayo et al.
(2012)). The gene set database has grown from the original database(Subramanian
et al. (2005)) of 1,325 genes set, including four major collections, to 13311 gene
sets as of today in the Molecular Signatures Database (MSigDB, Liberzon et al.
(2015)), divided into 8 major collections, and several subcollections. The gene sets
are available for download from the Broad Institute (2016).
1.2 Null hypothesis in GSEA
A key component in hypothesis testing is the null hypothesis. Tian et al. (2005) and
Goeman and Bühlmann (2007) among others introduce two different null hypotheses
in GSEA. Let G be the gene set of interest and Y be the treatment. The first
null compares the association between Y and S with the association between Y and
other gene sets of comparable sizes. This null hypothesis essentially means Y cannot
stand out from comparable gene sets, hence is often known as the “competitive null”
hypothesis. Methods for testing the competitive null typically involves randomizing
the genes labels and keeping the sample labels fixed. This permutation does not give
a rigorous test when genes are correlated, which is usually the case for those within
a gene set. (Goeman and Bühlmann (2007))
The second null only focuses on the gene set of interest S. It compares the associa-
tion between S and Y with the association between S and random treatments. To test
this “self-contained null”, usually the labels in the treatment Y are permuted, with
the gene labels fixed. While the competitive null is often of interest, this dissertation
focuses on testing for the “self-contained” null.
1.3 Permutation test
Gene set tests constructs a single test statistic for the whole gene set. In most cases,
the null distributions have no closed form, hence the p-values are usually estimated
by permutation tests. Even in cases where closed form null distributions are available
CHAPTER 1. INTRODUCTION 4
under appropriate assumptions, such as the Kolmogorov-Smirnov statistic in the ini-
tial GSEA in Mootha et al. (2003), and the J-G score in Jiang and Gentleman (2007),
the permutation tests are suggested to gain robustness in case the data falls short of
the assumptions.
A detailed explanation of permutation test is described in Lehmann and Romano
(2005). We describe its procedure in our particular application of GSEA. Suppose
for m independent samples we observe the single gene measurements Yg ∈ Rm, g =1, · · · , G for all genes in a gene set of size G, and denote the corresponding treatmentas X ∈ Rm or {0, 1}m. In cases where X takes binary values, let m0 be the number of0’s and m1 be the number of 1’s, with m = m0 +m1. Denote the gene measurements
for the gene set as the matrix Y1:G = [Y1, · · · , YG] ∈ Rm×G. We first decide onthe test statistic for the gene set. An example is taking the sum of single gene
correlations as the test statistic, T (Y1:G, X) =G∑g=1
corr(Yg, X). Another example is
constructing the test statistic as the sum of squared t-statistic for the single genes
T (Y1:G, X) =G∑g=1
t(Yg, X)2. To perform the permutation test, we keep Y1:G fixed and
obtain all N unique permutations of X as X∗0 , · · · , X∗N−1. In cases where X ∈ Rm,N = m! and in cases where X ∈ {0, 1}m0+m1 , N =
(m0+m1m0
). Then the permutation
p-value is defined as
p =1
N
N−1∑i=0
1(T (Y1:G, X∗i ) ≥ T (Y1:G, X))
Note that T (Y1:G, X∗i ) ≥ T (Y1:G, X) holds true at least for X∗i = X, so the true
permutation p-value never goes below 1/N .
The total number of permutations N increases exponentially and quickly becomes
intractable as the sample size m grows. For example with m = 20, N = 20!.= 2.4×
1018 and with m0 = m1 = 20, N =(
4020
) .= 1.4 × 1011. It is common to approximate
the exact permutation p-value by random sampling from all permutations (Good
(2013)). Monte Carlo permutation tests are easy to implement, require no specific
distributional assumptions on the data, and can be applied to any test statistic of our
choice. Despite their generality, they are often computationally expensive, especially
CHAPTER 1. INTRODUCTION 5
when the true p-values are small. As discussed in Larson and Owen (2015), for p-
values as small as �, random permutations of size between 3/� to 19/� are needed to
get adequate power.
Monte Carlo based permutation also suffers from a resampling granularity problem,
whose name is adopted from Larson and Owen (2015). It is conventional to add the
observation X as an additional random sample of permutations. Then the smallest
p-value that we can possibly get from M − 1 random permutations is 1/M . Whentwo or more gene sets are tied at this granularity value, there is no way to distinguish
them. Many existing methods rank the gene sets by their corresponding test statistics.
However this practice is subject to the assumption that all test statistics have the
same null distribution, which is clearly not the case when comparing gene sets of
different sizes or different correlation structures.
Observing the challenges in plain Monte Carlo sampling of permutations, we seek
alternative methods to estimate permutation p-values efficiently, especially those that
are extremely small. The methods that work most efficiently are usually specialized
to the chosen test statistic. We discuss the choice of test statistics in the next section.
1.4 Test statistic
A gene set statistic is usually constructed with three components: gene-level statistic,
transformation and summary statistic as shown in Fig. 1.1. There can be many
different choices in each component. To obtain the gene set test statistic, one can
first choose the gene level statistic as the t-statistic or the correlation coefficient,
then take no transformation, or take transformations such as the absolute value or
the square, and finally take the mean or median. The reader can easily decode the
aforementioned two examples of gene set statistics T (Y1:G, X) =G∑g=1
corr(Yg, X) and
T (Y1:G, X) =G∑g=1
t(Yg, X)2 in terms of these three steps. For a more detailed discussion
of the construction framework for gene set tests, see Ackermann and Strimmer (2009).
Ackermann and Strimmer (2009) compared 261 different gene set statistics, and
found particularly good performance of two families of statistics - a linear family
CHAPTER 1. INTRODUCTION 6
Figure 1.1: Framework for the construction of a gene set statistic
and a quadratic family. Let ρg(Yg, X) and tg(Yg, X) be the single gene correlation
and t-statistic respectively. The linear family consists of T1 =G∑g=1
ρg(Yg, X) and
T ′1 =G∑g=1
tg(Yg, X), and the quadratic family consists of T2 =G∑g=1
ρg(Yg, X)2 and T ′2 =
G∑g=1
tg(Yg, X)2. The best performance comes from the quadratic family. By squaring
the correlation or t-statistics, effects from genes that are deferentially expressed in
opposite directions are added up instead of being cancelled from each other. The
linear family is the second best. Here T ′1 is also known as the J-G score proposed in
Jiang and Gentleman (2007). It is remarkable that the best performing statistics are
surprisingly simple, especially when compared with the complicated GSEA method
in Subramanian et al. (2005).
The similar performances of using t-statistic and the correlation can be justified
through Taylor approximation, as shown in Larson and Owen (2015). The usual t-
statistic for testing a linear relationship is tg ≡√m− 2ρ̂g/(1 − ρ̂2g)1/2. The Taylor
expansion gives tg =√m− 2
(ρ̂g +
12ρ̂3g +O(ρ̂
5g)). Gene set tests are most useful when
individual |ρ̂g|’s are small. In these cases tg is approximately a constant multiple ofρ̂g, hence using the correlation as the gene level statistic should yield similar perfor-
mances to those using t-statistics. We study the linear and quadratic statistics with
correlation as the gene level statistic.
CHAPTER 1. INTRODUCTION 7
1.5 Notations
We summarize the notations that are used through out the dissertation here. Let
m = m0 +m1 be the number of patients, with m0 in the control group and m1 in the
treatment group, and X ∈ {0, 1}n be the indicator variable for treatment. We limitour discussion to binary X, while some methods can be easily extended to continuous
X’s as well. Let S be the gene set of interest and G = |S| be the cardinality of S.Denote the expression level for gene g as Yg ∈ Rm, and let Y1:G = [Y1, · · · , YG] ∈ Rm×G
be the expression level measurement matrix. We center and standardize the binary
X to obtain x0 s.t. xT0 1 = 0,x
T0x0 = 1. Let x0, · · · ,xN−1 be all permutations of x0,
where N =(m0+m1m0
).
We are interested in approximating permutation p-value for two statistics
T1(X;Y1:G) =G∑g=1
ρg(Yg, X), T2(X;Y1:G) =G∑g=1
ρg(Yg, X)2.
Note that we can replace X with x0 because centering and scaling do not change
the correlation coefficients, hence the corresponding p-values are
pj =1
N
N−1∑i=0
1(Tj(xi;Y1:G) ≥ Tj(x0;Y1:G)), j = 1, 2
In the following chapters we may omit the subscript on p-value when there is no
confusion on the test statistic. We now derive alternative formulas for T1 and T2 for
ease of discussion in the following chapters. Note that
T1 =G∑g=1
corr(X, Yg) =G∑g=1
corr(X,Yg
sd(Yg)) =
√G corr(X,
G∑g=1
Ygsd(Yg)
)
We define Y =G∑g=1
Yg/sd(Yg) ∈ Rn, then using T1 is equivalent to using corr(X, Y )
as the test statistic for permutation tests. We center and standardize Y to get y0 s.t.
yT0 1 = 0,yT0y0 = 1. Then the p-value for the linear statistic can be written in terms
CHAPTER 1. INTRODUCTION 8
of the correlations between y0 and xi’s.
p1 =1
N
N−1∑i=0
1(xTi y0 ≥ xT0y0) (1.1)
To simplify T2, we center and standardize all columns of Y1:G to get Ỹ1:G, s.t.
Ỹ T1:G1 = 0 and diag(ỸT
1:GỸ1:G) = 1. Define Σ = Ỹ1:GỸT
1:G, then
T2 =G∑g=1
corr(X, Yg)2 =
G∑g=1
(xTỸg)2 = xTΣx
Then the p-value for the quadratic statistic is
p2 =1
N
N−1∑i=0
1(xTi Σxi ≥ xT0 Σx0) (1.2)
This dissertation proposes three novel methods to efficiently estimate small permu-
tation p-values p1 and p2 with T1 and T2 as test statistics respectively. The rest of the
dissertation is organized as follows. Chapter 2, 4 and 5 are devoted to three distinct
methods respectively. Chapter 2 introduces three approximations for p1. Error esti-
mates are derived from generalized Stolarsky’s invariance. Alternative probabilistic
arguments are provided as well. Chapter 3 provides some new results on importance
sampling. Specifically, we provide a method to jointly optimize the weights in mixture
importance sampling in combination with control variates. Chapter 4 focuses on the
quadratic statistic T2, and introduces an estimation procedure based on importance
sampling. Chapter 5 is devoted to a subset sampling method that may work for both
linear and quadratic statistics. Similar ideas can be found in literature spanning from
combinatorics, sequential Monte Carlo, Bayesian computation, rare event estimation,
network reliability etc., and bears different names, e.g. approximate counting, nested
sampling, subset simulation, multilevel splitting etc. We give a thorough review of
literature in these different areas, and apply the technique to the gene set testing with
the quadratic test statistics. Finally Chapter 6 applies the three methods- Stolarsky,
importance sampling and subset simulation - on a real data example.
CHAPTER 1. INTRODUCTION 9
Chapter 2 is based on the tech report He et al. (2016), and Chapter 4 is joint work
with Kinjal Basu, Qingyuan Zhao and Art B. Owen. It appears in the tech report
He and Owen (2014).
Chapter 2
p-value approximation via
Stolarsky’s Invariance principle
2.1 Introduction
This chapter focuses on estimating the p-value for the linear statistic, as defined in
eq. (1.1). We drop the subscript of p1 throughout the discussion in this chapter. For
linear test statistics, as we show below, the permutation p-value is the fraction of
permuted data vectors lying in a given spherical cap subset of a d-dimensional sphere
Sd = {z ∈ Rd+1 | zTz = 1}. A natural but crude approximation to that p-value isthe fraction p̂ of the sphere’s surface volume contained in that spherical cap.
Stolarsky’s invariance principal gives a remarkable description of the accuracy of
this approximation p̂. For y ∈ Sd and t ∈ [−1, 1] we can define the spherical cap ofcenter y and height t via C(y; t) = {z ∈ Sd | 〈y, z〉 ≥ t}. For x0, . . . ,xN−1 ∈ Sd, letp(y, t) be the fraction of those N points that lie in C(y; t) and let p̂(y, t) = p̂(t) =
vol(C(y; t))/vol(Sd). The squared L2 spherical cap discrepancy of these points is
L2(x0, . . . ,xN−1)2 =
∫ 1−1
∫Sd
|p̂(t)− p(z, t)|2 dσd(z) dt.
10
CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 11
Stolarsky (1973) shows that
dωdωd+1
× L2(·)2 =∫Sd
∫Sd‖x− y‖ dσd(x) dσd(y)−
1
N2
N−1∑k,l=0
‖xk − xl‖ (2.1)
where σd is the uniform (Haar) measure on Sd and ωd is the (surface) volume of Sd.Equation (2.1) relates the mean squared error of p̂ to the mean absolute Euclidean
distance among the N points. In our applications, the N points will be the distinct
permuted values of a data vector, but the formula holds for an arbitrary set of N
points.
The left side of (2.1) is, up to normalization, a mean squared discrepancy over
spherical caps. This average of (p̂ − p)2 includes p-values of all sizes between 0 and1. It is not then a very good accuracy measure when p̂ turns out to be very small.
It would be more useful to get such a mean squared error taken over caps of exactly
the size p̂, and no others.
Brauchart and Dick (2013) consider quasi-Monte Carlo (QMC) sampling in the
sphere. They generalize Stolarsky’s discrepancy formula to include a weighting func-
tion on the height t. By specializing their formula, we get an expression for the mean
of (p̂− p)2 over spherical caps of any fixed size.Discrepancy theory plays a prominent role in QMC (Niederreiter, 1992), which
is about approximating an integral by a sample average. The present setting is in a
sense the reverse of QMC: the discrete average over permutations is the exact value,
and the integral over a continuum is the approximation. A second difference is that
the QMC literature focuses on choosing N points to minimize a criterion such as (2.1),
whereas here the N points are determined by the problem.
We present several results for the mean of (p̂− p)2 under different conditions. Inaddition to fixing the size of the caps we can restrict the mean squared error to only
be over caps centered on points y satisfying 〈y,x0〉 = 〈y0,x0〉 where x0 is the original(unpermuted) x vector and y0 is the observed y value. We can obtain this result by
further extending Brauchart and Dick’s generalization of Stolarsky’s invariance. We
call this the ‘finer approximation’ and will show it has advantages over constraining
CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 12
only the height of the caps. More generally, the point xc could be any of the permuted
x vectors, such as the one that happens to be closest to y0.
Although we found these results via invariance we can also obtain them via proba-
bilistic arguments. As a consequence we have a probabilistic derivation of Stolarsky’s
formula. Some of our results are for arbitrary x, but our best computational for-
mulas are for the case where the variable x is binary, as it would be in experiments
comparing treatment and control groups.
The rest of the chapter is organized as follows. Section 2.2 presents some context
on permutation tests and gives some results from spherical geometry. In Section 2.3
we use Stolarsky’s invariance principle as generalized by Brauchart and Dick (2013)
to obtain the mean squared error between the true p-value and its continuous ap-
proximation p̂1, taken over all spherical caps of volume p̂1. This section also has a
probabilistic derivation of that mean squared error. In Section 2.4 we describe some
finer approximations p̃ for the p-value. These condition on not just the volume of the
spherical cap but also on its distance from the original data point x0, or from some
other point, such as the closest permutation of x0 to y0. By always including the
original point we ensure that p̃ > 1/N . That is a desirable property because the true
permutation p-value cannot be smaller than 1/N . In Section 2.5 we modify the proof
in Brauchart and Dick (2013), to further generalize their invariance results to include
the mean squared error of the finer approximations. Section 2.6 extends our estimates
to two-sided testing. Section 2.7 illustrates our p-value approximations numerically.
We see that an RMS error in the finer approximate p-values is of the same order
of magnitude as those p-values themselves. Section 2.8 makes a numerical compari-
son to saddlepoint methods. Section 2.9 discusses the results and gives more details
about the bioinformatics problems that motivate the search for approximations to
the permutation distribution. Most of the proofs are in the Appendix A.
CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 13
2.2 Background and notation
The raw data contain points (Xi, Yi) for i = 1, . . . ,m. X′is are the treatment in-
dicators, and Y =G∑g=1
Yg/sd(Yg) ∈ Rn as discussed in Section 1.5. The sample
correlation of these points is ρ̂ = xT0y0 where x0 has components (Xi − X̄)/sX forX̄ = (1/m)
∑mi=1 Xi, s
2X = (1/m)
∑mi=1(Xi− X̄)2 and Ȳ and sY are defined similarly.
We assume that sX and sY are positive. Both x0 and y0 belong to Sm−1. Moreoverthey belong to {z ∈ Sm−1 | zT1m = 0}. We can use an orthogonal matrix to ro-tate the points of this set onto Sm−2 × {0}. As a result, we may simply work withx0,y0 ∈ Sd where d = m− 2.
The quantity ρ̂ measures association between X and Y . It can be used as such
a measure if the Xi are fixed and Yi observed conditionally, or vice versa, or if the
(Xi, Yi) pairs are independently sampled from some joint distribution. Let π be
a permutation of the indices 1, . . . ,m. There are m! vectors xπ that result from
centering and scaling Xπ = (Xπ(1), Xπ(2), . . . , Xπ(m)). The permutation p-value is
p = (1/m!)∑
π 1(xTπy0 > x
T0y0). The justification for this p-value relies on the group
structure of permutations (Lehmann and Romano, 2005). For a cautionary tale on
the use of permutation sets without a group structure, see Southworth et al. (2009).
For notational simplicity we assume ρ̂ > 0 and work with one-sided p-values. Negative
ρ̂ can be handled similarly, or simply by switching the group labels. For two-sided
p-values see Section 2.6.
Our proposals are computationally most attractive in the case where Xi takes on
just two values, such as 0 and 1. Then ρ̂ is a two-sample test statistic. If there are
m0 observations with Xi = 0 and m1 with Xi = 1 then x0 contains m0 components
equal to −√m1/(mm0) and m1 components equal to +
√m0/(mm1). Some formulas
involve the smaller sample size, m ≡ min(m0,m1).For this two-sample case there are only N =
(m0+m1m0
)distinct permutations of x0.
Calling these x0,x1, . . . ,xN−1 we find that
p =1
N
N−1∑k=0
1(xTky0 > ρ̂). (2.2)
CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 14
Now suppose that there are exactly r indices for which xk is positive and xl is
negative. There are then r indices with the reverse pattern too. We say that xk and
xl are at ‘swap distance r’. In that case we easily find that
u(r) ≡ xTkxl = 1− r( 1m0
+1
m1
). (2.3)
We need some geometric properties of the unit sphere and spherical caps. The
surface volume of Sd is ωd = 2π(d+1)/2/Γ((d+1)/2). We use σd for the volume elementin Sd normalized so that σd(Sd) = 1. The spherical cap C(y; t) = {z ∈ Sd | zTy > t}has volume
σd(C(y; t)) =
12I1−t2(d2, 1
2
), 0 ≤ t ≤ 1
1− 12I1−t2
(d2, 1
2
), −1 ≤ t < 0
where It(a, b) is the incomplete beta function
It(a, b) =1
B(a, b)
∫ t0
xa−1(1− x)b−1 dx
with B(a, b) =∫ 1
0xa−1(1− x)b−1 dx. Obviously, this volume is 0 if t < −1 and it is 1
if t > 1. This volume is independent of y so we may write σd(C(· , t)) for the volume.By symmetry, 1(x ∈ C(y, t)) = 1(y ∈ C(x, t)).
Our first approximation of the p-value is
p̂1(ρ̂) = σd(C(y; ρ̂)). (2.4)
This approximation has two intuitive explanations. First, the true p-value is the
proportion of permutations of x0 that lie in C(y0; ρ̂), and σd(C(y0, ρ̂)) is the pro-
portion of the volume of Sd in that set. Second, as we show in Proposition 2,p̂1 = E(p | 〈x0,y〉 = ρ̂) for y ∼ U(Sd) as y0 would if the original Yi were IIDGaussian. In Theorem 4, we find Var(p̂1) under this assumption.
We frequently need to project y ∈ Sd onto a point x ∈ Sd. In this representationy = tx +
√1− t2y∗ where t = yTx ∈ [−1, 1] and y∗ ∈ {z ∈ Sd | zTx = 0} which
CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 15
is isomorphic to Sd−1. The coordinates t and y∗ are unique. From equation (A.1) inBrauchart and Dick (2013) we get
dσd(y) =ωd−1ωd
(1− t2)d/2−1 dt dσd−1(y∗). (2.5)
In their case x was (0, 0, . . . , 1).
The intersection of two spherical caps of common height t is
C2(x,y; t) ≡ C(x; t) ∩ C(y; t).
We will need the volume of this intersection. Lee and Kim (2014) give a general solu-
tion for spherical cap intersections without requiring equal heights. They enumerate
25 cases, but our case does not correspond to any single such case and so we obtain
the formula we need directly, below. We suspect it must be known already, but we
were unable to find it in the literature.
Lemma 1. Let x,y ∈ Sd and −1 6 t 6 1 and put u = xTy. Let V2(u; t, d) =σd(C2(x,y; t)). If u = 1, then V2(u; t, d) = σd(C(x; t)). If −1 < u < 1, then
V2(u; t, d) =ωd−1ωd
∫ 1t
(1− s2)d2−1σd−1(C(y
∗; ρ(s))) ds, (2.6)
where ρ(s) = (t− su)/√
(1− s2)(1− u2). Finally, for u = −1,
V2(u; t, d) =
0, t > 0ωd−1ωd
∫ |t|−|t|(1− s
2)d2−1 ds, else.
(2.7)
Proof. Let z ∼ U(Sd). Then V2(u; t, d) = σd(C2(x,y; t)) = Pr(z ∈ C2(x,y; t)). Ifu = 1 then x = y and so C2(x,y; t) = C(x; t). For u < 1, we project y and z onto
x, via z = sx+√
1− s2z∗ and y = ux+√
1− u2y∗. Now
V2(u; t, d) =
∫Sd
1(〈x, z〉 ≥ t)1(〈y, z〉 ≥ t) dσ(z)
=
∫ 1−1
1(s > t)ωd−1ωd
(1− s2)d2−1
CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 16
×∫Sd−1
1(su+√
1− s2√
1− u2 〈y∗, z∗〉 ≥ t) dσd−1(z∗) ds.
If u > −1 then this reduces to (2.6). For u = −1 we get
V2(u; t, d) =ωd−1ωd
∫ 1−1
1(s > t)1(−s > t)(1− s2)d2−1 ds.
which reduces to (2.7).
When we give probabilistic arguments and interpretations we do so for a random
center y of a spherical cap. We use Models 1 and 2 below. Model 1 is illustrated in
Figure 2.1. Model 2 is illustrated in Figure 2.2 of Section 2.4 where we first use it.
Model 1. The vector y is uniformly distributed on the sphere Sd. Expectation underthis model is denoted E1(·).
Model 2. The vector y is uniformly distributed on {z ∈ Sd | zTxc = ρ̃}, for some−1 ≤ ρ̃ ≤ 1, and c ∈ {0, 1, . . . , N − 1}. Then y = ρ̃xc +
√1− ρ̃2y∗ for y∗ uniformly
distributed on a subset of Sd isometric to Sd−1. Expectation under this model isdenoted E2(·).
2.3 Approximation via spherical cap volume
Here we study the approximate p-value p̂1(ρ̂) = σd(C(y; ρ̂)). First we find the mean
squared error of this approximation over all spherical caps of the given volume via
invariance. Then we give a probabilistic interpretation which includes the conditional
unbiasedness result in Proposition 2 below. Then we give two computational simpli-
fications, first for points obtained via permutation, and second for permutations of a
binary vector. We begin by restating the invariance principle.
CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 17
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
y0
●
y1
Sd
Figure 2.1: Illustration for Model 1. The point y is uniformly distributed over Sd.The small open circles represent permuted vectors xk. The point y0 is the observedvalue of y. The circle around it goes through x0 and represents a spherical cap ofheight yT0x0. A second spherical cap of equal volume is centered at y = y1. We studymoments of p(y; ρ̂), the fraction of xk in the cap centered at random y.
Theorem 1. Let x0, . . . ,xN−1 be any points in Sd. Then
1
N2
N−1∑k,l=0
‖xk − xl‖+1
Cd
∫ 1−1
∫Sd
∣∣∣∣σd(C(z; t))− 1NN−1∑k=0
1C(z;t)(xk)
∣∣∣∣2 dσd(z) dt=
∫Sd
∫Sd‖x− y‖ dσd(x) dσd(y)
where Cd = ωd−1/(dωd).
Proof. Stolarsky (1973).
Brauchart and Dick (2013) gave a simple proof of Theorem 1 using reproducing
kernel Hilbert spaces. They generalized Theorem 1 as follows.
CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 18
Theorem 2. Let x0, . . . ,xN−1 be any points in Sd. Let v : [−1, 1] → (0,∞) be anyfunction with an antiderivative. Then
∫ 1−1v(t)
∫Sd
∣∣∣∣σd(C(z; t))− 1NN−1∑k=0
1C(z;t)(xk)
∣∣∣∣2 dσd(z) dt=
1
N2
N−1∑k,l=0
Kv(xk,xl)−∫Sd
∫SdKv(x,y) dσd(x) dσd(y)
(2.8)
where Kv(x,y) is a reproducing kernel function defined by
Kv(x,y) =
∫ 1−1v(t)
∫Sd
1C(z;t)(x)1C(z;t)(y) dσd(z) dt. (2.9)
Proof. See Theorem 5.1 in Brauchart and Dick (2013)
If we set v(t) = 1 and K(x,y) = 1 − Cd‖x − y‖, then we recover the originalStolarsky formula. Note that the statement of Theorem 5.1 in Brauchart and Dick
(2013) has a sign error in their counterpart to (2.8). The corrected statement (2.8)
can be verified by comparing equations (5.3) and (5.4) of Brauchart and Dick (2013).
We would like a version of (2.8) just for one value of t such as t = ρ̂ = xT0y0. For
ρ̂ ∈ [−1, 1) and � = (�1, �2) ∈ (0, 1)2, let
v�(t) = �2 +1
�11(ρ̂ ≤ t ≤ ρ̂+ �1). (2.10)
Each v� satisfies the conditions of Theorem 2 making (2.8) an identity in �. We let
�2 → 0 and then �1 → 0 on both sides of (2.8) for v = v� yielding Theorem 3.
Theorem 3. Let x0,x1, . . . ,xN ∈ Sd and t ∈ [−1, 1]. Then
∫Sd|p(y, t)− p̂1(t)|2 dσd(y) =
1
N2
N−1∑k=0
N−1∑l=0
σd(C2(xk,xl; t))− p̂1(t)2. (2.11)
Proof. See Section A.1 of the Appendix which uses the limit argument described
above.
CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 19
We now give a proposition that holds for all models, including our Model 1 and
Model 2.
Proposition 1. For a random point y ∈ Sd,
E(p(y, t)) =1
N
N−1∑k=0
Pr(y ∈ C(xk; t)), and (2.12)
E(p(y, t)2) =1
N2
N−1∑k,l=0
Pr(y ∈ C2(xk,xl; t)). (2.13)
Proposition 1 provides a probabilistic interpretation for equation (2.11). When
y ∼ U(Sd), the double sum on right side of (2.11) is E(p(y, t)2). Additionally p̂1(t)has a probabilistic interpretation under Model 1.
Proposition 2. For any x0, . . . ,xN−1 ∈ Sd and t ∈ [−1, 1], p̂1(t) = E1(p(y, t)).
Proof. E1(p(y; t)) = E1[
1N
N−1∑k=0
1C(y;t)(xk)]
= σd(Cd(y; t)) = p̂1(t).
Combining Proposition 2 and Theorem 3 we find that if y ∼ U(Sd), as it wouldfor IID Gaussian Yi, then p(y, ρ̂) is a random variable with mean p̂1(ρ̂) and variance
given by (2.11) with t = ρ̂.
The right hand side of (2.11) sums O(N2) terms. In a permutation analysis we
might have N = m! or N =(m0+m1m0
)for binary Xi, and so the computational cost
could be high. The symmetry in a permutation set allows us to use
∫Sd|p(y, t)− p̂1(t)|2 dσd(y) =
1
N
N−1∑k=0
σd(C2(x0,xk; t))− p̂1(t)2
instead. But that costs O(N), the same as the full permutation analysis.
When the Xi are binary, then for fixed t, σd(C2(xk,xl; t)) just depends on the
swap distance r between xk and xl. Then
∫Sd|p(y, t)− p̂1(t)|2 dσd(y) =
1
N
m∑r=0
NrV2(u(r); t, d)− p̂1(t)2 (2.14)
CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 20
for V2(u(r); t, d) given in Lemma 1 and Nr =∑N−1
k=0
∑N−1l=0 1(rk,l = r) counts pairs
(xk,xl) at swap distance r.
Theorem 4. Let x0 ∈ Sd be the centered and scaled vector from an experiment withbinary Xi of which m0 are negative and m1 are positive. Let x0,x1, . . . ,xN−1 be the
N =(m0+m1m0
)distinct permutations of x0. If y ∼ U(Sd), then for t ∈ [−1, 1], and
with u(r) defined in (2.3),
E(p(y; t)) = σd(C(y0; t)), and
Var(p(y, t)) =1
N
m∑r=0
(m0r
)(m1r
)V2(u(r); t, d)− p̂1(t)2.
Proof. There are(m0r
)(m1r
)permuted points xi at swap distance r from x0.
2.4 A finer approximation to the p-value
In the previous section, we studied the distribution of p-values with the spherical cap
centers y uniformly distributed on the sphere Sd. In this section, we give a finerapproximation to p(y0, ρ̂) by studying the distribution of the p-values with centers y
satisfying the constraint 〈y,xc〉 = 〈y0,xc〉 = ρ̃. The point xc may be any permutationof x0. There are two special choices. The first is to choose c = 0 so that xc = x0 is the
original unpermuted data. The second is to choose xc to be the closest permutation
of x0 to y0. That is c = arg maxi 〈y0,xi〉. We will give a general formula that worksfor any choice of xc and compare the performance of the above two choices.
The rationale for conditioning on all y satisfying 〈y,xc〉 = ρ̃ is as follows. Sincewe want the exact p-value centered at y0 with radius ρ̂, the more targeted set of
p-values we study, the better our approximation should be. When conditioning on
〈y,xc〉 = ρ̃, we eliminate many irrelevant y. The approximation could be improvedby conditioning on even more information, but the cost would go up. If we condition
on the order statistic of all inner products 〈y0,xi〉, we get back the exact p-value.For an index c ∈ {0, 1, . . . , N − 1} we propose finer approximations to the p-value
CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 21
based on Model 2 from Section 2.2. These are
p̃c = E2(p(y, ρ̂)) = E1(p(y, ρ̂) | yTxc = yT0xc). (2.15)
We are interested in two special cases,
p̂2 = p̃0, and p̂3 = p̃c, where c = arg max06i 1/N , having p̂2 > 1/N is a desirable property. Similarly, p̂3 > 1/N
because then xc is in general an interior point of C(y, ρ̂). We expect that p̂3 should
be more conservative than p̂2 and we see this numerically in Section 2.7.
From Proposition 1, we can get our estimate p̃c and its mean squared error by
finding single and double inclusion probabilities for y.
To compute p̃c we need to sum N values Pr(y ∈ C(xk; t) | yTxc = ρ̃) and for p̃cto be useful we must compute it in o(N) time. The computations are feasible in the
binary case, which we now focus on.
Let uj = xTj x0 for j = 1, 2, and let u3 = x
T1x2. Let the projection of y on xc
be y = ρ̃xc +√
1− ρ̃2y∗. Then the single and double point inclusion probabilitiesunder Model 2 are
P1(u1, ρ̃, ρ̂) =
∫Sd−1
1(〈y,x1〉 ≥ ρ̂) dσd−1(y∗), and (2.17)
P2(u1, u2, u3, ρ̃, ρ̂) =
∫Sd−1
1(〈y,x1〉 ≥ ρ̂)1(〈y,x2〉 ≥ ρ̂) dσd−1(y∗) (2.18)
where ρ̂ = 〈x0,y0〉. If two permutations of x0 are at swap distance r, then their innerproduct is u(r) = 1− r(m−10 +m−11 ) from equation (2.3).
Lemma 2. Let the projection of x1 onto xc be x1 = u1xc +√
1− u21x∗1. Then the
CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 22
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
xc●
y0
●
y2●
y1
Sd
Figure 2.2: Illustration for Model 2. The original response vector is y0 with yT0x0 =
ρ̂. We consider alternative y uniformly distributed on the surface of C(x0; ρ̂) withexamples y1 and y2. Around each such yj there is a spherical cap of height ρ̂ thatjust barely includes xc = x0. We use p̂2 = E2(p(y; ρ̂)) and find an expression forE2((p̂2 − p(y; ρ̂))2).
single point inclusion probability from (2.17) is
P1(u1, ρ̃, ρ̂) =
1(ρ̃u1 ≥ ρ̂), u1 = ±1 or ρ̃ = ±1σd−1(C(x∗1, ρ∗)), u1 ∈ (−1, 1), ρ̃ ∈ (−1, 1) (2.19)where ρ∗ = (ρ̂− ρ̃u1)/
√(1− ρ̃2)(1− u21).
Proof. The projection of y onto xc is y = ρ̃xc +√
1− ρ̃2y∗. Now
〈y,x1〉 =
ρ̃u1, u1 = ±1 or ρ̃ = ±1ρ̃u1 +√1− ρ̃2√1− u21 〈y∗,x∗1〉 , u1 ∈ (−1, 1), ρ̃ ∈ (−1, 1)and the result easily follows.
CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 23
We can now give a computable expression for p̃c and hence for p̂2 and p̂3.
Theorem 5. For −1 ≤ ρ̂ ≤ 1, −1 ≤ ρ̃ ≤ 1,
p̃c = E2(p(y, ρ̂)) =1
N
m∑r=0
(m0r
)(m1r
)P1(u(r), ρ̃, ρ̂) (2.20)
where u(r) is given in equation (2.3), P1(u(r), ρ̃, ρ̂) is given in equation (2.19) and
ρ̃ = xTc y0.
Proof. There are(m0r
)(m1r
)permutations of x0 at swap distance r from xc.
From (2.20) we see that p̃c can be computed in O(m) work. The mean squared
error for p̃c is more complicated and will be more expensive. We need the double
point inclusion probabilities and then we need to count the number of pairs xk,xl
forming a given set of swap distances among xk,xl,xc.
Lemma 3. For j = 1, 2, let xj be at swap distance rj from xc and let r3 be the swap
distance between x1 and x2. Let u1, u2, u3 be the corresponding inner products given
by (2.3). If there are equalities among x1, x2 and xc, then the double point inclusion
probability from (2.18) is
P2(u1, u2, u3, ρ̃, ρ̂) =
1(ρ̃ ≥ ρ̂), x1 = x2 = xc
1(ρ̃ ≥ ρ̂)P1(u2, ρ̃, ρ̂), x1 = xc 6= x2
1(ρ̃ ≥ ρ̂)P1(u1, ρ̃, ρ̂), x2 = xc 6= x1
P1(u2, ρ̃, ρ̂), x1 = x2 6= xc.
If x1, x2 and xc are three distinct points with min(u1, u2) = −1, then
P2(u1, u2, u3, ρ̃, ρ̂) =
1(−ρ̃ ≥ ρ̂)P1(u2, ρ̃, ρ̂), u1 = −11(−ρ̃ ≥ ρ̂)P1(u1, ρ̃, ρ̂), u2 = −1.
CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 24
Otherwise −1 < u1, u2 < 1, and then
P2(u1, u2, u3, ρ̃, ρ̂)
=
1(ρ̃u1 ≥ ρ̂)1(ρ̃u2 ≥ ρ̂), ρ̃ = ±1∫ 1−1
ωd−2ωd−1
(1− t2) d−12 −11(t ≥ ρ1)1(tu∗3 ≥ ρ2) dt, ρ̃ 6= ±1, u∗3 = ±1∫ 1−1
ωd−2ωd−1
(1− t2) d−12 −11(t ≥ ρ1)σd−2(C(x∗∗2 ,ρ2−tu∗3√
1−t2√
1−u∗23)) dt, ρ̃ 6= ±1, |u∗3| < 1
where
u∗3 =u3 − u1u2√
1− u21√
1− u22and ρj =
ρ̂− ρ̃uj√1− ρ̃2
√1− u2j
, j = 1, 2 (2.21)
and x∗∗2 is the residual from the projection of x∗2 on x
∗1.
Proof. See Section A.2.
Next we consider the swap configuration among x1, x2 and xc. Let xj be at swap
distance rj from xc, for j = 1, 2. We let δ1 be the number of positive components
of xc that are negative in both x1 and x2. Similarly, δ2 is the number of negative
components of xc that are positive in both x1 and x2. See Figure 2.3. The swap
distance between x1 and x2 is then r3 = r1 + r2 − δ1 − δ2.Let r = (r1, r2), δ = (δ1, δ2) and r = min(r1, r2). We will study values of
r1, r2, r3, δ1, δ2 ranging over the following sets:
r1, r2 ∈ R = {1, . . . ,m}
δ1 ∈ D1(r) = {max(0, r1 + r2 −m0), . . . , r}
δ2 ∈ D2(r) = {max(0, r1 + r2 −m1), . . . , r}, and
r3 ∈ R3(r) = {max(1, r1 + r2 − 2r), . . . ,min(r1 + r2,m,m0 +m1 − r1 − r2)}.
Whenever the lower bound for one of these sets exceeds the upper bound, we take
the set to be empty, and a sum over it to be zero. Note that while r1 = 0 is possible,
it corresponds to x1 = xc and we will handle that case specially, excluding it from R.
CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 25
xc = (
m1︷ ︸︸ ︷+,+,+,+,+, · · · ,+,+,+,+,
m0︷ ︸︸ ︷−,−,−,−, · · · ,−,−,−,−,− )
x1 = (
m1︷ ︸︸ ︷+,+,+, · · · ,+,−,−,−, · · · ,−︸ ︷︷ ︸
r1
,
m0︷ ︸︸ ︷+,+,+, · · · ,+,+,+︸ ︷︷ ︸
r1
,−, · · · ,− )
x2 = (
m1︷ ︸︸ ︷+, · · · ,+,−,−, · · · ,−︸ ︷︷ ︸
δ1︸ ︷︷ ︸r2
,+, · · · ,+,m0︷ ︸︸ ︷
−, · · · ,−,+,+, · · ·︸ ︷︷ ︸δ2
,+
︸ ︷︷ ︸r2
,−, · · · ,− )
Figure 2.3: Illustration of r1, r2, δ1 and δ2. The points xc, x1 and x2 each have m0negative and m1 positive components. For j = 1, 2 the swap distance between xj andxc is rj. There are δ1 positive components of xc where both x1 and x2 are negative,and δ2 negative components of xc where both xj are positive.
The number of pairs (xl,xk) with a fixed r and δ is
c(r, δ) =
(m0δ1
)(m1δ2
)(m0 − δ1r1 − δ1
)(m1 − δ2r1 − δ2
)(m0 − r1r2 − δ1
)(m1 − r1r2 − δ2
). (2.22)
Then the number of configurations given r1, r2 and r3 is
c(r1, r2, r3) =∑δ1∈D1
∑δ2∈D2
c(r, δ)1(r3 = r1 + r2 − δ1 − δ2). (2.23)
We can now get an expression for the expected mean squared under Model 2 which
combined with Theorem 5 for the mean provides an expression for the mean squared
error of p̃c.
CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 26
Theorem 6. For −1 ≤ ρ̂ ≤ 1,−1 ≤ ρ̃ ≤ 1,
E2(p(y, ρ̂)2) =1
N2
[1(ρ̃ ≥ ρ̂) + 2
m∑r=1
(m0r
)(m1r
)P2(1, u(r), u(r), ρ̃, ρ̂)
+
m∑r=1
(m0r
)(m1r
)P1(u(r), ρ̃, ρ̂)
+∑r1∈R
∑r2∈R
∑r3∈R3(r)
c(r1, r2, r3)P2(u1, u2, u3, ρ̃, ρ̂)
] (2.24)
where P2(·) is the double inclusion probability in (2.18) and c(r1, r2, r3) is the config-uration count in (2.23).
Proof. See Section A.3 of the Appendix.
In our experience, the cost of computing E2(p(y, ρ̂)2) under Model 2 is dominatedby the cost of the O(m3) integrals required to get the P2(·) values in (2.24). The costalso includes an O(m4) component because c(r1, r2, r3) is also a sum of O(m) terms,
but it did not dominate the computation at the sample sizes we looked at (up to
several hundred).
2.5 Generalized Stolarsky Invariance
Here we obtain the Model 2 results in a different way, by extending the work by
Brauchart and Dick (2013). They introduced a weight on the height t of the spherical
cap in the average. We now apply a weight function to the inner product 〈z,xc〉between the center z of the spherical cap and a special point xc.
Theorem 7. Let x0, . . . ,xN−1 be arbitrary points in Sd and v(·) and h(·) be positive
CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 27
functions in L2([−1, 1]). Then for any x′ ∈ Sd, the following equation holds,
∫ 1−1v(t)
∫Sdh(〈z,x′〉)
∣∣∣∣σd(C(z; t))− 1NN−1∑k=0
1C(z;t)(xk)
∣∣∣∣2 dσd(z) dt=
1
N2
N−1∑k,l=0
Kv,h,x′(xk,xl) +
∫Sd
∫SdKv,h,x′(x,y) dσd(x) dσd(y)
− 2N
N−1∑k=0
∫SdKv,h,x′(x,xk) dσd(x)
(2.25)
where Kv,h,x′ : Sd × Sd → R is a reproducing kernel defined by
Kv,h,x′(x,y) =
∫ 1−1v(t)
∫Sdh(〈z,x′〉)1C(z;t)(x)1C(z;t)(y) dσd(z) dt. (2.26)
Proof. See Section A.4 of the Appendix.
Remark. We will use this result for x′ = xc, where xc is one of the N given points.
The theorem holds for general x′ ∈ Sd, but the result is computationally and statisti-cally more attractive when x′ = xc.
We now show that the second moment in Theorem 6 holds as a special limiting case
of Theorem 7. In addition to v� from Section 2.3 we introduce η = (η1, η2) ∈ (0, 1)2
and
hη(s) = η2 +1
η1(ωd−1ωd
(1− s2)d/2−1)1(ρ̃ ≤ s ≤ ρ̃+ η1) (2.27)
Using these results we can now establish the following theorem, which provides
the second moment of p(y, ρ̂) under Model 2.
Theorem 8. Let x0 ∈ Sd be the centered and scaled vector from an experiment withbinary Xi of which m0 are negative and m1 are positive. Let x0,x1, . . . ,xN−1 be the
N =(m0+m1m0
)distinct permutations of x0. Let xc be one of the xk and let p̃c be given
CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 28
by (2.15). Then
E2(p(y, ρ̂)2) =1
N2
N−1∑k,l=0
∫Sd−1
1(〈y,xk〉 ≥ ρ̂)1(〈y,xl〉 ≥ ρ̂) dσd−1(y∗)
where y = ρ̃xc +√
1− ρ̃2y∗.
Proof. The proof uses Theorem 7 with a sequence of h defined in (2.27) and v defined
in (2.10). See Section A.5 of the appendix.
This result shows that we can use the invariance principle to derive the second
moment of p(y, ρ̂) under Model 2. The mean square in Theorem 8 is consistent with
the second moment equation (2.13) in Proposition 1.
2.6 Two sided p-values
In statistical applications it is more usual to report two-sided p-values. A conservative
approach is to use 2 min(p, 1− p) where p is a one-sided p-value. A sharper choice is
p =1
N
N−1∑k=0
1(|xTky0| ≥ |ρ̂|). (2.28)
This choice changes our Model 2 estimate. It also changes the second moment of our
Model 1 estimate.
The two-sided version of the estimate p̂1(ρ̂) is 2σd(C(y; |ρ̂|)), the same as if wehad doubled a one-tailed estimate. Also E1(p) = p̂1 in the two tailed case. We nowconsider the mean square for the two-tailed estimate under Model 1. For x1,x2 ∈ Sd
with u = xT1x2, the two-tailed double inclusion probability under Model 1 is
Ṽ2(u; t, d) =
∫Sd
1(|zTx1| ≥ |t|)1(|zTx2| ≥ |t|) dσd(z).
Writing 1(|zTxi| ≥ |t|) = 1(zTxi ≥ |t|)+1(zT(−xi) ≥ |t|) for i = 1, 2 and expanding
CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 29
the product, we get
Ṽ2(u; t, d) = 2V2(u; |t|, d) + 2V2(−u; |t|, d).
By replacing V2(u, t, d) with Ṽ2(u, t, d) and p̂1(t) with 2σd(C(y; |t|)) in Theorem 4, weget the variance of two-sided p-values under Model 1.
To obtain corresponding formulas under Model 2, we use the usual notations.
Let uj = xTj x0 for j = 1, 2, and let u3 = x
T1x2. Let the projection of y on xc be
y = ρ̃xc +√
1− ρ̃2y∗. Now
P̃1(u1, ρ̃, ρ̂) =
∫Sd−1
1(| 〈y,x1〉 | ≥ |ρ̂|) dσd−1(y∗), and, (2.29)
P̃2(u1, u2, u3, ρ̃, ρ̂) =
∫Sd−1
1(| 〈y,x1〉 | ≥ |ρ̂|)1(| 〈y,x2〉 | ≥ |ρ̂|) dσd−1(y∗) (2.30)
are the appropriate single and double inclusion probabilities.
After writing 1(| 〈y,xi〉 | ≥ |ρ̂|) = 1(〈y,xi〉 ≥ |ρ̂|) + 1(〈y,−xi〉 ≥ |ρ̂|) for i = 1, 2and expanding the product, we get
P̃1(u1, ρ̃, ρ̂) = P1(u1, ρ̃, |ρ̂|) + P1(−u1, ρ̃, |ρ̂|), and
P̃2(u1, u2, u3, ρ̃, ρ̂) = P2(u1, u2, u3, ρ̃, |ρ̂|) + P2(−u1, u2,−u3, ρ̃, |ρ̂|)
+ P2(u1,−u2,−u3, ρ̃, |ρ̂|) + P2(−u1,−u2, u3, ρ̃, |ρ̂|).
Changing P1(u1, ρ̃, ρ̂) and P2(u1, u2, u3, ρ̃, ρ̂) to P̃1(u1, ρ̃, ρ̂) and P̃2(u1, u2, u3, ρ̃, ρ̂)
respectively in Theorem 5 and 6, we get the first and second moments for two-sided
p-values under Model 2.
For a two-sided p-value, p̂3 is calculated with xc where c̃ = arg maxi| 〈y0,xi〉 |.For m0 = m1, c̃ = c = arg maxi 〈y0,xi〉, but the result may differ significantly forunequal sample sizes.
CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 30
2.7 Numerical Results
We consider two-sided p-values in this section. First we evaluate the accuracy of p̂1,
the simple spherical cap volume approximate p value. We considered m0 = m1 in a
range of values from 5 to 200. The values p̂1 ranged from just below 1 to 2×10−30. Wejudge the accuracy of this estimate by its root mean squared error. Under Model 1
this is (E(p̂1(ρ)−p(y, ρ))2)1/2 for y ∼ U(Sd). Figure 2.4a shows this RMSE decreasingtowards 0 as p̂1 goes to 0 with ρ going to 1. The RMSE also decreases with increasing
sample size, as we would expect from the central limit theorem.
As seen in Figures 2.4a and 2.4b, the RMSE is not monotone in p̂1. Right at
p̂1 = 1 we know that RMSE = 0 and around 0.1 there is a dip. The practically
interesting values of p̂1 are much smaller than 0.1, and the RMSE is monotone for
them.
A problem with p̂1 is that it can approach 0 even though p > 1/N . The Model 1
RMSE does not reflect this problem. By studying E2((p̂1(ρ)− p(y, ρ))2)1/2, we get adifferent result. In Figure 2.4c, the RMSE of p̂1 under Model 2 reaches a plateau as
p̂1 goes to 0. The Model 2 RMSE reveals the flaw in p̂1 going below 1/N .
The estimator p̂2 = p̃0 performs better than p̂1 because it makes more use of the
data, and it is never below 1/N . As seen in Figure 2.4d, the RMSE of p̂2 very closely
matches p̂2 itself as p̂2 decreases to zero. That is, the relative error |p̂2− p|/p̂2 is wellbehaved for small p-values. Also as p̂2 drops to the granularity limit 1/N , its RMSE
drops to 0.
The estimators p̂1 and p̂2, do not differ much for larger p-values as seen in Fig-
ure 2.5a. But in the limit as ρ̂ → 1 we see that p̂1 → 0, while p̂2 approaches thegranularity limit 1/N instead.
Figure 2.5b compares the RMSE of the two estimators under Model 2. As ex-
pected, p̂2 is more accurate. It also shows that the biggest differences occur only
when p̂1 goes below 1/N .
To examine the behavior of p̂2 more closely, we plot its coefficient of variation in
Figure 2.6. We see that the relative uncertainty in p̂2 is not extremely large. Even
when the estimated p-values are as small as 10−30 the coefficient of variation is below
CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 31
5.
In Section 2.4, we mentioned another choice for xc. It was p̂3 = p̃c, where xc is the
closest permutation of x0 to y0. We compare p̂3 to p̂2 in Figures 2.7 and 2.8. We fixed
the observed x0 and ρ, and then randomly sampled 100 vectors y0 with 〈y0,x0〉 = ρ.All 100 of the y0 lead to the same value for p̂2 and its standard deviation. We get 100
different estimates for p̂3 and its standard deviation. We varied m0 and m1, choosing
ρ so that the values of p̂2 are comparable at different sample sizes. Figure 2.7 shows
the estimates p̂3 with reference points for p̂2. As expected p̂3 tends to be larger than
p̂2. Figure 2.8 shows the sample RMSEs for p̂3 with reference points for the RMSE
for p̂2. The top row of plots has m0 = m1 while the bottom row has m1 = 2m0. The
left column of plots are at larger p-values than the rightmost column. We see that
neither choice always has the smaller RMSE, but p̂2 is usually more accurate.
2.8 Comparison to saddlepoint approximation
Many approximation methods have been proposed for permutation tests. Zhou et al.
(2009) fit approximations by moments in the Pearson family. Larson and Owen (2015)
fit Gaussian and beta approximations to linear statistics and gamma approximations
to quadratic statistics for gene set testing problems. Knijnenburg et al. (2009) fit
generalized extreme value distributions to the tails of sampled permutation values.
These approximations do not come with an all inclusive p-value that accounts for
both numerical and sampling uncertainty. The sampling method does come with such
a p-value if we add one to numerator and denominator as Barnard (1963) suggests.
But that method cannot attain very small p-values. Reasonable power to attain p 6 �
requires a sample of somewhere between 3/� and 19/� random permutations (Larson
and Owen, 2015).
The strongest theoretical support for approximate p-values comes from saddle-
point approximations. Reid (1988) surveys saddlepoint approximations and Robinson
(1982) develops them for permutation tests of the linear statistics we have consid-
ered here. When the true p-value is p, the saddlepoint approximation p̂s satisfies
p̂s = p(1+O(1/n)). Because we do not know the implied constant in O(1/n) or the n
CHAPTER 2. APPROXATION VIA STOLARSKY’S INVARIANCE 32
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●●
●●●●
●●●●●●
●●●●●●
●●
●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●●●●●●●
●●●●●
●
●●●●
●●●●