View
16
Download
0
Category
Preview:
Citation preview
October 14, 2021
TRUNCATED RANK-BASED TESTS FOR TWO-PART MODELS WITHEXCESSIVE ZEROS AND APPLICATIONS TO MICROBIOME DATA
BY WANJIE WANG1, ERIC Z. CHEN2 AND HONGZHE LI2,*
1National University of Singapore, staww@nus.edu.sg
2University of Pennsylvania, *hongzhe@upenn.edu
High-throughput sequencing technology allows us to test the composi-tional difference of bacteria in different populations. One important feature ofhuman microbiome data is that it often includes a large number of zeros. Suchdata can be treated as being generated from a two-part model that includes azero point-mass. Motivated by analysis of such non-negative data with exces-sive zeros, we introduce several truncated rank-based two-group and multi-group tests for such data, including a truncated rank-based Wilcoxon rank-sum test for two-group comparison and two truncated Kruskal-Wallis testsfor multi-group comparison. We show both analytically through asymptoticrelative efficiency analysis and by simulations that the proposed tests havehigher power than the standard rank-based tests, especially when the propor-tion of zeros in the data is high. The tests can also be applied to repeatedmeasurements of compositional data via simple within-subject permutations.We apply the tests to the analysis of a gut microbiome data set to compare themicrobiome compositions of healthy and pediatric Crohn’s disease patientsand to assess the treatment effects on microbiome compositions. We identifyseveral bacterial genera that are missed by the standard rank-based tests.
1. Introduction. The human microbiome includes all microorganisms in various humanbody sites such as gut, skin and mouth and has been shown to be associated with many humandiseases such as obesity, diabetes and inflammatory bowel disease (Turnbaugh et al., 2006;Qin et al., 2012; Manichanh et al., 2012). Two high-throughput sequencing based approaches,including 16S ribosomal RNA (rRNA) sequencing and shotgun metagenomic sequencing arecommonly used in microbiome studies (Turnbaugh et al., 2007; Qin et al., 2010). Bioin-formatics methods are available for quantifying the microbial relative abundances based onsuch sequencing data, which typically involve aligning the reads to some known databaseor marker genes (Huson et al., 2007; Segata et al., 2012). Since the DNA yielding materialsare different across different samples, the resulting numbers of sequencing reads vary greatlyfrom sample to sample. In order to make the microbial abundance comparable across sam-ples, the abundance in read counts is usually normalized to the relative abundance of thebacteria observed, which results in high dimensional compositional data. Some of the mostwidely used metagenomic processing software such as MEGAN (Huson et al., 2007) andMetaPhlAn (Segata et al., 2012) only outputs the relative abundances of the bacterial taxa.
In microbiome studies, one is often interested in identifying the bacterial taxa such asgenera or species that show different distributions between two or more conditions. One im-portant feature of the microbiome compositional data is that the data include large clumpsof zeros that represent the absence of the bacterial taxa in the samples, especially for those
arXiv: math.PR/0000000*This research was supported by NIH grants GM123056 and GM129781. Hongzhe Li is the corresponding
author.Keywords and phrases: Asymptotic relative efficiency; Differential abundance analysis; Two-part model;
Truncation;.
1
arX
iv:2
110.
0536
8v3
[st
at.M
E]
13
Oct
202
1
2
relatively rare taxa. Zero observations can also result from under-sampling of the sequencingreads for rare taxa. As an example, Figure 1 shows the heatmap of zeros and the relative abun-dances for 26 healthy and normal samples and 85 samples in Crohn’s disease group for eachof the 60 bacterial genera. We observed over 62.5% of the observations are zero. In addition,the compositional data are often skewed, which makes parametric modeling of such data dif-ficult and parametric tests problematic. Non-parametric tests, such as the Wilcoxon rank-sumtest and Kruskal-Wallis test, can be applied to such data and are commonly used in analysisof microbiome data. However, such rank-based tests tend to have low power because of thelarge number of ties from zero observations (Lachenbruch, 1976, 2001; Hallstrom, 2010).A two-part test combining the square of a test statistic for comparison of the proportion ofzeros and the square of an appropriate normal test such as the Wilcoxon rank-sum to comparethe non-zero scores was proposed and evaluated by Lachenbruch (2001). This two-part testwas recently applied to analysis of microbiome data (Wagner, Robertson and Harris, 2011).However, the theory developed for Lachenbruch’s 2 degree of freedoms χ2 test assumes thatthe non-zero compositions are independent of the proportion of zeros.
g__B
acter
oides
g_
_Prev
otella
g_
_Esc
heric
hia
g__L
actob
acillu
s g_
_Stre
ptoco
ccus
g_
_Bifid
obac
terium
g_
_Rum
inoco
ccus
g_
_Eub
acter
ium
g__F
aeca
libac
terium
g_
_Alist
ipes
g__E
nteroc
occu
s g_
_Kleb
siella
g_
_Ros
eburi
a g_
_Clos
tridium
g_
_Dial
ister
g__H
aemo
philu
s g_
_Sutt
erella
g_
_Veil
lonell
a g_
_Para
bacte
roide
s g_
_Coll
insell
a g_
_Acti
nomy
ces
g__M
egam
onas
g_
_Lac
tococ
cus
g__O
dorib
acter
g_
_Blau
tia
g__C
oproc
occu
s g_
_Dore
a g_
_Akke
rman
sia
g__P
edioc
occu
s g_
_Pep
toniph
ilus
__Me
thano
brevib
acter
g_
_Cop
robac
illus
g__R
othia
hasc
olarct
obac
terium
g_
_Hold
eman
ia g_
_Bilo
phila
g_
_Turi
cibac
ter
g__E
nterob
acter
g_
_Ana
erosti
pes
g__G
ranuli
catel
la g_
_Gem
ella
g__E
ggert
hella
g_
_Ana
erotru
ncus
g_
_Cate
nibac
terium
g_
_Des
ulfov
ibrio
g__A
naero
cocc
us
g__P
orphy
romon
as
g__V
ictiva
llis
g__A
biotro
phia
g__S
oloba
cteriu
m g_
_Atop
obium
g_
_Sub
dolig
ranulu
m g_
_Parv
imon
as
g__G
ordon
ibacte
r g_
_Slac
kia
_Can
didatu
s_Zin
deria
g_
_Ana
erofus
tis
g__O
ribac
terium
g_
_Buty
rivibr
io _P
seud
oflav
onifra
ctor
0
20
40
60
80
Croh
n’s D
iseas
e No
rmal
Fig 1: Heatmap of zeros and the relative abundances for 26 normal samples and 85 samplesin Crohn’s disease group for each of the 60 bacterial genera, showing over 62.5% of theobservations being zero.
To account for excessive zeros in non-negative distributions, Hallstrom (2010) introduceda truncated Wilcoxon rank-sum test, where the Wilcoxon rank-sum test is performed afterremoving an equal and maximal number of zeros from each sample. He showed that thistest recovers much of the power loss from the standard application of the Wilcoxon test.Compared with a directional modification of a two-part test proposed by Lachenbruch, thetruncated Wilcoxon test has similar power when the non-zero scores are independent of theproportion of zeros. In addition, the truncated Wilcoxon test is relatively unaffected whenthese processes are dependent. Hallstrom (2010) however only considered the two-sampletest when the sample sizes are the same in both groups.
In this paper, we assume that the data are generated from two-part models with point-massat zero as one of the components. We develop several rank-based tests for the general two-group comparison with possible unequal sample sizes, and for the multiple-group comparison
3
with equal or unequal sample sizes. These new tests are based on the idea of data truncationand asymptotic calculations and can effectively deal with the clumps of zeros in the data.The asymptotic null distributions of the tests are given. The key difficulty of defining suchtruncated rank-based tests is to calculate the variance of the test statistics under the null,which does not have close-form expressions. We instead use asymtotic analysis to obtainapproximations of the variance estimates.
In order to demonstrate the advantages of the proposed truncated rank-based tests, wealso derive the asymptotic relative efficiency of the proposed tests compared to commonlyused Wilcoxon rank-sum test and the Kruskal-Wallis test when the data are generated fromtwo-part models (Lachenbruch, 2001). We show large gain in efficiency, especially when theproportions of zeros in the data are high. These tests are rank-based, easy to calculate andprovide new tools for identifying the bacterial taxa with different distributions among differ-ent groups in human microbiome studies. We illustrate and compare the proposed tests usinga microbiome study comparing healthy and pediatric Crohn’s disease patients and assessingthe effects of treatment over time.
The rest of the paper is organized as follows. In Section 2, we extend the truncatedWilcoxon rank-sum test of Hallstrom (2010) to data with unequal sample sizes. We thenpresent in Section 3 and Section 4 a modified Kruskal-Wallis test to account for clumps ofzeros for multiple-sample setting with equal sample sizes and unequal sample sizes, respec-tively. For all these extensions, we derive the asymptotic relatively efficiency compared to thestandard rank-based tests. In Section 5, we present simulations to evaluate the proposed testsand to compare their power. In Section 6, we present results from an analysis of pediatricCrohn’s disease microbiome study at the University of Pennsylvania to identify the bacterialgenera that are associated with pediatric Crohn’s disease and to identify the bacterial generathat change their abundances over time during the anti-TNF treatment. Finally, we give abrief summary and discussion in Section 7. Details of the mathematical derivations and proofof relevant lemmas are provided as online Supplemental Materials.
2. A truncated Wilcoxon rank-sum test for data with excessive zero.
2.1. A truncated Wilcoxon rank-sum test. Consider the two-sample setting where wehave N1 non-negative independent observations from population 1, x = (x(1), · · · , x(N1)),and N2 independent observations from population 2, y = (y(1), · · · , y(N2)) where x(i) (ory(i)) represents the relative abundance of a bacterium in the ith sample of group 1 (or 2). Formost of the bacteria, x and y include many zeros, which represent absence of the bacteriumin these samples. We are interested in testing whether these observations x and y are fromthe same distribution, i.e., the hypothesis testing problem that
H0 : x∼ F,y∼ F vs H1 : x∼ F,y∼G,F 6=G,
where F andG are both probability density functions with a proportion of zeros. We considerthe two-part model of Lachenbruch (1976), which assumes that the data are generated fromthe following distributions,
(1) x(i)∼ (1− θ1)δ0 + θ1f, y(j)∼ (1− θ2)δ0 + θ2g, i= 1, · · · ,N1, j = 1, · · · ,N2,
where θk is the probability of being nonzero for population k, δ0 is point mass at 0, and f (org) is the distribution for nonzero elements for the population 1 (or 2).
Because of the excessive zeros, the standard non-parametric Wilcoxon rank-sum test statis-tic is less effective. A truncated Wilcoxon rank-sum test statistic has been proposed by Hall-strom (2010) for the case N1 =N2 =N . We first extend the test to the general setting where
4
N1 6=N2, and examine the asymptotic relative efficiency compared to the standard Wilcoxontest statistic.
Given x and y, denote n1 (or n2) as the number of non-zero observations in x (or y) and letp1 (or p2) be the proportion of non-zero observations in x (or y), where p1 = n1/N1 (or p2 =n2/N2). Let p= max{p1, p2}. We rank the combined observations x∪ y from the largest tothe smallest so that zeros have the highest ranks, where the tied measurements are given theaverage rank. For x (or y), we keep only bpN1c (or bpN2c) observations with the smallestranks, which implies that the observations removed are all zeros. The truncated samples aredenoted as x and y, respectively. Let R denote the sum of the ranks of all observations in x.Hence, the Wilcoxon test statistic can be written as
(2) TW =S2
Var(S), where S =R− N1 +N2 + 1
2N1,
and Var(S) is the variance of S under the null hypothesis.The same procedure can be applied to the truncated data x and y. Rank the combined
observations x ∪ y, from the largest to the smallest, and let r denote the sum of ranks of allobservations in x. First define the counterpart of S as
s= r− bp(N1 +N2)c+ 1
2bpN1c −
1
4
p1 + p2
2
(1− p1 + p2
2
)(N2 −N1),
and define the truncated Wilcoxon test statistic as
TtW =s2
Var(s).
The statistic s is very similar to S, except an extra term
1
4
p1 + p2
2
(1− p1 + p2
2
)(N2 −N1),
which is caused by the difference between the variances due to different sample sizes. Thisterm disappears when N1 =N2. Under the alternative hypothesis, this term is a small orderterm compared to the other part of s. Under the null hypothesis, this extra term is used toeliminate the effect of different sample sizes, leading to an expectation close to 0.
To calculate Var(s) and to derive the asymptotic distribution of s, we show that under thenull that both x and y follow the same distribution with a point mass at 0, when N1→∞,N2→∞, we have
E[S] = 0, E[s] =O(max{N1,N2}),
where the remainder O(max{N1,N2}) is caused by the difference between bpN1c (orbpN1c) and pN1 (or pN2). In addition, we have
Var(S) =N2
1N22
4E{(p2 − p1)2}+
N1N2
12E(N1p
21p2 +N2p1p
22 + p1p2),
Var(s) =N2
1N22
4Var{(p2 − p1)p}+
N1N2
12E(N1p
21p2 +N2p1p
22 + p1p2).
(see Supplemental Materials). With these the results, under the null hypothesis, when N1→∞, N2→∞, and c≤N1/N2 ≤C for some constants c and C , we have
Var(S) =N1N2(N1 +N2)θ(1− θ+ θ2/3)/4 +O(N2.5),
Var(s) =N1N2(N1 +N2)θ3(1− θ+ 1/3)/4 +O(N2.5).
5
where θ is the expected value of the proportion non-zero observations in the data. Hence,√Var(s) = O(
√N1N2 max{N1,N2}), and E[s] = O(max{N1,N2}) is relatively small
when N1 and N2 are large. Asymptotically, s/√V ar(s) follows a standard normal distri-
bution, and the truncated Wilcoxon test statistic TtW = s2/Var(s) has a χ21 null distribution.
The asymptotic analysis above provides a way of approximating Var(s) using the mainterm of Var(s), which leads to the final test statistic
TtW =s2
N1N2(N1 +N2)(p1+p22 )3(4/3− p1+p2
2 )/4.
This is used in our simulation and real data analysis.
2.2. Asymptotic relative efficiency of TW and TtW for two-sample test. To compare thetest statistics TW and TtW , we evaluate the asymptotic property of the relative efficiency.Since we reject the null hypothesis when the statistic is large, the relative efficiency is definedas
RE(TtW , TW ) =E1(TtW )
E1(TW )=
E1[s2]/Var(s)
E1[S2]/Var(S).
Here, E1 denotes the expectation of the statistic under alternative hypothesis. A value largerthan 1 implies power gain using the truncated Wilcoxon test statistic compared to the standardWilcoxon statistic.
For the two-part model (1), we need some terms to quantify the difference between thetwo distributions. Let θ = (θ1 + θ2)/2, θm = max{θ1, θ2}, and4θ = θ2− θ1. Define ∆f,g =P (X < Y ) + P (X = Y )/2− 1/2, X ∼ f and Y ∼ g. The term ∆f,g is used to measure theeffect size of the non-zero part. The following theorem provides an explicit expression forARE.
Theorem 1 Under model (1) and that N1→∞, N2→∞, and c ≤ N1/N2 ≤ C holds forsome constants c and C , then the ARE can be derived as
ARE(TtW , TW ) =1
θ2
1− θ+ θ2/3
1− θ+ 1/3
(θ1θ2∆f,g +4θθm/2θ1θ2∆f,g +4θ/2
)2
+O(N−1/2).
Especially, we have that E[s|n1, n2
]= n1n2∆f,g.
To illustrate the gain in efficiency of using the truncated Wilcoxon test, we present theARE(Wm,W ) for the following four different parameter settings to assess the effect of ∆f,g ,θ,4θ, and N1/N2:
(a) Effect of ∆f,g: N1 = 40, and N2 = 50 and θ1 = 0.3, θ2 = 0.8. Let ∆f,g range between−1/2 and 1/2, with a step size 0.01.
(b) Effect of θ: N1 = 40, N2 = 50, and ∆f,g = 0.1. Let θ range between 0.1 and 0.9, with astep size 0.01. With θ, take θ1 = θ− 0.1 and θ2 = θ+ 0.1.
(c) Effect of 4θ: N1 = 40, N2 = 50 and ∆f,g = 0.1. Let 4θ range between 0.2 and 1, witha step size 0.02. With4θ, take θ1 = 0.5−4θ/2 and θ2 = 0.5 +4θ/2.
(d) Effect of sample size: θ1 = 0.3, θ2 = 0.8, ∆f,g = 0.1, and N2 = 50. Let N1 range be-tween 20 and 120, with a step size 5.
Figure 2 shows the ARE for these four different settings. We observe that AREs are largerthan 1 for all four settings, indicating that the truncated Wilcoxon statistic has greater powerthan the standard Wilcoxon statistic. In settings (a) and (b), ARE(TtW , TW ) increases when∆f,g increases or the proportion of zeros increases. Hence, when the nonzero part are awayfrom each other, the truncated test statistic gains more power compared to the standard
6
Wilcoxon statistics. In setting (c), the ARE(TtW , TW ) function is not monotone due to thecancellation of difference between non-zero part and the non-zero proportions. In setting (d),obviously ARE(TtW , TW ) does not depend on the value of N1 or N2, which can be seen inthe formula.
Fig 2: Asymptotic relative efficiency ARE(TtW , TW ) comparing the truncated Wilcoxon testand the standard Wilcoxon test as a function of (a) ∆f,g , (b) θ, (c)4θ and (d) N1.
3. A truncated Kruskal-Wallis test for K-group comparisons with equal samplesizes.
3.1. A truncated Kruskal-Wallis test. Consider the setting where we have data from Kgroups x1,x2, · · · ,xK , all containing non-negative independent observations from popula-tion 1,2, · · · ,K , respectively. Each sample xi = (x1(1), · · · , x1(Ni)) contains Ni i.i.d ob-servations. We assume that the samples from the K group are generated from the followingtwo-part model,
(3) Xk(i)∼ (1− θk)δ0 + θkfk, k = 1, · · · ,K.
7
The hypothesis of interest is
H0 : θ1 = · · ·= θK = θ, f1 = · · ·= fK = f vs H1 : not all θis and fis are equal.
Kruskal-Wallis test is a standard nonparametric test for K-group comparison based on theranks. We propose to develop a similar rank-based test that account for excessive zeros in thetwo-part model. We first consider the case when the samples sizes from all K samples arethe same, i.e., N1 =N2 = · · ·=NK =N . Let ri(j) be the rank of the jth observation in theith sample among all KN observations. The Kruskal-Wallis statistic can be written as
TKW =12
KN2(KN + 1)
K∑i=1
S2i ,
where Si =∑N
j=1 ri(j)− (KN + 1)N/2. To derive the distribution of TKW under the nullhypothesis, we rewrite TKW in terms of Yi, where Yi =
∑ij=1 Sj − iSi+1, 1≤ i≤K − 1,
TKW =12
KN2(KN + 1)
K−1∑i=1
Y 2i
i(i+ 1)=
K−1∑i=1
Y 2i
Var(Yi).
With this transformation, Yis are asymptotically independent with each other, and under thenull hypothesis, the test statistic is the summation of K − 1 chi-square random variableswith degree of freedom of 1 and therefore TKW is asymptotically distributed as a χ2
K−1distribution.
In order to account for excessive of zeros in the data sets, we propose to modify the TKWstatistic using the idea of truncation. Let ni denote the number of nonzero observations insample i, so the proportion of nonzero entries is pi = ni/N . Let p= max1≤i≤K pi. For eachsample i, keep the pN observations with smallest ranks (or largest values), so that all theremoved entries are zeros. Let n= pN = max1≤i≤K piN = max1≤i≤K ni, then the truncateddata set has Kn observations in total. Rank all Kn observations from smallest to largest, andthen let ri denote the rank-sum of the observations in sample i. Define
si = ri − n(Kn+ 1)/2, Ui =
i∑j=1
sj − isi+1, 1≤ i≤K − 1,
where sK =−∑K−1
i=1 si. The following Lemma show that Ui’s are independent.
Lemma 1 Under the model that all xi(j) are independently and identically distributed,
E[Ui] = 0, Cov(Ui1 ,Ui2) = 0.
This leads to our definition of the truncated Kruskal-Wallis test statistic as
TtKW =
K−1∑i=1
U2i
Var(Ui),
which has a χ2 distribution with degree of freedoms K − 1 under the null hypothesis. WhenK = 2, this statistic becomes the truncated Wilcoxon rank-sum statistic given N1 = N2.The following Lemma 2 provides an approximation of the variance of Yi and Ui under nullhypothesis and alternative hypothesis.
Lemma 2 Consider the two-part model (3). Under null hypothesis and suppose that N →∞, we have
Var(Yi) =i(i+ 1)K2N3θ
4
{θ2
3+ (1− θ)
}+O(N5/2), 1≤ i≤K − 1;
8
Var(Ui) =i(i+ 1)K2N3θ3
4
{1
3+ (1− θ)
}+O(N5/2), 1≤ i≤K − 1.
Under alternative hypothesis, we have the upper bound of the variances,
Var1(Yi)≤K3N3, Var1(Ui)≤K3N3, 1≤ i≤K.
Based on this lemma, the unknown variance Var(Ui) can be approximated by
i(i+ 1)K2N3(∑K
k=1 pk/K)3
4
(4
3−
K∑k=1
pk/K
).
3.2. Asymptotic relative efficiency of TtKW and TKW under a two-part model. We eval-uate the relative efficiency of the truncated statistic TtKW with the standard statistic TKWby
RE(TtKW , TKW ) =E1[TtKW ]
E1[TKW ]=
∑K−1i=1 E1[U2
i ]/Var(Ui)∑K−1i=1 E1[Y 2
i ]/Var(Yi).
We assume that K samples have the same sample size N and the data are generated from thetwo-part model (3). To find RE(TtKW , TKW ), there are four terms to calculate: the expecta-tion of U2
i and Y 2i under the alternative hypothesis, and the variance of Ui and Yi under the
null hypothesis. To find E1[U2i ] and E1[Y 2
i ], we can calculate Var(Ui) and Var(Yi), and theexpectation of Ui and Yi separately, where the variance are needed under either hypothesis.Lemma 2 provides an approximation of the variance of Yi and Ui under null hypothesis andalternative hypothesis.
In order to calculate the expectation of Yi and Ui under alternative hypothesis, we notethat the basic terms in Yi and Ui are the rank-sum of the nonzero observations. Let r0
i denotethe rank-sum of the non-zero observations in sample i, 1≤ i≤K , and that s0
i = r0i − (1 +∑K
j=1 nj)ni/2. Define ∆i,k = P (Xi <Xk) + P (Xi =Xk)/2− 1/2, Xi ∼ fi and Xk ∼ fk,for 1≤ i, k ≤K− 1, where ∆i,k can be used to measure the effect sizes. In addition, one caneasily check that
E[ i∑j=1
s0j − is0
i+1
∣∣n1, · · · , nK]
=
i∑j=1
∑k 6=j
njnk∆j,k − ini+1
∑k 6=i+1
nk∆i+1,k,(4)
for 1≤ i≤K − 1. For model (3) and the effect sizes specified above, we have the results onE1[Yi] and E1[Ui], as shown in the following Lemma.
Lemma 3 Under alternative hypothesis, for 1≤ i≤K − 1, the expectation of Yi is
E1[Yi] =N2
( i∑j=1
∑k 6=j
θjθk∆j,k − iθi+1
∑k 6=i+1
θk∆i+1,k
)+KN2
2
(iθi+1 −
i∑j=1
θj
),
and the expectation of Ui is
E1[Ui] =N2
( i∑j=1
∑k 6=j
θjθk∆j,k−iθi+1
∑k 6=i+1
θk∆i+1,k
)+KN2
2θ(K)
(iθi+1−
i∑j=1
θj
)+o(N2).
Plugging E1[Ui], Var1(Ui) and Var(Ui) into the definition of ARE(TtKW , TKW ) and usingLemma 2 and Lemma 3, we obtain the ARE of TtKW versus TKW given by the followingtheorem.
9
Theorem 2 Under model (3) and that N →∞, then the ARE can be derived as
(5) ARE(TtKW , TKW ) =
K−1∑i=1
{(∑i
j=1 θj+iθi+1)2∆i+Kθ(K)(iθi+1−∑i
j=1 θj)/2}2
θ2i(i+1)(1/3+1−θ)
K−1∑i=1
{(∑i
j=1 θj+iθi+1)2∆i+K(iθi+1−∑i
j=1 θj)/2}2
i(i+1)(θ2/3+1−θ)
+ o(1).
3.3. Asymptotic relative efficiency for zero-Beta distributions. We consider the casewhere the nonzero functions fks are Beta distributions with parameters αk and β = 1, inwhich case the effect size ∆j,k can be calculated and the ARE can be expressed by αk’sand θk’s. Since compositional data are always between 0 and 1, Beta distribution provides areasonable parametric model for such data.
Given the Beta distribution for each sample, note that P (Xi = Xk) = 0 as they are con-tinuous random variables. Hence ∆i,k = P (X(i) < X(k)) − 1/2. If X ∼ Beta(α,1), thenXα ∼ Unif(0,1), therefore
P (Xi <Xk) = P (Xαi
i <Xαi
k ) = P (U < (Xαk
k )αk/αi), U ∼ Unif(0,1).
Note that Xαk
k ∼ Unif(0,1), so (Xαk
k )αi/αk ∼ Beta(αk/αi,1). Further, for any continuousrandom variable X with range [0,1], there is
P (U <X) =
∫ 1
0(1− FX(u))du=E[X].
This leads to
∆i,k = P (Xi <Xk)− 1/2 =αk/αi
αk/αi + 1=
αkαi + αk
− 1/2, 1≤ i, k ≤K.
When combining with (4), we have
E1[r0i |n1, · · · , nK ] = ni
{(ni + 1)/2 +
∑k 6=i
nkαk
αi + αk
},
E1[s0i |n1, · · · , nK ] = E1[r0
i ]−1 +
∑Kk=1 nk
2ni = ni
∑k 6=i
nk
(αk
αi + αk− 1/2
).
Plugging these equalities to (5) gives the closed-form expression for ARE(TtKW , TKW ).To demonstrate the gain in efficiency, we calculate ARE(TtKW , TKW ) for 5 group com-
parisons (K = 5) in two scenarios: (a) αi = i, 1≤ i≤ 5. θ1 = · · ·= θK = θ, where θ rangesfrom 0.1 to 1 with a step size 0.01; (b) αi = i, 1≤ i≤ 5. θ = (0.2,0.15,0.3,0.1,0.25) + d,where d ranges from 0 to 0.7 with a step size 0.01. The results are shown in Figure 3. In bothcases, we observe a high asymptotic relative efficiency using the truncated Kruskal-Wallistest statistic when compared to the original Kruskal-Wallis test.
4. A truncated Kruskal-Wallis multi-group test with unequal sample sizes. We nowconsider the setting where we have multiple samples x1,x2, · · · ,xK , all containing non-negative independent observations from populations 1,2, · · · ,K , respectively. Each samplexi = (x1(1), · · · , x1(Ni)) contains Ni observations sampled from the two-part model reft-wopart. We consider the case that N1, N2, · · · , NK are not necessarily equal.
Consider the standard Kruskal-Wallis statistic first. Let ri(j) be the rank of the jth obser-vation in the ith sample among all the observations. Define Si =
∑Ni
j=1 ri(j)− (∑K
i=1Ni +
10
Fig 3: Asymptotic relative efficiency. ARE(TtKW , TKW ) as a function of (a) θ when θ1 =· · ·= θK = θ, and (b) θ(K) when not all θis are equal.
1)Ni/2 and Yi =∑i
j=1(Ni+1Sj −NjSi+1), 1≤ i≤K − 1, where Yi is also related to thesample size. Then, Kruskal-Wallis statistic is defined as
TKW =12∑K
j=1Nj + 1
K−1∑i=1
Y 2i
Ni+1(∑i
j=1Nj)(∑i+1
j=1Nj)(∑K
j=1Nj), 1≤ i≤K − 1.
Each term follows a χ21 distribution asymptotically and is independent with each other, so the
statistic follows χ2K−1 distribution under null hypothesis.
To account for zeros, for sample i, there are ni non-zero elements, and the correspondingnon-zero ratio is pi = ni/Ni. Let p= max
1≤i≤Kpi, and keep the bpNic largest observations only
for sample i as the truncated data. All the removed entries are zeros. Let ri be the rank-sumfor the truncated sample i, and
si = ri −bp∑K
i=1Nic+ 1
2bpNic.
We define the statistic Ui as the counterpart of Yi in standard Kruskal-Wallis statistic,
Ui =
i∑j=1
(Ni+1sj −Njsi+1
), 1≤ i≤K − 1.
Then, a natural test statistic based on truncation is
(6) TtKW =
K−1∑i=1
U2i
Var(Ui),
where Var(Ui) is the variance of Ui under the null. This is calculated by noting thatVar(Ui) = Var(E[Ui|p1, · · · , pK ]) + E[Var(Ui|p1, · · · , pK)]. Lemma 4 in Supplemental Ma-terials shows that
Var[Ui|p1, · · · , pK ] =1
12E
{(
K∑k=1
nk + 1)
[ K∑k=1
nk[N2i+1
i∑j=1
nj + ni+1(
i∑j=1
Nj)2]
11
−(ni+1
i∑j=1
Nj −Ni+1
i∑j=1
nj)2
]},
E(Var[Ui|p1, · · · , pK ]) =θ2
12Ni+1(
i∑j=1
Nj)(
i+1∑j=1
Nj)(
K∑j=1
Nj)[(
K∑j=1
Nj)θ+ 3− 2θ].
However, when the sample sizes are not equal, it is difficult to evaluate Var[Ui|p1, · · · , pK ].Instead we approximate this under the null hypothesis by assuming that the non-zero proba-bility θ is the average of the empirical non-zero probability among all samples and by simu-lations since Var[Ui|p1, · · · , pK ] only depends on the non-zero proportions, but not of f .
Finally, in order to prove that TtKW has an asymptotic distribution of χ2K−1, we show that
(E[Ui])2 is much smaller than Var(Ui) so that E[Ui]/
√Var(Ui) is approximately 0, and the
correlation between either two terms is asymptotically 0. Combining upper bound of E[Ui]given in Lemma 3 and the lower bound for the variance term given in Lemma 4, we can showthat E[Ui]/
√Var(Ui)→ 0. Using the upper bound for the covariance in Lemma 5, we see
that when max1≤i≤KNi/N2(1)→ 0 and N(1)→∞, we have
|Cor(Ui,Uj)|= |Cov(Ui,Uj)√
Var(Ui)Var(Uj)| ≤C
√√√√(
K∑k=1
1
Nk)≤ C
√K√
N(1)
→ 0,1≤ i, j ≤K − 1.
Therefore, as long as the sample sizes are at the same order and go to infinity, (Ui/Var(Ui), i=1, · · · ,K) asymptotically has a multivariate normal distribution with zero mean and identitycovariance matrix. We leave the details of these lemmas in the Supplemental Materials.
5. Simulations. To further verify the gain in efficiency in using the proposed truncatedrank-based tests, we present simulation studies to evaluate the proposed tests and to com-pare with the standard Wilcoxon rank-sum and Kruskal-Wallis test statistics. In each of thesimulation setups, we perform the following steps.
1. Given parameters (K,N,θ,α,β, s), where N , θ, α and β are all K × 1 arrays, generateX from the distribution
Xi(j)i.i.d∼ (1− θi)δ0 + θiBeta(αi, βi), 1≤ i≤K,1≤ j ≤Ni.
2. Calculate the p-value from usual rank tests, including the Wilcoxon rank-sum test fortwo-sample test, and Kruskal-Wallis test for more groups, and our proposed tests.
3. Repeat steps 1-2 for M times, and calculate the power or type I error for a given signifi-cance level α.
5.1. Simulation 1 - evaluation of Type I error. We first evaluate the type I errors of var-ious test statistics proposed in this paper and compare them to Wilcoxon rank-sum test andKruskal-Wallis test. To simulate data from the null distribution, for each sample i, we simu-late data from a two-part model,
Xi(j)i.i.d∼ (1− θ)δ0 + θBeta(a, b),
where a= b= 2, θ = 0.5. We consider three different scenarios :
(a) K = 2, sample sizes = (0.65,1)×N ;(b) K = 3, equal sample sizes N ;(c) K = 3, sample sizes= (0.7,1,1.5)×N .
12
For each scenario, take N = {30,60,100,300,600,900}, and simulate 100,000 test statisticsfor each choice of N . The empirical Type I errors are summarized in Table 1 for significancelevel α ∈ {0.05,0.01,0.001}. In general, we observe that the proposed tests have correct TypeI errors, especially when the sample sizes are large. For small sample sizes, the proposed testsare slightly anti-conservative, indicating the asymptotic approximation of the test statisticsmay require relatively large sample sizes. In practice, when the sample sizes are small, onecan obtain more accurate p-values based on permutations.
TABLE 1Comparison of empirical Type I errors for two-group test with unequal sample sizes and three-group tests with
equal and unequal sample sizes. TW : Wilcoxon rank-sum test; TtW : truncated Wilcoxon rank-sum test; TKW :Kruskal-Wallis test; TtKW : truncated Kruskal-Wallis test.
Two-group - unequal sample sizesα-level (20, 30) (39, 60) (65, 100) (195, 300) (390, 600)
0.05TW .049 .049 .050 .049 .049TtW .054 .051 .051 .051 .050
0.01TW .009 .010 .010 .010 .010TtW .014 .013 .012 .011 .011
0.001TW 6.8× 10−4 9.3× 10−4 9.1× 10−4 1.2× 10−3 9.9× 10−4
TtW 2.2× 10−3 1.9× 10−3 1.8× 10−3 1.6× 10−3 1.2× 10−3
Three-group - equal sample sizes30 60 100 300 600
0.05TKW .049 .048 .049 .050 .050KMm .056 .053 .053 .052 .051
0.01TKW .009 .009 .010 .010 .010TtKW .015 .013 .013 .012 .011
0.001TKW 6.7× 10−4 8.7× 10−4 1.1× 10−3 9.9× 10−4 1.0× 10−3
TtKW 2.5× 10−3 2.0× 10−3 1.9× 10−3 1.6× 10−3 1.2× 10−3
Three-group - unequal sample sizes(21, 30, 45) (42, 60, 90) (70, 100, 150) (210, 300, 450) (420, 600, 900)
0.05TKW .047 .050 .049 .049 .049TtKW .055 .056 .054 .051 .051
0.01TKW .009 .010 .010 .010 .010TtKW .014 .014 .013 .012 .011
0.001TKW 6.4× 10−4 8.1× 10−4 1.0× 10−3 8.9× 10−4 8.6× 10−4
TtKW 2.5× 10−3 2.1× 10−3 1.9× 10−3 1.4× 10−3 1.3× 10−3
5.2. Simulation 2 - power comparisons. We next evaluate the power of the proposedtests. We consider three different models with K = 3 groups and repeat the simulationM = 10000 times. For the first model, we assume that the proportions of zeros are thesame across different groups and examine how the distribution of the non-zero observa-tion affects the power of the proposed tests. We set the θ = (0.5,0.5,0.5) and equal sam-ple sizes for the three groups, ranging from 100 to 2000 with a step size 100. We setα ∈ {(1.5,2,2.5), (1.7,2,2.3)}, and β = (2,2,2) and calculate the power function for dif-ferent sample sizes N . The resulting power curves are shown in the top row of Figure 4. Weobserve a substantial gain in power from the truncated Kruskal-Wallis test.
For the second model, we study how different proportions of zeros affect the test poweras the sample size N changes. We set (α,β) = {(1/2,1/2,1/2), (2,2,2)}, which assumesthat the nonzero distributions are the same across different groups. We again assume that thesample sizes are the same for all three groups, ranging from 20 to 300 with a step size 20. Weset θ ∈ {(0.1,0.15,0.22), (0.15,0.20,0.27)}, and calculate the power function for different
13
500 1000 1500 2000
0.0
0.4
0.8
Sample Size
Pow
er
500 1000 1500 2000
0.0
0.4
0.8
Sample Size
Pow
er
50 100 150 200 250 300
0.0
0.4
0.8
Sample Size
Pow
er
50 100 150 200 250 300
0.0
0.4
0.8
Sample Size
Pow
er
500 1000 1500 2000
0.0
0.4
0.8
Sample Size
Pow
er
500 1000 1500 2000
0.0
0.4
0.8
Sample Size
Pow
er
(a) (b)
(c) (d)
(f)(e)
Fig 4: Power curves for the truncated Kruskal-Wallis (solid line) and Kruskal-Wallistest (dashed line) as a function of the sample size N . (a)-(b ): a two-part modelwith θ = (0.5,0.5,0.5), α = (1.5,2,2.5), β = (2,2,2) and (b )α = (1.7,2,2.3),β = (2,2,2). (c)-(d): a two-part model with (α,β) = {(1/2,1/2,1/2), (2,2,2)}, θ ∈{(0.10,0.15,0.22), (0.15,0.20,0.27)}. (e)-(f): a two-part model with θ = (.4, .5, .6) andα ∈ {(1.5,2,2.5), (1.7,2,2.3)}, and β = (2,2,2).
sample sizes N . The resulting power curves are shown in the middle row of Figure 4. Ourtest has some improvement over the original test. However, in this case, the improvement isnot as large as when fis are different.
14
For the last model, we evaluate the proposed tests when the sample sizes are different indifferent groups. We set θ = (0.4,0.5,0.6) and the sample size as N × (0.8,1,1.5), whereN ranges from 100 to 2000 with a step size 100. We choose α ∈ {(1.5,2,2.5), (1.7,2,2.3)}and β = (2,2,2), and calculate the power function for different sample sizes N . The powercurves are shown in the bottom row of Figure 4, showing that the truncated Kruskal-Wallistest has much higher power than the Kruskal-Wallis test when there are ties.
6. Identifying the Crohn’s disease-associated bacterial genera and the effects oftreatment. Crohn’s disease, a chronic inflammatory bowel disease, is characterized by al-tered composition of the gut microbiota or dysbiosis. The etiology and clinical significanceof the dysbiosis is unknown. In a recent study at the University of Pennsylvania, the compo-sition of the gut microbiota among a cohort of 85 children with Crohn’s disease who wereinitiating therapy with either a defined formula diet (n=33) or an anti-tumor necrosis fac-tor α (anti-TNF) drug (n=52) was examined in order to better understand the cause of thedysbiosis in Crohn’s disease. Fecal samples were collected at baseline, 1, 4, and 8 weeksand DNA content characterized by shot-gun genomic sequencing with >0.5 tetra-bytes oftotal sequences (Lewis et al., 2015). The gut microbiota data of 26 normal children with noknown gastrointestinal disorders were similarly collected. The MetaPhlAn program (Segataet al., 2012) was applied to first obtain the relative proportions of the bacteria genera for eachof the normal and Crohn’s disease samples. The number of bacterial genera called by theMetaPhlAn program varies from sample to sample. Altogether, 60 bacterial genera were ob-served in both normal samples and the Crohn’s disease samples. Figure 1 shows the relativeabundances of the 60 genera in all the samples, where large proportions of zeros are observedfor many of the genera.
6.1. Comparison of gut microbiome between normal and Crohn’s disease patients. Weare interested in identifying the bacterial genera that show different distribution between nor-mal and Crohn’s disease children. At an α level of 0.05, the truncated Wilcoxon rank-sumtest identified 23 genera that showed different distributions between normal and Crohn’s dis-eases, while the standard Wilcoxon test identified 20, among these, 19 were identified by bothmethods. At FDR of 10%, the truncated Wilcoxon rank-sum test identified 21 genera and thestandard Wilcoxon test identified the same 20 genera. Four genera, including Eggerthella,Lactobacillus, Gemella, and Rothia were identified only by the truncated Wilcoxon rank-sum test. Figure 5 shows the the proportions of zeros and boxplots of these four genera thatclearly show difference in abundances in normal and Crohn’s disease children, both in termsof proportion of zeros and the median of non-zero abundances. All four genera had higherabundances in Crohn’s patients than normal controls. Among these genera, Eggerthella, Lac-tobacillus, both are anaerobic, non-sporulating, Gram-positive bacilli, have been reported tobe associated with clinically significant bacteraemia and Crohn’s disease (Lau et al., 2004).In contrast, for genus Methanobrevibacter, the standard Wilcoxon test had a sightly smallerp-value. However, 98% of the disease individuals did not carry this genus.
6.2. Comparison of gut microbiome across time after treatment. We next aim to identifythe bacterial genera that show changes of relative abundances across the four time pointsduring the anti-TNF treatment. We applied our truncated Kruskal-Wallis test statistic withwithin-subject permutations (100,000 permutations) and identified 7 genera with changesof abundances during the treatment with p < 0.05. As a comparison, the Friedman’s testidentified only three. Figure 6 shows the boxplots of the abundances of the genera across 4time points for the four genera identified only by our truncated Kruskal-Wallis test, includingBacteroides, Roseburia, Eubacterium and Bilophila. Among these, Bacteroides, Eubacterium
15
Fig 5: Proportions of zeros and the box plots of the non-zero relative abundances for thefour bacterial genera that were identified by the truncated Wilcoxon rank-sum test only for anominal α level of 0.05, where p-values from TW and TtW tests are shown as pS and ps. Thegenus Methanobrevibacterm was identified only by the standard Wilcoxon rank-sum test.
and Bilophila showed increased abundances at 8 weeks after the anti-TNF treatment. Interest-ingly, all three genera have been shown to have reduced abundance in Crohn’s disease patients(Gevers et al., 2014). Reduction of Roseburia, a well-known butyrate-producing bacteriumof the Firmicutes phylum, has been consistently demonstrated to be associated with Crohn’sdisease (Machiels et al., 2014). There has been evidence that the gut bacteria in patients withinflammatory bowel disease do not make butyrate, and that they have low levels of the fattyacid in their gut (Sartor and Mazmanian, 2012). The decreased abundance of Bilophila, maytranslate into a reduction of commensal bacteria-mediated, anti-inflammatory activities in themucosa, which are relevant to the pathophysiology of Crohn’s disease. The result shows theeffect of anti-TNF treatment in increasing the relative abundances of Roseburia, Bacteroidesand Bilophila, therefore potentially increase level of fatty acid butyrate and anti-inflammatoryactivities. This partially explains that 50% of the patients showed clinical improvement, asreflected by reduction of fecal calprotection below 250 mcg/g.
7. Discussion. Motivated by comparing the distributions of taxa composition in differ-ent groups in microbiome studies, we have developed several extensions of the popular rank-based tests to account for clumps of zeros, including truncated rank-based Wilcoxon andKruskal-Wallis tests for two- or multiple-group comparisons. These tests are nonparametricand are easy to implement. By using within-sample permutations, such tests can also be ap-plied to paired samples or repeated measurements analysis. We have shown that the proposedtests have better power than the standard rank-based tests, both by asymptotic relative effi-ciency analysis and by simulations. We observed a large gain in power when the proportionsof zeros in the data sets are high as compared to the standard rank-based tests due the highnumber of tied ranks. Hallstrom (2010) showed that the truncated Wilcoxon rank-sum test
16
0 1 4 8Bacteroides
0.00
0.02
0.04
Per
cent
age
of Z
eroe
s
0 4
040
80
pkw =.036, pKW =.21
Rel
ativ
e A
bund
ance
0 1 4 8Eubacterium
0.00
0.10
0.20
●
●
●
●
●
●
●
●
●
0 4
020
4060
pkw =.047, pKW =.142
0 1 4 8Roseburia
0.00
0.15
0.30
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
0 4
020
40
pkw =.034, pKW =.141
0 1 4 8Bilophila
0.0
0.2
0.4
0.6
●
●
0 4
0.0
1.0
2.0
3.0
pkw =.01, pKW =.079
Fig 6: Bar plots of proportions of zeros (top row) and box plots of the relative non-zeroabundances (bottom row) over four time points during anti-TNF treatment for four bacteriagenera that were identified only by the truncated Kruskal-Wallis test.
has equal or better power than the test of two-degree of freedoms based on two-part modelwhen the sample sizes are equal. Although such tests can be extended to two-part modelfor multiple group comparisons, it has not been studied in literature. It would be interestingto compare the performance of our proposed truncated Kruskal-Wallis tests with other testsbased on the two-part models.
We have demonstrated the applications of the proposed tests to analysis of real metage-nomic data sets. Our results have shown that the truncated rank-based tests are effective andidentify more bacterial genera that are associated with the clinical phenotypes and treatment.As observed in other studies (Wagner, Robertson and Harris, 2011), tests that account forclumps of zeros can be more powerful in testing abundance difference in microbiome stud-ies. We have demonstrated that some well-known Crohn’s disease-associated bacterial generacan be missed by using the standard rank-based tests without adjusting for excessive zeros.We expect to see more applications of the proposed tests in microbiome studies.
Acknowledgment. This research was supported by NIH grants GM129781 and GM123056.
SUPPLEMENTARY MATERIALS. The online Supplemental Materials include proofsof Theorem 1, Theorem 2 and all the lemmas. It also includes detailed derivations ofthe proposed test statistics. An R package of the proposed method is available on github,https://rdrr.io/github/chvlyl/ZIR/. The source codes for these methods canalso be found in the Supplemental Materials.
REFERENCES
GEVERS, D., KUGATHASAN, S., DENSON, L. A. and ET AL (2014). The treatment-naive microbiome in new-onset Crohn’s disease. Cell Host & Microbe 15 382–392.
HALLSTROM, A. (2010). A modified Wilcoxon test for non-negative distributions with a clump of zeros. Statisticsin Medicine 29 391–400.
HUSON, D. H., AUCH, A. F., QI, J. and SCHUSTER, S. C. (2007). MEGAN analysis of metagenomic data.Genome research 17 377–386.
17
LACHENBRUCH, P. (1976). Analysis of Data with Clumping at Zero. Biometrische Zeitschrift 18 351–356.LACHENBRUCH, P. (2001). Comparison of two-part models with competitors. Statistics in Medicine 20 1215–
1234.LAU, S., WOO, P., FUNG, A., CHAN, K., WOO, G. and YUEN, K. (2004). Journal of Medical Microbiology 53
1247–1253.LEWIS, J., CHEN, E. Z., BALDASSANO, R. N., OTLEY, A. R., GRIFFITHS, A. M., LEE, D., BITTINGER, K.,
BAILEY, A., FRIEDMAN, E. S., HOFFMANN, C., ALBENBERG, L., SINHA, R., COMPHER, C., GILROY, E.,NESSEL, L., GRANT, A., CHEHOUD, C., LI, H., WU, G. D. and BUSHMAN, F. D. (2015). Inflammation,Antibiotics, and Diet as Environmental Stressors of the Gut Microbiome in Pediatric Crohn’s Disease. CellHost & Microbe 18 489-500.
MACHIELS, K., JOOSSENS, M., SABINO, J., DE PRETER, V., ARIJS, I., EECKHAUT, V., BALLET, V.,CLAES, K., VAN IMMERSEEL, F., VERBEKE, K., FERRANTE, M., VERHAEGEN, J., RUTGEERTS, P. andVERMEIRE, S. (2014). A decrease of the butyrate-producing species Roseburia hominis and Faecalibacteriumprausnitzii defines dysbiosis in patients with ulcerative colitis. Gut 63(8) 1275–1283.
MANICHANH, C., BORRUEL, N., CASELLAS, F. and GUARNER, F. (2012). The gut microbiota in IBD. NatureReviews Gastroenterology and Hepatology 9 599–608.
QIN, J., LI, R., RAES, J., ARUMUGAM, M., BURGDORF, K. S., MANICHANH, C., NIELSEN, T., PONS, N.,LEVENEZ, F., YAMADA, T. et al. (2010). A human gut microbial gene catalogue established by metagenomicsequencing. Nature 464 59–65.
QIN, J., LI, Y., CAI, Z., LI, S., ZHU, J., ZHANG, F., LIANG, S., ZHANG, W., GUAN, Y., SHEN, D. et al.(2012). A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490 55–60.
SARTOR, R. and MAZMANIAN, S. (2012). Intestinal Microbes in Inflammatory Bowel Diseases. American Jour-nal of Gastroenterology (Supplements) 1 15-21.
SEGATA, N., WALDRON, L., BALLARINI, A., NARASIMHAN, V., JOUSSON, O. and HUTTENHOWER, C.(2012). Metagenomic microbial community profiling using unique clade-specific marker genes. Nature Meth-ods 9 811–814.
TURNBAUGH, P. J., LEY, R. E., MAHOWALD, M. A., MAGRINI, V., MARDIS, E. R. and GORDON, J. I. (2006).An obesity-associated gut microbiome with increased capacity for energy harvest. Nature 444 1027–131.
TURNBAUGH, P. J., LEY, R. E., HAMADY, M., FRASER-LIGGETT, C. M., KNIGHT, R. and GORDON, J. I.(2007). The human microbiome project. Nature 449 804–810.
WAGNER, B., ROBERTSON, C. and HARRIS, J. (2011). Application of Two–Part Statistics for Comparison ofSequence Variant Counts. PloS One 6(5) e20296.
18
SUPPLEMENTARY MATERIALS
APPENDIX A: BASIC PROPERTIES
Suppose there is a sample with size n. Rank all the observations so that the smallest ob-servations have largest ranks. Let ri and rj denote the ranks of two randomly selected ob-servations. We have the results for the expectation, variance and covariance as follows. Thecalculations are also presented here although its basic in statistics.
Property 1. E[ri] = n+12 .
Property 2. E[r2i ] = (n+1)(2n+1)
6 .Proof. E[r2
i ] = 1n
∑ni=1 i
2 = (n+1)(2n+1)6 .
Property 3. Var[ri] = n2−112 .
Proof. Note that Var(ri) = E[r2i ]− (E[ri])
2. According to Properties 1–2, the variance is
Var(ri) =1
n
n∑i=1
i2 − (n+ 1
2)2 =
n2 − 1
12.
Property 4. Cov[ri, rj ] =−n+112 .
Proof. When i 6= j, we have
E[rirj ] =1
n(n− 1)
∑i 6=j
ij =(n+ 1)(3n+ 2)
12.
Therefore,
Cov(ri, rj) = E[rirj ]− [E(ri)]2 =
(n+ 1)(3n+ 2)
12− (
n+ 1
2)2 =−n+ 1
12.
Property 5. Consider a randomly selected group of observations S with size n1. The rank-sum of this group is denoted by rg . Then,
E[rg] =1 + n
2n1, Var(rg) =
n1(n− n1)(n+ 1)
12.
Proof. For the expectation part, it simply introduces in E[ri] for each observation in thisgroup. Next we check the variance.
Var(rg) =∑i∈S
Var(ri) +∑
i,j∈S,i6=jCov(ri, rj)
=∑i∈S
n2 − 1
12+
∑i,j∈S,i6=j
−n+ 1
12
=n1(n− n1)(n+ 1)
12.
APPENDIX B: DERIVATIONS OF THE MODIFIED WILCOXON RANK SUMSTATISTIC
For a given statistic S, we use E[S] and Var(S) to represent its mean and variance calcu-lated under the null hypothesis and use E1[S] to represent its mean calculated under the alter-native hypothesis. In this section, given that W = S2/Var(S) where S =R− N1+N2+1
2 N1,and s= r− bp(N1+N2)c+1
2 bpN1c − 14p1+p2
2 (1− p1+p22 )(N2 −N1), we want to verify the fol-
lowing results:
19
• E[S] = 0, E[s] =O(max{N1,N2});• the results for Var(S) and Var(s):
Var(S) =N1N2(N1 +N2)θ(1− θ+ θ2/3)/4 +O(N2.5),
Var(s) =N1N2(N1 +N2)θ3(1− θ+ 1/3)/4 +O(N2.5);
Proof. Now we prove the results one by one.
• We show the expectation first. The result for E[S] is direct, so we focus on E[s].Recall that p = max{p1, p2}, where p1 = n1/N1 and p2 = n2/N2. Let r = r −
bp(N1+N2)c+12 bpN1c, and so s= r− 1
4p1+p2
2 (1− p1+p22 )(N2 −N1). To find E[s], we need
to calculate E[r] and E[14p1+p2
2 (1− p1+p22 )(N2 −N1)].
With independent samples, p1 and p2 are independent, E[pi] = θi and Var(pi) = θi(1−θi)/Ni), i= 1,2. Under null hypothesis, θ1 = θ2 = θ. We then have
E
[1
4
p1 + p2
2(1− p1 + p2
2)(N2 −N1)
]=N2 −N1
4
[(θ1 + θ2
2)− (
θ1 + θ2
2)2 − θ1(1− θ1)
4N1− θ2(1− θ2)
4N2
](Let θ =
θ1 + θ2
2) =
N2 −N1
4(θ(1− θ) +O(
1
N1+
1
N2)).(7)
Here, θ is the probability of zero under null and θ = (θ1 + θ2)/2 under alternative hypoth-esis.
Next we calculate E[r]. Recall that r is the sum-rank of the truncated sample 1. Let r0
be the sum-rank of all the non-zero elements in the truncated sample 1. With r0, we rewriter as
r = r0 −n1 + n2 + 1
2n1 +
n1 + n2 + 1 + bp(N1 +N2)c2
(bpN1c − n1)
−1 + bp(N1 +N2)c2
bpN1c+n1 + n2 + 1
2n1
= (r0 −n1 + n2 + 1
2n1) +
1
2pN1N2(p2 − p1) +O(max{N1,N2}),
where O(N2) comes from the difference between bpN1c (bpN1c) and pN1 (pN2).Since all the observations are non-negative, so the non-zero elements always enjoy the
ranks from 1 to n1 +n2. Hence, under null hypothesis that f = g, E[r0− n1+n2+12 n1] = 0.
Therefore,
(8) E[r|p1, p2] =pN1N2
2(p2 − p1) +O(max{N1,N2}).
We still need to remove the condition that p1 and p2 are given. The following lemma isused to describe the distribution of p(p2 − p1).
Lemma 1 Define p1 = n1/N1, p2 = n2/N2, and p= max{p1, p2}. Under null hypothesisthat θ1 = θ2 = θ, F =G, and let N1,N2→∞ at the same order, there is
E[p(p2 − p1)/2] =θ(1− θ)
4
N1 −N2
N1N2,
E[p2(p2 − p1)2] = θ3(1− θ)(1/N1 + 1/N2) +O(N−3/21 ).
20
Combining Lemma 1 with (8) leads to
E[r] =θ(1− θ)
4(N1 −N2) +O(max{N1,N2}).
Combining this with (7), we have that
E[s] =O(max{N1,N2}).
The first conclusion is proved.• Next we want to check Var(S) and Var(s).
Note that s= r− 14p1+p2
2 (1− p1+p22 )(N2 −N1) = r− I . Therefore,
(9) Var(s) = Var(r) + Var(I)− 2 Cov (() r, I).
Check the terms one by one.The first term is Var(r). Note that we already find E[r|p1, p2] = pN1N2(p2 − p1)/2 +
O(max{N1,N2}). Therefore,
Var(r) = Var(E[r|p1, p2]) + E[Var(r|p1, p2)]
=N2
1N22
4{E[(p2 − p1)2p2]− (E[(p2 − p1)p])2}+
N1N2
12E[N1p
21p2 +N2p1p
22 + p1p2].
With Lemma 1 we have that
E[(p2 − p1)2p2] = θ3(1− θ)(1/N1 + 1/N2) +O(N−3/21 ),
(E[(p2 − p1)p])2 = θ2(1− θ)2(N1 −N2)2/(4N21N
22 ) =O(N−2
1 ).
Combining withE[N1p21p2 +N2p1p
22 +p1p2] = (N1 +N2)θ3 +O(1) from basic statistics,
we have that
(10) Var(r) =N1N2(N1 +N2)θ3(1− θ+ 1/3)/4 +O(√N1N
22 ).
For the second term, note that
Var(I)≤E[I2]≤ (N2 −N1)2E[(1
4
p1 + p2
2(1− p1 + p2
2))2]≤ (N2 −N1)2 =O(N2),
and
Cov (() r, I)≤√
Var(r),Var(I)≤ (N2 −N1)√N1N2(N1 +N2) =O(N5/2).
Introduce the result and (10) into (9), we have(11)Var(s) = Var(r)+Var(I)−2 Cov (() r, I) =N1N2(N1 +N2)θ3(1−θ+1/3)/4+O(N2.5).
Similarly, we have that Var(S) =N1N2(N1 +N2)θ(1− θ+ θ2/3)/4 +O(N2.5).
APPENDIX C: PROOF OF THEOREM 1 IN MAIN PAPER
Recall that the definition of ARE(Wm,W ) is
ARE(Wm,W ) =E[s2]/Var(s)
E1[S2]/Var(S),
where E1 denote the expectation under alternative hypothesis, and Var(s) means the vari-ance of s under null hypothesis. With previous analysis, Var(s) and Var(S) are known. Theremaining problem is E1[s2], for which we can derive E1[s] and Var1(s) separately, and thecombine them together. Here, Var1(s) denote the variance of s under alternative hypothesis.
21
Recall that we rewrite s= r− I , where r can be rewritten as
r = [r0 −n1 + n2 + 1
2n1] +
1
2pN1N2(p2 − p1) +O(N2).
Since r0 depend on the non-zero part only, so we only consider the non-zero elements of xand y. Let x(j) and y(j) denote the jth non-zero observation in x and y, respectively. Therank of x(j) equals to
rj =1
2+
n1∑m=1
1{x(j)< x(m)}+n2∑m=1
1{x(j)< y(m)}+ 1
2
n1∑m=1
1{x(j) = x(m)}+ 1
2
n2∑m=1
1{x(j) = x(m)}.
Note that E[1{x(j) < x(m)}] = P (X1 < X2) when X1,X2i.i.d.∼ f , and E[1{x(j) <
y(m)}] = P (X < Y ) when X ∼ f,Y ∼ g independently. Therefore, we have
E1[rj |n1, n2] = 1/2 + n1P (X1 <X2) + n2P (X < Y ) +n1
2P (X1 =X2) +
n2
2P (X = Y ).
Note that P (X1 =X2)/2 + P (X1 <X2) = 1/2 since they follow the same distribution andP (X = Y )/2 + P (X < Y ) = ∆f,g + 1/2 according to the definition, plug them in and wehave that
E1[r0|n1, n2] = E1
[ n1∑j=1
r1j
]= n1
[(n1 + 1)/2 + n2[∆f,g + 1/2]
].
Introduce it into the definition of r, we have
E1[r|n1, n2] = E1[r0|n1, n2]− n1 + n2 + 1
2n1 +
1
2pN1N2(p2−p1) = n1n2∆f,g+
1
2pN1N2(p2−p1).
Further, let N1,N2→∞, and note that n1 ∼ Binomial(N1, θ1), n2 ∼ Binomial(N2, θ2),we have that
E1[r] = θ1θ2N1N2∆f,g +1
2θ(m)N1N2∆θ,
For the last term I , note that |E1[I]|<N2 −N1 =O(N). Combining all these terms, wehave that
(12) E1[s] = E1[r]−E[I] = θ1θ2N1N2∆f,g +1
2θ(m)N1N2∆θ+O(N).
Similarly, we have the result for S as
E1[S] = E1
[E1[r0|n1, n2] +
n1 + n2 + 1 +N1 +N2
2(N1 − n1)− N1 +N2 + 1
2N1
]= E1
[n1n2∆f,g +
N1n2 − n1N2
2
]= θ1θ2N1N2∆f,g +
1
2N1N2∆θ.(13)
Now we consider the variance under alternative hypothesis. We deal with S first.Note that S is a function of all the observations x and y. To find the variance, we try to
bound P (|S − µ| ≥ t) for any t with concentration inequality. When one observation of xchanges, the largest change of S caused by it is N2. When one observation of y changes, thelargest change of S is N1. Mathematically, we have
|S(x(1), · · · , x(k), · · · , x(N1),y)− S(x(1), · · · , x′(k), · · · , x(N1),y)| ≤N2,
|S(x, y(1), · · · , y(k), · · · , y(N2))− S(x, y(1), · · · , y′(k), · · · , y(N2))| ≤N1.
22
According to McDiarmid’s Inequality (?, Page 206), there is
P (|S −E[S]|> t)≤ 2e−2t2
N1N22+N2N2
1 .
Therefore,
Var1(S) = E[|S −E[S]|2] =
∫ ∞0
P (|S −E(S)|2 ≥ t)dt
=
∫ ∞0
P (|S −E(S)| ≥√t)dt
≤∫ ∞
02e
−2t
N1N22+N2N2
1 dt
=N1N2(N1 +N2).
Therefore,
(14) Var1(S)≤N1N2(N1 +N2) =O(N3).
Similarly, we derive the result for s. When one observation of x changes, the change of scannot be larger than the change of S, and so
(15) Var1(s)≤N1N2(N1 +N2) =O(N3).
Combine (12), (13), (14) and (15), with the results about variance in the second bullet, wehave
ARE(Wm,W ) =E1[s2]/Var(s)
E1[S2]/Var(S)=
(E1[s])2 + Var1(s)
(E1[S])2 + Var1(S)
Var(S)
Var(s)
=
(θ1θ2N1N2∆f,g + 1
2θ(m)N1N2∆θ)2
+O(N3)(θ1θ2N1N2∆f,g + 1
2N1N2∆θ)2
+O(N3)· N1N2(N1 +N2)θ(1− θ+ θ2/3)/4 +O(N2.5)
N1N2(N1 +N2)θ3(1− θ+ 1/3)/4 +O(N2.5)
=
(θ1θ2∆f,g + 1
2θ(m)∆θ)2(
θ1θ2∆f,g + 12∆θ
)2 · (1− θ+ θ2/3)
θ2(1− θ+ 1/3)+O(N−1/2).
The result is proved.
APPENDIX D: PROOF OF LEMMA 1 IN MAIN PAPER
Proof. It is easy to verify that E[si] = 0, 1≤ i≤K . Since Ui is a linear combination of si,there is
E[Ui] =
i∑j=1
E[sj ]− iE[si+1] = 0, 1≤ i≤K.
To check the variance, we begin with the variances and covariances of si. Given n1, · · · , nK ,Var(si|n1, · · · , nK) equals to the variance of the sum-rank of non-zero elements, which isalready derived in the basic property section in Supplementary materials. Therefore, with thelaw of total variance, we find that
Var(si) = E
[ni(
K∑j=1,j 6=i
nj)(
K∑j=1
nj + 1)
]/12 + E
[n2(
K∑j=1
nj −Kni)2
]/4,
Cov(si, sj) =−E
[ninj(
K∑k=1
nk + 1)
]/12 + E
[n2(
K∑k=1
nk −Kni)(K∑k=1
nk −Knj)]/4,
23
where 1≤ i, j ≤K − 1, i 6= j. From this, we obtain
Var(Ui) =E
{∑Kj=1 nj + 1
12
[(
i∑j=1
nj)ni+1(i+ 1)2 + (
K∑k=i+2
nk)(
i∑j=1
nj + i2ni+1)]}
+E
[K2n2
4(
i∑j=1
nj − ini+1)2
],
Cov(Ui1 ,Ui2) = E
[K2n2(
i1∑k=1
nk − i1ni1+1)(
i2∑j=1
nj − i2ni2+1)
]/4,
where 1≤ i≤K − 1, 1≤ i1 < i2 ≤K − 1 and∑K
k=i+2 nk = 0 when i+ 2>K .Now the remaining term is E
[K2n2(
∑i1k=1 nk− i1ni1+1)(
∑i2j=1 nj− i2ni2+1)
]. We prove
the following lemma.
Lemma 2 Let n1, · · · , nKi.i.d∼ Binomial(N,θ), and n= max1≤i≤K ni. Given n, we have
(16) E
[(
i1∑k=1
nk − i1ni1+1)(
i2∑j=1
nj − i2ni2+1)|n]
= 0, 1≤ i1 < i2 ≤K − 1.
With the lemma, we have that
Cov(Ui1 ,Ui2) = E
[K2n2E[(
i1∑k=1
nk− i1ni1+1)(
i2∑j=1
nj − i2ni2+1)|n]
]/4 = E[K2n2 · 0] = 0.
The result is proved. 2
Proof of Lemma 2:To show (16), note that
E
[(
i1∑k=1
nk − i1ni1+1)(
i2∑j=1
nj − i2ni2+1)|n]
= E
[ i1∑k=1
(nk − ni1+1)
i2∑j=1
(nj − ni2+1)|n]
=
i1∑k=1
i2∑j=1
E
[(nk − ni1+1)(nj − ni2+1)|n
],
for 1 ≤ i1 < i2 ≤K − 1. Therefore, to show the left-hand side is 0, we check the value ofeach single term
E[(nk − ni1)(nj − ni2)|n].
As k ≤ i1 < i2, and 1≤ j ≤ i2− 1, there are 3 cases: 1. j = k; 2. j = i1; 3. j 6= k and j 6= i1.In case 1, we have that
E[(nk − ni1)(nk − ni2)|n] = E[n2k − ni1nk − ni2nk + ni1ni2 |n]
Given n, ninj has the same joint distribution for any i 6= j, and so they share the sameexpectation. What’s more, ni and nj has the same conditional variance given n. Therefore,
E[n2k − ni1nk − ni2nk + ni1ni2 |n]
= (E[nk|n])2 + Var(nk|n)−E[ni1 |n]E[nk|n]−Cov (()ni1 |n,nk|n)
= Var(n1|n)−Cov (()n1|n,n2|n).(17)
24
The last equation stands since nk|n and ni1 |n follow the same distribution.In case 2, we have that
E[(nk − ni1)(ni1 − ni2)|n] = E[nkni1 − n2i1 − ni2nk + ni1ni2 |n].
With the same analysis of (17), we have that
(18) E[nkni1 − n2i1 − ni2nk + ni1ni2 |n] = Cov (()n1|n,n2|n)−Var(nk|n).
In case 3, there is
(19) E[(nk − ni1)(nj − ni2)|n] = E[nknj − ni1nj − ni2nk + ni1ni2 |n] = 0.
Combining (17)–(19), and summation over 1≤ k ≤ i1 and 1≤ j ≤ i2, we can find that themean is 0, and (16) follows. 2
APPENDIX E: PROOF OF LEMMA 2 IN MAIN PAPER
In this proof, we use the calculation of Var(UK−1) as an example for simplicity. Thecalculation of Var(Ui) is similar.
We first consider the term under null hypothesis. According to (??), there is an expressionfor Var(UK−1). Decompose Var(UK−1) into two parts,
Var(UK−1)
K2=
E[nK(∑K−1
j=1 nj)(∑K
j=1 nj + 1)]
12+
E[n2(∑K
j=1 nj −KnK)2]
4= I + II.
Part I can be easily calculated by the distribution of nj , and with basic calculations we obtain
(20) I =N3θ3K(K − 1)/12 +O(N2).
For II , we take n = m+Nθ, where m = max{n1 −Nθ,n2 −Nθ, · · · , nK −Nθ}. Thenwe have
4 · II = E[(m+Nθ)2(
K∑j=1
nj −KnK)2]
=N2θ2E[(
K∑j=1
nj −KnK)2] + 2NθE[m(
K∑j=1
nj −KnK)2] + E[m2(
K∑j=1
nj −KnK)2]
= IIa+ IIb+ IIc.
With basic calculations and that (∑K
j=1 nj − KnK)/N is asymptotically distributed as anormal distribution with mean 0 and variance K(K − 1)θ(1− θ), we have that
(21) IIa=N3θ3K(K − 1)(1− θ) +O(N2).
For IIb and IIc, according to Cauchy-Schwarz inequality, we have that
(22) IIb≤ 2Nθ
√√√√E[m2]E[(
K∑j=1
nj −KnK)4] = 2Nθ√
E[m2]√
3K(K − 1)Nθ(1− θ),
and
(23) IIc≤
√√√√E[m4]E[(
K∑j=1
nj −KnK)4] =√
E[m4]√
3K(K − 1)Nθ(1− θ).
25
The unknown part here is E[m2] and E[m4]. For these two terms, we can bound them by
E[m4]≤KE[(n1 −Nθ)4] =O(N2),
E[m2]≤KE[(n1 −Nθ)2] =O(N).
Plugging these into (22) and (23), we have that
(24) IIb.O(N5/2), IIc.O(N2).
Combining (21) and (24), we have that
II =N3θ3K(K − 1)(1− θ)/4 +O(N5/2).
Combining this with (20), and we have that
Var(UK−1) =N3θ3K3(K − 1)(1/3 + (1− θ))/4 +O(N5/2).
The result under null hypothesis is proved.Next, we consider the variances under alternative hypothesis. The analysis is similar with
what we did for the modified Wilcoxon statistics under alternative.Consider Yi first. Yi is a function of x1,x2, · · · ,xK . When one observation change, the
change of Yi is no larger than KN . For Ui, the same thing happens that the change is nolarger than KN . Therefore,
KN∑i=1
(KN)2 =K3N3.
According to McDiarmid’s Inequality (?, Page 206), there is
P (|Yi −E[Yi]|> t)≤ 2e−2t2
K3N3 .
Therefore,
Var1(Yi) = E[|Yi −E[Yi]|2] =
∫ ∞0
P (|Yi −E(Yi)|2 ≥ t)dt
=
∫ ∞0
P (|Yi −E(Yi)| ≥√t)dt
≤∫ ∞
02e
−2t
K3N3 dt
=K3N3.
Therefore,
(25) Var1(Yi)≤K3N3, Var1(Ui)≤K3N3, 1≤ i≤K.
The lemma is proved. 2
APPENDIX F: PROOF OF LEMMA 3 IN MAIN PAPER
In this section, all the proofs are for the expectations under alternative hypothesis. So wedrop the subscriptsof E1 for simplicity in this section only.
With basic calculations, it can be shown that, for 1≤ i≤K − 1,
Yi = (
i∑j=1
s0j − is0
i+1) +NK
2[ini+1−
i∑j=1
nj ], Ui = (
i∑j=1
s0j − is0
i+1) +nK
2[ini+1−
i∑j=1
nj ].
26
So, for Yi terms, the problem reduces to find E[NK2 (ini+1 −∑i
j=1 nj)]. With basic calcu-lations, this expectation turns to be N2K
2 (iθi+1 −∑i
j=1 θj). Combining with E[∑i
j=1 s0j −
is0i+1
∣∣n1, · · · , nK]
=∑i
j=1
∑k 6=j njnk∆j,k − ini+1
∑k 6=i+1 nk∆i+1,k, we have that
E[Yi] =N2
[ i∑j=1
∑k 6=j
θjθk∆j,k−iθi+1
∑k 6=i+1
θk∆i+1,k
]+KN2
2
(iθi+1−
i∑j=1
θj
), 1≤ i≤K−1.
So, the first part of the conclusion is proved.To find E[Ui], it also reduces to find E1[nK2 [ini+1−
∑ij=1 nj ]]. Decompose n=Nθ(K) +
m, then we have that
E[Ui] =N2
[ i∑j=1
∑k 6=j
θjθk∆j,k − iθi+1
∑k 6=i+1
θk∆i+1,k
]
+KN2
2θ(K)(iθi+1 −
i∑j=1
θj) + E[mK
2[ini+1 −
i∑j=1
nj ]].
As long as we can show that E[mK2 [ini+1 −∑i
j=1 nj ]] = o(N2), the lemma is proved.With Cauchy-Schwarz Inequality, we have that
(1) E[mK
2[ini+1 −
i∑j=1
nj ]]≤K
2
√√√√E[m2]E[(ini+1 −i∑
j=1
nj)2].
If θ1 = θ2 = · · ·= θK = θ, then θ(K) = θ. What’s more, E[m2]≤∑K
i=1 E[(ni −Nθ)2] =
KNθ(1 − θ) = O(N). The second term that E[(ini+1 −∑i
j=1 nj)2] = i(i + 1)Nθ(1 −
θ) = O(N). Combining the two terms with (1), it can be concluded that E[mK2 [ini+1 −∑ij=1 nj ]]≤O(N). So, the result is proved.On the other hand, if there is some i and j, such that θi 6= θj . Then,
(2) E[(ini+1 −i∑
j=1
nj)2] .O(N2).
For any given N , we decompose the index set S = {1,2, · · · ,K} into S1 = {j : θj − θ(K) ≥−2√
log(N)/N} and S2 = S\S1. Correspondingly, assume that the index that achieves themaximum is h, then the expectation can also be decomposed as following
E[m2] = E[m21{h ∈ S1}] + E[m21{h ∈ S2}] = I + II.
For part I , we have that
(3) I ≤ E[∑j∈S1
n2j ]≤ 4 log(N)N +N/4 =O(N log(N)).
To bound II , assume that θ1 = θ(K) without loss of generality. Then we have that
(4) II ≤∑j∈S2
E[(nj −Nθ(K))21{nj ≥ n1}]≤
∑j∈S2
√E[(nj −Nθ(K))4]P (nj ≥ n1).
The last inequality comes from Cauchy-Schwartz Inequality. Recall that we have θj−θ(K) <
−2√
log(N)/N} as j ∈ S2. According to Hoeffding’s Inequality, we have that
P (nj−n1 ≥ 0)≤ P ((nj−n1)−N(θj−θ(K))> 2√
log(N)N})≤ exp(−2(2√
log(N)N)2
4N),
27
and it reduces to P (nj − n1 ≥ 0)≤ 1/N2. Combining with (4), we have that
II ≤O(√
1/N2 ∗N4) =O(N).
Combining this result about II with (3), there is E[m2] =O(N log(N)). Plugging this resultand (2) into (1), we have
E[mK
2[ini+1 −
i∑j=1
nj ]] =O(N3/2√
log(N)) = o(N2).
So, the lemma is proved. 2
APPENDIX G: PROOFS OF LEMMA 1 IN SUPPLEMENTAL MATERIALS
When N1,N2→∞, according to Central Limit Theorem, we have that(p1
p2
)(d)→N
((θθ
)(θ(1− θ)/N1 0
0 θ(1− θ)/N2
)).
Let α= p1− p2, β = (N1p1 +N2p2)/(N1 +N2), inversely there are p1 = β + N2
N1+N2α and
p2 = β − N1
N1+N2α. What’s more, the distribution of (α,β)T is that(
αβ
)=N
((0θ
)(θ(1− θ)(1/N1 + 1/N2) 0
0 θ(1− θ)/(N1 +N2)
)).
With α, β, we rewrite the term p(p2 − p1) as
p(p2 − p1) =
{ 12(β + N2
N1+N2α)(−α) α≥ 0
12(β − N1
N1+N2α)(−α) α< 0
As α⊥ β asymptotically, we have
E[αβ] =E[α]E[β] = 0, E[α21{α≥ 0}] =E[α21{α< 0}] =1
2θ(1− θ)(1/N1 + 1/N2).
Combining with these and basic calculations, we have that
E[p(p2 − p1)] =θ(1− θ)
4
N1 −N2
N1N2.
On the other hand, assume N1 ≤N2 without loss of generality, we have that
E[α2β2] = θ3(1− θ)(1/N1 + 1/N2) +O(N−22 ),
E[α31{α≥ 0}] =E[α31{α< 0}] =O(N−3/21 ),
E[α41{α≥ 0}] =E[α41{α< 0}] =O(N−21 ).
So, we have that E[p2(p2−p1)2] =E[α2β2] +O(E[α3β]) +O(E[α4]) = θ3(1− θ)(1/N1 +
1/N2) +O(N−3/21 )
APPENDIX H: RELEVANT LEMMAS FOR TRUNCATED KRUSKAL-WALLIS TESTFOR UNEQUAL SAMPLE SIZES
We analyze E[Ui] under null hypothesis first. It is a linear function of si’s. For si, giventhe number of non-zero entries in each sample (equivalently, p1, · · · , pK are given), there is
E[si|p1, p2, · · · , pK ] =Ni
2
[p
K∑j=1
(pj − pi)Nj
]+O
(max{Ni,
K∑i=1
ni}),
28
where the remainder comes from the fact that we take the largest integer smaller than pNi
instead of pNi for each sample. Therefore,(1)
E[Ui] =1
2Ni+1
K∑k=1
Nk
i∑j=1
NjE[p(pi+1 − pj)] +O
( K∑k=1
Nk
i+1∑j=1
Nj
), 1≤ i≤K − 1.
The expectation in above equation is hard to calculate. However, we cn show it is compar-atively small, hence we only need to find an upper bound, which is given as Lemma 3 (seeSupplementary Materials).
Lemma 3 Let pi ∼N(θ, θ(1− θ)/Ni) independently, i = 1, · · · ,K , and N(1) = min1≤i≤K
Ni,
then
|E[p(pj − pi)]| ≤ (2√K + 1)θ(1− θ)/N(1), i 6= j,1≤ i, j ≤K,
where p= max1≤i≤K
pi.
Based on Lemma 3, let N(1) = min1≤i≤K
Ni,
(2) E[Ui] =O
(Ni+1
K∑k=1
Nk
i∑j=1
Nj/N(1)
),1≤ i≤K − 1.
Next, we derive the order of Var(Ui). It is given by
Var(Ui) = Var(E[Ui|p1, · · · , pK ]) + E[Var(Ui|p1, · · · , pK)](3)
≥ E[Var(Ui|p1, · · · , pK)], 1≤ i≤K − 1
Lemma 4 Under the null hypothesis that f1 = f2 = · · ·= fK = f , there is
Var[Ui|p1, · · · , pK ] =
1
12E
{(
K∑k=1
nk + 1)[ K∑k=1
nk[N2i+1
i∑j=1
nj + ni+1(
i∑j=1
Nj)2]− (ni+1
i∑j=1
Nj −Ni+1
i∑j=1
nj)2]}.
Further, the expectation follows that
E(Var[Ui|p1, · · · , pK ]) =θ2
12Ni+1(
i∑j=1
Nj)(
i+1∑j=1
Nj)(
K∑j=1
Nj)[(
K∑j=1
Nj)θ+ 3− 2θ]
≥ θ3
12Ni+1(
i∑j=1
Nj)2(
K∑j=1
Nj)2.
Combine (3) with Lemma 4, the lower bound for Var(Ui) is
Var(Ui)≥θ3
12Ni+1(
i∑j=1
Nj)2(
K∑j=1
Nj)2 =O
(N2i+1(
K∑j=1
Nj)2
( i∑j=1
Nj
)2
θ3/Ni+1
),(4)
when 1≤ i≤K . Compare it to theE[Ui], where (E[Ui])2 =O
(N2i+1(
∑Kj=1Nj)
2(∑i
j=1Nj)2/N2
(1)
).
E[Ui]/Var(Ui) = o(1) when max1≤i≤KNi/N2(1)→ 0 and N(1)→∞.
For the covariance term, we apply the following lemma.
29
Lemma 5 Under the null hypothesis where f1 = f2 = · · ·= fK = f , and all θi’s are equal,there is
|Cov(Ui,Uj)| ≤ θ2(
K∑k=1
Nk)2Ni+1Nj+1(1 +
K∑k=1
Nk/N(1))
√√√√(
i∑l=1
Nl)(
i∑l=j
Nl)
√√√√(
K∑k=1
1
Nk),
where N(1) = min1≤k≤KNk and N(1)→∞.
APPENDIX I: PROOF OF LEMMA 3
With Cauchy-Schwarz inequality and Jensen’s inequality, note that f(x) =√x is a con-
cave function, we have that∣∣E[p(pj − pi)]∣∣=∣∣E[(p− pi)(pj − pi)] +E[pi(pj − pi)]
∣∣≤√E[(p− pi)2]E[(pj − pi)2] +
∣∣E[pi(pj − pi)]∣∣.(1)
For the first part, note that pjNj ∼Binomial(Nj , θ), piNi ∼Binomial(Ni, θ), Ni and Nj
are constants and pi and pi are independent, so we have
(2) E[(pj − pi)2] = θ(1− θ)(1/Nj + 1/Ni).
To bound E[(p− pi)2], note that we have (p− pi)2 ≤∑K
k=1(pk − pi)2. Combining with (2),there is
(3) E[(p− pi)2]≤K∑k=1
E[(pk − pi)2] =∑k 6=i
θ(1− θ)(1/Nk + 1/Ni)≤ 2Kθ(1− θ)/N(1).
Introduce (2) and (3) into (1), and we have
|E[p(pj − pi)]| ≤ 2θ(1− θ)√K/N(1) + θ(1− θ)/Ni.
So the result follows. 2
APPENDIX J: PROOF OF LEMMA 4
There are two results to prove in this lemma, the conditional variance, and the expectationof the conditional variance.
• Conditional on the number of nonzeros in each sample, the variance is
Var[Ui|p1, · · · , pK ] =
1
12(
K∑k=1
nk + 1)[ K∑k=1
nk[N2i+1
i∑j=1
nj + ni+1(
i∑j=1
Nj)2]− (ni+1
i∑j=1
Nj −Ni+1
i∑j=1
nj)2].
• The expectation of the variance is
E(Var[Ui|p1, · · · , pK ]) =θ2
12Ni+1(
i∑j=1
Nj)(
i+1∑j=1
Nj)(
K∑j=1
Nj)[(
K∑j=1
Nj)θ+ 3− 2θ]
≥ θ3
12Ni+1(
i∑j=1
Nj)2(
K∑j=1
Nj)2.
30
We will show them one by one.First we find the conditional variance. Recall that Ui =
∑ij=1(Ni+1sj −Njsi+1). Define
r0j as the sum-rank of non-zero elements in Sample j. Hence, sj = r0
j +∑K
i=1bpNic+1+∑K
i=1 ni
2 (bpNic−ni)−
∑Ki=1bpNic+1
2 (bpNic). Given p1, p2, · · · , pK , equivalently n1, n2, · · · , nK , the only ran-dom part is r0
j . Therefore,
Var(Ui|p1, · · · , pK) = Var(
i∑j=1
(Ni+1sj−Njsi+1)|p1, · · · , pK) = Var(
i∑j=1
(Ni+1r0j−Njr
0i+1)|p1, · · · , pK).
Given p1, · · · , pK , then n1, n2, · · · , nK are also given. It means we have∑K
i=1 ni indepen-dently and identically distributed observations, and we are interested in the sum-rank of onesample j, denoted as r0
j . Definitely, we can use the properties in Section A.
Var(
i∑j=1
(Ni+1r0j −Njr
0i+1)|p1, · · · , pK)
= Var(Ni+1
i∑j=1
r0j − r0
i+1
i∑j=1
Nj)
=N2i+1
i∑j=1
Var(r0j ) + (
i∑j=1
Nj)2Var(r0
i+1)− 2Ni+1(
i∑j=1
Nj)
i∑j=1
Cov (() r0i+1, r
0j ).
According to the properties in Section A, Var(r0i ) =
(∑K
k=1 nk+1)12 ni(
∑Kk 6=i,k=1 nk), Cov (() r0
i , r0j ) =
− (∑K
k=1 nk+1)12 ninj . Introduce the terms into the equation, and we have that
Var[Ui|p1, · · · , pK ]
=
∑Kk=1 nk + 1
12
[N2i+1(
i∑j=1
nj)(
K∑j=i+1
nj) + (
i∑j=1
Nj)2ni+1(
K∑j 6=i+1,j=1
nj) + 2Ni+1(
i∑j=1
Nj)ni+1(
i∑j=1
nj)]
=1
12(
K∑k=1
nk + 1)[ K∑k=1
nk[N2i+1
i∑j=1
nj + ni+1(
i∑j=1
Nj)2]− (ni+1
i∑j=1
Nj −Ni+1
i∑j=1
nj)2].
The first conclusion is proved.To derive the second conclusion, we simply calculate the expectation of the results. Note
that in the equation, there is no p term, only products of ni, nj and nk. Under null hypothesis,ni ∼Binomial(Ni, θ) independently, therefore, we have
E[ni] =Niθ, Var(ni) =Niθ(1− θ), E[n2i ] =Niθ(Niθ+ 1− θ),
E[ninj ] =NiNjθ2, E[n2
inj ] =NiNjθ2(Niθ+ 1− θ).
Introduce these terms into EVar(Ui|p1, · · · , pK), we have the final result
E(Var[Ui|p1, · · · , pK ]) =θ2
12Ni+1(
i∑j=1
Nj)(
i+1∑j=1
Nj)(
K∑j=1
Nj)[(
K∑j=1
Nj)θ+ 3− 2θ]
≥ θ3
12Ni+1(
i∑j=1
Nj)2(
K∑j=1
Nj)2.
The final inequality comes from 3− 2θ > 0 and∑i+1
j=1Nj >∑i
j=1Nj .Therefore, the lemma is proved.
31
APPENDIX K: PROOF OF LEMMA 5
We are interested in the case that f1 = f2 = · · ·= fK = f , and all θi’s are equal. For thiscase, we have that(1)Cov (()Ui,Uj) = E[Cov (()Ui,Uj |p1, · · · , pK)]+Cov (()E[Ui|p1, · · · , pK ],E[Uj |p1, · · · , pK ]) = I+II.
We want to find the upper bound for I and II separately.Consider I first. Given p1, · · · , pK , the number of non-zero elements are given, so the
random part is the rank of non-zero elements in every sample. Let r0i denote the sum-rank of
the non-zero elements in sample i. Therefore, we calculate the conditional covariance as
Cov (()Ui,Uj |p1, · · · , pK) = Cov (()Ni+1
i∑l1=1
r0l1 − r
0i+1
i∑l1=1
Nl1 ,Nj+1
j∑l2=1
r0l2 − r
0j+1
j∑l2=1
Nl2)
=1 +
∑Kk=1 nk
12
[Ni+1Nj+1(
i∑l1=1
nl1)(
K∑l2=j+1
nl2) +Ni+1nj+1(
i∑l1=1
nl1)(
j∑l2=1
nl2)
−ni+1Nj+1(
i∑l1=1
Nl1)(
K∑l2=j+1
nl2)− ni+1nj+1(
i∑l1=1
Nl1)(
j∑l2=1
Nl2)
].
Note that ni ∼Binomial(Ni, θ) independently, so that we have
(2) I = E[Cov (()Ui,Uj |p1, · · · , pK)] = 0.
Next consider II . According to (1), we have the result that
E[Ui] =1
2Ni+1
K∑k=1
Nk
i∑j=1
NjE[p(pi+1 − pj)] +O
( K∑k=1
Nk
i+1∑j=1
Nj
), 1≤ i≤K − 1,
where the remainder comes from the difference between bpNic and pNi, which is a smallerorder term. Therefore, the covariance between two terms is
II = Cov (()1
2Ni+1
K∑k=1
Nk
i∑l1=1
Nl1p(pi+1 − pl1),1
2Nj+1
K∑k=1
Nk
j∑l2=1
Nl2p(pj+1 − pl2))
=1
4Ni+1Nj+1
( K∑k=1
Nk
)2Cov (()
i∑l1=1
Nl1p(pi+1 − pl1),j∑
l2=1
Nl2p(pj+1 − pl2)).
It’s hard to find an exact value, so we work on an upper bound only. We rewrite p= θ+ (p−θ), and so the covariance part is the summation of four parts
Cov (()
i∑l1=1
Nl1p(pi+1 − pl1),j∑
l2=1
Nl2p(pj+1 − pl2))
= Cov (()
i∑l1=1
Nl1θ(pi+1 − pl1),j∑
l2=1
Nl2θ(pj+1 − pl2)) + Cov (()
i∑l1=1
Nl1(p− θ)(pi+1 − pl1),j∑
l2=1
Nl2θ(pj+1 − pl2))
+ Cov (()
i∑l1=1
Nl1θ(pi+1 − pl1),j∑
l2=1
Nl2(p− θ)(pj+1 − pl2)) + Cov (()
i∑l1=1
Nl1(p− θ)(pi+1 − pl1),j∑
l2=1
Nl2(p− θ)(pj+1 − pl2))
= IIa+ IIb+ IIc+ IId.
32
Check the terms one by one. For IIa, note that Cov (()pi, pj) = 0 when i 6= j because ofindependence, so we have
IIa= θ2 Cov (()
i∑l1=1
Nl1(pi+1 − pl1),j∑
l2=1
Nl2(pi+1 − pl2))
= θ2
[ i∑l=1
[N2l Var(pl)]− (
i∑l=1
Nl)Ni+1Var(pi+1)
]
= θ2
[ i∑l=1
Nlθ(1− θ)− (
i∑l=1
Nl)θ(1− θ)]
= 0.(3)
The parts IIb and IIc are symmetric. We analyze one and the result for the other one canalso be achieved.
IIb= θCov (()
i∑l1=1
Nl1(p− θ)(pi+1 − pl1),j∑
l2=1
Nl2(pi+1 − pl2))
≤ θ
√√√√Var[ i∑l1=1
Nl1(p− θ)(pi+1 − pl1)]Var[ j∑l2=1
Nl2(pi+1 − pl2)].(4)
Then we check the two variances. Consider the first one. Note that p = max1≤i≤K pi, so(p− θ)2 ≤
∑Ki=1(pi − θ)2. So we have
Var[ i∑l1=1
Nl1(p− θ)(pi+1 − pl1)]≤ E[(p− θ)2]E[(
i∑l1=1
Nl1(pi+1 − pl1))2]
≤[ K∑k=1
E[(pk − θ)2]
]E[(
i∑l1=1
Nl1(pi+1 − pl1))2]
(Note E[
i∑l1=1
Nl1(pi+1 − pl1)] = 0) =
[ K∑k=1
Var(pk − θ)]Var[
i∑l1=1
Nl1(pi+1 − pl1)]
= θ2(1− θ)2(
K∑k=1
1/Nk)(
i∑l=1
Nl)[1 +
i∑l=1
Nl/Ni+1](5)
For the second one, the calculation is straigtforward.
(6) Var[ j∑l2=1
Nl2(pi+1 − pl2)]
= θ(1− θ)(j∑l=1
Nl)[1 +
j∑l=1
Nl/Nj+1].
Combine (5) – (6) with (4), we have that
IIb≤ θ
√√√√θ2(1− θ)2(
K∑k=1
1/Nk)(
i∑l=1
Nl)[1 +
i∑l=1
Nl/Ni+1]θ(1− θ)(j∑l=1
Nl)[1 +
j∑l=1
Nl/Nj+1]
≤ θ2[1 +
K∑l=1
Nl/N(1)]
√√√√(
K∑k=1
1/Nk)(
i∑l=1
Nl)(
j∑l=1
Nl),(7)
33
where N(1) is the minimum of N1, · · · ,NK .Similarly, for IIc, we have
(8) IIc≤ θ2[1 +
K∑l=1
Nl/N(1)]
√√√√(
K∑k=1
1/Nk)(
i∑l=1
Nl)(
j∑l=1
Nl).
Finally, we check IId. With the results above, we have the upper bound for IId as
IId≤
√√√√Var[ i∑l1=1
Nl1(p− θ)(pi+1 − pl1)]Var[ j∑l2=1
(p− θ)Nl2(pi+1 − pl2)]
≤
√√√√θ2(1− θ)2(
K∑k=1
1/Nk)(
i∑l=1
Nl)[1 +
i∑l=1
Nl/Ni+1]θ2(1− θ)2(
K∑k=1
1/Nk)(
i∑l=1
Nl)[1 +
i∑l=1
Nl/Ni+1]
≤ θ2[1 +
K∑l=1
Nl/N(1)](
K∑k=1
1/Nk)
√√√√(
i∑l=1
Nl)(
j∑l=1
Nl).(9)
Finally, combine (3), (7), (8) and (9), we have that
|Cov (()
i∑l1=1
Nl1p(pi+1 − pl1),j∑
l2=1
Nl2p(pj+1 − pl2))|= |IIa+ IIb+ IIc+ IId|
≤ 0 + θ2[1 +
K∑l=1
Nl/N(1)]
√√√√(
K∑k=1
1/Nk)(
i∑l=1
Nl)(
j∑l=1
Nl) + θ2[1 +
K∑l=1
Nl/N(1)]
√√√√(
K∑k=1
1/Nk)(
i∑l=1
Nl)(
j∑l=1
Nl)
+θ2[1 +
K∑l=1
Nl/N(1)](
K∑k=1
1/Nk)
√√√√(
i∑l=1
Nl)(
j∑l=1
Nl)
≤ 3θ2[1 +
K∑l=1
Nl/N(1)]
√√√√(
K∑k=1
1/Nk)(
i∑l=1
Nl)(
j∑l=1
Nl),(10)
when N1, · · · ,NK →∞.Finally, combine (2) and (10) with (1), there is
|Cov (()Ui,Uj)|= |I + II|= |II| ≤ 1
4Ni+1Nj+1
( K∑k=1
Nk
)2 · 3θ2[1 +
K∑l=1
Nl/N(1)]
√√√√(
K∑k=1
1/Nk)(
i∑l=1
Nl)(
j∑l=1
Nl)
≤ θ2Ni+1Nj+1
( K∑k=1
Nk
)2[1 +
K∑l=1
Nl/N(1)]
√√√√(
K∑k=1
1/Nk)(
i∑l=1
Nl)(
j∑l=1
Nl).
Therefore, the result is proved.
Recommended