The Generalized Higher Criticism for Testing SNP-sets inGenetic Association Studies
Ian Barnett, Rajarshi Mukherjee & Xihong Lin
Harvard University
August 5, 2014
Ian Barnett JSM 2014 August 5, 2014 1 / 20
Background
Genome-wide association studies (GWAS): millions of common (minorallele frequency > 0.05) SNPs genotyped.
Gene-level/pathway-level analysis can provide power to detect thesetypes of effects by combining information over the SNPs.
Goal: Develop powerful, computationally efficient, statisticalmethodology for SNP-sets that have the power to detect joint SNPeffects.
Ian Barnett JSM 2014 August 5, 2014 2 / 20
Model
n subjects, q covariates, p genetic variants.
Yi is phenotype for ith individual
Xi · contains q covariates for ith individual
Gi · contains SNP information (minor allele counts) in agene/pathway/SNP-set for ith individual
α and β contain regression coefficients.
µi = E (Yi |Gi ·,Xi ·)
Model
h(µi ) = Xi ·α + Gi ·β
h(·) is the link function.
Ian Barnett JSM 2014 August 5, 2014 3 / 20
Marginal SNP test statistics
The marginal score test statistic for the jth variant is:
Zj = GT·j (Y − µ0)
where µ0 is the MLE of E (Y |H0). Assume Zj is normalized.
Letting UUT = Cov(Z ) = Σ, define the transformed (decorrelated)test statistics:
Z∗ = U
−1Z
L−−−→n→∞
MVN(0, Ip)
Ian Barnett JSM 2014 August 5, 2014 4 / 20
Current popular methods
Method SKAT MinP
Test statistic∑p
j=1 Z2j maxj{|Zj |}
Pros
High power when signalsparsity is low. Accuratep-values can be obtainedquickly.
High power when signalsparsity is high.
ConsCan have very low powerwhen sparsity is high.
Slightly lower power whensparsity is low. Difficult toobtain accurate analytic p-values.
Ian Barnett JSM 2014 August 5, 2014 5 / 20
The higher criticism
Let
S(t) =
p∑j=1
1{|Zj |≥t}
Assumes Σ = Ip
Under H0, S(t) ∼ Binomial(p, 2Φ(t)) where Φ(t) = 1− Φ(t) is thesurvival function of the normal distribution.
The Higher Criticism test statistic is:
HC = supt>0
{S(t)− 2pΦ(t)√
2pΦ(t)(1− 2Φ(t))
}
Ian Barnett JSM 2014 August 5, 2014 6 / 20
The higher criticism
−4 −2 0 2 4
0.00
0.10
0.20
0.30
Histogram of the Zi
t
Den
sity
argmax{HC(t)}
Ian Barnett JSM 2014 August 5, 2014 7 / 20
Adjusting for correlation
Recalling that Z ∗ = U−1Z , let
S∗(t) =
p∑j=1
1{|Z∗j |≥t}
Note that under H0, S∗(t) ∼ Binomial(p, 2Φ(t)) regardless forgeneral correlated Σ.
The innovated Higher Criticism test statistic is:
iHC = supt>0
{S∗(t)− 2pΦ(t)√2pΦ(t)(1− 2Φ(t))
}
Ian Barnett JSM 2014 August 5, 2014 8 / 20
Asymptotic p-values for iHC
The supremum of this standardized empirical process follows aGumbel distribution asymptotically.
For 0 < ε < δ < 1, if the supremum is taken over the intervalΦ−1(1− δ/2) < t < Φ−1(1− ε/2), then, as shown in Jaeschke(1979), we can write:
pr(iHC < x) ≈ exp[− exp{−x(2 log ρ)1/2 − 2−1 log π + 2−1 log2 ρ+ 2 log ρ}],
where ρ = 2−1 log[δ(1− ε)/{ε(1− δ)}].Jaeschke (1979) shows that this converges in distribution at anabysmal rate of O{(log p)−1/2}
Ian Barnett JSM 2014 August 5, 2014 9 / 20
Slow convergence to asymptotic distribution
The supremum of the higher criticism test statistic is taken over the rangeΦ−1(1− δ/2) < t < Φ−1(1− ε/2) where ε = 0·01 and δ = 0·40. Eachempirical distribution is constructed using 500 samples of p independentstandard normal random variables.
0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
x
CD
F(x
)
Theoretical; p=∞ Empirical; p=102 Empirical; p=106
Ian Barnett JSM 2014 August 5, 2014 10 / 20
Analytic p-values for iHC
Letting h be the observed GHC statistic:
p-value = pr
(supt>0
{S∗(t)− 2pΦ(t)√2pΦ(t)(1− 2Φ(t))
}≥ h
)
There exists 0 < t1 < · · · < tp, such that
p-value = 1− pr
(p⋂
k=1
{S∗(tk) ≤ p − k}
)
Then apply the chain rule of conditioning to get a product ofbinomial probabilities.
Ian Barnett JSM 2014 August 5, 2014 11 / 20
Type I error rates
Table: Empirical type I error rate percentages from 106 simulations for the highercriticism using the proposed analytic method over a range of significance levels,α. Asymptotic type I error rate percentages are provided for comparison:exact(asymptotic).
pα 2 10 50
5·0 4·98(5·95) 4·98(3·24) 5·01(1·84)1·0 1·00(2·24) 9·92 × 10−1(7·31 × 10−1) 1·01(1·59 × 10−1)
1·0 × 10−1 9·97 × 10−2(2·05 × 10−1) 1·01 × 10−1(6·03 × 10−2) 9·75 × 10−2(4·90 × 10−3)
1·0 × 10−2 1·01 × 10−2(5·32 × 10−2) 1·12 × 10−2(7·30 × 10−3) 9·80 × 10−3(4·00 × 10−4)
Ian Barnett JSM 2014 August 5, 2014 12 / 20
Decorrelating dampens true signals
Cancer Genetic Markers of Susceptibility (CGEM) Breast Cancer GWAS: FGFR2 gene
Fre
quen
cy
02
46
8
Z
Marginal test statistics
Fre
quen
cy
−4 −2 0 2 4
02
46
8
Z*
Decorrelating causes iHC to lose power.Ian Barnett JSM 2014 August 5, 2014 13 / 20
Comparison
Method MinP SKAT iHC
Robust to signal sparsity X XRobust to correlation/LD structure X X
Computationally efficient X XDoes not require decorrelating test statistics X X
Ian Barnett JSM 2014 August 5, 2014 14 / 20
Comparison
Method MinP SKAT iHC GHC
Robust to signal sparsity X X XRobust to correlation/LD structure X X X
Computationally efficient X X XDoesn’t require decorrelating test statistics X X X X
*We will also consider the omnibus test, OMNI, in our power simulations.It is based on the minimum p-value of the SKAT, MinP, and GHC.
Ian Barnett JSM 2014 August 5, 2014 15 / 20
Our contribution: the generalized higher critcism (GHC)
Recall
S(t) =
p∑j=1
1{|Zj |≥t}
Now we allow Σ to have arbitrary correlation structure.
S(t) is no longer binomial. Instead we approximate withBeta-binomial, matching on first two moments.
The Generalized Higher Criticism test statistic is:
GHC = supt>0
S(t)− 2pΦ(t)√Var(S(t))
GHC achieves the same as detection boundary as HC .
Ian Barnett JSM 2014 August 5, 2014 16 / 20
The variance estimator Var(S(t))
Theorem 1
Let rn = 2p(1−p)
∑1≤k<l≤p(Σkl)
n and let Hi (t) be the Hermite
polynomials: H0(t) = 1, H1(t) = t, H2(t) = t2 − 1 and so on. Then
Cov
(S(tk),S(tj)
)= p[2Φ(max{tj , tk})− 4Φ(tj)Φ(tk)]
+4p(p − 1)φ(tj)φ(tk)∞∑i=1
H2i−1(tj)H2i−1(tk)r2i
(2i)!
Proof follows from Schwartzman and Lin (2009) where they showed:
P(Zk > ti ,Zl > tj) = Φ(ti )Φ(tj) + φ(ti )φ(tj)∞∑n=1
Σnkl
n!Hn−1(ti )Hn−1(tj)
Ian Barnett JSM 2014 August 5, 2014 17 / 20
Analytic p-values for the GHC
Letting h be the observed GHC statistic:
p-value = pr
supt>0
S(t)− 2pΦ(t)√Var(S(t))
≥ h
There exists 0 < t1 < · · · < tp, such that
p-value = 1− pr
(p⋂
k=1
{S(tk) ≤ p − k}
)
Ian Barnett JSM 2014 August 5, 2014 18 / 20
Power simulations
0.00 0.05 0.10 0.15
0.0
0.2
0.4
0.6
0.8
1.0
ρ1=0.4; ρ3=0; p=40; 2 causal
ρ2
pow
er
GHCMinPSKATiHCOMNI
0.00 0.05 0.10 0.15
0.0
0.2
0.4
0.6
0.8
1.0
ρ1=0.4; ρ3=0; p=40; 4 causal
ρ2
pow
er
GHCMinPSKATiHCOMNI
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.0
0.2
0.4
0.6
0.8
1.0
ρ1=0.4; ρ3=0.4; p=40; 2 causal
ρ2
pow
er
GHCMinPSKATiHCOMNI
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.2
0.4
0.6
0.8
1.0
ρ1=0.4; ρ3=0.4; p=40; 4 causal
ρ2
pow
er
GHCMinPSKATiHCOMNI
ρ1: correlation within causal variants.
ρ2: correlation between causal and noncausal variants.
ρ3: correlation within non-causal variants.
Ian Barnett JSM 2014 August 5, 2014 19 / 20
Data analysis
The National Cancer Institute’s Cancer Genetic Markers of Susceptibility(CGEM) breast cancer GWAS. Sample has 1145 cases, 1142 controls witheuropean ancestry.
0 1 2 3 4 5
0
1
2
3
4
5
Expected − log10(pvalue)
Obs
erve
d −
log 1
0(pva
lue)
FGFR2PTCD3
POLR1A
Ian Barnett JSM 2014 August 5, 2014 20 / 20