Download pdf - The Generalized Higher Criticism for Testing SNP …scholar.harvard.edu/files/ibarnett/files/enar2014_ghc_hc... · The Generalized Higher Criticism for Testing SNP-sets in Genetic

The Generalized Higher Criticism for Testing SNP-sets inGenetic Association Studies

Ian Barnett, Rajarshi Mukherjee & Xihong Lin

Harvard University

[email protected]

August 5, 2014

Ian Barnett JSM 2014 August 5, 2014 1 / 20

Background

Genome-wide association studies (GWAS): millions of common (minorallele frequency > 0.05) SNPs genotyped.

Gene-level/pathway-level analysis can provide power to detect thesetypes of effects by combining information over the SNPs.

Goal: Develop powerful, computationally efficient, statisticalmethodology for SNP-sets that have the power to detect joint SNPeffects.


Model

n subjects, q covariates, p genetic variants.

Yi is phenotype for ith individual

Xi · contains q covariates for ith individual

Gi · contains SNP information (minor allele counts) in agene/pathway/SNP-set for ith individual

α and β contain regression coefficients.

µi = E (Yi |Gi ·,Xi ·)

Model

h(µi ) = Xi ·α + Gi ·β

h(·) is the link function.


Marginal SNP test statistics

The marginal score test statistic for the jth variant is:

Zj = GT·j (Y − µ0)

where µ0 is the MLE of E (Y |H0). Assume Zj is normalized.

Letting UUT = Cov(Z ) = Σ, define the transformed (decorrelated)test statistics:

Z∗ = U

−1Z

L−−−→n→∞

MVN(0, Ip)


Current popular methods

Method SKAT MinP

Test statistic∑p

j=1 Z2j maxj{|Zj |}

Pros

High power when signalsparsity is low. Accuratep-values can be obtainedquickly.

High power when signalsparsity is high.

ConsCan have very low powerwhen sparsity is high.

Slightly lower power whensparsity is low. Difficult toobtain accurate analytic p-values.


The higher criticism

Let

S(t) =

p∑j=1

1{|Zj |≥t}

Assumes Σ = Ip

Under H0, S(t) ∼ Binomial(p, 2Φ(t)) where Φ(t) = 1− Φ(t) is thesurvival function of the normal distribution.

The Higher Criticism test statistic is:

HC = supt>0

{S(t)− 2pΦ(t)√

2pΦ(t)(1− 2Φ(t))

}


The higher criticism

−4 −2 0 2 4

0.00

0.10

0.20

0.30

Histogram of the Zi

t

Den

sity

argmax{HC(t)}


Adjusting for correlation

Recalling that Z ∗ = U−1Z , let

S∗(t) =

p∑j=1

1{|Z∗j |≥t}

Note that under H0, S∗(t) ∼ Binomial(p, 2Φ(t)) regardless forgeneral correlated Σ.

The innovated Higher Criticism test statistic is:

iHC = supt>0

{S∗(t)− 2pΦ(t)√2pΦ(t)(1− 2Φ(t))

}


Asymptotic p-values for iHC

The supremum of this standardized empirical process follows aGumbel distribution asymptotically.

For 0 < ε < δ < 1, if the supremum is taken over the intervalΦ−1(1− δ/2) < t < Φ−1(1− ε/2), then, as shown in Jaeschke(1979), we can write:

pr(iHC < x) ≈ exp[− exp{−x(2 log ρ)1/2 − 2−1 log π + 2−1 log2 ρ+ 2 log ρ}],

where ρ = 2−1 log[δ(1− ε)/{ε(1− δ)}].Jaeschke (1979) shows that this converges in distribution at anabysmal rate of O{(log p)−1/2}


Slow convergence to asymptotic distribution

The supremum of the higher criticism test statistic is taken over the rangeΦ−1(1− δ/2) < t < Φ−1(1− ε/2) where ε = 0·01 and δ = 0·40. Eachempirical distribution is constructed using 500 samples of p independentstandard normal random variables.

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

x

CD

F(x

)

Theoretical; p=∞ Empirical; p=102 Empirical; p=106


Analytic p-values for iHC

Letting h be the observed GHC statistic:

p-value = pr

(supt>0

{S∗(t)− 2pΦ(t)√2pΦ(t)(1− 2Φ(t))

}≥ h

)

There exists 0 < t1 < · · · < tp, such that

p-value = 1− pr

(p⋂

k=1

{S∗(tk) ≤ p − k}

)

Then apply the chain rule of conditioning to get a product ofbinomial probabilities.


Type I error rates

Table: Empirical type I error rate percentages from 106 simulations for the highercriticism using the proposed analytic method over a range of significance levels,α. Asymptotic type I error rate percentages are provided for comparison:exact(asymptotic).

pα 2 10 50

5·0 4·98(5·95) 4·98(3·24) 5·01(1·84)1·0 1·00(2·24) 9·92 × 10−1(7·31 × 10−1) 1·01(1·59 × 10−1)

1·0 × 10−1 9·97 × 10−2(2·05 × 10−1) 1·01 × 10−1(6·03 × 10−2) 9·75 × 10−2(4·90 × 10−3)

1·0 × 10−2 1·01 × 10−2(5·32 × 10−2) 1·12 × 10−2(7·30 × 10−3) 9·80 × 10−3(4·00 × 10−4)


Decorrelating dampens true signals

Cancer Genetic Markers of Susceptibility (CGEM) Breast Cancer GWAS: FGFR2 gene

Fre

quen

cy

02

46

8

Z

Marginal test statistics

Fre

quen

cy

−4 −2 0 2 4

02

46

8

Z*

Decorrelating causes iHC to lose power.Ian Barnett JSM 2014 August 5, 2014 13 / 20

Comparison

Method MinP SKAT iHC

Robust to signal sparsity X XRobust to correlation/LD structure X X

Computationally efficient X XDoes not require decorrelating test statistics X X


Comparison

Method MinP SKAT iHC GHC

Robust to signal sparsity X X XRobust to correlation/LD structure X X X

Computationally efficient X X XDoesn’t require decorrelating test statistics X X X X

*We will also consider the omnibus test, OMNI, in our power simulations.It is based on the minimum p-value of the SKAT, MinP, and GHC.


Our contribution: the generalized higher critcism (GHC)

Recall

S(t) =

p∑j=1

1{|Zj |≥t}

Now we allow Σ to have arbitrary correlation structure.

S(t) is no longer binomial. Instead we approximate withBeta-binomial, matching on first two moments.

The Generalized Higher Criticism test statistic is:

GHC = supt>0

S(t)− 2pΦ(t)√Var(S(t))

GHC achieves the same as detection boundary as HC .


The variance estimator Var(S(t))

Theorem 1

Let rn = 2p(1−p)

∑1≤k<l≤p(Σkl)

n and let Hi (t) be the Hermite

polynomials: H0(t) = 1, H1(t) = t, H2(t) = t2 − 1 and so on. Then

Cov

(S(tk),S(tj)

)= p[2Φ(max{tj , tk})− 4Φ(tj)Φ(tk)]

+4p(p − 1)φ(tj)φ(tk)∞∑i=1

H2i−1(tj)H2i−1(tk)r2i

(2i)!

Proof follows from Schwartzman and Lin (2009) where they showed:

P(Zk > ti ,Zl > tj) = Φ(ti )Φ(tj) + φ(ti )φ(tj)∞∑n=1

Σnkl

n!Hn−1(ti )Hn−1(tj)


Analytic p-values for the GHC

Letting h be the observed GHC statistic:

p-value = pr

supt>0

S(t)− 2pΦ(t)√Var(S(t))

≥ h

There exists 0 < t1 < · · · < tp, such that

p-value = 1− pr

(p⋂

k=1

{S(tk) ≤ p − k}

)


Power simulations

0.00 0.05 0.10 0.15

0.0

0.2

0.4

0.6

0.8

1.0

ρ1=0.4; ρ3=0; p=40; 2 causal

ρ2

pow

er

GHCMinPSKATiHCOMNI

0.00 0.05 0.10 0.15

0.0

0.2

0.4

0.6

0.8

1.0

ρ1=0.4; ρ3=0; p=40; 4 causal

ρ2

pow

er

GHCMinPSKATiHCOMNI

0.0 0.1 0.2 0.3 0.4 0.5 0.6

0.0

0.2

0.4

0.6

0.8

1.0

ρ1=0.4; ρ3=0.4; p=40; 2 causal

ρ2

pow

er

GHCMinPSKATiHCOMNI

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

0.8

1.0

ρ1=0.4; ρ3=0.4; p=40; 4 causal

ρ2

pow

er

GHCMinPSKATiHCOMNI

ρ1: correlation within causal variants.

ρ2: correlation between causal and noncausal variants.

ρ3: correlation within non-causal variants.


Data analysis

The National Cancer Institute’s Cancer Genetic Markers of Susceptibility(CGEM) breast cancer GWAS. Sample has 1145 cases, 1142 controls witheuropean ancestry.

0 1 2 3 4 5

0

1

2

3

4

5

Expected − log10(pvalue)

Obs

erve

d −

log 1

0(pva

lue)

FGFR2PTCD3

POLR1A