23
SUPPLEMENTARY INFORMATION FOR: Sparse PCA Corrects for Cell-Type Heterogeneity in Epigenome-Wide Association Studies Elior Rahmani, Noah Zaitlen, Yael Baran, Celeste Eng, Donglei Hu, Joshua Galanter, Sam Oh, Esteban G. Burchard, Eleazar Eskin, James Zou, and Eran Halperin Supplementary Figure 1. The fraction of variance explained in each of the simulated cell-types (simulation parameters: n = 500,p = 0.15=0.07; see Online Methods). The ReFACTor components are in blue and the PCs of a standard PCA are in red. (a) Correlation with simulated monocytes proportion as a function of the number of components used in the linear predictor (squared linear correlation). (b) Correlation with simulated B cells proportion. (c) Correlation with simulated T cells proportion. (d) Correlation with simulated NK cells proportion. (e) Correlation with simulated granulocytes proportion. Nature Methods: doi:10.1038/nmeth.3809

SUPPLEMENTARY INFORMATION FORweb.cs.ucla.edu/~ehalperin/cozygene/publications/nmeth.3809-S1.pdf · Supplementary Table 1. Lymphocytes Monocytes Neutrophils eosinophils Basophils Uncorrected

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SUPPLEMENTARY INFORMATION FORweb.cs.ucla.edu/~ehalperin/cozygene/publications/nmeth.3809-S1.pdf · Supplementary Table 1. Lymphocytes Monocytes Neutrophils eosinophils Basophils Uncorrected

SUPPLEMENTARY INFORMATION FOR:

Sparse PCA Corrects for Cell-Type Heterogeneity in Epigenome-Wide Association Studies

Elior Rahmani, Noah Zaitlen, Yael Baran, Celeste Eng, Donglei Hu, Joshua Galanter, Sam Oh, Esteban G.

Burchard, Eleazar Eskin, James Zou, and Eran Halperin

Supplementary Figure 1.

The fraction of variance explained in each of the simulated cell-types (simulation parameters: n = 500, p =0.15, τ = 0.07; see Online Methods). The ReFACTor components are in blue and the PCs of a standard PCAare in red. (a) Correlation with simulated monocytes proportion as a function of the number of componentsused in the linear predictor (squared linear correlation). (b) Correlation with simulated B cells proportion.(c) Correlation with simulated T cells proportion. (d) Correlation with simulated NK cells proportion.(e) Correlation with simulated granulocytes proportion.

Nature Methods: doi:10.1038/nmeth.3809

Page 2: SUPPLEMENTARY INFORMATION FORweb.cs.ucla.edu/~ehalperin/cozygene/publications/nmeth.3809-S1.pdf · Supplementary Table 1. Lymphocytes Monocytes Neutrophils eosinophils Basophils Uncorrected

Supplementary Figure 2.

The fraction of variance explained in each of the simulated cell-types as a function of the fraction of DMRsp in the data (simulation parameters: n = 500, τ = 0.07; see Online Methods). The correlation levels of thelinear predictor from the first five ReFACTor components are in blue and the correlation levels of the linearpredictor from the first five PCs of a standard PCA are in red. (a) Correlation with simulated monocytesproportion (squared linear correlation). (b) Correlation with simulated B cells proportion. (c) Correlationwith simulated T cells proportion. (d) Correlation with simulated NK cells proportion. (e) Correlation withsimulated granulocytes proportion.

Nature Methods: doi:10.1038/nmeth.3809

Page 3: SUPPLEMENTARY INFORMATION FORweb.cs.ucla.edu/~ehalperin/cozygene/publications/nmeth.3809-S1.pdf · Supplementary Table 1. Lymphocytes Monocytes Neutrophils eosinophils Basophils Uncorrected

Supplementary Figure 3.

The fraction of variance explained in each of the simulated cell-types as a function of the parameter τ(simulation parameters: n = 500, p = 0.15; see Online Methods). The correlation levels of the linearpredictor from the first five ReFACTor components are in blue and the correlation levels of the linearpredictor from the first five PCs of a standard PCA are in red. (a) Correlation with simulated monocytesproportion (squared linear correlation). (b) Correlation with simulated B cells proportion. (c) Correlationwith simulated T cells proportion. (d) Correlation with simulated NK cells proportion. (e) Correlation withsimulated granulocytes proportion.

Nature Methods: doi:10.1038/nmeth.3809

Page 4: SUPPLEMENTARY INFORMATION FORweb.cs.ucla.edu/~ehalperin/cozygene/publications/nmeth.3809-S1.pdf · Supplementary Table 1. Lymphocytes Monocytes Neutrophils eosinophils Basophils Uncorrected

Supplementary Figure 4.

A visualization of the p-values distribution in EWAS analysis of cell counts as phenotypes in the GALAII dataset. The p-values distributions of several methods including ReFACTor (in blue) are visualized bycounting the number of false positives as a function of the significance threshold. (a) A logarithmic scale ofthe number of false positives found in each method as a function of the significance threshold when usinglymphocytes cell count as a phenotype. (b) Using monocytes cell count as a phenotype. (c) Using neu-trophil granulocytes cell count as a phenotype. (d) Using eosinophil granulocytes cell count as a phenotype.(e) Using basophil granulocytes cell count as a phenotype.

Nature Methods: doi:10.1038/nmeth.3809

Page 5: SUPPLEMENTARY INFORMATION FORweb.cs.ucla.edu/~ehalperin/cozygene/publications/nmeth.3809-S1.pdf · Supplementary Table 1. Lymphocytes Monocytes Neutrophils eosinophils Basophils Uncorrected

Supplementary Figure 5.

A visualization of the p-values distribution in EWAS analysis of reference-based cell proportion estimatesas phenotypes in the GALA II dataset. The p-values distributions of several methods including ReFACTor(in blue) are visualized by counting the number of false positives as a function of the significance threshold.(a) A logarithmic scale of the number of false positives found in each method as a function of the significancethreshold when using estimated CD8T proportions as a phenotype. (b) Using estimated CD4T proportionas a phenotype. (c) Using estimated NK cells proportion as a phenotype. (d) Using estimated B-cellsproportion as a phenotype. (e) Using estimated monocytes proportion as a phenotype. (F) Using estimatedgranulocytes proportion as a phenotype.

Nature Methods: doi:10.1038/nmeth.3809

Page 6: SUPPLEMENTARY INFORMATION FORweb.cs.ucla.edu/~ehalperin/cozygene/publications/nmeth.3809-S1.pdf · Supplementary Table 1. Lymphocytes Monocytes Neutrophils eosinophils Basophils Uncorrected

Supplementary Figure 6.

Evaluation of false positives rate in EWAS analysis of simulated data. For each simulated dataset a phenotypewas generated using a linear model of the simulated cell-type proportions (see Online Methods). The barsindicate log scaled values of the total number of false positives reported across all simulations by severalmethods including ReFACTor. P-values in each EWAS analysis were adjusted for multiple hypotheses usingBonferroni. Error bars (in red) indicate 95% confidence intervals.

Nature Methods: doi:10.1038/nmeth.3809

Page 7: SUPPLEMENTARY INFORMATION FORweb.cs.ucla.edu/~ehalperin/cozygene/publications/nmeth.3809-S1.pdf · Supplementary Table 1. Lymphocytes Monocytes Neutrophils eosinophils Basophils Uncorrected

Supplementary Figure 7.

Evaluation of true discovery detection rate in EWAS analysis of simulated data. For each simulated dataset aphenotype was generated using a linear model of the simulated cell-type proportions, and a randomly chosenmethylation site was simulated to be causal (see Online Methods). The bars indicate the discovery rate ofthe causal site across all simulations, using several different effect sizes of the causal site, as reported byseveral methods including ReFACTor (in blue). P-values in each EWAS analysis were adjusted for multiplehypotheses using Bonferroni. Error bars indicate 95% confidence intervals.

Nature Methods: doi:10.1038/nmeth.3809

Page 8: SUPPLEMENTARY INFORMATION FORweb.cs.ucla.edu/~ehalperin/cozygene/publications/nmeth.3809-S1.pdf · Supplementary Table 1. Lymphocytes Monocytes Neutrophils eosinophils Basophils Uncorrected

Supplementary Figure 8.

Evaluation of true discovery detection rate in EWAS analysis of simulated data in case where the associationis derived from a single cell type. For each simulated dataset a phenotype was generated to be correlated withthe proportions of a single cell-type, and a randomly chosen methylation site was simulated to be correlatedwith the phenotype via methylation differences in the same cell-type (see Online Methods). The bars indicatethe discovery rate of the causal site across all simulations, using several different effect sizes of the causalsite, as reported by several methods including ReFACTor (in blue). P-values in each EWAS analysis wereadjusted for multiple hypotheses using Bonferroni. Error bars indicate 95% confidence intervals.

Nature Methods: doi:10.1038/nmeth.3809

Page 9: SUPPLEMENTARY INFORMATION FORweb.cs.ucla.edu/~ehalperin/cozygene/publications/nmeth.3809-S1.pdf · Supplementary Table 1. Lymphocytes Monocytes Neutrophils eosinophils Basophils Uncorrected

Supplementary Figure 9.

Evaluation of true discovery detection rate in EWAS analysis of simulated data in the case of multiplecausal sites. For each simulated dataset a phenotype was generated using a linear model of the simulatedcell-type proportions, and ten randomly chosen methylation sites were simulated to be causal (see OnlineMethods). Recall-precision plots, averaged across all simulations, describe the trade-off between the rate oftrue positives and false positives as reported by several methods including ReFACTor (in blue). Each plotdescribes a different effect size of the causal sites.

Nature Methods: doi:10.1038/nmeth.3809

Page 10: SUPPLEMENTARY INFORMATION FORweb.cs.ucla.edu/~ehalperin/cozygene/publications/nmeth.3809-S1.pdf · Supplementary Table 1. Lymphocytes Monocytes Neutrophils eosinophils Basophils Uncorrected

Supplementary Figure 10.

Alleviating the risk of over-correction in the RA analysis by using feature selection based on control samplesonly. (a) The original results of applying ReFACTor on the RA data. (b) Repeating the analysis, only thistime using the group of controls only for determining the most informative sites, corrects the inflation andreveals the same results found in the original analysis. In particular, the same 3 significant associations werefound. Moreover, most of the informative sites selected by ReFACTor in both experiments were identical(80% of the sites).

Nature Methods: doi:10.1038/nmeth.3809

Page 11: SUPPLEMENTARY INFORMATION FORweb.cs.ucla.edu/~ehalperin/cozygene/publications/nmeth.3809-S1.pdf · Supplementary Table 1. Lymphocytes Monocytes Neutrophils eosinophils Basophils Uncorrected

Supplementary Figure 11.

ReFACTor captures cell composition in the GALA II dataset better than other methods. The fraction ofvariance explained by ReFACTor (in blue) in each of the cell-types for which cell counts were available inthe GALA II dataset (78 samples), compared with a standard PCA, Factor Analysis (FA) and PEER, a FA-based method for the analysis of RNA data [1]. (a) Correlation with lymphocytes cell count as a functionof the number of components used in the linear predictor (squared linear correlation). (b) Correlationwith monocytes cell count. (c) Correlation with neutrophil granulocytes cell count. (d) Correlation witheosinophil granulocytes cell count. (e) Correlation with basophil granulocytes cell count.

Nature Methods: doi:10.1038/nmeth.3809

Page 12: SUPPLEMENTARY INFORMATION FORweb.cs.ucla.edu/~ehalperin/cozygene/publications/nmeth.3809-S1.pdf · Supplementary Table 1. Lymphocytes Monocytes Neutrophils eosinophils Basophils Uncorrected

Supplementary Figure 12.

The sites selected by ReFACTor are more informative compared with sites selected by a variance-basedfeature selection approach. The plots show the fraction of variance explained in each one of the cell-typesfor which cell counts were available in the GALA II dataset (78 samples). Neither applying PCA on themost variable sites (PCA-VAR) nor applying PCA on the sites with highest coefficient of variation (PCA-CV) could match ReFACTor. (a) Correlation with lymphocytes cell count as a function of the number ofcomponents used in the linear predictor (squared linear correlation). (b) Correlation with monocytes cellcount. (c) Correlation with neutrophil granulocytes cell count. (d) Correlation with eosinophil granulocytescell count. (e) Correlation with basophil granulocytes cell count.

Nature Methods: doi:10.1038/nmeth.3809

Page 13: SUPPLEMENTARY INFORMATION FORweb.cs.ucla.edu/~ehalperin/cozygene/publications/nmeth.3809-S1.pdf · Supplementary Table 1. Lymphocytes Monocytes Neutrophils eosinophils Basophils Uncorrected

Supplementary Figure 13.

ReFACTor is robust to the choice of the parameter t in simulated data. The fraction of variance explained ineach of the simulated cell-types (simulation parameters: n = 500, p = 0.15, τ = 0.07; see Online Methods) inseveral ReFACTor executions (in blue), compared with a standard PCA (in red). Each blue line correspondsto a single execution of ReFACTor with k = 5 and a different value of t in the range 400 to 10000. (a) Cor-relation with simulated monocytes proportion as a function of the number of components used in the linearpredictor (squared linear correlation). (b) Correlation with simulated B cells proportion. (c) Correlationwith simulated T cells proportion. (d) Correlation with simulated NK cells proportion. (e) Correlation withsimulated granulocytes proportion.

Nature Methods: doi:10.1038/nmeth.3809

Page 14: SUPPLEMENTARY INFORMATION FORweb.cs.ucla.edu/~ehalperin/cozygene/publications/nmeth.3809-S1.pdf · Supplementary Table 1. Lymphocytes Monocytes Neutrophils eosinophils Basophils Uncorrected

Supplementary Figure 14.

ReFACTor is robust to the choice of the parameter k in simulated data. The fraction of variance explained ineach of the simulated cell-types (simulation parameters: n = 500, p = 0.15, τ = 0.07; see Online Methods) inseveral ReFACTor executions (in blue), compared with a standard PCA (in red). Each blue line correspondsto a single execution of ReFACTor with a different value of k in the range 4 to 10. (a) Correlation withsimulated monocytes proportion as a function of the number of components used in the linear predictor(squared linear correlation). (b) Correlation with simulated B cells proportion. (c) Correlation with simu-lated T cells proportion. (d) Correlation with simulated NK cells proportion. (e) Correlation with simulatedgranulocytes proportion.

Nature Methods: doi:10.1038/nmeth.3809

Page 15: SUPPLEMENTARY INFORMATION FORweb.cs.ucla.edu/~ehalperin/cozygene/publications/nmeth.3809-S1.pdf · Supplementary Table 1. Lymphocytes Monocytes Neutrophils eosinophils Basophils Uncorrected

Supplementary Figure 15.

ReFACTor is robust to the choice of the parameter t in the GALA II dataset. The fraction of varianceexplained in each of the cell-types for which cell counts were available (78 samples) in several ReFACTorexecutions (in blue), compared with a standard PCA (in red). Each blue line corresponds to a singleexecution of ReFACTor with k = 6 and a different value of t in the range 400 to 10000. (a) Correlation withlymphocytes cell count as a function of the number of components used in the linear predictor (squared linearcorrelation). (b) Correlation with monocytes cell count. (c) Correlation with neutrophil granulocytes cellcount. (d) Correlation with eosinophil granulocytes cell count. (e) Correlation with basophil granulocytescell count.

Nature Methods: doi:10.1038/nmeth.3809

Page 16: SUPPLEMENTARY INFORMATION FORweb.cs.ucla.edu/~ehalperin/cozygene/publications/nmeth.3809-S1.pdf · Supplementary Table 1. Lymphocytes Monocytes Neutrophils eosinophils Basophils Uncorrected

Supplementary Figure 16.

ReFACTor is robust to the choice of the parameter k in the GALA II dataset. The fraction of varianceexplained in each of the cell-types for which cell counts were available (78 samples) in several ReFACTorexecutions (in blue), compared with a standard PCA (in red). Each blue line corresponds to a singleexecution of ReFACTor with a different value of k in the range 3 to 10. (a) Correlation with lymphocytescell count as a function of the number of components used in the linear predictor (squared linear correlation).(b) Correlation with monocytes cell count. (c) Correlation with neutrophil granulocytes cell count. (d)Correlation with eosinophil granulocytes cell count. (e) Correlation with basophil granulocytes cell count.

Nature Methods: doi:10.1038/nmeth.3809

Page 17: SUPPLEMENTARY INFORMATION FORweb.cs.ucla.edu/~ehalperin/cozygene/publications/nmeth.3809-S1.pdf · Supplementary Table 1. Lymphocytes Monocytes Neutrophils eosinophils Basophils Uncorrected

Supplementary Table 1.

Lymphocytes Monocytes Neutrophils eosinophils Basophils

Uncorrected 3412 1 1404 38 0

Reference-based [2] 0 0 0 29 0

PCA (6 components) 16 0 0 165 0

FaST-LMM-EWASher [3] 1 1 0 0 0

RefFreeEWAS [4] 39 0 216 215 10568

ReFACTor (6 components) 0 0 0 0 0

The results of performing EWAS on the measured cell counts as phenotypes in the GALA II dataset. Thetable shows the number of false positives reported by several methods including ReFACTor. A propercorrection for cell-type composition is expected to result in no false positives. The significance threshold wasset to be 0.05 after adjusting for multiple hypotheses using Bonferroni.

Nature Methods: doi:10.1038/nmeth.3809

Page 18: SUPPLEMENTARY INFORMATION FORweb.cs.ucla.edu/~ehalperin/cozygene/publications/nmeth.3809-S1.pdf · Supplementary Table 1. Lymphocytes Monocytes Neutrophils eosinophils Basophils Uncorrected

Supplementary Table 2.

CD8T CD4T NK B-cells Monocytes Granulocytes

Uncorrected 5897 10683 892 1691 17 16950

Cell counts 1798 474 423 0 0 7713

PCA (6 components) 0 0 5 25 63 0

FaST-LMM-EWASher [3] 2 0 0 0 2 0

RefFreeEWAS [4] 0 8 0 0 0 1106

ReFACTor (6 components) 0 0 0 0 0 0

The results of performing EWAS on the reference-based estimates of the cell proportions as phenotypes inthe GALA II dataset. The table shows the number of false positives reported by several methods includingReFACTor. A proper correction for cell-type composition is expected to result in no false positives. Thesignificance threshold was set to be 0.05 after adjusting for multiple hypotheses using Bonferroni.

Nature Methods: doi:10.1038/nmeth.3809

Page 19: SUPPLEMENTARY INFORMATION FORweb.cs.ucla.edu/~ehalperin/cozygene/publications/nmeth.3809-S1.pdf · Supplementary Table 1. Lymphocytes Monocytes Neutrophils eosinophils Basophils Uncorrected

Supplementary Table 3.

ID Chr Position Gene β p-value

cg05428452 6 32712979 HLA-DQA2 -3.66 3.72e-8

cg07839457 16 57023022 NLRC5 -5.14 1.67e-7

cg16411857 16 57023191 NLRC5 -8.74 4.69e-7

The significant associations found in the RA analysis. The data were adjusted for cell-type compositionusing the the first several ReFACTor component. β indicates the effect size.

Nature Methods: doi:10.1038/nmeth.3809

Page 20: SUPPLEMENTARY INFORMATION FORweb.cs.ucla.edu/~ehalperin/cozygene/publications/nmeth.3809-S1.pdf · Supplementary Table 1. Lymphocytes Monocytes Neutrophils eosinophils Basophils Uncorrected

Supplementary Table 4.

Statistical basis Feature selection

ReFACTor

Models the methylation levels of the i-thindividual:

Oi ∼ MVN(BXti +MRi,Σ)

where M represents cell-type specific con-tributions for each site (most of M ’srows are assumed to be constant or near-constant) and Σ is a diagonal matrix withσ2j in the j-th position of the diagonal. In

practice, the solution is approximated byapplying PCA on a subset of selected sites.

ReFACTor applies PCA on a small groupof methylation sites that are well approx-imated by the low rank approximation ofO. Sites that are highly variable across dif-ferent cell-types are expected to contributesubstantially to the low rank approxima-tion, and they will therefore be selected andused to capture the cell composition.

Principal Component Analysis (PCA)

Models the methylation levels:

Oi ∼ MVN(MRi, σ2In)

where M represents cell-type specific con-tributions for each site. The model is fittedby taking a maximum likelihood approach.

No feature selection is applied in the gen-eral case.

FaST-LMM-EWASher [3]

Models a phenotype of interest:

y ∼ MVN(Xβ, σ2gK + σ

2eIn)

K =1

mO

tO

where σ2gK represents inter-individuals re-

lationships using the kinship matrix K, andσ2eIn represents an environmental compo-

nent. The model is fitted by taking a max-imum likelihood approach.

In practice, the kinship matrix K is con-structed from the sites most correlated withthe phenotype.

RefFreeEWAS [4]

Models the methylation levels:

Oi = BX̃it

+MRi + εi

Ri = γX̃it

+ τi

M represents cell-type specific contribu-tions for each site, X̃i is a vector of covari-ates and the phenotype of interest of thei-th individual and εi is an error term. Inaddition, the cell proportions are assumedto be linearly dependent in X̃i (τi is anerror term). The method applies singularvalue decomposition (SVD) on a concate-nated matrix of the coefficients and resid-uals of a standard linear model fitted foreach site in O using X̃.

No feature selection is applied.

Surrogate Variable Analysis (SVA) [5]

Assumes the same model described abovefor RefFreeEWAS. The method computesorthogonal basis for the confounding effectsin the data by iteratively applying SVD onthe residuals of the linear model fitted usingX̃ for a subset of the sites in O.

Only a group of features that are most cor-related with the singular vectors of the dataare used to construct the final singular vec-tors.

A comparison between the algorithmic details underlying ReFACTor and those of other currently availablereference-free methods for EWAS analysis, including two widely used unsupervised methods for uncoveringhidden structure in data (PCA and SVA). Notations: O is an m×n observed methylation matrix (centered)and Oi is the m × 1 methylation vector of the i-th individual, X is an n × p covariates matrix and Xi is a1 × p covariates row vector of the i-th individual, Ri is a k × 1 cell proportions vector of the i-th individualand y is an n× 1 phenotype vector.

Nature Methods: doi:10.1038/nmeth.3809

Page 21: SUPPLEMENTARY INFORMATION FORweb.cs.ucla.edu/~ehalperin/cozygene/publications/nmeth.3809-S1.pdf · Supplementary Table 1. Lymphocytes Monocytes Neutrophils eosinophils Basophils Uncorrected

SUPPLEMENTARY NOTE

Tissues consist of a variety of different cell-types, each potentially having a unique methylation pattern across

the epigenome. For example, leukocytes, widely used in EWAS, are commonly divided into five main classes:

lymphocytes, monocytes, neutrophils, eosinophils and basophils. Each one of these classes is composed of

sub-types. For instance, lymphocytes include three main cell-types: natural killer (NK) cells, B-cells and

T-cells. The latter can be further divided into several sub-types, such as memory T-cells, which in turn can

be divided into at least three different sub-populations: stem memory, central memory and effector memory

T-cells [6].

An ideal adjustment for tissue heterogeneity would be possible by measuring, for each individual, the cell

proportions of all the cell-types composing the tissue from which methylation levels were assayed. However,

cell counting in the required granularity is currently very expensive or not feasible for complex tissues. For

example, cell counting for leukocytes is currently commonly used for measuring the proportions of the five

main classes of cell-types only. This limitation bounds correction of EWAS using cell counts to perform well

only when the correlation of the cell-type composition with the phenotype is driven by correlation with one

or more of the measured cell-types. The same reasoning holds when using reference-based methods, which

essentially estimate the cell proportions of a subset of the cell-types that compose the tissue. Therefore,

adjusting only for a subset of the cell-types can lead to a large number of false discoveries in case the

phenotype is correlated with the proportions of one or more of the cell-types that were not measured or

properly estimated.

We demonstrate this phenomenon using the GALA II dataset for which cell counts of five cell-types were

available for a subset of the samples (n = 78). For each counted cell-type we performed an EWAS analysis

using the cell counts as the phenotype. In this scenario, a proper adjustment for cell-type composition is

expected to result in no significant associations. In order to get a measure of the amount of false positives

generated by inadequate adjustment for cell-type composition, we then measured the number of significant

associations discovered by ReFACTor as well as by several other methods. The results in Supplementary

Table 1 show that the estimates given by the available reference-based [2] approach could eliminate all of the

false positives except for the case of eosinophils as a phenotype, where most of the false findings remained

after the correction. This result is driven by the fact that eosinophils are currently not estimated by the

reference-based method since no reference is available for this cell-type. This result demonstrates the main

Nature Methods: doi:10.1038/nmeth.3809

Page 22: SUPPLEMENTARY INFORMATION FORweb.cs.ucla.edu/~ehalperin/cozygene/publications/nmeth.3809-S1.pdf · Supplementary Table 1. Lymphocytes Monocytes Neutrophils eosinophils Basophils Uncorrected

caveat in the reference-based approach, and is expected to affect many EWAS using estimates of just a

subset of the cell-types composing the studied tissues.

We extend this analysis to consider additional cell-types as phenotypes by using cell proportions of the six

cell-types given by the reference-based method. Again, any identified association would be indicative of a

false positive due to confounding by cell-type composition. The results in Supplementary Table 2 show that

adjusting for the cell counts does not adequately correct for confounders when the phenotype is a cell-type

that was not measured (CD8T, CD4T, NK and B-cells). In contrast, there is adequate control for false

positives when the estimates of monocytes are used as the phenotype, presumably since monocytes were

measured. We note that the cell counts could not correct well in the granulocytes experiment, possibly due

to a difference in composition between the subset of granulocytes cell-types captured in the reference and

granulocytes cell-types that were captured in the cell counting procedure. Notably, applying ReFACTor

yielded no false findings across the two experiments as opposed to using cell counts or reference-based

estimates.

We further inspected the results of these two experiments. In case of no true associations, perfectly adjusting

a phenotype for confounders is expected to yield uniformly distributed p-values under the null model. In our

case the phenotypes are exactly the confounders, thus the p-values are all expected to be 1 under a perfect

correction. Therefore, a second measurement for the quality of correction can be the p-values distribution.

Specificly, the best method for correction is expected to have a distribution that is closet to a constant

distribution of ones. Supplementary Fig. 4 and Supplementary Fig. 5 help to visualize the distribution of

p-values in the two experiments by counting the number of false positives as a function of the significance

threshold. Overall, we observe that ReFACTor yields the best correction with respect to the expected

distribution of the p-values.

Nature Methods: doi:10.1038/nmeth.3809

Page 23: SUPPLEMENTARY INFORMATION FORweb.cs.ucla.edu/~ehalperin/cozygene/publications/nmeth.3809-S1.pdf · Supplementary Table 1. Lymphocytes Monocytes Neutrophils eosinophils Basophils Uncorrected

References

[1] Oliver Stegle, Leopold Parts, Richard Durbin, and John Winn. A bayesian framework to account for

complex non-genetic factors in gene expression levels greatly increases power in eqtl studies. PLoS Comput

Biol, 6(5):e1000770, 2010.

[2] Eugene Andres Houseman, William P Accomando, Devin C Koestler, Brock C Christensen, Carmen J

Marsit, Heather H Nelson, John K Wiencke, and Karl T Kelsey. DNA methylation arrays as surrogate

measures of cell mixture distribution. BMC bioinformatics, 2012.

[3] James Zou, Christoph Lippert, David Heckerman, Martin Aryee, and Jennifer Listgarten. Epigenome-

wide association studies without the need for cell-type composition. Nature methods, 11(3):309–11, March

2014.

[4] Eugene Andres Houseman, John Molitor, and Carmen J Marsit. Reference-free cell mixture adjustments

in analysis of dna methylation data. Bioinformatics, 30(10):1431–1439, 2014.

[5] Jeffrey T Leek and John D Storey. Capturing heterogeneity in gene expression studies by surrogate

variable analysis. PLoS Genet, 3(9):1724–1735, 2007.

[6] F Sallusto, A Langenkamp, J Geginat, and A Lanzavecchia. Functional subsets of memory t cells identified

by ccr7 expression. In Lymphoid Organogenesis, pages 167–171. Springer, 2000.

Nature Methods: doi:10.1038/nmeth.3809