34
Proof-of-Concept of Environment-Wide Association Studies (EWAS) on Common Disease Chirag Patel Division of Systems Medicine, Department of Pediatrics Stanford University School of Medicine National Academy of Sciences Exposome Workshop 12/8/2011 [email protected] twitter: @chiragjp 1 Wednesday, December 14, 11

Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

Proof-of-Concept of Environment-Wide Association Studies (EWAS)

on Common Disease

Chirag PatelDivision of Systems Medicine, Department of Pediatrics

Stanford University School of Medicine

National Academy of Sciences Exposome Workshop12/8/2011

[email protected]: @chiragjp

1Wednesday, December 14, 11

Page 2: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

...but, lack methods and data to comprehensively and systematically connect the environment with common disease.

... and we’re exposed to many environmental factors...

Common disease is a function of genes and environment...

dad mom

me

2Wednesday, December 14, 11

Page 3: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

... and the case is different with genetics (e.g., genomics)!

dad mom

me

3Wednesday, December 14, 11

Page 4: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

Common disease is function of genes and environment...

... yet the target of investigation is biased toward genetics!

1970 1980 1990 2000

0200

400

600

800

1000

1200

year published

tota

l pub

licat

ions

geneticsenvironment

WHO prioritized diseases:investigating either

“genetics”or “environment”in MEDLINE

type 2 diabetescardiovascular disease

kidney diseaselung, colon, and prostate cancer

asthmaCOPD

preterm birthsAlzheimer Disease

Patel et al. (2011). Data-driven creation of hypothesis in non-genomic studies. In Review.

4Wednesday, December 14, 11

Page 5: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

Human Genome Project to GWAS

Sequencing of the genome

2001

HapMap project:http://hapmap.ncbi.nlm.nih.gov/

Characterize common variation

2001-2003 (ongoing)

“Variant SNP chip”~$400 for ~100,000 variants

Measurement tools

~2003 (ongoing)

ARTICLES

Genome-wide association study of 14,000cases of seven common diseases and3,000 shared controlsThe Wellcome Trust Case Control Consortium*

There is increasing evidence that genome-wide association (GWA) studies represent a powerful approach to theidentification of genes involved in common human diseases.We describe a joint GWAstudy (using the Affymetrix GeneChip500KMapping Array Set) undertaken in the British population, which has examined,2,000 individuals for each of 7 majordiseases and a shared set of ,3,000 controls. Case-control comparisons identified 24 independent association signals atP, 53 1027: 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn’s disease, 3 in rheumatoid arthritis, 7 in type 1diabetes and 3 in type 2 diabetes. On the basis of prior findings and replication studies thus-far completed, almost all of thesesignals reflect genuine susceptibility effects. We observed association at many previously identified loci, and foundcompelling evidence that some loci confer risk for more than one of the diseases studied. Across all diseases, we identified alarge number of further signals (including 58 loci with single-point P values between 1025 and 53 1027) likely to yieldadditional susceptibility loci. The importance of appropriately large samples was confirmed by the modest effect sizesobserved at most loci identified. This study thus represents a thorough validation of the GWA approach. It has alsodemonstrated that careful use of a shared control group represents a safe and effective approach to GWA analyses ofmultiple disease phenotypes; has generated a genome-wide genotype database for future studies of common diseases in theBritish population; and shown that, provided individuals with non-European ancestry are excluded, the extent of populationstratification in the British population is generally modest. Our findings offer new avenues for exploring the pathophysiologyof these important disorders. We anticipate that our data, results and software, which will be widely available to otherinvestigators, will provide a powerful resource for human genetics research.

Despite extensive research efforts for more than a decade, the geneticbasis of common humandiseases remains largely unknown. Althoughthere have been some notable successes1, linkage and candidate geneassociation studies have often failed to deliver definitive results. Yetthe identification of the variants, genes and pathways involved inparticular diseases offers a potential route to new therapies, improveddiagnosis and better disease prevention. For some time it has beenhoped that the advent of genome-wide association (GWA) studieswould provide a successful new tool for unlocking the genetic basisof many of these common causes of humanmorbidity andmortality1.

Three recent advances mean that GWA studies that are powered todetect plausible effect sizes are now possible2. First, the InternationalHapMap resource3, which documents patterns of genome-wide vari-ation and linkage disequilibrium in four population samples, greatlyfacilitates both the design and analysis of association studies. Second,the availability of dense genotyping chips, containing sets of hundreds ofthousands of single nucleotide polymorphisms (SNPs) that providegood coverage of much of the human genome, means that for the firsttimeGWAstudies for thousandsof cases andcontrols are technically andfinancially feasible. Third, appropriately large and well-characterizedclinical samples have been assembled for many common diseases.

The Wellcome Trust Case Control Consortium (WTCCC) wasformed with a view to exploring the utility, design and analyses ofGWA studies. It brought together over 50 research groups from theUK that are active in researching the genetics of common humandiseases, with expertise ranging from clinical, through genotyping, to

informatics and statistical analysis. Here we describe the main experi-ment of the consortium: GWA studies of 2,000 cases and 3,000 sharedcontrols for 7 complex human diseases of major public health import-ance—bipolar disorder (BD), coronary artery disease (CAD), Crohn’sdisease (CD), hypertension (HT), rheumatoid arthritis (RA), type 1diabetes (T1D), and type 2 diabetes (T2D). Two further experimentsundertaken by the consortium will be reported elsewhere: a GWAstudy for tuberculosis in 1,500 cases and 1,500 controls, sampled fromThe Gambia; and an association study of 1,500 common controls with1,000 cases for each of breast cancer, multiple sclerosis, ankylosingspondylitis and autoimmune thyroid disease, all typed at around15,000 mainly non-synonymous SNPs. By simultaneously studyingseven diseases with differing aetiologies, we hoped to develop insights,not only into the specific genetic contributions to each of the diseases,but also into differences in allelic architecture across the diseases. Afurther major aim was to address important methodological issues ofrelevance to all GWA studies, such as quality control, design and ana-lysis. In addition to our main association results, we address several ofthese issues below, including the choice of controls for genetic studies,the extent of population structure within Great Britain, sample sizesnecessary to detect genetic effects of varying sizes, and improvements ingenotype-calling algorithms and analytical methods.

Samples and experimental analyses

Individuals included in the study were living within England,Scotland and Wales (‘Great Britain’) and the vast majority had

*Lists of participants and affiliations appear at the end of the paper.

Vol 447 |7 June 2007 |doi:10.1038/nature05911

661Nature ©2007 Publishing Group

WTCCC, Nature, 2008.

Comprehensive, high-throughput analyses

5Wednesday, December 14, 11

Page 6: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

HypothesisApplying genome-based methods to the environment

We claim that comprehensive connection of environmental

factors to disease is practicable using high-throughput analysis

methods, now common in genome-based investigations

e.g., Environment-wide Association Study (EWAS)

6Wednesday, December 14, 11

Page 7: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

Proof of Concept of EWAS

1. Background and Methods

2. Examples: Type 2 Diabetes, Serum Lipid Levels

3. Checking Validity and An “LD” Map of the Environment?

4. Conclusions

5. Informatics for the Environment

7Wednesday, December 14, 11

Page 8: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

Connecting Disease with Exposures:Drawbacks in Environmental Epidemiology

“candidate” E factors

multiple hypotheses often ignored

selective reporting

1. Ioannidis et al. Science Translational Medicine, 2009. 1 (7) p. 62. Boffetta, P., et al.,. J Natl Cancer Inst, 2008. 100(14): p. 988-95.3. Young, S.S., Int J Epidemiol, 2010. 39(3): p. 9344. Taubes, G.. Science, 1995. 269(5221): p. 164-9.

The lack of comprehension has led to a fragmented literature of environmental associations1,2,3,4

Genome-wide epidemiology has overcome some of these drawbacks1

E+ E-

diseased

non-diseased

Aim 1: EWAS Methods

8Wednesday, December 14, 11

Page 9: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

Genome-Wide Association Studies (GWAS)

Methods), and excluded 153 individuals on this basis. We nextlooked for evidence of population heterogeneity by studying allelefrequency differences between the 12 broad geographical regions(defined in Supplementary Fig. 4). The results for these 11-d.f. testsand associated quantile-quantile plots are shown in Fig. 2. Wide-spread small differences in allele frequencies are evident as anincreased slope of the line (Fig. 2b); in addition, a few loci showmuchlarger differences (Fig. 2a and Supplementary Fig. 6).

Thirteen genomic regions showing strong geographical variationare listed in Table 1, and Supplementary Fig. 7 shows theway in whichtheir allele frequencies vary geographically. The predominant patternis variation along a NW/SE axis. The most likely cause for thesemarked geographical differences is natural selection, most plausiblyin populations ancestral to those now in the UK. Variation due toselection has previously been implicated at LCT (lactase) and majorhistocompatibility complex (MHC)7–9, andwithin-UKdifferentiationat 4p14 has been found independently10, but others seem to be newfindings. All but three of the regions contain known genes. Aside from

evolutionary interest, genes showing evidence of natural selection areparticularly interesting for the biology of traits such as infectious dis-eases; possible targets for selection include NADSYN1 (NAD synthe-tase 1) at 11q13, which could have a role in prevention of pellagra, aswell as TLR1 (toll-like receptor 1) at 4p14, for which a role in thebiology of tuberculosis and leprosy has been suggested10.

There may be important population structure that is not wellcaptured by current geographical region of residence. Presentimplementations of strongly model-based approaches such asSTRUCTURE11,12 are impracticable for data sets of this size, and wereverted to the classical method of principal components13,14, using asubset of 197,175 SNPs chosen to reduce inter-locus linkage disequi-librium. Nevertheless, four of the first six principal componentsclearly picked up effects attributable to local linkage disequilibriumrather than genome-wide structure. The remaining two componentsshow the same predominant geographical trend from NW to SE but,perhaps unsurprisingly, London is set somewhat apart (Supplemen-tary Fig. 8).

The overall effect of population structure on our associationresults seems to be small, once recent migrants from outsideEurope are excluded. Estimates of over-dispersion of the associationtrend test statistics (usually denoted l; ref. 15) ranged from 1.03 and1.05 for RA and T1D, respectively, to 1.08–1.11 for the remainingdiseases. Some of this over-dispersion could be due to factors otherthan structure, and this possibility is supported by the fact that inclu-sion of the two ancestry informative principal components as cov-ariates in the association tests reduced the over-dispersion estimatesonly slightly (Supplementary Table 6), as did stratification by geo-graphical region. This impression is confirmed on noting thatP values with and without correction for structure are similar(Supplementary Fig. 9). We conclude that, for most of the genome,population structure has at most a small confounding effect in ourstudy, and as a consequence the analyses reported below do notcorrect for structure. In principle, apparent associations in the fewgenomic regions identified in Table 1 as showing strong geographicaldifferentiation should be interpreted with caution, but none arose inour analyses.

Disease association results

We assessed evidence for association in several ways (see Methods fordetails), drawing on both classical and bayesian statistical approaches.For polymorphic SNPs on the Affymetrix chip, we performed trendtests (1 degree of freedom16) and general genotype tests (2 degrees offreedom16, referred to as genotypic) between each case collection andthe pooled controls, and calculated analogous Bayes factors. Thereare examples from animal models where genetic effects act differentlyin males and females17, and to assess this in our data we applied a

!log

10(P

)

0

5

10

15

Chromosome

22 X212019181716151413121110987654321

3020

20

100

0

40

80

60

40

100

Obs

erve

d te

st s

tatis

tic

Expected chi-squared value

a

b

Figure 2 | Genome-wide picture of geographic variation. a, P values for the11-d.f. test for difference in SNP allele frequencies between geographicalregions, within the 9 collections. SNPs have been excluded using the projectquality control filters described inMethods. Green dots indicate SNPs with aP value,13 1025. b, Quantile-quantile plots of these test statistics. SNPs atwhich the test statistic exceeds 100 are represented by triangles at the top ofthe plot, and the shaded region is the 95% concentration band (seeMethods). Also shown in blue is the quantile-quantile plot resulting fromremoval of all SNPs in the 13 most differentiated regions (Table 1).

Table 1 | Highly differentiated SNPs

Chromosome Genes Region (Mb) SNP Position P value

2q21 LCT 135.16–136.82 rs1042712 136,379,576 5.54 3 10213

4p14 TLR1, TLR6, TLR10 38.51–38.74 rs7696175 386,43,552 1.51 3 10212

4q28 137.97–138.01 rs1460133 137,999,953 4.43 3 10208

6p25 IRF4 0.32–0.42 rs9378805 362,727 5.39 3 10213

6p21 HLA 31.10–31.55 rs3873375 31,359,339 1.07 3 10211

9p24 DMRT1 0.86–0.88 rs11790408 866,418 4.96 3 10207

11p15 NAV2 19.55–19.70 rs12295525 19,661,808 7.44 3 10208

11q13 NADSYN1, DHCR7 70.78–70.93 rs12797951 70,820,914 3.01 3 10208

12p13 DYRK4,AKAP3,NDUFA9,RAD51AP1,GALNT8

4.37–4.82 rs10774241 45,537,27 2.73 3 10208

14q12 HECTD1,AP4S1,STRN3 30.41–31.03 rs17449560 30,598,823 1.46 3 10207

19q13 GIPR,SNRPD2,QPCTL,SIX5,DMPK,DMWD,RSHL1,SYMPK,FOXA3

50.84–51.09 rs3760843 50,980,546 4.19 3 10207

20q12 38.30–38.77 rs2143877 38,526,309 1.12 3 10209

Xp22 2.06–2.08 rs6644913 2,061,160 1.23 3 10207

Properties of SNPs that show large allele frequency differences between samples of individuals from 12 regions across Great Britain. Regions showing differentiated SNPs are givenwith details of theSNPwith the smallest P value in each region for differentiation on the 11-d.f. test of differences in SNP allele frequencies between geographical regions, within the 9 collections. Cluster plots for theseSNPs have been examined visually. Signal plots appear in Supplementary Information. Positions are in NCBI build-35 coordinates.

NATURE |Vol 447 |7 June 2007 ARTICLES

663Nature ©2007 Publishing Group

WTCCC, 2007

What genetic loci are associated to disease?

AA Aa aacase

control

~100,000 - 1,000,000 association tests

Aim 1: EWAS Methods

9Wednesday, December 14, 11

Page 10: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

Methods), and excluded 153 individuals on this basis. We nextlooked for evidence of population heterogeneity by studying allelefrequency differences between the 12 broad geographical regions(defined in Supplementary Fig. 4). The results for these 11-d.f. testsand associated quantile-quantile plots are shown in Fig. 2. Wide-spread small differences in allele frequencies are evident as anincreased slope of the line (Fig. 2b); in addition, a few loci showmuchlarger differences (Fig. 2a and Supplementary Fig. 6).

Thirteen genomic regions showing strong geographical variationare listed in Table 1, and Supplementary Fig. 7 shows theway in whichtheir allele frequencies vary geographically. The predominant patternis variation along a NW/SE axis. The most likely cause for thesemarked geographical differences is natural selection, most plausiblyin populations ancestral to those now in the UK. Variation due toselection has previously been implicated at LCT (lactase) and majorhistocompatibility complex (MHC)7–9, andwithin-UKdifferentiationat 4p14 has been found independently10, but others seem to be newfindings. All but three of the regions contain known genes. Aside from

evolutionary interest, genes showing evidence of natural selection areparticularly interesting for the biology of traits such as infectious dis-eases; possible targets for selection include NADSYN1 (NAD synthe-tase 1) at 11q13, which could have a role in prevention of pellagra, aswell as TLR1 (toll-like receptor 1) at 4p14, for which a role in thebiology of tuberculosis and leprosy has been suggested10.

There may be important population structure that is not wellcaptured by current geographical region of residence. Presentimplementations of strongly model-based approaches such asSTRUCTURE11,12 are impracticable for data sets of this size, and wereverted to the classical method of principal components13,14, using asubset of 197,175 SNPs chosen to reduce inter-locus linkage disequi-librium. Nevertheless, four of the first six principal componentsclearly picked up effects attributable to local linkage disequilibriumrather than genome-wide structure. The remaining two componentsshow the same predominant geographical trend from NW to SE but,perhaps unsurprisingly, London is set somewhat apart (Supplemen-tary Fig. 8).

The overall effect of population structure on our associationresults seems to be small, once recent migrants from outsideEurope are excluded. Estimates of over-dispersion of the associationtrend test statistics (usually denoted l; ref. 15) ranged from 1.03 and1.05 for RA and T1D, respectively, to 1.08–1.11 for the remainingdiseases. Some of this over-dispersion could be due to factors otherthan structure, and this possibility is supported by the fact that inclu-sion of the two ancestry informative principal components as cov-ariates in the association tests reduced the over-dispersion estimatesonly slightly (Supplementary Table 6), as did stratification by geo-graphical region. This impression is confirmed on noting thatP values with and without correction for structure are similar(Supplementary Fig. 9). We conclude that, for most of the genome,population structure has at most a small confounding effect in ourstudy, and as a consequence the analyses reported below do notcorrect for structure. In principle, apparent associations in the fewgenomic regions identified in Table 1 as showing strong geographicaldifferentiation should be interpreted with caution, but none arose inour analyses.

Disease association results

We assessed evidence for association in several ways (see Methods fordetails), drawing on both classical and bayesian statistical approaches.For polymorphic SNPs on the Affymetrix chip, we performed trendtests (1 degree of freedom16) and general genotype tests (2 degrees offreedom16, referred to as genotypic) between each case collection andthe pooled controls, and calculated analogous Bayes factors. Thereare examples from animal models where genetic effects act differentlyin males and females17, and to assess this in our data we applied a

!log

10(P

)

0

5

10

15

Chromosome

22 X212019181716151413121110987654321

3020

20

100

0

40

80

60

40

100

Obs

erve

d te

st s

tatis

tic

Expected chi-squared value

a

b

Figure 2 | Genome-wide picture of geographic variation. a, P values for the11-d.f. test for difference in SNP allele frequencies between geographicalregions, within the 9 collections. SNPs have been excluded using the projectquality control filters described inMethods. Green dots indicate SNPs with aP value,13 1025. b, Quantile-quantile plots of these test statistics. SNPs atwhich the test statistic exceeds 100 are represented by triangles at the top ofthe plot, and the shaded region is the 95% concentration band (seeMethods). Also shown in blue is the quantile-quantile plot resulting fromremoval of all SNPs in the 13 most differentiated regions (Table 1).

Table 1 | Highly differentiated SNPs

Chromosome Genes Region (Mb) SNP Position P value

2q21 LCT 135.16–136.82 rs1042712 136,379,576 5.54 3 10213

4p14 TLR1, TLR6, TLR10 38.51–38.74 rs7696175 386,43,552 1.51 3 10212

4q28 137.97–138.01 rs1460133 137,999,953 4.43 3 10208

6p25 IRF4 0.32–0.42 rs9378805 362,727 5.39 3 10213

6p21 HLA 31.10–31.55 rs3873375 31,359,339 1.07 3 10211

9p24 DMRT1 0.86–0.88 rs11790408 866,418 4.96 3 10207

11p15 NAV2 19.55–19.70 rs12295525 19,661,808 7.44 3 10208

11q13 NADSYN1, DHCR7 70.78–70.93 rs12797951 70,820,914 3.01 3 10208

12p13 DYRK4,AKAP3,NDUFA9,RAD51AP1,GALNT8

4.37–4.82 rs10774241 45,537,27 2.73 3 10208

14q12 HECTD1,AP4S1,STRN3 30.41–31.03 rs17449560 30,598,823 1.46 3 10207

19q13 GIPR,SNRPD2,QPCTL,SIX5,DMPK,DMWD,RSHL1,SYMPK,FOXA3

50.84–51.09 rs3760843 50,980,546 4.19 3 10207

20q12 38.30–38.77 rs2143877 38,526,309 1.12 3 10209

Xp22 2.06–2.08 rs6644913 2,061,160 1.23 3 10207

Properties of SNPs that show large allele frequency differences between samples of individuals from 12 regions across Great Britain. Regions showing differentiated SNPs are givenwith details of theSNPwith the smallest P value in each region for differentiation on the 11-d.f. test of differences in SNP allele frequencies between geographical regions, within the 9 collections. Cluster plots for theseSNPs have been examined visually. Signal plots appear in Supplementary Information. Positions are in NCBI build-35 coordinates.

NATURE |Vol 447 |7 June 2007 ARTICLES

663Nature ©2007 Publishing Group

Environment-Wide Association Studies (EWAS)

What specific environmental “loci” are associated to disease?:ie, T2D, lipid levels, obesity, etc?

Environmental CategoryVita

mins

β-carotene

Metals

lead

Organo

phos

phate

Pestici

des

chlorpyrifos

Hydroc

arbon

s

2-hydroxyfluorene [factor]

case

control

Aim 1: EWAS Methods

10Wednesday, December 14, 11

Page 11: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

Methods), and excluded 153 individuals on this basis. We nextlooked for evidence of population heterogeneity by studying allelefrequency differences between the 12 broad geographical regions(defined in Supplementary Fig. 4). The results for these 11-d.f. testsand associated quantile-quantile plots are shown in Fig. 2. Wide-spread small differences in allele frequencies are evident as anincreased slope of the line (Fig. 2b); in addition, a few loci showmuchlarger differences (Fig. 2a and Supplementary Fig. 6).

Thirteen genomic regions showing strong geographical variationare listed in Table 1, and Supplementary Fig. 7 shows theway in whichtheir allele frequencies vary geographically. The predominant patternis variation along a NW/SE axis. The most likely cause for thesemarked geographical differences is natural selection, most plausiblyin populations ancestral to those now in the UK. Variation due toselection has previously been implicated at LCT (lactase) and majorhistocompatibility complex (MHC)7–9, andwithin-UKdifferentiationat 4p14 has been found independently10, but others seem to be newfindings. All but three of the regions contain known genes. Aside from

evolutionary interest, genes showing evidence of natural selection areparticularly interesting for the biology of traits such as infectious dis-eases; possible targets for selection include NADSYN1 (NAD synthe-tase 1) at 11q13, which could have a role in prevention of pellagra, aswell as TLR1 (toll-like receptor 1) at 4p14, for which a role in thebiology of tuberculosis and leprosy has been suggested10.

There may be important population structure that is not wellcaptured by current geographical region of residence. Presentimplementations of strongly model-based approaches such asSTRUCTURE11,12 are impracticable for data sets of this size, and wereverted to the classical method of principal components13,14, using asubset of 197,175 SNPs chosen to reduce inter-locus linkage disequi-librium. Nevertheless, four of the first six principal componentsclearly picked up effects attributable to local linkage disequilibriumrather than genome-wide structure. The remaining two componentsshow the same predominant geographical trend from NW to SE but,perhaps unsurprisingly, London is set somewhat apart (Supplemen-tary Fig. 8).

The overall effect of population structure on our associationresults seems to be small, once recent migrants from outsideEurope are excluded. Estimates of over-dispersion of the associationtrend test statistics (usually denoted l; ref. 15) ranged from 1.03 and1.05 for RA and T1D, respectively, to 1.08–1.11 for the remainingdiseases. Some of this over-dispersion could be due to factors otherthan structure, and this possibility is supported by the fact that inclu-sion of the two ancestry informative principal components as cov-ariates in the association tests reduced the over-dispersion estimatesonly slightly (Supplementary Table 6), as did stratification by geo-graphical region. This impression is confirmed on noting thatP values with and without correction for structure are similar(Supplementary Fig. 9). We conclude that, for most of the genome,population structure has at most a small confounding effect in ourstudy, and as a consequence the analyses reported below do notcorrect for structure. In principle, apparent associations in the fewgenomic regions identified in Table 1 as showing strong geographicaldifferentiation should be interpreted with caution, but none arose inour analyses.

Disease association results

We assessed evidence for association in several ways (see Methods fordetails), drawing on both classical and bayesian statistical approaches.For polymorphic SNPs on the Affymetrix chip, we performed trendtests (1 degree of freedom16) and general genotype tests (2 degrees offreedom16, referred to as genotypic) between each case collection andthe pooled controls, and calculated analogous Bayes factors. Thereare examples from animal models where genetic effects act differentlyin males and females17, and to assess this in our data we applied a

!log

10(P

)

0

5

10

15

Chromosome

22 X2120191817161514131211109876543213020

20

100

0

40

80

60

40

100

Obs

erve

d te

st s

tatis

tic

Expected chi-squared value

a

b

Figure 2 | Genome-wide picture of geographic variation. a, P values for the11-d.f. test for difference in SNP allele frequencies between geographicalregions, within the 9 collections. SNPs have been excluded using the projectquality control filters described inMethods. Green dots indicate SNPs with aP value,13 1025. b, Quantile-quantile plots of these test statistics. SNPs atwhich the test statistic exceeds 100 are represented by triangles at the top ofthe plot, and the shaded region is the 95% concentration band (seeMethods). Also shown in blue is the quantile-quantile plot resulting fromremoval of all SNPs in the 13 most differentiated regions (Table 1).

Table 1 | Highly differentiated SNPs

Chromosome Genes Region (Mb) SNP Position P value

2q21 LCT 135.16–136.82 rs1042712 136,379,576 5.54 3 10213

4p14 TLR1, TLR6, TLR10 38.51–38.74 rs7696175 386,43,552 1.51 3 10212

4q28 137.97–138.01 rs1460133 137,999,953 4.43 3 10208

6p25 IRF4 0.32–0.42 rs9378805 362,727 5.39 3 10213

6p21 HLA 31.10–31.55 rs3873375 31,359,339 1.07 3 10211

9p24 DMRT1 0.86–0.88 rs11790408 866,418 4.96 3 10207

11p15 NAV2 19.55–19.70 rs12295525 19,661,808 7.44 3 10208

11q13 NADSYN1, DHCR7 70.78–70.93 rs12797951 70,820,914 3.01 3 10208

12p13 DYRK4,AKAP3,NDUFA9,RAD51AP1,GALNT8

4.37–4.82 rs10774241 45,537,27 2.73 3 10208

14q12 HECTD1,AP4S1,STRN3 30.41–31.03 rs17449560 30,598,823 1.46 3 10207

19q13 GIPR,SNRPD2,QPCTL,SIX5,DMPK,DMWD,RSHL1,SYMPK,FOXA3

50.84–51.09 rs3760843 50,980,546 4.19 3 10207

20q12 38.30–38.77 rs2143877 38,526,309 1.12 3 10209

Xp22 2.06–2.08 rs6644913 2,061,160 1.23 3 10207

Properties of SNPs that show large allele frequency differences between samples of individuals from 12 regions across Great Britain. Regions showing differentiated SNPs are givenwith details of theSNPwith the smallest P value in each region for differentiation on the 11-d.f. test of differences in SNP allele frequencies between geographical regions, within the 9 collections. Cluster plots for theseSNPs have been examined visually. Signal plots appear in Supplementary Information. Positions are in NCBI build-35 coordinates.

NATURE |Vol 447 |7 June 2007 ARTICLES

663Nature ©2007 Publishing Group

Why “EWAS”?

What environmental factors are associated to disease?

Environmental Categorycomprehensive and transparent

multiplicity controlled

novel findings

(and validated)

Ioannidis JPAI, Loy EY, et al, (2009) Researching genetic vs nongenetic determinants of diseases: a comparison and proposed unification. Sci Trans Med vol. 1(7)7ps8

Aim 1: EWAS Methods

11Wednesday, December 14, 11

Page 12: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

NHANES:National Health and Nutrition Examination Survey1

since the 1960s: 50 years!now biannual: 1999 onwards10,000 participants per cohort

A Massive, Ongoing, and Significant Public Health Survey

Introduction

The National Health and Nutrition Examination Survey (NHANES) is a program of studiesdesigned to assess the health and nutritional status of adults and children in the United States. The survey is unique in that it com-bines interviews and physical examinations. NHANES is a major program of the National Center for Health Statistics (NCHS). NCHS is part of the Centers for Disease Control and Prevention (CDC) and has the responsibility for producing vital and health statistics for the Nation.

The NHANES program began in the early 1960s and has been conducted as a series of sur-veys focusing on different population groups or health topics. In 1999, the survey became a con-tinuous program that has a changing focus on a variety of health and nutrition measurements to meet emerging needs. The survey examines a nationally representative sample of about 5,000 persons each year. These persons are located in counties across the country, 15 of which are visited each year.

The NHANES interview includes demographic, socioeconomic, dietary, and health-related questions. The examination component consists of medical, dental, and physiological measure-ments, as well as laboratory tests administered by highly trained medical personnel.

Findings from this survey will be used to de-termine the prevalence of major diseases and risk factors for diseases. Information will be used to assess nutritional status and its associ-ation with health promotion and disease pre-���������������!��������������������������for national standards for such measurements as height, weight, and blood pressure. Data from this survey will be used in epidemiologi-cal studies and health sciences research, which help develop sound public health policy,

direct and design health programs and services, and expand the health knowl-edge for the Nation.

Survey Content

As in past health examination surveys, data will be collected on the prevalence of chron-ic conditions in the population. Estimates for previously undiagnosed conditions, as well as those known to and reported by respon-dents, are produced through the survey. Such information is a particular strength of the NHANES program.

Risk factors, those aspects of a person’s life-style, constitution, heredity, or environment that may increase the chances of developing a certain disease or condition, will be examined. Smoking, alcohol consumption, ���������� �� ���������������� �� ���!������and activity, weight, and dietary intake will be studied. Data on certain aspects of reproductive health, such as use of oral contraceptives and breastfeeding practices, will also be collected.

The diseases, medical conditions, and health indicators to be studied include:

• Anemia• Cardiovascular disease• Diabetes• Environmental exposures• Eye diseases• Hearing loss• Infectious diseases• Kidney disease• Nutrition• Obesity• Oral health• Osteoporosis

The sample for the survey is selected to represent the U.S. population of all ages. To produce reli-able statistics, NHANES over-samples persons 60 and older, African Americans, and Hispanics.

Since the United States has experienced dramatic growth in the number of older people during this century, the aging population has major impli-cations for health care needs, public policy, and research priorities. NCHS is working with public health agencies to increase the knowledge of the health status of older Americans. NHANES has a primary role in this endeavor.

All participants visit the physician. Dietary inter-views and body measurements are included for everyone. All but the very young have a blood sample taken and will have a dental screening. Depending upon the age of the participant, the rest of the examination includes tests and proce-dures to assess the various aspects of health listed above. In general, the older the individual, the more extensive the examination.

Survey Operations

Health interviews are conducted in respondents’ homes. Health measurements are performed in specially-designed and equipped mobile centers, which travel to locations throughout the country. The study team consists of a physician, medical and health technicians, as well as dietary and health interviewers. Many of the study staff are bilingual (English/Spanish).

An advanced computer system using high-end servers, desktop PCs, and wide-area networking collect and process all of the NHANES data, nearly eliminating the need for paper forms and manual coding operations. This system allows interviewers to use note-book computers with electronic pens. The staff at the mobile center can automatically transmit data into data bases through such devices as digital scales and stadiometers. Touch-sensi-tive computer screens let respondents enter their own responses to certain sensitive ques-tions in complete privacy. Survey information is available to NCHS staff within 24 hours of collection, which enhances the capability of collecting quality data and increases the speed with which results are released to the public.

In each location, local health and government ��! �������������!������������ ����������� ��Households in the study area receive a letter from the NCHS Director to introduce the survey. Local media may feature stories about the survey.

NHANES is designed to facilitate and en-courage participation. Transportation is provided to and from the mobile center if necessary. Participants receive compensation and a report ������� ���!��������������������� ������� �������All information collected in the survey is kept ���� �� � ��!������������� ��������� ����� �public laws.

Uses of the Data

Information from NHANES is made available through an extensive series of publications and ���� �������� �����! ������� ��� �����������������data users and researchers throughout the world, survey data are available on the internet and on easy-to-use CD-ROMs.

Research organizations, universities, health ���������������������� ����������!�������survey information. Primary data users are federal agencies that collaborated in the de-sign and development of the survey. The National Institutes of Health, the Food and Drug Administration, and CDC are among the agencies that rely upon NHANES to provide data essential for the implementation and evaluation of program activities. The U.S. Department of Agriculture and NCHS coop-erate in planning and reporting dietary and nutrition information from the survey.

NHANES’ partnership with the U.S. Environ-mental Protection Agency allows continued study of the many important environmental ��"��� �����������������

•�� �� ���!������������ �� ������ �������• Reproductive history and sexual behavior• Respiratory disease (asthma, chronic bron- chitis, emphysema)• Sexually transmitted diseases • Vision

environment:elimination of lead --70% decline since ‘70s

disease prevalence estimates:T2D, obesity, cardiovascular disease

growth charts for development: WHO standard

1 http://www.cdc.gov/nchs/nhanes.htm

Aim 1: EWAS Methods

12Wednesday, December 14, 11

Page 13: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

NHANESEnvironmental Factor “E-Chip”

Aim 1: EWAS Methods

Demographics

agerace/ethnicity

incomeeducation

10,000

examplesnumber

of participants

Questionnaires

Drug UsePhysical Activity

OccupationFood Frequency

10,000

Physical Exam and Clinical Laboratory

Measures

Blood PressureHeight, Weight

Fasting Blood GlucoseCholesterol

~3,000

Markers of Exposure(serum and urine)

Nutrients/VitaminsMetals

HydrocarbonsPhthalates

PhenolsInfectious Agents

Allergens~3,000

13Wednesday, December 14, 11

Page 14: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

Environmental Measures by Category and Cohort

total 182 96169 258

11 0Pesticides, Pyrethyroid 10Pesticides, Organophosphate 22 2

13 1110Pesticides, Organochlorine 00 1Pesticides, Chlorophenol 10

0 0Pesticides, Carbamate 0 10 50 0Pesticides, Atrazine

02214Volatile Compounds 2910Virus 666

Polyflourochemicals 0 10 1200 0Polybrominated Ethers 120

Phytoestrogens 6 066Phthalates 12 07 12

15 12911Phenols

0 020Perchlorate23Polychlorinated Biphenyls 26 38 0

22Vitamin E 321Vitamin D 10 11 1Vitamin C 0 0

Vitamin B 54 343Vitamin A 333

22Mineral Nutrients 2 176Carotenoid Nutrients 150

001Latex 022 2114 0Hydrocarbons

23Heavy Metals 18 18 259Furans 55 0

077Dioxins 50Diakyl 77 6

1 1 1Cotinine 11Bacterial 178 13

0Allergen Test 0 2000200Acrylamide

1999-20002001-2002

2003-20042005-2006

envi

ronm

enta

l cat

egor

ies

IgE (cat, dog, milk ragweed...)Staph. aureus, gonorrhea, chlamydia

carotenes, lutein/zeaxanthin

retinyl palmitate, retinol

bisphenol A

HIV, measles, hepatitis A-D

lead, cadmium, arsenic

DDT, trans-nonachlor

Aim 1: EWAS Methods

14Wednesday, December 14, 11

Page 15: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

EWAS Methodology

bisphenol APCB199β-carotenecotinine...

}{foreach:

Environmental factors:log transformed & z-standardizedreference groups “negative”

M=96-258

foreach: {1999-2000, 2001-2002, 2003-2004, 2005-2006}4 individual cohorts

1999-2000 2001-2002 2003-2004 2005-2006bisphenol A . . 0.002 0.01

PCB199 0.1 0.02 0.03 .

β-carotene NA 0.0001 0.02 0.002

cotinine 0.03 0.01 0.9 .

... ... ... ... ...

Significance tests per cohortp-value(βfactor)

zfactor

disease

βfactor

Survey Regression (GEE):adjusted for known confounding factors

age, sex, ethnicity, socioeconomic status, ...

Aim 1: EWAS Methods

15Wednesday, December 14, 11

Page 16: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

EWAS Methodology, cont’d

False Discovery Rate Estimation

permute labels or residuals B times

bisphenol APCB199β-carotenecotinine...

}{foreach:

M=96-258

foreach: {1999-2000, 2001-2002, 2003-2004, 2005-2006}

foreach: 1 to B

1999-2000 2001-2002 2003-2004 2005-2006bisphenol A . . 0.1 0.2

PCB199 0.1 0.02 0.03 .

β-carotene NA 0.01 0.1 0.05

cotinine 0.03 0.01 0.9 .

... ... ... ... ...

FDR(p-value)

compute FDR # [p-value(βfactor(permuted)) < p] × 1/B # [p-value(βfactor)]

Tentative ValidationFDR < threshold in 2 or greater cohorts?

AND sign(βfactor) equal for cohorts?

Aim 1: EWAS Methods

16Wednesday, December 14, 11

Page 17: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

Proof of Concept of EWAS

1. Background and Methods

2. Examples: Type 2 Diabetes, Serum Lipid Levels

3. Checking Validity and An “LD” Map of the Environment?

4. Conclusion

5. Informatics for the Environment

17Wednesday, December 14, 11

Page 18: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

What environmental factors are associated with Type 2 Diabetes?

Aim 2: EWAS examples

18Wednesday, December 14, 11

Page 19: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

Patel CJ, Bhattacharya J, Butte AJ, (2010) An Environment-Wide Association Study (EWAS) on T2DM. PLoS ONE vol. 5(5)

Novel Findings:heptachlor epoxide

γ-tocopherol

Known Associations:β-carotene

vitamin DPCBs

Interesting Patterns:pesticides, PCBs

EWAS on T2DM

−log

10(p

valu

e)

●●

●●

●●

●●

●●●

●●●●

●●●

● ●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●● ●

●●

● ●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●● ●

acry

lam

ide

alle

rgen

test

bact

eria

l inf

ectio

n

cotin

ine

diak

yl

diox

ins

fura

ns d

iben

zofu

ran

heav

y m

etal

s

hydr

ocar

bons

late

x

nutri

ents

car

oten

oid

nutri

ents

min

eral

snu

trien

ts v

itam

in A

nutri

ents

vita

min

Bnu

trien

ts v

itam

in C

nutri

ents

vita

min

Dnu

trien

ts v

itam

in E

pcbs

perc

hlor

ate

pest

icid

es a

trazi

nepe

stic

ides

chl

orop

heno

lpe

stic

ides

org

anoc

hlor

ine

pest

icid

es o

rgan

opho

spha

tepe

stic

ides

pyr

ethy

roid

phen

ols

phth

alat

es

phyt

oest

roge

ns

poly

brom

inat

ed e

ther

spo

lyflo

uroc

hem

ical

s

vira

l inf

ectio

n

vola

tile

com

poun

ds

01

2

1999-20002001-20022003-20042005-2006

cohort markers

FDR(α<0.02) ~ 10% “validated” factors

Fasting Blood Glucose > 125 mg/dL?BMI, SES, ethnicity, age, sex

OR: Δ 1SD of exposureN=500-2000 per cohort

Heptachlor Epoxide OR=3.2, 1.8

PCB170OR=4.5,2.3

γ-tocopherol (vitamin E)OR=1.8,1.6β-carotene

OR=0.6,0.6

Aim 2: EWAS examples

19Wednesday, December 14, 11

Page 20: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

What about other risk phenotypes?EWAS on Serum Lipid Levels

Risk factors for coronary heart disease (CHD)

Targets for intervention (ie, statins)

Influenced by smoking, physical activity, diet, genetics1

LDL-Cholesterol 1% increased risk2

HDL-Cholesterol 2% decreased risk3

Triglycerides (increased risk)

risk for CHD (for 1% increase in lipids)lipid type

1. Tanya M. Teslovich et al. Nature (2010) vol. 466 (7307) pp. 7072.Grundy et al. Arteriosclerosis, Thrombosis, and Vascular Biology (2004) vol. 24 (2) pp. e133. Gotto et al. Journal of the American College of Cardiology (2004) vol. 43 (5) pp. 717-24

Aim 2: EWAS examples

20Wednesday, December 14, 11

Page 21: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

EWAS on HDL-C1999-20002001-20022003-20042005-2006

cohort markers

−log

10(p

valu

e)

●●

●●

●●●●●

●● ●

●●

●●●

●●

● ●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●● ●●●●

●●

●●

●● ●

●●

●●●●

●●

●●

●●

●●●●

●●

●●

●●

● ●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

● ●

acry

lam

ide

alle

rgen

test

bact

eria

l inf

ectio

n

cotin

ine

diak

yl

diox

ins

fura

ns d

iben

zofu

ran

heav

y m

etal

s

hydr

ocar

bons

late

x

nutri

ents

car

oten

oid

nutri

ents

min

eral

snu

trien

ts v

itam

in A

nutri

ents

vita

min

Bnu

trien

ts v

itam

in C

nutri

ents

vita

min

Dnu

trien

ts v

itam

in E

pcbs

perc

hlor

ate

pest

icid

es a

trazi

nepe

stic

ides

car

bam

ate

pest

icid

es c

hlor

ophe

nol

pest

icid

es o

rgan

ochl

orin

epe

stic

ides

org

anop

hosp

hate

pest

icid

es p

yret

hyro

id

phen

ols

phth

alat

es

phyt

oest

roge

ns

poly

brom

inat

ed e

ther

spo

lyflo

uroc

hem

ical

s

vira

l inf

ectio

n

vola

tile

com

poun

ds

01

23

4

FDR < 10%

carotenes Vitamin E

heavy metals

cotinine

organochlorine pesticides

Patel CJ, Cullen MR, Ioannidis JAP, Butte AJ, (2011). Non-genetic associations and correlation globes for determinants of Lipid Levels: an EWAS. In Review.

Vitamin C, D

hydrocarbons

log10(HDL-C)BMI, SES, ethnicity, age, age2, sex

N=1000-3000

Aim 2: EWAS examples

21Wednesday, December 14, 11

Page 22: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

cohort

2001−20022003−2004combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

1999−20002001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

1999−20002001−20022005−2006combined

1999−20002001−20022003−20042005−2006combined

1999−20002003−2004combined

2001−20022003−2004combined

1999−20002001−20022003−2004combined

1999−20002001−20022003−2004combined

2001−20022003−2004combined

1,2,3,4,7,8−hxcdf

trans−b−carotene

cis−b−carotene

Retinol

Retinyl palmitate

a−tocopherol

g−tocopherol

PCB74

PCB170

Oxychlordane

Trans−nonachlor

Enterolactone

N

534806

1735

3605323328897374

3605323325967135

29963610323328899519

3480323327558903

2981360928897140

25853579323328729194

811832

2202

1004825

2155

704986877

2131

8141001

8652228

114910732358

pvalue

0.010.005

2e−05

0.0020.01

7e−041e−08

0.010.01

0.0011e−06

0.0055e−042e−043e−046e−21

7e−042e−040.003

6e−17

0.0012e−042e−048e−20

0.010.0020.002

7e−051e−17

0.010.005

1e−06

0.010.002

4e−06

0.020.0020.003

5e−09

0.020.0020.005

1e−08

0.020.006

2e−07

effe

ct (m

g/dl

)

554830

−18−18−19−16

−9−17−15−12

2323292725

37622441

49865767

2451423941

376138

628650

53785357

42664947

−14−20−17

−20 −10 0 10 20 30 40

% change

A Triglycerides

cohort

2003−20042005−2006combined

2003−20042005−2006combined

2001−20022003−2004combined

2001−20022003−2004combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

1999−20002001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−2004combined

2003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−2004combined

2001−20022003−2004combined

Cotinine

Mercury, total

2−fluorene

3−fluorene

Combined Lutein/zeaxanthin

cis−b−carotene

Iron, Frozen Serum

Retinyl stearate

Folate, serum

Vitamin C

Vitamin D

g−tocopherol

Heptachlor Epoxide

N

726769599513

727369616323

233221922252

233221762243

7473679068687388

7478679062647151

63837457270625246764

7251679063378421

746872679559

679969114852

7056727369667401

742867909216

202218352108

pvalue

0.0030.02

2e−06

0.010.002

6e−07

0.010.0060.004

0.020.01

0.006

2e−042e−054e−042e−16

3e−049e−042e−043e−12

0.0090.0030.0060.002

6e−11

0.0020.0030.002

4e−05

0.0040.02

2e−05

0.0060.02

0.002

0.010.004

0.011e−06

0.0010.01

6e−06

0.010.02

0.006

effe

ct (m

g/dl

)

−2−1−1

122

−2−1−1

−2−1−1

3334

2333

22222

−1−1−2−1

111

211

1212

−1−1−1

−2−1−2

−6 −4 −2 0 2 4 6

% change

C HDL-C

cohort

2001−20022003−20042005−2006combined

2001−20022005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022005−2006combined

1999−20002001−20022005−2006combined

2001−20022003−20042005−2006combined

trans−b−carotene

cis−b−carotene

b−cryptoxanthin

Combined Lutein/zeaxanthin

trans−lycopene

Retinyl palmitate

a−tocopherol

g−tocopherol

N

3315317428307043

331725416809

3294317428057012

3317317428307043

3315317428307043

320026988425

2734331728306665

3288317428148696

pvalue

0.0030.004

9e−042e−13

0.0020.004

5e−11

9e−046e−040.001

4e−13

0.0012e−045e−043e−15

5e−041e−042e−048e−17

8e−040.001

4e−13

0.0028e−057e−057e−19

0.0030.0020.005

3e−14

effe

ct (m

g/dl

)

8698

776

7798

98109

10101412

586

14171716

8666

0 5 10 15 20

% change

B LDL-C

cohort

2001−20022003−2004combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

1999−20002001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

1999−20002001−20022005−2006combined

1999−20002001−20022003−20042005−2006combined

1999−20002003−2004combined

2001−20022003−2004combined

1999−20002001−20022003−2004combined

1999−20002001−20022003−2004combined

2001−20022003−2004combined

1,2,3,4,7,8−hxcdf

trans−b−carotene

cis−b−carotene

Retinol

Retinyl palmitate

a−tocopherol

g−tocopherol

PCB74

PCB170

Oxychlordane

Trans−nonachlor

Enterolactone

N

534806

1735

3605323328897374

3605323325967135

29963610323328899519

3480323327558903

2981360928897140

25853579323328729194

811832

2202

1004825

2155

704986877

2131

8141001

8652228

114910732358

pvalue

0.010.005

2e−05

0.0020.01

7e−041e−08

0.010.01

0.0011e−06

0.0055e−042e−043e−046e−21

7e−042e−040.003

6e−17

0.0012e−042e−048e−20

0.010.0020.002

7e−051e−17

0.010.005

1e−06

0.010.002

4e−06

0.020.0020.003

5e−09

0.020.0020.005

1e−08

0.020.006

2e−07

effe

ct (m

g/dl

)

554830

−18−18−19−16

−9−17−15−12

2323292725

37622441

49865767

2451423941

376138

628650

53785357

42664947

−14−20−17

−20 −10 0 10 20 30 40

% change

A Triglycerides

cohort

2003−20042005−2006combined

2003−20042005−2006combined

2001−20022003−2004combined

2001−20022003−2004combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

1999−20002001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−2004combined

2003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−2004combined

2001−20022003−2004combined

Cotinine

Mercury, total

2−fluorene

3−fluorene

Combined Lutein/zeaxanthin

cis−b−carotene

Iron, Frozen Serum

Retinyl stearate

Folate, serum

Vitamin C

Vitamin D

g−tocopherol

Heptachlor Epoxide

N

726769599513

727369616323

233221922252

233221762243

7473679068687388

7478679062647151

63837457270625246764

7251679063378421

746872679559

679969114852

7056727369667401

742867909216

202218352108

pvalue

0.0030.02

2e−06

0.010.002

6e−07

0.010.0060.004

0.020.01

0.006

2e−042e−054e−042e−16

3e−049e−042e−043e−12

0.0090.0030.0060.002

6e−11

0.0020.0030.002

4e−05

0.0040.02

2e−05

0.0060.02

0.002

0.010.004

0.011e−06

0.0010.01

6e−06

0.010.02

0.006

effe

ct (m

g/dl

)

−2−1−1

122

−2−1−1

−2−1−1

3334

2333

22222

−1−1−2−1

111

211

1212

−1−1−1

−2−1−2

−6 −4 −2 0 2 4 6

% change

C HDL-C

cohort

2001−20022003−20042005−2006combined

2001−20022005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022005−2006combined

1999−20002001−20022005−2006combined

2001−20022003−20042005−2006combined

trans−b−carotene

cis−b−carotene

b−cryptoxanthin

Combined Lutein/zeaxanthin

trans−lycopene

Retinyl palmitate

a−tocopherol

g−tocopherol

N

3315317428307043

331725416809

3294317428057012

3317317428307043

3315317428307043

320026988425

2734331728306665

3288317428148696

pvalue

0.0030.004

9e−042e−13

0.0020.004

5e−11

9e−046e−040.001

4e−13

0.0012e−045e−043e−15

5e−041e−042e−048e−17

8e−040.001

4e−13

0.0028e−057e−057e−19

0.0030.0020.005

3e−14

effe

ct (m

g/dl

)

8698

776

7798

98109

10101412

586

14171716

8666

0 5 10 15 20

% change

B LDL-C

cohort

2001−20022003−2004combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

1999−20002001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

1999−20002001−20022005−2006combined

1999−20002001−20022003−20042005−2006combined

1999−20002003−2004combined

2001−20022003−2004combined

1999−20002001−20022003−2004combined

1999−20002001−20022003−2004combined

2001−20022003−2004combined

1,2,3,4,7,8−hxcdf

trans−b−carotene

cis−b−carotene

Retinol

Retinyl palmitate

a−tocopherol

g−tocopherol

PCB74

PCB170

Oxychlordane

Trans−nonachlor

Enterolactone

N

534806

1735

3605323328897374

3605323325967135

29963610323328899519

3480323327558903

2981360928897140

25853579323328729194

811832

2202

1004825

2155

704986877

2131

8141001

8652228

114910732358

pvalue

0.010.005

2e−05

0.0020.01

7e−041e−08

0.010.01

0.0011e−06

0.0055e−042e−043e−046e−21

7e−042e−040.003

6e−17

0.0012e−042e−048e−20

0.010.0020.002

7e−051e−17

0.010.005

1e−06

0.010.002

4e−06

0.020.0020.003

5e−09

0.020.0020.005

1e−08

0.020.006

2e−07

effe

ct (m

g/dl

)

554830

−18−18−19−16

−9−17−15−12

2323292725

37622441

49865767

2451423941

376138

628650

53785357

42664947

−14−20−17

−20 −10 0 10 20 30 40

% change

A Triglycerides

cohort

2003−20042005−2006combined

2003−20042005−2006combined

2001−20022003−2004combined

2001−20022003−2004combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

1999−20002001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−2004combined

2003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−2004combined

2001−20022003−2004combined

Cotinine

Mercury, total

2−fluorene

3−fluorene

Combined Lutein/zeaxanthin

cis−b−carotene

Iron, Frozen Serum

Retinyl stearate

Folate, serum

Vitamin C

Vitamin D

g−tocopherol

Heptachlor Epoxide

N

726769599513

727369616323

233221922252

233221762243

7473679068687388

7478679062647151

63837457270625246764

7251679063378421

746872679559

679969114852

7056727369667401

742867909216

202218352108

pvalue

0.0030.02

2e−06

0.010.002

6e−07

0.010.0060.004

0.020.01

0.006

2e−042e−054e−042e−16

3e−049e−042e−043e−12

0.0090.0030.0060.002

6e−11

0.0020.0030.002

4e−05

0.0040.02

2e−05

0.0060.02

0.002

0.010.004

0.011e−06

0.0010.01

6e−06

0.010.02

0.006

effe

ct (m

g/dl

)

−2−1−1

122

−2−1−1

−2−1−1

3334

2333

22222

−1−1−2−1

111

211

1212

−1−1−1

−2−1−2

−6 −4 −2 0 2 4 6

% change

C HDL-C

cohort

2001−20022003−20042005−2006combined

2001−20022005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022005−2006combined

1999−20002001−20022005−2006combined

2001−20022003−20042005−2006combined

trans−b−carotene

cis−b−carotene

b−cryptoxanthin

Combined Lutein/zeaxanthin

trans−lycopene

Retinyl palmitate

a−tocopherol

g−tocopherol

N

3315317428307043

331725416809

3294317428057012

3317317428307043

3315317428307043

320026988425

2734331728306665

3288317428148696

pvalue

0.0030.004

9e−042e−13

0.0020.004

5e−11

9e−046e−040.001

4e−13

0.0012e−045e−043e−15

5e−041e−042e−048e−17

8e−040.001

4e−13

0.0028e−057e−057e−19

0.0030.0020.005

3e−14

effe

ct (m

g/dl

)

8698

776

7798

98109

10101412

586

14171716

8666

0 5 10 15 20

% change

B LDL-Ccohort

2001−20022003−2004combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

1999−20002001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

1999−20002001−20022005−2006combined

1999−20002001−20022003−20042005−2006combined

1999−20002003−2004combined

2001−20022003−2004combined

1999−20002001−20022003−2004combined

1999−20002001−20022003−2004combined

2001−20022003−2004combined

1,2,3,4,7,8−hxcdf

trans−b−carotene

cis−b−carotene

Retinol

Retinyl palmitate

a−tocopherol

g−tocopherol

PCB74

PCB170

Oxychlordane

Trans−nonachlor

Enterolactone

N

534806

1735

3605323328897374

3605323325967135

29963610323328899519

3480323327558903

2981360928897140

25853579323328729194

811832

2202

1004825

2155

704986877

2131

8141001

8652228

114910732358

pvalue

0.010.005

2e−05

0.0020.01

7e−041e−08

0.010.01

0.0011e−06

0.0055e−042e−043e−046e−21

7e−042e−040.003

6e−17

0.0012e−042e−048e−20

0.010.0020.002

7e−051e−17

0.010.005

1e−06

0.010.002

4e−06

0.020.0020.003

5e−09

0.020.0020.005

1e−08

0.020.006

2e−07

effe

ct (m

g/dl

)

554830

−18−18−19−16

−9−17−15−12

2323292725

37622441

49865767

2451423941

376138

628650

53785357

42664947

−14−20−17

−20 −10 0 10 20 30 40

% change

A Triglycerides

cohort

2003−20042005−2006combined

2003−20042005−2006combined

2001−20022003−2004combined

2001−20022003−2004combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

1999−20002001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−2004combined

2003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−2004combined

2001−20022003−2004combined

Cotinine

Mercury, total

2−fluorene

3−fluorene

Combined Lutein/zeaxanthin

cis−b−carotene

Iron, Frozen Serum

Retinyl stearate

Folate, serum

Vitamin C

Vitamin D

g−tocopherol

Heptachlor Epoxide

N

726769599513

727369616323

233221922252

233221762243

7473679068687388

7478679062647151

63837457270625246764

7251679063378421

746872679559

679969114852

7056727369667401

742867909216

202218352108

pvalue

0.0030.02

2e−06

0.010.002

6e−07

0.010.0060.004

0.020.01

0.006

2e−042e−054e−042e−16

3e−049e−042e−043e−12

0.0090.0030.0060.002

6e−11

0.0020.0030.002

4e−05

0.0040.02

2e−05

0.0060.02

0.002

0.010.004

0.011e−06

0.0010.01

6e−06

0.010.02

0.006

effe

ct (m

g/dl

)

−2−1−1

122

−2−1−1

−2−1−1

3334

2333

22222

−1−1−2−1

111

211

1212

−1−1−1

−2−1−2

−6 −4 −2 0 2 4 6

% change

C HDL-C

cohort

2001−20022003−20042005−2006combined

2001−20022005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022005−2006combined

1999−20002001−20022005−2006combined

2001−20022003−20042005−2006combined

trans−b−carotene

cis−b−carotene

b−cryptoxanthin

Combined Lutein/zeaxanthin

trans−lycopene

Retinyl palmitate

a−tocopherol

g−tocopherol

N

3315317428307043

331725416809

3294317428057012

3317317428307043

3315317428307043

320026988425

2734331728306665

3288317428148696

pvalue

0.0030.004

9e−042e−13

0.0020.004

5e−11

9e−046e−040.001

4e−13

0.0012e−045e−043e−15

5e−041e−042e−048e−17

8e−040.001

4e−13

0.0028e−057e−057e−19

0.0030.0020.005

3e−14

effe

ct (m

g/dl

)

8698

776

7798

98109

10101412

586

14171716

8666

0 5 10 15 20

% change

B LDL-C

cohort

2001−20022003−2004combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

1999−20002001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

1999−20002001−20022005−2006combined

1999−20002001−20022003−20042005−2006combined

1999−20002003−2004combined

2001−20022003−2004combined

1999−20002001−20022003−2004combined

1999−20002001−20022003−2004combined

2001−20022003−2004combined

1,2,3,4,7,8−hxcdf

trans−b−carotene

cis−b−carotene

Retinol

Retinyl palmitate

a−tocopherol

g−tocopherol

PCB74

PCB170

Oxychlordane

Trans−nonachlor

Enterolactone

N

534806

1735

3605323328897374

3605323325967135

29963610323328899519

3480323327558903

2981360928897140

25853579323328729194

811832

2202

1004825

2155

704986877

2131

8141001

8652228

114910732358

pvalue

0.010.005

2e−05

0.0020.01

7e−041e−08

0.010.01

0.0011e−06

0.0055e−042e−043e−046e−21

7e−042e−040.003

6e−17

0.0012e−042e−048e−20

0.010.0020.002

7e−051e−17

0.010.005

1e−06

0.010.002

4e−06

0.020.0020.003

5e−09

0.020.0020.005

1e−08

0.020.006

2e−07

effe

ct (m

g/dl

)

554830

−18−18−19−16

−9−17−15−12

2323292725

37622441

49865767

2451423941

376138

628650

53785357

42664947

−14−20−17

−20 −10 0 10 20 30 40

% change

A Triglycerides

cohort

2003−20042005−2006combined

2003−20042005−2006combined

2001−20022003−2004combined

2001−20022003−2004combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

1999−20002001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−2004combined

2003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−2004combined

2001−20022003−2004combined

Cotinine

Mercury, total

2−fluorene

3−fluorene

Combined Lutein/zeaxanthin

cis−b−carotene

Iron, Frozen Serum

Retinyl stearate

Folate, serum

Vitamin C

Vitamin D

g−tocopherol

Heptachlor Epoxide

N

726769599513

727369616323

233221922252

233221762243

7473679068687388

7478679062647151

63837457270625246764

7251679063378421

746872679559

679969114852

7056727369667401

742867909216

202218352108

pvalue

0.0030.02

2e−06

0.010.002

6e−07

0.010.0060.004

0.020.01

0.006

2e−042e−054e−042e−16

3e−049e−042e−043e−12

0.0090.0030.0060.002

6e−11

0.0020.0030.002

4e−05

0.0040.02

2e−05

0.0060.02

0.002

0.010.004

0.011e−06

0.0010.01

6e−06

0.010.02

0.006

effe

ct (m

g/dl

)

−2−1−1

122

−2−1−1

−2−1−1

3334

2333

22222

−1−1−2−1

111

211

1212

−1−1−1

−2−1−2

−6 −4 −2 0 2 4 6

% change

C HDL-C

cohort

2001−20022003−20042005−2006combined

2001−20022005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022005−2006combined

1999−20002001−20022005−2006combined

2001−20022003−20042005−2006combined

trans−b−carotene

cis−b−carotene

b−cryptoxanthin

Combined Lutein/zeaxanthin

trans−lycopene

Retinyl palmitate

a−tocopherol

g−tocopherol

N

3315317428307043

331725416809

3294317428057012

3317317428307043

3315317428307043

320026988425

2734331728306665

3288317428148696

pvalue

0.0030.004

9e−042e−13

0.0020.004

5e−11

9e−046e−040.001

4e−13

0.0012e−045e−043e−15

5e−041e−042e−048e−17

8e−040.001

4e−13

0.0028e−057e−057e−19

0.0030.0020.005

3e−14

effe

ct (m

g/dl

)

8698

776

7798

98109

10101412

586

14171716

8666

0 5 10 15 20

% change

B LDL-C

cohort

2001−20022003−2004combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

1999−20002001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

1999−20002001−20022005−2006combined

1999−20002001−20022003−20042005−2006combined

1999−20002003−2004combined

2001−20022003−2004combined

1999−20002001−20022003−2004combined

1999−20002001−20022003−2004combined

2001−20022003−2004combined

1,2,3,4,7,8−hxcdf

trans−b−carotene

cis−b−carotene

Retinol

Retinyl palmitate

a−tocopherol

g−tocopherol

PCB74

PCB170

Oxychlordane

Trans−nonachlor

Enterolactone

N

534806

1735

3605323328897374

3605323325967135

29963610323328899519

3480323327558903

2981360928897140

25853579323328729194

811832

2202

1004825

2155

704986877

2131

8141001

8652228

114910732358

pvalue

0.010.005

2e−05

0.0020.01

7e−041e−08

0.010.01

0.0011e−06

0.0055e−042e−043e−046e−21

7e−042e−040.003

6e−17

0.0012e−042e−048e−20

0.010.0020.002

7e−051e−17

0.010.005

1e−06

0.010.002

4e−06

0.020.0020.003

5e−09

0.020.0020.005

1e−08

0.020.006

2e−07

effe

ct (m

g/dl

)

554830

−18−18−19−16

−9−17−15−12

2323292725

37622441

49865767

2451423941

376138

628650

53785357

42664947

−14−20−17

−20 −10 0 10 20 30 40

% change

A Triglycerides

cohort

2003−20042005−2006combined

2003−20042005−2006combined

2001−20022003−2004combined

2001−20022003−2004combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

1999−20002001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−2004combined

2003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−2004combined

2001−20022003−2004combined

Cotinine

Mercury, total

2−fluorene

3−fluorene

Combined Lutein/zeaxanthin

cis−b−carotene

Iron, Frozen Serum

Retinyl stearate

Folate, serum

Vitamin C

Vitamin D

g−tocopherol

Heptachlor Epoxide

N

726769599513

727369616323

233221922252

233221762243

7473679068687388

7478679062647151

63837457270625246764

7251679063378421

746872679559

679969114852

7056727369667401

742867909216

202218352108

pvalue

0.0030.02

2e−06

0.010.002

6e−07

0.010.0060.004

0.020.01

0.006

2e−042e−054e−042e−16

3e−049e−042e−043e−12

0.0090.0030.0060.002

6e−11

0.0020.0030.002

4e−05

0.0040.02

2e−05

0.0060.02

0.002

0.010.004

0.011e−06

0.0010.01

6e−06

0.010.02

0.006

effe

ct (m

g/dl

)

−2−1−1

122

−2−1−1

−2−1−1

3334

2333

22222

−1−1−2−1

111

211

1212

−1−1−1

−2−1−2

−6 −4 −2 0 2 4 6

% change

C HDL-C

cohort

2001−20022003−20042005−2006combined

2001−20022005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022005−2006combined

1999−20002001−20022005−2006combined

2001−20022003−20042005−2006combined

trans−b−carotene

cis−b−carotene

b−cryptoxanthin

Combined Lutein/zeaxanthin

trans−lycopene

Retinyl palmitate

a−tocopherol

g−tocopherol

N

3315317428307043

331725416809

3294317428057012

3317317428307043

3315317428307043

320026988425

2734331728306665

3288317428148696

pvalue

0.0030.004

9e−042e−13

0.0020.004

5e−11

9e−046e−040.001

4e−13

0.0012e−045e−043e−15

5e−041e−042e−048e−17

8e−040.001

4e−13

0.0028e−057e−057e−19

0.0030.0020.005

3e−14

effe

ct (m

g/dl

)

8698

776

7798

98109

10101412

586

14171716

8666

0 5 10 15 20

% change

B LDL-C

Effect Sizes For Validated Factors:HDL-C

% change = Δ 1 SD in Exposure18 validated factors

combined adjusted for:BMI, SES, ethnicity, age, age2, sex,

waist circumference, diabetes (FBG > 125 mg/dL), blood pressure

comparable to genetic effect sizes1!

Aim 2: EWAS examples

22Wednesday, December 14, 11

Page 23: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

Proof of Concept of EWAS

1. Background and Methods

2. Examples: Type 2 Diabetes, Serum Lipid Levels

3. Checking Validity and An “LD” Map of the Environment?

4. Conclusion

5. Informatics for the Environment

23Wednesday, December 14, 11

Page 24: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

Assessing Validity of Estimatesexample: HDL-C

Could the disease “lead” to exposure?“Reverse causality”

γ-tocopherol

?

tocopherol (vitamin e) supplements forCHD individuals?

low HDL

Aim 3: EWAS and Validity

Could there something confounding the association?

statin use

β-carotene

confounders

high HDL

??

Are associations independent? How to untangle the web of exposure?

β-carotene hydrocarbons

γ-tocopherol

ρ

24Wednesday, December 14, 11

Page 25: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

Longitudinal Study: “Gold Standard” for Validation

• exposure changing through time

• reverse causality bias

• compute disease risk

• how will we use “exposome”

longitudinally? age/time

HD

L-C

hole

ster

ol

(mg/

dL)

[high]

[low]

[γ-tocopherol]

tocopherol (vitamin e) supplements forCHD individuals?

low HDL

?

γ-tocopherol

Aim 3: EWAS and Validity

25Wednesday, December 14, 11

Page 26: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

Addressing Confounding Bias with the Exposome

E[HDL-C] = α + β1, original * carotene

“account” for bias due to statin use confounder

statin use

β-carotene High HDLE[HDL-C] = α + β1,extended * carotene + β2 * statin use

compare β1, original and β1,extended

disease status diabetes, CHD, heart attack

drug use metformin, statins use

supplement use count of total supplements used

physical activity daily estimated metabolic equivalents

recent food intake total nutrients computed from frequency questionnaire

“source” of bias example variables

Aim 3: EWAS and validity

26Wednesday, December 14, 11

Page 27: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

Correlation Structure of the Exposome?

Aim 3: EWAS and Validity

Are associations independent? How to untangle the complex web of exposure?

β-carotene hydrocarbons

γ-tocopherol

ρ

Analogy: “Linkage Disequilibrium”

Identification of four novel T2DM lociOur fast-track stage 2 genotyping confirmed the reported associationfor rs7903146 (TCF7L2) on chromosome 10, and in addition iden-tified significant associations for seven SNPs representing four newT2DM loci (Table 1). In all cases, the strongest association for theMAX statistic (see Methods) was obtained with the additive model.

The most significant of these corresponds to rs13266634, a non-synonymous SNP (R325W) in SLC30A8, located in a 33-kb linkagedisequilibrium block on chromosome 8, containing only the 39 endof this gene (Fig. 2a). SLC30A8 encodes a zinc transporter expressedsolely in the secretory vesicles of b-cells and is thus implicated in thefinal stages of insulin biosynthesis, which involve co-crystallization

Table 1 | Confirmed association results

SNP Chromosome Position

(nucleotides)

Risk

allele

Major

allele

MAF

(case)

MAF

(ctrl)

Odds ratio

(het)

Odds ratio

(hom)

PAR ls Stage 2

pMAX

Stage 2 pMAX

(perm)

Stage 1

pMAX

Stage 1 pMAX

(perm)

Nearest

gene

rs7903146 10 114,748,339 T C 0.406 0.293 1.656 0.19 2.776 0.50 0.28 1.0546 1.53 10234 ,1.03 1027 3.23 10217 ,3.33 10210 TCF7L2rs13266634 8 118,253,964 C C 0.254 0.301 1.186 0.25 1.536 0.31 0.24 1.0089 6.13 1028 5.03 1027 2.13 1025 1.83 1025 SLC30A8rs1111875 10 94,452,862 G G 0.358 0.402 1.196 0.19 1.446 0.24 0.19 1.0069 3.03 1026 7.43 1026 9.13 1026 7.33 1026 HHEXrs7923837 10 94,471,897 G G 0.335 0.377 1.226 0.21 1.456 0.25 0.20 1.0065 7.53 1026 2.23 1025 3.43 1026 2.53 1026 HHEXrs7480010 11 42,203,294 G A 0.336 0.301 1.146 0.13 1.406 0.25 0.08 1.0041 1.13 1024 2.93 1024 1.53 1025 1.23 1025 LOC387761rs3740878 11 44,214,378 A A 0.240 0.272 1.266 0.29 1.466 0.33 0.24 1.0046 1.23 1024 2.83 1024 1.83 1025 1.33 1025 EXT2rs11037909 11 44,212,190 T T 0.240 0.271 1.276 0.30 1.476 0.33 0.25 1.0045 1.83 1024 4.53 1024 1.83 1025 1.33 1025 EXT2rs1113132 11 44,209,979 C C 0.237 0.267 1.156 0.27 1.366 0.31 0.19 1.0044 3.33 1024 8.13 1024 3.73 1025 2.93 1025 EXT2

Significant T2DM associations were confirmed for eight SNPs in five loci. Allele frequencies, odds ratios (with 95% confidence intervals) and PAR were calculated using only the stage 2 data. Allelefrequencies in the controls were very close to those reported for the CEU set (European subjects genotyped in the HapMap project). Induced sibling recurrent risk ratios (ls) were estimated usingstage 2 genotype counts for the control subjects and assuming a T2DM prevalence of 7% in the French population. hom, homozygous; het, heterozygous; major allele, the allele with the higherfrequency in controls; pMAX, P-value of the MAX statistic from the x2 distribution; pMAX (perm), P-value of the MAX statistic from the permutation-derived empirical distribution (pMAX andpMAX (perm) are adjusted for variance inflation); risk allele, the allele with higher frequency in cases compared with controls.

**

*

024

–log10 [P]

–log10 [P]*

4954642sr 237 3971sr 3 37 3971 sr

44 5 40 9s r 801 2261sr 334 99 41s r

88 34 29sr 20 19462sr 034 9941sr

9035 0501sr 036169sr

0415 007sr 22 25 991s r 6136642sr 8 13664 2s r 186964 6sr 8798751sr

0 492 8201sr 3 9266 42s r 5 926642s r

4 3666231sr 9926642sr 2954642 sr

0 1350 501s r 5769 646 sr 4 577187s r 4 76 964 6sr

41350 501sr 5784931sr 2173387 sr

39250 501sr 5050007s r 7 492 602 sr 12550 51sr

15 6868sr 4 373 38 7sr 4 784 931sr 750 1107sr 269 7402sr

91518711sr 646 1001sr

2925 0501sr 5889103sr 866964 6sr 0889 103sr 468839 2sr

SLC30A8 IDE HHEXKIF11

**

**

**

024

* *

5470942sr 7 602242sr

2 8178111sr 1570942sr 239442 4sr 88 38 141s r

76029511sr 37178111sr

2945 391s r 2608842sr

646 905 01sr 15379 42sr 2950249sr 0339351 sr 17088 42s r

1 95 749s r 40 3794 2sr 1137942sr 73832 97sr 5781111sr 9275 722s r 9537 197sr 63420 97sr 0383856sr 0990 707sr 418 41 97 sr

1 902 8801sr 91 25722 sr

880 28801sr 19 740 64sr 5374283sr

5 346 5221s r 6283856sr 5058573sr 3679 991sr 1118097s r 34 91242s r

46078 111sr 06078 111s r

7 912381sr 3148707sr 028 38 56s r

52078 111sr 5227373sr 049 1242sr 2 3694 12sr 2297881sr

66 2155 sr 77901 97sr

44068701sr 350 75221sr

5826807sr 7 851 092sr 940 95 22sr

–log10 [P]

–log10 [P]EXT2 ALX4

****

*** *

024

*** *

**** 5704541sr 5194283sr 9085574sr 65 68281s r 3279397sr

832 978s r 4 953 102 sr

92096701sr 72973011sr

3 601 683sr 7609497sr 2498017s r 1265 297sr 5082083sr 79755 74sr 216 249 7sr

50 07 98 s r 7829 27s r

4975 5 74sr 878 0473sr

90 97 30 11 sr 2313 11 1sr 432 5574sr 3325574sr 3700 931sr 7 87 76 02 sr

45 47 06 11 sr 88755 7 4sr 87 7421 7sr 7945846s r 2705393sr 28 7557 4sr 4 38 97 34 sr

2 8873011sr 1 8873011sr

8315397sr 64 673 2 4sr 97 8111 7sr

53 2838 01 sr 0 23 83 97s r 8083293sr 3895701sr 9 20 58 22s r 57 27013 sr 4727013sr 1404702sr

D!

0 0.2 0.4 0.6 0.8 1

****

****

024

39863011sr 1601683 sr 9501683s r 03765 74s r 627 6574sr 60886701sr

7536017sr 977 183 1s r 1258051 sr 288630 11sr 4 985 747 1sr

97 626 41sr 1 19814 21s r

6856781sr 6642682sr 3 387 3801sr

445974 7sr 72863011s r

075493 7sr 4 377 397sr 8350847sr 47 706 101 sr

1 158 53 7sr 96558221s r

7035 846sr 010 0847sr 9 1654 44 sr 1 400 039sr 0431101sr 78786701sr

498 10 9sr 01 47 27sr 4245734 sr 90 612 42 1sr

0 1688 41 sr 27773801sr

8307831s r 93678221sr

112809sr 86773 801sr

367 6061sr 8433351s r 9 925 846sr 6243101sr 51 84151sr 9 387 21 7sr 8083172sr 18210501 sr 63 663 01 1sr 3 162 47 01sr

LOC387761

a b

c d

Figure 2 | Pairwise linkage disequilibrium diagrams for four novel T2DM-associated loci. D9 was calculated from the stage 1 genotyping data as afraction of observed linkage disequilibrium over the maximal possible. Thebar graph indicates the negative logarithm of the stage 1 P-value for each

SNP. Transcriptional units are indicated by green lines, with exonshighlighted in orange. Blue asterisksmark the SNPs chosen for confirmatorystudies. a, SLC30A8; b, IDE–KIF11–HHEX; c, EXT2–ALX4; d, LOC387761.

NATURE |Vol 445 |22 February 2007 ARTICLES

883Nature ©2007 Publishing Group

Sladek et al., Nature Genetics. 2007

Correlation between occurrence of genetic loci

In GWAS, allows one to trace to the “causal” locus.

27Wednesday, December 14, 11

Page 28: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

Red: positive ρBlue: negative ρthickness: |ρ|

for each pair:Partial ρ

age, BMI (serum)age, creatinine (urine)

cotinine

hydrocarbons

volatile organics

virus

bacteria

phthalatesPFCs

heavy metals

pcbs

dioxins

furans

deet

organochlorine

organophosphates

phenol pesticides

pyrethroiddiakyl carotenoids

mineral nutrientsvitamin A

vitamin B

vitamin C

vitamin D

vitamin E

phytoestrogens

ρ=0.3

ρ=0.1

ρ=0.4

ρ=0.6

ρ=0.3

ρ= -0.2

ρ=0.2

ρ=0.3ρ=0ρ=0.5ρ=0.3

ρ=0.4

ρ=0.2

ρ=0.2

ρ=0.3

ρ=0.4

ρ=0.4

ρ=0.4

ρ=0.2

ρ=0.3 ρ=0.3

phenolsρ=0

Aim 3: EWAS and Validity

“LD” of Exposure Biomarkers in NHANES

permuted data to produce“null ρ”

filtered those ρ > .3 or < -0.2 (5th, 95th percentile)

sought replication in > 1 cohort

pcbs & organochlorinesρ= 0.4

(Type 2 Diabetes)

cotinine & hydrocarbonsρ= 0.6

(HDL-Cholesterol)

Patel et al. (2011). Canonical Correlation of Biomarkers for Environmental Exposures. In Review.

28Wednesday, December 14, 11

Page 29: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

EWAS:Conclusion and Discussion

• generalizable, comprehensive, transparent, and systematic study of environment

• novel associations for T2D and HDL-C

• effects on disease are on par with genetics, calling for large-scale exposomic study

• However: confounding, reverse causal biases: need longitudinal and follow-up studies.

• Correlative web is dense: what is the “LD” of the the exposome?

−log

10(p

valu

e)

●●

●●●

●●

●● ●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●

●●●●

●●

●●●●●

●●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●● ●

●●

● ●

●●●●●

●●

●●● ●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●●●●●

●●

●●

acry

lam

ide

alle

rgen

test

bact

eria

l inf

ectio

n

cotin

ine

diak

yl

diox

ins

fura

ns d

iben

zofu

ran

heav

y m

etal

s

hydr

ocar

bons

late

x

nutri

ents

car

oten

oid

nutri

ents

min

eral

snu

trien

ts v

itam

in A

nutri

ents

vita

min

Bnu

trien

ts v

itam

in C

nutri

ents

vita

min

Dnu

trien

ts v

itam

in E

pcbs

perc

hlor

ate

pest

icid

es a

trazi

nepe

stic

ides

car

bam

ate

pest

icid

es c

hlor

ophe

nol

pest

icid

es o

rgan

ochl

orin

epe

stic

ides

org

anop

hosp

hate

pest

icid

es p

yret

hyro

id

phen

ols

phth

alat

es

phyt

oest

roge

ns

poly

brom

inat

ed e

ther

spo

lyflo

uroc

hem

ical

s

vira

l inf

ectio

n

vola

tile

com

poun

ds

01

23

4

heptachlor epoxideγ-tocopherol

HDL-C: 1-10 mg/dLT2D: ~2-3 OR

Aim 4: Conclusion

29Wednesday, December 14, 11

Page 30: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

Proof of Concept of EWAS

1. Background and Methods

2. Examples: Type 2 Diabetes, Serum Lipid Levels

3. Checking Validity and An “LD” Map of the Environment?

4. Conclusion

5. Informatics for the Environment

30Wednesday, December 14, 11

Page 31: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

Human Genome Project to GWAS

Sequencing of the genome

2001

HapMap project:http://hapmap.ncbi.nlm.nih.gov/

Characterize common variation

2001-2003 (ongoing)

Measurement tools

“Variant SNP chip”~$400 for ~100,000 variants

~2003 (ongoing)

ARTICLES

Genome-wide association study of 14,000cases of seven common diseases and3,000 shared controlsThe Wellcome Trust Case Control Consortium*

There is increasing evidence that genome-wide association (GWA) studies represent a powerful approach to theidentification of genes involved in common human diseases.We describe a joint GWAstudy (using the Affymetrix GeneChip500KMapping Array Set) undertaken in the British population, which has examined,2,000 individuals for each of 7 majordiseases and a shared set of ,3,000 controls. Case-control comparisons identified 24 independent association signals atP, 53 1027: 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn’s disease, 3 in rheumatoid arthritis, 7 in type 1diabetes and 3 in type 2 diabetes. On the basis of prior findings and replication studies thus-far completed, almost all of thesesignals reflect genuine susceptibility effects. We observed association at many previously identified loci, and foundcompelling evidence that some loci confer risk for more than one of the diseases studied. Across all diseases, we identified alarge number of further signals (including 58 loci with single-point P values between 1025 and 53 1027) likely to yieldadditional susceptibility loci. The importance of appropriately large samples was confirmed by the modest effect sizesobserved at most loci identified. This study thus represents a thorough validation of the GWA approach. It has alsodemonstrated that careful use of a shared control group represents a safe and effective approach to GWA analyses ofmultiple disease phenotypes; has generated a genome-wide genotype database for future studies of common diseases in theBritish population; and shown that, provided individuals with non-European ancestry are excluded, the extent of populationstratification in the British population is generally modest. Our findings offer new avenues for exploring the pathophysiologyof these important disorders. We anticipate that our data, results and software, which will be widely available to otherinvestigators, will provide a powerful resource for human genetics research.

Despite extensive research efforts for more than a decade, the geneticbasis of common humandiseases remains largely unknown. Althoughthere have been some notable successes1, linkage and candidate geneassociation studies have often failed to deliver definitive results. Yetthe identification of the variants, genes and pathways involved inparticular diseases offers a potential route to new therapies, improveddiagnosis and better disease prevention. For some time it has beenhoped that the advent of genome-wide association (GWA) studieswould provide a successful new tool for unlocking the genetic basisof many of these common causes of humanmorbidity andmortality1.

Three recent advances mean that GWA studies that are powered todetect plausible effect sizes are now possible2. First, the InternationalHapMap resource3, which documents patterns of genome-wide vari-ation and linkage disequilibrium in four population samples, greatlyfacilitates both the design and analysis of association studies. Second,the availability of dense genotyping chips, containing sets of hundreds ofthousands of single nucleotide polymorphisms (SNPs) that providegood coverage of much of the human genome, means that for the firsttimeGWAstudies for thousandsof cases andcontrols are technically andfinancially feasible. Third, appropriately large and well-characterizedclinical samples have been assembled for many common diseases.

The Wellcome Trust Case Control Consortium (WTCCC) wasformed with a view to exploring the utility, design and analyses ofGWA studies. It brought together over 50 research groups from theUK that are active in researching the genetics of common humandiseases, with expertise ranging from clinical, through genotyping, to

informatics and statistical analysis. Here we describe the main experi-ment of the consortium: GWA studies of 2,000 cases and 3,000 sharedcontrols for 7 complex human diseases of major public health import-ance—bipolar disorder (BD), coronary artery disease (CAD), Crohn’sdisease (CD), hypertension (HT), rheumatoid arthritis (RA), type 1diabetes (T1D), and type 2 diabetes (T2D). Two further experimentsundertaken by the consortium will be reported elsewhere: a GWAstudy for tuberculosis in 1,500 cases and 1,500 controls, sampled fromThe Gambia; and an association study of 1,500 common controls with1,000 cases for each of breast cancer, multiple sclerosis, ankylosingspondylitis and autoimmune thyroid disease, all typed at around15,000 mainly non-synonymous SNPs. By simultaneously studyingseven diseases with differing aetiologies, we hoped to develop insights,not only into the specific genetic contributions to each of the diseases,but also into differences in allelic architecture across the diseases. Afurther major aim was to address important methodological issues ofrelevance to all GWA studies, such as quality control, design and ana-lysis. In addition to our main association results, we address several ofthese issues below, including the choice of controls for genetic studies,the extent of population structure within Great Britain, sample sizesnecessary to detect genetic effects of varying sizes, and improvements ingenotype-calling algorithms and analytical methods.

Samples and experimental analyses

Individuals included in the study were living within England,Scotland and Wales (‘Great Britain’) and the vast majority had

*Lists of participants and affiliations appear at the end of the paper.

Vol 447 |7 June 2007 |doi:10.1038/nature05911

661Nature ©2007 Publishing Group

WTCCC, Nature, 2008.

Comprehensive, high-throughput analyses

31Wednesday, December 14, 11

Page 32: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

Human Genome Project:Informatics Contributions

http://genome.ucsc.edu

http://www.ncbi.nlm.nih.gov/projects/SNP/

http://www.ncbi.nlm.nih.gov/geo/

http://hapmap.ncbi.nlm.nih.gov/

http://www.ncbi.nlm.nih.gov/gap

32Wednesday, December 14, 11

Page 33: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

Gene Expression Omnibus

required submission

> 600,000 individuals (!)

enabled new investigationscollaboration & transparency

driven standardization

What if we had the same for environmental exposure measures?

http://www.ncbi.nlm.nih.gov/geo/

Informatics for the Environment?A GEO equivalent?

accessed 12/1/2011

33Wednesday, December 14, 11

Page 34: Proof-of-Concept of Environment-Wide Association …nas-sites.org/emergingscience/files/2011/12/Chirag-Patel.pdfProof-of-Concept of Environment-Wide Association Studies (EWAS) on Common

Acknowledgments

•Atul Butte

• Jay Bhattacharya

•Mark Cullen

• John Ioannidis

•Rob Tibshirani

•Biomedical Informatics at Stanford

•CDC / NCHS

•NHANES and participants!

•National Academy of Sciences• Steve Rappaport

•National Library of Medicine

•March of Dimes Preterm Birth Research Center

• Stanford Prevention Center /SPECTRUM Population Studies Innovation Grant

email: [email protected]: @chiragjp

34Wednesday, December 14, 11