Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Proof-of-Concept of Environment-Wide Association Studies (EWAS)
on Common Disease
Chirag PatelDivision of Systems Medicine, Department of Pediatrics
Stanford University School of Medicine
National Academy of Sciences Exposome Workshop12/8/2011
[email protected]: @chiragjp
1Wednesday, December 14, 11
...but, lack methods and data to comprehensively and systematically connect the environment with common disease.
... and we’re exposed to many environmental factors...
Common disease is a function of genes and environment...
dad mom
me
2Wednesday, December 14, 11
... and the case is different with genetics (e.g., genomics)!
dad mom
me
3Wednesday, December 14, 11
Common disease is function of genes and environment...
... yet the target of investigation is biased toward genetics!
1970 1980 1990 2000
0200
400
600
800
1000
1200
year published
tota
l pub
licat
ions
geneticsenvironment
WHO prioritized diseases:investigating either
“genetics”or “environment”in MEDLINE
type 2 diabetescardiovascular disease
kidney diseaselung, colon, and prostate cancer
asthmaCOPD
preterm birthsAlzheimer Disease
Patel et al. (2011). Data-driven creation of hypothesis in non-genomic studies. In Review.
4Wednesday, December 14, 11
Human Genome Project to GWAS
Sequencing of the genome
2001
HapMap project:http://hapmap.ncbi.nlm.nih.gov/
Characterize common variation
2001-2003 (ongoing)
“Variant SNP chip”~$400 for ~100,000 variants
Measurement tools
~2003 (ongoing)
ARTICLES
Genome-wide association study of 14,000cases of seven common diseases and3,000 shared controlsThe Wellcome Trust Case Control Consortium*
There is increasing evidence that genome-wide association (GWA) studies represent a powerful approach to theidentification of genes involved in common human diseases.We describe a joint GWAstudy (using the Affymetrix GeneChip500KMapping Array Set) undertaken in the British population, which has examined,2,000 individuals for each of 7 majordiseases and a shared set of ,3,000 controls. Case-control comparisons identified 24 independent association signals atP, 53 1027: 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn’s disease, 3 in rheumatoid arthritis, 7 in type 1diabetes and 3 in type 2 diabetes. On the basis of prior findings and replication studies thus-far completed, almost all of thesesignals reflect genuine susceptibility effects. We observed association at many previously identified loci, and foundcompelling evidence that some loci confer risk for more than one of the diseases studied. Across all diseases, we identified alarge number of further signals (including 58 loci with single-point P values between 1025 and 53 1027) likely to yieldadditional susceptibility loci. The importance of appropriately large samples was confirmed by the modest effect sizesobserved at most loci identified. This study thus represents a thorough validation of the GWA approach. It has alsodemonstrated that careful use of a shared control group represents a safe and effective approach to GWA analyses ofmultiple disease phenotypes; has generated a genome-wide genotype database for future studies of common diseases in theBritish population; and shown that, provided individuals with non-European ancestry are excluded, the extent of populationstratification in the British population is generally modest. Our findings offer new avenues for exploring the pathophysiologyof these important disorders. We anticipate that our data, results and software, which will be widely available to otherinvestigators, will provide a powerful resource for human genetics research.
Despite extensive research efforts for more than a decade, the geneticbasis of common humandiseases remains largely unknown. Althoughthere have been some notable successes1, linkage and candidate geneassociation studies have often failed to deliver definitive results. Yetthe identification of the variants, genes and pathways involved inparticular diseases offers a potential route to new therapies, improveddiagnosis and better disease prevention. For some time it has beenhoped that the advent of genome-wide association (GWA) studieswould provide a successful new tool for unlocking the genetic basisof many of these common causes of humanmorbidity andmortality1.
Three recent advances mean that GWA studies that are powered todetect plausible effect sizes are now possible2. First, the InternationalHapMap resource3, which documents patterns of genome-wide vari-ation and linkage disequilibrium in four population samples, greatlyfacilitates both the design and analysis of association studies. Second,the availability of dense genotyping chips, containing sets of hundreds ofthousands of single nucleotide polymorphisms (SNPs) that providegood coverage of much of the human genome, means that for the firsttimeGWAstudies for thousandsof cases andcontrols are technically andfinancially feasible. Third, appropriately large and well-characterizedclinical samples have been assembled for many common diseases.
The Wellcome Trust Case Control Consortium (WTCCC) wasformed with a view to exploring the utility, design and analyses ofGWA studies. It brought together over 50 research groups from theUK that are active in researching the genetics of common humandiseases, with expertise ranging from clinical, through genotyping, to
informatics and statistical analysis. Here we describe the main experi-ment of the consortium: GWA studies of 2,000 cases and 3,000 sharedcontrols for 7 complex human diseases of major public health import-ance—bipolar disorder (BD), coronary artery disease (CAD), Crohn’sdisease (CD), hypertension (HT), rheumatoid arthritis (RA), type 1diabetes (T1D), and type 2 diabetes (T2D). Two further experimentsundertaken by the consortium will be reported elsewhere: a GWAstudy for tuberculosis in 1,500 cases and 1,500 controls, sampled fromThe Gambia; and an association study of 1,500 common controls with1,000 cases for each of breast cancer, multiple sclerosis, ankylosingspondylitis and autoimmune thyroid disease, all typed at around15,000 mainly non-synonymous SNPs. By simultaneously studyingseven diseases with differing aetiologies, we hoped to develop insights,not only into the specific genetic contributions to each of the diseases,but also into differences in allelic architecture across the diseases. Afurther major aim was to address important methodological issues ofrelevance to all GWA studies, such as quality control, design and ana-lysis. In addition to our main association results, we address several ofthese issues below, including the choice of controls for genetic studies,the extent of population structure within Great Britain, sample sizesnecessary to detect genetic effects of varying sizes, and improvements ingenotype-calling algorithms and analytical methods.
Samples and experimental analyses
Individuals included in the study were living within England,Scotland and Wales (‘Great Britain’) and the vast majority had
*Lists of participants and affiliations appear at the end of the paper.
Vol 447 |7 June 2007 |doi:10.1038/nature05911
661Nature ©2007 Publishing Group
WTCCC, Nature, 2008.
Comprehensive, high-throughput analyses
5Wednesday, December 14, 11
HypothesisApplying genome-based methods to the environment
We claim that comprehensive connection of environmental
factors to disease is practicable using high-throughput analysis
methods, now common in genome-based investigations
e.g., Environment-wide Association Study (EWAS)
6Wednesday, December 14, 11
Proof of Concept of EWAS
1. Background and Methods
2. Examples: Type 2 Diabetes, Serum Lipid Levels
3. Checking Validity and An “LD” Map of the Environment?
4. Conclusions
5. Informatics for the Environment
7Wednesday, December 14, 11
Connecting Disease with Exposures:Drawbacks in Environmental Epidemiology
“candidate” E factors
multiple hypotheses often ignored
selective reporting
1. Ioannidis et al. Science Translational Medicine, 2009. 1 (7) p. 62. Boffetta, P., et al.,. J Natl Cancer Inst, 2008. 100(14): p. 988-95.3. Young, S.S., Int J Epidemiol, 2010. 39(3): p. 9344. Taubes, G.. Science, 1995. 269(5221): p. 164-9.
The lack of comprehension has led to a fragmented literature of environmental associations1,2,3,4
Genome-wide epidemiology has overcome some of these drawbacks1
E+ E-
diseased
non-diseased
Aim 1: EWAS Methods
8Wednesday, December 14, 11
Genome-Wide Association Studies (GWAS)
Methods), and excluded 153 individuals on this basis. We nextlooked for evidence of population heterogeneity by studying allelefrequency differences between the 12 broad geographical regions(defined in Supplementary Fig. 4). The results for these 11-d.f. testsand associated quantile-quantile plots are shown in Fig. 2. Wide-spread small differences in allele frequencies are evident as anincreased slope of the line (Fig. 2b); in addition, a few loci showmuchlarger differences (Fig. 2a and Supplementary Fig. 6).
Thirteen genomic regions showing strong geographical variationare listed in Table 1, and Supplementary Fig. 7 shows theway in whichtheir allele frequencies vary geographically. The predominant patternis variation along a NW/SE axis. The most likely cause for thesemarked geographical differences is natural selection, most plausiblyin populations ancestral to those now in the UK. Variation due toselection has previously been implicated at LCT (lactase) and majorhistocompatibility complex (MHC)7–9, andwithin-UKdifferentiationat 4p14 has been found independently10, but others seem to be newfindings. All but three of the regions contain known genes. Aside from
evolutionary interest, genes showing evidence of natural selection areparticularly interesting for the biology of traits such as infectious dis-eases; possible targets for selection include NADSYN1 (NAD synthe-tase 1) at 11q13, which could have a role in prevention of pellagra, aswell as TLR1 (toll-like receptor 1) at 4p14, for which a role in thebiology of tuberculosis and leprosy has been suggested10.
There may be important population structure that is not wellcaptured by current geographical region of residence. Presentimplementations of strongly model-based approaches such asSTRUCTURE11,12 are impracticable for data sets of this size, and wereverted to the classical method of principal components13,14, using asubset of 197,175 SNPs chosen to reduce inter-locus linkage disequi-librium. Nevertheless, four of the first six principal componentsclearly picked up effects attributable to local linkage disequilibriumrather than genome-wide structure. The remaining two componentsshow the same predominant geographical trend from NW to SE but,perhaps unsurprisingly, London is set somewhat apart (Supplemen-tary Fig. 8).
The overall effect of population structure on our associationresults seems to be small, once recent migrants from outsideEurope are excluded. Estimates of over-dispersion of the associationtrend test statistics (usually denoted l; ref. 15) ranged from 1.03 and1.05 for RA and T1D, respectively, to 1.08–1.11 for the remainingdiseases. Some of this over-dispersion could be due to factors otherthan structure, and this possibility is supported by the fact that inclu-sion of the two ancestry informative principal components as cov-ariates in the association tests reduced the over-dispersion estimatesonly slightly (Supplementary Table 6), as did stratification by geo-graphical region. This impression is confirmed on noting thatP values with and without correction for structure are similar(Supplementary Fig. 9). We conclude that, for most of the genome,population structure has at most a small confounding effect in ourstudy, and as a consequence the analyses reported below do notcorrect for structure. In principle, apparent associations in the fewgenomic regions identified in Table 1 as showing strong geographicaldifferentiation should be interpreted with caution, but none arose inour analyses.
Disease association results
We assessed evidence for association in several ways (see Methods fordetails), drawing on both classical and bayesian statistical approaches.For polymorphic SNPs on the Affymetrix chip, we performed trendtests (1 degree of freedom16) and general genotype tests (2 degrees offreedom16, referred to as genotypic) between each case collection andthe pooled controls, and calculated analogous Bayes factors. Thereare examples from animal models where genetic effects act differentlyin males and females17, and to assess this in our data we applied a
!log
10(P
)
0
5
10
15
Chromosome
22 X212019181716151413121110987654321
3020
20
100
0
40
80
60
40
100
Obs
erve
d te
st s
tatis
tic
Expected chi-squared value
a
b
Figure 2 | Genome-wide picture of geographic variation. a, P values for the11-d.f. test for difference in SNP allele frequencies between geographicalregions, within the 9 collections. SNPs have been excluded using the projectquality control filters described inMethods. Green dots indicate SNPs with aP value,13 1025. b, Quantile-quantile plots of these test statistics. SNPs atwhich the test statistic exceeds 100 are represented by triangles at the top ofthe plot, and the shaded region is the 95% concentration band (seeMethods). Also shown in blue is the quantile-quantile plot resulting fromremoval of all SNPs in the 13 most differentiated regions (Table 1).
Table 1 | Highly differentiated SNPs
Chromosome Genes Region (Mb) SNP Position P value
2q21 LCT 135.16–136.82 rs1042712 136,379,576 5.54 3 10213
4p14 TLR1, TLR6, TLR10 38.51–38.74 rs7696175 386,43,552 1.51 3 10212
4q28 137.97–138.01 rs1460133 137,999,953 4.43 3 10208
6p25 IRF4 0.32–0.42 rs9378805 362,727 5.39 3 10213
6p21 HLA 31.10–31.55 rs3873375 31,359,339 1.07 3 10211
9p24 DMRT1 0.86–0.88 rs11790408 866,418 4.96 3 10207
11p15 NAV2 19.55–19.70 rs12295525 19,661,808 7.44 3 10208
11q13 NADSYN1, DHCR7 70.78–70.93 rs12797951 70,820,914 3.01 3 10208
12p13 DYRK4,AKAP3,NDUFA9,RAD51AP1,GALNT8
4.37–4.82 rs10774241 45,537,27 2.73 3 10208
14q12 HECTD1,AP4S1,STRN3 30.41–31.03 rs17449560 30,598,823 1.46 3 10207
19q13 GIPR,SNRPD2,QPCTL,SIX5,DMPK,DMWD,RSHL1,SYMPK,FOXA3
50.84–51.09 rs3760843 50,980,546 4.19 3 10207
20q12 38.30–38.77 rs2143877 38,526,309 1.12 3 10209
Xp22 2.06–2.08 rs6644913 2,061,160 1.23 3 10207
Properties of SNPs that show large allele frequency differences between samples of individuals from 12 regions across Great Britain. Regions showing differentiated SNPs are givenwith details of theSNPwith the smallest P value in each region for differentiation on the 11-d.f. test of differences in SNP allele frequencies between geographical regions, within the 9 collections. Cluster plots for theseSNPs have been examined visually. Signal plots appear in Supplementary Information. Positions are in NCBI build-35 coordinates.
NATURE |Vol 447 |7 June 2007 ARTICLES
663Nature ©2007 Publishing Group
WTCCC, 2007
What genetic loci are associated to disease?
AA Aa aacase
control
~100,000 - 1,000,000 association tests
Aim 1: EWAS Methods
9Wednesday, December 14, 11
Methods), and excluded 153 individuals on this basis. We nextlooked for evidence of population heterogeneity by studying allelefrequency differences between the 12 broad geographical regions(defined in Supplementary Fig. 4). The results for these 11-d.f. testsand associated quantile-quantile plots are shown in Fig. 2. Wide-spread small differences in allele frequencies are evident as anincreased slope of the line (Fig. 2b); in addition, a few loci showmuchlarger differences (Fig. 2a and Supplementary Fig. 6).
Thirteen genomic regions showing strong geographical variationare listed in Table 1, and Supplementary Fig. 7 shows theway in whichtheir allele frequencies vary geographically. The predominant patternis variation along a NW/SE axis. The most likely cause for thesemarked geographical differences is natural selection, most plausiblyin populations ancestral to those now in the UK. Variation due toselection has previously been implicated at LCT (lactase) and majorhistocompatibility complex (MHC)7–9, andwithin-UKdifferentiationat 4p14 has been found independently10, but others seem to be newfindings. All but three of the regions contain known genes. Aside from
evolutionary interest, genes showing evidence of natural selection areparticularly interesting for the biology of traits such as infectious dis-eases; possible targets for selection include NADSYN1 (NAD synthe-tase 1) at 11q13, which could have a role in prevention of pellagra, aswell as TLR1 (toll-like receptor 1) at 4p14, for which a role in thebiology of tuberculosis and leprosy has been suggested10.
There may be important population structure that is not wellcaptured by current geographical region of residence. Presentimplementations of strongly model-based approaches such asSTRUCTURE11,12 are impracticable for data sets of this size, and wereverted to the classical method of principal components13,14, using asubset of 197,175 SNPs chosen to reduce inter-locus linkage disequi-librium. Nevertheless, four of the first six principal componentsclearly picked up effects attributable to local linkage disequilibriumrather than genome-wide structure. The remaining two componentsshow the same predominant geographical trend from NW to SE but,perhaps unsurprisingly, London is set somewhat apart (Supplemen-tary Fig. 8).
The overall effect of population structure on our associationresults seems to be small, once recent migrants from outsideEurope are excluded. Estimates of over-dispersion of the associationtrend test statistics (usually denoted l; ref. 15) ranged from 1.03 and1.05 for RA and T1D, respectively, to 1.08–1.11 for the remainingdiseases. Some of this over-dispersion could be due to factors otherthan structure, and this possibility is supported by the fact that inclu-sion of the two ancestry informative principal components as cov-ariates in the association tests reduced the over-dispersion estimatesonly slightly (Supplementary Table 6), as did stratification by geo-graphical region. This impression is confirmed on noting thatP values with and without correction for structure are similar(Supplementary Fig. 9). We conclude that, for most of the genome,population structure has at most a small confounding effect in ourstudy, and as a consequence the analyses reported below do notcorrect for structure. In principle, apparent associations in the fewgenomic regions identified in Table 1 as showing strong geographicaldifferentiation should be interpreted with caution, but none arose inour analyses.
Disease association results
We assessed evidence for association in several ways (see Methods fordetails), drawing on both classical and bayesian statistical approaches.For polymorphic SNPs on the Affymetrix chip, we performed trendtests (1 degree of freedom16) and general genotype tests (2 degrees offreedom16, referred to as genotypic) between each case collection andthe pooled controls, and calculated analogous Bayes factors. Thereare examples from animal models where genetic effects act differentlyin males and females17, and to assess this in our data we applied a
!log
10(P
)
0
5
10
15
Chromosome
22 X212019181716151413121110987654321
3020
20
100
0
40
80
60
40
100
Obs
erve
d te
st s
tatis
tic
Expected chi-squared value
a
b
Figure 2 | Genome-wide picture of geographic variation. a, P values for the11-d.f. test for difference in SNP allele frequencies between geographicalregions, within the 9 collections. SNPs have been excluded using the projectquality control filters described inMethods. Green dots indicate SNPs with aP value,13 1025. b, Quantile-quantile plots of these test statistics. SNPs atwhich the test statistic exceeds 100 are represented by triangles at the top ofthe plot, and the shaded region is the 95% concentration band (seeMethods). Also shown in blue is the quantile-quantile plot resulting fromremoval of all SNPs in the 13 most differentiated regions (Table 1).
Table 1 | Highly differentiated SNPs
Chromosome Genes Region (Mb) SNP Position P value
2q21 LCT 135.16–136.82 rs1042712 136,379,576 5.54 3 10213
4p14 TLR1, TLR6, TLR10 38.51–38.74 rs7696175 386,43,552 1.51 3 10212
4q28 137.97–138.01 rs1460133 137,999,953 4.43 3 10208
6p25 IRF4 0.32–0.42 rs9378805 362,727 5.39 3 10213
6p21 HLA 31.10–31.55 rs3873375 31,359,339 1.07 3 10211
9p24 DMRT1 0.86–0.88 rs11790408 866,418 4.96 3 10207
11p15 NAV2 19.55–19.70 rs12295525 19,661,808 7.44 3 10208
11q13 NADSYN1, DHCR7 70.78–70.93 rs12797951 70,820,914 3.01 3 10208
12p13 DYRK4,AKAP3,NDUFA9,RAD51AP1,GALNT8
4.37–4.82 rs10774241 45,537,27 2.73 3 10208
14q12 HECTD1,AP4S1,STRN3 30.41–31.03 rs17449560 30,598,823 1.46 3 10207
19q13 GIPR,SNRPD2,QPCTL,SIX5,DMPK,DMWD,RSHL1,SYMPK,FOXA3
50.84–51.09 rs3760843 50,980,546 4.19 3 10207
20q12 38.30–38.77 rs2143877 38,526,309 1.12 3 10209
Xp22 2.06–2.08 rs6644913 2,061,160 1.23 3 10207
Properties of SNPs that show large allele frequency differences between samples of individuals from 12 regions across Great Britain. Regions showing differentiated SNPs are givenwith details of theSNPwith the smallest P value in each region for differentiation on the 11-d.f. test of differences in SNP allele frequencies between geographical regions, within the 9 collections. Cluster plots for theseSNPs have been examined visually. Signal plots appear in Supplementary Information. Positions are in NCBI build-35 coordinates.
NATURE |Vol 447 |7 June 2007 ARTICLES
663Nature ©2007 Publishing Group
Environment-Wide Association Studies (EWAS)
What specific environmental “loci” are associated to disease?:ie, T2D, lipid levels, obesity, etc?
Environmental CategoryVita
mins
β-carotene
Metals
lead
Organo
phos
phate
Pestici
des
chlorpyrifos
Hydroc
arbon
s
2-hydroxyfluorene [factor]
case
control
Aim 1: EWAS Methods
10Wednesday, December 14, 11
Methods), and excluded 153 individuals on this basis. We nextlooked for evidence of population heterogeneity by studying allelefrequency differences between the 12 broad geographical regions(defined in Supplementary Fig. 4). The results for these 11-d.f. testsand associated quantile-quantile plots are shown in Fig. 2. Wide-spread small differences in allele frequencies are evident as anincreased slope of the line (Fig. 2b); in addition, a few loci showmuchlarger differences (Fig. 2a and Supplementary Fig. 6).
Thirteen genomic regions showing strong geographical variationare listed in Table 1, and Supplementary Fig. 7 shows theway in whichtheir allele frequencies vary geographically. The predominant patternis variation along a NW/SE axis. The most likely cause for thesemarked geographical differences is natural selection, most plausiblyin populations ancestral to those now in the UK. Variation due toselection has previously been implicated at LCT (lactase) and majorhistocompatibility complex (MHC)7–9, andwithin-UKdifferentiationat 4p14 has been found independently10, but others seem to be newfindings. All but three of the regions contain known genes. Aside from
evolutionary interest, genes showing evidence of natural selection areparticularly interesting for the biology of traits such as infectious dis-eases; possible targets for selection include NADSYN1 (NAD synthe-tase 1) at 11q13, which could have a role in prevention of pellagra, aswell as TLR1 (toll-like receptor 1) at 4p14, for which a role in thebiology of tuberculosis and leprosy has been suggested10.
There may be important population structure that is not wellcaptured by current geographical region of residence. Presentimplementations of strongly model-based approaches such asSTRUCTURE11,12 are impracticable for data sets of this size, and wereverted to the classical method of principal components13,14, using asubset of 197,175 SNPs chosen to reduce inter-locus linkage disequi-librium. Nevertheless, four of the first six principal componentsclearly picked up effects attributable to local linkage disequilibriumrather than genome-wide structure. The remaining two componentsshow the same predominant geographical trend from NW to SE but,perhaps unsurprisingly, London is set somewhat apart (Supplemen-tary Fig. 8).
The overall effect of population structure on our associationresults seems to be small, once recent migrants from outsideEurope are excluded. Estimates of over-dispersion of the associationtrend test statistics (usually denoted l; ref. 15) ranged from 1.03 and1.05 for RA and T1D, respectively, to 1.08–1.11 for the remainingdiseases. Some of this over-dispersion could be due to factors otherthan structure, and this possibility is supported by the fact that inclu-sion of the two ancestry informative principal components as cov-ariates in the association tests reduced the over-dispersion estimatesonly slightly (Supplementary Table 6), as did stratification by geo-graphical region. This impression is confirmed on noting thatP values with and without correction for structure are similar(Supplementary Fig. 9). We conclude that, for most of the genome,population structure has at most a small confounding effect in ourstudy, and as a consequence the analyses reported below do notcorrect for structure. In principle, apparent associations in the fewgenomic regions identified in Table 1 as showing strong geographicaldifferentiation should be interpreted with caution, but none arose inour analyses.
Disease association results
We assessed evidence for association in several ways (see Methods fordetails), drawing on both classical and bayesian statistical approaches.For polymorphic SNPs on the Affymetrix chip, we performed trendtests (1 degree of freedom16) and general genotype tests (2 degrees offreedom16, referred to as genotypic) between each case collection andthe pooled controls, and calculated analogous Bayes factors. Thereare examples from animal models where genetic effects act differentlyin males and females17, and to assess this in our data we applied a
!log
10(P
)
0
5
10
15
Chromosome
22 X2120191817161514131211109876543213020
20
100
0
40
80
60
40
100
Obs
erve
d te
st s
tatis
tic
Expected chi-squared value
a
b
Figure 2 | Genome-wide picture of geographic variation. a, P values for the11-d.f. test for difference in SNP allele frequencies between geographicalregions, within the 9 collections. SNPs have been excluded using the projectquality control filters described inMethods. Green dots indicate SNPs with aP value,13 1025. b, Quantile-quantile plots of these test statistics. SNPs atwhich the test statistic exceeds 100 are represented by triangles at the top ofthe plot, and the shaded region is the 95% concentration band (seeMethods). Also shown in blue is the quantile-quantile plot resulting fromremoval of all SNPs in the 13 most differentiated regions (Table 1).
Table 1 | Highly differentiated SNPs
Chromosome Genes Region (Mb) SNP Position P value
2q21 LCT 135.16–136.82 rs1042712 136,379,576 5.54 3 10213
4p14 TLR1, TLR6, TLR10 38.51–38.74 rs7696175 386,43,552 1.51 3 10212
4q28 137.97–138.01 rs1460133 137,999,953 4.43 3 10208
6p25 IRF4 0.32–0.42 rs9378805 362,727 5.39 3 10213
6p21 HLA 31.10–31.55 rs3873375 31,359,339 1.07 3 10211
9p24 DMRT1 0.86–0.88 rs11790408 866,418 4.96 3 10207
11p15 NAV2 19.55–19.70 rs12295525 19,661,808 7.44 3 10208
11q13 NADSYN1, DHCR7 70.78–70.93 rs12797951 70,820,914 3.01 3 10208
12p13 DYRK4,AKAP3,NDUFA9,RAD51AP1,GALNT8
4.37–4.82 rs10774241 45,537,27 2.73 3 10208
14q12 HECTD1,AP4S1,STRN3 30.41–31.03 rs17449560 30,598,823 1.46 3 10207
19q13 GIPR,SNRPD2,QPCTL,SIX5,DMPK,DMWD,RSHL1,SYMPK,FOXA3
50.84–51.09 rs3760843 50,980,546 4.19 3 10207
20q12 38.30–38.77 rs2143877 38,526,309 1.12 3 10209
Xp22 2.06–2.08 rs6644913 2,061,160 1.23 3 10207
Properties of SNPs that show large allele frequency differences between samples of individuals from 12 regions across Great Britain. Regions showing differentiated SNPs are givenwith details of theSNPwith the smallest P value in each region for differentiation on the 11-d.f. test of differences in SNP allele frequencies between geographical regions, within the 9 collections. Cluster plots for theseSNPs have been examined visually. Signal plots appear in Supplementary Information. Positions are in NCBI build-35 coordinates.
NATURE |Vol 447 |7 June 2007 ARTICLES
663Nature ©2007 Publishing Group
Why “EWAS”?
What environmental factors are associated to disease?
Environmental Categorycomprehensive and transparent
multiplicity controlled
novel findings
(and validated)
Ioannidis JPAI, Loy EY, et al, (2009) Researching genetic vs nongenetic determinants of diseases: a comparison and proposed unification. Sci Trans Med vol. 1(7)7ps8
Aim 1: EWAS Methods
11Wednesday, December 14, 11
NHANES:National Health and Nutrition Examination Survey1
since the 1960s: 50 years!now biannual: 1999 onwards10,000 participants per cohort
A Massive, Ongoing, and Significant Public Health Survey
Introduction
The National Health and Nutrition Examination Survey (NHANES) is a program of studiesdesigned to assess the health and nutritional status of adults and children in the United States. The survey is unique in that it com-bines interviews and physical examinations. NHANES is a major program of the National Center for Health Statistics (NCHS). NCHS is part of the Centers for Disease Control and Prevention (CDC) and has the responsibility for producing vital and health statistics for the Nation.
The NHANES program began in the early 1960s and has been conducted as a series of sur-veys focusing on different population groups or health topics. In 1999, the survey became a con-tinuous program that has a changing focus on a variety of health and nutrition measurements to meet emerging needs. The survey examines a nationally representative sample of about 5,000 persons each year. These persons are located in counties across the country, 15 of which are visited each year.
The NHANES interview includes demographic, socioeconomic, dietary, and health-related questions. The examination component consists of medical, dental, and physiological measure-ments, as well as laboratory tests administered by highly trained medical personnel.
Findings from this survey will be used to de-termine the prevalence of major diseases and risk factors for diseases. Information will be used to assess nutritional status and its associ-ation with health promotion and disease pre-���������������!��������������������������for national standards for such measurements as height, weight, and blood pressure. Data from this survey will be used in epidemiologi-cal studies and health sciences research, which help develop sound public health policy,
direct and design health programs and services, and expand the health knowl-edge for the Nation.
Survey Content
As in past health examination surveys, data will be collected on the prevalence of chron-ic conditions in the population. Estimates for previously undiagnosed conditions, as well as those known to and reported by respon-dents, are produced through the survey. Such information is a particular strength of the NHANES program.
Risk factors, those aspects of a person’s life-style, constitution, heredity, or environment that may increase the chances of developing a certain disease or condition, will be examined. Smoking, alcohol consumption, ���������� �� ���������������� �� ���!������and activity, weight, and dietary intake will be studied. Data on certain aspects of reproductive health, such as use of oral contraceptives and breastfeeding practices, will also be collected.
The diseases, medical conditions, and health indicators to be studied include:
• Anemia• Cardiovascular disease• Diabetes• Environmental exposures• Eye diseases• Hearing loss• Infectious diseases• Kidney disease• Nutrition• Obesity• Oral health• Osteoporosis
The sample for the survey is selected to represent the U.S. population of all ages. To produce reli-able statistics, NHANES over-samples persons 60 and older, African Americans, and Hispanics.
Since the United States has experienced dramatic growth in the number of older people during this century, the aging population has major impli-cations for health care needs, public policy, and research priorities. NCHS is working with public health agencies to increase the knowledge of the health status of older Americans. NHANES has a primary role in this endeavor.
All participants visit the physician. Dietary inter-views and body measurements are included for everyone. All but the very young have a blood sample taken and will have a dental screening. Depending upon the age of the participant, the rest of the examination includes tests and proce-dures to assess the various aspects of health listed above. In general, the older the individual, the more extensive the examination.
Survey Operations
Health interviews are conducted in respondents’ homes. Health measurements are performed in specially-designed and equipped mobile centers, which travel to locations throughout the country. The study team consists of a physician, medical and health technicians, as well as dietary and health interviewers. Many of the study staff are bilingual (English/Spanish).
An advanced computer system using high-end servers, desktop PCs, and wide-area networking collect and process all of the NHANES data, nearly eliminating the need for paper forms and manual coding operations. This system allows interviewers to use note-book computers with electronic pens. The staff at the mobile center can automatically transmit data into data bases through such devices as digital scales and stadiometers. Touch-sensi-tive computer screens let respondents enter their own responses to certain sensitive ques-tions in complete privacy. Survey information is available to NCHS staff within 24 hours of collection, which enhances the capability of collecting quality data and increases the speed with which results are released to the public.
In each location, local health and government ��! �������������!������������ ����������� ��Households in the study area receive a letter from the NCHS Director to introduce the survey. Local media may feature stories about the survey.
NHANES is designed to facilitate and en-courage participation. Transportation is provided to and from the mobile center if necessary. Participants receive compensation and a report ������� ���!��������������������� ������� �������All information collected in the survey is kept ���� �� � ��!������������� ��������� ����� �public laws.
Uses of the Data
Information from NHANES is made available through an extensive series of publications and ���� �������� �����! ������� ��� �����������������data users and researchers throughout the world, survey data are available on the internet and on easy-to-use CD-ROMs.
Research organizations, universities, health ���������������������� ����������!�������survey information. Primary data users are federal agencies that collaborated in the de-sign and development of the survey. The National Institutes of Health, the Food and Drug Administration, and CDC are among the agencies that rely upon NHANES to provide data essential for the implementation and evaluation of program activities. The U.S. Department of Agriculture and NCHS coop-erate in planning and reporting dietary and nutrition information from the survey.
NHANES’ partnership with the U.S. Environ-mental Protection Agency allows continued study of the many important environmental ��"��� �����������������
•�� �� ���!������������ �� ������ �������• Reproductive history and sexual behavior• Respiratory disease (asthma, chronic bron- chitis, emphysema)• Sexually transmitted diseases • Vision
environment:elimination of lead --70% decline since ‘70s
disease prevalence estimates:T2D, obesity, cardiovascular disease
growth charts for development: WHO standard
1 http://www.cdc.gov/nchs/nhanes.htm
Aim 1: EWAS Methods
12Wednesday, December 14, 11
NHANESEnvironmental Factor “E-Chip”
Aim 1: EWAS Methods
Demographics
agerace/ethnicity
incomeeducation
10,000
examplesnumber
of participants
Questionnaires
Drug UsePhysical Activity
OccupationFood Frequency
10,000
Physical Exam and Clinical Laboratory
Measures
Blood PressureHeight, Weight
Fasting Blood GlucoseCholesterol
~3,000
Markers of Exposure(serum and urine)
Nutrients/VitaminsMetals
HydrocarbonsPhthalates
PhenolsInfectious Agents
Allergens~3,000
13Wednesday, December 14, 11
Environmental Measures by Category and Cohort
total 182 96169 258
11 0Pesticides, Pyrethyroid 10Pesticides, Organophosphate 22 2
13 1110Pesticides, Organochlorine 00 1Pesticides, Chlorophenol 10
0 0Pesticides, Carbamate 0 10 50 0Pesticides, Atrazine
02214Volatile Compounds 2910Virus 666
Polyflourochemicals 0 10 1200 0Polybrominated Ethers 120
Phytoestrogens 6 066Phthalates 12 07 12
15 12911Phenols
0 020Perchlorate23Polychlorinated Biphenyls 26 38 0
22Vitamin E 321Vitamin D 10 11 1Vitamin C 0 0
Vitamin B 54 343Vitamin A 333
22Mineral Nutrients 2 176Carotenoid Nutrients 150
001Latex 022 2114 0Hydrocarbons
23Heavy Metals 18 18 259Furans 55 0
077Dioxins 50Diakyl 77 6
1 1 1Cotinine 11Bacterial 178 13
0Allergen Test 0 2000200Acrylamide
1999-20002001-2002
2003-20042005-2006
envi
ronm
enta
l cat
egor
ies
IgE (cat, dog, milk ragweed...)Staph. aureus, gonorrhea, chlamydia
carotenes, lutein/zeaxanthin
retinyl palmitate, retinol
bisphenol A
HIV, measles, hepatitis A-D
lead, cadmium, arsenic
DDT, trans-nonachlor
Aim 1: EWAS Methods
14Wednesday, December 14, 11
EWAS Methodology
bisphenol APCB199β-carotenecotinine...
}{foreach:
Environmental factors:log transformed & z-standardizedreference groups “negative”
M=96-258
foreach: {1999-2000, 2001-2002, 2003-2004, 2005-2006}4 individual cohorts
1999-2000 2001-2002 2003-2004 2005-2006bisphenol A . . 0.002 0.01
PCB199 0.1 0.02 0.03 .
β-carotene NA 0.0001 0.02 0.002
cotinine 0.03 0.01 0.9 .
... ... ... ... ...
Significance tests per cohortp-value(βfactor)
zfactor
disease
βfactor
Survey Regression (GEE):adjusted for known confounding factors
age, sex, ethnicity, socioeconomic status, ...
Aim 1: EWAS Methods
15Wednesday, December 14, 11
EWAS Methodology, cont’d
False Discovery Rate Estimation
permute labels or residuals B times
bisphenol APCB199β-carotenecotinine...
}{foreach:
M=96-258
foreach: {1999-2000, 2001-2002, 2003-2004, 2005-2006}
foreach: 1 to B
1999-2000 2001-2002 2003-2004 2005-2006bisphenol A . . 0.1 0.2
PCB199 0.1 0.02 0.03 .
β-carotene NA 0.01 0.1 0.05
cotinine 0.03 0.01 0.9 .
... ... ... ... ...
FDR(p-value)
compute FDR # [p-value(βfactor(permuted)) < p] × 1/B # [p-value(βfactor)]
Tentative ValidationFDR < threshold in 2 or greater cohorts?
AND sign(βfactor) equal for cohorts?
Aim 1: EWAS Methods
16Wednesday, December 14, 11
Proof of Concept of EWAS
1. Background and Methods
2. Examples: Type 2 Diabetes, Serum Lipid Levels
3. Checking Validity and An “LD” Map of the Environment?
4. Conclusion
5. Informatics for the Environment
17Wednesday, December 14, 11
What environmental factors are associated with Type 2 Diabetes?
Aim 2: EWAS examples
18Wednesday, December 14, 11
Patel CJ, Bhattacharya J, Butte AJ, (2010) An Environment-Wide Association Study (EWAS) on T2DM. PLoS ONE vol. 5(5)
Novel Findings:heptachlor epoxide
γ-tocopherol
Known Associations:β-carotene
vitamin DPCBs
Interesting Patterns:pesticides, PCBs
EWAS on T2DM
−log
10(p
valu
e)
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●●●
●
●
●
●
●
●
● ●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●●●
●
●
●●
●
●
●
●
●
●●● ●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●●
●
●
●
●
●
●●●●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●● ●
acry
lam
ide
alle
rgen
test
bact
eria
l inf
ectio
n
cotin
ine
diak
yl
diox
ins
fura
ns d
iben
zofu
ran
heav
y m
etal
s
hydr
ocar
bons
late
x
nutri
ents
car
oten
oid
nutri
ents
min
eral
snu
trien
ts v
itam
in A
nutri
ents
vita
min
Bnu
trien
ts v
itam
in C
nutri
ents
vita
min
Dnu
trien
ts v
itam
in E
pcbs
perc
hlor
ate
pest
icid
es a
trazi
nepe
stic
ides
chl
orop
heno
lpe
stic
ides
org
anoc
hlor
ine
pest
icid
es o
rgan
opho
spha
tepe
stic
ides
pyr
ethy
roid
phen
ols
phth
alat
es
phyt
oest
roge
ns
poly
brom
inat
ed e
ther
spo
lyflo
uroc
hem
ical
s
vira
l inf
ectio
n
vola
tile
com
poun
ds
01
2
1999-20002001-20022003-20042005-2006
cohort markers
FDR(α<0.02) ~ 10% “validated” factors
Fasting Blood Glucose > 125 mg/dL?BMI, SES, ethnicity, age, sex
OR: Δ 1SD of exposureN=500-2000 per cohort
Heptachlor Epoxide OR=3.2, 1.8
PCB170OR=4.5,2.3
γ-tocopherol (vitamin E)OR=1.8,1.6β-carotene
OR=0.6,0.6
Aim 2: EWAS examples
19Wednesday, December 14, 11
What about other risk phenotypes?EWAS on Serum Lipid Levels
Risk factors for coronary heart disease (CHD)
Targets for intervention (ie, statins)
Influenced by smoking, physical activity, diet, genetics1
LDL-Cholesterol 1% increased risk2
HDL-Cholesterol 2% decreased risk3
Triglycerides (increased risk)
risk for CHD (for 1% increase in lipids)lipid type
1. Tanya M. Teslovich et al. Nature (2010) vol. 466 (7307) pp. 7072.Grundy et al. Arteriosclerosis, Thrombosis, and Vascular Biology (2004) vol. 24 (2) pp. e133. Gotto et al. Journal of the American College of Cardiology (2004) vol. 43 (5) pp. 717-24
Aim 2: EWAS examples
20Wednesday, December 14, 11
EWAS on HDL-C1999-20002001-20022003-20042005-2006
cohort markers
−log
10(p
valu
e)
●
●
●
●●
●●
●
●
●
●●●●●
●
●
●● ●
●●
●●●
●●
●
●
●
●
● ●
●
●
●
●●●
●
●●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●●
●
●●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
● ●
●
●● ●●●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●●
●
●
●●●●
●●
●●
●
●
●
●
●●
●
●
●
●
●●●●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
● ●●●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●●
●●
●
●●
●●
●
●
●
●
●
●●
●●
●
●
●
●●
●●
●●
●
●●
●●
●
●●
●
●
●●●
●
●●●
●
●
●
●
●●
●
●●
●●●
●
●
●
●
●●
● ●
acry
lam
ide
alle
rgen
test
bact
eria
l inf
ectio
n
cotin
ine
diak
yl
diox
ins
fura
ns d
iben
zofu
ran
heav
y m
etal
s
hydr
ocar
bons
late
x
nutri
ents
car
oten
oid
nutri
ents
min
eral
snu
trien
ts v
itam
in A
nutri
ents
vita
min
Bnu
trien
ts v
itam
in C
nutri
ents
vita
min
Dnu
trien
ts v
itam
in E
pcbs
perc
hlor
ate
pest
icid
es a
trazi
nepe
stic
ides
car
bam
ate
pest
icid
es c
hlor
ophe
nol
pest
icid
es o
rgan
ochl
orin
epe
stic
ides
org
anop
hosp
hate
pest
icid
es p
yret
hyro
id
phen
ols
phth
alat
es
phyt
oest
roge
ns
poly
brom
inat
ed e
ther
spo
lyflo
uroc
hem
ical
s
vira
l inf
ectio
n
vola
tile
com
poun
ds
01
23
4
FDR < 10%
carotenes Vitamin E
heavy metals
cotinine
organochlorine pesticides
Patel CJ, Cullen MR, Ioannidis JAP, Butte AJ, (2011). Non-genetic associations and correlation globes for determinants of Lipid Levels: an EWAS. In Review.
Vitamin C, D
hydrocarbons
log10(HDL-C)BMI, SES, ethnicity, age, age2, sex
N=1000-3000
Aim 2: EWAS examples
21Wednesday, December 14, 11
cohort
2001−20022003−2004combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
1999−20002001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
1999−20002001−20022005−2006combined
1999−20002001−20022003−20042005−2006combined
1999−20002003−2004combined
2001−20022003−2004combined
1999−20002001−20022003−2004combined
1999−20002001−20022003−2004combined
2001−20022003−2004combined
1,2,3,4,7,8−hxcdf
trans−b−carotene
cis−b−carotene
Retinol
Retinyl palmitate
a−tocopherol
g−tocopherol
PCB74
PCB170
Oxychlordane
Trans−nonachlor
Enterolactone
N
534806
1735
3605323328897374
3605323325967135
29963610323328899519
3480323327558903
2981360928897140
25853579323328729194
811832
2202
1004825
2155
704986877
2131
8141001
8652228
114910732358
pvalue
0.010.005
2e−05
0.0020.01
7e−041e−08
0.010.01
0.0011e−06
0.0055e−042e−043e−046e−21
7e−042e−040.003
6e−17
0.0012e−042e−048e−20
0.010.0020.002
7e−051e−17
0.010.005
1e−06
0.010.002
4e−06
0.020.0020.003
5e−09
0.020.0020.005
1e−08
0.020.006
2e−07
effe
ct (m
g/dl
)
554830
−18−18−19−16
−9−17−15−12
2323292725
37622441
49865767
2451423941
376138
628650
53785357
42664947
−14−20−17
−20 −10 0 10 20 30 40
% change
A Triglycerides
cohort
2003−20042005−2006combined
2003−20042005−2006combined
2001−20022003−2004combined
2001−20022003−2004combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
1999−20002001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−2004combined
2003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−2004combined
2001−20022003−2004combined
Cotinine
Mercury, total
2−fluorene
3−fluorene
Combined Lutein/zeaxanthin
cis−b−carotene
Iron, Frozen Serum
Retinyl stearate
Folate, serum
Vitamin C
Vitamin D
g−tocopherol
Heptachlor Epoxide
N
726769599513
727369616323
233221922252
233221762243
7473679068687388
7478679062647151
63837457270625246764
7251679063378421
746872679559
679969114852
7056727369667401
742867909216
202218352108
pvalue
0.0030.02
2e−06
0.010.002
6e−07
0.010.0060.004
0.020.01
0.006
2e−042e−054e−042e−16
3e−049e−042e−043e−12
0.0090.0030.0060.002
6e−11
0.0020.0030.002
4e−05
0.0040.02
2e−05
0.0060.02
0.002
0.010.004
0.011e−06
0.0010.01
6e−06
0.010.02
0.006
effe
ct (m
g/dl
)
−2−1−1
122
−2−1−1
−2−1−1
3334
2333
22222
−1−1−2−1
111
211
1212
−1−1−1
−2−1−2
−6 −4 −2 0 2 4 6
% change
C HDL-C
cohort
2001−20022003−20042005−2006combined
2001−20022005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022005−2006combined
1999−20002001−20022005−2006combined
2001−20022003−20042005−2006combined
trans−b−carotene
cis−b−carotene
b−cryptoxanthin
Combined Lutein/zeaxanthin
trans−lycopene
Retinyl palmitate
a−tocopherol
g−tocopherol
N
3315317428307043
331725416809
3294317428057012
3317317428307043
3315317428307043
320026988425
2734331728306665
3288317428148696
pvalue
0.0030.004
9e−042e−13
0.0020.004
5e−11
9e−046e−040.001
4e−13
0.0012e−045e−043e−15
5e−041e−042e−048e−17
8e−040.001
4e−13
0.0028e−057e−057e−19
0.0030.0020.005
3e−14
effe
ct (m
g/dl
)
8698
776
7798
98109
10101412
586
14171716
8666
0 5 10 15 20
% change
B LDL-C
cohort
2001−20022003−2004combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
1999−20002001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
1999−20002001−20022005−2006combined
1999−20002001−20022003−20042005−2006combined
1999−20002003−2004combined
2001−20022003−2004combined
1999−20002001−20022003−2004combined
1999−20002001−20022003−2004combined
2001−20022003−2004combined
1,2,3,4,7,8−hxcdf
trans−b−carotene
cis−b−carotene
Retinol
Retinyl palmitate
a−tocopherol
g−tocopherol
PCB74
PCB170
Oxychlordane
Trans−nonachlor
Enterolactone
N
534806
1735
3605323328897374
3605323325967135
29963610323328899519
3480323327558903
2981360928897140
25853579323328729194
811832
2202
1004825
2155
704986877
2131
8141001
8652228
114910732358
pvalue
0.010.005
2e−05
0.0020.01
7e−041e−08
0.010.01
0.0011e−06
0.0055e−042e−043e−046e−21
7e−042e−040.003
6e−17
0.0012e−042e−048e−20
0.010.0020.002
7e−051e−17
0.010.005
1e−06
0.010.002
4e−06
0.020.0020.003
5e−09
0.020.0020.005
1e−08
0.020.006
2e−07
effe
ct (m
g/dl
)
554830
−18−18−19−16
−9−17−15−12
2323292725
37622441
49865767
2451423941
376138
628650
53785357
42664947
−14−20−17
−20 −10 0 10 20 30 40
% change
A Triglycerides
cohort
2003−20042005−2006combined
2003−20042005−2006combined
2001−20022003−2004combined
2001−20022003−2004combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
1999−20002001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−2004combined
2003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−2004combined
2001−20022003−2004combined
Cotinine
Mercury, total
2−fluorene
3−fluorene
Combined Lutein/zeaxanthin
cis−b−carotene
Iron, Frozen Serum
Retinyl stearate
Folate, serum
Vitamin C
Vitamin D
g−tocopherol
Heptachlor Epoxide
N
726769599513
727369616323
233221922252
233221762243
7473679068687388
7478679062647151
63837457270625246764
7251679063378421
746872679559
679969114852
7056727369667401
742867909216
202218352108
pvalue
0.0030.02
2e−06
0.010.002
6e−07
0.010.0060.004
0.020.01
0.006
2e−042e−054e−042e−16
3e−049e−042e−043e−12
0.0090.0030.0060.002
6e−11
0.0020.0030.002
4e−05
0.0040.02
2e−05
0.0060.02
0.002
0.010.004
0.011e−06
0.0010.01
6e−06
0.010.02
0.006
effe
ct (m
g/dl
)
−2−1−1
122
−2−1−1
−2−1−1
3334
2333
22222
−1−1−2−1
111
211
1212
−1−1−1
−2−1−2
−6 −4 −2 0 2 4 6
% change
C HDL-C
cohort
2001−20022003−20042005−2006combined
2001−20022005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022005−2006combined
1999−20002001−20022005−2006combined
2001−20022003−20042005−2006combined
trans−b−carotene
cis−b−carotene
b−cryptoxanthin
Combined Lutein/zeaxanthin
trans−lycopene
Retinyl palmitate
a−tocopherol
g−tocopherol
N
3315317428307043
331725416809
3294317428057012
3317317428307043
3315317428307043
320026988425
2734331728306665
3288317428148696
pvalue
0.0030.004
9e−042e−13
0.0020.004
5e−11
9e−046e−040.001
4e−13
0.0012e−045e−043e−15
5e−041e−042e−048e−17
8e−040.001
4e−13
0.0028e−057e−057e−19
0.0030.0020.005
3e−14
effe
ct (m
g/dl
)
8698
776
7798
98109
10101412
586
14171716
8666
0 5 10 15 20
% change
B LDL-C
cohort
2001−20022003−2004combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
1999−20002001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
1999−20002001−20022005−2006combined
1999−20002001−20022003−20042005−2006combined
1999−20002003−2004combined
2001−20022003−2004combined
1999−20002001−20022003−2004combined
1999−20002001−20022003−2004combined
2001−20022003−2004combined
1,2,3,4,7,8−hxcdf
trans−b−carotene
cis−b−carotene
Retinol
Retinyl palmitate
a−tocopherol
g−tocopherol
PCB74
PCB170
Oxychlordane
Trans−nonachlor
Enterolactone
N
534806
1735
3605323328897374
3605323325967135
29963610323328899519
3480323327558903
2981360928897140
25853579323328729194
811832
2202
1004825
2155
704986877
2131
8141001
8652228
114910732358
pvalue
0.010.005
2e−05
0.0020.01
7e−041e−08
0.010.01
0.0011e−06
0.0055e−042e−043e−046e−21
7e−042e−040.003
6e−17
0.0012e−042e−048e−20
0.010.0020.002
7e−051e−17
0.010.005
1e−06
0.010.002
4e−06
0.020.0020.003
5e−09
0.020.0020.005
1e−08
0.020.006
2e−07
effe
ct (m
g/dl
)
554830
−18−18−19−16
−9−17−15−12
2323292725
37622441
49865767
2451423941
376138
628650
53785357
42664947
−14−20−17
−20 −10 0 10 20 30 40
% change
A Triglycerides
cohort
2003−20042005−2006combined
2003−20042005−2006combined
2001−20022003−2004combined
2001−20022003−2004combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
1999−20002001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−2004combined
2003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−2004combined
2001−20022003−2004combined
Cotinine
Mercury, total
2−fluorene
3−fluorene
Combined Lutein/zeaxanthin
cis−b−carotene
Iron, Frozen Serum
Retinyl stearate
Folate, serum
Vitamin C
Vitamin D
g−tocopherol
Heptachlor Epoxide
N
726769599513
727369616323
233221922252
233221762243
7473679068687388
7478679062647151
63837457270625246764
7251679063378421
746872679559
679969114852
7056727369667401
742867909216
202218352108
pvalue
0.0030.02
2e−06
0.010.002
6e−07
0.010.0060.004
0.020.01
0.006
2e−042e−054e−042e−16
3e−049e−042e−043e−12
0.0090.0030.0060.002
6e−11
0.0020.0030.002
4e−05
0.0040.02
2e−05
0.0060.02
0.002
0.010.004
0.011e−06
0.0010.01
6e−06
0.010.02
0.006
effe
ct (m
g/dl
)
−2−1−1
122
−2−1−1
−2−1−1
3334
2333
22222
−1−1−2−1
111
211
1212
−1−1−1
−2−1−2
−6 −4 −2 0 2 4 6
% change
C HDL-C
cohort
2001−20022003−20042005−2006combined
2001−20022005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022005−2006combined
1999−20002001−20022005−2006combined
2001−20022003−20042005−2006combined
trans−b−carotene
cis−b−carotene
b−cryptoxanthin
Combined Lutein/zeaxanthin
trans−lycopene
Retinyl palmitate
a−tocopherol
g−tocopherol
N
3315317428307043
331725416809
3294317428057012
3317317428307043
3315317428307043
320026988425
2734331728306665
3288317428148696
pvalue
0.0030.004
9e−042e−13
0.0020.004
5e−11
9e−046e−040.001
4e−13
0.0012e−045e−043e−15
5e−041e−042e−048e−17
8e−040.001
4e−13
0.0028e−057e−057e−19
0.0030.0020.005
3e−14
effe
ct (m
g/dl
)
8698
776
7798
98109
10101412
586
14171716
8666
0 5 10 15 20
% change
B LDL-Ccohort
2001−20022003−2004combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
1999−20002001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
1999−20002001−20022005−2006combined
1999−20002001−20022003−20042005−2006combined
1999−20002003−2004combined
2001−20022003−2004combined
1999−20002001−20022003−2004combined
1999−20002001−20022003−2004combined
2001−20022003−2004combined
1,2,3,4,7,8−hxcdf
trans−b−carotene
cis−b−carotene
Retinol
Retinyl palmitate
a−tocopherol
g−tocopherol
PCB74
PCB170
Oxychlordane
Trans−nonachlor
Enterolactone
N
534806
1735
3605323328897374
3605323325967135
29963610323328899519
3480323327558903
2981360928897140
25853579323328729194
811832
2202
1004825
2155
704986877
2131
8141001
8652228
114910732358
pvalue
0.010.005
2e−05
0.0020.01
7e−041e−08
0.010.01
0.0011e−06
0.0055e−042e−043e−046e−21
7e−042e−040.003
6e−17
0.0012e−042e−048e−20
0.010.0020.002
7e−051e−17
0.010.005
1e−06
0.010.002
4e−06
0.020.0020.003
5e−09
0.020.0020.005
1e−08
0.020.006
2e−07
effe
ct (m
g/dl
)
554830
−18−18−19−16
−9−17−15−12
2323292725
37622441
49865767
2451423941
376138
628650
53785357
42664947
−14−20−17
−20 −10 0 10 20 30 40
% change
A Triglycerides
cohort
2003−20042005−2006combined
2003−20042005−2006combined
2001−20022003−2004combined
2001−20022003−2004combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
1999−20002001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−2004combined
2003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−2004combined
2001−20022003−2004combined
Cotinine
Mercury, total
2−fluorene
3−fluorene
Combined Lutein/zeaxanthin
cis−b−carotene
Iron, Frozen Serum
Retinyl stearate
Folate, serum
Vitamin C
Vitamin D
g−tocopherol
Heptachlor Epoxide
N
726769599513
727369616323
233221922252
233221762243
7473679068687388
7478679062647151
63837457270625246764
7251679063378421
746872679559
679969114852
7056727369667401
742867909216
202218352108
pvalue
0.0030.02
2e−06
0.010.002
6e−07
0.010.0060.004
0.020.01
0.006
2e−042e−054e−042e−16
3e−049e−042e−043e−12
0.0090.0030.0060.002
6e−11
0.0020.0030.002
4e−05
0.0040.02
2e−05
0.0060.02
0.002
0.010.004
0.011e−06
0.0010.01
6e−06
0.010.02
0.006
effe
ct (m
g/dl
)
−2−1−1
122
−2−1−1
−2−1−1
3334
2333
22222
−1−1−2−1
111
211
1212
−1−1−1
−2−1−2
−6 −4 −2 0 2 4 6
% change
C HDL-C
cohort
2001−20022003−20042005−2006combined
2001−20022005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022005−2006combined
1999−20002001−20022005−2006combined
2001−20022003−20042005−2006combined
trans−b−carotene
cis−b−carotene
b−cryptoxanthin
Combined Lutein/zeaxanthin
trans−lycopene
Retinyl palmitate
a−tocopherol
g−tocopherol
N
3315317428307043
331725416809
3294317428057012
3317317428307043
3315317428307043
320026988425
2734331728306665
3288317428148696
pvalue
0.0030.004
9e−042e−13
0.0020.004
5e−11
9e−046e−040.001
4e−13
0.0012e−045e−043e−15
5e−041e−042e−048e−17
8e−040.001
4e−13
0.0028e−057e−057e−19
0.0030.0020.005
3e−14
effe
ct (m
g/dl
)
8698
776
7798
98109
10101412
586
14171716
8666
0 5 10 15 20
% change
B LDL-C
cohort
2001−20022003−2004combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
1999−20002001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
1999−20002001−20022005−2006combined
1999−20002001−20022003−20042005−2006combined
1999−20002003−2004combined
2001−20022003−2004combined
1999−20002001−20022003−2004combined
1999−20002001−20022003−2004combined
2001−20022003−2004combined
1,2,3,4,7,8−hxcdf
trans−b−carotene
cis−b−carotene
Retinol
Retinyl palmitate
a−tocopherol
g−tocopherol
PCB74
PCB170
Oxychlordane
Trans−nonachlor
Enterolactone
N
534806
1735
3605323328897374
3605323325967135
29963610323328899519
3480323327558903
2981360928897140
25853579323328729194
811832
2202
1004825
2155
704986877
2131
8141001
8652228
114910732358
pvalue
0.010.005
2e−05
0.0020.01
7e−041e−08
0.010.01
0.0011e−06
0.0055e−042e−043e−046e−21
7e−042e−040.003
6e−17
0.0012e−042e−048e−20
0.010.0020.002
7e−051e−17
0.010.005
1e−06
0.010.002
4e−06
0.020.0020.003
5e−09
0.020.0020.005
1e−08
0.020.006
2e−07
effe
ct (m
g/dl
)
554830
−18−18−19−16
−9−17−15−12
2323292725
37622441
49865767
2451423941
376138
628650
53785357
42664947
−14−20−17
−20 −10 0 10 20 30 40
% change
A Triglycerides
cohort
2003−20042005−2006combined
2003−20042005−2006combined
2001−20022003−2004combined
2001−20022003−2004combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
1999−20002001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−2004combined
2003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−2004combined
2001−20022003−2004combined
Cotinine
Mercury, total
2−fluorene
3−fluorene
Combined Lutein/zeaxanthin
cis−b−carotene
Iron, Frozen Serum
Retinyl stearate
Folate, serum
Vitamin C
Vitamin D
g−tocopherol
Heptachlor Epoxide
N
726769599513
727369616323
233221922252
233221762243
7473679068687388
7478679062647151
63837457270625246764
7251679063378421
746872679559
679969114852
7056727369667401
742867909216
202218352108
pvalue
0.0030.02
2e−06
0.010.002
6e−07
0.010.0060.004
0.020.01
0.006
2e−042e−054e−042e−16
3e−049e−042e−043e−12
0.0090.0030.0060.002
6e−11
0.0020.0030.002
4e−05
0.0040.02
2e−05
0.0060.02
0.002
0.010.004
0.011e−06
0.0010.01
6e−06
0.010.02
0.006
effe
ct (m
g/dl
)
−2−1−1
122
−2−1−1
−2−1−1
3334
2333
22222
−1−1−2−1
111
211
1212
−1−1−1
−2−1−2
−6 −4 −2 0 2 4 6
% change
C HDL-C
cohort
2001−20022003−20042005−2006combined
2001−20022005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022005−2006combined
1999−20002001−20022005−2006combined
2001−20022003−20042005−2006combined
trans−b−carotene
cis−b−carotene
b−cryptoxanthin
Combined Lutein/zeaxanthin
trans−lycopene
Retinyl palmitate
a−tocopherol
g−tocopherol
N
3315317428307043
331725416809
3294317428057012
3317317428307043
3315317428307043
320026988425
2734331728306665
3288317428148696
pvalue
0.0030.004
9e−042e−13
0.0020.004
5e−11
9e−046e−040.001
4e−13
0.0012e−045e−043e−15
5e−041e−042e−048e−17
8e−040.001
4e−13
0.0028e−057e−057e−19
0.0030.0020.005
3e−14
effe
ct (m
g/dl
)
8698
776
7798
98109
10101412
586
14171716
8666
0 5 10 15 20
% change
B LDL-C
cohort
2001−20022003−2004combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
1999−20002001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
1999−20002001−20022005−2006combined
1999−20002001−20022003−20042005−2006combined
1999−20002003−2004combined
2001−20022003−2004combined
1999−20002001−20022003−2004combined
1999−20002001−20022003−2004combined
2001−20022003−2004combined
1,2,3,4,7,8−hxcdf
trans−b−carotene
cis−b−carotene
Retinol
Retinyl palmitate
a−tocopherol
g−tocopherol
PCB74
PCB170
Oxychlordane
Trans−nonachlor
Enterolactone
N
534806
1735
3605323328897374
3605323325967135
29963610323328899519
3480323327558903
2981360928897140
25853579323328729194
811832
2202
1004825
2155
704986877
2131
8141001
8652228
114910732358
pvalue
0.010.005
2e−05
0.0020.01
7e−041e−08
0.010.01
0.0011e−06
0.0055e−042e−043e−046e−21
7e−042e−040.003
6e−17
0.0012e−042e−048e−20
0.010.0020.002
7e−051e−17
0.010.005
1e−06
0.010.002
4e−06
0.020.0020.003
5e−09
0.020.0020.005
1e−08
0.020.006
2e−07
effe
ct (m
g/dl
)
554830
−18−18−19−16
−9−17−15−12
2323292725
37622441
49865767
2451423941
376138
628650
53785357
42664947
−14−20−17
−20 −10 0 10 20 30 40
% change
A Triglycerides
cohort
2003−20042005−2006combined
2003−20042005−2006combined
2001−20022003−2004combined
2001−20022003−2004combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
1999−20002001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−2004combined
2003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−2004combined
2001−20022003−2004combined
Cotinine
Mercury, total
2−fluorene
3−fluorene
Combined Lutein/zeaxanthin
cis−b−carotene
Iron, Frozen Serum
Retinyl stearate
Folate, serum
Vitamin C
Vitamin D
g−tocopherol
Heptachlor Epoxide
N
726769599513
727369616323
233221922252
233221762243
7473679068687388
7478679062647151
63837457270625246764
7251679063378421
746872679559
679969114852
7056727369667401
742867909216
202218352108
pvalue
0.0030.02
2e−06
0.010.002
6e−07
0.010.0060.004
0.020.01
0.006
2e−042e−054e−042e−16
3e−049e−042e−043e−12
0.0090.0030.0060.002
6e−11
0.0020.0030.002
4e−05
0.0040.02
2e−05
0.0060.02
0.002
0.010.004
0.011e−06
0.0010.01
6e−06
0.010.02
0.006
effe
ct (m
g/dl
)
−2−1−1
122
−2−1−1
−2−1−1
3334
2333
22222
−1−1−2−1
111
211
1212
−1−1−1
−2−1−2
−6 −4 −2 0 2 4 6
% change
C HDL-C
cohort
2001−20022003−20042005−2006combined
2001−20022005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022005−2006combined
1999−20002001−20022005−2006combined
2001−20022003−20042005−2006combined
trans−b−carotene
cis−b−carotene
b−cryptoxanthin
Combined Lutein/zeaxanthin
trans−lycopene
Retinyl palmitate
a−tocopherol
g−tocopherol
N
3315317428307043
331725416809
3294317428057012
3317317428307043
3315317428307043
320026988425
2734331728306665
3288317428148696
pvalue
0.0030.004
9e−042e−13
0.0020.004
5e−11
9e−046e−040.001
4e−13
0.0012e−045e−043e−15
5e−041e−042e−048e−17
8e−040.001
4e−13
0.0028e−057e−057e−19
0.0030.0020.005
3e−14
effe
ct (m
g/dl
)
8698
776
7798
98109
10101412
586
14171716
8666
0 5 10 15 20
% change
B LDL-C
Effect Sizes For Validated Factors:HDL-C
% change = Δ 1 SD in Exposure18 validated factors
combined adjusted for:BMI, SES, ethnicity, age, age2, sex,
waist circumference, diabetes (FBG > 125 mg/dL), blood pressure
comparable to genetic effect sizes1!
Aim 2: EWAS examples
22Wednesday, December 14, 11
Proof of Concept of EWAS
1. Background and Methods
2. Examples: Type 2 Diabetes, Serum Lipid Levels
3. Checking Validity and An “LD” Map of the Environment?
4. Conclusion
5. Informatics for the Environment
23Wednesday, December 14, 11
Assessing Validity of Estimatesexample: HDL-C
Could the disease “lead” to exposure?“Reverse causality”
γ-tocopherol
?
tocopherol (vitamin e) supplements forCHD individuals?
low HDL
Aim 3: EWAS and Validity
Could there something confounding the association?
statin use
β-carotene
confounders
high HDL
??
Are associations independent? How to untangle the web of exposure?
β-carotene hydrocarbons
γ-tocopherol
ρ
24Wednesday, December 14, 11
Longitudinal Study: “Gold Standard” for Validation
• exposure changing through time
• reverse causality bias
• compute disease risk
• how will we use “exposome”
longitudinally? age/time
HD
L-C
hole
ster
ol
(mg/
dL)
[high]
[low]
[γ-tocopherol]
tocopherol (vitamin e) supplements forCHD individuals?
low HDL
?
γ-tocopherol
Aim 3: EWAS and Validity
25Wednesday, December 14, 11
Addressing Confounding Bias with the Exposome
E[HDL-C] = α + β1, original * carotene
“account” for bias due to statin use confounder
statin use
β-carotene High HDLE[HDL-C] = α + β1,extended * carotene + β2 * statin use
compare β1, original and β1,extended
disease status diabetes, CHD, heart attack
drug use metformin, statins use
supplement use count of total supplements used
physical activity daily estimated metabolic equivalents
recent food intake total nutrients computed from frequency questionnaire
“source” of bias example variables
Aim 3: EWAS and validity
26Wednesday, December 14, 11
Correlation Structure of the Exposome?
Aim 3: EWAS and Validity
Are associations independent? How to untangle the complex web of exposure?
β-carotene hydrocarbons
γ-tocopherol
ρ
Analogy: “Linkage Disequilibrium”
Identification of four novel T2DM lociOur fast-track stage 2 genotyping confirmed the reported associationfor rs7903146 (TCF7L2) on chromosome 10, and in addition iden-tified significant associations for seven SNPs representing four newT2DM loci (Table 1). In all cases, the strongest association for theMAX statistic (see Methods) was obtained with the additive model.
The most significant of these corresponds to rs13266634, a non-synonymous SNP (R325W) in SLC30A8, located in a 33-kb linkagedisequilibrium block on chromosome 8, containing only the 39 endof this gene (Fig. 2a). SLC30A8 encodes a zinc transporter expressedsolely in the secretory vesicles of b-cells and is thus implicated in thefinal stages of insulin biosynthesis, which involve co-crystallization
Table 1 | Confirmed association results
SNP Chromosome Position
(nucleotides)
Risk
allele
Major
allele
MAF
(case)
MAF
(ctrl)
Odds ratio
(het)
Odds ratio
(hom)
PAR ls Stage 2
pMAX
Stage 2 pMAX
(perm)
Stage 1
pMAX
Stage 1 pMAX
(perm)
Nearest
gene
rs7903146 10 114,748,339 T C 0.406 0.293 1.656 0.19 2.776 0.50 0.28 1.0546 1.53 10234 ,1.03 1027 3.23 10217 ,3.33 10210 TCF7L2rs13266634 8 118,253,964 C C 0.254 0.301 1.186 0.25 1.536 0.31 0.24 1.0089 6.13 1028 5.03 1027 2.13 1025 1.83 1025 SLC30A8rs1111875 10 94,452,862 G G 0.358 0.402 1.196 0.19 1.446 0.24 0.19 1.0069 3.03 1026 7.43 1026 9.13 1026 7.33 1026 HHEXrs7923837 10 94,471,897 G G 0.335 0.377 1.226 0.21 1.456 0.25 0.20 1.0065 7.53 1026 2.23 1025 3.43 1026 2.53 1026 HHEXrs7480010 11 42,203,294 G A 0.336 0.301 1.146 0.13 1.406 0.25 0.08 1.0041 1.13 1024 2.93 1024 1.53 1025 1.23 1025 LOC387761rs3740878 11 44,214,378 A A 0.240 0.272 1.266 0.29 1.466 0.33 0.24 1.0046 1.23 1024 2.83 1024 1.83 1025 1.33 1025 EXT2rs11037909 11 44,212,190 T T 0.240 0.271 1.276 0.30 1.476 0.33 0.25 1.0045 1.83 1024 4.53 1024 1.83 1025 1.33 1025 EXT2rs1113132 11 44,209,979 C C 0.237 0.267 1.156 0.27 1.366 0.31 0.19 1.0044 3.33 1024 8.13 1024 3.73 1025 2.93 1025 EXT2
Significant T2DM associations were confirmed for eight SNPs in five loci. Allele frequencies, odds ratios (with 95% confidence intervals) and PAR were calculated using only the stage 2 data. Allelefrequencies in the controls were very close to those reported for the CEU set (European subjects genotyped in the HapMap project). Induced sibling recurrent risk ratios (ls) were estimated usingstage 2 genotype counts for the control subjects and assuming a T2DM prevalence of 7% in the French population. hom, homozygous; het, heterozygous; major allele, the allele with the higherfrequency in controls; pMAX, P-value of the MAX statistic from the x2 distribution; pMAX (perm), P-value of the MAX statistic from the permutation-derived empirical distribution (pMAX andpMAX (perm) are adjusted for variance inflation); risk allele, the allele with higher frequency in cases compared with controls.
**
*
024
–log10 [P]
–log10 [P]*
4954642sr 237 3971sr 3 37 3971 sr
44 5 40 9s r 801 2261sr 334 99 41s r
88 34 29sr 20 19462sr 034 9941sr
9035 0501sr 036169sr
0415 007sr 22 25 991s r 6136642sr 8 13664 2s r 186964 6sr 8798751sr
0 492 8201sr 3 9266 42s r 5 926642s r
4 3666231sr 9926642sr 2954642 sr
0 1350 501s r 5769 646 sr 4 577187s r 4 76 964 6sr
41350 501sr 5784931sr 2173387 sr
39250 501sr 5050007s r 7 492 602 sr 12550 51sr
15 6868sr 4 373 38 7sr 4 784 931sr 750 1107sr 269 7402sr
91518711sr 646 1001sr
2925 0501sr 5889103sr 866964 6sr 0889 103sr 468839 2sr
SLC30A8 IDE HHEXKIF11
**
**
**
024
* *
5470942sr 7 602242sr
2 8178111sr 1570942sr 239442 4sr 88 38 141s r
76029511sr 37178111sr
2945 391s r 2608842sr
646 905 01sr 15379 42sr 2950249sr 0339351 sr 17088 42s r
1 95 749s r 40 3794 2sr 1137942sr 73832 97sr 5781111sr 9275 722s r 9537 197sr 63420 97sr 0383856sr 0990 707sr 418 41 97 sr
1 902 8801sr 91 25722 sr
880 28801sr 19 740 64sr 5374283sr
5 346 5221s r 6283856sr 5058573sr 3679 991sr 1118097s r 34 91242s r
46078 111sr 06078 111s r
7 912381sr 3148707sr 028 38 56s r
52078 111sr 5227373sr 049 1242sr 2 3694 12sr 2297881sr
66 2155 sr 77901 97sr
44068701sr 350 75221sr
5826807sr 7 851 092sr 940 95 22sr
–log10 [P]
–log10 [P]EXT2 ALX4
****
*** *
024
*** *
**** 5704541sr 5194283sr 9085574sr 65 68281s r 3279397sr
832 978s r 4 953 102 sr
92096701sr 72973011sr
3 601 683sr 7609497sr 2498017s r 1265 297sr 5082083sr 79755 74sr 216 249 7sr
50 07 98 s r 7829 27s r
4975 5 74sr 878 0473sr
90 97 30 11 sr 2313 11 1sr 432 5574sr 3325574sr 3700 931sr 7 87 76 02 sr
45 47 06 11 sr 88755 7 4sr 87 7421 7sr 7945846s r 2705393sr 28 7557 4sr 4 38 97 34 sr
2 8873011sr 1 8873011sr
8315397sr 64 673 2 4sr 97 8111 7sr
53 2838 01 sr 0 23 83 97s r 8083293sr 3895701sr 9 20 58 22s r 57 27013 sr 4727013sr 1404702sr
D!
0 0.2 0.4 0.6 0.8 1
****
****
024
39863011sr 1601683 sr 9501683s r 03765 74s r 627 6574sr 60886701sr
7536017sr 977 183 1s r 1258051 sr 288630 11sr 4 985 747 1sr
97 626 41sr 1 19814 21s r
6856781sr 6642682sr 3 387 3801sr
445974 7sr 72863011s r
075493 7sr 4 377 397sr 8350847sr 47 706 101 sr
1 158 53 7sr 96558221s r
7035 846sr 010 0847sr 9 1654 44 sr 1 400 039sr 0431101sr 78786701sr
498 10 9sr 01 47 27sr 4245734 sr 90 612 42 1sr
0 1688 41 sr 27773801sr
8307831s r 93678221sr
112809sr 86773 801sr
367 6061sr 8433351s r 9 925 846sr 6243101sr 51 84151sr 9 387 21 7sr 8083172sr 18210501 sr 63 663 01 1sr 3 162 47 01sr
LOC387761
a b
c d
Figure 2 | Pairwise linkage disequilibrium diagrams for four novel T2DM-associated loci. D9 was calculated from the stage 1 genotyping data as afraction of observed linkage disequilibrium over the maximal possible. Thebar graph indicates the negative logarithm of the stage 1 P-value for each
SNP. Transcriptional units are indicated by green lines, with exonshighlighted in orange. Blue asterisksmark the SNPs chosen for confirmatorystudies. a, SLC30A8; b, IDE–KIF11–HHEX; c, EXT2–ALX4; d, LOC387761.
NATURE |Vol 445 |22 February 2007 ARTICLES
883Nature ©2007 Publishing Group
Sladek et al., Nature Genetics. 2007
Correlation between occurrence of genetic loci
In GWAS, allows one to trace to the “causal” locus.
27Wednesday, December 14, 11
Red: positive ρBlue: negative ρthickness: |ρ|
for each pair:Partial ρ
age, BMI (serum)age, creatinine (urine)
cotinine
hydrocarbons
volatile organics
virus
bacteria
phthalatesPFCs
heavy metals
pcbs
dioxins
furans
deet
organochlorine
organophosphates
phenol pesticides
pyrethroiddiakyl carotenoids
mineral nutrientsvitamin A
vitamin B
vitamin C
vitamin D
vitamin E
phytoestrogens
ρ=0.3
ρ=0.1
ρ=0.4
ρ=0.6
ρ=0.3
ρ= -0.2
ρ=0.2
ρ=0.3ρ=0ρ=0.5ρ=0.3
ρ=0.4
ρ=0.2
ρ=0.2
ρ=0.3
ρ=0.4
ρ=0.4
ρ=0.4
ρ=0.2
ρ=0.3 ρ=0.3
phenolsρ=0
Aim 3: EWAS and Validity
“LD” of Exposure Biomarkers in NHANES
permuted data to produce“null ρ”
filtered those ρ > .3 or < -0.2 (5th, 95th percentile)
sought replication in > 1 cohort
pcbs & organochlorinesρ= 0.4
(Type 2 Diabetes)
cotinine & hydrocarbonsρ= 0.6
(HDL-Cholesterol)
Patel et al. (2011). Canonical Correlation of Biomarkers for Environmental Exposures. In Review.
28Wednesday, December 14, 11
EWAS:Conclusion and Discussion
• generalizable, comprehensive, transparent, and systematic study of environment
• novel associations for T2D and HDL-C
• effects on disease are on par with genetics, calling for large-scale exposomic study
• However: confounding, reverse causal biases: need longitudinal and follow-up studies.
• Correlative web is dense: what is the “LD” of the the exposome?
−log
10(p
valu
e)
●
●●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●● ●●
●
●●●
●
●●●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●●
●●●
●
●
●●●●
●
●
●●
●
●
●
●
●
●
●
●●●●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●●
●
●●●●
●●
●
●
●
●●
●
●●
●●
●●
●
●●● ●
●
●
●●
●
● ●
●
●
●
●
●●●●●
●
●
●
●
●●
●
●
●
●
●●● ●●
●●
●●
●
●●
●●
●
●
●●
●●●
●●
●
●
●●
●
●
●
●
●
●●●
●
●●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●●
●
●
●
●●●●●●
●●
●●
●
●
●
●
●
acry
lam
ide
alle
rgen
test
bact
eria
l inf
ectio
n
cotin
ine
diak
yl
diox
ins
fura
ns d
iben
zofu
ran
heav
y m
etal
s
hydr
ocar
bons
late
x
nutri
ents
car
oten
oid
nutri
ents
min
eral
snu
trien
ts v
itam
in A
nutri
ents
vita
min
Bnu
trien
ts v
itam
in C
nutri
ents
vita
min
Dnu
trien
ts v
itam
in E
pcbs
perc
hlor
ate
pest
icid
es a
trazi
nepe
stic
ides
car
bam
ate
pest
icid
es c
hlor
ophe
nol
pest
icid
es o
rgan
ochl
orin
epe
stic
ides
org
anop
hosp
hate
pest
icid
es p
yret
hyro
id
phen
ols
phth
alat
es
phyt
oest
roge
ns
poly
brom
inat
ed e
ther
spo
lyflo
uroc
hem
ical
s
vira
l inf
ectio
n
vola
tile
com
poun
ds
01
23
4
heptachlor epoxideγ-tocopherol
HDL-C: 1-10 mg/dLT2D: ~2-3 OR
Aim 4: Conclusion
29Wednesday, December 14, 11
Proof of Concept of EWAS
1. Background and Methods
2. Examples: Type 2 Diabetes, Serum Lipid Levels
3. Checking Validity and An “LD” Map of the Environment?
4. Conclusion
5. Informatics for the Environment
30Wednesday, December 14, 11
Human Genome Project to GWAS
Sequencing of the genome
2001
HapMap project:http://hapmap.ncbi.nlm.nih.gov/
Characterize common variation
2001-2003 (ongoing)
Measurement tools
“Variant SNP chip”~$400 for ~100,000 variants
~2003 (ongoing)
ARTICLES
Genome-wide association study of 14,000cases of seven common diseases and3,000 shared controlsThe Wellcome Trust Case Control Consortium*
There is increasing evidence that genome-wide association (GWA) studies represent a powerful approach to theidentification of genes involved in common human diseases.We describe a joint GWAstudy (using the Affymetrix GeneChip500KMapping Array Set) undertaken in the British population, which has examined,2,000 individuals for each of 7 majordiseases and a shared set of ,3,000 controls. Case-control comparisons identified 24 independent association signals atP, 53 1027: 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn’s disease, 3 in rheumatoid arthritis, 7 in type 1diabetes and 3 in type 2 diabetes. On the basis of prior findings and replication studies thus-far completed, almost all of thesesignals reflect genuine susceptibility effects. We observed association at many previously identified loci, and foundcompelling evidence that some loci confer risk for more than one of the diseases studied. Across all diseases, we identified alarge number of further signals (including 58 loci with single-point P values between 1025 and 53 1027) likely to yieldadditional susceptibility loci. The importance of appropriately large samples was confirmed by the modest effect sizesobserved at most loci identified. This study thus represents a thorough validation of the GWA approach. It has alsodemonstrated that careful use of a shared control group represents a safe and effective approach to GWA analyses ofmultiple disease phenotypes; has generated a genome-wide genotype database for future studies of common diseases in theBritish population; and shown that, provided individuals with non-European ancestry are excluded, the extent of populationstratification in the British population is generally modest. Our findings offer new avenues for exploring the pathophysiologyof these important disorders. We anticipate that our data, results and software, which will be widely available to otherinvestigators, will provide a powerful resource for human genetics research.
Despite extensive research efforts for more than a decade, the geneticbasis of common humandiseases remains largely unknown. Althoughthere have been some notable successes1, linkage and candidate geneassociation studies have often failed to deliver definitive results. Yetthe identification of the variants, genes and pathways involved inparticular diseases offers a potential route to new therapies, improveddiagnosis and better disease prevention. For some time it has beenhoped that the advent of genome-wide association (GWA) studieswould provide a successful new tool for unlocking the genetic basisof many of these common causes of humanmorbidity andmortality1.
Three recent advances mean that GWA studies that are powered todetect plausible effect sizes are now possible2. First, the InternationalHapMap resource3, which documents patterns of genome-wide vari-ation and linkage disequilibrium in four population samples, greatlyfacilitates both the design and analysis of association studies. Second,the availability of dense genotyping chips, containing sets of hundreds ofthousands of single nucleotide polymorphisms (SNPs) that providegood coverage of much of the human genome, means that for the firsttimeGWAstudies for thousandsof cases andcontrols are technically andfinancially feasible. Third, appropriately large and well-characterizedclinical samples have been assembled for many common diseases.
The Wellcome Trust Case Control Consortium (WTCCC) wasformed with a view to exploring the utility, design and analyses ofGWA studies. It brought together over 50 research groups from theUK that are active in researching the genetics of common humandiseases, with expertise ranging from clinical, through genotyping, to
informatics and statistical analysis. Here we describe the main experi-ment of the consortium: GWA studies of 2,000 cases and 3,000 sharedcontrols for 7 complex human diseases of major public health import-ance—bipolar disorder (BD), coronary artery disease (CAD), Crohn’sdisease (CD), hypertension (HT), rheumatoid arthritis (RA), type 1diabetes (T1D), and type 2 diabetes (T2D). Two further experimentsundertaken by the consortium will be reported elsewhere: a GWAstudy for tuberculosis in 1,500 cases and 1,500 controls, sampled fromThe Gambia; and an association study of 1,500 common controls with1,000 cases for each of breast cancer, multiple sclerosis, ankylosingspondylitis and autoimmune thyroid disease, all typed at around15,000 mainly non-synonymous SNPs. By simultaneously studyingseven diseases with differing aetiologies, we hoped to develop insights,not only into the specific genetic contributions to each of the diseases,but also into differences in allelic architecture across the diseases. Afurther major aim was to address important methodological issues ofrelevance to all GWA studies, such as quality control, design and ana-lysis. In addition to our main association results, we address several ofthese issues below, including the choice of controls for genetic studies,the extent of population structure within Great Britain, sample sizesnecessary to detect genetic effects of varying sizes, and improvements ingenotype-calling algorithms and analytical methods.
Samples and experimental analyses
Individuals included in the study were living within England,Scotland and Wales (‘Great Britain’) and the vast majority had
*Lists of participants and affiliations appear at the end of the paper.
Vol 447 |7 June 2007 |doi:10.1038/nature05911
661Nature ©2007 Publishing Group
WTCCC, Nature, 2008.
Comprehensive, high-throughput analyses
31Wednesday, December 14, 11
Human Genome Project:Informatics Contributions
http://genome.ucsc.edu
http://www.ncbi.nlm.nih.gov/projects/SNP/
http://www.ncbi.nlm.nih.gov/geo/
http://hapmap.ncbi.nlm.nih.gov/
http://www.ncbi.nlm.nih.gov/gap
32Wednesday, December 14, 11
Gene Expression Omnibus
required submission
> 600,000 individuals (!)
enabled new investigationscollaboration & transparency
driven standardization
What if we had the same for environmental exposure measures?
http://www.ncbi.nlm.nih.gov/geo/
Informatics for the Environment?A GEO equivalent?
accessed 12/1/2011
33Wednesday, December 14, 11
Acknowledgments
•Atul Butte
• Jay Bhattacharya
•Mark Cullen
• John Ioannidis
•Rob Tibshirani
•Biomedical Informatics at Stanford
•CDC / NCHS
•NHANES and participants!
•National Academy of Sciences• Steve Rappaport
•National Library of Medicine
•March of Dimes Preterm Birth Research Center
• Stanford Prevention Center /SPECTRUM Population Studies Innovation Grant
email: [email protected]: @chiragjp
34Wednesday, December 14, 11