Upload
chirag-patel
View
272
Download
0
Embed Size (px)
Citation preview
Methods to enhance the validity of precision guidelines emerging from big data
Chirag J PatelLorenzini Foundation; Venice, Italy
06/16/16
[email protected]@chiragjp
www.chiragjpgroup.org
Data streams in public health are getting large!
Capacity to measure and compute becoming high-throughput and cheaper.
Data streams in public health are getting large!
Capacity to measure and compute becoming high-throughput and cheaper.
N=500,000 1M genetic variants
1000s of phenotypes
Data streams in public health are getting large!
…and alternative data sets (e.g., EMR) are omnipresent!
image: Stan Shaw (MGH)
And new concepts for discovery:The exposome, an analog of the genome!
what to measure? how to measure?
www.sciencemag.org SCIENCE VOL 330 22 OCTOBER 2010 461
PERSPECTIVES
Xenobiotics
Inflammation
Preexisting disease
Lipid peroxidation
Oxidative stress
Gut flora
Internal
chemical
environment
External environm
ent
ExposomeRADIATION
DIET
POLLUTION
INFECTIONS
DRUGS
LIFE-STYLE
STRESS
Reactive electrophiles
Metals
Endocrine disrupters
Immune modulators
Receptor-binding proteins
critical entity for disease eti-ology ( 7). Recent discussion has focused on whether and how to implement this vision ( 8). Although fully charac-terizing human exposomes is daunting, strategies can be developed for getting “snap-shots” of critical portions of a person’s exposome during different stages of life. At one extreme is a “bottom-up” strategy in which all chemi-cals in each external source of a subject’s exposome are measured at each time point. Although this approach would have the advantage of relat-ing important exposures to the air, water, or diet, it would require enormous effort and would miss essential compo-nents of the internal chemi-cal environment due to such factors as gender, obesity, infl ammation, and stress. By contrast, a “top-down” strat-egy would measure all chem-icals (or products of their downstream processing or effects, so-called read-outs or signatures) in a subject’s blood. This would require only a single blood specimen at each time point and would relate directly to the person’s internal chemical environ-ment. Once important exposures have been identifi ed in blood samples, additional test-ing could determine their sources and meth-ods to reduce them.
To make the top-down approach feasible, the exposome would comprise a profi le of the most prominent classes of toxicants that are known to cause disease, namely, reactive elec-trophiles, endocrine (hormone) disruptors, modulators of immune responses, agents that bind to cellular receptors, and metals. Expo-sures to these agents can be monitored in the blood either by direct measurement or by looking for their effects on physiological pro-cesses (such as metabolism). These processes generate products that serve as signatures and biomarkers in the blood. For example, reac-tive electrophiles, which constitute the largest class of toxic chemicals ( 6), cannot generally be measured in the blood. However, metabo-lites of electrophiles are detectable in serum ( 9), and products of their reactions with blood nucleophiles, like serum albumin, offer possi-ble signatures ( 10). Estrogenic activity could be used to monitor the effect of endocrine dis-
ruptors and can be measured through serum biomarkers. Immune modulators trigger the production of cytokines and chemokines that also can be measured in serum. Chemicals that bind to cellular receptors stimulate the production of serum biomarkers that can be detected with high-throughput screens ( 11). Metals are readily measured in blood ( 12), as are hormones, antibodies to pathogens, and proteins released by cells in response to stress. The accumulation of biologically important exposures may also be detected as changes to lymphocyte gene expression or in chemical modifi cations of DNA (such as methylation) ( 13).
The environmental equivalent of a GWAS is possible when signatures and biomarkers of the exposome are characterized in humans with known health outcomes. Indeed, a rel-evant prototype for such a study examined associations between type 2 diabetes and 266 candidate chemicals measured in blood or urine ( 14). It determined that exposure to cer-tain chemicals produced strong associations with the risk of type 2 diabetes, with effect sizes comparable to the strongest genetic loci reported in GWAS. In another study, chromo-
some (telomere) length in peripheral blood mono-nuclear cells responded to chronic psychological stress, possibly mediated by the production of reac-tive oxygen species ( 15).
Characterizing the exposome represents a tech-
nological challenge like that of the human genome project, which began when DNA sequencing was in its infancy ( 16). Analyti-cal systems are needed to pro-cess small amounts of blood from thousands of subjects. Assays should be multiplexed for mea-suring many chemicals in each class of interest. Tandem mass spectrometry, gene and protein chips, and microfl uidic systems offer the means to do this. Plat-forms for high-throughput assays should lead to economies of scale, again like those experienced by the human genome project. And because exposome technologies would provide feedback for thera-peutic interventions and personal-ized medicine, they should moti-vate the development of commer-cial devices for screening impor-tant environmental exposures in blood samples.
With successful characterization of both exposomes and genomes, environmental and genetic determinants of chronic diseases can be united in high-resolution studies that examine gene-environment interactions. Such a union might even push the nature-ver-sus-nurture debate toward resolution.
References and Notes
1. P. Lichtenstein et al., N. Engl. J. Med. 343, 78 (2000). 2. L. A. Hindorff et al., Proc. Natl. Acad. Sci. U.S.A. 106,
9362 (2009). 3. W. C. Willett, Science 296, 695 (2002). 4. P. Vineis, Int. J. Epidemiol. 33, 945 (2004). 5. I. Dalle-Donne et al., Clin. Chem. 52, 601 (2006). 6. D. C. Liebler, Chem. Res. Toxicol. 21, 117 (2008). 7. C. P. Wild, Cancer Epidemiol. Biomarkers Prev. 14, 1847
(2005). 8. http://dels.nas.edu/envirohealth/exposome.shtml 9. W. B. Dunn et al., Int. J. Epidemiol. 37 (suppl. 1), i23
(2008). 10. F. M. Rubino et al., Mass Spectrom Rev. 28, 725 (2009). 11. T. I. Halldorsson et al., Environ. Res. 109, 22 (2009). 12. S. Mounicou et al., Chem. Soc. Rev. 38, 1119 (2009). 13. C. M. McHale et al., Mutat. Res. 10.1016/j.mrrev.
2010.04.001 (2010). 14. C. J. Patel et al., PLoS ONE 5, e10746 (2010). 15. E. S. Epel et al., Proc. Natl. Acad. Sci. U.S.A. 101, 17312
(2004). 16. F. S. Collins et al., Science 300, 286 (2003). 17. Supported by NIEHS through grants U54ES016115 and
P42ES04705.
Characterizing the exposome. The exposome represents the combined exposures from all sources that reach the internal chemical environment. Toxicologically important classes of exposome chemicals are shown. Signatures and biomarkers can detect these agents in blood or serum.
CR
ED
IT: N
. K
EV
ITIY
AG
ALA
/SC
IEN
CE
; (P
HO
TO
CR
ED
ITS
) (L
EF
T, T
OP
FIV
E IM
AG
ES
) T
HIN
KS
TO
CK
.CO
M; (L
EF
T, T
WO
IM
AG
ES
FR
OM
BO
TT
OM
) IS
TO
CK
PH
OT
O.C
OM
; (R
IGH
T) T
HIN
KS
TO
CK
PH
OT
OS
.CO
M
10.1126/science.1192603
Published by AAAS
on
Oct
ober
21,
201
0 ww
w.sc
ienc
emag
.org
Down
load
ed fr
om
how to analyze in health?
Wild, 2005Rappaport and Smith, 2010
Buck-Louis and Sundaram 2012Miller and Jones, 2014
Patel CJ and Ioannidis JPAI, 2014
Many challenges exist in the use of big data for discovery, guideline development, and causal
researchINSIGHTS
1054 28 NOVEMBER 2014 • VOL 346 ISSUE 6213 sciencemag.org SCIENCE
In 1854, as cholera swept through Lon-
don, John Snow, the father of modern ep-
idemiology, painstakingly recorded the
locations of affected homes. After long,
laborious work, he implicated the Broad
Street water pump as the source of the
outbreak, even without knowing that a Vib-
rio organism caused cholera. “Today, Snow
might have crunched Global Positioning
System information and disease prevalence
data, solving the problem within hours” ( 1).
That is the potential impact of “Big Data” on
the public’s health. But the promise of Big
Data is also accompanied by claims that “the
scientific method itself is becoming obso-
lete” ( 2), as next-generation computers, such
as IBM’s Watson ( 3), sift through the digital
world to provide predictive models based
on massive information. Separating the true
signal from the gigantic amount of noise is
neither easy nor straightforward, but it is a
challenge that must be tackled if informa-
tion is ever to be translated into societal
well-being.
The term “Big Data” refers to volumes of
large, complex, linkable information ( 4). Be-
yond genomics and other “omic” fields, Big
Data includes medical, environmental, fi-
nancial, geographic, and social media infor-
mation. Most of this digital information was
unavailable a decade ago. This swell of data
will continue to grow, stoked by sources that
are currently unimaginable. Big Data stands
to improve health by providing insights into
the causes and outcomes of disease, better
drug targets for precision medicine, and en-
hanced disease prediction and prevention.
Moreover, citizen-scientists will increasingly
use this information to promote their own
health and wellness. Big Data can improve
our understanding of health behaviors
(smoking, drinking, etc.) and accelerate the
knowledge-to-diffusion cycle ( 5).
But “Big Error” can plague Big Data. In
2013, when influenza hit the United States
hard and early, analysis of flu-related Inter-
net searches drastically overestimated peak
flu levels ( 6) relative to those determined
by traditional public health surveillance.
Even more problematic is the potential for
many false alarms triggered by large-scale
examination of putative associations with
disease outcomes. Paradoxically, the pro-
portion of false alarms among all proposed
“findings” may increase when one can mea-
sure more things ( 7). Spurious correlations
and ecological fallacies may multiply. There
are numerous such examples ( 8), such as
“honey-producing bee colonies inversely cor-
relate with juvenile arrests for marijuana.”
The field of genomics has addressed this
problem of signal and noise by requiring
replication of study findings and by asking
for much stronger signals in terms of statisti-
cal significance. This requires the use of col-
laborative large-scale epidemiologic studies.
For nongenomic associations, false alarms
due to confounding variables or other biases
are possible even with very large-scale stud-
ies, extensive replication, and very strong
signals ( 9). Big Data’s strength is in finding
associations, not in showing whether these
associations have meaning. Finding a signal
is only the first step.
Even John Snow needed to start with a
plausible hypothesis to know where to look,
i.e., choose what data to examine. If all he
had was massive amounts of data, he might
well have ended up with a correlation as
spurious as the honey bee–marijuana con-
nection. Crucially, Snow “did the experi-
ment.” He removed the handle from the
water pump and dramatically reduced the
spread of cholera, thus moving from correla-
tion to causation and effective intervention.
How can we improve the potential for
Big Data to improve health and prevent
disease? One priority is that a stronger
epidemiological foundation is needed. Big
Data analysis is currently largely based on
convenient samples of people or informa-
tion available on the Internet. When as-
sociations are probed between perfectly
measured data (e.g., a genome sequence)
and poorly measured data (e.g., adminis-
trative claims health data), research ac-
curacy is dictated by the weakest link. Big
Data are observational in nature and are
fraught with many biases such as selection,
confounding variables, and lack of general-
izability. Big Data analysis may be embed-
ded in epidemiologically well-characterized
and representative populations. This epide-
miologic approach has served the genomics
community well ( 10) and can be extended
to other types of Big Data.
There also must be a means to integrate
knowledge that is based on a highly itera-
tive process of interpreting what we know
and don’t know from within and across sci-
entific disciplines. This requires knowledge
management, knowledge synthesis, and
knowledge translation ( 11). Curation can be
aided by machine learning algorithms. An
example is the ClinGen project ( 12) that will
create centralized resources of clinically an-
notated genes to improve interpretation
of genomic variation and optimize the use
of genomics in practice. And new funding,
such as the Biomedical Data to Knowledge
awards of the U.S. National Institutes of
Health, will develop new tools and training
in this arena.
From validity to utility. Big Data can improve tracking
and response to infectious disease outbreaks, discovery
of early warning signals of disease, and development of
diagnostic tests and therapeutics.
By Muin J. Khoury
1,2
and
John P. A. Ioannidis
3
MEDICINE
1O� ce of Public Health Genomics, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA. 2Epidemiology and Genomics Research Program, National Cancer Institute, Bethesda, MD 20850, USA. 3Stanford Prevention Research Center and Meta-Research Innovation Center at Stanford, Stanford University, Palo Alto, CA 94305, USA. E-mail: [email protected]; [email protected] IL
LU
ST
RA
TIO
N:
V.
AL
TO
UN
IAN
/SCIENCE
Big data meets public healthHuman well-being could benefit from large-scale data if large-scale noise is minimized
Published by AAAS
on
June
20,
201
6ht
tp://
scie
nce.
scie
ncem
ag.o
rg/
Dow
nloa
ded
from
Science, 2014
Many challenges exist in the use of big data for discovery, guideline development, and causal research
Thousands of hypotheses are possible.Multiplicity of hypotheses.
Big data are observational.Multiplicity of biases:
confounding, selection; reverse causal
Millions of analytic scenarios are possible.Multiplicity of analytic methods.
Big Data offers a multiplicity of possible hypotheses! Example: cohort database of E exposures and P phenotypes
Hum Genet 2012 JECH 2014
Curr Env Health Rep 2016
p-2
20
8
1
6
18
7
10
…
p-1
12
2
5
9
4
16
21
11
13
3
17
14
19
p
15
6 101 1482 7 …1311 12 e54 153 9 e-1e-2
…
…
…
…
E exposure factors
P ph
enot
ypic
fact
ors
which ones to test?all?
the ones in blue?
E times P possibilities!how to detect signal from noise?
Big Data offers a multiplicity of possible hypotheses!… that depends on the domain (type of measure)!
JECH, 2014National Health and Nutrition Examination
Survey (NHANES)
Interdependencies of the variables: Correlation globes paint a complex view of exposure and
behavior
Red: positive ρBlue: negative ρ
thickness: |ρ|
for each pair of E:Spearman ρ
(575 factors: 81,937 correlations)
permuted data to produce“null ρ”
sought replication in > 1 cohort
JAMA 2014Pac Symp Biocomput. 2015
JECH. 2015
National Health and Nutrition Examination Survey (NHANES)
Red: positive ρBlue: negative ρ
thickness: |ρ|
for each pair of E:Spearman ρ
(575 factors: 81,937 correlations)
Interdependencies of the variables: Correlation globes paint a complex view of exposure and
behavior
permuted data to produce“null ρ”
sought replication in > 1 cohort
JAMA 2014Pac Symp Biocomput. 2015
JECH. 2015
National Health and Nutrition Examination Survey (NHANES)
How to enhance the validity of precision guidelines emerging from big data?
1.) Test systematically, address multiplicity, and replicate.
2.) Consider modeling scenarios explicitly.
3.) Practice reproducible research and increase data literacy
Test systematically and replicate. Examples: “environment-wide” or “nutrient-wide”
association studies
A search engine for robust, reproducible genotype-phenotype associations…
580 VOLUME 42 | NUMBER 7 | JULY 2010 NATURE GENETICS
A RT I C L E S
13 autosomal loci exceeded the threshold for genome-wide significance (P ranging from 2.8 × 10−8 to 1.4 × 10−22) with allele-specific odds ratios (ORs) between 1.06 and 1.14 (Table 1 and Fig. 2). All signals remained close to or beyond genome-wide significance thresholds (the least significant P value was 5.2 × 10−8) when we repeated analyses after implementing a second (post meta-analysis) round of genomic control adjustment within stage 1 data (Supplementary Note).
We extended our search for susceptibility signals to the X chromo-some, identifying one further signal in the stage 1 discovery samples meeting our criteria for follow-up (represented by rs5945326, near DUSP9, P = 2.3 × 10−6). This SNP showed strong evidence for replication in 8,535 cases and 12,326 controls (OR (allowing for X-inactivation) 1.32 (95% CI 1.16–1.49), P = 2.3 × 10−5), for a combined association P value of 3.0 × 10−10 (OR 1.27 (95% CI 1.18–1.37)) (Table 1 and Fig. 2).
Fourteen signals reaching genome-wide significanceTwo of the 14 signals reaching genome-wide significance on joint analysis (those near MTNR1B and IRS1) represent loci for which T2D associations have been recently reported in samples which partially overlap with those studied here10,14–16 (Table 1).
A third signal (rs231362) on 11p15 overlaps both intron 11 of KCNQ1 and the KCNQ1OT1 transcript that controls regional imprint-ing17 and influences expression of nearby genes including CDKN1C, a known regulator of beta-cell development18. This signal maps ~150 kb from T2D-associated SNPs in the 3 end of KCNQ1 first identified in East Asian GWA scans8,9. SNPs within the 3 signal were also detected in the current DIAGRAM+ meta-analysis (for example, rs163184, P = 6.8 × 10−5), but they failed to meet the threshold for initiating replication. A SNP in the 3 region (rs2237895) that was reported to reach genome-wide significance in Danish samples9 was neither typed nor imputed in the DIAGRAM+ studies. In our European-descent samples, rs231362 and SNPs in the 3 signal were not correlated
(r2 < 0.05), and conditional analyses (see below) establish these SNPs as independent (Fig. 2 and Supplementary Table 4). Further analysis in Icelandic samples has shown that both associations are restricted to the maternally transmitted allele11. Both T2D loci are independent of the common variant associations with electrocardiographic QT intervals that map at the 5 end of KCNQ1 (r2 < 0.02, D < 0.35 in HapMap European CEU data)19,20 (Supplementary Table 5).
Of the remaining loci, two (near BCL11A and HNF1A) have been highlighted in previous studies7,21–23 but are now shown to reach genome-wide significance. Rare mutations in HNF1A account for a substantial proportion of cases of maturity onset diabetes of the young, and a population-specific variant (G319S) influences T2D risk in Oji-Cree Indians24. Confirmation of a common variant association at HNF1A brings to five the number of loci known to harbor both rare mutations causal for monogenic forms of diabetes and common vari-ants predisposing to multifactorial diabetes, the others being PPARG, KCNJ11, WFS1 and HNF1B. A T2D association in the BCL11A region was suggested by the earlier DIAGRAM meta-analysis (rs10490072, P = 3 × 10−5), but replication was inconclusive7; there is only modest linkage disequilibrium (LD) between rs10490072 and the lead SNP from the present analysis (rs243021, r2 = 0.22, D = 0.73 in HapMap CEU).
The remaining nine signals map near the genes HMGA2, CENTD2, KLF14, PRC1, TP53INP1, ZBED3, ZFAND6, CHCHD9 and DUSP9 (Table 1 and Figs. 1 and 2) and represent new T2D risk loci uncovered by the DIAGRAM+ meta-analysis.
Understanding the genetic architecture of type 2 diabetesCombining newly identified and previously reported loci and assuming a multiplicative model, the sibling relative risk attributable to the 32 T2D susceptibility variants described in this paper is ~1.14. With addition of the five T2D loci recently identified by the Meta-Analysis of Glucose and Insulin-related traits Consortium (MAGIC) investigators12,13 and
50 Locus established previouslyLocus identified by current studyLocus not confirmed by current study
BCL11A
THADANOTCH2
ADAMTS9
IRS1
IGF2BP2
WFS1ZBED3
CDKAL1
HHEX/IDEKCNQ1 (2 signals*: )
TCF7L2
KCNJ11CENTD2MTNR1B
HMGA2 ZFAND6PRC1
FTOHNF1B DUSP9
Conditional analysis
Unconditional analysis
TSPAN8/LGR5HNF1A
CDC123/CAMK1DCHCHD9
CDKN2A/2BSLC30A8
TP53INP1
JAZF1KLF14
PPAR
40
30
–log
10(P
)–l
og10
(P)
20
10
10
1 2 3 4 5 6 7 8Chromosome
9 10 11 12 13 14 15 16 17 18 19 20 21 22 X0
0
Suggestive statistical association (P < 1 10–5)
Association in identified or established region (P < 1 10–4)
Figure 1 Genome-wide Manhattan plots for the DIAGRAM+ stage 1 meta-analysis. Top panel summarizes the results of the unconditional meta-analysis. Previously established loci are denoted in red and loci identified by the current study are denoted in green. The ten signals in blue are those taken forward but not confirmed in stage 2 analyses. The genes used to name signals have been chosen on the basis of proximity to the index SNP and should not be presumed to indicate causality. The lower panel summarizes the results of equivalent meta-analysis after conditioning on 30 previously established and newly identified autosomal T2D-associated SNPs (denoted by the dotted lines below these loci in the upper panel). Newly discovered conditional signals (outside established loci) are denoted with an orange dot if they show suggestive levels of significance (P < 10−5), whereas secondary signals close to already confirmed T2D loci are shown in purple (P < 10−4).
Voight et al, Nature Genetics 2012N=8K T2D, 39K Controls
GWAS in Type 2 Diabetes
A prime example of systematic associations: Genome-wide association studies (GWASs)
Searching for 246 exposures and behaviors associated with all-cause mortality.
NHANES: 1999-2004National Death Index linked mortality
246 behaviors and exposures (serum/urine/self-report)
NHANES: 1999-2001N=330 to 6008 (26 to 655 deaths)
~5.5 years of followup
Cox proportional hazardsbaseline exposure and time to death
False discovery rate < 5%
NHANES: 2003-2004N=177 to 3258 (20-202 deaths)
~2.8 years of followup
p < 0.05
Int J Epidem. 2013
Adjusted Hazard Ratio
-log10(pvalue)
0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8
02
46
8
1
2
3
45
67
1 Physical Activity2 Does anyone smoke in home?3 Cadmium4 Cadmium, urine5 Past smoker6 Current smoker7 trans-lycopene
(11) 1
2
3 4
5 6
789
10 111213 141516
1 age (10 year increment)2 SES_13 male4 SES_05 black6 SES_27 SES_38 education_hs9 other_eth10 mexican11 occupation_blue_semi12 education_less_hs13 occupation_never14 occupation_blue_high15 occupation_white_semi16 other_hispanic
(69)
Searching >250 environmental and behavioral factors in all-cause mortality
FDR < 5%
sociodemographics
replicated factor
Int J Epidem. 2013
Adjusted Hazard Ratio
-log10(pvalue)
0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8
02
46
8
1
2
3
45
67
1 Physical Activity2 Does anyone smoke in home?3 Cadmium4 Cadmium, urine5 Past smoker6 Current smoker7 trans-lycopene
(11) 1
2
3 4
5 6
789
10 111213 141516
1 age (10 year increment)2 SES_13 male4 SES_05 black6 SES_27 SES_38 education_hs9 other_eth10 mexican11 occupation_blue_semi12 education_less_hs13 occupation_never14 occupation_blue_high15 occupation_white_semi16 other_hispanic
(69) age (10 years)
income (quintile 2)
income (quintile 1)male
black income (quintile 3)
any one smoke in home?
serum and urine cadmium[1 SD]
past smoker?current smoker?serum lycopene
[1SD]
physical activity[low, moderate, high activity]*
*derived from METs per activity and categorized by Health.gov guidelines
R2 ~ 2%
Searching >250 environmental and behavioral factors in all-cause mortality
Searching 82 dietary factors in blood pressure:INTERMAP and NHANES
criterion. In the INTERMAP test dataset, only 3 of these 11 werenominally significant (P!0.05): alcohol, urinary calcium, andurinary sodium-to-potassium ratio (Table IV in the online-onlyData Supplement). For diastolic BP, 40 dietary or supplementvariables entered the initial model, and 10 were selected by AIC(Table IV in the online-only Data Supplement). In the INTER-MAP test dataset, only alcohol intake retained nominal signifi-cance. Thus, although we have evidence pointing to someindependent effects of nutrients for systolic BP, multivariableestimates were attenuated or lost significance compared withtheir main effects documented above (Figures 2 and 3).
The absolute effect sizes (INTERMAP testing set) rangedfrom 2.06 mm Hg lower systolic BP (phosphorus) to0.81 mm Hg lower systolic BP (non-heme iron) per 1-SDdifference in nutrient variable. The effect sizes between the
INTERMAP training set and testing set were not systematicallydifferent (5 estimates were higher and 8 were lower for systolicBP). The effect sizes between nutrients obtained from foods orfrom food and supplements combined were similar in somecases (eg, phosphorus, magnesium, fiber; Tables 1 and 2),though for some tentatively validated nutrients from foods (eg,folacin, riboflavin, and thiamin) effect sizes incorporating sup-plemental and food intake were attenuated and no longerreached the FDR 5% threshold (FDR 10%, 94%, and 97% forfolacin, riboflavin, and thiamin, respectively for systolic BP(Figures 2B and 3B).
NHANESTables 1 and 2 show the associations between the tentativelyvalidated dietary factors with systolic BP and diastolic BP across
Figure 2. Volcano plot graphic showing the nutrient-wide associations with systolic blood pressure levels in INTERMAP training set fornutrients received from foods and urine excretion markers (A) and for nutrients received from foods and supplements (B). y axis indi-cates "log10(P value) of the adjusted linear regression coefficient for each of the nutrients. Horizontal (dotted) line represents the levelof significance corresponding to FDR less than 5%, and the x axis shows the effect sizes (mm Hg) per 1 SD change in the nutrient vari-able. Filled marks represent tentatively validated nutrients in the INTERMAP testing set (P!0.05). Analyses are adjusted for age, sex,reported special diet, use of dietary supplements, moderate or heavy physical activity (hours daily), doctor diagnosed cardiovasculardisease or diabetes mellitus, family history of hypertension, height, weight, and total energy intake. INTERMAP indicates InternationalCollaborative Study on Macro-/Micronutrients and Blood Pressure. FDR indicates false discovery rate.
Tzoulaki et al A Nutrient-Wide Association Study 2459
by guest on May 21, 2016http://circ.ahajournals.org/Downloaded from
Circulation. 2012
association size
FDR < 5%
R2 ~ 7%
Testing all associations systematically:Consideration of multiplicity of hypotheses and correlational web!
Explicit in number of hypotheses tested
False discovery rate; family-wise error rate;Report database size!
Does my correlation matter? How does my new correlation
compare to the family of correlations?
0.17 (e.g., carotene and diabetes)is average ρ much less than 0.17? greater?
ρ
JAMA 2014 JECH 2015
Consideration of multitude modeling scenarios. Example: Vibration of Effects, the empirical
distribution of effect sizes due to model choice
september2011 119
Multiple modelling
This problem is akin to – but less well recognised and more poorly understood than – multiple testing. For example, consider the use of linear regression to adjust the risk levels of two treatments to the same background level of risk. There can be many covariates, and each set of covariates can be in or out of the model. With ten covariates, there are over 1000 possible models. Consider a maze as a metaphor for modelling (Figure 3). The red line traces the correct path out of the maze. The path through the maze looks simple, once it is known. Returning to a linear regression model, terms can be put into and taken out of a regression model. Once you get a p-value smaller than 0.05, the model can be frozen and the model selection justified after the fact. It is easy to justify each turn.
The combination of multiple testing and multiple modelling can lead to a very large search space, as the example of bisphenol A in Box 3 shows. Such large search spaces can give small, false positive p-values somewhere within them. Unfortunately, authors and consumers are often like a deer caught in the headlights and take a small p-value as indicating a real effect.
How can it be fixed? A new, combined strategy
It should be clear by now that more than small-scale remedies are needed. The entire system of observational studies and the claims that are made from them is no longer functional, nor is it fit for purpose. What can be done to fix this broken system? There are no principled
ways in the literature for dealing with model selection, so we propose a new, composite strategy. Following Deming, it is based not upon the workers – the researchers – but on the production system managers – the funding agencies and the editors of the journals where the claims are reported.
We propose a multi-step strategy to help bring observational studies under control (see Table 2). The main technical idea is to split the data into two data sets, a modelling data set and a holdout data set. The main operational idea is to require the journal to accept or reject the paper based on an analysis of the modelling data set without knowing the results of applying the methods used for the modelling set on the holdout set and to publish an addendum to the paper giving the results of the analysis of the holdout set. We now cover the steps, one by one.
1 The data collection and clean-up should be done by a group separate from the analysis group. There can be a tempta-tion on the part of the analyst to do some exploratory data analysis during the data clean up. Exploratory analysis could lead to model selection bias.
Box 2. Publication bias
There is general recognition that a paper has a much better chance of acceptance if something new is found. This means that, for publication, the claim in the paper has to be based on a p-value less than 0.05. From Deming’s point of view5, this is quality by inspection. The journals are placing heavy reliance on a statistical test rather than examination of the methods and steps that lead to a conclusion. As to having a p-value less than 0.05, some might be tempted to game the system10 through multiple testing, multiple modelling or unfair treatment of bias, or some combination of the three that leads to a small p-value. Researchers can be quite creative in devising a plausible story to fit the statistical finding.
2 The data cleaning team creates a modelling data set and a holdout set and gives the modelling data set, less the item to be predicted, to the analyst for examination.
P < 0.05
Figure 3. The path through a complex process can appear quite simple once the path is defined. Which terms are included in a multiple linear regression model? Each turn in a maze is analogous to including or not a specific term in the evolving linear model. By keeping an eye on the p-value on the term selected to be at issue, one can work towards a suitably small p-value. © ktsdesign – Fotolia
Table 2. Steps 0–7 can be used to help bring the observational study process into control. Currently researchers analysing observational data sets are under no effective oversight
Step Process / Action
0 Data are made publicly available
1 Data cleaning and analysis separate
2 Split sample: A, modelling; and B, holdout (testing)
3 Analysis plan is written, based on modelling data only
4 Written protocol, based on viewing predictor variables of A
5 Analysis of A only data set
6 Journal accepts paper based on A only
7 Analysis of B data set gives Addendum
A maze of associations is one way to a fragmented literature and Vibration of Effects
Young, 2011
univariate
sex
sex & age
sex & race
sex & race & age
JCE, 2015
Distribution of associations and p-values due to model choice: Estimating the Vibration of Effects (or Risk)
Variable of Intereste.g., 1 SD of log(serum Vitamin D)
Adjusting Variable Setn=13
All-subsets Cox regression213+ 1 = 8,193 models
SES [3rd tertile]education [>HS]
race [white]body mass index [normal]
total cholesterolany heart disease
family heart diseaseany hypertension
any diabetesany cancer
current/past smoker [no smoking]drink 5/day
physical activity
Data SourceNHANES 1999-2004
417 variables of interesttime to death
N≧1000 (≧100 deaths)
effect sizes p-values
Vibration of EffectsRelative Hazard Ratio (RHR)=HR99/HR1
Range of P-value (RP)=-log10(p-value1) + log10(pvalue99)
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
56
78 9 10
111213
1
50
99
1 50 99
2.5
5.0
7.5
0.64 0.68 0.72 0.76Hazard Ratio
−log
10(p
valu
e)
Vitamin D (1SD(log)) RHR = 1.14
RPvalue = 4.68
A
B
C D
E
F
median p-value/HR for k
percentile indicator
JCE, 2015
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
56
78 9 10
111213
1
50
99
1 50 99
2.5
5.0
7.5
0.64 0.68 0.72 0.76Hazard Ratio
−log
10(p
valu
e)
Vitamin D (1SD(log)) RHR = 1.14RP = 4.68
●
●
●
●
●
●
●
●
●
●
●
●
●
●
01
2 34
56
78
910
1112
13
1
50
99
1 50 99
1
2
3
4
0.75 0.80 0.85 0.90Hazard Ratio
−log
10(p
valu
e)
Thyroxine (1SD(log)) RHR = 1.15RP = 2.90
The Vibration of Effects: Vitamin D and Thyroxine and attenuated risk in mortality
JCE, 2015
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
56
78 9 10
111213
1
50
99
1 50 99
2.5
5.0
7.5
0.64 0.68 0.72 0.76Hazard Ratio
−log
10(p
valu
e)
Vitamin D (1SD(log)) RHR = 1.14RP = 4.68
●
●
●
●
●
●
●
●
●
●
●
●
●
●
01
2 34
56
78
910
1112
13
1
50
99
1 50 99
1
2
3
4
0.75 0.80 0.85 0.90Hazard Ratio
−log
10(p
valu
e)
Thyroxine (1SD(log)) RHR = 1.15RP = 2.90
●
●
●
●
●
●
●
●
●
●
●
●●
●
01
23
45
6
78
910111213
1
50
99
1 50 99
5
10
1.3 1.4 1.5 1.6Hazard Ratio
−log
10(p
valu
e)
Cadmium (1SD(log)) adjustment=drink_five_per_day
●
●
●
●
●
●
●
●
●
●
●
●●
●
01
23
45
6
78
910111213
1
50
99
1 50 99
5
10
1.3 1.4 1.5 1.6Hazard Ratio
−log
10(p
valu
e)
Cadmium (1SD(log)) adjustment=current_past_smoking
●
●
●
●
●
●
●
●
●
●
●
●●
●
01
23
45
6
78
910111213
1
50
99
1 50 99
5
10
1.3 1.4 1.5 1.6Hazard Ratio
−log
10(p
valu
e)
Cadmium (1SD(log)) RHR = 1.29RP = 8.29
The Vibration of Effects: shifts in the effect size distribution due to select adjustments (e.g., adjusting cadmium levels with
smoking status)
JCE, 2015
●●●●●●●●●●●●●● 012345678910111213 1
5099
15099
0
1
2
3
1 2 3 4 5Hazard Ratio
−log
10(p
valu
e)
marital_status(married,widow,divorced,separated,never married,with partner,refused) never married vs. married RHR = 1.43
RP = 0.49
●●●●●●●●●●●●●●
01234
5678910111213 1
50
99
15099
0
1
2
3
1 2 3 4 5Hazard Ratio
−log
10(p
valu
e)
marital_status(married,widow,divorced,separated,never married,with partner,refused) with partner vs. married RHR = 1.69
RP = 0.75
●●●●●●●●●●●●●●
012345678910111213
1
5099
1 50 99
0
1
2
3
1 2 3 4 5Hazard Ratio
−log
10(p
valu
e)
marital_status(married,widow,divorced,separated,never married,with partner,refused) refused vs. married RHR = 3.87
RP = 0.96
●
●
●
●
●● ● ● ●
●●
●●
●
01
23
4 5 6 7 8 9 10111213
1
5099
1 50 99
0.0
0.5
1.0
0.90 0.95 1.00 1.05Hazard Ratio
−log
10(p
valu
e)
Alpha−carotene (1SD(log)) RHR = 1.15RP = 0.42
●
●
●
●
●
●●●●●●
●●●
01
23
45 678910111213
1
50
99
1 50 99
0
1
2
3
0.85 0.90 0.95Hazard Ratio
−log
10(p
valu
e)
Alcohol (1SD(log)) RHR = 1.16RP = 2.41
●
●
●
●
●
●
●
●
●●
●●
●●
0
1
2
34
56
78
9 10111213
1
50
99
1 50 99
1
2
3
4
5
0.75 0.80 0.85 0.90Hazard Ratio
−log
10(p
valu
e)
Vitamin E as alpha−tocopherol (1SD(log)) RHR = 1.15RP = 3.17
●
●
●
●
●
●
●
●
●
●
●●
●●
0
1
2
3
45
67
89
10 111213
1
50
99
1 50 99
1
2
3
0.80 0.85 0.90Hazard Ratio
−log
10(p
valu
e)
Beta−carotene (1SD(log)) RHR = 1.15RP = 2.34
●●●●●●
●●●●●● ●●
01 2345678910111213
1
50
99
1 50 99
1
2
3
0.875 0.900 0.925 0.950 0.975Hazard Ratio
−log
10(p
valu
e)
Caffeine (1SD(log)) RHR = 1.10RP = 1.99
●
●
●
●
●
●
●
●●
●●
●●●
0
1
2
3
45
67
8 910111213
1
50
99
1 50 99
0.0
0.5
1.0
1.5
0.90 0.95 1.00Hazard Ratio
−log
10(p
valu
e)
Calcium (1SD(log)) RHR = 1.13RP = 1.15
●
●
●
●
●
●
●
●
●●
●●●
●
0
1
2
3
45
67
8 9 10111213
1
50
99
1 50 99
0.5
1.0
1.5
2.0
2.5
0.84 0.88 0.92Hazard Ratio
−log
10(p
valu
e)
Carbohydrate (1SD(log)) RHR = 1.12RP = 1.57
●
●
●
●
●
●
●
●●
●●
●●●
0
1
2
3
45
67 8 9
10111213
1
50
99
1 50 99
0.5
1.0
1.5
2.0
2.5
0.80 0.84 0.88Hazard Ratio
−log
10(p
valu
e)
Carotene (1SD(log)) RHR = 1.14RP = 1.53
●●●●●●●●●●●●●●
012345678910111213
1
50
99
1 50 99
0.5
1.0
1.050 1.075 1.100 1.125Hazard Ratio
−log
10(p
valu
e)
Cholesterol (1SD(log)) RHR = 1.08RP = 0.64
●
●
●
●
●
●
●
●●
●●
●●●
0
1
2
3
45
67
8 9 10111213
1
50
99
1 50 99
1
2
3
4
0.80 0.85 0.90 0.95Hazard Ratio
−log
10(p
valu
e)
Copper (1SD(log)) RHR = 1.17RP = 2.86
●
●
●
●
●
●
●
●●
●●
●●●
0
1
2
3
45
67 8 910111213
1
50
99
1 50 99
0.0
0.5
1.0
1.5
0.85 0.90 0.95 1.00Hazard Ratio
−log
10(p
valu
e)
Beta−cryptoxanthin (1SD(log)) RHR = 1.15RP = 1.39
●
●
● ● ● ●●
●●
●●
●●
●01
2 3 4 5 6 7 8 9 101112131
50
99
1 50 99
0.0
0.5
1.0
0.96 0.99 1.02 1.05 1.08Hazard Ratio
−log
10(p
valu
e)
Folic acid (1SD(log)) RHR = 1.09RP = 0.41
●
●
●
●
●
●
●●
●●
●●●
●
0
1
2
34
56
7 8 9 10111213
1
50
99
1 50 99
1
2
3
4
0.80 0.85 0.90Hazard Ratio
−log
10(p
valu
e)
Folate, DFE (1SD(log)) RHR = 1.14RP = 2.39
●
●
●
●
●
●
●
●●
●●
●●●
0
1
2
34
56
7 8 910111213
1
50
99
1 50 99
2
4
6
8
0.76 0.80 0.84 0.88Hazard Ratio
−log
10(p
valu
e)
Food folate (1SD(log)) RHR = 1.14RP = 4.64
●
●
●
●
●
●
●
●
●●
●●
●●
0
1
23
45
67
89
10111213
1
50
99
1 50 99
1
2
3
4
0.80 0.84 0.88 0.92Hazard Ratio
−log
10(p
valu
e)
Dietary fiber (1SD(log)) RHR = 1.15RP = 2.79
●
●
●
●
●
●
●
●●
●●
●●●
0
1
2
34
56
78 9 1011 1213
1
50
99
1 50 99
1
2
3
0.80 0.84 0.88 0.92 0.96Hazard Ratio
−log
10(p
valu
e)
Total Folate (1SD(log)) RHR = 1.15RP = 2.11
●
●
●
●
●
●
●
●
●
●●
●●●
0
1
23
45
67
891011
1213
1
50
99
1 50 99
1
2
0.84 0.88 0.92Hazard Ratio
−log
10(p
valu
e)
Iron (1SD(log)) RHR = 1.12RP = 1.91
β-carotene caffeine
cholesterol
food folate
●
●
●
●
●
●
●
●●
●●
●●●
0
1
2
3
45
67 8
910111213
1
50
99
1 50 99
1
2
3
0.80 0.85 0.90Hazard Ratio
−log
10(p
valu
e)
Potassium (1SD(log)) RHR = 1.14RP = 2.28
●
●
●
●
●
●
●●
●●
●●●●
01
2
34
56
7 8 910111213
1
50
99
1 50 99
0.5
1.0
1.5
2.0
0.850 0.875 0.900 0.925 0.950Hazard Ratio
−log
10(p
valu
e)
Protein (1SD(log)) RHR = 1.11RP = 1.42
●●
● ● ●●
●●
●●
●●
●●
01
2 3 4 5 6 78 910111213
1
50
99
1 50 99
0.0
0.5
1.0
0.95 1.00 1.05 1.10Hazard Ratio
−log
10(p
valu
e)
Retinol (1SD(log)) RHR = 1.13RP = 0.67
●
●
●●
●
●
●
●
●
●
●
●
●●
01
23
45
67
89
1011
1213
1
50
99
1 50 99
0.0
0.5
1.0
1.5
1.00 1.05 1.10Hazard Ratio
−log
10(p
valu
e)
SFA 4:0 (1SD(log)) RHR = 1.11RP = 1.29
●●
●
●
●
●
●●
●
●
●●
●●
01
23
45
67
89
10111213
1
50
99
1 50 99
0.5
1.0
1.5
2.0
2.5
1.04 1.08 1.12 1.16Hazard Ratio
−log
10(p
valu
e)
SFA 6:0 (1SD(log)) RHR = 1.11RP = 1.71
●●
●
●
●
●
●
●
●
●
●
●●●
01
23
45
67
89
10111213
1
50
99
1 50 99
2
3
4
1.12 1.16 1.20Hazard Ratio
−log
10(p
valu
e)
SFA 8:0 (1SD(log)) RHR = 1.10RP = 2.55
●●
●
●
●
●
●
●
●
●
●●●●
0 1
23
45
67
89
10111213
1
50
99
1 50 99
1
2
1.04 1.08 1.12 1.16Hazard Ratio
−log
10(p
valu
e)
SFA 10:0 (1SD(log)) RHR = 1.11RP = 1.87
●●
●
●●
●●
●●
●
●●●●
01
23
45
6 78 910111213
1
50
99
1 50 99
1.0
1.5
2.0
2.5
3.0
1.075 1.100 1.125 1.150 1.175Hazard Ratio
−log
10(p
valu
e)
SFA 12:0 (1SD(log)) RHR = 1.08RP = 1.79
●●
●
●●
●●
●
●●
●●●●
01
23
45
67
8 9 10111213
1
50
99
1 50 99
0.5
1.0
1.5
2.0
1.05 1.10 1.15Hazard Ratio
−log
10(p
valu
e)
SFA 14:0 (1SD(log)) RHR = 1.11RP = 1.61
●●
●
●●
●●
●●
●●
●●●
012
34
5 6 7 8 910111213
1
50
99
1 50 99
0.0
0.5
1.0
1.00 1.05 1.10Hazard Ratio
−log
10(p
valu
e)
SFA 16:0 (1SD(log)) RHR = 1.11RP = 0.84
●●
●●
●●
●●
●●● ●
●●
012 3 4 5 67 8910111213
1
50
99
1 50 99
0.0
0.5
1.0
1.02 1.06 1.10Hazard Ratio
−log
10(p
valu
e)
SFA 18:0 (1SD(log)) RHR = 1.10RP = 0.73
●
●
●
●
●
●
●●
●●●
● ●●
01
2
34
56 7 8 910111213
1
50
99
1 50 99
0.5
1.0
1.5
2.0
0.875 0.900 0.925 0.950Hazard Ratio
−log
10(p
valu
e)
Selenium (1SD(log)) RHR = 1.09RP = 1.24
●●
●
●
●
●
●●
●●
●●●●
012
34
56 7
8910
111213
1
50
99
1 50 99
0.0
0.5
1.0
1.00 1.05 1.10 1.15Hazard Ratio
−log
10(p
valu
e)
Total saturated fatty acids (1SD(log)) RHR = 1.11RP = 0.93
● ●
●●●●●●●●●●●●
0 12 345678910111213
1
50
99
1 50 99
2
4
6
0.650 0.675 0.700 0.725 0.750Hazard Ratio
−log
10(p
valu
e)
Sodium (1SD(log)) RHR = 1.12RP = 3.74
●
●
●
●
●
●
●
●●●
●●●●
0
1
23
45
67 8910111213
1
50
99
1 50 99
1
2
3
4
0.76 0.80 0.84Hazard Ratio
−log
10(p
valu
e)
Total sugars (1SD(log)) RHR = 1.13RP = 2.51
●
●● ●
●●
●●
●●
●●●●
01 2 3 4 5 6 7 8 910111213
1
50
99
1 50 99
0.0
0.5
1.0
0.95 1.00 1.05 1.10Hazard Ratio
−log
10(p
valu
e)
Total fat (1SD(log)) RHR = 1.11RP = 0.54
●
●
●
●
●
●●
●●●
●●●●
0
1
23
45 6 78910111213
1
50
99
1 50 99
0.5
1.0
1.5
0.87 0.90 0.93 0.96Hazard Ratio
−log
10(p
valu
e)
Theobromine (1SD(log)) RHR = 1.08RP = 1.19
●
●
●
●
●
●
●
●
●●
●●
●●
0
1
2
3
45
67
8 910111213
1
50
99
1 50 99
0.4
0.8
1.2
1.6
0.80 0.84 0.88 0.92Hazard Ratio
−log
10(p
valu
e)
Vitamin A (1SD(log)) RHR = 1.13RP = 1.09
●
●
●
●
●
●
●
●
●●
●●
●●
0
1
2
34
56
78
910 111213
1
50
99
1 50 99
0.0
0.5
1.0
1.5
0.85 0.90 0.95 1.00Hazard Ratio
−log
10(p
valu
e)
Vitamin A, RAE (1SD(log)) RHR = 1.16RP = 1.31
●
●
●
●
●
●
●●
●●
● ●●●
01
2
34
56 7 8 910111213
1
50
99
1 50 99
0.5
1.0
0.86 0.90 0.94 0.98Hazard Ratio
−log
10(p
valu
e)
Retinol (1SD(log)) RHR = 1.15RP = 0.74
sodium sugars
SFA 6:0
SFA 8:0 SFA 10:0
JCE, 2015
Janus (two-faced) risk profile
Risk and significance depends on modeling scenario!
The Vibration of Effects: beware of the Janus effect(both risk and protection?!)
“risk”“protection”
“significant”
Brittanica.com
Accessible analytics tools and computer infrastructure exist to enable reproducible research
“Ability to recompute data analytic results given a observed dataset and
knowledge of the pipeline…”Leek and Peng, PNAS 2015
(1) Raw data available(2) Analytics code and documentation are available(3) Correct analysis methodology(4) Trained data analysts to execute research
In conclusion: Big data promises multitude of ways to discover precision guidelines
Thousands of hypotheses are possible.Multiplicity of hypotheses.
Big Data are observational.Multiplicity of biases: confounding, selection; reverse causal
Millions of analytic scenarios are possible.Multiplicity of analytic methods.
To enhance the validity of big data results, we must:
1.) Test systematically, address hypothesis tests, and replicate.
2.) Consider modeling scenarios explicitly.
3.) Practice reproducible research and increase data literacy.
Harvard DBMI Isaac KohaneSusanne ChurchillStan ShawJenn GrandfieldSunny AlvearMichal Preminger
Harvard Chan Hugues AschardFrancesca Dominici
Chirag J [email protected]
@chiragjpwww.chiragjpgroup.org
NIH Common FundBig Data to Knowledge
AcknowledgementsStanford John IoannidisAtul Butte (UCSF)
U Queensland Jian YangPeter Visscher
Cochrane Belinda Burford
RagGroup Chirag Lakhani Adam Brown Danielle RasoolyArjun ManraiErik CoronaNam Pho
Dennis Bier Emanuela Folco Elena Colombo
Lorenzini Foundation