James Brophy MD FRCP PhD McGill University Health Center,
McGill University, Montreal, Quebec
Réseau Québécois de Recherche sur les Médicaments
Session II : Big Data : une mine d’or Québécoise à exploiter
1 juin 2015
Pharmacoepidemiology – Big data, Big problems, Big solutions
2
Conflicts of Interest
I have no known conflicts associated with this presentation and to, the best of my knowledge,
am equally disliked by all pharmaceutical and device companies
http://www.nofreelunch.org/
Outline - agenda
• In the context of pharmacoepidemiology • What are big data? • What are the big problems with big data? • Are there innovative solutions to these
problems?
3
What is the definition of big data? • Something that
– doesn’t fit into Excel (65,535 row limit) – makes you say ”wow” – makes you uncomfortable working with it – only applies to genomics
• Wikipedia – Big data is high volume, high velocity, and/or
high variety information to enable enhanced decision making, insight discovery and process optimization. 4
How big is big data?
5
Just because it’s big, is it right?
6 http://oig.ssa.gov/sites/default/files/audit/full/pdf/A-06-14-34030_0.pdf
Over 6 million Americans have reached the age of 112 Just 13 are claiming benefits, and 67,000 of them are WORKING
More big data hubris
1. 2008 stock market crash – lots of economic data but incorrect models failed to predict and even facilitated the crash (Black Swan – N. Taleb)
2. Google - “…we can accurately estimate the current level of weekly influenza activity in each region of the United States, with a reporting lag of about one day.” (Nature 2009)
7
More big data hubris
• Google Flu was wrong for 100 out of 108 weeks since August 2011
• Error was a systematic over-estimate (Science Mar 14 2014) 8
So the big question…
• Is not the volume, velocity or variety of the data that is the problem but rather its VERACITY
• Also a problem for pharmacoepidemiology?
9
Pharmacoepidemiology 2010
10
2010 • Both studies used UK GPRD database
1996 -2006 & 1995-2005
11
BMJ – RR 2
JAMA – RR 1.07
Me, too
12
2 RAMQ cohorts
13
NEJM RCT 2014
14
NEJM RCT 2014
15
Problems with Big Data
• Most big data is observational -> biases (selection, information) and confounding
• Big data -> small random errors, tight CIs, small p values, but systematic errors not measured in these CIs -> false sense of precision
• Big data often leads to ignoring other pertinent evidence that should be synthesized to reach the most reasonable conclusions
16
Principles for working with big data • Government
– Privacy / Accessibility – Integrity of the data
• Researchers – Privacy / security – Processing the data (design, analysis, model
selection) – Interpreting the results - epistemologically
important to distinguish information (data), knowledge (causal inferences) & wisdom (systematic incorporation of all knowledge)
17
Learning from Big Data • More than big data need better data, rich in
important confounders • Need better research designs, especially
experimental data • Need to better appreciation of the quantitative
sciences (uncertainty, causal inference) • Need “domain knowledge”—specific clinical
information • Must incorporate prior evidence.
– If good prior data use informative priors – If very little data use agnostic/uniform prior beliefs 18
What is the purpose of pharmacoepidemiology?
• Patterns of drug utilization • Generating new information on drug safety • Supplementing premarketing effectiveness
studies – different populations, better precision
• However, the overall purpose is to provide insights or causal inferences, not merely associations generated from large data sets.
19
Estimating causal effects
1. Randomized Experiments 2. Natural Experiments 3. Instrumental Variables 4. Regression Discontinuity 5. Difference in Differences
20
21
An example
Results
22
Problems
• Not sure of the benefit in NA context • Changing everyone in Quebec to
ticagrelor would cost $25 million • Doing a large conventional RCT could cost
$10-50 million • What to do?
23
Using big data effectively
• Most of the cost is for the follow-up • We have excellent administrative
databases with reliable measures of death and cardiac outcomes so could minimize costs
• Need to avoid selection basis so could randomize at start and then simply observe
• New design – randomized registry – can answer the question at a reasonable cost aa24
Conclusion • Instead of focusing on a “big data
revolution,” better is an “all data revolution” including replication
• Recognize critical change has been innovative designs and analytics, can be applied to both traditional and new data
• Big data is an aid to thinking not a substitute for thinking
• Goal of this revolution is to provide a deeper, clearer understanding of our world. 25
Science March 14 2014
Merci
26
Learning form big data • Must incorporate prior evidence.
– If good prior data use informative priors – If very little data use agnostic or uniform prior
beliefs • In all cases, must be able to specify where
you are and why, if agnostic approach then need validation study
• Avoid confusing prior beliefs with prior evidence -> biases
27
How Much Data is There?
• 2.5 quintillion terabytes of data were generated every day in 2012
• As much data is now generated in just two days as was created from the dawn of civilization until 2003.
28 Harvard Business Review Dec
2014
• Where things go wrong is where tools of this kind are used not as an aid to thinking but as a substitute for thinking. When the information provided is used (this was one of David Ogilvy’s favourite quotations) “… as a drunk uses a lamppost: for support rather than illumination.”
29
What can big data find in healthcare?
30
Big data & inferences
31 Washington Post March 21
What is the correct inference?
• Americans spend too much on gambling and too much on the important stuff of politics
• Americans spend too much on gambling and not enough on the important stuff of politics
• Americans don’t spend too much on gambling but spending on politics is out of control
32
Looking in detail
• Consider there are 316 MM Americans • Basketball 13% gambled, average bet $200 • Elections, 80% adults, average $25 • Elections 1% of 1% of the population
(31,600) spent 28% or $2 B, average contribution $64,000
• Very small sample of Americans are controlling the election process
33
How unequal?
34
Do statins increase or decrease the risk of cancer?
Impossible d'afficher l'image. Votre ordinateur manque peut-être de mémoire pour ouvrir l'image ou l'image est endommagée. Redémarrez l'ordinateur, puis ouvrez à nouveau le fichier. Si le x rouge est toujours affiché, vous devrez peut-être supprimer l'image avant de la réinsérer.
NO
YES
Maybe neither
Maybe this is an isolated case and dates from 2007. Surely we are better today.
Do statins cause diabetes?
37
Do statins cause diabetes?
38
Statins & diabetes, Who do you believe?
• Both studies published in May 2013 • Both studies published in high impact
journals • Both used validated administrative
datasets • Both published by renown investigators
39
Statins & diabetes, Who do you believe?
• Even more confusing & troublesome • Both used THE SAME validated
administrative datasets (Ontario) • Both used essentiallyTHE SAME patients
(>65, no diabetes, new statin users from 1997 (2004) - 2010
• Both sets of authors are from THE SAME academic institution (Sunnybrook, U of T)
40
Adaptive randomization & ethics
41
• In the end, it seems doubtful that adaptive allocation generally improves risk/benefit for patients.
• Require larger sample sizes -> more patients, more research procedures, more visits.
• Since costs scale with sample sizes, it means more resources are consumed in answering a single research question than with a fixed 1:1 design.
42
Adaptive randomization & ethics
• Does outcome-adaptive allocation better accommodate clinical equipoise and promotes informed consent?
• Does adaptive allocation offers a ‘‘partial remedy’’ for the therapeutic misconception associated with fixed randomization?
43
Arguing against
• Hey and Kimmelman suggest that they do not improve risk–benefit for subjects but increase total burden for both patients and research systems by demanding larger sample sizes.
• Suggest that they redistribute rather than dissolving tensions in informed consent
• Suggest may have validity problems 44
A source of bias? • Given that the odds of receiving the better
treatment will improve over the course of the trial
• It is in the best interests of patient-subjects (and the physicians advocating on their behalf) to wait and enroll as late as possible
• So later patients maybe healthier (less urgency to participate) -> predictable time-trend in the study population increases the risk of bias 45
Example # 3
46
Example # 3
47 We have reached a threshold such that time to reperfusion no longer matters, provided < 90 minutes, and we now need to look elsewhere for improvements.
Results
48
16 minute improvement
No improvement, really? • Adjusted mortality has declined from 5% to
4.7% p=0.34 but what would CI tell us? • Back of the envelop calculations, a 0.3%
improvement with 95% CI from -0.1% to +0.7%
• In other words this small improvement in time is consistent with an up to 7/1000 absolute survival benefit (about 2800 annually) or 14% relative decrease in mortality and is entirely consistent with previous research 49
Consistent with other results
50 J Am Coll Cardiol 2006;47:2180-6
22,900 PCI in AMI NRMI
Telling it like it isn’t
51
MY CONCLUSIONS
This study shows that improved treatment times, even below the 90 minute threshold, are likely associated with meaningful mortality benefits that are entirely consistent with previous work and may have a huge public health impact. Efforts should continue to reduce all treatment delays.
Fundamental identity of causal
inference Outcome for treated − outcome for untreated
= [Outcome for treated − Outcome for treated if not treated] + [Outcome for treated if not treated − Outcome for untreated] = Impact of treatment on treated + selection bias If treatment is randomly assigned • Selection bias is zero. • Treated are random selection from population, so impact on treated = impact on population 52
Problems
• Basic problems of observational research including selection bias, information bias and confounding – How were patients selected? – How was exposure measured? – Were time dependencies? – What were the statistical models? – What confounders, interactions, mediators
considered? 53
References
54