Square wheels: electronic medical records for discovery research in rheumatoid arthritis Robert M....

Preview:

Citation preview

Square wheels: electronic medical records for discovery research in rheumatoid arthritis

Robert M. Plenge, M.D., Ph.D.

October 30, 2009

NCRR sponsored "Using EHR Data for Discovery

Research" HARVARDMEDICAL SCHOOL

Key questions

• What are the regulatory obstacles impacting your work?

• What are the resource needs required to replicate your work at other institutions?

• What are the priority short term "translational" questions in your field that would represent the most rapid payoff on investment?

Key questions

How can I implement your approach, and how much

better is it?

genotype

phenotype

clinical care

genotype

phenotype

clinical care

bottleneck

Raychaudhuri et al in press Nature Genetics

October 2009: >30 RA risk loci

20031978 1987 20052004

PTPN22

2008

“shared epitope”hypothesis

HLADR4

2007

PADI4 CTLA4

TNFAIP3STAT4TRAF1-C5IL2-IL21

CD40CCL21CD244IL2RBTNFRSF14PRKCQPIP4K2CIL2RAAFF3

Latest GWAS in 25,000 case-control samples with replication in 20,000 additional samples: >10 new

loci

2009

RELBLKTAGAPCD28TRAF6PTPRCFCGR2APRDM1CD2-CD58

Together explain ~35% of the genetic burden of

disease

genotype

phenotype

clinical carebottleneck

Genetic predictors of response to anti-TNF

therapy in RA

PTPRC/CD45 allelen=1,283 patients

P=0.0001

Submitted to Arth & Rheum

How can we collect DNA and detailed clinical data on >20,000 RA patients?

What are the options for collecting clinical

data and DNA for genetic studies?

Options for clinical + DNA

design Clinical

data

DNA Sample size

cost

clinical trial

+++ +++ + $$$

registry ++ +++ ++ $$

claims data

+ n/a +++ $

EMR ++ +++ +++ $

• Narrative data = free-form written text– info about symptoms, medical history,

medications, exam, impression/plan

• Codified data = structured format– age, demographics, and billing codes

Content of EMRs

EMRs are increasingly utilized!

Gabriel (1994) Arthritis and Rheumatism

This is not a new idea…

Sens: 89%PPV: 57%Sens: 89%PPV: 57%

Gabriel (1994) Arthritis and Rheumatism

Conclusion: The sole reliance on such databases for the diagnosis of RA can result in substantial misdiagnosis.

…but EMR data are “dirty”

Partners HealthCare: 4 million patients

Partners HealthCare: linked by EMR

Partners HealthCare: organized by i2b2

4 million patients

31,171 patients

ICD9 RA and/or CCP checked(goal = high sensitivity)

3,585 RA patients

Classification algorithm(goal = high PPV)

Clinical subsetsClinical subsets

Discarded blood for DNA

• Natural language processing (NLP)– disease terms (e.g., RA, lupus)– medications (e.g., methotrexate)– autoantibodies (e.g., CCP, RF)– radiographic erosions

• Codified data– ICD9 disease codes– prescription medications– laboratory autoantibodies

Our library of RA phenotypes

Qing Zeng

Concept/term Accuracy of concept presence of erosion 88% seropositive 96% CCP positive 98.7% RF positive 99.3% etanercept 100% methotrexate 100%

• Natural language processing (NLP)– disease terms (e.g., RA, lupus)– medications (e.g., methotrexate)– autoantibodies (e.g., CCP, RF)– radiographic erosions

• Codified data– ICD9 disease codes– prescription medications– laboratory autoantibodies

Our library of RA phenotypes

Shawn Murphy

‘Optimal’ algorithm to classify RA:

NLP + codified data

Regression model with a penalty parameter (to avoid over-fitting)

Codified data NLP data

Tianxi Cai, Kat Liao

High PPV with adequate sensitivity

✪392 out of 400 (98%) had definite or possible RA!

This means more patients!

~25% more subjects with the complete algorithm:

3,585 subjects (3,334 with true RA)3,046 subjects (2,680 with true RA)

4 million patients

31,171 patients

ICD9 RA and/or CCP checked(goal = high sensitivity)

3,585 RA patients

Classification algorithm(goal = high PPV)

Discarded blood for DNA

Linking the Datamart-Crimson

NLP

data

Cod

ified

data

• Over 3,000 samples collected to date– cost = $10 per sample

• DNA extracted on >2,400 Buffy coats– cost = $20 per sample– >90% had ≥1 ug of DNA– >99% had ≥5 ug of DNA after WGA

Status of i2b2 Crimson collection

genotyping of 384 SNPs (RA risk alleles, AIMs, other) is ongoing at

Broad Institute

• Measured autoantibodies from plasma– 5 autoantibodies in ~380 RA patients– ~85% are CCP+, ~35% ANA+, ~15%

TPO+

• Question: are non-RA autoantibodies present at increased frequency in RA patients vs matched controls?

stay tuned…more data soon!

Status of i2b2 Crimson collection

Key questions

How can I implement your approach, and how much

better is it?

Key questions

• What are the regulatory obstacles impacting your work?

• What are the resource needs required to replicate your work at other institutions?

• What are the priority short term "translational" questions in your field that would represent the most rapid payoff on investment?

Key questions

• What are the regulatory obstacles impacting your work?

• What are the resource needs required to replicate your work at other institutions?

• What are the priority short term "translational" questions in your fields that would represent the most rapid payoff on investment?

Regulatory obstacles

• IRB approval

• De-identified vs truly anonymous

• Open question: sharing of genetic data

Key questions

• What are the regulatory obstacles impacting your work?

• What are the resource needs required to replicate your work at other institutions?

• What are the priority short term "translational" questions in your fields that would represent the most rapid payoff on investment?

Resources required• Building a research DataMart

– clinical EMR ≠ research EMR– multiple FTE’s to build/maintain

• NLP expertise– open-source software available– iterative process for fine-tuning

• Clinical expertise– understand nature of clinical data

Resources required (cont.)

• Statistical expertise– simple algorithm is not sufficient– prepare for the unexpected!– true for narrative and codified

• Biospecimen collection, DNA extraction– varies by institution– Crimson – Broad Institute

Key questions

• What are the regulatory obstacles impacting your work?

• What are the resource needs required to replicate your work at other institutions?

• What are the priority short term "translational" questions in your field that would represent the most rapid payoff on investment?

4 million patients

31,171 patients

ICD9 RA and/or CCP checked(goal = high sensitivity)

3,585 RA patients

Classification algorithm(goal = high PPV)

Clinical subsetsClinical subsets

Discarded blood for DNA

Characteristics

i2b2 RA CORRONA

total number 3,585 7,971

Mean age (SD) 57.5 (17.5) 58.9 (13.4)

Female (%) 79.9 74.5

Anti-CCP(%) 63 N/A

RF (%) 74.4 72.1

Erosions (%) 59.2 59.7

MTX (%) 59.5 52.8

Anti-TNF (%) 32.6 22.6

Clinical features of patients

CCP has an OR = 1.5 for predicting erosions

Subset patients in clinically meaningful ways: causes of

mortality

NLP+codified data, together with statistical modeling, to define

cardiovascular disease

Non-responder to anti-TNF therapy

NLP+codified data, together with statistical modeling, to define treatment

response

Responder to anti-TNF therapy

NLP+codified data, together with statistical modeling, to define treatment

response

Post-marketing surveillance of adverse events

NLP+codified data, together with statistical modeling, to define treatment

response

pharmacovigilance

Conclusions

Options for clinical + DNA

design Clinical

data

DNA Sample size

cost

clinical trial

+++ +++ + $$$

registry ++ +++ ++ $$

claims data

+ n/a +++ $

EMR ++ +++ +++ $

Conclusion: NLP + codified data, together with appropriate statistical modeling, can yield accurate clinical data.

Options for clinical + DNA

design Clinical

data

DNA Sample size

cost

clinical trial

+++ +++ + $$$

registry ++ +++ ++ $$

claims data

+ n/a +++ $

EMR ++ +++ +++ $

Conclusion: We can collect DNA and plasma in a high-throughput manner.

Options for clinical + DNA

design Clinical

data

DNA Sample size

cost

clinical trial

+++ +++ + $$$

registry ++ +++ ++ $$

claims data

+ n/a +++ $

EMR ++ +++ +++ $

Conclusion: The cost is reasonable...even for >20,000 RA patients!

genotype

phenotype

clinical care

AcknowledgmentsZak KohaneSusanne ChurchillVivian GainerKat LiaoTianxi CaiShawn MurphyQing ZingSoumya RaychaudhuriBeth KarlsonPete SzolovitsLee-Jen WeiLynn Bry (Crimson)Sergey GoryachevBarbara Mawn & many others !

Namaste!

Narrative data (NLP text extractions)

Codified data (ICD9 codes, etc)

Run specific queries

Visualize results in a timeline

Identifying RA patients in our i2b2 RA DataMart

1993 2008

Signs and symptomsDiseases that mimick RA

Medications specific to RANotes (including whether seen by a rheumatologist)

diagnostic codes for RA

Shawn Murphy, Vivian Gainer, others

signs and symptoms c/w RA

RA without other diseases

Specific RA meds, including MTX

Seen by rheumatology

Many diagnostic codes for RA

1993 2008

Identifying RA patients in our i2b2 RA DataMart

Probability of RA: all 31K subjects

Probability of RA

Freq

uen

cy

not RA RA (n=3,585)

ROC curves for algorithms

sensi

tivit

y

1 - specificity

97% specificity

codified + NLP

NLP only

codified only

Other algorithms to classify RA

NLP OnlyCodified only

Portability!

Classification of RA cases (and not RA)

1.00

0.80

0.60

0.40

0.20

0.00

Pro

bab

ility

R

A

Not RA possible Yes RA

threshold

0.29

???

Diagnosis = Ankylosing Spondylitis

(but many RA codes)

A few signs and symptoms c/w RA

NLP with few mentions of RA Specific meds

Visits to BWH/MGH

diagnostic codes for RA

Probability RA = 0.78

Diagnosis = JRA (but many RA codes)

signs and symptoms c/w RA

NLP with “RA” and “JRA”

Specific meds

Visits to the RA Center at BWH

Many diagnostic codes for RA

Probability RA = 0.33

Diagnosis not clear initially…

signs and symptoms c/w RA

NLP without much “RA”, few specific meds (MTX x 1)

…and few diagnostic codes for RA, despite multiple LMR notes, including visits to the BWH Arthritis Center

Now the false negatives…

Diagnosed in 1992, little follow-up

For some reason few RA diagnostic codes

Probability RA = 0.11

Enbrel (etanercept)codified: 1,628NLP: 3,796

overlap: 1,612 (99%)

Note: review of 50 NLPoccurrences shows that 38 out of 50 actively on Enbrel

Medications: codified data vs. NLP

Recommended