AAVLD Informatics Committee: Data Standardization in the

The Case for Clinical Repositories as a Data Source for Research

Clem McDonald, MDDirector

Lister Hill National Center for Biomedical Communications

National Library of Medicine

National Advisory Research Resources Council (NARRC) MeetingJanuary 18, 2007

2

Clinical Repositories

What do they do?

What could they do?

3

Why Clinical Repositories

• Many sources of electronic data in an institution• Labs, radiology, Pharmacy, MD orders, EKGs,

Dictated reports, Radiology images, etc, etc, etc• Most of these sources can deliver this data via HL7

messages to another computer

• Repository is a database that provides a unified and simple access to all of this data in a unified view.

4

What Data is Available Within the Institution

• Lab data (almost always electronic)• Medication orders in patients • Radiology reports (text)• Pathology reports (text)• Dictation (discharge summary)• EKGs• Cardiac echoes• Endoscopy • Obstetrical Ultrasound• Nursing Observation

5

More Data – “Outside” the Institution

• Event Data – Coded diagnoses and procedures

• Tumor registries – whole country

• Cardiology data bases (ACC, ATS, etc.) – whole country

• Federal ESRD -base

• Out patient medications – From pharmacy benefit managers. Rx.Hub

• More

6

Potential Availability Still More

• Medicaid – procedures, diagnoses and drug use

• Medicare – Diagnoses, procedures and (now) medication use

• Lots of special federal collection instruments – Nursing home, disability, Medicare introduction, etc.

7

Why codes for observations so important

• The observation is not defined by the field as in typical in research and policy data bases

8

Flat Data Set(Analytic Conceptualization)

Pat ID Name surgery date

Hgb DBP # of BPU

Bypass Minute

Cholest

1234-5 Doe Jane 12May95 13 95 3 80 180

9999-3 Jones T 1Aug95 12.5 88 2 90 230

8888-3 Doe Sam 4June95 16 78 0 80 205

9

Why observation codes important

• The observation ID defined by a “pointer” to a master table – as follows

10

Stacked Data Set Application Conceptualization

Pt ID Relevant Date

Observation ID Value Units Normal Rang

Place Observer

Doe J 12-May-95 Hemoglobin 13 mg/dl 12.5-15 St Francis Dr Smith

Doe J 12-May-95 Hemoglobin 11.5 mg/dl 12.5-15 St Francis Dr Smith

Doe J 12-May-95 Dias BP 95 mm/Hg 80-140 St Francis Dr Smith

Doe J 12-May-95 Dias BP 110 mm/Hg 80-140 St Francis Dr Smith

Doe J 13-May-95 Bypass minutes 80 min St Francis Dr Sleepwell

Doe J 12-May-95 Cholesterol 180 St Francis Dr Bloodbank

Operational Data Base: One Record Per Observation

11

For Repositories – Need to Think in a Different Data Structure.

Instead of dedicating one data field (in a visit record) – per result as is the common model in clinical research

• Dedicate a record per result.• That structure is found in every lab, repository,

pharmacy system • You will never find a field for hemoglobin or

cholesterol – or for penicillin V • The record carries extra pieces of information

about each value as follows

12

Limits and Issues Depending Upon How the Data is Represented

Things to be aware of

13

Clinical Information Comes in Multiple Forms, Each with its own Issues

1. Quantitative – e.g. Maximum calf circumference, Serum calcium• Attend to units – and possible need to convert• Special forms ( 1:128, > 10, 1-5 , codes and mixed with

numeric's)2. Ordinal “measures” – e.g. Glasgow eye opening score

• Answers likely to be “fixed” text or localized codes 3. Nominal (football jerseys) – e.g. Blood culture results

• Same issues as 2.• But require small amounts of labor compared to direct

manual capture

14

Narrative Text Can be Good

• The Good:Easy to record/capture and useCan be searched for text patternsSome success in finding specially-targeted with

simple NLP

• The Bad:Usually requires some human review of

retrieved recordsStill light years faster than chart review

15

Document Images, Clinical Images and Tracings

• Fast access for human review

• Access to original data – esp. with tracings

• Human assisted measurements of biologic images

16

By-Patient and Cross-Patient Access

• Clinical repositories usually justified for clinical care – so data is organized by patient for clinician review

• May lack efficient cross-patient access as needed for research Three kinds of problems:

They may not have the right index structures or computer power for searching

The may not have tools for non-programmer access per query

The data may be a mess inside – good enough for display to a human but not for automatic searching

17

Repositories have Different Scopes • Local clinical data only• Local clinical data plus administrative data (Very

Useful)• Local data supplemented with “external data”

some of which may be “internal” Tumor registry (local – state) ACC data – Local – more Social Security death tape – NO INSTITUTION

SHOULD BE WITHOUT ONE Medicaid? Other

• Community wide repository (RHIOSs)

18

Research Uses

• Find potential cases for studies (local)

• Review candidates for study eligibility before trying to enroll (even with no search capabilities)

• Obtain numbers and statistical characteristics of potential candidates for grants (local)

19

Research Uses – More

• Estimate variance for sample size analysis

• Track outcomes – (labs – death) – longitudinal studies/Cost-benefit studies (local)

• Epidemiologic studies (esp. with community scope)

• Obtain tissue (through pathology reports)

• Link phenotype with genotype (if also collecting genetics)

20

Problems with Today’s Research

Strategies

21

Not Enough Research Data

• Clinicians are faced with zillions of decisions• Research helps only some of them

� Preventive decisions – but even for some of these (pneumonia vaccine) data are soft

Many cardiovascular interventions Some anticoagulation interventions

• Little help with special circumstances – age, co-morbidity

• Almost no data for decisions about diagnostic testing, surgery, use of devices

• Almost no help regarding cost benefits

22

Deeper Problems

• Sample size requirements for trials become difficult/impossible when Event rates are small When difference between treatment and “control”

are small: often the case is comparison of new with best existing

treatment

We want to quantify the amount of benefit accurately for cost benefit analysis A

23

Deeper Problems - More

• A study with 10% event rate and 25% difference (big difference) can require enrollment of 10,000 patients.

• To be 95% sure of finding one case of finding with event rate of 1/25,000 need to observe 63,000 cases (e.g. rhabdomyolysis)

• Trials can’t cover the entire water front

24

How to Get More for Less• Collect less on greater number of patients • Use Repositories

• To find patients for trials• For retrospective analysis of rare events• For post-marketing drug toxicities• To supplement data collection in traditional

clinical trials• For gathering outcomes and follow up in

longitudinal studies and large simple trials (Community repositories)

• To find tissue (paraffin) for study

25

Repository Examples

• Partners analytical database (Murphy SN)Considering labs alone – more than 125

different labs interfacedUses LOINC as lingua franca for gluing

different results together At LEAST (old data):

2.5 million patients with clinical data 700 million clinical facts 750 active researchers 7000 queries/year

26

More Examples

• The VA – mapping all of their lab tests to LOINC – so data can be pooled across hospitals.

• CRN – collaboration of 10 large “HMOs” for cancer research (Puget Sound, Kaiser, etc.) lab, radiology, drugs available from the collaborators (Wagner, et. al)

27

Community-Based Repositories

Memphis

North West Indiana

British Columbia

Pediatric hospitals in Ontario

North Jutland, Denmark

Utrecht, Netherlands

Central Indiana (Indianapolis) (INPC)

28

INPC – What Is It?

• Centralized (federated) clinical repository for central Indiana

• Data delivered from all major Indianapolis hospital systems as HL7 ver. 2.x

• Treat patients from each institute as separate institution

• Funded by NLM (INPC) and NCI (SPIN)• Open Source software

29

What is it For?

• Clinical care� Eligible providers can access clinical data from

all sources in one view when patient is seeking their care

• Public Health � Bio-surveillance� Reportable disease reporting

• Quality• Research (Today’s subject)

30

Who Contributes Data?• Hospitals

� Five Indianapolis Hospital Systems (total of 15 separate hospitals)

� Stand alone labs� Payers

� Medicaid (whole state)• Encounter ICD & CPT + meds

• 150 M encounters 75 M prescriptions

� WellPoint (largest healthcare company in US – more patients than Medicare)

31

Who Contributes Data? – More

• Tumor registry (De-identified research only – whole state) – 550 K cases (another “institution”)

• Death tapes (Important)� Indiana State Public Health Department� Social Security ( 80 million )

32

INPC Storage Strategy

• Central Community database resource and

management of mapping, etc.

• Standardized data structure – all use same

software and observation codes.

• Data for each organization in its own physical files

• Combine on-the-fly when needed

• Patient linking needed – because no national ID

33

34

All Hospitals Contribute – At Least

• Lab results

• Cardiology reports

• Tumor registry data

• In-patient medication orders (committed)

• TEXT IS GOOD� Discharge summaries/admission summaries

Operative notes� Radiology reports� Pathology reports – gets you to existing tissue

• Some Contribute All

35

2006 INPC Data Flows and Content

• Flows� More than 150 HL7 message streams� More than 100 million separate HL7 messages per

year (380 million OBX’s)� Add about 80 million results per year� HL7 ver. 2 works!!!!

• Content� 6 million distinct patient registration records ( 3 M)� 850 million discrete results� 50 million radiology images � 17 million narrative reports

36

How does the Data Flow from Source to RHIO Repository?

• HL7 messages delivers most of the clinical data.

• DICOM for radiology images.

• NCPDP for outpatient pharmacy.

• LOINC – provides standard codes that define the content of each delivered result.

http://www.regenstrief.org/loinc

37

38

Radiology Images - Thumbnail

39

BIG

40

BIGGER

41

BIGGEST 2800 x 2000

42

EKG Discrete Variables

43

EKG Tracings

44

Flow Sheet for Blood Count

45

46

Orders

47

Report Delivery to Office Practices

• 1300+ practices (3800 MDs) at present

• 90% of the active care providers in 9 county region

• Many opportunities to practice access for

48

Repository Research Uses

49

INPC Use for Research

• 100’s of queries for grants/year, e.g. to estimate # of cases available for study. To find cases.

• Pull supplemental data for many clinical trials

• Used in 80% of human subjects studies at some point in study

• Remind MDs of studies underway (recruitment)

• Database studies – the greatest:� Erythromycin and pyloric stenosis 1

1

1 Mahon BE, Rosenman MB, Kleiman MB. Maternal and infant use of erythromycin and other macrolide antibiotics as risk factors for infantile hypertrophic pyloric stenosis. J Pediatr. 2001 Sep;139(3):380-4.

50

Tissue Access

• SPIN project � NCI-funded Collaboration among

IU/Regenstrief Harvard, University of Pittsburgh, UCLA

� Use query to find clinical cases of interest. Pathology reports provide the link to tissue – paraffin block – 4 M in Indianapolis

51

Special Query Capabilities

• Access to more than 10,000 distinct variables

• Built in de-identification processes� Dates truncated to year� Forbidden fields removed� Text reports are scrubbed (Examples)

• Build cohort twin databases then statistical analyses.

52

Special Query Capabilities – More

• Each kind of text report is just another variable – Google-like searches on text, more traditional criteria for numeric and coded variables search

• Tie in to R-(RECCOMENDED) and pre-packaged statistical routine

• User can do statistical analyses without ever touching any data

53

SPIN Build Data Set Query

54

SPIN Look at Data Set

55

SPIN Look at Individual Scrubbed Report

57

How do we Glue Data Together?

• Use linking algorithms to tie patients – based on registration data

• Use LOINC codes and mapping tools to tie equivalent variables together

58

How do we Get There?

• Glue data from many sources together

• First from your institution

• Then other related data bases (hospital is full of them from tumor registry to heart attack database)

• Rx.Hub – 60% of the nations prescriptions

• Don’t forget Death tapes

• Push for community data melds – they could revolutionize clinical …

59

How do we Get There? – 2

• Force connections between clinical trials systems and institutional systems

• The current state makes no sense

• Demand HL7 bidirectional registration and resulting transmission

• Push for use of HL7 clinical trial segments in orders and reporting

60


• If combining independent sources� Need linking routines (NIH should make good

tools publicly available)

• Combine for clinical use – de-identify for research use (limited data sets)� Make well-tested de-identification tools

publicly available

61


• Develop national catalogues for variables and questionnaires. Contribute new variables to the catalogue when existing ones really won’t do.

• Use LOINC – as the catalogue – try it, you’ll like it

62

LOINC and RELMA Web Site – No Cost Downloads

� Type in LOINC at Google� Pig

63

Challenges Exist

• Each Study (and phase) needs ID => Institutional study database

• Ordering systems need option for adding trial ID and phase to the order

• HL7 has segments defined for these – not hard, just need to be articulated

64

Challenges Exist – 2

• Catch 22's – e.g., recruitment

• Defeats the efficiencies intrinsic to repositories

• Need more rational rules

65

Challenges Exist – 3

• Managing (and retrieving consents)

• Solvable with scanning and proper workflow

66

Medicare & Medicaid – Miracles

• Could follow-up of deaths via SS death tapes (here now)

• Find outcome events and (Medicare patients) in Medicare database

• Track medication and intervention use (Medicare patients) Medicare database

• Similar opportunities with Medicaid databases

67

Research Will Still be Hard

• Clinical systems will not carry all data of interest

• Repositories are not magic.

• But we could collect less if we used the available clinical data where it sufficed and focused the question on strong outcomes

68

ASIMO at CES 2007

• htthp://www.youtube.com/watch?v=UOWYIjbKDcc

69

The END

http://www.youtube.com/watch?v=UOWYIjbKDcc

http://www.youtube.com/watch?v=UOWYIjbKDcc

Documents

AAVLD Informatics Committee: Data Standardization in the