Data Quality in Multi-Site Health Services and Comparative Effective Research: Lessons
from PHIS+
Ram Gouripeddi University of Utah
10th Annual Utah Health Services Research ConferenceConsidering Data Quality in Health Services Research
Monday, March 16, 2015
Acknowledgements• Raj Srivastava, MD, MPH• Ron Keren MD, MPH• OpenFurther Team members• PHIS+ Team members across multiple institutions• Apelon
• FURTHeR development was supported by the NCRR and the NCATS, NIH, through Grant UL1RR025764 and supplement 3UL1RR025764-02S2. This project was funded under grant number R01 HS019862-01 from the AHRQ, U.S. Department of Health and Human Services (HHS). The opinions expressed [in this document] are those of the authors and do not reflect the official position of AHRQ or the HHS.
• PHIS+: www.childrenshospitals.org/phisplus/index.html
2
PHIS+
• Augment Children’s Hospital Association’s (CHA) existing electronic database of administrative data - Pediatric Health Information System (PHIS) with clinical data to conduct Comparative Effectiveness Research studies.
• UU Biomedical Informatics Core - Informatics Partners• Agency for Healthcare Research and Quality (AHRQ)
funded project.
PHIS PHIS+
3
PHIS+ Overview
Pneumonia
Appendicitis
Osteomyelitis
Gastroesophageal Reflux Disease
Data Streams3
Laboratory
Microbiology
Radiology
CER Studies4
2007 – 2011
2009 – Development
2012….
Years Data5
4
The PHIS+ Process6
Pediatric Research in Inpatient Setting (PRIS) Sites6
1. Cincinnati Children’s Hospital Medical Center (CCHMC)
3. Children’s Hospital of Philadelphia (CHOP)
5. Primary Children’s Medical Center, Intermountain Healthcare (PCMC)
2. Children’s Hospital Boston (CHB) 4. Children’s Hospital of Pittsburgh (CHP) 6. Seattle Children’s Hospital (SCH)
5
1
2
34
OpenFurther
6
Developmental Process Overview
Narus et. al, Federating Clinical Data from Six Pediatric Hospitals: Process and Initial Results from the PHIS+ Consortium. AMIA 2011
7
Modeling & Terminology Phase
• Data Model Harmonization• Semantic Mapping• Steps ensured quality of the data by limiting
information losses arising from data transformations
8
Data Model Harmonization
• Informatics team worked with domain experts to create representative common data models for storage of different domains of data.
• Then with each hospital’s IT to harmonize their data models with the common data models.
9
Semantic Mapping• Obtained detailed information
about distinct local data elements using a metadata collection toolkit
• Mapped local data elements to standard biomedical terminologies.
• Doubtful mappings discussed with their respective hospital team inclusive of the site PI, lab and EHR personnel.
• All mapping peer-reviewed within the informatics team, with the contributing hospital team, and also run through software checks.
10
Metadata Fields ExampleLocal Battery/Panel Name/CodeBattery/Panel DescriptionLocal Test Name GlucoseLocal Test Code GluTest Description Blood GlucoseLOINC Code -Test Value Type NumericTest Value Sample Data 86Test Start Date FormatTest End Date FormatSpecimen SerumUnits of Measure mg/dLReference Range 80 – 120Interpretation CodesTest Status CodesComments
Differences in Local Coding Schemas
C Reactive Protein
[Mass/volume] in Serum or
Plasma (1988-5)
C Reactive Protein (8726)
C Reactive Protein (CRPT)
CRP (CRP)
CRP Test (700111)
C-Reactive Protein
(801582)
C R Protein (801679)
11
Nanogram/Decilitre
(258805003)
NG/DL
ng/dL
ng/dL
ng per dL
ng/Dl
ng per dL
Laboratory Test Unit of Measure
Data Processing Phase
• Data collection phase: Each hospital used a combination of a data collection toolkit and data validation scripts to assess their submitted data.
• Contributed data was then processed through the OpenFurther platform for translation to selected standard terminologies and storage in common data models.
• Each row of processed data was check for different data quality issues specific to each domain.
• Errors in the data were flagged with an error taxonomy and reviewed for fixes or resubmissions.
12
Example Checks
• Is the lab test associated with a patient?• Is there a valid lab test in each row of lab
result data?• Does the lab test have a result a valid result?• Are there proper relationships between
cultures, their test specimens and results?
13
Study Specific Quality Assessment
• Individual studies have different granularities and specificities in their data requirements.
• We undertook a second set of data quality assessments at the study cohort level.
• This included a chart review of a significant sample within each study cohort.
14
0.6 3 4.653 6.7 8.8 10.9 13.2 >10.00
5000
10000
15000
20000
25000
30000
35000
2823-3: Potassium [Moles/volume] in Serum or Plasma
PHIS+ CER Database – 2007-11
Site Results LOINC Lab Test Code
A 15,011,312 538
B 33,214,540 1,214
C 16,868,383 860
D 25,706,608 1,089
E 38,422,668 1,016
F 14,507,629 2,131
Total 150,731,140 *6,848 (2,992)
Site Culture Results SNOMED Specimen Code
SNOMED Culture Procedure Code
SNOMED Organism Code
RxNorm Anti-microbial Code
Susceptibility Results
LOINC Susceptibility Test Code
A 247,933 114 70 113 57 487,813 97
B 359,780 58 42 56 58 393,594 85
C 231,071 179 46 162 59 340,100 99
D 335,606 110 34 145 57 376,844 75
E 486,315 130 56 160 59 605,000 76
F 176,848 264 71 121 51 283,865 89
Total 1,837,553 *855 (451) *319 (95) *757 (203) *341 (74) 2,487,216 *521 (136)
Site Reports CPT Radiology Procedure Code
A 445,681 280
B 1,151,383 349
C 635,458 296
D 980,740 482
E 1,098,693 497
F 201,708 477
Total 4,513,663 *2,381 (714)
Laboratory Radiology
Microbiology
* The first number is the total number of standard codes, the second in parenthesis is the distinct number of standard codes across all sites.
1,854,406 Kids
Discussion
• We developed an infrastructure that assesses the quality of data being integrated from disparate data sources.
• Using this infrastructure we populated a database with high quality data to support HSR & CER.
• To ensure data quality a combination of computerized data assessment checks within OpenFurther and manual checks were used.
• Global and study specific data quality assessments were required– Address systemic issues in data integration and study specific
issues.16
Discussion
• Informed by the framework developed by Kahn et. al in “A Pragmatic Framework for Single-site and Multisite Data Quality Assessment in Electronic Health Record-based Clinical Research”
• Inherent dimensions such as Accuracy, Objectivity and Believability; and Conceptual dimensions such as Timeliness and Appropriate amount of data were measured.
• A software platform that complies with existing theoretic frameworks of data quality can assist this process and speed up the process of generating new and reproducible study results.– A Data Model for Representation and Storage of Biomedical Data
Quality, Breakout Session 3 – Strategies for Identifying Data Quality Issues 17
THANK YOU
18