Upload
vuongduong
View
212
Download
0
Embed Size (px)
Citation preview
1
Managing Research Data
Judith R. Logan, MD, MS Department of Medical Informatics & Clinical Epidemiology
Manual abstraction from EHRs Clinical exams Electronic abstraction from EHRs
Lab or diagnostic tests Monitoring devices Patient-reported
Administrative data Registries National surveys and public datasets
2
Caveat #1
Data is never error-free
"...data strong enough to support conclusions and interpretations as equivalent to those derived from error-free data." IOM
Data Quality is Multidimensional
Reliability Validity Timeliness Attribution Legibility
Are you asking the right questions in the right way?
Concerns of regulatory agencies (FDA)
3
Data Quality is Multidimensional
Correctness Completeness Consistency Non-ambiguity Granularity Precision
Analytical Dataset
Electronic Data
refine
CRF
Primary source
extract
enter
Analytical Dataset
Electronic Data
refine
Primary electronic source
Primary paper source
abstract
4
What types of errors are there?
Paper to analytical dataset by humans
eSource to analytical dataset by humans
eSource to analytical dataset by system
Random Systematic
What types of errors are there? Paper to analytical dataset by humans eSource to analytical dataset by
humans
24.5 24.5 25.4
“copy” error
What types of errors are there? Paper to analytical dataset by humans eSource to analytical dataset by
humans
102.6° 38.2° C =(5/9)(F-32)
“calculation” error
5
What types of errors are there? Paper to analytical dataset by humans eSource to analytical dataset by
humans
Problem list: diabetes HgbA1C = 8.2 Cr = 2.0
Yes
“coding” error
Caveat #2
Keep humans out of the process But if you can't ....
Double data entry
1st copy from source to analytical dataset:
2nd copy from source to analytical dataset:
Comparison (manual or automated)
How long has the subject had diabetes? ____ years
How long has the subject had diabetes? ___ years
6
How long has the subject had diabetes? ____ years 5
6
Scannable paper forms
REDCap
REDCap
7
Caveat #3
If you have to collect new data, do it right Correctness Completeness Consistency Non-ambiguity Granularity Precision
What data do you collect … Identify the data to collect
Collect the right data to maximize quality and decrease inefficiencies • Only data that is relevant to the purpose for
which it is collected • Extraneous data may adversely affect data
quality by distracting the attention of the study personnel from the critical variables
Consider fields that are needed for statistical analysis
Collect a minimum of identifiable data
… and how do you collect it?
Keep the questions, prompts and instructions clear and concise
Include prompts and instructions for form completion
Include definitions for items that are not directly measurable “Does the subject have hypertension?” should
be accompanied by specific ranges, lengths of time sustained or necessity of specific interventions
8
… and how do you collect it?
Collect data in a structured form Yes/No options, multiple-choice pick lists,
check boxes, menus, etc. If answers are in text, may code to a
set of appropriate options CREATE A KEY
… and how do you collect it?
CREATE A KEY Collect raw data rather than calculations
5 BPs at 2 minute intervals rather than an average BP
Consider the workflow If study personnel are extracting from a
medical record, a tabular form may work If subjects are completing forms, group items
logically and be cautious about referential questions (“skip logic”)
9
… and how do you collect it?
Ask: Can every question always be answered without ambiguity? Are the choices exhaustive and
exclusive? I haven’t made an appointment yet because (choose one): I don’t know who to call I didn’t feel like it I’m too busy I don’t know why
Caveat #4
Electronic Health Record data is notoriously dirty incomplete locked in text
Caveat #5
Administrative data is not collected for research powerful large samples, diverse demographics longitudinal inexpensive
10
Use of administrative data
Administrative data coding systems designed for billing
• ICD-9-CM (ICD-10-CM): diagnoses and procedures
• CPT: procedures limit the number of slots inability to distinguish comorbidities
from complications (before 2007) coding bias
Use of administrative data
Administrative data lacks sensitivity or specificity for
identifying some conditions omissions in documentation common
Example: identification of diabetes 2 or more outpatient visits OR 1 inpatient ICD-9 for diabetes OR filled prescription for diabetes med
(excluding one medication) continuous enrollment for 11 months
If you are using external datasets The investigator must have adequate
knowledge of the database structure and contents how complete and accurate is the
data? Chart validation, when possible, can
confirm the codes used
11
Caveat #6
Once you have your data, handle it with care
Spreadsheets: MS Excel
Ubiquitous Easy and quick to use Little to no control over data integrity
unless you program forms Great for storing but not collecting
data
Databases
Essential skill for researchers Can constrain the data for improved data
integrity through constraints on data fields controlled access through application
interfaces Flexibility for growth and change Can query/transform the data
12
Check for data errors
Missing CRF pages Missing values
not done vs. not recorded Data entered more than once Data management plan
Outliers Inconsistent data Data dependencies
Correct errors
Ideally, document the correction of errors
Keep a copy of the original data Keep a dated copy each time
corrections are made?
13
Format your data
Data transformations change units categorize continuous variables add default values for nulls
One row per subject per encounter spreadsheet CSV file (text) SAS dataset
Finishing your study
“Lock” your data Data is saved and never changed
Copies of the locked database can be used for analysis
Caveat #7
Maintain data privacy and security
14
De-identifying data for research
Name Telephone, FAX numbers, email addresses Social security numbers, medical record
numbers, health plan numbers, account numbers
Certificate/license numbers, VIN, license plate numbers
URLs, IP addresses Biometric identifiers (voiceprints, fingerprints) Full-face photographic images (or comparable)
All geographic units smaller than a state except: may use first three digits of zip
code as long as >20,000 people live in this area
All elements of date except year including birth date, date of death, admission date, discharge date, other dates of service May use age except: all ages 90 and over
must be aggregated into a single category
De-identifying data for research
“Anything else” that uniquely identifies the individual
However, you may code the data in such a way that the subject can be reidentified as long as the code is not derived from or related to the
information about the individual and cannot be translated to identify the individual
the key is kept separate from the dataset
De-identifying data for research
15
Limited Data Sets
Limited data sets may contain 5-digit zip code dates of service, birth and death state, county, city, precinct (not street address)
Requires a data use agreement when the data is shared (i.e. cannot be freely distributed)
Still ePHI, so subject to privacy rules such as minimum necessary standards and to security rules
Store datasets in a secure location
Don't store ePHI datasets on your workstation or laptop (C: drive)
After temporary storage on your workstation, use Erase to securely erase the data
Store data on the H: drive where only you can access it If shared access is needed, ask Computer Access to
set up a shared folder De-identify datasets before analysis whenever
possible; if coded, keep the key in a secure location Store data in a format that will last: text, pdfs, ODM,
etc.
Transfer your data in a secure fashion
Secure e-mail (if the file is small enough) ZIP the file with a password and burn it to a
CD (or send as an email attachment) Mail is considered secure FAX is considered secure as long as the
data is not electronic at any time