15
1 Managing Research Data Judith R. Logan, MD, MS Department of Medical Informatics & Clinical Epidemiology Manual abstraction from EHRs Clinical exams Electronic abstraction from EHRs Lab or diagnostic tests Monitoring devices Patient-reported Administrative data Registries National surveys and public datasets

Managing Research Data - Healing, Teaching & · PDF fileManaging Research Data ... Spreadsheets: MS Excel ... Little to no control over data integrity unless you program formsAuthors:

Embed Size (px)

Citation preview

Page 1: Managing Research Data - Healing, Teaching & · PDF fileManaging Research Data ... Spreadsheets: MS Excel ... Little to no control over data integrity unless you program formsAuthors:

1

Managing Research Data

Judith R. Logan, MD, MS Department of Medical Informatics & Clinical Epidemiology

Manual abstraction from EHRs Clinical exams Electronic abstraction from EHRs

Lab or diagnostic tests Monitoring devices Patient-reported

Administrative data Registries National surveys and public datasets

Page 2: Managing Research Data - Healing, Teaching & · PDF fileManaging Research Data ... Spreadsheets: MS Excel ... Little to no control over data integrity unless you program formsAuthors:

2

Caveat #1

 Data is never error-free

"...data strong enough to support conclusions and interpretations as equivalent to those derived from error-free data." IOM

Data Quality is Multidimensional

 Reliability  Validity  Timeliness  Attribution  Legibility

Are you asking the right questions in the right way?

Concerns of regulatory agencies (FDA)

Page 3: Managing Research Data - Healing, Teaching & · PDF fileManaging Research Data ... Spreadsheets: MS Excel ... Little to no control over data integrity unless you program formsAuthors:

3

Data Quality is Multidimensional

 Correctness  Completeness  Consistency  Non-ambiguity  Granularity  Precision

Analytical Dataset

Electronic Data

refine

CRF

Primary source

extract

enter

Analytical Dataset

Electronic Data

refine

Primary electronic source

Primary paper source

abstract

Page 4: Managing Research Data - Healing, Teaching & · PDF fileManaging Research Data ... Spreadsheets: MS Excel ... Little to no control over data integrity unless you program formsAuthors:

4

What types of errors are there?

 Paper to analytical dataset by humans

  eSource to analytical dataset by humans

  eSource to analytical dataset by system

Random Systematic

What types of errors are there?  Paper to analytical dataset by humans   eSource to analytical dataset by

humans

24.5 24.5 25.4

“copy” error

What types of errors are there?  Paper to analytical dataset by humans   eSource to analytical dataset by

humans

102.6° 38.2° C =(5/9)(F-32)

“calculation” error

Page 5: Managing Research Data - Healing, Teaching & · PDF fileManaging Research Data ... Spreadsheets: MS Excel ... Little to no control over data integrity unless you program formsAuthors:

5

What types of errors are there?  Paper to analytical dataset by humans   eSource to analytical dataset by

humans

Problem list: diabetes HgbA1C = 8.2 Cr = 2.0

Yes

“coding” error

Caveat #2

 Keep humans out of the process  But if you can't ....

Double data entry

  1st copy from source to analytical dataset:

  2nd copy from source to analytical dataset:

 Comparison (manual or automated)

How long has the subject had diabetes? ____ years

How long has the subject had diabetes? ___ years

6

How long has the subject had diabetes? ____ years 5

Page 6: Managing Research Data - Healing, Teaching & · PDF fileManaging Research Data ... Spreadsheets: MS Excel ... Little to no control over data integrity unless you program formsAuthors:

6

Scannable paper forms

REDCap

REDCap

Page 7: Managing Research Data - Healing, Teaching & · PDF fileManaging Research Data ... Spreadsheets: MS Excel ... Little to no control over data integrity unless you program formsAuthors:

7

Caveat #3

  If you have to collect new data, do it right  Correctness  Completeness  Consistency  Non-ambiguity  Granularity  Precision

What data do you collect …   Identify the data to collect

 Collect the right data to maximize quality and decrease inefficiencies •  Only data that is relevant to the purpose for

which it is collected •  Extraneous data may adversely affect data

quality by distracting the attention of the study personnel from the critical variables

 Consider fields that are needed for statistical analysis

 Collect a minimum of identifiable data

… and how do you collect it?

  Keep the questions, prompts and instructions clear and concise

  Include prompts and instructions for form completion

  Include definitions for items that are not directly measurable   “Does the subject have hypertension?” should

be accompanied by specific ranges, lengths of time sustained or necessity of specific interventions

Page 8: Managing Research Data - Healing, Teaching & · PDF fileManaging Research Data ... Spreadsheets: MS Excel ... Little to no control over data integrity unless you program formsAuthors:

8

… and how do you collect it?

  Collect data in a structured form   Yes/No options, multiple-choice pick lists,

check boxes, menus, etc.   If answers are in text, may code to a

set of appropriate options  CREATE A KEY

… and how do you collect it?

  CREATE A KEY   Collect raw data rather than calculations

  5 BPs at 2 minute intervals rather than an average BP

  Consider the workflow   If study personnel are extracting from a

medical record, a tabular form may work   If subjects are completing forms, group items

logically and be cautious about referential questions (“skip logic”)

Page 9: Managing Research Data - Healing, Teaching & · PDF fileManaging Research Data ... Spreadsheets: MS Excel ... Little to no control over data integrity unless you program formsAuthors:

9

… and how do you collect it?

  Ask: Can every question always be answered without ambiguity?  Are the choices exhaustive and

exclusive? I haven’t made an appointment yet because (choose one): I don’t know who to call I didn’t feel like it I’m too busy I don’t know why

Caveat #4

 Electronic Health Record data is  notoriously dirty   incomplete   locked in text

Caveat #5

 Administrative data is not collected for research  powerful   large samples, diverse demographics   longitudinal   inexpensive

Page 10: Managing Research Data - Healing, Teaching & · PDF fileManaging Research Data ... Spreadsheets: MS Excel ... Little to no control over data integrity unless you program formsAuthors:

10

Use of administrative data

 Administrative data  coding systems designed for billing

•  ICD-9-CM (ICD-10-CM): diagnoses and procedures

•  CPT: procedures   limit the number of slots   inability to distinguish comorbidities

from complications (before 2007)  coding bias

Use of administrative data

 Administrative data   lacks sensitivity or specificity for

identifying some conditions  omissions in documentation common

 Example: identification of diabetes  2 or more outpatient visits OR  1 inpatient ICD-9 for diabetes OR   filled prescription for diabetes med

(excluding one medication)  continuous enrollment for 11 months

If you are using external datasets  The investigator must have adequate

knowledge of the database structure and contents  how complete and accurate is the

data?  Chart validation, when possible, can

confirm the codes used

Page 11: Managing Research Data - Healing, Teaching & · PDF fileManaging Research Data ... Spreadsheets: MS Excel ... Little to no control over data integrity unless you program formsAuthors:

11

Caveat #6

 Once you have your data, handle it with care

Spreadsheets: MS Excel

 Ubiquitous  Easy and quick to use   Little to no control over data integrity

unless you program forms  Great for storing but not collecting

data

Databases

 Essential skill for researchers  Can constrain the data for improved data

integrity through  constraints on data fields  controlled access through application

interfaces  Flexibility for growth and change  Can query/transform the data

Page 12: Managing Research Data - Healing, Teaching & · PDF fileManaging Research Data ... Spreadsheets: MS Excel ... Little to no control over data integrity unless you program formsAuthors:

12

Check for data errors

 Missing CRF pages  Missing values

 not done vs. not recorded  Data entered more than once  Data management plan

 Outliers   Inconsistent data  Data dependencies

Correct errors

  Ideally, document the correction of errors

 Keep a copy of the original data  Keep a dated copy each time

corrections are made?

Page 13: Managing Research Data - Healing, Teaching & · PDF fileManaging Research Data ... Spreadsheets: MS Excel ... Little to no control over data integrity unless you program formsAuthors:

13

Format your data

 Data transformations  change units  categorize continuous variables  add default values for nulls

 One row per subject per encounter  spreadsheet  CSV file (text)  SAS dataset

Finishing your study

  “Lock” your data  Data is saved and never changed

 Copies of the locked database can be used for analysis

Caveat #7

 Maintain data privacy and security

Page 14: Managing Research Data - Healing, Teaching & · PDF fileManaging Research Data ... Spreadsheets: MS Excel ... Little to no control over data integrity unless you program formsAuthors:

14

De-identifying data for research

  Name   Telephone, FAX numbers, email addresses   Social security numbers, medical record

numbers, health plan numbers, account numbers

  Certificate/license numbers, VIN, license plate numbers

  URLs, IP addresses   Biometric identifiers (voiceprints, fingerprints)   Full-face photographic images (or comparable)

 All geographic units smaller than a state  except: may use first three digits of zip

code as long as >20,000 people live in this area

 All elements of date except year including birth date, date of death, admission date, discharge date, other dates of service  May use age except: all ages 90 and over

must be aggregated into a single category

De-identifying data for research

  “Anything else” that uniquely identifies the individual

 However, you may code the data in such a way that the subject can be reidentified as long as   the code is not derived from or related to the

information about the individual and cannot be translated to identify the individual

  the key is kept separate from the dataset

De-identifying data for research

Page 15: Managing Research Data - Healing, Teaching & · PDF fileManaging Research Data ... Spreadsheets: MS Excel ... Little to no control over data integrity unless you program formsAuthors:

15

Limited Data Sets

  Limited data sets may contain   5-digit zip code   dates of service, birth and death   state, county, city, precinct (not street address)

 Requires a data use agreement when the data is shared (i.e. cannot be freely distributed)

 Still ePHI, so subject to privacy rules such as minimum necessary standards and to security rules

Store datasets in a secure location

  Don't store ePHI datasets on your workstation or laptop (C: drive)

  After temporary storage on your workstation, use Erase to securely erase the data

  Store data on the H: drive where only you can access it   If shared access is needed, ask Computer Access to

set up a shared folder   De-identify datasets before analysis whenever

possible; if coded, keep the key in a secure location   Store data in a format that will last: text, pdfs, ODM,

etc.

Transfer your data in a secure fashion

 Secure e-mail (if the file is small enough)  ZIP the file with a password and burn it to a

CD (or send as an email attachment)  Mail is considered secure  FAX is considered secure as long as the

data is not electronic at any time