Data Management
Pawin Numthavaj M.D.
Section for Clinical Epidemiology and Biostatistics
Ramathibodi Hospital, Mahidol University
E-mail: [email protected]
1
Objectives of Data Management
•To minimize errors at all stages of data collection
•To prepare data of the highest possible quality in a suitable form for statistical analysis
2
Data Management Process
1
• Design and create case report form (CRF)
2
• Collect data by CRF
3
• Design and create database
4
• Specify data quality control
5
• Enter data into database
6
• Clean and check data
3
Design & Create CRF
4
Definition of CRF
•Case report/record form (CRF) is the document used to record the data on which the eventual analysis and reporting of the clinical trial data will be based• Paper-based• Electronic
•Design of the CRF must reflect• Data collection• Data extraction
5
Who will use CRF?
Role Good CRF should be
Investigator • Clear, unambiguous, easy to follow, complete• Comprehensive instruction and guidance• Enable investigator to ascertain subject eligibility
to continue in the trial at any point
Monitor • Review completed CRF against protocol• Minimize uncertainties and facilitate entry
verification
Data manager • Design database• Source for data in database• Clear and unambiguous response, minimizing
amount of free text
6
Ideal CRF should
•Request the precise information and only the information required by the protocol
•Simple, quick, unambiguous, straightforward
•Order questions in sequence
•Have been accepted by all members of study team
7
Principles of CRF design
1. Understand basic questions for current research
• What are the questions/objectives of research?
• What is the type of study design?
• What variables will be involved?
• How variables will be collected?
• How often variables will be collected?
8
•Example:
•For a retrospective cohort study of kidney transplantation in Thailand, researchers would like to study the association between type of donor and risk of graft rejection
9
What are the objectives of this research?
•To study the association between type of donor and risk of graft rejection
10
What is the type of study design?
•Retrospective cohort study
What variables will be involved?
•Type of donor
•Graft status
11
How variables will be collected?
•Type of donor was classified as• Cadaveric donor (CDKT)• Living-related donor (LRKT)
•Graft status was classified as• Graft rejection• Graft non-rejection
12
How often variables will be collected?
•Type of donor was collected during enrollment
•Graft status was collected every 6 months during the follow up period
13
Principles of CRF design
2. Consider timing of data collection
•Decide how many different CRFs should be created to collect the data
•Decide which data should be collected on which form
14
Example
15
Data of requirementTiming of data
collection
Characteristics of recipients Enrollment
Characteristics of donors Enrollment
Details of kidney transplantation Enrollment
Graft status after kidney transplantation FU every 6 months
16
1. Enrollment formID numberPart I Recipient- - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - -
Part II Donor- - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - -
Part III Transplantation- - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - -
2. Follow up formID numberDate of visit- - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - -
Principle for CRF design
3. Consider sources of data collection
•Decide how many different CRFs should be created to collect the data.
•Decide which data should be collected on which form.
17
18
Example
Data of requirement Sources of data collection
Characteristics of recipients Recipients
Characteristics of donors Donor
Details of kidney transplantation Operating room
Graft status after kidney transplantation Outpatient clinic
19
1. Recipient formID number- - - - - - - - - - - -- - - - - - - - - - - -- - - - - - - - - - - -
4. Follow up formID numberDate of visit- - - - - - - - - - - -- - - - - - - - - - - -- - - - - - - - - - - -- - - - - - - - - - - -- - - - - - - - - - - -- - - - - - - - - - - -- - - - - - - - - - - -- - - - - - - - - - - -- - - - - - - - - - - -- - - - - - - - - - - -- - - - - - - - - - - -
2. Donor formID number- - - - - - - - - - - -- - - - - - - - - - - -- - - - - - - - - - - -
3. Transplantation formID number- - - - - - - - - - - -- - - - - - - - - - - -- - - - - - - - - - - -
Recommendations
• It is not always best to minimize the number of forms by trying to fit as much as possible onto one page.
• It may be better to have more forms, each with a small amount of data.
20
Principle for CRF design
4. Specify identifying (ID) number
• Identifying numbers are a unique value for each case which are assumed to be present on every CRF
•HN – Beware of patient’s identity
• ID will link all data on different forms together
21
Identifying and ensuring the integrity
•Each page of CRF should have• Patient identification (Subject No, CRF No, Subject
initials)
• Identification of trial (Ex. Code name or number)• Number or code identifying the center in which
subject has been recruiting• Visit number (if applicable)• Name of sponsor• Page number (page n of nn)
22
23
Principle for CRF design
5. Structure sequence of questions
•Related questions should be together
24
1. ID _ _ _
2. Sex
1) Male 2) Female 9) Missing
3. Height _ _ _._ _ cm
4. Types of treatment
1) RT 2) Chemo 9) Missing
5. Date of treatment _ _/_ _/_ _ _ _
√
Principle for CRF design
5. Structure sequence of questions
•Related questions should be together
25
1. ID _ _ _
2. Sex
1) Male 2) Female 9) Missing
3. Types of treatment
1) RT 2) Chemo 9) Missing
4. Height _ _ _._ _ cm
5. Date of treatment _ _/_ _/_ _ _ _
X
Question formats
•Questions should be written in a simple way•Avoid double negative question
• Is the patient unable to swallow tablets?• Does the patient have difficulty swallowing tablets?
•Use coded tick box instead of writing if possible• 0 = No, 1 = Yes• Usage the same for the rest of CRF
•Yes/No questions should appear in one column to prevent the wrong box tick•State clearly if more than one box can be checked
26
Layout
•Easy to read and understand
•Orderly and logical fashion
•Look “good” and “attractive” to encourage careful and accurate completion
27
Multiple assessments
•Should be in the same format and sequence for each visit
•Assist investigator to develop a ‘visit routine’
•Assist database building and data entry
28
Investigator comments
•Discourage note-writing on CRF
•Use of separate “comment page” can be provided
29
Fonts and layout
•Serif fonts (Times New Roman)
•Text size around 10-12 point• 10 point for minor instruction e.g. (dd/mm/yy)
•Rotate text if needed
30
Text entries
•Block capitals are easier than script
•Appropriate space to write
31
•Particular styles (ex. Bold) for all same answer (ex. Yes) can be useful
• Inclusion/exclusion question• All “yes” for inclusion criteria• All “no” for exclusion criteria
32
33
Sections that are completed by subject
•Text should be at least 10 point size
•No medical jargon
•Examples of entry should be given (ex. How to write time format)
•Attractive, easy-to-use
•Minimized text entry
34
Principle for CRF design
6. Collecting continuous data
•The correct number of boxes for the answer should be provided.
•Any required decimal points, commas, or other punctuation should be preprinted.
35
36
1. ID _ _ _
2. Weight _ _ _ . _ _
3. Height _ _ _ . _ _
4. SBP _ _ _
5. DBP _ _ _
Example. Format for collecting continuous data
•The units to be used in recording the data should be specified.
37
1. ID _ _ _
2. Weight _ _ _ . _ _ kg
3. Height _ _ _ . _ _ cm
4. SBP _ _ _ mmHg
5. DBP _ _ _ mmHg
Include units of measurement on the form
• Investigator should be in no doubt about units of measurement (ex: cm. or m. or ft. or in.)
38
Principle for CRF design
•Avoid grouping of continuous data at data collection time.
39
3a. Age at enrollment
1) 15-24
2) 25-35
3) 36-45
4) > 45
3b. Age at enrollment _ _ years
X
√
Principle for CRF design
•Do not make any calculations before data entry. Why?
•Since it may cause many errors and more time is consumed.
•We can calculated later in a statistical programs.
40
Weight (kg) _ _ . _ _
Height (cm) _ _ _
BMI (kg/m2) _ _ . _ _
Principle for CRF design
7. Collecting categorical data
•All possible categories of categorical variables should be displayed on the form.
41
Please circle the right answer
What is your sex?
Male
Female
Principle for CRF design
7. Collecting categorical data
•Numerical codes should be assigned for all possible categories.
42
What is your sex?
Male………………………1
Female…………………..2
Principle for CRF design
•Coding conventions should be consistent for all data items.
•For example, 1=yes, 2=no for all yes-no possible answers.
43
44
Underlying disease
• DM 1. yes 2. no
• HT 1. yes 2. no
• Stroke 1. yes 2. no
• CVD 1. yes 2. no
Example
Principle for CRF design
8. Code for missing data
• It is bad practice to leave data collection field blank on the CRF because it can lead to confusion at data entry time.
•Special codes should be assigned for missing values at the data collection time.
45
Principle for CRF design
8. Code for missing data
•The missing data codes should not be possible valid values.
• It is common practice to use 9, 99, 999 and so on to denote missing data.
46
Age _ _ _ year (missing=999)
Height _ _ _ . _ _ cm (missing=999.99)
Sex
1. male 2. female 9. missing
Stage of cancer
1. I 2. II 3. III 9. missing
47
Example
Principle for CRF design
9. Collecting date
• It is important to clearly identify the date format to be used, for example,
48
• Day, Month, Year (dd/mm/yyyy).
• Month, Day, Year (mm/dd/yyyy).
Principle for CRF design
9. Collecting date
• It is important to clearly identify the year format to be used, for example,
49
• Western (dd/mm/20yy)
• Buddist (dd/mm/25yy)
50
Example of weak CRF design
1. Have you ever been diagnosed with DM?
1. Yes 2. No 9. Missing
For female: if yes, answer the following questions
2. Did you have DM before pregnancy?
1. Yes 2. No 9. Missing
3. Did you have DM during pregnancy?
1. Yes 2. No 9. Missing
4. Have you ever taken drug for DM?
1. Yes 2. No 9. Missing
51
Example of strong CRF design
1. Have you ever been diagnosed with DM?
1. Yes 2. No 9. Missing
if yes, answer the question number 2.
2. Have you ever taken drug for DM?
1. Yes 2. No 9. Missing
If you are female, and have been pregnant, answer the questions number 3 and 4, otherwise go to question number 5.
3. Did you have DM before pregnancy?
1. Yes 2. No 9. Missing
4. Did you have DM during pregnancy?
1. Yes 2. No 9. Missing
52
Example of weak CRF design
Have you ever taken medications for osteoporosis?
Calcium □ Start date _ _/_ _/_ _ _ _
Vitamin D □ Start date _ _/_ _/_ _ _ _
Calcitonin □ Start date _ _/_ _/_ _ _ _
Hormone □ Start date _ _/_ _/_ _ _ _
53
Example of strong CRF design
Have you ever taken medications for osteoporosis?
Calcium 1. Yes 2. No 9. Missing
If yes, specify start date _ _/_ _/25 _ _
Vitamin D 1. Yes 2. No 9. Missing
If yes, specify start date _ _/_ _/25 _ _
Calcitonin 1. Yes 2. No 9. Missing
If yes, specify start date _ _/_ _/25 _ _
Hormone 1. Yes 2. No 9. Missing
If yes, specify start date _ _/_ _/25 _ _
Recommendations
•The quality of the data recorded decreases when the amount of data required increases.
• It is important to take time over the design and development of the forms because the design of CRF has a direct impact on the quality of data.
54
Recommendation
•Collecting data without the CRFs is likely to result in incomplete and invalid data.
55
Database Design & Testing
Pawin Numthavaj M.D.
Section for Clinical Epidemiology and Biostatistics
Faculty of Medicine Ramathibodi Hospital
56
Definition
•A database consists of an organized collection of data for one or more purposes, typically in digital form.
57
Database File
58
Id: 5
Id: 4
Id: 3
Id: 2
Id: 1
Date of birth: …
Age: …
Sex: …
Weight: …
Height: …
Variables Case File
Data set for database file
59
Id Date of birth Age Sex Weight Height
1 12/12/1973 37 M 56 167
2 10/11/1988 22 M 78 178
3 03/08/1963 47 F 45 158
4 14/09/1986 24 M 67 169
5 23/10/1981 29 F 41 155
Database Management System (DBMS)
•The DBMS is a set of computer programs which perform a wide range of operations:• creating new files
• entering new records• sorting, searching, and editing• and so on.
60
DBMS software package
•There are many different DBMS software packages:• Microsoft Access
• dBase• Paradox• EpiData• And so on
61
Reasons for using EpiData
•Specially written for use in research studies.
•Easy to use
•Free
•Small program
•Can export data in Stata / SPSS format
62
Where to get EpiData
•http://www.Epidata.dk/download.php#ee
63
Overview of EpiData
•The EpiData screen has a standard windows layout with one menu line and two toolbars.
64
Work process toolbar
Menu line
Editor toolbar
Work process toolbar
1. Define Data
2. Make Data File
3. Checks
4. Enter Data
5. Document
6. Export Data
65
Process of creating database file with EpiData
Define data QuEStionnaire file (.qes)
Make data file RECord file (.rec)
Add/revise checks CHecK file (.chk)
66
67
Define data
.QES file
Make data file
.REC file
Figure 1. Flowchart for creating a database file in EpiData
Add checks
.CHK file
1. Define data: QES files
68
Variable Name Variable Label Variable types
Variable names
•Must not exceed 8 characters.
•Must not contain space/punctuation
•Has to begin with a letter, not a number.
•Can contain any sequence of letters and digits.
•Can be upper or lower case.
69
Examples of illegal variable names
70
Variable name
1date
Last name
countryoforigin
Begins with a number
Contains a space
Longer than 8 letters
Variable labels
• “Notes” for variable name
•Make data more easy to understand for others
•For example,
• Variable Name: dateb
• Variable Label: date of birth
Variable types
•The variable type indicates characteristic of the variable such as
- Text
- Numeric
- Date
etc.
72
Variable types: Text
•Text variables are used for holding data consisting of letters and/or numbers
•You can enter numbers into text variables but you cannot perform any calculation with them
73
Variable types: Numeric
•Numerical information
•Can be used for continuous/categorical data
•Can be used for integer/real number
74
Variable types: Date
•Date variables are used for holding dates.
•You can perform simple arithmetic such as addition or subtraction one date variable from another date variable.
75
Variable types: Date
•The advantage of using date type variables is that the EpiData will only allow you to enter valid dates.
•EpiData also has a special type of date variable which is updated each time a record is changed.
76
Examples of variable types
77
Variable Type
ID
Date of birth
Age at enrollment
Sex
Do you have any underlying diseases?
Specify medications
Numeric
Date
Numeric
Text
Numeric
Numeric
Variable length
•The length of a variable defines how much data it can hold.
•A text variable with length 10 will be able to hold up to ten letters or numbers.
78
Variable length
•A numerical variable with length 3 will be able to hold numbers between -99 and 999.
•The length of a variable must correspond to the maximum anticipated number of letters and/or numbers.
79
Specify variable type and length
80
Type EpiData definition
Text _ _ _ _ _ _ _ _
Numeric ### or ###.##
Date <dd/mm/yyyy>, <mm/dd/yyyy>
Today’s date <today-dmy>, <today-mdy>
81
Variable Name Variable Label Variable types
82
Define data
.QES file
Make data file
.REC file
Figure 1. Flowchart for creating a database file in EpiData
Add checks
.CHK file
2. Make data file
•The second step is to create database file based upon the database structure.
•The make data file function is used to crate a record (.REC) file from questionnaire (.QES) file.
83
84
Summary
•At the end of this step, you can enter the data set into the database file.
85
Interactive checking
• Interactive checking is checking for error during data entry
• Interactive checking is useful in picking up typing errors
•This step can be done by using EpiData check functions
86
Data validation
• This involves the data being entered twice into different files by different persons.
• The resulting files are then compared to each other to see if they are the same.
87
88
Define data
.QES file
Make data file
.REC file
Figure 1. Flowchart for creating a database file in EpiData
Add checks
.CHK file
Interactive checking functions
•EpiData provides functions that allow you to do data interactive checking as:
- Must enter variables
- Range and legal values
- Attach value labels to variables
- Repeated variables
- Conditional jumps
- Programmed checks
Basic checks
Advanced checks
Something to consider if you do not want to use database software
•You could use spreadsheet software such as Excel to enter data
•But please consider following restriction for data preparation
90
1. Prepare data in a table format with each row corresponds to each individual
91
2. The name of the variable should be in English and do not contain special characters such as % & + ! (space). You can use underscore (_)
92
3. Do not enter text that is not data, such as comments, directly into the table. Use comment function in Excel. (Or put it somewhere else)
93
4. Do not use cell color to code data. Computer programs do not see the different between color-coded rows.
5. Try to make data as categories and use number to label categories (ex. 1/2 instead Male/Female)
6. In case there is no data collected, do not type anything. Leave the cell blank.
94
Thank you
95