Upload
jennifer-obrien
View
218
Download
2
Tags:
Embed Size (px)
Citation preview
Data File Data File Structure and Structure and
ContentContentJoe LarsonJoe Larson
5 / 6 / 095 / 6 / 09
OutlineOutline
What’s in a Data Set?What’s in a Data Set?
- File Setup- File Setup
- Key Variables- Key Variables Data ConventionsData Conventions Fun With DemographicsFun With Demographics
File SetupFile Setup
Data on the web is broken up into Data on the web is broken up into the forms it was collected on.the forms it was collected on.
Different forms can have different Different forms can have different collection time(s) and different collection time(s) and different participant subgroupsparticipant subgroups
Available Data is Broken up Available Data is Broken up by Formby Form
All data on the web is arranged by formAll data on the web is arranged by form
Exceptions:Exceptions:
- Outcomes file- Outcomes file
- Demographics file- Demographics file Variables within a data set are in the Variables within a data set are in the
order of the questionnaire, with any order of the questionnaire, with any computed variables at the end of the filecomputed variables at the end of the file
Different Forms…Different Different Forms…Different Participants…Different Participants…Different
TimesTimes Forms collected only once result in a file Forms collected only once result in a file
with one record per personwith one record per person Forms collected numerous times Forms collected numerous times
throughout follow-up result in a file with throughout follow-up result in a file with multiple records per personmultiple records per person
Some data is only available for specific Some data is only available for specific groups of participants (i.e. DM Only, groups of participants (i.e. DM Only, blood subsample, etc.)blood subsample, etc.)
Specifics for an individual file can be Specifics for an individual file can be found in its corresponding data dictionaryfound in its corresponding data dictionary
Key VariablesKey Variables
Some variables are found in every file Some variables are found in every file (with the exceptions of the (with the exceptions of the demographics and outcomes files)demographics and outcomes files)
- ID- ID
- Days since - Days since randomization/enrollmentrandomization/enrollment
- Visit type / Visit number- Visit type / Visit number
- Form closest to visit- Form closest to visit
- Expected for visit- Expected for visit
Key VariablesKey Variables Let’s take a look at actual Form 80 FileLet’s take a look at actual Form 80 File
Participant ID (ID)Participant ID (ID)
The ID variable is common to all of The ID variable is common to all of the web files.the web files.
Completely independent of the Completely independent of the member ID that is used at the member ID that is used at the individual clinics.individual clinics.
Also independent of the Public and Also independent of the Public and blood draw IDs.blood draw IDs.
Days Since Randomization / Days Since Randomization / Enrollment (F80DAYS)Enrollment (F80DAYS)
We do not give out actual dates for We do not give out actual dates for forms or events.forms or events.
Time is calculated between Time is calculated between randomization (CT) or enrollment randomization (CT) or enrollment (OS) and the form date.(OS) and the form date.
Visit Type (F80VTYP) & Visit Type (F80VTYP) & Visit Number (F80VNUM)Visit Number (F80VNUM)
These variables combine to let you These variables combine to let you know when data was collected.know when data was collected.
For example, in the second line of For example, in the second line of the data on the previous slide we the data on the previous slide we can see that the record is for can see that the record is for “Annual Visit 3”. This matches up “Annual Visit 3”. This matches up well with the 1189 days since well with the 1189 days since randomizationrandomization
Closest to Visit Within Visit Closest to Visit Within Visit Type and Number Type and Number
(F80VCLO)(F80VCLO)
Closest to Visit Within Visit Closest to Visit Within Visit Type and Number Type and Number
(F80VCLO)(F80VCLO) On rare occasions multiple forms were On rare occasions multiple forms were
filled out or entered for the same filled out or entered for the same participant at the same follow-up visitparticipant at the same follow-up visit
This variable identifies the visit This variable identifies the visit closest to the actual date. For closest to the actual date. For example, a year 1 annual visit with a example, a year 1 annual visit with a value of “Yes” for VCLO will be the value of “Yes” for VCLO will be the year 1 visit that is closest to 365 days year 1 visit that is closest to 365 days from randomization/enrollmentfrom randomization/enrollment
Expected for Visit Expected for Visit (F80EXPC)(F80EXPC)
Sometimes forms are filled out by Sometimes forms are filled out by participants who should not be participants who should not be filling them outfilling them out
The expected for visit flag identifies The expected for visit flag identifies data that were expected by protocoldata that were expected by protocol
File Setup / Key File Setup / Key VariablesVariables
Files are arranged by form on the Files are arranged by form on the web at web at www.whiops.org
File structure and participant group File structure and participant group varies by form and is in the data varies by form and is in the data dictionarydictionary
ID, Visit Type, and other important ID, Visit Type, and other important variables can be found at the start of variables can be found at the start of each fileeach file
Data ConventionsData Conventions
Skip patternsSkip patterns Mark all that applyMark all that apply Version differencesVersion differences
Skip PatternsSkip Patterns
• Questions within a form are often set up with a hierarchical structure with parent questions and subquestions
• In most cases, the sub-questions are set to missing if the parent value indicates the sub-questions should not be answered. This is the application of a skip pattern
• In a few cases where the error percentage is high, the skip pattern is not applied
Example: Skip Pattern Example: Skip Pattern AppliedApplied
PePett
DoDogg
CaCatt
BirBirdd
FisFishh
OtheOtherr
11 00 11 11 00 11
00
00 00 11 00 00 00
11 00 00 00 00
11
PePett
DoDogg
CaCatt
BirBirdd
FisFishh
OtheOtherr
11 00 11 11 00 11
00
00
11
Skip pattern QA applied
Sub-questions
Error Percentage < 1%
If the Skip Pattern is not If the Skip Pattern is not AppliedApplied
It will be in the data dictionaryIt will be in the data dictionary
Mark All That ApplyMark All That Apply
1 2 3 4 5
0 1 1 0 1
What kind of pet do you have? (mark all that apply)
Dog(s) Cat(s) Bird(s) Fish Other
• One question with multiple choices is
converted to separate indicator variables
of 0’s and 1’s
OrdeOrderr
QuestionQuestion QuestiQuestion on NumbeNumberr
ValueValue Value Value DescriptionDescription
1717 Do you have a Do you have a petpet
1111 11 YesYes
1818 DogDog 11.111.1
1919 CatCat 11.111.1 22 MarkedMarked
2020 BirdBird 11.111.1 33 MarkedMarked
2121 FishFish 11.111.1
2222 OtherOther 11.111.1 55 MarkedMarkedO1O177
O18O18 O19O19 O2O200
O2O211
O22O22
11 00 11 11 00 11
Mark all conversion
Version IssuesVersion Issues
Sometimes questions are not asked Sometimes questions are not asked on all versions of a form, leading to on all versions of a form, leading to higher percentages of missing datahigher percentages of missing data
The Data Dictionary will have thisThe Data Dictionary will have this
Data ConventionsData Conventions
Some cleaning was done to the data Some cleaning was done to the data before it reached the webbefore it reached the web
Skip patterns and mark-all-that-Skip patterns and mark-all-that-apply conversions were usually doneapply conversions were usually done
Sometimes questions were not Sometimes questions were not collected on all versions of a formcollected on all versions of a form
In all cases, any issues are In all cases, any issues are documented in the data dictionarydocumented in the data dictionary
The Demographics FileThe Demographics File
The demographics file is the glue The demographics file is the glue that pulls most analyses togetherthat pulls most analyses together
It contains important variables that It contains important variables that are used in just about every analysisare used in just about every analysis
The file has one record per personThe file has one record per person
Trial Participation FlagsTrial Participation Flags
Trial Flags distinguish what part of Trial Flags distinguish what part of the WHI a participant is inthe WHI a participant is in
In addition to CT and OS indicators, In addition to CT and OS indicators, there are indicator variables for there are indicator variables for each clinical trial componenteach clinical trial component
Basic Demographic DataBasic Demographic Data Including age, ethnicity, education, Including age, ethnicity, education,
and income can be found hereand income can be found here Because clinical center data has not Because clinical center data has not
been released, the “U.S. Region” been released, the “U.S. Region” variable is the best variable to use variable is the best variable to use for geographic locationfor geographic location
Trial ArmsTrial Arms
These are the key variables for any These are the key variables for any analysis on the clinical trial.analysis on the clinical trial.
The hormone arm variable can also The hormone arm variable can also be used to separate out participants be used to separate out participants in the two hormone trialsin the two hormone trials
Days from CT to CaD Days from CT to CaD RandomizationRandomization
Key variable used to determine how Key variable used to determine how far a follow-up visit is from CaD far a follow-up visit is from CaD randomizationrandomization
To determine days from CaD To determine days from CaD randomizationrandomization
- Start with the days from CT - Start with the days from CT randomization randomization
- Subtract the days from CT to - Subtract the days from CT to CaD CaD randomization randomization
BMD Subsample BMD Subsample IndicatorIndicator
A ‘yes’ response indicates that the A ‘yes’ response indicates that the participant was at one of the three participant was at one of the three BMD clinicsBMD clinics
Fun With DemographicsFun With Demographics
The demographics file is a key file The demographics file is a key file used in most analysesused in most analyses
It includes trial participation and It includes trial participation and treatment status variables, as well treatment status variables, as well as basic demographic dataas basic demographic data
Stay TunedStay Tuned
Later I’ll be doing a beginning to end Later I’ll be doing a beginning to end example:example:- Going to the web- Going to the web- Hunting down variables- Hunting down variables- Downloading the data- Downloading the data- Loading it into SAS- Loading it into SAS- Merging files together- Merging files together- Running some basic frequencies- Running some basic frequencies
And taking questions while I do it!And taking questions while I do it!