A Method for Cleaning Clinical Trial Datasets

Embed Size (px)

Citation preview

  • 8/13/2019 A Method for Cleaning Clinical Trial Datasets

    1/9

    A Method for Cleaning Clinical Trial Analys is Data SetsCarol R. Vaughn , Bridgewater Crossings, NJ

    ABSTRACTThis paper presents a method for using SAS

    software to search SAS programs in selected directories for

    references to variables existing in clinical trial analysis data sets slated to be submitted to the FDA. The end productis a list of variables not used in any of the programs searched. A common reason for unused derived variables is dueto analyses which were planned but later eliminated or significantly altered. Dropping these unused variables ishighly desirable since they require unnecessary validation and serve only as clutter with dubious value.

    The method involves searching selected directories and rendering programs in those directories into a searchableworking SAS data set. This working data set is then searched for the occurrence of a reference to each analysis dataset variable.

    INTRODUCTIONThe first step in this process is to identify the programs in the directories to be searched. This can be accomplished

    by working with directory information. The next step is to render the programs searchable. This can be accomplishedby treating the lines of code as lines of data and reading them into a working data set. Then, the analysis data setvariables must be identified in order to search for references to them. One way to achieve this is to select them intomacro variables from the SAS COLUMNS dictionary. After searching the program code, the analysis data setvariables for which no reference is found in any of the programs, are designated for possible deletion. Finally, thevariables identified can be compared against metadata to confirm the acceptability of deleting them.

    This paper provides example code for each of the steps in this process. The process is broken down into thesecomponent steps in order to show how the functionality of each could be used for other applications as well.

    WORKING WITH DIRECTORY INFORMATIONMethods for identifying the contents of a directory are dependant on the operating system in which SAS is running.On the UNIX

    operating system, this can be accomplished by using an X command within SAS to list the directory

    contents to a file.

    x ls -1 > dirlist.txt;

    Figure 1below is an example of a resulting text file:

    Figure 1. Example Text File Resulting From X Command

    1

    ApplicaESUG 2006

  • 8/13/2019 A Method for Cleaning Clinical Trial Datasets

    2/9

  • 8/13/2019 A Method for Cleaning Clinical Trial Datasets

    3/9

    %mend getdir_win;

    %getdir_win(_dir_nm=SAFT,_path= 'dir "P:\Biostat\XXX1234A\3333\pg\rep\saft\"');

    %getdir_win(_dir_nm=DER,_path='dir "P:\Biostat\XXX1234A\3333\pg\der\"');

    Figure 3 below is an excerpt of the working data set ALL_DIR created from this code:

    Figure 3. Example Working Data Set Resulting From Using Directory Information in a DATA Step

    RENDERING PROGRAM FILES INTO A SEARCHABLE SAS DATA SETA CALL EXECUTE can then be used to loop through each program in each directory identified in this working(ALL_DIR) data set. The code below successively creates a filename called INF for each program, and thencreates a working data set called PRG_SET using each successive infile INF. The lines of code in the programsthemselves are treated as lines of data and are read into this working data set with an input statement. Then, eachworking data set for each program in ALL_DIR is appended into the shell data set called PRG.

    data prg;length code $200 dir_nm $5 prg_nm $50;delete;

    run;

    data _null_; set all_dir;call execute("filename inf '" || trim(path) || trim(program) || "';");

    call execute("data prg_set; infile inf truncover;length prg_nm $50 dir_nm $5;input code $1-200;

    if code ne '';

    prg_nm = '" || trim(program) || "';" ||

    "code = upcase(code);

    dir_nm = '" || trim(dir_nm) || "';run;");

    call execute("proc append data = prg_set base = prg force; run;");

    run;The data set called PRG will contain every line of code from every program in ALL_DIR with its correspondingprogram name and directory reference. The code will be in all uppercase and left justified in order to aid in searching.

    3

    ApplicaESUG 2006

  • 8/13/2019 A Method for Cleaning Clinical Trial Datasets

    4/9

    Figure 4 below is an excerpt of the working data set PRG created with the code:

    Figure 4. Example Working Data Set Resulting From Reading in Program Contents with CALL EXCECUTE

    SEARCHING PROGRAM FILESThe data set PRG can then be searched for text strings. A simple use for the ability to search programs for textstrings would be to search for a programmers name.

    proc sort data=prg out=searched(keep=dir_nm prg_nm) nodupkey;by dir_nm prg_nm;where index(code,"WONG")>0;

    run;

    Figure 5 below is an excerpt of the working data set SEARCHED resulting from this search:

    Figure 5. Example Working Data Set Resulting From Searching for a Programmers Name in Programs

    It is sometimes desirable to search for a string of characters as a word and not merely a sequence of characters. Forexample, when searching for the string EVENT using the function INDEX, the string EVENTA will be identified as anoccurrence. To circumvent this, the function INDEXW can be used. This function searches a character expressionfor a specified string as a word preceded and followed by a blank space. When searching program code, often theword being searched for will be preceded or followed by special characters such as a semicolon or equal sign. Inorder for INDEXW to yield the desired result it may be necessary to strip out many of these special characters in theworking data set of code and replace them with spaces prior to searching with the function INDEXW. The functionTRANSLATE can be used for this purpose.

    4

    ApplicaESUG 2006

  • 8/13/2019 A Method for Cleaning Clinical Trial Datasets

    5/9

    data prg; set prg;code = trim(left(translate(code," ","*+-/^=~>

  • 8/13/2019 A Method for Cleaning Clinical Trial Datasets

    6/9

    Figure 7 below is an excerpt of the working data set ALL_DIR resulting from the addition of the variable DS:

    Figure 7. Example Working Data Set Resulting From The Addition of The Variable DS

    The variable DS would need to be included as a variable in the working data set PRG. This could be accomplishedby adding it to the CALL EXECUTE which used ALL_DIR to create PRG.

    In order to identify the variables in the derived data sets, the SAS dictionary table COLUMNS can be used.

    proc sql noprint;

    create table vars as

    select distinct upper(memname) as ds, upper(name) as var

    from dictionary.columns

    where upper(libname) = "DDS";

    quit;

    Figure 8 below is an excerpt of the working data set VARS created with this code:

    Figure 8. Example Working Data Set Resulting From Reading in Data from DICTIONARY.COLUMNS

    By counting the number of variables, and selecting variable names into macro variables, the variables can be loopedthrough and used successively as the string in the INDEXW function. By selecting the corresponding data set namesinto macro variables, the comparison can be made to the name of the data set the program created.

    proc sql noprint;

    select left(put(count(var),4.0)) into :varcnt from vars;

    select var into :var1 - :var&varcnt from vars;

    select ds into :ds1 - :ds&varcnt from vars;

    quit;

    6

    ApplicaESUG 2006

  • 8/13/2019 A Method for Cleaning Clinical Trial Datasets

    7/9

    The following code could be used to determine if a variable name is found in a line of code which does not have thesame data set tag name (the value of the variable DS in the working data set PRG) as the analysis data set in whichthe variable is found (the value of the variable DS from the working data set VARS which is held as a macro variable).If this condition is met, the line of code plus the value of variable VAR, which identifies which variable reference wasfound in the line of code, is output to a working data set called REF_PRG.

    data ref_prg; set prg;

    length var $20;

    %macro process;%do i = 1 %to &varcnt;if indexw(code,"&&var&i") > 0 and ds ne "&&ds&i" then do;var = "&&var&i";output;

    end;%end;%mend process;

    %process;

    run;

    Figure 9 below is an excerpt of the working data set REF_PRG created by this code:

    Figure 9. Example Working Data Set Resulting From Searching for Reference to Analysis Dataset Variables

    Note that if a variable from the working data set VAR was not found in the program code (the working data set PRG),a record will not be written to the working data set REF_PRG.

    The unique variables which were referred to in the code (the variable VAR in the working data set REF_PRG) arethen compared against all analysis data set variable names in order to determine which are never referred to in code.

    proc sort data = ref_prg (keep = var) out = used nodupkeyby var;run;proc sort data = vars (keep = var) out = all_vars nodupkey;by var;

    run;

    data unused; merge all_vars used (in = used);by var;if not used;

    run;

    7

    ApplicaESUG 2006

  • 8/13/2019 A Method for Cleaning Clinical Trial Datasets

    8/9

    Figure 10 below is an excerpt of the working data set UNUSED created with this code:

    Figure 10. Example Working Data Set Resulting Identifying Unused Variables

    Identifying the unused variables is usually not the last step. It may be necessary or desirable to retain some of thevariables identified as unused. For example, at times it may be desired to retain raw Case Report Form (CRF)variables in an analysis data set even though they are never used in a program. Or, perhaps there are many deriveddecode variables (example: a variable storing the values MALE, and FEMALE, which are the decodes of a codedvariable with values 1 and 2) which are never used in a program, but it is desired to retain them in the analysis datasets. In such cases, it is valuable to have the metadata for the analysis data sets in such a medium that the resultingunused variables can be programmatically compared against to identify which variables are desirable to retain.

    PREREQUISITES/CAVEATSThis method works well as long as certain programming practices and conventions are followed:

    The line size of code in all programs should not exceed 200 characters.

    Analysis data set derivation programs should be placed in a separate directory and named following aconvention which allows programmatic identification of the data set they create.

    The final subdirectory of any directory path to be searched should not be named SAS.

    Also, please note that this method does not discriminate between comments in programs and actual code. Nordoes it differentiate between variable names and data set names.

    For example, if there was an analysis data set with the derived variables EVENT and BASE and these variables werenever actually used, this method would identify these variable references as having been found if the following codewas contained in a program searched:

    However, in practical use, these potential problems have not yet presented themselves as actual problems.

    CONCLUSIONThe functionality of SAS to be able to take information from directory details and files other than data sets, place thisinformation in SAS data sets, and search for references to variables contained in SAS data sets has manyapplications. The basic concepts presented in this paper for determining variables not used in programs could be

    modified to accomplish many other tasks.

    ACKNOWLEDGEMENTI would like to thank my colleague, Jeffery Cortez, for coming up with the idea of searching programs to identifyanalysis data set variables not used in programs.

    8

    ApplicaESUG 2006

  • 8/13/2019 A Method for Cleaning Clinical Trial Datasets

    9/9

    CONTACT INFORMATIONYour comments and questions are welcome. Contact the author at:

    Author Name Carol R. VaughnEnterprise The sanofi-aventis Group

    Address 200 Bridgewater CrossingsCity State ZIP Bridgewater, NJ, 08807

    Work Phone: 908-304-6298

    Email: [email protected]

    SAS

    SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SASInstitute Inc. in the USA and other countries. indicates USA registration.

    WINDOWS

    Windows is a registered trademark of Microsoft Corporation in the United States and other countries.

    UNIX

    UNIX is a registered trademark of The Open Group.

    9

    ApplicaESUG 2006