12
Paper DM01 Searching for the Right Search Tool Eugene Yeh, ASG Inc., Cary, NC Angela Ringelberg, PharmaNet, Inc., Cary, NC ABSTRACT SAS ® programmers work with many files of different file types (e.g. SAS data sets, Excel ® , SAS programs, log and lst, Html, PDF). Many popular search tools used to help manage all these files will be discussed, along with a comparison of their features. Some of the features include scanning multiple files and multiple directories, and displaying a preview of the matching text within each file. The strengths and weaknesses of each search tool will be reviewed. The best way to conduct some common searches for text phrases and numeric data in files of different file types will be presented. Which search tool is best suited to handle each file type? Is there a search tool that can do it all? INTRODUCTION SAS programmers work with many different types of files during the course of their job. Input files containing data and reference tables are commonly SAS data sets, Excel spreadsheets, or ASCII files. Instructions dictating programming goals may exist in statistical analysis plans and mock-up files for tables, listings and figures as MS- Word ® and Excel files. Executing SAS programs can generate SAS Log and Lst files, and output files of file type .RTF, HTML or PDF. The programming tasks supporting a single clinical report can easily include over 100 outputs files. One aspect of data management is being able find the data you want on demand from all your files. When many files are involved, relying on your memory to find the occurrences of any particular text phrase, or data value within all these files may be impractical. You need a search application to accurately, systematically, and quickly search through the contents of all the files. Many software applications have some file search capability. What’s the right search tool for common search tasks? Can one application do all the searching for you? This paper will review and contrast many different popular search applications on their capabilities and limitations. We will discuss how search applications can be used to perform some practical tasks. All these applications are run on the Microsoft ® Windows ® operating system (OS) and are readily available, most are free of cost, to all users. Most applications developed to view and/or edit text or data of specific file types, such as MS-Word or Adobe Reader ® commonly include a search feature. Unfortunately, these tools often are limited to only searching through one file at a time, not multiple files from multiple directories. Furthermore, they can only scan file types which can be viewed by that application. Operating systems manage many files of different file types, and also commonly include search tools. Although these search tools can scan the contents for many different file types from many files and directories, they are less likely to provide a preview of the matching text within a file. Instead, they display only the filename containing the matching text. SYSTEM All applications discussed run on Microsoft Windows 2000, Win2000 servers, Microsoft V5.00 Terminal Server Client, Microsoft ® Office ® 2000 and SAS version 9.1 in production. SEARCH APPLICATION FEATURES An ideal search tool is not limited to what type of file can be searched and where to search. It would be able to search multiple file types and multiple directories with a single user request. It would report all the filenames containing at least one match to your search parameter. Furthermore, the tool would allow you to preview the matching text or data value within each file and hyperlink you to that exact location with a single click of the mouse. Below is a list of features to consider in any search tool: 1. Inputs : Can search through contents of multiple file types 2. Can search through multiple directories upon a single request 3. Specify search parameter using a filename, part of a filename, matching text or data 4. Use advanced search parameters based on file timestamp, file size, other metadata 5. Conduct: Easy to use 6. Perform search in a reasonable length of time 1

Searching for the Right Search Tool

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Paper DM01

Searching for the Right Search Tool Eugene Yeh, ASG Inc., Cary, NC

Angela Ringelberg, PharmaNet, Inc., Cary, NC

ABSTRACT SAS® programmers work with many files of different file types (e.g. SAS data sets, Excel®, SAS programs, log and lst, Html, PDF). Many popular search tools used to help manage all these files will be discussed, along with a comparison of their features. Some of the features include scanning multiple files and multiple directories, and displaying a preview of the matching text within each file. The strengths and weaknesses of each search tool will be reviewed. The best way to conduct some common searches for text phrases and numeric data in files of different file types will be presented. Which search tool is best suited to handle each file type? Is there a search tool that can do it all?

INTRODUCTION SAS programmers work with many different types of files during the course of their job. Input files containing data and reference tables are commonly SAS data sets, Excel spreadsheets, or ASCII files. Instructions dictating programming goals may exist in statistical analysis plans and mock-up files for tables, listings and figures as MS-Word® and Excel files. Executing SAS programs can generate SAS Log and Lst files, and output files of file type .RTF, HTML or PDF. The programming tasks supporting a single clinical report can easily include over 100 outputs files. One aspect of data management is being able find the data you want on demand from all your files. When many files are involved, relying on your memory to find the occurrences of any particular text phrase, or data value within all these files may be impractical. You need a search application to accurately, systematically, and quickly search through the contents of all the files. Many software applications have some file search capability. What’s the right search tool for common search tasks? Can one application do all the searching for you? This paper will review and contrast many different popular search applications on their capabilities and limitations. We will discuss how search applications can be used to perform some practical tasks. All these applications are run on the Microsoft® Windows® operating system (OS) and are readily available, most are free of cost, to all users. Most applications developed to view and/or edit text or data of specific file types, such as MS-Word or Adobe Reader® commonly include a search feature. Unfortunately, these tools often are limited to only searching through one file at a time, not multiple files from multiple directories. Furthermore, they can only scan file types which can be viewed by that application. Operating systems manage many files of different file types, and also commonly include search tools. Although these search tools can scan the contents for many different file types from many files and directories, they are less likely to provide a preview of the matching text within a file. Instead, they display only the filename containing the matching text.

SYSTEM All applications discussed run on Microsoft Windows 2000, Win2000 servers, Microsoft V5.00 Terminal Server Client, Microsoft® Office® 2000 and SAS version 9.1 in production.

SEARCH APPLICATION FEATURES An ideal search tool is not limited to what type of file can be searched and where to search. It would be able to search multiple file types and multiple directories with a single user request. It would report all the filenames containing at least one match to your search parameter. Furthermore, the tool would allow you to preview the matching text or data value within each file and hyperlink you to that exact location with a single click of the mouse. Below is a list of features to consider in any search tool:

1. Inputs: Can search through contents of multiple file types 2. Can search through multiple directories upon a single request 3. Specify search parameter using a filename, part of a filename, matching text or data 4. Use advanced search parameters based on file timestamp, file size, other metadata 5. Conduct: Easy to use 6. Perform search in a reasonable length of time

1

7. Allow you to abort a search in progress 8. Output: Display full pathname of all files that match the search parameter 9. Allow you to sort all matching files by their metadata (e.g. timestamp, file size) 10. Hyperlink to each matching file 11. Display a preview of the matching text or data value within each file 12. Hyperlink to each the location of the match text or data value within each file

The features being considered are divided into three broad categories; Input features, features dealing with the actual use of the tool, and output features. Table 1 presents a list of commonly used search tools, and how well they rate compared to each other in each feature. An explanation of how each tool behaves, and how it compares to the other tools, is discussed by having them search for the same thing. This is done using two examples: the first is when the search parameter is a text phrase; the second is when the search parameter is a numeric data value.

EXAMPLE 1 – SEARCHING FOR A TEXT PHRASE Let us examine how a search for the text phrase “Chi-square” would be conducted by each application in Table 1. This text might be expected to appear in a statistical analysis plan, the table mock-up file, as a comment in a SAS program, or in the footnote of an output table.

MS-OFFICE The search menu in MS-Office allows the user the most flexible way to choose multiple directories to include in a given search. From the “Search in:” drop-down menu, you can drill-down the directory tree and check or uncheck each box next to a folder name (see Figure 1a). The search is conducted on all folders with a checked box. You cannot limit the search based on specifying any part of the filename (e.g. all files starting with “lab”). However, files to include in the search results can be limited by their file type using the “Results should be:” drop-down menu. Rather than referring to file types by their filename extension (e.g. doc, xls, html), a box next to file types such as “Word”, “Excel” and “Web Page” can be checked or unchecked to indicate the file types included for each search (see Figure 1a). Checking “Anything” will scan the ASCII text in every file, regardless of its file type. The search result displays a simple, non-sortable list of filenames (see Figure 1b). The total number of files with matches is displayed, but the number of matches per file is not shown. Upon cursor fly-by, the full pathname is revealed and clicking on the filename displays the file content using its default application. Strengths: scans many file types (including all MS-Word and MS-Excel), scans multiple independent directories, allows multiple search parameters linked with Boolean operators. Weaknesses: non-sortable search results, cannot limit search by filename, does not display number of matches per file, does not display a preview of matching text within files.

MS-EXPLORER The search menu in MS-Explorer allows you to search all files of all file types by using the wildcard symbol “*” for the filename and extension, “*.*”. Alternatively, if you want to limit the search to only files that contain certain text in its name, for example, containing “lab”, you can use “*lab*.*” instead. Searches can be conducted in multiple directories only when all directories are subfolders of the single directory chosen. So, if you choose the root directory of a given drive letter, MS-Explorer can search through every folder in that drive. Since all our files of interest are in subfolders of the “E:\Compound B\Study 101”, the user can scan all the folders of interest during a single search. The search results yield a list of filenames containing “square” (part of “Chi-square”). In addition to the filename, the search results displays other columns for file metadata parameters such as the directory name, file size, file type, and file timestamp. Clicking, and double clicking on any column label sorts all the files in ascending and descending order by the values in that column. Similar to MS-Office, MS-Explorer displays the count of match files (see Figure 2, lower right corner), but does not provide the number of matches per file, or a preview of the match text within each file. Strengths: scans many file types, allows search parameter to include metadata values (e.g. search only files created in the last 3 days), sorting search results. Weaknesses: does not display number of matches per file or a preview of matching text within files.

MS EXCEL Clicking on any filename in the MS-Explorer search results list, opens that file using the default application. For example, if an Excel file is clicked on, it opens the Excel application, where all the instances of “square” within all worksheets can be found in a single search (See Figure 3). Typing control-F opens the “Find and Replace” dialog box. Choosing “Workbook” from the drop-down menu labeled “Within:” allows a single search to scan all the worksheets in that file, rather than a scanning only a single worksheet. The search results displays the worksheet name and cell location (e.g. $A$33) of each matching text phrase. A preview of the entire cell contents containing the match text is also displayed and clicking on the text navigates you directly to that worksheet and cell location.

2

1 = Worst, 10=Best

Google Desktop 2

Quick FInd

MSN Search Toolbar with

Windows Desktop Search

MS Office Search

MS Explorer

DOS Command FINDSTR

Adobe Acrobat

Reader v7 SAS Viewer SAS

System

Inputs

File Types Recognized

rtf xls pdf html txt

sas7bdat* E-mail

10

rtf xls pdf*** html txt

sas7bdat* e-mail (and

attachments) 10

e-mail html log

lst rtf doc sas

sas7bdat* xls 10

html log lst rtf doc

sas sas7bdat*

xls*

8

html log lst rtf

sas sas7bdat*

xls

8

pdf

1

Sas log lst sas7bdat

3

Sas log lst sas7bdat sas7bcat

4

Search Multiple Files within a directory Yes Yes Yes Yes Yes Yes No No

Search Multiple Directories 10 10 10 5** 5** 5** No No

Allows Advanced Search Parameters Yes Yes Yes Yes Yes Yes Yes Yes

Save Search Parameters No 10 7 7 10 7 7 No Use

How to Start Type into ‘Search;

box

Toolbar MSN Search

Toolbar

File Menu, File,

Search

Toolbar, ‘Search’ Icon or

Control-e

Command Prompt,

FINDSTR

Toolbar, ‘Search’ Icon or

Control-f

Control-f Control-f

Ease of Use 10 10 10 10 5 10 10 8 Abort Search in Progress Yes Yes Yes Yes Yes Yes Yes No Outputs Display number of files, matches, other stats 5 5 5 5 1 7 1 1

Displays Directory and File Names Yes Yes Yes Yes Yes Yes Yes Yes

Displays File Metadata Yes Yes Yes, (only on fly-by) Yes No No No

Sort List of Matching Files 7 7 1 7 1 7 1 1 Links to matching file Yes Yes Yes Yes No Yes Yes Yes Display preview of matching text within file Yes Yes No No Yes Yes Yes

Links to matching text within file Yes Yes No No No Yes Yes No

Save Search Results No No No No Yes No No No Other Topics

Availability of Application Free

Down-Load

Free Down-Load

Included in all MS

Office Appls..

Included with

Windows

Included with

Windows

Free Down-Load

Free Down-Load $$$

Contact Information www.google.com/downloads/

http://toolbar.msn.com/ www.adob

e.com www.sas.co

m www.sas.co

m

Comments:

Searches one PDF file or all PDF files from one directory

Single file search only

Single file search only

Footnotes: * = Cannot search through numeric data ** = Searches multiple directories only in subfolders of one location *** = With PDF Plug-in Application

Table 1 – Comparison of Search Tools

3

Figure 1a – MS-Office search feature. Checking boxes next to each directory (left) and file type (right)

controls the locations and file types included in each search.

Figure 1b – MS-Office search results display filenames containing search parameter. Full pathname of each filename is displayed on cursor ‘fly-by’ and clicking on the filename opens each file.

4

Figure 2 – MS-Explorer search results display filename and full pathname. Also displays a column for file type, file size, and timestamp (not shown).

Figure 3 – Excel search results display matches from cells from multiple worksheets from a single file.

5

DOS COMMAND: FINDSTR Unlike all the other newer search tools mentioned in this paper that use a graphic user interface (GUI), the FINDSTR is executed from the DOS command line, which lies underneath the Windows OS. Executing this command can be cumbersome, requiring you to enter the DOS Prompt from Windows, and then typing out the command. Upon opening DOS Prompt window, full documentation on how to use this command can be view by typing:

F:> help findstr To simplify of use this search tool, a DOS batch file was created to bundle a sequence of DOS commands. To search all the folders of interest, and save the search results in a file (F:\find_results.txt), the following DOS batch program (FINDIT.BAT) was created, featuring the DOS command FINDSTR:

echo off REM PURPOSE: Search directory (and subfolders), place results in file F:\find_results.txt echo Start search for: %1 echo Press Control-C to abort search echo Please wait . . . REM Start file with current directory name cd > F:\find_results.txt REM Use FINDSTR options, S=search subfolders, i=case insensitive, n=display text line number findstr/sin "%1" *.* >> f:\find_results.txt echo Finished search for: %1 echo Search results written to file: f:\find_results.txt echo Press Control-C to abort display of search results pause type f:\find_results.txt |more pause echo on

You can search for the text phrase “square” by placing this batch file in the directory of interest and typing “findit square” at the DOS prompt.

F:\Compound B\Study 101\findit square

The search results are displayed on the screen and save in the file, F:\find_results.txt (see Figure 4). All files (*.*) are searched and any ASCII text found that matches the search parameter “square” is considered a hit. The first line in the search results file indicates the root folder that was searched. The other lines display the subfolder name, filename, the line number of the string containing “square” and the actual text line containing “square”. Strengths: scans multiple file types, provides a preview of the matching text within each file, ability to save the search results in a permanent file. Weaknesses: command line interface is awkward compared to a GUI, does not hyperlink to files or text found in the search results. The format of the search results provided by the FINDSTR command is also provided by other popular search tools.

The text editor called UltraEdit® includes a search feature which yields a search result display that is very similar the FINDSTR command. The advantage of the UltraEdit search tool over FINDSTR is that it is activated via a GUI.

The UNIX OS has a powerful search command called GREP (generalized regular expression parser). You use it to find for text strings within multiple files. It provides search results in a format similar to that of FINDSTR. The GREP command supports the use of wildcards and regular expressions. Unlike the other search tools mentioned in this paper, there is a bit of a learning curve to master the use of these power advanced features.

ADOBE READER This application includes a search tool and is used to view files that are in the PDF format. If any files are opened in Adobe Reader, the search tool allows you to search the contents of the active PDF file or any chosen directory (and its subfolders). Many applications can create PDF files. Image files of files types TIF, JPEG, GIF, etc. can be converted to the PDF format; however any text that is present in the image cannot be scanned by the search tool. On the other hand, PDF files containing text that was specified with fonts can be scanned. This includes the PDF files created by the SAS Output Delivery System (ODS), using ODS PDF statements.

6

Adobe Reader found the text phrase “square” in the footnote section of two SAS output file tables (see Figure 5a). The filenames and matching text within each file are organized in a two level tree diagram in the search results pane. Clicking on the plus/minus sign adjacent to each filename reveals/hides the all the matching text (in bold) within that file, respectively. Clicking on any matching text in the search results pane opens the PDF file containing that text and navigates the cursor to the location of the matching text within the file. Strengths: scans multiple PDF files, provides a preview and a link to the matching text within each file. Weaknesses: scans only files with PDF file type.

ODS PDF, SAS 8 VERSUS SAS 9 ODS PDF is available to both SAS 8.x and 9.x, but is considered experimental in the former. Unlike SAS 9.x, a lot of the text displayed in PDF files created by the SAS 8.x is stored as ASCII characters. Therefore, many search tools which recognize the ASCII format will find matches in the PDF files created by SAS 8.x, but not in the PDF files created by the same SAS program executed on SAS 9.x. Although Adobe Reader can view the PDF files created by both versions of SAS, the search tool feature in Adobe Reader can only scan through those PDF files created by SAS 8.x when they are opened for viewing. This search tool cannot scan through these files if they are unopened and just reside on any directory. Since opening every PDF file in a directory is generally impractical, so is using this search tool to scan through multiple PDF files created by SAS 8.x. This limitation in scanning does not exist with PDF files created by SAS 9.x. The PDF files featured in this paper were created using SAS 9.1.

EXAMPLE 2 – SEARCHING FOR NUMERIC DATA SAS data sets can store numeric data, such as dates, in numeric variables and character variables. For example, the date 01/15/2005 can be stored as a numeric value of 16451 (number of days after 01/01/1960), and displayed as 15JAN2005 using the DATE9. format. This same date can also be stored as a character value of “20050115”, following a date pattern of “YYYYMMDD”. Character values are stored in SAS data sets in the ASCII format, while numeric values are stored using a different format. Some data sets might have both the numeric variable and character variable representing the date, while other variables might have only the numeric or character variable, but not both variables. Many file types store numeric data in a non-ASCII format. Therefore searching for numeric data using the same search tools can yield quite different results compared to searching for a text phrase. In this example, SAS programming was performed on interim analysis data using a cut-off date of 12/31/2004. Although data was collected after this date, none of this data will be reported in the listings, tables, and figures in the interim analysis report. After all the SAS programs are executed, we will search all input and output files for the numeric value of 2005 to verify that no data in this year was included in the report.

MS-OFFICE, MS-EXPLORER The search tool in both applications can scan text in the ASCII format. Therefore they can find the year “2005” stored in character variables in SAS data sets, but are unable to find the same year as part of a date value in numeric variables. Unfortunately, it cannot display the variable name or observation number (row number) of the matching text the data set. Instead it simply displays the filename. Strengths: scans many files from multiple independent directories. MS-Office can find matches to numeric data in Excel files. Weaknesses: does not display a preview of matching date within files, cannot find matches in PDF files, or numeric values in SAS data sets.

SAS VIEWER, SAS SYSTEM Both applications allow you to view the contents of SAS programs, log and lst files, and SAS data sets. Once a given file is opened by the application the search tool feature is activated by typing control-f. You can find “2005” in SAS programs, Log and Lst files, but that file must be first opened within the application. In the SAS Viewer, you can find all the matches to “2005” stored as character or numeric value in one given data set. Its search tool can scan through all variables and all observations with a single request. A single search reveals the variable name and observation number (row number) of each match in the data set by navigating the cursor to the location of each matching value. Unfortunately, the scan for “2005” only occurs within the one active data set. Scanning multiple data sets requires you to open and scan each data set, one at a time. The SAS System can display the contents of data sets using a VIEWTABLE window. This search tool is also activated by typing control-f, but it can not scan multiple variables at single request. Instead you must choose one variable from the list of all variables, and the search result displays all observations containing a match. To scan all the variables for a match requires you to search through each variable one at a time. Similar to the SAS Viewer, a search can only be conducted on the one active data set. Strengths: scans and displays the location of numeric values stored by character or numeric variables in SAS data sets. Weaknesses: a single search only scans one file.

A CUSTOM SAS DATA SET SEARCH TOOL The weakness of SAS Viewer and the SAS System being only able to scan one data set was overcome by creating a custom SAS data set search tool (1) (2). This tool has a GUI and can search through multiple data sets from multiple directories based upon a single request. The results of searching all character variables and formatted numeric variables from multiple data sets for ”2005” is displayed in Figure 6. Like SAS Viewer, it identifies the variable name and observation number of each matching data value. Figure 6 is similar to Figure 3, where the latter displays the cell location (row and column) of each matching search result in Excel

7

Figure 4 – Search summary excerpt from DOS Command FINDSTR. Each line displays subfolder name, filename, line number of the matching text, and entire text line containing matching text.

Figure 5a (left) and Figure 5b (right) - Adobe Reader search results of one directory (and its subfolders), displaying stats, matching files, matching text within the files, and on cursor ‘fly-by’, its page number (p# 15). Search results are shown for finding a text phrase (left) and numeric data (right).

8

Figure 6 – Search summary from a custom SAS data set search tool of all formatted numeric data values containing 2005 (red text) found in multiple SAS data sets. Item 7 from table of contents (bottom of

left column) is displayed in the table shown. For each formatted numeric data value containing 2005, the same row displays its associated file location, data set name, variable name (with format

attribute), variable label, observation number, and unformatted data value.

9

ADOBE READER When ODS PDF creates PDF files, the text representing numeric data is written into the PDF files in the same manner as character data, specified with a font style, type and size. Therefore searching for numeric data in PDF files is similar to searching for character data. Figure 5b displays the search results of finding “2005” in all PDF files with Adobe Reader. The results show that two listing files in the PDF format containing matching text that are dates with the year 2005 (e.g. 06DEC2005, 02JAN2005 etc.). Strengths: scans multiple PDF files, provides a preview and a link to the matching text within each file. Weaknesses: scans only files with PDF file type.

DISCUSSION In March 2005, the University of Wisconsin E-Business Consortium (UWEBC) performed an independent benchmark study of 12 popular desktop search tools (3). The evaluation was based on the aspects of usability, versatility, accuracy, efficiency, security, and enterprise readiness for each desktop search tool. The six main criteria were defined as follows: Usability: Is the desktop search tool easy to use? Versatility: How wide ranging does the tool allow you to search? – What type of documents does it support? – Does it support email? - Does it support multi-languages? Accuracy: How accurate are the search results? Can you find what you are looking for? Efficiency: Does this tool jeopardize the overall performance of your PC? Security: How well have security mechanisms been incorporated into the tool? Enterprise Readiness: Is this tool ready to use in an enterprise? The study rated Copernic 1.5 Beta with Coveoe the best overall desktop search tool. Yahoo! Desktop Search 1.1 Beta rated second best, and next came Likasoft Archivarius 3000, MSN Toolbar Suite, Google Desktop, and Ask Jeeves. Among all applications, some notable features are: MSN Toolbar allows you to associate a keyword with a specific file to define ‘short-cut’ searches. Likasoft Archivarius 3000 3.14 has a remote search function, allowing you to search your desktop remotely through a web browser. Blinkx 3.0 has a feature called ‘Smart folder’ which ‘crawls’ the web to find information relevant to the files in your folder. Its ‘Blinkx Visualizer’ feature produces a tree view of the web search results in which linked sites are visually connected with branches. Google has a floating bar which allows you to type keywords from anywhere on your desktop.

FALSE POSITIVE SEARCH RESULTS False positive matches are files that are reported in search results that are inappropriate or incorrect. Since many search tools scan all ASCII text within files, regardless of their file type, ASCII character embedded in the internal structure of a given file that are not displayed to you, can cause false positive matches. For example, all SAS data set files stored in Windows contain the embedded ASCII text “WIN”, “FILE” and “DATA”. This causes the search tool associated with MS-Office and MS-Explorer to consider all SAS data set files it scans as a match when the matching text you request is either “WIN”, “FILE” or “DATA”. Therefore, all SAS variable names and their SAS labels are embedded as ASCII characters in SAS data sets and can cause matches in these search tools. Another example is the ASCII text “PDF” and “/Page” embedded in PDF files. When all files of any file type appear as a match to a given search, you might suspect that it is being considered as a match because of embedded ASCII characters, rather than any text that is normally displayed to you. You can view all the embedded ASCII characters in a file by opening the file using a text editor, such as Notepad, instead of launching the default application.

E-MAIL Searching through emails has become a popular past time - not because it is fun to do, but because it contains so much pertinent information and is done very often. Sending and receiving emails is an important way of communicating with co-workers, colleagues, and clients. However, keeping numerous emails organized for easy access in the future can be an overwhelming and time consuming task. Wouldn’t it be nice to be able to search through emails and their attachments? For example, you could retrieve all emails relevant to a given protocol, or search for emails pertaining to ‘computing baseline’ or ‘PT 101203’. It is also nice to be able to request mail by “From:” or, “Subject:” or “To:”. Using a desktop search tool, this is done quickly and effectively. The e-mail programs Outlook and Outlook Express both offer search tools with flexible options and criteria, however the search process can take a long time. Fortunately, other desktop search tools such as MS-Office Search, Yahoo! Desktop Search 1.2, X1 04.09, Google Desktop Search 2.0 Beta, Windows Desktop Search 2.1, and Copernic Desktop Search 1.2 can rapidly search through emails and their attachments. Some tools can even search inside zipped file attachments.

CHARACTER VERSUS NUMERIC DATA Most of the search tools discussed find matches to character data more easily than numeric data. This is mostly because character data is stored in the common ASCII format, while numeric data is not. Since HTML and RTF files store data in the ASCII format, the majority of the search tools reviewed in this paper would find matches in these file types in a similar manner as the character text example. There are few tools that can scan through numeric data.

10

MS Office can search through multiple Excel files. The SAS Viewer and SAS System can easily scan numeric data within SAS data sets, but they only can handle one data set at a time. We overcame this weakness by creating a SAS data set search tool that can scan for numeric or character data from multiple data sets from multiple directories. Since all these tools use a GUI, programming is not required and it suitable for use by personnel who are unfamiliar with SAS.

ADVANCED SEARCH PARAMETERS In both of our examples, we specified a simple search parameter, looking for matches that contained the text “chi-square” or numeric values containing “2005”. Some search tools allow you to specify advanced search parameters such as text that contains “chi-square” or “significant”, or values between 2001 and 2005. Other tools allow you to specify matches based on a portion of its filename, or the date the file was last modified, or the file size.

CONCLUSION A search application is a practical tool for SAS programmers who manage many files. We chose to discuss the search tools that you probably already have access to, or can be easily obtained. The best search tool for you depends on which feature(s) you consider the most important. In Table 1, we have given you some examples of popular desktop search tools, and presented features which we think are worthy of consideration. The UWEBC study and Table 1 both rate the performance of different search tools based on many features. If you would like further information on any of the search tools discussed in this paper, see the ‘contact information’ on Table 1, or reference the UWEBC's study. Unfortunately, there are shortcomings of many search tools. For example, many tools cannot search through numeric data held in SAS data sets. We found the best features among all the search tools to be:

1. Ease of use 2. Ability to search through multiple directories, and multiple file types 3. Ability to use advanced search parameters 4. Links to matching file and matching text within that file

REFERENCES (1) Yeh, E. (2004), “A SAS Database Search Engine Gives Everyone the Power to Know,” Proceedings of the PharmaSUG 2004 Conference, Paper AD02. (2) Yeh, E., Verrill, D. (2005), “Use a SAS Database Search Engine Instead of Writing a Program,” Proceedings of the SESUG 2005 Conference, Paper SE11_05. (3) Noda, T, Helwig S (April 20, 2005),“Benchmark Study of Desktop Search Tools – There’s More to Search than Google and Yahoo! – An Evaluation of 12 Leading Desktop Search Tools”. UW E-Business Consortium. <{http://www.uwebi.org/reports/desktop_search.pdf}> (accessed on 06Jan2006). (4) Dunn, S (April 2005), “Find Your Local Files the Web Way in Windows”. PCWorld, page 138-140. <{http://www. pcworld.com/howto/article/0,aid,119583,00.asp}> (accessed on 015Jan2006). (5) <{http://www.umcs.maine.edu/~www-adm/docs/unixdosc/node55.html}>(accessed on 06Jan2006) ‘grep’.

CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the authors at:

Eugene Yeh ASG Inc. Work Phone: 919-466-8219 Email: [email protected] Angela Ringelberg PharmaNet Inc. 1001 Winstead Drive, Suite 505 Cary, NC 27513 Work Phone: 609-951-6569 Email: [email protected]

11

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.

12