Upload
angela-rogers
View
223
Download
0
Embed Size (px)
Citation preview
1
Perl Regular Expressions in SAS 9
Ruth Yuee Zhang, CFE
Jan 10, 2005
2
Outline
• Introduction
• SAS Syntax
• Meta-Characters
• Examples
3
Introduction
Regular Expressions
• A powerful tool for manipulating text data. Eg. Perl, Java, PHP and Emacs.
• Locate a pattern in text strings
• Obtain the position of the pattern
• Extract a substring
• Substitute a string by another
4
Introduction
• SAS Regular Expressions
RX functions: RXPARSE, RXMATCH, RXCHANGE etc.
• Perl Regular Expressions
PRX functions: PRXPARSE, PRXMATCH, PRXCHANGE, PRXPOSN, PRXDEBUG, etc.
5
SAS Syntax: Function PRXPARSE
PRXPARSE (perl-regular-expression);
To define a Perl regular expression to be used later by other Perl regular expression functions.
perl-regular-expression: define the Perl regular expression.
6
SAS SyntaxData _NULL_;
** create the regular expression only once **;if _N_ = 1 then myregex = PRXPARSE(“/cat/”);** exact match for the word “cat” **;retain myregex;
input string $30.;** matching the regular expression **;position = PRXMATCH (myregex, string);datalines;It is a cat;Does not match a CAT;cat in the beginning;
Run;
Position:Position: 9 0 1
7
Meta-Characters
Position Characters ^ and $
• “^cat”: matches the beginning of a string;Matches “cat” and “cats” but not “the cat”
• “cat$”: matches the end of a string;Matches “the cat” and “cat” but not “cat in the hat”
• “^cat$”: a string that starts and ends with “cat” -- that could only be “cat” itself!
• “cat”: a string that has the text “cat” in it.Matches “cat”, “cats”, “the cat”, “catch”
8
Meta-Characters
• “\d”: matches a digit 0 to 9. “\d\d\d” matches any three-digit number (123,389)
• “\w”: matches any upper and lower case letters, blank and underscore. “\w\w\w” matches any three-letter word
• “\s”: matches a white space character or a tab.”\d\s\w” matches “1 a”, “6 x”.
9
Meta-Characters
Quantifiers *, + and ?
• “c(at)*”: matches a string that has a “c” followed by zero or more “at” (“c”, “cat”, “catatat”);
• “c(at)+”: same, but there's at least one “at” (“cat”, “catat”, etc.);
• “c(at)?”: same, but there's zero or one “at” (“c”, “cat”);
• “c?a+t$”: a possible “c” followed by one or more “a” ending with “t” (“cat”, “at”, “aaat”).
10
Meta-Characters
Quantifiers • “\d{3}”: matches any 3-digit number and is
equivalent to “\d\d\d” • “\w{3,}”: matches 3- or more letter words
and is equivalent to “\w\w\w+” (“cat”, “_NULL_”)
• “\w{3,5}”: matches 3- or more but no more than 5-letter words (“cat”, “cats”, “catch”)
11
Meta-Characters
• “.”: matches exactly one character. “c.t” matches “cat”, “cut”, “cot”, “cit”.
• “c(a|u)t”: matches “cat”, “cut”• “c[auo]t”: matches “cat”, “cut”, “cot”• “[a-e]”: matches the letters “a” to “e”. “c[a-
e]t” matches “cat”, “cbt”, “cct”• “[^abc]”: matches any characters except
“abc”. “c[^abc]t” matches “cut”, “cot” but not “cat”, “cbt”
12
Ex #1 A Simple Search** create the regular expression only once **;Retain myregex;If _N_ = 1 then do;
myregex = PRXPARSE (“/m[ea]th[ea][dt]one?/i”);/* “e?”: zero or one “e”
“i”: ignore case when matching */;end;
** create a flag of whether matching or not **;myflag = min ( (PRXMATCH(myregex, drugname),1);
Matched:Matched: methadone, Metheton, methadon, mathatone, METHEDONE, METHADON
13
Function PRXMATCH
PRXMATCH ( pattern-id, string);
Returns the first position in the string where the regular expression match is found. If the pattern is not found, it returns 0.
• pattern-id: the value returned from the PRXPARSE function.
• string: the variable that you are interested in.
14
Ex #2 Validating the format
• A sample of the data:Hydro-Chlorothiazide 25.5Ziagen 200mgZerit mgInsulin 20 ccDapsone 100 gKaletra 3 tabs
• Improperly formatted data:Hydro-Chlorothiazide 25.5Zerit mg
15
Ex #2 Validating the format** create the regular expression **;myregex = PRXPARSE
(“/^\D+\d{1,4}\.?\d{0,4}\s?(tabs?|caps?|cc|m?g)/i”);/* ”^\D+”: starts with a group of non-digits “\d{1,4}”: followed by one to four digits “\.?”: an optional period “\d{0,4}”: may be followed by up to four more digits “\s?”: an optional space “(tabs?|caps?|cc|m?g|)”: units of measures: tab, tabs, cap, caps, cc, mg, g “/i”: ignore the case*/
** catch poorly formatted data **;If PRXMATCH (myregex, medication) = 0;
16
Ex #3 Extracting Text
To extract what the patients are reporting to the investigators:
Patient reported headache and nausea. MD noticed rash.
Pt. Rptd. Backache.
Patient reported seeing spots.
Elevated pulse and labored breathing.
Headache.
Extracted field:headache and nausea
Backache
seeing spots
17
Function CALL PRXPOSN
CALL PRXPOSN (pattern-id, capture-buffer-number, start <, length>);
Returns the position and length for a capture buffer. Used in conjunction with PRXPARSE and PRXMATCH.
• pattern-id: the value returned from the PRXPARSE function.
• capture-buffer-number: a number indicating which capture buffer is to be evaluated.
• start: the value of the first position where the particular capture buffer is found.
• length: the length of the found pattern.
18
Ex #3 Extracting Text
** create the regular expression **; myregex = PRXPARSE (“/(reported|rptd?\.?)
(.*\.)/i”);/* “(reported|rptd?\.?)”: 1st capture buffer. Capture
the word “reported”, “rpt”, “rpt.”, “rptd”, “rptd.” “(.*\.)”: 2nd capture buffer. Followed by any
characters until a period is reached “/i”: ignore the case*/
19
Ex #3 Extracting Text
** only call PRXPOSN if matching **;if PRXMATCH (myregex, comments) then do;
/* get the position and length of the matching of 2nd capture buffer */CALL PRXPOSN (myregex, 2, pos, len);/* extract the substring excluding the end period */pt_comments = substr (comments,pos,len-1);
End;
20
Ex #4 Substitute one string for another
Replace all the following by “Multi-vitamin”:multivitamin
multi-vitamin
multi-vita
multivit
multi-vit
multi vitamin
21
Function CALL PRXCHANGE
CALL PRXCHANGE (pattern-id, times, old-string <,new-string <,result-length <, truncation-value <, number-of-changes>>>>);
To substitute one string for another.
• times: the number of times to search for and replace a string.
• oldstring: the string that you want to replace.
22
Ex #4 Substitute one string for another
** create regular expression **;myregex = PRXPARSE (
“s/multi[- ]?vita?(min)?/Multi-vitamin/”);/*“s/”: indicates that the regular expression will be
used in a substitution“[- ]?”: optional “-” or space“a?”: optional “a”“(min)?”: optional “min”*/
23
Ex #4 Substitute one string for another
** using the myregex id created above **;CALL PRXCHANGE (myregex, -1, drugname);
/*
“-1” indicates that the pattern should be changed at every occurrence
*/
24
Ex #5 Finding digits in random positions
String x1 x2 x3 x4 x5
This 45 lines 98 has 3 s 45 98 3 . .
None here . . . . .
12 34 78 90 12 34 78 90
Weight 60kg 132pound 60 132 . . .
25
Function CALL PRXNEXT
CALL PRXNEXT (pattern-id, start, stop, position, length);
To locate the nth occurrence of a pattern. The next occurrence of the pattern will be identified at each time you call the function
• start: the starting position to begin the search• stop: the last position in the string for the search• position: the starting position of the nth occurrence of the
pattern• length: the length of the pattern
26
Ex #5 Finding digits in random positions
** create the regular expression **;myregex = PRXPARSE(“/\d+/”);** “\d+”: look for one or more digits **;
Start = 1;Stop = length(string);
** get the position and length of the first occurrence **;Call PRXNEXT (myregex, start, stop, string, pos, len);
27
Ex #5 Finding digits in random positions
Array x[5];
**continue until no more digits are found (pos=0)**;Do i = 1 to 5 while (pos gt 0);
** extract the current occurrence **;X[i] = input (substr (string, pos, len), 9.);** get the position and length of the next occurrence **;Call PRXNEXT (myregex, start, stop, string, pos, len);
End;
28
Ex #6 Locating zip codesString:
John Smith12 Broad streetFlemington, NJ 08822Philip JudsonApt #1, Building 7777 Route 730Kerrville, TX 78028Dr. Roger Alan44 Commonwealth Ave.Boston, MA 02116-7364
Zip_code:088227802802116-7364
29
Function CALL PRXSUBSTR
CALL PRXSUBSTR (pattern-id, string, start <,length>);
Returns the starting position and the length of the match.
• string: the string to be searched• start: the starting position of the pattern• length: the length of the substring
30
Ex #6 Locating zip codes
** create the regular expression **;myregex = PRXPARSE(“/ \d{5}(-\d{4})?/”);/*match a blank followed by 5 digits followed by
either nothing or a dash and 4 digits “\d{5}”: matches 5 digits “-”: matches a dash “\d{4}”: matches 4 digits “?”: matches zero or one of the preceding
subexpression*/
31
Ex #6 Locating zip codes
Call PRXSUBSTR (myregex, string, start, length);
** only extract the substring if the pattern is found **;
If start gt 0 then
** the start position is after the blank **;
zip_code = substrn (string, start+1, length-1);
32
References
• SAS Functions by Example, Ron Cody
• An Introduction to Regular Expression with Examples from Clinical Data, Richard Pless, Ovation Research Group, Highland Park, IL
• How Regular Expression Really Work, Jack N shoemaker, Greensboro, NC3232