1 Perl Regular Expressions in SAS 9 Ruth Yuee Zhang, CFE Jan 10, 2005

1

Perl Regular Expressions in SAS 9

Ruth Yuee Zhang, CFE

Jan 10, 2005

2

Outline

• Introduction

• SAS Syntax

• Meta-Characters

• Examples

3

Introduction

Regular Expressions

• A powerful tool for manipulating text data. Eg. Perl, Java, PHP and Emacs.

• Locate a pattern in text strings

• Obtain the position of the pattern

• Extract a substring

• Substitute a string by another

4

Introduction

• SAS Regular Expressions

RX functions: RXPARSE, RXMATCH, RXCHANGE etc.

• Perl Regular Expressions

PRX functions: PRXPARSE, PRXMATCH, PRXCHANGE, PRXPOSN, PRXDEBUG, etc.

5

SAS Syntax: Function PRXPARSE

PRXPARSE (perl-regular-expression);

To define a Perl regular expression to be used later by other Perl regular expression functions.

perl-regular-expression: define the Perl regular expression.

6

SAS SyntaxData _NULL_;

** create the regular expression only once **;if _N_ = 1 then myregex = PRXPARSE(“/cat/”);** exact match for the word “cat” **;retain myregex;

input string $30.;** matching the regular expression **;position = PRXMATCH (myregex, string);datalines;It is a cat;Does not match a CAT;cat in the beginning;

Run;

Position:Position: 9 0 1

7

Meta-Characters

Position Characters ^ and $

• “^cat”: matches the beginning of a string;Matches “cat” and “cats” but not “the cat”

• “cat$”: matches the end of a string;Matches “the cat” and “cat” but not “cat in the hat”

• “^cat$”: a string that starts and ends with “cat” -- that could only be “cat” itself!

• “cat”: a string that has the text “cat” in it.Matches “cat”, “cats”, “the cat”, “catch”

8

Meta-Characters

• “\d”: matches a digit 0 to 9. “\d\d\d” matches any three-digit number (123,389)

• “\w”: matches any upper and lower case letters, blank and underscore. “\w\w\w” matches any three-letter word

• “\s”: matches a white space character or a tab.”\d\s\w” matches “1 a”, “6 x”.

9

Meta-Characters

Quantifiers *, + and ?

• “c(at)*”: matches a string that has a “c” followed by zero or more “at” (“c”, “cat”, “catatat”);

• “c(at)+”: same, but there's at least one “at” (“cat”, “catat”, etc.);

• “c(at)?”: same, but there's zero or one “at” (“c”, “cat”);

• “c?a+t$”: a possible “c” followed by one or more “a” ending with “t” (“cat”, “at”, “aaat”).

10

Meta-Characters

Quantifiers • “\d{3}”: matches any 3-digit number and is

equivalent to “\d\d\d” • “\w{3,}”: matches 3- or more letter words

and is equivalent to “\w\w\w+” (“cat”, “_NULL_”)

• “\w{3,5}”: matches 3- or more but no more than 5-letter words (“cat”, “cats”, “catch”)

11

Meta-Characters

• “.”: matches exactly one character. “c.t” matches “cat”, “cut”, “cot”, “cit”.

• “c(a|u)t”: matches “cat”, “cut”• “c[auo]t”: matches “cat”, “cut”, “cot”• “[a-e]”: matches the letters “a” to “e”. “c[a-

e]t” matches “cat”, “cbt”, “cct”• “[^abc]”: matches any characters except

“abc”. “c[^abc]t” matches “cut”, “cot” but not “cat”, “cbt”

12

Ex #1 A Simple Search** create the regular expression only once **;Retain myregex;If _N_ = 1 then do;

myregex = PRXPARSE (“/m[ea]th[ea][dt]one?/i”);/* “e?”: zero or one “e”

“i”: ignore case when matching */;end;

** create a flag of whether matching or not **;myflag = min ( (PRXMATCH(myregex, drugname),1);

Matched:Matched: methadone, Metheton, methadon, mathatone, METHEDONE, METHADON

13

Function PRXMATCH

PRXMATCH ( pattern-id, string);

Returns the first position in the string where the regular expression match is found. If the pattern is not found, it returns 0.

• pattern-id: the value returned from the PRXPARSE function.

• string: the variable that you are interested in.

14

Ex #2 Validating the format

• A sample of the data:Hydro-Chlorothiazide 25.5Ziagen 200mgZerit mgInsulin 20 ccDapsone 100 gKaletra 3 tabs

• Improperly formatted data:Hydro-Chlorothiazide 25.5Zerit mg

15

Ex #2 Validating the format** create the regular expression **;myregex = PRXPARSE

(“/^\D+\d{1,4}\.?\d{0,4}\s?(tabs?|caps?|cc|m?g)/i”);/* ”^\D+”: starts with a group of non-digits “\d{1,4}”: followed by one to four digits “\.?”: an optional period “\d{0,4}”: may be followed by up to four more digits “\s?”: an optional space “(tabs?|caps?|cc|m?g|)”: units of measures: tab, tabs, cap, caps, cc, mg, g “/i”: ignore the case*/

** catch poorly formatted data **;If PRXMATCH (myregex, medication) = 0;

16

Ex #3 Extracting Text

To extract what the patients are reporting to the investigators:

Patient reported headache and nausea. MD noticed rash.

Pt. Rptd. Backache.

Patient reported seeing spots.

Elevated pulse and labored breathing.

Headache.

Extracted field:headache and nausea

Backache

seeing spots

17

Function CALL PRXPOSN

CALL PRXPOSN (pattern-id, capture-buffer-number, start <, length>);

Returns the position and length for a capture buffer. Used in conjunction with PRXPARSE and PRXMATCH.

• pattern-id: the value returned from the PRXPARSE function.

• capture-buffer-number: a number indicating which capture buffer is to be evaluated.

• start: the value of the first position where the particular capture buffer is found.

• length: the length of the found pattern.

18


** create the regular expression **; myregex = PRXPARSE (“/(reported|rptd?\.?)

(.*\.)/i”);/* “(reported|rptd?\.?)”: 1st capture buffer. Capture

the word “reported”, “rpt”, “rpt.”, “rptd”, “rptd.” “(.*\.)”: 2nd capture buffer. Followed by any

characters until a period is reached “/i”: ignore the case*/

19


** only call PRXPOSN if matching **;if PRXMATCH (myregex, comments) then do;

/* get the position and length of the matching of 2nd capture buffer */CALL PRXPOSN (myregex, 2, pos, len);/* extract the substring excluding the end period */pt_comments = substr (comments,pos,len-1);

End;

20

Ex #4 Substitute one string for another

Replace all the following by “Multi-vitamin”:multivitamin

multi-vitamin

multi-vita

multivit

multi-vit

multi vitamin

21

Function CALL PRXCHANGE

CALL PRXCHANGE (pattern-id, times, old-string <,new-string <,result-length <, truncation-value <, number-of-changes>>>>);

To substitute one string for another.

• times: the number of times to search for and replace a string.

• oldstring: the string that you want to replace.

22


** create regular expression **;myregex = PRXPARSE (

“s/multi[- ]?vita?(min)?/Multi-vitamin/”);/*“s/”: indicates that the regular expression will be

used in a substitution“[- ]?”: optional “-” or space“a?”: optional “a”“(min)?”: optional “min”*/

23


** using the myregex id created above **;CALL PRXCHANGE (myregex, -1, drugname);

/*

“-1” indicates that the pattern should be changed at every occurrence

*/

24

Ex #5 Finding digits in random positions

String x1 x2 x3 x4 x5

This 45 lines 98 has 3 s 45 98 3 . .

None here . . . . .

12 34 78 90 12 34 78 90

Weight 60kg 132pound 60 132 . . .

25

Function CALL PRXNEXT

CALL PRXNEXT (pattern-id, start, stop, position, length);

To locate the nth occurrence of a pattern. The next occurrence of the pattern will be identified at each time you call the function

• start: the starting position to begin the search• stop: the last position in the string for the search• position: the starting position of the nth occurrence of the

pattern• length: the length of the pattern

26


** create the regular expression **;myregex = PRXPARSE(“/\d+/”);** “\d+”: look for one or more digits **;

Start = 1;Stop = length(string);

** get the position and length of the first occurrence **;Call PRXNEXT (myregex, start, stop, string, pos, len);

27


Array x[5];

**continue until no more digits are found (pos=0)**;Do i = 1 to 5 while (pos gt 0);

** extract the current occurrence **;X[i] = input (substr (string, pos, len), 9.);** get the position and length of the next occurrence **;Call PRXNEXT (myregex, start, stop, string, pos, len);

End;

28

Ex #6 Locating zip codesString:

John Smith12 Broad streetFlemington, NJ 08822Philip JudsonApt #1, Building 7777 Route 730Kerrville, TX 78028Dr. Roger Alan44 Commonwealth Ave.Boston, MA 02116-7364

Zip_code:088227802802116-7364

29

Function CALL PRXSUBSTR

CALL PRXSUBSTR (pattern-id, string, start <,length>);

Returns the starting position and the length of the match.

• string: the string to be searched• start: the starting position of the pattern• length: the length of the substring

30

Ex #6 Locating zip codes

** create the regular expression **;myregex = PRXPARSE(“/ \d{5}(-\d{4})?/”);/*match a blank followed by 5 digits followed by

either nothing or a dash and 4 digits “\d{5}”: matches 5 digits “-”: matches a dash “\d{4}”: matches 4 digits “?”: matches zero or one of the preceding

subexpression*/

31

Ex #6 Locating zip codes

Call PRXSUBSTR (myregex, string, start, length);

** only extract the substring if the pattern is found **;

If start gt 0 then

** the start position is after the blank **;

zip_code = substrn (string, start+1, length-1);

32

References

• SAS Functions by Example, Ron Cody

• An Introduction to Regular Expression with Examples from Clinical Data, Richard Pless, Ovation Research Group, Highland Park, IL

• How Regular Expression Really Work, Jack N shoemaker, Greensboro, NC3232

Documents

1 Perl Regular Expressions in SAS 9 Ruth Yuee Zhang, CFE Jan 10, 2005