Upload
david-stockton
View
3.746
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Learn beginner to intermediate level regular expressions with some examples in PHP
Citation preview
Regular Expressions in PHP
/(?:dave@davidstockton\.com)/Front Range PHP User Group
David Stockton
What is a regular expression?
A pattern used to describe a part of some text
“Regular” has some implications to how it can be built, but that’s not really part of this presentation
Extremely powerful and useful (And often abused)
Regex Joke
A programmer says, “I have a problem that I can solve with regular expressions.”
Now, he has two problems…
How to use regex in PHP
The preg_* functions Perl compatible regular expressions. Probably the most common regex syntax
The ereg_* functions POSIX style regular expressions I am not covering these functions. Don’t use the ereg ones. They are deprecated in
PHP 5.3.
How can we use regex in PHP?
preg_match( ) – Searches a subject for a match
preg_match_all( ) – Searches a subject for all matches
preg_replace( ) – Searches a subject for a pattern and replaces it with something else
preg_split( ) – Split a string into an array based on a regex delimiter
preg_filter( ) – Identical to preg_replace except it returns only the matches
preg_replace_callback( ) – Like preg_replace, but replacement is defined in a callback
preg_grep( ) – Returns an array of array elements that match a pattern
How can we use regex in PHP?
preg_quote( ) – Quotes regular expression characters
preg_last_error( ) – Returns the error code of the last PCRE (Perl Compatible Regular Expression) function execution
How can we use regex in PHP?
Those are the function calls, and we’ll play with the later.
First, we need to learn how to create regex patterns since we need those for any function call.
Starting Pattern
/[A-Z0-9\._+=]+@[A-Z0-9\.-]\.[A-Z]{2,4}/i
This matches a series of letters, numbers, plus, dash, dots, underscores and equals, followed by an “AT” (@) sign, followed by a series of letters, numbers, dots and dashes, followed by a dot, followed by 2 to 4 letters.
In other words… It matches an email address… Or rather some email addresses.
Matching Email Addresses
What about [email protected]?
What about [email protected]?
Both of those are valid email addresses, but they fail because our patter only allows 2-4 character TLD parts for the email address.
How can we match all valid email addresses and only valid email addresses?
The “real” email address regex
(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?: \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:( ?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0 31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\ ](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+ (?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?: (?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n) ?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\ r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n) ?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t] )*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])* )(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*) *:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ |\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r \n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?: \r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t ]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031 ]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\]( ?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(? :(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(? :\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(? :(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)? [ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]| \\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<> @,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|" (?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t] )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(? :[^()<>@,;:\\".\[\] \000-
The “real” email address regex cont.
\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[ \]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000- \031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|( ?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,; :\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([ ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\" .\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\ ]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\ [\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\ r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\] |\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0 00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\ .|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@, ;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(? :[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])* (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[ ^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\] ]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*( ?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:( ?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[ \["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t ])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(? :\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+| \Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?: [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\ ]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n) ?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[" ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n) ?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<> @,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@, ;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t] )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)? (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?: \r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[ "()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t]) *))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]) +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\ .(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:( ?:\r\n)?[ \t])*))*)?;\s*)
So… How do we write this?
Don’t. Other much more simple patterns have been written and will match 99.9% of valid email addresses.
Use something like Zend_Validate_EmailAddress
So now the real learnin’…
Letters and numbers match… letters and numbers
/a/ - Matches a string that contains an “a”
/7/ - Matches a string that contains a 7.
More learnin’
Match a word
/regex/ - Matches a string with the word “regex” in it
You can use a pipe character to give a choice
/pizza|steak|cheeseburger/ - Matches a string with any of these foods
Delimiters
The examples so far have started with / and ended with /.
These are delimiters and let the regex engine know where the pattern starts and ends.
You can choose another delimiter if you’d like or if it’s more convenient
Match namespace:
#/My/PHP/Namespace#
If I used “/” in that example, I’d need to escape each of the forward slashes to differentiate them from the delimiter
Character Matching Continued
You can match a selection of characters
/[Pp][Hh][Pp]/ - Matches PHP in any mixture of upper and lowercase
Ranges can be defined
/[abcdefghijklmnopqrstuvwxyz]/ - Matches any lowercase alpha character
/[a-z]/ - Matches any lowercase alpha character
Character Selection Ranges
Ranges can be combined /[A-Za-z0-9]/ - Matches an alphanumeric character /[A-Fa-f0-9]/ - Matches any hex character
Character Selection can be inversed /[^0-9]/ - Matches any non-digit character /[^ ]/ - Matches any non space character /[.!@#$%^*]/ - Matches some punctuation
Special Characters
Dot (.) matches any character /./ /../ - Matches any two characters
To match an actual dot character, you must escape /\./ - Matches a single dot character
Unless it’s a character selection
/[.]/ - Matches a single dot character
Character classes
\d means [0-9] \D means non-digits - [^0-9]
\w means word characters - [A-Za-z0-9_] \W means non word characters – [^A-Za-z0-9_]
\s means a whitespace character [ \t\n\r] \S means non white space characters
Repeating Character Classes
Match two digits in a row /\d\d/ /[0-9][0-9]/ /\d{2}/ /[0-9]{2}/
Match at least one digit (but as many as it can) /\d+/
Match 0 to infinite digits /\d*/
Repeating Character Classes cont.
* means match 0 or more
+ means match 1 or more
{x} where x is a number means match exactly x of the preceding selection
{x,} means match at least x
{x,y} means match between x and y
{,y} means match up to y
More special characters
? Means the preceding selection is optional
Putting it together
Telephone Number /\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/ Matches 720-675-7471 or (720)675-7471 or (720)
675-7471 or 7206757471 or 720 675 7471
Find a misspelled word (and get great deals on EBay) /la[bp]top computer[s]?/
Regex Anchors
Anchors allow you to specify a position, like before, after or in between characters
/^ab/ matches abcdefg but not cab Notice that it’s the caret character… It means start
of the string in this context, but means the opposite of a character class inside the square brackets
/ab$/ matches cab but not abcdefg
/^[a-z]+$/ will match a string that consists only of lowercase characters
Word Boundaries
\b means word boundaries Before first character if first character is a word
character After last character if last character is a word
character Between two characters if one is a word character
and the other is not
/\bfish\b/ matches fish, but not fisherman or catfish. /fish\b/ matches fish and catfish
Alternation
/cow|boy/ - Matches cow, or boy or cowboy or coward, etc
/\b(cow|boy)\b/ - Matches cow or boy but not cowboy or coward
The above example also captures the matching word due to the parens. More on this later.
Greedy vs Lazy
By default, regular expressions are greedy…
That is, they will match as much as they can
Grab a starting html tag: /<.+>/ Matches in bold: <h1>Welcome to FRPUG</h1> Not what we want
Make it lazy: /<.+?>/ Now it matches <h1>Welcome to FRPUG</h1>
Another tag matching solution
/<[^>]+>/ Literally match a less than character followed by
one or more non-greater than characters followed by a greater than character
This way eliminates the need for the engine to backtrack (almost certainly faster than the last example).
Capturing part of regex (backreference)
/__(construct|destruct)/
Backreference will contain either construct or destruct so you can use it later
/([a-z]+)\1/ Matches groups of repeated characters that repeat
an even number of times. Matches aa but not a. Matches aaaaa
/([a-z]{3})\1/ Matches words like booboo or bambam
Backreference Continued…
Very useful when performing regex search and replace
preg_replace('/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/', '(\1) \2-\3', $phone)
The above example will take any phone number from the previous example and return it formatted in (xxx) xxx-xxxx format
More backreferences…
Replace duplicated words that that have been inadvertently left in
Non-capturing groups
Match an IPv4 address
/((?:\d{1,3}\.){3}\d{1,3})/
We’re matching 1 to 3 digits followed by a dot 3 times. We don’t care (right now) about the octets, we just want to repeat the match, so ?: says to not capture the group.
Pattern Modifiers
Modifiers go after the last delimiter (now you know why there are delimiters) and affect how the regex engine works
i – case insensitive matching (matches are case-sensitive by default)
m – multiline matching
s - dot matches all characters, including \n
x – ignore all whitespace characters except if escaped or in a character class
Pattern Modifiers Continued…
D – Anchor for the end of the string only, otherwise $ matches \n characters Allow username to be alphabetic only /^[A-Za-z]$/ - This will match dave\nextra stuff However, /^[A-Za-z]$/D will not match
U – Invert the meaning of the greediness. With this on by default matches are lazy and ? makes it greedy.
There are lots of other modifiers and you can see them at http://us2.php.net/manual/en/reference.pcre.pattern.modifiers.php
Named Capture Groups
Rather than get back a numbered array of matches, get back an associative array.
If you add a new capture group, you don’t have to renumber where you use the capture group
Named Capture Groups cont…
Use (?P<named_group>pattern)
Named Capture Groups cont…
Combined numbered and associative array
Capture group 0 is the wholepattern that is matched.
If our string to match against was abcde720-675 7471foobar, $matches[0] will contain720-675 7471
Positive Look Ahead Matches
Look for a pattern follow by another pattern
/p(?=h)/ - Match a “p” followed by an “h” but don’t include the “h”
Negative Look Ahead
Look for a pattern which is not followed by some other pattern
/p(?=!h)/ - p not followed by h.
Look Aheads
Positive and negative look aheads do not capture anything.
They just determine if the pattern match is possible
They are zero-width
/p[^h]/ is not the same as /p(?!h)/
/ph/ is not the same as /p(?=h)/
Look behinds
Positive look behind
/(?<=oo)d/ - d which is preceded by oo Matches “food”, “mood”, match only contains the
“d”
Negative look behind
/(?<!oo)d/ - d which is not preceded by oo Matches “dude”, “crude”, and “d”
With great power…
Test your regular expressions before they go to production
It’s much easier to get them wrong than to get themright if you don’t test
When to not use regex
Whenever they aren’t needed.
If you can use strstr or strpos or str_replace to do the job, do that. They are much faster, much simpler and easier to do correctly.
However, if you cannot use those functions, regex may be your best bet.
Don’t use regex when you really need a parser
Resources
http://regular-expressions.info
http://us2.php.net/manual/en/ref.pcre.php
Spider Man from http://www.onlineseats.com/
Questions?