Regular Expressions

Regular Expressions

Satyanarayana D <[email protected]>

http://taxi.corp.yahoo.com/

Topics

• What? • Why?• History - Who?• Flavou?rs• Grammar• Meta Chars• Character Classes• Shorthand Char Classes• Anchors• Repeaters or Quantifiers• Grouping & Capturing• Alternation• Match Float

• Atomic Grouping• Look Around• Conditional Expr.• Recursive Regex• Code Evalution• Code Expr.• Inline Modifiers• Regex Tools• Q&A

What are Regular Expressions?

• A Regular expression is a pattern describing a certain amount of text.

• A regular expression, often called a pattern, is an expression that describes a set of strings. - Wikipedia

Why do we need?

• Regular expressions allow matching and manipulation of textual data.

• Requirements• Matching/Finding• Doing something with matched text• Validation of data• Case insensitive matching• Parsing data ( ex: html )• Converting data into diff. form etc.

History

Stephen KleeneA mathematician discovered ‘regular sets’.

History

Ken Thompson1968 - Regular Expression Search Algorithm.

Qed -> ed -> g/re/p

History

Henry Spencer1986 – Wrote a regex library in C

Regex Flavors

BRE - Basic Regular Expressions• \?, \+, \{, \|, $, and $• ed, g/re/p, sed

ERE - Extended Regular Expressions• ?, +, {, |, (, and )• grep –E == egrep, awk

PCRE - Philip Hazel• Perl, PHP, Tcl etc.

Grammar of Regex* RE = one or more non-empty ‘branches‘ separated by ‘|’

Branch = one or more ‘pieces’

Piece = atom followed by quantifier

Quantifier = ‘*,+,?’ or ‘bound’

Bound = atom{n}, atom{n,}, atom{m, n}

Atom = (RE) or

() or

‘^,$,’ or

\ followed by `^.[$()|*+?{\’ or

any-char or

‘bracket expression’

Bracket Expression = is a list of characters enclosed in `[ ]'

Meta Chars?

2 + 4

Here ‘+’ has some special meaning

In a normal Expression like :

Meta Chars

\ Quote the next metacharacter ^ Match the beginning of the line . Match any character (except newline) = [^\n] $ Match the end of the line (or before newline at the end) | Alternation ( ) Grouping [ ] Character class { } Match m to n times * Match 0 or more times + Match 1 or more times ? Match 1 or 0 times

Non-printable Chars

\t tab (HT, TAB) \n newline (LF, NL) \r return (CR) \f form feed (FF) \a alarm (bell) (BEL) \e escape (think troff) (ESC) \033 octal char (example: ESC) \x1B hex char (example: ESC) \x{263a} long hex char (example: Unicode SMILEY) \cK control char (example: VT) \N{name} named Unicode character

Character Classes – [ ]• Set of character placed inside square brackets. Inside brackets

meta characters lose their meaning ( except ‘] \ ^ - ‘)• Requirements

• Matches one and only one character of a specified chars.• Range can be specified using ‘-’.

• a-z matches 26 lower case English alphabets • 0-9 matches any digit.• Negation can be specified using ‘^’ at the beginning of class.• In order to match above specified exceptional chars literally either escape them or

need to specify at end.

[0-9] Matches any one of 0,1,2,3,4,5,6,7,8,9.[aeiou] Matches one English vowel char.[âeiou] Matches any non-vowel char.[a-z-] Matches a to z and ‘-’[a-z0-9] Union matches a to z and 0 to 9.[a-z&&[m-z]] Intersection matches m to z.[a-z-[m-z] Subtraction matches a to l.

POSIX Character Classes – [: … :]

[^[:digit:] ]= \D = [^0-9]

Shorthand Chars

\w word character [A-Za-z0-9_]\d decimal digit [0-9]\s whitespace [ \n\r\t\f]

\W not a word character [Â-Za-z0-9_]\D not a decimal digit [^0-9]\S not whitespace [^ \n\r\t\f]

Anchors/Assertions• Anchor matches a certain position in the subject string and it won’t consume any characters.

^ Match the beginning of the line $ Match the end of the line (or before newline at the end) \A Matches only at the very beginning \z Matches only at the very end \Z Matches like $ used in single-line mode \b Matches when the current position is a word boundary\<,\> Matches when the current position is a word boundary \B Matches when the current position is not a word boundary

Ânchors• Anchor matches a certain position in the subject string and it won’t consume any characters.

^ Match the beginning of the line

Anchor matches a certain position In the subject string and it won’t consume any characters

/â/

String begin with ‘a’

Anchors$• Anchor matches a certain position in the subject string and it won’t consume any characters.

$ Match the end of the line (or before newline at the end)


/s$/

String end with ‘s’

\A Anchors• Anchor matches a certain position in the subject string and it won’t consume any characters.

\A Matches only at the very beginning


^ Vs \A

\z, \Z Anchors• Anchor matches a certain position in the subject string and it won’t consume any characters.

\z Matches only at the very end \Z Matches like $ used in single-line mode


$ Vs \z, \Z

\b, \B Anchors• Anchor matches a certain position in the subject string and it won’t consume any characters.

\b = \W\w|\w\W = Matches a word boundary \B Matches when the current position is not a word boundary

/\b2\b/

/\B2\B/

$ xl2twiki file 2 > /dev/null

Quantifiers• Why? – Because we are not sure about text. Specifies how many times regex component must repeat.

{m, n} = Matches minimum of m and a max of n occurrences. * = {0,} = Matches zero or more occurrences ( any amount).

+ = {1,} = Matches one or more occurrences.

? = {0,1} = Matches zero or one occurrence ( means optional ).

Quantifiers ( repetition) :

Quantifiers• By default quantifiers are greedy.

/\d{2,4}/ 2010

/<.+>/ My first regex test. regex 

/\w+sion/ Expression

If the entire match fails because they consumed too much, then they are forced to give up as much as needed to make the rest of regex succeed

Non Greedy Quantifiers

{,}? *?

+?

??

To make non greedy quantifiers append ‘?’

<.+?> My first regex test. 

Use negated classes

<[^>]+> My first regex test. 

Grouping – ( )

• Why? – To create sub patterns, so that you can apply regex operators to whole sub patterns or you can reference them by corresponding sub group numbers.

\d{2}-\d{2}-\d{2}(\d{2})?

Will match 01-01-10 and 01-01-2010 also.

• Grouping can be used for alternation.

Alternation - |

• Why? – Lets you to match more than one sub-expression at same point.

/\b(get|set)Value\b/ Match either getValue or setValue.

• Branches are tried from left->right.• Eagerness - Most likely pattern as first alternative

• (and|android) -> ‘robot and an android fight’

Capturing – ( )

• Allows us to access sub-parts of pattern for later processing.• All captured sub patterns are stored in memory.• Captured patterns are numbered from left to right.

/\b((\d{2})-(\d{2})-(\d{2}(\d{2})?))\b/

\b((\d{2})-(\d{2})-(\d{2}(\d{2})?))\b

Today is ‘18-08-2010’.

\1 -> date -> 18-08-2010\2 -> day-> 18\3 -> month -> 08\4 -> year -> 2010\5 -> year -> last two digits -> 10

Non-Capturing sub patterns– (?: )

• If you really don’t require back referencing make sub expressions as non-capture, It will save memory and processing time.

\d{2}-\d{2}-\d{2}(?:\d{2})?

Will match 01-01-10 and 01-01-2010 also.

• We can give names for sub patterns instead of numbers.

(?P<name>pattern) -> Python Style, Perl 5.12(?P=name) -> Back reference(?<name>pattern) or (?’name’pattern) ->Perl 5.10\k<name> or \k’name’ or -> Back reference\g{name}\g{-1}, \g{-2} -> Relative Back reference.

(?<vowel>[ai]).\k<vowel>.\1 abracadabra !!/(\w+)\s+\g{-1}/ "Thus joyful Troy Troy maintained the the watch of night...”

$date="18-08-2010";$date =~ s/(?<day>\d{2})-(?<month>\d{2})-(?<year>\d{4})/$+{year}-$+{month}-$+{day}/;

Named Capture – (?<> )

• Hits• Lines that I want to match.

• Misses• Lines that I don’t want to match.

• Omissions• Lines that I didn’t match but wanted to match.

• False alarms• Lines that I matched but didn’t want to match.

Before Evaluating Regex

Float number = integerpart.factionalpart

Matching a float number

Basic Principle – Split your task into sub tasks

Integerpart = \d+ -> will match one or more digits



Literal dot = \.



Literal dot = \.


Fractional part= \d+ -> will match one or more digits

Integerpart = \d+


Literal dot = \.

Fractional part = \d+

Combine all of them = \d+\.\d+


/\d+\.\d+/ -> Is generic.

It won’t match -123.45 or +123.45


/\d+\.\d+/ -> Is generic.

It won’t match -123.45 or +123.45

/[+-]?\d+\.\d+/ -> will match.


But It won’t match - 123.45 or + 123.45

/[+-]?\d+\.\d+/ -> will match.

/[+-]? *\d+\.\d+/ -> will match.

But It won’t match 123. or .45


/[+-]? *(?:\d+\.\d+|\d+\.|\.\d+)/ -> will match.

But It won’t match 123. or .45

/[+-]? \ *

(?: \d+\.\d+

| \d+\. | \.\d+ )/


/[+-]? *(?:\d+\.\d+|\d+\.|\.\d+)(?:[eE]\d+)?/ -> will match.

But It won’t match 10e2 or 101E5

/ [+-]? \ *

(?: \d+\.\d+

| \d+\. | \.\d+ )

(?: [eE]\d+)?

/


/^[+-]? *(?:\d+\.\d+|\d+\.|\.\d+)(?:[eE][+-]?\d+)?$/ -> will match.

But It won’t match 10e-2

/ ^[+-]? \ *

(?: \d+\.\d+

| \d+\. | \.\d+ )

(?: [eE][+-]?\d+)?

$/x

Match a float number

/^ [+-]?\ * # first, match an optional sign (?: # then match integers or f.p. mantissas: \d+\.\d+ # mantissa of the form a.b |\d+\. # mantissa of the form a. |\.\d+ # mantissa of the form .b |\d+ # integer of the form a ) (?:[eE][+-]?\d+)? # finally, optionally match an exponent $/x;

• Before looking into Atomic grouping need to know about Backtracking.

• Backtracking – If you don’t succeed try and try again...

Atomic Grouping – (?> )

\d+99 19999\d 19999 -> Add 1 to match -> 1

\d+ 19999 -> Add 9 to match -> 19

\d+ 19999 -> Add 9 to match -> 199

\d+ 19999 -> Add 9 to match -> 1999

\d+ 19999 -> Add 9 to match -> 19999

\d+ 19999 -> Still need to match 99

\d+99 19999 -> Give up a 9

\d+99 19999 -> Give up one more 9

\d+99 19999 -> Success

• Before looking into Atomic grouping need to know about Backtracking.

• Backtracking – If you don’t succeed try and try again...


\d+xx 199Rs\d 199Rs -> Add 1 to match -> 1

\d+ 199Rs -> Add 9 to match -> 19

\d+ 199Rs -> Add 9 to match -> 199

\d+x 199Rs -> x not matched with R

\d+x 199Rs -> Give up 9, still cannot match x

\d+x 199Rs -> Give up 9, still cannot match x

\d+x 199Rs -> Cannot give 1 due to \d+

\d+xx 199Rs -> Failure

• Atomic Grouping disables backtracking and speeds up the process.

• (?>pattern) here pattern will be treated as atomic token.• (?>\d+)xx here (?>\d+) won’t give up any digits and it locks.

• fails right at matching x with R.• Atomic groups are not captured and can be nested.


• Use Possessive quantifiers for single items to overcome backtracking.• Adding ‘+’ will make quantifier as possessive• (?>\d+)xx == \d++xx

Atomic Grouping:

Possessive Quantifiers:

Look Around

Ahead Behind

Positive Negative Positive Negative

(?=...) (?!...) (?<=...) (?<!...)

(?=...) Zero-width positive lookahead assertion

(?!...) Zero-width negative lookahead assertion

(?<=...) Zero-width positive lookbehind assertion

(?<!...) Zero-width negative lookbehind assertion

*Note : Assertions can be nested.Example : /(?<=,

(?! (?<=\d,)(?=\d) ) )/

/cat(?=\s+)/ I catch the housecat 'Tom-cat' with catnip

/(?<=\s)cat\w+/ I catch the housecat 'Tom-cat' with catnip

/\bcat\b / I catch the housecat 'Tom-cat' with catnip

/(?<=\s)cat(?=\s)/ no isolated 'cat’

Look Around

“I catch the housecat 'Tom-cat' with catnip”

/cat(?!\s)/ I catch the housecat 'Tom-cat' with catnip

/(?<!\s)cat/ I catch the housecat 'Tom-cat' with catnip

*Note : look-behind expressions cannot be of variable length. means you cannot use quantifiers (?, *, +, or {1,5}) or alternation of different-length items inside them.

Conditional expressions• A conditional expression is a form of if-then-else statement that allows one to choose which patterns are to be matched, based on some condition

• (?(condition)yes-regexp)" is like an 'if () {}' statement• (?(condition)yes-regexp|no-regexp) 'if () {} else {}' statement

• Condition can be• Sub pattern match corresponding number• Lookaround Assertion• Recursive call

Match a (quoted)? string -> /^("|')?[^”’]*(?(1)\1)$/

Matches 'blah blah’Matches “blah blah”Matches blah blahWon’t Match ‘blah blah”

Conditional expression• A conditional expression is a form of if-then-else statement that allows one to choose which patterns are to be matched, based on some condition

• (?(condition)yes-regexp)" is like an 'if () {}' statement• (?(condition)yes-regexp|no-regexp) 'if () {} else {}' statement

/(.)\1(?(<=AA)G|C)$/

ATGAAGTAGBBCGATGGC

/usr/share/dict/words -> /^(.+)(.+)?(?(2)\2\1|\1)$/

aabababeriberimaamvetitive

• (x(x)y(x)x)

• Palindrome -> /^((.)(?:(?1)|\w)*(\2))$/

Recursive Patterns – (?)

qr/

^ # Start of string ( # Start capture group 1 $ # Open paren (?> # Possessive capture subgroup [^()]++ # Grab all the non parens we can | # or (?1) # Recurse into group 1 )* # Zero more times $ # Close Paren ) # End capture group 1 $ # End of string/x;

• Perl code can be evaluated inside regular expressions using • (?{ }) construct.

Code Evaluation – (?{ })

$x = "aaaa”;$x =~ /(a(?{print "Yow\n";}))*aa/;

produces

Yow Yow Yow Yow

• Pattern code expression - the result of the code evaluation is treated as a regular expression and matched immediately.

• Construct is (??{ })

$length = 5;

$char = 'a';

$str = 'aaaaabb';

$str =~ /(??{$char x $length})/x; # matches, there are 5 of 'a'

Pattern Code Expression – (??{ })

Matching can be modified inline by placing modifiers.

(?i) enables case-insensitive mode(?m) enables multiline matching for ^ and $(?s) makes dot metacharacter match newline also(?x) ignores literal whitespace(?U) makes quantifiers ungreedy (lazy) by default

$answers =~ /(?i)y(?-i)(?:es)?/ -> Will match ‘y’, ’Y’, ’yes’, ’Yes’ but not ‘YES’.

Inline modifiers & Comments

Comments can be inserted inline using (?#) construct.

/^(?#begin)\d+(?#match integer part)\.(?#match dot)\d+(?#match fractional part)$/

Regex Testers Tools EditorsVim, TextMate, Edit Pad Pro, NoteTab, UltraEdit

RegexBuddy

Reggy – http://reggyapp.com

http://rubular.com (Ruby)

RegexPal (JavaScript) - http://www.regexpal.com

http://www.gskinner.com/RegExr/

http://www.spaweditor.com/scripts/regex/index.php

http://regex.larsolavtorvik.com/ (PHP, JavaScript)

http://www.nregex.com/ ( .NET )

http://www.myregexp.com/ ( Java )

http://osteele.com/tools/reanimator ( NFA Graphic repr. )

Expresso - http://www.ultrapico.com/Expresso.htm ( .NET )

Regulator - http://sourceforge.net/projects/regulator ( .NET )

RegexRenamer - http://regexrenamer.sourceforge.net/ ( .NET )

PowerGREP http://www.powergrep.com/

Windows Grep - http://www.wingrep.com/

Regex Resources

$perldoc perlre perlretut perlreref

$man re_format

“Mastering Regular Expressions”by Jeffrey Friedl

http://oreilly.com/catalog/9780596528126/

“Regular Expressions Cookbook”by Jan Goyvaerts & Steven Levithan

http://oreilly.com/catalog/9780596520694

Questions?

*

{

}

\

^

]

+

$

[

(?

.

)-

:#

Thank Y!ou

*

{

}

\

^

]

+

$

[

(?

.

)-

:#

Java Regeximport java.util.regex.*;

public class MatchTest {

public static void main(String[] args) throws Exception {

String date = "12/30/1969"; Pattern p =Pattern.compile("^(\\d\\d)[-/](\\d\\d)[-/](\\d\\d(?:\\d\\d)?$"); Matcher m = p.matcher(date);

if (m.find( )) {

String month = m.group(1);String day = m.group(2);String year = m.group(3);System.out.printf("Found %s-%s-%s\n", year, month, day);

}

}

}

PHP Regex

$date = "12/30/1969";

$p = "!^(\\d\\d)[-/](\\d\\d)[-/](\\d\\d(?:\\d\\d)?)$!";

if (preg_match($p,$date,$matches) {$month = $matches[1];$day = $matches[2];$year = $matches[3];

}

$text = "Hello world. ";

$pattern = "{ }i";

echo preg_replace($pattern, " ", $text);

JavaScript Regexvar date = "12/30/1969";

var p =new RegExp("^(\\d\\d)[-/](\\d\\d)[-/](\\d\\d(?:\\d\\d)?)$");

var result = p.exec(date);

if (result != null) {

var month = result[1];var day = result[2];var year = result[3];

}

String text = "Hello world. ";

var pattern = / /ig;

test.replace(pattern, " ");

.NET Regexusing System.Text.RegularExpressions;

class MatchTest {

static void Main( ) {

string date = "12/30/1969";Regex r =new Regex( @"^(\d\d)[-/](\d\d)[-/](\d\d(?:\d\d)?)$" );Match m = r.Match(date);if (m.Success) {

string month = m.Groups[1].Value;string day = m.Groups[2].Value;string year = m.Groups[3].Value;

}

}}

Python Regex

import re

date = '12/30/1969’

regex = re.compile(r'^(\d\d)[-/](\d\d)[-/](\d\d(?:\d\d)?)$')

match = regex.match(date)

if match:month = match.group(1) #12day = match.group(2) #30year = match.group(3) #1969

Ruby Regex

date = '12/30/1969’

regexp = Regexp.new('^(\d\d)[-/](\d\d)[-/](\d\d(?:\d\d)?)$')

if md = regexp.match(date)month = md[1] #12day = md[2] #30year = md[3] #1969

end

Unicode Properties

• Pattern code expression - the result of the code evaluation is treated as a regular expression and matched immediately.

• (??{ })

Find Incremental numbers ?

$str="abc 123hai cde 34567 efg 1245 a132 123456789 10adf";

print "$1\n" while($str=~/\D( (\d) (?{$x=$2}) ( (??{++$x%10}) )*

) \D/gx);'

Pattern Code Expression – (??{ })

Commify a number$no=123456789;substr($no,0,length($no)-1)=~s/(?=(?<=\d)(?:\d\d)+$)/,/g;print $no’

Produce 12,34,56,789

Find Incremental numbers ?

$str="abc 123hai cde 34567 efg 1245 a132 123456789 10adf";

print "$1\n" while($str=~/\D( (\d) (?{$x=$2}) ( (??{++$x%10}) )*

) \D/gx);’

Non Capture group in a capture group won’t work :perl -e '$x="cat cat cat";$x=~/(cat(?:\s+))/;print ":$1:";’

Technology

Regular Expressions