67
Regular Expressions Satyanarayana D <[email protected]>

Regular Expressions

Embed Size (px)

Citation preview

Page 1: Regular Expressions

Regular Expressions

Satyanarayana D <[email protected]>

Page 2: Regular Expressions

Topics

• What? • Why?• History - Who?• Flavou?rs• Grammar• Meta Chars• Character Classes• Shorthand Char Classes• Anchors• Repeaters or Quantifiers• Grouping & Capturing• Alternation• Match Float

• Atomic Grouping• Look Around• Conditional Expr.• Recursive Regex• Code Evalution• Code Expr.• Inline Modifiers• Regex Tools• Q&A

Page 3: Regular Expressions

What are Regular Expressions?

• A Regular expression is a pattern describing a certain amount of text.

• A regular expression, often called a pattern, is an expression that describes a set of strings. - Wikipedia

Page 4: Regular Expressions

Why do we need?

• Regular expressions allow matching and manipulation of textual data.

• Requirements• Matching/Finding• Doing something with matched text• Validation of data• Case insensitive matching• Parsing data ( ex: html )• Converting data into diff. form etc.

Page 5: Regular Expressions

History

Stephen KleeneA mathematician discovered ‘regular sets’.

Page 6: Regular Expressions

History

Ken Thompson1968 - Regular Expression Search Algorithm.

Qed -> ed -> g/re/p

Page 7: Regular Expressions

History

Henry Spencer1986 – Wrote a regex library in C

Page 8: Regular Expressions

Regex Flavors

BRE - Basic Regular Expressions• \?, \+, \{, \|, \(, and \)• ed, g/re/p, sed

ERE - Extended Regular Expressions• ?, +, {, |, (, and )• grep –E == egrep, awk

PCRE - Philip Hazel• Perl, PHP, Tcl etc.

Page 9: Regular Expressions

Grammar of Regex* RE = one or more non-empty ‘branches‘ separated by ‘|’

Branch = one or more ‘pieces’

Piece = atom followed by quantifier

Quantifier = ‘*,+,?’ or ‘bound’

Bound = atom{n}, atom{n,}, atom{m, n}

Atom = (RE) or

() or

‘^,$,’ or

\ followed by `^.[$()|*+?{\’ or

any-char or

‘bracket expression’

Bracket Expression = is a list of characters enclosed in `[ ]'

Page 10: Regular Expressions

Meta Chars?

2 + 4

Here ‘+’ has some special meaning

In a normal Expression like :

Page 11: Regular Expressions

Meta Chars

\ Quote the next metacharacter ^ Match the beginning of the line . Match any character (except newline) = [^\n] $ Match the end of the line (or before newline at the end) | Alternation ( ) Grouping [ ] Character class { } Match m to n times * Match 0 or more times + Match 1 or more times ? Match 1 or 0 times

Page 12: Regular Expressions

Non-printable Chars

\t tab (HT, TAB) \n newline (LF, NL) \r return (CR) \f form feed (FF) \a alarm (bell) (BEL) \e escape (think troff) (ESC) \033 octal char (example: ESC) \x1B hex char (example: ESC) \x{263a} long hex char (example: Unicode SMILEY) \cK control char (example: VT) \N{name} named Unicode character

Page 13: Regular Expressions

Character Classes – [ ]• Set of character placed inside square brackets. Inside brackets

meta characters lose their meaning ( except ‘] \ ^ - ‘)• Requirements

• Matches one and only one character of a specified chars.• Range can be specified using ‘-’.

• a-z matches 26 lower case English alphabets • 0-9 matches any digit.• Negation can be specified using ‘^’ at the beginning of class.• In order to match above specified exceptional chars literally either escape them or

need to specify at end.

[0-9] Matches any one of 0,1,2,3,4,5,6,7,8,9.[aeiou] Matches one English vowel char.[^aeiou] Matches any non-vowel char.[a-z-] Matches a to z and ‘-’[a-z0-9] Union matches a to z and 0 to 9.[a-z&&[m-z]] Intersection matches m to z.[a-z-[m-z] Subtraction matches a to l.

Page 14: Regular Expressions

POSIX Character Classes – [: … :]

[^[:digit:] ]= \D = [^0-9]

Page 15: Regular Expressions

Shorthand Chars

\w word character [A-Za-z0-9_]\d decimal digit [0-9]\s whitespace [ \n\r\t\f]

\W not a word character [^A-Za-z0-9_]\D not a decimal digit [^0-9]\S not whitespace [^ \n\r\t\f]

Page 16: Regular Expressions

Anchors/Assertions• Anchor matches a certain position in the subject string and it won’t consume any characters.

^ Match the beginning of the line $ Match the end of the line (or before newline at the end) \A Matches only at the very beginning \z Matches only at the very end \Z Matches like $ used in single-line mode \b Matches when the current position is a word boundary\<,\> Matches when the current position is a word boundary \B Matches when the current position is not a word boundary

Page 17: Regular Expressions

^Anchors• Anchor matches a certain position in the subject string and it won’t consume any characters.

^ Match the beginning of the line

Anchor matches a certain position In the subject string and it won’t consume any characters

/^a/

String begin with ‘a’

Page 18: Regular Expressions

Anchors$• Anchor matches a certain position in the subject string and it won’t consume any characters.

$ Match the end of the line (or before newline at the end)

Anchor matches a certain position In the subject string and it won’t consume any characters

/s$/

String end with ‘s’

Page 19: Regular Expressions

\A Anchors• Anchor matches a certain position in the subject string and it won’t consume any characters.

\A Matches only at the very beginning

Anchor matches a certain position In the subject string and it won’t consume any characters

^ Vs \A

Page 20: Regular Expressions

\z, \Z Anchors• Anchor matches a certain position in the subject string and it won’t consume any characters.

\z Matches only at the very end \Z Matches like $ used in single-line mode

Anchor matches a certain position In the subject string and it won’t consume any characters

$ Vs \z, \Z

Page 21: Regular Expressions

\b, \B Anchors• Anchor matches a certain position in the subject string and it won’t consume any characters.

\b = \W\w|\w\W = Matches a word boundary \B Matches when the current position is not a word boundary

/\b2\b/

/\B2\B/

$ xl2twiki file 2 > /dev/null

Page 22: Regular Expressions

Quantifiers• Why? – Because we are not sure about text. Specifies how many times regex component must repeat.

{m, n} = Matches minimum of m and a max of n occurrences. * = {0,} = Matches zero or more occurrences ( any amount).

+ = {1,} = Matches one or more occurrences.

? = {0,1} = Matches zero or one occurrence ( means optional ).

Quantifiers ( repetition) :

Page 23: Regular Expressions

Quantifiers• By default quantifiers are greedy.

/\d{2,4}/ 2010

/<.+>/ My first <strong> regex </strong> test. <strong> regex </strong>

/\w+sion/ Expression

If the entire match fails because they consumed too much, then they are forced to give up as much as needed to make the rest of regex succeed

Page 24: Regular Expressions

Non Greedy Quantifiers

{,}? *?

+?

??

To make non greedy quantifiers append ‘?’

<.+?> My first <strong> regex </strong> test. <strong>

Use negated classes

<[^>]+> My first <strong> regex </strong> test. <strong>

Page 25: Regular Expressions

Grouping – ( )

• Why? – To create sub patterns, so that you can apply regex operators to whole sub patterns or you can reference them by corresponding sub group numbers.

\d{2}-\d{2}-\d{2}(\d{2})?

Will match 01-01-10 and 01-01-2010 also.

• Grouping can be used for alternation.

Page 26: Regular Expressions

Alternation - |

• Why? – Lets you to match more than one sub-expression at same point.

/\b(get|set)Value\b/ Match either getValue or setValue.

• Branches are tried from left->right.• Eagerness - Most likely pattern as first alternative

• (and|android) -> ‘robot and an android fight’

Page 27: Regular Expressions

Capturing – ( )

• Allows us to access sub-parts of pattern for later processing.• All captured sub patterns are stored in memory.• Captured patterns are numbered from left to right.

/\b((\d{2})-(\d{2})-(\d{2}(\d{2})?))\b/

\b((\d{2})-(\d{2})-(\d{2}(\d{2})?))\b

Today is ‘18-08-2010’.

\1 -> date -> 18-08-2010\2 -> day-> 18\3 -> month -> 08\4 -> year -> 2010\5 -> year -> last two digits -> 10

Page 28: Regular Expressions

Non-Capturing sub patterns– (?: )

• If you really don’t require back referencing make sub expressions as non-capture, It will save memory and processing time.

\d{2}-\d{2}-\d{2}(?:\d{2})?

Will match 01-01-10 and 01-01-2010 also.

Page 29: Regular Expressions

• We can give names for sub patterns instead of numbers.

(?P<name>pattern) -> Python Style, Perl 5.12(?P=name) -> Back reference(?<name>pattern) or (?’name’pattern) ->Perl 5.10\k<name> or \k’name’ or -> Back reference\g{name}\g{-1}, \g{-2} -> Relative Back reference.

(?<vowel>[ai]).\k<vowel>.\1 abracadabra !!/(\w+)\s+\g{-1}/ "Thus joyful Troy Troy maintained the the watch of night...”

$date="18-08-2010";$date =~ s/(?<day>\d{2})-(?<month>\d{2})-(?<year>\d{4})/$+{year}-$+{month}-$+{day}/;

Named Capture – (?<> )

Page 30: Regular Expressions

• Hits• Lines that I want to match.

• Misses• Lines that I don’t want to match.

• Omissions• Lines that I didn’t match but wanted to match.

• False alarms• Lines that I matched but didn’t want to match.

Before Evaluating Regex

Page 31: Regular Expressions

Float number = integerpart.factionalpart

Matching a float number

Basic Principle – Split your task into sub tasks

Page 32: Regular Expressions

Integerpart = \d+ -> will match one or more digits

Matching a float number

Page 33: Regular Expressions

Matching a float number

Literal dot = \.

Integerpart = \d+ -> will match one or more digits

Page 34: Regular Expressions

Matching a float number

Literal dot = \.

Integerpart = \d+ -> will match one or more digits

Fractional part= \d+ -> will match one or more digits

Page 35: Regular Expressions

Integerpart = \d+

Matching a float number

Literal dot = \.

Fractional part = \d+

Combine all of them = \d+\.\d+

Page 36: Regular Expressions

Matching a float number

/\d+\.\d+/ -> Is generic.

It won’t match -123.45 or +123.45

Page 37: Regular Expressions

Matching a float number

/\d+\.\d+/ -> Is generic.

It won’t match -123.45 or +123.45

/[+-]?\d+\.\d+/ -> will match.

Page 38: Regular Expressions

Matching a float number

But It won’t match - 123.45 or + 123.45

/[+-]?\d+\.\d+/ -> will match.

/[+-]? *\d+\.\d+/ -> will match.

But It won’t match 123. or .45

Page 39: Regular Expressions

Matching a float number

/[+-]? *(?:\d+\.\d+|\d+\.|\.\d+)/ -> will match.

But It won’t match 123. or .45

/[+-]? \ *

(?: \d+\.\d+

| \d+\. | \.\d+ )/

Page 40: Regular Expressions

Matching a float number

/[+-]? *(?:\d+\.\d+|\d+\.|\.\d+)(?:[eE]\d+)?/ -> will match.

But It won’t match 10e2 or 101E5

/ [+-]? \ *

(?: \d+\.\d+

| \d+\. | \.\d+ )

(?: [eE]\d+)?

/

Page 41: Regular Expressions

Matching a float number

/^[+-]? *(?:\d+\.\d+|\d+\.|\.\d+)(?:[eE][+-]?\d+)?$/ -> will match.

But It won’t match 10e-2

/ ^[+-]? \ *

(?: \d+\.\d+

| \d+\. | \.\d+ )

(?: [eE][+-]?\d+)?

$/x

Page 42: Regular Expressions

Match a float number

/^ [+-]?\ * # first, match an optional sign (?: # then match integers or f.p. mantissas: \d+\.\d+ # mantissa of the form a.b |\d+\. # mantissa of the form a. |\.\d+ # mantissa of the form .b |\d+ # integer of the form a ) (?:[eE][+-]?\d+)? # finally, optionally match an exponent $/x;

Page 43: Regular Expressions

• Before looking into Atomic grouping need to know about Backtracking.

• Backtracking – If you don’t succeed try and try again...

Atomic Grouping – (?> )

\d+99 19999\d 19999 -> Add 1 to match -> 1

\d+ 19999 -> Add 9 to match -> 19

\d+ 19999 -> Add 9 to match -> 199

\d+ 19999 -> Add 9 to match -> 1999

\d+ 19999 -> Add 9 to match -> 19999

\d+ 19999 -> Still need to match 99

\d+99 19999 -> Give up a 9

\d+99 19999 -> Give up one more 9

\d+99 19999 -> Success

Page 44: Regular Expressions

• Before looking into Atomic grouping need to know about Backtracking.

• Backtracking – If you don’t succeed try and try again...

Atomic Grouping – (?> )

\d+xx 199Rs\d 199Rs -> Add 1 to match -> 1

\d+ 199Rs -> Add 9 to match -> 19

\d+ 199Rs -> Add 9 to match -> 199

\d+x 199Rs -> x not matched with R

\d+x 199Rs -> Give up 9, still cannot match x

\d+x 199Rs -> Give up 9, still cannot match x

\d+x 199Rs -> Cannot give 1 due to \d+

\d+xx 199Rs -> Failure

Page 45: Regular Expressions

• Atomic Grouping disables backtracking and speeds up the process.

• (?>pattern) here pattern will be treated as atomic token.• (?>\d+)xx here (?>\d+) won’t give up any digits and it locks.

• fails right at matching x with R.• Atomic groups are not captured and can be nested.

Atomic Grouping – (?> )

• Use Possessive quantifiers for single items to overcome backtracking.• Adding ‘+’ will make quantifier as possessive• (?>\d+)xx == \d++xx

Atomic Grouping:

Possessive Quantifiers:

Page 46: Regular Expressions

Look Around

Ahead Behind

Positive Negative Positive Negative

(?=...) (?!...) (?<=...) (?<!...)

(?=...) Zero-width positive lookahead assertion

(?!...) Zero-width negative lookahead assertion

(?<=...) Zero-width positive lookbehind assertion

(?<!...) Zero-width negative lookbehind assertion

*Note : Assertions can be nested.Example : /(?<=,

(?! (?<=\d,)(?=\d) ) )/

Page 47: Regular Expressions

/cat(?=\s+)/ I catch the housecat 'Tom-cat' with catnip

/(?<=\s)cat\w+/ I catch the housecat 'Tom-cat' with catnip

/\bcat\b / I catch the housecat 'Tom-cat' with catnip

/(?<=\s)cat(?=\s)/ no isolated 'cat’

Look Around

“I catch the housecat 'Tom-cat' with catnip”

/cat(?!\s)/ I catch the housecat 'Tom-cat' with catnip

/(?<!\s)cat/ I catch the housecat 'Tom-cat' with catnip

*Note : look-behind expressions cannot be of variable length. means you cannot use quantifiers (?, *, +, or {1,5}) or alternation of different-length items inside them.

Page 48: Regular Expressions

Conditional expressions• A conditional expression is a form of if-then-else statement that allows one to choose which patterns are to be matched, based on some condition

• (?(condition)yes-regexp)" is like an 'if () {}' statement• (?(condition)yes-regexp|no-regexp) 'if () {} else {}' statement

• Condition can be• Sub pattern match corresponding number• Lookaround Assertion• Recursive call

Match a (quoted)? string -> /^("|')?[^”’]*(?(1)\1)$/

Matches 'blah blah’Matches “blah blah”Matches blah blahWon’t Match ‘blah blah”

Page 49: Regular Expressions

Conditional expression• A conditional expression is a form of if-then-else statement that allows one to choose which patterns are to be matched, based on some condition

• (?(condition)yes-regexp)" is like an 'if () {}' statement• (?(condition)yes-regexp|no-regexp) 'if () {} else {}' statement

/(.)\1(?(<=AA)G|C)$/

ATGAAGTAGBBCGATGGC

/usr/share/dict/words -> /^(.+)(.+)?(?(2)\2\1|\1)$/

aabababeriberimaamvetitive

Page 50: Regular Expressions

• (x(x)y(x)x)

• Palindrome -> /^((.)(?:(?1)|\w)*(\2))$/

Recursive Patterns – (?)

qr/

^ # Start of string ( # Start capture group 1 \( # Open paren (?> # Possessive capture subgroup [^()]++ # Grab all the non parens we can | # or (?1) # Recurse into group 1 )* # Zero more times \) # Close Paren ) # End capture group 1 $ # End of string/x;

Page 51: Regular Expressions

• Perl code can be evaluated inside regular expressions using • (?{ }) construct.

Code Evaluation – (?{ })

$x = "aaaa”;$x =~ /(a(?{print "Yow\n";}))*aa/;

produces

Yow Yow Yow Yow

Page 52: Regular Expressions

• Pattern code expression - the result of the code evaluation is treated as a regular expression and matched immediately.

• Construct is (??{ })

$length = 5;

$char = 'a';

$str = 'aaaaabb';

$str =~ /(??{$char x $length})/x; # matches, there are 5 of 'a'

Pattern Code Expression – (??{ })

Page 53: Regular Expressions

Matching can be modified inline by placing modifiers.

(?i) enables case-insensitive mode(?m) enables multiline matching for ^ and $(?s) makes dot metacharacter match newline also(?x) ignores literal whitespace(?U) makes quantifiers ungreedy (lazy) by default

$answers =~ /(?i)y(?-i)(?:es)?/ -> Will match ‘y’, ’Y’, ’yes’, ’Yes’ but not ‘YES’.

Inline modifiers & Comments

Comments can be inserted inline using (?#) construct.

/^(?#begin)\d+(?#match integer part)\.(?#match dot)\d+(?#match fractional part)$/

Page 54: Regular Expressions

Regex Testers Tools EditorsVim, TextMate, Edit Pad Pro, NoteTab, UltraEdit

RegexBuddy

Reggy – http://reggyapp.com

http://rubular.com (Ruby)

RegexPal (JavaScript) - http://www.regexpal.com

http://www.gskinner.com/RegExr/

http://www.spaweditor.com/scripts/regex/index.php

http://regex.larsolavtorvik.com/ (PHP, JavaScript)

http://www.nregex.com/ ( .NET )

http://www.myregexp.com/ ( Java )

http://osteele.com/tools/reanimator ( NFA Graphic repr. )

Expresso - http://www.ultrapico.com/Expresso.htm ( .NET )

Regulator - http://sourceforge.net/projects/regulator ( .NET )

RegexRenamer - http://regexrenamer.sourceforge.net/ ( .NET )

PowerGREP http://www.powergrep.com/

Windows Grep - http://www.wingrep.com/

Page 55: Regular Expressions

Regex Resources

$perldoc perlre perlretut perlreref

$man re_format

“Mastering Regular Expressions”by Jeffrey Friedl

http://oreilly.com/catalog/9780596528126/

“Regular Expressions Cookbook”by Jan Goyvaerts & Steven Levithan

http://oreilly.com/catalog/9780596520694

Page 56: Regular Expressions

Questions?

*

{

}

\

^

]

+

$

[

(?

.

)-

:#

Page 57: Regular Expressions

Thank Y!ou

*

{

}

\

^

]

+

$

[

(?

.

)-

:#

Page 58: Regular Expressions

Java Regeximport java.util.regex.*;

public class MatchTest {

public static void main(String[] args) throws Exception {

String date = "12/30/1969"; Pattern p =Pattern.compile("^(\\d\\d)[-/](\\d\\d)[-/](\\d\\d(?:\\d\\d)?$"); Matcher m = p.matcher(date);

if (m.find( )) {

String month = m.group(1);String day = m.group(2);String year = m.group(3);System.out.printf("Found %s-%s-%s\n", year, month, day);

}

}

}

Page 59: Regular Expressions

PHP Regex

$date = "12/30/1969";

$p = "!^(\\d\\d)[-/](\\d\\d)[-/](\\d\\d(?:\\d\\d)?)$!";

if (preg_match($p,$date,$matches) {$month = $matches[1];$day = $matches[2];$year = $matches[3];

}

$text = "Hello world. <br>";

$pattern = "{<br>}i";

echo preg_replace($pattern, "<br />", $text);

Page 60: Regular Expressions

JavaScript Regexvar date = "12/30/1969";

var p =new RegExp("^(\\d\\d)[-/](\\d\\d)[-/](\\d\\d(?:\\d\\d)?)$");

var result = p.exec(date);

if (result != null) {

var month = result[1];var day = result[2];var year = result[3];

}

String text = "Hello world. <br>";

var pattern = /<br>/ig;

test.replace(pattern, "<br />");

Page 61: Regular Expressions

.NET Regexusing System.Text.RegularExpressions;

class MatchTest {

static void Main( ) {

string date = "12/30/1969";Regex r =new Regex( @"^(\d\d)[-/](\d\d)[-/](\d\d(?:\d\d)?)$" );Match m = r.Match(date);if (m.Success) {

string month = m.Groups[1].Value;string day = m.Groups[2].Value;string year = m.Groups[3].Value;

}

}}

Page 62: Regular Expressions

Python Regex

import re

date = '12/30/1969’

regex = re.compile(r'^(\d\d)[-/](\d\d)[-/](\d\d(?:\d\d)?)$')

match = regex.match(date)

if match:month = match.group(1) #12day = match.group(2) #30year = match.group(3) #1969

Page 63: Regular Expressions

Ruby Regex

date = '12/30/1969’

regexp = Regexp.new('^(\d\d)[-/](\d\d)[-/](\d\d(?:\d\d)?)$')

if md = regexp.match(date)month = md[1] #12day = md[2] #30year = md[3] #1969

end

Page 64: Regular Expressions

Unicode Properties

Page 65: Regular Expressions

• Pattern code expression - the result of the code evaluation is treated as a regular expression and matched immediately.

• (??{ })

Find Incremental numbers ?

$str="abc 123hai cde 34567 efg 1245 a132 123456789 10adf";

print "$1\n" while($str=~/\D( (\d) (?{$x=$2}) ( (??{++$x%10}) )*

) \D/gx);'

Pattern Code Expression – (??{ })

Page 66: Regular Expressions

Commify a number$no=123456789;substr($no,0,length($no)-1)=~s/(?=(?<=\d)(?:\d\d)+$)/,/g;print $no’

Produce 12,34,56,789

Page 67: Regular Expressions

Find Incremental numbers ?

$str="abc 123hai cde 34567 efg 1245 a132 123456789 10adf";

print "$1\n" while($str=~/\D( (\d) (?{$x=$2}) ( (??{++$x%10}) )*

) \D/gx);’

Non Capture group in a capture group won’t work :perl -e '$x="cat cat cat";$x=~/(cat(?:\s+))/;print ":$1:";’