Upload
satyanarayana-venkata
View
4.182
Download
3
Tags:
Embed Size (px)
Citation preview
Topics
• What? • Why?• History - Who?• Flavou?rs• Grammar• Meta Chars• Character Classes• Shorthand Char Classes• Anchors• Repeaters or Quantifiers• Grouping & Capturing• Alternation• Match Float
• Atomic Grouping• Look Around• Conditional Expr.• Recursive Regex• Code Evalution• Code Expr.• Inline Modifiers• Regex Tools• Q&A
What are Regular Expressions?
• A Regular expression is a pattern describing a certain amount of text.
• A regular expression, often called a pattern, is an expression that describes a set of strings. - Wikipedia
Why do we need?
• Regular expressions allow matching and manipulation of textual data.
• Requirements• Matching/Finding• Doing something with matched text• Validation of data• Case insensitive matching• Parsing data ( ex: html )• Converting data into diff. form etc.
History
Stephen KleeneA mathematician discovered ‘regular sets’.
History
Ken Thompson1968 - Regular Expression Search Algorithm.
Qed -> ed -> g/re/p
History
Henry Spencer1986 – Wrote a regex library in C
Regex Flavors
BRE - Basic Regular Expressions• \?, \+, \{, \|, \(, and \)• ed, g/re/p, sed
ERE - Extended Regular Expressions• ?, +, {, |, (, and )• grep –E == egrep, awk
PCRE - Philip Hazel• Perl, PHP, Tcl etc.
Grammar of Regex* RE = one or more non-empty ‘branches‘ separated by ‘|’
Branch = one or more ‘pieces’
Piece = atom followed by quantifier
Quantifier = ‘*,+,?’ or ‘bound’
Bound = atom{n}, atom{n,}, atom{m, n}
Atom = (RE) or
() or
‘^,$,’ or
\ followed by `^.[$()|*+?{\’ or
any-char or
‘bracket expression’
Bracket Expression = is a list of characters enclosed in `[ ]'
Meta Chars?
2 + 4
Here ‘+’ has some special meaning
In a normal Expression like :
Meta Chars
\ Quote the next metacharacter ^ Match the beginning of the line . Match any character (except newline) = [^\n] $ Match the end of the line (or before newline at the end) | Alternation ( ) Grouping [ ] Character class { } Match m to n times * Match 0 or more times + Match 1 or more times ? Match 1 or 0 times
Non-printable Chars
\t tab (HT, TAB) \n newline (LF, NL) \r return (CR) \f form feed (FF) \a alarm (bell) (BEL) \e escape (think troff) (ESC) \033 octal char (example: ESC) \x1B hex char (example: ESC) \x{263a} long hex char (example: Unicode SMILEY) \cK control char (example: VT) \N{name} named Unicode character
Character Classes – [ ]• Set of character placed inside square brackets. Inside brackets
meta characters lose their meaning ( except ‘] \ ^ - ‘)• Requirements
• Matches one and only one character of a specified chars.• Range can be specified using ‘-’.
• a-z matches 26 lower case English alphabets • 0-9 matches any digit.• Negation can be specified using ‘^’ at the beginning of class.• In order to match above specified exceptional chars literally either escape them or
need to specify at end.
[0-9] Matches any one of 0,1,2,3,4,5,6,7,8,9.[aeiou] Matches one English vowel char.[^aeiou] Matches any non-vowel char.[a-z-] Matches a to z and ‘-’[a-z0-9] Union matches a to z and 0 to 9.[a-z&&[m-z]] Intersection matches m to z.[a-z-[m-z] Subtraction matches a to l.
POSIX Character Classes – [: … :]
[^[:digit:] ]= \D = [^0-9]
Shorthand Chars
\w word character [A-Za-z0-9_]\d decimal digit [0-9]\s whitespace [ \n\r\t\f]
\W not a word character [^A-Za-z0-9_]\D not a decimal digit [^0-9]\S not whitespace [^ \n\r\t\f]
Anchors/Assertions• Anchor matches a certain position in the subject string and it won’t consume any characters.
^ Match the beginning of the line $ Match the end of the line (or before newline at the end) \A Matches only at the very beginning \z Matches only at the very end \Z Matches like $ used in single-line mode \b Matches when the current position is a word boundary\<,\> Matches when the current position is a word boundary \B Matches when the current position is not a word boundary
^Anchors• Anchor matches a certain position in the subject string and it won’t consume any characters.
^ Match the beginning of the line
Anchor matches a certain position In the subject string and it won’t consume any characters
/^a/
String begin with ‘a’
Anchors$• Anchor matches a certain position in the subject string and it won’t consume any characters.
$ Match the end of the line (or before newline at the end)
Anchor matches a certain position In the subject string and it won’t consume any characters
/s$/
String end with ‘s’
\A Anchors• Anchor matches a certain position in the subject string and it won’t consume any characters.
\A Matches only at the very beginning
Anchor matches a certain position In the subject string and it won’t consume any characters
^ Vs \A
\z, \Z Anchors• Anchor matches a certain position in the subject string and it won’t consume any characters.
\z Matches only at the very end \Z Matches like $ used in single-line mode
Anchor matches a certain position In the subject string and it won’t consume any characters
$ Vs \z, \Z
\b, \B Anchors• Anchor matches a certain position in the subject string and it won’t consume any characters.
\b = \W\w|\w\W = Matches a word boundary \B Matches when the current position is not a word boundary
/\b2\b/
/\B2\B/
$ xl2twiki file 2 > /dev/null
Quantifiers• Why? – Because we are not sure about text. Specifies how many times regex component must repeat.
{m, n} = Matches minimum of m and a max of n occurrences. * = {0,} = Matches zero or more occurrences ( any amount).
+ = {1,} = Matches one or more occurrences.
? = {0,1} = Matches zero or one occurrence ( means optional ).
Quantifiers ( repetition) :
Quantifiers• By default quantifiers are greedy.
/\d{2,4}/ 2010
/<.+>/ My first <strong> regex </strong> test. <strong> regex </strong>
/\w+sion/ Expression
If the entire match fails because they consumed too much, then they are forced to give up as much as needed to make the rest of regex succeed
Non Greedy Quantifiers
{,}? *?
+?
??
To make non greedy quantifiers append ‘?’
<.+?> My first <strong> regex </strong> test. <strong>
Use negated classes
<[^>]+> My first <strong> regex </strong> test. <strong>
Grouping – ( )
• Why? – To create sub patterns, so that you can apply regex operators to whole sub patterns or you can reference them by corresponding sub group numbers.
\d{2}-\d{2}-\d{2}(\d{2})?
Will match 01-01-10 and 01-01-2010 also.
• Grouping can be used for alternation.
Alternation - |
• Why? – Lets you to match more than one sub-expression at same point.
/\b(get|set)Value\b/ Match either getValue or setValue.
• Branches are tried from left->right.• Eagerness - Most likely pattern as first alternative
• (and|android) -> ‘robot and an android fight’
Capturing – ( )
• Allows us to access sub-parts of pattern for later processing.• All captured sub patterns are stored in memory.• Captured patterns are numbered from left to right.
/\b((\d{2})-(\d{2})-(\d{2}(\d{2})?))\b/
\b((\d{2})-(\d{2})-(\d{2}(\d{2})?))\b
Today is ‘18-08-2010’.
\1 -> date -> 18-08-2010\2 -> day-> 18\3 -> month -> 08\4 -> year -> 2010\5 -> year -> last two digits -> 10
Non-Capturing sub patterns– (?: )
• If you really don’t require back referencing make sub expressions as non-capture, It will save memory and processing time.
\d{2}-\d{2}-\d{2}(?:\d{2})?
Will match 01-01-10 and 01-01-2010 also.
• We can give names for sub patterns instead of numbers.
(?P<name>pattern) -> Python Style, Perl 5.12(?P=name) -> Back reference(?<name>pattern) or (?’name’pattern) ->Perl 5.10\k<name> or \k’name’ or -> Back reference\g{name}\g{-1}, \g{-2} -> Relative Back reference.
(?<vowel>[ai]).\k<vowel>.\1 abracadabra !!/(\w+)\s+\g{-1}/ "Thus joyful Troy Troy maintained the the watch of night...”
$date="18-08-2010";$date =~ s/(?<day>\d{2})-(?<month>\d{2})-(?<year>\d{4})/$+{year}-$+{month}-$+{day}/;
Named Capture – (?<> )
• Hits• Lines that I want to match.
• Misses• Lines that I don’t want to match.
• Omissions• Lines that I didn’t match but wanted to match.
• False alarms• Lines that I matched but didn’t want to match.
Before Evaluating Regex
Float number = integerpart.factionalpart
Matching a float number
Basic Principle – Split your task into sub tasks
Integerpart = \d+ -> will match one or more digits
Matching a float number
Matching a float number
Literal dot = \.
Integerpart = \d+ -> will match one or more digits
Matching a float number
Literal dot = \.
Integerpart = \d+ -> will match one or more digits
Fractional part= \d+ -> will match one or more digits
Integerpart = \d+
Matching a float number
Literal dot = \.
Fractional part = \d+
Combine all of them = \d+\.\d+
Matching a float number
/\d+\.\d+/ -> Is generic.
It won’t match -123.45 or +123.45
Matching a float number
/\d+\.\d+/ -> Is generic.
It won’t match -123.45 or +123.45
/[+-]?\d+\.\d+/ -> will match.
Matching a float number
But It won’t match - 123.45 or + 123.45
/[+-]?\d+\.\d+/ -> will match.
/[+-]? *\d+\.\d+/ -> will match.
But It won’t match 123. or .45
Matching a float number
/[+-]? *(?:\d+\.\d+|\d+\.|\.\d+)/ -> will match.
But It won’t match 123. or .45
/[+-]? \ *
(?: \d+\.\d+
| \d+\. | \.\d+ )/
Matching a float number
/[+-]? *(?:\d+\.\d+|\d+\.|\.\d+)(?:[eE]\d+)?/ -> will match.
But It won’t match 10e2 or 101E5
/ [+-]? \ *
(?: \d+\.\d+
| \d+\. | \.\d+ )
(?: [eE]\d+)?
/
Matching a float number
/^[+-]? *(?:\d+\.\d+|\d+\.|\.\d+)(?:[eE][+-]?\d+)?$/ -> will match.
But It won’t match 10e-2
/ ^[+-]? \ *
(?: \d+\.\d+
| \d+\. | \.\d+ )
(?: [eE][+-]?\d+)?
$/x
Match a float number
/^ [+-]?\ * # first, match an optional sign (?: # then match integers or f.p. mantissas: \d+\.\d+ # mantissa of the form a.b |\d+\. # mantissa of the form a. |\.\d+ # mantissa of the form .b |\d+ # integer of the form a ) (?:[eE][+-]?\d+)? # finally, optionally match an exponent $/x;
• Before looking into Atomic grouping need to know about Backtracking.
• Backtracking – If you don’t succeed try and try again...
Atomic Grouping – (?> )
\d+99 19999\d 19999 -> Add 1 to match -> 1
\d+ 19999 -> Add 9 to match -> 19
\d+ 19999 -> Add 9 to match -> 199
\d+ 19999 -> Add 9 to match -> 1999
\d+ 19999 -> Add 9 to match -> 19999
\d+ 19999 -> Still need to match 99
\d+99 19999 -> Give up a 9
\d+99 19999 -> Give up one more 9
\d+99 19999 -> Success
• Before looking into Atomic grouping need to know about Backtracking.
• Backtracking – If you don’t succeed try and try again...
Atomic Grouping – (?> )
\d+xx 199Rs\d 199Rs -> Add 1 to match -> 1
\d+ 199Rs -> Add 9 to match -> 19
\d+ 199Rs -> Add 9 to match -> 199
\d+x 199Rs -> x not matched with R
\d+x 199Rs -> Give up 9, still cannot match x
\d+x 199Rs -> Give up 9, still cannot match x
\d+x 199Rs -> Cannot give 1 due to \d+
\d+xx 199Rs -> Failure
• Atomic Grouping disables backtracking and speeds up the process.
• (?>pattern) here pattern will be treated as atomic token.• (?>\d+)xx here (?>\d+) won’t give up any digits and it locks.
• fails right at matching x with R.• Atomic groups are not captured and can be nested.
Atomic Grouping – (?> )
• Use Possessive quantifiers for single items to overcome backtracking.• Adding ‘+’ will make quantifier as possessive• (?>\d+)xx == \d++xx
Atomic Grouping:
Possessive Quantifiers:
Look Around
Ahead Behind
Positive Negative Positive Negative
(?=...) (?!...) (?<=...) (?<!...)
(?=...) Zero-width positive lookahead assertion
(?!...) Zero-width negative lookahead assertion
(?<=...) Zero-width positive lookbehind assertion
(?<!...) Zero-width negative lookbehind assertion
*Note : Assertions can be nested.Example : /(?<=,
(?! (?<=\d,)(?=\d) ) )/
/cat(?=\s+)/ I catch the housecat 'Tom-cat' with catnip
/(?<=\s)cat\w+/ I catch the housecat 'Tom-cat' with catnip
/\bcat\b / I catch the housecat 'Tom-cat' with catnip
/(?<=\s)cat(?=\s)/ no isolated 'cat’
Look Around
“I catch the housecat 'Tom-cat' with catnip”
/cat(?!\s)/ I catch the housecat 'Tom-cat' with catnip
/(?<!\s)cat/ I catch the housecat 'Tom-cat' with catnip
*Note : look-behind expressions cannot be of variable length. means you cannot use quantifiers (?, *, +, or {1,5}) or alternation of different-length items inside them.
Conditional expressions• A conditional expression is a form of if-then-else statement that allows one to choose which patterns are to be matched, based on some condition
• (?(condition)yes-regexp)" is like an 'if () {}' statement• (?(condition)yes-regexp|no-regexp) 'if () {} else {}' statement
• Condition can be• Sub pattern match corresponding number• Lookaround Assertion• Recursive call
Match a (quoted)? string -> /^("|')?[^”’]*(?(1)\1)$/
Matches 'blah blah’Matches “blah blah”Matches blah blahWon’t Match ‘blah blah”
Conditional expression• A conditional expression is a form of if-then-else statement that allows one to choose which patterns are to be matched, based on some condition
• (?(condition)yes-regexp)" is like an 'if () {}' statement• (?(condition)yes-regexp|no-regexp) 'if () {} else {}' statement
/(.)\1(?(<=AA)G|C)$/
ATGAAGTAGBBCGATGGC
/usr/share/dict/words -> /^(.+)(.+)?(?(2)\2\1|\1)$/
aabababeriberimaamvetitive
• (x(x)y(x)x)
• Palindrome -> /^((.)(?:(?1)|\w)*(\2))$/
Recursive Patterns – (?)
qr/
^ # Start of string ( # Start capture group 1 \( # Open paren (?> # Possessive capture subgroup [^()]++ # Grab all the non parens we can | # or (?1) # Recurse into group 1 )* # Zero more times \) # Close Paren ) # End capture group 1 $ # End of string/x;
• Perl code can be evaluated inside regular expressions using • (?{ }) construct.
Code Evaluation – (?{ })
$x = "aaaa”;$x =~ /(a(?{print "Yow\n";}))*aa/;
produces
Yow Yow Yow Yow
• Pattern code expression - the result of the code evaluation is treated as a regular expression and matched immediately.
• Construct is (??{ })
$length = 5;
$char = 'a';
$str = 'aaaaabb';
$str =~ /(??{$char x $length})/x; # matches, there are 5 of 'a'
Pattern Code Expression – (??{ })
Matching can be modified inline by placing modifiers.
(?i) enables case-insensitive mode(?m) enables multiline matching for ^ and $(?s) makes dot metacharacter match newline also(?x) ignores literal whitespace(?U) makes quantifiers ungreedy (lazy) by default
$answers =~ /(?i)y(?-i)(?:es)?/ -> Will match ‘y’, ’Y’, ’yes’, ’Yes’ but not ‘YES’.
Inline modifiers & Comments
Comments can be inserted inline using (?#) construct.
/^(?#begin)\d+(?#match integer part)\.(?#match dot)\d+(?#match fractional part)$/
Regex Testers Tools EditorsVim, TextMate, Edit Pad Pro, NoteTab, UltraEdit
RegexBuddy
Reggy – http://reggyapp.com
http://rubular.com (Ruby)
RegexPal (JavaScript) - http://www.regexpal.com
http://www.gskinner.com/RegExr/
http://www.spaweditor.com/scripts/regex/index.php
http://regex.larsolavtorvik.com/ (PHP, JavaScript)
http://www.nregex.com/ ( .NET )
http://www.myregexp.com/ ( Java )
http://osteele.com/tools/reanimator ( NFA Graphic repr. )
Expresso - http://www.ultrapico.com/Expresso.htm ( .NET )
Regulator - http://sourceforge.net/projects/regulator ( .NET )
RegexRenamer - http://regexrenamer.sourceforge.net/ ( .NET )
PowerGREP http://www.powergrep.com/
Windows Grep - http://www.wingrep.com/
Regex Resources
$perldoc perlre perlretut perlreref
$man re_format
“Mastering Regular Expressions”by Jeffrey Friedl
http://oreilly.com/catalog/9780596528126/
“Regular Expressions Cookbook”by Jan Goyvaerts & Steven Levithan
http://oreilly.com/catalog/9780596520694
Questions?
*
{
}
\
^
]
+
$
[
(?
.
)-
:#
Thank Y!ou
*
{
}
\
^
]
+
$
[
(?
.
)-
:#
Java Regeximport java.util.regex.*;
public class MatchTest {
public static void main(String[] args) throws Exception {
String date = "12/30/1969"; Pattern p =Pattern.compile("^(\\d\\d)[-/](\\d\\d)[-/](\\d\\d(?:\\d\\d)?$"); Matcher m = p.matcher(date);
if (m.find( )) {
String month = m.group(1);String day = m.group(2);String year = m.group(3);System.out.printf("Found %s-%s-%s\n", year, month, day);
}
}
}
PHP Regex
$date = "12/30/1969";
$p = "!^(\\d\\d)[-/](\\d\\d)[-/](\\d\\d(?:\\d\\d)?)$!";
if (preg_match($p,$date,$matches) {$month = $matches[1];$day = $matches[2];$year = $matches[3];
}
$text = "Hello world. <br>";
$pattern = "{<br>}i";
echo preg_replace($pattern, "<br />", $text);
JavaScript Regexvar date = "12/30/1969";
var p =new RegExp("^(\\d\\d)[-/](\\d\\d)[-/](\\d\\d(?:\\d\\d)?)$");
var result = p.exec(date);
if (result != null) {
var month = result[1];var day = result[2];var year = result[3];
}
String text = "Hello world. <br>";
var pattern = /<br>/ig;
test.replace(pattern, "<br />");
.NET Regexusing System.Text.RegularExpressions;
class MatchTest {
static void Main( ) {
string date = "12/30/1969";Regex r =new Regex( @"^(\d\d)[-/](\d\d)[-/](\d\d(?:\d\d)?)$" );Match m = r.Match(date);if (m.Success) {
string month = m.Groups[1].Value;string day = m.Groups[2].Value;string year = m.Groups[3].Value;
}
}}
Python Regex
import re
date = '12/30/1969’
regex = re.compile(r'^(\d\d)[-/](\d\d)[-/](\d\d(?:\d\d)?)$')
match = regex.match(date)
if match:month = match.group(1) #12day = match.group(2) #30year = match.group(3) #1969
Ruby Regex
date = '12/30/1969’
regexp = Regexp.new('^(\d\d)[-/](\d\d)[-/](\d\d(?:\d\d)?)$')
if md = regexp.match(date)month = md[1] #12day = md[2] #30year = md[3] #1969
end
Unicode Properties
• Pattern code expression - the result of the code evaluation is treated as a regular expression and matched immediately.
• (??{ })
Find Incremental numbers ?
$str="abc 123hai cde 34567 efg 1245 a132 123456789 10adf";
print "$1\n" while($str=~/\D( (\d) (?{$x=$2}) ( (??{++$x%10}) )*
) \D/gx);'
Pattern Code Expression – (??{ })
Commify a number$no=123456789;substr($no,0,length($no)-1)=~s/(?=(?<=\d)(?:\d\d)+$)/,/g;print $no’
Produce 12,34,56,789
Find Incremental numbers ?
$str="abc 123hai cde 34567 efg 1245 a132 123456789 10adf";
print "$1\n" while($str=~/\D( (\d) (?{$x=$2}) ( (??{++$x%10}) )*
) \D/gx);’
Non Capture group in a capture group won’t work :perl -e '$x="cat cat cat";$x=~/(cat(?:\s+))/;print ":$1:";’