8/2/2019 REGEX Extended
1/39
Metacharacters
1. the 12 punctuation characters that make
regular expressions work their magic are $ ( )
* + . ? [ \ ^ { |
2. notably absent from the list are ] , - and }.
The first two become metacharacters onlyafter an unescaped [, and the } only after an
unescaped {
3. If you want your regex to match themliterally, you need to escape them by placing
a backslash in front of them
8/2/2019 REGEX Extended
2/39
Matching literal string
Any regular expression that does not includeany of the dozen characters $()*+.? [\^{|simply matches itself.
By default, regular expressions are case
sensitive - regex matches regex but notRegex, REGEX, or ReGeX
Turn on case insensitivity by using the (?i)
mode modifier, such as (?i)regex, orsensitive(?i)caseless(?-i)sensitive (localmode modifiers) in .NET or setting the /i flagwhen creating it in JavaScript.
8/2/2019 REGEX Extended
3/39
Matching non printable charactersRepresentation Meaning Hex Flavors
\a bell 0x07 .NET\e escape 0x1B .NET
\f form feed 0x0C .NET, JScript
\n new line 0x0A .NET, JScript
\r carriage return 0x0D .NET, JScript
\t horizontal tab 0x09 .NET, JScript
\v vertical tab 0x0B .NET, JScript
Variations:Using \cA through \cZ, you can match one of the 26 control characters that occupy
positions 1 through 26 in the ASCII table
A lowercase \x followed by two uppercase hexadecimal digits matches a single character
in the ASCII set
8/2/2019 REGEX Extended
4/39
Matching *$"'\n\d/\\+ :
C# - "[$\"'\n\\d/\\\\] "
- double quotes and backslashes must be escaped with a backslash.Note: "\n" is a string with a literal line break, which is ignored as
whitespace. "\\n" is a string with the regex token \n, which matches
a newline.
@"[$""'\n\d/\\] - to include a double quote in a verbatim string, double it upNote: @"\n" is always the regex token \n, which matches a newline;
verbatim strings do not support \n at the string level
JavaScript - /[$"'\n\d\/\\]/- Simply place your regular expression between two forward slashes
- If any forward slashes occur within the regular expression itself,
escape those with a backslash.
8/2/2019 REGEX Extended
5/39
Creating Regular Expression Objects
C#:try{
Regex regexObj = new Regex("UserInput", RegexOptions.Compile);}catch (ArgumentException ex){
//...}
Note: RegexOptions.Compile can run up to 10 times faster than a regular expressioncompiled without this option (it compiles the regular expression down to CIL)
JavaScript:var myregexp = /regex pattern/;
var myregexp = new RegExp(userinput);
8/2/2019 REGEX Extended
6/39
Match One of Many Characters
[ ] character class matches a single characterout of a list of possible characters
^ (caret) - negates the character class if you placeit immediately after the opening bracket
- (hyphen) - creates a range when it is placedbetween two characters (order given by ASCII orUNICODE character table)
Examples:
o Hexadecimal character : [a-fA-F0-9]
o Nonhexadecimal character : [^a-fA-F0-9]
o Characters group : [aeiou]
8/2/2019 REGEX Extended
7/39
Shorthands
Six regex tokens that consist of a backslash and a letter
form shorthand character classes. Each lowercaseshorthand character has an associated uppercaseshorthand character with the opposite meaning.
Token Matches Opposite\d a single digit \D*^\d+)
\w a single word character \W
\s any whitespace character \S
(this includes spaces, tabs, and line)
Note - In JavaScript \w is always identical to *a-zA-Z0-9_+. In .NET it includes letters and digits from all otherscripts (Cyrillic, Thai, etc.)
8/2/2019 REGEX Extended
8/39
Matching any character
Solution Matches Flavor Notes
. any character, except line
breaks
.NET
JScript
.NET : the dot matches line
breaks option must not be
set
. any character, including line
breaks
.NET .NET : the dot matches line
breaks option must be set[1] - RegexOptions.Singleline
[\s\S] Any character, including line
breaks
JScript[2]
[1] you can also place a mode modifier at the start of the regular expression
: (?s) is the mode modifier for dot matches line breaks mode in .NET[2] an alternative solution is needed for JavaScript, which doesnt have a
dot matches line breaks option (*\d\D+ and *\w\W+ have the same
effect).
8/2/2019 REGEX Extended
9/39
Match Something at the Start and/or
the End of a Line (1)
Solution Matches Flavor Note
\A At the very start of the subject text,
before the first character (to test
whether the subject text begins with
the text you want to match)
.NET A must be uppercase
equivalent to \A, as long as you do not
turn on the ^ and $ match
at line breaks option; otherwise it will
match at the very start of the each line
.NET
JScript
.NET : ^ and $ match at line breaks option -
RegexOptions.Multiline
\Z \z at the very end of the subject text, after
the last character (to test whether thesubject text ends with the text you want
to match)
.NET Difference between \Z and \z - when the
last character in your subject text is a linebreak. In that case, \Z can match at the very
end of the subject text, after the final line
break, as well as immediately before that line
break
$ equivalent to \Z, as long as you do not
turn on the ^ and $ match
at line breaks option; otherwise it will
match at the ver end of the each line
.NET
JScript
.NET : ^ and $ match at line breaks option -
RegexOptions.Multiline
Anchors - ^, $, \A, \Z, and \z - they match at certain positions, effectively
anchoring the regular expression match at those positions:
8/2/2019 REGEX Extended
10/39
Match Something at the Start and/or
the End of a Line (2)Examples ^alpha (.NET, JavaScript)matches alpha at the
start of the subject text if ^ and $ match at line breaksis not set or at the start of each line otherwise
\Aalpha (.NET) - matches alpha at the start of thesubject text
omega$ (.NET, JavaScript)matches omega at theend of the subject text if ^ and $ match at line breaks
is not set or at the end of each line otherwise omega\Z (.NET) - matches omega at the end of the
subject text
8/2/2019 REGEX Extended
11/39
Match Something at the Start and/or
the End of a Line (3)
Combining two anchors:
\A\Z matches the empty string, as well as
the string that consists of a single newline
\A\z matches only the empty string
^$ matches each empty line in the subject
text (in ^ and $ match at line breaks mode)
Note - In .NET, if you cannot turn on ^ and $ match at line breaks mode outside
the regular expression, you can place (?m) mode modifier at the start of the
regular expression
8/2/2019 REGEX Extended
12/39
Regular Expression Options (C#)None Specifies that no options are set.
IgnoreCase Specifies case-insensitive matching.
Multiline Multiline mode. Changes the meaning of ^ and $ so they match at thebeginning and end, respectively, of any line, and not just the beginning and
end of the entire string (Caret and dollar match at line breaks)
ExplicitCapture Specifies that the only valid captures are explicitly named or numbered groups
of the form (?). This allows unnamed parentheses to act as
noncapturing groups without the syntactic clumsiness of the expression (?:).
Compiled Specifies that the regular expression is compiled to an assembly. This yieldsfaster execution but increases startup time.
Singleline Specifies single-line mode. Changes the meaning of the dot (.) so it matches
every character (instead of every character except \n). (Dot matches line
break)
IgnorePatternWhitespace Eliminates unescaped white space from the pattern and enables comments
marked with #. (Free-spacing).RightToLeft Specifies that the search will be from right to left instead of from left to right.
ECMAScript Enables ECMAScript-compliant behavior for the expression. This value can be
used only in conjunction with the IgnoreCase, Multiline, and Compiled values.
The use of this value with any other values results in an exception (JavaScript
flavor) - most important effect is that with this option, \w and \d are restricted
to ASCII characters, as they are in JavaScriptCultureInvariant Specifies that cultural differences in language is ignored.
8/2/2019 REGEX Extended
13/39
Setting Regular Expression Options
C#Regex regexObj = new Regex("regex pattern",
RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase |RegexOptions.Singleline | RegexOptions.Multiline);
JavaScriptvar myregexp = /regex pattern/im;
Regex Options1. Free-spacing: Not supported by JavaScript.
2. Case insensitive: /i3. Dot matches line breaks: Not supported by JavaScript.
4. Caret and dollar match at line breaks: /m
5. Additional Language-Specific Options: apply a regular expression repeatedly to thesame string: /g
8/2/2019 REGEX Extended
14/39
Test Whether a Match Can Be Found
Within a Subject StringC#:bool foundMatch = false;try {
foundMatch = Regex.IsMatch(subjectString, UserInput);} catch (ArgumentNullException ex) {
// Cannot pass null as the regular expression orsubject string} catch (ArgumentException ex) {
// Syntax error in the regular expression}
orbool foundMatch = Regex.IsMatch(subjectString, "regex pattern");
Note:@"\Aregex pattern\Z" - regex matches the subject string entirely
Javascript:if (/regex pattern/.test(subjectString)) {
// Successful match} else {
// Match attempt failed}
Note: /^regex pattern&/.test(subjectString) - regex matches the subject string
entirely
8/2/2019 REGEX Extended
15/39
Retrieve the Matched TextC#:
Regex regexObj = new Regex(@"\d+");string resultString = regexObj.Match(subjectString).Value;
Note:1. regexObj.Match("123456", 3, 2)tries to find a match in "45
2. regexObj.Match(subjectString).Index position in subject string
3. regexObj.Match(subjectString).Length length of the match
JavaScript:var result =
subject.match(/\d+/);
if (result) {
result = result[0];
} else {
result = '';
}
var matchstart = -1;var matchlength = -1;
var match = /\d+/.exec(subject);if (match) {
matchstart = match.index;matchlength = match[0].length;
}
JavaScript:var result =
subject.match(/\d+/);
if (result) {
result = result[0];
} else {
result = '';
}
8/2/2019 REGEX Extended
16/39
Match Whole Words \b - word boundary - matches at the start or the end of a
word in three positions:
Example: \bdog\b - The first \b requires the d to occur atthe very start of the string, or after a nonword character. Thesecond \b requires the g to occur at the very end of thestring, or before a nonword character (line break charactersare nonword characters). It matches dog in My dog is
stupid, but not in I will build a doghouse. \Bmatches at every position in the subject text where \b
does not match, at every position that is not at start or end ofa word.
Example: \Bcat\B matches cat in scatter, but not in My catis lazy, category, or bobcat
Note: you need to use alternation to combine \Bcat and cat\Binto \Bcat|cat\B
U i d C d P i P i
8/2/2019 REGEX Extended
17/39
Unicode Code Points, Properties,
Blocks, and Scripts (1)Solution Matches Flavor Note
\u2122 Unicode
code point
.NET
JScript
- a code point is one entry in the Unicode character database
(\u2122 trademark sign)
- \u syntax requires exactly four hexadecimal digits
(U+0000 through U+FFFF)
\p{Sc} Unicode
propertyor
category
.NET \pL- - Any kind of letter from any language
\pM- - A character intended to be combined with anothercharacter (accents etc.)
\pZ- - Any kind of whitespaces or invisible characters
\pS- - Math symbols, currency signs etc.
\pN- - Any kind of numeric character in any script
\pP- - Any kind of punctuation character
\pC- - Invisible control characters and unused code points
\p{IsGreek
Extended}
Unicode
block
.NET \p{InBasic_Latin- \p{InGreek_and_Coptic- \p{InCyrillic-
\p{InKatakana- etc.
\P{M}\p{M
}*
Unicode
grapheme
.NET Unicode grapheme - combining marks -
"\u00E0\u0061\u0300
8/2/2019 REGEX Extended
18/39
Unicode Code Points, Properties,
Blocks, and Scripts (2) The uppercase \P is the negated variant of the lowercase
\p. Example: \PSc- matches any character that does nothave the Currency Symbol Unicode property.
JavaScript flavor does not support Unicode categories, blocks,
or scripts, you can list the characters that are in the category,block, or in a character class. Alternative versions for:
Blocks - [\u1F00-\u1FFF] \p{IsGreekExtended}
Category, character class you should create a character classwith all the unicodes from the specific category/characterclass
See also: http://www.unicode.org/
http://www.unicode.org/http://www.unicode.org/8/2/2019 REGEX Extended
19/39
Character class subtractions in .NET
General form: *class-*subtract++
Example :
1. [a-zA-Z0-9-[g-zG-Z]]
2. *\p{IsThai}-[\PN-++ matches any of the 10 Thai digits.
\p{IsThai- - matches any character in the Thai
block\PN- matches any character that doesnt have the Number
property
8/2/2019 REGEX Extended
20/39
Match One of Several Alternatives
The vertical bar, or pipe symbol, splits the regular expression
into multiple alternatives
Example: apply Mary|Jane|Sue to Mary, Jane, and Sue
went to Mary's housethe match Mary is immediately found
at the start of the string
The order of the alternatives in the regex matters only when
two of them can match at the same position in the string. The
solution would be to leave the most general string last in the
enumeration.
8/2/2019 REGEX Extended
21/39
Group and Capture Parts of the Match A capturing group is a pair of parentheses that can capture only part of the
regular expressions
Example: \b(\d\d\d\d)-(\d\d)-(\d\d)\b1. Has three capturing groups (\d\d\d\d), (\d\d) and (\d\d)
2. During the matching process the regular expression engine stores the part ofthe text matched by the capturing group
Applied on subject string 2012 10 2 groups 2012, 10 , 2
Noncapturing groups : (?: opens the noncapturing groups (not available in Jscript flavor)
You can specify mode modifiers (example: (?i: ) case insensitivenoncapturing group)
Benefits:
You can add them to an existing regex without upsetting the references tonumbered capturing groups
Performance - a capturing group adds unnecessary overhead that you caneliminate by using a noncapturing group
Note: parts of the match can be named : \b(?\d\d\d\d)-(?\d\d)-(?\d\d)\b or \b(?\d\d\d\d)-(?\d\d)-(?\d\d)\b (only .NET).
8/2/2019 REGEX Extended
22/39
Match Previously Matched Text Again
Steps
1. Capture a text in a group
2. Match the same text anywhere in the regex
using a backreference (backslash followed by anumber)
Example: \b\d\d(\d\d)-\1-\1\b matches 2012-09-09, 2012-10-10, 2012-11-11 etc.
Note: you can name a backreference:\b\d\d(?\d\d)-\k-\k\b
8/2/2019 REGEX Extended
23/39
Retrieve Part of the Matched Text
C#:string resultString = Regex.Match(subjectString, "http://([a-z0-9.-
]+)").Groups[1].Value;
string resultString = Regex.Match(subjectString,
"http://(?[a-z0-9.-]+)").Groups["domain"].Value;
JavaScript:var result = "";
var match = /http:\/\/([a-z0-9.-]+)/.exec(subject);
if (match) {
result = match[1];} else {
result = '';
}
8/2/2019 REGEX Extended
24/39
Retrieve a List of All Matches
C#:Regex regexObj = new Regex(@"\d+");
MatchCollection matchlist = regexObj.Matches(subjectString);
JavaScript:var list = subject.match(/\d+/g);
Note:
- the /g flag tells the match() function to iterate over all matches in the string
and put them into an array
- regex with the /g flag, string.match() does not provide any further details
about the regular expression
8/2/2019 REGEX Extended
25/39
Iterate over All MatchesC#:
Match matchResult = Regex.Match(subjectString, @"\d+");while (matchResult.Success) {
// Here you can process the match stored in matchResult
matchResult = matchResult.NextMatch();
}
JavaScript:var regex = /\d+/g;
var match = null;
while (match = regex.exec(subject)) {
// Don't let browsers such as Firefox get stuck in an infinite loop
if (match.index == regex.lastIndex) regex.lastIndex++;// Here you can process the match stored in the match variable
}
Note: exec() should set lastIndex to the first character after the match if the match iszero characters long, the next match attempt will begin at the position of the match justfound, resulting in an infinite loop
Repeat Part of the Regex a Certain
8/2/2019 REGEX Extended
26/39
Repeat Part of the Regex a Certain
Number of Times
\b\d{100}\b - a decimal number with 100 digits
\b[a-f0-9]{1,8}\b - A 32-bit hexadecimal number
\b[a-f0-9]{1,8}h?\b - A 32-bit hexadecimal number with an
optional h suffix
\b\d*\.\d+(e\d+)? - A floating-point number with an optional
integer part, a mandatory fractional part, and an optional
exponent
Token Result Notes
{n} repeats the preceding regex token nnumber of times
{n,m} Variable repetition (between n and m
times)
{n,} Infinite repetition but more than n times \d1,- matches one or more digits\d
\d0,- matches zero or more digits\d\d0,1- matches zero or one digit\d?
+, * , ? - greedy quantifiers
8/2/2019 REGEX Extended
27/39
Choose Minimal or Maximal Repetition (1)
Lazy quantifiers repeats as few times as it has to, stores one
backtracking position, and allows the regex to continue- the regex goes ahead only one character at a time,
each time checking whether the following text can bematched
You can make any quantifier lazy by placing a questionmark after it: ?, ?, ??, and 7,42-?
Example:
The very first task is to find the beginningof a paragraph.
Then you have to find the end of theparagraph
.*
vs.*?
8/2/2019 REGEX Extended
28/39
Choose Minimal or Maximal Repetition (2)
Possessive quantifiers it tries to repeat as many times as possible
will never give back, not even when giving back is the only way thatthe remainder of the regular expression could match.
do not keep backtracking positions
You can make any quantifier possessive by placing a plus sign after it:, , ?, and 7,42-
Possessive quantifiers Atomic group (not available in JScript) a noncapturing group, with the extra job of refusing to backtrack
the opening bracket simply consists of the three characters (?>
\b\d++\b\b(?>\d+)\b
\w++\d(?>\w+)(?>\d+)
8/2/2019 REGEX Extended
29/39
Test for a Match Without Adding It to
the Overall Match Lookaround - checks whether certain text can be matched
without actually matching it:
1. lookbehind
positive : (?"a
2. lookahead
positive : q(?=u) matches a "q" that is followed by a "u"
negative : q(?!u) matches a "q" not followed by a "u
Note: JavaScript supports only lookahead
8/2/2019 REGEX Extended
30/39
Match One of Two Alternatives Based
on a Condition
(?(1)then|else) - checks whether the first capturing group has
already matched something
Example:
1. \b(?:(?:(one)|(two)|(three))(?:,|\b)){3,}(?(1)|(?!))(?(2)|(?!))(?(3)|(?!))
(?(1)|(?!)) - if named group "(1)"
- then empty regex "" (always pass)
-else empty negative lookahead (?!) (always fail)
2. (a)?b(?(1)c|d)abc|bd
8/2/2019 REGEX Extended
31/39
Insert Literal Text into the
Replacement Text (1)
Key characters:
\ - literal character does not need to be escaped
$ - need to be escaped only when they are
followed by a digit, &, `, ", _, +, or $; to escape a
dollar sign, precede it with another dollar sign.
Example: $%\*$1\1 => $%\*$$1\1
Note: $1 and/or \1 are a backreference to acapturing group and $& refers to whole regex
8/2/2019 REGEX Extended
32/39
Insert Literal Text into the
Replacement Text (2)
Examples:
1. Regular expression: http:\S+
Replacement: $&
2. Regular expression: \b(\d{4})(\d{3})(\d{3})\bReplacement: ($1) $2-$3
3. Regular expression: \b(?\d{3})(?\d{3})(?\d{4})\b
Replacement: (${g1}) ${g2}-${g3}
Note: .NET and JavaScript leave backreferences to groups that
dont exist as literal text in the replacement.
8/2/2019 REGEX Extended
33/39
Replace All Matches
C#:Regex regexObj = new Regex("pattern");
string resultString = regexObj.Replace(subjectString,replacement, count);
Example: Replace(subject, replacement, 3) replaces only the first threeregular expression matches, and further matches are ignored.
JavaScript:
result = subject.replace(/before/g, "after");Note: if you want to replace all regex matches in the string, set the /g flag when
creating your regular expression object; if you dont use the /g flag, only the first
match will be replaced.
8/2/2019 REGEX Extended
34/39
Replace Matches Reusing Parts of the
MatchC#:string resultString = Regex.Replace(subjectString, @"(\w+)=(\w+)",
"$2=$1");
or
Regex regexObj = new Regex(@"(\w+)=(\w+)");
string resultString = regexObj.Replace(subjectString, "$2=$1");
With named groups:
Regex regexObj = new Regex(@"(?\w+)=(?\w+)");
string resultString = regexObj.Replace(subjectString,
"${right}=${left}");
JavaScript:result = subject.replace(/(\w+)=(\w+)/g, "$2=$1");
8/2/2019 REGEX Extended
35/39
Replace Matches with Replacements
Generated in CodeC#:Regex regexObj = new Regex(@"\d+");string resultString = regexObj.Replace(subjectString, new
MatchEvaluator(ComputeReplacement));
public String ComputeReplacement(Match matchResult) {int t= int.Parse(matchResult.Value) * 2;
return t.ToString();}
JavaScript:var result = subject.replace(/\d+/g,
function(match) { return match * 2; }
);
Note: replacement function may accept one or more parameters: the firstparameter will be set to the text matched by the regular expression. If theregular expression has capturing groups, the second parameter will hold thetext matched by the first capturing group, the third parameter gives you the
text of the second capturing group, and so on.
8/2/2019 REGEX Extended
36/39
Split a stringC#:string[] splitArray = Regex.Split(subjectString, "");
JavaScript:var list = [];
var regex = //g;var match = null;
var lastIndex = 0;
while (match = regex.exec(subject)) {
// Don't let browsers such as Firefox get stuck in an infinite loop
if (match.index == regex.lastIndex) regex.lastIndex++;// Add the text before the match
list.push(subject.substring(lastIndex, match.index));
lastIndex = match.index + match[0].length;
}
8/2/2019 REGEX Extended
37/39
Search Line by LineC#:
string[] lines = Regex.Split(subjectString, "\r?\n");Regex regexObj = new Regex("regex pattern");for (int i = 0; i < lines.Length; i++) {
if (regexObj.IsMatch(lines[i])) {// The regex matches lines[i]
} else {// The regex does not match lines[i]
}}
JavaScript:var lines = subject.split(/\r?\n/);var regexp = /regex pattern/;for (var i = 0; i < lines.length; i++) {
if (lines[i].match(regexp)) {// The regex matches lines[i]
} else {// The regex does not match lines[i]
}}
8/2/2019 REGEX Extended
38/39
Validation and Formatting (1)
Email address^[\w!#$%&'*+/=?`{|}~^]+(?:\.[!#$%&'*+/=?`{|}~^-]+)*@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}$
International Phone Numbers^\+(?:[0-9]\x20?){6,14}[0-9]$
Validate Traditional Date Formats^(?:(0?2)/([12][0-9]|0?[1-9])|(0?[469]|11)/(30|[12][0-9]|0?[1-
9])|(0?[13578]|1[02])/(3[01]|[12][0-9]|0?[1-9]))/((?:[0-9]{2})?[0-9]{2})$
Limit the Number of Lines in Text^(?:(?:\r\n?|\n)?[^\r\n]*){0,5}$
Validate Affirmative Responses^(?:1|t(?:rue)?|y(?:es)?|ok(?:ay)?)$
8/2/2019 REGEX Extended
39/39
Validation and Formatting (2)
Find Words Near Each Other\b(?:word1\W+(?:\w+\W+){0,5}?word2|word2\W+(?:\w+\W+){0,5}?word1)\b
Remove Duplicate Lines^(.*)(?:(?:\r?\n|\r)\1)+$ replaced with $1
Validating URL^((https?|ftp)://|(www|ftp)\.)[a-z0-9-]+(\.[a-z0-9-]+)+([/?].*)?$
Extracting the Query from a URL^[^?#]+\?([^#]+)
Validate Windows Paths^(?:[a-z]:|\\\\[a-z0-9_.$]+\\[a-z0-9_.$]+)\\(?:[^\\/:*?"|\r\n]+\\)*[^\\/:*?"|\r\n]*$