Upload
javier-solis
View
457
Download
1
Embed Size (px)
Citation preview
Regular Expression Basic Syntax Reference
Characters
Character Description Example
Any character ex-
cept[\^$.|?*+()
All characters except the listed
special characters match a sin-
gle instance of them-
selves. { and } are literal char-
acters, unless they're part of a
valid regular expression token
(e.g. the {n} quantifier).
a matches a
\ (backslash) followed
by any of [\^$.|?*+()
{}
A backslash escapes special
characters to suppress their
special meaning.
\+ matches +
\Q...\E Matches the characters be-
tween \Q and \E literally, sup-
pressing the meaning of spe-
cial characters.
\Q+-*/\E matches+-*/
\xFF where FF are 2
hexadecimal digits
Matches the character with the
specified ASCII/ANSI value,
which depends on the code
page used. Can be used in
character classes.
\xA9 matches © when using the Latin-1
code page.
\n, \r and \t Match an LF character, CR
character and a tab character
respectively. Can be used in
character classes.
\r\n matches a DOS/Windows CRLF line
break.
\a, \e, \f and \v Match a bell character (\x07),
escape character (\x1B), form
feed (\x0C) and vertical tab (\
x0B) respectively. Can be
used in character classes.
\cA through \cZ Match an ASCII character
Control+A through Control+Z,
equivalent to \x01 through \
x1A. Can be used in character
classes.
\cM\cJ matches a DOS/Windows CRLF line
break.
Character Classes or Character Sets [abc]
Character Description Example
[ (opening square
bracket)
Starts a character class. A
character class matches a sin-
gle character out of all the pos-
sibilities offered by the charac-
ter class. Inside a character
class, different rules apply. The
rules in this section are only
valid inside character classes.
The rules outside this section
are not valid in character
classes, except for a few char-
acter escapes that are indi-
cated with "can be used inside
character classes".
Any character ex-
cept^-]\ add that
character to the pos-
sible matches for the
character class.
All characters except the listed
special characters.
[abc] matches a, b orc
\ (backslash) followed
by any of ^-]\
A backslash escapes special
characters to suppress their
special meaning.
[\^\]] matches ^ or ]
- (hyphen) except im-
mediately after the
opening [
Specifies a range of charac-
ters. (Specifies a hyphen if
placed immediately after the
opening [)
[a-zA-Z0-9] matches any letter or digit
^ (caret) immediately
after the opening [
Negates the character class,
causing it to match a single
character not listed in the char-
acter class. (Specifies a caret if
placed anywhere except after
the opening [)
[^a-d] matches x (any character except a,
b, c or d)
\d, \w and \s Shorthand character classes
matching digits, word charac-
ters (letters, digits, and under-
scores), and whitespace (spa-
ces, tabs, and line breaks).
Can be used inside and out-
side character classes.
[\d\s] matches a character that is a digit or
whitespace
\D, \W and \S Negated versions of the above.
Should be used only outside
character classes. (Can be
used inside, but that is confus-
ing.)
\D matches a character that is not a digit
[\b] Inside a character class, \b is a
backspace character.
[\b\t] matches a backspace or tab charac-
ter
Dot
Character Description Example
. (dot) Matches any single character
except line break characters \r
and \n. Most regex flavors
have an option to make the dot
match line break characters
. matches x or (almost) any other character
too.
Anchors
Character Description Example
^ (caret) Matches at the start of the
string the regex pattern is ap-
plied to. Matches a position
rather than a character. Most
regex flavors have an option to
make the caret match after line
breaks (i.e. at the start of a line
in a file) as well.
^. matches a inabc\ndef. Also
matches d in "multi-line" mode.
$ (dollar) Matches at the end of the
string the regex pattern is ap-
plied to. Matches a position
rather than a character. Most
regex flavors have an option to
make the dollar match before
line breaks (i.e. at the end of a
line in a file) as well. Also
matches before the very last
line break if the string ends
with a line break.
.$ matches f inabc\ndef. Also matches c in
"multi-line" mode.
\A Matches at the start of the
string the regex pattern is ap-
plied to. Matches a position
rather than a character. Never
matches after line breaks.
\A. matches a in abc
\Z Matches at the end of the
string the regex pattern is ap-
plied to. Matches a position
rather than a character. Never
matches before line breaks,
except for the very last line
.\Z matches f inabc\ndef
break if the string ends with a
line break.
\z Matches at the end of the
string the regex pattern is ap-
plied to. Matches a position
rather than a character. Never
matches before line breaks.
.\z matches f inabc\ndef
Word Boundaries
Character Description Example
\b Matches at the position be-
tween a word character (any-
thing matched by \w) and a
non-word character (anything
matched by [^\w] or \W) as
well as at the start and/or end
of the string if the first and/or
last characters in the string are
word characters.
.\b matches c in abc
\B Matches at the position be-
tween two word characters (i.e
the position between \w\w) as
well as at the position between
two non-word characters (i.e. \
W\W).
\B.\B matches b in abc
Alternation
Character Description Example
| (pipe) Causes the regex engine to
match either the part on the left
side, or the part on the right
side. Can be strung together
abc|def|xyz matchesabc, def or xyz
into a series of options.
| (pipe) The pipe has the lowest prece-
dence of all operators. Use
grouping to alternate only part
of the regular expression.
abc(def|xyz)matches abcdef orabcxyz
Quantifiers
Character Description Example
? (question mark) Makes the preceding item op-
tional. Greedy, so the optional
item is included in the match if
possible.
abc? matches ab orabc
?? Makes the preceding item op-
tional. Lazy, so the optional
item is excluded in the match if
possible. This construct is of-
ten excluded from documenta-
tion because of its limited use.
abc?? matches ab orabc
* (star) Repeats the previous item zero
or more times. Greedy, so as
many items as possible will be
matched before trying permu-
tations with less matches of
the preceding item, up to the
point where the preceding item
is not matched at all.
".*" matches"def" "ghi" inabc "def" "ghi"
jkl
*? (lazy star) Repeats the previous item zero
or more times. Lazy, so the en-
gine first attempts to skip the
previous item, before trying
permutations with ever in-
creasing matches of the pre-
".*?" matches "def"inabc "def" "ghi" jkl
ceding item.
+ (plus) Repeats the previous item
once or more. Greedy, so as
many items as possible will be
matched before trying permu-
tations with less matches of
the preceding item, up to the
point where the preceding item
is matched only once.
".+" matches"def" "ghi" inabc "def" "ghi"
jkl
+? (lazy plus) Repeats the previous item
once or more. Lazy, so the en-
gine first matches the previous
item only once, before trying
permutations with ever in-
creasing matches of the pre-
ceding item.
".+?" matches "def"inabc "def" "ghi" jkl
{n} where n is an in-
teger >= 1
Repeats the previous item ex-
actly n times.
a{3} matches aaa
{n,m} where n >= 0
and m >= n
Repeats the previous item be-
tween n and m times. Greedy,
so repeating m times is tried
before reducing the repetition
to n times.
a{2,4} matches aaaa,aaa or aa
{n,m}? where n >= 0
and m >= n
Repeats the previous item be-
tween n and m times. Lazy, so
repeating n times is tried be-
fore increasing the repetition to
m times.
a{2,4}? matches aa,aaa or aaaa
{n,} where n >= 0 Repeats the previous item at
least n times. Greedy, so as
many items as possible will be
matched before trying permu-
tations with less matches of
a{2,} matches aaaaain aaaaa
the preceding item, up to the
point where the preceding item
is matched only n times.
{n,}? where n >= 0 Repeats the previous item n or
more times. Lazy, so the en-
gine first matches the previous
item n times, before trying per-
mutations with ever increasing
matches of the preceding item.
a{2,}? matches aa inaaaaa
Regular Expression Advanced Syntax Reference
Grouping and Backreferences
Syntax Description Example
(regex) Round brackets group
the regex between them.
They capture the text
matched by the regex in-
side them that can be
reused in a backrefer-
ence, and they allow you
to apply regex operators
to the entire grouped
regex.
(abc){3}matchesabcabcabc. First group
matches abc.
(?:regex) Non-capturing parenthe-
ses group the regex so
you can apply regex op-
erators, but do not cap-
ture anything and do not
create backreferences.
(?:abc){3}matchesabcabcabc. No groups.
\1 through \9 Substituted with the text
matched between the 1st
(abc|def)=\1matchesabc=abc ordef=def,
through 9th pair of cap-
turing parentheses.
Some regex flavors allow
more than 9 backrefer-
ences.
but not abc=def ordef=abc.
Modifiers
Syntax Description Example
(?i) Turn on case insensitivity
for the remainder of the
regular expression.
(Older regex flavors may
turn it on for the entire
regex.)
te(?i)stmatches teSTbut not TEST.
(?-i) Turn off case insensitivity
for the remainder of the
regular expression.
(?i)te(?-i)stmatches TEstbut not TEST.
(?s) Turn on "dot matches
newline" for the remain-
der of the regular expres-
sion. (Older regex flavors
may turn it on for the en-
tire regex.)
(?-s) Turn off "dot matches
newline" for the remain-
der of the regular expres-
sion.
(?m) Caret and dollar match
after and before newlines
for the remainder of the
regular expression.
(Older regex flavors may
apply this to the entire
regex.)
(?-m) Caret and dollar only
match at the start and
end of the string for the
remainder of the regular
expression.
(?x) Turn on free-spacing
mode to ignore white-
space between regex to-
kens, and allow # com-
ments.
(?-x) Turn off free-spacing
mode.
(?i-sm) Turns on the options "i"
and "m", and turns off "s"
for the remainder of the
regular expression.
(Older regex flavors may
apply this to the entire
regex.)
(?i-sm:regex) Matches the regex inside
the span with the options
"i" and "m" turned on,
and "s" turned off.
(?i:te)stmatches TEstbut not TEST.
Atomic Grouping and Possessive Quantifiers
Syntax Description Example
(?>regex) Atomic groups prevent
the regex engine from
backtracking back into
x(?>\w+)x is more efficient than x\w+x if
the second x cannot be matched.
the group (forcing the
group to discard part of
its match) after a match
has been found for the
group. Backtracking can
occur inside the group
before it has matched
completely, and the en-
gine can backtrack past
the entire group, discard-
ing its match entirely.
Eliminating needless
backtracking provides a
speed increase. Atomic
grouping is often indis-
pensable when nesting
quantifiers to prevent a
catastrophic amount of
backtracking as the en-
gine needlessly tries
pointless permutations of
the nested quantifiers.
?+, *+, ++ and{m,n}+ Possessive quantifiers
are a limited yet syntacti-
cally cleaner alternative
to atomic grouping. Only
available in a few regex
flavors. They behave as
normal greedy quanti-
fiers, except that they will
not give up part of their
match for backtracking.
x++ is identical to (?>x+)
Lookaround
Syntax Description Example
(?=regex) Zero-width positive
lookahead. Matches at a
t(?=s)matches the second t instreets.
position where the pat-
tern inside the lookahead
can be matched.
Matches only the posi-
tion. It does not consume
any characters or expand
the match. In a pattern
likeone(?=two)three,
both two and three have
to match at the position
where the match
of one ends.
(?!regex) Zero-width negative
lookahead. Identical to
positive lookahead, ex-
cept that the overall
match will only succeed if
the regex inside the
lookahead fails to match.
t(?!s)matches the firstt in streets.
(?<=text) Zero-width positive look-
behind. Matches at a po-
sition to the left of which
text appears. Since regu-
lar expressions cannot
be applied backwards,
the test inside the look-
behind can only be plain
text. Some regex flavors
allow alternation of plain
text options in the look-
behind.
(?<=s)tmatches the firstt in streets.
(?<!text) Zero-width negative look-
behind. Matches at a po-
sition if the text does not
appear to the left of that
position.
(?<!s)tmatches the second t instreets.
Continuing from The Previous Match
Syntax Description Example
\G Matches at the position
where the previous
match ended, or the po-
sition where the current
match attempt started
(depending on the tool or
regex flavor). Matches at
the start of the string dur-
ing the first match at-
tempt.
\G[a-z] first matches a, then matches b and
then fails to match in ab_cd.
Conditionals
Syntax Description Example
(?(?=regex)then|else) If the lookahead suc-
ceeds, the "then" part
must match for the over-
all regex to match. If the
lookahead fails, the
"else" part must match
for the overall regex to
match. Not just positive
lookahead, but all four
lookarounds can be
used. Note that the
lookahead is zero-width,
so the "then" and "else"
parts need to match and
consume the part of the
text matched by the
lookahead as well.
(?(?<=a)b|c)matches the second b and the
first c inbabxcac
(?(1)then|else) If the first capturing
group took part in the
(a)?(?(1)b|c)matches ab, the first c and the
match attempt thus far,
the "then" part must
match for the overall
regex to match. If the first
capturing group did not
take part in the match,
the "else" part must
match for the overall
regex to match.
second c inbabxcac
Comments
Syntax Description Example
(?#comment) Everything between (?
# and ) is ignored by the
regex engine.
a(?#foobar)bmatches a
printf - a quick look at Perl and Java
In this cheat sheet I'm going to show all the examples using Perl, but I thought at first it
might help to one printf example using both Perl and Java. So, here's a simple Perl printf
example to get us started:
printf("the %s jumped over the %s, %d times", "cow", "moon", 2);
And here are three different ways of using printf format specifier syntax with Java:
System.out.format("the %s jumped over the %s, %d times", "cow", "moon", 2);System.err.format("the %s jumped over the %s, %d times", "cow", "moon", 2);String result = String.format("the %s jumped over the %s, %d times", "cow", "moon", 2);
As you can see in that last String.format example, that line of code doesn't print any out-
put, while the first line prints to standard output, and the second line prints to standard error.
In the remainder of this document I'm going to use Perl examples, but again, the actual for-
mat specifier strings can be used in many different languages.
A summary of the printf format specifiers
Here's a quick summary of the available print format specifiers:
%c character
%d decimal (integer) number (base 10)
%e exponential floating-point number
%f floating-point number
%i integer (base 10)
%o octal number (base 8)
%s a string of characters
%u unsigned decimal (integer) number
%x number in hexadecimal (base 16)
%% print a percent sign
\% print a percent sign
Controlling printf integer width
The "%3d" specifier means a minimum width of three spaces, which, by default, will be
right-justified. (Note: the alignment is not currently being displayed properly here.)
printf("%3d", 0); 0
printf("%3d", 123456789); 123456789
printf("%3d", -10); -10
printf("%3d", -123456789); -123456789
Left-justifying printf integer output
To left-justify those previous printf examples, just add a minus sign (-) after the % symbol,
like this:
printf("%-3d", 0); 0
printf("%-3d", 123456789); 123456789
printf("%-3d", -10); -10
printf("%-3d", -123456789); -123456789
The printf zero-fill option
To zero-fill your integer output, just add a zero (0) after the % symbol, like this:
printf("%03d", 0); 000
printf("%03d", 1); 001
printf("%03d", 123456789); 123456789
printf("%03d", -10); -10
printf("%03d", -123456789); -123456789
printf - integers with formatting
Here is a collection of examples for integer printing. Several different options are shown, in-
cluding a minimum width specification, left-justified, zero-filled, and also a plus sign for posi-
tive numbers.
Description Code Result
At least five wide printf("'%5d'", 10); ' 10'
At least five-wide, left-justified printf("'%-5d'", 10); '10 '
At least five-wide, zero-filled printf("'%05d'", 10); '00010'
At least five-wide, with a plus sign printf("'%+5d'", 10); ' +10'
Five-wide, plus sign, left-justified printf("'%-+5d'", 10); '+10 '
Description Code Result
Print one position after the decimal printf("'%.1f'", 10.3456); '10.3'
Two positions after the decimal printf("'%.2f'", 10.3456); '10.35'
Eight-wide, two positions after the decimal printf("'%8.2f'", 10.3456); ' 10.35'
Eight-wide, four positions after the decimal printf("'%8.4f'", 10.3456); ' 10.3456'
Eight-wide, two positions after the decimal, zero-filled
printf("'%08.2f'", 10.3456); '00010.35'
Eight-wide, two positions after the decimal, left-justified
printf("'%-8.2f'", 10.3456); '10.35 '
Printing a much larger number with that same format
printf("'%-8.2f'", 101234567.3456);
'101234567.35'
How to print strings with printf formatting
Here are several printf formatting examples that show how to format string output
with printf format specifiers.
Description Code Result
A simple string printf("'%s'", "Hello"); 'Hello'
A string with a minimum length printf("'%10s'", "Hello"); ' Hello'
Minimum length, left-justified printf("'%-10s'", "Hello"); 'Hello '
Summary of special printf characters
The following character sequences have a special meaning when used as printf format
specifiers:
\a audible alert
\b backspace
\f form feed
\n newline, or linefeed
\r carriage return
\t tab
\v vertical tab
\\ backslash
As you can see from that last example, because the backslash character itself is treated
specially, you have to print two backslash characters in a row to get one backslash charac-
ter to appear in your output.
Here are a few examples of how to use this special characters:
Description Code Result
Insert a tab character in a string printf("Hello\tworld"); Hello world
Insert a newline character in a string
printf("Hello\nworld");Helloworld
Typical use of the newline character
printf("Hello world\n"); Hello world
A DOS/Windows path with backslash characters
printf("C:\\Windows\\System32\\");
C:\Windows\System32\
Algorithms: Big-Oh Notation
How time and space grow as the amount of data increases
It's useful to estimate the cpu or memory resources an algorithm requires. This "complexity analysis" at-tempts to characterize the relationship between the number of data elements and resource usage (time or space) with a simple formula approximation. Many programmers have had ugly surprises when they moved from small test data to large data sets. This analysis will make you aware of potential problems.
Dominant Term
Big-Oh (the "O" stands for "order of") notation is concerned with what happens for very large values of N, therefore only the largest term in a polynomial is needed. All smaller terms are dropped.
For example, the number of operations in some sorts is N2 - N. For large values of N, the single N term is in-significant compared to N2, therefore one of these sorts would be described as an O(N2) algorithm.
Similarly, constant multipliers are ignored. So a O(4*N) algorithm is equivalent to O(N), which is how it should be written. Ultimately you want to pay attention to these multipliers in determining the performance, but for the first round of analysis using Big-Oh, you simply ignore constant factors.
Why Size Matters
Here is a table of typical cases, showing how many "operations" would be performed for various values of N. Logarithms to base 2 (as used here) are proportional to logarithms in other base, so this doesn't affect the big-oh formula.
constant logarithmic linear quadratic cubic
n O(1) O(log N) O(N) O(N log N) O(N2) O(N3)
1 1 1 1 1 1 1
2 1 1 2 2 4 8
4 1 2 4 8 16 64
8 1 3 8 24 64 512
16 1 4 16 64 256 4,096
1,024 1 10 1,024 10,240 1,048,576 1,073,741,824
1,048,576 1 20 1,048,576 20,971,520 1012 1016
Does anyone really have that much data?
It's quite common. For example, it's hard to find a digital camera that that has fewer than a million pixels (1 mega-pixel). These images are processed and displayed on the screen. The algorithms that do this had bet-ter not be O(N2)! If it took one microsecond (1 millionth of a second) to process each pixel, an O(N2) algo-rithm would take more than a week to finish processing a 1 megapixel image, and more than three months to process a 3 megapixel image (note the rate of increase is definitely not linear).
Another example is sound. CD audio samples are 16 bits, sampled 44,100 times per second for each of two channels. A typical 3 minute song consists of about 8 million data points. You had better choose the write al-gorithm to process this data.
A dictionary I've used for text analysis has about 125,000 entries. There's a big difference between a linear O(N), binary O(log N), or hash O(1) search.
Best, worst, and average cases
You should be clear about which cases big-oh notation describes. By default it usually refers to the average case, using random data. However, the characteristics for best, worst, and average cases can be very differ-ent, and the use of non-random data (often more realistic) data can have a big effect on some algorithms.
Why big-oh notation isn't always useful
Complexity analysis can be very useful, but there are problems with it too.
Too hard to analyze. Many algorithms are simply too hard to analyze mathematically.
Average case unknown. There may not be sufficient information to know what the most im-portant "average" case really is, therefore analysis is impossible.
Unknown constant. Both walking and traveling at the speed of light have a time-as-func-tion-of-distance big-oh complexity of O(N). Altho they have the same big-oh characteris-tics, one is rather faster than the other. Big-oh analysis only tells you how it grows with the size of the problem, not how efficient it is.
Small data sets. If there are no large amounts of data, algorithm efficiency may not be im-portant.
Benchmarks are better
Big-oh notation can give very good ideas about performance for large amounts of data, but the only real way to know for sure is to actually try it with large data sets. There may be performance issues that are not taken into account by big-oh notation, eg, the effect on paging as virtual memory usage grows. Although bench-marks are better, they aren't feasible during the design process, so Big-Oh complexity analysis is the choice.
Typical big-oh values for common algorithms
Searching
Here is a table of typical cases.
Type of Search Big-Oh Comments
Linear search array/ArrayList/LinkedList O(N)
Binary search sorted array/ArrayList O(log N) Requires sorted data.
Search balanced tree O(log N)
Search hash table O(1)
Other Typical Operations
Algorithmarray
ArrayListLinkedList
access front O(1) O(1)
access back O(1) O(1)
access middle O(1) O(N)
insert at front O(N) O(1)
insert at back O(1) O(1)
insert in middle O(N) O(1)
Sorting arrays/ArrayLists
Some sorting algorithms show variability in their Big-Oh performance. It is therefore interesting to look at their best, worst, and average performance. For this description "average" is applied to uniformly distributed
values. The distribution of real values for any given application may be important in selecting a particular al-gorithm.
Type of Sort Best Worst Average Comments
BubbleSort O(N) O(N2) O(N2) Not a good sort, except with ideal data.
Selection sort
O(N2) O(N2) O(N2) Perhaps best of O(N2) sorts
QuickSort O(N log N) O(N2) O(N log N)
Good, but it worst case is O(N2)
HeapSort O(N log N) O(N log N) O(N log N)
Typically slower than QuickSort, but worst case is much better.
Example - choosing a non-optimal algorithm
I had to sort a large array of numbers. The values were almost always already in order, and even when they weren't in order there was typically only one number that was out of order. Only rarely were the values com-pletely disorganized. I used a bubble sort because it was O(1) for my "average" data. This was many years ago when CPUs were 1000 times slower. Today I would simply use the library sort for the amount of data I had because the difference in execution time would probably be unnoticed. However, there are always data sets which are so large that a choice of algorithms really matters.
Example - O(N3) surprise
I once wrote a text-processing program to solve some particular customer problem. After seeing how well it processed the test data, the customer produced real data, which I confidently ran the program on. The pro-gram froze -- the problem was that I had inadvertently used an O(N3) algorithm and there was no way it was going to finish in my lifetime. Fortunately, my reputation was restored when I was able to rewrite the offend-ing algorithm within an hour and process the real data in under a minute. Still, it was a sobering experience, illustrating dangers in ignoring complexity analysis, using unrealistic test data, and giving customer demos.
Same Big-Oh, but big differences
Altho two algorithms have the same big-oh characteristics, they may differ by a factor of three (or more) in practical implementations. Remember that big-oh notation ignores constant overhead and constant factors. These can be substantial and can't be ignored in practical implementations.
Time-space tradeoffs
Sometimes it's possible to reduce execution time by using more space, or reduce space requirements by us-ing a more time-intensive algorithm.