Regular Expression Basic Syntax Reference

Regular Expression Basic Syntax Reference

Characters

Character Description Example

Any character ex-

cept[\^$.|?*+()

All characters except the listed

special characters match a sin-

gle instance of them-

selves. { and } are literal char-

acters, unless they're part of a

valid regular expression token

(e.g. the {n} quantifier).

a matches a

\ (backslash) followed

by any of [\^$.|?*+()

{}

A backslash escapes special

characters to suppress their

special meaning.

\+ matches +

\Q...\E Matches the characters be-

tween \Q and \E literally, sup-

pressing the meaning of spe-

cial characters.

\Q+-*/\E matches+-*/

\xFF where FF are 2

hexadecimal digits

Matches the character with the

specified ASCII/ANSI value,

which depends on the code

page used. Can be used in

character classes.

\xA9 matches © when using the Latin-1

code page.

\n, \r and \t Match an LF character, CR

character and a tab character

respectively. Can be used in

character classes.

\r\n matches a DOS/Windows CRLF line

break.

\a, \e, \f and \v Match a bell character (\x07),

escape character (\x1B), form

feed (\x0C) and vertical tab (\

x0B) respectively. Can be

used in character classes.

\cA through \cZ Match an ASCII character

Control+A through Control+Z,

equivalent to \x01 through \

x1A. Can be used in character

classes.

\cM\cJ matches a DOS/Windows CRLF line

break.

Character Classes or Character Sets [abc]


[ (opening square

bracket)

Starts a character class. A

character class matches a sin-

gle character out of all the pos-

sibilities offered by the charac-

ter class. Inside a character

class, different rules apply. The

rules in this section are only

valid inside character classes.

The rules outside this section

are not valid in character

classes, except for a few char-

acter escapes that are indi-

cated with "can be used inside

character classes".

Any character ex-

cept^-]\ add that

character to the pos-

sible matches for the

character class.

All characters except the listed

special characters.

[abc] matches a, b orc

\ (backslash) followed

by any of ^-]\

A backslash escapes special

characters to suppress their

special meaning.

[\^\]] matches ^ or ]

- (hyphen) except im-

mediately after the

opening [

Specifies a range of charac-

ters. (Specifies a hyphen if

placed immediately after the

opening [)

[a-zA-Z0-9] matches any letter or digit

^ (caret) immediately

after the opening [

Negates the character class,

causing it to match a single

character not listed in the char-

acter class. (Specifies a caret if

placed anywhere except after

the opening [)

[^a-d] matches x (any character except a,

b, c or d)

\d, \w and \s Shorthand character classes

matching digits, word charac-

ters (letters, digits, and under-

scores), and whitespace (spa-

ces, tabs, and line breaks).

Can be used inside and out-

side character classes.

[\d\s] matches a character that is a digit or

whitespace

\D, \W and \S Negated versions of the above.

Should be used only outside

character classes. (Can be

used inside, but that is confus-

ing.)

\D matches a character that is not a digit

[\b] Inside a character class, \b is a

backspace character.

[\b\t] matches a backspace or tab charac-

ter

Dot


. (dot) Matches any single character

except line break characters \r

and \n. Most regex flavors

have an option to make the dot

match line break characters

. matches x or (almost) any other character

too.

Anchors


^ (caret) Matches at the start of the

string the regex pattern is ap-

plied to. Matches a position

rather than a character. Most

regex flavors have an option to

make the caret match after line

breaks (i.e. at the start of a line

in a file) as well.

^. matches a inabc\ndef. Also

matches d in "multi-line" mode.

$ (dollar) Matches at the end of the



rather than a character. Most

regex flavors have an option to

make the dollar match before

line breaks (i.e. at the end of a

line in a file) as well. Also

matches before the very last

line break if the string ends

with a line break.

.$ matches f inabc\ndef. Also matches c in

"multi-line" mode.

\A Matches at the start of the



rather than a character. Never

matches after line breaks.

\A. matches a in abc

\Z Matches at the end of the




matches before line breaks,

except for the very last line

.\Z matches f inabc\ndef

break if the string ends with a

line break.

\z Matches at the end of the




matches before line breaks.

.\z matches f inabc\ndef

Word Boundaries


\b Matches at the position be-

tween a word character (any-

thing matched by \w) and a

non-word character (anything

matched by [^\w] or \W) as

well as at the start and/or end

of the string if the first and/or

last characters in the string are

word characters.

.\b matches c in abc

\B Matches at the position be-

tween two word characters (i.e

the position between \w\w) as

well as at the position between

two non-word characters (i.e. \

W\W).

\B.\B matches b in abc

Alternation


| (pipe) Causes the regex engine to

match either the part on the left

side, or the part on the right

side. Can be strung together

abc|def|xyz matchesabc, def or xyz

into a series of options.

| (pipe) The pipe has the lowest prece-

dence of all operators. Use

grouping to alternate only part

of the regular expression.

abc(def|xyz)matches abcdef orabcxyz

Quantifiers


? (question mark) Makes the preceding item op-

tional. Greedy, so the optional

item is included in the match if

possible.

abc? matches ab orabc

?? Makes the preceding item op-

tional. Lazy, so the optional

item is excluded in the match if

possible. This construct is of-

ten excluded from documenta-

tion because of its limited use.

abc?? matches ab orabc

* (star) Repeats the previous item zero

or more times. Greedy, so as

many items as possible will be

matched before trying permu-

tations with less matches of

the preceding item, up to the

point where the preceding item

is not matched at all.

".*" matches"def" "ghi" inabc "def" "ghi"

jkl

*? (lazy star) Repeats the previous item zero

or more times. Lazy, so the en-

gine first attempts to skip the

previous item, before trying

permutations with ever in-

creasing matches of the pre-

".*?" matches "def"inabc "def" "ghi" jkl

ceding item.

+ (plus) Repeats the previous item

once or more. Greedy, so as






is matched only once.

".+" matches"def" "ghi" inabc "def" "ghi"

jkl

+? (lazy plus) Repeats the previous item

once or more. Lazy, so the en-

gine first matches the previous

item only once, before trying

permutations with ever in-

creasing matches of the pre-

ceding item.

".+?" matches "def"inabc "def" "ghi" jkl

{n} where n is an in-

teger >= 1

Repeats the previous item ex-

actly n times.

a{3} matches aaa

{n,m} where n >= 0

and m >= n

Repeats the previous item be-

tween n and m times. Greedy,

so repeating m times is tried

before reducing the repetition

to n times.

a{2,4} matches aaaa,aaa or aa

{n,m}? where n >= 0

and m >= n

Repeats the previous item be-

tween n and m times. Lazy, so

repeating n times is tried be-

fore increasing the repetition to

m times.

a{2,4}? matches aa,aaa or aaaa

{n,} where n >= 0 Repeats the previous item at

least n times. Greedy, so as




a{2,} matches aaaaain aaaaa



is matched only n times.

{n,}? where n >= 0 Repeats the previous item n or

more times. Lazy, so the en-

gine first matches the previous

item n times, before trying per-

mutations with ever increasing

matches of the preceding item.

a{2,}? matches aa inaaaaa

Regular Expression Advanced Syntax Reference

Grouping and Backreferences

Syntax Description Example

(regex) Round brackets group

the regex between them.

They capture the text

matched by the regex in-

side them that can be

reused in a backrefer-

ence, and they allow you

to apply regex operators

to the entire grouped

regex.

(abc){3}matchesabcabcabc. First group

matches abc.

(?:regex) Non-capturing parenthe-

ses group the regex so

you can apply regex op-

erators, but do not cap-

ture anything and do not

create backreferences.

(?:abc){3}matchesabcabcabc. No groups.

\1 through \9 Substituted with the text

matched between the 1st

(abc|def)=\1matchesabc=abc ordef=def,

through 9th pair of cap-

turing parentheses.

Some regex flavors allow

more than 9 backrefer-

ences.

but not abc=def ordef=abc.

Modifiers


(?i) Turn on case insensitivity

for the remainder of the

regular expression.

(Older regex flavors may

turn it on for the entire

regex.)

te(?i)stmatches teSTbut not TEST.

(?-i) Turn off case insensitivity


regular expression.

(?i)te(?-i)stmatches TEstbut not TEST.

(?s) Turn on "dot matches

newline" for the remain-

der of the regular expres-

sion. (Older regex flavors

may turn it on for the en-

tire regex.)

(?-s) Turn off "dot matches

newline" for the remain-

der of the regular expres-

sion.

(?m) Caret and dollar match

after and before newlines


regular expression.


apply this to the entire

regex.)

(?-m) Caret and dollar only

match at the start and

end of the string for the

remainder of the regular

expression.

(?x) Turn on free-spacing

mode to ignore white-

space between regex to-

kens, and allow # com-

ments.

(?-x) Turn off free-spacing

mode.

(?i-sm) Turns on the options "i"

and "m", and turns off "s"


regular expression.


apply this to the entire

regex.)

(?i-sm:regex) Matches the regex inside

the span with the options

"i" and "m" turned on,

and "s" turned off.

(?i:te)stmatches TEstbut not TEST.

Atomic Grouping and Possessive Quantifiers


(?>regex) Atomic groups prevent

the regex engine from

backtracking back into

x(?>\w+)x is more efficient than x\w+x if

the second x cannot be matched.

the group (forcing the

group to discard part of

its match) after a match

has been found for the

group. Backtracking can

occur inside the group

before it has matched

completely, and the en-

gine can backtrack past

the entire group, discard-

ing its match entirely.

Eliminating needless

backtracking provides a

speed increase. Atomic

grouping is often indis-

pensable when nesting

quantifiers to prevent a

catastrophic amount of

backtracking as the en-

gine needlessly tries

pointless permutations of

the nested quantifiers.

?+, *+, ++ and{m,n}+ Possessive quantifiers

are a limited yet syntacti-

cally cleaner alternative

to atomic grouping. Only

available in a few regex

flavors. They behave as

normal greedy quanti-

fiers, except that they will

not give up part of their

match for backtracking.

x++ is identical to (?>x+)

Lookaround


(?=regex) Zero-width positive

lookahead. Matches at a

t(?=s)matches the second t instreets.

position where the pat-

tern inside the lookahead

can be matched.

Matches only the posi-

tion. It does not consume

any characters or expand

the match. In a pattern

likeone(?=two)three,

both two and three have

to match at the position

where the match

of one ends.

(?!regex) Zero-width negative

lookahead. Identical to

positive lookahead, ex-

cept that the overall

match will only succeed if

the regex inside the

lookahead fails to match.

t(?!s)matches the firstt in streets.

(?<=text) Zero-width positive look-

behind. Matches at a po-

sition to the left of which

text appears. Since regu-

lar expressions cannot

be applied backwards,

the test inside the look-

behind can only be plain

text. Some regex flavors

allow alternation of plain

text options in the look-

behind.

(?<=s)tmatches the firstt in streets.

(?<!text) Zero-width negative look-

behind. Matches at a po-

sition if the text does not

appear to the left of that

position.

(?<!s)tmatches the second t instreets.

Continuing from The Previous Match


\G Matches at the position

where the previous

match ended, or the po-

sition where the current

match attempt started

(depending on the tool or

regex flavor). Matches at

the start of the string dur-

ing the first match at-

tempt.

\G[a-z] first matches a, then matches b and

then fails to match in ab_cd.

Conditionals


(?(?=regex)then|else) If the lookahead suc-

ceeds, the "then" part

must match for the over-

all regex to match. If the

lookahead fails, the

"else" part must match

for the overall regex to

match. Not just positive

lookahead, but all four

lookarounds can be

used. Note that the

lookahead is zero-width,

so the "then" and "else"

parts need to match and

consume the part of the

text matched by the

lookahead as well.

(?(?<=a)b|c)matches the second b and the

first c inbabxcac

(?(1)then|else) If the first capturing

group took part in the

(a)?(?(1)b|c)matches ab, the first c and the

match attempt thus far,

the "then" part must

match for the overall

regex to match. If the first

capturing group did not

take part in the match,

the "else" part must

match for the overall

regex to match.

second c inbabxcac

Comments


(?#comment) Everything between (?

# and ) is ignored by the

regex engine.

a(?#foobar)bmatches a

printf - a quick look at Perl and Java

In this cheat sheet I'm going to show all the examples using Perl, but I thought at first it

might help to one printf example using both Perl and Java. So, here's a simple Perl printf

example to get us started:

printf("the %s jumped over the %s, %d times", "cow", "moon", 2);

And here are three different ways of using printf format specifier syntax with Java:

System.out.format("the %s jumped over the %s, %d times", "cow", "moon", 2);System.err.format("the %s jumped over the %s, %d times", "cow", "moon", 2);String result = String.format("the %s jumped over the %s, %d times", "cow", "moon", 2);

As you can see in that last String.format example, that line of code doesn't print any out-

put, while the first line prints to standard output, and the second line prints to standard error.

In the remainder of this document I'm going to use Perl examples, but again, the actual for-

mat specifier strings can be used in many different languages.

A summary of the printf format specifiers

Here's a quick summary of the available print format specifiers:

%c character

%d decimal (integer) number (base 10)

%e exponential floating-point number

%f floating-point number

%i integer (base 10)

%o octal number (base 8)

%s a string of characters

%u unsigned decimal (integer) number

%x number in hexadecimal (base 16)

%% print a percent sign

\% print a percent sign

Controlling printf integer width

The "%3d" specifier means a minimum width of three spaces, which, by default, will be

right-justified. (Note: the alignment is not currently being displayed properly here.)

printf("%3d", 0); 0

printf("%3d", 123456789); 123456789

printf("%3d", -10); -10

printf("%3d", -123456789); -123456789

Left-justifying printf integer output

To left-justify those previous printf examples, just add a minus sign (-) after the % symbol,

like this:

printf("%-3d", 0); 0

printf("%-3d", 123456789); 123456789

printf("%-3d", -10); -10

printf("%-3d", -123456789); -123456789

The printf zero-fill option

To zero-fill your integer output, just add a zero (0) after the % symbol, like this:

printf("%03d", 0); 000

printf("%03d", 1); 001

printf("%03d", 123456789); 123456789

printf("%03d", -10); -10

printf("%03d", -123456789); -123456789

printf - integers with formatting

Here is a collection of examples for integer printing. Several different options are shown, in-

cluding a minimum width specification, left-justified, zero-filled, and also a plus sign for posi-

tive numbers.

Description Code Result

At least five wide printf("'%5d'", 10); ' 10'

At least five-wide, left-justified printf("'%-5d'", 10); '10 '

At least five-wide, zero-filled printf("'%05d'", 10); '00010'

At least five-wide, with a plus sign printf("'%+5d'", 10); ' +10'

Five-wide, plus sign, left-justified printf("'%-+5d'", 10); '+10 '


Print one position after the decimal printf("'%.1f'", 10.3456); '10.3'

Two positions after the decimal printf("'%.2f'", 10.3456); '10.35'

Eight-wide, two positions after the decimal printf("'%8.2f'", 10.3456); ' 10.35'

Eight-wide, four positions after the decimal printf("'%8.4f'", 10.3456); ' 10.3456'

Eight-wide, two positions after the decimal, zero-filled

printf("'%08.2f'", 10.3456); '00010.35'

Eight-wide, two positions after the decimal, left-justified

printf("'%-8.2f'", 10.3456); '10.35 '

Printing a much larger number with that same format

printf("'%-8.2f'", 101234567.3456);

'101234567.35'

How to print strings with printf formatting

Here are several printf formatting examples that show how to format string output

with printf format specifiers.


A simple string printf("'%s'", "Hello"); 'Hello'

A string with a minimum length printf("'%10s'", "Hello"); ' Hello'

Minimum length, left-justified printf("'%-10s'", "Hello"); 'Hello '

Summary of special printf characters

The following character sequences have a special meaning when used as printf format

specifiers:

\a audible alert

\b backspace

\f form feed

\n newline, or linefeed

\r carriage return

\t tab

\v vertical tab

\\ backslash

As you can see from that last example, because the backslash character itself is treated

specially, you have to print two backslash characters in a row to get one backslash charac-

ter to appear in your output.

Here are a few examples of how to use this special characters:


Insert a tab character in a string printf("Hello\tworld"); Hello world

Insert a newline character in a string

printf("Hello\nworld");Helloworld

Typical use of the newline character

printf("Hello world\n"); Hello world

A DOS/Windows path with backslash characters

printf("C:\\Windows\\System32\\");

C:\Windows\System32\

Algorithms: Big-Oh Notation

How time and space grow as the amount of data increases

It's useful to estimate the cpu or memory resources an algorithm requires. This "complexity analysis" at-tempts to characterize the relationship between the number of data elements and resource usage (time or space) with a simple formula approximation. Many programmers have had ugly surprises when they moved from small test data to large data sets. This analysis will make you aware of potential problems.

Dominant Term

Big-Oh (the "O" stands for "order of") notation is concerned with what happens for very large values of N, therefore only the largest term in a polynomial is needed. All smaller terms are dropped.

For example, the number of operations in some sorts is N2 - N. For large values of N, the single N term is in-significant compared to N2, therefore one of these sorts would be described as an O(N2) algorithm.

Similarly, constant multipliers are ignored. So a O(4*N) algorithm is equivalent to O(N), which is how it should be written. Ultimately you want to pay attention to these multipliers in determining the performance, but for the first round of analysis using Big-Oh, you simply ignore constant factors.

Why Size Matters

Here is a table of typical cases, showing how many "operations" would be performed for various values of N. Logarithms to base 2 (as used here) are proportional to logarithms in other base, so this doesn't affect the big-oh formula.

constant logarithmic linear quadratic cubic

n O(1) O(log N) O(N) O(N log N) O(N2) O(N3)

1 1 1 1 1 1 1

2 1 1 2 2 4 8

4 1 2 4 8 16 64

8 1 3 8 24 64 512

16 1 4 16 64 256 4,096

1,024 1 10 1,024 10,240 1,048,576 1,073,741,824

1,048,576 1 20 1,048,576 20,971,520 1012 1016

Does anyone really have that much data?

It's quite common. For example, it's hard to find a digital camera that that has fewer than a million pixels (1 mega-pixel). These images are processed and displayed on the screen. The algorithms that do this had bet-ter not be O(N2)! If it took one microsecond (1 millionth of a second) to process each pixel, an O(N2) algo-rithm would take more than a week to finish processing a 1 megapixel image, and more than three months to process a 3 megapixel image (note the rate of increase is definitely not linear).

Another example is sound. CD audio samples are 16 bits, sampled 44,100 times per second for each of two channels. A typical 3 minute song consists of about 8 million data points. You had better choose the write al-gorithm to process this data.

A dictionary I've used for text analysis has about 125,000 entries. There's a big difference between a linear O(N), binary O(log N), or hash O(1) search.

Best, worst, and average cases

You should be clear about which cases big-oh notation describes. By default it usually refers to the average case, using random data. However, the characteristics for best, worst, and average cases can be very differ-ent, and the use of non-random data (often more realistic) data can have a big effect on some algorithms.

Why big-oh notation isn't always useful

Complexity analysis can be very useful, but there are problems with it too.

Too hard to analyze. Many algorithms are simply too hard to analyze mathematically.

Average case unknown. There may not be sufficient information to know what the most im-portant "average" case really is, therefore analysis is impossible.

Unknown constant. Both walking and traveling at the speed of light have a time-as-func-tion-of-distance big-oh complexity of O(N). Altho they have the same big-oh characteris-tics, one is rather faster than the other. Big-oh analysis only tells you how it grows with the size of the problem, not how efficient it is.

Small data sets. If there are no large amounts of data, algorithm efficiency may not be im-portant.

Benchmarks are better

Big-oh notation can give very good ideas about performance for large amounts of data, but the only real way to know for sure is to actually try it with large data sets. There may be performance issues that are not taken into account by big-oh notation, eg, the effect on paging as virtual memory usage grows. Although bench-marks are better, they aren't feasible during the design process, so Big-Oh complexity analysis is the choice.

Typical big-oh values for common algorithms

Searching

Here is a table of typical cases.

Type of Search Big-Oh Comments

Linear search array/ArrayList/LinkedList O(N)

Binary search sorted array/ArrayList O(log N) Requires sorted data.

Search balanced tree O(log N)

Search hash table O(1)

Other Typical Operations

Algorithmarray

ArrayListLinkedList

access front O(1) O(1)

access back O(1) O(1)

access middle O(1) O(N)

insert at front O(N) O(1)

insert at back O(1) O(1)

insert in middle O(N) O(1)

Sorting arrays/ArrayLists

Some sorting algorithms show variability in their Big-Oh performance. It is therefore interesting to look at their best, worst, and average performance. For this description "average" is applied to uniformly distributed

values. The distribution of real values for any given application may be important in selecting a particular al-gorithm.

Type of Sort Best Worst Average Comments

BubbleSort O(N) O(N2) O(N2) Not a good sort, except with ideal data.

Selection sort

O(N2) O(N2) O(N2) Perhaps best of O(N2) sorts

QuickSort O(N log N) O(N2) O(N log N)

Good, but it worst case is O(N2)

HeapSort O(N log N) O(N log N) O(N log N)

Typically slower than QuickSort, but worst case is much better.

Example - choosing a non-optimal algorithm

I had to sort a large array of numbers. The values were almost always already in order, and even when they weren't in order there was typically only one number that was out of order. Only rarely were the values com-pletely disorganized. I used a bubble sort because it was O(1) for my "average" data. This was many years ago when CPUs were 1000 times slower. Today I would simply use the library sort for the amount of data I had because the difference in execution time would probably be unnoticed. However, there are always data sets which are so large that a choice of algorithms really matters.

Example - O(N3) surprise

I once wrote a text-processing program to solve some particular customer problem. After seeing how well it processed the test data, the customer produced real data, which I confidently ran the program on. The pro-gram froze -- the problem was that I had inadvertently used an O(N3) algorithm and there was no way it was going to finish in my lifetime. Fortunately, my reputation was restored when I was able to rewrite the offend-ing algorithm within an hour and process the real data in under a minute. Still, it was a sobering experience, illustrating dangers in ignoring complexity analysis, using unrealistic test data, and giving customer demos.

Same Big-Oh, but big differences

Altho two algorithms have the same big-oh characteristics, they may differ by a factor of three (or more) in practical implementations. Remember that big-oh notation ignores constant overhead and constant factors. These can be substantial and can't be ignored in practical implementations.

Time-space tradeoffs

Sometimes it's possible to reduce execution time by using more space, or reduce space requirements by us-ing a more time-intensive algorithm.

Documents

Regular Expression Basic Syntax Reference