regexp

7/17/2019 regexp

http://slidepdf.com/reader/full/regexp-568c64050dd30 1/24

Python 3: Regular ExpressionsBob Dowling [email protected]

29 October 2012

Prerequisites

This self-paced course assumes that you have a nowledge of !ython " e#uivalent to having completed oneor other of

• !ython "$ %ntroduction for &bsolute Beginners' or

• !ython "$ %ntroduction for Those with !rogramming ()perience

*ome e)perience beyond these courses is always useful but no other course is assumed+

The course also assumes that you now how to use a ,ni) te)t editor gedit' emacs' vi' ./+

Facilities for this sessionThe computers in this room have been prepared for these self-paced courses+ They are already logged in with course %Ds and have home directories specially prepared+ !lease do not log in under any other %D+

&t the end of this session the home directories will be cleared+ &ny files you leave in them will be deleted+!lease copy any files you want to eep+

The home directories contain a number of subdirectories one for each topic+

or this topic please enter directory regexp+ &ll wor will be completed there$

$ cd regexp

$ pwd

/home/y550/regexp

$

These courses are held in a room with two demonstrators+ %f you get stuc or confused' or if you ust have a#uestion raised by something you read' please as

These handouts and the prepared folders to go with them can be downloaded fromwww.ucs.cam.ac.uk/docs/course-notes/unix-courses/pythontopics

The formal !ython " documentation for the topics covered here can be found online atdocs.python.org/release/.!./li"rary/re.html

7/17/2019 regexp


Table of Contents!rere#uisites++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1acilities for this session++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1

3otation++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ "4arnings++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ "()ercises+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ "

()ercise 0++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ "%nput and output+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ "5eys on the eyboard++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ "6ontent of files+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ "4hat7s in this course++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 84hat is a regular e)pression+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 8

()ercise 1++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ :6ase insensitivity+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ :

()ercise 2++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ;& more comple) e)ample++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ;

()ercise "++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 11*pecial codes for regular e)pressions+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++11

()ercise 8++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 12<(scaping= characters++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 12

()ercise >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1"& real world e)ample+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 18

()ercise :++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 18()ercise ;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1>()ercise ?++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1>

&lternation++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1>@roups+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1:Aerbose regular e)pression language+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++1:

()ercise 9++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1?()ercise 10++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1?

6omponents of matching te)t++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++1?()ercise 11++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 19

3amed groups+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 20()ercise 12++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 21&mbiguity and greed++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 21%nternal references+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 22

()ercise 1"++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2"6rib sheet++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 28

228

7/17/2019 regexp


Notation

Warnings

!4arnings are mared lie this+ These sections are used to highlight common

mistaes or misconceptions+

Exercises

Exercise

()ercises are mared lie this+ Cou are e)pected to complete all e)ercises+ *omeof them do depend on previous e)ercises being successfully completed+

nput an" outputaterial appearing in a terminal is presented lie this$

$ more lorem.txt #orem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmodtempor incididunt ut la"ore et dolore magna ali%ua. &t enim ad minim veniam--'ore--(44)*

The material you type is presented lie this$ ls+ Bold face' typewriter font+/

The material the computer responds with is presented lie this$ <#orem ipsum=+ Typewriter font again but in

a normal face+/

#eys on the $eyboar"

5eys on the eyboard will be shown as the symbol on the eyboard surrounded by s#uare bracets' so the<& ey= will be written <E&F=+ 3ote that the return ey pressed at the end of every line of commands/ is written<E F=' the shift ey as <E F=' and the tab ey as <E F=+ !ressing more than one ey at the same time such as↲ ⇧ ↹pressing the shift ey down while pressing the & ey/ will be written as <E FGE&F=+ 3ote that pressing E&F⇧generates the lower case letter <a=+ To get the upper case letter <&= you need to press E FGE&F+⇧

Content of files

The content1 of files with a comment/ will be shown lie this$

#orem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod temporincididunt ut la"ore et dolore magna ali%ua. &t enim ad minim veniam %uisnostrud exercitation ullamco la"oris nisi ut ali%uip ex ea commodo conse%uat.+uis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu

,ugiat nulla pariatur. This is a comment about the line.

1 The e)ample te)t here is the famous <lorem ipsum= dummy te)t used in the printing and typesettingindustry+ %t dates bac to the 1>00s+ *ee http$www+lipsum+com for more information+

"28

7/17/2019 regexp


What%s in this course

1+ 4hat is a regular e)pression

2+ 6ase in/sensitivity

"+ & more comple) e)ample

8+ 6haracter choices

>+ 6ounting in regular e)pressions

:+ <(scaping= characters

;+ &lternation

?+ <Aerbose= regular e)pressions

9+ 6omponents of matching te)t

10+ 3amed groups

11+ &mbiguity and greed

12+ %nternal references

What is a regular expression&

& regular e)pression is simply some means to write down a pattern describing some te)t+ There is a formalmathematical definition but we7re not bothering with that here+ 4hat the computing world calls regulare)pressions and what the strict mathematical grammarians call regular e)pressions are slightly differentthings+/

or e)ample we might lie to say <a series of digits= or <a single lower case letter followed by some digits=+There are terms in regular e)pression language for all of these concepts+ The language of these patterns canseem intimidating at first$

<a series of digits= d

<a lower case letter followed by some digits= a-1d

<a mi)ture of characters e)cept for new line' followed by a full stop andone or more letters or numbers=

.w

4e will cover what these mean very soon+ 4e will start with a <trivial= regular e)pression' however' whichsimply matches a fi)ed bit of te)t+

4e will set ourselves the goal of writing a <filter script= that reads lines in from the <standard input= and writesits output to <standard output=+ 4hat this means in practice is that if we want to identify both an input andoutput file we run our scripts lie this$

$ python3 example01.py < input_file > output_file

%f we omit the <2 input3,ile= our input comes from the eyboard+ %f we omit the < output3,ile= our

output goes to the screen+

Our initial target will be to find the lines in the file names.txt which include the te)t <red=$

828

7/17/2019 regexp


4e will start building up our script with the non-regular e)pression elements$

import regular expression module

import sys 6 or the 7/8

define pattern

set up regular expression

,or line in sys.stdin9 6 :ead one line at a time

compare line to regular expression

i, the line matches9

sys.stdout.write(line* 6 ;rite out matching lines

3ow we have to add some !ython for the regular e)pression handling+

The regular e)pression module in !ython is called' simply' <re=' so our first line becomes

import re

!atterns are simply te)t strings+ 4e will cover the synta) of regular e)pressions very shortly but for the timebeing' we will loo for the te)t <red= in files+ The regular e)pression for this is simply the te)t itself' <red=' sothe line that defines the pattern becomes

pattern < =red=

3ow we have to turn that simple te)t into a !ython obect that can actually do something+ 4e are going to

tae the te)t <red= and we are going to convert it into a !ython regular e)pression obect+ 4e call thisprocess <compilation= and in a very real sense the plain te)t <red= is the source code for the mini-program we are about to create$

regexp < re.compile(pattern*

The regexp obect is the !ython obect that will actually go looing for matching lines+ *o now we need to

now how to do the comparison of a trial line of te)t against the pattern we are looing for+ The regexp

obect has a method that does precisely this$ search(*+ This method taes the te)t being #ueried as its

argument and loos for the pattern in it+ The method returns a <match obect= that describes if and how thete)t matches the pattern$

match < regexp.search(line*

This match obect can be converted to a boolean+ %f there are no matches' so the pattern does not appear in

>28

&damBrenda6harlieDawn(ric(rica

redredaredericreya@eorge.

redredarederic

7/17/2019 regexp


the te)t' then it converts to alse+ %f there is a match then it converts to >rue+ &s a result of this we can test

for a matching line very easily$

i, match9

The following bloc of code will only be run if there is a match+

!utting this together gives us our first regular e)pression !ython script that searches for <red=$

import re

import sys

pattern < =red=

regexp < re.compile(pattern*

,or line in sys.stdin9

match < regexp.search(line*

i, match9

sys.stdout.write(line*

4e can run this script' example0?.py' to chec it wors$

$ python3 example01.py < names.txt

redredarederic

$

Cou should mae sure that you understand this script before proceeding+

Exercise '

4rite a script' exercise0?.py' from scratch that loos for the string <"ert= in

the file of names+

Case insensiti(ity

Our e)ample script pulled red' reda' and rederic from the list of names' but omitted anfred+ Thee)ercise if completed correctly/ should have pulled out 3orbert' Hobert' and Hoberta but not Bertram+!ython regular e)pressions are case sensitive' lie everything else in !ython+ The te)t <red= is not the sameas the te)t <fred=+

4e can build ourselves a case insensitive regular e)pression mini-program if we want to+ There.compile(* function we saw earlier can tae a second' optional argument+ This argument is a set of

flags which modify how the regular e)pression wors+ One of these flags maes it case insensitive+

The options are set as a series of values that need to be added together+ 4e7re currently only interested inone of them' though' so we can give <re.7A8:BCDEB= the 7A8:BCDEB constant from the re module/ as

the second argument+

or those of you who dislie long options' the re.7 constant is a synonym for the re.7A8:BCDEB

constant+ 4e encourage the long forms as better reminders of what the options mean for when you comebac to this script having not looed at it for si) months$

regexp < re.compile(pattern re.7A8:BCDEB*

:28

7/17/2019 regexp


Exercise )

6opy your script exercise0?.py to exercise0!.py and edit

exercise0!.py to be case insensitive+ Hun it again+

* +ore co+plex exa+ple

! ,nfortunately' there7s a fair bit of reading to do at this point to get you into regulare)pressions+ Tae a deep breath and get ready for the ne)t four pages+

*earching for fi)ed te)t such as <red= or <"ert= is a valid tas' but regular e)pressions are capable of

much more+ 4e will now e)tend the example0?.py script to do some serous analysis of an input file+

The file <atoms.log= is the output of a set of programs which do something involving atoms of the elements+

%t7s a fictitious e)ample' so don7t obsess on the detail+/ %t has a collection of lines corresponding to howvarious runs of a program completed+

*ome are simple success lines such as the first line$

:&A 00000? C8'F#B>B+. 8&>F&> 7A 7#B hydrogen.dat.

Others have additional information indicating that things did not go so well$

:&A 0000G C8'F#B>B+. 8&>F&> 7A 7#B yttrium.dat. ? &A+B:#8; ;D:A7A.

:&A 00005H C8'F#B>B+. 8&>F&> 7A 7#B lanthanum.dat. D#8:7>I' +7+ A8> C8AJB:BD>B: ?00000 7>B:D>78AE.

:&A 0000K4 C8'F#B>B+. 8&>F&> 7A 7#B gadolinium.dat. 8JB:#8; B::8:.

Our ob will be to unpic the <good lines= from the rest+

4e will build the pattern re#uired for these good lines bit by bit+ %t helps to have some lines <in mind= whiledeveloping the pattern' and to consider which bits change between lines and which bits donIt+

Because we are going to be using some leading and trailing spaces in our strings we will mar them e)plicitlyin the e)amples$

:&A 00000? C8'F#B>B+. 8&>F&> 7A 7#B hydrogen.dat.␣ ␣ ␣␣ ␣ ␣ ␣

The fi)ed te)t is shown above+ 3ote that while the element part of the file name varies' its suffi) is constant+

The first part of the line that varies is the set of si) digits+ 3ote that we are lucy that it is always si) digits+

ore realistic output might have varying numbers of digits$ 2 digits for <1;= as in the slide but only one digitfor <9=+ The second varying part is the primary part of the file name+

;28

:&A␣ .dat.chlorine␣C8'F#B>B+. 8&>F&> 7A 7#B␣␣ ␣ ␣ ␣0000?H


si) digits se#uenceof lower-case letters

7/17/2019 regexp


4hat we have described to date matches all the lines+ They all start with that same sentence+ 4hatdistinguishes the good lines from the bad is that this is all there is+ The lines start and stop with e)actly this'no more and no less+

%t is good practice to match against as much of the line as possible as it lessens the chance of accidentallymatching a line you didn7t plan to+ Jater on it will become essential as we will be e)tracting elements from theline+

4e will be building the pattern at the bottom of the slide+ 4e start by saying that the line begins here+3othing may precede it+

The start of line is represented with the <caret= or <circumfle)= character' <L=+ This is the first of the specialregular e)pressions that mean something other than ust the literal te)t+

The <L= is nown as an anchor' because it forces the pattern to match only at a fi)ed point the start' in this

case/ of the line+ *uch patterns are called anchored patterns+ !atterns which don7t have any anchors in themare nown as surprise/ unanchored patterns+

3e)t comes some literal te)t+ 4e ust add this to the pattern as is+

There7s one gotcha we will return to later+ %t7s easy to get the wrong number of spaces or to mistae a tabstop for a space+ %n this e)ample it7s a single space' but we will learn how to cope with generic <white space=later+

?28


The line starts here..and ends here+


L

start of line mared with <L=


L:&A␣

literal te)t

7/17/2019 regexp


3e)t comes a run of si) digits+ There are two approaches we can tae here+ & digit can be regarded as acharacter between <0= and <9= in the character set used' but it is more elegant to have a pattern that e)plicitlysays <a digit=+

The se#uence <0-G1= has the meaning <one character between <0= and <9= in the character set+ 4e will

meet this use of s#uare bracets in detail in a few slides7 time+/ The se#uence <d= means e)actly <one digit=+

0-G1 <any single character between 0 and 9=

d <any digit=

Kowever' a line of si) instances of <d= is not particularly elegant+ 6an you imagine counting them if there

were si)ty rather than si)

Hegular e)pression pattern language has a solution to this inelegance+ ollowing any pattern with a numberin curly bracets <braces=/ means to iterate that pattern that many times+

3ote that <dMKN= means <si) digits in a row=+ %t does not mean <the same digit si) times=+ 4e will see how to

describe that later+

The synta) can be e)tended$

dMKN si) digits

dM5HN five' si) or seven digits

dM5N five or more digits

dMHN no more than seven digits including no digits/

928


L:&A dddddd␣

si) digits

inelegant


L:&A dMKN␣

si) digits

more elegant

7/17/2019 regexp


3e)t comes some more fi)ed te)t+ &s ever' donIt forget the leading' trailing and double spaces+

3e)t comes the name of the element+ 4e will ignore for these purposes the fact that we now these are thenames of elements+ or our purposes they are se#uences of lower case letters+

This time we will use the s#uare bracet notation+ This is identical to the wild cards used by ,ni) shells' ifyou are already familiar with that synta)+ The regular e)pression pattern <aeiou1= means <e)actly one

character which can be either an La7' an Le7' an Li7' an Lo7' or a Lu7=+

The slight variant <a-m1= means <e)actly one character between La7 and Lm7 inclusive in the character set=+ %n

the standard computing character sets with no internationalisation turned on/ the digits' the lower caseletters' and the upper case letters form uninterrupted runs+ *o <E0-9F= will match a single digit+ <a-1= will

match a single lower case letter+ <D-O1= will match a single upper case letter+

But we don7t want to match a single lower case letter+ 4e want to match an unnown number of them+

&ny pattern can be followed by a <= to mean <repeat the pattern one or more times=+ *o <a-1= matches

a se#uence of one or more lower case letters+ &gain' it does not mean <the same lower case letter multipletimes=+/ %t is e#uivalent to <a-1M?N=+

3e)t we have the closing literal te)t+ *trictly speaing the dot is a special character in regular e)pressions

but we will address that in a later slide+/

1028


L:&A dMKN C8'F#B>B+. 8&>F&> 7A 7#B␣ ␣ ␣␣ ␣ ␣ ␣

literal te)t


L:&A dMKN C8'F#B>B+. 8&>F&> 7A 7#B a-1␣ ␣ ␣␣ ␣ ␣ ␣

se#uence of lower case letters


L:&A dMKN C8'F#B>B+. 8&>F&> 7A 7#B a-1.dat.␣ ␣ ␣␣ ␣ ␣ ␣

literal te)t

7/17/2019 regexp


inally' and crucially' we identify this as the end of the line+ The lines with warnings and errors go beyond thispoint+

The dollar character' <M=' mars the end of the line+ This is another anchor' since it forces the pattern tomatch only at another fi)ed place the end/ of the line+

3ote that although we are using both N and M in our pattern' you don7t have to always use both of them in apattern+ Cou may use both' or only one' or neither' depending on what you are trying to match+

Exercise 3

6opy your script exercise0?.py to exercise0.py and edit

exercise0.py to use this e)tended regular e)pression for processing the file

atoms.log+

Test that it wors$

$ python3 exercise03.py < atoms.log

,pecial co"es for regular expressions

4e have seen a few special codes in our previous e)ample that mean something special in regulare)pressions+ $

D L &nchor start of line

O $ &nchor end of line

d 0-G1 &ny digit

+ &ny non-digit

a"c1 &ny one of <a=' <"<' or <c=

4e will build up this list as we proceed+ 3ote that there are alternative forms of various codes+

irst' we will loo more closely at what can go in s#uare bracets+

%f we have ust a set of simple characters e+g+ <aeiou1=/ then it matches any one character from that set+

3ote that the set of simple characters can include a space' e+g+ < aeiou1␣ = matches a space or an <a= or an

<e= or an <i= or an <o= or a <u=+

%f we put a dash between two characters then it means any one character from that range+ *o <a-1= is

e)actly e#uivalent to <a"cde,ghijklmnop%rstuvwxy1 =+ 4e can repeat this for multiple ranges' so

<D-Oa-1= is e#uivalent to <DPC+BI7QR#'A8FS:E>&J;TUOa"cde,ghijklmnop%rstuvwxy1 =+

%f we want one of the characters in the set to be a dash' <-=' there are two ways we can do this+ 4e can

precede the dash with a bacspace <-= to mean <include the character L-7 in the set of characters we want

to match=' e+g+ <D-Oa--1= means <match any alphabetic character or a dash=+ &lternatively' we can

mae the first character in the set a dash in which case it will be interpreted as a literal dash <-=/ rather than

indicating a range of characters' e+g+ <-D-a-1= also means <match any alphabetic character or a dash=+

The bacslash approach is more generally applicable and can also be used to include a close s#uare

bracet in the set of characters+

1128


L:&A dMKN C8'F#B>B+. 8&>F&> 7A 7#B a-1.dat.$␣ ␣ ␣␣ ␣ ␣ ␣

end of line mared with <$=

7/17/2019 regexp


aeiou1 &ny one of <a=' <e=' <i=' <o=' or <u=+

D-O1 &ny upper case letter

D-Oa-1 &ny upper or lower case letter

D-O11 &ny upper case letter or a s#uare bracet

D-O-1 &ny upper case letter or a dash

There is one last piece of regular e)pression synta) for s#uare bracets+ %n addition to specifying <any one ofthese characters= we also need to be able to specify <any one character that is not one of these=+ This isdone by immediately following the open s#uare bracet with a circumfle) character' <L=+ 3ote that this is a

second' different use of the circumfle)+ 4e have already seen it used to anchor a pattern to the start of theline+ %t can7t mean this inside s#uare bracets so there is no possibility for confusion+

Laeiou1 &ny one character that is not a lower case vowel

LD-O1 &ny one character that is not an upper case letter

LD-Oa-1 &ny one character that is not an upper or lower case letter

LD-O11 &ny one character that is not an upper case letter or a s#uare bracet

LD-O-1 &ny one character that is not an upper case letter or a dash

*o' for e)ample' the regular e)pression <LDa1Da1= will match any line that begins with two &7s in either

case/ and the e)pression <LDa1Da1= will match any line that contains a <not-a= followed by an <a= again'

in either case/+

Exercise -

There are three separate parts to this e)ercise+The file /usr/share/dict/words contains every word recognised by a simple

spell checer+(dit the pattern<. line and no others/ in exercise04.py to find every word

that matchesa/ <LDa1Da1=

b/ <LDa1Da1=

c/ <lines that have a <double-a= in them but which don7t start with an <a=

$ python3 exercise04.py

The script opens the correct file automatically+/

4e also saw that we can count in regular e)pressions+ These counting modifiers appear below after thee)ample pattern <a"c1=+ They can follow any regular e)pression pattern+

a"c1 &ny one of La7' Lb7 or Lc7+

a"c1 One or more La7' Lb7 or Lc7+

a"c1V ero or one La7' Lb7 or Lc7+

a"c1W ero or more La7' Lb7 or Lc7+

a"c1MKN ()actly : of La7' Lb7 or Lc7+

a"c1M5HN >' : or ; of La7' Lb7 or Lc7+

a"c1M5N > or more of La7' Lb7 or Lc7+

a"c1MHN ; or fewer of La7' Lb7 or Lc7+

4e saw the plus modifier' <=' meaning <one or more=+ There are a couple of related modifiers that are oftenuseful$ a #uery' <V=' means Pero or one of the pattern and asteris' <W=' means <Pero or more=+ The more

precise counting is done with curly bracets+

3ote that in shell e)pansion of file names <globbing=/ the asteris means <any string=+ %n regular e)pressionsit means nothing on its own and is purely a modifier+

.Escaping/ characters

3ow letIs pic up a few stray #uestions that might have arisen as we built our log file pattern+ %f s#uarebracets identify sets of letters to match' what matches a s#uare bracet Kow would % match the literalstring <a"cde1=' for e)ample

The way to mean <a real s#uare bracet= is to precede it with a bacslash+ @enerally speaing' if a characterhas a special meaning then preceding it with a bacslash turns off that specialness+ *o <= is special' but <

1228

7/17/2019 regexp


= means <ust an open s#uare bracet=+ *imilarly' if we want to match a bacslash we use <=+/

a"cd1 <&ny one of La7' L"7' Lc7' or Ld7+=

a"cd1 <a"cd1=

@enerally speaing' if a character has a special meaning then preceding it with a bacslash turns off thatspecialness+ *o <= is special and <= means <ust an open s#uare bracet= i+e+ it has no special meaning/+

6onversely' if a character is ust a plain character then preceding it with a bacslash can mae it special+ or

e)ample' <d= matches ust the lower case letter <d= i+e+ it7s not special/ but <d= matches any one digit and isspecial/+

There is one last very important

!%f a character has no special meaning with a bacslash in front of it then thebacslash has no effect but does not magically become a bacslash itself+ *o<␣= and <␣= both represent a single space+

Because regular e)pressions use bacslashes and so does !ython itself special care needs to be taen tostop the two clashing+ !ython has a special synta) to tell it not to treat bacslashes specially+ This allows theregular e)pression module direct access to them+ !ython7s special synta) is called <raw strings= and worsby preceding a literal string with a <r= for <raw=/+ !ython treats bacslashes in raw strings as ust an ordinary

character$

print(''!

print(r''!

4e will start using raw strings in all our future e)amples and e)ercises$

pattern < r=TTTT=

! ,se raw strings for all regular e)pression patterns' even if you don7t strictly needto for a specific e)ample+ %t7s ust a good habit to get into+

Exercise 0

The scripts exercise05a.py and exercise05".py differ only in that >b uses

a raw string to define the pattern$

pattern < r= in=

Both read from the same file' example05.txt+ Hun them both and determine

why the outputs differ+ ae sure you can understand why example5".py

e)tracts the output it does+

4e can add to the bacslash letters now$



d 0-G1 &ny digit

+ L0-G1 &ny non-digit

s &ny white space character space' tab' ./

E &ny non-white-space character

w 0-Ga-D-O31 &ny alphabetic' numeric or underscore character+ & <word character=+/

1"28

7/17/2019 regexp


; L0-Ga-D-O31 &ny non-word character+

!Our atoms.log file has a double space in each line+ @uessing how many

spaces there are is prone to error+ The e)pression <s= means <some white

space= and is very useful+

* real 1orl" exa+ple

Cou now now enough to write a significant regular e)pression+ To demonstrate that you now enough to write a real world regular e)pression this e)tended e)ercise uses a log file from the real world actually fromthe author7s worstation/ and is an e)ample of ust how you might use a regular e)pression in practice+

The file messages is part of the system log file varlogmessages/ from a worstation and contains anumber of lines indicating that a hacer attempted to brea in to the system+ Our ob will be to find ust thoselines+

The lines we are looing for loo lie this$

Qun 0 090!9?4 noether sshdG5?019 7nvalid user rpc ,rom K5.?G.?XG.?4GQun 0 090!9?K noether sshdG5?519 7nvalid user gopher ,rom K5.?G.?XG.?4G

Qul ? 0H94?9?? noether sshd?450K19 7nvalid user test ,rom !?0.5?.?H!.?KX

and the lines we can ignore loo lie this$

Qun !K !9!09 noether syslog-ng!!GX19 E>D>E9 dropped 0Qun !K !9H9? noether sshd?G4G19 Dccepted pu"lickey ,or dump ,rom??.???.4.K port 540XK ssh!Qun !K !9H9! noether logger9 starting incr /home

There is some obvious te)t to loo for' <7nvalid user=' but we will write a regular e)pression that matches

the entire line because this will lead on to our being able to e)tract ust parts of the line+ 4e have to matchthem to e)tract them+/ *o this e)ercise is split into three parts+

Exercise 2

The script exercise0K.py is a bare-bones filter script+ (dit the pattern to ust

loo for <7nvalid user= with no anchoring or anything clever$

$ python3 exercise0".py

Qun !5 !94H9 noether sshdG!HH19 7nvalid user account,rom !0H.54.?40.?!4

.

Qul 1 0;$81$"8 noether sshdE18>8:F$ %nvalid user test from 210+>1+1;2+1:?

Cou should get 1'":: lines of output$

M python3 exercise0".py # wc $l

1"::

1828

7/17/2019 regexp


Exercise

6opy exercise0K.py to exercise0H.py and edit the pattern to also match

everything beyond the <%nvalid user= te)t+ Hecall that the lines loo lie this$

Qun 0 090!9?K noether sshdG5?519 7nvalid user gopher,rom K5.?G.?XG.?4G

Cou need to e)tend the pattern to include$

• a positive number of non-white space characters for the varying username recall the counting modifiers and the bacslash specials/'

• the fi)ed te)t <,rom=' and

• the %! addresses four lots of digits separated by full stops' and

• the end of line+ <$=/

4e strongly recommend that you get the first e)tension user name/ right andthen move on to the ne)t and so on+%f successful you should get e)actly the same number of lines through$

M python3 exercise0%.py # wc $l

1"::

Exercise 4

6opy exercise0H.py to exercise0X.py and edit the pattern to everything up

to the <%nvalid user= te)t too+Kint$ wor bacwards from what you have already got woring+ &gain' we stronglysuggest that you do it one bit at a time+Cou need to e)tend the pattern to include woring bacwards/$

• the closing s#uare bracet' colon and space fi)ed te)t/'

• a se#uence of digits

• the fi)ed te)t <noether sshd=' noether is the name of the

worstation and the sshd is the service the hacers are trying to brea in

through/'• the time pairs of digits separated by colons/'

• the day of the month' 3ote that the second character is a digit+ The firstis a digit or a space+/

• the month+ Treat this as a capital letter followed by two lower caseletters+/

• the start of the line+ <L=/

%f successful you should get e)actly the same number of lines through$

M python3 exercise0&.py # wc $l

1"::

*lternation

%n the previous e)ercise we used an e)pression for the three letter month abbreviations of something lie<E&-FEa-PFEa-PF= or =E&-FEa-PFR2S=+ This is a little unsatisfactory in general+ &s well as matching <Qan= and <eb=it matches <an= and <Qeb=+ 4hat we want to say is <one of the twelve valid three letter month abbreviations=+

Hegular e)pressions have a synta) that says <this or this or this=+ *uch a choice is called an <alternation= andthe synta) for it is this$

(QanYe"Y'arYDprY'ayYQunYQulYDugYEepY8ctYAovY+ec*

The bracets hold the whole thing together as a single entity' called a <group=+ The vertical bars' pronounced<or= separate the alternatives+

1>28

7/17/2019 regexp


5roups

@roups do not have to contain alternations+ They can be useful for other reasons to group an e)pressiontogether+

The braceted groups can be followed by any of the counting e)pressions+ *o <(DPC1a"c1*M!N=

matches <D"Ca= and <PcDa= but not <DPca< or <aDcC=+ or that e)ample' we don7t need to use groups we

could have used <DPC1a"c1DPC1a"c1 =+

Kowever' we can use them with counting to e)pression patterns that cannot be e)pressed in other ways+ ore)ample' <(DPC1a"c1*= means <one or more repetitions of an upper case <D=' <P=' or <C= followed by a

lower case <a=' <"=' or <c=+ %t will match

D"DcPaDaP"CcD"CaPcD"

etc+

4e will meet groups again' when e come to loo at e)tracting components of matching strings+

6erbose regular expression language

Hecall the regular e)pression you wrote for the last e)ercise$

LD-O1a-1M!N ?! 10-G1 dd9dd9dd noether sshdd19␣ ␣ ␣ ␣ ␣ ␣

7nvalid user E ,rom dM?N.dM?N.dM?N.dM?N$␣ ␣ ␣ ␣

This is about as comple) a regular e)pression as you would want to encounter without some assistance fromthe language+ %f you struggled with the e)ercise' re-read this answer to see if you understand it+

%f regular e)pressions were a programming language in their own right' we would e)pect to be able to laythem out sensibly to mae them easier to read and to include comments+

!ython allows us to do both of these with a special option to the re.compile(* function which we will meet

now+

4e might also e)pect to have variables with names' and we will come to that in this course too+/

Our fundamental problem is that the enormous regular e)pression we have ust written runs the ris ofbecoming gibberish+ %t was a struggle to write and if you passed it to someone else it would be even more ofa struggle to read+ %t gets even worse if you are ased to maintain it after not looing at it for si) months+

4e need to be able to spread it out over several lines so that how it breas down into its component partsbecomes clearer+ %t would be nice if we had comments so we could annotate it too+

!ython7s regular e)pression system has all this as an option and calls it' rather unfairly' <verbose mode=+

! The regular e)pression language we have met so far is fairly portable+Aerbose regular e)pressions are a !ython-only version+

Jet7s start by splitting that regular e)pression over several lines for ease of reading+ This would be a !ythonraw long string+/

pattern < r===LD-O1a-1M!N␣

?! 10-G1␣ ␣

dd9dd9dd␣

noether sshd␣

d19␣

7nvalid user␣ ␣

E␣

,rom␣

dM?N.dM?N.dM?N.dM?N

1:28

7/17/2019 regexp


$

===

This is no longer a useful regular e)pression+ 4e will fi) it up one stage at a time+

%deally we would lie to add comments to this multi-line layout and still have the regular e)pression wor$

pattern < r===L

D-O1a-1M!N␣ 6 'onth?! 10-G1␣ ␣ 6 +aydd9dd9dd␣ 6 >imenoether sshd␣

d19␣ 6 Frocess 7+7nvalid user␣ ␣

E␣ 6 ake user 7+,rom␣

dM?N.dM?N.dM?N.dM?N 6 Iost address$===

4e will use the hash character <6=/ for comments inside our verbose regular e)pressions ust as we have in

!ython itself+ Of course' if a hash introduces a comment in verbose regular e)pressions' what matches ahash character %n verbose regular e)pressions we use an escaped hash i+e+ preceded by a bacslash/ tomatch a literal hash character+

The ne)t issue we hit concerns spaces+ 4e want to match on spaces' and our original regular e)pressionhad spaces in it+ Kowever' any spaces between the end of the line and the start of any putative commentsmustnIt contribute towards the matching component+

4e will need to treat spaces differently in the verbose version+ Bring on the bacslashes

4e will use <U = to represent a literal space i+e+␣ one we want to match/ and will ignore any other spaces inthe pattern e+g+ those between the regular e)pression and the comment/+

3ote that we don7t need to bacslash a space or a hash in the s#uare bracets+

*pace ignored generally ␣

Bacslash space recognised ␣

Bacslash not needed in E+++F ?! 1␣

Kash introduces a comment 6 'onth

Bacslash hash matches <V= 6

Bacslash not needed in E+++F ?!61

This gives us a pattern lie this$

pattern < r===LD-O1a-1M!N␣ 6 'onth?! 10-G1␣ ␣ 6 +ay

dd9dd9dd␣

6 >imenoether sshd␣

d19␣ 6 Frocess 7+7nvalid user␣ ␣

E␣ 6 ake user 7+,rom␣

dM?N.dM?N.dM?N.dM?N 6 Iost address$===

*o now all we have to do is to tell !ython to use this verbose mode instead of its usual one+ 4e do this as anoption on the re.compile(* function ust as we did when we told it to wor case insensitively+ There is a

!ython module constant re.JB:P8EB which we use in e)actly the same way as we did re.7A8:BCDEB+ %t

has a short name <re.T= too+

1;28

7/17/2019 regexp


%ncidentally' if you ever wanted case insensitivity and verbosity' you add the two together$regexp < re.compile(pattern re.7A8:BCDEBre.JB:P8EB*

*o how would we change a filter script in practice to use verbose regular e)pressions %tIs actually a veryeasy four step process+

1+ 6onvert your pattern string into a multi-line string+

2+ ae the bacslash tweas necessary+

"+ 6hange the re.compile(* call to have the re.JB:P8EB option+

8+ Test your script to see if it still wors

Exercise 7

6opy exercise0X.py to exercise0G.py and convert its regular e)pression

into the verbose pattern derived above+%f successful you should get e)actly the same output as from exercise0X.py+

M python3 exercise0&.py

Exercise '

6opy exercise0.py the atoms.log filter/ to exercise?0.py and convert

its regular e)pression into a verbose format with comments/+%f successful you should get e)actly the same output as from exercise0.py+

M python3 exercise10.py

Co+ponents of +atching text

4eIre almost finished with the regular e)pression synta) now+ 4e have most of what we need for this courseand can now get on with developing !ythonIs system for using it+ 4e will continue to use the verbose version

of the regular e)pressions as it is easier to read' which is helpful for courses as well as for real life

Jet7s loo at the verbose pattern for our atoms.log file again$

pattern < r===L 6 Etart o, line:&A␣

dMKN 6 Qo" num"er C8'F#B>B+. 8&>F&> 7A 7#B␣ ␣ ␣ ␣ ␣ ␣

a-1.dat 6 ile name

.$ 6 Bnd o, line

===

*uppose we have a line that matches' but we want to now the ob number and file name that were in thematching line+ Kow do we do this

4hat we will do is to label the two components in the pattern and then loo at !ythonIs mechanism to get attheir values+

4e start by changing the pattern to place parentheses round bracets/ around the two components ofinterest to mae them groups+

pattern < r===L 6 Etart o, line:&A␣

(dMKN* 6 Qo" num"er

C8'F#B>B+. 8&>F&> 7A 7#B␣ ␣ ␣ ␣ ␣ ␣

1?28

7/17/2019 regexp


(a-1.dat* 6 ile name

.$ 6 Bnd o, line

===

*odMKN W (dMKN!

a-1.dat W (a-1.dat!

3ow we are asing for certain parts of the pattern to be specially treated as <groups=/ we must turn ourattention to the result of the search to get at those groups+

To date all we have done with the results is to test them for truth or falsehood$ <does it match or not= 3ow we will dig more deeply+

regexp < re.compile(patternre.JB:P8EB*,or line in sys.stdin9

result < regexp.search(line*i, result9

...

The match obect' <result=' has a method that gives access to the groups within its defining pattern'

result.group(*+

*o if the line that matches is this line$

H,3 000001 6O!J(T(D+ O,T!,T %3 %J( hydrogen+dat+

then

result.group(?* W =00000?=

result.group(!* W =hydrogen.dat=

result.group(0* W whole line

3ote that group(0* gives the entire te)t that the pattern matched+ %t7s only because we have a pattern

anchored at the start and end of the line that this is the entire line+

*o now we can write out ust those elements of the matching lines that we are interested in$

i, result9

print(result.group(?* result.group(!**

4e have switched to print(* from sys.stdout.write(* ust to eep the lines a bit shorter+/

3ote that we still have to test the result variable to mae sure that it is not Aone i+e+ that the regular

e)pression matched the line at all/+ This is what the i,+++ test does because Aone tests alse+ 4e cannot

as for the group(* method on Aone because it doesnIt have one+ %f you mae this mistae you will get an

error message <Dttri"uteBrror9 =Aone>ype= o"ject has no attri"ute =group== and your

script will terminate abruptly+

Exercise ''

6opy your script exercise0G.py to exercise??.py and edit it to add bracets

to set groups for the account name and the remote address+ *o' for a line liethis$

Qun !5 !94H95K noether sshdGKH19 7nvalid user alex ,rom!0H.54.?40.?!4

you would capture <alex= and <!0H.54.?40.?!4=+

Then modify the output to print ust those two values+

1928

7/17/2019 regexp


Na+e" groups

@roups in regular e)pressions are good but theyIre not perfect+ They suffer from the sort of problem thatcreeps up on you only after youIve been doing !ython regular e)pressions for a bit+

*uppose you decide you need to capture another group within a regular e)pression+ %f it is inserted betweenthe first and second e)isting group' say' then the old group number 2 becomes the new number "' the old "the new 8 and so on+

ThereIs also a problem that <results.group(!*= doesnIt shout out what the second group actually was+

ThereIs a solution to this$ 4e will associate names with groups rather than ust numbers+

6onsider the regular e)pression we have been using for our atoms.log searching+ 4e currently have two

groups for the ob number and the file name+

*o how do we do this naming

4e insert some additional controls immediately after the open parenthesis+ %n general in !ython7s regulare)pression synta) <(V= introduces something special that may not even be a group though in this case it is/+

4e specify the name with the rather biParre synta) <VF2groupname=$

(dMKN* W (VF2jo"numdMKN*

(a-1.dat* W (VF2,ilenamea-1.dat*

*o we name the ob number si) digits/ <jo"num= and the file name <,ilename=+

*o the group is defined as usual by parentheses round bracets/+ 3e)t must come <!= to indicate that weare handling a named group+ Then comes the name of the group in angle bracets+ inally comes the patternthat actually does the matching+ 3one of the <!X+++Y= business is used for matching it is purely for naming+

To refer to a group by its name' you simply pass the name to the group(* method as a string+ Cou can still

also refer to the group by its number+ *o in the e)ample here' result.group(=jo"no=* is the same as

result.group(?*' since the first group is named <jo"no=+$

H,3 000001 6O!J(T(D+ O,T!,T %3 %J( hydrogen+dat+

gives

result.group(=jo"no=* W =00000?=

result.group(=,ilename=* W =hydrogen.dat=

*o we can change the print statement to use these names$

i, result9

print(result.group(=jo"no=* result.group(=,ilename=**

2028

( *a-1.dat,ilename 2VF

./X.Y

!X.Y definesa named group

group name regular e)pressionpattern

7/17/2019 regexp


Exercise ')

6opy your script exercise??.py to exercise?!.py and edit name the groups

<login= and <address=+ !rint them using these names+

Qun !5 !94H95K noether sshdGKH19 7nvalid user alex ,rom!0H.54.?40.?!4

login W alex

address W !0H.54.?40.?!4

*+biguity an" gree"

%nput$ /usr/share/dict/words

Heg+ e)p+$ L(a-1*(a-1*$

*cript$ am"iguous.py

Zuestion$ 4hat part of the word goes in group 1' and what part goes in group 2

@roups are all well and good' but are they necessarily well-defined 4hat happens if a line can fit intogroups in two different ways

or e)ample' consider the list of words in /usr/share/dict/words + The lower case words in this line all

match the regular e)pression <Ea-PFGEa-PFG= because it is a series of lower case letters followed by a series oflower case letters+ But if we assign groups to these parts' <(a-1*(a-1*=' which part of the word

goes into the first group and which in the second

Cou can find out by running the script am"iguous.py which is currently in your home directory$

$ python3 amiguous.py

aa haahe d

aahin gaah saa laali iaalii saal saardvar k

.

!ython7s implementation of regular e)pressions maes the first group <greedy= the first group swallows asmany letters as it can at the e)pense of the second+ There is no guarantee that other languages7

2128

L $(a-1*(a-1*

aaaaliaardvaraardvarks

hiks

7/17/2019 regexp


implementations will do the same' though+ Cou should always aim to avoid this sort of ambiguity+ Cou canchange the greed of various groups with yet more use of the #uery character but please note the ambiguitycaution above+ %f you find yourself wanting to play with the greediness youIre almost certainly doingsomething wrong at a deeper level+

nternal references

%n our ambiguity e)ample' am"iguous.py' we had the same pattern' <a-1=' repeated twice+ These then

matched against different strings+ The first matched against <aardvar= and the second against <k=' for

e)ample+ Kow can we say that we want the same string twice

3ow that we have groups in our regular e)pression we can use them for this purpose+ *o far the bracetingto create groups has been purely labelling' to select sections we can e)tract later+ 3ow we will use them within the e)pression itself+

4e can use a bacslash in front of a number for integers from 1 to 99/ to mean <that number group in thecurrent e)pression=+ The pattern <L(a-1*?$= matches any string which is followed by the string itself

again+

The file dou"le?.py has an e)ample of this woring$

$ python3 doule1.py

"eri "eri"on "oncan canchi chido dohots hotsma mamur murmuu muupa pa

paw pawpom pomtar tartes testu tu

2228

L $(a-1*(a-1*

aaaaliaardvaraardvarks

hiks

L $?(a-1*

"eri"oncanchido

"eri"oncanchido

L

$

(VF<hal,*

(VF2hal,a-1* 6reates a group

Refers to the group

7/17/2019 regexp


%f we have given names to our groups' then we use the special !ython synta) <(VF<groupname*= to mean

<the group groupname in the current e)pression=+ *o <L(VF2hal,a-1*(VF<hal,*$ = matches any

string which is the same se#uence of lower case letters repeated twice+

3ote that in this case the (V...* e)pression does not create a group instead' it refers to one that already

e)ists+ Observe that there is no pattern language in that second pair of parentheses+

The script dou"le!.py uses named groups+

Exercise '3

6opy the script dou"le!.py to exercise?.py and adapt it to loo for words

with the pattern &B&B&$

Cou should get <alfalfa= and <entente=+

2"28

entente e ent

ent

& &B&B

e+g+

ent

7/17/2019 regexp


Crib sheet



d 0-G1 &ny digit

+ L0-G1 &ny non-digit

s &ny white space character space' tab' ./

E &ny non-white-space characterw 0-Ga-D-O31 &ny alphabetic' numeric or underscore character+ & <word character=+/

; L0-Ga-D-O31 &ny non-word character+

a"c1 &ny one of La7' Lb7 or Lc7+

a"c1 One or more La7' Lb7 or Lc7+

a"c1V ero or one La7' Lb7 or Lc7+

a"c1W ero or more La7' Lb7 or Lc7+

a"c1MKN ()actly : of La7' Lb7 or Lc7+

a"c1M5HN >' : or ; of La7' Lb7 or Lc7+

a"c1M5N > or more of La7' Lb7 or Lc7+

a"c1MHN ; or fewer of La7' Lb7 or Lc7+

(.* & numbered group no name/(.Y.Y.* &lternation

(VF2name.* & named group

?' !' . Heference to numbered group

(VF<name* Heference to named group

2828

Documents

regexp