28
CMSC 723 / LING 645: Intro to Computational Linguistics September 8, 2004: Monz Regular Expressions and Finite State Automata (J&M 2) Prof. Bonnie J. Dorr Dr. Christof Monz TA: Adam Lee

Lecture2 B

Embed Size (px)

Citation preview

Page 1: Lecture2 B

CMSC 723 / LING 645: Intro to Computational Linguistics

September 8, 2004: Monz

Regular Expressions and Finite State Automata (J&M 2)

Prof. Bonnie J. DorrDr. Christof Monz

TA: Adam Lee

Page 2: Lecture2 B

Regular Expressions and Finite State Automata

REs: Language for specifying text stringsSearch for document containing a string

– Searching for “woodchuck”

• Finite-state automata (FSA)(singular: automaton)

• How much wood would a woodchuck chuck if a woodchuck would chuck wood?

– Searching for “woodchucks” with an optional final “s”

Page 3: Lecture2 B

Regular Expressions

Basic regular expression patternsPerl-based syntax (slightly different from

other notations for regular expressions)Disjunctions /[wW]oodchuck/

Page 4: Lecture2 B

Regular Expressions

Ranges [A-Z]

Negations [^Ss]

Page 5: Lecture2 B

Regular Expressions

Optional characters ? ,* and +– ? (0 or 1)

• /colou?r/ color or colour

– * (0 or more)• /oo*h!/ oh! or Ooh! or Ooooh!

*+

Stephen Cole Kleene

– + (1 or more)

• /o+h!/ oh! or Ooh! or Ooooh!

Wild cards .- /beg.n/ begin or began or begun

Page 6: Lecture2 B

Regular Expressions

Anchors ^ and $– /^[A-Z]/ “Ramallah, Palestine”

– /^[^A-Z]/ “¿verdad?” “really?”

– /\.$/ “It is over.”

– /.$/ ?

Boundaries \b and \B– /\bon\b/ “on my way” “Monday”

– /\Bon\b/ “automaton”

Disjunction |– /yours|mine/ “it is either yours or mine”

Page 7: Lecture2 B

Disjunction, Grouping, Precedence

Column 1 Column 2 Column 3 …How do we express this?/Column [0-9]+ *//(Column [0-9]+ +)*/

Precedence– Parenthesis ()– Counters * + ? {}– Sequences and anchors the ^my end$– Disjunction |

REs are greedy!

Page 8: Lecture2 B

Perl Commands

While ($line=<STDIN>){if ($line =~ /the/){

print “MATCH: $line”;}

}

Page 9: Lecture2 B

Writing correct expressions

Exercise: Write a Perl regular expression to match the English article “the”:

/the//[tT]he//\b[tT]he\b//[^a-zA-Z][tT]he[^a-zA-Z]//(^|[^a-zA-Z])[tT]he[^a-zA-Z]/

Page 10: Lecture2 B

A more complex example

Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”:/$[0-9]+//$[0-9]+\.[0-9][0-9]//\b$[0-9]+(\.[0-9][0-9])?\b//\b$[0-9][0-9]?[0-9]?(\.[0-9][0-9])?\b//\b[0-9]+ *([MG]Hz|[Mm]egahertz| [Gg]igahertz)\b//\b[0-9]+ *(Mb|[Mm]egabytes?)\b//\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/

Page 11: Lecture2 B

Advanced operators

should be _

Page 12: Lecture2 B

Substitutions and Memory

Substitutionss/colour/color/s/colour/color/g

s/([Cc]olour)/$1olor/

/the (.*)er they were, the $1er they will be/

/the (.*)er they (.*), the $1er they $2/

Substitute as many times as possible!

Case insensitive matching

s/colour/color/i

Memory ($1, $2, etc. refer back to matches)

Page 13: Lecture2 B

Eliza [Weizenbaum, 1966]

User: Men are all alike

ELIZA: IN WHAT WAY

User: They’re always bugging us about something or other

ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE?

User: Well, my boyfriend made me come here

ELIZA: YOUR BOYFRIEND MADE YOU COME HERE

User: He says I’m depressed much of the time

ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED

Page 14: Lecture2 B

Eliza-style regular expressions

s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/

s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/

s/.* all .*/IN WHAT WAY/

s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/

Step 1: replace first person with second person references

s/\bI(’m| am)\b /YOU ARE/g

s/\bmy\b /YOUR/g

S/\bmine\b /YOURS/g

Step 2: use additional regular expressions to generate replies

Step 3: use scores to rank possible transformations

Page 15: Lecture2 B

Finite-state Automata

Finite-state automata (FSA)Regular languagesRegular expressions

Page 16: Lecture2 B

Finite-state Automata (Machines)

/^baa+!$/

q0 q1 q2 q3 q4

b a a !

a

state transitionfinalstate

baa! baaa! baaaa! baaaaa! ...

Page 17: Lecture2 B

Input Tape

a b a ! b

q0

0 1 2 3 4

b a a !a

REJECT

Page 18: Lecture2 B

Input Tape

b a a a

q0 q1 q2 q3 q3 q4

!

0 1 2 3 4

b a a !a

ACCEPT

Page 19: Lecture2 B

Finite-state Automata

Q: a finite set of N states – Q = {q0, q1, q2, q3, q4}

: a finite input alphabet of symbols = {a, b, !}

q0: the start stateF: the set of final states

– F = {q4}(q,i): transition function

– Given state q and input symbol i, return new state q' (q3,!) q4

Page 20: Lecture2 B

State-transition Tables

Input

State b a !

0 1 Ø Ø

1 Ø 2 Ø

2 Ø 3 Ø

3 Ø 3 4

4: Ø Ø Ø

Page 21: Lecture2 B

D-RECOGNIZE

function D-RECOGNIZE (tape, machine) returns accept or reject index Beginning of tape current-state Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elsif transition-table [current-state, tape[index]] is empty then return reject else current-state transition-table [current-state, tape[index]] index index + 1end

Page 22: Lecture2 B

Adding a failing state

q0 q1 q2 q3 q4

b a a !

a

qFa

!

b

! b ! b

b

a

!

Page 23: Lecture2 B

Adding an “all else” arc

q0 q1 q2 q3 q4

b a a !

a

qF

== =

=

Page 24: Lecture2 B

Languages and Automata

Can use FSA as a generator as well as a recognizer

Formal language L: defined by machine M that both generates and recognizes all and only the strings of that language. – L(M) = {baa!, baaa!, baaaa!, …}

Regular languages vs. non-regular languages

Page 25: Lecture2 B

Languages and Automata

Deterministic vs. Non-deterministic FSAs

Epsilon () transitions

Page 26: Lecture2 B

Using NFSAs to accept strings

Backup: add markers at choice points, then possibly revisit unexplored arcs at marked choice point.

Look-ahead: look ahead in inputParallelism: look at alternatives in parallel

Page 27: Lecture2 B

Using NFSAs

Input

State b a ! 0 1 Ø Ø Ø

1 Ø 2 Ø Ø

2 Ø 2,3 Ø Ø

3 Ø Ø 4 Ø

4: Ø Ø Ø Ø

Page 28: Lecture2 B

Readings for next time

J&M Chapter 3