Upload
hisham-elsherif
View
343
Download
2
Tags:
Embed Size (px)
Citation preview
CMSC 723 / LING 645: Intro to Computational Linguistics
September 8, 2004: Monz
Regular Expressions and Finite State Automata (J&M 2)
Prof. Bonnie J. DorrDr. Christof Monz
TA: Adam Lee
Regular Expressions and Finite State Automata
REs: Language for specifying text stringsSearch for document containing a string
– Searching for “woodchuck”
• Finite-state automata (FSA)(singular: automaton)
• How much wood would a woodchuck chuck if a woodchuck would chuck wood?
– Searching for “woodchucks” with an optional final “s”
Regular Expressions
Basic regular expression patternsPerl-based syntax (slightly different from
other notations for regular expressions)Disjunctions /[wW]oodchuck/
Regular Expressions
Ranges [A-Z]
Negations [^Ss]
Regular Expressions
Optional characters ? ,* and +– ? (0 or 1)
• /colou?r/ color or colour
– * (0 or more)• /oo*h!/ oh! or Ooh! or Ooooh!
*+
Stephen Cole Kleene
– + (1 or more)
• /o+h!/ oh! or Ooh! or Ooooh!
Wild cards .- /beg.n/ begin or began or begun
Regular Expressions
Anchors ^ and $– /^[A-Z]/ “Ramallah, Palestine”
– /^[^A-Z]/ “¿verdad?” “really?”
– /\.$/ “It is over.”
– /.$/ ?
Boundaries \b and \B– /\bon\b/ “on my way” “Monday”
– /\Bon\b/ “automaton”
Disjunction |– /yours|mine/ “it is either yours or mine”
Disjunction, Grouping, Precedence
Column 1 Column 2 Column 3 …How do we express this?/Column [0-9]+ *//(Column [0-9]+ +)*/
Precedence– Parenthesis ()– Counters * + ? {}– Sequences and anchors the ^my end$– Disjunction |
REs are greedy!
Perl Commands
While ($line=<STDIN>){if ($line =~ /the/){
print “MATCH: $line”;}
}
Writing correct expressions
Exercise: Write a Perl regular expression to match the English article “the”:
/the//[tT]he//\b[tT]he\b//[^a-zA-Z][tT]he[^a-zA-Z]//(^|[^a-zA-Z])[tT]he[^a-zA-Z]/
A more complex example
Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”:/$[0-9]+//$[0-9]+\.[0-9][0-9]//\b$[0-9]+(\.[0-9][0-9])?\b//\b$[0-9][0-9]?[0-9]?(\.[0-9][0-9])?\b//\b[0-9]+ *([MG]Hz|[Mm]egahertz| [Gg]igahertz)\b//\b[0-9]+ *(Mb|[Mm]egabytes?)\b//\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/
Advanced operators
should be _
Substitutions and Memory
Substitutionss/colour/color/s/colour/color/g
s/([Cc]olour)/$1olor/
/the (.*)er they were, the $1er they will be/
/the (.*)er they (.*), the $1er they $2/
Substitute as many times as possible!
Case insensitive matching
s/colour/color/i
Memory ($1, $2, etc. refer back to matches)
Eliza [Weizenbaum, 1966]
User: Men are all alike
ELIZA: IN WHAT WAY
User: They’re always bugging us about something or other
ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE?
User: Well, my boyfriend made me come here
ELIZA: YOUR BOYFRIEND MADE YOU COME HERE
User: He says I’m depressed much of the time
ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED
Eliza-style regular expressions
s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/
s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/
s/.* all .*/IN WHAT WAY/
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/
Step 1: replace first person with second person references
s/\bI(’m| am)\b /YOU ARE/g
s/\bmy\b /YOUR/g
S/\bmine\b /YOURS/g
Step 2: use additional regular expressions to generate replies
Step 3: use scores to rank possible transformations
Finite-state Automata
Finite-state automata (FSA)Regular languagesRegular expressions
Finite-state Automata (Machines)
/^baa+!$/
q0 q1 q2 q3 q4
b a a !
a
state transitionfinalstate
baa! baaa! baaaa! baaaaa! ...
Input Tape
a b a ! b
q0
0 1 2 3 4
b a a !a
REJECT
Input Tape
b a a a
q0 q1 q2 q3 q3 q4
!
0 1 2 3 4
b a a !a
ACCEPT
Finite-state Automata
Q: a finite set of N states – Q = {q0, q1, q2, q3, q4}
: a finite input alphabet of symbols = {a, b, !}
q0: the start stateF: the set of final states
– F = {q4}(q,i): transition function
– Given state q and input symbol i, return new state q' (q3,!) q4
State-transition Tables
Input
State b a !
0 1 Ø Ø
1 Ø 2 Ø
2 Ø 3 Ø
3 Ø 3 4
4: Ø Ø Ø
D-RECOGNIZE
function D-RECOGNIZE (tape, machine) returns accept or reject index Beginning of tape current-state Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elsif transition-table [current-state, tape[index]] is empty then return reject else current-state transition-table [current-state, tape[index]] index index + 1end
Adding a failing state
q0 q1 q2 q3 q4
b a a !
a
qFa
!
b
! b ! b
b
a
!
Adding an “all else” arc
q0 q1 q2 q3 q4
b a a !
a
qF
== =
=
Languages and Automata
Can use FSA as a generator as well as a recognizer
Formal language L: defined by machine M that both generates and recognizes all and only the strings of that language. – L(M) = {baa!, baaa!, baaaa!, …}
Regular languages vs. non-regular languages
Languages and Automata
Deterministic vs. Non-deterministic FSAs
Epsilon () transitions
Using NFSAs to accept strings
Backup: add markers at choice points, then possibly revisit unexplored arcs at marked choice point.
Look-ahead: look ahead in inputParallelism: look at alternatives in parallel
Using NFSAs
Input
State b a ! 0 1 Ø Ø Ø
1 Ø 2 Ø Ø
2 Ø 2,3 Ø Ø
3 Ø Ø 4 Ø
4: Ø Ø Ø Ø
Readings for next time
J&M Chapter 3