Computational Language Finite State Machines and Regular Expressions

Preview:

Citation preview

Computational Language

Finite State Machines and Regular Expressions

Plan Regular expressions

Introduction Operators Disjunction, precedence, substitution

Finite State Machines Link with regular expressions Determinisitic FSA Non-deterministic FSA

Lab session reg ex. implementation in UNIX (egrep)

Regular Expressions Basis of all web-based and word-

processor-based searches Definition 1. An algebraic notation

for describing a string Definition 2. A set of rules that you

can use to specify one or more items, such as words in a file, by using a single character string (Sarwar et al.)

Regular Expressions regular expression, text corpus regular expression algebra has

variants: Perl, Unix tools Unix tools: egrep, sed, awk

Regular Expressions Find occurrences of /Nokia/ in the

text egrep -n ‘Nokia’ nokia_corpus.txt

Regular Expressionsegrep -n ‘Nokia’ nokia_corpus.txt

1:.Nokia shares slide after warning 4:HELSINKI (Reuters) - Nokia has cut its sales growth forecast for 7:markets sharply down.Nokia warned group sales would grow only 13:better than expected first-quarter profits from Nokia, 15:Finland's Nokia and rivals have been hit by debt-laden telecoms 19:Nokia said in a statement. "The speed of this transition has been 20:slower than was anticipated earlier this year." Nokia saw its market 26:"The problem with Nokia is that it looks like its going ex-growth," 29:with a raft of new functions, was hurting. "Nokia had been perceived 36:Nokia cast another shadow over the sector by slashing its forecast for 41:be sold this year. "Nokia now believes that general weakness in all key 43:Nokia said. The market was caught by surprise, especially as Nokia had 46:said Nokia had been "a bit optimistic overall" in its forecasts. "We 49:adjust to weaker demand, Nokia followed the path of rivals in announcing 51:thousands of jobs in the group last year. Despite the bleak outlook, Nokia 57:Nokia also warned second quarter sales would grow only between two and 61:operating efficiencies, strong brand and leading product portfolio," Nokia 62:said. Nokia said it expected pro forma earnings per share (EPS) of 0.18-0.20 67:protecting the margins -- but Nokia has to be a top-line growth story as well, 69:analyst Susan Anthony.But Nokia, known for its strength in forecasting the 79:Nokia's own forecast. Nokia's January-March net sales came in worse than the

Regular Expressions Suppress case distinctions

Nokia or nokia

Regular Expressions set operatoregrep -n ‘[Nn]okia’

nokia_corpus.txt

Regular Expressions Suppress other features, for

example singular share or plural shares

Regular Expressions optional operatoregrep -n ‘shares?’

nokia_corpus.txt

Regular Expressions

egrep -n ‘shares?’ nokia_corpus.txt

1:.Nokia shares slide after warning 6:weak demand, sending its shares 12 percent lower and European 62:said. Nokia said it expected pro forma earnings per share (EPS) of 0.18-0.20 85:lion share of the company's sales and earnings, saw sales fall seven percent

Regular Expressions Kleene operators:

/string*/ “zero or more occurrences of previous character”

/string+/ “1 or more occurrences of previous character”

Regular Expressions Wildcard operator:

/string./ “any character after the previous character”

Regular Expressions Wildcard operator:

/string./ “any character after the previous character”

Combine wildcard and kleene: /string.*/ “zero or more instances of any

character after the previous character” /string.+/ “one or more instances of any

character after the previous character”

Regular Expressions

egrep –n ‘profit.*’ nokia_corpus.txt

13:better than expected first-quarter profits from Nokia, 52:remains the only profitable handset maker among the "big three" suppliers 60:company's profitability outlook remains strong, driven by increasing 81:Pre-tax profit was 1.31 billion euros.The company's struggling networks unit

Regular Expressions Anchors

Beginning of line operator: ^egrep ‘^said’ nokia_corpus.txt End of line operator: $egrep ‘$said’ nokia_corpus.txt

Regular Expressions Disjunction:

set operator/[Ss]tring/ “a string which begins with either S

or s” Range/[A-Z]tring/ “a string beginning with a capital

letter” pipe |/string1|string2/ “either string 1 or string 2”

Regular Expressions Disjunction

egrep –n ‘weak|warning|drop’ nokia_corpus.txt

egrep –n ‘weak.*|warn.*|drop.*’ nokia_corpus.txt

Regular Expressions

Negation: /[^a-z]tring“ any strings that does not begin

with a small letter”

Regular Expressions Precedence

1. Parantheses2. Kleene and optional operators * . ?3. Anchors and sequences4. Disjunction operator |

(a) /supply | iers/ /supply/ /iers/(b) /suppl(y|iers)/ /supply/ suppliers/

Regular Expressions Substitution

sed ‘s/word1/word2/ corpus.txt

Me: I am feeling a bit depressed todaysed ‘s/I am/sorry to hear that you are/’

corpus.txt

Regular Expressions Substitution

sed ‘s/word1/word2/ corpus.txt

Me: I am feeling a bit depressed todaysed ‘s/I am/sorry to hear that you are/’

corpus.txt

Eliza: sorry to hear that you are feeling a bit depressed today

Regular Expressions Substitution

sed ‘s/word1/word2/ corpus.txt

Me: I wish I could shake this depressionsed

Eliza: I am sure you could shake this depression

Regular Expressions Substitution

sed ‘s/word1/word2/’ corpus.txt

Me: I wish I could shake this depressionsed ‘s/wish I/am sure you/’ corpus.txt

Eliza: I am sure you could shake this depression

Finite State Transition Networks

Finite State Automata (FSA) Just as a regular expression, used to

recognise a set of stringse.g. egrep –n ‘baa+!’ corpus.txt

Finite State Transition Networks

Finite State Automata (FSA) Just as a regular expression, used to

recognise a set of strings Represented as a directed graph

Finite State Transition Networks

Finite State Automata (FSA) Just as a regular expression, used to

recognise a set of strings Represented as a directed graph Set of nodes representing states

Finite State Transition Networks

Finite State Automata (FSA) Just as a regular expression, used to

recognise a set of strings Represented as a directed graph Set of nodes representing states Set of arcs, links between nodes,

representing transitions between states

Finite State Transition Networks

Finite State Automata (FSA) Just as a regular expression, used to

recognise a set of strings Represented as a directed graph Set of nodes representing states Set of arcs, links between nodes,

representing transitions between states Arcs are labelled

Finite State Automata How does it work?

used to recognise a set of strings

Finite State Automata How does it work?

used to recognise a set of strings Candidate input string represented as

a segmented tape with a symbol for each cell

Finite State Automata How does it work?

used to recognise a set of strings Candidate input string represented as

a segmented tape with a symbol for each cell

String slowly fed into machine

Finite State Automata How does it work?

used to recognise a set of strings Candidate input string represented as a

segmented tape with a symbol for each cell String slowly fed into machine If symbol on input matches symbol on arc,

then A) move to next state B) advance one symbol on input string C) keep going till final state or input ends

Finite State Automata How does it work?

used to recognise a set of strings Candidate input string represented as a

segmented tape with a symbol for each cell String slowly fed into machine If symbol on input matches symbol on arc,

then A) move to next state B) advance one symbol on input string C) keep going till final state or input ends

Otherwise: stop and reject string

Finite State Automata State Transition Table

State Input b a ! 0 1 Ø Ø 1 2 3 4:

Finite State Automata State Transition Table

State Input b a ! 0 1 Ø Ø 1 Ø 2 Ø 2 3 4:

Finite State Automata State Transition Table

State Input b a ! 0 1 Ø Ø 1 Ø 2 Ø 2 Ø 3 Ø 3 4:

Finite State Automata State Transition Table

State Input b a ! 0 1 Ø Ø 1 Ø 2 Ø 2 Ø 3 Ø 3 Ø 3 4 4:

Finite State Automata State Transition Table

State Input b a ! 0 1 Ø Ø 1 Ø 2 Ø 2 Ø 3 Ø 3 Ø 3 4 4: Ø Ø Ø

Finite State Automata Algorithm for FSA (Jurafsky and Martin, p. 37)

function D-RECOGNIZE(tape, machine) returns accept or reject index <- Beginning of tape current-state <- Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elseif transition-table [current-state, tape [index]] is empty then return reject else Current-state <- transition-table [current-state, tape [index]] Index <- index + 1 end

Finite State Automata FSAs and recognition

Finite State Automata FSAs and recognition FSAs and generation

At each transition print out label of arc At final state stop printing

Finite State Automata Deterministic FSAs

An FSA whose recognition behaviour is fully determined by the state it is in and the input symbol it is looking at

Finite State Automata Deterministic FSAs

An FSA whose recognition behaviour is fully determined by the state it is in and the input symbol it is looking at

Non-deterministic FSAs An FSA with decision points

Finite State Automata Deterministic FSAs Non-deterministic FSAs

An FSA with decision points Self-loop may be in a particular state Arcs may have ε transitions

Finite State Automata Deterministic FSAs Non-deterministic FSA

Backup: set a marker that can be returned to

Look-ahead: look ahead at input Parallelism: look at alternative paths in

parallel

Finite State Automata Non-deterministic FSA: state transition

table State Input b a ! ε 0 1 Ø Ø Ø 1 Ø 2 Ø Ø 2 Ø 2, 3 Ø Ø 3 Ø Ø 4 Ø 4: Ø Ø Ø Ø

Finite State Automata Formal language Set of strings Finite symbol set, alphabet

Finite State Automata Formal language Set of strings Finite symbol set, alphabet

Σ = {a, b, !}

Finite State Automata Formal language Set of strings Finite symbol set, alphabet L(m) = {baa!, ba!, baaa!,…}“formal language characterised by m”

m = model L = formal language

Finite State Automata Formal language Set of strings Finite symbol set, alphabet L(m) = {baa!, ba!, baaa!,…} A formal language models a

fragment of a natural language

Recommended