Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
10/14/13
1
INTRO TO LEXING & PARSING
Parsing
! Two pieces conceptually: o Recognizing syntac6cally valid phrases. o Extrac6ng seman6c content from the syntax.
E.g., What is the subject of the sentence? E.g., What is the verb phrase? E.g., Is the syntax ambiguous? If so, which meaning do we take? o “2 * 3 + 4”
! In prac6ce, solve both problems at the same 6me.
10/14/13
2
Specifying Syntax
We use grammars to specify the syntax of a language.
exp →int | var | exp ‘+’ exp | exp ‘*’ exp |
‘let’ var ‘=‘ exp ‘in’ exp ‘end’
int → ….
var →alpha(alpha|digit)*
digit → ‘0’ | ‘1’ | ‘2’ | ‘3’ | ‘4’ | … | ‘9’
alpha → [a-‐zA-‐Z]
Naïve Matching
To see if a sentence is legal, start with the first non-‐terminal), and keep expanding non-‐terminals un6l you can match against the sentence.
N → ‘a’ | ‘(’ N ‘)’ “((a))” N → ‘(’N ‘)’
→ ‘(’‘(’N ‘)’ ‘)’ → ‘(’‘(’‘a’‘)’‘)’ = “((a))”
10/14/13
3
Alterna6vely
Start with the sentence, and replace phrases with corresponding non-‐terminals, repea6ng un6l you derive the start non-‐terminal.
N → ‘a’ | ‘(‘ N ‘)’ “((a))”
‘(’‘(‘ ‘a’‘)’‘)’ → "‘(‘ ‘(‘ N ‘)’ ‘)’ → ‘(‘ N ‘)’ →N
Highly Non-‐Determinis6c
! For real grammars, automa6ng this non-‐determinis6c search is non-‐trivial. o As we’ll see, naïve implementa6ons must do a lot of back-‐
tracking in the search. ! Ideally, given a grammar, we would like an efficient,
determinis6c algorithm to see if a string matches it. o There is a very general cubic 6me algorithm.
! Certain classes of grammars have much more efficient implementa6ons. o Essen6ally linear 6me with constant state (DFAs). o Or linear 6me with stack-‐like state (Pushdown Automata).
10/14/13
4
Tools in your Toolbox ! Manual parsing (say, recursive descent).
o Tedious, error prone, hard to maintain. o But fast & good error messages.
! Parsing combinators o Encode grammars as (higher-‐order) func6ons.
Basically, func6ons that generate recursive-‐descent parsers. Makes it easy to write & maintain grammars.
o But can do a lot of back-‐tracking, and requires a limited form of grammar (e.g., no lei-‐recursion.)
! Lex and Yacc o Domain-‐Specific-‐Languages that generate very efficient, table-‐driven
parsers for general classes of grammars.
REGULAR EXPRESSIONS &FINITE-‐STATE AUTOMATA
10/14/13
5
Regular Expressions
! Non-‐recursive grammars o ∅ (matches no string) o ε (epsilon – matches empty string)
o Literals (‘a’, ‘b’, ‘2’, ‘+’, etc.) drawn from alphabet o Concatena6on (R1 R2)
o Alterna6on (R1 | R2) (oien wrinen (R1 + R2) o Kleene star (R*)
Formally
! A regular expression denotes a set of strings: o [[∅ ]] = { } o [[ε]] = { “” }
o [[‘a’]] = { “a” } o [[R1 R2]] = { s | s = s1 ^ s2 & s1 in R1 & s2 in R2 }
o [[R1 | R2]] = { s | s in R1 or s in R2 } o [[R*]] = [[ε + RR* ]] = { s | s = “” or s = s1 ^ s2 & s1 in R & s2 in R* }
10/14/13
6
Example
! Suppose the only characters are 0 and 1. Here is a regular expression for strings containing
00 as a substring:
(0 | 1)*00(0 | 1)*
11011100101 0000
11111011110011111
Example
! Suppose our alphabet is a, @, and ., where a represents “some lener.”
! A regular expression for email addresses is
aa* (.aa*)* @ aa*.aa* (.aa*)*
10/14/13
7
Graphical Representa6on
a b c
d
ε
ε
ε
ε
b
a b (c | d)* b
start accept
ε
ε d
Non-‐Determinis6c, Finite-‐State Automaton
a b c
ε
ε
b
Formally: • an alphabet Σ • a set V of states • a distinguished start state • one or more accepting states • transition relation: δ : V * (Σ + ε) * V → bool
start accept
10/14/13
8
Transla6ng RegExps
! Epsilon:
! Literal ‘a’:
! R1 R2
! R1 | R2
ε
a
R1 R2 ε
R1
R2
ε
ε
ε
ε
Transla6ng RegExps
! R*
R1 ε
ε
ε
10/14/13
9
Conver6ng to Determinis6c
! Naively: o Give each state a unique ID (1,2,3,…) o Create super states
One super-‐state for each subset of all possible states in the original NDFA.
(e.g., {1},{2},{3},{1,2},{1,3},{2,3},{123}) o For each super-‐state (say {23}):
For each original state s and character c: o Find the set of accessible states (say {1,2}) skipping over epsilons. o Add an edge labeled by c from the super state to the corresponding super-‐state.
! In prac6ce, super-‐states are created lazily.
a b c
d
ε
ε
ε
b
ε
1 2 3
4
5
6 7
a b1 2 3,6
4,6,3
d5,6,3
c7
b
10/14/13
10
a b c
d
ε
ε
ε
b
ε
1 2 3
4
5
6 7
a b1 2 3,6
4,6,3
d5,6,3
c7
bb
b
c
d
a b c
d
ε
ε
ε
b
ε
1 2 3
4
5
6 7
a b1 2 3,6
4,6,3
d5,6,3
c7
bb
b
c
d
d
c
10/14/13
11
a b1 2 3,6
4,6,3
d5,6,3
c7
bb
b
c
d
d
c
Once We have a DFA
! Determinis6c Finite State automata are easy to simulate: o For each state and character, there is at most one transi6on we can take.
! Usually record the transi6on func6on as an array, indexed by states and characters.
! Lexer starts off with a variable s ini6alized to the start state: o Reads a character, uses transi6on table to find next state.
10/14/13
12
10/14/13
13
10/14/13
14
Global View
10/14/13
15
Implementa6on
Example
10/14/13
16
SCANNING IN PRACTICE
10/14/13
17
LEXING in FORTRAN
! FORTRAN rule: ! Whitespace is insignificant
o VAR1 is the same as VA R1
o A terrible design!
LEXING in PL/I
! PL/I keywords are not reserved
IF ELSE THEN THEN = ELSE; ELSE ELSE = THEN
10/14/13
18
SCANNING PYTHON
10/14/13
19
An Algebraic Approach
! There’s a purely algebraic way to compute whether a string is in the set denoted by a regular expression.
! It’s based on compu6ng the symbolic deriva/ve of a regular expression with respect to the characters in a string.
Some Algebraic Structure
Think of ε as one, ∅ as zero, concatena6on as mul6plica6on, and alterna6on as addi6on: ∅ | R = R | ∅ = R ∅ R = R ∅ = ∅ !ε R = R ε = R R | R = R ε* = ε R** = R* R2 | R1 = R1 | R2
10/14/13
20
Deriva6ves
! The deriva/ve of a regular expression R with respect to a character ‘a’ is:
[[deriv R a]] = { s | “a” ^ s in R }
! So it’s the residual strings once we remove the ‘a’ from the front. o e.g., deriv (abab) ‘a’ = bab
o e.g., deriv (abab | acde) = (bab | cde)
Symbolic Differen6a6on
deriv ∅ c = ∅
deriv ε c = ∅
deriv c c = ε
deriv c c’ = ∅
deriv (R1 | R2) c = (deriv R1 c) | (deriv R2 c)
deriv (R1 R2) c = (deriv R1 c) R2 | (empty R1) (deriv R2 c)
deriv (R*) c = (deriv R c) R*
10/14/13
21
Symbolic Empty
Think of ε as “true” and ∅ as false.
empty ∅ = ∅
empty ε = ε
empty c = ∅
emtpy (R1 | R2) = (empty R1) | (empty R2)
empty (R1 R2) = (empty R1) (empty R2)
empty (R*) = ε
Algebraic Recogni6on
! Given a regular expression R and a string of characters c1, c2, …, cn.
! We can test whether the string is in [[R]] by calcula6ng: o deriv (… deriv (deriv R c1) c2 …) cn
! And then test whether the resul6ng regular expression accepts the empty string.
10/14/13
22
Example:
Take R = (ab | ac)* Take the string to be [‘a’ ; ‘b’; ‘a’; ‘c’]
deriv R ‘a’ = (deriv (ab | ac) ‘a’) R
(deriv (ab) ‘a’) | (deriv (ac) ‘a’) R = (b | c) R
So deriv R ‘a’ = (b | c) R
Con6nuing
Now we must compute:
deriv ((b | c) R) ‘b’ =
(deriv (b | c) ‘b’) R | (empty (b | c)) (deriv R ‘b’)
10/14/13
23
Con6nuing
Now we must compute:
deriv ((b | c) R) ‘b’ =
(deriv (b | c) ‘b’) R | (empty (b | c)) (deriv R ‘b’)
empty (b | c) = (empty b) | (empty c) = | = !
Con6nuing
So this simplifies to:
deriv ((b | c) R) ‘b’ =
(deriv (b | c) ‘b’) R | (empty (b | c)) (deriv R ‘b’) =
(deriv (b | c) ‘b’) R
10/14/13
24
Con6nuing
(deriv (b | c) ‘b’) R =
(deriv b ‘b’ | deriv c ‘b’) R = (ε | ) R = ε R = R So star6ng with R = (ab | ac)* and calcula6ng the deriva6ve w.r.t. ‘a’ and then ‘b’, we are back where we started with R.
Building a DFA with Deriva6ves
! Given your regular expression R o associate it with an ini6al state S0.
! Calculate the deriva6ve of R w.r.t. each character in the alphabet, and generate new states for each unique resul6ng regular expression.
10/14/13
25
Example:
(ab | ac)*
Start State
Example:
(ab | ac)*
(b | c)((ab |ac)*)
deriv w.r.t. ‘a’
Start State
10/14/13
26
Example:
(ab | ac)*
(b | c)((ab |ac)*)
deriv w.r.t. ‘a’
deriv w.r.t. ‘b’
Start State
Example:
(ab | ac)*
(b | c)((ab |ac)*)
deriv w.r.t. ‘a’
deriv w.r.t. ‘b’
deriv w.r.t. ‘c’
Start State
10/14/13
27
Simplified
(ab | ac)*
(b | c)((ab |ac)*)
‘a’
‘b’, ‘c’
Then…
! Then con6nue calcula6ng the deriva6ves for each new state.
! You stop when you don’t generate a new state.
! (Important: must do algebraic simplifica6ons or this process won’t stop.)
10/14/13
28
Only One Non-‐Zero State
(ab | ac)*
(b | c)((ab |ac)*)
‘a’
‘b’, ‘c’
Calculate its Deriva6ves
(ab | ac)*
(b | c)((ab |ac)*)
‘a’
‘b’, ‘c’
(ab | ac)*
deriv w.r.t. ‘b’
10/14/13
29
Calculate its Deriva6ves
(ab | ac)*
(b | c)((ab |ac)*)
‘a’
‘b’, ‘c’
(ab | ac)*
(ab | ac)*
deriv w.r.t. ‘b’ deriv w.r.t. ‘c’
Calculate its Deriva6ves
(ab | ac)*
(b | c)((ab |ac)*)
‘a’
‘b’, ‘c’
(ab | ac)*
(ab | ac)*
deriv w.r.t. ‘b’ deriv w.r.t. ‘c’
deriv w.r.t. ‘a’
10/14/13
30
And then simplify – no new states
(ab | ac)*
(b | c)((ab |ac)*)
‘a’ ‘b’, ‘c’
‘a’ ‘b’,’c’
Accep6ng States
! Then figure out the accep6ng states by seeing if they accept the empty string.
! i.e., run empty on the associated regular expression and see if its simplified form is ε or .
10/14/13
31
Which states accept empty?
(ab | ac)*
(b | c)((ab |ac)*)
‘a’ ‘b’, ‘c’
‘a’ ‘b’,’c’
Just this one (our start state):
(ab | ac)*
(b | c)((ab |ac)*)
‘a’ ‘b’, ‘c’
‘a’ ‘b’,’c’