Lecture Slides for MAT-73006 Theoretical computer science ...hansen/TCS-Slides-Ib.pdf · tain kind...

Preview:

Citation preview

Lecture Slides for MAT-73006

Theoretical computer science

PART Ib: Automata and Languages.

Context-Free languages

Henri Hansen

January 26, 2015

1

Context-free languages

• There are several very simple languages that are not regu-lar, such as {0n1n | n ≥ 0}

• They are ”simple” to describe mathematically, but computa-tionally the situation is different

• An important class of languages is context-free languages.

• We shall explore a way of describing these languages, calledcontext-free grammars.

2

• An important area of application for these grammars is foundin programming languages

Context-free grammar

• Let us start with an example of a grammar:

A→ 0A1

A→ B

B →#

• These three rules are substitution rules. The left hand sideof each rule contains a variable, and the right hand sidecontains a string consisting of variables and terminal sym-bols

3

• Terminal symbols are symbols of the language that is beingdefined, i.e., Σ is the set of terminal symbols

• A grammar describes a language by generating the stringsin the language. This happens by the following the proce-dure:

1. Write down the start variable. Unless otherwise stated,it is the left-hand side of the topmost rule

2. Find a variable that has been written down, and a rulethat has this variable as it left-hand side. Replace thewritten down variable with the right-hand side of the rule

3. Repeat step 2 until no variables remain.

• For example, the example grammar can generate the string000#111

• The sequence of substitutions that results in the string iscalled a derivation.

• A derivation can also have a graphic representation as aparse tree.

• The set of strings that can be generated by a given grammaris called the language of the grammar.

A more complicated example〈SENTENCE〉 → 〈NOUN-PHRASE〉 〈VERB-PHRASE〉

〈NOUN-PHRASE〉 → 〈CMPLX-NOUN〉 | 〈CMPLX-NOUN〉 〈PREP-PHRASE〉〈VERB-PHRASE〉 → 〈CMPLX-VERB〉 | 〈CMPLX-VERB〉 〈PREP-PHRASE〉〈PREP-PHRASE〉 → 〈PREP〉 〈CMPLX-NOUN〉〈CMPLX-NOUN〉 → 〈ARTICLE〉 〈NOUN〉〈CMPLX-VERB〉 → 〈VERB〉 | 〈VERB〉 〈NOUN-PHRASE〉〈ARTICLE〉 → a | the〈NOUN〉 → boy | girl | flower〈VERB〉 → likes | sees | touches〈PREP〉 → with

4

Formal definition of CFG

• A context-free grammar is a 4-tuple (V,Σ, R, S), where

1. V is a finite set called variables

2. Σ is a finite set, disjoint from V called terminals (AKAalphabet)

3. R is a finite set of rules, a rule being a pair (v, σ) wherev is a variable and σ is s string of variables and termi-nals; also written as v → σ

4. S ∈ V is the starting variable

5

• if u, v and w are strings of variables and terminals, andA→ w is a rule of the grammar, then uAv yields the stringuwv, written uAv ⇒ uwv.

• We say that u derives v, written u⇒∗ v if u = v or if thereis some sequence u⇒ u1 ⇒ u2 ⇒ · · · ⇒ uk ⇒ v

• The language of the grammar is the set {w ∈ Σ∗ | S ⇒∗

w}

Examples of CFGs.

• Often we write a CFG by simply giving the rules; the vari-ables are the symbols that appear at left-hand sides and theothers are terminals.

• S ⇒ aSb | SS | ε (think of a as "(" and b as ")")

E → E + T | T

T → T × F | F

F → (E) | n6

Where the alphabet is {n,+,×, (, )}

• A compiler of a programming language translates code intoanother form; CFG:s are used, for instance in describingprogramming language syntax

• the process by which the meaning of a string is found byrelating it to a grammar, is known as parsing.

Ambiguity

• Consider the grammar rule E → E + E | E × E | (E) | a.There are several derivations for strings such as a+ a× a

• Definition: A grammar is ambiguous if there are two or moreways of deriving a string of its language

• Ambiguity makes (unique) parsing impossible, so obviouslyone should strive to describe languages unambiguously when-ever possible,

• Some languages are inherently ambiguous, i.e., all gram-mars that generate them, are ambiguous

7

Pushdown automata

• Regular languages were defined as languages that are rec-ognized by some finite automaton

• Context-free languages can similarly be recognized by cer-tain kind of automata, due to the recursive nature of context-free languages, some form of memory is needed.

• Informally, pushdown automata are like nondeterministic fi-nite automata, but instead of simply moving from one stateto another, they use a stack to store information about whatthe automaton has done in the past, and this informationaffects what the automaton does next

8

• When a pushdown automaton is in a given state, it respondsto the alphabet that is read from the input, and to the vari-able that is on top of the stack.

• Let us mark Σε the set Σ ∪ {ε} (and similarly for Γε

• Formally: A pushdown automaton is a 6-tuple (Q,Σ,Γ, δ, q0, F ),where

1. Q is the (finite) set of states

2. Σ is the input alphabet

3. Γ is the stack alphabet

4. δ : Q×Σε × Γε 7→ 2Q×Γε is the nondeterministic tran-sition function

5. q0 ∈ Q is the start state

6. F ⊆ Q is the set of accept states

• A pushdown automaton (PDA) M = (Q,Σ,Γ, δ, q0, F ) ac-cepts an input a1 · · · an (where ai ∈ Σε) if and only if thereis some sequence of states q0q1 · · · qn and a set of stringsg0, g1, · · · , gn of Γ∗ε such that the following conditions aremet:

1. g0 = ε, i.e., the automaton starts with an empty stack

2. for 0 ≤ i ≤ n − 1 we have (qi+1, x) ∈ δ(qi, ai+1, y)

and gi = yt and gi+1 = xt; i.e., the content of the stackis the same after the move, except possibly the topmostelement

3. qn ∈ F

• To understand the transition function, if (qi+1, x) ∈ δ(qi, ai+1, y),then this transition can executed if y is on top of the stack,the automaton is in state qi and the next read input symbolis ai+1. After it is executed, y is removed from the stackand x is put on top, and the automaton has moved to stateqi+1

Example

• Consider the language {aibjck | i = j or i = k} i.e., eitherthe number of bs or the number of cs is the same as thenumber of as.

• Informally, it is relatively easy to consider a PDA that ac-cepts the language: First read all as, pushing a counter intothe stack. Then, nondeterministically choose to count eitherthe bs or the cs and match their number with as.

9

q0 q1 q4

q2

q5 q6

q3

ε, ε→ $

ε, ε→ ε

ε, ε→ ε

ε,$→ ε

ε, ε→ ε ε,$→ ε

a, ε→ a b, ε→ ε c, a→ ε

b, a→ ε c, ε→ ε

Equivalence

• Pushdown automata and context-free grammars are equiv-alent in the same way as regular expressions and finite au-tomata are:

• Theorem: A language is context-free if and only if there is apushdown automaton that recognizes it

• First we explain how to prove this in the other direction. LetA be a context free language. By definition then, it has aCFG, say G that generates it

10

• The idea of the proof is as follows: We generate a nonde-terministic PDA that, when reaging an input "guesses" whatsubstitutions are needed for a given string.

1. Initially, the PDA puts the start variable on the stack

2. After this, the automaton always looks at the top symbolof the stack. If it is a variable, then it nondeterministi-cally chooses a rule to apply, removes the variable andreplaces the variable with the right-hand side of the rule(in reverse order)

3. If the top symbol is a terminal, then it compares it tothe next input. If the symbols differ, this branch rejects;otherwise the top symbol is simply removed.

4. If the stack is empty when the input ends, the automatonaccepts.

• Please verify that the automaton accepts exactly the stringsthat are generated by the grammar!

• The other direction is proven so that we generate a contextfree grammar from the transition relation of a PDA

• Given a PDA P three modifications are made:

1. It will contain only one accepting state, qa. This is not aproblem, because nondeterminism is allowed

2. The automaton only accepts after it has emptied thestack. This is not a restriction either

3. Every transition either pushes a symbol (but does notremove) or removes a symbol (but does not add) to thestack. Again, this is not a restriction, because transitionscan be "split" into two.

• The PDA is then used as a recipe for creating a grammarthat generates exactly the language that is accepted by thePDA; let p be the first state and q be the last state (theunique accept state).

• When P is computing on a string, say x, conditions 2 and3 require that the first operation adds and the last operation

removes a symbol of the stack. If the symbols are different,then the stack must have been empty at some point (why??)

• If the symbols are the same, we create the rule Apq →aArsb, where a is the input read at the first move and b atthe last move.

• If the symbols are not the same, then the there is somestate r in which the stack is empty. we create a rule Apq →AprArq, and so on.

• To formalize the proof, let (Q,Σ,Γ, δ, q0, {qa}) be a PDA(after the modification)

1. For each p, q, r, s ∈ Q, u ∈ Γ and a, b ∈ Σε, if δ(p, a, ε)contains (r, u) and δ(s, b, u) contains (q, ε), generatethe rule Apq → aArsb in G

2. For each p, q, r ∈ Q put the rule Apq → AprArq in G

3. Finally, for each p ∈ Q put the rule App → ε in G

• Lemma: If Apq generates x then P has an execution from p

(with empty stack) to q (with empty stack) reading x.

• This can proven by induction

1. If the derivation of x happens in one step, then the right-hand side contains a result with no variables, only termi-

nals. The only such rule that is generated by this con-struction is App → ε, hence, x must be the empty string

2. Assume it holds for all derivations with at most k steps.If Apq ⇒∗ x in k + 1 steps, the first step is eitherApq → aArsb or Apq → AprArq. Both cases resultin derivations of length less than k.

• Lemma: If P has an execution reading x from p to q (withempty stack in both ends), then Apq generates x.

• This again is done by induction:

1. If the computation contains 0 steps, the automaton can-not read any symbols and x is the empty string, and theautomaton stays in state p. App → ε generates x.

2. The inductive step is as before.

Non- context-free languages

• There are languages that are not regular nor context-free.

• There is a lemma, similar to pumping lemma, for contextfree grammars:

• If A is a context free language, then there is a number psuch that, if s ∈ A with |s| ≥ p, then s can be divided into 5parts s = wvxyz such that

1. wvixyiz ∈ A for every i ≥ 0

2. |vy| > 0 and11

3. |vxy| ≤ p

• Proof: Let A be a CFL. Then it has a grammar G that gen-erates it. Let s be a "very long" string of the language.

• Because s is "very long" (longer than p), it’s derivation willuse (at least) one of the variable symbols more than onceon (at least) one branch of the derivation tree. (please com-pare to the pumping lemma!). Let this variable be calledR.

• Let x be the string that is derived from the last occurrenceof R, and the occurrence before the last derive wxy.

• Then, we can replace the last occurrence of R with exactlythe same subtree as the one in the second to last

• Therefore, instead of wxy, we derive wwxyy.

• This can be done arbitrarily many times over.

Examples of non CF-languages.

• The language {anbncn | n ≥ 0} is not context-free.

• The language {aibjck | 0 ≤ i ≤ j ≤ k} is not context-free

• The langauge {ww | w ∈ {0,1}∗} is not context-free

12

Deterministic CFLs.

• Deterministic and nondeterministic finite automata are equiv-alent, but the same does not hold for pushdown automata

• To formalize the theory, let us begin with a definition of adeterministic PDA, or DPDA.

• A deterministic pushdown automaton is a 6-tuple (Q,Σ,Γ, δ, q0, F )

such that

1. Q is a finite set of states

2. Σ is the (input) alphabet13

3. Γ is the stack alphabet

4. δ : Q × Σε × Γε 7→ (Q × Γε) ∪ {∅} is the transitionfunction

5. q0 ∈ Q is the start state

6. F ⊆ Q is the set of accept states

• The transition function is furthermore required to be nonemptyfor exactly one of the values

δ(q, a, x), δ(q, a, ε), δ(q, ε, x), δ(q, ε, ε)

for every q ∈ Q, a ∈ Σ, and x ∈ Γ.

• In other words, the automaton either accepts any input andmoves (the fist two) or it just moves, and when moving, itbehaves in a unique manner.

• A language accepted by a DPDA is called a deterministiccontext-free language.

Examples

• The language {0n1n | n ≥ 0} is deterministic: It readsinput 0 and pushes a counter token every time until the first1, after which it removes counters every time it reads a 1.

• The language {aibjck | i = j ∨ i = k} is not deterministic.

• The language of palindromes is not deterministic

• Proving determinism is relatively easy: Simply give the de-terministic PDA

14

• Proving nondeterminism is much harder, and for that weneed some more theory

Properties of deterministic CFLs

• Lemma: Every deterministic PDA has an equivalent au-tomaton that always reads the entire input string

– There are two ways in which a DPDA might fail to readthe whole input: hanging, where the automaton is forcedto pop an empty stack, and looping, where te automatonmakes an endless loop of ε-reads.

– Hanging is prevented by putting a special symbol intothe stack before the automaton starts; popping this fromthe stack before the input ends, results in reading therest of the input and rejecting.

15

– Looping is solved by identifying loops structurally: a ε-loop is then replaced by reading the entire input and re-jecting

– The exception being situations where the whole inputhas been read: if accepts states are visited in such situ-ations, the automaton should accept.

• Theorem: The class of Deterministic CFLs is closed undercomplementation

– Swapping accept and non-accept states works for DFAs.

– DPDAs need to solve an additional problem: if the au-tomaton enters both accepting and non-accepting states

at the end of an input, it accepts even after complemen-tation. This is solved by requiring that only states whichread input, are allowed to accept.

– Swapping accept/non-accept states in such a DPDA com-plements the language accepted.

• This yields at least one test for non-determinisim: If thecomplement of a given CFL is not context-free, then thelanguage is not deterministic.

• Sometimes it is easier to look at a modified language. LetA be a language, and let ⊥ be a symbol not in the alpha-bet. We denote A⊥ = {w⊥ | w ∈ A} as the end-markedlanguage.

• Theorem: A is a deterministic CFL if and only if A⊥ is adeterministic CFL.

– proof of "only if": Accept states of a PDPA are replacedby a transition reading ⊥ and accepting.

– proof of "if": Let P⊥ accept A⊥. Construct P as follows:If P⊥ would accept after reading⊥ without looking at thestack, simply accept immediately. For other situations,the stack contains "two stacks" as a memory. When ⊥would be read (and possibly accept, depending on thestack) the behaviour of P⊥ is simulated and acceptedaccordingly, but if P⊥ would reject, then the stack is "re-verted".

Deterministic CFGs

• Deterministic PDAs have counterpart in grammars, calleddeterministic context-free grammars.

• Deterministic CFGs and deterministic languages have someattractive properties and restrictions on how strings can bederived.

• A reduce step is a substitution in reverse, for example, ifR→ xyz, then xyz is reduced into R, which is the reducingstring. The reverse derivation of a string is called reduction

16

• When a rule T → h is used backwards on a string xhy toproduce xTy, we write xhy ↪→ xTy

• A reduction from u is a sequence u = u1 ↪→ u2 ↪→ · · · ↪→uk = S, with S as the start symbol.

• The reduction is a leftmost reduction if each reducing stringis reduced only after all other reducing strings that appearto its left.

• if the rule T → h is used in a leftmost reduction to produceui ↪→ ui+1, then h (with this rule) is called the handle of ui.

• A string that appears in a leftmost reduction (for instance,ui) is called a valid string.

• If v = xhy is a valid string and h is its handle, we say thath is a forced handle if h is a unique handle for every validstring of the form xhz, where z ∈ Σ∗.

• A context-free grammar is deterministic iff every valid stringhas a forced handle

• In other words, in deterministic grammars, reduction de-pends only on the leftmost part of the string.

• This does not immediately give us a way of detecting deter-minism, but there is one test that we can derive from it.

The DK-test

• For any CFG G we can construct a deterministic finite au-tomaton DK that identifies handles. Specifically, DK ac-cepts z if

1. z is the prefix of some valid string v = zy and

2. z ends with a handle of v

• We first define a nondeterministic automaton, K

1. Let J be an NFA that accepts any string that ends withthe right-hand side of some grammar rule

17

2. In any accepting run of J , it "follows" the right-hand sideof a rule. Let us denote this so-called "rule-state" byB → u′v, when the automaton has read u and v hasnot yet been read. Then the rule-state B → uv′ is ac-cepting.

3. K works like J but with slight modifications.

4. For every rule-state B → u′Cv there is a ε-transitionto a rule-state with C as the left-hand-side, that has notread anything yet.

• Lemma: K may enter state T → u′v on reading z if andonly if z = xu and xuvy is a valid string with handle uv andreducing rule T → uv, for some y ∈ Σ∗.

• Proof should be obvious from construction

• Corollary: K may enter accept state T → h′ on input z ifand only if z = xh and h is a hanlde of some valid stringxhy with reducing rule T → h.

• This gives us the DK-test: Make K deterministic and checkif every accept state contains

1. Exactly one completed rule-state, and

2. no rule-state in which a terminal symbol immediatiatelyfollows, i.e., no rule of the form B → u′av, for somea ∈ Σ

• Theorem: G passes the DK-test iff G is deterministic

• If G is nondeterministic, there is some string with a handlethat is not forced. If DK is run on a string that is a handlebut not a forced handle, then DK must enter an acceptstate at the end of the handle. Because the handle is nota forced handle, it is not unique, so that the accept statecontains another accepting rule-state, or some continuationof the current string leads to an accepting state, and the testfails.

• If the DK-test fails, then there is a valid string with two han-dles: either the handle is complete or there is a continuationof the valid string with a different handle.

Practical applications of the theory

• Deterministic CFL are very important in practice, becauseparsing of deterministic CFGs is efficient That is why thesyntax of most programming languages is given as deter-ministic CFGs.

• The requirement of forced handles is, however, sometimestoo restrictive, because it restricts the use of intuition in de-signing grammars: it is not always easy to make sure allhandles are forced.

• There is a slightly broader class of grammars, however, thatis both practical and intuitive.

18

• The so-called LR(k)- grammars use lookahead. The idea isthat you are allowed to have non-determinism, as long asyou can resolve it by looking ahead no more then k steps ofthe input before choosing the handle.

• Formally: if h is the handle of v = xhy then we say thath is forced by a lookahead of of k, if his the unique han-dle of every string xhz, where y and z agree on the first ksymbols.

• LR(0) languages are deterministic

• LR(k) are grammars for which the handle of every validstring is forced by a lookahead of k.

Recommended