Upload
elia
View
48
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Transformation schemes for context-free grammars structural, algorithmic, linguistic applications. Eli Shamir Hebrew university of Jerusalem, Israel ISCOL Haifa university September 2014. O verview. CFG- Devices producing strings & their derivation trees (with weights) - PowerPoint PPT Presentation
Citation preview
Transformation schemes for context-free grammars
structural, algorithmic, linguistic applicationsEli Shamir
Hebrew university of Jerusalem, Israel
ISCOLHaifa universitySeptember 2014
Overview
• CFG- Devices producing strings & their derivation trees (with weights)
• Top down schemes transforming the grammars• Driven by rotations operations-tree (BOT)• Preserving derivation trees, semi-ring weightsEnhancing: property tests , parsing & optimal tree
algorithms: time down to O(n ), space to O(n).• Decomposition of bounded ambiguity grammars (Sam
Eilenberg’s question [SE])• Non-expansive [NE] (quasi-rational) grammars• Implications to NLP, sequence alignment, …
2
Schemes - simple to subtle
• Chomsky’s normal form (CNF)• Elimination of redundant symbols, ε rules• Greibach’s normal form (GNF) (subtle)
all rules are ATx. T terminal (lexicalization)GNF destroys derivation trees, however has
many applications (structural…)Schemes for sub-classes of CFG (in parsing
technology) deterministic, LR(k)…
Context Free Basics 1
Such a grammar G = (V,T,P,S = root) is a well known model to derive/generate a set of terminal strings in T G defines a derivation relation between strings overV UT: One step xy: y is obtained from x by rewriting a single occurrence of some A by B1..Bk when
A B1..Bk is production rule in P. Several steps x y if x x1 …yLA(G) = {wεT | Aw}, L(G)=LS(G), the language
generated by G.A derivation is best described by a labeled tree in which
the k sons of a node labeled A are labeled B1, .., Bk.
Ambiguity-deg (Aw) = {number of distinct trees for (Aw),
deg (GA)= max deg of (Aw).A - B - defines a partial order on V U T,
denoted A>B. it induces a complete order on any branch of a derivation tree.
B in G is pumping if B>B'>B. Then B' is also pumping; both belong to the pumping equivalence class [B].
Context Free Basics 2
Node Type and Spread Lemma
(i) B Pumping, (ii) C pre-terminal – if NOT {C > B, B pumping} (iii) D spread – D is not pumping but D>B, B pumping.
SPREAD LEMMA 1. Pre-terminal C derives a bounded number of bounded
terminal strings.2. In each derivation tree a spread node D derives a bounded
sub-tree the leaves of which are terminals or pump nodes.3. In G, each spread symbol D derives the bounded number
of sub-trees, as mentioned in 2.
Non Expansive Grammars
G is non-expansive (NE) if no production rule has the form B -B'-B''- where the B's are from the same pumping class Equivalently, no derivation B —B—B— is possible (sideway pumping is forbidden!).
NE is the quasi- rational class, the substitution closure of linear grammars[1]. Our BOT scheme simplifies proofs of its known properties and new ones (parsing speed).
Bounded Operation Tree (BOT)BOT Tree-nodes are labeled by: • Current grammar as a product Π=P1…Pk
• Current operation SPREAD / CYC / TTR (Depending on the type of the root of P1 or Pk)Determines the children nodes and their labelsRoot of BOT= #G, Leaves of BOT - linear G(i) Main Claim: each derivation tree for w w.r. to #G
is mapped onto derivation tree for ƍw w.r. to some G(i), (with the same weight) and V.V.
SPREAD / CYC / TTR Operations
Type=SPREAD: Pk is split to U Q(j), the current grammar at j’th child is P1…Pk-1 * Q(j)
Type=CYC: Pk is terminal, the (effective) current grammar at the single child is Pk P1…Pk-1
Type=TTR, if the root of Pk is pumping: let
M= P1…Pk-1 , N=Pk, the top trunk of N is rotated by 180° and mounted on M, so MN M*N^
Top Trunk Rotation of MN to (M*N^)
M
M
EXIT
N^
x1
x2y1
y2
x1
x2
y2
y1
N^
N
for strings: m x1x2 … n^ …y2y1 …y2y1 m x1x2 … n^
for trees: M*
180
Figure 1.1
N grammar (top trunk) M* grammar
BB’C B’CB BDB’ B’BD
BB^, B^α B root(M), root (M) αAll other productions carry over from N to M*; those of M unchanged.
The TTR rotation is invertible, one-one onto for the derivation trees, preserving ambiguity in ‘cyclic rotated’ sense.
Figure 1.2: TTR For grammars:
Termination and Correctness
TTR operations dominate the BOT scheme for NE grammars. The E-depth of N^ and of the two sides of the mounted trunk must decrease. The M* factors become taller and thinner until they become linear G(i). [without spread symbols]
Claim: each derivation tree for w w.r. to #G is mapped to derivation tree for ƍw w.r. to some G(i), (with the same weight) and V.V.
ƍw = CYCLIC rotation of w.Holds for each SPREAD/CYC/TTR step!
Tabular Dynamic Prog. For parsing G
(CYK/ Earley algorithm for terminal w of length n the table extends to items of rotated intervals
[i+1, i+k (mod n), A BC], at the same cost. For linear G(i) total time cost is only O(n ) Space cost is O(n): one or few diagonals of width
near k are kept in memory with pointers to few neighbors, enabling table reconstruction.
• Just membership, or total weight algorithm, is in the parallel class NC(1), as for finite-state transductions.
2
Example (from [4])
• (M)(N) = (u I u ) (v J v), u , vε {0.1}* = I = J u = reversal of u,• It has unbounded "direct (product) ambiguity"
which increases the time in Earley algorithm. But after one TTR step MN is rotated to
• (M*)(N^) = (v u I u v ) (J) , which has a linear grammar, (of unbounded ambiguity degree)
And all product ambiguity trees are rotated to union of trees for the linear M*N^.
R
R
R
R R
Decomposing Bounded Ambiguity
SE Claim: Ambiguity-deg(G)= l < ∞. Then L (G) is a bounded-size union of languages of deg 1-grammars. This provides a positive answer to a question Sam Eilenberg posed, c. 1970.
"Bounded size" means polynomial in |G|, the size of the grammar G, and l.
Expansive G and Ambiguity
G expansive each pump symbol has ambiguity - degree=1 or unbounded (exponential in length)
B==> --B—B—B--… B--… (k times)If degB ≥ 2 then degB ≥ 2This is a corner stone in the proof of SE • Extending ambiguity to cyclic-closed strings is
helpful (cf last slides)
k
Proof of SE
We briefly sketch the scheme for proving the claim. Starting with # G, and using the SPREAD LEMMA, the claim is reduces to:
LEMMA Let Π = MN(1)…N(k), deg M=1 deg Π=l < ∞, N(i) are terminals or with pump roots then
L(M) = U L(M(j)), jεJ and deg M(j)=1, J bounded. It suffices to prove it for a pair, starting with MN(1),
after which M(j)N(2) are decomposed, and so on.
Proof of SE (2)
For a pair MN the operation TTR is used transforming it to M*N^. Now deg M* < l and its ambiguity must be concentrated along the top trunk which it got from N. An easy direct argument shows it decomposes into a bounded union of M(j) of deg 1. As for N^ its E-depth is smaller than that of N. so for M(j)N^ we can use induction on the E-depth of the second factor or, more explicitly, continue the recursive descent on N^ until it is consumed.
Approximate G by NE G’
• Easy to achieve by duplicating symbols of the pumping classes.
• Makes linguistic sense• Advantages of NE G’ using the BOT scheme
view the linear G’i as finite-state transactions: powerful tool in several linguistic fields
• Applications to Bio-informatic (stringology)?• Extension of NE condition to mildly context-
sensitive models (LIG, TAG…)?
The Hardest Context Free Grammar
The concept is due to S. Greibach. The simplest reduction is based on Shamir's homomorphism theorem([1]), mapping each b in T into a finite set φ(b) of strings over the vocabulary of the Dyck language and claiming that w is in L(G) if and only if φ(w) contains a string in the Dyck language (see the description in [1]).
In fact, the categorical grammar model in the 1960 article ([2]) provides another homomorphism which makes it a hardest CFG.
However, those hardest CFG languages are inherently expansive. Indeed, an NE candidate grammar for Dyck will be negated by its BOT scheme, upon using local pump-shrinks, which for linear grammars can operate near any point of the (sufficiently long) main branch of non- terminals.
We conjecture that any hardest CFG must be expansive. Note that finding a non-expansive one would entail O(n ) complexity of membership test for any context free grammar.
2
Ambiguity and Cyclic RotationAmbiguity in natural languages can be resolved (or
created) by cyclic rotation. Consider the bible verse in book of Job chapter 6 verse 14 (six Hebrew words). Translated to English: "a friend should extend mercy to the sufferer , even if he abandons God's fear."
• The ambiguity here is anaphoric, does the pronoun "he" refer to the sufferer or to the friend? The poetic beautiful answer is: to both.
• The rotated sentences, starting at the symbols # and $, resolve the ambiguity one way or the other.
• Politically loaded example: the policeman shot the boy with the gun
# $
# $
References
1. J. Autebert, J. Berstel and L. Boasson, Context-free language and pushdown automata. Chap. 3 In: handbook of formal languages Vol 1. G. Rozenberg and A. Salomaa (eds.), Springer-Verlag 1997.
2. Y. Bar-Hillel, H. Gaifman and E. Shamir, On categorical and phrase structure grammars. Bulletin research council of Israel, vol. 9f (1960), 1-16.
3. S. Greibach. The hardest context-free language. SIAM J. on computing 3 (1973), 304-310.
4. E. Shamir. Some inherently ambiguous context-free languages. Inf. and Control 18 (1971), 355-363.