1
Learning and Inference for Hierarchically Split Probabilistic Context-Free Grammars Slav Petrov and Dan Klein University of California, Berkeley Evolution of the DT tag during hierarchical splitting and merging. Shown are the top three words for each subcategory and their respective probability. DT the (0.50) a (0.24) The (0.08) that (0.15) this (0.14) some (0.11) this (0.39) that (0.28) That (0.11) this (0.52) that (0.36) another (0.04) That (0.38) This (0.34) each (0.07) some (0.20) all (0.19) those (0.12) some (0.37) all (0.29) those (0.14) these (0.27) both (0.21) Some (0.15) the (0.54) a (0.25) The (0.09) the (0.80) The (0.15) a (0.01) the (0.96) a (0.01) The (0.01) The (0.93) A(0.02) No(0.01) a (0.61) the (0.19) an (0.10) a (0.75) an (0.12) the (0.03) Learning Inference Results Observed categories are too coarse: Adaptive Grammar Refinement: Split each category in k subcategories and fit grammar with the EM algorithm. Reference: Slav Petrov, Adam Pauls and Dan Klein, "Learning Structured Models for Phone Recognition", in EMNLP-CoNLL '07 Smoothing: Reduce overfitting by shrinking the productions of each subcategory towards their common base category. Reference: Slav Petrov, Leon Barrett, Romain Thibaux and Dan Klein, "Learning accurate, compact and interpretable tree annotation", in ACL-COLING '06 Reference: Slav Petrov and Dan Klein, "Improved Inference for Unlexicalized Parsing", in NAACL-HLT '07 The most frequent three words in the subcategories of several part-of-speech tags. VBZ VBZ-0 gives sells takes VBZ-1 comes goes works VBZ-2 includes owns is VBZ-3 puts provides takes VBZ-4 says adds Says VBZ-5 believes means thinks VBZ-6 expects makes calls VBZ-7 plans expects wants VBZ-8 is ’s gets VBZ-9 ’s is remains VBZ-10 has ’s is VBZ-11 does Is Does NNP NNP-0 Jr. Goldman INC. NNP-1 Bush Noriega Peters NNP-2 J. E. L. NNP-3 York Francisco Street NNP-4 Inc Exchange Co NNP-5 Inc. Corp. Co. NNP-6 Stock Exchange York NNP-7 Corp. Inc. Group NNP-8 Congress Japan IBM NNP-9 Friday September August NNP-10 Shearson D. Ford NNP-11 U.S. Treasury Senate NNP-12 John Robert James NNP-13 Mr. Ms. President NNP-14 Oct. Nov. Sept. NNP-15 New San Wall JJS JJS-0 largest latest biggest JJS-1 least best worst JJS-2 most Most least DT DT-0 the The a DT-1 A An Another DT-2 The No This DT-3 The Some These DT-4 all those some DT-5 some these both DT-6 That This each DT-7 this that each DT-8 the The a DT-9 no any some DT-10 an a the DT-11 a this the CD CD-0 1 50 100 CD-1 8.50 15 1.2 CD-2 8 10 20 CD-3 1 30 31 CD-4 1989 1990 1988 CD-5 1988 1987 1990 CD-6 two three five CD-7 one One Three CD-8 12 34 14 CD-9 78 58 34 CD-10 one two three CD-11 million billion trillion PRP PRP-0 It He I PRP-1 it he they PRP-2 it them him RBR RBR-0 further lower higher RBR-1 more less More RBR-2 earlier Earlier later IN IN-0 In With After IN-1 In For At IN-2 in for on IN-3 of for on IN-4 from on with IN-5 at for by IN-6 by in with IN-7 for with on IN-8 If While As IN-9 because if while IN-10 whether if That IN-11 that like whether IN-12 about over between IN-13 as de Up IN-14 than ago until IN-15 out up down RB RB-0 recently previously still RB-1 here back now RB-2 very highly relatively RB-3 so too as RB-4 also now still RB-5 however Now However RB-6 much far enough RB-7 even well then RB-8 as about nearly RB-9 only just almost RB-10 ago earlier later RB-11 rather instead because RB-12 back close ahead RB-13 up down off RB-14 not Not maybe RB-15 n’t not also General technique for learning refined, structured models when only the trace of a complex underlying process is observed. Learns compact and accurate grammars from a treebank without additional human input. Gives best known parsing accuracy on a variety of languages, while being extremely efficient. Interactive demo and download at http://nlp.cs.berkeley.edu Extensions Reference: Percy Liang, Slav Petrov, Michael Jordan and Dan Klein, "The Infinite PCFG using Hierarchical Dirichlet Processes", in EMNLP-CoNLL '07 Hierarchical Dirichlet Processes as a nonparametric Bayesian alternative to split and merge: Merging: Roll back the least useful splits in order to allocate complexity only where needed. Parsing: Grammar Extraction: S NP PRP She VP VBD heard NP DT the NN noise . . Grammar S NP VP . 1.0 NP PRP 0.5 NP DT NN 0.5 ... Lexicon PRP She 1.0 DT the 1.0 ... 11% 9% 6% NP PP DT NN PRP All NPs 9% 9% 21% NP PP DT NN PRP NPs under S 7% 4% 23% NP PP DT NN PRP NPs under VP Hierarchical Splitting: Repeatedly split each annotation symbol in two and retrain the grammar, initializing with the previous grammar. S-x NP-x PRP-x She VP-x VBD-x heard NP-x DT-x the NN-x noise .-x . NP NP 1 NP 2 NP 3 NP 4 NP 5 NP 6 NP 7 NP 8 L 1 2 1 2 1 2 1 2 VP NP DT NN 1 2 1 1 2 1 2 VP NP DT NN L With split at NP node With split at NP node reversed Parsing accuracy for different models 74 76 78 80 82 84 86 88 90 100 300 500 700 900 1100 Total number of grammar symbols Parsing accuracy (F1) 50% Merging and Smoothing 50% Merging Hierarchical Training Flat Training Parse Efficiency: Rapidly pre-parse the sentence in a hierarchical coarse-to-fine fashion pruning away unlikely chart items. Parse Selection: Use a variational approximation to select the tree with the maximum number of expected correct rules (since computing the best parse tree is intractable and selecting the best derivation is a poor approximation). β φ B z φ T z φ E z z z 1 z 2 x 2 z 3 x 3 T Parameters Trees Start a d End a d a d d ae d b c b c b c Automatic refinement of acoustic models learns phone-internal structure as well as phone-external context: Learned grammars are compact and interpretable: S NP DT The NNP Golden NNP Gate NN bridge VP VBZ is ADJP JJ red . . Compute grammars of intermediate complexity by projecting the most refined grammar. The Golden Gate bridge is red. G 3 G 2 G 1 G 0 G 1 G 2 G 3 G 4 G 5 G 6 Learning X-Bar=G 0 G= Projection π i π 0 (G) π 1 (G) π 2 (G) π 3 (G) π 4 (G) π 5 (G) G ... ... ... ... VP QP ... NP1 VP1 ... NP ... VP4 NP4 ... VP2 NP2 ... VP6 VP7 NP4 NP3 VP3 NP3 G0 G1 G2 G3 S NP PRP He VP VBD was ADJP right . . Parse Tree Parse Derivations S-2 NP-17 PRP-2 He VP-23 VBD-12 was ADJP-11 right . . S-1 NP-10 PRP-2 He VP-11 VBD-16 was ADJP-14 right . . 40 words all Parser F 1 F 1 ENGLISH Charniak & Johnson ’05 90.1 89.6 This Work 90.6 90.1 GERMAN Dubey ’05 76.3 - This Work 80.8 80.1 CHINESE Chiang & Bikel ’02 80.0 76.6 This Work 86.3 83.4 Bracket posterior probabilities during coarse-to-fine decoding Influential members of the House Ways and Means Committee introduced legislation that would restrict how the new s&l bailout agency can raise capital ; creating another potential obstacle to the government ‘s sale of sick thrifts .

Learning and Inference for Hierarchically Split Probabilistic ...nlp.cs.berkeley.edu/pubs/Petrov-Klein_2007_Learning...Learning and Inference for Hierarchically Split Probabilistic

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Learning and Inference for Hierarchically Split Probabilistic ...nlp.cs.berkeley.edu/pubs/Petrov-Klein_2007_Learning...Learning and Inference for Hierarchically Split Probabilistic

Learning and Inference for Hierarchically Split Probabilistic Context-Free GrammarsSlav Petrov and Dan Klein

University of California, Berkeley

Evolution of the DT tag during hierarchical splitting and merging. Shown are the top three words for each subcategory and their respective probability.

DTthe (0.50) a (0.24) The (0.08)

that (0.15) this (0.14) some (0.11)

this (0.39)that (0.28)That (0.11)

this (0.52)that (0.36)

another (0.04)

That (0.38)This (0.34)each (0.07)

some (0.20)all (0.19)

those (0.12)

some (0.37)all (0.29)

those (0.14)

these (0.27)both (0.21)Some (0.15)

the (0.54) a (0.25) The (0.09)

the (0.80)The (0.15)

a (0.01)

the (0.96)a (0.01)

The (0.01)

The (0.93)A(0.02)

No(0.01)

a (0.61)the (0.19)an (0.10)

a (0.75)an (0.12)the (0.03)

Learning Inference Results

Observed categories are too coarse:

Adaptive Grammar Refinement: Split each category in k subcategories and fit grammar with the EM algorithm.

Reference:Slav Petrov, Adam Pauls and Dan Klein,"Learning Structured Models for Phone Recognition", in EMNLP-CoNLL '07

Smoothing: Reduce overfitting by shrinking the productions of each subcategory towards their common base category.

Reference: Slav Petrov, Leon Barrett, Romain Thibaux and Dan Klein, "Learning accurate, compact and interpretable tree annotation", in ACL-COLING '06

Reference:Slav Petrov and Dan Klein,"Improved Inference for Unlexicalized Parsing", in NAACL-HLT '07

The most frequent three words in the subcategories of several part-of-speech tags.

VBZVBZ-0 gives sells takesVBZ-1 comes goes worksVBZ-2 includes owns isVBZ-3 puts provides takesVBZ-4 says adds SaysVBZ-5 believes means thinksVBZ-6 expects makes callsVBZ-7 plans expects wantsVBZ-8 is ’s getsVBZ-9 ’s is remainsVBZ-10 has ’s isVBZ-11 does Is Does

NNPNNP-0 Jr. Goldman INC.NNP-1 Bush Noriega PetersNNP-2 J. E. L.NNP-3 York Francisco StreetNNP-4 Inc Exchange CoNNP-5 Inc. Corp. Co.NNP-6 Stock Exchange YorkNNP-7 Corp. Inc. GroupNNP-8 Congress Japan IBMNNP-9 Friday September AugustNNP-10 Shearson D. FordNNP-11 U.S. Treasury SenateNNP-12 John Robert JamesNNP-13 Mr. Ms. PresidentNNP-14 Oct. Nov. Sept.NNP-15 New San Wall

JJSJJS-0 largest latest biggestJJS-1 least best worstJJS-2 most Most least

DTDT-0 the The aDT-1 A An AnotherDT-2 The No ThisDT-3 The Some TheseDT-4 all those someDT-5 some these bothDT-6 That This eachDT-7 this that eachDT-8 the The aDT-9 no any someDT-10 an a theDT-11 a this the

CDCD-0 1 50 100CD-1 8.50 15 1.2CD-2 8 10 20CD-3 1 30 31CD-4 1989 1990 1988CD-5 1988 1987 1990CD-6 two three fiveCD-7 one One ThreeCD-8 12 34 14CD-9 78 58 34CD-10 one two threeCD-11 million billion trillion

PRPPRP-0 It He IPRP-1 it he theyPRP-2 it them him

RBRRBR-0 further lower higherRBR-1 more less MoreRBR-2 earlier Earlier later

ININ-0 In With AfterIN-1 In For AtIN-2 in for onIN-3 of for onIN-4 from on withIN-5 at for byIN-6 by in withIN-7 for with onIN-8 If While AsIN-9 because if whileIN-10 whether if ThatIN-11 that like whetherIN-12 about over betweenIN-13 as de UpIN-14 than ago untilIN-15 out up down

RBRB-0 recently previously stillRB-1 here back nowRB-2 very highly relativelyRB-3 so too asRB-4 also now stillRB-5 however Now HoweverRB-6 much far enoughRB-7 even well thenRB-8 as about nearlyRB-9 only just almostRB-10 ago earlier laterRB-11 rather instead becauseRB-12 back close aheadRB-13 up down offRB-14 not Not maybeRB-15 n’t not also

General technique for learning refined, structured models when only the trace of a complex underlying process is observed.Learns compact and accurate grammars from a treebank without additional human input.Gives best known parsing accuracy on a variety of languages, while being extremely efficient.Interactive demo and download at http://nlp.cs.berkeley.edu

Extensions

Reference:Percy Liang, Slav Petrov, Michael Jordan and Dan Klein,"The Infinite PCFG using Hierarchical Dirichlet Processes", in EMNLP-CoNLL '07

Hierarchical Dirichlet Processes as a nonparametric Bayesian alternative to split and merge:

Merging: Roll back the least useful splits in order to allocate complexity only where needed.

Parsing:Grammar Extraction:S

NP

PRP

She

VP

VBD

heard

NP

DT

the

NN

noise

.

.

Grammar

S→ NP VP . 1.0

NP→ PRP 0.5

NP→ DT NN 0.5

. . .

Lexicon

PRP→ She 1.0

DT→ the 1.0

. . .

11%9%

6%

NP PP DT NN PRP

All NPs

9% 9%

21%

NP PP DT NN PRP

NPs under S

7%4%

23%

NP PP DT NN PRP

NPs under VP

Hierarchical Splitting:

Repeatedly split each annotation symbol in two and retrain the grammar, initializing with the previous grammar.

S-x

NP-x

PRP-x

She

VP-x

VBD-x

heard

NP-x

DT-x

the

NN-x

noise

.-x

.

NP

NP1 NP2 NP3 NP4 NP5 NP6 NP7 NP8

L(

1 2

1 2

1 2 1 2

VP

NP

DT NN

)1 2

1

1 2 1 2

VP

NP

DT NN

L( )

With split at NP node With split at NP node reversed

Parsing accuracy for different models

74

76

78

80

82

84

86

88

90

100 300 500 700 900 1100Total number of grammar symbols

Par

sing

acc

urac

y (F

1)

50% Merging and Smoothing 50% MergingHierarchical TrainingFlat Training

Parse Efficiency: Rapidly pre-parse the sentence in a hierarchical coarse-to-fine fashion pruning away unlikely chart items.

Parse Selection: Use a variational approximation to select the tree with the maximum number of expected correct rules (since computing the best parse tree is intractable and selecting the best derivation is a poor approximation).

β

φBz

φTz

φEz

z ∞

z1

z2

x2

z3

x3T

Parameters Trees

Starta d

Enda d a d

d ae d

b c b c b c

Automatic refinement of acoustic models learns phone-internal structure as well as phone-external context:

Learned grammars are compact and interpretable:

S

NP

DT

The

NNP

Golden

NNP

Gate

NN

bridge

VP

VBZ

is

ADJP

JJ

red

.

.

Compute grammars of intermediatecomplexity by projecting the mostrefined grammar.

The Golden Gatebridge is red.

G3

G2

G1

G0

G1

G2

G3

G4

G5

G6

Learning

X-Bar=G0

G=

Projection π

i

π0(G)

π1(G)

π2(G)

π3(G)

π4(G)

π5(G)

G

......

...... VP QP

... NP1VP1 ...

NP

... VP4 NP4...

VP2 NP2

... VP6 VP7 NP4NP3

VP3 NP3

G0

G1

G2

G3

S

NP

PRP

He

VP

VBD

was

ADJP

right

.

.

Parse Tree

Parse Derivations

S-2

NP-17

PRP-2

He

VP-23

VBD-12

was

ADJP-11

right

.

.

S-1

NP-10

PRP-2

He

VP-11

VBD-16

was

ADJP-14

right

.

.

≤ 40 words allParser F1 F1

ENGLISHCharniak & Johnson ’05 90.1 89.6This Work 90.6 90.1

GERMANDubey ’05 76.3 -This Work 80.8 80.1

CHINESEChiang & Bikel ’02 80.0 76.6This Work 86.3 83.4

Bracket posterior probabilities during

coarse-to-fine decoding

Influential

members of

the

House

Ways

and

Means

Committee

introduced

legislation

that

would

restrict

howthe

news&l

bailout

agency

can

raise

capital ;

creating

another

potential

obstacle tothe

government ‘s

sale of

sick

thrifts .