Shuffling Non-Constituents

1

Shuffling Non-Constituents

Jason Eisner

ACL SSST Workshop (invited talk), June 2008

with David A. Smith and Roy Tromblesyntactically-flavored

reordering search methodssyntactically-

flavoredreordering model

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 2

Starting point: Synchronous alignment Synchronous grammars are very pretty.

But does parallel text actually have parallel structure? Depends on what kind of parallel text

Free translations? Noisy translations? Were the parsers trained on parallel annotation schemes?

Depends on what kind of parallel structure What kinds of divergences can your synchronous grammar

formalism capture? E.g., wh-movement versus wh in situ

Two training trees, showing a free translation from French to English.

Synchronous Tree Substitution Grammar

enfants(“kids”)

d’(“of”)

beaucoup(“lots”)

Sam

donnent (“give”)

baiser(“kiss”)

un(“a”)

à (“to”)

kids

Sam

kiss

quite

often

“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”

enfants(“kids”)

kids

NPd’

(“of”)


NP

NP

SamSam NP


kissdonnent (“give”)

baiser(“kiss”)

un(“a”)

à (“to”)Start

NP

NP

nullAdv

quitenull Adv

oftennullAdv

null Adv


Two training trees, showing a free translation from French to English.A possible alignment is shown in orange.

enfants(“kids”)

kids

Adv

d’(“of”)


NP

SamSam NP



baiser(“kiss”)

un(“a”)

à (“to”)Start

NPquite

often


Two training trees, showing a free translation from French to English.A possible alignment is shown in orange.A much worse alignment ...

enfants(“kids”)

kids

NPd’

(“of”)


NP

NP

SamSam NP



baiser(“kiss”)

un(“a”)

à (“to”)Start

NP

NP

nullAdv

quitenull Adv

oftennullAdv

null Adv


Two training trees, showing a free translation from French to English.A possible alignment is shown in orange.

SamSam NPenfants(“kids”)

kids

NPquitenull Adv

Synchronous Grammar = Set of Elementary Trees

oftennullAdv

null Advd’

(“of”)


NP

NP


baiser(“kiss”)

un(“a”)

à (“to”)Start

NP

NP

nullAdv


But many examples are harder

Auf Fragediese bekommenichhabe leider Antwortkeine

I did not unfortunately receive an answer to this question

NULL

To questionthis received Ihave alas answer no





NULL


Displaced modifier (negation)





NULL


Displaced modifier (negation)





NULL


Displaced argument (here, because projective parser)





NULL


Head-swapping (here, just different annotation conventions)


Free Translation

TschernobylChernobyl

könntecould

dannthen

etwas something

später later

anon

diethe

Reihequeue

kommencome

Then we could deal with Chernobyl some time later

NULL


Free Translation


könntecould

dannthen

etwas something

später later

anon

diethe

Reihequeue

kommencome


NULL

Probably not systematic (but words are correctly aligned)


Free Translation


könntecould

dannthen

etwas something

später later

anon

diethe

Reihequeue

kommencome


NULL

Erroneous parse


What to do? Current practice:

Don’t try to model all systematic phenomena! Just use non-syntactic alignments (Giza++). Only care about the fragments that recur often

Phrases or gappy phrases Sometimes even syntactic constituents

(can favor these, e.g., Marton & Resnik 2008) Use these (gappy) phrases in a decoder

Phrase based or hierarchical


What to do? Current practice:

Use non-syntactic alignments (Giza++) Keep frequent phrases for a decoder

But could syntax give us better alignments? Would have to be “loose” syntax …

Why do we want better alignments?1. Throw away less of the parallel training data2. Help learn a smarter, syntactic, reordering model

Could help decoding: less reliance on LM3. Some applications care about full alignments


Quasi-synchronous grammar How do we handle “loose” syntax? Translation story:

Generate target English by a monolingual grammar Any grammar formalism is okay Pick a dependency grammar formalism for now


P(PRP | no previous left children of “did”)

P(I | did, PRP)

parsing: O(n3)



Generate target English by a monolingual grammar But probabilities are influenced by source sentence

Each English node is aligned to some source node Prefers to generate children aligned to nearby source nodes


parsing: O(n3)


P(PRP | no previous left children of “did”, habe)

QCFG Generative Storyobserved

Auf Fragediese bekommenich leider Antwortkeine


NULL

habe

P(parent-child)

aligned parsing: O(m2n3)

P(breakage)P(I | did, PRP, ich)


What’s a “nearby node”?

+ “none of the above”

Given parent’s alignment, where might child be aligned?

synchronousgrammar case


Useful analogies:1. Generative grammar with latent word senses2. MEMM

Generate n-gramtag sequence, but probabilities are influenced by word sequence

Quasi-synchronous grammar

Target

Source

How do we handle “loose” syntax? Translation story:



Useful analogies:1. Generative grammar with latent word senses2. MEMM3. IBM Model 1

Source nodes can be freely reused or unused Future work: Enforce 1-to-1 to allow good decoding

(NP-hard to do exactly)




Some results: Quasi-synch. Dep. Grammar Alignment (D. Smith & Eisner 2006)

Quasi-synchronous syntax much better than synchronous Maybe also better than IBM Model 4

Question answering (Wang et al. 2007) Align question w/ potential answer Mean average precision 43% 48% 60%

previous state of the art + QG + lexical features Bootstrapping a parser for a new language

(D. Smith & Eisner 2007 & ongoing) Learn how parsed parallel text influences target dependencies

Along with many other features! (cf. co-training) Unsupervised: German 30% 69%, Spanish 26% 65%


Summary of part I Current practice:

Use non-syntactic alignments (Giza++) Some bits align nicely Use the frequent bits in a decoder

Suggestion: Let syntax influence alignments.

So far, loose syntax methods are like IBM Model I. NP-hard to enforce 1-to-1 in any interesting model.

Rest of talk: How to enforce 1-to-1 in interesting models? Can we do something smarter than beam search?

26

Shuffling Non-Constituents

Jason Eisner

ACL SSST Workshop, June 2008

with David A. Smith and Roy Tromblesyntactically-flavored

reordering modelsyntactically-flavored

reordering search methods


Motivation

MT is really easy! Just use a finite-state transducer! Phrases, morphology, the works!


Permutation search in MT

1 42 3 5 6 initial order(French)

1 54 2 6 3 best order(French’)

NNP

Marie

NEG

ne

PRP

m’

AUX

a

NEG

pas

VBN

vu

NNP

Marie

NEG

ne

PRP

m’

AUX

a

NEG

pas

VBN

vu

Mary hasn’t seen me easy transduction


Motivation

MT is really easy! Just use a finite-state transducer! Phrases, morphology, the works!

Have just to fix that pesky word order.

Framing it this way lets us enforce 1-to-1 exactly at the permutation step.Deletion and fertility > 1 are still allowed in the subsequent transduction.


Often want to find an optimal permutation … Machine translation:

Reorder French to French-prime (Brown et al. 1992)So it’s easier to align or translate

MT eval:How much do you need to rearrange MT output so it scores well under an LM derived from ref translations?

Discourse generation, e.g., multi-doc summarization: Order the output sentences (Lapata 2003)So they flow nicely

Reconstruct temporal order of events after info extraction Learn rule ordering or constraint ranking for phonology? Multi-word anagrams that score well under a LM


How can we find this needlein the haystack of N!possible permutations?

Permutation search: The problem

1 42 3 5 6 initial order

1 54 2 6 3 best orderaccording tosome costfunction


Traditional approach: Beam searchApprox. best path through a really big FSA

N! paths: one for each permutationonly 2N states

arc weight = cost of picking 5 next

if we’ve seen {1,2,4} so far

state remembers what we’ve generated so far(but not in what order)


An alternative: Local search (“hill climbing”)The SWAP neighborhood

1 2 3 4 5 6cost=22

2 1 3 4 5 6cost=26

1 2 3 4 5 6cost=22 1 2 3 5 4 6

cost=25

1 3 2 4 5 6cost=20

1 2 4 3 5 6cost=19


An alternative: Local search (“hill-climbing”)The SWAP neighborhood

1 2 3 4 5 6cost=22

1 2 4 3 5 6cost=19


An alternative: Local search (“hill-climbing”)Like “greedy decoder” of Germann et al. 2001

1 42 3 5 6 cost=22

The SWAP neighborhood

cost=19cost=17cost=16

. . . Why are the costs always going down?How long does it take to pick best swap?How many swaps might you need to reach answer?What if you get stuck in a local min?

we pick best swapO(N) if you’re

carefulO(N2)

random restarts


Larger neighborhood

1 2 3 4 5 6cost=22

2 1 3 4 5 6cost=26

1 2 3 4 5 6cost=22 1 2 3 5 4 6

cost=25

1 3 2 4 5 6cost=20

1 2 4 3 5 6cost=19


Larger neighborhood(well-known in the literature; works well)

1 2 3 4 5 6 cost=22cost=17

INSERT neighborhood

Fewer local minima?Graph diameter (max #moves needed)?How many neighbors?How long to find best neighbor?

yes – 3 can move past 4 to get past 5

O(N) rather than O(N2)O(N2) rather than O(N) O(N2) rather than O(N)


2

Even larger neighborhood

1 3 4 5 6 cost=22cost=14

BLOCK neighborhood


yes – 2 can get past 45 without having to cross 3 or move 3 first

still O(N)O(N3) rather than O(N), O(N2) O(N3) rather than O(N), O(N2)


2

Larger yet: Via dynamic programming??

1 3 4 5 6 cost=22


logarithmicexponentialpolynomial


Unifying/generalizing neighborhoods so far

21 3 4 5 6 7 8Exchange two adjacent blocks, of max widths w ≤ w’

SWAP: w=1, w’=1INSERT: w=1, w’=NBLOCK: w=N, w’=N

Move is defined by an (i,j,k) triple

i j k

runtime = # neighbors = O(ww’N)O(N)O(N2)O(N3)

everything in this talk can be generalized to other values of w,w’


Very large-scale neighborhoods What if we consider multiple simultaneous exchanges

that are “independent”?

The DYNASEARCH neighborhood (Potts & van de Velde 1995; Congram 2000)

2 3 4 5 612 1 4 3 6 5

3 2 5 4

1 52 43 6

Lowest-cost neighboris lowest-cost path

Cost of this arc is Δcostof swapping (4,5), here < 0

3 62 1

5 4


Very large-scale neighborhoods

2 3 4 5 612 1 4 3 6 5

3 2 5 4


Why would this be a good idea?Help get out of bad local minima?Help avoid getting into bad local minima?

no; they’re still local minimayes – less greedy

B = 2 3 412 1 4 3

3 2

DYNASEARCH (-20+-20)

SWAP (-30)

0 -20 0 80

0 0 -30 -0

0 0 0 -20

0 0 0 0


no; they’re still local minimayes – less greedy

yes! – shortest-path algorithm finds the best set of swaps in O(N) time, as fast as best single swap.

Up to N moves as fast as 1 move:no penalty for “parallelism”!

Globally optimizes over exponentially many neighbors (paths).

Very large-scale neighborhoods

2 3 4 5 612 1 4 3 6 5

3 2 5 4


Why would this be a good idea?Help get out of bad local minima?Help avoid getting into bad local minima?More efficient?


Can we extend this idea – up to N moves in parallel by dynamic programming – to neighborhoods beyond SWAP?


SWAP: w=1, w’=1INSERT: w=1, w’=NBLOCK: w=N, w’=N


i j k

runtime = # neighbors = O(ww’N)O(N)O(N2)O(N3)

Yes.Asymptotic runtime is

always unchanged.


Let’s define each neighbor by a “colored tree”Just like ITG!

1 42 3 5 6

= swap children



1 42 3 5 6

= swap children



1 42 35 6

= swap children

This is like the BLOCK neighborhood, but with multiple block exchanges, which may be nested.


If that was the optimal neighbor …

1 45 6 2 3

… now look for its optimal neighbor

new tree!



5 6 1 4 2 3

… now look for its optimal neighbor

new tree!



5 61 4 2 3

… now look for its optimal neighbor… repeat till reach local optimum

Each tree defines a neighbor.At each step, optimize over all possible trees

by dynamic programming (CKY parsing).

Use your favorite parsing speedups (pruning, best-first, …)


Very-large-scale versions of SWAP, INSERT, and BLOCK all by the algorithm we just saw …


Runtime of the algorithm we just saw was O(N3) because we considered O(N3) distinct (i,j,k) triplesMore generally, restrict to only the O(ww’N) triples of interest to define a smaller neighborhood with runtime of O(ww’N). (yes, the dynamic programming recurrences go through)


i j k


How many steps to get from here to there?

8 46 2 5 3 7 1

4 51 2 3 6 7 8

One twisted-tree step?No: As you probably know,3 1 4 2 1 2 3 4 is impossible.

initial order

best order


Can you get to the answer in one step?

German-English, Giza++ alignment

often(yay, big neighborhood)

not always(yay, local search)

for longersentences,usually not


8 46 2 5 3 7 1

How many steps to the answer in the worst case? (what is diameter of the search space?)

4 51 2 3 6 7 8

claim: only log2N steps at worst (if you know where to step)

Let’s sketch the proof!


Quicksort anything into, e.g., 1 2 3 4 5 6 7 8

8 46 2 5 3 7 1

5 4

right-branchingtree


Quicksort anything into, e.g., 1 2 3 4 5 6 7 8

1 72 4 3 8 5 6

6

5 4

7 2 3

sequence of right-branching

trees

Only log2 N steps to get to 1 2 3 4 5 6 7 8 …

… or to anywhere!



1 42 3 5 6 initial order

1 54 2 6 3 best orderaccording tosome costfunction

Defining “best order”What class of cost functions can we handle efficiently? How fast can we compute a subtree’s cost from its child subtrees?


+ a25 + a56 + a63+ a42


Defining “best order”What class of cost functions?

best orderaccording tosome costfunction

1 54 2 6 3

0 15 22 80 5 -7

-30 0 -76 24 63 -44

15 28 0 -15 71 -99

12 8 -31 0 54 -6

7 -9 41 24 0 82

6 5 -22 8 93 0

A =

a14

“Traveling Salesperson

Problem” (TSP)

+ a31




best orderaccording tosome costfunction

1 54 2 6 3

0 5 -22 93 8 6

12 0 8 -31 -6 54

-7 41 0 -9 24 82

88 17 -6 0 12 -60

11 -17 10 -59 0 23

5 4 -12 6 55 0

B =

b26 = cost of 2 preceding 6“Linear Ordering

Problem” (LOP)

(add up n(n-1)/2 such costs)(any order will incur either b26 or b62)



TSP and LOP are both NP-complete In fact, believed to be inapproximable

hard even to achieve C * optimal cost (any C≥1)

Practical approaches: correct answer, typically fast branch-and-bound,

ILP, … fast answer, typically close to correct beam search,

this talk, …



initial order1 42 3 5 6

1 54 2 6 3 cost of this order:1.Does my favorite WFSA

like this string of #s?2.Non-local pair order ok?3.Non-local triple order

ok?Can add these all up …

4 before 3 …?1…2…3?Generalize

s TSP LOP


0 15 22 80 5 -7

-30 0 -76 24 63 -44

15 28 0 -15 71 -99

12 8 -31 0 54 -6

7 -9 41 24 0 82

6 5 -22 8 93 0

Costs are derived from source sentence features

1 42 3 5 6initial order

(French)NNP

Marie

NEG

ne

PRP

m’

AUX

a

NEG

pas

VBN

vu

A =

0 5 -22 93 8 6

12 0 8 -31 -6 54

-7 41 0 -9 24 82

88 17 -6 0 12 -60

11 -17 10 -59 0 23

-75 4 -12 6 55 0

B =

ne would like to be brought adjacent to the next NEG word


0 5 -22 93 8 6

12 0 8 -31 -6 54

-7 41 0 -9 24 82

88 17 -6 0 12 -60

11 -17 10 -59 0 23

75 4 -12 6 55 0


(French)NNP

Marie

NEG

ne

PRP

m’

AUX

a

NEG

pas

VBN

vu

0 15 22 80 5 -7

-30 0 -76 24 63 -44

15 28 0 -15 71 -99

12 8 -31 0 54 -6

7 -9 41 24 0 82

6 5 -22 8 93 0

A = B =

50: a verb (e.g., vu) shouldn’t precede its subject (e.g., Marie)+27: words at a distance of 5 shouldn’t swap order-2: words with PRP between them ought to swap

…

Can also include phrase boundary symbols in the input!


= 75


0 5 -22 93 8 6

12 0 8 -31 -6 54

-7 41 0 -9 24 82

88 17 -6 0 12 -60

11 -17 10 -59 0 23

75 4 -12 6 55 0


(French)NNP

Marie

NEG

ne

PRP

m’

AUX

a

NEG

pas

VBN

vu

0 15 22 80 5 -7

-30 0 -76 24 63 -44

15 28 0 -15 71 -99

12 8 -31 0 54 -6

7 -9 41 24 0 82

6 5 -22 8 93 0

A = B =

FSA costs: Distortion modelLanguage model – looks ahead to next step! ( good finite-state translation into good

English?)



Dynamic program must pick the tree that leads to the lowest-cost permutation initial order1 42 3 5 6


like it as a string?


Scoring with a weighted FSA

This particular WFSA implements TSP scoring for N=3:After you read 1, you’re in state 1After you read 2, you’re in state 2After you read 3, you’re in state 3 …

and this state determines the cost of the next symbol you read

nitial

We’ll handle a WFSA with Q states by using a fancier grammar, with nonterminals. (Now runtime goes up to O(N3Q3) …)


Including WFSA costs via nonterminals

1 42 3 5 661 42 23 14 I5 56

A possible preterminal for word 2is an arc in A that’s labeled with 2.

The preterminal 42 rewrites as word 2

with a cost equal to the arc’s cost.

4 22


I3I3Including WFSA costs via nonterminals

1 42 361 42 23 14

43

13

63

5 6I5 56

I6

63

I6

63

I6

I3

1 42 3 5 661 42 23 14 I5 56

This constituent’s total cost is the

total cost of the best 63 path

.6 11 4 2 34 2 3

.16 1 4 2 34 2 35 6I 5

cost of the new permutation


Dynamic program must pick the tree that leads to the lowest-cost permutation initial order1 42 3 5 6


like it as a string?2.Non-local pair order ok?4 before 3 …?


Incorporating the pairwise ordering costs

So this hypothesis must add costs 5 < 1, 5 < 2, 5 < 3, 5 < 4, 6 < 1, 6 < 2, 6 < 3, 6 < 4, 7 < 1, 7 < 2, 7 < 3, 7 < 4

Uh-oh! So now it takesO(N2) time to combine twosubtrees, instead of O(1) time?

Nope – dynamic programmingto the rescue again!

1 42 3 5 6 7

This puts {5,6,7} before {1,2,3,4}.


Computing LOP cost of a block move

1 42 3 5 6 7

1 2 3 4567

1 2 3 4567

1 2 3 4567

1 2 3 4567

1 2 3 4567

So we have to add O(N2) costsjust to consider this single neighbor!

This puts {5,6,7} before {1,2,3,4}.

= + - +

already computed at earlier steps of parsing

Reuse work from other, “narrower” block moves …computed new cost in O(1)!

revise


Incorporating 3-way ordering costs See the initial paper (Eisner & Tromble 2006)

A little tricky, but comes “for free” if you’re willing to

accept a certain restriction on these costs more expensive without that restriction, but possible


Another option: Markov chain Monte Carlo Random walk in the space of permutations

interpret a permutation’s cost as a log-probabilityp(π) = exp(–cost(π)) / Z

Sample a permutation from the neighborhood instead of always picking the most probable

Why? Simulated annealing might beat greedy-with-random-restarts When learning the parameters of the distribution, can use

sampling to compute the feature expectations


Another option: Markov chain Monte Carlo Random walk in the space of permutations

interpret a permutation’s cost as a log-probabilityp(π) = exp(–cost(π)) / Z

Sample a permutation from the neighborhood instead of always picking the most probable

How? Pitfall: Sampling a permutation sampling a tree

Spurious ambiguity: some permutations have many trees Solution: Exclude some trees, leaving 1 per permutation

Normal form has long been known for colored trees For restricted colored trees (which limit the size of blocks

to swap), we’ve devised a more complicated normal form


0 5 -22 93 8 6

12 0 8 -31 -6 54

-7 41 0 -9 24 82

88 17 -6 0 12 -60

11 -17 10 -59 0 23

75 4 -12 6 55 0

Learning the costs Where do these costs come from? If we have some examples on which we know

the true permutation, could try to learn them

0 15 22 80 5 -7

-30 0 -76 24 63 -44

15 28 0 -15 71 -99

12 8 -31 0 54 -6

7 -9 41 24 0 82

6 5 -22 8 93 0

A = B =


0 5 -22 93 8 6

12 0 8 -31 -6 54

-7 41 0 -9 24 82

88 17 -6 0 12 -60

11 -17 10 -59 0 23

75 4 -12 6 55 0

0 15 22 80 5 -7

-30 0 -76 24 63 -44

15 28 0 -15 71 -99

12 8 -31 0 54 -6

7 -9 41 24 0 82

6 5 -22 8 93 0

A = B =

Learning the costs Where do these costs come from? If we have some examples on which we know

the true permutation, could try to learn them More precisely, try to learn these weights θ

(the knowledge that’s reused across examples) 50: a verb (e.g., vu) shouldn’t

precede its subject (e.g., Marie)27: words at a distance of 5 shouldn’t swap order-2: words with PRP between them ought to swap

…


Experimenting with training LOP params(LOP is quite fast: O(n3) with no grammar constant)

PDS VMFIN PPER ADV APPR ART NN PTKNEG VVINF $.Das kann ich so aus dem Stand nicht sagen .

B[7,9]


Feature templates for cost of swapping i, j

22 features

plus versionsof all of these conjoined withthe distance j - i (binned)


Feature templates for cost of swapping i, j

22 features

plus versionsof all of these conjoined withthe distance j-i (binned)

Only LOP features so far And they’re unnecessarily simple

(don’t examine syntactic constituency) And input sequence is only words

(not interspersed with syntactic brackets)


Learning LOP Costs for MT

Define German’ to be German in English word order To get German’ for training data, use Giza++ to align

all German positions to English positions (disallow NULL)

German EnglishGerman’LOP MOSES

MOSES baseline

(interesting, if odd, to try to reorder with only the LOP costs)


Learning LOP Costs for MT(interesting, if odd, to try to reorder with only the LOP costs)

Easy first try: Naïve Bayes Treat each feature in θ as independent Count and normalize over the training data No real improvement over baseline


MOSES baseline


Learning LOP Costs for MT(interesting, if odd, to try to reorder with only the LOP costs)

Easy second try: Perceptron


MOSES baseline

0 1 n

*

. . . searcherrormodel

error

globaloptimum

localoptimu

mupdate

gold standard

Note: Search error can be beneficial, e.g., just take 1 step from identity permutation


Benefit from reordering Learning method BLEU vs.

German′BLEU vs. English

No reordering 49.65 25.55

Naïve Bayes—POS 49.21

Naïve Bayes—POS+lexical 49.75

Perceptron—POS 50.05 25.92

Perceptron—POS+lexical 51.30 26.34obviously, not

yet unscrambling German: need more features


Contrastive estimation (Smith & Eisner 2005)

Maximize the probability of the desired permutation relative to its ITG neighborhood

Requires summing all permutations in a neighborhood Must use normal-form trees here

Stochastic gradient descent

1-step very-large-scale neighborhood

Alternatively, work back from gold standard

gold standard

*


k-best MIRA in the neighborhood

Make gold standard beat its local competitors Beat the bad ones by a bigger margin

Good = close to gold in swap distance? Good = close to gold using BLEU? Good = translates into English that’s close to reference?

1-step very-large-scale neighborhood

gold standard

*current winnersin the

neighborhood

Alternatively, work back from gold standard


Alternatively, train each iterate

0 1 n. . .

*0

*1

*n

updateupdate update

model best inneigh of (0)

oracle inneigh of (0)

Or could do a k-best MIRA version of this, too; even use a loss measure based on lookahead to(n)


Summary of part II

Local search is fun and easy Popular elsewhere in AI Closely related to MCMC sampling

Probably useful for translation Maybe other NP-hard problems too

Can efficiently use huge local neighborhoods Algorithms are closely related to parsing and FSMs Our community knows that stuff better than anyone!

Documents

Shuffling Non-Constituents