87
1 Shuffling Non- Constituents Jason Eisner ACL SSST Workshop, June 2008 with David A. Smith and Roy Tromble syntactically-flavored reordering search methods syntactically- flavored reordering model

1 Shuffling Non-Constituents Jason Eisner ACL SSST Workshop, June 2008 with David A. Smith and Roy Tromble syntactically-flavored reordering search methods

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

1

Shuffling Non-Constituents

Jason Eisner

ACL SSST Workshop, June 2008

with David A. Smith and Roy Tromblesyntactically-flavored

reordering search methodssyntactically-

flavoredreordering model

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 2

Starting point: Synchronous alignment Synchronous grammars are very pretty.

But does parallel text actually have parallel structure? Depends on what kind of parallel text

Free translations? Noisy translations? Were the parsers trained on parallel annotation schemes?

Depends on what kind of parallel structure What kinds of divergences can your synchronous grammar

formalism capture? E.g., wh-movement versus wh in situ

Two training trees, showing a free translation from French to English.

Synchronous Tree Substitution Grammar

enfants(“kids”)

d’(“of”)

beaucoup(“lots”)

Sam

donnent (“give”)

baiser(“kiss”)

un(“a”)

à (“to”)

kids

Sam

kiss

quite

often

“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”

enfants(“kids”)

kids

NPd’

(“of”)

beaucoup(“lots”)

NP

NP

SamSam

NP

Synchronous Tree Substitution Grammar

kissdonnent (“give”)

baiser(“kiss”)

un(“a”)

à (“to”)

Start

NP

NP

nullAdv

quitenullAdv

oftennullAdv

nullAdv

“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”

Two training trees, showing a free translation from French to English.A possible alignment is shown in orange.

enfants(“kids”)

kids

Adv

d’(“of”)

beaucoup(“lots”)

NP

SamSam

NP

Synchronous Tree Substitution Grammar

kissdonnent (“give”)

baiser(“kiss”)

un(“a”)

à (“to”)

Start

NPquite

often

“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”

Two training trees, showing a free translation from French to English.A possible alignment is shown in orange.A much worse alignment ...

enfants(“kids”)

kids

NPd’

(“of”)

beaucoup(“lots”)

NP

NP

SamSam

NP

Synchronous Tree Substitution Grammar

kissdonnent (“give”)

baiser(“kiss”)

un(“a”)

à (“to”)

Start

NP

NP

nullAdv

quitenullAdv

oftennullAdv

nullAdv

“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”

Two training trees, showing a free translation from French to English.A possible alignment is shown in orange.

SamSamNP

enfants(“kids”)

kids

NPquitenull

Adv

Grammar = Set of Elementary Trees

oftennullAdv

nullAdv

d’(“of”)

beaucoup(“lots”)

NP

NP

kissdonnent (“give”)

baiser(“kiss”)

un(“a”)

à (“to”)

Start

NP

NP

nullAdv

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 8

But many examples are harder

Auf Fragediese bekommenichhabe leider Antwortkeine

I did not unfortunately receive an answer to this question

NULL

To questionthis received Ihave alas answer no

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 9

But many examples are harder

Auf Fragediese bekommenichhabe leider Antwortkeine

I did not unfortunately receive an answer to this question

NULL

To questionthis received Ihave alas answer no

Displaced modifier (negation)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 10

But many examples are harder

Auf Fragediese bekommenichhabe leider Antwortkeine

I did not unfortunately receive an answer to this question

NULL

To questionthis received Ihave alas answer no

Displaced modifier (negation)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 11

But many examples are harder

Auf Fragediese bekommenichhabe leider Antwortkeine

I did not unfortunately receive an answer to this question

NULL

To questionthis received Ihave alas answer no

Displaced argument (here, because projective parser)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 12

But many examples are harder

Auf Fragediese bekommenichhabe leider Antwortkeine

I did not unfortunately receive an answer to this question

NULL

To questionthis received Ihave alas answer no

Head-swapping (here, different annotation conventions)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 13

Free Translation

TschernobylChernobyl

könntecould

dannthen

etwas something

später later

anon

diethe

Reihequeue

kommencome

Then we could deal with Chernobyl some time later

NULL

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 14

Free Translation

TschernobylChernobyl

könntecould

dannthen

etwas something

später later

anon

diethe

Reihequeue

kommencome

Then we could deal with Chernobyl some time later

NULL

Probably not systematic (but words are correctly aligned)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 15

Free Translation

TschernobylChernobyl

könntecould

dannthen

etwas something

später later

anon

diethe

Reihequeue

kommencome

Then we could deal with Chernobyl some time later

NULL

Erroneous parse

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 16

What to do? Current practice:

Don’t try to model all systematic phenomena! Just use non-syntactic alignments (Giza++). Only care about the fragments that recur often

Phrases or gappy phrases Sometimes even syntactic constituents

(can favor these, e.g., Marton & Resnik 2008) Use these (gappy) phrases in a decoder

Phrase based or hierarchical

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 17

What to do? Current practice:

Use non-syntactic alignments (Giza++) Keep frequent phrases for a decoder

But could syntax give us better alignments? Would have to be “loose” syntax …

Why do we want better alignments?

1. Throw away less of the parallel training data

2. Help learn a smarter, syntactic, reordering model Could help decoding: less reliance on LM

3. Some applications care about full alignments

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 18

Quasi-synchronous grammar How do we handle “loose” syntax? Translation story:

Generate target English by a monolingual grammar Any grammar formalism is okay Pick a dependency grammar formalism for now

I did not unfortunately receive an answer to this question

P(PRP | no previous left children of “did”)

P(I | did, PRP)

parsing: O(n3)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 19

Quasi-synchronous grammar How do we handle “loose” syntax? Translation story:

Generate target English by a monolingual grammar But probabilities are influenced by source sentence

Each English node is aligned to some source node Prefers to generate children aligned to nearby source nodes

I did not unfortunately receive an answer to this question

parsing: O(n3)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 20

P(PRP | no previous left children of “did”, habe)

QCFG Generative Storyobserved

Auf Fragediese bekommenich leider Antwortkeine

I did not unfortunately receive an answer to this question

NULL

habe

P(parent-child)

aligned parsing: O(m2n3)

P(breakage)P(I | did, PRP, ich)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 21

What’s a “nearby node”?

+ “none of the above”

Given parent’s alignment, where might child be aligned?

synchronousgrammar case

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 22

Useful analogies:1. Generative grammar with latent word senses

2. MEMM Generate n-gram

tag sequence,

but probabilities are influenced by word sequence

Quasi-synchronous grammar

Target

Source

How do we handle “loose” syntax? Translation story:

Generate target English by a monolingual grammar But probabilities are influenced by source sentence

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 23

Useful analogies:1. Generative grammar with latent word senses

2. MEMM

3. IBM Model 1 Source nodes can be freely reused or unused Future work: Enforce 1-to-1 to allow good decoding

(NP-hard to do exactly)

Quasi-synchronous grammar How do we handle “loose” syntax? Translation story:

Generate target English by a monolingual grammar But probabilities are influenced by source sentence

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 24

Some results: Quasi-synch. Dep. Grammar Alignment (D. Smith & Eisner 2006)

Quasi-synchronous much better than synchronous Maybe also better than IBM Model 4

Question answering (Wang et al. 2007) Align question w/ potential answer Mean average precision 43% 48% 60%

previous state of the art + QG + lexical features

Bootstrapping a parser for a new language(D. Smith & Eisner 2007 & ongoing)

Learn how parsed parallel text influences target dependencies Along with many other features! (cf. co-training)

Unsupervised: German 30% 69%, Spanish 26% 65%

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 25

Summary of part I Current practice:

Use non-syntactic alignments (Giza++) Some bits align nicely Use the frequent bits in a decoder

Suggestion: Let syntax influence alignments.

So far, loose syntax methods are like IBM Model I. NP-hard to enforce 1-to-1 in any interesting model.

Rest of talk: How to enforce 1-to-1 in interesting models? Can we do something smarter than beam search?

26

Shuffling Non-Constituents

Jason Eisner

ACL SSST Workshop, June 2008

with David A. Smith and Roy Tromblesyntactically-flavored

reordering modelsyntactically-flavored

reordering search methods

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 27

Motivation

MT is really easy! Just use a finite-state transducer! Phrases, morphology, the works!

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 28

Permutation search in MT

1 42 3 5 6 initial order(French)

1 54 2 6 3 best order(French’)

NNP

Marie

NEG

ne

PRP

m’

AUX

a

NEG

pas

VBN

vu

NNP

Marie

NEG

ne

PRP

m’

AUX

a

NEG

pas

VBN

vu

Mary hasn’t seen me easy transduction

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 29

Motivation

MT is really easy! Just use a finite-state transducer! Phrases, morphology, the works!

Have just to fix that pesky word order.

Framing it this way lets us enforce 1-to-1 exactly at the permutation step.Deletion and fertility > 1 are still allowed in the subsequent transduction.

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 30

Often want to find an optimal permutation … Machine translation:

Reorder French to French-prime (Brown et al. 1992)

So it’s easier to align or translate MT eval:

How much do you need to rearrange MT output so it scores well under an LM derived from ref translations?

Discourse generation, e.g., multi-doc summarization: Order the output sentences (Lapata 2003)

So they flow nicely Reconstruct temporal order of events after info extraction Learn rule ordering or constraint ranking for phonology? Multi-word anagrams that score well under a LM

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 32

How can we find this needlein the haystack of N!possible permutations?

Permutation search: The problem

1 42 3 5 6 initial order

1 54 2 6 3 best orderaccording tosome costfunction

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 33

Traditional approach: Beam searchApprox. best path through a really big FSA

N! paths: one for each permutationonly 2N states

arc weight = cost of picking 5 next

if we’ve seen {1,2,4} so far

state remembers what we’ve generated so far(but not in what order)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 34

An alternative: Local search (“hill climbing”)

The SWAP neighborhood

1 2 3 4 5 6cost=22

2 1 3 4 5 6cost=26

1 2 3 4 5 6cost=22 1 2 3 5 4 6

cost=25

1 3 2 4 5 6cost=20

1 2 4 3 5 6cost=19

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 35

An alternative: Local search (“hill-climbing”)

The SWAP neighborhood

1 2 3 4 5 6cost=22

1 2 4 3 5 6cost=19

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 36

An alternative: Local search (“hill-climbing”)Like “greedy decoder” of Germann et al. 2001

1 42 3 5 6 cost=22

The SWAP neighborhood

cost=19cost=17cost=16

. . . Why are the costs always going down?How long does it take to pick best swap?How many swaps might you need to reach answer?What if you get stuck in a local min?

we pick best swapO(N) if you’re

carefulO(N2)

random restarts

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 37

Larger neighborhood

1 2 3 4 5 6cost=22

2 1 3 4 5 6cost=26

1 2 3 4 5 6cost=22 1 2 3 5 4 6

cost=25

1 3 2 4 5 6cost=20

1 2 4 3 5 6cost=19

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 38

Larger neighborhood(well-known in the literature; reportedly works well)

1 2 3 4 5 6 cost=22cost=17

INSERT neighborhood

Fewer local minima?Graph diameter (max #moves needed)?How many neighbors?How long to find best neighbor?

yes – 3 can move past 4 to get past 5

O(N) rather than O(N2)O(N2) rather than O(N) O(N2) rather than O(N)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 39

2

Even larger neighborhood

1 3 4 5 6 cost=22cost=14

BLOCK neighborhood

Fewer local minima?Graph diameter (max #moves needed)?How many neighbors?How long to find best neighbor?

yes – 2 can get past 45 without having to cross 3 or move 3 first

still O(N)O(N3) rather than O(N), O(N2) O(N3) rather than O(N), O(N2)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 40

2

Larger yet: Via dynamic programming??

1 3 4 5 6 cost=22

Fewer local minima?Graph diameter (max #moves needed)?How many neighbors?How long to find best neighbor?

logarithmicexponentialpolynomial

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 41

Unifying/generalizing neighborhoods so far

21 3 4 5 6 7 8

Exchange two adjacent blocks, of max widths w ≤ w’

SWAP: w=1, w’=1INSERT: w=1, w’=NBLOCK: w=N, w’=N

Move is defined by an (i,j,k) triple

i j k

runtime = # neighbors = O(ww’N)O(N)O(N2)O(N3)

everything in this talk can be generalized to other values of w,w’

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 42

Very large-scale neighborhoods What if we consider multiple simultaneous exchanges

that are “independent”?

The DYNASEARCH neighborhood (Potts & van de Velde 1995; Congram 2000)

2 3 4 5 61

2 1 4 3 6 5

3 2 5 4

1 52 43 6

Lowest-cost neighboris lowest-cost path

Cost of this arc is Δcostof swapping (4,5), here < 0

3 62 1

5 4

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 43

Very large-scale neighborhoods

2 3 4 5 61

2 1 4 3 6 5

3 2 5 4

Lowest-cost neighboris lowest-cost path

Why would this be a good idea?

Help get out of bad local minima?Help avoid getting into bad local minima?

no; they’re still local minimayes – less greedy

B = 2 3 41

2 1 4 3

3 2

DYNASEARCH (-20+-20)

SWAP (-30)

0 -20 0 80

0 0 -30 -0

0 0 0 -20

0 0 0 0

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 44

no; they’re still local minimayes – less greedy

yes! – shortest-path algorithm finds the best set of swaps in O(N) time, as fast as best single swap.

Up to N moves as fast as 1 move:no penalty for “parallelism”!

Globally optimizes over exponentially many neighbors (paths).

Very large-scale neighborhoods

2 3 4 5 61

2 1 4 3 6 5

3 2 5 4

Lowest-cost neighboris lowest-cost path

Why would this be a good idea?

Help get out of bad local minima?Help avoid getting into bad local minima?More efficient?

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 45

Can we extend this idea – up to N moves in parallel by dynamic programming – to neighborhoods beyond SWAP?

21 3 4 5 6 7 8

Exchange two adjacent blocks, of max widths w ≤ w’

SWAP: w=1, w’=1INSERT: w=1, w’=NBLOCK: w=N, w’=N

Move is defined by an (i,j,k) triple

i j k

runtime = # neighbors = O(ww’N)O(N)O(N2)O(N3)

Yes.Asymptotic runtime is

always unchanged.

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 46

Let’s define each neighbor by a “colored tree”Just like ITG!

1 42 3 5 6

= swap children

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 47

Let’s define each neighbor by a “colored tree”Just like ITG!

1 42 3 5 6

= swap children

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 48

Let’s define each neighbor by a “colored tree”Just like ITG!

1 42 35 6

= swap children

This is like the BLOCK neighborhood, but with multiple block exchanges, which may be nested.

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 49

If that was the optimal neighbor …

1 45 6 2 3

… now look for its optimal neighbor

new tree!

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 50

If that was the optimal neighbor …

5 6 1 4 2 3

… now look for its optimal neighbor

new tree!

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 51

If that was the optimal neighbor …

5 61 4 2 3

… now look for its optimal neighbor… repeat till reach local optimum

Each tree defines a neighbor.At each step, optimize over all possible trees

by dynamic programming (CKY parsing).

Use your favorite parsing speedups (pruning, best-first, …)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 52

Very-large-scale versions of SWAP, INSERT, and BLOCK all by the algorithm we just saw …

21 3 4 5 6 7 8

Exchange two adjacent blocks, of max widths w ≤ w’

Runtime of the algorithm we just saw was O(N3) because we considered O(N3) distinct (i,j,k) triplesMore generally, restrict to only the O(ww’N) triples of interest to define a smaller neighborhood with runtime of O(ww’N). (yes, the dynamic programming recurrences go through)

Move is defined by an (i,j,k) triple

i j k

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 53

How many steps to get from here to there?

8 46 2 5 3 7 1

4 51 2 3 6 7 8

One twisted-tree step?No: As you probably know,3 1 4 2 1 2 3 4 is impossible.

initial order

best order

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 54

Can you get to the answer in one step?

German-English, Giza++ alignment

often(yay, big neighborhood)

not always(yay, local search)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 55

8 46 2 5 3 7 1

How many steps to the answer in the worst case? (what is diameter of the search space?)

4 51 2 3 6 7 8

claim: only log2N steps at worst (if you know where to step)

Let’s sketch the proof!

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 56

Quicksort anything into, e.g., 1 2 3 4 5 6 7 8

8 46 2 5 3 7 1

5 4

right-branchingtree

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 57

Quicksort anything into, e.g., 1 2 3 4 5 6 7 8

1 72 4 3 8 5 6

6

5 4

7 2 3

sequence of right-branching

trees

Only log2 N steps to get to 1 2 3 4 5 6 7 8 …

… or to anywhere!

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 58

How can we find this needlein the haystack of N!possible permutations?

1 42 3 5 6 initial order

1 54 2 6 3 best orderaccording tosome costfunction

Defining “best order”What class of cost functions can we handle efficiently? How fast can we compute a subtree’s cost from its child subtrees?

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 59

+ a25 + a56 + a63+ a42

How can we find this needlein the haystack of N!possible permutations?

Defining “best order”What class of cost functions?

best orderaccording tosome costfunction

1 54 2 6 3

0 15 22 80 5 -7

-30 0 -76 24 63 -44

15 28 0 -15 71 -99

12 8 -31 0 54 -6

7 -9 41 24 0 82

6 5 -22 8 93 0

A =

a14

“Traveling Salesperson

Problem” (TSP)

+ a31

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 60

How can we find this needlein the haystack of N!possible permutations?

Defining “best order”What class of cost functions?

best orderaccording tosome costfunction

1 54 2 6 3

0 5 -22 93 8 6

12 0 8 -31 -6 54

-7 41 0 -9 24 82

88 17 -6 0 12 -60

11 -17 10 -59 0 23

5 4 -12 6 55 0

B =

b26 = cost of 2 preceding 6“Linear Ordering

Problem” (LOP)

(add up n(n-1)/2 such costs)(any order will incur either b26 or b62)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 61

Defining “best order”What class of cost functions?

TSP and LOP are both NP-complete In fact, believed to be inapproximable

hard even to achieve C * optimal cost (any C≥1)

Practical approaches: correct answer, typically fast branch-and-bound,

ILP, … fast answer, typically close to correct beam search,

this talk, …

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 63

Defining “best order”What class of cost functions?

initial order1 42 3 5 6

1 54 2 6 3 cost of this order:

1.Does my favorite WFSA like this string of #s?

2.Non-local pair order ok?3.Non-local triple order

ok?Can add these all up …

4 before 3 …?1…2…3?Generalize

s TSP

LOP

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 64

0 15 22 80 5 -7

-30 0 -76 24 63 -44

15 28 0 -15 71 -99

12 8 -31 0 54 -6

7 -9 41 24 0 82

6 5 -22 8 93 0

Costs are derived from source sentence features

1 42 3 5 6initial order

(French)

NNP

Marie

NEG

ne

PRP

m’

AUX

a

NEG

pas

VBN

vu

A =

0 5 -22 93 8 6

12 0 8 -31 -6 54

-7 41 0 -9 24 82

88 17 -6 0 12 -60

11 -17 10 -59 0 23

-75 4 -12 6 55 0

B =

ne would like to be brought adjacent to the next NEG word

ne would like to be brought adjacent to the next NEG word

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 65

0 5 -22 93 8 6

12 0 8 -31 -6 54

-7 41 0 -9 24 82

88 17 -6 0 12 -60

11 -17 10 -59 0 23

75 4 -12 6 55 0

1 42 3 5 6initial order

(French)

NNP

Marie

NEG

ne

PRP

m’

AUX

a

NEG

pas

VBN

vu

0 15 22 80 5 -7

-30 0 -76 24 63 -44

15 28 0 -15 71 -99

12 8 -31 0 54 -6

7 -9 41 24 0 82

6 5 -22 8 93 0

A = B =

50: a verb (e.g., vu) shouldn’t precede its subject (e.g., Marie)+27: words at a distance of 5 shouldn’t swap order-2: words with PRP between them ought to swap

50: a verb (e.g., vu) shouldn’t precede its subject (e.g., Marie)+27: words at a distance of 5 shouldn’t swap order-2: words with PRP between them ought to swap

Can also include phrase boundary symbols in the input!

Costs are derived from source sentence features

= 75

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 66

0 5 -22 93 8 6

12 0 8 -31 -6 54

-7 41 0 -9 24 82

88 17 -6 0 12 -60

11 -17 10 -59 0 23

75 4 -12 6 55 0

1 42 3 5 6initial order

(French)

NNP

Marie

NEG

ne

PRP

m’

AUX

a

NEG

pas

VBN

vu

0 15 22 80 5 -7

-30 0 -76 24 63 -44

15 28 0 -15 71 -99

12 8 -31 0 54 -6

7 -9 41 24 0 82

6 5 -22 8 93 0

A = B =

FSA costs: Distortion modelLanguage model – looks ahead to next step! ( good finite-state translation into good

English?)

Costs are derived from source sentence features

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 67

Dynamic program must pick the tree that leads to the lowest-cost permutation initial order1 42 3 5 6

1 54 2 6 3 cost of this order:

1.Does my favorite WFSA like it as a string?

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 68

Scoring with a weighted FSA

This particular WFSA implements TSP scoring for N=3:After you read 1, you’re in state 1After you read 2, you’re in state 2After you read 3, you’re in state 3 …

and this state determines the cost of the next symbol you read

nitial

We’ll handle a WFSA with Q states by using a fancier grammar, with nonterminals. (Now runtime goes up to O(N3Q3) …)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 69

Including WFSA costs via nonterminals

1 42 3 5 661 42 23 14 I5 56

A possible preterminal for word 2is an arc in A that’s labeled with 2.

The preterminal 42 rewrites as word 2

with a cost equal to the arc’s cost.

4 22

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 70

I3I3Including WFSA costs via nonterminals

1 42 361 42 23 14

43

13

63

5 6I5 56

I6

63

I6

63

I6

I3

1 42 3 5 661 42 23 14 I5 56

This constituent’s total cost is the

total cost of the best 63 path

.

6 11

4 2 34 2 3

.161

4 2 34 2 3

56

I5

cost of the new permutation

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 71

Dynamic program must pick the tree that leads to the lowest-cost permutation initial order1 42 3 5 6

1 54 2 6 3 cost of this order:

1.Does my favorite WFSA like it as a string?

2.Non-local pair order ok?4 before 3 …?

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 72

Incorporating the pairwise ordering costs

So this hypothesis must add costs 5 < 1, 5 < 2, 5 < 3, 5 < 4, 6 < 1, 6 < 2, 6 < 3, 6 < 4, 7 < 1, 7 < 2, 7 < 3, 7 < 4

Uh-oh! So now it takesO(N2) time to combine twosubtrees, instead of O(1) time?

Nope – dynamic programmingto the rescue again!

1 42 3 5 6 7

This puts {5,6,7} before {1,2,3,4}.

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 73

Computing LOP cost of a block move

1 42 3 5 6 7

1 2 3 4

5

6

7

1 2 3 4

5

6

7

1 2 3 4

5

6

7

1 2 3 4

5

6

7

1 2 3 4

5

6

7

So we have to add O(N2) costsjust to consider this single neighbor!

This puts {5,6,7} before {1,2,3,4}.

= + - +

already computed at earlier steps of parsing

Reuse work from other, “narrower” block moves …computed new cost in O(1)!

revise

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 74

Incorporating 3-way ordering costs See the initial paper (Eisner & Tromble 2006)

A little tricky, but comes “for free” if you’re willing to

accept a certain restriction on these costs more expensive without that restriction, but possible

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 75

Another option: Markov chain Monte Carlo Random walk in the space of permutations

interpret a permutation’s cost as a log-probability Sample a permutation from the neighborhood

instead of always picking the most probable

Why? Simulated annealing might beat greedy-with-random-restarts When learning the parameters of the distribution, can use

sampling to compute the feature expectations

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 76

Another option: Markov chain Monte Carlo Random walk in the space of permutations

interpret a permutation’s cost as a log-probability Sample a permutation from the neighborhood

instead of always picking the most probable

How? Pitfall: Sampling a permutation sampling a tree

Spurious ambiguity: some permutations have many trees Solution: Exclude some trees, leaving 1 per permutation

Normal form has long been known for colored trees For restricted colored trees (which limit the size of blocks to

swap), we have devised a more complicated normal form

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 78

0 5 -22 93 8 6

12 0 8 -31 -6 54

-7 41 0 -9 24 82

88 17 -6 0 12 -60

11 -17 10 -59 0 23

75 4 -12 6 55 0

Learning the costs

Where do these costs come from? If we have some examples on which we know

the true permutation, could try to learn them

0 15 22 80 5 -7

-30 0 -76 24 63 -44

15 28 0 -15 71 -99

12 8 -31 0 54 -6

7 -9 41 24 0 82

6 5 -22 8 93 0

A = B =

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 79

0 5 -22 93 8 6

12 0 8 -31 -6 54

-7 41 0 -9 24 82

88 17 -6 0 12 -60

11 -17 10 -59 0 23

75 4 -12 6 55 0

0 15 22 80 5 -7

-30 0 -76 24 63 -44

15 28 0 -15 71 -99

12 8 -31 0 54 -6

7 -9 41 24 0 82

6 5 -22 8 93 0

A = B =

Learning the costs

Where do these costs come from? If we have some examples on which we know

the true permutation, could try to learn them More precisely, try to learn these weights θ

(the knowledge that’s reused across examples) 50: a verb (e.g., vu) shouldn’t

precede its subject (e.g., Marie)27: words at a distance of 5 shouldn’t swap order-2: words with PRP between them ought to swap

50: a verb (e.g., vu) shouldn’t precede its subject (e.g., Marie)27: words at a distance of 5 shouldn’t swap order-2: words with PRP between them ought to swap

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 83

Experimenting with training LOP params(LOP is quite fast: O(n3) with no grammar constant)

PDS VMFIN PPER ADV APPR ART NN PTKNEG VVINF $.

Das kann ich so aus dem Stand nicht sagen .

B[7,9]

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 84

LOP feature templates

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 85

LOP feature templates

Only LOP features so far And they’re unnecessarily simple

(don’t examine syntactic constituency) And input sequence is only words

(not interspersed with syntactic brackets)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 86

Learning LOP Costs for MT

Define German’ to be German in English word order To get German’ for training data, use Giza++ to align

all German positions to English positions (disallow NULL)

German EnglishGerman’LOP MOSES

MOSES baseline

(interesting, if odd, to try to reorder with only the LOP costs)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 87

Learning LOP Costs for MT(interesting, if odd, to try to reorder with only the LOP costs)

Easy first try: Naïve Bayes Treat each feature in θ as independent Count and normalize over the training data No real improvement over baseline

German EnglishGerman’LOP MOSES

MOSES baseline

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 88

Learning LOP Costs for MT(interesting, if odd, to try to reorder with only the LOP costs)

Easy second try: Perceptron

German EnglishGerman’LOP MOSES

MOSES baseline

0 1 n

*

. . . searcherror

model

error

globaloptimum

localoptimu

mupdate

gold standard

Note: Search error can be beneficial, e.g., just take 1 step from identity permutation

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 90

Benefit from reordering

Learning method BLEU vs. German′

BLEU vs. English

No reordering 49.65 25.55

Naïve Bayes—POS 49.21

Naïve Bayes—POS+lexical 49.75

Perceptron—POS 50.05 25.92

Perceptron—POS+lexical 51.30 26.34obviously, not

yet unscrambling German: need more features

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 91

Contrastive estimation (Smith & Eisner 2005)

Maximize the probability of the desired permutation relative to its ITG neighborhood

Requires summing all permutations in a neighborhood Must use normal-form trees here

Stochastic gradient descent

1-step very-large-scale neighborhood

Alternatively, work back from gold standard

gold standard

*

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 92

k-best MIRA in the neighborhood

Make gold standard beat its local competitors Beat the bad ones by a bigger margin

Good = close to gold in swap distance? Good = close to gold using BLEU? Good = translates into English that’s close to reference?

1-step very-large-scale neighborhood

gold standard

*current winnersin the

neighborhood

Alternatively, work back from gold standard

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 93

Alternatively, train each iterate

0 1 n. . .

*

0

*

1

*

n

updateupdate update

model best inneigh of (0)

oracle inneigh of (0)

Or could do a k-best MIRA version of this, too; even use a loss measure based on lookahead to(n)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 95

Summary of part II

Local search is fun and easy Popular elsewhere in AI Closely related to MCMC sampling

Probably useful for translation Maybe other NP-hard problems too

Can efficiently use huge local neighborhoods Algorithms are closely related to parsing and FSMs Our community knows that stuff better than anyone!