Upload
randy
View
51
Download
0
Tags:
Embed Size (px)
DESCRIPTION
syntactically-flavored reordering model. Shuffling Non-Constituents. Jason Eisner. with David A. Smith and Roy Tromble. syntactically-flavored reordering search methods. ACL SSST Workshop (invited talk), June 2008. Starting point: Synchronous alignment. - PowerPoint PPT Presentation
Citation preview
1
Shuffling Non-Constituents
Jason Eisner
ACL SSST Workshop (invited talk), June 2008
with David A. Smith and Roy Tromblesyntactically-flavored
reordering search methodssyntactically-
flavoredreordering model
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 2
Starting point: Synchronous alignment Synchronous grammars are very pretty.
But does parallel text actually have parallel structure? Depends on what kind of parallel text
Free translations? Noisy translations? Were the parsers trained on parallel annotation schemes?
Depends on what kind of parallel structure What kinds of divergences can your synchronous grammar
formalism capture? E.g., wh-movement versus wh in situ
Two training trees, showing a free translation from French to English.
Synchronous Tree Substitution Grammar
enfants(“kids”)
d’(“of”)
beaucoup(“lots”)
Sam
donnent (“give”)
baiser(“kiss”)
un(“a”)
à (“to”)
kids
Sam
kiss
quite
often
“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”
enfants(“kids”)
kids
NPd’
(“of”)
beaucoup(“lots”)
NP
NP
SamSam NP
Synchronous Tree Substitution Grammar
kissdonnent (“give”)
baiser(“kiss”)
un(“a”)
à (“to”)Start
NP
NP
nullAdv
quitenull Adv
oftennullAdv
null Adv
“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”
Two training trees, showing a free translation from French to English.A possible alignment is shown in orange.
enfants(“kids”)
kids
Adv
d’(“of”)
beaucoup(“lots”)
NP
SamSam NP
Synchronous Tree Substitution Grammar
kissdonnent (“give”)
baiser(“kiss”)
un(“a”)
à (“to”)Start
NPquite
often
“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”
Two training trees, showing a free translation from French to English.A possible alignment is shown in orange.A much worse alignment ...
enfants(“kids”)
kids
NPd’
(“of”)
beaucoup(“lots”)
NP
NP
SamSam NP
Synchronous Tree Substitution Grammar
kissdonnent (“give”)
baiser(“kiss”)
un(“a”)
à (“to”)Start
NP
NP
nullAdv
quitenull Adv
oftennullAdv
null Adv
“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”
Two training trees, showing a free translation from French to English.A possible alignment is shown in orange.
SamSam NPenfants(“kids”)
kids
NPquitenull Adv
Synchronous Grammar = Set of Elementary Trees
oftennullAdv
null Advd’
(“of”)
beaucoup(“lots”)
NP
NP
kissdonnent (“give”)
baiser(“kiss”)
un(“a”)
à (“to”)Start
NP
NP
nullAdv
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 8
But many examples are harder
Auf Fragediese bekommenichhabe leider Antwortkeine
I did not unfortunately receive an answer to this question
NULL
To questionthis received Ihave alas answer no
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 9
But many examples are harder
Auf Fragediese bekommenichhabe leider Antwortkeine
I did not unfortunately receive an answer to this question
NULL
To questionthis received Ihave alas answer no
Displaced modifier (negation)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 10
But many examples are harder
Auf Fragediese bekommenichhabe leider Antwortkeine
I did not unfortunately receive an answer to this question
NULL
To questionthis received Ihave alas answer no
Displaced modifier (negation)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 11
But many examples are harder
Auf Fragediese bekommenichhabe leider Antwortkeine
I did not unfortunately receive an answer to this question
NULL
To questionthis received Ihave alas answer no
Displaced argument (here, because projective parser)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 12
But many examples are harder
Auf Fragediese bekommenichhabe leider Antwortkeine
I did not unfortunately receive an answer to this question
NULL
To questionthis received Ihave alas answer no
Head-swapping (here, just different annotation conventions)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 13
Free Translation
TschernobylChernobyl
könntecould
dannthen
etwas something
später later
anon
diethe
Reihequeue
kommencome
Then we could deal with Chernobyl some time later
NULL
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 14
Free Translation
TschernobylChernobyl
könntecould
dannthen
etwas something
später later
anon
diethe
Reihequeue
kommencome
Then we could deal with Chernobyl some time later
NULL
Probably not systematic (but words are correctly aligned)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 15
Free Translation
TschernobylChernobyl
könntecould
dannthen
etwas something
später later
anon
diethe
Reihequeue
kommencome
Then we could deal with Chernobyl some time later
NULL
Erroneous parse
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 16
What to do? Current practice:
Don’t try to model all systematic phenomena! Just use non-syntactic alignments (Giza++). Only care about the fragments that recur often
Phrases or gappy phrases Sometimes even syntactic constituents
(can favor these, e.g., Marton & Resnik 2008) Use these (gappy) phrases in a decoder
Phrase based or hierarchical
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 17
What to do? Current practice:
Use non-syntactic alignments (Giza++) Keep frequent phrases for a decoder
But could syntax give us better alignments? Would have to be “loose” syntax …
Why do we want better alignments?1. Throw away less of the parallel training data2. Help learn a smarter, syntactic, reordering model
Could help decoding: less reliance on LM3. Some applications care about full alignments
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 18
Quasi-synchronous grammar How do we handle “loose” syntax? Translation story:
Generate target English by a monolingual grammar Any grammar formalism is okay Pick a dependency grammar formalism for now
I did not unfortunately receive an answer to this question
P(PRP | no previous left children of “did”)
P(I | did, PRP)
parsing: O(n3)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 19
Quasi-synchronous grammar How do we handle “loose” syntax? Translation story:
Generate target English by a monolingual grammar But probabilities are influenced by source sentence
Each English node is aligned to some source node Prefers to generate children aligned to nearby source nodes
I did not unfortunately receive an answer to this question
parsing: O(n3)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 20
P(PRP | no previous left children of “did”, habe)
QCFG Generative Storyobserved
Auf Fragediese bekommenich leider Antwortkeine
I did not unfortunately receive an answer to this question
NULL
habe
P(parent-child)
aligned parsing: O(m2n3)
P(breakage)P(I | did, PRP, ich)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 21
What’s a “nearby node”?
+ “none of the above”
Given parent’s alignment, where might child be aligned?
synchronousgrammar case
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 22
Useful analogies:1. Generative grammar with latent word senses2. MEMM
Generate n-gramtag sequence, but probabilities are influenced by word sequence
Quasi-synchronous grammar
Target
Source
How do we handle “loose” syntax? Translation story:
Generate target English by a monolingual grammar But probabilities are influenced by source sentence
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 23
Useful analogies:1. Generative grammar with latent word senses2. MEMM3. IBM Model 1
Source nodes can be freely reused or unused Future work: Enforce 1-to-1 to allow good decoding
(NP-hard to do exactly)
Quasi-synchronous grammar How do we handle “loose” syntax? Translation story:
Generate target English by a monolingual grammar But probabilities are influenced by source sentence
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 24
Some results: Quasi-synch. Dep. Grammar Alignment (D. Smith & Eisner 2006)
Quasi-synchronous syntax much better than synchronous Maybe also better than IBM Model 4
Question answering (Wang et al. 2007) Align question w/ potential answer Mean average precision 43% 48% 60%
previous state of the art + QG + lexical features Bootstrapping a parser for a new language
(D. Smith & Eisner 2007 & ongoing) Learn how parsed parallel text influences target dependencies
Along with many other features! (cf. co-training) Unsupervised: German 30% 69%, Spanish 26% 65%
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 25
Summary of part I Current practice:
Use non-syntactic alignments (Giza++) Some bits align nicely Use the frequent bits in a decoder
Suggestion: Let syntax influence alignments.
So far, loose syntax methods are like IBM Model I. NP-hard to enforce 1-to-1 in any interesting model.
Rest of talk: How to enforce 1-to-1 in interesting models? Can we do something smarter than beam search?
26
Shuffling Non-Constituents
Jason Eisner
ACL SSST Workshop, June 2008
with David A. Smith and Roy Tromblesyntactically-flavored
reordering modelsyntactically-flavored
reordering search methods
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 27
Motivation
MT is really easy! Just use a finite-state transducer! Phrases, morphology, the works!
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 28
Permutation search in MT
1 42 3 5 6 initial order(French)
1 54 2 6 3 best order(French’)
NNP
Marie
NEG
ne
PRP
m’
AUX
a
NEG
pas
VBN
vu
NNP
Marie
NEG
ne
PRP
m’
AUX
a
NEG
pas
VBN
vu
Mary hasn’t seen me easy transduction
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 29
Motivation
MT is really easy! Just use a finite-state transducer! Phrases, morphology, the works!
Have just to fix that pesky word order.
Framing it this way lets us enforce 1-to-1 exactly at the permutation step.Deletion and fertility > 1 are still allowed in the subsequent transduction.
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 30
Often want to find an optimal permutation … Machine translation:
Reorder French to French-prime (Brown et al. 1992)So it’s easier to align or translate
MT eval:How much do you need to rearrange MT output so it scores well under an LM derived from ref translations?
Discourse generation, e.g., multi-doc summarization: Order the output sentences (Lapata 2003)So they flow nicely
Reconstruct temporal order of events after info extraction Learn rule ordering or constraint ranking for phonology? Multi-word anagrams that score well under a LM
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 32
How can we find this needlein the haystack of N!possible permutations?
Permutation search: The problem
1 42 3 5 6 initial order
1 54 2 6 3 best orderaccording tosome costfunction
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 33
Traditional approach: Beam searchApprox. best path through a really big FSA
N! paths: one for each permutationonly 2N states
arc weight = cost of picking 5 next
if we’ve seen {1,2,4} so far
state remembers what we’ve generated so far(but not in what order)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 34
An alternative: Local search (“hill climbing”)The SWAP neighborhood
1 2 3 4 5 6cost=22
2 1 3 4 5 6cost=26
1 2 3 4 5 6cost=22 1 2 3 5 4 6
cost=25
1 3 2 4 5 6cost=20
1 2 4 3 5 6cost=19
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 35
An alternative: Local search (“hill-climbing”)The SWAP neighborhood
1 2 3 4 5 6cost=22
1 2 4 3 5 6cost=19
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 36
An alternative: Local search (“hill-climbing”)Like “greedy decoder” of Germann et al. 2001
1 42 3 5 6 cost=22
The SWAP neighborhood
cost=19cost=17cost=16
. . . Why are the costs always going down?How long does it take to pick best swap?How many swaps might you need to reach answer?What if you get stuck in a local min?
we pick best swapO(N) if you’re
carefulO(N2)
random restarts
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 37
Larger neighborhood
1 2 3 4 5 6cost=22
2 1 3 4 5 6cost=26
1 2 3 4 5 6cost=22 1 2 3 5 4 6
cost=25
1 3 2 4 5 6cost=20
1 2 4 3 5 6cost=19
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 38
Larger neighborhood(well-known in the literature; works well)
1 2 3 4 5 6 cost=22cost=17
INSERT neighborhood
Fewer local minima?Graph diameter (max #moves needed)?How many neighbors?How long to find best neighbor?
yes – 3 can move past 4 to get past 5
O(N) rather than O(N2)O(N2) rather than O(N) O(N2) rather than O(N)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 39
2
Even larger neighborhood
1 3 4 5 6 cost=22cost=14
BLOCK neighborhood
Fewer local minima?Graph diameter (max #moves needed)?How many neighbors?How long to find best neighbor?
yes – 2 can get past 45 without having to cross 3 or move 3 first
still O(N)O(N3) rather than O(N), O(N2) O(N3) rather than O(N), O(N2)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 40
2
Larger yet: Via dynamic programming??
1 3 4 5 6 cost=22
Fewer local minima?Graph diameter (max #moves needed)?How many neighbors?How long to find best neighbor?
logarithmicexponentialpolynomial
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 41
Unifying/generalizing neighborhoods so far
21 3 4 5 6 7 8Exchange two adjacent blocks, of max widths w ≤ w’
SWAP: w=1, w’=1INSERT: w=1, w’=NBLOCK: w=N, w’=N
Move is defined by an (i,j,k) triple
i j k
runtime = # neighbors = O(ww’N)O(N)O(N2)O(N3)
everything in this talk can be generalized to other values of w,w’
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 42
Very large-scale neighborhoods What if we consider multiple simultaneous exchanges
that are “independent”?
The DYNASEARCH neighborhood (Potts & van de Velde 1995; Congram 2000)
2 3 4 5 612 1 4 3 6 5
3 2 5 4
1 52 43 6
Lowest-cost neighboris lowest-cost path
Cost of this arc is Δcostof swapping (4,5), here < 0
3 62 1
5 4
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 43
Very large-scale neighborhoods
2 3 4 5 612 1 4 3 6 5
3 2 5 4
Lowest-cost neighboris lowest-cost path
Why would this be a good idea?Help get out of bad local minima?Help avoid getting into bad local minima?
no; they’re still local minimayes – less greedy
B = 2 3 412 1 4 3
3 2
DYNASEARCH (-20+-20)
SWAP (-30)
0 -20 0 80
0 0 -30 -0
0 0 0 -20
0 0 0 0
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 44
no; they’re still local minimayes – less greedy
yes! – shortest-path algorithm finds the best set of swaps in O(N) time, as fast as best single swap.
Up to N moves as fast as 1 move:no penalty for “parallelism”!
Globally optimizes over exponentially many neighbors (paths).
Very large-scale neighborhoods
2 3 4 5 612 1 4 3 6 5
3 2 5 4
Lowest-cost neighboris lowest-cost path
Why would this be a good idea?Help get out of bad local minima?Help avoid getting into bad local minima?More efficient?
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 45
Can we extend this idea – up to N moves in parallel by dynamic programming – to neighborhoods beyond SWAP?
21 3 4 5 6 7 8Exchange two adjacent blocks, of max widths w ≤ w’
SWAP: w=1, w’=1INSERT: w=1, w’=NBLOCK: w=N, w’=N
Move is defined by an (i,j,k) triple
i j k
runtime = # neighbors = O(ww’N)O(N)O(N2)O(N3)
Yes.Asymptotic runtime is
always unchanged.
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 46
Let’s define each neighbor by a “colored tree”Just like ITG!
1 42 3 5 6
= swap children
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 47
Let’s define each neighbor by a “colored tree”Just like ITG!
1 42 3 5 6
= swap children
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 48
Let’s define each neighbor by a “colored tree”Just like ITG!
1 42 35 6
= swap children
This is like the BLOCK neighborhood, but with multiple block exchanges, which may be nested.
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 49
If that was the optimal neighbor …
1 45 6 2 3
… now look for its optimal neighbor
new tree!
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 50
If that was the optimal neighbor …
5 6 1 4 2 3
… now look for its optimal neighbor
new tree!
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 51
If that was the optimal neighbor …
5 61 4 2 3
… now look for its optimal neighbor… repeat till reach local optimum
Each tree defines a neighbor.At each step, optimize over all possible trees
by dynamic programming (CKY parsing).
Use your favorite parsing speedups (pruning, best-first, …)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 52
Very-large-scale versions of SWAP, INSERT, and BLOCK all by the algorithm we just saw …
21 3 4 5 6 7 8Exchange two adjacent blocks, of max widths w ≤ w’
Runtime of the algorithm we just saw was O(N3) because we considered O(N3) distinct (i,j,k) triplesMore generally, restrict to only the O(ww’N) triples of interest to define a smaller neighborhood with runtime of O(ww’N). (yes, the dynamic programming recurrences go through)
Move is defined by an (i,j,k) triple
i j k
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 53
How many steps to get from here to there?
8 46 2 5 3 7 1
4 51 2 3 6 7 8
One twisted-tree step?No: As you probably know,3 1 4 2 1 2 3 4 is impossible.
initial order
best order
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 54
Can you get to the answer in one step?
German-English, Giza++ alignment
often(yay, big neighborhood)
not always(yay, local search)
for longersentences,usually not
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 55
8 46 2 5 3 7 1
How many steps to the answer in the worst case? (what is diameter of the search space?)
4 51 2 3 6 7 8
claim: only log2N steps at worst (if you know where to step)
Let’s sketch the proof!
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 56
Quicksort anything into, e.g., 1 2 3 4 5 6 7 8
8 46 2 5 3 7 1
5 4
right-branchingtree
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 57
Quicksort anything into, e.g., 1 2 3 4 5 6 7 8
1 72 4 3 8 5 6
6
5 4
7 2 3
sequence of right-branching
trees
Only log2 N steps to get to 1 2 3 4 5 6 7 8 …
… or to anywhere!
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 58
How can we find this needlein the haystack of N!possible permutations?
1 42 3 5 6 initial order
1 54 2 6 3 best orderaccording tosome costfunction
Defining “best order”What class of cost functions can we handle efficiently? How fast can we compute a subtree’s cost from its child subtrees?
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 59
+ a25 + a56 + a63+ a42
How can we find this needlein the haystack of N!possible permutations?
Defining “best order”What class of cost functions?
best orderaccording tosome costfunction
1 54 2 6 3
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
A =
a14
“Traveling Salesperson
Problem” (TSP)
+ a31
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 60
How can we find this needlein the haystack of N!possible permutations?
Defining “best order”What class of cost functions?
best orderaccording tosome costfunction
1 54 2 6 3
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
5 4 -12 6 55 0
B =
b26 = cost of 2 preceding 6“Linear Ordering
Problem” (LOP)
(add up n(n-1)/2 such costs)(any order will incur either b26 or b62)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 61
Defining “best order”What class of cost functions?
TSP and LOP are both NP-complete In fact, believed to be inapproximable
hard even to achieve C * optimal cost (any C≥1)
Practical approaches: correct answer, typically fast branch-and-bound,
ILP, … fast answer, typically close to correct beam search,
this talk, …
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 63
Defining “best order”What class of cost functions?
initial order1 42 3 5 6
1 54 2 6 3 cost of this order:1.Does my favorite WFSA
like this string of #s?2.Non-local pair order ok?3.Non-local triple order
ok?Can add these all up …
4 before 3 …?1…2…3?Generalize
s TSP LOP
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 64
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
Costs are derived from source sentence features
1 42 3 5 6initial order
(French)NNP
Marie
NEG
ne
PRP
m’
AUX
a
NEG
pas
VBN
vu
A =
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
-75 4 -12 6 55 0
B =
ne would like to be brought adjacent to the next NEG word
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 65
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
75 4 -12 6 55 0
1 42 3 5 6initial order
(French)NNP
Marie
NEG
ne
PRP
m’
AUX
a
NEG
pas
VBN
vu
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
A = B =
50: a verb (e.g., vu) shouldn’t precede its subject (e.g., Marie)+27: words at a distance of 5 shouldn’t swap order-2: words with PRP between them ought to swap
…
Can also include phrase boundary symbols in the input!
Costs are derived from source sentence features
= 75
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 66
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
75 4 -12 6 55 0
1 42 3 5 6initial order
(French)NNP
Marie
NEG
ne
PRP
m’
AUX
a
NEG
pas
VBN
vu
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
A = B =
FSA costs: Distortion modelLanguage model – looks ahead to next step! ( good finite-state translation into good
English?)
Costs are derived from source sentence features
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 67
Dynamic program must pick the tree that leads to the lowest-cost permutation initial order1 42 3 5 6
1 54 2 6 3 cost of this order:1.Does my favorite WFSA
like it as a string?
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 68
Scoring with a weighted FSA
This particular WFSA implements TSP scoring for N=3:After you read 1, you’re in state 1After you read 2, you’re in state 2After you read 3, you’re in state 3 …
and this state determines the cost of the next symbol you read
nitial
We’ll handle a WFSA with Q states by using a fancier grammar, with nonterminals. (Now runtime goes up to O(N3Q3) …)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 69
Including WFSA costs via nonterminals
1 42 3 5 661 42 23 14 I5 56
A possible preterminal for word 2is an arc in A that’s labeled with 2.
The preterminal 42 rewrites as word 2
with a cost equal to the arc’s cost.
4 22
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 70
I3I3Including WFSA costs via nonterminals
1 42 361 42 23 14
43
13
63
5 6I5 56
I6
63
I6
63
I6
I3
1 42 3 5 661 42 23 14 I5 56
This constituent’s total cost is the
total cost of the best 63 path
.6 11 4 2 34 2 3
.16 1 4 2 34 2 35 6I 5
cost of the new permutation
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 71
Dynamic program must pick the tree that leads to the lowest-cost permutation initial order1 42 3 5 6
1 54 2 6 3 cost of this order:1.Does my favorite WFSA
like it as a string?2.Non-local pair order ok?4 before 3 …?
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 72
Incorporating the pairwise ordering costs
So this hypothesis must add costs 5 < 1, 5 < 2, 5 < 3, 5 < 4, 6 < 1, 6 < 2, 6 < 3, 6 < 4, 7 < 1, 7 < 2, 7 < 3, 7 < 4
Uh-oh! So now it takesO(N2) time to combine twosubtrees, instead of O(1) time?
Nope – dynamic programmingto the rescue again!
1 42 3 5 6 7
This puts {5,6,7} before {1,2,3,4}.
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 73
Computing LOP cost of a block move
1 42 3 5 6 7
1 2 3 4567
1 2 3 4567
1 2 3 4567
1 2 3 4567
1 2 3 4567
So we have to add O(N2) costsjust to consider this single neighbor!
This puts {5,6,7} before {1,2,3,4}.
= + - +
already computed at earlier steps of parsing
Reuse work from other, “narrower” block moves …computed new cost in O(1)!
revise
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 74
Incorporating 3-way ordering costs See the initial paper (Eisner & Tromble 2006)
A little tricky, but comes “for free” if you’re willing to
accept a certain restriction on these costs more expensive without that restriction, but possible
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 75
Another option: Markov chain Monte Carlo Random walk in the space of permutations
interpret a permutation’s cost as a log-probabilityp(π) = exp(–cost(π)) / Z
Sample a permutation from the neighborhood instead of always picking the most probable
Why? Simulated annealing might beat greedy-with-random-restarts When learning the parameters of the distribution, can use
sampling to compute the feature expectations
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 76
Another option: Markov chain Monte Carlo Random walk in the space of permutations
interpret a permutation’s cost as a log-probabilityp(π) = exp(–cost(π)) / Z
Sample a permutation from the neighborhood instead of always picking the most probable
How? Pitfall: Sampling a permutation sampling a tree
Spurious ambiguity: some permutations have many trees Solution: Exclude some trees, leaving 1 per permutation
Normal form has long been known for colored trees For restricted colored trees (which limit the size of blocks
to swap), we’ve devised a more complicated normal form
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 78
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
75 4 -12 6 55 0
Learning the costs Where do these costs come from? If we have some examples on which we know
the true permutation, could try to learn them
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
A = B =
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 79
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
75 4 -12 6 55 0
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
A = B =
Learning the costs Where do these costs come from? If we have some examples on which we know
the true permutation, could try to learn them More precisely, try to learn these weights θ
(the knowledge that’s reused across examples) 50: a verb (e.g., vu) shouldn’t
precede its subject (e.g., Marie)27: words at a distance of 5 shouldn’t swap order-2: words with PRP between them ought to swap
…
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 83
Experimenting with training LOP params(LOP is quite fast: O(n3) with no grammar constant)
PDS VMFIN PPER ADV APPR ART NN PTKNEG VVINF $.Das kann ich so aus dem Stand nicht sagen .
B[7,9]
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 84
Feature templates for cost of swapping i, j
22 features
plus versionsof all of these conjoined withthe distance j - i (binned)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 85
Feature templates for cost of swapping i, j
22 features
plus versionsof all of these conjoined withthe distance j-i (binned)
Only LOP features so far And they’re unnecessarily simple
(don’t examine syntactic constituency) And input sequence is only words
(not interspersed with syntactic brackets)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 86
Learning LOP Costs for MT
Define German’ to be German in English word order To get German’ for training data, use Giza++ to align
all German positions to English positions (disallow NULL)
German EnglishGerman’LOP MOSES
MOSES baseline
(interesting, if odd, to try to reorder with only the LOP costs)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 87
Learning LOP Costs for MT(interesting, if odd, to try to reorder with only the LOP costs)
Easy first try: Naïve Bayes Treat each feature in θ as independent Count and normalize over the training data No real improvement over baseline
German EnglishGerman’LOP MOSES
MOSES baseline
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 88
Learning LOP Costs for MT(interesting, if odd, to try to reorder with only the LOP costs)
Easy second try: Perceptron
German EnglishGerman’LOP MOSES
MOSES baseline
0 1 n
*
. . . searcherrormodel
error
globaloptimum
localoptimu
mupdate
gold standard
Note: Search error can be beneficial, e.g., just take 1 step from identity permutation
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 90
Benefit from reordering Learning method BLEU vs.
German′BLEU vs. English
No reordering 49.65 25.55
Naïve Bayes—POS 49.21
Naïve Bayes—POS+lexical 49.75
Perceptron—POS 50.05 25.92
Perceptron—POS+lexical 51.30 26.34obviously, not
yet unscrambling German: need more features
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 91
Contrastive estimation (Smith & Eisner 2005)
Maximize the probability of the desired permutation relative to its ITG neighborhood
Requires summing all permutations in a neighborhood Must use normal-form trees here
Stochastic gradient descent
1-step very-large-scale neighborhood
Alternatively, work back from gold standard
gold standard
*
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 92
k-best MIRA in the neighborhood
Make gold standard beat its local competitors Beat the bad ones by a bigger margin
Good = close to gold in swap distance? Good = close to gold using BLEU? Good = translates into English that’s close to reference?
1-step very-large-scale neighborhood
gold standard
*current winnersin the
neighborhood
Alternatively, work back from gold standard
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 93
Alternatively, train each iterate
0 1 n. . .
*0
*1
*n
updateupdate update
model best inneigh of (0)
oracle inneigh of (0)
Or could do a k-best MIRA version of this, too; even use a loss measure based on lookahead to(n)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 95
Summary of part II
Local search is fun and easy Popular elsewhere in AI Closely related to MCMC sampling
Probably useful for translation Maybe other NP-hard problems too
Can efficiently use huge local neighborhoods Algorithms are closely related to parsing and FSMs Our community knows that stuff better than anyone!