Upload
amber-hodges
View
217
Download
1
Embed Size (px)
Citation preview
Machine Translation
Discriminative Word Alignment
Stephan VogelSpring Semester 2011
Stephan Vogel - Machine Translation) 2
Generative Alignment Models
Generative word alignment models: P(f, a|e) = … Alignment a as hidden variable
Actual word alignment is not observed Sum over all alignments
Well-known IBM 1 … 5 models, HMM, ITG Model lexical association, distortion, fertility
It is difficult to incorporate additional information POS of words (used in distortion model, not as direct link
features) Manual dictionary Syntax information …
Stephan Vogel - Machine Translation) 3
Discriminative Word Alignment
Model alignment directly: p(a | f, e) Find alignment that maximizes p(a | f, e)
Well-suited framework: maximum entropy Set of feature functions hm(a, f, e), m = 1, …, M
Set of model parameters (feature weights) cm, m = 1, …, M
Decision rule:
Stephan Vogel - Machine Translation) 5
Tasks
Modeling: design feature functions which capture cross-lingual divergences
Search: find alignment with highest probability Training: find optimal feature weights
Minimize alignment errors given some gold-standard alignments(Notice: Alignments no longer hidden!)
Supervised training, i.e. we evaluate against gold standard
Notice: features functions may result from some training procedure themselves E.g. use statistical dictionary resulting from IBMn alignment,
trained on large corpus Here now additional training step, on small (hand-aligned)
corpus(Similar to MERT for decoder)
Stephan Vogel - Machine Translation) 6
2005 – Year of DWA
Yang Liu, Qun Liu, and Shouxun Lin. 2005.Loglinear Models for Word Alignment.
Abraham Ittycheriah and Salim Roukos. 2005.A Maximum Entropy Word Aligner for Arabic-English Machine Translation.
Ben Taskar, Simon Lacoste-Julien, and Dan Klein. 2005.A Discriminative Matching Approach to Word Alignment.
Robert C. Moore. 2005.A Discriminative Framework for Bilingual Word Alignment.
Necip Fazil Ayan, Bonnie J. Dorr, and Christof Monz. 2005.NeurAlign: Combining Word Alignments Using Neural Networks.
Stephan Vogel - Machine Translation) 7
Yang Liu et al. 2005
Start out with features used in generative alignment
Lexicons E.g. IBM1
Use both directions: p(fj|ei) and p(ei|fj), =>Symmetrical alignment model
And/or symmetric model
Fertility model: p(i|ei)
Stephan Vogel - Machine Translation) 8
More Features
Cross count: number of crossings in alignment Neighbor count: count the number of links in the
immediate neighborhood Exact match: count number of src/tgt pairs, where
src=tgt Linked word count: total number of links (to influence
density) Link types: count how many 1-1, 1-m, m-1, n-m
alignments Sibling distance: if word is aligned to multiple words,
add the distance between these aligned words Link Co-occurrence count: given multiple alignments
(e.g. Viterbi alignments from IBM models) count how often links co-occur
Stephan Vogel - Machine Translation) 9
Search
Greedy search based on gain by adding a link For each of the features the gain can be calculated
E.g. IBM1
Algorithm:Start with empty alignmentLoop until no addition gain Loop over all (j,i) not in set if gain(j,i) > best_gain then store as (j’,i’) Set link(j’,i’)
Stephan Vogel - Machine Translation) 10
Moore 2005
Log-Likelihood-based model Measure word association strength Values can get large
Conditional-Link-Probability-based Estimated probability of two words being linked Used simpler alignment model to establish links Add simple smoothing
Additional features: one-to-one, one-to-many, non-monotonicity
Stephan Vogel - Machine Translation) 11
Training
Finding optimal alignment is non-trivial Adding link can affect nonmonotonicity, one-to-many features Dynamic programming does not work
Beam search could be used Requires pruning
Parameter optimization Modified version of average perceptron learning
)),,(),,(( feahfeah refirefiii
Stephan Vogel - Machine Translation) 12
Modeling Alignment with CRF
CRF is an undirected graphical model Each vertex (node) represents a random variable whose distribution is
to be inferred Each edge represents a dependency between two random variables The distribution of each discrete random variable Y in the graph is
conditioned on an input sequence X. Cliques: set of nodes in graph fully connected
In our case: Features derived from source and target words are the input sequence
X Alignment links are the random variables Y
Different ways to model alignment Blunsom & Cohn (2006): many-to-one word alignments, where each
source word is aligned with zero or one target words (-> asymmetric) Niehues & Vogel (2008): model not sequence, but entire alignment
matrix(->symmetric)
Stephan Vogel - Machine Translation) 13
Modeling Alignment Matrix
Random variables yji for all possible alignment links 2 values: 0/1 – word in position j is not linked/linked to word in
position i Represented as nodes in a graph
Stephan Vogel - Machine Translation) 14
Modeling Alignment Matrix
Factored nodes x representing features (observables) Linked to random variables Define potential for each yji
Stephan Vogel - Machine Translation) 15
Probability of Alignment
))(exp(1
))(exp(1
)(1
)|(p
function potential-))(exp()(
ctor weight vea-
vectorfeature a -)(
clique) (a nodes connected ofset a -
nodes factored ofset -
FN
FN
FN
Vccc
Vccc
Vccc
cccc
cc
c
FN
VFZ
VFZ
VZ
xy
VFV
VF
V
V
Stephan Vogel - Machine Translation) 16
Features
Local features, e.g. lexical, POS, … Fertility features First-order features: capturing relation between links Phrase-features: interaction between word and phrase
alignment
Stephan Vogel - Machine Translation) 17
Local Features
Local information about link probability Features derived from positions j and i only Factored node connected to only one random variable
Features Lexical probabilities, also normalized to (f,e) Word identity (e.g. for numbers, names) Word similarity (e.g. cognates) Relative position distance Link indicator feature: is (j,i) linked
in Viterbi alignment from generative alignment POS: Indicator feature for every src/tgt POS pair High frequency word indicator feature for every
src/tgt word pair for most frequent words
Stephan Vogel - Machine Translation) 18
Fertility Features
Model word fertility, src and tgt side Link to all nodes in row/column Constraint: model fertility only up
to maximum fertility Indicator features:
one for each fertility n <= None for all fertilities n > N
Alternative: use fertility probabilitiesfrom IBM4 training Now different for different words
Stephan Vogel - Machine Translation) 19
First Order Features
Links depend on links ofneighboring words
Link always 2 nodes Different features for different
directions (1,1), (1,2), (2,1), (1,0), …
Captures distortions, similar toHMM and IBM4 alignment
Indicator features, if both links are set Also POS 1-order feature: indicator feature link(j,i) and
(POSj, POSi) and link(j+k, i+l)
Stephan Vogel - Machine Translation) 20
Inference – Finding the Best Alignment
Word alignment corresponds to assignment of random variables
=> Find most probable variable assignment
Problem: Complex model structure: many loops No exact inference possible
Solution: Belief propagation algorithm Inference by message passing
Runtime exponential in number of connected nodes
Stephan Vogel - Machine Translation) 21
Belief Propagation
Messages are sent from random variable nodes to factored nodes, and also in the opposite direction
Start with some initial values, e.g. uniform In each iteration
Calculate messages from hidden node (j,i) and sent to factored node c
Calculate messages from factored node c and sent to hidden node (j,i)
Stephan Vogel - Machine Translation) 22
Getting the Probability
After several iterations, belief value calculated from messages send to hidden nodes
Belief value can be interpreted as posterior probability
Stephan Vogel - Machine Translation) 23
Training
Maximum log-likelihood of correct alignment Use gradient descent to find optimum
Train towards minimum alignment error Need smoothed version of AER Express AER in terms of link indicator functions Use sigmoid of link probability
Can use 2-step approach 1. Optimize towards ML 2. Optimize towards AER
Stephan Vogel - Machine Translation) 24
Some Results: Spanish-English
Features IBM1 and IBM4 lexicons Fertilties Link indicator feature POS features Phrase features
Impact on translationquality (Bleu scores)
Dev Eval
Baseline 40.04 47.73
DWA 41.62 48.13
Stephan Vogel - Machine Translation) 25
Summary
In last 5 years new efforts in word alignment Discriminative word alignment
Integrate many features Need small amount of hand aligned data to tune (train)
feature weights
Different variants Log-linear modeling Conditional random fields: sequence and alignment matrix
Significant improvements in word alignment error rate Not always improvements in translation quality
Different density of alignment -> different phrase table size Need to adjust phrase extraction algorithms?