Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011

Machine Translation

Discriminative Word Alignment

Stephan VogelSpring Semester 2011

Stephan Vogel - Machine Translation) 2

Generative Alignment Models

Generative word alignment models: P(f, a|e) = … Alignment a as hidden variable

Actual word alignment is not observed Sum over all alignments

Well-known IBM 1 … 5 models, HMM, ITG Model lexical association, distortion, fertility

It is difficult to incorporate additional information POS of words (used in distortion model, not as direct link

features) Manual dictionary Syntax information …


Discriminative Word Alignment

Model alignment directly: p(a | f, e) Find alignment that maximizes p(a | f, e)

Well-suited framework: maximum entropy Set of feature functions hm(a, f, e), m = 1, …, M

Set of model parameters (feature weights) cm, m = 1, …, M

Decision rule:


Tasks

Modeling: design feature functions which capture cross-lingual divergences

Search: find alignment with highest probability Training: find optimal feature weights

Minimize alignment errors given some gold-standard alignments(Notice: Alignments no longer hidden!)

Supervised training, i.e. we evaluate against gold standard

Notice: features functions may result from some training procedure themselves E.g. use statistical dictionary resulting from IBMn alignment,

trained on large corpus Here now additional training step, on small (hand-aligned)

corpus(Similar to MERT for decoder)


2005 – Year of DWA

Yang Liu, Qun Liu, and Shouxun Lin. 2005.Loglinear Models for Word Alignment.

Abraham Ittycheriah and Salim Roukos. 2005.A Maximum Entropy Word Aligner for Arabic-English Machine Translation.

Ben Taskar, Simon Lacoste-Julien, and Dan Klein. 2005.A Discriminative Matching Approach to Word Alignment.

Robert C. Moore. 2005.A Discriminative Framework for Bilingual Word Alignment.

Necip Fazil Ayan, Bonnie J. Dorr, and Christof Monz. 2005.NeurAlign: Combining Word Alignments Using Neural Networks.


Yang Liu et al. 2005

Start out with features used in generative alignment

Lexicons E.g. IBM1

Use both directions: p(fj|ei) and p(ei|fj), =>Symmetrical alignment model

And/or symmetric model

Fertility model: p(i|ei)


More Features

Cross count: number of crossings in alignment Neighbor count: count the number of links in the

immediate neighborhood Exact match: count number of src/tgt pairs, where

src=tgt Linked word count: total number of links (to influence

density) Link types: count how many 1-1, 1-m, m-1, n-m

alignments Sibling distance: if word is aligned to multiple words,

add the distance between these aligned words Link Co-occurrence count: given multiple alignments

(e.g. Viterbi alignments from IBM models) count how often links co-occur


Search

Greedy search based on gain by adding a link For each of the features the gain can be calculated

E.g. IBM1

Algorithm:Start with empty alignmentLoop until no addition gain Loop over all (j,i) not in set if gain(j,i) > best_gain then store as (j’,i’) Set link(j’,i’)


Moore 2005

Log-Likelihood-based model Measure word association strength Values can get large

Conditional-Link-Probability-based Estimated probability of two words being linked Used simpler alignment model to establish links Add simple smoothing

Additional features: one-to-one, one-to-many, non-monotonicity


Training

Finding optimal alignment is non-trivial Adding link can affect nonmonotonicity, one-to-many features Dynamic programming does not work

Beam search could be used Requires pruning

Parameter optimization Modified version of average perceptron learning

)),,(),,(( feahfeah refirefiii


Modeling Alignment with CRF

CRF is an undirected graphical model Each vertex (node) represents a random variable whose distribution is

to be inferred Each edge represents a dependency between two random variables The distribution of each discrete random variable Y in the graph is

conditioned on an input sequence X. Cliques: set of nodes in graph fully connected

In our case: Features derived from source and target words are the input sequence

X Alignment links are the random variables Y

Different ways to model alignment Blunsom & Cohn (2006): many-to-one word alignments, where each

source word is aligned with zero or one target words (-> asymmetric) Niehues & Vogel (2008): model not sequence, but entire alignment

matrix(->symmetric)


Modeling Alignment Matrix

Random variables yji for all possible alignment links 2 values: 0/1 – word in position j is not linked/linked to word in

position i Represented as nodes in a graph


Modeling Alignment Matrix

Factored nodes x representing features (observables) Linked to random variables Define potential for each yji


Probability of Alignment

))(exp(1

))(exp(1

)(1

)|(p

function potential-))(exp()(

ctor weight vea-

vectorfeature a -)(

clique) (a nodes connected ofset a -

nodes factored ofset -

FN

FN

FN

Vccc

Vccc

Vccc

cccc

cc

c

FN

VFZ

VFZ

VZ

xy

VFV

VF

V

V


Features

Local features, e.g. lexical, POS, … Fertility features First-order features: capturing relation between links Phrase-features: interaction between word and phrase

alignment


Local Features

Local information about link probability Features derived from positions j and i only Factored node connected to only one random variable

Features Lexical probabilities, also normalized to (f,e) Word identity (e.g. for numbers, names) Word similarity (e.g. cognates) Relative position distance Link indicator feature: is (j,i) linked

in Viterbi alignment from generative alignment POS: Indicator feature for every src/tgt POS pair High frequency word indicator feature for every

src/tgt word pair for most frequent words


Fertility Features

Model word fertility, src and tgt side Link to all nodes in row/column Constraint: model fertility only up

to maximum fertility Indicator features:

one for each fertility n <= None for all fertilities n > N

Alternative: use fertility probabilitiesfrom IBM4 training Now different for different words


First Order Features

Links depend on links ofneighboring words

Link always 2 nodes Different features for different

directions (1,1), (1,2), (2,1), (1,0), …

Captures distortions, similar toHMM and IBM4 alignment

Indicator features, if both links are set Also POS 1-order feature: indicator feature link(j,i) and

(POSj, POSi) and link(j+k, i+l)


Inference – Finding the Best Alignment

Word alignment corresponds to assignment of random variables

=> Find most probable variable assignment

Problem: Complex model structure: many loops No exact inference possible

Solution: Belief propagation algorithm Inference by message passing

Runtime exponential in number of connected nodes


Belief Propagation

Messages are sent from random variable nodes to factored nodes, and also in the opposite direction

Start with some initial values, e.g. uniform In each iteration

Calculate messages from hidden node (j,i) and sent to factored node c

Calculate messages from factored node c and sent to hidden node (j,i)


Getting the Probability

After several iterations, belief value calculated from messages send to hidden nodes

Belief value can be interpreted as posterior probability


Training

Maximum log-likelihood of correct alignment Use gradient descent to find optimum

Train towards minimum alignment error Need smoothed version of AER Express AER in terms of link indicator functions Use sigmoid of link probability

Can use 2-step approach 1. Optimize towards ML 2. Optimize towards AER


Some Results: Spanish-English

Features IBM1 and IBM4 lexicons Fertilties Link indicator feature POS features Phrase features

Impact on translationquality (Bleu scores)

Dev Eval

Baseline 40.04 47.73

DWA 41.62 48.13


Summary

In last 5 years new efforts in word alignment Discriminative word alignment

Integrate many features Need small amount of hand aligned data to tune (train)

feature weights

Different variants Log-linear modeling Conditional random fields: sequence and alignment matrix

Significant improvements in word alignment error rate Not always improvements in translation quality

Different density of alignment -> different phrase table size Need to adjust phrase extraction algorithms?

Documents

Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011