View
216
Download
0
Embed Size (px)
Citation preview
ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005
1
Competitive Grouping in Integrated Segmentation and Alignment Model
Ying Zhang Stephan Vogel
Language Technologies Institute
School of Computer Science
Carnegie Mellon University
ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005
2
Integrated Segmentation and Alignment Model
• Phrase alignment models (Och et al., 1999; Marcu and Wong, 2002; Kohen et al., 2003)– Many of these models rely on the pre-calculated word alignment.– Use different heuristics to extract phrase pairs from the Viterbi word
alignment path.
• Integrated Segmentation and Alignment model (Zhang 2003)– No such word alignments needed– Segment source and target sentences into phrases and align them
simultaneously– Use chi-square(f, e) instead of the conditional probability P(f|e) for word
pair associations– Greedy search for phrase pairs– Key idea: competitive grouping algorithm– Inspired by the competitive linking algorithm (Melamed 1997) for word
alignment
ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005
3
Competitive Linking Algorithm
• A greedy word alignment algorithm.
• The word pair has the highest likelihood L(f,e) “wins” the competition.
• One-to-one assumption: when pair{f, e} is “linked”, neither f nor e can be aligned with any other words.
• Example:
ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005
4
Competitive Grouping Algorithm
• Discard the one-to-one assumption in competitive linking, make it less greedy.
• When a pair {e, f} wins the competition, inviting the neighboring pairs to join the “winner’s club”.
• Introducing the locality assumption: a source phrase of adjacent words can only be aligned to a target phrase of adjacent words.– Words inside the aligned phrase pairs can not be aligned to other words
ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005
5
Expanding the Phrase Pair Aligned
• Two criteria have to be satisfied to expand the seeding word pair to phrase pairs1. If a new source word f is to be grouped, the best e that f is associated
should not be “blocked” by this expansion; similar for grouping a new target word.
2. The highest word pair likelihood value in the expanded area needs to be “similar” to the seed value
• According to the locality assumption, words in the aligned phrase pairs can not be aligned with other words again.
ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005
6
Exploring All Possible Phrase Pairs
• Criterion 2 is used to control the granularity of the phrase pairs aligned– Two short phrase pairs
– Or one long phrase pairs
• Short phrases give better coverage for unseen testing data
• Long phrases encapsulate more context, e.g. local reordering, word sense, and etc.
• Hard to decided on the optimal granularity without knowing the testing data
• Solution: for each grouping, try all possible granularities
ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005
7
Exploring All Possible Phrase Pairs
French: Je déclare reprise la session
English: I declare resumed the session
ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005
8
The Likelihood of Word Associations
• Chi-square statistics is used to measure the likelihood of word associations for pair {e, f}
• For each word pair {e, f} null hypothesis: e and f are independent of each other.
• Calculating to measure how true is this hypothesis
• Construct the contingency table using the counts from the corpus given the current alignment, e.g. uniform alignment– O11: number of times when e and f are aligned
– O12: number of times when e aligned with other f
– O21: number of times when f aligned with other e
– O22: number of times when other f aligned with other e
f ~f
e O11 O12
~e O21 O22
ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005
9
In WPT-05
• Submitted results for all four languages
• Training data as provided
• Language model as provided
• Decoder (Pharaoh) as provided
BLEU German Spanish Finnish French
Dev-test 18.63 26.20 12.88 26.20
Test 18.93 26.14 12.66 26.71
ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005
10
Conclusion
• Competitive grouping algorithm at the core of the ISA model
• Simple and efficient model
• Comparable results as other phrase alignment models
ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005
11
The Evolution of ISA
ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005
12
Matrix of the Likelihood
ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005
13
Expanding the Phrase Pairs