21
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001

The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001

The current status of Chinese-English EBMT research

-where are we now

Joy, Ralf Brown, Robert Frederking, Erik Peterson

Aug 2001

Page 2: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001

Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University

2

Overview of Ch-En EBMT• Adapting EBMT to Chinese

– Segmentation of Chinese

• Corpus used– Hong Kong legal code (from LDC)– Hong Kong news articles (from LDC)

• In this project:– Robert Frederking, Ralf Brown, Joy, Erik Peterson,

Stephan Vogel, Alon Lavie, Lori Levin,

Page 3: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001

Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University

3

Corpus Statistics• Hong Kong Legal Code:

Chinese: 23 MB

English: 37.8 MB

• Hong Kong News (After cleaning): 7622 DocumentsDev-test: Size: 1,331,915 byte , 4,992 sentence pairs

Final-test: Size: 1,329,764 byte, 4,866 sentence pairs

Training: Size: 25,720,755 byte, 95,752 sentence pairs

• Corpus Cleaning– Converted from Big5 to GB

– Divided into Training set (90%), Dev-test (5%) and test set (5%)

– Sentence level alignment, using Church & Gale Method (by ISI)

– Cleaned

– Convert two-byte Chinese characters to their cognates

Page 4: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001

Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University

4

Chinese Segmentation• There are no spaces between Chinese words in written Chinese.

• The segmentation problem: Given a sentence with no spaces, break it into words. Definition of Chinese word is vague.

Page 5: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001

Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University

5

Our Definition of Words/Phrases/Terms

• Chinese Characters– The smallest unit in written Chinese is a character, which is represented

by 2 bytes in GB-2312 code.

• Chinese Words– A word in natural language is the smallest reusable unit which can be used

in isolation.

• Chinese Phrases– We define a Chinese phrase as a sequence of Chinese words. For each

word in the phrase, the meaning of this word is the same as the meaning when the word appears by itself.

• Terms– A term is a meaningful constituent. It can be either a word or a phrase.

Page 6: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001

Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University

6

Complicated Constructions• Transliterated foreign words and names

• Abbreviations

• Chinese Names

• Chinese Numbers

Page 7: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001

Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University

7

Segmenter

• Approaches– Statistical approaches:

• Idea: Building collocation models for Chinese characters, such as first-order HMM. Place the space at the place where two characters rarely co-occur.

• Cons:– Data sparseness

– Cross boundary

Page 8: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001

Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University

8

Segmenter (2)– Dictionary-based approaches

• Idea: Use a dictionary to find the words in the sentence

• Forward maximum match / backward maximum match/ or both direction

• Cons:– The size and quality of the dictionary used are of great

importance: New words, Named-entity

– Maximum (greedy) match may cause mis-segmentations

Page 9: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001

Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University

9

Segmenter (3)

– A combination of dictionary and linguistic knowledge

• Ideas: Using morphology, POS, grammar and heuristics to aid disambiguation

• Pros: high accuracy (possible)

• Cons: – Require a dictionary with POS and word-frequency

– Computationally expensive

Page 10: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001

Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University

10

Segmenter (4)

• We first used LDC’s segmenter

• Currently we are using a forward/backward maximum match segmenter for baseline. The word frequency dictionary is from LDC

• The word frequency dictionary from LDC: 43,959 entries

• For HLT 2001, we augmented the frequency dictionary with new words found from the corpus by statistical method

Page 11: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001

Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University

11

Two-threshold method

• Two-threshold for tokenization (finding new words from the corpus) : for MT Summit VIII

Page 12: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001

Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University

12

For PI Meeting• Baseline System

– Using LDC’s frequency word dictionary

• Full System– Tokenize new words from the pre-segmented corpus using two-

threshold method, augment the frequency dictionary with new words to re-segment the corpus

– Bracket English– Using feedback from statDict to adjust segmentation/bracketing

• Baseline + Named-Entity– Named-entity tagger by Erik Peterson

• Multi-corpora System– Cluster the documents into sub-corpora according to their topics

Page 13: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001

Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University

13

Evaluation Issues

• Automatic Measures– EBMT Source Match– EBMT Source Coverage– EBMT Target Coverage– MEMT (EBMT+DICT) Unigram Coverage– MEMT (EBMT+DICT) PER

Page 14: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001

Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University

14

Evaluation Issues (2)• Human Evaluations

– 4-5 graders each time

– 6 categories

Page 15: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001

Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University

15

After PI Meeting (0)• Study of results reported in PI meeting(http://pizza.is.cs.cmu.edu/research/internal/ebmt/tokenLen/index.htm)

– The quality of Named-Entity (Cleaned by Erik)

– Performance difference of EBMT while changing the average length of Chinese word token (by changing segmentation)

– How to evaluate the performance of the system

• Experiment of G-EBMT– Word clustering

Page 16: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001

Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University

16

After PI Meeting (1)

• Changing the average length of Chinese token– No bracket on English– Use a subset of LDC’s frequency dictionary for

segmentation– Study the performance of EBMT system on

different average Chinese token length

Page 17: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001

Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University

17

After PI Meeting (2)• Avg. Token Len. Vs. PER

Page 18: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001

Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University

18

After PI Meeting (3)• Type-Token curve of Chinese and English

Page 19: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001

Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University

19

Future Research Plan

• Generalized EBMT– Word-clustering– Grammar Induction

• Using Machine Learning to optimize the parameters used in MEMT

• Better Alignment Model: Integrating segmentation, brackting and alignment

Page 20: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001

Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University

20

New Alignment Model (1)• Using both monolingual and bilingual collocation information to

segment and align corpus

Page 21: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001

Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University

21

References

• Tom Emerson, “Segmentation of Chinese Text”. In #38 Volume 12 Issue2 of MultiLingual Computing & Technology published by MultiLingual Computing, Inc.

• Ying Zhang, Ralf D. Brown, and Robert E. Frederking. "Adapting an Example-Based Translation System to Chinese". To appear in Proceedings of Human Language Technology Conference 2001 (HLT-2001).

• Ying Zhang, Ralf D. Brown, Robert E. Frederking and Alon Lavie. "Pre-processing of Bilingual Corpora for Mandarin-English EBMT". Accepted in MT Summit VIII (Santiago de Compostela, Spain, Sep. 2001)