24
Mandarin-English Information (MEI): Investigating Translingual Speech Retrieval Johns Hopkins University Summer Workshop 2000 Presented at the ANLP-NAACL 2000 Embedded Machine Translation Systems Workshop The MEI Team

Presented at the ANLP-NAACL 2000 Embedded Machine Translation Systems Workshop The MEI Team

  • Upload
    briana

  • View
    42

  • Download
    1

Embed Size (px)

DESCRIPTION

Mandarin-English Information (MEI): Investigating Translingual Speech Retrieval Johns Hopkins University Summer Workshop 2000. Presented at the ANLP-NAACL 2000 Embedded Machine Translation Systems Workshop The MEI Team. MEI Team. Helen Meng Chinese University of Hong Kong - PowerPoint PPT Presentation

Citation preview

Page 1: Presented at the ANLP-NAACL 2000   Embedded Machine Translation Systems Workshop The MEI Team

Mandarin-English Information (MEI):Investigating Translingual Speech Retrieval

Johns Hopkins University Summer Workshop 2000

Presented at theANLP-NAACL 2000

Embedded Machine Translation Systems Workshop

The MEI Team

Page 2: Presented at the ANLP-NAACL 2000   Embedded Machine Translation Systems Workshop The MEI Team

MEI Team

• Senior Members

• Students

Helen Meng Chinese University of Hong KongErika Grams Advanced Analytic ToolsSanjeev Khudanpur Johns Hopkins UniversityGina-Anne Levow University of MarylandDouglas Oard University of MarylandPatrick Schone US Department of DefenseHsin-Min Wang Academia Sinica, Taiwan

Berlin Chen National Taiwan UniversityWai-Kit Lo Chinese University of Hong KongKaren Tang Princeton UniversityJianqiang Wang University of Maryland

Page 3: Presented at the ANLP-NAACL 2000   Embedded Machine Translation Systems Workshop The MEI Team

Outline

• Audio indexing

• MEI Project overview

• Research challenges

• System architecture

• Collaboration opportunities

Page 4: Presented at the ANLP-NAACL 2000   Embedded Machine Translation Systems Workshop The MEI Team

Motivation

• Speech retrieval applications are emerging– e.g., http://speechbot.research.compaq.com

source: www.real.com, Feb 2000

529

1367

English

OtherLanguages

• Internet-accessibleRadio and TelevisionStations

Page 5: Presented at the ANLP-NAACL 2000   Embedded Machine Translation Systems Workshop The MEI Team

The Big Picture

Speech to SpeechTranslation

TranslingualAudio Browsing

TranslingualAudio Search

EnglishQuery

EnglishAudio

Select Examine

MEI

Page 6: Presented at the ANLP-NAACL 2000   Embedded Machine Translation Systems Workshop The MEI Team

Related Work• TREC Spoken Document Retrieval

– Close coupling of recognition and retrieval

• TREC Cross-Language Retrieval– Close coupling of translation and retrieval

• TDT-3 Topic Tracking– Coupling recognition, translation and retrieval

• Using speech recognition transcripts

Page 7: Presented at the ANLP-NAACL 2000   Embedded Machine Translation Systems Workshop The MEI Team

The MEI Project

Query byExampleEnglish

ExampleNewswire

Stories

MandarinAudio

Collection

• Closely couple recognition and translation– For the purpose of retrieval

• Using English examples, find Mandarin audio

Page 8: Presented at the ANLP-NAACL 2000   Embedded Machine Translation Systems Workshop The MEI Team

Research Challenges

• Multi-scale audio indexing– Multiple feature sets capture more information

• Multi-scale translation– Lexicon and pronunciation are complementary

• Multi-scale retrieval– Combination of evidence can add robustness

Page 9: Presented at the ANLP-NAACL 2000   Embedded Machine Translation Systems Workshop The MEI Team

Multi-scale Mandarin Audio Indexing

Initial/FinalPreme/Core Final

Preme/Toneme

/iang//ji/

/j//ang/

/i/ /a/ /ng//j/

Page 10: Presented at the ANLP-NAACL 2000   Embedded Machine Translation Systems Workshop The MEI Team

• Word-scale– Dictionary-based [Levow & Oard 00]– Parallel corpora [Nie 99]– Comparable corpora [Fung 98]

• Subword-scale [Knight & Graehl 97]– Cross-language phonetic mapping – /bei2 ai4 er3 lan2/

• Kosovo (/ke1-sou3-wo4/, /ke1-sou3-fo2/, /ke1-sou3-fu1/, /ke1-sou3-fu2/)

Multi-scale Translation

Page 11: Presented at the ANLP-NAACL 2000   Embedded Machine Translation Systems Workshop The MEI Team

Cross-Language Phonetic Mapping

• Syllabify English spelling– e.g. Jiang Zemin, Shandong Province

• Map English pronunciation to Mandarin – Convert phonemes to pinyin

• e.g. /k ow s ax v ow/ to /ke1-suo3-wo4/

– Plan to investigate alternative techniques• Rule-based• Statistical mapping

Page 12: Presented at the ANLP-NAACL 2000   Embedded Machine Translation Systems Workshop The MEI Team

Multi-scale Retrieval

• Word-scale exploits lexical knowledge– Enhances precision

• Subwords can achieve complete coverage – Enhances recall

• Combination of evidence may be best– If a good merging strategy can be found

Page 13: Presented at the ANLP-NAACL 2000   Embedded Machine Translation Systems Workshop The MEI Team

Multi-scale Retrieval Techniques

• Subword-scale– Syllable lattice matching [Chen, Wang & Lee 00] – Overlapping syllable n-grams [Meng et al. 99]– Syllable confusion matrix [Meng et al. 99]

• Word-scale– Structured queries [Pirkola 98]– Structured translation [Sperer & Oard 00]

Page 14: Presented at the ANLP-NAACL 2000   Embedded Machine Translation Systems Workshop The MEI Team

Merging Strategies

• Loose coupling– Separate retrieval runs– Merge ranked lists [Voorhees 95]

• Tight coupling [Ng 00]– Unified indexing of words and subwords– Single ranked list

Page 15: Presented at the ANLP-NAACL 2000   Embedded Machine Translation Systems Workshop The MEI Team

Robust Retrieval• Multiple causes

– Speech recognition errors – Translation ambiguity– Transliteration ambiguity

• Possible solutions– Weighted n-best indexing [Levow & Oard 00]– Syllable lattice indexing [Chen, Wang & Lee 00] – Syllable confusion expansion [Meng et al. 99]– Structured queries [Pirkola 98]– Document expansion [Levow & Oard 00]

Page 16: Presented at the ANLP-NAACL 2000   Embedded Machine Translation Systems Workshop The MEI Team

System Architecture Overview

WordTranslation

EnglishExample

PhoneticTranscription

KnownTerms

RetrievalSystem

WordsSyllablen-grams

Mandarin Documents

TranslationLexicon

CorpusStatistics

EvalCode

AveragePrecision

RelevanceJudgments

Syllable n-gramGeneration

Page 17: Presented at the ANLP-NAACL 2000   Embedded Machine Translation Systems Workshop The MEI Team

The TDT Collections

41 Hours VOA

MandarinAudio

59 Topics

121 Hours Voice of America (VOA)

Mandarin Audio

APW+NYT English

Associated Press (APW)New York Times (NYT)

English Newswire

20 Topics

Development Test (TDT-2) Evaluation (TDT-3)Mar 98 Oct 98 Dec 98Jun 98

Jan 98

• Four stories per topic in each language– Each reporting on some aspect of one event

StoryBoundariesKnownCondition

Page 18: Presented at the ANLP-NAACL 2000   Embedded Machine Translation Systems Workshop The MEI Team

MEI Project Schedule

Six Weeks at Hopkins:

Dec Feb JunApr Aug

Wor

ksho

p 00

Plan

ning

Mee

ting

First

Tea

mPl

anni

ng M

eetin

g

Seco

nd T

eam

Plan

ning

Mee

ting

Page 19: Presented at the ANLP-NAACL 2000   Embedded Machine Translation Systems Workshop The MEI Team

Things We Need• Ideas

– To sharpen our focus

• Connections– To build a community of interest

• Resources– To build on what others have done

Page 20: Presented at the ANLP-NAACL 2000   Embedded Machine Translation Systems Workshop The MEI Team

For More Information

• MEI Project– http://www.glue.umd.edu/~meiweb

• Translingual Retrieval– http://www.clis.umd.edu/dlrg/clir

• Speech Retrieval– http://www.clis.umd.edu/dlrg/speech

• Hopkins Summer Workshop Series– http://www.clsp.jhu.edu/workshops

Page 21: Presented at the ANLP-NAACL 2000   Embedded Machine Translation Systems Workshop The MEI Team

White-space separated

text with namedentity tags.

StoppingStemming

Phrase Extraction

List oftranslatable words

and phrases

Terms that have Mandarin

translations

Named entities

terms with noMandarin

translations

Detailed Query Processing (1)

Page 22: Presented at the ANLP-NAACL 2000   Embedded Machine Translation Systems Workshop The MEI Team

termtranslation

English-Mandarintranslation lexicon

Named entityparsing

Named entityparsing rules for

transliteration

phoneticexpansion

Englishpronunciation

lexicon

termtranslation

phoneticexpansion

xxxxxxx xxx

xxxzzzzzzz

wwxx ww z

rrrrrrsss

ttttt ttww www

eh n t ih t iyt eh k s t

Bag ofMandarin terms

Bag of English

phone sequences

Terms that haveMandarin translations

Terms with noMandarin translations

Namedentities

Northern

Ireland

Detailed Query Processing (2)

Page 23: Presented at the ANLP-NAACL 2000   Embedded Machine Translation Systems Workshop The MEI Team

English phone stringsto

Mandarin syllables

Mandarinsyllabification

rules

ASRInsertion prone

words

ASRsubstitution/deletion

prone words

Retainthis term?

sa sb sc sdsc se

s1 s2s2 s5 s1

s2 s3Mandarinpronunciation

lexicon

Bag ofMandarin terms

Syllabicexpansion

yes

xxxxxxxx

zzzzzzzxx ww z

rrrrrrsss

ttttt ttww www

Bag ofEnglishphone

sequences

Two bags of Mandarinsyllable sequences

Smaller bag ofMandarin terms

To trash (or downweight)no

Detailed Query Processing (3)

Page 24: Presented at the ANLP-NAACL 2000   Embedded Machine Translation Systems Workshop The MEI Team

Syllable n-gramgeneration

Syllable n-gramgeneration

s0 s1s1 s2s2 s3s3 s4s3 s5s5 s6s5 s7s7 s7

sa sbsb scsc sdsc se

s1 s2s2 s3s2 s5s5 s1

Syllable n-gramgeneration

xxxxxxxx

zzzzzzzxx ww z

rrrrrrsss

ttttt ttww wwwMandarin

pronunciationlexicon

syllabicexpansion

Mandarin syllablesequences from

likely recognition errors

Mandarin syllablesequences fromunknown words

Bag of Mandarin lexical terms

Three Bags of Mandarin syllable

n-grams fromdifferent sources

Bag of high-confidenceMandarin terms

Detailed Query Processing (4)