Mandarin–English Information (MEI): investigating translingual speech retrieval

Mandarin–English Information (MEI): investigating

translingual speech retrievalHelen M. Meng a,*, Berlin Chen b, Sanjeev Khudanpur c, Gina-Anne Levow d,

Wai-Kit Lo a, Douglas Oard d, Patrick Schone e, Karen Tang f,Hsin-min Wang b, Jianqiang Wang d

a Department of Systems Engineering and Engineering Management, Human-Computer Communications Laboratory,

The Chinese University of Hong Kong, Shatin, NT, Hong Kongb Academia Sinica, Taiwan

c Johns Hopkins University, USAd University of Maryland at College Park, USA

e Department of Defense, USAf Princeton University, USA

Computer Speech and Language (2004)Received 30 May 2001;

received in revised form 8 August 2003; accepted 15 September 2003

2

Outline Abstract Introduction Previous work The TDT collection The multi-scale paradigm

Multi-scale query formulation• Query term select• Query translation• Named entity transliteraion• Multi-scale query construction

Multi-scale audio indexing Multi-scale retrieval

Experiments Evaluation criterion Tuning with the development test set Experimental results

• Phrase-based translation• Multi-scale retrieval• Subword translation

Conclusions

3

Abstract First English–Chinese CL-SDR system

The Mandarin–English Information (MEI) project investigated the problem of cross-language spoken document retrieval (CL-SDR).

cross-language / cross-media retrieval task English news story (text) as query, and retrieves relevant Chin

ese broadcast news stories (audio) from the document collection.

English queries translated into Chinese by dictionary-based approach phrase-based translation with word-by-word translation Untranslatable named entities are transliterated by a novel sub

word translation technique

4

Introduction multi-scale paradigm for English–Chinese CL-SDR

Phrases ,words ,subwords (Chinese characters and syllables) problems related to English–Chinese CL-SDR

Multiplicity in translation: • multiple translation alternatives• no translations (e.g., for proper names)

Ambiguity in Chinese homophones: ( 義異同音 )• word-level confusions : ex., 富庶 , 負數 , 複數 , 覆述

Ambiguity in Chinese word tokenization:• word-level mismatches between queries and documents• Example: 這一晚會如常舉行 , 這一晚會如常舉行 , 這一晚會如常舉行 .

Speech recognition errors:• OOV (out of vocabulary) words• acoustic confusions among in-vocabulary words

5

Previous Work

Retrieval of German spoken news documents with French text (Sheridan and Ballerini, 1996)

Retrieval of Serbo-Croatian news broadcasts with English text (Hauptmann et al., 1998)

CL-SDR integrates speech recognition, machine translation and information retrieval technologies to accomplish the task.

CLIR to cross language barrier for retrieval: query translation, document translation, interlingual techniques and cognate matching

• Buckley et al. (1997) performed English–French CLIR• spoken document retrieval, a popular approach (Garofolo et al., 20

00)• (OOV) problem in recognition (Woodland et al., 2000),• indexed with phoneme n-grams (Ng, 2000; Wechsler and Schauble,

1995)

6

The TDT collection Using the Topic Detection and Tracking (TDT)

collection for this work. TDT is a DARPA sponsored program where

participating sites tackle tasks such as identifying the first time a story is reported on a given topic grouping similar topics from audio and textual streams of

newswire data.

In recent years, TDT has focused primarily on performing such tasks in both English and Mandarin Chinese.

Most of the Mandarin audio data are transcribed by the Dragon automatic speech recognition system.

7

The TDT collection Using the TDT-2 corpus as our development test

set, and TDT-3 as our evaluation test set.

8

The multi-scale paradigm- Multi-scale query formulation (1/6)

Query term selection: In the MEI task :

• The query consists of an entire English news story.• Queries tend to be long, and not all query terms are important for retrieval.

First excluded all stopwords. List all of the terms in the exemplar For multi-word used a test in a manner to select these terms.

• N :number of documents

• r is relevant, n is non-relevant documents

• + means the current terms appears in the document, and - indicates that it does not.

2

9


Query translation : Bilingual term list (BTL) The list is formed by combining LDCs English–Chinese bilingual

term list with translations extracted from the CETA (Chinese-English Translation Assistance) dictionaries.

BTL covers 200,000 distinct English terms. Among these English terms, some have multiple translations an

d there is a total of approximately 400,000 English-to-Chinese translation pairs.

This approach preserves term frequency information in the query.

Translation proceeds on the phrasal scale, word scale, as well as the subword scale.

10


Named entity transliteration : BTL will inevitably be untranslatable terms (names of people, pl

aces, locations and organizations) Named entities need to be salvaged since they tend to be impor

tant for retrieval. For example ‘‘Kosovo’’ :

• 科索沃 /ke-suo-wo/• 科索佛 /ke-suo-fo/ • 科索夫 /ke-suo-fu/• 科索伏 /ke-suo-fu/• 柯索佛 /ke-suo-fo/

subword translation• We applied our knowledge in acoustic–phonetics and phonology rel

ated to both English and Chinese.• Also applied machine learning techniques and other techniques us

ed in speech recognition.

11


Query exemplars that are tagged by the BBN Identifinder system, and those absent from our BTL are processed by our transliteration system.

Chinese names are often represented in English by “pinyin” transcription.

Using letter-to-sound rules acquired by the transformation-based error-driving learning technique.

Hand-designing a set of cross-lingual phonological rules that partially transforms an English pronunciation into a Chinese pronunciation.

12


Hypothesized syllable sequence is generated from the English name spelling by our transliteration procedure.

Reference syllable sequence is obtained by pronunciation lexicon lookup based on the Chinese name characters.

13


Multi-scale query construction Query construction process:

• bag of English query terms Multi-scale query construction

• translated phrases, • named entities, • individual translated words • translated syllables.

14

The multi-scale paradigm- Multi-scale audio indexing

The Dragon large-vocabulary continuous speech recognizer (Zhan et al., 1999) provided Chinese word transcriptions for our Mandarin audio collections (TDT-2 and TDT-3). For a fraction of the TDT-2 test set (about 23 h)

• word error rates : 18.0%• character error rates : 12.1%• syllable error rates : 7.9%

For a fraction of the TDT-3 test set (about 27 h)• word error rates :19.1% • character error rates :13.0%• syllable error rates : 8.6%

15

The multi-scale paradigm- Multi-scale retrieval (1/2)

InQuery retrieval engine developed by the University of Massachusetts (Callan et al., 1992). Using a probabilistic belief network as the main data structure b

ehind its query language.

Key feature is the ‘‘balanced query’’ mechanism. Suppose English query E1; E2; . . . ; En

E1 has three possible Chinese translations, C11, C12 and C13.

With balanced translation,belief value for E1 in the Chinese document = the mean of the belief values for C11, C12 and C13 in that document.

Repeating the process for additional terms produces a set of belief values for each English query term with respect to every Chinese document.

16

The multi-scale paradigm- Multi-scale retrieval (2/2)

Balanced translation of the query would be represented as #sum(#sum(C11;C12; C13)#sum(C21;C22)...#sum(Cn1; Cn2;Cn3))

• The outer #sum operator being the typical way of combining belief values across query terms.

• The inner #sum operators implementing balanced translation.

Multi-scale retrieval proceeds : each scale individually each scale produces its own ranked list of documents

• Word scale produces • characters scale produces • Syllables scale produces

17

Experiments Evaluation Criterion :

L is the number of topics Mi is a sample of the exemplars for topic i

Ni is number of relevant documents for topic i

rankijk is the rank of the kth relevant document retrieved for exemplar j of topic i

18

Experiments

Based on a paired two-tailed t test on the means across exemplars of each topic, with p < 0.05.

19

Experiments

20

Conclusion English news story (text) as query, and retrieves relevan

t Chinese broadcast news stories (audio) from the document collection.This is a cross-language and cross-media retrieval task.

The experimented with the TDT collections, which have English newswire from New York Times and Associated Press, and Mandarin Chinese radio news broadcasts from Voice of America. The radio news is transcribed by Dragons large-vocabulary continuous speech recognizer.

Multi-scale approach is promising and applicable to the English–Chinese CL-SDR task. It should also be possible to leverage off of our experience in a translingual setting, which involves SDR across other language pairs.

Documents

Mandarin–English Information (MEI): investigating translingual speech retrieval