“Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

“Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation

Ferhan TureDissertation defense May 24th, 2013

Department of Computer ScienceUniversity of Maryland at College Park

Motivation

• Fact 1: People want to access informatione.g., web pages, videos, restaurants, products, …

• Fact 2: Lots of data out there… but also lots of noise, redundancy, different languages

• Goal: Find ways to efficiently and effectively- Search complex, noisy data

- Deliver content in appropriate form

3

multi-lingual text

user’s native language

forum posts

clustered summaries

retriev ir find materi (usual document unstructur natur (usual text satisfi need larg collect (usual store comput work assum materi collect document written natur languag need form queri rang word entir document typic approach ir repres document vector weight term term mean word stem pre-determin list word .g. `` '' `` '' `` '' may remov set term found creat nois search process document score relat queri score queri term independ aggreg term-docu score

Information Retrieval

4

Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). In our work, we assume that the material is a collection of documents written in natural language, and the information need is provided in the form of a query, ranging from a few words to an entire document. A typical approach in IR is to represent each document as a vector of weighted terms, where a term usually means either a word or its stem. A pre-determined list of stop words (e.g., ``the'', ``an'', ``my'') may be removed from the set of terms, since they have found to create noise in the search process. Documents are scored, relative to the query, usually by scoring each query term independently and aggregating these term-document scores.

queri:11.69, ir:11.39, vector:7.93, document:7.09, nois:6.92, stem:6.56, score:5.68, weight:5.67, word:5.46, materi:5.42, search:5.06, term:5.03, text:4.87, comput:4.73, need:4.61, collect:4.48, natur:4.36, languag:4.12, find:3.92, repres:3.58

Cross-Language Information Retrieval

5

Information Retrieval (IR) bzw. Informationsrückgewinnung, gelegentlich ungenau Informationsbeschaffung, ist ein Fachgebiet, das sich mit computergestütztem Suchen nach komplexen Inhalten (also z. B. keine Einzelwörter) beschäftigt und in die Bereiche Informationswissenschaft, Informatik und Computerlinguistik fällt. Wie aus der Wortbedeutung von retrieval (deutsch Abruf, Wiederherstellung) hervorgeht, sind komplexe Texte oder Bilddaten, die in großen Datenbanken gespeichert werden, für Außenstehende zunächst nicht zugänglich oder abrufbar. Beim Information Retrieval geht es darum bestehende Informationen aufzufinden, nicht neue Strukturen zu entdecken (wie beim Knowledge Discovery in Databases, zu dem das Data Mining und Text Mining gehören).

89,9332,345221,932106,13492,5414,073--162,67178,346241,58019,3185,802327,094104,82223,89095,936187,3499,394

3.42.92.72.52.42.121.81.81.71.71.51.51.51.41.11.00.90.8

Machine Translation

6

Maschinelle Übersetzung (MT) ist, um Text in einer Ausgangssprache in

entsprechenden Text in der Zielsprache geschrieben übersetzen.

Machine translation (MT) is to translate text written in a source language into corresponding text in a target language.

Motivation

• Fact 1: People want to access informatione.g., web pages, videos, restaurants, products, …

• Fact 2: Lots of data out there… but also lots of noise, redundancy, different languages

• Goal: Find ways to efficiently and effectively- Search complex, noisy data

- Deliver content in appropriate form

7

multi-lingual text

user’s native language

MT

Cross-language IR

Outline

•Introduction

•Searching to Translate (IR MT)- Cross-Lingual Pairwise Document Similarity

- Extracting Parallel Text From Comparable Corpora

•Translating to Search (MT IR)- Context-Sensitive Query Translation

•Conclusions

8

(Ture et al., SIGIR’11)

(Ture and Lin, NAACL’12)

(Ture et al., SIGIR’12), (Ture et al., COLING’12)(Ture and Lin, SIGIR’13)

Extracting Parallel Text from the Web

9

Preprocess

Signature Generatio

n Sliding WindowAlgorith

m

Candidate Generation

2-step Parallel Text Classifier

doc vectorsF signaturesF

doc vectorsE

signaturesE

sourcecollection F

targetcollection E

Phase 1

Phase 2candidate

sentence pairsaligned bilingual sentence pairs(F-E parallel text)

cross-lingualdocument pairs

Preprocess

Signature Generatio

n

Pairwise Similarity

• Pairwise similarity: • finding similar pairs of documents in a large collection

• Challenges• quadratic search space

• measuring similarity effectively and efficiently

• Focus on recall and scalability

10

Ne

Englisharticles

Ne

English document vectors

Ne

Signatures

Signature

generation

Sliding windowalgorith

m

<nobel=0.324, prize=0.227, book=0.01, …>

[0111000010...]

Preprocess

Locality-Sensitive Hashing

Similar article pairs

• LSH(vector) = signature- faster similarity computation

s.t. similarity(vector pair) ≈ similarity(signature pair)

e.g., ~20 times faster than computing (cosine) similarity from

vectors similarity error ≈ 0.03

• Sliding window algorithm - approximate similarity search based on LSH

- linear run-time

12

Locality-Sensitive Hashing

(Ravichandran et al., 2005)

Sliding window algorithm

sort

.

.

.

sort

permute

Generating tables

Signatures

….1,110110111012,011100001013,10101010000…

p1

pQ

list1

….11111101010,110011000110,201100100100,3…

listQ

….11111001011,100101001110,210010000101,3…

table1

tableQ

….01100100100,110011000110,211111101010,3…

….00101001110,110010000101,211111001011,3…

Map Reduce

.

.

.

tableQ

.

.

.

Map

Sliding window algorithm

14

Detecting similar pairs

00000110101000100011110010010110100110000000001100100000011001111100110101000001110100101001001101110010110011

table1

….01100100100,110011000110,211111101010,3…

Sliding window algorithmExample

Signatures

(1,11011011101)(2,01110000101)(3,10101010000)

table1p1

p2

list1

list2 table2

(<2,11111001011>,1)(<2,00101001110>,2)(<2,10010000101>,3)

(<1,11111101010>,1)(<1,10011000110>,2)(<1,01100100100>,3)

(<1,01100100100>,3)(<1,10011000110>,2)(<1,11111101010>,1)

(<2,00101001110>,2)(<2,10010000101>,3)(<2,11111001011>,1)

Map Reduce

Distance(3,2) = 7Distance(2,1) = 5

Distance(2,3) = 7Distance(3,1) = 6

✗✓

✗✓

# tables = 2window size = 2

# bits = 11

16

MT

Doc A

MT translate doc vector

vA

German English

DocB

English

doc vector vB

Doc A

CLIR translate

doc vector vA

German

DocB

English

doc vector vB

doc vector vA

CLIR

Cross-lingual Pairwise Similarity

17

MT vs. CLIR for Pairwise Similarity

low similarity values

positive-negativeclearly separated

MT slightly better than CLIR, but 600 times slower!

clir-negclir-posmt-negmt-pos

Ne

Englisharticles

Ne


Ne

Signatures

Signature

generation


m

<nobel=0.324, prize=0.227, book=0.01, …>

[0111000010...]

Preprocess


Locality-Sensitive Hashing for Pairwise Similarity

Locality-Sensitive Hashing for Cross-Lingual Pairwise Similarity

CLIRTranslate

Nf Germa

n articles

Ne

Englisharticles

Ne+Nf


<nobel=0.324, prize=0.227, book=0.01, …>

Ne



Ne

Signatures

Signature

generation


m

[0111000010...]

Preprocess

Evaluation

• Experiments with De/Es/Cs/Ar/Zh/Tr to En Wikipedia

• Collection: 3.44m En + 1.47m De Wikipedia articles

• Task: For each German Wikipedia article, find:

{all English articles s.t. cosine similarity > 0.30}

20

# bits (D) = 1000# tables (Q) = 100-1500window size (B) = 100-2000

Scalability

21

two sources of error

Signatures

Brute-force approach

Similar

article

pairs

upperbound

document

vectors

Brute-force approach

Similar

article

pairs

ground truth

Signatures

Signature generatio

n


m

document

vectors

Similar

article

pairsalgorit

hm output

Evaluation

22

Evaluation

23

95% recall39% cost

99% recall70% cost

95% recall40% cost

99% recall62% cost

100% recallno savings = no free lunch!

Outline

•Introduction




•Conclusions

24




Approach 1. Generate candidate sentence pairs from each document

pair2. Classify each candidate as ‘parallel’ or ‘not parallel’

Challenge: 10s millions doc pairs ≈ 100s billions sentence pairs

Solution: 2-step classification approach3. a simple classifier efficiently filters out irrelevant pairs 4. a complex classifier effectively classifies remaining pairs

Phase 2: Extracting Parallel Text

25

• cosine similarity of the two sentences• sentence length ratio: the ratio of lengths of the two

sentences• word translation ratio: ratio of words in source (target)

sentence with a translation in target (source) sentence

Parallel Text (Bitext) Classifier

26

sentence

detection+tf-

idf

cross-lingualdocument pairs

sentence pairs

simple classificat

ion

complexclassificat

ion

bitext S1

bitext S2

source document

target document

sentences andsent. vectors

cartesian product

X

MAP

REDUCE

candidategeneration2.4 hours

shuffle&sort1.3 hours

simple classification

4.1 hours

Bitext Extraction Algorithm

27

complex classification

0.5 hours

400 billio

n 214billio

n

132billio

n

Extracting Bitext from WikipediaSize Language

English

German Spanish Chinese Arabic Czech Turkish

Documents

4.0m 1.42m 0.99m 0.59m 0.25m 0.26m 0.23m

Similar doc pairs

- 35.9m 51.5m 14.8m 5.4m 9.1m 17.1m

Sentences ~90m 42.3m 19.9m 5.5m 2.6m 5.1m 3.5m

Candidate sentence pairs

- 530b 356b 62b 48b 101b 142b

S1 - 292m 178m 63m 7m 203m 69m

S2 - 0.2-3.3m 0.9-3.3m 50k-290k 130-320k 0.5-1.6m 8-250k

Baseline training data

- 2.1m 2.1m 303k 3.4m 0.78m 53k

Dev/Test set

- WMT-11/12

WMT-11/12

NIST-06/08

NIST-06/08

WMT-11/12

held-out

Baseline BLEU

- 24.50 33.44 25.38 63.15 23.11 27.22

Evaluation on MT

Evaluation on MT

Conclusions (Part I)

31

•Summary

- Scalable approach to extract parallel text from a

comparable corpus

- Improvements over state-of-the-art MT baseline

- General algorithm applicable to any data format

•Future work

- Domain adaptation

- Experimenting with larger web collections

Outline

•Introduction




•Conclusions

32




Cross-Language Information Retrieval

• Information Retrieval (IR): Given information need, find relevant material.

• Cross-language IR (CLIR): query and documents in different languages

•“Why does China want to import technology to build Maglev Railway?”➡ relevant information in Chinese documents

•“Maternal Leave in Europe”➡ relevant information in French, Spanish, German, etc.

33

query (ranked) documents

grammar

extractor

decoder

language model

tokenaligne

rtokenalignments

query“maternal leave in Europe”

sentence-aligned parallel corpus

token translation probabilities

n best translations

1-best translation“congé

de maternité en Europe”

Machine Translation for CLIR

34

language model

translationgrammar

STATISTICALMT

SYSTEM

Token-based CLIR

•Token translation formula

35

… most leave their children in …... aim of extending maternity leave to … ...

… la plupart laisse leurs enfants…… l’objectif de l’extension des congé de maternité à …...

Token-based probabilities

Token-based CLIR

36

Maternal leave in Europe1. laisser (Eng. forget) 49%2. congé (Eng. time off) 17%3. quitter (Eng. quit) 9%4. partir (Eng. disappear) 7%

…

Document Retrieval

•How to score a document, given a query?

37

[maternité : 0.74, maternel : 0.26]

“maternal leave in Europe”

Query q1

Document

DocumentDocument

Documentd1

tf(maternité)

tf(maternel)

df(maternité)df(maternel)…

Token-based CLIR

38


…

1. laisser (Eng. forget) 49%2. congé (Eng. time off) 17%3. quitter (Eng. quit) 9%4. partir (Eng. disappear) 7%

…

39


…

Token-based CLIR

1. laisser (Eng. forget) 49%2. congé (Eng. time off) 17%3. quitter (Eng. quit) 9%4. partir (Eng. disappear) 7%

…

Context-Sensitive CLIR

40

This talk: MT for context-sensitive CLIR


…

12%70%6%5%

Previous approach: Token-based CLIR

41

Previous approach: MT as black boxOur approach: Looking inside the box

grammar

extractor

decoder

language model

MTtokenaligne

rtoken alignments

query“maternal leave in Europe”



n best derivations



language model

translationgrammar

n best derivations

STATISTICALMT

SYSTEM

MT for Context-Sensitive CLIR

42

language model

MTtokenaligne

r

grammar

extractor

tokenalignments

translationgrammar

query



decoder


n best translations



CLIR from translation grammar

•Token translation formula

43

S [X : X] , 1.0X [X1 leave in europe : congé de X1 en europe] , 0.9X [maternal : maternité] , 0.9X [X1 leave : congé de X1] , 0.74X [leave : congé ] , 0.17X [leave : laisser] , 0.49...

Grammar-based probabilities

S1

X1

X2 leave in Europe

maternal

S1

X1

X2 en Europe

maternité

congé de

Synchronoushierarchical derivation

SynchronousContext-FreeGrammar (SCFG)[Chiang, 2007]


44

language model

MTtokenaligne

r

grammar

extractor

tokenalignments

translationgrammar

query



decoder


n best translations




45

language model

MTtokenaligne

r

grammar

extractor

tokenalignments

translationgrammar

query



decoder


n best translations



CLIR from n-best derivations

46

t(1): { , 0.8 }

t(k): { kth best derivation , score(t(k)|s) }

t(2): { , 0.11 }

• Token translation formula

.

.

.

Translation-based probabilities

S1

X1

X2 leave in Europe

maternal

S1

X1

X2 en Europe

maternité

congé de

S1

X1 in Europe

maternal leave

S1

X1

maternité

en Europe

congé de


47Ambiguity preserved

Conte

xt

sensi

tivit

y

1-best MT

token

based

n best derivations

tokenalignments

translationgrammar

sentence-alignedbitext

1-best translation

MT pipeline

grammar

based

translation

based

Prnbest

PrSCFG

Prtoken

Combining Evidence

•For best results, we compute an interpolated probability distribution:

48

leave laisser 0.14 congé 0.70quitter 0.06…



Prtoken PrSCFG Prnbest

35%40%

25%leave laisser 0.33 congé 0.54quitter 0.8…

Printerp

Combining Evidence


49




Prtoken PrSCFG Prnbest

100%0%

0%leave laisser 0.72 congé 0.10quitter 0.09…

Printerp

Combining Evidence

50


Experiments

•Three tasks:

1. TREC 2002 English-Arabic CLIR task

50 English queries and 383,872 Arabic documents

2. NTCIR-8 English-Chinese ACLIA task

73 English queries and 388,859 Chinese documents

3. CLEF 2006 English-French CLIR task

50 English queries and 177,452 French documents

• Implementation

- cdec MT system [Dyer et al, 2010]

- using Hiero-style grammars, GIZA++ for token alignments

51

Comparison of ModelsEnglish-French CLEF 2006Comparison of ModelsEnglish-Arabic TREC 2002

52

Grammar-based Translation-based (10-best)

Token-based

Best interpolation

1-best MT

Comparison of ModelsEnglish-Chinese NTCIR-8

53

Comparison of ModelsOverview

Comparison of Models

English-Chinese English-Arabic English-French0.00

0.05

0.10

0.15

0.20

0.25

0.30

Token-based

Grammar-based

Translation-based

1-best MT

Interpolated

Me

an

Ave

rag

e P

recis

ion

(M

AP

)

54

Interpolated significantly better than

token-based and 1-best in all three cases.

Conclusions (Part II)•Summary

- A novel framework for context-sensitive and ambiguity-

preserving CLIR

- Interpolation of proposed models works best

- Significant improvements in MAP for three tasks

•Future work

- Robust parameter optimization

- Document vs. query translation with MT

55

Contributions

CLIRTranslation

Model

MT Translation

Model

MTpipeline

baselinebitext

Token-basedCLIR

extracted

bitext

BitextExtraction

comparable corpora

+

Contributions

CLIRTranslation

Model

MT Translation

Model

MTpipeline

baselinebitext

Token-basedCLIR

extracted

bitext

BitextExtraction

comparable corpora

+

Higher BLEU for

5 lang pairs

Token-basedCLIR

Contributions

MTpipeline

baselinebitext

CLIRTranslation

Model

MT Translation

Model

Context-sensitive CLIR

Contributions

CLIRTranslation

Model

MTpipeline

baselinebitext

MT Translation

Model


Higher MAPfor

3 lang pairs

Contributions

MT Translation

Model

MTpipeline

baselinebitext

extracted

bitext

BitextExtraction

comparable corpora

+


CLIRTranslation

Model

Higher MAPfor

3 lang pairs

Higher BLEU for

5 lang pairs

Contributions

MT Translation

Model

MTpipeline

baselinebitext

Token-basedCLIR

extracted

bitext

BitextExtraction

comparable corpora

+

CLIRTranslation

Model

CLIRTranslation

Model

morebitext

Higher BLEUafter additional

iteration

•LSH-based MapReduce approach to pairwise similarity

•Exploration of parameter space for sliding window algorithm

•MapReduce algorithm to generate candidate sentence pairs

•2-step classification approach to bitext extraction

Bitext from Wikipedia: improvement over state-of-the-art MT

•Set of techniques for context-sensitive CLIR using MT Combination-of-evidence works best

•Framework for better integration of MT and IR•Bootstrapping approach to show feasibility

•All code and data as part of Ivory project (www.ivory.cc)62

Contributions

Thank you!