Upload
tobias-sailer
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
“Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation
Ferhan TureDissertation defense May 24th, 2013
Department of Computer ScienceUniversity of Maryland at College Park
Motivation
• Fact 1: People want to access informatione.g., web pages, videos, restaurants, products, …
• Fact 2: Lots of data out there… but also lots of noise, redundancy, different languages
• Goal: Find ways to efficiently and effectively- Search complex, noisy data
- Deliver content in appropriate form
3
multi-lingual text
user’s native language
forum posts
clustered summaries
retriev ir find materi (usual document unstructur natur (usual text satisfi need larg collect (usual store comput work assum materi collect document written natur languag need form queri rang word entir document typic approach ir repres document vector weight term term mean word stem pre-determin list word .g. `` '' `` '' `` '' may remov set term found creat nois search process document score relat queri score queri term independ aggreg term-docu score
Information Retrieval
4
Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). In our work, we assume that the material is a collection of documents written in natural language, and the information need is provided in the form of a query, ranging from a few words to an entire document. A typical approach in IR is to represent each document as a vector of weighted terms, where a term usually means either a word or its stem. A pre-determined list of stop words (e.g., ``the'', ``an'', ``my'') may be removed from the set of terms, since they have found to create noise in the search process. Documents are scored, relative to the query, usually by scoring each query term independently and aggregating these term-document scores.
queri:11.69, ir:11.39, vector:7.93, document:7.09, nois:6.92, stem:6.56, score:5.68, weight:5.67, word:5.46, materi:5.42, search:5.06, term:5.03, text:4.87, comput:4.73, need:4.61, collect:4.48, natur:4.36, languag:4.12, find:3.92, repres:3.58
Cross-Language Information Retrieval
5
Information Retrieval (IR) bzw. Informationsrückgewinnung, gelegentlich ungenau Informationsbeschaffung, ist ein Fachgebiet, das sich mit computergestütztem Suchen nach komplexen Inhalten (also z. B. keine Einzelwörter) beschäftigt und in die Bereiche Informationswissenschaft, Informatik und Computerlinguistik fällt. Wie aus der Wortbedeutung von retrieval (deutsch Abruf, Wiederherstellung) hervorgeht, sind komplexe Texte oder Bilddaten, die in großen Datenbanken gespeichert werden, für Außenstehende zunächst nicht zugänglich oder abrufbar. Beim Information Retrieval geht es darum bestehende Informationen aufzufinden, nicht neue Strukturen zu entdecken (wie beim Knowledge Discovery in Databases, zu dem das Data Mining und Text Mining gehören).
89,9332,345221,932106,13492,5414,073--162,67178,346241,58019,3185,802327,094104,82223,89095,936187,3499,394
3.42.92.72.52.42.121.81.81.71.71.51.51.51.41.11.00.90.8
Machine Translation
6
Maschinelle Übersetzung (MT) ist, um Text in einer Ausgangssprache in
entsprechenden Text in der Zielsprache geschrieben übersetzen.
Machine translation (MT) is to translate text written in a source language into corresponding text in a target language.
Motivation
• Fact 1: People want to access informatione.g., web pages, videos, restaurants, products, …
• Fact 2: Lots of data out there… but also lots of noise, redundancy, different languages
• Goal: Find ways to efficiently and effectively- Search complex, noisy data
- Deliver content in appropriate form
7
multi-lingual text
user’s native language
MT
Cross-language IR
Outline
•Introduction
•Searching to Translate (IR MT)- Cross-Lingual Pairwise Document Similarity
- Extracting Parallel Text From Comparable Corpora
•Translating to Search (MT IR)- Context-Sensitive Query Translation
•Conclusions
8
(Ture et al., SIGIR’11)
(Ture and Lin, NAACL’12)
(Ture et al., SIGIR’12), (Ture et al., COLING’12)(Ture and Lin, SIGIR’13)
Extracting Parallel Text from the Web
9
Preprocess
Signature Generatio
n Sliding WindowAlgorith
m
Candidate Generation
2-step Parallel Text Classifier
doc vectorsF signaturesF
doc vectorsE
signaturesE
sourcecollection F
targetcollection E
Phase 1
Phase 2candidate
sentence pairsaligned bilingual sentence pairs(F-E parallel text)
cross-lingualdocument pairs
Preprocess
Signature Generatio
n
Pairwise Similarity
• Pairwise similarity: • finding similar pairs of documents in a large collection
• Challenges• quadratic search space
• measuring similarity effectively and efficiently
• Focus on recall and scalability
10
Ne
Englisharticles
Ne
English document vectors
Ne
Signatures
Signature
generation
Sliding windowalgorith
m
<nobel=0.324, prize=0.227, book=0.01, …>
[0111000010...]
Preprocess
Locality-Sensitive Hashing
Similar article pairs
• LSH(vector) = signature- faster similarity computation
s.t. similarity(vector pair) ≈ similarity(signature pair)
e.g., ~20 times faster than computing (cosine) similarity from
vectors similarity error ≈ 0.03
• Sliding window algorithm - approximate similarity search based on LSH
- linear run-time
12
Locality-Sensitive Hashing
(Ravichandran et al., 2005)
Sliding window algorithm
sort
.
.
.
sort
permute
Generating tables
Signatures
….1,110110111012,011100001013,10101010000…
p1
pQ
list1
….11111101010,110011000110,201100100100,3…
listQ
….11111001011,100101001110,210010000101,3…
table1
tableQ
….01100100100,110011000110,211111101010,3…
….00101001110,110010000101,211111001011,3…
Map Reduce
.
.
.
tableQ
.
.
.
Map
Sliding window algorithm
14
Detecting similar pairs
00000110101000100011110010010110100110000000001100100000011001111100110101000001110100101001001101110010110011
table1
….01100100100,110011000110,211111101010,3…
Sliding window algorithmExample
Signatures
(1,11011011101)(2,01110000101)(3,10101010000)
table1p1
p2
list1
list2 table2
(<2,11111001011>,1)(<2,00101001110>,2)(<2,10010000101>,3)
(<1,11111101010>,1)(<1,10011000110>,2)(<1,01100100100>,3)
(<1,01100100100>,3)(<1,10011000110>,2)(<1,11111101010>,1)
(<2,00101001110>,2)(<2,10010000101>,3)(<2,11111001011>,1)
Map Reduce
Distance(3,2) = 7Distance(2,1) = 5
Distance(2,3) = 7Distance(3,1) = 6
✗✓
✗✓
# tables = 2window size = 2
# bits = 11
16
MT
Doc A
MT translate doc vector
vA
German English
DocB
English
doc vector vB
Doc A
CLIR translate
doc vector vA
German
DocB
English
doc vector vB
doc vector vA
CLIR
Cross-lingual Pairwise Similarity
17
MT vs. CLIR for Pairwise Similarity
low similarity values
positive-negativeclearly separated
MT slightly better than CLIR, but 600 times slower!
clir-negclir-posmt-negmt-pos
Ne
Englisharticles
Ne
English document vectors
Ne
Signatures
Signature
generation
Sliding windowalgorith
m
<nobel=0.324, prize=0.227, book=0.01, …>
[0111000010...]
Preprocess
Similar article pairs
Locality-Sensitive Hashing for Pairwise Similarity
Locality-Sensitive Hashing for Cross-Lingual Pairwise Similarity
CLIRTranslate
Nf Germa
n articles
Ne
Englisharticles
Ne+Nf
English document vectors
<nobel=0.324, prize=0.227, book=0.01, …>
Ne
English document vectors
Similar article pairs
Ne
Signatures
Signature
generation
Sliding windowalgorith
m
[0111000010...]
Preprocess
Evaluation
• Experiments with De/Es/Cs/Ar/Zh/Tr to En Wikipedia
• Collection: 3.44m En + 1.47m De Wikipedia articles
• Task: For each German Wikipedia article, find:
{all English articles s.t. cosine similarity > 0.30}
20
# bits (D) = 1000# tables (Q) = 100-1500window size (B) = 100-2000
Scalability
21
two sources of error
Signatures
Brute-force approach
Similar
article
pairs
upperbound
document
vectors
Brute-force approach
Similar
article
pairs
ground truth
Signatures
Signature generatio
n
Sliding windowalgorith
m
document
vectors
Similar
article
pairsalgorit
hm output
Evaluation
22
Evaluation
23
95% recall39% cost
99% recall70% cost
95% recall40% cost
99% recall62% cost
100% recallno savings = no free lunch!
Outline
•Introduction
•Searching to Translate (IR MT)- Cross-Lingual Pairwise Document Similarity
- Extracting Parallel Text From Comparable Corpora
•Translating to Search (MT IR)- Context-Sensitive Query Translation
•Conclusions
24
(Ture et al., SIGIR’11)
(Ture and Lin, NAACL’12)
(Ture et al., SIGIR’12), (Ture et al., COLING’12)(Ture and Lin, SIGIR’13)
Approach 1. Generate candidate sentence pairs from each document
pair2. Classify each candidate as ‘parallel’ or ‘not parallel’
Challenge: 10s millions doc pairs ≈ 100s billions sentence pairs
Solution: 2-step classification approach3. a simple classifier efficiently filters out irrelevant pairs 4. a complex classifier effectively classifies remaining pairs
Phase 2: Extracting Parallel Text
25
• cosine similarity of the two sentences• sentence length ratio: the ratio of lengths of the two
sentences• word translation ratio: ratio of words in source (target)
sentence with a translation in target (source) sentence
Parallel Text (Bitext) Classifier
26
sentence
detection+tf-
idf
cross-lingualdocument pairs
sentence pairs
simple classificat
ion
complexclassificat
ion
bitext S1
bitext S2
source document
target document
sentences andsent. vectors
cartesian product
X
MAP
REDUCE
candidategeneration2.4 hours
shuffle&sort1.3 hours
simple classification
4.1 hours
Bitext Extraction Algorithm
27
complex classification
0.5 hours
400 billio
n 214billio
n
132billio
n
Extracting Bitext from WikipediaSize Language
English
German Spanish Chinese Arabic Czech Turkish
Documents
4.0m 1.42m 0.99m 0.59m 0.25m 0.26m 0.23m
Similar doc pairs
- 35.9m 51.5m 14.8m 5.4m 9.1m 17.1m
Sentences ~90m 42.3m 19.9m 5.5m 2.6m 5.1m 3.5m
Candidate sentence pairs
- 530b 356b 62b 48b 101b 142b
S1 - 292m 178m 63m 7m 203m 69m
S2 - 0.2-3.3m 0.9-3.3m 50k-290k 130-320k 0.5-1.6m 8-250k
Baseline training data
- 2.1m 2.1m 303k 3.4m 0.78m 53k
Dev/Test set
- WMT-11/12
WMT-11/12
NIST-06/08
NIST-06/08
WMT-11/12
held-out
Baseline BLEU
- 24.50 33.44 25.38 63.15 23.11 27.22
Evaluation on MT
Evaluation on MT
Conclusions (Part I)
31
•Summary
- Scalable approach to extract parallel text from a
comparable corpus
- Improvements over state-of-the-art MT baseline
- General algorithm applicable to any data format
•Future work
- Domain adaptation
- Experimenting with larger web collections
Outline
•Introduction
•Searching to Translate (IR MT)- Cross-Lingual Pairwise Document Similarity
- Extracting Parallel Text From Comparable Corpora
•Translating to Search (MT IR)- Context-Sensitive Query Translation
•Conclusions
32
(Ture et al., SIGIR’11)
(Ture and Lin, NAACL’12)
(Ture et al., SIGIR’12), (Ture et al., COLING’12)(Ture and Lin, SIGIR’13)
Cross-Language Information Retrieval
• Information Retrieval (IR): Given information need, find relevant material.
• Cross-language IR (CLIR): query and documents in different languages
•“Why does China want to import technology to build Maglev Railway?”➡ relevant information in Chinese documents
•“Maternal Leave in Europe”➡ relevant information in French, Spanish, German, etc.
33
query (ranked) documents
grammar
extractor
decoder
language model
tokenaligne
rtokenalignments
query“maternal leave in Europe”
sentence-aligned parallel corpus
token translation probabilities
n best translations
1-best translation“congé
de maternité en Europe”
Machine Translation for CLIR
34
language model
translationgrammar
STATISTICALMT
SYSTEM
Token-based CLIR
•Token translation formula
35
… most leave their children in …... aim of extending maternity leave to … ...
… la plupart laisse leurs enfants…… l’objectif de l’extension des congé de maternité à …...
Token-based probabilities
Token-based CLIR
36
Maternal leave in Europe1. laisser (Eng. forget) 49%2. congé (Eng. time off) 17%3. quitter (Eng. quit) 9%4. partir (Eng. disappear) 7%
…
Document Retrieval
•How to score a document, given a query?
37
[maternité : 0.74, maternel : 0.26]
“maternal leave in Europe”
Query q1
Document
DocumentDocument
Documentd1
tf(maternité)
tf(maternel)
df(maternité)df(maternel)…
Token-based CLIR
38
Maternal leave in Europe1. laisser (Eng. forget) 49%2. congé (Eng. time off) 17%3. quitter (Eng. quit) 9%4. partir (Eng. disappear) 7%
…
1. laisser (Eng. forget) 49%2. congé (Eng. time off) 17%3. quitter (Eng. quit) 9%4. partir (Eng. disappear) 7%
…
39
Maternal leave in Europe1. laisser (Eng. forget) 49%2. congé (Eng. time off) 17%3. quitter (Eng. quit) 9%4. partir (Eng. disappear) 7%
…
Token-based CLIR
1. laisser (Eng. forget) 49%2. congé (Eng. time off) 17%3. quitter (Eng. quit) 9%4. partir (Eng. disappear) 7%
…
Context-Sensitive CLIR
40
This talk: MT for context-sensitive CLIR
Maternal leave in Europe1. laisser (Eng. forget) 49%2. congé (Eng. time off) 17%3. quitter (Eng. quit) 9%4. partir (Eng. disappear) 7%
…
12%70%6%5%
Previous approach: Token-based CLIR
41
Previous approach: MT as black boxOur approach: Looking inside the box
grammar
extractor
decoder
language model
MTtokenaligne
rtoken alignments
query“maternal leave in Europe”
sentence-aligned parallel corpus
token translation probabilities
n best derivations
1-best translation“congé
de maternité en Europe”
language model
translationgrammar
n best derivations
STATISTICALMT
SYSTEM
MT for Context-Sensitive CLIR
42
language model
MTtokenaligne
r
grammar
extractor
tokenalignments
translationgrammar
query
sentence-aligned parallel corpus
“maternal leave in Europe”
decoder
token translation probabilities
n best translations
1-best translation“congé
de maternité en Europe”
CLIR from translation grammar
•Token translation formula
43
S [X : X] , 1.0X [X1 leave in europe : congé de X1 en europe] , 0.9X [maternal : maternité] , 0.9X [X1 leave : congé de X1] , 0.74X [leave : congé ] , 0.17X [leave : laisser] , 0.49...
Grammar-based probabilities
S1
X1
X2 leave in Europe
maternal
S1
X1
X2 en Europe
maternité
congé de
Synchronoushierarchical derivation
SynchronousContext-FreeGrammar (SCFG)[Chiang, 2007]
MT for Context-Sensitive CLIR
44
language model
MTtokenaligne
r
grammar
extractor
tokenalignments
translationgrammar
query
sentence-aligned parallel corpus
“maternal leave in Europe”
decoder
token translation probabilities
n best translations
1-best translation“congé
de maternité en Europe”
MT for Context-Sensitive CLIR
45
language model
MTtokenaligne
r
grammar
extractor
tokenalignments
translationgrammar
query
sentence-aligned parallel corpus
“maternal leave in Europe”
decoder
token translation probabilities
n best translations
1-best translation“congé
de maternité en Europe”
CLIR from n-best derivations
46
t(1): { , 0.8 }
t(k): { kth best derivation , score(t(k)|s) }
t(2): { , 0.11 }
• Token translation formula
.
.
.
Translation-based probabilities
S1
X1
X2 leave in Europe
maternal
S1
X1
X2 en Europe
maternité
congé de
S1
X1 in Europe
maternal leave
S1
X1
maternité
en Europe
congé de
MT for Context-Sensitive CLIR
47Ambiguity preserved
Conte
xt
sensi
tivit
y
1-best MT
token
based
n best derivations
tokenalignments
translationgrammar
sentence-alignedbitext
1-best translation
MT pipeline
grammar
based
translation
based
Prnbest
PrSCFG
Prtoken
Combining Evidence
•For best results, we compute an interpolated probability distribution:
48
leave laisser 0.14 congé 0.70quitter 0.06…
leave laisser 0.72 congé 0.10quitter 0.09…
leave laisser 0.09 congé 0.90quitter 0.11…
Prtoken PrSCFG Prnbest
35%40%
25%leave laisser 0.33 congé 0.54quitter 0.8…
Printerp
Combining Evidence
•For best results, we compute an interpolated probability distribution:
49
leave laisser 0.14 congé 0.70quitter 0.06…
leave laisser 0.72 congé 0.10quitter 0.09…
leave laisser 0.09 congé 0.90quitter 0.11…
Prtoken PrSCFG Prnbest
100%0%
0%leave laisser 0.72 congé 0.10quitter 0.09…
Printerp
Combining Evidence
50
•For best results, we compute an interpolated probability distribution:
Experiments
•Three tasks:
1. TREC 2002 English-Arabic CLIR task
50 English queries and 383,872 Arabic documents
2. NTCIR-8 English-Chinese ACLIA task
73 English queries and 388,859 Chinese documents
3. CLEF 2006 English-French CLIR task
50 English queries and 177,452 French documents
• Implementation
- cdec MT system [Dyer et al, 2010]
- using Hiero-style grammars, GIZA++ for token alignments
51
Comparison of ModelsEnglish-French CLEF 2006Comparison of ModelsEnglish-Arabic TREC 2002
52
Grammar-based Translation-based (10-best)
Token-based
Best interpolation
1-best MT
Comparison of ModelsEnglish-Chinese NTCIR-8
53
Comparison of ModelsOverview
Comparison of Models
English-Chinese English-Arabic English-French0.00
0.05
0.10
0.15
0.20
0.25
0.30
Token-based
Grammar-based
Translation-based
1-best MT
Interpolated
Me
an
Ave
rag
e P
recis
ion
(M
AP
)
54
Interpolated significantly better than
token-based and 1-best in all three cases.
Conclusions (Part II)•Summary
- A novel framework for context-sensitive and ambiguity-
preserving CLIR
- Interpolation of proposed models works best
- Significant improvements in MAP for three tasks
•Future work
- Robust parameter optimization
- Document vs. query translation with MT
55
Contributions
CLIRTranslation
Model
MT Translation
Model
MTpipeline
baselinebitext
Token-basedCLIR
extracted
bitext
BitextExtraction
comparable corpora
+
Contributions
CLIRTranslation
Model
MT Translation
Model
MTpipeline
baselinebitext
Token-basedCLIR
extracted
bitext
BitextExtraction
comparable corpora
+
Higher BLEU for
5 lang pairs
Token-basedCLIR
Contributions
MTpipeline
baselinebitext
CLIRTranslation
Model
MT Translation
Model
Context-sensitive CLIR
Contributions
CLIRTranslation
Model
MTpipeline
baselinebitext
MT Translation
Model
Context-sensitive CLIR
Higher MAPfor
3 lang pairs
Contributions
MT Translation
Model
MTpipeline
baselinebitext
extracted
bitext
BitextExtraction
comparable corpora
+
Context-sensitive CLIR
CLIRTranslation
Model
Higher MAPfor
3 lang pairs
Higher BLEU for
5 lang pairs
Contributions
MT Translation
Model
MTpipeline
baselinebitext
Token-basedCLIR
extracted
bitext
BitextExtraction
comparable corpora
+
CLIRTranslation
Model
CLIRTranslation
Model
morebitext
Higher BLEUafter additional
iteration
•LSH-based MapReduce approach to pairwise similarity
•Exploration of parameter space for sliding window algorithm
•MapReduce algorithm to generate candidate sentence pairs
•2-step classification approach to bitext extraction
Bitext from Wikipedia: improvement over state-of-the-art MT
•Set of techniques for context-sensitive CLIR using MT Combination-of-evidence works best
•Framework for better integration of MT and IR•Bootstrapping approach to show feasibility
•All code and data as part of Ivory project (www.ivory.cc)62
Contributions
Thank you!