Upload
cameron-wilkinson
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
DCU@FIRE 2012: Monolingual and Crosslingual SMS-based FAQ Retrieval
Johannes LevelingCNGL, School of Computing, Dublin City University, Ireland
Outline
Motivation
System Setup and Changes
Monolingual Experiments
Crosslingual ExperimentsSMT system
Training data
Translation results
OOV Reduction
FAQ Retrieval Results
Conclusions and Future Work
Motivation
Task:Given a SMS query, find FAQ documents answering the query
Last year’s DCU system:SMS correction and normalisation
In-Domain retrieval: Three approaches (SOLR, Lucene, Term Overlap)
Out-of-domain (OOD) detection: Three approaches (term overlap, normalized BM25 scores, ML)
Combination of ID retrieval and OOD results
Motivation
This year’s system:
Same SMS correction and normalisationone more spelling correction resource (manually created)
Single retrieval approach: Lucene with BM25 retrieval model
Single OOD detection approach: IB-1 classification using Timbl (Machine Learning)
additional features for term overlap and normalized BM25 scores
Trained statistical machine translation system for document translation (Hindi to English)
Questions
Investigatethe influence of OOD detection on system performance
the influence of out-of-vocabulary (OOV) words on crosslingual performance
Collection Statistics
Language Documents Training (rel/non_rel)
Test (rel/non_rel)
English 7251 4476 (3047/1429)
1733 (726/1007)
Hindi 1994 554 (173/381)
579 (200/379)
English to Hindi
1994 554 (173/381)
431
(75/356)
Monolingual Experiments (Setup)
Experiments for English and Hindi
Processing steps:Normalize SMS and FAQ documents
Correct SMS queries
Retrieve answers
Detect OOD queries (or not), e.g. “NONE” queries
Produce final result
Crosslingual Experiments (Setup)
Experiments for English to Hindi
Additional translation step to translate Hindi FAQ documents into English
Translation is based on newly trained statistical machine translation system (SMT)
Problem:sparse training data → combination of different training resources
out of vocabulary (OOV) words
→ OOV reduction
Crosslingual Experiments (SMT System)
Training an SMT systemData preparation
tokenization/normalization scripts
Data alignmentGiza++ for word-level alignment
Phrase extractionMoses MT toolkit
Training a language modelSRILM for trigram LM with Kneser-Ney smoothing
TuningMinimum error rate tuning (MERT)
Crosslingual Experiments (Training Data)
Agro (agricultural domain): 246 sentences Crowdsourced HI-EN data: 50k sentences EILMT (tourism domain): 6700 sentences ICON: 7000 sentences TIDES: 50k sentences
FIRE ad-hoc queries: 200 titles, 200 descriptions Interlanguage Wikipedia links: 27k entries OPUS/KDE: 97k entries UWdict: 128k entries
Translation Results (Hindi to English)
Data Training / Test / Development BLEU
TIDES 49,504 / 697 / 988 13.30
Crowdsourced EN-HI 41,396 / 8000 / 4000 7.04
ICON 7000 / 500 / 500 25.38
OOV Reduction
Problem: 15.4% untranslated words in translation output
Idea: modify untranslated words to obtain a translation
OOV reduction is based on two resourcesUWdict
Manually created transliteration lexicon (TRL): 639 entries
OOV Reduction
Word modifications:Character normalization, e.g.
replace Chandrabindu with Bindu
delete Virama character
replace long with short vowels
StemmingLucene Hindi stemmer
TransliterationITRANS transliteration rules
rules for cleaning up ITRANS results
Decompoundingword split at every position into candidate constituents
word is decompounded if both constituents have a translation
OOV Reduction Results (Hindi to English)
Lookup form Lookup Data Count % Reduction
original term UWdict. 4,728 14.5
original term TRL 83 0.3
normalized term UWdict 419 1.3
normalized term TRL 24 0.1
stemmed term UWdict 1,413 4.4
stemmed term TRL 14 0.0
stemmed normalized term UWdict 135 0.4
stemmed normalized term TRL 0 0.0
compound constituents UWdict 721 2.2
transliteration N/A 24,973 76.8
FAQ Retrieval Results
Run Language OOD detection
OOV reduction
ID correct
OOD correct
MRR
1 EN N - 661/726 19/1007 0.937
2 EN Y - 595/726 981/1007 0.949
1 HI N - 77/379 13/379 0.473
2 HI Y - 26/379 375/379 0.880
1 EN2HI N N 29/75 41/1007 0.450
2 EN2HI N Y 22/75 60/1007 0.365
3 EN2HI Y Y 4/75 989/1007 0.444
Conclusions
Monolingual experiments:Good performance for English and Hindi
OOD detection improves MRR (but reduces number of correct ID queries)
Crosslingual experiments:Lower performance
OOD detection reduces MRR
OOV reduction reduces MRR
Future work
Further analysis of our results neededNormalization issues for MT training data?
Unbalanced OOD training data for Hindi and English?
Is there Hindi textese (e.g. abbreviations etc.)?
Does the training data match the test data?manually or automatically created
Improve transliteration approach
Comparison to other submissions
10q 4 ur @ensn