AQUAINT
Building an Initial Cross-lingual Building an Initial Cross-lingual Question Answering System:Question Answering System:
English Question -> Chinese CollectionEnglish Question -> Chinese Collection
Ralph Weischedel, Ana Licuanan, Jinxi Xu
6 October 2004
2
AQUAINTPhase 2 Objectives Phase 2 Objectives
End-to-End System; Multi-lingual Data• Find appropriate information in a second language• Be organized to maximize performance
– Analyze, then translate?– Translate, then analyze?
• Focus on complex questions (e.g. definitional & biographical questions), rather than on factoid questions
• Determine whether two statements across languages convey – The same information, – Inconsistent information, or – Novel/complementary information
(To be addressed later)
(To be addressed later)
3
AQUAINTApproachApproach
• Trained, language-independent algorithms for core NLP problems, e.g.,
– passage retrieval,
– name tagging,
– parsing and
– co-reference
• Plug-and-play architecture for alternative MT systems for question & document translation
• Controlled experiments to measure and optimize QA performance
– BBN’s AQUA system for English as monolingual baseline
4
AQUAINTMono-Lingual System 12/2003Mono-Lingual System 12/2003
Question Classification
Question
Document Retrieval
Linguistic Processing & Extraction of Kernel Facts
Kernel Fact Ranking
Redundancy Removal
List of Responses
Proposition Finding
Co-reference
Relation Extraction
Name AnnotationName Tagging
Question Profile
TreebankParsing
Linguistically Motivated
Components of SERIF
Hand-crafted Patterns
Surface Structure Matching
Background Model
5
AQUAINTAQUA Cross-LingualAQUA Cross-Lingual
• Architecture today implemented for English questions against Chinese data bases via analysis in Chinese
• Expansion later for– Arabic documents– Merging of answers from English, Arabic, and Chinese sources
(Later)
English & Arabic
DatabaseAnswer
Generation
Transliteration
User Interface
Document Processing
SERIFMachine
Translation
Chinese TextTranslated English Text
Chinese Extraction Output
Analysis during
Indexing
Responding to Questions
6
AQUAINTOutlineOutline
• Transliterating names from English to foreign language (part of question analysis)
• Performance of Chinese analysis components
• Machine translation currently used
• Example output
7
AQUAINTProblems with Dialects in TransliterationProblems with Dialects in Transliteration
• Examples
– George Bush• 乔治 布什 (PRC)• 乔治 布希 (Taiwan)
– Blair• 布莱尔 (PRC)• 贝理雅 (Hong Kong)
• Even within a single dialect, there can be multiple transliterations in use
• Currently we use the PRC style of transliteration
8
AQUAINTTransliteration AlgorithmTransliteration Algorithm
• Given an English name E, the algorithm (Al-Onaizan, 2002) finds C that maximizes
P(E|C)*P(C)– English name E is segmented into phonemes (character
sequences)– Probabilities of phoneme mappings P(E|C) are learned form
human transliterated names – Language model probability P(C) is compiled from a Chinese
corpus• Transliteration Training Data
– Person proper names– Mandarin training data– ~500k name pairs taken from Chinese - English Name Entity
Lists (LDC2003E01 v1.beta)
9
AQUAINTStatistical Transliteration: ExamplesStatistical Transliteration: Examples
• Albright: 奥尔布赖特 5.5 * 10-4
– 奥尔 :al 0.1648
– 布 :b 0.5292
– 赖 :righ 0.0113
– 特 :t 0.5657
• Powell: 鲍威尔 2.4 * 10-4
– 鲍 :po 0.0069
– 威尔 :well 0.0351
10
AQUAINTCurrent Chinese Component PerformanceCurrent Chinese Component Performance
Test Set Recall Precision F/Value
NamesACE
evaluation TDT4 data
80% 77% 78.26 F
DescriptionsACE
evaluation TDT4 data
60% 76% 66.82 F
Entity Mentions
ACE evaluation TDT4 data
78.7 (value)
EntitiesACE
evaluation TDT4 data
72.4 (value)
ParsingChinese
Treebank 82.8% 81.3% 82.04 F
Represents state-of-the-art performance
11
AQUAINTMachine TranslationMachine Translation
• Statistical MT learns to translate new text based on existing text translated by humans
• Model of translation trained by GIZA++
– Freely available at www.informatik.rwthaachen.de/Colleagues/och/software/GIZA++.html
• Language Model trained using CMU Language Modelling Toolkit v2
• Translation was done by USC/ISI’s ReWrite decoder, version 1.0.0a
– Downloaded from http://www.isi.edu/licensed-sw/rewrite-decoder/
12
AQUAINTMT Training DataMT Training Data
• Translation– ~315k sentence pairs (~11m Chinese characters)– Corpora:
• MTC-1 (Multiple Translation Chinese Corpus)• Chinese-English Lexicon• Chinese Treebank• Hong Kong News• Hong Kong Hansards (proceedings of the Legislative
Council of the HKSAR)
• Language Model– Trigram language model– ~60m English words– Corpora:
• TDT-4 (English portion)• North American News Text Corpus
13
AQUAINTSteps to Improving MT ModelSteps to Improving MT Model
• Using GIZA++ & ReWrite
– Take advantage of full UN Parallel Corpus
– Tune training and decoding parameters
• Consider other MT systems
14
AQUAINTExample Answer NuggetsExample Answer Nuggets
• Who is Colin Powell?
– Nugget from Copulas and Appositives前任 美国 参谋长 连席 会议 主席 鲍威尔 “former Chairman of the Joint Chiefs of Staff”
• Who is Kofi Annan?
– Nugget from Propositions他 建议 由 两 族 轮流 选派 总统 , 即 希腊族人 每 担任 两 任 总统 , 土耳其族人 担任 一任 总统 。 “He proposed that Greece and Turkey alternately hold the presidency (of Cyprus)”
– Nugget from Relations安理会 主席“Chairman of the UN Security Council”
15
AQUAINTUser InterfaceUser Interface
Former Chairman of the Joint Chiefs (app)
Soon-to-be secretary of state, retired general Powell (app)
A candidate acceptable to Republicans and Democrats (copula)
The most likely candidate (copula)
National Security Advisor to President Reagan (prop)
General Powell will become the first black to be Secretary of State in US History (prop)
Powell served as commander of US forces in South Korea from 1973 to 1974. (prop)
U.N. Secretary-General(app)
Anan is in Jerusalem for a diplomatic mission … (app)
U.N. Secretary-General(app)
Became the first U.N. Secretary-General to make a statement at the refugee meeting. (prop)
Proposed that Greece and Turkey alternately hold the presidency of Cyprus (prop)
18
AQUAINTConcluding CommentsConcluding Comments
• Initial question answering from Chinese corpus implemented• Opportunities for improvement in all components, including
– Transliteration– Machine translation– Passage retrieval– Answer finding and generation
• Positive experience in transitioning English AQUA to– AQUAINT testbed at MITRE– Fairfield experiment
• Baseline participant in relationship pilot study – No work on answering relationship questions
• Proposed pilot evaluation in spring, 2005• First step toward full goal of answer merging across
– English– Arabic– Chinese