Upload
aubrey-park
View
220
Download
0
Embed Size (px)
Citation preview
AQUAINT User Testbedand System Integration Activities
Marc Light, John Burger, Clarence Huff MITRE
In the beginning was the BAA“As independent activities, but in close collaboration with AQUAINT contractors, the Government intends … to establish a testbed for the demonstration of emerging capabilities.”
“ARDA and its AQUAINT Program Committee, with technical advice and support from a separately solicited and procured system integration contractor, will throughout the entire duration of the AQUAINT program, be reviewing and evaluating each of the programs and projects for components, subsystems and full systems that could be successfully integrated ... demonstrate application against existing operational problems …”
Models of Technology Transfer
Research Transfer
System Integrator’s Role
MITRE will advise and support wrt what and how systems could be part of the user testbed
Research
Transfer
MITRE will advise and support wrt user-centric evaluations and data and performance analysis and report back to researchers
Other User Testbed Details
• User-centric evaluation– E.g., task time studies, quality of solution
studies, user satisfaction surveys
• Intelligence analysts are the user group– Many different kinds of analysts
• User testbed will be in both unclassified and classified environments
More on MITRE’s Role
• What we will do– Advise on and enable testbed participation
– User-centric evaluations
– Data and system performance analysis
– Provide an open source reference QA system
– Provide integration infrastructure (glue)
• What we won't do– Act as a gatekeeper
– Force a one-size-fits-all architecture
What is MITRE
• MITRE is a set of FFRDCs– Federally-Funded Research and Development Centers
• Does not sell products, must not compete • Charged to act in the public interest
– Unbiased advisor, system integrator, evaluator, etc.
Okay, so what is this user testbed thing?
Researchers
Integration
User Eval& Analysis
User
Separation between parallel
efforts of research systems and user
testbed
User Testbed
User-Centric Evaluation
• Select users, design tasks• Timing measures
– How fast are users able to accomplish a task?
• Task performance measures– Quality of user’s solution to task
• User satisfaction
Example Survey QuestionsPlease take a few moments to provide us with feedback on your experiences with AQUAINT. Your responses to the questions below will help us improve the system. Keep in mind that we are evaluating the system – not you.
Gender: ___M ___F Age group: ___20-29 ___30-39 ___40-49 ___50+
How often do you use a computer for your work?
___never ___once a week ___a few times a week ___a few hours a day ___all day
Which keyword search engine(s) do your normally use to find information? _____
Have you ever used AQUAINT before or seen a demonstration of AQUAINT? _____
Using AQUAINT to answer my questions was
___very easy ___easy ___neither easy nor hard ___hard ___very hard ___no opinion
AQUAINT worked as expected
___all of the time ___most of the time ___neutral ___some of the time ___never ___no expectations
I had to rephrase my questions to get the answer I wanted
___all of the time ___most of the time ___some of the time ___never ___no opinion
AQUAINT’s response time in returning answers to my questions was
___very fast ___fast ___slightly fast ___neutral ___slightly slow ___slow ___very slow ___no opinion
The first answer AQUAINT returned for each question I asked was the one I wanted ___all of the time ___most of the time ___neutral ___some of the time ___never ___not sure
At least one of the answers AQUAINT returned for each question I asked was what I wanted
___strongly agree ___agree ___slightly agree ___neutral ___slightly disagree ___disagree ___strongly disagreeAQUAINT provided accurate answers to my questions ___all of the time ___most of the time ___neutral ___some of the time ___never ___not sure
Analysis Activity
• Inherent properties of the data– What percentage of questions are definitions?
• System-level feature analysis– Is there a correlation between system performance and
answer redundancy?
• Bounds– How much ambiguity remains after answer typing …
– What is the limit of weighted word overlap approaches?
– Related to error analysis
Example Analysis: Answer Multiplicity
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 10 20 30 40 50 60 70
# answers occurences per question
% o
f sy
stem
s w
ith a
t lea
st o
ne
corr
ect r
esponse
Point per questionMean correct per occurrence #
Integration Activities
• Inserting complete systems into testbed• Connecting components• Answer collation• Reference system
Testbed “Glue” Wish List• Arbitrary system topologies• Distributed systems• Scalability• Delivery of only relevant inputs to components• Caching and archiving• Fault tolerance
MITRE’s Catalyst has many of these characteristics
Catalyst Characteristics• Data model
– Standoff annotations
– Flexible restructuring, renaming, indexing
• Processing model– Components connected by streams of annotations
– Stream operations (merge, split, …)
– Flexible system topologies
– Synchronization
• In use for DARPA’s TIDES Program
Three ways to integrate a language processor into a Catalyst system
• Go native: language processor uses Catalyst stream operators directly
• Wrap: write a separate, native process to mediate with existing non-Catalyst LP
• Middle ground: if existing LP uses Expat (or equivalent) XML parser, replace with Catalyst’s Expat-like API
Existing APIs in C, Lisp, Python.Java and Perl in progress.
Answer Collation• Problem
– Many systems means many answers
– Answers will duplicate, overlap, cluster
– Combining rankings is problematic
• As system integrator, MITRE will develop an answer collation module
Answer Collation Issues
• Merging– PLO Palestine Liberation Organization
– Last Christmas 2000-12-25
– Issues of approximation, partial orders ...
• Clustering– {Yasser Arafat, PLO, Hammas}
• Ranking– Answer type, number of occurrences, source quality
Reference System
• Open-source QA system from MITRE• Components include a question analyzer, passage
filter, tokenizer, entity taggers, answer selector ...• Uses Catalyst for glue
– Some components can communicate using inline XML
• Intent:– Contractors without end-to-end systems can insert their
own components
– Possibly useful for baseline evaluation
Summary of User Testbed Activities
• Parallel research and transition efforts• MITRE to assist in transition effort• User testbed activities include
– Selecting users and defining tasks
– User-centric evaluation
– Analysis
– Integration