AQUAINT User Testbed and System Integration Activities Marc Light, John Burger, Clarence Huff MITRE

AQUAINT User Testbedand System Integration Activities

Marc Light, John Burger, Clarence Huff MITRE

In the beginning was the BAA“As independent activities, but in close collaboration with AQUAINT contractors, the Government intends … to establish a testbed for the demonstration of emerging capabilities.”

“ARDA and its AQUAINT Program Committee, with technical advice and support from a separately solicited and procured system integration contractor, will throughout the entire duration of the AQUAINT program, be reviewing and evaluating each of the programs and projects for components, subsystems and full systems that could be successfully integrated ... demonstrate application against existing operational problems …”

Models of Technology Transfer

Research Transfer

System Integrator’s Role

MITRE will advise and support wrt what and how systems could be part of the user testbed

Research

Transfer

MITRE will advise and support wrt user-centric evaluations and data and performance analysis and report back to researchers

Other User Testbed Details

• User-centric evaluation– E.g., task time studies, quality of solution

studies, user satisfaction surveys

• Intelligence analysts are the user group– Many different kinds of analysts

• User testbed will be in both unclassified and classified environments

More on MITRE’s Role

• What we will do– Advise on and enable testbed participation

– User-centric evaluations

– Data and system performance analysis

– Provide an open source reference QA system

– Provide integration infrastructure (glue)

• What we won't do– Act as a gatekeeper

– Force a one-size-fits-all architecture

What is MITRE

• MITRE is a set of FFRDCs– Federally-Funded Research and Development Centers

• Does not sell products, must not compete • Charged to act in the public interest

– Unbiased advisor, system integrator, evaluator, etc.

Okay, so what is this user testbed thing?

Researchers

Integration

User Eval& Analysis

User

Separation between parallel

efforts of research systems and user

testbed

User Testbed

User-Centric Evaluation

• Select users, design tasks• Timing measures

– How fast are users able to accomplish a task?

• Task performance measures– Quality of user’s solution to task

• User satisfaction

Example Survey QuestionsPlease take a few moments to provide us with feedback on your experiences with AQUAINT. Your responses to the questions below will help us improve the system. Keep in mind that we are evaluating the system – not you.

Gender: ___M ___F Age group: ___20-29 ___30-39 ___40-49 ___50+

How often do you use a computer for your work?

___never ___once a week ___a few times a week ___a few hours a day ___all day

Which keyword search engine(s) do your normally use to find information? _____

Have you ever used AQUAINT before or seen a demonstration of AQUAINT? _____

Using AQUAINT to answer my questions was

___very easy ___easy ___neither easy nor hard ___hard ___very hard ___no opinion

AQUAINT worked as expected

___all of the time ___most of the time ___neutral ___some of the time ___never ___no expectations

I had to rephrase my questions to get the answer I wanted

___all of the time ___most of the time ___some of the time ___never ___no opinion

AQUAINT’s response time in returning answers to my questions was

___very fast ___fast ___slightly fast ___neutral ___slightly slow ___slow ___very slow ___no opinion

The first answer AQUAINT returned for each question I asked was the one I wanted ___all of the time ___most of the time ___neutral ___some of the time ___never ___not sure

At least one of the answers AQUAINT returned for each question I asked was what I wanted

___strongly agree ___agree ___slightly agree ___neutral ___slightly disagree ___disagree ___strongly disagreeAQUAINT provided accurate answers to my questions ___all of the time ___most of the time ___neutral ___some of the time ___never ___not sure

Analysis Activity

• Inherent properties of the data– What percentage of questions are definitions?

• System-level feature analysis– Is there a correlation between system performance and

answer redundancy?

• Bounds– How much ambiguity remains after answer typing …

– What is the limit of weighted word overlap approaches?

– Related to error analysis

Example Analysis: Answer Multiplicity

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 10 20 30 40 50 60 70

# answers occurences per question

% o

f sy

stem

s w

ith a

t lea

st o

ne

corr

ect r

esponse

Point per questionMean correct per occurrence #

Integration Activities

• Inserting complete systems into testbed• Connecting components• Answer collation• Reference system

Testbed “Glue” Wish List• Arbitrary system topologies• Distributed systems• Scalability• Delivery of only relevant inputs to components• Caching and archiving• Fault tolerance

MITRE’s Catalyst has many of these characteristics

Catalyst Characteristics• Data model

– Standoff annotations

– Flexible restructuring, renaming, indexing

• Processing model– Components connected by streams of annotations

– Stream operations (merge, split, …)

– Flexible system topologies

– Synchronization

• In use for DARPA’s TIDES Program

Three ways to integrate a language processor into a Catalyst system

• Go native: language processor uses Catalyst stream operators directly

• Wrap: write a separate, native process to mediate with existing non-Catalyst LP

• Middle ground: if existing LP uses Expat (or equivalent) XML parser, replace with Catalyst’s Expat-like API

Existing APIs in C, Lisp, Python.Java and Perl in progress.

Answer Collation• Problem

– Many systems means many answers

– Answers will duplicate, overlap, cluster

– Combining rankings is problematic

• As system integrator, MITRE will develop an answer collation module

Answer Collation Issues

• Merging– PLO Palestine Liberation Organization

– Last Christmas 2000-12-25

– Issues of approximation, partial orders ...

• Clustering– {Yasser Arafat, PLO, Hammas}

• Ranking– Answer type, number of occurrences, source quality

Reference System

• Open-source QA system from MITRE• Components include a question analyzer, passage

filter, tokenizer, entity taggers, answer selector ...• Uses Catalyst for glue

– Some components can communicate using inline XML

• Intent:– Contractors without end-to-end systems can insert their

own components

– Possibly useful for baseline evaluation

Summary of User Testbed Activities

• Parallel research and transition efforts• MITRE to assist in transition effort• User testbed activities include

– Selecting users and defining tasks

– User-centric evaluation

– Analysis

– Integration

Documents

AQUAINT User Testbed and System Integration Activities Marc Light, John Burger, Clarence Huff MITRE