Overview of Hybrid Architecture in Project Halo Jesse Wang, Peter ClarkMarch 18, 2013
2
Status of Hybrid ArchitectureGoals, Modularity, Dispatcher, Evaluation
3
Hybrid System Near Term Goals
• Setup the infrastructure to communicate with existing reasoners
• Reliably dispatch questions and collect answers
• Create related tools and resourceso Question generation/selection, answer
evaluation, report analysis, etc.
• Experiment ways to choose the answers from available reasoners – as hybrid solver
AURA
CYC
TEQA
Dispatcher
AURA
CYC
TEQA
4
Focus Areas of Hybrid Framework (until mid 2013)
• Loose coupling, high cohesion, data exchange protocols
Modularity
• Send requests and handle the responses
Dispatching
• Ability to get ratings on answers, and report results
Evaluation
5
DirectQA
AURA
CYC TEQA
IR?SQDB
Retrieval
AURA
SQs
CYC SQs
TEQA
SQsHybri
d SQs
EVALUATIONReport
Hybrid System Core Components
SQs: suggested questions SQA: QA with suggested questions TEQA: Textual Entailment QAIR: Information Retrieval
Yellow Outline: New or Updated
Filtered Set of
Questions
In Campbe
ll
Chapt 7
Find-A-Value
6
Infrastructure: Dispatchers
Dispatcher
AURA
CYC TEQA
IR
Live Single QA
Suggested QA Batch QA
7
Dispatcher Features
• Asynchronous batch mode and single/experiment mode
• Parallel dispatching to reasonerso Very functional UI: Live progress indicator, view question file, logso Exception and error handling
• Retry question when server is busy
• Batch service can continue to finish even if the client dieso Cancel/stop the batch process also available
• Input and output support both XML and CSV/TSV formatso Pipeline support: accept Question-Selector input
• Configurable dispatchers, select reasonerso Collect answers and compute basic statistics
8
Question-Answering via Suggested Questions
• Similar features as Live/Direct QA
• Aggregate suggested questions’ answers as a solver
• Unique features:o Interactively browse suggested questions databaseo Filter on certain facetso Using Q/A concepts, question types, etc. to improve relevanceo Automatic comparison of filtered and non-filtered results by
chapters
9
Question and Answer Handling
• Handling and parsing reasoner’s returned resultso Customized programming
• Information on execution: details and summary
• Report generationo Automatic evaluation
• Question Selectoro Support multiple facets/filterso Question bankso Dynamic UI to pick questionso Hidden tags support
10
Automatic Evaluation: Status as of 2013.3
• Automatic result evaluation features• Web UI/service to use• Algorithms to score exact and variable answers
– brevity/clarity– relevance: correctness + completeness– overall score
• Generate reports – Summary & details– Graph plot
• Improving evaluation result accuracy • Using: basic text processing tricks (stop words, stemming, trigram
similarity, etc.), location of answer, length of answer, bio concepts, counts of concepts, chapters referred, question types, answer type
• Experiments and analysis (several rounds, W.I.P.)0
20
40
60
80
100
120
User overall AutoEval Overall
11
Hybrid PerformanceHow we evaluate and how can improve overall system performance
12
Caveats: Question Generation and Selection
• Generated by a small group of SMEs (senior biology students)
• In natural language, without textbook (only syllabus)
13
Question Set Facets
04
5
6
7
89
10
11
12
Chapter Distribution
EV
FIND-A-VALUE46%
IS-IT-TRUE-THAT9%
HAVE-RELATIONSHIP8%
HOW7%
PROPERTY6%
WHY5%
HOW-MANY5%
WHERE5%
WHAT-DOES-X-DO3%
WHAT-IS-A3%
HAVE-SIMILARITIES2%
X-OR-Y2%FUNCTION-OF-X
1%HAVE-DIFFERENCES
1%
Question Types
14
Caveat: Evaluation Criteria
• We provided a clear guideline, but still subjectiveo A(4.0) = correct, complete answer, no major weaknesso B(3.0) = correct, complete answer with small cosmetic issueso C(2.0) = partially correct or complete answers, with some big issueso D(1.0) = somewhat relevant answer or information, or poor presentationo F(0.0) = wrong or irrelevant, conflicting or hard-to-locate answers
• Only 3 users to rate the answers, under tight timeline
7 15 230
0.51
1.52
2.53
User Preferences
AuraCycText QA
15
Evaluation ExampleQ: What is the maximum number of different atoms a carbon atom can bind at once?
16
More Evaluation Samples (Snapshot)
17
0.00 0.33 0.67 1.00 1.33 1.67 2.00 2.33 2.67 3.00 3.33 3.67 4.000
20
40
60
80
100
120
140
160
Answer Counts Over Rating
Aura
Cyc
Text QA
Reasoner Quality Overview
18
Performance Number
Precision Recall F10.000
0.100
0.200
0.300
0.400
0.500
0.600
Reasoner Performance on All Ratings (0..4)
AuraCycText QA
Precision Recall F10.000
0.050
0.100
0.150
0.200
0.250
0.300
0.350
0.400
Reasoner Performance on "Good" (>= 3.0)
Answers
AuraCycText QA
19
Answers Over Question Types
FIND-A-VALUE
HOW
HOW-MANY
PROPERTY
WHAT-DOES-X-DO
WHAT-IS-A
X-OR-Y
IS-IT-TRUE-THAT
HAVE-DIFFERENCES
HAVE-SIMILARITIES
HAVE-RELATIONSHIP
0.000.501.001.502.002.503.003.504.00
Answer Overall Rating
Text QACycAura
FIND-A-VALUE
HOW
HOW-MANY
PROPERTY
WHAT-DOES-X-DO
WHAT-IS-A
X-OR-Y
IS-IT-TRUE-THAT
HAVE-DIFFERENCES
HAVE-SIMILARITIES
HAVE-RELATIONSHIP
0 2 4 6 8 10 12 14 16 18 20
36
Count of Answered Questions
Text QACycAura
20
Answer Distribution Over Chapters
Aura
AuraAuraAura
Aura
Aura
AuraAura
Cyc
Cyc
Cyc
CycCyc
Cyc
Cyc
Cyc
Text QA
Text QA
Text QA
Text QA
Text QAText QA
Text QA
Text QA
Text QA
Text QA
0 4
5 6
7 8
9 10
11 12
0 4 5 6 7 8 9 10 11 120.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00Answer Quality Over Chapters
Aura
Cyc
Text QA
21
Answers on Questions with E/V Answer Type
Aura Cyc Text QA Average0.000.501.001.502.002.503.00
Exact/Variou Answer Quality
EV
Aura Cyc Text QA0
1020304050
5 5
45
25
13
40
Exact/Various Answer Count
EV
22
Improve Performance: Hybrid Solver – Combine!• Random selector (dumbest, baseline)
o Total question answered correctly should beat the best solver
• Priority selector (less dumb)o Pick reasoner following a good order (e.g. Aura > Cyc > Text QA) *o Expected performance: better than best individual
• Trained selector: Feature and rule-based selector (smarter)o Decision-Tree (CTree…) learning over Q-Type, Chapter, …o Expected performance: slightly better than above
• Theoretical best selector: MAX – the upper limit (smartest)o Suppose we can always pick the best performing reasoner
24
Performance (F1) with Hybrid Solvers
Aura Cyc Text QA Random Priority D-Tree Max0.000
0.050
0.100
0.150
0.200
0.250
0.300
Performance of Solvers on Good Answers (Good: Rating >= 3.0)
F1
25
Conclusion
• Each reasoner has its own strength and weaknesso Some aspects not handled well by AURA & CYCo Low hanging: IS-IT-TRUE-THAT for all, WHAT-IS-A for CYC, …
• Aggregated performance easily beats the best individual (Text QA)o Random solver does a good job (F1: mean=0.609): F1
MAX –
F1random
~ 2.5%
• Little room for better performance via answer selectiono F1
MAX – F1
D-Tree ~ 0.5%
o Better focus on MORE and/or BETTER solvers
26
Future and Discussions
27
Near Future Plans
• Include SQDB-based answers as a “Solver”o Help alleviate question interpretation problems by reasoners
• Include Information Retrieval-based answers as a “Solver”o Help understand the extra power reasoners can have over search
• Improvement evaluation mechanism
• Extract more features from questions and answers to enable a better solver, and see how close we can get to the upper limit (MAX)
• Improve question selector to support multiple sources and automatic update/merge of question metadata
• Find ways to handle question bank evolution
28
Get More, Better Reasoners
• Extract and use more features to select best answers• Evidence collection and weighing
Machine learning, Evidence combination
• Easier to explore individual results and diagnose failures
• Support to tune and optimize performance over target question-answer datasets
Analytics & tuning
• Support shared data, shared answers• Subgoaling• Allow reasoners to call each other for subgoals
Inter-solver communication
Further Technical Directions (2013.6+)
Open Data
Open Services
Open Environment
29
Open *Data*
• Clear Semantics, Common Format (standard), Easy to Access, Persistent (available)
Requirements
• Questions bank, training sets, knowledge base, protocol for intermediate and final data exchange
Data Sources
• Design and implement protocols and services for data I/O
Open Data Access Layer
30
Open *Services*
• Pure machine/algorithms based• Human-computation (social, crowd sourcing)
Two Categories
• Communicate with open data, generate meta data, • More reliable, scalable, reusable
Requirements
• Convert raw, noisy, inaccurate data refined, structured, useful
Goal: Process and refine data
31
Open *Environment*
• AI development environment to facilitate collaboration, efficiency and scalability
Definition
• like MMPOG, each “player” gets credits: contribution, resource consumption; interests, loans; ratings…
Operation
• self-organized projects, growth potential, encourage collaboration, grand prize
Opportunities
32
Thank You!For having the opportunity for Q&A
Backup slides next
33
IBM Watson’s “DeepQA” Hybrid Architecture
34
DeepQA Answer Merging And Ranking Module
35
Wolfram Alpha Hybrid Architecture
• Data Curation
• Computation
• Linguistic components
• Presentation
36
37
38
Answer Distribution (Density)
0.00 0.33 0.67 1.00 1.33 1.67 2.00 2.33 2.67 3.00 3.33 3.67 4.000
2
4
6
8
10
12
14
16
Answer Distribution
Text QACycAura
Average User Rating
Coun
t of
Ans
wer
s
39
Data Table for Answer Quality Distribution
40
Work Performed
• Created web-based dispatcher infrastructureo For both Live Direct QA and Live Suggested Questionso Batch mode to handle larger amount
• Built a web UI for UW student to rate answers of questions (HEF)o Coherent UI, duplicate removal, queued tasks
• Established automatic ways for result evaluation and comparison
• Applied first versions of file exchange format and protocols
• Employed initial file and data exchange formats and protocols• Setup faceted browsing and search (retrieval) UI
o And web services for 3rd party consumption
• Carried out many rounds of relevance studies and analysis
41
First Evaluation via Halo Evaluation Framework• We sent individual QA result set to UW students for evaluation
• First round hybrid system evaluation:o Cyc SQA: 9 best (3 ties), 14 good, 15 / 60 answeredo Aura QA: 1 best, 9 good, 14/60 answered; o Aura SQA: 4 best (3 ties), 7 good, 8/60 answeredo Text QA: 27 best, 29 good; SQA: 3 best, 5 good, 7/60 answeredo Best scenario: 41/60 answered o Note: Cyc Live was not included
o * SQA (Answering via suggested questions)
42
Ask a question Waiting for answers
Answers returned?
Live Direct QA Dispatcher ServiceWhat does ribosome make?
43
Live Suggested QA Dispatcher Service
44
Batch QA Dispatcher Service
Result automatically downloaded once finished
45
Live solver Service Dispatchers
46
Direct Live QA: What does ribosome make?
47
Direct Live QA: What does ribosome make?
48
Suggested Questions Dispatcher
49
Results for Suggested Question Dispatcher
50
Batch M
ode QA D
ispatcher
51
Batch QA Progress Bar
Result automatically downloaded once finished
52
Suggested questions database browser
53
Faceted Search on Suggested Questions
54
Tuning the Suggested Question RecommendationAccomplished• Indexed suggested questions
database– Concept, question, answers
• Created a web service for upload new set of suggested questions
• Extracted chapter information from answer text (TEXT)
• Analyzed question types– Pattern-based
• Experimented with some basic retrieval criteria
Not Yet Implemented• Parsing the questions• More experiment
(heuristics) on retrieval/ranking criteria– manual
• Get SME generate training data to evaluate– Automatic
• More feature extraction
55
Parsing, Indexing and Ranking
In-place• New local concept
extraction service• Concept extracted and in
index• Both sentences and
paragraphs are in index • Basic sentence type
identified• Chapter and section
information in• Several ways of ranking
evaluated
NYI• More sentence features
– Content type: Questions, figures, header, regular, review…
– Previous and next concepts– Count of concepts– Clauses – Universal truth– Relevance or not
• Question parsing• More refining on ranking• Learning to Rank ??
56
Browse Hybrid system
57
WIP: Ranking Experiments (Ablation Study)Features Only
(Easy)Without(Easy)
Only (Hard)
W/O (Hard)
Sentence Text 139/201 31/146
Sentence Concept 79/201 13/146
Prev/Next Sentence Concept
- -
Locality info (Chapter, etc.)
- -
Stopword list - -
Stemming comparison
- -
Other features (type…)
- -
Weighting (variations)
58
Automatic Evaluation of IR Results
• Inexpensive, consistent results for tuningo Always using human judgments would be expensive and
somehow inconsistent
• Quick turnover
• With both “easy” and “difficult” question-answer sets
• Validated by UW students to be trustworthyo 95% accuracy on average with threshold
59
First UW Students’ Evaluation on AutoEval
• Notations:o 0 = right on. 100% is right, 0% is wrong.o -1 = false positive. It means we gave it a high score (>50%), but
the retrieved text does NOT contain or imply answero +1 = false negative. It means we gave it a low score (<50%), but
the retrieved text actually DOES contain or imply the answer
• We gave each of 4 studentso 15 questions, 15*5=75 sentences and scores to ranko 5 of the questions are the same, 10 are unique to each studento 23/45 questions from “hard” set, 22/45 from “easy” set
60
Results: Auto-Evaluation Validity Verification
12
34
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Threshold at 50%
Threshold at 80%
Threshold at 50%Threshold at 80%
61
The “Easy” QA set *
• Task: automatic evaluate if retrieved sentences contain the answer
• Scoring: Max score, Mean Average Precision (MAP)
• Result using Max (with threshold at 80%):o 193 regular questions and 8 yes/no questions (via concepts
overlap)• Only with sentence text: 139 (69.2%)• Peter’s test set: 149 (74.1%)• Peter’s more refined: 158 (78.6%)• (Lower) Upper bound for IR: 170 (84.2%)• Jesse’s best: ?? * The evaluation is for IR portion ONLY, no answer pinpointing
62
“Easy” QA Set Auto-Evaluation
Q text Only Vulcan Basic Vulcan Refined BaseIR Current Upper Bound
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Result
Result
64
Best Upper Bound for Hard Set as of Today
With weighting on Answer Text, Answer Concepts, Question Text, Question Concepts, matching over Sentence Text, Concepts, and Concepts from Previous and Next Sentences, and sentence type…Comparison with keyword overlap, concept overlap, stopwords removal and smart stemming techniques…
56/146=38.4%
66
Sharing the Data and Knowledge
• Information We Want, and each solver may also want
• Everyone’s result
• Everyone’s confidence on results
• Everyone’s supporting evidenceo From textbook sentences, reviews, homework section, figures…o From related web material, e.g. biology WikiPediao From common world knowledge, ParaPara, WordNet, …
• Training data – for offline use
67
More Timeline Details for First Integration
We are in control• AURA
– Now• Text
– before 12/7• Vulcan IR Baseline
– before 12/15• Initial Hybrid System Output
– Before 12/21– Without unified data format– With limited (possibly
outdated) suggested questions
Partners• Cyc
– ? Hopefully before EOY 2012
• JHU– ?? Hopefully before EOY
2012• ReVerb
– ??? EOM January 2013
68
Rounds of ImprovementsAnalysis (evaluation)• E
valuation with humans
• With each solver + hybrid system
69
OpenHalo
Vulcan Hybrid System
CYC QA
SILK QA
Other QA
TEQA
AURA
Data Service Collaboration