24
Scaling Answer Type Detection to Large Hierarchies Kirk Roberts and Andrew Hickl {kirk,andy}@languagecomputer.com May 29, 2008

Scaling Answer Type Detection to Large Hierarchies Kirk Roberts and Andrew Hickl {kirk,andy}@languagecomputer.com May 29, 2008

Embed Size (px)

Citation preview

Page 1: Scaling Answer Type Detection to Large Hierarchies Kirk Roberts and Andrew Hickl {kirk,andy}@languagecomputer.com May 29, 2008

Scaling Answer Type Detection to Large Hierarchies

Kirk Roberts and Andrew Hickl{kirk,andy}@languagecomputer.com

May 29, 2008

Page 2: Scaling Answer Type Detection to Large Hierarchies Kirk Roberts and Andrew Hickl {kirk,andy}@languagecomputer.com May 29, 2008

Answer Type Hierarchy

(ATH)

Introduction

• Work in factoid question-answering (Q/A) has long leveraged answer type detection (ATD) systems in order to identify the semantic class (or answer type) of the entities, words, or phrases most likely to correspond to the exact answer of a question.

Human

OrganizationGroupIndividual

Actor Artist Award Athlete

Baseball Player

CricketPlayer

Who wears #23 for the Los Angeles Galaxy?

Who wears #23 for the Los Angeles Galaxy?

SoccerPlayer

Page 3: Scaling Answer Type Detection to Large Hierarchies Kirk Roberts and Andrew Hickl {kirk,andy}@languagecomputer.com May 29, 2008

Answer Type Hierarchy

(ATH)

Introduction

• Work in factoid question-answering (Q/A) has long leveraged answer type detection (ATD) systems in order to identify the semantic class (or answer type) of the entities, words, or phrases most likely to correspond to the exact answer of a question.

Human

OrganizationGroupIndividual

Actor Artist Award Athlete

Baseball Player

CricketPlayer

Who wears #23 for the Los Angeles Galaxy?

Who wears #23 for the Los Angeles Galaxy?

SoccerPlayer

Page 4: Scaling Answer Type Detection to Large Hierarchies Kirk Roberts and Andrew Hickl {kirk,andy}@languagecomputer.com May 29, 2008

Answer Types and Entity Types

• While articulated ATHs clearly have value for question-answering applications, most work in ATD has been limited by the number of types recognized by current named entity recognition systems.– ACE Guidelines: ~35 entity types– Typical Commercial Offering: ~50 entity types– LCC’s CiceroLite™: > 350 entity types

• But are more types really better? Or do they make for a tougher learning problem?

Page 5: Scaling Answer Type Detection to Large Hierarchies Kirk Roberts and Andrew Hickl {kirk,andy}@languagecomputer.com May 29, 2008

Four Challenges

Page 6: Scaling Answer Type Detection to Large Hierarchies Kirk Roberts and Andrew Hickl {kirk,andy}@languagecomputer.com May 29, 2008

Challenge 1: Creating an Answer Type Hierarchy• First (published) Answer Type Hierarchy: (Li and Roth 2002, et seq.):

– 2 Tiered Structure: 6 “Coarse” Answer Types, ~50 “Fine” Answer Types

LAND VEHICLE (7): Automobile, Truck, Mass Transport, Train, Military Vehicle, Industrial Vehicle

LAND VEHICLE (7): Automobile, Truck, Mass Transport, Train, Military Vehicle, Industrial Vehicle

WATER VEHICLE (4): Ships, Submarines, Civilian Watercraft, Other Watercraft

WATER VEHICLE (4): Ships, Submarines, Civilian Watercraft, Other Watercraft

AIR VEHICLE (4): Commercial Airliner, Military Plane, Other Aircraft, Blimp

AIR VEHICLE (4): Commercial Airliner, Military Plane, Other Aircraft, Blimp

SPACE VEHICLE (3): Spacecraft, Satellite, Fictional SpacecraftSPACE VEHICLE (3): Spacecraft, Satellite, Fictional Spacecraft

• Questions to answer:– Why not just use an entity hierarchy as the answer type hierarchy?– What are the right set of leaf nodes for an ATH?– What are the right set of non-terminals for an ATH?

Page 7: Scaling Answer Type Detection to Large Hierarchies Kirk Roberts and Andrew Hickl {kirk,andy}@languagecomputer.com May 29, 2008

Why not just use the entity hierarchy?

• Short answer: Entity hierarchies aren’t organized according to the potential information needs expressed by natural language questions.

• Entity Types are semantic categories assigned to phrases found in text:– David Beckham was born on February 9, 1976. [ENTITY TYPE: DATE]– David Beckham was 33 years old in 2008. [ENTITY TYPE: AGE]– David Beckham (1976 - ) plays for the LA Galaxy. [ENTITY TYPE: YEAR_RANGE]– David Beckham is one year older than Luis Figo. [ENTITY TYPE: RELATIVE_AGE]– David Beckham, 33, was scratched by Capello. [ENTITY TYPE:

GENERIC_NUMBER] – David Beckham has been living for 33 years. [ENTITY TYPE: DURATION]

• Answer Types are semantic categories sought by a question:– How old is David Beckham?

• Answer Type is AGE, but valid entity types include:– AGE– RELATIVE_AGE– GENERIC_NUMBER– DURATION– DATE / YEAR_RANGE

Page 8: Scaling Answer Type Detection to Large Hierarchies Kirk Roberts and Andrew Hickl {kirk,andy}@languagecomputer.com May 29, 2008

• Step 1: Initialize. – Create the initial ATH as a direct clone of the existing ETH

• Step 2: Consolidate Similar Nodes.– Combine similar nodes under “abstract” parent nodes corresponding to a possible Q-stem

• SOCCER PLAYER, BASEBALL PLAYER, CRICKET PLAYER ATHLETE (Which player?)• CITY, FACILITY, GEOPOLITICAL ENTITY LOCATION (Where?)• POEM, BOOK, MOVIE, GOVERNMENT DOCUMENT AUTHORED_WORK (What work?)

• Step 3: Separate Existing Nodes into Subtypes.– Create multiple answer types for a single entity type when it belongs under different parents

• AIRPORT AIRPORT_LOC and AIRPORT_ORG

• Step 4: Repeat (as necessary).– Perform Step 2 and Step 3 until all “merge-able” types are included in ATH

Constructing an ATH from an ETH

Page 9: Scaling Answer Type Detection to Large Hierarchies Kirk Roberts and Andrew Hickl {kirk,andy}@languagecomputer.com May 29, 2008

Resultant Answer Type Hierarchy

• 11 coarse answer types (UIUC Hierarchy: 6)– HUMAN, LOCATION, NUMERIC, ABBREVIATION, ENTITY, COMPLEX, WORK,

TEMPORAL, TITLE, CONTACT-INFO, OTHER-VALUE*

• 296 fine types (UIUC Hierarchy: ~50)– Examples: CASINO, MUSEUM, CITY, COUNTRY, STATE, ACTOR, BASEBALL PLAYER, MILITARY

PERSON, COMPANY, UNIVERSITY, BASEBALL TEAM, ISLAND, PLANET, RIVER, ALBUM, SONG, BOOK, WRESTLER, SOCCER PLAYER, SPACE LOCATION, MOON, etc.

• Average Depth: 3.8 levels• Average Number of “Sisters”: 4.2 nodes

* - Corresponds to UIUC Coarse Type

Page 10: Scaling Answer Type Detection to Large Hierarchies Kirk Roberts and Andrew Hickl {kirk,andy}@languagecomputer.com May 29, 2008

Four Challenges

Page 11: Scaling Answer Type Detection to Large Hierarchies Kirk Roberts and Andrew Hickl {kirk,andy}@languagecomputer.com May 29, 2008

Annotation Methodology

• We experimented with three different annotation methodologies:– Method #1: One Pass

• A traditional one-pass annotation where the annotator assigned the final “fine” AT

– Method #2: Two Passes• Annotators first select a “coarse” answer type, then select a “fine” answer type

– Methodology #3: Multiple Passes• Annotators annotate each question according to each decision point in the hierarchy• Annotators can STOP annotation at any level in the hierarchy

Who wears #23 for theLos Angeles Galaxy? SOCCER PLAYERSOCCER PLAYER

Who wears #23 for theLos Angeles Galaxy? HUMANHUMAN SOCCER PLAYERSOCCER PLAYER

Who wears #23 for theLos Angeles Galaxy? HUMANHUMAN SOCCER PLAYERSOCCER PLAYER

INDIVIDUALINDIVIDUAL

ATHLETEATHLETE

Page 12: Scaling Answer Type Detection to Large Hierarchies Kirk Roberts and Andrew Hickl {kirk,andy}@languagecomputer.com May 29, 2008

Named ComplexValue

Abbrev Human Location Work Other

Individual Group Organization…..

ArtistActor WriterCoachAthlete

One-Pass Annotation

Who is the owner of the Los Angeles Galaxy?

Who is the owner of the Los Angeles Galaxy?

Coarse Types

Fine Types

Page 13: Scaling Answer Type Detection to Large Hierarchies Kirk Roberts and Andrew Hickl {kirk,andy}@languagecomputer.com May 29, 2008

Named ComplexValue

Abbrev Human Location Work Other

Individual Group Organization…..

ArtistActor WriterCoachAthlete

Two-Pass Annotation

Who is the owner of the Los Angeles Galaxy?

Who is the owner of the Los Angeles Galaxy?

Coarse Types

Fine Types

Page 14: Scaling Answer Type Detection to Large Hierarchies Kirk Roberts and Andrew Hickl {kirk,andy}@languagecomputer.com May 29, 2008

Named ComplexValue

Abbrev Human Location Work Other

Individual Group Organization…..

ArtistActor WriterCoachAthlete

STOP

Multi-Pass Annotation

Who is the owner of the Los Angeles Galaxy?

Who is the owner of the Los Angeles Galaxy?

Page 15: Scaling Answer Type Detection to Large Hierarchies Kirk Roberts and Andrew Hickl {kirk,andy}@languagecomputer.com May 29, 2008

Annotating Questions

• Annotated a corpus of 10,000 questions using all three different annotation methods

• UIUC Set: Factoid and Definition Questions (Li and Roth 2002)– How many villi are found in the small intestine?

• Web Crawl Set: “What” questions taken from on-line FAQs– What is the e-mail address for the mayor of Miami?

• Ferret Log: Factoid and Complex Questions taken from previous experiments with LCC’s Ferret question-answering system (Hickl et al. 2006)

– What is the relationship between Iran and Hezbollah?– What power plants are in Baluchestan?

UIUC Train & Test

Web Crawl

FERRET Log

5,952

3,485

563

Total 10,000

Page 16: Scaling Answer Type Detection to Large Hierarchies Kirk Roberts and Andrew Hickl {kirk,andy}@languagecomputer.com May 29, 2008

Experimental Methodology

• Manually annotated 10,000 factoid questions– “Warm-up” Set: 1000 Questions (Method 1)– Method 2: 4000 Questions– Method 3: 4000 Questions– “Cool Down” Set: 1000 Questions (Method 1)

• Each set annotated by 2 different pairs of annotators• Annotators tasked with annotating 1K questions per session• Differences between annotators resolved after each session; differences between

annotator pairs were resolved after all questions were annotated• Average agreement between individuals per session:

47.4%(initial 1K)

72.3%(final 1K) 79.4%

(Fine)

86.2%(Coarse)

99.8%

85.3%

84.7%

91.5%

97.0%

Method 1 Method 2 Method 3

Page 17: Scaling Answer Type Detection to Large Hierarchies Kirk Roberts and Andrew Hickl {kirk,andy}@languagecomputer.com May 29, 2008

Four Challenges

Page 18: Scaling Answer Type Detection to Large Hierarchies Kirk Roberts and Andrew Hickl {kirk,andy}@languagecomputer.com May 29, 2008

Performing Answer Type Detection

• Heuristic (Harabagiu et al. 2001, Harabagiu et al. 2002):– Used lexicosemantic features (e.g. WordNet synsets) to map between question terms and

answer types– Performance dependent on number of synset mappings

• “Flat” Classification:– Classifier used to directly map to one of n fine answer types– Performance degrades as n increases

• (Pure) “Hierarchical” Classification: (Li and Roth 02, Das et al. 05, Hickl et al. 06)– Recursively identifies best “child” node for each answer type– Only the children of the current type are considered as outcomes at every branching point in

the hierarchy– Proceeds until no more branching points, or until a STOP type has been selected

• “Hierarchical” Classifier + Heuristics (Hickl et al. 07, Roberts & Hickl 08)– Uses classifier to identify best “child” node for selected sets of answer types– Uses heuristics to map to some terminal nodes– Proceeds until:

• No more branching points• No heuristics available for mapping to finer types• A STOP type has been selected

Page 19: Scaling Answer Type Detection to Large Hierarchies Kirk Roberts and Andrew Hickl {kirk,andy}@languagecomputer.com May 29, 2008

Performance of ATD

• Compared performance of 3 classification-based approaches:– “Flat” Classification– Pure “Hierarchical” Classification– “Hierarchical” Classification + Heuristics

• All approaches trained / tested on same questions:– UIUC Hierarchy: 4000 train / 2000 test– LCC Hierarchy: 8000 train / 2000 test

ATD Method Coarse Type Fine Type

Flat -- 79.8%

Pure Hierarchical 92.5% 86.7%

Hybrid Hierarchical 92.5% 89.5%

Page 20: Scaling Answer Type Detection to Large Hierarchies Kirk Roberts and Andrew Hickl {kirk,andy}@languagecomputer.com May 29, 2008

Four Challenges

Page 21: Scaling Answer Type Detection to Large Hierarchies Kirk Roberts and Andrew Hickl {kirk,andy}@languagecomputer.com May 29, 2008

Architecture of Ferret

• We used a “baseline” version of LCC’s question-answering system, Ferret (Hickl et al. 2006) in order to evaluate the impact that an expanded ATH could have on Q/A performance.

Question ProcessingQuestion

Processing ATDATD

DocumentRetrieval

DocumentRetrieval

PassageRetrievalPassageRetrieval

Answer Extraction

Answer Extraction

Answer RankingAnswer Ranking

Answer Validation

Answer Validation

Page 22: Scaling Answer Type Detection to Large Hierarchies Kirk Roberts and Andrew Hickl {kirk,andy}@languagecomputer.com May 29, 2008

Impact on Question-Answering

• Used Ferret on a set of 188 factoid questions taken from the past TREC QA evaluations which had known answer types in both the UIUC and LCC ATHs

– Document Collection: AQUAINT-2 Newswire Corpus (2 GB)– Answers judged by hand based on TREC QA keys– Question considered to be answered correctly (“Top 1”) if valid answer returned in first

position– Question considered answered correctly (“Top 5”) if answer returned in any of the top 5

answers returned by system

Q/A Method Top 1 Performance Top 5 Performance

Coarse Only (11) 31.3% (+8.8%) 34.4% (+7.9%)

Coarse + Fine (307) 38.6% (+10.3%) 48.5% (+17.1%)

Page 23: Scaling Answer Type Detection to Large Hierarchies Kirk Roberts and Andrew Hickl {kirk,andy}@languagecomputer.com May 29, 2008

Conclusions

• Annotated a corpus of more than 10,000 factoid questions with appropriate “fine” answer types with nearly 90% inter-annotator agreement

• Constructed classifier-based ATD models capable of associating questions with their appropriate answer type with nearly 90% accuracy

• Incorporated new ATD system into a baseline Q/A system; showed improvement of more than 10% over system using previous ATH

Page 24: Scaling Answer Type Detection to Large Hierarchies Kirk Roberts and Andrew Hickl {kirk,andy}@languagecomputer.com May 29, 2008

Talk Overview

• Introduction• Four Challenges:

– Challenge 1: Organization.Can we organize a large entity hierarchy into a workable answer type hierarchy?

– Challenge 2: Annotation.Can we reliably annotate questions with fine-grained types from a large ATH?What’s the best way to perform annotation?

– Challenge 3: Learning.Can we learn models for performing fine-grained ATD?How do they compare with current ATD models?

– Challenge 4: Implementation.How do we incorporate ATD into a Q/A system (without sacrificing performance)?

• Conclusions