67
Improving Answer Precision and Recall of List Questions Kian Wei Kor T H E U N I V E R S I T Y O F E D I N B U R G H Master of Science School of Informatics University of Edinburgh 2005

Improving Answer Precision and Recall of List Questions Kian Wei Kor

Embed Size (px)

Citation preview

Page 1: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Improving Answer Precision and Recall of List

Questions

Kian Wei Kor

TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Master of Science

School of Informatics

University of Edinburgh

2005

Page 2: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Abstract

This thesis presents a novel approach to answering natural language list questions, or

questions that have more than a single correct answer. It is based on the hypothesis that

the set of answers to a list question will often occur in a similar context. By analyzing

candidate answers produced by an existing Question Answering system, it is possible

to identify the common context shared by two or more candidate answers. Once the

common context is identified, it is then possible to extrapolate from this common con-

text to identify more answer candidates previously not found by the original Question

Answering system.

i

Page 3: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Acknowledgements

Many thanks to Bonnie Webber, Johan Bos, Kisuh Ahn and Malvina Nissim for your

invaluable advice and guidance. To Alok Mishra, June Tee and Sasithorn Parinamosot

for being great company this past year. And finally, to my parents who have made this

all possible.

ii

Page 4: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Declaration

I declare that this thesis was composed by myself, that the work contained herein is

my own except where explicitly stated otherwise in the text, and that this work has not

been submitted for any other degree or professional qualification except as specified.

(Kian Wei Kor)

iii

Page 5: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Table of Contents

1 Introduction 1

2 Background 4

2.1 Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Text REtrieval Conference (TREC) . . . . . . . . . . . . . . 5

2.2 Relations and Patterns in Question Answering . . . . . . . . . . . . . 6

2.2.1 Dual Iterative Pattern Relation Extraction . . . . . . . . . . . 6

2.2.2 DIPRE and Question Answering . . . . . . . . . . . . . . . . 7

2.3 String Matching and String Alignment Algorithms . . . . . . . . . . 10

2.3.1 Smith-Waterman-Gotoh Algorithm . . . . . . . . . . . . . . 11

3 The LiQED System 13

3.1 External Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1.1 Apache Lucene Search Engine . . . . . . . . . . . . . . . . . 14

3.1.2 OpenNLP Toolkit . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1.3 LingPipe Natural Language Toolkit . . . . . . . . . . . . . . 14

3.1.4 JAligner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1.5 Google Web API . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1.6 Amazon Web Services API . . . . . . . . . . . . . . . . . . . 15

3.2 DIPRE Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.2 Pattern Generation . . . . . . . . . . . . . . . . . . . . . . . 17

3.2.3 Extracting Answers . . . . . . . . . . . . . . . . . . . . . . . 23

iv

Page 6: Improving Answer Precision and Recall of List Questions Kian Wei Kor

4 LiQED in TREC 26

4.1 Improved Answer Extraction . . . . . . . . . . . . . . . . . . . . . . 26

4.1.1 Hypernyms for Answer Extraction . . . . . . . . . . . . . . . 27

4.1.2 Publication Answers . . . . . . . . . . . . . . . . . . . . . . 29

4.2 Google Reranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3 Answer Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Evaluation and Analysis 32

5.1 Training Set - TREC 2004 . . . . . . . . . . . . . . . . . . . . . . . 32

5.2 Test Set - TREC 2005 . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.2.2 Event Questions . . . . . . . . . . . . . . . . . . . . . . . . 35

5.2.3 Semantically Deep Questions . . . . . . . . . . . . . . . . . 36

5.2.4 Answer Extraction . . . . . . . . . . . . . . . . . . . . . . . 38

6 Conclusion 40

6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.2 Future Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.2.1 Alternative Question Targets . . . . . . . . . . . . . . . . . . 42

6.2.2 Fine-Grained Named Entity Recognition . . . . . . . . . . . 42

6.2.3 Anaphora Resolution . . . . . . . . . . . . . . . . . . . . . . 42

6.2.4 Answer Verification . . . . . . . . . . . . . . . . . . . . . . 43

A TREC 2004 List Questions 44

B TREC 2005 List Questions 49

C Sample LiQED Pattern File 56

Bibliography 59

v

Page 7: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 1

Introduction

Statistical Natural Language technologies are becoming increasingly important in to-

day’s world. As the world wide web continues to grow, there is an increasing amount

of textual information and knowledge waiting to be mined. While search engine tech-

nologies currently fills this information gap, it is increasingly clear that simply retriev-

ing relevant documents may not be the best solution to this problem. What users are

searching for is information and knowledge. Users do not want to sift through docu-

ments returned by search engines, looking for relevant nuggets of information. What

users want are answers, precise information nuggets and evidence to support the valid-

ity of these information nuggets.

Question Answering (QA) systems are an attempt to fulfill this need for exact an-

swers. Unlike a search engine that returns a ranked list of relevant documents for a

query, question answering systems take a user posed question and returns what it be-

lieves is the correct answer to the question. Open domain Question Answering as a

research area took off in 1999, mainly due to the introduction of a Question Answer-

ing Track in the Text REtrieval Conference (TREC), see Voorhees (1999), held by the

American National Institute of Science and Technology (NIST). The first TREC QA

systems showed that by combining Natural Language Processing and Information Re-

trieval techniques, it is computationally possible to extract answers to questions from

a large corpus of news reports.

Question Answering systems have grown increasingly sophisticated over the past

6 years. A typical current day QA system would contain components for natural lan-

1

Page 8: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 1. Introduction 2

guage processing, statistical and probabilistic machine learning, logic inference as well

as incorporating various sources of world knowledge. Yet despite all this increased

complexity, Voorhees (2004) show that the best QA system to date has a balanced

F-Score of 0.770. There is still room for much improvement.

Question Answering research has mainly been focused on factoid questions. That

is, questions that have a single, concise fact as an answer. Examples of factoid ques-

tions includes”In what sea did the Russian submarine Kursk sink?”and”Who won

the Miss Universe 2000 crown?”. One area of Question Answering that has been ne-

glected in the past few years is the list question. List questions are similar to factoid

question, except there is more than one correct and distinct answer to the question.

For example,”Which countries expressed regret about the loss of Russian submarine

Kursk?” and”Name the contestants of Miss Universe 2000.”.

Most research thus far treat list questions simply as shorthand for asking the same

factoid question multiple times. The set of all correct, distinct answers in the document

collection that satisfy the factoid question is the correct answer for the list questions.

In some instance, answer to a list question is simply the topN distinct answers found

by the factoid question answering system.

This may not necessarily be the only way or best way to answer list questions.

This dissertation presents a supplementary approach to answering list questions. It is

based on the hypothesis that the set of answers to a list question often appear in similar

contexts. By analyzing candidate answers produced by an existing factoid QA system,

it is possible to identify the common context in which two or more answer candidates

appear in. Once the common context is identified, it is then possible to extrapolate on

this common context to identify more answer candidates previously not found by the

original factoid QA system.

The common context described above can be expressed in several different forms.

For example as common syntactic constituents or semantic structures. In this paper,

words that frequently co-occur with two ore more answer candidates are used as the

common context.

Chapter 2 will provide some background information on some of the techniques

and algorithms used. Chapter 3 details the implementation of the LiQED (short for

Page 9: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 1. Introduction 3

List QED) system which takes a set of candidate answers from an existing QA system,

Edinburgh University’s QED system by Leidner et al. (2003), and expands the answer

set with new and distinct candidate answers. Chapter 4 wraps up the discussion of the

LiQED system with some of the challenges involved with using LiQED as list question

answering module for the TREC conference. Chapter 5 presents the results of applying

LiQED in the TREC 2005 Question Answering evaluation and concludes with analysis

and potential areas for enhancements in Chapter 6.

Page 10: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 2

Background

This chapter presents some of the background work that directly relates to the work

done in this thesis. The first part of this chapter gives some basic background on the

field of Question Answering and the defacto arena for evaluation of Question Answer-

ing Systems, the Text REtrieval Conference. This continues on to the second part of

the chapter, which covers prior work that relates to the thesis. This includes the idea of

using patterns and relations by Brin (1999), and the work of Ravichandran and Hovy

(2002) in using automatically generated text patterns for Question Answering. The

chapter ends with a brief look at two string matching algorithms, one of which will be

used in the thesis.

2.1 Question Answering

A Question Answering (QA) System is a computer system that has access to one or

more sources of information and is capable of using these information sources to find

answers to natural language questions posed by human users.

A question answering system is quite similar to a search engine. Both provide

an user interface that allow human users to find information. However there are sev-

eral differences that distinguishes a search engine from a question answering system.

The most salient difference is that a search engine relies on the human subject to pose

queries, not questions. A query is a specially formatted string that contains keywords

4

Page 11: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 2. Background 5

and search engine commands. The output of a search engine is typically a set of doc-

uments that match the user’s query. A Question Answering system on the other hand,

takes a natural language question as input and typically returns a specific answer as its

output.

The earliest Question Answering systems such as BASEBALL by Green et al.

(1961) was developed in the 1960s. BASEBALL provided a natural language user

interface to a database of baseball facts and figures. Another early Question Answering

system is LUNAR by Woods (1973), which allow NASA geologist ask questions to a

database containing information and analysis of lunar rocks, soil samples gathered

from the Apollo 11 lunar expedition mission.

Today, there are many different types of question answering systems. Generally the

systems fall under two classes, open domain Question Answering systems and closed

domain question answering system. A closed-domain Question Answering system is

designed to answer questions that fall within a specific specialist domain. For example,

there is a medicine question answering system by Yun and Graeme (2004) and an

aircraft maintenance question answering system by Rinaldi et al. (2003). Open-domain

systems deal with generalist questions about nearly everything. An example of an

open-domain question answering system is MIT’s START web-based system Katz.

(1997) that can be found at http://www.ai.mit.edu/projects/infolab/.

2.1.1 Text REtrieval Conference (TREC)

Open domain Question Answering as a research area took off in 1999, due to the in-

troduction of a Question Answering Track in the Text REtrieval Conference (TREC)

Voorhees (1999) held by the American National Institute of Science and Technology

(NIST). The first TREC QA systems showed that by combining Natural Language

Processing and Information Retrieval techniques, it is computationally possible to ex-

tract answers to questions from a large corpus of news reports.

2.1.1.1 Aquaint Corpus

The corpus used in the Question Answering Track in TREC is the AQUAINT corpus.

Aquaint consists of newswire articles stored as text data in English and is drawn from

Page 12: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 2. Background 6

three sources: the Xinhua News Service (Peoples Republic of China), the New York

Times News Service, and the Associated Press Worldstream News Service.

The corpus contains 1,033,461 news articles starting from the year 1996 to 2000.

The corpus is 3 gigabytes in size, containing roughly 25 million sentences and 375

million words.

2.2 Relations and Patterns in Question Answering

This section explores some prior work that has been done to exploit the relationships

inherent between two concepts. Concepts are represented in natural language as words.

The recent work on semantic web by Berners-Lee et al. (2001) aim to capture the

relationship between these concepts via the way concept words relate to each other. In

question answering, there is a clear and direct relationship between a question and its

answers. Both the question (or more specifically the question topic) and the answers

are in some sense related. Most QA systems try to leverage these relationships in some

way.

This thesis takes a look at list questions because these questions offer an opportu-

nity to directly examine the relationship between a question and its answers. Given a

question and some correct answers, is it possible to identify the relationship between

question topic and an answer? Can we then use this relationship to expand the set of

correct answers? Given a set of answers from a QA system which may contain false

positives, can relationships be reliably identified in the presence of such noise?

There exists a wide range of natural language processing techniques that is ap-

plicable for this task. For this thesis, we will implement and examine one possible

technique.

2.2.1 Dual Iterative Pattern Relation Extraction

Brin (1999) suggested that it is possible to extract pairs of related concepts, for example

authors and their book titles, from the web using just a small initial seed samples.

The general algorithm he proposed, called Dual Iterative Pattern Relation Extraction

(DIPRE), work as follows.

Page 13: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 2. Background 7

1. Start with a small seed set of (author, title) pairs.

2. Find all occurrences of those pairs on the web.

3. Identify patterns for the citations of the books from these occurrences.

4. Search the web for these patterns to recognize more new (author, title) pairs.

5. Repeat the steps with the new (author, title) pairs to find even more (author, title)

pairs.

Brin shows that using this method, he successfully found over 15,000 author, title

pairs with just an initial seed of 5 authors and book title pairs. DIPRE has also been

used by Yi and Sundaresan (1999) to identify acronyms of organizations on the web.

2.2.2 DIPRE and Question Answering

While Brin’s DIPRE algorithm has been successfully applied to the web, it is not cer-

tain if such an algorithm, or a similar algorithm will be successful within a question

answering context. Thus far the algorithm has only been applied to simple, clearly

defined and commonly occurring relationship types such as (author, title) pairs or (or-

ganization, acronym) pairs.

In a separate work, Hovy et al. (2002) has applied a similar algorithm in question

answering systems. By examining past TREC factoid questions, six commonly oc-

curring relationship types were identified. They showed that it is possible to extract

patterns for such relationships from a corpus. Table 2.1 shows the 6 specific relation-

ship types and an example of the type of pattern picked up by Hovy and Ravichandran’s

system.

Unlike Hovy and Ravichandran, this work focuses on relationships specific to

each individual question rather than specialized relationships applicable only to cer-

tain classes of questions. This is achieved by adapting the DIPRE algorithm to operate

in a question answering context.

The adapted DIPRE algorithm for question answering treats the Aquaint corpus

as a bag of sentences. Given a question and a set of potential answers, the adapted

algorithm work as follows:

Page 14: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 2. Background 8

Relationship Type Sample Pattern

<PERSON>-<BIRTHYEAR> <PERSON> was born on<BIRTHYEAR>

<PERSON>-<INVENTION> the<INVENTION> was invented by<PERSON>

<DISCOVERY>-<PERSON> discovery of<DISCOVERY> by <PERSON>

<ENTITY>-<TYPE> , a form of<TYPE>, <ENTITY>

<PERSON>-<FAME> the famous<FAME>, <PERSON>,

<NAME>-<LOCATION> at the<NAME> in <LOCATION>

Table 2.1: 6 common relationship types found in factoid questions

1. Find sentences that contain the relationship.

2. Identify surface text patterns common to two or more sentences.

3. Find other sentences that match these surface text patterns, extract answers.

The above algorithm can be iterative like the original DIPRE algorithm. However

there was the danger of false positive answers causing a feedback loop and amplifying

the number of false positives with each iteration. Thus for this thesis, it was decided

not to apply this algorithm in an iterative fashion.

2.2.2.1 The Question - Answer Relationship

Table 2.2 show a TREC 2005 List question and the set of all correct answers. It is

possible to say that the question target, OPEC, is in some way related to the answer

set. The relationship is the organization OPEC and its member countries. Likewise,

each answer is also in some way related to other answers in the answer set.

We thus have two possible question specific relationship types that can be exploited

in list questions, the Question to Answer relationship and the Answer to Answer rela-

tionship. Both relationships were initially explored. However initial experimentation

showed that there were disadvantages to using the Answer to Answer relationship.

1. Simple Patterns. Pattern generation experiments were conducted using known

Answer to Answer pairs from the TREC 2004 list question set. The majority of

Page 15: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 2. Background 9

Target OPEC

Question What are OPEC countries?

Answers Brazil

Algeria

Indonesia

Iran

Iraq

Kuwait

Libya

Nigeria

Qatar

Saudi Arabia

United Arab Emirates

Table 2.2: Question on OPEC countries

the generated text patterns were very simply answer pairs delimited by a comma

”,” or the word”and” . In other words, the patterns generated correspond pre-

cisely to what one would expect to match a sequence of answers. These are

common and intuitive patterns that does not require a sophisticated method for

extraction.

2. Computational Complexity. Given a set ofN answers, it takesO(N(N−1)2 ) time

to perform pairwise comparisons between all distinct answer pairs. In compar-

ison, question to answer relationship pairs only requireO(N) comparisons. In

light of the simple patterns produced by Answer to Answer pairs, it was deter-

mined that using Answer to Answer relations would not be helpful.

3. Identifying Conjunctions . It was found that patterns generated using Question

and Answer relationships can also identify and extract a list of sequential an-

swers in a single sentence. Thus depreciating the need for using Answer and

Answer relationships to identify a list of answers.

Page 16: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 2. Background 10

For these reasons, Answer to Answer relationship was not used in the final build

of the system. Only the question target and answer set will be used. For the remainder

of this paper, angular brackets will be used to represent words related to a certain

concept. Examples can already be found in Table 2.1. Specifically,<QUESTION> will

refer words in the question target while<ANSWER> refer to words that make up an

answer.

2.3 String Matching and String Alignment Algorithms

In order to identify common text patterns in<QUESTION>-<ANSWER> relationship

pairs, it is important to have an algorithm that is able to pick up surface text patterns

common to two or more of these pairs of words. Hovy et al. (2002) borrowed the idea

of suffix trees from computational biology for this purpose

A suffix tree is a tree-like data structure for storing a string. Suffix trees are used to

solve the exact string matching problem in linear time, achieving about the same worst-

case bound as the Knuth et al. (1977) and the Boyer and Moore (1997) algorithms. See

Gusfield (1997) and Nelson (1996) for more information on suffix trees.

A suffix treeT for a stringS (with n = |S|) is a rooted, labeled tree with a leaf for

each non-empty suffix ofS. Furthermore, a suffix tree satisfies the following properties:

• Each internal node, other than the root, has at least two children;

• Each edge leaving a particular node is labeled with a non-empty substring of S

of which the first symbol is unique among all first symbols of the edge labels of

the edges leaving this particular node;

• For any leaf in the tree, the concatenation of the edge labels on the path from the

root to this leaf exactly spells out a non-empty suffix of s.

By concatenating a pair of sentences into a single string and constructing its suffix

tree, the longest common substring can be identified inO(n+m) time, wheren is the

length of the first string andm is the length of the second string.

• Mozart (1756-1791) was a genius.

Page 17: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 2. Background 11

• The great Mozart (1756-1791) achieved fame at a young age.

In the above two sentences, a suffix tree would be able to identify that the longest

common substring isMozart (1756-1791), which nicely matches Hovy’s<NAME>-

<BIRTHYEAR> text pattern. However if either sentences has some form of long dis-

tance dependency that often occur in natural language, a suffix tree would fail to find

the text pattern.

For example, a suffix tree will only pick up the fragmentauthor of Silent

Spring from the two sentences below. This fragment would not result in a positive

match with any<BOOK>-<AUTHOR> patterns as the fragment does not contain an au-

thor name.

• Rachel Carson is the author of Silent Spring.

• Rachel Carson, founder of contemporary environmental movement, author of

Silent Spring, died on April 14 1964.

During the course of experimentation while working on this thesis, it was discov-

ered that a significant majority of the sentences that contain a list question relationship

pair also separates the question topic and answer via some form of long distance de-

pendency. Thus, an algorithm that is able to perform substring matching on sentences

with long distance dependencies is required.

2.3.1 Smith-Waterman-Gotoh Algorithm

Besides the suffix tree algorithn, the field of computational biology also uses a number

of other algorithms for matching of DNA or protein sequences. One such algorithm

is the Smith-Waterman algorithm by Smith and Waterman (1981). It is a dynamic

programming algorithm that works by computing the optimal local alignment between

two sequences or strings.

Suppose we have two string sequencesA=(a1,a2,a3, ...,an), B=(b1,b2,b3, ...,bm)

and a scoring or similarity matrixs wheres(ai ,b j) is a measure of similarity between

elementai in sequenceA and elementb j in sequenceB. The algorithm computes op-

timal alignment using a score function,H(i, j), that measures the degree of optimal

alignment between the two sequences ending at elementsai andb j .

Page 18: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 2. Background 12

H(i, j) = max

0

H(i−1, j −1)+s(ai ,b j)

H(i−1, j)+d

H(i, j −1)+d

whered is the open gap penalty value. In biology,d is usually a function which

allows biologists to vary the penalty costs for different gap sizes. For the purposes

of this thesis, the gap penalty is constant thusd = 1. The original Smith-Waterman

algorithm stores the intermediate values ofH(i, j) as a two dimensionaln by mmatrix,

which takesO(m2n) time to compute. There is an improved version of the algorithm by

Gotoh (1982) that reduces computation time toO(mn). Once this matrix is computed,

the optimal alignment can be found by retracing steps through the matrix.

To illustrate the results of the algorithm, here is an example from biology where

the algorithm is used to compare human and mouse DNA.

Human DNA QWEFTEDPGGDEAFT...

|-| ||||-|||...

Mouse DNA EEEET PGGDFAFT...

The algorithm identifies similar parts of the two DNA sequences and finds the

optimal alignment which are indicated by the vertical bar|. Gaps are indicated by

spaces and the horizontal bar- indicate a single DNA (character) substitution.

By replacing DNA sequences with natural language word sequences, it is possible

to use this algorithm to identify matching substrings even if there are long-distance

dependencies in the sentences. The Smith-Waterman-Gotoh algorithm is thus the al-

gorithm of choice for this thesis.

Page 19: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 3

The LiQED System

The DIPRE algorithm itself is really just a generalized pseudo code that prescribes

an approach for finding new instances of a relationship type from some relationship

samples, this chapter and the next will go into details of how this algorithm is fleshed

out into a system capable of answering TREC list questions.

Written in Java, the complete system is called LiQED, short for List QED, because

it only answer list questions. The system draw its initial set of candidate answers from

Edinburgh’s QED Question Answering system.

As mentioned in chapter 2, the system only uses<QUESTION>-<ANSWER> rela-

tion pairs to extract answers for a list question.<QUESTION> is the question target

associated with every question and<ANSWER> are candidate answers generated by

QED.

Using purely shallow methods, LiQED first does a search for sentences that contain

both the question target and a candidate answer. From these sentences, LiQED gen-

erate text patterns which are then used to search for more sentences. Next, an answer

extraction step identifies new candidate answers and finally the answers are reranked

to give the final set of answers.

13

Page 20: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 3. The LiQED System 14

3.1 External Libraries

Before going into the details of LiQED, here is a brief overview of the external libraries

used in the system.

3.1.1 Apache Lucene Search Engine

Apache Lucene is a high-performance, full-featured text search engine library written

entirely in Java. The library is released by the Apache Software Foundation under the

Apache Software License. Lucene, being a search engine API, allows the creation of

an vector-space model index which can then searched with a large variety of query

tools, including boolean, phrase, wildcard and fuzzy match queries. Of particular use

is Lucene’s capability for searching hierarchical ”span phrase” queries, which are used

to search for sentences with long range dependencies.

3.1.2 OpenNLP Toolkit

The OpenNLP Toolkit and OpenNLP MaxEnt projects by Tom Morton provides a

set of open source, Java-based Natural Language Processing tools including sentence

detection, tokenization, part-of-speech tagging, chunking, parsing, named-entity de-

tection. These tools have been trained using the Maximum Entropy model provided by

OpenNLP MaxEnt.

While the toolkit itself is still very much a work in progress, most of the tools

such as sentence detection, tokenization, POS tagging and chunking are stable and

sufficiently accurate for the purpose of this thesis. However, the toolkit’s named entity

tagger requires an inordinate amount of memory and is very slow. Thus for LiQED,

the toolkit is only used for tokenization, POS tagging and chunking.

3.1.3 LingPipe Natural Language Toolkit

LingPipe is a commercial Natural Language Processing library. LingPipe can be li-

censed for no cost under a non-commercial use license. Like the OpenNLP toolkit,

Page 21: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 3. The LiQED System 15

it provides sentence detection, tokenization, part-of-speech tagging, chunking, parsing

and named-entity detection.

LingPipe is used mainly for its named entity tagger as OpenNLP’s tagger is highly

inefficient.

3.1.4 JAligner

JAligner is an open source Java implementation of the Smith and Waterman (1981)

algorithm, with improvements Gotoh (1982) made to the algorithm. The algorithm was

originally designed as a fast tool for protein and nucleic acid sequence comparison.

However the algorithm works equally well for linguistic purposes as the algorithm

works on the basis of dynamic programming string comparison. The algorithm is used

to generate the text patterns that will be used to identify answers.

3.1.5 Google Web API

The Google Web API is a SOAP based web service API provided by Google. Using

this API enables a java program to issue queries and get results from the Google search

engine. The Google API is used in two different areas of the system, reranking the final

set of answers as well as for identifying hypernyms.

3.1.6 Amazon Web Services API

The Amazon Web Services API is a SOAP based web service API provided by Ama-

zon.com. This API allows a program to query the Amazon.com database of books and

music. The Amazon API is used mainly as an external information source to verify

answers for publication (books, songs and movies) questions.

3.2 DIPRE Implementation

There are two distinct phases in the DIPRE algorithm. The first phase involves au-

tomatically generating patterns from a set of initial relation pairs. In the context of

Page 22: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 3. The LiQED System 16

question answering, a relation pair would be the question topic and a candidate an-

swer.

The next phase takes the generated patterns and matches them to sentences in the

corpus to find new potential answers. A final answer extraction step then picks up what

the system believes are answers to the question.

3.2.1 Preprocessing

Both the Pattern Generation and Answer Extraction phase require searching over sen-

tences. To speed up the process of sentence searching, the sentence level index of the

Aquaint corpus is created using the Lucene search engine library. In addition, every

word in every sentence is also tagged with part-of-speech and chunk information. This

tag information is stored together with the original sentence in the Lucene index and is

also retrieved together with its associating sentence.

3.2.1.1 The Aquaint Index

The following steps were applied to all documents in the Aquaint corpus to construct

the Lucene search index.

1. Break each news article into constituent sentences using LingPipe’s sentence

detector.

2. For each sentence,

(a) Tokenize the sentence with OpenNLP Tokenizer.

(b) Stem each word in the sentence with Lucene’s Porter Stemmer.

(c) Label each stemmed word with a part-of-speech tag with OpenNLP’s POS

Tagger.

(d) Label each stemmed word with a chunk tag with OpenNLP’s Chunk Tag-

ger.

(e) Store in Lucene the DOCID of the source article, line number, the original

sentence, the tokenized, stemmed and lowercased version of the sentence,

POS tags and Chunk tags.

Page 23: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 3. The LiQED System 17

All punctuation and stop words have been preserved in the sentences as they prove

to be very important during the pattern generation phase. This will be discussed in

detail in the next section.

The end result is an sentence-level index of all news articles in the Aquaint corpus.

The index takes up 13 GB of disk space. Table 3.1 shows the index size for each news

source and year.

Agency Year Date Sub-Index Size

APW 1998 Jun 01 to Dec 31 1,000 MB

APW 1999 Jan 01 to Nov 01 1,007 MB

APW 2000 Jan 01 to Sep 30 711 MB

NYT 1998 Jun 01 to Dec 31 1,800 MB

NYT 1999 Jan 01 to Dec 31 3,000 MB

NYT 2000 Jan 01 to Sep 29 2,000 MB

XIE 1996 Jun 01 to Dec 31 469 MB

XIE 1997 Jan 01 to Dec 31 496 MB

XIE 1998 Jan 01 to Dec 31 537 MB

XIE 1999 Jan 01 to Dec 31 545 MB

XIE 2000 Jan 01 to Sep 30 425 MB

Total Disk space 13.246 GB

Table 3.1: Size of the indexed Aquaint corpus

3.2.2 Pattern Generation

Given a question target and a set of candidate answers, the pattern generation phase

seeks to identify a chain of words, a text pattern, that will identify potential answers.

The end result of this pattern generation phase will be a pattern file that contains a list

of text patterns that can be used to identify more answers. Appendix C exhibits one

such sample pattern file for the question”What movies was he (Bing Crosby) in?”and

will be used for illustration purposes in this chapter.

Page 24: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 3. The LiQED System 18

3.2.2.1 Identifying Relevant Words

The first step in creating text patterns is to identify sentences that contain the question

target and one of the candidate answers. In our example question,”What movies was

Bing Crosby in?”, the question target isBing Crosbyand QED identified six candidate

answers which are listed in Table 3.2.

Of these candidate answers, onlyHigh Societyis a movie staring Bing Crosby.

Roadrefer to a series of movies by Bing Crosby andWhite Christmasis a song also

by Bing Crosby.Southern Man, Mr. Tambourine ManandLooking Forwardare songs

or albums by David Crosby who is not in any way related to Bing Crosby.

A series of queries on the Aquaint corpus picks up around 30 sentences that con-

tain both the question target and one candidate answer. Table 3.3 show some of the

sentences with the question target and QED’s candidate answers highlighted in bold.

Given sentences like the above, the first step in constructing text patterns is to

identify terms that are in some sense, relevant to the question and its answers. What

are considered relevant terms here include not just words, but also stop words and even

punctuation marks and symbols. The reason for including punctuation in list questions

is due to Yang et al. (2003) and others who note that multiple answers appearing in the

same sentences are often delimited by punctuation marks. As can be seen later, it is

important to capture these punctuation marks in the generated text patterns.

Every term found in the matching sentences will be tested for relevancy. Relevance

is defined in terms of distance from the question target and candidate answer. For each

term in a sentence, its relevance is defined by:

Relevance(term) = wq

(1−

Distance(term,Tq)Maxspan(Tq)

)+wa

(1− Distance(term,Ta)

Maxspan(Ta)

)where,Tq is the set of words in the question target.Ta is the set of answer words that

occur in the sentence.wq andwa are weights such thatwq+wa = 1. Distance(term,T)

is the shortest distance be the term and any words in setT. MaxSpan(T), which serves

as a normalization function, is the longest distance between any words in setT, or the

sentence boundary.

As an example, the formula is applied to every term in the sentence:60 years ago

: Bob Hope andBing Crosbystarred in ” Roadto Singapore ” .

Page 25: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 3. The LiQED System 19

White Christmas Mr. Tambourine Man Looking Forward

High Society Road Southern Man

Table 3.2: QED’s candidate answers for ”What movies was Bing Crosby in?”

• The record was previously held byBing Crosby’s “ White Christmas. ”

• It has sold an estimated 15 million copies and is the second best-selling single

in history , runner-up toBing Crosby’s “ White Christmas. ”

• Bing Crosbyand Fred Astaire star in “ Holiday Inn , ” the 1942 musical featur-

ing the classic song “White Christmas, ” for $ 9.98 .

• “ High Society” with Bing Crosbyand Grace Kelly , “ Anchors Aweigh ” with

Gene Kelly and Kathryn Grayson and “ On the Town ” with Gene Kelly and

Betty Garrett .

• “ High Society, ” TCM Saturday at 8 :Bing Crosby, Grace Kelly and Frank

Sinatra star in this 1956 remake of “ The Philadelphia Story . ”

• 60 years ago : Bob Hope andBing Crosbystarred in ” Roadto Singapore . ”

• are having more fun on the road than Bob Hope andBing Crosbyin “ The Road

to Morocco . ”

• “The Enchanted Cottage , ” 1944 starring Robert Young ; “Roadto Utopia ”

1945 starring Bob Hope andBing Crosby; and Alfred Hitchcock ’s “ The Man

Who Knew Too Much ” 1956 .

Table 3.3: Sentences containing both question target and candidate answers for the

question ”What movies was Bing Crosby in?”

Page 26: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 3. The LiQED System 20

Let wq = 0.5, wa = 0.5, Tq = {Bing,Crosby}, Ta = {Road}, Maxspan(Tq) = 8,

Maxspan(Ta) = 12.

Relevance(60) = 0.5(1− 7

8

)+0.5

(1− 12

12

)= 0.0625

Relevance(years) = 0.5(1− 6

8

)+0.5

(1− 11

12

)= 0.1667

Relevance(ago) = 0.5(1− 5

8

)+0.5

(1− 10

12

)= 0.2708

Relevance(:) = 0.5(1− 4

8

)+0.5

(1− 9

12

)= 0.3750

Relevance(Bob) = 0.5(1− 3

8

)+0.5

(1− 8

12

)= 0.4792

Relevance(Hope) = 0.5(1− 2

8

)+0.5

(1− 7

12

)= 0.5833

Relevance(and) = 0.5(1− 1

8

)+0.5

(1− 6

12

)= 0.6875

Relevance(Bing)

Relevance(Crosby)

Relevance(starred) = 0.5(1− 1

8

)+0.5

(1− 3

12

)= 0.8125

Relevance(in) = 0.5(1− 2

8

)+0.5

(1− 2

12

)= 0.7917

Relevance(′′) = 0.5(1− 3

8

)+0.5

(1− 1

12

)= 0.7708

Relevance(Road)

Relevance(to) = 0.5(1− 5

8

)+0.5

(1− 1

12

)= 0.6458

Relevance(Singapore) = 0.5(1− 6

8

)+0.5

(1− 2

12

)= 0.5417

Relevance(′′) = 0.5(1− 7

8

)+0.5

(1− 3

12

)= 0.4375

In the example sentence, the termsstarred, in and ” have the highest relevance

because they appear between both the question target and candidate answer. By av-

eraging the relevance score of all occurrences of the same term, we can then rank all

terms according to their relevance to bothTq andTa. Using a threshold as filter, we can

select the set of most relevant terms,R. Table 3.4 lists setR for the example question.

It was found that the most relevant terms generally fall into one of three classes:

Contextual Terms add context information that improves precision of locating rele-

vant sentences. For example,BobandHopefrom the example question fall into

Page 27: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 3. The LiQED System 21

to the and ’s

with bob hope in

. ‘‘ ’’ ,

Table 3.4: Set of terms relevant to the question ”What movies was Bing Crosby in?”

this category because the actor Bob Hope regularly co-star with Bing Crosby.

Thus Bob Hope can help in disambiguating random movies from movies star-

ring Bing Crosby.

Chunk Markers identify a position in a sentence where an answer can be found.

Chunk markers are typically punctuation marks, determinants and prepositions.

These chunk markers are especially helpful when the answer and question target

are separated by a span of non-relevant words due to long distance dependency.

Sequence Markersare really a subset of chunk markers because they also identify

locations within a sentence that contain answers. These are conjunction terms

such as”and” and commas. The presence of these terms is a good indicator that

the multiple answers can often be found within a single sentence.

3.2.2.2 Identifying Patterns

Having identified a set of relevant terms, the next step is to use these relevant terms

to construct surface text patterns. The sentences need to be compared using a diff-

like algorithm that looks for similar terms within a pair of sentences. This similarity

search is performed using the Smith-Waterman-Gotoh algorithm described in Chapter

2. Before sentences can be compared, they are first cleaned and simplified to ensure

that the algorithm can reliably identify good text patterns.

To ensure that the algorithm only focus on terms near the question target and can-

didate answer, all terms falling outside a three-chunk window are dropped. The two

BIO chunk-tagged sentences in Table 3.5 illustrate this process. In the two examples,

our question target isBing Crosbyand the candidate answer isWhite Christmas. Terms

falling within the three-chunk window, highlighted in boldface, are retained while the

Page 28: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 3. The LiQED System 22

remaining terms are dropped.

The BNP record INP was BVP previously IVP held IVP by BPP

Bing BNP Crosby INP ’s BNP ‘‘ INP White INP Christmas INP

. O ’’ O

[ Bing BNP Crosby INP and INP Fred INP Astaire INP star INP

in BPP ‘‘ O ] Holiday BNP Inn INP , O ’’ O the BNP 1942 INP

musical INP featuring BVP [ the BNP classic INP song INP ‘‘ O

White BNP Christmas INP , O ’’ O ] for BPP $ BNP 9.98 INP . O

Table 3.5: Chunk tagged sentences. retained terms are highlighted in boldface.

If there are long distance dependencies between the question target and candidate

answer, there will be two sentence fragments. Square brackets,[ and] will be used

to delineate the two sentence fragments. If there is no long distance dependency, there

will only be a single, longer sentence fragment.

The removal of noisy terms is performed by a term replacement function. The

function determine if each term in the remaining sequence of terms is either a member

of the set of question target terms,Tq, the set of candidate answer terms,Ta, the set of

relevant terms,R, or does not fall within any set. The function then replaces the term

with a representative label depending on membership of the term. Table 3.3 shows the

original sentences and Table 3.6 shows the cleaned and simplified sentence fragments.

Replace(t) =

t ∈ Tq <QUESTION>

t ∈ Ta <ANSWER>

t ∈ R t

otherwise *

A pairwise comparison between these sentence fragments are then performed us-

ing the Smith-Waterman-Gotoh Algorithm to create our surface text patterns. Not

all generated text patterns are usable as some patterns will not contain<QUESTION>

or <ANSWER> tags. Other patterns do contain both tags but only consist of wild-

card,*, terms and stop words which would pick up too many false positives. Both

types of unusable patterns are removed, leaving only text patterns that contain both

Page 29: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 3. The LiQED System 23

* * <QUESTION> ’s ‘‘ <ANSWER> . ’’

* to <QUESTION> ’s ‘‘ <ANSWER> . ’’

[ <QUESTION> and * * * in ‘‘ ] [ ‘‘ <ANSWER> , ’’ ]

‘‘ <ANSWER> ’’ with <QUESTION> and * *

[ ‘‘ <ANSWER> , ’’ * * ] [ * * <QUESTION> , * * ]

bob hope and <QUESTION> * in ’’ <ANSWER> to *

bob hope and <QUESTION> in ‘‘ The <ANSWER> to *

[ * ‘‘ <ANSWER> to * ] [ * * bob hope and <QUESTION> * and ]

Table 3.6: Cleaned and simplified sentence fragments.

a <QUESTION> and a<ANSWER> tag and does not contain solely wildcard and stop

words. Table 3.7 shows some of the final surface text patterns that can be used to iden-

tify new and distinct answers. For a full list of surface text patterns generated for this

example question, please refer to Appendix C.

<QUESTION> ’s ‘‘ <ANSWER> ’’

[ ‘‘ <ANSWER> to ] [ bob hope and <QUESTION> ]

bob hope and <QUESTION> * ‘‘ <ANSWER>

bob hope and <QUESTION> * <ANSWER>

the ‘‘ <ANSWER> * ’’ <QUESTION>

Table 3.7: Answer-finding text patterns for ”What movies was Bing Crosby in?”

3.2.3 Extracting Answers

Now that text patterns have been generated for each question, it is a simple matter of

searching the Aquaint corpus for sentences that match these text patterns. For every

matching sentence found, the text pattern also identifies a window of between one to

three chunks that potentially contains an answer. This level of granularity is not suffi-

cient for the purpose of question answering as the exact answer has not been provided.

And additional step is required to extract the exact answer from these answer chunks.

Page 30: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 3. The LiQED System 24

3.2.3.1 Named Entity Based Answer Extraction

The QED system provides a fine-grained expected answer type which is the system’s

prediction of type of answer required for a question. A series of simple regular ex-

pressions was used to map QED’s expected answer type to a named entity type. The

LingPipe named entity tagger is then used to sift through the answer chunks, looking

for labeled named entities of the correct type. These named entities are the final answer

candidates generated by LiQED. Table 3.8 shows the series of regular expressions used

to map QED’s expected answer type to LingPipe’s named entity type.

There are limitations to using a course grained named entity tagger such as Ling-

Pipe to identify answer strings. Most significant is that LiQED is restricted to only

answering questions that expect the answer to be either a person, location or organiza-

tion.

This limitation caused by the choice to use a simple answer extraction mechanism,

and not a limitation of the thesis. Ideally a fine-grained named entity tagger that is able

to identify a larger variety of named entity types would better suit the task. However,

a suitable fine-grained named entity tagger could not be found and integrated in time

for the TREC evaluation.

Chapter 4 will detail further enhancements that expand on LiQED’s ability to only

answer person, location and organization type questions.

Page 31: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 3. The LiQED System 25

QED Expected Answer Type Name Entity Type

person PERSON

people PERSON

citizen PERSON

man PERSON

men PERSON

woman PERSON

women PERSON

mortal PERSON

adult PERSON

child PERSON

male PERSON

human PERSON

name PERSON

location LOCATION

city LOCATION

metropolis LOCATION

town LOCATION

village LOCATION

county LOCATION

province LOCATION

state LOCATION

country LOCATION

nation LOCATION

organization ORGANIZATION

organisation ORGANIZATION

company ORGANIZATION

business ORGANIZATION

institution ORGANIZATION

Table 3.8: Mapping QED’s expected answer type to LingPipe’s named entity type

Page 32: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 4

LiQED in TREC

This chapter describes some further modifications and optimizations made to LiQED

just two weeks prior to the release of the 2005 TREC Question Answering Track main

task test set. The focus of this phase of work is to enhance and optimize the basic

system.

4.1 Improved Answer Extraction

As alluded in Chapter 3, the simple named entity answer extraction mechanism will

fail if the answer type to a question is not a person, location or organization. Most

of the time had been spent developing and fine-tuning LiQED’s implementation of

the DIPRE algorithm, the core of this thesis, leaving less than two weeks to address

this limitation in answer extraction. Thus quick and easy to implement solutions were

required.

The key issue in extracting answers is that LiQED by itself is only able to identify a

sentence fragment that may contain an answer. The system does have any information

to assist it in identifying an answer within those sentence fragments. LiQED has to

rely on external sources to extract correct answers.

For persons, locations and organization answer types, a named entity tagger is used.

The assumption here is that if LiQED identifies that a sentence fragment contains an

answer and the named entity tagger identifies an answer of the correct type within

26

Page 33: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 4. LiQED in TREC 27

the sentence fragment, then there is a high chance that the answer identified by both

LiQED and the tagger is a correct answer. This is essentially a simple form of ensemble

learning or boosting (See R.Meir and G.Ratsch (2003)).

This simple ensemble learning technique can be applied to other information sources.

4.1.1 Hypernyms for Answer Extraction

Table 4.1 lists the expected answer types of all list questions in the TREC 2005 test set.

These answer types are essentially class labels or hypernyms, words whose meaning

denotes a superordinate or superclass. Conversely, hyponyms are words that denote

membership to a class. For example, animal is a hypernym of dog and dog is a hy-

ponym of animal.

date:date general:award general:character general:child

general:company general:competitor general:contestant general:course

general:eyewitness general:festival general:graduate general:group

general:holding general:horse general:individual general:leader

general:legionnaire general:manufacturer general:medal general:member

general:nationality general:occupation general:off general:official

general:opponent general:organization general:people general:person

general:personnel general:player general:position general:product

general:program general:puppet general:ship general:show

general:species general:student general:submarine general:team

general:theme general:thing general:variety general:victim

general:work location:country location:location name:name

publication:book publication:movie publication:song publication:title

Table 4.1: QED Expected Answer Types

Hearst (1992) defines a simple surface text pattern that is able to reliably identify

such hypernym-hyponym class relationships. The surface text pattern is”X such as

Y” , whereX is the class label andY is a member of classX. This text pattern can be

applied to a web search engine to identify members of a specific answer type.

Page 34: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 4. LiQED in TREC 28

Specifically for LiQED, this surface text pattern is expressed as a query to the

Google search engine via Google Web API. The first part of the query is a phrase search

for ”X such as”, whereX is the hyponym or expected answer type. The Google query

also includes the question target as context to ensure that potential answers are relevant

to the list question. Hypernyms are then extracted from snippets returned by Google

using a series of simple rules. Table 4.2 shows the query and the first ten hyponyms

extracted for question 136.7,What Shiite leaders were killed in Pakistan?While not

all extracted hyponyms are correct (the 9th hyponym in Table 4.2 is incorrect), it is

sufficient for the task.

Question Target Shiite

Answer Type general:leader

Google Query ”Shiite” AND ”leaders such as”

Extracted Hyponyms Abu Mazen

Ahmad Chalabi

Ahmad Shah Masud

Asi

Ayatollah Khomeini

Ayatollah Sistani

Ayman al-Zawahiri

Baburam

Blair who

Burhanuddin Rabbani

Table 4.2: Shiite leaders found by Google

The extracted hyponyms are cached and compared against sentence fragments

identified by LiQED as containing an answer. And if a hyponym is found within a

LiQED sentence fragment, it is flagged as an answer candidate. This technique allows

LiQED to expand beyond just answering questions that expect a person, location or

organization answer type.

Page 35: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 4. LiQED in TREC 29

4.1.2 Publication Answers

Amazon.com provides a web service API (Application Program Interface) that allow

programs to query Amazon’s database of books and music albums. For list questions

that expects a publication answer type, a query on the question target is made to Ama-

zon via the web service API. The results of the query are concatenated to the list of

hyponyms already found in the previous section. Again, any hyponyms identified in

the sentence fragments are flagged as LiQED’s answer candidates.

Out of 93 list questions in the TREC 2005 test set, there were 6 publication ques-

tions so the Amazon web service API proved to be useful. The publication list ques-

tions include:

Q76.7:What movies was Bing Crosby in?

Q97.5:List the Counting Crows’ record titles.

Q108.6:Name movies released by Sony Pictures Entertainment (SPE).

Q113.5:Name some of Paul Newman’s movies.

Q114.4:Name movies/TV shows Jesse Ventura appeared in.

Q121.3:What books did Rachel Carson write?

Table 4.3: Publication questions in the TREC 2005 Test Set

4.2 Google Reranking

A set of answers generated by a list question answering system such as LiQED can

be treated as a ranked list of answers. For example, a threshold can be applied to

answers ranked by confidence, which ensures that precision can be improved while

still retaining good recall scores.

Since LiQED uses surface text pattern matching, a sentence fragment is deemed to

contain an answer if it matches any of the automatically generated text patterns. There

is no notion of confidence with pattern matching. However, by counting the number

of times the same answer is picked up by our automatically generated text patterns, a

rough level of confidence can be given.

Page 36: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 4. LiQED in TREC 30

To improve on this confidence, Google is used to rerank the answers by the degree

of correlation the answer has to the question target. The idea is to improve the ranking

of answers that are more related to the question target and decrease the ranking of

answers that do not seem to be related. A simple correlation formula is used.

Correlation(x,y) = 0.5

(Count(x∧ y)

Count(x)+

Count(x∧ y)Count(y)

)wherex is the question target andy is an answer.Count(x) is the number of relevant

documents Google returns when the query isx. The computed correlation score is then

equally weighted with a simple linear rank score of the answer.

Con f idence(x) = 0.5

(Correlation(target,x)+

NumAnswers−Rank(x)NumAnswers

)whereNumAnswersis the number of distinct answers LiQED found andRank(x)

is LiQED’s original ranking for answerx. The answers are reranked in descending

order based on this confidence score. Via experimentation, it is found that answers

with a confidence score of 0.5 or less was usually wrong and are thus removed. The

remaining answers constitutes the final reranked list of answers.

4.3 Answer Merging

The answers generated by QED tends to be different from answers generated by LiQED.

This is due to both systems applying very different techniques in generating their an-

swer set. To get the best recall performance, the best answers from both systems will

need to be merged.

Simply combing all answers from both systems will result in a large number of

answers and potentially adversely affecting the precision score. Thus there is a need

to strike a balance between precision and recall. After several rounds of trial and error

testing, the following procedure was used to create the final answer set for LiQED.

1. Select the top 10 answers from QED.

2. Add the top 10 answer from LiQED that are not identical to any of the top 10

QED answers.

Page 37: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 4. LiQED in TREC 31

This procedure creates the final set of LiQED answers that was submitted for eval-

uation as run 2 in TREC 2005. The set of answers generated by QED itself was sub-

mitted as run 1. This setup allows easily comparison between the two systems. The

next chapter analyzes the performance of LiQED and compare its performance with

QED.

Page 38: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 5

Evaluation and Analysis

Chapters 3 and 4 detailed the construction of the LiQED system. This chapter exam-

ines and analyzes the performance of LiQED on the TREC 2005 question answering

test set.

The official TREC evaluations allow a maximum of three distinct runs, such that

answers from each run is separately evaluated. For list questions, we submitted the

top 12 answers from QED for the first run. The second run gave the top 10 answers

from QED and LiQED as described in 4. In addition to the top 7 answers from QED

and LiQED, the final run also included the top 7 answers from TOQA, a topic based

question answering system by Kisuh Ahn.

This chapter focus mainly on comparing QED answers in run 1 against run 2 which

is a combination of the top answers from QED with additional answers from LiQED.

The third run with TOQA will not be discussed here as the addition of a third system

introduces too many new variables and makes it hard to determine the behavior of the

three combined systems.

5.1 Training Set - TREC 2004

The 56 list questions from TREC 2004 were used as a training set to tune LiQED. Table

5.1 shows the precision and recall scores of LiQED on the training set. The precision

and recall formulas used here are the formula for instance precision and instance recall

32

Page 39: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 5. Evaluation and Analysis 33

as defined in the TREC QA Task, see Voorhees (2004):

A systems response to a list question was scored using instance pre-cision (IP) and instance recall (IR) based on the list of known instances.Let Sbe the the number of known instances,D be the number of correct,distinct responses returned by the system, andN be the total number ofresponses returned by the system. ThenIP = D/N andIR = D/S. Preci-sion and recall were then combined using the F measure with equal weightgiven to recall and precision,F = 2×IP×IR

IP+IR .

The list of known instances for a question is the set of all correct answers found by

systems that have participated in the year’s QA task. Table 5.1 shows the performance

of both QED and LiQED on the 2004 question set. Since LiQED operates by taking

QED’s answers as examples to look for more answers of a similar type, the recall score

is indicative of the degree by which LiQED improves QED’s performance. There is

a 0.0390 improvement in recall and a 0.0081 decrease in precision. Overall, run 2

improves on run 1’s F-score by 0.0083. This is sufficient to place the system as the 4th

position for list questions in last year’s evaluation.

Run Precision Recall F-Score

QED 0.1927 0.2419 0.2145

QED + LiQED 0.1846 0.2809 0.2228

QED + LiQED + TOQA 0.1719 0.2761 0.2119

Table 5.1: QED and LiQED performance on TREC 2004 Training Set

5.2 Test Set - TREC 2005

The TREC 2005 question set was released on July 21, 2005 and participating systems

were required to submit their answers a week later on July 28, 2005 (An extension of

1 day was given due to a temporary hardware failure of the NIST webserver that was

hosting the questions and answer submission form). The question set has a total of 600

questions covering 75 different topics (question targets). Each topic asks between 0 to

2 list questions. In total, there were 93 list questions in the test set.

Page 40: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 5. Evaluation and Analysis 34

The number of surface text patterns generated by LiQED can vary from none for

questions where QED provided no answers, to over a thousand patterns. An average

of 182.2 surface text patterns were generated for each question and in total, LiQED

generated 16,944 text patterns.

From these patterns, LiQED found 824 distinct answers or an average of 8.9 an-

swers per question. After reranking and filtering the answers using Google, the system

was left with 379 distinct answers or an average of 4.08 answers per question. These

remaining answers were then added to QED’s top 10 answers, resulting in 956 submit-

ted answers or 10.3 answers per question.

5.2.1 Evaluation

The official evaluation results for TREC 2005 will only be released in November 2005,

too late to be included in this report. Instead of the official evaluation score, these are

preliminary scores from a personal evaluation performed on the answers generated by

both QED and LiQED. At this point, I need to add a cautionary note to the reader. I

have been as unbiased as possible in my evaluation of my own system. However, one is

only human and thus the results shown below should be seen only as interim results and

will be superseded by the official TREC evaluation results. Also, the precision score

for list questions is dependent on the set of all correct answers given by all systems

that submitted answers to TREC for evaluation. Obviously, the set of correct answers

given by all systems will not be known until the official evaluation results are revealed.

Thus only recall scores will be evaluated.

Out of the 93 list questions in the test set, there was some ambiguity between ques-

tion Q128.3,”What countries constitute the OPEC committee?”and question Q128.5,

”What are OPEC countries?”. The two questions were asking for very similar an-

swers and the few people we polled were unable to distinguish any difference between

an OPEC country and an OPEC committee country. We therefore presume that both

questions were essentially the same question, phrased in different forms. For that rea-

son one question, Q128.3, was dropped from evaluation leaving 92 questions in the

question set.

Table 5.2 only shows the recall scores of QED and LiQED for the 92 list questions,

Page 41: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 5. Evaluation and Analysis 35

as well as for a subset of 64 questions that will be discussed in a later section. Like

the training set, there is an improvement of 0.0140 to recall. While the amount of

improvement is smaller compared to the improvement achieved with the training set,

it is not unexpected for various reasons discussed in the next sections.

Run Recall, 64 Questions Recall, 92 Questions

QED 0.1271 0.1445

QED + LiQED 0.1529 0.1585

QED + LiQED + TOQA 0.1507 0.1409

Table 5.2: QED and LiQED performance on TREC 2005 Test Set

5.2.2 Event Questions

Unlike the TREC 2004 questions, this year’s questions included questions on tempo-

ral events. The events mainly focused on current day events like the 1998 Nagano

Olympic Games and the Port Arthur Massacre. However, there were also questions

on past events like the Hindenburg disaster. From the perspective of a Question An-

swering system, events can generally be classified under one of three different types of

events:

Named Event A significant, one-off event that are important enough to be given a

name. Examples include the Hindenburg Disaster, Port Arthur Massacre.

Unnamed Event A minor, one-off event not important to be named. Examples in-

clude the 1998 indictment and trial of Susan McDougal, the first 2000 Bush-

Gore presidential debate, a plane clipping cable wires in an Italian resort. The

difficulty with an unnamed event is in identifying articles and passages that are

relevant to the event. Often, a named event may actually start as an unnamed

event. For example, an event involving a plane crashing into the World Trade

Center in 2001 was not called the ”9-11 Attack” until it was established to be the

work of terrorists.

Page 42: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 5. Evaluation and Analysis 36

Periodic Event An event that occurs periodically, perhaps quarterly or annually. These

include holidays or major festivals such as Christmas, Halloween and the Edin-

burgh Fringe Festival. Sport events like the Olympics, Superbowl and Wimble-

don are also periodic events. Similar to unnamed events, it can be difficult to

disambiguate. After all, Olympics 2000 in Sydney is different from Olympics

1996 in Atlanta.

Event-based questions were particularly difficult for LiQED and there are two rea-

sons why this is so. Firstly, event questions are new and up till a week before the actual

questions were revealed, there were no example questions to train or tune LiQED on.

The more critical issue to LiQED was that some of the events does not have a proper

name. LiQED relies heavily on a well defined question target to identify relevant sen-

tences. When the question target is an unnamed event like”France wins World Cup

in soccer” or a periodic event like the Olympics game, LiQED has difficulty identify-

ing relevant sentences and thus fails to identify new answers. Table 5.3 lists the event

targets in this year’s question set. Unnamed events are marked with an asterisk (∗) and

periodic events are marked with a cross (+).

Out of the 93 list questions in the test set, 20 questions were based on events. Of

the 20 event questions, LiQED only found additional answers for one question. In

comparison, out of the 92 evaluated questions LiQED found additional answers for 14

questions.

5.2.3 Semantically Deep Questions

Unlike the TREC 2004 question set, the TREC 2005 question set contained more chal-

lenging questions that require a Question Answering system with some form of se-

mantic reasoning or inferencing module. Table 5.4 list some of these questions.

As LiQED uses purely shallow methods to identify answers, it is unable to answer

any of these questions correctly. For these questions, LiQED either found no new an-

swers or found many incorrect answers. This is perfectly illustrated in Table 5.5 which

list the answers for two questions Q77.6”Name opponents who Foreman defeated”

and Q77.7”Name opponents who defeated Foreman”. Ideally, the union of these two

Page 43: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 5. Evaluation and Analysis 37

1980 Mount St. Helens eruption

1998 Nagano Olympic Games+

1998 Baseball World Series+

1998 indictment and trial of Susan McDougal∗

1999 North American International Auto Show+

Boston Big Dig

Crash of EgyptAir Flight 990∗

first 2000 Bush-Gore presidential debate∗

France wins World Cup in soccer∗

Hindenburg disaster

Kip Kinkel school shooting∗

Miss Universe 2000 crowned+

Plane clips cable wires in Italian resort∗

Port Arthur Massacre

Preakness 1998+

return of Hong Kong to Chinese sovereignty∗

Russian submarine Kursk sinks∗

Super Bowl XXXIV +

Table 5.3: Event-based question targets in TREC 2005

sets of answers should form a disjoint set. In the case of LiQED, the two answer sets

are identical.

Combined, the event questions and semantically deep questions make up 28 out of

92 questions in the test set. In other words LiQED is unable to answer or has difficultly

answering nearly a third of the list questions. Discounting these questions leave 64

questions in the question set. Table 5.2 shows the recall scores for this subset of 64

questions. As anticipated LiQED’s recall performance is better, improving QED’s

score by a respectable 0.0258.

Comparing the recall scores between the 64 question set and 92 question set, it is

clear that QED did not have problems with the 28 event or inference-type question

Page 44: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 5. Evaluation and Analysis 38

QID Question Target Question

Q67.6 Miss Universe

2000 crowned

Name other contestants (besides Miss Universe).

Q77.6 George Foreman Name opponents who Foreman defeated.

Q77.7 George Foreman Name opponents who defeated Foreman.

Q81.2 Preakness 1998 List other horses who won the Kentucky Derby and

Preakness but not the Belmont.

Q100.7 Sammy Sosa Name the pitchers off of which Sosa homered.

Q119.4 Harley-Davidson What other products (beside motorcycles) do they

produce?

Q123.5 Vicente Fox What countries did Vicente Fox visit after election?

Q126.3 Pope Pius XII What official positions did he hold prior to becoming

Pius XII?

Q133.3 Hurricane Mitch As of the time of Hurricane Mitch, what previous hur-

ricanes had higher death totals?

Q137.3 Kinmen Island What other island groups are controlled by this gov-

ernment (Taiwan)?

Table 5.4: Questions that require semantic inferencing

since its recall score increased from 0.1271 to 0.1445. Conversely, LiQED’s recall

score only increase by a mere 0.0056, indicating that it failed to identify answers for

most of the event or inference-type questions.

5.2.4 Answer Extraction

The majority of work on this thesis was focused on implementing the adapted DIPRE

question answering algorithm. As a consequence, not as much effort was placed on

answer extraction and this has had some negative effect on the overall performance of

LiQED. Typically the automatically generated text patterns are able to identify several

hundred relevant sentences and due to the amount of time required to manually exam-

ine every sentence, a strict evaluation of all questions is not possible. However, just

Page 45: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 5. Evaluation and Analysis 39

Opponents who Foreman defeated. Opponents who defeated Foreman.

George Foreman George Foreman

Joe Frazier Joe Frazier

Ken Norton Ken Norton

Sonny Sonny

Archie Moore Archie Moore

Table 5.5: LiQED answers for questions the two questions on boxer George Foreman.

examining the example question used in Chapter 3,”What movies was Bing Crosby

in?” clearly shows that answer extraction can be improved. Table 5.6 list the correct

answers identified by the text patterns compared with the actual set of extracted an-

swers that constitute the final answer set. In total, 14 correct answers were identified

by patterns but only 5 answers were extracted.

Answers Found by Patterns Answers Extracted

Birth of the Blues Rhythm on the RangeGoing My Way

East Side of Heaven Road to Morocco High Society

Going My Way Road to Singapore Holiday Inn

High Society Road to Utopia Road to Zanzibar

Holiday Inn Road to Zanzibar White Christmas

Legend of Sleepy Hollow Waikiki Wedding

Pennies From Heaven White Christmas

Table 5.6: Identified and extracted answers for the question, ”What movies was Bing

Crosby in?”

Page 46: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 6

Conclusion

6.1 Conclusion

This thesis set out to examine if it is possible to extrapolate on an existing set of an-

swers to identify more answers. Through the use of surface text patterns automatically

generated from commonality found within the initial answer set, new answers were

indeed found. The basic LiQED system was able to automatically capture question

specific relationships instead of pre-determined, broad-coverage relationships as done

in prior work by Hovy and Ravichandran. Table 6.1 show some of the question-specific

relationships where LiQED is able to positively extrapolate from the initial set of an-

swers to identify new answers. These relations tend to be deeper and more specific

than the 6 relationship types used by Hovy et al. (2002).

In general, the system requires at least two correct answers to be provided before it

is able to identify new correct answers. This is expected as a minimum of two correct

answers are required to find commonality. So long as there are two or more correct

answers, the generated patterns can and do pick up new answers. However the current

answer extraction implementation needs to be improved as many correct answers were

not extracted.

Besides using surface text patterns, other information can also be used as context

information. For example, part-of-speech can be included in the text pattern. Al-

ternatively, sentence structure can be used as context. In this case, new answers are

40

Page 47: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 6. Conclusion 41

<ACTOR>-<MOVIE>

<BOXER>-<OPPONENT>

<MUSEUM>-<ARTWORK>

<PARENT>-<CHILD>

<GOLFER>-<OPPONENT>

<GOLFER>-<GOLF COURSE>

<BAND>-<ALBUM>

<SINGER>-<SONG>

<LOCATION>-<PERSON>

<ORGANIZATION>-<MEMBER>

<PROJECT>-<ORGANIZATION>

Table 6.1: Some relationships identified by LiQED

extracted from sentences that conform with commonly occurring branches of parse

trees of answers in the initial answer set.

Several secondary goals were also achieved. These include:

1. A novel, shallow approach that enables the generation of surface patterns that is

able to match sentences containing long distance dependencies. This is achieved

through the use of word alignment instead of word matching via the Smith-

Waterman-Gotoh algorithm for pattern generation. One caveat is that this tech-

nique does not ensure coordination between the two constituents in a sentence.

2. Observation on the importance of prepositions, determinants and punctuation

symbols, especially conjunction symbols, in identifying answers to list ques-

tions. Typically these are considered as stop words and ignored. However, they

have proven to be useful as they tend to prefix answers.

In conclusion, the basic LiQED system has proven that the hypothesis is sound

and can be applied to identify new answers. The remainder of this chapter will cover

possible areas of enhancements to this basic system.

Page 48: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 6. Conclusion 42

6.2 Future Enhancements

While the basic concept has been shown to be feasible, there is quite a lot of room

for improvement. Here are some areas where further work will improve LiQED’s

performance.

6.2.1 Alternative Question Targets

One of the main issue in LiQED is caused by the over reliance on the question target. If

LiQED be unable to find sentences containing the question target, it would be unable to

generate patterns and thus unable to extract new answers. The introduction of unnamed

and periodic events in TREC 2005 further exacerbates this issue. One possible method

to alleviate this problem is to identify alternative question targets that are synonymous

to the original question target. For example, the1998 Nagano Olympic Gamesis often

referred to as theNagano Olympicsor the1998 Olympics. Such alternative question

targets will enable LiQED to better find matching sentences.

6.2.2 Fine-Grained Named Entity Recognition

As detailed in Chapters 3 and 4, one of the issue with the LiQED system is that it

is only able to identify a sentence fragment that contains an answer, and requires an

external information source to confirm the extract answers. This is where the use

of a fine-grained named entity tagger that is able to identify more than just persons,

locations and organizations would help the system better identify answers to more

question types.

6.2.3 Anaphora Resolution

Currently, LiQED only construct text patterns searching for sentences that contain

words from both the question target and a candidate answer. However not all sen-

tences that contain an answer will also contain words from the question target. Very

likely, the question target is mentioned in a sentences and answers will be found in the

suceeding sentences. In cases like these, it may be possible to apply anaphora reso-

Page 49: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Chapter 6. Conclusion 43

lution to extend pattern matching beyond a single sentence by matching the question

target in the first sentence with answer antecedents in subsequent sentences, especially

questions on unnamed events.

6.2.4 Answer Verification

LiQED uses purely shallow methods to identify potential answers. Due to the use of

context, answers generated by LiQED tends to be more answers belonging to the same

class than correct answers. This is evident from the two questions that ask for boxers

that George Foreman defeated and boxers that defeated George Foreman. LiQED gave

the exact same answers to both questions as it could distinguish boxers from non-

boxers but was unable to distinguish who the winner is. In other words, LiQED itself

is unable to determine if a potential answer is actually a correct answer. Ideally, some

form of post-hoc inference module should be used to verify if an answer produced by

LiQED is indeed a correct answer. Such a module would also address the needs of

answering inference-based questions.

If implemented, all the improvements discussed in this section should reliably im-

prove both question coverage as well as the recall of LiQED.

Page 50: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Appendix A

TREC 2004 List Questions

1. Crips

1.3 Which cities have Crip gangs?

2. Fred Durst

2.3 What are titles of the group’s releases?

3. Hale Bopp comet

3.3 In what countries was the comet visible on its last return?

4. James Dean

4.4 What movies did he appear in?

5. AARP

5.5 What companies has AARP endorsed?

6. Rhodes scholars

6.3 Name famous people who have been Rhodes scholars.

6.4 What countries have Rhodes scholars come from?

7. agouti

7.3 In what countries are they found?

8. Black Panthers

8.4 Who have been members of the organization?

44

Page 51: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Appendix A. TREC 2004 List Questions 45

9. Insane Clown Posse

9.1 Who are the members of this group?

9.2 What albums have they made?

10. prions

10.3 What diseases are prions associated with?

10.4 What researchers have worked with prions?

11. the band Nirvana

11.2 Who are the band members?

11.5 What are their albums?

15. Rat Pack

15.1 Who are the members of the Rat Pack?

16. cataract

16.3 Who are doctors that have performed cataract surgery?

18. boxer Floyd Patterson

18.6 List the names of boxers he fought.

20. Concorde

20.2 What airlines have Concordes in their fleets?

21. Club Med

21.2 List the spots in the United States.

22. Franz Kafka

22.4 What books did he author?

24. architect Frank Gehry

24.4 What prizes or awards has he won?

24.5 What buildings has he designed?

25. Harlem Globe Trotters

25.4 What countries have they played in?

Page 52: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Appendix A. TREC 2004 List Questions 46

26. Ice-T

26.5 What are names of his albums?

30. minstrel Al Jolson

30.5 What songs did he sing?

31. Jean Harlow

31.7 What movies did she appear in?

31.8 What leading men did she star opposite of?

32. Wicca

32.4 What festivals does it have?

34. Amtrak

34.5 Name cities that have an Amtrak terminal.

36. Khmer Rouge

36.4 Who were leaders of the Khmer Rouge?

37. Wiggles

37.2 Who are the members’ names?

37.4 List the Wiggles’ songs.

38. quarks

38.4 What are the different types of quarks?

39. The Clash

39.3 Name their songs.

41. Teapot Dome scandal

41.4 Who were the major players involved in the scandal?

43. Nobel prize

43.2 What are the different categories of Nobel prizes?

45. International Finance Corporation (IFC)

45.3 What countries has the IFC financed projects in?

Page 53: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Appendix A. TREC 2004 List Questions 47

47. Bashar Assad

47.5 What schools did he attend?

48. Abu Nidal

48.4 In what countries has he operated from?

50. Cassini space probe

50.4 What planets will it pass?

51. Kurds

51.3 What other countries do Kurds live in?

52. Burger King

52.5 What countries is Burger King located in?

53. Conde Nast

53.4 What magazines does Conde Nast publish?

54. Eileen Marie Collins

54.6 What schools did she attend?

55. Walter Mosley

55.4 What books has he written?

56. Good Friday Agreement

56.3 What groups are affected by it?

56.4 Who were the key players in negotiating the agreement?

58. philanthropist Alberto Vilar

58.2 What organizations has he donated money to?

58.4 What companies has he invested in?

61. Muslim Brotherhood

61.4 What countries does it operate in?

61.5 Name members of the group.

62. Berkman Center for Internet and Society

Page 54: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Appendix A. TREC 2004 List Questions 48

62.4 Name members of the center.

63. boll weevil

63.3 What states have had problems with boll weevils?

64. Johnny Appleseed

64.5 In what states did he plant trees?

65. space shuttles

65.1 What are the names of the space shuttles?

Page 55: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Appendix B

TREC 2005 List Questions

66. Russian submarine Kursk sinks

66.5 Which countries expressed regret about the loss?

66.7 Which U.S. submarines were reportedly in the area?

67. Miss Universe 2000 crowned

67.6 Name other contestants.

68. Port Arthur Massacre

68.6 What were the names of the victims?

68.7 What were the nationalities of the victims?

69. France wins World Cup in soccer

69.7 Name players on the French team.

70. Plane clips cable wires in Italian resort

70.7 Who were on-ground witnesses to the accident?

71. F16

71.6 What countries besides U.S. fly F16s?

72. Bollywood

72.6 Who are some of the Bollywood stars?

49

Page 56: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Appendix B. TREC 2005 List Questions 50

73. Viagra

73.6 In what countries could Viagra be obtained on the black market?

74. DePauw University

74.6 Name graduates of the university.

75. Merck & Co.

75.5 Name companies that are business competitors.

75.7 Name products manufactured by Merck.

76. Bing Crosby

76.7 What movies was he in?

77. George Foreman

77.6 Name opponents who Foreman defeated.

77.7 Name opponents who defeated Foreman.

78. Akira Kurosawa

78.7 What were some of his Japanese film titles?

79. Kip Kinkel school shooting

79.3 List students who were shot by Kip Kinkel.

80. Crash of EgyptAir Flight 990

80.6 Identify the nationalities of passengers on Flight 990.

81. Preakness 1998

81.2 List other horses who won the Kentucky Derby and Preakness but not theBelmont.

82. Howdy Doody Show

82.3 Name the various puppets used in the ”Howdy Doody Show”.

82.4 Name the characters in the show.

83. Louvre Museum

83.4 Name the works of art that have been stolen from the Louvre.

Page 57: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Appendix B. TREC 2005 List Questions 51

84. meteorites

84.7 Provide a list of names or identifications given to meteorites.

85. Norwegian Cruise Lines (NCL)

85.1 Name the ships of the NCL.

85.6 Name so-called theme cruises promoted by NCL.

86. Sani Abacha

86.5 Name the children of Sani Abacha.

87. Enrico Fermi

87.4 List things named in honor of Enrico Fermi.

88. United Parcel Service (UPS)

88.4 In what foreign countries does the UPS operate?

89. Little League Baseball

89.3 What Little League teams have won the World Series?

90. Virginia wine

90.1 What grape varieties are Virginia wines made from?

90.5 Name the Virginia wine festivals.

91. Cliffs Notes

91.3 Give the titles of Cliffs Notes Condensed Classics.

92. Arnold Palmer

92.3 What players has Arnold competed against in the Skins Games?

92.4 Which golf courses were designed by Arnold?

93. first 2000 Bush-Gore presidential debate

93.7 Who helped the candidates prepare?

94. 1998 indictment and trial of Susan McDougal

94.4 Who testified for Mrs. McDougal’s defense?

95. return of Hong Kong to Chinese sovereignty

Page 58: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Appendix B. TREC 2005 List Questions 52

95.5 What other countries formally congratulated China on the return?

96. 1998 Nagano Olympic Games

96.3 Who won gold medals in Nagano?

97. Counting Crows

97.5 List the Crows’ record titles.

97.6 List the Crows’ band members.

98. American Legion

98.5 List Legionnaires.

99. Woody Guthrie

99.1 List Woody Guthrie’s songs.

100. Sammy Sosa

100.7 Name the pitchers off of which Sosa homered.

101. Michael Weiss

101.7 List Michael Weiss’s competitors.

102. Boston Big Dig

102.6 List individuals associated with the Big Dig.

103. Super Bowl XXXIV

103.6 List players who scored touchdowns in the game.

104. 1999 North American International Auto Show

104.4 List auto manufacturers in the show.

105. 1980 Mount St. Helens eruption

105.6 List names of eyewitnesses of the eruption.

106. 1998 Baseball World Series

106.6 Name the players in the series.

107. Chunnel

Page 59: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Appendix B. TREC 2005 List Questions 53

107.6 List dates of Chunnel closures.

108. Sony Pictures Entertainment (SPE)

108.4 Name movies released by SPE.

108.5 Name TV shows by the SPE.

109. Telefonica of Spain

109.5 Name companies involved in mergers with Telefonica of Spain.

110. Lions Club International

110.5 Name officials of the club.

110.6 Name programs sponsored by the Lions Club.

111. AMWAY

111.4 Name the officials of the company.

112. McDonald’s Corporation

112.5 Name the corporation’s top officials.

112.6 Name the non-hamburger restaurant holdings of the corporation.

113. Paul Newman

113.4 Name the camps started under his Hole in the Wall Foundation.

113.5 Name some of his movies.

114. 114. Jesse Ventura

114.3 List his various occupations.

114.4 Name movies/TV shows he appeared in.

115. Longwood Gardens

115.7 List personnel of the gardens.

116. Camp David

116.6 Who are some world leaders that have met there?

117. kudzu

117.4 What are other names it is known by?

Page 60: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Appendix B. TREC 2005 List Questions 54

118. U.S. Medal of Honor

118.4 What Medal of Honor recipients are in Congress?

119. Harley-Davidson

119.4 What other products do they produce?

120. Rose Crumb

120.5 What awards has she received?

121. Rachel Carson

121.3 What books did she write?

122. Paul Revere

122.7 What were some of his occupations?

123. Vicente Fox

123.5 What countries did Vicente Fox visit after election?

124. Rocky Marciano

124.6 Who were some of his opponents?

125. Enrico Caruso

125.1 What operas has Caruso sung?

126. Pope Pius XII

126.3 What official positions did he hold prior to becoming Pius XII?

127. U.S. Naval Academy

127.6 List people who have attended the Academy.

128. OPEC

128.3 What countries constitute the OPEC committee?

128.5 List OPEC countries.

129. NATO

129.4 Which countries were the original signers?

Page 61: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Appendix B. TREC 2005 List Questions 55

130. tsunami

130.5 What countries has it struck?

131. Hindenburg disaster

131.7 Name individuals who witnessed the disaster.

132. Kim Jong Il

132.4 What posts has Kim Jong Il held in the government of this country?

133. Hurricane Mitch

133.3 As of the time of Hurricane Mitch, what previous hurricanes had higherdeath totals?

133.4 What countries offered aid for this hurricane?

134. genome

134.2 List species whose genomes have been sequenced.

134.3 List the organizations that sequenced the Human genome.

135. Food-for-Oil Agreement

135.5 What countries participated in this agreement by providing food or medi-cine?

136. Shiite

136.7 What Shiite leaders were killed in Pakistan?

137. Kinmen Island

137.3 What other island groups are controlled by this government?

138. International Bureau of Universal Postal Union (UPU)

138.3 Where were UPU congresses held?

139. Organization of Islamic Conference (OIC)

139.2 Which countries are members of the OIC?

139.3 Who has served as Secretary General of the OIC?

140. PBGC

140.4 Employees of what companies are receiving benefits from this organiza-tion?

Page 62: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Appendix C

Sample LiQED Pattern File

The following file,liqed.list.pat, was generated by the Pattern Generation phase

for question 76.7,What movies was he in?(He, in this case, is Bing Crosby). The text

patterns use a series of symbolic representation and markers.<QUESTION> refers to

the question target, Bing Crosby, and<ANSWER> refers to the location of a possible

answer. An asterisk,*, is a wildcard that can match any single word. The square

brackets,[ ] signify a contiguous sentence fragment or chunk.

% Q76.7: What movies was he in ?%% Question topic: Bing Crosby (233 sentences)% Potential answer: White Christmas (190 sentences)% Potential answer: Mr. Tambourine Man (23 sentences)% Potential answer: Looking Forward (10092 sentences)% Potential answer: High Society (129 sentences)% Potential answer: Road (77370 sentences)% Potential answer: Southern Man (32 sentences)%% "Bing Crosby" and "White Christmas" (21 sentences)% "Bing Crosby" and "Mr. Tambourine Man" (0 sentences)% "Bing Crosby" and "Looking Forward" (0 sentences)% "Bing Crosby" and "High Society" (2 sentences)% "Bing Crosby" and "Road" (12 sentences)% "Bing Crosby" and "Southern Man" (0 sentences)

56

Page 63: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Appendix C. Sample LiQED Pattern File 57

%% Most Relevant Terms :% "to", "the", "and", "’s", "’’", "with",% ".", "bob", "hope", "in", "‘‘", ","%% Patterns:, ‘‘ <ANSWER> * * * ’’ with <QUESTION>, ‘‘ <ANSWER> * ’’ * <QUESTION> *, <QUESTION> ‘‘ <ANSWER> * ’’<ANSWER> ’’ * <QUESTION><ANSWER> ’’ * <QUESTION> .<ANSWER> ’’ with <QUESTION><ANSWER> * ’’ * * <QUESTION><ANSWER> * ’’ * <QUESTION><ANSWER> * ’’ with <QUESTION> .<QUESTION> ’s ‘‘ <ANSWER><QUESTION> ’s ‘‘ <ANSWER> ’’<QUESTION> ’s ‘‘ <ANSWER> ’’ and<QUESTION> ’s ‘‘ <ANSWER> * ’’<QUESTION> ’s ‘‘ <ANSWER> . ’’<QUESTION> * ‘‘ * <ANSWER><QUESTION> * ‘‘ * <ANSWER> * to<QUESTION> * ‘‘ <ANSWER><QUESTION> * ‘‘ <ANSWER> ’’<QUESTION> * ‘‘ <ANSWER> ’’ * the<QUESTION> * ‘‘ <ANSWER> * ’’<QUESTION> * ‘‘ <ANSWER> * to<QUESTION> * ‘‘ <ANSWER> . ’’<QUESTION> ‘‘ <ANSWER><QUESTION> ‘‘ <ANSWER> * ’’<QUESTION> ‘‘ <ANSWER> * ,<QUESTION> ‘‘ <ANSWER> . ’’[ * ‘‘ * <ANSWER> to ] [ bob hope and <QUESTION> ][ * ‘‘ <ANSWER> to ] [ bob hope and <QUESTION> , and ][ , * ‘‘ * <ANSWER> to ] [ bob hope and <QUESTION> ][ <ANSWER> to ] [ * * * * * bob hope and <QUESTION> * ][ ‘‘ <ANSWER> to ] [ bob hope and <QUESTION> ][‘‘ <ANSWER> to ] [ bob hope and <QUESTION> * and ]‘‘ <ANSWER> ’’ * <QUESTION>‘‘ <ANSWER> ’’ <QUESTION>‘‘ <ANSWER> * ’’ * * <QUESTION>‘‘ <ANSWER> * ’’ * <QUESTION>

Page 64: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Appendix C. Sample LiQED Pattern File 58

‘‘ <ANSWER> * ’’ * <QUESTION> .‘‘ <ANSWER> * ’’ <QUESTION>‘‘ <ANSWER> * ’’ with <QUESTION>‘‘ <ANSWER> , ’’ * * <QUESTION>‘‘ <ANSWER> , ’’ * * <QUESTION> ,‘‘ <ANSWER> , ’’ * <QUESTION> * ]‘‘ <ANSWER> , ’’ <QUESTION> ,bob hope and <QUESTION> * * <ANSWER>bob hope and <QUESTION> * * <ANSWER> tobob hope and <QUESTION> * <ANSWER>bob hope and <QUESTION> * ‘‘ * <ANSWER>bob hope and <QUESTION> * ‘‘ <ANSWER>bob hope and <QUESTION> in * * <ANSWER> tohope and <QUESTION> * * * ‘‘ * <ANSWER> tohope and <QUESTION> * * * ‘‘ the <ANSWER> tothe * bob hope and <QUESTION> * ‘‘ * <ANSWER> tothe ‘‘ <ANSWER> * ’’ <QUESTION>to <QUESTION> ’s ‘‘ <ANSWER> * ’’

Page 65: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Bibliography

Berners-Lee, T., Hendler, J., and Lasilla, O. (2001). The semantic web.Scientific

American.

Boyer, R. S. and Moore, J. S. (1997). A fast string search algorithm. InCommunica-

tions of the Association for Computing Machinery, volume 20, pages 762–772.

Brin, S. (1999). Extracting patterns and relations from the world wide web. InWebDB

’98: Selected papers from the International Workshop on The World Wide Web and

Databases, pages 172–183, London, UK. Springer-Verlag.

Gotoh (1982). An improved algorithm for matching biological sequences. InJournal

of Molecular Biology, number 162, pages 705–708.

Green, B., Wolf, A., Chomsky, C., and Laughery, K. (1961). Baseball: an automatic

question answerer. InProceedings of the Western Joint Computer Conference, pages

219–224.

Gusfield, D. (1997).Algorithms on Strings, Trees, and Sequences: Computer Science

and Computational Biology. Cambridge University Press.

Hearst, M. A. (1992). Automatic aquisition of hyponyms from large text corpora. In

59

Page 66: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Bibliography 60

Proceedings of the Fourteenth International Conference on Computational Linguis-

tics, Nantes, France.

Hovy, E., Hermjakob, U., and Ravichandran, D. (2002). A question/answer typol-

ogy with surface text patterns. InProceedings of the Human Language Technology

Conference, Seattle, CA.

Katz., B. (1997). From sentence processing to information access on the world wide

web. In In Proceedings of the AAAI Spring Symposium on Natural Language

Processing for the World Wide Web.

Knuth, D. E., Morris, J. H., and Pratt, V. R. (1977). Fast pattern matching in strings. In

Society for Industrial and Applied Mathematics Journal on Computing, volume 6,

pages 323–350.

Leidner, J., Bos, J., Dalmas, T., Curran, J. R., Clark, S., Bannard, C. J., Webber, B., and

Steedman, M. (2003). Qed: The edinburgh trec-2003 question answering system.

In Text REtrieval Conference.

Nelson, M. (1996). Fast string searching with suffix trees.Dr. Dobb’s Journal.

Ravichandran, D. and Hovy, E. H. (2002). Learning surface text patterns for a question

answering system. InACL, pages 41–47.

Rinaldi, F., Dowdall, J., Kaljurad, K., Hess, M., and Molla, D. (2003). Exploiting para-

phrases in a question answering system. InProceedings of the Second International

Workshop on Paraphrasing, pages 25–32.

R.Meir and G.Ratsch (2003). An introduction to boosting and leveraging. In

S.Mendelson and A.Smola, editors,Advanced Lectures on Machine Learning,

LNCS, pages 119–184. Springer. In press. Copyright by Springer Verlag.

Page 67: Improving Answer Precision and Recall of List Questions Kian Wei Kor

Bibliography 61

Smith, T. F. and Waterman, M. S. (1981). Identification of common molecular subse-

quences.Journal of Molecular Biology, 147:195–197.

Voorhees, E. M. (1999). The trec-8 question answering track report. InText REtrieval

Conference (TREC).

Voorhees, E. M. (2004). Overview of the trec 2004 question answering track. InText

REtrieval Conference.

Woods, W. (1973). Progress in natural language understanding: An application to

lunar geology. InAFIPS Conference Proceedings, 42, pages 441–450.

Yang, H., Cui, H., Kan, M.-Y., Maslennikov, M., Qiu, L., and Chua, T.-S. (2003).

Qualifier in trec-12 qa main task. InText REtrieval Conference (TREC), page 480.

Yi, J. and Sundaresan, N. (1999). Mining the web for acronyms using the duality of

patterns and relations. InWIDM ’99: Proceedings of the 2nd international workshop

on Web information and data management, pages 48–52, New York, NY, USA.

ACM Press.

Yun, N. and Graeme, H. (2004). Analysis of semantic classes in medical text for

question answering. In Aliod, D. M. and Vicedo, J. L., editors,ACL 2004: Question

Answering in Restricted Domains, pages 54–61, Barcelona, Spain. Association for

Computational Linguistics.