AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

Architectures Machines And Devices for Efficient Ubiquitous SystemsAMADEUS

Architectures Machines And Devices for Efficient Ubiquitous Systems

AMADEUS TUTORIAL

Combining Question Answering and

Information Retrieval

Suresh Manandhar and Thimal JayasooriyaUniversity of York



Search Engines

Question: How tall is Mt Everest?– IR could give following answers:

• “Mt Everest was first climbed by Hilary”• “Mt Everest is part of the Himalayan range”• “Susan Armstrong the 28 yr old rep from New York

climbed the 8800m high Mt Everest”• … plus a large number of irrelevant answers

• Search engines are not so good at reasoning with syntax and semantics


Beyond Search Engines

• Want shallow understanding of Natural Language to support a range of applications:– Document clustering– Topic based searching– Question based searching– Interfacing with DB backends– Linking multiple related documents

• Intelligent searching to aid– scientific discovery, document organisation, automatic

extraction of knowledge from textual data


Ingredients for building QA systems

• Spelling correction/checking

• PoS tagging

• Shallow Parsing

• Deep parsing and Logical form generation

• Lexical reasoning

• Matching

• … other kinds of reasoning


Part of Speech (POS) tagging

• PoS tagging is the pre-step to syntactic analysis

• Given I went to the bank output:

I-pronoun went-verb-past the-det bank-noun-sg

• Notice bank can be verb/noun

• I can be numeric or pronoun

• And there can be unknown words:

I went to Pokhara

• Task of PoS tagger is to assign the “correct” PoS tags


POS tagging

• current taggers employ HMMs trained on large corpora

• trigram-based p(t0 | t-2 t-1) taggers are the current state-of-the-art - accuracy > 96%

• Q-A systems: lots of unknown words or known words used unusually…e.g. “what i want is ... ?”

• low tagging accuracy on unknown words - best systems ~ 84% to 88%

• far less tagged data on questions


Parsing and Grammar formalisms

• constructing canonical logical forms from sentences• a good grammatical formalism will allow mapping:

“John bought a toy” / “A toy was bought by John”…to roughly the same semantic representation

exists(Y): toy(Y) & buy(‘John’,Y)

• common syntactic phenomena include:– relative clauses – The man that Mary liked went home.

– co-ordinations – Bill and Mary got married.

– question constructions – What book did John buy


Parsing… Pure CFGs not sufficient

• most automatically extracted grammars employ pure CFGs

• need information on moved phrases to generate semantics

• dependency information is crucial

• most machine learnt grammars focus on raw crossing bracket rates


Parsing Issues: PP attachment ambiguity

• prepositional phrase attachment ambiguity e.g. “I drank scotch on ice”

• better to leave PPs unattached rather than guessing wrong

• combine both position-based and meaning-based matching

• hybrid representations that combine logical form, syntactic structure and string/position based representations needed


Recollect: Search Engines

Question: How tall is Mt Everest?– IR could give following answers:

• “Mt Everest was first climbed by Hilary”• “Mt Everest is part of the himalayan range”• “Susan Amstrong the 28 yr old rep from New York

climbed the 8800m high Mt Everest”• … plus a large number of irrelevant answers

• Search engines are not so good at reasoning with syntax and semantics


Matching: Getting the right answer

Matching answers with questions• “Susan Amstrong the 28 yr old rep from New York

climbed the 8800m high Mt Everest”

Reasoning using logical form and lexical relations: – Meaning representation: X = Mt_Everest & tall(X, ?Y)– “How tall …” – asking for a numeric measure– “tall” is related to “height/high”– “the 8800m high Mt Everest”:

high(X,Y) & Y=8800m & X = Mt_Everest


Matching – Lexical Relations

• Lexical relations are semantic relations between words:– Synonym : (human – person)– Antonym: (tall – short)– Hyponym: (BMW – car)– Meronym: (door – house)– Entailment: (fire – smoke)– ….. plus many more

• Matching algorithm computes the semantic distance between the Question and the Answer.


Matching and reasoning

• reasoning crucial to Q-A

• WordNet provides:– hypernym, synonym, antonym, meronym, etc.

• (but) common relations required for Q-A tasks missing:– noun-adjective (benefit-beneficial)– verb-noun (punish-punishment)– entailment (penalty-punishment)– telic (hammer-break)

• limited current research on learning of semantic relations


Search engines

Current state of the art

– Google uses backlinks to determine the most relevant pages

– Most of the other search engines use keyword scanning techniques

– Online directories, such as DMOZ, use human editors to sort and rank content


Beyond Search Engines

Essential ingredients

– A better syntactic understanding of document contents

– More efficient means of grouping “similar” or related elements together

– A better understanding of “relevance” to the user … A better query interface ?


The ideal situation

• Error free disambiguation of natural language in documents

• Categorization of documents by subject and intent of the author; rather than by scanned keywords

• The opportunity to clarify the information needs of individual users; to closer match what they want


Possible dimensions for queries …

Who is the First Lord of the Treasury

of the United Kingdom ?

First Lords of the TreasuryFirst Lords of the Treasury Prime MinistersPrime Ministers

Head of State in the UKHead of State in the UK


Are Document Dimensions an answer ?

In datawarehousing terminology, a dimension is

“a structure that categorizes data in order to enable end users to answer questions”

Charles Bachman urged programmers to think in terms of multi-dimensional space as far back as 1973

Mothè experimented with document metadata in dimensional space (2001 and 2003)

Roelleke used the accessibility dimension to determine relevance within a document


How are dimensions created ?

• Cleanse and tokenize source data

• Shallow parse the source data to resolve some syntactic ambiguity

• Extract a series of unique terms, words or phrases

• Determine the similarity between individual terms

• Organize similar terms into dimensions, groups of semantically related elements


Analysing source documents

The source data was the sample newswire articles from TREC-11, 3 gigabytes of XML formatted data, consisting of around 20,000 articles

• Stripping XML formatting

• Detecting sentence boundaries in articles

• POS tagging individual sentences

• Named Entity and Coreference annotation


NE annotation

Kenneth Joseph Lenihan, a New York research sociologist who helped refine the scientific methods used in criminology, died May 25 at his home in Manhattan.

Named Entity (Person Name)

Named Entity (Person Name)

Named Entity(Location name)


Named Entity(Temporal entity)

Named Entity(Temporal entity)Named Entity

(Location name)



Semantic distance

• Semantic distance uses the concept of relatedness, or the semantic similarity between two lexical concepts

• Grouping synonyms together seems intuitive.i.e.: humans ≈ people ≈ beings

• But surprisingly, other lexical concepts such as meronyms, hyponyms, hypernyms, troponyms and even antonyms can also be semantically close.

• Different semantic distance algorithms for Wordnet quantify “relatedness” in different ways (Budanitsky2001)


Semantic distance – continued

Dimensions are found by setting an inclusion distance, an experimentally derived figure for semantic distance

The inclusion distance differs between algorithms; and can sometimes even differ depending on the dataset

All terms which are within the specified inclusion distance are grouped in the same dimension

Terms within a dimension serve as a starting point for searching related concepts


Dimension example

#dimension(duck:verb)

Synonyms Antonyms Troponyms Hypernyms

avoid, movecrouch, elude

sidestep

straightenunbend

avoidquibble


Issues with dimensions

• Search space explosion (at least thrice the number of documents are returned)

• Stored semantic knowledge is not sufficiently granular:– “duck” has 85 different entries in Roget’s thesaurus; 47

verb definitions, 21 noun definitions and 18 uses as an adjective.

– However, the term database stores only the part of speech tag. Thus, all 47 uses of “duck” as a verb are clumped together

• Wordnet is not sufficiently rich in lexical relations, nor sufficiently inclusive of modern language idioms


The way forward

• Adding Natural language processing techniques to search is the answer– Processing capabilities allow NLP techniques to be included

without significant degradation of speed– Richer lexicons and language resources are being developed– People are continually asking harder questions of available

information resources; keyword searches no longer satisfy end users!

• IBM’s WebFountain and the MOMINIS research project are two of several research initiatives to bring focused crawling and natural language processing techniques to search


Thanks for listening

Documents

AMADEUS AMADEUS TUTORIAL - University of York