27
UDA – Ubiquitous Digital Agents Architectures Machines And Devices for Efficient Ubiquitous Systems AMADEUS Architectures Machines And Devices for Efficient Ubiquitous Systems AMADEUS TUTORIAL Combining Question Answering and Information Retrieval Suresh Manandhar and Thimal Jayasooriya University of York UDA – Ubiquitous Digital Agents

AMADEUS AMADEUS TUTORIAL - University of York

  • Upload
    others

  • View
    40

  • Download
    0

Embed Size (px)

Citation preview

Page 1: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

Architectures Machines And Devices for Efficient Ubiquitous SystemsAMADEUS

Architectures Machines And Devices for Efficient Ubiquitous Systems

AMADEUS TUTORIAL

Combining Question Answering and

Information Retrieval

Suresh Manandhar and Thimal JayasooriyaUniversity of York

UDA – Ubiquitous Digital Agents

Page 2: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

Search Engines

Question: How tall is Mt Everest?– IR could give following answers:

• “Mt Everest was first climbed by Hilary”• “Mt Everest is part of the Himalayan range”• “Susan Armstrong the 28 yr old rep from New York

climbed the 8800m high Mt Everest”• … plus a large number of irrelevant answers

• Search engines are not so good at reasoning with syntax and semantics

Page 3: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

Beyond Search Engines

• Want shallow understanding of Natural Language to support a range of applications:– Document clustering– Topic based searching– Question based searching– Interfacing with DB backends– Linking multiple related documents

• Intelligent searching to aid– scientific discovery, document organisation, automatic

extraction of knowledge from textual data

Page 4: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

Ingredients for building QA systems

• Spelling correction/checking

• PoS tagging

• Shallow Parsing

• Deep parsing and Logical form generation

• Lexical reasoning

• Matching

• … other kinds of reasoning

Page 5: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

Part of Speech (POS) tagging

• PoS tagging is the pre-step to syntactic analysis

• Given I went to the bank output:

I-pronoun went-verb-past the-det bank-noun-sg

• Notice bank can be verb/noun

• I can be numeric or pronoun

• And there can be unknown words:

I went to Pokhara

• Task of PoS tagger is to assign the “correct” PoS tags

Page 6: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

POS tagging

• current taggers employ HMMs trained on large corpora

• trigram-based p(t0 | t-2 t-1) taggers are the current state-of-the-art - accuracy > 96%

• Q-A systems: lots of unknown words or known words used unusually…e.g. “what i want is ... ?”

• low tagging accuracy on unknown words - best systems ~ 84% to 88%

• far less tagged data on questions

Page 7: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

Parsing and Grammar formalisms

• constructing canonical logical forms from sentences• a good grammatical formalism will allow mapping:

“John bought a toy” / “A toy was bought by John”…to roughly the same semantic representation

exists(Y): toy(Y) & buy(‘John’,Y)

• common syntactic phenomena include:– relative clauses – The man that Mary liked went home.

– co-ordinations – Bill and Mary got married.

– question constructions – What book did John buy

Page 8: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

Parsing… Pure CFGs not sufficient

• most automatically extracted grammars employ pure CFGs

• need information on moved phrases to generate semantics

• dependency information is crucial

• most machine learnt grammars focus on raw crossing bracket rates

Page 9: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

Parsing Issues: PP attachment ambiguity

• prepositional phrase attachment ambiguity e.g. “I drank scotch on ice”

• better to leave PPs unattached rather than guessing wrong

• combine both position-based and meaning-based matching

• hybrid representations that combine logical form, syntactic structure and string/position based representations needed

Page 10: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

Recollect: Search Engines

Question: How tall is Mt Everest?– IR could give following answers:

• “Mt Everest was first climbed by Hilary”• “Mt Everest is part of the himalayan range”• “Susan Amstrong the 28 yr old rep from New York

climbed the 8800m high Mt Everest”• … plus a large number of irrelevant answers

• Search engines are not so good at reasoning with syntax and semantics

Page 11: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

Matching: Getting the right answer

Matching answers with questions• “Susan Amstrong the 28 yr old rep from New York

climbed the 8800m high Mt Everest”

Reasoning using logical form and lexical relations: – Meaning representation: X = Mt_Everest & tall(X, ?Y)– “How tall …” – asking for a numeric measure– “tall” is related to “height/high”– “the 8800m high Mt Everest”:

high(X,Y) & Y=8800m & X = Mt_Everest

Page 12: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

Matching – Lexical Relations

• Lexical relations are semantic relations between words:– Synonym : (human – person)– Antonym: (tall – short)– Hyponym: (BMW – car)– Meronym: (door – house)– Entailment: (fire – smoke)– ….. plus many more

• Matching algorithm computes the semantic distance between the Question and the Answer.

Page 13: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

Matching and reasoning

• reasoning crucial to Q-A

• WordNet provides:– hypernym, synonym, antonym, meronym, etc.

• (but) common relations required for Q-A tasks missing:– noun-adjective (benefit-beneficial)– verb-noun (punish-punishment)– entailment (penalty-punishment)– telic (hammer-break)

• limited current research on learning of semantic relations

Page 14: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

Search engines

Current state of the art

– Google uses backlinks to determine the most relevant pages

– Most of the other search engines use keyword scanning techniques

– Online directories, such as DMOZ, use human editors to sort and rank content

Page 15: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

Beyond Search Engines

Essential ingredients

– A better syntactic understanding of document contents

– More efficient means of grouping “similar” or related elements together

– A better understanding of “relevance” to the user … A better query interface ?

Page 16: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

The ideal situation

• Error free disambiguation of natural language in documents

• Categorization of documents by subject and intent of the author; rather than by scanned keywords

• The opportunity to clarify the information needs of individual users; to closer match what they want

Page 17: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

Possible dimensions for queries …

Who is the First Lord of the Treasury

of the United Kingdom ?

First Lords of the TreasuryFirst Lords of the Treasury Prime MinistersPrime Ministers

Head of State in the UKHead of State in the UK

Page 18: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

Are Document Dimensions an answer ?

In datawarehousing terminology, a dimension is

“a structure that categorizes data in order to enable end users to answer questions”

Charles Bachman urged programmers to think in terms of multi-dimensional space as far back as 1973

Mothè experimented with document metadata in dimensional space (2001 and 2003)

Roelleke used the accessibility dimension to determine relevance within a document

Page 19: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

How are dimensions created ?

• Cleanse and tokenize source data

• Shallow parse the source data to resolve some syntactic ambiguity

• Extract a series of unique terms, words or phrases

• Determine the similarity between individual terms

• Organize similar terms into dimensions, groups of semantically related elements

Page 20: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

Analysing source documents

The source data was the sample newswire articles from TREC-11, 3 gigabytes of XML formatted data, consisting of around 20,000 articles

• Stripping XML formatting

• Detecting sentence boundaries in articles

• POS tagging individual sentences

• Named Entity and Coreference annotation

Page 21: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

NE annotation

Kenneth Joseph Lenihan, a New York research sociologist who helped refine the scientific methods used in criminology, died May 25 at his home in Manhattan.

Named Entity (Person Name)

Named Entity (Person Name)

Named Entity(Location name)

Named Entity(Location name)

Named Entity(Temporal entity)

Named Entity(Temporal entity)Named Entity

(Location name)

Named Entity(Location name)

Page 22: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

Semantic distance

• Semantic distance uses the concept of relatedness, or the semantic similarity between two lexical concepts

• Grouping synonyms together seems intuitive.i.e.: humans ≈ people ≈ beings

• But surprisingly, other lexical concepts such as meronyms, hyponyms, hypernyms, troponyms and even antonyms can also be semantically close.

• Different semantic distance algorithms for Wordnet quantify “relatedness” in different ways (Budanitsky2001)

Page 23: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

Semantic distance – continued

Dimensions are found by setting an inclusion distance, an experimentally derived figure for semantic distance

The inclusion distance differs between algorithms; and can sometimes even differ depending on the dataset

All terms which are within the specified inclusion distance are grouped in the same dimension

Terms within a dimension serve as a starting point for searching related concepts

Page 24: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

Dimension example

#dimension(duck:verb)

Synonyms Antonyms Troponyms Hypernyms

avoid, movecrouch, elude

sidestep

straightenunbend

avoidquibble

Page 25: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

Issues with dimensions

• Search space explosion (at least thrice the number of documents are returned)

• Stored semantic knowledge is not sufficiently granular:– “duck” has 85 different entries in Roget’s thesaurus; 47

verb definitions, 21 noun definitions and 18 uses as an adjective.

– However, the term database stores only the part of speech tag. Thus, all 47 uses of “duck” as a verb are clumped together

• Wordnet is not sufficiently rich in lexical relations, nor sufficiently inclusive of modern language idioms

Page 26: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

The way forward

• Adding Natural language processing techniques to search is the answer– Processing capabilities allow NLP techniques to be included

without significant degradation of speed– Richer lexicons and language resources are being developed– People are continually asking harder questions of available

information resources; keyword searches no longer satisfy end users!

• IBM’s WebFountain and the MOMINIS research project are two of several research initiatives to bring focused crawling and natural language processing techniques to search

Page 27: AMADEUS AMADEUS TUTORIAL - University of York

UDA – Ubiquitous Digital Agents

Thanks for listening