Upload
others
View
40
Download
0
Embed Size (px)
Citation preview
UDA – Ubiquitous Digital Agents
Architectures Machines And Devices for Efficient Ubiquitous SystemsAMADEUS
Architectures Machines And Devices for Efficient Ubiquitous Systems
AMADEUS TUTORIAL
Combining Question Answering and
Information Retrieval
Suresh Manandhar and Thimal JayasooriyaUniversity of York
UDA – Ubiquitous Digital Agents
UDA – Ubiquitous Digital Agents
Search Engines
Question: How tall is Mt Everest?– IR could give following answers:
• “Mt Everest was first climbed by Hilary”• “Mt Everest is part of the Himalayan range”• “Susan Armstrong the 28 yr old rep from New York
climbed the 8800m high Mt Everest”• … plus a large number of irrelevant answers
• Search engines are not so good at reasoning with syntax and semantics
UDA – Ubiquitous Digital Agents
Beyond Search Engines
• Want shallow understanding of Natural Language to support a range of applications:– Document clustering– Topic based searching– Question based searching– Interfacing with DB backends– Linking multiple related documents
• Intelligent searching to aid– scientific discovery, document organisation, automatic
extraction of knowledge from textual data
UDA – Ubiquitous Digital Agents
Ingredients for building QA systems
• Spelling correction/checking
• PoS tagging
• Shallow Parsing
• Deep parsing and Logical form generation
• Lexical reasoning
• Matching
• … other kinds of reasoning
UDA – Ubiquitous Digital Agents
Part of Speech (POS) tagging
• PoS tagging is the pre-step to syntactic analysis
• Given I went to the bank output:
I-pronoun went-verb-past the-det bank-noun-sg
• Notice bank can be verb/noun
• I can be numeric or pronoun
• And there can be unknown words:
I went to Pokhara
• Task of PoS tagger is to assign the “correct” PoS tags
UDA – Ubiquitous Digital Agents
POS tagging
• current taggers employ HMMs trained on large corpora
• trigram-based p(t0 | t-2 t-1) taggers are the current state-of-the-art - accuracy > 96%
• Q-A systems: lots of unknown words or known words used unusually…e.g. “what i want is ... ?”
• low tagging accuracy on unknown words - best systems ~ 84% to 88%
• far less tagged data on questions
UDA – Ubiquitous Digital Agents
Parsing and Grammar formalisms
• constructing canonical logical forms from sentences• a good grammatical formalism will allow mapping:
“John bought a toy” / “A toy was bought by John”…to roughly the same semantic representation
exists(Y): toy(Y) & buy(‘John’,Y)
• common syntactic phenomena include:– relative clauses – The man that Mary liked went home.
– co-ordinations – Bill and Mary got married.
– question constructions – What book did John buy
UDA – Ubiquitous Digital Agents
Parsing… Pure CFGs not sufficient
• most automatically extracted grammars employ pure CFGs
• need information on moved phrases to generate semantics
• dependency information is crucial
• most machine learnt grammars focus on raw crossing bracket rates
UDA – Ubiquitous Digital Agents
Parsing Issues: PP attachment ambiguity
• prepositional phrase attachment ambiguity e.g. “I drank scotch on ice”
• better to leave PPs unattached rather than guessing wrong
• combine both position-based and meaning-based matching
• hybrid representations that combine logical form, syntactic structure and string/position based representations needed
UDA – Ubiquitous Digital Agents
Recollect: Search Engines
Question: How tall is Mt Everest?– IR could give following answers:
• “Mt Everest was first climbed by Hilary”• “Mt Everest is part of the himalayan range”• “Susan Amstrong the 28 yr old rep from New York
climbed the 8800m high Mt Everest”• … plus a large number of irrelevant answers
• Search engines are not so good at reasoning with syntax and semantics
UDA – Ubiquitous Digital Agents
Matching: Getting the right answer
Matching answers with questions• “Susan Amstrong the 28 yr old rep from New York
climbed the 8800m high Mt Everest”
Reasoning using logical form and lexical relations: – Meaning representation: X = Mt_Everest & tall(X, ?Y)– “How tall …” – asking for a numeric measure– “tall” is related to “height/high”– “the 8800m high Mt Everest”:
high(X,Y) & Y=8800m & X = Mt_Everest
UDA – Ubiquitous Digital Agents
Matching – Lexical Relations
• Lexical relations are semantic relations between words:– Synonym : (human – person)– Antonym: (tall – short)– Hyponym: (BMW – car)– Meronym: (door – house)– Entailment: (fire – smoke)– ….. plus many more
• Matching algorithm computes the semantic distance between the Question and the Answer.
UDA – Ubiquitous Digital Agents
Matching and reasoning
• reasoning crucial to Q-A
• WordNet provides:– hypernym, synonym, antonym, meronym, etc.
• (but) common relations required for Q-A tasks missing:– noun-adjective (benefit-beneficial)– verb-noun (punish-punishment)– entailment (penalty-punishment)– telic (hammer-break)
• limited current research on learning of semantic relations
UDA – Ubiquitous Digital Agents
Search engines
Current state of the art
– Google uses backlinks to determine the most relevant pages
– Most of the other search engines use keyword scanning techniques
– Online directories, such as DMOZ, use human editors to sort and rank content
UDA – Ubiquitous Digital Agents
Beyond Search Engines
Essential ingredients
– A better syntactic understanding of document contents
– More efficient means of grouping “similar” or related elements together
– A better understanding of “relevance” to the user … A better query interface ?
UDA – Ubiquitous Digital Agents
The ideal situation
• Error free disambiguation of natural language in documents
• Categorization of documents by subject and intent of the author; rather than by scanned keywords
• The opportunity to clarify the information needs of individual users; to closer match what they want
UDA – Ubiquitous Digital Agents
Possible dimensions for queries …
Who is the First Lord of the Treasury
of the United Kingdom ?
First Lords of the TreasuryFirst Lords of the Treasury Prime MinistersPrime Ministers
Head of State in the UKHead of State in the UK
UDA – Ubiquitous Digital Agents
Are Document Dimensions an answer ?
In datawarehousing terminology, a dimension is
“a structure that categorizes data in order to enable end users to answer questions”
Charles Bachman urged programmers to think in terms of multi-dimensional space as far back as 1973
Mothè experimented with document metadata in dimensional space (2001 and 2003)
Roelleke used the accessibility dimension to determine relevance within a document
UDA – Ubiquitous Digital Agents
How are dimensions created ?
• Cleanse and tokenize source data
• Shallow parse the source data to resolve some syntactic ambiguity
• Extract a series of unique terms, words or phrases
• Determine the similarity between individual terms
• Organize similar terms into dimensions, groups of semantically related elements
UDA – Ubiquitous Digital Agents
Analysing source documents
The source data was the sample newswire articles from TREC-11, 3 gigabytes of XML formatted data, consisting of around 20,000 articles
• Stripping XML formatting
• Detecting sentence boundaries in articles
• POS tagging individual sentences
• Named Entity and Coreference annotation
UDA – Ubiquitous Digital Agents
NE annotation
Kenneth Joseph Lenihan, a New York research sociologist who helped refine the scientific methods used in criminology, died May 25 at his home in Manhattan.
Named Entity (Person Name)
Named Entity (Person Name)
Named Entity(Location name)
Named Entity(Location name)
Named Entity(Temporal entity)
Named Entity(Temporal entity)Named Entity
(Location name)
Named Entity(Location name)
UDA – Ubiquitous Digital Agents
Semantic distance
• Semantic distance uses the concept of relatedness, or the semantic similarity between two lexical concepts
• Grouping synonyms together seems intuitive.i.e.: humans ≈ people ≈ beings
• But surprisingly, other lexical concepts such as meronyms, hyponyms, hypernyms, troponyms and even antonyms can also be semantically close.
• Different semantic distance algorithms for Wordnet quantify “relatedness” in different ways (Budanitsky2001)
UDA – Ubiquitous Digital Agents
Semantic distance – continued
Dimensions are found by setting an inclusion distance, an experimentally derived figure for semantic distance
The inclusion distance differs between algorithms; and can sometimes even differ depending on the dataset
All terms which are within the specified inclusion distance are grouped in the same dimension
Terms within a dimension serve as a starting point for searching related concepts
UDA – Ubiquitous Digital Agents
Dimension example
#dimension(duck:verb)
Synonyms Antonyms Troponyms Hypernyms
avoid, movecrouch, elude
sidestep
straightenunbend
avoidquibble
UDA – Ubiquitous Digital Agents
Issues with dimensions
• Search space explosion (at least thrice the number of documents are returned)
• Stored semantic knowledge is not sufficiently granular:– “duck” has 85 different entries in Roget’s thesaurus; 47
verb definitions, 21 noun definitions and 18 uses as an adjective.
– However, the term database stores only the part of speech tag. Thus, all 47 uses of “duck” as a verb are clumped together
• Wordnet is not sufficiently rich in lexical relations, nor sufficiently inclusive of modern language idioms
UDA – Ubiquitous Digital Agents
The way forward
• Adding Natural language processing techniques to search is the answer– Processing capabilities allow NLP techniques to be included
without significant degradation of speed– Richer lexicons and language resources are being developed– People are continually asking harder questions of available
information resources; keyword searches no longer satisfy end users!
• IBM’s WebFountain and the MOMINIS research project are two of several research initiatives to bring focused crawling and natural language processing techniques to search
UDA – Ubiquitous Digital Agents
Thanks for listening