Upload
rajendra-akerkar
View
804
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
I lli N l LIntelligent Natural Language System
M A N I S H J O S H I
System
M A N I S H J O S H I R A J E N D R A A K E R K A R
Open Domain Question Answering
What is Question Answering?
How is QA related to IR, IE?
S i l d QA Some issues related to QA
Question taxonomiesQ
General approach to QA
8 July, 2007ENLIGHT sys 2
Question Answering Systems
These types of systems try to provide exact informationas an answer in response to the natural language queryraised by the user.by
Motivation: given a question, system should provideMotivation: given a question, system should provide an answer instead of requiring user to search for the answer in a set of documents
Example:
Q: What year was Mozart born? A: Mozart was born in 1756.
8 July, 2007ENLIGHT sys 3
Information Retrieval
Document is the unit of information
Answers questions indirectly One has to search into the Document
Results: (ranked) list based on estimated relevance
Effective approaches are predominantly statistical pp p y(“bag of words”)
QA = (very short) passage retrieval with natural language Q ( y ) p g g g
questions (not queries)
8 July, 2007ENLIGHT sys 4
Information Extraction
Task
Identify messages that fall under a number of specific topics
Extract information according to pre-defined templates g p p
Place the information into frame-like database records
Limitations
Templates are hand-crafted by human experts
Templates are domain dependent and not easily portable
8 July, 2007ENLIGHT sys 5
Issues
Applications
Source of the answers Source of the answers Structured data — natural language queries on databases A fixed collection or book — encyclopedia
b dWeb data
Domain-independent vs. Domain specificp p
Users
Casual users vs. Regular users — Profile, History, etc.
May be maintained for regular users
8 July, 2007ENLIGHT sys 6
y g
Question Taxonomy
Factual questions: answer is often found in a text snippet
from one or more documents
Questions that may have yes/no answers
h i ( h h h ) wh questions (who, where, when, etc.)
what, which questions are hard
Questions may be phrased as requests or commands Questions may be phrased as requests or commands
Questions requiring simple reasoning: Some world
knowledge elementary reasoning may be required to relateknowledge, elementary reasoning may be required to relate
the question with the answer. why, how questions
e g How did Socrates die? (by) drinking poisoned wine
8 July, 2007ENLIGHT sys 7
e.g. How did Socrates die? (by) drinking poisoned wine.
Question Taxonomy
Context questions: Questions have to be answered in the
context of previous interactions with the usercontext of previous interactions with the user
Who assassinated Indira Gandhi?
When did this happen?
List questions: Fusion of partial answers scattered over
several documents is necessary
Ex. - List 3 major rice producing nations.
How do I assemble a bicycle?
8 July, 2007ENLIGHT sys 8
QA System Architecture
8 July, 2007ENLIGHT sys 9
General Approach
Question analysis: Find type of object that answers question: "when" -time, date "who" -person, organization, etc.
Document collection preprocessing: Prepare documents for real-time query processing q y p g
Document retrieval (IR): Using (augmented) question, retrieve set of possible relevant documents/passages using IR
8 July, 2007ENLIGHT sys 10
General Approach
Document processing (IE): Search documents for entities of the desired type and in appropriate relations using NLPof the desired type and in appropriate relations using NLP
Answer extraction and ranking: Extract and rank candidate answers from the documents
Answer construction: Provide (links to) context evidenceAnswer construction: Provide (links to) context, evidence, etc.
8 July, 2007ENLIGHT sys 11
Question Analysis
Identify semantic type of the entity sought by the question when, where, who — easy to handle which, what — ambiguous
e.g. What was the Beatles’ first hit single?e.g. What was the Beatles first hit single?
Determine additional constraints on the answer entity key words that will be used to locate candidatekey words that will be used to locate candidateanswer-bearing sentencesrelations (syntactic/semantic) that should hold between a candidate answer entity and other entities mentioned in the question
8 July, 2007ENLIGHT sys 12
Document Processing
Preprocessing: Detailed analysis of all texts in the corpus
b d i imay be done a priori
one group annotates terms with one of 50 semantic
tags which are indexed along with terms
Retrieval: Initial set of candidate answer bearing documentsRetrieval: Initial set of candidate answer-bearing documents
are selected from a large collection
Boolean retrieval methods may be used profitably
Passage retrieval may be more appropriate
8 July, 2007ENLIGHT sys 13
Document Processing
Analysis:
P t f h t i Part of speech tagging
Named entity identification: recognizes multi-wordstrings as names of companies/persons, locations/addresses, quantities, etc.
Shallow/deep syntactic analysis: Obtains informationabout syntactic relations, semantic roles
8 July, 2007ENLIGHT sys 14
History
(MURAX ((Kupiec, 1993 )
was designed to answer questions from the Trivial Pursuit general-knowledge board game – drawing answers from Grolier’s on-line encyclopaedia (1990).
Text Retrieval Conference (TREC). TREC was started in 1992 with the aim of supporting information retrieval research by pp g yproviding the infrastructure necessary for large-scale evaluation of text retrieval methodologies.
The QA track was first included as part of TREC in 1999 with seventeen research groups entering one or more systems
8 July, 2007ENLIGHT sys 15
seventeen research groups entering one or more systems.
Techniques for performing open-domain question answering
Manual and automatically constructed question analysers,
answering
Document retrieval specifically for question answering,
Semantic type answer extractionSemantic type answer extraction,
Answer extraction via automatically acquired surface matching text patterns, p ,
principled target processing combined with document retrieval for definition questions,
and various approaches to sentence simplification which aid in the generation of concise definitions.
8 July, 2007ENLIGHT sys 16
Answer ExtractionLook for strings whose semantic type matches that of theLook for strings whose semantic type matches that of the expected answer - matching may include subasumption (incorporating something under a more general category )
Check additional constraints Select a window around matching candidate and
calculate word overlap between window and query;calculate word overlap between window and query;OR
Check how many distinct question keywords are found in a matching sentence order of occurrence etc
Check syntactic/semantic role of matching candidate
in a matching sentence, order of occurrence, etc.
Semantic Symmetry
Ambiguous Modification
8 July, 2007ENLIGHT sys 17
Semantic Symmetry
Question – Who killed militants?
Militants killed five innocents in Doda District.
After 6 hour long encounter army soldiers killed 3 Militants.
We are looking for sentences containing word ‘Militant’ assubject but we got a sentence where word ‘Militant’ acts asobject (second sentence)
It is a Linguistic Phenomena which occur when an entity acts
object (second sentence)
It is a Linguistic Phenomena which occur when an entity actsas subject in some sentences and as object in anothersentences.
8 July, 2007ENLIGHT sys 18
Example
Following Example illustrates the phenomenon of semantic symmetry and demonstrates problems caused thereof.
Question : Who visited President of India?
Candidate Answer 1: George Bush visited President of India
Candidate Answer 2: President of India visited flood affected area ofMumbai.
More than one sentences are similar at the word level, but they havevery different meanings.
8 July, 2007ENLIGHT sys 19
Some more examples showing semantic symmetry
(1) The birds ate the snake. (1) The snake ate the bird.
(2) Communists in India are
(What does snake eat?)
(2) Small parties are supportingsupporting UPA government.(To whom communists aresupporting?)
Communists in Kerala.
8 July, 2007ENLIGHT sys 20
Ambiguous Modification
It is a Linguistic Phenomena which occurs when an adjective in the sentence may modify more than one nounnoun.Question : What is the largest volcano in the Solar System?
Candidate Answer 1: In the Solar System, the largest planetJupitor has several volcanoes. ---- Wrong
Candidate Answer 2: Olympus Mons, the largest volcano inthe solar system. --- Correct
In first sentence Largest modifies word ‘planet’ whereas in second sentence Largest modifies word ‘volcano’.
8 July, 2007ENLIGHT sys 21
Approaches to tackle the problem
Boris Katz and James Lin of MIT developed a system SAPERE that handles problems occurring due to semantic symmetry and ambiguous modification.
These problems occurs at semantic levelThese problems occurs at semantic level.
To deal with problems occurring at semantic level detailed information at syntactic level is gathered in all approachesy g pp
System developed by Katz and Lin gives results after utilizing syntactic relations. These typical S-V-O ternary relations are obtained after processing the information gathered by Minipar functional dependency parser.
8 July, 2007ENLIGHT sys 22
gat e ed by pa u ct o a depe de cy pa se .
Our Approach
To deal with problems at semantic level most of the approaches available need to obtain and work on
We have proposed a new approach to deal with the
information gathered at syntactic level.
We have proposed a new approach to deal with the problems caused by Linguistic phenomena of Semantic Symmetry and Ambiguous Modification.
The Algorithms based on our approach removes wrong t f th ith th h l f i f tisentences from the answer with the help of information
obtained at Lexical level (Lexical Analysis).
8 July, 2007ENLIGHT sys 23
Algorithm for Handling Semantic Symmetry
Rule 1 -If (sequence of keywords in question and candidateIf (sequence of keywords in question and candidate answer matches) then
If (POS of verb keyword are same) thenC did t i C tCandidate answer is Correct
Rule 2 -If (sequence of keywords in question and candidate answer do not match) then
If (POS verb keyword are not same) thenCandidate answer is CorrectCandidate answer is Correct
Otherwise -Candidate Answer is wrong
8 July, 2007ENLIGHT sys 24
Algorithm for Handling Ambiguous Modification
We have identified the adjective as Adj, Scope defining noun as SN and the Identifier noun as IN.
Rules –If the sentence contains keywords in following order –
Adj α SN Where α indicate string of zero or more keywords.Then e
Rule1-a If α is IN == Correct Answer Or
Rule1 b If α is Blank == Correct AnswerRule1-b If α is Blank == Correct AnswerElse
Rule 2 If α is Otherwise == Wrong Answer
8 July, 2007ENLIGHT sys 25
Algorithm for Handling Ambiguous Modification (Cont.)
If the sentence contains keywords in following order –
(Cont.)
y gSN α Adj β IN Where α and β indicate string
of zero or more keywords.ThenThen
Rule 3 If β is Blank == Correct Answer
(V l f D t tt )(Value of α Does not matter)Else
Rule 4 If β is Otherwise == Wrong Answer
8 July, 2007ENLIGHT sys 26
Working System - ENLIGHT
We have developed a system that answers questions using ‘keyword based matching paradigm’.
We have incorporated newly formulated algorithms in the system and we got goodalgorithms in the system and we got good results.
8 July, 2007ENLIGHT sys 27
ENLIGHT System Architecture
8 July, 2007ENLIGHT sys 28
Thi d l l tf f th I t lli t d
PreprocessingThis module prepares platform for the Intelligent and Effective interface.
This module transfer raw format data into well organized corpus with the help of following activities.
Keyword Extraction Sentence Segmentation Handling of Abbreviations and Punctuation MarksHandling of Abbreviations and Punctuation Marks Tokenization
Stemming Identifying Group of Words with Specific MeaningIdentifying Group of Words with Specific Meaning Shallow Parsing Reference Resolution
8 July, 2007ENLIGHT sys 29
Question Analysis
Q i T k i iQuestion Tokenization Question Classification
C M tCorpus Management Various database tables are created to manage the vast data
InfoKeywordsQuestionKeyword QuestionAnswer CorpusSentences Abb i iAbbreviations Apostrophes StopWords
Answer RetrievalAnswer Searching
8 July, 2007ENLIGHT sys 30
Answer Generation
Answer Rescoring
Handling problems caused due to linguistic phenomena using shallow parsing based algorithms
Intelligence Incorporation
Semantic SymmetryAmbiguous Modification
LearningRote Learning
Intelligence Incorporation
g Feedback
Can ImproveSatisfactoryWrong AnswerLoose criterion
Automated Classification
8 July, 2007ENLIGHT sys 31
Results
P iPreciseness
Response Timep
Adaptability
8 July, 2007ENLIGHT sys 32
Preciseness
ENLIGHT Basic Keyword Matchingg
Average Number of sentences returned as Answer 3 34.6
Average Number of correct sentences 2.63 6
Average precision 84 % 32 %Average precision 84 % 32 %
8 July, 2007ENLIGHT sys 33
Response Time (ENLIGHT Vs Sapere)
Type of Data andN f d
Time Required byQTAG
Time Required by MiniparNo. of words QTAG
(Used in ENLIGHT)Minipar
(Used in Sapere)
News extract, Times ofIndia 202 Words 1.71 s 2.88 sIndia. 202 Words
Reply, START QASystem. 251 Words 1.89 s 3.11 s
Google Search EngineResult 1.55 s 2.86 s
Y h S h E iYahoo Search EngineResults 1.67 s 3.13 s
AVERAGE 1.705 s 2.995 s
8 July, 2007ENLIGHT sys 34
Adaptability
Handling Additional Keywords
Question like ‘who killed the Prime Minister?’ can also be handled by ENLIGHT Systemy y
Use of synonyms
If the question and answer contains synonyms ENLIGHT System can associate these two words using the Learning phase.
8 July, 2007ENLIGHT sys 35
References
L. Hirschman, R. Gaizauskas, Natural language question answering: the view from here, Natural Language engineering, 7(4), December 2001.
Manish Joshi, Rajendra Akerkar, The ENLIGHT System, Intelligent Natural Language System, Journal of Digital Information M JManagement, June 2007.
8 July, 2007ENLIGHT sys 36