Upload
shivkumar
View
213
Download
1
Embed Size (px)
Citation preview
Abstract— Search engines can return ranked documents as a
result for any query from which the user struggle to navigate and
search the correct answer. This process wastes user’s navigation
time and due to this the need for automated question answering
systems becomes more urgent. We need such a system which is
capable of replying the exact and concise answer to the question
posed in natural language. The best way to address this problem
is use of Question answering systems (QAS). The basic aim of
QAS is to provide short and correct answer to the user saving
his/her navigation time. The concept of Natural Language
Processing plays an important role in developing any QAS. This
paper provides an implementation approaches for various
categories of QAS such as Closed Domain based QAS, Open
Domain based QAS, WEBBASED QAS, Information Retrieval
or Information Extraction (IR/IE) based QAS, and Rule based
QAS which will be helpful for new directions of research in this
area.
Keywords – Information Extraction (IE), Information Retrieval
(IR), Natural Language Processing (NLP), Question Answering
System (QAS), Search Engines.
I. INTRODUCTION
With the advancement in technology it becomes very easy to
fetch the required information on a finger tip by using a single
mouse click. Search engines are the biggest giants which are
serving their users with their best efficiency but the drawback
of using these engines is the result which they provide. Users
need to search the exact information which suits their query
from the multiple possible results that comes as a response of
the fired query. QAS (Question answering systems) can be
used to tackle this issue. The first QAS was developed in
1960’s.QAS can be developed for close domains like medical,
education, construction etc or for open domain where the user
can get answer for almost any Query. The systems developed
in 90’s were mostly domain specific and they use NL interface
to expertise their efficiency. On the other hand today’s
systems mostly use various NLP Techniques. The basic
motive behind the development of any question answering
system is to provide the concise response to the question posed
in natural language in order to save user’s time and efforts.
Now a day’s ample of the information is available on internet
which can be easily accessed. This information is suitable for
users. But it becomes quite confusing to select the most
relevant one and therefore it creates a problem for the
computer applications to select the most suitable from the list
of responses. Information extraction and retrieval is the most
important application area for the QAS from various
databases, WWW, various web sites etc. In this paper we focus
on the implementation approaches for various types of QAS
like Close domain QAS, Web based QAS, Information
Retrieval or Information Extraction(IR/IE)based QAS, Rule
based QAS and Open domain QAS.
II. SYSTEM OVERVIEW
QAS is most important application of information retrieval.
Basic motive is to retrieve correct answer to the given question
posed in natural language from a collection of documents
(such as the WWW or any local database collection).An
efficient QAS requires more complex natural language
processing (NLP) Techniques as compared with any other
information retrieval system such as document retrieval, and
hence it can also be called as the next step beyond search
engines.[1][2] QAS research attempts to deal with a wide
range of question types including: factoid, long answers,
definition, how, wh-type questions, semantically-complex and
multi-lingual questions.QAS are classified in two main
types[14]: Open domain QAS and Closed domain QAS
Open domain question answering deals with questions about
nearly everything and anything because of its huge and strong
world knowledge & general ontology. On the other hand, these
systems tackle huge amount of data to extract the most
relevant answer.
Closed-domain question answering deals with questions under
a specific domain (for example medical or construction etc)
and can be seen as an easier task as compared with open
domain QAS because NLP systems can exploit
domain-specific knowledge frequently formalized in
ontology. Alternatively, closed-domain might refer to a
situation where only a limited type of questions are accepted,
such as questions asking for descriptive rather than procedural
information. [1][2]
Implementation Approaches for Various
Categories of Question Answering System Ms. Pooja P. Walke, Mr. Shivkumar Karale
Dept of Computer Technology
Yeshwantrao Chavan College of Engg.
Nagpur, India
Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013)
978-1-4673-5758-6/13/$31.00 © 2013 IEEE 402
QAS consists of three main modules which plays a vital role in
such systems. These are question classification module,
information retrieval module, and answer extraction module.
A. Question Classification Module
Question classification module plays a primary role in QAS to
categorize the question based on its type. Extracting answer
from a large collection of database and texts, it is very
important for a system to know what it look for. So it needs to
classify the questions regarding their types. [4].broadly
questions can be classified as factoid, long answers, definition,
how, wh-type questions, semantically-complex and
multi-lingual questions, hypo-sort of questions etc.(types of
questions can be referred from table 1 in reference paper
[6]).Classifying proper type of question is very important ,if it
goes wrong then it will affect the working of other modules.
Once this classification is done it derives expected answer
types, extracts most relevant keywords, and reformulates a
question into its semantically equivalent multiple questions.
Reformulation of a query into similar meaning queries is also
known as query expansion and it boosts up the recall of the
information retrieval system. [6]
B. Information Retrieval Module
The mission of the IR module is to perform several operations
like first selection of paragraphs that are considered relevant to
the input question. Then filtering of paragraphs or documents
will be done in order to narrow the search area. One can also
go for quality check status for these documents so that it can be
easily checked whether the selected paragraphs or documents
contain correct answer. One can later use radix sort for
ordering the paragraphs as it will give the most appropriate
paragraph where the exact answer for the question is assumed
to be available. And finally we move to answer extraction
module for perfect answer.
Information retrieval (IR) system recall is very important for
question answering. If no correct answers are present in a
document, no further processing could be carried out to find an
answer. Precision and ranking of candidate passages can also
affect question answering performance in the IR phase. This
module can be understood easily from the fig 1[7]
IR systems:
Use statistical methods
Rely on frequency of words in query, document, and
collection
Retrieve complete documents
C. Answer Extraction Module
Answer extraction is a final component in question answering
system, which is the tag of discrimination between question
answering system and the usual sense of text retrieval system.
Answer extraction technology becomes an influential and
decisive factor on question answering system for the final
results. Therefore, the answer extraction technology is deemed
to be a module in the question answering system.
There are various ways to extract answer but the feature based
methods of sorting is considered as main stream of answer
extraction technology in recent years, for instance
technologies like neural network [21], maximum entropy
[22], SVM [23], logistic regression [24] etc are playing a vital
role in it. The development of semantic features of NLP is bit
slow and hence feature technology is ruling the work. One of
the issues of QAS is improvement of the correctness of answer
extraction under existing technology. Answer can be extracted
via any one of the following approaches- 1) System centered
approach and other is 2) Answer centered approach [25]
The task which any answer extraction module need to perform
is as follows. As it is the final phase in the QA architecture, the
answer processing module is responsible for identifying,
extracting and validating answers from the set of ordered
paragraphs passed to it from the information retrieval module.
It requires to:
1) Identify the answer candidates within the filtered ordered
paragraphs through parsing. We can use POS tagger for this
Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013)
978-1-4673-5758-6/13/$31.00 © 2013 IEEE 403
after parsing the question. We also have heuristic measures as
a good option
2) Extract the answer by choosing only the word or phrase
that answers the submitted question through a set of heuristics.
Researchers have presented miscellaneous heuristic measures
to extract the correct answer from the answer candidates.
Extraction can be based on measures of distance between
keywords, numbers of keywords matched and other similar
heuristic metrics.
3) Validate the answer by providing confidence in the
correctness of the answer. There are several ways to validate
the final answer and it is always recommended to do so. One
can use lexical resource like WorldNet to verify the
correctness of final answer. Other is specific knowledge
sources. It can also to check questions belonging to specific
domain. Even Web search is a good option to validate the
correctness for domain specific knowledge. The most
attractive and easiest yet simplest technique is investigation
using the redundancy of the web to validate answers based on
frequency counts of question answer collocation.
III. CHARACTERISTICS OF QAS
QAS can broadly categories in two groups. The first one with
various information retrieval methods and NLP while the other
depends on reasoning along with natural language. These QAS
carry their unique characteristics which are compared on
different dimensions like techniques used, Domains, responses
question that deals with and so on. Table-1 provides the details
of the comparisons of these QAS. [5]
Table-I Characterization of QA System
IV. CLOSE DOMAIN QAS
Closed-domain QAS works on a document collection
restricted in specific subject and volume. This kind of QAS
has some characteristics which makes it different from other
categories of QAS specially open-domain QA, which works
over a large document collection, including the WWW. In
closed-domain QA, the database is quite small and specific to
a targeted domain so whenever the query is fired correct
answers may often be found in only very few documents; the
system does not have a large retrieval set abundant of good
candidates for selection. The QAS needs to answer for all
types of questions whether it is simple or complex in order to
use it as a question answering system for any company or
organization. The system should return a complete answer,
which can be long and complex, because it has to, e.g., clarify
the context of the problem posed in the question, explain the
options of a service, give instructions or procedures, etc.
Closed-domain QAS has a long history, beginning with
systems working over databases (e.g., BASEBALL (Green et
al, 1961) and LUNAR (Wood, 1973)).To implement closed
domain QAS one need to follow following steps strategy as
one of the way of implementation.
It is obvious that the question would be in natural language.
Once the question is fired it will be first examined to know the
type of question as the approach will depend on the type of
question. To know the type first we need to parse the question
by using appropriate parser. For example if the question is in
Hindi language than we will use the Hindi parser to parse the
question properly. The simplest way for parsing is use of POS
tagging. After parsing the question it will be transformed into
query by using query formulation which is finally feed into the
retrieval engine to extract answers. Query formulation can be
done in various ways as per our need. One may use entity file
to recognize the domain specific entities in the question. This
approach uses hash table in order to compare individual words
in question with the data in file. While searching the data we
should consider the semantic related terms also for this
purpose we need to expand the query. Query Expansion
enhances the search by including semantically related terms to
retrieves texts in which the query terms do not specifically
appear. It will be advantageous if we have our own dictionary
for the specified domain else one can use the existing
dictionary like WorldNet. Use of dictionary is advantageous as
it provides multiple words having similar meaning for the
word which we are comparing. That is nothing but the
synonyms. Due to this the semantic structure of the question
can be tapped more effectively. And finally we need to do
answer extraction process. To extract answer from the
collection of documents an information retrieval engine is
needed which can analyze the keywords and passages in
detail. The answers to a query are locations in the text where
there is neighboring similarity to the query, and the similarity
is assess by a mechanism that employs as one of its parameters
the distance between keywords [5] [20].
V. OPEN DOMAIN QAS
The aim of an open domain question answering system is to
respond to the user’s question. The reply is mostly a short texts
rather than a lengthy list of relevant documents. This type of
system makes use of multiple techniques from computational
linguistics, information retrieval and knowledge
representation for searching answers.
Like other types of QAS here also the query will be accepted
in the form of question in natural language. First the type of
question will be identified and then an Information retrieval
Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013)
978-1-4673-5758-6/13/$31.00 © 2013 IEEE 404
system is used to find a set of documents containing the correct
key words.
A tagger and NP/Verb Group chunker can also be used to
determine whether the right entities and relations are
mentioned in the searched documents or not. To find the
correct person or location for questions such as “Who” or
“Where”, one can use Named Entity Recognizer which
provides correct answers from the retrieved documents or
database. Later on the paragraphs which are relevant to the
answer are selected for ranking.
A vector space model [12] is a kind of model which can be
used as a strategy for classifying the candidate answers. The
system also needs to check if the answer is of the correct type
as determined in the question type analysis stage. Several
techniques can be used to validate the candidate answers. A
score is then given to each of these candidates according to the
number of question words it contains and how close these
words are to the candidate, the more and closer is the better.
The answer is then translated into a compact and meaningful
representation by parsing. And finally the answer is passed to
the user as a response to the asked query.
The most important challenge of an open domain system is its
database. The efficiency of any system depends on how well
the database is arranged and maintained. Especially for open
domain QAS as it aims to answer merely for everything.
VI WEB BASED QAS
Now a day’s internet is becoming the giant of information.
Tremendous amount of information is available online making
Web an ideal source of answers to a large variety of questions.
The most important property of any web based QAS is its
“snippet-tolerant” property which allows it to provide correct
responses to the QAS while searching answer through search
engines like Google, yahoo etc. Whenever we pass a query to
any search engine it will give a list of expected answers in the
form of various web documents. These documents along with
them usually carry the URL, the title, and some string
segments of the related web document. These title and the
string segments are nothing but “snippets”. This “snippet
tolerant” property is important for any web based QAS as it
will be an online question answering system where the
efficiency of system depends on the time required to download
the wed documents, then it needs to analyze them. This time
should be as less as possible.
The user will submit the query to the system in natural
language. First the system will identify the type of question
later the system submits the question to the search engine and
grabs its top search results. Each search result will be a
snippet. The system may use a Support Vector Machine
(SVM) [9] to classify the questions. After the question type
has been identified, the system extracts all such type
information from the snippets as plausible answers. For this
one may use a HMM-based named entity recognizer [11] or
any other technique as per required as well as some heuristics
rules. For answer selection we can use snippet cluster [13].
After using the standard Vector Space Model it has been
observed that the count for correct answer to the question is
usually greater than the incorrect ones on the search results of
that question. [13]. and finally the evaluation of the final
answer will be done. Lamp system [10], ASKMSR [15] these
are examples of such systems.
VII INFORMATION RETRIEVAL OR INFORMATION
EXTRACTION (IR/IE) BASED QAS
Question Answering, the process of extracting answers to
natural language questions is not simply Information Retrieval
(IR) or Information Extraction (IE) but much more than this.
IR systems are used to locate relevant documents that relate to
a query, but it fails to specify exactly where the answer is. IR
uses query keyword matching approach to fetch the
documents. These documents are indexed document
collection. On other hand IE systems are used to extract the
required information from the fetched documents provided the
domain of extraction is well defined. Information extracted by
IE systems is in the form of slot fillers of some predefined
templates. The QAS technology is one step ahead from IR and
IE systems. It uses both IR and IE and provides exact, concise
answers formulated naturally. [16]
There is a difference between IR & IE systems. IR system
works on the interaction between human and computer when
used to search the answer for posed query. The efficiency of
IR systems depends on how well the machine is programmed
in order to match the user’s query with the available
documents to provide the most relevant documents.IR systems
retrieves the most relevant documents from the available
database but it alone cannot give the exact answer. And here
comes the role of IE.IE systems are used for extracting the
correct answer from the retrieved documents.IE systems uses
various natural language processing technique to extract the
answer. Both system demands the well arranged and
maintained database.IR systems needs to face various
challenges to prove its efficiency [18].
VIII RULE BASED QAS
This kind of a system is one of the most important and efficient
QAS. Its basic application is compression reading. Generally
in United States the reading ability of children is evaluated by
giving them reading comprehension tests. Now what do you
mean by compression test? These tests means a small story is
given to children as a paragraph. They need to understand it
and requires to answer the questions which are followed after
the story. Children need to understand the aspect of story to
answer the questions.
Understanding the story is easy task for children as compared
to the computer system. Because at the end of the day
computer systems are just an electronic device which need to
be programmed for performing any required task. So when we
want the computer system to go for compression test first we
need to feed the program which will make the computer
Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013)
978-1-4673-5758-6/13/$31.00 © 2013 IEEE 405
system understand the aspects of the story to answer the
questions correctly.
The program which will make this possible uses a concept of
natural language processing along with the understanding of
lexical and semantics heuristics which is difficult to achieve
with broad-coverage techniques. These compression tests are
quite difficult and challenging to be successful as it covers
merely any topic.
Developing a rule based QAS is bit challenging task as the
developer needs to consider virtually all the possible topics on
which the system may get tested. At this level we all are very
well familiar with the basic types of questions which can be
asked [5]. The generally covered questions types would be
WHO, WHAT, WHEN, WHERE, WHY. For rule based QAS
the developer requires to consider each type as one separate
group and need to implement separate rules for each one. This
is because each type of question searches for different
answers. For example WHO type of question search for
PERSON NAME as an answer while WHERE looks for the
LOCATION to answer the question correctly.
Once the question is asked to rule based system the very first
task is to parse it using a parser. Syntactic analysis is optional.
After this the system would apply the NLP Techniques like
morphological analysis, part-of-speech tagging, semantic
class tagging, entity recognition etc. one can use hand crafted
rules to get the correct answer from the given story. These
rules are then applied to every sentence in the story including
the title. Though the title is included it will not be considered
for WHY type questions. Each rule awards a certain number of
points to each sentence. The rules like dateline (for WHEN &
WHERE type questions), wordmatch function etc can be
applied on the sentences as per necessary. Once the rules are
applied each rule will award some predefined value as a score.
Finally the sentence whose score is highest is returned as the
answer. Writing rules is also a tough job as there are N-
numbers of ways to write them [19].
REFERENCES [1] Demner-Fushman, Dina, "Complex Question Answering Based on Semantic Domain Model of Clinical Medicine", OCLC's Experimental
Thesis Catalog, College Park, Md.: University of Maryland (United States),
2006.
[2] Doan-Nguyen Hai, Leila Kosseim, "The Problem of Precision in
Restricted-Domain Question Answering. Some Proposed Methods of Improvement", In Proceedings of the ACL 2004 Workshop on Question
Answering in Restricted Domains, Barcelona, Spain, Publisher of
Association for Computational Linguistics, July 2004, PP.8-15.
[3] Green, W., Chomky, C., Laugherty, K. BASEBALL: "An automatic question answer". Proceeding of the western Joint Computer
Conference, 1961, PP. 219-224.
[4] Figueira, H. Martins, A. Mendes, A.Mendes, P.Pinto, C. Vidal, D
,"Priberam's Question Answering System in a Cross-Language Environment”, LECTURE NOTES IN COMPUTER SCIENCE, Volume
4730, 2007,PP. 300-309.
[5] Vanitha Guda , Suresh Kumar Sanampudi, I.Lakshmi Manikyamba, “Approaches for Question Answering Systems”, International Journal
of Engineering Science and Technology (IJEST), 2011.
[6] Mohammad Reza Kangavari, Samira Ghandchi, Manak Golpour “A New
Model for Question Answering Systems“, World Academy of Science, Engineering and Technology 18 2008
[7] Mukul Aggarwal,”Information Retrieval and Question Answering NLP Approach: An Artificial Intelligence Application”, International Journal
of soft Computing and Engineering (IJSCE) ISSN: 2231-2307, Volume-1, Issue-NCAI2011, June 2011, NCAI2011, 13-14 May 2011,
Jaipur, India
[8] Hai Doan-Nguyen & Leila Kosseim, “Improving the Precision of a
Closed-Domain Question-Answering System with Semantic Information”
[9] C. Cristianini and J. Shawe-Taylor. An Introduction to Support VectorMachines. Cambridge University Press, Cambridge, UK, 2000.
[10] Dell Zhang and Wee Sun Lee “A Web-based Question Answering
System”
[11] D. Bikel, R. Schwartz, and R. Weischedel. “An Algorithm that Learns
What's in a Name”. Machine learning, 34(1-3) pp. 211--231, 1999.
[12] R. Baeza-Yates and B. Ribeiero-Neto. Modern Information
Retrieval. Addison Wesley, 1999.
[13] S. Dumais, M. Banko, E. Brill, J. Lin and A. Ng. “Web Question Answering: Is More Always Better?” In Proceedings of SIGIR'02,
pp.291-298, Aug 2002.
[14] Hai Doan-Nguyen, Leila Kosseim: “Improving the Precision of a
Closed-Domain Question-Answering System with Semantic Information”, ACL 2004 Workshop on Question Answering in
Restricted Domain, 2004- acl.ldc.upenn.edu
[15] E. Brill, S. Dumais and M. Banko (2002). An analysis of the AskMSR
question-answering system. In Proceedings Of 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002).
[16] Dan Moldovan, Mihai Surdeanu “On the Role of Information Retrieval and Information Extraction in Question Answering Systems”.
[17] Barry Schiffman, Kathleen R. McKeown “Question Answering using
Integrated Information Retrieval and Information Extraction”
[18] James Allan (editor), Jay Aslam, Nicholas Belkin, Chris Buckley, Jamie
Callan, Bruce Croft (editor), Sue Dumais, Norbert Fuhr, Donna Harman,“Challenges in Information Retrieval and Language
Modeling”, Report of a Workshop held at the Center for Intelligent
Information Retrieval, University of Massachusetts Amherst, September 2002
[19] Ellen Riloff and Michael Thelen, “A Rule-based Question Answering
System for Reading Comprehension Tests”
[20] Shalini Stalin 1, Rajeev Pandey 2, Raju Barskar , “Web Based
Application for Hindi Question Answering System”, International Journal of Electronics and Computer Science Engineering, ISSN-
2277-1956
[21] Marius A Pasca. High performance, open-domain question answering
from large text collections[D]. USA; University of Southern Methodist, 2001.
[22] Abraham Ittycheriah. Trainable question answering systems[D]. USA: The State University of New Jersey, 2001
[23] Jun Suzuki, Yutaka Sasaki, Eisaku Maeda. SVM answer selection for
Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013)
978-1-4673-5758-6/13/$31.00 © 2013 IEEE 406
open-domain question answering[A].l9th International Conference on Computational Linguistics (Coling-2002) [C] Taipei: Howard
International House, 2002. 974- 980.
[24] Peng Li, Yi Guan, Xiao-Iong Wang. Answer extraction based on system
similarity model and stratified sampling logistic regression in rare data International Journal of Computer Science and Network Security,
2006,6(3):189-196
[25] Muthukrishnan Ramprasath1 and Shanmugasundaram Hariharan2 , “A
Survey on Question Answering System ”, International Journal of Research and Reviews in Information Sciences (IJRRIS) Vol. 2, No. 1,
March 2012, ISSN: 2046-6439
Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013)
978-1-4673-5758-6/13/$31.00 © 2013 IEEE 407