[IEEE 2013 IEEE Conference on Information & Communication Technologies (ICT) - Thuckalay, Tamil Nadu, India (2013.04.11-2013.04.12)] 2013 IEEE CONFERENCE ON INFORMATION AND COMMUNICATION

Abstract— Search engines can return ranked documents as a

result for any query from which the user struggle to navigate and

search the correct answer. This process wastes user’s navigation

time and due to this the need for automated question answering

systems becomes more urgent. We need such a system which is

capable of replying the exact and concise answer to the question

posed in natural language. The best way to address this problem

is use of Question answering systems (QAS). The basic aim of

QAS is to provide short and correct answer to the user saving

his/her navigation time. The concept of Natural Language

Processing plays an important role in developing any QAS. This

paper provides an implementation approaches for various

categories of QAS such as Closed Domain based QAS, Open

Domain based QAS, WEBBASED QAS, Information Retrieval

or Information Extraction (IR/IE) based QAS, and Rule based

QAS which will be helpful for new directions of research in this

area.

Keywords – Information Extraction (IE), Information Retrieval

(IR), Natural Language Processing (NLP), Question Answering

System (QAS), Search Engines.

I. INTRODUCTION

With the advancement in technology it becomes very easy to

fetch the required information on a finger tip by using a single

mouse click. Search engines are the biggest giants which are

serving their users with their best efficiency but the drawback

of using these engines is the result which they provide. Users

need to search the exact information which suits their query

from the multiple possible results that comes as a response of

the fired query. QAS (Question answering systems) can be

used to tackle this issue. The first QAS was developed in

1960’s.QAS can be developed for close domains like medical,

education, construction etc or for open domain where the user

can get answer for almost any Query. The systems developed

in 90’s were mostly domain specific and they use NL interface

to expertise their efficiency. On the other hand today’s

systems mostly use various NLP Techniques. The basic

motive behind the development of any question answering

system is to provide the concise response to the question posed

in natural language in order to save user’s time and efforts.

Now a day’s ample of the information is available on internet

which can be easily accessed. This information is suitable for

users. But it becomes quite confusing to select the most

relevant one and therefore it creates a problem for the

computer applications to select the most suitable from the list

of responses. Information extraction and retrieval is the most

important application area for the QAS from various

databases, WWW, various web sites etc. In this paper we focus

on the implementation approaches for various types of QAS

like Close domain QAS, Web based QAS, Information

Retrieval or Information Extraction(IR/IE)based QAS, Rule

based QAS and Open domain QAS.

II. SYSTEM OVERVIEW

QAS is most important application of information retrieval.

Basic motive is to retrieve correct answer to the given question

posed in natural language from a collection of documents

(such as the WWW or any local database collection).An

efficient QAS requires more complex natural language

processing (NLP) Techniques as compared with any other

information retrieval system such as document retrieval, and

hence it can also be called as the next step beyond search

engines.[1][2] QAS research attempts to deal with a wide

range of question types including: factoid, long answers,

definition, how, wh-type questions, semantically-complex and

multi-lingual questions.QAS are classified in two main

types[14]: Open domain QAS and Closed domain QAS

Open domain question answering deals with questions about

nearly everything and anything because of its huge and strong

world knowledge & general ontology. On the other hand, these

systems tackle huge amount of data to extract the most

relevant answer.

Closed-domain question answering deals with questions under

a specific domain (for example medical or construction etc)

and can be seen as an easier task as compared with open

domain QAS because NLP systems can exploit

domain-specific knowledge frequently formalized in

ontology. Alternatively, closed-domain might refer to a

situation where only a limited type of questions are accepted,

such as questions asking for descriptive rather than procedural

information. [1][2]

Implementation Approaches for Various

Categories of Question Answering System Ms. Pooja P. Walke, Mr. Shivkumar Karale

Dept of Computer Technology

Yeshwantrao Chavan College of Engg.

Nagpur, India

Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013)

978-1-4673-5758-6/13/$31.00 © 2013 IEEE 402

QAS consists of three main modules which plays a vital role in

such systems. These are question classification module,

information retrieval module, and answer extraction module.

A. Question Classification Module

Question classification module plays a primary role in QAS to

categorize the question based on its type. Extracting answer

from a large collection of database and texts, it is very

important for a system to know what it look for. So it needs to

classify the questions regarding their types. [4].broadly

questions can be classified as factoid, long answers, definition,

how, wh-type questions, semantically-complex and

multi-lingual questions, hypo-sort of questions etc.(types of

questions can be referred from table 1 in reference paper

[6]).Classifying proper type of question is very important ,if it

goes wrong then it will affect the working of other modules.

Once this classification is done it derives expected answer

types, extracts most relevant keywords, and reformulates a

question into its semantically equivalent multiple questions.

Reformulation of a query into similar meaning queries is also

known as query expansion and it boosts up the recall of the

information retrieval system. [6]

B. Information Retrieval Module

The mission of the IR module is to perform several operations

like first selection of paragraphs that are considered relevant to

the input question. Then filtering of paragraphs or documents

will be done in order to narrow the search area. One can also

go for quality check status for these documents so that it can be

easily checked whether the selected paragraphs or documents

contain correct answer. One can later use radix sort for

ordering the paragraphs as it will give the most appropriate

paragraph where the exact answer for the question is assumed

to be available. And finally we move to answer extraction

module for perfect answer.

Information retrieval (IR) system recall is very important for

question answering. If no correct answers are present in a

document, no further processing could be carried out to find an

answer. Precision and ranking of candidate passages can also

affect question answering performance in the IR phase. This

module can be understood easily from the fig 1[7]

IR systems:

Use statistical methods

Rely on frequency of words in query, document, and

collection

Retrieve complete documents

C. Answer Extraction Module

Answer extraction is a final component in question answering

system, which is the tag of discrimination between question

answering system and the usual sense of text retrieval system.

Answer extraction technology becomes an influential and

decisive factor on question answering system for the final

results. Therefore, the answer extraction technology is deemed

to be a module in the question answering system.

There are various ways to extract answer but the feature based

methods of sorting is considered as main stream of answer

extraction technology in recent years, for instance

technologies like neural network [21], maximum entropy

[22], SVM [23], logistic regression [24] etc are playing a vital

role in it. The development of semantic features of NLP is bit

slow and hence feature technology is ruling the work. One of

the issues of QAS is improvement of the correctness of answer

extraction under existing technology. Answer can be extracted

via any one of the following approaches- 1) System centered

approach and other is 2) Answer centered approach [25]

The task which any answer extraction module need to perform

is as follows. As it is the final phase in the QA architecture, the

answer processing module is responsible for identifying,

extracting and validating answers from the set of ordered

paragraphs passed to it from the information retrieval module.

It requires to:

1) Identify the answer candidates within the filtered ordered

paragraphs through parsing. We can use POS tagger for this


978-1-4673-5758-6/13/$31.00 © 2013 IEEE 403

after parsing the question. We also have heuristic measures as

a good option

2) Extract the answer by choosing only the word or phrase

that answers the submitted question through a set of heuristics.

Researchers have presented miscellaneous heuristic measures

to extract the correct answer from the answer candidates.

Extraction can be based on measures of distance between

keywords, numbers of keywords matched and other similar

heuristic metrics.

3) Validate the answer by providing confidence in the

correctness of the answer. There are several ways to validate

the final answer and it is always recommended to do so. One

can use lexical resource like WorldNet to verify the

correctness of final answer. Other is specific knowledge

sources. It can also to check questions belonging to specific

domain. Even Web search is a good option to validate the

correctness for domain specific knowledge. The most

attractive and easiest yet simplest technique is investigation

using the redundancy of the web to validate answers based on

frequency counts of question answer collocation.

III. CHARACTERISTICS OF QAS

QAS can broadly categories in two groups. The first one with

various information retrieval methods and NLP while the other

depends on reasoning along with natural language. These QAS

carry their unique characteristics which are compared on

different dimensions like techniques used, Domains, responses

question that deals with and so on. Table-1 provides the details

of the comparisons of these QAS. [5]

Table-I Characterization of QA System

IV. CLOSE DOMAIN QAS

Closed-domain QAS works on a document collection

restricted in specific subject and volume. This kind of QAS

has some characteristics which makes it different from other

categories of QAS specially open-domain QA, which works

over a large document collection, including the WWW. In

closed-domain QA, the database is quite small and specific to

a targeted domain so whenever the query is fired correct

answers may often be found in only very few documents; the

system does not have a large retrieval set abundant of good

candidates for selection. The QAS needs to answer for all

types of questions whether it is simple or complex in order to

use it as a question answering system for any company or

organization. The system should return a complete answer,

which can be long and complex, because it has to, e.g., clarify

the context of the problem posed in the question, explain the

options of a service, give instructions or procedures, etc.

Closed-domain QAS has a long history, beginning with

systems working over databases (e.g., BASEBALL (Green et

al, 1961) and LUNAR (Wood, 1973)).To implement closed

domain QAS one need to follow following steps strategy as

one of the way of implementation.

It is obvious that the question would be in natural language.

Once the question is fired it will be first examined to know the

type of question as the approach will depend on the type of

question. To know the type first we need to parse the question

by using appropriate parser. For example if the question is in

Hindi language than we will use the Hindi parser to parse the

question properly. The simplest way for parsing is use of POS

tagging. After parsing the question it will be transformed into

query by using query formulation which is finally feed into the

retrieval engine to extract answers. Query formulation can be

done in various ways as per our need. One may use entity file

to recognize the domain specific entities in the question. This

approach uses hash table in order to compare individual words

in question with the data in file. While searching the data we

should consider the semantic related terms also for this

purpose we need to expand the query. Query Expansion

enhances the search by including semantically related terms to

retrieves texts in which the query terms do not specifically

appear. It will be advantageous if we have our own dictionary

for the specified domain else one can use the existing

dictionary like WorldNet. Use of dictionary is advantageous as

it provides multiple words having similar meaning for the

word which we are comparing. That is nothing but the

synonyms. Due to this the semantic structure of the question

can be tapped more effectively. And finally we need to do

answer extraction process. To extract answer from the

collection of documents an information retrieval engine is

needed which can analyze the keywords and passages in

detail. The answers to a query are locations in the text where

there is neighboring similarity to the query, and the similarity

is assess by a mechanism that employs as one of its parameters

the distance between keywords [5] [20].

V. OPEN DOMAIN QAS

The aim of an open domain question answering system is to

respond to the user’s question. The reply is mostly a short texts

rather than a lengthy list of relevant documents. This type of

system makes use of multiple techniques from computational

linguistics, information retrieval and knowledge

representation for searching answers.

Like other types of QAS here also the query will be accepted

in the form of question in natural language. First the type of

question will be identified and then an Information retrieval


978-1-4673-5758-6/13/$31.00 © 2013 IEEE 404

system is used to find a set of documents containing the correct

key words.

A tagger and NP/Verb Group chunker can also be used to

determine whether the right entities and relations are

mentioned in the searched documents or not. To find the

correct person or location for questions such as “Who” or

“Where”, one can use Named Entity Recognizer which

provides correct answers from the retrieved documents or

database. Later on the paragraphs which are relevant to the

answer are selected for ranking.

A vector space model [12] is a kind of model which can be

used as a strategy for classifying the candidate answers. The

system also needs to check if the answer is of the correct type

as determined in the question type analysis stage. Several

techniques can be used to validate the candidate answers. A

score is then given to each of these candidates according to the

number of question words it contains and how close these

words are to the candidate, the more and closer is the better.

The answer is then translated into a compact and meaningful

representation by parsing. And finally the answer is passed to

the user as a response to the asked query.

The most important challenge of an open domain system is its

database. The efficiency of any system depends on how well

the database is arranged and maintained. Especially for open

domain QAS as it aims to answer merely for everything.

VI WEB BASED QAS

Now a day’s internet is becoming the giant of information.

Tremendous amount of information is available online making

Web an ideal source of answers to a large variety of questions.

The most important property of any web based QAS is its

“snippet-tolerant” property which allows it to provide correct

responses to the QAS while searching answer through search

engines like Google, yahoo etc. Whenever we pass a query to

any search engine it will give a list of expected answers in the

form of various web documents. These documents along with

them usually carry the URL, the title, and some string

segments of the related web document. These title and the

string segments are nothing but “snippets”. This “snippet

tolerant” property is important for any web based QAS as it

will be an online question answering system where the

efficiency of system depends on the time required to download

the wed documents, then it needs to analyze them. This time

should be as less as possible.

The user will submit the query to the system in natural

language. First the system will identify the type of question

later the system submits the question to the search engine and

grabs its top search results. Each search result will be a

snippet. The system may use a Support Vector Machine

(SVM) [9] to classify the questions. After the question type

has been identified, the system extracts all such type

information from the snippets as plausible answers. For this

one may use a HMM-based named entity recognizer [11] or

any other technique as per required as well as some heuristics

rules. For answer selection we can use snippet cluster [13].

After using the standard Vector Space Model it has been

observed that the count for correct answer to the question is

usually greater than the incorrect ones on the search results of

that question. [13]. and finally the evaluation of the final

answer will be done. Lamp system [10], ASKMSR [15] these

are examples of such systems.

VII INFORMATION RETRIEVAL OR INFORMATION

EXTRACTION (IR/IE) BASED QAS

Question Answering, the process of extracting answers to

natural language questions is not simply Information Retrieval

(IR) or Information Extraction (IE) but much more than this.

IR systems are used to locate relevant documents that relate to

a query, but it fails to specify exactly where the answer is. IR

uses query keyword matching approach to fetch the

documents. These documents are indexed document

collection. On other hand IE systems are used to extract the

required information from the fetched documents provided the

domain of extraction is well defined. Information extracted by

IE systems is in the form of slot fillers of some predefined

templates. The QAS technology is one step ahead from IR and

IE systems. It uses both IR and IE and provides exact, concise

answers formulated naturally. [16]

There is a difference between IR & IE systems. IR system

works on the interaction between human and computer when

used to search the answer for posed query. The efficiency of

IR systems depends on how well the machine is programmed

in order to match the user’s query with the available

documents to provide the most relevant documents.IR systems

retrieves the most relevant documents from the available

database but it alone cannot give the exact answer. And here

comes the role of IE.IE systems are used for extracting the

correct answer from the retrieved documents.IE systems uses

various natural language processing technique to extract the

answer. Both system demands the well arranged and

maintained database.IR systems needs to face various

challenges to prove its efficiency [18].

VIII RULE BASED QAS

This kind of a system is one of the most important and efficient

QAS. Its basic application is compression reading. Generally

in United States the reading ability of children is evaluated by

giving them reading comprehension tests. Now what do you

mean by compression test? These tests means a small story is

given to children as a paragraph. They need to understand it

and requires to answer the questions which are followed after

the story. Children need to understand the aspect of story to

answer the questions.

Understanding the story is easy task for children as compared

to the computer system. Because at the end of the day

computer systems are just an electronic device which need to

be programmed for performing any required task. So when we

want the computer system to go for compression test first we

need to feed the program which will make the computer


978-1-4673-5758-6/13/$31.00 © 2013 IEEE 405

system understand the aspects of the story to answer the

questions correctly.

The program which will make this possible uses a concept of

natural language processing along with the understanding of

lexical and semantics heuristics which is difficult to achieve

with broad-coverage techniques. These compression tests are

quite difficult and challenging to be successful as it covers

merely any topic.

Developing a rule based QAS is bit challenging task as the

developer needs to consider virtually all the possible topics on

which the system may get tested. At this level we all are very

well familiar with the basic types of questions which can be

asked [5]. The generally covered questions types would be

WHO, WHAT, WHEN, WHERE, WHY. For rule based QAS

the developer requires to consider each type as one separate

group and need to implement separate rules for each one. This

is because each type of question searches for different

answers. For example WHO type of question search for

PERSON NAME as an answer while WHERE looks for the

LOCATION to answer the question correctly.

Once the question is asked to rule based system the very first

task is to parse it using a parser. Syntactic analysis is optional.

After this the system would apply the NLP Techniques like

morphological analysis, part-of-speech tagging, semantic

class tagging, entity recognition etc. one can use hand crafted

rules to get the correct answer from the given story. These

rules are then applied to every sentence in the story including

the title. Though the title is included it will not be considered

for WHY type questions. Each rule awards a certain number of

points to each sentence. The rules like dateline (for WHEN &

WHERE type questions), wordmatch function etc can be

applied on the sentences as per necessary. Once the rules are

applied each rule will award some predefined value as a score.

Finally the sentence whose score is highest is returned as the

answer. Writing rules is also a tough job as there are N-

numbers of ways to write them [19].

REFERENCES [1] Demner-Fushman, Dina, "Complex Question Answering Based on Semantic Domain Model of Clinical Medicine", OCLC's Experimental

Thesis Catalog, College Park, Md.: University of Maryland (United States),

2006.

[2] Doan-Nguyen Hai, Leila Kosseim, "The Problem of Precision in

Restricted-Domain Question Answering. Some Proposed Methods of Improvement", In Proceedings of the ACL 2004 Workshop on Question

Answering in Restricted Domains, Barcelona, Spain, Publisher of

Association for Computational Linguistics, July 2004, PP.8-15.

[3] Green, W., Chomky, C., Laugherty, K. BASEBALL: "An automatic question answer". Proceeding of the western Joint Computer

Conference, 1961, PP. 219-224.

[4] Figueira, H. Martins, A. Mendes, A.Mendes, P.Pinto, C. Vidal, D

,"Priberam's Question Answering System in a Cross-Language Environment”, LECTURE NOTES IN COMPUTER SCIENCE, Volume

4730, 2007,PP. 300-309.

[5] Vanitha Guda , Suresh Kumar Sanampudi, I.Lakshmi Manikyamba, “Approaches for Question Answering Systems”, International Journal

of Engineering Science and Technology (IJEST), 2011.

[6] Mohammad Reza Kangavari, Samira Ghandchi, Manak Golpour “A New

Model for Question Answering Systems“, World Academy of Science, Engineering and Technology 18 2008

[7] Mukul Aggarwal,”Information Retrieval and Question Answering NLP Approach: An Artificial Intelligence Application”, International Journal

of soft Computing and Engineering (IJSCE) ISSN: 2231-2307, Volume-1, Issue-NCAI2011, June 2011, NCAI2011, 13-14 May 2011,

Jaipur, India

[8] Hai Doan-Nguyen & Leila Kosseim, “Improving the Precision of a

Closed-Domain Question-Answering System with Semantic Information”

[9] C. Cristianini and J. Shawe-Taylor. An Introduction to Support VectorMachines. Cambridge University Press, Cambridge, UK, 2000.

[10] Dell Zhang and Wee Sun Lee “A Web-based Question Answering

System”

[11] D. Bikel, R. Schwartz, and R. Weischedel. “An Algorithm that Learns

What's in a Name”. Machine learning, 34(1-3) pp. 211--231, 1999.

[12] R. Baeza-Yates and B. Ribeiero-Neto. Modern Information

Retrieval. Addison Wesley, 1999.

[13] S. Dumais, M. Banko, E. Brill, J. Lin and A. Ng. “Web Question Answering: Is More Always Better?” In Proceedings of SIGIR'02,

pp.291-298, Aug 2002.

[14] Hai Doan-Nguyen, Leila Kosseim: “Improving the Precision of a

Closed-Domain Question-Answering System with Semantic Information”, ACL 2004 Workshop on Question Answering in

Restricted Domain, 2004- acl.ldc.upenn.edu

[15] E. Brill, S. Dumais and M. Banko (2002). An analysis of the AskMSR

question-answering system. In Proceedings Of 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002).

[16] Dan Moldovan, Mihai Surdeanu “On the Role of Information Retrieval and Information Extraction in Question Answering Systems”.

[17] Barry Schiffman, Kathleen R. McKeown “Question Answering using

Integrated Information Retrieval and Information Extraction”

[18] James Allan (editor), Jay Aslam, Nicholas Belkin, Chris Buckley, Jamie

Callan, Bruce Croft (editor), Sue Dumais, Norbert Fuhr, Donna Harman,“Challenges in Information Retrieval and Language

Modeling”, Report of a Workshop held at the Center for Intelligent

Information Retrieval, University of Massachusetts Amherst, September 2002

[19] Ellen Riloff and Michael Thelen, “A Rule-based Question Answering

System for Reading Comprehension Tests”

[20] Shalini Stalin 1, Rajeev Pandey 2, Raju Barskar , “Web Based

Application for Hindi Question Answering System”, International Journal of Electronics and Computer Science Engineering, ISSN-

2277-1956

[21] Marius A Pasca. High performance, open-domain question answering

from large text collections[D]. USA; University of Southern Methodist, 2001.

[22] Abraham Ittycheriah. Trainable question answering systems[D]. USA: The State University of New Jersey, 2001

[23] Jun Suzuki, Yutaka Sasaki, Eisaku Maeda. SVM answer selection for


978-1-4673-5758-6/13/$31.00 © 2013 IEEE 406

open-domain question answering[A].l9th International Conference on Computational Linguistics (Coling-2002) [C] Taipei: Howard

International House, 2002. 974- 980.

[24] Peng Li, Yi Guan, Xiao-Iong Wang. Answer extraction based on system

similarity model and stratified sampling logistic regression in rare data International Journal of Computer Science and Network Security,

2006,6(3):189-196

[25] Muthukrishnan Ramprasath1 and Shanmugasundaram Hariharan2 , “A

Survey on Question Answering System ”, International Journal of Research and Reviews in Information Sciences (IJRRIS) Vol. 2, No. 1,

March 2012, ISSN: 2046-6439


978-1-4673-5758-6/13/$31.00 © 2013 IEEE 407

Documents

[IEEE 2013 IEEE Conference on Information & Communication Technologies (ICT) - Thuckalay, Tamil Nadu, India (2013.04.11-2013.04.12)] 2013 IEEE CONFERENCE ON INFORMATION AND COMMUNICATION