5
Formalization of Natural Language Queries HASNA BOUMECHAAL 1 , ZIZETTE BOUFAIDA 2 (1, 2) LIRE Laboratory Mentouri University, Constantine Algeria {boumechaal.h, zboufaida} @gmail.com Abstract—Most existing systems of information retrieval are limited to a search by keywords, based on the syntactic content of documents. The absence of semantics conducts to a very complex formulation of queries which generates the problem of access to relevant information on the web. In this paper, we implement a conversion tool which takes queries expressed in natural language and an ontology as input and returns the appropriate formal queries. The generated queries are then sent to the reasoner for querying the knowledge bases. Keywords-component; Information retrieval; semantic research; ontology; nRQL language I. INTRODUCTION Web pages represent a mass of knowledge so enormous as ill-assorted. This mass increases ceaselessly as well as the number of users who wants to find easily the information which they look for there. The existing systems of information retrieval are limited to a search by keywords, based on the syntactic content of documents. Knowing that, the absence of semantics conducts to a very complex formulation of queries which generates the problem of access to relevant information on the web. The Semantic Web [1] is an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation. So, it is aimed to structure the information available on the web. However, we notice that not only the structured information allow to satisfy the users’ information needs, so we need to a systems of semantic research based on the semantic interpretation of the queries as well as documents, to give the user what he really seeks. Semantic search plays an important role in realizing this goal, as it promises to produce precise answers to user’s queries by taking advantage of the availability of explicit semantics of information. The main objective of our work is to facilitate querying the knowledge base by implementing a conversion tool which takes queries expressed in natural language and an ontology as input and returns the appropriate formal queries. The generated queries are then sent to the reasoner for querying the knowledge bases. The paper is structured as follows: in section 2 we give an overview of existing systems to querying ontologies, in section 3 we describe some features of the developed system and we present the overall architecture. In section 4 we give an overview of Animal Ontology witch is used in our validation. In section 5, we illustrate a case study processed by the system. Section 6 describes the evaluation. Finally, section 7 concludes this paper and discusses some prospects arising from the work. II. RELATED WORK Natural language interfaces to databases have been a significant research focus, especially during the 70s and 80s [2]. However, only recently has the topic of natural language interfaces to ontologies been seriously investigated. As part of the semantic web, a number of systems providing access to ontologies have been created. AquaLog [3] is a portable question-answering system which takes queries expressed in natural language and an ontology as input and returns answers drawn from the available semantic markup. It uses a controlled language for querying ontologies with the addition of a learning mechanism, so that its performance improves over time in response to the vocabulary used by the end users. PowerAqua, [4] a multi-ontology-based Question Answering (QA) system, which takes as input queries expressed in natural language and is able to return answers drawn from relevant distributed resources on the Semantic Web. Querix [5] is another ontology-based question answering system that translates generic natural language queries into SPARQL. In case of ambiguities, Querix relies on clarification dialogues with users. SemSearch [6] is a concept-based system which aims to have a Google-like Query Interface. It requires a list of concepts (classes or instances) as an input query. ONLI (Ontology Natural Language Interaction) [7], takes as input questions in unrestricted natural language, and translates them into nRQL, an extension to the RACER ontology query language, then generates answers as returned by the RACER ontology reasoning server. PANTO [8] portable natural languages interface to ontologies, which accepts generic natural language queries 495

[IEEE 2011 International Symposium on Innovations in Intelligent Systems and Applications (INISTA) - Istanbul, Turkey (2011.06.15-2011.06.18)] 2011 International Symposium on Innovations

  • Upload
    zizette

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE 2011 International Symposium on Innovations in Intelligent Systems and Applications (INISTA) - Istanbul, Turkey (2011.06.15-2011.06.18)] 2011 International Symposium on Innovations

Formalization of Natural Language Queries

HASNA BOUMECHAAL1, ZIZETTE BOUFAIDA2 (1, 2) LIRE Laboratory

Mentouri University, Constantine Algeria {boumechaal.h, zboufaida} @gmail.com

Abstract—Most existing systems of information retrieval are limited to a search by keywords, based on the syntactic content of documents. The absence of semantics conducts to a very complex formulation of queries which generates the problem of access to relevant information on the web. In this paper, we implement a conversion tool which takes queries expressed in natural language and an ontology as input and returns the appropriate formal queries. The generated queries are then sent to the reasoner for querying the knowledge bases.

Keywords-component; Information retrieval; semantic research; ontology; nRQL language

I. INTRODUCTION Web pages represent a mass of knowledge so enormous

as ill-assorted. This mass increases ceaselessly as well as the number of users who wants to find easily the information which they look for there.

The existing systems of information retrieval are limited to a search by keywords, based on the syntactic content of documents. Knowing that, the absence of semantics conducts to a very complex formulation of queries which generates the problem of access to relevant information on the web.

The Semantic Web [1] is an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation. So, it is aimed to structure the information available on the web. However, we notice that not only the structured information allow to satisfy the users’ information needs, so we need to a systems of semantic research based on the semantic interpretation of the queries as well as documents, to give the user what he really seeks.

Semantic search plays an important role in realizing this goal, as it promises to produce precise answers to user’s queries by taking advantage of the availability of explicit semantics of information.

The main objective of our work is to facilitate querying the knowledge base by implementing a conversion tool which takes queries expressed in natural language and an ontology as input and returns the appropriate formal queries. The generated queries are then sent to the reasoner for querying the knowledge bases.

The paper is structured as follows: in section 2 we give an overview of existing systems to querying ontologies, in section 3 we describe some features of the developed system and we present the overall architecture. In section 4 we give an overview of Animal Ontology witch is used in our validation. In section 5, we illustrate a case study processed by the system. Section 6 describes the evaluation. Finally, section 7 concludes this paper and discusses some prospects arising from the work.

II. RELATED WORK Natural language interfaces to databases have been a

significant research focus, especially during the 70s and 80s [2]. However, only recently has the topic of natural language interfaces to ontologies been seriously investigated. As part of the semantic web, a number of systems providing access to ontologies have been created.

AquaLog [3] is a portable question-answering system which takes queries expressed in natural language and an ontology as input and returns answers drawn from the available semantic markup. It uses a controlled language for querying ontologies with the addition of a learning mechanism, so that its performance improves over time in response to the vocabulary used by the end users.

PowerAqua, [4] a multi-ontology-based Question Answering (QA) system, which takes as input queries expressed in natural language and is able to return answers drawn from relevant distributed resources on the Semantic Web.

Querix [5] is another ontology-based question answering system that translates generic natural language queries into SPARQL. In case of ambiguities, Querix relies on clarification dialogues with users.

SemSearch [6] is a concept-based system which aims to have a Google-like Query Interface. It requires a list of concepts (classes or instances) as an input query.

ONLI (Ontology Natural Language Interaction) [7], takes as input questions in unrestricted natural language, and translates them into nRQL, an extension to the RACER ontology query language, then generates answers as returned by the RACER ontology reasoning server.

PANTO [8] portable natural languages interface to ontologies, which accepts generic natural language queries

495

dserifoglu
Text Box
978-1-61284-922-5/11/$26.00 ©2011 IEEE
Page 2: [IEEE 2011 International Symposium on Innovations in Intelligent Systems and Applications (INISTA) - Istanbul, Turkey (2011.06.15-2011.06.18)] 2011 International Symposium on Innovations

and outputs SPARQL queries. Based on a special consideration on nominal phrases, it adopts a triple-based data model to interpret the parse trees output by an off-the-shelf parser.

Finally, QuestIO [9] system is a natural language interface for accessing structured information. It aims to bring the simplicity of Google's search interface to conceptual retrieval by automatically converting short conceptual queries into formal ones, which can then be executed against any semantic repository.

To summarise, we notice that the performances of the systems discussed in the previous paragraphs are strongly influenced by the techniques of natural language processing (NLP) used; because these techniques are used to: (1) generate the syntax tree of the query (2) extracting the linguistic triples from this tree (3) compare the linguistic triple of the query with the triplet of ontology. So, these systems have assumed that all queries can be written as linguistic triple form <term, relation, term>, although the case where the relations is not mentioned in the queries is very common because relations are often missing in user queries. In addition, most existing systems are based on mapping the terms in the query to the entities in the ontology rather than the semantic relationships between those terms.

III. THE PROPOSED APPROACH

A. Motivation Our idea in this work, is building the system which takes a

natural language query and an ontology as input and returns formal queries expressed in the form of the nRQL language [10,11] (new Racerpro Query Language) which is the query language of inference engine RACER. These queries are then sent to the reasoner RACER for have the related resources. However, the access to ontologies is not a simple task. This creates two major obstacles:

• Firstly, the ambiguity and complexity make it difficult for a machine to understand arbitrary natural language.

• Secondly, the correct formalization of the query in the query language of the inference engine. After the linguistic analysis of natural language queries, a lot of challenges remain in translating them to correct formal queries.

To address the first obstacle, we used the electronic version of WordNet (2.1) which is available for free download. Knowing that WordNet [12] dictionary is the most widely used for the disambiguation of natural language words. Thus, if the query term does not belong to the ontology we must look for its synonyms in WordNet and among them we select those who belong to the ontology. Consequently, each term of the query may be associated to one or more entities from the ontology. The result is a sequence of the entities that exist in the query.

To address the second obstacle, first we represent the previous sequence to a set of triples in the form <argument, relation, argument> where arguments can be classes, instances, literals, or empties. The relation can be an object_property, a data_type_property, or empty. The main reason for adopting a triple-based data model is because the knowledge representation (KR) formalisms for the semantic web, such as RDF or OWL also subscribe to this binary relational model. Then, we used an algorithm of nRQL queries generation[13] which takes as input the triples of the query and returns the nRQL appropriate query based on the semantic relationships between all query components. This algorithm allows first to enrich the query by the new found relationships, because relationships between the resources of the knowledge base must be stated explicitly in nRQL queries. Then, it uses specification’s rules for removing redundancies and optimizing the query. Finally, it uses some generation rules to generate the final nRQL query.

B. Architecture The system consists of two main phases (figure 1), in

addition to a user interface that allows the user to interact with the system. The user interface allows the user to enter natural language queries and choose the ontology to be queried. After executing a query, it displays the results and the nRQL query is generated to the user.

In a first step, the query decomposer decomposes the user query to identify its various components. Next, the mapping module eliminates the stopwords and identifies the set of the entities that correspond to terms of the user query, using not just the ontology but also general dictionaries as WordNet.

In phase of generation of nRQL queries, the triples extractor translates the previous sequence to a set of triples <argument, relation, argument>. Next, the nRQL generator converts extracted triples to nRQL language using the algorithm of nRQL queries generation.

496

Page 3: [IEEE 2011 International Symposium on Innovations in Intelligent Systems and Applications (INISTA) - Istanbul, Turkey (2011.06.15-2011.06.18)] 2011 International Symposium on Innovations

animal → class (animal ∈ AnimalOntology) with→ stopword feather→ instance (feather ∈ AnimalOntology) and → stopword wings→ instance (wings ∈ AnimalOntology) consumes → consumes∉ AnimalOntology, we must check if its synonym belong to the ontology. meat → instance (meat ∈ AnimalOntology)

Figure 1. System architecture

IV. ANIMAL ONTOLOGY The development of ontologies is always linked to a design

methodology, a building tool, and a representation language. Many methodologies have been developed but there are no standardised methodologies. Such a methodology would include a set of stages that occur when building ontologies, guidelines and principles to assist in the different stages, and an ontology life-cycle which indicates the relationships among stages. In this work, we used the building process specified in the work of [14] to design our ontology. This ontology is developed to validate our approach.

Many tools have been proposed to help the manual building of ontology; these tools allow editing an ontology, adding concepts and relations, with a integration of a different formalization languages (RDF, OWL). We used Protege OWL witch allows to formulate the ontology in the language of representation OWL, and verify the ontology using the reasoner RACER (compute the subsumption relationship between concepts, and check the consistency of all concepts).

AnimalOntology is an ontology scripted in the OWL formalism. It is intended for modeling the knowledge of the animal domain. This ontology is used in the phase of linguistic analysis of queries to check the existence of initial query terms in the ontology, or the existence of their synonyms in the case where these terms do not belong to the ontology.

It is also used in the phase of generation of nRQL queries to check the existence of the semantic relations between query terms, and enrich the query by all the found relations.

Figure 2. Animal ontology

V. CASE STUDY

In the following, we present a query processed by the system to illustrate the steps taken during the conversion of natural language queries to nRQL. A. The user enters the query « animal with feather and

wings consumes meat » with the Animal Ontology. B. The query decomposer decomposes the query on seven

components: animal / with / feather / and / wings / consumes /

meat

C. The mapping module eliminates the stopwords and identifies the set of the entities that correspond to each component of the user query:

Using WordNet 2.1 we found 18 synonyms of consumes (figure 3). Among these synonyms we have selected eat wicth is the only synonym belonging to the ontology. Consequently, the term consumes will be replaced by its corresponding in the ontology eat.

Linguistic analysis of queries

Generation of nRQL queries

Ontology (OWL)

Natural language

query

Query decomposer

Query components

Mapping module

Mapped components

Triples Extractor

Query triples

nRQL generator

nRQL queries

WordNet 2.1

497

Page 4: [IEEE 2011 International Symposium on Innovations in Intelligent Systems and Applications (INISTA) - Istanbul, Turkey (2011.06.15-2011.06.18)] 2011 International Symposium on Innovations

Figure 3. Synonyms of consumes in WordNet

D. The triples extractor identifies all possible triples in the query.

TABLE 1 THE QUERY TRIPLES

Triple Type of triple <animal , _, feather> <class, empty, instance> <animal,_, wings> <class, empty, instance> <animal, _, meat> <class, empty, instance> <feather ,_, wings> <instance, empty, instance> <feather, _, meat> <instance, empty, instance> <wings,_, meat> <instance, empty, instance> <animal, eat, feather> < class, object_property, instance> <animal, eat, wings> <class, object_property, instance> <animal, eat, meat> <class, object_property, instance> <feather, eat wings> <instance, object_property, instance> <feather ,eat ,meat> <instance, object_property, instance> <wings, eat, meat> <instance, object_property, instance> <animal, eat ,_> <class, object_property, empty> <feather, eat ,_> <instance, object_property, empty> <wings, eat ,_> <instance, object_property, empty> <meat, eat ,_> <instance, object_property, empty>

E. Finally, the nRQL generator converts extracted triples to nRQL language by applying the algorithm of nRQL queries generation:

1) Link between components of each triplet and put it in the valid form:

a) The triple <animal , _, feather > , its type is empty

relation with one of the arguments is a class and the other an instance linked by the object property found covered_with,

so the valid form of triple after enrichment is: <animal, covered_with, feather >.

b) The triple <animal , _, wings > , its type is empty relation with one of the arguments is a class and the other an instance linked by the object property found has_membres, so the valid form of triple after enrichment is: <animal, has_membres, wings >.

c) The triple <animal , _, meat > , its type is empty relation with one of the arguments is a class and the other an instance linked by the object property found eat, so the valid form of triple after enrichment is: <animal, eat, meat >.

d) The triple <animal , eat, meat > , its type is objet property with the domain is animal and the range is the class of meat, so the triple is valid in the ontology.

e) The triple <animal, eat, _ > , its type is object property with the domain is animal, so the valid form is <animal, eat, variable >.

f) The triple <meat, eat, _>, its type is object property with the range is meat, so the valid form is <variable, eat, meat>.

The other triples are deleted because there is no semantic relation between their components.

2) Conjunction of all triples and elimination of redundant

links:

<animal, covered_with, feather > and <animal, has_membres, wings > and <animal, eat, meat > and <animal, eat, meat > and <animal, eat, variable > and <variable, eat, meat>

We see that the triple <animal, eat, meat > is repeated and more restraint than <animal, eat, variable > and <variable, eat, meat>, so we retain it. The final optimized form of the query is obtained as follows:

<animal, covered_with, feather > and <animal, has_membres, wings > and <animal, eat,meat >

3) Translation to nRQL by applying the generation rules:

<animal, covered_with, feather > → ?x|animal| |covered_with| |feather| <animal, has_membres, wings > → ?x|animal| |has_membres| |wings| <animal, eat,meat > → ?x|animal| |eat| |meat|

498

Page 5: [IEEE 2011 International Symposium on Innovations in Intelligent Systems and Applications (INISTA) - Istanbul, Turkey (2011.06.15-2011.06.18)] 2011 International Symposium on Innovations

Figure 4. nRQL query

VI. EVALUATION

To evaluate the performance of system, we implemented a prototype in Java. Using WordNet (the version of WordNet used in the system is 2.1) and a small ontology in the domain of animals (AnimalOntology). Once the ontology is validated, it is ready to be queried by user’s queries. To facilitate this interrogation our system works with an interface which allows the user to express his information’s need by a natural language query. Then, the system will present the result of conversion through this interface.

The users of system are informed of the contents and domain of the ontology, but they do not necessarily know the names of relations and the ontology structure.

Consequently, our tool allows to decompose the query, delete stopwords, load the ontology, find for each query component its corresponding in ontology by exploiting WordNet, define the types of components in the ontology (class, property, literal), extract the triplets of the query, finding semantic relationships between components of each triplet, enrich the query by the new found relationships, conjunction of query triples, removing redundancies, and finally generating the final nRQL query.

To develop and test the system, we used 30 different queries. The metric that we used is recall. For each domain, recall means the percentage of correct queries that the system produced an output in the total testing query set.

With the first test our system has achieved encouraging results of conversion; it claimed to achieve a recall of 85%. Future work will focus on improving performance, specifically the improvement of response time and evaluation with large ontologies.

VII. CONCLUSION

We discussed in this work, the problem of building formal queries, and we have presented an approach to solve this problem. Our system is designed to convert natural language queries to nRQL queries using the semantic restrictions imposed by the ontology to map terms in the query to concepts and roles in the ontology. Then, the translation into nRQL is done through an algorithm based on the semantic relationships between all terms mapped in the query. The generated query is then sent to the reasoner for querying the knowledge bases.

The improvements that can be made to the system are a combination of many ontologies in the conversion of queries, expand the scope using the queries that contain some structured operators (NOT, OR, etc…), and integrate a linguistic methods to improve the linguistic processing of queries.

REFERENCES [1] T. Berners-Lee, J. Hendler, and O. Lassila, “The Semantic Web”, In Scientific American. 2001. [2] I. Androutsopoulos, G.D. Ritchie, P. Thanisch “Natural Language Interfaces to Databases- An Introduction”,Journal of Natural Language Engineering, Cambridge University Press, pp.29-81, 1995. [3] V. Lopez, M. Pasin, E. Motta, “Aqualog: An ontology-portable question answering system for the semantic web”, European Semantic Web Conference (ESWC), 2005. [4] V. Lopez, E. Motta., V. Uren, “Poweraqua: Fishing the semantic web”, European Semantic Web Conference (ESWC), 2006. [5] E. Kaufmann, A. Bernstein, R. Zumstein, “Querix: A natural language interface to query ontologies based on clarification dialogs”, 5th International Semantic Web Conference (ISWC 2006),2006. [6] Y. Lei, V. Uren, E. Motta, “Semsearch: a search engine for the semantic web”, Managing Knowledge in a World of Networks,2006. [7] L. Kosseim, R. SibliniI, C. Baker, S. Bergler, “Using Selectional Restrictions to Query an OWL Ontology”, International Conference on Formal Ontology in Information Systems (FOIS 2006), 2006. [8] C. Wang, M. Xiong, Q. Zhou, Y. Yu, , “PANTO: A Portable Natural Language Interface to Ontologies”,4th European Semantic Web Conference ESWC, 2007. [9] V.Tablan, , D. Damljanovic, K. Bontcheva “ A Natural Language Query Interface to Structured Information”, the 5h European Semantic Web Conference (ESWC 2008), Tenerife, Spain, pp. 361-375 June, 2008. [10] V. Haarslev, R. MÖller, M.Wessel “Querying the Semantic Web with Racer + nRQL”, The KI-04 Workshop on Applications of Description Logics,2004. [11] M. Wessel, R. MÖller “A High Performance Semantic Web Query Answering Engine”, International Workshop on Description Logics, 2005. [12] C. Fellbaum, "WordNet, An Electronic Lexical Database", Bradford Books, 1998. [13] H. Boumechaal, S. Allioua, Z Boufaida., “A new approach to query an OWL ontology”, International Conference On Applied Informatics ICAI09, Bordj Bou Arréridj – Algeria, 2009. [14] Hemam M., “Un processus de développement d’ontologies dans le cadre du web sémantique”, Magister thesis, Mentouri University, Constantine Algeria 2005.

499