Upload
lakshman
View
212
Download
0
Embed Size (px)
Citation preview
WikiOnto: A System For Semi-automatic Extraction AndModeling Of Ontologies
Using Wikipedia XML Corpus
Lalindra De Silva
University of Colombo School of ComputingColombo, Sri Lanka
Lakshman Jayaratne
University of Colombo School of ComputingColombo, Sri [email protected]
Abstract—This paper introduces WikiOnto: a system thatassists in the extraction and modeling of topic ontologies in asemi-automatic manner using a preprocessed document corpusof one of the largest knowledge bases in the world - theWikipedia. Based on the Wikipedia XML Corpus, we present athree-tiered framework for extracting topic ontologies in quicktime and a modeling environment to refine these ontologies.Using Natural Language Processing (NLP) and other MachineLearning (ML) techniques along with a very rich documentcorpus, this system proposes a solution to a task that isgenerally considered extremely cumbersome. The initial resultsof the prototype suggest strong potential of the system tobecome highly successful in ontology extraction and modelingand also inspire further research on extracting ontologies fromother semi-structured document corpora as well.
Keywords-Ontology, Wikipedia XML Corpus, OntologyModeling, Ontology Extraction
I. INTRODUCTION
With the growing issue of information overload in the
present world and with the emergence of the semantic
web, the importance of ontologies has come under the
spotlight in recent years. Ontologies, commonly defined as
“an explicit specification of a conceptualization” provides
an agreed upon representation of concepts of a domain for
easier knowledge management [1]. However, the modeling
of ontologies is generally considered to be a painstaking
task, often requiring the knowledge and expertise of an
ontology engineer who is well versed in the concerned
domain.
Previous efforts in building ontology development en-
vironments were successful in their comprehensiveness of
the tools they provided to the user, but it still involved a
lot of manual work when building ontologies using these
environments [2]. Additionally, most of the efforts to extract
ontologies were focused on using textual data as their
sources. As such, in this research, we have looked into ex-
tracting ontologies and modeling them using the Wikipedia
XML Corpus [3] as the source, with a futuristic view of
extending this research into using other semi-structured data
domains as the sources for building ontologies1.
The research efforts and the applications that resulted
from those research in the area of ontology development,
extraction and ontology learning are many and diverse.
With the light of these attempts, section 2 takes a look at
related prior research to construct topic ontologies using
various sources as input. In section 3, we describe our
source - the Wikipedia XML Corpus - and the structure of
the documents in the corpus, along with recognition as to
why it is an ideal source for such research. The framework
of our system consists of three major layers (Figure 1).
As such, section 4 looks at in detail how this three-tiered
framework is implemented. Finally, we will present the
ontology development environment with both the current and
proposed facilities that are/will be available to the users of
the system.
II. RELATED WORKS
The research efforts to extract ontologies have spanned
over multifarious domains in terms of the input sources they
use. A large number of these researches have focused on
using textual data as their input while a lesser number of
research have focused on semi-structured data and relational
databases. [4] presents a method in which the words that
appear in the source texts are ontologically annotated using
the WordNet [5] as the lexical database. The authors have
primarily worked on assigning verbs that appear in source
articles into ontological classes. Several other researchers
have utilized the XML schemas as a starting point for the
creation of ontologies. In [6], the authors have used the
ORA-SS (Object-Relationship-Attribute Model for Semi-
structured Data) model to initially determine the nature of
the relationship between the elements of the XML schema
and subsequently to create a genetic model for organizing
the ontology. [7] gives a detailed survey of some of the work
that has been carried out in terms of learning ontologies for
the semantic web. [8] presents a framework for ontology
1This work is based on the continuing undergraduate thesis of the firstauthor titled ‘A Machine Learning Approach To Ontology Extraction andEvaluation Using Semi-structured Data’
2009 IEEE International Conference on Semantic Computing
978-0-7695-3800-6/09 $26.00 © 2009 IEEE
DOI 10.1109/ICSC.2009.93
571
extraction for document classification. In this work, the
author has experimented on the Reuters collection and the
Wikipedia database dump as her sources and produces a
methodology in which concepts can be extracted from such
large document bases. Other tools for extracting ontologies
from text such as ASIUM [9], OntoLearn [10] and Text-
ToOnto [11] have also been around. However, the biggest
motivation for our research spans from OntoGen [12]. Onto-
Gen is a tool that enables semi-automatic generation of topic
ontologies using textual data as the initial source. We have
utilized the best features of the OntoGen methodology while
extending our research into the semi-structured data domain.
Consequently, we have prposed to enhance our system with
additional concept extraction mechanisms such as lexico
syntactic pattern matching, which the OntoGen project has
not incorporated.
III. WIKIPEDIA XML CORPUS
The Wikipedia, being one of the largest knowledge bases
in the world and the most popular reference work in the
current World Wide Web, has inspired so many people and
projects in diverse fields including knowledge management,
information retrieval and ontology development. The reli-
ability of the articles in the Wikipedia is often criticized
as it is a knowledge base that can be accessed and made
modifications to by anyone in the world. However, for the
most part, the Wikipedia is considered to be reliable enough
to be used in non-critical applications.
The Wikipedia XML Corpus is a collection of documents
derived from the Wikipedia which is being used in a large
number of Information Retrieval and Machine Learning
tasks at present research communities. The corpus contains
articles in eight languages and the latest collection contains
approximately 660,000 articles in English language that
correspond to articles in the Wikipedia. Each XML docu-
ment in the corpus is uniquely named (e.g. 12345.xml) and
has the following uniform element structure: (e.g. 3928.xml
representing the article on “Ball”)
<article><name id="3928">Ball</name><body><p>A <emph3>ball</emph3> is a round objectthat is used most often in<collectionlink xlink:type="simple"
xlink:href="26853.xml">sport</collectionlink>s and<collectionlink xlink:type="simple"
xlink:href="11970.xml">game</collectionlink>s.</p>.........<section>
<title>Popular ball games</title>There are many popular games......<normallist>
<item>
<collectionlink xlink:type="simple"xlink:href="3850.xml">Baseball
</collectionlink></item><item>
<collectionlink xlink:type="simple"xlink:href="3812.xml">Basketball
</collectionlink></item>
</normallist></section>
...
...
...<languagelink lang="cs">M?</languagelink><languagelink lang="de">Ball</languagelink>
...
...
...</section></body>
</article>
Each document is contained within the ‘article’ tag within
which lies a limited number of predefined element tags
that correspond to specific relations between text segments
in the document. The ‘name’ element carries the title of
the article while ‘title’ elements within each ‘section’ ele-
ments correspond to the titles of different subsections. The
‘normallist’ elements contain lists of information while the
‘collectionlink’ and ‘unknownlink’ elements correspond to
other articles referred from within the document.
In addition to the documents, the corpus provides category
information which hierarchically organizes the articles in
the relevant categories. Using this category information, we
were able to develop the initial segment of our system where
the user can choose a certain number of articles from the
interested domain to be initially input to the system. This
selection of domain specific document set enables better
targeted and expedite ontology creation as opposed to using
the entire corpus as the input.
IV. WIKIONTO FRAMEWORK
As previously mentioned, the WikiOnto system is imple-
mented in a three-tiered framework. The following sections
explain each layer in detail.
A. Concept Extraction Using Document Structure
In this layer, we make use of the structure of the docu-
ments to deduce substantial information about the taxonomic
hierarchy of possible concepts in that document. After
careful review of many documents in the corpus, we have
established the following assumptions in extracting concepts
from the documents.
• Word phrases contained within the ‘name’, ‘title’,
‘item’, ‘collectionlink’ and ‘unknownlink’ elements are
proposed as concepts to the user (e.g. in the previous
example of the XML file, the words ‘Ball’, ‘Sport’,
‘Game’, ‘Popular Ball Games’, ‘Baseball’ and ‘Bas-
ketball’ are all suggested to the user as concepts)
572
Figure 1. Design of the WikiOnto System
• The word phrases contained within ‘title’ elements in
each of the first-level sections are suggested as sub-
concepts of the concept within the ‘name’ element of
the document (e.g. ‘Popular Ball Games’ is a sub-
concept of the concept ‘Ball’)
• The word phrases contained within a ‘title’ element
nested within several ‘section’ elements is suggested
as a sub-concept of the ‘title’ concept of the immediate
section above it
• Any word phrase wrapped inside ‘collectionlink’ or
‘unknownlink’ elements are suggested as sub-concepts
of the concept immediately above it in the structure of
the XML document (e.g. the concepts ‘Baseball’ and
‘BasketBall’ are sub-concepts of the concept ‘Popular
Ball Games’ while the concepts ‘Sport’ and ‘Game’ are
sub-concepts of the concept ‘Ball’)
In the initialization of the system, the user is given the
opportunity to define the maximum number of words that
can be contained in a potential concept. Once the user
inputs the documents that are to be used as the sources to
extract concepts (either using the domain-specific document
selector explained earlier, choosing manually or inputting the
entire corpus2), the system iterates through all the documents
to extract the potential concepts and their relationships
according to the assumptions listed earlier. The ontology
modeling environment (section 5) explains how the user
can refine these relationships after examining the concepts
(i.e. how the user can label the relationships as hyponomic,
hypernimic, meronomic relations etc).In order to validate the concepts that are extracted and
suggested, we have incorporated WordNet [5] into our sys-
tem. When all the concepts are extracted from the documents
as per the above assumptions, each and every concept is
matched morphologically, word-by-word with WordNet. If
even a single word in a word phrase cannot be morpholog-
ically matched with WordNet (owing to the reasons such as
being a foreign word, a person’s name, place name, etc),
the whole word phrase is withheld from being added to the
concept collection and is presented to the user at the end of
the processing stage along with the rest of such concepts.
2Due to computer resource restraints, the system is yet to be tested withthe full corpus (approx. 660,000 articles) as input
573
The user has the full discretion to decide whether these
concepts should be added to the concept collection or not.
At the end of this processing stage, the user is able to
query for a concept in the collection and start building
an ontology with the queried concept as the root concept.
The system automatically populates first level concepts and
according to the user’s needs, will populate the additional
levels of concepts as well (i.e. initially the ontology includes
only the immediate concepts that have a sub-concept rela-
tionship with the root concept. With the user’s discretion, the
system will treat each concept in the first level as a parent
concept and add its child concepts to it).
B. Concepts From Keyword Clustering
As the second layer of our framework, we have imple-
mented a keyword clustering method to identify the concepts
related to a given concept. There have been several well-
documented approaches and metrics for extracting keywords
and measuring their relevance to the documents. However,
we have used the well known TFIDF measure [13], which
is generally accepted to be a good measure of the relevance
of a term in a document and have represented each doc-
ument using the vector-space model with the use of these
keywords. The TFIDF measure is the product between the
Term Frequency (TF) and the Inverse Document Frequency
(IDF) defined as follows:
TFt,d =nt,d∑k nk,d
TFt,d :the number of times a word t appears in a document
dn :the frequency of term t in the document d
k :the number of distinct words in document d
IDFi = log|D|
|{d : ti ∈ d}|IDFi :the log of the fraction between the number of
documents in the corpus and the number of documents in
which the word t appears
|D| :the number of documents in the corpus
TFIDFi,d = TFi,d×IDFi
In selecting keywords in the document collection, we
have defined and removed all the ‘stopwords’ (words that
appear commonly in sentences and have little meaning with
regard to the ontologies, such as prepositions, definite and
indefinite articles, etc) and extracted all the distinct words in
the document collection to build the vector-space model for
each document. Afterwards, calculating the TFIDF measure
for each and every word in the word list within the individual
documents and imposing a threshold value, we were able to
identify the keywords that correspond to a given document.
With this vector-space representation of each document,
the user is then given the opportunity to group the documents
in to a desired number of clusters. This is achieved through
the k-means clustering algorithm [14] described below.
k-means Algorithm:1) Choose k cluster centers randomly from the data points
2) Assign each data point to the closest cluster center
based on the similarity measure
3) Re-compute the cluster centers
4) If the assignment of data points have changed (i.e. the
process has not converged) repeat from step 2
The similarity measure used in our approach is the cosine
similarity [15] between two vectors which is defined as
follows:
If A and B are two vectors in our vector space:
Cosine Similarity = cos (θ) =A.B
||A|| ||B||Once these clusters have been formed, the user selects a
concept in the raw ontology that is taking shape and the
system provides suggestions for that concept (excluding the
concepts already added as sub-concepts of that concept). In
achieving this, the system looks for the cluster that contains
the highest TFIDF value for that word (in the case where a
concept consists of more than one word, the system looks
for all the clusters where each word is located) and suggests
the keywords using two criteria.
1) Suggest the individual keywords of the document
vectors of that cluster
2) Suggest the highest valued keywords in the centroid
vector of that cluster. The centroid vector is the single
vector comprised by summing the respective elements
of all the vectors
C. Concepts From Sentence Pattern Matching
In enhancing the accuracy and the comprehensiveness of
the ontologies constructed through our system, we have
proposed to implement a sentence analyzing module to
work alongside the previous two layers. Several research
attempts at identifying lexical patterns within text files have
been proposed and we intend to utilize the best of these
methods to enable the users of our system to enhance
the ontologies they are constructing. A popular method for
acquiring hyponymic relations was presented by [16] and we
have begun to implement a similar approach in our system.
The motivation for such syntactic processing comes from the
fact that there is a significant number of common sentence
patterns appearing in the Wikipedia XML Corpus and this
574
Figure 2. WikiOnto Ontology Construction Environment
will allow the system to provide better suggestions to the
user in constructing the ontology.
In implementing this layer, we will incorporate a Part-Of-
Speech tagger and with the help of the POS-tagged text, we
intend to extract relations as follows.
Sentences like “A ball is a round object that is used most
often ...” which appeared in the example XML file earlier
are evidence of common sentence patterns in the corpus. In
this instance, “A ball is a round object” matches with the
pattern “{NP} is a {NP}” in the POS-tagged text where ‘NP’
refers to a noun phrase. Several other patterns such as “{NP}including {NP,}*and {NP}” (e.g. “All third world countries
including Sri Lanka, India and Pakistan”) are candidates for
very obvious relations and this should enable us to make
more comprehensive suggestions to the user. Again, these
candidate word phrases will be validated against WordNet
and the decision to add them to the ontology will completely
lie at the user’s discretion.
V. THE GRAPHICAL MODELING ENVIRONMENT
We have used C# as our language of choice for this system
and piccolo2D for the visualization of the ontology editor.
The user is given the flexibility to add, delete concepts from
the taxonomic hierarchy and also the capability to rename
relations and concepts according to their discretion (Figure.
2).
In the beginning of the ontology contsruction process, the
user has the ability to query for a concept, irrespective of
the fact that that concept exists in the concept collection
extracted initially. The system will generate the OWL defi-
nition for the ontology being built and the user has the ability
to export the ontology in this standard format.
VI. EVALUATION AND FUTURE WORK
Owing to the reason that the project is still continuing, a
thorough evaluation of the system seems distant. However,
the initial prototype was tested among the undergraduates
and several faculty members of the authors’ university and
we received commendable feedback and several requests for
additional features for the system. Additionally, since the
ontology being generated, for the most part, depends on
the user’s choices, a standard evaluation mechanism seems
impractical. The most well-known evaluation methods such
as comparing with a gold standard (an ontology that is
accepted to be defining the targeted domain accurately) or
testing the results of the ontology in an application seems
inappropriate given the dependance of the system upon the
user. Owing to the reason that our system is a facilitator
rather than a fully-fledged ontology generator, we plan to
evaluate the system through user trials towards the end of
the project.
With the progress of this project we have furthered our
575
expectations of the potential applications that can make use
of a system that takes semi-structured data and extracts topic
ontologies from them. Especially, we are focusing on how
the proposed project can be extended to other document
sources, such as the Reuters Collection [17].
VII. CONCLUSION
In this paper, we have introduced WikiOnto: a system
for extracting and modeling topic ontologies using the
Wikipedia XML corpus. Through detail explanations, we
have presented our methodology of the system where we
have proposed a three-tiered approach to concept and rela-
tion extraction from the corpus as well as a development
environment for modeling the ontology. The project is still
continuing and is expected to produce successful outcomes
in the area of ontology extraction as well as spring up further
research in ontology extraction and modeling.
We expect to make the system and the source code
available for free. The extendability for other document
sources is still being tested and is the reason why it was left
out from the paper. We hope to make the results announced
with enough time so that the final system will be available
for the this paper’s intended audiance.
REFERENCES
[1] T. R. Gruber, “A translation approach to portable ontologyspecifications,” Knowl. Acquis., vol. 5, no. 2, pp. 199–220,1993.
[2] Ontoprise, “Ontostudio,” http://www.ontoprise.de/, 2007, (ac-cessed 2009-05-08).
[3] L. Denoyer and P. Gallinari, “The Wikipedia XML Corpus,”SIGIR Forum, 2006.
[4] S. Tratz, M. Gregory, P. Whitney, C. Posse, P. Paulson,B. Baddeley, R. Hohimer, and A. White, “Ontological an-notation with wordnet.”
[5] C. Fellbaum, Ed., WordNet An Electronic LexicalDatabase. Cambridge, MA ; London: The MIT Press,1998. [Online]. Available: http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=8106
[6] C. Li and T. W. Ling, “From xml to semantic web,” in In 10thInternational Conference on Database Systems for AdvancedApplications, 2005, pp. 582–587.
[7] B. Omelayenko, “Learning of ontologies for the web: theanalysis of existent approaches,” in In Proceedings of theInternational Workshop on Web Dynamics, 2001.
[8] N. Kozlova, “Automatic ontology extraction for documentclassification - masters thesis,” Saarland University, Germany,Tech. Rep., February 2005.
[9] D. Faure, C. Ndellec, and C. Rouveirol, “Acquisition ofsemantic knowledge using machine learning methods: Thesystem ”asium”,” Universite Paris Sud, Tech. Rep., 1998.
[10] P. Velardi, R. Navigli, A. Cucchiarelli, and F. Neri, “Evalua-tion of OntoLearn, a methodology for automatic population ofdomain ontologies,” in Ontology Learning from Text: Meth-ods, Applications and Evaluation, P. Buitelaar, P. Cimiano,and B. Magnini, Eds. IOS Press, 2006.
[11] A. Maedche and S. Staab, “Semi-automatic Engineering ofOntologies from Text,” in Proceedings of the 12th Interna-tional Conference on Software Engineering and KnowledgeEngineering, 2000.
[12] “Ontogen: Semi-automatic ontology editor,” in Human Inter-face, Part II, HCII 2007, M. Smith and G. Salvendy, Eds.,2007, pp. 309–318.
[13] G. Salton and C. Buckley, “Term-weighting approaches inautomatic text retrieval,” in Information Processing and Man-agement, 1988, pp. 513–523.
[14] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering:A review,” 1999.
[15] G. Salton and C. Buckley, “Term-weighting approaches inautomatic text retrieval,” in Information Processing and Man-agement, 1988, pp. 513–523.
[16] M. A. Hearst, “Automatic acquisition of hyponyms from largetext corpora,” in In Proceedings of the 14th InternationalConference on Computational Linguistics, 1992, pp. 539–545.
[17] M. Sanderson, “Reuters Test Collection,” in BSC IRSG, 1994.[Online]. Available: citeseer.ist.psu.edu/sanderson94reuters.html
576