[IEEE 2009 IEEE International Conference on Semantic Computing (ICSC) - Berkeley, CA, USA (2009.09.14-2009.09.16)] 2009 IEEE International Conference on Semantic Computing - WikiOnto:

WikiOnto: A System For Semi-automatic Extraction AndModeling Of Ontologies

Using Wikipedia XML Corpus

Lalindra De Silva

University of Colombo School of ComputingColombo, Sri Lanka

[email protected]

Lakshman Jayaratne

University of Colombo School of ComputingColombo, Sri [email protected]

Abstract—This paper introduces WikiOnto: a system thatassists in the extraction and modeling of topic ontologies in asemi-automatic manner using a preprocessed document corpusof one of the largest knowledge bases in the world - theWikipedia. Based on the Wikipedia XML Corpus, we present athree-tiered framework for extracting topic ontologies in quicktime and a modeling environment to refine these ontologies.Using Natural Language Processing (NLP) and other MachineLearning (ML) techniques along with a very rich documentcorpus, this system proposes a solution to a task that isgenerally considered extremely cumbersome. The initial resultsof the prototype suggest strong potential of the system tobecome highly successful in ontology extraction and modelingand also inspire further research on extracting ontologies fromother semi-structured document corpora as well.

Keywords-Ontology, Wikipedia XML Corpus, OntologyModeling, Ontology Extraction

I. INTRODUCTION

With the growing issue of information overload in the

present world and with the emergence of the semantic

web, the importance of ontologies has come under the

spotlight in recent years. Ontologies, commonly defined as

“an explicit specification of a conceptualization” provides

an agreed upon representation of concepts of a domain for

easier knowledge management [1]. However, the modeling

of ontologies is generally considered to be a painstaking

task, often requiring the knowledge and expertise of an

ontology engineer who is well versed in the concerned

domain.

Previous efforts in building ontology development en-

vironments were successful in their comprehensiveness of

the tools they provided to the user, but it still involved a

lot of manual work when building ontologies using these

environments [2]. Additionally, most of the efforts to extract

ontologies were focused on using textual data as their

sources. As such, in this research, we have looked into ex-

tracting ontologies and modeling them using the Wikipedia

XML Corpus [3] as the source, with a futuristic view of

extending this research into using other semi-structured data

domains as the sources for building ontologies1.

The research efforts and the applications that resulted

from those research in the area of ontology development,

extraction and ontology learning are many and diverse.

With the light of these attempts, section 2 takes a look at

related prior research to construct topic ontologies using

various sources as input. In section 3, we describe our

source - the Wikipedia XML Corpus - and the structure of

the documents in the corpus, along with recognition as to

why it is an ideal source for such research. The framework

of our system consists of three major layers (Figure 1).

As such, section 4 looks at in detail how this three-tiered

framework is implemented. Finally, we will present the

ontology development environment with both the current and

proposed facilities that are/will be available to the users of

the system.

II. RELATED WORKS

The research efforts to extract ontologies have spanned

over multifarious domains in terms of the input sources they

use. A large number of these researches have focused on

using textual data as their input while a lesser number of

research have focused on semi-structured data and relational

databases. [4] presents a method in which the words that

appear in the source texts are ontologically annotated using

the WordNet [5] as the lexical database. The authors have

primarily worked on assigning verbs that appear in source

articles into ontological classes. Several other researchers

have utilized the XML schemas as a starting point for the

creation of ontologies. In [6], the authors have used the

ORA-SS (Object-Relationship-Attribute Model for Semi-

structured Data) model to initially determine the nature of

the relationship between the elements of the XML schema

and subsequently to create a genetic model for organizing

the ontology. [7] gives a detailed survey of some of the work

that has been carried out in terms of learning ontologies for

the semantic web. [8] presents a framework for ontology

1This work is based on the continuing undergraduate thesis of the firstauthor titled ‘A Machine Learning Approach To Ontology Extraction andEvaluation Using Semi-structured Data’

2009 IEEE International Conference on Semantic Computing

978-0-7695-3800-6/09 $26.00 © 2009 IEEE

DOI 10.1109/ICSC.2009.93

571

extraction for document classification. In this work, the

author has experimented on the Reuters collection and the

Wikipedia database dump as her sources and produces a

methodology in which concepts can be extracted from such

large document bases. Other tools for extracting ontologies

from text such as ASIUM [9], OntoLearn [10] and Text-

ToOnto [11] have also been around. However, the biggest

motivation for our research spans from OntoGen [12]. Onto-

Gen is a tool that enables semi-automatic generation of topic

ontologies using textual data as the initial source. We have

utilized the best features of the OntoGen methodology while

extending our research into the semi-structured data domain.

Consequently, we have prposed to enhance our system with

additional concept extraction mechanisms such as lexico

syntactic pattern matching, which the OntoGen project has

not incorporated.

III. WIKIPEDIA XML CORPUS

The Wikipedia, being one of the largest knowledge bases

in the world and the most popular reference work in the

current World Wide Web, has inspired so many people and

projects in diverse fields including knowledge management,

information retrieval and ontology development. The reli-

ability of the articles in the Wikipedia is often criticized

as it is a knowledge base that can be accessed and made

modifications to by anyone in the world. However, for the

most part, the Wikipedia is considered to be reliable enough

to be used in non-critical applications.

The Wikipedia XML Corpus is a collection of documents

derived from the Wikipedia which is being used in a large

number of Information Retrieval and Machine Learning

tasks at present research communities. The corpus contains

articles in eight languages and the latest collection contains

approximately 660,000 articles in English language that

correspond to articles in the Wikipedia. Each XML docu-

ment in the corpus is uniquely named (e.g. 12345.xml) and

has the following uniform element structure: (e.g. 3928.xml

representing the article on “Ball”)

<article><name id="3928">Ball</name><body><p>A <emph3>ball</emph3> is a round objectthat is used most often in<collectionlink xlink:type="simple"

xlink:href="26853.xml">sport</collectionlink>s and<collectionlink xlink:type="simple"

xlink:href="11970.xml">game</collectionlink>s.</p>.........<section>

<title>Popular ball games</title>There are many popular games......<normallist>

<item>

<collectionlink xlink:type="simple"xlink:href="3850.xml">Baseball

</collectionlink></item><item>

<collectionlink xlink:type="simple"xlink:href="3812.xml">Basketball

</collectionlink></item>

</normallist></section>

...

...

...<languagelink lang="cs">M?</languagelink><languagelink lang="de">Ball</languagelink>

...

...

...</section></body>

</article>

Each document is contained within the ‘article’ tag within

which lies a limited number of predefined element tags

that correspond to specific relations between text segments

in the document. The ‘name’ element carries the title of

the article while ‘title’ elements within each ‘section’ ele-

ments correspond to the titles of different subsections. The

‘normallist’ elements contain lists of information while the

‘collectionlink’ and ‘unknownlink’ elements correspond to

other articles referred from within the document.

In addition to the documents, the corpus provides category

information which hierarchically organizes the articles in

the relevant categories. Using this category information, we

were able to develop the initial segment of our system where

the user can choose a certain number of articles from the

interested domain to be initially input to the system. This

selection of domain specific document set enables better

targeted and expedite ontology creation as opposed to using

the entire corpus as the input.

IV. WIKIONTO FRAMEWORK

As previously mentioned, the WikiOnto system is imple-

mented in a three-tiered framework. The following sections

explain each layer in detail.

A. Concept Extraction Using Document Structure

In this layer, we make use of the structure of the docu-

ments to deduce substantial information about the taxonomic

hierarchy of possible concepts in that document. After

careful review of many documents in the corpus, we have

established the following assumptions in extracting concepts

from the documents.

• Word phrases contained within the ‘name’, ‘title’,

‘item’, ‘collectionlink’ and ‘unknownlink’ elements are

proposed as concepts to the user (e.g. in the previous

example of the XML file, the words ‘Ball’, ‘Sport’,

‘Game’, ‘Popular Ball Games’, ‘Baseball’ and ‘Bas-

ketball’ are all suggested to the user as concepts)

572

Figure 1. Design of the WikiOnto System

• The word phrases contained within ‘title’ elements in

each of the first-level sections are suggested as sub-

concepts of the concept within the ‘name’ element of

the document (e.g. ‘Popular Ball Games’ is a sub-

concept of the concept ‘Ball’)

• The word phrases contained within a ‘title’ element

nested within several ‘section’ elements is suggested

as a sub-concept of the ‘title’ concept of the immediate

section above it

• Any word phrase wrapped inside ‘collectionlink’ or

‘unknownlink’ elements are suggested as sub-concepts

of the concept immediately above it in the structure of

the XML document (e.g. the concepts ‘Baseball’ and

‘BasketBall’ are sub-concepts of the concept ‘Popular

Ball Games’ while the concepts ‘Sport’ and ‘Game’ are

sub-concepts of the concept ‘Ball’)

In the initialization of the system, the user is given the

opportunity to define the maximum number of words that

can be contained in a potential concept. Once the user

inputs the documents that are to be used as the sources to

extract concepts (either using the domain-specific document

selector explained earlier, choosing manually or inputting the

entire corpus2), the system iterates through all the documents

to extract the potential concepts and their relationships

according to the assumptions listed earlier. The ontology

modeling environment (section 5) explains how the user

can refine these relationships after examining the concepts

(i.e. how the user can label the relationships as hyponomic,

hypernimic, meronomic relations etc).In order to validate the concepts that are extracted and

suggested, we have incorporated WordNet [5] into our sys-

tem. When all the concepts are extracted from the documents

as per the above assumptions, each and every concept is

matched morphologically, word-by-word with WordNet. If

even a single word in a word phrase cannot be morpholog-

ically matched with WordNet (owing to the reasons such as

being a foreign word, a person’s name, place name, etc),

the whole word phrase is withheld from being added to the

concept collection and is presented to the user at the end of

the processing stage along with the rest of such concepts.

2Due to computer resource restraints, the system is yet to be tested withthe full corpus (approx. 660,000 articles) as input

573

The user has the full discretion to decide whether these

concepts should be added to the concept collection or not.

At the end of this processing stage, the user is able to

query for a concept in the collection and start building

an ontology with the queried concept as the root concept.

The system automatically populates first level concepts and

according to the user’s needs, will populate the additional

levels of concepts as well (i.e. initially the ontology includes

only the immediate concepts that have a sub-concept rela-

tionship with the root concept. With the user’s discretion, the

system will treat each concept in the first level as a parent

concept and add its child concepts to it).

B. Concepts From Keyword Clustering

As the second layer of our framework, we have imple-

mented a keyword clustering method to identify the concepts

related to a given concept. There have been several well-

documented approaches and metrics for extracting keywords

and measuring their relevance to the documents. However,

we have used the well known TFIDF measure [13], which

is generally accepted to be a good measure of the relevance

of a term in a document and have represented each doc-

ument using the vector-space model with the use of these

keywords. The TFIDF measure is the product between the

Term Frequency (TF) and the Inverse Document Frequency

(IDF) defined as follows:

TFt,d =nt,d∑k nk,d

TFt,d :the number of times a word t appears in a document

dn :the frequency of term t in the document d

k :the number of distinct words in document d

IDFi = log|D|

|{d : ti ∈ d}|IDFi :the log of the fraction between the number of

documents in the corpus and the number of documents in

which the word t appears

|D| :the number of documents in the corpus

TFIDFi,d = TFi,d×IDFi

In selecting keywords in the document collection, we

have defined and removed all the ‘stopwords’ (words that

appear commonly in sentences and have little meaning with

regard to the ontologies, such as prepositions, definite and

indefinite articles, etc) and extracted all the distinct words in

the document collection to build the vector-space model for

each document. Afterwards, calculating the TFIDF measure

for each and every word in the word list within the individual

documents and imposing a threshold value, we were able to

identify the keywords that correspond to a given document.

With this vector-space representation of each document,

the user is then given the opportunity to group the documents

in to a desired number of clusters. This is achieved through

the k-means clustering algorithm [14] described below.

k-means Algorithm:1) Choose k cluster centers randomly from the data points

2) Assign each data point to the closest cluster center

based on the similarity measure

3) Re-compute the cluster centers

4) If the assignment of data points have changed (i.e. the

process has not converged) repeat from step 2

The similarity measure used in our approach is the cosine

similarity [15] between two vectors which is defined as

follows:

If A and B are two vectors in our vector space:

Cosine Similarity = cos (θ) =A.B

||A|| ||B||Once these clusters have been formed, the user selects a

concept in the raw ontology that is taking shape and the

system provides suggestions for that concept (excluding the

concepts already added as sub-concepts of that concept). In

achieving this, the system looks for the cluster that contains

the highest TFIDF value for that word (in the case where a

concept consists of more than one word, the system looks

for all the clusters where each word is located) and suggests

the keywords using two criteria.

1) Suggest the individual keywords of the document

vectors of that cluster

2) Suggest the highest valued keywords in the centroid

vector of that cluster. The centroid vector is the single

vector comprised by summing the respective elements

of all the vectors

C. Concepts From Sentence Pattern Matching

In enhancing the accuracy and the comprehensiveness of

the ontologies constructed through our system, we have

proposed to implement a sentence analyzing module to

work alongside the previous two layers. Several research

attempts at identifying lexical patterns within text files have

been proposed and we intend to utilize the best of these

methods to enable the users of our system to enhance

the ontologies they are constructing. A popular method for

acquiring hyponymic relations was presented by [16] and we

have begun to implement a similar approach in our system.

The motivation for such syntactic processing comes from the

fact that there is a significant number of common sentence

patterns appearing in the Wikipedia XML Corpus and this

574

Figure 2. WikiOnto Ontology Construction Environment

will allow the system to provide better suggestions to the

user in constructing the ontology.

In implementing this layer, we will incorporate a Part-Of-

Speech tagger and with the help of the POS-tagged text, we

intend to extract relations as follows.

Sentences like “A ball is a round object that is used most

often ...” which appeared in the example XML file earlier

are evidence of common sentence patterns in the corpus. In

this instance, “A ball is a round object” matches with the

pattern “{NP} is a {NP}” in the POS-tagged text where ‘NP’

refers to a noun phrase. Several other patterns such as “{NP}including {NP,}*and {NP}” (e.g. “All third world countries

including Sri Lanka, India and Pakistan”) are candidates for

very obvious relations and this should enable us to make

more comprehensive suggestions to the user. Again, these

candidate word phrases will be validated against WordNet

and the decision to add them to the ontology will completely

lie at the user’s discretion.

V. THE GRAPHICAL MODELING ENVIRONMENT

We have used C# as our language of choice for this system

and piccolo2D for the visualization of the ontology editor.

The user is given the flexibility to add, delete concepts from

the taxonomic hierarchy and also the capability to rename

relations and concepts according to their discretion (Figure.

2).

In the beginning of the ontology contsruction process, the

user has the ability to query for a concept, irrespective of

the fact that that concept exists in the concept collection

extracted initially. The system will generate the OWL defi-

nition for the ontology being built and the user has the ability

to export the ontology in this standard format.

VI. EVALUATION AND FUTURE WORK

Owing to the reason that the project is still continuing, a

thorough evaluation of the system seems distant. However,

the initial prototype was tested among the undergraduates

and several faculty members of the authors’ university and

we received commendable feedback and several requests for

additional features for the system. Additionally, since the

ontology being generated, for the most part, depends on

the user’s choices, a standard evaluation mechanism seems

impractical. The most well-known evaluation methods such

as comparing with a gold standard (an ontology that is

accepted to be defining the targeted domain accurately) or

testing the results of the ontology in an application seems

inappropriate given the dependance of the system upon the

user. Owing to the reason that our system is a facilitator

rather than a fully-fledged ontology generator, we plan to

evaluate the system through user trials towards the end of

the project.

With the progress of this project we have furthered our

575

expectations of the potential applications that can make use

of a system that takes semi-structured data and extracts topic

ontologies from them. Especially, we are focusing on how

the proposed project can be extended to other document

sources, such as the Reuters Collection [17].

VII. CONCLUSION

In this paper, we have introduced WikiOnto: a system

for extracting and modeling topic ontologies using the

Wikipedia XML corpus. Through detail explanations, we

have presented our methodology of the system where we

have proposed a three-tiered approach to concept and rela-

tion extraction from the corpus as well as a development

environment for modeling the ontology. The project is still

continuing and is expected to produce successful outcomes

in the area of ontology extraction as well as spring up further

research in ontology extraction and modeling.

We expect to make the system and the source code

available for free. The extendability for other document

sources is still being tested and is the reason why it was left

out from the paper. We hope to make the results announced

with enough time so that the final system will be available

for the this paper’s intended audiance.

REFERENCES

[1] T. R. Gruber, “A translation approach to portable ontologyspecifications,” Knowl. Acquis., vol. 5, no. 2, pp. 199–220,1993.

[2] Ontoprise, “Ontostudio,” http://www.ontoprise.de/, 2007, (ac-cessed 2009-05-08).

[3] L. Denoyer and P. Gallinari, “The Wikipedia XML Corpus,”SIGIR Forum, 2006.

[4] S. Tratz, M. Gregory, P. Whitney, C. Posse, P. Paulson,B. Baddeley, R. Hohimer, and A. White, “Ontological an-notation with wordnet.”

[5] C. Fellbaum, Ed., WordNet An Electronic LexicalDatabase. Cambridge, MA ; London: The MIT Press,1998. [Online]. Available: http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=8106

[6] C. Li and T. W. Ling, “From xml to semantic web,” in In 10thInternational Conference on Database Systems for AdvancedApplications, 2005, pp. 582–587.

[7] B. Omelayenko, “Learning of ontologies for the web: theanalysis of existent approaches,” in In Proceedings of theInternational Workshop on Web Dynamics, 2001.

[8] N. Kozlova, “Automatic ontology extraction for documentclassification - masters thesis,” Saarland University, Germany,Tech. Rep., February 2005.

[9] D. Faure, C. Ndellec, and C. Rouveirol, “Acquisition ofsemantic knowledge using machine learning methods: Thesystem ”asium”,” Universite Paris Sud, Tech. Rep., 1998.

[10] P. Velardi, R. Navigli, A. Cucchiarelli, and F. Neri, “Evalua-tion of OntoLearn, a methodology for automatic population ofdomain ontologies,” in Ontology Learning from Text: Meth-ods, Applications and Evaluation, P. Buitelaar, P. Cimiano,and B. Magnini, Eds. IOS Press, 2006.

[11] A. Maedche and S. Staab, “Semi-automatic Engineering ofOntologies from Text,” in Proceedings of the 12th Interna-tional Conference on Software Engineering and KnowledgeEngineering, 2000.

[12] “Ontogen: Semi-automatic ontology editor,” in Human Inter-face, Part II, HCII 2007, M. Smith and G. Salvendy, Eds.,2007, pp. 309–318.

[13] G. Salton and C. Buckley, “Term-weighting approaches inautomatic text retrieval,” in Information Processing and Man-agement, 1988, pp. 513–523.

[14] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering:A review,” 1999.

[15] G. Salton and C. Buckley, “Term-weighting approaches inautomatic text retrieval,” in Information Processing and Man-agement, 1988, pp. 513–523.

[16] M. A. Hearst, “Automatic acquisition of hyponyms from largetext corpora,” in In Proceedings of the 14th InternationalConference on Computational Linguistics, 1992, pp. 539–545.

[17] M. Sanderson, “Reuters Test Collection,” in BSC IRSG, 1994.[Online]. Available: citeseer.ist.psu.edu/sanderson94reuters.html

576

Documents

[IEEE 2009 IEEE International Conference on Semantic Computing (ICSC) - Berkeley, CA, USA (2009.09.14-2009.09.16)] 2009 IEEE International Conference on Semantic Computing - WikiOnto: