[IEEE 2009 Latin American Web Congress (LA-WEB) - Merida, Yucatan, Mexico (2009.11.9-2009.11.11)] 2009 Latin American Web Congress - Eliciting Ontology Components from Semantic Specific-Domain

Eliciting ontology components from semantic specific-domain maps: towards thenext generation web

JRG PulidoTelematics Faculty

University of ColimaColima, [email protected]

SBF FloresTelematics Faculty

University of ColimaColima, Mexico

[email protected]

RCM RamırezElectrical Engineering Department

UAM-IztapalapaDF, Mexico

[email protected]

RA DıazTelematics Faculty

University of ColimaColima, [email protected]

Abstract—The present-day web can be broken into smallerpieces that slowly but surely will be transformed into semanticweb pieces. This paper describes an approach for elicitingontology components by using specific-domain maps. Theknowledge contained in a particular domain, any kind ofdigital archive, is portrayed by assembling and displayingits ontology components. The novelty of our approach isto offer clustering and visualization features not present inother techniques helping in the semi-automatic constructionof ontologies by identifying components from digital archives.We describe here a case study that applies our approachon an academic domain. The specific-domain maps generatedare presented and a conceptualized ontology created with theontology componentes is introduced. Further processing maybe carried out on the extracted knowledge to be embeddedon the semantic web pages for software agents to use. Theultimate kind of semantic web software applications that wewill be seeing are intelligent software agents searching the webto infer and extract new knowledge from the digital archives.

Keywords-Semantic web; ontology learning; self-organizingmaps

I. INTRODUCTION

The present-day web and its always growing amount ofweb pages [1], [2] must be broken into smaller pieces thatslowly but surely will be transformed into semantic webpieces. Reaching specific web pages is a massive challengehaving into account that current search engines only containa small percentage of the total of documents in the web.Furthermore, this small amount of reachable documentsis in an unstructured way, meaning that software agentsunderstand actually nothing about the actual content ofthem. In other words, these documents can be read but notundestood [3]. It would be useful to develop representationsof the information contained in digital archives and createintelligent systems to expose pieces of ontology components[4], [5], [6], [7]. In this paper we describe an approach forhelping in the eliciting process of ontology components forsuch web sites.The remainder of this paper is organized as follows. Insection II some related work is introduced. Our approachis outlined in section III. Results are presented in sectionIV, and conclusions and further work in section V.

II. RELATED WORK

One of the most important challenges that the semanticweb poses to the research community is the mapping oflarge amounts of on-line unstructured knowledge, suitablefor humans, to formal representation of knowledge [10],[11]. In the next subsections we have a brief look at somework done on ontology as well as the so called semanticmaps.

A. Knowledge representation

Representing knowledge about a domain as an ontologyis a challenging process, not only on account of requiringa lot of computing power, but also because it is difficult toachieve in a consistent and rigorous way. It is easy to loseconsistency and to introduce ambiguity and confusion [12].Nevertheless, ontologies are a useful form of knowledgerepresentation which may be used to support the design anddevelopment of intelligent software applications and expertsystems. Recently, ontology has been used for enhancinglearning objects [14], and helping with instructional de-sign tasks [15]. However, web ontologies can take ratherdifferent forms. In [16] the use of the so-called SimpleHTML Ontology Extension (SHOE) in a real world internetapplication was described. This approach allowed authorsto add semantic content to web pages, relating the contextto common ontologies that provide contextual informationabout the domain. Most tag-annotated web pages tend tocategorize concepts, therefore there is no need for complexinference rules to perform automatic classification. One ofthe most exciting uses of an ontology, in the context ofthe semantic web, is to support the development of agent-based systems for web searching [17], [18]. New approaches,including advanced ontology languages have been proposed,such as OIL, DAML which later evolved into OWL. Thelatter now an standard [19], [20], [21], [22], [23].

B. Towards specific-domain maps

Self-organizing maps (SOM) is the artificial neural net-work technique we have used to produce specific-domain

2009 Latin American Web Congress

978-0-7695-3856-3/09 $26.00 © 2009 IEEE

DOI 10.1109/LA-WEB.2009.27

222

2009 Latin American Web Congress

978-0-7695-3856-3/09 $26.00 © 2009 IEEE

DOI 10.1109/LA-WEB.2009.27

224

University

School

Researcher

Publication

part-of

has

Researchgroup

ArticleBook

has

part-ofhas

...is-a

has

...First name

...

is-a

Person

Figure 1. (Left) Basic taxonomy for an academic domain [8]. (Right) Embedded knowledge into a web page using an early ontology approach [9].

maps. SOM are of benefit to quite a few areas. For in-stance, they have been used to carry out mapping genomesequence tasks [24], and classify volcanic events [25], anddocument organization [26]. Perhaps the most well-knownSOM project is [27], [28], where the results of applyingthe WEBSOM2, a document organization, searching andbrowsing system, to a set of about 7 million electronicpatent abstracts is described. In this case, a document map ispresented as a series of HTML pages facilitating exploration.In [29] a distributed architecture for the extraction of meta-data from WWW documents was proposed and is particu-larly suited for repositories of historical publications. Thisinformation extraction system is based on semi-structureddata analysis. The system output is a meta-data objectcontaining a concise representation of the correspondingpublication and its components. In that research gatherershave been designed as a combination of a parser, based ona context-free grammar, and a web robot, which navigatesthe links contained in the basic document type to inferthe document structure of the entire site. These meta-dataobjects can be interchanged with other web agents, thenclassified and organized.

III. METHODS

The idea of combining ontologies and knowledge mapshas motivated our work. For the semantic web to becomea reality, we need to transform the current web into a webwhere software agents are able to negotiate and carry outtrivial tasks for us. Doing this manually, would mean abottleneck for the semantic web. We need software toolsthat help us accomplish this enterprise. Our software iswritten in Java, which offers robust, multiplatform, and easynetworking functionalities. Being a object-oriented program-ming language, it also facilitates reuse as well. Java and its

various APIs are powerful enough for constructing ontologysoftware systems.

Our suite of ontology learning tools consists basically oftwo applications: Spade and Grubber [8]. The former pre-processes html pages and creates a document space. The lat-ter is fed with the document space and produces knowledgemaps that allow us visualize ontology components containedfrom a digital archive. They may later be organized as a setof Entities, Relations, and Functions. Semantic problemsolvers use this triad for inferring new data from knowledgebases [30], [31], [32], [33].

A. The Algorithm

SOM can be viewed as a model of unsupervised learningand an adaptive knowledge representation scheme. Adaptivemeans that at each iteration a unique sample is taken intoaccount to update the weight vector of a neighbourhood ofneurons [34], [35]. Adaptation of the model vectors takeplace according to the following equation:

mi(t + 1) = mi(t) + hci(t)[x(t) − mi(t)] (1)

where t ∈ N is the discrete time coordinate, mi ∈ �n is anode, and hci(t) is a neighbourhood function. The latter hasa central role as it acts as a smoothing kernel defined overthe lattice points and defines the stiffness of the surface tobe fitted to the data points. This function may be constantfor all the cells in the neighbourhood and zero elsewhere.A common neighbourhood kernel that describes a naturalmapping and that is used for this purpose can be written interms of the Gaussian function:

hci(t) = α(t) exp(−||rc − ri||22σ2(t)

) (2)

where rc, ri ∈ �2 are the locations of the winner and aneighbouring node on the grid, α(t) is the learning rate (0 ≤

223225

tutorial

deparment

people

modules

slides

RRVlectures

comittee

research

modules

projects

staff

papers dissertations

coursework

research

tutorials

theses

lectures

meetings

department

people

teaching

researchprojects

proposals

surveys

lecturesmodules

people

publications

labs

Figure 2. Entity maps. (Left) Full entity map. (Right) Reduced entity map

α(t) ≤ 1), and σ(t) is the width of the kernel. Both α(t)and σ(t) decrease monotonically. The major phases of ourapproach are as follows:

1) Produce a document space A document space(docuspace) of feature vectors is created by1spade.This docuspace is produced with the data coming fromthe specific domain digital archive [8].

2) Construct the SOM A second software2tool,grubber, is fed and trained with the docuspaces andknowledge maps are then created.

Ontology components can be visualized clustered togetherfrom the knowledge maps created. Preliminary results weresurprisingly close to our intuitive expectations. After this,some other ontology tools such as editors can be used toorganize this knowledge. Then, it can be embedded into thedigital archive (fig.1) where it was extracted from by meansof any of the ontology languages that exist [20].

IV. THE ACADEMIC DOMAIN CASE

In this section the results of applying our approach toan academic3domain are presented. It also includes a briefdiscussion of the entity maps and the attribute maps createdby grubber. Later, an example of how to use the ontologycomponents identified is outlined.

A. Entity map

These maps are created with the docuspace transposedview. In this view, the columns (URLs of the domain) ofthe dataset determine the dimensions of the maps. In figure2 the entity maps, full and reduced, created by grubber arepresented.

1http://docente.ucol.mx/˜jrgp/en/ol-approach.html2http://docente.ucol.mx/˜jrgp/en/ol-approach.html3http://www.nottingham.ac.uk/cs

The full entity map: A few clusters can be identifiedat a glance by means of the main feature and subfeaturessection of grubber and background knowledge. For instance:department, lectures, modules, research, tutorial, papers, andsome other clusters related to the domain. A deeper analysisreveals that some other clusters exist within the maps. Butthere is a problem, we strive to find them for the maps do notexhibit a significant coloring level. Dark areas in the mapsare the result of training the artificial neural network withfeature input vectors containing a large number of zeros.These do not contribute to an informative coloring of themaps.

B. The attribute map

These maps are created with the docuspace front view. Inthis view, the columns (relevant terms from the domain) ofthe dataset determine the dimensions of the maps. In figure3 the entity maps, full and reduced, created by grubber arepresented.

The full attribute map: A few clusters again can beidentified at a glance by means of the main feature andsubfeatures section of grubber and background knowledge.For instance: people, department, research, teaching, andother belonging to this domain. Note that the coloring ofthese maps is more informative in comparison with theentity maps. Keep in mind that for the attribute maps we areusing the columns of the front view of the docuspace whosedimensionality is smaller than the transposed view used bythe entity maps earlier. In other words, there are less URLsin this case than terms in the dataset. A significant coloringhas to do with the number of zeros in the datasets used totrain the network.It must be said that identifying these clusters has not beeneasy. In order to obtain better coloring from the knowledgemaps, a dimensionality reduction was also carried out. Theresults are described as in turn.

224226

capabilitbool

bowlannual

cartoonburchet

attainberm

deltacumul

commercertif

countdanger

developdiscuss

britainchow

disposcongress

datasetcylind

amazabba

atgaxc

creaturdead

dmtargu

biggestawar

cursandrew

birtwhistashle

computerscience

functionalprogramminglanguages

visualimagepattern

headconsultorassistant

industryeurope

academicinstitution

artificialintelligence

groupsupervisormember

studentdevelop

timetablescheduling

planschool

professorphd

unix

nottinghamuniversity

binarysearchtree

activitiesgoals

java

math

documenttexteducation

facultydepartment

guideroadmap

conferencejournal

database

methodclassobject

lecturesprojectstutorialsteaching

(a) Full attribute map (b) Reduced attribute map

Figure 3. Attribute maps.

C. Ontology components from reduced docuspace maps

The following clusters can be identified from the mapsby means of the main feature and subfeatures bothfrom grubber, domain experts and background knowledge(figs.2,3). It is clearly seen that an improvement on knowl-edge map coloring has been achived by reducing the di-mensionality of the docuspace. The clusters of the specific-domain semantic maps are described in turn.

Department: There are a number of clusters on themap about this topic. For example terms such as science,computer make up one cluster. University, Nottingham formanother cluster on the map. Education, faculty, departmentconstitute another cluster. Academic, institution constituteanother cluster. Other related clusters include terms such asguide, road, map, and turn, plan, school.

Roles: Some clusters about Roles that people playin the school are bracketed together. Terms such as head,consultor, assistant compose a cluster. Group, supervisor,member constitute another cluster. Professor, phd formanother cluster. The terms student, develop form anothercluster.

Teaching: A number of clusters on this topic have beenidentified. For example clusters about Java, programming,data structures, mathematics, databases, artificial intelli-gence have been spotted all over the map. As mentioned,teaching relates to a number of other clusters such as:modules, lectures, coursework, labs, projects.

Research: Terms related to some of the interests ofthe research groups of the school have been identified:Visual, image, pattern, functional, programming, languages,scheduling, timetable, heuristics, document, text, xml, inter-face.

Publications: Two nearby clusters about Publicationshave been spotted. One of them includes terms such asconference, international. The other cluster includes termssuch as journal, heuristics.

Industry: A cluster with the terms industry, europe,management has been also identified.The experiments we have presented in this section showthat ours is an effective approach to analyse domains andto identify ontology components. With the aid of domainexperts, then we are able to conceptualize domain-specificontologies. We have reported the use of our approach onother domains [36]. The next step is to use other ontologytools to organize and embed this knowledge into web pages.Figure 4 shows a basic taxonomy conceptualized with theontology components that have been identified from theactual domain by means of the specific-domain semanticmaps, background knowledge, and most importantly, domainexperts.

V. CONCLUSIONS

The present-day web can be broken into smaller piecesthat slowly but surely will be transformed into semantic webpieces. Should that be done manually, then the semantic webwill not become a really in the next couple of decades dueto this bottleneck. Ontology learning tools are essential forthe realization of the semantic web for the job to be doneis quite complex [37], [38]. This paper is an example ofhow ontology learning tools, along with some other ontologytools, can be used to construct specific-domain ontologies.Experts do not have to start constructing specific-domainontologies from scratch anymore. Rather, they can now useappropriate ontology learning tools such as the knowledgemaps we have introduced to help in the eliciting process ofspecific-domain ontology components. This kind of softwaretools help speed up the ontology creation process and othersoftware tools such as ontology editors help in validating theontology created always under the supervision of experts andthe ontology engineer [39]. Must be said that domain expertsare always required in order to obtain a desirable level ofaccuracy in the ontology.

225227

academicinstitution

nottinghamuniversity

people

professorstudent

researchgroup

IPI FOP

computerscience

... ...

...

...

is-a school

is-a

Attributes maybe shared betweeninstances

part-of

is-apart-ofhasother relationships

instances

attributes

part-of

part-of

has

instances

attributes

part-of

part-of

has

instances

attributes

part-of

part-of

has

is-a

is-a

is-a

is-a

Other academicinstitutions exist

Other schools exist

Instances may belongto several groups

Figure 4. Conceptualized ontology.

Further research avenues we are working on involve theuse of hybrid systems in such a way that by combining clus-tering techniques with the already trained feature vectors wemay refine the classification of the knowledge componentesfrom the domain [40], [41]. The ultimate kind of softwareapplications that we will be seeing are intelligent softwareagents searching the semantic web to infer and extract newknowledge from the digital archives.

REFERENCES

[1] P. Schneider and D. Fensel, “Layering the Semantic Web:Problems and directions,” in 1st Int.Semantic Web Conf., ser.LNCS, I. Horrocks and J. Hendler, Eds., vol. 2342. Springer-Verlag, Berlin, 2002, pp. 16–29.

[2] Y. Yang et al., “A study of approaches to hypertext catego-rization,” J.Intelligent Information Systems, vol. 18, no. 2/3,pp. 219–241, 2002.

[3] T. Berners-Lee et al., “The Semantic Web,” Scientific Amer-ican, vol. 284, no. 5, pp. 34–43, May 2001.

[4] JRG Pulido et al., “Identifying ontology components fromdigital archives for the semantic web,” in IASTED Advancesin Computer Science and Technology (ACST), 2006, pp. 1–6,CD edition.

[5] B. Berent et al., “A roadmap for web mining: from webto semantic web,” in First European Web Mining Forum(EWMF), ser. LNCS, B. Berent et al., Eds., vol. 3209.Springer, 2004, pp. 1–22.

[6] JRG Pulido et al., “Semi-automatic derivation of specific-domain ontologies for the semantic web,” in 5th MexicanIntl Conf on Artificial Intelligence, Gelbukh and Reyes-Garca,Eds. IEEE Computer Soc.Press, Los Alamitos, 2006, pp.253–261.

[7] F. Ciravegna et al., “Learning to harvest information for thesemantic web,” in The Semantic Web: Research and Applica-tions: First European Semantic Web Symposium, ESWS 2004,ser. LNCS, C. Bussler et al., Eds., vol. 3053. Springer, 2004,pp. 312–326.

[8] D. Elliman and JRG Pulido, “Visualizing ontology com-ponents through self-organizing maps,” in 6th InternationalConference on Information Visualization (IV02), London, UK,D. Williams, Ed. IEEE Computer Soc.Press, Los Alamitos,2002, pp. 434–438.

[9] R. Benjamins et al., “(KA)2: Building ontologies for theinternet: A midterm report,” Int.J.Human-Computer Studies,vol. 51, no. 3, pp. 687–712, 1999.

[10] L. Crow and N. Shadbolt, “Extracting focused knowledgefrom the Semantic Web,” Int.J.Human-Computer Studies,vol. 54, pp. 155–184, 2001.

[11] D. Mladenic and M. Grobelnick, “Mapping documents ontoweb page ontology,” in Web Mining: From Web to Seman-tic Web: First European Web Mining Forum, ser. LNCS,B. Berendt et al., Eds., vol. 3209. Springer, 2004, pp. 77–96.

[12] R. Brachman, “What is-a and isn’t: An analysis of taxonomiclinks in semantic networks,” IEEE Computer, vol. 16, no. 10,pp. 10–36, 1983.

226228

[13] M. Uschold and M. Gruninger, “Ontologies: Principles,methods, and applications,” Knowledge Engineering Review,vol. 11, no. 2, pp. 93–155, 1996.

[14] A. Zouaq and R. Nkambou, “Enhancing learning objects withan ontology-based memory,” IEEE Trans.on Knowledge AndData Engineering, vol. 21, no. 6, pp. 881–893, 2009.

[15] W. Wei et al., “Probabilistic topic models for learningterminological ontologies,” IEEE Trans.on KnowledgeAnd Data Engineering, 2009. [Online]. Available:http://doi.ieeecomputersociety.org/10.1109/TKDE.2009.122

[16] J. Heflin and J. Hendler, “Dynamic ontologies on the web,” inAmerican Association For Artificial Intelligence Conf. AAAIPress, California, 2000, pp. 251–254.

[17] S. Geffner et al., “Browsing large digital library collectionsusing classification hierarchies,” in 8th Int. Conf. on Infor-mation Knowledge Management, CIKM’99, S. Gauch, Ed.ACM, New York, 1999, pp. 195–201.

[18] J. McCormack and B. Wohlschlaeger, “Harnessing agenttechnologies for data mining and knowledge discovery,” inData Mining and Knowledge Discovery: Theory, Tools andTechnology II, vol. 4057, 2000, pp. 393–400.

[19] L. Lacy, OWL: Representing Information Using the WebOntology Language. Trafford Publishing, USA, 2005.

[20] JRG Pulido et al., “Ontology languages for the semanticweb: A never completely updated review,” Knowledge-BasedSystems, vol. Elsevier volume 19, issue 7, pp. 489–497, 2006.

[21] I. Horrocks et al., “From SHIQ and RDF to OWL: The mak-ing of a web ontology language,” Journal of web semantics,vol. 1, no. 1, pp. 7–26, 2003.

[22] A. Gomez and O. Corcho, “Ontology languages for theSemantic Web,” IEEE Intelligent Systems, 2002.

[23] D. Fensel et al., “OIL in a nutshell,” in Proc.EuropeanKnowledge Acquisition Conf., ser. LNAI, R. Ding et al., Eds.Springer-Verlag, Berlin, 2000.

[24] H. Dozono and T. Takahashi, “Mapping of the genomesequence using two-stage self organizing maps,” in 6th IntlWorkshop on Self-Organizing Maps (WSOM’07), H. Ritteret al., Eds. Neuroinformatics group, Bielefeld University,Germany, 2007, pp. 1–7, CD edition.

[25] JRG Pulido et al., “A novel approach to the analysis ofvolcanic-domain data using self-organizing maps: A prelimi-nary study on the volcano of colima,” Research in computerscience, IPN, Mexico, vol. 40, pp. 49–59, 2008.

[26] V. Guerrero et al., “Document organization using Kohonen’salgorithm,” Information Processing and Management, vol. 38,pp. 79–89, 2002.

[27] T. Kohonen et al., “Self organization of a massive textdocument collection,” in Kohonen Maps, E. Oja and S. Kaski,Eds. Elsevier Sci, Amsterdam, 1999, pp. 171–182.

[28] S. Kaski et al., “WEBSOM - Self-organizing maps of doc-ument collections,” Neurocomputing, vol. 6, pp. 101–117,1998.

[29] I. Sanz et al., “Gathering metadata from web-based repos-itories of historical publications,” in 9th Int. Workshop onDatabase and Expert Systems Apps, A. Tjoa and R. Wagner,Eds. IEEE Computer Soc.Press, Los Alamitos, 1998, pp.473–478.

[30] A. Gomez et al., “Knowledge maps: An essential tech-nique for conceptualisation,” Data & Knowledge Engineering,vol. 33, pp. 169–190, 2000.

[31] J. Gordon, “Creating knowledge maps by exploiting depen-dent relationships,” Knowledge-Based Systems, pp. 71–79,2000.

[32] A. Waterson and A. Preece, “Verifying ontological commit-ment in knowledge-based systems,” Knowledge-Based Sys-tems, vol. 12, pp. 45–54, 1999.

[33] E. Motta et al., “Ontology-driven document enrichment:principles, tools and applications,” Int.J.Human-ComputerStudies, vol. 52, pp. 1071–1109, 2000.

[34] T. Kohonen, Self-Organizing Maps, 3rd ed., ser. InformationSciences Series. Springer-Verlag, Berlin, 2001.

[35] H. Ritter and T. Kohonen, “Self-organizing semantic maps,”Biological Cybernetics, vol. 61, pp. 241–254, 1989.

[36] JRG Pulido et al., “On the finding process of volcano-domainontology components using self-organizing maps,” in 7th IntlWorkshop on Self-Organizing Maps (WSOM’09), ser. LNCS,J. Principe and R. Miikkulainen, Eds., vol. 5629. Springer-Verlag, Berlin, 2009, pp. 255–263.

[37] JRG Pulido et al., “Artificial learning approaches for thenext generation web: part II,” Ingenierıa Investigacion yTecnologıa, UNAM (CONACyT), Mexico, vol. 9, no. 2, pp.161–170, 2008.

[38] JRG Pulido et al., “Artificial learning approaches for thenext generation web: part I,” Ingenierıa Investigacion y Tec-nologıa, UNAM (CONACyT), Mexico, vol. 9, no. 1, pp. 67–76,2008.

[39] JRG Pulido et al., “In the quest of specific-domain ontol-ogy components for the semantic web,” in 6th Intl Work-shop on Self-Organizing Maps (WSOM’07), H. Ritter et al.,Eds. Neuroinformatics group, Bielefeld University, Germany,2007, pp. 1–7, CD edition.

[40] S. Legrand and JRG Pulido, “A hybrid approach to wordsense disambiguation: Neural clustering with class labeling,”in Workshop on knowledge discovery and ontologies, 15thEuropean Conference on Machine Learning (ECML), Pisa,Italy, P. Buitelaar et al., Eds., September 2004, pp. 127–132.

[41] A. Maedche and V. Zacharias, “Clustering ontology-basedmetadata in the semantic web,” in Principles of Data Miningand Knowledge Discovery: 6th European Conference, PKDD2002, ser. LNCS, T. Elomaa et al., Eds., vol. 2431. Springer,2002, pp. 348–360.

227229

Documents

[IEEE 2009 Latin American Web Congress (LA-WEB) - Merida, Yucatan, Mexico (2009.11.9-2009.11.11)] 2009 Latin American Web Congress - Eliciting Ontology Components from Semantic Specific-Domain