19

Click here to load reader

Concept drift and how to identify it

Embed Size (px)

Citation preview

Page 1: Concept drift and how to identify it

Web Semantics: Science, Services and Agents on the World Wide Web 9 (2011) 247–265

Contents lists available at ScienceDirect

Web Semantics: Science, Services and Agentson the World Wide Web

journal homepage: ht tp : / /www.elsevier .com/ locate/websem

Concept drift and how to identify it

Shenghui Wang ⇑, Stefan Schlobach, Michel KleinVrije Universiteit Amsterdam, The Netherlands

a r t i c l e i n f o

Article history:Available online 17 May 2011

Keywords:Concept driftSemanticsKOSOntology change

1570-8268/$ - see front matter � 2011 Elsevier B.V. Adoi:10.1016/j.websem.2011.05.003

⇑ Corresponding author.E-mail addresses: [email protected] (S. Wang), schlo

[email protected] (M. Klein).1 Concept drift will be our name for change in meanin

drift in meaning occurs not only over time, but also overof presentation we will mostly refer to drift in timeframework should extend to other kinds of ‘‘contexts.’

a b s t r a c t

This paper studies concept drift over time. We first define the meaning of a concept in terms of intension,extension and label. Then we study concept drift over time using two theories: one based on concept iden-tity and one based on concept morphing. A qualitative toolkit for analysing concept drift is proposed todetect concept shift and stability when concept identity is available, and concept split and strength ofmorphing chain if using the morphing theory. We apply our framework in four case-studies: a politicalvocabulary in SKOS, the DBpedia ontology in RDFS, the LKIF-Core ontology in OWL and a few biomedicalontologies in OBO. We describe ways of identifying interesting changes in the meaning of concept withingiven application contexts. These case-studies illustrate the feasibility of our framework in analysing con-cept drift in knowledge organisation schemas of varying expressiveness.

� 2011 Elsevier B.V. All rights reserved.

1. Introduction

Knowledge organisation systems (KOS), such as formal ontolo-gies (e.g. modelled in OWL), thesauri or taxonomies (e.g. describedin SKOS) or other term classification schemes, play a crucial role inproviding semantic interoperability in many domains and usecases. They have become critical to the Web of Data for structuredaccess of documents in libraries or patient records based on diag-nostic information, and many more applications. In almost allmodern types of KOS, concepts are the central constructs that areused to describe sets of objects with shared characteristics.Although it is widely recognised to be an oversimplification mostcurrent systems consider their underlying KOS to be stable overtime. For many applications this starts to be a critical problem,and this paper attempts to provide a theory of what we call conceptdrift.1 As the world is continuously changing, concepts also changeover time. That is, for example, a concept refers to different objectsat different points in time. The term Government of the Netherlands re-fers to different people in 1999 and in 2009. Or, consider the conceptMiddle class which is a concept with very different properties in var-ious periods of time. Concept drift is known to be a problem in datamining or machine learning, when learned models loose their pre-dictive power over time [50]. Here, we deal with concepts that are

ll rights reserved.

[email protected] (S. Schlobach),

g of a concept over time. Thislocation, culture, etc. For ease

, but significant parts of the’

explicitly and symbolically represented in some KOS, and wherethe drift needs to be made explicit as well. In this sense our contri-bution is orthogonal and complementary to the work in the machinelearning literature.

In this paper we will formally define what concept drift in a KOSis, and how to study its impact, given two different ontological par-adigms, and without committing to a particular type of Knowledgeorganisation schemes. These definitions are not intended to pro-vide new philosophical insights, but aim at making existing ac-cepted notions of intension, extension and labelling applicable inthe context of dynamics of semantics.

Semantic drift. Throughout the theoretical part of the paper wewill use the generic example of the European Union (the conceptEuropean Union) and the countries of the EU (EUCountry) as our con-cepts of reference.2 The historical overview provided by the EU on[5] gives an interesting discussion over its development. Startingin the early Fifties ‘‘the European Union is set up’’ with ‘‘Belgium,France, Germany, Italy, Luxembourg and the Netherlands’’ as found-ing members. Within the next 6 decades many other countriesjoined, and the European Union transforms slowly from an economiccommunity to an ‘‘international organisation governing commoneconomic, social, and security policies’’ [4]. The following list givessome definitions for the concept of the European Union over time.

2 We use two different concepts EUCountry and European Union as they exemplifytwo different kinds of change: the (intensional) definition of the EU is highly instableand changes throughout history, but as a relatively abstract concept it is not so clearwhat its instances (extension) are. For EUCountry its the opposite: the set of instancesregularly changing, but its intensional definitions as countries of the EU relativelystable.

Page 2: Concept drift and how to identify it

248 S. Wang et al. / Web Semantics: Science, Services and Agents on the World Wide Web 9 (2011) 247–265

(1979) The European Community is a common denominator forthe European Economic (EEC), the European Coal andSteel Community (ECSC), and the European AtomicEnergy Community (EAEC) [3].

(1999) The European Community is the new stage in the imple-mentation of increasing the Union of the European peo-ple [2].

(2003) The European Union or EU is an international organisa-tion of European states, established by the Treaty onEuropean Union [52].

(2006) The European Union (EU) is a supranational and inter-governmental union of 25 independent, democraticmember states [53].

(2010) The European Union (EU) is an economic and politicalunion of 27 member countries, located primarily in Eur-ope [51].

Most of the aspects of the change of meaning of a concept overtime that we introduce in this paper can be identified in this exam-ple. The instances of the set of countries of the European Union, alsocalled its extension, change (e.g. from 25 to 27 between 2006 and2007). The label of the concept European Union changes from Euro-pean Community to European Union. From a judicial perspectivethose are different concepts. However, the European Union website[5] considers the EU and EC to be the same concepts. Note that theyuse the label European Union to signify the concept that existed in1952. There was no European Union at the time, however, only theEuropean Community. In our opinion the website refers to an ab-stract concept of an organisational unit which stands for the Euro-pean idea, which we now refer to as the European Union. Of course,the properties of this European Union have significantly change aswell; from a union bound by economic treaties to a full political,military and social organisation. Ontologically, we say that theintension of the concept European Union has changed. These threetypes of changes will be discussed as three different aspects ofsemantic drift: extensional, label and intensional drift respectively.

Identity based concept drift. The impact of such drift is difficult tomeasure and for that purpose we would like to identify more dras-tic and qualitative changes in meaning. In the example of the EUcountries there are some of these more drastic changes. Recall thatthe EU started out with just 6 countries, all of which were CentralEuropean. The effect of the expansion of the EU is that the originalmeaning of the concept EUCountry in terms of its members in 1950is now far closer to the concept CES of Central European states thanto the concept EUCountry. We will call such changes shift. Anotherway of looking at drift is to study the similarity between the mean-ing over time. The following table shows, e.g. the number ofEuropean member countries of NATO and EU:

1949 1952 1955 1973 1981 1986 1995 1999 2004 2007

EU – 6 9 9 10 12 15 15 25 27diff – – 50% 0% 11.1% 20% 25% 0% 66.7% 8.0%E-NATO 9 10 11 11 12 12 12 15 22 22diff – 11.1% 10% 0% 9.1% 0% 0% 25% 46.7% 0%

Studying the similarity between these sets, e.g. in this simpleexample as the ratio of the number of new versus old instances,indicates the moments when the most drastic changes occured(in both cases between 1995 and 2004). We say that we measurestability of the concept as compared to other concepts. In principle,there are two lessons: for each concept we can identify the most

unstable moments in a temporal chain, but we can also comparethe average or overall instability of a concept over time. An inter-esting comparison can be made with the concept of European Coun-

tries within NATO, which has also seen expansion over time. Note,that the latter is more extensionally stable as countries joinedmore gradually.

Morphing based concept drift. The previous analysis identifyingshift and stability of the meaning of a concept makes sense underthe assumption, or ontological commitment, that the concept ofthe European Union in 1995 is considered to be the same conceptof the European Union in 2007. This is just one possible interpreta-tion, and we do not want to restrict our framework to just this oneworld-view. The discussion on identity is probably as old as philos-ophy itself, and has played a role in ontology engineering as well.We try to keep out of this discussion by providing both methodsthat work on identity, and without it. For the latter we considerconcepts to be pertaining to just one moment in time and that theyevolve/morph into new, but highly similar, concepts at each mo-ment in time. Meaning drift is then to be defined in the degree ofdissimilarity of these maximally similar concepts over time. Again,we will study drift with respect to intension, extension and labels.

In this alternative conceptualisation of concept drift we alsowant to identify some qualitative notions to describe more drasticnotions of drift. One example is the notion of split which occurswhen the different semantic elements (intension, extension or la-bel) morph into different concepts. When a series of concepts atdifferent moments are linked by this morphing relation, then amorphing chain is formed. The strength of the morphing chain,i.e., how similar the morphed concepts are, is another notion tostudy.

Research questions. To our knowledge there has been no formal-isation of what concept drift actually means and implies. In orderto identify different types of changes in concepts and to under-stand the impact of concept drift, such a formalisation is critical.Therefore, this paper focuses on the following research questions:

(1) What is concept drift, and how to formalise it?(2) Can we identify the impact of concept-drift?

It is important to note that we want to develop a generic frame-work which allows us to study concept drift without a priori choos-ing one of the above described ontological commitments (identityversus morphing) or a particular type of KOS.

Methodology. We provide a generic formalisation of the mean-ing of concepts in terms of label, intension and extension. Basedon the ontological commitment towards an identity- or morp-hing-based model concept drift is defined differently, and the qual-itative follow-up notions are therefore defined differently.

For the former, the instability over a time period and conceptshift between time points (where part of the meaning of a conceptshifts to some other concept) are crucial notions. For the latter, thenotion of split and the morphing strength will be introduced.

As our proposal is meant in a very generic way we need to dis-cuss the notions of intension and extension in more detail.

Page 3: Concept drift and how to identify it

S. Wang et al. / Web Semantics: Science, Services and Agents on the World Wide Web 9 (2011) 247–265 249

Case-studies. We instantiate our framework in four case-studies,studying concept drift in our motivating case, a political ontologyin SKOS [17] used by communication scientists for political analy-sis as well as a general purpose RDFS [30] ontology, DBpedia [25], alegal OWL [31] ontology, LKIF-Core [15], three OBO biomedicalontologies [41]. We investigate the introduced mechanisms forstudying concept drift in these four different KR models. Ourexperiments show the feasibility of both the formalisation andidentification mechanisms by pointing to some examples of con-cept (in)stability, shift, split, and morphing strength which wereidentified as relevant by collaborating domain experts. We areaware that those case-studies do not constitute a formal evaluationor even less empirical proofs of validity of our approach. However,they indicate the broad coverage and potential of the toolkit(although not all tools are useful in all scenarios). Furthermore, se-lected domain experts have in three of the four cases confirmed theusefulness of some of the results, the fourth being general enoughto allow an informal assessment without expert input.

Contributions. This paper operationalises established (philo-sophical) insights into a general pragmatic framework for dealingwith concept drift and developed a general toolkit to detect driftin different domains. We believe that we contribute to a betterunderstanding of temporal change of meaning in formal knowl-edge organisation schemes and its impact in practical applications.We motivate and define the crucial notions of shift, split, stabilityand morphing strength for different ontological commitments anddifferent types of Knowledge organisation systems. The fourcase-studies illustrate the feasibility of our framework in analysingconcept drift in knowledge organisation schemas of varyingexpressiveness. This is a significantly extended version of our pre-vious paper [47] in which some of the ideas were introduced. Wenow provide a far more detailed analysis and formalisations,including an alternative model for dealing with changes based onconcept morphing (with the related notions of split and morphingstrength), we also provide a far more thorough evaluations includ-ing new datasets.

Structure of the paper. Section 2 gives a general introduction tothe four types of knowledge organisation schemas investigatedand discusses the most relevant related work. In Section 3, we pro-vide the basic formalisation of the meaning of concepts and onwhich two theories of concept drift are based, namely identity-based and morphing-based. These two theories are defined in Sec-tion 4 together with a practical toolkit for analysing concept drift.In Section 5, we apply our general framework in four different casestudies involving four different KOSs. Finally, Section 6 concludesthe paper.

2. Background

In this section, we describe some general notions that are nec-essary for understanding the remainder of the paper. In addition,we describe what has been done on formalizing concept drift inknowledge systems.

2.1. Knowledge Organisation Schemas

Knowledge Organisation Schema (KOS) is a broad notion usedto refer to schemes meant to describe information. This notionencompasses term lists and glossaries, classification schemes suchas subject headings or taxonomies, and semantically richer struc-tures such as thesauri and ontologies. More specifically, in [14]the notion of a KOS is explained as follows:

The term knowledge organization systems is intended toencompass all types of schemes for organizing informationand promoting knowledge management. Knowledge organiza-

tion systems include classification and categorization schemesthat organize materials at a general level, subject headings thatprovide more detailed access, and authority files that controlvariant versions of key information such as geographic namesand personal names. Knowledge organization systems alsoinclude highly structured vocabularies, such as thesauri, andless traditional schemes, such as semantic networks and ontol-ogies. Because knowledge organization systems are mecha-nisms for organizing information, they are at the heart ofevery library, museum, and archive.

Over the last decades, several standards for representing knowl-edge structures have been developed. For example, the Interna-tional Organization for Standardization (ISO) has developed aguideline for monolingual thesauri (ISO-2788) and a guideline formultilingual thesauri (ISO-5964), and the NISO has developedguidelines for the construction, format, and management of mono-lingual controlled vocabularies (ANSI/NISO Z39.19). In this paper,we focus on representation used in the context of the SemanticWeb, i.e. SKOS and OWL.

2.2. RDF Schema

RDF Schema (RDFS) is a simple type system for the ResourceDescription Format (RDF). It provides a mechanism to define do-main-specific properties and classes of resources to which youcan apply those properties. The basic modeling primitives in RDFSare class definitions (rdfs:Class) and subclass-of statements(rdfs:subClassOf), which together allow the definition of class hier-archies. RDFS also includes property definitions and subproperty-of (rdfs:subPropertyOf) statements to build property hierarchies aswell as domain and range statements (rdfs:domain and rdfs:range)to restrict the possible combinations of properties and classes.Inherited from RDF, the type statements (rdf:type) are used to de-clare a resource as an instance of a specific class). In addition, RDFScomes with a specific labeling relation rdfs:label which can be usefulto add natural language terms to a class.

2.3. OWL

The Web Ontology Language (OWL) is a family of knowledgerepresentation languages for authoring ontologies endorsed bythe World Wide Web Consortium. They are characterised by for-mal semantics and a number of different serialisations, e.g. RDF/XML. OWL is built upon RDF and RDFS. An OWL class is definedas an RDF resource of type owl:Class. The mechanism to define classhierarchies is inherited from RDFS. Furthermore, properties areused to describe resources and specify class characteristics. Proper-ties may possess logical capabilities such as being transitive, sym-metric, inverse and functional. The W3C Consortium has recentlyendorsed a new version of OWL as standard referred to as OWL2[32], which extends the previous version with more expressivenessand fixes some previous problems.

2.4. SKOS

SKOS (Simple Knowledge Organization System) is a commondata model for sharing and linking knowledge organization sys-tems via the Web, published as a W3C recommendation [18]. Itsaim is to capture much of the similarity between thesauri, classifi-cation schemes, taxonomies, subject-heading systems, or any othertype of structured controlled vocabulary. SKOS is built upon RDFand RDFS.

The core of the SKOS model defines the classes and propertiesneeded for representing the features most commonly found in the-sauri and other types of structured controlled vocabularies. The

Page 4: Concept drift and how to identify it

250 S. Wang et al. / Web Semantics: Science, Services and Agents on the World Wide Web 9 (2011) 247–265

primitive objects in SKOS are abstract concepts, which are definedas RDF resources of type skos:Concept. These concepts can be fur-ther described via RDF properties by:

� labels (i.e. terms), especially the preferred labels skos:prefLabel

and alternative labels skos:altLabel; these properties are sub-properties of rdfs:label, and are used to link a skos:Concept toan RDF literal (a character string) optionally combined with alanguage tag (e.g. ‘‘en-US’’);� semantic relations, such as skos:broader, skos:narrower and

skos:related;� documentary notes, such as definitions, scope notes and exam-

ples, represented by (subproperties) of skos:note.

Using the semantic relationships, the concepts can be organizedin hierarchies using broader–narrower relationships, or linked bynon-hierarchical (‘‘related’’) relationships. Moreover, skos:concept’scan be grouped into schemes, using the skos:inScheme property.

2.5. OBO

The OBO format is widely used to represent a significant num-ber of biomedical ontologies, including the Gene Ontology (GO).3

Originated along with the Gene Ontology, the OBO ‘‘flat file format’’has evolved to support the needs of the biomedical ontologies thatfall under the Open Biomedical Ontologies (OBO) umbrella. Accord-ing to its authors, the OBO flat format main aims are to provide (1)human readability, (2) ease of parsing, (3) extensibility and (4) min-imal redundancy. The OBO format currently forms the backbone ofmost GO based annotation and data analysis tools.

[Term]

id: HP:0000023

name: Inguinal hernia

namespace: medical_genetics

exact_synonym: ’’Inguinal hernias’’ []

is_a: HP:0000035 ! Abnormality of the testis

is_a: HP:0004299 ! Hernia of the abdominal wall

As the above example shows, a OBO ontology is a collection ofstanzas, each of which describes one element of the ontology. Astanza is introduced by a line containing a stanza name that iden-tifies the type of element being described. The rest of the stanzaconsists of lines, each of which contains a tag followed by a colon,a value, and an optional comment introduced by ‘‘!’’.

2.6. Related work

Although concept drift is widely recognised as an important is-sue in Semantic Web applications,4 there is not yet a generally ac-cepted way of dealing with changing concepts over time. However,there is quite some related research (e.g. see [11,27]) that discuss as-pects of coping with concept drift. Below, an overview is given on theresearch about concept change in different areas.

Let us first look at research in other domains that is related toour notion of concept drift. In historical linguistics, semantic shiftdescribes the evolution of word usage. Each word has multiplesenses and connotations which can be added, removed or alteredover time. Semantic change is a change in one meaning of a word.Semantic shift can be triggered by different forces and have differ-

3 http://www.geneontology.org/4 E.g. see the fact that the topic is mentioned in both the discussion about the

design of SKOS (http://www.w3.org/2001/sw/wiki/SKOS/Issues/ConceptEvolution)and the previous version of OWL (http://www.w3.org/TR/2004/REC-owl-guide/),while there is no fine-grained solution yet in OWL2.

ent types [1]. In this interpretation of ‘‘semantic’’, there is no expli-cit relation with the meaning and change of an ‘‘underlyingconcept’’, as the main entity of concern is the word itself.

In machine learning concept drift addresses a similar problem[45]. The term concept refers to the quantity that a learning modelis trying to predict, i.e. the variable. Concept drift is the situation inwhich the statistical properties of the target concept change overtime. This requires regular updates in the predicting model itself.A special case is virtual concept drift [49] (or sampling shift), inwhich the meaning of a concept does not change, instead, the datathat this concept classifies changes. An example is the conceptspam for which the meaning does not change, but the data distribu-tion (i.e. the relative frequency of the properties) is changing.

In 1994, Klenner and Hahn [24] discuss exactly the problem ofconcept drift because of evolving notions over time, however, notin the context of Semantic Web applications but for technical stan-dards. As a mechanism for updating static value restrictions orintegrity constraints, they propose an automatic procedure. Thisgenerates a generational stratification of the underlying level ofgeneric concepts in terms of concept versions; single instancesare then related to their associated concept version. The procedureexploits a so called progress model—provided by an expert—whichdescribes in qualitative terms the regularities of foreseeablechanges of attributes in a domain. Versions are then detected bymeasuring the change in values of attributes of instances.

With the goal of detecting concept drift and the occurrence ofnew concepts in a domain, Fanizzi et al. describe the use of a con-ceptual clustering technique based on unsupervised learning [6]. Intheir approach, a clustering method is used to hierarchically orga-nize groups of similar instances. Concept drift is detected by find-ing new individuals that are too far apart from existing clusters,but that together do not form a new cluster. If the unclustered in-stances do form a cluster, a new concept has occurred.

In the Semantic Web community, the problem addressed here isrelated to a broader problem of ontology evolution and change man-agement, which refers to the ‘‘problem of deciding the modifica-tions to perform upon an ontology in response to a certain needfor change as well as the implementation of these modificationsand the management of their effects in depending data, services,applications, agents or other elements’’ [7]. The existing researchon ontology evolution and versioning often addresses this problemat the macro ontological level, that is, the effect of certain changeoperations over the ontology elements, including concepts, rela-tions and instances, as well as the interoperability issue betweendifferent variants (versions) over time. For example, Heflin andPan [13] formally define ontology perspectives, which describethe relation between versions of an ontology and its extension.

For analyzing interoperability between different versions of anontology, change detection for individual concepts is a relevant is-sue [34]. Different techniques for detecting such changes havebeen developed. For example, Plessers and De Troyer introduce achange detection mechanism [37,38] that is able to find meaning-ful changes in sets of changes, based on the formal definition ofthese changes. This approach, however, assumes a fixed identityof concepts. As such, it is of limited use in cases where thisassumption does not hold. A syntactic change detection approachfor ontologies is presented in [23]. In this approach, the semanticimplication of the change has to be made explicit by a human user.PromptDiff [35] is an approach and tool for comparing ontologyversions that also detects changes in individual concepts. In [22],an interpretation of the effect of such changes on different typesof ontological compatibility is given. This approach does not as-sume fixed identifiers, but tries to match different ontological ele-ments using a set of heuristics focussing on the structural aspectsof an ontology. It does not take the extension of concepts into ac-count, neither it looks at the intension of a change. In contrast, in

Page 5: Concept drift and how to identify it

Fig. 1. An example ontology representing the EU in 2003.

S. Wang et al. / Web Semantics: Science, Services and Agents on the World Wide Web 9 (2011) 247–265 251

[16], the intensional change of concepts in different versions ofontologies is studied. This approach again assumes a fixed identity.

In the context of integrating ontologies, their has been a lotwork done on mechanisms and measures for calculating similarityconcept similarity. An overview can be found in [20]. Such mea-sures are related to our research, as our framework also makesuse of different similarity measures. However, the aim of our workis not to introduce another similarity measure, but a frameworkthat allows to combine different aspects of concept similarity todescribe concept change.

A whole body of research in ontology change managementinvestigates methodologies and techniques for change manage-ment in collaborative working environments [28,12,21,36,33,42–44]. In these papers, methods for the coordination of the changesmade by different authors are developed and implemented, suchas locking, change detection and approval mechanisms. Althoughvery relevant for ontology development, these approaches focuson a specific use case.

There has not been much study on the specific meaning ofchange for specific concepts. In [10] a series of ‘‘concept signa-tures’’ extracted from the textual definitions of the same conceptat different time are used to detect drifts. This definition-basedmethod is applicable if there is rich definitions of concepts andthe definitions are constantly modified. In the Ontoclean frame-work [9] the meta-properties identity and rigidity are defined withrespect to the stability of a concept.

In this article, we do not focus on a specific application of con-cept change, but we present a generic framework to describe andanalyse concept drift within two common ontological paradigms.This framework helps understanding and analyzing specific strate-gies for detecting and coping with concept change in knowledgestructures.

Fig. 2. An example ontology representing the EU in 2010.

3. Two theories of concept drift

Arguably, the meaning of concepts changes over time. In thissection we define precisely what this actually means, as thereare different philosophical views on the matter. The two core alter-natives can briefly be summarised upfront:

� Although possibly changing its meaning, a concept can existover periods of time. Different variants of the same conceptscan differ in meaning.� Concepts only exist at specific moments in time, and evolve

gradually into some other concepts (possibly with almost thesame meaning) at the next moment in time.

We will build a framework that works on both views, and pro-vide qualitative notions of concept drift for both views.

To explain and motivate our ideas we will use the followingsimple information which is freely inspired by the data availablethrough DBpedia.5 Let us consider the following ontology (depictedin Fig. 1) which contains some information valid in 2003 aboutamong others the resources dbpedia:EUCountry, cyc:Political Entity,and cyc:European Union [29] from several different sources, and re-lated using different relations. Concept dbpedia:EUCountry is definedas countries, which are political entities. The EU was at the time apolitical entity which also was an economy. There are some examplecountries as instances of the concept dbpedia:EUCountry, and somearticles art1,art2 and art3 about the European Union. Fig. 2 which willbe introduced later in the paper describes the European Union in2010. This example, and some extensions provided later, are usedin the further discussions to explain our general ideas.

5 http://wiki.dbpedia.org/

3.1. The meaning of concepts

Let us first commit to some basic definitions regarding themeaning of concepts, which is common to both views. The inten-sion of a concept are the properties implied by it, the extensionthe set of things it extends to. We also consider the labelling as apart of the meaning of a concept, because the way people referencea concept is also crucial in studying concept drift. Labels do not re-fer to a unique identifier but to a natural language description usedto convey the meaning of a concept from one human to another.We believe this proposal to be consistent with the most commonphilosophical approaches. We apply and formalise the standarddistinction between intension and extension which goes back to[8]. We include, somewhat more unconventionally, the labellingin the meaning of a concept (in the tradition of the signifier [39]).

Our definition of the meaning of a concept, and its drift is in-tended to be generic enough to be applied in different ontologicalframeworks. In this paper, we apply our ideas on a SKOS vocabu-lary, a few RDFS, OWL and OBO ontologies. We start out from aset of objects referred to as the universe of the domain, and a setof properties (unary predicates). Both universe and properties de-pend on the application and the formalism used.

Definition 1. The meaning Ct of a concept C at some moment intime t is a triple ðlabeltðCÞ; inttðCÞ; exttðCÞÞ, where labeltðCÞ is aString, inttðCÞ a set of properties (the intension of C), and exttðCÞ asubset of the universe (the extension of C).

We will refer to intension, extension and label of a concept asthe aspects of the concept.

Page 6: Concept drift and how to identify it

252 S. Wang et al. / Web Semantics: Science, Services and Agents on the World Wide Web 9 (2011) 247–265

3.1.1. A closer look at extension and intensionThe example ontology introduced in Fig. 1 shows that Definition

1 is far from unambiguous. Depending on the semantics underlyinga particular ontological representation language, the extension,intension and even labels can have different definitions.

OWL and RDFS are the most simple cases as the model theory ofDL semantics provides immediate mapping to the definition ofextension as subset of the domain. For the concept EUcountry inFig. 1, the extension is defined as the resources in the subject posi-tion of an rdf:type triple. For abstract concepts, this view on exten-sion is hardly applicable. In knowledge organisation schemes suchas SKOS with less rigid formal semantics, the notion of extension isfar less clear cut. In many applications there is a clearly defined do-main of discourse which we can take as the semantic universe fordefining the extension of a concept. In Definition 1, the meaning ofconcepts has been defined in a rather non-committal way in orderto accommodate for these weaker semantic structures as well. Forexample, European Union is a skos:concept which is used to annotatea number of articles. We argue that, in certain applications, it isacceptable to use the articles as the extension of the concept Euro-

pean Union. In our political study in Section 5.1 we will define theset of sentences annotated with a particular concept as itsextension.

For the aspect of intension, there is no one-and-for-all interpre-tation either. The definition is deliberately left vague to allowbroad coverage of our theory of concept drift. In a First-Order Logicsense, the properties of a concept C correspond to the binary pred-icates that hold for C. Which predicates to consider then dependson the formalisms used for representing the ontology. If we inter-pret the ontology in Fig. 1 in RDF only, the intension of the conceptdbpedia:EUCountry could be defined as just the set of triples it oc-curs in (except those with rdf:type as the predicate), i.e.,

(dbpedia:EUCountry rdfs:subclass dbpedia:Country)(cyc:European_Union skos:broader dbpedia:EUCountry)

If interpreted in RDFS, the semantic closure could be takeninto account, i.e., the triple (dbpedia:EUCountry rdfs:subClassOf

cyc:PoliticalEntity) is probably relevant as well and could be partof the intension. For OWL and Description Logics formalisations,the intension could also contain implicitly true predicates thatcan be expressed. In our example, the concept dbpedia:EUCountry

is also a subclass of the class of all the things which have atleast one instance which is a Central European Country (or $rdf:type. cyc:Central European Country in a Description Logic likenotation).

The above examples show that there is no clearcut definition ofintension and extension, and even the definition of the label of aconcept can depend on the application: often there are alternativelabels and we could even take glosses or other descriptions into ac-count. In our view, a framework on the change of meaning of a con-cept over time should be applicable to all those types of ontologicalcommitments as long as the three aspects of the meaning areunambiguously formalised.

3.1.2. A closer look at timeIt was stated earlier in the paper that we do not intend to pro-

vide a formal study of time and logical formalisations of changeover time. Instead, we commit to the most simple temporal modelwhere time is defined as linear, discrete and complete. This allowsus to refer to particular points in time by subscripts i, j such thatti 6 tj whenever i 6 j. We will use 6 to denote both the orderingon subscripts and in time which we expect not to cause any confu-sion. A further and deeper study of different temporal models isout of the scope of this paper.

3.1.3. SimilarityInvestigating changes in the meaning of a concept means study-

ing changes in intension, extension and label. Given the above dis-cussion, it should not come as a surprise that the differencebetween the aspects over time also do not come in a single form.The only notion which is required is a notion of similarity betweenthe aspects of two concepts or two variants of the same concept attwo moments in time. There will be intensional similarity, simint ,which is calculated between sets of predicates, extensional similar-ity, simext , between sets of objects, and label similarity, simlabel, be-tween strings. Each similarity is a function with the range [0,1],where a similarity value of 1 indicates equality. We will show dif-ferent examples on how to define the similarity between two as-pects in the discussion of our case-studies, and will only brieflydiscuss some generic problems here.

For given notions of similarity between the meaning of two con-cepts according to a particular aspect we can extend this to simi-larity of a series of meanings of concepts over time:

simaspðCt11 ;C

t22 ; . . . ;Ctn

n Þ ¼Pn�1

i¼1 simaspðCtii ;C

tiþ1iþ1 Þ

n� 1

where asp is an aspect asp 2 fint; ext; labelg. The maximal similaritybetween the meaning of two concepts for a series of concepts overtime is defined as:

max simaspðCt11 ;C

tnn Þ ¼

Pn�1i¼1 simaspðCti

i ;Ctiþ1iþ1 Þ

n� 1ð1Þ

where simaspðCtii ;C

tiþ1iþ1 ÞP simaspðCti

i ;Dtiþ1iþ1 Þ for the meaning of any

other concept Diþ1 at time tiþ1. We will call the sequence ofmaximally similar concepts Ciþ1 the max-sim-chain between C1

and Cn between t1 and tn for aspect asp, referred to by maxsimasp

ðC1;Cn; t1; tnÞ.In the most simple cases calculating similarity between two

versions of an aspect is using standard notion of similarity directlyon the aspects. For labels, being defined as Strings, there are stan-dard distance measures, such as Levenshtein’s [26] edit distance.For sets of predicates variants of Jaccard similarity [19] are theobvious candidates. The situation can, however, be more intricatedepending on the ontological commitment w.r.t. extension andintension. For example, if the extension of a concept is defined asthe set of articles annotated with it the overlap between extensionsat different moments in time is usually empty. In our case study onpolitical journalism we will later first define similarity between theinstances of the extension and aggregate over this similarity in or-der to determine the overall similarity between the extensions. Asstated before, similarity needs to be defined explicitly per applica-tion. Again, we believe that a framework for studying concept driftshould be generic enough to cater for particular definitions ofsimilarity.

3.2. Meaning drift based on concept identity over time

All aspects of the meaning of a concept can change. We claimthat the meaning of a concept drifts if there are two variants thatdiffer at different points in time. If a concept at different timeshas the same meaning, there is no concept drift. This definitionof drift is based on the idea that a concept retains its identity overtime, i.e., remains the same at least temporarily. In order to definedrift, we need a workable definition of a concept and variants ofsuch concepts.

Given the philosophical assumption that concepts can exist atdifferent moments in time, concepts can naturally also have differ-ent meanings at different times.

Page 7: Concept drift and how to identify it

S. Wang et al. / Web Semantics: Science, Services and Agents on the World Wide Web 9 (2011) 247–265 253

Definition 2. The meaning Ct of a concept C at time t is a variant ofC.

This definition assumes the stability of a concept over time eventhough different variants might semantically differ, in other words,the concept itself retains its identity over time. Identity allows usto compare two variants of the same concept at different momentsin time even if the meaning (either label, extension and possiblyparts of its intension) has changed. We will assume that there is al-ways only one variant of a concept at one moment in time, i.e., onlyone concept at a moment can be identical to a concept at anothermoment.

Definition 3. The ordered set of all variants for a concept Cbetween two moments t1 and tn is a chain and abbreviate:

chainðC; t1; tnÞ ¼ Ct1 ! Ct2 ! . . .! Ctn

for all time-points between t1 and tn:

Given our ‘‘meaning of a concept’’ and the notion of identity, wecan define concept drift per aspect as the basic notion of change inthe meaning of a concept.

Definition 4. A concept C has extensionally drifted (according toidentity semantics) between time ti and tj ðti 6 tjÞ if and only ifsimextðCti ;Ctj Þ– 1. Intensional and label drift are defined similarly.The meaning of a concept has drifted if one of the aspects hasdrifted.

Before we go on to discuss an alternative way of defining con-cept drift based on concept morphing, let us discuss the conceptof identify in more detail.

3.2.1. A closer look at identityOur definition of concept, its meaning and the subsequent no-

tion of identity is rather cyclic and non-committal. However, anexhaustive discussion on the nature of what a concept is, andwhether it remains itself over time is not at the core of this paper.What matters for this model is the assumption of identity andwhat we can do whenever we can identify identical concepts overtime.

First, let us give one possible explanation what identity of a con-cept could mean. This suggestion is based on the idea of rigidity ofthe intension of a concept.6 Formally, we assume that the intensionof a concept C is the disjoint union of a rigid and a non-rigid set ofproperties (i.e., ðintrðCÞ [ intnrðCÞÞ). This separation between rigidand non-rigid properties does not need to be explicitly specified,but rigidity is one way to define for identity of a concept over time.Intuitively, this amounts to the assumption that a concept is un-iquely identified through some core properties that do not changeover time. Formally, this is

Conjecture 1. Two concepts C1 and C2 are considered identical iftheir rigid intensions are equivalent, i.e., intrðC1Þ ¼ intrðC2Þ.

This assumption implies that if the rigid core of a conceptchanges, the new concept will be a different concept. An exampleis demagogue which used to denote the concept Political Leader whilethe same label now refers to a different concept Populist Demagogue.

How to determine identity?Although we provide a possible explanation and motivation for

the notion of identity based on the rigid part of the intention, thisdefinition is by no means practical, as the rigid core is unknown inalmost all applications. Since all three aspects of the meaning of a

6 Rigidity is discussed in Ontoclean [9] in a slightly different way. There rigidity is ameta-property of a concept that modellers should make explicit. We use it in astronger way as an intrinsic and the only stable part of the meaning of a conceptwhich otherwise can drift in various ways.

concept can drift over time, identifying identity is crucial and non-trivial.

In order to render practical algorithms, we need to study waysof determining identity of concepts. There are several methods fordoing so: using oracles, domain knowledge or automatic tech-niques. Although identity resolution between concepts over timeis not at the core of this paper, we need to discuss some of theseoptions to make credible our claim that identity-based analysisof meaning drift is practically applicable.

The simplest way of determining identity of concepts over timeis to ask an oracle. Such an oracle usually comes in form of a humandomain expert who can determine whether two variants have thesame rigid properties based on his/her domain knowledge. Unfortu-nately, such oracles are difficult to find and usually very expensive.

A cheaper way of determining identity is based on expertknowledge about the process and structure of the ontology inquestion. It is often the case that labels were explicitly kept stableover time to enforce a priori identity of concepts. Close knowledgeof such processes can help to decide the determining factor inidentifying identity.

Finally, there are attempts to detect identity automatically, forexample, by assuming that most similar concepts are identical, orby conjecturing that concepts in a similarity chain between twovariants are identical. The advantage is that those methods aresimple and cheap to apply. The disadvantage, however, is thatidentity and drift are based on the same information.

In fact, this observation can be taken to the extreme of definingconcept drift without the underlying notion of identity in the firstplace. We call this approach concept morphing.

3.3. Meaning drift based on concept morphing

In this alternative conceptualisation of concept drift, there is nounderlying notion of identity we can rely on. This means that twoconcepts at two different moments in time are never identical. In-stead, a concept C1 morphs over time into a new concept C2.

Let us first define morphing of the three aspects of the seman-tics and of the meaning of a concept in general.

Definition 5. A concept Ci extensionally morphes into a concept Cj

from t to t þ 1 if and only if exttþ1ðCjÞ is maximally similar toexttðCiÞ. Formally, we write:

morph extt;tþ1ðCi;CjÞ iff arg maxk

simextðCti ;C

tþ1k Þ ¼ j

and similarly for intensional and label morphing.A concept Ci morphes into a concept Cj from t to t þ 1 if and only

if all semantic aspects, i.e., extension, intension and label of Ci,morph into Cj.

Let us consider the EU in 2010 and in 2003 in our exampleontology from Figs. 1 and 2. Now, the concepts with label EUCoun-try are different concepts. There is label morphing for the conceptdbpedia:EUCountry in 2003 to the concept with the same name in2010 as the label stays the same. However, given that the exten-sion of concept dbpedia:EUCountry in 2003 is more similar to theextension of concept cyc:Central European Country the former mor-phes extensionally to the latter.

Note that an aspect of a concept C can morph into several con-cepts if the maximal similarity between C and those concepts is thesame w.r.t. this aspect.

Definition 6. A concept C has extensionally drifted (according tomorphing semantics) between t and t þ 1, if and only ifsimextðC;C1Þ– 1 for morph extt;tþ1ðC;C1Þ. Intensional and label driftare defined similarly. The meaning of a concept has drifted if one ofthe aspects has drifted.

Page 8: Concept drift and how to identify it

C1t1 C1t2

C2t2

time t1 t2

Fig. 3. Concept shift (identity-based).

254 S. Wang et al. / Web Semantics: Science, Services and Agents on the World Wide Web 9 (2011) 247–265

In the previous section, we argued that the dbpedia:EUCountry of2010, although having changed in some aspects, was still the sameconcept as the one in 2003, and even as the one in 1979. An alter-native is to consider the label EUCountry to refer to different con-cepts at different times. In this morphing view, the conceptdbpedia:EUCountry in 2003 morphs into the concept dbpe-

dia:EUCountry in 2010 on the label aspect, although neither exten-sion nor intension stayed the same.

In the case of morphing, we define the max-sim-chain betweenC1 and Cn between t1 and tn for aspect asp the asp-morphing chain.The similarity between the concepts in such a maximal morphingchain will later be used to indicate the strength of a morphingchain.

4. A qualitative toolkit for analysing concept drift

In the previous section, we have formally defined concept driftas the change of aspects of the meaning of a concept over time.Concept drift happens regularly. Even if it can be measured, it is of-ten difficult to grasp its impact, which probably results in a far toofine-grained analysis. In this section, we will define some morepractical notions that describe changes in meaning: our toolkitfor analysing concept drift.

This toolkit takes both possible notions of drift into account, andwe will define identify-based observations, and morphing-basedones. Practically, it does not matter whether one of our tools isidentity or morphing based, but, conceptually, we will introducethem separately.

4.1. Identity based observations

Based on identity-based concept drift we define two notions:stability and shift.

4.1.1. Concept (in)stabilityThe more the meaning of a concept drifts, the more unstable it

becomes. Although there is no indication of when an unstable con-cept becomes critically unstable, stability can still be an interestingnotion to study.

A first alternative is to define a threshold based on experience.Every similarity between two variants below this threshold signi-fies some interesting semantic change. A typical example is labelsimilarity which is often defined using edit distance. A high edit-distance signals label instability.

Another way of analysing concept drift over time is to comparethe (average) stability of concepts. As a relative measure, it does notrequire any priori commitment (such as a threshold).

Definition 7. A concept C between time t1 and tn is more stablethan a concept C1 on aspect asp if, and only if,

avg16i;j6n;i – jðsimaspðCti ;Ctj ÞÞ > avg16i;j6n;i – jðsimaspðCti1 ; C

tj1 ÞÞ:

Relative stability is the most interesting use of stability: to order theconcepts by stability values. In our experience, the set of the moststable and the most unstable concepts are usually of interest to do-main experts. In our example ontology about the European Unionfrom Figs. 1 and 2 the extension of the concept dbpedia:EUCountry

grows from 3 to 5 countries, while that of concept cyc:Political Entity

consists of just one (explicit) instance cyc:European Union in bothvariants. Jaccard similarity between the extension of dbpe-

dia:EUCountry in 2003 and 2010 is thus 35 ¼ 0:6, the one for cyc:Polit-

ical Entity 1, which makes the latter the more stable concept.

The second use of stability is to track the change of similaritybetween variants of a concept. The most unstable moments arethose when the similarity is the lowest.

Definition 8. In a chain chainðC; t1; tnÞ the concept C is more stableat time ti than at tj on aspect asp if

simaspðCti ;Ctiþ1 Þ > simaspðCtj ;Ctjþ1 Þ:

We can use this ordering in our use-cases to identify mostunstable, and most stable moments in a chain.

4.1.2. Concept shiftA special case of instability is when a concept becomes so

unstable that part of its meaning is more representative for a dif-ferent concept rather than for itself. We call this concept shift.

Definition 9. The meaning of a concept C extensionally shiftsbetween two of its variants Cti and Ctj if the extension of Ctj is moresimilar to the extension of a non-identical concept rather than tothe extension of Cti . Intensional and label shift are defined similarly.

Concept shift can have drastic consequences on the use of aconcept in an application, because some other concept has in facttaken over its meaning. Concept shift is explained in Fig. 3 whereovals denote the meaning of a concept with each circle a partic-ular aspect. Arrows stand for the most similar aspects of the newvariant. From the moment t1 to t2, there is shift in terms of themiddle aspect, because, in this aspect, the concept C2 at t2 is moresimilar to C1 at t1 than C1 at t2 is, while the other two aspects donot shift.

Consider the extension of concept dbpedia:EUCountry in 2003with the three countries Germany, Netherlands and France. In 2010this set of countries is more similar to the concept cyc:Central Euro-

pean Country than to the original dbpedia:EUCountry, which meansthat the extensional meaning of the concept has shifted.

4.2. Morphing based observations

Morphing-based concept drift comes with two notions: morp-hing strength and split.

4.2.1. Strength of morphing chainsWithout an explicit definition of identity, it does not make

sense to define stability of a concept over time. A closely relatednotion is strength of the morphing chain, which indicates how sim-ilar the concepts in this (maximal) morphing chain between twotime-points are.

Definition 10. A morphing chain morphchainaspðCb; Ce; tb; teÞ isstronger than another morphing chain morphchainaspðCb0 ;Ce0 ; tb; teÞon aspect asp if and only if

max simaspðCtbb ;C

tee Þ > max simaspðCtb

b0;Cte

e0 Þ

where max simaspðCtbb ;C

tee Þ has been defined in Eq. (1) in Section 3.

We already pointed out that in our example ontology conceptdbpedia:EUCountry extensionally morphes into concept cyc:Central

European Country. This similarity equals 1 (complete overlap),

Page 9: Concept drift and how to identify it

Fig. 5. Chainsplit (morphing-based).

S. Wang et al. / Web Semantics: Science, Services and Agents on the World Wide Web 9 (2011) 247–265 255

which makes this a very strong chain (though of course rathershort). Concept cyc:European Union extensionally morphes into theconcept of same name, but there is only an overlap of 1 out of 5articles, which is 0.2 according to standard Jaccard similarity. Thismeans dbpedia:EUCountry ? cyc:Central European Country is a stron-ger extensional morph chain than

cyc : EuropeanUnion ! cyc : EuropeanUnion:

Relative strength is the most interesting use of strength we will ap-ply in our case-study: to order the chains by strength. In our expe-rience, the set of strongest and weakest concepts are usually ofinterest to domain experts.

4.2.2. Concept splitConcept split is very similar to concept shift discussed in the

previous section. It is defined based on the fact that not all aspectsof a concept at ti is the most similar to the same concept at tiþ1. Asthere is no identity as the reference to talk about shift, we can onlysay that the meaning of a concept splits into several new concepts.This can happen in two ways (as shown in Fig. 4): either one of theaspects morphs into the aspect of a different concept from the onethat the other aspects morph to (on the left-hand side), or an as-pect is equally similar to the same aspect of more than one otherconcept (on the right-hand side of Fig. 4).

Definition 11. The meaning of a concept C splits at time t if, andonly if, either

(1) it does not morph into another concept, or(2) one aspect morphes into more than one concept, which

happens when the similarity between those concepts is thesame according to at least one aspect.

Remember that the notion of morphing of a concept is definedthrough simultaneous morphing of all aspects to the same concept.This implies that if the concept does not morph there must be anaspect that morphes into a different concept from the one(s) theother two aspects morph to.

In our ontology in Figs. 1 and 2 the concept dbpedia:EUCountry in2003 1-way splits as it label-morphes into concept dbpedia:EUCoun-

try with the same name, but extensionally morphes into conceptcyc:Central European Country.

Finally, there is one additional possibility for the meaning of aconcept to split (depicted in Fig 5): remember that a morphing-chain morphchainaspðCb;Ce; tb; teÞ is defined as a chain of conceptsbetween Cb (begin) and Ce (end) over time points tb and te wherethe similarity between the aspect asp of the concepts in a chainfor every time-point is maximal. A morphing chain split nowoccurs when the endpoint of such a chain is less similar to the ori-ginal concept at the start of the chain than some other concept.

C1 C2

C3

time t1 t2

Fig. 4. 1-way concept split (left) and syno

Definition 12. A concept has a morphing-based chain split foraspect asp if the similarity between a concept Cb at the beginning,and Ce at the end of the chain morphchainaspðCb;Ce; tb; teÞ is smallerthan the similarity between the same aspect of concept Cb andsome other concept C0e at te.

4.3. Applying the framework

To apply our framework for concept drift in a specific use-case,the following steps are required:

(1) to define intension, extension and a labelling function,(2) to define similarity functions over intension, extension and

labels.

Given that the mission of the Semantic Web includes givingmeaning to resources on the Web, it could come as a surprise thatdefining intensions, extensions and even labelling functions is byno means trivial. The usual model-theoretic notions of extensionand intension, e.g., for RDF(S) or OWL semantics are slightly mis-leading here, as they refer to specific models, whereas ontologiesusually represent classes of models. In practice, one needs to definethe relevant notions per use-case, where each such definition is anontological commitment. In the following section, we will givesuch commitments for different case-studies. It should be under-stood that such a commitment is never uncontroversial.

4.4. Implementation of our toolkit

This toolkit has been implemented in Python and is availableat https://www.few.vu.nl/�swang/data/concept_drift/getConcept-Drift.py. This toolkit currently requires different versions of theontology to be queryable through independent SPARQL-endpoints.It first collects all the information of intension, extension and la-bels of the concepts, and then calculates the similarity betweenvariants of the same concepts. Based on the calculated similarityit automatically identifies concept shift and split, and provides aranking based on concept stability and morphing strength.

C1 C3

C4

C2

time t1 t2

nymy split (right) (morphing-based).

Page 10: Concept drift and how to identify it

256 S. Wang et al. / Web Semantics: Science, Services and Agents on the World Wide Web 9 (2011) 247–265

5. Case-studies

In this section we will discuss four different case-studies asexamples on how to use our methods, and to give some evidenceof the potential of the ideas introduced in Section 3. The purposeof these case-studies is twofold: first, we want to indicate how topractically apply our analysis and tool-kit by exemplifying theusage in a number of very different scenarios, and based on a vari-ety of different knowledge organisation principles. Secondly, webelieve that we gather evidence of the potential usefulness of theresults in a number of different domains. Although the presentedresults do not amount of proofs, nor are empirically representative,our experience with legal, medical experts, and communicationscientists show promisingly that our approach can produce non-trivial and interesting results.

5.1. Case study 1: concept shift in political reporting

Communication scientists annotate various media content withconcepts from controlled vocabularies (of increasing expressive-ness) in order to study the relationship between the Media andthe political processes. The idea is to code the meaning of sentencesand articles in a formalised graph representation, using the methodof Network analysis of Evaluative Texts (NET) [46]. Take the fol-lowing sentence as an example.

Het Openbaar Ministerie (OM) wil de komende vier jaarmensenhandel uitroeien. (The Justice Department (OM) wantsto eliminate human trafficking within the next four years.)

7 In [48] we have shown that this method produces reasonable results.

The sentence is coded as hom;�1; humantraffickingi where om isan actor concept and human trafficking an issue concept, and -1 indi-cates the Justice Department is negative about human trafficking.Here, the concepts are taken from controlled vocabularies whichgenerally consist of all related political actors and issues. They haverecently been represented using the SKOS model [18].

In this case study, we focus on five variants of a political vocab-ulary used during the five most recent Dutch national electioncampaigns, which took place in 1994, 1998, 2002, 2003 and2006. The size of the vocabulary increased from 101 concepts in1994 to 580 concepts in 2006. Newspaper articles on Dutch politicsduring these campaign periods were manually annotated with theconcepts from the particular variant of that year. In our case, theannotated sentences of these articles are considered as the instan-tiation of the abstract political concepts which annotated them. In1994, we gathered 1502 such annotated sentences while in 2006,there were 14,572 annotated sentences.

Our case study in political reporting was supported by an expertfrom the Dept. of Communication Science who provide us with pairsof identical concepts over time and who performed a small-scalequalitative evaluation of the produced results. This expert hasextensive experience in annotation and encoding of political state-ments over time, and has recently produced a manual mapping be-tween the vocabularies used to describe the election campaigns for aqualitative comparison of the respective newspaper articles.

Political concepts and their meaning. We now formally define ourproblem: for each election campaign t 2 f1994;1998;2002;2003;2006g we have a set Dt of sentences annotated by conceptsfrom a SKOS vocabulary Vt .

The label of a concept is obtained using the SKOS labelling prop-erty skos:prefLabel. The extension extsðCtÞ of a concept Ct 2 Vt attime t is the set of all sentences annotated by Ct , i.e.,

extsðCtÞ ¼ fs 2 DtjannotatedBy Ctg:

It is most difficult to formally define the intension of a concept, asthere is no explicit intensional definition of the concepts available.

We construct an explicit intension based on co-occurrence of con-cepts in annotations. For each concept C, we calculate the top it Kconcepts topKuseðCÞ which co-occur the most in the sentences theycode in one moment in time. The properties we use to define theintension are based on ‘‘topicality,’’ i.e., a property PCðDÞ is true if,and only if, D is in the topKuse relation with C. The intension of Cis then the set of properties intsðCÞ ¼ fPCðDÞ ¼ trueg, in other words,the intension is in fact determined by all associated concepts. In ourfollowing experiments, we took K = 3. Note, this is an empiricalchoice which gave sensible results.

Similarity of intension, extensions and labels. Similarity of labelscan be determined through standard Levenshtein edit distance.In our case we define similarity as 1 minus the hyperbolic tangentof the original edit distance.

Extensional similarity is usually determined by calculating theoverlap of the extensions. In our case, this is not possible, as theset of sentences in different years are disjoint. In order to use dis-joint extensions to measure the similarity of concepts from differ-ent years, we applied the mapping tool which was developed in[40]. For each sentence in Year1, the mapping tool first looks forthe most similar sentence in Year2. Then this sentence of Year1is considered to be coded by the concept(s) with which its mostsimilar sentence of Year2 is coded. In this way, two disjoint exten-sions become dually annotated, and we then measure the similar-ity between two concepts using their extensions.7

Since the intension of a concept is determined by its associatedconcepts, the intensional similarity between two concepts is there-fore determined by the set similarity between the sets of conceptswith which they are associated. We use the Jaccard similarity forthis purpose.

Once the similarity is calculated, we can study concept driftusing two theories introduced above, respectively.

5.1.1. Study concept drift based on identityThe first step of studying concept drift based on identity is to

determine concept identity. The identity problem is solved, in thiscase, by the manual concept mapping provided by a communica-tion science expert. This enables us to investigate the concept shiftand stability in terms of the label, intension and extension.

Measuring concept stability. Since our domain expert has alreadyindicated the identical variants of the same concepts over theyears, we can easily rank the stability of the concepts based ontheir stability. Here, the stability is calculated as the average sim-ilarity between the variants of neighbouring time-points.

Fig. 6 shows the histogram of the stability values in terms of thethree aspects. We can see that many concepts have very stable la-bels ðsimlabelðCÞ ¼ 1Þ over time. For example, the concept Asielzoe-

kers (asylum seekers) has exactly the same label across all theyears. However, some concepts do have very unstable labels. Forinstance, according to the domain expert, the following 6 conceptsare intensionally identical:

1994_sjo_creawetsto ? 1998_wcorruptie (corruption) ? 1998_rbeu

rsfraude (stock fraud)?2002_belangenverstrengeling (conflict of interest) ?2003_corruptie (corruption) ? 2006_fraude_en_corruptie (fraudand corruption)

In Fig. 6(b) and (c), we find that most concepts have ratherunstable intension and extension over the years ðsimintðCÞ andsimextðCÞ are low). On one hand, this suggests that the label of con-cepts is more stable than their intension, which means the conceptlabel could be used as a pseudo-identity in practice if an oracle is

Page 11: Concept drift and how to identify it

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15

20

25

30

35

40

45

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15

20

25

30

35

40

45

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15

20

25

30

35

40

45

(a) Label Stability (b) Intensional Stability (c) Extensional Stability

Fig. 6. Stability value distribution: the X-axis gives the average corresponding similarity over the years between variants of the same concept; the Y-axis the number ofconcepts with this average.

2002 2003 2006

EnvironmentalActivist

Democracy0.03

Moroccans

0.02

Rechtsstaat

0.03

Democracy

High Incomes

0.02

Referendum

0.04

Bureaucracy

0.04

Democracy

Islam

0.02

VotingComputers

0.01

Sharia

0.03

Fig. 7. Intension of concept Democracy in 3 years, with average drift of ðsimint ¼ 0:02Þ.

2002 2003 2006

unions

employees unions

Socio-EconomicCouncil employees

employers

0.1220.09

0.229

socialpact

employers

0.189

0.266

0.26

employers

employees

workmigration

0.032

0.085

discrimination

0.048

Fig. 8. Intension of concept Employers in 3 years, with average drift of ðsimint ¼ 0:15Þ.

8 For convenience, we translate all the Dutch labels into English.

S. Wang et al. / Web Semantics: Science, Services and Agents on the World Wide Web 9 (2011) 247–265 257

not available. On the other hand, we use an approximation of thereal extension, therefore, the reliability of the calculated exten-sional similarity is not fully guaranteed. The instability of intensionis far beyond our expectation, which suggests that we should lookfor another way of formalising concept intension, as the reliabilityof the results provided by the current formalisation is doubtful.

Figs. 7 and 8 respectively give an example of intensionally veryunstable and very stable concepts. Here, the red links are the iden-tical concepts provided by domain experts, the black links are theassociation links (i.e., the concepts which co-occur the most toannotated sentences) and the number next to the links are thestrength of the association. As the simint value indicates, ConceptEmployers is rather stable over the years, while Concept Democracy

seems to shift its meaning in different years. This observationsare also consistent with the political reality. Although in this par-ticular case this is related to the definition of the intension, it is stillinteresting to observe from these two figures that, when one con-cept is stable, its closely associated concepts tend to be stable too,while the concepts closely associated with an unstable concepttend to be also unstable.

Identifying concept shift. Even with the defects of our commit-ments, we can still identify interesting concept shift.

Label shift. As shown in Fig. 9 (a),8 1994_Military has the same la-bel as 1998_Military. However, according to our domain expert,1998_militairen is actually identical to 2006_Dutch military deployment,while 1994_Military is identical to 2006_Military. These two conceptswith the same label ‘‘Military’’ are actually different concepts. There-fore, Concept 1994_Military has a shift in label as, according to Defini-tion 9, its label shifts to another concept.

Extension shift. Fig. 9(b) suggests an extensional shift. Accordingto the domain expert, 2003_Childcare is the same as 2006_Childcare.However, 2006_Free_childcare is more similar to 2003_Childcare interms of their extension. In this case, we say 2003_Childcare hasshifted its meaning extensionally towards a more specific (nar-rower) topic, namely free childcare. This is also confirmed by thepost-hoc analysis of our domain experts.

Page 12: Concept drift and how to identify it

Fig. 9. Example of label shift and extension shift, where the red links indicate the two concepts are identical according to our domain experts, while the blue links are themost similar concepts in terms of the corresponding aspect. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of thisarticle.)

Fig. 10. Example of morphing chains, where the red links are consistent with concept identity provided by the domain expert. (For interpretation of the references to colourin this figure legend, the reader is referred to the web version of this article.)

258 S. Wang et al. / Web Semantics: Science, Services and Agents on the World Wide Web 9 (2011) 247–265

5.1.2. Study concept drift based on morphingLet us now study concept drift without replying on the exis-

tence of concept identities. As we introduced earlier, morphing-based concept drift involves two notions: morphing strength andconcept split.

Measuring morphing strength. When each concept of a series ofconcepts can always morphs into the concept of the next timepoint, a morphing chain is formed. Starting from each concept,we find its max-sim-chain, as defined in Section 3.1.3. Even if wetake only the most similar concept as the morphed one, thesemorphing chains each have a different strength (see Definition10). We therefore rank these morphing chains based on theirstrength.

In Fig. 10, we listed three morphing chains. The top one is actu-ally consistent with the identity provided by our domain expert.This morphing chain also has the highest strength. Clearly, thischain makes more sense than the bottom one, which is the weakestmorphing chain.

Detecting concept split. If three aspects do not morph unambig-uously into one concept or at least one aspect morphs to multipleconcepts, there is a concept split. In Fig. 11, Concept Free Childcare

has a 1-way concept split from 1998 to 2002, as defined in 4.2.2.Concept Capitalism intensionally split into three concepts whichare also different from its extensional morph. Unfortunately, al-most all concepts have a morphing-based chain split, which isnot really surprising. As we discussed above, our commitment toconcept intension and extension in this case are both problematic,therefore the reliability of the similarity is not guaranteed. As aconsequence, the detected splits are also doubtful.

9 http://wiki.dbpedia.org/10 http://linkeddata.org/11 The terminology is called differently in the different DBpedia versions, but usually

something like dbpedia-ontoloy.owl.

5.1.3. Identify concept drift momentsWe can also measure at which moment(s) the concepts are the

most stable or unstable. At time point ti, we measure the similarityof concepts between time points ti and tiþ1. When concept identitycan be defined, we take the average similarity between two vari-ants of the same concepts. When using morphing theory, we takethe average similarity between all concepts at ti and their morphedones at the next time point tiþ1. Fig. 12 compares the analysis usingthese two theories of concept drift, where the error bars indicatethe corresponding standard deviations over all the morphingstrength.

Three aspects show the same trend. Both theories indicate thatthe concept are the most stable between 2002 and 2003, as thesimilarity is the highest. While between 1998 and 2002, therewas big changes in terms of intension, especially using the morp-hing theory. This indicates that the concepts were associated withquite different concepts in these two years.

5.2. Case-study 2: concept drift in DBpedia

DBpedia9 is probably the most successful ontology currentlylinked within the Linked-Open Data (LOD) cloud.10 It combines ahand-crafted class hierarchy with automatically generated instancedata taken from the Wikipedia effort. Through its huge coverage,DBpedia is now the most strongly linked dataset within the LOD.For the sake of this research we consider the DBpedia ontology inRDFS [30], i.e., we ignore the (very few) OWL operators used in themodel. RDF(S) and its underlying semantics is a very common mod-eling framework, which makes DBpedia an interesting object ofstudy as it is almost exclusively modeled in RDF(S).

RDF(S) concepts and their meaning. RDFS comes with a specificlabeling relation rdfs:label. Furthermore, RDF is equipped with ardf:type relation that relates objects with classes. It seems naturalto define the extension of an RDF class to be the set of all instancesin the rdf:type relation. The intension of a class is on the other handnot specifically defined in RDF(S). We have chosen a simple ap-proach which focusses on the ‘‘semantic’’ operators in RDFS withfixed semantics, more precisely rdfs:subclass, rdfs:range, rdfs:domain.The intension of a concept C is then simply the set of all triples withC in the subject or object position of these three types of triples.

Let us define the meaning of a DBpedia concept formally. Wewill call the combination of the DBpedia terminology T 11 and theexplicit type information as well as the relations translated fromWikipedia, the DBpedia ontology.

Definition 13. Let O be the DBpedia ontology, i.e., a set of triplesðs; p; oÞ, and O� the semantic closure of O. The rdf-label labrðCÞ of C isdefined as the object of ðC; rdfs : label; oÞ. The rdf-extension extrðCÞ of

Page 13: Concept drift and how to identify it

Fig. 11. Examples of the detected concept split.

(a) Label stability (b) Intensional stability

(c) Extensional stability (d) Whole concept stability

Fig. 12. Concept stability over time, using two formalisations—political vocabulary.

S. Wang et al. / Web Semantics: Science, Services and Agents on the World Wide Web 9 (2011) 247–265 259

C is defined as the set of resources r such that ðr rdf : type CÞ 2 O�.The rdf-intension intrðCÞ of C is defined as the set of all triplesðC; p; oÞ 2 O� in O where p = rdfs:subClassOf and ðs; p;CÞ, wherep 2 rdfs : subclass; rdfs : domain; rdfs : rangeg.

Please note that this definition is just one possible choice ofontological commitment regarding the meaning of a concept inRDF(S).

Similarity relations between labels are defined on the basis ofstring similarity, similarity between intension and extension,which are just sets of resources and triples respectively can easilybe defined through the set similarity (Jaccard).

Experiments to study concept drift in DBpedia . We studied thefour latest versions of the DBpedia ontology, namely, DBpedia

3.5, 3.4, 3.3 and 3.2. Table 1 gives some general information aboutthese four versions. These versions were uploaded into indepen-dent RDF repositories.

5.2.1. Study concept drift based on identityDBpedia concepts have their URI references which remain sta-

ble over the versions. Therefore we use the URIrefs as the identitiesof concepts. Again this is rather arbitrary decision that can be ar-gued for or against. In our case, a DBpedia concept refers to itswebsite, rather than an underlying abstract concept. If this websitesuddenly refers to some other topic (which happens occasionally),this change will be detected as intensional shift.

Page 14: Concept drift and how to identify it

Table 2The top five most stable and last five least stable DBpedia concepts in terms of theirextension and intension (of the 167 concepts present in all four versions).

Rank Extensional Intensional

1 Planet SportsEvent

2 Road FormulaOneRacer

3 Infrastructure WineRegion

4 Cyclist Cleric

5 LunarCrater WrestlingEvent

. . . . . .

163 OfficeHolder Vein

164 Politician BasketballPlayer

165 City EthnicGroup

166 College Band

167 ChemicalCompound BritishRoyalty

Table 1Four versions of the DBpedia ontology.

Version #Concept #Resource

3.5 255 1,477,3773.4 204 1,161,6783.3 174 1,054,1993.2 174 875,273

13 http://ontology.leibnizcenter.org/trac/wiki/LKIFCore

260 S. Wang et al. / Web Semantics: Science, Services and Agents on the World Wide Web 9 (2011) 247–265

Although most DBpedia concepts define their label using rdfs:la-

bel, these labels are mostly equivalent to the localnames of theirURIrefs. This means that labels are strongly correlated to the iden-tify, and it is thus less interesting to study label drift.

For each concept, we built its extensional representation (i.e.,the set of instances) and the intensional representation (i.e., theset of related triples). The Jaccard similarity measure was appliedto measure the extensional and intensional similarity betweenthe different variants of the same concept. In the end, we averagedover all similarities to indicate the general level of stability of thisconcept, in terms of their extensional and intensional semantics,respectively. Table 2 gives the top five most stable and unstableconcepts in terms of these two aspects.

Concept Politician is considered extensionally very unstable.This can easily be confirmed by the change of the sheer amountof instances. In Version 3.2, it has only 476 instances, while in Ver-sion 3.5, it has already 19,285 instances. The extension of this con-cept clearly has expanded significantly. However, low stability notnecessarily leads to concept shift. For example, although growing,the extension of Politician is always the most similar to the exten-sion of its following variant.

Concept City is also very unstable extensionally. In Version 3.4 ithas indeed shifted to another concept in Version 3.5, Settlement.These two concepts share more than 73% instances, which causesa high extensional similarity between them, which is higher thanthe similarity between the two variants of City. We found that Set-

tlement appeared only in version 3.5. For some reason, most of theinstances of City in version 3.4 have been transferred to Settlement.As DBpedia is of course based on a community effort of Wikipedia,this poses an interesting question whether this was an intentionaldecision, or rather happened.12

Similarly, studying the intensional stability and shifts also givesinsight to the evolution of the intensional semantics of a concept.For example, the intensionally very unstable concept, EthnicGroup,has been involved in the rdfs:domain and rdfs:range of a continuouslychanging set of properties. This indicates this concepts are relatedto different concepts in different versions, which contributes to theintensional instability. Furthermore, real shifts happened to someconcepts. The identified extensional shift, from City to Settlement

12 The Wikipedia logs unfortunately do not give clear insights into this questions.

is also found to be an intensional shift, that is, these two conceptsnot only share a lot of instances, but their intensional definitionsare very similar too. This double-confirmation is valuable becauseit may well indicate a genuine concept shift.

5.2.2. Study concept drift based on morphingIf we do not constrain ourselves to use explicit identities, what

can the morphing theory tell us?Concept SportsEvent has the strongest morphing chain, while

Concept ChemicalCompound has the weakest morphing chain. Stron-ger morphing chains can involve concept shift if the identities areknow, such as the red links in Fig. 13.

Let us look at the morphing-based chain split. Out of 104 morp-hing chains from Version 3.2 to Version 3.5, only 6 concepts havesuch chain split. This indicates the propagation of the similarityalong maximum morphing pairs of concepts to a great extend pre-serves the identity of the concepts, although for a very small numberof concepts, such propagation leads to a relatively more drastic drift.

5.2.3. Identify concept drift momentsAs shown in Fig. 14, concept label is indeed the most stable as-

pect of the concept meaning. There are a relatively big intensionalchanges between Version 3.3 and 3.4, while the concepts are ratherstable in terms of their extensions. As opposed to Fig. 12, the twotheories give rather similar analysis. One reason is that RDFS pro-vides much more explicit definitions for concept intension andextension, which leads to more reliable and consistent analysis.

5.3. Case-study 3: concept drift in LKIF-Core

In this section we look at the evolution of concepts in the LKIF-Core, an OOWLWL ontology of basic legal concepts.13 It consists of15 modules, each of which describes a set of closely related conceptsfrom both legal and common sense domains. This ontology has alsobeen continuously developed, and uses most of OWL’s expressiveness.

Our case study on a legal ontology was supported by an expertfrom the Leibniz Center for Law of the Universtity of Amsterdamwho performed a small-scale qualitative evaluation of the pro-duced results and gave advise on the intended meaning of the con-cepts in LKIF. This expert is one of the leading developers of LKIF.

OWL concepts and their meaning. The formal meaning of a con-cept in an OWL DL ontology is often far more explicitly definedthan in other formalisms. The intension of a concept could poten-tially be defined as the set of all possible DL concepts that areequivalent to it w.r.t. the ontology. Unfortunately, this is only pos-sible in the so-called definitorical terminologies, and difficult tocalculate even if possible. We will approximate this set in ourexperiments (rather coarsely) by using the OWLIM14 interpretationof LKIF, and consider finite sets of consequences (triples-chains). AsOWL ontologies are specifications of sets of possible models, there isno unique notion of the extension of concepts. However, once com-mitted to a particular model, DL semantics provide the formal in-stance-of relation to specify the extension of a concept betweenthe intension of two concepts.

Let us define the meaning of concepts in LKIF.

Definition 14. Let O to be the LKIF-Core ontology and O� denotethe OWLIM inferred semantic closure. The owl-label laboðCÞ of C isdefined as the object of the ðC; rdfs : label; oÞ. The owl-extensionextoðCÞ of C is defined as the set of individuals i such thatði rdf : type CÞ 2 O�. The owl-intension intoðCÞ of C is defined:

14 OWLIM is a RDF database management system, supporting the semantics of RDFS,OWL Horst and OWL 2 RL, see http://www.ontotext.com/owlim/. For our experimentswe used OWLIM version 3.

Page 15: Concept drift and how to identify it

Fig. 13. Examples of morphing chains.

(a) Label stability (b) Intensional stability

(c) Extensional stability (d) Whole concept stability

Fig. 14. Concept stability over time, using two formalisations—DBpedia.

S. Wang et al. / Web Semantics: Science, Services and Agents on the World Wide Web 9 (2011) 247–265 261

(1) all triples ðC; p; oÞ 2 O� and ðs; p;CÞ 2 O�

(2) all triples in chains fðC; p1; o1Þ � ðs2; p2; o2Þ � . . . ; �ðsn; pn; onÞgwhere sk ¼ ok�1, plus

(3) all triples in chains fðs1; p1; o1Þ � ðs2; p2; o2Þ; �; . . . ; �ðsn; pn;CÞgwhere skþ1 ¼ ok being blank nodes.

Again, the above definition is only one possible ontologicalcommitment regarding the meaning of an OWL concept. Based

on such definition, the similarity of label, intension and extensioncan be calculated in the same way as for the DBpedia case.

Experiments to study concept drift in LKIF. We studied four majorversions of LKIF, namely, 1.0, 1.0.2, 1.0.3 and 1.1. As before we takethe localname in the URIrefs as the identity of concepts and thenstudy the concept drift based on identity. We also study conceptdrift using the morphing theory. Unfortunately, the rdfs:label wasactually rarely used (only four concepts) and the LKIF ontology

Page 16: Concept drift and how to identify it

Table 3Top five stable and unstable concepts.

Most stable concepts Most unstable concepts

norm.owl#Custom legal-action.owl#Mandate

expression.owl#Promise legal-action.owl#Public_Law

norm.owl#Potestative_Expression legal-action.owl#Asignment

norm.owl#Hohfeldian_Power legal-action.owl#Act_of_Law

relative-places.owl#Place legal-action.owl#Delegation

Table 4Examples of confirmed intensional shift in LKIF-Core.

lkif1.0:action.owl#Speech_Act lkif1.0.2:expression.owl#Speech_Act

lkif1.0:action.owl#Termination lkif1.0.2:process.owl#Termination

lkif1.0.2:lkif-top.owl#Mental_Concept lkif1.0.3:lkif-top.owl#Mental_Entity

lkif1.0.2:lkif-top.owl#Physical_Concept lkif1.0.3:lkif-top.owl#Physical_Entity

17 As shown in previous cases, two theories provide similar results. However, the

262 S. Wang et al. / Web Semantics: Science, Services and Agents on the World Wide Web 9 (2011) 247–265

does not come with any instance data. This limits ourself to focuson the intensional shift, stability, split and morphing strength.

5.3.1. Study concept drift based on identityAll concepts were ranked according to its intensional stability,

see Table 3 for some examples. The ranked list has been confirmedby one of the developer of LKIF to be consistent with hisexpectations.

By comparing the intension of concepts between differentversions, we were also able to find true concept shifts, listed inTable 4.

As Table 4 shows, some intensional shift corresponding to shift-ing in modules. For example, Speech_Act is no more a general ac-tion, instead, it belongs to the expression module for describing,propositions and propositional attitudes (belief and intention),qualifications, statements and media. While the other kind of shift,for example, Mental_Concept to Mental_Entity, were confirmed to bea renaming operation.

5.3.2. Study concept drift based on morphingSome shift (found based on identity semantics) can also be

recognised when looking at the morphing chains. In Fig. 15, theshift from action.owl#Speech_Act in Version 1.0 to expres-

sion.owl#Speech_Act in Version 1.0.2 is the first link (in red) ofthe morphing chain. This morphing chain is consistent with the ac-tual operations on this ontology. Another morphing chain startingfrom Concept lkif1.0:action.owl#Termination is also confirmed by ourdomain expert, but its strength is weaker.

Similar to the DBpedia case, out of 98 morphing chains fromVersion 1.0 to Version 1.1, only 7 concepts have morphing-basedchain split. As we know that the labels almost as unique identifiersin LKIF, this indicates further that the max-sim-chain on intensioncan be used as a reliable strategy for identifying concept identities.

5.3.3. Study concept drift momentsAgain we compare the stability of chains, either based on iden-

tity or morphing. As visible in Fig. 16, there are less changes be-tween Version 1.0.2 and 1.0.3. Similar to the DBpedia case, thetwo theories gave quite similar analysis.

5.4. Case-study 4: concept drift in NCBO biomedical ontologies

We also studied the concept drift in a few biomedical ontologiesfrom the NCBO Bioportal.15 These ontologies are in the OBO for-mat.16 Unfortunately, there is also no instance data available from

15 http://bioportal.bioontology.org/16 http://www.geneontology.org/GO.format.obo-1_2.shtml

the Bioportal. We therefore only study concept drift in terms of labeland intension.

Our case study on biomedical ontology was supported by NCBOin Stanford. Through the contacts at NCBO small-scale qualitativeanalyses of the results were done by one of the curators of the GeneOntology, and main contributors to the HPO.

OBO concepts and their meaning. The meaning of an OBO conceptis defined by a term stanza which consists of the information usingthe tag: value pairs. Each term has its identity defined using the tagid and the labelling tag name. Its intension is then defined by all theremaining tag: value pairs. The instances of an OBO ontology are de-fined by instance stanzes which is associated to its concept usingthe tag instance_of. Therefore, we define the meaning of an OBOconcept formally as follows:

Definition 15. Let O be an OBO ontology. The obo-label laboðCÞ ofconcept C is defined as the value of the tag name. The obo-extensionextoðCÞ of C is the set of instances whose value of the tag instance_of

is C. The obo-intension intoðCÞ of C is the set of the remaining tag:

value pairs of C (i.e., all but the name tag).

Based on such definition, the similarity of label, intension andextension can be calculated in the same way as for the previousDBpedia and LKIF cases.

Study concept drift in OBO ontologies . The Human PhenotypeOntology (HPO) has 28 versions, with the number of concepts in-creased from 8773 to 9559. Among these versions, we identified18 label shifts. These label shifts mainly because that the conceptof the previous version was deleted and replaced by another newconcept with the same label but a new id number. For example,Concept HP:0005730 (Small epiphyses) in Version 1.66 is substi-tuted by Concept HP:0010585 in Version 1.66.

We also identified 6707 intensional shifts. Some intensionalchanges are due to the changes in the ontology structure. Forexample, the property synonym in Version 1.1 was changed intoexact_synonym in Version 1.2, which then was changed back to syn-

onym. The current similarity measure is sensitive to such changes,therefore, many false intensional shifts were identified. In Fig. 17,the stability between Version 1.54 and 1.59 drops to around 0.5,which is because a new property xref was introduced and most ofthe concepts were enriched by UMLS-references using this tag.Therefore the intension of the concepts were increased at that mo-ment, which should be considered as a crucial moment as the con-cepts evolve. However, this of course relies on the commitmentswe took about which information should be consider as a part ofintension.

According to the HPO ontology developers, it is not surprisingthat the early days of ontology development were more dynamicand concepts are less stable than the later versions. With the qual-ity of the ontology improving, the similarity between versions isexpected to become bigger, which is consistent with our results.

We also looked at the evolution of the Gene Ontology (23 ver-sions), see Fig. 18. For the Gene Ontology, we only took the sampleversions from the Bioportal repository at the monthly basis fromJune 2008 to May 2010.17 There are different types of edits, includ-ing term name changes, term definition updates, updates to relation-ships used by terms, and extensive changes to external references.These all contribute to the label and intensional drift. The GO devel-opers did find more changes had been conducted at the less stablemoments. As shown in Fig. 18, the labels were changed the most be-tween September and November 2008, while the most intensionally

analysis based on morphing is very computationally expensive as a full similaritymatrix between all pairs of concepts needs to be computed. The number of terms inour sampled versions grows from 27,542 to 33,243, which makes the computationnot feasible practically. Therefore, we only show the results based on identity.

Page 17: Concept drift and how to identify it

Fig. 15. Examples of morphing chains—LKIF.

Fig. 16. Concept stability over time, using two formalisations—LKIF.

S. Wang et al. / Web Semantics: Science, Services and Agents on the World Wide Web 9 (2011) 247–265 263

unstable moment is between June and July 2009, which is also con-firmed by the GO expert we consulted to be because of substantialchanges in the SDB (Society for Developmental Biology) terms.

6. Conclusion

More and more applications critically depend on some kind ofconcept schemes for the semantic interoperability of their data.However, although it is recognised by many as a critical problem,the continuous change in meaning of concepts (called drift in this

(a) Label stability

Fig. 17. Concept stability over time, using two fo

paper) has not yet received the attention it deserves in the ontol-ogy modelling community. Despite the significant efforts that havegone into topics such as ontology evolution, semantic versioning ortemporal modelling and reasoning, most tools are still based onstatic representations. The existing ontology versioning frame-works focus on the interoperability between versions and data.There is not yet a formal framework for concept drift, nor an imple-mentation for identifying significant concept drift.

This paper attempts to close this gap by introducing a theoret-ical foundation for concept drift. We study concept drift based ontwo theories: one based on concept identity and one based on con-cept morphing. We also propose a generic qualitative toolkit tostudy concept drift through some more practical notions: shiftand stability, split and morphing strength, over time. We apply thisgeneral formalisation in practical applications modelled in SKOS,RDFS, OWL and OBO. These plausibility tests are still preliminary,but encouraging: although intensional drift is difficult to study be-cause the concepts are often not formally defined, the detectedconcept drift gives useful information for the domain experts, forexample, identifying substantial ontology design changes. Addi-tionally, our experiments also suggest that studying morphingchains is a promising strategy in practice for identifying conceptidentities.

Our case-studies also indicate that in few realistic scenarios allof the proposed methods give useful insights, or can even be mean-ingfully applied. In some cases, one or even two of the aspects ofthe meaning were missing, in others one of the aspects almostfunctionally coupled with identity. For that reason we introducedour mechanisms as a toolkit, for the knowledge engineer and do-main expert to try out, and in the end to choose those that are mostlikely to produce meaningful insights.

Future research will be directed in two directions: first, incooperation with Communication Scientists working on political

(b) Intensional stability

rmalisations—Human Phenotype Ontology.

Page 18: Concept drift and how to identify it

Fig. 18. Concept stability—gene ontology.

264 S. Wang et al. / Web Semantics: Science, Services and Agents on the World Wide Web 9 (2011) 247–265

reporting and legal experts, we will apply the proposed methods todetect changes of meaning in real-world scenarios. We are also in-volved in a close cooperation with NCBO which will give us moreinsights into the problem. In at least two of these domains we willhave to do a more detailed larger-scale userability study to assessthe validity of our proposed notions beyond the slightly anecdotalevidence provided in this paper. With the experience gained, weplan to create applications the ultimate goal is of course to improveusability of our toolkit for more and more easy analysis of drift indynamic environments, and to extend the semantics of conceptdrift into query or reasoning tasks.

Acknowledgements

We are especially grateful to Janet Takens, Jan Kleinnijenhuis,Sebastian Khler, Peter Robinson, Sandra Dölken, Paea LePendu,Rinke Hoekstra, Doug Howe who participated our use case studyand provided crucial feedback.

References

[1] L. Bloomfield, Language, Allen and Unwin, 1933.[2] Brockhaus: Europaeische gemeinschaft, translated, 1999.[3] DTV, DTV Atlas, 1979.[4] Encyclopedia britannica, online version visited 27 May 2010, 2010.[5] European Union, The history of the european union, Website, <http://

europa.eu/abc/history/index_en.htm>, 2010.[6] N. Fanizzi, C. d’Amato, F. Esposito, Conceptual clustering and its application to

concept drift and novelty detection, in: ESWC’08: Proceedings of the 5thEuropean Semantic Web Conference on the Semantic Web, Springer-Verlag,Berlin, Heidelberg, 2008.

[7] G. Flouris, D. Manakanatas, H. Kondylakis, D. Plexousakis, G. Antoniou,Ontology change: classification and survey, Knowledge Eng. Rev. 23 (2)(2008) 117–152.

[8] F.L.G. Frege, Über sinn und bedeutung, Z. Philosophie Philosophische Kritik 100(1892) 25–50.

[9] N. Guarino, C. Welty, Evaluating ontological decisions with ontoclean,Commun. ACM 45 (2) (2002) 61–65.

[10] J.A. Gulla, G. Solskinnsbakk, P. Myrseth, V. Haderlein, O. Cerrato, Semantic driftin ontologies, in: Proceedings of Sixth International Conference on WebInformation Systems and Technologies, Valencia, Spain.

[11] P. Haase, F. Harmelen, Z. Huang, H. Stuckenschmidt, Y. Sure, A framework forhandling inconsistency in changing ontologies, in: The Semantic Web–ISWC2005, 2005, pp. 353–367.

[12] P. Haase, A. Hotho, L. Schmidt-Thieme, Y. Sure, Collaborative and usage-drivenevolution of personal ontologies, in: A.G=mez-PTrez, J.Euzenat (Eds.), TheSemantic Web: Research and Applications, Lecture Notes in Computer Science,vol. 3532, Springer, Berlin’Heidelberg, 2005, pp. 486–499, <http://dx.doi.org/10.1007/11431053_33>.

[13] J. Heflin, Z. Pan, A model theoretic semantics for ontology versioning, in: ThirdInternational Semantic Web Conference, Springer, 2004.

[14] G. Hodge, Systems of knowledge organization for digital libraries: beyondtraditional authority files, Tech. Rep., Council on Library and InformationResources, April 2000.

[15] R. Hoekstra, J. Breuker, M.D. Bello, A. Boer, The lkif core ontology of basic legalconcepts, in: P. Casanovas, M.A. Biasiotti, E. Francesconi, M.T. Sagri (Eds.),Proceedings of the Workshop on Legal Ontologies and Artificial IntelligenceTechniques, 2007.

[16] Z. Huang, H. Stuckenschmidt, Reasoning with multi-version ontologies: atemporal logic approach, in: Proceeding of the 4th International Semantic WebConference ISWC, 2005.

[17] A. Isaac, E. Summers, Skos simple knowledge organization system primer,<http://www.w3.org/TR/skos-primer/>, 2008.

[18] A. Isaac, E. Summers, SKOS Primer, W3C Group Note, URL: <http://www.w3.org/TR/skos-primer/>, 2009.

[19] P. Jaccard, Étude comparative de la distribution florale dans une portion desalpes et des jura, Bull. SociTtT Vaudoise Sci. Naturelles 37 (1901) 547–579.

[20] Y. Kalfoglou, M. Schorlemmer, Ontology mapping: the state of the art,Knowledge Eng. Rev. 18 (01) (2003) 1–31.

[21] A. Kalyanpur, B. Parsia, E. Sirin, B. Grau, J. Hendler, Swoop: a web ontologyediting browser, Web Semantics: Sci Services AgentsWorld Wide Web 4 (2)(2006) 144–153.

[22] M. Klein, Change management for distributed ontologies, Ph.D. Thesis, VrijeUniversiteit Amsterdam, URL <http://www.cs.vu.nl/mcaklein/thesis/>, Aug.2004.

[23] M.C.A. Klein, D. Fensel, A. Kiryakov, D. Ognyanov, Ontology versioning andchange detection on the web, in: Proceedings of the 13th InternationalConference on Knowledge Engineering and Knowledge Management.Ontologies and the Semantic Web, EKAW ’02, Springer-Verlag, London, UK,2002. URL <http://portal.acm.org/citation.cfm?id=645362.756436>.

[24] M. Klenner, U. Hahn, Concept versioning: A methodology for trackingevolutionary concept drift in dynamic concept systems, in: Proc. of ECAI1994, Wiley, 1994.

[25] J. Lehmann, C. Bizer, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, S. Hellmann,Dbpedia – a crystallization point for the web of data, J. Web Semantics 7 (3)(2009) 154–165.

[26] V. Levenshtein, Binary codes capable of correcting deletions, insertions andreversals, Soviet Phys. Dokl. 10 (1966) 707.

[27] Y. Liang, H. Alani, N. Shadbolt, Changing ontology breaks queries, in: I. Cruz, S.Decker, D. Allemang, C. Preist, D. Schwabe, P. Mika, M. Uschold, L. Aroyo (Eds.),The Semantic Web – ISWC 2006, Lecture Notes in Computer Science, vol. 4273,Springer, Berlin/Heidelberg, 2006, pp. 982–985. URL: <http://dx.doi.org/10.1007/11926078_79>.

[28] Y. Liang, H. Alani, N. Shadbolt, et al., Change management: the core task ofontology versioning and evolution, in: Proceedings of Postgraduate ResearchConference in Electronics, Photonics, Communications and Networks, andComputing Science, 2005.

[29] C. Matuszek, J. Cabral, M. Witbrock, J. Deoliveira, An introduction to the syntaxand content of cyc, in: Proceedings of the 2006 AAAI Spring Symposium onFormalizing and Compiling Background Knowledge and its Applications toKnowledge Representation and Question Answering, 2006.

[30] B. McBride, Rdf vocabulary description language 1.0: Rdf schema, <http://www.w3.org/TR/rdf-schema/>, 2004.

[31] D.L. McGuinness, F. van Harmelen, Owl web ontology language overview,<http://www.w3.org/TR/owl-features/>, January 2004.

Page 19: Concept drift and how to identify it

S. Wang et al. / Web Semantics: Science, Services and Agents on the World Wide Web 9 (2011) 247–265 265

[32] B. Motik, B. CuencaGrau, I. Horrocks, Z. Wu, A. Fokoue, C. Lutz, Owl 2 webontology language profiles, <http://www.w3.org/TR/owl2-profiles/>, 2009.

[33] N. Noy, A. Chugh, W. Liu, M. Musen, A framework for ontology evolution incollaborative environments, The Semantic Web-ISWC 2006, 2006, pp. 544–558.

[34] N.F. Noy, M. Klein, Ontology evolution: not the same as schema evolution,Knowledge Inform. Syst. 6 (2004) 428–440. URL <http://dx.doi.org/10.1007/s10115-003-0137-2>.

[35] N.F. Noy, M.A. Musen, Promptdiff: a fixed-point algorithm for comparingontology versions, in: 18th national conference on Artificial intelligence,American Association for Artificial Intelligence, Menlo Park, CA, USA, 2002,URL <http://portal.acm.org/citation.cfm?id=777092.777207>.

[36] N.F. Noy, M.A. Musen, Ontology versioning in an ontology managementframework, IEEE Intell. Syst. 19 (2004) 6–13. URL <http://dx.doi.org/10.1109/MIS.2004.33>.

[37] P. Plessers, O. DeTroyer, Ontology change detection using a version log, in: Y.Gil, E. Motta, V. Benjamins, M. Musen (Eds.), The Semantic Web ISWC 2005,Lecture Notes in Computer Science, vol. 3729, Springer, Berlin, Heidelberg,2005, pp. 578–592. URL <http://dx.doi.org/10.1007/11574620_42>.

[38] P. Plessers, O.D. Troyer, S. Casteleyn, Understanding ontology evolution: achange detection approach, Web Semantics: Sci. Services Agents World WideWeb 5 (1) (2007) 39–49. selected Papers from the International Semantic WebConference (ISWC2005), URL <http://www.sciencedirect.com/science/article/B758F-4MX4VXJ-1/2/a6eaf2835f9d1ca7799bcc27980d2973>.

[39] F.D. Saussure, Course in General Linguistics., Open Court Classics, 1986,reedition.

[40] B. Schopman, S. Wang, S. Schlobach, Deriving concept mappings throughinstance mappings, in: Proceedings of the 3rd Asian Semantic Web Conference,Bangkok, Thailand, 2008.

[41] B. Smith, M. Ashburner, C. Rosse, J. Bard, W. Bug, W. Ceusters, L.J. Goldberg, K.Eilbeck, A. Ireland, C.J. Mungall, The OBI consortium, N. Leontis, P. Rocca-Serra,A. Ruttenberg, S.-A. Sansone, R.H. Scheuermann, N.Shah, P.L. Whetzel, S. Lewis,The obo foundry: coordinated evolution of ontologies to support biomedicaldata integration, Nat. Biotechnol. (25) (2007) 1251–1255.

[42] L. Stojanovic, A. Maedche, B. Motik, N. Stojanovic, User-driven ontologyevolution management, Knowledge Eng. Knowledge Manage: OntologiesSemantic Web (2002) 133–140.

[43] L. Stojanovic, B. Motik, Ontology evolution within ontology editors, in:Proceedings of the OntoWeb-SIG3 Workshop, Citeseer, 2002.

[44] Y. Sure, M. Erdmann, J. Angele, S. Staab, R. Studer, D. Wenke, OntoEdit:Collaborative ontology development for the semantic web, The Semantic WebISWC 2002, 2002, pp. 221–235.

[45] A. Tsymbal, The problem of concept drift: definitions and related work, Tech.Rep. TCD-CS-2004-15, Computer Science Department, Trinity College Dublin,Ireland, 2004.

[46] J. VanCuilenburg, J. Kleinnijenhuis, J. DeRidder, Towards a graph theory ofjournalistic texts., Eur. J. Commun. 1 (1986) 65–96.

[47] S. Wang, S. Schlobach, M. Klein, Concept drift and how to identify it, in:European Knowledge Acquisition Workshop (EKAW), 2010.

[48] S. Wang, S. Schlobach, J. Takens1, W. van Atteveldt, Mapping-chains forstudying concept shift in political ontologies, in: Proceedings of the OntologyMatching workshop (OM 2009), Washington, USA, 2009.

[49] G. Widmer, M. Kubat, Effective learning in dynamic environments by explicitcontext tracking, in: ECML ’93: Proceedings of the European Conference onMachine Learning, Springer-Verlag, London, UK, 1993.

[50] G. Widmer, M. Kubat, Learning in the presence of concept drift and hiddencontexts, Mach. Learn. 23 (1) (1996) 69–101.

[51] Wikipedia: European union, Timeback machine: visited, URL <http://en.wikipedia.org/w/index.php?title=European_Unio&n&oldid=391894227>,2010.

[52] Wikipedia: European union, Timeback machine: visited, URL <http://en.wikipedia.org/w/index.php?title=European_Unio&n&oldid=1433926>,2003.

[53] Wikipedia: European union, Timeback machine: visited, URL <http://en.wikipedia.org/w/index.php?title=European_Unio&n&oldid=84415101>,2006