113
Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE International Conference on Data Mining Houston, Texas, USA, November 27, 2005 Edited by Zbigniew W. Ras Department of Computer Science University of North Carolina, Charlotte, USA. Shusaku Tsumoto Shimane University School of Medicine Japan. Djamel A. Zighed ERIC Laboratory University of Lyon 2 France ISBN 0-9738918-8-2

Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

  • Upload
    vuthuy

  • View
    218

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Mining Complex Data

Proceedings of a Workshop held in Conjunction with 2005 IEEE International Conference on Data Mining

Houston, Texas, USA, November 27, 2005

Edited by

Zbigniew W. Ras

Department of Computer Science University of North Carolina,

Charlotte, USA.

Shusaku Tsumoto Shimane University School of Medicine

Japan.

Djamel A. Zighed ERIC Laboratory

University of Lyon 2 France

ISBN 0-9738918-8-2

Page 2: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE
Page 3: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Preface Data mining and knowledge discovery, as stated in their early definition, can today be considered as stable fields with numerous efficient methods and studies that have been proposed to extract knowledge from data. Nevertheless, the famous golden nugget is still challenging. Actually, the context evolved since the first definition of the KDD process has been given and knowledge has now to be extracted from data getting more and more complex. In the framework of Data Mining, many software solutions were developed for the extraction of knowledge from tabular data (which are typically obtained from relational databases). Methodological extensions were proposed to deal with data initially obtained from other sources, like in the context of natural language (text mining) and image (image mining). KDD has thus evolved following a unimodal scheme instantiated according to the type of the underlying data (tabular data, text, images, etc), which, at the end, always leads to working on the classical double entry tabular format. However, in a large number of application domains, this unimodal approach appears to be too restrictive. Consider for instance a corpus of medical files. Each file can contain tabular data such as results of biological analyzes, textual data coming from clinical reports, image data such as radiographies, echograms, or electrocardiograms. In a decision making framework, treating each type of information separately has serious drawbacks. It appears therefore more and more necessary to consider these different data simultaneously, thereby encompassing all their complexity. Hence, a natural question arises: how could one combine information of different nature and associate them with a same semantic unit, which is for instance the patient? On a methodological level, one could also wonder how to compare such complex units via similarity measures. The classical approach consists in aggregating partial dissimilarities computed on components of the same type. However, this approach tends to make superposed layers of information. It considers that the whole entity is the sum of its components. By analogy with the analysis of complex systems, it appears that knowledge discovery in complex data can not simply consist of the concatenation of the partial information obtained from each part of the object. The aim would rather be to discover more « global » knowledge giving a meaning to the components and associating them with the semantic unit. This fundamental information cannot be extracted by the currently considered approaches and the available tools.

Page 4: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

The new data mining strategies shall take into account the specificities of complex objects (units with which are associated the complex data). These specificities are summarized hereafter:

- The data associated to an object are of different types. Besides classical numerical, categorical or symbolic descriptors, text, image or audio/video data are often available.

- The data come from different sources. As shown in the context of medical files, the collected data can come from surveys filled in by doctors, textual reports, measures acquired from medical equipment, radiographies, echograms, etc.

- It often happens that the same object is described according to the same characteristics at different times or different places. For instance, a patient may often consult several doctors, each one of them producing specific information. These different data are associated with the same subject.

Intelligent data mining should also take into account external information, also called expert knowledge, which could be taken into account by means of ontology. The association of different data sources at different moments multiplies the points of view and therefore the number of potential descriptors. The resulting high dimensionality is the cause of both algorithmic and methodological difficulties. The difficulty of Knowledge Discovery in complex data lies in all these specificities. Many people approach the field of mining complex data from different and interesting angles. They come from various communities such as data mining, classification, knowledge discovery and engineering. We believe it is now time to establish and enhance communication between these communities. The aim of this workshop is to address issues related to the concepts of mining complex data. Twenty nine (29) papers coming from ten countries have been submitted and reviewed. Fifteen have been selected, eleven of them as long papers the remaining as short papers. We would like to thanks ICDM conference chairs for their support and for giving us the opportunity to organize this workshop. Thanks to all PC members for their valuable work : Paula Brito (Portugal); Francisco Carvalho (Brazil); Odej Kao (Germany); Donato Malerba (Italy); Rokia Missaoui (Canada); Dunja Mladenic (Slovenia); Zbigniew Ras (USA); Jan Rauch (Tcheq Rep); Lorenza Saitta (Italy); Einoshin Suzuki

Page 5: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

(Japan); Brigitte Trousse (France); Shusaku Tsumoto (Japan); Rosana Verde (Italie); Osmar Zaiane (Canada); Ning Zhong (Japan); Djamel A. Zighed (France). Thanks to Omar Boussaid (France) and Florent Masseglia (France) for their help in the organisation. Finally, we wish to express our special thanks to Hakim Hacid (France) for his great contribution.

Zbigniew W. Ras Department of Computer Science University of North Carolina, Charlotte [email protected] Shusaku Tsumoto Shimane University School of Medicine [email protected] Djamel A. Zighed University of Lyon 2 [email protected]

Page 6: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE
Page 7: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Table of Contents

Hierarchical Density-Based Clustering for Multi-Represented Objects.....................................9 Elke Achtert, Hans-Peter Kriegel, Alexey Pryakhin, Matthias Schubert

Mining Multiple Video Clips of a Soccer Game ......................................................................17 Kenji Aoki, Hiroyuki Mano, Yukihiro Nakamura, Shin Ando, Einoshin Suzuki

Improving the Knowledge Discovery Process Using Ontologies.............................................25 Laurent Brisson, Martine Collard, Nicolas Pasquier

Pre-Processing and Clustering Complex Data in E-Commerce Domain..................................33 Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte Trousse

Semi-Supervised Clustering with Partial Background Information .........................................41 Jing Gao, Haibin Cheng, Pang-Ning Tan

Neighborhood Graphs for Image Databases Indexing and Content-Based Retrieval...............45 Hakim Hacid, Djamel Zighed

Mining Data Streams for Frequent Sequences Extraction........................................................53 Alice Marascu, Florent Masseglia

CaptionSearch: Finding Images in Publications .......................................................................61 Brigitte Mathiak, Andreas Kupfer, Richard Muench, Claudia Taeubner, Silke Eckstein

Decision Tree Construction by Chunkingless Graph-Based Induction for Graph-Structured Data ...................................................................................................................65 Phu Chien Nguyen, Kouzou Ohara, Akira Mogi, Hiroshi Motoda, Takashi Washio

Unusual Condition Detection of Bearing Vibration in Hydroelectric Power Plants ................73 Takashi Onoda, Norihiko Ito, Kenji Shimizu

Automatic juridical texts classification and relevance feedback ..............................................81 Vincent Pisetta, Hakim Hacid, Djamel Zighed

Mining E-Action Rules .............................................................................................................85 Zbigniew W. Ras, Li-Shiang Tsay, Agnieszka Dardzinska

Mining Statistical Association Rules to Select the Most Relevant Medical Image Features ...............................................................................................................................91 Marcela Ribeiro, Andre Balan, Joaquim Felipe, Agma Traina, Caetano Traina

Mining Multidimensional Sequential Rules A Characterization Approach .............................99 Lionel Savary, Zeitouni Karine

MB3-Miner: efficiently mining eMBedded subTREEs using Tree Model Guided candidate generation .........................................................................................................103 Henry Tan, Tharam Dillon, Fedja Hadzic, Ling Feng, Elizabeth Chang

Page 8: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Author Index ...........................................................................................................................111

Page 9: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Hierarchical Density-Based Clustering for Multi-Represented Objects

Elke Achtert, Hans-Peter Kriegel, Alexey Pryakhin, Matthias Schubert

Institute for Computer Science, University of Munichachtert,kriegel,pryakhin,[email protected]

Abstract

In recent years, the complexity of data objects in datamining applications has increased as well as their plainnumbers. As a result, there exist various feature trans-formations and thus multiple object representations. Forexample, an image can be described by a text annotation,a color histogram and some texture features. To clusterthesemulti-represented objects, dedicated datamining al-gorithms have been shown to achieve improved results. Inthis paper, we will therefore introduce a method for hi-erarchical density-based clustering of multi-representedobjects which is insensitive w.r.t. the choice of parame-ters. Furthermore, we will introduce a theoretical modelthat allows us to draw conclusions about the interactionof representations. Additionally, we will show how theseconclusions can be used for defining a suitable combina-tion method for multiple representations. To back up theusability of our proposed method, we present encourag-ing results for clustering a real world image data set thatis described by 4 different representations.

1. Introduction

In modern data mining applications, the data ob-jects are getting more and more complex. Thus, theextraction of meaningful feature representations yieldsa variety on different views on the same set of data ob-jects. Each of these views or representations might fo-cus on a different aspect and may offer another no-tion of similarity. However, in almost any applicationthere is no universal feature representation that canbe used to express similarity between all possible ob-jects in a meaningful way. Thus, recent data miningapproaches employ multiple representations to achievemore general results that are based on a variety of as-pects. An example application for multi-representedobjects is data mining in protein data. A protein canbe described by multiple feature transformations based

upon its amino acid sequence, its secondary or its threedimensional structure. Another example is data min-ing in image data which might be represented by tex-ture features, color histograms or text annotations.

Mining multi-represented objects yields advantagesbecause more information can be incorporated into themining process. On the other hand, the additional in-formation has to be used carefully since too much in-formation might distort the derived patterns. Basically,we can distinguish two problems when clustering multi-represented objects, comparability and semantics.

The comparability problem subsumes several effectswhen comparing features, distances or statements fromdifferent representations. For example, a distance valueof 1,000 might indicate similarity in some feature spaceand a distance of 0.5 might indicate dissimilarity in an-other space. Thus, directly comparing the distances isnot advisable.

Other than the comparability problem the semanticsproblem is caused by differences between the knowl-edge that can be derived from each representation. Forexample, two images described by very similar text an-notations are very likely to be very similar as well. Onthe other hand, if the words describing two images arecompletely disjunctive the implication that both im-ages are dissimilar is rather weak because it is possi-ble to describe the same object using a completely dif-ferent set of words. Another type of semantics can befound in color histograms. An image of a plane in blueskies might provide the same color distribution as asailing boat in the water. However, if two color imageshave completely different colors, it is usually a stronghint that the images are really dissimilar.

To cluster multi-represented objects with respect toboth problems, [1] described a multi-represented ver-sion of the density-based clustering algorithm DB-SCAN. However, this approach is still very sensi-tive with respect to its parametrization. Therefore,we adapt the density-based, hierarchical clustering al-gorithm OPTICS to the setting of multi-representedobjects. This new version of OPTICS is far less sen-

Z.W. Ras, S. Tsumoto and D.A. Zighed (Eds) : IEEE MCD05, published byDepartment of Mathematics and Computing Science Saint Marys UniversityPages 9-16, ISBN 0-9738918-8-2, Nov. 2005

Page 10: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

sitive to parameter selection. Another problem ofdensity-based multi-represented clustering is the han-dling of the semantic problem for multiple representa-tions. In this paper, we will give a theoretical discussionof possible semantics and demonstrate how to com-bine the already proposed basic methods for morethan two representations. Furthermore, we will ex-plain the need of domain knowledge for judgingeach representation. To demonstrate the applicabil-ity of our approach, we will show the results of severalexperiments on an image data set that is given by 4 dif-ferent representations.

The rest of this paper is organized as follows. Sec-tion 2 surveys related work in the areas of density-basedclustering and multi-represented or multi view cluster-ing. In section 3 the extension of OPTICS to multi-represented data is described. Section 4 starts with aformal discussion of the sematics problems and derivesa theory for constructing meaningful distance functionsfor multi-represented objects. In our experimental eval-uation in section 5, it is shown how the cluster qual-ity can be improved using the proposed method. Fi-nally, we will conclude the paper in section 6 with ashort summary and some ideas for future work.

2. Related Work

DBSCAN [2] is a density-based clustering algorithmwhere clusters are considered as dense areas that areseparated by sparse areas. Based on two input param-eters (ε ∈ R and k ∈ N), DBSCAN defines dense re-gions by means of core objects. An object o ∈ DBis called core object, if its ε-neighborhood contains atleast k objects. DBSCAN is able to detect arbitrar-ily shaped clusters by one single pass over the data. Todo so, DBSCAN uses the fact, that a cluster can be de-tected by finding one of its core-objects o and comput-ing all objects which are ”density-reachable” from o.OPTICS [3] extends the density-connected clusteringnotion of DBSCAN by hierarchical concepts. In con-trast to DBSCAN, OPTICS does not assign clustermemberships but computes a cluster order in which theobjects are processed and additionally generates the in-formation which would be used by an extended DB-SCAN algorithm to assign cluster memberships. Thisinformation consists of only two values for each object,the core distance and the reachability distance. If theε-neighborhood of an object o contains at least k ob-jects, the core distance of o is defined as the k-nearestneighbor distance of o. Otherwise, the core distanceis undefined. The reachability distance of an object pfrom o is an asymmetric distance measure that is de-fined as the maximum value of the core distance of

o and the distance between p and o. Using these dis-tances, OPTICS computes a ”walk” through the dataset and assigns to each object o its core distance andthe smallest reachability distance w.r.t. all objects con-sidered before o in the walk. In each step, OPTICS se-lects the object o having the minimum reachability dis-tance to any already processed object. A special orderof the database according to its density-based cluster-ing structure is generated, the so-called cluster order,which can be displayed in a reachability plot. A reach-ability plot consists of the reachability distances on they-axis of all objects plotted according to the cluster or-der on the x-axis. The ”valleys” in the plot representthe clusters, since objects within a cluster have lowerreachability distances than objects outside a cluster.

In [1] the idea of DBSCAN has been adapted tomulti-represented objects. Two different methods havebeen proposed to decide wether a multi-represented ob-ject is a core object: the union and the intersectionmethod. The union method assumes an object to bea core object if at least k objects are found withinthe union of its local ε-neighborhoods of each repre-sentation. The intersection method requires that atleast k objects are within the intersection of all localε-neighborhoods of each representation of a core ob-ject. In [4] an algorithm for spectral clustering of multi-represented objects is proposed. The author proposesto calculate the clustering in a way that the disagree-ment between the cluster models in each representationis minimized. In [5] a version of Expectation Maximiza-tion (EM) clustering was introduced. Additionally, theauthors proposed a multi view version of agglomera-tive clustering. However, this second approach did notdisplay any benefit against clustering single represen-tations. [6] introduces the framework of reinforcementclustering, which is applicable to multi-represented ob-jects. All of these approaches do not consider any se-mantic aspect of the underlying data spaces. The pro-posed approaches result in a partitioning clusterings ofthe data spaces, which makes the maximization of theagreement between local models a beneficial goal to op-timize. However, in a density-based setting, there is anarbitrary number of clusters and no explicit cluster-ing models that can be optimized to agree with eachother.

3. Hierarchical Clustering of Multi-Represented Objects

3.1. Normalization

In order to obtain the comparability of distances de-rived from different feature spaces, we perform a nor-

Elke Achtert, Hans-Peter Kriegel, Alexey Pryakhin, and Matthias Schubert

10

Page 11: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

malization of the distances for each representation. LetD be a set of n objects and let R := R1, . . . , Rmbe a set of m different representation existing for ob-jects in D.

The normalized distance between two objects o, q ∈D w.r.t. Ri is denoted by di(o, q) and can be calcu-lated by applying one of the following normalizationmethods:

• Mean normalization: Normalize the distance withregard to the mean value µorig

i of the original dis-tance dorig

i in representation Ri. The mean valuecan be calculated by sampling a small set of ob-jects from the current representation Ri.

di(o, q) = dorigi (o, q)/µorig

i

• Range normalization: Normalize the distance intothe range of [0 . . . 1].

di(o, q) =dorig

i (o, q)− minr,s∈D

dorigi (r, s)

maxr,s∈D

dorigi (r, s) − min

r,s∈Ddorig

i (r, s).

• Studentize: Normalize the distance around themean µorig

i and standard deviation σorigi of the

distances in representation Ri.

di(o, q) = (dorigi (o, q)− µorig

i )/σorigi .

Since the factors can be calculated on database sam-ple, we employed the first method in our experiments.

3.2. Multi-represented OPTICS

The algorithm OPTICS [3] works like an extendedDBSCAN algorithm, computing the density connectedclusters w.r.t. all parameters εi that are smaller thana generic value of ε. Since we handle multi-representedobjects we have not only one ε-neighborhood of an ob-ject o but several ε-neighborhoods, one for each rep-resentation Ri. In the following, we will call the ε-neighborhood of an object o in representation Ri itslocal ε-neighborhood w.r.t Ri.

Definition 1 (local ε-neighborhood w.r.t Ri )Let o ∈ D, ε ∈ IR+ , Ri ∈ R, di the distance function ofRi. The local ε-neighborhood w.r.t. Ri of o, denoted byNRi

ε (o), is defined as the set of objects around o with dis-tances in representation Ri less or equal than ε, formally

NRiε (o) = x ∈ D | di(o, x) ≤ ε.

In contrast to DBSCAN, OPTICS does not assigncluster memberships, but stores the order in whichthe objects have been processed and the information

which would be used by an extended DBSCAN al-gorithm to assign cluster memberships. This informa-tion consists of two values for each object, its core dis-tance and its reachability distance. To compute theseinformation during a run of the OPTICS algorithm onmulti-represented objects, we must adapt the core dis-tance and reachability distance predicates of OPTICSto our multi-represented approach. In the next two sub-sections, we will show how we can use the conceptsof union and intersection of the local ε-neighborhoodsof each representation to handle multi-represented ob-jects.

Union of different representations . The unionmethod characterizes an object o to be density-reachable from another object p, if o is density-reachable from p in at least one of the represen-tations Ri. If an object is placed in a dense re-gion of at least one representation, it is already uniondensity-reachable regardless in how many other rep-resentations it is density-reachable. In the following,we adapt some of the definitions of OPTICS in or-der to apply OPTICS on the union of different repre-sentations.

In the union approach, the (global) union ε-neighborhood of an object o ∈ D is defined by theunion of all of its local ε-neighborhoods NRi

ε (o) ineach representation Ri.

Definition 2 (union ε-neighborhood)Let ε ∈ IR+ and o ∈ D. The union ε-neighborhood o,denoted byN∪

ε (o), is defined by

N∪ε (o) =

⋃Ri∈R

NRiε (o).

Since the core distance predicate of OPTICS isbased on the concept of k-nearest neighbor distances,we have to redefine the k-nearest neighbor distance ofan object o in the union approach. Assume that all ob-jects p ∈ D are ranked according to their minimumdistance dmin(o, q) = min

i=1...m(di(o, p)) in each represen-

tation Ri to o. Then, the union k-nearest neighbor dis-tance of o ∈ D, short nn-dist∪k (o) is the distance dmin

to its k-nearest neighbor in this ranking. The unionk-nearest neighbor distance is formally defined as fol-lows:

Definition 3 (union k-NN distance)Let k ∈ IN , |D| ≥ k and o ∈ D. The union k-nearestneighbors of o is the smallest set NN∪

k (o) ⊆ D that con-tains (at least) k objects and for which the following con-dition holds:

∀p ∈ NN∪k (o),∀q ∈ D −NN∪

k (o) :min

i=1...mdi(p, o) < min

i=1...mdi(q, o).

Hierarchical Density-Based Clustering for Multi-Represented Objects

11

Page 12: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

The union k-nearest neighbor distance of o, denotedby nn-dist∪k (o) is defined as follows:

nn-dist∪k (o) = max mini=1...m

di(o, q) | q ∈ NN∪k (o).

Now, we can adopt the core distance definitionfrom OPTICS to our union approach: If the union ε-neighborhood of an object o contains at least k objects,the union core distance of o is defined as the union k-nearest neighbor distance of o. Otherwise, the unioncore distance is undefined.

Definition 4 (union core distance)The union core distance of object o ∈ DB w.r.t. ε ∈IR+ and k ∈ IN is defined as

Core∪ε,k(o) =

nn-dist∪k (o) if |N∪ε (o)| ≥ k

∞ else.

The union reachability distance of an object p ∈ Dfrom o ∈ D is an asymmetric distance measure that isdefined as the maximum value of the union core dis-tance of o and the minimum distance in each represen-tation Ri between p and o.

Definition 5 (union reachability distance)The union reachability distance of an object o ∈ DBrelative from another object p ∈ D w.r.t. ε ∈ IR+ andk ∈ IN is defined as

Reach∪ε,k(p, o) = maxCore∪ε,k(p), mini=1...m

di(o, p)

Intersection of different representations .The intersection method denotes an object o tobe density-reachable from another object p, if ois density-reachable from p in all representations.The intersection ε-neighborhood N∩

ε (o) of an ob-ject o ∈ D is defined by the intersection of all ofits local ε-neighborhoods NRi

ε (o) in each representa-tion Ri.

Definition 6 (intersection ε-neighborhood)Let ε ∈ IR+ and o ∈ D. The intersection ε-neighborhoodo, denoted byN∩

ε (o), is defined by

N∩ε (o) =

⋂Ri∈R

NRiε (o).

Analogously to the union method, we define the in-tersection k-nearest neighbor distance of an object o. Inthe intersection approach all objects p ∈ D are rankedaccording to their maximum distance dmax(o, q) =max

i=1...m(di(o, p)) in each representation Ri to o. Then,

the intersection k-nearest neighbor distance of o ∈ D,short nn-dist∩k (o) is the distance dmax to its k near-est neighbor in this ranking.

Definition 7 (intersection k-NN distance)Let k ∈ IN , |D| ≥ k and o ∈ D. The intersection k-nearest neighbors of o is the smallest set NN∩

k (o) ⊆ Dthat contains (at least) k objects and for which the follow-ing condition holds:

∀p ∈ NN∩k (o),∀q ∈ D −NN∩

k (o) :max

i=1...mdi(p, o) < max

i=1...mdi(q, o).

The intersection k-nearest neighbor distance of o, de-noted by nn-dist∩k (o) is defined as follows:

nn-dist∩k (o) = max maxi=1...m

di(o, q) | q ∈ NN∩k (o).

In the following, we define the intersection core dis-tance and the intersection reachability distance anal-ogously to the union method. If the intersection ε-neighborhood of an object o contains at least k ob-jects, the intersection core distance of o is defined asthe intersection k-nearest neighbor distance of o. Oth-erwise, the intersection core distance is undefined.

Definition 8 (intersection core distance)The intersection core distance of object o ∈ DB w.r.t.ε ∈ IR+ and k ∈ IN is defined as

Core∩ε,k(o) =

nn-dist∩k (o) if |N∩ε (o)| ≥ k

∞ else.

The intersection reachability distance of an objectp ∈ D from o ∈ D is an asymmetric distance measurethat is defined as the maximum value of the intersec-tion core distance of o and the maximum distance ineach representation Ri between p and o.

Definition 9 (intersection reachability distance)

The intersection reachability distance of an ob-ject o ∈ DB relative from another object p ∈ D w.r.t.ε ∈ IR+ and k ∈ IN is defined as

Reach∩ε,k(p, o) = maxCore∩ε,k(p), maxi=1...m

di(o, p)

By first normalizing the distances within the repre-sentations, we are now able to use OPTICS applying ei-ther the union or the intersection method. In the nextsection, we will discuss the proper choice of one of thesebasic methods and introduce a heuristic for a meaning-ful combination of both methods for multiple represen-tations.

4. Handling Semantics

In this section, we will discuss the handling of se-mantics. Therefore, we will first of all introduce a modelthat will help us to understand the interaction betweendifferent representations.

Elke Achtert, Hans-Peter Kriegel, Alexey Pryakhin, and Matthias Schubert

12

Page 13: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

4.1. A Model for Local Semantics

Since feature spaces are usually not a perfect modelof the intuitive notion of similarity, a small distance inthe feature space does not always indicate true objectsimilarity. Therefore, we denote two objects that a hu-man user would classify as similar as truly similar. Toformalize the semantic problem, we first of all distin-guish two characteristics of representation spaces:

Definition 10 (Precision Space) A precision spaceis a data space Ri where for each data object o there ex-ists a σ-neighborhood NRi

σ (o) in which the percentage ofdata objects inNRi

σ (o) that are considered to be truly sim-ilar exceeds a given value π. Formally, a precision spaceRi is defined as:

∃σ ∈ IR+, ∀o ∈ D :|NRi

σ (o) ∩ sim(o)||NRi

σ (o)≥ π

where sim(o) denotes all truly similar objects inD for ob-ject o.

Definition 11 (Recall Space) A recall space is adata space Ri where for each data object o there ex-ists a σ-neighborhood NRi

σ (o) in which the percentageof all truly similar data objects among the data ob-jects in NRi

σ (o) exceeds a given value ρ. Formally, arecall space Ri is defined as:

∃σ ∈ IR+, ∀o ∈ D :|NRi

σ (o) ∩ sim(o)||sim(o)|

≥ ρ

where sim(o) denotes all truly similar objects inD for ob-ject o.

A precision space (recall space respectively) is calledoptimal, iff there exists an σ for which π = 1 ( ρ = 1 re-spectively). Let us note that these definition treats sim-ilarity as a boolean function instead of using continu-ous similarity. However, the density-based clusteringemploys ε-neighborhoods ND

ε (o) for finding dense re-gions and within these dense regions the objects shouldbe similar to o. Thus, boolean similarity should be suf-ficient for discussing density-based clustering. Figure1 displays a maximal σp-neighborhood for object o ifRi would be an optimal precision space. Additionally,the figure displays the minimum σr-neighborhood ofo if Ri is an optimal recall space as well. Note thatthe σp-neighborhood is a subset of the σr in all opti-mal precision and recall spaces.

Though it is possible that a representation space isas well a good precision as a recall space, most realworld feature spaces are usually more suited to fulfillonly one of these conditions. An example for a preci-sion space are text vectors. Since we can assume that

+

_

++

+

+

++

+

_

_

_

_

_

_

_

_

_

__

_

_

_

_

_

+ truly similar objects

- truly unsimilar objects

p

r

Figure 1. Maximal σp-neighborhood and mini-mum σr-neighborhood of an optimal precisionand recall space.

two very similar text annotations indicate that the de-scribed data objects are very similar as well, text an-notations usually provide a good precision space. How-ever, descriptions of two very similar objects do nothave to use the same words. An object representa-tion that is often well-suited for providing a good re-call space are color histograms. If the color histogramsof two color images are quite different from each otherthe images are unlikely to display the same object. Onthe other hand, two images having similar color his-tograms, are not necessarily displaying the same mo-tive.

When combining optimal precision and recall spacesfor density-based clustering our goal is to find maxi-mum density-connected clusters where each object hasonly truly similar objects in its global neighborhood.In general, we can derive the following useful observa-tions:

1. A data space that is as well an optimal precisionas an optimal recall space for the same value ofσ is already optimal w.r.t our goal and thus doesnot need to be combined with any other represen-tation.

2. A set of optimal precision spaces should always becombined by the union method because the unionmethod improves the recall but not the precision.If there is at least one representation for any sim-ilar object in which the object is placed in theσ-neighborhood, the resulting combination is op-timal w.r.t. recall.

3. A set of recall spaces should always be combinedby the intersection method because the intersec-tion method improves the precision but not the re-call. If there exists no dissimilar data object thatis part of the σ-neighborhoods in all representa-

Hierarchical Density-Based Clustering for Multi-Represented Objects

13

Page 14: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

tions, the resulting intersection is optimal w.r.t.precision.

4. Combining an optimal recall space with an opti-mal precision space with either union or intersec-tion method does not make any sense. For thiscombination the objects in the σ-neighborhood ofthe precision space are always a subset of the ob-jects in the σ-neighborhood of the recall space. Asa result, applying the union method is equivalentto only using the recall space and applying the in-tersection method is equivalent to only using theprecision space.

The observations that are made in this model arenot directly applicable to the combination of represen-tations using the multi-represented version of OPTICSdescribed in the previous section. To apply the con-clusion of our model to OPTICS, we have to fulfilltwo requirements. The normalization has to be donein a proper way, meaning that the normalization fac-tors should have the same ratio as the σ values in eachrepresentation. The second requirement is that ε > σi

in each representation Ri ∈ R. If both requirementsare satisfied, it is guaranteed that there is some levelin the OPTICS plot representing the σ-values guaran-teeing optimality.

Another aspect is that the derived statements onlyhold for optimal precision and recall spaces. Since arepresentation is always as well a precision space asa recall space to some degree, the observations gener-ally do not hold for the non-optimal case. For exam-ple, it might make sense to combine a very good re-call space with a very good precision space if the recallspace has a good quality as a precision space as wellat some other σ level. However, the implications to thegeneral case are strong enough to derive useful heuris-tics.

A final problem for applying our model is the factthat it is not possible to determine π and ρ values forthe given representations without additional informa-tion about true similarity. Thus, we have to employdomain knowledge when deriving some heuristics forbuilding a well-suited combination of representationsfor clustering.

4.2. Combining Multiple Representations

Though we might not be able to exactly determinethe parametrization for which a representation fulfillsthe precision and recall space conditions in a best pos-sible way, we can still reason about the suitability of arepresentation for each of both conditions. Like in ourrunning example of text vectors and color histograms,

R1 R2 R3

R6

Figure 2.Combination tree of the imagedata set.

we can derive the suitability of a representation for be-ing a good precision or a good recall space by ana-lyzing the underlying feature transformations. If it islikely that two dissimilar objects have very similar fea-ture representations the data space still might providea useful recall space. If it is possible that very simi-lar objects are mapped to some far away feature rep-resentations the data space might still provide a usefulprecision space.

The most important implication of our model is thatcombining a good precision space (recall space respec-tively) with a rather bad precision space (recall spacerespectively) will not increase the all over quality ofclustering. Considering only two representations, thereare only three options: use the union method for twoprecision spaces, the intersection method for two re-call spaces or cluster only the more reliable represen-tation in case of a mixture.

For more than two representations, the combinationof precision and recall spaces still can make sense. Theidea is to combine these representations on differentlevels. Since the intersection method increases the pre-cision and the union method increases the recall, weare able to construct recall spaces or precision spacesfrom a subset of the representations. To formalize thismethod, we will now define the so-called combinationtree:

Definition 12 (Combination Tree) A combinationtree is a tree of arbitrary degree fulfilling the followingconditions:

• The leafs are labeled with representations.

• The inner nodes are labeled with either the union orthe intersection operator.

A good combination according to our heuristics canbe described by a combination tree where the sons ofeach intersection node are all reasonable recall spacesand the sons of each union node are all reasonable pre-cision spaces. After we derived the combination tree,we can now modify the core distance and reachabil-

Elke Achtert, Hans-Peter Kriegel, Alexey Pryakhin, and Matthias Schubert

14

Page 15: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256 271 286 301 316 331 346 361 376 391 406 421 436 451 466 481 496

Color Histograms

(a) OPTICS plot using only color histograms. Additionally,a representative sample set for one of the clusters is shown.

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256 271 286 301 316 331 346 361 376 391 406 421 436 451 466 481 496

Intersection of Texture and ColorHistograms

(b) OPTICS plot when employing the intersection of colorhistograms and both texture representations. The displayedcluster shows promising precision.

-0,1

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1 18 35 52 69 86 103 120 137 154 171 188 205 222 239 256 273 290 307 324 341 358 375 392 409 426 443 460 477 494

Text Annotations

(c) OPTICSplot using onlyText annotations.Thedisplayedcluster has a high precision but is incomplete.

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256 271 286 301 316 331 346 361 376 391 406 421 436 451 466 481 496

Combination of all Rep.

(d) OPTICS plot of the combination of all representations.Theprecise clusterobserved in the text representation is com-pletes with similar images.

ity distance of OPTICS in an top-down order to im-plement the semantics described by the combinationtree.

Figure 2 displays the combination tree of the im-age data set, we used in our experiments. R1, R2 andR3 represent the content based feature repressntationsexpressing texture features and colors distributions. Ineach of these representations a small distance betweenthe feature vectors does not necessarily indicate thatthe underlying image is truly similar. Therefore, weuse all of these 3 representations as recall spaces. R4

consists of text annotations. As mentioned before, textannotations usually provide good precision spaces butmay provide good recall spaces. Thus, we use the textannotation as a precision space. The combination ofthe representation is now done in the following way. Wefirst of all combine our recall spaces R1, R2 and R3 us-ing the intersection method. Due to the increased pre-cision resulting from applying the intersection method,

the suitability of the result for being a precision spaceshould be sufficient for applying the union method withthe remaining text annotations R4.

5. Performance Evaluation

In order to show the capability of our method,we im-plemented the proposed clustering algorithm in Java1.4. All experiments were processed on a work stationwith a 2.4 GHz Pentium IV processor and 2 GB mainmemory. We used a data set containing 500 imagesmanually annotated by a short text. From each im-age, we extracted 3 representations, namely a color his-togram and two textural feature vectors. We used theHSV color space and calculated 32 dimensional colorhistograms based on 8 ranges of hue and 4 ranges ofsaturation. The textural features were generated from16 gray-scale conversions of the images. We computedcontrast and inverse difference moment using the co-

Hierarchical Density-Based Clustering for Multi-Represented Objects

15

Page 16: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

occurrence matrix [7]. For comparing text annotations,we applied the cosine coefficient and used the Euclid-ian distance in the rest of the representations. SinceOPTICS does not generate a partitioning clusteringbut only a cluster order, we did not apply a quan-titative quality measure. To verify the results of thefound clustering, we visually verified the similarity ofimages in each cluster instead. To demonstrate the re-sults of multi-represented OPTICS with the combina-tion method described above, we ran OPTICS on eachsingle representation. Additionally, we examined theclustering for the combination of color histograms andtexture features using the intersection method like pro-posed in the combination tree. Finally, we ran OPTICSusing the complete combination of image and text fea-tures. For all clusterings, we used k = 3 and ε = 10.Normalization was achieved using the average distancesbetween two objects in the data set.

The result for the text annotations provided a veryprecise clustering. However, due to the fact that someannotations used different languages for describing theimage, some of the clusters were incomplete. Figure3(c) displays the result of clustering the text an-notations. The observed cluster displays only simi-lar objects. The cluster order derived for color his-tograms, found some clusters. However, though the im-ages within the clusters had similar colors the objectswere not necessarily similar. Figure 3(a) displays thecluster order using color histograms and an image clus-ter containing two similar groups of images and somenoise. Let use note that the clustering of the two tex-ture representation performed similarily. However, dueto space limitations, we do not display the correspond-ing plots. In Figure 3(b) the clustering of all 3 im-age feature spaces using the intersection method is dis-played. Though the number of clusters was decreased,the quality of the remaining clusters increased consid-erably, as expected. The cluster shown in figure 3(b)showed exclusively very similar images. Finally, figure3(d) displays the result on all representations. The clus-ter observed for text annotations displayed in figure3(c) was extended with additional similar images thatare described in German instead of English language.To conclude, examining the complete clustering, the allover quality of clustering was improved by using all 4representations.

6. Conclusions

In this paper, we discussed the problem of hier-archical density-based clustering of multi-representedobjects. A multi-represented object is described by atuple of feature representations belonging to different

feature spaces. Therefore, we adapted the hierarchicalclustering algorithm OPTICS to multi-represented ob-jects by introducing the union (intersection) core dis-tance and the union (intersection) reachability distancefor multi-represented objects. Since union and intersec-tion method might not be suitable to compare an ar-bitrary large number of representations, we proposeda theoretical model distinguishing so-called precisionand recall spaces. Based on these concepts, we observedthat the combination of good precision spaces usingthe union method increases the completeness of clus-ters and applying the intersection method on good re-call spaces increases the pureness of clusters. Finally,we concluded that combining a good precision (recall)space with a bad one results in no benefit. To use theseconclusion for combining problems with multiple rep-resentation, we introduced combination trees that dis-play valid combination of precision and recall spaces.In our experimental evaluation, we described the im-provement of clustering results for an image data setthat is described by 4 representations.

For future work, we plan to find a reasonable wayto quantify the usability of representations as precisionor recall spaces. Additionally, we are currently workingan theory for describing optimal combination trees.

References

[1] Kailing, K., Kriegel, H.P., Pryakhin, A., Schubert, M.:Clustering multi-represented objects with noise. In:PAKDD. (2004) 394–403

[2] Ester, M., Kriegel, H.P., Sander, J., Xu, X.: ”A Density-Based Algorithm for Discovering Clusters in Large Spa-tial Databases with Noise”. In: Proc. KDD’96, Portland,OR, AAAI Press (1996) 291–316

[3] Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.:”OPTICS: Ordering Points to Identify the ClusteringStructure”. In: Proc. ACM SIGMOD Int. Conf. on Man-agementofData (SIGMOD’99),Philadelphia,PA. (1999)49–60

[4] de Sa,V.R.: ”spectral clusteringwith twoviews”. In:Pro-ceedings of the ICML 2005 Workshop on Learning WithMultiple Views,Bonn, Germany. (2005) 20–27

[5] Bickel, S., Scheffer, T.: Multi-view clustering. In: ICDM.(2004) 19–26

[6] Wang, J., Zeng, H., Chen, Z., Lu, H., Tao, L., Ma, W.:”ReCoM: reinforcement clustering of multi-type interre-lated data objects. 274-281”. In: Proc. SIGIR 2003, July28 - August 1, Toronto, CA, ACM (2003) 274–281

[7] Haralick, R.M., K., S., I., D.: Textural features for imageclassification. IEEE Transactions on Systems, Man, andCybernetics 3 (1973) 610–621

Elke Achtert, Hans-Peter Kriegel, Alexey Pryakhin, and Matthias Schubert

16

Page 17: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Mining Multiple Video Clips of a Soccer GameKenji Aoki, Hiroyuki Mano, Yukihiro Nakamura, Shin Ando, and Einoshin Suzuki

Division of Electrical and Computer Engineering, Yokohama National University79-5, Tokiwadai, Hodogaya, Yokohama, 240-8501, Japan

Telephone: +81–45–339–4148Fax: +81–45–339–4148

Abstract— In this paper, we propose an approach for learningdecision trees for good/bad plays and defensive/attack roles fromvideo clips of a soccer game filmed with seven fixed videocamera recorders. Our approach transforms the video clipsinto a sequence of coordinates of players and the ball, obtainsattribute-value data from the sequence, and learns a decision treefrom the data. We perform two studies in which we investigateeffectiveness of high-level attributes and generation of manyattributes. Experiments using a high-school soccer game showpromising results. We consider that our endeavor represents astep toward a novel research domain of sport informatics, wheremining complex data, especially multiple video clips, plays animportant role.

I. INTRODUCTION

Data mining, which has originated from analysis of simpledata sets typically in the form of a table with propositionalattributes, has steadily increased the types of the data setsto be analyzed. Among such data sets, we believe that avideo clip represents an important type because we humansrely on visual information considerably. With the rapid pricereduction of video camera recorders (VCRs), a scene filmedwith multiple VCRs from different angles in multiple videoclips is becoming common. We think that mining such multiplevideo clips represents a challenge for researchers interested inmining complex data.

Soccer is said to be the world’s most popular team sport.For instance, the World Cup is known to be the biggest sportevent as the number of TV viewers in the world is larger thanthat for the Olympic Games. Soccer can be considered as anapplication domain which boosts machine learning researchand has recently attracted researchers in various fields relatedwith machine learning [1], [2], [3], [4], [5], [6], [7].

Most of these pieces of work [2], [3], [4], [5], [6], [7]mainly try to track players and the ball from a single ormultiple video clips and do not consider mining complexdata1. An exception is [1], which tries to discover interestingpasses using hierarchical clustering. However, the input datarepresents a sequence of coordinates and player identifiers thusis relatively simple.

The objective of this paper is to mine complex data ofmultiple video clips of a soccer game and obtain complexconcepts specific to soccer. In our approach, we transformvideo clips of a soccer game into time aligned positions of

1Among them, [6], [7] recognize events such as a play break, a pass, and agoal but questions remain if such tasks should be called data mining or imageunderstanding.

players and the ball then learn a classifier from such data. Thetwo concepts represent good/bad plays and defensive/attackroles, which are learnt with C4.5 [8].

II. TRANSFORMING VIDEO CLIPS TO SPATIO-TEMPORAL

DATA

A. Flow of the Proposed Method

We generate a sequence of coordinate data from video clipsof a soccer game as shown in Figure 1. A game is filmed withseven VCRs, each of which is fixed. A video clip of 1 second istransformed into 25 frames each of which represents a bitmapimage of size 350 × 240 pixels. In other words, the input toour method is seven video clips which are first transformedinto seven sequences S1, S2, . . . , S7 of bitmap images, whereSi = (si1, si2, . . . , siT ), sit is a bitmap image taken by VCRi at time t, and the time tick is from 1 to T .

In our spatio-temporal data, a player is represented as apoint on the 2-dimensional field and so is the ball. If a playeroccupies multiple pixels in an image, his/her (x, y) coordi-nates represent the horizontal average and the bottom values,respectively. Each player is allocated with an identifier, fromwhich his/her team can be also identified. This representationneglects various information including the posture of a playerbut still contains information essential to a soccer game suchas trajectories and space2. The output of our method is asequence (c1, c2, . . . , cT ) of positions of players and the ball,where ct = (α1t, α2t, . . . , α11t, β1t, β2t, . . . , β11t, γt), αit andβit represent (x, y) coordinates of a player i at time t in teamA and B, respectively; and γt represents (x, y) coordinates ofthe ball at time t.

B. Processing Images

For each of 7 video clips, we extract the field, recognizeobjects, and identify teams and the ball. Extracting the fieldcan be done manually because it suffices to work on one imagefor each VCR, which is fixed. The original image sit is inRGB, each of which element is represented with 8 bits andwe transform sit into an HSV image s′it, which we call a

2One possible extension is to employ a 3-dimensional point for the ball torepresent an air ball but we leave this for future work.

Z.W. Ras, S. Tsumoto and D.A. Zighed (Eds) : IEEE MCD05, published byDepartment of Mathematics and Computing Science Saint Marys UniversityPages 17-24, ISBN 0-9738918-8-2, Nov. 2005

Page 18: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

VCR1 VCR2

VCR3 VCR4

VCR5 VCR6 VCR7

A

B CD

E

F

GH

I

J

KA

B C

D

E

F

GH

I

J

Kf

b

c

a

e j

d

h

i

g

k f

b

c

a

e j

d

h

i

g

k ++

1 2 3 4

5

6

7

Fig. 1. Positions of VCRs (Top left), 7 images as input (Right), coordinates as output (Bottom left)

frame3.Though we are aware that there exist advanced techniques

in computer vision such as [9], we have employed a rathersimple method to extract players and the ball because our maininterest is in machine learning. We have chosen backgroundsubtraction, which detects foreground objects as the differencebetween the current frame and an image of the scene’s staticbackground. More specifically, a pixel is recognized as aforeground object if the absolute difference of its elementvalue and the corresponding value in the background imageexceeds a given threshold.

We have investigated the number W of images for buildinga background image by changing the value of W from 25 to400. By visual inspection, the difference of the results wasnegligible thus W was settled to 300 arbitrary. We have alsoinvestigated the thresholds for detecting the foreground objectsand settled them to 20, 25, and 25 for H, S, and V, respectivelyby visual inspection.

To identify a pixel, we have investigated the HSV valuesof the teams and the ball, in which the colors of the uniformswere pink and blue. Let δ, ε, and ζ represent HSV values ofan object. For a player with a pink uniform, we employ −30≤ δ ≤80, 35 ≤ ε ≤ 55, 135 ≤ ζ. For a player with a blue

3HSV represents a color space which is natural for humans and stands forHue, Saturation, and Value. H describes the shade of color and is representedas an angle from 0 to 360 degree of a color circle, S describes purity of thehue with respect to a white reference and is represented as a radius from 0 to255 of the circle. V describes brightness and is represented as a value from0 (black) to 255 (white).

uniform, we employ 90 ≤ δ ≤220, 40 ≤ ε ≤90, 105 ≤ ζ≤150. For the ball, we employ 70 ≤ δ ≤90, 40 ≤ ε ≤60, 200≤ ζ ≤250. These values were settled by visual inspection inexperiments using several hundreds of images.

C. Transformation of Coordinates

We transform the results obtained in the previous sectioninto coordinates individually, agglomerate the seven sets ofcoordinates, and correct then revise the set of coordinates. Wetransform an obtained coordinate (X, Y ) in an image filmedwith a VCR to a coordinate (x, y) on the field. We assumethat a filmed image is projected on a screen located betweenthe field and a VCR as shown in Figure 2. This transfor-mation represents a combination of Affine-transformation andperspective projection thus can be expressed as follows.

X =ax + by + c

px + qy + r, Y =

dx + ey + f

px + qy + r

where a, b, c, d, e, f, p, q, r represent coefficients which can becalculated with parameters in the transformation.

In order to calculate these coefficients, we select a rectangleP1P2P3P4 from the xy plane and assume the correspondingsquare in the XY plane Q1Q2Q3Q4 as shown in Figure 3. Asimple calculation shows that we can assume r = 0 thus wehave obtained formula for transforming (X, Y ) to (x, y).

Next we combine coordinates on the xy plane obtainedwith seven VCRs. In this procedure, we use a procedureagglo(cat, cbt) for agglomerating coordinates cat, cbt of framessat, sbt filmed at time t with cameras a and b, respectively.

Kenji Aoki, Hiroyuki Mano, Yukihiro Nakamura, Shin Ando, and Einoshin Suzuki

18

Page 19: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

VCR

screen

Fig. 2. Image as a projection on a screen

x

y

X

Y

P1 P2

P4 P3

Q1 Q2

Q4Q3

Fig. 3. Points for calculating parameters

Let cit = pit1, pit2, . . . , pitm(it), where pitj represents aposition, i.e. coordinates of player j in cit and m(it) representsthe number of players identified in cit. We obtain a Euclideandistance for each pair (pitα, pjtβ) for α = 1, 2, . . . , m(it)and β = 1, 2, . . . , m(jt) and consider that a pair representsdifferent players if the value is no smaller than R. Thecorrespondence which shows the minimum value for the add-sum of the distances is chosen with depth-first search. If acorresponding player is not found in a set, the coordinates inthe other set is used.

For two VCRs located side by side in Figure 1 such asVCRs 1 and 2 or VCRs 5 and 6, we identify each playerand the ball with this method and use the average value ifthe values differ. More precisely, we use agglo(agglo(c1t, c2t),agglo(c3t, c4t)) and agglo(agglo(c5t, c6t), c7t) as agglomeratedcoordinates for VCRs 1 - 4 and VCR 5 - 7, respectively, wherewe use R = 12 pixel.

The agglomeration of the two sets of coordinates aboveneeds to handle discrepancies and occlusions. We use R = 16pixel and in case of discrepancies, we use the x coordinate ofVCRs 1-4 and the y coordinate of VCRs 5-7 to exploit dataof better precision. In case of occlusions, we avoid conflicts inthe agglomeration process by using the same x or y coordinateof an occluding player for the occluded player. We describethe set of coordinates obtained with the procedure by c′t.

Since an object can be surrounded by other objects and thecoordinate agglomeration can fail, the number of identifiedplayers varies during a game. To circumvent this problem, weuse the following method to correct c′t into c′′′t and keep thenumber to be always 22. Let t be the identifier of the startingframe.

1) Give the set c′′t of coordinates of all players at time tmanually.

2) c′′′t = agglo(c′t, c′′t )

3) c′′′t+1 = agglo(c′t+1, c′′′t )

4) Repeat the procedures 2 and 3 by incrementing t.

The transformation task is difficult even for sophisticatedmethods in computer vision due to various reasons suchas the size of the ball can be as small as three pixels.Thus we consider manual revision mandatory and have de-veloped an interface which transforms the obtained coordi-nates c′′′1 , c′′′2 , . . . , c′′′T into a tgif4 file. The final sequence(c1, c2, . . . , cT ) is obtained after this manual correction pro-cedure.

III. EVALUATION OF PLAYERS

A. Motivation and experiments

In this section, we try to learn a model which discriminatesgood plays and bad plays as an example of a concept specificto soccer. Here we investigate effectiveness of a high-levelattribute, which requires knowledge specific to soccer forcalculating its value. For discrimination, other attributes arecalled low-level attributes. The main objective of this sectionis to validate our hypothesis that the good/bad play concept5

can be learned as a decision tree accurately and high-levelattributes are useful in learning.

In the experiments, we generated attributes using thoseshown in Table I. Except for the “Score” attribute, eachattribute is measured at time t, t−15, and t−25, where t is thetime for evaluating each player. Since there are 23 attributesin the Table, 67 attributes were used in the experiments. Wecall each of the 22 × 3 = 66 attributes which is dependent ontime a temporal attribute.

TABLE I

ATTRIBUTES USED IN THE FIRST SERIES OF EXPERIMENTS

Kind AttributesHigh Make a through pass, Beaten by a forward (FW) with the ball,

Number of marking players, Level of marking,Marked player has the ball, Marked player makes a pass,Win ball possession, Team which possesses the ball

Low Number of opponents ahead, Number of allies ahead,Distance to the closest player, Score, Distance to the ball,Angle to the goal, Distance to the goal,X coordinate of the player with the ball,Y coordinate of the player with the ball,X element of the velocity vector of the player with the ball,Y element of the velocity vector of the player with the ball,X coordinate of the player to be evaluated,Y coordinate of the player to be evaluated,X element of the velocity vector of the player to be evaluated,Y element of the velocity vector of the player to be evaluated

Several high-level attributes require explanation. We judgethat a player possesses a ball if s/he is closest to the ball, theirdistance is no larger than 4 pixels, and the velocity of the ballis no larger than 7 pixel/frame, where 1 pixel corresponds to0.4 m. In case the player makes a pass, s/he is consideredto possess the ball until another player possesses the ball. A

4http://bourbon.usc.edu:8001/tgif/5A bad play is defined as either giving a chance to the opponents or missing

a chance for the allies. Players who make no mistakes are classified as good.

Mining Multiple Video Clips of a Soccer Game

19

Page 20: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

through pass is a pass between DFs (defenders) into the openspace between the fullbacks and the goalkeeper with the ideathat a forward will beat the DFs to the ball6 as shown in Figure4. We define the levels (i.e. degrees of importance) of a throughpass as 0 ≤ L < 50 → level 3, 50 ≤ L < 100 → level 2,and 100 ≤ L → level 1, where L represents the distance tothe goal mouse from the point (B) that a player receives thepass. We also define levels of marking as shown in Figure 5because we consider that a player outside of the range is doingcovering.

FW

2 defenders as a "gate"

run forward

makea pass

B

Fig. 4. Example of a through pass

5

15

FWlevel3

level2level1

does not mark

does mark at level 2

Fig. 5. Range and levels of marking

The class of an example was determined by a questionnairefollowed by discussion by two people who are experienced insoccer7. The class ratio is good 84.0% and bad 16.0%. C4.5[8] was employed as a decision tree induction system. We used10-fold cross validation to measure predictive accuracies andtree size, which represents the number of nodes.

We filmed two semi-final games of Kanagawa Prefecturepreliminaries in 2004 National High-School ChampionshipTournament under permission of the organizers. We selected 6

6Technical terms related to soccer are explained in WWW sites such ashttp://www.soccerhelp.com/Soccer Tips Dictionary Terms T.shtml.

7Both of them have played in official teams of their schools at least for sixyears.

scenes (approximately 60 seconds) in which a player of Zamahigh school makes a pass into an open space of the opponentsof Nippon University Fujisawa high school. 345 frames wereselected for these experiments by choosing 1 image per 5images in 6 relevant scenes8, and were converted to spatio-temporal data with the transformation method. As an examplerepresents a player at a time tick, the number of examples is345 (images) × 22 (players) = 7590. We use three types ofcontrol, a method which does not use high-level attributes, amethod which does not use temporal attributes, and a methodwhich uses neither of them.

B. Results and analysis

We show the experimental results in Table II, from which wesee that the decision trees exhibit high accuracies. We reportaccuracy for each class since the 0/1-loss assumption does nothold. The Table shows that high-level attributes and temporalattributes reduce the size of decision trees considerably. On theother hand, neither of them significantly improve accuracy.We consider that high-level attributes concerning a mark, apass, and a through pass simplified the decision trees becausethe attributes are useful in describing bad players. Predictiveaccuracy, however, is related with bias and variance thus couldnot enjoy such improvement due to its complex nature [10].Overall, we believe that good/bad play prediction, which isspecific to soccer, can be achieved accurately.

We also investigated a part of the decision tree learned withour method from the whole data. The part was selected sinceit corresponds to the case in which the FWs have an advantagewhile the DFs still function. From the analysis, we have foundthe decision tree interesting, as it expresses complex conditionssuch as marking and covering in a meaningful way. We admitthat there are several problems including lack of attributes suchas space, side attack, and offside trap. Increasing the numberof examples by transforming more scenes is also important.In overall, however, we are satisfied with demonstrating thefeasibility of decision tree induction in learning conceptsessential to soccer games.

IV. PREDICTION OF ROLES

A. Motivation and experiments

As we have described in section I, the other feasibilitystudy is to predict defensive/attack roles of each player.More specifically, we assume the following classes, class 1:attacking, class 2: ready to attack, class 3: ready to attack anddefend, class 4: ready to defend, and class 5: defending.

In the previous section, we demonstrated effectiveness ofhigh-level attributes but experienced several difficulties insettling them. For this problem, we explore another approachin which we generate a large number of attributes by combin-ing functions. The approach generates values, functions, andlogical formulae and some of the obtained values are usedas attribute values in decision tree induction. We show thealgorithm in Figure 6 and explain it below.

8The number 5 was chosen because it makes the problem challenging byremoving similar scenes.

Kenji Aoki, Hiroyuki Mano, Yukihiro Nakamura, Shin Ando, and Einoshin Suzuki

20

Page 21: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

TABLE II

RESULTS ON GOOD/BAD PLAY PREDICTION. KIND CONCERNS HIGH-LEVEL AND LOW-LEVEL ATTRIBUTES WHILE TIME REPRESENTS TEMPORAL

ATTRIBUTES

Kind Time Class Accuracy Size Kind Time Class Accuracy SizeHigh and Low yes good 97.9% 141 High and Low no good 97.8% 167

bad 88.7% bad 89.4%Low yes good 97.3% 195 Low no good 97.1% 209

bad 84.3% bad 82.2%

Algorithm: attribute generation(V, F, L)Input: values V , functions F , logical formulae L.Return value: set of attribute values.

V = (P, PS, X, Y, D, N, LOV )P = (set of the points of players and the ball)PS = (set of points of all players and the ball), (set of

points of all allies), (set of points of all opponents), (set ofpoints of other allies)

X = Y = D = N = φ,LOV = 10, 20, 40F = (SF, PF, OF, MF, RO, LF)SF = subset(l,s), size(s), average(f , s)PF = x(p), y(p), distance(p1, p2)OF = minus x(x1, x2), absolute x(x), minus y(y1, y2),

absolute y(y), minus d(d1, d2), absolute d(d), minus n(n1,n2), absolute n(n)

RO = < for x coordinates, < for y coordinates, < fordistances

MF = LF = φgenerate attributes(PF, V, F, L)update MF(V, F, L)generate LF(V, F, L)generate attributes(OF, V, F, L)generate attributes(SF, V, F, L)

Fig. 6. Algorithm for generating attributes

In our method, we assume 7 types of values: point P of aplayer or the ball, point set PS, threshold LOV , X coordinateX , Y coordinate Y , distance D, and number N of players.The initial values for LOV are 10, 20, and 40 and concerna value χ in X , Y , and D. This means that a ∈ LOV isemployed as χ < a or a < χ. Initially, each of the sets X ,Y , D, and N is empty and includes values generated in ourmethod. The type of a value included in these sets is used asan attribute in decision tree induction.

In our method, we assume four types of functions: setfunction SF , point function PF , generated value functionOF , and generated function MF . Besides we define a typeof comparative operators RO and a type of logical formulaLF . For SF , the function subset(l,s) returns a set s′ ∈ PSeach of which element belongs to a set s ∈ PS of pointsand satisfies a logical formula l ∈ LF . The function size(s)returns the cardinality |s| ∈ N of a set s ∈ PS of points. Thefunction average(f , s) returns the average value v of the returnvalues of function f ∈ PF for elements of a set s ∈ PS ofpoints. For PF , the function x(p) returns the X coordinate

x′ ∈ X of a point p ∈ P while the function y(p) returns itsY coordinate y′ ∈ Y . The function distance(p1, p2) returnsthe distance d ∈ D between a pair of points p1 ∈ P andp2 ∈ P . For OF , a minus χ(a1, a2) function returns a1 − a2

while an absolute χ(a) function returns |a|, where the typesof the arguments and the return value correspond to that of χ.

Algorithm: generate attributes(FS, V, F, L)Input: set of functions FS, values V , functions F , logicalformulae L.Output: set of functions FS, values V .

foreach f ∈ FSif f takes one argument

foreach v ∈ Vadd f(v) to the corresponding set of values in V

else if f takes two argumentsforeach (v, v′) where v, v′ ∈ V

add f(v, v′), f(v′, v) to the corresponding set ofvalues in V

Fig. 7. Procedure for generating values with a set of function FS

For generating attributes, we use gener-ate attributes(FS, V, F, L), where FS corresponds toeither of SF,PF , and OF as shown in Figure 7. Anattribute is generated when an output value is added, wherewe define that an output value is an element of X∪Y ∪D∪N .For instance, the X coordinate of the point p ∈ P of theplayer to be evaluated is generated from the function x(p)and the point p. Because x(p) is an output value, it is addedas a new attribute.

Algorithm: update MF(V, F, L)Input: values V , functions F , logical formulae L.Output: values V .

foreach f ∈ (set of two-argument functions in PF )foreach p′ ∈ P

add f(p, p′), f(p′, p) to MFforeach f ∈ OF

if f takes one argumentforeach g ∈ PF ∪MF

add f(g(p)) to MFelse if f takes two arguments

foreach g ∈ PF ∪MFforeach v ∈ (set of values which belong to the

type of an argument of f )add f(g(p), v), f(v, g(p)) to MF

Fig. 8. Procedure for generating functions in MF , where p is a variable

Mining Multiple Video Clips of a Soccer Game

21

Page 22: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

A generated function belongs to MF and is used asan argument of a logical formula. Our method generates afunction with update MF(V, F, L) as shown in Figure 8. Forinstance, using distance(p1, p2) ∈ PF for points p1, p2 ∈ Pand the point p0 of the player, two functions “m1(p) =distance(p0, p)” and “m2(p) = distance(p, p0)” are gener-ated, where p represents one of its arguments.

In generate LF(V, F, L), logical formulae in LF are gen-erated with RO and the arguments belong to either of PF ,MF , and LOV . The procedure is similar to those above thuswe omit the pseudo code. For instance, from the functionx(p) ∈ PF , the initial value 10 ∈ LOV , and the initial value(< for x coordinate) ∈ RO; logical formula x(p) < 10 and10 < x(p) are generated.

B. Results and analysis

Our procedure in the previous section produced 2188 at-tributes. The class ratio is class 1 (7.8 %), class 2 (18.4 %),class 3 (32.1 %), class 4 (29.6 %), and class 5 (12.1 %). Wehave used the same data as in the previous section and decisiontrees obtained with C4.5 were evaluated with 100 × 10-foldcross validation. C4.5 allows specification of the minimumnumber of examples in a split test with the “-m” option, whichis expected as a countermeasure to overfitting [8]. In order toinvestigate effects of overfitting, we used options “-m2”, “-m10”, “-m30”, “-m60”, and “-m100”. The average results forthese 5 cases are shown in Figure 9 and those for each classare shown in Figures 10 - 14. In the Figures an average sizewhich exceeds 450 is plotted on the line of 450 for visibility.For accuracy, we also show their ranges of the average value± 2 × standard deviations. A dotted line represents the resultsof the control, which employs the following attributes: X andY coordinates of the player, X and Y relative coordinates ofthe player to the ball, distances from the last line of the sameand the opponent teams, numbers of allies who have larger andsmaller X coordinates, numbers of opponents who have largerand smaller X coordinates, numbers of allies and opponentswho are within 30 pixels, numbers of allies and opponentswho are within 30 pixels along the X axis.

From the Figures, we see that our method outperformsthe control both in accuracies and the size of the decisiontrees in most cases. This good performance can be attributedto the effectiveness of our attribute generation. It is alsonoteworthy that our method produces smaller trees, whichis important from the viewpoint of comprehensiveness of theresults. We believe that our method learns the defensive/attackroles concept accurately. Especially for classes 2, 3, and 4, ourmethod achieves more than 85% of accuracies even with smalltrees.

For classes 3, 4, and 5, the results for accuracy do not alwaysincrease monotonically with size. We consider that some of thesmall trees achieved high accuracies by overcoming overfittingin these cases. Results for class 1 are worst in both accuraciesand the size of the decision trees. This can be explained by thefact that the class ratio is smallest thus the number of examplesis small. For class 3, our method and the control exhibit small

difference in accuracies. This can be explained by the fact thatmany players in the midfield belong to this class and can beexplained with relatively simple attributes such as the numbersof allies and opponents ahead or behind.

From these results, we conclude that our method hasproduced useful attributes for this learning problem9. Wehave investigated the learned decision trees, especially theattributes used near their roots. We have found that someof the combinations of attributes used in the control weregenerated as new attributes by our method. We are againsatisfied with demonstrating the feasibility of decision treeinduction in learning a concept essential in a soccer game.

V. CONCLUSIONS

In this paper, we have demonstrated the feasibility oflearning concepts essential in a soccer game from videoclips. Our main contributions are a semi-automatic methodfor transforming video clips into spatio-temporal data, high-level attributes which simplify decision trees, and automaticgeneration of attributes which improves both accuracy andreadability. We believe that this paper represents an importantstep toward mining complex data with the world’s mostpopular team sport as a novel application domain.

Our endeavor in this paper can be regarded as computerscience applied to sport, of which domain we call sport infor-matics. We believe that sport informatics has a considerableimpact to academia and societies. We plan to make our spatio-temporal data public for academic and educational use in orderto contribute to extending sport informatics to a wide rangeof researchers.

ACKNOWLEDGMENT

This work was partially supported by the grant-in-aid forscientific research on fundamental research (B) 16300042 fromthe Japanese Ministry of Education, Culture, Sports, Scienceand Technology. We are grateful to Kanagawa Football Asso-ciation for kindly letting us record the soccer games and permitacademic and educational use. We thank Kazuhisa Sakakibarafor negotiating with the association.

REFERENCES

[1] S. Hirano and S. Tsumoto, “Finding interesting pass patterns from soccergame records,” in Principles and Practice of Knowledge Discovery inDatabases, LNAI 3202 (PKDD), 2004, pp. 209–218.

[2] S. Iwase and H. Saito, “Tracking soccer players based on homographyamong multiple views,” in Proc. Visual Communications and ImageProcessing (VCIP), SPIE, 2003, vol. 5150, pp. 283–292.

[3] T. Misu, M. Naemura, W. Zheng, Y. Izumi, and K. Fukui, “Robust track-ing of soccer players based on data fusion,” in Proc. 16th InternationalConference on Pattern Recognition (ICPR), 2002, vol. 1, pp. 556–561.

[4] T. Misu, S. Gohshi, Y. Izumi, Y. Fujita, and M. Naemura, “Robusttracking of athletes using multiple features of multiple views,” Journal ofthe International Conference in Central Europe on Computer Graphicsand Visualization (WSCG), vol. 12, no. 2, pp. 285–292, 2004.

[5] C. J. Needham and R. D. Boyle, “Tracking multiple sports playersthrough occlusion, congestion and scale,” in Proc. British MachineVision Conference 2001 (BMVC), 2001.

9It should be noted that some of the attributes has a natural interpretation.

Kenji Aoki, Hiroyuki Mano, Yukihiro Nakamura, Shin Ando, and Einoshin Suzuki

22

Page 23: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

82

84

86

88

90

92

94

96

98

0 50 100 150 200 250 300 350 400 450

Acc

urac

y [%

]

Size

Generated attributesGiven attributes

Fig. 9. Results for all classes

40

50

60

70

80

90

100

0 50 100 150 200 250 300 350 400 450

Acc

urac

y [%

]

Size

Generated attributesGiven attributes

Fig. 10. Results for class 1

80

82

84

86

88

90

92

94

96

98

0 50 100 150 200 250 300 350 400 450

Acc

urac

y [%

]

Size

Generated attributesGiven attributes

Fig. 11. Results for class 2

84

86

88

90

92

94

96

98

0 50 100 150 200 250 300 350 400 450

Acc

urac

y [%

]

Size

Generated attributesGiven attributes

Fig. 12. Results for class 3

888990919293949596979899

0 50 100 150 200 250 300 350 400 450

Acc

urac

y [%

]

Size

Generated attributesGiven attributes

Fig. 13. Results for class 4

65

70

75

80

85

90

95

100

0 50 100 150 200 250 300 350 400 450

Acc

urac

y [%

]

Size

Generated attributesGiven attributes

Fig. 14. Results for class 5

Mining Multiple Video Clips of a Soccer Game

23

Page 24: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

[6] L. Xie, P. Xu, S.-F. Chang, A. Divakaran, and H. Sun, “Structure analysisof soccer video with domain knowledge and hidden markov models,”Pattern Recognition Letters, vol. 25, no. 7, pp. 767–775, 2004.

[7] A. Yamada, Y. Shirai, and J. Miura, “Tracking players and a ball in videoimage sequence and estimating camera parameters for 3d interpretationof soccer games,” in Proc. 16th International Conference on PatternRecognition (ICPR), 2002, vol. 1, pp. 303–306.

[8] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo,Calif.: Morgan Kaufmann, 1993.

[9] K. Nummiaro, E. Koller-Meier, and L. J. V. Gool, “An adaptive color-based particle filter,” Image and Vision Computing, vol. 21, no. 1, pp.99–110, 2003.

[10] J. H. Friedman, “On bias, variance, 0/1-loss, and the curse-of-dimensionality,” Data Mining and Knowledge Discovery, vol. 1, no. 1,pp. 55–77, 1997.

Kenji Aoki, Hiroyuki Mano, Yukihiro Nakamura, Shin Ando, and Einoshin Suzuki

24

Page 25: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Improving the Knowledge Discovery Process UsingOntologies

Laurent Brisson, Martine Collard, Nicolas PasquierLaboratoire I3S

Université de Nice2000 route des Lucioles

06903 Sophia-Antipolis, FranceEmail: brisson,mcollard,[email protected]

Abstract— In this paper, we present the new ontology-basedmethodology ExCIS (Extraction using a Conceptual InformationSystem) for integrating expert prior knowledge in a data miningprocess. This methodology describes guidelines for a data miningprocess like CRISP-DM. Its originality is to build a specificcon-ceptual information system related to the application domain inorder to improve datasets preparation and results interpretation.In this paper we specially present the CIS construction whichconsists of creating an ontology by information extractionfroman initial raw database and building data to be mined.

I. I NTRODUCTION

One important challenge in data mining is to extract in-teresting knowledge and information useful for expert users.Numerous algorithms have been built for extracting bestmodels according to quality criteria like accuracy, ROC area,lift and other indexes. Numerous quality measures have beenproposed for objective and quantitative interest. Other workshave focused on more semantic approaches for evaluatingthe subjective quality of discovered models. For instance,a dependency rule, association rule or classification rule,may be defined as interesting if it is either surprising oractionable [8], [13]. Thus the use of prior knowledge maysignificantly enhance the discovery of interesting patterns byconsidering the interestingness according to expert beliefs.In most data mining projects, prior knowledge is implicitonly or it is not organized as a structured conceptual system.ExCIS is dedicated to data mining situations in which theexpert knowledge is crucial for the interpretation of minedpatterns, no conceptual representation of this knowledge isstored and only operational databases are given to the datamining process. In this approach, an ontology of the domainis built by analyzing existing databases with the collaborationof expert users who play a central role. The design process ofthe ontology is directed in order to facilitate the preparationof datasets and the interpretation of extracted patterns. Themain objective in ExCIS is to propose a framework in whichthe extraction process makes use of a well-formed conceptualinformation system (CIS) for improving the quality of minedknowledge. We consider the paradigm of CIS as defined by[17]. The CIS provides the structure of information usefulfor further mining tasks. It contains: A conceptual schemadefining anOntologyenhanced with aKnowledge Base(set of

factual informations) on the specific domain of interest andaMining-Oriented relational DataBase(MODB). Both ontologyelicitation and database construction are mining-oriented; inthis context, mining-oriented means that they are driven inorder to facilitate knowledge discovery tasks.

Ontologies [4] are generally used to specify and commu-nicate domain knowledge. They are formal, explicit spec-ifications of shared conceptualizations of a given domain.Ontologies are very useful for structuring and defining themeaning of the metadata terms that are currently collectedinside a domain community. They are a popular researchtopic in knowledge engineering, natural language processing,intelligent information integration and multi-agent systems.Ontologies are also applied in the World Wide Web communitywhere they provide the conceptual underpinning for makingthe semantics of metadata machine understandable. Whileontologies may be useful for conducting extraction in datamining tasks for discovering patterns, interpreting rulesorconceptual clustering, they are not really integrated in currentknowledge discovery projects.

In ExCIS, the ontology provides a conceptual representationof the application domain mainly elicited by analyzing theexisting operational databases. ExCIS main characteristics are:

• Prior knowledge conceptualization: the CIS is speciallydesigned for data mining tasks with the ontology, theprior expert knowledge base, the mining-oriented data-base.

• Adaptation of the CRISP-DM methodology with:

– CIS based preparation of data sets to be mined.– CIS based post processing of mined knowledge in

order to extract surprising and/or actionable knowl-edge.

– Incremental evolution of the expert knowledge storedin the CIS.

In this paper, we present a kind of Customer RelationshipManagement (CRM) case study in which we apply the ExCISmethodology. This project deals with data from the ‘family’branch of the French national health care system (CAF: CaisseNationale d’Allocations Familiales). In this system, beneficia-ries receive allowances depending on their social situation.The issue we address is to improve relationships between

Z.W. Ras, S. Tsumoto and D.A. Zighed (Eds) : IEEE MCD05, published byDepartment of Mathematics and Computing Science Saint Marys UniversityPages 25-32, ISBN 0-9738918-8-2, Nov. 2005

Page 26: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

beneficiaries and the CAF organism. In this case study, we hadtwo sources of information: an operational database storingdata on beneficiaries and their contacts with the CAF wasprovided prior knowledge of expert users aware of the businessprocesses, behaviors and habits in the organism.

The paper is organized as follows. Section 2 gives anoverview of the ExCIS approach. In Section 3, we studyrelated works. Section 4 describes the underlying conceptualstructures of the ontology. In Section 5, we give a detaileddescription of CIS construction. Section 6 concludes the paper.

II. OVERVIEW OF THE EXCIS APPROACH

ExCIS integrates prior knowledge all along the miningprocess: the first step structures and organizes the knowledgein the CIS and further steps exploit it and enrich it too.

Fig. 1. Hierarchical clustering of conditions.

The global ExCIS process presented in 1 shows:

• The CIS construction where:

– The ontology is extracted by analyzing operationaldatabases and by interacting with expert users.

– The knowledge base, set of factual informations, isobtained in a first step from dialogs with expert users.

– The new generic MODB is built.

• The pre-processing step where specific datasets may bebuilt for specific mining tasks.

• The standard mining step which extracts patterns fromthese datasets.

• The post-processing step where discovered patterns maybe interpreted and/or filtered according to both priorknowledge stored in the CIS and individual user attempt.

The MODB is said to be generic since it will be use asa kind of basic data repository from which any task-specificdataset may be generated. The underlying idea in the CISis to build structures which will provide more flexibility notonly for pre-processing the data to be mined, but for filtering

and interpreting discovered patterns in a post-processingstep.Hierarchical structures and generalization/specialization linksbetween ontological concepts play a central role:

• They allow reducing the volume of extracted patterns likesets of rules which are often very large.

• They provide a tool for interpreting results obtained byclustering algorithms too.

For numerical or categorical data, they provide differentgranularity levels which are useful in the pre-processing andthe post-processing steps.

a) Example 1:If we assume that 10, 11, 12 and 13 aredefined in the ontology as concepts (day numbers) whichinherits (are more specialized) from the more general conceptWeekNumber = 2, then Rule1, Rule2, Rule3, Rule4 may bereplaced by the only rule Rule5 which left hand side is moregeneral.Rule1: If DAYNUMBER = 10 and REASON = ‘EnteringCall’ then OBJECTIVE = ‘To Be Paid’Rule2: If DAYNUMBER = 11 and REASON = ‘EnteringCall’ then OBJECTIVE = ‘To Be Paid’Rule3: If DAYNUMBER = 12 and REASON = ‘EnteringCall’ then OBJECTIVE = ‘To Be Paid’Rule4: If DAYNUMBER = 13 and REASON = ‘EnteringCall’ then OBJECTIVE = ‘To Be Paid’Rule5: If WEEKNUMBER = 2 and REASON = ‘EnteringCall’ then OBJECTIVE = ‘To Be Paid’

b) Example2: Let us consider the following exampleon categorical data: if ‘Student Housing Allowance’ and‘Family Housing Allowance’ are defined in the ontology asinstances of concept ‘Housing Allowances’ and if the twofollowing association rules Rule6 and Rule7 are extracted,ageneralization may be applied to replace these two rules bythe more general rule Rule8.Rule 6: If PLACE = ‘Main Office’ and REASON = ‘Docu-ments’ and RESULT = ‘Intervention Dossier’ then OBJEC-TIVE = ‘Student Housing Allowance’Rule 7: If PLACE = ‘Main Office’ and REASON = ‘Docu-ments’ and RESULT = ‘Intervention Dossier’ then OBJEC-TIVE = ‘Family Housing Allowance’Rule 8: If PLACE = ‘Main Office’ and REASON = ‘Docu-ments’ and RESULT = ‘Intervention Dossier’ then OBJEC-TIVE = ‘Housing Allowances’Techniques for integrating prior knowledge as illustratedbyprevious examples may be introduced at the post-processingsteps of the data mining process.

III. R ELATED WORKS

A. Interestingness Measures

Numerous works focused on indexes that measure theinterestingness of a mined pattern [5], [9]. They generallydistinguished objective and subjective interest. Among theseindexes there are quantitative measures of objective interest-ingness such as confidence, coverage, lift, success rate whileunexpectedness and actionability are proposed for subjective

Laurent Brisson, Martine Collard, and Nicolas Pasquier

26

Page 27: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

criteria. Since our work deals with user interestingness, wefocus this state of the art on the former. According to theactionability criteria, a model is interesting if the user can startsome action depending on it [14]. On the other hand, unex-pected models are considered interesting since they contradictuser expectations which depend on his beliefs.

User expectations is a method developed by Liu [9]. Thefirst approach neither dealt with unexpectedness nor action-ability. User had to specify a set of patterns according tohis previous knowledge and intuitive feelings. Patterns hadto be expressed in the same way that mined patterns. ThenLiu defined a fuzzy algorithm which matches these patterns.In order to find actionable patterns, the user has to specify allactions that he could take. Then, for each action he specifiesthe situation under which he is likely to run the action.Finally, the system matches each discovered pattern againstthe patterns specified by the user using a fuzzy matchingtechnique.

Silberschatz and Tuzhilin [13] proposed a method to defineunexpectedness via belief systems. In this approach, therearetwo kinds of beliefs: soft beliefs that the user is willing tochange if new patterns are discovered and hard beliefs whichare constraints that cannot be changed with new discoveredknowledge. Consequently this approach assumes that we canbelieve in certain statements only partially and some degree orconfidence factor is assigned to each belief. A pattern is saidto be interesting relatively to some belief system if it ‘affects’this system, and the more it ‘affects’ it, the more interesting itis. However, interestingness of a pattern depends also on thekind of belief.

Piatetsky-Shapiro [11] presented KEFIR, a discovery systemfor data analysis and report generation from relational data-bases. This system embodies a generic approach based on thediscovery technique of deviation detection [10] for uncovering‘key findings’, and dependency networks for explaining thecauses of theses findings. Central to KEFIR’s methodology isits abilities to rank deviations according to some measure ofinterestingness. Interestingness in KEFIR refers to the degreeto which a discovered pattern is of interest to the user of thesystem and is driven by factors such novelty, utility, relevanceand statistical significance [3].

B. Databases and Ontologies

Ontologies provide a formal support to express beliefs andprior knowledge on a domain. Domain ontologies are not al-ways available; they have to be built specially by querying ex-pert users or by analyzing existing data. Extracting ontologicalstructures from data is very similar to the process of retrievinga conceptual schema from legacy databases. Different methods[7], [6], [16], [12] were proposed. They are based on theassumption that sufficient knowledge is stored in databasesfor producing an intelligent guide for ontology construction.They generally apply a matching between ontological conceptsand relational tables such that the ontology extracted is veryclose to the conceptual database schema.

C. Ontologies and Data Mining

For the last ten years, ontologies have been extensivelyused for knowledge representation and analysis mainly intwo domains: Bioinformatics and web content management.Biological knowledge is nowadays most often represented in‘bio-ontologies’ that are formal representations of knowledgeareas in which the essential terms are combined with structur-ing rules that describe relationships between the terms. Bio-ontologies are constructed according to textual descriptions ofbiological activities. One of the most popular bio-ontology isGene Ontology1 that contains more than 18 thousands terms.It describes the molecular function of a gene product, thebiological process in which the gene product participates,and the cellular component where the gene product can befound. Results of data mining processes can then be linkedto structured knowledge within bio-ontologies in order to ex-plicit discovered knowledge, for instance to identify biologicalfunctions of genes within a cluster. Interesting surveys ofontologies usage for bio-informatics can be found in [1], [15].A successful project of data mining application using bio-ontologies is described in [18].

In the domain of web content management, OWL (OntologyWeb Language)2 is a Semantic Web standard that provides aframework for the management, the integration, the sharingand the reuse of data on the Web. Semantic Web aims atthe sharing and processing of web data by automated toolsas well as by people. It can be used to explicitly representthe meaning of terms in vocabularies and the relationshipsbetween those terms, i.e. an ontology. Web ontologies canbe used to enrich and explain extracted patterns in manyknowledge discovery applications to web such as web usageprofiling [2] for instance.

IV. CONCEPTUAL STRUCTURES OF THE ONTOLOGY

A. Ontology

In EXCIS the domain ontology is an essential means bothfor improving data mining processes and for interpretingdata mining results. The ontology is defined by a set ofconcepts and relationships among them which are discoveredby analyzing existing data. It provides support both in thepre-processing step for building the MODB and in the post-processing steps for refining mined results.

As shown in section II, generalization/specialization re-lationships between ontological concepts provide valuableinformation since they may be used intensively for reducingand interpreting results. For instance, a set of dependencyrules(attribute-value rules) may be reduced by generalization onattributes (see example 1 above) or by generalization on values(see example 2 above). Thus the guidelines in the ontologyconstruction are:

• To distinguish attribute-concept and value-concept.

1http://www.geneontology.org/2http://www.w3.org/TR/owl-ref/

Improving the Knowledge Discovery Process Using Ontologies

27

Page 28: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

• To establish matching between source attributes andattribute-concepts and a matching between source valuesand value-concepts.

• To define concept hierarchies between concepts.This ontology does not contain any instances since valuesare organized in hierarchies and considered as concepts. InEXCIS, according to the data mining orientation, the ontologyhas two important characteristics:

• The ontology does not contain any instance since sourcevalues are organized in hierarchies and considered asconcepts; instances are only present in the final relationaldatabase MODB.

• Each concept has two generic properties only.The MODB is a relational database whose role is to store themost fine-grain data elicited from the operational database.MODB attributes are those which are identified as relevantfor the data mining task and MODB instances are tuples ofmost fine-grain values.

B. Concept definition

In a EXCIS ontology, a concept has to refer to a domainparadigm useful in the data-mining process. A concept inEXCIS is characterized by the two following properties: Itsrole in an extracted pattern (attribute/value) and a booleanproperty (abstract/concrete) which indicates its presence inthe final MODB. An attribute-concept is a data property,and a value-concept represents values of a data property.For instance in figure 2 “Children number” is an abstractattribute-concept and “3 children” is a concrete value-concept.

Fig. 2. Legend

A concept (attribute-concept or value-concept) which is inthe MODB is called aconcrete concept. A concept that is notconcrete but is useful during the post-processing step is calledan abstract concept.

C. Relationship definition

A relationship is an oriented link between 2 concepts. In Ex-CIS there are 4 different kinds of concepts and we distinguishrelationships between concepts of the same hierarchy andconcepts of different hierarchies: Thus there are 32 differentkind of relationships between concepts. Among them we canset up 3 different categories:

Fig. 3. Children related concepts

• Generalization/specialization relationships between twovalue-concepts which are “is-a-kind-of” links betweentwo value-concepts (see relationship 2 in figure 3)

• Generalization/specialization relationships between twoattribute-concepts which are “is-a-kind-of” links betweentwo attribute-concepts (see relationship 8 in figure 3)

• Generalization/specialization relationships between avalue-concept and an attribute-concept which are “is-a-value-of” links (see relationship 2 in figure 3)

All relationships between concepts within the same hi-erarchie are in table 1, and relationships between conceptsin different hierarchies are in table 2. Numbers are relationreferences and forbidden relationships are indicated by letters.

TABLE I

CONCEPT RELATIONSHIPS WITHIN THE SAME HIERARCHIE

Concept Concrete Abstract Concrete AbstractValue Value Attribute Attribute

Concrete Value 1 2 3 DAbstract Value B 2 6 5Concrete Attribute C C 9 7,8Abstract Attribute C C A 10

TABLE II

CONCEPT RELATIONSHIPS IN DIFFERENT HIERARCHIES

Concept Concrete Abstract Concrete AbstractValue Value Attribute Attribute

Concrete Value 4 4 D 5Abstract Value B 4 D 5Concrete Attribute C C D DAbstract Attribute C C A D

D. Description and use of ontology relationships

First of all, two relationships are forbidden into ExCIS: gen-eralization/specialization relationships from an abstract con-

Laurent Brisson, Martine Collard, and Nicolas Pasquier

28

Page 29: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

cept toward a concrete concept with the same role (attributeorvalue) since we want abstract concepts to be more general thanconcrete concepts (see relationship A or B in figure 4), andgeneralization/specialization relationships from an attribute-concept toward a value-concept (see relationship C in figure4)since the relationship “is a value of” has no meaning in thissituation.

Fig. 4. Forbidden relationships

1) Relationships between value-concepts:Generalizationor specialization relationships between value-concepts (seerelationship 2 in figure 3) are useful in order to generalizepatterns during the post-processing step. Furthermore,relationships between two concrete value-concepts of thesame hierarchy are essential since they also allow selectingdata granularity in datasets generated from the MODB. If ina data mining session we are more interested in the kind ofallowances than in specifics allowances, dataset granularitywill be set to the “Housing Location” level (see relationship1 in figure 5).

Fig. 5. Allowances related concepts

2) Relationships between attribute-concepts:Generaliza-tion or specialization relationships between attribute-conceptsare useful in order to generalize models during the post-processing step. However, we must be careful during thisgeneralization since sometimes we can switch an attributewith a more general (see relationship 7 in figure 6) and other

Fig. 6. Location related concepts

times we have to compute a new value (see relationship 8 infigure 3). For instance in figure 3 “Children number” is thesum of all of the values of his subconcepts.

Relationships between two concrete attribute-concepts ofthe same hierarchy are specific because they have to bechecked during datasets generation: indeed these attributescannot be in the same dataset to avoid redundancy.

ExCIS method forbids relationships between attribute-concepts of different hierarchies. Relationships betweenattribute-concepts are only generalization relations; attribute-concepts which are semantically close have to be locatedtogether in the same hierarchy.For instance “Home Location”and “Agency Location” concepts are in the “Location”hierarchie (see relationship 7 in figure 6).

3) Relationships between value-concepts and attribute-concepts:These relationships are essential in order to builddata or to provide different semantic views during the post-processing step (see relationship 5 in figurein figure 6). Forinstance “98001” is both a “Home Location” and a “Zip Code”(see relationship 3 in figure 6).

Each value-concept is linked with attribute-concepts in thesame hierarchy. ExCIS forbids relationships between a value-concept and concrete attribute-concepts of different hierar-chies: indeed if such a relationship exists, it means that avalue-concept “is a value of” two different attribute-concepts.If these attribute concepts are semantically close they mustbe in the same hierarchy and if they are totally different theycan’t be in relationship with the same value-concepts.

V. CONCEPTUAL INFORMATION SYSTEM CONSTRUCTION

Let A the set of source database attributes,C the set ofontology concepts andCz the set of concepts associated to anattributez ∈ A. C is defined by

⋃z∈A

Cz .ExCIS differs from CRISP-DM mainly in the data prepa-

ration step. In this step CRISP-DM describe 5 tasks: select,clean, construct, integrate and format data. Selection andformat are identical in both methods but in ExCIS cleaning,construction and integration are merged method in order toelicitate the ontological concepts and to build the MODB.

A. Scope Definition and Source Attribute Selection

First steps of ExCIS method are related to the BusinessUnderstanding and the Data Understanding steps of CRISP

Improving the Knowledge Discovery Process Using Ontologies

29

Page 30: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

DM method. They need an important interaction with expertusers.

1. Determine objectives: in our case study, objectivesare to improve "relationships with beneficiaries".

2. Define themes: analysis of data allow to gather theminto semantic sets called themes. For example wecreate 3 themes: Allowance beneficiaries profiles,contacts (by phone, by mail, in the agency, . . . ) andevents (holidays, school starts, birth, wedding, . . . ).

3. For each theme select a set of source attributes withexperts users.

B. Data Analysis and Attribute-Concept Elicitation

4. For each selected attributez:

5. Examine name and description in order to:• Associaten concepts to the attribute.• Into C, clean homonyms (different concepts with

same name), synonyms (same concepts withdifferent names like age and date of birth) anduseless attributes according the objectives.

6. Examine values (distribution, missing values,duplicates values, . . . ) in order to:

• RefineCz (add or delete concepts) according toinformation collected in step 6.

• Clean again homonyms, synonyms and use-less attributes. For example by analyzing val-ues we realized that ‘allowances’ was in fact2 homonyms concepts. Thus we created the‘allowance amount’ concept and the ‘allowancebeneficiary’ concept.

7. For each concept associated to z create the methodwhich generates value-concepts.

In the step 7, if the attribute-concept doesn’t exist we haveto create a 4 fields record table. These fields are the attributeassociated to the concept, the name of the attribute table insource database, the attribute domain value and the referenceto the procedure which may generate value-concepts. There isonly one procedure for record in the table. A domain value canbe a distinct value or a regular expression and is the input ofthe procedure. Procedure output provides references to value-concepts. The procedure might be an SQL request (SELECTor specific computation) or an external program (script, shell,C, . . . ). However, if the attribute-concept already exists we justhave to add a record in the table and create a new procedure.

C. Value-Concept Elicitation

At this point, all of the methods to generate value-conceptsare created.

8. Give a name to each value-concept.9. Clean homonyms and synonyms among value-

concepts.

D. Ontology Structuration

10. Identify generalization relationships among value-concepts (see figure 5).

11. Create abstract concepts and reorganize the ontologywith these new concepts. For instance ‘Location’concept in figure 6.

12. Create relationships between value-concepts of dif-ferent hierarchies (see relation 4 in figure 6).

E. Generation of the Mining Oriented Database

13. Generate the database by using procedures definedin step 7.

In this final step a program reads the tables created for eachattribute-concept and calls the procedures in order to generatethe MODB.

VI. CONCLUSION

We have given a global presentation of the new method-ology ExCIS for the integration of prior knowledge in a datamining process. The main objective is to improve the qualityofextracted knowledge and facilitate its interpretation. ExCIS isbased on a conceptual information system (CIS) which storesthe expert knowledge. The CIS plays a central role in themethodology since it is used for datasets construction beforemining, for filtering and interpreting mined patterns and forupdating expert knowledge with validated mined knowledge.This paper focuses on the CIS structure and on its constructiononly. We have presented its ontological structures, and we havediscussed the choices made for identifying ontology conceptsand relations by analyzing existing operational data. Furtherworks will be dedicated to the pre- and post-processing steps.We will study techniques for efficiently exploiting informationfrom the domain ontology in order to conduct the preparationof datasets to be mined and the interpretation of mined results.

VII. A CKNOWLEDGEMENTS

We would like to thank the CAF (family branch of theFrench national health care system) and more specially PierreBourgeot, Cyril Broilliard, Jacques Faveeuw, Hugues Sanieland the BGPEO for supporting this work.

REFERENCES

[1] J.B. Bard and S.Y. Rhee.Ontologies in Biology: Design, Applications andFuture Challenges. Nature Review Genetics, 5(3):213-222, march 2004.

[2] H. Dai and B. Mobasher.Using Ontologies to Discover Domain-level WebUsage Profiles. Proceedings 2nd ECML/PKDD Semantic Web Miningworkshop, august 2002.

[3] W.J. Frawley, G. Piatetsky-Shapiro and C.J. Matheus.Knowledge Discov-ery in Databases: An Overwiew. In Knowledge Discovery in Databases,pp. 1-27, AAAI/MIT Press, 1991. Reprinted in AI Magazine, 13(3), 1992.

[4] T. Gruber. What is an Ontology?, january 2002. http://www-ksl.stanford.edu/kst/what-is-an-ontology.htm.

[5] R.J. Hilderman and H.J. Hamilton.Evaluation of Interestingness Mea-sures for Ranking Discovered Knowledge. Proceedings 5th PAKDDconference, Lecture Notes in Computer Science 2035:247-259, april 2001.

[6] P. JohannessonA Method for Transforming Relational Schemas into Con-ceptual Schemas. Proceedings 10th ICDE conference, M. Rusinkiewiczeditor, pp. 115-122, IEEE Press, febuary 1994.

[7] V. Kashyap.Design and Creation of Ontologies for Environmental Infor-mation Retrieval. Proceedings 12th workshop on Knowledge Acquisition,Modelling and Management, october 1999.

[8] B. Liu, W. Hsu and S. Chen.Using General Impressions to AnalyzeDiscovered Classification Rules. Proceedings 3rd KDD conference, pp.31-36, august 1997.

Laurent Brisson, Martine Collard, and Nicolas Pasquier

30

Page 31: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

[9] B. Liu, W. Hsu, L.-F. Mun and H.-Y. Lee.Finding Interesting Patternsusing User Expectations. Knowledge and Data Engineering, 11(6):817-832, 1999.

[10] C.J. Matheus, P.K. Chan and G. Piatetsky-Shapiro.Systems for Knowl-edge Discovery in Databases. IEEE Transactions on Knowledge and DataEnginneering, 5(6):903-913, december 1993.

[11] G. Piatetsky-Shapiro and C. Matheus.The Interestingness of Deviations.Proceedings of the AAAI-94 workshop on Knowledge DiscoveryinDatabases, 1994.

[12] D.L. Rubin, M. Hewett, D.E. Oliver, T.E. Klein, R.B. AltmanAutomaticData Acquisition into Ontologies from Pharmacogenetics Relational DataSources using Declarative Object Definitions and XML. Proceedings 7thPacific Symposium on Biocomputing, pp. 88-99, january 2002.

[13] A. Silberschatz and A. Tuzhilin.On Subjective Measures of Interest-ingness in Knowledge Discovery. Proceedings 1st KDD conference, pp.275-281, august 1995.

[14] A. Silberschatz and A. Tuzhilin.What Makes Patterns Interesting inKnowledge Discovery Systems. IEEE Transaction On Knowledge AndData Engineering, 8(6):970-974, december 1996.

[15] R. Stevens, C.A. Goble and S. Bechhofer.Ontology-based KnowledgeRepresentation for Bioinformatics. Brief Bioinformatics, 1(4):398-414,november 2000.

[16] L. Stojanovic, N. Stojanovic and R. Volz.Migrating Data-intensiveWeb Sites into the Semantic Web. Proceedings 17th ACM Symposiumon Applied Computing, pp. 1100-1107, ACM Press, 2002.

[17] G. Stumme.Conceptual On-Line Analytical Processing. K. Tanaka,S. Ghandeharizadeh and Y. Kambayashi editors. InformationOrganizationand Databases, chpt. 14, Kluwer Academic Publishers, pp 191-203, 2000.

[18] N. Tiffin, J.F. Kelso, A.R. Powell, H. Pan, V.B. Bajic andW.A. Hide.Integration of Text- and Data-Mining using Ontologies SuccessfullySelects Disease Gene Candidates. Nucleic Acids Research, 33(5):1544-1552, march 2005.

Improving the Knowledge Discovery Process Using Ontologies

31

Page 32: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE
Page 33: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Pre-Processing and Clustering Complex Data in E-Commerce Domain

Sergiu Chelcea1, Alzennyr Da Silva

1&2, Yves Lechevallier

2, Doru Tanasa

1, Brigitte Trousse

1

1 AxIS, INRIA Sophia-Antipolis

2004, Route des Lucioles, B.P. 93

06902 Sophia Antipolis Cedex, France 2 AxIS, INRIA Rocquencourt

Domaine de Voluceau, Rocquencourt, B.P. 105

78153 Le Chesnay Cedex, France

Sergiu.Chelcea,Doru.Tanasa,Brigitte.Trousse,Alzennyr.Da_Silva,Yves.Lechevallier @inria.fr

http://www-sop.inria.fr/axis/

Abstract

This paper presents our preprocessing and

clustering method on a clickstream dataset issued from

e-commerce domain. The main contributions of this

article are double. First, after presenting the

clickstream dataset, we show how we build a rich data

warehouse based an advanced preprocessing method.

We take into account the intersite aspects in the given

e-commerce domain, which offers an interesting data

structuration. A preliminary statistical analysis based

on such complex data i.e. time period clickstreams is

given, emphasing the importance of intersite user visits

in such a context. Secondly, we describe our crossed-

clustering method which is applied on data generated

from our data warehouse. Our preliminary results are

interesting and promising illustrating the benefits of

our WUM methods, even if more investigations are

needed on the same dataset.

1. Introduction

This article deals with complex data from Web in e-

commerce domain. The data complexity results from its

characteristics: large, multi-source (intersite logs),

heterogeneous (product description tables, logs) and

temporal (time period based clickstreams).

Indeed, the daily access of an Internet Web site can

today easily rise to a number of access in millions of

pages, executed by a large amount of users spread all

over the world. Different contents and heterogeneous

necessities constitute the reality of the Web. All this

complexity results in extremely rich and varied usage

patterns. With this explosive growth of data available

on the internet, the discovery and analysis of useful

information from the Web becomes a necessity. Web

usage mining [14] is the application of data mining

technologies on large logs files, collected from Web

servers accesses. This is also known as Web log

mining, and represents the process of knowledge

extraction from the Web access log files. Examples of

such applications include: improvements of web sites

design, system performance analyses as well as

network communications, understanding user reaction

and motivation, web personalization, and building

adaptive web sites [11][12][24][16][17].

E-commerce organizations are especially interested in

the insight that web usage mining provides [13][17].

Such insight helps not only to improve their web site,

but also their services and marketing strategies

(promotions, banners, etc.).

This paper aims at analyzing a chosen clickstream

dataset, which is in the e-commerce domain. Also we

adopt for our analysis the vendor/webmaster point of

view. According to different time periods, we aim at

discovering the usage and the server charge in the

context of multiple Web sites in the e-commerce

domain. The rest of this paper is organized as follows.

Section 2 presents the analyzed dataset and our

Intersite advanced data preprocessing of structuring

data in terms of multishop consumers visits. Then

Section 3 gives first results issued from a preliminary

statistical analysis of this dataset. Next, before

concluding in Section 5, Section 4 describes our two

clustering methods, the used data and the respective

analyses.

Z.W. Ras, S. Tsumoto and D.A. Zighed (Eds) : IEEE MCD05, published byDepartment of Mathematics and Computing Science Saint Marys UniversityPages 33-40, ISBN 0-9738918-8-2, Nov. 2005

Page 34: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

2. Advanced intersite pre-processing

2.1. Used clickstream dataset description

with a total of 3,617,171 requests for page views. The

data was multi-source with requests made on seven

different e-commerce Web sites from the Czech

Republic (see Table 1). Each log file contains all the

requests recorded during one hour on the seven e-

commerce Web sites, whilst all the files cover a

continuous 24 days period starting from 09:00AM 20th

January until 08:59AM on the 13th February 2004.

The log files were in csv format, with each line

containing a request for a page on one of the 7 e-

commerce Web servers and having the 6 following

fields (Table 2):

- ShopID: an ID of the e-commerce Web server

(also denoted as shop) that received the

request;

- Date: the Unix time of the request (seconds count

since 00:00:00 1st January 1970);

- IP address: the computer’s IP address of the user

making the request;

- SessionID: a php session id automatically

generated for each new visit on each

server (unique IDs);

- Page: the requested resource (page) on the server;

- Referrer: the referrer of the requested page.

Along with these logs files, information on the data

structure was given in different description tables. A

table containing the seven shops names and their ID

was provided (see Table 1, first two columns). For

confidentiality reasons, the names of the seven e-

commerce Web sites have been anonymized in this

table as well as in the Referrer field when present.

Table 1. Number of requests per shop ShopID Site name (shop) #Requests

10 www.shop1.cz 509,688

11 www.shop2.cz 400,045

12 www.shop3.cz 645,724

14 www.shop4.cz 1,290,870

15 www.shop5.cz 308,367

16 www.shop6.cz 298,030

17 www.shop7.cz 164,447

The Web pages on these servers are interconnected,

meaning users can navigate from one shop to another

using only the links in the pages (clicks). However,

since the SessionID is generated when first entering a

page of a web shop, a user will get a new SessionID

when he/she switches to a yet unvisited shop.

The Page field contains the path on the server

corresponding to the ShopID field to the requested

page. There are 21 “page types” corresponding to the

distinct 21 first level syntactic topics of all pages (see

Table 3 below).

Table 3. Page types ID Page

type

Description #Requests %

1 /ct Product category 228,991 6.33

2 /ls Product sheet 1,363,187 37.68

3 /dt Detail of product 1,233,570 34.1

4 /znacka List of brand names or

brand detail

88,189 2.43

5 /akce Actual offers 26,260 0.72

6 /df Comparing product parame-

ters

57,939 1.60

7 /findf Fulltext search for products

and accessories

55,139 1.52

8 /findp Parameters based search 93,455 2.58

9 /setp Setting displayed

parameters

11,752 0.32

10 /poradna On-line advice 107,711 2.97

11 /kosik Shopping cart, details of

contract, submitting order

35,487 0.98

12 / Main page 219,218 6.06

13 /obchody

-elektro

List of shops with

electronics

10,926 0.30

14 /kontakt Contact info 6,104 0.16

15 /faq Frequently-asked questions 861 0.02

16 /onakupu Info about shopping 6,659 0.18

17 /splatky Variants of hire-purchase 2,846 0.07

18 /mailc Availability of products 6,680 0.18

19 /mailp Send this page 6,905 0.19

20 /mailf Send feedback 1,855 0.05

21 /mailr Complaint form 494 0.01

Total 3,564,228 98,45

Using these page types and their provided

descriptions, we can thus find out if the user has made

a request for a specific product, category or theme, if

some filters were applied, etc. For example, in the

second entry presented in Table 1, the user has

requested the products from the Earphones category

(code 148). To describe the variables present in the

requested pages, another four tables were made

available:

Table 2. Format of page requests ShopID Date IP address SessionID Page Referrer

11 1074585663 213.151.91.186 939dad92c4…84208dca /

11 1074585670 213.151.91.186 87ee02ddcff…7655bb9e /ct/?c=148 http://www.shop2.cz

The used clickstream dataset,

provided for the first time in

the PKDD Challenge 2005,

consists in 576 large log files

Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, and Brigitte Trousse

34

Page 35: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

1. kategorie: containing 60 descriptions of product categories, table hereafter denoted

category table;

2. list: containing 157 descriptions of products, hereafter denoted product table;

3. znacka: containing 197 descriptions of product brands, hereafter denoted brand table;

4. tema: contains 36 descriptions of product themes, denoted theme table.

The Referrer field represents the URL of the Web

page containing the link that the user followed to get to

the current page. This field sometimes can be empty

(address entered manually, blocked referrer, etc).

2.2. Intersite pre-processing method

In order to prepare the dataset for our analyses, we

used a recently proposed methodology for multisites

logs data preprocessing [19][20], which extends the

Cooley’s previous work [5].

Unfortunately, the provided raw data was not

formatted in the CLF (Common Log Format [21] and

some fields were not available (i.e. the status code, the

user agent, logname). Thus, we rely only on the

SessionID to identify a user’s visit.

The data preprocessing was done in four steps: data

fusion, data cleaning, data structuration, data

summarization.

Generally, in the data fusion step, the log files from

different Web servers are merged into a single log file.

This was already done, so we only merged the 579 log

files. We also changed the Date format into Gregorian

time in order to facilitate our analyses interpretation,

and merged the Page and the ShopID fields into the

URL field (see Table 4) in order to have same format

as the Referrer field.

During the data cleaning step, the non-relevant

resources are eliminated (e.g. jpg, js files). Here, this

was also already done but since the status code field is

missing, we assume that all requests have succeeded

(code 200).

In the data structuration step, requests are grouped by

user, user session, and visit. On this dataset, as

mentioned previously, a user changing shops can have

during a single visit multiple SessionIDs, one on each

shop. For this reason, we decided to group such

SessionIDs that belong to a single user (same IP) into a

group of sessions, corresponding to the user’s actual

visit. This was done by comparing the Referrer with the

URLs previously accessed (in a reasonable time

window), each time the user moves to another shop. If

the Referrer (actually a page on another shop) was

previously accessed, we group the two SessionIDs

together (the actual one and the one on the previous

shop). We thus grouped the existing 522,410

SessionIDs into 397,629 groups, equivalent to a

23.88% reduction in the user visits number. For

example, in Table 4 we grouped into the same session

group the two SessionIDs because the Referrer of the

second line was recently requested from the same IP

address and recorded in the URL field of the first line.

Thus, we obtained cross-server user visits which can be

used to perform global analyses on all the shops.

But these days, still for privacy reasons, a large number

of security software (firewalls, anti spyware, etc) are

blocking user’s cookies (used for SessionID) and/or the

URL’s Referrer on each request. In this case, blocking

the cookies and the Referrer will result in a different

SessionID for each request that a user makes even on

the same shop. This also happens in the case of a Web

robot [1] used to index the pages on these Web sites or

in the case of a download manager. We thus found that

2,54% IPs (2,020 out of 79,526) have 141,976 distinct

requests (3,92%) corresponding to 141,976 distinct

SessionIDs (27,18%). Knowing this, we could

furthermore reduce the total number of user visits to

achieve a more accurate dataset.

Finally, after identifying each variable in the accessed

URL and their corresponding descriptions, we defined

a relational database model to use (see Fig. 1).

Next, the preprocessed data was stored in a relational

existing tables. When analyzing this data, we can select

only the information which interests us. Table LOG is

the main table in this model and contains on each line,

information about one request for page view from the

provided log files.

During the pre-processing step and the statistical

analysis (see Tables 1 and 3), we found several

inconsistency problems with the dataset. For example

we found missing and redundant entries in the given

description tables, missing variables information,

malformed SessionIDs, etc. Also to have a better data

Table 4. Transformed logs lines Datetime IP SessionID URL Referrer

2004-01-20

09:01:03

213.151.91.186 939dad92c4…84208dca http://www.shop2.cz/ -

2004-01-20

09:01:10

213.151.91.186 87ee02ddcff…7655bb9e http://www.shop2.cz/ct/?c=148 http://www.shop2.cz/

DB, during the data

summarization step.

As in [19][20], this

model can be further

extended by adding

new tables or new

attributes to the

Pre-Processing and Clustering Complex Data in E-Commerce Domain

35

Page 36: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

analysis, more information like the User Agent, the

Status Code, the Login Name (if available) and the

general PHP session configuration should be provided.

Fig. 1. Relational model of the clickStream database

In the two next sections, we illustrate the usefulness

of the structured pre-processed data and the flexibility

of our DB model via a statistical analysis and two

clustering methods. In order to analyze the traffic

charge in the seven shop sites, we have decided to use

time units. For the first statistical analysis, we used the

classical day and hour time unit, whilst for the two

clustering methods we used slices of date and hour,

which we call Time Periods.

3. Statistical data analysis based on time

periods

The objective of this section is to show the kind of

results we obtained based on an Intersite analysis and

on the previously added notion of “Group of

SessionsIDs”, which actually represents the user visits.

Table 5 shows that Wednesdays are the most

important days in terms of new visits (visits are

grouped by their start date). More, we observe that on

Wednesdays and Sundays the reduction rates of

SessionsIDs to visits are the highest (more than 35%).

We believe this is caused by frequent users re-

connexions, fact confirmed also by the high ratio of

number of SessionIDs per visit. We note that the multi-

shop analysis was not significant (multi-shop visits

percentage varied from 2,72 to 4,49% per day).

Table 5. Days analysis

Day #Requests #SesIDs #Visits Red.

%

#MS

Visits

#Ses/MS

Visits

Mon 551,138 73,700 53,373 27,58 2,903 7

Tues 675,984 80,649 66,565 17,46 3,753 3,75

Wed 677,243 112,580 73,126 35,04 3,576 11,03

Thu 612,158 76,211 64,338 15,57 3,410 3,48

Fri 461,607 64,737 57,065 11,85 2,430 3,15

Sat 296,334 51,018 44,706 12,37 1,601 3,94

Sun 342,707 63,515 38,456 39,45 1,859 13,47

Total 3,617,171 522,410 397,629 19,532

As for the hours analysis, Table 6 shows that four time

periods (7-8, 12-14 and 20-21) are very important in

terms of the server charge (re-connexions), while Fig. 2

presents the global visits distribution per hour.

Table 6. Hour analysis Hour #Req #SesDs #Visits Red.

%

#MS

Visits

#Sess/

Visits

0-1 59,205 12,804 9,407 26.53 274 12,39

1-2 32,110 9,352 8,309 11.15 165 6,32

2-3 19,183 6,628 6,376 3.80 90 2,8

3-4 13,302 5,937 5,815 2.05 58 2,1

4-5 14,082 6,999 6,743 3.65 51 5,01

5-6 15,691 7,772 7,265 6.52 65 7,8

6-7 43,459 11,178 10,161 9.09 258 3,94

7-8 103,445 22,827 14,589 36.08 521 15,81

8-9 156,642 24,805 17,913 27.78 899 7,66

9-10 200,170 24,746 21,006 15.11 1,152 3,24

10-11 228,906 26,685 23,112 13.38 1,286 2,77

11-12 246,296 29,661 22,967 22.56 1,332 5,02

12-13 264,805 43,493 24,598 43.44 1,424 13,26

13-14 275,854 36,981 24,767 33.02 1,454 8,4

14-15 264,876 31,040 24,788 20.14 1,433 4,36

15-16 242,962 29,220 23,092 20.97 1,304 4,69

16-17 206,331 23,841 20,531 13.88 1,179 2,8

17-18 179,601 21,241 18,259 14.03 957 3,11

18-19 185,730 23,714 20,086 15.29 1,068 3,39

19-20 187,077 22,933 19,158 16.46 1,051 3,59

20-21 219,793 37,663 20,599 45.30 1,174 14,53

21-22 200,343 27,394 19,612 28.40 1,048 7,42

22-23 154,582 20,645 15,462 25.10 762 6,8

23-0 102,726 14,851 13,014 12.36 527 3,48

Total 397,629 19,532

Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, and Brigitte Trousse

36

Page 37: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Hours

%

Fig. 2. Visits per hour

Fig. 3a (global visits) and 3b (multi-shop visits) show

clearly the low number of customers new visits on

Saturdays and Sundays during the lunch time and the

high number on Tuesdays and Wednesdays.

0

1000

2000

3000

4000

5000

6000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Hour

Vis

its

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

Sunday

0

50

100

150

200

250

300

0 1 2 3 4 5 6 7 8 9 1011 1213 141516 1718 192021 2223

Hour

Gro

up

s

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

Sunday

Fig. 3. Visits per day and hours: (a) globally, (b) multi-shop

4. Crossed clustering approach

4.1. Dimension problem

Appropriate use of a clustering algorithm is often a

useful first step in extracting knowledge from a

database. Clustering, in fact, leads to a classification,

i.e. the identification of homogeneous and distinct

subgroups in data [7] [4], where the definition of

homogeneous and distinct depends on the particular

algorithm used: this is indeed a simple structure, which,

in the absence of a priori knowledge about the

multidimensional shape of the data, may be a

reasonable starting point towards the discovery of

richer, more complex structures.

In spite of the great wealth of clustering algorithms, the

rapid accumulation of large databases of increasing

complexity poses a number of new problems that

traditional algorithms are not equipped to address. One

important feature of modern data collection is the ever

increasing size of a typical database: it is not so

unusual to work with databases containing from a few

thousands to a few millions of individuals and hundreds

or thousands of variables. Now, most clustering

algorithms of the traditional type are severely limited

as to the number of individuals they can comfortably

handle (from a few hundred to a few thousands).

To analyze the traffic charge in the seven shop sites, we

grouped the requests in terms of Time Periods. We also

limited our clustering analysis to the 1,363,187

requests registered on the page type ls for the first

analysis and to the 490,883 requests registered on the

shop 4 and same page type ls for the second analysis.

4.2. Method

According to our aim to obtain rows partition, a

classification of the complex descriptors is

accomplished. Some authors [8][9] proposed the

maximization of the chi-squared criterion between rows

and columns of a contingency table. The main

advantage to use our algorithm is to get a tool for

comparing and clustering aggregated and structured

data. Our approach preserves the flexibility and the

generality of the dynamic algorithm in the Knowledge

Discovery. The cluster characterization is based on the

distributions or multi-valued descriptors and the

interactions between descriptors and individuals.

As in the classical clustering algorithm the criterion

optimized is based on the best fitting between classes

of objects and their representation. In our context of

analyze the relations between time periods and

products, we propose to represent the classes by

prototypes which summarize the whole information of

the time periods belonging to each of them. Each

prototype is even modeling as an object described by

multi-categories variables with associated distributions.

In this context, several distances and dissimilarity

functions could be proposed as assignment. In

particular, if the objects and the prototypes are

described by multi-categories variables, the

b.

a.

Pre-Processing and Clustering Complex Data in E-Commerce Domain

37

Page 38: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

dissimilarity measure can be chosen as a classical

distance between distributions (e.g. chi-squared).

The convergence of the algorithm to a stationary value

of the criterion is guaranteed by the best fitting between

the type representation of the classes and the properties

of the allocation function. Different algorithms, even

referred the same scheme, has been proposed according

to the type of descriptors and to the choice of the

allocation function.

The generalized dynamic algorithm [15] [26] on

objects has been proposed in different contexts of

analysis, for example: to cluster archaeological data,

described by multi-categorical variables [25]; to

compare social-economics characteristics in different

geographical areas with respect to the distributions of

some variables (e.g.: economics activities; income

distributions; worked hours; etc).

4.3. Analysis

4.3.1. Time periods/product data generation from

the DB

Here, we have considered a crossed table where each

line describes a individual that covers the couple

weekday and hour of requests for the page /ls in the

shop 4 (the more visited one according to Table 1), and

the column describes one multi-categorical variable

witch represents the number of products requested by

users into a specific time slice (see Table 7, where we

have 7 x 24 individuals). We have limited our

clustering analysis to the requests registered on the

shop 4 although the same analysis can be made for all

the other shops.

4.3.2 Results

We return now to the subset of the time period dataset

describing the products accessed on shop 4 (see Table

8). The 168 periods of time summarize 490,883

requests on all products from shop 4. Table 9 presents

the results after applying the crossed clustering method

[15] specifying 7 classes of periods and 5 classes of

products.

Table 8. Quantity of products requested by weekday x hour and registered on shop 4 Weekday x Hour Product (number of requests)

Monday_0

Built-in electric hobs (10),

Built-in dish washers 60cm (64),

Corner single sinks (50), ...

Monday_1

Free standing combi refrigerators (44),

Corner single sinks (50),

Built-in hoods (60), ...

… …

Sunday_22

Built-in microwave ovens (27),

Built-in dish washers 45cm (38),

Built-in dish washers 60cm (85), ...

Sunday_23

Built-in freezers (56),

Kitchen taps with shower (45),

Garbage disposers (32), ...

Table 9. Confusion table Prod_1 Prod _2 Prod _3 Prod _4 Prod _5 Total

Per_ 1 2847 5084 3284 2265 2471 15951

Per_ 2 11305 31492 12951 1895 9610 67253

Per_3 33107 55652 36699 5345 20370 151173

Per _4 22682 46322 30200 5165 27659 132028

Per _5 9576 20477 19721 2339 7551 59664

Per _6 1783 3515 2549 392 11240 19479

Per _7 15019 14297 8608 1397 6014 45335

Total 96319 176839 114012 18798 84915 490883

Surprisingly the product class 5 was defined only by

one product, namely Free standing combi refrigerators

witch was consulted predominantly on Fridays between

the hours 17:00 and 20:00. It is important to note that

although this product belongs to the a class product

(see Table 7) which is responsible for only 17.3 % of

the total requests on shop 4, the requests occurring on

its period (see Table 10) is 57.7 % based on this

product. In other words it means that product Free

standing combi refrigerators is the more requested on

Fridays from 17:00 to 20:00. Such information could

be used on marketing strategies, like cross selling

involving this product.

5. Conclusions and future works

In this paper, we first proposed a pre-processing

method in the context of a multi-shop e-commerce

domain. Our analysis on the proposed PKDD Web logs

showed the great flexibility offered by the built data

warehouse and the benefits of an enriched data

structuration in terms of customer visits. Secondly we

presented two original clustering analysis based on

Table 7. Product clustering Product_5 Cardinal: 1

/product/Free standing combi refrigerators

Table 10. Time period clustering Period_6 Cardinal: 8

Friday_2, Friday_6, Friday_17, Friday_18,

Friday_19, Friday_20, Saturday_5, Tuesday_4

Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, and Brigitte Trousse

38

Page 39: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

efficient clustering methods applied on Web time

period-based clickstreams: our first results on the used

dataset in a short time due to the PKDD challenge are

promising. Such analyses allow us to identify best

hours for marketing strategies, like fast promotions, on-

line advices and publish banners, etc. Others analyses

could be planned (cf. Table 3) in the future, exploiting

for example the link between the consumer activities

and the time periods by shop or focusing on multi-shop

user visits, etc.

References 1. ABCInteactive.com: Spiders and Robots. http://www.abcinteractiveaudits.com/abci_iab_spidersand

robots/.

2. Ambroise, C., Séze, G., Badran, F., Thiria, S.: Hierarchical clustering of Self-Organizing Maps for cloud

classification. Neurocomputing, 30 , pp 47—52, 2000.

3. Arnoux, M., Lechevallier, Y., Tanasa, D., Trousse, B., Verde, R.: Automatic Clustering for the Web Usage

Mining. In Dana Petcu and Daniela Zaharie and Viorel

Negru and Tudor Jebeleanu, editor, Proceedings of the

5th Intl. Workshop on Symbolic and Numeric Algorithms

for Scientific Computing (SYNASC03), Pages 54 -- 66,

Editura Mirton, Timisoara, 1-4 October 2003.

4. Bock, H. H.: Classification and clustering~: Problems for the future. In: E. Diday, Y. Lechevallier, M. Schader, P.

Bertrand, B. Burtschy (eds.): New Approaches in

Classification and Data Analysis. Springer, Heidelberg

3—24, 1993.

5. Ciampi A., Lechevallier Y.: Clustering large: An approach based on Kohonen Self Organizing Maps,

Proceedings of PKKD 2000, Zighed, D. A., Komorowski

J., Zytkow J. (Eds), Springer-Verlag, Heidelberg pp 353-

358, 2000.

6. Cooley, R.: Web Usage Mining: Discovery and Application of Interesting Patterns From Web Data. PhD

Thesis, Dept of Computer Science, Univ. of Minnesota,

2000.

7. Gordon, A. D.: Classification: Methods for the

Exploratory Analysis of Multivariate Data. Chapman&

Hall, London, 1981.

8. Govaert, G.: Algorithme de classification d'un tableau de contingence. In Proc. of first international symposium on

Data Analysis and Informatics, INRIA, Versailles, pp

487—500, 1977.

9. Govaert, G., Nadif M.: Clustering with block mixture models. Pattern Recognition, Elservier Science

Publishers, 36, pp 463-473, 2003.

10. Hébrail, G., Debregeas, A.: Interactive interpretation of Kohonen maps applied to curves. Proceedings of the 4th

International Conference on Knowledge Discovery and

Data Mining. AAAI press, Menlo Park pp 179—183,

1998.

11.Jaczynski, M.: Scheme and Object-Oriented Framework

for case Indexing By Behavioural Situations : Application

in Assisted Web Browsing. Doctorat Thesis of the

University of Sophia-Antipolis (in french), december,

1998.

12.Jaczynski, M., Trousse, B.: WWW Assisted Browsing by

Reusing Past Navigations of a Group of Users. In

Advanced in Case-based Reasoning, 4th European

Workshop on Case-Based Reasoning , Lecture Notes in

Artificial Intelligence, 1488, pages 160-171, 1998.

13.Kohavi, R.: Mining E-Commerce Data. KDD 01, San

Francisco CA, USA

14.Kohonen, T.: Self-Organizing Maps. Springer, New York,

1997.

15.Lechevallier, Y., Verde, R.: Crossed Clustering method:

An efficient Clustering Method for Web Usage Mining.

In: Complex Data Analysis, Pekin, Chine, October 2004.

16.Mobasher, B.: Mining Web Usage Data for Automatic

Site Personalization. In: Proc. 24th Annual Conference of

the Gesellschaft Fur Klassifikation E.V., University of

Passau, March (2000) 15–17

17.Perkowitz, M., Etzioni, O.: Adaptive sites: Automatically

learning from user access patterns. In: Proc. 6th Int'l

World Wide Web Conf., Santa Clara, California, April

(1997)

18.Srivastava, J. and al.: Web usage mining: Discovery and

applications of usage patterns from Web data. ACM

SIGKDD Explorations, Vol.1, N.2, January (2000)

19.Tanasa, D., Trousse, B.: Advanced Data Preprocessing for

Intersites Web Usage Mining. IEEE Intelligent Systems,

Vol. 19(2):59--65, April 2004.

20.Tanasa, D.: Web Usage Mining : Contributions to

Intersites Logs Preprocessing and Sequential Pattern

Extraction with Low Support, PhD Thesis, University of

Sophia Antipolis, June 2005.

21.W3C. Logging Control In W3C httpd.

http://www.w3.org/Daemon/User/Config/Logging.html#c

ommon-logfile-format, July(1995)

22.Sauberlich, F., Huber K.-P. :A Framework for Web

Usage Mining on Anonymous Logfile Data. In :

Schwaiger M. and Opitz O.(Eds.): Exploratory Data

Analysis in Empirical Research, Springer-Verlag,

Heidelberg, 309—318, 2001.

23.Srinivasa Raghavan, NR.: Data Mining in e-commerce: A

survey. Sadhana (Academy “Proceedings in Engineering

Sciences”), Vol. 30, parts 2 & 3, April/June 2005, Indian

Academy of Sciences (publisher), pp 275-289.

24.Trousse, B., Jaczynski, M., Kanawati, K.: Using User

Behavior Similarity for Recommandation Computation :

The Broadway Approach, In proceedings of 8th

international conference on human computer interaction

(HCI'99), Munich, August, 1999.

25.Verde, R., De Carvalho, F.A.T., Lechevallier, Y. :A

Dynamical Clustering Algorithm for Multi-Nominal Data.

In : H.A.L. Kiers, J.-P. Rasson, P.J.F. Groenen and M.

Schader (Eds.): Data Analysis, Classification, and

Related Methods, Springer-Verlag, Heidelberg, pp 387—

394, 2000.

26.Verde, R., Lechevallier, Y.: Crossed Clustering method

on Symbolic Data tables. In :M. Vichi, P. Monari, S.

Migneni, A. Montanari (Eds.): New developments in

Classification, and Data Analysis, Springer-Verlag,

Heidelberg, pp 87—96, 2003.

Pre-Processing and Clustering Complex Data in E-Commerce Domain

39

Page 40: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE
Page 41: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Semi-Supervised Clustering with Partial Background Information

Jing Gao, Pang-Ning Tan, Haibin ChengDept. of Computer Science and Engineering Michigan State University

East Lansing,MI 48824gaojing2,ptan,[email protected]

Abstract

Incorporating background knowledge into unsupervisedclustering algorithms has been the subject of extensive re-search in recent years. Nevertheless, existing algorithmsimplicitly assume that the background information, typi-cally specified in the form of labeled examples or pairwiseconstraints, has the same feature space as the unlabeleddata to be clustered. In this paper, we are concerned witha new problem of incorporating partial background knowl-edge into clustering, where the labeled examples have mod-erate overlapping features with the unlabeled data. We for-mulate this as a constrained optimization problem, and pro-pose two learning algorithms to solve the problem, basedon hard and fuzzy clustering methods. An empirical studyperformed on a variety of real data sets shows that our pro-posed algorithms improve the quality of clustering resultswith limited labeled examples.

1. Introduction

The topic of semi-supervised clustering has attractedconsiderable interests among researchers in the data miningand machine learning community [1, 2, 3, 7]. The goal ofsemi-supervised clustering is to obtain a better partitioningof data by incorporating background knowledge. Althoughcurrent semi-supervised algorithms have shown significantimprovements over unsupervised algorithms, they assumethat the background knowledge are given in the same fea-ture space as the unlabeled data. They are therefore inappli-cable when the background knowledge is provided by an-other source with a different feature space.

There are many examples in which the backgroundknowledge does not share the same feature space as the datato be mined. For example, in computational biology, ana-lysts are often interested in the clustering of genes or pro-teins. Extensive background knowledge is available in thebiomedical literature that might be useful to aid the clus-tering task. In these examples, the background knowledge

and the data to be mined may share some common features,but their overall feature sets are not exactly identical. Wetermed this task as semi-supervised clustering with partialbackground information.

The key challenge of semi-supervised clustering withpartial background knowledge is to determine how to utilizeboth the shared and non-shared features while performingclustering on the full feature set of the unlabeled data. Thesemi-supervised algorithm also has to recognize the possi-bility that the shared features might be useful for identifyingonly certain clusters but not for others.

To address these challenges, we propose a novel ap-proach for incorporating partial background knowledgeinto the clustering algorithm. The idea is to first assignconfidence-rated labels to the unlabeled examples by us-ing a classifier built from the shared features of the labeledset. We then apply a constrained clustering algorithm to theunlabeled data, where the constraints are given by the un-labeled examples whose predicted labels have confidencegreater than some specified threshold, δ. The algorithm alsoincorporates a weight factor to each predicted class, in pro-portion to the ability of the shared feature set to discrim-inate between the predicted class and other classes in thelabeled data.We also propose hard and soft (fuzzy) cluster-ing algorithms to efficiently solve the constrained optimiza-tion problem.The extensive experiments using real data setsdemonstrate that our proposed algorithms produce betterclusters than both unsupervised and supervised1 algorithms.

2. Incorporating Background Information

Let U be a set of n unlabeled examples drawn from ad-dimensional feature space, while L be a set of s labeledexamples drawn from a d′-dimensional feature space. With-out loss of generality, we assume that the first p features forthe labeled and unlabeled data are identical, which is knownas the shared feature set of U and L. Each labeled example

1The supervised algorithm builds a classifier from the shared featureset and applies it to the unlabeled data.

Z.W. Ras, S. Tsumoto and D.A. Zighed (Eds) : IEEE MCD05, published byDepartment of Mathematics and Computing Science Saint Marys UniversityPages 41-44, ISBN 0-9738918-8-2, Nov. 2005

Page 42: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

lj ∈ L has a corresponding label yj ∈ 1, 2, · · · , q, whereq is the number of classes. The objective of clustering is toassign each of the unlabeled examplesto one of the k clus-ters. We use the notation C = c1, c2,· · · , ck to representthe k cluster centroids, where ∀i : ci ∈ d.

Then we describe our proposed framework for incor-porating partial background information into the clusteringtask. Our framework consists of the following steps:

Step 1: Build a classifier C using the shared feature set ofthe labeled examples, L.

Step 2: Apply the classifier C to predict the labels of theexamples in U.

Step 3: Cluster the examples in U subjected to the con-straints given by the predicted labels in Step 2.

The details of these steps are explained in the next threesubsections. To better understand the overall process, weshall explain the clustering step first before discussing theclassification steps.

2.1. Constraint-based Clustering with Partial Back-ground Information

The objective function for clustering is of the followingform:

Q =n∑

i=1

k∑j=1

(tij)md2ij + R

n′∑i=1

k∑j=1

wj |tij − fij |m (1)

The first term is equivalent to the objective function for stan-dard k-means algorithm. The second term penalizes any vi-olation of constraints on the predicted labels of examplesin U. R is a regularization parameter that determines thetradeoff between the distance to centroids and violation ofconstraints. wj is the weight factor assigned to cluster j andrepresents how informative are the shared features in termsof discriminating cluster j from other clusters. Finally, f ij

is a class membership function that indicates whether u i ispredicted to be in cluster j by the classifier C.

For hard clustering, m = 1 and fij is a discrete variablein 0, 1 that satisfies the constraint

∑kj=1 fij = 1. For

fuzzy clustering, m = 2 and fij is a continuous variablethat satisfies

∑kj=1 fij = 1. The next subsection explains

how the confidence-rated labeling is obtained.

2.2. Classification of Unlabeled Examples

We choose the nearest-neighbor classifier for labelingthe examples in U. For each example ui ∈ U, we computeNN(i, j), which is the distance between ui to its nearestneighbor in L from class j. The distance is computed in

terms of the shared feature set. If minj N(i, j) ≥ δ, thenthe example is ignored from the second term of the objec-tive function. δ is the threshold for filtering unnecessaryconstraints due to examples that have low confidence val-ues.

Once we have computed the nearest-neighbor distances,the class membership function fij is determined in the fol-lowing way. For hard clustering, each example u i ∈ U islabeled as follows:

fij =

1 if j = min1≤l≤k NN(i, l)0 otherwise

(2)

For fuzzy clustering, fij measures the confidence that ui

belongs to the class cj . In [4], several confidence mea-sures are introduced for nearest neighbor classification al-gorithms. We adopt the following confidence measure forclassification output:

M(i, j) =1

α + NN(i, j),

where α is a smoothing parameter. The value of f ij is thencomputed as follows:

fij =M(i, j)

Z(3)

where Z is a normalization factor so that∑k

j=1 fij = 1.To incorporate fij into the objective function Q, we need

to identify the class label associated with each cluster j. Weassume that k = q, so a one-to-one correspondence betweenthe classes and clusters is established during cluster initial-ization.

2.3. Cluster Weighting

The proposed objective function should also take into ac-count how informative is the shared feature set. We use theweight wj to reflect the importance of the shared feature setin terms of discriminating cluster j from other clusters. Theconcept of feature weighting has been used in clustering byFrigui et. al. [6]

Let D(d, j) denote the discriminating ability of the fea-ture set d for cluster j:

D(d, j) =

∑ui∈cj

l =jd(xi, cl,d)

∑ui∈cj

d(xi, cj,d)(4)

where d(ui, cj ,d) is the distance between ui and centroidcj based on the feature set d. Intuitively, D(d, j) is theratio of the between-cluster distance and the within-clusterdistance. The higher the ratio, the more informative is thefeature set in terms of discriminating cluster j from otherclusters.

Jing Gao, Haibin Cheng, and Pang-Ning Tan

42

Page 43: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

The weight of cluster j is then determined as follows:

wj = D(p, j)/D(d, j) (5)

where p is the number of shared features and d is the totalnumber of features in U.

3. Algorithms

This section presents the SemiHard and SemiFuzzy clus-tering algorithms for solving the constrained optimizationproblem posed in the previous section. The EM algorithmprovides an iterative method for solving the optimizationproblem. The basic idea of the algorithm is as follows: first,an initial set of centroids is chosen. During the E-step, thecluster centroids are fixed while the configuration matrixis varied until the objective function is minimized. Duringthe M step, the configuration matrix is fixed while the cen-troids are recomputed to minimize the objective function.The procedure is repeated until the objective function con-verges.

3.1. SemiHard Clustering

In SemiHard clustering, tij takes the value from0,1.We define Ai = aijkj=1 as follows:

aij =

d2ij −Rwj if fij = 1

d2ij + Rwj otherwise

(6)

During the E-step, we should minimize the contribution ofeach point to the objective function Q. Clearly, minimizingQ is a linear programming problem. From [5], the mini-mum can be achieved by setting tij as follows:

tij =

1 if j = minl ail

0 otherwise(7)

During the M-step, since the configuration matrix isfixed, the second term of the objective function is un-changed. Minimizing Q is therefore equivalent to minimiz-ing the first term. The centroid update formula is :

cj =∑n

i=1 tijui∑ni=1 tij

(8)

3.2. SemiFuzzy Clustering

In fuzzy clustering, tij are continuous variables. Dur-ing the E-step, we should minimize Q with respect to Tsubject to the constraints. Therefore,

tij =λi + 2Rwjfij

2(d2ij + Rwj)

(9)

The solution for λi is:

λi = (1−k∑

j=1

Rwjfij

d2ij + Rwj

)/k∑

j=1

12(d2

ij + Rwj)

During the M-step, the objective function once again de-pends only on the first term. Therefore,

cj =

∑ni=1 t2

ijui∑ni=1 t2

ij

(10)

4. Experiments

The following data sets are used to evaluate the perfor-mances of our algorithms:

Table 1. Data set descriptions.# of instances # of attributes # of classes

Waveform 5000 21 3Shuttle 4916 9 2Web 8500 26 2Physio- 13239 10 3logical

We compared the performance of SemiHard and Semi-Fuzzy algorithms against the unsupervised clustering algo-rithm K-means as well as the supervised learning algorithmNearest Neighbor (NN). The evaluation metrics we use isan external cluster validation measure, i.e., the error rate ofeach algorithm.

4.1 Comparison with Baseline Methods

Table (2) summarizes the results of our experimentswhen comparing SemiHard and SemiFuzzy to k-means andNN algorithms. In all of these experiments, 1% of thedata set is labeled while the remaining 99% is unlabeled.Furthermore, 10% of the features are randomly selected asshared features. Clearly, both SemiHard and SemiFuzzyalgorithms outperform their unsupervised and supervisedlearning counterparts even at 1% labeled examples.

4.2 Variation in the Amount of Labeled Examples

In this experiment, we vary the amount of labeled exam-ples from 1% up to 20% and compare the performances ofthe four algorithms. We use the waveform data set for thisexperiment. The percentage of shared features is set to be10%. Figure 1 shows the results of our experiment. The er-ror rate for SemiFuzzy is lowest, followed by the SemiHardalgorithm.

Semi-Supervised Clustering with Partial Background Information

43

Page 44: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Table 2. Performance comparison for variousalgorithms.

SemiHard SemiFuzzy Kmeans NNWaveform 0.4534 0.4186 0.4700 0.4706Shuttle 0.2054 0.1617 0.2168 0.2316Web 0.4766 0.4317 0.4994 0.4597Physio- 0.4930 0.4719 0.5828 0.5009logical

0 0.05 0.1 0.15 0.20.36

0.38

0.4

0.42

0.44

0.46

0.48

ratio of labeled samples

erro

r ra

te

SemiHardSemiFuzzyNNKmeans

Figure 1. Error rate for various percentage oflabeled examples.

4.3 Variation in the Amount of Common Features

Figure 2 shows the effect of varying the percentage ofshared features from 10% to 60%. We use the Waveformdata set for our experiment. We fix the percentage of labeledexamples at 1%. It can be seen that adding more shared fea-tures improve the error rates of SemiHard, SemiFuzzy, andNN algorithms. This is because the more features shared bythe labeled and unlabeled data, the more informative it is toaid the clustering task. In all of these experiments, Semi-Fuzzy gives the lowest error rates, followed by SemiHardand the NN algorithm.

5. Conclusions

In this paper, we propose a principled approach for in-corporating partial background information into a cluster-ing algorithm. The beauty of this approach is that the back-ground information can be specified in a different featurespace than the unlabeled data, as long as there are sufficientoverlapping features. We illustrate how the objective func-tion for k-means clustering can be modified to incorporatethe constraints due to partially labeled information. In prin-

0.1 0.2 0.3 0.4 0.5 0.60.25

0.3

0.35

0.4

0.45

0.5

ratio of common features

erro

r ra

te

SemiHardSemiFuzzyNNKmeans

Figure 2. Error rate for various percentage ofshared features.

ciple, our methodology is applicable to any base classifiers.We present both hard and fuzzy clustering solutions to theconstrained optimization problem. Using a variety of realdata sets, we demonstrate the effectiveness of our approachover both k-means clustering and nearest-neighbor classifi-cation algorithms.

References

[1] S. Basu. Semi-supervised clustering: Learning with limiteduser feedback. Department of Computer Sciences, Universityof Texas at Austin, 2003.

[2] S. Basu, A. Banerjee, and R. J. Mooney. Semi-supervisedclustering by seeding. In ICML ’02: Proceedings of the In-ternational Conference on Machine Learning, pages 27–34,2002.

[3] S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic frame-work for semi-supervised clustering. In KDD ’04: Proceed-ings of the 2004 ACM SIGKDD international conference onKnowledge discovery and data mining, pages 59–68. ACMPress, 2004.

[4] S. J. Delany, P. Cunningham, and D. Doyle. Generating esti-mates of classification confidence for a case-based spam filter.In Proceedings of the 6th International Conference on Case-based Reasoning, page To appear, 2005.

[5] R. Fletcher. Practical methods of optimization. John Wiley &Sons, Chichester, second edition, 1986.

[6] H. Frigui and O. Nasraoui. Unsupervised learning of proto-types and attribute weights. Pattern Recognition, 37(3):943–952, 2004.

[7] K. Wagstaff, C. Cardie, S. Rogers, and S. Schrodl. Con-strained k-means clustering with background knowledge. InProceedings of the Eighteenth International Conference onMachine Learning (ICML 2001), pages 577–584, 2001.

Jing Gao, Haibin Cheng, and Pang-Ning Tan

44

Page 45: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Neighborhood Graphs for Image DatabasesIndexing and Content-Based Retrieval

Hakim HACIDLyon 2 University

ERIC Laboratory- 5, avenue Pierre Mendes-France69676 Bron cedex - France

Email: [email protected]

Abdelkader Djamel ZIGHEDLyon 2 University

ERIC Laboratory- 5, avenue Pierre Mendes-France69676 Bron cedex - France

Email: [email protected]

Abstract— Retrieving hidden information in multimedia datais a difficult task according to their complex structure and thesubjectivity related to their interpretation. In this situation theuse of an index is primordial. A multimedia index makes it pos-sible to group data according to similarity criteria. We proposein this paper to bring an improvement to an existing contentbased image retrieval approach. We propose an effective methodfor locally updating neighborhood graphs which constitute ourmultimedia index. This method is based on an intelligent way forlocating points in a multidimensional space. Promising results areobtained after experimentations on various databases.

I. INTRODUCTION

Information retrieval in image databases is still a challenge.For the human being, the access to the meaning of an imageis natural and not explicit. Hence, the meaning comes outfrom the image without explicit cognitive process. In computervision, there are several levels of interpretation. The lowestis the pixels and the highest is the scene; between themmany levels of abstraction. Then, the challenge is how to fillthe gap between the pixels and the meaning. There are atleast two intermediate issues we are interested in. The firstone is the vector representation which is named indexing. Itconsists of extracting some features (components of a vector)from the low level representation (pixels). For example, colorshistogram, moment values, shapes parameters, etc. The secondissue is the set of labels associated to an image. It is providedby human by mean of words, adjectives, or any set of symbolicattributes. Labels are understandable and better handled innatural cognition. The meaning may be considered as theoutcome of the processing of the symbolic attributes relatedto the image.

To give the computer the ability to mimic the human beingin scene analysis needs to explicit the process by whichit moves up from the low level to the highest one. Imageprocessing tools give many ways to transform an image ina vector. For instance, MPEG-7 protocol associates a set ofquantitative attributes to each image. The computation of thesefeatures is fully integrated and automatic in many softwareplatforms. In return, the labels are basically given by the userbecause they are issued from the human language.

In image retrieval, the user is more interested in findingimages that have meaning he expects regardless to the vectorof features. To access to the suitable images, the user may

express his/her request using either keywords (or naturallanguage similar to Internet search) or by sample images.Today, the textual annotation is manual and expensive process,this is why the vector approach is mostly used.

The relevance of the image retrieval process depends on thevector of features. Nevertheless, if we assume that the featuresare relevant in the representation space, which it is supposedto be Rp, images that are neighbors should have very similarmeanings. This is based on the general principle that says”birds of a feather flock together”.

We propose a framework using low-level features, automat-ically extracted from images, as navigation support for thebrowsing process. That way we can access the semantic aspectof an image based on the neighbor images in the representationspace. Hence, the user may retrieve visually similar images. Ifthe images in the neighborhood of a sample image are labeled,then the query image may inherit from the present labels. Thisprocess is called learning based instance. The most knownalgorithm in this category is the k-nearest neighbors. However,k-nearest neighbors algorithm has many drawback underlinedin our previous paper [21]. We prefer to use proximity graphs[24] that have very interesting properties for this purpose.

The advantages of the proximity graph approach wereshown in [21]. However, there were some limitations basicallydue to the complexity of the algorithm. Indeed, for each newquery we have to rebuild the whole proximity graphs. Ac-cording to the fact that the fastest algorithm has a complexityO(n3), where n is the number of items, the problem becomesintractable. In this paper, we propose a new algorithm ableto compute locally exact neighborhood of a new point. Thisprovides a new advantage to our approach.

The next section discusses the related works to the specifiedproblems. Then we present the proximity graph concept inSection III. Our contribution, dealing with the neighborhoodgraphs optimization and their application to multimedia index-ing, is described in Section IV. We describe our experimentsand the evaluation of the method in Section V. Concludingremarks and future work are presented in Section VI.

II. RELATED WORKS

The multimedia database interrogation includes three mainfunctions, namely: search, navigation and browsing [1]:

Z.W. Ras, S. Tsumoto and D.A. Zighed (Eds) : IEEE MCD05, published byDepartment of Mathematics and Computing Science Saint Marys UniversityPages 45-52, ISBN 0-9738918-8-2, Nov. 2005

Page 46: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

• Search (Retrieval): The objective of the search is toprovide, for a user, a list as short as possible of perti-nent results. Rowe et al., [19] identified three types ofinformations which can be useful for search:

– Bibliographical data: Include all information relatingto the multimedia data such as author, title, producer,etc.

– Structural data. This type of information is relatedto the structure of the multimedia data, for example,the number of color channels in an image.

– Semantic data. They are informations based on thesemantic contents of the multimedia data like textualannotations.

• Navigation: It is a continuation of the preceding function.A user can submit his query either in textual form, bydescribing the wished result, or by providing an imageor a video sequence with an aim of finding the similarimages/videos. From the obtained results, the user canmove through the similar images/videos in the databaseuntil finding what he/she wishes.

• Browsing: browsing the content of a database can bevery useful, because the search sometimes can not givesatisfactory results, or sometimes it can be difficult toexpress a query that meets the user’s needs well. Indeed,the user is not always able to express his/her queryclearly; he/she has only a rather vague idea of the wishedresult. The standard today of the browsing is the story-board browsing; its principle is the condensation of thedatabase into summaries. For example, Zhang et al., [27]presented a hierarchical browsing form for video data: theroot is represented by N images of the video, each onecorresponding to a sequence in the video, and each levelof the hierarchy corresponds to a more detailed level inthe video.

The interrogation of multimedia databases can be done on twolevels [20]: the low level (structural level) and the high level(semantic level). The low level results in the characteristics,which can be extracted, from the multimedia data like color,texture, form, etc. The high level as for it, consists generallyin a list of keywords associated with each multimedia datawhich serves to describe its semantic content. The use oftextual annotation presents two main disadvantages: the firstone is its slowness and its expensiveness; the second oneis related to the subjectivity of the persons annotating themultimedia data. Indeed, for example, two different personscan annotate the same image in two different ways. Becauseof these disadvantages, the interrogation is generally done byusing the low level characteristics.

In all the multimedia data interrogation approaches (lowlevel approach or high level approach), each datum is localizedby its co-ordinates in a multidimensional space. A vector offeatures (low level characteristics or textual annotations) isassociated to each item. In [20], Rui and Huang estimate thatcontent-based retrieval can be made in an effective way bycombining the two levels (low level and high level). However,

this can raise the problem of the annotation subjectivitywhich is an important problem and which can deteriorateconsiderably the performances of a content-based informationretrieval system.

In order to capture semantic aspects starting from thelow level characteristics, the use of a multimedia index isnecessary. An index makes it possible to group or to bringtogether closer items, which have similar characteristics.

Several content-based information retrieval systems arebased on the k-nearest neighbors principle [5][26]. This prin-ciple is based on sorting items by taking into account theirdistances, a predefined k nearest items are returned by thesystem when a query is submitted. For example, the QBICsystem, in its implementation for the Hermitage museum1 [4],returns the 12 nearest images to the query submitted by theuser.

The structuration model of the multimedia databases is(or can be seen) as a graph based on similarity relationsbetween the items (for example k-NN [13] or the relativeneighborhood graph [21]). The objective is to explore an imagedatabase through the similarity between images. The struc-turation model is very important because the performances ofa content-based information retrieval system depend stronglyon the representation structure (indexation structure), whichmanages the data.

Several multimedia information retrieval systems were pro-posed. Content-based image retrieval systems are more wides-pread than those for video retrieval. We find for example QBIC[6][15], CANDID [10], CHABOT [16], VIRAGE [16], Pho-toBook [17], BlobWorld [2], VisualSeek [3][22] and RETIN[7] for images, and CVEPS [3], JAKOB [11], VISION [12],and SWIN [27] for video.

From now, we will consider the context of the content-basedimage retrieval in a large image databases to illustrate theproposals of this paper.

III. NEIGHBORHOOD GRAPHS

Neighborhood graphs are very much used in various sys-tems. Their popularity is due to the fact that the neighborhoodis determined by coherent functions which reflect, from somepoint of view, the mechanism of the human intuition. Their useis varied from information retrieval systems to geographicalinformation systems.

However, several problems concerning the neighborhoodgraphs are under research and require a lot of work to solvethem. These problems are primarily related to their high costof construction and to their update difficulties. For this reasons,optimizations are necessary for their construction and update.

In order to avoid some problems related to the use of thek-NN (symmetry problem, subjectivity related to the determi-nation of k), the use of another structuration model based onneighborhood graph was proposed in [21]. This propositionhas many advantages and this is the reason why we adopt the

1http://www.hermitagemuseum.org/

Hakim Hacid, and Djamel Zighed

46

Page 47: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

same approach (the use of neighborhood graphs) for content-based image retrieval. We will in the followings introduce theneighborhood graphs.

Neighborhood graphs, or proximity graphs, are geometricalstructures which use the concept of neighborhood to determinethe closest points to a given point. For that, they are based onproximity measures [25]. We will use the following notationsthroughout this paper:

Let Ω be a set of points in a multidimensional space Rd. Agraph G(Ω,ϕ) is composed by the set of points Ω and a set ofedges ϕ. Then, to any graph we can associate a binary relationR upon Ω, in which two points (α, β) ∈ Ω2 are in binaryrelation if and only if the pair (α, β) ∈ ϕ. In other words,(α, β) are in binary relation if and only if they are directlyconnected in the graph G. From this, the neighborhood N (α)of a point α in the graph G can be considered as a sub-graphwhich contains the point α and all the points which are directlyconnected to it.

Several possibilities were proposed for building neigh-borhood graphs. Among them we can quote the Delaunaytriangulation [18], the relative neighborhood graphs [24], theGabriel graph [8], and the minimum spanning tree [18].In this paper, we’ll consider only one of them, the rela-tive neighborhood graph RNG. The main motivation for thischoice is its simplicity and its wide usage. For illustration,we describe hereafter two examples of neighborhood graphs,namely, relative neighborhood graph (RNG) and Gabriel graph(GG).

A. Relative neighborhood graphs

In a relative neighborhood graph G rng (Ω,ϕ), two points(α, β) ∈ Ω2 are neighbors if they satisfy the relative neigh-borhood property defined hereafter.

Let H (α, β) be the hyper-sphere of radius δ (α, β) andcentered on α, and let H (β, α) be the hyper-sphere ofradius δ (β, α) and centered on β. δ (α, β) and δ (β, α) arethe dissimilarity measures between the two points α and β.δ (α, β) = δ (β, α). Then, α and β are neighbors if and onlyif the lune A (α, β) formed by the intersection of the twohyper-spheres H (α, β) and H (β, α) is empty [24]. Formally:

A (α, β) = H (α, β) ∩H (β, α)Then (α, β) ∈ ϕ iff A (α, β)∩ Ω = φ

Figure 1 illustrates the relative neighborhood graph.

B. Gabriel graph

This graph is introduced by Gabriel and Sokal [8] in tehcontext of geographical variations measurement. Let H (α, β)be the hyper-sphere of diameter δ (α, β) (cf. Figure 2). Then, αis the neighbour of β if and only if the hyper-sphere H (α, β)is empty. Formally :

(α, β) ∈ ϕ iff H (α, β)∩ Ω = φ

As introduced in the previous sections, the main problemrelated to the neighborhood graphs is their high complexity.

Fig. 1. Relative neighbourhood graph

Fig. 2. Gabriel graph

Their usage in the context of database indexing, their exploita-tion needs some optimizations.

We can consider two situations when dealing with theoptimization problem of the neighborhood graphs. The firstsituation is when we have the graph already built. In thissituation, if one use an approximation method, he/she risks tohave another graph with some bed neighbors (bad neighborsis in the meaning of the obtained neighborhood compared tothe one in the graph built using the basic algorithm), we canobtain more or less neighbors for some items. In this case,we have to find a solution for effectively updating the graphwhen inserting or deleting a point without rebuilding it. Thesecond situation is the one where the graph isn’t built yet. Inthis situation we can apply an approximation method to havea graph which is as close as possible to the one which we canobtain using a standard algorithm. We are interested in thispaper with the first case.

C. Neighborhood graphs construction algorithms

Several algorithms for neighborhood graphs constructionwere proposed. The algorithms which we quote hereafter relateto the construction of the relative neighborhood graph.

One of the common approaches to the various neighborhoodgraphs construction algorithms is the use of the refinementtechniques. In this approach, the graph is built by steps. Eachgraph is built starting from the previous graph, containing allconnections, by eliminating some edges which do not satisfythe neighborhood property of the graph to be built. Pruning(edges elimination) is generally done by taking into accountthe construction function of the graph or through geometricalproperties.

Neighborhood Graphs for Image Databases Indexing and Content-Based Retrieval

47

Page 48: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

The construction principle of the neighborhood graphs con-sists in checking for each point, if the other points in the spaceare in its proximity. The cost of this operation is of complexityO(n3)

(n is the number of points in the space). Toussaint [25]proposed an algorithm of complexity O

(n2). He deduces the

RNG starting from a Delaunay triangulation [18]. Using theOctant neighbors, Katajainen [9] also proposed an algorithmof the same complexity. Smith [23] proposed an algorithm ofcomplexity O

(n23/12

)which is significantly less than O

(n3).

In what concern us, as quoted in the introduction, thiswork can be viewed as an improvement of the proposal in[21]. The proposed approach is focused mainly on a locallyupdating technique of neighborhood graphs. This method doesnot offer really the anticipated results. Indeed, in this methodthe graph is not really updated but the neighbors of each pointare considered to be those of the nearest neighbor of eachquery point. Formally, if we consider α to be a query pointand β its nearest neighbor. If the set of the neighbors of β isβ1, β2, β3, then the neighbors of α are β, β1, β2, β3.

This approach, as we can see it, isn’t correct. So, usingthis method the graph will deteriorate. We propose in thefollowing an effective locally updating method, which is stableand non-sensitive to the data dimension effects, and whichreturns very satisfactory results (in terms of response time andquality answers).

IV. NEIGHBORHOOD GRAPHS BASED RETRIEVAL

APPROACH

Image databases interrogation is generally made by sub-mitting a query to the system; this query is generally animage which the system pre-processes (segments, equalizes,etc.) producing a vector of features representing a point in amultidimensional space. This point is inserted in the repre-sentation structure (indexation structure) and its neighbors arethen turned over as an answer to the query.

In our case, a naive approach when using neighborhoodgraphs would be the rebuilding of the neighborhood graphcontaining the already existing data in the database by addingthe new query item. This approach is unfortunately not suit-able, because it is very expensive especially when the numberof items in the database is important. Another approach wouldbe to locally update the neighborhood graph, i.e., to find aprocedure such that only the potential items, which are in theneighborhood of the inserted point, will be affected by theupdating.

The neighborhood graph locale update task passes by thelocation of the inserted (or removed) point as well as thepoints which can be affected by the update. To achieve this, weproceed in two main stages: initially, we look for an optimalsurface which can contain a maximum number of potentiallyclosest points to the query point. The second stage is donein the aim of filtering the items found beforehand in orderto recover the really closest points to the query by applyingan adequate neighborhood property. This last stage causes theeffective updating of the neighborhood relations between theconcerned points.

The main stage in this method is the ”search area” determi-nation. This can be considered as a question of determining anhyper sphere of center α (the query point) which maximizesthe chance of containing the neighbors of α while minimizingthe number of items that it contains.

We take advantage of the general neighborhood graphsstructure in order to establish the radius of the hyper sphere.We are focusing especially on the nearest neighbor and thefarther neighbor concepts. So, two observations in connectionwith these two concepts seem to us interesting:

• The neighbors of the nearest neighbor are potential can-didates for the neighborhood of the query point α.From this, by generalization, we can deduce that:

• All the neighbors of a point are also candidates to theneighborhood of a query point to which it is a neighbor.

With regard to the first step, the radius of the hyper-spherewhich the above properties is the one including all the neigh-bors of the first nearest neighbor of the query point. So,considering that the hyper-sphere is centered in α, its radiuswill be the sum of the distances between α and its nearestneighbor and the one between this nearest neighbor and itsfurther neighbor.

The content of the hyper sphere is processed in order tosee whether there are some neighbors (or all the neighbors).The second step constitutes a reinforcement step and aims toeliminate the risk of losing neighbors or including bed ones.This step proceeds so as to take advantage of the secondobservation. So, we take all the true neighbors of the querypoint recovered beforehand (those returned in the first step) aswell as their neighbors and update the neighborhood relationsbetween these points.

That is, let consider α be the query point and β its nearestneighbor with a distance δ1. Let consider λ be the furtherneighbor of β with a distance δ2. The radius R of the hypersphere can be expressed as:

SR = δ1 + δ2 + ε

ε is a relaxation parameter and it can be defined according tothe state of the data (their dispersion for example) or by thedomain knowledge. We set experimentally this parameter to1.

The complexity of this method is very low and meetsperfectly our starting aims (locating the neighborhood ofpoints in a time as short as possible). It is expressed by:

O(2n + n′2)

Where

• n: the number of items in the database.• n’: the number of items in the hyper sphere (<< n).

This complexity includes the two stages described previ-ously, namely, the search of the radius of the hyper sphereand the seek of the true neighbors which are in this hypersphere corresponding to the term O(2n). The second termcorresponds to the necessary time for the effective update ofthe neighborhood relations between the real neighbors which is

Hakim Hacid, and Djamel Zighed

48

Page 49: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Fig. 3. Illustration of the proposed method

very weak taking into account the number of candidates turnedover. This complexity constitutes the maximum complexityand can be optimized in several ways. The most obvious way isto use a fast nearest neighbor algorithm. The example hereafterillustrates and summarizes the principle of the method.

V. EXPERIMENTS AND RESULTS

The interest of neighborhood graphs for content basedinformation retrieval is shown and discussed in [21]. Thecomparison tests done with several different configurationsof k-NN were interesting and showed the utility of thesestructures in this field.

Here, we are interested in two different tests: testing thevalidity of the obtained results by the use of the suggestedmethod and testing the response times. We used an imagedatabase [14]. This database contains 7200 images represent-ing 100 different objects taken with various views.

To perform our experimentations, we pre-process thedatabase by applying some image processing algorithms forextracting some characteristics. We use especially color, tex-ture and form features. Each image is represented as a vectorof features and the graph is built on the obtained vectors.The interrogation is also done in the same way by using avector of features. So, each image is represented as a pointin a multidimensional space. Our interest by performing theseexperiments is to show that our method gives the same resultsas the ones obtained by building the graph with all items andensure we dont include wrong or miss good neighbors of thequery item.

To check the validity of the obtained results, we need areference graph. For that, we take the whole image database(vectors of descriptors) and we build a relative neighborhoodgraph. After building the reference graph, we can start thevalidity checking tests. We take arbitrary an item from thedata set and we build a new relative neighborhood graph onthe remaining items (the n-1 items). After that, we insert theitem taken beforehand from the dataset in the graph by usingthe proposed method. The neighbors of the inserted pointare compared with the neighbors of this same item in thereference graph. This operation is repeated several times bytaking different items at each iteration.

Table I summarizes the obtained neighborhood of randomlytaken items from the image collection. Each item in the datasetis identified by a unique ID. The first column of each tableconstitutes the item concerned by the neighborhood (in thesecond table its the item which constitutes our query). Wecan clearly see that the obtained results after the insertion ofthe remaining item (Graph 2), are exactly the same as theones obtained using the standard algorithm(Graph 1). So wedont miss neighbors. Figure 4 gives a visual illustration ofthe obtained result using the neighborhood graphs indexingtechnique.

The validity of the obtained results being acquired, we areinterested now in the response times of the method, i.e. timethat the method takes to insert a query item in an existingstructure. We use the same protocol as the one for the validitychecking, but in place of recovering the results items, we take

Neighborhood Graphs for Image Databases Indexing and Content-Based Retrieval

49

Page 50: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

TABLE I

COMPARISON OF THE OBTAINED NEIGHBORHOOD USING THE REFERENCE

GRAPH (GRAPH 1) AND THE UPDATED GRAPH (GRAPH 2)

Standard Graph (Graph 1) Updated Graph (Graph 2)Query Image Found Images Query Image Found Images

IMG74 IMG74IMG87 IMG87

IMG85 IMG118 IMG85 IMG118IMG131 IMG131IMG3665 IMG3665IMG484 IMG484IMG623 IMG623IMG7170 IMG7170

IMG4804 IMG4763 IMG4804 IMG4763IMG4803 IMG4803IMG4805 IMG4805IMG6532 IMG6532IMG4345 IMG4345IMG6781 IMG6781IMG6791 IMG6791

IMG6807 IMG6795 IMG6807 IMG6795IMG6825 IMG6825IMG6830 IMG6830IMG74 IMG74IMG87 IMG87

IMG7196 IMG7171 IMG7196 IMG7171IMG7195 IMG7195IMG7197 IMG7197

into consideration only the response times of each query. Weused for these experiments a machine with an INTEL Pentium4 processor (2.80 GHz) and 512 Mo of memory. The responsetimes for 10 items, always arbitrary taken from the samedataset, are shown in the graphic on Figure 5.

The response times (expressed in milliseconds) are inter-esting according to the volume of used dataset. They arevariable from an item to another and this is due especially tothe fact that various candidates items count for neighborhooddetermination are used at each iteration. Generally speaking,the insertion of a new item in the graph using our methodrequires only 50 milliseconds (the mean), and requires atleast 40 minutes using the standard algorithm in the sameconditions.

VI. CONCLUSION AND FUTURE WORK

Content-based multimedia information retrieval is a com-plex task because of, primarily, the nature of the multimediadata and the attached subjectivity on their interpretation. Theuse of an adequate multimedia index structure is important. Inthis paper we considered neighborhood graphs based indexingstructure and we proposed a method for locally updatingneighborhood graphs with an aim of bringing an improve-ment of an existing approach. Our method is based on thelocalization of the potential items, which can be affected by theupdating task. The experiments performed on different datasetsshow the effectiveness and the utility of the proposed method.

As future work, we plan to fix the problem of the re-laxation parameter determination by setting up an automaticdetermination function by taking into account some statisticalindicators on the data like dispersion. Also we plan to extendthis algorithm in order to make an incremental version for

Fig. 4. Obtained neighborhood using a standard algorithm with the wholeimages

a fast and effective way of building neighborhood graphs.The integration of this approach and its application in morefunctions related to multimedia data like automatic annotationand classification is our objective.

REFERENCES

[1] R. M. Bolle, B.-L. Yeo, and M. M. Yeung. Video query: Beyondkeywords. Research Report RC 20586, IBM, 1996.

[2] C. Carson, M. Thomas, S. Belongie, J. M. Hellerstein, and M. Jitendra.Blobworld: A system for region based image indexing and retrieval.Visual information and information systems, proceedings of the thirdinternational conference VISUAL’99, Amsterdam, the Netherlands, June1999.

[3] S.-F. Chang, J. R. Smith, and J. Meng. Efficient techniques for feature-based image/video access and manipulation. In Data Processing Clinic,pages 0–, 1996.

Hakim Hacid, and Djamel Zighed

50

Page 51: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Fig. 5. Execution time variations of 10 locale insertions

[4] C. Faloutsos, R. Barber, M. Flickner, J. Hafner, W. Niblack, D. Petkovic,and W. Equitz. Efficient and effective querying by image content. J.Intell. Inf. Syst., 3(3/4):231–262, 1994.

[5] E. Fix and J. L. Hudges. Discriminatory analysis - non parametricdiscrimination: consistency properties. Technical Report 21-49-004,USAF School of aviation medicine, Randolph Field, Texas, 1951.

[6] M. Flickner, H. S. Sawhney, J. Ashley, Q. Huang, B. Dom, M. Gorkani,J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker. Query byimage and video content: The qbic system. IEEE Computer, 28(9):23–32, 1995.

[7] J. Fournier, M. Cord, and S. Philipp-Foliguet. Retin: A content-basedimage indexing and retrieval system. Pattern Anal. Appl., 4(2-3):153–173, 2001.

[8] K. R. Gabriel and R. R. Sokal. A new statistical approach to geographicvariation analysis. Systematic zoology, 18:259–278, 1969.

[9] J. Katajainen. The region approach for computing relative neighborhoodgraphs in the lp metric. Computing, 40:147–161, 1988.

[10] P. M. Kelly, T. M. Cannon, and D. R. Hush. Query by image example:The comparison algorithm for navigating digital image databases (can-did) approach. In Storage and Retrieval for Image and Video Databases(SPIE), pages 238–248, 1995.

[11] M. La-Cascia and E. Ardizzone. Jakob : Just a content based querysystem for video databases. ICASSP, Atlanta, May 1996.

[12] W. Li, S. Gauch, J. M. Gauch, and K. M. Pua. Vision: A digital videolibrary. In Digital Libraries, pages 19–27, 1996.

[13] T. M. Mitchell. Machine learning meets natural language. In EPIA,page 391, 1997.

[14] S. A. Nene, S. K. Nayar, and H. Murase. Columbia object image library(coil-100). Technical report, Technical Report CUCS-006-96, February1996.

[15] W. Niblack, R. Barber, W. Equitz, M. Flickner, E. H. Glasman,D. Petkovic, P. Yanker, C. Faloutsos, and G. Taubin. The qbic project:Querying images by content, using color, texture, and shape. In Storageand Retrieval for Image and Video Databases (SPIE), pages 173–187,1993.

[16] V. E. Ogle and M. Stonebraker. Chabot: Retrieval from a relationaldatabase of images. IEEE Computer, 28(9):40–48, 1995.

[17] A. Pentland, R. W. Picard, and S. Sclaroff. Photobook: Tools for content-based manipulation of image databases. In Storage and Retrieval forImage and Video Databases (SPIE), pages 34–47, 1994.

[18] F. Preparata and M. I. Shamos. Computationnal Geometry-Introduction.Springer-Verlag, New-York, 1985.

[19] L. A. Rowe, J. S. Boreczky, and C. A. Eads. Indexes for user accessto large video databases. In Storage and Retrieval for Image and VideoDatabases (SPIE), pages 150–161, 1994.

[20] Y. Rui and T. S. Huang. Image retrieval : current techniques, promisingdirections, and open issues. Journal of Visual Communication and ImageRepresentation, 10(1):39–62, March 1999.

[21] M. Scuturici, J. Clech, V. M. Scuturici, and D. Zighed. Topologicalrepresentation model for image databases query. Journal of Experimentaland Theoritical Artificial Intelligence (JETAI), pages 145–160, 2004.

[22] J. R. Smith and S. F. Chang. Querying by color regions using thevisualseek content-based visual query system. M. T. Maybury, editor,Intelligent Multimedia Information Retrieval. AAAI Press, 1997.

[23] W. D. Smith. Studies in computational geometry motivated by meshgeneration. PhD thesis, Princeton University, 1989.

[24] G. T. Toussaint. The relative neighborhood graphs in a finite planar set.Pattern recognition, 12:261–268, 1980.

[25] G. T. Toussaint. Some insolved problems on proximity graphs. D. WDearholt and F. Harrary, editors, proceeding of the first workshop onproximity graphs. Memoranda in computer and cognitive science MCCS-91-224. Computing research laboratory. New Mexico state university LasCruces, 1991.

[26] R. C. Veltkamp and M. Tanase. Content-based image retrieval systems :A survey. Technical Report UU-CS-2000-34, Department of ComputingScience, Utrecht University, 2000.

[27] H. J. Zhang, C. Y. Low, S. W. Smoliar, and J. H. Wu. Video parsingretrieval and browsing: An integrated and content-based solution. Proc.Of ACM Multimedia’95., San Francisco, November 1995.

Neighborhood Graphs for Image Databases Indexing and Content-Based Retrieval

51

Page 52: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE
Page 53: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Mining Data Streams for Frequent SequencesExtraction

Alice Marascu and Florent Masseglia

INRIA Sophia AntipolisAxIS Project/Team

2004 route des Lucioles - BP 9306902 Sophia Antipolis, France

Email: amarascu,[email protected]

Abstract— In recent years, emerging applications introducednew constraints for data mining methods. These constraints areparticularly linked to new kinds of data that can be consideredas complex data. One typical kind of such data is known asdata streams. In a data stream processing, memory usage isrestricted, new elements are generated continuously and haveto be considered as fast as possible, no blocking operator can beperformed and the data can be examined only once. At this timeand to the best of our knowledge, no method has been proposedfor mining sequential patterns in data streams. We argue that themain reason is the combinatory phenomenon related to sequentialpattern mining. In this paper, we propose an algorithm based onsequences alignment for mining approximate sequential patternsin Web usage data streams. To meet the constraint of one scan,a greedy clustering algorithm associated to an alignment methodare proposed. We will show that our proposal is able to extractrelevant sequences with very low thresholds.Keywords: data streams, sequential patterns, web usage mining,clustering, sequences alignment.

I. INTRODUCTION

The problem of mining sequential patterns from a largestatic database has been widely adressed [1], [10], [13],[15], [9]. The extracted relationship is known to be usefulfor various applications such as decision analysis, marketing,usage analysis, etc. In recent years, emerging applicationssuch as network traffic analysis, intrusion and fraud detec-tion, web clickstream mining or analysis of sensor data (toname a few), introduced new constraints for data miningmethods. Beside those applications appeared the concept ofcomplex data. Among those new kinds of complex data,an increasing amount of work is being done on the datastreams. A data stream processing has to satisfy the followingconstraints: memory usage is restricted, new elements aregenerated continuously and have to be considered as fastas possible, no blocking operator can be performed and thedata can be examined only once. Hence, many methods havebeen proposed for mining items or patterns from data streams[5], [2], [4]. One potential characteristic of complex data istheir low reliability and the approximation that has to beaccepted in some cases. For the field of mining data streams,the main problem was at first to satisfy the constraints ofthe data stream environment and provide efficient methodsfor extracting patterns as fast as possible. For this purpose,

approximation has been recognized as a key feature [6]. Then,recent methods [3], [7], [14] introduced different principlesfor managing the history of the extracted patterns frequencies.The main idea is that people are often more interested in recentchanges. [7] introduced the logarithmic tilted time window forstoring patterns frequencies with a fine granularity for recentchanges and a coarse granularity for long term changes. In [14]the frequencies are represented by a regression-based schemeand a particular technique is proposed for segment tuning andrelaxation (merging old segments for saving main memory).However, at this time and to the best of our knowledge, nomethod has been proposed for mining sequential patterns indata streams. We argue that the main reason is the combinatoryphenomenon related to sequential pattern mining. Actually, ifitemset mining relies on a finite set of possible results (theset of combinations between items recorded in the data) thisis not the case for sequential patterns where the set of resultsis infinite. In fact, due to the temporal aspect of sequentialpatterns, an item can be repeated without limitation leadingto an infinite number of potential frequent sequences. FTP-DS [14], for instance, is designed for mining inter-transactionpatterns from data streams. Actually, the patterns extracted intheir framework are itemsets and this work does not addressthe extraction of sequences as we propose to do.In this paper, we propose the SMDS (Sequence Mining in DataStreams) algorithm which is based on sequences alignment(such as [9], [8] have already proposed for static databases)for mining approximate sequential patterns in data streams.The goal of this paper is first to show that classic sequentialpattern mining methods cannot be included in a data streamenvironment because of their complexity and then to proposea solution. The proposed algorithm is implemented and testedover real and synthetic datasets. Our data come from the accesslog files of Inria Sophia-Antipolis. We will thus show theefficiency of our mining scheme for Web usage data streams,though our method might be applied to any kind of sequentialdata. Analyzing the behavior of a Web site’s users, also knownas Web Usage Mining, is a research field which consists inadapting the data mining methods to the records of accesslog files. These files collect data such as the IP address ofthe connected host, the requested URL, the date and other

Z.W. Ras, S. Tsumoto and D.A. Zighed (Eds) : IEEE MCD05, published byDepartment of Mathematics and Computing Science Saint Marys UniversityPages 53-60, ISBN 0-9738918-8-2, Nov. 2005

Page 54: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

information regarding the navigation of the user. Web UsageMining techniques provide knowledge about the behavior ofthe users in order to extract relationships in the recordeddata. Among available techniques, the sequential patterns areparticularly well adapted to the log study. Extracting sequentialpatterns on a log file, is supposed to provide this kind ofrelationship:“On the Inria’s Web Site, 10% of users visitedconsecutively the homepage, the available positions page,the ET1 offers, the ET missions and finally the past ETcompetitive selection”. We want to extract typical behavioursfrom clickstream data and show that our algorithm meets thetime constraints in a data stream environment and can beincluded in a data stream process at a negligible cost. Therest of this paper is organized as follows. The definitions ofSequential Pattern Mining and Web Usage Mining are givenin Section II.

II. DEFINITIONS

In this section we define the sequential pattern miningproblem in large databases and give an illustration. Then weexplain the goals and techniques of Web Usage Mining withsequential patterns.

A. Sequential Pattern Mining

The problem of mining sequential patterns from staticdatabases is defined as follows [1]:

Definition 1: Let I = i1, i2, ..., im, be a set of m literals(items). I is a k-itemset where k is the number of items inI . A sequence is an ordered list of itemsets denoted by <s1s2 . . . sn > where sj is an itemset. The data-sequence of acustomer c is the sequence in D corresponding to customerc. A sequence < a1a2 . . . an > is a subsequence of anothersequence < b1b2 . . . bm > if there exist integers i1 < i2 <. . . < in such that a1 ⊆ bi1 , a2 ⊆ bi2 , . . . , an ⊆ bin .

Example 1: Let C be a client and S=< (c) (d e) (h) >,be that client’s purchases. S means that “C bought item c,then he bought d and e at the same moment (i.e. in the sametransaction) and finally bought item h”.

Definition 2: The support of a sequence s, also calledsupp(s), is defined as the fraction of total data-sequences thatcontain s. If supp(s) ≥ minsupp, with a minimum supportvalue minsupp given by the user, s is considered as a frequentsequential pattern.

The problem of sequential pattern mining is thus to find allthe frequent sequential patterns as stated in definition 2.

B. From Web Usage to Data Streams

For classic Web usage mining methods, the general idea issimilar to the principle proposed in [11]. Raw data is collectedin access log files by Web servers. Each input in the log fileillustrates a request from a client machine to the server (httpdaemon).

Definition 3: Let Log be a set of server access logentries. An entry g, g ∈ Log, is a tuple g =<ipg, ([l

g1 .URL, lg1.time] ... [lgm.URL, lgm.time]) > such that

1ET: Engineers, Technicians

Client d1 d2 d3 d4 d51 a c d b c2 a c b f c3 a g c b c

TABLE I

FILE OBTAINED AFTER A PRE-PROCESSING STEP

for 1 ≤ k ≤ m, lgk.URL is the item asked for by the user gat time lgk.time and for all 1 ≤ j < k, lgk.time > lgj .time.

The structure of a log file, as described in definition 3,is close to the “Client-Time-Item” structure used bysequential pattern algorithms. In order to extract frequentbehaviors from a log file, for each g in the log file, we first haveto transform ipg into a client number and for each record k ing, lgk.time is transformed into a time number and lg

k.URL istransformed into an item number. Table I gives a file exampleobtained after that pre-processing. To each client correspondsa series of times and the URL requested by the client at eachtime. For instance, the client 2 requested the URL “f” at timed4. The goal is thus, according to definition 2 and by meansof a data mining step, to find the sequential patterns in thefile that can be considered as frequent. The result may be, forinstance, <(a)(c)(b)(c)> (with the file illustrated in table I anda minimum support given by the user: 100%). Such a result,once mapped back into URLs, strengthens the discovery of afrequent behavior, common to n users (with n the thresholdgiven for the data mining process) and also gives the sequenceof events composing that behavior.

Nevertheless, most methods which were designed for min-ing patterns from access log files cannot be applied to adata stream coming from web usage data (such as click-streams). In our context, we consider that large volumes ofusage data are arriving at a rapid rate. Sequences of dataelements are continuously generated and we aim at identifyingrepresentative behaviours. We assume that the mapping ofURLs and clients as well as the data stream managementare performed simultaneously. Furthermore, as stated by [12],sequential pattern extraction, when applied to Web access data,is effective only if the support is very low. A low supportmeans long response time and the authors proposed a divisiveapproach to extract sequential patterns on similar navigations(in order to get highly significant patterns). Our goal with thiswork is close to that of [12] since we will provide a navigationclustering scheme designed to facilitate the discovery of inter-esting sequences, while meeting the needs for rapid executiontimes involved in data stream processing.

III. SMDS: MOTIVATION AND PRINCIPLE

Our method relies on a batch environment (widely inspiredfrom [7]) and the prefix tree structure of PSP [10] (formanaging frequent sequences). We first study the limitations ofa sequential pattern mining algorithm that would be integratedin a data stream context. Then, we propose our framework,based on a sequences alignment principle.

Alice Marascu, and Florent Masseglia

54

Page 55: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

A. Limits of Sequential Pattern Mining

Fig. 1. Limits of a framework involving PSP

Our method will process the data stream as batches of fixedsize. Let B1, B2, ...Bn be the batches, where Bn is the mostrecent batch of transactions. The principle of SMDS will beto extract frequent sequential patterns from each batch b in[B1..Bn] and to store the frequent approximate sequences ina prefix tree structure (inspired from [10]). Let us consider thatthe frequent sequences are extracted with a classic exhaustivemethod (designed for a static transaction database). We arguethat such a method will have at least one drawback leading toa blocking operator. Let us consider the example of the PSP[10] algorithm. We have tested this algorithm on databasescontaining only two sequences (s1 and s2). Both sequencesare equal and contain repetitions of itemsets having lengthone. The first database contains 11 repetitions of the itemsets(1)(2) (i.e. s1 =< (1)(2)(1)(2)...(1)(2) >, lentgh(s1)=22and s2 = s1). The number of candidates generated at eachscan is reported in figure 1. Figure 1 also reports the numberof candidates for databases of sequences having length 24,26 and 28. For the base of sequences having length 28, thememory was exceeded and the process could not succeed.We made the same observation for PrefixSpan2 [13] wherethe number of intermediate sequences was similar to that ofPSP with the same mere databases. If this phenomenon is notblocking for methods extracting the whole exact result (onecan select the appropriate method depending on the dataset),the integration of such a method in a data stream process isimpossible because the worst case can appear in any batch3.

B. Principle

The outline of our method is the following: for each batchof transactions, discovering clusters of users (grouped bybehavior) and then analyzing their navigations by means of asequences alignment process. This allows us to obtain clustersof behaviours representing the current usage of the Web site.For each cluster having size greater than minSize (specifiedby the user) we store only the summary of the cluster. This

2http://www-sal.cs.uiuc.edu/ hanj/software/prefixspan.htm3In a web usage pattern, for instance, numerous repetitions of requests for

pdf or php files are usual.

Step 1 :S1 : <(a,c) (e) () (m,n)>S2 : <(a,d) (e) (h) (m,n)>SA12 : (a:2, c:1, d:1):2 (e:2):2 (h:1):1 (m:2, n:2):2

Step 2 :SA12 : (a:2, c:1, d:1):2 (e:2):2 (h:1):1 (m:2, n:2):2S3 : <(a,b) (e) (i,j) (m)>SA13 : (a:3, b:1, c:1, d:1):3 (e:3):3 (h:1, i:1, j:1):2 (m:3, n:2):3

Step 3 :SA13 : (a:3, b:1, c:1, d:1):3 (e:3):3 (h:1, i:1, j:1):2 (m:3, n:2):3S4 : <(b) (e) (h,i) (m)>SA14 : (a:3, b:2, c:1, d:1):4 (e:4):4 (h:2, i:2, j:1):3 (m:4, n:2):4

Fig. 2. Different steps of the alignment

summary is given by the aligned sequence obtained on thesequences of that cluster.

A Greedy Approach for Clustering Sequences

The clustering scheme we have developed is a straight for-ward, naive algorithm. It is based on the fact that navigationson a Web site are usually very similar or very different fromeach other. Basically, users interested in the pages related tojob opportunities are usually not likely to request pages relatedto the next seminar organized by the Inria’s unit of Sophia-Antipolis. In order to cluster the navigations as fast as possible,our greedy approach is thus based on the following scheme:the algorithm is initialized with one cluster containing the firstnavigation. Then, for each navigation n in the the batch, n iscompared to each cluster c. As soon as n is found to be similarto at least one sequence of c then s is inserted in c. If s hasbeen inserted in no cluster, then a new cluster is created ands is inserted in this new cluster. The similarity between twosequences is given in definition 4 (s is inserted in c if thefollowing condition holds: ∃sc ∈ c/sim(s, sc) ≤ minSimwith minSim given by the user).

Definition 4: Let s1 and s2 be two sequentialpatterns. Let |LCS(s1, s2)| be the length of the longestcommon subsequences between s1 and s2. The similaritysim(s1, s2) between s1 and s2 is defined as follows:sim = 2×|LCS(s1,s2)|

length(s1)+length(s2) .

Sequences Alignment

The clustering algorithm ends with clusters of similar se-quences, which is a key element for sequences alignment.The alignment of sequences leads to a weighted sequencerepresented as follows: SA =< I1 : n1, I2 : n2, ..., Ir, nr >:m. In this representation, m stands for the total number ofsequences involved in the alignment. Ip (1 ≤ p ≤ r) is anitemset represented as (xi1 : mi1 , ...xit : mit ),where mit isthe number of sequences containing the item x i at the pth

position in the aligned sequences. Finally, np is the numberof occurrences of itemset Ip in the alignment. Example 2describes the alignment process on 4 sequences. Starting fromtwo sequences, the alignment begins with the insertion ofempty items (at the beginning, the end or inside the sequence)until both sequences contain the same number of itemsets.

Example 2: Let us consider the following sequences:

Mining Data Streams for Frequent Sequences Extraction

55

Page 56: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

S1 =< (a,c) (e) (m,n) >, S2 =< (a,d) (e) (h) (m,n) >,S3 =< (a,b) (e) (i,j) (m) >, S4 =< (b) (e) (h,i) (m) >. Thesteps leading to the alignment of those sequences are detailedin Figure 2. At first, an empty itemset is inserted in S1. ThenS1 and S2 are aligned in order to provide SA12. The alignmentprocess is then applied to SA12 and S3. The alignment methodgoes on processing two sequences at each step.At the end of the alignment process, the aligned sequence(SA14 in figure 2) is a summary of the corresponding cluster.The approximate sequential pattern can be obtained byspecifying k: the number of occurrences of an item in orderfor it to be displayed. For instance, with the sequence SA14

from figure 2 and k = 2 the filtered aligned sequence willbe: <(a,b)(e)(h,i)(m,n)> (corresponding to items having anumber of occurrences greater or equal to k).

Sequences Storage and Management

The aligned sequences obtained from the previous stepare stored in a prefix tree similar to that of [10]. If a newsequence s has been discovered, then the tree is modified tostore this new sequence. Otherwise, if s is already in the tree,then the support of s is updated. Figure 3 gives an exampleof a prefix tree where transaction cutting is captured by usinglabelled edges. Each path from the root to any node in thetree stands for an extracted sequence. The tree from figure3 contains 6 sequences (<(a c)>, <(a d)>, <(b)>, <(cd)>, <(c)(e)>, <(d)(a)>). Any path, from the root to a leafstands for a sequence and considering a single branch eachnode at depth l (k ≥ l) captures the l th item of the sequence.Transaction cutting is captured by using labelled edges. Forinstance, the dashed link between nodes c and e in figure 3illustrates the fact that e is not in the same itemset as c. Eachnode is provided with k, the filter used to obtain this alignedsequence from the corresponding cluster. Example 3 gives anillustration of the sequences support management.

Example 3: Let us consider the sequence s1 =<(a)> fromfigure 3. The support of the aligned sequence s1 is unknown(-1). This means that s1 has been extracted in more thanone cluster. Let us consider the sequence s2 =<(d)(a)>. Thesupport of the aligned sequence s2 is 3 (meaning that thecorresponding cluster contains this aligned sequence with afilter k = 3).

Fig. 3. Example of a prefix tree

As we wrote in section I, the first challenge of mining datastreams was to extract patterns as fast as possible in order

to get adapted to the speed of the streams. Then the historyof frequencies has been considered and tilted time windowswere proposed [7], [3]. However, no particular effort has beenmade for extracting temporal relationships between items indata streams (sequences, sequential patterns). Even if ourmain goal was to show that such patterns could be extractedwith SMDS, we have provided our method with logarithmictilted time windows. Let fS(i, j) denote the frequency of asequence S in B(i,j) = ∪j

(k=i)Bk, with Bk the kth batch oftransactions. Let Bn be the current batch, the logarithmictilted time windows allow to store the set of frequencies[f(n, n); f(n− 1, n− 1); f(n− 2, n− 3); f(n− 4, n− 7), ...]and to save main memory. Frequencies are shifted in thetilted time window when updating with a new batch B.For this purpose, fS(B) replaces f(n, n), f(n, n) replacesf(n − 1, n − 1) and so on. When two windows are oldenough, their frequencies are merged in a single window.This principle allows to give more importance to recent events(their frequencies are stored at a fine level of granularity) andless importance to past events (their frequencies are storedwith a precision getting less and less fine). Tail pruning isalso implemented. Actually, in our version, tail frequencies(oldest records) are dropped when their timestamp is greaterthan a fixed timestamp given by the user (e.g. we only storefrequencies for the last 100,000 batches, which will requirelog2(100, 000) ≈ 17 units of time).

The SMDS algorithm described in this paper is given below.

Algorithm 1 (SMDS)Input : B = ∪∞i=0Bi: an infinite set of batches of transactions; minSize : the minimum size of a cluster that has to besummarized ; minSim : the minimum similarity betweentwo sequences in order to consider growing a cluster ; k : thefilter for the sequences alignment method.Output : The updated prefix tree structure of approximatefrequent sequences.While (B) Do

b← NextBatch();// 1) Obtain clusters of size > minSizeC ← Clustering(b, minSize, minSim);// 2) Summarize each cluster with filter k;Foreach (c ∈ C) Do

SAc ← Alignment(c, k);//3) Store frequent sequencesIf (SAc) Then PrefixTree ← PrefixTree+SAc

Done// 4) Update Tilted Time Windows and delete// obsolete sequencesTailPruning(PrefixTree);

Done (end Algorithm SMDS);

Complexity

Let n be the number of sequences in the batch. In the worstcase, the clustering algorithm has a time complexity of O(n2).

Alice Marascu, and Florent Masseglia

56

Page 57: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Actually, in the worst case, LCS is called once for the firstsequence, twice for the second, and so on. The complexity isthus O(n.(n+1)

2 ) = O(n2). Even if our algorithm’s complexity(in the worst case) might be improved, it is well adapted forWeb navigation patterns and the results obtained (see SectionIV) on real datasets (access logs of Inria Sophia-Antipolis)show its effectiveness. Let us consider that all the sequencesin a cluster C have length m and |C| = p. The complexity ofthe alignment algorithm for a batch is O(p.m2).

IV. EXPERIMENTS

The SMDS algorithm is written in Java on a Pentium(2.1 Ghz) PC running a Linux Fedora system. We evaluatedour proposal on both real and synthetic data (available athttp://www.almaden.ibm.com/cs/quest).

A. Feasibility and Scalability of SMDS

In order to show the efficiency of the SMDS algorithm,we report in figure 4 the time needed to extract the longestapproximate sequential pattern on each batch corresponding tothe Web usage data (up figure 4) and the synthetic data (downfigure 4). For the Inria’s web site, the data were collectedover a period of 14 months for a size of 14 Gb. The totalamount of navigations is 3.5 millions and the total numberof items is 300, 000. We cut down the log into batches of4500 transactions (an average amount of 1500 navigationsequences). For those experiments, the filter k was fixed to 30% (please note that this filter has an impact on response time,since the sequences managed in the prefix tree will be longerwhen k is low). In our experiment we have injected “parasitic”navigations into the batches. The first batch was not modified.The second batch was added with ten sequences containingrepetitions of 2 items and having length 2 (as s1 and s2

described in Section III-A). The third batch was added with tensuch sequences having length 3 and so on up to ten sequenceshaving length 30 in batch number 30. The goal is to show thata classic method (PSP, prefixSpan) will block the data streamwhereas SMDS will go on performing the mining task. We canobserve that the response time of SMDS varies from 1800 msto 3100 ms. PSP performs very well for the first batches andfinally is penalized by the noise added to the data stream (i.e.batch 19). The test has also been conducted with prefixSpan.The execution times were greater than 6000 ms and prefixSpanhad the same exponential behaviour because of the noiseinjected in the data streams. For both PSP and prefixSpanthe specified minimum support was just enough to find theinjected sequences of repetitions (10 sequences). We addedto figure 4 the number of sequences involved in each batchin order to explain the different execution times of SMDS.We can observe, for instance, that batch number 1 contains1500 sequence and SMDS needs 2700 ms in order to extractthe approximate sequential patterns. For the synthetic data wegenerated batches of 10, 000 transactions (corresponding to500 sequences in average). The average length of sequenceswas 10 and the number of items was 200,000. The filter k wasfixed to 30 %. We report in figure 4 (down) the response time

and the number of sequences corresponding to each batch. Wecan observe that SMDS is able to handle 10, 000 transactionsin less than 4 seconds (e.g. batch number 2).

Fig. 4. SMDS execution time

B. Patterns Extracted on Real Data

The list of behaviours discovered by SMDS covers morethan 100 navigation goals (clusters of navigation sequences)on the Web site of Inria Sophia-antipolis. Most of thediscovered patterns can be considered as “rare” but very“confident” (their support is low with respect to the globalnumber of sequences in the batch, but the filter k usedfor each cluster is high). We report here a sample of twodiscovered behaviours:A) k = 30%, cluster size = 13, prefix=”http://www-sop.inria.fr/omega/”: < (MC2QMC2004)(personnel/Denis.Talay/moi.html)(MC2QMC2004/presentation.html)(MC2QMC2004/dates.html)(MC2QMC2004/Call for papers.html)>This sequence has been found on batches corresponding tojune 2004, when the conference MCQMC has been organizedby a team of Inria Sophia Antipolis. This behaviour wasshared by up to 13 users (cluster size).

B) k = 30%, cluster size = 10, prefix=”http://www-sop.inria.fr/acacia/personnel/itey/Francais/Cours/”< (programmation-fra.html) (PDF/chapitre-cplus.pdf) (cours-programmation-fra.html)(programmation-fra.html) >This behaviour corresponds to requests that have been madefor a document about programming lessons written by amember of a team from Inria Sophia Antipolis. It has beenfound on a batch corresponding to april 2004. For the usagesequences of Inria sophia Antipolis, we also observed that

Mining Data Streams for Frequent Sequences Extraction

57

Page 58: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

SMDS is able to detect the parasitic sequences that we addedto the batches (long sequences containing multiple repetitionsof 2 items). Those sequences are gathered together in onecluster.

C. Size of the Batches

Fig. 5. Size of the batches

Since the complexity of our algorithm depends on thenumber of sequences in the batch, we conducted a study onthe impact of the size of the batches on the response time. Forthis purpose we report in figure 5 the response time when S,the size of the batch, varies from 100 to 5000 sequences. timestands for the response time. clusters stands for the number ofclusters extracted by SMDS. clusters 1% stands for the numberof clusters c such that : |c| > 0.1 × S. Indeed, we argue thatthe number of clusters has to be filtered. We thus propose toconsider only the clusters such that their size is greater thana particular ratio of the number of sequences (large clusters).With 1% and a batch of 1000 sequences, for instance, a clusterc such that |c| < 10 will not be considered. time 1% stands forthe response time needed to process only the large clusters. Wecan observe that the number of clusters grows linearly, but thenumber of large clusters remains the same after a short numberof iterations. The response time is obviously linked to the sizeof the batches, but it is reasonable to say that the final usercan choose the size of the batches depending on the desiredresponse time.

D. Analyzing the Quality of the Clusters

In order to give a measure of the quality of our clusters, ourmain tool will be the distance between two sequences. Let s1

and s2 be two sequences, the distance dist(s1, s2) betweens1 and s2 is based on sim(s1, s2), the similarity given indefinition 4, and is such that dist(s1, s2) = 1 − sim(s1, s2).Thus, dist(s1, s2) ∈ [0..1] and dist(s1, s2) = 0 means that thesequences are the same whereas dist(s1, s2) = 1 means thats1 and s2 do not share any item. We used two main measures.The first one is the diameter of a cluster C. It stands forthe largest distance between two sequences of C. A diameterof 0% shows that the cluster contains only equal sequences,whereas a diameter of 100% shows that the cluster containsat least two sequences that do not share any item. During our

experiments the average diameter at the end of each batchvaried from 2% to 3%.The second measure is the “double average”. It is based onthe center of the cluster, which is given as follows. Let Cbe a cluster, the center of C is a sequence c such that:∀s ∈ C,

∑x∈C dist(s, x) ≥ ∑

y∈C dist(c, y). We are thusable to give, for C, the average distance (AD) from c to allthe remaining sequences of C:AD =

∑x∈C dist(x,c)

|C| .We report in figure 6 some of the worst average distancesobtained during our experiments. For each sequence added ina cluster, we report the new value of AD for this cluster. Forinstance, when the last sequence is added to cluster 1, thevalue of AD for cluster 1 is 22%. We can observe that ADvaries from 0 (when |C| = 1 the center is the only sequence)to 50%. AD thus decreases rapidly to a value between 20%and 35% which is a good result considering that in figure 6 arereported the rare clusters that are not really homogeneous. Theremaining clusters are very well formed and thus give a goodframework for the alignement algorithm. Indeed, we reportin figure 7 the “double average” value (DA) after processingeach sequence of the batch. DA is given as follows: let N be

the set of clusters, DA =∑

i∈N

∑x∈Ci

dist(x,ci)

|N | with ci thecenter of Ci (the ith cluster). We can observe in figure 7 thatfor the second batch, DA rapidly grows up to almost 2% aftersequence 220, then DA slightly increases to 3.7%. The finalvalue of DA at the end of each batch is given in figure 8. Wecan observe that the value DA is always between 2% and 9%.At the end of the process, the average value of DA is 4.3%(an average clusters quality of 95.7%).

Fig. 6. Worst clusters step by step

V. CONCLUSION

In this paper, we proposed the SMDS algorithm for extract-ing sequential patterns in data streams. Our method has twomajor features. Our first study was intended to show the mainlimits and problems that have to be identified and solved priorto proposing a method for this subject. Then we proposeda principle designed for rapidly processing the sequencesof a data stream and extracting the meaningful summary.Our algorithm relies on a clustering and alignment methodassociated to an efficient structure for managing the extractedsequences and their history. First, batches of transactions are

Alice Marascu, and Florent Masseglia

58

Page 59: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Fig. 7. Global distance step by step

Fig. 8. Global distance batch by batch

summarized by means of a sequences alignment method. Thisalignment process relies on a greedy clustering algorithm thatconsiders the main characteristics of Web usage sequencesin order to provide distinct clusters of sequences. Second,frequent sequences obtained by SMDS are stored in a prefixtree structure. Thanks to this mining scheme, SMDS is ableto detect frequent behaviours shared by very little amountsof users (e.g. 13 users, or 0.5%) which is close to thedifficult problem of mining sequential patterns with a verylow support. Furthermore, our experiments have shown thatSMDS performs fast enough to be integrated in a data streamenvironment at a negligible cost. We also reported the resultsof our experiments showing that SMDS provides clusters ofquality and extracts the significant patterns of a Web usageanalysis.

REFERENCES

[1] R. Agrawal and R. Srikant. Mining Sequential Patterns. In Proceedingsof the 11th International Conference on Data Engineering (ICDE’95),Taiwan, March 1995.

[2] Joong Hyuk Chang and Won Suk Lee. Finding recent frequent itemsetsadaptively over online data streams. In KDD ’03: Proceedings of theninth international conference on Knowledge discovery and data mining,pages 487–492, 2003.

[3] Y. Chen, G. Dong, J. Han, B. Wah, and J. Wang. Multidimensionalregression analysis of time-series data streams, 2002.

[4] Graham Cormode and S. Muthukrishnan. What’s hot and what’s not:tracking most frequent items dynamically. ACM Trans. Database Syst.,30(1):249–278, 2005.

[5] Mayur Datar, Aristides Gionis, Piotr Indyk, and Rajeev Motwani.Maintaining stream statistics over sliding windows. In Proceedingsof the thirteenth annual ACM-SIAM symposium on Discrete algorithms(SODA), pages 635–644, 2002.

[6] Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi. Queryingand mining data streams: you only get one look a tutorial. In SIGMOD’02: Proceedings of the 2002 ACM SIGMOD international conferenceon Management of data, 2002.

[7] C. Giannella, J. Han, J. Pei, X. Yan, and P.S. Yu. Mining FrequentPatterns in Data Streams at Multiple Time Granularities. In H. Kargupta,A. Joshi, K. Sivakumar, and Y. Yesha (eds.), Next Generation DataMining. AAAI/MIT, 2003.

[8] Birgit Hay, Geert Wets, and Koen Vanhoof. Web Usage Mining byMeans of Multidimensional Sequence Alignment Method. In WEBKDD,pages 50–65, 2002.

[9] H. Kum, J. Pei, W. Wang, and D. Duncan. ApproxMAP: Approximatemining of consensus sequential patterns. In Proceedings of SIAM Int.Conf. on Data Mining, San Francisco, CA, 2003.

[10] F. Masseglia, F. Cathala, and P. Poncelet. The PSP Approach for MiningSequential Patterns. In Proceedings of the 2nd European Symposium onPrinciples of Data Mining and Knowledge Discovery, Nantes, France,September 1998.

[11] F. Masseglia, P. Poncelet, and R. Cicchetti. An efficient algorithm forweb usage mining. Networking and Information Systems Journal (NIS),April 2000.

[12] F. Masseglia, Doru Tanasa, and Brigitte Trousse. Web usage mining:Sequential pattern extraction with a very low support. In 6th Asia-PacificWeb Conference, APWeb, Hangzhou, China, 2004.

[13] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, andMC. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In 17th International Conference on DataEngineering (ICDE), 2001.

[14] Wei-Guang Teng, Ming-Syan Chen, and Philip S. Yu. A Regression-Based Temporal Pattern Mining Scheme for Data Streams. In VLDB,pages 93–104, 2003.

[15] J. Wang and J. Han. BIDE: Efficient Mining of Frequent ClosedSequences. In Proceedings of the International Conference on DataEngineering (ICDE’04), Boston, M.A., March 2004.

Mining Data Streams for Frequent Sequences Extraction

59

Page 60: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE
Page 61: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

CaptionSearch: Mining Images from Publications

Brigitte Mathiak, Andreas Kupfer, Claudia Taubner, Silke EcksteinInstitut fur Informationssysteme

TU BraunschweigGermany

b.mathiak, a.kupfer, c.taeubner, [email protected]

Richard MunchInstitut fur Mikrobiologie

TU BraunschweigGermany

[email protected]

Abstract

One way to deal with complex data is to break it downinto classes and then to define these classes semantical con-nections towards each other. For example a scientific pub-lication consists among other things of figures and captionswith the captions giving semantical annotations of the fig-ures. In biology, figures are searched for, because they con-tain the results of experiments. These results are used toassess the quality of the information. By identifying the cap-tions and their corresponding figure counterparts, we wereable to implement a search engine which searches for pic-tures containing experiments, thereby saving time neededfor quality assessment of these experiments.

1 Introduction

A scientific document is more complex than it seems.While readers can easily deduce structure and semantics ofthe different characters and pictures on a page, most of thisstructure information is not stored and available when ac-cessing the publication as PDF.

The basic problem of handling PDFs is that the textinformation is not freely available. While an HTML filestripped of its tags usually delivers legible text, even thesimple task of text extraction from a PDF is rather com-plicated. Down to the basics, PDF is foremost a visualmedium, describing for each glyph (= character or picture)where it should be printed on the page [1]. Most PDF con-verters simply emulate this glyph-by-glyph positioning inASCII, HTML or any other formats [2], but informationabout reading order and overall semantic connection of thetext is lost.

Still, since the position of all glyphs is known, the orig-inal layout can be deduced and the semantical connectioncan be restored. For HTML, the layout information hassuccessfully been used to improve the classification of webpages [8]. We extract the same layout information for PDF

documents. To prove the viability of our approach, we usedthe layout information to implement a special search forPDF embedded pictures in scientific publications. Since thefigure captions contain much information about the figure,we implemented a search engine, similar to Google Images,to find images not in regular HTML web pages, but in PDFs.We use layout information to associate the picture with thecaption. So, if the user wishes to find pictures containingUML-diagramms, he can just enter ”UML” to find what heneeds.

For evaluation, we took a problem from bioinformatics:quality assessment of pictures documenting experiments.The special problem posed by experimental data is that key-words searches in abstracts or even full text are ambiguous.The experimental procedures are rarely mentioned in theabstract and in the full text, experimental methods belong-ing to the overall topic are often referenced eg. in literatureoverview. By using the figure captions we can accuratelyfind pictures for each experiment.

In bioinformatics, text mining is usually done by inspect-ing abstracts, since they are most easily available. Yet, moreinformation can be obtained by searching through full textpaper [5]. In [16], it has been observed that analyzing thefigure caption is of great value. Classifying the differentsections of a paper to analyse them seperately like in [13]has been successfully attempted, but curiously, the figurecaptions have not been examined.

The paper starts with introducing the criteria we use toidentify the caption of the figures. Section 3 describes theoverall process of the example application CaptionSearchand in Section 4 we present some results. In the last sectionwe draw some conclusions and discuss further improve-ments on our search engine.

2 Finding the Caption

Assuming we have the text and the position of the textand also the images and the position of the images, we still

Z.W. Ras, S. Tsumoto and D.A. Zighed (Eds) : IEEE MCD05, published byDepartment of Mathematics and Computing Science Saint Marys UniversityPages 61-64, ISBN 0-9738918-8-2, Nov. 2005

Page 62: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

have the task at hand to find out, which text exactly com-prises the caption of a given image.

To achieve this, every text block (roughly a paragraph)is weighted according to 2 factors:ydiff, the y-proximity inpixels from the bottom line of the picture andxdiff , the x-proximity from the left border of the picture. The weighingof these factors can be seen from the following equation.

weightω = 10ydiff + xdiff

In order to find captions that are next to the figure orabove it, we also introduced a semantic criterium. Figurecaptions do traditionally begin with ”Figure 1:” or some-thing similar. The block that we found with the methoddescribed above, is first checked whether or not it has sucha denotation. If not, we look at the other blocks in prox-imity and check them. If there is no ”Figure”-block to befound, we stick with the block right below. This is the case,for example, when the caption is prefaced with a filler (dotsor a special symbol), or when the image is not scientifical,like for example a logo.

Finding the end of the caption is much less deterministic.Fortunately, most captions have a significant gap before themain text begins again. Also, normally captions are single-column even if the text is two-column. Although we dokeep track of the font and the captions are usually written inanother font, we do not use this information, since there arejust too many publications, that do use the same font andfont size for both purposes. Instead we keep strictly to thelayout information on where an untypical large gap betweenline is.

3 CaptionSearch: Our example application

As a simple demonstration, how the layout informationcan be used, we decided to write a web based search enginefor images. It is called CaptionSearch, since we are tech-nically not searching through the images, but through thecaption text beneath them.

The process consists of 6 steps from the download to theactual query execution (see Fig. 1), all of them are currentlyimplemented in Java.

Step 1 is straightforward downloading any kind of litera-ture that might be interesting into a file pool. Step 2 handlesthe caption extraction. In order to allow for future exten-sions, we split the process into three parts. Part A works likeany PDF to text converter and in fact is based on the Javaconverter PDFBox [10]. We did a few changes, like refiningthe coordinate calculations and changing the space detec-tion algorithm. We also keep a lot of the meta-informationthat is usually lost in the process, like the bounding boxes ofall the text blocks, the fonts and font sizes. That informationcould also be used by a different algorithm.

CcaptionidentifyingB

pictureextraction

Atextextraction

1. download 3. XML 4. index files

6. executing user queries

2. caption extraction

5. webserver

literature

Figure 1. Overview of the CaptionSearchdataflow

Part B extracts the pictures into files and adds fake pic-ture text blocks that contain information on the position ofthe picture and the filename as a link. In part C we performthe algorithm outlined in section 2 to get pairs of pictureblocks and text blocks for each pdf.

In step 3 these pairs are written to an xml file (see exam-ple below).

1 <?xml version="1.0"encoding="iso-8859-2"?>

2 <pdf src="10094677.pdf">3 <img src="pics/10094677.Im4.jpg">4 <caption>FIG. 4. DNase I

footprint analysis of ...5 </caption></img></pdf>

Each publication is represented by one file containing aPDF element (line 2-5), which may contain many image el-ements (like the one in line 3-5), which have the link to thepicture as an attribute (line 3) and the caption as a furtherelement (line 4-5).

In Step 4 we use the Digester package [4] to extract theinformation from the xml files into the indexer. For eachfigure a virtual document is put into the indexer containingthe figure caption to be indexed and the image link and pdflink as metainformation. For the indexing itself, we use theLucene package [7], which offers fast, Java-based indexing,but also some additional functionality, like a built-in queryparser and several so-called analyser that allow us to varyhow exactly the captions are indexed and what defines aterm.

For step 5 we set up a Tomcat webserver [14], using Javaservlets [3] to produce the website and to present the queryresults. In step 6, all queries are executed by a servlet thatuses Lucene to fetch the results from the index files andbuilds a new web page to display the results according tothe pre-selected schema (see Fig. 2).

Brigitte Mathiak, Andreas Kupfer, Richard Muench, Claudia Taeubner, Silke Eckstein

62

Page 63: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Figure 2. A screenshot from our search en-gine

4 Results

In the last few years, the number of biological data-bases has grown exponentially, from 548 in January 2004to 719 in January 2005 [6]. Yet, one of the most time-consuming tasks when setting up new databases is the an-notation of literature. There already exist a number of veryinteresting search engines. Via PubMed [11], for example,almost every abstracts ever written in life science can besearched. Unfortunately, abstracts often do not mention theexact methods that were used, so for databases that containexperimental data, like the PRODORIC database [9], litera-ture annotation becomes the proverbial search for the needlein the haystack.

The PRODORIC database contains very special data likeDNA binding sites of prokaryotic transcriptional regulators.This data is generated via specific experiments like DNAse Ifootprints or ElectroMobility gel Shift Assays (EMSA) thatare generally not mentioned in the abstracts of scientific lit-erature. The search for general key words classifying thiscomprehensive field like ”gene regulation”, ”promoter” or”binding site” results in over 150,000 hits, and even with ad-ditional refinement only 10-20% contain appropriate data.Therefore it is necessary to screen all the hits manually toobtain literature references suitable for database annotation.Of these, those are especially valuable that contain pictures

of the DNAse I footprint or EMSA assay, because they rep-resent verified information of high quality. This quality as-sessment can be important on further exploration of the sub-ject.

Out of 188 papers that were known to contain informa-tion about DNA binding sites, 170 did contain extractablepictures. All in all there were 1416 pictures in the PDFsof which 586 could not be extracted, but we identified andindexed the caption anyhow. The relatively high numberof pictures attributes mostly to the fact that some PDF pro-ducing programs use pictures for symbols like list bullets.A random sample of 236 captions showed that 15 % of thefound captions were just random text pieces, like page num-bers or single sentences, mostly due to the said symbol pic-tures. 6 % were wrong text that means whole paragraphs oftext, just not belonging to the figure. The main reasons herewere again symbol pictures, which naturally had no genuinecaption and also some cases of figure captions written leftor right of the picture. We had 3 cases of duplication, whereone figure was internally composed of several pictures, allof which rightly claimed the same caption. We had text con-version problems with only 3 out of these 236 captions. Inone no spaces were found, in the second one, some spaceswere missing and in the third one some symbols like∆ con-verted into wrong strings. We had only one case, where theend of the caption was not found correctly. For a summaryconfer table 1.

No. of paper No. of papers containing pictures188 170 (90%)

No. of pictures No. of extractable pictures1416 830 (59%)

No. of captions short text wrong wrongin sample pieces text conversion

236 37 (15%) 17 (7%) 3 (1%)

Table 1. Results of Evaluation

To search for DNAse I footprints, we used the keywords”footprint”, ”footprinting” and ”DNAse”. Overall, 184 hitswere scored of which 163 actually showed experimentaldata. As a byproduct, the thumbnails mostly sufficed tomake a fast quality assessment. Another positive effect wasthat the data was much faster available than with the usualmethod of opening each PDF independently.

The search for EMSAs was a lttle bit more difficult, sincethere exist a wide range of naming possiblities. The mostsignificant terms in those names were ”shift”, ”mobility”,”EMSA” and ”EMS” to catch ”EMS assay”. We had 91hits of which 81 were genuine.

Recall could not be tested thoroughly, due the sheernumbers of pictures and the limited time of experts, but therandom sample did not include pictures that would not have

CaptionSearch: Finding Images in Publications

63

Page 64: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

been found by the keywords, which suggests a rather highrecall.

We are still in an early test phase with the user accep-tance. For legal reasons we cannot just put the informationon the Web and see what is coming, but instead, we onlyhave our local biologist work group as users, who supply uswith the literature anyway.

5 Conclusion and Future Work

Although we have just started to explore the possibilitiesof layout-enhanced analysis of PDF files, our first applica-tion looks promising. To bypass the legal problems posedby the copyrights, we plan on publicising a demo version,that does not link to the full text, but to the PubMed entryinstead. We also plan to cross-check whether or not the fulltexts are available on the Internet and then give appropri-ate addresses, so users can download from the source. Thisdemo version should then be able to give some data aboutuser behaviour.

We are in process of finding more areas of applicationfor our search engine, as we broaden our spectrum of func-tionalities, especially those mentioned in the sections above,like finding the context a certain figure is mentioned in (eg.”...as you can see in Fig. 2...”) to add more text to besearched through and be presented to the user and the ex-traction of more pictures.

The groundwork of knowing the layout of the publica-tion can also be used for other purposes. On long term, weare working on reading order recognition to improve shal-low parsing, which is still a problem in text mining appli-cations [12]. Also, we are investigating the feasibility of a”table search engine” similar to what has already been in-vestigated by [15] for HTML web pages. The overall goalis to make PDF a multi-functional format that can easily beused with any kind of text mining application, just as easilyas HTML or plain text.

References

[1] PDF Reference Fourth Edition. Adobe Network Solu-tions, http://partners.adobe.com/asn/acrobat/sdk/pub-licdocs/PDFReference15v6.pdf, 2004.

[2] BCL Jade. http://www.bcltechnologies.com/docu-ment/products/jade/jade.htm, 2004.

[3] D. Coward and Y. Yoshida. Java Servlet Specification.http://jcp.org/aboutJava/communityprocess/final/jsr154/index.html, 2003.

[4] Digester. http://jakarta.apache.org/commons/digester/,2005.

[5] L. C. Faulstich, P. F. Stadler, C. Thurner, andC. Witwer. litsift: Automated text categorization inbibliographic search. InData Mining and Text Miningfor Bioinformatics, Workshop at the ECML / PKDD2003, 2003.

[6] M. Y. Galperin. The Molecular Biology DatabaseCollection: 2005 update.Nucleic Acids Research,33(Database-Issue):5–24, 2005.

[7] E. Hatcher and O. Gospodnetic.Lucene in Action.Manning Publications, 2004.

[8] M. Kovacevic, M. Diligenti, M. Gori, and V. Miluti-novic. Visual Adjacency Multigraphs - a Novel Ap-proach to Web Page Classification. InProceedings ofSAWM04 workshop, ECML2004, 2004.

[9] R. Munch, K. Hiller, H. Barg, H. Heldt, S. Linz,E. Wingender, and D. Jahn. Prodoric: prokaryoticdatabase of gene regulation.Nucleic Acids Research,31(1):266–269, 2003.

[10] PDFBox. http://www.pdfbox.org/ or http://source-forge.net/, 2004.

[11] PubMed. http://www.ncbi.nlm.nih.gov/pubmed/,2004.

[12] S. Schmeier, J. Hakenberg, A. Kowald, E. Klipp, andU. Leser. Text mining for systems biology using sta-tistical learning methods. In”3. Workshop des Arbeit-skreises Knowledge Discovery”, 2003.

[13] P. Shah, C. Perez-Iratxeta, P. Bork, and M. Andrade.Information extraction from full text scientific articles:Where are the keywords?BMC Bioinformatics, 2003.

[14] Tomcat. http://jakarta.apache.org/tomcat/, 2005.

[15] Y. Wang, I.T.. Phillips, and R.M.. Haralick. Table de-tection via probability optimization. InProceedingsof the 5th IAPR Workshop on Analysis Systems (DAS2002), pages 272–283.

[16] A. Yeh, L. Hirschman, and A. Morgan. Evaluation oftext data mining for database curation: lessons learnedfrom the kdd challenge cup.Bioinformatics, 19(1),2003.

Brigitte Mathiak, Andreas Kupfer, Richard Muench, Claudia Taeubner, Silke Eckstein

64

Page 65: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Decision Tree Construction by Chunkingless Graph-Based Induction forGraph-Structured Data

Phu Chien Nguyen Kouzou Ohara Akira Mogi Hiroshi MotodaTakashi Washio

Institute of Scientific and Industrial Research, Osaka University8-1 Mihogaoka, Ibaraki, Osaka, 567-0047, Japan

E-mail:chien,ohara,mogi,motoda,washio @ar.sanken.osaka-u.ac.jp

Abstract

A decision tree is an effective means of data classifica-tion from which one can obtain rules that are easy to un-derstand. However, decision trees cannot be convention-ally constructed for data which are not explicitly expressedwith attribute-value pairs such as graph-structured data.We have proposed a novel algorithm, named Chunking-less Graph-Based Induction (Cl-GBI), for extracting typ-ical patterns from graph-structured data. Cl-GBI is animproved version of Graph-Based Induction (GBI) whichemploys stepwise pair expansion (pairwise chunking) toextract typical patterns from graph-structured data, andcan find overlapping patterns that cannot not be foundby GBI. In this paper, we further propose an algorithmfor constructing a decision tree for graph-structured datausing Cl-GBI. This decision tree construction algorithm,now called Decision Tree Chunkingless Graph-Based In-duction (DT-ClGBI), can construct a decision tree fromgraph-structured data while simultaneously constructingattributes useful for classification using Cl-GBI internally.Since patterns (subgraphs) extracted by Cl-GBI are con-sidered as attributes of a graph, and their existence/non-existence are used as attribute values in DT-ClGBI, DT-ClGBI can be conceived as a tree generator equipped withfeature construction capability. Experiments were con-ducted on both synthetic and real-world graph-structureddatasets showing the usefulness and effectiveness of thealgorithm.

1 Introduction

Over the last few years there has been much researchwork on data mining in seeking for better performance.Better performance includes mining from structured data,which is a new challenge. Since structure is represented byproper relations and a graph can easily represent relations,knowledge discovery from graph-structured data poses a

general problem for mining from structured data.

On one hand, from this background, discovering fre-quent patterns of graph-structured data, i.e., frequent sub-graph mining or simply graph mining, has attracted muchresearch interest in recent years. AGM [6] and a number ofother methods including AcGM [7], FSG [8], gSpan [16],etc. have been developed for the purpose of enumerat-ing all frequent subgraphs of a graph database. However,the computation time increases exponentially with inputgraph size and minimum support. This is because the ker-nel of frequent subgraph mining is subgraph isomorphism,which is known to be NP-complete [3].

To avoid the complex subgraph isomorphism problem,heuristic algorithms, which are not guaranteed to find thecomplete set of frequent subgraphs, such as SUBDUE [2]and GBI (Graph-Based Induction) [17] have also been pro-posed. They tend to find an extremely small number ofpatterns based on greedy search. GBI extracts typical pat-terns from graph-structured data by recursively chunkingtwo adjoining nodes. Later an improved version calledB-GBI (Beam-wise Graph-Based Induction) [10] adopt-ing the beam search was proposed to increase the searchspace, thus extracting more discriminative patterns whilekeeping the computational complexity within a tolerantlevel. Since the search in GBI is greedy and no backtrack-ing is made, which patterns are extracted by GBI dependson which pairs are selected for chunking. This means thatpatterns that partially overlap can no longer been extracted,and thus there can be many patterns which are not ex-tracted by GBI. B-GBI can help alleviate this problem, butcannot solve it completely because the chunking processis still involved.

To overcome the problem of overlapping patterns in-curred by GBI and B-GBI, we have proposed an algo-rithm to extract typical patterns from graph-structureddata, called Chunkingless Graph-Based Induction (Cl-GBI)[12]. Although Cl-GBI is an improved version ofB-GBI, it does not employ the pair-wise chunking strat-egy. Instead, the most frequent pairs are regarded as new

Z.W. Ras, S. Tsumoto and D.A. Zighed (Eds) : IEEE MCD05, published byDepartment of Mathematics and Computing Science Saint Marys UniversityPages 65-72, ISBN 0-9738918-8-2, Nov. 2005

Page 66: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

nodes and given new node labels in the subsequent stepsbut none of them is chunked. In other words, they are usedas pseudo nodes, thus allowing extraction of overlappingsubgraphs. It was experimentally shown that Cl-GBI canextract more typical substructures than B-GBI [12].

On the other hand, a majority of methods widely usedfor data mining are for data that do not have structureand that are represented by attribute-value pairs. Decisiontrees [13, 14], and induction rules [11, 1] relate attributevalues to target classes. Association rules often used indata mining also use this attribute-value pair representa-tion. These methods can induce rules such that they areeasy to understand. However, the attribute-value pair rep-resentation is not suitable to represent a more general datastructure such as graph-structured data. This means thatmost of useful methods in data mining are not directly ap-plicable to graph-structured data.

In this paper, we propose an algorithm to construct adecision tree for graph structured data using Cl-GBI. Thisdecision tree construction algorithm, called Decision TreeChunkingless Graph-Based Induction (DT-ClGBI), is a re-vised version of our previous algorithm called DecisionTree Graph-Based Induction (DT-GBI) [4, 5], and can con-struct a decision tree for graph-structured data while si-multaneously constructing substructures used as attributesfor the classification task by means of Cl-GBI instead ofB-GBI adopted in DT-GBI. In this context, substructuresmeans subgraphs or patterns that appear in a given graphdatabase. Patterns extracted by Cl-GBI are regarded asattributes of graphs and their existence/non-existence areused as attribute values. Namely, DT-ClGBI does not re-quire the user to define available substructures in advance.Since attributes (features) are constructed while a classi-fier is being constructed, DT-ClGBI can be conceived as amethod for feature construction. Using synthetic and real-world datasets, we experimentally show DT-ClGBI canconstruct decision trees from graph-structured data thatachieve reasonably good predictive accuracy.

This paper is organized as follows: Section 2 brieflydescribes the framework of GBI, the problem caused bythe nature of chunking in GBI, and the summary of theCl-GBI algorithm. Section 3 explains DT-ClGBI and itsworking mechanism of how a decision tree is constructedusing a simple example. The performance of DT-ClGBIis experimentally evaluated and reported in Section 4. Fi-nally, Section 5 concludes the paper.

2 Graph-Based Induction Revisited

2.1 Graph-Based Induction (GBI)

GBI contracts the graph by chunking the most frequentpatterns into single nodes. However, this process is not ex-ecuted in one step, rather it follows a pairwise chunking(stepwise pair expansion) strategy. In the original GBI, an

A

B D

C

A

B D

(a) (b)

Figure 1. Missing patterns due to chunkingorder

D

A B

C D

A B

C

B

(a) (b)

Figure 2. A pattern is found in one inputgraph but not in the other

assumption is made that typical patterns represent someconcepts/substructures and “typicality” is characterized bythe pattern’s frequency or the value of some evaluationfunction of its frequency. We can use statistical indices asan evaluation function, such as frequency itself, Informa-tion Gain [13], Gain Ratio [14], all of which are based onfrequency. It is of interest to note that GBI was improvedlater to use two criteria, one based on frequency measuresfor chunking and the other for finding discriminative pat-terns after chunking. The reader is referred to [17, 10] fora detailed mathematical treatment of GBI.

It is possible to extract typical patterns of various sizesby repeating this chunking process. Since the search inGBI is greedy and no backtracking is allowed, which pat-terns (subgraphs) are extracted depends on which pair isselected for chunking. This is because in enumeratingpairs no pattern which has been chunked into one node isrestored to the original pattern. Therefore, all the “typicalpatterns” that exist in the input graph are not necessarilyextracted and patterns that partially overlap are never gen-erated, i.e., any two patterns are either disjoint or perfectinclusion. The problem of extracting all the isomorphicsubgraphs is known to be NP-complete. Thus, GBI aimsat extracting only meaningful typical patterns of a certainsize. Its objective is not finding all the typical patterns norfinding all the frequent patterns.

Fig. 1 shows an example of incomplete search in GBI.Here if the pair B–C is selected for chunking beforehand,there is no way to extract the substructure A–B–D even ifit is a typical pattern. Moreover, any subgraph that GBIcan find is along the way in the chunking process. Thus, ithappens that a pattern found in one input graph is unable tobe found in the other input graph even if it does exist in thegraph. An example is shown in Fig. 2, where even if thepair A – B is selected for chunking and the substructure D

Phu Chien Nguyen, Kouzou Ohara, Akira Mogi, Hiroshi Motoda, and Takashi Washio

66

Page 67: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

– A – B – C exists in the input graphs, we may not find thatsubstructure because an unexpected pair A – B is chunked(see Fig. 2(b)). This causes a serious problem in countingthe frequency of a pattern.

An improved version of GBI, called Beam-wise Graph-Based Induction (B-GBI), adopting a beam search wasproposed to relax the problem of overlapping subgraphsdescribed above [10]. Though the beam search helps in-crease the search space, thus resulting in more discrim-inative patterns extracted by B-GBI than GBI, it cannothelp solve this problem completely because the chunkingprocess is still involved.

2.2 Chunkingless Graph-Based Induction (Cl-GBI)

We have introduced an algorithm, named Chunking-less Graph-Based Induction (Cl-GBI) [12], to cope withthe aforementioned problem of overlapping subgraphs in-curred by both GBI and B-GBI. Cl-GBI employs a “chunk-ingless chunking” strategy, where frequent pairs are neverchunked but used as pseudo nodes in the subsequent steps,thus allowing extraction of overlapping subgraphs. As inB-GBI, the Cl-GBI approach can handle both directed andundirected graphs as well as both general and induced sub-graphs. It can also extract typical patterns in either a singlelarge graph or a graph database. The algorithm of Cl-GBIis briefly described as follows.

Given a graph database, two natural numbersb (beamwidth) and Ne, and a frequency thresholdθ, the new“chunkingless chunking” strategy repeats the followingthree stepsNe times, each of which is referred to as a level.Ne is thus the number of levels.

Step 1 Extract all the pairs consisting of two connectednodes in the graphs, register their positions using nodeid (identifier) sets, and count their frequencies. Fromthe2nd level on, extract all the pairs consisting of twoconnected nodes with at least one node being a newpseudo node.

Step 2 Select theb most frequent pairs from among thepairs extracted at Step 1 (from the2nd level on, fromamong the unselected pairs in the previous levels andthe newly extracted pairs). Each of theb selected pairsis registered as a new node. If either or both nodesof the selected pair are not original nodes but pseudonodes, they are restored to the original patterns beforeregistration.

Step 3 Assign a new label to each pair selected at Step 2but do not rewrite the graphs. Go back to Step 1.

All the pairs extracted at Step 1 in all the levels (i.e.level 1 to levelNe), including those that are not used aspseudo nodes, are ranked based on a typicality criterionusing a discriminative function such as Information Gain

A

A

A A

A

A

Figure 3. Two different pairs representingidentical patterns

B

A

B B

A

B

N N2N1

(a) (b)

Figure 4. Example of frequency counting

or Gain Ratio. Those pairs that have frequency count be-low θ are eliminated, which means that there are three pa-rametersb, Ne, θ to control the search in Cl-GBI.

GBI assigns a new label to each newly chunked pair.Because it recursively chunks pairs, it happens that thenew pairs that have different labels happen to be the samepattern as illustrated in Fig. 3. B-GBI identifies if differ-ent pairs represent the same pattern using their canonicallabels [3]. Two pairs are considered to be identical onlywhen their labels are the same. Unlike B-GBI, to countthe number of occurrences of a pattern in a graph transac-tion, not only the canonical labeling but also the node idset is employed in Cl-GBI. If both the canonical label andthe node id set are identical for two subgraphs, we regardthat they are the same and count once. Without informa-tion on the node id set, it happens that the substructure N→ B in Fig. 4(a) is incorrectly counted twice as shown inFig. 4(b) due to the presence of two pseudo nodesN1 andN2.

The output of Cl-GBI algorithm is a set of ranked typ-ical patterns, each of which comes together with the posi-tions of every occurrence of the pattern in each transactionof the graph database (given by the node id sets) as well asthe number of occurrences.

3 Decision Tree Chunkingless Graph-BasedInduction (DT-ClGBI)

3.1 Decision Tree for Graph-Structured Data

As mentioned in Section 1, the attribute-value pair rep-resentation is not suitable for graph-structured data, al-though both attributes and their values are essential for aclassification or prediction task because a class is relatedto some attribute values in most cases. In a decision tree,each node and a branch connecting the node to its childnode correspond to an attribute and one of its attribute val-ues, respectively. Thus, to formulate the construction of adecision tree for graph-structured data, we define attributesand their values as follows:

Decision Tree Construction by Chunkingless Graph-Based Induction for Graph-Structured Data

67

Page 68: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

13

45

6

7 8

5

1

17

2

2

9

3

3 4

2

13

45

6

7 8

5

1

17

2

2

9

3

3 4

2

13

45

6

7 8

5

1

17

2

2

9

3

3 4

2

45

6

7 8

5

1

17

2

2

9

3

3 413

2

Class AClass A

Class BClass C

Graph data

class A class B

class C class A class B class C

Y N

132

46

6 32

4

32

4

36

1

4Y N Y N

Y N Y N

Decision tree

Figure 5. Decision tree for classifying graph-structured data

DT-ClGBI(D)INPUTD: a graph databasebegin

Create a nodeDT for Dif termination condition reached

returnDTelse

P := Cl-GBI(D) (with b, Ne, andθ specified)Select the most discriminative patternp from PDivide D into Dy (with p) andDn (withoutp)for Di := Dy, Dn

DTi := DT-ClGBI(Di)AugmentDT by attachingDTi as its childalong yes/no branch

return DTend

Figure 6. Algorithm of DT-ClGBI

• attribute: a pattern/subgraph in graph-structured data,

• value of an attribute: existence/non-existence of thepattern in each graph.

Since the value of an attribute is either yes (the patterncorresponding to the attribute exists in the graph) or no (thepattern does not exist), the resulting decision tree is repre-sented as a binary tree. Namely, data (graphs) are dividedinto two groups: one consists of graphs with the pattern,and the other consists of graphs without it. Fig. 5 illustratesthe decision tree constructed based on this approach. Oneremaining question is how to determine patterns which areused as attributes for graph-structured data. Our approachto this question is described in the next subsection.

3.2 Feature Construction by Cl-GBI

The algorithm we propose here, called Decision TreeChunkingless Graph-Based Induction (DT-ClGBI), uti-lizes Cl-GBI to extract patterns from graph-structureddata and use them as attributes for a classification task,whereas our previous algorithm, Decision Tree Graph-Based Induction (DT-GBI), adopted B-GBI to extract pat-terns. Namely, DT-ClGBI invokes Cl-GBI at each node

of a decision tree, and selects the most discriminative pat-tern from those which were extracted by Cl-GBI. Then thedata (graphs) are divided into two groups, i.e., one withthe pattern and the other without the pattern as describedabove. For each group, the same process is recursivelyapplied until the group contains graphs of a single classlike the ordinary decision tree construction method suchas C4.5 [14]. The algorithm of DT-ClGBI is summarizedin Fig. 6.

In DT-ClGBI, each of the parameters of Cl-GBI,b, Ne,andθ, can be set to different values at different nodes ina decision tree. All patterns extracted at a node are inher-ited to its descendant nodes to prevent a pattern that hasalready been extracted in the node from being extractedagain in its descendants. This means that, as the construc-tion of a decision tree progresses, the number of patternsto be considered at a node progressively increases, and thesize of a pattern newly extracted can be larger than exist-ing patterns. Thus, although initial patterns at the start ofsearch consist of two nodes and the link between them,attributes useful for the classification task can be gradu-ally grown up into larger patterns (subgraphs) by applyingCl-GBI recursively. In this sense, DT-ClGBI can be con-ceived as a method for feature construction, since features,i.e., attributes (patterns) useful for classification task, areconstructed during the application of DT-ClGBI.

However, recursive partitioning of data until each sub-set in the partition contains data of a single class of-ten results in overfitting to the training data and thusdegrades the predictive accuracy of resulting decisiontrees. To avoid overfitting, and improve predictive accu-racy, DT-ClGBI incorporates “pessimistic pruning” usedin C4.5 [14] that prunes an overfitted tree based on theconfidence interval for binomial distribution. This pruningis a postprocess that follows the algorithm in Fig. 6.

Note that the criterion for selecting a pair that becomesa pseudo node in Cl-GBI and the criterion for selecting adiscriminative pattern in DT-ClGBI can be different. In thefollowing experiments, frequency of a pair is used as theformer criterion, and information gain of a pattern is usedas the latter criterion1.

3.3 Working Example of DT-ClGBI

Suppose DT-ClGBI receives a set of 4 graphs in the up-per left-hand side of Fig. 7. Both the beam widthb and thenumber of levelsNe of Cl-GBI are set to 1 at every nodeof a decision tree to simplify the working of DT-ClGBI inthis example, and the frequency thresholdθ is set to 0%.Cl-GBI called inside of DT-ClGBI enumerates all the pairsin these graphs and extracts 11 kinds of pairs from the data.These pairs are: a→a, a→b, a→c, a→d, b→a, b→b, b→c,b→d, d→a, d→b, d→c. The existence/non-existence of

1We did not use information gain ratio because DT-ClGBI constructsa binary tree.

Phu Chien Nguyen, Kouzou Ohara, Akira Mogi, Hiroshi Motoda, and Takashi Washio

68

Page 69: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

(aa)=e

(ad)=f

Y Naa

class CaadNY

class Bclass A

aa

a a d

d b c

a a c

b d a

b b

ad a

b b a

c

Graph1(class A)

c

a a d

d b c

a a c

b d a

b b

ad a

b b a

c

c

a a d

d b c a a c

b d ab b

ad a

c

e

e

ef

f

e

ee

f

Graph2(class B)

Graph3(class A) Graph4(class C) Graph3(class A)

Graph1(class A) Graph2(class B)

Graph4(class C)

Graph1(class A)

Graph3(class A)

Graph2(class B)

Figure 7. Example of decision tree construc-tion by DT-ClGBI

100

0

011

0

010

0

100

0

001

1

001

1

001

0

111

0

011

0

110

1

111

0

1 (class A)2 (class B)3 (class A)

4 (class C)

Graph aa ab ac ad ba bb bc bd da db dc

Figure 8. Attribute-value pairs at the firststep

the pairs in each graph is converted into the ordinary tablerepresentation of attribute-value pairs, as shown in Fig. 8.For instance, graph 1, graph 2 and graph 3 have the paira→a but graph 4 does not have it. This is shown in the firstcolumn in Fig. 8.

Then Cl-GBI selects the most frequent pair “a→a”, as-signs new label “e” to it to generate a pseudo node asshown in the upper right-hand side of Fig. 7, and termi-nates. It is worth noting that Cl-GBI can calculate fre-quency of a pattern based on either the number of graphsthat include the pattern (document frequency) or the to-tal number of occurrences of the pattern in all the graphs(total frequency). In this example, document frequency isemployed. Next, DT-ClGBI selects the discriminative pat-tern, i.e., the pattern (pair) with the highest evaluation forclassification (i.e., information gain) from the enumeratedpairs, and uses it to divide the data into two groups at theroot node. In this example, the pair “a→a” is selected.As a result, the input data is divided into two groups: oneconsisting of graph 1, graph 2, and graph 3, and the otherconsisting of only graph 4.

The above process is recursively applied at each nodeto grow up the decision tree while constructing attributes(patterns) useful for the classification task at the sametime. In this example, since the former group consists ofgraphs belonging to different classes, again Cl-GBI is ap-plied to it, while the latter group is no longer divided be-cause it contains a single graph of class C. For the formergroup, pairs in graph 1, graph 2 and graph 3 are enumer-

100

011

010

100

001

001

001

111

011

110

1 (class A)2 (class B)3 (class A)

Graph ab ac ad ba bb bc bd da db dc

01

0

00

1

10

1

01

1

11

0

eb ec ed be de

Figure 9. Attribute-value pairs at the secondstep

ated and the attribute-value table is updated as shown inFig. 9. Note that the pairs included in the table for theparent node are inherited. In this case, the pair “a→d”is selected by Cl-GBI as the most frequent pair to be apseudo node “f”, while the pair “e→d” is selected as themost discriminative pattern by DT-ClGBI. Consequently,the graphs are separated into two partitions, each of whichcontains graphs of a single class as shown in the lower left-hand side of Fig. 7. The constructed decision tree is shownin the lower right-hand side of Fig. 7.

3.4 Classification using the Constructed DecisionTree

Unseen new graph data must be classified once the de-cision tree has been constructed. Here again, the problemof subgraph isomorphism arises to test if the input graphcontains the pattern (subgraph) specified in the test nodeof the tree. To alleviate this problem, we utilize Cl-GBIagain. Theoretically, if the test pattern actually exists inthe input graph, Cl-GBI can find it by setting the beamwidth b and the number of levelsNe large enough and bysetting the frequency threshold to 0. However, note thatnodes and links that never appear in the test pattern arenever used to form the test pattern in Cl-GBI. Therefore,we can remove such nodes and links from the input graphbefore applying Cl-GBI to reduce its running time. Thisapproach is summarized as follows:

Step 1 Remove nodes and links that never appear in thetest pattern from the input graph.

Step 2 Apply Cl-GBI to the resulting input graph settingthe parametersb andNe large enough, while settingthe parameterθ to 0.

Step 3 Test if one of the canonical labels of extracted pat-terns with the same size as the test pattern is equal tothe canonical label of the test pattern.

In general, Step 1 results in a small graph and Cl-GBIcan run very quickly without any constraints onN andb. However, if we need to set these constraints, we maynot be able to obtain the correct answer because we don’t

Decision Tree Construction by Chunkingless Graph-Based Induction for Graph-Structured Data

69

Page 70: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

A B

C D

f

c

h

a

a

C D

D C

c

b

j

h

i

d

D

A

B

C

E

g

i

c

d

b

b

g

E

E

D

C

D

c

b a

i

eed

Figure 10. Example of 4 basic subgraphs

know how large these parameters should be. In that sense,this procedure can be regarded as an approximate solutionto the subgraph isomorphism problem.

4 Experimental Evaluation of DT-ClGBI

To evaluate the performance of DT-ClGBI, we con-ducted some experiments on both synthetic and real-worlddatasets consisting of directed graphs.

As for the synthetic datasets, two kinds of datasets hav-ing the average size of 30 and 40 in terms of the numberof nodes in a graph were randomly generated. We denotethem byT30 andT40, respectively. Every dataset has 300connected graphs as transactions. The links were attachedrandomly with the probability 20%. The number of nodeand link labels are 5 and 10, respectively, and each la-bel was randomly assigned with equal probability. Thenwe equally divided each dataset into two classes, namely“active” and “inactive”, and randomly embedded similarlygenerated 4 kinds of basic patterns in each transaction ofthe class “active”, which are connected subgraphs havingthe average size of 4. The number of basic patterns to beembedded in a transaction of the class “active” was se-lected in the range between 1 and 4, each of which was inturn chosen from the set of 4 basic patterns by equal proba-bility. Thus, each transaction of the class “active” includesfrom 1 to 4 basic patterns, and some of them may happento be the same.

We also check if there is any basic subgraph included ina transaction of the class “inactive” by Cl-GBI as describedin Section 3.4. If there is, the involved node and link labelsare changed in a way that the basic pattern no longer ex-ists in the transaction. In other words, basic subgraphs arethose which discriminate the two classes. This does notnecessarily mean that any subgraph of the basic patterns isnot discrimitive enough. We have not checked that all thesubgraphs of the basic patterns appear in both active and

inactive data. Fig.10 shows the 4 basic subgraphs whichwere embedded in the transactions of the class “active”.

For these two synthetic datasets, the classification taskis to classify two classes “active” and “inactive” using DT-ClGBI by a single run of 10-fold cross validation (CV).The final prediction error rate was evaluated by the aver-age of 10 estimates of the prediction error (a total of 10decision trees).

The first experiment was conducted to confirm that themost discriminative patterns can be extracted by Cl-GBInot only at the root node, but also at each internal nodeitself. To this end, we compared the predictive accuracyand the tree size obtained by two different settings for DT-ClGBI described as follows. In the first setting, i.e. set-ting 1, a decision tree is constructed by applying Cl-GBIat the root node only, withNe = 2. At the other nodes,we simply recalculate information gain for those patternsthat have already been discovered at the root node. In theother case, i.e., setting 2, Cl-GBI is invoked at the rootnode withNe = 2 and other nodes withNe = 1. In ad-dition, the total number of levels of Cl-GBI in the secondsetting is limited to 6 to keep the computation time at atolerant level. Whenever the total number of levels reachesthis limitation, Cl-GBI is no longer used for extracting pat-terns. Instead, only the existing patterns are employed forconstructing the decision tree thereafter. Note that beamwidth is set to 5 in both settings.

Results of the first experiment are summarized in Table1, and it is shown that the second setting obtains higherpredictive accuracy. Moreover, we observe that the deci-sion trees constructed by the second setting have smallersizes in most cycles of the 10-fold CV for both datasets.The result reveals that invoking of Cl-GBI at internal nodesis needed to improve the predictive accuracy of DT-ClGBI,as well as to reduce the tree size. Intuitively, the searchspace increases by applying Cl-GBI at the internal nodesin addition to the root node. As a result, more discrimi-native patterns which have not been discovered in the pre-vious steps are discovered at these nodes. In other words,applying Cl-GBI at only the root node cannot help enumer-ate all the necessary patterns unlessNe andb are set largeenough. For example, in the decision tree constructed bythe first run of 10-fold CV on the datasetT30 using thesecond setting, the classifying pattern for a node at thethird level was not found at the root node but its parentnode. If Ne is set large enough, the necessary patternshould be able to be found at the root node. This pattern,if found at the root node, should give smaller informationgain at the root node but Cl-GBI retains this and passesdown to the lower node. The question is how to find thispattern where it is needed without running Cl-GBI usingall the dataset.

The second experiment focused on the comparisons be-tween DT-ClGBI and DT-GBI [4, 5], also in terms of thepredictive accuracy and the tree size. Here beam width is

Phu Chien Nguyen, Kouzou Ohara, Akira Mogi, Hiroshi Motoda, and Takashi Washio

70

Page 71: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Table 1. Comparisons of different settings for DT-ClGBISetting 1 Setting 2

Dataset Training error Test error Average of Training error Test error Average oftree sizes tree sizes

T30 0.22% 1.33% 17.2 0% 0% 9.2T40 0.37% 5% 18 0.15% 3.33% 12.8

also set to 5 in both cases. For DT-GBI, the number of lev-els of B-GBI at any node of a decision tree is kept fixed as4. It should be noted that, whenever being invoked for con-structing a decision tree by DT-GBI, B-GBI starts extract-ing typical patterns from the beginning, i.e. no inheritanceis employed, because the graphs that pass down to the yesbranch have been chunked by the test pattern. On the otherhand, the number of levels of Cl-GBI is 4 at the root nodeand 1 at the other nodes of a decision tree in the case ofDT-ClGBI. In addition, the total number of levels of Cl-GBI is limited to 8, which means that the number of levelsperformed by the feature construction tool in DT-ClGBI ismuch less than that in DT-GBI. As mentioned earlier, thislimitation helps reduce the computation time required forconstructing the decision trees by DT-ClGBI, however, atthe expenses of the decrease of the search space in Cl-GBI.

Table 2 shows the results of the second experiment. Itcan be seen that DT-ClGBI achieves lower prediction errorfor both datasets. We also observe that, for each dataset,the decision trees constructed by DT-ClGBI have smallersizes in most cycles of the 10-fold CV. The higher predic-tive accuracy of DT-ClGBI and the simpler decision treesobtained by this method can be explained by the improve-ment of Cl-GBI over B-GBI, and the inheritance of previ-ously extracted patterns at an internal node (in a decisiontree) from its predecessors. It is known that Cl-GBI re-solves the problem of overlapping patterns incurred by B-GBI, thus resulting in more typical patterns extracted byCl-GBI.

Additionally, it should be noted that the size of the em-bedded graphs in these two datasets is 4 or 5. SettingNe =2 as in the first experiment means that the maximum sizeof patterns we can get at the root node is 4. Considering thebeam width, it is unlikely that the embedded patterns arefound at the root node. EvenNe = 4 as in the second exper-iment, these basic patterns cannot be found. However, thesubstructures of the embedded graphs are discriminativeenough as shown in the two aforementioned experiments.Due to the downward closure property, these substructureswere embedded in the transactions of the class “active”.

Finally, we verified if DT-ClGBI can construct decisiontrees that achieve reasonably good predictive accuracy alsoon a real-world dataset. For that purpose, we used the he-patitis dataset as in [5]. The classification task here is toclassify patients into two classes, “LC” (Liver Cirrhosis)and “nonLC” (non Liver Cirrhosis) based on their fibrosisstages, which are categorized into five stages in the dataset:

F0 (normal), F1, F2, F3, and F4 (severe = Liver Cirrho-sis). All 43 patients at F4 stage were used as the class“LC”, while all 4 patients at F0 stage and 61 patients at F1stage were used as the class “nonLC”. This ratio of “LC”to “nonLC” was determined based on [15]. The recordsfor each patient were converted into a directed graph as de-scribed in [5]. The resulting graph database has 108 graphtransactions. The average size of a graph transaction inthis database is 316.2 in terms of the number of nodes, and386.4 in terms of the number of links.

Through some preliminary experiments on this data-base using DT-ClGBI, we found that existence of somegraphs often makes the resulting decision tree too compli-cated and worsen the predictive accuracy. This has led usto adopt a two step approach, first to divide the patientsinto “typical” and “non-typical”, and second to constructa decision tree for each group of the patients. To dividethe patients in the first step, we ran 10-fold cross valida-tion of DT-ClGBI on this database, varying its parametersb andNe in the ranges of5, 6, 8, 10 and6, 8, 10, 12,respectively. Note that these values are only for the rootnode and we did not run Cl-GBI at the succeeding nodes.The frequency thresholdθ was fixed to 10%. Namely, weconducted 10-fold cross validation 16 times with differ-ent combinations of these parameters, and obtained totally160 decision trees in this step. Then we classified graphswhose average error rate is 0% into “typical”, and the oth-ers into “non-typical”. As a result, for the class “LC”, 28graphs are classified into the subset “typical” and the other15 graphs into “non-typical”, while for the class “nonLC”,48 graphs are classified into “typical” and 17 graphs into“non-typical”.

In the second step, we applied DT-ClGBI to each subsetagain adopting the best parameter setting in the first stepwith respect to the predictive accuracy, whereb = 8 andNe = 10. The predictive accuracy (average of 10-CV) forthe subset “typical” is 97.4%, and that for “non-typical” is78.1%. The overall accuracy is 91.7%, which is much bet-ter than the accuracy obtained by applying DT-ClGBI tothe original dataset withb = 8 andNe = 10, i.e., 83.4%.We can find typical features for a patient with Liver Cir-rhosis in the extracted patterns such as “GOT is High” or“PLT is Low”. From these results, we can say that DT-ClGBI can achieve reasonably good predictive accuracyon a real-world dataset and extract discriminative featuresembedded in the dataset as subpatterns.

Decision Tree Construction by Chunkingless Graph-Based Induction for Graph-Structured Data

71

Page 72: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Table 2. Comparisons of DT-ClGBI and DT-GBIDT-GBI DT-ClGBI

Dataset Training error Test error Average of Training error Test error Average oftree sizes tree sizes

T30 1.41% 7.67% 24 0% 0% 9T40 3.15% 7.67% 18.2 0% 0.67% 9

5 Conclusions

In this paper, we have proposed an algorithm calledDT-ClGBI, which can construct a decision tree for graph-structured data using Cl-GBI. In DT-ClGBI, substructures,or patterns useful for a classification task are constructedon the fly by means of Cl-GBI during the constructionprocess of a decision tree. The experimental results usingsynthetic datasets showed that decision trees constructedby DT-ClGBI achieve good predictive accuracy for graph-structured data. The good predictive accuracy of DT-ClGBI is mainly attributed to the fact that Cl-GBI cangive the correct number of occurrences of a pattern dueto its capability of extracting overlapping patterns, whichis very useful for algorithms such as DT-ClGBI that needcorrect counting. Also, the inheritance of previously ex-tracted patterns at an internal node from its predecessors isshown helpful.

For future work, we plan to employ some heuristicsto speed up the Cl-GBI algorithm to extract larger typi-cal subgraphs at an early stage in the search process. Thiscould also improve the performance of DT-ClGBI.

References

[1] Breiman, L., Friedman, J. H., Olshen, R. A., Stone,C. J. 1989. The CN2 Induction Algorithm,MachineLearning, 3: 261–283.

[2] Cook, D. J. and Holder, L. B. 1994. Substruc-ture Discovery Using Minimum Description Lengthand Background Knowledge,Artificial IntelligenceRessearch, 1: 231–255.

[3] Fortin, S. 1996. The Graph Isomorphism Problem,Technical Report TR96-20, Dept. of Computer Sci-ence, University of Alberta, Edmonton, Canada.

[4] Geamsakul, W., Matsuda, T., Yoshida, T., Motoda,M., and Washio, T. 2003. Classifier Construction byGraph-Based Induction for Graph-Structured Data,In Proc. PAKDD 2003, pp. 52–62.

[5] Geamsakul, W., Yoshida, T., Ohara, K., Motoda,H., Washio, T., Takabayashi, K., Yokoi, H. 2005.Constructing a Decision Tree for Graph-StructuredData and Its Applications.Fundamenta Informati-cae, 66(1-2): 131–160.

[6] Inokuchi, A., Washio, T., and Motoda, H. 2003.Complete Mining of Frequent Patterns from Graphs:Mining Graph Data,Machine Learning, 50(3): 321–354.

[7] Inokuchi, A., Washio, T., Nishimura, K., and Mo-toda, H. 2002. A Fast Algorithm for Mining Fre-quent Connected Subgraphs,IBM Research ReportRT0448, Tokyo Research Laboratory.

[8] Kuramochi, M. and Karypis, G. 2004. An Ef-ficient Algorithm for Discovering Frequent Sub-graphs,IEEE Trans. Knowledge and Data Engineer-ing, 16(9): 1038-1051.

[9] Kuramochi, M. and Karypis, G. 2004. GREW–AScalable Frequent Subgraph Discovery Algorithm,In Proc. ICDM 2004, pp. 439–442.

[10] Matsuda, T., Motoda, H., Yoshida, T., and Washio,T. 2002. Mining Patterns from Structured Data byBeam-wise Graph-Based Induction, InProc. DS2002, pp. 422–429.

[11] Michalski, R.S. 1990. Learning Flexible Concepts:Fundamental Ideas and a Method Based on Two-Tiered Representation, InMachine Learning: An Ar-tificial Intelligence Approach, 3: 63–102.

[12] Nguyen, P.C., Ohara, K., Motoda, H., and Washio,T. 2005. Cl-GBI: A Novel Approach for Extract-ing Typical Patterns from Graph-Structured Data, InProc. PAKDD 2005, pp. 639–649.

[13] Quinlan, J.R. 1986. Induction of Decision Trees,Ma-chine Learning, 1: 81–106.

[14] Quinlan, J.R. 1993.C4.5: Programs for MachineLearning, Morgan Kaufmann Publishers.

[15] Yamada, Y., Suzuki, E., Yokoi, H., and Takabayashi,K. 2003. Decision-tree Induction from Time-seriesData Based on a Standard-example Split Test, InProc. ICML 2003, pp. 840–847.

[16] Yan, X. and Han, J. 2002. gSpan: Graph-BasedStructure Pattern Mining, InProc. ICDM 2002, pp.721–724.

[17] Yoshida, K. and Motoda, M. 1995. CLIP: ConceptLearning from Inference Patterns,Artificial Intelli-gence, 75(1): 63–92.

Phu Chien Nguyen, Kouzou Ohara, Akira Mogi, Hiroshi Motoda, and Takashi Washio

72

Page 73: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Unusual Condition Detection of Bearing Vibration in Hydroelectric Power Plants

Takashi OnodaCRIEPI

2-11-1, Iwado Kita, Komae-shi,Tokyo 201-8511 JPAN

[email protected]

Norihiko ItoCRIEPI

2-11-1, Iwado Kita, Komae-shi,Tokyo 201-8511 JPAN

[email protected]

Kenji ShimizuKyushu Electric Power Co.,Inc.

2-1-82, Watanabe-Dori, Chuo-Ku,Fukuoka 810-8720 JAPAN

kenji [email protected]

Abstract

Kyushu Electric Power Co.,Inc. collects different sensordata and weather information to maintain the safety of hy-droelectric power plants while the plants are running. How-ever, it is very rare to occur abnormal condition and troublecondition in the power equipment. And in order to collectthe abnormal and trouble condition data, it is hard to con-struct experimental power generation plant and hydroelec-tric power plant. Because its cost is very high. In this situ-ation, we have to find abnormal condition sign. Therefore,we consider that the abnormal condition sign may be un-usual condition. This paper shows results of unusual con-dition patterns of bearing vibration detected from the col-lected different sensor data and weather information by us-ing one class support vector machine. The result shows thatour approach may be useful for unusual condition patternsdetection in bearing vibration and maintaining hydroelec-tric power plants.

1 Introduction

Recently, electric power companies have begun to tryto shift a Time Based Maintenance (TBM) to a ConditionBased Maintenance (CBM) for electric equipment manage-ment to realize an efficient maintenance and reduce the costof maintenance[2]. TBM is to check and change the equip-ment based on the guaranteed term recommended by mak-ers. And CBM is to check, repair and change the equip-ment based on the state of equipment. The state consists ofthe present state of equipment, the operation term of equip-

ment, the load in an operation, and etc.In order to realize the CBM, it is important for electric

power companies to collect the data of equipment. The dataconsist of normal operation information of equipment, ab-normal operation information of equipment, and etc. Espe-cially, to reduce the maintenance cost by the accuracy im-provement of the equipment management and maintenancebased on CBM, it is the most important to collect much datautilized to make management and maintenance.

It is necessary to collect and analyze the past abnormalcondition data and the past trouble condition data, in orderto discover an abnormal condition signs of power equip-ment. For instance, there are the discovery of an abnormalcondition signs from sensor information of the hydroelectricpower plant bearing vibration and the discovery of an ab-normal condition sign form the operation data of the powergeneration plant. However, it is very rare to occur abnormalcondition and trouble condition in the power equipment.And in order to collect the abnormal and trouble conditiondata, it is hard to construct experimental power generationplant and hydroelectric power plant. Because its cost is veryhigh.

In the above situation, Kyushu Electric Power Co.,Inc.has been analyzing the causal relation between the bearingvibration and various sensor information to establish the de-tection method of an abnormal condition sign for the bear-ing vibration in the hydroelectric power plant. However, itis hard to discover the causal relation between the bearingvibration and various sensor information. Because the rela-tion between the bearing vibration and various sensor infor-mation is too complex, and it is impossible to acquire theabnormal and trouble condition data and we can measurethe normal condition data only.

Z.W. Ras, S. Tsumoto and D.A. Zighed (Eds) : IEEE MCD05, published byDepartment of Mathematics and Computing Science Saint Marys UniversityPages 73-80, ISBN 0-9738918-8-2, Nov. 2005

Page 74: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

In order to discover abnormal condition signs of hydro-electric power plants, Kyushu Electric Power Co.,Inc. andCentral Research Institute of Power Industry are research-ing the detection method of the unusual condition for thebearing vibration using various sensor information, now. Inour research, we consider that the unusual condition pat-tern coincide with the abnormal condition pattern nearly,because we can measure the normal condition patterns onlyfrom a regular operation hydroelectric power plant. And wedeveloped the detection method of the unusual condition forbearing vibration from regular condition data in the hydro-electric power plant.

In this paper, we describe the measured sensor databriefly in the next section. In the third section, we brieflyexplain our proposed approach to detect an abnormal con-dition sign for bearing vibration and our proposed detectionmethod of the unusual condition for bearing vibration fromregular operation condition data. The experimental resultsare shown in the forth section. Finally, we conclude theconcept of the abnormal sign discovery of the hydroelec-tric power plant bearing vibration based on the detectionmethod of the unusual patterns.

2 Measurement Data

Table 1 shows the outline of a hydroelectric power plant.The hydroelectric power plant has various sensors to mea-sure data related to bearing vibration. In this paper, themeasurement data is collected from the hydroelectric powerplant and analyzed by our proposed method. The measure-ment data, which is related to bearing vibration, is collectedfrom March 16, 2004 to November 23, 2004.

One data has been composed of the sensor and weatherinformation on 38 measurement items shown in the table 2for five seconds the measurement interval. All measure-ment data is normal condition data and does not includean abnormal condition such as accidental condition, trou-ble condition, and etc.

Table 2 shows one example of the measured data value.

3 Abnormal Condition Sign Detection

In this section, we describes the outline of the approachof detecting the abnormal condition sign using the unusualcondition pattern.

Generally, the discovery of an abnormal sign is to detecta peculiar case that appears only before an existing abnor-mal condition by comparing between normal conditions andabnormal conditions. However, it is a fact that there is littledata of an abnormal conditions in the electric power equip-ment, because the electric power plants is designed with thehigh safety factor and maintained appropriately. Currently,

• Unusual pattern detection:Exception pattern detection in normal pattern data

General General condition condition patternpattern

((Exception))Normal condition patterns only

(oblique line area)

General condition pattern:99% of normal condition pappern

Exception:1% of normal condition pattern

Figure 1. Image of Unusual Condition Detec-tion

our bearing vibration data of the hydroelectric power plantalso does not have abnormal conditions and accidental con-ditions. Therefore, it is impossible to detect a peculiar casebefore abnormal conditions and accidental conditions hap-pen using the comparison between the abnormal conditionsand normal conditions. Then, we think the relation betweena peculiar condition before an abnormal condition happen(hereafter, we call it the abnormal condition sign) and un-usual conditions as the following relation.

The abnormal condition sign ≈ The unusual condition.

It is possible to change the discovery of the abnormal signcondition to the detection of the unusual condition in thenormal conditions. In other words, it is think that the un-usual condition with low probability of existing in the nor-mal condition data has high probability of abnormal condi-tion sign.

The figure 1 shows a concept of detection of unusualcondition pattern in the normal condition data. In this fig-ure, the gray oblique line area denotes the normal conditionpattern area. In this research, the unusual condition patternsare detected from this normal condition patterns. From thefigure 1, if we can find a hyper-sphere, which can coverthe 99% of the normal condition patterns, we can think thatthe other 1% patterns unusual condition patterns. This 99%normal condition patterns are called “general condition pat-terns”. In the figure 1, the inside of a circle shown a blacksolid line is general condition pattern area, and black starsdenote unusual condition patterns. Therefore, if we canfind a boundary of α% area in the normal condition areacorrectly, it is possible to detect unusual condition patternswhich do not belong the α% area of normal condition pat-terns. We adopt One Class Support Vector Machine (here-after One Class SVM) to find the boundary correctly[1].

Takashi Onoda, Norihiko Ito, Kenji Shimizu

74

Page 75: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Table 1. Outline of a targeted Hydroelectric Power PlantGenerated Output 18,000kWWorking Water 45m3/sEffective Head 46.3mTurbine Type Vertical Shaft Francis TurbineRated Revolutions Per Minute 240rpm

Upper Bearing Oil Self Contained Type Segment Bearing(Natural Cooling)Bearing Bottom Bearing Oil Self Contained Type Segment Bearing(Natural Cooling)Type Turbine Bearing Oil Self Contained Type Cylindrical Bearing(Natural Cooling)

Thrust Bearing Oil Self Contained Type Pivot Spring Bearing(Natural Cooling)Operation Pattern Process Control Operation(An operation pattern is generated at everyday.)

Table 2. An Example of Online Collected Sensor InformationMeasurement Item Value Measurement Item Value

Time 04:20:45 Bottom Bearing Temperature( C) 51.0Generated Output(MW) 16.26 Upper Bearing Oil Temperature( C) 47.6Generated Reactive Power(Mvar) -0.708 Turbine Bearing Temperature( C) 44.1Generator Voltage(V-W)(kV) 10.901 Turbine Bearing Oil Temperature( C) 37.4Generator Current(R)(A) 893 Thrust Bearing Temperature( C) 59.8Guide Vane Opening(%) 85 Bottom Oil Tank Oil Temperature( C) 47.8Revolutions Per Minute(rpm) 242 Air Duct Outlet Air Temperature( C) 50.7Generator Stator Winding Temperature(U Phase)( C) 74.4 Bottom Bearing Inlet Air Temperature( C) 29.0Generator Stator Winding Temperature(V Phase)( C) 75.0 Pressured Oil Tank Oil Temperature( C) 35.6Generator Stator Winding Temperature(W Phase)( C) 74.8 Generator Shaft Vibration(X axis)(μm) 22Dam Water Level(m) -2.49 Turbine Shaft Vibration(X axis)(μm) 51Surge Tank Water Level(m) 128.73 Air Tank Pressure(MPa) 3.167Tail Water Outlet(m) 85.08 Pressured Oil Tube Oil Pressure(MPa) 2.732Outside Temperature( C) 25.9 Generator Bottom Bearing Oil Tank Oil Surface(mm) -4Room Temperature( C) 28.6 Generator Upper Thrust Bearing Oil(mm) 3Water Temperature() 27.4 Turbine Bearing Oil Surface(mm) -6Oil Cooler Inlet Air Temperature( C) 31.2 Pressured Oil Tank Oil Surface(mm) -58Oil Cooler Outlet Air Temperature( C) 40.6 Pressured Oil Tube Oil Surface(mm) 17Upper Bearing Temperature( C) 50.7 Main Shaft Sealed Water Volume(/min) -0.44

4 One Class SVM

Scholkopf et al. suggested a method of adapting theSVM methodology to one class classification problem[1].Essentially, after transforming the feature via a kernel, theytreat the origin as the only member of the second class. Theusing “relaxation parameters” they separate the image of theone class from the origin. Then the standard two class SVMtechniques are employed.

One Class SVM [1] returns a function f that takes thevalue +1 in a “small” region capturing most of the trainingdata points, and -1 elsewhere.

The algorithm can be summarized as mapping the datainto a feature space H using an appropriate kernel function,and then trying to separate the mapped vectors from the ori-gin with maximum margin(see Figure 2).

Let the training data be

x1, . . . ,x (1)

belonging to one class X , where X is a compact subset ofRN and is the number of observations. Let Φ : X → H

be a kernel map which transforms the training examples tofeature space. The dot product in the image of Φ can becomputed by evaluating some simple kernel

k(x,y) = (Φ(x) · Φ(y)) (2)

such as the Gaussian kernel

k(x,y) = exp(‖x− y‖2

c

). (3)

The strategy is to map the data into the feature space cor-responding to the kernel, and to separate them from the ori-gin with maximum margin. Then, To separate the data setfrom the origin, one needs to solve the following quadraticprogram:

minw∈H,ξ∈Rρ∈RN

12‖w‖2 +

∑i

ξi − ρ

subject to (w · Φ(xi)) ≥ ρ− ξi, (4)

ξi ≥ 0.

Unusual Condition Detection of Bearing Vibration in Hydroelectric Power Plants

75

Page 76: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Origin

+1

-1

Figure 2. One Class SVM Classifier: the ori-gin is the only original member of the secondclass.

Here, ν ∈ (0, 1) is an upper bound on the fraction of out-liers, and a lower bound on the fraction of Support Vectors.

Since nonzero slack variables ξi are penalized in the ob-jective function, we can expect that if w and ρ solve thisproblem, then the decision function

f(x) = sgn ((w ·Φ(x)) − ρ) (5)

will be positive for most examples xi contained in the train-ing set, while the SV type regularization term ‖w‖ will stillbe small. The actual trade-off between these two is con-trolled by ν. For a new point x, the value f(x) is deter-mined by evaluating which side of the hyperplane it fallson, in feature space.

Using multipliers αi, βi ≥ 0, we introduce a Lagrangian

L(w, ξ, ρ, α, β) =

12‖w‖2 +

∑i

ξi − ρ

−∑

i

αi((w · xi)− ρ + ξi)−∑

i

βiξi (6)

and set the derivatives with respect to the primal variablesw, ξi, ρ equal to zero, yielding

w =∑

i

αixi, (7)

αi =1ν− βi ≤ 1

ν,∑

i

αi = 1. (8)

In Eqn. (7), all patterns xi : i ∈ [], αi > 0 are calledSupport Vectors. Using Eqn. (2), the SV expansion trans-forms the decision function Eqn. (5)

f(x) = sgn

(∑i

αik(xi,x)− ρ

). (9)

Substituting Eqn. (7) and Eqn. (8) into Eqn. (6), we obtainthe dual problem:

minα

12

∑i,j

αiαjk(xi,xj) (10)

subject to 0 ≤ αi ≤ 1ν

,∑

i

αi = 1. (11)

One can show that at the optimum, the two inequality con-straints Eqn. (4) become equalities if αi and βi are nonzero,i.e. if 0 < α ≤ 1/(ν). Therefore, we can recover ρ by ex-ploiting that for any such αi, the corresponding pattern xi

satisfiesρ = (w · xi) =

∑j

αjxj · xi. (12)

Note that if ν approaches 0, the upper boundaries on theLagrange multipliers tend to infinity, i.e. the second in-equality constraint in Eqn. (11) become void. The problemthen resembles the corresponding hard margin algorithm,since the penalization of errors becomes infinite, as can beseen from the primal objective function Eqn. (4). It is stilla feasible problem, since we have placed no restriction onρ, so ρ can become a large negative number in order to sat-isfy Eqn. (4). If we had required ρ ≥ 0 from the start, wewould have ended up with the constraint

∑i αi ≥ 1 instead

of the corresponding equality constraint in Eqn. (11), andthe multipliers αi could have diverged.

In our research we used the LIBSVM. This is anintegrated tool for support vector classification and re-gression which can handle One Class SVM using theScholkopf etc algorithms. The LIBSVM is available athttp://www.csie.ntu.edu.tw/˜cjlin/libsvm.

5 Unusual Pattern Detection Expriment

In this section, we describe our experiment by using themeasurement data, which is explained in section 2. Espe-cially, we briefly introduce our experimental setup, how toverify our experimental results, experimental results, andthe evaluation.

5.1 Experimental Setup

Our experiment analyzed the measurement data, whichis explained in section 2. The measurement data is com-posed of 38 measurement items. However, in order to de-tect the unusual condition patterns, we extracted the relatedmeasurement items to the bearing vibration from all mea-surement items. So, 16 measurement items were selectedby the expertise of the bearing vibration of the experts toanalyze the unusual condition patterns. Table 3 shows theseselected 16 measurement items.

Takashi Onoda, Norihiko Ito, Kenji Shimizu

76

Page 77: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Table 3. Measurement Items for Detection AnalysisMeasurement Items for Detection Analysis

A. Generated Output(MW) B. Revolutions Per Minute(rpm) C. Room Temperature( C)D. Water Temperature( C) E. Oil Cooler Inlet Air Temperature( C) F. Oil Cooler Outlet Air Temperature( C)G. Upper Bearing Temperature( C) H. Bottom Bearing Temperature( C) I. Upper Bearing Oil Temperature( C)J. Turbine Bearing Temperature( C) K. Turbine Bearing Oil Temperature( C) L. Thrust Bearing Temperature( C)M. Bottom Oil Tank Oil Temperature( C) N. Bottom Bearing Inlet Air Temperature( C) O. Generator Shaft Vibration(X axis)(μm)P. Turbine Shaft Vibration(X axis)(μm)

Table 4. The Number of Patterns in Each Con-dition

Group The number of patterns

Stopping condition 433,935Starting condition 120Parallel operation condition 2,368,756Parallel off condition 132

Total 2,804,113

The starting condition patterns and the parallel off condi-tion patterns are very few relatively. The parallel operationcondition patterns are very large. If we analyze the all mea-surement data to detect the unusual condition patterns, thedetected condition patterns are the starting condition pat-terns or the parallel off condition patterns. This is not goodsituation for our analysis. So, the all measurement data isdivided into the following four groups by expertise of ex-perts.

Starting condition group Generator Voltage(V-W) ¡10kV and Guide Vane Opening ≥ 10% and Revolu-tions Per Minute ≥ 200 rpm.

Parallel operation condition group GeneratorVoltage(V-W) ≥ 10kV and Revolutions Per Minute ≥200 rpm.

Parallel off condition group Generator Voltage(V-W) ¡10kV and Guide Vane Opening ¡ 10% and RevolutionsPer Minute ≥ 200 rpm.

Stopping condition group Otherwise.

These groups defined by the expertise of the experts. Ta-ble 4 shows the number of patterns in each group.

In the stopping condition group, the bearing does not ro-tate. This group patterns were omitted from the analyzedpatterns. In other words, the unusual condition patters weredetected in each group, which is starting condition or par-allel operation condition or parallel off condition. In orderto ignore the different measurement units, the measurementdata is normalized into the average 0 and the variance 1 ateach measurement item.

In this section, the hydroelectric power plant operationsare defined by the expertise of the experts. These opera-

tions are the starting condition, the parallel operation con-dition, the parallel off condition and the stopping condition.In practice, these operations should be defined by the oper-ation record of the hydroelectric power plant. However, inorder to verify our proposed approach, we defined the fourgroups by using the expertise of the experts intentionally.

5.2 Detection Results and Evaluation

The unusual condition patterns were detected in eachgroup data of the starting condition, the parallel operationcondition and the parallel off condition by applying OneClass SVM, which is introduced in section 3. Our exper-iment made the trial that changes the detected number ofunusual condition patterns. In this section, we reports theanalysis result to each group data in the number of unusualcondition patterns. The experts judged that the number wasappropriate.

5.2.1 Detection of the unusual condition patterns in thestarting condition patterns

The starting condition patterns were applied One ClassSVM to detect unusual condition patterns in the startingcondition patterns. The parameter ν, which can determinethe rate of the unusual condition patterns in the analyzedpatterns, was set 0.05. In this case, ν = 0.05 denotes thatthe number of the unusual condition patterns is about six.

We could detect 9 unusual condition patterns by the anal-ysis. Table 5 shows the typical unusual condition patterns,which are detected by our experiment.

Figure 3 shows an existence probability of the startingcondition patterns. In this figure, x axis denotes values ofdiscriminate function and y axis denotes a cumulative prob-ability. The discriminate function values of the unusual con-dition patterns, which are shown in figure 5, are smaller than0. In the figure 3, the existence probabilities of the unusualcondition patterns, whose discriminate function values aresmaller than 0, are smaller than 0.05 and very rare conditionpatterns.

In the figure 5, the case 1 and 2 should belong to the par-allel operation condition actually. Table 6 shows the databefore and after 5 seconds of the case 1 and 2. In this ta-ble, we can recognize that the case 1 and 2 should belong to

Unusual Condition Detection of Bearing Vibration in Hydroelectric Power Plants

77

Page 78: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Table 5. The typical unusual condition pat-terns in the starting condition patterns

Measurement Item Case 1 Case 2 Case 3

A.Generated -0.192 -0.192 -0.162Output(MW)

B.Revolutions Per 243 243 330Minute(rpm)

C.Room 23.1 23.1 28.8Temperature( C)

D.Water 21.1 21.1 27.4Temperature( C)

E.Oil Cooler Inlet 25.9 25.9 30.9Air Temperature( C)

F.Oil Cooler Outlet 35.0 34.8 40.5Air Temperature( C)

G.Upper Bearing 46.4 46.5 50.8Temperature( C)

H.Bottom Bearing 44.3 44.3 50.9Temperature( C)

I.Upper Bearing Oil 42.9 42.8 47.5Temperature( C)

J.Turbine Bearing 41.1 41.1 43.8Temperature( C)

K.Turbine Bearing Oil 33.1 33.1 37.0Temperature( C)

L.Thrust Bearing 54.6 54.8 59.7Temperature( C)

M.Bottom Oil Tank Oil 40.5 40.4 47.8Temperature( C)

N.Bottom Bearing Inlet 24.3 24.5 29.2Air Temperature( C)

O.Generator Shaft 12 8 23Vibration(X axis)(μm)

P.Turbine Shaft 34 27 33Vibration(X axis)(μm)

the parallel operation condition. This situation shows thatthe expertise of experts can not describe the starting condi-tion completely. However, our proposed approach can de-tect the case 1 and 2, which should belong to the paralleloperation condition, as the unusual condition patterns. Thisfact shows that our approach can find the appropriate un-usual condition patterns, which should be detected in thestarting condition. In the table 5, the case 3 is an accidentalinterruption. This fact was researched by using an operationrecord, an daily report, and etc. The accidental interruptiondenotes a power plant parallel off when the power systemhas an accident, and is very rare case.

5.2.2 Detection of the unusual condition patterns in theparallel operation condition patterns

The parallel operation condition patterns were applied OneClass SVM to detect unusual condition patterns in the par-allel operation condition patterns. The parameter ν, whichcan determine the rate of the unusual condition patternsin the analyzed patterns, was set 5 × 10−6. In this case,ν = 5 × 10−6 denotes that the number of the unusual con-

0.001

0.010

0.100

1.000

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2Discriminated function value

Cum

ulat

ive

prob

abili

ty

Figure 3. The Relation between ClassificationValue and Existence Probability for Startingcondition patterns

dition patterns is about twelve. This analysis can detectthe unusual condition patterns, whose existence probabili-ties are smaller than 5× 10−6.

We could detect 14 unusual condition patterns by theanalysis. Table 7 shows the typical unusual condition pat-terns, which are detected by our experiment. In the table 7,the typical three cases (case 1-3) of these unusual condi-tion patterns were test operation. This fact was researchedby using an operation record, an daily report, and etc. Thetest operation denotes a short term operation to check theelectric power plant, and is very rare case. This fact showsthat our approach can find the appropriate unusual conditionpatterns, which should be detected in the parallel operationcondition.

5.2.3 Detection of the unusual condition patterns in theparallel off condition patterns

The parallel off condition patterns were applied One ClassSVM to detect unusual condition patterns in the parallel offcondition patterns. The parameter ν, which can determinethe rate of the unusual condition patterns in the analyzedpatterns, was set 0.05. In this case, ν = 0.05 denotesthat the number of the unusual condition patterns is abouttwelve. This analysis can detect the unusual condition pat-terns, whose existence probabilities are smaller than 0.05.

We could detect 11 unusual condition patterns by theanalysis. Table 8 shows the typical unusual condition pat-terns, which are detected by our experiment. In the table 8,the case 1 and 2 of these unusual condition patterns werean accidental interruption. The other two cases (the case 3and 4) of the unusual condition patterns were a test oper-ation. These facts were researched by using an operationrecord, an daily report, and etc. The test operation and theaccidental interruption are very rare case. These facts showthat our approach can find the appropriate unusual condi-tion patterns, which should be detected in the parallel off

Takashi Onoda, Norihiko Ito, Kenji Shimizu

78

Page 79: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Table 6. Data Before and After Case 4 and 5Measurement Items -5sec. case 4 case 5 +10 sec.

A.Generated 4.452 -0.192 -0.192 4.26Output(MW)

B.Revolutions Par 243 243 243 243Minute(rpm)

C.Room 23.1 23.1 23.1 23.1Temperature( C)

D.Water 21.1 21.1 21.1 21.2Temperature( C)

E.Oil Cooler Inlet 25.9 25.9 25.9 26.1Air Temperature( C)

F.Oil Cooler Outlet 34.9 35.0 34.8 34.9Air Temperature( C)

G.Upper Bearing 46.5 46.4 46.5 46.3Temperature( C)

H.Bottom Bearing 44.3 44.3 44.3 44.5Temperature( C)

I.Upper Bearing Oil 42.8 42.9 42.8 43.0Temperature( C)

J.Turbine Bearing 41.1 41.1 41.1 41.1Temperature( C)

K.Turbine Bearing Oil 33.2 33.1 33.1 33.0Temperature( C)

L.Thrust Bearing 54.8 54.6 54.8 54.6Temperature( C)

M.Bottom Oil Tank Oil 40.4 40.5 40.4 40.6Temperature( C)

N.Bottom Bearing Inlet 24.5 24.3 24.5 24.6Air Temperature( C)

O.Generator Shaft 9 12 8 11Vibration(X axis)(μm)

P.Turbine Shaft 26 34 27 31Vibration(X axis)(μm)

condition.

5.3 Inexplainable Unusual Condition Patterns

In our detection experiment, the trial that changes theextracted number of unusual condition patterns was carriedout, and the detected unusual condition patterns were shownto experts. It is easy for experts to explain that the detectedunusual condition patterns, which are larger than the abovedescribed number of the unusual condition patterns, are nor-mal condition patterns.

However, there are some inexplainable unusual condi-tion patterns in the above described unusual condition pat-terns. In our experiment, our approach could detect nine un-usual condition patterns in the starting condition. The twopatterns should belong to the parallel operation condition.The other one pattern is the accidental interruption. How-ever, the rest six patterns are inexplainable unusual condi-tion patterns. For the parallel operation condition patterns,our approach could detect fourteen unusual condition pat-terns. The three patterns could be explain as the test oper-ation, but the rest eleven unusual patterns are inexplainableunusual condition patterns. For the parallel off condition

Table 7. The typical unusual condition pat-terns in the parallel operation condition pat-terns

Measurement Item Case 1 Case 2 Case 3

A.Generated -0.192 -0.192 -0.192Output(MW)

B.Revolutions Per 232 231 229Minute(rpm)

C.Room 26.8 26.8 26.7Temperature( C)

D.Water 25.6 25.6 25.6Temperature( C)

E.Oil Cooler Inlet 28.1 28.0 28.0Air Temperature( C)

F.Oil Cooler Outlet 35.3 35.3 35.3Air Temperature( C)

G.Upper Bearing 45.6 45.6 45.8Temperature( C)

H.Bottom Bearing 43.5 43.7 43.8Temperature( C)

I.Upper Bearing Oil 42.4 42.6 42.4Temperature( C)

J.Turbine Bearing 39.9 40.0 39.9Temperature( C)

K.Turbine Bearing Oil 31.1 31.1 31.1Temperature( C)

L.Thrust Bearing 52.9 52.9 53.1Temperature( C)

M.Bottom Oil Tank Oil 40.3 40.2 40.2Temperature( C)

N.Bottom Bearing Inlet 26.8 26.7 26.8Air Temperature( C)

O.Generator Shaft 23 18 12Vibration(X axis)(μm)

P.Turbine Shaft 70 54 47Vibration(X axis)(μm)

patterns, our approach could detect eleven unusual condi-tion patterns. The four patterns could be explain as the testoperation or the accidental interruption, but the rest sevenunusual patterns are inexplainable unusual condition pat-terns. These inexplainable unusual condition patterns arevery important for the discovery of the abnormal conditionsign.

It is easy for experts to manage the explainable unusualcondition patterns, because the experts understand the evi-dence of generation of the unusual condition patterns. How-ever, it is difficult to manage the inexplainable unusual con-dition patterns, because the experts can not understand theevidence of generation of the unusual condition patternswithout the experience of the different abnormal conditions.Therefore, the following two parts are important for the dis-covery of abnormal condition signs for hydroelectric powerplants.

• The first part is to discover the unusual condition pat-terns by a system, to evaluate the discovered unusualcondition patterns and select the inexplainable unusual

Unusual Condition Detection of Bearing Vibration in Hydroelectric Power Plants

79

Page 80: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Table 8. The typical unusual condition pat-terns in the parallel off condition patternsMeasurement Items case 1 case 2 case 3 case4

A.Generated -0.18 -0.192 -0.192 -0.192Output(MW)

B.Revolutions Par 261 208 225 236Minute(rpm)

C.Room 28.8 28.8 26.7 27.6Temperature( C)

D.Water 27.6 27.6 25.7 25.6Temperature( C)

E.Oil Cooler Inlet 31.1 31.1 28.0 28.6Air Temperature( C)

F.Oil Cooler Outlet 40.6 40.5 35.3 30.8Air Temperature( C)

G.Upper Bearing 50.6 50.7 45.6 36.6Temperature( C)

H.Bottom Bearing 50.7 50.9 43.7 32.5Temperature( C)

I.Upper Bearing Oil 47.6 47.7 42.6 34.6Temperature( C)

J.Turbine Bearing 44.0 43.9 40.0 26.8Temperature( C)

K.Turbine Bearing Oil 37.0 37.0 31.1 25.3Temperature( C)

L.Thrust Bearing 59.7 59.7 53.1 37.2Temperature( C)

M.Bottom Oil Tank Oil 47.4 47.9 40.3 30.9Temperature( C)

N.Bottom Bearing Inlet 29.2 29.4 26.6 28.2Air Temperature( C)

O.Generator Shaft 17 10 12 9Vibration(X axis)(μm)

P.Turbine Shaft 37 37 34 23Vibration(X axis)(μm)

condition patterns by human experts.

• The second part is to classify the inexplainable unusualcondition patterns as abnormal condition signs in un-seen data and trace the trend of generation of the inex-plainable unusual condition patterns. generation of

Figure 4 shows our concept of discovery of abnormal condi-tion signs for hydroelectric power plants. In the upper partof this figure, the system discovers the unusual conditionpatterns and presents the patterns to human experts. Thenthe human experts evaluate the patterns and select the in-explainable unusual condition patterns as abnormal condi-tion signs. And the human experts give the selected unusualcondition patterns to the system. In the lower part of thefigure 4, the system classify the abnormal condition signsin unseen data by using the selected unusual condition pat-terns. And the system traces the trend of the generation ofabnormal condition signs. Then the human experts get thetrend of the generation of abnormal condition signs from thesystem and determine how to maintain hydroelectric powerplants.

Discovery of Unusual Condition Patterns

Maintenance Data

Selection of Abnormal Condition Sign

Check by Experts

Feedback: Store of Abnormal Condition Sign

1. Classification of Abnormal Condition Sign

2. Tracing Trend of Abnormal Condition Sign Generation

Present

Present

Decision Making by Experts for Maintenance

Input

Unusual Condition Patterns Discovery System

Abnormal Condition Sign Generation Trend Tracing System

Figure 4. The Concept of Discovery of Abnor-mal Condition Sign

6 Conclusion

In this paper, we described an unusual condition detec-tion of bearing vibration in hydroelectric power plants forthe discovery of abnormal condition signs. We proposedan unusual condition detection of bearing vibration in hy-droelectric power plants base on the One Class SVM. Andwe show that our proposed approach could find the unusualcondition patterns in hydroelectric power plants by usingexpertise of experts. Moreover, we proposed a concept ofthe discovery of abnormal condition signs by using the de-tected unusual condition patterns, and introduced a systemimage for the abnormal condition signs discovery. How-ever, it is hard for our problem to evaluate effectiveness ofour approach, because it is very rare to occur abnormal con-dition and trouble condition in hydroelectric power plantsin Japan, and we can not use normal condition data only.From this point of view, our problem is very complex, wethink.

In our future work, we plan to develop our proposed sys-tem and evaluate the effectiveness of our system for the dis-covery of abnormal condition signs in hydroelectric powerplants.

References

[1] B. Scholkopf, A. Smola, R. Williamson, and P. Bartlett. Newsupport vector algorithms. Neural Computaion, 12:1083 –1121, 2000.

[2] M. Yamana, H. Murata, T. Onoda, T. Oohashi, and S. Kato.Comparison of pattern classification methods in system forcrossarm reuse judgement on the basis of rust images. InProceedings of Artificial Intelligence and Applications 2005,pages 439–444, February 2003.

Takashi Onoda, Norihiko Ito, Kenji Shimizu

80

Page 81: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Abstract— Text retrieval is an important field in automatic

information and natural language processing. A large variety of

conferences, like TREC, are dedicated to it. The goal of text

retrieval systems is to process textual data in order to make easy

their retrieval. In this paper, we are interested in text retrieval

and especially the automatic juridical texts classification as a way

to facilitate their retrieval. We use linguistic tools (terminology

extraction) in order to determine concepts presents in the corpus.

Those concepts aim to create a low dimensionality space which

enabling us to use automatic learning algorithms based on

similarity measures as proximity graphs. We compare results

extracted from the graph with a more classical method: C4.5. The

originality of this work, in addition to the use of real data, is the

use of two methods for classifying our data. Promising results are

obtained using our methodology and a comparison between

proximity graphs and decision trees is shown.

Index Terms— text retrieval, automatic text classification,

decision trees, proximity graph.

I. INTRODUCTION

he general framework of machine learning starts from a

training file containing n rows and p columns. The rows

represent the items and the columns the features,

quantitative or qualitative, observed for each item. In this

context, one also supposes that the training sample is relatively

high compared to the number of features. Generally speaking,

the sample size is about 10 times the number of features to

hope to obtain a stability, i.e. a generalization error not much

more higher than training error. Moreover, the feature to be

predicted is generally supposed with a unique value. It is a

feature with actual values in the case of the regression and with

discrete methods, called classes, in the case of the

classification. In this paper we describe a situation of training

which deviates significantly from the traditional framework

described above. Indeed, the experimental context does not

enable us to immediately have a consequent training sample,

each item can belong to several classes simultaneously, and

each item, instead of being described by an attribute-values

unit, is described by a text in natural English language.

Before describing the approach that we recommend to learn in

this context, we, first of all, will point out the problems of the

application concerned (section 2). In section 3, we describe

the adopted methodological approach. In section four, we

describe the stages implemented to format the data and, in

particular, the linguistic analysis strategy implemented to

extract the main concepts which will have the role of features.

We describe then, in section 5, the topological models

containing proximity graphs which enable us to manage the

multiclass problem. In a comparative aim, we also implement

a method based on decision trees which is useful to better

identify the discriminant concepts. In section 6, we introduce

the results obtained from the linguistic analysis and the

machine learning. We detail the obtained performances. In

section 7, we conclude and we detail the future issues of this

work, in particular the use of association rules.

II. EXPERIMENTS FRAMEWORK

A. Problematics and corpus description

This work falls under a project in collaboration with the

International Labour Organization (ILO). Several countries

signed conventions with the ILO which binds them to the

international labour law. More concretely, the agreement

relates to two conventions drawn up by the ILO, Convention

n°87 and Convention n°981. The conventions contain several

laws articles which the signatories commit themself respecting.

The countries are subjected, once per year, to an inspection

having for goal to check the respect of the conventions’ rules.

At the end of each inspection, the experts of the ILO deliver a

report to the concerned country. The report shows the

observed violations and underlines the efforts to be set up in

each country, in order to be in adequacy with the conventions.

Datasets (conventions) are expressed in free text without a

specific coding of the violations and written in English

language by lawyers elected by the ILO.

The goal of our work is to define and to set up data mining

methods and tools making it possible to analyse more

effectively and more quickly these corpora, which manually, as

said before become not suitables. For these reasons, the

experts of the ILO wish to have tools allowing the automatic

location of the texts containing the violation of one or more

1 The contents of these conventions and the list of the signatories are

accessibles on (http://www.ilo.org/ilolex/cgi-lex/convde.pl?C087 et

http://www.ilo.org/ilolex/cgi-lex/convde.pl?C098 )

Automatic juridical texts classification and

relevance feedback

Vincent Pisetta, Hakim Hacid and Djamel.A Zighed

University of Lyon 2

ERIC Laboratory – 5, av. Pierre Mendès-France- 69767 Bron- France

[email protected] ; [email protected] ; [email protected]

T

Z.W. Ras, S. Tsumoto and D.A. Zighed (Eds) : IEEE MCD05, published byDepartment of Mathematics and Computing Science Saint Marys UniversityPages 81-84, ISBN 0-9738918-8-2, Nov. 2005

Page 82: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

rules (articles) by the concerned countries. After the

classification, the experts will be able to synthesize more

quickly the difficulties that meet the various countries in the

application of these conventions.

In order to achieve this, the ILO provides a database

containing 1315 texts, 834 texts relative to the Convention

N°87 containing 17 violations (rules) and 481 texts relative to

the Convention N°98 describing 10 violations. Each text,

corresponds to an annual report addressed by the experts of the

ILO about specific country. Each text describes the violataed

rules related to each convention and the methods of this

violation. The texts are sorted by convention (n°87 and n°98),

date and country. Let us note, that a text can report the

violation of several rules.

III. METHODOLOGY

A. General principle

In order to set up methods able to identify the violations in a

corpus, we use machine learning techniques. Our choice is

motivated by the state of the data. Indeed, we are in a

supervised learning case in which the predictive features

represent the coordinates of each text (extracted by text mining

techniques) and the classes to be predicted are represented by

the violations assigned to each text. Thus, we can build a

precition model, which, once evaluated and considered to be

acceptable by the user, can be used as an automatic

categorization tool for the new texts.

In order to build a prediction model in a supervised learning

context, the principle consists in providing to the training

algorithm a dataset containing pre-classified texts called

training set. In our case, because each text can violate one or

more rules, we are in a multiclass learning problem. More

formally, let ω be a text of the total corpus Ω and

1, ,i

c i k= … the conventions rule sensitive to be violated.

Then, ( ) 1 2, ,..., kC c c cω = expresses that the text

contains violations relating to the rules 1 2, ,..., kc c c of

convention a

v . Let us note in this context that it is difficult

and extremely expensive to manually carry out this

categorization in only one draft. We suggested to the experts

to annotate about seventy texts by convention (exactly 71 for

convention n°87 and 65 for convention n°98) which will be

used as an initial learning datasets. Let us call this first corpus

1Ω .

Let Ψ be a training algorithm. The result of the training is a

model, noted M, and a generalization error ε estimated on a

sample test or by cross-validation.

( ) ( ), ,C M εΨ Ω =

The application of the model M on a new unknown dataset

′Ω with relatively modest size allows to predict for each

unknown text the rules which would be violated, for example

( ) 1,..., kc cω′Μ = . The relevance feedback allows the

user to evaluate each labelled text belonging to the unknown

dataset. The user U can then validate completely or partially

the labelling suggested by the model. If, for a text ω′ , the

prediction is considered to be erroneous by the user U then,

the concerned text is inserted in the training sample

1i iω+′Ω =Ω ∪ for another iteration. This operation is

renewed each time that an item is considered badly labelled.

We reiterate the training process with the new sample thus

created. We obtain a new model M’ which one hopes for a

lower error rate ( )ε ε′ < . A new unknown sample of modest

size to facilitate a manual checking is made up. The reiteration

of this process should lead to an iterative improvement of the

model. The question which arises consequently is the choice

of an algorithm able to manage the multi-labelling. The

proximity graphs[17], which belong to the instance based

learning methods, allow that. In order to use these methods,

the texts have to be transformed into a vector representation.

Then, each text could then be considered as a point in a

multidimensional space pR .

The coordinates of a text in this space will be

( ) ( ) ( ) ( )( )1 2X , , ,

pX X Xω ω ω ω= …

. What do these

variables represent, how are they extracted ?

IV. REPRESENTATION SPACE

A. Terminology extraction

We choose to create our representation space by concepts

extraction. One of the advantages of this technique compared

to other methods as the N-grams, or the co-occurences

matrices of words, is the significant reduction of the

dimensionnality, allowing the use of classifiers based on

similarity measures. Various applications based on this

principle gave interesting results[9]. Two different methods

exist for the concepts construction: by training or by

extraction.

The first one is a statistical method and aims to search the

most discriminant words according to a class to be predicted.

The words are then gathered into concepts on the basis of their

co-occurences or thanks to association rules[9]. The second

method is a linguistic method and consists in extracting the

terminology from the corpus and gathering the extracted terms

according to their semantic proximity. Our preference goes

towards the linguistic techniques.

The choice of linguistic analysis is justified by the fact that this

method makes it possible to prevent the polysemia and raises

ambiguities related to the context [6]. It also deals with small

textual units [11]. Moreover, our learning database contains

few examples, it then seems difficult to use the training

techniques described above. Our work is carried out in

collaboration with experts of the legal field, which is an

additional reason to use the linguistic techniques. We use a

data processing sequence inspired by [1].

After a standardization phase (conservation or not of the

capital letters, proper names, etc.), we carry out a grammatical

tagging of the corpus using BRILL [3]. The goal of this

Vincent Pisetta, Hakim Hacid, and Djamel Zighed

82

Page 83: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

operation is to associate to each word its grammatical tag. The

next operation is the terminology extraction. We use for this a

tool called EXIT [14]. Terminology extraction passes by the

search for candidate-terms. Candidate-terms are sets of two or

more adjacent units (lexical, syntatcic or semantic). We group

then the candidate-terms, extracted according to their semantic

proximity, in order to locate the concepts present in the texts.

In order to choose the candidate-terms and to perform

concepts creation, we use the same methodology as [2]. We

distinguish two stages :

1) First, we study the terms whom appearance frequency is

higher than a threshold l. Initially, we fix a high threshold.

This preliminary stage makes it possible to locate the main

conceptual axis.

2) The semantically close candidate-terms are gathered using

tools such as WordNet [10]. The recourse to the expert is of

primary importance here.

The phases 1 and 2 are iterative, we increase very quickly the

representation by examining the candidate-terms whom

appearance frequency in the corpus is lower than the threshold

l.

B. Vector Encoding

This level raises the problem of the choice of the

representation model. We choose the vectoriel model [15]

which appears to us more adapted than the Boolean model.

The reason is that it seems simplistic to apply a binary logic to

an information search. The vectoriel model proposes to

represent a document on the dimensions represented by the

words. We adapted it to represent a document by a vector of

concepts. And, instead of represent it according to the

frequency of the concept in the document, we use weighting

TF x IDF[16].

( ) ( ) logi i

i

X X

X

nTF IDF TF

DFω ω

× =

With:

o i

X a concept;

o ω a document;

o ( )iXTF

ω the absolute appearance frequency of the

concept in a document ω ;

o iX

DF the number of documents in the learning

dataset containing the concept i

X ;

o n the number of documents in the learning dataset.

At the end of this operation, we obtain a learning dataset which

we represent in a tabular form.

We will now use the TF x IDF scores (which can be calculated

on all the texts) in order to predict violations in unlabelled

texts.

V. MODELISATION AND GENERALIZATION TOOLS

We have defined the representation space of our documents.

The objective is now to use training algorithm with an aim of

automatically classifying the documents. In our study, we are

brought to classify multi-labelled texts. Two possibilities are

then offered: a global approach and a binary approach

consisting in dividing the global problem into m subproblems.

We performed the two approaches. We use relative

neighborhood graph in the global approach and decision trees

in the binary approach.

A. Global approach

The representation space created is a low dimensionnality

space. Then, it seems reasonnable to use classifiers based on

similarity measures. We choosed proximity graphs which

comes from computationnal geometry [12]. Proximity graphs

have some advantages on k-NN. They define better the

proximity between two items [5]. There are several graphs

models. We choosed relative neighborhood graph which is a

compromise between number of neighbors and algorithmic

complexity.

We apply a simple decision function. The main idea is that an

unlabelled text inherits from the properties of its neighbors

contained in the training database. Let K be the number of

neighbors of the text to be labelled 'ω and i

c a rule. The

probability that the new text to be labelled contains a violation

of i

c is:

Number of neighbors of 'ω which

contains a violation of i

c

P(i

c | 'ω ) =

K

B. Binary approach

The objective here is to predict the presence or the absence of

each rule. We create consequently as many trees as there are

rules. More formally, we consider each rule as a boolean

attribute 0,1ic = . If the convention a

v contains k rules,

we create k trees. Each tree is then a model i

M envisaging

the presence or the absence of a rule i

c . We obtain thus k

models which return i

c if rule i is considered violated, ∅

elsewhere. Discrimination is carried out by the C4.5

algorithm[13].

VI. RESULTS AND METHODS

We will illsutrate the obtained results on the Convention n°98.

Our labelled dataset is composed by only 65 texts. We

consequently decide to entirely preserve it during the

execution of the training algorithms. In order to have a more

precise idea on the real error rate, we carry out 200 bootstrap

Automatic juridical texts classification and relevance feedback

83

Page 84: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

replications of the training sample. Results are illustrated

hereafter.

A. C4.5 results

TABLE 1 C4.5 RESULTS

We observe excellent classification rates in training (between

89% and 100%). Sometimes, one observes a collapse of the

recall at the time of the bootstrap. The reason of this collapse

is simple. Indeed, the training dataset used here comprises few

examples. It thus happens that some violations are few

represented in the dataset. It is consequently difficult for C4.5

to determine the "profiles" of texts containing these violations.

On the other hand, the results are very encouraging in the

recognition violations where manpower presence-absence are

balanced enough. The good classification rates remain stable

on the bootstraped samples.

B. Comparison of the two classifiers

We do not have a sample test with labelled texts for the

previously quoted reasons, however, we can note a strong

similarity between C4.5 and RNG (Table 2). To evaluate the

predictive similarity between the two methods, we have

extracted 20 unlabelled texts and we have applied the RNG

and C4.5 with an aim of predicting the violations contained in

these texts.

TABLE 2 COMPARISON BETWEEN RNG AND C4.5

% agreement

100% 80% 75% 66% 50% 33%

Nb of countries 9 1 3 3 2 2

*there is agreement if the two methods identify the same violation in the

same text

In spite of the low size of our learning dataset, the two

methods show rather similar results (sixty percent have a

percentage of agreement equal to or higher than 66%), which

are encouraging. We have consequently strong reasons to

think that the increase in the number of examples will result in

an even more significant agreement.

VII. CONCLUSION AND FUTURE WORK

The finality of this work is to propose a predictive model able

to determine the violations of several countries concerning two

conventions of law the labour. A machine learning approach

was adopted. Initially (data preparation), we extracted, thanks

to the linguistic techniques, a set of candidates-terms which

then allow us to create concepts related to the studied corpus.

The purpose of this operation is to reduce the dimensionnality

of the representation space of the texts. We were able thus to

use the proximity graphs, in addition to C4.5. The results seem

to be interesting insofar as the two prediction methods give

similar results in spite of a training dataset comprising few

examples. We now plan to increase the learning database size

with an aim of improving the prediction and of leading to more

robust results.

REFERENCES

[1]Amrani. A, Azé. J, Heitz. T, Kodratoff. Y, Roche. M (2004). From the

texts to the concept they contain : a chain of linguistic treatments. TREC

2004, p.1-5.

[2]Baneyx.A and Charlet.J and Jaulent.M.C. (2005). Construction

d’ontologies médicales fondée sur l’extraction terminologique à partir de

textes : application au domaine de la pneumologie. Journées Francophones

d’Informatique Médicale (JFIM) 2005, p.1-6.

[3]Brill. E. (1995). Transformation-Based Error-Driven Learning and Natural

Language Processing: A Case Study in Part-of-Speech Tagging.

Computational Linguistics, 21(4):543-565.

[4]Bourigault. D, Jacquemin. C. (1999). Term Extraction + Term Clustering:

An Integrated Platform for Computer-Aided Terminology . Proceedings of the

European Chapter of the Association for Computational Linguistics

(EACL'99), Bergen, pages 15-22. [5]Clech. J. (2004). Contribution Méthodologique à la Fouille de Données

Complexes. Thèse, Laboratoire ERIC, Université Lumière Lyon 2.

[6]Fluhr C. (2000). Indexation et recherche d'information textuelle. Ingénierie

des langues. Hermes, 2000

Institute,University of Calgary.

[8]Hajek. P, Havranek. T, Chytil. M (1983) GUHA Method. Academia,

Prague.

[9]Kumps. N, Francq.P, Delchambre. A. (2004). Création d’un espace

conceptuel par analyse de données contextuelles. JADT 2004, (International

Conference on Statistical Analysis of Textual Data), p.683-691.

[10] Miller. A. G, Fellbaum.C, Tengi. R, Wolff.S, Wakefield.P, Langone. H,

Haskell. B (2005) WordNet 2.1, Cognitive Science Laboratory, Princeton

University.

[11]Pouliquen. B, Delamarre. D, Le Beux. P. (2002). Indexation de textes

médicaux par extraction de concepts, et ses utilisations. JADT 2002, p.617-

627.

[12]Preparata. F, Shamos. M. (1985). Computational Geometry An

Introduction. New-York, Springer.

[13]Quinlan. J.R. (1993). C4.5 : Programs for Machine Learning. San Mateo,

CA

[14]Roche. M and Heitz.T and Matte-Tailliez.O and Kodratoff.Y. (2004).

EXIT: Un système itératif pour l'extraction de la terminologie du domaine à

partir de corpus spécialisés. Proceedings of JADT 2004 (International

Conference on Statistical Analysis of Textual Data), volume 2, p. 946-956. [15]Salton G. (1971) The SMART retrieval system. Experiment in automatic

document processing. Prentice Hall. Englewood Cliffs. New Jersey.

[16]Salton G. et Buckley C. (1988). Term weighting approaches in automatic

text retrieval. Information Processing and Management, (l) 24, n° 5, p. 513-

523.

[17]Toussaint. G.T. (1980). The relative neighborhood graphs in a finite

planar set. Pattern recognition, 12:261–268.

Learning 200 bootstrap

% well

classified recall precision

% well

classified recall precision

Violation 1 95% 100% 91% 94% 93% 94%

Violation 2 98% 96% 100% 98% 96% 99%

Violation 3 100% 100% 100% 95% 42% 100%

Violation 4 98% 86% 100% 89% 38% 96%

Violation 5 91% 91% 90% 87% 86% 87%

Violation 6 94% 93% 95% 80% 78% 81%

Violation 7 98% 80% 100% 91% 36% 95%

Violation 8 89% 94% 83% 86% 90% 82%

Violation 9 94% 79% 100% 89% 82% 92%

Violation 10 100% 100% 100% 96% 30% 99%

Vincent Pisetta, Hakim Hacid, and Djamel Zighed

84

Page 85: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Mining E-Action Rules

Zbigniew W. Ras1,2

2 Polish-Japanese Institute of ITDepartment of Computer Science

Koszykowa 86, 02-008 Warsaw, [email protected]

Li-Shiang Tsay1

1 University of North CarolinaDepartment of Computer Science

Charlotte, N.C. 28223, [email protected]

Agnieszka Dardzinska3,1

3Bialystok Technical UniversityDepartment of Mathematics

15-351 Bialystok, [email protected]

Abstract

One of the main goals in Knowledge Discovery is to findinteresting associations between values of attributes, thosethat are meaningful in a domain of interest. This task maybe viewed as searching a space of possible actionable rules[1] and for relations among them. Because classical knowl-edge discovery algorithms are unable to determine if a ruleis truly actionable for a given user, we focus on a new classof action rules [12], called E-action rules, that can be usednot only for automatic analysis of discovered classificationrules but also for hints of how to reclassify some objects ina data set from one state into another more desired one. Ac-tionability is closely linked with the availability of flexibleattributes [14] used to describe data and with the feasibilityand cost [18] of desired re-classifications of objects. Someof them are easy to achieve. Some, initially seen as unreach-able within constraints set up by a user, still can be success-fully achieved if additional attributes are available [18]. So,if system is distributed and collaborating sites agree on theontology [5], [6] of their common attributes, the availabil-ity of additional data, from remote sites, is sometime a ne-cessity to achieve certain re-classifications of objects at aserver site. Action tree algorithm, presented in this paper,guarantees a quicker and more effective process of E-actionrules discovery than algorithms proposed in [12], [13]. Itwas implemented as system DEAR 2.2 and tested on sev-eral public domain databases. Support and confidence ofE-action rules is introduced and used to prune a large num-ber of generated candidates which are irrelevant, spurious,and insignificant.

1 Introduction

Finding useful rules is an important task of knowledgediscovery in data. Most of the researchers on knowledgediscovery focused on techniques for generating patterns,such as classification rules, association rules...etc, from adata set. They assumed that it is users responsibility to ana-lyze these patterns and infer actionable solutions for specificproblems within a given domain. The classical knowledgediscovery algorithms have the potential to identify enor-mous number of significant patterns from data. Therefore,people are overwhelmed by a large number of uninterestingpatterns which are very difficult to analyze and use to formtimely solutions. In other words, not all discovered classifi-cation rules are interesting enough to be presented and used.So, there is a need to look for new techniques and tools withthe ability to assist people in identifying rules with usefulknowledge. E-action rules mining is a technique that intelli-gently and automatically assists humans in acquiring usefulinformation from data. This information can be turned intoactions which can benefit users.

E-action rule is a significant improvement of a classicalaction rule [12] and an extended action rule [13] by intro-ducing a well defined notion of its supporting class of ob-jects. E-action rules automatically assist humans acquiringactionable knowledge. They are constructed from certainpairs of classification rules. These rules can be used notonly for evaluating discovered patterns but also for reclassi-fying some objects in the data set from one state into a newdesired state. Given a set of classification rules extractedfrom data, they can be used for predicting the class label ofnew data objects. For example, classification rules found

Z.W. Ras, S. Tsumoto and D.A. Zighed (Eds) : IEEE MCD05, published byDepartment of Mathematics and Computing Science Saint Marys UniversityPages 85-90, ISBN 0-9738918-8-2, Nov. 2005

Page 86: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

from a bank’s data can be very useful to describe who is agood client (whom to offer some additional services) andwho is a bad client (whom to watch carefully to minimizethe bank loses). However, if bank managers need to im-prove their understanding of customers and seek for specificactions to improve the services, mere classification rules arenot sufficient. In this paper, we propose to use classificationrules to build a strategy of action based on their conditionfeatures in order to get a desired effect on their decisionfeature. Going back to the bank example, the strategy of ac-tion would consist of modifying some condition features inorder to improve our understanding of customers behaviorand then improve the services. E-action rules are useful inmany other fields, including medical diagnosis. In medicaldiagnosis, classification rules can explain the relationshipsbetween symptoms and sickness and in predicting the diag-nosis of a new patient. E-action rules are useful in provid-ing a hint to a doctor what symptoms have to be modifiedin order to recover a certain group of patients from a givenillness.

There are two types of interestingness measure: objec-tive and subjective (see [7], [1], [15], [16]). Subjectiveinterestingness measures include unexpectedness [15] andactionability [1]. When a rule contradicts the user’s priorbelief about the domain, uncovers new knowledge, or sur-prises him, it is classified as unexpected. A rule is deemedactionable, if the user can take action to gain an advantagebased on this rule. Domain experts basically look at a ruleand say that this rule can be converted into an appropriateaction. In this paper, we are interested in tackling action-ability issue by presenting E-action rules to automaticallyanalyze classification rules and effectively develop work-able strategies.

Action Tree algorithm is presented for generating E-action rules and it is implemented as System DEAR 2.2.The algorithm follows a top-down strategy that searches fora solution in a part of the search space. It is seeking at eachstage for a stable attribute that has a least number of values.Then, the set of rules is split recursively using that attribute.When all stable attributes are processed, the final subsetsare split further based on a decision attribute. This strategygenerates an action tree which is used to construct E-actionrules from the leaf nodes of the same parent.

2 Information System and E-Action Rules

An information system is used for representing knowl-edge. Its definition, presented here, is due to Pawlak [9].

By an information system we mean a pair S = (U,A),where:

• U is a nonempty, finite set of objects,

Table 1. Decision Systema b c d

x1 2 1 2 Lx2 2 1 2 Lx3 1 1 0 Hx4 1 1 0 Hx5 2 3 2 Hx6 2 3 2 Hx7 2 1 1 Lx8 2 1 1 Lx9 2 2 1 Lx10 2 3 0 Lx11 1 1 2 Hx12 1 1 1 H

• A is a nonempty, finite set of attributes i.e. a : U −→Va is a function for any a ∈ A, where Va is called thedomain of a.

Elements of U are called objects. In this paper, for thepurpose of clarity, objects are interpreted as customers. At-tributes are interpreted as features such as, offers made by abank, characteristic conditions etc.

We consider a special case of information systems calleddecision tables [9]. In any decision table together with theset of attributes a partition of that set into conditions anddecisions is given. Additionally, we assume that the set ofconditions is partitioned into stable conditions and flexibleconditions. For simplicity reason, we assume that there isonly one decision attribute. Date of birth is an example of astable attribute. The interest rate on any customer accountis an example of a flexible attribute as the bank can adjustrates. We adopt the following definition of a decision table:

A decision table is any information system of the formS = (U,ASt ∪ AFl ∪ d), where d 6∈ ASt ∪ AFl is adistinguished attribute called the decision. The elements ofASt are called stable conditions, whereas the elements ofAFl are called flexible conditions.

As an example of a decision table we take S =(x1, x2, x3, x4, x5, x6, x7, x8,x9, x10, x11, x12, a, c ∪ b ∪ d) represented by Ta-ble 1. The set a, c lists stable attributes, b is a flexibleattribute and d is a decision attribute. Also, we assume thatH denotes a high profit and L denotes a low one.

In order to induce rules in which the THEN part consistsof the decision attribute d and the IF part consists of at-tributes belonging to ASt ∪ AFl, for instance system LERS[4] can be used for rules extraction.

Alternatively, we can extract rules from sub-tables(U,B ∪ d) of S, where B is a d-reduct (see [8]) of S,

Zbigniew W. Ras, Li-Shiang Tsay, Agnieszka Dardzinska

86

Page 87: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

to improve efficiency of the algorithm when the number ofattributes is large. The set B is called d-reduct in S if there isno proper subset C of B such that d depends on C. The con-cept of d-reduct in S was introduced to induce efficientlyrules from S describing values of the attribute d dependingon minimal subsets of ASt ∪AFl.

By L(r) we mean all attributes listed in the IF part of arule r. For example, if r1 = [(a1, 2)∧ (a2, 1)∧ (a3, 4) −→(d, 8)] is a rule then L(r1) = a1, a2, a3.

By d(r1) we denote the decision value of that rule. In ourexample d(r1) = 8. If r1, r2 are rules and B ⊆ ASt ∪ AFl

is a set of attributes, then r1/B = r2/B means that theconditional parts of rules r1, r2 restricted to attributes B arethe same. For example if r2 = [(a2, 1)∗ (a3, 4) −→ (d, 1)],then r1/a2, a3 = r2/a2, a3.

In our example, we get the following certain rules withsupport greater or equal to 2:

(b, 3) ∗ (c, 2) −→ (d,H), (a, 1) ∗ (b, 1) −→ (d, L),(a, 1) ∗ (c, 1) −→ (d, L), (b, 1) ∗ (c, 0) −→ (d,H),(a, 1) −→ (d,H)

Now, let us assume that (a, v −→ w) denotes the factthat the value of attribute a has been changed from v to w.Similarly, the term (a, v −→ w)(x) means that a(x) = vhas been changed to a(x) = w. Saying another words, theproperty (a, v) of object x has been changed to property(a,w).

Let S = (U,ASt ∪ AFl ∪ d) be a decision table andrules r1, r2 are extracted from S. The notion of an extendedaction rule was given in [17]. Its definition is given below.We assume here that:

• BSt is a maximal subset of ASt such that r1/BSt =r2/BSt,

• d(r1) = k1, d(r2) = k2 and k1 ≤ k2,

• (∀a ∈ [ASt ∩ L(r1) ∩ L(r2)])[a(r1) = a(r2)],

• (∀i ≤ q)(∀ei ∈ [ASt∩ [L(r2)−L(r1)]])[ei(r2) = ui],

• (∀i ≤ r)(∀ci ∈ [AFl ∩ [L(r2)−L(r1)]])[ci(r2) = ti],

• (∀i ∈ p)(∀bi ∈ [AFl ∩ L(r1) ∩ L(r2)])[[bi(r1) =vi]&[bi(r2) = wi]].

Let ASt∩L(r1)∩L(r2) = B. By (r1, r2) -E-action ruleon x ∈ U we mean the expression r:

[∏a = a(r1) : a ∈ B ∧ (e1 = u1) ∧ (e2 = u2) ∧

... ∧ (eq = uq) ∧ (b1, v1 −→ w1) ∧ (b2, v2 −→ w2) ∧ ... ∧(bp, vp −→ wp)∧(c1,−→ t1)∧(c2,−→ t2)∧ ...∧(cr,−→tr)](x) =⇒ [(d, k1 −→ k2)](x)

Object x ∈ U supports (r1, r2)-E-action rule r in S =(U,ASt ∪AFl ∪ d), if the following conditions are satis-fied:

• (∀i ≤ p)[bi ∈ L(r)][bi(x) = vi] ∧ d(x) = k1

• (∀i ≤ p)[bi ∈ L(r)][bi(y) = wi] ∧ d(y) = k2

• (∀j ≤ p)[aj ∈ (ASt ∩ L(r2))][aj(x) = uj ]

• (∀j ≤ p)[aj ∈ (ASt ∩ L(r2))][aj(y) = uj ]

• object x supports rule r1

• object y supports rule r2

By the support of E-action rule r in S, denoted bySupS(r), we mean the set of all objects in S supportingr. Saying another words, this set is defined as:

x : [a1(x) = u1] ∧ [a2(x) = u2] ∧ ... ∧ [aq(x) = uq] ∧[b1(x) = v1]∧[b2(x) = v2] ∧ ... ∧ [bp(x) = vp] ∧ [d(x) = k1].

By the confidence of r in S, denoted by ConfS(r), wemean

[SupS(r)/SupS(L(r))]× [Conf(r2)]

To find the confidence of (r1, r2)-E-action rule in S, wedivide the number of objects supporting (r1, r2)-action rulein S by the number of objects supporting left hand side of(r1, r2)-E-action rule times the confidence of the classifica-tion rule r2 in S.

3 Discovering E-Action Rules

In this section we present a new algorithm, calledAction-Tree algorithm, for discovering E-action rules. Ba-sically, we partition the set of classification rules R discov-ered from a decision system S = (U,ASt ∪ AFl ∪ d),where ASt is the set of stable attributes, AFl is the set offlexible attributes and, Vd = d1, d2, ..., dk is the set ofdecision values, into subsets of rules having the same val-ues of stable attributes in their classification parts and defin-ing the same value of the decision attribute. Classificationrules can be extracted from S using, for instance, discoverysystem LERS [2].

Action-tree algorithm for extracting E-Action rules fromdecision system S is as follows:

i. Build Action-Tree

a. Partition the set of classification rules R in a waythat two rules are in the same class if their stableattributes are the same

1. Find the cardinality of the domain Vvi foreach stable attribute vi in S.

2. Take vi which card(Vvi) is the smallest asthe splitting attribute and divide R into sub-sets each of which contains rules having thesame value of the stable attribute vi.

Mining E-Action Rules

87

Page 88: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

3. For each subset, obtained in step 2, deter-mine if it contains rules of different deci-sion values and different values of flexibleattributes. If it does, go to step 2. If itdoesn’t, there is no need to split the subsetfurther and we place a mark.

b. Partition each resulting subset into new subsetseach of which contains only rules having thesame decision value.

c. Each leaf of the resulting tree represents a set ofrules which do not contradict on stable attributesand also it uniquely defines decision value di.The path from the root to that leaf gives the de-scription of objects supported by these rules.

ii. Generate E-action rules

a. Form E-action rules by comparing all unmarkedleaf nodes of the same parent.

b. Calculate support and confidence of each new-formed E-action rule. If support and confidencemeet the thresholds set up by user, print the rule.

The algorithm starts at the root node of the tree, calledE-action tree, representing all classification rules extractedfrom S. A stable attribute is selected to partition these rules.For each value of that attribute an outgoing edge from theroot node is created, and the corresponding subset of rulesthat have the attribute value assigned to that edge is movedto the newly created child node. This process is repeatedrecursively for each child node. When we are done with sta-ble attributes, the last split is based on a decision attributefor each current leaf of E-action tree. If at any time allclassification rules representing a node have the same de-cision value, then we stop constructing that part of the tree.We still have to explain which stable attributes are chosento split classification rules representing a node of E-actiontree. The algorithm selects any stable attribute which hasthe smallest number of possible values among all the re-maining stable attributes. This step is justified by the needto apply a heuristic strategy (see [11]) which will minimizethe number of edges in the resulting tree and the same makethe time-complexity of the algorithm lower.

We have two types of nodes: a leaf node and a non-leafnode. At a non-leaf node, the set of rules is partitioned alongthe branches and each child node gets its correspondingsubset of rules. Every path to the decision attribute node,one level above the leaf node, represents a subset of the ex-tracted classification rules when the stable attributers havethe same value. Each leaf node represents a set of rules,which do not contradict on stable attributes and also definedecision value di. The path from the root to that leaf givesthe description of objects supported by these rules.

Table 2. Set of rules R with supporting ob-jects

Objects a b c d

x3, x4, x11, x12 1 Hx1, x2, x7, x8 2 1 Lx7, x8, x9 2 0 Lx3, x4 1 0 Hx5, x6 3 2 H

Instead of splitting the set of rules R by stable attributesand next by the decision attribute, we can also start the parti-tioning algorithm from a decision attribute. For instance, ifa decision attribute has 3 values, we get 3 initial sub-trees.In the next step of the algorithm, we start splitting thesesub-trees by stable attributes following the same strategy asthe one presented for E-action trees. This new algorithm iscalled action-forest algorithm.

Now, let us take Table 1 as an example of a decision sys-tem S. Attributes a, c are stable and b, d flexible. Assumenow that our plan is to re-classify some objects from theclass d−1(di) into the class d−1(dj). In our example,we also assume that di = (d, L) and dj = (d,H).

First, we represent the set R of certain rules extractedfrom S as a table (see Table 2). The first column of thistable shows objects in S supporting the rules from R (eachrow represents a rule). For instance, the second row repre-sents the rule [[(a, 2) ∧ (c, 1)] ⇒ (d, L)]. The constructionof an action tree starts with the set R as a table (see Table2) representing the root of the tree (T1 in Fig. 1). The rootnode selection is based on a stable attribute with the small-est number of values among all stable attributes. The samestrategy is used for a child node selection. After labellingthe nodes of the tree by all stable attributes, the tree is splitbased on the value of the decision attribute. Referring backto the example in Table 1, we use stable attribute a to splitthat table into two sub-tables defined by values 1, 2 ofattribute a. The domain of attribute a is 1, 2 and the do-main of attribute c is 0, 1, 2. Clearly, card[Va] is lessthan card[Vc] so we partition the table into two: one tablewith rules containing a = 0 and another with rules contain-ing a = 2. Corresponding edges are labelled by values ofattribute a. All rules in the sub-table T2 have the same de-cision value. So, we can not construct E-action rule fromthat sub-table which means it is not divided any further. Be-cause rules in the sub-table T3 contain different decisionvalues and a stable attribute c, T3 is partitioned into threesub-tables, one with rules containing c=0, one with rulescontaining c=1, and one with rules containing c=2. Now,rules in each of the sub-tables do not contain any stable at-

Zbigniew W. Ras, Li-Shiang Tsay, Agnieszka Dardzinska

88

Page 89: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Figure 1. Action tree

tributes. Sub-table T6 is not split any further for the samereason as sub-table T2. All objects in sub-table T4 have thesame value of flexible attribute b. There is no way to form aworkable strategy from this sub-table so it is not partitionedany further. Sub-table T5 is divided into two new sub-tables.Each leaf represents a set of rules, which do not contradicton stable attributes and also define decision value di.

The path from the root of the tree to that leaf gives thedescription of objects supported by these rules. Followingthe path labelled by value [a = 2], [c = 2], and [d = L], weget table T7. Following the path labelled by value [a = 2],[c = 2], and [d = H], we get table T8. Because T7 andT8 are sibling nodes, we can directly compare pairs of rulesbelonging to these two tables and construct one E-actionrule such as:

[[(a, 2) ∧ (b, 1 → 3)] ⇒ (d, L → H)].

After the rule is formed, we evaluate it by checking itssupport and its confidence (sup = 4, conf = 100%).

This new algorithm (called DEAR 2.2) was imple-mented and tested on several data sets from UCI MachineLearning Repository. In all cases, the action tree algorithmwas more efficient then the action forest algorithm. Thegenerated E-action rules by both algorithms are the same.The confidence of E-action rules is higher than the confi-dence action rules.

4 Conclusion

E-action rules are structures that represent actionabilityin an objective way. The strategy used to generate themis data driven and domain independent because it does notdepend on domain knowledge. Although the definition ofE-action rules is purely objective, we still can not get ridof some degree of subjectivity in determining how actionscan be implemented. To build E-action rules, we divide allattributes into two subsets, stable and flexible. Obviously,this partition has to be done by users who decide which at-tributes are stable and which are flexible. This is a purelysubjective decision. A stable attribute has no influence onchange, but any flexible attribute may influence changes.Users have to be careful judging which attributes are stableand which are flexible. If we apply E-action rule on objectsthen it shows how values of their flexible features should bechanged in order to achieve their desired re-classification.Stable features always will remain the same. Basically, anyE-action rule identifies a class of objects that can be reclas-sified from an undesired state to a desired state by prop-erly changing some of the values of their flexible features.How to implement these changes often depends on the user.If the attribute is an interest rate on the banking accountthen banks can take appropriate action as the rule states(i.e., change lower interest rate to 4.75%). In this case, itis a purely objective action. However, if the attribute is a

Mining E-Action Rules

89

Page 90: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

fever then doctors may lower the temperature by followinga number of different actions. So, this is a purely subjec-tive concept. Basically, we cannot eliminate some amountof subjectivity in the process of E-action rules constructionand implementation.

5 Acknowledgement

This research was partially supported by the NationalScience Foundation under grant IIS-0414815.

References

[1] Adomavicius G, Tuzhilin A (1997) Discovery of ac-tionable patterns in databases: the action hierarchyapproach. In: Proceedings of KDD97 Conference.Newport Beach, CA. AAAI Press

[2] Chmielewski M R, Grzymala-Busse J W, Peterson NW, Than S (1993) The rule induction system LERS -a version for personal computers. In: Foundations ofComputing and Decision Sciences. Vol. 18, No. 3-4,Institute of Computing Science, Technical Universityof Poznan, Poland: 181–212

[3] Geffner H, Wainer J (1998) Modeling action, knowl-edge and control. In: ECAI 98, Proceedings of the13th European Conference on AI, (Ed. H. Prade).John Wiley & Sons, 532–536

[4] Grzymala-Busse J (1997) A new version of the ruleinduction system LERS. In: Fundamenta Informati-cae, Vol. 31, No. 1, 27–39

[5] Guarino N (1998) Formal Ontology in InformationSystems, IOS Press, Amsterdam

[6] Guarino N, Giaretta P (1995) Ontologies and knowl-edge bases, towards a terminological clarification, inTowards Very Large Knowledge Bases: KnowledgeBuilding and Knowledge Sharing, IOS Press, Ams-terdam

[7] Liu B, Hsu W, Chen S (1997) Using general impres-sions to analyze discovered classification rules. In:Proceedings of KDD97 Conference, Newport Beach,CA, AAAI Press

[8] Pawlak Z (1991) Rough sets-theoretical aspects ofreasoning about data. Kluwer, Dordrecht

[9] Pawlak Z (1981) Information systems - theoreticalfoundations. In: Information Systems Journal, Vol. 6,205–218

[10] Polkowski L, Skowron A (1998) Rough sets in knowl-edge discovery. In: Studies in Fuzziness and SoftComputing, Physica-Verlag, Springer

[11] Ras Z (1999) Discovering rules in information trees.In: Principles of Data Mining and Knowledge Dis-covery, (Eds. J. Zytkow, J. Rauch), Proceedings ofPKDD’99, Prague, Czech Republic, LNAI, No. 1704,Springer, 518–523

[12] Ras Z, Wieczorkowska A (2000) Action rules: howto increase profit of a company. In: Principles ofData Mining and Knowledge Discovery, (Eds. D.A.Zighed, J. Komorowski, J. Zytkow), Proceedings ofPKDD’00, Lyon, France, LNCS/LNAI, No. 1910,Springer-Verlag, 587–592

[13] Ras Z W, Tsay L-S (2003) Discovering extendedaction-rules (System DEAR). In: Intelligent Informa-tion Systems 2003, Proceedings of the IIS’2003 Sym-posium, Zakopane, Poland, Advances in Soft Com-puting, Springer-Verlag, 293–300

[14] Ras Z W, Tzacheva A, Tsay L-S (2005) Action rules.In: Encyclopedia of Data Warehousing and Mining,(Ed. J. Wang), Idea Group Inc., 1–5

[15] Silberschatz A, Tuzhilin A (1995) On subjective mea-sures of interestingness in knowledge discovery. In:Proceedings of KDD95 Conference, AAAI Press

[16] Silberschatz A, Tuzhilin A (1996) What makes pat-terns interesting in knowledge discovery systems. In:IEEE Transactions on Knowledge and Data Engineer-ing, Vol. 5, No. 6

[17] Tsay L-S, Ras Z W (2005) Action rules discov-ery: System DEAR2, method and experiments. In:theSpecial Issue on Knowledge Discovery, Journal ofExperimental and Theoretical Artificial Intelligence,Taylor & Francis, Vol. 17, No. 1-2, 119–128

[18] Tzacheva A, Ras Z W (2005) Action rules mining.In:the Special Issue on Knowledge Discovery, Inter-national Journal of Intelligent Systems, Wiley, Vol.20, No. 7, 719–736

Zbigniew W. Ras, Li-Shiang Tsay, Agnieszka Dardzinska

90

Page 91: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Mining Statistical Association Rules to Select the Most Relevant Medical Image Features

Marcela X. Ribeiro1, André G. R. Balan1, Joaquim C. Felipe2,

Agma J. M. Traina1, Caetano Traina Jr1

1Computer Science Department, University of Sao Paulo at Sao Carlos - Brazil [mxavier, agrbalan,agma,caetano]@icmc.usp.br

2Department of Physics and Mathematics, University of São Paulo at Ribeirão Preto - Brazil [email protected]

Abstract

This paper presents a new algorithm, called StARMiner (Statistical Association Rule Miner) that aims at finding the real "essence" of medical images, identifying the most relevant features from those extracted from each image, taking advantage of statistical association rules. After describing our proposed algorithm, we show two case studies employing features extracted from segmented and not-segmented medical images. The first case study shows the behavior of StARMiner on feature vectors when condensing the texture information of segmented images in just 30 features (attributes). The second case study employs a more traditional feature vector of 256 attributes from not-segmented images obtained by Zernike moments. The case studies aims at showing and analyzing the proposed algorithm in two different situations, regarding its ability on discriminating images in categories (benign and malignant tumors, for instance) and retrieve them. We compared the attributes selected by StARMiner with those selected by the well-known C4.5, and found that the attributes selected by StARMiner keep the retrieval ability of images higher. In fact, StARMiner achieved a reduction of 50% in the number of features for the case study one, and 85% for the case study two. It is interesting to observe that using the reduced feature vectors, the precision achieved in the similarity queries was in some situations higher than using the original one.

1. Introduction

Nowadays many computational applications need to deal with complex data, such as images, video, time series, fingerprints and DNA sequences, among others. Focusing on one of the most studied complex data – images, and more particularly on medical images – two kinds of systems are now widely used: the Picture Archiving and Communication Systems (PACS) and the Computer-Aided Diagnosis (CAD) systems.

Correctly classifying images into categories aiming at supporting medical diagnosis has been sought by the

CAD community for several years. The development of PACS helps on the effective use of images to diagnosing and medicine teaching [10]. In order to be effectively useful, the processing of image retrieval in PACS and CAD should be fast and consistent with the judgment of specialists.

The volume of images generated by medical exams grows in exponential pace, demanding faster and effective retrieval and analysis methods. In fact, the amount of images generated in hospitals and research centers amounts to several Terabytes per day in a medium size hospital [17], requiring efficient mechanisms to store and retrieve them. Therefore, content-based image retrieval (CBIR) techniques and similarity search queries have been intensively researched in the last years [3].

CBIR uses image processing algorithms to extract relevant characteristics (features) from the images, organizing them as feature vectors. The feature vectors are indexed in place of the image to allow fast and efficient indexing and retrieval. CBIR techniques normally use intrinsic visual features of images, such as color, shape and texture [21], leading to vectors with hundreds or even thousands of features. Contrarily to what one would think, having a large number of features is a problem. As the number of the extracted features grows, indexing, retrieving, comparing and analyzing the images become much more time consuming [1, 12, 13]. In many situations, a large number of the features are correlated, so they do not bring new information about the images and do not improve the ability to discriminate them. The large number of features leads CBIR-based systems often to face the problem called the "dimensionality curse" [1]. Beyer in [9] proved that increasing the number of features (and consequently the dimensionality of the data) leads to losing the significance of each feature value. Thus, to avoid decreasing the discrimination accuracy, it is important to keep the number of features as low as possible, establishing a tradeoff between the discrimination power and the feature vector size.

Z.W. Ras, S. Tsumoto and D.A. Zighed (Eds) : IEEE MCD05, published byDepartment of Mathematics and Computing Science Saint Marys UniversityPages 91-98, ISBN 0-9738918-8-2, Nov. 2005

Page 92: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

In some applications, the features are employed to classify the images. An example is classifying tumoral masses detected in mammograms as benign or malignant. Initially, the radiologist classifies images based on the shape of the lesion. Malignant tumors generally infiltrate the surrounding tissue, resulting in an irregular or hard-distinguishable contour, while benignant ones have a smooth contour (Figure 1).

Figure 1: Typical breast tumor masses: benign (left) and malign (right)

This paper presents a new algorithm (the StARMiner – Statistical Association Rule Miner) to determine a minimal representative set of features. By mining statistical association rules, it identifies the most relevant features needed to maintain high discrimination power when classifying images into categories (for instance, showing benign and malignant tumors). The algorithm uses statistical measurements, which describe the attributes’ behavior considering the image categories, to find interesting rules. Thus, this paper proposes to employ statistical association rules to perform attribute selection on image feature vectors, finding the most meaningful attributes that keep the majority of the data information.

To validate the proposed algorithm, we performed similarity queries over the best features selected by StARMiner, and compared with those selected by the well-known C4.5 decision tree inducer, regarding the quality of the answer given to the queries executed. As shown in section 4, StARMiner presented the best results. Analyzing the experiments, we claim that mining statistical association rules aiming at selecting relevant features is an effective approach to dimensionality reduction in medical images.

The remainder of this paper is structured as follows. Section 2 presents the background needed to understand the paper. Section 3 presents the proposed method, and Section 4 discusses the experiments and results reached with the developed method. Finally, Section 5 presents the conclusions of this work.

2. Background and Techniques

Mining of association rules [2] is one of the most investigated areas in data mining. It was initially motivated by business applications, such as catalog design, store layout, and customer categorization [5]. However, finding associations has also been widely

used in many other applications such as data classification and summarization [4, 15, 16, 22].

Mining images demands to extract their main features regarding specific criteria. After extracted, the feature vector and image descriptions are submitted to the mining process, as illustrated in Figure 2.

Performing exact searches on image datasets are not useful, since searching for the same data already under analysis has very few applications. Therefore, the retrieval of complex data is mainly performed regarding similarity. The most well-known and useful types of similarity queries are the k-nearest neighbor (for instance: “given the Thorax-XRay of John Doe, find in the image database the 5 images most similar to it”), and range queries (for instance: “given the Thorax-XRay of John Doe, find in the image database the images that differ up to 3 units from it”). Similarity search is performed comparing the feature vectors using a distance function - or dissimilarity - function to quantify how close (or similar) each pair of vectors is.

We focus our work on medical images, more specifically on the feature vectors employed to compare and retrieve the images by similarity. The motivation is to reduce the usually large number of features extracted, because for PACS and CAD systems, it is important to gather as many image characteristics as possible, leading to high-dimensional feature vectors, comprising much redundant information. Consequently, it is necessary to “sift” the features that keep the majority of image information, making the image retrieval processing more efficient. Notice that the proposed approach can be straightforwardly extended to work on other types of complex data beyond images, since similarity queries are the most suitable for complex data.

A standard approach to evaluate the accuracy of information retrieved in a query is the precision and recall (P&R) graph [7]. Precision and recall are defined as:

Figure 2 -Steps of Image Mining

Marcela Ribeiro, Andre Balan, Joaquim Felipe, Agma Traina, Caetano Traina

92

Page 93: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

TSTRSprecision = TR

TRSrecall = where,

TR = total of relevant images in the database; TRS = total of relevant images in the query result; TS = total of images in the query result. We use precision and recall (P&R) graphs in Section 4 to discuss the experiments performed to analyze the proposed algorithm.

Among the several datasets evaluated, we show here results from two meaningful ones. The first is a succinct but very effective feature vector (only 30 elements), and the second is a more traditional one, containing 256 elements. Both datasets are explained as follows.

Dataset for Case Study 1 – Features extracted from segmented images: the SegMRF dataset

The SegMRF dataset consists of 704 medical images obtained from the Clinical Hospital at Ribeirão Preto of the University of São Paulo. The SegMRF dataset is classified in eight categories (see Table 1) segmented using Markov Random Fields (MRF), which has been proved to be an adequate segmentation model for textured images [11]. The MRF segments an image using local features, assigning each pixel to a region based on its relationship to the neighboring pixels. The final segmentation is achieved by minimizing the expected value of the number of misclassified pixels. The segmentation algorithm employed [8] is an improved version of the EM/MPM method [11].

Since MRFs express only local properties of the images, it is also important to extract global properties to discriminate them well. The global description is achieved by estimating the fractal properties of each segmented region. For each region segmented based on texture, six features were extracted: the mass (m); the centroid (xo and yo); the average gray level (µ), the Fractal dimension (D); and the linear coefficient used to estimate D (b). Therefore, when the image is segmented in L classes, the feature vector have L * 6 elements. Figure 3 illustrates the feature vector described.

It is important to stress that considering just five classes (as illustrated in Figure 3), the feature vector

generated (and used in the experiments of this paper) is quite compact. The feature vector can discriminate the images well, as shown in Section 4, but even so, StARMiner demonstrated that it still has superfluous information that does not need to be stored.

Figure 3: Segmentation classes and the feature vector Dataset for Case Study 2 – Features extracted from not segmented images: the Zernike dataset

The Zernike dataset consists of 250 images obtained from the Clinical Hospital at Ribeirão Preto of the University of São Paulo. The images are ROIs (Regions of Interest) comprising tumoral masses, taken from mammographies. The images are classified either as benign or malignant, according to a prior analysis done by radiologists and eventually confirmed by complementary exams. The kinds of tumors are strongly related to the image shape: benign masses have well-defined smooth contours, while the malignant ones have the contours spread over the mammary parenchyma.

Zernike polynomials provide a precise mathematical model that captures the global shape of images while preserving enough information, given by local harmonics. They form a complete orthogonal base in the unit circle x2+y2=1. Zernike moments of an image are the projections of the image map of pixels onto the base functions and can be invariant with respect to the image size, rotation and translation. Moments and functions of moments can be employed as pattern features to represent images. The features capture global information about the image and do not require closed boundaries as boundary-based methods do, such as the Fourier descriptors do [14].

The equations related to Zernike moments can be found in [20]. The order of one moment corresponds to its capability to describe details (high spatial frequency components) of the pixel distribution of the image.

There is abundant literature regarding image descriptors to build image feature vectors. Zhang and Lu [23] compared some shape descriptors that have been widely adopted for CBIR such as: Fourier descriptors, curvature scale space descriptors, Zernike moments and grid descriptors. The strengths and limitations of them

Table 1- Summary of image categories from the SegMRF dataset

Image Category # of Images Angiogram 36 MR Axial Pelvis 86 MR Axial Head 155 MR Sagittal Head 258 MR Coronal Abdomen 23 MR Sagittal Spine 59 MR Axial Abdomen 51 MR Coronal Head 36

Mining Statistical Association Rules to Select the Most Relevant Medical Image Features

93

Page 94: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

had been analyzed against properties such as affine invariance, robustness, compactness, low computation complexity and perceptual similarity measurement. Other experiments were made on Zernike moments, always achieving impressive results [19].

The literature supports the choice of Zernike moments as a very suitable descriptor for image shape. In this paper we evaluated how many moments are actually needed to characterize images well, aiming at CBIR and similarity queries. For polynomial order 30, a set of 256 moments is generated for each image, so the feature vector has 256 elements. In Section 4 we show that actually fewer features are enough to discriminate the images without loosing accuracy.

3. Proposed Idea: Relevant Features Selection

Feature vectors describe the images quantitatively. Thus, a suitable approach to find association rules should consider quantitative data. The goal of StARMiner is to find statistical rules to select a minimal set of features that preserves the discrimination of images into categorical classes. The method proposed here extends statistical association rule mining techniques proposed in [6].

Let x be a category of an image and Ai an image attribute (feature). The rules returned by the algorithm have the format x Ai and each rule only is identified if the following conditions are satisfied:

• The attribute Ai must have a behavior in images from category x different from its behavior in images from all the other categories;

• The attribute Ai must present a uniform behavior in every image from category x. The two previous conditions are implemented in

StARMiner algorithm incorporating restrictions of interest in the mining process, in the way described as follows.

Let T be a database of images, x an image category, Tx ⊂ T the subset of images of category x, Ai an attribute, and (Ai)j the Ai the i-th value of an image j. The restrictions of interest implemented in StARMiner algorithm are: 1) |µAi(Tx) – µAi(T-Tx)| ≥ ∆ µmin , where:

||

)()(

V

AV Vj

ji

Ai

∑∈=µ

• µAi (V) is the average of attribute Ai values in the subset of images V. • ∆ µmin is the input parameter indicating the minimum allowed difference between the average of Ai in images from category x and the average of Ai in the remaining images in the database.

2) σAi(Tx) ≤ σmax, where:

||

))()(()(

2

V

VAV Vj

Aiji

Ai

∑∈

−=

µσ

• σAi(V) is the standard deviation of the values of attribute Ai in the V subset. • σmax is the input parameter that indicates the maximum standard deviation of Ai values allowed in images from category x.

3) Hypothesis test. The null hypothesis H0 should be rejected with confidence γmin in favor of the alternative hypothesis H1, where:

H0: µAi(Tx) = µAi(T-Tx)

H1: µAi(Tx) ≠ µAi (T-Tx) • γmin is an input parameter that indicates the minimum confidence to reject the hypothesis H0. Rejecting H0 implies, with the confidence γmin, that the averages µAi(Tx) and µAi(T-Tx) are statistically different. • To reject H0 with confidence γmin, the Z value (calculated as shown below) must be in the rejection region (see Figure 4). The critic Z values z1 and z2 depend on the γmin value as shown in Table 2.

||)(

)()(

x

xAi

xAixAi

TT

TTTZσ

µµ −−=

The StARMiner algorithm makes it possible to find

rules in an image dataset allowing categorizing the images well. That is, the algorithm spots the attributes with higher discrimination power, since they have a particular and uniform behavior in images of a given category. This is important, because the attributes that present a uniform behavior to every image in the dataset, independently of the image category, do not contribute to categorize them, and should be eliminated. Figure 5 gives a description of the StARMiner algorithm.

To perform StARMiner, the database is scanned twice. The first scan calculates the averages of each attribute (lines 2 to 5). The second database scan (lines 6 to 13) calculates the standard deviation for each attribute

Figure 4: Rejection Regions

Table 2- Critic-Z values γmin 0.9 0.95 0.99 z1 1.64 1.96 2.58 z2 -1.64 -1.96 -2.58

Marcela Ribeiro, Andre Balan, Joaquim Felipe, Agma Traina, Caetano Traina

94

Page 95: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

and the Z value, used in the hypotheses test. The restrictions of interest are processed in lines 11 and 12. A rule is returned only if it satisfies the input thresholds ( ∆ µmin, σmax and γmin).

Algorithm StARMiner input: image database instances structured as x, A1, A2,.., AN, where x is the image category and Ai are the image features; thresholds ∆ µmin, σmax and γmin output: the mined rules 1. Begin 2. scan database 3. for each feature Ai 4. for each existing category x 5. calculate µAi(Tx) and µAi(T-Tx) 6. scan database 7. for each feature Ai 8. for each existing category x 9. calculate σAi(Tx) 10. calculate z value 11. if |µAi(Tx) – µAi(T-Tx)| ≥ ∆ µmin and 12. | σAi(Tx) | ≤ σmax and (z < z1 or z > z2) 13. write(x + ‘ ’ +Ai) 14. end

Figure 5: StARMiner Algorithm The attributes present in the set of rules generated

by StARMiner algorithm are selected as the most relevant ones. 4. Experiments

The procedure used in the experiments were composed of three steps described as follows: Step 1 - Extraction of features An image set is submitted to the feature extractor, generating the feature vector for each image. Step 2 - Definition of relevant features The feature vectors together with the previously known class of each training image are the entry to StARMiner to produce a set of statistical association rules. The rules mined identify the most relevant image features, generating a reduced feature vector. Thereafter the reduced feature vectors are used to index the images in a CBIR environment and to answer similarity queries, aiming reducing the computational cost to execute similarity queries. Step 3 – Executing similarity queries:

Considering the size of the dataset, k-nearest neighbor queries were applied to the indexed image database. The image centers are taken randomly, and the measurements are averaged for each varying value of k, for both feature vectors generated (original and reduced) in Steps 1 and 2 respectively. We calculated the precision and recall values for each query asked.

The average of the precision and recall values was used to generate the P&R curves. As a rule of thumb on analyzing P&R curves, the closer the curve to the top of the graph, the better is the retrieval technique.

The three steps were repeated to perform several experiments. Due to space limitations, we only present here two meaningful cases, showing the ability of StARMiner in selecting the most relevant features. Experiment 1: The first experiment was performed on the dataset SegMRF of 704 images from the 8 categories described in Table 1. The feature vector generated in Step 1 was composed of 30 texture features extracted from 5 regions of segmented images, as explained in Section 2. In Step 2, StARMiner was executed over the image feature vectors, generating 15 rules. We evaluated various threshold values, and the best results were achieve using ∆ µmin=0.15, σmax =0.15 and γmin=0.9. The rule

angiogram region 3 gray level average µ is an example of a rule obtained. It indicates the importance of the average gray level µ of region 3 in discriminating images in the angiogram category.

Table 3 shows the attributes selected by StARMiner as the most discriminating in white background and the attributes that can be eliminated in gray background. Region 1 Region 2 Region 3 Region 4 Region 5

D D D D D xo xo xo xo xo yo yo yo yo yo m m m m m µ µ µ µ µ b b b b b

Table 3: The best attributes, as selected by StARMiner are shown in white background. Attributes that should be eliminated are depicted in gray background.

The results in Table 3 show that the features from regions 4 and 5 are more important to discriminate the images than the features from regions 1, 2 and 3. The results also indicate that the feature m has a very low contribution to discriminate the images.

To validate the results, step 3 executed k-nearest neighbor queries using the 30 initial features and the 15 StARMiner selected features. The precision vs. recall graph obtained for the average of the queries over each image of the dataset is shown in Figure 6. To guarantee that we have selected the minimum set of relevant features that maintain the precision results, we also executed the same k-nearest neighbor queries using 14 features obtained by randomly removing a feature from the 15 features selected by StARMiner, and, as a result of it, the P&R (precision vs recall) graph always

Mining Statistical Association Rules to Select the Most Relevant Medical Image Features

95

Page 96: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

worsened. Analyzing the graphs of Figure 6, we can conclude

that the results obtained with 15 features are quite similar to the results got with all 30 features. Hence, the dimensionality reduction of 50% achieved with StARMiner nearly maintains the same precision values considering the recall values under 0.7. Thus, although only approximately a half of the processing effort is required, the precision of content-based similarity queries is maintained.

The selected features shown in Table 3 points out that the feature m has a very low contribution to discriminate the images. We redid the similarity queries removing the feature m of region 1 from the set of 15 selected features to check if it drops the precision values. The results of this test are shown in Figure 7.

Figure 6: Precision vs. Recall using 30 features, 15 features selected by StARMiner and 14 features obtained by randomly removing another one from those 15 selected by StARMiner.

Comparing the third and fourth curve of the graphs from Figure 7, it is possible to note that removing the feature m from the 15 features selected by the StARMiner, the precision values decline much less than removing any other feature. So, whereas meaningful, feature m, of region 1, is the one that brings the smallest contribution to discriminate images, as compared to the other features analyzed (D, xo, yo, m, µ and b).

It is interesting to visualize the images returned when applying similarity queries using the whole set of features and the reduced set selected by StARMiner. We asked for the 20-nearest neighbors over the same image center (shown on the left side of the screen of Figures 8 and 9). Figure 8 shows the results when using the 30 original features, while Figure 9 shows the results using only the 15 selected features.

Figure 7: Precision vs. Recall using 30 features, 15 features selected by StARMiner, 14 features obtained by randomly removing another from the 15 selected by StARMiner and 14 features obtained by removing the m mass feature from the 15 selected by StARMiner.

Surprisingly, the features selected by the StARMiner works better than the 30 original features for the executed query. The query performed using the original 30 features (Figure 8) achieved a precision of 75%, while the query performed using the 15 selected features achieved a precision of 85% (Figure 9). This happened in several situations, corroborating previous claims that allowing correlated attributes, instead of improving the retrieval process, it actually worsens it.

Figure 8: Example of 20-nearest neighbor query using the 30 original features.

Marcela Ribeiro, Andre Balan, Joaquim Felipe, Agma Traina, Caetano Traina

96

Page 97: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Figure 9: Example of 20-nearest neighbor query using the 15 selected features. Experiment 2:

The second experiment was performed on the Zernike dataset. The 256 Zernike moments were extracted from the images in Step 1 (see Section 2). Step 2 employed 89 images from the dataset as training images, and was executed twice: once using the well-known C4.5 decision tree generator [18], and a second time using StARMiner. This was done to compare the efficacy of both algorithms in selecting the most discriminating features. C4.5 generated a decision tree where 38 moments were selected as relevant. StARMiner parameters were calibrated to also generate the same number of features. However, StARMiner selected a different set of 38 moments. An example of StARMiner obtained rule is:

malign tumor 9th Zernike moment, indicating the importance of 9th Zernike moment in discriminating malign tumors.

Figure 10 presents the P&R graphs obtained with

256 moments (Step 1), the 38 moments selected by C4.5, and the 38 moments selected by StARMiner (Step 2).

Analyzing Figure 10 we can conclude that:

• The results obtained with 38 moments are better than with 256 moments for both StARMiner and C4.5. The dimensionality reduction of Step 2 provides an important gain in precision in the region comprising recall values under 0.5. The lower recall regions are the most important into a PACS environment because k-nearest neighbor queries usually don't require high values of k. These results testify that the dimensionality curse really damages the results: irrelevant features disturb the clear tendency of the relevant ones.

• StARMiner reaches a better performance than C4.5. It maintains higher precision, with values above 0.7 for recalls smaller than 0.5, while C4.5 sustains 0.7 of precision for recall only up to 0.3.

5. Conclusions

A new approach to select the most relevant image features has been presented, consequently allowing dimensionality reduction of medical image features. The proposed method uses statistical association rules to select the most relevant characteristics. The proposed mining algorithm, StARMiner, finds rules involving the attributes that most contribute to discriminate medical images.

The accuracy of the method was verified in several experiments, and two representative ones were discussed in the paper. The first employs a succinct set of image features extracted from segmented images (SegMRF dataset), and the second works on a high-dimensional set of shape features extracted from non-segmented images (Zernike dataset). The experiments performed k-nearest neighbor queries to measure the ability of the proposed technique in reducing the number of features needed to perform similarity queries maintaining the accuracy of the results.

The results show that a significant reduction in the number of features can be obtained without degrading the retrieval efficacy of the features: 50% for the SegMRF dataset and 85% for the Zernike dataset. Nonetheless, the Zernike dataset achieved better precision using the selected features. The experiments over the SegMRF dataset confirmed that the proposed technique allows detecting the most important segmented regions to discriminate images. The experiments also showed that StARMiner selects features more relevant to discriminate images than those selected by C4.5. Furthermore, the results indicate that mining of statistical association rules to select relevant features is an effective approach for dimensionality reduction in medical image databases. Acknowledgments

The research presented in this paper was supported, in part, by the Sao Paulo State Research Foundation (FAPESP) under grants 04/02215-5, 03/01769-4, and by

Figure 10: Precision vs. recall with 256 and 38

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

recall

prec

isio

n

256 MOMENTS

38 MOMENTSC4.5

38 MOMENTSStARMiner

38 moments StARMiner

38 moments C4.5

256 moments

Mining Statistical Association Rules to Select the Most Relevant Medical Image Features

97

Page 98: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

the Brazilian National Research Council (CNPq) under grants 52.1685/98-6, 860.068/00-7 and 35.0852/94-4.

References [1] C. C. Aggarwal, "On the Effects of Dimensionality Reduction on High Dimensional Similarity Search," Proc. ACM PODS 2001, pp. 1-11 [2] R. Agrawal, T. Imielinski, and A. N. Swami, "Mining Association Rules between Sets of Items in Large Databases," Proc. ACM SIGMOD Intl. Conf. on Management of Data, 1993, pp. 207-216. [3] S. Antani, R. Kasturi, and R. Jain, "A survey on the use of pattern recognition methods for abstraction, indexing and retrieval of images and video," Pattern Recognition, Vol. 35, No. 4, April 2002, pp. 945-965. [4] M.-L. Antonie, O. R. Zaïane, and A. Coman, "Application of Data Mining Techniques for Medical Image Classification," Proc. Second Intl. Workshop on Multimedia Data Mining (MDM/KDD'2001), 2001, pp. 94-101. [5] C. Apte, B. Liu, E. P. D. Pednault, and P. Smyth, "Business Applications of Data Mining," Communications of the ACM, Vol. 45, No. 8, August 2002, pp. 49-53. [6] Y. Aumann and Y. Lindell, "A statistical theory for quantitative association rules," Proc. The fifth ACM SIGKDD Intl. Conf. on Knowledge discovery and data mining, San Diego, California, United States, 1999, pp. 261-270. [7] R. Baeza-Yates and B. A. Ribeiro-Neto, Modern Information Retrieval. Wokingham, UK: Addison-Wesley, 1999. [8] A. G. R. Balan, A. J. M. Traina, C. Traina Jr., and P. M. d. A. Marques, "Fractal Analysis of Image Textures for Indexing and Retrieval by Content," Proc. 18th IEEE Intl. Symposium on Computer-Based Medical Systems - CBMS, 2005, pp. 581-586. [9] K. Beyer, J. Godstein, R. Ramakrishnan, and U. Shaft, "When is "Nearest Neighbor" Meaningful?," Proc. Intl. Conf. on Database Theory (ICDT), 1999, pp. 217-235. [10] P. Cao, M. Hashiba, K. Akazawa, T. Yamakawa, and T. Matsutoc, "An integrated medical image database and retrieval system using a web application server," Intl. Journal of Medical Informatics, Vol. 71, No. 1, August 2003, pp. 51-55. [11] M. L. Comer and E. J. Delp, "The em/mpm Algorithm for Segmentation of Textured Images: Analysis and Further Experimental Results," IEEE Transactions on Image Processing, Vol. 9, No. 10, October 2000, pp. 1731-1744. [12] Ö. Egecioglu, H. Ferhatosmanoglu, and U. Ogras, "Dimensionality Reduction and Similarity Computation by Inner-Product Approximations," IEEE

Transactions on Knowledge and Data Engineering (TKDE), Vol. 16, No. 6, June 2004, pp. 714-726. [13] S. H. Huang, "Dimensionality Reduction in Automatic Knowledge Acquisition: A Simple Greedy Search Approach," IEEE Transactions on Knowledge and Data Engineering (TKDE), Vol. 15, No. 6, November/December 2003, pp. 1364-1373. [14] A. Khotanzad and Y. H. Hong, "Invariant Image Recognition by Zernike Moments," IEEE Transactions on Pattern Analysis and Machine Inteligence, Vol. 12, No. 5, 1990, pp. 489-497. [15] B. Liu, W. Hsu, and Y. Ma, "Integration Classification and Association Rule Mining," Proc. Fourth Intl. Conf. on Knowledge Discovery and Data Mining,1998 , pp. 80-86. [16] H. Mannila and H. Toivonen, "Multiple Uses of Frequent Sets and Condensed Representations," Proc. Second Int. Conf. Knowledge Discovery and Data Mining, 1996, pp. 189-194. [17] H. Müller, N. Michoux, D. Bandon, and A. Geissbuhler, "A review of content-based image retrieval systems in medical applications––clinical benefits and future directions," Intl. Journal of Medical Informatics, Vol. 73, No. 1, February 2004, pp. 1-23. [18] J. R. Quinlan, "Simplifying Decision Trees," Intl. J. of Man-Machine Studies, Vol. 27, pp. 221-231. [19] C.-H. Teh and R. T. Chin, "On Image Analysis by the Methods of Moments," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 10, No. 4, 1998, pp. 496-513. [20] M. D. Twa, S. Parthasarathy, T. W. Raasch, and M. A. Bullimore, "Decision Tree Classification of Spatial Data Patterns From Videokeratography Using Zernike Polynomials," Proc. SIAM Intl. Conf. on Data Mining, 2003, pp.1-10. [21] A. Vailaya, M. A. T. Figueiredo, A. K. Jain, and H.-J. Zhang, "Image classification for content-based indexing," IEEE Transactions on Image Processing, Vol. 10, No. 1, January 2001, pp. 117-30. [22] O. R. Zaïane, M.-L. Antonie, and A. Coman, "Mammography Classification by an Association Rule-based Classifier," Proc. The Third Intl. Workshop on Multimedia Data Mining (MDM/KDD'2002), 2002, pp. 62-69. [23] D. Zhang and G. Lu, "Content-Based Shape Retrieval Using Different Shape Descriptors: A Comparative Study," Proc. IEEE Conf. on Multimedia and Expo (ICME'01), 2001, pp. 317-320.

Marcela Ribeiro, Andre Balan, Joaquim Felipe, Agma Traina, Caetano Traina

98

Page 99: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Abstract—Sequential pattern mining have been broadly studied and many efficient algorithms have been proposed. However, most of them only allow extracting frequent subsequences in sequences database, without taking into account multidimensional information associated to sequential data. In this paper, we propose a characterization based multidimensional sequential patterns mining. The idea behind this method is to first cluster sequences by similarity; then to characterize each class using multidimensional properties describing the sequences. The clusters are built around the frequent sequential patterns. Thus, the whole process results in rules characterizing sequential patterns using multidimensional information.

Index Terms— multidimensional sequential rules, characterization, similarity.

I. INTRODUCTION

IGHT now, sequence mining mainly focuses in mining sequential patterns in order to extract rules in the fields

of market basket analysis and Web Usage Mining. Existing works only allow extracting information on sequences [1] [7], without considering other information related to these data. Whereas it could be interesting to determine to which category of individuals a given purchase corresponds, or discover to which category of individuals correspond a given path traversal pattern [12]. Our objective is to discover individual profiles for the most frequent sequence patterns. This leads, first to retrieve such frequent patterns, then to characterize those sequential patterns using related attributes.

In order to solve this problem, we propose a characterization based approach where a whole sequence is considered as a complex attribute. Thus, it makes sense to integrate reasoning on sequences (frequent patterns, similarity, grouping) while other dimensions are considered as descriptive of each sequence group. Briefly, our approach is based on two steps:

1. Gather all database sequences around the most similar sequential pattern in order to derive classes of sequences represented by their sequential patterns.

2. Describe these classes (and their sequential patterns) by their multidimensional attributes values

characterizing them. As a result, this process allows discovering characteristics

rules of the form: (a1, a2, …, ai, …, an) => (s1, s2, si,…, sn) where (s1, s2, …, sj,…, sn) stands for a sequential pattern, and (a1, a2, …, ai, …, an) a multidimensional values. The sequential patterns should fulfil a given support threshold, and the rule should be satisfied with a given confidence threshold. This rule means that: the sequence part (i.e. the conclusion) is a frequent pattern in the database; and the individuals that have such sequence or similar sequences are characterized by the properties given in the left part of the rule. Thus, each rule characterizes groups that are homogeneous regardless their sequential attribute.

The extraction of such knowledge raises the following problems:

1. How to determine that a sequence or a sub-sequence is similar to another?

2. How to group multidimensional sequences with a given sequential pattern?

3. How to determine the most characteristic properties for a group of sequences?

This paper is organized as follow: Section II presents some concepts and definitions used further. Section III gives an overview of related works. Section IV presents our approach for mining rules in multidimensional sequences database. Section V shows some results and finally section VI outlines the main contribution of the paper.

II. CONCEPTS AND DEFINITIONS In the proposed approach, we consider a database

composed of sequences s and attributes Ai (with value ai) describing the sequences (table 1). Definition 1: Let I be a set of items. A sequence s = <s1, s2, …, si,…, sn> is defined as an ordered list of elements si ∈I. si is also called a sequence element.

Note that the main difference here with some existing works as [8] is that we consider each element of sequence as a simple item, not an itemset. Definition 2: A sequences database S is composed of a set of tuples (sid, s) where sid denotes the identity of the sequence s. Definition 3: A sequence s = <s1, s2, …, sm> is called a sub-sequence of another sequence t = <t1, t2, …, tn>, denoted s ⊆ t if and only if there exist integers 1 ≤ j1 < j2 < ... <jn ≤ m such that s1 ⊆ tj1, s2 ⊆ tj2, ..., sn ⊆ tjn.

In other words, a sequence s is included in a sequence t if

Mining Multidimensional Sequential Rules A Characterization based Approach

Lionel Savary, Karine Zeitouni 45 Avenue des Etats-Unis, 78035 Versailles Cedex, FRANCE

Lionel.Savary, [email protected]

R

Z.W. Ras, S. Tsumoto and D.A. Zighed (Eds) : IEEE MCD05, published byDepartment of Mathematics and Computing Science Saint Marys UniversityPages 99-102, ISBN 0-9738918-8-2, Nov. 2005

Page 100: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

the ordered list of elements of s is included in the ordered list of elements of t. Definition 4: The support of a sequence α in a sequence database S is the number of tuples in the database containing α i.e., support(α) | <sid, s> | (<sid, s> ∈ S) ∧ (α ⊆ s). It can be denoted as support(α) if the sequence database is clear from the context. Definition 5: Given a positive integer min_support as the support threshold, a sequence α is called a sequential pattern in sequence database S if the sequence is contained by at least min_support tuples in the database, i.e., support(α) ≥ min_support.

Another concept is the similarity of sequences in order to compare them. Sequence similarity is a well known problem in the fields of bio-informatics. It aims at determining if a DNA sequence is similar or not to another. The most popular algorithms are BLAST [2], FASTA [9] and LCSS (Longest Common Subsequence) [5]. We use this last method in our definition of sequence similarity. It measures the minimum number of insertions and deletions to transform s1 into s2. This method is well suitable to our context and could be implemented easily. Definition 6: Given two sequences s = <s1, s2, ..., si,…, sm> and t = <t1, t2, …, tj, …, tn> such that (i ∈ [1, m], j ∈ [1, n]. n, m > 0). Let lcs (s, t) the size of the longest common sub-sequence. Then the dissimilarity distance between s and t is defined as: d(s, t) = n + m - 2*lcs (s, t).

A sequence s is said similar to a sequence t if their dissimilarity distance is lower than a pre-defined threshold. Definition 7: A multidimensional sequences database is of schema (RID, S, A1, …, Am), where RID is a primary key, A1, …, Am are dimensions, and S the domain of sequences.

A multidimensional sequence is defined as (s, a1, …, am) where ai ∈ Ai, for 1 ≤ i ≤ m and s a sequence (see table 1). Definition 8: We define a Class of a Sequence S, denoted SC, as the cluster composed of multidimensional sequences whose sequences are similar to a sequential pattern associated to SC.

In order to define characteristic rules, we adopt and formalise the definition given by Han et al. [6]. They define characterization of a sub-set as the property descriptions specific to this sub-set, comparing to all objects in the database. Definition 9: We denote se a subset of the database DB, prop a multidimensional property (ai1, …, aik), freqse(prop) the number of objects in se that meet the property prop; and card(se) the cardinality of se. The significance of prop in the subset se is defined as: FDB

se(prop) = (freqse(prop)/card(se)) / (freqDB(prop)/card(DB)) Definition 10: Given a real R standing for the significance threshold. prop is said characteristic of se, and denoted as: prop se [significance], if an only if:

FDBse(prop) = significance ≥ R.

Definition 11: Let Sc be the class of a sequential pattern SPc. We define a multidimensional sequential rule as:

prop SPc [significance]. This multidimensional sequential rule means that the

multidimensional property prop is characteristic of the sequential pattern SPc with the computed significance.

TABLE 1 Multidimensional Sequence Database

The example of table 1 shows a multidimensional sequence

database. The tuple (1, <s1, s2, .., si, .., sn>, a1, …, am) stands for a multidimensional sequence of the database.

III. RELATED WORKS Mining multidimensional sequences has been studied

recently and partially. Main works deal with frequent patterns that mix sequence items and dimensions. The main contribution is from Pinto et al. [8]. They have defined three algorithms for mining multidimensional sequential patterns: UniSeq uses PrefixSpan [7] to mine multidimensional sequential patterns with sequences extended with dimensional information. Dim-Seq uses the BUC-like algorithm [3] to first mine multidimensional patterns, then PrefixSpan is used to mine the sequential patterns associated to the multidimensional patterns. Seq-Dim first mines multidimensional patterns using BUC-like algorithm, then uses PrefixSpan to mine the associated sequential patterns. Notice that the three algorithms produce the same result. However, they do not make a real distinction between multidimensional values and sequential items. Moreover, mining multidimensional sequential patterns only considers a support threshold, and then, do not allow extracting characteristic rules. Finally, they do not use specific methods to select the sequences while some multidimensional patterns are likely to be shared by groups of similar sequences.

IV. THE PROPOSED APPROACH In order to extract relevant knowledge from a

multidimensional sequence database, we propose a characterization based approach. This process combines: mining of sequential patterns, sequence similarity, and characterization. It produces sequential multidimensional rules as defined in section II.

A. Method Overview Our approach is sequence centred. Indeed, it aims to

characterise the most frequent sequences according to the other dimensions (attributes). It is divided into three steps:

1. First, sequential patterns are mined. 2. Then these sub-sequences are used to represent classes

composed of subset of database sequences. In order to determine to which class a sequence belongs, an algorithm of similarity between sequences is used. The sequences of database which are the most similar to a

S A1 Am A2 Ai … …

a1 a2 ai am <s1, s2, .., si, .., sn>

RID

1

k

Lionel Savary, and Zeitouni Karine

100

Page 101: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

given sequential pattern are gathered in the same class. At the end of this process, a set of classes composed of database sequences and represented by their sequential pattern are defined1. In order to reduce the intersections between classes, only maximal sequential patterns are considered i.e. sequential patterns that are not included in another one (see section II).

3. Once the classes have been built, a characterization algorithm is used to discover the dimension values characterizing them.

B. General Algorithm Based on the three steps previously described, the general

algorithm for mining multidimensional sequential patterns is depicted in fig. 1.

The input parameters of this algorithm are: A multidimensional sequences database DB. A support FT for mining sequential patterns. A dissimilarity threshold DT for mining the most

similar sequence to a given sequential pattern. A characterization threshold CT and a set of

attributes and associated values E(Ai., Vj). The algorithm first mines maximal sequential patterns and

places the result in D (line 03). Each pattern becomes a model of a class of sequences. Individuals whom sequences are the most similar to a given maximal sequential pattern are placed in the corresponding class Sc (line 05). The characterization algorithm (line 08) determines which attribute values characterize the subset Sc among E(Ai, Vj).

FIG. 1 General Algorithm

C. Sequential pattern mining Many algorithms for mining sequential pattern exist [11]

[7] for mining sequential pattern can be applied, our choice goes to the IBM algorithm[10] which is better adapted and gives better performances for this type of data. It only makes one scan of the database and provides an efficient data structure saving runtime and memory space.

D. Sequence Similarity The similarity computation between two sequences aims at

determining the number of common ordered items between

these two candidates. However, most algorithms rather compute the dissimilarity distance (see section II). Depending on the selected similarity algorithm, the more the dissimilarity distance is small, the more the candidates are similar i.e. have common ordered items. We adopt LCSS algorithm because in the presence of each item in the sequence is here more relevant than its probability of occurrence. This algorithm mines the longest common sub-sequence between two sequences with a minimum number of deletions and insertions.

In order to assign a sequence of the database to a given class, a threshold of dissimilarity has been defined. This threshold is important since it determines the number of overlapping between classes i.e. the number of intersection between subsets of sequences. The more this threshold will be high, the more the overlapping probability will be high. Inversely, if this threshold is too small, too many database sequences may not be clustered (section II).

E. Class Generation Frequent patterns are likely to have much more similar

sequences in the database than non frequent ones. Therefore, our approach of class generation starts from those frequent patters to determine classes’ centers. We call them models. But, not all sequential patterns can be considered as models, because some are very similar to each others, and then do not maximise the dissimilarity between different classes. Especially, as sub-sequences of frequent patterns are all frequents, their similarity is high.

F. Characterization and Knowledge Discovery The two first steps of the approach defined in sections

IV.C) and IV.D) allow determining: The class models. The database sequence composing the class.

The final step is to discover the main characteristics of individuals (profiles) characterizing each class i.e. the dimension values specific to a given class. In order to extract such knowledge, we propose an algorithm based on the definition of the characterization proposed by Han [6].

The algorithm is detailed in fig. 2 hereafter:

FIG. 2 Characterization Algorithm

Characterization (class se, database DB, real S, set E of (attribute A, value V)) 00 se.characterization = ∅ ; 01 compute freqDB(prop) for all properties prop = (attributes, values) ; 02 For each attribute A 03 For each value V of A 04 compute freqse(prop) for the property prop = (attribute, value) ; 05 If FDB

se(prop) ≥ S 06 Then Add(se.characterization, prop) ; 07 End If 08 End For 09 End For

Mining_MSP(DB, FT, DT, CT, E(Ai,Vj)) 00 Begin 01 D = ∅ 02 R = ∅ 03 D = Maximal_Sequential_Patterns (DB, FT) 04 For each S ∈ D 05 Sc = Cluster (DB, S, DT) 06 End For 07 For Each Sc 08 R = Characterization (Sc, DB, CS, E(Ai., Vj)) 07 End For 08 End

Mining Multidimensional Sequential Rules A Characterization Approach

101

Page 102: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

V. EXPERIMENTATIONS The concerned application aims to analyse the population

mobility and more precisely their daily displacement. A daily displacement can be seen as a sequence of activities, also called activity program [14]. For example, during a day, an individual can leave home, drive children to school, go to work, pick children up from school and come back to home. This sequence can be described as (Home, School, Work, School, Home) or HSWSH. Other activities are Market (denoted M), Restaurant (R), Leisure (L), etc. Our approach allows discovering frequent activity sequences (patterns) and groups of the surveyed individuals that share approximately those patterns. Besides, the survey holds other information about individuals as their age, gender, profession, and so on. Then, each group having similar activity pattern is likely to have specific attribute values. Hence, it is very relevant to characterise those groups and their corresponding sequential patterns.

The experimentations have been made on real data of Lille urban mobility survey. The dataset contains 10840 individuals, with the dimensions: gender, age (10 classes of age) and work types (about 9). The total number of distinct sequence activities is about 3429.

The tests have been realized using the following parameters:

support threshold = 0.1; dissimilarity threshold = 1; characterization threshold = 8.

The algorithm has found 17 classes. For instance, the

sequential pattern HMRH has been mined with a support equal to 0.14..Its similar sequences are: HWMRH, HMHR, HMRH, HMMRH, HRH, HLMRH, HMH, and HMRRH. This set of sequences composes the class of HMRH.

Then, the following rules are mined for this class:

(gender = F) ∧ (30 < age < 40) ∧ (job category = TE8) HMRH [9.064]. (gender = F) ∧ (50 < age < 60) ∧ (job category = TE7) HMRH [10.413]. (gender = F) ∧ (50 < age < 60) ∧ (job category = TE5) HMRH [12.466]. (gender = H) ∧ (60 < age < 70) ∧ (job category = TE6) HMRH [10.407].

The first rule means that individuals having an activity

sequence similar to HMRH standing for (Home, Market, Restaurant, Home) are frequently (with a significance of 9.064) women between 30 and 40 years old working as TE8, where TE8 stands for liberal profession.

VI. CONCLUSION This paper has proposed a novel approach for mining

multidimensional sequential rules. This approach is sequence

centred since it characterizes the main sequences. It is three steps. First, it mines sequential patterns and sets them as class models. In the second step, a similarity algorithm is used to gather each model with similar sequences of database to form the classes. Then in the third step, a characterization algorithm is used to extract attribute values characterizing each class.

Compared to existing works, our characterization process: 1. Allows extracting rules of the form (a1, a2, ai,…, an)

<s1, s2, si,…, sn> [R] where attribute values (a1, a2, ai …, an) are characteristics of objects whom sequences are similar to <s1, s2, si,…, sn> with a significance R.

2. Considers similar sequences rather than their exact value, producing more relevant knowledge for a better decision–making.

In perspective, this approach will be extended to multidimensional spatial and temporal sequences in order to characterize the trajectories of moving objects. At this end, similarity search of trajectories may use the approach of Vlachos et al. [13] also based on LCSS.

REFERENCES [1] R., Srikant R.: Fast Algorithms for Mining Association Rules. In Proc.

of the 20th Int. Conf. Very Large Data Bases (VLDB), Santiago, Chile, Septembre 1994.

[2] S.F. Altschul, W. Gish, W. Miller, E.W. Myers and D.J. Lipman. “Basic Local Alignment Search Tool”. J.Mol. Biol.. 215: 403-410, 1990.

[3] K. Beyer, R. Ramakrishnan, “Bottom-up computation of sparse and iceberg cubese. In Proc. 1999 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’99), pp. 359-370, Philadelphia, PA, June 1999.

[4] T. Brijs., K. Vanhoof., G. Wets..- Reducing redundancy in characteristic rule discovery by using IP-techniques.- In: Workshop on Pre- and Postprocessing in Machine Learning: Theoretical Aspects and Applications, Chania (Crete), July 5-16, s.l., , 1999, p. 18-27.

[5] V. Freschi, A. Bogliolo, “Longuest Common Subsequence between Run-Length-Encoded Strings: a New Algorithm with Improved Parallelism”, Elsevier Information Processing Letters, Vol.. 90, No 4, pp. 167-173, 2004.

[6] J. Han, M. Kamber, “Data Mining Concepts and Techniques”. Morgan Kaufmann Publishers, ISBN 1-55860-489-8, 2001.

[7] J. Pei, J. Han, B. Mortazavi-Asl, and H. Pinto. Prefixspan: Mining sequential patterns efficiency by prefix-projected pat tern growth. In Proc. of the International Conference on Data Engineering (ICDE), p.215–224, 2001.

[8] H. Pinto, J. Han, J. Pei, K. Wang, Q.Chen and U.Dayal. “Multi-dimensionnal sequential pattern mining”. In CIKM ACM, 2001, pages 81-88.

[9] W.Pearson, D.Lipman, “Improved Tools for Biological Sequence Analysis”, In Proceedings of National Academic Science, 85, 1988, p. 2444-2448.

[10] L. Savary, K. Zeitouni. “Indexed Bit Map (IBM) for Mining Frequent Sequences”. 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD 2005. Porto 3-7 October, 2005.

[11] R. Srikant, R. Agrawal, : Mining Sequential Patterns, “Generalizations and Performance Improvements”. Proc. 5th EDBT, Mars 25-29, (1996). Avignon, France. pp 3-17.

[12] J. Srivastava, R. Cooley, M. Deshpande, P-N. Tan, “Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data”, ACM SIGKDD Explorations, Vol.1, No. 2, 2000, p. 12-23.

[13] M. Vlachos, G.Kollios, D. Gunopulos,"Discovering Similar Multidimensional Trajectories", In Proc. of 18th International Conference on Data Engineering (ICDE), pp. 673-684, San Jose, CA, 2002.

[14] D. Wang, T. Cheng, “A Spatio-Temporal Data Model for Activity-Based Transport Demand Modelling”. International Journal of Geographical Information Science, vol.15, No 6, p. 561-585, 2001.

Lionel Savary, and Zeitouni Karine

102

Page 103: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

MB3-Miner: mining eMBedded subTREEs using Tree Model Guided candidate generation

Henry Tan1, Tharam S. Dillon1, Fedja Hadzic1, Ling Feng2 and Elizabeth Chang3 1Faculty of Information Technology, University of Technology Sydney, Australia

e-mail: (henryws, tharam, fhadzic)@it.uts.edu.au 2University of Twente, Netherlands

e-mail: [email protected] 3School of Information Systems, Curtin University of Technology Perth, Australia

e-mail: [email protected]

Abstract

Tree mining has many useful applications in areas such as Bioinformatics, XML mining, Web mining, etc. In general, most of the formally represented information in these domains is a tree structured form. In this paper we focus on mining frequent embedded subtrees from databases of rooted labeled ordered subtrees. We propose a novel and unique embedding list representation that is suitable for describing embedded subtrees. This representation is completely different from the string-like or conventional adjacency list representation previously utilized for trees. We present the mathematical model of a breadth-first-search Tree Model Guided (TMG) candidate generation approach previously introduced in [8]. The key characteristic of the TMG approach is that it enumerates fewer candidates by ensuring that only valid candidates that conform to the structural aspects of the data are generated as opposed to the join approach. Our experiments with both synthetic and real-life datasets provide comparisons against one of the state-of-the-art algorithms, TreeMiner [15], and they demonstrate the effectiveness and the efficiency of the technique. Keywords: treeminer, tree mining, frequent tree mining, embedded subtree,. 1. Introduction

Research in both theory and applications of data mining is expanding driven by a need to consider more complex structures, relationships and semantics expressed in the data [3,4,7,8,12,13,15]. As the complexity of the structures to be discovered increases, more informative patterns could be extracted [13]. Tree as a special type of graph has attracted considerable amount of interest [2,3,4,8,9,10,11,12,13,15]. Recently, XML has become very popular. There is a growing number of XML-enabled multimedia data encoding. VoiceXML [5] is developed to enable the rapid deployment of voice applications. X3D [6] is used to enable real-time communication of 3D data. Ling et. al. [4] have

initiated an XML-enabled association rule framework. It extends the notion of associated items to XML fragments to present associations among trees. Tan et. al. [8] suggested that one of the important XML mining tasks is to mine frequent tree patterns in a collection of XML documents. Wang and Liu [11] developed an algorithm to mine frequently occurring induced subtrees in XML documents. In general, most of the formally represented information in these domains is a tree structured form. Frequent structure mining (FSM) refers to an important aspect of mining that deals with pattern extraction in massive databases and involves discovering frequent complex structures [15].

Frequent subtree mining as one instance of FSM has many useful applications in areas such as Bioinformatics, XML mining, Web mining, etc. One practical application of frequent subtree mining in the above areas is to discover associations between data entities in tree structured form. This is known as association mining which consists of two important problems, i.e. frequent itemset discovery and rule construction. The former task is considered to be a more difficult problem to solve than the latter. Our study is mainly focused on developing an efficient approach for frequent embedded subtrees discovery from a database of rooted ordered labeled subtrees.

We propose a novel and unique embedding list representation that is suitable for describing embedded subtrees. This representation is adapted from the conventional adjacency list format, by relaxing the adjacency notion into an embedding notion in order to capture the embedding relationships between nodes in trees. The list not only contains adjacent nodes (children), but also takes all its descendants into an embedded list in pre-order ordering. For speed considerations, each embedded list is represented in a string-like format so that each item in the list can be accessed in O(1) time. This representation is completely different from the string-like [2,3,8,10,12,15] or adjacency list [7] representation utilized previously for trees. There are different strategies for efficient candidate generation such as join and enumeration by extension [3]. An idea of utilizing XML schema to guide the candidate generation appeared in [12]. The approach generates

Z.W. Ras, S. Tsumoto and D.A. Zighed (Eds) : IEEE MCD05, published byDepartment of Mathematics and Computing Science Saint Marys UniversityPages 103-110, ISBN 0-9738918-8-2, Nov. 2005

Page 104: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

candidates that conform to the schema. Recently, Tree Model Guided (TMG) candidate generation for mining embedded rooted ordered subtrees is proposed in [8]. TMG generalizes the concept of schema guided into tree model guided, so that the enumeration model can be applied to any data in tree structure form, especially for enumeration of embedded subtrees. This non-redundant systematic enumeration model ensures only valid candidates are generated which conform to the actual tree structure of the data. An example of a tree model would be the structural aspects of a document in an XML schema, and a valid candidate would conform to this. In general, the TMG would be applicable to any area with structural models with clearly defined semantics that have tree like structures. Contrary to the extension method used for relational data, TMG generates fewer candidates as opposed to the join approach [1,3,8,15]. Even though fewer candidates are generated, the TMG enumeration ensures the generation of a complete candidate set. Thus, TMG approach is an optimal enumeration model for mining embedded, rooted and ordered subtrees. This is in contrast to an incomplete method TreeFinder [9] that can miss many frequent subtrees, especially when the support is lowered or when different trees in the database have common node labels. Independently, XSpanner [10] extends the Pattern-Growth concept into tree structured data and its enumeration model also generates only valid candidates.

The occurrences of candidate subtrees need to be counted in order to determine if they are frequent whilst the infrequent ones would be pruned. As the number of candidates to be counted can be enormous, an efficient and fast counting approach is extremely important. Efficiency of candidate counting is highly determined by the data structure used. More conventional approaches use a direct checking approach. For each candidate generated its frequency is increased by one if it exists in the transaction. A Hash-tree [1,3] data structure can be used to accelerate direct checking. Another approach projects each candidate generated into a vertical representation [8,14,15], which associates an occurrence list with each candidate subtree. If transaction based support [3,15] is used, the vertical format will consist of transaction IDs of the transactions that support it. In contrast, if occurrence match [3,8] or weighted support definition [15] is used each list will correspond to each candidate occurrence in the whole database of trees [8]. Occurrence match support takes repetition of items in a transaction into account whilst transaction based support only checks for existence of items in a transaction. With the vertical representation approach the frequency of a candidate subtree corresponds to the size of the occurrence list. With the advantage of being able to determine the support count of each candidate directly the vertical format has been reported to be faster than the direct checking approach [14,15]. In this paper, we improved over the vertical list format as previously utilized in our earlier approach [8] in two ways. First, the performance is

expedited by storing only the hyperlinks [10] of subtrees in the tree database instead of creating a local copy for each generated subtree. The format is different than the scope-list [15] representation as our vertical list does not store any scope information. Secondly, we transform and map the tree structure data into integer indexes as opposed to processing time consuming string labels directly. Representing labels as integer opposed to string labels has performance and space advantages. For instance a node with label ‘123’ would require 3 bytes of characters if its label is represented as a string, whereas only 1 byte is required if it is represented as an integer. Therefore, when a hashtable is used for candidate frequency counting, hashing integer labels over string labels can have significant impact on the overall candidates counting performance.

Our experiments with both synthetic and real-life data sets provide comparisons against one of the state of the art algorithms, TreeMiner [15], and they demonstrate the effectiveness and efficiency of the technique. The paper is organized as follows. In section 2 the problem decomposition is given. Section 3 describes the details of the algorithm. The mathematical model of TMG approach is provided in section 4. We empirically evaluate the performance of the algorithms and study their scale-up properties in section 5, and the paper is concluded in section 6. 2. Problem Definitions

General tree concepts and definitions. A tree is an acyclic connected graph with one node defined as the root. A tree can be denoted as T(v0,V,L,E), where (1) v0 ∈ V is the root vertex; (2) V is the set of vertices or nodes; (3) L is the set of labels of vertices, for any vertex v∈V, L(v) is the label of v; and (4) E is the set of edges in the tree. A root is the topmost node in the tree. In labeled tree, there is a labeling function mapping vertices to a set of labels so that a label can be shared among many vertices. Parent of node v is defined as the predecessor of node v. There is only one parent for each v in the tree. A node v can have one or more children which are defined as its successors. A node without any child is a leaf node; otherwise, it is an internal node. If for each internal node, all the children are ordered, then the tree is an ordered tree. In an ordered tree, the rightmost child is referred to as the last child. The number of children of a node is commonly termed as fan-out/degree of the node. A path from vertex vi to vj, is defined as a finite sequence of edges that connects vi to vj. The length of a path p is the number of edges in p. If a path exists from node p to node q, then p is an ancestor of q and q is a descendant of p. Height of a node is length of path from that node to its furthest leaf. The rightmost path of T is defined as the path connecting the rightmost leaf with the root node. Height of a tree is defined as height of its root node. Depth/level of a node is the length of the path from

Henry Tan, Tharam Dillon, Fedja Hadzic, Ling Feng, and Elizabeth Chang

104

Page 105: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

root to that node. The size of a tree is determined by the number of nodes in the tree. Uniform tree T(n,r) is a tree with height equal to n and all of its internal nodes have degree r. Closed form of an arbitrary tree is defined as a uniform tree with degree equal to the maximum degree of internal nodes in the arbitrary tree. In this paper, all trees we consider are ordered, labeled, and rooted trees. We are concerned with mining embedded subtree. An embedded subtree [3,8,10,15] is a generalization of induced subtree [2,3], where parent-child as well as ancestor-descendant relationships are preserved. By extracting embedded subtrees, patterns hidden deeply within large tree structures can be found. Mining embedded subtrees is more complex than mining induced subtrees, as induced subtree ⊆ embedded subtree [3]. In figure 1, T2 and T4 are examples of induced subtrees of T while T1-4 are examples of embedded subtrees of T. In case of induced subtrees T2 and T4, only the parent-child relationship of each node is preserved while for embedded subtrees T1-4 the ancestor-descendant relationship is also preserved. For each node in T (figure 1), its label is shown as a single-quoted symbol inside the circle whereas its position is shown as indexes at the left/right side of the circle.

Figure 1. Example of induced (T2, T4) and embedded (T1, T3)

subtrees

String encoding (φ). In a database of labeled subtrees, many subtrees can have the same string encoding (φ) [2,8,10,15]. We denote encoding of subtree T as φ(T). From figure 1, φ(T1):‘2 6 / 7 /’; φ(T3):‘2 5 6 / 7 /’, etc. We could omit backtrack symbols after the last node, i.e. φ(T1):‘2 6 / 7’. We refer to a group of subtrees with the same encoding L as candidate subtree CL. A subtree with k number of nodes is denoted as k-subtree. Throughout the paper, the ‘+’ operator is used to conceptualize an operation of appending two or more tree encodings. However, this operator should be contrasted with the conventional string append operator, as in the encoding used the backtrack symbols need to be computed accordingly.

Mining frequent subtrees. Given that Tdb is a tree database consisting of N transactions of trees, KN. The task of frequent subtree mining from Tdb with given minimum support (σ), is to find all the candidate subtrees that occur at least σ times in Tdb. Unless otherwise stated, occurrence match/weighted support definition is used [3,8,15]. Based on the downward-closure lemma [1], every sub-pattern of a frequent pattern is also frequent. In relational data, given a frequent itemset all its subsets are also frequent. A question

however arises if whether the same principle applies to tree structure data when the occurrence match support definition is used? To show that the same principle doesn’t apply, we need to find a counter-example where the relation doesn’t hold for tree structure data.

Lemma 1. Given a tree database Tdb. If there exists candidate subtree CL and CL’, where CL⊆CL’, such that CL’ is frequent and CL is infrequent, we say that CL’ is a pseudo-frequent candidate subtree.

Lemma 2. Given an infrequent candidate subtree CL and a subtree T where φ(T):L. Vrm(T) is a set of nodes in rightmost path of T. A pseudo-frequent candidate subtree CL’ with support m where L’:L+l can be generated from T by attaching m number of children with the same encoding l to any node∈Vrm(T) and m ≥ minimum support σ such that there are m number of subtrees T1,…,Tm with encoding L’ and T⊆ T1,…,Tm.

Lemma 3. Given an infrequent embedded candidate subtree CL and a subtree T with root node vr where φ(T):L. v1,…,vm is a set of nodes with the same encoding l where v1 is a parent of v2, vm-1 is a parent of vm and m ≥ minimum support σ. A pseudo-frequent candidate subtree CL’ with support m where L’:l+L can be obtained by connecting v1,…,vm to T such that there is a path with length ≥ 1 from vm to vr and m numbers of embedded subtrees T1,…,Tm with encoding L’ can be generated where T⊆ T1,…,Tm.

Theorem 1. Antimonotone property of frequent patterns suggests that the frequency of a superpattern is less than or equal to the frequency of a subpattern. Lemma 2 and 3 say that there can be pseudo-frequent candidate subtrees generated from an infrequent subtree. Thus, antimonotone property does not always hold in frequent subtrees mining when occurrence match support is considered.

In the light of downward closure lemma, the pseudo-frequent candidate subtrees are equivalent to infrequent candidate subtrees. From figure 1, if σ is set to 2, subtrees with encoding ‘2 5 6 / 7’ are examples of pseudo-frequent candidate subtrees. Although support of ‘2 5 6 / 7’ is 2, it is a pseudo-frequent candidate subtree since ‘5 6 / 7’ is infrequent. Thus, ‘2 5 6 / 7’ should be pruned. 3. MB3-Miner Algorithm 3.1. Generating Candidate Subtrees

We are concerned with a systematic way of generating candidate subtrees. An optimal enumeration method should generate each subtree at most once and only generate valid candidates according to the tree model. It should also be complete, i.e. it generates all possible candidate subtrees from a given database of trees. Our candidate generation approach utilizes embedded list representation to guide the enumeration of embedded subtrees.

Database scanning. The process of frequent subtree mining is initiated by scanning a tree database, Tdb, and

‘2’ 0

‘2’ 1

‘5’ 2

‘6’ 3

‘7’4

‘5’ 5

‘6’ 6 ‘7’7

‘2’ 0

‘6’ 6 ‘7’7

T1:

‘5’ 5

‘6’ 6 ‘7’7

T2:

‘2’ 0

‘5’ 5

‘6’ 6 ‘7’7

‘2’0

‘5’5

‘6’6 ‘7’7

T3: T4:T:

MB3-Miner: efficiently mining eMBedded subTREEs using Tree Model Guided candidate generation

105

Page 106: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

generating a global sequence D in memory. We refer to this sequence as a dictionary. The dictionary consists of each node in Tdb following the pre-order traversal indexing. For each node its position, label, right-most descendant position (scope) [8,10,15], and parent position is stored. Thus each dictionary item is defined as a tuple of position (pos), label (l), scope (s), parent (p), pos, l, s, p. We refer to an item in the dictionary at position i as dictionary[i]. Unless otherwise mentioned, the notion of position of an item refers to its index position in the dictionary. At the same time, when generating the dictionary, we compute all the frequent 1-subtrees, F1. After the in-memory database (dictionary) is constructed our approach does not require further database scanning.

Constructing Embedding List (EL). For each frequent internal node in F1, a list is generated which stores its descendant nodes’ hyperlinks [10] in pre-order traversal ordering such that the embedding relationships between nodes are preserved. The notion of hyperlinks of nodes refers here to the positions of node in the dictionary. For a given internal node at position i, such ordering reflects the enumeration sequence of generating 2-subtree candidates rooted at i (see figure 2). Thus, the EL construction is equivalent to the process of enumerating all 2-subtree candidates from a database of trees. Hereafter, we call this list as embedded list (EL). As there can be more than one ELs, we use notation i-EL to refer to an embedded list of node at position i. Position of an item in EL is referred to as slot as opposed to position when referring item in the dictionary. Thus, i-EL[n] refers to the item in the list at slot n. Whereas |i-EL| refers to the size of the embedded list of node at position i. In figure 2, 0-EL for example refers to the list: 0:[1,2,3,4,5,6,7], 0-EL[0]=1 and 0-EL[6]=7. Figure 2 illustrates an example of the EL representation of subtree T (figure 1) with encoding: ‘2 2 5 6 7 / / / 5 6 / 7’.

0: 1 2 3 4 5 6 7 1: 2 3 4 5 6 7 2: 3 4 3: 4 5: 6 7

Figure 2. The EL representation of tree T in figure 1

Occurrence Coordinate (OC). We have adopted an extension by a single item at a time. When generating k-subtree candidates from (k-1)-subtree, we consider only frequent (k-1)-subtrees for extension. Each occurrence of k-subtree in Tdb is encoded as occurrence coordinate r:[e1,…ek-1]; r refers to k-subtree root position and e1,…,ek-1 refer to slots in r-EL. Each ei corresponds to node (i+1) in k-subtree and e1 < ek-1. We refer to ek-1 as tail slot. From figure 1 and 2, the OC of 3-subtree (T2) with encoding ‘5 6 7’ is encoded as 2:[0,1]; two 4-subtrees (T3 & T4) with encoding ‘2 5 6 / 7’ are encoded as 0:[4,5,6] and 1:[3,4,5] respectively, and so on. Each OC of a subtree describes an

instance of each occurrence of the subtree in Tdb. Hence, each candidate instance should have an OC associated with it.

The scope of node. EL representation preserves the ordering as well as the embedding relationships of nodes in a tree. i-EL defines the scope of node i such that the scope of i spans from i-EL[0] to i-EL[j] where j = |i-EL|-1. We refer to the first scope position as the leftmost scope and the last scope position as the rightmost scope. Consequently, given a 4-subtree T with occurrence coordinate 1:[3,4,5], the leftmost scope of T is defined by 1-EL[3] and the rightmost scope of T is defined by 1-EL[5]. An occurrence coordinate of a valid candidate is defined by r:[m,…n] where m < n. Thus, a valid candidate has an increasing scope ordering such that r-EL[m] < r-EL[n].

TMG enumeration formulation. An enumeration approach guided by tree model (TMG) was introduced in our previous work [8]. TMG enumeration approach extends a candidate (k-1)-subtree by one node at the time starting from the last node of its right most path up to its root. We have constructed EL in such a way that the pre-order ordering of the embedding relationship of nodes is preserved. As a corollary, given the tail position of a (k-1)-subtree the enumeration sequence provided by EL starts from the next slot after the tail to the end of the EL follows the correct right most extension ordering. Thus the TMG enumeration is formulated as follows. l(i) denotes a labeling function of node at position i. Given frequent (k-1)-subtree tk-1 with φ(tk-1):L, the root position r, tail position t, left-most scope m, right-most scope n, and occurrence coordinate r:[m,…,n], k-subtrees are generated by extending tk-1 with j∈r-EL such that t<j≤|r-EL|-1. Thus its occurrence coordinate becomes r:[m,…,n,j] and its encoding becomes L’:L+l(i) where i=r-EL[j] and m<n<j.

Pruning. Apriori theory says that a pattern is frequent if and only if all of its sub-patterns are frequent. Theorem 1 suggests that this property doesn’t always hold for tree structure, and as such a more specialized approach is needed when mining frequent subtrees. In mining frequent subtrees, this problem may occur because the semantics of a tree structure is determined by its values and hierarchical structure. Hence, each individual node may be frequent, but the structural relationships between the nodes are infrequent. The approach taken was to prune those candidates that have one or more infrequent subtrees. Thus, (k-1) full pruning [15] must be performed when generating k-subtrees. This implies that at most (k-1) numbers of (k-1)-subtrees need to be generated from the currently expanding k-subtrees. The expanding k-subtree is pruned when at least one (k-1)-subtree is infrequent, otherwise it is added to the frequent k-subtree set. This ensures that the method generates no pseudo-frequent subtrees and is correct as opposed to the opportunistic pruning utilized in DFS method such as VTreeMiner [15]. As for each k-subtree candidate there can be (k-1) checks involved for determining whether all its (k-1)-subtrees are frequent, the

‘2’ 0

‘2’ 1

‘2’ 0

‘5’ 2

‘2’ 0

‘6’ 3

‘2’ 0

‘7’ 4

‘2’ 0

‘5’ 5

‘2’ 0

‘6’ 6

‘2’0

‘7’ 7

Henry Tan, Tharam Dillon, Fedja Hadzic, Ling Feng, and Elizabeth Chang

106

Page 107: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

process can be quite time consuming and expensive. Fortunately, some time is saved by checking whether a candidate is already a part of the frequent k-subtree set. This way if a (k-1)-subtree candidate is already in the frequent k-subtree set, it is known that all its subtrees are frequent, and hence the (k-1) full pruning can be accelerated as only 1 comparison is required. 3.2. Candidate Subtree counting

In the candidate enumeration step, the process utilized the notion of coordinates. To determine if a subtree is frequent, we count the occurrences of that subtree and check if it is greater or equal to the specified minimum support σ. In a database of labeled trees many instances of subtrees can occur with the same encoding. Hence, the notion of encoding is utilized in the candidate counting process. We say that a subtree has a frequency n if there are n instances of subtrees with same encoding, i.e. we group subtree occurrences by its encoding.

Vertical Occurrence List (VOL). Each occurrence of a subtree is stored as an occurrence coordinate as previously described. The vertical occurrence list of a subtree groups the occurrence coordinates of that subtree by its encoding. Hence, computing the frequency of a subtree can be easily determined from the size of the VOL. We use the notation VOL(L) to refer to the vertical occurrence list of a subtree with encoding L. Consequently, the frequency of a subtree with encoding L is denoted as |VOL(L)|. As an example, the frequency of a subtree of tree T with encoding ‘2 5 6’, |VOL(‘2 5 6’)| is equal to 4.

1 5 6 0 5 6 1 2 3 0 2 3 ‘ 2 5 6 ’

Figure 3. VOL representation of subtree with encoding ‘2 5 6’ of tree T in figure 1

The cost of the frequency counting process comes from

at least two main areas. First, it comes from the VOL construction itself. With numerous numbers of occurrences of subtrees the list can grow very large. Secondly, for each candidate generated its encoding needs to be computed. Constructing an encoding from a long tree pattern can be very expensive. An efficient and fast encoding construction can be employed by a step-wise encoding construction so that at each step the computed value is remembered and used in the next step. This way a constant processing cost that is independent of the length of the encoding is achieved. Thus, fast candidate counting can be achieved. Overall, our algorithm can be described by the following pseudo-code:

Inputs : Tdb (Tree database), σ (min. support) Outputs : Frequent subtrees (Fk), D (dictionary) D, F1 = DatabaseScanning (Tdb) EL, F2 = ConstructEmbeddedList (F1, D) k=3 while( |Fk| ≥ 0 ) Fk = GenerateCandidateSubtrees(Fk-1) k = k+1 GenerateCandidateSubtrees(Fk-1): for each frequent k-subtree tk-1∈Fk-1 Lk-1 = GetEncoding (tk-1) VOL-tk-1 = GetVOL(tk-1) for each occurrence coordinate ock-1 (r:[m,…n]) ∈VOL-tk-1

for (j = n+1 to |r-EL|-1 ) ock, Lk = TMG-extend(ock-1, Lk-1, j) If( Contains(Lk, Fk) ) Insert(h(Lk), ock, Fk) else If( k-1Pruning (Lk) == false) Insert(h(Lk), ock, Fk) return Fk

Figure 4. MB3-Miner algorithm pseudo-code 4. TMG Mathematical Model

In this section we will develop the mathematical model of TMG approach for mining embedded subtree. Such a model would allow us to calculate the worst case complexity of enumerating all possible candidates from data in a tree structure form. There is no simple way to parameterize a tree structure unless it is specified as a uniform tree. The size of a uniform tree follows a geometrical series. Thus, a size of uniform tree T(n,r) can be computed using geometrical series formula (1-rn+1)/(1-r). Alternatively, when the root is omitted the following formula is used, r(rn-1)/(r-1). When r = 1, the size of the uniform tree is equal to its height n.

The complexity of enumeration using TMG approach is bounded by the actual tree structure. By definition of closed form of an arbitrary tree, the worst case scenario of enumerating candidates from an arbitrary tree is bounded by its closed form enumeration complexity. We have developed a knowledge representation called embedded list (EL) to represent any arbitrary tree and enumerate candidates using TMG approach in a systematic way. The TMG enumeration mathematical model is formulated as follows. Given a uniform tree T with height n and degree r the worst case complexity of candidate generation of T is expressed mathematically in term of its height n and degree r. We define that the cost of enumeration is expressed as the number of candidate instances enumerated throughout the candidate generation process as opposed to the number of candidate subtrees generated (section 2).

Complexity of 1-subtree & 2-subtree enumeration

1T &

2T . The complexity of 1-subtree enumeration is

equal to the size of the tree |T|. Previously we have developed a corollary that the construction of EL

size : 4

MB3-Miner: efficiently mining eMBedded subTREEs using Tree Model Guided candidate generation

107

Page 108: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

representation from a database of trees reflects the enumeration sequence of generating 2-subtree candidates. From figure 2, the visualization of EL representation of tree T suggests that the number of generated candidate instances is equal to the sum of the size of the lists in EL. Let s be a set with n objects. Combinations of k objects from this set s (sCk) are subsets of s having k elements each (where the order of listing the elements does not distinguish two subsets). sCk formula is given by s!/(s-k)!k!. Thus, for 2-subtrees enumeration the following relation exists. Let r-EL consist of l number of slots where each slot is denoted by j. The number of all generated valid 2-subtree candidates (r:[j]) rooted at r is equal to the number of combinations of l nodes from r-EL having 1 element each. As the corollary, complexity of 2-subtrees enumeration of tree T with size |T| is equal to the sum of all generated 2-subtree candidates from each node in T is given by eq 1 below.

∑=

T

rELr C

11 eq. 1

Complexity of k-subtree enumerationk

T . The

generalization of 2-subtrees enumeration complexity can be formulated as follows. Let r-EL consist of l number of items; each item is denoted by j. The number of all generated valid k-subtree candidates (r:[e1,…,ek-1]) rooted at r is equal to the number of combinations of l nodes from i-EL having (k-1) element each. In section 3, valid occurrence coordinate of valid candidates has the property that e1<ek-1. Thus all valid combinations have the (k-1) element in increasing order. As a corollary, the complexity of k-subtrees enumeration of tree T with size |T| is equal to the sum of all generated k-subtree candidates:

11

|| −=

−∑ k

T

rELr C eq. 2

In eq.1 and 2, the size of each EL (r-EL) is unknown. If we consider T as a uniform tree T(n,r), a relationship between height n and degree r of a uniform tree T with the size of each EL for each node can be derived.

Determining rδn-d of uniform tree T(n, r). rδn-d is denoted as the size of the embedded list, |EL| of a node in T(n,r) with depth d. A close look at EL definition in section 2 and |T(n,r)|, suggests that rδn-d is described by a geometrical series formula r(r(n-d)-1)/(r-1). In a uniform tree T(n, r), there are rd number of nodes at each level d. Thus, for each level in T(n,r) there are rd number of lists that have the same size rδn-d. dn

rdr −δ eq. 3

Using the fact that for each level in T(n,r) there are rd number of lists that have the same size rδn-d, eq 2 can be expressed as shown below. 11

11

001 ... −−− +++ − k

nkk CrCrCr

rn

rn

r δδδ eq. 4

Further, eq. 4 can be written as follows:

1

1

0−

=−∑ k

n

i

i Cr inr δ

, for )1( −≥− kinrδ eq. 5

Please note that whenever the |EL| < (k-1) no candidate subtrees would be generated, thus the constraint rδn-I ≥ (k-1) takes care of this condition. Hence, using the developed equations, calculating the complexity of total k-subtree candidates from a uniform tree T(n,r) for k=1,…,|T(n,r)| is given by the following equations:

∑∑==

+=),(

2

),(

11

),(),(),(rnT

kk

rnT

kk

rnTrnTrnT eq. 6

Thus, given an arbitrary tree T and its closed form T’(n,r), the worst case complexity of enumerating embedded subtrees using TMG approach from T can be computed using eq. 6 where n is the height of T’ and r is the degree of T’. Due to space limitation we would reserve a more details discussion about complexity issue in our future work. 5. Results and Discussions

This section provides some comparisons between the MB3-Miner (MB3), X3-Miner (X3), VTreeMiner (VTM) and PatternMatcher (PM) algorithms. We have synthetic database of trees with varying: size (s), max. height (hmax), max. fan-out (fmax), and number of transactions (|Tr|) We use a short hand notation XXX–T, XXX-C, and XXX–F to denote total execution time (including the data preprocessing, variables declaration, etc); number of subtree candidates generated, and the number of frequent subtrees generated with XXX approach respectively. The minimum support σ is denoted as (sxx), where xx is the minimum frequency. Experiments were run on a machine using 3Ghz (Intel-CPU), 2Gb RAM, Mandrake 10.2 Linux where each algorithm was run exclusively. The source code for each algorithm is compiled using GNU g++ version 3.4.3 with –g and –O3 parameters. We run TreeMiner with –u parameter (weighted support).

1.00

10.00

100.00

1,000.00

T100K-S25 T500K-S125

T1000K-S250

Minimum Support

Tim

e (s

econ

ds)

MB3-T VTM-T PM-T

MB3-C, 115,257

VTM-C, 818,559

MB3-F, 33,307

VTM-F, 38,865

10,000

100,000

1,000,000

T100K-S25Minimum Support

Num

ber o

f Can

dida

te S

ubtre

e s

Figure 5. Scalability test: (a) time performance (b) number of

subtrees generated Scalability (s:10,hmax:3,fmax:3). In this experiment, |Tr|

were varied to 100K, 500K and 1000K, with σ of 25, 125 and 250, respectively. From the figure 5a we can see that all three algorithms are well scalable, and that MB3

Henry Tan, Tharam Dillon, Fedja Hadzic, Ling Feng, and Elizabeth Chang

108

Page 109: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

outperforms others with respect to time. Figure 5b compares MB3 with VTM in the number of candidates generated versus the determined number of frequent candidates. Using the join approach, it can be seen that VTM generates more candidates (VTM-C). These candidates are in fact invalid candidates, in the sense that they do not conform to the tree model.

Deep (s:28,hmax:17,fmax:3) vs wide (s:428,hmax:3, fmax:50) trees. This experiment was conducted to verify the worse performance of a DFS approach VTM on deep trees, and BSF approach (MB3 & PM) on wide trees. The deep tree data has |Tr|:10,000 with a total of 273,090 nodes. In figure 6a we can see the performance of VTM degrades significantly after support is lowered. VTM struggles to finish within a reasonable time and jumps significantly to 7,177.85 seconds at σ ≤ 150. Overall, the PM algorithm performs better than the VTM while the MB3 algorithm enjoys the best performance and stability even at very low support.

217.071

7177.85

1574.52

1

10

100

1000

10000

s300 s200 s150 s80

Minimum Support

Tim

e (s

econ

ds)

MB3-T VTM-T PM-T

0102030405060708090

s10 s8 s7 s6 s5Minimum Support

Tim

e (s

econ

ds)

MB3-T VTM-T PM-T

Figure 6. (a) Deep tree (b) Wide tree

265.041

7304.18

885.033

1

10

100

1000

10000

s300 s150 s100Minimum Support

Tim

e (s

econ

ds)

MB3-T VTM-T PM-T

Figure 7. Mix dataset

For the wide tree data, |Tr| = 6,000 with a total of 1,303,424 nodes. As expected the DFS based approach like VTM outperforms MB3 on this dataset. However when the support threshold was decreased below 7, VTM failed to finish the task. As the DFS based approach and BFS based approach suffer from, deep and wide trees respectively, we tested the performance on a mixed dataset (s:428,hmax:17,fmax:50,|Tr|=76,000). MB3 performs the best in this case as is shown in figure 7. Uniform trees (s:20,hmax:3,fmax:4). As most real world tree structured datasets are usually not in the form of a complete tree we have created an artificial dataset that represents a

uniform tree. In this dataset, |Tr|:20,000 with a total of 246,110 nodes.

1

10

100

1000

s50 s30 s20 s12Minimum Support

Tim

e (s

econ

ds)

MB3-T VTM-T PM-T

4,716,636

388,860

0500000

100000015000002000000250000030000003500000400000045000005000000

s50 s30 s20 s12Minimum Support

Tim

e (s

econ

ds)

Figure 8. Uniform Trees: (a) time performance graph (b)

number of frequent subtrees graph

Figure 8a shows that MB3 has the best time performance. As illustrated in figure 8b, the number of frequent candidates generated by VTM (VTM-F) is substantially larger (~12x) than MB3-F, X3-F, PM-F as the support decreases. Performing full (k-1) pruning is a challenge in a DFS based approach [15]. A DFS based approach such as VTM has to rely on the opportunistic pruning. This results in many pseudo-frequent candidate subtrees that should be pruned.

0.1

1

10

100

1000

10000

100000

S3000 S1500 S1000 S500 S200

Minimum Support

Tim

e (s

econ

ds)

MB3-T VTM-T HTM-T

Figure 9. 54% transactions of original CSLogs data

CSLogs (s:214,hmax:28,fmax:21). This real-world data

set was previously used by Zaki in [15] to test the VTM using transactional support definition. When we tried to use it for occurrence match support, all the tested algorithms had problems in returning frequent subtrees. We start to see interesting result when we cut |Tr| from 56,291 to 32,241 randomly. With this partial data set there were problems with VTM returning the result with σ ≤ 500. From figure 9 it can be seen that MB3 again has the best performance.

Enumeration complexity. We created 4 datasets of uniform tree T(2,2), T(3,2), T(2,3), T(2,4) where all nodes have distinct labels. We specify σ:1. Figure 10 shows that the enumeration cost using the join approach (VTM) is higher than the TMG approach (MB3, X3M). For data with higher complexity such as T(3,3) and T(4,3) all algorithms used get aborted. We verify with eq 6 that the enumeration cost turns out to be very large 549,755,826,275 and 1.33x1036 respectively. Furthermore, the cost of enumeration as shown in figure 10 can be verified using eq 6. A program to calculate enumeration complexity of

aborted

VTM-F

PM-F, X3-F, MB3-F

aborted

aborted

aborted

MB3-Miner: efficiently mining eMBedded subTREEs using Tree Model Guided candidate generation

109

Page 110: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

complete tree T(n,r) can be requested from the authors. Thus, the TMG approach is a predictable enumeration model where its enumeration cost can be measured and verified mathematically, allowing one to isolate difficult situations.

1,048,656

16,5364,129

76

403

23,282

90,320

6,046,653

1

10

100

1000

10000

100000

1000000

10000000

T(2,2) T(2,3) T(3,2) T(2,4)

Enu

mer

atio

n C

ost

MB3 X3 VTM

Figure 10. VTM, MB3, & X3 enumeration complexity

Overall Discussion. MB3 demonstrates high

performance and scalability. In general, the performance increase comes from the efficient use of the EL representation and the optimal TMG candidate generation approach that ensures only valid candidate subtrees are enumerated. In figure 5b it is shown that the number of invalid subtrees generated by the join approach can be enormous. This can degrade the performance. Furthermore, MB3 performs expensive full (k-1) pruning and produces no pseudo-frequent candidate subtrees. VTM utilizes less expensive opportunistic pruning which as a trade-off generates many pseudo-frequent candidate subtrees. This can be seen from figure 8b, at σ:20 VTM generates ~12x more than MB3,PM, & X3. One of the consequences is that this greatly degrades VTM performance. In figure 6,7, & 9 VTM even failed to provide results within a reasonable time at a very low minimum support σ. In the context of association mining, regardless of which approach is used, for a given dataset with minimum support σ specified, the discovered frequent patterns should be identical and consistent. Considering that pseudo-frequent subtrees are infrequent subtrees, techniques that don’t perform full pruning would generate pseudo-frequent subtrees and therefore would have limited applicability to association rule mining. 6. Conclusions

In this study we have provided some detailed discussions about various theoretical and performance issues of the different approaches. We proposed a novel and unique embedding list representation and showed the strength of the TMG enumeration approach which was formulized mathematically. High performance and scalability of the MB3 algorithm was demonstrated in our experiments by contrasting it with the state of the art algorithm TreeMiner.

Acknowledgement

A special thanks to Prof. M. J. Zaki [15] for providing us the TreeMiner source code and discussing the results obtained from it with us. References [1]. R. Agrawal, R. Srikant, “Fast Algo. for Mining Assoiciation Rules,” In Proc. the 20th VLDB, 487–499, 1994. [2]. K. Abe, S. Kawasoe, T. Asai, H. Arimura, and S. Arikawa, “Optimized Substructure Discovery for Semistructured Data,” In Proc. PKDD’02, 1–14, LNAI 2431, 2002. [3]. Y. Chi, S. Nijssen, R.R. Muntz, J. N. Kok, “Frequent Subtree Mining An Overview,” Fundamenta Informaticae, Special Issue on Graph and Tree Mining, 2005. [4]. L. Feng, T. S. Dillon, H. Weigand, E. Chang, “An XML-Enabled Association Rule Framework,” In Proc. of DEXA’03, pp 88-97, 2003. [5]. J. Ferrans, B. Lucas, K. Rehor, B. Porter, A. Hunt, S. McGlashan, et. al., “Voice Extensible Markup Language (VoiceXML) Version 2.0,” W3C Technical Report, March, 2004 [6]. V. Geroimenko, C. Chen, Visualizing Information Using SVG and X3D, Springer, 2004 [7]. M. Kuramochi, and G. Karypis, “An Efficient Algorithm for Discovering Freq. Subgraphs,” IEEE Transactions Knowledge and Data Engineering, vol. 16, no. 9, pp. 1038-1051, 2004. [8]. H. Tan, T.S. Dillon, L. Feng, E. Chang, F. Hadzic, “X3-Miner: Mining Patterns from XML Database,” In Proc. Data Mining '05. Skiathos, Greece, 2005. [9]. Termier, M-C. Rousset, and M. Sebag, “Treefinder: A First Step Towards XML Data Mining,” In Proc. ICDM’02, 2002. [10]. Wang, M. Hong, J. Pei, H. Zhou, W. Wang and B. Shi, “Efficient Pattern-Growth Methods for Frequent Tree Pattern Mining,” in Proc. of PAKDD’04, 2004. [11]. K. Wang and H. Liu, “Discovering Typical Structures of Documents: A Road Map Approach,” In Proc. ACM SIGIR Conf. Information Retrieval, 1998. [12]. L. H. Yang, M. L. Lee, & W. Hsu, “Efficient Mining of XML Query Patterns for Caching,” In Proc. the 29th VLDB Conf., 2003. [13]. Zhang, J., Ling, T. W., Bruckner, R. M., Tjoa, A. M., Liu, H., “On Efficient and Effective Association Rule Mining from XML Data,” In Proc. of DEXA’04, pp. 497 - 507, 2004. [14]. M.J. Zaki, “Fast Vertical Mining Using Diffsets,” In. Proc. of SIGKDD’03, 2003. [15]. M.J. Zaki, “Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications,” in IEEE Transaction on Knowledge and Data Engineering, vol. 17, no. 8, pp. 1021-1035, 2005.

Henry Tan, Tharam Dillon, Fedja Hadzic, Ling Feng, and Elizabeth Chang

110

Page 111: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Author Index

Achtert, Elke 9 Ando, Shin 17 Aoki, Kenji 17 Balan, Andre 91 Brisson, Laurent 25 Chang, Elizabeth 103 Chelcea, Sergiu 33 Cheng, Haibin 41 Collard, Martine 25 Dardzinska, Agnieszka 85 Da-Silva, Alzennyr 33 Dillon, Tharam 103 Eckstein, Silke 61 Felipe, Joaquim 91 Feng, Ling 103 Gao, Jing 41 Hacid, Hakim 45, 81 Hadzic, Fedja 103 Ito, Norihiko 73 Karine, Zeitouni 99 Kriegel, Hans-Peter 9 Kupfer, Andreas 61 Lechevallier, Yves 33 Mano, Hiroyuki 17 Marascu, Alice 53 Masseglia, Florent 53 Mathiak, Brigitte 61

Mogi, Akira 65 Motoda, Hiroshi 65 Muench, Richard 61 Nakamura, Yukihiro 17 Nguyen, Phu-Chien 65 Ohara, Kouzou 65 Onoda, Takashi 73 Pasquier, Nicolas 25 Pisetta, Vincent 81 Pryakhin, Alexey 9 Ras, Zbigniew-W. 85 Ribeiro, Marcela 91 Savary, Lionel 99 Schubert, Matthias 9 Shimizu, Kenji 73 Suzuki, Einoshin 17 Taeubner, Claudia 61 Tan, Pang-Ning 41 Tan, Henry 103 Tanasa, Doru 33 Traina, Agma 91 Traina, Caetano 91 Trousse, Brigitte 33 Tsay, Li-Shiang 85 Washio, Takashi 65 Zighed, Djamel 45, 81

Page 112: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE
Page 113: Mining Complex Data - cs.stmarys.cacs.stmarys.ca/~pawan/icdm05/proceedings/mcd05_proceedings.pdf · Mining Complex Data Proceedings of a Workshop held in Conjunction with 2005 IEEE

Published by Department of Mathematics and Computing Science

Technical Report Number: 2005-12 November, 2005

ISBN 0-9738918-8-2