Discovering Interesting Molecular Sub Structure

8/4/2019 Discovering Interesting Molecular Sub Structure

1/13

IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 9, NO. 2, JUNE 2010 77

Discovering Interesting Molecular Substructuresfor Molecular Classification

Winnie W. M. Lam and Keith C. C. Chan*, Member, IEEE

AbstractGiven a set of molecular structure data preclassifiedinto a number of classes, the molecular classification problem isconcerned with the discovering of interesting structural patterns inthedata so that unseenmolecules notoriginally in thedataset canbe accurately classified. To tackle the problem, interesting molec-ular substructures have to be discovered and this is done typicallyby first representing molecular structures in molecular graphs,and then, using graph-mining algorithms to discover frequentlyoccurring subgraphs in them. These subgraphs are then used tocharacterize different classes for molecular classification. Whilesuch an approach can be very effective, it should be noted thata substructure that occurs frequently in one class may also doesoccur in another. The discovering of frequent subgraphs for molec-

ular classification may, therefore, not always be the most effective.In this paper, we propose a novel technique called mining interest-ing substructures in molecular data for classification (MISMOC)that can discover interesting frequent subgraphs not just for thecharacterization of a molecular class but also for the distinguishingof it from the others. Using a test statistic, MISMOC screens eachfrequent subgraph to determine if they are interesting. For thosethat are interesting, theirdegrees of interestingness are determinedusing an information-theoretic measure. When classifying an un-seen molecule, its structure is then matched against the interestingsubgraphs in each class and a total interestingness measure forthe unseen molecule to be classified into a particular class is de-termined, which is based on the interestingness of each matchedsubgraphs. The performance of MISMOC is evaluated using bothartificial and real datasets, and the results show that it can be aneffective approach for molecular classification.

Index TermsFrequent subgraph, graph mining, interesting-ness, molecular classification, molecular structures.

I. INTRODUCTION

THE SIZE and number of molecular structure databases

have grown rather rapidly recently, due to advances in

X-ray diffraction or nuclear magnetic resonance (NMR) tech-

nologies [1]. Molecular databases of nucleotide, genome, pro-

tein and nucleic acid, etc., such as NCBI, MINT, SwissMod,

and FSSP in EMBL [2][5] have been made available online.

These databases continue to grow in size and diversity, andthere is an increasing need for techniques to be developed to

mine these data for interesting patterns [6]. There have been,

for example, attempts to discover such patterns for molecular

classification [1], [7].

Manuscript received March 25, 2009; revised October 26, 2009. Date ofcurrent version June 3, 2010. Asterisk indicates corresponding author.

W. W. M. Lamis with theDepartment of Computing, Hong Kong PolytechnicUniversity, Hung Hom, Hong Kong (e-mail: [email protected]).

*K.C. C. Chan iswiththe Department ofComputing,HongKong PolytechnicUniversity, Hung Hom, Hong Kong (e-mail: [email protected]).

Digital Object Identifier 10.1109/TNB.2010.2042609

Given a set of molecular structure data preclassified into a

number of classes, the molecular classification problem is con-

cerned with the discovering of interesting structural patterns in

the data so that unseen molecules not originally in the dataset

can be accurately classified. Effective molecular classification

can uncover relationships between structures and functions, and

can have many applications in many areas, such as drug discov-

ery [8], protein folding [9], comparative genomics [10], cancer-

risk assessment [11], and gene evolution [12].

To tackle the molecular classification problem, two types of

approaches have been used. The first is the more traditional ap-

proach of using what is called the quantitative structure-activity

relationship (QSAR) or the quantitative structure-property rela-

tionship (QSPR) model [35] to derive descriptors from chemi-

cal compounds for classification. The second approach, which

is the more recent approach, is to represent molecular struc-

tures as molecular graphs [13] and to discover frequently occur-

ring subgraphs [14] in them for classification. Both approaches

aim to extract attributes that can best represent the structure

of chemical compounds. The latter approach has recently be-

come more popular as it has been shown that using frequent

subgraph analysis for molecular classification can be better

than using the QSAR/QSPR models [36], [37]. This is because

QSAR/QSPR models cannot be used to map chemical structuredirectly to attribute-based descriptions, such as the internal orga-

nization of chemical compounds. Besides, comparing with the

use of frequent subgraph analysis, QSAR/QSPR requires much

more user intervention and domain knowledge. For this reason,

graph-mining algorithms that can discover frequently occur-

ring subgraphs in larger graphs have recently become popular

(e.g., WARMR [17], Frequent SubGraph discovery (FSG) [18],

Graph-based Substructure PAtterN mining (gSpan) [19], and

GrAph/Sequence/Tree extractiON (GASTON) [42]). These fre-

quently occurring subgraphs represent subgraphs that occur

frequently enough in different classes. The idea of finding fre-

quent subgraphs in different classes of molecular data has beenproposed previously and has been shown to be effective [43].

However, it should be noted that as subgraphs, which occur fre-

quently in one class may also occur frequently in another; the

discovering of frequent subgraphs for molecular classification

may not always be the most effective approach. It does not ex-

plicitly find discriminative subgraphs to allow one class to be

easily discriminated from another. There have been some recent

attempts to find such subgraphs between classes, and they are

defined to be subgraphs that appear more frequently in a certain

positive class than another negative class [15], [16]. However,

how much more frequently should these subgraphs appear for

them to be considered discriminative are not explicitly stated.

1536-1241/$26.00 2010 IEEE


2/13

78 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 9, NO. 2, JUNE 2010

In this paper, we propose a novel graph-mining algorithm for

molecular classification. This algorithm, which is called min-

ing interesting substructures in molecular data for classification

(MISMOC), can discover interesting frequent subgraphs for the

characterization of a molecular class and for the discrimination

of it from one or more of the other classes. However, the clas-

sification problem that MISMOC can tackle is not restricted tobinary classification. MISMOC performs its tasks by first fil-

tering out subgraphs that do not occur frequently enough for

the purpose of classification. By using a test statistic, it then

filters out these frequently occurring subgraphs that only appear

as frequently as expected. Those that remain are subgraphs,

which are interesting in the sense that they not only characterize

a class of molecular graphs, but also allow them to be discrimi-

nated from the others. For each interesting subgraphs, MISMOC

determines a degree of interestingness based on the use of an

information-theoretic measure. When classifying an unseen

molecule that is not in the original dataset, this molecules struc-

ture is matched against the interesting subgraphs in each class

and a total interestingness measure for the unseen molecule tobe classified into a particular class is then determined for the

purpose of classification.

The performance of MISMOCis evaluated with both artificial

and real data. The experimental results show that MISMOC can

discover interesting frequent subgraphs that can both character-

ize and distinguish molecules of one class from the others. It can

also reduce the number of subgraphs that need to be considered

for graph classification by filtering out these subgraphs, which

are not interesting for classification.

The rest of this paper is organized as follows. Section II

presents a review of existing graph-mining algorithms that can

be used for classifying molecular structures. Using an illustra-tive example, Section III describes how frequently occurring

subgraphs can be discovered. Section IV presents the details of

our proposed approach, MISMOC. For illustration, Section V

makes use of an example to demonstrate how MISMOC can ef-

fectively perform molecular classification tasks. In Section VI,

we describe how the performance of MISMOC was evaluated.

The results of the experiments that were carried out are pre-

sented. Finally, Section VII summarizes the work and discusses

possible directions for future research.

II. RELATED WORK

Many graph-mining algorithms have been developed to dis-

cover interesting subgraphs in data with complex structures.

Given such data represented in the form of graphs, these algo-

rithms can be used to mine frequent subgraphs in them. These

frequent subgraphs can then be used to tackle the classification

problem [20], [21].

The graph-mining algorithms based on inductive logic pro-

gramming (ILP), for example, have been used to discover fre-

quent subgraphs for classification [22]. An ILP-based algorithm

called WARMR [17], for example, is able to mine frequent sub-

graphs in graph data that are represented as first-order predicate

logic. ILP-based approaches to graph mining, being based on

predicate logic, have the disadvantages that they may not be

very robust to noisy data. Also, when dealing with real-world

databases that tend to be very large, the computational com-

plexity of these algorithms can be too high to handle. These

approaches have to perform a lot of tests for equivalence in or-

der to prune infrequent and semantically redundant subgraphs.

Other than the ILP-based algorithms, there are quite a number

of other graph-mining algorithms that can be used to discoverfrequent subgraphs. FSG [18], for example, adopts an edge-

based subgraph generation strategy for such purpose. It expands

on a subgraph based on a level-by-level approach [23], first, enu-

merating all frequent single and double-edge subgraphs, and

then, generates larger subgraph iteratively by adding one more

edge to those generated in the previous iteration. For FSG to

perform its tasks, it has to rely on canonical labeling to check

whether a particular subgraph satisfies a support threshold. If

two graphs are isomorphic, their canonical labels are assumed

to be identical. This canonical labeling process for the determi-

nation of graph isomorphism is memory consuming for large

databases.

Other than FSG, gSpan [19] is also a popular graph-miningalgorithm that has been used for graph classification. gSpan

searches for frequent subgraph on graph canonical forms using

a depth-first search (DFS) strategy. It does so by starting from a

randomly chosen vertex, then visiting and marking the vertices

to which this chosen vertex is connected to. This process of

visiting and marking of vertices continues repeatedly until a full

DFS tree is built. For each graph searched, it is possible that

more than one tree be built with DFS depending on the order in

which the vertices were visited. By means of DFS, gSpan is able

to discover all frequent subgraphs without generating candidate

subgraphs and pruning false positives.

Another algorithm for mining of frequent subgraphs is calledGaston [42]. It discovers such subgraphs by first finding fre-

quent paths, then trees, and then, cyclic graphs. It stores all

occurrences of these graphs in an embedding list so that the

frequency of occurrence of a subgraph can be determined by

scanning the embedding list, thereby, improving the speed of

the graph-mining process [43].

MoFa [16] has been used to find frequent subgraphs in graph

data by maintaining parallel embeddings for both vertices and

edges. Like Gaston, each such embedding consists of a set of

references to a molecule that point to the atoms and bonds that

form a subgraph. Such embeddings can be extended so that

larger subgraphs can be formed iteratively [16]. MoFa has been

enhanced later to discover discriminative [40], [41] subgraphs

with relatively higher support and these subgraphs can make

MoFa more suitable approach for graph classification.

Subdue [15] is another graph-mining algorithm that discovers

frequent subgraphs. It makes use of the minimum description

length principle to narrow down possible outcomes when trying

to identify subgraphs that best compress the original graph [45].

The graph-mining algorithms described earlier discover fre-

quent subgraphs by building on smaller subgraphs edge by edge.

Subgraph isomorphism for graph matching is required as a part

of the kernels of these algorithms and this process is known

to be nondeterministic polynomial time (NP)-complete. The

discovering of frequent subgraphs using existing graph-mining


3/13

LAM AND CHAN: DISCOVERING INTERESTING MOLECULAR SUBSTRUCTURES FOR MOLECULAR CLASSIFICATION 79

algorithm requires a frequency threshold to be supplied. If the

threshold is set too small, one may not be able to discover

enough frequent subgraphs to allow graph classes to be distin-

guished from each other. If the threshold is set too large, one

may discover too many frequent subgraphs that are irrelevant for

classification. As subgraphs that appear frequently in one class

of graphs may also do so in another, the discovering of frequentsubgraphs may not always be useful for graph classification.

What is needed for the task is a way to discover subgraphs in a

class that can make it distinguishable from other classes.

In the following, we propose a graph-mining technique called

MISMOC for this purpose. Given a set of frequent subgraphs,

MISMOC can screen out frequent subgraphs that are not useful

for classification and retain those graphs that are useful for the

characterization of molecular classes and the discrimination of

one class from another.

III. ILLUSTRATIVE EXAMPLE

To explain why the discovering of frequent subgraphs maynot always be useful for graph classification, let us consider an

example. We are given three classes of artificial molecular data

shown in Fig. 1.

Each of these three classes of data contains ten molecules and

each molecule consists of atoms connected with bonds. These

molecules are generated in such a way that the atoms are chosen

from 30 possible atoms, including such atoms as carbon (C),

oxygen (O), iridium (Ir), nobelium (No), and thorium (Th), and

bond types from three possible types, including single, double,

and triple bonds. These molecules can be represented as labeled

molecular graphs with each node used to represent an atom and

each edge as a bond.Given the set of graph data as shown in Fig. 1, frequent

subgraphs can be discovered in each of class 1, 2, and 3 us-

ing a graph-mining algorithm, such as FSG, and the unknown

molecule given in Fig. 2 can be predicted. These algorithms

require that a threshold to be given by the users to define how

frequent a subgraph should appear for it to be considered fre-

quent.

For the purpose of illustration, we choose FSG here as graph-

mining algorithms, such as gSpan, does not perform subgraph

pruning. FSG, however, can discover maximal frequent sub-

graphs and can better avoid the problems caused by the discov-

ering of subgraphs, which are too fragmented.

By setting a support threshold of 80% (i.e., any subgraph that

occurs in at least eight out of ten graphs), a number of frequent

subgraphs can be found and theyare given in Table I. It should be

noted that the same frequent subgraph, a nitrogen atom double-

bonded with an oxygen atom (i.e., N==O), appears in 80% ofthe graphs in each of the three classes (see Table I).

Since the choice of threshold does not allow any unique fre-

quent subgraph to be discovered for each class, we lower the

support threshold by 10%. The results are shown in Table II.

More frequent subgraphs are discovered this time when the

support threshold is lowered to 70%. However, the newly dis-

covered frequent subgraphs for class 2 and 3 are still the same

and a graph with such subgraphs may be classified into either

Fig. 1. Training molecular data.

class 2 or 3. This means that the discovered frequent subgraph

cannot allow graphs in class 2 to be easily discriminated from

class 3.

When the support threshold is further lowered to 60%,

more frequent subgraphs are discovered and they are shown in

Table III. Unfortunately, the newly discovered frequent sub-

graphs, for each of the three classes, still overlap with each

other. A graph characterized by these subgraphscan be classified


4/13


Fig. 2. Unknown molecule.

TABLE IMAXIMAL FREQUENT SUBGRAPHS (SUPPORT THRESHOLD = 80%)

TABLE IIMAXIMAL FREQUENT SUBGRAPHS (SUPPORT THRESHOLD = 70%)

TABLE IIIMAXIMAL FREQUENT SUBGRAPHS (SUPPORT THRESHOLD = 60%)

into one or more classes. For example, if a graph G is character-

ized by the subgraph , it can be classified into either class

2 or 3. If G is characterized by the subgraph , it can be

classified into either class 1 or 2. If G is characterized by both

TABLE IVMAXIMAL FREQUENT SUBGRAPHS (SUPPORT THRESHOLD = 50%)

and , then there is a chance that it can be classified

into any of class 1, 2, or 3 as appears six times in class

1 and 2, and appears seven times in class 2 and 3.

To find more interesting and useful frequent subgraphs for

classification, the support threshold is further lowered to 50%.

Using the FSG again, the frequent subgraphs discovered are

shown in Table IV. This time, many more frequent subgraphs


5/13


Fig. 3. Classifying the unseen molecule in Fig. 2 with FSG.

are discovered and some of the subgraphs discovered in each of

S(1 ) , S(2 ) , and S(3 ) do not overlap with each other.

If we have to classify the testing sample in Fig. 2, it should be

noted that this graph is characterized by three frequent subgraphs

, , and from each of S(1 ) , S(2 ) , and S(3 ), respec-

tively (see Fig. 3). It is, therefore, hard to decide to which class

this graph should be classified into, based on these subgraphs

that it contains. If one is to take a closer look at the frequency

of appearance of each of these three subgraphs , , and

, in each class, one may discover that even though is

not frequent enough in class 2 and 3, it appears in 40% of the

graphs in these classes. This is the case also with . Although

it only appears in 10% of the graphs in class 1, it appears in 40%

of the graphs in class 2. Of these three subgraphs, , is themost interesting and unique in the sense that it appears in 50%

of the graphs in class 2, it only appears in 10% of the graphs in

both class 1 and 3. In other words, this subgraph provides more

evidence for a graph it characterizes to be classified into class 2

than other subgraphs. In fact, it is for this reason that the graph

in Fig. 2 belongs more likely to class 2 than any other classes.

In order to discover more frequent subgraphs that may be use-

ful for classifying the unseen molecule, the support threshold is

further reduced to 40%, and the new frequent subgraphs are dis-

covered, as shown in Table V. The newly discovered subgraphs

are S(1 )8 , S

(1 )9 , S

(1 )10 in class 1; S

(2 )7 , S

(2 )8 , S

(2 )9 , S

(2 )10 in class 2; and

S(3 )7 , S(3 )8 , S(3 )9 in class 3. Although the support threshold is low-ered to 40%, these subgraphs are all appeared more frequently

in the other classes, for example, S(1 )8 is previously discovered

as frequent subgraph S(2 )2 in class 2 and S

(3 )2 in class 3. The case

is the same as the others. We tried to further reduce the support

threshold to 30%, but the case is still the same that the newly

discovered subgraph is already found at higher threshold value.

The actual relative frequency of appearances of each frequent

subgraph in each class may therefore provide useful informa-

tion for classification. The idea that MISMOC uses to filter out

uninteresting and irrelevant frequent subgraphs to allow molec-

ular classification to be performed effectively is, therefore, to

take into consideration such information so as to measure the

TABLE VMAXIMAL FREQUENT SUBGRAPHS (SUPPORT THRESHOLD = 40%)


6/13


relatively interestingness of each frequent subgraph relative to

the others.

IV. MISMOC: A GRAPH-MINING TECHNIQUE FOR

MOLECULAR CLASSIFICATION

The molecular classification problem, which this paper ad-

dresses can be stated more formally as follows. Given a set of

molecular structure data G, containing n molecules preclassifiedintop classes, the molecular classification problem is concernedwith the discovering of interesting patterns in the data to allow

unseen graphs not originally in G to be correctly classified

into one of the p classes.The n molecules in G can be represented as n molecular

graphs G1 , G2 , . . . , Gn , where Gi = Gi (Vi , Ei ), i {1, . . . , n}is a labeled graph with vertices representing atoms and edges

representing bonds between atoms.

For applications in bioinformatics, the molecular graphs can

be generalized so that the vertices can represent molecules, such

as amino acids, and the edges can represent the chemical bondsthat connect the molecules. The p classes that the n moleculesand their corresponding molecular graphs are classified into can

be represented as (1 ) , . . . , (p) , where (i) = {G(i)1 , . . . , G(i)c i } G, i = 1, . . . , p.

In the following, we present the details of a MISMOC tech-

nique, which can be used to effectively improve the accuracy

of graph classification. MISMOC performs its tasks in several

stages. It first searches for frequent subgraphs using an existing

algorithm, such as FSG or gSpan. Since a subgraph that ap-

pears frequently in one class may also does so in another, not

all frequent subgraphs are useful and interesting for classifica-

tion. To screen out the uninteresting ones, MISMOC makes useof a test statistics to distinguish interesting subgraphs from the

uninteresting ones.

Once the interesting frequent subgraphs are identified, the

interestingness of each of these frequent subgraphs is then mea-

sured based on an information theoretic measure called the

weight of evidence. This measure can be combined to form

an overall total interestingness measure for the purpose of clas-

sifying an unseen graph.

A. Discovering Frequent Subgraphs

To discover frequent subgraphs in a graph database, there

are several graph-mining algorithms to choose from. ForMISMOC, users can choose between two commonly used

graph-mining algorithms FSG [18] and gSpan [19]. Given the

dataset = {G1 , . . . , Gi , . . . , Gn } as described earlier, one canuse either of these algorithms to discover a set of frequent sub-

graphs (1 ), . . . , (i) , . . . , (p) , where (i) = {S(i)1 , . . . , S(i)n i },i = 1, . . . , p, for each of the corresponding p classes (1 ), . . . ,

(i) , . . . , (p) .1) FSG Algorithm: The FSG algorithm can find all frequent

subgraphs in each class of molecular graphs using the Apriori

algorithm [23]. It does so by treating edges in the graphs as

items in transactions so that the Apriori algorithm can be used to

discover frequent subgraphs, like it is used to discover frequent

Fig. 4. Algorithm of FSG.

itemsets, i.e., in the same way the Apriori algorithm increases

the size of frequent itemsets by adding a single item at a time,

the FSG algorithm also increases the size of frequent subgraphs

by adding an edge one by one.

Briefly, the FSG can be described as follows. For each (i) ,

i = 1, . . . , p, FSG first finds a set of frequent one-edge sub-graphs and a set of frequent two-edge subgraphs. Then, based

on these two sets of intermediate subgraphs, it starts to itera-

tively generate candidate subgraphs, whose size is greater than

the previous frequent subgraphs by one edge. FSG then counts

the frequency for each of these candidates and prunes subgraphsthat do not satisfy the support threshold . The qualified sub-graphs are further expanded and their frequencies are verified

with the same support condition to prune the lattice of frequent

subgraphs. The final set of frequent subgraphs (1 ), . . . (i) , . . . ,(p ) , where (i) contains all frequent k-subgraphs, is generated

for each class. Let gk be a k-subgraph with k edges, k be aset of candidate subgraphs with k edges, k (i) be a set of fre-quent k-subgraphs for class (i) , the algorithm of FSG can besummarized in Fig. 4 [18].

2) gSpan Algorithm: The gSpan algorithm [19] discovers a

set of frequent subgraphs for each graph class by mapping each

graph in the class to a unique minimum DFS code as the canon-

ical label. Firstly, gSpan sorts all vertices and edges in the set

of graph transactions in each class according to their frequency

of occurrence and removes the infrequent vertices and edges

from (i) . The remaining vertices and edges are relabeled and

sorted in descending frequency. 1( i) is then formed by all fre-

quent one-edge subgraphs and it acts as the seed for generating

more children. The subprocedure called SubgraphMiner, which

expand each one-edge frequent subgraph 1( i) from each class

by adding one edge at a time. In the SubgraphMiner, ifs is theminimum DFS code of the graph it represents, it adds s to its fre-quent subgraph set (i) . It then generates all potential children

with a one-edge growth and runs SubgraphMiner recursively for

each child. After this, the edge is removed from each graph in


7/13


Fig. 5. Algorithm of gSpan.

(i) when all the descendants of this one-edge graph have been

searched. When all frequent k-subgraphs and their descendantsare generated, the final set of frequent subgraphs (i) , i = 1, . . . ,p, will be generated for each class. The algorithm of gSpan canbe summarized in Fig. 5 [19].

B. Discovering Interesting Frequent Subgraphs

Using MISMOC

FSG and gSpan aim at discovering frequent subgraphs (i) =

{S(i)1 , . . . , S(i)j , . . . , S(i)n i }, i = 1, . . . , p, in each of the corre-sponding graph class (1 ) , . . . , (i) , . . . , (p) . These algorithmsare not originally developed for graph classification. Hence,

while the discovered frequent subgraph can characterize each

graph class, they may not be very useful in discriminating one

class from another. This is because a frequent subgraph, which

appears frequently in one graph may also do so in another and

such frequent subgraphs are not interesting for classification. Inthis section, we present a methodology that MISMOC uses to

identify interesting subgraphs that are interesting and useful for

classification. This methodology is based on the use of a test

statistic [24][26] and its details are given in Fig. 6.

Once the set of frequent subgraphs (i) , i = 1, . . . , p, arediscovered for each of (i) , i = 1, . . . , p, respectively, the prob-ability that a graph G is in (i) , i {1, . . . , p} given that G is

Fig. 6. Algorithm of MISMOC.

characterized by a frequent subgraph S(i)

j (i) , j {1, . . . ,ni} can be determined as follows:

PrG (i)

|G is characterized by S

(i)j

=total no. of graphs in (i) that are characterized by S

(i)j

total no. of graphs in that are characterized by S(i)

j

.

(1)

If Pr(G (i) |G is characterized by S(i)j ) is not much differ-ent from Pr(G (i) ), i.e., whether or not G is characterized byS

(i)j makes very little difference, then S

(i)j should not be consid-

ered very interesting in determining, if G should be classified

into (i) . Otherwise, S(i)

j can be very interesting.

To objectively determine if the two probabilities are different,

we make use of a test statistic [24][26], dj i , which is definedas follows:

dj i =zj ij i

(2)

where zj i is defined as (3), shown at the bottom of this page,and ij is the maximum likelihood estimate of the variance of

Pr

G (i) |G is characterized by S(i)j n Pr G (i) )Pr(G is characterized by S(i)j

n Pr

G

(i)

Pr

G is characterized by S(i)

j (3)


8/13


zj i and is given by

j i = (1 Pr(G (i) ))

1 PrG is characterized by S(i)j

.(4)

Based on [24], if|dj i | > 1.96, we can conclude that the differ-ence between Pr(G

(i)

|G is characterized by S

(i)j ) is signifi-

cantly different from Pr(G (i) ), and therefore, the subgraphS

(i)j is interesting and useful for classification. Ifdj i > +1.96, it

implies that the presence ofS(i)

j in a graph G provides evidence

supporting G to be classifiedinto (i) , otherwiseifdj i < 1.96,it implies that thepresenceof thefrequent subgraph S

(i)j provides

negative evidence against G to be classified into (i) . In either

case, S(i)

j can be considered as interesting frequent subgraph.

With the use of the test statistics, MISMOC screens each set

of frequent subgraphs (i) = {S(i)1 , . . . , S(i)j , . . . , S(i)n i }, i =1, . . . , p, to retain only those who are interesting. The set ofinteresting frequent subgraph discovered for each of (1 ), . . . ,

(i) , . . . , (p) is denoted as (i) = {S(i)

1 , . . . , S(i)

j , . . . , S(i)

n i },i = 1, . . . , p, and ni < n i , respectively.

C. Interestingness Measure as a Function of the

Weight of Evidence

The interesting frequent subgraphs provide positive or nega-

tive evidence supporting or refuting the classification of a graph

into a particular class. MISMOC measures how interesting these

frequent subgraphs are with the use of an interestingness mea-

sure defined in terms of an information-theoretic weight of evi-

dence measure.

The more interesting a frequent subgraph is for a class, thegreater the difference is between the two probabilities of Pr(G

(i) |G is characterized by S(i)j ) and Pr(G (i) ). Hence, theinterestingness measure is defined again as a function of these

two probabilities. Specifically, the more interesting S(i)

j is, the

greater is the ratio between Pr(G (i) |G is characterized byS

(i)j ) and Pr(G (i) ). This ratio can be measured with a

mutual information measure I(G (i) :G is characterized byS

(i)j ), between G (i) and G is characterized by S(i)j as

follows:

I(G

(i) : G is characterized by S(i)

j

)

= logPr

G (i) |G is characterized by S(i)j

Pr(G (i) ) . (5)

Based on the mutual information measure, the weight of ev-

idence provided by S(i)

j for or against the classification of G

into (i) can be defined as follows:

W(i) (G|S(i)j ) = W(G (i)/G / (i) |G is characterized by S(i)j= I(G (i) : G is characterized by S(i)j )

I(G / (i) : G is characterized by S(i)

j ). (6)

W(i) (G|S(i)j ) can be interpreted as a measure of the differencein the gain in information when a graph G that contains S

(i)j is

classified into (i) , as opposed to other classes. W(i) (G|S(i)j )is positive if S

(i)j provides positive evidence supporting the

classification of G into (i) , otherwise it is negative.

D. Classification Using a Total Interestingness Measure

Given the interesting frequent subgraphs (1 ), . . . ,(i) , . . . , (p) , discovered for each corresponding p classes(1 ) , . . . , (i) , . . . , (p ) , an unseen graph G not originally

in , can be classified by matching it against the subgraphs in

each of (i) , i = 1, . . . , p.For every subgraph, S

(i)j (i) that G matches, there is some

evidence W(i) (G|S(i)j ) provided by it for or against the classi-fication of G into (i) . Assuming that G matches with mi niinteresting frequent subgraph in (i) , s(i)1 , . . . , s

(i)j , . . . , s

(i)m i

(i) , MISMOC then computes a total interestingness measurefor G to be classified into (i) . This total interestingness mea-

sure is defined as the summation of the total weight of evidence

provided by each individual interesting frequent subgraph s(i)

j

for or against G to be classified into (i) as follows:

W(i) (G) = W(G (i) /G / (i) |Gis characterized by s

(i)1 , . . . s

(i)j , . . . , s

(i)m i )

=

m ij = 1

W(G (i) /G / (i) |G

is characterized by s(i)j ). (7)

The value of W(i) (G| S(i)j ) increases with the number andstrength of the matched subgraphs in s

(i)1 , . . . , s

(i)j , . . . , s

(i)m i

that provide positive evidence supporting G to be classified

into (i) , whereas the value of W(i) (G|S(i)j ) decreases if somematched subgraphs provide negative evidence refuting the clas-

sification of G into (i) . The total interestingness measure for

G to be classified into each of (1 ), . . . , (i) , . . . , (p) is de-termined and MISMOC assigns G to the class, which gives the

greatest total interestingness measure.

Compared to algorithms that classify graphs by consider-

ing only frequent subgraphs, MISMOC has the advantages that

it discovers frequent subgraphs that are considered interesting

by an objective statistical evidence measure. Instead of relying

solely on the appearance of frequent subgraph during classifica-

tion, MISMOC takes into consideration only those, which are

useful and interesting. These frequent subgraphs are unique and

can have biological meaning. The other frequent graph-mining

algorithms like FSG and gSpan can only handle single class of

data, if there are two or more classes, the comparative effect of a

subgraph across all classes are ignored. There is always a chance

that two or more classes have the same frequent subgraph. With

interestingness measure, we can distinguish interesting frequent

subgraphs from uninteresting ones for multiple classes.


9/13


V. ILLUSTRATIVE EXAMPLE CONTINUED

To illustrate how MISMOC works,let us consider theexample

in Section III again. Given the frequent subgraphs discovered

using FSG at a support threshold of 50%, MISMOC obtains

for each of the 15 frequent subgraphs with their frequency of

occurrences in each class (see Table V). It then screens for all

frequent subgraphs that are interesting using the test statisticsgiven by (2). The value of the test statistics for each frequent

subgraph in each class are given also in Table VI.

As described in the last section, subgraphs with |dj i < 1.96|will be filtered out, and the remaining subgraphs will form a set

of interesting subgraphs for graph classification. Since d41 , d51 ,d62 , d72 , d83 , and d93 are greater than 1.96, we conclude that of

all 15 frequent subgraphs discovered, only S(1 )4 and S

(1 )5 , S

(2 )6

and S(2 )7 , and S

(3 )8 and S

(3 )9 are interesting frequent subgraphs

for each of class 1, 2, and 3, respectively.

Given these interesting frequent subgraphs, we can classify

the test graph shown in Fig. 2 by computing the total weight of

interestingness measure for it to be classified into each class.Using (1) to (7)

W(1 ) (G) = W

Class = 1/Class = 1|S(1 )7 , S(2 )6 , S(3 )5

= W

Class = 1/Class = 1/S(2 )6

= 1.5018.Similarly, W(2 )(G) = 2.2288 and W(3 )(G) = 1.5732. As

the value of W(2 )(G) in class 2 is the largest of all, we canconclude that the new sample belongs to class 2. Besides, there

is negative evidence against the test graph being classified in

class 1 and 3, therefore, the new sample is not likely to belong

to class 1 or 3.

VI. EXPERIMENTS AND RESULTS

To evaluate the effectiveness of MISMOC, it is tested us-

ing both artificial and real data. We compared its performance

with that of two graph classification algorithms based on FSG

and gSpan. For experimentation, we used the executable files of

these algorithms available from [27] and [28], respectively. The

classification results were obtained using tenfold cross valida-

tions with an implementation of support vector machine (SVM)

available at [29].

A. Performance Evaluation

The performance of a classifier is usually evaluated by the use

of average classification accuracy and the results are typically

presented in a confusion matrix (see Table VII), which has four

entries: the number of true positive cases (TP), true negative

cases (TN), false positive cases (FP), and false negative cases

(FN), and the average accuracy is calculated as follows [30]:

Average Accuracy =TP + TN

TP + FN + FP + TN. (8)

While evaluation based on the use of the classification accuracy

measure may be popular, it may not always be very appro-

priate for classification problems involving imbalanced class

TABLE VIINTERESTINGNESS MEASURE OF FREQUENT SUBGRAPHS ( = 50%)

TABLE VIICONFUSION MATRIX


10/13


distributions. When TN is much greater than TP, (FP + TN) isalso much greater than (TP + FN). In such case, the success-fully predicted cases in the minority positive class will play a

role that can be too insignificant when the average accuracy rate

is determined and the minority cases will be treated as noise

even if they are supposed to be important. In order to overcome

this problem, the true positive and false positive rates need tobe monitored separately, using (9) and (10) when test data are

being classified

True positive rate =TP

TP + FN(9)

False positive rate =FP

FP + TN. (10)

These rates measure the performance of a classifier for each

class and the objective is to keep the true positive rate as high

as possible and the false positive rate as low as possible. Some-

times, the true positive rate is called recall or sensitivity, and the

false positive rate is called false alarm rate. In order to transformthis multiobjective problem into a single-objective equivalent,

the receiver operating characteristic (ROC) analysis [31] has

been proposed and is becoming more and more popular when

the training data size for different classes of data are very dif-

ferent. With the ROC analysis, the true positive rate is plotted

along the y-axis against the false positive rate along the x-axisto form a ROC curve, and the objective is to maximize the value

of AUC, which stands for the area under the ROC curve. The

value of AUC is always between 0.0 and 1.0. An area of 1 repre-

sents a perfect classification, whereas an area of 0.5 represents a

worthless classification that is equivalent to a random guess in a

two-class classification problem. The AUC reflects very well theprobability that a classifier ranks, a randomly chosen positive

instance higher than a randomly chosen negative instance. In

this paper, as the datasets that we use differ significantly in class

sizes, we use the AUC to evaluate the performance of different

classifiers on different datasets.

B. Datasets

The first dataset is a set of binary-class artificial data that

are generated with GraphGen [32]. The artificial datasets are

generated with a set of parameters: 1) the total number of trans-

actions (-ngraphs); 2) the average size of each graph (-size);

3) the number of unique node labels (-nnodel); 4) the number

of unique edge labels (-nedgel); 5) the average density of each

graph (-density); 6) the number of unique edges in the whole

dataset (-nedges); and 7) the average edge ratio of each graph

(-edger). The parameter 1, 4, 5, 6, and 7 are fixed to 5000, 10,

0.3, 100, and 0.2, respectively, and we vary the remaining pa-

rameters to generate four datasets as given in Table VIII with

properties given next.

The second dataset is collected from predictive toxicology

challenge (PTC) [33] that contains the carcinogenicity of 417

chemical compounds on four types of rodents: male rats (MR),

female rats (FR), male mice (MM), and female mice (FM). Each

of these datasets can be considered as consisting of two classes

TABLE VIIIARTIFICIAL DATASET WITH DIFFERENT PARAMETERS

TABLE IXPROPERTIES OF THE EXPERIMENTAL DATASETS

TABLE XCLASSIFICATION PERFORMANCE FOR FSG AND MISMOC

of data [39]: those with positive evidence of cancerous growth

and those with negative evidence.

The third dataset is collected from the Estrogen Receptor

Binding (NCTR ER) Database in the Distributed Structure-

Searchable Toxicity (DSSTox) Public Database Network of the

National Center for Toxicological Research [34]. The database

covers most known estrogenic classes and it is a structurally di-

verse set of estrogens. The NCTR ER database consists of 224

chemical compounds with each classified as active or inactive

with respect to the attribute ActivityOutcome_NCTRER. A

compound is active if the measure of activity of the compound

is active strong, medium, or weak. It is inactive if there is no

activity for that compound. The properties of the datasets we

used in our experiments are listed in Table IX.

C. Performance Analysis

For performance comparison, we tested all datasets using

first two algorithms of FSG, gSpan, and then, compare their

performance when MISMOC is used. Tables X and XI show the

performance of different algorithms on the different datasets.

For easier comparisons, we use a single misclassification cost

value of 3.0 and as suggested in [38] for the SVM classifier.


11/13


TABLE XICLASSIFICATION PERFORMANCE FOR gSPAN AND MISMOC

Forour experiments, as a high threshold mayresultin toolittle

and a low threshold may result in too many of the frequently oc-

curring subgraphs beingdiscovered, and as the support threshold

is proportional to the runtime and memory consumption [43],[44], we tried different support thresholds ranging from 90% to

2% and decided to settle at 3% for the artificial dataset, 5% for

the PTC dataset, and 10% for the NCTR ER dataset for both

the experiments with FSG and gSpan. These settings allow us

to obtain a good size of subgraphs (i.e., 50 n 500) for theidentification of the interesting ones.

Given these settings of the support thresholds, the average

AUC for each algorithm is determined and shown in the table.

From these results, we can see that the classification perfor-

mance (average AUC) of FSG and gSpan are similar. The av-

erage AUC for them are 0.683 and 0.691, respectively. After

applying MISMOC to these frequent subgraph discovery algo-rithms, their average AUC improved by 14.44% and 14.05%,

respectively.

These results show that the performance of FSG and gSpan

can be improved with the two-phase approach that MISMOC

adopts. The subgraphs discovered by many graph-mining al-

gorithms may appear frequently in a class, but they may not

uniquely represent a class. Subgraphs that may not appear very

frequently can play an important role in discriminating one class

from another. With MISMOC, the relative frequency of each

subgraph is considered and how useful they are for classifica-

tion are determined with a measure. The measure is then used

when a graph is classified. This makes MISMOC more effective

as a graph-classification algorithm.

The datasets D1 to D4 are the artificial dataset with varied

size of graph samples and number of unique node labels. When

the number of unique node labels is increased from 5 to 10,

we can see that the classification performance is higher for D2

with more unique node labels than D1 with less unique node

labels, the case is the same for D3 and D4. The reason is that the

combination of the discovered frequent subgraphs will be less

if the number of unique node labels is small. For example, if

there are only two node labels v1 and v2 in the dataset, we have

only three combinations (v1 v1 , v1 v2 , and v2 v2 ) for a graph

with two vertices and one edge; if there are five node labels v i ,

where i = 5 in the dataset, we can have 15 combinations. In the

case with less unique node labels, many frequent subgraphs will

be the same for both positive and negative class. These frequent

subgraphs are uninteresting and not useful in discriminating

the graph sample into different classes. With MISMOC, we

can filter these uninteresting frequent subgraphs to increase the

classification performance. Hence, we can observe from the

results that the average AUC of D1 is lower than that of D2,and the AUC is increased more significant in D1 than D2 after

applying MISMOC. When thesize of graph samples is increased

from 10 to 30, we can see that the classification performance is

lower for D4 with larger graph size than D2 with smaller graph

size, the case is the same for D1 and D3. The reason for this is

that a large graph will contain more noise than a small graph

as the interesting subgraph(s) usually contribute a small part in

a graph. From the results, we can see that the average AUC of

D4 is lower than D2, and MISMOC helps to remove these noisy

frequent subgraphs and increase the AUC more significantly in

D4 than D2 as the graph size in D4 is larger than that of D2.

The PTC dataset contains four datasets: MR, FR, MM, and

FM. The average AUC of FM is the highest and that of FR isthe lowest. This may be due to the percentage of the positive

samples of FM (38.1%) being higher than that of FR (31.1%).

The overall AUC for the PTC dataset is 0.58 when applying FSG

and gSpan, and this value has increased to 0.63 with MISMOC.

The overall AUC is still relatively low even when MISMOC

is used and this may be due to some structural features in the

test set, not being present in the training set. This is the main

reason that the classification performance is quite low. This

phenomenon is also mentioned in the evaluation report of [33].

The NCTR ER dataset has the highest AUC throughout the

experimental datasets. The average AUC for FSG and gSpan

is 0.844 and this is increased to 0.939 with MISMOC. Thismeans that the ER compounds contain distinguishing structures

for active and inactive classes. The discovered interesting fre-

quent subgraphs can be used to characterize a class of estrogen

as well as to discriminate it from other classes. From the per-

centage of improvement in AUC, we can observe that the noisy

and uninteresting frequent subgraphs are effectively ignored by

MISMOC and the AUC is maximized when it is used with FSG

and gSpan.

VII. CONCLUSION

In this paper, we introduced a new graph-mining technique

called MISMOC to discoverinterestingfrequent subgraphs from

graph databases. It is evaluated with both artificial and real

datasets, and the experimental results show that MISMOC can

work very well with large and complex datasets and can improve

the classification performance of the existing graph-mining

algorithms.

The frequent subgraphs of real biological datasets usually

contain many common vertices [e.g., carbon (C) and oxygen

(O)] and edges (e.g., single hydrogen bond). For this reason,

both positive and negative samples may contain the same set

of frequent subgraphs. The frequent subgraphs discovered by

existing graph-mining algorithms may, therefore, not be very

useful for molecular classification. MISMOC is able to achieve


12/13


a higher accuracy as it aims to discover interesting subgraphs

that do not just occur more frequently but can also allow graph

classes to be better discriminated from one another. MISMOC

can better handle the problem of having too many frequent sub-

graphs when support thresholds are lowered. Like other graph-

mining algorithm, the size and number of graphs that MISMOC

can handle can be very large and they are limited mainly bycomputing hardware.

The next version of MISMOC will include an algorithm that

can discover interesting subgraphs that may not occur frequently

enough. However, it will not be relying on a frequent-subgraph-

mining algorithm in the first phase. In order to facilitate un-

derstanding, it will also try to better identify graphs that are

maximal and less fragmented. In addition, it will represent

graph in a more flexible structure so that graph that are similar

can be represented in the same subgraph. The next release of

MISMOC is expected to take into consideration topological in-

dexes of the discovered structure so as to allow graph classes to

be distinguished more easily from each other.

REFERENCES

[1] D. Conklin, S. Fortier, and J. Glasgow, Knowledge discovery in molecu-lar databases, IEEE Trans. Knowl. Data Eng., vol. 5, no. 6, pp. 985987,Dec. 1993.

[2] T. Barrett, T. O. Suzek, D. B. Troup, S. E. Wilhite, W. C. Ngau, P. Ledoux,D. Rudnev, A. E. Lash, W. Fujibuchi, and R. Edgar, NCBI GEO: Miningmillions of expression profilesDatabase and tools, Nucleic Acids Res.,vol. 33, pp. D562D566, 2005.

[3] A. Zanzoni, L. Montecchi-Palazzi, M. Quondam, G. Ausiello,M. Helmer-Citterich, and G. Cesareni, MINT: A molecular INTeraction database,FEBS Lett., vol. 513, no. 1, pp. 135140, 2002.

[4] K. Arnold, L. Bordoli, J. Kopp, and T. Schwede, The SWISS-MODELWorkspace: A Web-based environment for protein structure homologymodeling, Bioinformatics, vol. 22, pp. 195201, 2006.

[5] L. Holm, C. Ouzounis, C. Sander, G. Tuparev, andG. Vriend, A databaseof protein structure families with common folding motifs, Protein Sci.,vol. 1, pp. 16911698, 1992.

[6] M. Ebeling and S. Suhai, Molecular databases on the internet, J. Mol.Med., vol. 75, pp. 620623, 1997.

[7] A. Sperduti and A. Starita, Supervised neural networks for the classifica-tion of structures, IEEE Trans. Neural Netw., vol. 8, no. 3, pp. 714735,May 1997.

[8] C.A. Lipinski,F.Lombardo,B. W. Dominy,and P. J.Feeney,Experimen-tal and computational approaches to estimate solubility and permeabilityin drugdiscovery and development settings, Adv. Drug Del. Rev., vol.46,pp. 326, 2001.

[9] L. A. Mirny and E. I. Shakhnovich, Universally conserved positions inprotein folds: reading evolutionary signals about stability, folding kineticsand function, J. Mol. Biol., vol. 291, no. 1, pp. 177196, 1999.

[10] A. Kallioniemi, O. P. Kallioniemi, D. Sudar, D. Rutovitz, J. W. Gray,

F. Waldman, and D. Pinkel, Comparative genomic hybridization formolecular cytogenetic analysis of solid tumors, Science, vol. 258,no. 5083, pp. 818821, 1992.

[11] M. G. Dunlop, S. M. Farrington, A. D. Carothers, A. H. Wyllie, L. Sharp,J. Burn, B. Liu, K. W. Kinzler, and B. Vogelstein, Cancer risk associatedwith germline DNA mismatch repair gene mutations, Hum. Mol. Genet.,vol. 6, pp. 105110, 1997.

[12] L. Nakhleh, T. Warnow, C. R. Linder, and K. St. John, Reconstructingreticulate evolution in speciesTheory and practice, J. Comput. Biol.,vol. 12, no. 6, pp. 796811, 2005.

[13] J. A. Bondy, Graph Theory With Applications. New York: Elsevier,1976.

[14] Y. Yoshida, Y. Ohta, K. Kobayashi, and N. Yugami, Mining interestingpatterns using estimated frequencies from subpatterns and superpatterns,

Lecture Notes in Computer Science, vol. 2843, pp. 494501, 2003.[15] L. B. Holder, D. J. Cook, and S. Djoko, Substructure discovery in the

SUBDUE system, in Proc. AAAI Workshop Knowl. Discov. Databases,

1994, pp. 169180.

[16] C. Borgelt and M. R. Berthold, Mining molecular fragments: Findingrelevant substructures of molecules, in Proc. 2nd IEEE Int. Conf. Data

Mining (ICDM), 2002, pp. 5158.[17] R. D. King, A. Srinivasan, andL. Dehaspe, Warmr: A data miningtoolfor

chemical data, J. Comput.-Aided Mol. Des., vol. 15, no. 2, pp. 173181,2001.

[18] M. Kuramochi and G. Karypis, Frequent sub-graph discovery, in Proc.1st IEEE Int. Conf. Data Mining (ICDM), 2001, pp. 313320.

[19] X. Yan and J. Han, gSpan: Graph-based substructure pattern mining, inProc. IEEE Int. Conf. Data Mining, 2002, pp. 721724.

[20] I. Fischer and T. Meinl, Graph-based molecular data mining Anoverview, in Proc. IEEE Int. Conf. Syst., Man Cybern., 2004, vol. 5,pp. 45784582.

[21] M. Deshpande, M. Kuramochi, N. Wale, and G. Karypis, Frequentsubstructure-based approaches for classifying chemical compounds,

IEEE Trans. Knowl. Data Eng., vol. 17, no. 8, pp. 10361050, Aug.2005.

[22] S. H. Muggleton, Inductive logic programming, N. Gen. Comput.,vol.8,no. 4, pp. 295318, 1991.

[23] A. Inokuchi, T. Washio, and H. Motoda, An apriori-based algorithm formining frequent substructures from graph data, in Proc. 4th Eur. Conf.Principles Pract. Knowl. Discov. Databases (PKDD), 2000, pp. 1323.

[24] K. C. C. Chan, A. K. C. Wong, and D. K. Y. Chiu, Learning sequentialpatterns for probabilistic inductive prediction, IEEE Trans. Syst., ManCybern., vol. 24, no. 10, pp. 15321547, Oct. 1994.

[25] K. C. C. Chan and A. K. C. Wong, APACS: A system for automatedpattern analysis andclassification, Comput. Intell.: Int. J.,vol.6,pp.119131, 1990.

[26] P. C. H. Ma and K. C. C. Chan, UPSEC: An algorithm for classifyingunaligned protein sequences into functional families, J. Comput. Biol.,vol. 15, no. 4, pp. 431443, 2008.

[27] FSG, Karypis Lab, version 1.0.1. (2003). [Online]. Available:http://www-users.cs.umn.edu/karypis/pafi

[28] gSpan, Illimine, version 1.1.1. (2006). [Online]. Available:http://illimine.cs.uiuc.edu/download/index.php

[29] C. C. Chang and C. J. Lin. (2001) LIBSVM: A library for support vectormachines [Online]. Available: http://www.csie.ntu.edu.tw/cjlin/libsvm

[30] S. Daskalaki, I. Kopanas, and N. Avouris, Evaluation of classifiers foran uneven class distribution problem, Appl. Artif. Intell., vol. 20, no. 5,pp. 381417, 2006.

[31] T. Fawcett, An introduction to ROC analysis, Pattern Recogn. Lett.,vol. 27, pp. 861874, 2006.

[32] J. Cheng, Y. Ke, and W. Ng. (2006) GraphGen: A graph synthetic gener-ator [Online]. Available: http://www.cse.ust.hk/graphgen/

[33] A. Srinivasan, R. D. King, S. H. Muggleton, and M. Sternberg, Thepredictive toxicology evaluation challenge, presented at the 15th IJCAI,Los Angeles, CA, 1997.

[34] W. Tong, H. Fang, C. R. Williams, J. M. Burch, and A. M. Richard.(2008) DSSTox FDA National Center for Toxicological Research Es-trogen Receptor Binding Database (NCTRER): SDF files and web-site documentation,NCTRER_v4b_232_15Feb2008[Online]. Available:www.epa.gov/ncct/dsstox/sdf_nctrer.html

[35] J. Devillers and A. T. Balaban, Topological Indices and Related Descrip-tors in QSAR and QSPR. Boca Raton, FL: CRC Press, 1999.

[36] R. D. King, S. H. Muggleton, A. Srinivasan, and M. J. E. Sternberg,Structure-activity relationships derived by machine learning: The use ofatoms and their bond connectivities to predict mutagenicity by induc-tive logic programming, Proc. Nat. Acad. Sci., vol. 93, pp. 438442,

1996.[37] A. Sriniviasan and R. King, Feature construction with inductive logic

programming: A study of quantitative predictions of biological activityaided by structural attributes, J. Knowl. Discov. Data Mining, vol. 3,pp. 3757, 1999.

[38] M. Deshpande and G. Karypis, Automated approaches for classifyingstructure, in Proc. 2nd ACM SIGKDD Workshop Data Mining Bioinf.,2002, pp. 1118.

[39] S. Menchetti, F. Costa, and P. Frasconi, Weighted decomposition ker-nels, in Proc. 22nd Int. Conf. Mach. Learning, Bonn, Germany, 2005,pp. 585592.

[40] T. Meinl, C. Borgelt, and M. R. Berthold, Discriminative closed fragmentmining and perfect extensions in MoFa, in Proc. 2nd Starting AI Res.Symp. (STAIRS), Valencia, Spain, 2004, pp. 314.

[41] C. Borgelt, H. Hofer, and M. Berthold, Finding discriminative molecu-lar fragments, presented at the Workshop Inf. Mining Navigat. LargeHeterogen. Spaces Multimedia Inf. German Conf. Artif. Intell., Hamburg,Germany, 2003.


13/13


[42] S. Nijssen and J. N. Kok, Frequent graph mining and its application tomolecular databases, in Proc. IEEE Conf. Syst., Man Cybern. (SMC) ,W. Thissen, P. Wieringa, M. Pantic, and M. Ludema, Eds. Den Haag, TheNetherlands, 2004, pp. 45714577.

[43] M. Worlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitativecomparison of the subgraph miners MoFa, gSpan, FFSM, and Gaston,in Proc. 9th Eur. Conf. on Principles Pract. Knowl. Discov. Databases(PKDD), Porto, Portugal (Lecture Notes in Computer Science), A. Jorge,L. Torgo, P. Brazdil, R. Camacho, and J. Gama, Eds. Berlin, Germany:Springer-Verlag, 2005, pp. 392403.

[44] S. Nijssen and J. N. Kok, A quickstart in frequent structure mining canmake a difference, in Proc. Int. Conf. Knowl. Discov. Data Mining, 2004,pp. 647652.

[45] R. Chittimoori, L. B. Holder, and D. J. Cook, Applying the subduesubstructure discovery system to the chemical toxicity domain, presentedat the AAAI Spring Symp. Predictive Toxicol. Chem.: Exp. Impact AITools, Menlo Park, CA, 1999.

Winnie W. M. Lam received the B.Sc. (Hons.) de-gree in information technology from Hong Kong

Polytechnic University, Hung Hom, Hong Kong. Sheis currently working toward the Ph.D. degree in theDepartment of Computing, Hong Kong PolytechnicUniversity.

She has been involved in several large-scale com-mercial projects, including the ESDlife electronicsystem of the Government of the Hong Kong SpecialAdministrative Region (HKSAR), the system migra-tion project in the Hong Kong Exchanges and Clear-

ing Limited, the data mining development in the Kowloon-Canton RailwayCorporation and Immigration Department, and the consultancy project in theSPSS Inc. Her research interests include data mining, bioinformatics, and arti-ficial intelligence.

Keith C. C. Chan (M94) received the B.Math.degree in computer science and statistics, and theM.A.Sc. and Ph.D. degrees in systems design en-gineering from the University of Waterloo, ON,Canada, in 1984, 1985, and 1989, respectively.

He joined the IBM Canada Laboratory as a SeniorAnalyst and was involved in the design and devel-opment of image and multimedia, and software en-gineering tools. In 1993, he joined as an AssociateProfessor in the Department of Electrical and Com-puter Engineering, Ryerson University, Toronto, ON.

In 1994, he joined the Department of Computing, Hong Kong Polytechnic Uni-versity, Hung Hom, Hong Kong, where he is currently a Professor. He has beena Consultant in various companies and other parts of Asia and Europe. His re-search interests include data mining, bioinformatics, software engineering, andpervasive computing.Graphic here

Documents

Discovering Interesting Molecular Sub Structure