Discovering Interesting Molecular Sub Structure

Embed Size (px)

Citation preview

  • 8/4/2019 Discovering Interesting Molecular Sub Structure

    1/13

    IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 9, NO. 2, JUNE 2010 77

    Discovering Interesting Molecular Substructuresfor Molecular Classification

    Winnie W. M. Lam and Keith C. C. Chan*, Member, IEEE

    AbstractGiven a set of molecular structure data preclassifiedinto a number of classes, the molecular classification problem isconcerned with the discovering of interesting structural patterns inthedata so that unseenmolecules notoriginally in thedataset canbe accurately classified. To tackle the problem, interesting molec-ular substructures have to be discovered and this is done typicallyby first representing molecular structures in molecular graphs,and then, using graph-mining algorithms to discover frequentlyoccurring subgraphs in them. These subgraphs are then used tocharacterize different classes for molecular classification. Whilesuch an approach can be very effective, it should be noted thata substructure that occurs frequently in one class may also doesoccur in another. The discovering of frequent subgraphs for molec-

    ular classification may, therefore, not always be the most effective.In this paper, we propose a novel technique called mining interest-ing substructures in molecular data for classification (MISMOC)that can discover interesting frequent subgraphs not just for thecharacterization of a molecular class but also for the distinguishingof it from the others. Using a test statistic, MISMOC screens eachfrequent subgraph to determine if they are interesting. For thosethat are interesting, theirdegrees of interestingness are determinedusing an information-theoretic measure. When classifying an un-seen molecule, its structure is then matched against the interestingsubgraphs in each class and a total interestingness measure forthe unseen molecule to be classified into a particular class is de-termined, which is based on the interestingness of each matchedsubgraphs. The performance of MISMOC is evaluated using bothartificial and real datasets, and the results show that it can be aneffective approach for molecular classification.

    Index TermsFrequent subgraph, graph mining, interesting-ness, molecular classification, molecular structures.

    I. INTRODUCTION

    THE SIZE and number of molecular structure databases

    have grown rather rapidly recently, due to advances in

    X-ray diffraction or nuclear magnetic resonance (NMR) tech-

    nologies [1]. Molecular databases of nucleotide, genome, pro-

    tein and nucleic acid, etc., such as NCBI, MINT, SwissMod,

    and FSSP in EMBL [2][5] have been made available online.

    These databases continue to grow in size and diversity, andthere is an increasing need for techniques to be developed to

    mine these data for interesting patterns [6]. There have been,

    for example, attempts to discover such patterns for molecular

    classification [1], [7].

    Manuscript received March 25, 2009; revised October 26, 2009. Date ofcurrent version June 3, 2010. Asterisk indicates corresponding author.

    W. W. M. Lamis with theDepartment of Computing, Hong Kong PolytechnicUniversity, Hung Hom, Hong Kong (e-mail: [email protected]).

    *K.C. C. Chan iswiththe Department ofComputing,HongKong PolytechnicUniversity, Hung Hom, Hong Kong (e-mail: [email protected]).

    Digital Object Identifier 10.1109/TNB.2010.2042609

    Given a set of molecular structure data preclassified into a

    number of classes, the molecular classification problem is con-

    cerned with the discovering of interesting structural patterns in

    the data so that unseen molecules not originally in the dataset

    can be accurately classified. Effective molecular classification

    can uncover relationships between structures and functions, and

    can have many applications in many areas, such as drug discov-

    ery [8], protein folding [9], comparative genomics [10], cancer-

    risk assessment [11], and gene evolution [12].

    To tackle the molecular classification problem, two types of

    approaches have been used. The first is the more traditional ap-

    proach of using what is called the quantitative structure-activity

    relationship (QSAR) or the quantitative structure-property rela-

    tionship (QSPR) model [35] to derive descriptors from chemi-

    cal compounds for classification. The second approach, which

    is the more recent approach, is to represent molecular struc-

    tures as molecular graphs [13] and to discover frequently occur-

    ring subgraphs [14] in them for classification. Both approaches

    aim to extract attributes that can best represent the structure

    of chemical compounds. The latter approach has recently be-

    come more popular as it has been shown that using frequent

    subgraph analysis for molecular classification can be better

    than using the QSAR/QSPR models [36], [37]. This is because

    QSAR/QSPR models cannot be used to map chemical structuredirectly to attribute-based descriptions, such as the internal orga-

    nization of chemical compounds. Besides, comparing with the

    use of frequent subgraph analysis, QSAR/QSPR requires much

    more user intervention and domain knowledge. For this reason,

    graph-mining algorithms that can discover frequently occur-

    ring subgraphs in larger graphs have recently become popular

    (e.g., WARMR [17], Frequent SubGraph discovery (FSG) [18],

    Graph-based Substructure PAtterN mining (gSpan) [19], and

    GrAph/Sequence/Tree extractiON (GASTON) [42]). These fre-

    quently occurring subgraphs represent subgraphs that occur

    frequently enough in different classes. The idea of finding fre-

    quent subgraphs in different classes of molecular data has beenproposed previously and has been shown to be effective [43].

    However, it should be noted that as subgraphs, which occur fre-

    quently in one class may also occur frequently in another; the

    discovering of frequent subgraphs for molecular classification

    may not always be the most effective approach. It does not ex-

    plicitly find discriminative subgraphs to allow one class to be

    easily discriminated from another. There have been some recent

    attempts to find such subgraphs between classes, and they are

    defined to be subgraphs that appear more frequently in a certain

    positive class than another negative class [15], [16]. However,

    how much more frequently should these subgraphs appear for

    them to be considered discriminative are not explicitly stated.

    1536-1241/$26.00 2010 IEEE

  • 8/4/2019 Discovering Interesting Molecular Sub Structure

    2/13

    78 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 9, NO. 2, JUNE 2010

    In this paper, we propose a novel graph-mining algorithm for

    molecular classification. This algorithm, which is called min-

    ing interesting substructures in molecular data for classification

    (MISMOC), can discover interesting frequent subgraphs for the

    characterization of a molecular class and for the discrimination

    of it from one or more of the other classes. However, the clas-

    sification problem that MISMOC can tackle is not restricted tobinary classification. MISMOC performs its tasks by first fil-

    tering out subgraphs that do not occur frequently enough for

    the purpose of classification. By using a test statistic, it then

    filters out these frequently occurring subgraphs that only appear

    as frequently as expected. Those that remain are subgraphs,

    which are interesting in the sense that they not only characterize

    a class of molecular graphs, but also allow them to be discrimi-

    nated from the others. For each interesting subgraphs, MISMOC

    determines a degree of interestingness based on the use of an

    information-theoretic measure. When classifying an unseen

    molecule that is not in the original dataset, this molecules struc-

    ture is matched against the interesting subgraphs in each class

    and a total interestingness measure for the unseen molecule tobe classified into a particular class is then determined for the

    purpose of classification.

    The performance of MISMOCis evaluated with both artificial

    and real data. The experimental results show that MISMOC can

    discover interesting frequent subgraphs that can both character-

    ize and distinguish molecules of one class from the others. It can

    also reduce the number of subgraphs that need to be considered

    for graph classification by filtering out these subgraphs, which

    are not interesting for classification.

    The rest of this paper is organized as follows. Section II

    presents a review of existing graph-mining algorithms that can

    be used for classifying molecular structures. Using an illustra-tive example, Section III describes how frequently occurring

    subgraphs can be discovered. Section IV presents the details of

    our proposed approach, MISMOC. For illustration, Section V

    makes use of an example to demonstrate how MISMOC can ef-

    fectively perform molecular classification tasks. In Section VI,

    we describe how the performance of MISMOC was evaluated.

    The results of the experiments that were carried out are pre-

    sented. Finally, Section VII summarizes the work and discusses

    possible directions for future research.

    II. RELATED WORK

    Many graph-mining algorithms have been developed to dis-

    cover interesting subgraphs in data with complex structures.

    Given such data represented in the form of graphs, these algo-

    rithms can be used to mine frequent subgraphs in them. These

    frequent subgraphs can then be used to tackle the classification

    problem [20], [21].

    The graph-mining algorithms based on inductive logic pro-

    gramming (ILP), for example, have been used to discover fre-

    quent subgraphs for classification [22]. An ILP-based algorithm

    called WARMR [17], for example, is able to mine frequent sub-

    graphs in graph data that are represented as first-order predicate

    logic. ILP-based approaches to graph mining, being based on

    predicate logic, have the disadvantages that they may not be

    very robust to noisy data. Also, when dealing with real-world

    databases that tend to be very large, the computational com-

    plexity of these algorithms can be too high to handle. These

    approaches have to perform a lot of tests for equivalence in or-

    der to prune infrequent and semantically redundant subgraphs.

    Other than the ILP-based algorithms, there are quite a number

    of other graph-mining algorithms that can be used to discoverfrequent subgraphs. FSG [18], for example, adopts an edge-

    based subgraph generation strategy for such purpose. It expands

    on a subgraph based on a level-by-level approach [23], first, enu-

    merating all frequent single and double-edge subgraphs, and

    then, generates larger subgraph iteratively by adding one more

    edge to those generated in the previous iteration. For FSG to

    perform its tasks, it has to rely on canonical labeling to check

    whether a particular subgraph satisfies a support threshold. If

    two graphs are isomorphic, their canonical labels are assumed

    to be identical. This canonical labeling process for the determi-

    nation of graph isomorphism is memory consuming for large

    databases.

    Other than FSG, gSpan [19] is also a popular graph-miningalgorithm that has been used for graph classification. gSpan

    searches for frequent subgraph on graph canonical forms using

    a depth-first search (DFS) strategy. It does so by starting from a

    randomly chosen vertex, then visiting and marking the vertices

    to which this chosen vertex is connected to. This process of

    visiting and marking of vertices continues repeatedly until a full

    DFS tree is built. For each graph searched, it is possible that

    more than one tree be built with DFS depending on the order in

    which the vertices were visited. By means of DFS, gSpan is able

    to discover all frequent subgraphs without generating candidate

    subgraphs and pruning false positives.

    Another algorithm for mining of frequent subgraphs is calledGaston [42]. It discovers such subgraphs by first finding fre-

    quent paths, then trees, and then, cyclic graphs. It stores all

    occurrences of these graphs in an embedding list so that the

    frequency of occurrence of a subgraph can be determined by

    scanning the embedding list, thereby, improving the speed of

    the graph-mining process [43].

    MoFa [16] has been used to find frequent subgraphs in graph

    data by maintaining parallel embeddings for both vertices and

    edges. Like Gaston, each such embedding consists of a set of

    references to a molecule that point to the atoms and bonds that

    form a subgraph. Such embeddings can be extended so that

    larger subgraphs can be formed iteratively [16]. MoFa has been

    enhanced later to discover discriminative [40], [41] subgraphs

    with relatively higher support and these subgraphs can make

    MoFa more suitable approach for graph classification.

    Subdue [15] is another graph-mining algorithm that discovers

    frequent subgraphs. It makes use of the minimum description

    length principle to narrow down possible outcomes when trying

    to identify subgraphs that best compress the original graph [45].

    The graph-mining algorithms described earlier discover fre-

    quent subgraphs by building on smaller subgraphs edge by edge.

    Subgraph isomorphism for graph matching is required as a part

    of the kernels of these algorithms and this process is known

    to be nondeterministic polynomial time (NP)-complete. The

    discovering of frequent subgraphs using existing graph-mining

  • 8/4/2019 Discovering Interesting Molecular Sub Structure

    3/13

    LAM AND CHAN: DISCOVERING INTERESTING MOLECULAR SUBSTRUCTURES FOR MOLECULAR CLASSIFICATION 79

    algorithm requires a frequency threshold to be supplied. If the

    threshold is set too small, one may not be able to discover

    enough frequent subgraphs to allow graph classes to be distin-

    guished from each other. If the threshold is set too large, one

    may discover too many frequent subgraphs that are irrelevant for

    classification. As subgraphs that appear frequently in one class

    of graphs may also do so in another, the discovering of frequentsubgraphs may not always be useful for graph classification.

    What is needed for the task is a way to discover subgraphs in a

    class that can make it distinguishable from other classes.

    In the following, we propose a graph-mining technique called

    MISMOC for this purpose. Given a set of frequent subgraphs,

    MISMOC can screen out frequent subgraphs that are not useful

    for classification and retain those graphs that are useful for the

    characterization of molecular classes and the discrimination of

    one class from another.

    III. ILLUSTRATIVE EXAMPLE

    To explain why the discovering of frequent subgraphs maynot always be useful for graph classification, let us consider an

    example. We are given three classes of artificial molecular data

    shown in Fig. 1.

    Each of these three classes of data contains ten molecules and

    each molecule consists of atoms connected with bonds. These

    molecules are generated in such a way that the atoms are chosen

    from 30 possible atoms, including such atoms as carbon (C),

    oxygen (O), iridium (Ir), nobelium (No), and thorium (Th), and

    bond types from three possible types, including single, double,

    and triple bonds. These molecules can be represented as labeled

    molecular graphs with each node used to represent an atom and

    each edge as a bond.Given the set of graph data as shown in Fig. 1, frequent

    subgraphs can be discovered in each of class 1, 2, and 3 us-

    ing a graph-mining algorithm, such as FSG, and the unknown

    molecule given in Fig. 2 can be predicted. These algorithms

    require that a threshold to be given by the users to define how

    frequent a subgraph should appear for it to be considered fre-

    quent.

    For the purpose of illustration, we choose FSG here as graph-

    mining algorithms, such as gSpan, does not perform subgraph

    pruning. FSG, however, can discover maximal frequent sub-

    graphs and can better avoid the problems caused by the discov-

    ering of subgraphs, which are too fragmented.

    By setting a support threshold of 80% (i.e., any subgraph that

    occurs in at least eight out of ten graphs), a number of frequent

    subgraphs can be found and theyare given in Table I. It should be

    noted that the same frequent subgraph, a nitrogen atom double-

    bonded with an oxygen atom (i.e., N==O), appears in 80% ofthe graphs in each of the three classes (see Table I).

    Since the choice of threshold does not allow any unique fre-

    quent subgraph to be discovered for each class, we lower the

    support threshold by 10%. The results are shown in Table II.

    More frequent subgraphs are discovered this time when the

    support threshold is lowered to 70%. However, the newly dis-

    covered frequent subgraphs for class 2 and 3 are still the same

    and a graph with such subgraphs may be classified into either

    Fig. 1. Training molecular data.

    class 2 or 3. This means that the discovered frequent subgraph

    cannot allow graphs in class 2 to be easily discriminated from

    class 3.

    When the support threshold is further lowered to 60%,

    more frequent subgraphs are discovered and they are shown in

    Table III. Unfortunately, the newly discovered frequent sub-

    graphs, for each of the three classes, still overlap with each

    other. A graph characterized by these subgraphscan be classified

  • 8/4/2019 Discovering Interesting Molecular Sub Structure

    4/13

    80 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 9, NO. 2, JUNE 2010

    Fig. 2. Unknown molecule.

    TABLE IMAXIMAL FREQUENT SUBGRAPHS (SUPPORT THRESHOLD = 80%)

    TABLE IIMAXIMAL FREQUENT SUBGRAPHS (SUPPORT THRESHOLD = 70%)

    TABLE IIIMAXIMAL FREQUENT SUBGRAPHS (SUPPORT THRESHOLD = 60%)

    into one or more classes. For example, if a graph G is character-

    ized by the subgraph , it can be classified into either class

    2 or 3. If G is characterized by the subgraph , it can be

    classified into either class 1 or 2. If G is characterized by both

    TABLE IVMAXIMAL FREQUENT SUBGRAPHS (SUPPORT THRESHOLD = 50%)

    and , then there is a chance that it can be classified

    into any of class 1, 2, or 3 as appears six times in class

    1 and 2, and appears seven times in class 2 and 3.

    To find more interesting and useful frequent subgraphs for

    classification, the support threshold is further lowered to 50%.

    Using the FSG again, the frequent subgraphs discovered are

    shown in Table IV. This time, many more frequent subgraphs

  • 8/4/2019 Discovering Interesting Molecular Sub Structure

    5/13

    LAM AND CHAN: DISCOVERING INTERESTING MOLECULAR SUBSTRUCTURES FOR MOLECULAR CLASSIFICATION 81

    Fig. 3. Classifying the unseen molecule in Fig. 2 with FSG.

    are discovered and some of the subgraphs discovered in each of

    S(1 ) , S(2 ) , and S(3 ) do not overlap with each other.

    If we have to classify the testing sample in Fig. 2, it should be

    noted that this graph is characterized by three frequent subgraphs

    , , and from each of S(1 ) , S(2 ) , and S(3 ), respec-

    tively (see Fig. 3). It is, therefore, hard to decide to which class

    this graph should be classified into, based on these subgraphs

    that it contains. If one is to take a closer look at the frequency

    of appearance of each of these three subgraphs , , and

    , in each class, one may discover that even though is

    not frequent enough in class 2 and 3, it appears in 40% of the

    graphs in these classes. This is the case also with . Although

    it only appears in 10% of the graphs in class 1, it appears in 40%

    of the graphs in class 2. Of these three subgraphs, , is themost interesting and unique in the sense that it appears in 50%

    of the graphs in class 2, it only appears in 10% of the graphs in

    both class 1 and 3. In other words, this subgraph provides more

    evidence for a graph it characterizes to be classified into class 2

    than other subgraphs. In fact, it is for this reason that the graph

    in Fig. 2 belongs more likely to class 2 than any other classes.

    In order to discover more frequent subgraphs that may be use-

    ful for classifying the unseen molecule, the support threshold is

    further reduced to 40%, and the new frequent subgraphs are dis-

    covered, as shown in Table V. The newly discovered subgraphs

    are S(1 )8 , S

    (1 )9 , S

    (1 )10 in class 1; S

    (2 )7 , S

    (2 )8 , S

    (2 )9 , S

    (2 )10 in class 2; and

    S(3 )7 , S(3 )8 , S(3 )9 in class 3. Although the support threshold is low-ered to 40%, these subgraphs are all appeared more frequently

    in the other classes, for example, S(1 )8 is previously discovered

    as frequent subgraph S(2 )2 in class 2 and S

    (3 )2 in class 3. The case

    is the same as the others. We tried to further reduce the support

    threshold to 30%, but the case is still the same that the newly

    discovered subgraph is already found at higher threshold value.

    The actual relative frequency of appearances of each frequent

    subgraph in each class may therefore provide useful informa-

    tion for classification. The idea that MISMOC uses to filter out

    uninteresting and irrelevant frequent subgraphs to allow molec-

    ular classification to be performed effectively is, therefore, to

    take into consideration such information so as to measure the

    TABLE VMAXIMAL FREQUENT SUBGRAPHS (SUPPORT THRESHOLD = 40%)

  • 8/4/2019 Discovering Interesting Molecular Sub Structure

    6/13

    82 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 9, NO. 2, JUNE 2010

    relatively interestingness of each frequent subgraph relative to

    the others.

    IV. MISMOC: A GRAPH-MINING TECHNIQUE FOR

    MOLECULAR CLASSIFICATION

    The molecular classification problem, which this paper ad-

    dresses can be stated more formally as follows. Given a set of

    molecular structure data G, containing n molecules preclassifiedintop classes, the molecular classification problem is concernedwith the discovering of interesting patterns in the data to allow

    unseen graphs not originally in G to be correctly classified

    into one of the p classes.The n molecules in G can be represented as n molecular

    graphs G1 , G2 , . . . , Gn , where Gi = Gi (Vi , Ei ), i {1, . . . , n}is a labeled graph with vertices representing atoms and edges

    representing bonds between atoms.

    For applications in bioinformatics, the molecular graphs can

    be generalized so that the vertices can represent molecules, such

    as amino acids, and the edges can represent the chemical bondsthat connect the molecules. The p classes that the n moleculesand their corresponding molecular graphs are classified into can

    be represented as (1 ) , . . . , (p) , where (i) = {G(i)1 , . . . , G(i)c i } G, i = 1, . . . , p.

    In the following, we present the details of a MISMOC tech-

    nique, which can be used to effectively improve the accuracy

    of graph classification. MISMOC performs its tasks in several

    stages. It first searches for frequent subgraphs using an existing

    algorithm, such as FSG or gSpan. Since a subgraph that ap-

    pears frequently in one class may also does so in another, not

    all frequent subgraphs are useful and interesting for classifica-

    tion. To screen out the uninteresting ones, MISMOC makes useof a test statistics to distinguish interesting subgraphs from the

    uninteresting ones.

    Once the interesting frequent subgraphs are identified, the

    interestingness of each of these frequent subgraphs is then mea-

    sured based on an information theoretic measure called the

    weight of evidence. This measure can be combined to form

    an overall total interestingness measure for the purpose of clas-

    sifying an unseen graph.

    A. Discovering Frequent Subgraphs

    To discover frequent subgraphs in a graph database, there

    are several graph-mining algorithms to choose from. ForMISMOC, users can choose between two commonly used

    graph-mining algorithms FSG [18] and gSpan [19]. Given the

    dataset = {G1 , . . . , Gi , . . . , Gn } as described earlier, one canuse either of these algorithms to discover a set of frequent sub-

    graphs (1 ), . . . , (i) , . . . , (p) , where (i) = {S(i)1 , . . . , S(i)n i },i = 1, . . . , p, for each of the corresponding p classes (1 ), . . . ,

    (i) , . . . , (p) .1) FSG Algorithm: The FSG algorithm can find all frequent

    subgraphs in each class of molecular graphs using the Apriori

    algorithm [23]. It does so by treating edges in the graphs as

    items in transactions so that the Apriori algorithm can be used to

    discover frequent subgraphs, like it is used to discover frequent

    Fig. 4. Algorithm of FSG.

    itemsets, i.e., in the same way the Apriori algorithm increases

    the size of frequent itemsets by adding a single item at a time,

    the FSG algorithm also increases the size of frequent subgraphs

    by adding an edge one by one.

    Briefly, the FSG can be described as follows. For each (i) ,

    i = 1, . . . , p, FSG first finds a set of frequent one-edge sub-graphs and a set of frequent two-edge subgraphs. Then, based

    on these two sets of intermediate subgraphs, it starts to itera-

    tively generate candidate subgraphs, whose size is greater than

    the previous frequent subgraphs by one edge. FSG then counts

    the frequency for each of these candidates and prunes subgraphsthat do not satisfy the support threshold . The qualified sub-graphs are further expanded and their frequencies are verified

    with the same support condition to prune the lattice of frequent

    subgraphs. The final set of frequent subgraphs (1 ), . . . (i) , . . . ,(p ) , where (i) contains all frequent k-subgraphs, is generated

    for each class. Let gk be a k-subgraph with k edges, k be aset of candidate subgraphs with k edges, k (i) be a set of fre-quent k-subgraphs for class (i) , the algorithm of FSG can besummarized in Fig. 4 [18].

    2) gSpan Algorithm: The gSpan algorithm [19] discovers a

    set of frequent subgraphs for each graph class by mapping each

    graph in the class to a unique minimum DFS code as the canon-

    ical label. Firstly, gSpan sorts all vertices and edges in the set

    of graph transactions in each class according to their frequency

    of occurrence and removes the infrequent vertices and edges

    from (i) . The remaining vertices and edges are relabeled and

    sorted in descending frequency. 1( i) is then formed by all fre-

    quent one-edge subgraphs and it acts as the seed for generating

    more children. The subprocedure called SubgraphMiner, which

    expand each one-edge frequent subgraph 1( i) from each class

    by adding one edge at a time. In the SubgraphMiner, ifs is theminimum DFS code of the graph it represents, it adds s to its fre-quent subgraph set (i) . It then generates all potential children

    with a one-edge growth and runs SubgraphMiner recursively for

    each child. After this, the edge is removed from each graph in

  • 8/4/2019 Discovering Interesting Molecular Sub Structure

    7/13

    LAM AND CHAN: DISCOVERING INTERESTING MOLECULAR SUBSTRUCTURES FOR MOLECULAR CLASSIFICATION 83

    Fig. 5. Algorithm of gSpan.

    (i) when all the descendants of this one-edge graph have been

    searched. When all frequent k-subgraphs and their descendantsare generated, the final set of frequent subgraphs (i) , i = 1, . . . ,p, will be generated for each class. The algorithm of gSpan canbe summarized in Fig. 5 [19].

    B. Discovering Interesting Frequent Subgraphs

    Using MISMOC

    FSG and gSpan aim at discovering frequent subgraphs (i) =

    {S(i)1 , . . . , S(i)j , . . . , S(i)n i }, i = 1, . . . , p, in each of the corre-sponding graph class (1 ) , . . . , (i) , . . . , (p) . These algorithmsare not originally developed for graph classification. Hence,

    while the discovered frequent subgraph can characterize each

    graph class, they may not be very useful in discriminating one

    class from another. This is because a frequent subgraph, which

    appears frequently in one graph may also do so in another and

    such frequent subgraphs are not interesting for classification. Inthis section, we present a methodology that MISMOC uses to

    identify interesting subgraphs that are interesting and useful for

    classification. This methodology is based on the use of a test

    statistic [24][26] and its details are given in Fig. 6.

    Once the set of frequent subgraphs (i) , i = 1, . . . , p, arediscovered for each of (i) , i = 1, . . . , p, respectively, the prob-ability that a graph G is in (i) , i {1, . . . , p} given that G is

    Fig. 6. Algorithm of MISMOC.

    characterized by a frequent subgraph S(i)

    j (i) , j {1, . . . ,ni} can be determined as follows:

    PrG (i)

    |G is characterized by S

    (i)j

    =total no. of graphs in (i) that are characterized by S

    (i)j

    total no. of graphs in that are characterized by S(i)

    j

    .

    (1)

    If Pr(G (i) |G is characterized by S(i)j ) is not much differ-ent from Pr(G (i) ), i.e., whether or not G is characterized byS

    (i)j makes very little difference, then S

    (i)j should not be consid-

    ered very interesting in determining, if G should be classified

    into (i) . Otherwise, S(i)

    j can be very interesting.

    To objectively determine if the two probabilities are different,

    we make use of a test statistic [24][26], dj i , which is definedas follows:

    dj i =zj ij i

    (2)

    where zj i is defined as (3), shown at the bottom of this page,and ij is the maximum likelihood estimate of the variance of

    Pr

    G (i) |G is characterized by S(i)j n Pr G (i) )Pr(G is characterized by S(i)j

    n Pr

    G

    (i)

    Pr

    G is characterized by S(i)

    j (3)

  • 8/4/2019 Discovering Interesting Molecular Sub Structure

    8/13

    84 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 9, NO. 2, JUNE 2010

    zj i and is given by

    j i = (1 Pr(G (i) ))

    1 PrG is characterized by S(i)j

    .(4)

    Based on [24], if|dj i | > 1.96, we can conclude that the differ-ence between Pr(G

    (i)

    |G is characterized by S

    (i)j ) is signifi-

    cantly different from Pr(G (i) ), and therefore, the subgraphS

    (i)j is interesting and useful for classification. Ifdj i > +1.96, it

    implies that the presence ofS(i)

    j in a graph G provides evidence

    supporting G to be classifiedinto (i) , otherwiseifdj i < 1.96,it implies that thepresenceof thefrequent subgraph S

    (i)j provides

    negative evidence against G to be classified into (i) . In either

    case, S(i)

    j can be considered as interesting frequent subgraph.

    With the use of the test statistics, MISMOC screens each set

    of frequent subgraphs (i) = {S(i)1 , . . . , S(i)j , . . . , S(i)n i }, i =1, . . . , p, to retain only those who are interesting. The set ofinteresting frequent subgraph discovered for each of (1 ), . . . ,

    (i) , . . . , (p) is denoted as (i) = {S(i)

    1 , . . . , S(i)

    j , . . . , S(i)

    n i },i = 1, . . . , p, and ni < n i , respectively.

    C. Interestingness Measure as a Function of the

    Weight of Evidence

    The interesting frequent subgraphs provide positive or nega-

    tive evidence supporting or refuting the classification of a graph

    into a particular class. MISMOC measures how interesting these

    frequent subgraphs are with the use of an interestingness mea-

    sure defined in terms of an information-theoretic weight of evi-

    dence measure.

    The more interesting a frequent subgraph is for a class, thegreater the difference is between the two probabilities of Pr(G

    (i) |G is characterized by S(i)j ) and Pr(G (i) ). Hence, theinterestingness measure is defined again as a function of these

    two probabilities. Specifically, the more interesting S(i)

    j is, the

    greater is the ratio between Pr(G (i) |G is characterized byS

    (i)j ) and Pr(G (i) ). This ratio can be measured with a

    mutual information measure I(G (i) :G is characterized byS

    (i)j ), between G (i) and G is characterized by S(i)j as

    follows:

    I(G

    (i) : G is characterized by S(i)

    j

    )

    = logPr

    G (i) |G is characterized by S(i)j

    Pr(G (i) ) . (5)

    Based on the mutual information measure, the weight of ev-

    idence provided by S(i)

    j for or against the classification of G

    into (i) can be defined as follows:

    W(i) (G|S(i)j ) = W(G (i)/G / (i) |G is characterized by S(i)j= I(G (i) : G is characterized by S(i)j )

    I(G / (i) : G is characterized by S(i)

    j ). (6)

    W(i) (G|S(i)j ) can be interpreted as a measure of the differencein the gain in information when a graph G that contains S

    (i)j is

    classified into (i) , as opposed to other classes. W(i) (G|S(i)j )is positive if S

    (i)j provides positive evidence supporting the

    classification of G into (i) , otherwise it is negative.

    D. Classification Using a Total Interestingness Measure

    Given the interesting frequent subgraphs (1 ), . . . ,(i) , . . . , (p) , discovered for each corresponding p classes(1 ) , . . . , (i) , . . . , (p ) , an unseen graph G not originally

    in , can be classified by matching it against the subgraphs in

    each of (i) , i = 1, . . . , p.For every subgraph, S

    (i)j (i) that G matches, there is some

    evidence W(i) (G|S(i)j ) provided by it for or against the classi-fication of G into (i) . Assuming that G matches with mi niinteresting frequent subgraph in (i) , s(i)1 , . . . , s

    (i)j , . . . , s

    (i)m i

    (i) , MISMOC then computes a total interestingness measurefor G to be classified into (i) . This total interestingness mea-

    sure is defined as the summation of the total weight of evidence

    provided by each individual interesting frequent subgraph s(i)

    j

    for or against G to be classified into (i) as follows:

    W(i) (G) = W(G (i) /G / (i) |Gis characterized by s

    (i)1 , . . . s

    (i)j , . . . , s

    (i)m i )

    =

    m ij = 1

    W(G (i) /G / (i) |G

    is characterized by s(i)j ). (7)

    The value of W(i) (G| S(i)j ) increases with the number andstrength of the matched subgraphs in s

    (i)1 , . . . , s

    (i)j , . . . , s

    (i)m i

    that provide positive evidence supporting G to be classified

    into (i) , whereas the value of W(i) (G|S(i)j ) decreases if somematched subgraphs provide negative evidence refuting the clas-

    sification of G into (i) . The total interestingness measure for

    G to be classified into each of (1 ), . . . , (i) , . . . , (p) is de-termined and MISMOC assigns G to the class, which gives the

    greatest total interestingness measure.

    Compared to algorithms that classify graphs by consider-

    ing only frequent subgraphs, MISMOC has the advantages that

    it discovers frequent subgraphs that are considered interesting

    by an objective statistical evidence measure. Instead of relying

    solely on the appearance of frequent subgraph during classifica-

    tion, MISMOC takes into consideration only those, which are

    useful and interesting. These frequent subgraphs are unique and

    can have biological meaning. The other frequent graph-mining

    algorithms like FSG and gSpan can only handle single class of

    data, if there are two or more classes, the comparative effect of a

    subgraph across all classes are ignored. There is always a chance

    that two or more classes have the same frequent subgraph. With

    interestingness measure, we can distinguish interesting frequent

    subgraphs from uninteresting ones for multiple classes.

  • 8/4/2019 Discovering Interesting Molecular Sub Structure

    9/13

    LAM AND CHAN: DISCOVERING INTERESTING MOLECULAR SUBSTRUCTURES FOR MOLECULAR CLASSIFICATION 85

    V. ILLUSTRATIVE EXAMPLE CONTINUED

    To illustrate how MISMOC works,let us consider theexample

    in Section III again. Given the frequent subgraphs discovered

    using FSG at a support threshold of 50%, MISMOC obtains

    for each of the 15 frequent subgraphs with their frequency of

    occurrences in each class (see Table V). It then screens for all

    frequent subgraphs that are interesting using the test statisticsgiven by (2). The value of the test statistics for each frequent

    subgraph in each class are given also in Table VI.

    As described in the last section, subgraphs with |dj i < 1.96|will be filtered out, and the remaining subgraphs will form a set

    of interesting subgraphs for graph classification. Since d41 , d51 ,d62 , d72 , d83 , and d93 are greater than 1.96, we conclude that of

    all 15 frequent subgraphs discovered, only S(1 )4 and S

    (1 )5 , S

    (2 )6

    and S(2 )7 , and S

    (3 )8 and S

    (3 )9 are interesting frequent subgraphs

    for each of class 1, 2, and 3, respectively.

    Given these interesting frequent subgraphs, we can classify

    the test graph shown in Fig. 2 by computing the total weight of

    interestingness measure for it to be classified into each class.Using (1) to (7)

    W(1 ) (G) = W

    Class = 1/Class = 1|S(1 )7 , S(2 )6 , S(3 )5

    = W

    Class = 1/Class = 1/S(2 )6

    = 1.5018.Similarly, W(2 )(G) = 2.2288 and W(3 )(G) = 1.5732. As

    the value of W(2 )(G) in class 2 is the largest of all, we canconclude that the new sample belongs to class 2. Besides, there

    is negative evidence against the test graph being classified in

    class 1 and 3, therefore, the new sample is not likely to belong

    to class 1 or 3.

    VI. EXPERIMENTS AND RESULTS

    To evaluate the effectiveness of MISMOC, it is tested us-

    ing both artificial and real data. We compared its performance

    with that of two graph classification algorithms based on FSG

    and gSpan. For experimentation, we used the executable files of

    these algorithms available from [27] and [28], respectively. The

    classification results were obtained using tenfold cross valida-

    tions with an implementation of support vector machine (SVM)

    available at [29].

    A. Performance Evaluation

    The performance of a classifier is usually evaluated by the use

    of average classification accuracy and the results are typically

    presented in a confusion matrix (see Table VII), which has four

    entries: the number of true positive cases (TP), true negative

    cases (TN), false positive cases (FP), and false negative cases

    (FN), and the average accuracy is calculated as follows [30]:

    Average Accuracy =TP + TN

    TP + FN + FP + TN. (8)

    While evaluation based on the use of the classification accuracy

    measure may be popular, it may not always be very appro-

    priate for classification problems involving imbalanced class

    TABLE VIINTERESTINGNESS MEASURE OF FREQUENT SUBGRAPHS ( = 50%)

    TABLE VIICONFUSION MATRIX

  • 8/4/2019 Discovering Interesting Molecular Sub Structure

    10/13

    86 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 9, NO. 2, JUNE 2010

    distributions. When TN is much greater than TP, (FP + TN) isalso much greater than (TP + FN). In such case, the success-fully predicted cases in the minority positive class will play a

    role that can be too insignificant when the average accuracy rate

    is determined and the minority cases will be treated as noise

    even if they are supposed to be important. In order to overcome

    this problem, the true positive and false positive rates need tobe monitored separately, using (9) and (10) when test data are

    being classified

    True positive rate =TP

    TP + FN(9)

    False positive rate =FP

    FP + TN. (10)

    These rates measure the performance of a classifier for each

    class and the objective is to keep the true positive rate as high

    as possible and the false positive rate as low as possible. Some-

    times, the true positive rate is called recall or sensitivity, and the

    false positive rate is called false alarm rate. In order to transformthis multiobjective problem into a single-objective equivalent,

    the receiver operating characteristic (ROC) analysis [31] has

    been proposed and is becoming more and more popular when

    the training data size for different classes of data are very dif-

    ferent. With the ROC analysis, the true positive rate is plotted

    along the y-axis against the false positive rate along the x-axisto form a ROC curve, and the objective is to maximize the value

    of AUC, which stands for the area under the ROC curve. The

    value of AUC is always between 0.0 and 1.0. An area of 1 repre-

    sents a perfect classification, whereas an area of 0.5 represents a

    worthless classification that is equivalent to a random guess in a

    two-class classification problem. The AUC reflects very well theprobability that a classifier ranks, a randomly chosen positive

    instance higher than a randomly chosen negative instance. In

    this paper, as the datasets that we use differ significantly in class

    sizes, we use the AUC to evaluate the performance of different

    classifiers on different datasets.

    B. Datasets

    The first dataset is a set of binary-class artificial data that

    are generated with GraphGen [32]. The artificial datasets are

    generated with a set of parameters: 1) the total number of trans-

    actions (-ngraphs); 2) the average size of each graph (-size);

    3) the number of unique node labels (-nnodel); 4) the number

    of unique edge labels (-nedgel); 5) the average density of each

    graph (-density); 6) the number of unique edges in the whole

    dataset (-nedges); and 7) the average edge ratio of each graph

    (-edger). The parameter 1, 4, 5, 6, and 7 are fixed to 5000, 10,

    0.3, 100, and 0.2, respectively, and we vary the remaining pa-

    rameters to generate four datasets as given in Table VIII with

    properties given next.

    The second dataset is collected from predictive toxicology

    challenge (PTC) [33] that contains the carcinogenicity of 417

    chemical compounds on four types of rodents: male rats (MR),

    female rats (FR), male mice (MM), and female mice (FM). Each

    of these datasets can be considered as consisting of two classes

    TABLE VIIIARTIFICIAL DATASET WITH DIFFERENT PARAMETERS

    TABLE IXPROPERTIES OF THE EXPERIMENTAL DATASETS

    TABLE XCLASSIFICATION PERFORMANCE FOR FSG AND MISMOC

    of data [39]: those with positive evidence of cancerous growth

    and those with negative evidence.

    The third dataset is collected from the Estrogen Receptor

    Binding (NCTR ER) Database in the Distributed Structure-

    Searchable Toxicity (DSSTox) Public Database Network of the

    National Center for Toxicological Research [34]. The database

    covers most known estrogenic classes and it is a structurally di-

    verse set of estrogens. The NCTR ER database consists of 224

    chemical compounds with each classified as active or inactive

    with respect to the attribute ActivityOutcome_NCTRER. A

    compound is active if the measure of activity of the compound

    is active strong, medium, or weak. It is inactive if there is no

    activity for that compound. The properties of the datasets we

    used in our experiments are listed in Table IX.

    C. Performance Analysis

    For performance comparison, we tested all datasets using

    first two algorithms of FSG, gSpan, and then, compare their

    performance when MISMOC is used. Tables X and XI show the

    performance of different algorithms on the different datasets.

    For easier comparisons, we use a single misclassification cost

    value of 3.0 and as suggested in [38] for the SVM classifier.

  • 8/4/2019 Discovering Interesting Molecular Sub Structure

    11/13

    LAM AND CHAN: DISCOVERING INTERESTING MOLECULAR SUBSTRUCTURES FOR MOLECULAR CLASSIFICATION 87

    TABLE XICLASSIFICATION PERFORMANCE FOR gSPAN AND MISMOC

    Forour experiments, as a high threshold mayresultin toolittle

    and a low threshold may result in too many of the frequently oc-

    curring subgraphs beingdiscovered, and as the support threshold

    is proportional to the runtime and memory consumption [43],[44], we tried different support thresholds ranging from 90% to

    2% and decided to settle at 3% for the artificial dataset, 5% for

    the PTC dataset, and 10% for the NCTR ER dataset for both

    the experiments with FSG and gSpan. These settings allow us

    to obtain a good size of subgraphs (i.e., 50 n 500) for theidentification of the interesting ones.

    Given these settings of the support thresholds, the average

    AUC for each algorithm is determined and shown in the table.

    From these results, we can see that the classification perfor-

    mance (average AUC) of FSG and gSpan are similar. The av-

    erage AUC for them are 0.683 and 0.691, respectively. After

    applying MISMOC to these frequent subgraph discovery algo-rithms, their average AUC improved by 14.44% and 14.05%,

    respectively.

    These results show that the performance of FSG and gSpan

    can be improved with the two-phase approach that MISMOC

    adopts. The subgraphs discovered by many graph-mining al-

    gorithms may appear frequently in a class, but they may not

    uniquely represent a class. Subgraphs that may not appear very

    frequently can play an important role in discriminating one class

    from another. With MISMOC, the relative frequency of each

    subgraph is considered and how useful they are for classifica-

    tion are determined with a measure. The measure is then used

    when a graph is classified. This makes MISMOC more effective

    as a graph-classification algorithm.

    The datasets D1 to D4 are the artificial dataset with varied

    size of graph samples and number of unique node labels. When

    the number of unique node labels is increased from 5 to 10,

    we can see that the classification performance is higher for D2

    with more unique node labels than D1 with less unique node

    labels, the case is the same for D3 and D4. The reason is that the

    combination of the discovered frequent subgraphs will be less

    if the number of unique node labels is small. For example, if

    there are only two node labels v1 and v2 in the dataset, we have

    only three combinations (v1 v1 , v1 v2 , and v2 v2 ) for a graph

    with two vertices and one edge; if there are five node labels v i ,

    where i = 5 in the dataset, we can have 15 combinations. In the

    case with less unique node labels, many frequent subgraphs will

    be the same for both positive and negative class. These frequent

    subgraphs are uninteresting and not useful in discriminating

    the graph sample into different classes. With MISMOC, we

    can filter these uninteresting frequent subgraphs to increase the

    classification performance. Hence, we can observe from the

    results that the average AUC of D1 is lower than that of D2,and the AUC is increased more significant in D1 than D2 after

    applying MISMOC. When thesize of graph samples is increased

    from 10 to 30, we can see that the classification performance is

    lower for D4 with larger graph size than D2 with smaller graph

    size, the case is the same for D1 and D3. The reason for this is

    that a large graph will contain more noise than a small graph

    as the interesting subgraph(s) usually contribute a small part in

    a graph. From the results, we can see that the average AUC of

    D4 is lower than D2, and MISMOC helps to remove these noisy

    frequent subgraphs and increase the AUC more significantly in

    D4 than D2 as the graph size in D4 is larger than that of D2.

    The PTC dataset contains four datasets: MR, FR, MM, and

    FM. The average AUC of FM is the highest and that of FR isthe lowest. This may be due to the percentage of the positive

    samples of FM (38.1%) being higher than that of FR (31.1%).

    The overall AUC for the PTC dataset is 0.58 when applying FSG

    and gSpan, and this value has increased to 0.63 with MISMOC.

    The overall AUC is still relatively low even when MISMOC

    is used and this may be due to some structural features in the

    test set, not being present in the training set. This is the main

    reason that the classification performance is quite low. This

    phenomenon is also mentioned in the evaluation report of [33].

    The NCTR ER dataset has the highest AUC throughout the

    experimental datasets. The average AUC for FSG and gSpan

    is 0.844 and this is increased to 0.939 with MISMOC. Thismeans that the ER compounds contain distinguishing structures

    for active and inactive classes. The discovered interesting fre-

    quent subgraphs can be used to characterize a class of estrogen

    as well as to discriminate it from other classes. From the per-

    centage of improvement in AUC, we can observe that the noisy

    and uninteresting frequent subgraphs are effectively ignored by

    MISMOC and the AUC is maximized when it is used with FSG

    and gSpan.

    VII. CONCLUSION

    In this paper, we introduced a new graph-mining technique

    called MISMOC to discoverinterestingfrequent subgraphs from

    graph databases. It is evaluated with both artificial and real

    datasets, and the experimental results show that MISMOC can

    work very well with large and complex datasets and can improve

    the classification performance of the existing graph-mining

    algorithms.

    The frequent subgraphs of real biological datasets usually

    contain many common vertices [e.g., carbon (C) and oxygen

    (O)] and edges (e.g., single hydrogen bond). For this reason,

    both positive and negative samples may contain the same set

    of frequent subgraphs. The frequent subgraphs discovered by

    existing graph-mining algorithms may, therefore, not be very

    useful for molecular classification. MISMOC is able to achieve

  • 8/4/2019 Discovering Interesting Molecular Sub Structure

    12/13

    88 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 9, NO. 2, JUNE 2010

    a higher accuracy as it aims to discover interesting subgraphs

    that do not just occur more frequently but can also allow graph

    classes to be better discriminated from one another. MISMOC

    can better handle the problem of having too many frequent sub-

    graphs when support thresholds are lowered. Like other graph-

    mining algorithm, the size and number of graphs that MISMOC

    can handle can be very large and they are limited mainly bycomputing hardware.

    The next version of MISMOC will include an algorithm that

    can discover interesting subgraphs that may not occur frequently

    enough. However, it will not be relying on a frequent-subgraph-

    mining algorithm in the first phase. In order to facilitate un-

    derstanding, it will also try to better identify graphs that are

    maximal and less fragmented. In addition, it will represent

    graph in a more flexible structure so that graph that are similar

    can be represented in the same subgraph. The next release of

    MISMOC is expected to take into consideration topological in-

    dexes of the discovered structure so as to allow graph classes to

    be distinguished more easily from each other.

    REFERENCES

    [1] D. Conklin, S. Fortier, and J. Glasgow, Knowledge discovery in molecu-lar databases, IEEE Trans. Knowl. Data Eng., vol. 5, no. 6, pp. 985987,Dec. 1993.

    [2] T. Barrett, T. O. Suzek, D. B. Troup, S. E. Wilhite, W. C. Ngau, P. Ledoux,D. Rudnev, A. E. Lash, W. Fujibuchi, and R. Edgar, NCBI GEO: Miningmillions of expression profilesDatabase and tools, Nucleic Acids Res.,vol. 33, pp. D562D566, 2005.

    [3] A. Zanzoni, L. Montecchi-Palazzi, M. Quondam, G. Ausiello,M. Helmer-Citterich, and G. Cesareni, MINT: A molecular INTeraction database,FEBS Lett., vol. 513, no. 1, pp. 135140, 2002.

    [4] K. Arnold, L. Bordoli, J. Kopp, and T. Schwede, The SWISS-MODELWorkspace: A Web-based environment for protein structure homologymodeling, Bioinformatics, vol. 22, pp. 195201, 2006.

    [5] L. Holm, C. Ouzounis, C. Sander, G. Tuparev, andG. Vriend, A databaseof protein structure families with common folding motifs, Protein Sci.,vol. 1, pp. 16911698, 1992.

    [6] M. Ebeling and S. Suhai, Molecular databases on the internet, J. Mol.Med., vol. 75, pp. 620623, 1997.

    [7] A. Sperduti and A. Starita, Supervised neural networks for the classifica-tion of structures, IEEE Trans. Neural Netw., vol. 8, no. 3, pp. 714735,May 1997.

    [8] C.A. Lipinski,F.Lombardo,B. W. Dominy,and P. J.Feeney,Experimen-tal and computational approaches to estimate solubility and permeabilityin drugdiscovery and development settings, Adv. Drug Del. Rev., vol.46,pp. 326, 2001.

    [9] L. A. Mirny and E. I. Shakhnovich, Universally conserved positions inprotein folds: reading evolutionary signals about stability, folding kineticsand function, J. Mol. Biol., vol. 291, no. 1, pp. 177196, 1999.

    [10] A. Kallioniemi, O. P. Kallioniemi, D. Sudar, D. Rutovitz, J. W. Gray,

    F. Waldman, and D. Pinkel, Comparative genomic hybridization formolecular cytogenetic analysis of solid tumors, Science, vol. 258,no. 5083, pp. 818821, 1992.

    [11] M. G. Dunlop, S. M. Farrington, A. D. Carothers, A. H. Wyllie, L. Sharp,J. Burn, B. Liu, K. W. Kinzler, and B. Vogelstein, Cancer risk associatedwith germline DNA mismatch repair gene mutations, Hum. Mol. Genet.,vol. 6, pp. 105110, 1997.

    [12] L. Nakhleh, T. Warnow, C. R. Linder, and K. St. John, Reconstructingreticulate evolution in speciesTheory and practice, J. Comput. Biol.,vol. 12, no. 6, pp. 796811, 2005.

    [13] J. A. Bondy, Graph Theory With Applications. New York: Elsevier,1976.

    [14] Y. Yoshida, Y. Ohta, K. Kobayashi, and N. Yugami, Mining interestingpatterns using estimated frequencies from subpatterns and superpatterns,

    Lecture Notes in Computer Science, vol. 2843, pp. 494501, 2003.[15] L. B. Holder, D. J. Cook, and S. Djoko, Substructure discovery in the

    SUBDUE system, in Proc. AAAI Workshop Knowl. Discov. Databases,

    1994, pp. 169180.

    [16] C. Borgelt and M. R. Berthold, Mining molecular fragments: Findingrelevant substructures of molecules, in Proc. 2nd IEEE Int. Conf. Data

    Mining (ICDM), 2002, pp. 5158.[17] R. D. King, A. Srinivasan, andL. Dehaspe, Warmr: A data miningtoolfor

    chemical data, J. Comput.-Aided Mol. Des., vol. 15, no. 2, pp. 173181,2001.

    [18] M. Kuramochi and G. Karypis, Frequent sub-graph discovery, in Proc.1st IEEE Int. Conf. Data Mining (ICDM), 2001, pp. 313320.

    [19] X. Yan and J. Han, gSpan: Graph-based substructure pattern mining, inProc. IEEE Int. Conf. Data Mining, 2002, pp. 721724.

    [20] I. Fischer and T. Meinl, Graph-based molecular data mining Anoverview, in Proc. IEEE Int. Conf. Syst., Man Cybern., 2004, vol. 5,pp. 45784582.

    [21] M. Deshpande, M. Kuramochi, N. Wale, and G. Karypis, Frequentsubstructure-based approaches for classifying chemical compounds,

    IEEE Trans. Knowl. Data Eng., vol. 17, no. 8, pp. 10361050, Aug.2005.

    [22] S. H. Muggleton, Inductive logic programming, N. Gen. Comput.,vol.8,no. 4, pp. 295318, 1991.

    [23] A. Inokuchi, T. Washio, and H. Motoda, An apriori-based algorithm formining frequent substructures from graph data, in Proc. 4th Eur. Conf.Principles Pract. Knowl. Discov. Databases (PKDD), 2000, pp. 1323.

    [24] K. C. C. Chan, A. K. C. Wong, and D. K. Y. Chiu, Learning sequentialpatterns for probabilistic inductive prediction, IEEE Trans. Syst., ManCybern., vol. 24, no. 10, pp. 15321547, Oct. 1994.

    [25] K. C. C. Chan and A. K. C. Wong, APACS: A system for automatedpattern analysis andclassification, Comput. Intell.: Int. J.,vol.6,pp.119131, 1990.

    [26] P. C. H. Ma and K. C. C. Chan, UPSEC: An algorithm for classifyingunaligned protein sequences into functional families, J. Comput. Biol.,vol. 15, no. 4, pp. 431443, 2008.

    [27] FSG, Karypis Lab, version 1.0.1. (2003). [Online]. Available:http://www-users.cs.umn.edu/karypis/pafi

    [28] gSpan, Illimine, version 1.1.1. (2006). [Online]. Available:http://illimine.cs.uiuc.edu/download/index.php

    [29] C. C. Chang and C. J. Lin. (2001) LIBSVM: A library for support vectormachines [Online]. Available: http://www.csie.ntu.edu.tw/cjlin/libsvm

    [30] S. Daskalaki, I. Kopanas, and N. Avouris, Evaluation of classifiers foran uneven class distribution problem, Appl. Artif. Intell., vol. 20, no. 5,pp. 381417, 2006.

    [31] T. Fawcett, An introduction to ROC analysis, Pattern Recogn. Lett.,vol. 27, pp. 861874, 2006.

    [32] J. Cheng, Y. Ke, and W. Ng. (2006) GraphGen: A graph synthetic gener-ator [Online]. Available: http://www.cse.ust.hk/graphgen/

    [33] A. Srinivasan, R. D. King, S. H. Muggleton, and M. Sternberg, Thepredictive toxicology evaluation challenge, presented at the 15th IJCAI,Los Angeles, CA, 1997.

    [34] W. Tong, H. Fang, C. R. Williams, J. M. Burch, and A. M. Richard.(2008) DSSTox FDA National Center for Toxicological Research Es-trogen Receptor Binding Database (NCTRER): SDF files and web-site documentation,NCTRER_v4b_232_15Feb2008[Online]. Available:www.epa.gov/ncct/dsstox/sdf_nctrer.html

    [35] J. Devillers and A. T. Balaban, Topological Indices and Related Descrip-tors in QSAR and QSPR. Boca Raton, FL: CRC Press, 1999.

    [36] R. D. King, S. H. Muggleton, A. Srinivasan, and M. J. E. Sternberg,Structure-activity relationships derived by machine learning: The use ofatoms and their bond connectivities to predict mutagenicity by induc-tive logic programming, Proc. Nat. Acad. Sci., vol. 93, pp. 438442,

    1996.[37] A. Sriniviasan and R. King, Feature construction with inductive logic

    programming: A study of quantitative predictions of biological activityaided by structural attributes, J. Knowl. Discov. Data Mining, vol. 3,pp. 3757, 1999.

    [38] M. Deshpande and G. Karypis, Automated approaches for classifyingstructure, in Proc. 2nd ACM SIGKDD Workshop Data Mining Bioinf.,2002, pp. 1118.

    [39] S. Menchetti, F. Costa, and P. Frasconi, Weighted decomposition ker-nels, in Proc. 22nd Int. Conf. Mach. Learning, Bonn, Germany, 2005,pp. 585592.

    [40] T. Meinl, C. Borgelt, and M. R. Berthold, Discriminative closed fragmentmining and perfect extensions in MoFa, in Proc. 2nd Starting AI Res.Symp. (STAIRS), Valencia, Spain, 2004, pp. 314.

    [41] C. Borgelt, H. Hofer, and M. Berthold, Finding discriminative molecu-lar fragments, presented at the Workshop Inf. Mining Navigat. LargeHeterogen. Spaces Multimedia Inf. German Conf. Artif. Intell., Hamburg,Germany, 2003.

  • 8/4/2019 Discovering Interesting Molecular Sub Structure

    13/13

    LAM AND CHAN: DISCOVERING INTERESTING MOLECULAR SUBSTRUCTURES FOR MOLECULAR CLASSIFICATION 89

    [42] S. Nijssen and J. N. Kok, Frequent graph mining and its application tomolecular databases, in Proc. IEEE Conf. Syst., Man Cybern. (SMC) ,W. Thissen, P. Wieringa, M. Pantic, and M. Ludema, Eds. Den Haag, TheNetherlands, 2004, pp. 45714577.

    [43] M. Worlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitativecomparison of the subgraph miners MoFa, gSpan, FFSM, and Gaston,in Proc. 9th Eur. Conf. on Principles Pract. Knowl. Discov. Databases(PKDD), Porto, Portugal (Lecture Notes in Computer Science), A. Jorge,L. Torgo, P. Brazdil, R. Camacho, and J. Gama, Eds. Berlin, Germany:Springer-Verlag, 2005, pp. 392403.

    [44] S. Nijssen and J. N. Kok, A quickstart in frequent structure mining canmake a difference, in Proc. Int. Conf. Knowl. Discov. Data Mining, 2004,pp. 647652.

    [45] R. Chittimoori, L. B. Holder, and D. J. Cook, Applying the subduesubstructure discovery system to the chemical toxicity domain, presentedat the AAAI Spring Symp. Predictive Toxicol. Chem.: Exp. Impact AITools, Menlo Park, CA, 1999.

    Winnie W. M. Lam received the B.Sc. (Hons.) de-gree in information technology from Hong Kong

    Polytechnic University, Hung Hom, Hong Kong. Sheis currently working toward the Ph.D. degree in theDepartment of Computing, Hong Kong PolytechnicUniversity.

    She has been involved in several large-scale com-mercial projects, including the ESDlife electronicsystem of the Government of the Hong Kong SpecialAdministrative Region (HKSAR), the system migra-tion project in the Hong Kong Exchanges and Clear-

    ing Limited, the data mining development in the Kowloon-Canton RailwayCorporation and Immigration Department, and the consultancy project in theSPSS Inc. Her research interests include data mining, bioinformatics, and arti-ficial intelligence.

    Keith C. C. Chan (M94) received the B.Math.degree in computer science and statistics, and theM.A.Sc. and Ph.D. degrees in systems design en-gineering from the University of Waterloo, ON,Canada, in 1984, 1985, and 1989, respectively.

    He joined the IBM Canada Laboratory as a SeniorAnalyst and was involved in the design and devel-opment of image and multimedia, and software en-gineering tools. In 1993, he joined as an AssociateProfessor in the Department of Electrical and Com-puter Engineering, Ryerson University, Toronto, ON.

    In 1994, he joined the Department of Computing, Hong Kong Polytechnic Uni-versity, Hung Hom, Hong Kong, where he is currently a Professor. He has beena Consultant in various companies and other parts of Asia and Europe. His re-search interests include data mining, bioinformatics, software engineering, andpervasive computing.Graphic here