Graph-Based Data Miningholder/courses/CptS570/fall07/...DATA MINING Graph-Based Data Mining Diane J. Cook and Lawrence B. Holder, University of Texas at Arlington T HE LARGE AMOUNT

D A T A M I N I N G

Graph-Based Data MiningDiane J. Cook and Lawrence B. Holder, University of Texas at Arlington

THE LARGE AMOUNT OF DATA collected today is quickly overwhelming re-searchers’ abilities to interpret the data anddiscover interesting patterns in it. In responseto this problem, researchers have developedtechniques and systems for discovering con-cepts in databases.1–3Much of the collecteddata, however, has an explicit or implicitstructural component (spatial or temporal),which few discovery systems are designedto handle.4 So, in addition to the need toaccelerate data mining of large databases,there is an urgent need to develop scalabletools for discovering concepts in structuraldatabases.

One method for discovering knowledge instructural data is the identification of com-mon substructures within the data. Sub-structure discovery is the process of identi-fying concepts describing interesting andrepetitive substructures within structuraldata. The discovered substructure conceptsallow abstraction from the detailed datastructure and provide relevant attributes forinterpreting the data.

The substructure discovery method is thebasis of Subdue, which performs data miningon databases represented as graphs. The sys-tem performs two key data-mining tech-niques: unsupervised pattern discovery andsupervised concept learning from examples.Our test applications have demonstrated thescalability and effectiveness of these tech-niques on a variety of structural databases.

Unsupervised conceptdiscovery

Subdue discovers substructures that com-press the original data and represent struc-tural concepts in the data. The substructurediscovery system represents structural dataas a labeled graph. Objects in the data mapto vertices or small subgraphs in the graph,and relationships between objects map todirected or undirected edges in the graph. Asubstructureis a connected subgraph in thegraphical representation. An instanceof asubstructure is a set of vertices and edgesfrom the input graph that match—graph the-oretically—the substructure’s graphical rep-resentation. This graphical representationserves as input to the substructure discoverysystem. Figure 1 shows a geometric exam-ple of a database. It also shows the graph rep-resentation of the discovered substructureand highlights one of the four instances ofthe substructure.

Subdue’s substructure discovery algorithmis a beam search. Figure 2 shows the algo-rithm. The first step initializes ParentList(substructures to be expanded), ChildList(substructures that have been expanded), andBestList (the highest-valued substructuresfound so far) as empty. It also sets Processed-Subs (the number of substructures expandedso far) to 0. Each list is a linked list of sub-structures, sorted in nonincreasing order bysubstructure value. For each unique vertexlabel, Subdue assembles a substructurewhose definition is a vertex with that labeland whose instances are all the vertices ininput graph G with that label. Each sub-structure is inserted in ParentList.

The inner while loop is the algorithm’score. It removes each substructure in turnfrom the head of ParentList, and extends eachof the substructure’s instances in all possibleways. It does this either by adding a new edgeand vertex in G to the instance or by addinga new edge between two vertices if both ver-

USING DATABASES REPRESENTED AS GRAPHS, THE SUBDUE

SYSTEM PERFORMS TWO KEY DATA-MINING TECHNIQUES:UNSUPERVISED PATTERN DISCOVERY AND SUPERVISED

CONCEPT LEARNING FROM EXAMPLES. APPLICATIONS TO

LARGE STRUCTURAL DATABASES DEMONSTRATE

SUBDUE’S SCALABILITY AND EFFECTIVENESS.

32 1094-7167/00/$10.00 © 2000 IEEE IEEE INTELLIGENT SYSTEMS

tices are already part of the instance. The firstinstance of each unique expansion becomes adefinition for a new child substructure. All thechild instances that were expanded in thesame way (by adding the same new edge ornew edge with new vertex to the same old ver-tex) become instances of that child substruc-ture. In addition,child instances generated bydifferent expansions that match the child sub-structure definition within the match-costthreshold (described later) become instancesof the child substructure.

Subdue then evaluates each child, usingthe minimum description length (MDL)heuristic, and inserts each child in ChildListin order of its heuristic value. The algorithmenforces the search’s beam width by con-trolling ChildList’s length. After inserting a

new child in ChildList,if ChildList’s lengthexceeds BeamWidth,the system destroys thesubstructure at the end of the list. The parentsubstructure is inserted in BestList,and the

same pruning mechanism limits BestList tobe no longer than MaxBest. When ParentListis empty, the algorithm switches ParentListand ChildList,so that ParentList now holds

MARCH/APRIL 2000 33

In this continuation of our specialissue on data mining, we presentfour articles that deal with miningnontraditional forms of data. Twoarticles deal with text mining. Sho-lom Weiss,Brian White, ChidanandApte, and Fredrick Damerau showthat simple and fast documentmatching can effectively assist help-desk applications in matching prob-lem descriptions to relevant storedsolution descriptions. Kurt Bollaker,Steve Lawrence, and C. Lee Gilespresent their Web-based CiteSeersystem that finds user-specific scien-

tif ic documents in the documents that are preprocessed and stored in a localdatabase. The system learns the user profile from the interactions and uses itas an agent to monitor new documents that might interest the user.

Two articles describe new techniques. Neal Lesh,Mohammed Zaki,andMitsunori Ogihara find frequent subsequence patterns that help classify asequence, such as DNA, into different classes of sequences. Their Feature-Mine system efficiently examines subsequences to select a drastically prunedfeature set. The article by Diane Cook and Lawrence Holder deals with datamining graph-structured data,such as CAD diagrams and chemical structures.Their Subdue system uses a beam search to discover substructures (features)that efficiently describe the given set of structures based on the MDL princi-ple. Both of these new techniques discover useful features from the sourcedata that are not the flat tables typically found in popular databases,which dis-criminate or describe such data.

The earlier issue

In the first part of this two-part special issue on data mining that ran inour November-December 1999 issue, Gregory Piatetsky-Shapiro describedthe growing community of data mining in his “Expert Opinion” column.Data mining is becoming an indispensable decision-making tool in the evermore competitive business world. And challenging applications inspirenew techniques and affirm their utility.

Two articles by Simon Kasif and Steven Salzberg discussed data mining

in the exciting field of computational biology, where frequent pattern dis-covery, clustering, and classification all play crucial roles in understandingprotein structures and their functions.

Several articles represented innovative data-mining applications in moretraditional domains with conceptually flat data tables. Chidanand Apte,Edna Grossman,Edwin Pednault,Barry Rosen,Fateh Tipu, and BrianWhite have developed a specialized probabilistic model for auto insurancepure premium—that is, the expected claim amount for each policy holderthat meets the strict actuarial requirement. Special attention is given tomissing values and the model’s scalability. Sylvain Létourneau,FazelFamili, and Stan Matwin described the entire process of modeling failingaircraft components,from gathering data and generating serious models toevaluating them using a domain-specific scoring function. Extensive exper-iments seem to show that the nearest-neighbor method does best for mostparts. Philip Chan,Wei Fan,Andreas Prodromidis,and Salvatore Stolfoapply a specialized boosting technique to credit card fraud detection forscalability and enhanced utility of the model. Cost savings thus achieveddemonstrate the improved accuracy of multiple models while providing ascalable fast model generation on a very large data set.

—David Waltz and Se June Hong, Guest Editors

David Waltz has been vice president of the NEC Research Institute’s Com-puter Science Research Division and adjunct professor of computer science atBrandeis University since 1993,andwill become president of the NECResearch Institute in April 2000. Con-tact him at NEC Research Institute, 4Independence Way, Princeton, NJ08540; [email protected].

Se June Hong is a research staffmember working on data-miningtechnology at the IBM T.J. WatsonResearch Center in Yorktown Heights,New York. His research interests haveincluded error-correcting code, fault-tolerant computing, design automa-tion, and knowledge-based systems.Contact him at IBM T.J. WatsonResearch Center, P.O. Box 218,York-town Heights,NY 10598; [email protected].

Data Mining: A Long-Term Dream Continues

(b)(a)

Object

Object

Shape

ShapeOn

Triangle

Square

T2S2

T3S3

T4S4

R1

C1

T1S1

Figure 1. Example substructure in graph form: (a) input graph; (b) substructure.

the next generation of substructures to beexpanded.

The BeamWidth and Limit (a user-definedlimit on the number of substructures toprocess) parameters, along with computa-tional constraints on the inexact graph matchalgorithm,constrain Subdue to a polynomialrunning time.

Another user-specified parameter, Max-SubSize, determines whether the algorithmprunes the discovery search space by dis-carding a child substructure whose heuristicvalue is not greater than its parent’s heuristicvalue. Early in the discovery process,thenumber of a given substructure’s instances isvery large, and it dominates the value of anyheuristic that uses the number of instances asa parameter. As the substructure grows, itsnumber of instances decreases quicklybecause the substructure is becoming morespecific. This means that the heuristic valuealso usually decreases early in the discoveryprocess. If it decreases for a sufficiently longperiod, all the child substructures will be dis-carded, ParentList will empty, and discoverywill halt. Disabling pruning during discov-ery keeps ChildList full,even for substruc-ture values that do not increase monotoni-cally as the substructure definition grows.

Because ParentList never empties,anothermethod for halting the program is needed. Ifthe user specifies a maximum substructuresize (MaxSubSize),larger child substructureswill not be inserted in ChildList. As a result,ParentList and ChildList will eventuallyempty, and the discovery algorithm will halt.

Once a substructure is discovered, Subdueuses it to simplify the data by replacinginstances of the substructure with a pointerto the newly discovered substructure. Dis-covered substructures allow abstraction fromdetailed structures in the original data.Through iteration of the substructure dis-covery and replacement process,Subdueconstructs a hierarchical description of thestructural data in terms of the discovered sub-structures. This hierarchy provides varyinginterpretation levels that can be accessed onthe basis of the data analysis’s specific goals.

The MDL heuristic. Subdue’s heuristic forevaluating substructures is based on theMDL principle:The best theory for describ-ing a data set is the theory that minimizes thedata set’s description length.5 The model fordescription length calculation is a local com-puter sending the description of a concept toa remote computer. The local computer

encodes the concept as a bit string, and theremote computer decodes the string to restorethe original concept. The concept’s descrip-tion length is the number of bits in the string.

Subdue implements the MDL principle inthe context of graph compression using asubstructure. Thus,the best substructure ina graph is one that minimizes DL(S) +DL(G|S). Here Sis the discovered substruc-ture, G is the input graph,and DL(S) is thenumber of bits required to encode the dis-covered substructure. DL(G|S) is the numberof bits required to encode G after it has beencompressed using S. An earlier articledescribes the exact graph-description-lengthcomputation used in Subdue.6

Inexact graph match. Because substructureinstances can appear in different formsthroughout the database, Subdue uses aninexact graph match to identify them.7 In thisapproach, the user assigns a cost to each dis-tortion of a graph. A distortion consists ofbasic transformations such as deletion,inser-tion, and substitution of vertices and edges.In determining the distortion costs,the usercan bias the match for or against particulartypes of distortions.

Given graphs g1 with n vertices and g2

with mvertices,mbeing greater than or equalto n, the complexity of the full inexact graphmatch is on the order of nm+1. Because thediscovery process uses this routine heavily,its complexity can significantly degrade sys-tem performance. To improve the algorithm’sperformance, we have it search through thespace of possible partial mappings using auniform cost search. The cost from the rootof the tree to a given node is calculated as thecost of all distortions corresponding to thatnode’s partial mapping. The algorithm con-siders vertices from the matched graphs inorder, from the most heavily connected to theleast connected. Because the uniform costsearch guarantees an optimal solution,it endsas soon as it finds the first complete mapping.

In addition, the user can limit the numberof search nodes (defined as a function of thenumber of vertices in the first graph) con-sidered by the branch-and-bound procedure.When the number of nodes expanded in thesearch tree reaches the defined limit, thesearch resorts to hill climbing, using the map-ping cost so far as the measure for choosingthe best node at a given level. Our earlier arti-cle provides a complete description of Sub-due’s polynomial inexact graph match.6

Bounds on the number of substructures

34 IEEE INTELLIGENT SYSTEMS

Figure 2. Subdue’s discovery algorithm.

Subdue(Graph, BeamWidth, MaxBest, MaxSubSize, Limit)

ParentList = {}

ChildList = {}

BestList = {}

ProcessedSubs = 0

Create a substructure from each unique vertex label and

its single-vertex instances; insert the resulting

substructures in ParentList

wwhhiillee ProcessedSubs <= Limit aanndd ParentList is not empty ddoo

wwhhiillee ParentList is not empty ddoo

Parent = RemoveHead(ParentList)

Extend each instance of Parent in all possible ways

Group the extended instances into Child substructures

ffoorreeaacchh Child ddoo

iiff SizeOf(Child) <= MaxSubSize tthheenn

Evaluate the Child

Insert Child in ChildList in order by value

iiff Length(ChildList) > BeamWidth tthheenn

Destroy the substructure at the end of ChildList

ProcessedSubs = ProcessedSubs + 1

Insert Parent in BestList in order by value

iiff Length(BestList) > MaxBest tthheenn

Destroy the substructure at the end of BestList

Switch ParentList and ChildList

rreettuurrnn BestList

considered (L) and the number of partialmappings considered during an inexact graphmatch (g) constrain Subdue to run in poly-nomial time. The system’s worst-case run-time is the product of the number of gener-ated substructures,the number of instancesof each substructure, and the number of par-tial mappings considered during the graphmatch. This is expressed

where v represents the number of vertices inthe input graph. An earlier article providesthe derivation of this expression.7

Discovery system applications

We have successfully applied Subdue,withand without domain knowledge, to databasesin domains including image analysis,CADcircuit analysis,Chinese characters,programsource code, and chemical reaction chains.7,8

CAD cir cuit analysis.Figure 3 shows thesubstructures Subdue discovered in a CADcircuit representing a sixth-order bandpassleapfrog ladder (boxed substructures werediscovered in previous iterations). The fig-ure also tabulates an evaluation based oncompression obtained with the substructure,time required to process the database, ahuman rating of the discovered concept’sinterestingness,and the number of substruc-ture instances found in the database. For theinterestingness rating, eight domain expertsrated each substructure on a scale from 1 to5, with 1 meaning the substructure does notrepresent useful CAD information and 5meaning the substructure is very useful.

In this experiment,Subdue generated sub-structures in three ways:using no backgroundknowledge, using background knowledge inthe form of graph match rules customized forthis domain,and using both graph match rulesand information about specific models likelyto occur in this domain. As the figure shows,Subdue discovered substructures that performwell at compressing the database and repre-sent functional CAD concepts. Using generalbackground knowledge improved the con-cept’s functional rating. But the MDL princi-pal alone was effective in discovering theoperational amplifier substructure, which theexperts determined is highly interesting andfunctional in this domain.

Protein structure analysis.More recently,weapplied Subdue to several large databases con-taining data requiring scientific interpretation.For example,we applied the unsupervised-dis-covery system to the July 1997 release of theProtein Data Bank (PDB). Our goal was toidentify structural patterns in the primary,sec-ondary, and tertiary structures of three proteincategories: hemoglobin, myoglobin, andribonuclease A. These patterns would act assignatures distinguishing proteins in the cate-gory from other types of proteins and provid-ing a classification mechanism.

Using Subdue, we converted primarystructure information from the primary DNAsequence specified in each PDB file by rep-resenting each amino acid in the sequence asa graph vertex. The vertex numbers increasein the sequence order from N-terminus to C-terminus,and the vertex label is the name ofthe amino acid. We added an edge labeled“bond” between adjacent amino acids in asequence.

We extracted secondary structure by listingoccurrences of helices and strands along theprimary sequence. A graph vertex labeled “h”

followed by the helix type and length repre-sents each helix. A graph vertex labeled “s”followed by the strand’s orientation and lengthrepresents each strand. Edges between twoconsecutive vertices are labeled “sh” if theybelong to the same PDB file. To represent theprotein’s 3D features,we used the x, y, and zcoordinates of each atom in the protein. Werepresented each amino acid α-carbon as agraph vertex. If the distance between two α-carbons was greater than 6 Å,we discardedthe information. Otherwise,we created edgesbetween two α-carbons and labeled them “vs”(very short, distance ≤ 4 Å) or “s” (short).

Subdue indeed found a structural patternfor each protein category. Using primarystructure information, it identified patternsunique to each class of protein but occurringin 63 of the 65 hemoglobin cases,67 of the103 myoglobin cases,and 59 of the 68ribonuclease A cases.

Figure 4 summarizes one of the findingsfor hemoglobin secondary structure, pre-senting an overall view of the protein, theportion where the discovered pattern exists,and schematic views of the best pattern. The

i v i v L giL *(( ) ( )) *( ( ))*− − −( ) −=∑ 1 1 11

MARCH/APRIL 2000 35

+–

+–

+–

+–

2

2

1

3

+–

+–

+–

No domainknowledge

Graph match rules

Model knowledgeand graphmatch rules

Usage ofdomain knowledge

Discoveredsubstructures Compression Nodes

expanded Instances

0.63

0.68

0.72

0.43

0.51

0.62

0.53

0.66

0.72

571,370

46,422

59,886

145,175

39,881

425,772

24,273

12,960

7,432

Humanrating

[std. dev.]

4.2 [1.2]

2.7 [1.2]

2.7 [1.0]

2.7 [1.7]

2.7 [1.0]

1.5 [0.8]

4.3 [1.2]

4.5 [0.8]

4.5 [0.8]

9

3

2

6

3

2

9

4

2

+– 1

2

+–

1

3

+–

+–

3 +–

+–

Figure 3. Discovered substructures in a leapfrog circuit (boxed substructures were discovered in previous iterations).

patterns discovered for each sample categorycovered a majority of proteins in that cate-gory. That is, 33 of the 50 analyzed hemo-

globin proteins,67 of the 89 myoglobin proteins,and 35 of the 52 ribonuclease A pro-teins contained the discovered patterns.

There are many possible reasons that someproteins did not show a pattern. Many fac-tors can affect a protein’s structure: samplequality, experimental conditions,and humanerror. Discrepancies can also result fromphysiological and biochemical factors. Thesame protein molecule’s structure might dif-fer from one species to another. The proteinmight be genetically defective. For example,sickle-cell anemia is the classic example of agenetic hemoglobin disease in which the pro-tein lacks the structure necessary to performits normal function.

We mapped the secondary structural pat-terns of the hemoglobin, myoglobin, andribonuclease A proteins back into the PDBfiles. After mapping, we found that one dis-covered hemoglobin pattern belongs to theα chains and the other to the β chains of ahemoglobin molecule (a hemoglobin mole-cule contains two α and two β chains). Thediscovered myoglobin pattern appears in amajority of the myoglobin proteins in thedata set. Finally, upon mapping the discov-ered ribonuclease A patterns back to the PDBfiles,we observed that several ribonucleaseS proteins have the same patterns as those inribonuclease A proteins. This is consistentwith the fact that ribonuclease S is a complexconsisting of two fragments (S-peptide andS-protein) of the ribonuclease A proteins.The pattern in the ribonuclease S comes fromthe S-protein fragment.

The secondary-structure patterns discov-ered are also distinct to each protein cate-gory. Subdue searched the global data set toidentify the possible existence of the dis-covered pattern from each protein category.The results indicated that there is no exactmatch of the best patterns of one category inother categories.

Steve Sprang, a molecular biologist at theUniversity of Texas Southwestern MedicalCenter, evaluated the patterns discovered bythe Subdue system. He reviewed the originaldatabase and the discovered substructures todetermine whether the discovered conceptsrepresented the data accurately and pointed


Figure 4. Hemoglobin secondary structure: (a) overall view, (b) discovered pattern, and (c) schematic views of the pattern.

Related workResearchers have proposed a variety of unsupervised-discovery approaches for structural

data.1,2One approach is to use a knowledge base of concepts to classify the structural data.Systems using this approach learn concepts from examples and then categorize observed data.Such systems represent examples as distinct objects and process individual objects one at atime. In contrast,Subdue stores the entire database (with embedded objects) as one graph andprocesses the graph as a whole.

Scientific discovery systems that use domain knowledge have also been developed, but theytarget a single application domain. An example is Mechem,3 which relies on domain knowl-edge to discover chemistry hypotheses. In contrast,Subdue performs general-purpose, auto-mated discovery with or without domain knowledge and hence can be applied to many struc-tural domains.

Logic-based systems have dominated relational concept learning, especially inductive logicprogramming (ILP) systems. However, first-order logic can also be represented as a graph and,in fact,is a subset of what graphs can represent. Therefore, learning systems using graphical rep-resentations potentially can learn richer concepts if they can handle the larger hypothesis space.

FOIL,4 the ILP system discussed in this article, executes a top-down approach to learningrelational concepts (theories) represented as an ordered sequence of function-free definiteclauses. Given extensional background knowledge including relations and examples of thetarget concept relation, FOIL begins with the most general theory. Then it follows a set-cover-ing approach, repeatedly adding a clause that covers some positive examples and few negativeexamples. Then,FOIL removes the positive examples covered by the clause and iterates theprocess on the reduced set of positive examples and all negative examples until the theorycovers all the positive examples. To avoid overcomplex clauses,FOIL ensures that a clause’sdescription length does not exceed the description length of the examples the clause covers. Inaddition to the applications discussed here, as well as applications in numerous recursive andnonrecursive logical domains,FOIL has been applied to learning search-control rules and pat-terns in hypertext.

References

1. D. Conklin,“Machine Discovery of Protein Motifs,” Machine Learning, Vol. 21,Nos.1/2,1995,pp. 125–150.

2. K. Thompson and P. Langley, “Concept Formation in Structured Domains,” Concept For-mation: Knowledge and Experience in Unsupervised Learning, D.H. Fisher and M. Paz-zani,eds.,Morgan Kaufmann,San Mateo,Calif., 1991.

3. R.E. Valdes-Perez,“Conjecturing Hidden Entities by Means of Simplicity and Conserva-tion Laws:Machine Discovery in Chemistry,” Artificial Intelligence, Vol. 65,No. 2,1994,pp. 247–280.

4. R.M. Cameron-Jones and J.R. Quinlan,“Efficient Top-Down Induction of LogicPrograms,SIGART Bulletin, Vol. 5,No. 1,1994,pp. 33–42.

to interesting discoveries. Sprang found thatSubdue discovered an interesting, previouslyunknown pattern that suggests new informa-tion about the microevolution of such pro-teins in mammals.

We are continuing experimental applications ofSubdue’s unsupervised-discovery capabilities inthe domains of biochemistry, geology, programsource code,and aviation data.

Scalability

A barrier to integrating scientific discoveryinto practical data-mining approaches is thatdiscovery systems lack scalability. Many sys-tem developers evaluate a discovery method’scorrectness without regard to its scalability.Another barrier is that some scientific-dis-covery systems deal with rich data represen-tations that degrade scalability. For example,Subdue’s discovery relies on computationallyexpensive procedures such as subgraph iso-morphism. Although Subdue’s algorithm ispolynomially constrained, the system stillspends a considerable amount of computationon this task.

We are researching the use of distributedhardware to improve Subdue’s scalability. Inour approach,we partition the data among nindividual processors and process each par-tition in parallel.9 Each processor performsthe sequential version of Subdue on its localgraph partition and broadcasts its best sub-structures to the other processors. Theprocessors then evaluate the communicatedsubstructures on their local partitions. Onceall evaluations are complete, a master proces-sor gathers the results and determines theglobal best discoveries.

Sequential Subdue’s runtime is nonlinearwith respect to the graph’s size. So decreas-ing the input’s size by partitioning the graphamong multiple processors sometimes resultsin a speedup greater than the number ofprocessors. However, the serial algorithmanalyzes the entire graph and therefore doesnot overlook important relationships of itsparts. Partitioning the graph among proces-sors might remove essential information (inour case, edges along boundary lines),andneighboring information can no longer beused to discover concepts.

The Metis graph-partitioning algorithm(http://www-users.cs.umn.edu/~karypis/metis) lets us divide the input graph into npartitions in a way that minimizes the numberof edges shared by partitions and thus reduces

information loss. We modified this algorithmto allow a small amount of overlap betweenpartitions,which recovers some of the infor-mation lost at the partition boundaries.

Figure 5 graphs the runtime of distributedSubdue on two classes of graphs as thenumber of processors increases. The twoclasses are

• graphs representing CAD circuits (eachlabeled “nCAD,” where n represents thesize factor, and 1CAD contains 8,441 ver-

tices and 19,206 edges) and• graphs generated artif icially (each labeled

“nART,” where n represents the size fac-tor, and 1ART contains 2,000 vertices and5,000 edges).

As predicted, the speedup is usually close tolinear and sometimes greater than the num-ber of processors. Increasing the number ofpartitions results in improved speedup untilthe number of partitions approaches the num-ber of vertices in the graph.

MARCH/APRIL 2000 37

(b)

(a)

0

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

Seco

nds

(hun

dred

s of

thou

sand

s)

Processors

’1CAD’’2CAD’’3CAD’’4CAD’’5CAD’’6CAD’’7CAD’’8CAD’

0

50

100

150

200

250

300

350

400

450

2 4 6 8 10 12 14 16

Seco

nds

(thou

sand

s)

Processors

’2ART’’4ART’’8ART’

’16ART’’32ART’

Figure 5. Distributed Subdue runtime on (a) CAD and (b) ART graphs.

Because each processor handles only a por-tion of the overall database, some degrada-tion in the quality of discovered substructuresmight result from this parallel version of Sub-due. In practice, we find that as the numberof processors increases,quality measured ascompression initially improves because agreater number of substructures can be con-sidered in the allotted time. As the number ofpartitions approaches the number of verticesin the graph, the quality of discovered sub-structures degrades.

By partitioning the database effectively,distributed Subdue proves to be a highlyscalable system. One of our tested data-bases,representing a satellite image, con-tains 2 million vertices and 5 million edges.It could not be effectively processed on onemachine given the memory constraints. Butdistributed Subdue processed the databasein less than three hours using eight proces-sors. Requiring a minimal amount of com-munication and using PVM (Parallel VirtualMachine) for communication, distributedSubdue is available for widespread use andruns on a variety of heterogeneous distrib-uted networks.

Supervised concept learning

We have extended Subdue to act not onlyas an unsupervised-discovery system,butalso to perform supervised, graph-based,relational concept learning. Few general-purpose learning methods use a formal graphrepresentation of knowledge, perhaps be-cause of a graph’s arbitrary expressiveness.Another reason is the inherent NP-hardnessof typical learning routines,such as coversand least-general generalization,which bothrequire subgraph isomorphism tests in thegeneral case. Pattern-recognition-learningmethods in chemical domains have been suc-cessful because of the natural graphical des-cription of chemical compounds,but nodomain-independent concept-learning sys-tems use a graph representation.

Our main challenge in adding a concept-learning capability to the graph-based dis-covery system was including a negativegraph in the process. Substructures that occuroften in the positive graph but infrequentlyin the negative graph are likely to representthe target concept. Therefore, the Subdueconcept learner (SubdueCL) accepts both a


house1 house2

Positive:

Negative:

(b)

(a)

Object

Object

Shape

ShapeSubdueCL: [0/0]

FOIL(80%): : - house(A, B, C). [3/0]FOIL(50%): : house(A, B, C) : - shape(A,triangle). [0/1]

(c)

FOIL(80%): house(A, B, C): - shape(A,triangle), shape(C,triangle). [3/0]FOIL(50%): house(A, B, C) : - shape(A,triangle). [0/2]Subdue CL: Same as for house1

Triangle

Square

On

On

Object

Figure 6. (a) The house domain; (b) SubdueCL and FOIL results for house1; (c) results for house2.

http://computer.org/internet/

IC and IC Online publish the lat-est developments in Internet-based applications and support-ing technologies, and addressthe Internet's widening impacton engineering practice andsociety.

In 2000, we’ll look at:• Agent technologies• Internet-based workflow• Internet QoS• Knowledge networking• Widely deployed security

solutions ... and more

Check us out!

positive and a negative graph and evaluatessubstructures as to their compression of thepositive and lack of compression of the neg-ative. SubdueCL searches for the substruc-ture S that minimizes the cost of describingsubstructure Sin positive graph Gp and neg-ative graph Gn. This cost is expressed

value(Gp, Gn, S) = DL(Gp, S) + DL(S) +DL(Gn) − DL(Gn, S).

DL(G, S) is the description length,accordingto the MDL encoding, of graph Gafter beingcompressed using substructure S, and DL(G)is the description length of graph G. Thiscost represents the information needed torepresent Gp using substructure S, plus theinformation needed to represent the portionof Gn that was compressed using S. Thus,SubdueCL prefers substructures that com-press the positive graph but not the negativegraph.

Another challenge was the discovery sys-tem’s bias toward finding only one good sub-structure in the entire input graph. Inductivelogic programming (ILP) systems have adistinct advantage over Subdue; they typi-cally find theories composed of many rules,whereas Subdue finds essentially one rule.Subdue’s iterative, hierarchical capabilitiessomewhat address this problem. But the sub-structures found in later iterations are typi-cally defined in terms of previously discov-ered substructures and are therefore onlyspecializations of the earlier, more generalrule. To avoid this tendency, SubdueCL dis-cards any substructure that contains sub-structures discovered during previous itera-tions. SubdueCL iterates until it can find nosubstructure that compresses the positivegraph more than the negative graph.

Comparing learning systems

We compared SubdueCL to the ILP sys-tem FOIL (First-Order Inductive Learning)10

and to the decision-tree induction systemC4.5.3 First, we compared the relationallearners SubdueCL and FOIL on a simpleartif icial domain to examine qualitative dif-ferences. Then we compared all three sys-tems on relational domains.

An artif icial domain. Figure 6a depicts thetwo data sets of the housedomain,house1andhouse2, whose target concept is “tr iangle onsquare.” House1contains the six examples onthe left,and house2contains the two addi-tional examples on the right. Each object hasa shape and is related to other objects by thebinary relation on. Figure 6b shows the resultsof running SubdueCL and FOIL on house1.

The bracketed numbers in Figure 6b arethe number of false negatives and false posi-tives. We ran FOIL twice on each data set,once with the default minimum individual-clause-accuracy of 80%,and once with a min-imum of 50%. At 80%,FOIL returned theconcept that nothing was a house, misclassi-fying the three positive examples. At 50%,FOIL described houses as any example witha triangle on top. However, to relate the tri-angle and square objects in a connected sub-structure, SubdueCL includes the onrelation.Therefore, SubdueCL recognized that the tri-angle should be on top of the square.

The house2data set adds two examples tohouse1to emphasize the need for the onrela-tion. Figure 6c shows the results of runningthe two systems on house2. Again,FOIL hastrouble identifying the correct concept.

These examples reveal an advantage ofgraph representation over logic for such posi-

tion-invariant (and other relational) concepts.However, logic is better suited for some con-cepts. For example, FOIL can easily repre-sent the concept “top and bottom objectshave same shape” as follows:

house(A,B,C) :- shape(A,S),

shape(C,S). orhouse(A,B,C) :- shape(A,S1),

shape(C,S2), S1 = S2.

In contrast,using the graph representationdefined for the housedomain,SubdueCLwould need to learn five different substruc-tures,one for each shape. Of course, if wecould foresee such a concept, we couldchange the graph representation to includean intermediate shapevertex connected toanother vertex with the actual shape. Then,we could add equal edges between theseshapevertices to represent equivalent shapesindependent of the actual shape. Further-more, although both systems handle numericvalues (vertex labels),a similar representa-tional transformation is necessary for learn-ing concepts involving general equalities andinequalities between numbers.

Relational domains. The relational do-mains we used in our tests included illegalchess endgames,tic-tac-toe endgames,andmusical excerpts. The chess domain consistsof 12,957 row-column positions for a whiteking, white rook, and black king such thatthe black king is (positive) or is not (nega-tive) in check. FOIL extensionally definedadjacency relations between chessboard positions and less-than relations betweenrow and column numbers. We provided C4.5with the same information by adding fea-tures that relate each piece’s row and col-

MARCH/APRIL 2000 39

(b) (c)(a)

0 1 20

1

2

WK

BK

WR

eqpos

adjWKC

WKR

WRR WRC

BKR

BKC

pos

pos

adj

adj

adj

noteq

eqWKC

WRC

BKC

noteq

WKR

WRR

BKR

noteq

noteq

eq

Figure 7. (a) A chess domain example; SubdueCL’s (b) graphical representation and (c) discovered substructures for the example.

umn values as equal,not equal,or adjacent.Figure 7 shows SubdueCL’s representation

of a chess domain example. Each piece is rep-resented by two vertices corresponding to apiece’s row and column,connected by a posi-tion relation (for example, WKC stands forwhite king column). Instead of less-thanrela-tions,we used eqand noteqrelations betweenall such rows and columns.

We tested the three systems for predictiveaccuracy on this domain,using threefold crossvalidation, with significance values gatheredfrom a paired-student t-test. The accuracyresults were 99.8% for FOIL, 99.77% forC4.5,and 99.21% for SubdueCL. The accu-racy difference between FOIL and SubdueCLis significant at the 0.19 level (that is,the prob-ability that the difference is insignificant is0.19),and the difference between C4.5 andSubdueCL is significant at the 0.23 level.

FOIL learned six rules,SubdueCL learnedfive rules (substructures),and C4.5 learned 43rules. All three systems discovered four rulesthat described in each case approximately2,000 of the 2,118 positive examples. Figure7c shows two of these rules described as sub-structures. The remaining rules differedamong systems and proved to be more pow-erful for FOIL and C4.5 because of these sys-tems’ability to learn numeric ranges.

Next, we tested the systems on a completeset of 958 possible board configurations at theend of tic-tac-toe games. The target concept is“a win for x.” We supplied the three systemswith the values (X,O, and blank) for each ofthe nine board positions. Unlike the other sys-tems,SubdueCL does not key on individualposition values but uses relational informationbetween board positions to learn the three winconcepts: three-in-a-row, three-in-a-column,and three-in-a-diagonal. The accuracy resultsare therefore 100% for SubdueCL,92.35% forFOIL (the difference is significant at the 0.21level), and 96.03% for C4.5 (the difference issignificant at the 0.03 level).

In the final experiment,we attempted todifferentiate Beethoven works from Bachworks, using a set of musical excerptsdescribed by a sequence of pitch values. Weobtained the 100 Bach chorales from the UCIrvine repository and randomly selected theBeethoven works from the composer’s col-lected works. To provide examples for thethree learning systems,we selected a musi-cal theme from each of 50 Beethoven exam-ples and repeated it 10 times,with a varyingpitch offset and surrounded by random pitchvalues. We tested the learning systems both

IEEE INTELLIGENT SYSTEMS

May/June

computer.org/intelligent

Coming Next Issue

& the i r app l i cat ions

IEEE

Knowledge management is a key progress fac-tor in organizations. It involves explicit and per-sistent representation of knowledge of dispersedgroups of people in the organization, to im-prove the organization’s activities. When theorganization’s knowledge is distributed amongseveral experts and documents located all overthe world, the Internet or an intranet and WorldWide Web techniques can be the means for theacquisition, modeling, and management of thisdistributed knowledge. This issue will coverissues related to knowledge management, cor-porate memory, and knowledge distribution viathe Internet and intranets.

NOVEMBER/DECEMBER 1999IEEE

ALSO:

Gesture-based programming

Creating intelligent transportation

AI puzzles and games IEEEhttp://computer.org

Knowledge Management and the Internet

Knowledge Management and the InternetGuest Editor: Rose Dieng

by using absolute pitch values and by usingthe relative difference between one pitchvalue and the next. All three systems per-formed better with pitch difference values.Unlike FOIL and C4.5,SubdueCL used therelational information between successivenotes to learn the embedded musicalsequences for the positive examples. Theaccuracy values were 100% for SubdueCL,85.71% for FOIL (the difference is signif-icant at the 0.06 level), and 82% for C4.5(the difference is significant at the 0.00level).

OUR EXPERIMENTAL RESULTSindicate that SubdueCL,the graph-based rela-tional concept learner, is competitive withlogic-based relational concept learners on avariety of domains. This comparison hasidentified a number of avenues for enhance-ments. SubdueCL would benefit from theability to identify ranges of numbers. Wecould accomplish this by utilizing the sys-tem’s existing capability to find similar butnot exact matches of a substructure in theinput graph. Numeric values within theinstances could be generalized to the encom-passing range. A graph-based learner alsoneeds the ability to represent recursion,whichplays a central part in many logic-based con-cepts. More research is needed to identify rep-resentational enhancements for describingrecursive structures—for example, graphgrammars. Our future work will also focus onextending Subdue to handle other forms oflearning, such as clustering.

We are continuing our testing of Subdue inreal-world applications. In biochemistry, forexample,we are applying Subdue to data fromthe Human Genome Project to find patternsin the DNA sequence that indicate the pres-ence of a gene-transcription-factor site. Unlikeother approaches to finding patterns in genedata,11Subdue uses a graph to represent struc-tural information in the sequence. We hopethat the discovered patterns will point to genesin uncharted areas of the DNA sequence.

In another area of chemistry, we are apply-ing SubdueCL to the Predictive ToxicologyChallenge data. This data contains the struc-tural descriptions of more than 300 chemi-

cal compounds that have been analyzed forcarcinogenicity. Each compound (except forabout 30 held out for future testing) is labeledas either cancer-causing or not. Our goal isto find a pattern in the cancerous compoundsthat does not occur in the noncancerous com-pounds. So far, SubdueCL has found severalpromising patterns, which are currentlyunder evaluation in the University of Texas atArlington’s Department of Chemistry.

In addition, we are applying Subdue to anumber of other databases,including the Avi-ation Safety Reporting System database, USGeological Survey earthquake data,and soft-ware call graphs. Subdue has discovered sev-eral interesting patterns in the ASRS data-base. Burke Burkart of UTA’s Department ofGeology evaluated Subdue’s results on thegeology data and found that Subdue correctlyidentified patterns dependent on earthquakedepth,often the distinguishing factor amongearthquake types. These and other resultsshow that Subdue discovers relevant knowl-edge in structural data and that it scales tolarge databases.

AcknowledgmentsOur work was supported by NSF grant IRI-

9615272 and THECB grant 003656-045.

References1. P. Cheeseman and J. Stutz,“Bayesian Clas-

sification (AutoClass):Theory and Results,”Advances in Knowledge Discovery and DataMining, U.M. Fayyad et al.,eds.,MIT Press,Cambridge, Mass.,1996,pp. 153–180.

2. D. Fisher, “Knowledge Acquisition via Incre-mental Conceptual Clustering,” MachineLearning, Vol. 2,No. 2,1987,pp. 139–172.

3. J.R. Quinlan,C4.5: Programs for MachineLearning, Morgan Kaufmann,San Mateo,Calif., 1993.

4. U.M. Fayyad, G. Piatetsky-Shapiro, and P.Smyth, “From Data Mining to KnowledgeDiscovery: An Overview,” Advances inKnowledge Discovery and Data Mining,U.M. Fayyad et al.,eds.,MIT Press,Cam-bridge, Mass.,1996,pp. 1–34.

5. J. Rissanen,Stochastic Complexity in Statis-tical Inquiry, World Scientific Publishing,Singapore, 1989.

6. D.J. Cook and L.B. Holder, “SubstructureDiscovery Using Minimum DescriptionLength and Background Knowledge,” J. Arti-ficial Intelligence Research,Vol. 1,1994,pp.231–255.

7. D.J. Cook,L.B. Holder, and S. Djoko,“Scal-able Discovery of Informative Structural Con-cepts Using Domain Knowledge,” IEEEExpert, Vol. 11,No. 5,Sept./Oct. 1996,pp.59–68.

8. S. Djoko, D.J. Cook,and L.B. Holder, “AnEmpirical Study of Domain Knowledge andIts Benefits to Substructure Discovery,” IEEETrans. Knowledge and Data Engineering,Vol. 9,No. 4,1997,pp. 575–586.

9. G. Galal, D.J. Cook, and L.B. Holder,“Improving Scalability in a Scientific Dis-covery System by Exploiting Parallelism,”Proc. Int’l Conf. Knowledge Discovery andData Mining, AAAI Press, Menlo Park,Calif., 1997,pp. 171–174.

10. R.M. Cameron-Jones and J.R. Quinlan,“Effi-cient Top-Down Induction of Logic Pro-grams,” SIGART Bulletin,Vol. 5,No,1,1994,pp. 33–42.

11. M.W. Craven and J.W. Shavlik, “MachineLearning Approaches to Gene Recognition,”IEEE Expert, Vol. 9,No. 2,Mar./Apr. 1993,pp. 2–10.

Diane J. Cookis an associate professor in the Com-puter Science and Engineering Department at theUniversity of Texas at Arlington, where she is adirector of the Learning and Planning Laboratory.Her research interests include artificial intelligence,machine planning, machine learning, robotics,andparallel algorithms for artif icial intelligence. Cookreceived her BS from Wheaton College and her MSand PhD from the University of Illinois. Contact herat the Dept. of Computer Science and Engineering,Univ. of Texas at Arlington,Box 19015,Arlington,TX 76019-0015; [email protected]; http://www-cse.uta.edu/~cook.

Lawrence B. Holder is an associate professor andassociate chair in the Department of ComputerScience and Engineering at the University ofTexas at Arlington,where he is also a codirectorof the Learning and Planning Laboratory. Hisresearch interests include artif icial intelligenceand machine learning. Holder received his BS incomputer engineering and his MS and PhD incomputer science, all from the University of Illi -nois at Urbana-Champaign. He is a member of theAAAI, the ACM, and the IEEE. Contact him atthe Dept. of Computer Science and Engineering,Univ. of Texas at Arlington,Box 19015,Arling-ton,TX 76019-0015; [email protected]; http://www-cse.uta.edu/~holder.

MARCH/APRIL 2000 41

Documents

Graph-Based Data Miningholder/courses/CptS570/fall07/...DATA MINING Graph-Based Data Mining Diane J. Cook and Lawrence B. Holder, University of Texas at Arlington T HE LARGE AMOUNT