Upload
tyler
View
32
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Applications of knowledge discovery to molecular biology: Identifying structural regularities in proteins. Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook Dr. Edward Bellion. Outline. Motivation and goal of the research - PowerPoint PPT Presentation
Citation preview
Applications of knowledge Applications of knowledge discovery to molecular discovery to molecular biology:biology:Identifying structural regularities in Identifying structural regularities in proteinsproteins
Shaobing SuShaobing SuSupervisor: Dr. Lawrence B. HolderSupervisor: Dr. Lawrence B. HolderCommittee: Dr. Diane J. CookCommittee: Dr. Diane J. Cook
Dr. Edward BellionDr. Edward Bellion
OutlineOutline Motivation and goal of the researchMotivation and goal of the research SUBDUE knowledge discovery systemSUBDUE knowledge discovery system Proteins and PDBProteins and PDB Methods and resultsMethods and results Discussion and conclusionDiscussion and conclusion Future researchFuture research
Motivation and GoalMotivation and Goal Explosive amount of molecular biology info Explosive amount of molecular biology info
need to be analyze to help understanding need to be analyze to help understanding the underlining structure-function the underlining structure-function relationship in protein and other relationship in protein and other macromolecules.macromolecules.
Apply SUBDUE to the Brookhaven Protein Apply SUBDUE to the Brookhaven Protein Data Bank (PDB) to identify biologically Data Bank (PDB) to identify biologically meaningful patternsmeaningful patterns
SUBDUE knowledge SUBDUE knowledge discovery systemdiscovery system SUBDUE discovers patterns (substructures) SUBDUE discovers patterns (substructures)
in structural data sets in structural data sets SUBDUE represent data as a labeled graphSUBDUE represent data as a labeled graph Inputs: vertices and edgesInputs: vertices and edges Outputs: discovered patterns and Outputs: discovered patterns and
instancesinstances
ExampleExample
objecttriangle
objectsquareon
shape
shape
Vertices: objects or attributesEdges: relationships
4 instances of
SUBDUE’s search SUBDUE’s search algorithmalgorithm Minimum Description Length (MDL) principle: Minimum Description Length (MDL) principle:
The best theory to describe a set of data is the The best theory to describe a set of data is the one that minimizes the DL of the entire data setone that minimizes the DL of the entire data set
DL of the graph: the number of bits necessary DL of the graph: the number of bits necessary to completely describe the graph to completely describe the graph
Search for the substructure that results in Search for the substructure that results in the maximum compressionthe maximum compression
Inexact graph match Inexact graph match approachapproach
Find instances with a slight Find instances with a slight distortion: insertion, deletion, distortion: insertion, deletion, and substitution of and substitution of edges/vertices.edges/vertices.
Threshold parameter: specify Threshold parameter: specify amount of distortion allowed.amount of distortion allowed.
Overview of proteinsOverview of proteins most important biomolecule most important biomolecule composed from 20 amino acidscomposed from 20 amino acids structural hierarchystructural hierarchy very diverse structure and functionvery diverse structure and function
Structural hierarchy in Structural hierarchy in proteinsproteins Primary structure (sequence of protein)Primary structure (sequence of protein)
Secondary structure (helix, sheet, Secondary structure (helix, sheet, random)random)
Tertiary structure (3-D)Tertiary structure (3-D)
Primary Structure of proteinsPrimary Structure of proteins Average 100-150 residues (a.a.) linked in head Average 100-150 residues (a.a.) linked in head
to tailto tail N-terminus and C-terminus N-terminus and C-terminus Peptide bond, alpha-carbonPeptide bond, alpha-carbon
H3N - C1 - C - N - C2 - C - O
R1 O H R2 ON-terminus C-terminus
+ -
peptide bondfirst a.a second a.a
Secondary structure Secondary structure elementselements Ordered backbone arrangement: helix and Ordered backbone arrangement: helix and
sheetsheet Helix (0 % to 90 %; average 11 a.a; several Helix (0 % to 90 %; average 11 a.a; several
types)types) Sheet (2 to 15 strands per sheet; parallel and Sheet (2 to 15 strands per sheet; parallel and
anti-parallel; average 6 a.a. per anti-parallel; average 6 a.a. per strand)strand)
Right-handeda -helix
Two-stranded parallel b -sheet
Two-strandedanti-parallel b -sheet
Tertiary Structure of Tertiary Structure of proteinprotein Highly complicated 3-D arrangementHighly complicated 3-D arrangement Folding of its secondary structure elementsFolding of its secondary structure elements
Brookhaven Protein Data Brookhaven Protein Data Bank Bank (PDB)(PDB) Brookhaven National LaboratoryBrookhaven National Laboratory
Over 6000 Experimentally determined Over 6000 Experimentally determined 3-D structure of 3-D structure of biomolecules biomolecules
Majority: protein structuresMajority: protein structures
Contents of PDBContents of PDB SEQRES: sequence of a.a. (three letter SEQRES: sequence of a.a. (three letter
code) code)
HELIX: starting, ending, and type HELIX: starting, ending, and type
SHEET: starts, ends, senseSHEET: starts, ends, sense
ATOM: (x, y, z) coordinates for each atoms ATOM: (x, y, z) coordinates for each atoms in protein in protein
Applications of SUBDUE to Applications of SUBDUE to PDBPDB- Methods and Results- Methods and Results July 1997 PDBJuly 1997 PDBTMTM release (6000 PDB) release (6000 PDB)
Global data set (4000 PDB)Global data set (4000 PDB)
Category data sets Category data sets hemoglobin hemoglobin Myoglobin Myoglobin Ribonuclease ARibonuclease A
Flowchart of ResearchFlowchart of Research
Preprocessing Application
BrookhavenPDB
Graphic representation
Inputs to SUBDUE
Patterns in Category
Patterns in Global others
Instancemapping
PreprocessingPreprocessing compile PDB list for each categorycompile PDB list for each category model.c: extract first modelmodel.c: extract first model seq.c: extract sequence info seq.c: extract sequence info
convert to graphic format convert to graphic format secondary.c: extract secondary structure info secondary.c: extract secondary structure info
and convert to graphic format and convert to graphic format coor.c: extract 3D coordinates coor.c: extract 3D coordinates
convert to grahic format convert to grahic format
Primary structure and its Primary structure and its representationrepresentation Sample PDB lines: Sample PDB lines:
SEQRES 1 150 ALA ASN LYS THR 1ASH 139 SEQRES 1 150 ALA ASN LYS THR 1ASH 139 SEQRES 2 150 LYS SER LEU GLU 1ASH 140 SEQRES 2 150 LYS SER LEU GLU 1ASH 140
Sequence (N-terminus to C-terminus): Sequence (N-terminus to C-terminus): ALA ASN LYS THR LYS SER LEU GLU ALA ASN LYS THR LYS SER LEU GLU
SUBDUE graphic input (ALA ASN): SUBDUE graphic input (ALA ASN): v 1 ALA - - - ALA residue v 1 ALA - - - ALA residue v 2 ASN v 2 ASN - - - ASN residue - - - ASN residue e 1 2 bond - - - a peptide bond between ALA and ASN e 1 2 bond - - - a peptide bond between ALA and ASN
Secondary structure and its Secondary structure and its representation -HELIXrepresentation -HELIX Sample PDB linesSample PDB lines (starting, ending, type):(starting, ending, type):
HELIX 1 ASN HELIX 1 ASN 1 HIS 13 1 1 HIS 13 1 HELIX 2 ASN 20 ASN 36 1 HELIX 2 ASN 20 ASN 36 1
vertex: h_type_lengthvertex: h_type_length Helix Length:Helix Length:
Hlength = SeqNum(last a.a.) - SeqNum(first a.a.)Hlength = SeqNum(last a.a.) - SeqNum(first a.a.) SUBDUE graphic input:SUBDUE graphic input:
v 1 h_1_12 - - - helix 1, type 1, length v 1 h_1_12 - - - helix 1, type 1, length 12 v 2 h_1_16 - - - 12 v 2 h_1_16 - - - helix 2, type 1, length 16helix 2, type 1, length 16
Secondary structure and its Secondary structure and its representation - SHEETrepresentation - SHEET Sample PDB linesSample PDB lines (sense, length):(sense, length):
SHEET 1 TYR 284 ILE 286 0 SHEET 1 TYR 284 ILE 286 0 SHEET 2 HIS 292 SHEET 2 HIS 292 THR 294 - 1THR 294 - 1
vertex: s_sense_lengthvertex: s_sense_length
SUBDUE graphic input:SUBDUE graphic input: v 1 s_0_2 - - - strand 1, sense 0, length 2 v 1 s_0_2 - - - strand 1, sense 0, length 2 v 2 s_-1_2 - - - strand 2, sense -1, length 2 v 2 s_-1_2 - - - strand 2, sense -1, length 2
Overall secondary structure Overall secondary structure representationrepresentation PDB line: SUBDUE PDB line: SUBDUE
graphic input graphic input HELIX 1 THR 3 MET 13 1 HELIX 1 THR 3 MET 13 1 v 1 h_1_10 v 1 h_1_10 HELIX 2 ASN 24 ASN 34 1 HELIX 2 ASN 24 ASN 34 1
v 2 h_1_10 e 1 2 sh v 2 h_1_10 e 1 2 sh HELIX 3 SER 50 GLN 60 1 HELIX 3 SER 50 GLN 60 1 v 3 s_0_7 e 2 3 sh v 3 s_0_7 e 2 3 sh SHEET 1 SHEET 1 LYS 41 HIS 48 0LYS 41 HIS 48 0 v 4 h_1_10 e 3 4 sh v 4 h_1_10 e 3 4 sh SHEET 2 MET 79 SHEET 2 MET 79 THR 87 -1THR 87 -1 v 5 s_-1_8 e 4 5 shv 5 s_-1_8 e 4 5 sh
sequential relationship is represented as edge “sh”sequential relationship is represented as edge “sh”
Visualization: Visualization:
N-terminus C-terminus
Tertiary structure and its Tertiary structure and its representationrepresentation Sample PDB lines:Sample PDB lines: XX Y Y Z Z
ATOMATOM CACA ALAALA 11 10.36910.3690.9970.997 10.519 ATOM10.519 ATOM CACAASNASN 22 6.6916.691 0.2390.239 9.8309.830
vertex: backbone carbon; vertex: backbone carbon; edge: distance (vs, s) edge: distance (vs, s)
Distance (Å): Distance (Å): distance = ((xdistance = ((x22-x-x11))22 + (y + (y22-y-y11))22 + (z + (z22 - z - z11))22))1/21/2
v 1 CA_ALA v 1 CA_ALA v 2 CA_ASN v 2 CA_ASN e 1 2 vs e 1 2 vs - - - very short - - - very short distancedistance
Rationale for representation Rationale for representation choicechoice-Criteria-Criteria Patterns identified by SUBDUE must be Patterns identified by SUBDUE must be
representative for each categoryrepresentative for each category
Patterns discovered by SUBDUE should Patterns discovered by SUBDUE should discriminate one category from othersdiscriminate one category from others
Primary sequencePrimary sequence vertex - a.a. residue namevertex - a.a. residue name edge - peptide bondedge - peptide bond
e 1 2 bond e 2 3 bond
ARG GLU ALAbond bond
v 1 ARG v 2 GLU v 3 ALA
Secondary structure Secondary structure elementselements Type of the helixType of the helix starting and ending points (a.a name and seq starting and ending points (a.a name and seq
number)number)
Helix 1
1 12
ASN … HIS
type length
starts ends
N-terminus C-terminus
Other ways of representing Other ways of representing helixhelix Separate type and lengthSeparate type and length combine type and length combine type and length
Helix 1
1 12
Helix_1_12 type length
Tertiary structureTertiary structure (x, y, z) coordinates vary with different origin choice(x, y, z) coordinates vary with different origin choice
avoid numeric number, use vs (avoid numeric number, use vs (4 Å), s (4 Å < dist 4 Å), s (4 Å < dist 6 6 Å)Å)
10.4 6.7
1.0 C1 C2 0.2
10.5 9.8
x x y vs y
z z
Results:Results:Primary structure patternsPrimary structure patterns
Ribonuclease_A_sequence:GLY GLN THR ASN CYS TYR GLN SER TYR SER THR MET SER ILE THR ASP CYS ARG GLU THR GLY SER SERLYS TYR PRO ASN CYS ALA TYR LYS THR THR GLN ALA ASN LYS HIS ILE ILE VAL ALA CYS GLU GLY ASN PRO TYR VAL PRO VAL HIS PHE ASP ALA SER VAL
Hemo_seq (63/65)Hemo_sequence:THR LYS THR TYR PHE PRO HIS PHE ASP LEU SER HIS GLY SER ALA GLN VAL LYS GLY HIS GLY LYS LYSVAL ALA ASP ALA LEU THR ASN ALA VAL ALA HIS VAL ASP ASP MET PRO ASN ALA LEU SET ALA LEU SERTHR LEU ALA ALA HIS LEU PRO LAL GLU PHE THR PRO ALA VAL HIS ALA SET LEU ASP LYS PHE LEU ALASET VAL SER THR VAL LEU THR SER LYS TYR
Myo_seq (67/103)Myoglo_sequence:VAL LSU SER GLU GLY GLU TRP GLN LEU VAL LEU HIS VAL TRP ALA LYS VAL GLU ALA ASP VAL ALA GLY HIS GLY GLN ASP ILE LEU ILE ARG LEU PHE LYS SER HIS PRO GLU THR LEU GLU LYS PHE ASP ARG
Ribo_A (59/68)
Primary structure patternsPrimary structure patterns Unique to each sample categoryUnique to each sample category
hemoglobin and myoglobin proteins hemoglobin and myoglobin proteins share little sequence similarity share little sequence similarity
Results:Results:Hemo secondary structure Hemo secondary structure patternspatterns
The secondary structure patterns discovered in hemoglobin (Hemo)
Exp.Parameter
Pattern 1(# of instances inHemo/Global_Other)
Pattern 2(# of instances inHemo/ Global_Other)
Pattern 3(# of instances inHemo/ Global_Other)
Threshold0.0
Hemo_s_1_0.01
(50 / 0)Hemo_s _2_0.02
(52 / 0)Hemo_s _3_0.03
(50 / NA)Threshold0.1
Hemo_s _1_0.14
(51 / NA)Hemo_s _2_0.15
(58 / NA)Hemo_s _3_0.16
(52 / NA)Threshold0.2
Hemo_s _1_0.27
(90 / NA)Hemo_s _2_0.28
(98 / NA)Hemo_s _3_0.29
(92 / NA)Threshold0.3
Hemo_s _1_0.310
(95 / NA)Hemo_s _2_0.311
(107 / NA)Hemo_s _3_0.312
(100 / NA)
1: h_1_14 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20
7: h_1_15 -> h_1_15 -> h_1_6 -> h_1_1 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20
Results:Results:Myo secondary structure Myo secondary structure patternspatterns
The secondary structure patterns discovered in myoglobin (Myo)
Exp.Parameter
Pattern 1(# of instances inMyo/ Global_Other)
Pattern 2(# of instances inMyo/ Global_Other)
Pattern 3(# of instances inMyo/ Global_Other)
Threshold0.0
Myo_s_1_0.01
(81 / 0)Myo_s _2_0.02
(82 / 0)Myo_s _3_0.03
(81 / 0)Threshold0.1
Myo_s _1_0.14
(81 / NA)Myo_s _2_0.15
(84 / NA)Myo_s _3_0.16
(81 / NA)Threshold0.2
Myo_s _1_0.27
(83 / NA)Myo_s _2_0.28
(84 / NA)Myo_s _3_0.29
(83 / NA)Threshold0.3
Myo_s _1_0.310
(83 / NA)Myo_s _2_0.311
(84 / NA)Myo_s _3_0.312
(84 / NA)
1: h_1_15 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_9 -> h_1_18 -> h_1_25
Results:Results:Ribo_A secondary structure Ribo_A secondary structure patternspatterns
The secondary structural patterns discovered in ribonuclease A (Ribo_A)
Exp.Parameter
Pattern 1(# of instances inRibo_A/ Global_Other)
Pattern 2(# of instances inRibo_A/ Global_Other)
Pattern 3(# of instances inRibo_A/ Global_Other)
Threshold0.0
Ribo_A_s_1_0.01
(25 / 0)Ribo_A _s _2_0.02
(25 / 0)Ribo_A _s _3_0.03
(25 / 0)Threshold0.1
Ribo_A _s _1_0.14
(27 / NA)Ribo_A _s _2_0.15
(27 / NA)Ribo_A _s _3_0.16
(27 / NA)Threshold0.2
Ribo_A _s _1_0.27
(27 / NA)Ribo_A _s _2_0.28
(27 / NA)Ribo_A _s _3_0.29
(27 / NA)Threshold0.3
Ribo_A _s _1_0.310
(36 / NA)Ribo_A _s _2_0.311
(36 / NA)Ribo_A _s _3_0.312
(36 / NA)
1: h_1_10 -> h_1_10 -> s_0_7 -> s_0_7 -> h_1_10 -> s_0_3 -> s_0_3 -> s_-1_4 -> s_-1_4 -> s_-1_8 -> s_-1_1 -> s_-1_10 -> s_-1_10 -> s_-1_8 -> s_-1_8 -> s_-1_5 -> s_-1_3
10: h_1_10 -> h_1_10 -> s_0_7 -> h_1_10 -> s_0_3 -> s_-1_4 -> s_-1_8 -> s_-1_8 -> s_-1_6
Results:Results:Tertiary structural patternsTertiary structural patterns SUBDUE finds small patterns (2 or 3 SUBDUE finds small patterns (2 or 3
a.a.)a.a.)
not unique for each category of proteinsnot unique for each category of proteins
not biologically meaningfulnot biologically meaningful
Visualization of secondary Visualization of secondary structure patterns -structure patterns -hemoglobinhemoglobin
complete hemoglobin 2 instances of pattern structure
N-terminus C-terminus
Visualization of secondary Visualization of secondary structure patterns -structure patterns -myoglobinmyoglobin
complete myoglobin 1 instance of pattern structure
N-terminus C-terminus
Visualization of secondary Visualization of secondary structure patterns -structure patterns -ribonuclease_Aribonuclease_A
complete ribonuclease_A 1 instance of pattern structure
N-terminus C-terminus
DiscussionDiscussion-Hemoglobin-Hemoglobin Hemoglobin: A, B, C, D chainsHemoglobin: A, B, C, D chains Two types of patterns identified by SUBDUE Two types of patterns identified by SUBDUE
One for A, C chains, the other for B, D chainsOne for A, C chains, the other for B, D chains Patterns exist in a majority of hemoglobin Patterns exist in a majority of hemoglobin
proteinsproteins No instances of the best hemoglobin pattern No instances of the best hemoglobin pattern
found in other proteins in the global data set found in other proteins in the global data set
Occurrence of hemo patternsOccurrence of hemo patternsThe occurrences of the best hemoglobin patterns
PDB Name Occurrence Speciespdb2hhb.ent B, D chains1; A, C chains7 humanpdb1sdl.ent NO human pdb1bbb.ent B chai1 humanpdb4hhb.ent B, D chains1; A, C chains7 humanpdb1thb.ent A, C, B, D chains7 humanpdb3hhb.ent B chain1; A chain7 humanpdb1sdk.ent NO humanpdb1cbm.ent A, B, C, D chains1 humanpdb1cls.ent NO humanpdb1hbb.ent B, D chains1; A, C chains7 humanpdb1hba.ent B, D chains1; A, C chains7 humanpdb2hbc.ent B chain1; A chain7 humanpdb1cbl.ent A, B, C, D chains1 humanpdb1hga.ent N/A humanpdb1hgb.ent N/A humanpdb1hgc.ent N/A humanpdb2hbd.ent B chain1; A chain7 humanpdb2hbf.ent B chain1; A chain7 humanpdb1hho.ent B chain1; A chain7 humanpdb1nih.ent B, D chains1; A, C chains7 humanpdb1coh.ent B, D chains1; A, C chains7 humanpdb1fdh.ent G chain1 humanpdb2hco.ent B chain1; A chain7 humanpdb1cmy.ent B, D chains1 humanpdb1hbs.ent B,D,F,H chains1; A,C,E,G chains7 humanpdb1hco.ent B chain1; A chain7 human
N/A: secondary structure information not available1 instance found with a thredhold of 0.0 (default): Hemo_s_1_0.01
4 instance found with a thredhold of 0.1: Hemo_s_1_0.14
7 instance found with a thredhold of 0.2: Hemo_s_1_0.27
10 instance found with a thredhold of 0.3 (default): Hemo_s_1_0.010
Occurrence of hemo patterns Occurrence of hemo patterns -continued-continued
The occurrences of the best hemoglobin patterns
PDB Name Occurrence Speciespdb1bab.ent B chain1; A, C chains7 humanpdb1dxu.ent B, D chains1; A, C chains7 humanpdb1dxv.ent B, D chains1; A, C chains7 humanpdb1dxt.ent B, D chains1; A, C chains7 humanpdb1gbu.ent B, D chains10 humanpdb1gbv.ent B, D chains10 humanpdb1hdb.ent NO humanpdb1dsh.ent NO humanpdb2hhe.ent N/A humanpdb1gli.ent B, D chains1; A, C chains7 humanpdb2hbe.ent B chain1 humanpdb1ibe.ent NO horsepdb2mhb.ent NO horsepdb2dhb.ent B chain4; A chain7 horsepdb1hds.ent N/A deerpdb1hda.ent B, D chains1 bovinepdb2pgh.ent B, D chains1 pigpdb1out.ent NO troutpdb1ouu.ent NO troutpdb1pbx.ent B chain1 antarctic fishpdb1hbh.ent NO antarctic fishpdb1ith.ent NO innkeeper worm
N/A: secondary structure information not available1 instance found with a thredhold of 0.0 (default): Hemo_s_1_0.01
4 instance found with a thredhold of 0.1: Hemo_s_1_0.14
7 instance found with a thredhold of 0.2: Hemo_s_1_0.27
10 instance found with a thredhold of 0.3 (default): Hemo_s_1_0.010
DiscussionDiscussion-Myoglobin-Myoglobin Myoglobin: one chainMyoglobin: one chain One dominant pattern identified by SUBDUE One dominant pattern identified by SUBDUE Patterns exist in most of myoglobin proteinsPatterns exist in most of myoglobin proteins No instances of the best myoglobin pattern No instances of the best myoglobin pattern
found in other proteins in the global data found in other proteins in the global data set set
Discussion:Discussion:-Hemoglobin and Myoglobin-Hemoglobin and Myoglobin Similar secondary structure patternsSimilar secondary structure patterns
Hemoglobin B, D chains (from N- to C-terminus) h_1_14 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20
Myoglobin chain (from N- to C-terminus)
h_1_15 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_9 -> h_1_18 -> h_1_25
Hemoglobin A, C chains (from N- to C-terminus)
h_1_15 -> h_1_15 -> h_1_6 -> h_1_1 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20
Discussion:Discussion:-Hemoglobin and Myoglobin-Hemoglobin and Myoglobin Consistent with the genetic studies Consistent with the genetic studies
Hemoglobin and myoglobin share one ancestral geneHemoglobin and myoglobin share one ancestral gene
Divergence occurred in the course of evolution. One Divergence occurred in the course of evolution. One copy of gene for myoglobin, four copies for hemoglobin.copy of gene for myoglobin, four copies for hemoglobin.
The last helix of the hemoglobin is shorter; One of the The last helix of the hemoglobin is shorter; One of the helix in hemoglobin A, C chains almost disappear: allow helix in hemoglobin A, C chains almost disappear: allow conformational change conformational change
Discussion:Discussion:-ribonuclease A proteins-ribonuclease A proteins All patterns have three helices of the All patterns have three helices of the
same sizesame size
Several strands appear twice indicating Several strands appear twice indicating participation in two sheet formation. participation in two sheet formation.
Ribonuclease S protein (S-protein Ribonuclease S protein (S-protein fragment) also has the pattern. fragment) also has the pattern.
Conclusion of the resultsConclusion of the results Secondary structure patterns discovered by Secondary structure patterns discovered by
SUBDUE are representative to each categorySUBDUE are representative to each category
Secondary structure patterns discovered by Secondary structure patterns discovered by SUBDUE are distinct for each categorySUBDUE are distinct for each category
SUBDUE has the ability to discover SUBDUE has the ability to discover biologically interesting patterns from PDB biologically interesting patterns from PDB and other similar MB data basesand other similar MB data bases
Comparison with other related Comparison with other related studiesstudies Different graphic representationDifferent graphic representation
predefined patterns with exact or inexact predefined patterns with exact or inexact graph matchgraph match
Not applied systematically to PDB or other DBNot applied systematically to PDB or other DB
SUBDUE would perform similar task if the SUBDUE would perform similar task if the inexact graph match routine is incorporatedinexact graph match routine is incorporated
Conclusions of the studyConclusions of the study Abstraction over 3D structure to its secondary Abstraction over 3D structure to its secondary
structural elements is suitable for discoverystructural elements is suitable for discovery
SUBDUE discovered secondary structure patterns for SUBDUE discovered secondary structure patterns for each category can be used as a signature for its classeach category can be used as a signature for its class
Inexact graph match is useful for finding similar Inexact graph match is useful for finding similar patterns patterns
SUBDUE is suitable for knowledge discovery in MB SUBDUE is suitable for knowledge discovery in MB structural DBstructural DB
Future ResearchFuture Research More consistent and detailed description of More consistent and detailed description of
secondary structure secondary structure Add relative positions of the secondary structural Add relative positions of the secondary structural
elements to represent spatial relationshipelements to represent spatial relationship Investigate alternative representation: more Investigate alternative representation: more
suitable 3D coordinates representation; suitable 3D coordinates representation; weighting on different edgesweighting on different edges
Inexact graph match in predefined substructureInexact graph match in predefined substructure More collaboration with domain scientistsMore collaboration with domain scientists