Upload
aleesha-leonard
View
215
Download
1
Tags:
Embed Size (px)
Citation preview
Volos, 11-13 November 2005
Intelligent Systems andSoftware Engineering Lab (ISSEL) – ECE – AUTH
1010thth Panhellenic Conference in Informatics Panhellenic Conference in Informatics
Machine Learning andKnowledge Discovery Group(MLKD) – CSD – AUTH
Protein Classification with Protein Classification with Multiple AlgorithmsMultiple Algorithms
S. Diplaris, G. Tsoumakas,S. Diplaris, G. Tsoumakas,
P. A. Mitkas, I. VlahavasP. A. Mitkas, I. Vlahavas
18/04/23 Protein Classification with Multiple Algorithms 2
Aristotle University of ThessalonikiAristotle University of Thessaloniki
OutlineOutline
Introduction Motif-based protein classification Combining classification methods Experiments Results and Discussion Conclusions
18/04/23 Protein Classification with Multiple Algorithms 3
Aristotle University of ThessalonikiAristotle University of Thessaloniki
IntroductionIntroduction
The amount of protein sequences in public biological databases is constantly increasing.These will be dwarfed by the sequences from the
environmental sequencing projects currently underway. Growing imbalance between the number of
sequences in databases and the information about their structure and function.
Protein function prediction can save time and money.
18/04/23 Protein Classification with Multiple Algorithms 4
Aristotle University of ThessalonikiAristotle University of Thessaloniki
Discovering protein Discovering protein functionalityfunctionality
Identification of a protein’s biological effect can be accomplished in two waysTime-consuming and expensive experiments,
that are not always applicable (in vitro)Using computational methods, such as data
mining (in silico)
18/04/23 Protein Classification with Multiple Algorithms 5
Aristotle University of ThessalonikiAristotle University of Thessaloniki
Protein familiesProtein families
According to their functionality, proteins are categorized in families.
Proteins belonging in the same family feature structural relation, thus having similar properties.
18/04/23 Protein Classification with Multiple Algorithms 6
Aristotle University of ThessalonikiAristotle University of Thessaloniki
Protein Motifs and ProfilesProtein Motifs and Profiles
The behavior of a protein is a function of many motifs and profiles, where some overpower others.
Profiles are computational representations of multiple sequence alignments using hidden Markov models.
Motifs are short conserved sub-sequences that usually correspond to active or functional sites.
18/04/23 Protein Classification with Multiple Algorithms 7
Aristotle University of ThessalonikiAristotle University of Thessaloniki
Problem DescriptionProblem Description
What are its
properties?
Unknown protein
sequence
Known-family
proteins
18/04/23 Protein Classification with Multiple Algorithms 8
Aristotle University of ThessalonikiAristotle University of Thessaloniki
BackgroundBackground Databases
Protein Databases (Prosite, Swiss-Prot)Motif/profile Databases (Pfam, Prints)
Direct knowledge of protein function from motifs is impossibleUse machine learning methods to discover similarities
in protein chainsClassify proteins in families using their motifs and
profiles
18/04/23 Protein Classification with Multiple Algorithms 9
Aristotle University of ThessalonikiAristotle University of Thessaloniki
Protein ClassificationProtein ClassificationMotif-based protein data
Classification algorithm
Classifier induction
Unknown protein
10-fold validatio
n
Algorithm
evaluation
Prediction of
protein function
18/04/23 Protein Classification with Multiple Algorithms 10
Aristotle University of ThessalonikiAristotle University of Thessaloniki
Data Preprocessing (1/2)Data Preprocessing (1/2)
PATTERNSPATTERNS PROFILESPROFILES
PROTEINPROTEINCLASSESCLASSES
GenMineGenMinerr
OUTPUTOUTPUTFILEFILE
Protein: P04591 (Gag polyprotein)
Belongs in class: PDOC50158
Contains motif: PS50158 (ZF_CCHC)
18/04/23 Protein Classification with Multiple Algorithms 11
Aristotle University of ThessalonikiAristotle University of Thessaloniki
Data Preprocessing (2/2)Data Preprocessing (2/2)
VLAEAMSQVT NSATIMMQRG NFRNQRKIVK CFNCGKEGHT ARNCRAPRKK GCWKCGKEGH
m3 m5 m6
0 0 1 0 1 0 1 0 ...
Protein Χ
Motifs in protein Χ
N-bit binary pattern
18/04/23 Protein Classification with Multiple Algorithms 12
Aristotle University of ThessalonikiAristotle University of Thessaloniki
Combining Multiple Combining Multiple Classification AlgorithmsClassification Algorithms
Motivation: Accuracy improvement Algorithms show different:
Biases for generalizing from examplesKnowledge representation
Each algorithm tends to err on different parts of the instant space
Solution: Efficient combination of algorithms to correct uncorrelated errors Classifier selectionClassifier fusion
18/04/23 Protein Classification with Multiple Algorithms 13
Aristotle University of ThessalonikiAristotle University of Thessaloniki
Classifier SelectionClassifier Selection Select a single algorithm for classifying a new
instance Known approaches:
SelectBest: evaluation and selection (ES)Select upon performance on similar learning domains:
estimate performance in k-nearest neighbors and rank algorithms
Dynamic Selection: use of different algorithm in different parts of the instant space
Dynamic Weighting: local performance around the meta-instance space
18/04/23 Protein Classification with Multiple Algorithms 14
Aristotle University of ThessalonikiAristotle University of Thessaloniki
Classifier FusionClassifier Fusion Fuse decisions from all algorithms Known approaches:
Voting (V) • Each model outputs a class value, the majority class wins
Weighted Voting (VW) • Each model votes with a coefficient based on its accuracy
Stacking with Multi-Response Model Trees (SMT)• Learn a meta-level model that predicts the correct class based on the base-
level classifiers • Most accurate classifier of the Stacking family
Selective Fusion (SF)• Use statistical procedures to select the best sub-group of classifiers• Use VW in this subgroup to decide
18/04/23 Protein Classification with Multiple Algorithms 15
Aristotle University of ThessalonikiAristotle University of Thessaloniki
ExperimentsExperiments
Dataset:10 most important protein families662 proteins1182 motifsSome proteins belonged in more than one class
• Create separate classes for these groups of proteins• Finally: 32 different classes
Evaluation: 10-fold validation
18/04/23 Protein Classification with Multiple Algorithms 16
Aristotle University of ThessalonikiAristotle University of Thessaloniki
ExperimentsExperiments 9 classification algorithms
DT, C4.5: decision trees RIPPER (JRip), PART: rule learning K-nearest neighbor (IBk) K*: instance-based algorithm with entropic distance measure Naïve-Bayes algorithm (NB) SMO: Sequencial Minimal Optimization for training a support
vector classifier using polynomial kernels RBF: Radial basis function network
5 classifier combination methods: SMT, V, WV, SB, SF
18/04/23 Protein Classification with Multiple Algorithms 17
Aristotle University of ThessalonikiAristotle University of Thessaloniki
Results 1Results 1Individual classifier comparison
0,739
0,61
0,035 0,026 0,024 0,024 0,024 0,023 0,021
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
RBF NB JRip PART K* C4.5 DT IBk SMO
Classifiers
Mea
n e
rro
r ra
te
18/04/23 Protein Classification with Multiple Algorithms 18
Aristotle University of ThessalonikiAristotle University of Thessaloniki
Discussion 1Discussion 1
The reputation of SVM as a state-of-the-art classification method is verifiedDecision Trees and Instance-Based learning also
perform wellNaïve-Bayes and RBF exhibit quite low performance
A biologist could use SMO but The rest of the well-performing algorithms could
generalize betterBy combining all these algorithms or a subset of them
we get…
18/04/23 Protein Classification with Multiple Algorithms 19
Aristotle University of ThessalonikiAristotle University of Thessaloniki
Results 2Results 2Comparison of ensemble methods
0,558
0,195
0,024 0,021 0,019
0
0,1
0,2
0,3
0,4
0,5
0,6
SMT V SB VW SF
Ensemble methods
Mea
n e
rro
r ra
te
18/04/23 Protein Classification with Multiple Algorithms 20
Aristotle University of ThessalonikiAristotle University of Thessaloniki
Discussion 2Discussion 2 WV performed better than SB
The voting procedure corrects uncorrelated errors V did not perform well
Bad performing models State-of-the-art SMT performed very badly
Large number of classes led to high dimensionality in the meta-level dataset
SF performed great Selected the best subset of models and fused them 6.3 models on average of the 10 folds Combination of multiple algorithms using appropriate selection results
in error reduction
18/04/23 Protein Classification with Multiple Algorithms 21
Aristotle University of ThessalonikiAristotle University of Thessaloniki
ConclusionsConclusions
Comparative study of different classification algorithms and algorithm combination methods in motif-based protein classification
To successfully apply the algorithms we need:Multiple algorithmsA proper method to discard bad-performing algorithms
and combine the best