22
Volos, 11-13 November 2005 Intelligent Systems and Software Engineering Lab (ISSEL) – ECE – AUTH 10 10 th th Panhellenic Conference in Informatics Panhellenic Conference in Informatics Machine Learning and Knowledge Discovery Group (MLKD) – CSD – AUTH Protein Classification with Protein Classification with Multiple Algorithms Multiple Algorithms S. Diplaris, G. Tsoumakas, S. Diplaris, G. Tsoumakas, P. A. Mitkas, I. Vlahavas P. A. Mitkas, I. Vlahavas

Intelligent Systems and Software Engineering Lab (ISSEL) – ECE – AUTH 10 th Panhellenic Conference in Informatics Machine Learning and Knowledge Discovery

Embed Size (px)

Citation preview

Volos, 11-13 November 2005

Intelligent Systems andSoftware Engineering Lab (ISSEL) – ECE – AUTH

1010thth Panhellenic Conference in Informatics Panhellenic Conference in Informatics

Machine Learning andKnowledge Discovery Group(MLKD) – CSD – AUTH

Protein Classification with Protein Classification with Multiple AlgorithmsMultiple Algorithms

S. Diplaris, G. Tsoumakas,S. Diplaris, G. Tsoumakas,

P. A. Mitkas, I. VlahavasP. A. Mitkas, I. Vlahavas

18/04/23 Protein Classification with Multiple Algorithms 2

Aristotle University of ThessalonikiAristotle University of Thessaloniki

OutlineOutline

Introduction Motif-based protein classification Combining classification methods Experiments Results and Discussion Conclusions

18/04/23 Protein Classification with Multiple Algorithms 3

Aristotle University of ThessalonikiAristotle University of Thessaloniki

IntroductionIntroduction

The amount of protein sequences in public biological databases is constantly increasing.These will be dwarfed by the sequences from the

environmental sequencing projects currently underway. Growing imbalance between the number of

sequences in databases and the information about their structure and function.

Protein function prediction can save time and money.

18/04/23 Protein Classification with Multiple Algorithms 4

Aristotle University of ThessalonikiAristotle University of Thessaloniki

Discovering protein Discovering protein functionalityfunctionality

Identification of a protein’s biological effect can be accomplished in two waysTime-consuming and expensive experiments,

that are not always applicable (in vitro)Using computational methods, such as data

mining (in silico)

18/04/23 Protein Classification with Multiple Algorithms 5

Aristotle University of ThessalonikiAristotle University of Thessaloniki

Protein familiesProtein families

According to their functionality, proteins are categorized in families.

Proteins belonging in the same family feature structural relation, thus having similar properties.

18/04/23 Protein Classification with Multiple Algorithms 6

Aristotle University of ThessalonikiAristotle University of Thessaloniki

Protein Motifs and ProfilesProtein Motifs and Profiles

The behavior of a protein is a function of many motifs and profiles, where some overpower others.

Profiles are computational representations of multiple sequence alignments using hidden Markov models.

Motifs are short conserved sub-sequences that usually correspond to active or functional sites.

18/04/23 Protein Classification with Multiple Algorithms 7

Aristotle University of ThessalonikiAristotle University of Thessaloniki

Problem DescriptionProblem Description

What are its

properties?

Unknown protein

sequence

Known-family

proteins

18/04/23 Protein Classification with Multiple Algorithms 8

Aristotle University of ThessalonikiAristotle University of Thessaloniki

BackgroundBackground Databases

Protein Databases (Prosite, Swiss-Prot)Motif/profile Databases (Pfam, Prints)

Direct knowledge of protein function from motifs is impossibleUse machine learning methods to discover similarities

in protein chainsClassify proteins in families using their motifs and

profiles

18/04/23 Protein Classification with Multiple Algorithms 9

Aristotle University of ThessalonikiAristotle University of Thessaloniki

Protein ClassificationProtein ClassificationMotif-based protein data

Classification algorithm

Classifier induction

Unknown protein

10-fold validatio

n

Algorithm

evaluation

Prediction of

protein function

18/04/23 Protein Classification with Multiple Algorithms 10

Aristotle University of ThessalonikiAristotle University of Thessaloniki

Data Preprocessing (1/2)Data Preprocessing (1/2)

PATTERNSPATTERNS PROFILESPROFILES

PROTEINPROTEINCLASSESCLASSES

GenMineGenMinerr

OUTPUTOUTPUTFILEFILE

Protein: P04591 (Gag polyprotein)

Belongs in class: PDOC50158

Contains motif: PS50158 (ZF_CCHC)

18/04/23 Protein Classification with Multiple Algorithms 11

Aristotle University of ThessalonikiAristotle University of Thessaloniki

Data Preprocessing (2/2)Data Preprocessing (2/2)

VLAEAMSQVT NSATIMMQRG NFRNQRKIVK CFNCGKEGHT ARNCRAPRKK GCWKCGKEGH

m3 m5 m6

0 0 1 0 1 0 1 0 ...

Protein Χ

Motifs in protein Χ

N-bit binary pattern

18/04/23 Protein Classification with Multiple Algorithms 12

Aristotle University of ThessalonikiAristotle University of Thessaloniki

Combining Multiple Combining Multiple Classification AlgorithmsClassification Algorithms

Motivation: Accuracy improvement Algorithms show different:

Biases for generalizing from examplesKnowledge representation

Each algorithm tends to err on different parts of the instant space

Solution: Efficient combination of algorithms to correct uncorrelated errors Classifier selectionClassifier fusion

18/04/23 Protein Classification with Multiple Algorithms 13

Aristotle University of ThessalonikiAristotle University of Thessaloniki

Classifier SelectionClassifier Selection Select a single algorithm for classifying a new

instance Known approaches:

SelectBest: evaluation and selection (ES)Select upon performance on similar learning domains:

estimate performance in k-nearest neighbors and rank algorithms

Dynamic Selection: use of different algorithm in different parts of the instant space

Dynamic Weighting: local performance around the meta-instance space

18/04/23 Protein Classification with Multiple Algorithms 14

Aristotle University of ThessalonikiAristotle University of Thessaloniki

Classifier FusionClassifier Fusion Fuse decisions from all algorithms Known approaches:

Voting (V) • Each model outputs a class value, the majority class wins

Weighted Voting (VW) • Each model votes with a coefficient based on its accuracy

Stacking with Multi-Response Model Trees (SMT)• Learn a meta-level model that predicts the correct class based on the base-

level classifiers • Most accurate classifier of the Stacking family

Selective Fusion (SF)• Use statistical procedures to select the best sub-group of classifiers• Use VW in this subgroup to decide

18/04/23 Protein Classification with Multiple Algorithms 15

Aristotle University of ThessalonikiAristotle University of Thessaloniki

ExperimentsExperiments

Dataset:10 most important protein families662 proteins1182 motifsSome proteins belonged in more than one class

• Create separate classes for these groups of proteins• Finally: 32 different classes

Evaluation: 10-fold validation

18/04/23 Protein Classification with Multiple Algorithms 16

Aristotle University of ThessalonikiAristotle University of Thessaloniki

ExperimentsExperiments 9 classification algorithms

DT, C4.5: decision trees RIPPER (JRip), PART: rule learning K-nearest neighbor (IBk) K*: instance-based algorithm with entropic distance measure Naïve-Bayes algorithm (NB) SMO: Sequencial Minimal Optimization for training a support

vector classifier using polynomial kernels RBF: Radial basis function network

5 classifier combination methods: SMT, V, WV, SB, SF

18/04/23 Protein Classification with Multiple Algorithms 17

Aristotle University of ThessalonikiAristotle University of Thessaloniki

Results 1Results 1Individual classifier comparison

0,739

0,61

0,035 0,026 0,024 0,024 0,024 0,023 0,021

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

RBF NB JRip PART K* C4.5 DT IBk SMO

Classifiers

Mea

n e

rro

r ra

te

18/04/23 Protein Classification with Multiple Algorithms 18

Aristotle University of ThessalonikiAristotle University of Thessaloniki

Discussion 1Discussion 1

The reputation of SVM as a state-of-the-art classification method is verifiedDecision Trees and Instance-Based learning also

perform wellNaïve-Bayes and RBF exhibit quite low performance

A biologist could use SMO but The rest of the well-performing algorithms could

generalize betterBy combining all these algorithms or a subset of them

we get…

18/04/23 Protein Classification with Multiple Algorithms 19

Aristotle University of ThessalonikiAristotle University of Thessaloniki

Results 2Results 2Comparison of ensemble methods

0,558

0,195

0,024 0,021 0,019

0

0,1

0,2

0,3

0,4

0,5

0,6

SMT V SB VW SF

Ensemble methods

Mea

n e

rro

r ra

te

18/04/23 Protein Classification with Multiple Algorithms 20

Aristotle University of ThessalonikiAristotle University of Thessaloniki

Discussion 2Discussion 2 WV performed better than SB

The voting procedure corrects uncorrelated errors V did not perform well

Bad performing models State-of-the-art SMT performed very badly

Large number of classes led to high dimensionality in the meta-level dataset

SF performed great Selected the best subset of models and fused them 6.3 models on average of the 10 folds Combination of multiple algorithms using appropriate selection results

in error reduction

18/04/23 Protein Classification with Multiple Algorithms 21

Aristotle University of ThessalonikiAristotle University of Thessaloniki

ConclusionsConclusions

Comparative study of different classification algorithms and algorithm combination methods in motif-based protein classification

To successfully apply the algorithms we need:Multiple algorithmsA proper method to discard bad-performing algorithms

and combine the best

18/04/23 Protein Classification with Multiple Algorithms 22

Aristotle University of ThessalonikiAristotle University of Thessaloniki

Future issuesFuture issues

Multi-label instances Effectiveness of alternative representations

of the problem Discover new profiles directly from the

protein sequences