1
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program Department of Computer Science RECOMB 2007 Acknowledgements: This work is supported in part by a grant from the National Institutes of Health (GM 066387) to Vasant Honavar & Drena Dobbs Glycosylation Site Prediction using Machine Learning Approaches Cornelia Caragea, Jivko Sinapov, Adrian Silvescu, Drena Dobbs and Vasant Honavar Biological Motivation Glycosylation is one of the most complex post- translational modifications (PTMs). It is the site-specific enzymatic addition of saccharides to proteins and lipids. Most proteins in eukaryotic cells undergo glycosylation. Types of Glycosylation M K L I T I L C F L S R L L P S L T Q E S S Q E I D Non O-Glycosylated? O-Glycosylated? H 3 N + COO - Problem: Predict glycosylation sites from amino acid sequence Previous Approaches • Trained Neural Networks used in netOglyc prediction server (Hansen et al., 1995) • Dataset: mucin type O-linked glycosylation sites in mammalian proteins • Trained SVMs based on physical properties, 0/1 system and a combination of these two (Li et al., 2006) • Dataset: mucin type O-linked glycosylation sites in mammalian proteins • Negative examples extracted from sequences with no known glycosylated sites • Trained/tested using different ratios of positive and negative sites Our Approach • We investigate 3 types of glycosylation and use an ensemble classifier approach Dataset: N-, C- and O-linked glycoslation sites in proteins from several different species: human, rat, mouse, insect, worm, horse, etc. • Negative examples extracted from sequences with at least one experimentally verified glycosylated site Dataset O-GlycBase v6.00: O- , N- & C- glycosylated proteins with 242 glycosylated entries available at http://www.cbs.dtu.dk/databases/OGLYCBASE/Oglyc.base.html Glycosylatio n Type Positive Sites Negative Sites O-Linked (S/T) 2098 11623 N-Linked (N) 251 1430 C-Linked (W) 47 73 Total 2366 13126 Train DB Sampling . . . . S 1 S 2 S 3 S k C 1 t r a i n C 2 C 3 C k . . . . Bag of Trained Classifiers Test DB Weighted Majority Vote Predictions t r a i n t r a i n t r a i n t r a i n Training an ensemble classifier Classifiers • SVM • 0/1 String Kernel • Substitution Matrix Kernel • PSI-Blast PSSM - Polynomial Kernel • Decision Tree • Naïve Bayes • Identity windows • Identity plus additional information ( S ( x i , y i ) i 1 | w| ) e where S ( x i , y i ) 1 if x i y i and 0 otherwise S ( x i , y i ) entry ( x i , y i ) in the Blosum 62 matrix C-mannosylation Glycosylation N-linked glycosylation GPI anchor N-acetylglucosamine (N-GlcNAc) O-N-acetylgalactosamine (O-GalNAc) O-N-acetylglucosamine (O-GlcNAc) O-fucose O-glucose O-mannose O-hexose O-xylose C-mannose O-linked glycosylation ROC Curves for N-Linked ROC Curves for O-Linked ROC Curves for C-Linked Comparison of ROC Curves for single and ensemble classifier Results Conclusion In this work we addressed the problem of predicting O-, N-, and C-Linked glycosylation sites from protein sequences. We trained and evaluated an Ensemble Classifier in conjunction with SVM, Naïve Bayes and Decision Tree models. Our experiments show that an ensemble classifier approach achieves low generalization error and can outperform a single trained classifier.

Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program

Embed Size (px)

Citation preview

Page 1: Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program

Artificial Intelligence Research LaboratoryBioinformatics and Computational Biology ProgramComputational Intelligence, Learning, and Discovery ProgramDepartment of Computer Science

RECOMB 2007

Acknowledgements: This work is supported in part by a grant from the National Institutes of Health (GM 066387) to Vasant Honavar & Drena Dobbs

Glycosylation Site Prediction using Machine Learning Approaches Cornelia Caragea, Jivko Sinapov, Adrian Silvescu, Drena Dobbs and Vasant Honavar

Biological MotivationGlycosylation is one of the most complex post-translational modifications (PTMs). It is the site-specific enzymatic addition of saccharides to proteins and lipids. Most proteins in eukaryotic cells undergo glycosylation.Types of Glycosylation

M K L I T I L

C

F

LSR

LLPSL

T

QE S

S Q E I D

Non O-Glycosylated?O-Glycosylated?

H3N+

COO-

Problem: Predict glycosylation sites from amino acid sequence

Previous Approaches• Trained Neural Networks used in netOglyc prediction server (Hansen et al., 1995)• Dataset: mucin type O-linked glycosylation sites in mammalian proteins

• Trained SVMs based on physical properties, 0/1 system and a combination of these two (Li et al., 2006)• Dataset: mucin type O-linked glycosylation sites in mammalian proteins• Negative examples extracted from sequences with no known glycosylated sites• Trained/tested using different ratios of positive and negative sites

Our Approach• We investigate 3 types of glycosylation and use an ensemble classifier approach• Dataset: N-, C- and O-linked glycoslation sites in proteins from several different species: human, rat, mouse, insect, worm, horse, etc.• Negative examples extracted from sequences with at least one experimentally verified glycosylated site

DatasetO-GlycBase v6.00: O- , N- & C- glycosylated proteins with 242 glycosylated entries available at http://www.cbs.dtu.dk/databases/OGLYCBASE/Oglyc.base.html

Glycosylation Type

Positive Sites

Negative Sites

O-Linked (S/T)

2098 11623

N-Linked (N)

251 1430

C-Linked (W)

47 73

Total 2366 13126

Train DBSampling

. . . .

S1 S2 S3 Sk

C1

train

C2 C3 Ck . . . .

Bag of Trained Classifiers

Test DB

WeightedMajority

VotePredictions

train

train

train

train

Training an ensemble classifier

Classifiers• SVM • 0/1 String Kernel

• Substitution Matrix Kernel

• PSI-Blast PSSM - Polynomial Kernel

• Decision Tree• Naïve Bayes• Identity windows• Identity plus additional information

( S(x i,y i)i1

|w|

)e where S(x i,y i) 1 if x i y i and 0 otherwise

S(x i,y i) entry(x i,y i) in the Blosum62 matrix

C-mannosylation

Glycosylation

N-linked glycosylation GPI anchor

N-acetylglucosamine(N-GlcNAc)

O-N-acetylgalactosamine(O-GalNAc)

O-N-acetylglucosamine (O-GlcNAc)

O-fucose

O-glucose

O-mannose

O-hexose

O-xylose

C-mannose

O-linked glycosylation

ROC Curves for N-Linked

ROC Curves for O-Linked

ROC Curves for C-Linked

Comparison of ROC Curves for single and ensemble classifier

Results

ConclusionIn this work we addressed the problem of predicting O-, N-, and C-Linked glycosylation sites from protein sequences. We trained and evaluated an Ensemble Classifier in conjunction with SVM, Naïve Bayes and Decision Tree models. Our experiments show that an ensemble classifier approach achieves low generalization error and can outperform a single trained classifier.