Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program...

Preview:

Citation preview

Artificial Intelligence Research LaboratoryBioinformatics and Computational Biology ProgramComputational Intelligence, Learning, and Discovery ProgramDepartment of Computer Science

RECOMB 2007

Acknowledgements: This work is supported in part by a grant from the National Institutes of Health (GM 066387) to Vasant Honavar & Drena Dobbs

Glycosylation Site Prediction using Machine Learning Approaches Cornelia Caragea, Jivko Sinapov, Adrian Silvescu, Drena Dobbs and Vasant Honavar

Biological MotivationGlycosylation is one of the most complex post-translational modifications (PTMs). It is the site-specific enzymatic addition of saccharides to proteins and lipids. Most proteins in eukaryotic cells undergo glycosylation.Types of Glycosylation

M K L I T I L

C

F

LSR

LLPSL

T

QE S

S Q E I D

Non O-Glycosylated?O-Glycosylated?

H3N+

COO-

Problem: Predict glycosylation sites from amino acid sequence

Previous Approaches• Trained Neural Networks used in netOglyc prediction server (Hansen et al., 1995)• Dataset: mucin type O-linked glycosylation sites in mammalian proteins

• Trained SVMs based on physical properties, 0/1 system and a combination of these two (Li et al., 2006)• Dataset: mucin type O-linked glycosylation sites in mammalian proteins• Negative examples extracted from sequences with no known glycosylated sites• Trained/tested using different ratios of positive and negative sites

Our Approach• We investigate 3 types of glycosylation and use an ensemble classifier approach• Dataset: N-, C- and O-linked glycoslation sites in proteins from several different species: human, rat, mouse, insect, worm, horse, etc.• Negative examples extracted from sequences with at least one experimentally verified glycosylated site

DatasetO-GlycBase v6.00: O- , N- & C- glycosylated proteins with 242 glycosylated entries available at http://www.cbs.dtu.dk/databases/OGLYCBASE/Oglyc.base.html

Glycosylation Type

Positive Sites

Negative Sites

O-Linked (S/T)

2098 11623

N-Linked (N)

251 1430

C-Linked (W)

47 73

Total 2366 13126

Train DBSampling

. . . .

S1 S2 S3 Sk

C1

train

C2 C3 Ck . . . .

Bag of Trained Classifiers

Test DB

WeightedMajority

VotePredictions

train

train

train

train

Training an ensemble classifier

Classifiers• SVM • 0/1 String Kernel

• Substitution Matrix Kernel

• PSI-Blast PSSM - Polynomial Kernel

• Decision Tree• Naïve Bayes• Identity windows• Identity plus additional information

( S(x i,y i)i1

|w|

)e where S(x i,y i) 1 if x i y i and 0 otherwise

S(x i,y i) entry(x i,y i) in the Blosum62 matrix

C-mannosylation

Glycosylation

N-linked glycosylation GPI anchor

N-acetylglucosamine(N-GlcNAc)

O-N-acetylgalactosamine(O-GalNAc)

O-N-acetylglucosamine (O-GlcNAc)

O-fucose

O-glucose

O-mannose

O-hexose

O-xylose

C-mannose

O-linked glycosylation

ROC Curves for N-Linked

ROC Curves for O-Linked

ROC Curves for C-Linked

Comparison of ROC Curves for single and ensemble classifier

Results

ConclusionIn this work we addressed the problem of predicting O-, N-, and C-Linked glycosylation sites from protein sequences. We trained and evaluated an Ensemble Classifier in conjunction with SVM, Naïve Bayes and Decision Tree models. Our experiments show that an ensemble classifier approach achieves low generalization error and can outperform a single trained classifier.

Recommended