25
Training and applying Training and applying hidden Markov models and hidden Markov models and support vector machines support vector machines for prediction of T-cell for prediction of T-cell epitopes epitopes Van Hai Van Van Hai Van , Cao Thi Ngoc Phuong, Tran Linh , Cao Thi Ngoc Phuong, Tran Linh Thuoc Thuoc Faculty of Biology, University of Natural Faculty of Biology, University of Natural Sciences, Sciences, VNU-HCMC, Vietnam VNU-HCMC, Vietnam Sixth International Sixth International Conference on Conference on Bioinformatics Bioinformatics InCoB2007 InCoB2007

Training and applying hidden Markov models and support vector machines for prediction of T-cell epitopes Van Hai Van, Cao Thi Ngoc Phuong, Tran Linh Thuoc

Embed Size (px)

Citation preview

Training and applying Training and applying hidden Markov models and hidden Markov models and support vector machines support vector machines

for prediction of T-cell epitopesfor prediction of T-cell epitopes

Van Hai VanVan Hai Van, Cao Thi Ngoc Phuong, Tran Linh Thuoc, Cao Thi Ngoc Phuong, Tran Linh Thuoc

Faculty of Biology, University of Natural Sciences, Faculty of Biology, University of Natural Sciences, VNU-HCMC, VietnamVNU-HCMC, Vietnam

Sixth InternationalSixth InternationalConference on BioinformaticsConference on Bioinformatics

InCoB2007InCoB2007

Epitope predictionEpitope prediction

“Epitope is the portion of an antigen that is recognized by the antigen receptor on lymphocytes”

Molecular Biology

Epitope prediction:

Computers aid to develop epitope-based vaccines against various human pathogens for which no vaccines currently exist

http://www.scripps.edu/newsandviews/e_20050228/hiv.html

T-cell epitope predictionT-cell epitope prediction •T-cell epitopes are a subset of MHC binding peptides prediction of the peptides binding to MHC is essential for design of peptide-based vaccines•HLA-A0201

Sequence

Binding motifs

Quantitative matrices

Decision tree

Artificial neural networks

Hidden Markov models

Support vector machines

Molecular Biology

HMMs & SVMsHMMs & SVMs

HMMs

(Hidden Markov Models)

Statistical model that can capture complex relationships in data sets.

SVMs

(Support Vector Machines):

Learning machine that can find the optimal separating hyperplane.

Epitope prediction for dengue virusEpitope prediction for dengue virusTropical disease• Dengue fever• Dengue hemorraghic fever• Dengue shock syndromeHypothesis of pathogenesis• Antibody – dependent

enhancement• Virus virulenceNo dengue vaccine is available

In our research:

. Develop procedure for building automatically T-cell epitope predicting models

. Find candidates in silico for making multivalent vaccines on 4 types of Dengue virus

Building models for predicting T-cell epitopes Building models for predicting T-cell epitopes & applying these models on dengue virus& applying these models on dengue virus

Building effective prediction models?Building effective prediction models?

The predicting ability of HMM and SVM models depends on:

•Experimentally peptides binding to MHC molecules

•Partition of the peptides into training set and testing set

•Encoding method

A system finds easily and quickly the best prediction model when type of MHC molecules and quantity of binding peptides are changed

Processing MHC-binding experimental peptidesProcessing MHC-binding experimental peptides

Create training and testing setsCreate training and testing sets

Training & testing procedureTraining & testing procedureHMMs (HMMer) SVMs (SVM_light)

Experiment 1Experiment 1Method HMMs SVMs

Databases MHCBN, MHCPEP

Homology 7- amino acid

No. homologous groups binding seq.: 11 , non-binding seq.: 3

Kind of peptide BindingNon-

bindingBinding

Non-binding

No. peptides

Training set 623 25 20

Testing set 80 30 678 30

Training times 200 200

Parameters E-value = 0 ÷ 10

Linear kernel, c = 0

Encoding: binary, Blosum-62,

physical-chemical method

Result of the training by HMMsResult of the training by HMMs

HMM.7.136:

AROC=0.914

Choose parameter from HMM.7.136:

At point: E=3.4, S=-8.5,

SE=0.91, SP= 0.86, AROC=0.885

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1.2 2.4 3.6 4.8 6 7.2 8.4 9.6

E - val ue

AROC

Result of the training by SVMsResult of the training by SVMs

Binary encoding: AROC=0.42÷0.77

Blosum-62 encoding: AROC= 0.47÷0.87

Chemical-physical encoding: AROC= 0.41÷0.71

At blosum-62 encoding, data set SVM.7.blo62.46:

SE=0.83, SP=0.90, AROC=0.87

Experiment 2Experiment 2

Method HMMs SVMs

Databases MHCBN, MHCPEP, IEDB

Homology 7- amino acid, 6-amino acid, 5-amino acid

Training times 200 100

Parameters E-value = 40 ÷ 80

Linear kernel, c = 0

Encoding: binary, Blosum-62,

Binary - Blosum-62 method

Result of the training by HMMsResult of the training by HMMs

Homology 5-amino acid 6-amino acid 7-amino acid

Kind of peptide Binding Binding Binding

No. homologous group 82 139 84

No. Sequences in homologous groups

1232 551 374

Total peptides

Training set 1189 1165 1188

Testing set 632 656 633

AROC 0.832÷0.877 0.835÷0.883 0.828÷0.876

The best HMM profile HMM.6.78

Training in 6-amino acid homologous groupsTraining in 6-amino acid homologous groups

0.8

0.85

0.9

1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191

The training time

AR

OC

valu

es

Parameters of HMM.6.78:

At point: E=42, S=-9.2,

SE=0.91, SP= 0.84, AROC=0.875

HMM.6.78: AROC=0.883

0.8

0.85

0.9

40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

E value

ARO

C v

alue

s

Result of the training by SVMs methodsHomology 5-amino acid 6-amino acid 7-amino acid

Kind of peptide BindingNon-

bindingBinding

Non-binding

BindingNon-

binding

Total homologous group 82 176 139 45 84 21

Sequence in homologous groups

1232 540 551 116 374 60

Total sequences

Training set 1189 1282 1165 1365 1188 1367

Testing set 632 557 656 474 633 472

AROC

Binary encoding (1) 0.847÷0.884 0.845÷0.880 0.838÷0.882

Blosum-62 encoding (2) 0.843÷0.884 0.846 ÷0.883 0.838÷0.894

Binary-Blosum-62 encoding (3) 0.849÷0.879 0.847 ÷0.889 0.850÷0.891

Chosen setSVM.blo62.7.8

5

Training in 7-amino acid homologous groupsTraining in 7-amino acid homologous groups

At SVM.2.7.85:

SE=0.93, SP=0.86, AROC=0.894

0.8

0.81

0.82

0.83

0.84

0.85

0.86

0.87

0.88

0.89

0.9

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96The training time

AR

OC v

alu

es

: Binary encoding : Blosum-62 encoding : Binary-Blosum-62 encoding

Epitope predicting procedure for dengue virusEpitope predicting procedure for dengue virus

1. Do multiple sequence alignment

2. Extract consensus sequences more than or equal 9 amino acids

3. Create 9-mer overlap sequences

4. Predict peptides binding to MHC by HMMs profile or SVMs model

Experiment 1Experiment 1Proteins (1,2,3,4) Epitope sequences Methods

537NS3, 536NS3, 2010DV3_gp1, 536NS3 LMRRGDLPVWL HMMs, SVMs

763NS5, 764NS5, 515NS5, 765NS5 LMYFHRRDLRL HMMs, SVMs

358NS3, 357NS3, 2HELICc, 357NS3 KTVWFVPSI SVMs

658NS5, 659NS5, 410NS5, 660NS5 AISGDDCVV SVMs

472NS5, 473NS5, 223NS5, 473NS5 AIWYMWLGA SVMs

101E, 99E, 99glycoprot, 99E RGWGNGCGL SVMs

194NS1, 194NS1, 193NS1, 194NS1 VHADMGYWI SVMs

352NS5, 353NS5, 103NS5, 353NS5 RVFKEKVDT SVMs

13NS1, 13NS1, 12NS1, 13NS1 LKCGSGIFV SVMs

26NS1, 26NS1, 25NS1, 26NS1 HTWTEQYKF SVMs

230NS1, 230NS1, 229NS1, 230NS1 TLWSNGVLES SVMs

327NS1, 327NS1, 326NS1, 327NS1 DGCWYGMEIRP SVMs

148NS3, 148NS3, 142Pep_S7, 148NS3 GLYGNGVVT SVMs

256NS3, 255NS3, 67DEXHc, 255NS3 EIVDLMCHA SVMs

297NS3, 296NS3, 108DEXHc, 296NS3 ARGYISTRV SVMs

410NS3, 409NS3, 54HELICc, 409NS3 DISEMGANF SVMs

36NS4B, 35NS4B, 35NS4B, 32NS4B ASAWTLYAV SVMs

118NS4B, 117NS4B, 117NS4B, 114NS4B HYAIIGPGLQA SVMs

142NS4B, 141NS4B, 141NS4B, 138NS4B IMKNPTVDGI SVMs

224NS4B, 223NS4B, 223NS4B, 220NS4B NIFRGSYLAGA SVMs

81NS5, 81NS5, 27FtsJ, 81NS5 GCGRGGWSY SVMs

529NS5, 530NS5, 280NS5, 530NS5 MYADDTAGW SVMs

602NS5, 603NS5, 353NS5, 603NS5 QVGTYGLNT SVMs

606NS5, 607NS5, 357NS5, 607NS5 YGLNTFTNM SVMs

682NS5, 683NS5, 434NS5, 684NS5 DMGKVRKDI SVMs

745NS5, 746NS5, 497NS5, 747NS5 WSLRETACLG SVMs

788NS5, 789NS5, 540NS5, 790NS5 PTSRTTWSI SVMs

Proteins (1,2,3,4) Epitope sequences Methods

537NS3, 536NS3, 2010DV3_gp1, 536NS5 LMRRGDLPV HMMs

763NS5, 764NS5, 515NS5, 765NS5 LMYFHRRDLRL HMMs

358NS3, 357NS3, 2HELICc, 357NS3 KTVWFVPSI HMMs

658NS5, 659NS5, 410NS5, 660NS5 AISGDDCVV HMMs

469NS5, 470NS5, 220NS5, 470NS5 GSRAIWYMWLGAR HMMs

103E, 101E, 101DV3_gp1, 101E WGNGCGLFG SVMs

193NS1, 193NS1, 192NS1, 193NS1 AVHADMGYWIES SVMs

348NS5, 349NS5, 99NS5, 349NS5 FGQQRVFKE SVMs

568NS5, 569NS5, 319NS5, 569NS5 FKLTYQNKV HMMs

Experiment 2Experiment 2

Result of epitope prediction (peptide binding to HLA-A0201 prediction):

Join overlap 9-amino acid peptides predicted binding to HLA-A0201 molecules

Result of prediction Result of prediction

• HMMs profile is stable and increase ability of prediction when there are additional data sets.

• SVMs model is good but ability of prediction decreases when amount of training data increases.

http://www.biology.hcmuns.edu.vn/epitope

ConclusionConclusion

• Successfully building system for training Hidden Markov models and Support Vector Machines

• Generating training and testing data based on separating data set into homologous groups give us good result.

• Could predict consensus epitope for 4 types of Dengue virus based on data of peptides binding to HLA-A0201

Future plansFuture plans

• Set other kernels on SVMs method

• Survey other encoding method for sequences having flexible length

• Survey other methods for classifying MHC data to homologous groups

• Automate procedure collecting and updating data of peptide binding MHC from databases

Thank you very much!Thank you very much!