Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

Preview:

DESCRIPTION

Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites. M.W. Mak The Hong Kong Polytechnic University. S.Y. Kung Princeton University. Contents. Introduction Proteins and Their Subcellular Locations Importance of Protein Cleavage-Site Prediction - PowerPoint PPT Presentation

Citation preview

1

M.W. Mak and S.Y. Kung, ICASSP’09

Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

M.W. MakThe Hong Kong Polytechnic University

S.Y. KungPrinceton University

2

M.W. Mak and S.Y. Kung, ICASSP’09

Contents1. Introduction

Proteins and Their Subcellular LocationsImportance of Protein Cleavage-Site PredictionInformation in Amino Acid SequencesExisting Approaches to Cleavage Site Prediction

2. Conditional Random Field (CRF)CRF for Cleavage Site Prediction

3. Experiments and ResultsEffectiveness of Different Feature FunctionsEffect of Varying Window SizeFusion with SignalP

3

M.W. Mak and S.Y. Kung, ICASSP’09

Proteins and Their Destination

• A protein consists of a sequence of amino acids.

• Newly synthesized proteins need to pass across intra-cellular membrane to their destination.

http://redpoll.pharmacy.ualberta.ca

4

M.W. Mak and S.Y. Kung, ICASSP’09

Signal Peptide

Source: S. R. Goodman, Medical Cell Biology, Elsevier, 2008.

• A short segment of 20 to 100 amino acids (known as signal peptides) contains information about the destination (address) of the protein.

• The signal peptide is cleaved off from the resulting mature protein when it passes across the membrane.

http://nobelprize.org

Mature protein

Signal Peptide Cleavage Site

5

M.W. Mak and S.Y. Kung, ICASSP’09

• Defects in the protein sorting process can cause serious diseases, e.g., kidney stone

Importance of Cleavage Site Prediction

Source: http://nobelprize.org/nobel_prizes/medicine/laureates/1999/illpres/diseases.html

6

M.W. Mak and S.Y. Kung, ICASSP’09

• Many proteins (e.g. insulin) are produced in living cells. To cause the proteins to be secreted out of the cell, they are provided with a signal peptide.

Importance of Cleavage Site Prediction

Source: http://nobelprize.org/nobel_prizes/medicine/laureates/1999/illpres/diseases.html

Bioreactor

7

M.W. Mak and S.Y. Kung, ICASSP’09

Information in Sequences• Signal peptides contain some regular patterns. • Although the patterns exhibit substantial variation, they

can be detected by machine learning tools.

Cleavage SiteRich in hydrophobic AA

8

M.W. Mak and S.Y. Kung, ICASSP’09

Existing Methods

• Weight matrices (PrediSi)• Neural Networks (SignalP 1.1)• HMMs (SignalP 3.0)

9

M.W. Mak and S.Y. Kung, ICASSP’09

Weight Matrices

M A R S S L F T F L C L A V F I N G C L S Q I E Q Q

Score at position t = 16+0+8+6+78+7+7+13+10+6+8+6+0+6+7=178

t -1 t t+1

20AA

15 Positions

10

M.W. Mak and S.Y. Kung, ICASSP’09

SignalP-HMMSource: Nielsen and Krogh

Mature protein

Signal Peptide

11

M.W. Mak and S.Y. Kung, ICASSP’09

Contents1. Introduction

Proteins and Their Subcellular LocationsImportance of Protein Cleavage-Site PredictionInformation in Amino Acid SequencesExisting Approaches to Cleavage Site Prediction

2. Conditional Random Field (CRF)CRF for Cleavage Site Prediction

3. Experiments and ResultsEffectiveness of Amino Acid PropertiesEffectiveness of Different Feature FunctionsFusion with SignalP

12

M.W. Mak and S.Y. Kung, ICASSP’09

Conditional Random Fields

• Given a sequence of observations (e.g., words), a CRF attempts to find the most likely label sequence, i.e., it gives a label for each of the observations.

• Conditional Random Fields (CRFs) were originally designed for sequence labeling tasks such as Part-of-Speech (POS) tagging

14

M.W. Mak and S.Y. Kung, ICASSP’09

Advantages of CRF

• Avoid computing likelihood p(observation|label). Instead, the posterior p(label|observation) is computed directly.

• Able to model long-range dependency without making the inference problem intractable.

• Guarantee global optimal.

M A R S S L F T F L C L A V F I N G C L S Q I E Q Q

Depends on

15

M.W. Mak and S.Y. Kung, ICASSP’09

CRF for Cleavage Cite PredictionCleavage site

},,{ MCSL

Transition features

State features

Weights

1t Tt

Length of Sequence

n-grams of amino acids

16

M.W. Mak and S.Y. Kung, ICASSP’09

CRF for Cleavage Cite Prediction

WA)5,( xb

e.g. bi-gram and query sequence = T Q T W A G S H S . . .

MyCy tt and e.g., 1

18

M.W. Mak and S.Y. Kung, ICASSP’09

Contents1. Introduction

Proteins and Their Subcellular LocationsImportance of Protein Cleavage-Site PredictionInformation in Amino Acid SequencesExisting Approaches to Cleavage Site Prediction

2. Conditional Random Field (CRF)CRF for Cleavage Site Prediction

3. Experiments and ResultsEffectiveness of Different Feature FunctionsEffect of Varying Window SizeFusion with SignalP

19

M.W. Mak and S.Y. Kung, ICASSP’09

Experiments• Data: 1937 protein sequences extracted from

Swissprot 56.5. The cleavage sites locations of these sequences were biologically determined

• Ten-fold cross validation

• For 1st-order state features, up to 5-grams of amino acids

• For 2nd-order state features, up to bi-grams of amino acids.

• Use CRF++ software

21

M.W. Mak and S.Y. Kung, ICASSP’09

ResultsEffectiveness of Different Feature Functions:

Observations: (1) Transition feature by itself

is no good.(2) But, once combined with

state-features, performance improves

(Transition only)

(Transition + State)

22

M.W. Mak and S.Y. Kung, ICASSP’09

ResultsEffect of Varying the Window Size:

}max{ SizeWindow nd

e.g. query sequence = T Q T W A G S H S . . . 5 SizeWindow

23

M.W. Mak and S.Y. Kung, ICASSP’09

ResultsCompared with Other Predictors

Observations: (1) CRF is slightly better than SignalP(2) CRF is complementary to SignalP

Predictor Accuracy SignalP (HMM and NN) 81.88% PrediSi (Weight matrix) 77.06% CRF 82.19% CRF + SignalP 85.03%

24

M.W. Mak and S.Y. Kung, ICASSP’09

Web Serverhttp://158.132.148.85:8080/CSitePred/faces/Page1.jsp

25

M.W. Mak and S.Y. Kung, ICASSP’09

Web Serverhttp://158.132.148.85:8080/CSitePred/faces/Page1.jsp

Available in May2009

26

M.W. Mak and S.Y. Kung, ICASSP’09

27

M.W. Mak and S.Y. Kung, ICASSP’09

Conditional Random Fields

• Given a sequence of observations, A CRF attempts to find the most likely label sequence, i.e., it gives a label for each of the observations.

• Conditional Random Fields (CRFs) were originally designed for sequence labeling tasks such as Part-of-Speech (POS) tagging

Observations

Labels

x

x

y

Recommended