View
26
Download
0
Category
Preview:
DESCRIPTION
Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites. M.W. Mak The Hong Kong Polytechnic University. S.Y. Kung Princeton University. Contents. Introduction Proteins and Their Subcellular Locations Importance of Protein Cleavage-Site Prediction - PowerPoint PPT Presentation
Citation preview
1
M.W. Mak and S.Y. Kung, ICASSP’09
Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites
M.W. MakThe Hong Kong Polytechnic University
S.Y. KungPrinceton University
2
M.W. Mak and S.Y. Kung, ICASSP’09
Contents1. Introduction
Proteins and Their Subcellular LocationsImportance of Protein Cleavage-Site PredictionInformation in Amino Acid SequencesExisting Approaches to Cleavage Site Prediction
2. Conditional Random Field (CRF)CRF for Cleavage Site Prediction
3. Experiments and ResultsEffectiveness of Different Feature FunctionsEffect of Varying Window SizeFusion with SignalP
3
M.W. Mak and S.Y. Kung, ICASSP’09
Proteins and Their Destination
• A protein consists of a sequence of amino acids.
• Newly synthesized proteins need to pass across intra-cellular membrane to their destination.
http://redpoll.pharmacy.ualberta.ca
4
M.W. Mak and S.Y. Kung, ICASSP’09
Signal Peptide
Source: S. R. Goodman, Medical Cell Biology, Elsevier, 2008.
• A short segment of 20 to 100 amino acids (known as signal peptides) contains information about the destination (address) of the protein.
• The signal peptide is cleaved off from the resulting mature protein when it passes across the membrane.
http://nobelprize.org
Mature protein
Signal Peptide Cleavage Site
5
M.W. Mak and S.Y. Kung, ICASSP’09
• Defects in the protein sorting process can cause serious diseases, e.g., kidney stone
Importance of Cleavage Site Prediction
Source: http://nobelprize.org/nobel_prizes/medicine/laureates/1999/illpres/diseases.html
6
M.W. Mak and S.Y. Kung, ICASSP’09
• Many proteins (e.g. insulin) are produced in living cells. To cause the proteins to be secreted out of the cell, they are provided with a signal peptide.
Importance of Cleavage Site Prediction
Source: http://nobelprize.org/nobel_prizes/medicine/laureates/1999/illpres/diseases.html
Bioreactor
7
M.W. Mak and S.Y. Kung, ICASSP’09
Information in Sequences• Signal peptides contain some regular patterns. • Although the patterns exhibit substantial variation, they
can be detected by machine learning tools.
Cleavage SiteRich in hydrophobic AA
8
M.W. Mak and S.Y. Kung, ICASSP’09
Existing Methods
• Weight matrices (PrediSi)• Neural Networks (SignalP 1.1)• HMMs (SignalP 3.0)
9
M.W. Mak and S.Y. Kung, ICASSP’09
Weight Matrices
M A R S S L F T F L C L A V F I N G C L S Q I E Q Q
Score at position t = 16+0+8+6+78+7+7+13+10+6+8+6+0+6+7=178
t -1 t t+1
20AA
15 Positions
10
M.W. Mak and S.Y. Kung, ICASSP’09
SignalP-HMMSource: Nielsen and Krogh
Mature protein
Signal Peptide
11
M.W. Mak and S.Y. Kung, ICASSP’09
Contents1. Introduction
Proteins and Their Subcellular LocationsImportance of Protein Cleavage-Site PredictionInformation in Amino Acid SequencesExisting Approaches to Cleavage Site Prediction
2. Conditional Random Field (CRF)CRF for Cleavage Site Prediction
3. Experiments and ResultsEffectiveness of Amino Acid PropertiesEffectiveness of Different Feature FunctionsFusion with SignalP
12
M.W. Mak and S.Y. Kung, ICASSP’09
Conditional Random Fields
• Given a sequence of observations (e.g., words), a CRF attempts to find the most likely label sequence, i.e., it gives a label for each of the observations.
• Conditional Random Fields (CRFs) were originally designed for sequence labeling tasks such as Part-of-Speech (POS) tagging
14
M.W. Mak and S.Y. Kung, ICASSP’09
Advantages of CRF
• Avoid computing likelihood p(observation|label). Instead, the posterior p(label|observation) is computed directly.
• Able to model long-range dependency without making the inference problem intractable.
• Guarantee global optimal.
M A R S S L F T F L C L A V F I N G C L S Q I E Q Q
Depends on
15
M.W. Mak and S.Y. Kung, ICASSP’09
CRF for Cleavage Cite PredictionCleavage site
},,{ MCSL
Transition features
State features
Weights
1t Tt
Length of Sequence
n-grams of amino acids
16
M.W. Mak and S.Y. Kung, ICASSP’09
CRF for Cleavage Cite Prediction
WA)5,( xb
e.g. bi-gram and query sequence = T Q T W A G S H S . . .
MyCy tt and e.g., 1
18
M.W. Mak and S.Y. Kung, ICASSP’09
Contents1. Introduction
Proteins and Their Subcellular LocationsImportance of Protein Cleavage-Site PredictionInformation in Amino Acid SequencesExisting Approaches to Cleavage Site Prediction
2. Conditional Random Field (CRF)CRF for Cleavage Site Prediction
3. Experiments and ResultsEffectiveness of Different Feature FunctionsEffect of Varying Window SizeFusion with SignalP
19
M.W. Mak and S.Y. Kung, ICASSP’09
Experiments• Data: 1937 protein sequences extracted from
Swissprot 56.5. The cleavage sites locations of these sequences were biologically determined
• Ten-fold cross validation
• For 1st-order state features, up to 5-grams of amino acids
• For 2nd-order state features, up to bi-grams of amino acids.
• Use CRF++ software
21
M.W. Mak and S.Y. Kung, ICASSP’09
ResultsEffectiveness of Different Feature Functions:
Observations: (1) Transition feature by itself
is no good.(2) But, once combined with
state-features, performance improves
(Transition only)
(Transition + State)
22
M.W. Mak and S.Y. Kung, ICASSP’09
ResultsEffect of Varying the Window Size:
}max{ SizeWindow nd
e.g. query sequence = T Q T W A G S H S . . . 5 SizeWindow
23
M.W. Mak and S.Y. Kung, ICASSP’09
ResultsCompared with Other Predictors
Observations: (1) CRF is slightly better than SignalP(2) CRF is complementary to SignalP
Predictor Accuracy SignalP (HMM and NN) 81.88% PrediSi (Weight matrix) 77.06% CRF 82.19% CRF + SignalP 85.03%
24
M.W. Mak and S.Y. Kung, ICASSP’09
Web Serverhttp://158.132.148.85:8080/CSitePred/faces/Page1.jsp
25
M.W. Mak and S.Y. Kung, ICASSP’09
Web Serverhttp://158.132.148.85:8080/CSitePred/faces/Page1.jsp
Available in May2009
26
M.W. Mak and S.Y. Kung, ICASSP’09
27
M.W. Mak and S.Y. Kung, ICASSP’09
Conditional Random Fields
• Given a sequence of observations, A CRF attempts to find the most likely label sequence, i.e., it gives a label for each of the observations.
• Conditional Random Fields (CRFs) were originally designed for sequence labeling tasks such as Part-of-Speech (POS) tagging
Observations
Labels
x
x
y
Recommended