Upload
ryu
View
43
Download
0
Embed Size (px)
DESCRIPTION
PhD Thesis Proposal. Conditional Graphical Models for Protein Structure Prediction. Yan Liu Language Technologies Institute Carnegie Mellon University. Thesis Committee Jaime Carbonell (Chair) John Lafferty Eric P. Xing Vanathi Gopalakrishnan (Univ. of Pittsburgh). - PowerPoint PPT Presentation
Citation preview
Carnegie MellonSchool of Computer Science
1
Conditional Graphical Models for Protein Structure Prediction
Yan LiuLanguage Technologies Institute
Carnegie Mellon University
Thesis Committee
Jaime Carbonell (Chair)John Lafferty Eric P. Xing
Vanathi Gopalakrishnan (Univ. of Pittsburgh)
PhD Thesis Proposal
Carnegie MellonSchool of Computer Science
2
Proteins in Our LifeNobelprize.org
Predict protein structures from
sequences
Carnegie MellonSchool of Computer Science
3
Protein Structure Hierarchy
• Our focus: predicting the topology of the structures
Carnegie MellonSchool of Computer Science
4
Previous work
• General approaches for structural motif recognition Sequence similarity searches, e.g. PSI-BLAST [Altschul et al, 1997]
Profile HMM, .e.g. HMMER [Durbin et al, 1998] and SAM [Karplus et al, 1998]
Homology modeling or threading, e.g. Threader [Jones, 1998]
Window-based methods, e.g. PSI_pred [Jones, 2001]
• Methods of careful design for specific structure motifs Example: αα- and ββ- hairpins, β-turn and β-helix
- Structural similarity without clear sequence similarity- Long-range interactions
- Hard to generalize
Carnegie MellonSchool of Computer Science
5
Missing pieces
• Informative features without clear interpretation
• No principled models to formulate the structured properties of proteins
Discriminative models
Graphical models
Conditional graphical models for protein structure prediction
Carnegie MellonSchool of Computer Science
6
Conditional Graphical Models (CGM)
• Structural prediction is reduced to segmentation and labeling problem
• Define segment semantics structural modules W = (M, {Wi}), M: # of segments, Wi: configuration of segment i
• The conditional probability of a possible segmentation W given the observation x is defined as
1
1( | ) exp( ( , ))
G
K
k k cc C k
fP W wZ
x x
Carnegie MellonSchool of Computer Science
7
Advantages of Conditional Graphical Models
• Flexible feature definition and kernelization
• Different loss functions and regularizers
• Convex functions for globally optimal solutions
• Structured information, especially the long-range interactions
Carnegie MellonSchool of Computer Science
8
Training and Testing
• Training phase : learn the model parameters
Minimizing regularized log loss
Seek the direction whose empirical values agree with the expectation
Iterative searching algorithms have to be applied
• Testing phase: search the segmentation that maximizes P(w|x)
1
( , ) log ( )G
K
k k cc C k
L f W Z
x
( | )( ( , ) [ ( , )]) ( ) 0G
k c p W x k cc Ck
Lf w E f W
x x
Carnegie MellonSchool of Computer Science
9
Thesis WorkConditional graphical models for protein structure prediction
Carnegie MellonSchool of Computer Science
10
Model Roadmap
Conditional random fields
Kernel CRFs
Segmentation CRFs
Chain graph model
Factorial segmentation CRFs
Kernelized
Segmentation
Locally normalized
Factorized
Carnegie MellonSchool of Computer Science
11
OutlineConditional graphical models for protein structure prediction
Carnegie MellonSchool of Computer Science
12
Task Definition and Evaluation Materials
• Given a protein sequence, predict its secondary structure assignments Three class: helix (C), sheets (E) and coil (C)
APAFSVSPASGACGPECA
CCEEEEECCCCCHHHCCC
Protein Secondary Structure Prediction
Carnegie MellonSchool of Computer Science
13
CGM on Secondary Structure Prediction
[Liu, Carbonell, Klein-Seetharaman and Gopalakrishnan, Bioinformatics 2004]
• Structural component Individual residues
• Segmentation definition W = (n, Y), n: # of residues, Yi = H, E
or C
• Specific models Conditional random fields (CRFs)
Kernel conditional random fields (kCRFs)
• where
Protein Secondary Structure Prediction
11 1
1( | ) exp( ( , , , ))
N K
k k i ii k
P f i y yZ
y x x
*1( | ) exp( ( , , ))
G
cc C
P y f c yZ
x x* ( )
1.. | |
( , ( , , ))c
cG cj
jy c
j L c C y Y
f K c y
x
Carnegie MellonSchool of Computer Science
14
Target Directions
Protein Secondary Structure Prediction
Input Features(PSI-BLAST profile)
Prediction Combination(CRFs)
Feature Exploration(KCRFs)
Learning Algorithm(SVM)
Beta-sheet Detection
Carnegie MellonSchool of Computer Science
15
Prediction Combination
• Previous work Window-based label combination
• Rule-based algorithm Window-based score combination
• SVM, Neural networks
• Other graphical models for score combination Maximum entropy Markov model (MEMM)
Higher-order MEMMs (HOMEMM)
Pseudo state duration MEMMs (PSMEMM)
Protein Secondary Structure Prediction
MEMM
HOMEMM
Carnegie MellonSchool of Computer Science
16
Experiment Results
Protein Secondary Structure Prediction - Prediction Combination
• Graphical models are consistently better than the window-based approaches• CRFs perform the best among the four graphical models
Carnegie MellonSchool of Computer Science
17
Experiment Results
• Offset from the correct register for the correctly predicted beta-strand pairs (CB513)
Protein Secondary Structure Prediction
- Beta-sheet detection
Considerable improvement in prediction performance
Prediction accuracy for beta-sheets
Carnegie MellonSchool of Computer Science
18
Experiment Results
• Prediction accuracy for KCRFs with vertex cliques (V) and edge cliques (E) PSI-BLAST profile with RBF kernel
Protein Secondary Structure Prediction
- Feature exploration
Carnegie MellonSchool of Computer Science
19
Summary• Conditional graphical model for protein secondary structure
prediction Conditional random fields (CRFs) Kernel conditional random fields (KCRFs)
Protein Secondary Structure Prediction
Carnegie MellonSchool of Computer Science
20
OutlineConditional graphical models for protein structure prediction
Carnegie MellonSchool of Computer Science
21
Task Definition and Evaluation Materials
..APAFSVSPASGACGPECA..
Contains the structural motif?
..NNEEEEECCCCCHHHCCC..
Structural motif recognition
• Structural motif Regular arrangement of secondary structural elements Super-secondary structure, or protein fold
Yes
Carnegie MellonSchool of Computer Science
22
CGM for Structural Motif Recognition
• Structural component Secondary structure elements
• Protein structural graph Nodes for the states of secondary structural elements of unknown lengths Edges for interactions between nodes in 3-D
• Example: β-α-β motif
• Tradeoff between fidelity of the model and graph complexity
Structural Motif Recognition
Carnegie MellonSchool of Computer Science
23
CGM for Structural Motif Recognition
• Segmentation definition Yi: state of segment i
Si: the length of segment iM: number of segments
• Segmentation conditional random fields (SCRFs) For any graph, we have
For a simplified graph, we have
1 1
1( | ) exp( ( , , ))
i
M K
k k ii k
P W f W WZ
x x
1
1( | ) exp( ( , ))
G
K
k k cc C k
P W f wZ
x x
Structural Motif Recognition
Carnegie MellonSchool of Computer Science
24
Protein Folds with Structural Repeats
• Prevalent in proteins and important in functions
• Each repeats consists of structural motifs and insertions
• Challenge Low sequence similarity in
structural motifs Long-range interactions
Structural Motif Recognition
Carnegie MellonSchool of Computer Science
25
CGM for Structural Motif Recognition
• Chain graph A graph consisting of directed and
undirected graphs Given a variable set V that forms
multiple subgraphs U
• Segmentation defintion Two layer segmentation W = {M, {Ξi},
T}M: # of envelops Ξi: the segmentation of envelops
Ti: the state of envelop i = repeat, non-repeat
• Chain graph model
( ) ( | parents( ))u U
P V P u u
Structural Motif Recognition
1 1 11
( | ) ( ,{ }, | ) ( ) ( | , , ) ( | , , )M
i i i i i i ii
P W P M T P M P T x T P T
x x x
Carnegie MellonSchool of Computer Science
26
Experiments
• Right-handed β-helix fold An elongated helix-like structures whose
successive rungs composed of three parallel β-strands (B1, B2, B3 strands)
T2 turn: a unique two-residue turn Low sequence identity
• Leucine-rich repeats (LLR) Solenoid-like regular arrangement of beta-
strands and an alpha-helix, connected by coils
Relatively high sequence identity (many Leucines)
Structural Motif Recognition
Carnegie MellonSchool of Computer Science
27
Experiment Results
• Cross-family validation for classifying β-helix proteins SCRFs can score all known β-helices higher than non β-helices
Structural Motif Recognition SCRFs model
Carnegie MellonSchool of Computer Science
28
Experiment Results
• Predicted Segmentation for known Beta-helices
Structural Motif Recognition SCRFs model
Carnegie MellonSchool of Computer Science
29
Experiment Results
• Histograms of scores for known β-helices against PDB-minus dataset 18 non β-helix proteins have a
score higher than 0
13 from β-class and 5 from α/β class
Most confusing proteins: β-sandwiches and left-handed β-helix
5
Structural Motif Recognition SCRFs model
Carnegie MellonSchool of Computer Science
30
Experiment Results
• Verification on recently crystallized structures
Successfully identify 1YP2: Potato Tuber ADP-Glucose Pyrophosphorylase
• score 10.47
1PXZ: Jun A 1, The Major Allergen From Cedar Pollen• score 32.35
GP14 of Shigella bacteriophage as a β-helix protein with scoring 15.63
Structural Motif Recognition SCRFs model
Carnegie MellonSchool of Computer Science
31
Experiment Results
• Cross-family validation for classifying β-helix proteins Chain graph model can score all known β-helices higher than non
β-helices
Structural Motif Recognition Chain graph model
Carnegie MellonSchool of Computer Science
32
Experiment Results
• Cross-family validation for classifying LLR proteins Chain graph model can score all known LLR higher than non-LLR
Structural Motif Recognition Chain graph model
Carnegie MellonSchool of Computer Science
33
Experiment Results
• Predicted Segmentation for known Beta-helices and LLRs
Structural Motif Recognition Chain graph model
Carnegie MellonSchool of Computer Science
34
Further Experiments
• Virus proteins Noncellular biological entity that
can reproduce only within a host cell
Dynamical properties
Various kinds of viruses• Adenovirus (common cold)• Bacteriophage (which infects
bacteria)• DNA virus, RNA virus
Structural Motif Recognition
Gp 41: core protein of HIV virus (1AIK)
Carnegie MellonSchool of Computer Science
35
Summary• Segmented conditional graphical models for structural
motif recognition Segmentation conditional random fields Chain graph model
• Successful applications Right-handed beta-helices Leucine-rich repeats
• Further verfication Virus spike folds and others
Structural Motif Recognition
Carnegie MellonSchool of Computer Science
36
OutlineConditional graphical models for protein structure prediction
Carnegie MellonSchool of Computer Science
37
Quaternary Structure Prediction
• Quaternary structures Multiple chains associated together
through noncovalent bonds or disulfide bonds
• Classes of quaternary structures Based on the number of subunits Based on the identity of the subunits
• Homo-oligomers* and hetero-oligomers
• Very few related research work
..APAFSVSPASGACGPECA..
Contains the quaternary structures?
..NNEEEEECCCCCHHHCCC..
Yes
Carnegie MellonSchool of Computer Science
38
Proposed Work
• Structural component Secondary structure elements Super-secondary structure elements
• Protein structural graph Nodes for the states of secondary structural
or super-secondary structural elements of unknown lengths
Edges for interactions between nodes in 3-D
Protocol: nodes representing secondary structure elements must involve long-range interactions
Quaternary structure prediction
Carnegie MellonSchool of Computer Science
39
Proposed Work
• Segmentation definition W = (M, {Wi}),
M: # of segmentsWi: configuration of segment i
• Factorial segmentation conditional random fields
R: # of chains in quaternary structures
Quaternary structure prediction
(1) ( )
1
1( ,..., | ) exp( ( , ))
G
KR
k k cc C k
P W W f wZ
x x
Carnegie MellonSchool of Computer Science
40
Experiment Design• Triple beta-spirals
Described by van Raaij et al. in Nature (1999) Clear sequence repeats Two proteins with crystallized structures and
about 20 without structure annotation
• Tripe beta-helices Described by van Raaij et al. in JMB (2001) Structurally similar to beta-helix without clear
sequence similarity Two proteins with crystallized structures Information from beta-helices can be used
• Both folds characterized by unusual stability to heat, protease, and detergent
Quaternary structure prediction
Carnegie MellonSchool of Computer Science
41
SummaryConditional graphical models for protein structure
prediction
Carnegie MellonSchool of Computer Science
42
Expected Contribution• Computational contribution
Conditional graphical models for structured data prediction “First effective models for the long-range interactions”
• Biological contribution Improvement in protein structure prediction Hypothesis on potential proteins with biological important
folds Aids for function analysis and drug design
Carnegie MellonSchool of Computer Science
43
Timeline• Jun, 2005 -- Aug, 2005
Data collection for triple beta-spirals and triple beta-helices Preliminary investigation for virus protein folds
• Sep, 2005 -- Nov, 2005 Implementation and Testing the model on synthetic data and real data Determining the specific virus proteins to work on
• Nov, 2005 -- Jan, 2006 Virus protein fold recognition Analysis the properties for virus protein folds
• Feb, 2006 -- June, 2006: Virus protein fold recognition Investigation the possibility for protein function prediction or information
extraction
• July, 2006 -- Aug, 2006 Writing up thesis
Carnegie MellonSchool of Computer Science
44
Carnegie MellonSchool of Computer Science
45
Features• Node features
Regular expression template, HMM profiles Secondary structure prediction scores Segment length
• Inter-node features β-strand Side-chain alignment scores Preferences for parallel alignment scores Distance between adjacent B23 segments
• Features are general and easy to extend
Structural Motif Recognition
Carnegie MellonSchool of Computer Science
46
Evaluation Measure
• Q3 (accuracy)
• Precision, Recall
• Segment Overlap quantity (SOV)
• Matthew’s Correlation coefficients
)(
)1()2;1(
)2;1()2;1(1)2,1(
iS
SLENSSMAXOV
SSDELTASSMINOV
NSSSOV
))(()()( iiiiiiii
iiiii
onunopup
ounpC
P + P-
T + P u
T - o nuonp
npQ
3
op
pQ pre
up
pQ
Carnegie MellonSchool of Computer Science
47
Local Information• PSI-blast profile
Position-specific scoring matrices (PSSM) Linear transformation[Kim & Park, 2003]
• SVM classifier with RBF kernel • Feature#1 (Si): Prediction score for each
residue Ri
)5(0.1
)55(1.05.0
)5(0
)(
xif
xifx
xif
xf
Carnegie MellonSchool of Computer Science
48
Previous Work
• Huge literature over decades Window-based methods Hidden Markov models
• Major breakthroughs in recent years Combine the predictions from neighboring residues or
various methods (Cuff & Barton, 1999)
Explore evolutionary information from sequences (Jones, 1999)
Specific algorithm for beta-sheets or infer the paring of beta-strands (Meiler & Baker, 2003)
Protein Secondary Structure Prediction