Upload
sheng-wang
View
1.074
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Template-based modeling, including homology modeling and protein threading, is the most reliable method for protein 3D structure prediction. However, alignment errors and template selection are still the main bottleneck for current template-base modeling methods, especially when proteins under consideration are distantly related. We present a novel context-specific alignment potential for protein threading, including alignment and template selection. Our alignment potential measures the log-odds ratio of one alignment being generated from two related proteins to being generated from two unrelated proteins, by integrating both local and global contextspecific information.
Citation preview
Protein Threading Using Context-Specific Alignment Potential
Sheng Wanghttp://raptorx.uchicago.edu
Toyota Technological Institute at Chicago,Joint work with Jianzhu Ma, Feng Zhao and Jinbo Xu
ISMB 2013 Jul 22, ICC Berlin, Germany
Outline
• Where we are @ template-based modeling• What’s our work• What’s the problem• What’s our solution• Welcome to our server
Template-based Modeling (or, Threading)• Observation
– ~50,000 non-redundant structures in PDB – ~ 1,200 unique structure folds (SCOP)
• Methodology– Use known structures to predict a new one
Template sequenceQuery sequence DDVYILDQAEEG
DE-FIVD-PDEH
DDVYILDQAEEG
SPCKR---ADEG
DDVYILDQAEEG
E--IFVDQADDS
DDVYILDQAEEG
NMCVFGQWERTY
database
Template-based Modeling Procedures Easy: similar sequences → similar structures
Sequence-based method, e.g., BLAST, FASTA Works only for close homologous (>70% sequence identity)
Medium: similar profiles → similar structures Protein profile is a matrix that represents a multiple sequence
alignment of the similar proteins Profile-based method, e.g., PSI-BLAST , HHMER, HHpred, Works for relative remote homologous (>40% sequence identity)
Challenge: dissimilar profiles → similar structures Adding structural information, or context-specific into sequence/profile
based methods Threading method, e.g., MUSTER, RAPTOR, CS-BLAST Works for distant remote homologous (<40% sequence identity)
Our Work
• CNFpred: Transform a template-sequence alignment problem into a Machine Learning problem to calculate the alignment’s probability.
• DeepAlign: Prepare for high quality training data of structural alignment.
• CNF model: Combined Machine Learning model that incorporate Conditional Random Field (CRF) and Neural Network (NN).
Protein Alignment ModelS A L R Q
L
P
L
S
E
M
M
M
M
L P L S - E
S A - L R QTemplateSequence
Match states (M)
M M Is M It M
Insertion at sequence (Is)
Insertion at template (It)
The structural alignment generated by DeepAlign is used for training data
DeepAlign for Structure Alignment
• evolutionary information• local sub-structure similarity • angular similarity for hydrogen bonding
BLOSUM is the local amino acid substitution matrix; CLESUM is the local sub-structure substitution matrix;v(i,j) measures the angular similarity for hydrogen bonding; d(i,j) measures the spatial proximity of two aligned residues.
local similarity global similarity
Score(i,j)=( max(0,BLOSUM(i,j) )+CLESUM(i,j) )*v(i,j)*d(i,j)
CNF-based Alignment Model
E: a neural network estimating the log-likelihood of state transition
Z(S,T): normalization factor
1 2{ , ,..., }LA a a a { , , }i t sa M I IGiven an alignment
Define a conditional probability
between Sequence S and Template T
Where,
),(/)),,,(exp(),|( 1 TSZTSaaETSApi
ii
Context-Specific
Comprehensive FeaturesMTYKLILN--GKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
How similar two residues : EAA
How similar query’s sequence and profile and template’s profile: Esp, Epp
How similar template’s secondary structure and sequence’s predicted second structure (3-class and 8-class): Ess3, Ess8
Sequence S
How similar is the query’s solvent accessibility and template’s solvent accessibility: Esa
Total scoring function is a non-linear combination of:
E( ai, ai-1, EAA , Esp , Epp , Ediso, Ess3 , Ess8 , Esa )
Template TMTYKLILNSTVRTKSDTVTDAVP---ADKICSFAQQLPWEREWSF--
For disordered regions, Ediso,
no structure information used.
What’s the problem?
• Only the alignment probability is described, instead of the log-odds potential compared to background.
• Only incorporate local information, insufficient of global information.
Our solution
Propose a protein alignment potential• With an elaborately designed reference state.• Can be generalized into sequence-sequence,
sequence-structure as well as structure-structure alignment.
Incorporate both local and global terms• For local term, CNFpred potential is applied.• For global term, EPAD potential is employed.
Protein alignment potential
Similarly, given one alignment A between sequence S and template T,we define the potential of A as follows.
NN
i
ref
yxAP
TSAP
APTSAP
TSAu
1),|(
),|(log
)(),|(
log),|(
Given 2 AAs a and b, their mutation potential is defined as follows.
)()(
)(log
)(
)(log)(
bPaP
baP
baP
baPbau
ref
BLOSUM62 Potential
Alignment Potential
x and y are two random proteins with the same length as S and T, respectively.
Assumption: the alignment maximizing the potential is the optimal.
),(/)),|(),|(exp(),|( TSZTSAGTSAFTSAP
The alignment probability given sequence S and template T could be modeled as follows,
local term global term
partition function
A
TSAPtsZ ),|(),(
Protein alignment potential
),(),|(),|(
),|(),|(
),(/)),|(),|(exp(
),(/)),|(),|(exp(log
),|(
),|(log),|(
,
,
1
1
TScyxAGEXPTSAG
yxAFEXPTSAF
yxZyxAGyxAF
TSZTSAGTSAF
yxAP
TSAPTSAu
yx
yx
NN
i
NN
i
Expected score, can be calculated in advance by sampling
Independent of any specific alignment.
Protein alignment potential
Model the local potential
i
ii TSaaETSAF ),,,(),|( 1
From CNFpred, we use a context-specific linear chain model as,
The expectation term can be calculated by uniformly sampling a few thousand protein pairs, so the local potential is
The local potential is defined as,
),|(),|(),|( , yxAFEXPTSAFTSAU yxlocal
i
iiiilocal aaETSaaETSAU )),(),,,((),|( 11
Maximize on probability Maximize on potential
Long but less informative and highly false positive.
Good for building models.
Template Template
Sequ
ence
Sequ
ence
Short but relevant and highly significant.
Good for ranking templates.
What’s the difference between
Model the global potential
ji
jiTij ssdPTSAG ),|(log),|(
From EPAD, we use a context-specific distance-dependent model as,
The expectation term can be calculated by uniformly sampling a few thousand residue pairs from templates, so the global potential is
The global potential is defined as,
),|(),|(),|( , yxAGEXPTSAGTSAU yxglobal
ji
Tijji
Tijglobal dPssdPTSAU ))(log),|((log),|(
What’s global information given an alignment?
i j
i j
ji
jiTij ssdPTSAG ),|(log),|(
Template T
Sequence S
Tijd
Tijd
i j
If the alignment is good, the distance of a sequence residue pair shall match well with that of their aligned template residue pair.
si
sj
Result on 1000*6000
CNFpred (local+global potential) compared to,
HHpred CNFpred (local potential)
Welcome to our server http://raptorx.uchicago.edu/
Binding
Contact
Thank you
Jinbo Xu
Feng ZhaoJianzhu Ma
National Institutes of Health (R01GM0897532)National Science Foundation (DBI-0960390)
NSF CAREER award CCF-1149811Alfred P. Sloan Research Fellowship