21
Protein Threading Using Context- Specific Alignment Potential Sheng Wang http://raptorx.uchicago.edu Toyota Technological Institute at Chicago, Joint work with Jianzhu Ma, Feng Zhao and Jinbo Xu ISMB 2013 Jul 22, ICC Berlin, Germany

Protein threading using context specific alignment potential ismb-2013

Embed Size (px)

DESCRIPTION

Template-based modeling, including homology modeling and protein threading, is the most reliable method for protein 3D structure prediction. However, alignment errors and template selection are still the main bottleneck for current template-base modeling methods, especially when proteins under consideration are distantly related. We present a novel context-specific alignment potential for protein threading, including alignment and template selection. Our alignment potential measures the log-odds ratio of one alignment being generated from two related proteins to being generated from two unrelated proteins, by integrating both local and global contextspecific information.

Citation preview

Page 1: Protein threading using context specific alignment potential ismb-2013

Protein Threading Using Context-Specific Alignment Potential

Sheng Wanghttp://raptorx.uchicago.edu

Toyota Technological Institute at Chicago,Joint work with Jianzhu Ma, Feng Zhao and Jinbo Xu

ISMB 2013 Jul 22, ICC Berlin, Germany

Page 2: Protein threading using context specific alignment potential ismb-2013

Outline

• Where we are @ template-based modeling• What’s our work• What’s the problem• What’s our solution• Welcome to our server

Page 3: Protein threading using context specific alignment potential ismb-2013

Template-based Modeling (or, Threading)• Observation

– ~50,000 non-redundant structures in PDB – ~ 1,200 unique structure folds (SCOP)

• Methodology– Use known structures to predict a new one

Template sequenceQuery sequence DDVYILDQAEEG

DE-FIVD-PDEH

DDVYILDQAEEG

SPCKR---ADEG

DDVYILDQAEEG

E--IFVDQADDS

DDVYILDQAEEG

NMCVFGQWERTY

database

Page 4: Protein threading using context specific alignment potential ismb-2013

Template-based Modeling Procedures Easy: similar sequences → similar structures

Sequence-based method, e.g., BLAST, FASTA Works only for close homologous (>70% sequence identity)

Medium: similar profiles → similar structures Protein profile is a matrix that represents a multiple sequence

alignment of the similar proteins Profile-based method, e.g., PSI-BLAST , HHMER, HHpred, Works for relative remote homologous (>40% sequence identity)

Challenge: dissimilar profiles → similar structures Adding structural information, or context-specific into sequence/profile

based methods Threading method, e.g., MUSTER, RAPTOR, CS-BLAST Works for distant remote homologous (<40% sequence identity)

Page 5: Protein threading using context specific alignment potential ismb-2013

Our Work

• CNFpred: Transform a template-sequence alignment problem into a Machine Learning problem to calculate the alignment’s probability.

• DeepAlign: Prepare for high quality training data of structural alignment.

• CNF model: Combined Machine Learning model that incorporate Conditional Random Field (CRF) and Neural Network (NN).

Page 6: Protein threading using context specific alignment potential ismb-2013

Protein Alignment ModelS A L R Q

L

P

L

S

E

M

 

M

M

M

L P L S - E

S A - L R QTemplateSequence

Match states (M)

M M Is M It M

Insertion at sequence (Is)

Insertion at template (It)

 

The structural alignment generated by DeepAlign is used for training data

Page 7: Protein threading using context specific alignment potential ismb-2013

DeepAlign for Structure Alignment

• evolutionary information• local sub-structure similarity • angular similarity for hydrogen bonding

BLOSUM is the local amino acid substitution matrix; CLESUM is the local sub-structure substitution matrix;v(i,j) measures the angular similarity for hydrogen bonding; d(i,j) measures the spatial proximity of two aligned residues.

local similarity global similarity

Score(i,j)=( max(0,BLOSUM(i,j) )+CLESUM(i,j) )*v(i,j)*d(i,j)

Page 8: Protein threading using context specific alignment potential ismb-2013

CNF-based Alignment Model

E: a neural network estimating the log-likelihood of state transition

Z(S,T): normalization factor

1 2{ , ,..., }LA a a a { , , }i t sa M I IGiven an alignment

Define a conditional probability

between Sequence S and Template T

Where,

),(/)),,,(exp(),|( 1 TSZTSaaETSApi

ii

Context-Specific

Page 9: Protein threading using context specific alignment potential ismb-2013

Comprehensive FeaturesMTYKLILN--GKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE

How similar two residues : EAA

How similar query’s sequence and profile and template’s profile: Esp, Epp

How similar template’s secondary structure and sequence’s predicted second structure (3-class and 8-class): Ess3, Ess8

Sequence S

How similar is the query’s solvent accessibility and template’s solvent accessibility: Esa

Total scoring function is a non-linear combination of:

E( ai, ai-1, EAA , Esp , Epp , Ediso, Ess3 , Ess8 , Esa )

Template TMTYKLILNSTVRTKSDTVTDAVP---ADKICSFAQQLPWEREWSF--

For disordered regions, Ediso,

no structure information used.

Page 10: Protein threading using context specific alignment potential ismb-2013

What’s the problem?

• Only the alignment probability is described, instead of the log-odds potential compared to background.

• Only incorporate local information, insufficient of global information.

Page 11: Protein threading using context specific alignment potential ismb-2013

Our solution

Propose a protein alignment potential• With an elaborately designed reference state.• Can be generalized into sequence-sequence,

sequence-structure as well as structure-structure alignment.

Incorporate both local and global terms• For local term, CNFpred potential is applied.• For global term, EPAD potential is employed.

Page 12: Protein threading using context specific alignment potential ismb-2013

Protein alignment potential

Similarly, given one alignment A between sequence S and template T,we define the potential of A as follows.

NN

i

ref

yxAP

TSAP

APTSAP

TSAu

1),|(

),|(log

)(),|(

log),|(

Given 2 AAs a and b, their mutation potential is defined as follows.

)()(

)(log

)(

)(log)(

bPaP

baP

baP

baPbau

ref

BLOSUM62 Potential

Alignment Potential

x and y are two random proteins with the same length as S and T, respectively.

Assumption: the alignment maximizing the potential is the optimal.

Page 13: Protein threading using context specific alignment potential ismb-2013

),(/)),|(),|(exp(),|( TSZTSAGTSAFTSAP

The alignment probability given sequence S and template T could be modeled as follows,

local term global term

partition function

A

TSAPtsZ ),|(),(

Protein alignment potential

Page 14: Protein threading using context specific alignment potential ismb-2013

),(),|(),|(

),|(),|(

),(/)),|(),|(exp(

),(/)),|(),|(exp(log

),|(

),|(log),|(

,

,

1

1

TScyxAGEXPTSAG

yxAFEXPTSAF

yxZyxAGyxAF

TSZTSAGTSAF

yxAP

TSAPTSAu

yx

yx

NN

i

NN

i

Expected score, can be calculated in advance by sampling

Independent of any specific alignment.

Protein alignment potential

Page 15: Protein threading using context specific alignment potential ismb-2013

Model the local potential

i

ii TSaaETSAF ),,,(),|( 1

From CNFpred, we use a context-specific linear chain model as,

The expectation term can be calculated by uniformly sampling a few thousand protein pairs, so the local potential is

The local potential is defined as,

),|(),|(),|( , yxAFEXPTSAFTSAU yxlocal

i

iiiilocal aaETSaaETSAU )),(),,,((),|( 11

Page 16: Protein threading using context specific alignment potential ismb-2013

Maximize on probability Maximize on potential

Long but less informative and highly false positive.

Good for building models.

Template Template

Sequ

ence

Sequ

ence

Short but relevant and highly significant.

Good for ranking templates.

What’s the difference between

Page 17: Protein threading using context specific alignment potential ismb-2013

Model the global potential

ji

jiTij ssdPTSAG ),|(log),|(

From EPAD, we use a context-specific distance-dependent model as,

The expectation term can be calculated by uniformly sampling a few thousand residue pairs from templates, so the global potential is

The global potential is defined as,

),|(),|(),|( , yxAGEXPTSAGTSAU yxglobal

ji

Tijji

Tijglobal dPssdPTSAU ))(log),|((log),|(

Page 18: Protein threading using context specific alignment potential ismb-2013

What’s global information given an alignment?

i j

i j

ji

jiTij ssdPTSAG ),|(log),|(

Template T

Sequence S

Tijd

Tijd

i j

If the alignment is good, the distance of a sequence residue pair shall match well with that of their aligned template residue pair.

si

sj

Page 19: Protein threading using context specific alignment potential ismb-2013

Result on 1000*6000

CNFpred (local+global potential) compared to,

HHpred CNFpred (local potential)

Page 21: Protein threading using context specific alignment potential ismb-2013

Thank you

Jinbo Xu

Feng ZhaoJianzhu Ma

National Institutes of Health (R01GM0897532)National Science Foundation (DBI-0960390)

NSF CAREER award CCF-1149811Alfred P. Sloan Research Fellowship