Protein threading using context specific alignment potential ismb-2013

Protein Threading Using Context-Specific Alignment Potential

Sheng Wanghttp://raptorx.uchicago.edu

Toyota Technological Institute at Chicago,Joint work with Jianzhu Ma, Feng Zhao and Jinbo Xu

ISMB 2013 Jul 22, ICC Berlin, Germany

http://raptorx.uchicago.edu/

Outline

• Where we are @ template-based modeling• What’s our work• What’s the problem• What’s our solution• Welcome to our server

Template-based Modeling (or, Threading)• Observation

– ~50,000 non-redundant structures in PDB – ~ 1,200 unique structure folds (SCOP)

• Methodology– Use known structures to predict a new one

Template sequenceQuery sequence DDVYILDQAEEG

DE-FIVD-PDEH

DDVYILDQAEEG

SPCKR---ADEG

DDVYILDQAEEG

E--IFVDQADDS

DDVYILDQAEEG

NMCVFGQWERTY

database

Template-based Modeling Procedures Easy: similar sequences → similar structures

Sequence-based method, e.g., BLAST, FASTA Works only for close homologous (>70% sequence identity)

Medium: similar profiles → similar structures Protein profile is a matrix that represents a multiple sequence

alignment of the similar proteins Profile-based method, e.g., PSI-BLAST , HHMER, HHpred, Works for relative remote homologous (>40% sequence identity)

Challenge: dissimilar profiles → similar structures Adding structural information, or context-specific into sequence/profile

based methods Threading method, e.g., MUSTER, RAPTOR, CS-BLAST Works for distant remote homologous (<40% sequence identity)

Our Work

• CNFpred: Transform a template-sequence alignment problem into a Machine Learning problem to calculate the alignment’s probability.

• DeepAlign: Prepare for high quality training data of structural alignment.

• CNF model: Combined Machine Learning model that incorporate Conditional Random Field (CRF) and Neural Network (NN).

Protein Alignment ModelS A L R Q

L

P

L

S

E

M

M

M

M

L P L S - E

S A - L R QTemplateSequence

Match states (M)

M M Is M It M

Insertion at sequence (Is)

Insertion at template (It)

The structural alignment generated by DeepAlign is used for training data

DeepAlign for Structure Alignment

• evolutionary information• local sub-structure similarity • angular similarity for hydrogen bonding

BLOSUM is the local amino acid substitution matrix; CLESUM is the local sub-structure substitution matrix;v(i,j) measures the angular similarity for hydrogen bonding; d(i,j) measures the spatial proximity of two aligned residues.

local similarity global similarity

Score(i,j)=( max(0,BLOSUM(i,j) )+CLESUM(i,j) )*v(i,j)*d(i,j)

CNF-based Alignment Model

E: a neural network estimating the log-likelihood of state transition

Z(S,T): normalization factor

1 2{ , ,..., }LA a a a { , , }i t sa M I IGiven an alignment

Define a conditional probability

between Sequence S and Template T

Where,

),(/)),,,(exp(),|( 1 TSZTSaaETSApi

ii

Context-Specific

Comprehensive FeaturesMTYKLILN--GKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE

How similar two residues : EAA

How similar query’s sequence and profile and template’s profile: Esp, Epp

How similar template’s secondary structure and sequence’s predicted second structure (3-class and 8-class): Ess3, Ess8

Sequence S

How similar is the query’s solvent accessibility and template’s solvent accessibility: Esa

Total scoring function is a non-linear combination of:

E( ai, ai-1, EAA , Esp , Epp , Ediso, Ess3 , Ess8 , Esa )

Template TMTYKLILNSTVRTKSDTVTDAVP---ADKICSFAQQLPWEREWSF--

For disordered regions, Ediso,

no structure information used.

What’s the problem?

• Only the alignment probability is described, instead of the log-odds potential compared to background.

• Only incorporate local information, insufficient of global information.

Our solution

Propose a protein alignment potential• With an elaborately designed reference state.• Can be generalized into sequence-sequence,

sequence-structure as well as structure-structure alignment.

Incorporate both local and global terms• For local term, CNFpred potential is applied.• For global term, EPAD potential is employed.

Protein alignment potential

Similarly, given one alignment A between sequence S and template T,we define the potential of A as follows.

NN

i

ref

yxAP

TSAP

APTSAP

TSAu

1),|(

),|(log

)(),|(

log),|(

Given 2 AAs a and b, their mutation potential is defined as follows.

)()(

)(log

)(

)(log)(

bPaP

baP

baP

baPbau

ref

BLOSUM62 Potential

Alignment Potential

x and y are two random proteins with the same length as S and T, respectively.

Assumption: the alignment maximizing the potential is the optimal.

),(/)),|(),|(exp(),|( TSZTSAGTSAFTSAP

The alignment probability given sequence S and template T could be modeled as follows,

local term global term

partition function

A

TSAPtsZ ),|(),(


),(),|(),|(

),|(),|(

),(/)),|(),|(exp(

),(/)),|(),|(exp(log

),|(

),|(log),|(

,

,

1

1

TScyxAGEXPTSAG

yxAFEXPTSAF

yxZyxAGyxAF

TSZTSAGTSAF

yxAP

TSAPTSAu

yx

yx

NN

i

NN

i

Expected score, can be calculated in advance by sampling

Independent of any specific alignment.


Model the local potential

i

ii TSaaETSAF ),,,(),|( 1

From CNFpred, we use a context-specific linear chain model as,

The expectation term can be calculated by uniformly sampling a few thousand protein pairs, so the local potential is

The local potential is defined as,

),|(),|(),|( , yxAFEXPTSAFTSAU yxlocal

i

iiiilocal aaETSaaETSAU )),(),,,((),|( 11

Maximize on probability Maximize on potential

Long but less informative and highly false positive.

Good for building models.

Template Template

Sequ

ence

Sequ

ence

Short but relevant and highly significant.

Good for ranking templates.

What’s the difference between

Model the global potential

ji

jiTij ssdPTSAG ),|(log),|(

From EPAD, we use a context-specific distance-dependent model as,

The expectation term can be calculated by uniformly sampling a few thousand residue pairs from templates, so the global potential is

The global potential is defined as,

),|(),|(),|( , yxAGEXPTSAGTSAU yxglobal

ji

Tijji

Tijglobal dPssdPTSAU ))(log),|((log),|(

What’s global information given an alignment?

i j

i j

ji

jiTij ssdPTSAG ),|(log),|(

Template T

Sequence S

Tijd

Tijd

i j

If the alignment is good, the distance of a sequence residue pair shall match well with that of their aligned template residue pair.

si

sj

Result on 1000*6000

CNFpred (local+global potential) compared to,

HHpred CNFpred (local potential)

Welcome to our server http://raptorx.uchicago.edu/

Binding

Contact

http://raptorx.uchicago.edu/StructurePrediction/predict/

http://raptorx.uchicago.edu/BindingSite/myjobs/720709/

http://raptorx.uchicago.edu/ContactMap/myjobs/984588/

Thank you

Jinbo Xu

Feng ZhaoJianzhu Ma

National Institutes of Health (R01GM0897532)National Science Foundation (DBI-0960390)

NSF CAREER award CCF-1149811Alfred P. Sloan Research Fellowship

Technology

Protein threading using context specific alignment potential ismb-2013