31
Prediction of protein Prediction of protein disorder disorder Zsuzsanna Dosztányi MTA-ELTE Momentum Bioinformatics Group Department of Biochemistry Eotvos Lorand University, Budapest, Hungary [email protected]

Prediction of protein disorder Zsuzsanna Dosztányi MTA-ELTE Momentum Bioinformatics Group Department of Biochemistry Eotvos Lorand University, Budapest,

Embed Size (px)

Citation preview

Prediction of protein Prediction of protein disorderdisorder

Zsuzsanna DosztányiMTA-ELTE Momentum Bioinformatics GroupDepartment of Biochemistry Eotvos Lorand University, Budapest, [email protected]

Protein Structure/Function Paradigm

Dominant view: 3D structure is a prerequisite for protein function

But….

Heat stability Protease sensitivity Failed attempts to crystallize Lack of NMR signals Increased molecular volume “Freaky” sequences …

IDPs

Intrinsically disordered proteins/regions (IDPs/IDRs)

Do not adopt a well-defined structure in isolation under native-like conditions

Highly flexible ensembles Functional proteins

p53 tumor suppressortransactivation DNA-binding tetramerizationregulation

Wells et al. PNAS 2008; 105: 5762

TAD DBD TD RD

Disordered region Disordered region

Bioinformatics of protein disorder

Part 1 Prediction of protein disorder Databases Prediction of protein disorder

Part 2 Biology of disordered proteins Prediction of functional regions within IDPs

Datasets Ordered proteins in the PDB

over 100000 structures few 1000s folds

Some structures in the PDB classify as disordered! only adopt a well-defined structure in complex

in crystals, with cofactors, proteins, …

Disorder in the PDB Missing electron density regions from the PDB

NMR structures with large structural variations

Less than 10% of all positions

Usually short (<10 residues), often at the termini

Disprot

www.disprot.org

Current release: 6.02Release date: 05/24/2013Number of proteins: 694Number of disordered regions: 1539

Experimentally verified disordered

proteins collected from literature

(X-ray, NMR, CD, proteolysis, SAXS,

heat stability, gel filtration, …)

Additional databases Combining experiments and predictions

Genome level annotations

MobiDB: http://mobidb.bio.unipd.it D2P2: http://d2p2.pro IDEAL: http://www.ideal.force.cs.is.nagoya-u.ac.jp/IDEAL

Sequence properties of disordered proteins Amino acid compositional bias High proportion of polar and charged amino acids

(Gln, Ser, Pro, Glu, Lys) Low proportion of bulky, hydrophobhic amino acids

(Val, Leu, Ile, Met, Phe, Trp, Tyr) Low sequence complexity Signature sequences identifying disordered proteins

Protein disorder is encoded in the amino acid sequence

Amino acid compositions

He et al. Cell Res. 2009; 19: 929

Prediction methods for protein disorder

Based on amino acid propensity scales or on simplified

biophysical models GlobPlot, FoldIndex, FoldUnfold, IUPred, UCON

Machine learning approaches PONDR VL-XT, VL3, VSL2; Disopred; POODLE S and L ; DisEMBL;

DisPSSMP; PrDOS, DisPro, OnD-CRF, POODLE-W, RONN

Over 50 methods

1.Amino acid propensity scaleGlobPlotCompare the tendency of amino acids:

to be in coil (irregular) structure. to be in regular secondary structure elements

Linding (2003) NAR 31, 3701

GlobPlot

From position specific predictions

Where are the ordered domains?

Longer disordered segments?

Noise vs. real data

GlobPlot: http://globplot.embl.de/

downhill regions correspond to putative domains (GlobDom)

up-hill regions correspond to predicted protein disorder

Globular proteins

Large entropy penalty

Large number of inter-residue contacts

2. Physical principles

IUPred

If a residue cannot form enough favorable interactions within its sequential environment, it will not adopt a well defined structure it will be disordered

Dosztanyi (2005) JMB 347, 827

Based on an energy estimation method

Parameters calculated from statistics of globular proteins

No training on disordered proteins

IUPred The algorithm:

…PSVEPPLSQETFSDL WKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPRVA PAPAAPTPAA...

Based only on the composition of environment of D’swe try to predict if it is in a disordered region or not:

Amino acid composition of environ-ment:

A – 10%C – 0%D – 12 %E – 10 %F – 2 % etc…

Estimate the interaction energy between the residue and its environment

Decide the probability of the residue being disordered based on this

3. Machine learning approaches INPUT

OUTPUT.ATVQLSMIWQSTR.

D

O

DISOPRED2: …..AMDDLMLSPDDIEQWFTED…..

Assign label: D or O D O

F(inp)SVM with linear kernel

Ward (2004) JMB 337, 635

DISOPRED2

Cutoff value!

PONDR VSL2

Differences in short and long disorder amino acid composition methods trained on one type of dataset tested on

other dataset resulted in lower efficiencies

PONDR VSL2: separate predictors for short and long disorder combined

length independent predictions

Peng (2006) BMC Bioinformatics 7, 208

4. Metaservers:

Sequence

Disorder prediction methods

PONDR VLXT

PONDR VL3

PONDR VSL2

IUPred

FoldIndex

TopIDP

PredictionANN

Meta-predictor

Xue et al. Biochem Biophys Acta. 2010; 180: 996

Disordered regions and secondary structure Coil is an ordered, irregular structural element Disordered proteins usually do not contain stable secondary

structural elements (e.g. by CD)

They can contain transient secondary structure elements (by NMR)

Pure random coil never occurs Use secondary structure predictions methods for disordered

proteins with extreme caution Long segments without predicted secondary structure may

indicate proteins disorder (NORsnet)

Accuracy

•True positive: Disordered residues are predicted as

disordered

•False positive: Ordered residues predicted as disordered

•True negative: Ordered residues predicted as ordered

•False negative: Disordered residues predicted as ordered

75-90%

Prediction of protein disorder Disordered residues can be predicted from

the amino acid sequence ~ 80% at the residue level

Methods can be specific to certain type of disorder accordingly, accuracies vary depending on

datasets Predictions are based on binary

classification of disorder

Heterogeneity in protein disorder

Flexible loop

RC-like Compact

Transient structures

Modularity in proteins

Many proteins contains multiple domains

Composed of ordered and disordered segments

Average length of a PDB chain is < 300

Average length of a human proteins ~ 500

Average length of cancer-related proteins > 900

Structural properties of full length proteins …

Practical