37
Validation of Time Series Technique for Prediction of Conformational States of Amino Acids Dr. Sangeeta Sawant , Bioinformatics Centre, UoP, Pune (Guide) Dr. Mohan Kale, Dept. of Statistics, UoP, Pune (co-guide)

BIM_2010_20_Bioinformatics_Project

Embed Size (px)

DESCRIPTION

Project presentation for partial fulfillment of M.Sc (Bioinformatics) at Bioinformatics Center,University of Pune, Pune

Citation preview

Page 1: BIM_2010_20_Bioinformatics_Project

Validation of Time Series Technique for Prediction of Conformational States of Amino Acids

Dr. Sangeeta Sawant , Bioinformatics Centre, UoP, Pune (Guide)

Dr. Mohan Kale, Dept. of Statistics, UoP, Pune (co-guide)

Page 2: BIM_2010_20_Bioinformatics_Project

Concepts Used

Ramachandran Plot

Time series

AR,ARMA,ARIMA models

AIC criteria

Euclidean distance

Potential values for AA residues

Feynman Problem Solving Algorithm

Page 3: BIM_2010_20_Bioinformatics_Project

Ramachandran Plot

Page 4: BIM_2010_20_Bioinformatics_Project

Time Series

a sequence of data points or set of observations, measured typically at successive time instants spaced at uniform time intervals.

Patterns, variations

forecasting

Page 5: BIM_2010_20_Bioinformatics_Project

Autoregressive (AR) models

Autoregressive-moving average (ARMA)

Autoregressive integrated moving average (ARIMA) models

- depend linearly on previous data points

Time Series Models (probability model)

Page 6: BIM_2010_20_Bioinformatics_Project

Materials & Methods

R

R-Studio, Tinn-R

bio3d,itsmr,forecast,tseries,timsac,wordcloud

ITSM_2000- Standalone

R Nabble

BioStars

stats.stackexchange

Page 7: BIM_2010_20_Bioinformatics_Project

Methods

A) Calculation of Potential values for AA residues

B)Forecasting of AA states

C) Clustering

Page 8: BIM_2010_20_Bioinformatics_Project

Calculation of Potential values for AA residues

Dataset-I

Assignment of Conformational state 1, 2, or 3 - to regions I, II, or III of the Rama. Plot, to each amino-acid residue (Phi_psi values)

Phi-Psi values –torsion.pdb() of “bio3d” & verified via PDBGoodies (IISC, Bangalore) & Protein Angle Descriptor utility (IIT, Delhi )

Chain breaks, only CA atoms

Expt. method-X-ray, R-factor: - 0-0.25 (for best resolved structures)

3829 proteins selected from PDB (Protein Data Bank) –PDBSelect dataset list(25 % seq. similarity)

Page 9: BIM_2010_20_Bioinformatics_Project

Figure No- 2 Ramachandran plot showing three conformational regions I ,II and III

I- closely/tightly packed conformations, Phi-140 to 0,Psi -100 to 0 II-extended conformations, Phi -180 to 0, Psi 80 to 180 III- all remaining confirmations

Page 10: BIM_2010_20_Bioinformatics_Project

Frequencies of single residues in three states calculated & normalized using (Kolaskar, A.S. & Sawant, S.V. -1996 )

ikik

kik

nn

Nni=P

Nik –no. of times the AA of type (i) occurs in state k=1-3;

N -total no. of residues

Pik -potential values of AA of type (i) in state k

Potential values in pdf

Page 11: BIM_2010_20_Bioinformatics_Project

Potential values

Page 12: BIM_2010_20_Bioinformatics_Project
Page 13: BIM_2010_20_Bioinformatics_Project

Time Series

Page 14: BIM_2010_20_Bioinformatics_Project

ACF Plot

Page 15: BIM_2010_20_Bioinformatics_Project

ACF –Stat Vs. Non-stationary

Non-stationary

Stationary

Page 16: BIM_2010_20_Bioinformatics_Project

Time Series

Stationary

Non-stationary

Stationary

ACF plot

Page 17: BIM_2010_20_Bioinformatics_Project

Stationary TS

Page 18: BIM_2010_20_Bioinformatics_Project

TS model building…..

AR (p)

ARMA(p,q)

ARIMA (p,q)

Page 19: BIM_2010_20_Bioinformatics_Project

Best model Selection

AR (p)

ARMA (p, q)

ARIMA (p, q)

AIC

Page 20: BIM_2010_20_Bioinformatics_Project

Forecasting of AA states for best models

Page 21: BIM_2010_20_Bioinformatics_Project

Forecasting of AA states for best models….

e.g. for AR(1) process,

X t = φ X (t-1) + Z (t), t=0,± 1,….

Where {Z t}~ WN (0, s2) & | φ | <1

1st observed potential for AA with index given as data points & t respectively, prediction starts from 2nd position up to last index

using forecast() “itsmr”

Page 22: BIM_2010_20_Bioinformatics_Project

Similarly for ARMA (1,1) /ARIMA (1,1)

X t = φ X (t-1) + Z (t) + θ Z (t-1), θ + φ

Forecasting Quality by coefficient of determination (R2) using formula

2

2

2 1)Y(Y

)F(Y=R

i

ii

Yi =True value /Observed value Fi = Forecasted/predicted value

Page 23: BIM_2010_20_Bioinformatics_Project

Clustering

Dataset-II

SCOP Domain specific PDB-style files(ATOM & HETATM records ) downloaded from

ASTRAL Compendium for Sequence and Structure Analysis -release 1.75 (June 2009)

Scan for chain breaks & presence of CA atoms only, breaked files kept aside

Page 24: BIM_2010_20_Bioinformatics_Project

Length of AA residues(100-110) e.g. 10gsa1_a_133_pot.txt

File

Page 25: BIM_2010_20_Bioinformatics_Project

Potential values (Time series),each domain divided into stationary (506) & non-stationary process (1692)

Non-stationary data kept aside for further transformations

AR,ARMA & ARIMA models

Best model (minimum AIC criteria)

Best-AR(22),ARMA(484),ARIMA(No model)

AR(p), ARMA(p,q) -distance matrix (Euclidean distance )

Dendrogram-Neighbour-joing ( Phylip packages)

Page 26: BIM_2010_20_Bioinformatics_Project

Dendrogram_TS –AR models-22

Page 27: BIM_2010_20_Bioinformatics_Project

Dendrogram_TS –ARMA models-484

• Phylowidget link

Page 28: BIM_2010_20_Bioinformatics_Project

Results & Discussion

For each AA of all the proteins, 3D- Cartesian co-ordinates were transformed into 2D info. i.e. conformational states of AA and potential values were computed and used to build time-distance (index of AA) dependent statistical model as time series for forecasting purposes.

Page 29: BIM_2010_20_Bioinformatics_Project

AR values

Autoregressive order (p) 1-18 range

Short & long range dependence variations in protein structural arrangements

Variations proves diversity exhibits through structural components

Page 30: BIM_2010_20_Bioinformatics_Project

All (a)-12 All (b)-5 / (c)-9 + (d)-13 Small

proteins

(g)-1

Coiled-coil

(h)-3

Designed

proteins

(k)-1

Max Min Max Min Max Min Max Min Max Min

AA

seq

(%)

26.82 2.41 16.30 8.88 27.77 1.47 28.57 7.04 19.51 22.5 5.88 29.03

States

(%)

55.68 21.77 51.11 44.76 54.76 30.64 51.70 19.04 48.78 26 15 26.88

Table No. II – Forecasting results for AR models (44) out of best 90 models (Note- for 46 models, class information not found in SCOP database) All values are in % accuracy

Conformational states accuracy > AA residues accuracy due to low resolution of potential values(forecasted values)

Page 31: BIM_2010_20_Bioinformatics_Project

All (a)-123 All (b)-146 / (c)-120 + (d)-127 Multi domains

proteins (e)-13

Membrane &

cell surface

(f)-3

Small

proteins(g)-

17

Max Min Max Min Max Min Max Min Max Min Max Min Max Min

AA

seq

(%)

32.55 2.63 32.81 3.96 43.47 5 37.96 2.70 24.39 6.034 12.65 7.01 30.64 6.60

States

(%)

65.77 8.06 65.01 17.94 62.89 8.97 68.15 11.11 50 17.80 34.33 11.42 64.51 14.28

Table No. III– Forecasting results for ARMA models (557) out of best 1239 models (Note- for 682 models, class information not found in SCOP database) —All values are in % accuracy

Due to non-representative dataset & inadequate info. about class, can’t say that for any particular class i) pred. accuracy ↑ or ↓ & ii) follows mostly ARMA process

Page 32: BIM_2010_20_Bioinformatics_Project

Discussion

TS graphs opens new door in scientific visualization of proteins (no 3D str. info) i.e. specific AA can be visualized on line plot with its value proportional to frequency to occur into allowed regions of Ramachandran plot.

Potential value for each AA adds new feature of selection in machine learning techniques.

Order of AR model tells how current value linearly related to past p value

Intra-dependency of AA shown using models of TS e.g. AR(4),ARMA(1,3)

Page 33: BIM_2010_20_Bioinformatics_Project

Found new way of looking at protein structure prediction.

Application of TS technique for predicting conformational states based on the conformational state potentials instead of secondary str. has been attempted.

Accuracy of prediction of conformational states for AA, using time series is higher than that for prediction of AA residues.

To increase accuracy for prediction, multivariate time series concept may be useful instead of uni-variate time series

Intra-fluctuations inside proteins, due to AA arrangement can be traced out by stationary & non-stationary groups

CONCLUSIONS

Page 34: BIM_2010_20_Bioinformatics_Project

AR and MA order of TS models -as point of genetic information (distances) to predict evolutionary relationship between different proteins.

TS concept can be used to predict conformational states of missing residues in PDB data files

Hierarchical clustering/classification of TS of proteins -birth to new concept of time dependent clustering (pseudo-clustering) & pseudo-phylogeny.

Development of synthetic proteins to combat seasonal diseases & to tackle chemical warfare attacks.

TS fluctuations for specific class of proteins can be used as “Pattern” for data analysis and pattern-dependent classification of proteins

FUTURE WORK

Page 35: BIM_2010_20_Bioinformatics_Project

References

Blundell TL, Sibanda BL, Sternberg MJ, Thornton JM. Knowledge-based prediction of protein structures and the design of novel molecules. Nature. 1987 Mar 26-Apr 1;326(6111):347-52. Review

Kolaskar, A.S., Sawant, S.V. (1996). Prediction of conformational states of amino acids using a Ramachandran plot. Int.J.Peptide Protein Res.110-116

Alessandro G.,Romualdo B.,(2000). Nonlinear Methods in the Analysis of Protein Sequences:A Case Study in Rubredoxins. Biophysical Journal.136-148

Page 36: BIM_2010_20_Bioinformatics_Project

Questions

Page 37: BIM_2010_20_Bioinformatics_Project

Thank You !