Statistical Inference Problems with Applications to ...users.monash.edu/~parthan/publications/talk2.pdfStatistical Inference Problems with Applications to Computational Structural

Statistical Inference Problems with Applications toComputational Structural Biology

Parthan KasarapuSupervisors:

Arun Konagurthu & Maria Garcia de la Banda

6 Aug 2015

Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 1 / 33

Presentation Outline

� Thesis Overview

� Motivation� Research Summary

I Statistical modellingI Applications to protein structural biology

� Thesis contributions

� Conclusion


Thesis Overview


Motivation: How many clusters?

Principal component 1-4 -3 -2 -1 0 1 2 3 4 5

Prin

cip

al co

mp

on

en

t 2

-2

-1.5

-1

-0.5

0

0.5

1

1.5

setosa

versicolor

virginica




Princip

al com

ponent 2

-2

-1.5

-1

-0.5

0

0.5

1

1.5

setosa

versicolor

virginica




Princip

al com

ponent 2

-2

-1.5

-1

-0.5

0

0.5

1

1.5

setosa

versicolor

virginica

(a) 2-cluster model


Princip

al com

ponent 2

-2

-1.5

-1

-0.5

0

0.5

1

1.5

setosa

versicolor

virginica

(b) 3-cluster model


Prin

cip

al co

mp

on

en

t 2

-2

-1.5

-1

-0.5

0

0.5

1

1.5

setosa

versicolor

virginica

(c) 4-cluster model

Statistical model selection is important.


Model selection and inference

� Several candidate models: which one to choose?I A criterion to compare models ...I Based on the model’s complexity and the goodness-of-fit


Princip

al com

ponent 2

-2

-1.5

-1

-0.5

0

0.5

1

1.5

setosa

versicolor

virginica

complexity: 2 means + 2 covariance matrices + cluster weights


The typical model selection criteria ...

� Various model selection criteria are commonly used ...I AIC, BIC, MDL, ...

� An example of model selection ...

(a) Normal model (b) Laplace model

I Two parameters for each model (µ & σ)I Considered to have the same model complexity (limitation)


Minimum Message Length (MML) Framework

Encoding of model (hypothesis) H and data DI (H&D) = I (H)︸︷︷︸

First part

+ I (D|H)︸︷︷︸Second part

� Two-part message:I I (H): model complexityI I (D|H): goodness-of-fit

� Total message length I (H&D) is used to compare models.

Model with the least message length is optimal


Problem: Modelling the protein main chain

(a) True structure (Cα − Cα data)

X1

X2

X3

φ

Pi

θ

Pi+1

Pi−1

Pi−2

(b) Cα orientations


Problem: Modelling the protein main chain

(a) True structure (Cα − Cα data)

X1

X2

X3

φ

Pi

θ

Pi+1

Pi−1

Pi−2

(b) Cα orientations

Empirical distribution of (θ, φ)


Modelling of empirical distribution of directional data

� Mixture modelling (Clustering)I Data is multi-modalI Ideal to find data clusters ...I Modelling using directional probability distributions


Mixture modelling (Clustering)

Challenges:� Determination of the number of components

I Proposed a search method

� Ability to generalize to any probability distributionI No assumptions in terms of the nature of data or distribution

P. Kasarapu, L. Allison, Minimum message length estimation of mixturesof multivariate Gaussian and von Mises–Fisher distributions, MachineLearning (2015) Vol. 100, No. 2-3, Pages 333-378


Proposed method to determine clusters of data

Basic idea to determine number of clusters

Perturb a K -component mixture through a series of operations so that themixture escapes a sub-optimal state to reach an improved state.

� Operations include ...I SplitI DeleteI Merge


Illustrative example of the search method

X

-4 -2 0 2 4

Y

-4

-3

-2

-1

0

1

2

3

4

Original mixture with threecomponents.



X

-4 -2 0 2 4

Y

-4

-3

-2

-1

0

1

2

3

4

Original mixture with threecomponents.

X

-4 -2 0 2 4

Y

-4

-3

-2

-1

0

1

2

3

4

Begin with a one-componentmixture.



X

-4 -2 0 2 4

Y

-4

-3

-2

-1

0

1

2

3

4

(a) I = 22793 bitsX

-4 -2 0 2 4

Y

-4

-3

-2

-1

0

1

2

3

4

(b) SplittingX

-4 -2 0 2 4

Y

-4

-3

-2

-1

0

1

2

3

4

(c) I = 22673 bits

Split operation

A parent component is split to find locally optimal children leading to a(K + 1)-component mixture.



X

-4 -2 0 2 4

Y

-4

-3

-2

-1

0

1

2

3

4

(a) Initial meansX

-4 -2 0 2 4

Y

-4

-3

-2

-1

0

1

2

3

4

(b) I = 22691 bitsX

-4 -2 0 2 4

Y

-4

-3

-2

-1

0

1

2

3

4

(c) I = 22460 bits



X

-4 -2 0 2 4

Y

-4

-3

-2

-1

0

1

2

3

4

(a) DeletingX

-4 -2 0 2 4

Y

-4

-3

-2

-1

0

1

2

3

4

(b) I = 25599 bitsX

-4 -2 0 2 4

Y

-4

-3

-2

-1

0

1

2

3

4

(c) I = 22793 bits

Delete operation

A component is deleted to find an optimal (K − 1)-component mixture.



X

-4 -2 0 2 4

Y

-4

-3

-2

-1

0

1

2

3

4

(a) MergingX

-4 -2 0 2 4

Y

-4

-3

-2

-1

0

1

2

3

4

(b) InitializationX

-4 -2 0 2 4

Y

-4

-3

-2

-1

0

1

2

3

4

(c) I = 22793 bits

Merge operation

A pair of close components are merged to find an optimal(K − 1)-component mixture.


Evolution of the mixture model

22.35

22.40

22.45

22.50

22.55

22.60

22.65

22.70

22.75

22.80

2 4 6 8 10 12 140.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

0.20

0.22

Mes

sag

e le

ng

th (

in t

ho

usa

nd

s o

f b

its)

Number of components

first partsecond part

total

Variation of the individual parts of the total message length with increasingnumber of components (clusters).


Models of protein data

(a) Data (b) von Mises-Fisher (c) Kent


Models of protein data

(a) vMF: β = 0 (b) Kent: β > 0

Kent probability density function

∝ exp{ κγγγT1 x︸︷︷︸linear term

+β(γγγT2 x)2 − β(γγγT3 x)2︸︷︷︸

non-linear term

}


Modelling using Kent distributions

Challenges:� Complex mathematical form

I Parameter estimation is a difficult task

� Mixture modellingI Cluster data on the spherical surface


vMF and Kent mixtures of protein directional data

Longitude φ

0 60 120 180 240 300 360

Co

-la

titu

de

θ

15

35

55

75

95

115

15

14

2221

13

2019

12

18

11

17

10

16

876

543

9

21

(a) Uncorrelated (35 vMF clusters)

Longitude φ

0 60 120 180 240 300 360

Co

-la

titu

de

θ

15

35

55

75

95

115

11

10

13

9

8

12

7

654

32

1

(b) Correlated (23 Kent clusters)

How are these models useful?� Discovery of frequently occuring patterns

I Dedicated clusters for helices, strands, etc.� Clustering profile can be related to protein function

I Structurally similar proteins will have similar clusters� Ab initio protein structure prediction

I Random protein generation, homology modelling, template structures, etc.


vMF and Kent mixtures of protein directional data

Longitude φ

0 60 120 180 240 300 360

Co

-la

titu

de

θ

15

35

55

75

95

115

15

14

2221

13

2019

12

18

11

17

10

16

876

543

9

21

(a) Uncorrelated (35 vMF clusters)

Longitude φ

0 60 120 180 240 300 360

Co

-la

titu

de

θ

15

35

55

75

95

115

11

10

13

9

8

12

7

654

32

1

(b) Correlated (23 Kent clusters) -optimal!

ModelTotal message length Bits per

(millions of bits) residue

Uniform 6.895 27.434vMF mixture 6.449 25.656Kent mixture 6.442 25.630


Problem: Modelling of protein dihedral angles

� Modelling protein dihedral angles (φ, ψ)I φ, ψ ∈ [0, 2π) represents a point on the torusI Cannot be modelled using vMF or KentI Modelled using mixtures of bivariate von Mises (BVM) distributions



� Modelling protein dihedral angles (φ, ψ)I φ, ψ ∈ [0, 2π)

represents a point on the torusI Cannot be modelled using vMF or KentI Modelled using mixtures of bivariate von Mises (BVM) distributions



� Modelling protein dihedral angles (φ, ψ)I φ, ψ ∈ [0, 2π) represents a point on the torus

I Cannot be modelled using vMF or KentI Modelled using mixtures of bivariate von Mises (BVM) distributions



� Modelling protein dihedral angles (φ, ψ)I φ, ψ ∈ [0, 2π) represents a point on the torusI Cannot be modelled using vMF or KentI Modelled using mixtures of bivariate von Mises (BVM) distributions


Distribution of protein dihedral angles


Distribution of protein dihedral angles

Example BVM distributionsParthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 25 / 33

Bivariate von Mises (BVM) clusters of dihedral angle data

φ

-180 -120 -60 0 60 120 180

ψ

-180

-120

-60

0

60

120

180

25

22

24

23

2120

11

4

18

17

16

15

5

14

10

6

13

19

7

9

12

3

2

8

1

No correlation (32 clusters)

φ

-180 -120 -60 0 60 120 180

ψ

-180

-120

-60

0

60

120

180

16

15

14

8

4

13

5

7

12

11

10

3

9

2

6

1

BVM (21 clusters) - optimal!


Problem: Abstraction of protein folding patterns

Motivation� Rapid protein structure comparison

I Achieved by effective summarization of folding patterns

� Determine functionally similar proteinsI Achieved by unique representations


A novel method to abstract protein folding patterns

(a) True structure (b) Commonly used

(c) Illustrative non-linear representations (which is optimal?)


Optimal representation

� MML balances the trade-off betweenI Maximize economy of description (compression)I Minimize loss of structural information (preservation of geometry)


Merits of this abstraction

� Does not rely on secondary structure assignment

� Applications in protein structure comparisonI Database searchI Comparing the representations - fast

P. Kasarapu, M. G. de la Banda, A. S. Konagurthu, On representingprotein folding patterns using non-linear parametric curves, IEEE/ACMTransactions on Computational Biology and Bioinformatics,11(6):1218-1228 (2014)



� Does not rely on secondary structure assignment� Applications in protein structure comparison

I Database searchI Comparing the representations

- fast




� Does not rely on secondary structure assignment� Applications in protein structure comparison

I Database searchI Comparing the representations - fast



Main contributions of my thesis

Theoretical:� MML-based statistical inference

I Multivariate von Mises-Fisher (hypersphere)I Kent (3D-sphere)I Bivariate von Mises (3D-torus)

� Mixture modelling (clustering)

� Non-linear abstractions (regression)

Applications:

� Structural bioinformatics

� High-dimensional text clustering using vMF mixtures

� Analytical tools for biologists and statisticians


Conclusion

� Data analysis and statistical modelling go hand-in-handI Rigorous models are useful

� Scope for improving the existing methodologiesI Extend the current machine learning algorithms

� My research has practical implications in data mining, structuralbiology, etc.


Thank you.


Documents

Statistical Inference Problems with Applications to ...users.monash.edu/~parthan/publications/talk2.pdfStatistical Inference Problems with Applications to Computational Structural