Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Statistical Inference Problems with Applications toComputational Structural Biology
Parthan KasarapuSupervisors:
Arun Konagurthu & Maria Garcia de la Banda
6 Aug 2015
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 1 / 33
Presentation Outline
� Thesis Overview
� Motivation� Research Summary
I Statistical modellingI Applications to protein structural biology
� Thesis contributions
� Conclusion
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 2 / 33
Thesis Overview
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 3 / 33
Motivation: How many clusters?
Principal component 1-4 -3 -2 -1 0 1 2 3 4 5
Prin
cip
al co
mp
on
en
t 2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
setosa
versicolor
virginica
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 4 / 33
Motivation: How many clusters?
Principal component 1-4 -3 -2 -1 0 1 2 3 4 5
Prin
cip
al co
mp
on
en
t 2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
setosa
versicolor
virginica
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 4 / 33
Motivation: How many clusters?
Principal component 1-4 -3 -2 -1 0 1 2 3 4 5
Prin
cip
al co
mp
on
en
t 2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
setosa
versicolor
virginica
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 4 / 33
Motivation: How many clusters?
Principal component 1-4 -3 -2 -1 0 1 2 3 4 5
Princip
al com
ponent 2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
setosa
versicolor
virginica
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 4 / 33
Motivation: How many clusters?
Principal component 1-4 -3 -2 -1 0 1 2 3 4 5
Princip
al com
ponent 2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
setosa
versicolor
virginica
(a) 2-cluster model
Principal component 1-4 -3 -2 -1 0 1 2 3 4 5
Princip
al com
ponent 2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
setosa
versicolor
virginica
(b) 3-cluster model
Principal component 1-4 -3 -2 -1 0 1 2 3 4 5
Prin
cip
al co
mp
on
en
t 2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
setosa
versicolor
virginica
(c) 4-cluster model
Statistical model selection is important.
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 5 / 33
Model selection and inference
� Several candidate models: which one to choose?I A criterion to compare models ...I Based on the model’s complexity and the goodness-of-fit
Principal component 1-4 -3 -2 -1 0 1 2 3 4 5
Princip
al com
ponent 2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
setosa
versicolor
virginica
complexity: 2 means + 2 covariance matrices + cluster weights
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 6 / 33
Model selection and inference
� Several candidate models: which one to choose?I A criterion to compare models ...I Based on the model’s complexity and the goodness-of-fit
Principal component 1-4 -3 -2 -1 0 1 2 3 4 5
Princip
al com
ponent 2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
setosa
versicolor
virginica
complexity: 2 means + 2 covariance matrices + cluster weights
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 6 / 33
The typical model selection criteria ...
� Various model selection criteria are commonly used ...I AIC, BIC, MDL, ...
� An example of model selection ...
(a) Normal model (b) Laplace model
I Two parameters for each model (µ & σ)I Considered to have the same model complexity (limitation)
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 7 / 33
The typical model selection criteria ...
� Various model selection criteria are commonly used ...I AIC, BIC, MDL, ...
� An example of model selection ...
(a) Normal model (b) Laplace model
I Two parameters for each model (µ & σ)I Considered to have the same model complexity (limitation)
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 7 / 33
The typical model selection criteria ...
� Various model selection criteria are commonly used ...I AIC, BIC, MDL, ...
� An example of model selection ...
(a) Normal model (b) Laplace model
I Two parameters for each model (µ & σ)I Considered to have the same model complexity (limitation)
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 7 / 33
Minimum Message Length (MML) Framework
Encoding of model (hypothesis) H and data DI (H&D) = I (H)︸︷︷︸
First part
+ I (D|H)︸ ︷︷ ︸Second part
� Two-part message:I I (H): model complexityI I (D|H): goodness-of-fit
� Total message length I (H&D) is used to compare models.
Model with the least message length is optimal
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 8 / 33
Minimum Message Length (MML) Framework
Encoding of model (hypothesis) H and data DI (H&D) = I (H)︸︷︷︸
First part
+ I (D|H)︸ ︷︷ ︸Second part
� Two-part message:I I (H): model complexityI I (D|H): goodness-of-fit
� Total message length I (H&D) is used to compare models.
Model with the least message length is optimal
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 8 / 33
Minimum Message Length (MML) Framework
Encoding of model (hypothesis) H and data DI (H&D) = I (H)︸︷︷︸
First part
+ I (D|H)︸ ︷︷ ︸Second part
� Two-part message:I I (H): model complexityI I (D|H): goodness-of-fit
� Total message length I (H&D) is used to compare models.
Model with the least message length is optimal
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 8 / 33
Problem: Modelling the protein main chain
(a) True structure (Cα − Cα data)
X1
X2
X3
φ
Pi
θ
Pi+1
Pi−1
Pi−2
(b) Cα orientations
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 9 / 33
Problem: Modelling the protein main chain
(a) True structure (Cα − Cα data)
X1
X2
X3
φ
Pi
θ
Pi+1
Pi−1
Pi−2
(b) Cα orientations
Empirical distribution of (θ, φ)
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 9 / 33
Modelling of empirical distribution of directional data
� Mixture modelling (Clustering)I Data is multi-modalI Ideal to find data clusters ...I Modelling using directional probability distributions
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 10 / 33
Mixture modelling (Clustering)
Challenges:� Determination of the number of components
I Proposed a search method
� Ability to generalize to any probability distributionI No assumptions in terms of the nature of data or distribution
P. Kasarapu, L. Allison, Minimum message length estimation of mixturesof multivariate Gaussian and von Mises–Fisher distributions, MachineLearning (2015) Vol. 100, No. 2-3, Pages 333-378
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 11 / 33
Mixture modelling (Clustering)
Challenges:� Determination of the number of components
I Proposed a search method
� Ability to generalize to any probability distributionI No assumptions in terms of the nature of data or distribution
P. Kasarapu, L. Allison, Minimum message length estimation of mixturesof multivariate Gaussian and von Mises–Fisher distributions, MachineLearning (2015) Vol. 100, No. 2-3, Pages 333-378
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 11 / 33
Proposed method to determine clusters of data
Basic idea to determine number of clusters
Perturb a K -component mixture through a series of operations so that themixture escapes a sub-optimal state to reach an improved state.
� Operations include ...I SplitI DeleteI Merge
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 12 / 33
Illustrative example of the search method
X
-4 -2 0 2 4
Y
-4
-3
-2
-1
0
1
2
3
4
Original mixture with threecomponents.
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 13 / 33
Illustrative example of the search method
X
-4 -2 0 2 4
Y
-4
-3
-2
-1
0
1
2
3
4
Original mixture with threecomponents.
X
-4 -2 0 2 4
Y
-4
-3
-2
-1
0
1
2
3
4
Begin with a one-componentmixture.
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 13 / 33
Illustrative example of the search method
X
-4 -2 0 2 4
Y
-4
-3
-2
-1
0
1
2
3
4
(a) I = 22793 bitsX
-4 -2 0 2 4
Y
-4
-3
-2
-1
0
1
2
3
4
(b) SplittingX
-4 -2 0 2 4
Y
-4
-3
-2
-1
0
1
2
3
4
(c) I = 22673 bits
Split operation
A parent component is split to find locally optimal children leading to a(K + 1)-component mixture.
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 14 / 33
Illustrative example of the search method
X
-4 -2 0 2 4
Y
-4
-3
-2
-1
0
1
2
3
4
(a) Initial meansX
-4 -2 0 2 4
Y
-4
-3
-2
-1
0
1
2
3
4
(b) I = 22691 bitsX
-4 -2 0 2 4
Y
-4
-3
-2
-1
0
1
2
3
4
(c) I = 22460 bits
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 15 / 33
Illustrative example of the search method
X
-4 -2 0 2 4
Y
-4
-3
-2
-1
0
1
2
3
4
(a) DeletingX
-4 -2 0 2 4
Y
-4
-3
-2
-1
0
1
2
3
4
(b) I = 25599 bitsX
-4 -2 0 2 4
Y
-4
-3
-2
-1
0
1
2
3
4
(c) I = 22793 bits
Delete operation
A component is deleted to find an optimal (K − 1)-component mixture.
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 16 / 33
Illustrative example of the search method
X
-4 -2 0 2 4
Y
-4
-3
-2
-1
0
1
2
3
4
(a) MergingX
-4 -2 0 2 4
Y
-4
-3
-2
-1
0
1
2
3
4
(b) InitializationX
-4 -2 0 2 4
Y
-4
-3
-2
-1
0
1
2
3
4
(c) I = 22793 bits
Merge operation
A pair of close components are merged to find an optimal(K − 1)-component mixture.
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 17 / 33
Evolution of the mixture model
22.35
22.40
22.45
22.50
22.55
22.60
22.65
22.70
22.75
22.80
2 4 6 8 10 12 140.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
0.22
Mes
sag
e le
ng
th (
in t
ho
usa
nd
s o
f b
its)
Number of components
first partsecond part
total
Variation of the individual parts of the total message length with increasingnumber of components (clusters).
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 18 / 33
Models of protein data
(a) Data (b) von Mises-Fisher (c) Kent
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 19 / 33
Models of protein data
(a) vMF: β = 0 (b) Kent: β > 0
Kent probability density function
∝ exp{ κγγγT1 x︸ ︷︷ ︸linear term
+β(γγγT2 x)2 − β(γγγT3 x)2︸ ︷︷ ︸
non-linear term
}
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 20 / 33
Modelling using Kent distributions
Challenges:� Complex mathematical form
I Parameter estimation is a difficult task
� Mixture modellingI Cluster data on the spherical surface
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 21 / 33
Modelling using Kent distributions
Challenges:� Complex mathematical form
I Parameter estimation is a difficult task
� Mixture modellingI Cluster data on the spherical surface
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 21 / 33
vMF and Kent mixtures of protein directional data
Longitude φ
0 60 120 180 240 300 360
Co
-la
titu
de
θ
15
35
55
75
95
115
15
14
2221
13
2019
12
18
11
17
10
16
876
543
9
21
(a) Uncorrelated (35 vMF clusters)
Longitude φ
0 60 120 180 240 300 360
Co
-la
titu
de
θ
15
35
55
75
95
115
11
10
13
9
8
12
7
654
32
1
(b) Correlated (23 Kent clusters)
How are these models useful?� Discovery of frequently occuring patterns
I Dedicated clusters for helices, strands, etc.� Clustering profile can be related to protein function
I Structurally similar proteins will have similar clusters� Ab initio protein structure prediction
I Random protein generation, homology modelling, template structures, etc.
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 22 / 33
vMF and Kent mixtures of protein directional data
Longitude φ
0 60 120 180 240 300 360
Co
-la
titu
de
θ
15
35
55
75
95
115
15
14
2221
13
2019
12
18
11
17
10
16
876
543
9
21
(a) Uncorrelated (35 vMF clusters)
Longitude φ
0 60 120 180 240 300 360
Co
-la
titu
de
θ
15
35
55
75
95
115
11
10
13
9
8
12
7
654
32
1
(b) Correlated (23 Kent clusters)
How are these models useful?� Discovery of frequently occuring patterns
I Dedicated clusters for helices, strands, etc.� Clustering profile can be related to protein function
I Structurally similar proteins will have similar clusters� Ab initio protein structure prediction
I Random protein generation, homology modelling, template structures, etc.
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 22 / 33
vMF and Kent mixtures of protein directional data
Longitude φ
0 60 120 180 240 300 360
Co
-la
titu
de
θ
15
35
55
75
95
115
15
14
2221
13
2019
12
18
11
17
10
16
876
543
9
21
(a) Uncorrelated (35 vMF clusters)
Longitude φ
0 60 120 180 240 300 360
Co
-la
titu
de
θ
15
35
55
75
95
115
11
10
13
9
8
12
7
654
32
1
(b) Correlated (23 Kent clusters) -optimal!
ModelTotal message length Bits per
(millions of bits) residue
Uniform 6.895 27.434vMF mixture 6.449 25.656Kent mixture 6.442 25.630
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 23 / 33
Problem: Modelling of protein dihedral angles
� Modelling protein dihedral angles (φ, ψ)I φ, ψ ∈ [0, 2π) represents a point on the torusI Cannot be modelled using vMF or KentI Modelled using mixtures of bivariate von Mises (BVM) distributions
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 24 / 33
Problem: Modelling of protein dihedral angles
� Modelling protein dihedral angles (φ, ψ)I φ, ψ ∈ [0, 2π)
represents a point on the torusI Cannot be modelled using vMF or KentI Modelled using mixtures of bivariate von Mises (BVM) distributions
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 24 / 33
Problem: Modelling of protein dihedral angles
� Modelling protein dihedral angles (φ, ψ)I φ, ψ ∈ [0, 2π) represents a point on the torus
I Cannot be modelled using vMF or KentI Modelled using mixtures of bivariate von Mises (BVM) distributions
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 24 / 33
Problem: Modelling of protein dihedral angles
� Modelling protein dihedral angles (φ, ψ)I φ, ψ ∈ [0, 2π) represents a point on the torusI Cannot be modelled using vMF or KentI Modelled using mixtures of bivariate von Mises (BVM) distributions
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 24 / 33
Distribution of protein dihedral angles
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 25 / 33
Distribution of protein dihedral angles
Example BVM distributionsParthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 25 / 33
Bivariate von Mises (BVM) clusters of dihedral angle data
φ
-180 -120 -60 0 60 120 180
ψ
-180
-120
-60
0
60
120
180
25
22
24
23
2120
11
4
18
17
16
15
5
14
10
6
13
19
7
9
12
3
2
8
1
No correlation (32 clusters)
φ
-180 -120 -60 0 60 120 180
ψ
-180
-120
-60
0
60
120
180
16
15
14
8
4
13
5
7
12
11
10
3
9
2
6
1
BVM (21 clusters) - optimal!
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 26 / 33
Problem: Abstraction of protein folding patterns
Motivation� Rapid protein structure comparison
I Achieved by effective summarization of folding patterns
� Determine functionally similar proteinsI Achieved by unique representations
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 27 / 33
A novel method to abstract protein folding patterns
(a) True structure (b) Commonly used
(c) Illustrative non-linear representations (which is optimal?)
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 28 / 33
Optimal representation
� MML balances the trade-off betweenI Maximize economy of description (compression)I Minimize loss of structural information (preservation of geometry)
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 29 / 33
Optimal representation
� MML balances the trade-off betweenI Maximize economy of description (compression)I Minimize loss of structural information (preservation of geometry)
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 29 / 33
Merits of this abstraction
� Does not rely on secondary structure assignment
� Applications in protein structure comparisonI Database searchI Comparing the representations - fast
P. Kasarapu, M. G. de la Banda, A. S. Konagurthu, On representingprotein folding patterns using non-linear parametric curves, IEEE/ACMTransactions on Computational Biology and Bioinformatics,11(6):1218-1228 (2014)
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 30 / 33
Merits of this abstraction
� Does not rely on secondary structure assignment� Applications in protein structure comparison
I Database searchI Comparing the representations
- fast
P. Kasarapu, M. G. de la Banda, A. S. Konagurthu, On representingprotein folding patterns using non-linear parametric curves, IEEE/ACMTransactions on Computational Biology and Bioinformatics,11(6):1218-1228 (2014)
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 30 / 33
Merits of this abstraction
� Does not rely on secondary structure assignment� Applications in protein structure comparison
I Database searchI Comparing the representations - fast
P. Kasarapu, M. G. de la Banda, A. S. Konagurthu, On representingprotein folding patterns using non-linear parametric curves, IEEE/ACMTransactions on Computational Biology and Bioinformatics,11(6):1218-1228 (2014)
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 30 / 33
Merits of this abstraction
� Does not rely on secondary structure assignment� Applications in protein structure comparison
I Database searchI Comparing the representations - fast
P. Kasarapu, M. G. de la Banda, A. S. Konagurthu, On representingprotein folding patterns using non-linear parametric curves, IEEE/ACMTransactions on Computational Biology and Bioinformatics,11(6):1218-1228 (2014)
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 30 / 33
Main contributions of my thesis
Theoretical:� MML-based statistical inference
I Multivariate von Mises-Fisher (hypersphere)I Kent (3D-sphere)I Bivariate von Mises (3D-torus)
� Mixture modelling (clustering)
� Non-linear abstractions (regression)
Applications:
� Structural bioinformatics
� High-dimensional text clustering using vMF mixtures
� Analytical tools for biologists and statisticians
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 31 / 33
Main contributions of my thesis
Theoretical:� MML-based statistical inference
I Multivariate von Mises-Fisher (hypersphere)I Kent (3D-sphere)I Bivariate von Mises (3D-torus)
� Mixture modelling (clustering)
� Non-linear abstractions (regression)
Applications:
� Structural bioinformatics
� High-dimensional text clustering using vMF mixtures
� Analytical tools for biologists and statisticians
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 31 / 33
Conclusion
� Data analysis and statistical modelling go hand-in-handI Rigorous models are useful
� Scope for improving the existing methodologiesI Extend the current machine learning algorithms
� My research has practical implications in data mining, structuralbiology, etc.
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 32 / 33
Conclusion
� Data analysis and statistical modelling go hand-in-handI Rigorous models are useful
� Scope for improving the existing methodologiesI Extend the current machine learning algorithms
� My research has practical implications in data mining, structuralbiology, etc.
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 32 / 33
Conclusion
� Data analysis and statistical modelling go hand-in-handI Rigorous models are useful
� Scope for improving the existing methodologiesI Extend the current machine learning algorithms
� My research has practical implications in data mining, structuralbiology, etc.
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 32 / 33
Thank you.
Parthan Kasarapu Statistical Modelling of Protein Structures 6 Aug 2015 33 / 33