Upload
domenic-fisher
View
218
Download
2
Tags:
Embed Size (px)
Citation preview
Identifying co-regulation using
Probabilistic Relational Models
by Christoforos AnagnostopoulosBA Mathematics, Cambridge UniversityMSc Informatics, Edinburgh University
supervised by Dirk Husmeier
General Problematic
Bringing together disparate data sources:
Promoter sequence data...ACGTTAAGCCAT......GGCATGAATCCC...
General Problematic
Bringing together disparate data sources:
Promoter sequence data...ACGTTAAGCCAT......GGCATGAATCCC...
mRNAGene expression datagene 1: overexpressedgene 2: overexpressed...
General Problematic
Bringing together disparate data sources:
Promoter sequence data...ACGTTAAGCCAT......GGCATGAATCCC...
mRNAGene expression datagene 1: overexpressedgene 2: overexpressed...
Proteins
Protein interaction dataprotein 1 protein 2 ORF 1 ORF 2 --------------------------------------------------AAC1 TIM10 YMR056C YHR005CA AAD6 YNL201C YFL056C YNL201C
Our data
Promoter sequence data...ACGTTAAGCCAT......GGCATGAATCCC...
mRNAGene expression datagene 1: overexpressedgene 2: overexpressed...
Bayesian Modelling Framework
Bayesian Networks
Conditional Independence Assumptions
Factorisation of the Joint Probability Distribution
UNIFIED TRAINING
Aims for this presentation:
1. Briefly present the Segal model and the main criticisms offered
in the thesis
2. Briefly introduce PRMs
3. Outline directions for future work
The Segal Model
How to determine P(M = 1)?
Module 1 Motif Profilemotif 3: active
motif 4: very activemotif 16: very active
motif 29: slightly active
gene
The Segal Model
How to determine P(M = 1)?
Module 1 Motif Profilemotif 3: active
motif 4: very activemotif 16: very active
motif 29: slightly active
Predicted Expression
LevelsArray 1: overexpressedArray 2: overexpressed
Array 3: underexpressed...
gene
The Segal Model
How to determine P(M = 1)?
Module 1 Motif Profilemotif 3: active
motif 4: very activemotif 16: very active
motif 29: slightly active
Predicted Expression
LevelsArray 1: overexpressedArray 2: overexpressed
Array 3: underexpressed...
gene
P(M = 1)
Learning via hard EM
Initialise hidden variables
Set parameters toMaximum Likelihood
Set hidden values to theirmost probable value giventhe parameters (hard EM)
Learning via hard EM
Initialise hidden variables
Set parameters toMaximum Likelihood
Set hidden values to theirmost probable value giventhe parameters (hard EM)
Motif Model
OBJECTIVE: Learn motif so as to discriminate between genes for which the Regulationvariable is “on” and genes forwhich it is “off”.
r = 1 r = 0
Motif Model – scoring scheme
...CATTCC...
...TGACAA...
high score:
low score:
...AGTCCATTCCGCCTCAAG...
high scoringsubsequences
Motif Model – scoring scheme
...CATTCC...
...TGACAA...
high score:
low score:
...AGTCCATTCCGCCTCAAG...
high scoringsubsequences
low scoring (background) subsequences
Motif Model – scoring scheme
...CATTCC...
...TGACAA...
high score:
low score:
...AGTCCATTCCGCCTCAAG...
high scoringsubsequences
low scoring (background) subsequences
promotersequencescoring
Motif Model
SCORING SCHEME P ( g.r = true | g.S, w )
w: parameter set
can be taken to represent motifs
Motif Model
SCORING SCHEME P ( g.r = true | g.S, w )
w: parameter set
can be taken to represent motifs
Maximum Likelihood setting Most discriminatory motif
Motif Model – overfitting
typical motif:
...TTT.CATTCC...
high score
TRUEPSSM
INFERREDPSSM
Can triple the score!
Regulation Model
For each module m and each motif i, we estimate the association umi
P ( g.M = m | g. R ) is proportional to
Regulation Model: Geometrical
InterpretationThe (umi )i define separating hyperplanes
Classification criterion is the inner product:Each datapoint is given the label of the hyperplane it is
the furthest away from, on its positive side.
Regulation Model: Divergence and
Overfittingpairwise linear separability
overconfident classification
Method A: dampen the parameters (eg Gaussian prior)
Method B: make the dataset linearly inseparable by augmentation
Erroneous interpretation of the
parametersSegal et al claim that:
When umi = 0, motif i is inactive in module m
When umi > 0 for all i,m, then only the presence of motifs is significant, not their absence
Erroneous interpretation of the
parametersSegal et al claim that:
When umi = 0, motif i is inactive in module m
When umi > 0 for all i,m, then only the presence of motifs is significant, not their absence
Contradict normalisation conditions!
Sparsity
Sparsity can be understood as pruning
Pruning can improve generalisation performance (deals with overfitting both by damping and by decreasing the degrees of freedom)
Pruning ought not be seen as a combinatorial problem,but can be dealt with appropriate prior distributions
Reconceptualise the problem:
Sparsity: the Laplacian
How to prune using a prior:
choose a prior with a simple discontinuity at the origin, so that the penalty term does not vanish near the origin
every time a parameter crosses the origin, establish whetherit will escape the origin or is trapped in Brownian motion around it
if trapped, force both its gradient and value to 0 and freeze it
Can actively look for nearby zeros to accelerate pruning rate
Results: interpretability
TRUE MODULE STRUCTURE
DEFAULT MODEL: LEARNT WEIGHTS
LAPLACIAN PRIOR MODEL:
LEARNT WEIGHTS
Aims for this presentation:
1. Briefly present the Segal model and the main criticisms offered
in the thesis
2. Briefly introduce PRMs
3. Outline directions for future work
Probabilistic Relational Models
How to model context – specific regulation?Need to cluster the experiments...
Probabilistic Relational Models
Variability with experiments as requiredprovided we constrain the parameters of
the probability distributions P(E|A) to be equal
Probabilistic Relational Models
Resulting BN is essentially UNIQUE.But derivation: VAGUE, COMPLICATED, UNSYSTEMATIC
Probabilistic Relational Models
GENES
g.S1, g.S2, ...
g.R1, g.R2, ...
g.M
g.E1, g.E1, ...
this variable cannot be considered an attribute of a gene, because it has attributes of its own that are gene-independent
Probabilistic Relational Models
GENES
g.S1, g.S2, ...
g.R1, g.R2, ...
g.M
g.E1, g.E1, ...
EXPERIMENTS
e.Cycle_Phase
e.Dye_Type
Probabilistic Relational Models
GENES
g.S1, g.S2, ...
g.R1, g.R2, ...
g.M
g.E1, g.E1, ...
EXPERIMENTS
e.Cycle_Phase
e.Dye_Type
An expression measurement is an attributeof both a gene and an experiment.
Probabilistic Relational Models
GENES
g.S1, g.S2, ...
g.R1, g.R2, ...
g.M
g.E1, g.E1, ...
EXPERIMENTS
e.Cycle_Phase
e.Dye_Type
MEASUREMENTS
m(e,g).Level
Probabilistic Relational Models
PRM = { BN1, BN2, BN3, ... }
given Dataset1 PRM = BN1
given Dataset2 PRM = BN2
Relational schema : higher level description of data
PRM: higher level description of BNs
Probabilistic Relational Models
Relational vs flat data structures:
• Natural generalisation – knowledge carries over
• Expandability
• Richer semantics – better interpretability
• No loss in coherence
Personal opinion (not tested yet):
• Not entirely natural as a generalisation • Some loss in interpretability• Some loss in coherence
Aims for this presentation:
1. Briefly present the Segal model and the main criticisms offered
in the thesis
2. Briefly introduce PRMs
3. Outline directions for future work
Future research
1. Improve the learning algorithm
‘soften’ it by exploiting sparsity
systematise dynamic addition / deletion
Future research
2. Model Selection Techniques improve interpretability
learn the optimal number of
modules in our model
Future research
2. Model Selection Techniques improve interpretability
learn the optimal number of
modules in our model
Are such methods consistent?Do they carry over just as well in
PRMs?
Future research
4. The choice of encoding the question into a BN/PRM is only partly determined by the domain
Are there any general ‘rules’ about how to restrict the choice
so as to promoter interpretability?
Future research
5. Explore methods to express structural, nonquantifiable prior beliefs about the biological domain using Bayesian tools.