Identifying co-regulation using Probabilistic Relational Models by Christoforos Anagnostopoulos BA Mathematics, Cambridge University MSc Informatics, Edinburgh

Identifying co-regulation using

Probabilistic Relational Models

by Christoforos AnagnostopoulosBA Mathematics, Cambridge UniversityMSc Informatics, Edinburgh University

supervised by Dirk Husmeier

General Problematic

Bringing together disparate data sources:

Promoter sequence data...ACGTTAAGCCAT......GGCATGAATCCC...

General Problematic



mRNAGene expression datagene 1: overexpressedgene 2: overexpressed...

General Problematic




Proteins

Protein interaction dataprotein 1 protein 2 ORF 1 ORF 2 --------------------------------------------------AAC1 TIM10 YMR056C YHR005CA AAD6 YNL201C YFL056C YNL201C

Our data



Bayesian Modelling Framework

Bayesian Networks


Bayesian Networks

Conditional Independence Assumptions

Factorisation of the Joint Probability Distribution

UNIFIED TRAINING


Bayesian Networks


Aims for this presentation:

1. Briefly present the Segal model and the main criticisms offered

in the thesis

2. Briefly introduce PRMs

3. Outline directions for future work

The Segal ModelCluster genes into transcriptional

modules...

Module 1

gene

Module 2

?

The Segal Model

Module 1

gene

Module 2

P(M = 1) P(M = 2)

The Segal Model

How to determine P(M = 1)?

Module 1

gene

P(M = 1)

The Segal Model


Module 1 Motif Profilemotif 3: active

motif 4: very activemotif 16: very active

motif 29: slightly active

gene

The Segal Model





Predicted Expression

LevelsArray 1: overexpressedArray 2: overexpressed

Array 3: underexpressed...

gene

The Segal Model





Predicted Expression

LevelsArray 1: overexpressedArray 2: overexpressed

Array 3: underexpressed...

gene

P(M = 1)

The Segal model

PROMOTER SEQUENCE

The Segal model

PROMOTER SEQUENCE

MOTIF PRESENCE

The Segal model

PROMOTER SEQUENCE

MOTIF PRESENCE

MOTIF MODEL

The Segal model

MOTIF PRESENCE

MODULE ASSIGNMENT

The Segal model

MOTIF PRESENCE

MODULE ASSIGNMENT

REGULATION MODEL

The Segal model

MODULE ASSIGNMENT

EXPRESSION DATA

The Segal model

MODULE ASSIGNMENT

EXPRESSION DATA

EXPRESSION MODEL

Learning via hard EM

HIDDEN


Initialise hidden variables



Set parameters toMaximum Likelihood




Set hidden values to theirmost probable value giventhe parameters (hard EM)




Set hidden values to theirmost probable value giventhe parameters (hard EM)

Motif Model

OBJECTIVE: Learn motif so as to discriminate between genes for which the Regulationvariable is “on” and genes forwhich it is “off”.

r = 1 r = 0

Motif Model – scoring scheme

...CATTCC...

...TGACAA...

high score:

low score:


...CATTCC...

...TGACAA...

high score:

low score:

...AGTCCATTCCGCCTCAAG...

high scoringsubsequences


...CATTCC...

...TGACAA...

high score:

low score:



low scoring (background) subsequences


...CATTCC...

...TGACAA...

high score:

low score:



low scoring (background) subsequences

promotersequencescoring

Motif Model

SCORING SCHEME P ( g.r = true | g.S, w )

w: parameter set

can be taken to represent motifs

Motif Model

SCORING SCHEME P ( g.r = true | g.S, w )

w: parameter set

can be taken to represent motifs

Maximum Likelihood setting Most discriminatory motif

Motif Model – overfitting

TRUEPSSM


typical motif:

...TTT.CATTCC...

high score

TRUEPSSM


typical motif:

...TTT.CATTCC...

high score

TRUEPSSM

INFERREDPSSM

Can triple the score!

Regulation Model

For each module m and each motif i, we estimate the association umi

P ( g.M = m | g. R ) is proportional to

Regulation Model: Geometrical

InterpretationThe (umi )i define separating hyperplanes

Classification criterion is the inner product:Each datapoint is given the label of the hyperplane it is

the furthest away from, on its positive side.

Regulation Model: Divergence and

Overfittingpairwise linear separability

overconfident classification

Method A: dampen the parameters (eg Gaussian prior)

Method B: make the dataset linearly inseparable by augmentation

Erroneous interpretation of the

parametersSegal et al claim that:

When umi = 0, motif i is inactive in module m

When umi > 0 for all i,m, then only the presence of motifs is significant, not their absence

Erroneous interpretation of the

parametersSegal et al claim that:

When umi = 0, motif i is inactive in module m

When umi > 0 for all i,m, then only the presence of motifs is significant, not their absence

Contradict normalisation conditions!

Sparsity

TRUE PROCESS

INFERRED PROCESS

Sparsity

Sparsity can be understood as pruning

Pruning can improve generalisation performance (deals with overfitting both by damping and by decreasing the degrees of freedom)

Pruning ought not be seen as a combinatorial problem,but can be dealt with appropriate prior distributions

Reconceptualise the problem:

Sparsity: the Laplacian

How to prune using a prior:

choose a prior with a simple discontinuity at the origin, so that the penalty term does not vanish near the origin

every time a parameter crosses the origin, establish whetherit will escape the origin or is trapped in Brownian motion around it

if trapped, force both its gradient and value to 0 and freeze it

Can actively look for nearby zeros to accelerate pruning rate

Results: generalisationperformance

Synthetic Dataset with 49 motifs, 20 modules and 1800 datapoints

Results: interpretability

TRUE MODULE STRUCTURE

DEFAULT MODEL: LEARNT WEIGHTS

LAPLACIAN PRIOR MODEL:

LEARNT WEIGHTS

Regrets: BIOLOGICAL DATA



in the thesis




How to model context – specific regulation?Need to cluster the experiments...


Variable A can vary with genesbut not with experiments


We now have variability with experimentsbut also with genes!


Variability with experiments as requiredbut too many dependencies


Variability with experiments as requiredprovided we constrain the parameters of

the probability distributions P(E|A) to be equal


Resulting BN is essentially UNIQUE.But derivation: VAGUE, COMPLICATED, UNSYSTEMATIC


GENES

g.S1, g.S2, ...

g.R1, g.R2, ...

g.M

g.E1, g.E1, ...

this variable cannot be considered an attribute of a gene, because it has attributes of its own that are gene-independent


GENES

g.S1, g.S2, ...

g.R1, g.R2, ...

g.M

g.E1, g.E1, ...


GENES

g.S1, g.S2, ...

g.R1, g.R2, ...

g.M

g.E1, g.E1, ...

EXPERIMENTS

e.Cycle_Phase

e.Dye_Type


GENES

g.S1, g.S2, ...

g.R1, g.R2, ...

g.M

g.E1, g.E1, ...

EXPERIMENTS

e.Cycle_Phase

e.Dye_Type

An expression measurement is an attributeof both a gene and an experiment.


GENES

g.S1, g.S2, ...

g.R1, g.R2, ...

g.M

g.E1, g.E1, ...

EXPERIMENTS

e.Cycle_Phase

e.Dye_Type

MEASUREMENTS

m(e,g).Level

Examples of PRMs - 1

Segal et al, “From Promoter Sequence to Gene Expression”

Examples of PRMs – 1

Segal et al, “From Promoter Sequence to Gene Expression”


Segal et al, “Decomposing gene expression into cellular processes”


Segal et al, “Decomposing gene expression into cellular processes”


PRM = { BN1, BN2, BN3, ... }

given Dataset1 PRM = BN1

given Dataset2 PRM = BN2

Relational schema : higher level description of data

PRM: higher level description of BNs


Relational vs flat data structures:

• Natural generalisation – knowledge carries over

• Expandability

• Richer semantics – better interpretability

• No loss in coherence

Personal opinion (not tested yet):

• Not entirely natural as a generalisation • Some loss in interpretability• Some loss in coherence



in the thesis



Future research

1. Improve the learning algorithm

‘soften’ it by exploiting sparsity

systematise dynamic addition / deletion

Future research

2. Model Selection Techniques improve interpretability

learn the optimal number of

modules in our model

Future research

2. Model Selection Techniques improve interpretability

learn the optimal number of

modules in our model

Are such methods consistent?Do they carry over just as well in

PRMs?

Future research

3. Fine tune the Laplacian regulariser to fit the skewing of the model

Future research

4. The choice of encoding the question into a BN/PRM is only partly determined by the domain

Are there any general ‘rules’ about how to restrict the choice

so as to promoter interpretability?

Future research

5. Explore methods to express structural, nonquantifiable prior beliefs about the biological domain using Bayesian tools.

Summary:

1. Briefly presented the Segal model and the main

observations offered in the thesis

2. Briefly introduced PRMs

3. Hinted towards directions for future work

Documents

Identifying co-regulation using Probabilistic Relational Models by Christoforos Anagnostopoulos BA Mathematics, Cambridge University MSc Informatics, Edinburgh