74
Identifying co- regulation using Probabilistic Relational Models by Christoforos Anagnostopoulos BA Mathematics, Cambridge University MSc Informatics, Edinburgh University supervised by Dirk Husmeier

Identifying co-regulation using Probabilistic Relational Models by Christoforos Anagnostopoulos BA Mathematics, Cambridge University MSc Informatics, Edinburgh

Embed Size (px)

Citation preview

Identifying co-regulation using

Probabilistic Relational Models

by Christoforos AnagnostopoulosBA Mathematics, Cambridge UniversityMSc Informatics, Edinburgh University

supervised by Dirk Husmeier

General Problematic

Bringing together disparate data sources:

Promoter sequence data...ACGTTAAGCCAT......GGCATGAATCCC...

General Problematic

Bringing together disparate data sources:

Promoter sequence data...ACGTTAAGCCAT......GGCATGAATCCC...

mRNAGene expression datagene 1: overexpressedgene 2: overexpressed...

General Problematic

Bringing together disparate data sources:

Promoter sequence data...ACGTTAAGCCAT......GGCATGAATCCC...

mRNAGene expression datagene 1: overexpressedgene 2: overexpressed...

Proteins

Protein interaction dataprotein 1 protein 2 ORF 1 ORF 2 --------------------------------------------------AAC1 TIM10 YMR056C YHR005CA AAD6 YNL201C YFL056C YNL201C

Our data

Promoter sequence data...ACGTTAAGCCAT......GGCATGAATCCC...

mRNAGene expression datagene 1: overexpressedgene 2: overexpressed...

Bayesian Modelling Framework

Bayesian Networks

Bayesian Modelling Framework

Bayesian Networks

Conditional Independence Assumptions

Factorisation of the Joint Probability Distribution

UNIFIED TRAINING

Bayesian Modelling Framework

Bayesian Networks

Probabilistic Relational Models

Aims for this presentation:

1. Briefly present the Segal model and the main criticisms offered

in the thesis

2. Briefly introduce PRMs

3. Outline directions for future work

The Segal ModelCluster genes into transcriptional

modules...

Module 1

gene

Module 2

?

The Segal Model

Module 1

gene

Module 2

P(M = 1) P(M = 2)

The Segal Model

How to determine P(M = 1)?

Module 1

gene

P(M = 1)

The Segal Model

How to determine P(M = 1)?

Module 1 Motif Profilemotif 3: active

motif 4: very activemotif 16: very active

motif 29: slightly active

gene

The Segal Model

How to determine P(M = 1)?

Module 1 Motif Profilemotif 3: active

motif 4: very activemotif 16: very active

motif 29: slightly active

Predicted Expression

LevelsArray 1: overexpressedArray 2: overexpressed

Array 3: underexpressed...

gene

The Segal Model

How to determine P(M = 1)?

Module 1 Motif Profilemotif 3: active

motif 4: very activemotif 16: very active

motif 29: slightly active

Predicted Expression

LevelsArray 1: overexpressedArray 2: overexpressed

Array 3: underexpressed...

gene

P(M = 1)

The Segal model

PROMOTER SEQUENCE

The Segal model

PROMOTER SEQUENCE

MOTIF PRESENCE

The Segal model

PROMOTER SEQUENCE

MOTIF PRESENCE

MOTIF MODEL

The Segal model

MOTIF PRESENCE

MODULE ASSIGNMENT

The Segal model

MOTIF PRESENCE

MODULE ASSIGNMENT

REGULATION MODEL

The Segal model

MODULE ASSIGNMENT

EXPRESSION DATA

The Segal model

MODULE ASSIGNMENT

EXPRESSION DATA

EXPRESSION MODEL

Learning via hard EM

HIDDEN

Learning via hard EM

Initialise hidden variables

Learning via hard EM

Initialise hidden variables

Set parameters toMaximum Likelihood

Learning via hard EM

Initialise hidden variables

Set parameters toMaximum Likelihood

Set hidden values to theirmost probable value giventhe parameters (hard EM)

Learning via hard EM

Initialise hidden variables

Set parameters toMaximum Likelihood

Set hidden values to theirmost probable value giventhe parameters (hard EM)

Motif Model

OBJECTIVE: Learn motif so as to discriminate between genes for which the Regulationvariable is “on” and genes forwhich it is “off”.

r = 1 r = 0

Motif Model – scoring scheme

...CATTCC...

...TGACAA...

high score:

low score:

Motif Model – scoring scheme

...CATTCC...

...TGACAA...

high score:

low score:

...AGTCCATTCCGCCTCAAG...

high scoringsubsequences

Motif Model – scoring scheme

...CATTCC...

...TGACAA...

high score:

low score:

...AGTCCATTCCGCCTCAAG...

high scoringsubsequences

low scoring (background) subsequences

Motif Model – scoring scheme

...CATTCC...

...TGACAA...

high score:

low score:

...AGTCCATTCCGCCTCAAG...

high scoringsubsequences

low scoring (background) subsequences

promotersequencescoring

Motif Model

SCORING SCHEME P ( g.r = true | g.S, w )

w: parameter set

can be taken to represent motifs

Motif Model

SCORING SCHEME P ( g.r = true | g.S, w )

w: parameter set

can be taken to represent motifs

Maximum Likelihood setting Most discriminatory motif

Motif Model – overfitting

TRUEPSSM

Motif Model – overfitting

typical motif:

...TTT.CATTCC...

high score

TRUEPSSM

Motif Model – overfitting

typical motif:

...TTT.CATTCC...

high score

TRUEPSSM

INFERREDPSSM

Can triple the score!

Regulation Model

For each module m and each motif i, we estimate the association umi

P ( g.M = m | g. R ) is proportional to

Regulation Model: Geometrical

InterpretationThe (umi )i define separating hyperplanes

Classification criterion is the inner product:Each datapoint is given the label of the hyperplane it is

the furthest away from, on its positive side.

Regulation Model: Divergence and

Overfittingpairwise linear separability

overconfident classification

Method A: dampen the parameters (eg Gaussian prior)

Method B: make the dataset linearly inseparable by augmentation

Erroneous interpretation of the

parametersSegal et al claim that:

When umi = 0, motif i is inactive in module m

When umi > 0 for all i,m, then only the presence of motifs is significant, not their absence

Erroneous interpretation of the

parametersSegal et al claim that:

When umi = 0, motif i is inactive in module m

When umi > 0 for all i,m, then only the presence of motifs is significant, not their absence

Contradict normalisation conditions!

Sparsity

TRUE PROCESS

INFERRED PROCESS

Sparsity

Sparsity can be understood as pruning

Pruning can improve generalisation performance (deals with overfitting both by damping and by decreasing the degrees of freedom)

Pruning ought not be seen as a combinatorial problem,but can be dealt with appropriate prior distributions

Reconceptualise the problem:

Sparsity: the Laplacian

How to prune using a prior:

choose a prior with a simple discontinuity at the origin, so that the penalty term does not vanish near the origin

every time a parameter crosses the origin, establish whetherit will escape the origin or is trapped in Brownian motion around it

if trapped, force both its gradient and value to 0 and freeze it

Can actively look for nearby zeros to accelerate pruning rate

Results: generalisationperformance

Synthetic Dataset with 49 motifs, 20 modules and 1800 datapoints

Results: interpretability

TRUE MODULE STRUCTURE

DEFAULT MODEL: LEARNT WEIGHTS

LAPLACIAN PRIOR MODEL:

LEARNT WEIGHTS

Regrets: BIOLOGICAL DATA

Aims for this presentation:

1. Briefly present the Segal model and the main criticisms offered

in the thesis

2. Briefly introduce PRMs

3. Outline directions for future work

Probabilistic Relational Models

How to model context – specific regulation?Need to cluster the experiments...

Probabilistic Relational Models

Variable A can vary with genesbut not with experiments

Probabilistic Relational Models

We now have variability with experimentsbut also with genes!

Probabilistic Relational Models

Variability with experiments as requiredbut too many dependencies

Probabilistic Relational Models

Variability with experiments as requiredprovided we constrain the parameters of

the probability distributions P(E|A) to be equal

Probabilistic Relational Models

Resulting BN is essentially UNIQUE.But derivation: VAGUE, COMPLICATED, UNSYSTEMATIC

Probabilistic Relational Models

GENES

g.S1, g.S2, ...

g.R1, g.R2, ...

g.M

g.E1, g.E1, ...

this variable cannot be considered an attribute of a gene, because it has attributes of its own that are gene-independent

Probabilistic Relational Models

GENES

g.S1, g.S2, ...

g.R1, g.R2, ...

g.M

g.E1, g.E1, ...

Probabilistic Relational Models

GENES

g.S1, g.S2, ...

g.R1, g.R2, ...

g.M

g.E1, g.E1, ...

EXPERIMENTS

e.Cycle_Phase

e.Dye_Type

Probabilistic Relational Models

GENES

g.S1, g.S2, ...

g.R1, g.R2, ...

g.M

g.E1, g.E1, ...

EXPERIMENTS

e.Cycle_Phase

e.Dye_Type

An expression measurement is an attributeof both a gene and an experiment.

Probabilistic Relational Models

GENES

g.S1, g.S2, ...

g.R1, g.R2, ...

g.M

g.E1, g.E1, ...

EXPERIMENTS

e.Cycle_Phase

e.Dye_Type

MEASUREMENTS

m(e,g).Level

Examples of PRMs - 1

Segal et al, “From Promoter Sequence to Gene Expression”

Examples of PRMs – 1

Segal et al, “From Promoter Sequence to Gene Expression”

Examples of PRMs - 2

Segal et al, “Decomposing gene expression into cellular processes”

Examples of PRMs - 2

Segal et al, “Decomposing gene expression into cellular processes”

Probabilistic Relational Models

PRM = { BN1, BN2, BN3, ... }

given Dataset1 PRM = BN1

given Dataset2 PRM = BN2

Relational schema : higher level description of data

PRM: higher level description of BNs

Probabilistic Relational Models

Relational vs flat data structures:

• Natural generalisation – knowledge carries over

• Expandability

• Richer semantics – better interpretability

• No loss in coherence

Personal opinion (not tested yet):

• Not entirely natural as a generalisation • Some loss in interpretability• Some loss in coherence

Aims for this presentation:

1. Briefly present the Segal model and the main criticisms offered

in the thesis

2. Briefly introduce PRMs

3. Outline directions for future work

Future research

1. Improve the learning algorithm

‘soften’ it by exploiting sparsity

systematise dynamic addition / deletion

Future research

2. Model Selection Techniques improve interpretability

learn the optimal number of

modules in our model

Future research

2. Model Selection Techniques improve interpretability

learn the optimal number of

modules in our model

Are such methods consistent?Do they carry over just as well in

PRMs?

Future research

3. Fine tune the Laplacian regulariser to fit the skewing of the model

Future research

4. The choice of encoding the question into a BN/PRM is only partly determined by the domain

Are there any general ‘rules’ about how to restrict the choice

so as to promoter interpretability?

Future research

5. Explore methods to express structural, nonquantifiable prior beliefs about the biological domain using Bayesian tools.

Summary:

1. Briefly presented the Segal model and the main

observations offered in the thesis

2. Briefly introduced PRMs

3. Hinted towards directions for future work