22
Introduction Poisson Dirichlet process Results Summary Estimating the size of the transcriptome Statistical Modeling and Machine Learning in Computational Systems Biology June 22-26, 2009, Tampere, Finland Ole Winther Technical University of Denmark (DTU) & University of Copenhagen (KU) June 24, 2009 Ole Winther DTU & KU

Estimating the size of the transcriptome - Aalto · Estimating the size of the transcriptome Statistical Modeling and Machine Learning in Computational Systems Biology June 22-26,

Embed Size (px)

Citation preview

Page 1: Estimating the size of the transcriptome - Aalto · Estimating the size of the transcriptome Statistical Modeling and Machine Learning in Computational Systems Biology June 22-26,

Introduction Poisson Dirichlet process Results Summary

Estimating the size of the transcriptomeStatistical Modeling and Machine Learning in Computational

Systems Biology June 22-26, 2009, Tampere, Finland

Ole Winther

Technical University of Denmark (DTU) & University of Copenhagen (KU)

June 24, 2009

Ole Winther DTU & KU

Page 2: Estimating the size of the transcriptome - Aalto · Estimating the size of the transcriptome Statistical Modeling and Machine Learning in Computational Systems Biology June 22-26,

Introduction Poisson Dirichlet process Results Summary

Overview

All lectures

1 Introduction to graphical models and Bayesian networks2 Estimating the size of the transcriptome3 Using biological prior information in motif discovery4 Learning linear Bayes networks with sparse Bayesian

models

Common theme:• Complex Bayesian model building possible and

advantageous• Model checking – prediction, marginal- and test-likelihood

Ole Winther DTU & KU

Page 3: Estimating the size of the transcriptome - Aalto · Estimating the size of the transcriptome Statistical Modeling and Machine Learning in Computational Systems Biology June 22-26,

Introduction Poisson Dirichlet process Results Summary

Overview

Lecture 2

• How many species? or in a genomic content• Estimating the size of the transcriptome• High throughput sequencing• Non-parametric Bayesian model• The model is always wrong (and Bayes can’t tell)• Model checking with cross-validation.

Ole Winther DTU & KU

Page 4: Estimating the size of the transcriptome - Aalto · Estimating the size of the transcriptome Statistical Modeling and Machine Learning in Computational Systems Biology June 22-26,

Introduction Poisson Dirichlet process Results Summary

Motivation

How many species?

Ole Winther DTU & KU

Page 5: Estimating the size of the transcriptome - Aalto · Estimating the size of the transcriptome Statistical Modeling and Machine Learning in Computational Systems Biology June 22-26,

Introduction Poisson Dirichlet process Results Summary

DNA sequence tags - CAGE

Carninci et. al. Nat. Gen. 2006

Ole Winther DTU & KU

Page 6: Estimating the size of the transcriptome - Aalto · Estimating the size of the transcriptome Statistical Modeling and Machine Learning in Computational Systems Biology June 22-26,

Introduction Poisson Dirichlet process Results Summary

DNA sequence tags - CAGE

5 10 15 20 25 300

200

400

600

800

1000

1200

1400

1600

1800

m(n

)

n200 400 600 800 1000 1200

0

10

20

30

40

50

60

m(n

)

n

Cerebellum library - frequency of frequency plot.

Ole Winther DTU & KU

Page 7: Estimating the size of the transcriptome - Aalto · Estimating the size of the transcriptome Statistical Modeling and Machine Learning in Computational Systems Biology June 22-26,

Introduction Poisson Dirichlet process Results Summary

DNA sequence tags - CAGE

5 10 15 20 25 300

200

400

600

800

1000

1200

1400

1600

1800

2000

m(n

)

n5000 10000 15000

0

10

20

30

40

50

60

m(n

)

n

Embryo library

Ole Winther DTU & KU

Page 8: Estimating the size of the transcriptome - Aalto · Estimating the size of the transcriptome Statistical Modeling and Machine Learning in Computational Systems Biology June 22-26,

Introduction Poisson Dirichlet process Results Summary

High-throughput sequencing technologies

Solexa and Solid sequencing offer 106 − 108 reads of length20-60 nt at a price comparable to a micro-array.• CAGE = cap analysis gene expression. 5’ end of mRNAs.

Pinpoints transcription start sites (TSSs). High throughput.• EST = expressed sequence tag. Relatively low throughput.

Used for gene identification.• SAG = Serial analysis of gene expression. Medium

throughput and longer reads.• RNA seq or whole transcriptome shotgun sequencing.

High throughput and longer reads.

Ole Winther DTU & KU

Page 9: Estimating the size of the transcriptome - Aalto · Estimating the size of the transcriptome Statistical Modeling and Machine Learning in Computational Systems Biology June 22-26,

Introduction Poisson Dirichlet process Results Summary

Non-parametric statistics

• Non-parametric statistics from Wikipedia:

“distribution free methods which do not rely onassumptions that the data are drawn from a givenprobability distribution. As such it is the opposite ofparametric statistics. It includes non-parametric statisticalmodels, inference and statistical tests. . . .”

• Non-parametric models have parameters that are learnedin the same way as in parametric statistics.

• Lijoi, Mena and Prünster, in a series of papers, recentlyapplied the Poisson-Dirichlet process (chinese restaturant)in the context of genomic tag data.

Ole Winther DTU & KU

Page 10: Estimating the size of the transcriptome - Aalto · Estimating the size of the transcriptome Statistical Modeling and Machine Learning in Computational Systems Biology June 22-26,

Introduction Poisson Dirichlet process Results Summary

Non-parametric statistics

Setting up the problem• Library of n tags (reads)• A sequence of genomic coordinates (c1, c2, . . . , cn).• Contains k unique TSSs with counts n = (n1, . . . ,nk ),

n =∑k

j=1 nj .• Label the tags in order of their arrival such that

ci ∈ {1, . . . , k}.• The n + 1th tag may either be one of the k previously seen

TSSs or a new one. to belong to one:cn+1 ∈ {1, . . . , k + 1}.

Ole Winther DTU & KU

Page 11: Estimating the size of the transcriptome - Aalto · Estimating the size of the transcriptome Statistical Modeling and Machine Learning in Computational Systems Biology June 22-26,

Introduction Poisson Dirichlet process Results Summary

Chinese restaurant process - Yor-Pitman sampling formula

Observing new species given counts n = n1, . . . ,nk in k bins:

p(cn+1 = k + 1|n, σ, θ) =θ + kσn + θ

withk∑

i=1

ni = n

Re-observing j :

P(cn+1 = j |n, σ, θ) =nj − σn + θ

Exchangeability – invariant to re-ordering

E ,E ,M,T ,T : p1 =θ

θ

1− σ1 + θ

θ + σ

2 + θ

θ + 2σ3 + θ

1− σ4 + θ

M,E ,T ,T ,E : p2 =θ

θ

θ + σ

1 + θ

θ + 2σ2 + θ

1− σ3 + θ

1− σ4 + θ

= . . . = p1

Ole Winther DTU & KU

Page 12: Estimating the size of the transcriptome - Aalto · Estimating the size of the transcriptome Statistical Modeling and Machine Learning in Computational Systems Biology June 22-26,

Introduction Poisson Dirichlet process Results Summary

Chinese restaurant process - Yor-Pitman sampling formula

• Likelihood function, e.g. E ,E ,M,T ,T

p(n|σ, θ) =θ

θ

1− σ1 + θ

θ + σ

2 + θ

θ + 2σ3 + θ

1− σ4 + θ

=1∏n−1

i=1 (i + θ)

k−1∏j=1

(θ + jσ)k∏

i ′=1

ni′−1∏j ′=1

(j ′ − σ)

• Flat prior for σ ∈ [−θ/k ,1] and θ ≥ 0 pseudo-countparameter.

• Predictions – simulate new sequence cn+1, cn+2, . . . , cn+n′ :

p(cn+1, . . . , cn+n′ |n) =

∫p(cn+1, . . . , cn+n′ |n, σ, θ) p(σ, θ) dσdθ

with Gibbs sampling (σ, θ) and Yor-Pitman sampling forcn+1, . . ..

• 95% highest posterior density (HPD) intervals.Ole Winther DTU & KU

Page 13: Estimating the size of the transcriptome - Aalto · Estimating the size of the transcriptome Statistical Modeling and Machine Learning in Computational Systems Biology June 22-26,

Introduction Poisson Dirichlet process Results Summary

Averaging and maximum likelihood

0.15 0.16 0.17 0.18

1160

1180

1200

1220

1240

1260

1280

1300

1320

1340

−48.25−46.94

−45.64−44.33

−43.03−41.73

−40.42−39.12

−39.12

−37.82

−37.82

−36.51

−36.51

−35.21

−35.21

−33.9

−33.9

−32.6

−32.6

−31.3

−31.3

−29.99

−29.99

−28.69

−28.69

−27.38

−27.38

−26.08

−26.08

−24.78

−24.78

−23.47

−23.47

−22.17

−22.17

−20.86

−20.86

−19.56

−19.56

−18.26

−18.26

−16.95

−16.95

−15.65

−15.65

−14.34

−14.34

−14.34

−13.04

−13.04

−13.04

−11.74

−11.74

−11.74

−11.74

−10.43

−10.43

−10.43

−10.43

−9.13

−9.13−9.13

−9.13

−7.82

−7.82

−7.82

−7.82

−6.52

−6.52−6.52

−6.52

−5.22

−5.22

−5.22

−5.22

−3.91

−3.91

−3.91

−3.91

−2.61

−2.61

−2.61

−2.61

−1.3

−1.3

−1.3

embryo log L − log LML

contours

σ

θ

0.03 0.04 0.05 0.06

2020

2040

2060

2080

2100

2120

2140

2160

2180

−39.7−38.66

−36.57−35.52−34.48

−34.48

−33.43

−33.43

−32.39

−32.39

−31.35

−31.35

−30.3

−30.3

−29.26

−29.26

−28.21

−28.21

−27.17

−27.17

−26.12

−26.12

−25.08

−25.08

−24.03

−24.03

−22.99

−22.99

−21.94

−21.94

−20.9

−20.9

−19.85

−19.85

−18.81

−18.81

−17.76

−17.76

−16.72

−16.72

−15.67

−15.67

−15.67

−14.63

−14.63

−14.63

−14.63

−13.58

−13.58

−13.58

−13.58

−12.54

−12.54

−12.54

−12.54

−11.49

−11.49

−11.49

−11.49

−10.45

−10.45

−10.45

−10.45

−9.4

−9.4

−9.4

−9.4

−8.36

−8.36

−8.36

−8.36

−7.31

−7.31

−7.31

−7.31

−6.27

−6.27

−6.27

−6.27−5.22

−5.22

−5.22

−5.22

−4.18

−4.18

−4.18

−4.18

−4.18

−3.13

−3.13

−3.13

−3.13

−3.13

−2.09

−2.09

−2.09

−2.09

−2.09

−1.04

−1.04−1.04

−1.04

cerebellum log L − log LML

contours

σ

θ

Ole Winther DTU & KU

Page 14: Estimating the size of the transcriptome - Aalto · Estimating the size of the transcriptome Statistical Modeling and Machine Learning in Computational Systems Biology June 22-26,

Introduction Poisson Dirichlet process Results Summary

Predictions

0 1 2 3 4

x 106

0

0.5

1

1.5

2

2.5

3

x 104

n

k

0 1 2 3 4

x 106

0

0.5

1

1.5

2

2.5x 10

4

n

k 2

0 1 2 3

x 105

0

200

400

600

800

1000

n

k 30

0 1 2 3 4 5

x 106

0

5

10

15

x 104

n

k 2

0 1 2 3 4 5

x 106

0

2000

4000

6000

8000

10000

12000

14000

16000

n

k 30

0 1 2 3 4 5

x 106

0

1

2

3

4

5

6x 10

5

n

k

0 1 2 3 4 5

x 106

0

2

4

6

8

10

12

14x 10

5

n

k

0 1 2 3 4 5

x 106

0

0.5

1

1.5

2

2.5

3

3.5x 10

5

n

k 2

0 1 2 3 4 5

x 106

0

0.5

1

1.5

2

2.5x 10

4

n

k 30

cerebellum embryo hcamp liver lung macrophages som_cortex visual_cortex

cerebellum embryo hcamp liver lung macrophages som_cortex visual_cortex

Sin

gle

nu

cleo

tid

e TS

S le

vel(C

TSS)

Co

re p

rom

ote

r lev

el (

TC)

Gen

e le

vel

≥1 tags ≥2 tags ≥30 tags

A B C

D E F

G H I

Ole Winther DTU & KU

Page 15: Estimating the size of the transcriptome - Aalto · Estimating the size of the transcriptome Statistical Modeling and Machine Learning in Computational Systems Biology June 22-26,

Introduction Poisson Dirichlet process Results Summary

Predictions

Cross validation – notice anything funny?

2 4 6 8 10

x 105

0

0.5

1

1.5

2

2.5

3

3.5

4x 10

5

n

k

cerebellum

1.34

1.35

1.36

x 105

2.8

3

3.2x 10

4

253234400

420

440

0.5 1 1.5 2 2.5

x 106

0

0.5

1

1.5

2

2.5

3

3.5

4x 10

5

n

k

embryo

1.38

1.4

1.42x 10

5

3.2

3.4

3.6x 10

4

6321971450

1500

1 2 3 4 5

x 106

0

1

2

3

4

5

6

7

x 105

n

k

hcamp

2.9833.023.043.063.08x 10

5

1

1.05

1.1x 10

5

1.38824e+06

5150

5200

5250

1 2 3 4

x 106

0

1

2

3

4

5

6

7

8x 10

5

n

k

liver

2.822.842.862.88

x 105

7.5

8x 10

4

1.21871e+06

3750

3800

3850

1 2 3 4 5

x 106

0

2

4

6

8

10

x 105

n

k

lung

3.95

4x 10

5

0.9811.021.041.061.08x 10

5

1.29824e+06

435044004450

2 4 6

x 105

0

2

4

6

8

10

12

14

16

18

x 104

n

k

macrophage

6.6

6.7

6.8

x 104

1.7

1.8

1.9x 10

4

169919420

440

460

480

2 4 6 8

x 105

0

0.5

1

1.5

2

2.5

3

3.5x 10

5

n

k

som cortex

1.16

1.17

1.18x 10

5

2.2

2.4

x 104

208943

310320330340350

2 4 6 8

x 105

0

0.5

1

1.5

2

2.5

3

3.5

x 105

n

k

vis cortex

1.24

1.25

1.26x 10

5

2.6

2.8

x 104

232690

400

420

Ole Winther DTU & KU

Page 16: Estimating the size of the transcriptome - Aalto · Estimating the size of the transcriptome Statistical Modeling and Machine Learning in Computational Systems Biology June 22-26,

Introduction Poisson Dirichlet process Results Summary

Predictions

Cross validation – cerebellum

2 4 6 8 10

x 105

0

0.5

1

1.5

2

2.5

3

3.5

4x 10

5

n

k

cerebellum

1.34

1.35

1.36

x 105

2.8

3

3.2x 10

4

253234400

420

440

Ole Winther DTU & KU

Page 17: Estimating the size of the transcriptome - Aalto · Estimating the size of the transcriptome Statistical Modeling and Machine Learning in Computational Systems Biology June 22-26,

Introduction Poisson Dirichlet process Results Summary

Predictions

Cross validation – embryo

0.5 1 1.5 2 2.5

x 106

0

0.5

1

1.5

2

2.5

3

3.5

4x 10

5

n

k

embryo

1.38

1.4

1.42x 10

5

3.2

3.4

3.6x 10

4

6321971450

1500

Ole Winther DTU & KU

Page 18: Estimating the size of the transcriptome - Aalto · Estimating the size of the transcriptome Statistical Modeling and Machine Learning in Computational Systems Biology June 22-26,

Introduction Poisson Dirichlet process Results Summary

Coverage

• We can actually predict more than just k .• Each observed species in the library has a certain

frequency.• We estimate a probability for each species based upon

model and observed frequencynj − σn + θ

Probabilities don’t add up to one.• The coverage says how we have already seen.• Coverage (weight species by their observation

probabilities):

Coverage =k∑

j=1

nj − σn + θ

= 1− θ + kσn + θ

.

Ole Winther DTU & KU

Page 19: Estimating the size of the transcriptome - Aalto · Estimating the size of the transcriptome Statistical Modeling and Machine Learning in Computational Systems Biology June 22-26,

Introduction Poisson Dirichlet process Results Summary

Coverage

Coverage predictions

Library sample stats parameters 90.0 % coverage predictionsCTSS n k coverage σML θML coverage n k

cerebellum 253234 133923 0.592 0.728 9714 0.849 10000000 2068648embryo 632197 138190 0.836 0.749 475 0.900 4576200 610448hcamp 1388237 299038 0.859 0.639 6217 0.900 3633400 560111liver 1218713 282555 0.831 0.722 2027 0.900 8032400 1109792lung 1298243 393792 0.776 0.727 5441 0.872 10000000 1756587macrophages 169919 66257 0.716 0.699 2620 0.900 5529000 787295som cortex 208943 115480 0.565 0.748 7980 0.834 10000000 2206260visual cortex 232690 123847 0.587 0.736 8431 0.845 10000000 2090184

99.9 % coverage predictionsgene n k coverage σML θML coverage n k

cerebellum 203481 15079 0.980 0.069 3090 0.990 436800 18312embryo 510662 16869 0.989 0.254 1161 0.990 554400 17320hcamp 983532 18665 0.995 0.210 1292 already reachedliver 1015553 18711 0.995 0.223 1192 already reachedlung 1015700 23121 0.994 0.142 2418 already reachedmacrophages 141236 10643 0.974 0.233 1223 0.990 495400 16020som cortex 170823 13813 0.977 0.093 2720 0.990 434000 17675visual cortex 189515 14080 0.979 0.086 2744 0.990 420000 17302

Ole Winther DTU & KU

Page 20: Estimating the size of the transcriptome - Aalto · Estimating the size of the transcriptome Statistical Modeling and Machine Learning in Computational Systems Biology June 22-26,

Introduction Poisson Dirichlet process Results Summary

Parametric approach

Comparison TSSs: True, non-parametric and parametric Zhuet. al., PLoS ONE, 2008.

Ole Winther DTU & KU

Page 21: Estimating the size of the transcriptome - Aalto · Estimating the size of the transcriptome Statistical Modeling and Machine Learning in Computational Systems Biology June 22-26,

Introduction Poisson Dirichlet process Results Summary

Parametric approach

Comparison genes: True, non-parametric and parametric.

Ole Winther DTU & KU

Page 22: Estimating the size of the transcriptome - Aalto · Estimating the size of the transcriptome Statistical Modeling and Machine Learning in Computational Systems Biology June 22-26,

Introduction Poisson Dirichlet process Results Summary

Summary

• Non-parametric Poisson Dirichlet (PD) process for complexdata.

• The model is always wrong! (as revealed in modelchecking with sufficient data).

• We need more complex model. With so much data onlyunderfitting is a problem. Go beyond PD.

• Joint w Albin Sandelin, Eivind Valen and Anders Krogh, KUbuilding upon Lijoi, Mena and Prünster.

Ole Winther DTU & KU