Estimating the size of the transcriptome - Aalto …users.ics.aalto.fi/harrila/teaching/summers09/winther_part2.pdf · Estimating the size of the transcriptome ... Ole Winther Technical

Introduction Poisson Dirichlet process Results Summary

Estimating the size of the transcriptomeStatistical Modeling and Machine Learning in Computational

Systems Biology June 22-26, 2009, Tampere, Finland

Ole Winther

Technical University of Denmark (DTU) & University of Copenhagen (KU)

June 24, 2009

Ole Winther DTU & KU


Overview

All lectures

1 Introduction to graphical models and Bayesian networks2 Estimating the size of the transcriptome3 Using biological prior information in motif discovery4 Learning linear Bayes networks with sparse Bayesian

models

Common theme:• Complex Bayesian model building possible and

advantageous• Model checking – prediction, marginal- and test-likelihood



Overview

Lecture 2

• How many species? or in a genomic content• Estimating the size of the transcriptome• High throughput sequencing• Non-parametric Bayesian model• The model is always wrong (and Bayes can’t tell)• Model checking with cross-validation.



Motivation

How many species?



DNA sequence tags - CAGE

Carninci et. al. Nat. Gen. 2006




5 10 15 20 25 300

200

400

600

800

1000

1200

1400

1600

1800

m(n

)

n200 400 600 800 1000 1200

0

10

20

30

40

50

60

m(n

)

n

Cerebellum library - frequency of frequency plot.




5 10 15 20 25 300

200

400

600

800

1000

1200

1400

1600

1800

2000

m(n

)

n5000 10000 15000

0

10

20

30

40

50

60

m(n

)

n

Embryo library



High-throughput sequencing technologies

Solexa and Solid sequencing offer 106 − 108 reads of length20-60 nt at a price comparable to a micro-array.• CAGE = cap analysis gene expression. 5’ end of mRNAs.

Pinpoints transcription start sites (TSSs). High throughput.• EST = expressed sequence tag. Relatively low throughput.

Used for gene identification.• SAG = Serial analysis of gene expression. Medium

throughput and longer reads.• RNA seq or whole transcriptome shotgun sequencing.

High throughput and longer reads.



Non-parametric statistics

• Non-parametric statistics from Wikipedia:

“distribution free methods which do not rely onassumptions that the data are drawn from a givenprobability distribution. As such it is the opposite ofparametric statistics. It includes non-parametric statisticalmodels, inference and statistical tests. . . .”

• Non-parametric models have parameters that are learnedin the same way as in parametric statistics.

• Lijoi, Mena and Prünster, in a series of papers, recentlyapplied the Poisson-Dirichlet process (chinese restaturant)in the context of genomic tag data.



Non-parametric statistics

Setting up the problem• Library of n tags (reads)• A sequence of genomic coordinates (c1, c2, . . . , cn).• Contains k unique TSSs with counts n = (n1, . . . ,nk ),

n =∑k

j=1 nj .• Label the tags in order of their arrival such that

ci ∈ {1, . . . , k}.• The n + 1th tag may either be one of the k previously seen

TSSs or a new one. to belong to one:cn+1 ∈ {1, . . . , k + 1}.



Chinese restaurant process - Yor-Pitman sampling formula

Observing new species given counts n = n1, . . . ,nk in k bins:

p(cn+1 = k + 1|n, σ, θ) =θ + kσn + θ

withk∑

i=1

ni = n

Re-observing j :

P(cn+1 = j |n, σ, θ) =nj − σn + θ

Exchangeability – invariant to re-ordering

E ,E ,M,T ,T : p1 =θ

θ

1− σ1 + θ

θ + σ

2 + θ

θ + 2σ3 + θ

1− σ4 + θ

M,E ,T ,T ,E : p2 =θ

θ

θ + σ

1 + θ

θ + 2σ2 + θ

1− σ3 + θ

1− σ4 + θ

= . . . = p1



Chinese restaurant process - Yor-Pitman sampling formula

• Likelihood function, e.g. E ,E ,M,T ,T

p(n|σ, θ) =θ

θ

1− σ1 + θ

θ + σ

2 + θ

θ + 2σ3 + θ

1− σ4 + θ

=1∏n−1

i=1 (i + θ)

k−1∏j=1

(θ + jσ)k∏

i ′=1

ni′−1∏j ′=1

(j ′ − σ)

• Flat prior for σ ∈ [−θ/k ,1] and θ ≥ 0 pseudo-countparameter.

• Predictions – simulate new sequence cn+1, cn+2, . . . , cn+n′ :

p(cn+1, . . . , cn+n′ |n) =

∫p(cn+1, . . . , cn+n′ |n, σ, θ) p(σ, θ) dσdθ

with Gibbs sampling (σ, θ) and Yor-Pitman sampling forcn+1, . . ..

• 95% highest posterior density (HPD) intervals.Ole Winther DTU & KU


Averaging and maximum likelihood

0.15 0.16 0.17 0.18

1160

1180

1200

1220

1240

1260

1280

1300

1320

1340

−48.25−46.94

−45.64−44.33

−43.03−41.73

−40.42−39.12

−39.12

−37.82

−37.82

−36.51

−36.51

−35.21

−35.21

−33.9

−33.9

−32.6

−32.6

−31.3

−31.3

−29.99

−29.99

−28.69

−28.69

−27.38

−27.38

−26.08

−26.08

−24.78

−24.78

−23.47

−23.47

−22.17

−22.17

−20.86

−20.86

−19.56

−19.56

−18.26

−18.26

−16.95

−16.95

−15.65

−15.65

−14.34

−14.34

−14.34

−13.04

−13.04

−13.04

−11.74

−11.74

−11.74

−11.74

−10.43

−10.43

−10.43

−10.43

−9.13

−9.13−9.13

−9.13

−7.82

−7.82

−7.82

−7.82

−6.52

−6.52−6.52

−6.52

−5.22

−5.22

−5.22

−5.22

−3.91

−3.91

−3.91

−3.91

−2.61

−2.61

−2.61

−2.61

−1.3

−1.3

−1.3

embryo log L − log LML

contours

σ

θ

0.03 0.04 0.05 0.06

2020

2040

2060

2080

2100

2120

2140

2160

2180

−39.7−38.66

−36.57−35.52−34.48

−34.48

−33.43

−33.43

−32.39

−32.39

−31.35

−31.35

−30.3

−30.3

−29.26

−29.26

−28.21

−28.21

−27.17

−27.17

−26.12

−26.12

−25.08

−25.08

−24.03

−24.03

−22.99

−22.99

−21.94

−21.94

−20.9

−20.9

−19.85

−19.85

−18.81

−18.81

−17.76

−17.76

−16.72

−16.72

−15.67

−15.67

−15.67

−14.63

−14.63

−14.63

−14.63

−13.58

−13.58

−13.58

−13.58

−12.54

−12.54

−12.54

−12.54

−11.49

−11.49

−11.49

−11.49

−10.45

−10.45

−10.45

−10.45

−9.4

−9.4

−9.4

−9.4

−8.36

−8.36

−8.36

−8.36

−7.31

−7.31

−7.31

−7.31

−6.27

−6.27

−6.27

−6.27−5.22

−5.22

−5.22

−5.22

−4.18

−4.18

−4.18

−4.18

−4.18

−3.13

−3.13

−3.13

−3.13

−3.13

−2.09

−2.09

−2.09

−2.09

−2.09

−1.04

−1.04−1.04

−1.04

cerebellum log L − log LML

contours

σ

θ



Predictions

0 1 2 3 4

x 106

0

0.5

1

1.5

2

2.5

3

x 104

n

k

0 1 2 3 4

x 106

0

0.5

1

1.5

2

2.5x 10

4

n

k 2

0 1 2 3

x 105

0

200

400

600

800

1000

n

k 30

0 1 2 3 4 5

x 106

0

5

10

15

x 104

n

k 2

0 1 2 3 4 5

x 106

0

2000

4000

6000

8000

10000

12000

14000

16000

n

k 30

0 1 2 3 4 5

x 106

0

1

2

3

4

5

6x 10

5

n

k

0 1 2 3 4 5

x 106

0

2

4

6

8

10

12

14x 10

5

n

k

0 1 2 3 4 5

x 106

0

0.5

1

1.5

2

2.5

3

3.5x 10

5

n

k 2

0 1 2 3 4 5

x 106

0

0.5

1

1.5

2

2.5x 10

4

n

k 30

cerebellum embryo hcamp liver lung macrophages som_cortex visual_cortex

cerebellum embryo hcamp liver lung macrophages som_cortex visual_cortex

Sin

gle

nu

cleo

tid

e TS

S le

vel(C

TSS)

Co

re p

rom

ote

r lev

el (

TC)

Gen

e le

vel

≥1 tags ≥2 tags ≥30 tags

A B C

D E F

G H I



Predictions

Cross validation – notice anything funny?

2 4 6 8 10

x 105

0

0.5

1

1.5

2

2.5

3

3.5

4x 10

5

n

k

cerebellum

1.34

1.35

1.36

x 105

2.8

3

3.2x 10

4

253234400

420

440

0.5 1 1.5 2 2.5

x 106

0

0.5

1

1.5

2

2.5

3

3.5

4x 10

5

n

k

embryo

1.38

1.4

1.42x 10

5

3.2

3.4

3.6x 10

4

6321971450

1500

1 2 3 4 5

x 106

0

1

2

3

4

5

6

7

x 105

n

k

hcamp

2.9833.023.043.063.08x 10

5

1

1.05

1.1x 10

5

1.38824e+06

5150

5200

5250

1 2 3 4

x 106

0

1

2

3

4

5

6

7

8x 10

5

n

k

liver

2.822.842.862.88

x 105

7.5

8x 10

4

1.21871e+06

3750

3800

3850

1 2 3 4 5

x 106

0

2

4

6

8

10

x 105

n

k

lung

3.95

4x 10

5

0.9811.021.041.061.08x 10

5

1.29824e+06

435044004450

2 4 6

x 105

0

2

4

6

8

10

12

14

16

18

x 104

n

k

macrophage

6.6

6.7

6.8

x 104

1.7

1.8

1.9x 10

4

169919420

440

460

480

2 4 6 8

x 105

0

0.5

1

1.5

2

2.5

3

3.5x 10

5

n

k

som cortex

1.16

1.17

1.18x 10

5

2.2

2.4

x 104

208943

310320330340350

2 4 6 8

x 105

0

0.5

1

1.5

2

2.5

3

3.5

x 105

n

k

vis cortex

1.24

1.25

1.26x 10

5

2.6

2.8

x 104

232690

400

420



Predictions

Cross validation – cerebellum

2 4 6 8 10

x 105

0

0.5

1

1.5

2

2.5

3

3.5

4x 10

5

n

k

cerebellum

1.34

1.35

1.36

x 105

2.8

3

3.2x 10

4

253234400

420

440



Predictions

Cross validation – embryo

0.5 1 1.5 2 2.5

x 106

0

0.5

1

1.5

2

2.5

3

3.5

4x 10

5

n

k

embryo

1.38

1.4

1.42x 10

5

3.2

3.4

3.6x 10

4

6321971450

1500



Coverage

• We can actually predict more than just k .• Each observed species in the library has a certain

frequency.• We estimate a probability for each species based upon

model and observed frequencynj − σn + θ

Probabilities don’t add up to one.• The coverage says how we have already seen.• Coverage (weight species by their observation

probabilities):

Coverage =k∑

j=1

nj − σn + θ

= 1− θ + kσn + θ

.



Coverage

Coverage predictions

Library sample stats parameters 90.0 % coverage predictionsCTSS n k coverage σML θML coverage n k

cerebellum 253234 133923 0.592 0.728 9714 0.849 10000000 2068648embryo 632197 138190 0.836 0.749 475 0.900 4576200 610448hcamp 1388237 299038 0.859 0.639 6217 0.900 3633400 560111liver 1218713 282555 0.831 0.722 2027 0.900 8032400 1109792lung 1298243 393792 0.776 0.727 5441 0.872 10000000 1756587macrophages 169919 66257 0.716 0.699 2620 0.900 5529000 787295som cortex 208943 115480 0.565 0.748 7980 0.834 10000000 2206260visual cortex 232690 123847 0.587 0.736 8431 0.845 10000000 2090184

99.9 % coverage predictionsgene n k coverage σML θML coverage n k

cerebellum 203481 15079 0.980 0.069 3090 0.990 436800 18312embryo 510662 16869 0.989 0.254 1161 0.990 554400 17320hcamp 983532 18665 0.995 0.210 1292 already reachedliver 1015553 18711 0.995 0.223 1192 already reachedlung 1015700 23121 0.994 0.142 2418 already reachedmacrophages 141236 10643 0.974 0.233 1223 0.990 495400 16020som cortex 170823 13813 0.977 0.093 2720 0.990 434000 17675visual cortex 189515 14080 0.979 0.086 2744 0.990 420000 17302



Parametric approach

Comparison TSSs: True, non-parametric and parametric Zhuet. al., PLoS ONE, 2008.



Parametric approach

Comparison genes: True, non-parametric and parametric.



Summary

• Non-parametric Poisson Dirichlet (PD) process for complexdata.

• The model is always wrong! (as revealed in modelchecking with sufficient data).

• We need more complex model. With so much data onlyunderfitting is a problem. Go beyond PD.

• Joint w Albin Sandelin, Eivind Valen and Anders Krogh, KUbuilding upon Lijoi, Mena and Prünster.


Documents

Estimating the size of the transcriptome - Aalto …users.ics.aalto.fi/harrila/teaching/summers09/winther_part2.pdf · Estimating the size of the transcriptome ... Ole Winther Technical