88
Transcript expression estimation, normalisation, differential expression Ernest Turro University of Cambridge 14 Sep 2016

Transcript expression estimation ... - University of Malta

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Transcript expression estimation ... - University of Malta

Transcript expression estimation,normalisation, differential expression

Ernest Turro

University of Cambridge

14 Sep 2016

Page 2: Transcript expression estimation ... - University of Malta

Gene expression

An important aim in genomics is the characterisation of RNAsamples. Specifically:

1. What is the sequence of each distinct RNA in a sample?

2. What is the concentration of each RNA in a sample?

6/103/101/10

NB: in general only relative proportions available

Page 3: Transcript expression estimation ... - University of Malta

Gene expression

An important aim in genomics is the characterisation of RNAsamples. Specifically:

1. What is the sequence of each distinct RNA in a sample?

2. What is the concentration of each RNA in a sample?

6/103/101/10

NB: in general only relative proportions available

Page 4: Transcript expression estimation ... - University of Malta

Gene expressionDifferent kinds of RNAs (tRNAs, rRNAs, mRNAs, other ncRNAs...).

Messenger RNAs of particular interest as they code for proteins.

haplotype

{exon

Paternal chromosome

Maternal chromosome

Mutant alleleReference allele

Page 5: Transcript expression estimation ... - University of Malta

Gene expressionDifferent kinds of RNAs (tRNAs, rRNAs, mRNAs, other ncRNAs...).

Messenger RNAs of particular interest as they code for proteins.

1. Alternative isoforms have distinct sequences2. Two versions of each isoform sequence in diploid organisms

Paternally transcribed mRNAs

Maternally transcribed mRNAs

concentrationμP1

μP2

μP3

μM1

μM2

μM3

Isoforms

Isoforms

{

{μG

+

Paternal chromosome

Maternal chromosome

Mutant alleleReference allele

Page 6: Transcript expression estimation ... - University of Malta

RNA-seq read counts

To infer concentrations, we need to identify

• the set of transcript sequences in the sample

• the set of reads (potentially) emanating from each transcript

Approaches:

• Select transcript sequences from a database (e.g. Ensembl)and align reads to them

• Align reads to genome and infer transcript sequences

• Assemble reads into contigs

In any case, we get a mapping of reads to features of interest(e.g. genes, isoforms, haplotype-specific isoforms).

How do we model the alignments?

Page 7: Transcript expression estimation ... - University of Malta

RNA-seq read counts

To infer concentrations, we need to identify

• the set of transcript sequences in the sample

• the set of reads (potentially) emanating from each transcript

Approaches:

• Select transcript sequences from a database (e.g. Ensembl)and align reads to them

• Align reads to genome and infer transcript sequences

• Assemble reads into contigs

In any case, we get a mapping of reads to features of interest(e.g. genes, isoforms, haplotype-specific isoforms).

How do we model the alignments?

Page 8: Transcript expression estimation ... - University of Malta

RNA-seq read counts

To infer concentrations, we need to identify

• the set of transcript sequences in the sample

• the set of reads (potentially) emanating from each transcript

Approaches:

• Select transcript sequences from a database (e.g. Ensembl)and align reads to them

• Align reads to genome and infer transcript sequences

• Assemble reads into contigs

In any case, we get a mapping of reads to features of interest(e.g. genes, isoforms, haplotype-specific isoforms).

How do we model the alignments?

Page 9: Transcript expression estimation ... - University of Malta

RNA-seq read counts

To infer concentrations, we need to identify

• the set of transcript sequences in the sample

• the set of reads (potentially) emanating from each transcript

Approaches:

• Select transcript sequences from a database (e.g. Ensembl)and align reads to them

• Align reads to genome and infer transcript sequences

• Assemble reads into contigs

In any case, we get a mapping of reads to features of interest(e.g. genes, isoforms, haplotype-specific isoforms).

How do we model the alignments?

Page 10: Transcript expression estimation ... - University of Malta

The Poisson distributionIf independent events occur at a known given rate, then thenumber of such events follows a Poisson distribution.

Examples:• Number of cars crossing a milestone every hour• Number of raindrops falling on a rooftop every minute

Single rate parameter λ (pets rate = cats rate + dogs rate).Mean = variance = rate.

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0 5 10 15 20

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Counts

Pro

babi

lity

rate=0.5

● ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ●

0 5 10 15 20

0.00

0.05

0.10

0.15

0.20

0.25

Counts

Pro

babi

lity

rate=2

● ●

●● ● ● ● ● ● ● ● ●

0 5 10 15 20

0.00

0.05

0.10

0.15

Counts

Pro

babi

lity

rate=5

Page 11: Transcript expression estimation ... - University of Malta

The Poisson distributionIf independent events occur at a known given rate, then thenumber of such events follows a Poisson distribution.

Examples:• Number of cars crossing a milestone every hour• Number of raindrops falling on a rooftop every minute

Single rate parameter λ (pets rate = cats rate + dogs rate).Mean = variance = rate.

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0 5 10 15 20

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Counts

Pro

babi

lity

rate=0.5

● ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ●

0 5 10 15 20

0.00

0.05

0.10

0.15

0.20

0.25

Counts

Pro

babi

lity

rate=2

● ●

●● ● ● ● ● ● ● ● ●

0 5 10 15 200.

000.

050.

100.

15

Counts

Pro

babi

lity

rate=5

Page 12: Transcript expression estimation ... - University of Malta

Basic Poisson model for expression quantification

Number of reads aligning to a transcript increases with

• Total number of reads

• Length of transcript

• Abundance of transcript

Number of reads from gene g captured by Poisson model (Marioniet al. 2008):

rg ∼ Poisson(bµg lg),

• µg: concentration of RNAs from gene g

• lg: effective length of the gene

• b: normalisation constant (e.g. total no. of reads)

Page 13: Transcript expression estimation ... - University of Malta

Basic Poisson model for expression quantification

Number of reads aligning to a transcript increases with

• Total number of reads

• Length of transcript

• Abundance of transcript

Number of reads from gene g captured by Poisson model (Marioniet al. 2008):

rg ∼ Poisson(bµg lg),

• µg: concentration of RNAs from gene g

• lg: effective length of the gene

• b: normalisation constant (e.g. total no. of reads)

Page 14: Transcript expression estimation ... - University of Malta

Basic Poisson model for expression quantification

Basic model is useful but:

• “gene length” ambiguous — fragments from several isoformswith different lengths are sequenced

• reads counts not always observed due to sequence sharing(e.g. paralogous families)

Can we estimate expression for each isoform?

• Isoform read counts in general not observed:

?

Isoform 1

Isoform 2

• We need a read count model for isoforms

Page 15: Transcript expression estimation ... - University of Malta

Basic Poisson model for expression quantification

Basic model is useful but:

• “gene length” ambiguous — fragments from several isoformswith different lengths are sequenced

• reads counts not always observed due to sequence sharing(e.g. paralogous families)

Can we estimate expression for each isoform?

• Isoform read counts in general not observed:

?

Isoform 1

Isoform 2

• We need a read count model for isoforms

Page 16: Transcript expression estimation ... - University of Malta

Basic Poisson model for expression quantification

Recall that sequencing allows us to distinguish alleles atheterozygous positions.

paternal

maternal

AT

Can we use RNA-seq to detect allelic imbalance?

We need a read count model for alleles

Page 17: Transcript expression estimation ... - University of Malta

The binomial distributionA Bernoulli trial is an experiment in which “success” occurs withprobability p and “failure” occurs with probability 1 − p.

The number of successes given n Bernoulli trials follows abinomial distribution with parameters n and p. E(X) = np.

Examples:• Number of heads after n coin tosses. p , 0.5 if not fair• Number of times you win the lottery (tiny p (but £££))

●● ● ● ● ● ● ● ● ● ● ● ● ● ●

0 5 10 15 20

0.00

0.05

0.10

0.15

0.20

0.25

Successes

Pro

babi

lity

n=20,p=0.1

● ●●

●● ● ● ● ● ●

0 5 10 15 20

0.00

0.05

0.10

0.15

Successes

Pro

babi

lity

n=20,p=0.4

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

0 20 40 60 80 100

0.00

0.05

0.10

0.15

Successes

Pro

babi

lity

n=100,p=0.95

Page 18: Transcript expression estimation ... - University of Malta

The binomial distributionA Bernoulli trial is an experiment in which “success” occurs withprobability p and “failure” occurs with probability 1 − p.

The number of successes given n Bernoulli trials follows abinomial distribution with parameters n and p. E(X) = np.

Examples:• Number of heads after n coin tosses. p , 0.5 if not fair• Number of times you win the lottery (tiny p (but £££))

●● ● ● ● ● ● ● ● ● ● ● ● ● ●

0 5 10 15 20

0.00

0.05

0.10

0.15

0.20

0.25

Successes

Pro

babi

lity

n=20,p=0.1

● ●●

●● ● ● ● ● ●

0 5 10 15 20

0.00

0.05

0.10

0.15

Successes

Pro

babi

lity

n=20,p=0.4

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

0 20 40 60 80 100

0.00

0.05

0.10

0.15

Successes

Pro

babi

lity

n=100,p=0.95

Page 19: Transcript expression estimation ... - University of Malta

The multinomial distributionA Bernoulli trial is an experiment in which “success” occurs withprobability p and “failure” occurs with probability 1 − p.

The number of successes given n Bernoulli trials follows abinomial distribution with parameters n and p.

Examples:• Number of heads after n coin tosses. p , 0.5 if it is unfair• Number of times you hit the bullseye out of n shots

If there are > 2 categories, the per-category counts follow amultinomial distribution with parameters n and (p1, p2, . . .).

Example:• Number of 1s, 2s, 3s, 4s, 5s, 6s if you roll a die n times. If{pi} ,

16 then the die is not fair.

Page 20: Transcript expression estimation ... - University of Malta

Basic Binomial model for allelic imbalance• Reads permit discrimination between two copies of an isoform

paternal

maternal

AT

• Binomial test:r0∑

r=0P(r |p = 0.5, n = r0 + r1) < α? (Degner et

al. 2009)

. E.g. suppose r0 = 2; r1 = 6:

P(r = 0|p = 0.5, n = 8) = 0.00390625

P(r = 1|p = 0.5, n = 8) = 0.03125

P(r = 2|p = 0.5, n = 8) = 0.1093752∑

r=0

P(r |p = 0.5, n = 8) = 0.1445312 (not significant)

Page 21: Transcript expression estimation ... - University of Malta

Basic Binomial model for allelic imbalance• Reads permit discrimination between two copies of an isoform

paternal

maternal

AT

• Binomial test:r0∑

r=0P(r |p = 0.5, n = r0 + r1) < α? (Degner et

al. 2009). E.g. suppose r0 = 2; r1 = 6:

P(r = 0|p = 0.5, n = 8) = 0.00390625

P(r = 1|p = 0.5, n = 8) = 0.03125

P(r = 2|p = 0.5, n = 8) = 0.1093752∑

r=0

P(r |p = 0.5, n = 8) = 0.1445312 (not significant)

Page 22: Transcript expression estimation ... - University of Malta

Basic Binomial model for allelic imbalance

• What if there are multiple SNPs and isoforms?

paternal

maternal

A

T

maternal

paternal

T

G

• Binomial test not appropriate

• We need a read count model for haplotype-specificisoforms

Page 23: Transcript expression estimation ... - University of Malta

Multi-mapping reads

• Align reads back to reference transcript sequences withBowtie (Langmead et al. 2009), allowing multiple alignmentsper read

• Multi-mapping structure between reads and transcripts

A

B

C

D

Reads Transcripts

Page 24: Transcript expression estimation ... - University of Malta

Multi-mapping reads

• Obtain transcript sets, such that each read maps to only 1 set

• Transcripts may belong to more than one set

• Read counts per set can be observed

• Transcripts can be isoforms sharing exons or from multiplegenes

A

B

C

D

A

B

D

A

B

C

Reads Transcripts

Reads Transcripts Transcriptsets

1

2

3

Page 25: Transcript expression estimation ... - University of Malta

Poisson model for transcript set reads counts

Model reads per transcript set instead of per gene (Turro et al.2011).

Define Mit =

{1 if transcript t in set i,0 otherwise.

Now model for reads counts is:

ki ∼ Poisson(bsi

∑t

Mitµt),

where si is the effective length shared by transcripts in set i.

Page 26: Transcript expression estimation ... - University of Malta

Latent variables for read counts

A

B

D

A

B

C

Reads Transcripts Transcriptsets

1

2

3

M =

⇤⇧

1 1 0 01 1 1 00 0 0 1

⌅⌃

X =

⇤⇧

X11 X12 0 0X21 X22 X23 00 0 0 X44

⌅⌃

r =�

r1 r2 r3 r4⇥

M =

⇤⇧

1 1 0 01 1 1 00 0 0 1

⌅⌃

X =

⇤⇧

X11 X12 0 0X21 X22 X23 00 0 0 X44

⌅⌃

r =�

r1 r2 r3 r4⇥

M =

⇤⇧

1 1 0 01 1 1 00 0 0 1

⌅⌃

X =

⇤⇧

X11 X12 0 0X21 X22 X23 00 0 0 X44

⌅⌃

r =�

r1 r2 r3 r4⇥

M =

⇤⇧

1 1 0 01 1 1 00 0 0 1

⌅⌃

X =

⇤⇧

X11 X12 0 0X21 X22 X23 00 0 0 X44

⌅⌃

r =�

r1 r2 r3 r4⇥

M =

⇤⇧

1 1 0 01 1 1 00 0 0 1

⌅⌃

X =

⇤⇧

X11 X12 0 0X21 X22 X23 00 0 0 X44

⌅⌃

r =�

r1 r2 r3 r4⇥

M =

⇤⇧

1 1 0 01 1 1 00 0 0 1

⌅⌃

X =

⇤⇧

X11 X12 0 0X21 X22 X23 00 0 0 X44

⌅⌃

r =�

r1 r2 r3 r4⇥

M =

⇤⇧

1 1 0 01 1 1 00 0 0 1

⌅⌃

X =

⇤⇧

X11 X12 0 0X21 X22 X23 00 0 0 X44

⌅⌃

r =�

r1 r2 r3 r4⇥

M =

⇤⇧

1 1 0 01 1 1 00 0 0 1

⌅⌃

X =

⇤⇧

X11 X12 0 0X21 X22 X23 00 0 0 X44

⌅⌃

r =�

r1 r2 r3 r4⇥

M =

⇤⇧

1 1 0 01 1 1 00 0 0 1

⌅⌃

X =

⇤⇧

X11 X12 0 0X21 X22 X23 00 0 0 X44

⌅⌃

r =�

r1 r2 r3 r4⇥

M =

⇤⇧

1 1 0 01 1 1 00 0 0 1

⌅⌃

X =

⇤⇧

X11 X12 0 0X21 X22 X23 00 0 0 X44

⌅⌃

r =�

r1 r2 r3 r4⇥

M =

⇤⇧

1 1 0 01 1 1 00 0 0 1

⌅⌃

X =

⇤⇧

X11 X12 0 0X21 X22 X23 00 0 0 X44

⌅⌃

r =�

r1 r2 r3 r4⇥

M =

⇤⇧

1 1 0 01 1 1 00 0 0 1

⌅⌃

X =

⇤⇧

X11 X12 0 0X21 X22 X23 00 0 0 X44

⌅⌃

r =�

r1 r2 r3 r4⇥

M =

⇤⇧

1 1 0 01 1 1 00 0 0 1

⌅⌃

X =

⇤⇧

X11 X12 0 0X21 X22 X23 00 0 0 X44

⌅⌃

r =�

r1 r2 r3 r4⇥

Latent variables for read counts

A

B

D

A

B

C

Reads Transcripts Transcriptsets

1

2

3

Latent variables for read counts

M =

�⇤

1 1 0 01 1 1 00 0 0 1

⇥⌅

X =

�⇤

X11 X12 0 0X21 X22 X23 00 0 0 X34

⇥⌅

ki � Poisson(bsi�

t

Mitµt)

Xit � Pois(bsiMitµt)

Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.

None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20

A B C D

123

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit = �1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki �Poisson(bsi ⇥t Mit µt ),

wheresi isthee�ectivelengthsharedbytranscriptsinseti.

Transcripts

Transcriptsets

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit=

�1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki�Poisson(bsi⇥

t

Mitµt),

wheresiisthee�ectivelengthsharedbytranscriptsinseti.

Latent variables for read counts

M =

�⇤

1 1 0 01 1 1 00 0 0 1

⇥⌅

X =

�⇤

X11 X12 0 0X21 X22 X23 00 0 0 X34

⇥⌅

ki � Poisson(bsi�

t

Mitµt)

Xit � Pois(bsiMitµt)

Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.

None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20

A B C D

123

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit = �1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki �Poisson(bsi ⇥t Mit µt ),

wheresi isthee�ectivelengthsharedbytranscriptsinseti.

Transcripts

Transcriptsets

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit=

�1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki�Poisson(bsi⇥

t

Mitµt),

wheresiisthee�ectivelengthsharedbytranscriptsinseti.

Latent variables for read counts

M =

�⇤

1 1 0 01 1 1 00 0 0 1

⇥⌅

X =

�⇤

X11 X12 0 0X21 X22 X23 00 0 0 X34

⇥⌅

ki � Poisson(bsi�

t

Mitµt)

Xit � Pois(bsiMitµt)

Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.

None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20

Latent variables for read counts

A

B

D

A

B

C

Reads Transcripts Transcriptsets

1

2

3

Latent variables for read counts

M =

�⇤

1 1 0 01 1 1 00 0 0 1

⇥⌅

X =

�⇤

X11 X12 0 0X21 X22 X23 00 0 0 X34

⇥⌅

ki � Poisson(bsi�

t

Mitµt)

Xit � Pois(bsiMitµt)

Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.

None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20

A B C D

123

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit = �1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki �Poisson(bsi ⇥t Mit µt ),

wheresi isthee�ectivelengthsharedbytranscriptsinseti.

Transcripts

Transcriptsets

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit=

�1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki�Poisson(bsi⇥

t

Mitµt),

wheresiisthee�ectivelengthsharedbytranscriptsinseti.

Latent variables for read counts

M =

�⇤

1 1 0 01 1 1 00 0 0 1

⇥⌅

X =

�⇤

X11 X12 0 0X21 X22 X23 00 0 0 X34

⇥⌅

ki � Poisson(bsi�

t

Mitµt)

Xit � Pois(bsiMitµt)

Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.

None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20

A B C D

123

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit = �1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki �Poisson(bsi ⇥t Mit µt ),

wheresi isthee�ectivelengthsharedbytranscriptsinseti.

Transcripts

Transcriptsets

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit=

�1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki�Poisson(bsi⇥

t

Mitµt),

wheresiisthee�ectivelengthsharedbytranscriptsinseti.

Latent variables for read counts

M =

�⇤

1 1 0 01 1 1 00 0 0 1

⇥⌅

X =

�⇤

X11 X12 0 0X21 X22 X23 00 0 0 X34

⇥⌅

ki � Poisson(bsi�

t

Mitµt)

Xit � Pois(bsiMitµt)

Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.

None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20

k =

�⇤

212

⇥⌅

r =�

r1 r2 r3 r4⇥Xit � Poisson(bsiMitµt),

ki � Poisson(bsi⌃

t

Mitµt),

{X1t , . . . , Xmt}|rt � Mult

⇤rt ,

M1ts1⇧i Mitsi

, . . . ,Mmtsm⇧

i Mitsi

⌅.

A B C D

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit = �1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki �Poisson(bsi ⇥t Mit µt ),

wheresi isthee�ectivelengthsharedbytranscriptsinseti.

Transcripts

Transcriptsets

123

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit=

�1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki�Poisson(bsi⇥

t

Mitµt),

wheresiisthee�ectivelengthsharedbytranscriptsinseti.

Transcriptsets

123

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit=

�1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki�Poisson(bsi⇥

t

Mitµt),

wheresiisthee�ectivelengthsharedbytranscriptsinseti.

Observedset counts

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit = �1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki �Poisson(bsi ⇥t Mit µt ),

wheresi isthee�ectivelengthsharedbytranscriptsinseti. Unobservedtranscript counts

Poisson model for transcript set reads counts

Model reads per transcript set instead of per gene.

Define Mit =

�1 if transcript t in set i ,0 otherwise.

Now model for reads counts is:

ki � Poisson(bsi⇥

t

Mitµt),

where si is the e�ective length shared by transcripts in set i .

3

Xit ∼ Poisson(bsiMitµt),

ki ∼ Poisson(bsi

∑t

Mitµt),

rt ∼ Poisson(bµt

∑i

Mitsi) = Poisson(bltµt),

{Xi1, . . . ,Xin}|{µ1, . . . , µn}, ki ∼ Mult(ki ,Mi1µ1∑t Mitµt

, . . . ,Minµn∑t Mitµt

).

Page 27: Transcript expression estimation ... - University of Malta

Latent variables for read counts

A

B

D

A

B

C

Reads Transcripts Transcriptsets

1

2

3

M =

⇤⇧

1 1 0 01 1 1 00 0 0 1

⌅⌃

X =

⇤⇧

X11 X12 0 0X21 X22 X23 00 0 0 X44

⌅⌃

r =�

r1 r2 r3 r4⇥

M =

⇤⇧

1 1 0 01 1 1 00 0 0 1

⌅⌃

X =

⇤⇧

X11 X12 0 0X21 X22 X23 00 0 0 X44

⌅⌃

r =�

r1 r2 r3 r4⇥

M =

⇤⇧

1 1 0 01 1 1 00 0 0 1

⌅⌃

X =

⇤⇧

X11 X12 0 0X21 X22 X23 00 0 0 X44

⌅⌃

r =�

r1 r2 r3 r4⇥

M =

⇤⇧

1 1 0 01 1 1 00 0 0 1

⌅⌃

X =

⇤⇧

X11 X12 0 0X21 X22 X23 00 0 0 X44

⌅⌃

r =�

r1 r2 r3 r4⇥

M =

⇤⇧

1 1 0 01 1 1 00 0 0 1

⌅⌃

X =

⇤⇧

X11 X12 0 0X21 X22 X23 00 0 0 X44

⌅⌃

r =�

r1 r2 r3 r4⇥

M =

⇤⇧

1 1 0 01 1 1 00 0 0 1

⌅⌃

X =

⇤⇧

X11 X12 0 0X21 X22 X23 00 0 0 X44

⌅⌃

r =�

r1 r2 r3 r4⇥

M =

⇤⇧

1 1 0 01 1 1 00 0 0 1

⌅⌃

X =

⇤⇧

X11 X12 0 0X21 X22 X23 00 0 0 X44

⌅⌃

r =�

r1 r2 r3 r4⇥

M =

⇤⇧

1 1 0 01 1 1 00 0 0 1

⌅⌃

X =

⇤⇧

X11 X12 0 0X21 X22 X23 00 0 0 X44

⌅⌃

r =�

r1 r2 r3 r4⇥

M =

⇤⇧

1 1 0 01 1 1 00 0 0 1

⌅⌃

X =

⇤⇧

X11 X12 0 0X21 X22 X23 00 0 0 X44

⌅⌃

r =�

r1 r2 r3 r4⇥

M =

⇤⇧

1 1 0 01 1 1 00 0 0 1

⌅⌃

X =

⇤⇧

X11 X12 0 0X21 X22 X23 00 0 0 X44

⌅⌃

r =�

r1 r2 r3 r4⇥

M =

⇤⇧

1 1 0 01 1 1 00 0 0 1

⌅⌃

X =

⇤⇧

X11 X12 0 0X21 X22 X23 00 0 0 X44

⌅⌃

r =�

r1 r2 r3 r4⇥

M =

⇤⇧

1 1 0 01 1 1 00 0 0 1

⌅⌃

X =

⇤⇧

X11 X12 0 0X21 X22 X23 00 0 0 X44

⌅⌃

r =�

r1 r2 r3 r4⇥

M =

⇤⇧

1 1 0 01 1 1 00 0 0 1

⌅⌃

X =

⇤⇧

X11 X12 0 0X21 X22 X23 00 0 0 X44

⌅⌃

r =�

r1 r2 r3 r4⇥

Latent variables for read counts

A

B

D

A

B

C

Reads Transcripts Transcriptsets

1

2

3

Latent variables for read counts

M =

�⇤

1 1 0 01 1 1 00 0 0 1

⇥⌅

X =

�⇤

X11 X12 0 0X21 X22 X23 00 0 0 X34

⇥⌅

ki � Poisson(bsi�

t

Mitµt)

Xit � Pois(bsiMitµt)

Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.

None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20

A B C D

123

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit = �1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki �Poisson(bsi ⇥t Mit µt ),

wheresi isthee�ectivelengthsharedbytranscriptsinseti.

Transcripts

Transcriptsets

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit=

�1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki�Poisson(bsi⇥

t

Mitµt),

wheresiisthee�ectivelengthsharedbytranscriptsinseti.

Latent variables for read counts

M =

�⇤

1 1 0 01 1 1 00 0 0 1

⇥⌅

X =

�⇤

X11 X12 0 0X21 X22 X23 00 0 0 X34

⇥⌅

ki � Poisson(bsi�

t

Mitµt)

Xit � Pois(bsiMitµt)

Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.

None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20

A B C D

123

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit = �1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki �Poisson(bsi ⇥t Mit µt ),

wheresi isthee�ectivelengthsharedbytranscriptsinseti.

Transcripts

Transcriptsets

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit=

�1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki�Poisson(bsi⇥

t

Mitµt),

wheresiisthee�ectivelengthsharedbytranscriptsinseti.

Latent variables for read counts

M =

�⇤

1 1 0 01 1 1 00 0 0 1

⇥⌅

X =

�⇤

X11 X12 0 0X21 X22 X23 00 0 0 X34

⇥⌅

ki � Poisson(bsi�

t

Mitµt)

Xit � Pois(bsiMitµt)

Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.

None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20

Latent variables for read counts

A

B

D

A

B

C

Reads Transcripts Transcriptsets

1

2

3

Latent variables for read counts

M =

�⇤

1 1 0 01 1 1 00 0 0 1

⇥⌅

X =

�⇤

X11 X12 0 0X21 X22 X23 00 0 0 X34

⇥⌅

ki � Poisson(bsi�

t

Mitµt)

Xit � Pois(bsiMitµt)

Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.

None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20

A B C D

123

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit = �1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki �Poisson(bsi ⇥t Mit µt ),

wheresi isthee�ectivelengthsharedbytranscriptsinseti.

Transcripts

Transcriptsets

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit=

�1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki�Poisson(bsi⇥

t

Mitµt),

wheresiisthee�ectivelengthsharedbytranscriptsinseti.

Latent variables for read counts

M =

�⇤

1 1 0 01 1 1 00 0 0 1

⇥⌅

X =

�⇤

X11 X12 0 0X21 X22 X23 00 0 0 X34

⇥⌅

ki � Poisson(bsi�

t

Mitµt)

Xit � Pois(bsiMitµt)

Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.

None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20

A B C D

123

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit = �1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki �Poisson(bsi ⇥t Mit µt ),

wheresi isthee�ectivelengthsharedbytranscriptsinseti.

Transcripts

Transcriptsets

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit=

�1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki�Poisson(bsi⇥

t

Mitµt),

wheresiisthee�ectivelengthsharedbytranscriptsinseti.

Latent variables for read counts

M =

�⇤

1 1 0 01 1 1 00 0 0 1

⇥⌅

X =

�⇤

X11 X12 0 0X21 X22 X23 00 0 0 X34

⇥⌅

ki � Poisson(bsi�

t

Mitµt)

Xit � Pois(bsiMitµt)

Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.

None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20

k =

�⇤

212

⇥⌅

r =�

r1 r2 r3 r4⇥Xit � Poisson(bsiMitµt),

ki � Poisson(bsi⌃

t

Mitµt),

{X1t , . . . , Xmt}|rt � Mult

⇤rt ,

M1ts1⇧i Mitsi

, . . . ,Mmtsm⇧

i Mitsi

⌅.

A B C D

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit = �1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki �Poisson(bsi ⇥t Mit µt ),

wheresi isthee�ectivelengthsharedbytranscriptsinseti.

Transcripts

Transcriptsets

123

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit=

�1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki�Poisson(bsi⇥

t

Mitµt),

wheresiisthee�ectivelengthsharedbytranscriptsinseti.

Transcriptsets

123

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit=

�1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki�Poisson(bsi⇥

t

Mitµt),

wheresiisthee�ectivelengthsharedbytranscriptsinseti.

Observedset counts

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit = �1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki �Poisson(bsi ⇥t Mit µt ),

wheresi isthee�ectivelengthsharedbytranscriptsinseti. Unobservedtranscript counts

Poisson model for transcript set reads counts

Model reads per transcript set instead of per gene.

Define Mit =

�1 if transcript t in set i ,0 otherwise.

Now model for reads counts is:

ki � Poisson(bsi⇥

t

Mitµt),

where si is the e�ective length shared by transcripts in set i .

3

Xit ∼ Poisson(bsiMitµt),

ki ∼ Poisson(bsi

∑t

Mitµt),

rt ∼ Poisson(bµt

∑i

Mitsi) = Poisson(bltµt),

{Xi1, . . . ,Xin}|{µ1, . . . , µn}, ki ∼ Mult(ki ,Mi1µ1∑t Mitµt

, . . . ,Minµn∑t Mitµt

).

Page 28: Transcript expression estimation ... - University of Malta

Concrete example

e1 e2 e3

e1 e3

d1 d2 d3

d1 d3d4

A

t1

t2

ε-1

ε-1

Page 29: Transcript expression estimation ... - University of Malta

Concrete example

BRIEF ARTICLE

THE AUTHOR

M =

1 11 00 1

k =

641

s =

d1 + d3

d2

d4

=

e1 + e3 − 2(�− 1)e2 + �− 1

�− 1

d1 = e1 − � + 1(1)

d2 = e2 + �− 1(2)

d3 = e3 − � + 1(3)

l1 = s1 + s2 = d1 + d3 + d2(4)

l2 = s1 + s3 = d1 + d3 + �− 1(5)

l1 = s1 + s2 = e1 + e2 + e3 − (�− 1)(6)

l2 = s1 + s3 = e1 + e3 − (�− 1)(7)

s =

d1 + d3

d2

d2

1

BRIEF ARTICLE

THE AUTHOR

M =

1 11 00 1

k =

641

s =

d1 + d3

d2

d4

=

e1 + e3 − 2(�− 1)e2 + �− 1

�− 1

d1 = e1 − � + 1(1)

d2 = e2 + �− 1(2)

d3 = e3 − � + 1(3)

l1 = s1 + s2 = d1 + d3 + d2(4)

l2 = s1 + s3 = d1 + d3 + �− 1(5)

l1 = s1 + s2 = e1 + e2 + e3 − (�− 1)(6)

l2 = s1 + s3 = e1 + e3 − (�− 1)(7)

1

t2

t1

t1 t2

e1 e2 e3

e1 e3

d1 d2 d3

d1 d3d4

A

t1

t2

ε-1

ε-1

t1,t2 t1 t1,t2t1t1,t2 t1,t2

t1t1,t2 t1,t2

Page 30: Transcript expression estimation ... - University of Malta

Concrete example

BRIEF ARTICLE

THE AUTHOR

M =

1 11 00 1

k =

641

s =

d1 + d3

d2

d4

=

e1 + e3 − 2(�− 1)e2 + �− 1

�− 1

d1 = e1 − � + 1(1)

d2 = e2 + �− 1(2)

d3 = e3 − � + 1(3)

l1 = s1 + s2 = d1 + d3 + d2(4)

l2 = s1 + s3 = d1 + d3 + �− 1(5)

l1 = s1 + s2 = e1 + e2 + e3 − (�− 1)(6)

l2 = s1 + s3 = e1 + e3 − (�− 1)(7)

s =

d1 + d3

d2

d2

1

BRIEF ARTICLE

THE AUTHOR

M =

1 11 00 1

k =

641

s =

d1 + d3

d2

d4

=

e1 + e3 − 2(�− 1)e2 + �− 1

�− 1

d1 = e1 − � + 1(1)

d2 = e2 + �− 1(2)

d3 = e3 − � + 1(3)

l1 = s1 + s2 = d1 + d3 + d2(4)

l2 = s1 + s3 = d1 + d3 + �− 1(5)

l1 = s1 + s2 = e1 + e2 + e3 − (�− 1)(6)

l2 = s1 + s3 = e1 + e3 − (�− 1)(7)

s =

d1 + d3

d2

d2

1

BRIEF ARTICLE

THE AUTHOR

M =

1 11 00 1

k =

641

s =

d1 + d3

d2

d4

=

e1 + e3 − 2(�− 1)e2 + �− 1

�− 1

d1 = e1 − � + 1(1)

d2 = e2 + �− 1(2)

d3 = e3 − � + 1(3)

l1 = s1 + s2 = d1 + d3 + d2(4)

l2 = s1 + s3 = d1 + d3 + �− 1(5)

l1 = s1 + s2 = e1 + e2 + e3 − (�− 1)(6)

l2 = s1 + s3 = e1 + e3 − (�− 1)(7)

1

BRIEF ARTICLE

THE AUTHOR

M =

1 11 00 1

k =

641

s =

d1 + d3

d2

�− 1

=

e1 + e3 − 2(�− 1)e2 + �− 1

�− 1

d1 = e1 − � + 1(1)

d2 = e2 + �− 1(2)

d3 = e3 − � + 1(3)

l1 = s1 + s2 = d1 + d3 + d2(4)

l2 = s1 + s3 = d1 + d3 + �− 1(5)

l1 = s1 + s2 = e1 + e2 + e3 − (�− 1)(6)

l2 = s1 + s3 = e1 + e3 − (�− 1)(7)

1

t2

t1

t1 t2

e1 e2 e3

e1 e3

d1 d2 d3

d1 d3d4

A

t1

t2

ε-1

ε-1

t1,t2 t1 t1,t2t1t1,t2 t1,t2

t1t1,t2 t1,t2

Page 31: Transcript expression estimation ... - University of Malta

Heterozygotes and haplo-isoformsB

t1ACG

d1 d3d2 ε-1

t1B

Page 32: Transcript expression estimation ... - University of Malta

Heterozygotes and haplo-isoformsB

t1ACG

d1 d3d2 ε-1

t1B

t1A,t1Bt1A,t1B

t1A,t1B

t1At1A

t1A,t1B

t1Bt1B

Page 33: Transcript expression estimation ... - University of Malta

Same model structure for isoforms and haplo-isoforms

BRIEF ARTICLE

THE AUTHOR

M =

1 11 00 1

k =

641

s =

d1 + d3

d2

d4

=

e1 + e3 − 2(�− 1)e2 + �− 1

�− 1

d1 = e1 − � + 1(1)

d2 = e2 + �− 1(2)

d3 = e3 − � + 1(3)

l1 = s1 + s2 = d1 + d3 + d2(4)

l2 = s1 + s3 = d1 + d3 + �− 1(5)

l1 = s1 + s2 = e1 + e2 + e3 − (�− 1)(6)

l2 = s1 + s3 = e1 + e3 − (�− 1)(7)

s =

d1 + d3

d2

d2

M =

1 11 00 1

k =

422

1

BRIEF ARTICLE

THE AUTHOR

M =

1 11 00 1

k =

641

s =

d1 + d3

d2

d4

=

e1 + e3 − 2(�− 1)e2 + �− 1

�− 1

d1 = e1 − � + 1(1)

d2 = e2 + �− 1(2)

d3 = e3 − � + 1(3)

l1 = s1 + s2 = d1 + d3 + d2(4)

l2 = s1 + s3 = d1 + d3 + �− 1(5)

l1 = s1 + s2 = e1 + e2 + e3 − (�− 1)(6)

l2 = s1 + s3 = e1 + e3 − (�− 1)(7)

s =

d1 + d3

d2

d2

1

BRIEF ARTICLE

THE AUTHOR

M =

1 11 00 1

k =

641

s =

d1 + d3

d2

d4

=

e1 + e3 − 2(�− 1)e2 + �− 1

�− 1

d1 = e1 − � + 1(1)

d2 = e2 + �− 1(2)

d3 = e3 − � + 1(3)

l1 = s1 + s2 = d1 + d3 + d2(4)

l2 = s1 + s3 = d1 + d3 + �− 1(5)

l1 = s1 + s2 = e1 + e2 + e3 − (�− 1)(6)

l2 = s1 + s3 = e1 + e3 − (�− 1)(7)

s =

d1 + d3

d2

d2

1

BRIEF ARTICLE

THE AUTHOR

M =

1 11 00 1

k =

641

s =

d1 + d3

d2

d4

=

e1 + e3 − 2(�− 1)e2 + �− 1

�− 1

d1 = e1 − � + 1(1)

d2 = e2 + �− 1(2)

d3 = e3 − � + 1(3)

l1 = s1 + s2 = d1 + d3 + d2(4)

l2 = s1 + s3 = d1 + d3 + �− 1(5)

l1 = s1 + s2 = e1 + e2 + e3 − (�− 1)(6)

l2 = s1 + s3 = e1 + e3 − (�− 1)(7)

1

BRIEF ARTICLE

THE AUTHOR

M =

1 11 00 1

k =

641

s =

d1 + d3

d2

�− 1

=

e1 + e3 − 2(�− 1)e2 + �− 1

�− 1

d1 = e1 − � + 1(1)

d2 = e2 + �− 1(2)

d3 = e3 − � + 1(3)

l1 = s1 + s2 = d1 + d3 + d2(4)

l2 = s1 + s3 = d1 + d3 + �− 1(5)

l1 = s1 + s2 = e1 + e2 + e3 − (�− 1)(6)

l2 = s1 + s3 = e1 + e3 − (�− 1)(7)

1

t2

t1

t1 t2

e1 e2 e3

e1 e3

d1 d2 d3

d1 d3d4

A

t1

t2

ε-1

ε-1

t1,t2 t1 t1,t2t1t1,t2 t1,t2

t1t1,t2 t1,t2

B

t1ACG

d1 d3d2 ε-1

t1B

t1At1B

t1A,t1Bt1A,t1B

t1A,t1B

t1At1A

t1A,t1B

t1Bt1B

BRIEF ARTICLE

THE AUTHOR

M =

1 11 00 1

k =

641

s =

d1 + d3

d2

d4

=

e1 + e3 − 2(�− 1)e2 + �− 1

�− 1

d1 = e1 − � + 1(1)

d2 = e2 + �− 1(2)

d3 = e3 − � + 1(3)

l1 = s1 + s2 = d1 + d3 + d2(4)

l2 = s1 + s3 = d1 + d3 + �− 1(5)

l1 = s1 + s2 = e1 + e2 + e3 − (�− 1)(6)

l2 = s1 + s3 = e1 + e3 − (�− 1)(7)

s =

d1 + d3

d2

d2

M =

1 11 00 1

k =

422

1

Heterozygotes can be treatedlike alternative exons!

Page 34: Transcript expression estimation ... - University of Malta

Remarks on expression estimation

• Poisson distribution captures the unavoidable variance due tocounting independent events

• The mapping of a read or read pair to a feature can beambiguous

• Deconvolution methods help quantify expression of differentisoforms and even haplotype-specific isoforms

• This really sets RNA-seq and microarrays apart!

Page 35: Transcript expression estimation ... - University of Malta

Normalisation

Normalisation aims to ensure our expression estimates are:

• comparable across features (genes, isoforms, etc)

• comparable across libraries (different samples)

• on a human-friendly scale (interpretable magnitude)

Necessary for valid inference about DE

• between transcripts within samples

• between samples belonging to different biological conditions

Page 36: Transcript expression estimation ... - University of Malta

Normalisation

Normalisation aims to ensure our expression estimates are:

• comparable across features (genes, isoforms, etc)

• comparable across libraries (different samples)

• on a human-friendly scale (interpretable magnitude)

Necessary for valid inference about DE

• between transcripts within samples

• between samples belonging to different biological conditions

Page 37: Transcript expression estimation ... - University of Malta

Basic Poisson model

Number of reads from gene g in library i can be captured by aPoisson model (Marioni et al. 2008):

rig ∼ Poisson(kigµig),

=⇒ E(rig) = kigµig

where µig is the concentration of RNA in the library and kig is anormalisation constant.

µig =rig

kig

Page 38: Transcript expression estimation ... - University of Malta

RPKM normalisation

Normalisation is procedure for setting kig such that the estimates ofµig are comparable between genes and across libraries.

µig =rig

kig

The number of reads rig is roughly proportional to

• the length of the gene, lg• the total number of reads in the library, Ni

Thus it is natural to include them in the normalisation constant.

If kig = 10−9Ni lg, the units of µig are Reads Per Kilobase per Millionmapped reads (RPKM) (Mortazavi et al. 2008).

This is the most elementary form of normalisation.

Page 39: Transcript expression estimation ... - University of Malta

RPKM normalisation

Normalisation is procedure for setting kig such that the estimates ofµig are comparable between genes and across libraries.

µig =rig

kig

The number of reads rig is roughly proportional to

• the length of the gene, lg• the total number of reads in the library, Ni

Thus it is natural to include them in the normalisation constant.

If kig = 10−9Ni lg, the units of µig are Reads Per Kilobase per Millionmapped reads (RPKM) (Mortazavi et al. 2008).

This is the most elementary form of normalisation.

Page 40: Transcript expression estimation ... - University of Malta

RPKM normalisation

Normalisation is procedure for setting kig such that the estimates ofµig are comparable between genes and across libraries.

µig =rig

kig

The number of reads rig is roughly proportional to

• the length of the gene, lg• the total number of reads in the library, Ni

Thus it is natural to include them in the normalisation constant.

If kig = 10−9Ni lg, the units of µig are Reads Per Kilobase per Millionmapped reads (RPKM) (Mortazavi et al. 2008).

This is the most elementary form of normalisation.

Page 41: Transcript expression estimation ... - University of Malta

RPKM normalisation

• RPKM works well for technical and some biological replicates

• µig ' µjg for all libraries i and j

• RPKM units obtained by scaling of counts by N−1i

Log counts

Dens

ity

Log RPKM

Dens

ity

Page 42: Transcript expression estimation ... - University of Malta

Sample to sample normalisation• Between different biological samples, homogeneity

assumption does not hold• Why is this a problem?

Number of reads is limitedE.g. counts from very highly expressed genes leave less realestate available for counts from lowly expressed genes

A B

• Suppose you have two RNApopulations A and B sequenced atsame depth

• A and B are identical except half ofgenes in B are unexpressed in A

• Only ∼ half of reads from B comefrom shared gene set

• Estimates for shared genes differby factor of ∼ 2!

Robinson and Oslack 2010

Page 43: Transcript expression estimation ... - University of Malta

Sample to sample normalisation• Between different biological samples, homogeneity

assumption does not hold• Why is this a problem?

Number of reads is limitedE.g. counts from very highly expressed genes leave less realestate available for counts from lowly expressed genes

A B

• Suppose you have two RNApopulations A and B sequenced atsame depth

• A and B are identical except half ofgenes in B are unexpressed in A

• Only ∼ half of reads from B comefrom shared gene set

• Estimates for shared genes differby factor of ∼ 2!

Robinson and Oslack 2010

Page 44: Transcript expression estimation ... - University of Malta

Sample to sample normalisation• Between different biological samples, homogeneity

assumption does not hold• Why is this a problem?

Number of reads is limitedE.g. counts from very highly expressed genes leave less realestate available for counts from lowly expressed genes

A B

• Suppose you have two RNApopulations A and B sequenced atsame depth

• A and B are identical except half ofgenes in B are unexpressed in A

• Only ∼ half of reads from B comefrom shared gene set

• Estimates for shared genes differby factor of ∼ 2!

Robinson and Oslack 2010

Page 45: Transcript expression estimation ... - University of Malta

Poisson approximation to Binomial• Total RNA output,

∑g µig lg, inversely affects read counts rig

(for fixed µig)• RPKM normalisation assumes implicitly that total RNA output

(unknown) is the same for all libraries:

rig ∼ Binomial(Ni ,

µig lg∑g µig lg

)∼ Poisson

(Ni

µig lg∑g µig lg

)as N → ∞

=⇒ E(rig) = Niµig lg∑g µig lg

• RPKM assumption: ∀i,∑

g µig lg = 109 (so µig =rig

10−9Ni lg)

• Better assumption: output between samples for a core setonly of genes G is similar:

∑g∈G

µig lg =∑

g∈Gµjg lg

Page 46: Transcript expression estimation ... - University of Malta

Poisson approximation to Binomial• Total RNA output,

∑g µig lg, inversely affects read counts rig

(for fixed µig)• RPKM normalisation assumes implicitly that total RNA output

(unknown) is the same for all libraries:

rig ∼ Binomial(Ni ,

µig lg∑g µig lg

)∼ Poisson

(Ni

µig lg∑g µig lg

)as N → ∞

=⇒ E(rig) = Niµig lg∑g µig lg

• RPKM assumption: ∀i,∑

g µig lg = 109 (so µig =rig

10−9Ni lg)

• Better assumption: output between samples for a core setonly of genes G is similar:

∑g∈G

µig lg =∑

g∈Gµjg lg

Page 47: Transcript expression estimation ... - University of Malta

Poisson approximation to Binomial• Total RNA output,

∑g µig lg, inversely affects read counts rig

(for fixed µig)• RPKM normalisation assumes implicitly that total RNA output

(unknown) is the same for all libraries:

rig ∼ Binomial(Ni ,

µig lg∑g µig lg

)∼ Poisson

(Ni

µig lg∑g µig lg

)as N → ∞

=⇒ E(rig) = Niµig lg∑g µig lg

• RPKM assumption: ∀i,∑

g µig lg = 109 (so µig =rig

10−9Ni lg)

• Better assumption: output between samples for a core setonly of genes G is similar:

∑g∈G

µig lg =∑

g∈Gµjg lg

Page 48: Transcript expression estimation ... - University of Malta

TMM normalisationThe naive MLE is proportional to the normalised counts:

µjg =rjg

kjg=

110−9lg

rjg

Nj

If∑

g∈Gµig lg ,

∑g∈G

µjg lg, the MLEs for all genes need to be adjusted.

Calculate scaling factor for sample j relative to reference sample i:∑g∈G

rig

Ni' S(i,j)

∑g∈G

rjg

Nj.

Adjust the MLEs for sample j for all genes:

µjg =rjg

kjg=

rjg

10−9Nj lg· S(i,j).

Robinson and Oslack 2010

Page 49: Transcript expression estimation ... - University of Malta

TMM normalisationThe naive MLE is proportional to the normalised counts:

µjg =rjg

kjg=

110−9lg

rjg

Nj

If∑

g∈Gµig lg ,

∑g∈G

µjg lg, the MLEs for all genes need to be adjusted.

Calculate scaling factor for sample j relative to reference sample i:∑g∈G

rig

Ni' S(i,j)

∑g∈G

rjg

Nj.

Adjust the MLEs for sample j for all genes:

µjg =rjg

kjg=

rjg

10−9Nj lg· S(i,j).

Robinson and Oslack 2010

Page 50: Transcript expression estimation ... - University of Malta

TMM normalisationHow to choose the subset G used to calculate S(i,j)?

• For pair of libraries (i, j) determine log fold change ofnormalised counts

M(i,j)g = log

rig

Ni− log

rjg

Nj.

• and the mean of the log normalised counts

A (i,j)g =

12

[log

rig

Ni+ log

rjg

Nj

].

• Set G to genes remaining after trimming upper and lower x%of the {Ag} and {Mg}. I.e. genes in G have unexceptional

values of A (i,j)g and M(i,j)

g

Robinson and Oslack 2010

Page 51: Transcript expression estimation ... - University of Malta

TMM normalisationHow to choose the subset G used to calculate S(i,j)?

• For pair of libraries (i, j) determine log fold change ofnormalised counts

M(i,j)g = log

rig

Ni− log

rjg

Nj.

• and the mean of the log normalised counts

A (i,j)g =

12

[log

rig

Ni+ log

rjg

Nj

].

• Set G to genes remaining after trimming upper and lower x%of the {Ag} and {Mg}. I.e. genes in G have unexceptional

values of A (i,j)g and M(i,j)

g

Robinson and Oslack 2010

Page 52: Transcript expression estimation ... - University of Malta

TMM normalisationHow to choose the subset G used to calculate S(i,j)?

• For pair of libraries (i, j) determine log fold change ofnormalised counts

M(i,j)g = log

rig

Ni− log

rjg

Nj.

• and the mean of the log normalised counts

A (i,j)g =

12

[log

rig

Ni+ log

rjg

Nj

].

• Set G to genes remaining after trimming upper and lower x%of the {Ag} and {Mg}. I.e. genes in G have unexceptional

values of A (i,j)g and M(i,j)

g

Robinson and Oslack 2010

Page 53: Transcript expression estimation ... - University of Malta

TMM normalisation (with edgeR)• Compute summary of {M(i,j)

g } for genes in G (weighted mean)

• Let S(i,j) be the exponential of this summary• Adjust µjg by a factor of S(i,j) for all genes g

Other datasetsThe global shift in log-fold-change caused by RNA com-position differences occurs at varying degrees in otherRNA-seq datasets. For example, an M versus A plot forthe Cloonan et al. [12] dataset (Figure S3 in Additionalfile 1) gives an estimated TMM scaling factor of 1.04between the two samples (embryoid bodies versusembryonic stem cells), sequenced on the SOLiD™ sys-tem. The M versus A plot for this dataset also highlightsan interesting set of genes that have lower overall

expression, but higher in embryoid bodies. This explainsthe positive shift in log-fold-changes for the remaininggenes. The TMM scale factor appears close to the med-ian log-fold-changes amongst a set of approximately 500mouse housekeeping genes (from [17]). As anotherexample, the Li et al. [18] dataset, using the llumina 1GGenome Analyzer, exhibits a shift in the overall distri-bution of log-fold-changes and gives a TMM scaling fac-tor of 0.904 (Figure S4 in Additional file 1). However,there are sequencing-based datasets that have quitesimilar RNA outputs and may not need a significantadjustment. For example, the small-RNA-seq data fromKuchenbauer et al. [19] exhibits only a modest bias inthe log-fold-changes (Figure S5 in Additional file 1).Spike-in controls have the potential to be used for

normalization. In this scenario, small but knownamounts of RNA from a foreign organism are added toeach sample at a specified concentration. In order touse spike-in controls for normalization, the ratio of theconcentration of the spike to the sample must be keptconstant throughout the experiment. In practice, this isdifficult to achieve and small variations will lead tobiased estimation of the normalization factor. For exam-ple, using the spiked-in DNA from the Mortazavi et al.data set [11] would lead to unrealistic normalization fac-tor estimates (Figure S6 in Additional file 1). As with

Figure 1 Normalization is required for RNA-seq data. Data from [6] comparing log ratios of (a) technical replicates and (b) liver versuskidney expression levels, after adjusting for the total number of reads in each sample. The green line shows the smoothed distribution of log-fold-changes of the housekeeping genes. (c) An M versus A plot comparing liver and kidney shows a clear offset from zero. Green pointsindicate 545 housekeeping genes, while the green line signifies the median log-ratio of the housekeeping genes. The red line shows theestimated TMM normalization factor. The smear of orange points highlights the genes that were observed in only one of the liver or kidneytissues. The black arrow highlights the set of prominent genes that are largely attributable for the overall bias in log-fold-changes.

Table 1 Number of genes called differentially expressedbetween liver and kidney at a false discovery rate <0.001using different normalization methods

Library sizenormalization

TMMnormalization

Overlap

Higher in liver 2,355 4,293 2,355

Higher inkidney

8,332 4,935 4,935

Total 10,867 9,228 7,290

House keepinggenes (545)

Higher in liver 45 137 45

Higher inkidney

376 220 220

Total 421 357 265

TMM, trimmed mean of M values.

Robinson and Oshlack Genome Biology 2010, 11:R25http://genomebiology.com/2010/11/3/R25

Page 4 of 9

Robinson and Oslack 2010

Page 54: Transcript expression estimation ... - University of Malta

Median log deviation normalisation (with DESeq)

An alternative normalisation provided in DESeq package

• For each gene g in sample i, calculate deviation of log rig fromthe mean log rig over all libraries: dig = log rig −

1I∑

i log rig.

• Calculate median over all genes: log S(i) = mediani(dig)

• Adjust µig by a factor of S(i) for all genes g

edgeR and DESeq are both robust across genes (weighted meanof core set vs. median of all genes)

Call Ni =NiSi

the “adjusted library size”.

Anders and Huber 2010

Page 55: Transcript expression estimation ... - University of Malta

Median log deviation normalisation (with DESeq)

An alternative normalisation provided in DESeq package

• For each gene g in sample i, calculate deviation of log rig fromthe mean log rig over all libraries: dig = log rig −

1I∑

i log rig.

• Calculate median over all genes: log S(i) = mediani(dig)

• Adjust µig by a factor of S(i) for all genes g

edgeR and DESeq are both robust across genes (weighted meanof core set vs. median of all genes)

Call Ni =NiSi

the “adjusted library size”.

Anders and Huber 2010

Page 56: Transcript expression estimation ... - University of Malta

Normalisation between genes

• So far we have looked at library-level scaling to make theexpression of a given gene comparable across libraries

• In other words, we have been seeking to account for factorsaffecting all genes in a library similarly

• Are there factors affecting different genes differently?

• Recall normalisation equation:

µig =rig

kig

Consider the decomposition of kig = kkikg

• k : global scaling to get more convenient units. E.g. 10−9.

• ki : library-specific normalisation factors. E.g. Ni = Ni/S(i)

• kg: gene-specific normalisation factors. E.g. lg

Page 57: Transcript expression estimation ... - University of Malta

Normalisation between genes

• So far we have looked at library-level scaling to make theexpression of a given gene comparable across libraries

• In other words, we have been seeking to account for factorsaffecting all genes in a library similarly

• Are there factors affecting different genes differently?

• Recall normalisation equation:

µig =rig

kig

Consider the decomposition of kig = kkikg

• k : global scaling to get more convenient units. E.g. 10−9.

• ki : library-specific normalisation factors. E.g. Ni = Ni/S(i)

• kg: gene-specific normalisation factors. E.g. lg

Page 58: Transcript expression estimation ... - University of Malta

Normalisation between genes

• So far we have looked at library-level scaling to make theexpression of a given gene comparable across libraries

• In other words, we have been seeking to account for factorsaffecting all genes in a library similarly

• Are there factors affecting different genes differently?

• Recall normalisation equation:

µig =rig

kig

Consider the decomposition of kig = kkikg

• k : global scaling to get more convenient units. E.g. 10−9.

• ki : library-specific normalisation factors. E.g. Ni = Ni/S(i)

• kg: gene-specific normalisation factors. E.g. lg

Page 59: Transcript expression estimation ... - University of Malta

Normalisation between genesWhere does the lg factor come from anyway?

Underlying assumption: constant Poisson rate across bases.

μg μg μg μg μg μg μg μg μg μg 00 0 0

lg

rigp ∼ Pois(kkiµg)

rig =

lg∑p=1

rigp

rig ∼ Pois(kki

lg∑p=1

µg)

∼ Pois(kki lgµg)

∼ Pois(10−9Ni lgµig)

Page 60: Transcript expression estimation ... - University of Malta

Normalisation between genesWhere does the lg factor come from anyway?

Underlying assumption: constant Poisson rate across bases.

μg μg μg μg μg μg μg μg μg μg 00 0 0

lg

rigp ∼ Pois(kkiµg)

rig =

lg∑p=1

rigp

rig ∼ Pois(kki

lg∑p=1

µg)

∼ Pois(kki lgµg)

∼ Pois(10−9Ni lgµig)

Page 61: Transcript expression estimation ... - University of Malta

Normalisation between genes

−20 −10 0 10 20

−0.

20.

00.

20.

40.

6

position

coef

ficie

nts

3125_2 coefficients, red−T, green−A, blue−C, black−G

● ●

● ●

● ● ● ●

●●

●●

●●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ● ●

● ●

● ●

●●

● ●

−20 −10 0 10 20

−0.

20.

00.

20.

4

position

coef

ficie

nts

3125_7 coefficients, red−T, green−A, blue−C, black−G

● ●●

●●

● ● ●●

● ●

●●

● ●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ● ●●

● ●

● ●

● ●

● ●

● ●

● ●

−20 −10 0 10 20

−0.

4−

0.2

0.0

0.2

0.4

0.6

position

coef

ficie

nts

3122_7 coefficients, red−T, green−A, blue−C, black−G

● ●● ●

● ●

●●

● ●

●●

● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●

●●

● ●

● ●●

● ●

● ●●

● ●

● ●

●● ●

● ●

● ●

●●

● ●●

● ●

● ●

● ●

There are in fact localsequence-specific biases (Li et al.2010, Hansen et al. 2010) (non-randomamplification?).

This suggests a variable-rate modelwith weights αgp :

αg1μg

lg

0 0αg2μgαg3μgαg4μgαg5μgαg6μgαg7μgαg8μg

rig ∼ Pois(kki

lg∑p=1

αgpµig)

∼ Pois(kki lgµig)

∼ Pois(10−9Ni lgµig)

Page 62: Transcript expression estimation ... - University of Malta

Accounting for sequencing biases with mseq

−0.6 −0.4 −0.2 0.0 0.2 0.4

−2

−1

01

23

Log FC transcript length

Log

FC

exp

ress

ion

Page 63: Transcript expression estimation ... - University of Malta

Normalisation between genes (adjust for insert sizedistro)

lt = 6

lf = 3: 4 positions

lf = 4: 3 positions

lf = 5: 2 positions

lf = 6: 1 position

lr = 2

lf = 2: 5 positions

lt =lt∑

lf=lr

p(lf |lt)(lt − lf + 1)

(assuming each position equallylikely)

lt =lt∑

lf=lr

p(lf |lt)lt−lf+1∑

p=1

α(p, t , lf )

(weight α(p, t , lf ) for fragments oflength lf at position p, transcript t)

If pre-selection fragments roughlyuniform up to lt within insert sizedistribution, then p(lf |lt) ' p(lf )

Glaus et al 2012

Page 64: Transcript expression estimation ... - University of Malta

Normalisation between genes (adjust for insert sizedistro)

lt = 6

lf = 3: 4 positions

lf = 4: 3 positions

lf = 5: 2 positions

lf = 6: 1 position

lr = 2

lf = 2: 5 positions

lt =lt∑

lf=lr

p(lf |lt)(lt − lf + 1)

(assuming each position equallylikely)

lt =lt∑

lf=lr

p(lf |lt)lt−lf+1∑

p=1

α(p, t , lf )

(weight α(p, t , lf ) for fragments oflength lf at position p, transcript t)

If pre-selection fragments roughlyuniform up to lt within insert sizedistribution, then p(lf |lt) ' p(lf )

Glaus et al 2012

Page 65: Transcript expression estimation ... - University of Malta

Normalisation between genes (adjust for insert sizedistro)

lt = 6

lf = 3: 4 positions

lf = 4: 3 positions

lf = 5: 2 positions

lf = 6: 1 position

lr = 2

lf = 2: 5 positions

lt =lt∑

lf=lr

p(lf |lt)(lt − lf + 1)

(assuming each position equallylikely)

lt =lt∑

lf=lr

p(lf |lt)lt−lf+1∑

p=1

α(p, t , lf )

(weight α(p, t , lf ) for fragments oflength lf at position p, transcript t)

If pre-selection fragments roughlyuniform up to lt within insert sizedistribution, then p(lf |lt) ' p(lf )

Glaus et al 2012

Page 66: Transcript expression estimation ... - University of Malta

Normalisation between genes (adjust for insert sizedistro)

lt = 6

lf = 3: 4 positions

lf = 4: 3 positions

lf = 5: 2 positions

lf = 6: 1 position

lr = 2

lf = 2: 5 positions

lt =lt∑

lf=lr

p(lf |lt)(lt − lf + 1)

(assuming each position equallylikely)

lt =lt∑

lf=lr

p(lf |lt)lt−lf+1∑

p=1

α(p, t , lf )

(weight α(p, t , lf ) for fragments oflength lf at position p, transcript t)

If pre-selection fragments roughlyuniform up to lt within insert sizedistribution, then p(lf |lt) ' p(lf )

Glaus et al 2012

Page 67: Transcript expression estimation ... - University of Malta

Differential expression

We have obtained library and gene specific normalisation factors tomake counts/concentration estimates as comparable as possible.

This allows us to:

• obtain reasonably unbiased log fold changes between twogroups of samples

• obtain p-values under the null hypothesis of no differentialexpression

Recall hypothesis testing:

• define a function of the data, T (the test statistic)

• derive distribution of T under the null (e.g. no DE)

• define critical regions of T

• compute observed value t from actual data

• reject null if t is in a critical region

Page 68: Transcript expression estimation ... - University of Malta

Differential expression

We have obtained library and gene specific normalisation factors tomake counts/concentration estimates as comparable as possible.

This allows us to:

• obtain reasonably unbiased log fold changes between twogroups of samples

• obtain p-values under the null hypothesis of no differentialexpression

Recall hypothesis testing:

• define a function of the data, T (the test statistic)

• derive distribution of T under the null (e.g. no DE)

• define critical regions of T

• compute observed value t from actual data

• reject null if t is in a critical region

Page 69: Transcript expression estimation ... - University of Malta

Differential expression

Option 1. H0 : µ1g = µ2g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g = µ4g = µ5g = µ6g = µ

(B)g

vs. ?

Both options are inadequate!

µig is the RNA concentration parameter for library i, which variesacross biological replicates.

Page 70: Transcript expression estimation ... - University of Malta

Differential expression

Option 1. H0 : µ1g = µ2g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g = µ4g = µ5g = µ6g = µ

(B)g

vs. ?

Both options are inadequate!

µig is the RNA concentration parameter for library i, which variesacross biological replicates.

Page 71: Transcript expression estimation ... - University of Malta

Differential expression

Option 1. H0 : µ1g = µ2g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g = µ4g = µ5g = µ6g = µ

(B)g

vs. ?

Both options are inadequate!

µig is the RNA concentration parameter for library i, which variesacross biological replicates.

Page 72: Transcript expression estimation ... - University of Malta

Negative binomial distribution

Negative binomial distribution

k3gµ3gDi↵erential expression

Option 1. H0 : µ1g = µ2g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

vs. ?

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Di↵erential expression

Option 1. H0 : µ1g = µ2g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

vs. ?

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

µ(A)g �

2(A)g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

µ(A)g �

2(A)g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Gamma

Poisson Poisson PoissonNegative binomial distribution

r1g r2g r3g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

r1g r2g r3g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

r1g r2g r3g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Di↵erential expression

Option 1. H0 : µ1g = µ2g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

vs. ?

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

k2gµ2gDi↵erential expression

Option 1. H0 : µ1g = µ2g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

vs. ?

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Di↵erential expression

Option 1. H0 : µ1g = µ2g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

vs. ?

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

µ(A)g �

2(A)g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

µ(A)g �

2(A)g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Gamma

Poisson Poisson PoissonNegative binomial distribution

r1g r2g r3g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

r1g r2g r3g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

r1g r2g r3g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Di↵erential expression

Option 1. H0 : µ1g = µ2g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

vs. ?

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

k1gµ1gDi↵erential expression

Option 1. H0 : µ1g = µ2g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

vs. ?

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Di↵erential expression

Option 1. H0 : µ1g = µ2g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

vs. ?

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

µ(A)g �

2(A)g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

µ(A)g �

2(A)g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Gamma

Poisson Poisson PoissonNegative binomial distribution

r1g r2g r3g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

r1g r2g r3g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

r1g r2g r3g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Di↵erential expression

Option 1. H0 : µ1g = µ2g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

vs. ?

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

µ(A)g �

2(A)g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

µ(A)g �

2(A)g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Gamma

Poisson Poisson PoissonNegative binomial distribution

r1g r2g r3g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

r1g r2g r3g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

r1g r2g r3g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

If the rate parameter of the Poisson distribution is not fixed, butvaries according to a Gamma distribution, then the counts follow anegative binomial distribution.

Unlike the Poisson, the variance is greater than the mean.

Page 73: Transcript expression estimation ... - University of Malta

Negative binomial distribution

Negative binomial distribution

k3gµ3gDi↵erential expression

Option 1. H0 : µ1g = µ2g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

vs. ?

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Di↵erential expression

Option 1. H0 : µ1g = µ2g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

vs. ?

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

µ(A)g �

2(A)g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

µ(A)g �

2(A)g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Gamma

Poisson Poisson PoissonNegative binomial distribution

r1g r2g r3g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

r1g r2g r3g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

r1g r2g r3g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Di↵erential expression

Option 1. H0 : µ1g = µ2g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

vs. ?

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

k2gµ2gDi↵erential expression

Option 1. H0 : µ1g = µ2g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

vs. ?

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Di↵erential expression

Option 1. H0 : µ1g = µ2g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

vs. ?

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

µ(A)g �

2(A)g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

µ(A)g �

2(A)g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Gamma

Poisson Poisson PoissonNegative binomial distribution

r1g r2g r3g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

r1g r2g r3g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

r1g r2g r3g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Di↵erential expression

Option 1. H0 : µ1g = µ2g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

vs. ?

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

k1gµ1gDi↵erential expression

Option 1. H0 : µ1g = µ2g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

vs. ?

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Di↵erential expression

Option 1. H0 : µ1g = µ2g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

vs. ?

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

µ(A)g �

2(A)g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

µ(A)g �

2(A)g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Gamma

Poisson Poisson PoissonNegative binomial distribution

r1g r2g r3g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

r1g r2g r3g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

r1g r2g r3g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Di↵erential expression

Option 1. H0 : µ1g = µ2g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

vs. ?

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

µ(A)g �

2(A)g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

µ(A)g �

2(A)g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Gamma

Poisson Poisson PoissonNegative binomial distribution

r1g r2g r3g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

r1g r2g r3g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

Negative binomial distribution

r1g r2g r3g

vs. ?

Option 2. H0 : µ1g = µ2g = µ3g = µ(A)g , µ4g = µ5g = µ6g = µ

(B)g

Both options are inadequate!

µig is the RNA concentration parameter for library i , which variesacross biological replicates with some mean and variance.

If the rate parameter of the Poisson distribution is not fixed, butvaries according to a Gamma distribution, then the counts follow anegative binomial distribution.

Unlike the Poisson, the variance is greater than the mean.

Page 74: Transcript expression estimation ... - University of Malta

Negative binomial distribution

0 20 40 60 80 100

0.00

0.02

0.04

0.06

Gamma distribution

x

f(x)

0 20 40 60 80 100

0.00

0.04

0.08

0.12

Poisson distribution

x

f(x)

● ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●

● ●

●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●

● ●

●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●

●● ●

●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●

●● ●

●●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●

●●

● ●●

●●

●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●

●●

●●

● ● ●●

●●

●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●

●●

●●

● ● ● ●●

●●

●●

●●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●

●●

●●

●● ● ● ●

●●

●●

●●

●●

●●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●

●●

●●

●●

●●

● ● ● ●●

●●

●●

●●

●●

●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●

●●

●●

●●

●●

●●

● ● ● ● ●●

●●

●●

●●

●●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●

●●

●●

●●

●●

● ● ● ● ● ●●

●●

●●

●●

●●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●

●●

●●

●●

●●

● ● ● ● ● ●●

●●

●●

●●

●●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●

●●

●●

●●

●●

● ● ● ● ● ● ●●

●●

●●

●●

●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●

●●

●●

●●

●●

● ● ● ● ● ● ● ●●

●●

●●

●●

●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●

●●

●●

●●

●●

● ● ● ● ● ● ● ●●

●●

●●

●●

●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●

●●

●●

●●

●●

● ● ● ● ● ● ● ● ●●

●●

●●

●●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●

●●

●●

●●

● ● ● ● ● ● ● ● ● ●●

●●

●●

●●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●

●●

●●

●●

● ● ● ● ● ● ● ● ● ● ●●

●●

●●

●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●

●●

●●

●● ● ● ● ● ● ● ● ● ● ● ●

●●

●●

●●

● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●

●●

●●

●● ● ● ● ● ● ● ● ● ● ● ● ●

●●

●●

●● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●

●●

●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●

●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●

●●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●

●● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ●●

●●

●●

●●

●●

●●

●● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●

●●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0 20 40 60 80 100

0.00

0.02

0.04

0.06

Negative binomial distribution

x

f(x)

BiologicalvarianceXXXPoissonnoiseXXXRead countvariance

Mean(NB) = mean(Gamma) = E(mean(Poisson))Var(NB) = var(Gamma) + E(var(Poisson))

Page 75: Transcript expression estimation ... - University of Malta

Negative binomial distribution

Number of reads from gene g in library i of condition c can becaptured by a negative binomial model:

rcig = NB(kigµcg, scg)

where µcg and scg are, respectively, the mean and dispersion forreads from gene g in condition c.

The variance has two components:

σ2cg = kigµcg + k 2

igµ2cgscg

Poisson noise Overdispersion

• Notice there are now two parameters to estimate• How do we obtain precise estimates of the dispersion if we

have a small number of libraries per condition?

Page 76: Transcript expression estimation ... - University of Malta

Negative binomial distribution

Number of reads from gene g in library i of condition c can becaptured by a negative binomial model:

rcig = NB(kigµcg, scg)

where µcg and scg are, respectively, the mean and dispersion forreads from gene g in condition c.

The variance has two components:

σ2cg = kigµcg + k 2

igµ2cgscg

Poisson noise Overdispersion

• Notice there are now two parameters to estimate• How do we obtain precise estimates of the dispersion if we

have a small number of libraries per condition?

Page 77: Transcript expression estimation ... - University of Malta

DESeqHow do we estimate the variance robustly?

Assumption: dispersion is a smooth function of the mean.

σ2cg = kigµcg + k 2

igµ2cgs(µcg)

Poisson noise Overdispersion

1 100 10000

1e−0

51e−0

31e−0

1

rowMeans(counts(cds, normalized = TRUE))

fitIn

fo(c

ds)$

perG

eneD

ispE

sts

Figure 1: Empirical (black dots) and fitted (red lines) dispersion values plotted against meanexpression strength.

> plotDispEsts( cds )

The plot in Figure 1 is doubly logarithmic; this may be helpful or misleading, and it is worthexperimenting with other plotting styles.

As we estimated the dispersion from only two samples, we should expect the estimates toscatter with quite some sampling variance around their true values. Hence, we DESeq should notuse the per-gene estimates directly in the test, because using too low dispersion values leads tofalse positives. Many of the values below the red line are likely to be underestimates of the truedispersions, and hence, it is prudent to instead rather use the fitted value. On the othe hand,not all of he values above the red line are overestimations, and hence, the conservative choice isto keep them instead of replacing them with their fitted values. If you do not like this defaultbehaviour, you can change it with the option sharingMode of estimateDispersions. Note thatDESeq orginally (as described in [1]) only used the fitted values (sharingMode="fit-only").The current default (sharingMode="maximum") is more conservative.

Another di↵erence of the current DESeq version to the original method described in thepaper is the way how the mean-dispersion relation is fitted. By default, estimateDispersion

now performs a parametric fit: Using a gamma-family GLM, two coe�cients ↵0,↵1 are found toparametrize the fit as ↵ = ↵0 + ↵1/µ. (The values of the two coe�cients can be found in thefitInfo object, as attribute coefficients to dispFunc.) For some data sets, the parametricfit may give bad results, in which case one should try a local fit (the method described in thepaper), which is available via the option fitType="local" to estimateDispersions.

6

DESeqHow do we estimate the variance robustly?

µcg Assumption: variance is a smooth function of the mean.

�2cg = kigµcg + k2

ig⌫2(µcg )

Poisson noise Overdispersion

1 100 10000

1e−0

51e−0

31e−0

1

rowMeans(counts(cds, normalized = TRUE))

fitIn

fo(c

ds)$

perG

eneD

ispE

sts

Figure 1: Empirical (black dots) and fitted (red lines) dispersion values plotted against meanexpression strength.

> plotDispEsts( cds )

The plot in Figure 1 is doubly logarithmic; this may be helpful or misleading, and it is worthexperimenting with other plotting styles.

As we estimated the dispersion from only two samples, we should expect the estimates toscatter with quite some sampling variance around their true values. Hence, we DESeq should notuse the per-gene estimates directly in the test, because using too low dispersion values leads tofalse positives. Many of the values below the red line are likely to be underestimates of the truedispersions, and hence, it is prudent to instead rather use the fitted value. On the othe hand,not all of he values above the red line are overestimations, and hence, the conservative choice isto keep them instead of replacing them with their fitted values. If you do not like this defaultbehaviour, you can change it with the option sharingMode of estimateDispersions. Note thatDESeq orginally (as described in [1]) only used the fitted values (sharingMode="fit-only").The current default (sharingMode="maximum") is more conservative.

Another di�erence of the current DESeq version to the original method described in thepaper is the way how the mean-dispersion relation is fitted. By default, estimateDispersion

now performs a parametric fit: Using a gamma-family GLM, two coe�cients ↵0,↵1 are found toparametrize the fit as ↵ = ↵0 + ↵1/µ. (The values of the two coe�cients can be found in thefitInfo object, as attribute coefficients to dispFunc.) For some data sets, the parametricfit may give bad results, in which case one should try a local fit (the method described in thepaper), which is available via the option fitType="local" to estimateDispersions.

6

Use fitted values (or valuesabove the line) instead ofraw estimates. XXThis is a form of pool-ing (sharing of informationacross genes) to stabilisethe estimates.

Anders and Huber 2010

Di↵

eren

tial

expr

ession

with

DESeq

Let

the

obse

rved

conditio

n-s

pec

ific

counts

be

q⇤ cg

=P

ir c

ig.

The

prob

abili

tyof

the

dat

aunder

the

null

isP⇤

=P

(q⇤ 1g,q

⇤ 2g|µ

g,�

2 g).

Obta

inge

ne-

wise

p-v

alues

:

pg

=

Pq

1g,q

2g:P

(q1g,q

2g|µ

g,�

2 g)<

P⇤ ^

q1g+

q2g=

q⇤ 1g+

q⇤ 2g

P(q

1g,q

2g|µ

g,�

2 g)

Pq

1g,q

2g:q

1g+

q2g=

q⇤ 1g+

q⇤ 2g

P(q

1g,q

2g|µ

g,�

2 g)

,

wher

gan

d�

2 gar

ees

tim

ates

for

the

mea

nan

dva

rian

ceunder

the

null.

s(µ

cg)

Ander

sand

Huber

2010

Use fitted values (or valuesabove the line) instead ofraw estimates. XXThis is a form of pool-ing (sharing of informationacross genes) to stabilisethe estimates.

Anders and Huber 2010

Page 78: Transcript expression estimation ... - University of Malta

Differential expression with DESeq

Back to hypothesis testing...

rcig = NB(kigµcg, scg)

H0 : µ1g = µ2g.

Perform a negative binomial exact test.

How extreme is the partitioning of counts between the twoconditions under the null?

Page 79: Transcript expression estimation ... - University of Malta

Differential expression with DESeq

Back to hypothesis testing...

rcig = NB(kigµcg, scg)

H0 : µ1g = µ2g.

Perform a negative binomial exact test.

How extreme is the partitioning of counts between the twoconditions under the null?

Page 80: Transcript expression estimation ... - University of Malta

Differential expression with DESeq

Let the observed condition-specific counts be q∗cg =∑

i rcig.

The probability of the data under the null isP∗ = P(q∗1g, q

∗2g |µg, σ

2g).

Obtain gene-wise exact p-values:

pg =

∑q1g ,q2g:P(q1g ,q2g |µg ,σ

2g)<P∗∧q1g+q2g=q∗1g+q∗2g

P(q1g, q2g |µg, σ2g)∑

q1g ,q2g:q1g+q2g=q∗1g+q∗2g

P(q1g, q2g |µg, σ2g)

,

where µg and σ2g are estimates for the mean and variance under

the null.

Anders and Huber 2010

Page 81: Transcript expression estimation ... - University of Malta

Differential expression with DESeq

Let the observed condition-specific counts be q∗cg =∑

i rcig.

The probability of the data under the null isP∗ = P(q∗1g, q

∗2g |µg, σ

2g).

Obtain gene-wise exact p-values:

pg =

∑q1g ,q2g:P(q1g ,q2g |µg ,σ

2g)<P∗∧q1g+q2g=q∗1g+q∗2g

P(q1g, q2g |µg, σ2g)∑

q1g ,q2g:q1g+q2g=q∗1g+q∗2g

P(q1g, q2g |µg, σ2g)

,

where µg and σ2g are estimates for the mean and variance under

the null.

Anders and Huber 2010

Page 82: Transcript expression estimation ... - University of Malta

Differential isoform expression

• At the gene level, counts are often observed (however bewareof isoform switching)

• At other levels (isoforms, haplo-isoforms) counts almostalways have to be estimated (e.g. with MMSEQ) becausereads map to multiple overlapping transcripts

• Count-based methods such as DESeq can be used to obtaindifferential isoform expression results by using estimatedcounts instead of observed counts

• A more powerful approach is to take into account posterioruncertainty in expression estimates (MMDIFF; Turro et al2014)

Page 83: Transcript expression estimation ... - University of Malta

Closing remarks

• Variation in total RNA output per sample leads to biases inexpression estimates (limited real estate)

• Variation in sequence composition of genes leads to biases(non-random hexamer priming)

• Normalisation seeks to correct for these biases

• Biological and Poisson variability can be modelled with anegative binomial distribution

• Variance of negative binomial hard to estimate gene-by-gene(best to share information acrosss genes)

• Negative binomial exact test produces p-values under the nullof no differential expression

Page 84: Transcript expression estimation ... - University of Malta

Closing remarks

• Variation in total RNA output per sample leads to biases inexpression estimates (limited real estate)

• Variation in sequence composition of genes leads to biases(non-random hexamer priming)

• Normalisation seeks to correct for these biases

• Biological and Poisson variability can be modelled with anegative binomial distribution

• Variance of negative binomial hard to estimate gene-by-gene(best to share information acrosss genes)

• Negative binomial exact test produces p-values under the nullof no differential expression

Page 85: Transcript expression estimation ... - University of Malta

Closing remarks

• Variation in total RNA output per sample leads to biases inexpression estimates (limited real estate)

• Variation in sequence composition of genes leads to biases(non-random hexamer priming)

• Normalisation seeks to correct for these biases

• Biological and Poisson variability can be modelled with anegative binomial distribution

• Variance of negative binomial hard to estimate gene-by-gene(best to share information acrosss genes)

• Negative binomial exact test produces p-values under the nullof no differential expression

Page 86: Transcript expression estimation ... - University of Malta

Closing remarks

• Variation in total RNA output per sample leads to biases inexpression estimates (limited real estate)

• Variation in sequence composition of genes leads to biases(non-random hexamer priming)

• Normalisation seeks to correct for these biases

• Biological and Poisson variability can be modelled with anegative binomial distribution

• Variance of negative binomial hard to estimate gene-by-gene(best to share information acrosss genes)

• Negative binomial exact test produces p-values under the nullof no differential expression

Page 87: Transcript expression estimation ... - University of Malta

Closing remarks

• Variation in total RNA output per sample leads to biases inexpression estimates (limited real estate)

• Variation in sequence composition of genes leads to biases(non-random hexamer priming)

• Normalisation seeks to correct for these biases

• Biological and Poisson variability can be modelled with anegative binomial distribution

• Variance of negative binomial hard to estimate gene-by-gene(best to share information acrosss genes)

• Negative binomial exact test produces p-values under the nullof no differential expression

Page 88: Transcript expression estimation ... - University of Malta

Further reading (transcriptome-based analysis)

Turro E, Su S-Y, Goncalves A , Coin LJM, Richardson S, Lewin A.Haplotype and isoform specific expression estimation usingmulti-mapping RNA-seq reads. Genome Biology, 2011 Feb;12:R13.

Turro E, Astle WJ, Tavare S. Flexible analysis of RNA-seq datausing mixed effects models. Bioinformatics, 2014 Jan;30(2):180-188.

MMSEQ, MMDIFF

• Haplotype-specific, isoform, gene-level expression estimation

• Flexible model comparison (polytomous model selection)

https://github.com/eturro/mmseq