71
RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL Tuesday, 11 September 2012

RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

RNA-seq data analysis with BioconductorÂngela Filimon Gonçalves

Functional Genomics Team at the European Bioinformatics Institute - EMBL

Tuesday, 11 September 2012

Page 2: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Analysis pipeline overviewFrom raw sequence files to expression levels:

1. Quality control and filtering

2. Read mapping

3. Quantifying expression

4. Normalisation

Downstream analysis:

5. E.g. Differential expression

Tuesday, 11 September 2012

Page 3: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Analysis pipeline overviewFrom raw sequence files to expression levels:

1. Quality control and filtering

2. Read mapping

3. Quantifying expression

4. Normalisation

Downstream analysis:

5. Differential expression

Tuesday, 11 September 2012

Page 4: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Read mappingWhich reads come from which regions of a genome or transcript?

Reads can be:

• assembled de novo to find the transcriptome

or mapped to

• a genome

• a transcriptome.

decreasing complexity

Tuesday, 11 September 2012

Page 5: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Read mappingWhich reads come from which regions of a genome or transcript?

Reads can be:

• assembled de novo to find the transcriptome

or mapped to

• a genome

• a transcriptome.

decreasing complexity

Tuesday, 11 September 2012

Page 6: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

De novo read assemblyObjective:

• find the set of longest possible contiguously expressed regions (contigs) without using a reference - by exploiting the overlaps between reads

One of the most prevalent algorithmic approaches uses a de Bruijn graph representation

• implemented in methods such as:

• Trinity (Grabherr, 2011)

• Trans-ABySS (Robertson, 2010)

• or Oases (Schulz, 2012).

Tuesday, 11 September 2012

Page 7: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

De novo read assemblyE.g. we sequence a transcript in

three reads:

We choose a length k (e.g. k = 4) and create two subsets:

• one with all possible substrings of length k (known as k-mers):

ATGG, TGGC, GCGT, CGTG, GTGC, TGCA

• and another with substrings of length k-1:

ATG, TGG, GGC, GCG, CGT, GTG, TGC, GCA

A T G G C G T G C AA T G G C

G C G T GG T G C A

Tuesday, 11 September 2012

Page 8: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

De novo read assemblyBuild a graph, with the k-1-mers as

nodes and the k-mers as directed

edges:

edges:

ATGG, TGGC, GCGT, CGTG, GTGC, TGCA

nodes:

ATG, TGG, GGC, GCG, CGT, GTG, TGC, GCA

ATG TGGATGG

A T G G C G T G C AA T G G C

G C G T GG T G C A

Tuesday, 11 September 2012

Page 9: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

De novo read assembly

ATG TGGATGG

GGCTGGC

A T G G C G T G C AA T G G C

G C G T GG T G C A

edges:

ATGG, TGGC, GCGT, CGTG, GTGC, TGCA

nodes:

ATG, TGG, GGC, GCG, CGT, GTG, TGC, GCA

Tuesday, 11 September 2012

Page 10: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

De novo read assembly

ATG TGGATGG

GGCTGGC

??

edges:

ATGG, TGGC, GCGT, CGTG, GTGC, TGCA

nodes:

ATG, TGG, GGC, GCG, CGT, GTG, TGC, GCA

A T G G C G T G C AA T G G C

G C G T GG T G C A

Tuesday, 11 September 2012

Page 11: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

De novo read assembly

What now?

We cannot extend the contig, but we can create a second one:

ATG TGGATGG

GGCTGGC

GCG CGTGCGT

GTGCGTG

TGCGTGC

GCATGCA

ATGGC

GCGTGCA

A T G G C G T G C AA T G G C

G C G T GG T G C A

Tuesday, 11 September 2012

Page 12: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

De novo read assembly

edges:

ATG, TGG, GGC, GCG, CGT, GTG, TGC, GCA

nodes:

AT, TG, GG, GC, CG, GT, CA

A T G G C G T G C AA T G G C

G C G T GG T G C A

Why was the transcript split into 2?

Minimum overlap between reads is only 2, so our choice of k (4) is too high.

Try k = 3 (more edges, fewer nodes):

Tuesday, 11 September 2012

Page 13: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

De novo read assembly

edges:

ATG, TGG, GGC, GCG, CGT, GTG, TGC, GCA

nodes:

AT, TG, GG, GC, CG, GT, CA

AT TGATG

GGTGG

GCGGC

CGGCG

GTCGT

Tuesday, 11 September 2012

Page 14: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

De novo read assembly

edges:

ATG, TGG, GGC, GCG, CGT, GTG, TGC, GCA

nodes:

AT, TG, GG, GC, CG, GT, CA

AT TGATG

GGTGG

GCGGC

CGGCG

GTCGT

TGCGTG

CAGCA

Tuesday, 11 September 2012

Page 15: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

De novo read assemblySmaller k:

• increases the chance of observing overlaps -> more complex graphs

Larger k:

• decreases the chance of observing overlaps -> simpler graphs

There is a balance between sensitivity and specificity determined by k -> k has a significant influence on the quality of the assembly.

The best k depends on the coverage, read length and error rate but it’s hard to estimate it.

Often a small range of values is tested to find the optimal k.

Tuesday, 11 September 2012

Page 16: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

De novo read assemblyFor genome assembly the coverage is reasonably even, so a single k works well.

For RNA-seq

• highly expressed genes -> more overlap between reads -> the optimal k is larger -> allows long contigs and produces simpler graphs

• lowly expressed genes -> less overlap -> optimal k is smaller -> complex graphs

use a range of ks and merge contigs.

Tuesday, 11 September 2012

Page 17: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

De novo read assembly

AT TGATG

GGTGG

GCGGC

CGGCG

GTCGT

TGCGTG

CAGCA

Additional complexity arises from:

heterozygosity

sequencing errors

alternative splicing!!!

Tuesday, 11 September 2012

Page 18: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

De novo read assemblyHow to assemble the contigs into gene models? There are many many methods... (Robertson, 2010)

contig merging and expansion

connecting contigs using paired-end reads

alignment of the contigs to the genome and comparison with known annotation

etc...

Tuesday, 11 September 2012

Page 19: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

De novo read assemblyMethods using de Bruijn graphs can be used to assemble transcriptomes de novo.

Con

• the assembly if difficult and only the most abundant transcripts are likely to be fully assembled

Pro

• useful for when there is no reference genome for the species in question and/or the annotation is of poor quality.

Tuesday, 11 September 2012

Page 20: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Read mappingWhich reads come from which regions of a genome or transcript?

Reads can be:

• assembled de novo to find the transcriptome

or mapped to

• a genome

• a transcriptome

decreasing complexity

Tuesday, 11 September 2012

Page 21: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Read mapping

read

genome

1st mate 2nd mate...

read

transcript

X

Tuesday, 11 September 2012

Page 22: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Read mappingTopHat - a hybrid approach¹

• First, align to the genome contiguously

• then assemble thealigned reads into putative exons

• map the remaining reads to a set of pairing betweenthose exons.

1. Trapnell et al, Bioinformatics, 2009 Tuesday, 11 September 2012

Page 23: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Read mappingGene model assembly with Cufflinks - find the smallest possible set of transcripts that explains all the observed reads by:

1. constructing a directed acyclic graph where the reads are nodes and the nodes are connected if

• the reads overlap with one another and

• they are not incompatible

2. find the minimum number of partitions in the graph with Dilworth’s theorem.

Tuesday, 11 September 2012

Page 24: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Read mapping

inferred transcript models

Tuesday, 11 September 2012

Page 25: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Read mappingGene model assembly with Cufflinks:

the minimum number of paths is not necessarily unique -> need to phase distant exons together

Cufflinks uses the expression levels to weight the edges and find the most likely path

Tuesday, 11 September 2012

Page 26: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Read mappingGene model assembly with Cufflinks:

• because it’s a de novo method it requires high coverage (high read overlap) to properly assemble transcripts

• one caveat is that all paths are maximally extended -> impossible to detect some instances of alternative transcription start and end sites

• finds the minimum set of transcripts that explains the data (Occam’s Razor principle), which is not necessarily what evolution comes up with...

Tuesday, 11 September 2012

Page 27: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Read mappingMapping to the genome

Pros

• discovery of novel transcribed regions

• de novo assembly of gene models for species with no annotation

Cons

• gapped mapping algorithms have limitations - even when allowing gaps in mapping to the genome quite a few of the reads that don’t map would do so contiguously to the transcriptome

• the analysis of the result in order to obtain transcript expression is more complex

Tuesday, 11 September 2012

Page 28: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Read mappingMapping to the transcriptome

Pros

• less complexity in mapping - reads map contiguously

• the output is easily interpretable

• faster than mapping to a genome

Cons

• only for annotated regions -> you must be reasonable confident in the annotation

• multiple transcripts for the same gene (reads might map equally well to these because of shared sequence)

Tuesday, 11 September 2012

Page 29: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Analysis pipeline overviewFrom raw sequence files to expression levels:

1. Quality control and filtering

2. Read mapping

3. Quantifying expression

4. Normalisation

Downstream analysis:

5. Differential expression

Tuesday, 11 September 2012

Page 30: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Quantifying expression“Easy”: count the number of reads that aligned to a gene, exon, splice-junction...

What about multi-mapping reads?

read

genome

annotatedtranscripts

g1 g2

Tuesday, 11 September 2012

Page 31: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Quantifying expression

read

genome

annotatedtranscripts

g1 t1g1 t2

And what about isoform levels?

Tuesday, 11 September 2012

Page 32: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Quantifying expressionDiscard multi-mapping reads:

• good at the gene, exon, splice junction level

• not so good at the isoform level, use an alternative method

Assign the reads probabilistically according to unique mapping neighbouring reads (still doesn’t work for isoform levels).

Use a method for estimating the expression most likely to explain the data

• MMSEQ (Turro, 2011) - alignment to the transcriptome

• Cufflinks (Trapnell, 2010) - alignment to the genome + de novo gene model assembly

• and several others (POEM, MISO, RSEM...)

Tuesday, 11 September 2012

Page 33: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Quantifying expressionMulti-mapping reads

• Obtain transcript sets, such that each read maps to only 1 set

• Transcripts may belong to more than one set

• Read counts per set can be observed

• Transcripts can be isoforms sharing exons or from multiple

genes

A

B

C

D

A

B

D

A

B

C

Reads Transcripts

Reads Transcripts Transcriptsets

1

2

3

MMSEQ¹

• map reads to transcriptsequences with Bowtie

1. Turro et al., Genome Biology, 2011Tuesday, 11 September 2012

Page 34: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Quantifying expressionMMSEQLatent variables for read counts

A

B

D

A

B

C

Reads Transcripts Transcriptsets

1

2

3

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

Latent variables for read counts

A

B

D

A

B

C

Reads Transcripts Transcriptsets

1

2

3

Latent variables for read counts

M =

1 1 0 0

1 1 1 0

0 0 0 1

X =

X11 X12 0 0

X21 X22 X23 0

0 0 0 X34

ki ∼ Poisson(bsi�

t

Mitµt)

Xit ∼ Pois(bsiMitµt)

Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.

None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20

A B C D

123

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit = �1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki ∼Poisson(bsi �t Mit µt ),

wheresi istheeffectivelengthsharedbytranscriptsinseti.

Transcripts

Transcriptsets

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit=

�1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki∼Poisson(bsi�

t

Mitµt),

wheresiistheeffectivelengthsharedbytranscriptsinseti.

Latent variables for read counts

M =

1 1 0 0

1 1 1 0

0 0 0 1

X =

X11 X12 0 0

X21 X22 X23 0

0 0 0 X34

ki ∼ Poisson(bsi�

t

Mitµt)

Xit ∼ Pois(bsiMitµt)

Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.

None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20

A B C D

123

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit = �1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki ∼Poisson(bsi �t Mit µt ),

wheresi istheeffectivelengthsharedbytranscriptsinseti.

Transcripts

Transcriptsets

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit=

�1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki∼Poisson(bsi�

t

Mitµt),

wheresiistheeffectivelengthsharedbytranscriptsinseti.

Latent variables for read counts

M =

1 1 0 0

1 1 1 0

0 0 0 1

X =

X11 X12 0 0

X21 X22 X23 0

0 0 0 X34

ki ∼ Poisson(bsi�

t

Mitµt)

Xit ∼ Pois(bsiMitµt)

Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.

None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20

Latent variables for read counts

A

B

D

A

B

C

Reads Transcripts Transcriptsets

1

2

3

Latent variables for read counts

M =

1 1 0 0

1 1 1 0

0 0 0 1

X =

X11 X12 0 0

X21 X22 X23 0

0 0 0 X34

ki ∼ Poisson(bsi�

t

Mitµt)

Xit ∼ Pois(bsiMitµt)

Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.

None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20

A B C D

123

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit = �1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki ∼Poisson(bsi �t Mit µt ),

wheresi istheeffectivelengthsharedbytranscriptsinseti.

Transcripts

Transcriptsets

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit=

�1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki∼Poisson(bsi�

t

Mitµt),

wheresiistheeffectivelengthsharedbytranscriptsinseti.

Latent variables for read counts

M =

1 1 0 0

1 1 1 0

0 0 0 1

X =

X11 X12 0 0

X21 X22 X23 0

0 0 0 X34

ki ∼ Poisson(bsi�

t

Mitµt)

Xit ∼ Pois(bsiMitµt)

Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.

None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20

A B C D

123

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit = �1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki ∼Poisson(bsi �t Mit µt ),

wheresi istheeffectivelengthsharedbytranscriptsinseti.

Transcripts

Transcriptsets

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit=

�1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki∼Poisson(bsi�

t

Mitµt),

wheresiistheeffectivelengthsharedbytranscriptsinseti.

Latent variables for read counts

M =

1 1 0 0

1 1 1 0

0 0 0 1

X =

X11 X12 0 0

X21 X22 X23 0

0 0 0 X34

ki ∼ Poisson(bsi�

t

Mitµt)

Xit ∼ Pois(bsiMitµt)

Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.

None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20

k =

212

r =�r1 r2 r3 r4

�Xit ∼ Poisson(bsiMitµt),

ki ∼ Poisson(bsi�

t

Mitµt),

{X1t , . . . ,Xmt}|rt ∼ Mult

�rt ,

M1ts1�i Mitsi

, . . . ,Mmtsm�i Mitsi

�.

A B C D

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit = �1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki ∼Poisson(bsi �t Mit µt ),

wheresi istheeffectivelengthsharedbytranscriptsinseti.

Transcripts

Transcriptsets

123

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit=

�1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki∼Poisson(bsi�

t

Mitµt),

wheresiistheeffectivelengthsharedbytranscriptsinseti.

Transcriptsets

123

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit=

�1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki∼Poisson(bsi�

t

Mitµt),

wheresiistheeffectivelengthsharedbytranscriptsinseti.

Observedset counts

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit = �1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki ∼Poisson(bsi �t Mit µt ),

wheresi istheeffectivelengthsharedbytranscriptsinseti. Unobservedtranscript counts

Poisson model for transcript set reads counts

Model reads per transcript set instead of per gene.

Define Mit =

�1 if transcript t in set i ,0 otherwise.

Now model for reads counts is:

ki ∼ Poisson(bsi�

t

Mitµt),

where si is the effective length shared by transcripts in set i .Xit ∼ Poisson(bsiMitµt),

ki ∼ Poisson(bsi�

t

Mitµt),

rt ∼ Poisson(bµt

i

Mitsi ) = Poisson(bltµt),

{X1t , . . . ,Xmt}|rt ∼ Mult

�rt ,

M1ts1�i Mitsi

, . . . ,Mmtsm�i Mitsi

�.

observed number of reads in set i

Latent variables for read counts

A

B

D

A

B

C

Reads Transcripts Transcriptsets

1

2

3

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

Latent variables for read counts

A

B

D

A

B

C

Reads Transcripts Transcriptsets

1

2

3

Latent variables for read counts

M =

1 1 0 0

1 1 1 0

0 0 0 1

X =

X11 X12 0 0

X21 X22 X23 0

0 0 0 X34

ki ∼ Poisson(bsi�

t

Mitµt)

Xit ∼ Pois(bsiMitµt)

Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.

None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20

A B C D

123

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit = �1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki ∼Poisson(bsi �t Mit µt ),

wheresi istheeffectivelengthsharedbytranscriptsinseti.

Transcripts

Transcriptsets

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit=

�1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki∼Poisson(bsi�

t

Mitµt),

wheresiistheeffectivelengthsharedbytranscriptsinseti.

Latent variables for read counts

M =

1 1 0 0

1 1 1 0

0 0 0 1

X =

X11 X12 0 0

X21 X22 X23 0

0 0 0 X34

ki ∼ Poisson(bsi�

t

Mitµt)

Xit ∼ Pois(bsiMitµt)

Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.

None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20

A B C D

123

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit = �1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki ∼Poisson(bsi �t Mit µt ),

wheresi istheeffectivelengthsharedbytranscriptsinseti.

Transcripts

Transcriptsets

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit=

�1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki∼Poisson(bsi�

t

Mitµt),

wheresiistheeffectivelengthsharedbytranscriptsinseti.

Latent variables for read counts

M =

1 1 0 0

1 1 1 0

0 0 0 1

X =

X11 X12 0 0

X21 X22 X23 0

0 0 0 X34

ki ∼ Poisson(bsi�

t

Mitµt)

Xit ∼ Pois(bsiMitµt)

Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.

None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20

Latent variables for read counts

A

B

D

A

B

C

Reads Transcripts Transcriptsets

1

2

3

Latent variables for read counts

M =

1 1 0 0

1 1 1 0

0 0 0 1

X =

X11 X12 0 0

X21 X22 X23 0

0 0 0 X34

ki ∼ Poisson(bsi�

t

Mitµt)

Xit ∼ Pois(bsiMitµt)

Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.

None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20

A B C D

123

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit = �1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki ∼Poisson(bsi �t Mit µt ),

wheresi istheeffectivelengthsharedbytranscriptsinseti.

Transcripts

Transcriptsets

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit=

�1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki∼Poisson(bsi�

t

Mitµt),

wheresiistheeffectivelengthsharedbytranscriptsinseti.

Latent variables for read counts

M =

1 1 0 0

1 1 1 0

0 0 0 1

X =

X11 X12 0 0

X21 X22 X23 0

0 0 0 X34

ki ∼ Poisson(bsi�

t

Mitµt)

Xit ∼ Pois(bsiMitµt)

Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.

None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20

A B C D

123

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit = �1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki ∼Poisson(bsi �t Mit µt ),

wheresi istheeffectivelengthsharedbytranscriptsinseti.

Transcripts

Transcriptsets

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit=

�1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki∼Poisson(bsi�

t

Mitµt),

wheresiistheeffectivelengthsharedbytranscriptsinseti.

Latent variables for read counts

M =

1 1 0 0

1 1 1 0

0 0 0 1

X =

X11 X12 0 0

X21 X22 X23 0

0 0 0 X34

ki ∼ Poisson(bsi�

t

Mitµt)

Xit ∼ Pois(bsiMitµt)

Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.

None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20

k =

212

r =�r1 r2 r3 r4

�Xit ∼ Poisson(bsiMitµt),

ki ∼ Poisson(bsi�

t

Mitµt),

{X1t , . . . ,Xmt}|rt ∼ Mult

�rt ,

M1ts1�i Mitsi

, . . . ,Mmtsm�i Mitsi

�.

A B C D

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit = �1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki ∼Poisson(bsi �t Mit µt ),

wheresi istheeffectivelengthsharedbytranscriptsinseti.

Transcripts

Transcriptsets

123

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit=

�1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki∼Poisson(bsi�

t

Mitµt),

wheresiistheeffectivelengthsharedbytranscriptsinseti.

Transcriptsets

123

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit=

�1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki∼Poisson(bsi�

t

Mitµt),

wheresiistheeffectivelengthsharedbytranscriptsinseti.

Observedset counts

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit = �1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki ∼Poisson(bsi �t Mit µt ),

wheresi istheeffectivelengthsharedbytranscriptsinseti. Unobservedtranscript counts

Poisson model for transcript set reads counts

Model reads per transcript set instead of per gene.

Define Mit =

�1 if transcript t in set i ,0 otherwise.

Now model for reads counts is:

ki ∼ Poisson(bsi�

t

Mitµt),

where si is the effective length shared by transcripts in set i .Xit ∼ Poisson(bsiMitµt),

ki ∼ Poisson(bsi�

t

Mitµt),

rt ∼ Poisson(bµt

i

Mitsi ) = Poisson(bltµt),

{X1t , . . . ,Xmt}|rt ∼ Mult

�rt ,

M1ts1�i Mitsi

, . . . ,Mmtsm�i Mitsi

�.

unknown real expression of transcript t

can be estimated with an EM algorithm

normalisationterms

Tuesday, 11 September 2012

Page 35: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Analysis pipeline overviewFrom raw sequence files to expression levels:

1. Quality control and filtering

2. Read mapping

3. Quantifying expression

4. Normalisation

Downstream analysis:

5. Differential expression

Tuesday, 11 September 2012

Page 36: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

NormalisationDo we need normalisation?

• longer transcripts produce more reads than shorter ones - divide by length

• and read count is proportional to the depth of sampling - divide by total number of reads

Tuesday, 11 September 2012

Page 37: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

NormalisationSample 1 Sample 2

Gene 1 1,000,000 1,000,000

Gene 2 0 1,000,000

: : :

Gene 19,999 20,000 20,000

Gene 20,000 0 20,000

Total Number of Reads

10,000,000 20,000,000

Tuesday, 11 September 2012

Page 38: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

NormalisationSample 1 Norm Sample

1Sample 2 Norm Sample

2Gene 1 1,000,000 0.1 1,000,000 0.05

Gene 2 0 0 1,000,000 0.05

: : : : :

Gene 19,999 20,000 0.002 20,000 0.001

Gene 20,000 0 0 20,000 0.001

Total Number of Reads

10,000,000 20,000,000

Tuesday, 11 September 2012

Page 39: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

NormalisationRobinson and Oshlack¹ propose a method that after trimming the data calculates a weighted global fold change between the two samples, which is used as a normalisation factor

This, and other similar approaches²³, can better account for differences in read count between samples

1. Robinson and Oshlack, Genome Biology, 20102. Bullard et al., BMC Bioinformatics, 20103. Anders and Huber, Genome Biology, 2010

Tuesday, 11 September 2012

Page 40: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Normalisation - an additional note on normalising across genes

• recent studies found that there are position and sequence-dependent biases¹²³⁴ that affect which fragments are sequenced

• e.g. 3’ biases due fragmentation, depletion of reads near exon-exon junctions because of difficulties in alignment, sequence preferences introduced by random hexamer priming, etc...

1. Mortazavi et al., Nature Methods, 20082. Wang et al., Nature Review Genetics, 20093. Bohnert and Ratsch, Nucleic Acids Research, 20104. Hansen et al., BMC Bioinformatics, 2009

Tuesday, 11 September 2012

Page 41: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Normalisation - an additional note on normalising across genes

• we may not be able to understand these biases, but several authors propose methods for correcting for the biases by learning from the data¹²³

• is this necessary to do this kind of correction? Depends, but the improvement on the estimates appears to be only slight³⁴.

1. Hansen et al., BMC Bioinformatics, 20092. Li et al., Genome Biology 2010 -> BIOCONDUCTOR PACKAGE MSEQ3. Roberts et al., Genome Biology, 20114. Turro et al., Genome Biology, 2011

Tuesday, 11 September 2012

Page 42: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Analysis pipeline overviewFrom raw sequence files to expression levels:

1. Quality control and filtering

2. Read mapping

3. Quantifying expression

4. Normalisation

Downstream analysis:

5. Differential expression

Tuesday, 11 September 2012

Page 43: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Differential expressionObjective: comparing read counts between different biological conditions.

We want to use a statistical test to decide whether, for a given gene, an observed difference in read counts is significant.

Tuesday, 11 September 2012

Page 44: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Differential expressionedgeR (Robinson et al., 2010)

• Normalisation: quantile adjusted method (or TMM)

• Model & test: Negative binomial, exact test to obtain a p-value

DESeq (Anders & Huber, 2010)

• Normalisation: size factor estimation (median of the ratios of gene read counts to the geometric mean of all genes in sample i)

• Model: Negative binomial, exact test to obtain a p-value

baySeq (Hardcastle & Kelly, 2010)

• Normalisation: none

• Model & test: Bayesian approach with Poisson or Negative binomial, estimate posterior likelihood of each model and rank the models rather than obtaining a p-value, not restricted to pairwise comparisons

Tuesday, 11 September 2012

Page 45: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Differential expressionDESeq, edgeR and baySeq have been compared using simulated data¹². Overall the studies found that the three methods perform similarly, with baySeq performing marginally better in ranking genes according to their significance.

DEGSeq (Wang et al., 2009)

• Normalisation: none, loess, median

• Model & test: MA-plot based methods, Likelihood Ratio Test, Fisher’s Exact Test

1. Kvam et al., American Journal of Botany, 20122. Glaus et al., Bioinformatics, 2012

Tuesday, 11 September 2012

Page 46: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Analysis pipeline overviewFrom raw sequence files to expression levels:

1. Quality control and filtering

2. Read mapping

3. Quantifying expression

4. Normalisation

Downstream analysis:

5. E.g. Differential expression

Tuesday, 11 September 2012

Page 47: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

AcknowledgmentsErnest Turro, Cambridge University

John Marioni, EBI, Cambridge

Alvis Brazma, EBI, Cambridge

Tuesday, 11 September 2012

Page 48: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Practical

Tuesday, 11 September 2012

Page 49: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

In the practical we will...• align the reads to the genome with TopHat

• import aligned reads into R

• discard reads that map multiple times to the genome

• import annotation from files or public databases (e.g. Ensembl)

• count reads over genes and exons

• find novel transcribed regions

• differential expression with DESeq

Tuesday, 11 September 2012

Page 50: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

SAM/BAM alignment filesAll aligners output their results in the SAM/BAM format:SRR034782.85 16 EG:4_group3 5036819 25 30M * 0 0 AAAACAATAAATTTCACTATTATTTCAAAA <<<<<<<<<<<<<<<<<<<<<<<<<<<<<< XT:A:U NM:i:2 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:28G0G0

QNAME ID of the read (“query”)FLAG alignment flagsRNAME ID of the reference (typically: chromosome name)POS Position in reference (1-based, left side)MAPQ Mapping quality (as Phred score)CIGAR Alignment description (gaps etc.) in CIGAR formatMRNM Mate reference sequence name [for paired end data]MPOS Mate position [for paired end data]ISIZE inferred insert size [for paired end data]SEQ sequence of the readQUAL quality string of the readextra fields

Tuesday, 11 September 2012

Page 51: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

SAM/BAM CIGAR strings

AT GT T AGAT AA* * GAT AGCT GT GCT AGT AGGCAGT CAGCGCCATAGAT AAAGGAT A* CT

a a a T AGT AGGCAGTAT AGCT . . . . . . . . . . . . . . T CAGCGCCC

GCAGT CAACGCCAT

CIGAR string: 6M 2I 4M 1D 2M

Tuesday, 11 September 2012

Page 52: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

SAM/BAM CIGAR strings

AT GT T AGAT AA* * GAT AGCT GT GCT AGT AGGCAGT CAGCGCCATAGAT AAAGGAT A* CT

a a a T AGT AGGCAGTAT AGCT . . . . . . . . . . . . . . T CAGCGCCC

GCAGT CAACGCCAT

CIGAR string: 3S 11M

Tuesday, 11 September 2012

Page 53: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

SAM/BAM CIGAR strings

AT GT T AGAT AA* * GAT AGCT GT GCT AGT AGGCAGT CAGCGCCATAGAT AAAGGAT A* CT

a a a T AGT AGGCAGTAT AGCT . . . . . . . . . . . . . . T CAGCGCCC

GCAGT CAACGCCAT

CIGAR string: 6M 14N 8M

Tuesday, 11 September 2012

Page 54: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

SAM/BAM CIGAR strings

AT GT T AGAT AA* * GAT AGCT GT GCT AGT AGGCAGT CAGCGCCATAGAT AAAGGAT A* CT

a a a T AGT AGGCAGTAT AGCT . . . . . . . . . . . . . . T CAGCGCCC

GCAGT CAACGCCAT

CIGAR string: 14M

extra tags: NM (number of mismatches) = 1

Tuesday, 11 September 2012

Page 55: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

SAM/BAM extra tags

NMEdit distance to the reference, including ambiguous bases but excluding clipping

NH Number of reported alignments that contains the query in the current record

H0 Number of perfect hits

H1 Number of 1-difference hits (see also NM)Number of 2-difference hits

H2 Number of 2-difference hits

Tuesday, 11 September 2012

Page 56: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

In the practical we will...• align the reads to the genome with TopHat

• import aligned reads into R

• discard reads that map multiple times to the genome

• import annotation from files or public databases (e.g. Ensembl)

• count reads over genes and exons

• find novel transcribed regions

Tuesday, 11 September 2012

Page 57: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

SAM/BAM alignment filesThe BAM format can be read easily into R.

ShortRead package

• (is able to read almost every sequencer’s manufacturer proprietary formats)

• loads most information (read DNA sequence, base quality, some alignment information, does not import the extra tags...) - not flexible

• stores data into an efficient object of class AlignedRead

• sequence manipulation and QA reports

• not aware of spliced alignments.

Tuesday, 11 September 2012

Page 58: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

SAM/BAM alignment filesThe BAM format can be read easily into R.

Rsamtools package:

• can read ALL bam fields

• flexibility to choose which fields and which subset of the data to import (for example by genomic location, by alignment type, etc...)

• reads all data into lists, not very efficient for memory or speed

• low level interface with no data manipulation functions.

Tuesday, 11 September 2012

Page 59: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

SAM/BAM alignment filesThe BAM format can be read easily into R.

GenomicRanges package

• builds on top of Rsamtools so it’s also flexible

• by default discards some information for the sake of memory usage and speed (read sequence, quality string, etc...)

• stores alignments into an efficient GappedAlignments object

• useful functions for finding overlaps and coverage

• is aware of spliced alignments

• does not import extra tag fields.

Tuesday, 11 September 2012

Page 60: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

In the practical we will...• align the reads to the genome with TopHat

• import aligned reads into R

• discard reads that map multiple times to the genome

• import annotation from files or public databases (e.g. Ensembl)

• count reads over genes and exons

• find novel transcribed regions

• differential expression with DESeq

Tuesday, 11 September 2012

Page 61: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

SAM/BAM extra tags

NMEdit distance to the reference, including ambiguous bases but excluding clipping

NH Number of reported alignments that contains the query in the current record

H0 Number of perfect hits

H1 Number of 1-difference hits (see also NM)Number of 2-difference hits

H2 Number of 2-difference hits

Tuesday, 11 September 2012

Page 62: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

In the practical we will...• align the reads to the genome with TopHat

• import aligned reads into R

• discard reads that map multiple times to the genome

• import annotation from files or public databases (e.g. Ensembl)

• count reads over genes and exons

• find novel transcribed regions

• differential expression with DESeq

Tuesday, 11 September 2012

Page 63: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Importing annotationrtracklayer package:

• read in annotation from GFF files

biomaRt:

• interface to a growing collection of databases such as Ensembl, Uniprot and HapMap for retrieving annotation and reference sequences

• returns a data.frame

GenomicRanges:

• represent the features as a GRanges object that can be overlapped with the aligned reads (in the GappedAlignments object)

Tuesday, 11 September 2012

Page 64: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

In the practical we will...• align the reads to the genome with TopHat

• import aligned reads into R

• discard reads that map multiple times to the genome

• import annotation from files or public databases (e.g. Ensembl)

• count reads over genes and exons

• find novel transcribed regions

• differential expression with DESeq

Tuesday, 11 September 2012

Page 65: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Quantifying expressionIRanges and GenomicRanges:

• utilities for overlapping aligned reads and genomic features (genes, exons, etc...)

• aware of spliced alignments

• several modes for counting: union, intersection strict, intersection non-empty

We will end up with a table of counts per gene and per exon which can be used as the input for DE packages such as DESeq, edgeR, baySeq...

Tuesday, 11 September 2012

Page 66: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Quantifying expression

read

genome

annotatedgenes

g1g2

We already removed multi-mapping reads, but what if the read overlaps several features in the same locus?

There is no optimal solution.

Tuesday, 11 September 2012

Page 67: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Union IntStrict IntNonEmpty

A A A

A - A

A A A

ambiguous A A

ambiguous ambiguous ambiguous

How to avoid double counting?

READGENE A

READGENE A

READ READGENE A GENE A

READGENE A

GENE B

READGENE A

GENE B

Tuesday, 11 September 2012

Page 68: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

In the practical we will...• align the reads to the genome with TopHat

• import aligned reads into R

• discard reads that map multiple times to the genome

• import annotation from files or public databases (e.g. Ensembl)

• count reads over genes and exons

• find novel transcribed regions

• differential expression with DESeq

Tuesday, 11 September 2012

Page 69: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

> 54 bp

> 3 reads

Defining novel transcribed regionsIRanges and GenomicRanges:

• define coverage

• find “islands” of expression

• define potential “exons” by setting thresholds...

• retain only regions that do not overlap annotated features

Tuesday, 11 September 2012

Page 70: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

In the practical we will...• align the reads to the genome with TopHat

• import aligned reads into R

• discard reads that map multiple times to the genome

• import annotation from files or public databases (e.g. Ensembl)

• count reads over genes and exons

• find novel transcribed regions

• differential expression analysis with DESeq

Tuesday, 11 September 2012

Page 71: RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

What’s going on right now?

Read and do the practical till the end of section 3.4.

Tuesday, 11 September 2012