RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

RNA-seq data analysis with BioconductorÂngela Filimon Gonçalves

Functional Genomics Team at the European Bioinformatics Institute - EMBL

Tuesday, 11 September 2012

Analysis pipeline overviewFrom raw sequence files to expression levels:

1. Quality control and filtering

2. Read mapping

3. Quantifying expression

4. Normalisation

Downstream analysis:

5. E.g. Differential expression




2. Read mapping


4. Normalisation


5. Differential expression


Read mappingWhich reads come from which regions of a genome or transcript?

Reads can be:

• assembled de novo to find the transcriptome

or mapped to

• a genome

• a transcriptome.

decreasing complexity



Reads can be:


or mapped to

• a genome

• a transcriptome.



De novo read assemblyObjective:

• find the set of longest possible contiguously expressed regions (contigs) without using a reference - by exploiting the overlaps between reads

One of the most prevalent algorithmic approaches uses a de Bruijn graph representation

• implemented in methods such as:

• Trinity (Grabherr, 2011)

• Trans-ABySS (Robertson, 2010)

• or Oases (Schulz, 2012).


De novo read assemblyE.g. we sequence a transcript in

three reads:

We choose a length k (e.g. k = 4) and create two subsets:

• one with all possible substrings of length k (known as k-mers):

ATGG, TGGC, GCGT, CGTG, GTGC, TGCA

• and another with substrings of length k-1:

ATG, TGG, GGC, GCG, CGT, GTG, TGC, GCA

A T G G C G T G C AA T G G C

G C G T GG T G C A


De novo read assemblyBuild a graph, with the k-1-mers as

nodes and the k-mers as directed

edges:

edges:


nodes:


ATG TGGATGG


G C G T GG T G C A


De novo read assembly

ATG TGGATGG

GGCTGGC


G C G T GG T G C A

edges:


nodes:




ATG TGGATGG

GGCTGGC

??

edges:


nodes:



G C G T GG T G C A



What now?

We cannot extend the contig, but we can create a second one:

ATG TGGATGG

GGCTGGC

GCG CGTGCGT

GTGCGTG

TGCGTGC

GCATGCA

ATGGC

GCGTGCA


G C G T GG T G C A



edges:


nodes:

AT, TG, GG, GC, CG, GT, CA


G C G T GG T G C A

Why was the transcript split into 2?

Minimum overlap between reads is only 2, so our choice of k (4) is too high.

Try k = 3 (more edges, fewer nodes):



edges:


nodes:


AT TGATG

GGTGG

GCGGC

CGGCG

GTCGT



edges:


nodes:


AT TGATG

GGTGG

GCGGC

CGGCG

GTCGT

TGCGTG

CAGCA


De novo read assemblySmaller k:

• increases the chance of observing overlaps -> more complex graphs

Larger k:

• decreases the chance of observing overlaps -> simpler graphs

There is a balance between sensitivity and specificity determined by k -> k has a significant influence on the quality of the assembly.

The best k depends on the coverage, read length and error rate but it’s hard to estimate it.

Often a small range of values is tested to find the optimal k.


De novo read assemblyFor genome assembly the coverage is reasonably even, so a single k works well.

For RNA-seq

• highly expressed genes -> more overlap between reads -> the optimal k is larger -> allows long contigs and produces simpler graphs

• lowly expressed genes -> less overlap -> optimal k is smaller -> complex graphs

use a range of ks and merge contigs.



AT TGATG

GGTGG

GCGGC

CGGCG

GTCGT

TGCGTG

CAGCA

Additional complexity arises from:

heterozygosity

sequencing errors

alternative splicing!!!


De novo read assemblyHow to assemble the contigs into gene models? There are many many methods... (Robertson, 2010)

contig merging and expansion

connecting contigs using paired-end reads

alignment of the contigs to the genome and comparison with known annotation

etc...


De novo read assemblyMethods using de Bruijn graphs can be used to assemble transcriptomes de novo.

Con

• the assembly if difficult and only the most abundant transcripts are likely to be fully assembled

Pro

• useful for when there is no reference genome for the species in question and/or the annotation is of poor quality.



Reads can be:


or mapped to

• a genome

• a transcriptome



Read mapping

read

genome

1st mate 2nd mate...

read

transcript

X


Read mappingTopHat - a hybrid approach¹

• First, align to the genome contiguously

• then assemble thealigned reads into putative exons

• map the remaining reads to a set of pairing betweenthose exons.

1. Trapnell et al, Bioinformatics, 2009 Tuesday, 11 September 2012

Read mappingGene model assembly with Cufflinks - find the smallest possible set of transcripts that explains all the observed reads by:

1. constructing a directed acyclic graph where the reads are nodes and the nodes are connected if

• the reads overlap with one another and

• they are not incompatible

2. find the minimum number of partitions in the graph with Dilworth’s theorem.


Read mapping

inferred transcript models


Read mappingGene model assembly with Cufflinks:

the minimum number of paths is not necessarily unique -> need to phase distant exons together

Cufflinks uses the expression levels to weight the edges and find the most likely path


Read mappingGene model assembly with Cufflinks:

• because it’s a de novo method it requires high coverage (high read overlap) to properly assemble transcripts

• one caveat is that all paths are maximally extended -> impossible to detect some instances of alternative transcription start and end sites

• finds the minimum set of transcripts that explains the data (Occam’s Razor principle), which is not necessarily what evolution comes up with...


Read mappingMapping to the genome

Pros

• discovery of novel transcribed regions

• de novo assembly of gene models for species with no annotation

Cons

• gapped mapping algorithms have limitations - even when allowing gaps in mapping to the genome quite a few of the reads that don’t map would do so contiguously to the transcriptome

• the analysis of the result in order to obtain transcript expression is more complex


Read mappingMapping to the transcriptome

Pros

• less complexity in mapping - reads map contiguously

• the output is easily interpretable

• faster than mapping to a genome

Cons

• only for annotated regions -> you must be reasonable confident in the annotation

• multiple transcripts for the same gene (reads might map equally well to these because of shared sequence)




2. Read mapping


4. Normalisation




Quantifying expression“Easy”: count the number of reads that aligned to a gene, exon, splice-junction...

What about multi-mapping reads?

read

genome

annotatedtranscripts

g1 g2


Quantifying expression

read

genome

annotatedtranscripts

g1 t1g1 t2

And what about isoform levels?


Quantifying expressionDiscard multi-mapping reads:

• good at the gene, exon, splice junction level

• not so good at the isoform level, use an alternative method

Assign the reads probabilistically according to unique mapping neighbouring reads (still doesn’t work for isoform levels).

Use a method for estimating the expression most likely to explain the data

• MMSEQ (Turro, 2011) - alignment to the transcriptome

• Cufflinks (Trapnell, 2010) - alignment to the genome + de novo gene model assembly

• and several others (POEM, MISO, RSEM...)


Quantifying expressionMulti-mapping reads

• Obtain transcript sets, such that each read maps to only 1 set

• Transcripts may belong to more than one set

• Read counts per set can be observed

• Transcripts can be isoforms sharing exons or from multiple

genes

A

B

C

D

A

B

D

A

B

C

Reads Transcripts

Reads Transcripts Transcriptsets

1

2

3

MMSEQ¹

• map reads to transcriptsequences with Bowtie

1. Turro et al., Genome Biology, 2011Tuesday, 11 September 2012

Quantifying expressionMMSEQLatent variables for read counts

A

B

D

A

B

C


1

2

3

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

�

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

�

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

�

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

�

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

�

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

�

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

�

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

�

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

�

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

�

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

�

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

�

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

�

Latent variables for read counts

A

B

D

A

B

C


1

2

3


M =

1 1 0 0

1 1 1 0

0 0 0 1

X =

X11 X12 0 0

X21 X22 X23 0

0 0 0 X34

ki ∼ Poisson(bsi�

t

Mitµt)

Xit ∼ Pois(bsiMitµt)

Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.

None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20

A B C D

123

Poissonmodelfortranscriptsetreadscounts

Modelreadspertranscriptsetinsteadofpergene.

DefineMit = �1iftranscripttinseti,0otherwise.

Nowmodelforreadscountsis:

ki ∼Poisson(bsi �t Mit µt ),

wheresi istheeffectivelengthsharedbytranscriptsinseti.

Transcripts

Transcriptsets



DefineMit=

�1iftranscripttinseti,0otherwise.


ki∼Poisson(bsi�

t

Mitµt),

wheresiistheeffectivelengthsharedbytranscriptsinseti.


M =

1 1 0 0

1 1 1 0

0 0 0 1

X =

X11 X12 0 0

X21 X22 X23 0

0 0 0 X34


t

Mitµt)




A B C D

123







Transcripts

Transcriptsets



DefineMit=



ki∼Poisson(bsi�

t

Mitµt),



M =

1 1 0 0

1 1 1 0

0 0 0 1

X =

X11 X12 0 0

X21 X22 X23 0

0 0 0 X34


t

Mitµt)





A

B

D

A

B

C


1

2

3


M =

1 1 0 0

1 1 1 0

0 0 0 1

X =

X11 X12 0 0

X21 X22 X23 0

0 0 0 X34


t

Mitµt)




A B C D

123







Transcripts

Transcriptsets



DefineMit=



ki∼Poisson(bsi�

t

Mitµt),



M =

1 1 0 0

1 1 1 0

0 0 0 1

X =

X11 X12 0 0

X21 X22 X23 0

0 0 0 X34


t

Mitµt)




A B C D

123







Transcripts

Transcriptsets



DefineMit=



ki∼Poisson(bsi�

t

Mitµt),



M =

1 1 0 0

1 1 1 0

0 0 0 1

X =

X11 X12 0 0

X21 X22 X23 0

0 0 0 X34


t

Mitµt)




k =

212

r =�r1 r2 r3 r4

�Xit ∼ Poisson(bsiMitµt),


t

Mitµt),

{X1t , . . . ,Xmt}|rt ∼ Mult

�rt ,

M1ts1�i Mitsi

, . . . ,Mmtsm�i Mitsi

�.

A B C D







Transcripts

Transcriptsets

123



DefineMit=



ki∼Poisson(bsi�

t

Mitµt),


Transcriptsets

123



DefineMit=



ki∼Poisson(bsi�

t

Mitµt),


Observedset counts






wheresi istheeffectivelengthsharedbytranscriptsinseti. Unobservedtranscript counts

Poisson model for transcript set reads counts

Model reads per transcript set instead of per gene.

Define Mit =

�1 if transcript t in set i ,0 otherwise.

Now model for reads counts is:


t

Mitµt),

where si is the effective length shared by transcripts in set i .Xit ∼ Poisson(bsiMitµt),


t

Mitµt),

rt ∼ Poisson(bµt

�

i

Mitsi ) = Poisson(bltµt),

{X1t , . . . ,Xmt}|rt ∼ Mult

�rt ,

M1ts1�i Mitsi


�.

observed number of reads in set i


A

B

D

A

B

C


1

2

3

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

�

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

�

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

�

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

�

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

�

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

�

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

�

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

�

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

�

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

�

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

�

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

�

M =

1 1 0 01 1 1 00 0 0 1

X =

X11 X12 0 0X21 X22 X23 00 0 0 X44

r =�r1 r2 r3 r4

�


A

B

D

A

B

C


1

2

3


M =

1 1 0 0

1 1 1 0

0 0 0 1

X =

X11 X12 0 0

X21 X22 X23 0

0 0 0 X34


t

Mitµt)




A B C D

123







Transcripts

Transcriptsets



DefineMit=



ki∼Poisson(bsi�

t

Mitµt),



M =

1 1 0 0

1 1 1 0

0 0 0 1

X =

X11 X12 0 0

X21 X22 X23 0

0 0 0 X34


t

Mitµt)




A B C D

123







Transcripts

Transcriptsets



DefineMit=



ki∼Poisson(bsi�

t

Mitµt),



M =

1 1 0 0

1 1 1 0

0 0 0 1

X =

X11 X12 0 0

X21 X22 X23 0

0 0 0 X34


t

Mitµt)





A

B

D

A

B

C


1

2

3


M =

1 1 0 0

1 1 1 0

0 0 0 1

X =

X11 X12 0 0

X21 X22 X23 0

0 0 0 X34


t

Mitµt)




A B C D

123







Transcripts

Transcriptsets



DefineMit=



ki∼Poisson(bsi�

t

Mitµt),



M =

1 1 0 0

1 1 1 0

0 0 0 1

X =

X11 X12 0 0

X21 X22 X23 0

0 0 0 X34


t

Mitµt)




A B C D

123







Transcripts

Transcriptsets



DefineMit=



ki∼Poisson(bsi�

t

Mitµt),



M =

1 1 0 0

1 1 1 0

0 0 0 1

X =

X11 X12 0 0

X21 X22 X23 0

0 0 0 X34


t

Mitµt)




k =

212

r =�r1 r2 r3 r4

�Xit ∼ Poisson(bsiMitµt),


t

Mitµt),

{X1t , . . . ,Xmt}|rt ∼ Mult

�rt ,

M1ts1�i Mitsi


�.

A B C D







Transcripts

Transcriptsets

123



DefineMit=



ki∼Poisson(bsi�

t

Mitµt),


Transcriptsets

123



DefineMit=



ki∼Poisson(bsi�

t

Mitµt),


Observedset counts






wheresi istheeffectivelengthsharedbytranscriptsinseti. Unobservedtranscript counts

Poisson model for transcript set reads counts

Model reads per transcript set instead of per gene.

Define Mit =

�1 if transcript t in set i ,0 otherwise.

Now model for reads counts is:


t

Mitµt),

where si is the effective length shared by transcripts in set i .Xit ∼ Poisson(bsiMitµt),


t

Mitµt),

rt ∼ Poisson(bµt

�

i

Mitsi ) = Poisson(bltµt),

{X1t , . . . ,Xmt}|rt ∼ Mult

�rt ,

M1ts1�i Mitsi


�.

unknown real expression of transcript t

can be estimated with an EM algorithm

normalisationterms




2. Read mapping


4. Normalisation




NormalisationDo we need normalisation?

• longer transcripts produce more reads than shorter ones - divide by length

• and read count is proportional to the depth of sampling - divide by total number of reads


NormalisationSample 1 Sample 2

Gene 1 1,000,000 1,000,000

Gene 2 0 1,000,000

: : :

Gene 19,999 20,000 20,000

Gene 20,000 0 20,000

Total Number of Reads

10,000,000 20,000,000


NormalisationSample 1 Norm Sample

1Sample 2 Norm Sample

2Gene 1 1,000,000 0.1 1,000,000 0.05

Gene 2 0 0 1,000,000 0.05

: : : : :

Gene 19,999 20,000 0.002 20,000 0.001

Gene 20,000 0 0 20,000 0.001

Total Number of Reads

10,000,000 20,000,000


NormalisationRobinson and Oshlack¹ propose a method that after trimming the data calculates a weighted global fold change between the two samples, which is used as a normalisation factor

This, and other similar approaches²³, can better account for differences in read count between samples

1. Robinson and Oshlack, Genome Biology, 20102. Bullard et al., BMC Bioinformatics, 20103. Anders and Huber, Genome Biology, 2010


Normalisation - an additional note on normalising across genes

• recent studies found that there are position and sequence-dependent biases¹²³⁴ that affect which fragments are sequenced

• e.g. 3’ biases due fragmentation, depletion of reads near exon-exon junctions because of difficulties in alignment, sequence preferences introduced by random hexamer priming, etc...

1. Mortazavi et al., Nature Methods, 20082. Wang et al., Nature Review Genetics, 20093. Bohnert and Ratsch, Nucleic Acids Research, 20104. Hansen et al., BMC Bioinformatics, 2009


Normalisation - an additional note on normalising across genes

• we may not be able to understand these biases, but several authors propose methods for correcting for the biases by learning from the data¹²³

• is this necessary to do this kind of correction? Depends, but the improvement on the estimates appears to be only slight³⁴.

1. Hansen et al., BMC Bioinformatics, 20092. Li et al., Genome Biology 2010 -> BIOCONDUCTOR PACKAGE MSEQ3. Roberts et al., Genome Biology, 20114. Turro et al., Genome Biology, 2011




2. Read mapping


4. Normalisation




Differential expressionObjective: comparing read counts between different biological conditions.

We want to use a statistical test to decide whether, for a given gene, an observed difference in read counts is significant.


Differential expressionedgeR (Robinson et al., 2010)

• Normalisation: quantile adjusted method (or TMM)

• Model & test: Negative binomial, exact test to obtain a p-value

DESeq (Anders & Huber, 2010)

• Normalisation: size factor estimation (median of the ratios of gene read counts to the geometric mean of all genes in sample i)

• Model: Negative binomial, exact test to obtain a p-value

baySeq (Hardcastle & Kelly, 2010)

• Normalisation: none

• Model & test: Bayesian approach with Poisson or Negative binomial, estimate posterior likelihood of each model and rank the models rather than obtaining a p-value, not restricted to pairwise comparisons


Differential expressionDESeq, edgeR and baySeq have been compared using simulated data¹². Overall the studies found that the three methods perform similarly, with baySeq performing marginally better in ranking genes according to their significance.

DEGSeq (Wang et al., 2009)

• Normalisation: none, loess, median

• Model & test: MA-plot based methods, Likelihood Ratio Test, Fisher’s Exact Test

1. Kvam et al., American Journal of Botany, 20122. Glaus et al., Bioinformatics, 2012




2. Read mapping


4. Normalisation


5. E.g. Differential expression


AcknowledgmentsErnest Turro, Cambridge University

John Marioni, EBI, Cambridge

Alvis Brazma, EBI, Cambridge


Practical


In the practical we will...• align the reads to the genome with TopHat

• import aligned reads into R

• discard reads that map multiple times to the genome

• import annotation from files or public databases (e.g. Ensembl)

• count reads over genes and exons

• find novel transcribed regions

• differential expression with DESeq


SAM/BAM alignment filesAll aligners output their results in the SAM/BAM format:SRR034782.85 16 EG:4_group3 5036819 25 30M * 0 0 AAAACAATAAATTTCACTATTATTTCAAAA <<<<<<<<<<<<<<<<<<<<<<<<<<<<<< XT:A:U NM:i:2 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:28G0G0

QNAME ID of the read (“query”)FLAG alignment flagsRNAME ID of the reference (typically: chromosome name)POS Position in reference (1-based, left side)MAPQ Mapping quality (as Phred score)CIGAR Alignment description (gaps etc.) in CIGAR formatMRNM Mate reference sequence name [for paired end data]MPOS Mate position [for paired end data]ISIZE inferred insert size [for paired end data]SEQ sequence of the readQUAL quality string of the readextra fields


SAM/BAM CIGAR strings

AT GT T AGAT AA* * GAT AGCT GT GCT AGT AGGCAGT CAGCGCCATAGAT AAAGGAT A* CT

a a a T AGT AGGCAGTAT AGCT . . . . . . . . . . . . . . T CAGCGCCC

GCAGT CAACGCCAT

CIGAR string: 6M 2I 4M 1D 2M





GCAGT CAACGCCAT

CIGAR string: 3S 11M





GCAGT CAACGCCAT

CIGAR string: 6M 14N 8M





GCAGT CAACGCCAT

CIGAR string: 14M

extra tags: NM (number of mismatches) = 1


SAM/BAM extra tags

NMEdit distance to the reference, including ambiguous bases but excluding clipping

NH Number of reported alignments that contains the query in the current record

H0 Number of perfect hits

H1 Number of 1-difference hits (see also NM)Number of 2-difference hits

H2 Number of 2-difference hits









SAM/BAM alignment filesThe BAM format can be read easily into R.

ShortRead package

• (is able to read almost every sequencer’s manufacturer proprietary formats)

• loads most information (read DNA sequence, base quality, some alignment information, does not import the extra tags...) - not flexible

• stores data into an efficient object of class AlignedRead

• sequence manipulation and QA reports

• not aware of spliced alignments.



Rsamtools package:

• can read ALL bam fields

• flexibility to choose which fields and which subset of the data to import (for example by genomic location, by alignment type, etc...)

• reads all data into lists, not very efficient for memory or speed

• low level interface with no data manipulation functions.



GenomicRanges package

• builds on top of Rsamtools so it’s also flexible

• by default discards some information for the sake of memory usage and speed (read sequence, quality string, etc...)

• stores alignments into an efficient GappedAlignments object

• useful functions for finding overlaps and coverage

• is aware of spliced alignments

• does not import extra tag fields.










SAM/BAM extra tags

NMEdit distance to the reference, including ambiguous bases but excluding clipping

NH Number of reported alignments that contains the query in the current record

H0 Number of perfect hits

H1 Number of 1-difference hits (see also NM)Number of 2-difference hits

H2 Number of 2-difference hits










Importing annotationrtracklayer package:

• read in annotation from GFF files

biomaRt:

• interface to a growing collection of databases such as Ensembl, Uniprot and HapMap for retrieving annotation and reference sequences

• returns a data.frame

GenomicRanges:

• represent the features as a GRanges object that can be overlapped with the aligned reads (in the GappedAlignments object)










Quantifying expressionIRanges and GenomicRanges:

• utilities for overlapping aligned reads and genomic features (genes, exons, etc...)

• aware of spliced alignments

• several modes for counting: union, intersection strict, intersection non-empty

We will end up with a table of counts per gene and per exon which can be used as the input for DE packages such as DESeq, edgeR, baySeq...


Quantifying expression

read

genome

annotatedgenes

g1g2

We already removed multi-mapping reads, but what if the read overlaps several features in the same locus?

There is no optimal solution.


Union IntStrict IntNonEmpty

A A A

A - A

A A A

ambiguous A A

ambiguous ambiguous ambiguous

How to avoid double counting?

READGENE A

READGENE A

READ READGENE A GENE A

READGENE A

GENE B

READGENE A

GENE B










> 54 bp

> 3 reads

Defining novel transcribed regionsIRanges and GenomicRanges:

• define coverage

• find “islands” of expression

• define potential “exons” by setting thresholds...

• retain only regions that do not overlap annotated features








• differential expression analysis with DESeq


What’s going on right now?

Read and do the practical till the end of section 3.4.


Documents

RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL