33
R / Bioconductor for High-Throughput Sequence Analysis Nicolas Delhomme 1 21 October - 26 October, 2013 1 [email protected]

R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

  • Upload
    lenga

  • View
    227

  • Download
    3

Embed Size (px)

Citation preview

Page 1: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

R / Bioconductor for High-Throughput Sequence Analysis

Nicolas Delhomme1

21 October - 26 October, 2013

[email protected]

Page 2: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

Contents

1 Day2 of the workshop 21.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Main Bioconductor packages of interest for the day . . . . . . . . . . . . . . . . . . . . . . 21.3 A word on High-throughput sequence analysis . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 A word on Integrated Development Environment (IDE) . . . . . . . . . . . . . . . . . . . 21.5 Today’s schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Prelude 42.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Creating GAlignment objects from BAM files . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Processing the files in parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.4 Processing the files one chunk at a time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.5 Pros and cons of the current solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.5.1 Pros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.5.2 Cons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Sequences and Short Reads 73.1 Alignments and Bioconductor packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1.1 The pasilla data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.1.2 Alignments and the ShortRead package . . . . . . . . . . . . . . . . . . . . . . . . 83.1.3 Alignments and the Rsamtools package . . . . . . . . . . . . . . . . . . . . . . . . 93.1.4 Alignments and other Bioconductor packages . . . . . . . . . . . . . . . . . . . . . 133.1.5 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Interlude 18

5 Estimating Expression over Genes and Exons 205.1 Counting reads over known genes and exons . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.1.1 The alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.2 Discovering novel transcribed regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.3 Using easyRNASeq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.4 Where to from here . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

1

Page 3: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

Chapter 1

Day2 of the workshop

1.1 Introduction

This portion of the workshop introduces use of R [15] and Bioconductor [5] for analysis of high-throughputsequence (HTS) data; specifically the manipulation of HTS reads alignment and how to estimate expres-sion over exons, transcripts and genes using these. The workshop is structured as a series of short remarksfollowed by group exercises. The exercises explore the diversity of tasks for which R / Bioconductor areappropriate, but are far from comprehensive.

The goals of that workshop part are to: (1) develop familiarity with R / Bioconductor packages forhigh-throughput analysis; (2) specifically for those necessary for manipulating HTS reads alignment filesand for devising expression over genic features; and (3) provide inspiration and a framework for furtherindependent exploration.

1.2 Main Bioconductor packages of interest for the day

Bioconductor is a collection of R packages for the analysis and comprehension of high-throughput ge-nomic data. Among these, we will focus on three of them principally: ShortRead, Rsamtools andGenomicRanges.

1.3 A word on High-throughput sequence analysis

Recent technological developments introduce high-throughput sequencing approaches. A variety of ex-perimental protocols and analysis workflows address gene expression, regulation, and encoding of geneticvariants. Experimental protocols produce a large number (tens of millions per sample) of short (e.g. ,35-250, single or paired-end) nucleotide sequences. These are aligned to a reference or other genome.Analysis workflows use the alignments to infer levels of gene expression (RNA-seq), binding of regulatoryelements to genomic locations (ChIP-seq), or prevalence of structural variants (e.g. , SNPs, short indels,large-scale genomic rearrangements). Sample sizes range from minimal replication (e.g,. 2 samples pertreatment group) to thousands of individuals.

1.4 A word on Integrated Development Environment (IDE)

There are numerous tools to support developing programs and softwares in R. For this course, we haveselected one of them: the RStudio environment, which provides a feature-full, user-friendly, cross-platformenvironment for working with R.

1.5 Today’s schedule

2

Page 4: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

Table 1.1: EMBO2013 AHTSD workshop day2 Schedule

Time Description09:00 Lecture: Representing and manipulating alignments09:45 Practical: Representing and manipulating alignments

10:30 Coffee break

10:45 Practical c’ed: Representing and manipulating alignments

12:30 Lunch

13:30 Lecture: Estimating expression over genes and exons14:30 Practical: Estimating expression over genes and exons

15:30 Coffee break

15:45 Lecture: Working without a ”reference” genome16:30 Practical: Discovering novel transcribed regions17:30 Question and Answer session - preferably at the Red Lion

18:30 Dinner

3

Page 5: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

Chapter 2

Prelude

2.1 Purpose

Before getting familiar with the Bioconductor packages functionalities that were presented in the lecture,we will first sublimate the knowledge you’ve gathered so far into adressing the computationaal challengesfaced when using HTS data: i.e. resources and time consumption.

In the lecture, the readGAlignmentsFromBam function from the Rsamtools package was introduced andused to extract a GAlignment object. However, most of the times, an experiment will NOT consist of asingle sample (of only 2.5M reads!) and an obvious way to speed up the process is to parallelize. In thefollowing three sections, we will see how to perform this before ultimately discussing the pros and consof the implemented method.

2.2 Creating GAlignment objects from BAM files

Exercise 1First of all, locate the BAM files and implement a function to read them sequentially. Have a look atthe lapply function man page for doing so.

Solution:

> library(Rsamtools)

> bamfiles <- dir(system.file("bigdata","bam",package="EMBO2013Day2"),

+ pattern="*.bam$",full.names=TRUE)

> gAlns <- lapply(bamfiles,readGAlignmentsFromBam)

Nothing complicated so far - or if, raise your voice. We proceed both files sequentially and get a listof GAlignments objects stored in the gAlns object. Apart from the coding enhancement - with one line,we can process all our samples - there is no other gains.

2.3 Processing the files in parallel

Modern laptop CPUs possess several cores that can perform tasks independently, commonly 2 to 4.Computational servers usually have many CPUs (commonly 8) each having several cores. An obviousenhancement to our previous solution is to take advantage of this CPU architecture and to process oursample in parallel.

Exercise 2Have a look at the parallel package and in particular at the mclapply function to re-implement theprevious function in a parallel manner.

Solution:

4

Page 6: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

> library(parallel)

> gAlns <- mclapply(bamfiles,readGAlignmentsFromBam)

Exercise 3Could you figure out how many cores were used in parallel when running the previous line? Can youexplain why that was so?

Solution: It is NOT because there were 2 files to proceed. The mclapply has a number of defaultparameters - see ?mclapply for details - including the mc.cores one that defaults to 2. If you want toproceed more samples in parallel, set that parameter value accordingly.

This new implementation has the obvious advantage to be X times faster (with X being the numberof CPU used, or almost so as parallelization comes with a slight processing cost), but it put a differentstrain on the system. As several files are being processed in parallel, the memory requirement alsoincrease by a factor X (assuming files of almost equivalent size are to be processed). This might be fineon a computational server but given the constant increase in sequencing reads being produced per run,this will eventually be challenged.

Exercise 4Can you think of the way this memory issue could be adressed? i.e. what could we modify in the waywe read/process the file to limit the memory required at a given moment?

Solution: No, buying more memory is usually not an option. And anyway, at the moment, the increaserate of reads sequenced per run is faster than the memory doubling time. So, let us just move to thenext section to have a go at adressing the issue.

2.4 Processing the files one chunk at a time

To limit the memory required at any moment, one approach would be to proceed the file not as a whole,but chunk-wise. As we can assume that reads are stored independently in BAM files (or almost so, thinkof how Paired-End data is stored!), we simply can decide to parse, e.g. 1, 000, 000 reads at a time. Thiswill of course require to have a new way to represent a BAM file in R, i.e. not just as a character stringas we had it until now in our bamfiles object.

Exercise 5The Rsamtools package again comes in handy. Lookup the ?BamFile package and try to scheme how wecould take advantage of the BamFile or BamFileList classes for our purpose.

Solution: The yieldSize parameter of either class looks like exactly what we want. Let us recode ourbamfiles character object into a BamFileList .

> bamFileList <- BamFileList(bamfiles,yieldSize=10^6)

Now that we have the BAM files described in a way that we can process them chunk-wise, let us doso. The paradigm is as follow:

> open(bamFile)

> while(length(chunk <- readGAlignmentsFromBam(bamFile))){

+ message(length(chunk))

+ }

> close(bamFile)

5

Page 7: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

Exercise 6In the paradigm above, we process one BAM file chunk wise and report the sizes of the chunks. i.e.these would be 1M reads - in our case - apart for the last one, which would be smaller or equal to 1M(it is unlikely that a sequencing file contains an exact multiple of our chink size).

Now, try to implement the above paradigm in the function we implemented previously - see solu-tion 2.3 page 4 - so as to process both our BAM files in parallel chunk-wise.

Solution:

> gAlns <- mclapply(bamFileList,function(bamFile){

+ open(bamFile)

+ gAln <- GAlignments()

+ while(length(chunk <- readGAlignmentsFromBam(bamFile))){

+ gAln <- c(gAln,chunk)

+ }

+ close(bamFile)

+ return(gAln)

+ })

2.5 Pros and cons of the current solutionExercise 7Before reading my comments below, take the time to jot down what you think are the advantages anddrawbacks of the method implemented above. My own comments below are certainly not extensive andI would be curious to hear yours that are not matched with mine.

Solution:

2.5.1 Pros

a. We have written a streamlined piece of code, using up to date functionalities from other packages.Hence, it is both easily maintanable and updatable.

b. With regards to time consumption, we have reduced it by a factor 2 and that can be reducedfurther by using computer with more CPUs or a compute farm even - obviously if we have morethan 2 samples to process.

c. We have implemented the processing of the BAM files by chunk

2.5.2 Cons

a. There’s only one big cons really: we have NOT addressed the memory requirement issue satisfyingly.We do proceed the BAM files by chunks, but then we simply aggregate these chunks without furtherprocessing, so we eventually end up using the same amount of memory. This is the best we cando so far given the introduced Bioconductor functionalities, so let us move to the next step in thepipeline that will help us resolve that - see Chapter 4 page 18 if you are impatient - but first weshould recap the usage of the Bioconductor packages for obtaining and manipulating sequencingread information in R, which is next chapter’s topic.

6

Page 8: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

Chapter 3

Sequences and Short Reads

Most down-stream analysis of short read sequences is based on reads aligned to reference genomes. Thereare many aligners available, including BWA [13, 12], Bowtie2 [9], GSNAP[21], STAR[4],etc. ; merits ofthese are discussed in the literature. There are also alignment algorithms implemented in Bioconductor(e.g., matchPDict in the Biostrings package and the gmapR, Rbowtie, Rsubread packages); matchPDict isparticularly useful for flexible alignment of moderately sized subsets of data.

3.1 Alignments and Bioconductor packages

The following sections introduce core tools for working with high-throughput sequence data; key packagesfor representing reads and alignments are summarized in Table 3.1.

Moreover,Martin introduced yesterday resources for annotating sequences, that will come handy inthe next two chapters of this tutorial (Chapter 4, page 18 and Chapter 5, page 20)

Exercise 8Read the man page of the GAlignments and GAlignmentPairs classes and pay attention to the veryimportant comments on multi-reads and paired-end processing.

Solution: Really just ?GAlignments. However, KEEP these details in mind as they essential and likelysource of erroneous conclusion. Remember the example of this morning lecture about RNA editing.

3.1.1 The pasilla data set

As a running example, we use the pasilla data set, derived from [2]. The authors investigate conservationof RNA regulation between D. melanogaster and mammals. Part of their study used RNAi and RNA-seq to identify exons regulated by Pasilla (ps), the D. melanogaster ortholog of mammalian NOVA1 and

Table 3.1: Selected Bioconductor packages for extracting and manipulating sequence reads alignments.

Package DescriptionShortRead In addition to the functionalities described yesterday to manipulate raw

read files, e.g. the ShortReadQ class and functions for manipulating fastq

files; this package offers the possibility to load numerous HTS formatsclasses. These are mostly sequencer manufacturer specific e.g. sff for454 or pre-BAM aligner proprietary formats, e.g. MAQ or bowtie. Thesefunctionalities rely heavily on Biostrings and somewhat on Rsamtools.

GenomicRanges GAlignments and GAlignmentPairs store single- and paired-end alignedreads.

Rsamtools Provides access to BAM alignment and other large sequence-related files.rtracklayer Input and output of bed, wig and similar files

7

Page 9: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

NOVA2. Briefly, their experiment compared gene expression as measured by RNAseq in S2-DRSC cellscultured with, or without, a 444bp dsRNA fragment corresponding to the ps mRNA sequence. Theirassessment investigated differential exon use, but our worked example will focus on gene-level differences.

In the following sections, we look at a subset of the ps data, corresponding to reads obtained fromlanes of their RNA-seq experiment, and aligned to a D. melanogaster reference genome. These are thesame reads that were used yesterday for the demonstration of the raw read based functionalities of theShortRead package. As a side note, reads were retrieved from GEO and the Short Read Archive (SRA),and were aligned to the D. melanogaster reference genome dm3 as described in the pasilla experimentdata package.

3.1.2 Alignments and the ShortRead package

Yesterday, Martin introduced the ShortRead to manipulate raw reads and to perform Quality Assessment(QA) on raw data files e.g. fastq formatted files. These are not the only functionalities from theShortRead package, which offers as well the possibility to read in alignments files in many differentformats.

Exercise 9Two files of the pasilla dataset have been aligned using bowtie [9], locate them in the bigdata folder ofthe EMBO2013Day2 package.

Solution:

> bwtFiles <- dir(path=system.file("bigdata","bowtie",package="EMBO2013Day2"),

+ pattern="*.bwt$",full.names=TRUE)

As we will be accessing this bigdata folder frequently, we create a function called bigdata to do somore conveniently.

> library(EMBO2013Day2)

> bigdata <- function()

+ system.file("bigdata",package="EMBO2013Day2")

Exercise 10Have a pick at one of the file and try to decipher its format. Hint: it is a tab delimited format, so checkthe read.delim function. As you may not want to read all the lines to get an idea, lookup an appropriateargument for that.

Solution: You might want to check http://bowtie-bio.sourceforge.net/manual.shtml#default-bowtie-output

for checking whether your guesses were correct. Here is how to read 10 lines of the first file.

> read.delim(file=file.path(bigdata(),"bowtie","SRR074430.bwt"),

+ header=FALSE,nrows=10)

Exercise 11now, as was presented in the lecture use the readAligned function to read in the bowtie alignment files.

Solution:

> alignedRead <- readAligned(dirPath=file.path(bigdata(),"bowtie"),

+ pattern="*.bwt$",type="Bowtie")

8

Page 10: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

Exercise 12What is peculiar about the returned object? Determine its class. Can you tell where the data from bothinput files are?

Solution: We obtained a single object of the AlignedRead class. By looking at the documentation, i.e.?readAligned in the Value section, we are told that all files are concatenated in a single object with NOguarantee of order in which files are read. This is convenient when we want to merge several sequencingruns of the same sample but we need to be cautions and process independent sample by individuallycalling the readAligned function for every sample.

Exercise 13Finally, taking another look at the lecture, select only the reads that align to chromosome 2L. Hint, usethe appropriate SRFilter filter.

Solution:

> alignedRead2L <- readAligned(dirPath=file.path(bigdata(),"bowtie"),

+ pattern="*.bwt$",type="Bowtie",

+ filter=chromosomeFilter("2L"))

This concludes the overview of the ShortRead package. As the BAM format has become a de-factostandard, it is more unlikely that you end up using that package to process reads in R over the Rsamtoolspackage that you will be using next.

3.1.3 Alignments and the Rsamtools package

Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format.A SAM file is a text file, with one line per aligned read, and fields separated by tabs. Here is an exampleof a single SAM line, split into fields.

> fl <- system.file("extdata", "ex1.sam", package="Rsamtools")

> strsplit(readLines(fl, 1), "\t")[[1]]

[1] "B7_591:4:96:693:509"

[2] "73"

[3] "seq1"

[4] "1"

[5] "99"

[6] "36M"

[7] "*"

[8] "0"

[9] "0"

[10] "CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG"

[11] "<<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7"

[12] "MF:i:18"

[13] "Aq:i:73"

[14] "NM:i:0"

[15] "UQ:i:0"

[16] "H0:i:1"

[17] "H1:i:0"

Fields in a SAM file are summarized in Table 3.2. We recognize from the FASTQ file introducedyesterday, the identifier string, read sequences and qualities. The alignment is to a chromosome ‘seq1’starting at position 1. The strand of alignment is encoded in the ‘flag’ field. The alignment record alsoincludes a measure of mapping quality, and a CIGAR string describing the nature of the alignment. Inthis case, the CIGAR is 36M, indicating that the alignment consisted of 36 Matches or mismatches, withno indels or gaps; indels are represented by I and D; gaps (e.g., from alignments spanning introns) by N.

9

Page 11: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

Table 3.2: Fields in a SAM record. From http://samtools.sourceforge.net/samtools.shtml

Field Name Value1 QNAME Query (read) NAME2 FLAG Bitwise FLAG, e.g., strand of alignment3 RNAME Reference sequence NAME4 POS 1-based leftmost POSition of sequence5 MAPQ MAPping Quality (Phred-scaled)6 CIGAR Extended CIGAR string7 MRNM Mate Reference sequence NaMe8 MPOS 1-based Mate POSition9 ISIZE Inferred insert SIZE10 SEQ Query SEQuence on the reference strand11 QUAL Query QUALity12+ OPT OPTional fields, format TAG:VTYPE:VALUE

Note that mismatches are not represented in the CIGAR string but might be detailed in the additionalattributes; this depends on the aligner used.

BAM files encode the same information as SAM files, but in a format that is more efficiently parsedby software; BAM files are the primary way in which aligned reads are imported in to R.

Aligned reads in R As introduced - c.f. section 3.1 - there are three different packages to readalignments in R:

• ShortRead

• GenomicRanges

• Rsamtools

The last two will be described in more details in the next paragraphs.

GenomicRanges The readGAlignments function from the GenomicRanges package reads essentialinformation from a BAM file into R. The result is an instance of the GAlignments class. The GAlignmentsclass has been designed to allow useful manipulation of many reads (e.g., 20 million) under moderatememory requirements (e.g., 4 GB).

Note that the the readGAlignments function and the GAlignments class have replaced the readGappedAlign-

ments function and the GappedAlignments class, respectively, from the previous releases of Bioconductor;this as of version 2.13 released beginning of October this year.

Exercise 14Use the readGAlignments to read in the ”ex1.bam” that can be found in the ”extdata” folder of theRsamtools package.

Solution:

> alnFile <- system.file("extdata", "ex1.bam", package="Rsamtools")

> aln <- readGAlignments(alnFile)

> head(aln, 3)

The readGAlignments function takes an additional argument, param, allowing the user to specify regionsof the BAM file (e.g., known gene coordinates) from which to extract alignments.

A GAlignments instance is like a data frame, but with accessors as suggested by the column names.It is easy to query, e.g., the distribution of reads aligning to each strand, the width of reads, or the cigarstrings

10

Page 12: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

Exercise 15Summarize the strand, width and CIGAR information from that file.

Solution:

> table(strand(aln))

+ - *

1647 1624 0

> table(width(aln))

30 31 32 33 34 35 36 38 40

2 21 1 8 37 2804 285 1 112

> head(sort(table(cigar(aln)), decreasing=TRUE))

35M 36M 40M 34M 33M 14M4I17M

2804 283 112 37 6 4

Rsamtools The Rsamtools readGAlignmentsFromBam function - introduced earlier, see Chapter 2page 4 - as the GenomicRanges readGAlignments function only parse some of the fields of a BAM file,and that may not be appropriate for all uses. In these cases the scanBam function in Rsamtools providesgreater flexibility. The idea is to view BAM files as a kind of data base. Particular regions of interestcan be selected, and the information in the selection restricted to particular fields. These operations aredetermined by the values of a ScanBamParam object, passed as the named param argument to scanBam.

Exercise 16Consult the help page for ScanBamParam, and construct an object that restricts the information returnedby a scanBam query to the aligned read DNA sequence. Your solution will use the what parameter to theScanBamParam function.

Use the ScanBamParam object to query a BAM file, and calculate the GC content of all aligned reads.Summarize the GC content as a histogram (Figure 3.1).

Solution:

> param <- ScanBamParam(what="seq")

> seqs <- scanBam(bamfiles[[1]], param=param)

> readGC <- gcFunction(seqs[[1]][["seq"]])

> hist(readGC)

Advanced Rsamtools usage The Rsamtools package has more advanced functionalities:

1. function to count, index, filter, sort BAM files

2. function to access the header only

3. the possibility to access SAM attributes (tags)

4. manipulate the CIGAR string

5. create BAM libraries to represent a study set (BamViews)

6. . . .

Exercise 17Find out the function that permit to scan the BAM header and retrieve the header of the first BAM filein the bigdata() bam subfolder. What information does it contain?

11

Page 13: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

Figure 3.1: GC content in aligned reads

Solution: It contains the reference sequence length and names as well as the name, version and commandline of the tool used for performing the alignments.

> scanBamHeader(bamfiles[1])

Exercise 18The SAM/BAM format contains a tag: “NH” that defines the total number of valid alignments reportedfor a read. How can you extract that information from the same first bam file and plot it as an histogram?

Solution:

> param <- ScanBamParam(tag="NH")

> nhs <- scanBam(bamfiles[[1]], param=param)[[1]]$tag$NH

So it seems a majority of our reads have multiple alignments! Some processing might be required todeal with these; e.g. if reads were aligned to the transcriptome there exist tools that can deconvoluatethe transcript specific expression, for example MMSEQ [20], BitSeq [6], that last one existing as an Rpackage too: BitSeq. Otherwise if reads were aligned to the genome, one should consider filtering thesemultiple alignments to avoid introducing artifactual noise in the subsequent analyses.

Exercise 19The CIGAR string contains interesting information, in particular, whether or not a given match containindels. Using the first bam file, can you get a matrix of all seven CIGAR operations? And find out theintron size distribution?

Solution:

> param <- ScanBamParam(what="cigar")

> cigars <- scanBam(bamfiles[[1]], param=param)[[1]]$cigar

12

Page 14: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

> cigar.matrix <- cigarOpTable(cigars)

> intron.size <- cigar.matrix[,"N"]

> intron.size[intron.size>0]

> plot(density(intron.size[intron.size>0]))

> histogram(log10(intron.size[intron.size>0]),xlab="intron size (log10 bp)")

Exercise 20Look up the documentation for the BamViews and using the leeBamViews package, create a BamViewsinstance. Afterwards, use some of the accessors of that object to retrieve e.g. the file paths or the samplenames

Solution:

> library(leeBamViews)

> bpaths = dir(system.file("bam", package="leeBamViews"), full=TRUE, patt="bam$")

> gt<-do.call(rbind,strsplit(basename(bpaths),"_"))[,1]

> geno<-substr(gt,1,nchar(gt)-1)

> lane<-substr(gt,nchar(gt),nchar(gt))

> pd = DataFrame(geno=geno, lane=lane, row.names=paste(geno,lane,sep="."))

> bs1 = BamViews(bamPaths=bpaths, bamSamples=pd,

+ bamExperiment=list(annotation="org.Sc.sgd.db"))

> bamPaths(bs1)

> bamSamples(bs1)

Exercise 21Finally, extract the coverage for the locus 861250:863000 on chromosome “Scchr13” for every sample inthe bs1 object

Solution:

> sel <- GRanges(seqnames = "Scchr13", IRanges(start = 861250, end = 863000),strand="+")

> covex = RleList(lapply(bamPaths(bs1), function(x) coverage(readGAlignments(x))[[1]]))

This offer an interesting way to process multiple sample at the same time when you’re interested ina particular locus.

3.1.4 Alignments and other Bioconductor packages

In the following, an excerpt of additional functionalities offered by Bioconductor packages is presented.It is far from being a complete overview, and as such only aims at giving a feel for what’s out there.

Retrieving data using SRAdb Most journals require the raw data to be deposited in a publicrepository, such as GEO, SRA or ENA. The SRAdb package offers the possibility to list the content ofthese archives, and to retrieve raw (fastq or sra) files.

Exercise 22Using the pasilla package, retrieve the submission accession of that dataset (check out that packagevignette)

Solution:

> vignette(package="pasilla")

> vignette("create_objects")

> geo.acc <- "GEO: GSE18508"

13

Page 15: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

Now that as we only have the GEO ID, we need to convert it to an SRA ID. You can either use theGEO, SRA or ENA website for this or if you are slightly familiar with SQL, just use the SRAdb package.

Exercise 23Look into the SRAdb package vignette to figure out how to do this.

Solution: Accessing the vignette and reading it tells us

> library(SRAdb)

> vignette("SRAdb")

a. we need to download the SRAdb sqlfile

b. we need to create a connection to the locally downloaded database

c. we need to query that database with our submission alias: “GEO: GSE18508” to retrieve the SRAsubmission accession.

The first step requires the download of a 280+Mb compressed large file, so to avoid the downloadingtime, connect to the file on the shared folder

> sqlfile <- "replace-with-your-path-to-the-SRAmetadb.sqlite-file"

> sra_con <- dbConnect(SQLite(),sqlfile)

> sra.acc <- dbGetQuery(sra_con,paste("select submission_accession ",

+ "from submission ",

+ 'where submission_alias = "',+ geo.acc,';"',sep=""))

To download the file, the command to use is getSRAdbFile

The retrieved sra.acc is: “SRA010243”.

Now that we have that accession, the vignette tells us how to get every experiment, sample, run, . . .associated with this submission accession.

Exercise 24There are at least two possibilities to do so, one using an SQL query and the other one using a functionof the packages. What would be that function?

Solution: For those that like SQL:

> run.acc <- dbGetQuery(sra_con,paste("select run_accession ",

+ "from run ",

+ 'where submission_accession = "',+ sra.acc,'";',sep=""))$run_accession

For those that like functions:

> sraConvert(sra.acc,sra_con=sra_con)

> run.acc <- sraConvert(sra.acc,"run",sra_con=sra_con)$run

Exercise 25Now that we’ve got the list of runs, it would be interesting to get more information about the corre-sponding fastq file.

Solution:

> info <- getFASTQinfo(run.acc,srcType="ftp")

And the final step would be to download the fastq file(s) of interest.

14

Page 16: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

Exercise 26Retrieve the shortest fastq file from that particular submission.

Solution:

> getSRAfile(in_acc=info[which.min(info[,"run.read.count"]),"run"],

+ sra_con, destDir = getwd(),

+ fileType = 'fastq', srcType = 'ftp' )

Well, that’s almost it. As we are tidy people, we clean after ourselves.

> dbDisconnect( sra_con )

Demultiplexing using easyRNASeq Note: This section does not apply to all datasets but only tomultiplexed ones. Since the data we loaded so far into R was not multiplexed we will use a differentdataset here.

Nowadays, NGS machines produces so many reads (e.g. 40M for Illumina GAIIx, 100M for ABISOLiD4 and 160M for an Illumina HiSeq), that the coverage obtained per lane for the transcriptome oforganisms with small genomes, is very high. Sometimes it’s more valuable to sequence more samples withlower coverage than sequencing only one to very high coverage, so techniques have been optimised for se-quencing several samples in a single lane using 4-6bp barcodes to uniquely identify the sample within thelibrary[10]. This is called multiplexing and one can on average sequence 12 yeast samples at 30X coveragein a single lane of an Illumina GenomeAnalyzer GAIIx (100bp read, single end). This approach is veryadvantageous for researchers, especially in term of costs, but it adds an additional layer of pre-processingthat is not as trivial as one would think. Extracting the barcodes would be fairly straightforward, butfor the average 0.1-1 percent sequencing error rate that introduces a lot of multiplicity in the actualbarcodes present in the samples. A proper design of the barcodes, maximising the Hamming distance [8]is an essential step for proper de-multiplexing.

The data we loaded into R in the previous section was not mutiplexed, so we now load a differ-ent FASTQ file where the 4 different samples sequenced were identified by the barcodes ”ATGGCT”,”TTGCGA”, ”ACACTG” and ”ACTAGC”.

> reads <- readFastq(file.path(bigdata(),"multiplex","multiplex.fq.gz"))

> # filter out reads with more than 2 Ns

> filter <- nFilter(threshold=2)

> reads <- reads[filter(reads)]

> # access the read sequences

> seqs <- sread(reads)

> # this is the length of your adapters

> barcodeLength <- 6

> # get the first 6 bases of each read

> seqs <- narrow(seqs, end=barcodeLength)

> seqs

> length(table(as.character(seqs)))

So it seems we have 1953 barcodes instead of 6 . . .

Exercise 27Which barcode is most represented in this library? Plot the relative frequency of the top 20 barcodes.Try:

• using the function table to count how many times each barcode occurs in the library, you can’t applythis function to seqs directly you must convert it first to a character vector with the as.characterfunction

• sort the counts object you just created with the function sort, use decreasing=TRUE as an argumentfor sort so that the elements are sorted from high to low (use sort( ..., decreasing=TRUE ))

15

Page 17: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

• look at the first element of your sorted counts object to find out with barcode is most represented

• find out what the relative frequency of each barcode is by dividing your counts object by the totalnumber of reads (the function sum might be useful)

• plot the relative frequency of the top 20 barcodes by adapting these function calls:

> # set up larger margins for the plot so we can read the barcode names

> par(mar=c(5, 5, 4, 2))

> barplot(..., horiz=T, las=1, col="orange" )

Solution:

> barcount = sort(table(as.character(seqs)), decreasing=TRUE)

> barcount[1:10] # TTGCGA

> barcount = barcount/sum(barcount)

> par( mar=c(5, 5, 4, 2))

> barplot(barcount[1:20], horiz=TRUE, las=1, col="orange" )

Exercise 28The designed barcodes (”ATGGCT”, ”TTGCGA”, ”ACACTG” and ”ACTAGC”) seem to be equally dis-tributed, what is the percentage of reads that cannot be assigned to a barcode?

Solution:

> signif((1-sum(barcount[1:4]))*100,digits=2) # ~6.4%

We will now iterate over the 4 barcodes, split the reads between them and save a new fastq file foreach:

> barcodes = c("ATGGCT", "TTGCGA", "ACACTG", "ACTAGC")

> # iterate through each of these top 10 adapters and write

> # output to fastq files

> for(barcode in barcodes) {

+ seqs <- sread(reads) # get sequence list

+ qual <- quality(reads) # get quality score list

+ qual <- quality(qual) # strip quality score type

+ mismatchVector <- 0 # allow no mismatches

+

+ # trim sequences looking for a 5' pattern

+ # gets IRanges object with trimmed coordinates

+ trimCoords <- trimLRPatterns(Lpattern=barcode,

+ subject=seqs, max.Lmismatch=mismatchVector, ranges=T)

+

+ # generate trimmed ShortReadQ object

+ seqs <- DNAStringSet(seqs, start=start(trimCoords),

+ end=end(trimCoords))

+ qual <- BStringSet(qual, start=start(trimCoords),

+ end=end(trimCoords))

+

+ # use IRanges coordinates to trim sequences and quality scores

+ qual <- SFastqQuality(qual) # reapply quality score type

+ trimmed <- ShortReadQ(sread=seqs, quality=qual, id=id(reads))

+

+ # rebuild reads object with trimmed sequences and quality scores

16

Page 18: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

+ # keep only reads which trimmed the full barcode

+ trimmed <- trimmed[start(trimCoords) == barcodeLength + 1]

+

+ # write reads to Fastq file

+ outputFileName <- paste(barcode, ".fq", sep="")

+ writeFastq(trimmed, outputFileName)

+

+ # export filtered and trimmed reads to fastq file

+ print(paste("wrote", length(trimmed),

+ "reads to file", outputFileName))

+ }

You should have four new FASTQ files: ACACTG.fq, ACTAGC.fq ATGGCT.fq and TTGCGA.fq withthe reads (the barcodes have been trimmed) corresponding to each mutiplexed sampled. The next stepwould be to align these reads against your reference genome.

Aligning reads using Rsubread Note that since last week and the latest release of Bioconductori.e. version 2.13, I have encountered weird errors using Rsubread on Mac OSX 10.6.8. If that occurs onthe course machines too, read the code and feel free to ask me any question.

> library(Rsubread)

> library(BSgenome.Dmelanogaster.UCSC.dm3)

> chr4 <- DNAStringSet(unmasked(Dmelanogaster[["chr4"]]))

> names(chr4) <- "chr4"

> writeXStringSet(chr4,file="dm3-chr4.fa")

> ## create the indexes

> dir.create("indexes")

> buildindex(basename=file.path("indexes","dm3-chr4"),

+ reference="dm3-chr4.fa",memory=1000)

> ## align the reads

> sapply(dir(pattern="*\\.fq$"),function(fil){

+ ## align

+ align(index=file.path("indexes","dm3-chr4"),

+ readfile1=sub("\\.fq$","",fil),

+ nsubreads=2,TH1=1,

+ output_file=sub("\\.fq$","\\.sam",fil)

+ )

+

+ ## create bam files

+ asBam(file=sub("\\.fq$","\\.sam",fil),

+ destination=sub("\\.fq$","",fil),

+ indexDestination=TRUE)

+ })

And that’s it you have filtered, demultiplexed and aligned your reads!

3.1.5 Resources

There are extensive vignettes for Biostrings and GenomicRanges packages. A useful on-line resource isfrom Thomas Girke’s group.

17

Page 19: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

Chapter 4

Interlude

Now that we have seen the GenomicRanges functionalities to find count or summarize overlaps betweenreads and annotations, we can refine our prefered function. We had left it as:

> gAlns <- mclapply(bamFileList,function(bamFile){

+ open(bamFile)

+ gAln <- GAlignments()

+ while(length(chunk <- readGAlignmentsFromBam(bamFile))){

+ gAln <- c(gAln,chunk)

+ }

+ close(bamFile)

+ return(gAln)

+ })

Exercise 29Using the synthetic transcript annotation prepared during the lecture: dmel_synthetic_transcript_r5-52.rda, implement the count by chunks.

Solution:

> load("~/Day2/dmel_synthetic_transcript_r5-52.rda")

> count.list <- mclapply(bamFileList,function(bamFile){

+ open(bamFile)

+ counts <- vector(mode="integer",length=length(annot))

+ while(length(chunk <- readGAlignmentsFromBam(bamFile))){

+ counts <- counts + assays(summarizeOverlaps(annot,chunk,mode="Union"))$counts

+ }

+ close(bamFile)

+ return(counts)

+ })

This gives us a list of counts per sample, to get a count matrix do:

> count.table <- do.call("cbind",count.list)

> head(count.table)

reads reads

FBgn0000008.0 1 3

FBgn0000014.0 0 0

FBgn0000015.0 0 0

FBgn0000017.0 107 110

FBgn0000018.0 5 11

FBgn0000022.0 0 0

18

Page 20: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

Such a count.table object is the minimal input that downstream analysis softwares - e.g. DESeq2,edgeR, etc. uses.

A similar function to this is probably all you’ll need to process your read and get a count tablefrom a standard Illumina based RNA-Seq experiment. However, you might want more flexibility for youprojects and certainly Bioconductor offer the possibility to do that; examples of which are given in thenext chapter.

19

Page 21: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

Chapter 5

Estimating Expression over Genesand Exons

This chapter1 describes part of an RNA-Seq analysis use-case. RNA-Seq [14] was introduced as a newmethod to perform Gene Expression Analysis, using the advantages of the high throughput of Next-Generation Sequencing (NGS) machines.

5.1 Counting reads over known genes and exons

The goal of this use-case is to generate a count table for the selected genic features of interest, i.e. exons,transcripts, gene models, etc.

To achieve this, we need to take advantage of all the steps performed previously in the workshopDay1 and Day2.

1. the alignments information has to be retrieved

2. the corresponding annotation need to be fetched and possibly adapted e.g. as was done in thepreceeding lecture.

3. the read coverage per genic feature of interest determined

Exercise 30Can you associate at least a Bioconductor package to every of these tasks?

Solution: There are numerous choices, as an example in the following we will go for the following setof packages:

a. Rsamtools

b. genomeIntervals - this was already done during the lecture

c. GenomicRanges

5.1.1 The alignments

This was introduced in section 3.1.3, page 9. In this section we will import the data using the RsamtooslreadGAlignmentsFromBam. This will create a GAlignments object that contains only the reads that alignedto the genome.

Exercise 31Using what was introduced in section 3.1.3, read in the first bam file from the bigdata() bam folder.Remember that the protocol used was not strand-specific.

1The author want to thank Angela Goncalves for parts of the present chapter

20

Page 22: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

Figure 5.1: Overlap modes; Image from the HTSeq package developed by Simon Anders.

Solution: First we scan the bam directory:

> bamfiles <- dir(file.path(bigdata(), "bam"), ".bam$", full=TRUE)

> names(bamfiles) <- sub("_.*", "", basename(bamfiles))

Then we read the first file:

> aln <- readGAlignments(bamfiles[1])

> strand(aln) <- "*"

As we have seen, many of these reads actually align to multiple locations. In a first basic analysis -i.e. to get a feel for the data - such reads could be ignored.

Exercise 32Filter the multiple alignment reads. Think of the “NH” tag.

Solution:

> param <- ScanBamParam(tag="NH")

> nhs <- scanBam(bamfiles[[1]], param=param)[[1]]$tag$NH

> aln <- aln[nhs==1,]

Now that we have the alignments (aln object) and the synthetic transcript annotation (annot object)- the one from the lecture; the same used in the Interlude 4, page 18, we can quantify gene expression bycounting reads over all exons of a gene and summing them together. One thing to keep in mind is thatspecial care must be taken in dealing with reads that overlap more than one feature (e.g. overlappinggenes, isoforms), and thus might be counted several times in different features. To deal with this we canuse any of the approaches summarised in Figure 5.1:

The GenomicRanges summarizeOverlaps offer different possibilities to summarize reads per features:

21

Page 23: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

> load("~/Day2/dmel_synthetic_transcript_r5-52.rda")

> counts1 <- summarizeOverlaps(annot, aln, mode="Union")

> counts2 <- summarizeOverlaps(annot, aln, mode="IntersectionStrict")

> counts3 <- summarizeOverlaps(annot, aln, mode="IntersectionNotEmpty")

Exercise 33Create a data.frame or a matrix of the results above and figure out if any differences can be observed.E.g check for difference in the row standard deviation (using the apply and sd functions).

Solution:

> synthTrxCountsTable <- data.frame(

+ assays(counts1)$counts,

+ assays(counts2)$counts,

+ assays(counts3)$counts)

> colnames(synthTrxCountsTable) <- c("union","intStrict","intNotEmpty")

> rownames(synthTrxCountsTable) <- rownames(counts1)

> sds <- apply(synthTrxCountsTable,1,sd)

> sum(sds!=0)

> sum(sds!=0)/length(sds)

> synthTrxCountsTable[which.max(sds),]

> annot[which.max(sds),]

So it appears that we have 3, 872 cases where these counting generate different results (28% of the total!!),and that the synthetic transcript “FBgn0003942.0” shows the largest difference.

For a detailled analysis, it would be important to adequatly choose one of the intersection modesabove, however for the remainder of this section, we will use the “union” set. As before for reads aligningto multiple places in the genome, choosing to take the union when reads overlap several features is asimplification we may not want to do. There are several methods that probabilistically estimate theexpression of overlapping features [11, 19, 20].

This concludes that section on counting reads per known features. In the next section, we will lookat how novel transcribed regions could be identified.

22

Page 24: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

5.2 Discovering novel transcribed regions

One main advantage of RNA-seq experiments over microarrays is that they can be used to identify anytranscribed molecule, including unknown transcripts and isoforms, as well as other regulatory transcribedelements. To identify such new elements, several methods are available to recreate and annotate tran-scripts, e.g. Cufflinks[19], Oases[17], Trinity[7], to mention some of them. We can use Bioconductor toolsas well, to identify loci and quantify counts without prior annotation knowledge. The example here isvery crude and is really just a proof of concept of what one could do in a few commands i.e. R rules.

Nonetheless to make the results more precise, the reads have been realigned using STAR [4], avery fast and accurate aligner that use the recent approach of Maximum Exact Matches (MEMs), seehttps://code.google.com/p/rna-star/ for more details. This MEM approach allow STAR to identifyexon-exon junctions without prior knowledge e.g. no need for an annotation gff. To start, we re-readone of the sample alignments using the Rsamtools readGAlignmentsFromBam function.

> aln <- readGAlignmentsFromBam(

+ BamFile(file.path(bigdata(),"STAR","SRR074431.bam")))

Defining transcribed regions The process begins with calculating the coverage, using the methodfrom the GenomicRanges package:

> cover <- coverage(aln)

> cover

> # this object is compressed to save space. It is an RLE (Running Length Encoding)

> # we can look at a section of chromosome 4 say between bp 1 and 1000

> # which gives us the number of read overlapping each of those bases

> as.vector(cover[["3R"]])[1:1000]

The coverage shows us how many reads overlap every single base in the genome. It is actually splitper chromosomes.

The next step is to define, “islands” of expression. These can be created using the slice function.The peak height for the islands can be determined with the viewMaxs function and the island widths canbe found using the width function:

> islands <- slice(cover, 1)

> islandPeakHeight <- viewMaxs(islands)

> islandWidth <- width(islands)

While some more sophisticated approaches can be used to find exons de novo, we can use a simpleapproach whereby we select islands whose maximum peak height is 2 or more and whose width is 114 bp(150% of the read size) or more to be candidate exons. The elementLengths function shows how manyof these candidate exons appear on each chromosome:

> candidateExons <- islands[islandPeakHeight >= 2L & islandWidth >=114L]

> candidateExons[["3R"]]

Remember that we used an aligner which is capable of mapping reads across splice junctions in thegenome.

> sum(cigarOpTable(cigar(aln))[,"N"] > 0)

[1] 99677

There are 99, 677 reads that span exon-exon junctions (EEJs).Let’s look up such a potential EEJ:

> aln[cigarOpTable(cigar(aln))[,"N"] > 0 & seqnames(aln) == "3R",]

23

Page 25: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

GAlignments with 22070 alignments and 0 metadata columns:

seqnames strand cigar qwidth start end width

<Rle> <Rle> <character> <integer> <integer> <integer> <integer>

[1] 3R - 58M68N18M 76 452 595 144

[2] 3R + 40M75N36M 76 20556 20706 151

[3] 3R + 39M75N37M 76 20557 20707 151

[4] 3R + 38M75N38M 76 20558 20708 151

[5] 3R - 69M174N7M 76 23216 23465 250

... ... ... ... ... ... ... ...

[22066] 3R - 43M7183N33M 76 27884825 27892083 7259

[22067] 3R - 29M7183N47M 76 27884839 27892097 7259

[22068] 3R - 29M7183N47M 76 27884839 27892097 7259

[22069] 3R + 17M7183N59M 76 27884851 27892109 7259

[22070] 3R + 15M7183N61M 76 27884853 27892111 7259

ngap

<integer>

[1] 1

[2] 1

[3] 1

[4] 1

[5] 1

... ...

[22066] 1

[22067] 1

[22068] 1

[22069] 1

[22070] 1

---

seqlengths:

YHet ... XHet

347038 ... 204112

> aln[cigarOpTable(cigar(aln))[,"N"] > 0 & seqnames(aln) == "3R",][2:10,]

GAlignments with 9 alignments and 0 metadata columns:

seqnames strand cigar qwidth start end width

<Rle> <Rle> <character> <integer> <integer> <integer> <integer>

[1] 3R + 40M75N36M 76 20556 20706 151

[2] 3R + 39M75N37M 76 20557 20707 151

[3] 3R + 38M75N38M 76 20558 20708 151

[4] 3R - 69M174N7M 76 23216 23465 250

[5] 3R + 63M174N13M 76 23222 23471 250

[6] 3R + 45M174N31M 76 23240 23489 250

[7] 3R + 41M174N35M 76 23244 23493 250

[8] 3R + 40M174N36M 76 23245 23494 250

[9] 3R + 32M174N44M 76 23253 23502 250

ngap

<integer>

[1] 1

[2] 1

[3] 1

[4] 1

[5] 1

[6] 1

[7] 1

[8] 1

[9] 1

24

Page 26: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

---

seqlengths:

YHet ... XHet

347038 ... 204112

There are respectively 3 and 7 reads spanning what looks like introns of 75 and 174 bp respectively.Note that the GenomicRanges Galignments package is aware of splicing junctions. Have a look at thecoverage for the first intron:

> cover[["3R"]][20556:20706]

integer-Rle of length 151 with 5 runs

Lengths: 1 1 38 75 36

Values : 1 2 3 0 3

Now, if we select a few of these EEJs, we can have a look if we can identify a specific motif.

> splice.reads <- aln[cigarOpTable(cigar(aln))[,"N"] > 0 & seqnames(aln) == "3R",]

> read.start <- start(splice.reads)[c(2,5,16,30,37)]

> donor.pos <- read.start - 1 +

+ as.integer(sapply(strsplit(cigar(splice.reads)[c(2,5,16,30,37)],"M"),"[",1))

> acceptor.pos <- read.start - 1 +

+ sapply(

+ lapply(

+ lapply(strsplit(cigar(splice.reads)[c(2,5,16,30,37)],"M|N"),"[",1:2),

+ as.integer),

+ sum)

Now read the chromosome 3R sequence (actually just a subset of the first 30, 000bp)

> chr3R <- readDNAStringSet(file.path(bigdata(),"..",

+ "fasta",

+ "dmel-chromosome-3R-1-30000bp.fasta"))

Now locate the acceptor and donor sites, but think of the strand! Let’s just look at the one on theplus strand.

> sel <- as.logical(strand(splice.reads)[c(2,5,16,30,37)] == "+")

> plus.donor <- Views(subject=chr3R[[1]],start=donor.pos[sel]-8,

+ end=donor.pos[sel]+11)

> plus.acceptor <- Views(subject=chr3R[[1]],start=acceptor.pos[sel]-10,

+ end=acceptor.pos[sel]+9)

Let’s see if there’s a consensus in the sequences of 20bp centered around the potential acceptor anddonor sites. Note that you might have to install the seqLogo package

> library(seqLogo)

> pwm <- makePWM(cbind(

+ alphabetByCycle(DNAStringSet(plus.donor))[c("A","C","G","T"),]/3,

+ alphabetByCycle(DNAStringSet(plus.acceptor))[c("A","C","G","T"),]/3)

+ )

> seqLogo(pwm)

Clearly the logo - Figure 5.2 - is not exceptional, but from only 3 EEJs, we can already see that thedonor site at position 10 − 11 is GT and the acceptor site at position 30 − 31 is AG, i.e. the canonicalsites. Moreover, we can see a relative - or at least I want to see it because I know it must be there- enrichment for Ts in the intron sequence, a known phenomenon. Hence, using a de-novo approachcomplemented by additional criteria can prove very efficient.

25

Page 27: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

Figure 5.2: GC content in aligned reads

This concludes the section on summarizing counts. As you could realize, juggling with the differentpackage for manipulating the alignment and annotation requires some coding. To facilitate this a numberof “workflow” package are available at Bioconductor. The next section gives a brief introduction ofeasyRNASeq (obviously, a biased selection . . . )

26

Page 28: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

5.3 Using easyRNASeq

Let us redo what was done in the previous section. Note that most of the RNAseq object slots are optional.However, it is advised to set them, especially the readLength and the organismName; to help having aproper documentation of your analysis. The organismName slot is actually mandatory if you want toget genomic annotation using biomaRt. In that case, you need to provide the name as specified in thecorresponding BSgenome package, i.e. “Dmelanogaster” for the BSgenome.Dmelanogaster.UCSC.dm3package.

> ## load the library

> library("easyRNASeq")

> count.table <- easyRNASeq(filesDirectory=dirname(bamfiles[1]),

+ filenames=basename(bamfiles),

+ organism="Dmelanogaster",

+ readLength=76L,

+ annotationMethod="gff",

+ annotationFile=file.path("~/Day2",

+ "dmel_synthetic_transcript_r5-52.gff3"),

+ format="bam",

+ gapped=TRUE,

+ count="transcripts")

> head(count.table)

> dim(count.table)

That is all. In one command, you got the count table for your 2 samples!

Warnings As you could see when running the previous example, warnings were emitted and quiterightly so.

1. about the annotation: Although we have created synthetic transcripts (sometimes called genemodels), the annotation we are using here is still redundant, as genes located on opposing strandoverlap. Therefore counting reads using these annotation will result in counting some of the readsseveral times. As this can be a very significant source of error, all the examples here will emitthis warning. The ideal solution is to further refine the annotation object so that it contains nooverlapping features or validate the affected genes either in-silico - e.g. by looking at the raw readcoverage in a genome browser or by an appropriate wet-lab method.

2. about potential naming issue in the input file: It is (sadly) very frequent that the sequencingfacilities use different naming conventions for the chromosomes they report in the alignment files.It is therefore very frequent that the annotation provided to easyRNASeq uses different chromosomenames than the alignment file. These warnings are there to inform you about this issue.

Details The easyRNASeq function currently accepts the following annotationMethods:

• “biomaRt” use biomaRt to retrieve the annotation

• “env” use a RangedData or GRanges class object present in the environment

• “gff” reads in a gff version 3 file

• “gtf” reads in a gtf file

• “rda” load an RData object. The object needs to be named gAnnot and of class RangedData orGRanges.

The reads can be read in from BAM files or any format supported by ShortRead.The reads can be summarized by:

• exons

27

Page 29: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

• features (any features such as introns, enhancers, etc.)

• transcripts

• geneModels (a geneModel is the set of non overlapping loci (i.e. synthetic exons) that representsall the possible exons and UTRs of a gene. Such geneModels are essential when counting readsas they ensure that no reads will be accounted for several times. E.g., a gene can have differentisoforms, using different exons, overlapping exons, in which case summarizing by exons might resultin counting a read several times, once per overlapping exon. N.B. Assessing differential expressionbetween transcripts, based on synthetic exons is something possible since the release 2.11 of R,using the DEXSeq package available from Bioconductor. Note that this geneModels approach isactually an older implementation of the one we have taken during the lecture to create the synthetictranscripts. This last one should be the prefered one. In the coming version of easyRNASeq,geneModels will be deprecated in favor of the synthetic transcripts generation approach.

The results can be exported in five different formats:

• count table (the default, a n (features) x m (samples) matrix).

• a DESeq [1] countDataSet class object. Useful to perform further analyses using the DESeq package.

• an edgeR [16] DGEList class object. Useful to perform further analyses using the edgeR package.

• an RNAseq class object. Useful for performing additional pre-processing without re-loading thereads and annotations.

• an SummarizedExperiment class object. This should be the output of choice and will be madedefault in the easyRNASeq version to be released with Bioconductor version 2.14 next spring.

The obtained results can optionally be corrected as Reads per Kilobase of feature per Million readsin the library (RPKM, [14]) or normalized using the DESeq or edgeR packages.

For more details and a complete overview of the easyRNASeq package capabilities, have a look atthe easyRNASeq vignette.

> vignette("easyRNASeq")

Exercise 34From the same input files and annotations, generate an object of class SummarizedExperiment .

Solution:Note that recent change to the Bioconductor API have affected the following functionality. I’ve just

realized it yesterday, so I have not got the time to devise a fix, but I will do so asap. Sorry. Well, anywayit should be the end of the day when you reach that point so you probably will not mind so much I hope.Especially since it is not something essential.

> sumExp <- easyRNASeq(filesDirectory=dirname(bamfiles[1]),

+ filenames=basename(bamfiles),

+ organism="Dmelanogaster",

+ readLength=76L,

+ annotationMethod="gff",

+ annotationFile=file.path("~/Day2",

+ "dmel_synthetic_transcript_r5-52.gff3"),

+ format="bam",

+ gapped=TRUE,

+ count="transcripts",

+ outputFormat="SummarizedExperiment")

See the GenomicRange package SummarizedExperiment class for more details on last three accessorsused in the following.

28

Page 30: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

> ## the counts

> assays(sumExp)

> ## the sample info

> colData(sumExp)

> ## the 'features' info

> rowData(sumExp)

Caveats easyRNASeq is still under active development and as such the current version still lacks someessential data processing (e.g. strand specific sequencing is not yet supported). The new version to bereleased with Bioconductor 2.13, in early October this year, fill in these gaps:

1. The easyRNASeq function used above actually gets deprecated in favor of the simpleRNASeq functionwhich takes advantage of numerous core Bioconductor packages, e.g. the use of BamFile andBamFileList objects from the Rsamtools package to locate and access BAM formatted files.

2. Secondly, it benefits from the RNA-Seq field standardization in the sense that the number ofnecessary arguments to be provided by default has plummeted. It benefits as well from a refactoringof how these arguments are provided; they are indeed abstracted and combined in a way similar tothe ScanBamParam parameter of the Rsamtools package scanBam.

3. Then, its performance - e.g. memory management - have been optimized through parallelization.

4. In addition, advanced checks are conducted on the data provided by the user to ensure the overallprocess suitability. More comprehensive warnings or errors are thrown, should it be necessary.

5. The concerns raised by the analysis reported there https://stat.ethz.ch/pipermail/bioc-devel/2012-September/003608.html by Robinson et al. have been adressed too. Both the originaleasyRNASeq method and the GenomicRanges approach are provided, the later one being thedefault.

6. And last but not least, it provides access to the latest tools for Differential Expression expres-sion analysis such as DESeq2 and DEXSeq. Planned is an integration of the limma for enablingthe voom+limma paradigm. Ideally, easyRNASeq would select the most appropriate analysis to beconducted based on the report by Soneson and Delorenzi [18].

5.4 Where to from here

After obtaining the count table, numerous downstream analyses are available. Most often, such counttables are generated in a differential expression experimental setup. In that case, packages such as DESeq,DEXSeq, edgeR, limma (see voom+limma in the limma vignette), etc. are some of the possibilitiesavailable in Bioconductor. Have a look at [3] and [18] to decide which tool/approach is the best suitedfor your experimental design. But, of course, counts can as well be used for other purposes such asvisualization, using e.g. the rtracklayer and GViz packages.Actually, there’s no real limitation of what one can achieve with a count table and it does not needbe an RNA-Seq experiment; look at the DiffBind package for an example of using ChIP-Seq data fordifferential binding analyses.

29

Page 31: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

Acknowledgments

1. Thanks to the Workshop organizers, in particular Gabriella Rustici

2. Thanks to the other lecturers, it is always fun around you.

3. Finally, thanks to you the reader - whatever the support you’re reading this on - for having madeit that far.

30

Page 32: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

Bibliography

[1] S. Anders and W. Huber. Differential expression analysis for sequence count data. Genome Biology,11:R106, 2010.

[2] A. N. Brooks, L. Yang, M. O. Duff, K. D. Hansen, J. W. Park, S. Dudoit, S. E. Brenner, and B. R.Graveley. Conservation of an RNA regulatory map between Drosophila and mammals. GenomeResearch, pages 193–202, 2011.

[3] M.-A. Dillies, A. Rau, J. Aubert, C. Hennequet-Antier, M. Jeanmougin, N. Servant, C. Keime,G. Marot, D. Castel, J. Estelle, G. Guernec, B. Jagla, L. Jouneau, D. Laloe, C. L. Gall, B. Schaeffer,S. L. Crom, M. Guedj, F. Jaffrezic, and on behalf of The French StatOmique Consortium. Acomprehensive evaluation of normalization methods for illumina high-throughput rna sequencingdata analysis. Brief Bioinformatics, Sep 2012.

[4] A. Dobin, C. A. Davis, F. Schlesinger, J. Drenkow, C. Zaleski, S. Jha, P. Batut, M. Chaisson, andT. R. Gingeras. Star: ultrafast universal rna-seq aligner. Bioinformatics, 29(1):15–21, Jan 2013.

[5] R. C. Gentleman et al. Bioconductor: open software development for computational biology andbioinformatics. Genome Biology 2010 11:202, 5(10):R80, Jan 2004.

[6] Glaus, Peter, Honkela, Antti, Rattray, and Magnus. Identifying differentially expressed transcriptsfrom rna-seq data with biological variation. Bioinformatics, 28(13):1721–1728, 2012.

[7] M. G. Grabherr, B. J. Haas, M. Yassour, J. Z. Levin, D. A. Thompson, I. Amit, X. Adiconis, L. Fan,R. Raychowdhury, Q. Zeng, Z. Chen, E. Mauceli, N. Hacohen, A. Gnirke, N. Rhind, F. D. Palma,B. W. Birren, C. Nusbaum, K. Lindblad-Toh, N. Friedman, and A. Regev. Full-length transcriptomeassembly from rna-seq data without a reference genome. Nat Biotechnol, 29(7):644–652, May 2011.

[8] R. W. Hamming. Error detecting and error correcting codes. The Bell System Technical Journal,XXIX(2):1–14, Nov 1950.

[9] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg. Ultrafast and memory-efficient alignmentof short DNA sequences to the human genome. Genome Biol., 10:R25, 2009.

[10] P. Lefrancois, G. M. Euskirchen, R. K. Auerbach, J. Rozowsky, T. Gibson, C. M. Yellman, M. Ger-stein, and M. Snyder. Efficient yeast chip-seq using multiplex short-read dna sequencing. BMCgenomics, 10(1):37, Jan 2009.

[11] B. Li, V. Ruotti, R. M. Stewart, J. A. Thomson, and C. N. Dewey. Rna-seq gene expressionestimation with read mapping uncertainty. Bioinformatics, 26(4):493–500, Feb 2010.

[12] H. Li and R. Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform.Bioinformatics, 25:1754–1760, Jul 2009.

[13] H. Li and R. Durbin. Fast and accurate long-read alignment with Burrows-Wheeler transform.Bioinformatics, 26:589–595, Mar 2010.

[14] A. Mortazavi et al. Mapping and quantifying mammalian transcriptomes by rna-seq. Nature Meth-ods, 5(7):621–8, Jul 2008.

31

Page 33: R Bioconductor for High-Throughput Sequence Analysis · PDF fileR / Bioconductor for High-Throughput Sequence Analysis Nicolas ... 2.2 Creating GAlignment objects from BAM les

[15] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foun-dation for Statistical Computing, Vienna, Austria, 2009. ISBN 3-900051-07-0.

[16] M. D. Robinson, D. J. McCarthy, and G. K. Smyth. edgeR: a Bioconductor package for differentialexpression analysis of digital gene expression data. Bioinformatics, 26:139–140, Jan 2010.

[17] M. H. Schulz, D. R. Zerbino, M. Vingron, and E. Birney. Oases: robust de novo rna-seq assemblyacross the dynamic range of expression levels. Bioinformatics, 28(8):1086–92, Apr 2012.

[18] C. Soneson and M. Delorenzi. A comparison of methods for differential expression analysis of rna-seqdata. BMC Bioinformatics, 14:91, Jan 2013.

[19] C. Trapnell, B. A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. J. van Baren, S. L. Salzberg,B. J. Wold, and L. Pachter. Transcript assembly and quantification by rna-seq reveals unannotatedtranscripts and isoform switching during cell differentiation. Nat Biotechnol, 28(5):511–5, May 2010.

[20] E. Turro, S.-Y. Su, A. Goncalves, L. J. M. Coin, S. Richardson, and A. Lewin. Haplotype andisoform specific expression estimation using multi-mapping rna-seq reads. Genome Biol, 12(2):R13,Jan 2011.

[21] T. D. Wu and C. K. Watanabe. Gmap: a genomic mapping and alignment program for mrna andest sequences. Bioinformatics, 21(9):1859–75, May 2005.

32