Getting the computer setup

Getting the computer setup

• Follow directions on handout to login to server.

• Type “qsub -I” to get a compute node.• The data you will be using is stored in

../shared/

Mapping RNA-seq data

Matthew YoungAlicia OshlackBernie Pope

DNA(0.1-1.0 ug)

Single molecule arraySample

preparation Cluster growth5’

5’3’

G

T

C

A

G

T

C

A

G

T

C

A

C

A

G

TC

A

T

C

A

C

C

TAG

CG

TA

GT

1 2 3 7 8 94 5 6

Image acquisition Base calling

T G C T A C G A T …

Sequencing

Illumina Sequencing Technology

Slide courtesy of G Schroth, Illumina

Raw data

• Short sequence reads

• Quality scores = -10log10(p) or similar…

@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT+!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Handout Reference: Page 2

Quality checks• Base composition• Quality score• PCR Artifacts

CDS CDS CDS CDS

CDS CDS CDS CDS

Gene

transcript

Sequencing transcripts not the genome

Difficulty here is that reads spanning a exon-exon junction may not get mapped when mapping to the genome.

One strategy: Supplement the reference genome with sequences that span all known or possible junctions.

Reads

Coding Sequence Exons Introns Splice Junctions

CDS CDS CDS CDS

Aggregate reads based on: exons Exons + junctions All reads start to end of transcript De novo methods

Mapping toolsType Name Link

General aligner GMAP/GSNAP http://research-pub.gene.com/gmap/

BFAST http://sourceforge.net/apps/mediawiki/bfast/index.php

BOWTIE http://bowtie-bio.sourceforge.net/index.shtml

CloudBurst http://sourceforge.net/apps/mediawiki/cloudburst-bio/index.php

GNUmap http://dna.cs.byu.edu/gnumap/index.shtml

MAQ/BWA http://maq.sourceforge.net/

Perm http://code.google.com/p/perm/

RazerS http://www.seqan.de/projects/razers.html

Mrfast/mrsfast http://mrfast.sourceforge.net/manual.html

SOAP/SOAP2 http://soap.genomics.org.cn/

SHRiMP http://compbio.cs.toronto.edu/shrimp/

De Novo annotator QPALMA/GenomeMapper/PALMapper http://www.fml.tuebingen.mpg.de/raetsch/suppl/palmapper

SpliceMap http://www.stanford.edu/group/wonglab/SpliceMap/

SOAPals http://soap.genomics.org.cn/

G-Mo. R-Se http://www.genoscope.cns.fr/externe/gmorse/

TopHat http://tophat.cbcb.umd.edu/

SplitSeek http://solidsoftwaretools.com/gf/project/splitseek

De Novo transcript assembler

Oases http://www.ebi.ac.uk/~zerbino/oases/

MIRA http://sourceforge.net/apps/mediawiki/mira-assembler/index.php

http://research-pub.gene.com/gmap/

http://sourceforge.net/apps/mediawiki/bfast/index.php

http://bowtie-bio.sourceforge.net/index.shtml

http://sourceforge.net/apps/mediawiki/cloudburst-bio/index.php

http://dna.cs.byu.edu/gnumap/index.shtml

http://maq.sourceforge.net/

http://code.google.com/p/perm/

http://www.seqan.de/projects/razers.html

http://mrfast.sourceforge.net/manual.html

http://soap.genomics.org.cn/

http://compbio.cs.toronto.edu/shrimp/

http://www.fml.tuebingen.mpg.de/raetsch/suppl/palmapper

http://www.stanford.edu/group/wonglab/SpliceMap/


http://www.genoscope.cns.fr/externe/gmorse/

http://tophat.cbcb.umd.edu/

http://solidsoftwaretools.com/gf/project/splitseek

http://www.ebi.ac.uk/~zerbino/oases/

http://sourceforge.net/apps/mediawiki/mira-assembler/index.php




../shared/

Familiarizing yourself with bowtie

• The minimum information bowtie needs is your reads (i.e., the fastq file from the machine) and a reference.

• The reference is the genome or transcriptome we are trying to map to, transformed using a Burrows Wheeler Transform to allow fast searching.

• Many optional parameters to tweak alignment.

Handout Reference: Pages 2-7

How do you map 10^9, 76bp sequences, to a 10^9 bp reference

• Ideally we’d test every position on the genome for its suitability as a match and assign it a score based on # mismatches, indels etc.

• However, this is computationally impossible, so we have to come up with something else.


Lots of aligners, one general strategy

• Quick “heuristic” is performed to cut down the number of candidate alignment regions for each read.

• More precise algorithm is employed to decide which of these candidates is a valid alignment.


A fragment as seen by an aligner

• Heuristic acts only on the seed. Putative mapping locations identified.

• A more precise algorithm extends each seed to the full read and ranks them.

• The fragment is not sequenced directly, it’s sequence is inferred based on mapping of the read.

;

ReadSeed

Fragment

Burrows Wheeler acts on thisSmith-Waterman acts on this


Our test data set

• 76bp single end reads. • Sequenced using the Illumina Genome

Analyzer• RNA taken from Mouse myoblast cell line.• We only look at a subset of one lane, full data

available here http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE2084




../shared/

A strict mappingbowtie -t -p 1 -n 0 -e 1 --sam --best --un Strict_Unmapped.fastq

../shared/BowtieIndexes/mm9 ../shared/Sample_reads.fastq Strict_Mapped.sam

• -n sets the number of mismatches in the seed• The sum of quality scores at ALL mismatches (not just the seed) must

be less than -e.• --un saves the unmapped reads to the file specified• -t prints timing information• -p sets the number of simultaneous threads• --sam makes bowtie output in SAM format• --best ensures the best map is returned


Using the defaults

bowtie -t -p 1 --best --sam ../shared/BowtieIndexes/mm9 ../shared/Sample_reads.fastq test.sam

• -t prints timing information• -p sets the number of simultaneous threads• --sam makes bowtie output in SAM format• --best ensures the best map is returned• --un saves the unmapped reads to the file specified


Allowing some mismatches

Bowtie -t -p 1 -n 3 -e 200 --sam –best --un Loose_Unmapped.fastq ../shared/BowtieIndexes/mm9 Strict_Unmapped.fastq Loose_Mapped.sam

• Set the number of seed mismatches to the maximum.• Set -e to a value more appropriate for our read

length.• How many reads do you map?


A closer look at our data set: fastQC



Trimming reads

• Bowtie allows you to trim reads before it attempts to map them.

• You can trim from the left (5’) end of the read with the --trim5 option.

• You can trim from the right (3’) end of the read with the --trim3 option.


CDS CDS CDS CDS

CDS CDS CDS CDS

Gene

transcript

Sequencing transcripts not the genomeReads

A library containing these bits of sequence (which do not appear in the genome) can help map junction reads. This is called an exon-junction library.

A reference built from this library is in BowtieIndexes, called mm9.UCSC.knownGene.junctions (named for the annotation it was built from).


Options to map more reads…

• Trim some bases from the end of the reads using --trim5 and/or --trim3.

• Map to the junction library instead of the mouse genome using the mm9.UCSC.knownGene.junctions index.


Trim[myoung@bionode11 Starting]$ bowtie -t -p 1 -n 2 -e 70 --trim5 15 --trim3 25 --

sam --best --un Trimmed_Unmapped.fastq ../shared/BowtieIndexes/mm9 Loose_Unmapped.fastq Trimmed_Mapped.sam

Time loading forward index: 00:00:02Time loading mirror index: 00:00:02Seeded quality full-index search: 00:00:47# reads processed: 531414# reads with at least one reported alignment: 164256 (30.91%)# reads that failed to align: 367158 (69.09%)Reported 164256 alignments to 1 output stream(s)Time searching: 00:00:53Overall time: 00:00:54


Then map to junctions[myoung@bionode11 Starting]$ bowtie -t -p 1 -n 3 -e 200 --sam --best --un

Junctions_Unampped.fastq ../shared/BowtieIndexes/mm9.UCSC.knownGene.junctions Trimmed_Unmapped.fastq Junctions_Mapped.sam

Time loading forward index: 00:00:35Time loading mirror index: 00:00:34Seeded quality full-index search: 00:00:24# reads processed: 367158# reads with at least one reported alignment: 128756 (35.07%)# reads that failed to align: 238402 (64.93%)Reported 128756 alignments to 1 output stream(s)Time searching: 00:01:43Overall time: 00:01:45


Number of mapped readsMapping strategy

Command line options

No. Mapped Reads No. Unmapped Reads

Reference

Strict -n 0 -e 1 1,049,050 (47.32%) 1,167,992 (52.68%) Genome

Loose -n 3 -e 200 1,783,048 (80.42%) 433,994 (19.58%) Genome

Trimming -n 3 --trim5 15 --trim3 25

1,912,003 (86.24%) 305,039 (13.76%) Genome

Junctions -n 3 -e 200 2,007,627 (90.55%) 209,415 (9.45%) Junction Library


Further optionsType Name Link

General aligner GMAP/GSNAP http://research-pub.gene.com/gmap/

BFAST http://sourceforge.net/apps/mediawiki/bfast/index.php

BOWTIE http://bowtie-bio.sourceforge.net/index.shtml

CloudBurst http://sourceforge.net/apps/mediawiki/cloudburst-bio/index.php

GNUmap http://dna.cs.byu.edu/gnumap/index.shtml

MAQ/BWA http://maq.sourceforge.net/

Perm http://code.google.com/p/perm/

RazerS http://www.seqan.de/projects/razers.html

Mrfast/mrsfast http://mrfast.sourceforge.net/manual.html

SOAP/SOAP2 http://soap.genomics.org.cn/

SHRiMP http://compbio.cs.toronto.edu/shrimp/

De Novo annotator QPALMA/GenomeMapper/PALMapper http://www.fml.tuebingen.mpg.de/raetsch/suppl/palmapper

SpliceMap http://www.stanford.edu/group/wonglab/SpliceMap/

SOAPals http://soap.genomics.org.cn/

G-Mo. R-Se http://www.genoscope.cns.fr/externe/gmorse/

TopHat http://tophat.cbcb.umd.edu/

SplitSeek http://solidsoftwaretools.com/gf/project/splitseek

De Novo transcript assembler

Oases http://www.ebi.ac.uk/~zerbino/oases/

MIRA http://sourceforge.net/apps/mediawiki/mira-assembler/index.php


http://research-pub.gene.com/gmap/

http://sourceforge.net/apps/mediawiki/bfast/index.php

http://bowtie-bio.sourceforge.net/index.shtml

http://sourceforge.net/apps/mediawiki/cloudburst-bio/index.php

http://dna.cs.byu.edu/gnumap/index.shtml

http://maq.sourceforge.net/

http://code.google.com/p/perm/

http://www.seqan.de/projects/razers.html

http://mrfast.sourceforge.net/manual.html


http://compbio.cs.toronto.edu/shrimp/

http://www.fml.tuebingen.mpg.de/raetsch/suppl/palmapper

http://www.stanford.edu/group/wonglab/SpliceMap/


http://www.genoscope.cns.fr/externe/gmorse/

http://tophat.cbcb.umd.edu/

http://solidsoftwaretools.com/gf/project/splitseek

http://www.ebi.ac.uk/~zerbino/oases/

http://sourceforge.net/apps/mediawiki/mira-assembler/index.php

Documents

Getting the computer setup