Upload
nay
View
22
Download
0
Embed Size (px)
DESCRIPTION
Getting the computer setup. Follow directions on handout to login to server. Type “ qsub - I ” to get a compute node. The data you will be using is stored in ../shared/. Mapping RNA-seq data. Matthew Young Alicia Oshlack Bernie Pope. Illumina Sequencing Technology. 3’. 5’. G. C. T. - PowerPoint PPT Presentation
Citation preview
Getting the computer setup
• Follow directions on handout to login to server.
• Type “qsub -I” to get a compute node.• The data you will be using is stored in
../shared/
Mapping RNA-seq data
Matthew YoungAlicia OshlackBernie Pope
DNA(0.1-1.0 ug)
Single molecule arraySample
preparation Cluster growth5’
5’3’
G
T
C
A
G
T
C
A
G
T
C
A
C
A
G
TC
A
T
C
A
C
C
TAG
CG
TA
GT
1 2 3 7 8 94 5 6
Image acquisition Base calling
T G C T A C G A T …
Sequencing
Illumina Sequencing Technology
Slide courtesy of G Schroth, Illumina
Raw data
• Short sequence reads
• Quality scores = -10log10(p) or similar…
@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT+!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Handout Reference: Page 2
Quality checks• Base composition• Quality score• PCR Artifacts
CDS CDS CDS CDS
CDS CDS CDS CDS
Gene
transcript
Sequencing transcripts not the genome
Difficulty here is that reads spanning a exon-exon junction may not get mapped when mapping to the genome.
One strategy: Supplement the reference genome with sequences that span all known or possible junctions.
Reads
Coding Sequence Exons Introns Splice Junctions
CDS CDS CDS CDS
Aggregate reads based on: exons Exons + junctions All reads start to end of transcript De novo methods
Mapping toolsType Name Link
General aligner GMAP/GSNAP http://research-pub.gene.com/gmap/
BFAST http://sourceforge.net/apps/mediawiki/bfast/index.php
BOWTIE http://bowtie-bio.sourceforge.net/index.shtml
CloudBurst http://sourceforge.net/apps/mediawiki/cloudburst-bio/index.php
GNUmap http://dna.cs.byu.edu/gnumap/index.shtml
MAQ/BWA http://maq.sourceforge.net/
Perm http://code.google.com/p/perm/
RazerS http://www.seqan.de/projects/razers.html
Mrfast/mrsfast http://mrfast.sourceforge.net/manual.html
SOAP/SOAP2 http://soap.genomics.org.cn/
SHRiMP http://compbio.cs.toronto.edu/shrimp/
De Novo annotator QPALMA/GenomeMapper/PALMapper http://www.fml.tuebingen.mpg.de/raetsch/suppl/palmapper
SpliceMap http://www.stanford.edu/group/wonglab/SpliceMap/
SOAPals http://soap.genomics.org.cn/
G-Mo. R-Se http://www.genoscope.cns.fr/externe/gmorse/
TopHat http://tophat.cbcb.umd.edu/
SplitSeek http://solidsoftwaretools.com/gf/project/splitseek
De Novo transcript assembler
Oases http://www.ebi.ac.uk/~zerbino/oases/
MIRA http://sourceforge.net/apps/mediawiki/mira-assembler/index.php
Getting the computer setup
• Follow directions on handout to login to server.
• Type “qsub -I” to get a compute node.• The data you will be using is stored in
../shared/
Familiarizing yourself with bowtie
• The minimum information bowtie needs is your reads (i.e., the fastq file from the machine) and a reference.
• The reference is the genome or transcriptome we are trying to map to, transformed using a Burrows Wheeler Transform to allow fast searching.
• Many optional parameters to tweak alignment.
Handout Reference: Pages 2-7
How do you map 10^9, 76bp sequences, to a 10^9 bp reference
• Ideally we’d test every position on the genome for its suitability as a match and assign it a score based on # mismatches, indels etc.
• However, this is computationally impossible, so we have to come up with something else.
Handout Reference: Pages 2-7
Lots of aligners, one general strategy
• Quick “heuristic” is performed to cut down the number of candidate alignment regions for each read.
• More precise algorithm is employed to decide which of these candidates is a valid alignment.
Handout Reference: Pages 2-7
A fragment as seen by an aligner
• Heuristic acts only on the seed. Putative mapping locations identified.
• A more precise algorithm extends each seed to the full read and ranks them.
• The fragment is not sequenced directly, it’s sequence is inferred based on mapping of the read.
;
ReadSeed
Fragment
Burrows Wheeler acts on thisSmith-Waterman acts on this
Handout Reference: Pages 2-7
Our test data set
• 76bp single end reads. • Sequenced using the Illumina Genome
Analyzer• RNA taken from Mouse myoblast cell line.• We only look at a subset of one lane, full data
available here http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE2084
Getting the computer setup
• Follow directions on handout to login to server.
• Type “qsub -I” to get a compute node.• The data you will be using is stored in
../shared/
A strict mappingbowtie -t -p 1 -n 0 -e 1 --sam --best --un Strict_Unmapped.fastq
../shared/BowtieIndexes/mm9 ../shared/Sample_reads.fastq Strict_Mapped.sam
• -n sets the number of mismatches in the seed• The sum of quality scores at ALL mismatches (not just the seed) must
be less than -e.• --un saves the unmapped reads to the file specified• -t prints timing information• -p sets the number of simultaneous threads• --sam makes bowtie output in SAM format• --best ensures the best map is returned
Handout Reference: Page 8
Using the defaults
bowtie -t -p 1 --best --sam ../shared/BowtieIndexes/mm9 ../shared/Sample_reads.fastq test.sam
• -t prints timing information• -p sets the number of simultaneous threads• --sam makes bowtie output in SAM format• --best ensures the best map is returned• --un saves the unmapped reads to the file specified
Handout Reference: Page 8
Allowing some mismatches
Bowtie -t -p 1 -n 3 -e 200 --sam –best --un Loose_Unmapped.fastq ../shared/BowtieIndexes/mm9 Strict_Unmapped.fastq Loose_Mapped.sam
• Set the number of seed mismatches to the maximum.• Set -e to a value more appropriate for our read
length.• How many reads do you map?
Handout Reference: Pages 8-9
A closer look at our data set: fastQC
Handout Reference: Pages 9-10
Handout Reference: Pages 9-10
Trimming reads
• Bowtie allows you to trim reads before it attempts to map them.
• You can trim from the left (5’) end of the read with the --trim5 option.
• You can trim from the right (3’) end of the read with the --trim3 option.
Handout Reference: Page 10
CDS CDS CDS CDS
CDS CDS CDS CDS
Gene
transcript
Sequencing transcripts not the genomeReads
A library containing these bits of sequence (which do not appear in the genome) can help map junction reads. This is called an exon-junction library.
A reference built from this library is in BowtieIndexes, called mm9.UCSC.knownGene.junctions (named for the annotation it was built from).
Handout Reference: Pages 10-12
Options to map more reads…
• Trim some bases from the end of the reads using --trim5 and/or --trim3.
• Map to the junction library instead of the mouse genome using the mm9.UCSC.knownGene.junctions index.
Handout Reference: Pages 9-12
Trim[myoung@bionode11 Starting]$ bowtie -t -p 1 -n 2 -e 70 --trim5 15 --trim3 25 --
sam --best --un Trimmed_Unmapped.fastq ../shared/BowtieIndexes/mm9 Loose_Unmapped.fastq Trimmed_Mapped.sam
Time loading forward index: 00:00:02Time loading mirror index: 00:00:02Seeded quality full-index search: 00:00:47# reads processed: 531414# reads with at least one reported alignment: 164256 (30.91%)# reads that failed to align: 367158 (69.09%)Reported 164256 alignments to 1 output stream(s)Time searching: 00:00:53Overall time: 00:00:54
Handout Reference: Page 10
Then map to junctions[myoung@bionode11 Starting]$ bowtie -t -p 1 -n 3 -e 200 --sam --best --un
Junctions_Unampped.fastq ../shared/BowtieIndexes/mm9.UCSC.knownGene.junctions Trimmed_Unmapped.fastq Junctions_Mapped.sam
Time loading forward index: 00:00:35Time loading mirror index: 00:00:34Seeded quality full-index search: 00:00:24# reads processed: 367158# reads with at least one reported alignment: 128756 (35.07%)# reads that failed to align: 238402 (64.93%)Reported 128756 alignments to 1 output stream(s)Time searching: 00:01:43Overall time: 00:01:45
Handout Reference: Page 11
Number of mapped readsMapping strategy
Command line options
No. Mapped Reads No. Unmapped Reads
Reference
Strict -n 0 -e 1 1,049,050 (47.32%) 1,167,992 (52.68%) Genome
Loose -n 3 -e 200 1,783,048 (80.42%) 433,994 (19.58%) Genome
Trimming -n 3 --trim5 15 --trim3 25
1,912,003 (86.24%) 305,039 (13.76%) Genome
Junctions -n 3 -e 200 2,007,627 (90.55%) 209,415 (9.45%) Junction Library
Handout Reference: Page 12
Further optionsType Name Link
General aligner GMAP/GSNAP http://research-pub.gene.com/gmap/
BFAST http://sourceforge.net/apps/mediawiki/bfast/index.php
BOWTIE http://bowtie-bio.sourceforge.net/index.shtml
CloudBurst http://sourceforge.net/apps/mediawiki/cloudburst-bio/index.php
GNUmap http://dna.cs.byu.edu/gnumap/index.shtml
MAQ/BWA http://maq.sourceforge.net/
Perm http://code.google.com/p/perm/
RazerS http://www.seqan.de/projects/razers.html
Mrfast/mrsfast http://mrfast.sourceforge.net/manual.html
SOAP/SOAP2 http://soap.genomics.org.cn/
SHRiMP http://compbio.cs.toronto.edu/shrimp/
De Novo annotator QPALMA/GenomeMapper/PALMapper http://www.fml.tuebingen.mpg.de/raetsch/suppl/palmapper
SpliceMap http://www.stanford.edu/group/wonglab/SpliceMap/
SOAPals http://soap.genomics.org.cn/
G-Mo. R-Se http://www.genoscope.cns.fr/externe/gmorse/
TopHat http://tophat.cbcb.umd.edu/
SplitSeek http://solidsoftwaretools.com/gf/project/splitseek
De Novo transcript assembler
Oases http://www.ebi.ac.uk/~zerbino/oases/
MIRA http://sourceforge.net/apps/mediawiki/mira-assembler/index.php
Handout Reference: Pages 12-13