Genetics 211 - 2014 Lecture 2 - Stanford...

Preview:

Citation preview

Genetics 211 - 2014 Lecture 2

High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu January 14th 2014

•  interactions between nucleic acids and proteins"

•  transcript identity"•  transcript abundance"

• RNA editing"• SNPs"

• Allele specific expression"• Regulation"

• Nucleosome positioning"• 3D genome architecture"

• Active promoters"•  interactions between

nucleic acids and proteins"•  chromatin modifications"

• genome variability"• metagenomics"

• genome modifications"• detection of mutations"

• association studies"• phylogeny"• evolution"

Applications of Next-Gen Sequencing

genome chromatin transcriptome"

de novo sequencing"

assembly"

annotation"

mapping"

resequencing"

detection of variants"

mapping"

Hi-C"

3D reconstruction"

mapping"

ChIP-Seq"

detection of binding sites"

mapping"

RNA-Seq"

transcript detection and quantification"

mapping"

ATAC-Seq"

Identify open

chromatin"

How do we make an Illumina Genomic DNA library?

Fragment (Covaris)"

Polish, add dA overhang"Add adaptors, size select"

Genomic DNA"

Sequence"

Making fragments asymmetric

5'-pNNNN.........NNNNA-3' 3'-ANNNN.........NNNNp-5'

Fragmented, end polished, phosphorylated, dA ligated DNA sample"

Genomic Y-adapter"

5'-ACACTCTTTCCCTACACGAC

GCTCTTCCGATCT-3' 3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGp-5'

Ligate"

5'-ACACTCTTTCCCTACACGAC

GCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3’ 3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCG

CAGCACATCCCTTTCTCACA-5’

[Ligation product is gel purified, selecting only those products in a certain size range]"

Making our genomic DNA library asymmetric

Round 1 of PCR"

5'-ACACTCTTTCCCTACACGAC

GCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3’ 3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCG

CAGCACATCCCTTTCTCACA-5’

TCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'

5'-CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT

Products of first round:"

5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3' 3'-TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'

5'-CAAGCAGAAGACGGCATACGAGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT-3’ 3’-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGCAGCACATCCCTTTCTCACA-5’

Finishing and Sequencing the Library

5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3' 3'-TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'

TCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'

5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

Rounds 2-18"

Product of PCR amplification"

5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3' 3’-TTACTATGCCGCTGGTGGCTCTAGATGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'

[Anneal to flow cell. Perform cluster generation]"

Genomic DNA Sequencing Primer"

5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3' 3’-TTACTATGCCGCTGGTGGCTCTAGATGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'

5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCT

How Much Sequence? •  HiSeq 2500 can give ~250 million reads/lane of

paired end 100bp reads •  This is 50Gb of sequence •  This is ~4000x coverage yeast (12Mb). •  This is an obvious waste of resources (it’s also ~500x

C. elegans, and ~500x D. melanogaster) •  How can we sequence on a HiSeq and not waste all

these resources when sequencing smaller genomes?

Barcode Sequencing •  Two ways to perform barcode sequencing

–  In-line barcodes •  Barcode is read as part of the normal sequencing read

–  Multiplex barcodes •  Barcode is read as a third, short sequencing run (also known

as index reads)

•  Can be used to run multiple samples from any particular origin on the same lane of a HiSeq, with the barcodes allowing the samples to be de-convoluted afterwards.

•  Barcodes should be designed so that they are balanced in GC content, and as dissimilar as possible.

In-line Barcode Sequencing

Multiplex, or Index barcoding

Random barcoding

•  During the PCR step, each template gets amplified many times

•  If your library is of insufficient complexity, or you overamplify you may have PCR duplicates

•  You want to make independent observations, not redundant observations

•  When sequencing to high coverage, you may have identical, but non-redundant observations.

•  Want to be able to distinguish these.

Random Barcoding

What are the data?

•  Illumina produces data in fastq format.

@SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

‘@’ followed by a sequence Identifier

The sequence ‘+’, optionally followed by a sequence Identifier The quality scores

Example of Illumina SeqID

@HWUSI-EAS100R:6:73:941:1973#0/1

HWUSI-EAS100R The unique instrument name 6 Flowcell lane 73 Tile number within the flowcell 941 'x'-coordinate of the cluster within the tile 1973 'y'-coordinate of the cluster within the tile #0 index number for a multiplexed sample (0 for no

indexing) /1 the member of a pair, /1 or /2 (paired-end or mate-pair

reads only)

Assessing Quality

FastQC

http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/"

HTQC

A

C

F G

D E

B

https://sourceforge.net/projects/htqc"

De novo assembly

•  Several methods available •  Short reads require long overlaps

•  e.g., 33 bp reads must overlap by 20 bp •  end-trimming helps, to remove low quality bases.

•  Most de novo short read assemblers use a k-mer hashing based approach.

•  The central challenge of genome assembly is resolving repeat regions.

De novo Assembly Strategies

•  Many, many different algorithms and open source (as well as closed source) software for short read sequence assembly.

•  Choice of tool depends on exactly what you are trying to assemble: –  Genome size –  Genome complexity –  Level of polymorphism –  Genome vs. transcriptome vs. –  Sequence coverage you have (more is generally better) –  Paired-end vs. single end (you should really have paired-end data)

•  E.g. –  SSAKE (Warren et al, 2007)

•  Uses DNA prefix tree to find k-mer matches –  Edena (Hernandez et al, 2008)

•  Overlap layout algorithm plus error correction –  Velvet (Zerbino and Birney, 2008)

•  Uses DeBruijn graph algorithm plus error correction

Example of Velvet de novo Assembly

TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG

AGTCGAG CTTTAGA CGATGAG CTTTAGA GTCGGG TTAGATC ATGAGGC GAGACAG GAGGCTC ATCCGAT AGGCTTT GAGACAG AGTCGAG TAGATCC ATGAGGC TAGAGAA

TAGTCGA CTTTAGA CCGATGA TTAGAGA CGAGGCT AGATCCG TGAGGCT AGAGACA

TAGTCGA GCTTTAG TCCGATG GCTCTAG TCGACGC GATCCGA GAGGCTT AGAGACA TAGTCGA TTAGATC GATGAGG TTTAGAG

GTCGAGG TCTAGAT ATGAGGC TAGAGAC AGGCTTT ATCCGAT AGGCTTT GAGACAG AGTCGAG TTAGATT ATGAGGC AGAGACA

GGCTTTA TCCGATG TTTAGAG CGAGGCT TAGATCC TGAGGCT GAGACAG AGTCGAG TTTAGATC ATGAGGC TTAGAGA GAGGCTT GATCCGA GAGGCTT GAGACAG

Sequence (7bp reads)

Hashing (k = 4)

TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG

TAGT AGTC GTCG TCGA CGAG GAGG AGGC GGCT GCTT AGAG GAGA AGAC GACA ACAG (3x) (7x) (9x) (10x) (8x) (16x) (16x) (16x) (11x) (9x) (12x) (9x) (8x) (5x)

� � � � � � � � � � � � � � CTTC TTCA TCAG CAGA (1x) (2x) (2x) (1x)

� � � �

CTTT TTTA TTAG TAGA (8x) (8x) (12x) (16x)

� � � �

CGAC GACG ACGC (1x) (1x) (1x)

� � �

TGAG ATGA GATG CGAT CCGA TCCG ATCC GATC AGAT (9x) (8x) (5x) (6x) (7x) (7x) (7x) (8x) (8x)

� � � � � � � � � �

GATT (1x) �

AGAA (1x)

{ {

Graph Building

TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG

TAGTCGA CGAG

CGACGC

GCTCTAG

GCTTTAG

GATCCGATGAG AGAT

AGAA

� �

� � { {�

GATT "

� �

� � GAGGCT TAGA AGAGA AGACAG

TAGT AGTC GTCG TCGA CGAG GAGG AGGC GGCT GCTT AGAG GAGA AGAC GACA ACAG (3x) (7x) (9x) (10x) (8x) (16x) (16x) (16x) (11x) (9x) (12x) (9x) (8x) (5x)

� � � � � � � � � � � � � � CTTC TTCA TCAG CAGA (1x) (2x) (2x) (1x)

� � � �

CTTT TTTA TTAG TAGA (8x) (8x) (12x) (16x)

� � � �

CGAC GACG ACGC (1x) (1x) (1x)

� � �

TGAG ATGA GATG CGAT CCGA TCCG ATCC GATC AGAT (9x) (8x) (5x) (6x) (7x) (7x) (7x) (8x) (8x)

� � � � � � � � � �

GATT (1x) �

AGAA (1x)

{ {

Simplification of Linear Stretches

TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG

TAGTCGAG GAGGCTTTAGA AGAGACAG!

� �

AGATCCGATGAG!

Error (tip and bubble) removal

Tips

{TAGTCGA CGAG

CGACGC

GCTCTAG

GCTTTAG

GATCCGATGAG AGAT

AGAA

� �

� �

{�

GATT "

� �

� � GAGGCT TAGA AGAGA AGACAG

Bubble

TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG

De novo Assembler Performance

•  All three programs run with default parameters on the same dataset –  Input: 8.6 millions reads –  Platform: 64-bit Opteron, 4CPUs, 32 GB memory

Program Version CPU time Wall clock

SSAKE 3.0 2:24:59 5:08:59

Edena 2.11 0:28:31 28:58

Velvet 0.5 0:08:48 10:36

De novo assemblies"

Program # Contigs >200 bp N50 (bp) Sum (bp) Singletons

SSAKE 12,532 549 6,090,567 3,164,495

Edena 8,316 902 5,759,209 3,955,865

Velvet 7,382 1,252 6,474,426 1,273,164

Program # Contigs N50 (bp) Sum (bp) Max contig

SSAKE 185,030 87 14,287,079 5,490

Edena 11,180 837 6,175,460 11,300

Velvet 10,684 1,184 6,841,458 16,239

Assembly Limitations

•  Common repeat regions are typically missing/collapsed –  Han Chinese genome missing ~420Mbp of repeats

•  Same is true for segmental duplications –  Han Chinese genome only contains ~10Mbp of ~150Mbp of

segmental duplications. •  You typically get very large numbers of contigs, which

range in size from very small, to sometimes quite large.

Recent Assembler Comparisons

•  Earl et al (2011). Assemblathon 1: A competitive assessment of de novo short read assembly methods. Genome Research 21: 2224-2241.

–  Used a simulated dataset for all competitors to assemble •  Salzberg et al (2012). GAGE: A critical evaluation of genome

assemblies and assembly algorithms. Genome Research 22(3):557-67. –  Applied several assembly algorithms to their own datasets, for several

different sized genomes •  Bradnam et al. (2013). Assemblathon 2: evaluating de novo methods of

genome assembly in three vertebrate species. Gigascience 2(1):10. –  See http://assemblathon.org/

•  If you have an assembly problem, you should read these papers to gain some insights into strengths and weaknesses of different assemblers

Improving de novo Assemblies

•  Need to generate additional long range continuity to be able to orient and order contigs

•  Mate pairs •  Hybrid approach using PacBio Reads •  Synthetic long reads (aka Moleculo)

Mate-pair libraries

•  Goal is to have the equivalent of 2-5kb insert libraries.

•  However, technology is limited to ~700 bp fragments –  Means you have to use some molecular biology to

accomplish the equivalent.

Fragment"

Genomic DNA"

Size Select (2-5kb)"

Biotinylate"

Bio"

Bio"*"

*"

Bio"

Bio"*"

*"

Circularize"

*"*"

Fragment (400-600bp)"

*"*"

*"*"

*"*"

*"*"

*"*"

*"*"

Enrich Biotinylated fragments"

Standard Paired End Illumina Sequencing"

Incorporate Mate-pair information into assembly"

Leveraging Multiple Technologies

•  Illumina is great, because you can get a ton of data –  BUT read length is short

•  PacBio is great, because read lengths are long –  BUT the data quality is terrible

•  Two approaches have been used: 1.  Hybrid error correction

•  Use short reads to perform correction of long PacBio reads, and then assemble those

2.  Use PacBio reads to improve existing (short read or Sanger based) assemblies •  E.g. With 24× mapped coverage of PacBio long-reads applied

to a D. pseudoobscura assembly, 99% of gaps were addressed, with 69% being closed and further 12% improved.

Long “Synthetic Reads” aka Moleculo

Fragment"

Genomic DNA"

Size Select (10kb)"

Polish, ligate amplification adaptors"

~10 kb DNA"

Dilute to 500 molecules per well "

Amplify, fragment, add sequencing adaptors"

Pool"

Sequence"

Separate, based on bar code"

Remove barcodes, assemble 10kb fragments"

Assemble genome from 10kb fragments"

•  interactions between nucleic acids and proteins"

•  transcript identity"•  transcript abundance"

• RNA editing"• SNPs"

• Allele specific expression"• Regulation"

• Nucleosome positioning"• 3D genome architecture"

• Active promoters"•  interactions between

nucleic acids and proteins"•  chromatin modifications"

• genome variability"• metagenomics"

• genome modifications"• detection of mutations"

• association studies"• phylogeny"• evolution"

Applications of Next-Gen Sequencing

genome chromatin transcriptome"

de novo sequencing"

assembly"

annotation"

mapping"

resequencing"

detection of variants"

mapping"

Hi-C"

3D reconstruction"

mapping"

ChIP-Seq"

detection of binding sites"

mapping"

RNA-Seq"

transcript detection and quantification"

mapping"

ATAC-Seq"

Identify open

chromatin"

Mapping Short Reads

•  Many options; often a trade off between speed, resources and sensitivity.

•  Several open source projects to solve this problem, continually improving in speed and memory requirements.

•  New features being added all the time. •  When dealing with short read data, make sure you

have the very latest versions of the software you’re using, as some are updated frequently.

Alignment

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Approaches to Short Read Alignment

•  Hash-Based mapping –  Hashing of reads (E.g. Maq, Eland, SHRiMP) –  Hashing of genome (E.g. novoalign, SHRiMP2)

•  Indexing using Suffix Array/Burrows-Wheeler Transform (BWT) (E.g. bowtie, bwa)

SHRiMP2

•  Uses hash of genome to find alignment seeds, then performs Smith-Waterman –  SW, while slow, is accelerated; requires x86_64

processor (which most macs have nowadays) •  Can detect indels, as well as mismatches •  As of v2.2, takes into account quality scores •  Takes longer than bowtie and bwa, but is more

sensitive than both

MAQ

•  Much faster than SHRiMP, at the cost of accuracy (cannot find indels)

•  Uses hashing technique to index genome •  Guaranteed to find alignments with up to 2 mismatches •  Can take advantage of paired end reads •  Uses sequence quality scores to determine best alignments •  Generally no longer used

http://sourceforge.net/projects/maq/

How does “hashing” work?

•  A hash function simply converts a string (“key”) to an integer (“value”).

•  The integer is then used as an index in an array, for fast look up.

•  In MAQ, the reads are “hashed”, using 6 different permutations of the first 28bp.

•  The genome is then looked through, in 28bp chunks, to see if they match, via the hash, to reads.

Bowtie

•  Similar to MAQ, in that it uses quality scores to find best alignments.

•  Uses “Burrows-Wheeler index” to keep its memory footprint small.

•  Can find alignments with up to 3 mismatches in the first L bases of the read.

•  Only ungapped alignments •  Also supports paired end reads.

http://bowtie-bio.sourceforge.net/ •  Bowtie2 supports gapped alignments too.

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

AATAATACGGCGACCACCGAGATCTA!

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATAATACGGCGACCACCGAGATCTA!

BWA

•  From the author of Maq, but now uses Burrows-Wheeler transform to significantly speed it up.

•  Can also find small indels, in contrast to both Maq and Bowtie.

•  Is slightly slower than bowtie, but ability to find indels make it more useful if SNVs are important to you.

Comparison

•  PC: 2.4 GHz Intel Core 2, 2 GB RAM •  Server: 2.4 GHz AMD Opteron, 32 GB RAM •  Bowtie v0.9.6, Maq v0.6.6, SOAP v1.10 •  SOAP not run on PC due to memory constraints •  Reads: FASTQ 8.84 M reads from 1000 Genomes (Acc: SRR001115) •  Reference: Human (NCBI 36.3, contigs)

CPU time Wall clock

time

Reads per hour

Peak virtual memory footprint

Bowtie speedup

Reads aligned

(%)

Bowtie –v 2 (server) 15m:07s 15m:41s 33.8 M 1,149 MB - 67.4

SOAP (server) 91h:57m:35s 91h:47m:46s 0.08 M 13,619 MB 351x 67.3

Bowtie (PC) 16m:41s 17m:57s 29.5 M 1,353 MB - 71.9

Maq (PC) 17h:46m:35s 17h:53m:07s 0.49 M 804 MB 59.8x 74.7

Bowtie (server) 17m:58s 18m:26s 28.8 M 1,353 MB - 71.9

Maq (server) 32h:56m:53s 32h:58m:39s 0.27 M 804 MB 107x 74.7

Comparison

•  Bowtie delivers about 30 million alignments per CPU hour

CPU time Wall clock

time

Reads per hour

Peak virtual memory footprint

Bowtie speedup

Reads aligned

(%)

Bowtie –v 2 (server) 15m:07s 15m:41s 33.8 M 1,149 MB - 67.4

SOAP (server) 91h:57m:35s 91h:47m:46s 0.08 M 13,619 MB 351x 67.3

Bowtie (PC) 16m:41s 17m:57s 29.5 M 1,353 MB - 71.9

Maq (PC) 17h:46m:35s 17h:53m:07s 0.49 M 804 MB 59.8x 74.7

Bowtie (server) 17m:58s 18m:26s 28.8 M 1,353 MB - 71.9

Maq (server) 32h:56m:53s 32h:58m:39s 0.27 M 804 MB 107x 74.7

Comparison

•  Bowtie and Maq have memory footprints compatible with a typical workstation with 2 GB of RAM

•  SOAP requires a computer with >13 GB of RAM •  SOAP2 claims to be “super-fast”, and require less RAM (also uses BW

Transform). •  Your choice will be dictated by your needs (sensitivity, genome size,

number of reads) and your computing resources, and may change over time.

CPU time Wall clock

time

Reads per hour

Peak virtual memory footprint

Bowtie speedup

Reads aligned

(%)

Bowtie –v 2 (server) 15m:07s 15m:41s 33.8 M 1,149 MB - 67.4

SOAP (server) 91h:57m:35s 91h:47m:46s 0.08 M 13,619 MB 351x 67.3

Bowtie (PC) 16m:41s 17m:57s 29.5 M 1,353 MB - 71.9

Maq (PC) 17h:46m:35s 17h:53m:07s 0.49 M 804 MB 59.8x 74.7

Bowtie (server) 17m:58s 18m:26s 28.8 M 1,353 MB - 71.9

Maq (server) 32h:56m:53s 32h:58m:39s 0.27 M 804 MB 107x 74.7

Precision and recall by amount of variation for 4 datasets, by polymorphism: (number of SNPs, Indel size).

More Comparison Data

David M, Dzamba M, Lister D, Ilie L, Brudno M. (2011). SHRiMP2: sensitive yet practical SHort Read Mapping. Bioinformatics 27(7):1011-2."

Current Practice

•  Most people use bwa for mapping their short read data if they want to discover variants

•  If not interested in variants, people use bowtie for speed •  Most people don’t determine whether the tool they are

using is the best for their purpose •  There is no standard benchmark dataset, though see:

–  Holtgrewe et al. (2011). A novel and well-defined benchmarking method for second generation read mapping. BMC Bioinformatics 12:210.

•  It doesn’t hurt to experiment

Recommended Reading Mapping •  Li, H., Ruan, J. and Durbin, R.. (2008). Mapping short DNA sequencing reads and calling variants using mapping

quality scores. Genome Res. 18(11):1851-8. MAQ •  David, M., Dzamba, M., Lister, D., Ilie, L. and Brudno, M. (2011). SHRiMP2: sensitive yet practical SHort Read

Mapping. Bioinformatics 27(7):1011-2. •  Langmead, B., Trapnell, C., Pop, M. and Salzberg, S.L. (2009). Ultrafast and memory-efficient alignment of short

DNA sequences to the human genome. Genome Biol. 10(3):R25. Bowtie •  Langmead, B. and Salzberg, S. (2012). Fast gapped-read alignment with Bowtie 2. Nature Methods 9:357-359. •  Li, R., Yu, C., Li, Y., Lam, T.W., Yiu, S.M., Kristiansen, K. and Wang, J. (2009). SOAP2: an improved ultrafast tool for

short read alignment. Bioinformatics 25(15):1966-7. •  Li, H. and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform.

Bioinformatics 25(14):1754-60. BWA Assembly •  Zerbino, D.R. and Birney, E. (2008). Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

Genome Res. 18(5):821-9. •  Zerbino, D.R., McEwen, G.K., Margulies, E.H. and Birney, E. (2009). Pebble and rock band: heuristic resolution of

repeats and scaffolding in the velvet short-read de novo assembler. PLoS One 4(12):e8407. •  Simpson, J.T. and Durbin, R. (2010). Efficient construction of an assembly string graph using the FM-index.

Bioinformatics 26(12):i367-73. •  Simpson, J.T. and Durbin, R. (2012). Efficient de novo assembly of large genomes using compressed data

structures. Genome Research 22(3):549-56. SGA •  Earl, D. et al. (2011). Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome

Research 21(12):2224-41. •  Bradnam, K.R. et al. (2013). Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate

species. Gigascience 2(1):10. •  English, A.C., Richards, S., Han, Y., Wang, M., Vee, V., Qu, J., Qin, X., Muzny, D.M., Reid, J.G., Worley, K.C., Gibbs,

R.A. (2012). Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS One 7(11):e47768.

Moleculo •  Voskoboynik, A., et al. (2013). The genome sequence of the colonial chordate, Botryllus schlosseri. Elife 2:e00569.

Recommended