69
Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock [email protected] January 14 th 2014

Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Genetics 211 - 2014 Lecture 2

High Throughput Sequencing Gavin Sherlock [email protected] January 14th 2014

Page 2: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

•  interactions between nucleic acids and proteins"

•  transcript identity"•  transcript abundance"

• RNA editing"• SNPs"

• Allele specific expression"• Regulation"

• Nucleosome positioning"• 3D genome architecture"

• Active promoters"•  interactions between

nucleic acids and proteins"•  chromatin modifications"

• genome variability"• metagenomics"

• genome modifications"• detection of mutations"

• association studies"• phylogeny"• evolution"

Applications of Next-Gen Sequencing

genome chromatin transcriptome"

de novo sequencing"

assembly"

annotation"

mapping"

resequencing"

detection of variants"

mapping"

Hi-C"

3D reconstruction"

mapping"

ChIP-Seq"

detection of binding sites"

mapping"

RNA-Seq"

transcript detection and quantification"

mapping"

ATAC-Seq"

Identify open

chromatin"

Page 3: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

How do we make an Illumina Genomic DNA library?

Fragment (Covaris)"

Polish, add dA overhang"Add adaptors, size select"

Genomic DNA"

Sequence"

Page 4: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Making fragments asymmetric

5'-pNNNN.........NNNNA-3' 3'-ANNNN.........NNNNp-5'

Fragmented, end polished, phosphorylated, dA ligated DNA sample"

Genomic Y-adapter"

5'-ACACTCTTTCCCTACACGAC

GCTCTTCCGATCT-3' 3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGp-5'

Ligate"

5'-ACACTCTTTCCCTACACGAC

GCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3’ 3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCG

CAGCACATCCCTTTCTCACA-5’

[Ligation product is gel purified, selecting only those products in a certain size range]"

Page 5: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Making our genomic DNA library asymmetric

Round 1 of PCR"

5'-ACACTCTTTCCCTACACGAC

GCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3’ 3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCG

CAGCACATCCCTTTCTCACA-5’

TCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'

5'-CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT

Products of first round:"

5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3' 3'-TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'

5'-CAAGCAGAAGACGGCATACGAGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT-3’ 3’-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGCAGCACATCCCTTTCTCACA-5’

Page 6: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Finishing and Sequencing the Library

5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3' 3'-TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'

TCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'

5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

Rounds 2-18"

Product of PCR amplification"

5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3' 3’-TTACTATGCCGCTGGTGGCTCTAGATGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'

[Anneal to flow cell. Perform cluster generation]"

Genomic DNA Sequencing Primer"

5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3' 3’-TTACTATGCCGCTGGTGGCTCTAGATGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'

5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCT

Page 7: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

How Much Sequence? •  HiSeq 2500 can give ~250 million reads/lane of

paired end 100bp reads •  This is 50Gb of sequence •  This is ~4000x coverage yeast (12Mb). •  This is an obvious waste of resources (it’s also ~500x

C. elegans, and ~500x D. melanogaster) •  How can we sequence on a HiSeq and not waste all

these resources when sequencing smaller genomes?

Page 8: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Barcode Sequencing •  Two ways to perform barcode sequencing

–  In-line barcodes •  Barcode is read as part of the normal sequencing read

–  Multiplex barcodes •  Barcode is read as a third, short sequencing run (also known

as index reads)

•  Can be used to run multiple samples from any particular origin on the same lane of a HiSeq, with the barcodes allowing the samples to be de-convoluted afterwards.

•  Barcodes should be designed so that they are balanced in GC content, and as dissimilar as possible.

Page 9: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

In-line Barcode Sequencing

Page 10: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Multiplex, or Index barcoding

Page 11: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Random barcoding

•  During the PCR step, each template gets amplified many times

•  If your library is of insufficient complexity, or you overamplify you may have PCR duplicates

•  You want to make independent observations, not redundant observations

•  When sequencing to high coverage, you may have identical, but non-redundant observations.

•  Want to be able to distinguish these.

Page 12: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Random Barcoding

Page 13: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

What are the data?

•  Illumina produces data in fastq format.

@SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

‘@’ followed by a sequence Identifier

The sequence ‘+’, optionally followed by a sequence Identifier The quality scores

Page 14: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Example of Illumina SeqID

@HWUSI-EAS100R:6:73:941:1973#0/1

HWUSI-EAS100R The unique instrument name 6 Flowcell lane 73 Tile number within the flowcell 941 'x'-coordinate of the cluster within the tile 1973 'y'-coordinate of the cluster within the tile #0 index number for a multiplexed sample (0 for no

indexing) /1 the member of a pair, /1 or /2 (paired-end or mate-pair

reads only)

Page 15: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Assessing Quality

Page 16: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

FastQC

http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/"

Page 17: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

HTQC

A

C

F G

D E

B

https://sourceforge.net/projects/htqc"

Page 18: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

De novo assembly

•  Several methods available •  Short reads require long overlaps

•  e.g., 33 bp reads must overlap by 20 bp •  end-trimming helps, to remove low quality bases.

•  Most de novo short read assemblers use a k-mer hashing based approach.

•  The central challenge of genome assembly is resolving repeat regions.

Page 19: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

De novo Assembly Strategies

•  Many, many different algorithms and open source (as well as closed source) software for short read sequence assembly.

•  Choice of tool depends on exactly what you are trying to assemble: –  Genome size –  Genome complexity –  Level of polymorphism –  Genome vs. transcriptome vs. –  Sequence coverage you have (more is generally better) –  Paired-end vs. single end (you should really have paired-end data)

•  E.g. –  SSAKE (Warren et al, 2007)

•  Uses DNA prefix tree to find k-mer matches –  Edena (Hernandez et al, 2008)

•  Overlap layout algorithm plus error correction –  Velvet (Zerbino and Birney, 2008)

•  Uses DeBruijn graph algorithm plus error correction

Page 20: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Example of Velvet de novo Assembly

Page 21: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG

AGTCGAG CTTTAGA CGATGAG CTTTAGA GTCGGG TTAGATC ATGAGGC GAGACAG GAGGCTC ATCCGAT AGGCTTT GAGACAG AGTCGAG TAGATCC ATGAGGC TAGAGAA

TAGTCGA CTTTAGA CCGATGA TTAGAGA CGAGGCT AGATCCG TGAGGCT AGAGACA

TAGTCGA GCTTTAG TCCGATG GCTCTAG TCGACGC GATCCGA GAGGCTT AGAGACA TAGTCGA TTAGATC GATGAGG TTTAGAG

GTCGAGG TCTAGAT ATGAGGC TAGAGAC AGGCTTT ATCCGAT AGGCTTT GAGACAG AGTCGAG TTAGATT ATGAGGC AGAGACA

GGCTTTA TCCGATG TTTAGAG CGAGGCT TAGATCC TGAGGCT GAGACAG AGTCGAG TTTAGATC ATGAGGC TTAGAGA GAGGCTT GATCCGA GAGGCTT GAGACAG

Sequence (7bp reads)

Hashing (k = 4)

TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG

Page 22: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

TAGT AGTC GTCG TCGA CGAG GAGG AGGC GGCT GCTT AGAG GAGA AGAC GACA ACAG (3x) (7x) (9x) (10x) (8x) (16x) (16x) (16x) (11x) (9x) (12x) (9x) (8x) (5x)

� � � � � � � � � � � � � � CTTC TTCA TCAG CAGA (1x) (2x) (2x) (1x)

� � � �

CTTT TTTA TTAG TAGA (8x) (8x) (12x) (16x)

� � � �

CGAC GACG ACGC (1x) (1x) (1x)

� � �

TGAG ATGA GATG CGAT CCGA TCCG ATCC GATC AGAT (9x) (8x) (5x) (6x) (7x) (7x) (7x) (8x) (8x)

� � � � � � � � � �

GATT (1x) �

AGAA (1x)

{ {

Graph Building

TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG

Page 23: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

TAGTCGA CGAG

CGACGC

GCTCTAG

GCTTTAG

GATCCGATGAG AGAT

AGAA

� �

� � { {�

GATT "

� �

� � GAGGCT TAGA AGAGA AGACAG

TAGT AGTC GTCG TCGA CGAG GAGG AGGC GGCT GCTT AGAG GAGA AGAC GACA ACAG (3x) (7x) (9x) (10x) (8x) (16x) (16x) (16x) (11x) (9x) (12x) (9x) (8x) (5x)

� � � � � � � � � � � � � � CTTC TTCA TCAG CAGA (1x) (2x) (2x) (1x)

� � � �

CTTT TTTA TTAG TAGA (8x) (8x) (12x) (16x)

� � � �

CGAC GACG ACGC (1x) (1x) (1x)

� � �

TGAG ATGA GATG CGAT CCGA TCCG ATCC GATC AGAT (9x) (8x) (5x) (6x) (7x) (7x) (7x) (8x) (8x)

� � � � � � � � � �

GATT (1x) �

AGAA (1x)

{ {

Simplification of Linear Stretches

TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG

Page 24: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

TAGTCGAG GAGGCTTTAGA AGAGACAG!

� �

AGATCCGATGAG!

Error (tip and bubble) removal

Tips

{TAGTCGA CGAG

CGACGC

GCTCTAG

GCTTTAG

GATCCGATGAG AGAT

AGAA

� �

� �

{�

GATT "

� �

� � GAGGCT TAGA AGAGA AGACAG

Bubble

TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG

Page 25: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

De novo Assembler Performance

•  All three programs run with default parameters on the same dataset –  Input: 8.6 millions reads –  Platform: 64-bit Opteron, 4CPUs, 32 GB memory

Program Version CPU time Wall clock

SSAKE 3.0 2:24:59 5:08:59

Edena 2.11 0:28:31 28:58

Velvet 0.5 0:08:48 10:36

Page 26: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

De novo assemblies"

Program # Contigs >200 bp N50 (bp) Sum (bp) Singletons

SSAKE 12,532 549 6,090,567 3,164,495

Edena 8,316 902 5,759,209 3,955,865

Velvet 7,382 1,252 6,474,426 1,273,164

Program # Contigs N50 (bp) Sum (bp) Max contig

SSAKE 185,030 87 14,287,079 5,490

Edena 11,180 837 6,175,460 11,300

Velvet 10,684 1,184 6,841,458 16,239

Page 27: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Assembly Limitations

•  Common repeat regions are typically missing/collapsed –  Han Chinese genome missing ~420Mbp of repeats

•  Same is true for segmental duplications –  Han Chinese genome only contains ~10Mbp of ~150Mbp of

segmental duplications. •  You typically get very large numbers of contigs, which

range in size from very small, to sometimes quite large.

Page 28: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Recent Assembler Comparisons

•  Earl et al (2011). Assemblathon 1: A competitive assessment of de novo short read assembly methods. Genome Research 21: 2224-2241.

–  Used a simulated dataset for all competitors to assemble •  Salzberg et al (2012). GAGE: A critical evaluation of genome

assemblies and assembly algorithms. Genome Research 22(3):557-67. –  Applied several assembly algorithms to their own datasets, for several

different sized genomes •  Bradnam et al. (2013). Assemblathon 2: evaluating de novo methods of

genome assembly in three vertebrate species. Gigascience 2(1):10. –  See http://assemblathon.org/

•  If you have an assembly problem, you should read these papers to gain some insights into strengths and weaknesses of different assemblers

Page 29: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Improving de novo Assemblies

•  Need to generate additional long range continuity to be able to orient and order contigs

•  Mate pairs •  Hybrid approach using PacBio Reads •  Synthetic long reads (aka Moleculo)

Page 30: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Mate-pair libraries

•  Goal is to have the equivalent of 2-5kb insert libraries.

•  However, technology is limited to ~700 bp fragments –  Means you have to use some molecular biology to

accomplish the equivalent.

Page 31: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Fragment"

Genomic DNA"

Size Select (2-5kb)"

Biotinylate"

Bio"

Bio"*"

*"

Page 32: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Bio"

Bio"*"

*"

Circularize"

*"*"

Fragment (400-600bp)"

*"*"

*"*"

*"*"

*"*"

*"*"

Page 33: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

*"*"

Enrich Biotinylated fragments"

Standard Paired End Illumina Sequencing"

Incorporate Mate-pair information into assembly"

Page 34: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Leveraging Multiple Technologies

•  Illumina is great, because you can get a ton of data –  BUT read length is short

•  PacBio is great, because read lengths are long –  BUT the data quality is terrible

•  Two approaches have been used: 1.  Hybrid error correction

•  Use short reads to perform correction of long PacBio reads, and then assemble those

2.  Use PacBio reads to improve existing (short read or Sanger based) assemblies •  E.g. With 24× mapped coverage of PacBio long-reads applied

to a D. pseudoobscura assembly, 99% of gaps were addressed, with 69% being closed and further 12% improved.

Page 35: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Long “Synthetic Reads” aka Moleculo

Fragment"

Genomic DNA"

Size Select (10kb)"

Polish, ligate amplification adaptors"

Page 36: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

~10 kb DNA"

Dilute to 500 molecules per well "

Amplify, fragment, add sequencing adaptors"

Page 37: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Pool"

Sequence"

Page 38: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Separate, based on bar code"

Remove barcodes, assemble 10kb fragments"

Assemble genome from 10kb fragments"

Page 39: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

•  interactions between nucleic acids and proteins"

•  transcript identity"•  transcript abundance"

• RNA editing"• SNPs"

• Allele specific expression"• Regulation"

• Nucleosome positioning"• 3D genome architecture"

• Active promoters"•  interactions between

nucleic acids and proteins"•  chromatin modifications"

• genome variability"• metagenomics"

• genome modifications"• detection of mutations"

• association studies"• phylogeny"• evolution"

Applications of Next-Gen Sequencing

genome chromatin transcriptome"

de novo sequencing"

assembly"

annotation"

mapping"

resequencing"

detection of variants"

mapping"

Hi-C"

3D reconstruction"

mapping"

ChIP-Seq"

detection of binding sites"

mapping"

RNA-Seq"

transcript detection and quantification"

mapping"

ATAC-Seq"

Identify open

chromatin"

Page 40: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Mapping Short Reads

•  Many options; often a trade off between speed, resources and sensitivity.

•  Several open source projects to solve this problem, continually improving in speed and memory requirements.

•  New features being added all the time. •  When dealing with short read data, make sure you

have the very latest versions of the software you’re using, as some are updated frequently.

Page 41: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Alignment

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Page 42: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Approaches to Short Read Alignment

•  Hash-Based mapping –  Hashing of reads (E.g. Maq, Eland, SHRiMP) –  Hashing of genome (E.g. novoalign, SHRiMP2)

•  Indexing using Suffix Array/Burrows-Wheeler Transform (BWT) (E.g. bowtie, bwa)

Page 43: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

SHRiMP2

•  Uses hash of genome to find alignment seeds, then performs Smith-Waterman –  SW, while slow, is accelerated; requires x86_64

processor (which most macs have nowadays) •  Can detect indels, as well as mismatches •  As of v2.2, takes into account quality scores •  Takes longer than bowtie and bwa, but is more

sensitive than both

Page 44: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

MAQ

•  Much faster than SHRiMP, at the cost of accuracy (cannot find indels)

•  Uses hashing technique to index genome •  Guaranteed to find alignments with up to 2 mismatches •  Can take advantage of paired end reads •  Uses sequence quality scores to determine best alignments •  Generally no longer used

http://sourceforge.net/projects/maq/

Page 45: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

How does “hashing” work?

•  A hash function simply converts a string (“key”) to an integer (“value”).

•  The integer is then used as an index in an array, for fast look up.

•  In MAQ, the reads are “hashed”, using 6 different permutations of the first 28bp.

•  The genome is then looked through, in 28bp chunks, to see if they match, via the hash, to reads.

Page 46: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Bowtie

•  Similar to MAQ, in that it uses quality scores to find best alignments.

•  Uses “Burrows-Wheeler index” to keep its memory footprint small.

•  Can find alignments with up to 3 mismatches in the first L bases of the read.

•  Only ungapped alignments •  Also supports paired end reads.

http://bowtie-bio.sourceforge.net/ •  Bowtie2 supports gapped alignments too.

Page 47: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Page 48: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Page 49: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Page 50: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Page 51: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Page 52: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Page 53: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Page 54: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Page 55: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Page 56: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Page 57: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Page 58: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Page 59: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Page 60: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

Page 61: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!

AATAATACGGCGACCACCGAGATCTA!

Page 62: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Bowtie Algorithm

!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATAATACGGCGACCACCGAGATCTA!

Page 63: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

BWA

•  From the author of Maq, but now uses Burrows-Wheeler transform to significantly speed it up.

•  Can also find small indels, in contrast to both Maq and Bowtie.

•  Is slightly slower than bowtie, but ability to find indels make it more useful if SNVs are important to you.

Page 64: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Comparison

•  PC: 2.4 GHz Intel Core 2, 2 GB RAM •  Server: 2.4 GHz AMD Opteron, 32 GB RAM •  Bowtie v0.9.6, Maq v0.6.6, SOAP v1.10 •  SOAP not run on PC due to memory constraints •  Reads: FASTQ 8.84 M reads from 1000 Genomes (Acc: SRR001115) •  Reference: Human (NCBI 36.3, contigs)

CPU time Wall clock

time

Reads per hour

Peak virtual memory footprint

Bowtie speedup

Reads aligned

(%)

Bowtie –v 2 (server) 15m:07s 15m:41s 33.8 M 1,149 MB - 67.4

SOAP (server) 91h:57m:35s 91h:47m:46s 0.08 M 13,619 MB 351x 67.3

Bowtie (PC) 16m:41s 17m:57s 29.5 M 1,353 MB - 71.9

Maq (PC) 17h:46m:35s 17h:53m:07s 0.49 M 804 MB 59.8x 74.7

Bowtie (server) 17m:58s 18m:26s 28.8 M 1,353 MB - 71.9

Maq (server) 32h:56m:53s 32h:58m:39s 0.27 M 804 MB 107x 74.7

Page 65: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Comparison

•  Bowtie delivers about 30 million alignments per CPU hour

CPU time Wall clock

time

Reads per hour

Peak virtual memory footprint

Bowtie speedup

Reads aligned

(%)

Bowtie –v 2 (server) 15m:07s 15m:41s 33.8 M 1,149 MB - 67.4

SOAP (server) 91h:57m:35s 91h:47m:46s 0.08 M 13,619 MB 351x 67.3

Bowtie (PC) 16m:41s 17m:57s 29.5 M 1,353 MB - 71.9

Maq (PC) 17h:46m:35s 17h:53m:07s 0.49 M 804 MB 59.8x 74.7

Bowtie (server) 17m:58s 18m:26s 28.8 M 1,353 MB - 71.9

Maq (server) 32h:56m:53s 32h:58m:39s 0.27 M 804 MB 107x 74.7

Page 66: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Comparison

•  Bowtie and Maq have memory footprints compatible with a typical workstation with 2 GB of RAM

•  SOAP requires a computer with >13 GB of RAM •  SOAP2 claims to be “super-fast”, and require less RAM (also uses BW

Transform). •  Your choice will be dictated by your needs (sensitivity, genome size,

number of reads) and your computing resources, and may change over time.

CPU time Wall clock

time

Reads per hour

Peak virtual memory footprint

Bowtie speedup

Reads aligned

(%)

Bowtie –v 2 (server) 15m:07s 15m:41s 33.8 M 1,149 MB - 67.4

SOAP (server) 91h:57m:35s 91h:47m:46s 0.08 M 13,619 MB 351x 67.3

Bowtie (PC) 16m:41s 17m:57s 29.5 M 1,353 MB - 71.9

Maq (PC) 17h:46m:35s 17h:53m:07s 0.49 M 804 MB 59.8x 74.7

Bowtie (server) 17m:58s 18m:26s 28.8 M 1,353 MB - 71.9

Maq (server) 32h:56m:53s 32h:58m:39s 0.27 M 804 MB 107x 74.7

Page 67: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Precision and recall by amount of variation for 4 datasets, by polymorphism: (number of SNPs, Indel size).

More Comparison Data

David M, Dzamba M, Lister D, Ilie L, Brudno M. (2011). SHRiMP2: sensitive yet practical SHort Read Mapping. Bioinformatics 27(7):1011-2."

Page 68: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Current Practice

•  Most people use bwa for mapping their short read data if they want to discover variants

•  If not interested in variants, people use bowtie for speed •  Most people don’t determine whether the tool they are

using is the best for their purpose •  There is no standard benchmark dataset, though see:

–  Holtgrewe et al. (2011). A novel and well-defined benchmarking method for second generation read mapping. BMC Bioinformatics 12:210.

•  It doesn’t hurt to experiment

Page 69: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock

Recommended Reading Mapping •  Li, H., Ruan, J. and Durbin, R.. (2008). Mapping short DNA sequencing reads and calling variants using mapping

quality scores. Genome Res. 18(11):1851-8. MAQ •  David, M., Dzamba, M., Lister, D., Ilie, L. and Brudno, M. (2011). SHRiMP2: sensitive yet practical SHort Read

Mapping. Bioinformatics 27(7):1011-2. •  Langmead, B., Trapnell, C., Pop, M. and Salzberg, S.L. (2009). Ultrafast and memory-efficient alignment of short

DNA sequences to the human genome. Genome Biol. 10(3):R25. Bowtie •  Langmead, B. and Salzberg, S. (2012). Fast gapped-read alignment with Bowtie 2. Nature Methods 9:357-359. •  Li, R., Yu, C., Li, Y., Lam, T.W., Yiu, S.M., Kristiansen, K. and Wang, J. (2009). SOAP2: an improved ultrafast tool for

short read alignment. Bioinformatics 25(15):1966-7. •  Li, H. and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform.

Bioinformatics 25(14):1754-60. BWA Assembly •  Zerbino, D.R. and Birney, E. (2008). Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

Genome Res. 18(5):821-9. •  Zerbino, D.R., McEwen, G.K., Margulies, E.H. and Birney, E. (2009). Pebble and rock band: heuristic resolution of

repeats and scaffolding in the velvet short-read de novo assembler. PLoS One 4(12):e8407. •  Simpson, J.T. and Durbin, R. (2010). Efficient construction of an assembly string graph using the FM-index.

Bioinformatics 26(12):i367-73. •  Simpson, J.T. and Durbin, R. (2012). Efficient de novo assembly of large genomes using compressed data

structures. Genome Research 22(3):549-56. SGA •  Earl, D. et al. (2011). Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome

Research 21(12):2224-41. •  Bradnam, K.R. et al. (2013). Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate

species. Gigascience 2(1):10. •  English, A.C., Richards, S., Han, Y., Wang, M., Vee, V., Qu, J., Qin, X., Muzny, D.M., Reid, J.G., Worley, K.C., Gibbs,

R.A. (2012). Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS One 7(11):e47768.

Moleculo •  Voskoboynik, A., et al. (2013). The genome sequence of the colonial chordate, Botryllus schlosseri. Elife 2:e00569.