51
Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock [email protected] January 12 th 2015

Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock [email protected] ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Genetics 211 - 2016 Lecture 2

High Throughput Sequencing Gavin Sherlock [email protected] January 12th 2015

Page 2: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Differences in Throughput

Parameter Sanger

(AB 3730) Illumina

(NextSeq 500)

Read L (bp) 800 2 x150

Number of reads per run [days]

96 [<1]

400,000,000 [~1 day]

Throughput 6Mb/day ~120Gb/day

SNP error rate low high (~0.5%)

Indel error rate low low

Costs $500/Mb <$0.05/Mb

Page 3: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Illumina: Flow Cells with “Molecular Colonies” •  flow cell with randomly spaced

molecular clusters •  spacing depends on initial

seeding of the single molecules onto the flow cell

1µM

Page 4: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Detection, Chemistry

•  Massively Parallel Detection on immobilized “molecular colonies”

•  Means you have to measure (image) every cycle, instead of the Sanger model (letting reaction go to completion and then separating products by size)

•  Requires specially designed chemistry, using reversible dye-terminators and a polymerase

Page 5: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

DNA (0.1-1.0 ug)

Single molecule arraySample

preparation Cluster growth5’

5’3’

G

T

C

A

G

T

C

A

G

T

C

A

C

A

G

TC

A

T

C

A

C

C

TAG

CG

TA

GT

1 2 3 7 8 9 4 5 6

Image acquisition Base calling

T G C T A C G A T …

Sequencing

Illumina Sequencing Technology Robust Reversible Terminator Chemistry Foundation

Page 6: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

250+ Million Clusters Per Flow Cell

20 Microns

100 Microns

Illumina Sequence Visualization

Page 7: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

O

PPP

HN

N

O

O

cleavage site

fluorophore

3’

3’ OH is blocked

Illumina Sequencing: Reversible Terminators

Detection

O

HN

N

O

O

3’

DNA

O Incorporate

Ready for Next Cycle

O

DNA

HN

N

O

O

3’

O

free 3’ end OH

Deblock and Cleave

off Dye

Page 8: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Image Processing, Base Calling •  Image processing algorithms find signals in

each panel, align signals from different panels, etc. –  Machines ship with server or small cluster that

does image analysis while run is happening •  Sequence data after base calling much

reduced in size (tens of gigabytes) => more manageable but still large amounts that add up over time

•  Unsustainable to keep image data; people discard the images, and just keep the sequences (fastq format).

Page 9: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

•  Patterned flow cells (HiSeq 3000/HiSeq 4000 systems) –  Allows denser cluster spacing –  Avoids cluster overlap –  Image analysis easier

•  Two-Channel SBS (NextSeq) –  Two, rather than 4 colors –  Leads to faster sequencing times

•  Synthetic Long Reads –  We’ll discuss later in lecture

•  Coming soon –  MinSeq –  Project Firefly

Recent Illumina Innovations

Page 10: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Pacific Biosciences

•  Single Molecule Real Time DNA Sequencing •  Read lengths now averaging ~10-15kb, max

>40kb •  Strobe sequencing •  Observation of DNA modifications •  Throughput per run is low (~1 million reads on

PacBio SEQUEL machine), but run time is short (30 mins – 6 hours)

•  Error rate is high, though hybrid approaches can significantly improve assemblies generated by short reads alone.

Page 11: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Oxford Nanopore

•  MinION and GridION products •  Not yet on market, but in early release •  DNA “sequenced” as it is dragged through a

nanopore •  Very high error rate (5-40%) •  Reads from 5-50kb (as long as 100kb?) •  Some data published, more likely on the way – keep

an eye on the AGBT meeting next month •  This year might actually be the year that nanopore

breaksthrough

Page 12: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

•  interactions between nucleic acids and proteins

•  transcript identity•  transcript abundance

• RNA editing• SNPs

• Allele specific expression• Regulation

• Nucleosome positioning• 3D genome architecture

• Active promoters•  interactions between

nucleic acids and proteins•  chromatin modifications

• genome variability• metagenomics

• genome modifications• detection of mutations

• association studies• phylogeny• evolution

Applications of Next-Gen Sequencing

genome chromatin transcriptome

de novo sequencing

assembly

annotation

mapping

resequencing

detection of variants

mapping

ATAC-Seq

Identify open chromatin

mapping

ChIP-Seq

detection of binding sites

mapping

RNA-Seq

transcript detection and quantification

mapping

Hi-C

3D reconstruction

Page 13: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

How do we make an Illumina Genomic DNA library?

Fragment (Covaris)

Polish, add dA overhangAdd adaptors, size select

Double-stranded genomic DNA

Sequence

Page 14: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Making fragments asymmetric

5'-pNNNN.........NNNNA-3' 3'-ANNNN.........NNNNp-5'

Fragmented, end polished, phosphorylated, dA overhang DNA sample

Genomic Y-adapter

5'-ACACTCTTTCCCTACACGAC

GCTCTTCCGATCT-3' 3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGp-5'

Ligate

5'-ACACTCTTTCCCTACACGAC

GCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3’ 3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCG

CAGCACATCCCTTTCTCACA-5’

[Ligation product is gel purified, selecting only those products in a certain size range]

Page 15: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Making our genomic DNA library asymmetric

Round 1 of PCR

5'-ACACTCTTTCCCTACACGAC

GCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3’ 3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCG

CAGCACATCCCTTTCTCACA-5’

TCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'

5'-CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT

Products of first round:

5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3' 3'-TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'

5'-CAAGCAGAAGACGGCATACGAGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT-3’ 3’-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGCAGCACATCCCTTTCTCACA-5’

Page 16: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Finishing and Sequencing the Library

5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3' 3'-TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'

TCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'

5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

Rounds 2-18

Product of PCR amplification

5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3' 3’-TTACTATGCCGCTGGTGGCTCTAGATGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'

[Anneal to flow cell. Perform cluster generation]

Genomic DNA Sequencing Primer

5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3' 3’-TTACTATGCCGCTGGTGGCTCTAGATGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'

5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCT

Page 17: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Nextera Library Preparation

Transposomes

Genomic DNA+

Primer 1

Adaptor 1

Primer 2

Adaptor 2

tagmentation

PCR

Page 18: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Nextera Library Preparation

Transposomes

Genomic DNA+

Primer 1

Adaptor 1

Primer 2

Adaptor 2

tagmentation

(suppression) PCR

Page 19: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

How Much Sequence? •  HiSeq 2500 can give ~250 million reads/lane of

paired end 100bp reads •  This is 50Gb of sequence •  This is ~4,000x coverage yeast (12Mb). •  This is an obvious waste of resources (it’s also ~500x

C. elegans, and ~500x D. melanogaster) •  How can we sequence on a HiSeq and not waste all

these resources when sequencing smaller genomes? •  HiSeq 3000/HiSeq 4000

–  Patterned flow cells (not random clusters) –  Almost twice as much data, half the time

Page 20: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Multiplexed Sequencing, using Barcodes

•  Two ways to perform barcode sequencing –  In-line barcodes

•  Barcode is read as part of the normal sequencing read –  Index barcodes

•  Barcode is read as a third, short sequencing run (also known as index reads)

•  Can be used to run multiple samples from any particular origin on the same lane of a HiSeq, with the barcodes allowing the samples to be de-convoluted afterwards.

•  Barcodes should be designed so that they are balanced in GC content, and as dissimilar as possible. (Hamming distance > 2).

Page 21: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

In-line Barcode Sequencing

Page 22: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Index barcoding

Page 23: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Unique Molecular Identifiers (UMIs)

•  During the PCR step, each template gets amplified many times

•  If your library is of insufficient complexity, or you overamplify, you may have PCR duplicates

•  You want to make independent observations, not redundant observations

•  When sequencing to high coverage, you may have identical, but non-redundant observations.

•  Want to be able to distinguish these.

Page 24: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Unique Molecular Identifiers (UMIs) using Random Barcodes

Page 25: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Longer, and/or more Accurate Reads

Insert Size0 100 200 300 400 500

•  Insert sizes are a distribution•  Some inserts not necessarily longer than twice the read length•  What does this mean for paired end reads?

Page 26: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Longer, and/or more Accurate Reads

insert

Page 27: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Read Error Correction •  Many approaches, and lots of available tools •  Most rely on the idea of looking for rare k-mers:

•  Build up a table of all k-mers, and their frequencies. •  Consecutive k-mers that cover an error in a read should

be at lower frequency, given sufficient coverage •  Can use this to recognize errors in reads and correct them •  If done without deference to quality scores, assumes

homogenous sample

Page 28: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

What are the data?

•  Illumina produces data in fastq format.

@SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

‘@’ followed by a sequence Identifier

The sequence ‘+’, optionally followed by a sequence Identifier The quality scores

Page 29: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Example of Illumina SeqID

@HWUSI-EAS100R:6:73:941:1973#0/1

HWUSI-EAS100R The unique instrument name6 Flowcell lane73 Tile number within the flowcell941 'x'-coordinate of the cluster within the tile1973 'y'-coordinate of the cluster within the tile#0 index number for a multiplexed sample (0 for no

indexing)/1 the member of a pair, /1 or /2 (paired-end or mate-pair

reads only)

Page 30: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Assessing Quality

Page 31: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

FastQC

http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

Page 32: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

HTQC

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

https://sourceforge.net/projects/htqc

Page 33: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

De novo Assembly of Short Reads

•  Several methods available •  Short reads require long overlaps

•  e.g., 33 bp reads must overlap by 20 bp •  end-trimming helps, to remove low quality bases.

•  Most de novo short read assemblers use a k-mer hashing based approach and de Bruijn graphs.

•  The central challenge of genome assembly is resolving repeat regions.

Page 34: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

De novo Assembly Strategies

•  Many, many different algorithms and open source (as well as closed source) software for short read sequence assembly.

•  Choice of tool depends on exactly what you are trying to assemble: –  Genome size –  Genome complexity –  Level of polymorphism –  Genome vs. transcriptome –  Sequence coverage you have (more is generally better) –  Paired-end vs. single end (you should really have paired-end data)

•  E.g. –  Velvet (Zerbino and Birney, 2008)

•  Uses DeBruijn graph algorithm plus error correction –  SGA (Simpson and Durbin, 2010)

•  Use String Graph – lower memory requirements, but takes longer –  SOAPdenovo2 (Li et al, 2012)

•  Also uses DeBruijn graphs with error correction

Page 35: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Example of Velvet de novo Assembly

Page 36: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG

AGTCGAG CTTTAGA CGATGAG CTTTAGA GTCGGG TTAGATC ATGAGGC GAGACAG GAGGCTC ATCCGAT AGGCTTT GAGACAG AGTCGAG TAGATCC ATGAGGC TAGAGAA

TAGTCGA CTTTAGA CCGATGA TTAGAGA CGAGGCT AGATCCG TGAGGCT AGAGACA

TAGTCGA GCTTTAG TCCGATG GCTCTAG TCGACGC GATCCGA GAGGCTT AGAGACA TAGTCGA TTAGATC GATGAGG TTTAGAG

GTCGAGG TCTAGAT ATGAGGC TAGAGAC AGGCTTT ATCCGAT AGGCTTT GAGACAG AGTCGAG TTAGATT ATGAGGC AGAGACA

GGCTTTA TCCGATG TTTAGAG CGAGGCT TAGATCC TGAGGCT GAGACAG AGTCGAG TTTAGATC ATGAGGC TTAGAGA GAGGCTT GATCCGA GAGGCTT GAGACAG

Sequence (7bp reads)

Hashing (k = 4)

TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG

Page 37: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

TAGT AGTC GTCG TCGA CGAG GAGG AGGC GGCT GCTT AGAG GAGA AGAC GACA ACAG (3x) (7x) (9x) (10x) (8x) (16x) (16x) (16x) (11x) (9x) (12x) (9x) (8x) (5x)

� � � � � � � � � � � � � � CTTC TTCA TCAG CAGA (1x) (2x) (2x) (1x)

� � � �

CTTT TTTA TTAG TAGA (8x) (8x) (12x) (16x)

� � � �

CGAC GACG ACGC (1x) (1x) (1x)

� � �

TGAG ATGA GATG CGAT CCGA TCCG ATCC GATC AGAT (9x) (8x) (5x) (6x) (7x) (7x) (7x) (8x) (8x)

� � � � � � � � � �

GATT (1x) �

AGAA (1x)

{ {

Graph Building

TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG

Page 38: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

TAGTCGA CGAG

CGACGC

GCTCTAG

GCTTTAG

GATCCGATGAG AGAT

AGAA

� �

� � { {�

GATT

� �

� � GAGGCT TAGA AGAGA AGACAG

TAGT AGTC GTCG TCGA CGAG GAGG AGGC GGCT GCTT AGAG GAGA AGAC GACA ACAG (3x) (7x) (9x) (10x) (8x) (16x) (16x) (16x) (11x) (9x) (12x) (9x) (8x) (5x)

� � � � � � � � � � � � � � CTTC TTCA TCAG CAGA (1x) (2x) (2x) (1x)

� � � �

CTTT TTTA TTAG TAGA (8x) (8x) (12x) (16x)

� � � �

CGAC GACG ACGC (1x) (1x) (1x)

� � �

TGAG ATGA GATG CGAT CCGA TCCG ATCC GATC AGAT (9x) (8x) (5x) (6x) (7x) (7x) (7x) (8x) (8x)

� � � � � � � � � �

GATT (1x) �

AGAA (1x)

{ {

Simplification of Linear Stretches

TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG

Page 39: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

TAGTCGAG GAGGCTTTAGA AGAGACAG

� �

AGATCCGATGAG

Error (tip and bubble) removal

Tips

{TAGTCGA CGAG

CGACGC

GCTCTAG

GCTTTAG

GATCCGATGAG AGAT

AGAA

� �

� �

{�

GATT

� �

� � GAGGCT TAGA AGAGA AGACAG

Bubble

TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG

Page 40: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Assembler Num N50 (kb) Errors

ABySS 302 29.2 19

ALLPATHS-LG 60 96.7 20

Bambus2 109 50.2 190

MSR-CA 94 59.2 34

SGA 252 4.0 10

SOAPdenovo 107 288.2 65

Velvet 162 48.4 42

Assemblies of S. aureus (genome size 2,872,915)Taken from GAGE paper (Salzberg et al, 2012).

De novo Short Read Assembler Performance

Page 41: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Short Read Assembly Limitations

•  Common repeat regions are typically missing/collapsed–  Han Chinese genome missing ~420Mbp of repeats

•  Same is true for segmental duplications–  Han Chinese genome only contains ~10Mbp of ~150Mbp of

segmental duplications.•  Even for microbial genomes, you typically get very large

numbers of contigs, which range in size from very small, to sometimes quite large.–  (need reads of ~7kb to completely assemble bacterial genomes)

Page 42: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Recent Short Read Assembler Comparisons

•  Earl et al (2011). Assemblathon 1: A competitive assessment of de novo short read assembly methods. Genome Research 21: 2224-2241.

–  Used a simulated dataset for all competitors to assemble•  Salzberg et al (2012). GAGE: A critical evaluation of genome

assemblies and assembly algorithms. Genome Research 22(3):557-67.–  Applied several assembly algorithms to their own datasets, for several

different sized genomes•  Bradnam et al. (2013). Assemblathon 2: evaluating de novo methods of

genome assembly in three vertebrate species. Gigascience 2(1):10.–  See http://assemblathon.org/

•  If you have an assembly problem, you should read these papers to gain some insights into strengths and weaknesses of different assemblers

•  Also see: Vezzi, F., Narzisi, G., and Mishra B. (2012). Reevaluating assembly evaluations with feature response curves: GAGE and assemblathons. PLoS One 7(12):e52210.

Page 43: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Improving de novo Assemblies •  Need to generate additional long range continuity

to be able to orient and order contigs •  Mate pair libraries •  Hybrid approach using either PacBio or Oxford

Nanopore data, plus Illumina data –  Though there is a paper on assembling E. coli solely from

Oxford Nanopore Data

•  Synthetic long reads (Illumina Tru-Seq Synthetic Reads (aka Moleculo))

•  CPT-SEQ – discuss on Thursday –  Similar to 10X Genomics

•  Hi-C contact maps –  Similar to Dovetail Genomics

Page 44: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Mate-pair libraries

•  Goal is to have the equivalent of 2-5kb insert libraries, or even up to 10-12kb.

•  However, Illumina flow cell technology is limited to ~700 bp fragments that can be successfully amplified –  Means you have to use some molecular biology to

accomplish the equivalent.

Page 45: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Fragment

Genomic DNA

Size Select (up to 12kb)

Biotinylate

Bio

Bio*

*

Page 46: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Bio

Bio*

*

Circularize

**

Fragment (400-600bp)

**

**

**

**

**

Page 47: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

**

Capture Biotinylated fragments

Standard Paired End Illumina Sequencing

Incorporate Mate-pair information into assembly

Page 48: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Recommended Reading Nextera •  Adey, A., Morrison, H.G., Asan, Xun, X., Kitzman, J.O., Turner, E.H., Stackhouse, B., MacKenzie, A.P., Caruccio,

N.C., Zhang, X., Shendure, J. (2010). Rapid, low-input, low-bias construction of shotgun fragment libraries by high-density in vitro transposition. Genome Biol. 11(12):R119.

•  Syed, F., Grunenwald, H., Caruccio, N.. (2009). Next-generation sequencing library preparation: simultaneous fragmentation and tagging using in vitro transposition. Nature Methods 6, Applications Note.

•  Caruccio, N. (2011). Preparation of next-generation sequencing libraries using Nextera™ technology: simultaneous DNA fragmentation and adaptor tagging by in vitro transposition. Methods Mol Biol. 733:241-55.

•  Marine, R., Polson, S.W., Ravel, J., Hatfull, G., Russell, D., Sullivan, M., Syed, F., Dumas, M., Wommack, K.E. (2011). Evaluation of a transposase protocol for rapid generation of shotgun high-throughput sequencing libraries from nanogram quantities of DNA. Appl Environ Microbiol. 77(22):8071-9.

UMIs •  Kivioja, T., Vähärautio, A., Karlsson, K., Bonke, M., Enge, M., Linnarsson, S. and Taipale, J. (2011). Counting

absolute numbers of molecules using unique molecular identifiers. Nat Methods. 9(1):72-4. Adapter Trimming •  Kong, Y. (2011). Btrim: a fast, lightweight adapter and quality trimming program for next-generation sequencing

technologies. Genomics 98(2):152-3. •  Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal

17:10-12. •  Lindgreen, S. (2012). AdapterRemoval: easy cleaning of next-generation sequencing reads. BMC Res Notes 5:337. •  Jiang, H., Lei, R., Ding, S.W. and Zhu, S. (2014). Skewer: a fast and accurate adapter trimmer for next-generation

sequencing paired-end reads. BMC Bioinformatics 15:182.

Page 49: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Recommended Reading Read Merging •  Rodrigue, S., Materna, A.C., Timberlake, S.C., Blackburn, M.C., Malmstrom, R.R., Alm, E.J., Chisholm, S.W. (2010).

Unlocking short read sequencing for metagenomics. PLoS One 5(7):e11840. •  Magoč, T. and Salzberg, S.L. (2011). FLASH: fast length adjustment of short reads to improve genome assemblies.

Bioinformatics 27(21):2957-63. •  Masella, A.P., Bartram, A.K., Truszkowski, J.M., Brown, D.G. and Neufeld, J.D. (2012). PANDAseq: paired-end

assembler for illumina sequences. BMC Bioinformatics 13:31. •  Liu, B., Yuan, J., Yiu, S.M., Li, Z., Xie, Y., Chen, Y., Shi, Y., Zhang, H., Li, Y., Lam, T.W. and Luo, R. (2012). COPE:

an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly. Bioinformatics 28(22):2870-4. •  Zhang, J., Kobert, K., Flouri, T. and Stamatakis, A. (2014). PEAR: a fast and accurate Illumina Paired-End reAd

mergeR. Bioinformatics 30(5):614-20. •  Kwon, S., Lee, B. and Yoon, S. (2014). CASPER: context-aware scheme for paired-end reads from high-throughput

amplicon sequencing. BMC Bioinformatics 15 Suppl 9:S10. Error Correction •  Heo, Y., Wu, X.L., Chen, D., Ma, J. and Hwu, W.M. (2014). BLESS: bloom filter-based error correction solution for

high-throughput sequencing reads. Bioinformatics 30(10):1354-62. •  Lim, E.C., Müller, J., Hagmann, J., Henz, S.R., Kim, S.T. and Weigel, D. (2014). Trowel: a fast and accurate error

correction module for Illumina sequencing reads. Bioinformatics 30(22):3264-5. •  Greenfield, P., Duesing, K., Papanicolaou, A. and Bauer, D.C. (2014). Blue: correcting sequencing errors using

consensus and context. Bioinformatics 30(19):2723-32. File Formats •  Cock, P.J., Fields, C.J., Goto, N., Heuer, M.L. and Rice, P.M. (2010). The Sanger FASTQ file format for sequences

with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 38(6):1767-71. Quality •  Yang, X., Liu, D., Liu, F., Wu, J., Zou, J., Xiao, X., Zhao, F. and Zhu, B. (2013). HTQC: a fast quality control toolkit for

Illumina sequencing data. BMC Bioinformatics 14:33.

Page 50: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Recommended Reading Assemblers •  Zerbino, D.R. and Birney, E. (2008). Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

Genome Res. 18(5):821-9. •  Zerbino, D.R., McEwen, G.K., Margulies, E.H. and Birney, E. (2009). Pebble and rock band: heuristic resolution of

repeats and scaffolding in the velvet short-read de novo assembler. PLoS One 4(12):e8407. •  Simpson, J.T. and Durbin, R. (2010). Efficient construction of an assembly string graph using the FM-index.

Bioinformatics 26(12):i367-73. •  Simpson, J.T. and Durbin, R. (2012). Efficient de novo assembly of large genomes using compressed data

structures. Genome Research 22(3):549-56. SGA Assembly of Long Reads •  Loman, N.J., Quick, J., Simpson, J.T. (2015). A complete bacterial genome assembled de novo using only nanopore

sequencing data. Nat Methods 12(8):733-5. •  Stadermann, K.B., Weisshaar, B., Holtgräwe, D. (2015). SMRT sequencing only de novo assembly of the sugar beet

(Beta vulgaris) chloroplast genome. BMC Bioinformatics 16:295. •  Chin, C.S., Alexander, D.H., Marks, P., Klammer, A.A., Drake, J., Heiner, C., Clum, A., Copeland, A., Huddleston, J.,

Eichler, E.E., Turner, S.W., Korlach, J. (2013). Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat Methods 10(6):563-9.

Moleculo/Illumina TruSeq synthetic reads •  Voskoboynik, A., Neff, N.F., Sahoo, D., Newman, A.M., Pushkarev, D., Koh, W., Passarelli, B., Fan, H.C., Mantalas,

G.L., Palmeri, K.J., Ishizuka, K.J., Gissi, C., Griggio, F., Ben-Shlomo, R., Corey, D.M., Penland, L., White, R.A., Weissman, I.L. and Quake, S.R. (2013). The genome sequence of the colonial chordate, Botryllus schlosseri. Elife 2:e00569.

•  McCoy, R.C., Taylor, R.W., Blauwkamp, T.A., Kelley, J.L., Kertesz, M., Pushkarev, D., Petrov, D.A. and Fiston-Lavier, A.S. (2014). Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements. PLoS One 9(9):e106689.

•  Sharon, I., Kertesz, M., Hug, L.A., Pushkarev, D., Blauwkamp, T.A., Castelle, C.J., Amirebrahimi, M., Thomas, B.C., Burstein, D., Tringe, S.G., Williams, K.H., Banfield, J.F. (2015). Accurate, multi-kb reads resolve complex populations and detect rare microorganisms. Genome Res. 25(4):534-43.

•  Kuleshov, V., Jiang, C., Zhou, W., Jahanbani, F., Batzoglou, S., Snyder, M. (2016). Synthetic long-read sequencing reveals intraspecies diversity in the human microbiome. Nat Biotechnol. 34(1):64-9.

Read Clouds •  Bishara, A., Liu, Y., Weng, Z., Kashef-Haghighi, D., Newburger, D.E., West, R., Sidow, A., Batzoglou, S. (2015).

Read clouds uncover variation in complex regions of the human genome. Genome Res. 25(10):1570-80.

Page 51: Genetics 211 - 2016 Lecture 2 · Genetics 211 - 2016 Lecture 2 High Throughput Sequencing Gavin Sherlock gsherloc@stanford.edu ... ALLPATHS-LG 60 96.7 20 Bambus2 109 50.2 190 MSR-CA

Recommended Reading CPT-SEQ •  Adey, A., Kitzman, J.O., Burton, J.N., Daza, R., Kumar, A., Christiansen, L., Ronaghi, M., Amini, S., Gunderson, K.L.,

Steemers, F.J. and Shendure, J. (2014). In vitro, long-range sequence information for de novo genome assembly via transposase contiguity. Genome Res. 24(12):2041-9.

•  Amini, S., Pushkarev, D., Christiansen, L., Kostem, E., Royce, T., Turk, C., Pignatelli, N., Adey, A., Kitzman, J.O., Vijayan, K., Ronaghi, M., Shendure, J., Gunderson, K.L. and Steemers FJ. (2014). Haplotype-resolved whole-genome sequencing by contiguity-preserving transposition and combinatorial indexing. Nat Genet. 46(12):1343-9.

Hi-C Method •  Lieberman-Aiden, E., van Berkum, N.L., Williams, L., Imakaev, M., Ragoczy, T., Telling, A., Amit, I., Lajoie, B.R.,

Sabo, P.J., Dorschner, M.O., Sandstrom, R., Bernstein, B., Bender, M.A., Groudine, M., Gnirke, A., Stamatoyannopoulos, J., Mirny, L.A., Lander, E.S. and Dekker, J. (2009). Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326(5950):289-93.

•  Belton, J.M., McCord, R.P., Gibcus, J.H., Naumova, N., Zhan, Y and Dekker, J. (2012). Hi-C: a comprehensive technique to capture the conformation of genomes. Methods 58(3):268-76.

•  Rao, S.S., Huntley, M.H., Durand, N.C., Stamenova, E.K., Bochkov, I.D., Robinson, J.T., Sanborn, A.L., Machol, I., Omer, A.D., Lander, E.S., Lieberman-Aiden, E (2014). A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. Cell 159(7):1665-80.

Hi-C Assisted Assembly •  Burton, J.N., Adey, A., Patwardhan, R.P., Qiu, R., Kitzman, J.O. and Shendure, J. (2013). Chromosome-scale

scaffolding of de novo genome assemblies based on chromatin interactions. Nat Biotechnol. 31(12):1119-25. •  Marie-Nelly, H., Marbouty, M., Cournac, A., Liti, G., Fischer, G., Zimmer, C. and Koszul, R. (2014). Filling annotation

gaps in yeast genomes using genome-wide contact maps. Bioinformatics 30(15):2105-13. •  Marie-Nelly, H., Marbouty, M., Cournac, A., Flot, J.F., Liti, G., Parodi, D.P., Syan, S., Guillén, N., Margeot, A.,

Zimmer, C. and Koszul, R. (2014). High-quality genome (re)assembly using chromosomal contact data. Nat Commun. 5:5695.

Hi-C Assisted Metagenomic Assembly •  Burton, J.N., Liachko, I., Dunham, M.J. and Shendure, J. (2014). Species-level deconvolution of metagenome

assemblies with Hi-C-based contact probability maps. G3 (Bethesda) 4(7):1339-46. •  Marbouty, M., Cournac, A., Flot, J.F., Marie-Nelly, H., Mozziconacci, J., and Koszul, R. (2014). Metagenomic

chromosome conformation capture (meta3C) unveils the diversity of chromosome organization in microorganisms. Elife 3:03318.