Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Genetics 211 - 2016 Lecture 2
High Throughput Sequencing Gavin Sherlock [email protected] January 12th 2015
Differences in Throughput
Parameter Sanger
(AB 3730) Illumina
(NextSeq 500)
Read L (bp) 800 2 x150
Number of reads per run [days]
96 [<1]
400,000,000 [~1 day]
Throughput 6Mb/day ~120Gb/day
SNP error rate low high (~0.5%)
Indel error rate low low
Costs $500/Mb <$0.05/Mb
Illumina: Flow Cells with “Molecular Colonies” • flow cell with randomly spaced
molecular clusters • spacing depends on initial
seeding of the single molecules onto the flow cell
1µM
Detection, Chemistry
• Massively Parallel Detection on immobilized “molecular colonies”
• Means you have to measure (image) every cycle, instead of the Sanger model (letting reaction go to completion and then separating products by size)
• Requires specially designed chemistry, using reversible dye-terminators and a polymerase
DNA (0.1-1.0 ug)
Single molecule arraySample
preparation Cluster growth5’
5’3’
G
T
C
A
G
T
C
A
G
T
C
A
C
A
G
TC
A
T
C
A
C
C
TAG
CG
TA
GT
1 2 3 7 8 9 4 5 6
Image acquisition Base calling
T G C T A C G A T …
Sequencing
Illumina Sequencing Technology Robust Reversible Terminator Chemistry Foundation
250+ Million Clusters Per Flow Cell
20 Microns
100 Microns
Illumina Sequence Visualization
O
PPP
HN
N
O
O
cleavage site
fluorophore
3’
3’ OH is blocked
Illumina Sequencing: Reversible Terminators
Detection
O
HN
N
O
O
3’
DNA
O Incorporate
Ready for Next Cycle
O
DNA
HN
N
O
O
3’
O
free 3’ end OH
Deblock and Cleave
off Dye
Image Processing, Base Calling • Image processing algorithms find signals in
each panel, align signals from different panels, etc. – Machines ship with server or small cluster that
does image analysis while run is happening • Sequence data after base calling much
reduced in size (tens of gigabytes) => more manageable but still large amounts that add up over time
• Unsustainable to keep image data; people discard the images, and just keep the sequences (fastq format).
• Patterned flow cells (HiSeq 3000/HiSeq 4000 systems) – Allows denser cluster spacing – Avoids cluster overlap – Image analysis easier
• Two-Channel SBS (NextSeq) – Two, rather than 4 colors – Leads to faster sequencing times
• Synthetic Long Reads – We’ll discuss later in lecture
• Coming soon – MinSeq – Project Firefly
Recent Illumina Innovations
Pacific Biosciences
• Single Molecule Real Time DNA Sequencing • Read lengths now averaging ~10-15kb, max
>40kb • Strobe sequencing • Observation of DNA modifications • Throughput per run is low (~1 million reads on
PacBio SEQUEL machine), but run time is short (30 mins – 6 hours)
• Error rate is high, though hybrid approaches can significantly improve assemblies generated by short reads alone.
Oxford Nanopore
• MinION and GridION products • Not yet on market, but in early release • DNA “sequenced” as it is dragged through a
nanopore • Very high error rate (5-40%) • Reads from 5-50kb (as long as 100kb?) • Some data published, more likely on the way – keep
an eye on the AGBT meeting next month • This year might actually be the year that nanopore
breaksthrough
• interactions between nucleic acids and proteins
• transcript identity• transcript abundance
• RNA editing• SNPs
• Allele specific expression• Regulation
• Nucleosome positioning• 3D genome architecture
• Active promoters• interactions between
nucleic acids and proteins• chromatin modifications
• genome variability• metagenomics
• genome modifications• detection of mutations
• association studies• phylogeny• evolution
Applications of Next-Gen Sequencing
genome chromatin transcriptome
de novo sequencing
assembly
annotation
mapping
resequencing
detection of variants
mapping
ATAC-Seq
Identify open chromatin
mapping
ChIP-Seq
detection of binding sites
mapping
RNA-Seq
transcript detection and quantification
mapping
Hi-C
3D reconstruction
How do we make an Illumina Genomic DNA library?
Fragment (Covaris)
Polish, add dA overhangAdd adaptors, size select
Double-stranded genomic DNA
Sequence
Making fragments asymmetric
5'-pNNNN.........NNNNA-3' 3'-ANNNN.........NNNNp-5'
Fragmented, end polished, phosphorylated, dA overhang DNA sample
Genomic Y-adapter
5'-ACACTCTTTCCCTACACGAC
GCTCTTCCGATCT-3' 3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGp-5'
Ligate
5'-ACACTCTTTCCCTACACGAC
GCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3’ 3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCG
CAGCACATCCCTTTCTCACA-5’
[Ligation product is gel purified, selecting only those products in a certain size range]
Making our genomic DNA library asymmetric
Round 1 of PCR
5'-ACACTCTTTCCCTACACGAC
GCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3’ 3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCG
CAGCACATCCCTTTCTCACA-5’
TCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
5'-CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT
Products of first round:
5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3' 3'-TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
5'-CAAGCAGAAGACGGCATACGAGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT-3’ 3’-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGCAGCACATCCCTTTCTCACA-5’
Finishing and Sequencing the Library
5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3' 3'-TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
TCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
Rounds 2-18
Product of PCR amplification
5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3' 3’-TTACTATGCCGCTGGTGGCTCTAGATGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
[Anneal to flow cell. Perform cluster generation]
Genomic DNA Sequencing Primer
5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3' 3’-TTACTATGCCGCTGGTGGCTCTAGATGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCT
Nextera Library Preparation
Transposomes
Genomic DNA+
Primer 1
Adaptor 1
Primer 2
Adaptor 2
tagmentation
PCR
Nextera Library Preparation
Transposomes
Genomic DNA+
Primer 1
Adaptor 1
Primer 2
Adaptor 2
tagmentation
(suppression) PCR
How Much Sequence? • HiSeq 2500 can give ~250 million reads/lane of
paired end 100bp reads • This is 50Gb of sequence • This is ~4,000x coverage yeast (12Mb). • This is an obvious waste of resources (it’s also ~500x
C. elegans, and ~500x D. melanogaster) • How can we sequence on a HiSeq and not waste all
these resources when sequencing smaller genomes? • HiSeq 3000/HiSeq 4000
– Patterned flow cells (not random clusters) – Almost twice as much data, half the time
Multiplexed Sequencing, using Barcodes
• Two ways to perform barcode sequencing – In-line barcodes
• Barcode is read as part of the normal sequencing read – Index barcodes
• Barcode is read as a third, short sequencing run (also known as index reads)
• Can be used to run multiple samples from any particular origin on the same lane of a HiSeq, with the barcodes allowing the samples to be de-convoluted afterwards.
• Barcodes should be designed so that they are balanced in GC content, and as dissimilar as possible. (Hamming distance > 2).
In-line Barcode Sequencing
Index barcoding
Unique Molecular Identifiers (UMIs)
• During the PCR step, each template gets amplified many times
• If your library is of insufficient complexity, or you overamplify, you may have PCR duplicates
• You want to make independent observations, not redundant observations
• When sequencing to high coverage, you may have identical, but non-redundant observations.
• Want to be able to distinguish these.
Unique Molecular Identifiers (UMIs) using Random Barcodes
Longer, and/or more Accurate Reads
Insert Size0 100 200 300 400 500
• Insert sizes are a distribution• Some inserts not necessarily longer than twice the read length• What does this mean for paired end reads?
Longer, and/or more Accurate Reads
insert
Read Error Correction • Many approaches, and lots of available tools • Most rely on the idea of looking for rare k-mers:
• Build up a table of all k-mers, and their frequencies. • Consecutive k-mers that cover an error in a read should
be at lower frequency, given sufficient coverage • Can use this to recognize errors in reads and correct them • If done without deference to quality scores, assumes
homogenous sample
What are the data?
• Illumina produces data in fastq format.
@SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
‘@’ followed by a sequence Identifier
The sequence ‘+’, optionally followed by a sequence Identifier The quality scores
Example of Illumina SeqID
@HWUSI-EAS100R:6:73:941:1973#0/1
HWUSI-EAS100R The unique instrument name6 Flowcell lane73 Tile number within the flowcell941 'x'-coordinate of the cluster within the tile1973 'y'-coordinate of the cluster within the tile#0 index number for a multiplexed sample (0 for no
indexing)/1 the member of a pair, /1 or /2 (paired-end or mate-pair
reads only)
Assessing Quality
FastQC
http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
HTQC
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
https://sourceforge.net/projects/htqc
De novo Assembly of Short Reads
• Several methods available • Short reads require long overlaps
• e.g., 33 bp reads must overlap by 20 bp • end-trimming helps, to remove low quality bases.
• Most de novo short read assemblers use a k-mer hashing based approach and de Bruijn graphs.
• The central challenge of genome assembly is resolving repeat regions.
De novo Assembly Strategies
• Many, many different algorithms and open source (as well as closed source) software for short read sequence assembly.
• Choice of tool depends on exactly what you are trying to assemble: – Genome size – Genome complexity – Level of polymorphism – Genome vs. transcriptome – Sequence coverage you have (more is generally better) – Paired-end vs. single end (you should really have paired-end data)
• E.g. – Velvet (Zerbino and Birney, 2008)
• Uses DeBruijn graph algorithm plus error correction – SGA (Simpson and Durbin, 2010)
• Use String Graph – lower memory requirements, but takes longer – SOAPdenovo2 (Li et al, 2012)
• Also uses DeBruijn graphs with error correction
Example of Velvet de novo Assembly
TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
AGTCGAG CTTTAGA CGATGAG CTTTAGA GTCGGG TTAGATC ATGAGGC GAGACAG GAGGCTC ATCCGAT AGGCTTT GAGACAG AGTCGAG TAGATCC ATGAGGC TAGAGAA
TAGTCGA CTTTAGA CCGATGA TTAGAGA CGAGGCT AGATCCG TGAGGCT AGAGACA
TAGTCGA GCTTTAG TCCGATG GCTCTAG TCGACGC GATCCGA GAGGCTT AGAGACA TAGTCGA TTAGATC GATGAGG TTTAGAG
GTCGAGG TCTAGAT ATGAGGC TAGAGAC AGGCTTT ATCCGAT AGGCTTT GAGACAG AGTCGAG TTAGATT ATGAGGC AGAGACA
GGCTTTA TCCGATG TTTAGAG CGAGGCT TAGATCC TGAGGCT GAGACAG AGTCGAG TTTAGATC ATGAGGC TTAGAGA GAGGCTT GATCCGA GAGGCTT GAGACAG
Sequence (7bp reads)
Hashing (k = 4)
TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
TAGT AGTC GTCG TCGA CGAG GAGG AGGC GGCT GCTT AGAG GAGA AGAC GACA ACAG (3x) (7x) (9x) (10x) (8x) (16x) (16x) (16x) (11x) (9x) (12x) (9x) (8x) (5x)
� � � � � � � � � � � � � � CTTC TTCA TCAG CAGA (1x) (2x) (2x) (1x)
� � � �
CTTT TTTA TTAG TAGA (8x) (8x) (12x) (16x)
� � � �
CGAC GACG ACGC (1x) (1x) (1x)
� � �
TGAG ATGA GATG CGAT CCGA TCCG ATCC GATC AGAT (9x) (8x) (5x) (6x) (7x) (7x) (7x) (8x) (8x)
� � � � � � � � � �
GATT (1x) �
AGAA (1x)
�
{ {
Graph Building
TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
TAGTCGA CGAG
CGACGC
GCTCTAG
GCTTTAG
GATCCGATGAG AGAT
AGAA
� �
�
� � { {�
�
GATT
�
� �
� � GAGGCT TAGA AGAGA AGACAG
�
TAGT AGTC GTCG TCGA CGAG GAGG AGGC GGCT GCTT AGAG GAGA AGAC GACA ACAG (3x) (7x) (9x) (10x) (8x) (16x) (16x) (16x) (11x) (9x) (12x) (9x) (8x) (5x)
� � � � � � � � � � � � � � CTTC TTCA TCAG CAGA (1x) (2x) (2x) (1x)
� � � �
CTTT TTTA TTAG TAGA (8x) (8x) (12x) (16x)
� � � �
CGAC GACG ACGC (1x) (1x) (1x)
� � �
TGAG ATGA GATG CGAT CCGA TCCG ATCC GATC AGAT (9x) (8x) (5x) (6x) (7x) (7x) (7x) (8x) (8x)
� � � � � � � � � �
GATT (1x) �
AGAA (1x)
�
{ {
Simplification of Linear Stretches
TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
TAGTCGAG GAGGCTTTAGA AGAGACAG
� �
�
�
AGATCCGATGAG
Error (tip and bubble) removal
Tips
{TAGTCGA CGAG
CGACGC
GCTCTAG
GCTTTAG
GATCCGATGAG AGAT
AGAA
� �
�
� �
{�
�
GATT
�
� �
� � GAGGCT TAGA AGAGA AGACAG
�
Bubble
TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
Assembler Num N50 (kb) Errors
ABySS 302 29.2 19
ALLPATHS-LG 60 96.7 20
Bambus2 109 50.2 190
MSR-CA 94 59.2 34
SGA 252 4.0 10
SOAPdenovo 107 288.2 65
Velvet 162 48.4 42
Assemblies of S. aureus (genome size 2,872,915)Taken from GAGE paper (Salzberg et al, 2012).
De novo Short Read Assembler Performance
Short Read Assembly Limitations
• Common repeat regions are typically missing/collapsed– Han Chinese genome missing ~420Mbp of repeats
• Same is true for segmental duplications– Han Chinese genome only contains ~10Mbp of ~150Mbp of
segmental duplications.• Even for microbial genomes, you typically get very large
numbers of contigs, which range in size from very small, to sometimes quite large.– (need reads of ~7kb to completely assemble bacterial genomes)
Recent Short Read Assembler Comparisons
• Earl et al (2011). Assemblathon 1: A competitive assessment of de novo short read assembly methods. Genome Research 21: 2224-2241.
– Used a simulated dataset for all competitors to assemble• Salzberg et al (2012). GAGE: A critical evaluation of genome
assemblies and assembly algorithms. Genome Research 22(3):557-67.– Applied several assembly algorithms to their own datasets, for several
different sized genomes• Bradnam et al. (2013). Assemblathon 2: evaluating de novo methods of
genome assembly in three vertebrate species. Gigascience 2(1):10.– See http://assemblathon.org/
• If you have an assembly problem, you should read these papers to gain some insights into strengths and weaknesses of different assemblers
• Also see: Vezzi, F., Narzisi, G., and Mishra B. (2012). Reevaluating assembly evaluations with feature response curves: GAGE and assemblathons. PLoS One 7(12):e52210.
Improving de novo Assemblies • Need to generate additional long range continuity
to be able to orient and order contigs • Mate pair libraries • Hybrid approach using either PacBio or Oxford
Nanopore data, plus Illumina data – Though there is a paper on assembling E. coli solely from
Oxford Nanopore Data
• Synthetic long reads (Illumina Tru-Seq Synthetic Reads (aka Moleculo))
• CPT-SEQ – discuss on Thursday – Similar to 10X Genomics
• Hi-C contact maps – Similar to Dovetail Genomics
Mate-pair libraries
• Goal is to have the equivalent of 2-5kb insert libraries, or even up to 10-12kb.
• However, Illumina flow cell technology is limited to ~700 bp fragments that can be successfully amplified – Means you have to use some molecular biology to
accomplish the equivalent.
Fragment
Genomic DNA
Size Select (up to 12kb)
Biotinylate
Bio
Bio*
*
Bio
Bio*
*
Circularize
**
Fragment (400-600bp)
**
**
**
**
**
**
Capture Biotinylated fragments
Standard Paired End Illumina Sequencing
Incorporate Mate-pair information into assembly
Recommended Reading Nextera • Adey, A., Morrison, H.G., Asan, Xun, X., Kitzman, J.O., Turner, E.H., Stackhouse, B., MacKenzie, A.P., Caruccio,
N.C., Zhang, X., Shendure, J. (2010). Rapid, low-input, low-bias construction of shotgun fragment libraries by high-density in vitro transposition. Genome Biol. 11(12):R119.
• Syed, F., Grunenwald, H., Caruccio, N.. (2009). Next-generation sequencing library preparation: simultaneous fragmentation and tagging using in vitro transposition. Nature Methods 6, Applications Note.
• Caruccio, N. (2011). Preparation of next-generation sequencing libraries using Nextera™ technology: simultaneous DNA fragmentation and adaptor tagging by in vitro transposition. Methods Mol Biol. 733:241-55.
• Marine, R., Polson, S.W., Ravel, J., Hatfull, G., Russell, D., Sullivan, M., Syed, F., Dumas, M., Wommack, K.E. (2011). Evaluation of a transposase protocol for rapid generation of shotgun high-throughput sequencing libraries from nanogram quantities of DNA. Appl Environ Microbiol. 77(22):8071-9.
UMIs • Kivioja, T., Vähärautio, A., Karlsson, K., Bonke, M., Enge, M., Linnarsson, S. and Taipale, J. (2011). Counting
absolute numbers of molecules using unique molecular identifiers. Nat Methods. 9(1):72-4. Adapter Trimming • Kong, Y. (2011). Btrim: a fast, lightweight adapter and quality trimming program for next-generation sequencing
technologies. Genomics 98(2):152-3. • Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal
17:10-12. • Lindgreen, S. (2012). AdapterRemoval: easy cleaning of next-generation sequencing reads. BMC Res Notes 5:337. • Jiang, H., Lei, R., Ding, S.W. and Zhu, S. (2014). Skewer: a fast and accurate adapter trimmer for next-generation
sequencing paired-end reads. BMC Bioinformatics 15:182.
Recommended Reading Read Merging • Rodrigue, S., Materna, A.C., Timberlake, S.C., Blackburn, M.C., Malmstrom, R.R., Alm, E.J., Chisholm, S.W. (2010).
Unlocking short read sequencing for metagenomics. PLoS One 5(7):e11840. • Magoč, T. and Salzberg, S.L. (2011). FLASH: fast length adjustment of short reads to improve genome assemblies.
Bioinformatics 27(21):2957-63. • Masella, A.P., Bartram, A.K., Truszkowski, J.M., Brown, D.G. and Neufeld, J.D. (2012). PANDAseq: paired-end
assembler for illumina sequences. BMC Bioinformatics 13:31. • Liu, B., Yuan, J., Yiu, S.M., Li, Z., Xie, Y., Chen, Y., Shi, Y., Zhang, H., Li, Y., Lam, T.W. and Luo, R. (2012). COPE:
an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly. Bioinformatics 28(22):2870-4. • Zhang, J., Kobert, K., Flouri, T. and Stamatakis, A. (2014). PEAR: a fast and accurate Illumina Paired-End reAd
mergeR. Bioinformatics 30(5):614-20. • Kwon, S., Lee, B. and Yoon, S. (2014). CASPER: context-aware scheme for paired-end reads from high-throughput
amplicon sequencing. BMC Bioinformatics 15 Suppl 9:S10. Error Correction • Heo, Y., Wu, X.L., Chen, D., Ma, J. and Hwu, W.M. (2014). BLESS: bloom filter-based error correction solution for
high-throughput sequencing reads. Bioinformatics 30(10):1354-62. • Lim, E.C., Müller, J., Hagmann, J., Henz, S.R., Kim, S.T. and Weigel, D. (2014). Trowel: a fast and accurate error
correction module for Illumina sequencing reads. Bioinformatics 30(22):3264-5. • Greenfield, P., Duesing, K., Papanicolaou, A. and Bauer, D.C. (2014). Blue: correcting sequencing errors using
consensus and context. Bioinformatics 30(19):2723-32. File Formats • Cock, P.J., Fields, C.J., Goto, N., Heuer, M.L. and Rice, P.M. (2010). The Sanger FASTQ file format for sequences
with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 38(6):1767-71. Quality • Yang, X., Liu, D., Liu, F., Wu, J., Zou, J., Xiao, X., Zhao, F. and Zhu, B. (2013). HTQC: a fast quality control toolkit for
Illumina sequencing data. BMC Bioinformatics 14:33.
Recommended Reading Assemblers • Zerbino, D.R. and Birney, E. (2008). Velvet: algorithms for de novo short read assembly using de Bruijn graphs.
Genome Res. 18(5):821-9. • Zerbino, D.R., McEwen, G.K., Margulies, E.H. and Birney, E. (2009). Pebble and rock band: heuristic resolution of
repeats and scaffolding in the velvet short-read de novo assembler. PLoS One 4(12):e8407. • Simpson, J.T. and Durbin, R. (2010). Efficient construction of an assembly string graph using the FM-index.
Bioinformatics 26(12):i367-73. • Simpson, J.T. and Durbin, R. (2012). Efficient de novo assembly of large genomes using compressed data
structures. Genome Research 22(3):549-56. SGA Assembly of Long Reads • Loman, N.J., Quick, J., Simpson, J.T. (2015). A complete bacterial genome assembled de novo using only nanopore
sequencing data. Nat Methods 12(8):733-5. • Stadermann, K.B., Weisshaar, B., Holtgräwe, D. (2015). SMRT sequencing only de novo assembly of the sugar beet
(Beta vulgaris) chloroplast genome. BMC Bioinformatics 16:295. • Chin, C.S., Alexander, D.H., Marks, P., Klammer, A.A., Drake, J., Heiner, C., Clum, A., Copeland, A., Huddleston, J.,
Eichler, E.E., Turner, S.W., Korlach, J. (2013). Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat Methods 10(6):563-9.
Moleculo/Illumina TruSeq synthetic reads • Voskoboynik, A., Neff, N.F., Sahoo, D., Newman, A.M., Pushkarev, D., Koh, W., Passarelli, B., Fan, H.C., Mantalas,
G.L., Palmeri, K.J., Ishizuka, K.J., Gissi, C., Griggio, F., Ben-Shlomo, R., Corey, D.M., Penland, L., White, R.A., Weissman, I.L. and Quake, S.R. (2013). The genome sequence of the colonial chordate, Botryllus schlosseri. Elife 2:e00569.
• McCoy, R.C., Taylor, R.W., Blauwkamp, T.A., Kelley, J.L., Kertesz, M., Pushkarev, D., Petrov, D.A. and Fiston-Lavier, A.S. (2014). Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements. PLoS One 9(9):e106689.
• Sharon, I., Kertesz, M., Hug, L.A., Pushkarev, D., Blauwkamp, T.A., Castelle, C.J., Amirebrahimi, M., Thomas, B.C., Burstein, D., Tringe, S.G., Williams, K.H., Banfield, J.F. (2015). Accurate, multi-kb reads resolve complex populations and detect rare microorganisms. Genome Res. 25(4):534-43.
• Kuleshov, V., Jiang, C., Zhou, W., Jahanbani, F., Batzoglou, S., Snyder, M. (2016). Synthetic long-read sequencing reveals intraspecies diversity in the human microbiome. Nat Biotechnol. 34(1):64-9.
Read Clouds • Bishara, A., Liu, Y., Weng, Z., Kashef-Haghighi, D., Newburger, D.E., West, R., Sidow, A., Batzoglou, S. (2015).
Read clouds uncover variation in complex regions of the human genome. Genome Res. 25(10):1570-80.
Recommended Reading CPT-SEQ • Adey, A., Kitzman, J.O., Burton, J.N., Daza, R., Kumar, A., Christiansen, L., Ronaghi, M., Amini, S., Gunderson, K.L.,
Steemers, F.J. and Shendure, J. (2014). In vitro, long-range sequence information for de novo genome assembly via transposase contiguity. Genome Res. 24(12):2041-9.
• Amini, S., Pushkarev, D., Christiansen, L., Kostem, E., Royce, T., Turk, C., Pignatelli, N., Adey, A., Kitzman, J.O., Vijayan, K., Ronaghi, M., Shendure, J., Gunderson, K.L. and Steemers FJ. (2014). Haplotype-resolved whole-genome sequencing by contiguity-preserving transposition and combinatorial indexing. Nat Genet. 46(12):1343-9.
Hi-C Method • Lieberman-Aiden, E., van Berkum, N.L., Williams, L., Imakaev, M., Ragoczy, T., Telling, A., Amit, I., Lajoie, B.R.,
Sabo, P.J., Dorschner, M.O., Sandstrom, R., Bernstein, B., Bender, M.A., Groudine, M., Gnirke, A., Stamatoyannopoulos, J., Mirny, L.A., Lander, E.S. and Dekker, J. (2009). Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326(5950):289-93.
• Belton, J.M., McCord, R.P., Gibcus, J.H., Naumova, N., Zhan, Y and Dekker, J. (2012). Hi-C: a comprehensive technique to capture the conformation of genomes. Methods 58(3):268-76.
• Rao, S.S., Huntley, M.H., Durand, N.C., Stamenova, E.K., Bochkov, I.D., Robinson, J.T., Sanborn, A.L., Machol, I., Omer, A.D., Lander, E.S., Lieberman-Aiden, E (2014). A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. Cell 159(7):1665-80.
Hi-C Assisted Assembly • Burton, J.N., Adey, A., Patwardhan, R.P., Qiu, R., Kitzman, J.O. and Shendure, J. (2013). Chromosome-scale
scaffolding of de novo genome assemblies based on chromatin interactions. Nat Biotechnol. 31(12):1119-25. • Marie-Nelly, H., Marbouty, M., Cournac, A., Liti, G., Fischer, G., Zimmer, C. and Koszul, R. (2014). Filling annotation
gaps in yeast genomes using genome-wide contact maps. Bioinformatics 30(15):2105-13. • Marie-Nelly, H., Marbouty, M., Cournac, A., Flot, J.F., Liti, G., Parodi, D.P., Syan, S., Guillén, N., Margeot, A.,
Zimmer, C. and Koszul, R. (2014). High-quality genome (re)assembly using chromosomal contact data. Nat Commun. 5:5695.
Hi-C Assisted Metagenomic Assembly • Burton, J.N., Liachko, I., Dunham, M.J. and Shendure, J. (2014). Species-level deconvolution of metagenome
assemblies with Hi-C-based contact probability maps. G3 (Bethesda) 4(7):1339-46. • Marbouty, M., Cournac, A., Flot, J.F., Marie-Nelly, H., Mozziconacci, J., and Koszul, R. (2014). Metagenomic
chromosome conformation capture (meta3C) unveils the diversity of chromosome organization in microorganisms. Elife 3:03318.