43
How to sequence a large eukaryotic genome and how we sequenced the cod genome Lex Nederbragt Norwegian High-Throughput Sequencing Centre (NSC) and Centre for Ecological and Evolutionary Synthesis (CEES)

How to sequence a large eukaryotic genome

Embed Size (px)

DESCRIPTION

How to sequence a large eukaryotic genome - and how we sequenced the cod genome. A seminar I gave for the Computational Life Science (Univ. of Oslo) seminar series, September 28, 2011

Citation preview

Page 1: How to sequence a large eukaryotic genome

How to sequence a large eukaryotic genome

and how we sequenced the cod genome

Lex NederbragtNorwegian High-Throughput Sequencing Centre (NSC)

andCentre for Ecological and Evolutionary Synthesis (CEES)

Page 2: How to sequence a large eukaryotic genome
Page 3: How to sequence a large eukaryotic genome

What is a genome assembly?

A hierarchical data structure

that maps the sequence data

to a putative reconstruction of the target

Miller et al 2010, Genomics 95 (6): 315-327

Page 4: How to sequence a large eukaryotic genome

Hierarchical structure

Page 5: How to sequence a large eukaryotic genome

Sequence data

Reads

http://www.cbcb.umd.edu/research/assembly_primer.shtml

Page 6: How to sequence a large eukaryotic genome

Reads!

http://www.sciencephoto.com/media/210915/enlarge

Page 7: How to sequence a large eukaryotic genome

Contigs

Building contigs

Page 8: How to sequence a large eukaryotic genome

Contigs

Building contigs

Repeat copy 1 Repeat copy 2

http://www.cbcb.umd.edu/research/assembly_primer.shtml

Collapsed repeat consensus

Contig orienation?Contig order?

Page 9: How to sequence a large eukaryotic genome

Mate pairs

Other read type

Repeat copy 1 Repeat copy 2

mate pair reads(much) longer fragments

Page 10: How to sequence a large eukaryotic genome

Scaffolds

Ordered, oriented contigs

contigs

mate pairs

gap size estimate

Page 11: How to sequence a large eukaryotic genome

Hierarchical structure

Page 12: How to sequence a large eukaryotic genome

Algorithms

All are graph-based

Read 1 Read 2

Overlap

Graph-theory!

Page 13: How to sequence a large eukaryotic genome

Algorithms

Hamiltonian path– a path that contains all the nodes

http://www.cbcb.umd.edu/research/assembly_primer.shtml

Page 14: How to sequence a large eukaryotic genome

Algorithms

Overlap calculation (alignment)– computationally intensive

Read 1 Read 2

Overlap

Page 15: How to sequence a large eukaryotic genome

Algorithms

Path through the graphcontig

Read 1 Read 2

Overlap

Read 3

Overlap

Read 4

Overlap

Page 16: How to sequence a large eukaryotic genome

Greedy extension

Oldest

http://www.cbcb.umd.edu/research/assembly_primer.shtml

Page 17: How to sequence a large eukaryotic genome

Overlap-Layout-Consensus

Typical for Sanger-type reads– also used by newbler from 454 Life Sciences

Steps– Overlap computation– Layout: graph simplification– Consensus: sequence

Page 18: How to sequence a large eukaryotic genome

Overlap-Layout-Consensus

Overlap phase:– K-mer seeds initiate overlap

ACGCGATTCAGGTTACCACG

Page 19: How to sequence a large eukaryotic genome

de Bruijn graphs

Developed outside of DNA-related work– Best solution for very short reads ≤100 nt

GACCTACAGAC ACC CCT CTA TAC ACA

Read

K-mers (K=3)

K-1 bases overlap

de Bruijn graph

Page 20: How to sequence a large eukaryotic genome

Graphs

Schatz M C et al. Genome Res. 2010;20:1165-1173

Page 21: How to sequence a large eukaryotic genome

Graphs

Simplify the graph

Add scaffolding information

Page 22: How to sequence a large eukaryotic genome

Sequence data

Sequencing errors– add complexity to graph– create new k-mers

Correction of errors– k-mer frequency

Kelley et al. Genome Biology 2010 11:R116

Page 23: How to sequence a large eukaryotic genome

How to sequence a genome

human 1990'scod 1 2009 - 2011cod 2 2011 - 2012

Page 24: How to sequence a large eukaryotic genome

Human genome

Public effort– BAC-by-BAC sequencing– hierarchical shotgun sequencing

Genome

BACs

Select BACs

http://www.cbcb.umd.edu/research/assembly_primer.shtml

shotgun sequencing

100-150 kb

Page 25: How to sequence a large eukaryotic genome

Human genome

Celera: shotgun sequencing– entire genome shotgun– use of mate pairs

Page 26: How to sequence a large eukaryotic genome

How to sequence a genome

PreparationsBAC-by-BAC

Add shotgunand mate pairs

Most projects

Double haploid individual ✔

OR Inbred line ✔

BAC library ✔

Genetic map ✔

Physical map ✔

Page 27: How to sequence a large eukaryotic genome

The cod genome project

Preparations

Most projects

Codproject

Double haploid individual ✔ -

OR Inbred line ✔ -BAC library ✔ ✔*

Genetic map ✔ -Physical map ✔ -

* From a different individual

Page 28: How to sequence a large eukaryotic genome

Cod: strategy

‘454 only’– NO subcloning– Pure ‘shotgun’ approach– 454 specific paired end libraries

Supplementary– BAC ends using Sanger sequencing

Page 29: How to sequence a large eukaryotic genome

Cod: sequencing

Page 30: How to sequence a large eukaryotic genome

Cod: assembly

Input for assembly– 84 million reads– 28 billion bases (Gb)

• 34x coverage

Assembly program– Newbler from 454– Celera from Venter Inst.

Computing nodes– 24 cpus– 128 GB of memory

Page 31: How to sequence a large eukaryotic genome

Cod: assembly

611 Mb in 6 467 scaffolds– but 35% gap bases– short contigs– incomplete genes

Page 32: How to sequence a large eukaryotic genome

Cod: gaps

Polymorphic contig 2Polymorphic contig 2

Polymorphic contig 3Polymorphic contig 3

Contig 4

Contig 1

Heterozygosity

Short Tandem Repeats

ACACACACACACACACACACACACACACACACACACACACACACACACACACACACACAACACACACACACACACACACACACACACACACACACACACACACACACACACACACACAACACACACACACACACACACACACACACACACACACACACACACACACACACACACACAACACACACACACACACACACACACACACACACACACACACACACACACACACACACACA

Page 33: How to sequence a large eukaryotic genome

Cod: annotation

Ensembl– 'repair' genes based on stickleback sequence– ~22 000 genes

http://pre.ensembl.org/Gadus_morhua/

Page 34: How to sequence a large eukaryotic genome
Page 35: How to sequence a large eukaryotic genome

Cod 2: 2011-2012

Close the gaps– increase contig size

Pseudochromosomes– genetic linkage map– scaffolds to 'chromosomes'

• anchoring• ordering and orienting

Page 36: How to sequence a large eukaryotic genome

Cod 2: strategy

New data– Illumina reads– longer 454 reads ~700 bases– PacBio reads?

Improved programs– newbler

New programs– assembly– gap closing

Page 37: How to sequence a large eukaryotic genome

Many programs to choose from

Page 38: How to sequence a large eukaryotic genome

Assembly competitions

Assemblathon 1– simulated datasets– ALLPATHS_LG – Broad Institute MIT (US)– Soapdenovo – BGI (China)– SGA – Sanger Institute (UK)

Page 39: How to sequence a large eukaryotic genome

Assembly competitions

Assemblathon 2– real datasets

• snake – Illumina only• cichlid fish – Illumina only• parrot

– Illumina– 454 FLX+– PacBio

http://assemblathon.org/

Page 40: How to sequence a large eukaryotic genome

How to sequence a genome

In 2011

Most projects

Double haploid individual ✔

OR Inbred line ✔

BAC library ✔

Genetic map ✔

Physical map ✔

Cheap alternative: RAD-tag sequencing

Page 41: How to sequence a large eukaryotic genome

How to sequence a genome

Foundation of Illumina data– 100x coverage Paired End reads (2x100bp)– several Mate Pair libraries

• 2kb, 3kb, 8k, 10kb, bigger?

– this is now very cheap!

Fill gaps with long reads– 454 or PacBio

Page 42: How to sequence a large eukaryotic genome

How to sequence a genome

Add lots of bioinformatics...

http://cores.montana.edu/index.php?page=bioinformatics-core-facility

Page 43: How to sequence a large eukaryotic genome

Thank you!

[email protected]

www.sequencing.uio.no

www.sequencing.uio.no