How to sequence a large eukaryotic genome

How to sequence a large eukaryotic genome

and how we sequenced the cod genome

Lex NederbragtNorwegian High-Throughput Sequencing Centre (NSC)

andCentre for Ecological and Evolutionary Synthesis (CEES)

What is a genome assembly?

A hierarchical data structure

that maps the sequence data

to a putative reconstruction of the target

Miller et al 2010, Genomics 95 (6): 315-327

Hierarchical structure

Sequence data

Reads

http://www.cbcb.umd.edu/research/assembly_primer.shtml

Reads!

http://www.sciencephoto.com/media/210915/enlarge

Contigs

Building contigs

Contigs

Building contigs

Repeat copy 1 Repeat copy 2


Collapsed repeat consensus

Contig orienation?Contig order?

Mate pairs

Other read type

Repeat copy 1 Repeat copy 2

mate pair reads(much) longer fragments

Scaffolds

Ordered, oriented contigs

contigs

mate pairs

gap size estimate

Hierarchical structure

Algorithms

All are graph-based

Read 1 Read 2

Overlap

Graph-theory!

Algorithms

Hamiltonian path– a path that contains all the nodes


Algorithms

Overlap calculation (alignment)– computationally intensive

Read 1 Read 2

Overlap

Algorithms

Path through the graphcontig

Read 1 Read 2

Overlap

Read 3

Overlap

Read 4

Overlap

Greedy extension

Oldest


Overlap-Layout-Consensus

Typical for Sanger-type reads– also used by newbler from 454 Life Sciences

Steps– Overlap computation– Layout: graph simplification– Consensus: sequence

Overlap-Layout-Consensus

Overlap phase:– K-mer seeds initiate overlap

ACGCGATTCAGGTTACCACG

de Bruijn graphs

Developed outside of DNA-related work– Best solution for very short reads ≤100 nt

GACCTACAGAC ACC CCT CTA TAC ACA

Read

K-mers (K=3)

K-1 bases overlap

de Bruijn graph

Graphs

Schatz M C et al. Genome Res. 2010;20:1165-1173

Graphs

Simplify the graph

Add scaffolding information

Sequence data

Sequencing errors– add complexity to graph– create new k-mers

Correction of errors– k-mer frequency

Kelley et al. Genome Biology 2010 11:R116

How to sequence a genome

human 1990'scod 1 2009 - 2011cod 2 2011 - 2012

Human genome

Public effort– BAC-by-BAC sequencing– hierarchical shotgun sequencing

Genome

BACs

Select BACs


shotgun sequencing

100-150 kb

Human genome

Celera: shotgun sequencing– entire genome shotgun– use of mate pairs


PreparationsBAC-by-BAC

Add shotgunand mate pairs

Most projects

Double haploid individual ✔

OR Inbred line ✔

BAC library ✔

Genetic map ✔

Physical map ✔

The cod genome project

Preparations

Most projects

Codproject

Double haploid individual ✔ -

OR Inbred line ✔ -BAC library ✔ ✔*

Genetic map ✔ -Physical map ✔ -

* From a different individual

Cod: strategy

‘454 only’– NO subcloning– Pure ‘shotgun’ approach– 454 specific paired end libraries

Supplementary– BAC ends using Sanger sequencing

Cod: sequencing

Cod: assembly

Input for assembly– 84 million reads– 28 billion bases (Gb)

• 34x coverage

Assembly program– Newbler from 454– Celera from Venter Inst.

Computing nodes– 24 cpus– 128 GB of memory

Cod: assembly

611 Mb in 6 467 scaffolds– but 35% gap bases– short contigs– incomplete genes

Cod: gaps

Polymorphic contig 2Polymorphic contig 2

Polymorphic contig 3Polymorphic contig 3

Contig 4

Contig 1

Heterozygosity

Short Tandem Repeats

ACACACACACACACACACACACACACACACACACACACACACACACACACACACACACAACACACACACACACACACACACACACACACACACACACACACACACACACACACACACAACACACACACACACACACACACACACACACACACACACACACACACACACACACACACAACACACACACACACACACACACACACACACACACACACACACACACACACACACACACA

Cod: annotation

Ensembl– 'repair' genes based on stickleback sequence– ~22 000 genes

http://pre.ensembl.org/Gadus_morhua/

Cod 2: 2011-2012

Close the gaps– increase contig size

Pseudochromosomes– genetic linkage map– scaffolds to 'chromosomes'

• anchoring• ordering and orienting

Cod 2: strategy

New data– Illumina reads– longer 454 reads ~700 bases– PacBio reads?

Improved programs– newbler

New programs– assembly– gap closing

Many programs to choose from

Assembly competitions

Assemblathon 1– simulated datasets– ALLPATHS_LG – Broad Institute MIT (US)– Soapdenovo – BGI (China)– SGA – Sanger Institute (UK)

Assembly competitions

Assemblathon 2– real datasets

• snake – Illumina only• cichlid fish – Illumina only• parrot

– Illumina– 454 FLX+– PacBio

http://assemblathon.org/


In 2011

Most projects

Double haploid individual ✔

OR Inbred line ✔

BAC library ✔

Genetic map ✔

Physical map ✔

Cheap alternative: RAD-tag sequencing


Foundation of Illumina data– 100x coverage Paired End reads (2x100bp)– several Mate Pair libraries

• 2kb, 3kb, 8k, 10kb, bigger?

– this is now very cheap!

Fill gaps with long reads– 454 or PacBio


Add lots of bioinformatics...

http://cores.montana.edu/index.php?page=bioinformatics-core-facility

Thank you!

[email protected]

www.sequencing.uio.no

www.sequencing.uio.no

mailto:[email protected]

http://www.sequdcing.uio.no/

Technology

How to sequence a large eukaryotic genome