How to sequence a large eukaryotic genome

and how we sequenced the cod genome

Lex NederbragtNorwegian High-Throughput Sequencing Centre (NSC)

andCentre for Ecological and Evolutionary Synthesis (CEES)

What is a genome assembly?

A hierarchical data structure

that maps the sequence data

to a putative reconstruction of the target

Miller et al 2010, Genomics 95 (6): 315-327

Hierarchical structure

Sequence data

http://www.cbcb.umd.edu/research/assembly_primer.shtml

Reads!

http://www.sciencephoto.com/media/210915/enlarge

Contigs

Building contigs

Contigs

Building contigs

Repeat copy 1 Repeat copy 2

Collapsed repeat consensus

Contig orienation?Contig order?

Mate pairs

Other read type

Repeat copy 1 Repeat copy 2

mate pair reads(much) longer fragments

Scaffolds

Ordered, oriented contigs

contigs

mate pairs

gap size estimate

Hierarchical structure

Algorithms

All are graph-based

Read 1 Read 2

Overlap

Graph-theory!

Algorithms

Hamiltonian path– a path that contains all the nodes

Algorithms

Overlap calculation (alignment)– computationally intensive

Read 1 Read 2

Overlap

Algorithms

Path through the graphcontig

Read 1 Read 2

Overlap

Overlap

Overlap

Greedy extension

Oldest

Overlap-Layout-Consensus

Typical for Sanger-type reads– also used by newbler from 454 Life Sciences

Steps– Overlap computation– Layout: graph simplification– Consensus: sequence

Overlap-Layout-Consensus

Overlap phase:– K-mer seeds initiate overlap

ACGCGATTCAGGTTACCACG

de Bruijn graphs

Developed outside of DNA-related work– Best solution for very short reads ≤100 nt

GACCTACAGAC ACC CCT CTA TAC ACA

K-mers (K=3)

K-1 bases overlap

de Bruijn graph

Graphs

Schatz M C et al. Genome Res. 2010;20:1165-1173

Graphs

Simplify the graph

Add scaffolding information

Sequence data

Sequencing errors– add complexity to graph– create new k-mers

Correction of errors– k-mer frequency

Kelley et al. Genome Biology 2010 11:R116

How to sequence a genome

human 1990'scod 1 2009 - 2011cod 2 2011 - 2012

Human genome

Public effort– BAC-by-BAC sequencing– hierarchical shotgun sequencing

Genome

Select BACs

shotgun sequencing

100-150 kb

Human genome

Celera: shotgun sequencing– entire genome shotgun– use of mate pairs

PreparationsBAC-by-BAC

Add shotgunand mate pairs

Most projects

Double haploid individual ✔

OR Inbred line ✔

BAC library ✔

Genetic map ✔

Physical map ✔

The cod genome project

Preparations

Most projects

Codproject

Double haploid individual ✔ -

OR Inbred line ✔ -BAC library ✔ ✔*

Genetic map ✔ -Physical map ✔ -

* From a different individual

Cod: strategy

‘454 only’– NO subcloning– Pure ‘shotgun’ approach– 454 specific paired end libraries

Supplementary– BAC ends using Sanger sequencing

Cod: sequencing

Cod: assembly

Input for assembly– 84 million reads– 28 billion bases (Gb)

• 34x coverage

Assembly program– Newbler from 454– Celera from Venter Inst.

Computing nodes– 24 cpus– 128 GB of memory

Cod: assembly

611 Mb in 6 467 scaffolds– but 35% gap bases– short contigs– incomplete genes

Cod: gaps

Polymorphic contig 2Polymorphic contig 2

Polymorphic contig 3Polymorphic contig 3

Contig 4

Contig 1

Heterozygosity

Short Tandem Repeats

ACACACACACACACACACACACACACACACACACACACACACACACACACACACACACAACACACACACACACACACACACACACACACACACACACACACACACACACACACACACAACACACACACACACACACACACACACACACACACACACACACACACACACACACACACAACACACACACACACACACACACACACACACACACACACACACACACACACACACACACA

Cod: annotation

Ensembl– 'repair' genes based on stickleback sequence– ~22 000 genes

http://pre.ensembl.org/Gadus_morhua/

Cod 2: 2011-2012

Close the gaps– increase contig size

Pseudochromosomes– genetic linkage map– scaffolds to 'chromosomes'

• anchoring• ordering and orienting

Cod 2: strategy

New data– Illumina reads– longer 454 reads ~700 bases– PacBio reads?

Improved programs– newbler

New programs– assembly– gap closing

Many programs to choose from

Assembly competitions

Assemblathon 1– simulated datasets– ALLPATHS_LG – Broad Institute MIT (US)– Soapdenovo – BGI (China)– SGA – Sanger Institute (UK)

Assembly competitions

Assemblathon 2– real datasets

• snake – Illumina only• cichlid fish – Illumina only• parrot

– Illumina– 454 FLX+– PacBio

http://assemblathon.org/

In 2011

Most projects

Double haploid individual ✔

OR Inbred line ✔

BAC library ✔

Genetic map ✔

Physical map ✔

Cheap alternative: RAD-tag sequencing

Foundation of Illumina data– 100x coverage Paired End reads (2x100bp)– several Mate Pair libraries

• 2kb, 3kb, 8k, 10kb, bigger?

– this is now very cheap!

Fill gaps with long reads– 454 or PacBio

Add lots of bioinformatics...

http://cores.montana.edu/index.php?page=bioinformatics-core-facility

Thank you!

lex.nederbragt@bio.uio.no

www.sequencing.uio.no

How to sequence a large eukaryotic genome

Technology

SHORT GENOME REPORT Open Access Draft genome sequence …

Genome Sequence Informatics & Comparative Genome Sequence Analysis

Sequence Comparison and Genome Alignment in the Human Genome Jian Ma Jian Ma | Sequence Comparison and Genome Alignment1 Powerpoint: Casey Hanson

BAB VII Eukaryotic Genome

Introduction Basic Genetic Mechanisms Eukaryotic Gene Regulation The Human Genome Projects Test 1 Genome I - Genes Genome II – Repetitive DNA Genome III

Long Read Annotation (LoReAn): automated eukaryotic genome ... · 1 1 Long Read Annotation (LoReAn): automated eukaryotic genome annotation 2 based on long-read cDNA sequencing 3

articles Analysis of the genome sequence of the ﬂowering ...mcclean/plsc411/... · Analysis of the genome sequence of the ... complete genome sequence of a plant and provides the

Eukaryotic Genome & Gene Regulation

BME 130 – Genomes Lecture 14 Eukaryotic Genome Anatomy

Eukaryotic Genome Annotation

AP Biology Eukaryotic Genome Control Mechanisms for Gene expression

Changes in the Eukaryotic Genome By: Sergio Aguilar

Hierarchies in eukaryotic genome organization: …nanoweb.ucsd.edu/~arya/paper30.pdfREVIEW Open Access Hierarchies in eukaryotic genome organization: Insights from polymer theory and

Maintenance of genomes Copying the genome sequence Repairing damage to the genome sequence Rearranging genome sequences

The Eukaryotic Genome And

Control of Eukaryotic Genome

A beginner's guide to eukaryotic genome annotation · A beginner’s guide to eukaryotic genome annotation Mark Yandell and Daniel Ence Abstract | The falling cost of genome sequencing

Molecular Biology Eukaryotic Genome Structure. The human genome: nuclear and mitochondrial components

14 The Eukaryotic Genome and Its Expression. 14 The Eukaryotic Genome and Its Expression 14.1 What Are the Characteristics of the Eukaryotic Genome? 14.2

Genome sequence assembly