174
Genome Assembly: the art of trying to make one BIG thing from millions of very small things Keith Bradnam @kbradnam Image from Wellcome Trust v1.1 June 2015

Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Embed Size (px)

Citation preview

Page 1: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Genome Assembly: the art of trying to make one BIG thing from millions of

very small things

Keith Bradnam

@kbradnam

Image from Wellcome Trust

v1.1 June 2015

Page 2: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Genome Assembly: the art of trying to make one BIG thing from millions of

very small things

Keith Bradnam

@kbradnam

Image from Wellcome Trust

This was a talk given at UC Davis on 15th June 2015 as part of a Bioinformatics Core teaching workshop.

Author: Keith Bradnam, Genome Center, UC Davis This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Page 3: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)
Page 4: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

flickr.com/incrediblehow/

Overview

Page 5: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

1. What is genome assembly?

2. Why is it difficult?

3. Why is it important?

4. How do we know if an assembly is any good?

Page 6: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

1. What is genome assembly?

2. Why is it difficult?

3. Why is it important?

4. How do we know if an assembly is any good?

Page 7: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

flickr.com/incrediblehow/

What is genome assembly?

Page 8: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

A genome assembly is an attempt to accurately represent an entire genome sequence from a

large set of very short DNA sequences.

Page 9: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

A genome assembly is an attempt to accurately represent an entire genome sequence from a

large set of very short DNA sequences.

Page 10: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

A genome assembly is an attempt to accurately represent an entire genome sequence from a

large set of very short DNA sequences.

Using a piece of bioinformatics software is just like running an experiment. Just because you get an answer, it doesn't mean it will be the right answer. You should always be prepared to tweak some parameters and re-run the experiment.

Page 11: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

A genome assembly is an attempt to accurately represent an entire genome sequence from a

large set of very short DNA sequences.

Page 12: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

A genome assembly is an attempt to accurately represent an entire genome sequence from a

large set of very short DNA sequences.

The ideal goal would be to end up with complete sequences for each chromosome at each level of ploidy. E.g. diploid genomes would be assembled as two sets of genome sequences.

Page 13: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

A genome assembly is an attempt to accurately represent an entire genome sequence from a

large set of very short DNA sequences.

Page 14: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

A genome assembly is an attempt to accurately represent an entire genome sequence from a

large set of very short DNA sequences.

'Large' is a relative term. We would expect that advances in sequencing technology would mean that the number of sequences needed to assemble a genome is only ever going to decrease.

Page 15: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

A genome assembly is an attempt to accurately represent an entire genome sequence from a

large set of very short DNA sequences.

Page 16: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

A genome assembly is an attempt to accurately represent an entire genome sequence from a

large set of very short DNA sequences.

'Short' is also a relative term. As technology improves, we expect to see our input sequences get longer and longer until the steps of sequencing and assembly essentially merge into one process.

Page 17: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

It's a bit like trying to do the hardest jigsaw puzzle you can imagine!

Page 18: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)
Page 19: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

This is a jigsaw that I did for the benefit of your education! There are lots of analogies that can be made between assembling genomes, and assembling jigsaws.

Page 20: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)
Page 21: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Sometimes we assemble regions of jigsaws that are locally accurate, but globally misplaced (the top region circled in red). Sometimes we also assemble regions and leave them to one side as we don't know where they should go. Many 'finished' genome assemblies include sets of 'unanchored' sequences that are not positioned on any chromosome.

Page 22: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)
Page 23: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)
Page 24: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Let's keep working on our jigsaw.

Page 25: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Repetitive regions are a big problem for genome assembly

Page 26: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

The hardest parts of a jigsaw tend to be repetitive regions (skies, sea, forests etc.). The same is true for genome assemblies.

Page 27: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Certain information can help pair together regions

Page 28: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Sometimes we can use information to pair together two different completed sections of a jigsaw. In this case, we can use our understanding of what a bridge looks like to give us an approximate spacing between the two completed sections at the top of this puzzle. We do similar things with genome assemblies and also end up inserting approximately sized gaps between regions of sequence.

Page 29: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)
Page 30: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)
Page 31: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Is this good enough?

Page 32: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Is this good enough?

For a jigsaw, we would never ever call this 'finished', but for a genome assembly this would represent an almost perfect sequence! All of the main details are present, you can identify what the picture is showing (San Francisco), the edges are detailed enough that we can accurately calculate the size of the jigsaw, and the parts that are missing are mostly minor details.

Page 33: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

We often end up with some missing pieces

Page 34: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

We often try to fit pieces in the wrong way

Page 35: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Jigsaws often end up with a few missing pieces meaning that it is impossible to complete the puzzle. Genome assemblies also end up with missing pieces because they were never in the input set of sequences to begin with. This is because not all sequencing technologies capture all locations in a genome.

Page 36: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

We never get to this point with genome assembly!

Page 37: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

With the exception of bacterial genomes, we never reach this point with genome assembly. All published eukaryotic genomes are incomplete and contain errors. Maybe yeast (Saccharomyces cerevisiae) and worm (Caenorhabditis elegans) are the best examples we have a of near-complete reference genome for a eukaryotic species.

Page 38: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

flickr.com/incrediblehow/

Why is it difficult?

Page 39: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

World's largest assembled genome

• Lobolly pine (Pinus taeda)

• 22 Gbp genome!

• ~80% repetitive

• 64x coverage

from tulsalandscape.com

Page 40: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

World's largest assembled genome

• Lobolly pine (Pinus taeda)

• 22 Gbp genome!

• ~80% repetitive

• 64x coverage

from tulsalandscape.com

Page 41: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

World's largest assembled genome

• Lobolly pine (Pinus taeda)

• 22 Gbp genome!

• ~80% repetitive

• 64x coverage

from tulsalandscape.com

This gargantuan effort featured the work of many people at UC Davis, led by the efforts of David Neale's group.

Page 42: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

What does 64x coverage mean?

Over 1.4 trillion bp of DNA were sequenced!

Page 43: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

What does 64x coverage mean?

Over 1.4 trillion bp of DNA were sequenced!

I.e. they had to use 64x times as much input DNA as they ended up with in the final output. Imagine if baking a cake was like this, and you had to use 64x as many ingredients in order to make one cake.

Some genome assembly projects are done with >100x coverage.

Page 44: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Biological challenges for genome assembly

Problem Description

RepeatsMany plant and animal genomes mostly consist of

repetitive sequences, some of which are longer than length of sequencing reads.

Ploidy For many species, you have at least two copies of the genome present. Level of heterozygosity is important.

Lack of reference genome

Reference-assisted assembly is a much easier problem than de novo assembly. Even having genome from a

closely related species can help.

Page 45: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Biological challenges for genome assembly

Problem Description

RepeatsMany plant and animal genomes mostly consist of

repetitive sequences, some of which are longer than length of sequencing reads.

Ploidy For many species, you have at least two copies of the genome present. Level of heterozygosity is important.

Lack of reference genome

Reference-assisted assembly is a much easier problem than de novo assembly. Even having genome from a

closely related species can help.

Ploidy is often a much bigger problem for plant genomes. E.g. some wheat species are hexaploid. Genome assembly is sometimes performed on a genome for which we already have a reference (e.g. if you sequenced your own genome, you could align it to the human reference sequence). Otherwise, we are talking about de novo assembly which is much, much harder.

Page 46: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

from amazon.com

Page 47: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

from amazon.com

Returning to the jigsaw analogy…every jigsaw puzzle comes with a picture of the puzzle on the box. This is a luxury not always available to genome assemblers.

Page 48: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)
Page 49: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

When we are doing de novo assembly, it is a bit like doing a jigsaw without knowing what it will look like.

Page 50: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)
Page 51: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Even with de novo assembly, we may have a distant relative with a known genome sequence that can help with the assembly. A bit like assembling a jigsaw using a blurred picture as a guide.

Page 52: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)
Page 53: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Jigsaws tell you how many pieces are in the puzzle (and what the dimensions of the puzzle will be). We don't always know this for genome assembly. There are measures for determining how big a genome might be, but these methods can sometimes be misleading.

Page 54: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

2.0

2.5

3.0

3.5

4.0

? 1949 1959 1971 1972 1980 1981 1983 1985 1990 1994 1998

Data from genomesize.com

C-value (pg)

Page 55: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

2.0

2.5

3.0

3.5

4.0

? 1949 1959 1971 1972 1980 1981 1983 1985 1990 1994 1998

Data from genomesize.com

C-value (pg)

These are experimental estimates of the mouse genome size (taken from the animal genome size database). There is a lot of variation! Many organisms only have one experimental estimate of how big their genome is.

Page 56: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Other challenges for genome assembly

Problem Description

Cost In 2014 Illumina claimed the $1,000 genome barrier had been broken (if you first spend ~$10 million on hardware).

Library prep A critical, and often overlooked, step in the process.

Sequence diversity

Illumina, 454, Ion Torrent, PacBio, Oxford Nanopore: which mix of sequence data will you be using?

Hardware Some genome assemblers have very high CPU/RAM requirements. Might need specialized cluster.

Expertise Not always easy to even get assembly software installed, let alone understand how to run it properly.

Software There is a lot of choice out there.

Page 57: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

The PRICE genome assembler has 52

command-line options!!!

Page 58: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

The PRICE genome assembler has 52

command-line options!!!

This is probably not the most complex, nor the most simple, genome assembler that is out there. But how much time do you have to explore some of those 52 parameters that could affect the resulting genome assembly?

Page 59: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

You may need more than one tool

via Shaun Jackman

Page 60: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

You may need more than one tool

via Shaun JackmanModern genome assembly pipelines don't always rely on a single tool. This pipeline consists of many different programs.

Page 61: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Problem Description

Cost In 2014 Illumina claimed the $1,000 genome barrier had been broken (if you first spend ~$10 million on hardware).

Library prep A critical, and often overlooked, step in the process.

Sequence diversity

Illumina, 454, Ion Torrent, PacBio, Oxford Nanopore: which mix of sequence data will you be using?

Hardware Some genome assemblers have very high CPU/RAM requirements. Might need specialized cluster.

Expertise Not always easy to even get assembly software installed, let alone understand how to run it properly.

Software There is a lot of choice out there.

Other challenges for genome assembly

Page 62: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

There are over 125 different tools available to help assemble a genome!

Page 63: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

There are over 125 different tools available to help assemble a genome!

Not all of these are comprehensive genome assemblers, some are tools to help with specific aspects of the assembly process, or to help evaluate genome assemblies etc.

Still, this represents a bewildering amount of choice.

Page 64: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

bambus2

Ray

CeleraMIRA

ALLPATHS-LG

SGACurtain MetassemblerPhusion

ABySS

Amos

Arapan

CLCCortex

DNAnexus

DNA Dragon

EdenaForge

GeneiousIDBA

Newbler

PRICE

PADENA

PASHA

Phrap

TIGR

Sequencher

SeqMan NGen

SHARCGS

SOPRA

SSAKE

SPAdes

Taipan

VCAKE

Velvet

Arachne

PCAP

GAM

MonumentAtlas

ABBA

Anchor

ATAC

Contrail

DecGPU GenoMinerLasergene

PE-Assembler

Pipeline Pilot

QSRA

SeqPrep

SHORTY

fermiTelescoper

QuastSCARPA

Hapsembler

HapCompass

HaploMerger

SWiPS

GigAssembler

MSR-CA

MaSuRCA

GARM

Cerulean

TIGRA

ngsShoRT

PERGA

SOAPdenovo

REAPR

FRCBam

EULER-SR SSPACE

Opera

mip

gapfiller

image

PBJelly

HGAP

FALCON

Dazzler

GGAKE

A5

CABOG

SHRAPSR-ASM

SuccinctAssembly

SUTTARagout

Tedna

Trinity

SWAP-Assembler

SILP3

AutoAssemblyD

KGBAssembler

MetAMOS

iMetAMOS

MetaVelvet-SL

KmerGenie

Nesoni

Pilon

Platanus

CGAL

GAGM

Enly

BESST

Khmer

GRIT

IDBA-MTP

dipSPAdes

WhatsHap

SHEAR

ELOPER

OMACC Omega

GABenchToB

HiPGA

SAGE

HyDA-Vista

MHAP

Mapsembler 2

GAML

SAT-Assembler

RAMPART

VICUNACloudBrush

Which tool will you use?

Page 65: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

bambus2

Ray

CeleraMIRA

ALLPATHS-LG

SGACurtain MetassemblerPhusion

ABySS

Amos

Arapan

CLCCortex

DNAnexus

DNA Dragon

EdenaForge

GeneiousIDBA

Newbler

PRICE

PADENA

PASHA

Phrap

TIGR

Sequencher

SeqMan NGen

SHARCGS

SOPRA

SSAKE

SPAdes

Taipan

VCAKE

Velvet

Arachne

PCAP

GAM

MonumentAtlas

ABBA

Anchor

ATAC

Contrail

DecGPU GenoMinerLasergene

PE-Assembler

Pipeline Pilot

QSRA

SeqPrep

SHORTY

fermiTelescoper

QuastSCARPA

Hapsembler

HapCompass

HaploMerger

SWiPS

GigAssembler

MSR-CA

MaSuRCA

GARM

Cerulean

TIGRA

ngsShoRT

PERGA

SOAPdenovo

REAPR

FRCBam

EULER-SR SSPACE

Opera

mip

gapfiller

image

PBJelly

HGAP

FALCON

Dazzler

GGAKE

A5

CABOG

SHRAPSR-ASM

SuccinctAssembly

SUTTARagout

Tedna

Trinity

SWAP-Assembler

SILP3

AutoAssemblyD

KGBAssembler

MetAMOS

iMetAMOS

MetaVelvet-SL

KmerGenie

Nesoni

Pilon

Platanus

CGAL

GAGM

Enly

BESST

Khmer

GRIT

IDBA-MTP

dipSPAdes

WhatsHap

SHEAR

ELOPER

OMACC Omega

GABenchToB

HiPGA

SAGE

HyDA-Vista

MHAP

Mapsembler 2

GAML

SAT-Assembler

RAMPART

VICUNACloudBrush

Which tool will you use?

This slide was made in 2014, and so is already out of date!

Page 66: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

These six assembly tools were published in one month in 2014!

Page 67: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Before you assemble…

• You should remove adapter contamination

• You should remove sequence contamination

• You should trim sequences for low quality regions

Page 68: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Before you assemble…

• You should remove adapter contamination

• You should remove sequence contamination

• You should trim sequences for low quality regions

After we have generated the raw sequence data, we still must run a few basic steps to clean up our data prior to assembly. How straightforward are these steps?

Page 69: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Before you assemble…

• You should remove adapter contamination

• You should remove sequence contamination

• You should trim sequences for low quality regions

Page 70: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Tools for removing adapter contamination from sequences

There are at least 34 different tools!

One of these tools has 27 different command-line options

Page 71: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Tools for removing adapter contamination from sequences

There are at least 34 different tools!

One of these tools has 27 different command-line options

Even the first step of removing adapter contamination is something for which you could spend a lot of time researching different software choices.

Page 72: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

flickr.com/incrediblehow/

Why is it important?

Page 73: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Saccharomyces cerevisiae

• 12 Mbp genome

• Published in 1997

• First eukaryotic genome sequence

Page 74: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Saccharomyces cerevisiae

• 12 Mbp genome

• Published in 1997

• First eukaryotic genome sequenceNot the first published genome — there were several bacterial genomes sequenced in the preceding couple of years — but this was the first eukaryotic genome sequence. Furthermore, this genome sequence has undergone continual improvements and corrections since publication (the last set of changes were in 2011).

Page 75: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Caernorhabditis elegans

• ~100 Mbp genome

• Published in 1998

• First animal genome sequence

Page 76: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Arabidopsis thaliana• First plant genome sequence

• Published in 2000

• Size?

• 2000 = 125 Mbp

• 2007 = 157 Mbp

• 2012 = 135 Mbp

Page 77: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Arabidopsis thaliana• First plant genome sequence

• Published in 2000

• Size?

• 2000 = 125 Mbp

• 2007 = 157 Mbp

• 2012 = 135 MbpAs alluded to earlier, we don't always know for sure how big (or small) a genome is. The Arabidopsis genome size has been corrected upwards and downwards since publication. The amount of sequenced information as of today is about 119 Mbp. And this is for the best understood plant genome that we know about it!

Page 78: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Homo sapiens• ~3 Gbp genome

• Finished?

• 'working draft' announced in 2000

• 'working draft' published in 2001

• completion announced in 2003

• complete sequence published in 2004

Page 79: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Homo sapiens• ~3 Gbp genome

• Finished?

• 'working draft' announced in 2000

• 'working draft' published in 2001

• completion announced in 2003

• complete sequence published in 2004The human genome has also undergone improvements since the (many) announcements regarding its completion (or near completion). There are only a small number of species for which there is dedicated group of people who seek to continually improve the genome sequence and get closer to 'the truth'.

Page 80: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

The 100,000 genomes project

There are lots of ongoing genome sequencing projects

i5k Insect and other Arthropod Genome Sequencing Initiative

Page 81: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

The 100,000 genomes project

There are lots of ongoing genome sequencing projects

i5k Insect and other Arthropod Genome Sequencing InitiativeBigger numbers must be better, right? Some projects sequence genomes to align back to a reference to look for the differences, others seek to characterize genomes for which we have very little genomic information. The 100,000 genomes project in England heralds the start of the mass sequencing of patients to understand disease.

Page 82: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

We no longer have one genome per species

• We have genome sequences representing different strains and varieties of a species

• We have genome sequences from multiple individuals of a species

• We have multiple genomes from different tissues of the same individual (e.g. cancer genomes)

Page 83: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

We no longer have one genome per species

• We have genome sequences representing different strains and varieties of a species

• We have genome sequences from multiple individuals of a species

• We have multiple genomes from different tissues of the same individual (e.g. cancer genomes)

Also, in the near future we can imagine having your genome sequenced at birth (from different tissues) and getting 'genome health checks' throughout your life.

Page 84: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

There is no point sequencing so many genomes if we can't accurately assemble them!

Page 85: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

There is no point sequencing so many genomes if we can't accurately assemble them!

Sequencing genomes is relatively easy. Putting that information together in a meaningful way so as to make it useful to others…that's not so easy.

Page 86: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Bad genome assemblies #1

Length of 10 shortest sequences: 100, 100, 99, 88, 87, 76, 73, 63, 12, and 3 bp!

The average vertebrate gene is about 25,000 bp

Page 87: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Bad genome assemblies #1

Length of 10 shortest sequences: 100, 100, 99, 88, 87, 76, 73, 63, 12, and 3 bp!

The average vertebrate gene is about 25,000 bp

Everyone wants long sequences in a genome assembly. This may not always matter, but in most cases they should hopefully be long enough to contain at least one gene.

These data are from a vertebrate genome sequence that someone asked me to look at. Over half of the genome assembly was represented by sequences less than 150 bp! This is not much use to anyone.

Page 88: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Bad genome assemblies #2

Ns = 91% !!!

Genome sequences usually contain

unknown bases (Ns)

Page 89: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Bad genome assemblies #2

Ns = 90.6% !!!

Genome sequences usually contain

unknown bases (Ns)

From another assembly that I was asked to look at. Even the 9% of the genome which wasn't an 'N' was split into tiny little fragments. Completely unusable information.

Page 90: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Has anyone compared different assemblers to work out which is the best?

Page 91: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

flickr.com/incrediblehow/

Assemblathons

Page 92: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

A genome assembly competition

Page 93: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

A genome assembly competition

This was a genome assembly assessment exercise that I was involved with.

Page 94: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

@assemblathon

Page 95: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

@assemblathonIt spawned a sequel.

Page 96: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Coming soon to a cinema near you!

Page 97: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Coming soon to a cinema near you!Work is currently underway to organize a third Assemblathon effort.

Page 98: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Assemblathon 2Published in

Gigascience, 2013

Page 99: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Bird

SnakeFish

Page 100: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Bird

SnakeFish

Three species were used in Assemblathon 2. A budgie, a Lake Malawi cichlid fish, and a boa constrictor snake.

Page 101: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Species Estimated genome size Illumina Roche 454 PacBio

Bird 1.2 Gbp 285x (14 libraries)

16x (3 libraries)

10x (2 libraries)

Fish 1.0 Gbp 192x (8 libraries)

Snake 1.6 Gbp 125x (4 libraries)

Assemble this!

Page 102: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Species Estimated genome size Illumina Roche 454 PacBio

Bird 1.2 Gbp 285x (14 libraries)

16x (3 libraries)

10x (2 libraries)

Fish 1.0 Gbp 192x (8 libraries)

Snake 1.6 Gbp 125x (4 libraries)

Assemble this!

Lots of sequence data were provided, especially for the bird.

Page 103: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

3 species 21 teams

43 assemblies 52 Gbp of sequence!

Page 104: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Goals

• Assess 'quality' of genome assemblies

• Identify the best assemblers

• First need to define quality!

Page 105: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Who makes the best pizza in Davis?

Page 106: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Who makes the best pizza in Davis?

An easy question to ask, but maybe not as straightforward as it seems…

Page 107: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Who makes the best pizza in Davis?

Freshest?

Cheapest?

Biggest?

Gluten free?

Healthiest?

Choice of toppings?

Free sodas?

Delivery time?

Tastiest?

Page 108: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Who makes the best pizza in Davis?

Freshest?

Cheapest?

Biggest?

Gluten free?

Healthiest

Choice of toppings?

Choice of toppings?

Delivery time?

Tastiest?

'Best' is subjective. If you are intolerant to gluten, then the best pizza place will be the one that makes gluten-free pizzas.

Page 109: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Who makes the best pizza in Davis?

Freshest?

Cheapest?

Biggest?

Gluten free?

Healthiest?

Choice of toppings?

Free sodas?

Delivery time?

Tastiest?

Page 110: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Who makes the best pizza in Davis?

Freshest?

Cheapest?

Biggest?

Gluten free?

Healthiest

Choice of toppings?

Choice of toppings?

Delivery time?

Tastiest?

Even if you focus on who makes the best 'tasting' pizzas, this is still very subjective.

Page 111: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Image from flickr.com/dullhunk/

Who makes the best genome assembler?

Page 112: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Who makes the best genome assembly?

Image from flickr.com/dullhunk/

But surely this is not such a subjective topic when it comes to genome assembly?

Page 113: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Who makes the best genome assembler?

Longest contigs?

Fewest errors?

Lowest CPU demands?Best deals with repeats?

Produces most genes?

Fastest?

Best resolves heterozygosity?

Easiest to install?

Longest scaffolds?

Image from flickr.com/dullhunk/

Page 114: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Who makes the best genome assembly?

Longest contigs?

Fewest errors?

Lowest CPU demands?Best deals with repeats?

Contains most genes?

Fastest?

Best resolves heterozygosity?

Easiest to install?

Longest scaffolds?

Image from flickr.com/dullhunk/

It is less subjective, but there are still many different ways we can think of when trying to determine what makes a good genome assembly.

Page 115: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Longest contigs?

Fewest errors?

Lowest CPU demands?Best deals with repeats?

Produces most genes?

Fastest?

Best resolves heterozygosity?

Easiest to install?

Longest scaffolds?

Image from flickr.com/dullhunk/

Who makes the best genome assembler?

Page 116: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Longest contigs?

Fewest errors?

Lowest CPU demands?Best deals with repeats?

Produces most genes?

Fastest?

Best resolves heterozygosity?

Easiest to install?

Longest scaffolds?

Image from flickr.com/dullhunk/

Who makes the best genome assembler?

The best assembler in the world may be no use to anyone if people can't get it installed and understand how it should be run.

Page 117: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

flickr.com/incrediblehow/

Metrics

Page 118: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Metric Notes

Assembly size How does it compare to expected size?

Number of sequences How fragmented is your assembly?

N50 length (contigs & scaffolds)

Making contigs and making scaffolds are two different skills.

NG50 scaffold length Becoming more common to see this used.

Coverage How much of some reference sequence is present in your assembly?

Errors Errors in alignment of assembly to reference sequence or to input read data.

Number of genes From comparison to reference transcriptome and/or set of known genes

Page 119: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Metric Notes

Assembly size How does it compare to expected size?

Number of sequences How fragmented is your assembly?

N50 length (contigs & scaffolds)

Making contigs and making scaffolds are two different skills.

NG50 scaffold length Becoming more common to see this used.

Coverage How much of some reference sequence is present in your assembly?

Errors Errors in alignment of assembly to reference sequence or to input read data.

Number of genes From comparison to reference transcriptome and/or set of known genes

This is a very brief summary that lists just some of the ways in which you could describe your genome assembly.

Page 120: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Assembly size

0

500,000,000

1,000,000,000

1,500,000,000

2,000,000,000

A B C D E F G H I J K L M

Assemblathon 2 bird genome assemblies

Page 121: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Assembly size

0

500,000,000

1,000,000,000

1,500,000,000

2,000,000,000

A B C D E F G H I J K L M

Assemblathon 2 bird genome assemblies

In Assemblathon 2, one assembly of the bird genome (a parrot) was very, very small. Conversely, one assembly was almost twice the size of the estimated genome (~1.2 Gbp). Bigger is not always better when it comes to assembly size.

Page 122: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Using core genes

• All genomes perform some core functions (transcription, replication, translation etc.)

• Proteins involved tend to be highly conserved

• They should be present in every genome

Page 123: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

CEGMA

Page 124: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

CEGMA

This was an approach developed by our lab, originally to find a handful of genes in a newly sequenced genome which could be used to train a species-specific gene finder. We then adapted the technique to assess the gene space of a draft genome.

Page 125: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

What is CEGMA?

• CEGMA (Core Eukaryotic Gene Mapping Approach)

• defines a set of 248 'Core Eukaryotic Genes' (CEGs)

• CEGs identified from genomes of: S. cerevisiae, S. pombe, A. thaliana, C. elegans, D. melanogaster, and H. sapiens

• How many full-length CEGs are present in an assembly?

Page 126: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

What is CEGMA?

• CEGMA (Core Eukaryotic Gene Mapping Approach)

• defines a set of 248 'Core Eukaryotic Genes' (CEGs)

• CEGs identified from genomes of: S. cerevisiae, S. pombe, A. thaliana, C. elegans, D. melanogaster, and H. sapiens

• How many full-length CEGs are present in an assembly?We expect that these 248 genes to be present in all eukaryotes. CEGMA uses a combination of software tools to find these genes. The number of core genes present is assumed to reflect the proportion of all genes that are present in the assembly. Sometimes genes are split across contigs or scaffolds, CEGMA can find some of these and reports them as partial matches.

Page 127: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)
Page 128: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Here are N50 scaffold lengths and number of core genes present in a variety of genomes that I have looked at. There is a lot of variation. Some assemblies might give you longer sequences (higher N50 values), but this is no guarantee that those assemblies will contain more gene sequences. Likewise, assemblies with more gene sequences may not necessarily have longer sequences.

Page 129: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Should you use CEGMA?

• CEGMA is not easy to install

• It is old and somewhat out of date

• You could use other transcript/protein data sets instead of CEGMA

Page 130: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Should you use CEGMA?

• CEGMA is not easy to install

• It is old and somewhat out of date

• You could use other transcript/protein data sets instead of CEGMA

The principle of CEGMA could be used with a variety of different data. Maybe there are a small number of full-length mRNAs available for your species of interest. If you have multiple genome assemblies, you could simply see how they differ with respect to the presence of those genes.

Page 131: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)
Page 132: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

BUSCO is a recently developed tool that works along similar lines to CEGMA.

Page 133: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Other tools for evaluating assemblies

FRCbam (2012) REAPR (2013) kPAL (2014)

Page 134: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Other tools for evaluating assemblies

FRCbam (2012) REAPR (2013) kPAL (2014)

Just as it seems increasingly popular to develop new genome assemblers, there is a growing demand (and supply) for tools to evaluate genome assemblies. Here are three recent ones.

Page 135: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)
Page 136: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

102 metrics per assembly

10 key metrics

1 final ranking

Page 137: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

102 metrics per assembly

10 key metrics

1 final ranking

Starting from 102 metrics per assembly, the entries were ultimately judged on 10 'key' metrics, that largely captured different aspects of an assembly's 'quality'. The results from these 10 were combined into a single overall ranking (for each species).

Page 138: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

And the winner is…

• No winner!

• Some assemblers seemed to work well for one species, but not for other species

• Some assemblies were good, as measured by one metric, but not when measured by others

Page 139: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

And the winner is…

• No winner!

• Some assemblers seemed to work well for one species, but not for other species

• Some assemblies were good, as measured by one metric, but not when measured by others

This result was disappointing to many who was hoping that we would provide a resounding endorsement for assembler 'X'.

Page 140: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Assembly Number of core genes Rank Z-score

CRACS 438 1 +0.68

SYMB 436 2 +0.59

PHUS 435 3 +0.54

BCM 434 4 +0.49

SGA 433 5 +0.44

MERAC 430 6 +0.30

ABYSS 429 7 +0.25

SOAP 428 8 +0.21

RAY 422 9 –0.08

GAM 415 10 –0.41

CURT 360 11 –3.02

Page 141: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Assembly Number of core genes Rank Z-score

CRACS 438 1 +0.68

SYMB 436 2 +0.59

PHUS 435 3 +0.54

BCM 434 4 +0.49

SGA 433 5 +0.44

MERAC 430 6 +0.30

ABYSS 429 7 +0.25

SOAP 428 8 +0.21

RAY 422 9 –0.08

GAM 415 10 –0.41

CURT 360 11 –3.02

Here are the CEGMA results. As well as rank each metric, we calculated a Z-score for each metric (how man standard deviations was each assembly from the average) and summed Z-scores to generate the final rankings.

Page 142: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)
Page 143: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

The SGA team initially produced what looked like a clear winner for the snake competition. Error bars show maximum and minimum Z-scores that would be produced if any 9 of 10 combinations of metrics were used.

Page 144: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Assemblathon 2 Metric Rank of snake SGA assembly

NG50 scaffold length 2

NG50 contig length 5

Amount of assembly in 'gene-sized' scaffolds 7

Number of 'core genes' present 5

Fosmid coverage 2

Fosmid validity 2

Short-range scaffold accuracy 3

Optical map: level 1 2

Optical map: levels 1–3 1

REAPR summary score 2

Page 145: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Assemblathon 2 Metric Rank of snake SGA assembly

NG50 scaffold length 2

NG50 contig length 5

Amount of assembly in 'gene-sized' scaffolds 7

Number of 'core genes' present 5

Fosmid coverage 2

Fosmid validity 2

Short-range scaffold accuracy 3

Optical map: level 1 2

Optical map: levels 1–3 1

REAPR summary score 2

Even though the SGA entry ranked 1st overall, it only ranked 1st in one individual metric. So it was a good assembler, on average.

Page 146: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

flickr.com/incrediblehow/

The long and short of it

Page 147: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Technology Date Typical read lengths

Sanger ~1970–2000 750–1,000 bp

Solexa/Illumina ~2005 ~25 bp

Illumina ~2014 ~150–250 bp

Pacific Biosciences ~2014 10–15 Kbp

Oxford Nanopore ~2014 5–??? Kbp

Revolocity 2015 28 bp

Page 148: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Technology Date Typical read lengths

Sanger ~1970–2000 750–1,000 bp

Solexa/Illumina ~2005 ~25 bp

Illumina ~2014 ~150–250 bp

Pacific Biosciences ~2014 10–15 Kbp

Oxford Nanopore ~2014 5–??? Kbp

Revolocity 2015 28 bp

Different technologies produce reads with very different length distributions, and these technologies also increase the length of reads over time. Perhaps more importantly, different technologies have different error profiles (where errors occur in reads and types of error).

Page 149: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

N50 length

The most widely used statistic for genome assemblies

First described in human genome paper (2001)

Page 150: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

N50 length

The most widely used statistic for genome assemblies

First described in human genome paper (2001)

The length of the sequence which takes the sum length of all sequences past 50% of the total assembly size (when summing lengths from longest to shortest).

Page 151: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

NG50 length

Use NG50 when making comparisons between genome assemblies because N50 can be biased

Be warned…some people obsess over N50!

Page 152: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

NG50 length

Use NG50 when making comparisons between genome assemblies because N50 can be biased

Be warned…some people obsess over N50!

In the Assemblathon contests, we used a new measure which enables a fairer comparison between different assemblies (of the same genome).

Page 153: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

flickr.com/incrediblehow/

What you can do

Page 154: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

My #1 piece of advice

flickr.com/julia_manzerova

Page 155: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

flickr.com/thomashawk

Look at your data!

Page 156: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

flickr.com/thomashawk

Look at your data!

Before you do anything with your assembly, look at it closely. Look at the distribution of lengths (not just N50). Look at the %N. Are the sequences long enough to contain genes? Are the shortest sequences just unassembled reads?

Page 157: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

flickr.com/incrediblehow/

The future of genome assembly

Page 158: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)
Page 159: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

As new sequencing technologies mature, the associated tools also get developed. People have recently published a de novo (bacterial) genome assembly using data from the very new Oxford Nanopore MinION platform.

Page 160: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Long read technologies

Page 161: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)
Page 162: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Over companies are developing promising long-range technologies which will be a great resource for genome assemblers.

Page 163: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)
Page 164: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

There are more companies out there, waiting to make their big entrance on the world stage of genome sequencing and assembly.

Page 165: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)
Page 166: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Many of these companies promise the same key features.

Page 167: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

flickr.com/incrediblehow/

Summary

Page 168: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

In conclusion…• Genome assembly is not a solved problem

• If possible, try different genome assemblers

• Don't rely on one metric to assess quality

• Different metrics assess different aspects of quality

• Look at your genome assembly!

Page 169: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Other resources

Page 170: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

http://acgt.me

@assemblathon

Page 171: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

http://acgt.me

@assemblathonI frequently blog about some of the issues raised in this talk. I also use the @assemblathon twitter account to publish links to lots of papers and other resources that are related to this field.

Page 172: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Lex Nederbragt@lexnederbragt

flxlexblog.wordpress.com

Nick Loman@pathogenomenick

pathogenomic.bham.ac.uk/blog

Mick Watson@BioMickWatson

biomickwatson.wordpress.com

Keith Robison@OmicsOmicsBlog omicsomics.blogspot.com

Page 173: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

Lex Nederbragt@lexnederbragt

flxlexblog.wordpress.com

Nick Loman@pathogenomenick

pathogenomic.bham.ac.uk/blog

Mick Watson@BioMickWatson

biomickwatson.wordpress.com

Keith Robison@OmicsOmicsBlog omicsomics.blogspot.com

These people have a lot of useful things to say about genome sequencing and assembly. Their blogs and twitter feeds are useful resources.

Page 174: Genome assembly: the art of trying to make one big thing from millions of very small things (v1.1 - with notes)

The end