33
Towards Gapless, Chromosome Scale, Haplotype Assemblies Matt Settles UC Davis Bioinformatics Core

Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

TowardsGapless,ChromosomeScale,HaplotypeAssemblies

MattSettlesUCDavisBioinformaticsCore

Page 2: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

HumanGenome

• In1990,theNationalInstitutesofHealth(NIH)andtheDepartmentofEnergyjoinedwithinternationalpartnerstosequencethehumangenome.• InApril2003,researcherssuccessfullycompletedtheHumanGenomeProject,underbudget($2.7B)andmorethantwoyearsaheadofschedule.• ThousandsofpeoplecontributedtheHumanGenomeProject• Evenso,thereremains~400gapsinthehumanreferencesequenceassemblyrepresentinghundredsofmillionsofbases.

Page 3: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

SlowdownatUCSCofaddinggenomes

Number of new genome/version added over time (UCSC)

Year

2000 2005 2010 2015

05

1015

20

Page 4: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

Renewedfocusongenomes• Sequencinghasbecomemoredemocratic.Forexample,ittookmorethan50people,aroundadozencenters,$50millionandhalfadecadetogenerateadraftchimpanzeegenome,publishedin2005.Thisyear,Eichler's labcompletedagorillasequenceforabout$70,000.“That,tome,isabigdeal,”hesays.• Alsoabigdeal,saysEichler,isthequalityoftheirsequences.Anearlierversionofagorillagenomewaspublishedin2012 butthatwasdonewithshorterpiecesofDNA,andthereforelefthundredsofthousandsofgaps.Histeamusedlong-readtechnology,closed90percentofthosegaps,andwasabletocompletemanygenesthatwereonlypartiallysequencedinthefirstattempt.

Speed-readingthegenome:Cheapermethodsofsequencingareopeningupdoorsfornewresearchandnewcareerpaths.http://www.nature.com/naturejobs/science/articles/10.1038/nj0492

Page 5: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

GorillaGenomeAssembly 2012Illumina Assembly 2016PacificBiosystemsAssemblyTotallength 3,041,976,159 bp 3,080,414,926bp

Contigs 465,847 16,073

Totalcontiglength 2,829,670,843 bp 3,080,414,926bp

Placedcontiglength 2,712,844,129 bp 2,790,620,487bp

Unplacedcontiglength 116,826,714 bp 289,794,439bp

Max.contiglength 191,556 bp 36,219,563bp

ContigN50 11.6 kb 9.6 mb

Scaffolds 22,164 554

Max.scaffoldlength 10,247,101 bp 110,018,866bp

ScaffoldN50 914 Kb 23.1 Mb

2012Assembly:ABIcapillarysequenceandshort35bpIlluminasequence+BACPEdata2015Assembly:PACBIOSMRTsequence+BACPEdata,INDELcorrectedwithIlluminasequence

Page 6: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

Advancesinhigh-noise,long-readassemblyalgorithms• Summerof2015• PacificBiosystemsFalconassemblerforSMRTassemblyoflargegenomes• Canu forkofCeleraAssemblerforsingle-moleculehigh-noisesequences.

• Keyfeatures:• DiscardallreadsshorterthanX bp toloadintotheoverlapper,stepsignificantlyreducesthenumberofreadsbeinganalyzed.• Selfcorrectreadsfromall-by-alloverlaps(takesadvantagesofclusterenv.)• Buildagraphbasedonhighquality,longcorrectedreads.• “Polish”theresultingassemblyusingallreads,60xcoverageproduceshighqualityfinalcontigs.

Page 7: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

The‘Next,Next’GenerationSequencers(3rd Generation)

• 2009– SingleMoleculeReadTimesequencingbyPacificBiosystems,mostsuccessfulthirdgenerationsequencingplatforms,RSII~1Gb/zmw,newPacBioSequel~7Gb/zmw,near100Kbpossiblereadlength.

Page 8: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

PacBioAdvances(RSIIvsSequel)CaliforniaCondordata(~1.2Gbpgenome)basedon4SMRTcellinJan2017

RS2 data

read length

Freq

uenc

y

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

020

0040

0060

0080

0010

000

Sequel data

read length

Freq

uenc

y

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

020

000

4000

060

000

8000

0RS2 SequelReadcount

448,767 1,947,684

N50 10,426 4,293

LongestRead

82,366 102,310

#reads>12Kb

217,691 754,157

Coverage>12Kb

3.64 12.165

Assembly(60SMRTcells) Total assembly size N90 N50 Number of contigs Largest contig Smallest_contigFalcon +QuiverPolishing 1,239,863,868 1,106,390 17,286,884 1,164 77,968,233 2,802

Canu 1,240,661,679 1,080,915 14,278,087 1,004 45,704,690 1,812

Page 9: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

OxfordNanopore

• 2015– Another3rd generationsequencer,foundedin2005.Thesequencerusesnanopore technologydevelopedinthe90’stosequencesinglemolecules.Throughputisabout5-10Gbperflowcell,capableofnear200kbreads.

SmidgION:nanopore sensingforusewithmobiledevices

MinION PromethION

Page 10: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

TowardsGaplessassemblies

v Promise• ContinuedprogressonDNAinputandresultingPacBio/OxfordNanopore ReadLengthsandreaddepthwillresultinlongerN50/N90fewerresultingcontigs.• Algorithmsarestillyoungandhaveroomforimprovement.

v Issues• Somemis-assembliesarestillpresent,Chimericreadsareanissue• SmallINDELsareanissueandrequirecleanup(Illuminareads),especiallywithingenes.

Page 11: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

ScaffoldingOptions

BorrowedfromSergy Koren talkfromPacBio InformaticsDeveloperMeetinginJan2017

Page 12: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

Bionano Irys/Saphyr

• TheIrys/Saphyr Systemputsthepowerofopticalgenomemapping.Nomorewaitingformonthstogetaphysicalgenomemap.BionanoNext-GenerationMapping(NGM)provideslong-rangeinformationtorevealtruegenomestructure.Assistsgenomesassemblestonearchromosomalarms.

Notsequencingbased

Page 13: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

DovetailandHi-C(CrossLinking)onIlluminaHi-C

ProximityGuidedScaffolding

DovetailandPhaseGenomics

DovetailChicagoLibraries

Page 14: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

10xgenomicsonIllumina

• 10xGenomics,Linkedreadstechnology• Illuminamachines,SequencingbySynthesis~120Gb/lane,2x150bpreads.

10xhasitsownassembler,Supernova

Page 15: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

10xGenomicsphasing+highqualityIlluminadata

Page 16: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

10xGenomics,SupernovaGenomeStatsGenome Size(Gb) DNAsize(Kb) N50 contig(Kb) N50scaffold(Mb) N50phase

block(Mb)NA12878 3.2 95.5 85.0 12.8 2.8

NA24385 3.2 111.3 90.0 10.4 3.9

HGP 3.2 138.8 104.9 19.4 4.6

Yoruban 3.2 126.9 100.5 16.1 11.4

Komododragon 1.8 85.4 95.3 10.2 0.4

Spottedowl 1.5 72.2 118.3 10.1 0.2

Hummingbird 1.0 86.2 87.6 12.5 10.1

Monkseal 2.6 92.3 93.8 14.8 0.6

Chilipepper 3.5 53.3 84.7 4.0 2.1

CowPea 0.38 46.5 28.3 0.83 0.35

Walnut 0.89 55.0 48.0 0.60 0.25

California Condor 1.19 67.0 147.5 17.9 1.0

Page 17: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

GenomeAssemblyisconvergingonmorestandardizeddatamodels• Trendistoconsidersample,datagenerationandbioinformaticstogether.• ALLPATH-LG,startedwithspecificrequirementofsequencinglibraries

• Discovar250bppaired-endPCR-freeIlluminareads.Nootherlibrariesarerequired.

Page 18: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

TheKitchenSink

• AvailableTechnologies§ LongReads:PacificBiosystems/Nanopore LongContigs§ OpticalMaps:BioNano Scaffolding§ LinkedReads:10xGenomics Highbasequalityandphasing§ CrossLinking:Hi-C/DovetailChicago Scaffolding

• Whatthebestcombination,areallnecessary?Asalgorithmsimprove,whichbecomeunnecessary• Genome10Kproject:Sequence10,000Invertebrates

Page 19: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

GoatGenomeCHIR_2.0 (BGI)- 2012 ARS1- 2016

14 IlluminaPElibraries+Opgen PacBio+Bionano +Hi-C

Coverage 175x 69x(@5.1Kb meanreadlength)

Assembly length 2.8Gb 2.9Gb

Numberofcontigs 173,141 3,074

Contig N50 73.5Kb 18.7Mb

Numberofscaffolds 103,494 31(chromosomes)

Scaffold N50 9Mb 87.3Mb

AddingintheopticalmapsfromtheIrys systemreducedthetotalnumberofcontigs to1,780,withacontig N50of10.2megabases."Theopticalmappingincreasedthequalityandconfidenceoftheinitialscaffolds,"Phillippy said.Thethreetechnologies—PacBio,Bionano,andHi-C—endedupbeingcomplementarytoeachother,headded.Finally,Illuminadataisusedtopolishandmakeerrorcorrectionsatthebaselevel. GenomeWeb “GoatGenomeDemonstratesBenefitsofCombiningTechnologiesforDeNovoAssembly”,Mar07,2017

Page 20: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

FocusoftheFuture

• TosomeextentwearelimitedbybeingabletogenerateenoughhighqualityhighmolecularweightDNA.• Continuedimprovementtosequencingchemistriesforconsistentandlongerreads,qualityimprovementhasbecomesecondary.• Incrementalimprovementofthecomputationalalgorithms,includingimprovedalignmentoferror-pronereads(GFA2).• Scaffoldingalgorithms,mergemultipledatatypes/sources• Polyploidy??• Haplotyping – Howtoreallyusethedata

Page 21: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

GraphicalFormatAssembly- GFA2

• Assemblyisapipeline• Overlap• Layout• Consensus

• Withacommoninput(fastq)andcommonoutput(fasta),butnocommonintermediatefileformat,causesaduplicationofeffort.• GFA2- Commonfileformatforassemblygraphrepresentation• Directgraphvisualization,manipulation• Modularassemblytools(heterozygous/mis-assembledcontigs)• Modularscaffoldingtools• Graphawareannotation

Page 22: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

Annotation– PacbioIso-seq

Producefull-lengthtranscriptswithoutassemblyTheisoformsequencing(Iso-Seq)applicationgeneratesfull-lengthcDNAsequences— fromthe5’endoftranscriptstothepoly-Atail— AfterCircularconsensussequence(CCS)algorithmproduceshighqualityisoforms.

Page 23: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

10xGenomes[Linkedreads]

• Genome–– GenomeResequencing• Callthefullspectrumofvariants(particularlylongINDELS/CNVandstructuralvariants)andunlockpreviouslyinaccessibleregionsfromasinglelibraryatequivalentcoverageasstandardgenomeresequencingprojects

• Exome– Subselect readsusingcapturetechniques(Agilent)• Enablephasingofgenesanddetectionofstructuralandcopynumbervariation• AgilentSureSelect baitsimprovegenephasingbyclosinggaps,andrecoveringhard-to-maplociinthegenome(futurekitstoincludepreviouslyfailedregions)

Page 24: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

inanutshell

Page 25: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

LaboratoryWorkflow

Shear?

Page 26: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

TheMath

1ngInputDNA=300genomescopiesofthegenome

Calculationsimplythatabout50%ofallpossiblefragmentsendupinabead

Page 27: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

@RecommendedLoading• Eachlocuswillhave150molecules• Eachlocuswillhave30xreaddepth

• ~35fragmentspermolecule@50Kbmolecules=0.2x/molecule

Page 28: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

Analysis– BiologicalQuestions

• Atrecommendedspecs(forhumangenome)• Get~30xcoverage,adequateforstandardvariantanalysisSNPs,smallINDELS• Increasedmapability todifficultregions[multi-mappedreadscanberesolvedbyconsideringlinkedreadsinformation],variantspreviouslyundetermined.• DetectlargeSVandCNV• Phasedinformation

DetectionofSVandCNVrequiresadvancedcomputationaltechniquesPhasinghasbeenusedextensivelyinGWAS(usuallyimputed)to

enhanceanalysisandinferences,slowtogettosequencebaseddataPotentialapplicationsforthetechnologyarelikelystillyettocome

Page 29: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

IncreasedMapability

Page 30: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

LinkedReads

Page 31: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

Capture– linkedgenes

Enrichreadsofinterestinsteadofrandomselection

Dependingonsizeofcapture,canpoolmoresamples/lane

Page 32: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems

Thus,thisTP53C>Tmutationisintranswiththedeletedallelehaplotype.Asaresult,thetumorcontainsonlyasingle,inactivatedcopyofTP53.Takentogether,thisresultshowstheunambiguousbiallelic inactivationofTP53

Page 33: Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in high-noise, long-read assembly algorithms •Summer of 2015 •Pacific Biosystems