Towards Gapless, Chromosome Scale, Haplotype Assemblies › ... › GenomeAssemb… · Advances in...

Preview:

Citation preview

TowardsGapless,ChromosomeScale,HaplotypeAssemblies

MattSettlesUCDavisBioinformaticsCore

HumanGenome

• In1990,theNationalInstitutesofHealth(NIH)andtheDepartmentofEnergyjoinedwithinternationalpartnerstosequencethehumangenome.• InApril2003,researcherssuccessfullycompletedtheHumanGenomeProject,underbudget($2.7B)andmorethantwoyearsaheadofschedule.• ThousandsofpeoplecontributedtheHumanGenomeProject• Evenso,thereremains~400gapsinthehumanreferencesequenceassemblyrepresentinghundredsofmillionsofbases.

SlowdownatUCSCofaddinggenomes

Number of new genome/version added over time (UCSC)

Year

2000 2005 2010 2015

05

1015

20

Renewedfocusongenomes• Sequencinghasbecomemoredemocratic.Forexample,ittookmorethan50people,aroundadozencenters,$50millionandhalfadecadetogenerateadraftchimpanzeegenome,publishedin2005.Thisyear,Eichler's labcompletedagorillasequenceforabout$70,000.“That,tome,isabigdeal,”hesays.• Alsoabigdeal,saysEichler,isthequalityoftheirsequences.Anearlierversionofagorillagenomewaspublishedin2012 butthatwasdonewithshorterpiecesofDNA,andthereforelefthundredsofthousandsofgaps.Histeamusedlong-readtechnology,closed90percentofthosegaps,andwasabletocompletemanygenesthatwereonlypartiallysequencedinthefirstattempt.

Speed-readingthegenome:Cheapermethodsofsequencingareopeningupdoorsfornewresearchandnewcareerpaths.http://www.nature.com/naturejobs/science/articles/10.1038/nj0492

GorillaGenomeAssembly 2012Illumina Assembly 2016PacificBiosystemsAssemblyTotallength 3,041,976,159 bp 3,080,414,926bp

Contigs 465,847 16,073

Totalcontiglength 2,829,670,843 bp 3,080,414,926bp

Placedcontiglength 2,712,844,129 bp 2,790,620,487bp

Unplacedcontiglength 116,826,714 bp 289,794,439bp

Max.contiglength 191,556 bp 36,219,563bp

ContigN50 11.6 kb 9.6 mb

Scaffolds 22,164 554

Max.scaffoldlength 10,247,101 bp 110,018,866bp

ScaffoldN50 914 Kb 23.1 Mb

2012Assembly:ABIcapillarysequenceandshort35bpIlluminasequence+BACPEdata2015Assembly:PACBIOSMRTsequence+BACPEdata,INDELcorrectedwithIlluminasequence

Advancesinhigh-noise,long-readassemblyalgorithms• Summerof2015• PacificBiosystemsFalconassemblerforSMRTassemblyoflargegenomes• Canu forkofCeleraAssemblerforsingle-moleculehigh-noisesequences.

• Keyfeatures:• DiscardallreadsshorterthanX bp toloadintotheoverlapper,stepsignificantlyreducesthenumberofreadsbeinganalyzed.• Selfcorrectreadsfromall-by-alloverlaps(takesadvantagesofclusterenv.)• Buildagraphbasedonhighquality,longcorrectedreads.• “Polish”theresultingassemblyusingallreads,60xcoverageproduceshighqualityfinalcontigs.

The‘Next,Next’GenerationSequencers(3rd Generation)

• 2009– SingleMoleculeReadTimesequencingbyPacificBiosystems,mostsuccessfulthirdgenerationsequencingplatforms,RSII~1Gb/zmw,newPacBioSequel~7Gb/zmw,near100Kbpossiblereadlength.

PacBioAdvances(RSIIvsSequel)CaliforniaCondordata(~1.2Gbpgenome)basedon4SMRTcellinJan2017

RS2 data

read length

Freq

uenc

y

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

020

0040

0060

0080

0010

000

Sequel data

read length

Freq

uenc

y

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

020

000

4000

060

000

8000

0RS2 SequelReadcount

448,767 1,947,684

N50 10,426 4,293

LongestRead

82,366 102,310

#reads>12Kb

217,691 754,157

Coverage>12Kb

3.64 12.165

Assembly(60SMRTcells) Total assembly size N90 N50 Number of contigs Largest contig Smallest_contigFalcon +QuiverPolishing 1,239,863,868 1,106,390 17,286,884 1,164 77,968,233 2,802

Canu 1,240,661,679 1,080,915 14,278,087 1,004 45,704,690 1,812

OxfordNanopore

• 2015– Another3rd generationsequencer,foundedin2005.Thesequencerusesnanopore technologydevelopedinthe90’stosequencesinglemolecules.Throughputisabout5-10Gbperflowcell,capableofnear200kbreads.

SmidgION:nanopore sensingforusewithmobiledevices

MinION PromethION

TowardsGaplessassemblies

v Promise• ContinuedprogressonDNAinputandresultingPacBio/OxfordNanopore ReadLengthsandreaddepthwillresultinlongerN50/N90fewerresultingcontigs.• Algorithmsarestillyoungandhaveroomforimprovement.

v Issues• Somemis-assembliesarestillpresent,Chimericreadsareanissue• SmallINDELsareanissueandrequirecleanup(Illuminareads),especiallywithingenes.

ScaffoldingOptions

BorrowedfromSergy Koren talkfromPacBio InformaticsDeveloperMeetinginJan2017

Bionano Irys/Saphyr

• TheIrys/Saphyr Systemputsthepowerofopticalgenomemapping.Nomorewaitingformonthstogetaphysicalgenomemap.BionanoNext-GenerationMapping(NGM)provideslong-rangeinformationtorevealtruegenomestructure.Assistsgenomesassemblestonearchromosomalarms.

Notsequencingbased

DovetailandHi-C(CrossLinking)onIlluminaHi-C

ProximityGuidedScaffolding

DovetailandPhaseGenomics

DovetailChicagoLibraries

10xgenomicsonIllumina

• 10xGenomics,Linkedreadstechnology• Illuminamachines,SequencingbySynthesis~120Gb/lane,2x150bpreads.

10xhasitsownassembler,Supernova

10xGenomicsphasing+highqualityIlluminadata

10xGenomics,SupernovaGenomeStatsGenome Size(Gb) DNAsize(Kb) N50 contig(Kb) N50scaffold(Mb) N50phase

block(Mb)NA12878 3.2 95.5 85.0 12.8 2.8

NA24385 3.2 111.3 90.0 10.4 3.9

HGP 3.2 138.8 104.9 19.4 4.6

Yoruban 3.2 126.9 100.5 16.1 11.4

Komododragon 1.8 85.4 95.3 10.2 0.4

Spottedowl 1.5 72.2 118.3 10.1 0.2

Hummingbird 1.0 86.2 87.6 12.5 10.1

Monkseal 2.6 92.3 93.8 14.8 0.6

Chilipepper 3.5 53.3 84.7 4.0 2.1

CowPea 0.38 46.5 28.3 0.83 0.35

Walnut 0.89 55.0 48.0 0.60 0.25

California Condor 1.19 67.0 147.5 17.9 1.0

GenomeAssemblyisconvergingonmorestandardizeddatamodels• Trendistoconsidersample,datagenerationandbioinformaticstogether.• ALLPATH-LG,startedwithspecificrequirementofsequencinglibraries

• Discovar250bppaired-endPCR-freeIlluminareads.Nootherlibrariesarerequired.

TheKitchenSink

• AvailableTechnologies§ LongReads:PacificBiosystems/Nanopore LongContigs§ OpticalMaps:BioNano Scaffolding§ LinkedReads:10xGenomics Highbasequalityandphasing§ CrossLinking:Hi-C/DovetailChicago Scaffolding

• Whatthebestcombination,areallnecessary?Asalgorithmsimprove,whichbecomeunnecessary• Genome10Kproject:Sequence10,000Invertebrates

GoatGenomeCHIR_2.0 (BGI)- 2012 ARS1- 2016

14 IlluminaPElibraries+Opgen PacBio+Bionano +Hi-C

Coverage 175x 69x(@5.1Kb meanreadlength)

Assembly length 2.8Gb 2.9Gb

Numberofcontigs 173,141 3,074

Contig N50 73.5Kb 18.7Mb

Numberofscaffolds 103,494 31(chromosomes)

Scaffold N50 9Mb 87.3Mb

AddingintheopticalmapsfromtheIrys systemreducedthetotalnumberofcontigs to1,780,withacontig N50of10.2megabases."Theopticalmappingincreasedthequalityandconfidenceoftheinitialscaffolds,"Phillippy said.Thethreetechnologies—PacBio,Bionano,andHi-C—endedupbeingcomplementarytoeachother,headded.Finally,Illuminadataisusedtopolishandmakeerrorcorrectionsatthebaselevel. GenomeWeb “GoatGenomeDemonstratesBenefitsofCombiningTechnologiesforDeNovoAssembly”,Mar07,2017

FocusoftheFuture

• TosomeextentwearelimitedbybeingabletogenerateenoughhighqualityhighmolecularweightDNA.• Continuedimprovementtosequencingchemistriesforconsistentandlongerreads,qualityimprovementhasbecomesecondary.• Incrementalimprovementofthecomputationalalgorithms,includingimprovedalignmentoferror-pronereads(GFA2).• Scaffoldingalgorithms,mergemultipledatatypes/sources• Polyploidy??• Haplotyping – Howtoreallyusethedata

GraphicalFormatAssembly- GFA2

• Assemblyisapipeline• Overlap• Layout• Consensus

• Withacommoninput(fastq)andcommonoutput(fasta),butnocommonintermediatefileformat,causesaduplicationofeffort.• GFA2- Commonfileformatforassemblygraphrepresentation• Directgraphvisualization,manipulation• Modularassemblytools(heterozygous/mis-assembledcontigs)• Modularscaffoldingtools• Graphawareannotation

Annotation– PacbioIso-seq

Producefull-lengthtranscriptswithoutassemblyTheisoformsequencing(Iso-Seq)applicationgeneratesfull-lengthcDNAsequences— fromthe5’endoftranscriptstothepoly-Atail— AfterCircularconsensussequence(CCS)algorithmproduceshighqualityisoforms.

10xGenomes[Linkedreads]

• Genome–– GenomeResequencing• Callthefullspectrumofvariants(particularlylongINDELS/CNVandstructuralvariants)andunlockpreviouslyinaccessibleregionsfromasinglelibraryatequivalentcoverageasstandardgenomeresequencingprojects

• Exome– Subselect readsusingcapturetechniques(Agilent)• Enablephasingofgenesanddetectionofstructuralandcopynumbervariation• AgilentSureSelect baitsimprovegenephasingbyclosinggaps,andrecoveringhard-to-maplociinthegenome(futurekitstoincludepreviouslyfailedregions)

inanutshell

LaboratoryWorkflow

Shear?

TheMath

1ngInputDNA=300genomescopiesofthegenome

Calculationsimplythatabout50%ofallpossiblefragmentsendupinabead

@RecommendedLoading• Eachlocuswillhave150molecules• Eachlocuswillhave30xreaddepth

• ~35fragmentspermolecule@50Kbmolecules=0.2x/molecule

Analysis– BiologicalQuestions

• Atrecommendedspecs(forhumangenome)• Get~30xcoverage,adequateforstandardvariantanalysisSNPs,smallINDELS• Increasedmapability todifficultregions[multi-mappedreadscanberesolvedbyconsideringlinkedreadsinformation],variantspreviouslyundetermined.• DetectlargeSVandCNV• Phasedinformation

DetectionofSVandCNVrequiresadvancedcomputationaltechniquesPhasinghasbeenusedextensivelyinGWAS(usuallyimputed)to

enhanceanalysisandinferences,slowtogettosequencebaseddataPotentialapplicationsforthetechnologyarelikelystillyettocome

IncreasedMapability

LinkedReads

Capture– linkedgenes

Enrichreadsofinterestinsteadofrandomselection

Dependingonsizeofcapture,canpoolmoresamples/lane

Thus,thisTP53C>Tmutationisintranswiththedeletedallelehaplotype.Asaresult,thetumorcontainsonlyasingle,inactivatedcopyofTP53.Takentogether,thisresultshowstheunambiguousbiallelic inactivationofTP53

Recommended