Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
TowardsGapless,ChromosomeScale,HaplotypeAssemblies
MattSettlesUCDavisBioinformaticsCore
HumanGenome
• In1990,theNationalInstitutesofHealth(NIH)andtheDepartmentofEnergyjoinedwithinternationalpartnerstosequencethehumangenome.• InApril2003,researcherssuccessfullycompletedtheHumanGenomeProject,underbudget($2.7B)andmorethantwoyearsaheadofschedule.• ThousandsofpeoplecontributedtheHumanGenomeProject• Evenso,thereremains~400gapsinthehumanreferencesequenceassemblyrepresentinghundredsofmillionsofbases.
SlowdownatUCSCofaddinggenomes
Number of new genome/version added over time (UCSC)
Year
2000 2005 2010 2015
05
1015
20
Renewedfocusongenomes• Sequencinghasbecomemoredemocratic.Forexample,ittookmorethan50people,aroundadozencenters,$50millionandhalfadecadetogenerateadraftchimpanzeegenome,publishedin2005.Thisyear,Eichler's labcompletedagorillasequenceforabout$70,000.“That,tome,isabigdeal,”hesays.• Alsoabigdeal,saysEichler,isthequalityoftheirsequences.Anearlierversionofagorillagenomewaspublishedin2012 butthatwasdonewithshorterpiecesofDNA,andthereforelefthundredsofthousandsofgaps.Histeamusedlong-readtechnology,closed90percentofthosegaps,andwasabletocompletemanygenesthatwereonlypartiallysequencedinthefirstattempt.
Speed-readingthegenome:Cheapermethodsofsequencingareopeningupdoorsfornewresearchandnewcareerpaths.http://www.nature.com/naturejobs/science/articles/10.1038/nj0492
GorillaGenomeAssembly 2012Illumina Assembly 2016PacificBiosystemsAssemblyTotallength 3,041,976,159 bp 3,080,414,926bp
Contigs 465,847 16,073
Totalcontiglength 2,829,670,843 bp 3,080,414,926bp
Placedcontiglength 2,712,844,129 bp 2,790,620,487bp
Unplacedcontiglength 116,826,714 bp 289,794,439bp
Max.contiglength 191,556 bp 36,219,563bp
ContigN50 11.6 kb 9.6 mb
Scaffolds 22,164 554
Max.scaffoldlength 10,247,101 bp 110,018,866bp
ScaffoldN50 914 Kb 23.1 Mb
2012Assembly:ABIcapillarysequenceandshort35bpIlluminasequence+BACPEdata2015Assembly:PACBIOSMRTsequence+BACPEdata,INDELcorrectedwithIlluminasequence
Advancesinhigh-noise,long-readassemblyalgorithms• Summerof2015• PacificBiosystemsFalconassemblerforSMRTassemblyoflargegenomes• Canu forkofCeleraAssemblerforsingle-moleculehigh-noisesequences.
• Keyfeatures:• DiscardallreadsshorterthanX bp toloadintotheoverlapper,stepsignificantlyreducesthenumberofreadsbeinganalyzed.• Selfcorrectreadsfromall-by-alloverlaps(takesadvantagesofclusterenv.)• Buildagraphbasedonhighquality,longcorrectedreads.• “Polish”theresultingassemblyusingallreads,60xcoverageproduceshighqualityfinalcontigs.
The‘Next,Next’GenerationSequencers(3rd Generation)
• 2009– SingleMoleculeReadTimesequencingbyPacificBiosystems,mostsuccessfulthirdgenerationsequencingplatforms,RSII~1Gb/zmw,newPacBioSequel~7Gb/zmw,near100Kbpossiblereadlength.
PacBioAdvances(RSIIvsSequel)CaliforniaCondordata(~1.2Gbpgenome)basedon4SMRTcellinJan2017
RS2 data
read length
Freq
uenc
y
0e+00 2e+04 4e+04 6e+04 8e+04 1e+05
020
0040
0060
0080
0010
000
Sequel data
read length
Freq
uenc
y
0e+00 2e+04 4e+04 6e+04 8e+04 1e+05
020
000
4000
060
000
8000
0RS2 SequelReadcount
448,767 1,947,684
N50 10,426 4,293
LongestRead
82,366 102,310
#reads>12Kb
217,691 754,157
Coverage>12Kb
3.64 12.165
Assembly(60SMRTcells) Total assembly size N90 N50 Number of contigs Largest contig Smallest_contigFalcon +QuiverPolishing 1,239,863,868 1,106,390 17,286,884 1,164 77,968,233 2,802
Canu 1,240,661,679 1,080,915 14,278,087 1,004 45,704,690 1,812
OxfordNanopore
• 2015– Another3rd generationsequencer,foundedin2005.Thesequencerusesnanopore technologydevelopedinthe90’stosequencesinglemolecules.Throughputisabout5-10Gbperflowcell,capableofnear200kbreads.
SmidgION:nanopore sensingforusewithmobiledevices
MinION PromethION
TowardsGaplessassemblies
v Promise• ContinuedprogressonDNAinputandresultingPacBio/OxfordNanopore ReadLengthsandreaddepthwillresultinlongerN50/N90fewerresultingcontigs.• Algorithmsarestillyoungandhaveroomforimprovement.
v Issues• Somemis-assembliesarestillpresent,Chimericreadsareanissue• SmallINDELsareanissueandrequirecleanup(Illuminareads),especiallywithingenes.
ScaffoldingOptions
BorrowedfromSergy Koren talkfromPacBio InformaticsDeveloperMeetinginJan2017
Bionano Irys/Saphyr
• TheIrys/Saphyr Systemputsthepowerofopticalgenomemapping.Nomorewaitingformonthstogetaphysicalgenomemap.BionanoNext-GenerationMapping(NGM)provideslong-rangeinformationtorevealtruegenomestructure.Assistsgenomesassemblestonearchromosomalarms.
Notsequencingbased
DovetailandHi-C(CrossLinking)onIlluminaHi-C
ProximityGuidedScaffolding
DovetailandPhaseGenomics
DovetailChicagoLibraries
10xgenomicsonIllumina
• 10xGenomics,Linkedreadstechnology• Illuminamachines,SequencingbySynthesis~120Gb/lane,2x150bpreads.
10xhasitsownassembler,Supernova
10xGenomicsphasing+highqualityIlluminadata
10xGenomics,SupernovaGenomeStatsGenome Size(Gb) DNAsize(Kb) N50 contig(Kb) N50scaffold(Mb) N50phase
block(Mb)NA12878 3.2 95.5 85.0 12.8 2.8
NA24385 3.2 111.3 90.0 10.4 3.9
HGP 3.2 138.8 104.9 19.4 4.6
Yoruban 3.2 126.9 100.5 16.1 11.4
Komododragon 1.8 85.4 95.3 10.2 0.4
Spottedowl 1.5 72.2 118.3 10.1 0.2
Hummingbird 1.0 86.2 87.6 12.5 10.1
Monkseal 2.6 92.3 93.8 14.8 0.6
Chilipepper 3.5 53.3 84.7 4.0 2.1
CowPea 0.38 46.5 28.3 0.83 0.35
Walnut 0.89 55.0 48.0 0.60 0.25
California Condor 1.19 67.0 147.5 17.9 1.0
GenomeAssemblyisconvergingonmorestandardizeddatamodels• Trendistoconsidersample,datagenerationandbioinformaticstogether.• ALLPATH-LG,startedwithspecificrequirementofsequencinglibraries
• Discovar250bppaired-endPCR-freeIlluminareads.Nootherlibrariesarerequired.
TheKitchenSink
• AvailableTechnologies§ LongReads:PacificBiosystems/Nanopore LongContigs§ OpticalMaps:BioNano Scaffolding§ LinkedReads:10xGenomics Highbasequalityandphasing§ CrossLinking:Hi-C/DovetailChicago Scaffolding
• Whatthebestcombination,areallnecessary?Asalgorithmsimprove,whichbecomeunnecessary• Genome10Kproject:Sequence10,000Invertebrates
GoatGenomeCHIR_2.0 (BGI)- 2012 ARS1- 2016
14 IlluminaPElibraries+Opgen PacBio+Bionano +Hi-C
Coverage 175x 69x(@5.1Kb meanreadlength)
Assembly length 2.8Gb 2.9Gb
Numberofcontigs 173,141 3,074
Contig N50 73.5Kb 18.7Mb
Numberofscaffolds 103,494 31(chromosomes)
Scaffold N50 9Mb 87.3Mb
AddingintheopticalmapsfromtheIrys systemreducedthetotalnumberofcontigs to1,780,withacontig N50of10.2megabases."Theopticalmappingincreasedthequalityandconfidenceoftheinitialscaffolds,"Phillippy said.Thethreetechnologies—PacBio,Bionano,andHi-C—endedupbeingcomplementarytoeachother,headded.Finally,Illuminadataisusedtopolishandmakeerrorcorrectionsatthebaselevel. GenomeWeb “GoatGenomeDemonstratesBenefitsofCombiningTechnologiesforDeNovoAssembly”,Mar07,2017
FocusoftheFuture
• TosomeextentwearelimitedbybeingabletogenerateenoughhighqualityhighmolecularweightDNA.• Continuedimprovementtosequencingchemistriesforconsistentandlongerreads,qualityimprovementhasbecomesecondary.• Incrementalimprovementofthecomputationalalgorithms,includingimprovedalignmentoferror-pronereads(GFA2).• Scaffoldingalgorithms,mergemultipledatatypes/sources• Polyploidy??• Haplotyping – Howtoreallyusethedata
GraphicalFormatAssembly- GFA2
• Assemblyisapipeline• Overlap• Layout• Consensus
• Withacommoninput(fastq)andcommonoutput(fasta),butnocommonintermediatefileformat,causesaduplicationofeffort.• GFA2- Commonfileformatforassemblygraphrepresentation• Directgraphvisualization,manipulation• Modularassemblytools(heterozygous/mis-assembledcontigs)• Modularscaffoldingtools• Graphawareannotation
Annotation– PacbioIso-seq
Producefull-lengthtranscriptswithoutassemblyTheisoformsequencing(Iso-Seq)applicationgeneratesfull-lengthcDNAsequences— fromthe5’endoftranscriptstothepoly-Atail— AfterCircularconsensussequence(CCS)algorithmproduceshighqualityisoforms.
10xGenomes[Linkedreads]
• Genome–– GenomeResequencing• Callthefullspectrumofvariants(particularlylongINDELS/CNVandstructuralvariants)andunlockpreviouslyinaccessibleregionsfromasinglelibraryatequivalentcoverageasstandardgenomeresequencingprojects
• Exome– Subselect readsusingcapturetechniques(Agilent)• Enablephasingofgenesanddetectionofstructuralandcopynumbervariation• AgilentSureSelect baitsimprovegenephasingbyclosinggaps,andrecoveringhard-to-maplociinthegenome(futurekitstoincludepreviouslyfailedregions)
inanutshell
LaboratoryWorkflow
Shear?
TheMath
1ngInputDNA=300genomescopiesofthegenome
Calculationsimplythatabout50%ofallpossiblefragmentsendupinabead
@RecommendedLoading• Eachlocuswillhave150molecules• Eachlocuswillhave30xreaddepth
• ~35fragmentspermolecule@50Kbmolecules=0.2x/molecule
Analysis– BiologicalQuestions
• Atrecommendedspecs(forhumangenome)• Get~30xcoverage,adequateforstandardvariantanalysisSNPs,smallINDELS• Increasedmapability todifficultregions[multi-mappedreadscanberesolvedbyconsideringlinkedreadsinformation],variantspreviouslyundetermined.• DetectlargeSVandCNV• Phasedinformation
DetectionofSVandCNVrequiresadvancedcomputationaltechniquesPhasinghasbeenusedextensivelyinGWAS(usuallyimputed)to
enhanceanalysisandinferences,slowtogettosequencebaseddataPotentialapplicationsforthetechnologyarelikelystillyettocome
IncreasedMapability
LinkedReads
Capture– linkedgenes
Enrichreadsofinterestinsteadofrandomselection
Dependingonsizeofcapture,canpoolmoresamples/lane
Thus,thisTP53C>Tmutationisintranswiththedeletedallelehaplotype.Asaresult,thetumorcontainsonlyasingle,inactivatedcopyofTP53.Takentogether,thisresultshowstheunambiguousbiallelic inactivationofTP53