Upload
sucheta-tripathy
View
239
Download
0
Embed Size (px)
Citation preview
Improved Hybrid Assembly of two strains of P. ramorum
Mathu Malar C., Jennifer Yuzon, Takao Kasuga and
Sucheta TripathyUC Davis, CA, USA.
CSIR- Indian Institute of Chemical Biology, Kolkata,
India.
Background
Phytophthora ramorum, a highly destructive pathogen with a wide host-range that causes Sudden Oak Death in western North America and Sudden Larch Death in the UK.
P.ramorum was first reported in 1995 and the origins of the pathogens are still unclear
P. ramorum can be spread over several miles in mists, air currents, watercourses and rain splash. It is also known that Phytophthora pathogens can be spread on footwear, dogs’ paws, bicycle wheels, tools and equipment etc.
Parke, J. L., and S. Lucas. 2008. Sudden oak death and ramorum blight. The Plant Health Instructor. DOI: 10.1094/PHI-I-2008-0227-01 https://sites.google.com/site/phytophthoragenomicslab/home
Input DatasetsPlatform No of Reads
generatedTotal reads used for assembly
Organism Read coverage
Pacbio 435399 33% and 47% Pr102 25X
illumina 20942377 20942377 (100%)
Pr102 10X
Platform No of Reads generated
Total reads used for assembly
Organism Read coverage
pacbio 402170 285487 (70%) ND886 50X
Illumina 43676830 43676830(100%)
ND886 50X
For strain Pr102
For ND886
V1 assembly (Tyler BM et al, 2006 ) by Sanger sequencing method, 65 MB, Genome Coverage 7.7X and Total Gaps 12 MB.
V2 Assembly (September 2015) V3 Assembly (December 2015) V4 Assembly (March 2016) V5 Assembly April 2016
Assembly versions of P. ramorum (Pr102)
Pacbio Pr102 435399 (raw reads)
ECTools with Sanger Unitigs from 2006 phyra V1 assembly
Corrected (33%)
147429 reads
Uncorrected (67%)
287970 reads
ECTools with mock intermediate
assembly (Illumina reads + unitigs (V1)
derived 6K, 20K simulated libraries
using allpaths)
Corrected1418 reads
0.49%
Uncorrected reads 286552
66.50%
PBCR Auto Error correction assembly
used as input to Ectools for EC
Corrected 57640 reads
13.2%
Uncorrected228912
52%
Improved 3-way error correction protocol
An Overview of Assemblers and tools used in this studyTools Input type Function
ECTools PacBio reads with a reference dataset (unitgs) for read error correction.
Correcting errors in PacBioreads
PBCR (PacBioToCA) PacBio reads Error corrections and Assembly
Canu PacBio reads Successor of PBCR assembler
SSPACE (stand-alone scaffolder of pre-assembled contigs using paired-read data)
Pre-assembled contigs, short reads (paired end and mate pair)
Is not a de novo assembler. Used for scaffolding and extending contigs
SSPACE Long Reads Pre-assembled contigs, uses (the pacbio reads) especially long reads
Is a successor of SSPACE and performs better on a case to base basis.
Dedupe Sequence reads Removes PCR duplicates and identical sequences prior to mapping
Redundans Hybrid datasets Recently developed (2016) specifically effective for heterozygous genomes
Improved Error corrected reads (49%)
Illumina reads
Dedupe
Redudans2325 scaffolds 76
Mb largest 781884N50=65030
Canu
Largest scaffold =655506,
smallest=3055Total scaffolds = 920, N50 = 116386, size =
61mb
V3 Assembly
Celera
minimus
SSPACE
SSPACE Long Reads
1114 scaffoldsLargest = 886281Smallest = 15009
Total length = 79285078
Gaps = 450621
Previous error corrected
protocol (33%)
V2 Assembly
Other Assembly Protocols
minimus
SSPACE
SSPACELong reads
SSPACE
SSPACELong reads
Improved Error corrected reads (49%)
V4 Assembly
Total error corrected reads 206487
Celera assembly with length cut off 10k (2735 contigs,
77Mb )
Library No reads Read lengthIllumina R1=1015741
9R2=10784958
varies from 50 nt to 100nt
V1 unitigs MP20k
R1=28379R2=28379
101
Pacbio corrected MP 10k 6k
R1=5234R2=5234R1=59180R2=59180
150
101
V1 unitigs (2006 assembly)
7589 (unitigs)
variableInput data for Redundans
Comparison with Phyra unitigs using mummer CAP3 on unmapped
sequences from V1 unitigs appended to assembly
back
No of scaffolds = 2005, largest scaffold= 781884, smallest scaffold = 2000 , N50 = 76032, total length = 67996746 Gaps = 220 bases
Protocol for V5 assembly
Redundans Assembly 65M1825 scaffolds, N50=76861, Largest=781884,Smallest=2000
Assembly Statistics
V1 V2 V3 V4 V50
102030405060708090
Assembly sizes in MB
V1 V2 V3 V4 V50
500
1000
1500
2000
2500
3000
Number of scaffolds
V1 V2 V3 V4 V50
2000000
4000000
6000000
8000000
10000000
12000000
14000000
Gaps in nucleotides
Gap filled in the pacbio new version genome
assembly
Gaps filled
scaffolds broken
mis-assemblies
Assembly validation using Quast
V1 V2 V3 V4 V50
5000
10000
15000
20000
25000
V1 V2 V3 V4 V50
5000100001500020000250003000035000400004500050000
Number of genesAverage gene lengthLargest gene
Gene Prediction statistics using Augustus and mapping statistics
V1 V2 V3 V4 V50.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
bases maskedTotal interspersed repeatsLTR / Gypsy elements
Repeat Regions captured in the genome
CEGMA comparisons among all assemblies
V5 V1 V2 V3 V4
V5
V1
V2
V3V4
AssemblyVersion
No of core Prots(248 completely highly conserved CEG)
Unique gene % completeness
Out of 458 core genes present in genome
V1 236 KOG0948 Nuclear exosomal RNA helicase MTR4
95.16 412
V2 237 KOG0434Isoleucyl-tRNA synthetase
95.56 412
V3 236 KOG0734AAA+-type ATPase containing the peptide
95.16 413
V4 237 KOG2311NAD/FAD-utilizing protein
95.56 416
V5 238 KOG1158NADP/FAD dependent oxidoreductase
95.97 414
Genome assembly completeness
Effector Prediction PipelineV5 Assembly
Signal p predicted protein sequences
(7159)
Removed proteins with transmembrane domains. RXLR motifs on the N terminus (373 sequences)
Motif prediction with MEME (W Y
domain)
343 sequences were detected in MEME
ND886 genome assembly
ND886 error correction and Read statistics
Pacbio raw reads(402170)
ECTools with Sanger Unitigs (V1 Assembly)
Corrected(70.9%)
285487
Uncorrected(29.1%)
ND886 assembly
Total error corrected reads 285487
Celera Assembly
Minimus
Dedupe
Library No reads Read length
Illumina R1=28389986R2= 28334221
varies from 50 nt to 100nt
Pacbio corrected MP 10k 6k
R1= 91555R2= 91555R1=131703R2=131703
101
101
Read statistics
SSPACE [with illumina reads],Total contigs = 6443Largest contig =648889,Smallest contig =2098, assembly size = 150 Mb
Redudans No of scaffolds = 2225, largest = 648906 , smallest = 2745 , N50 = 48161 , total length = 92877686 , Gaps = 4133
Assembly No of core proteins from 248
% completeness
No of core genes out of 458
Nd886 234 94.35 410
Comparison of ND886 against Pr102 2006 assembly
P.ramorum ND886
P.ramorum Pr102 (2006)
Summary of work
De Novo assemblers alone are not enough for a good genome assembly.
PacBio Reads are marred with errors and one error correction protocol alone does not always produce the best result.
Hybrid assembly in combination with scaffolder, duplicate removers are effective for assembly.
No protocol works best for 2 genomes, has to be mixed and matched.
Assembly improvement does not necessarily change the gene space rather works better for repetitive regions and correcting assembly.
Acknowledgements
Assembly name
bases masked
Small RNA
Simple repeats
Low complexity
GC content
Total interspersed repeats
LINE[R2/R4/NeSL]
Ty1/copio Gypsy/DIRS1LTR elements
DNA transposon
Piggy BAC
Tourist/harbinger
V1 11.77%
0.01%
0.36% 0.03% 53.86%
11.37%
0.13% 10.01 % 1.23 % 0.16%
0.01%
V2 20.68 %
0.00%
0.44 %
0.04 % 54.32%
20.20 %
0.32 % 5.64 % 1.48 %
0.16 %
0.0.1 %
V3 27.00%
0.45%
0.75% 0.12% 52.40 %
25.70 %
0.39 % 23.85 % 1.46 % 0.17 %
0.01 %
V4 21.06 %
0.11 %
0.53% 0.05% 54.09 %
20.37 %
0.29 % 18.70 % 1.38 %
0.17 %
0.0.1 %
V5 24.34%
0.07%
0.49% 0.06 % 53.98 %
23.73%
0.37% 21.89 % 1.47 % 0.15 %
0.0.1%
Repeat Regions captured in the genome
Long reads ranges from 14,000 to 48,000 base pairs greater than that of sanger and NGS reads
Shortest run time (30 mins).
Least GC bias.
No amplification bias.
Handles the highly repetitive genome, can fill the gaps efficiently.
Reference: http://www.pacificbiosciences.com/products/smrt-technology/smrt-sequencing-advantage/
Why pacbio sequencing?
Assembly name
bases masked
Small RNA
Simple repeats
Low complexity
GC content
Total interspersed repeats
LINE[R2/R4/NeSL]
Ty1/copio Gypsy/DIRS1LTR elements
DNA transposon
Piggy BAC Tourist/harbinger
P.ramorum 2006
7847064 bp(11.77%)
11 (6033 bp)0.01%
5336 (242077 bp)0.36%
422(20747 bp)0.03%
53.86% 7580618 bp (11.37%)
53 (88470 bp 0.13%)
5972 (6669143 bp) 10.01 %
1174 (823005bp)1.23 %
200 (104977 bp) 0.16%
12 (5609 bp)0.01%
Protocol 1b
16553511 bp(24.34%)
75 (49453 bp)0.07%
7122(331100bp)0.49%
816 (40373 bp)0.06 %
53.98 %
16138229 bp (23.73%)
87 (250632bp)0.37%
8822(14885437 bp ) 21.89 %
1419 ( 1002160 bp )1.47 %
198 ( 104684 bp )0.15 %
13 (5809 bp ) 0.0.1%
Protocol 2
21185972 bp(27.00%)
605(349328 bp)0.45%
11702 (586604 bp)0.75%
1787 (91389 bp)0.12%
52.40 %
20163370 bp25.70 %
112 308607 bp 0.39 %
11127 ( 18710327 bp ) 23.85 %
1756(1144436) bp 1.46 %
231129697 bp0.17 %
12 5417 bp0.01 %
Protocol 3
12854764 bp(21.06 %)
64( 69255 bp)0.11 %
6801 (323105 bp)0.53%
679 (33221 bp)0.05%
54.09 %
12434182 bp 20.37 %
64 (176881 bp)0.29 %
6752(11415393 bp ) 18.70 %
1133 (841908 bp) 1.38 %
191 (105824 bp) 0.17 %
12 (6211 bp)0.0.1 %
Bangalore meeting
16192690 bp(20.68 %)
8 (3317 bp)0.00%
7549 (340933 bp)0.44 %
699(33372 bp)0.04 %
54.32% 15819353 bp (20.20 %)
92250092bp (0.32 %)
2498 (4413567 bp) 5.64 %
1560 (1155118 bp ) 1.48 %
228 (126831bp) 0.16 %
15 (6376 bp)0.0.1 %
Repeat Regions captured in the genome
Genome assembly and error correction
Error correcti
on
Gene Prediction statistics using Augustus and mapping predicted genes against all genome assembly
Assembly
No of genes predicted
Average gene length
Largest gene
Mapping withV1 assembly
Mapping withV2 assembly
Mapping withV3 assembly
Mapping withV4 assembly
Mapping withV5 assembly
V1 16134 1673 21479
NA 15978
15645
15855
16072
V2 20741 2162.78
31832
20739
NA 20377
20519
20675
V3 15110 2005.05
46572
15055
15019
NA 14990
15073
V4 17311 1821.26
47518
17307
17245
16906
NA 17277
V5 19278 1829.68
31832
19273
19167
18861
19051
NA
Comparison plot of V1 vs V5 Assembly
Nucmer
Promer
Pr102 2006 assembly
Pr102 2016 pacbio assembly