29
Improved Hybrid Assembly of two strains of P. ramorum Mathu Malar C., Jennifer Yuzon, Takao Kasuga and Sucheta Tripathy UC Davis, CA, USA. CSIR- Indian Institute of Chemical Biology, Kolkata, India.

Ramorum2016 final

Embed Size (px)

Citation preview

Page 1: Ramorum2016 final

Improved Hybrid Assembly of two strains of P. ramorum

Mathu Malar C., Jennifer Yuzon, Takao Kasuga and

Sucheta TripathyUC Davis, CA, USA.

CSIR- Indian Institute of Chemical Biology, Kolkata,

India.

Page 2: Ramorum2016 final

Background

Phytophthora ramorum, a highly destructive pathogen with a wide host-range that causes Sudden Oak Death in western North America and Sudden Larch Death in the UK.

P.ramorum was first reported in 1995 and the origins of the pathogens are still unclear

 P. ramorum can be spread over several miles in mists, air currents, watercourses and rain splash. It is also known that Phytophthora pathogens can be spread on footwear, dogs’ paws, bicycle wheels, tools and equipment etc. 

Parke, J. L., and S. Lucas. 2008. Sudden oak death and ramorum blight. The Plant Health Instructor. DOI: 10.1094/PHI-I-2008-0227-01 https://sites.google.com/site/phytophthoragenomicslab/home

Page 3: Ramorum2016 final

Input DatasetsPlatform No of Reads

generatedTotal reads used for assembly

Organism Read coverage

Pacbio 435399 33% and 47% Pr102 25X

illumina 20942377 20942377 (100%)

Pr102 10X

Platform No of Reads generated

Total reads used for assembly

Organism Read coverage

pacbio 402170 285487 (70%) ND886 50X

Illumina 43676830 43676830(100%)

ND886 50X

For strain Pr102

For ND886

Page 4: Ramorum2016 final

V1 assembly (Tyler BM et al, 2006 ) by Sanger sequencing method, 65 MB, Genome Coverage 7.7X and Total Gaps 12 MB.

V2 Assembly (September 2015) V3 Assembly (December 2015) V4 Assembly (March 2016) V5 Assembly April 2016

Assembly versions of P. ramorum (Pr102)

Page 5: Ramorum2016 final

Pacbio Pr102 435399 (raw reads)

ECTools with Sanger Unitigs from 2006 phyra V1 assembly

Corrected (33%)

147429 reads

Uncorrected (67%)

287970 reads

ECTools with mock intermediate

assembly (Illumina reads + unitigs (V1)

derived 6K, 20K simulated libraries

using allpaths)

Corrected1418 reads

0.49%

Uncorrected reads 286552

66.50%

PBCR Auto Error correction assembly

used as input to Ectools for EC

Corrected 57640 reads

13.2%

Uncorrected228912

52%

Improved 3-way error correction protocol

Page 6: Ramorum2016 final

An Overview of Assemblers and tools used in this studyTools Input type Function

ECTools PacBio reads with a reference dataset (unitgs) for read error correction.

Correcting errors in PacBioreads

PBCR (PacBioToCA) PacBio reads Error corrections and Assembly

Canu PacBio reads Successor of PBCR assembler

SSPACE (stand-alone scaffolder of pre-assembled contigs using paired-read data)

Pre-assembled contigs, short reads (paired end and mate pair)

Is not a de novo assembler. Used for scaffolding and extending contigs

SSPACE Long Reads Pre-assembled contigs, uses (the pacbio reads) especially long reads

Is a successor of SSPACE and performs better on a case to base basis.

Dedupe Sequence reads Removes PCR duplicates and identical sequences prior to mapping

Redundans Hybrid datasets Recently developed (2016) specifically effective for heterozygous genomes

Page 7: Ramorum2016 final

Improved Error corrected reads (49%)

Illumina reads

Dedupe

Redudans2325 scaffolds 76

Mb largest 781884N50=65030

Canu

Largest scaffold =655506,

smallest=3055Total scaffolds = 920, N50 = 116386, size =

61mb

V3 Assembly

Celera

minimus

SSPACE

SSPACE Long Reads

1114 scaffoldsLargest = 886281Smallest = 15009

Total length = 79285078

Gaps = 450621

Previous error corrected

protocol (33%)

V2 Assembly

Other Assembly Protocols

minimus

SSPACE

SSPACELong reads

SSPACE

SSPACELong reads

Improved Error corrected reads (49%)

V4 Assembly

Page 8: Ramorum2016 final

Total error corrected reads 206487

Celera assembly with length cut off 10k (2735 contigs,

77Mb )

Library No reads Read lengthIllumina R1=1015741

9R2=10784958

varies from 50 nt to 100nt

V1 unitigs MP20k

R1=28379R2=28379

101

Pacbio corrected MP 10k 6k

R1=5234R2=5234R1=59180R2=59180

150

101

V1 unitigs (2006 assembly)

7589 (unitigs)

variableInput data for Redundans

Comparison with Phyra unitigs using mummer CAP3 on unmapped

sequences from V1 unitigs appended to assembly

back

No of scaffolds = 2005, largest scaffold= 781884, smallest scaffold = 2000 , N50 = 76032, total length = 67996746 Gaps = 220 bases

Protocol for V5 assembly

Redundans Assembly 65M1825 scaffolds, N50=76861, Largest=781884,Smallest=2000

Page 9: Ramorum2016 final

Assembly Statistics

V1 V2 V3 V4 V50

102030405060708090

Assembly sizes in MB

V1 V2 V3 V4 V50

500

1000

1500

2000

2500

3000

Number of scaffolds

V1 V2 V3 V4 V50

2000000

4000000

6000000

8000000

10000000

12000000

14000000

Gaps in nucleotides

Page 10: Ramorum2016 final

Gap filled in the pacbio new version genome

assembly

Gaps filled

scaffolds broken

mis-assemblies

Page 11: Ramorum2016 final

Assembly validation using Quast

Page 12: Ramorum2016 final

V1 V2 V3 V4 V50

5000

10000

15000

20000

25000

V1 V2 V3 V4 V50

5000100001500020000250003000035000400004500050000

Number of genesAverage gene lengthLargest gene

Gene Prediction statistics using Augustus and mapping statistics

Page 13: Ramorum2016 final

V1 V2 V3 V4 V50.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

bases maskedTotal interspersed repeatsLTR / Gypsy elements

Repeat Regions captured in the genome

Page 14: Ramorum2016 final

CEGMA comparisons among all assemblies

V5 V1 V2 V3 V4

V5

V1

V2

V3V4

Page 15: Ramorum2016 final

AssemblyVersion

No of core Prots(248 completely highly conserved CEG)

Unique gene % completeness

Out of 458 core genes present in genome

V1 236 KOG0948 Nuclear exosomal RNA helicase MTR4

95.16 412

V2 237 KOG0434Isoleucyl-tRNA synthetase 

95.56 412

V3 236 KOG0734AAA+-type ATPase containing the peptide

95.16 413

V4 237 KOG2311NAD/FAD-utilizing protein

95.56 416

V5 238 KOG1158NADP/FAD dependent oxidoreductase

95.97 414

Genome assembly completeness

Page 16: Ramorum2016 final

Effector Prediction PipelineV5 Assembly

Signal p predicted protein sequences

(7159)

Removed proteins with transmembrane domains. RXLR motifs on the N terminus (373 sequences)

Motif prediction with MEME (W Y

domain)

343 sequences were detected in MEME

Page 17: Ramorum2016 final

ND886 genome assembly

Page 18: Ramorum2016 final

ND886 error correction and Read statistics

Pacbio raw reads(402170)

ECTools with Sanger Unitigs (V1 Assembly)

Corrected(70.9%)

285487

Uncorrected(29.1%)

Page 19: Ramorum2016 final

ND886 assembly

Total error corrected reads 285487

Celera Assembly

Minimus

Dedupe

Library No reads Read length

Illumina R1=28389986R2= 28334221

varies from 50 nt to 100nt

Pacbio corrected MP 10k 6k

R1= 91555R2= 91555R1=131703R2=131703

101

101

Read statistics

SSPACE [with illumina reads],Total contigs = 6443Largest contig =648889,Smallest contig =2098, assembly size = 150 Mb

Redudans No of scaffolds = 2225, largest = 648906 , smallest = 2745 , N50 = 48161 , total length = 92877686 , Gaps = 4133

Assembly No of core proteins from 248

% completeness

No of core genes out of 458

Nd886 234 94.35 410

Page 20: Ramorum2016 final

Comparison of ND886 against Pr102 2006 assembly

P.ramorum ND886

P.ramorum Pr102 (2006)

Page 21: Ramorum2016 final

Summary of work

De Novo assemblers alone are not enough for a good genome assembly.

PacBio Reads are marred with errors and one error correction protocol alone does not always produce the best result.

Hybrid assembly in combination with scaffolder, duplicate removers are effective for assembly.

No protocol works best for 2 genomes, has to be mixed and matched.

Assembly improvement does not necessarily change the gene space rather works better for repetitive regions and correcting assembly.

Page 22: Ramorum2016 final

Acknowledgements

Page 23: Ramorum2016 final

Assembly name

bases masked

Small RNA

Simple repeats

Low complexity

GC content

Total interspersed repeats

LINE[R2/R4/NeSL]

Ty1/copio Gypsy/DIRS1LTR elements 

DNA transposon

Piggy BAC

Tourist/harbinger

V1 11.77%

0.01%

0.36% 0.03% 53.86%

11.37%

0.13% 10.01 % 1.23 % 0.16% 

0.01%

V2 20.68 %

0.00%

0.44 %

0.04 % 54.32%

20.20 %

0.32 % 5.64 % 1.48 %

0.16 %

0.0.1 %

V3 27.00%

0.45%

0.75% 0.12% 52.40 %

25.70 % 

0.39 % 23.85 % 1.46 % 0.17 %

0.01 %

V4 21.06 %

0.11 %

0.53% 0.05% 54.09 %

20.37 %

0.29 % 18.70 % 1.38 %

0.17 %

0.0.1 %

V5 24.34%

0.07%

0.49% 0.06 % 53.98 %

23.73%

0.37% 21.89 % 1.47 % 0.15 %

0.0.1%

Repeat Regions captured in the genome

Page 24: Ramorum2016 final

Long reads ranges from 14,000 to 48,000 base pairs greater than that of sanger and NGS reads

Shortest run time (30 mins).

Least GC bias.

No amplification bias.

Handles the highly repetitive genome, can fill the gaps efficiently.

Reference: http://www.pacificbiosciences.com/products/smrt-technology/smrt-sequencing-advantage/

Why pacbio sequencing?

Page 25: Ramorum2016 final

Assembly name

bases masked

Small RNA

Simple repeats

Low complexity

GC content

Total interspersed repeats

LINE[R2/R4/NeSL]

Ty1/copio Gypsy/DIRS1LTR elements 

DNA transposon

Piggy BAC Tourist/harbinger

P.ramorum 2006

7847064 bp(11.77%)

11 (6033 bp)0.01%

5336 (242077 bp)0.36%

422(20747 bp)0.03%

53.86% 7580618 bp (11.37%)

53 (88470 bp 0.13%)

5972 (6669143 bp) 10.01 %

1174 (823005bp)1.23 %

200 (104977 bp) 0.16% 

12 (5609 bp)0.01%

Protocol 1b

16553511 bp(24.34%)

75 (49453 bp)0.07%

7122(331100bp)0.49%

816 (40373 bp)0.06 %

53.98 %

16138229 bp (23.73%)

87 (250632bp)0.37%

8822(14885437 bp ) 21.89 %

1419 ( 1002160 bp )1.47 %

198 ( 104684 bp )0.15 %

13 (5809 bp ) 0.0.1%

Protocol 2

21185972 bp(27.00%)

605(349328 bp)0.45%

11702 (586604 bp)0.75%

1787 (91389 bp)0.12%

52.40 %

20163370 bp25.70 % 

112 308607 bp 0.39 %

11127 ( 18710327 bp ) 23.85 %

1756(1144436) bp 1.46 %

231129697 bp0.17 %

12 5417 bp0.01 %

Protocol 3

12854764 bp(21.06 %)

64( 69255 bp)0.11 %

6801 (323105 bp)0.53%

679 (33221 bp)0.05%

54.09 %

12434182 bp 20.37 %

64 (176881 bp)0.29 %

6752(11415393 bp ) 18.70 %

1133 (841908 bp) 1.38 %

191 (105824 bp) 0.17 %

12 (6211 bp)0.0.1 %

Bangalore meeting

16192690 bp(20.68 %)

8 (3317 bp)0.00%

7549 (340933 bp)0.44 %

699(33372 bp)0.04 %

54.32% 15819353 bp (20.20 %)

92250092bp (0.32 %)

2498 (4413567 bp) 5.64 %

1560 (1155118 bp ) 1.48 %

228 (126831bp) 0.16 %

15 (6376 bp)0.0.1 %

Repeat Regions captured in the genome

Page 26: Ramorum2016 final

Genome assembly and error correction

Error correcti

on

Page 27: Ramorum2016 final
Page 28: Ramorum2016 final

Gene Prediction statistics using Augustus and mapping predicted genes against all genome assembly

Assembly

No of genes predicted

Average gene length

Largest gene

Mapping withV1 assembly

Mapping withV2 assembly

Mapping withV3 assembly

Mapping withV4 assembly

Mapping withV5 assembly

V1 16134 1673 21479

NA 15978

15645

15855

16072

V2 20741 2162.78

31832

20739

NA 20377

20519

20675

V3 15110 2005.05

46572

15055

15019

NA 14990

15073

V4 17311 1821.26

47518

17307

17245

16906

NA 17277

V5 19278 1829.68

31832

19273

19167

18861

19051

NA

Page 29: Ramorum2016 final

Comparison plot of V1 vs V5 Assembly

Nucmer

Promer

Pr102 2006 assembly

Pr102 2016 pacbio assembly