Using BioNano Maps to Improve an Insect Genome Assembly

Using BioNano Maps to Improve an Insect Genome Assembly

!!!!!!!!!

Sue Brown Oct 23, 2014

1

Genetic model organism for developmental, physiology and toxicology studies !

• Easily cultured • Short generation time • Small genome size • Molecular and visible marker genetic maps • Genetic tools: balancers, deficiencies • Genomic libraries: lambda and BAC • cDNA libraries • Mutant analysis and RNAi • Transformation • 7x Sanger draft genome (Nature, 2008)

Tribolium castaneum, the red flour beetle

2

!!!!Genome size: 200 (Mb) Cot Analysis 9 autosomes, X and Y Low methylation Long period interspersion

!!!!

!!

Jeff Stuart, Purdue University

Tribolium Genome

3

Molecular map markers used to anchor scaffolds to Chromosome builds

Few X markers, no Y, variable marker density 4

Tcas 3.0 Reference Genome stats from NCBI

Input file N50 (Mb) Number Cumulative Length (Mb)

Genome contigs 0.04 8814 160.74Genome scaffolds 0.98 481 152.53

Unmapped scaffolds 352

Unmapped contigs 1884

5

Algorithms and filters used to improve the Tribolium draft Assembly with Physical Maps Based on

Imaging Ultra-Long Single DNA Molecules !

Jennifer Shelton 2014

6

Data formats

BNX molecule 1BNX - text file of molecules

7

Data formats


CMAP - text file of consensus maps

7

Data formats

in silico CMAP - from genome

FASTAin silico CMAP 1



7

Data formats


FASTA

BNG CMAP - from assembled

molecules

in silico CMAP 1

BNG CMAP 1



7

Data formats


FASTA

BNG CMAP - from assembled

molecules

XMAP - text file of alignment of two

CMAPs

in silico CMAP 2in silico CMAP 1

BNG CMAP 1 BNG CMAP 2

in silico CMAP 1

BNG CMAP 1



7

3) use sequence reference to adjust molecule stretch for each scan

Assembly Pipeline

BNXBNXBNXscanBNX

Assembly workflow:!!1) The Irys produces tiff files that are converted into BNX text files.!2) Each chip produces one BNX file for each of two flowcells.!3) BNX files are split by scan and aligned to the sequence reference. Stretch (bases per pixel) is

recalculated from the alignment.!4) Quality check graphs are created for each pre-adjusted flowcell BNX.!5) Adjusted flowcell BNXs are merged.!6) The first assemblies are run with a variety of p-value thresholds.!7) The best of the first assemblies (red oval) is chosen and a version of this assembly is produced

with a variety of minimum molecule length filters.

4) QC graphs for each flowcell

adjBNX

adjBNX

adjBNX

adjBNX

1) The Irys produces tiff files

3) Scan BNX are adjusted

7) Second assemblies

Strict minimum molecule

length

Relaxed minimum molecule

length

mergeBNX

5) Merge all flowcells

Relaxed p-value

threshold

6) First assemblies

Strict p-value

threshold

Default p-value

threshold

BNXBNXBNXscanBNX

BNXBNXBNXscanBNX

BNXBNXBNXscanBNX

2) Each chip produces flowcell

BNX files

BNX

BNX

BNX

BNX

8

In recent datasets when SNR is low and alignment is good we see a spike in bases per pixel (bpp) in the first scan, a plateau and a lower plateau

Assembly Pipeline

First scan in a flow cell

9

5) Use sequence reference to determine assembly noise parameters. Estimated genome size is used to set the p-value threshold.

Assembly Pipeline

BNXBNXBNXscanBNX





adjBNX

adjBNX

adjBNX

adjBNX





length


length

mergeBNX


Relaxed p-value

threshold

6) First assemblies

Strict p-value

threshold

Default p-value

threshold

BNXBNXBNXscanBNX

BNXBNXBNXscanBNX

BNXBNXBNXscanBNX


BNX files

BNX

BNX

BNX

BNX

10

6/7) Variants of the starting p-value and default minimum molecule length are explored in nine assemblies.

Assembly Pipeline

BNXBNXBNXscanBNX





adjBNX

adjBNX

adjBNX

adjBNX





length


length

mergeBNX


Relaxed p-value

threshold

6) First assemblies

Strict p-value

threshold

Default p-value

threshold

BNXBNXBNXscanBNX

BNXBNXBNXscanBNX

BNXBNXBNXscanBNX


BNX files

BNX

BNX

BNX

BNX

11

223 scaffolds from the sequence-based assembly were longer than 20 (kb) with more than 5 labels and were converted into in silico CMAPs

Current Tribolium sequence-based assembly

Input file N50 (Mb) Number of Scaffolds

Cumulative Length (Mb)

Genome FASTA 1.16 2240 160.74in silico CMAP from FASTA 1.20 223 152.53

12

BNG assembled molecules had a higher N50 and longer cumulative length than the sequence assembly

!The estimated size of the Tribolium genome is ~200 (Mb)

Assembly Results

Input file N50 (Mb) Number Cumulative Length (Mb)

Genome FASTA 1.16 2240 160.74in silico CMAP from FASTA 1.20 223 152.53

CMAP from assembled BNG molecules (BNG CMAP)

1.35 216 200.47

13

Breadth of alignment coverage for in silico CMAP: 2.1 (Mb) Total alignment length for in silico CMAP: 2.1 (Mb)

!Breadth of alignment coverage for BNG CMAP: 2.4 (Mb)

Total alignment length for BNG CMAP: 2.4 (Mb)

Simplest XMAP alignment description

1 (Mb)

1.1 (Mb) 1.3 (Mb)

in silico CMAP from genome

FASTA

CMAP from assembled molecules

in silico CMAP 2in silico CMAP 1


1.1 (Mb)

14

Breadth of alignment coverage for in silico CMAP: 1 (Mb) Total alignment length for in silico CMAP: 2 (Mb)

!Breadth of alignment coverage for BNG CMAP: 2.4 (Mb)

Total alignment length for BNG CMAP: 2.4 (Mb)

Complex XMAP alignment description

in silico CMAP 1


1 (Mb)

1.1 (Mb) 1.3 (Mb)


FASTA


15

Breadth of alignment coverage compared to total aligned length can indicate relevant relationships between assemblies

!In this example differences between "breadth" and "total" length could be due to:

!Genomic duplications in sample molecules were extracted from

Assembly of alternate haplotypes Mis-assembly creating redundant contigs Collapsed repeat in sequence assembly

Alignment of CMAPs

in silico CMAP 1


1 (Mb)

1.1 (Mb) 1.3 (Mb)


FASTA


16

Close to 4% of the alignment of the in silico CMAP appears to be redundant !

Overall 81% of the in silico CMAP aligns to the BNG consensus map

Alignment of BNG assembly to reference genome

CMAP name Breadth of alignment coverage for CMAP (Mb)

Length of total alignment for CMAP (Mb)

Percent of CMAP aligned

in silico CMAP from FASTA 124.04 132.40 81

CMAP from assembled BNG molecules (BNG CMAP)

131.64 132.34 67

17

Typically where redundant alignments occur two BNG consensus maps aligned suggesting they represent haplotypes although this has not been

verified

Alignment of BNG assembly to reference genome

min confidence 10

ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?

130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145

BNG consensus maps

ChLG 9!scaffolds

BNG consensus maps

ChLG 9 super!scaffold

18

Potential haplotypes where overlapping BNG cmaps align

min confidence 10


128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145

BNG consensus maps

ChLG 9!scaffolds

BNG consensus maps


19

Stitch.pl estimates super scaffolds using alignments of scaffolds and assembled BNG molecules using BNG Refaligner

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

in silico CMAP aligned as reference

- in silico CMAP 3+ in silico CMAP 2


+ in silico CMAP 4+ in silico CMAP 1

20




alignment is inverted and

used as input for stitch







20




alignment is inverted and

used as input for stitch



+ in silico CMAP 4

alignments are filtered based on alignment length

relative total possible

alignment length and confidence







+ in silico CMAP 1

20

Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments





BNG CMAP 1

+ in silico CMAP 1

alignment passes because

the alignment length is greater than 30% of the

potential alignment length

21

+ in silico CMAP 2

BNG CMAP 1









22

- in silico CMAP 2

BNG CMAP 2









23

- in silico CMAP 2

BNG CMAP 2






alignment fails because the

alignment length is less than 30% of the potential

alignment length

24

+ in silico CMAP 2

BNG CMAP 2








alignment length

25









- in silico CMAP 3

BNG CMAP 2

26

BNG CMAP 2








alignment length- in silico CMAP 3

27

BNG CMAP 2









+ in silico CMAP 4

28




+ in silico CMAP 4

high quality scaffolding

alignments...+ in silico CMAP 1

29


are filtered for longest and

highest confidence

alignment for each in silico

CMAP






+ in silico CMAP 4



29


Passing alignments are used to super

scaffold

are filtered for longest and

highest confidence

alignment for each in silico

CMAP









+ in silico CMAP 4



29


Stitch is iterated and additional

super scaffolding

alignments are found





Stitch is iterated and additional

super scaffolding

alignments are found




Until all super scaffolds are

joined - in silico CMAP 3+ in silico CMAP 2




- in silico CMAP 3

+ in silico CMAP 2


+ in silico CMAP 4

+ in silico CMAP 1

If gap length is estimated to be negative gaps are represented by 100 (bp) spacers

31

Gap lengths

Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had known lengths and 26 had negative lengths (set to 100 (bp))

!Of the manually edited Tribolium super-scaffolds there were 66 gaps had

known lengths and 24 had negative lengths (set to 100 (bp))

Distribution of gap lengths for automated output

Gap length (bp)

Cou

nt

−1500000 −1000000 −500000 0 500000 1000000

05

1015

20 Negative gap lengthsPositive gap lengths

32

Gap lengths





Gap length (bp)

Cou

nt

−1500000 −1000000 −500000 0 500000 1000000

05

1015


32

Negative gap lengths

The longest negative gap length is from a BNG consenus map joining in silico 23 and 136

Is part of scaffold_23 connected to 136?!I went with the second alignment (21-26 together and 136-137 together because it is supported by genetic maps) but we should check these assemblies. !!In bottom alignment of 136 you can see that a large section of the BNG map 32 (which joins 23 to 136) is a duplicate in the BNG assembly?

22 23 129 136 137

33


!Because the same region of 136 aligns to another BNG consensus map that

aligns to its chromosome linkage group this alignment was rejected and stitch was re-run

Is part of scaffold_23 connected to 136?!I went with the second alignment (21-26 together and 136-137 together because it is supported by genetic maps) but we should check these assemblies. !!In bottom alignment of 136 you can see that a large section of the BNG map 32 (which joins 23 to 136) is a duplicate in the BNG assembly?

22 23 129 136 137

34


Two new super scaffolds were created and the sequence similarity is being evaluated

min confidence 10


128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145

BNG consensus maps

ChLG 9!scaffolds

BNG consensus maps


min confidence 10


128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145

BNG consensus maps

ChLG 9!scaffolds

BNG consensus maps


min confidence 10

U 18 14 16 19 20 21 22 23 24 25 26 27 28 30

BNG consensus maps

ChLG 2!scaffolds

BNG consensus maps


min confidence 10

U 18 14 16 19 20 21 22 23 24 25 26 27 28 30

BNG consensus maps

ChLG 2!scaffolds

BNG consensus maps


35

Gap lengths

This negative alignment also indicated a potential assembly issue


Gap length (bp)

Cou

nt

−1500000 −1000000 −500000 0 500000 1000000

05

1015


36


This negative gap length is from a BNG consenus map joining in silico 81 and 102 and 103

Half of scaffold_81 aligns with ChLG7

37


Because the other half of 81 aligns to another BNG consensus map that aligns to its chromosome linkage group this alignment was rejected and stitch was re-

run !

The BNG maps suggest a mis-assembly of in silico 81 at a sequence level

Half of scaffold_81 aligns with ChLG7

79 80 81 82 83

38


Gap length (bp)

Cou

nt

−1500000 −1000000 −500000 0 500000 1000000

05

1015


Gap lengths

All extremely small negative gap lengths, < -20,000 (bp) (shaded), were independently flagged as potential sequence mis-assemblies to be checked at

the sequence-level

39


Gap length (bp)

Cou

nt

−1500000 −1000000 −500000 0 500000 1000000

05

1015


Gap lengths

All gaps from the shaded regions were also manually rejected and stitch.pl was rerun without them for the current super-scaffolded assembly

!We suspect extremely small negative gap sizes may be useful in locating

sequence mis-assemblies !

stitch.pl version 1.4.5 rejects alignments if negative gap lengths < -20,000 (bp) but lists them in data summary

40

N50 of the super-scaffolded genome was ~4 times greater than the original !

Super-scaffolds tend to agree with the Tribolium genetic map

Tribolium super-scaffolds



genome FASTA 1.16 2240 160.74

super-scaffold FASTA

4.46 2150 165.92

41

For Tribolium : first minimum percent aligned = 30%

first minimum confidence = 13 second minimum percent aligned = 90%

second minimum confidence = 8 !

Lower quality alignments were manually selected if genetic map also supported the order

Complex scaffolds were broken manually for sequence level evaluation




genome FASTA 1.16 2240 160.74

super-scaffold FASTA

4.46 2150 165.92

42

ChLG X was reduced from 13 scaffolds to 2 with one scaffold being moved to ChLG 3


min confidence 10

From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold.

BNG consensus maps

ChLG X!scaffolds

BNG consensus maps

ChLG X super!scaffold

U 3 4 5 6 7 U 8 9 10 11 12 13

43

The second scaffold from ChLG X aligned to scaffolds from a portion of ChLG 3


min confidence 10

51 U 43 45 44 46

47 U U 152 48 49 50 52 53 54 U 57 55

BNG consensus maps

ChLG 3!scaffolds

BNG consensus maps


32 33 34 35 36 2 37 38 39 40 41 42

BNG consensus maps


BNG consensus maps

ChLG 3!scaffolds

BNG consensus maps


44

Two unplaced scaffolds aligned to ChLG X


min confidence 10


BNG consensus maps

ChLG X!scaffolds

BNG consensus maps


U 3 4 5 6 7 U 8 9 10 11 12 13

45

4% Redundancy in alignment may be from assembly of haplotypes (generally observed as two BNG consensus maps aligning to the same in silico map)


min confidence 10


BNG consensus maps

ChLG X!scaffolds

BNG consensus maps


U 3 4 5 6 7 U 8 9 10 11 12 13

46

Potential haplotypes where overlapping BNG cmaps align

min confidence 10


BNG consensus maps

ChLG X!scaffolds

BNG consensus maps


U 3 4 5 6 7 U 8 9 10 11 12 13

47

For ChLG 9 21 scaffolds were reduced to 9

Tribolium super-scaffoldsmin confidence 10


128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145

BNG consensus maps

ChLG 9!scaffolds

BNG consensus maps


48

For ChLG 5 17 scaffolds were reduced to 4

Tribolium super-scaffoldsmin confidence 10

BNG consensus maps

ChLG 5!scaffolds

BNG consensus maps


69 68 70 71 72 73 74 U 75 76 77 78 79 80 81 82 83

49

Future directions: Structural Variant (SV)

Use SV-detect pipelines to resize existing gaps in scaffolds and identify mis-assemblies

50

K-INBRE Bioinformatics Core!Susan Brown - PI Nic Herndon - script development Nanyan Lu - manual evaluation Michelle Coleman - extractions and running the Irys! Zachary Sliefert - metric summaries !Bionano Genomics!Ernest Lam - assembly pipeline best practices assistance Weiping Wang - assistance with data formats Palak Sheth - collaboration to standardize analysis !Script availability!https://github.com/i5K-KINBRE-script-share/Irys-scaffolding BNG scripts available by request from BNG !Slide availability!http://www.slideshare.net/kstatebioinformatics/using-bionano-maps-to-improve-an-insect-genome-assembly !This project was supported by grants from the National Center for Research Resources (5P20RR016475) and the National Institute of General Medical Sciences (8P20GM103418) from the National Institutes of Health.

Acknowledgements

Improving the Tribolium draft Assembly with Physical Maps Based on Imaging Ultra-Long Single DNA Molecules!Jennifer M. Shelton Kansas State University, Michelle Coleman Kansas State University, Nic Herndon Kansas State University, Warren Andrews BioNano Genomics, Weiping Wang BioNano Genomics, Ernest Lam BioNano Genomics, Susan J. Brown Kansas State University

We use the red flour beetle, a pest of stored grain, as a genetic model organism for developmental studies. As members of the i5k, we join scientists around the globe who are gearing up to sequencing 5000 insect genomes to improve human welfare and understand key ecosystem services that insects provide. By investigating insect genomes, we can take a fresh look at how insects transmit some of the most devastating diseases of humans, livestock, and plants on one hand, yet also serve as medical models for cancer, obesity, alcoholism, and neurological disease on the other. !Genome sequencing is becoming very affordable, but genome assembly is still challenging. Most are basically drafts of the genome, but even heavily curated reference assemblies contain misassemblies and truncations or gaps in repetitive regions. We are using a form of optical mapping to validate and extend the contigs and scaffolds that constitute a genome assembly. The 7x draft assembly of the red flour beetle, Tribolium castaneum genome is based on paired-end Sanger sequencing of 4-6 Kb insert plasmid libraries, scaffolded with paired-end reads from 40Kb fosmid and ~130Mb BAC clones1. The total assembled length of ~156 Mb represents 75% of the estimated genome (200Mb) and presumably lacks a significant portion of repetitive DNA. Superscaffolds or chromosome linkage group

builds (ChLG 2-10 and X) were constructed by mapping molecular markers from the genetic recombination map to the assembly scaffolds, anchoring greater than 90% of the assembled sequence.!To improve this draft assembly, we constructed physical maps of the T. castaneum genome. Using the Irys system designed by BioNano Genomics (http://www.bionanogenomics.com/). !Consensus maps of the imaged molecules were compared with in silico maps generated from the assembly sequence. Here we report our progress on using these comparisons to validate the assembly in regions were they agree and reanalyze the assembly in regions were they do not. Additional scaffolds have been anchored to the chromosomes, order and orientation of scaffolds have been corrected, and scaffolds have been extended by spanning repetitive regions.!1) Nature 2008 452:949-55.!2) Baylor College of Medicine Human Genome Sequencing Center 2012 https://www.hgsc.bcm.edu/software/!3) Genome Biology 2012 13:R56!4) BMC Bioinformatics 2013, 14(Suppl 7):S6!

T. castaneum 5.1!!!

Sequence scaffolds were aligned to BNG maps with IrysView. The alignments were filtered and used to create new scaffolds.!

!length (Mb): 164.60!scaffolds: 2157!scaffold N50 (Mb): 4.00

T. castaneum 5.0!!!!

Illumina long distance jump-libraries extended scaffolds using Atlas gap-link2 and Base Clear gapfiller3.!


T. castaneum 3.0 !!!!

Baylor Sanger 7x draft assembly and molecular genetic map!!!


Figure 2 Genome refinements using consensus physical mapsFigure 1 Generation of consensus physical maps!a) Ultra long molecules (Mb) were nicked on one strand and labeled with fluorescent nucleotides. !!b) Individual molecules were imaged on a massively parallel scale in nanochannels printed on silicon chips. !!c) Consensus maps were assembled from populations of overlapping molecules.

a.

b.

c.

Figure 3 Super scaffolding alignmentsT. castaneum assembly before (5.0) and after (5.1) super scaffolding. Assembly contigs (top in green) aligned with BNG consensus maps (bottom in blue) with BNG map molecule coverage in dark blue. Arrows indicate newly placed scaffolds.

Table 1 Super scaffolding of T. castaneum 5.0 with BNG consensus maps Assembly metrics forT. castaneum assembly 5.0 and for T. castaneum assembly 5.1 after BNG maps were used to super scaffold

Figure 4 Comparative Genomics Comparison T. castaneum 5.1 (top in green) with BNG consensus maps for T. freemani (bottom in blue with BNG map molecule coverage in dark blue)

51

https://github.com/i5K-KINBRE-script-share/Irys-scaffolding

http://www.slideshare.net/kstatebioinformatics/using-bionano-maps-to-improve-an-insect-genome-assembly

Gap lengths





Gap length (bp)

Cou

nt

−1500000 −1000000 −500000 0 500000 1000000

05

1015


Documents

Using BioNano Maps to Improve an Insect Genome Assembly