Upload
jennifer-shelton
View
554
Download
0
Embed Size (px)
DESCRIPTION
Algorithms and filters used to improve the Tribolium draft Assembly with Physical Maps Based on Imaging Ultra-Long Single DNA Molecules. Video of Webinar available at BioNano Genomics website http://www.bionanogenomics.com/bionano-community/webinars/ as "Using BioNano Maps to Improve an Insect Genome Assembly".
Citation preview
Using BioNano Maps to Improve an Insect Genome Assembly
!!!!!!!!!
Sue Brown Oct 23, 2014
1
Genetic model organism for developmental, physiology and toxicology studies !
• Easily cultured • Short generation time • Small genome size • Molecular and visible marker genetic maps • Genetic tools: balancers, deficiencies • Genomic libraries: lambda and BAC • cDNA libraries • Mutant analysis and RNAi • Transformation • 7x Sanger draft genome (Nature, 2008)
Tribolium castaneum, the red flour beetle
2
!!!!Genome size: 200 (Mb) Cot Analysis 9 autosomes, X and Y Low methylation Long period interspersion
!!!!
!!
Jeff Stuart, Purdue University
Tribolium Genome
3
Molecular map markers used to anchor scaffolds to Chromosome builds
Few X markers, no Y, variable marker density 4
Tcas 3.0 Reference Genome stats from NCBI
Input file N50 (Mb) Number Cumulative Length (Mb)
Genome contigs 0.04 8814 160.74Genome scaffolds 0.98 481 152.53
Unmapped scaffolds 352
Unmapped contigs 1884
5
Algorithms and filters used to improve the Tribolium draft Assembly with Physical Maps Based on
Imaging Ultra-Long Single DNA Molecules !
Jennifer Shelton 2014
6
Data formats
BNX molecule 1BNX - text file of molecules
7
Data formats
BNX molecule 1BNX - text file of molecules
CMAP - text file of consensus maps
7
Data formats
in silico CMAP - from genome
FASTAin silico CMAP 1
BNX molecule 1BNX - text file of molecules
CMAP - text file of consensus maps
7
Data formats
in silico CMAP - from genome
FASTA
BNG CMAP - from assembled
molecules
in silico CMAP 1
BNG CMAP 1
BNX molecule 1BNX - text file of molecules
CMAP - text file of consensus maps
7
Data formats
in silico CMAP - from genome
FASTA
BNG CMAP - from assembled
molecules
XMAP - text file of alignment of two
CMAPs
in silico CMAP 2in silico CMAP 1
BNG CMAP 1 BNG CMAP 2
in silico CMAP 1
BNG CMAP 1
BNX molecule 1BNX - text file of molecules
CMAP - text file of consensus maps
7
3) use sequence reference to adjust molecule stretch for each scan
Assembly Pipeline
BNXBNXBNXscanBNX
Assembly workflow:!!1) The Irys produces tiff files that are converted into BNX text files.!2) Each chip produces one BNX file for each of two flowcells.!3) BNX files are split by scan and aligned to the sequence reference. Stretch (bases per pixel) is
recalculated from the alignment.!4) Quality check graphs are created for each pre-adjusted flowcell BNX.!5) Adjusted flowcell BNXs are merged.!6) The first assemblies are run with a variety of p-value thresholds.!7) The best of the first assemblies (red oval) is chosen and a version of this assembly is produced
with a variety of minimum molecule length filters.
4) QC graphs for each flowcell
adjBNX
adjBNX
adjBNX
adjBNX
1) The Irys produces tiff files
3) Scan BNX are adjusted
7) Second assemblies
Strict minimum molecule
length
Relaxed minimum molecule
length
mergeBNX
5) Merge all flowcells
Relaxed p-value
threshold
6) First assemblies
Strict p-value
threshold
Default p-value
threshold
BNXBNXBNXscanBNX
BNXBNXBNXscanBNX
BNXBNXBNXscanBNX
2) Each chip produces flowcell
BNX files
BNX
BNX
BNX
BNX
8
In recent datasets when SNR is low and alignment is good we see a spike in bases per pixel (bpp) in the first scan, a plateau and a lower plateau
Assembly Pipeline
First scan in a flow cell
9
5) Use sequence reference to determine assembly noise parameters. Estimated genome size is used to set the p-value threshold.
Assembly Pipeline
BNXBNXBNXscanBNX
Assembly workflow:!!1) The Irys produces tiff files that are converted into BNX text files.!2) Each chip produces one BNX file for each of two flowcells.!3) BNX files are split by scan and aligned to the sequence reference. Stretch (bases per pixel) is
recalculated from the alignment.!4) Quality check graphs are created for each pre-adjusted flowcell BNX.!5) Adjusted flowcell BNXs are merged.!6) The first assemblies are run with a variety of p-value thresholds.!7) The best of the first assemblies (red oval) is chosen and a version of this assembly is produced
with a variety of minimum molecule length filters.
4) QC graphs for each flowcell
adjBNX
adjBNX
adjBNX
adjBNX
1) The Irys produces tiff files
3) Scan BNX are adjusted
7) Second assemblies
Strict minimum molecule
length
Relaxed minimum molecule
length
mergeBNX
5) Merge all flowcells
Relaxed p-value
threshold
6) First assemblies
Strict p-value
threshold
Default p-value
threshold
BNXBNXBNXscanBNX
BNXBNXBNXscanBNX
BNXBNXBNXscanBNX
2) Each chip produces flowcell
BNX files
BNX
BNX
BNX
BNX
10
6/7) Variants of the starting p-value and default minimum molecule length are explored in nine assemblies.
Assembly Pipeline
BNXBNXBNXscanBNX
Assembly workflow:!!1) The Irys produces tiff files that are converted into BNX text files.!2) Each chip produces one BNX file for each of two flowcells.!3) BNX files are split by scan and aligned to the sequence reference. Stretch (bases per pixel) is
recalculated from the alignment.!4) Quality check graphs are created for each pre-adjusted flowcell BNX.!5) Adjusted flowcell BNXs are merged.!6) The first assemblies are run with a variety of p-value thresholds.!7) The best of the first assemblies (red oval) is chosen and a version of this assembly is produced
with a variety of minimum molecule length filters.
4) QC graphs for each flowcell
adjBNX
adjBNX
adjBNX
adjBNX
1) The Irys produces tiff files
3) Scan BNX are adjusted
7) Second assemblies
Strict minimum molecule
length
Relaxed minimum molecule
length
mergeBNX
5) Merge all flowcells
Relaxed p-value
threshold
6) First assemblies
Strict p-value
threshold
Default p-value
threshold
BNXBNXBNXscanBNX
BNXBNXBNXscanBNX
BNXBNXBNXscanBNX
2) Each chip produces flowcell
BNX files
BNX
BNX
BNX
BNX
11
223 scaffolds from the sequence-based assembly were longer than 20 (kb) with more than 5 labels and were converted into in silico CMAPs
Current Tribolium sequence-based assembly
Input file N50 (Mb) Number of Scaffolds
Cumulative Length (Mb)
Genome FASTA 1.16 2240 160.74in silico CMAP from FASTA 1.20 223 152.53
12
BNG assembled molecules had a higher N50 and longer cumulative length than the sequence assembly
!The estimated size of the Tribolium genome is ~200 (Mb)
Assembly Results
Input file N50 (Mb) Number Cumulative Length (Mb)
Genome FASTA 1.16 2240 160.74in silico CMAP from FASTA 1.20 223 152.53
CMAP from assembled BNG molecules (BNG CMAP)
1.35 216 200.47
13
Breadth of alignment coverage for in silico CMAP: 2.1 (Mb) Total alignment length for in silico CMAP: 2.1 (Mb)
!Breadth of alignment coverage for BNG CMAP: 2.4 (Mb)
Total alignment length for BNG CMAP: 2.4 (Mb)
Simplest XMAP alignment description
1 (Mb)
1.1 (Mb) 1.3 (Mb)
in silico CMAP from genome
FASTA
CMAP from assembled molecules
in silico CMAP 2in silico CMAP 1
BNG CMAP 1 BNG CMAP 2
1.1 (Mb)
14
Breadth of alignment coverage for in silico CMAP: 1 (Mb) Total alignment length for in silico CMAP: 2 (Mb)
!Breadth of alignment coverage for BNG CMAP: 2.4 (Mb)
Total alignment length for BNG CMAP: 2.4 (Mb)
Complex XMAP alignment description
in silico CMAP 1
BNG CMAP 1 BNG CMAP 2
1 (Mb)
1.1 (Mb) 1.3 (Mb)
in silico CMAP from genome
FASTA
CMAP from assembled molecules
15
Breadth of alignment coverage compared to total aligned length can indicate relevant relationships between assemblies
!In this example differences between "breadth" and "total" length could be due to:
!Genomic duplications in sample molecules were extracted from
Assembly of alternate haplotypes Mis-assembly creating redundant contigs Collapsed repeat in sequence assembly
Alignment of CMAPs
in silico CMAP 1
BNG CMAP 1 BNG CMAP 2
1 (Mb)
1.1 (Mb) 1.3 (Mb)
in silico CMAP from genome
FASTA
CMAP from assembled molecules
16
Close to 4% of the alignment of the in silico CMAP appears to be redundant !
Overall 81% of the in silico CMAP aligns to the BNG consensus map
Alignment of BNG assembly to reference genome
CMAP name Breadth of alignment coverage for CMAP (Mb)
Length of total alignment for CMAP (Mb)
Percent of CMAP aligned
in silico CMAP from FASTA 124.04 132.40 81
CMAP from assembled BNG molecules (BNG CMAP)
131.64 132.34 67
17
Typically where redundant alignments occur two BNG consensus maps aligned suggesting they represent haplotypes although this has not been
verified
Alignment of BNG assembly to reference genome
min confidence 10
ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?
130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145
BNG consensus maps
ChLG 9!scaffolds
BNG consensus maps
ChLG 9 super!scaffold
18
Potential haplotypes where overlapping BNG cmaps align
min confidence 10
ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?
128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145
BNG consensus maps
ChLG 9!scaffolds
BNG consensus maps
ChLG 9 super!scaffold
19
Stitch.pl estimates super scaffolds using alignments of scaffolds and assembled BNG molecules using BNG Refaligner
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
in silico CMAP aligned as reference
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
20
Stitch.pl estimates super scaffolds using alignments of scaffolds and assembled BNG molecules using BNG Refaligner
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
in silico CMAP aligned as reference
alignment is inverted and
used as input for stitch
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
20
Stitch.pl estimates super scaffolds using alignments of scaffolds and assembled BNG molecules using BNG Refaligner
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
in silico CMAP aligned as reference
alignment is inverted and
used as input for stitch
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4
alignments are filtered based on alignment length
relative total possible
alignment length and confidence
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
+ in silico CMAP 1
20
Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
BNG CMAP 1
+ in silico CMAP 1
alignment passes because
the alignment length is greater than 30% of the
potential alignment length
21
+ in silico CMAP 2
BNG CMAP 1
Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
alignment passes because
the alignment length is greater than 30% of the
potential alignment length
22
- in silico CMAP 2
BNG CMAP 2
Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
alignment passes because
the alignment length is greater than 30% of the
potential alignment length
23
- in silico CMAP 2
BNG CMAP 2
Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
alignment fails because the
alignment length is less than 30% of the potential
alignment length
24
+ in silico CMAP 2
BNG CMAP 2
Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
alignment fails because the
alignment length is less than 30% of the potential
alignment length
25
Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
alignment passes because
the alignment length is greater than 30% of the
potential alignment length
- in silico CMAP 3
BNG CMAP 2
26
BNG CMAP 2
Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
alignment fails because the
alignment length is less than 30% of the potential
alignment length- in silico CMAP 3
27
BNG CMAP 2
Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
alignment passes because
the alignment length is greater than 30% of the
potential alignment length
+ in silico CMAP 4
28
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4
high quality scaffolding
alignments...+ in silico CMAP 1
29
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
are filtered for longest and
highest confidence
alignment for each in silico
CMAP
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4
high quality scaffolding
alignments...+ in silico CMAP 1
29
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
Passing alignments are used to super
scaffold
are filtered for longest and
highest confidence
alignment for each in silico
CMAP
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4
high quality scaffolding
alignments...+ in silico CMAP 1
29
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
Stitch is iterated and additional
super scaffolding
alignments are found
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
Stitch is iterated and additional
super scaffolding
alignments are found
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
Until all super scaffolds are
joined - in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
- in silico CMAP 3
+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4
+ in silico CMAP 1
If gap length is estimated to be negative gaps are represented by 100 (bp) spacers
31
Gap lengths
Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had known lengths and 26 had negative lengths (set to 100 (bp))
!Of the manually edited Tribolium super-scaffolds there were 66 gaps had
known lengths and 24 had negative lengths (set to 100 (bp))
Distribution of gap lengths for automated output
Gap length (bp)
Cou
nt
−1500000 −1000000 −500000 0 500000 1000000
05
1015
20 Negative gap lengthsPositive gap lengths
32
Gap lengths
Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had known lengths and 26 had negative lengths (set to 100 (bp))
!Of the manually edited Tribolium super-scaffolds there were 66 gaps had
known lengths and 24 had negative lengths (set to 100 (bp))
Distribution of gap lengths for automated output
Gap length (bp)
Cou
nt
−1500000 −1000000 −500000 0 500000 1000000
05
1015
20 Negative gap lengthsPositive gap lengths
32
Negative gap lengths
The longest negative gap length is from a BNG consenus map joining in silico 23 and 136
Is part of scaffold_23 connected to 136?!I went with the second alignment (21-26 together and 136-137 together because it is supported by genetic maps) but we should check these assemblies. !!In bottom alignment of 136 you can see that a large section of the BNG map 32 (which joins 23 to 136) is a duplicate in the BNG assembly?
22 23 129 136 137
33
Negative gap lengths
!Because the same region of 136 aligns to another BNG consensus map that
aligns to its chromosome linkage group this alignment was rejected and stitch was re-run
Is part of scaffold_23 connected to 136?!I went with the second alignment (21-26 together and 136-137 together because it is supported by genetic maps) but we should check these assemblies. !!In bottom alignment of 136 you can see that a large section of the BNG map 32 (which joins 23 to 136) is a duplicate in the BNG assembly?
22 23 129 136 137
34
Negative gap lengths
Two new super scaffolds were created and the sequence similarity is being evaluated
min confidence 10
ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?
128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145
BNG consensus maps
ChLG 9!scaffolds
BNG consensus maps
ChLG 9 super!scaffold
min confidence 10
ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?
128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145
BNG consensus maps
ChLG 9!scaffolds
BNG consensus maps
ChLG 9 super!scaffold
min confidence 10
U 18 14 16 19 20 21 22 23 24 25 26 27 28 30
BNG consensus maps
ChLG 2!scaffolds
BNG consensus maps
ChLG 2 super!scaffold
min confidence 10
U 18 14 16 19 20 21 22 23 24 25 26 27 28 30
BNG consensus maps
ChLG 2!scaffolds
BNG consensus maps
ChLG 2 super!scaffold
35
Gap lengths
This negative alignment also indicated a potential assembly issue
Distribution of gap lengths for automated output
Gap length (bp)
Cou
nt
−1500000 −1000000 −500000 0 500000 1000000
05
1015
20 Negative gap lengthsPositive gap lengths
36
Negative gap lengths
This negative gap length is from a BNG consenus map joining in silico 81 and 102 and 103
Half of scaffold_81 aligns with ChLG7
37
Negative gap lengths
Because the other half of 81 aligns to another BNG consensus map that aligns to its chromosome linkage group this alignment was rejected and stitch was re-
run !
The BNG maps suggest a mis-assembly of in silico 81 at a sequence level
Half of scaffold_81 aligns with ChLG7
79 80 81 82 83
38
Distribution of gap lengths for automated output
Gap length (bp)
Cou
nt
−1500000 −1000000 −500000 0 500000 1000000
05
1015
20 Negative gap lengthsPositive gap lengths
Gap lengths
All extremely small negative gap lengths, < -20,000 (bp) (shaded), were independently flagged as potential sequence mis-assemblies to be checked at
the sequence-level
39
Distribution of gap lengths for automated output
Gap length (bp)
Cou
nt
−1500000 −1000000 −500000 0 500000 1000000
05
1015
20 Negative gap lengthsPositive gap lengths
Gap lengths
All gaps from the shaded regions were also manually rejected and stitch.pl was rerun without them for the current super-scaffolded assembly
!We suspect extremely small negative gap sizes may be useful in locating
sequence mis-assemblies !
stitch.pl version 1.4.5 rejects alignments if negative gap lengths < -20,000 (bp) but lists them in data summary
40
N50 of the super-scaffolded genome was ~4 times greater than the original !
Super-scaffolds tend to agree with the Tribolium genetic map
Tribolium super-scaffolds
Input file N50 (Mb) Number of Scaffolds
Cumulative Length (Mb)
genome FASTA 1.16 2240 160.74
super-scaffold FASTA
4.46 2150 165.92
41
For Tribolium : first minimum percent aligned = 30%
first minimum confidence = 13 second minimum percent aligned = 90%
second minimum confidence = 8 !
Lower quality alignments were manually selected if genetic map also supported the order
Complex scaffolds were broken manually for sequence level evaluation
Tribolium super-scaffolds
Input file N50 (Mb) Number of Scaffolds
Cumulative Length (Mb)
genome FASTA 1.16 2240 160.74
super-scaffold FASTA
4.46 2150 165.92
42
ChLG X was reduced from 13 scaffolds to 2 with one scaffold being moved to ChLG 3
Tribolium super-scaffolds
min confidence 10
From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold.
BNG consensus maps
ChLG X!scaffolds
BNG consensus maps
ChLG X super!scaffold
U 3 4 5 6 7 U 8 9 10 11 12 13
43
The second scaffold from ChLG X aligned to scaffolds from a portion of ChLG 3
Tribolium super-scaffolds
min confidence 10
51 U 43 45 44 46
47 U U 152 48 49 50 52 53 54 U 57 55
BNG consensus maps
ChLG 3!scaffolds
BNG consensus maps
ChLG 3 super!scaffold
32 33 34 35 36 2 37 38 39 40 41 42
BNG consensus maps
ChLG 3 super!scaffold
BNG consensus maps
ChLG 3!scaffolds
BNG consensus maps
ChLG 3 super!scaffold
44
Two unplaced scaffolds aligned to ChLG X
Tribolium super-scaffolds
min confidence 10
From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold.
BNG consensus maps
ChLG X!scaffolds
BNG consensus maps
ChLG X super!scaffold
U 3 4 5 6 7 U 8 9 10 11 12 13
45
4% Redundancy in alignment may be from assembly of haplotypes (generally observed as two BNG consensus maps aligning to the same in silico map)
Tribolium super-scaffolds
min confidence 10
From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold.
BNG consensus maps
ChLG X!scaffolds
BNG consensus maps
ChLG X super!scaffold
U 3 4 5 6 7 U 8 9 10 11 12 13
46
Potential haplotypes where overlapping BNG cmaps align
min confidence 10
From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold.
BNG consensus maps
ChLG X!scaffolds
BNG consensus maps
ChLG X super!scaffold
U 3 4 5 6 7 U 8 9 10 11 12 13
47
For ChLG 9 21 scaffolds were reduced to 9
Tribolium super-scaffoldsmin confidence 10
ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?
128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145
BNG consensus maps
ChLG 9!scaffolds
BNG consensus maps
ChLG 9 super!scaffold
48
For ChLG 5 17 scaffolds were reduced to 4
Tribolium super-scaffoldsmin confidence 10
BNG consensus maps
ChLG 5!scaffolds
BNG consensus maps
ChLG 5 super!scaffold
69 68 70 71 72 73 74 U 75 76 77 78 79 80 81 82 83
49
Future directions: Structural Variant (SV)
Use SV-detect pipelines to resize existing gaps in scaffolds and identify mis-assemblies
50
K-INBRE Bioinformatics Core!Susan Brown - PI Nic Herndon - script development Nanyan Lu - manual evaluation Michelle Coleman - extractions and running the Irys! Zachary Sliefert - metric summaries !Bionano Genomics!Ernest Lam - assembly pipeline best practices assistance Weiping Wang - assistance with data formats Palak Sheth - collaboration to standardize analysis !Script availability!https://github.com/i5K-KINBRE-script-share/Irys-scaffolding BNG scripts available by request from BNG !Slide availability!http://www.slideshare.net/kstatebioinformatics/using-bionano-maps-to-improve-an-insect-genome-assembly !This project was supported by grants from the National Center for Research Resources (5P20RR016475) and the National Institute of General Medical Sciences (8P20GM103418) from the National Institutes of Health.
Acknowledgements
Improving the Tribolium draft Assembly with Physical Maps Based on Imaging Ultra-Long Single DNA Molecules!Jennifer M. Shelton Kansas State University, Michelle Coleman Kansas State University, Nic Herndon Kansas State University, Warren Andrews BioNano Genomics, Weiping Wang BioNano Genomics, Ernest Lam BioNano Genomics, Susan J. Brown Kansas State University
We use the red flour beetle, a pest of stored grain, as a genetic model organism for developmental studies. As members of the i5k, we join scientists around the globe who are gearing up to sequencing 5000 insect genomes to improve human welfare and understand key ecosystem services that insects provide. By investigating insect genomes, we can take a fresh look at how insects transmit some of the most devastating diseases of humans, livestock, and plants on one hand, yet also serve as medical models for cancer, obesity, alcoholism, and neurological disease on the other. !Genome sequencing is becoming very affordable, but genome assembly is still challenging. Most are basically drafts of the genome, but even heavily curated reference assemblies contain misassemblies and truncations or gaps in repetitive regions. We are using a form of optical mapping to validate and extend the contigs and scaffolds that constitute a genome assembly. The 7x draft assembly of the red flour beetle, Tribolium castaneum genome is based on paired-end Sanger sequencing of 4-6 Kb insert plasmid libraries, scaffolded with paired-end reads from 40Kb fosmid and ~130Mb BAC clones1. The total assembled length of ~156 Mb represents 75% of the estimated genome (200Mb) and presumably lacks a significant portion of repetitive DNA. Superscaffolds or chromosome linkage group
builds (ChLG 2-10 and X) were constructed by mapping molecular markers from the genetic recombination map to the assembly scaffolds, anchoring greater than 90% of the assembled sequence.!To improve this draft assembly, we constructed physical maps of the T. castaneum genome. Using the Irys system designed by BioNano Genomics (http://www.bionanogenomics.com/). !Consensus maps of the imaged molecules were compared with in silico maps generated from the assembly sequence. Here we report our progress on using these comparisons to validate the assembly in regions were they agree and reanalyze the assembly in regions were they do not. Additional scaffolds have been anchored to the chromosomes, order and orientation of scaffolds have been corrected, and scaffolds have been extended by spanning repetitive regions.!1) Nature 2008 452:949-55.!2) Baylor College of Medicine Human Genome Sequencing Center 2012 https://www.hgsc.bcm.edu/software/!3) Genome Biology 2012 13:R56!4) BMC Bioinformatics 2013, 14(Suppl 7):S6!
T. castaneum 5.1!!!
Sequence scaffolds were aligned to BNG maps with IrysView. The alignments were filtered and used to create new scaffolds.!
!length (Mb): 164.60!scaffolds: 2157!scaffold N50 (Mb): 4.00
T. castaneum 5.0!!!!
Illumina long distance jump-libraries extended scaffolds using Atlas gap-link2 and Base Clear gapfiller3.!
!length (Mb): 160.745!scaffolds: 2240!scaffold N50 (Mb): 1.16
T. castaneum 3.0 !!!!
Baylor Sanger 7x draft assembly and molecular genetic map!!!
!length (Mb): 160.466!scaffolds: 2321!scaffold N50 (Mb): 0.98
Figure 2 Genome refinements using consensus physical mapsFigure 1 Generation of consensus physical maps!a) Ultra long molecules (Mb) were nicked on one strand and labeled with fluorescent nucleotides. !!b) Individual molecules were imaged on a massively parallel scale in nanochannels printed on silicon chips. !!c) Consensus maps were assembled from populations of overlapping molecules.
a.
b.
c.
Figure 3 Super scaffolding alignmentsT. castaneum assembly before (5.0) and after (5.1) super scaffolding. Assembly contigs (top in green) aligned with BNG consensus maps (bottom in blue) with BNG map molecule coverage in dark blue. Arrows indicate newly placed scaffolds.
Table 1 Super scaffolding of T. castaneum 5.0 with BNG consensus maps Assembly metrics forT. castaneum assembly 5.0 and for T. castaneum assembly 5.1 after BNG maps were used to super scaffold
Figure 4 Comparative Genomics Comparison T. castaneum 5.1 (top in green) with BNG consensus maps for T. freemani (bottom in blue with BNG map molecule coverage in dark blue)
51
Gap lengths
Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had known lengths and 26 had negative lengths (set to 100 (bp))
!Of the manually edited Tribolium super-scaffolds there were 66 gaps had
known lengths and 24 had negative lengths (set to 100 (bp))
Distribution of gap lengths for automated output
Gap length (bp)
Cou
nt
−1500000 −1000000 −500000 0 500000 1000000
05
1015
20 Negative gap lengthsPositive gap lengths