68
Genome Assembly Forensics and Visualisation Nathan S. Watson-Haigh Fri 11 th May 2012, ACPFG Journal Club Schatz, M.C. et al., 2007. Hawkeye: an interactive visual analytics tool for genome assemblies. Genome Biology, 8(3), p.R34. Phillippy, A.M., Schatz, M.C. & Pop, M., 2008. Genome assembly forensics: finding the elusive mis-assembly. Genome Biology, 9(3), p.R55. Schatz, M.C. et al., 2011. Hawkeye and AMOS: Visualizing and Assessing the Quality of Genome Assemblies. Briefings in Bioinformatics. Available at:

Genome Assembly Forensics

Embed Size (px)

DESCRIPTION

Automated assemblies are one thing, good assemblies are another! This presentation covers the basic concepts of using paired-end and mate pair read data to identify mis-assemblies. It also covers some of the tools for visualising and correcting mis-assemblies. An attempt is made to rate these tools on their feature set and scalability beyond small (

Citation preview

Page 1: Genome Assembly Forensics

Genome Assembly Forensics and Visualisation

Nathan S. Watson-Haigh

Fri 11th May 2012, ACPFG Journal Club

Schatz, M.C. et al., 2007. Hawkeye: an interactive visual analytics tool for genome assemblies. Genome Biology, 8(3), p.R34.Phillippy, A.M., Schatz, M.C. & Pop, M., 2008. Genome assembly forensics: finding the elusive mis-assembly. Genome Biology, 9(3), p.R55.Schatz, M.C. et al., 2011. Hawkeye and AMOS: Visualizing and Assessing the Quality of Genome Assemblies. Briefings in

Bioinformatics. Available at: http://bib.oxfordjournals.org/content/early/2011/12/23/bib.bbr074.

Page 2: Genome Assembly Forensics

Overview

• Genome Assembly• N50/N90/N95• Paired-end and Matepair Reads• Mis-assembly Signatures• Assembly Validation and Manual Editing

Page 3: Genome Assembly Forensics

Genome Assembly – Shotgun Reads

aligned shotgun reads

DNA being sequenced

Page 4: Genome Assembly Forensics

Genome Assembly – Repeats

Page 5: Genome Assembly Forensics

Genome Assembly – Repeats

Page 6: Genome Assembly Forensics

Genome Assembly – Repeats

reads from different repeats can’t be

resolved

double coverage

Page 7: Genome Assembly Forensics

Genome Assembly – Repeats

Page 8: Genome Assembly Forensics

Genome Assembly – Diploid

Page 9: Genome Assembly Forensics

Assembly Metrics – N50

• The N50 is the most widely reported metric for de novo assemblies

• It is a single measure of the contig length size distribution of an assembly– If contigs are sorted into descending length order, the

n50 is the size of the contig above which the assembly contains at least 50% of the total length of all the contigs

– Commonly reported with the N90 and N95

Page 10: Genome Assembly Forensics

Assembly Metrics – N50

+ = N50

+ = N90

+ = N95

Page 11: Genome Assembly Forensics

Assembly Metrics – N50

• The N50 is the most widely reported metric for de novo assemblies

• It is a single measure of the contig length size distribution of an assembly– If contigs are sorted into descending length order, the

n50 is the size of the contig above which the assembly contains at least 50% of the total length of all the contigs

– Commonly reported with the N90 and N95• These stats DO NOT imply anything about

assembly quality– Could simply concatenate contigs together to get a

better N50!!

Page 12: Genome Assembly Forensics

Paired-end Reads

Page 13: Genome Assembly Forensics

Matepair Reads

Page 14: Genome Assembly Forensics

Paired-end and Matepair Reads

Paired-end Matepair

reverse compliment

Page 15: Genome Assembly Forensics

So, Why are Pairs so Useful?

Page 16: Genome Assembly Forensics

So, Why are Pairs so Useful?

Page 17: Genome Assembly Forensics

Pairs are Useful – Orientation and Separation

Page 18: Genome Assembly Forensics

Pairs are Useful – Orientation and Separation

Page 19: Genome Assembly Forensics

Pairs are Useful – Orientation and Separation

Page 20: Genome Assembly Forensics

Pairs are Useful – Orientation and Separation

Page 21: Genome Assembly Forensics

Pairs are Useful – Orientation and Separation

Incorrect orientationIncorrect distance

Page 22: Genome Assembly Forensics

Mis-assembly Signatures – Collapsed Tandem Repeat

Correct alignment

Incorrect alignment

Page 23: Genome Assembly Forensics

Mis-assembly Signatures – Collapsed Tandem Repeat

Mis-assembly

Correct assembly

Page 24: Genome Assembly Forensics

Mis-assembly Signatures – Collapsed (small) Tandem Repeat

Mis-assembly

Correct assembly

Page 25: Genome Assembly Forensics

Mis-assembly Signatures – Collapsed Repeat

Mis-assembly

Correct assembly

Page 26: Genome Assembly Forensics

Mis-assembly Signatures – Rearrangement

Mis-assembly

Correct assembly

Page 27: Genome Assembly Forensics

Automated Assemblies Are One Thing, Good Assemblies Are Another

• Given the computer resources you can generate an automated assembly in a few weeks– Not necessarily good– Need to optimise assembly parameters

• For small organisms (< ~15Mbases)– Commodity hardware– OLC assemblers

• For larger genomes– More RAM (10-100’s Gbytes) for OLC assemblers– De Bruijin Graph assemblers– Read Mapping step to generate contig read alignments

Page 28: Genome Assembly Forensics

Automated Assemblies Are One Thing, Good Assemblies Are Another

• Automated assemblies need to be checked for mis-assemblies– Need paired-end/matepair reads– Need viewers to visualise paired-end data– Need editors to break/join/reassemble parts of the

assembly deemed to be inconsistent with read pair info– Need enough computer hardware to allow all this data to

be loaded – especially with large volumes of Illumina paired-end data

Page 29: Genome Assembly Forensics

Automated Assemblies Are One Thing, Good Assemblies Are Another

• Very time consuming and laborious to check/edit– Small assemblies (< ~15Mbases)

• Several weeks/few months to move 1 scaffold/contig at a time

– Large assemblies need a team to do the same thing• Need enough RAM to load all the paired-end data• Need ways to identify regions requiring closer inspection• identify possible mis-assemblies

• Major hurdles– Software inadequacies– Time– File formats! Grrrr!

Page 30: Genome Assembly Forensics

Software Inadequacies

Software Contig View

Scaffold View

Editing Reassemble Clipping Info

Other

SeqMan Pro

9 9 6 6 6 $$, buggy, not for large assemblies (32bit), 1 template size

Gap5 6 NA 9 NA 8 Free, join editor, contig comparator, poor visual support for many contigs, shuffle pads, ACE, multiple template sizes

Consed 6 6 5 9 6 Free/US$2500/US$10k, poor visual support for many contigs, multiple templates sizes

Hawkeye 9 9 NA NA 7 Leverages AMOS, automated detection of mis-assemblies, large assemblies, modular

Page 31: Genome Assembly Forensics

SeqMan Pro – Strategy View

Page 32: Genome Assembly Forensics

SeqMan Pro

Page 33: Genome Assembly Forensics

Software Inadequacies

Software Contig View

Scaffold View

Editing Reassemble Clipping Info

Other

SeqMan Pro

9 9 6 6 6 $$, buggy, not for large assemblies (32bit), 1 template size

Gap5 6 NA 9 NA 8 Free, join editor, contig comparator, poor visual support for many contigs, shuffle pads, ACE, multiple template sizes

Consed 6 6 5 9 6 Free/US$2500/US$10k, poor visual support for many contigs, multiple templates sizes

Hawkeye 9 9 NA NA 7 Leverages AMOS, automated detection of mis-assemblies, large assemblies, modular

Page 34: Genome Assembly Forensics

Gap5 – Template View

Page 35: Genome Assembly Forensics

Gap5 – Contig Comparator

Page 36: Genome Assembly Forensics

Gap5 – Join Editor

Page 37: Genome Assembly Forensics

Gap5 – Contig Editor

Page 38: Genome Assembly Forensics

Software Inadequacies

Software Contig View

Scaffold View

Editing Reassemble Clipping Info

Other

SeqMan Pro

9 9 6 6 6 $$, buggy, not for large assemblies (32bit), 1 template size

Gap5 6 NA 9 NA 8 Free, join editor, contig comparator, poor visual support for many contigs, shuffle pads, ACE, multiple template sizes

Consed 6 6 5 9 6 Free/US$2500/US$10k, poor visual support for many contigs, multiple templates sizes

Hawkeye 9 9 NA NA 7 Leverages AMOS, automated detection of mis-assemblies, large assemblies, modular

Page 39: Genome Assembly Forensics

Consed – Assembly View

Page 40: Genome Assembly Forensics

Consed – Contig Viewer/Editor

Page 41: Genome Assembly Forensics

Software Inadequacies

Software Contig View

Scaffold View

Editing Reassemble Clipping Info

Other

SeqMan Pro

9 9 6 6 6 $$, buggy, not for large assemblies (32bit), 1 template size

Gap5 6 NA 9 NA 8 Free, join editor, contig comparator, poor visual support for many contigs, shuffle pads, ACE, multiple template sizes

Consed 6 6 5 9 6 Free/US$2500/US$10k, poor visual support for many contigs, multiple templates sizes

Hawkeye 9 9 NA NA 7 Leverages AMOS, automated detection of mis-assemblies, large assemblies, modular

Page 42: Genome Assembly Forensics
Page 43: Genome Assembly Forensics
Page 44: Genome Assembly Forensics
Page 45: Genome Assembly Forensics
Page 46: Genome Assembly Forensics
Page 47: Genome Assembly Forensics
Page 48: Genome Assembly Forensics

Scaffold/Contig Length Distribution

Page 49: Genome Assembly Forensics

Library Stats

Page 50: Genome Assembly Forensics

• A measure of the deviation of local distribution of insert sizes to the global distribution of insert sizes– 0 indicates no deviation– ≤ 3 indicates much

compression– ≥3 indicates much

expansion

Compression-Expansion (CE) Statistic

Page 51: Genome Assembly Forensics

Insert Coverage Read Coverage

Page 52: Genome Assembly Forensics

500bp inserts 3kb inserts

20kb inserts

Page 53: Genome Assembly Forensics

AMOSvalidate

• An assembly analysis pipeline to identify possible mis-assemblies– Paired-end data

• CE stats• Incorrect orientation• Missing mate

– Coverage– SNP density– Singletons

Page 54: Genome Assembly Forensics
Page 55: Genome Assembly Forensics

Hawkeye Cons

• Poor support for correcting mis-assemblies once detected

Page 56: Genome Assembly Forensics

Software Inadequacies

Software Contig View

Scaffold View

Editing Reassemble Clipping Info

Other

SeqMan Pro

9 9 6 6 6 $$, buggy, not for large assemblies (32bit), 1 template size

Gap5 6 NA 9 NA 8 Free, join editor, contig comparator, poor visual support for many contigs, shuffle pads, ACE, multiple template sizes

Consed 6 6 5 9 6 Free/US$2500/US$10k, poor visual support for many contigs, multiple templates sizes

Hawkeye 9 9 NA NA 7 Leverages AMOS, automated detection of mis-assemblies, large assemblies, modular

Page 57: Genome Assembly Forensics

Closing Remarks

• Software exist to allow manual editing of assemblies– Time consuming– Different tools have different features– Most fall over with assemblies > ~15Mbases or with

many contigs/scaffolds (10k-100k)

Page 58: Genome Assembly Forensics

Closing Remarks

• Ideal Tool– Contig/scaffold viewer capable of displaying

compressed/expanded mates, which contigs mates map to when they are off contig/scaffold (like SeqMan Pro and Hawkeye)

Page 59: Genome Assembly Forensics
Page 60: Genome Assembly Forensics
Page 61: Genome Assembly Forensics

Closing Remarks

• Ideal Tool– Contig/scaffold viewer capable of displaying

compressed/expanded mates, which contigs mates map to when they are off contig/scaffold (like SeqMan Pro and Hawkeye)

– Contig join editor for manual alignment and editing of contigs (like Gap5)

Page 62: Genome Assembly Forensics

Gap5 – Join Editor

Page 63: Genome Assembly Forensics

Closing Remarks

• Ideal Tool– Contig/scaffold viewer capable of displaying

compressed/expanded mates, which contigs mates map to when they are off contig/scaffold (like SeqMan Pro and Hawkeye)

– Contig join editor for manual alignment and editing of contigs (like Gap5)

– Visualise clipped regions with consensus mismatches (like Gap5)

Page 64: Genome Assembly Forensics

Gap5 – Contig Editor

Page 65: Genome Assembly Forensics

Closing Remarks

• Ideal Tool– Contig/scaffold viewer capable of displaying

compressed/expanded mates, which contigs mates map to when they are off contig/scaffold (like SeqMan Pro and Hawkeye)

– Contig join editor for manual alignment and editing of contigs (like Gap5)

– Visualise clipped regions with consensus mismatches (like Gap5)

– Automated analysis of assembly to identify regions requiring attention (like AMOSvalidate) and a way to navigate to those regions for editing

– Minimise mouse-clicks and keyboard presses!!

Page 66: Genome Assembly Forensics
Page 67: Genome Assembly Forensics

Newbler Plant Genome Assemblies

• Pretty conservative in contig construction• Seems to split out repetitive regions into their

own contigs pretty well• Heterozygsity issues

– SNP alignment issues– Indels break contigs– Hidden in clipped regions– Manual joining of neighbouring contigs can reduce

scaffolded contig numbers by 60-70%– Many unscaffolded contigs have high sequence

similarity to scaffolded contigs – could collapse these and reduce the number of unscaffolded contigs by 50%

Page 68: Genome Assembly Forensics

Gap5 – Contig Editor