22
Using BOLD Data in Bioinformatics Workflows Dr. Justin Schonfeld Biodiversity Institute of Ontario

Dr Justin Schonfeld - Bioinformatics Applications

Embed Size (px)

DESCRIPTION

Analysis of typical informatics workflow, extracting data, aligning data, identifying problems and uploading data to BOLD

Citation preview

Page 1: Dr Justin Schonfeld - Bioinformatics Applications

Using BOLD Data in Bioinformatics Workflows

Dr. Justin SchonfeldBiodiversity Institute of Ontario

Page 2: Dr Justin Schonfeld - Bioinformatics Applications

DNA Barcodes

166 Full Eukaryotic genomes2,471 Metazoan Mitochondrial Genomes1,444,076 Barcodes - ~118,000 species

DNA Barcodes represent an enormous resource for researchers of all types.

Page 3: Dr Justin Schonfeld - Bioinformatics Applications

Applications

• Species Identification• Taxonomy• Building the Reference Library• Ecology• Proteomics• Comparative Genomics• Teaching• Music

Page 4: Dr Justin Schonfeld - Bioinformatics Applications

High level data flow

Museums Private collections

Regulatory Agencies Researchers

CCDB

BOLD

Genbank Mirrors Educators ResearchersRegulatory Agencies

Australian Museum

Page 5: Dr Justin Schonfeld - Bioinformatics Applications

Typical Informatics Workflow

Filtered Data

Aligned Data

Cleaned Data

BOLD

Align Data

Identify Problematic Sequences

Analyze Data

Extract Data

Local Copy Filter Data

Page 6: Dr Justin Schonfeld - Bioinformatics Applications

Extracting Data: BOLD Public

• Easy to use• Flexible search

tool– Search by

taxonomic name, geographic region, collector, etc.

– Example Searches: “Hymenoptera”, “Lepidoptera Canada”

Page 7: Dr Justin Schonfeld - Bioinformatics Applications

Extracting Data: BOLD Public

• Provides data in .tsv, fasta, and xml formats.

• Can select sequence data, trace files, specimen data, combined data.

Page 8: Dr Justin Schonfeld - Bioinformatics Applications

Extracting Data: web services

• Provides data in tsv (tab separated value) and xml formats

• Sequence data or full records

• Can be used to provide a complete dump of all public BOLD data http://services.boldsystems.org/

Page 9: Dr Justin Schonfeld - Bioinformatics Applications

Extracting Data: web services

• Working with the raw data allows for custom queries

• Not all fields are available as search terms in BOLD Public

• Requires scripting knowledge, or a lot of patience with excel

• Example: All plants above 2000 ft, etc.

Page 10: Dr Justin Schonfeld - Bioinformatics Applications

Filter Data

• The Barcode data is collected from a wide variety of independent investigations

• High degree of taxonomic bias• Tentative Names• Variable sequence quality

Page 11: Dr Justin Schonfeld - Bioinformatics Applications

Impact of Alignment

Alignment

Build Phylogenetic

Trees

Nearest Neighbor Analysis

Clustering Distance Matrices

Page 12: Dr Justin Schonfeld - Bioinformatics Applications

Impact of Alignment

Pairwise Sequence Alignment

Muscle Multiple Sequence Alignment

Page 13: Dr Justin Schonfeld - Bioinformatics Applications

Aligning Animal Barcode Data

CO1 Barcode

Short CO1

3’ CO1’

Full CO1 sequence

Barcode

Even a gene as straightforward as CO1 can provide alignment challenges.

5’ 3’

Page 14: Dr Justin Schonfeld - Bioinformatics Applications

Aligning Barcode Data

• Multiple Sequence Alignment– Accurate– Slow (a thousand sequences can take hours)– Trouble with variable sequences

• Pairwise Sequence Alignment– Fast (Thousands of sequences in minutes)– Inconsistent placement of indels– Highly dependent on choosing the right reference

• Parameters– Amino Acid vs Nucleotide– Gap Penalty

Page 15: Dr Justin Schonfeld - Bioinformatics Applications

Uploading your alignment to BOLD

• Upload in fasta format• Edit sequence permission on the records

Page 16: Dr Justin Schonfeld - Bioinformatics Applications

Identifying Problems

• Stop codons – Automatically annotated for coding regions– Even stop codons can be tricky

• Frame shifts • Ambiguous characters• Chimeric sequences

Page 17: Dr Justin Schonfeld - Bioinformatics Applications

Identifying Problems: Frame Shifts

• Frame-shifts in the middle of the sequence are disruptive and easy to spot

• Frame-shifts at the ends of the sequence are more challenging

Page 18: Dr Justin Schonfeld - Bioinformatics Applications

Identifying Problems: Chimeric Sequences

• Identify change points• Split the sequence at the point of

discontinuity• Blast each part

Hymenoptera

Hymenoptera Lepidoptera Chimera

Lepidoptera

Page 19: Dr Justin Schonfeld - Bioinformatics Applications

Cleaning Data: Updating BOLD

• BOLD is curated by the community– Re-upload sequences– Delete sequences– Annotate sequences– Flag sequences

BOLD

Genbank Mirrors Educators ResearchersRegulatory Agencies

Page 20: Dr Justin Schonfeld - Bioinformatics Applications

Example Workflow: Occurrence of Indels

Download public BOLD

Hymenoptera ecords using webservices

Select sequences with full taxonomy

Align sequences using MAAFT,

Muscle, Transalign

Select one representative

per species

Remove problematic Sequences

Tree

Map sequences onto phylogeny

Page 21: Dr Justin Schonfeld - Bioinformatics Applications

Example Workflow: Code shifts

Download public BOLD

Hymenoptera ecords using webservices

80,000 sequences –

Align pairwise

Scan sequences for code shifts

Remove problematic sequences

Analyze results

Page 22: Dr Justin Schonfeld - Bioinformatics Applications

Acknowledgements

• Paul Hebert• Sujeeven Ratnasingham• The BOLD Team