View
1.197
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Analysis of typical informatics workflow, extracting data, aligning data, identifying problems and uploading data to BOLD
Citation preview
Using BOLD Data in Bioinformatics Workflows
Dr. Justin SchonfeldBiodiversity Institute of Ontario
DNA Barcodes
166 Full Eukaryotic genomes2,471 Metazoan Mitochondrial Genomes1,444,076 Barcodes - ~118,000 species
DNA Barcodes represent an enormous resource for researchers of all types.
Applications
• Species Identification• Taxonomy• Building the Reference Library• Ecology• Proteomics• Comparative Genomics• Teaching• Music
High level data flow
Museums Private collections
Regulatory Agencies Researchers
CCDB
BOLD
Genbank Mirrors Educators ResearchersRegulatory Agencies
Australian Museum
Typical Informatics Workflow
Filtered Data
Aligned Data
Cleaned Data
BOLD
Align Data
Identify Problematic Sequences
Analyze Data
Extract Data
Local Copy Filter Data
Extracting Data: BOLD Public
• Easy to use• Flexible search
tool– Search by
taxonomic name, geographic region, collector, etc.
– Example Searches: “Hymenoptera”, “Lepidoptera Canada”
Extracting Data: BOLD Public
• Provides data in .tsv, fasta, and xml formats.
• Can select sequence data, trace files, specimen data, combined data.
Extracting Data: web services
• Provides data in tsv (tab separated value) and xml formats
• Sequence data or full records
• Can be used to provide a complete dump of all public BOLD data http://services.boldsystems.org/
Extracting Data: web services
• Working with the raw data allows for custom queries
• Not all fields are available as search terms in BOLD Public
• Requires scripting knowledge, or a lot of patience with excel
• Example: All plants above 2000 ft, etc.
Filter Data
• The Barcode data is collected from a wide variety of independent investigations
• High degree of taxonomic bias• Tentative Names• Variable sequence quality
Impact of Alignment
Alignment
Build Phylogenetic
Trees
Nearest Neighbor Analysis
Clustering Distance Matrices
Impact of Alignment
Pairwise Sequence Alignment
Muscle Multiple Sequence Alignment
Aligning Animal Barcode Data
CO1 Barcode
Short CO1
3’ CO1’
Full CO1 sequence
Barcode
Even a gene as straightforward as CO1 can provide alignment challenges.
5’ 3’
Aligning Barcode Data
• Multiple Sequence Alignment– Accurate– Slow (a thousand sequences can take hours)– Trouble with variable sequences
• Pairwise Sequence Alignment– Fast (Thousands of sequences in minutes)– Inconsistent placement of indels– Highly dependent on choosing the right reference
• Parameters– Amino Acid vs Nucleotide– Gap Penalty
Uploading your alignment to BOLD
• Upload in fasta format• Edit sequence permission on the records
Identifying Problems
• Stop codons – Automatically annotated for coding regions– Even stop codons can be tricky
• Frame shifts • Ambiguous characters• Chimeric sequences
Identifying Problems: Frame Shifts
• Frame-shifts in the middle of the sequence are disruptive and easy to spot
• Frame-shifts at the ends of the sequence are more challenging
Identifying Problems: Chimeric Sequences
• Identify change points• Split the sequence at the point of
discontinuity• Blast each part
Hymenoptera
Hymenoptera Lepidoptera Chimera
Lepidoptera
Cleaning Data: Updating BOLD
• BOLD is curated by the community– Re-upload sequences– Delete sequences– Annotate sequences– Flag sequences
BOLD
Genbank Mirrors Educators ResearchersRegulatory Agencies
Example Workflow: Occurrence of Indels
Download public BOLD
Hymenoptera ecords using webservices
Select sequences with full taxonomy
Align sequences using MAAFT,
Muscle, Transalign
Select one representative
per species
Remove problematic Sequences
Tree
Map sequences onto phylogeny
Example Workflow: Code shifts
Download public BOLD
Hymenoptera ecords using webservices
80,000 sequences –
Align pairwise
Scan sequences for code shifts
Remove problematic sequences
Analyze results
Acknowledgements
• Paul Hebert• Sujeeven Ratnasingham• The BOLD Team