23
SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS Emanuel Weitschek Giulia Fiscon Giovanni Felici

SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS

  • Upload
    aiden

  • View
    98

  • Download
    0

Embed Size (px)

DESCRIPTION

SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS. Outline. How we approach the DNA Barcodes species classification problem Supervised machine learning Supervised machine learning DNA Barcodes classification methods: BLOG 2.0 - PowerPoint PPT Presentation

Citation preview

Page 1: SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS

SUPERVISED DNA BARCODES SPECIES CLASSIFICATION:ANALYSIS, COMPARISON, AND RESULTS

Emanuel Weitschek Giulia Fiscon Giovanni Felici

Page 2: SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS

2/23

Outline

• How we approach the DNA Barcodes speciesclassification problem

• Supervised machine learning• Supervised machine learning DNA Barcodes

classification methods: BLOG 2.0• Supervised machine learning DNA Barcodes

classification methods: WEKA• Consolidated DNA Barcodes classification

methods• Methods comparison: the data sets• Methods comparison: Weka• Methods comparison: Weka and consolidated

DNA Barcodes classification methods• Conclusions

Page 3: SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS

3/23

• Goal: assign an unknown specimen to a known species starting from its DNA Barcode sequence

• The classification problem may be formulated in the following way [Weitschek, et al. 2013]:

− given a reference library composed of DNA Barcode specimen sequences of known species and

− a collection of unknown DNA Barcode sequences (query set)− recognize the latter into the species that are present in the library − to obtain reliable results

the query set has to contain only specimens from the same species that are present in the reference library

the reference set has to contain a sufficient number of specimens sequences for each species (at least 4 specimens per species)

How we approach the DNA Barcodes species classification problem

Page 4: SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS

4/23

• The user has to provide as input a training set (reference library) containing specimens with a priori known species membership

• Based on this training set, the software computes the classification model

• Subsequently, the classification model can be applied to a test set (query set) which contains specimens that require classification

• The test set can contain query specimens with unknown species membership or, alternatively, specimens that also have a priori known species membership, allowing verification of the specimen classifications

Supervised machine learning

Page 5: SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS

5/23

BLOG 2.0: a software system for character-based species classification with DNA Barcode sequences. What it does, how to use it. E. Weitschek, R. van Velzen, G. Felici and P. Bertolazzi. Molecular Ecology Resources 2013 13(6):1043-1046, 2013 (doi: 10.1111/1755-0998.12073)

Supervised machine learning DNA Barcodes classification methods: BLOG 2.0

Page 6: SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS

6/23

• Input a reference library in fasta format− sequences have to be of the same region or pre-aligned to the same region

• BLOG computes for each species the distinctive nucleotide positions of the DNA Barcode sequences and the logic classification formulas (small rules in the form of “if-then” that are able to characterize a species in a compact way)

• The classification formulas can be applied to a query set

Supervised machine learning DNA Barcodes classification methods: BLOG 2.0

If pos3 = A and pos458 = C then the specimen is a

IF BASE IN POSITION 466 IS C AND BASE IN POSITION 595 IS T THEN SPECIES IS

…1

IF BASE IN POSITION 340 IS G AND BASE IN POSITION 451 IS A AND BASE IN POSITION 493 IS C THEN SPECIES IS …

2

IF BASE IN POSITION 340 IS T AND BASE IN POSITION 466 IS A AND

BASE IN POSITION 625 IS G THEN SPECIES IS … 3

Page 7: SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS

7/23

http://bol.uvm.edu

http://dmb.iasi.cnr.it/blog.php

Supervised machine learning DNA Barcodes classification methods: BLOG 2.0

Page 8: SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS

8/23

• WEKA (Waikato Environment for Knowledge Analysis) machine learning software is adopted for DNA Barcoding

• WEKA contains several methods to perform supervised classification of general problems

• Input a reference library in arff format

− fasta sequences have to be converted− sequences have to be of the same

region or pre-aligned to the same region

• Weka computes the classification model• The classification model can be applied to a query set• For using Weka in DNA Barcodes classification reference and

query set have to be converted in arff format

Supervised machine learning DNA Barcodesclassification methods: WEKA

Page 9: SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS

9/23

Supervised machine learning DNA Barcodesclassification methods: WEKA

• FASTA to WEKA • Example

> CC_1c_ID115 | Inga_alba ATT

> CC.MZ_9_ID316 | Inga_chartacea AAC

species name specimen ID

nucleotides sequence

@relation Inga_test

@attribute pos1 numeric@attribute pos2 numeric@attribute pos3 numeric@attribute class {Inga_alba,Inga_chartacea}

@data1,4,4 Inga_alba1,1,2 Inga_chartacea

species names

nucleotides position

nucleotides sequences

DNA barcode: FASTA format

Weka format: ARFF format

Available upon request ([email protected]), soon online on http://dmb.iasi.cnr.it

Page 10: SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS

10/23

WEKA contains several supervised machine learning methods to perform classification , that can all be used for DNA Barcoding

Supervised machine learning DNA Barcodesclassification methods: WEKA

Methods Description

Bayes Bayesian network (e.g. Naive Bayes)

Functions Linear regression, Neural networks, Support Vector Machines

Lazy Instance-based similarity (e.g., Nearest neighbor algorithm)

Meta Bagging, Boosting, Stacking, Regression through classification, Classification through regression, Cost sensitive classification

Rules Rule-based classifiers (e.g. Jrip)

Trees Tree classifier (e.g., decision tree)

Mi Algorithms that handle multi-instance data

Misc Various classifiers that don't fit in any another category

Page 11: SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS

11/23

• Function - Support Vector Machines (SMO):− Two class distinction problem (one vs all the other approach)− Transform the data in n-dimensional vectors and build the best separating hyperplane

between the two vectors− Perform well, but no human interpretable classification model

• Rule Based – RIPPER (Jrip):− Extracts for every species in the reference library a characterizing “if– then rule”− classification model is compact and human interpretable

• Classification tree – C4.5 (J48):− mathematical structures composed of nodes (nucleotides assignments) and edges

(decisions). The species labels are on the leaves of the trees. − A path from the root to the leave is a set of decision on the attributes values that leads

to a classification of a specimen (can be transformed in “if-then rule”)

• Bayesian – Naïve Bayes:− joint probability distribution of a set of variables. − Bayesian networks based on the state of the observable variables and a priori

probabilities represented by the edges in the relations between variables, evaluating the a posteriori probabilities of the unknown states

Supervised machine learning DNA Barcodesclassification methods: WEKA

Page 12: SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS

12/23

• Tree-based methods assign unidentified (query) barcodes to species based on their membership of clusters (or clades) in a DNA barcode tree

• Similarity-based methods assign query barcodes to species based on how much DNA barcode characters they have in common

• Diagnostic methods (character-based methods) rely on the presence/absence of particular characters in DNA barcode sequences for identification, instead of using them all

DNA barcoding of recently diverged species: Relative Performance of Matching Methods. R. Van Velzen, E. Weitschek, G. Felici and F.T.Bakker. Plos One 7(1):e30490, 2012

DNA Barcodes classification methods

Page 13: SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS

13/23

• Tree based:− Neighbour Joining [Saitou and Nei 1987]: is the most used method in DNA Barcode data

analysis; it is a bottom-up clustering method used for the construction of phylogenetic trees based on sequence distance

− Parsimony [Edwards et al. 1963]: the preferred tree, is the tree that requires the least evolutionary change to explain data; outperformed other tree-based methods;

• Similarity based:− Nearest Neighbour [Meier et al. 2006] is a distance based method, which gave very high

recognition rates− BLAST [Altschul et al. 1997]: the most commonly used method for classifying DNA

sequences in practice; Ian algorithm for comparing query sequences with an unaligned reference data base calculating pairwise alignments in the process

• Diagnostic methods:− DNA-BAR [DasGupta et al. 2005]: it showed higher levels of accurate species

identification in previous studies; alignment free method; it first selects sequence substrings (distinguishers) differentiating the sequences in the reference data set, and then records presence/absence of these distinguishers; it does not require an alignment

− BLOG [Bertolazzi, Felici, Weitschek 2009]: character based method; the first time tested

DNA Barcodes classification methods

Page 14: SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS

14/23

• Public available data sets to perform a comparative analysis of the methods• Empirical data sets [Weitschek et al., 2013; Van Velzen et al., 2012]

• Synthetic data sets [Van Velzen et al., 2012]

Methods comparison: The data sets

Dataset #Sequences Seq.length #Species RefCypraeidae 2008 614 211 [Meier et al., 2006]Drosophila 615 663 19 [Lou and Golding, 2010]Inga 913 1838 56 [Dexter et al., 2010]Bats 826 659 82 [Ratnasingham et al., 2007]Fishes 626 419 82 [Bertolazzi et al.,2009]Birds 1700 255 150 [Hebert et al., 2004]Fungi 50 510 8 [Ratnasingham et al., 2007]

Algae 26 1128 5[CBOL Plant Working Group 2009]

Dataset Ne #Individual Seq.length #Species RefNe1000 1000 20 650 50 [van Velzen et al., 2012]Ne10000 10000 20 650 50 [van Velzen et al., 2012]Ne50000 50000 20 650 50 [van Velzen et al., 2012]

Page 15: SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS

15/23

• Weka supervised machine learning methods (SVM, RIPPER, C4.5, and Naïve Bayes) were tested on the empirical and simulated data sets

• Reference and query were chosen as in the previous references (80% – 20% in empirical data; 80% - 20% replicated 100 fold in simulated data)

• The reached average accuracies on the query sets

Methods comparison: Weka

Empirical Datasets Accuracy (average) [%]Cypraeide 91.55Drosophila 95.26Inga 89.41Bats 99.54Fishes 93.91Birds 92.35Fungi 65.00

Simulated Datasets Accuracy (average) [%]Ne1000 98.83Ne10000 95.92Ne50000 91.32

Page 16: SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS

16/23

• Weka supervised machine learning methods (SVM, RIPPER, C4.5, and Naïve Bayes) comparison

Methods comparison: Weka

• SVM and Naïve Bayes have the highest correct classification rate (accuracy), but no human interpretable model or compact model of the data set is provided

• Jrip and C4.5 have slightly inferior results, but provide a classification model

Page 17: SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS

17/23

Methods comparison: Weka and consolidatedDNA Barcodes classification methods

• SVM and Naïve Bayes have the highest correct classification rate (accuracy)

• BLOG is at a comparable level and provides a classification model in terms of logic formulas

Page 18: SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS

18/23

Methods comparison: Weka and consolidatedDNA Barcodes classification methods

• Very high accuracy for the supervised machine learning methods in Weka

• Consolidated DNA Barcodes methods are challenged by these datasets

Page 19: SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS

19/23

Conclusions

• The classification analysis shows that • supervised machine learning methods are promising candidates for handling with

success the DNA Barcode species classification problem • All methods obtained very good classification performances• SVM, Naïve Bayes excellent accuracy, but no human interpretable model• BLOG, C4.5, RIPPER very good results and human interpretable classification model

(if-then rules) that can be used outside the realm of DNA barcoding, for instance in species description or molecular detection assays

• Finally, the DNA Barcoding community is provided with a powerful tool to perform species classification

Page 20: SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS

20/23

Main references

1. BLOG 2.0: a software system for character-based species classification with DNA Barcode sequences. What it does, how to use it. E. Weitschek, R. van Velzen, G. Felici and P. Bertolazzi. Molecular Ecology Resources 13(6):1043-1046, 2013 (doi: 10.1111/1755-998.12073)

2. DNA barcoding of recently diverged species: Relative Performance of Matching Methods. R. Van Velzen, E. Weitschek, G. Felici and F.T.Bakker. Plos One 7(1):e30490, 2012. www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0030490

3. Learning to classify species with barcodes. P. Bertolazzi, G. Felici and E. Weitschek. BMC Bioinformatics 10(S-14):7, 2009. www.biomedcentral.com/1471-105/10/S14/S7

4. Supervised DNA Barcodes species classification: analysis, comparison, and results. E. Weitschek, G. Fiscon and G. Felici. BMC BioData Mining (under review)

Page 21: SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS

21/23

References

• Sarkar IN, Trizna M; The Barcode of Life Data Portal: Bridging the Biodiversity Informatics Divide for DNA Barcoding; PLoS One; 2011

• Saitou N, Nei M; The Neighbour-joining method: a new method for reconstructing phylogenetic trees; Mol Biol Evol; 1987, 4:406 - 425.

• Edwards AWF, L.L. C-S; The reconstruction of evolution; Annals of Human Genetics; 1963, 27: 105–106

• Meier R, Shiyang K, Vaidya G, K. L. NG P; DNA Barcoding and Taxonomy in Diptera: A Tale of High Intraspecific Variability and Low Identification Success; Systematic Biology; 2006, 55(5):715-728

• Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, et al.; Gapped BLAST and PSI-BLAST: a new generation of protein database search programs; Nucleic Acids Research; 1997, 25: 3389-3402

• DasGupta B, Konwar KM, Măndoiu II, Shvartsman AA; DNA-BAR: distinguisher selection for DNA barcoding; Bioinformatics; 2005, 21: 3424-3426

• Lou M, Golding GB; Assigning sequences to species in the absence of large interspecific differences; Molecular Phylogenetics and Evolution; 2010, 56: 187-194

• Dexter KG, Pennington TD, Cunningham CW; Using DNA to assess errors in tropical tree identifications: How often are ecologists wrong and when does it matter?; Ecological Monographs; 2010 ,80: 267-286

Page 22: SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS

22/23

References

• Meyer CP, Paulay G; DNA barcoding; Error rates based on comprehensive sampling; PLoS Biology; 2005, 3: 2229–2238

• Felici G, Truemper K; The Lsquare System for Mining Logic Data; Encyclopedia of Data Warehousing and Mining; 2005

• Bertolazzi P, Felici G, Festa P, Lancia G; Logic Classification and Feature Selection for Biomedical Data; Computers & Mathematics with Applications, 2008

• Van Velzen R, Weitschek E, Felici G and Bakker FT; DNA barcoding of recently diverged species: Relative Performance of Matching Methods; Plos One (in press)

• Weitschek E, Van Velzen R, Felici G; Species classification using DNA Barcode sequences: A comparative analysis; IASI CNR Technical Report ; 2011

• Bertolazzi P, Felici G, Weitschek E; Learning to classify species with barcodes; BMC Bioinformatics; 2009

• Arisi, D'Onofrio, Brandi, Di Mambro, Felsani, Capsoni, Drovandi, Felici, Weitschek, Bertolazzi, Cattaneo; Gene expression biomarkers in the brain of a mouse model for Alzheimer's disease: mining of microarray data by logic classification and feature selection; Journal of Alzheimer’s Disease; 2011

• Bertolazzi, Felici, Weitschek, Drovandi, Ciccozzi, Ciotti, Lopresti; Polyomaviruses genome analysis by logic mining techniques; BMC Virology Journal; 2012

• Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten (2009); The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1.

• DMB project website: http://dmb.iasi.cnr.it

Page 23: SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS

23/23

Contacts

• Emanuel WeitschekUniversity Roma TreDepartment of Computer Science and AutomationRome, [email protected]

• Giulia FisconUniversity La SapienzaDepartment of Computer, Control and Management Engineering Rome, [email protected]

• Giovanni FeliciNational Research CouncilInstitute of System Analysis and Computer Science A. RubertiRome, [email protected]

Thanks for your attention!