Upload
javier-otegui
View
102
Download
0
Embed Size (px)
Citation preview
Data Cleaning and Data Publishing Workshop 2013 18-22 February, Nairobi, Kenya Javier Otegui @jotegui
TAXONOMIC ASSESSMENTS
¡ What is Taxonomy? § CBD – “Taxonomy is the science of naming, describing and
classifying organisms and includes all plants, animals and microorganisms of the world”
§ Using morphological, behavioral, genetic and biochemical observations, taxonomists identify, describe and arrange species into classifications, including those that are new to science.
¡ Taxonomy is related to: § the identification of an organism § Placing the organism in context with the rest of living
organisms
TAXONOMY – WHAT IS IT?
¡ Taxonomy is based on names
¡ Humans have always given names
¡ Binomial nomenclature
¡ Define individuals and groups
¡ Each name defines a taxon
TAXONOMY – TAXONOMIC NAMES
¡ Organization and classification of organisms
¡ According to common features
¡ Taxonomic classification
TAXONOMY - HIERARCHIES
http://wp.lps.org/jbenson2/blog/2012/01/18/january-18-taxonomy-chart-lab
¡ Taxonomy has a strong subjective component
¡ Classifications depend on the expertise and point of
view of the specialist
¡ Lots of episodes of:
§ Name removals
§ Taxon splits
§ Taxon merges
§ Different organizations according to different features
¡ Some cases…
TAXONOMY – NAMES AND TAXONOMIES
¡ Two different names are applied to the same organism ¡ Expert argues that two originally dif ferent taxa are the same ¡ Generally one name remains, the other is considered a
synonym and no longer valid
TAXONOMY - SYNONYMY
Photo: Arthur Chapman
Antilocapra americana Ord, 1815
Antilocapra anteflexa Gray, 1855
¡ Two different names are applied to the same organism ¡ Expert argues that two originally dif ferent taxa are the same ¡ Generally one name remains, the other is considered a
synonym and no longer valid
TAXONOMY - SYNONYMY
Photo: Arthur Chapman
Antilocapra americana Ord, 1815
Antilocapra anteflexa Gray, 1855
¡ The same name is applied to two different organisms ¡ New description using “already taken” name ¡ Generally, oldest name prevails and newest has to change
TAXONOMY - HOMONYMY
Echidna Cuvier, 1797
Echidna Forster, 1777
Photo: David R
Photo: Petr Baum
Photo: David R
Photo: Petr Baum
¡ The same name is applied to two different organisms ¡ New description using “already taken” name ¡ Generally, oldest name prevails and newest has to change
TAXONOMY - HOMONYMY
Echidna Cuvier, 1797
Echidna Forster, 1777
Photo: Petr Baum
¡ The same name is applied to two different organisms ¡ New description using “already taken” name ¡ Generally, oldest name prevails and newest has to change
TAXONOMY - HOMONYMY
Echidna Cuvier, 1797
Tachyglossus Illiger, 1811
¡ Taxonomic classifications are subjective
¡ Based on common features
¡ Different experts select different features
¡ Scientific names might remain the same
¡ Higher level taxa or groups might differ
¡ See example…
TAXONOMY – ALTERNATE CLASSIFICATIONS
TAXONOMY – ALTERNATE CLASSIFICATIONS
¡ Issues with names hamper the use of taxonomic names alone to be effective
¡ New term: Taxon concept ¡ Name – Concatenation of characters ¡ Concept – Name + context ¡ Even if the name is the same, the
concept is different since it applies to different organisms
TAXONOMY – NAME VS CONCEPT
TAXONOMY - STANDARDS
¡ Taxonomic names: Scientific name and all higher taxa ¡ Taxon concept: taxonConceptID, nameAccordingTo,
namePublishedIn…
TAXONOMY - STANDARDS
¡ Taxonomic names: Scientific name and all higher taxa ¡ Taxon concept: taxonConceptID, nameAccordingTo,
namePublishedIn…
Source in which the specific taxon concept circumscription is defined or implied
TAXONOMY - STANDARDS
¡ Taxonomic names: Scientific name and all higher taxa ¡ Taxon concept: taxonConceptID, nameAccordingTo,
namePublishedIn…
For taxa that result from identifications, a reference to the keys, monographs, experts and other sources
should be given
¡ One of the most common issues
¡ Random alteration of one or more characters in a
name
¡ Possibilities:
§ Purely accidental
§ Due to low knowledge
¡ Tend to appear at the time of digitization
NOISE - MISSPELLINGS
NOISE - MISSPELLINGS
Photo: Barracuda1983
Pipistrellus
Pipistrelus Pippistrellus
Pipistrella Pippistrela …
¡ Misidentification § A more obscure type of error § Wrongly identify a taxon § The only way of solving is through close examination by
expert taxonomist § Might not be resolvable at all
¡ Emptiness § Seriousness depends on missing level/s § Importance decreases as taxonomic rank increases § Scientific name missing? § Special cases: homonymies, synonymies…
NOISE – MISIDENTIFICATIONS & EMPTINESS
¡ Not defining used taxonomy § Can have the same effect as having only scientific name § We might complete hierarchy, but reliability? § Providing employed taxonomy (taxonomic concept) § Use identification qualifiers: “Sensu Otegui, 2013”, or “Sensu
Biologia Centrali Americana”
¡ Synonymies and homonymies § Again, background information (metadata, taxonomic concept)
needed § Use of identification qualifiers
NOISE – NATURE OF TAXONOMY
¡ Instability of taxonomic identifications ¡ Background information greatly help ¡ Also having source of change records
NOISE – NATURE OF TAXONOMY
¡ Aims of taxonomic assessments § Correct issues § Reconcile taxonomies § Complete hierarchies
¡ Basic general process – controlled name list § Take a name § Check if exists in a reliable list of names § Extract related information § Apply to our dataset
ASSESSMENTS
¡ General Databases § Ideally, global high-quality information § Not complete § Rely on taxon-specific sources and their completeness
ASSESSMENTS – SOURCES OF DATA
¡ General Databases § Ideally, global high-quality information § Not complete § Rely on taxon-specific sources and their completeness
¡ Thematic databases and regional checklists § If our collection is taxon-specific or location-specific § Gather all available knowledge on their topic § Reliable authoritative sources
ASSESSMENTS – SOURCES OF DATA
¡ General Databases § Ideally, global high-quality information § Not complete § Rely on taxon-specific sources and their completeness
¡ Thematic databases and regional checklists § If our collection is taxon-specific or location-specific § Gather all available knowledge on their topic § Reliable authoritative sources
¡ Taxonomic Literature § Most specific source § Very high reliability § Hard to retrieve relevant literature § Some processing needed
ASSESSMENTS – SOURCES OF DATA
¡ Free of misspellings § Ab initio, or manage to reduce to the minimum § Some of the tools (Refine, Excel processing…) to accomplish
this § Taxonomic reconciliation depends on this requirement
¡ Completeness § At least to certain point § This minimum is scientific name § But only scientific name might not be enough
¡ Helpful metadata § Not related to the organism, but to the process of identification § The person who identified, taxonomic classification
ASSESSMENTS - REQUIREMENTS
¡ Manual § Removing inconsistencies, updating the wrong information § Taxonomy is an interpretation of explicit and implicit knowledge § Explicit knowledge – records § Implicit knowledge – human deduction § Machines are not good at interpreting implicit knowledge § Prone to errors. Automated approach recommended
¡ Automatic § Big amounts of data § Repetitive tasks § Removal of misspellings, checking against source, update § Only explicit knowledge. Explicit metadata mandatory
ASSESSMENTS - METHODS
ASSESSMENTS - SEQUENCE
¡ After cleaning, validate output ¡ Check:
§ The data that has been corrected § The data that could not be corrected § The data that might have gone worse
¡ Taxonomic validation: § Expertise § Mixture of explicit and implicit knowledge § Not completely automatable
¡ If assessments fail: § Our data – Document and report reliability § Distributed data – Flag and report
VALIDATION