Upload
dominick-morris
View
219
Download
2
Tags:
Embed Size (px)
Citation preview
BiOnymA flexible workflow approach to
taxon name matchingEdward Vanden Berghe (VUB), Nicolas Bailly (WorldFish),
Caselyn Aldemita (FIN), Fabio Fiorellato (FAO), Gianpaolo Coro (CNR), Anton Ellenbroek (FAO),
Pasquale Pagano (CNR)
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Improving the current matchers
• Propose several Taxonomic Authority Files as references to be matched with
• Make flexible and customizable the control of the matching workflow (e.g., selection of the sequence of the matching methods)
• Give full control for advanced users [but still a set of default/standard workflow(s) for basic users]
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
BiOnym approach
• There is no one size that fits all!!• Some applications are ‘fault intolerant’– E.g. compilation of authority lists– Have to minimise ‘false positives’, at the expense
of less automation• Others are less sensitive to mistakes– E.g. synonymy expansion in a biogeographic query,
find distribution records of a single species under different names or spelling variations
• Will require different choices
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
A flexible workflow for taxon name matching: BiOnym
Parsingand Pre-
processing
Matchers:• GSAy (new)• Lexical distances
• Levenshtein • Soundex• Trigrams
Workflows• BiOnym (new): User control• Emulation of Taxamatch• YASMEEN (new)
Taxon Matcher 1
Taxon Matcher 2
Taxon Matcher n
Post-processing
ReferenceSource(ASFIS)
ReferenceSource
(FishBase)
ReferenceSource
(WoRMS)
Raw Input String. e.g. Gadis morua Lineus 1759
Matching name qed Gadus morhua (Linnaeus, 1758)
…
ReferenceSource(any in DwC-A)
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Developed in iMarine infrastructure• iMarine (D4Science): e-infrastructure• VREs: Virtual Research Environments exploiting
data and tools in the infrastructure …
Infrastructure
VREsgCube
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
iMarine
OBISWoRMS
WoRDS
GBIF
CoL
ITIS
IRMNGNCBI
MyOcean
WOA
EuroStat
Data.FAO
…
iMarine Registries
Validation
Enriching
Processing
Sharing
… and outside: iMarine Data Bonanza
Private Cloud
Commercial Cloud
iMarine: Storage and Computing as Service
• Scalability and high availability
• Across sites
• ISO 19115/10139 Metadata
• Catalogue
• Open source RDBMS
• Up to 1 TB data
• Secure• Fault-tolerant• Replication
Virtual Workspace
Relational Databases
Large and Active data
storage
Spatial Database
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
45 TB Currently Used330 CPU Cores Currently Allocated
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Statistical Manager: Resources and Sharing
BiOnym: Outline
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
• Components– Taxonomic Authority Files– Matchers– Pre- and post-processing: parsers, synonym ex-
pansion , taxon resolution, performance statistics• Development frameworks– For Matchers– For Workflows (= sequence of Matchers)
• Experiments– Results
• Conclusions
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Available Taxonomic Authority Files• CoL: Catalogue of Life
• NCBI: National Center for Biotechnology Information
• IRMNG: Interim Register of Marine and Non-marine Genera
• ITIS: Integrated Taxonomic Information System
• WoRMS: World Register of Marine Species
• ASFIS: List of Species for Fishery Statistics Purposes; for commercial aquatic species
• FishBase (+info from CofF: Catalog of Fishes): for finfishes
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Pre-Processing: name format standard
• Split names in atomic components (genus, species, authority, author, year) if necessary (Dima Mozzherin’s parser)
• Align variations in complementary words: var./v., aff., conf./cf., comma in authority, etc.
• Customize character/string substitutions
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Matchers: principle• Input:
– Standard formatted file of names Input– Customized parameters (e.g., thresholds for distances)
• Character substitutions– E.g. dropping gender suffix– E.g. fuzzy matching of Tony Rees
• A unique algorithm (e.g., one lexical distance):– Using the customized parameters
• Output: A set of names with matching rate– One subset being considered as matched– One subset considered as non-matching
• The output of a matcher can be used as the input of another one
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Matchers: the built-in matchers• Lexical dist.: the minimum number of single-
character edits (insertion, deletion, substitution) required to change one word into the other
• Soundex-Like dist.: an algorithm relying on an encoding of phonemes pronunciation in English. Our variant does not compress phonetic information
• Trigrams / N-grams dist.: a similarity measure between sequences of letter triplets (a trigram representation) extracted from the input strings
• One domain-knowledge based matcher (GSAy) … to be applied first in the context of Systematics
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Matchers: GSAy process (1)
GSAy
GSAY
GSrAy
GSrAY
GSA
Complete matchStep ScoreGSAy 100
GSAY 97
GSrAy 94
GSrAY 91
GSA 88
GSrA 85
Parentheses issue
Gender agreement issues
Gender agreement and parentheses
Year issues
GSrAYear and gender agreement issues
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Matchers: GSAy process (2)
GSY
GSrY
GS
GSr
SAy
Author issues, misspelling or wrongStep RateGSAy 82
GSAY 79
GSrAy 76
GSrAY 73
GSA 70
GSrA 67
Author and year issues, Homonyms
Genus issues, other combinations
SrAY
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Matchers: GSAy process (3)
SrAy
SrAY
GAy
GAY
Genus issues, other combinationsStep RateSrAy 64
SrAY 61
GAy 35
GAY 3261> >35
Species misspellings … but also …
… species described in same genusby same author in same paper
Matched names
othermatchers
Non-Matching names
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Matcher: GSAy examplesGenus Species Authority Step RateGadus morhua Linnaeus, 1758 GSAy 100Gadus morhua (Linnaeus, 1758) GSAY 97Gadus morhuus Linnaeus, 1758 GSrAy 94Gadus morhuus (Linnaeus, 1758) GSrAY 91Gadus morhua Linnaeus, 1759 GSA 88Gadus morhuus (Linnaeus, 1759) GSrA 85Gadus morhua Lineus, 1758 GSY 82Gadus morhuus Lineus, 1758 GSrY 79Gadus morhua Lineus, 1759 GS 76Gadus morhuus Lineus, 1759 GSr 73Gadis morhua Linnaeus, 1758 SAy 70Gadis morhuus (Linnaeus, 1758) SrAy 67Gadis morhua Linnaeus, 1758 SAY 64Gadis morhuus (Linnaeus, 1758) SrAY 61Gadus morthua Linnaeus, 1758 GAy 35Gadus morthua (Linnaeus, 1758) GAY 32
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Workflow development framework• Builds flexible Workflows for Names Matching
• A Java framework based on the gCube system (http://www.gcube-system.org/)
• Allows to exploit Cloud Computing Facilities
• Presents Java interfaces to build Strings Pre-Processing, Parsing and Post-processing
• Allows to define character substitutions
• Allows to add new Matchers as plug-ins
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Workflow development framework• A series of operators acting as switches:– First apply ‘transformation’ (e.g. character
substitution)– Then calculate distance between all possible pairs
of names • Each switch decides, whether a pair of names
should be considered as ‘matches’, and splits the input list in:– ‘matched’ names– ‘non-matching’ names.
• Parameters in each switch are customizable
Matcher framework: YASMEEN (FAO)• Yet Another Species Matching Execution ENvironment
• Based on COMET – COncept Matching Engine and Tools
• YASMEEN: a set of data models, formats and tools to perform species matching identification
• Multiple matchlets, each dealing with a specific attribute of the species data model (genus, species, author etc.)
• New matchlets can be designed (just a few lines of code) and plugged in
• Reference data in DwC-A format
• Full support to distributed computation (split IN & REF data / join results)
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Workflow builder: YASMEEN (FAO)
• Used as a matcher in BiOnym workflow• But can work as a standalone specific workflow
• When used as standalone:• Lexical matchlets' scores can be computed with a
combination of different strategies (Levenshtein distance, soundex similarity, N-grams similarity)
• Overall matching score for an input / reference data pair is a weighted combination of the triggered matchlets' scores
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Workflow: BiOnym• Workflow consists of three parts– Preprocessing (including possibly parsing)– Chain of matchers; output from one is input in next– Postprocessing• E.g. present ‘ambiguous’ matches to end user• E.g. calculate performance statistics
• Chain of matchers– Most restrictive first– Those based on domain knowledge first– Test names matched in one step are not passed on
to next matcher
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
FAO use caseScientific namesAustroglossus spp
Austropotamobius pallipesAuxis rocheiAuxis thazardAuxis thazard, A. rochei
Bagrus sppBothidaeCorallium sp. nov.Ex MolluscaEx Pinctada spp
Common NamesAbalones neiAesop shrimpAfrican forktail snapperAfrican lungfishesAfrican moonfishAfrican sicklefishAka coralAkiami paste shrimp
Alaska plaiceAlaska pollock(=Walleye poll.)
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Workflow: Emulation of Tony Rees’ Taxamatch
• Normalization of species name into its root disregarding the gender issues in taxon name.
• Modified Damerau-Levenshtein Distance Algorithm (MDLD) - the number of times of replace, delete or insert character to make the two strings the same
• Phonetic algorithm (e.g. Soundex)• Authority Matching - which detects the
similarity in substring
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Post-processing• Modalities governing how the results of the
matching process are used/presented to the end-user
• Will depend on the needs of the end user• Examples:– Synonymy expansion of queries in a
biogeographical system–Reconciliation of check lists from different
sources, for same area and taxon–Presenting end-user with ambiguous matches
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Experiments: a R implementation• Experimental system implemented in R and
PostgreSQL–R thin wrapper around PostgreSQL statements– SQL used for the heavy lifting• Make use of Trigram indexes, for example
• Tool for communication and prototyping• Developing tools to analyse performance–Generate confusion matrix…– For identical test sets, different workflows• Quantitatively compare sets of options and/or matchers
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Effectiveness
True
hits
False hits
Non hits
Example graph comparing performance of different settings (generated with R)
Results of experiments (YASMEEN)Genus / species misspellings:IN: Lacheepenseer Perthseecous → REF: Acipenser persicus (Borodin, 1897)
Scientific name matchlet, using Levenshtein similarity → 61.5%
No separation between genus / species, relevant misspelling:IN: acipnesreppeerseekoos → REF: Acipenser persicus (Borodin, 1897)
Scientific name matchlet, using Levenshtein similarity → 47.6%
Inverted genus / species:IN: Platorhynchus Scaphirhincus → REF: Scaphirhincus platorhynchus (Rafinesque, 1820)
Scientific name matchlet, using n-grams similarity → Score: 100.0%
Relevant misspellings (resolved with support from authorities data):IN: Casphinhi Platynchurs (Rafinesk, 1820) → REF: Scaphirhynchus platorynchus (Rafinesque, 1820)
Genus matchlet (wgt: 75), Species matchlet (wgt : 100), Author name matchlet (wgt : 50), Year matchlet (wgt : 25), using Levenshtein
similarity (wgt : 100) and Soundex similarity (50) → 58,2%
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Experiments: a simple interface
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Efficiency• Application first version of BiOnym workflow (1000
species names)
• First run: only one Worker node (~ 1 CPU)
• Second run: Cloud Computing facilities assigned by iMarine e-Infrastructure (computation distributed over 19 Worker nodes)
• Result: Time reduction 76.7%
• This means that the workflow can be used also in interactive systems (no need for batch processing)
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Results of experiments: Matchers
• Search Term: – Rhincodon typu Linneaus, 1758
• Output/s: Using GSAy – Rhincodon typus Smith, 1828 -> Score is 73%
• Using taxamatch: – Rhineodon typus Smith, 1828 – Rhiniodon typus (Smith, 1828) – Rhinodon typicus Müller & Henle, 1839 – Rhinodon typicus Smith, 1845
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Conclusions: Results• Workflows– Building of pre-set (default)– On the fly setting– Integrating taxonomic/nomenclature knowledge
(GSAy)• Making the best from previous matchers
(Taxamatch and subsequent various implementations) and other technologies (uBio/GNA/GBIF parser)
• Effectiveness and Efficiency increased in iMarine e-infrastructure
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Conclusions• Plans: interface (Nov.), tests (Dec.), open (Jan.)
• Other Taxonomic Authority File– FADA (BioFresh) / PESI / …
• Name reconciliation
• Beyond scientific names– Common names / Vessels / …
• New matchers integration = as matching methods are developed
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Future added value? Storage of knowledge
• Make available the matches between raw and published names (and current valid names)
• Self-learning system• Build a community of practice (CoP), not alone
…• GNA, BioVel
Collaborative development
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Special Thanks
• Tony Rees (CSIRO)• Dmitry (Dima) Mozzherin (GNA project)
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
• Edward Vanden Berghe, Vrije Universiteit Brussel (VUB), Brussels, Belgium
• Nicolas Bailly, WorldFish, and FishBase Information and Research Group (FIN), Los Baños, Philippines
• Caselyn Aldemita, FishBase Information and Research Group (FIN), Los Baños, Philippines
• Fabio Fiorellato & Anton Ellenbroek, Fisheries Statistics and Information (FIPS), FAO, Rome, Italy
• Gianpaolo Coro & Pasquale Pagano, Istituto di Scienza e Tecnologie dell'Informazione A. Faedo (ISTI), CNR, Pisa, Italy
Authors