5

Click here to load reader

Syst biol 2012-burguiere-sysbio sys069

Embed Size (px)

Citation preview

Page 1: Syst biol 2012-burguiere-sysbio sys069

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[13:42 7/9/2012 Sysbio-sys069.tex] Page: 1 1–5

Software for Systematics and Evolution

Syst. Biol. 0(0):1–5, 2012© The Author(s) 2012. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved.For Permissions, please email: [email protected]:10.1093/sysbio/sys069

IKey+: A New Single-Access Key Generation Web Service

THOMAS BURGUIERE∗, FLORIAN CAUSSE, VISOTHEARY UNG, AND RÉGINE VIGNES-LEBBE

Laboratoire Informatique et Systématique, Museum National d’Histoire Naturelle, Département Histoire de la Terre,UMR 7207 – C.N.R.S. - M.N.H.N. - U.P.M.C., Paris, France

∗Correspondence to be sent to: Thomas Burguiere, Laboratoire Informatique et Systématique, M.N.H.N Département Histoire de la Terre,Bâtiment de Géologie, CP48, 57 rue Cuvier, 75005 Paris, France,

E-mail: [email protected] Burguiere and Florian Causse contributed equally to this article.

Received 2 April 2012; reviews returned 29 June 2012; accepted 25 July 2012Associate Editor: David Posada

Abstract.—Single-access keys are a major tool for biologists who need to identify specimens. The construction process of thesekeys is particularly complex (especially if the input data set is large) so having an automatic single-access key generation toolis essential. As part of the European project ViBRANT, our aim was to develop such a tool as a web service, thus allowingend-users to integrate it directly into their workflow.

IKey+generates single-access keys on demand, for single users or research institutions. It receives user input data (usingthe standard SDD format), accepts several key-generation parameters (affecting the key topology and representation), andsupports several output formats.

IKey+ is freely available (sources and binary packages) at www.identificationkey.fr. Furthermore, it is deployed on ourserver and can be queried (for testing purposes) through a simple web client also available at www.identificationkey.fr(last accessed 13 August 2012). Finally, a client plugin will be integrated to the Scratchpads biodiversity networking tool(scratchpads.eu). [Systematics; taxonomy; single-access key; web service; biodiversity informatics.]

During the last decade, taxonomy has been goingthrough a revolution towards cybertaxonomy (Wheeler2008). This is particularly obvious when looking atthe numerous initiatives dedicated to the managementand monitoring of biodiversity data such as GlobalBiodiversity Information Facility (GBIF), EuropeanDistributed Institute of Taxonomy (EDIT), BIOTA,Catalogue of Life, Encyclopedia of Life (EOL) or KeyTo Nature, and now ViBRANT. A significant increaseof digitized information should be expected in a nearfuture, and the deep social and scientific impact of web2.0 stimulates sharing of digital data and proposalsof cybertaxonomy projects to study global biodiversityand climate changes. In the current context ofdistributed knowledge and long-distance collaborations,the European project ViBRANT (http://vbrant.eu, lastaccessed 13 August 2012) has emerged to bridge theexisting gap between tools and services, providing apowerful exemplar platform, the Scratchpads, for theentire community.

ViBRANT is about connecting the people, data and science ofbiodiversity (ViBRANT - Objectives)

As studying and monitoring biodiversity remainsa strong challenge for biologists, descriptive datamanagement and identification tools are needed to assistthem in their daily work. For instance, we developed acomplete descriptive data management tool: Xper2 (Unget al. 2010).

Our contribution to ViBRANT is the implementationof an identification key generation web service, becausewe also have a significant experience in developing keygeneration tools such as MyKey (Gérard and Vignes-Lebbe 2010). The most common identification keys(described in Hagedorn et al. 2010) are the printed

single-access keys published in almost all taxonomicrevisionary work. But these printed keys may be difficultto use while collecting samples in the field or whenconsulting incomplete specimens. Hence, electronicsingle-access keys, which can be quickly generatedwith different parameters in order to bypass potentiallyproblematic steps (i.e., steps with characters for whichthe specimen studied in the field cannot provideinformation) are a good alternative.

APPROACH

Web ServiceThere are currently several tools designed to generate

single-access keys such as, among others, the Deltasoftware suite (Dallwitz et al. 1993), Pankey (Pankhurst1988), and MyKey. These tools are well established andproduce valid identification keys, but are not withoutdrawbacks. For instance, both Delta and Pankey need tobe installed on the end-user’s machine and are not cross-platform compatible (Windows, Apple, and Linux).Mykey is a web application, which makes it platform-agnostic, but the end-users cannot submit their own dataset to generate a key (the data set has to be manuallyuploaded by someone in our laboratory). Furthermore,these 3 tools do not use the biodiversity descriptionstandard file format, Structured Descriptive Data (SDD)(Hagedorn 2006), for data input.

Our objective, within the ViBRANT project, was todesign a free and open-source key generation serviceusing the SDD file format that can be used byindependent users or through the ViBRANT OBOEservice (also available through Scratchpads). It wasthus decided to develop a web service (implementingboth the SOAP and REST web service communication

1

Systematic Biology Advance Access published September 16, 2012 at Freie U

niversitaet Berlin on Septem

ber 18, 2012http://sysbio.oxfordjournals.org/

Dow

nloaded from

Page 2: Syst biol 2012-burguiere-sysbio sys069

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[13:42 7/9/2012 Sysbio-sys069.tex] Page: 2 1–5

2 SYSTEMATIC BIOLOGY

FIGURE 1. IKey+use cases.

protocols), which is particularly suitable for these twouse cases (cf. Fig. 1) and allows the client to build complexworkflows, with a high level of automation.

ArchitectureBecause the main purpose of ViBRANT is to provide

a platform that is both open source and easily reusable,we organized IKey+ in two parts:

• an Application Programming Interface (API)consisting of 3 distinct modules; the Data Model,that is, the computer representation of thedescriptive data, the Input-Output (IO) module,which parses the SDD input files and loads the datain the Model and generates the output files, andthe Algorithm, which uses the data contained inthe Model to generate a key and returns it throughthe IO module.

• a Simple Object Access Protocol (SOAP) or aREpresentational State Transfer (REST) ServiceLayer encapsulating the API, which manages thecommunication between the client and the API.

This allows potential developers to integrate the serviceinto their workflows and adapt it to their specific needs(e.g., integrating the API in a standalone software). Thewhole application was developed using the J2EE (Java 2,Enterprise Edition) programming environment.

Web Service Input ParametersIKey+ has several parameters available to the end-

user. These parameters allow the end-user to change thetopology of the key (e.g., pruning the key, promotingsome characters), change the visual aspect of the key, orthe output format. A complete list of these parameterscan be found in the user documentation, available here:http://www.identificationkey.fr/resources/docs/identificationKeyGeneratorWS_UserGuide.pdf (last accessed13 August 2012).

AlgorithmThe single-access key generation algorithm is a

recursive depth-first graph construction algorithm. Its

arguments are a list of taxa, taxaList and the listof character considered, charList (cf. SupplementaryMaterials, appendices 1, 2 and 3, available at http://datadryad.org, doi:10.5061/dryad.3ft19). The resultinggraph (cf. Fig. 2) is a directed acyclic graph, consistingof non-terminal nodes labelled with a character, edgeslabelled with the states of the character of the previousnode, and terminal nodes labelled with taxa. At eachstep of the algorithm, the best character among thoseavailable is selected by the BEST_CHAR function, whichiterates over the available characters and returns thecharacter with the greatest discriminant power.

Discriminant Power of a CharacterDuring the last 50 years, estimating a character’s

discriminant power has been the central point of key-construction algorithms. Many measurement methodshave been suggested, and a review of these methods canbe found in Gower and Payne (1975), Pankhurst (1991),and Delgado-Calvo-Flores et al. (2006). In a statisticalcontext, examples include the Bayesian probability andgeneralized entropy, such as the Shannon entropy usedin the ID3 algorithm developed by Quinlan (1986) or theGini index used in Breiman et al. (1984). In a contextwhere there is no probability associated with the statesof the characters, one can use the separation coefficientor the variance as measurement methods of a character’sdiscriminant power.

In IKey+, the discriminant power of a given characterC is calculated by the DPOWER function and isan estimate of C’s ability to differentiate the taxaof the current list. DPOWER is an extension of theGyllenberg separation factor (Gyllenberg 1963) andis a measurement of the number of pairs of taxadiscriminated by C, which is a generalization of thevariance. Indeed, the variance of a variable can becomputed by comparing all pairs of values (here,comparing all pairs of taxa description for C). Withdifferent comparison functions, each adapted to a certaintype of character (e.g., a categorical or a numericalcharacter), it is possible to obtain a generalized formulato estimate the discriminant power of a given character,even with polymorphic characters or characters withmissing data. For categorical characters, DPOWER canuse a binary function (boolean comparison) or other

at Freie Universitaet B

erlin on September 18, 2012

http://sysbio.oxfordjournals.org/D

ownloaded from

Page 3: Syst biol 2012-burguiere-sysbio sys069

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[13:42 7/9/2012 Sysbio-sys069.tex] Page: 3 1–5

2012 BURGUIERE ET AL.—Ikey+ 3

Character I

Character II

State I-a

Taxon 1

State I-b

Character III

State II-a

Taxon 2

State II-b

Character IV

State II-c OR State II-d

Taxon 3

State III-a

Taxon 4

State III-b

Taxon 5

State IV-a

Taxon 6, Taxon 7

State IV-b

FIGURE 2. Formal identification key example.

functions such as the Sokal and Michener coefficient(Sokal and Michener 1958) or the Jaccard coefficient(Jaccard 1901). For numerical characters (e.g., a sizemeasured in milllimetre), the set of values (i.e., all thevalues entered for C for all taxa) is split in two intervals.In order to determine the threshold value separatingthese two intervals, we consider the list of min and maxvalue of C for each of the remaining taxa. We then choosefrom this list the value that separates the remainingtaxa into two groups of equal size (±1 taxon). Thesetwo intervals are considered as 2 discrete states for thecalculation of the discriminant power.

DPOWER iterates over the available taxa (i.e., thosethat are compatible with the current description), anddetermines, for each pair of taxa Ta and Tb), thedissimilarity (i.e., the possibility of discriminating Ta andTb with C) that is based on the number of common statesof C for Ta and Tb (n11), the number of states of C thatoccur only for Ta ( n10), the number of states of C thatoccur only for Tb (n01), and the number of states of C thatdo not occur for neither Ta or Tb (n00). The discriminantpower is then calculated as

DP=∑

a

b

SCORE(n11,n10,n01,n00).

Three different SCORE measurements are currentlyavailable in IKey+, the Xper coefficient (Ung et al. 2010;Vignes et al. 1989), the Jaccard coefficient, and the Sokaland Michener coefficient.

BENCHMARKS

TestsIn order to assess the performance of our web

service, we conducted a series of tests using thesame input file for every test. This file contains adata set representing the subfamily Cichorieae of theplant family Asteraceae with 303 observable characters,144 taxa, and their descriptions. This data set was

generated and curated as an exemplar group for theEDIT project (Hand et al. 2009) [the file is availablehere: http://www.infosyslab.fr/vibrant/project/test/Cichorieae-fullSDD.xml (last accessed 13 August 2012)].The aim of these tests was to evaluate the overallperformance of the algorithm, to assess the impactof the various parameters available to the end-useron performance, and to ensure that IKey+ would berobust in a production environment. We measured thetime necessary for the web service to respond to 100identification keys generation queries, while countingthe number of rejected queries due to CPU overload(IKey+ rejects any query received when the CPU loadof the host server is greater than 80%). We ran onereference test (described in Supplementary Materials,appendix 4, doi:10.5061/dryad.3ft19), and several testswith variations on the input parameters, the web servicecommunication protocol used or the parallelizationsetups. The complete list of performance test setupsis available in Supplementary Materials, appendix 5,doi:10.5061/dryad.3ft19. We also measured the timenecessary to generate a single key, using the reference testconfiguration described in Supplementary Materials,appendix 4, doi:10.5061/dryad.3ft19. Finally, we testedthe usability of IKey+ from an end-user perspective,when generating the Cichorieae identification key usingthe web interface. We asked 11 persons who were notinvolved in the development of IKey+ to test the webservice and the web interface. They were given theSDD formated Cichorieae data set and were asked touse the web interface to create the identification key.We measured the time needed by each test subject togenerate the identification key.

ResultsThe average length of the paths leading to taxa

that were identified by the algorithm is 4.67 steps,with the shortest path being 1 step long, and thelongest being 10 steps long. The generation of a single

at Freie Universitaet B

erlin on September 18, 2012

http://sysbio.oxfordjournals.org/D

ownloaded from

Page 4: Syst biol 2012-burguiere-sysbio sys069

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[13:42 7/9/2012 Sysbio-sys069.tex] Page: 4 1–5

4 SYSTEMATIC BIOLOGY

key took roughly 1.8 s. The results of the other testsare shown in Supplementary Materials, appendix 6,doi:10.5061/dryad.3ft19. The reference test took roughly50 s to complete, that is, 500 ms per query, which isconsistent with the time measured for the generationof a single key, because the reference test uses 4simultaneous threads to generate the keys. In this test,few queries were rejected due to CPU overload. Amongthe parameters available to the end-user, only the scoremethod parameter had a significant influence: whenusing the Sokal and Michener score method or theJaccard score method, the tests took longer to finish(∼80 s instead of 50 s). The score method parameter hadno impact on the number of rejected queries. Our testsshowed that the communication protocol used (SOAP orREST) had no influence on the performance of IKey+.

Some taxa do not appear in the generated key, due toinsufficient data in the input file. Indeed, the Cichorieaedata set we used for our tests was created by anexternal team. This is because some errors were madeduring creation of the Cichorieae data set: some taxawith unknown data were not specifically marked as“unknown data”, but were left unspecified instead.Regardless, although these ambiguously coded taxa donot appear in the resulting identification key by default,the end-user can choose to have them appear in the key.

When modifying the parallelization of the queries,we observed significant performance variations. As canbe expected, when launching 100 queries sequentially(instead of using 4 simultaneous threads launching 25queries each), the test was 4 times longer, and no querieswere rejected. When we augmented the parallelismof the queries (e.g., 25 or 100 simultaneous queries),more queries were rejected (up to 50%). However, whenlaunching 100 simultaneous queries, with a randomdelay at the beginning of each thread, the results (both intime and number of rejected queries) were comparable tothe performance of the reference test. In the usability test,the average time needed to generate the identificationkey was slightly above 60 s (60.36 s), with the shortesttime measured at 26 s, and the longest time measuredat 121 s. The complete results of the usability testare available in Supplementary Materials, appendix 7,doi:10.5061/dryad.3ft19.

DISCUSSION

IKey+ is available and can be installed on any J2EEapplication server (e.g., Apache Tomcat). It can generatesingle-access keys using a tree or a flat representationin several output formats (HTML, Wiki, SDD, plain-text, etc.).

Our test showed that IKey+performs sufficiently wellto handle a large data set in a relatively short amount oftime and can generate well-optimized key files (averagenumber of steps to identify a taxon: 4.67). This, combinedwith the web service accessibility, makes it possible tointegrate IKey+ in many workflows that might requirea fast and automated key generation process (e.g., a batch

key generation script). Furthermore, an end-user can usethe web interface available at www.identificationkey.frto quickly generate a customized identification key,using the numerous parameters available (affecting thetopology, representation, file format, etc.).

Finally, our tests showed that IKey+ is likely tobe robust in a production environment (i.e., manysimultaneous queries) as it is able to withstandsimultaneous key generation queries (e.g., 4 threadslaunching 25 consecutive queries). It is also protectedfrom cryptic failure, because we implemented a CPU-load-watching mechanism that automatically rejects aquery (with an explicit error message) whenever theCPU load exceeds a given threshold (80%). This preventsa crash of the service, or the generation of incomplete orcorrupt key files.

CONCLUSION

IKey+ is the first key-generation tool available asa web service with standardized input and outputformats. Our test showed that IKey+ is able to generatekeys rapidly and that it can also be used by an end-userwith the web interface. Finally, the modular and open-source nature of IKey+ makes it possible for anyone toreuse its components. For instance, we plan to reuse somecomponents of the API to develop another web servicethat would provide free-access key identification.

LICENSING

As part of the ViBRANT project, IKey+’s sourcecode is freely available and is licensed under the GNUGeneral Public License version 2. It is already availableon our google code SVN repository: http://ikey-plus.googlecode.com/svn/trunk/ (last accessed 13 August2012). It will be actively maintained by our team for thenext 2 years.

SUPPLEMENTARY MATERIAL

Supplementary material, including Algorithms andappendices, can be found at http://www.sysbio.oxfordjournals.org.

FUNDING

This work was supported by the European Unionfunded FP7 ViBRANT Project (Contract number RI-261532, Period, December 2010 to November 2013).

ACKNOWLEDGEMENTS

We sincerely thank Gregor Hagedorn (Julius KühnInstitute, Berlin, Germany) and Andreas Müller(Botanical Garden and Botanical Museum, Berlin,

at Freie Universitaet B

erlin on September 18, 2012

http://sysbio.oxfordjournals.org/D

ownloaded from

Page 5: Syst biol 2012-burguiere-sysbio sys069

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[13:42 7/9/2012 Sysbio-sys069.tex] Page: 5 1–5

2012 BURGUIERE ET AL.—Ikey+ 5

Germany) for sharing their knowledge on the SDDformat. We are also grateful to Dave Roberts (NaturalHistory Museum, London, UK) for reviewing an earlyversion of the article and providing style improvements.

REFERENCES

BIOTA. Available from: URL http://www.edinburgh.ceh.ac.uk/biota/ (last accessed 13 August 2012).

Breiman L., Friedman J.H., Olshen R.A. Stone C.J. 1984. Classificationand regression trees. Belmont, CA:Wadsworth International Group.

Catalogue of Life. Available from: URL http://www.catalogueoflife.org/ (last accessed 13 August 2012).

Dallwitz M.J., Paine T.A., Zurcher E.J. 1993. User’s guide to the deltasystem: a general system for coding taxonomic descriptions. 4thed. Available from: URL http://delta-intkey.com (last accessed13 August 2012).

Delgado-Calvo-Flores M., Fajardo-Contreras W., Gibaja-Galindo E.L.,Perez-Perez R. 2006. Xkey: a tool for the generation of identificationkeys. Expert Syst. Appl. 30:337–351.

EDIT. European Distributed Institute of Taxonomy. Available from:URL http://www.e-taxonomy.eu/ (last accessed 13 August 2012).

EOL. Encyclopaedia of Life. Available from: URL http://eol.org/.GBIF. Global Biodiversity Information Facility. Available from: URL

http://www.gbif.org/ (last accessed 13 August 2012).Gérard D., Vignes-Lebbe R. 2010. Mykey: a server-side software to

create customized decision trees. In: Nimis P.L., Vignes-Lebbe R.,editors. Tools for identifying biodiversity: progress and problems.Edizioni Università di Trieste, Trieste, Italy. p. 107–112.

Gower J.C., Payne R.W. 1975. A comparison of different criteriafor selecting binary tests in diagnostic keys. Biometrika62:665–672.

Gyllenberg H.G. 1963. A general method for deriving determinativeschemes for random collections of microbial isolates. Ann. Acad.Scient. Fenn. Ser. A IV. Biologica 1(69):1–23.

Hagedorn G. 2006. The structured descriptive data (SDD) w3c-xml-schema. Version 1.1 Available from: URL http://wiki.tdwg.org/twiki/bin/view/SDD/Version1dot1 (last accessed 13 August 2012).

Hagedorn G., Rambold G., Martellos S. 2010. Types of identificationkeys. In: Nimis P.L., Vignes-Lebbe R. editors. Tools for identifying

biodiversity: progress and problems. Edizioni Università di Trieste,Trieste, Italy. p. 59–64.

Hand R., Kilian N., Raab-Straube E. 2009. International cichorieaenetwork: Cichorieae portal. Available from: URL http://wp6-cichorieae.e-taxonomy.eu/portal/ (last accessed 13 August 2012).

Jaccard P. 1901. Étude comparative de la distribution florale dans uneportion des alpes et des jura. Bull. Soc. Vaud. Sci. Nat. 37:547–579.

J2EE. Java 2, Enterprise Edition. Available from: URL http://www.oracle.com/technetwork/java/javaee/overview/index.html (lastaccessed 13 August 2012).

KeyToNature. Available from: URL http://www.keytonature.eu/wiki/ (last accessed 13 August 2012).

Pankhurst R.J. 1988. Pankey programs. DELTA Newsletter 1:2.Pankhurst R.J. 1991. Practical taxonomic computing. Cambridge

University Press, Cambridge, UK.Quinlan J.R. 1986. Induction of decision trees. Mach. Learn. 1:81–

106. ISSN 0885-6125. Available from: URL http://dx.doi.org/10.1007/BF00116251 (last accessed 13 August 2012).

REST Architecture. Available from: URL http://www.oracle.com/technetwork/articles/javase/index-137171.html (last accessed13 August 2012).

Scratchpads. Biodiversity Online. Available from: URL http://scratchpads.eu/ (last accessed 13 August 2012).

SOAP. Simple Object Access Protocol, W3C Recommandation. Version1.2. Available from: URL http://www.w3.org/TR/soap/ (lastaccessed 13 August 2012).

Sokal R., Michener C. 1958. A statistical method for evaluatingsystematic relationships. Univ. Kansas Sci. Bull., (38):1409–1438.

Ung V., Dubus G., Zaragüeta-Bagils R., Vignes-Lebbe R. 2010. Xper2:introducing e-taxonomy. Bioinformatics 26(5):703–704.

ViBRANTa. Objectives. Available from: URL http://vbrant.eu/node/1 (last accessed 13 August 2012).

ViBRANTb. Virtual Biodiversity Research and Access Network forTaxonomy. Available from: URL http://vbrant.eu (last accessed13 August 2012).

Vignes R., Lebbe J., Darmoni S. 1989. Symbolic-numeric approachfor biological knowledge representation: a medical example withcreation of identification graphs. In E. Diday, editor, Proceedingsof the conference on Data analysis, learning symbolic and numericknowledge. Nova Science Publishers, Inc. Commack, NY, USA. ISBN0-941743-64-0. p. 389–398.

Wheeler, Q.D., editor 2008. The new taxonomy. CRC Press Inc. NewYork, USA.

at Freie Universitaet B

erlin on September 18, 2012

http://sysbio.oxfordjournals.org/D

ownloaded from