15
RESEARCH ARTICLE Inselect: Automating the Digitization of Natural History Collections Lawrence N. Hudson 1 *, Vladimir Blagoderov 1 , Alice Heaton 1 , Pieter Holtzhausen 2 , Laurence Livermore 1 , Benjamin W. Price 1 , Stéfan van der Walt 2,3 , Vincent S. Smith 1 1 Department of Life Sciences, Natural History Museum, Cromwell Road, London, SW7 5BD, United Kingdom, 2 Division of Applied Mathematics, Stellenbosch University, Stellenbosch 7600, South Africa, 3 Berkeley Institute for Data Science, University of California, Berkeley, CA, United States of America * [email protected] Abstract The worlds natural history collections constitute an enormous evidence base for scientific research on the natural world. To facilitate these studies and improve access to collections, many organisations are embarking on major programmes of digitization. This requires auto- mated approaches to mass-digitization that support rapid imaging of specimens and associ- ated data capture, in order to process the tens of millions of specimens common to most natural history collections. In this paper we present Inselecta modular, easy-to-use, cross-platform suite of open-source software tools that supports the semi-automated pro- cessing of specimen images generated by natural history digitization programmes. The software is made up of a Windows, Mac OS X, and Linux desktop application, together with command-line tools that are designed for unattended operation on batches of images. Blending image visualisation algorithms that automatically recognise specimens together with workflows to support post-processing tasks such as barcode reading, label transcrip- tion and metadata capture, Inselect fills a critical gap to increase the rate of specimen digitization. Introduction There are an estimated two billion specimens stored in natural history collections worldwide [1]. These botanical, zoological, anthropological, geological, mineralogical, and paleontological collections represent the largest and most significant part of the available scientific evidence base of the planets biosphere. Collectively these specimens form a global research infrastruc- ture for tackling major scientific challenges such as environmental change, biodiversity loss, human health, sustainable agriculture, and the exploration of scarce minerals [25]. Museum specimens have been used to estimate the regional species richness of tropical insects [6], to develop novel species-distribution models [7], to reveal the historical spread of a fungal patho- gen linked to declines of amphibians [8] and to examine historical responses of butterflies to climate change [9]. The public and private institutions that manage collections cover practi- cally all-geographic areas with increasing levels of sampling density and taxonomic coverage PLOS ONE | DOI:10.1371/journal.pone.0143402 November 23, 2015 1 / 15 OPEN ACCESS Citation: Hudson LN, Blagoderov V, Heaton A, Holtzhausen P, Livermore L, Price BW, et al. (2015) Inselect: Automating the Digitization of Natural History Collections. PLoS ONE 10(11): e0143402. doi:10.1371/journal.pone.0143402 Editor: Nico Cellinese, University of Florida, UNITED STATES Received: April 13, 2015 Accepted: November 4, 2015 Published: November 23, 2015 Copyright: © 2015 Hudson et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability Statement: All 804 jpeg images are available, via the Natural History Museum's data portal, at http://dx.doi.org/10.5519/0018537. Funding: This research received support from the SYNTHESYS Project, http://www.synthesys.info/, which is financed by European Community Research Infrastructure Action under the FP7 Integrating Activities Programme (Grant agreement number 312253). Competing Interests: The authors have declared that no competing interests exist.

RESEARCHARTICLE … · RESEARCHARTICLE Inselect:AutomatingtheDigitizationof NaturalHistoryCollections ... Yes: 2D, 1D Yes: plugin ... angle, makingitanon-trivial

  • Upload
    lamtram

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

RESEARCH ARTICLE

Inselect Automating the Digitization ofNatural History CollectionsLawrence N Hudson1 Vladimir Blagoderov1 Alice Heaton1 Pieter Holtzhausen2Laurence Livermore1 BenjaminW Price1 Steacutefan van der Walt23 Vincent S Smith1

1 Department of Life Sciences Natural History Museum Cromwell Road London SW7 5BD UnitedKingdom 2 Division of Applied Mathematics Stellenbosch University Stellenbosch 7600 South Africa3 Berkeley Institute for Data Science University of California Berkeley CA United States of America

lhudsonnhmacuk

AbstractThe worldrsquos natural history collections constitute an enormous evidence base for scientific

research on the natural world To facilitate these studies and improve access to collections

many organisations are embarking on major programmes of digitization This requires auto-

mated approaches to mass-digitization that support rapid imaging of specimens and associ-

ated data capture in order to process the tens of millions of specimens common to most

natural history collections In this paper we present Inselectmdasha modular easy-to-use

cross-platform suite of open-source software tools that supports the semi-automated pro-

cessing of specimen images generated by natural history digitization programmes The

software is made up of a Windows Mac OS X and Linux desktop application together with

command-line tools that are designed for unattended operation on batches of images

Blending image visualisation algorithms that automatically recognise specimens together

with workflows to support post-processing tasks such as barcode reading label transcrip-

tion and metadata capture Inselect fills a critical gap to increase the rate of specimen

digitization

IntroductionThere are an estimated two billion specimens stored in natural history collections worldwide[1] These botanical zoological anthropological geological mineralogical and paleontologicalcollections represent the largest and most significant part of the available scientific evidencebase of the planetrsquos biosphere Collectively these specimens form a global research infrastruc-ture for tackling major scientific challenges such as environmental change biodiversity losshuman health sustainable agriculture and the exploration of scarce minerals [2ndash5] Museumspecimens have been used to estimate the regional species richness of tropical insects [6] todevelop novel species-distribution models [7] to reveal the historical spread of a fungal patho-gen linked to declines of amphibians [8] and to examine historical responses of butterflies toclimate change [9] The public and private institutions that manage collections cover practi-cally all-geographic areas with increasing levels of sampling density and taxonomic coverage

PLOSONE | DOI101371journalpone0143402 November 23 2015 1 15

OPEN ACCESS

Citation Hudson LN Blagoderov V Heaton AHoltzhausen P Livermore L Price BW et al (2015)Inselect Automating the Digitization of NaturalHistory Collections PLoS ONE 10(11) e0143402doi101371journalpone0143402

Editor Nico Cellinese University of Florida UNITEDSTATES

Received April 13 2015

Accepted November 4 2015

Published November 23 2015

Copyright copy 2015 Hudson et al This is an openaccess article distributed under the terms of theCreative Commons Attribution License which permitsunrestricted use distribution and reproduction in anymedium provided the original author and source arecredited

Data Availability Statement All 804 jpeg imagesare available via the Natural History Museums dataportal at httpdxdoiorg1055190018537

Funding This research received support from theSYNTHESYS Project httpwwwsynthesysinfowhich is financed by European Community ResearchInfrastructure Action under the FP7 IntegratingActivities Programme (Grant agreement number312253)

Competing Interests The authors have declaredthat no competing interests exist

over the last 500 years and together their global collections form an infrastructure that is usedannually by tens of thousands of scientific visitors The vast majority of these collections haveno digital records and are only accessible to a handful of specialists working within each insti-tution As a consequence these collections remains largely unknown to the majority of poten-tial users with access limited by the number of visitors that each institution can host

The sheer scale of natural history collections requires an unprecedented digitization effortto make these scientific specimens more widely accessible [1011] and many national digitiza-tion activities are underway such as the Digital Collections Programme at the Natural HistoryMuseum in the United Kingdom (henceforth NHM) which holds over 80 million specimensand has a target of digitizing 20 million of these within the next five years Similar initiativeshave been put in place by the National Science Foundation USA (Integrated Digitized Biocol-lections iDigBio httpswwwidigbioorg) the Naturalis Biodiveristy Center in Holland (atleast 37 million objects by mid-2015 httpssciencenaturalisnl) and the Atlas of Living Aus-tralia (httpwwwalaorgau)

Advances in digital imaging technology are central to these digitization efforts yet the col-lection of these images represents just one element of the digitization task [12] The compila-tion of metadata from the billions of labels associated with these specimens coupled with thetask of persistently linking the images and metadata to the physical specimens and the publica-tions in which they are described represents a much greater challenge Few collections can bemore challenging than those of pinned insectsmdashthe NHM alone has more than 33 millionpinned insect specimens constituting more than 40 of the museumrsquos entire collection It isneither practical nor cost-effective to digitize so many specimens individually As a result sev-eral whole-drawer scanning technologies have been developed [1314] that reduce the imagingtask by several orders of magnitude This approach can be applied to digitize other collectionsobjects such as microscope slides 3D dry-preserved specimens (eg fruits lichens and fungi)and fossils Drawer-level digitization has become the most practical way of unlocking theresearch potential for natural history collections For example at the NHM a single scanninginstrument (described further in Materials and Methods) can produce up to 70 high-resolutiondrawer images per day Files are between 100 and 800 megabytes (MB) in size and each cancontain images of well over a thousand individual specimens Whole-drawer images are usefulin their own right either for collections audits or for remote identification However specimen-level digitization ie creation and association of specimen metadata with images of individualspecimens remains a laborious and largely manual process Automatic segmentation of multi-specimen images would remove a major bottleneck in the digitization of natural history collec-tions by significantly reducing the time required for imaging and record creation

General-purpose image-processing tools such as the GNU Image Manipulation Program(GIMP httpwwwgimporg) and ImageJ (httpimagejnihgovij) have been proposed forthe task for automatic segmentation of images [1516] but such software is not optimised forprocessing the volume of large image files that are produced by mass-digitization programmesBlagoderov et al [11] presented a prototype for segmentation and data capturemdashMetadataCreatormdashthat allowed images of individual specimens to be cropped from multi-specimenimages but this software requires that the user manually draw a bounding box around eachspecimen This laborious process makes it unsuitable for mass-digitization activities Similarsolutions exist for related activities within the Atlas of Living Australia (ALA) project andGigaPan service but in both cases require manual drawing of rectangles to select subimagesand do not allow for metadata association beyond simple text comments (summarised inTable 1)

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 2 15

The lack of software to support efficient post-processing workflows associated with whole-drawer scanning has hampered the take up of mass-digitization activities [1112] Such tasksinclude but are not limited to

bull automated segmentation (the detection and placement of a bounding box around each speci-men within multi-specimen images)

bull automated detection and reading of one- or two-dimensional (matrix) barcodes

bull manual refinement of bounding boxes

bull association of specimen images with corresponding metadata (the addition and editing ofdrawer- bulk- and specimen-level metadata such as catalog number taxonomic group geo-graphical data and physical location etc)

bull transcription of label data through manual or automated (optical character recognition) pro-cessing export of metadata to structured files of common formats

bull saving individual cropped specimen images at the full available resolution and

bull preserving the associations between cropped images of specimens and specimen metadata(eg for import into collections management software)

InselectWe present Inselectmdasha modular easy-to-use cross-platform suite of open-source softwaretools designed to address the image processing needs of large-scale digitization projects Thedesktop application implements automatic image segmentation manual editing of boundingboxes automated barcode recognition and association of metadata with images of individualspecimens The most important and time-consuming functions are also accessible throughcommand-line tools that operate on batches of images without human intervention for exam-ple being run as overnight processes Our goal was to make it straightforward to integrate Inse-lect into existing mass-digitization workflows such as those operated by major digitizationprogrammes

The software is written in Python (programming language httpwwwpythonorg)NumPy (Python scientific computing package httpwwwnumpyorg) OpenCV (computervision library httpopencvorg) and QT (application development framework httpqt-projectorg) All packages are mature portable open-source software projects with active usercommunities providing a degree of assurance that the project will remain sustainable Thesoftware runs on the three major desktop operating systemsmdashWindows Mac OS X and Linux

Table 1 Comparison of features in Inselectwith current multi-specimen image segmentation solutions

Solution Format SegmentationAlgorithm

Metadata capture Barcoderecognition

Modularity Opensource

ALA Online Manual Simple text No No Yes

Gigapan Online Manual Simple text No No No

ImageJ Desktop Automated + Manual No No Yes Yes

GIMP Desktop Automated + Manual No No Yes scripts Yes

Inselect Desktop Automated + Manual Structured fields flexible template lookupsverification

Yes 2D 1D Yes pluginsupport

Yes

doi101371journalpone0143402t001

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 3 15

Source code installers and open issues are at httpsgithubcomNaturalHistoryMuseuminselect We describe Inselect assess its performance and shortcomings and make recommen-dations for future developments

Desktop applicationAn Inselect document is made up of original full-resolution scanned image (all commonlyencountered file formats are supported) a lower-resolution Joint Photographic Experts Group(JPEG) thumbnail (customizable dimensions default of 4096 pixels in width) and a list ofbounding boxes together with their associated metadata Inselect presents two views of thesedata each designed with different tasks in mind The lsquoBoxesrsquo view (Fig 1) shows the completeimage together with the bounding box around each individual specimen The lsquoSegmentrsquo com-mands runs an automatic segmentation algorithm which detects individual specimens andreplaces existing bounding boxes The user can then create delete move and resize boxes usingthe mouse andor keyboard making it a simple task to refine the results of the segmentationprocess The panel on the right contains metadata fields

The user has complete control over the list of fields and any associated validation In theEdit menu lsquoChoose templatersquo allows the user to select an lsquoinselect_templatersquo file that containsmetadata fields definitions Templates are written in YAML (YAML Ainrsquot a Markup Language-httpyamlorg)mdasha structured text format that is easy to learn and that can be edited using aplain-text editor Fig 1 shows a template called lsquoHymenopterarsquo with one numeric field (lsquoCatalognumberrsquo which can be populated by values of object barcodesndashsee below) and three fields with

Fig 1 The lsquoBoxesrsquo view This view shows the scanned image together with bounding boxes around specimens This document is a scan of 1160 individualsof Tetrastichinae (a subfamily of wasps) The boxes for one tray of specimens are selected shown outlined in light blue

doi101371journalpone0143402g001

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 4 15

drop-down lists of values The metadata fields reflect the currently selected boxes making iteasy to enter metadata for a single specimen a group of specimens or to all the specimens inthe initial image (eg a taxon name or geographic location) The 131 selected boxes (Fig 1)have the same values for lsquoLocationrsquo lsquoFamilyrsquo and lsquoSubfamilyrsquo but different values of lsquoCatalognumberrsquo The template specifies that each of these four fields is mandatory Any boxes that failvalidation (eg missing mandatory values) are shown with a red backgroundmdashthe first of the131 selected boxes in Fig 1 is shown in red because it lacks a value of lsquoCatalog numberrsquo Inselecttemplates permit comprehensive field validation such as lsquoan integer value greater than zerorsquo lsquoalatitudersquo lsquoa longitudersquo and lsquoa date in the form YYYY-MM-DDrsquo For more complex cases fieldvalidation can be given as a regular expression For example the NHM templates use theregular expression ^[0ndash9]9$mdashexactly nine digits with no letters no punctuation and noleading or trailing whitespacendashfor the lsquoCatalog numberrsquo field The user can specify other prop-erties in the template such as the width of the low-resolution thumbnail image (default of4096 pixels) A complete description of the format along with example templates that are usedfor NHMrsquos digitization projects are available in the github repository httpsgithubcomNaturalHistoryMuseuminselect-templates The built-in Simple Darwin Core terms templatewhich contains all Simple Darwin Core terms (httprstdwgorgdwctermssimple [17]) canbe used by selecting the lsquoDefault templatelsquo command under the Edit menu Metadata can beexported to comma-separated values (CSV) files and included in the file name of segmentedimages

The lsquoObjectsrsquo view (Fig 2) shows individual images either in a grid or with a single imageexpanded The first box lacks a value of lsquoCatalog numberrsquo and so is shown with a red back-ground The user can rotate images individually or in groups making it easy to transcribe labelinformation into metadata fields Rotation is also applied to the cropped object images whenthese are saved

Inselect displays the low-resolution thumbnail image which is small in size quick to readand takes up relatively little space in-memory the full-resolution file (which might be manyhundreds of megabytes in size) is loaded only as required for example when saving the individ-ual cropped specimen images

The desktop application supports pluginsmdashcode modules that are able to examine and pos-sibly modify the list of bounding boxes and their associated metadata Plugins can access thelow-resolution thumbnail image and if necessary the full-resolution scanned image The soft-ware currently has plugins for automated segmentation of the entire image and for sub-seg-mentation of a single bounding box (see lsquoSegmentation algorithmsrsquo below)

Many institutions use barcodes to uniquely identify specimens Inselect therefore provides alsquoRead barcodesrsquo plugin which reads the values of any barcode(s) within each box and placesvalue(s) in the lsquoCatalog numberrsquometadata field Barcodes typically take up just a small fractionof the area of an image (eg S1 Fig) they can be smudged or damaged and can be placed at anangle making it a non-trivial task to quickly and reliably detect and decode barcodes Inselectincludes two open-source libraries zbar (httpzbarsourceforgenet) which reads one-dimen-sional barcodes and QR codes and libdmtx (httpwwwlibdmtxorg) which reads DataMatrix barcodes We found that commercial libraries were faster and more reliable than thetwo open-source decoders Inselectrsquos lsquoRead barcodesrsquo plugin therefore also supports the bestperforming of the commercial librariesmdashInlite Clearimage (purchase or download for evalua-tion from httpwwwinliteresearchcombarcode-recognition) The user can select which ofthese libraries to use by selecting the ldquoConfigure lsquoRead Barcodesrsquordquo command under the Editmenu

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 5 15

Command-line toolsEach command-line tool makes available some of the functionality of the desktop applicationin a form that is convenient for unattended processing of images in batches

bull ingest reads each scanned image creates and saves an empty Inselect document along with athumbnail image

bull segment runs the segmentation algorithm for each Inselect document that does not alreadycontain bounding boxes

bull save_crops for each Inselect document writes specimen images cropped from the high-reso-lution image and

bull export_metadata for each Inselect document writes a CSV file containing metadata

Each of these tools corresponds to a shaded box in the typical Inselect workflow shown inFig 3

Segmentation algorithmsBoth of Inselectrsquos algorithms operate on thumbnail images The automatic segmentation algo-rithm converts the image to the CIELAB (Commission internationale de leacuteclairage Lab) col-our space and then adds Gaussian blur in order to remove noise It then applies Sobel filters inx and y directions and applies a threshold resulting in a binary image (ie pixels are either lsquooffrsquoor lsquoonrsquo) where lsquoonrsquo indicates that an edge in the source image The algorithm then detects

Fig 2 The lsquoObjectsrsquo view This view shows objects in a grid The selected object lacks a mandatory metadata field so is shown in red

doi101371journalpone0143402g002

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 6 15

contours around each edge and computes the bounding box around each contour Contoursare processed recursively in order to detect edges-within-edges such as specimens within insecttrays The result of the algorithm is a list of bounding boxes

The sub-segmentation algorithm is applied by the user to a single bounding box that con-tains many specimensmdasha situation that can arise when the automatic segmentation algorithmwas unable to discriminate between specimens The user marks each individual specimenwithin a box using shift+left mouse click The sub-segmentation algorithm applies a watershedtechnique in which the image is considered to be a topographical surface with peaks and val-leys each lsquovalleyrsquo (indicated by a user-designated marker) is lsquofilledrsquo with a different colourlsquowaterrsquo until all lsquopeaksrsquo are submerged The resulting lsquolakesrsquo of different colours indicate theextent of each specimen The result of the algorithm is a list of bounding boxes

Based on an initial period of exploration with a variety of images from the NHMrsquos collec-tion all free parameters of both algorithms were hard-coded within the Inselect software

Materials and Methods

Test imagesWe evaluated the performance of the software using 804 multi-specimen Tagged Image FileFormat (TIFF) images of specimens from the NHMrsquos collections Images were captured usingthe SmartDrive SatScan (httpwwwsmartdrivecouk) collection scanner which is capable ofproducing high-resolution images of entire collection drawers A camera (UEye-SE USBCMOS model UI-1480SE-C-HQ 2560times1920 resolution) and an attached lens (Edmund Opticstelecentric TML lenses model 58428 03times or model 56675 016times) is moved in two dimensionsalong precision-engineered rails positioned above the objects that are to be imaged A combi-nation of hardware and software provides automated capture of high-resolution images ofsmall regions of interest which are then assembled (ldquostitchedrdquo) into a single panoramic imageby proprietary software (Analyse by SmartDrive) This method maximizes depth of field of thecaptured images and minimizes distortion and parallax artefacts

We used scanned images of pinned insects stored in collection drawers of between 400 x500 mm and 555 x 572 mm in size with or without unit trays Some of the scanned images con-tain in areas where no specimens are present paper with a printed Penrose tiles patternndashthesewere added in order to aid earlier versions of the stitching algorithm We also tested Inselectusing scans of standard-size microscope slides laid out for imaging in a rectangular grid con-taining 72 sockets arranged in six columns and twelve rows and large-sized microscope slidesarranged in a grid of six columns and eight rows some sockets were empty in some scans Wemake the thumbnail images (on which the segmentation algorithm operates) of our completetest dataset available at httpdxdoiorg1055190018537

PerformanceFor each image we computed or measured

Fig 3 Typical workflow All processes are offered by the desktop application Shaded boxes indicate processes that can be performed by the command-line tools which can work on batches of images and documents

doi101371journalpone0143402g003

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 7 15

bull the dimensions of the scanned TIFF image (in pixels)

bull the size of the scanned image file (in MB)

bull the time to ingest (ie read the scanned image save a JPEG thumbnail image of 4096 pixelsin width and create an empty Inselect document)

bull the size of the thumbnail image file (in MB)

bull the time to segment and

bull the number of boxes found by segmentation

We picked 30 images at random and manually refined the bounding boxes that weredetected by the segmentation algorithm This involved correcting false positives (removingboxes where there was no specimen) false negatives (creating boxes where specimens did nothave one) and adjusting the size of boxes that did not encompass the entire specimen and asso-ciated labels We recorded the time taken to refine the bounding boxes and the actual numberof specimens on the image

All tests were carried out on an eight-CPU Dell Precision T3610 workstation with 32GBRAM running Ubuntu Linux 1404 (httpwwwubuntucom) Python 279 NumPy 191OpenCV 249 and QT 486 The version-control tags for these two experiments are lsquoperfor-mance-experiment-1rsquo and lsquoexperiment-1-refinementrsquo respectively

ResultsThe mean dimensions of the scanned images were 18131 x 15268 pixels The complete set ofscanned images took up 341GB on disk file sizes varied between 111MB and 796MB median429MB (Table 2) Examples of segmented images are shown in Figs 4ndash6 Median thumbnailimages were nearly two orders of magnitude smaller than the full-resolution scanned images(S2 Fig) The whole experiment took 2 hours 29 minutes to run Ingestion times are shown inS3 Fig Segmentation time is not explained by the number of bounding boxes detected (S4 Fig)

Table 2 Scanned images by group

Group Description N images File sizes (MB)

Min Median Max

Brahmaeidae A family of moths 30 7211 7345 7960

Cerambycidae The family of longhorn beetles 2 1109 1155 1201

Chalcidoidea A superfamily of wasps 2 4199 4232 4266

Chalcosiinae A subfamily of moths 271 3960 4310 4690

Charaxinae A subfamily of butterflies 67 3683 4283 5043

Coccinellidae Ladybirds 7 4015 4140 4152

Embioptera Webspinners 2 4295 4308 4321

Limacodidae A family of moths 4 4105 4170 4237

Lucanidae The family of stag beetles 240 2142 4354 4719

Lycaenidae A large family of butterflies 53 4068 4268 4369

Microscope slides Benthic Foraminifera in a rock thin section 105 3781 4045 4332

Microscope slides (large) Benthic Foraminifera in a rock thin section 13 3783 3878 4084

Mixed moths and butterflies Mixed moths and butterflies collected in Madagascar 2 4405 6134 7864

Mycalesina A group of satyrid butterflies 4 4004 4295 4553

Neuroptera Lacewings and their relatives 2 4301 4337 4373

doi101371journalpone0143402t002

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 8 15

We refined the bounding boxes of 30 images picked at random The median time to refine was1085 s (minimum time 85s maximum time 4130s S5 Fig) and the number of boundingboxes detected by segmentation was not a good predictor of the actual number of specimens(S6 Fig)

Discussion

Ingestion and segmentation performanceIt took just 2 frac12 hours to ingest and segment just over 800 images which represents more thantwice the weekly output of a SatScan machine running at full capacity Some images containedoverlapping specimens (eg Fig 7) Not only are such images challenging for any segmentationalgorithm but the resulting cropped specimen images are of questionable use arguably thesedrawers should be re-curated and re-imaged As might be expected given the way that JPEGcompression works thumbnail size is a function of image complexity rather than size of the

Fig 4 Segmented image of moth specimens Example of a segmented image of Chalcosiinae (a subfamily of moths) specimens

doi101371journalpone0143402g004

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 9 15

full-resolution scanned image (S2 Fig) We expected time to ingest (read full-resolution TIFFresize in memory and write JPEG thumbnail) to scale linearly with the size of the full-resolu-tion image file Variation in ingestion time (S3 Fig) could be due to blocked PC resources(CPU RAM hard-disk) but the PC used to carry out the tests has a high specification and theimage files were on their own physical hard disk that was not being used for other tasks JPEGcompression speed is more likely to explain the variation given that this is correlated withimage complexity and that this complexity is highly variable across each drawer

Fig 5 Segmented image of beetle specimens Example of a segmented image of Lucanidae (stag beetles) specimens

doi101371journalpone0143402g005

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 10 15

Application in natural history collection digitization workflowsNelson et al [12] described three dominant digitization workflows for natural history collec-tions (1) data capture with occasional specimen imaging (2) parallel data and specimen imagecapture and (3) imaging of specimens and labels followed by data capture from the image Weconsider the third workflow as the most efficient process for mass digitization of very large col-lections This allows operators to perform simultaneous image and data capture for multiplespecimens thus significantly increasing throughput (S7 Fig)

Inselect was developed primarily to suit the needs of the mass digitization program withinthe Natural History Museum and can we hope also be used by the many organisations withcollections that share the following characteristics

bull extremely large size

bull reasonably complete taxonomic index (list of taxa represented in the collection)

bull complete record of collection lots (ie multi-specimen and mixed taxon collections) and

bull very low percentage of specimen-level records

Fig 6 Segmented image of microscope slides Example of a segmented image of microscope slides of benthic Foraminifera in a rock thin section

doi101371journalpone0143402g006

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 11 15

Under these circumstances one of the most pressing priorities is ardquobroad-and-thinrdquoapproach to digitization the collection of essential specimen-level data allowing complete col-lection audit and providing a specimen level platform to add metadata in future At a minimumthis includes the specimens determination (ie taxon name) and its physical location withinthe collection Inselect has proven to be a very useful tool for the most challenging parts of theNHMrsquos collections such as pinned insects and it can be easily applied to other areas for exam-ple environmental studies and quantitative analysis of trap samples (eg ldquoinvertebrate soupsrdquoor sticky traps see S8 Fig)

In the course of the NHM Slide Digitization Pilot project which will digitize 100000microscope slides in eight months Inselect received extensive user acceptance testing Resultsto date show that throughput is as high as 5000 slides per day per person for processingmulti-slide images through Inselect This includes tasks associated with image segmentation

Fig 7 Segmented image containing overlapping specimens Example of segmented image of Mycalesina (a group of butterflies)

doi101371journalpone0143402g007

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 12 15

and refinement barcode recognition and association with minimal metadata (taxon name andphysical location in the collection) Upon completion of the project the entire set of time andmotion studies alongside the associated workflow will be described in a separate publicationDrawers of curated pinned insects as a rule do not require additional preparation thereforethe imaging output can be up to 70 SatScan images per day resulting in 3500ndash70000 speci-mens per day available for Inselect

Future developmentsSegmentation algorithms The high throughput of mass-digitization activities makes it

important to minimize the amount of manual intervention required The NHMrsquos SatScaninstrument can generate up to 70 multi-specimen images per day The median user-timerequired to refine segmented images of 109 s (S5 Fig) means that 70 x 109 3600 = 21 person-hours could be required to refine bounding boxes for a dayrsquos worth of images from a SatScanmachine In the worst case (413 s) this refinement time increases to more than eight hoursTherefore segmentation algorithms should be as accurate as possible and we suggest that thereis a need for a formal method (and supporting software) that allows segmentation methodsand their associated parameter sets to be scored and ranked Such a score should consider per-formance false positives and false negatives The outputs of such an activity might be a libraryof algorithms andor parameter sets geared towards different specimen types The dataset of804 images used in the present work (available at httpdxdoiorg1055190018537) consti-tutes a benchmark dataset against which segmentation algorithms can be measured Inselectrsquosmodular architecture and its provision of plugins make it a suitable platform for such aninvestigation

Desktop application The desktop application lacks some of the polish that is expected ofmodern software such as lsquoundorsquo and localization Other desirable features include the ability tofilter and order bounding boxes by size andor area in order to aid refinement support forExchangeable Image File Format (EXIF) tags and integration with industry-standard imageprocessing tools such as Adobe Photoshop (httpwwwadobecomproductsphotoshophtml)The plugin architecture makes a possible range of developments such as additional segmenta-tion algorithms and optical character recognition of label text within bounding boxes

Inselect has been tested using specimens from the NHMrsquos entomological and micropalaeon-tological collections as well as a limited number of specimens from Continental European col-lections We would like to test the software against a greater diversity of museum specimensand institutions to ensure that it can accommodate variation in the storage and mounting ofthese specimens

Despite these limitations Inselect represents a substantial contribution to the tools availableto support mass-digitization of natural history collections The desktop application and itsassociated command-line tools have been designed to efficiently handle the high numbers oflarge image files produced by mass-digitization activities The combination of a modular archi-tecture desktop application and scriptable technology makes it a relatively simple task to inte-grate Inselect into existing and workflows Bug reports feature requests and ideas can beviewed and created at httpsgithubcomNaturalHistoryMuseuminselectissues We areactively developing Inselect and we greatly value all comments and suggestions

Supporting InformationS1 Fig A cropped specimen image containing a barcode A scan of a microscope slide thatcontains a Data Matrix barcode(TIFF)

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 13 15

S2 Fig Thumbnail file size against file size Numbers in brackets in the legend are the num-ber of files in that group(EPS)

S3 Fig Ingestion time against file size Ingestion time is the time taken to read the scannedimage and save a JPEG thumbnail image of 4096 pixels in width Numbers in brackets in thelegend are the number of files in that group(EPS)

S4 Fig Segmentation time against number of boxes found Numbers in brackets in the leg-end are the number of files in that group(EPS)

S5 Fig Distribution of the time to refine 30 segmented images For 30 images picked at ran-dom(EPS)

S6 Fig The actual number of specimens against the number of boxes detected by segmenta-tion For 30 images picked at random(EPS)

S7 Fig Imaging digitization workflow a Original single-specimen imaging and data capture(after Nelson 2012 modified) b multi-specimen imaging and data capture(EPS)

S8 Fig Alternative use of Inselect for trap sample Count estimation (1064 specimens) usingInselect on an image of a yellow sticky trap used in an environmental assessment study(TIFF)

AcknowledgmentsWe thank Francois Malan for the modified ldquoDivide Scanned Imagesrdquo software (httpfrancoismalancom201301how-to-batch-separate-crop-multiple-scanned-photos) whichprovided the inspiration for developing Inselect Gabriel Brostow and Michael Terry for helpfuldiscussions about computer vision Ben Scott for advice on sustainable software developmentJoumlrg Holetschek and Gwenaeumll Le Bras for application testing and feedback NHM curatorsBlanca Huertas and Geoff Martin for access to images from the Lepidoptera collections volun-teers Robyn Crowther Alexander Esin James Fage Sophie Ledger and Gabriela Montejo fordevoting their time to digitize the test specimens Duncan Sivell for the yellow sticky trapimage used in S8 Fig

Author ContributionsConceived and designed the experiments LNH Performed the experiments LNH Analyzedthe data LNH Contributed reagentsmaterialsanalysis tools VB BWP Wrote the paper LNHVB AH PH LL BWP SVDWVSS Identified the need for the application VB LL BWP VSSWrote the segmentation algorithm and designed and wrote the prototype desktop applicationPH SVDW Worked on the prototype application AH Re-engineered the application LNHContributed towards the design of the desktop application and workflow tools LNH VB AHPH LL BWP SVDWVSS

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 14 15

References1 Arintildeo AH Approaches to estimating the universe of natural history collections data Biodiversity Infor-

matics 2010 7 81ndash92 doi 1017161biv7i23991

2 GrahamCH Ferrier S Huettman F Moritz C Peterson AT New developments in museum-based infor-matics and applications in biodiversity analysis Trends Ecol Evol 2004 19 497ndash503 doi 101016jtree200407006 PMID 16701313

3 Suarez AV Tsutsui ND The value of museum collections for research and society BioScience 200454 66ndash74 doi 1016410006-3568(2004)054[0066TVOMCF]20CO2

4 Lister A M Brooks SJ Fenberg PB Glover AG James KE Johnson KG et al Natural history collec-tions as sources of long-term datasets Trends Ecol Evol 2011 26 153ndash154 doi 101016jtree201012009 PMID 21255862

5 Rocha LA Aleixo A Allen G Almeda F Baldwin CC Barclay MV et al Specimen collection an essen-tial tool Science 2014 344 814ndash5 doi 101126science3446186814 PMID 24855245

6 Beck J Kitching IJ Estimating regional species richness of tropical insects frommuseum data a com-parison of a geography-based and sample-based methods J Appl Ecol 2007 44 672ndash681 doi 101111j1365-2664200701291x

7 Newbold T Applications and limitations of museum data for conservation and ecology with particularattention to species distribution models Prog Phys Geog 2010 34 3ndash22 doi 1011770309133309355630

8 Cheng TL Rovito SM Wake DB Vredenburg VT Coincident mass extirpation of neotropical amphibi-ans with the emergence of the infectious fungal pathogen Batrachochytrium dendrobatidis Proc NatlAcad Sci USA 2011 108 9502ndash9507 doi 101073pnas1105538108 PMID 21543713

9 Brooks SJ Self A Toloni F Sparks T Natural history museum collections provide information on phe-nological change in British butterflies since the late-nineteenth century Int J Biometeorol 2014 581749ndash1758 doi 101007s00484-013-0780-6 PMID 24429705

10 Beaman RS Cellinese N Mass digitization of scientific collections new opportunities to transform theuse of biological specimens and underwrite biodiversity science In Blagoderov V Smith VS editorsNo specimen left behind mass digitization of natural history collections ZooKeys 2012 209 7ndash17doi 103897zookeys2093313

11 Blagoderov V Kitching IJ Livermore L Simonsen TJ Smith VS No specimen left behind industrialscale digitization of natural history collections In Blagoderov V Smith VS editors No specimen leftbehind mass digitization of natural history collections ZooKeys 2012 209 133ndash146 doi 103897zookeys2093178

12 Nelson G Paul D Riccardi G Mast AR Five task clusters that enable efficient and effective digitizationof biological collections In Blagoderov V Smith VS editors No specimen left behind mass digitizationof natural history collections ZooKeys 2012 209 19ndash45 doi 103897zookeys2093135

13 Holovachov O Zatushevsky A Shydlovsky I Whole-drawer imaging of entomological collections ben-efits limitations and alternative applications J of Conserv and Mus Stud 2014 12 1ndash13 doi 105334jcms1021218

14 Mantle BL La Salle J Fisher N Whole-drawer imaging for digital management and curation of a largeentomological collection In Blagoderov V Smith VS editors No specimen left behind mass digitiza-tion of natural history collections ZooKeys 2012 209 147ndash163 doi 103897zookeys2093169

15 Dietrich CH Hart J Raila D Ravaioli U Sobh N Sobh O et al InvertNet a new paradigm for digitalaccess to invertebrate collections In Blagoderov V Smith VS editors No specimen left behind massdigitization of natural history collections ZooKeys 2012 209 165ndash181 doi 103897zookeys2093571

16 Schmidt S Balke M Lafogler S DScanmdasha high-performance digital scanning system for entomologicalcollections In Blagoderov V Smith VS editors No specimen left behind mass digitization of naturalhistory collections ZooKeys 2012 209 183ndash191 doi 103897zookeys2093115

17 Wieczorek J Bloom D Guralnick R Blum S Doumlring M Giovanni R et al Darwin Core An EvolvingCommunity-Developed Biodiversity Data Standard PLoS ONE 2012 7(1) e29715 doi 101371journalpone0029715 PMID 22238640

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 15 15

over the last 500 years and together their global collections form an infrastructure that is usedannually by tens of thousands of scientific visitors The vast majority of these collections haveno digital records and are only accessible to a handful of specialists working within each insti-tution As a consequence these collections remains largely unknown to the majority of poten-tial users with access limited by the number of visitors that each institution can host

The sheer scale of natural history collections requires an unprecedented digitization effortto make these scientific specimens more widely accessible [1011] and many national digitiza-tion activities are underway such as the Digital Collections Programme at the Natural HistoryMuseum in the United Kingdom (henceforth NHM) which holds over 80 million specimensand has a target of digitizing 20 million of these within the next five years Similar initiativeshave been put in place by the National Science Foundation USA (Integrated Digitized Biocol-lections iDigBio httpswwwidigbioorg) the Naturalis Biodiveristy Center in Holland (atleast 37 million objects by mid-2015 httpssciencenaturalisnl) and the Atlas of Living Aus-tralia (httpwwwalaorgau)

Advances in digital imaging technology are central to these digitization efforts yet the col-lection of these images represents just one element of the digitization task [12] The compila-tion of metadata from the billions of labels associated with these specimens coupled with thetask of persistently linking the images and metadata to the physical specimens and the publica-tions in which they are described represents a much greater challenge Few collections can bemore challenging than those of pinned insectsmdashthe NHM alone has more than 33 millionpinned insect specimens constituting more than 40 of the museumrsquos entire collection It isneither practical nor cost-effective to digitize so many specimens individually As a result sev-eral whole-drawer scanning technologies have been developed [1314] that reduce the imagingtask by several orders of magnitude This approach can be applied to digitize other collectionsobjects such as microscope slides 3D dry-preserved specimens (eg fruits lichens and fungi)and fossils Drawer-level digitization has become the most practical way of unlocking theresearch potential for natural history collections For example at the NHM a single scanninginstrument (described further in Materials and Methods) can produce up to 70 high-resolutiondrawer images per day Files are between 100 and 800 megabytes (MB) in size and each cancontain images of well over a thousand individual specimens Whole-drawer images are usefulin their own right either for collections audits or for remote identification However specimen-level digitization ie creation and association of specimen metadata with images of individualspecimens remains a laborious and largely manual process Automatic segmentation of multi-specimen images would remove a major bottleneck in the digitization of natural history collec-tions by significantly reducing the time required for imaging and record creation

General-purpose image-processing tools such as the GNU Image Manipulation Program(GIMP httpwwwgimporg) and ImageJ (httpimagejnihgovij) have been proposed forthe task for automatic segmentation of images [1516] but such software is not optimised forprocessing the volume of large image files that are produced by mass-digitization programmesBlagoderov et al [11] presented a prototype for segmentation and data capturemdashMetadataCreatormdashthat allowed images of individual specimens to be cropped from multi-specimenimages but this software requires that the user manually draw a bounding box around eachspecimen This laborious process makes it unsuitable for mass-digitization activities Similarsolutions exist for related activities within the Atlas of Living Australia (ALA) project andGigaPan service but in both cases require manual drawing of rectangles to select subimagesand do not allow for metadata association beyond simple text comments (summarised inTable 1)

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 2 15

The lack of software to support efficient post-processing workflows associated with whole-drawer scanning has hampered the take up of mass-digitization activities [1112] Such tasksinclude but are not limited to

bull automated segmentation (the detection and placement of a bounding box around each speci-men within multi-specimen images)

bull automated detection and reading of one- or two-dimensional (matrix) barcodes

bull manual refinement of bounding boxes

bull association of specimen images with corresponding metadata (the addition and editing ofdrawer- bulk- and specimen-level metadata such as catalog number taxonomic group geo-graphical data and physical location etc)

bull transcription of label data through manual or automated (optical character recognition) pro-cessing export of metadata to structured files of common formats

bull saving individual cropped specimen images at the full available resolution and

bull preserving the associations between cropped images of specimens and specimen metadata(eg for import into collections management software)

InselectWe present Inselectmdasha modular easy-to-use cross-platform suite of open-source softwaretools designed to address the image processing needs of large-scale digitization projects Thedesktop application implements automatic image segmentation manual editing of boundingboxes automated barcode recognition and association of metadata with images of individualspecimens The most important and time-consuming functions are also accessible throughcommand-line tools that operate on batches of images without human intervention for exam-ple being run as overnight processes Our goal was to make it straightforward to integrate Inse-lect into existing mass-digitization workflows such as those operated by major digitizationprogrammes

The software is written in Python (programming language httpwwwpythonorg)NumPy (Python scientific computing package httpwwwnumpyorg) OpenCV (computervision library httpopencvorg) and QT (application development framework httpqt-projectorg) All packages are mature portable open-source software projects with active usercommunities providing a degree of assurance that the project will remain sustainable Thesoftware runs on the three major desktop operating systemsmdashWindows Mac OS X and Linux

Table 1 Comparison of features in Inselectwith current multi-specimen image segmentation solutions

Solution Format SegmentationAlgorithm

Metadata capture Barcoderecognition

Modularity Opensource

ALA Online Manual Simple text No No Yes

Gigapan Online Manual Simple text No No No

ImageJ Desktop Automated + Manual No No Yes Yes

GIMP Desktop Automated + Manual No No Yes scripts Yes

Inselect Desktop Automated + Manual Structured fields flexible template lookupsverification

Yes 2D 1D Yes pluginsupport

Yes

doi101371journalpone0143402t001

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 3 15

Source code installers and open issues are at httpsgithubcomNaturalHistoryMuseuminselect We describe Inselect assess its performance and shortcomings and make recommen-dations for future developments

Desktop applicationAn Inselect document is made up of original full-resolution scanned image (all commonlyencountered file formats are supported) a lower-resolution Joint Photographic Experts Group(JPEG) thumbnail (customizable dimensions default of 4096 pixels in width) and a list ofbounding boxes together with their associated metadata Inselect presents two views of thesedata each designed with different tasks in mind The lsquoBoxesrsquo view (Fig 1) shows the completeimage together with the bounding box around each individual specimen The lsquoSegmentrsquo com-mands runs an automatic segmentation algorithm which detects individual specimens andreplaces existing bounding boxes The user can then create delete move and resize boxes usingthe mouse andor keyboard making it a simple task to refine the results of the segmentationprocess The panel on the right contains metadata fields

The user has complete control over the list of fields and any associated validation In theEdit menu lsquoChoose templatersquo allows the user to select an lsquoinselect_templatersquo file that containsmetadata fields definitions Templates are written in YAML (YAML Ainrsquot a Markup Language-httpyamlorg)mdasha structured text format that is easy to learn and that can be edited using aplain-text editor Fig 1 shows a template called lsquoHymenopterarsquo with one numeric field (lsquoCatalognumberrsquo which can be populated by values of object barcodesndashsee below) and three fields with

Fig 1 The lsquoBoxesrsquo view This view shows the scanned image together with bounding boxes around specimens This document is a scan of 1160 individualsof Tetrastichinae (a subfamily of wasps) The boxes for one tray of specimens are selected shown outlined in light blue

doi101371journalpone0143402g001

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 4 15

drop-down lists of values The metadata fields reflect the currently selected boxes making iteasy to enter metadata for a single specimen a group of specimens or to all the specimens inthe initial image (eg a taxon name or geographic location) The 131 selected boxes (Fig 1)have the same values for lsquoLocationrsquo lsquoFamilyrsquo and lsquoSubfamilyrsquo but different values of lsquoCatalognumberrsquo The template specifies that each of these four fields is mandatory Any boxes that failvalidation (eg missing mandatory values) are shown with a red backgroundmdashthe first of the131 selected boxes in Fig 1 is shown in red because it lacks a value of lsquoCatalog numberrsquo Inselecttemplates permit comprehensive field validation such as lsquoan integer value greater than zerorsquo lsquoalatitudersquo lsquoa longitudersquo and lsquoa date in the form YYYY-MM-DDrsquo For more complex cases fieldvalidation can be given as a regular expression For example the NHM templates use theregular expression ^[0ndash9]9$mdashexactly nine digits with no letters no punctuation and noleading or trailing whitespacendashfor the lsquoCatalog numberrsquo field The user can specify other prop-erties in the template such as the width of the low-resolution thumbnail image (default of4096 pixels) A complete description of the format along with example templates that are usedfor NHMrsquos digitization projects are available in the github repository httpsgithubcomNaturalHistoryMuseuminselect-templates The built-in Simple Darwin Core terms templatewhich contains all Simple Darwin Core terms (httprstdwgorgdwctermssimple [17]) canbe used by selecting the lsquoDefault templatelsquo command under the Edit menu Metadata can beexported to comma-separated values (CSV) files and included in the file name of segmentedimages

The lsquoObjectsrsquo view (Fig 2) shows individual images either in a grid or with a single imageexpanded The first box lacks a value of lsquoCatalog numberrsquo and so is shown with a red back-ground The user can rotate images individually or in groups making it easy to transcribe labelinformation into metadata fields Rotation is also applied to the cropped object images whenthese are saved

Inselect displays the low-resolution thumbnail image which is small in size quick to readand takes up relatively little space in-memory the full-resolution file (which might be manyhundreds of megabytes in size) is loaded only as required for example when saving the individ-ual cropped specimen images

The desktop application supports pluginsmdashcode modules that are able to examine and pos-sibly modify the list of bounding boxes and their associated metadata Plugins can access thelow-resolution thumbnail image and if necessary the full-resolution scanned image The soft-ware currently has plugins for automated segmentation of the entire image and for sub-seg-mentation of a single bounding box (see lsquoSegmentation algorithmsrsquo below)

Many institutions use barcodes to uniquely identify specimens Inselect therefore provides alsquoRead barcodesrsquo plugin which reads the values of any barcode(s) within each box and placesvalue(s) in the lsquoCatalog numberrsquometadata field Barcodes typically take up just a small fractionof the area of an image (eg S1 Fig) they can be smudged or damaged and can be placed at anangle making it a non-trivial task to quickly and reliably detect and decode barcodes Inselectincludes two open-source libraries zbar (httpzbarsourceforgenet) which reads one-dimen-sional barcodes and QR codes and libdmtx (httpwwwlibdmtxorg) which reads DataMatrix barcodes We found that commercial libraries were faster and more reliable than thetwo open-source decoders Inselectrsquos lsquoRead barcodesrsquo plugin therefore also supports the bestperforming of the commercial librariesmdashInlite Clearimage (purchase or download for evalua-tion from httpwwwinliteresearchcombarcode-recognition) The user can select which ofthese libraries to use by selecting the ldquoConfigure lsquoRead Barcodesrsquordquo command under the Editmenu

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 5 15

Command-line toolsEach command-line tool makes available some of the functionality of the desktop applicationin a form that is convenient for unattended processing of images in batches

bull ingest reads each scanned image creates and saves an empty Inselect document along with athumbnail image

bull segment runs the segmentation algorithm for each Inselect document that does not alreadycontain bounding boxes

bull save_crops for each Inselect document writes specimen images cropped from the high-reso-lution image and

bull export_metadata for each Inselect document writes a CSV file containing metadata

Each of these tools corresponds to a shaded box in the typical Inselect workflow shown inFig 3

Segmentation algorithmsBoth of Inselectrsquos algorithms operate on thumbnail images The automatic segmentation algo-rithm converts the image to the CIELAB (Commission internationale de leacuteclairage Lab) col-our space and then adds Gaussian blur in order to remove noise It then applies Sobel filters inx and y directions and applies a threshold resulting in a binary image (ie pixels are either lsquooffrsquoor lsquoonrsquo) where lsquoonrsquo indicates that an edge in the source image The algorithm then detects

Fig 2 The lsquoObjectsrsquo view This view shows objects in a grid The selected object lacks a mandatory metadata field so is shown in red

doi101371journalpone0143402g002

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 6 15

contours around each edge and computes the bounding box around each contour Contoursare processed recursively in order to detect edges-within-edges such as specimens within insecttrays The result of the algorithm is a list of bounding boxes

The sub-segmentation algorithm is applied by the user to a single bounding box that con-tains many specimensmdasha situation that can arise when the automatic segmentation algorithmwas unable to discriminate between specimens The user marks each individual specimenwithin a box using shift+left mouse click The sub-segmentation algorithm applies a watershedtechnique in which the image is considered to be a topographical surface with peaks and val-leys each lsquovalleyrsquo (indicated by a user-designated marker) is lsquofilledrsquo with a different colourlsquowaterrsquo until all lsquopeaksrsquo are submerged The resulting lsquolakesrsquo of different colours indicate theextent of each specimen The result of the algorithm is a list of bounding boxes

Based on an initial period of exploration with a variety of images from the NHMrsquos collec-tion all free parameters of both algorithms were hard-coded within the Inselect software

Materials and Methods

Test imagesWe evaluated the performance of the software using 804 multi-specimen Tagged Image FileFormat (TIFF) images of specimens from the NHMrsquos collections Images were captured usingthe SmartDrive SatScan (httpwwwsmartdrivecouk) collection scanner which is capable ofproducing high-resolution images of entire collection drawers A camera (UEye-SE USBCMOS model UI-1480SE-C-HQ 2560times1920 resolution) and an attached lens (Edmund Opticstelecentric TML lenses model 58428 03times or model 56675 016times) is moved in two dimensionsalong precision-engineered rails positioned above the objects that are to be imaged A combi-nation of hardware and software provides automated capture of high-resolution images ofsmall regions of interest which are then assembled (ldquostitchedrdquo) into a single panoramic imageby proprietary software (Analyse by SmartDrive) This method maximizes depth of field of thecaptured images and minimizes distortion and parallax artefacts

We used scanned images of pinned insects stored in collection drawers of between 400 x500 mm and 555 x 572 mm in size with or without unit trays Some of the scanned images con-tain in areas where no specimens are present paper with a printed Penrose tiles patternndashthesewere added in order to aid earlier versions of the stitching algorithm We also tested Inselectusing scans of standard-size microscope slides laid out for imaging in a rectangular grid con-taining 72 sockets arranged in six columns and twelve rows and large-sized microscope slidesarranged in a grid of six columns and eight rows some sockets were empty in some scans Wemake the thumbnail images (on which the segmentation algorithm operates) of our completetest dataset available at httpdxdoiorg1055190018537

PerformanceFor each image we computed or measured

Fig 3 Typical workflow All processes are offered by the desktop application Shaded boxes indicate processes that can be performed by the command-line tools which can work on batches of images and documents

doi101371journalpone0143402g003

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 7 15

bull the dimensions of the scanned TIFF image (in pixels)

bull the size of the scanned image file (in MB)

bull the time to ingest (ie read the scanned image save a JPEG thumbnail image of 4096 pixelsin width and create an empty Inselect document)

bull the size of the thumbnail image file (in MB)

bull the time to segment and

bull the number of boxes found by segmentation

We picked 30 images at random and manually refined the bounding boxes that weredetected by the segmentation algorithm This involved correcting false positives (removingboxes where there was no specimen) false negatives (creating boxes where specimens did nothave one) and adjusting the size of boxes that did not encompass the entire specimen and asso-ciated labels We recorded the time taken to refine the bounding boxes and the actual numberof specimens on the image

All tests were carried out on an eight-CPU Dell Precision T3610 workstation with 32GBRAM running Ubuntu Linux 1404 (httpwwwubuntucom) Python 279 NumPy 191OpenCV 249 and QT 486 The version-control tags for these two experiments are lsquoperfor-mance-experiment-1rsquo and lsquoexperiment-1-refinementrsquo respectively

ResultsThe mean dimensions of the scanned images were 18131 x 15268 pixels The complete set ofscanned images took up 341GB on disk file sizes varied between 111MB and 796MB median429MB (Table 2) Examples of segmented images are shown in Figs 4ndash6 Median thumbnailimages were nearly two orders of magnitude smaller than the full-resolution scanned images(S2 Fig) The whole experiment took 2 hours 29 minutes to run Ingestion times are shown inS3 Fig Segmentation time is not explained by the number of bounding boxes detected (S4 Fig)

Table 2 Scanned images by group

Group Description N images File sizes (MB)

Min Median Max

Brahmaeidae A family of moths 30 7211 7345 7960

Cerambycidae The family of longhorn beetles 2 1109 1155 1201

Chalcidoidea A superfamily of wasps 2 4199 4232 4266

Chalcosiinae A subfamily of moths 271 3960 4310 4690

Charaxinae A subfamily of butterflies 67 3683 4283 5043

Coccinellidae Ladybirds 7 4015 4140 4152

Embioptera Webspinners 2 4295 4308 4321

Limacodidae A family of moths 4 4105 4170 4237

Lucanidae The family of stag beetles 240 2142 4354 4719

Lycaenidae A large family of butterflies 53 4068 4268 4369

Microscope slides Benthic Foraminifera in a rock thin section 105 3781 4045 4332

Microscope slides (large) Benthic Foraminifera in a rock thin section 13 3783 3878 4084

Mixed moths and butterflies Mixed moths and butterflies collected in Madagascar 2 4405 6134 7864

Mycalesina A group of satyrid butterflies 4 4004 4295 4553

Neuroptera Lacewings and their relatives 2 4301 4337 4373

doi101371journalpone0143402t002

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 8 15

We refined the bounding boxes of 30 images picked at random The median time to refine was1085 s (minimum time 85s maximum time 4130s S5 Fig) and the number of boundingboxes detected by segmentation was not a good predictor of the actual number of specimens(S6 Fig)

Discussion

Ingestion and segmentation performanceIt took just 2 frac12 hours to ingest and segment just over 800 images which represents more thantwice the weekly output of a SatScan machine running at full capacity Some images containedoverlapping specimens (eg Fig 7) Not only are such images challenging for any segmentationalgorithm but the resulting cropped specimen images are of questionable use arguably thesedrawers should be re-curated and re-imaged As might be expected given the way that JPEGcompression works thumbnail size is a function of image complexity rather than size of the

Fig 4 Segmented image of moth specimens Example of a segmented image of Chalcosiinae (a subfamily of moths) specimens

doi101371journalpone0143402g004

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 9 15

full-resolution scanned image (S2 Fig) We expected time to ingest (read full-resolution TIFFresize in memory and write JPEG thumbnail) to scale linearly with the size of the full-resolu-tion image file Variation in ingestion time (S3 Fig) could be due to blocked PC resources(CPU RAM hard-disk) but the PC used to carry out the tests has a high specification and theimage files were on their own physical hard disk that was not being used for other tasks JPEGcompression speed is more likely to explain the variation given that this is correlated withimage complexity and that this complexity is highly variable across each drawer

Fig 5 Segmented image of beetle specimens Example of a segmented image of Lucanidae (stag beetles) specimens

doi101371journalpone0143402g005

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 10 15

Application in natural history collection digitization workflowsNelson et al [12] described three dominant digitization workflows for natural history collec-tions (1) data capture with occasional specimen imaging (2) parallel data and specimen imagecapture and (3) imaging of specimens and labels followed by data capture from the image Weconsider the third workflow as the most efficient process for mass digitization of very large col-lections This allows operators to perform simultaneous image and data capture for multiplespecimens thus significantly increasing throughput (S7 Fig)

Inselect was developed primarily to suit the needs of the mass digitization program withinthe Natural History Museum and can we hope also be used by the many organisations withcollections that share the following characteristics

bull extremely large size

bull reasonably complete taxonomic index (list of taxa represented in the collection)

bull complete record of collection lots (ie multi-specimen and mixed taxon collections) and

bull very low percentage of specimen-level records

Fig 6 Segmented image of microscope slides Example of a segmented image of microscope slides of benthic Foraminifera in a rock thin section

doi101371journalpone0143402g006

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 11 15

Under these circumstances one of the most pressing priorities is ardquobroad-and-thinrdquoapproach to digitization the collection of essential specimen-level data allowing complete col-lection audit and providing a specimen level platform to add metadata in future At a minimumthis includes the specimens determination (ie taxon name) and its physical location withinthe collection Inselect has proven to be a very useful tool for the most challenging parts of theNHMrsquos collections such as pinned insects and it can be easily applied to other areas for exam-ple environmental studies and quantitative analysis of trap samples (eg ldquoinvertebrate soupsrdquoor sticky traps see S8 Fig)

In the course of the NHM Slide Digitization Pilot project which will digitize 100000microscope slides in eight months Inselect received extensive user acceptance testing Resultsto date show that throughput is as high as 5000 slides per day per person for processingmulti-slide images through Inselect This includes tasks associated with image segmentation

Fig 7 Segmented image containing overlapping specimens Example of segmented image of Mycalesina (a group of butterflies)

doi101371journalpone0143402g007

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 12 15

and refinement barcode recognition and association with minimal metadata (taxon name andphysical location in the collection) Upon completion of the project the entire set of time andmotion studies alongside the associated workflow will be described in a separate publicationDrawers of curated pinned insects as a rule do not require additional preparation thereforethe imaging output can be up to 70 SatScan images per day resulting in 3500ndash70000 speci-mens per day available for Inselect

Future developmentsSegmentation algorithms The high throughput of mass-digitization activities makes it

important to minimize the amount of manual intervention required The NHMrsquos SatScaninstrument can generate up to 70 multi-specimen images per day The median user-timerequired to refine segmented images of 109 s (S5 Fig) means that 70 x 109 3600 = 21 person-hours could be required to refine bounding boxes for a dayrsquos worth of images from a SatScanmachine In the worst case (413 s) this refinement time increases to more than eight hoursTherefore segmentation algorithms should be as accurate as possible and we suggest that thereis a need for a formal method (and supporting software) that allows segmentation methodsand their associated parameter sets to be scored and ranked Such a score should consider per-formance false positives and false negatives The outputs of such an activity might be a libraryof algorithms andor parameter sets geared towards different specimen types The dataset of804 images used in the present work (available at httpdxdoiorg1055190018537) consti-tutes a benchmark dataset against which segmentation algorithms can be measured Inselectrsquosmodular architecture and its provision of plugins make it a suitable platform for such aninvestigation

Desktop application The desktop application lacks some of the polish that is expected ofmodern software such as lsquoundorsquo and localization Other desirable features include the ability tofilter and order bounding boxes by size andor area in order to aid refinement support forExchangeable Image File Format (EXIF) tags and integration with industry-standard imageprocessing tools such as Adobe Photoshop (httpwwwadobecomproductsphotoshophtml)The plugin architecture makes a possible range of developments such as additional segmenta-tion algorithms and optical character recognition of label text within bounding boxes

Inselect has been tested using specimens from the NHMrsquos entomological and micropalaeon-tological collections as well as a limited number of specimens from Continental European col-lections We would like to test the software against a greater diversity of museum specimensand institutions to ensure that it can accommodate variation in the storage and mounting ofthese specimens

Despite these limitations Inselect represents a substantial contribution to the tools availableto support mass-digitization of natural history collections The desktop application and itsassociated command-line tools have been designed to efficiently handle the high numbers oflarge image files produced by mass-digitization activities The combination of a modular archi-tecture desktop application and scriptable technology makes it a relatively simple task to inte-grate Inselect into existing and workflows Bug reports feature requests and ideas can beviewed and created at httpsgithubcomNaturalHistoryMuseuminselectissues We areactively developing Inselect and we greatly value all comments and suggestions

Supporting InformationS1 Fig A cropped specimen image containing a barcode A scan of a microscope slide thatcontains a Data Matrix barcode(TIFF)

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 13 15

S2 Fig Thumbnail file size against file size Numbers in brackets in the legend are the num-ber of files in that group(EPS)

S3 Fig Ingestion time against file size Ingestion time is the time taken to read the scannedimage and save a JPEG thumbnail image of 4096 pixels in width Numbers in brackets in thelegend are the number of files in that group(EPS)

S4 Fig Segmentation time against number of boxes found Numbers in brackets in the leg-end are the number of files in that group(EPS)

S5 Fig Distribution of the time to refine 30 segmented images For 30 images picked at ran-dom(EPS)

S6 Fig The actual number of specimens against the number of boxes detected by segmenta-tion For 30 images picked at random(EPS)

S7 Fig Imaging digitization workflow a Original single-specimen imaging and data capture(after Nelson 2012 modified) b multi-specimen imaging and data capture(EPS)

S8 Fig Alternative use of Inselect for trap sample Count estimation (1064 specimens) usingInselect on an image of a yellow sticky trap used in an environmental assessment study(TIFF)

AcknowledgmentsWe thank Francois Malan for the modified ldquoDivide Scanned Imagesrdquo software (httpfrancoismalancom201301how-to-batch-separate-crop-multiple-scanned-photos) whichprovided the inspiration for developing Inselect Gabriel Brostow and Michael Terry for helpfuldiscussions about computer vision Ben Scott for advice on sustainable software developmentJoumlrg Holetschek and Gwenaeumll Le Bras for application testing and feedback NHM curatorsBlanca Huertas and Geoff Martin for access to images from the Lepidoptera collections volun-teers Robyn Crowther Alexander Esin James Fage Sophie Ledger and Gabriela Montejo fordevoting their time to digitize the test specimens Duncan Sivell for the yellow sticky trapimage used in S8 Fig

Author ContributionsConceived and designed the experiments LNH Performed the experiments LNH Analyzedthe data LNH Contributed reagentsmaterialsanalysis tools VB BWP Wrote the paper LNHVB AH PH LL BWP SVDWVSS Identified the need for the application VB LL BWP VSSWrote the segmentation algorithm and designed and wrote the prototype desktop applicationPH SVDW Worked on the prototype application AH Re-engineered the application LNHContributed towards the design of the desktop application and workflow tools LNH VB AHPH LL BWP SVDWVSS

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 14 15

References1 Arintildeo AH Approaches to estimating the universe of natural history collections data Biodiversity Infor-

matics 2010 7 81ndash92 doi 1017161biv7i23991

2 GrahamCH Ferrier S Huettman F Moritz C Peterson AT New developments in museum-based infor-matics and applications in biodiversity analysis Trends Ecol Evol 2004 19 497ndash503 doi 101016jtree200407006 PMID 16701313

3 Suarez AV Tsutsui ND The value of museum collections for research and society BioScience 200454 66ndash74 doi 1016410006-3568(2004)054[0066TVOMCF]20CO2

4 Lister A M Brooks SJ Fenberg PB Glover AG James KE Johnson KG et al Natural history collec-tions as sources of long-term datasets Trends Ecol Evol 2011 26 153ndash154 doi 101016jtree201012009 PMID 21255862

5 Rocha LA Aleixo A Allen G Almeda F Baldwin CC Barclay MV et al Specimen collection an essen-tial tool Science 2014 344 814ndash5 doi 101126science3446186814 PMID 24855245

6 Beck J Kitching IJ Estimating regional species richness of tropical insects frommuseum data a com-parison of a geography-based and sample-based methods J Appl Ecol 2007 44 672ndash681 doi 101111j1365-2664200701291x

7 Newbold T Applications and limitations of museum data for conservation and ecology with particularattention to species distribution models Prog Phys Geog 2010 34 3ndash22 doi 1011770309133309355630

8 Cheng TL Rovito SM Wake DB Vredenburg VT Coincident mass extirpation of neotropical amphibi-ans with the emergence of the infectious fungal pathogen Batrachochytrium dendrobatidis Proc NatlAcad Sci USA 2011 108 9502ndash9507 doi 101073pnas1105538108 PMID 21543713

9 Brooks SJ Self A Toloni F Sparks T Natural history museum collections provide information on phe-nological change in British butterflies since the late-nineteenth century Int J Biometeorol 2014 581749ndash1758 doi 101007s00484-013-0780-6 PMID 24429705

10 Beaman RS Cellinese N Mass digitization of scientific collections new opportunities to transform theuse of biological specimens and underwrite biodiversity science In Blagoderov V Smith VS editorsNo specimen left behind mass digitization of natural history collections ZooKeys 2012 209 7ndash17doi 103897zookeys2093313

11 Blagoderov V Kitching IJ Livermore L Simonsen TJ Smith VS No specimen left behind industrialscale digitization of natural history collections In Blagoderov V Smith VS editors No specimen leftbehind mass digitization of natural history collections ZooKeys 2012 209 133ndash146 doi 103897zookeys2093178

12 Nelson G Paul D Riccardi G Mast AR Five task clusters that enable efficient and effective digitizationof biological collections In Blagoderov V Smith VS editors No specimen left behind mass digitizationof natural history collections ZooKeys 2012 209 19ndash45 doi 103897zookeys2093135

13 Holovachov O Zatushevsky A Shydlovsky I Whole-drawer imaging of entomological collections ben-efits limitations and alternative applications J of Conserv and Mus Stud 2014 12 1ndash13 doi 105334jcms1021218

14 Mantle BL La Salle J Fisher N Whole-drawer imaging for digital management and curation of a largeentomological collection In Blagoderov V Smith VS editors No specimen left behind mass digitiza-tion of natural history collections ZooKeys 2012 209 147ndash163 doi 103897zookeys2093169

15 Dietrich CH Hart J Raila D Ravaioli U Sobh N Sobh O et al InvertNet a new paradigm for digitalaccess to invertebrate collections In Blagoderov V Smith VS editors No specimen left behind massdigitization of natural history collections ZooKeys 2012 209 165ndash181 doi 103897zookeys2093571

16 Schmidt S Balke M Lafogler S DScanmdasha high-performance digital scanning system for entomologicalcollections In Blagoderov V Smith VS editors No specimen left behind mass digitization of naturalhistory collections ZooKeys 2012 209 183ndash191 doi 103897zookeys2093115

17 Wieczorek J Bloom D Guralnick R Blum S Doumlring M Giovanni R et al Darwin Core An EvolvingCommunity-Developed Biodiversity Data Standard PLoS ONE 2012 7(1) e29715 doi 101371journalpone0029715 PMID 22238640

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 15 15

The lack of software to support efficient post-processing workflows associated with whole-drawer scanning has hampered the take up of mass-digitization activities [1112] Such tasksinclude but are not limited to

bull automated segmentation (the detection and placement of a bounding box around each speci-men within multi-specimen images)

bull automated detection and reading of one- or two-dimensional (matrix) barcodes

bull manual refinement of bounding boxes

bull association of specimen images with corresponding metadata (the addition and editing ofdrawer- bulk- and specimen-level metadata such as catalog number taxonomic group geo-graphical data and physical location etc)

bull transcription of label data through manual or automated (optical character recognition) pro-cessing export of metadata to structured files of common formats

bull saving individual cropped specimen images at the full available resolution and

bull preserving the associations between cropped images of specimens and specimen metadata(eg for import into collections management software)

InselectWe present Inselectmdasha modular easy-to-use cross-platform suite of open-source softwaretools designed to address the image processing needs of large-scale digitization projects Thedesktop application implements automatic image segmentation manual editing of boundingboxes automated barcode recognition and association of metadata with images of individualspecimens The most important and time-consuming functions are also accessible throughcommand-line tools that operate on batches of images without human intervention for exam-ple being run as overnight processes Our goal was to make it straightforward to integrate Inse-lect into existing mass-digitization workflows such as those operated by major digitizationprogrammes

The software is written in Python (programming language httpwwwpythonorg)NumPy (Python scientific computing package httpwwwnumpyorg) OpenCV (computervision library httpopencvorg) and QT (application development framework httpqt-projectorg) All packages are mature portable open-source software projects with active usercommunities providing a degree of assurance that the project will remain sustainable Thesoftware runs on the three major desktop operating systemsmdashWindows Mac OS X and Linux

Table 1 Comparison of features in Inselectwith current multi-specimen image segmentation solutions

Solution Format SegmentationAlgorithm

Metadata capture Barcoderecognition

Modularity Opensource

ALA Online Manual Simple text No No Yes

Gigapan Online Manual Simple text No No No

ImageJ Desktop Automated + Manual No No Yes Yes

GIMP Desktop Automated + Manual No No Yes scripts Yes

Inselect Desktop Automated + Manual Structured fields flexible template lookupsverification

Yes 2D 1D Yes pluginsupport

Yes

doi101371journalpone0143402t001

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 3 15

Source code installers and open issues are at httpsgithubcomNaturalHistoryMuseuminselect We describe Inselect assess its performance and shortcomings and make recommen-dations for future developments

Desktop applicationAn Inselect document is made up of original full-resolution scanned image (all commonlyencountered file formats are supported) a lower-resolution Joint Photographic Experts Group(JPEG) thumbnail (customizable dimensions default of 4096 pixels in width) and a list ofbounding boxes together with their associated metadata Inselect presents two views of thesedata each designed with different tasks in mind The lsquoBoxesrsquo view (Fig 1) shows the completeimage together with the bounding box around each individual specimen The lsquoSegmentrsquo com-mands runs an automatic segmentation algorithm which detects individual specimens andreplaces existing bounding boxes The user can then create delete move and resize boxes usingthe mouse andor keyboard making it a simple task to refine the results of the segmentationprocess The panel on the right contains metadata fields

The user has complete control over the list of fields and any associated validation In theEdit menu lsquoChoose templatersquo allows the user to select an lsquoinselect_templatersquo file that containsmetadata fields definitions Templates are written in YAML (YAML Ainrsquot a Markup Language-httpyamlorg)mdasha structured text format that is easy to learn and that can be edited using aplain-text editor Fig 1 shows a template called lsquoHymenopterarsquo with one numeric field (lsquoCatalognumberrsquo which can be populated by values of object barcodesndashsee below) and three fields with

Fig 1 The lsquoBoxesrsquo view This view shows the scanned image together with bounding boxes around specimens This document is a scan of 1160 individualsof Tetrastichinae (a subfamily of wasps) The boxes for one tray of specimens are selected shown outlined in light blue

doi101371journalpone0143402g001

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 4 15

drop-down lists of values The metadata fields reflect the currently selected boxes making iteasy to enter metadata for a single specimen a group of specimens or to all the specimens inthe initial image (eg a taxon name or geographic location) The 131 selected boxes (Fig 1)have the same values for lsquoLocationrsquo lsquoFamilyrsquo and lsquoSubfamilyrsquo but different values of lsquoCatalognumberrsquo The template specifies that each of these four fields is mandatory Any boxes that failvalidation (eg missing mandatory values) are shown with a red backgroundmdashthe first of the131 selected boxes in Fig 1 is shown in red because it lacks a value of lsquoCatalog numberrsquo Inselecttemplates permit comprehensive field validation such as lsquoan integer value greater than zerorsquo lsquoalatitudersquo lsquoa longitudersquo and lsquoa date in the form YYYY-MM-DDrsquo For more complex cases fieldvalidation can be given as a regular expression For example the NHM templates use theregular expression ^[0ndash9]9$mdashexactly nine digits with no letters no punctuation and noleading or trailing whitespacendashfor the lsquoCatalog numberrsquo field The user can specify other prop-erties in the template such as the width of the low-resolution thumbnail image (default of4096 pixels) A complete description of the format along with example templates that are usedfor NHMrsquos digitization projects are available in the github repository httpsgithubcomNaturalHistoryMuseuminselect-templates The built-in Simple Darwin Core terms templatewhich contains all Simple Darwin Core terms (httprstdwgorgdwctermssimple [17]) canbe used by selecting the lsquoDefault templatelsquo command under the Edit menu Metadata can beexported to comma-separated values (CSV) files and included in the file name of segmentedimages

The lsquoObjectsrsquo view (Fig 2) shows individual images either in a grid or with a single imageexpanded The first box lacks a value of lsquoCatalog numberrsquo and so is shown with a red back-ground The user can rotate images individually or in groups making it easy to transcribe labelinformation into metadata fields Rotation is also applied to the cropped object images whenthese are saved

Inselect displays the low-resolution thumbnail image which is small in size quick to readand takes up relatively little space in-memory the full-resolution file (which might be manyhundreds of megabytes in size) is loaded only as required for example when saving the individ-ual cropped specimen images

The desktop application supports pluginsmdashcode modules that are able to examine and pos-sibly modify the list of bounding boxes and their associated metadata Plugins can access thelow-resolution thumbnail image and if necessary the full-resolution scanned image The soft-ware currently has plugins for automated segmentation of the entire image and for sub-seg-mentation of a single bounding box (see lsquoSegmentation algorithmsrsquo below)

Many institutions use barcodes to uniquely identify specimens Inselect therefore provides alsquoRead barcodesrsquo plugin which reads the values of any barcode(s) within each box and placesvalue(s) in the lsquoCatalog numberrsquometadata field Barcodes typically take up just a small fractionof the area of an image (eg S1 Fig) they can be smudged or damaged and can be placed at anangle making it a non-trivial task to quickly and reliably detect and decode barcodes Inselectincludes two open-source libraries zbar (httpzbarsourceforgenet) which reads one-dimen-sional barcodes and QR codes and libdmtx (httpwwwlibdmtxorg) which reads DataMatrix barcodes We found that commercial libraries were faster and more reliable than thetwo open-source decoders Inselectrsquos lsquoRead barcodesrsquo plugin therefore also supports the bestperforming of the commercial librariesmdashInlite Clearimage (purchase or download for evalua-tion from httpwwwinliteresearchcombarcode-recognition) The user can select which ofthese libraries to use by selecting the ldquoConfigure lsquoRead Barcodesrsquordquo command under the Editmenu

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 5 15

Command-line toolsEach command-line tool makes available some of the functionality of the desktop applicationin a form that is convenient for unattended processing of images in batches

bull ingest reads each scanned image creates and saves an empty Inselect document along with athumbnail image

bull segment runs the segmentation algorithm for each Inselect document that does not alreadycontain bounding boxes

bull save_crops for each Inselect document writes specimen images cropped from the high-reso-lution image and

bull export_metadata for each Inselect document writes a CSV file containing metadata

Each of these tools corresponds to a shaded box in the typical Inselect workflow shown inFig 3

Segmentation algorithmsBoth of Inselectrsquos algorithms operate on thumbnail images The automatic segmentation algo-rithm converts the image to the CIELAB (Commission internationale de leacuteclairage Lab) col-our space and then adds Gaussian blur in order to remove noise It then applies Sobel filters inx and y directions and applies a threshold resulting in a binary image (ie pixels are either lsquooffrsquoor lsquoonrsquo) where lsquoonrsquo indicates that an edge in the source image The algorithm then detects

Fig 2 The lsquoObjectsrsquo view This view shows objects in a grid The selected object lacks a mandatory metadata field so is shown in red

doi101371journalpone0143402g002

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 6 15

contours around each edge and computes the bounding box around each contour Contoursare processed recursively in order to detect edges-within-edges such as specimens within insecttrays The result of the algorithm is a list of bounding boxes

The sub-segmentation algorithm is applied by the user to a single bounding box that con-tains many specimensmdasha situation that can arise when the automatic segmentation algorithmwas unable to discriminate between specimens The user marks each individual specimenwithin a box using shift+left mouse click The sub-segmentation algorithm applies a watershedtechnique in which the image is considered to be a topographical surface with peaks and val-leys each lsquovalleyrsquo (indicated by a user-designated marker) is lsquofilledrsquo with a different colourlsquowaterrsquo until all lsquopeaksrsquo are submerged The resulting lsquolakesrsquo of different colours indicate theextent of each specimen The result of the algorithm is a list of bounding boxes

Based on an initial period of exploration with a variety of images from the NHMrsquos collec-tion all free parameters of both algorithms were hard-coded within the Inselect software

Materials and Methods

Test imagesWe evaluated the performance of the software using 804 multi-specimen Tagged Image FileFormat (TIFF) images of specimens from the NHMrsquos collections Images were captured usingthe SmartDrive SatScan (httpwwwsmartdrivecouk) collection scanner which is capable ofproducing high-resolution images of entire collection drawers A camera (UEye-SE USBCMOS model UI-1480SE-C-HQ 2560times1920 resolution) and an attached lens (Edmund Opticstelecentric TML lenses model 58428 03times or model 56675 016times) is moved in two dimensionsalong precision-engineered rails positioned above the objects that are to be imaged A combi-nation of hardware and software provides automated capture of high-resolution images ofsmall regions of interest which are then assembled (ldquostitchedrdquo) into a single panoramic imageby proprietary software (Analyse by SmartDrive) This method maximizes depth of field of thecaptured images and minimizes distortion and parallax artefacts

We used scanned images of pinned insects stored in collection drawers of between 400 x500 mm and 555 x 572 mm in size with or without unit trays Some of the scanned images con-tain in areas where no specimens are present paper with a printed Penrose tiles patternndashthesewere added in order to aid earlier versions of the stitching algorithm We also tested Inselectusing scans of standard-size microscope slides laid out for imaging in a rectangular grid con-taining 72 sockets arranged in six columns and twelve rows and large-sized microscope slidesarranged in a grid of six columns and eight rows some sockets were empty in some scans Wemake the thumbnail images (on which the segmentation algorithm operates) of our completetest dataset available at httpdxdoiorg1055190018537

PerformanceFor each image we computed or measured

Fig 3 Typical workflow All processes are offered by the desktop application Shaded boxes indicate processes that can be performed by the command-line tools which can work on batches of images and documents

doi101371journalpone0143402g003

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 7 15

bull the dimensions of the scanned TIFF image (in pixels)

bull the size of the scanned image file (in MB)

bull the time to ingest (ie read the scanned image save a JPEG thumbnail image of 4096 pixelsin width and create an empty Inselect document)

bull the size of the thumbnail image file (in MB)

bull the time to segment and

bull the number of boxes found by segmentation

We picked 30 images at random and manually refined the bounding boxes that weredetected by the segmentation algorithm This involved correcting false positives (removingboxes where there was no specimen) false negatives (creating boxes where specimens did nothave one) and adjusting the size of boxes that did not encompass the entire specimen and asso-ciated labels We recorded the time taken to refine the bounding boxes and the actual numberof specimens on the image

All tests were carried out on an eight-CPU Dell Precision T3610 workstation with 32GBRAM running Ubuntu Linux 1404 (httpwwwubuntucom) Python 279 NumPy 191OpenCV 249 and QT 486 The version-control tags for these two experiments are lsquoperfor-mance-experiment-1rsquo and lsquoexperiment-1-refinementrsquo respectively

ResultsThe mean dimensions of the scanned images were 18131 x 15268 pixels The complete set ofscanned images took up 341GB on disk file sizes varied between 111MB and 796MB median429MB (Table 2) Examples of segmented images are shown in Figs 4ndash6 Median thumbnailimages were nearly two orders of magnitude smaller than the full-resolution scanned images(S2 Fig) The whole experiment took 2 hours 29 minutes to run Ingestion times are shown inS3 Fig Segmentation time is not explained by the number of bounding boxes detected (S4 Fig)

Table 2 Scanned images by group

Group Description N images File sizes (MB)

Min Median Max

Brahmaeidae A family of moths 30 7211 7345 7960

Cerambycidae The family of longhorn beetles 2 1109 1155 1201

Chalcidoidea A superfamily of wasps 2 4199 4232 4266

Chalcosiinae A subfamily of moths 271 3960 4310 4690

Charaxinae A subfamily of butterflies 67 3683 4283 5043

Coccinellidae Ladybirds 7 4015 4140 4152

Embioptera Webspinners 2 4295 4308 4321

Limacodidae A family of moths 4 4105 4170 4237

Lucanidae The family of stag beetles 240 2142 4354 4719

Lycaenidae A large family of butterflies 53 4068 4268 4369

Microscope slides Benthic Foraminifera in a rock thin section 105 3781 4045 4332

Microscope slides (large) Benthic Foraminifera in a rock thin section 13 3783 3878 4084

Mixed moths and butterflies Mixed moths and butterflies collected in Madagascar 2 4405 6134 7864

Mycalesina A group of satyrid butterflies 4 4004 4295 4553

Neuroptera Lacewings and their relatives 2 4301 4337 4373

doi101371journalpone0143402t002

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 8 15

We refined the bounding boxes of 30 images picked at random The median time to refine was1085 s (minimum time 85s maximum time 4130s S5 Fig) and the number of boundingboxes detected by segmentation was not a good predictor of the actual number of specimens(S6 Fig)

Discussion

Ingestion and segmentation performanceIt took just 2 frac12 hours to ingest and segment just over 800 images which represents more thantwice the weekly output of a SatScan machine running at full capacity Some images containedoverlapping specimens (eg Fig 7) Not only are such images challenging for any segmentationalgorithm but the resulting cropped specimen images are of questionable use arguably thesedrawers should be re-curated and re-imaged As might be expected given the way that JPEGcompression works thumbnail size is a function of image complexity rather than size of the

Fig 4 Segmented image of moth specimens Example of a segmented image of Chalcosiinae (a subfamily of moths) specimens

doi101371journalpone0143402g004

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 9 15

full-resolution scanned image (S2 Fig) We expected time to ingest (read full-resolution TIFFresize in memory and write JPEG thumbnail) to scale linearly with the size of the full-resolu-tion image file Variation in ingestion time (S3 Fig) could be due to blocked PC resources(CPU RAM hard-disk) but the PC used to carry out the tests has a high specification and theimage files were on their own physical hard disk that was not being used for other tasks JPEGcompression speed is more likely to explain the variation given that this is correlated withimage complexity and that this complexity is highly variable across each drawer

Fig 5 Segmented image of beetle specimens Example of a segmented image of Lucanidae (stag beetles) specimens

doi101371journalpone0143402g005

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 10 15

Application in natural history collection digitization workflowsNelson et al [12] described three dominant digitization workflows for natural history collec-tions (1) data capture with occasional specimen imaging (2) parallel data and specimen imagecapture and (3) imaging of specimens and labels followed by data capture from the image Weconsider the third workflow as the most efficient process for mass digitization of very large col-lections This allows operators to perform simultaneous image and data capture for multiplespecimens thus significantly increasing throughput (S7 Fig)

Inselect was developed primarily to suit the needs of the mass digitization program withinthe Natural History Museum and can we hope also be used by the many organisations withcollections that share the following characteristics

bull extremely large size

bull reasonably complete taxonomic index (list of taxa represented in the collection)

bull complete record of collection lots (ie multi-specimen and mixed taxon collections) and

bull very low percentage of specimen-level records

Fig 6 Segmented image of microscope slides Example of a segmented image of microscope slides of benthic Foraminifera in a rock thin section

doi101371journalpone0143402g006

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 11 15

Under these circumstances one of the most pressing priorities is ardquobroad-and-thinrdquoapproach to digitization the collection of essential specimen-level data allowing complete col-lection audit and providing a specimen level platform to add metadata in future At a minimumthis includes the specimens determination (ie taxon name) and its physical location withinthe collection Inselect has proven to be a very useful tool for the most challenging parts of theNHMrsquos collections such as pinned insects and it can be easily applied to other areas for exam-ple environmental studies and quantitative analysis of trap samples (eg ldquoinvertebrate soupsrdquoor sticky traps see S8 Fig)

In the course of the NHM Slide Digitization Pilot project which will digitize 100000microscope slides in eight months Inselect received extensive user acceptance testing Resultsto date show that throughput is as high as 5000 slides per day per person for processingmulti-slide images through Inselect This includes tasks associated with image segmentation

Fig 7 Segmented image containing overlapping specimens Example of segmented image of Mycalesina (a group of butterflies)

doi101371journalpone0143402g007

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 12 15

and refinement barcode recognition and association with minimal metadata (taxon name andphysical location in the collection) Upon completion of the project the entire set of time andmotion studies alongside the associated workflow will be described in a separate publicationDrawers of curated pinned insects as a rule do not require additional preparation thereforethe imaging output can be up to 70 SatScan images per day resulting in 3500ndash70000 speci-mens per day available for Inselect

Future developmentsSegmentation algorithms The high throughput of mass-digitization activities makes it

important to minimize the amount of manual intervention required The NHMrsquos SatScaninstrument can generate up to 70 multi-specimen images per day The median user-timerequired to refine segmented images of 109 s (S5 Fig) means that 70 x 109 3600 = 21 person-hours could be required to refine bounding boxes for a dayrsquos worth of images from a SatScanmachine In the worst case (413 s) this refinement time increases to more than eight hoursTherefore segmentation algorithms should be as accurate as possible and we suggest that thereis a need for a formal method (and supporting software) that allows segmentation methodsand their associated parameter sets to be scored and ranked Such a score should consider per-formance false positives and false negatives The outputs of such an activity might be a libraryof algorithms andor parameter sets geared towards different specimen types The dataset of804 images used in the present work (available at httpdxdoiorg1055190018537) consti-tutes a benchmark dataset against which segmentation algorithms can be measured Inselectrsquosmodular architecture and its provision of plugins make it a suitable platform for such aninvestigation

Desktop application The desktop application lacks some of the polish that is expected ofmodern software such as lsquoundorsquo and localization Other desirable features include the ability tofilter and order bounding boxes by size andor area in order to aid refinement support forExchangeable Image File Format (EXIF) tags and integration with industry-standard imageprocessing tools such as Adobe Photoshop (httpwwwadobecomproductsphotoshophtml)The plugin architecture makes a possible range of developments such as additional segmenta-tion algorithms and optical character recognition of label text within bounding boxes

Inselect has been tested using specimens from the NHMrsquos entomological and micropalaeon-tological collections as well as a limited number of specimens from Continental European col-lections We would like to test the software against a greater diversity of museum specimensand institutions to ensure that it can accommodate variation in the storage and mounting ofthese specimens

Despite these limitations Inselect represents a substantial contribution to the tools availableto support mass-digitization of natural history collections The desktop application and itsassociated command-line tools have been designed to efficiently handle the high numbers oflarge image files produced by mass-digitization activities The combination of a modular archi-tecture desktop application and scriptable technology makes it a relatively simple task to inte-grate Inselect into existing and workflows Bug reports feature requests and ideas can beviewed and created at httpsgithubcomNaturalHistoryMuseuminselectissues We areactively developing Inselect and we greatly value all comments and suggestions

Supporting InformationS1 Fig A cropped specimen image containing a barcode A scan of a microscope slide thatcontains a Data Matrix barcode(TIFF)

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 13 15

S2 Fig Thumbnail file size against file size Numbers in brackets in the legend are the num-ber of files in that group(EPS)

S3 Fig Ingestion time against file size Ingestion time is the time taken to read the scannedimage and save a JPEG thumbnail image of 4096 pixels in width Numbers in brackets in thelegend are the number of files in that group(EPS)

S4 Fig Segmentation time against number of boxes found Numbers in brackets in the leg-end are the number of files in that group(EPS)

S5 Fig Distribution of the time to refine 30 segmented images For 30 images picked at ran-dom(EPS)

S6 Fig The actual number of specimens against the number of boxes detected by segmenta-tion For 30 images picked at random(EPS)

S7 Fig Imaging digitization workflow a Original single-specimen imaging and data capture(after Nelson 2012 modified) b multi-specimen imaging and data capture(EPS)

S8 Fig Alternative use of Inselect for trap sample Count estimation (1064 specimens) usingInselect on an image of a yellow sticky trap used in an environmental assessment study(TIFF)

AcknowledgmentsWe thank Francois Malan for the modified ldquoDivide Scanned Imagesrdquo software (httpfrancoismalancom201301how-to-batch-separate-crop-multiple-scanned-photos) whichprovided the inspiration for developing Inselect Gabriel Brostow and Michael Terry for helpfuldiscussions about computer vision Ben Scott for advice on sustainable software developmentJoumlrg Holetschek and Gwenaeumll Le Bras for application testing and feedback NHM curatorsBlanca Huertas and Geoff Martin for access to images from the Lepidoptera collections volun-teers Robyn Crowther Alexander Esin James Fage Sophie Ledger and Gabriela Montejo fordevoting their time to digitize the test specimens Duncan Sivell for the yellow sticky trapimage used in S8 Fig

Author ContributionsConceived and designed the experiments LNH Performed the experiments LNH Analyzedthe data LNH Contributed reagentsmaterialsanalysis tools VB BWP Wrote the paper LNHVB AH PH LL BWP SVDWVSS Identified the need for the application VB LL BWP VSSWrote the segmentation algorithm and designed and wrote the prototype desktop applicationPH SVDW Worked on the prototype application AH Re-engineered the application LNHContributed towards the design of the desktop application and workflow tools LNH VB AHPH LL BWP SVDWVSS

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 14 15

References1 Arintildeo AH Approaches to estimating the universe of natural history collections data Biodiversity Infor-

matics 2010 7 81ndash92 doi 1017161biv7i23991

2 GrahamCH Ferrier S Huettman F Moritz C Peterson AT New developments in museum-based infor-matics and applications in biodiversity analysis Trends Ecol Evol 2004 19 497ndash503 doi 101016jtree200407006 PMID 16701313

3 Suarez AV Tsutsui ND The value of museum collections for research and society BioScience 200454 66ndash74 doi 1016410006-3568(2004)054[0066TVOMCF]20CO2

4 Lister A M Brooks SJ Fenberg PB Glover AG James KE Johnson KG et al Natural history collec-tions as sources of long-term datasets Trends Ecol Evol 2011 26 153ndash154 doi 101016jtree201012009 PMID 21255862

5 Rocha LA Aleixo A Allen G Almeda F Baldwin CC Barclay MV et al Specimen collection an essen-tial tool Science 2014 344 814ndash5 doi 101126science3446186814 PMID 24855245

6 Beck J Kitching IJ Estimating regional species richness of tropical insects frommuseum data a com-parison of a geography-based and sample-based methods J Appl Ecol 2007 44 672ndash681 doi 101111j1365-2664200701291x

7 Newbold T Applications and limitations of museum data for conservation and ecology with particularattention to species distribution models Prog Phys Geog 2010 34 3ndash22 doi 1011770309133309355630

8 Cheng TL Rovito SM Wake DB Vredenburg VT Coincident mass extirpation of neotropical amphibi-ans with the emergence of the infectious fungal pathogen Batrachochytrium dendrobatidis Proc NatlAcad Sci USA 2011 108 9502ndash9507 doi 101073pnas1105538108 PMID 21543713

9 Brooks SJ Self A Toloni F Sparks T Natural history museum collections provide information on phe-nological change in British butterflies since the late-nineteenth century Int J Biometeorol 2014 581749ndash1758 doi 101007s00484-013-0780-6 PMID 24429705

10 Beaman RS Cellinese N Mass digitization of scientific collections new opportunities to transform theuse of biological specimens and underwrite biodiversity science In Blagoderov V Smith VS editorsNo specimen left behind mass digitization of natural history collections ZooKeys 2012 209 7ndash17doi 103897zookeys2093313

11 Blagoderov V Kitching IJ Livermore L Simonsen TJ Smith VS No specimen left behind industrialscale digitization of natural history collections In Blagoderov V Smith VS editors No specimen leftbehind mass digitization of natural history collections ZooKeys 2012 209 133ndash146 doi 103897zookeys2093178

12 Nelson G Paul D Riccardi G Mast AR Five task clusters that enable efficient and effective digitizationof biological collections In Blagoderov V Smith VS editors No specimen left behind mass digitizationof natural history collections ZooKeys 2012 209 19ndash45 doi 103897zookeys2093135

13 Holovachov O Zatushevsky A Shydlovsky I Whole-drawer imaging of entomological collections ben-efits limitations and alternative applications J of Conserv and Mus Stud 2014 12 1ndash13 doi 105334jcms1021218

14 Mantle BL La Salle J Fisher N Whole-drawer imaging for digital management and curation of a largeentomological collection In Blagoderov V Smith VS editors No specimen left behind mass digitiza-tion of natural history collections ZooKeys 2012 209 147ndash163 doi 103897zookeys2093169

15 Dietrich CH Hart J Raila D Ravaioli U Sobh N Sobh O et al InvertNet a new paradigm for digitalaccess to invertebrate collections In Blagoderov V Smith VS editors No specimen left behind massdigitization of natural history collections ZooKeys 2012 209 165ndash181 doi 103897zookeys2093571

16 Schmidt S Balke M Lafogler S DScanmdasha high-performance digital scanning system for entomologicalcollections In Blagoderov V Smith VS editors No specimen left behind mass digitization of naturalhistory collections ZooKeys 2012 209 183ndash191 doi 103897zookeys2093115

17 Wieczorek J Bloom D Guralnick R Blum S Doumlring M Giovanni R et al Darwin Core An EvolvingCommunity-Developed Biodiversity Data Standard PLoS ONE 2012 7(1) e29715 doi 101371journalpone0029715 PMID 22238640

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 15 15

Source code installers and open issues are at httpsgithubcomNaturalHistoryMuseuminselect We describe Inselect assess its performance and shortcomings and make recommen-dations for future developments

Desktop applicationAn Inselect document is made up of original full-resolution scanned image (all commonlyencountered file formats are supported) a lower-resolution Joint Photographic Experts Group(JPEG) thumbnail (customizable dimensions default of 4096 pixels in width) and a list ofbounding boxes together with their associated metadata Inselect presents two views of thesedata each designed with different tasks in mind The lsquoBoxesrsquo view (Fig 1) shows the completeimage together with the bounding box around each individual specimen The lsquoSegmentrsquo com-mands runs an automatic segmentation algorithm which detects individual specimens andreplaces existing bounding boxes The user can then create delete move and resize boxes usingthe mouse andor keyboard making it a simple task to refine the results of the segmentationprocess The panel on the right contains metadata fields

The user has complete control over the list of fields and any associated validation In theEdit menu lsquoChoose templatersquo allows the user to select an lsquoinselect_templatersquo file that containsmetadata fields definitions Templates are written in YAML (YAML Ainrsquot a Markup Language-httpyamlorg)mdasha structured text format that is easy to learn and that can be edited using aplain-text editor Fig 1 shows a template called lsquoHymenopterarsquo with one numeric field (lsquoCatalognumberrsquo which can be populated by values of object barcodesndashsee below) and three fields with

Fig 1 The lsquoBoxesrsquo view This view shows the scanned image together with bounding boxes around specimens This document is a scan of 1160 individualsof Tetrastichinae (a subfamily of wasps) The boxes for one tray of specimens are selected shown outlined in light blue

doi101371journalpone0143402g001

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 4 15

drop-down lists of values The metadata fields reflect the currently selected boxes making iteasy to enter metadata for a single specimen a group of specimens or to all the specimens inthe initial image (eg a taxon name or geographic location) The 131 selected boxes (Fig 1)have the same values for lsquoLocationrsquo lsquoFamilyrsquo and lsquoSubfamilyrsquo but different values of lsquoCatalognumberrsquo The template specifies that each of these four fields is mandatory Any boxes that failvalidation (eg missing mandatory values) are shown with a red backgroundmdashthe first of the131 selected boxes in Fig 1 is shown in red because it lacks a value of lsquoCatalog numberrsquo Inselecttemplates permit comprehensive field validation such as lsquoan integer value greater than zerorsquo lsquoalatitudersquo lsquoa longitudersquo and lsquoa date in the form YYYY-MM-DDrsquo For more complex cases fieldvalidation can be given as a regular expression For example the NHM templates use theregular expression ^[0ndash9]9$mdashexactly nine digits with no letters no punctuation and noleading or trailing whitespacendashfor the lsquoCatalog numberrsquo field The user can specify other prop-erties in the template such as the width of the low-resolution thumbnail image (default of4096 pixels) A complete description of the format along with example templates that are usedfor NHMrsquos digitization projects are available in the github repository httpsgithubcomNaturalHistoryMuseuminselect-templates The built-in Simple Darwin Core terms templatewhich contains all Simple Darwin Core terms (httprstdwgorgdwctermssimple [17]) canbe used by selecting the lsquoDefault templatelsquo command under the Edit menu Metadata can beexported to comma-separated values (CSV) files and included in the file name of segmentedimages

The lsquoObjectsrsquo view (Fig 2) shows individual images either in a grid or with a single imageexpanded The first box lacks a value of lsquoCatalog numberrsquo and so is shown with a red back-ground The user can rotate images individually or in groups making it easy to transcribe labelinformation into metadata fields Rotation is also applied to the cropped object images whenthese are saved

Inselect displays the low-resolution thumbnail image which is small in size quick to readand takes up relatively little space in-memory the full-resolution file (which might be manyhundreds of megabytes in size) is loaded only as required for example when saving the individ-ual cropped specimen images

The desktop application supports pluginsmdashcode modules that are able to examine and pos-sibly modify the list of bounding boxes and their associated metadata Plugins can access thelow-resolution thumbnail image and if necessary the full-resolution scanned image The soft-ware currently has plugins for automated segmentation of the entire image and for sub-seg-mentation of a single bounding box (see lsquoSegmentation algorithmsrsquo below)

Many institutions use barcodes to uniquely identify specimens Inselect therefore provides alsquoRead barcodesrsquo plugin which reads the values of any barcode(s) within each box and placesvalue(s) in the lsquoCatalog numberrsquometadata field Barcodes typically take up just a small fractionof the area of an image (eg S1 Fig) they can be smudged or damaged and can be placed at anangle making it a non-trivial task to quickly and reliably detect and decode barcodes Inselectincludes two open-source libraries zbar (httpzbarsourceforgenet) which reads one-dimen-sional barcodes and QR codes and libdmtx (httpwwwlibdmtxorg) which reads DataMatrix barcodes We found that commercial libraries were faster and more reliable than thetwo open-source decoders Inselectrsquos lsquoRead barcodesrsquo plugin therefore also supports the bestperforming of the commercial librariesmdashInlite Clearimage (purchase or download for evalua-tion from httpwwwinliteresearchcombarcode-recognition) The user can select which ofthese libraries to use by selecting the ldquoConfigure lsquoRead Barcodesrsquordquo command under the Editmenu

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 5 15

Command-line toolsEach command-line tool makes available some of the functionality of the desktop applicationin a form that is convenient for unattended processing of images in batches

bull ingest reads each scanned image creates and saves an empty Inselect document along with athumbnail image

bull segment runs the segmentation algorithm for each Inselect document that does not alreadycontain bounding boxes

bull save_crops for each Inselect document writes specimen images cropped from the high-reso-lution image and

bull export_metadata for each Inselect document writes a CSV file containing metadata

Each of these tools corresponds to a shaded box in the typical Inselect workflow shown inFig 3

Segmentation algorithmsBoth of Inselectrsquos algorithms operate on thumbnail images The automatic segmentation algo-rithm converts the image to the CIELAB (Commission internationale de leacuteclairage Lab) col-our space and then adds Gaussian blur in order to remove noise It then applies Sobel filters inx and y directions and applies a threshold resulting in a binary image (ie pixels are either lsquooffrsquoor lsquoonrsquo) where lsquoonrsquo indicates that an edge in the source image The algorithm then detects

Fig 2 The lsquoObjectsrsquo view This view shows objects in a grid The selected object lacks a mandatory metadata field so is shown in red

doi101371journalpone0143402g002

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 6 15

contours around each edge and computes the bounding box around each contour Contoursare processed recursively in order to detect edges-within-edges such as specimens within insecttrays The result of the algorithm is a list of bounding boxes

The sub-segmentation algorithm is applied by the user to a single bounding box that con-tains many specimensmdasha situation that can arise when the automatic segmentation algorithmwas unable to discriminate between specimens The user marks each individual specimenwithin a box using shift+left mouse click The sub-segmentation algorithm applies a watershedtechnique in which the image is considered to be a topographical surface with peaks and val-leys each lsquovalleyrsquo (indicated by a user-designated marker) is lsquofilledrsquo with a different colourlsquowaterrsquo until all lsquopeaksrsquo are submerged The resulting lsquolakesrsquo of different colours indicate theextent of each specimen The result of the algorithm is a list of bounding boxes

Based on an initial period of exploration with a variety of images from the NHMrsquos collec-tion all free parameters of both algorithms were hard-coded within the Inselect software

Materials and Methods

Test imagesWe evaluated the performance of the software using 804 multi-specimen Tagged Image FileFormat (TIFF) images of specimens from the NHMrsquos collections Images were captured usingthe SmartDrive SatScan (httpwwwsmartdrivecouk) collection scanner which is capable ofproducing high-resolution images of entire collection drawers A camera (UEye-SE USBCMOS model UI-1480SE-C-HQ 2560times1920 resolution) and an attached lens (Edmund Opticstelecentric TML lenses model 58428 03times or model 56675 016times) is moved in two dimensionsalong precision-engineered rails positioned above the objects that are to be imaged A combi-nation of hardware and software provides automated capture of high-resolution images ofsmall regions of interest which are then assembled (ldquostitchedrdquo) into a single panoramic imageby proprietary software (Analyse by SmartDrive) This method maximizes depth of field of thecaptured images and minimizes distortion and parallax artefacts

We used scanned images of pinned insects stored in collection drawers of between 400 x500 mm and 555 x 572 mm in size with or without unit trays Some of the scanned images con-tain in areas where no specimens are present paper with a printed Penrose tiles patternndashthesewere added in order to aid earlier versions of the stitching algorithm We also tested Inselectusing scans of standard-size microscope slides laid out for imaging in a rectangular grid con-taining 72 sockets arranged in six columns and twelve rows and large-sized microscope slidesarranged in a grid of six columns and eight rows some sockets were empty in some scans Wemake the thumbnail images (on which the segmentation algorithm operates) of our completetest dataset available at httpdxdoiorg1055190018537

PerformanceFor each image we computed or measured

Fig 3 Typical workflow All processes are offered by the desktop application Shaded boxes indicate processes that can be performed by the command-line tools which can work on batches of images and documents

doi101371journalpone0143402g003

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 7 15

bull the dimensions of the scanned TIFF image (in pixels)

bull the size of the scanned image file (in MB)

bull the time to ingest (ie read the scanned image save a JPEG thumbnail image of 4096 pixelsin width and create an empty Inselect document)

bull the size of the thumbnail image file (in MB)

bull the time to segment and

bull the number of boxes found by segmentation

We picked 30 images at random and manually refined the bounding boxes that weredetected by the segmentation algorithm This involved correcting false positives (removingboxes where there was no specimen) false negatives (creating boxes where specimens did nothave one) and adjusting the size of boxes that did not encompass the entire specimen and asso-ciated labels We recorded the time taken to refine the bounding boxes and the actual numberof specimens on the image

All tests were carried out on an eight-CPU Dell Precision T3610 workstation with 32GBRAM running Ubuntu Linux 1404 (httpwwwubuntucom) Python 279 NumPy 191OpenCV 249 and QT 486 The version-control tags for these two experiments are lsquoperfor-mance-experiment-1rsquo and lsquoexperiment-1-refinementrsquo respectively

ResultsThe mean dimensions of the scanned images were 18131 x 15268 pixels The complete set ofscanned images took up 341GB on disk file sizes varied between 111MB and 796MB median429MB (Table 2) Examples of segmented images are shown in Figs 4ndash6 Median thumbnailimages were nearly two orders of magnitude smaller than the full-resolution scanned images(S2 Fig) The whole experiment took 2 hours 29 minutes to run Ingestion times are shown inS3 Fig Segmentation time is not explained by the number of bounding boxes detected (S4 Fig)

Table 2 Scanned images by group

Group Description N images File sizes (MB)

Min Median Max

Brahmaeidae A family of moths 30 7211 7345 7960

Cerambycidae The family of longhorn beetles 2 1109 1155 1201

Chalcidoidea A superfamily of wasps 2 4199 4232 4266

Chalcosiinae A subfamily of moths 271 3960 4310 4690

Charaxinae A subfamily of butterflies 67 3683 4283 5043

Coccinellidae Ladybirds 7 4015 4140 4152

Embioptera Webspinners 2 4295 4308 4321

Limacodidae A family of moths 4 4105 4170 4237

Lucanidae The family of stag beetles 240 2142 4354 4719

Lycaenidae A large family of butterflies 53 4068 4268 4369

Microscope slides Benthic Foraminifera in a rock thin section 105 3781 4045 4332

Microscope slides (large) Benthic Foraminifera in a rock thin section 13 3783 3878 4084

Mixed moths and butterflies Mixed moths and butterflies collected in Madagascar 2 4405 6134 7864

Mycalesina A group of satyrid butterflies 4 4004 4295 4553

Neuroptera Lacewings and their relatives 2 4301 4337 4373

doi101371journalpone0143402t002

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 8 15

We refined the bounding boxes of 30 images picked at random The median time to refine was1085 s (minimum time 85s maximum time 4130s S5 Fig) and the number of boundingboxes detected by segmentation was not a good predictor of the actual number of specimens(S6 Fig)

Discussion

Ingestion and segmentation performanceIt took just 2 frac12 hours to ingest and segment just over 800 images which represents more thantwice the weekly output of a SatScan machine running at full capacity Some images containedoverlapping specimens (eg Fig 7) Not only are such images challenging for any segmentationalgorithm but the resulting cropped specimen images are of questionable use arguably thesedrawers should be re-curated and re-imaged As might be expected given the way that JPEGcompression works thumbnail size is a function of image complexity rather than size of the

Fig 4 Segmented image of moth specimens Example of a segmented image of Chalcosiinae (a subfamily of moths) specimens

doi101371journalpone0143402g004

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 9 15

full-resolution scanned image (S2 Fig) We expected time to ingest (read full-resolution TIFFresize in memory and write JPEG thumbnail) to scale linearly with the size of the full-resolu-tion image file Variation in ingestion time (S3 Fig) could be due to blocked PC resources(CPU RAM hard-disk) but the PC used to carry out the tests has a high specification and theimage files were on their own physical hard disk that was not being used for other tasks JPEGcompression speed is more likely to explain the variation given that this is correlated withimage complexity and that this complexity is highly variable across each drawer

Fig 5 Segmented image of beetle specimens Example of a segmented image of Lucanidae (stag beetles) specimens

doi101371journalpone0143402g005

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 10 15

Application in natural history collection digitization workflowsNelson et al [12] described three dominant digitization workflows for natural history collec-tions (1) data capture with occasional specimen imaging (2) parallel data and specimen imagecapture and (3) imaging of specimens and labels followed by data capture from the image Weconsider the third workflow as the most efficient process for mass digitization of very large col-lections This allows operators to perform simultaneous image and data capture for multiplespecimens thus significantly increasing throughput (S7 Fig)

Inselect was developed primarily to suit the needs of the mass digitization program withinthe Natural History Museum and can we hope also be used by the many organisations withcollections that share the following characteristics

bull extremely large size

bull reasonably complete taxonomic index (list of taxa represented in the collection)

bull complete record of collection lots (ie multi-specimen and mixed taxon collections) and

bull very low percentage of specimen-level records

Fig 6 Segmented image of microscope slides Example of a segmented image of microscope slides of benthic Foraminifera in a rock thin section

doi101371journalpone0143402g006

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 11 15

Under these circumstances one of the most pressing priorities is ardquobroad-and-thinrdquoapproach to digitization the collection of essential specimen-level data allowing complete col-lection audit and providing a specimen level platform to add metadata in future At a minimumthis includes the specimens determination (ie taxon name) and its physical location withinthe collection Inselect has proven to be a very useful tool for the most challenging parts of theNHMrsquos collections such as pinned insects and it can be easily applied to other areas for exam-ple environmental studies and quantitative analysis of trap samples (eg ldquoinvertebrate soupsrdquoor sticky traps see S8 Fig)

In the course of the NHM Slide Digitization Pilot project which will digitize 100000microscope slides in eight months Inselect received extensive user acceptance testing Resultsto date show that throughput is as high as 5000 slides per day per person for processingmulti-slide images through Inselect This includes tasks associated with image segmentation

Fig 7 Segmented image containing overlapping specimens Example of segmented image of Mycalesina (a group of butterflies)

doi101371journalpone0143402g007

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 12 15

and refinement barcode recognition and association with minimal metadata (taxon name andphysical location in the collection) Upon completion of the project the entire set of time andmotion studies alongside the associated workflow will be described in a separate publicationDrawers of curated pinned insects as a rule do not require additional preparation thereforethe imaging output can be up to 70 SatScan images per day resulting in 3500ndash70000 speci-mens per day available for Inselect

Future developmentsSegmentation algorithms The high throughput of mass-digitization activities makes it

important to minimize the amount of manual intervention required The NHMrsquos SatScaninstrument can generate up to 70 multi-specimen images per day The median user-timerequired to refine segmented images of 109 s (S5 Fig) means that 70 x 109 3600 = 21 person-hours could be required to refine bounding boxes for a dayrsquos worth of images from a SatScanmachine In the worst case (413 s) this refinement time increases to more than eight hoursTherefore segmentation algorithms should be as accurate as possible and we suggest that thereis a need for a formal method (and supporting software) that allows segmentation methodsand their associated parameter sets to be scored and ranked Such a score should consider per-formance false positives and false negatives The outputs of such an activity might be a libraryof algorithms andor parameter sets geared towards different specimen types The dataset of804 images used in the present work (available at httpdxdoiorg1055190018537) consti-tutes a benchmark dataset against which segmentation algorithms can be measured Inselectrsquosmodular architecture and its provision of plugins make it a suitable platform for such aninvestigation

Desktop application The desktop application lacks some of the polish that is expected ofmodern software such as lsquoundorsquo and localization Other desirable features include the ability tofilter and order bounding boxes by size andor area in order to aid refinement support forExchangeable Image File Format (EXIF) tags and integration with industry-standard imageprocessing tools such as Adobe Photoshop (httpwwwadobecomproductsphotoshophtml)The plugin architecture makes a possible range of developments such as additional segmenta-tion algorithms and optical character recognition of label text within bounding boxes

Inselect has been tested using specimens from the NHMrsquos entomological and micropalaeon-tological collections as well as a limited number of specimens from Continental European col-lections We would like to test the software against a greater diversity of museum specimensand institutions to ensure that it can accommodate variation in the storage and mounting ofthese specimens

Despite these limitations Inselect represents a substantial contribution to the tools availableto support mass-digitization of natural history collections The desktop application and itsassociated command-line tools have been designed to efficiently handle the high numbers oflarge image files produced by mass-digitization activities The combination of a modular archi-tecture desktop application and scriptable technology makes it a relatively simple task to inte-grate Inselect into existing and workflows Bug reports feature requests and ideas can beviewed and created at httpsgithubcomNaturalHistoryMuseuminselectissues We areactively developing Inselect and we greatly value all comments and suggestions

Supporting InformationS1 Fig A cropped specimen image containing a barcode A scan of a microscope slide thatcontains a Data Matrix barcode(TIFF)

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 13 15

S2 Fig Thumbnail file size against file size Numbers in brackets in the legend are the num-ber of files in that group(EPS)

S3 Fig Ingestion time against file size Ingestion time is the time taken to read the scannedimage and save a JPEG thumbnail image of 4096 pixels in width Numbers in brackets in thelegend are the number of files in that group(EPS)

S4 Fig Segmentation time against number of boxes found Numbers in brackets in the leg-end are the number of files in that group(EPS)

S5 Fig Distribution of the time to refine 30 segmented images For 30 images picked at ran-dom(EPS)

S6 Fig The actual number of specimens against the number of boxes detected by segmenta-tion For 30 images picked at random(EPS)

S7 Fig Imaging digitization workflow a Original single-specimen imaging and data capture(after Nelson 2012 modified) b multi-specimen imaging and data capture(EPS)

S8 Fig Alternative use of Inselect for trap sample Count estimation (1064 specimens) usingInselect on an image of a yellow sticky trap used in an environmental assessment study(TIFF)

AcknowledgmentsWe thank Francois Malan for the modified ldquoDivide Scanned Imagesrdquo software (httpfrancoismalancom201301how-to-batch-separate-crop-multiple-scanned-photos) whichprovided the inspiration for developing Inselect Gabriel Brostow and Michael Terry for helpfuldiscussions about computer vision Ben Scott for advice on sustainable software developmentJoumlrg Holetschek and Gwenaeumll Le Bras for application testing and feedback NHM curatorsBlanca Huertas and Geoff Martin for access to images from the Lepidoptera collections volun-teers Robyn Crowther Alexander Esin James Fage Sophie Ledger and Gabriela Montejo fordevoting their time to digitize the test specimens Duncan Sivell for the yellow sticky trapimage used in S8 Fig

Author ContributionsConceived and designed the experiments LNH Performed the experiments LNH Analyzedthe data LNH Contributed reagentsmaterialsanalysis tools VB BWP Wrote the paper LNHVB AH PH LL BWP SVDWVSS Identified the need for the application VB LL BWP VSSWrote the segmentation algorithm and designed and wrote the prototype desktop applicationPH SVDW Worked on the prototype application AH Re-engineered the application LNHContributed towards the design of the desktop application and workflow tools LNH VB AHPH LL BWP SVDWVSS

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 14 15

References1 Arintildeo AH Approaches to estimating the universe of natural history collections data Biodiversity Infor-

matics 2010 7 81ndash92 doi 1017161biv7i23991

2 GrahamCH Ferrier S Huettman F Moritz C Peterson AT New developments in museum-based infor-matics and applications in biodiversity analysis Trends Ecol Evol 2004 19 497ndash503 doi 101016jtree200407006 PMID 16701313

3 Suarez AV Tsutsui ND The value of museum collections for research and society BioScience 200454 66ndash74 doi 1016410006-3568(2004)054[0066TVOMCF]20CO2

4 Lister A M Brooks SJ Fenberg PB Glover AG James KE Johnson KG et al Natural history collec-tions as sources of long-term datasets Trends Ecol Evol 2011 26 153ndash154 doi 101016jtree201012009 PMID 21255862

5 Rocha LA Aleixo A Allen G Almeda F Baldwin CC Barclay MV et al Specimen collection an essen-tial tool Science 2014 344 814ndash5 doi 101126science3446186814 PMID 24855245

6 Beck J Kitching IJ Estimating regional species richness of tropical insects frommuseum data a com-parison of a geography-based and sample-based methods J Appl Ecol 2007 44 672ndash681 doi 101111j1365-2664200701291x

7 Newbold T Applications and limitations of museum data for conservation and ecology with particularattention to species distribution models Prog Phys Geog 2010 34 3ndash22 doi 1011770309133309355630

8 Cheng TL Rovito SM Wake DB Vredenburg VT Coincident mass extirpation of neotropical amphibi-ans with the emergence of the infectious fungal pathogen Batrachochytrium dendrobatidis Proc NatlAcad Sci USA 2011 108 9502ndash9507 doi 101073pnas1105538108 PMID 21543713

9 Brooks SJ Self A Toloni F Sparks T Natural history museum collections provide information on phe-nological change in British butterflies since the late-nineteenth century Int J Biometeorol 2014 581749ndash1758 doi 101007s00484-013-0780-6 PMID 24429705

10 Beaman RS Cellinese N Mass digitization of scientific collections new opportunities to transform theuse of biological specimens and underwrite biodiversity science In Blagoderov V Smith VS editorsNo specimen left behind mass digitization of natural history collections ZooKeys 2012 209 7ndash17doi 103897zookeys2093313

11 Blagoderov V Kitching IJ Livermore L Simonsen TJ Smith VS No specimen left behind industrialscale digitization of natural history collections In Blagoderov V Smith VS editors No specimen leftbehind mass digitization of natural history collections ZooKeys 2012 209 133ndash146 doi 103897zookeys2093178

12 Nelson G Paul D Riccardi G Mast AR Five task clusters that enable efficient and effective digitizationof biological collections In Blagoderov V Smith VS editors No specimen left behind mass digitizationof natural history collections ZooKeys 2012 209 19ndash45 doi 103897zookeys2093135

13 Holovachov O Zatushevsky A Shydlovsky I Whole-drawer imaging of entomological collections ben-efits limitations and alternative applications J of Conserv and Mus Stud 2014 12 1ndash13 doi 105334jcms1021218

14 Mantle BL La Salle J Fisher N Whole-drawer imaging for digital management and curation of a largeentomological collection In Blagoderov V Smith VS editors No specimen left behind mass digitiza-tion of natural history collections ZooKeys 2012 209 147ndash163 doi 103897zookeys2093169

15 Dietrich CH Hart J Raila D Ravaioli U Sobh N Sobh O et al InvertNet a new paradigm for digitalaccess to invertebrate collections In Blagoderov V Smith VS editors No specimen left behind massdigitization of natural history collections ZooKeys 2012 209 165ndash181 doi 103897zookeys2093571

16 Schmidt S Balke M Lafogler S DScanmdasha high-performance digital scanning system for entomologicalcollections In Blagoderov V Smith VS editors No specimen left behind mass digitization of naturalhistory collections ZooKeys 2012 209 183ndash191 doi 103897zookeys2093115

17 Wieczorek J Bloom D Guralnick R Blum S Doumlring M Giovanni R et al Darwin Core An EvolvingCommunity-Developed Biodiversity Data Standard PLoS ONE 2012 7(1) e29715 doi 101371journalpone0029715 PMID 22238640

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 15 15

drop-down lists of values The metadata fields reflect the currently selected boxes making iteasy to enter metadata for a single specimen a group of specimens or to all the specimens inthe initial image (eg a taxon name or geographic location) The 131 selected boxes (Fig 1)have the same values for lsquoLocationrsquo lsquoFamilyrsquo and lsquoSubfamilyrsquo but different values of lsquoCatalognumberrsquo The template specifies that each of these four fields is mandatory Any boxes that failvalidation (eg missing mandatory values) are shown with a red backgroundmdashthe first of the131 selected boxes in Fig 1 is shown in red because it lacks a value of lsquoCatalog numberrsquo Inselecttemplates permit comprehensive field validation such as lsquoan integer value greater than zerorsquo lsquoalatitudersquo lsquoa longitudersquo and lsquoa date in the form YYYY-MM-DDrsquo For more complex cases fieldvalidation can be given as a regular expression For example the NHM templates use theregular expression ^[0ndash9]9$mdashexactly nine digits with no letters no punctuation and noleading or trailing whitespacendashfor the lsquoCatalog numberrsquo field The user can specify other prop-erties in the template such as the width of the low-resolution thumbnail image (default of4096 pixels) A complete description of the format along with example templates that are usedfor NHMrsquos digitization projects are available in the github repository httpsgithubcomNaturalHistoryMuseuminselect-templates The built-in Simple Darwin Core terms templatewhich contains all Simple Darwin Core terms (httprstdwgorgdwctermssimple [17]) canbe used by selecting the lsquoDefault templatelsquo command under the Edit menu Metadata can beexported to comma-separated values (CSV) files and included in the file name of segmentedimages

The lsquoObjectsrsquo view (Fig 2) shows individual images either in a grid or with a single imageexpanded The first box lacks a value of lsquoCatalog numberrsquo and so is shown with a red back-ground The user can rotate images individually or in groups making it easy to transcribe labelinformation into metadata fields Rotation is also applied to the cropped object images whenthese are saved

Inselect displays the low-resolution thumbnail image which is small in size quick to readand takes up relatively little space in-memory the full-resolution file (which might be manyhundreds of megabytes in size) is loaded only as required for example when saving the individ-ual cropped specimen images

The desktop application supports pluginsmdashcode modules that are able to examine and pos-sibly modify the list of bounding boxes and their associated metadata Plugins can access thelow-resolution thumbnail image and if necessary the full-resolution scanned image The soft-ware currently has plugins for automated segmentation of the entire image and for sub-seg-mentation of a single bounding box (see lsquoSegmentation algorithmsrsquo below)

Many institutions use barcodes to uniquely identify specimens Inselect therefore provides alsquoRead barcodesrsquo plugin which reads the values of any barcode(s) within each box and placesvalue(s) in the lsquoCatalog numberrsquometadata field Barcodes typically take up just a small fractionof the area of an image (eg S1 Fig) they can be smudged or damaged and can be placed at anangle making it a non-trivial task to quickly and reliably detect and decode barcodes Inselectincludes two open-source libraries zbar (httpzbarsourceforgenet) which reads one-dimen-sional barcodes and QR codes and libdmtx (httpwwwlibdmtxorg) which reads DataMatrix barcodes We found that commercial libraries were faster and more reliable than thetwo open-source decoders Inselectrsquos lsquoRead barcodesrsquo plugin therefore also supports the bestperforming of the commercial librariesmdashInlite Clearimage (purchase or download for evalua-tion from httpwwwinliteresearchcombarcode-recognition) The user can select which ofthese libraries to use by selecting the ldquoConfigure lsquoRead Barcodesrsquordquo command under the Editmenu

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 5 15

Command-line toolsEach command-line tool makes available some of the functionality of the desktop applicationin a form that is convenient for unattended processing of images in batches

bull ingest reads each scanned image creates and saves an empty Inselect document along with athumbnail image

bull segment runs the segmentation algorithm for each Inselect document that does not alreadycontain bounding boxes

bull save_crops for each Inselect document writes specimen images cropped from the high-reso-lution image and

bull export_metadata for each Inselect document writes a CSV file containing metadata

Each of these tools corresponds to a shaded box in the typical Inselect workflow shown inFig 3

Segmentation algorithmsBoth of Inselectrsquos algorithms operate on thumbnail images The automatic segmentation algo-rithm converts the image to the CIELAB (Commission internationale de leacuteclairage Lab) col-our space and then adds Gaussian blur in order to remove noise It then applies Sobel filters inx and y directions and applies a threshold resulting in a binary image (ie pixels are either lsquooffrsquoor lsquoonrsquo) where lsquoonrsquo indicates that an edge in the source image The algorithm then detects

Fig 2 The lsquoObjectsrsquo view This view shows objects in a grid The selected object lacks a mandatory metadata field so is shown in red

doi101371journalpone0143402g002

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 6 15

contours around each edge and computes the bounding box around each contour Contoursare processed recursively in order to detect edges-within-edges such as specimens within insecttrays The result of the algorithm is a list of bounding boxes

The sub-segmentation algorithm is applied by the user to a single bounding box that con-tains many specimensmdasha situation that can arise when the automatic segmentation algorithmwas unable to discriminate between specimens The user marks each individual specimenwithin a box using shift+left mouse click The sub-segmentation algorithm applies a watershedtechnique in which the image is considered to be a topographical surface with peaks and val-leys each lsquovalleyrsquo (indicated by a user-designated marker) is lsquofilledrsquo with a different colourlsquowaterrsquo until all lsquopeaksrsquo are submerged The resulting lsquolakesrsquo of different colours indicate theextent of each specimen The result of the algorithm is a list of bounding boxes

Based on an initial period of exploration with a variety of images from the NHMrsquos collec-tion all free parameters of both algorithms were hard-coded within the Inselect software

Materials and Methods

Test imagesWe evaluated the performance of the software using 804 multi-specimen Tagged Image FileFormat (TIFF) images of specimens from the NHMrsquos collections Images were captured usingthe SmartDrive SatScan (httpwwwsmartdrivecouk) collection scanner which is capable ofproducing high-resolution images of entire collection drawers A camera (UEye-SE USBCMOS model UI-1480SE-C-HQ 2560times1920 resolution) and an attached lens (Edmund Opticstelecentric TML lenses model 58428 03times or model 56675 016times) is moved in two dimensionsalong precision-engineered rails positioned above the objects that are to be imaged A combi-nation of hardware and software provides automated capture of high-resolution images ofsmall regions of interest which are then assembled (ldquostitchedrdquo) into a single panoramic imageby proprietary software (Analyse by SmartDrive) This method maximizes depth of field of thecaptured images and minimizes distortion and parallax artefacts

We used scanned images of pinned insects stored in collection drawers of between 400 x500 mm and 555 x 572 mm in size with or without unit trays Some of the scanned images con-tain in areas where no specimens are present paper with a printed Penrose tiles patternndashthesewere added in order to aid earlier versions of the stitching algorithm We also tested Inselectusing scans of standard-size microscope slides laid out for imaging in a rectangular grid con-taining 72 sockets arranged in six columns and twelve rows and large-sized microscope slidesarranged in a grid of six columns and eight rows some sockets were empty in some scans Wemake the thumbnail images (on which the segmentation algorithm operates) of our completetest dataset available at httpdxdoiorg1055190018537

PerformanceFor each image we computed or measured

Fig 3 Typical workflow All processes are offered by the desktop application Shaded boxes indicate processes that can be performed by the command-line tools which can work on batches of images and documents

doi101371journalpone0143402g003

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 7 15

bull the dimensions of the scanned TIFF image (in pixels)

bull the size of the scanned image file (in MB)

bull the time to ingest (ie read the scanned image save a JPEG thumbnail image of 4096 pixelsin width and create an empty Inselect document)

bull the size of the thumbnail image file (in MB)

bull the time to segment and

bull the number of boxes found by segmentation

We picked 30 images at random and manually refined the bounding boxes that weredetected by the segmentation algorithm This involved correcting false positives (removingboxes where there was no specimen) false negatives (creating boxes where specimens did nothave one) and adjusting the size of boxes that did not encompass the entire specimen and asso-ciated labels We recorded the time taken to refine the bounding boxes and the actual numberof specimens on the image

All tests were carried out on an eight-CPU Dell Precision T3610 workstation with 32GBRAM running Ubuntu Linux 1404 (httpwwwubuntucom) Python 279 NumPy 191OpenCV 249 and QT 486 The version-control tags for these two experiments are lsquoperfor-mance-experiment-1rsquo and lsquoexperiment-1-refinementrsquo respectively

ResultsThe mean dimensions of the scanned images were 18131 x 15268 pixels The complete set ofscanned images took up 341GB on disk file sizes varied between 111MB and 796MB median429MB (Table 2) Examples of segmented images are shown in Figs 4ndash6 Median thumbnailimages were nearly two orders of magnitude smaller than the full-resolution scanned images(S2 Fig) The whole experiment took 2 hours 29 minutes to run Ingestion times are shown inS3 Fig Segmentation time is not explained by the number of bounding boxes detected (S4 Fig)

Table 2 Scanned images by group

Group Description N images File sizes (MB)

Min Median Max

Brahmaeidae A family of moths 30 7211 7345 7960

Cerambycidae The family of longhorn beetles 2 1109 1155 1201

Chalcidoidea A superfamily of wasps 2 4199 4232 4266

Chalcosiinae A subfamily of moths 271 3960 4310 4690

Charaxinae A subfamily of butterflies 67 3683 4283 5043

Coccinellidae Ladybirds 7 4015 4140 4152

Embioptera Webspinners 2 4295 4308 4321

Limacodidae A family of moths 4 4105 4170 4237

Lucanidae The family of stag beetles 240 2142 4354 4719

Lycaenidae A large family of butterflies 53 4068 4268 4369

Microscope slides Benthic Foraminifera in a rock thin section 105 3781 4045 4332

Microscope slides (large) Benthic Foraminifera in a rock thin section 13 3783 3878 4084

Mixed moths and butterflies Mixed moths and butterflies collected in Madagascar 2 4405 6134 7864

Mycalesina A group of satyrid butterflies 4 4004 4295 4553

Neuroptera Lacewings and their relatives 2 4301 4337 4373

doi101371journalpone0143402t002

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 8 15

We refined the bounding boxes of 30 images picked at random The median time to refine was1085 s (minimum time 85s maximum time 4130s S5 Fig) and the number of boundingboxes detected by segmentation was not a good predictor of the actual number of specimens(S6 Fig)

Discussion

Ingestion and segmentation performanceIt took just 2 frac12 hours to ingest and segment just over 800 images which represents more thantwice the weekly output of a SatScan machine running at full capacity Some images containedoverlapping specimens (eg Fig 7) Not only are such images challenging for any segmentationalgorithm but the resulting cropped specimen images are of questionable use arguably thesedrawers should be re-curated and re-imaged As might be expected given the way that JPEGcompression works thumbnail size is a function of image complexity rather than size of the

Fig 4 Segmented image of moth specimens Example of a segmented image of Chalcosiinae (a subfamily of moths) specimens

doi101371journalpone0143402g004

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 9 15

full-resolution scanned image (S2 Fig) We expected time to ingest (read full-resolution TIFFresize in memory and write JPEG thumbnail) to scale linearly with the size of the full-resolu-tion image file Variation in ingestion time (S3 Fig) could be due to blocked PC resources(CPU RAM hard-disk) but the PC used to carry out the tests has a high specification and theimage files were on their own physical hard disk that was not being used for other tasks JPEGcompression speed is more likely to explain the variation given that this is correlated withimage complexity and that this complexity is highly variable across each drawer

Fig 5 Segmented image of beetle specimens Example of a segmented image of Lucanidae (stag beetles) specimens

doi101371journalpone0143402g005

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 10 15

Application in natural history collection digitization workflowsNelson et al [12] described three dominant digitization workflows for natural history collec-tions (1) data capture with occasional specimen imaging (2) parallel data and specimen imagecapture and (3) imaging of specimens and labels followed by data capture from the image Weconsider the third workflow as the most efficient process for mass digitization of very large col-lections This allows operators to perform simultaneous image and data capture for multiplespecimens thus significantly increasing throughput (S7 Fig)

Inselect was developed primarily to suit the needs of the mass digitization program withinthe Natural History Museum and can we hope also be used by the many organisations withcollections that share the following characteristics

bull extremely large size

bull reasonably complete taxonomic index (list of taxa represented in the collection)

bull complete record of collection lots (ie multi-specimen and mixed taxon collections) and

bull very low percentage of specimen-level records

Fig 6 Segmented image of microscope slides Example of a segmented image of microscope slides of benthic Foraminifera in a rock thin section

doi101371journalpone0143402g006

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 11 15

Under these circumstances one of the most pressing priorities is ardquobroad-and-thinrdquoapproach to digitization the collection of essential specimen-level data allowing complete col-lection audit and providing a specimen level platform to add metadata in future At a minimumthis includes the specimens determination (ie taxon name) and its physical location withinthe collection Inselect has proven to be a very useful tool for the most challenging parts of theNHMrsquos collections such as pinned insects and it can be easily applied to other areas for exam-ple environmental studies and quantitative analysis of trap samples (eg ldquoinvertebrate soupsrdquoor sticky traps see S8 Fig)

In the course of the NHM Slide Digitization Pilot project which will digitize 100000microscope slides in eight months Inselect received extensive user acceptance testing Resultsto date show that throughput is as high as 5000 slides per day per person for processingmulti-slide images through Inselect This includes tasks associated with image segmentation

Fig 7 Segmented image containing overlapping specimens Example of segmented image of Mycalesina (a group of butterflies)

doi101371journalpone0143402g007

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 12 15

and refinement barcode recognition and association with minimal metadata (taxon name andphysical location in the collection) Upon completion of the project the entire set of time andmotion studies alongside the associated workflow will be described in a separate publicationDrawers of curated pinned insects as a rule do not require additional preparation thereforethe imaging output can be up to 70 SatScan images per day resulting in 3500ndash70000 speci-mens per day available for Inselect

Future developmentsSegmentation algorithms The high throughput of mass-digitization activities makes it

important to minimize the amount of manual intervention required The NHMrsquos SatScaninstrument can generate up to 70 multi-specimen images per day The median user-timerequired to refine segmented images of 109 s (S5 Fig) means that 70 x 109 3600 = 21 person-hours could be required to refine bounding boxes for a dayrsquos worth of images from a SatScanmachine In the worst case (413 s) this refinement time increases to more than eight hoursTherefore segmentation algorithms should be as accurate as possible and we suggest that thereis a need for a formal method (and supporting software) that allows segmentation methodsand their associated parameter sets to be scored and ranked Such a score should consider per-formance false positives and false negatives The outputs of such an activity might be a libraryof algorithms andor parameter sets geared towards different specimen types The dataset of804 images used in the present work (available at httpdxdoiorg1055190018537) consti-tutes a benchmark dataset against which segmentation algorithms can be measured Inselectrsquosmodular architecture and its provision of plugins make it a suitable platform for such aninvestigation

Desktop application The desktop application lacks some of the polish that is expected ofmodern software such as lsquoundorsquo and localization Other desirable features include the ability tofilter and order bounding boxes by size andor area in order to aid refinement support forExchangeable Image File Format (EXIF) tags and integration with industry-standard imageprocessing tools such as Adobe Photoshop (httpwwwadobecomproductsphotoshophtml)The plugin architecture makes a possible range of developments such as additional segmenta-tion algorithms and optical character recognition of label text within bounding boxes

Inselect has been tested using specimens from the NHMrsquos entomological and micropalaeon-tological collections as well as a limited number of specimens from Continental European col-lections We would like to test the software against a greater diversity of museum specimensand institutions to ensure that it can accommodate variation in the storage and mounting ofthese specimens

Despite these limitations Inselect represents a substantial contribution to the tools availableto support mass-digitization of natural history collections The desktop application and itsassociated command-line tools have been designed to efficiently handle the high numbers oflarge image files produced by mass-digitization activities The combination of a modular archi-tecture desktop application and scriptable technology makes it a relatively simple task to inte-grate Inselect into existing and workflows Bug reports feature requests and ideas can beviewed and created at httpsgithubcomNaturalHistoryMuseuminselectissues We areactively developing Inselect and we greatly value all comments and suggestions

Supporting InformationS1 Fig A cropped specimen image containing a barcode A scan of a microscope slide thatcontains a Data Matrix barcode(TIFF)

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 13 15

S2 Fig Thumbnail file size against file size Numbers in brackets in the legend are the num-ber of files in that group(EPS)

S3 Fig Ingestion time against file size Ingestion time is the time taken to read the scannedimage and save a JPEG thumbnail image of 4096 pixels in width Numbers in brackets in thelegend are the number of files in that group(EPS)

S4 Fig Segmentation time against number of boxes found Numbers in brackets in the leg-end are the number of files in that group(EPS)

S5 Fig Distribution of the time to refine 30 segmented images For 30 images picked at ran-dom(EPS)

S6 Fig The actual number of specimens against the number of boxes detected by segmenta-tion For 30 images picked at random(EPS)

S7 Fig Imaging digitization workflow a Original single-specimen imaging and data capture(after Nelson 2012 modified) b multi-specimen imaging and data capture(EPS)

S8 Fig Alternative use of Inselect for trap sample Count estimation (1064 specimens) usingInselect on an image of a yellow sticky trap used in an environmental assessment study(TIFF)

AcknowledgmentsWe thank Francois Malan for the modified ldquoDivide Scanned Imagesrdquo software (httpfrancoismalancom201301how-to-batch-separate-crop-multiple-scanned-photos) whichprovided the inspiration for developing Inselect Gabriel Brostow and Michael Terry for helpfuldiscussions about computer vision Ben Scott for advice on sustainable software developmentJoumlrg Holetschek and Gwenaeumll Le Bras for application testing and feedback NHM curatorsBlanca Huertas and Geoff Martin for access to images from the Lepidoptera collections volun-teers Robyn Crowther Alexander Esin James Fage Sophie Ledger and Gabriela Montejo fordevoting their time to digitize the test specimens Duncan Sivell for the yellow sticky trapimage used in S8 Fig

Author ContributionsConceived and designed the experiments LNH Performed the experiments LNH Analyzedthe data LNH Contributed reagentsmaterialsanalysis tools VB BWP Wrote the paper LNHVB AH PH LL BWP SVDWVSS Identified the need for the application VB LL BWP VSSWrote the segmentation algorithm and designed and wrote the prototype desktop applicationPH SVDW Worked on the prototype application AH Re-engineered the application LNHContributed towards the design of the desktop application and workflow tools LNH VB AHPH LL BWP SVDWVSS

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 14 15

References1 Arintildeo AH Approaches to estimating the universe of natural history collections data Biodiversity Infor-

matics 2010 7 81ndash92 doi 1017161biv7i23991

2 GrahamCH Ferrier S Huettman F Moritz C Peterson AT New developments in museum-based infor-matics and applications in biodiversity analysis Trends Ecol Evol 2004 19 497ndash503 doi 101016jtree200407006 PMID 16701313

3 Suarez AV Tsutsui ND The value of museum collections for research and society BioScience 200454 66ndash74 doi 1016410006-3568(2004)054[0066TVOMCF]20CO2

4 Lister A M Brooks SJ Fenberg PB Glover AG James KE Johnson KG et al Natural history collec-tions as sources of long-term datasets Trends Ecol Evol 2011 26 153ndash154 doi 101016jtree201012009 PMID 21255862

5 Rocha LA Aleixo A Allen G Almeda F Baldwin CC Barclay MV et al Specimen collection an essen-tial tool Science 2014 344 814ndash5 doi 101126science3446186814 PMID 24855245

6 Beck J Kitching IJ Estimating regional species richness of tropical insects frommuseum data a com-parison of a geography-based and sample-based methods J Appl Ecol 2007 44 672ndash681 doi 101111j1365-2664200701291x

7 Newbold T Applications and limitations of museum data for conservation and ecology with particularattention to species distribution models Prog Phys Geog 2010 34 3ndash22 doi 1011770309133309355630

8 Cheng TL Rovito SM Wake DB Vredenburg VT Coincident mass extirpation of neotropical amphibi-ans with the emergence of the infectious fungal pathogen Batrachochytrium dendrobatidis Proc NatlAcad Sci USA 2011 108 9502ndash9507 doi 101073pnas1105538108 PMID 21543713

9 Brooks SJ Self A Toloni F Sparks T Natural history museum collections provide information on phe-nological change in British butterflies since the late-nineteenth century Int J Biometeorol 2014 581749ndash1758 doi 101007s00484-013-0780-6 PMID 24429705

10 Beaman RS Cellinese N Mass digitization of scientific collections new opportunities to transform theuse of biological specimens and underwrite biodiversity science In Blagoderov V Smith VS editorsNo specimen left behind mass digitization of natural history collections ZooKeys 2012 209 7ndash17doi 103897zookeys2093313

11 Blagoderov V Kitching IJ Livermore L Simonsen TJ Smith VS No specimen left behind industrialscale digitization of natural history collections In Blagoderov V Smith VS editors No specimen leftbehind mass digitization of natural history collections ZooKeys 2012 209 133ndash146 doi 103897zookeys2093178

12 Nelson G Paul D Riccardi G Mast AR Five task clusters that enable efficient and effective digitizationof biological collections In Blagoderov V Smith VS editors No specimen left behind mass digitizationof natural history collections ZooKeys 2012 209 19ndash45 doi 103897zookeys2093135

13 Holovachov O Zatushevsky A Shydlovsky I Whole-drawer imaging of entomological collections ben-efits limitations and alternative applications J of Conserv and Mus Stud 2014 12 1ndash13 doi 105334jcms1021218

14 Mantle BL La Salle J Fisher N Whole-drawer imaging for digital management and curation of a largeentomological collection In Blagoderov V Smith VS editors No specimen left behind mass digitiza-tion of natural history collections ZooKeys 2012 209 147ndash163 doi 103897zookeys2093169

15 Dietrich CH Hart J Raila D Ravaioli U Sobh N Sobh O et al InvertNet a new paradigm for digitalaccess to invertebrate collections In Blagoderov V Smith VS editors No specimen left behind massdigitization of natural history collections ZooKeys 2012 209 165ndash181 doi 103897zookeys2093571

16 Schmidt S Balke M Lafogler S DScanmdasha high-performance digital scanning system for entomologicalcollections In Blagoderov V Smith VS editors No specimen left behind mass digitization of naturalhistory collections ZooKeys 2012 209 183ndash191 doi 103897zookeys2093115

17 Wieczorek J Bloom D Guralnick R Blum S Doumlring M Giovanni R et al Darwin Core An EvolvingCommunity-Developed Biodiversity Data Standard PLoS ONE 2012 7(1) e29715 doi 101371journalpone0029715 PMID 22238640

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 15 15

Command-line toolsEach command-line tool makes available some of the functionality of the desktop applicationin a form that is convenient for unattended processing of images in batches

bull ingest reads each scanned image creates and saves an empty Inselect document along with athumbnail image

bull segment runs the segmentation algorithm for each Inselect document that does not alreadycontain bounding boxes

bull save_crops for each Inselect document writes specimen images cropped from the high-reso-lution image and

bull export_metadata for each Inselect document writes a CSV file containing metadata

Each of these tools corresponds to a shaded box in the typical Inselect workflow shown inFig 3

Segmentation algorithmsBoth of Inselectrsquos algorithms operate on thumbnail images The automatic segmentation algo-rithm converts the image to the CIELAB (Commission internationale de leacuteclairage Lab) col-our space and then adds Gaussian blur in order to remove noise It then applies Sobel filters inx and y directions and applies a threshold resulting in a binary image (ie pixels are either lsquooffrsquoor lsquoonrsquo) where lsquoonrsquo indicates that an edge in the source image The algorithm then detects

Fig 2 The lsquoObjectsrsquo view This view shows objects in a grid The selected object lacks a mandatory metadata field so is shown in red

doi101371journalpone0143402g002

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 6 15

contours around each edge and computes the bounding box around each contour Contoursare processed recursively in order to detect edges-within-edges such as specimens within insecttrays The result of the algorithm is a list of bounding boxes

The sub-segmentation algorithm is applied by the user to a single bounding box that con-tains many specimensmdasha situation that can arise when the automatic segmentation algorithmwas unable to discriminate between specimens The user marks each individual specimenwithin a box using shift+left mouse click The sub-segmentation algorithm applies a watershedtechnique in which the image is considered to be a topographical surface with peaks and val-leys each lsquovalleyrsquo (indicated by a user-designated marker) is lsquofilledrsquo with a different colourlsquowaterrsquo until all lsquopeaksrsquo are submerged The resulting lsquolakesrsquo of different colours indicate theextent of each specimen The result of the algorithm is a list of bounding boxes

Based on an initial period of exploration with a variety of images from the NHMrsquos collec-tion all free parameters of both algorithms were hard-coded within the Inselect software

Materials and Methods

Test imagesWe evaluated the performance of the software using 804 multi-specimen Tagged Image FileFormat (TIFF) images of specimens from the NHMrsquos collections Images were captured usingthe SmartDrive SatScan (httpwwwsmartdrivecouk) collection scanner which is capable ofproducing high-resolution images of entire collection drawers A camera (UEye-SE USBCMOS model UI-1480SE-C-HQ 2560times1920 resolution) and an attached lens (Edmund Opticstelecentric TML lenses model 58428 03times or model 56675 016times) is moved in two dimensionsalong precision-engineered rails positioned above the objects that are to be imaged A combi-nation of hardware and software provides automated capture of high-resolution images ofsmall regions of interest which are then assembled (ldquostitchedrdquo) into a single panoramic imageby proprietary software (Analyse by SmartDrive) This method maximizes depth of field of thecaptured images and minimizes distortion and parallax artefacts

We used scanned images of pinned insects stored in collection drawers of between 400 x500 mm and 555 x 572 mm in size with or without unit trays Some of the scanned images con-tain in areas where no specimens are present paper with a printed Penrose tiles patternndashthesewere added in order to aid earlier versions of the stitching algorithm We also tested Inselectusing scans of standard-size microscope slides laid out for imaging in a rectangular grid con-taining 72 sockets arranged in six columns and twelve rows and large-sized microscope slidesarranged in a grid of six columns and eight rows some sockets were empty in some scans Wemake the thumbnail images (on which the segmentation algorithm operates) of our completetest dataset available at httpdxdoiorg1055190018537

PerformanceFor each image we computed or measured

Fig 3 Typical workflow All processes are offered by the desktop application Shaded boxes indicate processes that can be performed by the command-line tools which can work on batches of images and documents

doi101371journalpone0143402g003

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 7 15

bull the dimensions of the scanned TIFF image (in pixels)

bull the size of the scanned image file (in MB)

bull the time to ingest (ie read the scanned image save a JPEG thumbnail image of 4096 pixelsin width and create an empty Inselect document)

bull the size of the thumbnail image file (in MB)

bull the time to segment and

bull the number of boxes found by segmentation

We picked 30 images at random and manually refined the bounding boxes that weredetected by the segmentation algorithm This involved correcting false positives (removingboxes where there was no specimen) false negatives (creating boxes where specimens did nothave one) and adjusting the size of boxes that did not encompass the entire specimen and asso-ciated labels We recorded the time taken to refine the bounding boxes and the actual numberof specimens on the image

All tests were carried out on an eight-CPU Dell Precision T3610 workstation with 32GBRAM running Ubuntu Linux 1404 (httpwwwubuntucom) Python 279 NumPy 191OpenCV 249 and QT 486 The version-control tags for these two experiments are lsquoperfor-mance-experiment-1rsquo and lsquoexperiment-1-refinementrsquo respectively

ResultsThe mean dimensions of the scanned images were 18131 x 15268 pixels The complete set ofscanned images took up 341GB on disk file sizes varied between 111MB and 796MB median429MB (Table 2) Examples of segmented images are shown in Figs 4ndash6 Median thumbnailimages were nearly two orders of magnitude smaller than the full-resolution scanned images(S2 Fig) The whole experiment took 2 hours 29 minutes to run Ingestion times are shown inS3 Fig Segmentation time is not explained by the number of bounding boxes detected (S4 Fig)

Table 2 Scanned images by group

Group Description N images File sizes (MB)

Min Median Max

Brahmaeidae A family of moths 30 7211 7345 7960

Cerambycidae The family of longhorn beetles 2 1109 1155 1201

Chalcidoidea A superfamily of wasps 2 4199 4232 4266

Chalcosiinae A subfamily of moths 271 3960 4310 4690

Charaxinae A subfamily of butterflies 67 3683 4283 5043

Coccinellidae Ladybirds 7 4015 4140 4152

Embioptera Webspinners 2 4295 4308 4321

Limacodidae A family of moths 4 4105 4170 4237

Lucanidae The family of stag beetles 240 2142 4354 4719

Lycaenidae A large family of butterflies 53 4068 4268 4369

Microscope slides Benthic Foraminifera in a rock thin section 105 3781 4045 4332

Microscope slides (large) Benthic Foraminifera in a rock thin section 13 3783 3878 4084

Mixed moths and butterflies Mixed moths and butterflies collected in Madagascar 2 4405 6134 7864

Mycalesina A group of satyrid butterflies 4 4004 4295 4553

Neuroptera Lacewings and their relatives 2 4301 4337 4373

doi101371journalpone0143402t002

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 8 15

We refined the bounding boxes of 30 images picked at random The median time to refine was1085 s (minimum time 85s maximum time 4130s S5 Fig) and the number of boundingboxes detected by segmentation was not a good predictor of the actual number of specimens(S6 Fig)

Discussion

Ingestion and segmentation performanceIt took just 2 frac12 hours to ingest and segment just over 800 images which represents more thantwice the weekly output of a SatScan machine running at full capacity Some images containedoverlapping specimens (eg Fig 7) Not only are such images challenging for any segmentationalgorithm but the resulting cropped specimen images are of questionable use arguably thesedrawers should be re-curated and re-imaged As might be expected given the way that JPEGcompression works thumbnail size is a function of image complexity rather than size of the

Fig 4 Segmented image of moth specimens Example of a segmented image of Chalcosiinae (a subfamily of moths) specimens

doi101371journalpone0143402g004

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 9 15

full-resolution scanned image (S2 Fig) We expected time to ingest (read full-resolution TIFFresize in memory and write JPEG thumbnail) to scale linearly with the size of the full-resolu-tion image file Variation in ingestion time (S3 Fig) could be due to blocked PC resources(CPU RAM hard-disk) but the PC used to carry out the tests has a high specification and theimage files were on their own physical hard disk that was not being used for other tasks JPEGcompression speed is more likely to explain the variation given that this is correlated withimage complexity and that this complexity is highly variable across each drawer

Fig 5 Segmented image of beetle specimens Example of a segmented image of Lucanidae (stag beetles) specimens

doi101371journalpone0143402g005

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 10 15

Application in natural history collection digitization workflowsNelson et al [12] described three dominant digitization workflows for natural history collec-tions (1) data capture with occasional specimen imaging (2) parallel data and specimen imagecapture and (3) imaging of specimens and labels followed by data capture from the image Weconsider the third workflow as the most efficient process for mass digitization of very large col-lections This allows operators to perform simultaneous image and data capture for multiplespecimens thus significantly increasing throughput (S7 Fig)

Inselect was developed primarily to suit the needs of the mass digitization program withinthe Natural History Museum and can we hope also be used by the many organisations withcollections that share the following characteristics

bull extremely large size

bull reasonably complete taxonomic index (list of taxa represented in the collection)

bull complete record of collection lots (ie multi-specimen and mixed taxon collections) and

bull very low percentage of specimen-level records

Fig 6 Segmented image of microscope slides Example of a segmented image of microscope slides of benthic Foraminifera in a rock thin section

doi101371journalpone0143402g006

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 11 15

Under these circumstances one of the most pressing priorities is ardquobroad-and-thinrdquoapproach to digitization the collection of essential specimen-level data allowing complete col-lection audit and providing a specimen level platform to add metadata in future At a minimumthis includes the specimens determination (ie taxon name) and its physical location withinthe collection Inselect has proven to be a very useful tool for the most challenging parts of theNHMrsquos collections such as pinned insects and it can be easily applied to other areas for exam-ple environmental studies and quantitative analysis of trap samples (eg ldquoinvertebrate soupsrdquoor sticky traps see S8 Fig)

In the course of the NHM Slide Digitization Pilot project which will digitize 100000microscope slides in eight months Inselect received extensive user acceptance testing Resultsto date show that throughput is as high as 5000 slides per day per person for processingmulti-slide images through Inselect This includes tasks associated with image segmentation

Fig 7 Segmented image containing overlapping specimens Example of segmented image of Mycalesina (a group of butterflies)

doi101371journalpone0143402g007

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 12 15

and refinement barcode recognition and association with minimal metadata (taxon name andphysical location in the collection) Upon completion of the project the entire set of time andmotion studies alongside the associated workflow will be described in a separate publicationDrawers of curated pinned insects as a rule do not require additional preparation thereforethe imaging output can be up to 70 SatScan images per day resulting in 3500ndash70000 speci-mens per day available for Inselect

Future developmentsSegmentation algorithms The high throughput of mass-digitization activities makes it

important to minimize the amount of manual intervention required The NHMrsquos SatScaninstrument can generate up to 70 multi-specimen images per day The median user-timerequired to refine segmented images of 109 s (S5 Fig) means that 70 x 109 3600 = 21 person-hours could be required to refine bounding boxes for a dayrsquos worth of images from a SatScanmachine In the worst case (413 s) this refinement time increases to more than eight hoursTherefore segmentation algorithms should be as accurate as possible and we suggest that thereis a need for a formal method (and supporting software) that allows segmentation methodsand their associated parameter sets to be scored and ranked Such a score should consider per-formance false positives and false negatives The outputs of such an activity might be a libraryof algorithms andor parameter sets geared towards different specimen types The dataset of804 images used in the present work (available at httpdxdoiorg1055190018537) consti-tutes a benchmark dataset against which segmentation algorithms can be measured Inselectrsquosmodular architecture and its provision of plugins make it a suitable platform for such aninvestigation

Desktop application The desktop application lacks some of the polish that is expected ofmodern software such as lsquoundorsquo and localization Other desirable features include the ability tofilter and order bounding boxes by size andor area in order to aid refinement support forExchangeable Image File Format (EXIF) tags and integration with industry-standard imageprocessing tools such as Adobe Photoshop (httpwwwadobecomproductsphotoshophtml)The plugin architecture makes a possible range of developments such as additional segmenta-tion algorithms and optical character recognition of label text within bounding boxes

Inselect has been tested using specimens from the NHMrsquos entomological and micropalaeon-tological collections as well as a limited number of specimens from Continental European col-lections We would like to test the software against a greater diversity of museum specimensand institutions to ensure that it can accommodate variation in the storage and mounting ofthese specimens

Despite these limitations Inselect represents a substantial contribution to the tools availableto support mass-digitization of natural history collections The desktop application and itsassociated command-line tools have been designed to efficiently handle the high numbers oflarge image files produced by mass-digitization activities The combination of a modular archi-tecture desktop application and scriptable technology makes it a relatively simple task to inte-grate Inselect into existing and workflows Bug reports feature requests and ideas can beviewed and created at httpsgithubcomNaturalHistoryMuseuminselectissues We areactively developing Inselect and we greatly value all comments and suggestions

Supporting InformationS1 Fig A cropped specimen image containing a barcode A scan of a microscope slide thatcontains a Data Matrix barcode(TIFF)

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 13 15

S2 Fig Thumbnail file size against file size Numbers in brackets in the legend are the num-ber of files in that group(EPS)

S3 Fig Ingestion time against file size Ingestion time is the time taken to read the scannedimage and save a JPEG thumbnail image of 4096 pixels in width Numbers in brackets in thelegend are the number of files in that group(EPS)

S4 Fig Segmentation time against number of boxes found Numbers in brackets in the leg-end are the number of files in that group(EPS)

S5 Fig Distribution of the time to refine 30 segmented images For 30 images picked at ran-dom(EPS)

S6 Fig The actual number of specimens against the number of boxes detected by segmenta-tion For 30 images picked at random(EPS)

S7 Fig Imaging digitization workflow a Original single-specimen imaging and data capture(after Nelson 2012 modified) b multi-specimen imaging and data capture(EPS)

S8 Fig Alternative use of Inselect for trap sample Count estimation (1064 specimens) usingInselect on an image of a yellow sticky trap used in an environmental assessment study(TIFF)

AcknowledgmentsWe thank Francois Malan for the modified ldquoDivide Scanned Imagesrdquo software (httpfrancoismalancom201301how-to-batch-separate-crop-multiple-scanned-photos) whichprovided the inspiration for developing Inselect Gabriel Brostow and Michael Terry for helpfuldiscussions about computer vision Ben Scott for advice on sustainable software developmentJoumlrg Holetschek and Gwenaeumll Le Bras for application testing and feedback NHM curatorsBlanca Huertas and Geoff Martin for access to images from the Lepidoptera collections volun-teers Robyn Crowther Alexander Esin James Fage Sophie Ledger and Gabriela Montejo fordevoting their time to digitize the test specimens Duncan Sivell for the yellow sticky trapimage used in S8 Fig

Author ContributionsConceived and designed the experiments LNH Performed the experiments LNH Analyzedthe data LNH Contributed reagentsmaterialsanalysis tools VB BWP Wrote the paper LNHVB AH PH LL BWP SVDWVSS Identified the need for the application VB LL BWP VSSWrote the segmentation algorithm and designed and wrote the prototype desktop applicationPH SVDW Worked on the prototype application AH Re-engineered the application LNHContributed towards the design of the desktop application and workflow tools LNH VB AHPH LL BWP SVDWVSS

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 14 15

References1 Arintildeo AH Approaches to estimating the universe of natural history collections data Biodiversity Infor-

matics 2010 7 81ndash92 doi 1017161biv7i23991

2 GrahamCH Ferrier S Huettman F Moritz C Peterson AT New developments in museum-based infor-matics and applications in biodiversity analysis Trends Ecol Evol 2004 19 497ndash503 doi 101016jtree200407006 PMID 16701313

3 Suarez AV Tsutsui ND The value of museum collections for research and society BioScience 200454 66ndash74 doi 1016410006-3568(2004)054[0066TVOMCF]20CO2

4 Lister A M Brooks SJ Fenberg PB Glover AG James KE Johnson KG et al Natural history collec-tions as sources of long-term datasets Trends Ecol Evol 2011 26 153ndash154 doi 101016jtree201012009 PMID 21255862

5 Rocha LA Aleixo A Allen G Almeda F Baldwin CC Barclay MV et al Specimen collection an essen-tial tool Science 2014 344 814ndash5 doi 101126science3446186814 PMID 24855245

6 Beck J Kitching IJ Estimating regional species richness of tropical insects frommuseum data a com-parison of a geography-based and sample-based methods J Appl Ecol 2007 44 672ndash681 doi 101111j1365-2664200701291x

7 Newbold T Applications and limitations of museum data for conservation and ecology with particularattention to species distribution models Prog Phys Geog 2010 34 3ndash22 doi 1011770309133309355630

8 Cheng TL Rovito SM Wake DB Vredenburg VT Coincident mass extirpation of neotropical amphibi-ans with the emergence of the infectious fungal pathogen Batrachochytrium dendrobatidis Proc NatlAcad Sci USA 2011 108 9502ndash9507 doi 101073pnas1105538108 PMID 21543713

9 Brooks SJ Self A Toloni F Sparks T Natural history museum collections provide information on phe-nological change in British butterflies since the late-nineteenth century Int J Biometeorol 2014 581749ndash1758 doi 101007s00484-013-0780-6 PMID 24429705

10 Beaman RS Cellinese N Mass digitization of scientific collections new opportunities to transform theuse of biological specimens and underwrite biodiversity science In Blagoderov V Smith VS editorsNo specimen left behind mass digitization of natural history collections ZooKeys 2012 209 7ndash17doi 103897zookeys2093313

11 Blagoderov V Kitching IJ Livermore L Simonsen TJ Smith VS No specimen left behind industrialscale digitization of natural history collections In Blagoderov V Smith VS editors No specimen leftbehind mass digitization of natural history collections ZooKeys 2012 209 133ndash146 doi 103897zookeys2093178

12 Nelson G Paul D Riccardi G Mast AR Five task clusters that enable efficient and effective digitizationof biological collections In Blagoderov V Smith VS editors No specimen left behind mass digitizationof natural history collections ZooKeys 2012 209 19ndash45 doi 103897zookeys2093135

13 Holovachov O Zatushevsky A Shydlovsky I Whole-drawer imaging of entomological collections ben-efits limitations and alternative applications J of Conserv and Mus Stud 2014 12 1ndash13 doi 105334jcms1021218

14 Mantle BL La Salle J Fisher N Whole-drawer imaging for digital management and curation of a largeentomological collection In Blagoderov V Smith VS editors No specimen left behind mass digitiza-tion of natural history collections ZooKeys 2012 209 147ndash163 doi 103897zookeys2093169

15 Dietrich CH Hart J Raila D Ravaioli U Sobh N Sobh O et al InvertNet a new paradigm for digitalaccess to invertebrate collections In Blagoderov V Smith VS editors No specimen left behind massdigitization of natural history collections ZooKeys 2012 209 165ndash181 doi 103897zookeys2093571

16 Schmidt S Balke M Lafogler S DScanmdasha high-performance digital scanning system for entomologicalcollections In Blagoderov V Smith VS editors No specimen left behind mass digitization of naturalhistory collections ZooKeys 2012 209 183ndash191 doi 103897zookeys2093115

17 Wieczorek J Bloom D Guralnick R Blum S Doumlring M Giovanni R et al Darwin Core An EvolvingCommunity-Developed Biodiversity Data Standard PLoS ONE 2012 7(1) e29715 doi 101371journalpone0029715 PMID 22238640

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 15 15

contours around each edge and computes the bounding box around each contour Contoursare processed recursively in order to detect edges-within-edges such as specimens within insecttrays The result of the algorithm is a list of bounding boxes

The sub-segmentation algorithm is applied by the user to a single bounding box that con-tains many specimensmdasha situation that can arise when the automatic segmentation algorithmwas unable to discriminate between specimens The user marks each individual specimenwithin a box using shift+left mouse click The sub-segmentation algorithm applies a watershedtechnique in which the image is considered to be a topographical surface with peaks and val-leys each lsquovalleyrsquo (indicated by a user-designated marker) is lsquofilledrsquo with a different colourlsquowaterrsquo until all lsquopeaksrsquo are submerged The resulting lsquolakesrsquo of different colours indicate theextent of each specimen The result of the algorithm is a list of bounding boxes

Based on an initial period of exploration with a variety of images from the NHMrsquos collec-tion all free parameters of both algorithms were hard-coded within the Inselect software

Materials and Methods

Test imagesWe evaluated the performance of the software using 804 multi-specimen Tagged Image FileFormat (TIFF) images of specimens from the NHMrsquos collections Images were captured usingthe SmartDrive SatScan (httpwwwsmartdrivecouk) collection scanner which is capable ofproducing high-resolution images of entire collection drawers A camera (UEye-SE USBCMOS model UI-1480SE-C-HQ 2560times1920 resolution) and an attached lens (Edmund Opticstelecentric TML lenses model 58428 03times or model 56675 016times) is moved in two dimensionsalong precision-engineered rails positioned above the objects that are to be imaged A combi-nation of hardware and software provides automated capture of high-resolution images ofsmall regions of interest which are then assembled (ldquostitchedrdquo) into a single panoramic imageby proprietary software (Analyse by SmartDrive) This method maximizes depth of field of thecaptured images and minimizes distortion and parallax artefacts

We used scanned images of pinned insects stored in collection drawers of between 400 x500 mm and 555 x 572 mm in size with or without unit trays Some of the scanned images con-tain in areas where no specimens are present paper with a printed Penrose tiles patternndashthesewere added in order to aid earlier versions of the stitching algorithm We also tested Inselectusing scans of standard-size microscope slides laid out for imaging in a rectangular grid con-taining 72 sockets arranged in six columns and twelve rows and large-sized microscope slidesarranged in a grid of six columns and eight rows some sockets were empty in some scans Wemake the thumbnail images (on which the segmentation algorithm operates) of our completetest dataset available at httpdxdoiorg1055190018537

PerformanceFor each image we computed or measured

Fig 3 Typical workflow All processes are offered by the desktop application Shaded boxes indicate processes that can be performed by the command-line tools which can work on batches of images and documents

doi101371journalpone0143402g003

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 7 15

bull the dimensions of the scanned TIFF image (in pixels)

bull the size of the scanned image file (in MB)

bull the time to ingest (ie read the scanned image save a JPEG thumbnail image of 4096 pixelsin width and create an empty Inselect document)

bull the size of the thumbnail image file (in MB)

bull the time to segment and

bull the number of boxes found by segmentation

We picked 30 images at random and manually refined the bounding boxes that weredetected by the segmentation algorithm This involved correcting false positives (removingboxes where there was no specimen) false negatives (creating boxes where specimens did nothave one) and adjusting the size of boxes that did not encompass the entire specimen and asso-ciated labels We recorded the time taken to refine the bounding boxes and the actual numberof specimens on the image

All tests were carried out on an eight-CPU Dell Precision T3610 workstation with 32GBRAM running Ubuntu Linux 1404 (httpwwwubuntucom) Python 279 NumPy 191OpenCV 249 and QT 486 The version-control tags for these two experiments are lsquoperfor-mance-experiment-1rsquo and lsquoexperiment-1-refinementrsquo respectively

ResultsThe mean dimensions of the scanned images were 18131 x 15268 pixels The complete set ofscanned images took up 341GB on disk file sizes varied between 111MB and 796MB median429MB (Table 2) Examples of segmented images are shown in Figs 4ndash6 Median thumbnailimages were nearly two orders of magnitude smaller than the full-resolution scanned images(S2 Fig) The whole experiment took 2 hours 29 minutes to run Ingestion times are shown inS3 Fig Segmentation time is not explained by the number of bounding boxes detected (S4 Fig)

Table 2 Scanned images by group

Group Description N images File sizes (MB)

Min Median Max

Brahmaeidae A family of moths 30 7211 7345 7960

Cerambycidae The family of longhorn beetles 2 1109 1155 1201

Chalcidoidea A superfamily of wasps 2 4199 4232 4266

Chalcosiinae A subfamily of moths 271 3960 4310 4690

Charaxinae A subfamily of butterflies 67 3683 4283 5043

Coccinellidae Ladybirds 7 4015 4140 4152

Embioptera Webspinners 2 4295 4308 4321

Limacodidae A family of moths 4 4105 4170 4237

Lucanidae The family of stag beetles 240 2142 4354 4719

Lycaenidae A large family of butterflies 53 4068 4268 4369

Microscope slides Benthic Foraminifera in a rock thin section 105 3781 4045 4332

Microscope slides (large) Benthic Foraminifera in a rock thin section 13 3783 3878 4084

Mixed moths and butterflies Mixed moths and butterflies collected in Madagascar 2 4405 6134 7864

Mycalesina A group of satyrid butterflies 4 4004 4295 4553

Neuroptera Lacewings and their relatives 2 4301 4337 4373

doi101371journalpone0143402t002

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 8 15

We refined the bounding boxes of 30 images picked at random The median time to refine was1085 s (minimum time 85s maximum time 4130s S5 Fig) and the number of boundingboxes detected by segmentation was not a good predictor of the actual number of specimens(S6 Fig)

Discussion

Ingestion and segmentation performanceIt took just 2 frac12 hours to ingest and segment just over 800 images which represents more thantwice the weekly output of a SatScan machine running at full capacity Some images containedoverlapping specimens (eg Fig 7) Not only are such images challenging for any segmentationalgorithm but the resulting cropped specimen images are of questionable use arguably thesedrawers should be re-curated and re-imaged As might be expected given the way that JPEGcompression works thumbnail size is a function of image complexity rather than size of the

Fig 4 Segmented image of moth specimens Example of a segmented image of Chalcosiinae (a subfamily of moths) specimens

doi101371journalpone0143402g004

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 9 15

full-resolution scanned image (S2 Fig) We expected time to ingest (read full-resolution TIFFresize in memory and write JPEG thumbnail) to scale linearly with the size of the full-resolu-tion image file Variation in ingestion time (S3 Fig) could be due to blocked PC resources(CPU RAM hard-disk) but the PC used to carry out the tests has a high specification and theimage files were on their own physical hard disk that was not being used for other tasks JPEGcompression speed is more likely to explain the variation given that this is correlated withimage complexity and that this complexity is highly variable across each drawer

Fig 5 Segmented image of beetle specimens Example of a segmented image of Lucanidae (stag beetles) specimens

doi101371journalpone0143402g005

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 10 15

Application in natural history collection digitization workflowsNelson et al [12] described three dominant digitization workflows for natural history collec-tions (1) data capture with occasional specimen imaging (2) parallel data and specimen imagecapture and (3) imaging of specimens and labels followed by data capture from the image Weconsider the third workflow as the most efficient process for mass digitization of very large col-lections This allows operators to perform simultaneous image and data capture for multiplespecimens thus significantly increasing throughput (S7 Fig)

Inselect was developed primarily to suit the needs of the mass digitization program withinthe Natural History Museum and can we hope also be used by the many organisations withcollections that share the following characteristics

bull extremely large size

bull reasonably complete taxonomic index (list of taxa represented in the collection)

bull complete record of collection lots (ie multi-specimen and mixed taxon collections) and

bull very low percentage of specimen-level records

Fig 6 Segmented image of microscope slides Example of a segmented image of microscope slides of benthic Foraminifera in a rock thin section

doi101371journalpone0143402g006

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 11 15

Under these circumstances one of the most pressing priorities is ardquobroad-and-thinrdquoapproach to digitization the collection of essential specimen-level data allowing complete col-lection audit and providing a specimen level platform to add metadata in future At a minimumthis includes the specimens determination (ie taxon name) and its physical location withinthe collection Inselect has proven to be a very useful tool for the most challenging parts of theNHMrsquos collections such as pinned insects and it can be easily applied to other areas for exam-ple environmental studies and quantitative analysis of trap samples (eg ldquoinvertebrate soupsrdquoor sticky traps see S8 Fig)

In the course of the NHM Slide Digitization Pilot project which will digitize 100000microscope slides in eight months Inselect received extensive user acceptance testing Resultsto date show that throughput is as high as 5000 slides per day per person for processingmulti-slide images through Inselect This includes tasks associated with image segmentation

Fig 7 Segmented image containing overlapping specimens Example of segmented image of Mycalesina (a group of butterflies)

doi101371journalpone0143402g007

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 12 15

and refinement barcode recognition and association with minimal metadata (taxon name andphysical location in the collection) Upon completion of the project the entire set of time andmotion studies alongside the associated workflow will be described in a separate publicationDrawers of curated pinned insects as a rule do not require additional preparation thereforethe imaging output can be up to 70 SatScan images per day resulting in 3500ndash70000 speci-mens per day available for Inselect

Future developmentsSegmentation algorithms The high throughput of mass-digitization activities makes it

important to minimize the amount of manual intervention required The NHMrsquos SatScaninstrument can generate up to 70 multi-specimen images per day The median user-timerequired to refine segmented images of 109 s (S5 Fig) means that 70 x 109 3600 = 21 person-hours could be required to refine bounding boxes for a dayrsquos worth of images from a SatScanmachine In the worst case (413 s) this refinement time increases to more than eight hoursTherefore segmentation algorithms should be as accurate as possible and we suggest that thereis a need for a formal method (and supporting software) that allows segmentation methodsand their associated parameter sets to be scored and ranked Such a score should consider per-formance false positives and false negatives The outputs of such an activity might be a libraryof algorithms andor parameter sets geared towards different specimen types The dataset of804 images used in the present work (available at httpdxdoiorg1055190018537) consti-tutes a benchmark dataset against which segmentation algorithms can be measured Inselectrsquosmodular architecture and its provision of plugins make it a suitable platform for such aninvestigation

Desktop application The desktop application lacks some of the polish that is expected ofmodern software such as lsquoundorsquo and localization Other desirable features include the ability tofilter and order bounding boxes by size andor area in order to aid refinement support forExchangeable Image File Format (EXIF) tags and integration with industry-standard imageprocessing tools such as Adobe Photoshop (httpwwwadobecomproductsphotoshophtml)The plugin architecture makes a possible range of developments such as additional segmenta-tion algorithms and optical character recognition of label text within bounding boxes

Inselect has been tested using specimens from the NHMrsquos entomological and micropalaeon-tological collections as well as a limited number of specimens from Continental European col-lections We would like to test the software against a greater diversity of museum specimensand institutions to ensure that it can accommodate variation in the storage and mounting ofthese specimens

Despite these limitations Inselect represents a substantial contribution to the tools availableto support mass-digitization of natural history collections The desktop application and itsassociated command-line tools have been designed to efficiently handle the high numbers oflarge image files produced by mass-digitization activities The combination of a modular archi-tecture desktop application and scriptable technology makes it a relatively simple task to inte-grate Inselect into existing and workflows Bug reports feature requests and ideas can beviewed and created at httpsgithubcomNaturalHistoryMuseuminselectissues We areactively developing Inselect and we greatly value all comments and suggestions

Supporting InformationS1 Fig A cropped specimen image containing a barcode A scan of a microscope slide thatcontains a Data Matrix barcode(TIFF)

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 13 15

S2 Fig Thumbnail file size against file size Numbers in brackets in the legend are the num-ber of files in that group(EPS)

S3 Fig Ingestion time against file size Ingestion time is the time taken to read the scannedimage and save a JPEG thumbnail image of 4096 pixels in width Numbers in brackets in thelegend are the number of files in that group(EPS)

S4 Fig Segmentation time against number of boxes found Numbers in brackets in the leg-end are the number of files in that group(EPS)

S5 Fig Distribution of the time to refine 30 segmented images For 30 images picked at ran-dom(EPS)

S6 Fig The actual number of specimens against the number of boxes detected by segmenta-tion For 30 images picked at random(EPS)

S7 Fig Imaging digitization workflow a Original single-specimen imaging and data capture(after Nelson 2012 modified) b multi-specimen imaging and data capture(EPS)

S8 Fig Alternative use of Inselect for trap sample Count estimation (1064 specimens) usingInselect on an image of a yellow sticky trap used in an environmental assessment study(TIFF)

AcknowledgmentsWe thank Francois Malan for the modified ldquoDivide Scanned Imagesrdquo software (httpfrancoismalancom201301how-to-batch-separate-crop-multiple-scanned-photos) whichprovided the inspiration for developing Inselect Gabriel Brostow and Michael Terry for helpfuldiscussions about computer vision Ben Scott for advice on sustainable software developmentJoumlrg Holetschek and Gwenaeumll Le Bras for application testing and feedback NHM curatorsBlanca Huertas and Geoff Martin for access to images from the Lepidoptera collections volun-teers Robyn Crowther Alexander Esin James Fage Sophie Ledger and Gabriela Montejo fordevoting their time to digitize the test specimens Duncan Sivell for the yellow sticky trapimage used in S8 Fig

Author ContributionsConceived and designed the experiments LNH Performed the experiments LNH Analyzedthe data LNH Contributed reagentsmaterialsanalysis tools VB BWP Wrote the paper LNHVB AH PH LL BWP SVDWVSS Identified the need for the application VB LL BWP VSSWrote the segmentation algorithm and designed and wrote the prototype desktop applicationPH SVDW Worked on the prototype application AH Re-engineered the application LNHContributed towards the design of the desktop application and workflow tools LNH VB AHPH LL BWP SVDWVSS

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 14 15

References1 Arintildeo AH Approaches to estimating the universe of natural history collections data Biodiversity Infor-

matics 2010 7 81ndash92 doi 1017161biv7i23991

2 GrahamCH Ferrier S Huettman F Moritz C Peterson AT New developments in museum-based infor-matics and applications in biodiversity analysis Trends Ecol Evol 2004 19 497ndash503 doi 101016jtree200407006 PMID 16701313

3 Suarez AV Tsutsui ND The value of museum collections for research and society BioScience 200454 66ndash74 doi 1016410006-3568(2004)054[0066TVOMCF]20CO2

4 Lister A M Brooks SJ Fenberg PB Glover AG James KE Johnson KG et al Natural history collec-tions as sources of long-term datasets Trends Ecol Evol 2011 26 153ndash154 doi 101016jtree201012009 PMID 21255862

5 Rocha LA Aleixo A Allen G Almeda F Baldwin CC Barclay MV et al Specimen collection an essen-tial tool Science 2014 344 814ndash5 doi 101126science3446186814 PMID 24855245

6 Beck J Kitching IJ Estimating regional species richness of tropical insects frommuseum data a com-parison of a geography-based and sample-based methods J Appl Ecol 2007 44 672ndash681 doi 101111j1365-2664200701291x

7 Newbold T Applications and limitations of museum data for conservation and ecology with particularattention to species distribution models Prog Phys Geog 2010 34 3ndash22 doi 1011770309133309355630

8 Cheng TL Rovito SM Wake DB Vredenburg VT Coincident mass extirpation of neotropical amphibi-ans with the emergence of the infectious fungal pathogen Batrachochytrium dendrobatidis Proc NatlAcad Sci USA 2011 108 9502ndash9507 doi 101073pnas1105538108 PMID 21543713

9 Brooks SJ Self A Toloni F Sparks T Natural history museum collections provide information on phe-nological change in British butterflies since the late-nineteenth century Int J Biometeorol 2014 581749ndash1758 doi 101007s00484-013-0780-6 PMID 24429705

10 Beaman RS Cellinese N Mass digitization of scientific collections new opportunities to transform theuse of biological specimens and underwrite biodiversity science In Blagoderov V Smith VS editorsNo specimen left behind mass digitization of natural history collections ZooKeys 2012 209 7ndash17doi 103897zookeys2093313

11 Blagoderov V Kitching IJ Livermore L Simonsen TJ Smith VS No specimen left behind industrialscale digitization of natural history collections In Blagoderov V Smith VS editors No specimen leftbehind mass digitization of natural history collections ZooKeys 2012 209 133ndash146 doi 103897zookeys2093178

12 Nelson G Paul D Riccardi G Mast AR Five task clusters that enable efficient and effective digitizationof biological collections In Blagoderov V Smith VS editors No specimen left behind mass digitizationof natural history collections ZooKeys 2012 209 19ndash45 doi 103897zookeys2093135

13 Holovachov O Zatushevsky A Shydlovsky I Whole-drawer imaging of entomological collections ben-efits limitations and alternative applications J of Conserv and Mus Stud 2014 12 1ndash13 doi 105334jcms1021218

14 Mantle BL La Salle J Fisher N Whole-drawer imaging for digital management and curation of a largeentomological collection In Blagoderov V Smith VS editors No specimen left behind mass digitiza-tion of natural history collections ZooKeys 2012 209 147ndash163 doi 103897zookeys2093169

15 Dietrich CH Hart J Raila D Ravaioli U Sobh N Sobh O et al InvertNet a new paradigm for digitalaccess to invertebrate collections In Blagoderov V Smith VS editors No specimen left behind massdigitization of natural history collections ZooKeys 2012 209 165ndash181 doi 103897zookeys2093571

16 Schmidt S Balke M Lafogler S DScanmdasha high-performance digital scanning system for entomologicalcollections In Blagoderov V Smith VS editors No specimen left behind mass digitization of naturalhistory collections ZooKeys 2012 209 183ndash191 doi 103897zookeys2093115

17 Wieczorek J Bloom D Guralnick R Blum S Doumlring M Giovanni R et al Darwin Core An EvolvingCommunity-Developed Biodiversity Data Standard PLoS ONE 2012 7(1) e29715 doi 101371journalpone0029715 PMID 22238640

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 15 15

bull the dimensions of the scanned TIFF image (in pixels)

bull the size of the scanned image file (in MB)

bull the time to ingest (ie read the scanned image save a JPEG thumbnail image of 4096 pixelsin width and create an empty Inselect document)

bull the size of the thumbnail image file (in MB)

bull the time to segment and

bull the number of boxes found by segmentation

We picked 30 images at random and manually refined the bounding boxes that weredetected by the segmentation algorithm This involved correcting false positives (removingboxes where there was no specimen) false negatives (creating boxes where specimens did nothave one) and adjusting the size of boxes that did not encompass the entire specimen and asso-ciated labels We recorded the time taken to refine the bounding boxes and the actual numberof specimens on the image

All tests were carried out on an eight-CPU Dell Precision T3610 workstation with 32GBRAM running Ubuntu Linux 1404 (httpwwwubuntucom) Python 279 NumPy 191OpenCV 249 and QT 486 The version-control tags for these two experiments are lsquoperfor-mance-experiment-1rsquo and lsquoexperiment-1-refinementrsquo respectively

ResultsThe mean dimensions of the scanned images were 18131 x 15268 pixels The complete set ofscanned images took up 341GB on disk file sizes varied between 111MB and 796MB median429MB (Table 2) Examples of segmented images are shown in Figs 4ndash6 Median thumbnailimages were nearly two orders of magnitude smaller than the full-resolution scanned images(S2 Fig) The whole experiment took 2 hours 29 minutes to run Ingestion times are shown inS3 Fig Segmentation time is not explained by the number of bounding boxes detected (S4 Fig)

Table 2 Scanned images by group

Group Description N images File sizes (MB)

Min Median Max

Brahmaeidae A family of moths 30 7211 7345 7960

Cerambycidae The family of longhorn beetles 2 1109 1155 1201

Chalcidoidea A superfamily of wasps 2 4199 4232 4266

Chalcosiinae A subfamily of moths 271 3960 4310 4690

Charaxinae A subfamily of butterflies 67 3683 4283 5043

Coccinellidae Ladybirds 7 4015 4140 4152

Embioptera Webspinners 2 4295 4308 4321

Limacodidae A family of moths 4 4105 4170 4237

Lucanidae The family of stag beetles 240 2142 4354 4719

Lycaenidae A large family of butterflies 53 4068 4268 4369

Microscope slides Benthic Foraminifera in a rock thin section 105 3781 4045 4332

Microscope slides (large) Benthic Foraminifera in a rock thin section 13 3783 3878 4084

Mixed moths and butterflies Mixed moths and butterflies collected in Madagascar 2 4405 6134 7864

Mycalesina A group of satyrid butterflies 4 4004 4295 4553

Neuroptera Lacewings and their relatives 2 4301 4337 4373

doi101371journalpone0143402t002

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 8 15

We refined the bounding boxes of 30 images picked at random The median time to refine was1085 s (minimum time 85s maximum time 4130s S5 Fig) and the number of boundingboxes detected by segmentation was not a good predictor of the actual number of specimens(S6 Fig)

Discussion

Ingestion and segmentation performanceIt took just 2 frac12 hours to ingest and segment just over 800 images which represents more thantwice the weekly output of a SatScan machine running at full capacity Some images containedoverlapping specimens (eg Fig 7) Not only are such images challenging for any segmentationalgorithm but the resulting cropped specimen images are of questionable use arguably thesedrawers should be re-curated and re-imaged As might be expected given the way that JPEGcompression works thumbnail size is a function of image complexity rather than size of the

Fig 4 Segmented image of moth specimens Example of a segmented image of Chalcosiinae (a subfamily of moths) specimens

doi101371journalpone0143402g004

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 9 15

full-resolution scanned image (S2 Fig) We expected time to ingest (read full-resolution TIFFresize in memory and write JPEG thumbnail) to scale linearly with the size of the full-resolu-tion image file Variation in ingestion time (S3 Fig) could be due to blocked PC resources(CPU RAM hard-disk) but the PC used to carry out the tests has a high specification and theimage files were on their own physical hard disk that was not being used for other tasks JPEGcompression speed is more likely to explain the variation given that this is correlated withimage complexity and that this complexity is highly variable across each drawer

Fig 5 Segmented image of beetle specimens Example of a segmented image of Lucanidae (stag beetles) specimens

doi101371journalpone0143402g005

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 10 15

Application in natural history collection digitization workflowsNelson et al [12] described three dominant digitization workflows for natural history collec-tions (1) data capture with occasional specimen imaging (2) parallel data and specimen imagecapture and (3) imaging of specimens and labels followed by data capture from the image Weconsider the third workflow as the most efficient process for mass digitization of very large col-lections This allows operators to perform simultaneous image and data capture for multiplespecimens thus significantly increasing throughput (S7 Fig)

Inselect was developed primarily to suit the needs of the mass digitization program withinthe Natural History Museum and can we hope also be used by the many organisations withcollections that share the following characteristics

bull extremely large size

bull reasonably complete taxonomic index (list of taxa represented in the collection)

bull complete record of collection lots (ie multi-specimen and mixed taxon collections) and

bull very low percentage of specimen-level records

Fig 6 Segmented image of microscope slides Example of a segmented image of microscope slides of benthic Foraminifera in a rock thin section

doi101371journalpone0143402g006

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 11 15

Under these circumstances one of the most pressing priorities is ardquobroad-and-thinrdquoapproach to digitization the collection of essential specimen-level data allowing complete col-lection audit and providing a specimen level platform to add metadata in future At a minimumthis includes the specimens determination (ie taxon name) and its physical location withinthe collection Inselect has proven to be a very useful tool for the most challenging parts of theNHMrsquos collections such as pinned insects and it can be easily applied to other areas for exam-ple environmental studies and quantitative analysis of trap samples (eg ldquoinvertebrate soupsrdquoor sticky traps see S8 Fig)

In the course of the NHM Slide Digitization Pilot project which will digitize 100000microscope slides in eight months Inselect received extensive user acceptance testing Resultsto date show that throughput is as high as 5000 slides per day per person for processingmulti-slide images through Inselect This includes tasks associated with image segmentation

Fig 7 Segmented image containing overlapping specimens Example of segmented image of Mycalesina (a group of butterflies)

doi101371journalpone0143402g007

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 12 15

and refinement barcode recognition and association with minimal metadata (taxon name andphysical location in the collection) Upon completion of the project the entire set of time andmotion studies alongside the associated workflow will be described in a separate publicationDrawers of curated pinned insects as a rule do not require additional preparation thereforethe imaging output can be up to 70 SatScan images per day resulting in 3500ndash70000 speci-mens per day available for Inselect

Future developmentsSegmentation algorithms The high throughput of mass-digitization activities makes it

important to minimize the amount of manual intervention required The NHMrsquos SatScaninstrument can generate up to 70 multi-specimen images per day The median user-timerequired to refine segmented images of 109 s (S5 Fig) means that 70 x 109 3600 = 21 person-hours could be required to refine bounding boxes for a dayrsquos worth of images from a SatScanmachine In the worst case (413 s) this refinement time increases to more than eight hoursTherefore segmentation algorithms should be as accurate as possible and we suggest that thereis a need for a formal method (and supporting software) that allows segmentation methodsand their associated parameter sets to be scored and ranked Such a score should consider per-formance false positives and false negatives The outputs of such an activity might be a libraryof algorithms andor parameter sets geared towards different specimen types The dataset of804 images used in the present work (available at httpdxdoiorg1055190018537) consti-tutes a benchmark dataset against which segmentation algorithms can be measured Inselectrsquosmodular architecture and its provision of plugins make it a suitable platform for such aninvestigation

Desktop application The desktop application lacks some of the polish that is expected ofmodern software such as lsquoundorsquo and localization Other desirable features include the ability tofilter and order bounding boxes by size andor area in order to aid refinement support forExchangeable Image File Format (EXIF) tags and integration with industry-standard imageprocessing tools such as Adobe Photoshop (httpwwwadobecomproductsphotoshophtml)The plugin architecture makes a possible range of developments such as additional segmenta-tion algorithms and optical character recognition of label text within bounding boxes

Inselect has been tested using specimens from the NHMrsquos entomological and micropalaeon-tological collections as well as a limited number of specimens from Continental European col-lections We would like to test the software against a greater diversity of museum specimensand institutions to ensure that it can accommodate variation in the storage and mounting ofthese specimens

Despite these limitations Inselect represents a substantial contribution to the tools availableto support mass-digitization of natural history collections The desktop application and itsassociated command-line tools have been designed to efficiently handle the high numbers oflarge image files produced by mass-digitization activities The combination of a modular archi-tecture desktop application and scriptable technology makes it a relatively simple task to inte-grate Inselect into existing and workflows Bug reports feature requests and ideas can beviewed and created at httpsgithubcomNaturalHistoryMuseuminselectissues We areactively developing Inselect and we greatly value all comments and suggestions

Supporting InformationS1 Fig A cropped specimen image containing a barcode A scan of a microscope slide thatcontains a Data Matrix barcode(TIFF)

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 13 15

S2 Fig Thumbnail file size against file size Numbers in brackets in the legend are the num-ber of files in that group(EPS)

S3 Fig Ingestion time against file size Ingestion time is the time taken to read the scannedimage and save a JPEG thumbnail image of 4096 pixels in width Numbers in brackets in thelegend are the number of files in that group(EPS)

S4 Fig Segmentation time against number of boxes found Numbers in brackets in the leg-end are the number of files in that group(EPS)

S5 Fig Distribution of the time to refine 30 segmented images For 30 images picked at ran-dom(EPS)

S6 Fig The actual number of specimens against the number of boxes detected by segmenta-tion For 30 images picked at random(EPS)

S7 Fig Imaging digitization workflow a Original single-specimen imaging and data capture(after Nelson 2012 modified) b multi-specimen imaging and data capture(EPS)

S8 Fig Alternative use of Inselect for trap sample Count estimation (1064 specimens) usingInselect on an image of a yellow sticky trap used in an environmental assessment study(TIFF)

AcknowledgmentsWe thank Francois Malan for the modified ldquoDivide Scanned Imagesrdquo software (httpfrancoismalancom201301how-to-batch-separate-crop-multiple-scanned-photos) whichprovided the inspiration for developing Inselect Gabriel Brostow and Michael Terry for helpfuldiscussions about computer vision Ben Scott for advice on sustainable software developmentJoumlrg Holetschek and Gwenaeumll Le Bras for application testing and feedback NHM curatorsBlanca Huertas and Geoff Martin for access to images from the Lepidoptera collections volun-teers Robyn Crowther Alexander Esin James Fage Sophie Ledger and Gabriela Montejo fordevoting their time to digitize the test specimens Duncan Sivell for the yellow sticky trapimage used in S8 Fig

Author ContributionsConceived and designed the experiments LNH Performed the experiments LNH Analyzedthe data LNH Contributed reagentsmaterialsanalysis tools VB BWP Wrote the paper LNHVB AH PH LL BWP SVDWVSS Identified the need for the application VB LL BWP VSSWrote the segmentation algorithm and designed and wrote the prototype desktop applicationPH SVDW Worked on the prototype application AH Re-engineered the application LNHContributed towards the design of the desktop application and workflow tools LNH VB AHPH LL BWP SVDWVSS

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 14 15

References1 Arintildeo AH Approaches to estimating the universe of natural history collections data Biodiversity Infor-

matics 2010 7 81ndash92 doi 1017161biv7i23991

2 GrahamCH Ferrier S Huettman F Moritz C Peterson AT New developments in museum-based infor-matics and applications in biodiversity analysis Trends Ecol Evol 2004 19 497ndash503 doi 101016jtree200407006 PMID 16701313

3 Suarez AV Tsutsui ND The value of museum collections for research and society BioScience 200454 66ndash74 doi 1016410006-3568(2004)054[0066TVOMCF]20CO2

4 Lister A M Brooks SJ Fenberg PB Glover AG James KE Johnson KG et al Natural history collec-tions as sources of long-term datasets Trends Ecol Evol 2011 26 153ndash154 doi 101016jtree201012009 PMID 21255862

5 Rocha LA Aleixo A Allen G Almeda F Baldwin CC Barclay MV et al Specimen collection an essen-tial tool Science 2014 344 814ndash5 doi 101126science3446186814 PMID 24855245

6 Beck J Kitching IJ Estimating regional species richness of tropical insects frommuseum data a com-parison of a geography-based and sample-based methods J Appl Ecol 2007 44 672ndash681 doi 101111j1365-2664200701291x

7 Newbold T Applications and limitations of museum data for conservation and ecology with particularattention to species distribution models Prog Phys Geog 2010 34 3ndash22 doi 1011770309133309355630

8 Cheng TL Rovito SM Wake DB Vredenburg VT Coincident mass extirpation of neotropical amphibi-ans with the emergence of the infectious fungal pathogen Batrachochytrium dendrobatidis Proc NatlAcad Sci USA 2011 108 9502ndash9507 doi 101073pnas1105538108 PMID 21543713

9 Brooks SJ Self A Toloni F Sparks T Natural history museum collections provide information on phe-nological change in British butterflies since the late-nineteenth century Int J Biometeorol 2014 581749ndash1758 doi 101007s00484-013-0780-6 PMID 24429705

10 Beaman RS Cellinese N Mass digitization of scientific collections new opportunities to transform theuse of biological specimens and underwrite biodiversity science In Blagoderov V Smith VS editorsNo specimen left behind mass digitization of natural history collections ZooKeys 2012 209 7ndash17doi 103897zookeys2093313

11 Blagoderov V Kitching IJ Livermore L Simonsen TJ Smith VS No specimen left behind industrialscale digitization of natural history collections In Blagoderov V Smith VS editors No specimen leftbehind mass digitization of natural history collections ZooKeys 2012 209 133ndash146 doi 103897zookeys2093178

12 Nelson G Paul D Riccardi G Mast AR Five task clusters that enable efficient and effective digitizationof biological collections In Blagoderov V Smith VS editors No specimen left behind mass digitizationof natural history collections ZooKeys 2012 209 19ndash45 doi 103897zookeys2093135

13 Holovachov O Zatushevsky A Shydlovsky I Whole-drawer imaging of entomological collections ben-efits limitations and alternative applications J of Conserv and Mus Stud 2014 12 1ndash13 doi 105334jcms1021218

14 Mantle BL La Salle J Fisher N Whole-drawer imaging for digital management and curation of a largeentomological collection In Blagoderov V Smith VS editors No specimen left behind mass digitiza-tion of natural history collections ZooKeys 2012 209 147ndash163 doi 103897zookeys2093169

15 Dietrich CH Hart J Raila D Ravaioli U Sobh N Sobh O et al InvertNet a new paradigm for digitalaccess to invertebrate collections In Blagoderov V Smith VS editors No specimen left behind massdigitization of natural history collections ZooKeys 2012 209 165ndash181 doi 103897zookeys2093571

16 Schmidt S Balke M Lafogler S DScanmdasha high-performance digital scanning system for entomologicalcollections In Blagoderov V Smith VS editors No specimen left behind mass digitization of naturalhistory collections ZooKeys 2012 209 183ndash191 doi 103897zookeys2093115

17 Wieczorek J Bloom D Guralnick R Blum S Doumlring M Giovanni R et al Darwin Core An EvolvingCommunity-Developed Biodiversity Data Standard PLoS ONE 2012 7(1) e29715 doi 101371journalpone0029715 PMID 22238640

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 15 15

We refined the bounding boxes of 30 images picked at random The median time to refine was1085 s (minimum time 85s maximum time 4130s S5 Fig) and the number of boundingboxes detected by segmentation was not a good predictor of the actual number of specimens(S6 Fig)

Discussion

Ingestion and segmentation performanceIt took just 2 frac12 hours to ingest and segment just over 800 images which represents more thantwice the weekly output of a SatScan machine running at full capacity Some images containedoverlapping specimens (eg Fig 7) Not only are such images challenging for any segmentationalgorithm but the resulting cropped specimen images are of questionable use arguably thesedrawers should be re-curated and re-imaged As might be expected given the way that JPEGcompression works thumbnail size is a function of image complexity rather than size of the

Fig 4 Segmented image of moth specimens Example of a segmented image of Chalcosiinae (a subfamily of moths) specimens

doi101371journalpone0143402g004

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 9 15

full-resolution scanned image (S2 Fig) We expected time to ingest (read full-resolution TIFFresize in memory and write JPEG thumbnail) to scale linearly with the size of the full-resolu-tion image file Variation in ingestion time (S3 Fig) could be due to blocked PC resources(CPU RAM hard-disk) but the PC used to carry out the tests has a high specification and theimage files were on their own physical hard disk that was not being used for other tasks JPEGcompression speed is more likely to explain the variation given that this is correlated withimage complexity and that this complexity is highly variable across each drawer

Fig 5 Segmented image of beetle specimens Example of a segmented image of Lucanidae (stag beetles) specimens

doi101371journalpone0143402g005

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 10 15

Application in natural history collection digitization workflowsNelson et al [12] described three dominant digitization workflows for natural history collec-tions (1) data capture with occasional specimen imaging (2) parallel data and specimen imagecapture and (3) imaging of specimens and labels followed by data capture from the image Weconsider the third workflow as the most efficient process for mass digitization of very large col-lections This allows operators to perform simultaneous image and data capture for multiplespecimens thus significantly increasing throughput (S7 Fig)

Inselect was developed primarily to suit the needs of the mass digitization program withinthe Natural History Museum and can we hope also be used by the many organisations withcollections that share the following characteristics

bull extremely large size

bull reasonably complete taxonomic index (list of taxa represented in the collection)

bull complete record of collection lots (ie multi-specimen and mixed taxon collections) and

bull very low percentage of specimen-level records

Fig 6 Segmented image of microscope slides Example of a segmented image of microscope slides of benthic Foraminifera in a rock thin section

doi101371journalpone0143402g006

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 11 15

Under these circumstances one of the most pressing priorities is ardquobroad-and-thinrdquoapproach to digitization the collection of essential specimen-level data allowing complete col-lection audit and providing a specimen level platform to add metadata in future At a minimumthis includes the specimens determination (ie taxon name) and its physical location withinthe collection Inselect has proven to be a very useful tool for the most challenging parts of theNHMrsquos collections such as pinned insects and it can be easily applied to other areas for exam-ple environmental studies and quantitative analysis of trap samples (eg ldquoinvertebrate soupsrdquoor sticky traps see S8 Fig)

In the course of the NHM Slide Digitization Pilot project which will digitize 100000microscope slides in eight months Inselect received extensive user acceptance testing Resultsto date show that throughput is as high as 5000 slides per day per person for processingmulti-slide images through Inselect This includes tasks associated with image segmentation

Fig 7 Segmented image containing overlapping specimens Example of segmented image of Mycalesina (a group of butterflies)

doi101371journalpone0143402g007

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 12 15

and refinement barcode recognition and association with minimal metadata (taxon name andphysical location in the collection) Upon completion of the project the entire set of time andmotion studies alongside the associated workflow will be described in a separate publicationDrawers of curated pinned insects as a rule do not require additional preparation thereforethe imaging output can be up to 70 SatScan images per day resulting in 3500ndash70000 speci-mens per day available for Inselect

Future developmentsSegmentation algorithms The high throughput of mass-digitization activities makes it

important to minimize the amount of manual intervention required The NHMrsquos SatScaninstrument can generate up to 70 multi-specimen images per day The median user-timerequired to refine segmented images of 109 s (S5 Fig) means that 70 x 109 3600 = 21 person-hours could be required to refine bounding boxes for a dayrsquos worth of images from a SatScanmachine In the worst case (413 s) this refinement time increases to more than eight hoursTherefore segmentation algorithms should be as accurate as possible and we suggest that thereis a need for a formal method (and supporting software) that allows segmentation methodsand their associated parameter sets to be scored and ranked Such a score should consider per-formance false positives and false negatives The outputs of such an activity might be a libraryof algorithms andor parameter sets geared towards different specimen types The dataset of804 images used in the present work (available at httpdxdoiorg1055190018537) consti-tutes a benchmark dataset against which segmentation algorithms can be measured Inselectrsquosmodular architecture and its provision of plugins make it a suitable platform for such aninvestigation

Desktop application The desktop application lacks some of the polish that is expected ofmodern software such as lsquoundorsquo and localization Other desirable features include the ability tofilter and order bounding boxes by size andor area in order to aid refinement support forExchangeable Image File Format (EXIF) tags and integration with industry-standard imageprocessing tools such as Adobe Photoshop (httpwwwadobecomproductsphotoshophtml)The plugin architecture makes a possible range of developments such as additional segmenta-tion algorithms and optical character recognition of label text within bounding boxes

Inselect has been tested using specimens from the NHMrsquos entomological and micropalaeon-tological collections as well as a limited number of specimens from Continental European col-lections We would like to test the software against a greater diversity of museum specimensand institutions to ensure that it can accommodate variation in the storage and mounting ofthese specimens

Despite these limitations Inselect represents a substantial contribution to the tools availableto support mass-digitization of natural history collections The desktop application and itsassociated command-line tools have been designed to efficiently handle the high numbers oflarge image files produced by mass-digitization activities The combination of a modular archi-tecture desktop application and scriptable technology makes it a relatively simple task to inte-grate Inselect into existing and workflows Bug reports feature requests and ideas can beviewed and created at httpsgithubcomNaturalHistoryMuseuminselectissues We areactively developing Inselect and we greatly value all comments and suggestions

Supporting InformationS1 Fig A cropped specimen image containing a barcode A scan of a microscope slide thatcontains a Data Matrix barcode(TIFF)

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 13 15

S2 Fig Thumbnail file size against file size Numbers in brackets in the legend are the num-ber of files in that group(EPS)

S3 Fig Ingestion time against file size Ingestion time is the time taken to read the scannedimage and save a JPEG thumbnail image of 4096 pixels in width Numbers in brackets in thelegend are the number of files in that group(EPS)

S4 Fig Segmentation time against number of boxes found Numbers in brackets in the leg-end are the number of files in that group(EPS)

S5 Fig Distribution of the time to refine 30 segmented images For 30 images picked at ran-dom(EPS)

S6 Fig The actual number of specimens against the number of boxes detected by segmenta-tion For 30 images picked at random(EPS)

S7 Fig Imaging digitization workflow a Original single-specimen imaging and data capture(after Nelson 2012 modified) b multi-specimen imaging and data capture(EPS)

S8 Fig Alternative use of Inselect for trap sample Count estimation (1064 specimens) usingInselect on an image of a yellow sticky trap used in an environmental assessment study(TIFF)

AcknowledgmentsWe thank Francois Malan for the modified ldquoDivide Scanned Imagesrdquo software (httpfrancoismalancom201301how-to-batch-separate-crop-multiple-scanned-photos) whichprovided the inspiration for developing Inselect Gabriel Brostow and Michael Terry for helpfuldiscussions about computer vision Ben Scott for advice on sustainable software developmentJoumlrg Holetschek and Gwenaeumll Le Bras for application testing and feedback NHM curatorsBlanca Huertas and Geoff Martin for access to images from the Lepidoptera collections volun-teers Robyn Crowther Alexander Esin James Fage Sophie Ledger and Gabriela Montejo fordevoting their time to digitize the test specimens Duncan Sivell for the yellow sticky trapimage used in S8 Fig

Author ContributionsConceived and designed the experiments LNH Performed the experiments LNH Analyzedthe data LNH Contributed reagentsmaterialsanalysis tools VB BWP Wrote the paper LNHVB AH PH LL BWP SVDWVSS Identified the need for the application VB LL BWP VSSWrote the segmentation algorithm and designed and wrote the prototype desktop applicationPH SVDW Worked on the prototype application AH Re-engineered the application LNHContributed towards the design of the desktop application and workflow tools LNH VB AHPH LL BWP SVDWVSS

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 14 15

References1 Arintildeo AH Approaches to estimating the universe of natural history collections data Biodiversity Infor-

matics 2010 7 81ndash92 doi 1017161biv7i23991

2 GrahamCH Ferrier S Huettman F Moritz C Peterson AT New developments in museum-based infor-matics and applications in biodiversity analysis Trends Ecol Evol 2004 19 497ndash503 doi 101016jtree200407006 PMID 16701313

3 Suarez AV Tsutsui ND The value of museum collections for research and society BioScience 200454 66ndash74 doi 1016410006-3568(2004)054[0066TVOMCF]20CO2

4 Lister A M Brooks SJ Fenberg PB Glover AG James KE Johnson KG et al Natural history collec-tions as sources of long-term datasets Trends Ecol Evol 2011 26 153ndash154 doi 101016jtree201012009 PMID 21255862

5 Rocha LA Aleixo A Allen G Almeda F Baldwin CC Barclay MV et al Specimen collection an essen-tial tool Science 2014 344 814ndash5 doi 101126science3446186814 PMID 24855245

6 Beck J Kitching IJ Estimating regional species richness of tropical insects frommuseum data a com-parison of a geography-based and sample-based methods J Appl Ecol 2007 44 672ndash681 doi 101111j1365-2664200701291x

7 Newbold T Applications and limitations of museum data for conservation and ecology with particularattention to species distribution models Prog Phys Geog 2010 34 3ndash22 doi 1011770309133309355630

8 Cheng TL Rovito SM Wake DB Vredenburg VT Coincident mass extirpation of neotropical amphibi-ans with the emergence of the infectious fungal pathogen Batrachochytrium dendrobatidis Proc NatlAcad Sci USA 2011 108 9502ndash9507 doi 101073pnas1105538108 PMID 21543713

9 Brooks SJ Self A Toloni F Sparks T Natural history museum collections provide information on phe-nological change in British butterflies since the late-nineteenth century Int J Biometeorol 2014 581749ndash1758 doi 101007s00484-013-0780-6 PMID 24429705

10 Beaman RS Cellinese N Mass digitization of scientific collections new opportunities to transform theuse of biological specimens and underwrite biodiversity science In Blagoderov V Smith VS editorsNo specimen left behind mass digitization of natural history collections ZooKeys 2012 209 7ndash17doi 103897zookeys2093313

11 Blagoderov V Kitching IJ Livermore L Simonsen TJ Smith VS No specimen left behind industrialscale digitization of natural history collections In Blagoderov V Smith VS editors No specimen leftbehind mass digitization of natural history collections ZooKeys 2012 209 133ndash146 doi 103897zookeys2093178

12 Nelson G Paul D Riccardi G Mast AR Five task clusters that enable efficient and effective digitizationof biological collections In Blagoderov V Smith VS editors No specimen left behind mass digitizationof natural history collections ZooKeys 2012 209 19ndash45 doi 103897zookeys2093135

13 Holovachov O Zatushevsky A Shydlovsky I Whole-drawer imaging of entomological collections ben-efits limitations and alternative applications J of Conserv and Mus Stud 2014 12 1ndash13 doi 105334jcms1021218

14 Mantle BL La Salle J Fisher N Whole-drawer imaging for digital management and curation of a largeentomological collection In Blagoderov V Smith VS editors No specimen left behind mass digitiza-tion of natural history collections ZooKeys 2012 209 147ndash163 doi 103897zookeys2093169

15 Dietrich CH Hart J Raila D Ravaioli U Sobh N Sobh O et al InvertNet a new paradigm for digitalaccess to invertebrate collections In Blagoderov V Smith VS editors No specimen left behind massdigitization of natural history collections ZooKeys 2012 209 165ndash181 doi 103897zookeys2093571

16 Schmidt S Balke M Lafogler S DScanmdasha high-performance digital scanning system for entomologicalcollections In Blagoderov V Smith VS editors No specimen left behind mass digitization of naturalhistory collections ZooKeys 2012 209 183ndash191 doi 103897zookeys2093115

17 Wieczorek J Bloom D Guralnick R Blum S Doumlring M Giovanni R et al Darwin Core An EvolvingCommunity-Developed Biodiversity Data Standard PLoS ONE 2012 7(1) e29715 doi 101371journalpone0029715 PMID 22238640

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 15 15

full-resolution scanned image (S2 Fig) We expected time to ingest (read full-resolution TIFFresize in memory and write JPEG thumbnail) to scale linearly with the size of the full-resolu-tion image file Variation in ingestion time (S3 Fig) could be due to blocked PC resources(CPU RAM hard-disk) but the PC used to carry out the tests has a high specification and theimage files were on their own physical hard disk that was not being used for other tasks JPEGcompression speed is more likely to explain the variation given that this is correlated withimage complexity and that this complexity is highly variable across each drawer

Fig 5 Segmented image of beetle specimens Example of a segmented image of Lucanidae (stag beetles) specimens

doi101371journalpone0143402g005

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 10 15

Application in natural history collection digitization workflowsNelson et al [12] described three dominant digitization workflows for natural history collec-tions (1) data capture with occasional specimen imaging (2) parallel data and specimen imagecapture and (3) imaging of specimens and labels followed by data capture from the image Weconsider the third workflow as the most efficient process for mass digitization of very large col-lections This allows operators to perform simultaneous image and data capture for multiplespecimens thus significantly increasing throughput (S7 Fig)

Inselect was developed primarily to suit the needs of the mass digitization program withinthe Natural History Museum and can we hope also be used by the many organisations withcollections that share the following characteristics

bull extremely large size

bull reasonably complete taxonomic index (list of taxa represented in the collection)

bull complete record of collection lots (ie multi-specimen and mixed taxon collections) and

bull very low percentage of specimen-level records

Fig 6 Segmented image of microscope slides Example of a segmented image of microscope slides of benthic Foraminifera in a rock thin section

doi101371journalpone0143402g006

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 11 15

Under these circumstances one of the most pressing priorities is ardquobroad-and-thinrdquoapproach to digitization the collection of essential specimen-level data allowing complete col-lection audit and providing a specimen level platform to add metadata in future At a minimumthis includes the specimens determination (ie taxon name) and its physical location withinthe collection Inselect has proven to be a very useful tool for the most challenging parts of theNHMrsquos collections such as pinned insects and it can be easily applied to other areas for exam-ple environmental studies and quantitative analysis of trap samples (eg ldquoinvertebrate soupsrdquoor sticky traps see S8 Fig)

In the course of the NHM Slide Digitization Pilot project which will digitize 100000microscope slides in eight months Inselect received extensive user acceptance testing Resultsto date show that throughput is as high as 5000 slides per day per person for processingmulti-slide images through Inselect This includes tasks associated with image segmentation

Fig 7 Segmented image containing overlapping specimens Example of segmented image of Mycalesina (a group of butterflies)

doi101371journalpone0143402g007

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 12 15

and refinement barcode recognition and association with minimal metadata (taxon name andphysical location in the collection) Upon completion of the project the entire set of time andmotion studies alongside the associated workflow will be described in a separate publicationDrawers of curated pinned insects as a rule do not require additional preparation thereforethe imaging output can be up to 70 SatScan images per day resulting in 3500ndash70000 speci-mens per day available for Inselect

Future developmentsSegmentation algorithms The high throughput of mass-digitization activities makes it

important to minimize the amount of manual intervention required The NHMrsquos SatScaninstrument can generate up to 70 multi-specimen images per day The median user-timerequired to refine segmented images of 109 s (S5 Fig) means that 70 x 109 3600 = 21 person-hours could be required to refine bounding boxes for a dayrsquos worth of images from a SatScanmachine In the worst case (413 s) this refinement time increases to more than eight hoursTherefore segmentation algorithms should be as accurate as possible and we suggest that thereis a need for a formal method (and supporting software) that allows segmentation methodsand their associated parameter sets to be scored and ranked Such a score should consider per-formance false positives and false negatives The outputs of such an activity might be a libraryof algorithms andor parameter sets geared towards different specimen types The dataset of804 images used in the present work (available at httpdxdoiorg1055190018537) consti-tutes a benchmark dataset against which segmentation algorithms can be measured Inselectrsquosmodular architecture and its provision of plugins make it a suitable platform for such aninvestigation

Desktop application The desktop application lacks some of the polish that is expected ofmodern software such as lsquoundorsquo and localization Other desirable features include the ability tofilter and order bounding boxes by size andor area in order to aid refinement support forExchangeable Image File Format (EXIF) tags and integration with industry-standard imageprocessing tools such as Adobe Photoshop (httpwwwadobecomproductsphotoshophtml)The plugin architecture makes a possible range of developments such as additional segmenta-tion algorithms and optical character recognition of label text within bounding boxes

Inselect has been tested using specimens from the NHMrsquos entomological and micropalaeon-tological collections as well as a limited number of specimens from Continental European col-lections We would like to test the software against a greater diversity of museum specimensand institutions to ensure that it can accommodate variation in the storage and mounting ofthese specimens

Despite these limitations Inselect represents a substantial contribution to the tools availableto support mass-digitization of natural history collections The desktop application and itsassociated command-line tools have been designed to efficiently handle the high numbers oflarge image files produced by mass-digitization activities The combination of a modular archi-tecture desktop application and scriptable technology makes it a relatively simple task to inte-grate Inselect into existing and workflows Bug reports feature requests and ideas can beviewed and created at httpsgithubcomNaturalHistoryMuseuminselectissues We areactively developing Inselect and we greatly value all comments and suggestions

Supporting InformationS1 Fig A cropped specimen image containing a barcode A scan of a microscope slide thatcontains a Data Matrix barcode(TIFF)

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 13 15

S2 Fig Thumbnail file size against file size Numbers in brackets in the legend are the num-ber of files in that group(EPS)

S3 Fig Ingestion time against file size Ingestion time is the time taken to read the scannedimage and save a JPEG thumbnail image of 4096 pixels in width Numbers in brackets in thelegend are the number of files in that group(EPS)

S4 Fig Segmentation time against number of boxes found Numbers in brackets in the leg-end are the number of files in that group(EPS)

S5 Fig Distribution of the time to refine 30 segmented images For 30 images picked at ran-dom(EPS)

S6 Fig The actual number of specimens against the number of boxes detected by segmenta-tion For 30 images picked at random(EPS)

S7 Fig Imaging digitization workflow a Original single-specimen imaging and data capture(after Nelson 2012 modified) b multi-specimen imaging and data capture(EPS)

S8 Fig Alternative use of Inselect for trap sample Count estimation (1064 specimens) usingInselect on an image of a yellow sticky trap used in an environmental assessment study(TIFF)

AcknowledgmentsWe thank Francois Malan for the modified ldquoDivide Scanned Imagesrdquo software (httpfrancoismalancom201301how-to-batch-separate-crop-multiple-scanned-photos) whichprovided the inspiration for developing Inselect Gabriel Brostow and Michael Terry for helpfuldiscussions about computer vision Ben Scott for advice on sustainable software developmentJoumlrg Holetschek and Gwenaeumll Le Bras for application testing and feedback NHM curatorsBlanca Huertas and Geoff Martin for access to images from the Lepidoptera collections volun-teers Robyn Crowther Alexander Esin James Fage Sophie Ledger and Gabriela Montejo fordevoting their time to digitize the test specimens Duncan Sivell for the yellow sticky trapimage used in S8 Fig

Author ContributionsConceived and designed the experiments LNH Performed the experiments LNH Analyzedthe data LNH Contributed reagentsmaterialsanalysis tools VB BWP Wrote the paper LNHVB AH PH LL BWP SVDWVSS Identified the need for the application VB LL BWP VSSWrote the segmentation algorithm and designed and wrote the prototype desktop applicationPH SVDW Worked on the prototype application AH Re-engineered the application LNHContributed towards the design of the desktop application and workflow tools LNH VB AHPH LL BWP SVDWVSS

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 14 15

References1 Arintildeo AH Approaches to estimating the universe of natural history collections data Biodiversity Infor-

matics 2010 7 81ndash92 doi 1017161biv7i23991

2 GrahamCH Ferrier S Huettman F Moritz C Peterson AT New developments in museum-based infor-matics and applications in biodiversity analysis Trends Ecol Evol 2004 19 497ndash503 doi 101016jtree200407006 PMID 16701313

3 Suarez AV Tsutsui ND The value of museum collections for research and society BioScience 200454 66ndash74 doi 1016410006-3568(2004)054[0066TVOMCF]20CO2

4 Lister A M Brooks SJ Fenberg PB Glover AG James KE Johnson KG et al Natural history collec-tions as sources of long-term datasets Trends Ecol Evol 2011 26 153ndash154 doi 101016jtree201012009 PMID 21255862

5 Rocha LA Aleixo A Allen G Almeda F Baldwin CC Barclay MV et al Specimen collection an essen-tial tool Science 2014 344 814ndash5 doi 101126science3446186814 PMID 24855245

6 Beck J Kitching IJ Estimating regional species richness of tropical insects frommuseum data a com-parison of a geography-based and sample-based methods J Appl Ecol 2007 44 672ndash681 doi 101111j1365-2664200701291x

7 Newbold T Applications and limitations of museum data for conservation and ecology with particularattention to species distribution models Prog Phys Geog 2010 34 3ndash22 doi 1011770309133309355630

8 Cheng TL Rovito SM Wake DB Vredenburg VT Coincident mass extirpation of neotropical amphibi-ans with the emergence of the infectious fungal pathogen Batrachochytrium dendrobatidis Proc NatlAcad Sci USA 2011 108 9502ndash9507 doi 101073pnas1105538108 PMID 21543713

9 Brooks SJ Self A Toloni F Sparks T Natural history museum collections provide information on phe-nological change in British butterflies since the late-nineteenth century Int J Biometeorol 2014 581749ndash1758 doi 101007s00484-013-0780-6 PMID 24429705

10 Beaman RS Cellinese N Mass digitization of scientific collections new opportunities to transform theuse of biological specimens and underwrite biodiversity science In Blagoderov V Smith VS editorsNo specimen left behind mass digitization of natural history collections ZooKeys 2012 209 7ndash17doi 103897zookeys2093313

11 Blagoderov V Kitching IJ Livermore L Simonsen TJ Smith VS No specimen left behind industrialscale digitization of natural history collections In Blagoderov V Smith VS editors No specimen leftbehind mass digitization of natural history collections ZooKeys 2012 209 133ndash146 doi 103897zookeys2093178

12 Nelson G Paul D Riccardi G Mast AR Five task clusters that enable efficient and effective digitizationof biological collections In Blagoderov V Smith VS editors No specimen left behind mass digitizationof natural history collections ZooKeys 2012 209 19ndash45 doi 103897zookeys2093135

13 Holovachov O Zatushevsky A Shydlovsky I Whole-drawer imaging of entomological collections ben-efits limitations and alternative applications J of Conserv and Mus Stud 2014 12 1ndash13 doi 105334jcms1021218

14 Mantle BL La Salle J Fisher N Whole-drawer imaging for digital management and curation of a largeentomological collection In Blagoderov V Smith VS editors No specimen left behind mass digitiza-tion of natural history collections ZooKeys 2012 209 147ndash163 doi 103897zookeys2093169

15 Dietrich CH Hart J Raila D Ravaioli U Sobh N Sobh O et al InvertNet a new paradigm for digitalaccess to invertebrate collections In Blagoderov V Smith VS editors No specimen left behind massdigitization of natural history collections ZooKeys 2012 209 165ndash181 doi 103897zookeys2093571

16 Schmidt S Balke M Lafogler S DScanmdasha high-performance digital scanning system for entomologicalcollections In Blagoderov V Smith VS editors No specimen left behind mass digitization of naturalhistory collections ZooKeys 2012 209 183ndash191 doi 103897zookeys2093115

17 Wieczorek J Bloom D Guralnick R Blum S Doumlring M Giovanni R et al Darwin Core An EvolvingCommunity-Developed Biodiversity Data Standard PLoS ONE 2012 7(1) e29715 doi 101371journalpone0029715 PMID 22238640

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 15 15

Application in natural history collection digitization workflowsNelson et al [12] described three dominant digitization workflows for natural history collec-tions (1) data capture with occasional specimen imaging (2) parallel data and specimen imagecapture and (3) imaging of specimens and labels followed by data capture from the image Weconsider the third workflow as the most efficient process for mass digitization of very large col-lections This allows operators to perform simultaneous image and data capture for multiplespecimens thus significantly increasing throughput (S7 Fig)

Inselect was developed primarily to suit the needs of the mass digitization program withinthe Natural History Museum and can we hope also be used by the many organisations withcollections that share the following characteristics

bull extremely large size

bull reasonably complete taxonomic index (list of taxa represented in the collection)

bull complete record of collection lots (ie multi-specimen and mixed taxon collections) and

bull very low percentage of specimen-level records

Fig 6 Segmented image of microscope slides Example of a segmented image of microscope slides of benthic Foraminifera in a rock thin section

doi101371journalpone0143402g006

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 11 15

Under these circumstances one of the most pressing priorities is ardquobroad-and-thinrdquoapproach to digitization the collection of essential specimen-level data allowing complete col-lection audit and providing a specimen level platform to add metadata in future At a minimumthis includes the specimens determination (ie taxon name) and its physical location withinthe collection Inselect has proven to be a very useful tool for the most challenging parts of theNHMrsquos collections such as pinned insects and it can be easily applied to other areas for exam-ple environmental studies and quantitative analysis of trap samples (eg ldquoinvertebrate soupsrdquoor sticky traps see S8 Fig)

In the course of the NHM Slide Digitization Pilot project which will digitize 100000microscope slides in eight months Inselect received extensive user acceptance testing Resultsto date show that throughput is as high as 5000 slides per day per person for processingmulti-slide images through Inselect This includes tasks associated with image segmentation

Fig 7 Segmented image containing overlapping specimens Example of segmented image of Mycalesina (a group of butterflies)

doi101371journalpone0143402g007

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 12 15

and refinement barcode recognition and association with minimal metadata (taxon name andphysical location in the collection) Upon completion of the project the entire set of time andmotion studies alongside the associated workflow will be described in a separate publicationDrawers of curated pinned insects as a rule do not require additional preparation thereforethe imaging output can be up to 70 SatScan images per day resulting in 3500ndash70000 speci-mens per day available for Inselect

Future developmentsSegmentation algorithms The high throughput of mass-digitization activities makes it

important to minimize the amount of manual intervention required The NHMrsquos SatScaninstrument can generate up to 70 multi-specimen images per day The median user-timerequired to refine segmented images of 109 s (S5 Fig) means that 70 x 109 3600 = 21 person-hours could be required to refine bounding boxes for a dayrsquos worth of images from a SatScanmachine In the worst case (413 s) this refinement time increases to more than eight hoursTherefore segmentation algorithms should be as accurate as possible and we suggest that thereis a need for a formal method (and supporting software) that allows segmentation methodsand their associated parameter sets to be scored and ranked Such a score should consider per-formance false positives and false negatives The outputs of such an activity might be a libraryof algorithms andor parameter sets geared towards different specimen types The dataset of804 images used in the present work (available at httpdxdoiorg1055190018537) consti-tutes a benchmark dataset against which segmentation algorithms can be measured Inselectrsquosmodular architecture and its provision of plugins make it a suitable platform for such aninvestigation

Desktop application The desktop application lacks some of the polish that is expected ofmodern software such as lsquoundorsquo and localization Other desirable features include the ability tofilter and order bounding boxes by size andor area in order to aid refinement support forExchangeable Image File Format (EXIF) tags and integration with industry-standard imageprocessing tools such as Adobe Photoshop (httpwwwadobecomproductsphotoshophtml)The plugin architecture makes a possible range of developments such as additional segmenta-tion algorithms and optical character recognition of label text within bounding boxes

Inselect has been tested using specimens from the NHMrsquos entomological and micropalaeon-tological collections as well as a limited number of specimens from Continental European col-lections We would like to test the software against a greater diversity of museum specimensand institutions to ensure that it can accommodate variation in the storage and mounting ofthese specimens

Despite these limitations Inselect represents a substantial contribution to the tools availableto support mass-digitization of natural history collections The desktop application and itsassociated command-line tools have been designed to efficiently handle the high numbers oflarge image files produced by mass-digitization activities The combination of a modular archi-tecture desktop application and scriptable technology makes it a relatively simple task to inte-grate Inselect into existing and workflows Bug reports feature requests and ideas can beviewed and created at httpsgithubcomNaturalHistoryMuseuminselectissues We areactively developing Inselect and we greatly value all comments and suggestions

Supporting InformationS1 Fig A cropped specimen image containing a barcode A scan of a microscope slide thatcontains a Data Matrix barcode(TIFF)

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 13 15

S2 Fig Thumbnail file size against file size Numbers in brackets in the legend are the num-ber of files in that group(EPS)

S3 Fig Ingestion time against file size Ingestion time is the time taken to read the scannedimage and save a JPEG thumbnail image of 4096 pixels in width Numbers in brackets in thelegend are the number of files in that group(EPS)

S4 Fig Segmentation time against number of boxes found Numbers in brackets in the leg-end are the number of files in that group(EPS)

S5 Fig Distribution of the time to refine 30 segmented images For 30 images picked at ran-dom(EPS)

S6 Fig The actual number of specimens against the number of boxes detected by segmenta-tion For 30 images picked at random(EPS)

S7 Fig Imaging digitization workflow a Original single-specimen imaging and data capture(after Nelson 2012 modified) b multi-specimen imaging and data capture(EPS)

S8 Fig Alternative use of Inselect for trap sample Count estimation (1064 specimens) usingInselect on an image of a yellow sticky trap used in an environmental assessment study(TIFF)

AcknowledgmentsWe thank Francois Malan for the modified ldquoDivide Scanned Imagesrdquo software (httpfrancoismalancom201301how-to-batch-separate-crop-multiple-scanned-photos) whichprovided the inspiration for developing Inselect Gabriel Brostow and Michael Terry for helpfuldiscussions about computer vision Ben Scott for advice on sustainable software developmentJoumlrg Holetschek and Gwenaeumll Le Bras for application testing and feedback NHM curatorsBlanca Huertas and Geoff Martin for access to images from the Lepidoptera collections volun-teers Robyn Crowther Alexander Esin James Fage Sophie Ledger and Gabriela Montejo fordevoting their time to digitize the test specimens Duncan Sivell for the yellow sticky trapimage used in S8 Fig

Author ContributionsConceived and designed the experiments LNH Performed the experiments LNH Analyzedthe data LNH Contributed reagentsmaterialsanalysis tools VB BWP Wrote the paper LNHVB AH PH LL BWP SVDWVSS Identified the need for the application VB LL BWP VSSWrote the segmentation algorithm and designed and wrote the prototype desktop applicationPH SVDW Worked on the prototype application AH Re-engineered the application LNHContributed towards the design of the desktop application and workflow tools LNH VB AHPH LL BWP SVDWVSS

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 14 15

References1 Arintildeo AH Approaches to estimating the universe of natural history collections data Biodiversity Infor-

matics 2010 7 81ndash92 doi 1017161biv7i23991

2 GrahamCH Ferrier S Huettman F Moritz C Peterson AT New developments in museum-based infor-matics and applications in biodiversity analysis Trends Ecol Evol 2004 19 497ndash503 doi 101016jtree200407006 PMID 16701313

3 Suarez AV Tsutsui ND The value of museum collections for research and society BioScience 200454 66ndash74 doi 1016410006-3568(2004)054[0066TVOMCF]20CO2

4 Lister A M Brooks SJ Fenberg PB Glover AG James KE Johnson KG et al Natural history collec-tions as sources of long-term datasets Trends Ecol Evol 2011 26 153ndash154 doi 101016jtree201012009 PMID 21255862

5 Rocha LA Aleixo A Allen G Almeda F Baldwin CC Barclay MV et al Specimen collection an essen-tial tool Science 2014 344 814ndash5 doi 101126science3446186814 PMID 24855245

6 Beck J Kitching IJ Estimating regional species richness of tropical insects frommuseum data a com-parison of a geography-based and sample-based methods J Appl Ecol 2007 44 672ndash681 doi 101111j1365-2664200701291x

7 Newbold T Applications and limitations of museum data for conservation and ecology with particularattention to species distribution models Prog Phys Geog 2010 34 3ndash22 doi 1011770309133309355630

8 Cheng TL Rovito SM Wake DB Vredenburg VT Coincident mass extirpation of neotropical amphibi-ans with the emergence of the infectious fungal pathogen Batrachochytrium dendrobatidis Proc NatlAcad Sci USA 2011 108 9502ndash9507 doi 101073pnas1105538108 PMID 21543713

9 Brooks SJ Self A Toloni F Sparks T Natural history museum collections provide information on phe-nological change in British butterflies since the late-nineteenth century Int J Biometeorol 2014 581749ndash1758 doi 101007s00484-013-0780-6 PMID 24429705

10 Beaman RS Cellinese N Mass digitization of scientific collections new opportunities to transform theuse of biological specimens and underwrite biodiversity science In Blagoderov V Smith VS editorsNo specimen left behind mass digitization of natural history collections ZooKeys 2012 209 7ndash17doi 103897zookeys2093313

11 Blagoderov V Kitching IJ Livermore L Simonsen TJ Smith VS No specimen left behind industrialscale digitization of natural history collections In Blagoderov V Smith VS editors No specimen leftbehind mass digitization of natural history collections ZooKeys 2012 209 133ndash146 doi 103897zookeys2093178

12 Nelson G Paul D Riccardi G Mast AR Five task clusters that enable efficient and effective digitizationof biological collections In Blagoderov V Smith VS editors No specimen left behind mass digitizationof natural history collections ZooKeys 2012 209 19ndash45 doi 103897zookeys2093135

13 Holovachov O Zatushevsky A Shydlovsky I Whole-drawer imaging of entomological collections ben-efits limitations and alternative applications J of Conserv and Mus Stud 2014 12 1ndash13 doi 105334jcms1021218

14 Mantle BL La Salle J Fisher N Whole-drawer imaging for digital management and curation of a largeentomological collection In Blagoderov V Smith VS editors No specimen left behind mass digitiza-tion of natural history collections ZooKeys 2012 209 147ndash163 doi 103897zookeys2093169

15 Dietrich CH Hart J Raila D Ravaioli U Sobh N Sobh O et al InvertNet a new paradigm for digitalaccess to invertebrate collections In Blagoderov V Smith VS editors No specimen left behind massdigitization of natural history collections ZooKeys 2012 209 165ndash181 doi 103897zookeys2093571

16 Schmidt S Balke M Lafogler S DScanmdasha high-performance digital scanning system for entomologicalcollections In Blagoderov V Smith VS editors No specimen left behind mass digitization of naturalhistory collections ZooKeys 2012 209 183ndash191 doi 103897zookeys2093115

17 Wieczorek J Bloom D Guralnick R Blum S Doumlring M Giovanni R et al Darwin Core An EvolvingCommunity-Developed Biodiversity Data Standard PLoS ONE 2012 7(1) e29715 doi 101371journalpone0029715 PMID 22238640

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 15 15

Under these circumstances one of the most pressing priorities is ardquobroad-and-thinrdquoapproach to digitization the collection of essential specimen-level data allowing complete col-lection audit and providing a specimen level platform to add metadata in future At a minimumthis includes the specimens determination (ie taxon name) and its physical location withinthe collection Inselect has proven to be a very useful tool for the most challenging parts of theNHMrsquos collections such as pinned insects and it can be easily applied to other areas for exam-ple environmental studies and quantitative analysis of trap samples (eg ldquoinvertebrate soupsrdquoor sticky traps see S8 Fig)

In the course of the NHM Slide Digitization Pilot project which will digitize 100000microscope slides in eight months Inselect received extensive user acceptance testing Resultsto date show that throughput is as high as 5000 slides per day per person for processingmulti-slide images through Inselect This includes tasks associated with image segmentation

Fig 7 Segmented image containing overlapping specimens Example of segmented image of Mycalesina (a group of butterflies)

doi101371journalpone0143402g007

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 12 15

and refinement barcode recognition and association with minimal metadata (taxon name andphysical location in the collection) Upon completion of the project the entire set of time andmotion studies alongside the associated workflow will be described in a separate publicationDrawers of curated pinned insects as a rule do not require additional preparation thereforethe imaging output can be up to 70 SatScan images per day resulting in 3500ndash70000 speci-mens per day available for Inselect

Future developmentsSegmentation algorithms The high throughput of mass-digitization activities makes it

important to minimize the amount of manual intervention required The NHMrsquos SatScaninstrument can generate up to 70 multi-specimen images per day The median user-timerequired to refine segmented images of 109 s (S5 Fig) means that 70 x 109 3600 = 21 person-hours could be required to refine bounding boxes for a dayrsquos worth of images from a SatScanmachine In the worst case (413 s) this refinement time increases to more than eight hoursTherefore segmentation algorithms should be as accurate as possible and we suggest that thereis a need for a formal method (and supporting software) that allows segmentation methodsand their associated parameter sets to be scored and ranked Such a score should consider per-formance false positives and false negatives The outputs of such an activity might be a libraryof algorithms andor parameter sets geared towards different specimen types The dataset of804 images used in the present work (available at httpdxdoiorg1055190018537) consti-tutes a benchmark dataset against which segmentation algorithms can be measured Inselectrsquosmodular architecture and its provision of plugins make it a suitable platform for such aninvestigation

Desktop application The desktop application lacks some of the polish that is expected ofmodern software such as lsquoundorsquo and localization Other desirable features include the ability tofilter and order bounding boxes by size andor area in order to aid refinement support forExchangeable Image File Format (EXIF) tags and integration with industry-standard imageprocessing tools such as Adobe Photoshop (httpwwwadobecomproductsphotoshophtml)The plugin architecture makes a possible range of developments such as additional segmenta-tion algorithms and optical character recognition of label text within bounding boxes

Inselect has been tested using specimens from the NHMrsquos entomological and micropalaeon-tological collections as well as a limited number of specimens from Continental European col-lections We would like to test the software against a greater diversity of museum specimensand institutions to ensure that it can accommodate variation in the storage and mounting ofthese specimens

Despite these limitations Inselect represents a substantial contribution to the tools availableto support mass-digitization of natural history collections The desktop application and itsassociated command-line tools have been designed to efficiently handle the high numbers oflarge image files produced by mass-digitization activities The combination of a modular archi-tecture desktop application and scriptable technology makes it a relatively simple task to inte-grate Inselect into existing and workflows Bug reports feature requests and ideas can beviewed and created at httpsgithubcomNaturalHistoryMuseuminselectissues We areactively developing Inselect and we greatly value all comments and suggestions

Supporting InformationS1 Fig A cropped specimen image containing a barcode A scan of a microscope slide thatcontains a Data Matrix barcode(TIFF)

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 13 15

S2 Fig Thumbnail file size against file size Numbers in brackets in the legend are the num-ber of files in that group(EPS)

S3 Fig Ingestion time against file size Ingestion time is the time taken to read the scannedimage and save a JPEG thumbnail image of 4096 pixels in width Numbers in brackets in thelegend are the number of files in that group(EPS)

S4 Fig Segmentation time against number of boxes found Numbers in brackets in the leg-end are the number of files in that group(EPS)

S5 Fig Distribution of the time to refine 30 segmented images For 30 images picked at ran-dom(EPS)

S6 Fig The actual number of specimens against the number of boxes detected by segmenta-tion For 30 images picked at random(EPS)

S7 Fig Imaging digitization workflow a Original single-specimen imaging and data capture(after Nelson 2012 modified) b multi-specimen imaging and data capture(EPS)

S8 Fig Alternative use of Inselect for trap sample Count estimation (1064 specimens) usingInselect on an image of a yellow sticky trap used in an environmental assessment study(TIFF)

AcknowledgmentsWe thank Francois Malan for the modified ldquoDivide Scanned Imagesrdquo software (httpfrancoismalancom201301how-to-batch-separate-crop-multiple-scanned-photos) whichprovided the inspiration for developing Inselect Gabriel Brostow and Michael Terry for helpfuldiscussions about computer vision Ben Scott for advice on sustainable software developmentJoumlrg Holetschek and Gwenaeumll Le Bras for application testing and feedback NHM curatorsBlanca Huertas and Geoff Martin for access to images from the Lepidoptera collections volun-teers Robyn Crowther Alexander Esin James Fage Sophie Ledger and Gabriela Montejo fordevoting their time to digitize the test specimens Duncan Sivell for the yellow sticky trapimage used in S8 Fig

Author ContributionsConceived and designed the experiments LNH Performed the experiments LNH Analyzedthe data LNH Contributed reagentsmaterialsanalysis tools VB BWP Wrote the paper LNHVB AH PH LL BWP SVDWVSS Identified the need for the application VB LL BWP VSSWrote the segmentation algorithm and designed and wrote the prototype desktop applicationPH SVDW Worked on the prototype application AH Re-engineered the application LNHContributed towards the design of the desktop application and workflow tools LNH VB AHPH LL BWP SVDWVSS

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 14 15

References1 Arintildeo AH Approaches to estimating the universe of natural history collections data Biodiversity Infor-

matics 2010 7 81ndash92 doi 1017161biv7i23991

2 GrahamCH Ferrier S Huettman F Moritz C Peterson AT New developments in museum-based infor-matics and applications in biodiversity analysis Trends Ecol Evol 2004 19 497ndash503 doi 101016jtree200407006 PMID 16701313

3 Suarez AV Tsutsui ND The value of museum collections for research and society BioScience 200454 66ndash74 doi 1016410006-3568(2004)054[0066TVOMCF]20CO2

4 Lister A M Brooks SJ Fenberg PB Glover AG James KE Johnson KG et al Natural history collec-tions as sources of long-term datasets Trends Ecol Evol 2011 26 153ndash154 doi 101016jtree201012009 PMID 21255862

5 Rocha LA Aleixo A Allen G Almeda F Baldwin CC Barclay MV et al Specimen collection an essen-tial tool Science 2014 344 814ndash5 doi 101126science3446186814 PMID 24855245

6 Beck J Kitching IJ Estimating regional species richness of tropical insects frommuseum data a com-parison of a geography-based and sample-based methods J Appl Ecol 2007 44 672ndash681 doi 101111j1365-2664200701291x

7 Newbold T Applications and limitations of museum data for conservation and ecology with particularattention to species distribution models Prog Phys Geog 2010 34 3ndash22 doi 1011770309133309355630

8 Cheng TL Rovito SM Wake DB Vredenburg VT Coincident mass extirpation of neotropical amphibi-ans with the emergence of the infectious fungal pathogen Batrachochytrium dendrobatidis Proc NatlAcad Sci USA 2011 108 9502ndash9507 doi 101073pnas1105538108 PMID 21543713

9 Brooks SJ Self A Toloni F Sparks T Natural history museum collections provide information on phe-nological change in British butterflies since the late-nineteenth century Int J Biometeorol 2014 581749ndash1758 doi 101007s00484-013-0780-6 PMID 24429705

10 Beaman RS Cellinese N Mass digitization of scientific collections new opportunities to transform theuse of biological specimens and underwrite biodiversity science In Blagoderov V Smith VS editorsNo specimen left behind mass digitization of natural history collections ZooKeys 2012 209 7ndash17doi 103897zookeys2093313

11 Blagoderov V Kitching IJ Livermore L Simonsen TJ Smith VS No specimen left behind industrialscale digitization of natural history collections In Blagoderov V Smith VS editors No specimen leftbehind mass digitization of natural history collections ZooKeys 2012 209 133ndash146 doi 103897zookeys2093178

12 Nelson G Paul D Riccardi G Mast AR Five task clusters that enable efficient and effective digitizationof biological collections In Blagoderov V Smith VS editors No specimen left behind mass digitizationof natural history collections ZooKeys 2012 209 19ndash45 doi 103897zookeys2093135

13 Holovachov O Zatushevsky A Shydlovsky I Whole-drawer imaging of entomological collections ben-efits limitations and alternative applications J of Conserv and Mus Stud 2014 12 1ndash13 doi 105334jcms1021218

14 Mantle BL La Salle J Fisher N Whole-drawer imaging for digital management and curation of a largeentomological collection In Blagoderov V Smith VS editors No specimen left behind mass digitiza-tion of natural history collections ZooKeys 2012 209 147ndash163 doi 103897zookeys2093169

15 Dietrich CH Hart J Raila D Ravaioli U Sobh N Sobh O et al InvertNet a new paradigm for digitalaccess to invertebrate collections In Blagoderov V Smith VS editors No specimen left behind massdigitization of natural history collections ZooKeys 2012 209 165ndash181 doi 103897zookeys2093571

16 Schmidt S Balke M Lafogler S DScanmdasha high-performance digital scanning system for entomologicalcollections In Blagoderov V Smith VS editors No specimen left behind mass digitization of naturalhistory collections ZooKeys 2012 209 183ndash191 doi 103897zookeys2093115

17 Wieczorek J Bloom D Guralnick R Blum S Doumlring M Giovanni R et al Darwin Core An EvolvingCommunity-Developed Biodiversity Data Standard PLoS ONE 2012 7(1) e29715 doi 101371journalpone0029715 PMID 22238640

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 15 15

and refinement barcode recognition and association with minimal metadata (taxon name andphysical location in the collection) Upon completion of the project the entire set of time andmotion studies alongside the associated workflow will be described in a separate publicationDrawers of curated pinned insects as a rule do not require additional preparation thereforethe imaging output can be up to 70 SatScan images per day resulting in 3500ndash70000 speci-mens per day available for Inselect

Future developmentsSegmentation algorithms The high throughput of mass-digitization activities makes it

important to minimize the amount of manual intervention required The NHMrsquos SatScaninstrument can generate up to 70 multi-specimen images per day The median user-timerequired to refine segmented images of 109 s (S5 Fig) means that 70 x 109 3600 = 21 person-hours could be required to refine bounding boxes for a dayrsquos worth of images from a SatScanmachine In the worst case (413 s) this refinement time increases to more than eight hoursTherefore segmentation algorithms should be as accurate as possible and we suggest that thereis a need for a formal method (and supporting software) that allows segmentation methodsand their associated parameter sets to be scored and ranked Such a score should consider per-formance false positives and false negatives The outputs of such an activity might be a libraryof algorithms andor parameter sets geared towards different specimen types The dataset of804 images used in the present work (available at httpdxdoiorg1055190018537) consti-tutes a benchmark dataset against which segmentation algorithms can be measured Inselectrsquosmodular architecture and its provision of plugins make it a suitable platform for such aninvestigation

Desktop application The desktop application lacks some of the polish that is expected ofmodern software such as lsquoundorsquo and localization Other desirable features include the ability tofilter and order bounding boxes by size andor area in order to aid refinement support forExchangeable Image File Format (EXIF) tags and integration with industry-standard imageprocessing tools such as Adobe Photoshop (httpwwwadobecomproductsphotoshophtml)The plugin architecture makes a possible range of developments such as additional segmenta-tion algorithms and optical character recognition of label text within bounding boxes

Inselect has been tested using specimens from the NHMrsquos entomological and micropalaeon-tological collections as well as a limited number of specimens from Continental European col-lections We would like to test the software against a greater diversity of museum specimensand institutions to ensure that it can accommodate variation in the storage and mounting ofthese specimens

Despite these limitations Inselect represents a substantial contribution to the tools availableto support mass-digitization of natural history collections The desktop application and itsassociated command-line tools have been designed to efficiently handle the high numbers oflarge image files produced by mass-digitization activities The combination of a modular archi-tecture desktop application and scriptable technology makes it a relatively simple task to inte-grate Inselect into existing and workflows Bug reports feature requests and ideas can beviewed and created at httpsgithubcomNaturalHistoryMuseuminselectissues We areactively developing Inselect and we greatly value all comments and suggestions

Supporting InformationS1 Fig A cropped specimen image containing a barcode A scan of a microscope slide thatcontains a Data Matrix barcode(TIFF)

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 13 15

S2 Fig Thumbnail file size against file size Numbers in brackets in the legend are the num-ber of files in that group(EPS)

S3 Fig Ingestion time against file size Ingestion time is the time taken to read the scannedimage and save a JPEG thumbnail image of 4096 pixels in width Numbers in brackets in thelegend are the number of files in that group(EPS)

S4 Fig Segmentation time against number of boxes found Numbers in brackets in the leg-end are the number of files in that group(EPS)

S5 Fig Distribution of the time to refine 30 segmented images For 30 images picked at ran-dom(EPS)

S6 Fig The actual number of specimens against the number of boxes detected by segmenta-tion For 30 images picked at random(EPS)

S7 Fig Imaging digitization workflow a Original single-specimen imaging and data capture(after Nelson 2012 modified) b multi-specimen imaging and data capture(EPS)

S8 Fig Alternative use of Inselect for trap sample Count estimation (1064 specimens) usingInselect on an image of a yellow sticky trap used in an environmental assessment study(TIFF)

AcknowledgmentsWe thank Francois Malan for the modified ldquoDivide Scanned Imagesrdquo software (httpfrancoismalancom201301how-to-batch-separate-crop-multiple-scanned-photos) whichprovided the inspiration for developing Inselect Gabriel Brostow and Michael Terry for helpfuldiscussions about computer vision Ben Scott for advice on sustainable software developmentJoumlrg Holetschek and Gwenaeumll Le Bras for application testing and feedback NHM curatorsBlanca Huertas and Geoff Martin for access to images from the Lepidoptera collections volun-teers Robyn Crowther Alexander Esin James Fage Sophie Ledger and Gabriela Montejo fordevoting their time to digitize the test specimens Duncan Sivell for the yellow sticky trapimage used in S8 Fig

Author ContributionsConceived and designed the experiments LNH Performed the experiments LNH Analyzedthe data LNH Contributed reagentsmaterialsanalysis tools VB BWP Wrote the paper LNHVB AH PH LL BWP SVDWVSS Identified the need for the application VB LL BWP VSSWrote the segmentation algorithm and designed and wrote the prototype desktop applicationPH SVDW Worked on the prototype application AH Re-engineered the application LNHContributed towards the design of the desktop application and workflow tools LNH VB AHPH LL BWP SVDWVSS

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 14 15

References1 Arintildeo AH Approaches to estimating the universe of natural history collections data Biodiversity Infor-

matics 2010 7 81ndash92 doi 1017161biv7i23991

2 GrahamCH Ferrier S Huettman F Moritz C Peterson AT New developments in museum-based infor-matics and applications in biodiversity analysis Trends Ecol Evol 2004 19 497ndash503 doi 101016jtree200407006 PMID 16701313

3 Suarez AV Tsutsui ND The value of museum collections for research and society BioScience 200454 66ndash74 doi 1016410006-3568(2004)054[0066TVOMCF]20CO2

4 Lister A M Brooks SJ Fenberg PB Glover AG James KE Johnson KG et al Natural history collec-tions as sources of long-term datasets Trends Ecol Evol 2011 26 153ndash154 doi 101016jtree201012009 PMID 21255862

5 Rocha LA Aleixo A Allen G Almeda F Baldwin CC Barclay MV et al Specimen collection an essen-tial tool Science 2014 344 814ndash5 doi 101126science3446186814 PMID 24855245

6 Beck J Kitching IJ Estimating regional species richness of tropical insects frommuseum data a com-parison of a geography-based and sample-based methods J Appl Ecol 2007 44 672ndash681 doi 101111j1365-2664200701291x

7 Newbold T Applications and limitations of museum data for conservation and ecology with particularattention to species distribution models Prog Phys Geog 2010 34 3ndash22 doi 1011770309133309355630

8 Cheng TL Rovito SM Wake DB Vredenburg VT Coincident mass extirpation of neotropical amphibi-ans with the emergence of the infectious fungal pathogen Batrachochytrium dendrobatidis Proc NatlAcad Sci USA 2011 108 9502ndash9507 doi 101073pnas1105538108 PMID 21543713

9 Brooks SJ Self A Toloni F Sparks T Natural history museum collections provide information on phe-nological change in British butterflies since the late-nineteenth century Int J Biometeorol 2014 581749ndash1758 doi 101007s00484-013-0780-6 PMID 24429705

10 Beaman RS Cellinese N Mass digitization of scientific collections new opportunities to transform theuse of biological specimens and underwrite biodiversity science In Blagoderov V Smith VS editorsNo specimen left behind mass digitization of natural history collections ZooKeys 2012 209 7ndash17doi 103897zookeys2093313

11 Blagoderov V Kitching IJ Livermore L Simonsen TJ Smith VS No specimen left behind industrialscale digitization of natural history collections In Blagoderov V Smith VS editors No specimen leftbehind mass digitization of natural history collections ZooKeys 2012 209 133ndash146 doi 103897zookeys2093178

12 Nelson G Paul D Riccardi G Mast AR Five task clusters that enable efficient and effective digitizationof biological collections In Blagoderov V Smith VS editors No specimen left behind mass digitizationof natural history collections ZooKeys 2012 209 19ndash45 doi 103897zookeys2093135

13 Holovachov O Zatushevsky A Shydlovsky I Whole-drawer imaging of entomological collections ben-efits limitations and alternative applications J of Conserv and Mus Stud 2014 12 1ndash13 doi 105334jcms1021218

14 Mantle BL La Salle J Fisher N Whole-drawer imaging for digital management and curation of a largeentomological collection In Blagoderov V Smith VS editors No specimen left behind mass digitiza-tion of natural history collections ZooKeys 2012 209 147ndash163 doi 103897zookeys2093169

15 Dietrich CH Hart J Raila D Ravaioli U Sobh N Sobh O et al InvertNet a new paradigm for digitalaccess to invertebrate collections In Blagoderov V Smith VS editors No specimen left behind massdigitization of natural history collections ZooKeys 2012 209 165ndash181 doi 103897zookeys2093571

16 Schmidt S Balke M Lafogler S DScanmdasha high-performance digital scanning system for entomologicalcollections In Blagoderov V Smith VS editors No specimen left behind mass digitization of naturalhistory collections ZooKeys 2012 209 183ndash191 doi 103897zookeys2093115

17 Wieczorek J Bloom D Guralnick R Blum S Doumlring M Giovanni R et al Darwin Core An EvolvingCommunity-Developed Biodiversity Data Standard PLoS ONE 2012 7(1) e29715 doi 101371journalpone0029715 PMID 22238640

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 15 15

S2 Fig Thumbnail file size against file size Numbers in brackets in the legend are the num-ber of files in that group(EPS)

S3 Fig Ingestion time against file size Ingestion time is the time taken to read the scannedimage and save a JPEG thumbnail image of 4096 pixels in width Numbers in brackets in thelegend are the number of files in that group(EPS)

S4 Fig Segmentation time against number of boxes found Numbers in brackets in the leg-end are the number of files in that group(EPS)

S5 Fig Distribution of the time to refine 30 segmented images For 30 images picked at ran-dom(EPS)

S6 Fig The actual number of specimens against the number of boxes detected by segmenta-tion For 30 images picked at random(EPS)

S7 Fig Imaging digitization workflow a Original single-specimen imaging and data capture(after Nelson 2012 modified) b multi-specimen imaging and data capture(EPS)

S8 Fig Alternative use of Inselect for trap sample Count estimation (1064 specimens) usingInselect on an image of a yellow sticky trap used in an environmental assessment study(TIFF)

AcknowledgmentsWe thank Francois Malan for the modified ldquoDivide Scanned Imagesrdquo software (httpfrancoismalancom201301how-to-batch-separate-crop-multiple-scanned-photos) whichprovided the inspiration for developing Inselect Gabriel Brostow and Michael Terry for helpfuldiscussions about computer vision Ben Scott for advice on sustainable software developmentJoumlrg Holetschek and Gwenaeumll Le Bras for application testing and feedback NHM curatorsBlanca Huertas and Geoff Martin for access to images from the Lepidoptera collections volun-teers Robyn Crowther Alexander Esin James Fage Sophie Ledger and Gabriela Montejo fordevoting their time to digitize the test specimens Duncan Sivell for the yellow sticky trapimage used in S8 Fig

Author ContributionsConceived and designed the experiments LNH Performed the experiments LNH Analyzedthe data LNH Contributed reagentsmaterialsanalysis tools VB BWP Wrote the paper LNHVB AH PH LL BWP SVDWVSS Identified the need for the application VB LL BWP VSSWrote the segmentation algorithm and designed and wrote the prototype desktop applicationPH SVDW Worked on the prototype application AH Re-engineered the application LNHContributed towards the design of the desktop application and workflow tools LNH VB AHPH LL BWP SVDWVSS

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 14 15

References1 Arintildeo AH Approaches to estimating the universe of natural history collections data Biodiversity Infor-

matics 2010 7 81ndash92 doi 1017161biv7i23991

2 GrahamCH Ferrier S Huettman F Moritz C Peterson AT New developments in museum-based infor-matics and applications in biodiversity analysis Trends Ecol Evol 2004 19 497ndash503 doi 101016jtree200407006 PMID 16701313

3 Suarez AV Tsutsui ND The value of museum collections for research and society BioScience 200454 66ndash74 doi 1016410006-3568(2004)054[0066TVOMCF]20CO2

4 Lister A M Brooks SJ Fenberg PB Glover AG James KE Johnson KG et al Natural history collec-tions as sources of long-term datasets Trends Ecol Evol 2011 26 153ndash154 doi 101016jtree201012009 PMID 21255862

5 Rocha LA Aleixo A Allen G Almeda F Baldwin CC Barclay MV et al Specimen collection an essen-tial tool Science 2014 344 814ndash5 doi 101126science3446186814 PMID 24855245

6 Beck J Kitching IJ Estimating regional species richness of tropical insects frommuseum data a com-parison of a geography-based and sample-based methods J Appl Ecol 2007 44 672ndash681 doi 101111j1365-2664200701291x

7 Newbold T Applications and limitations of museum data for conservation and ecology with particularattention to species distribution models Prog Phys Geog 2010 34 3ndash22 doi 1011770309133309355630

8 Cheng TL Rovito SM Wake DB Vredenburg VT Coincident mass extirpation of neotropical amphibi-ans with the emergence of the infectious fungal pathogen Batrachochytrium dendrobatidis Proc NatlAcad Sci USA 2011 108 9502ndash9507 doi 101073pnas1105538108 PMID 21543713

9 Brooks SJ Self A Toloni F Sparks T Natural history museum collections provide information on phe-nological change in British butterflies since the late-nineteenth century Int J Biometeorol 2014 581749ndash1758 doi 101007s00484-013-0780-6 PMID 24429705

10 Beaman RS Cellinese N Mass digitization of scientific collections new opportunities to transform theuse of biological specimens and underwrite biodiversity science In Blagoderov V Smith VS editorsNo specimen left behind mass digitization of natural history collections ZooKeys 2012 209 7ndash17doi 103897zookeys2093313

11 Blagoderov V Kitching IJ Livermore L Simonsen TJ Smith VS No specimen left behind industrialscale digitization of natural history collections In Blagoderov V Smith VS editors No specimen leftbehind mass digitization of natural history collections ZooKeys 2012 209 133ndash146 doi 103897zookeys2093178

12 Nelson G Paul D Riccardi G Mast AR Five task clusters that enable efficient and effective digitizationof biological collections In Blagoderov V Smith VS editors No specimen left behind mass digitizationof natural history collections ZooKeys 2012 209 19ndash45 doi 103897zookeys2093135

13 Holovachov O Zatushevsky A Shydlovsky I Whole-drawer imaging of entomological collections ben-efits limitations and alternative applications J of Conserv and Mus Stud 2014 12 1ndash13 doi 105334jcms1021218

14 Mantle BL La Salle J Fisher N Whole-drawer imaging for digital management and curation of a largeentomological collection In Blagoderov V Smith VS editors No specimen left behind mass digitiza-tion of natural history collections ZooKeys 2012 209 147ndash163 doi 103897zookeys2093169

15 Dietrich CH Hart J Raila D Ravaioli U Sobh N Sobh O et al InvertNet a new paradigm for digitalaccess to invertebrate collections In Blagoderov V Smith VS editors No specimen left behind massdigitization of natural history collections ZooKeys 2012 209 165ndash181 doi 103897zookeys2093571

16 Schmidt S Balke M Lafogler S DScanmdasha high-performance digital scanning system for entomologicalcollections In Blagoderov V Smith VS editors No specimen left behind mass digitization of naturalhistory collections ZooKeys 2012 209 183ndash191 doi 103897zookeys2093115

17 Wieczorek J Bloom D Guralnick R Blum S Doumlring M Giovanni R et al Darwin Core An EvolvingCommunity-Developed Biodiversity Data Standard PLoS ONE 2012 7(1) e29715 doi 101371journalpone0029715 PMID 22238640

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 15 15

References1 Arintildeo AH Approaches to estimating the universe of natural history collections data Biodiversity Infor-

matics 2010 7 81ndash92 doi 1017161biv7i23991

2 GrahamCH Ferrier S Huettman F Moritz C Peterson AT New developments in museum-based infor-matics and applications in biodiversity analysis Trends Ecol Evol 2004 19 497ndash503 doi 101016jtree200407006 PMID 16701313

3 Suarez AV Tsutsui ND The value of museum collections for research and society BioScience 200454 66ndash74 doi 1016410006-3568(2004)054[0066TVOMCF]20CO2

4 Lister A M Brooks SJ Fenberg PB Glover AG James KE Johnson KG et al Natural history collec-tions as sources of long-term datasets Trends Ecol Evol 2011 26 153ndash154 doi 101016jtree201012009 PMID 21255862

5 Rocha LA Aleixo A Allen G Almeda F Baldwin CC Barclay MV et al Specimen collection an essen-tial tool Science 2014 344 814ndash5 doi 101126science3446186814 PMID 24855245

6 Beck J Kitching IJ Estimating regional species richness of tropical insects frommuseum data a com-parison of a geography-based and sample-based methods J Appl Ecol 2007 44 672ndash681 doi 101111j1365-2664200701291x

7 Newbold T Applications and limitations of museum data for conservation and ecology with particularattention to species distribution models Prog Phys Geog 2010 34 3ndash22 doi 1011770309133309355630

8 Cheng TL Rovito SM Wake DB Vredenburg VT Coincident mass extirpation of neotropical amphibi-ans with the emergence of the infectious fungal pathogen Batrachochytrium dendrobatidis Proc NatlAcad Sci USA 2011 108 9502ndash9507 doi 101073pnas1105538108 PMID 21543713

9 Brooks SJ Self A Toloni F Sparks T Natural history museum collections provide information on phe-nological change in British butterflies since the late-nineteenth century Int J Biometeorol 2014 581749ndash1758 doi 101007s00484-013-0780-6 PMID 24429705

10 Beaman RS Cellinese N Mass digitization of scientific collections new opportunities to transform theuse of biological specimens and underwrite biodiversity science In Blagoderov V Smith VS editorsNo specimen left behind mass digitization of natural history collections ZooKeys 2012 209 7ndash17doi 103897zookeys2093313

11 Blagoderov V Kitching IJ Livermore L Simonsen TJ Smith VS No specimen left behind industrialscale digitization of natural history collections In Blagoderov V Smith VS editors No specimen leftbehind mass digitization of natural history collections ZooKeys 2012 209 133ndash146 doi 103897zookeys2093178

12 Nelson G Paul D Riccardi G Mast AR Five task clusters that enable efficient and effective digitizationof biological collections In Blagoderov V Smith VS editors No specimen left behind mass digitizationof natural history collections ZooKeys 2012 209 19ndash45 doi 103897zookeys2093135

13 Holovachov O Zatushevsky A Shydlovsky I Whole-drawer imaging of entomological collections ben-efits limitations and alternative applications J of Conserv and Mus Stud 2014 12 1ndash13 doi 105334jcms1021218

14 Mantle BL La Salle J Fisher N Whole-drawer imaging for digital management and curation of a largeentomological collection In Blagoderov V Smith VS editors No specimen left behind mass digitiza-tion of natural history collections ZooKeys 2012 209 147ndash163 doi 103897zookeys2093169

15 Dietrich CH Hart J Raila D Ravaioli U Sobh N Sobh O et al InvertNet a new paradigm for digitalaccess to invertebrate collections In Blagoderov V Smith VS editors No specimen left behind massdigitization of natural history collections ZooKeys 2012 209 165ndash181 doi 103897zookeys2093571

16 Schmidt S Balke M Lafogler S DScanmdasha high-performance digital scanning system for entomologicalcollections In Blagoderov V Smith VS editors No specimen left behind mass digitization of naturalhistory collections ZooKeys 2012 209 183ndash191 doi 103897zookeys2093115

17 Wieczorek J Bloom D Guralnick R Blum S Doumlring M Giovanni R et al Darwin Core An EvolvingCommunity-Developed Biodiversity Data Standard PLoS ONE 2012 7(1) e29715 doi 101371journalpone0029715 PMID 22238640

Inselect Automating the Digitization of Natural History Collections

PLOS ONE | DOI101371journalpone0143402 November 23 2015 15 15