ICollections, Mass Digitisation of British & Irish Lepidoptera Adrian Hine, Natural History...

Preview:

Citation preview

iCollections, Mass Digitisation of British & Irish Lepidoptera

Adrian Hine, Natural History Museum, London

iCollections Background

• iCollections began March 2013 for 3 years, using 8 full time digitisers plus existing staff.

• Digitise the British Lepidoptera (Butterflies & Moths) ca. ½ million specimens (5000 drawers).

• Pilot project for mass digitsation of pinned insects.

• The main aim of digitisation is to capture the label data, not on the specimen image per se.

• Workflow for the Digital Collections Programme (DCP) – a Digital Museum.

Digitisation Benefits

• Three top-level themes:

• Research

• Collections

• Public engagement

• Have to choose carefully to maximise limited budget. British Lepidoptera ticks all these boxes!

Research

• Large powerful dataset (50% usable), temporal & spatial.

• Cimate change, distributional changes, migration, morphometrics.

• Occurance records to National Biodiversity Network.

Better Collections

• Better curation & preservation, access

Public Engagement

• Lepidoptera charismatic group, lot of public interest.

• Explain our science: Science Uncovered, Nature Live, TV, radio.

Data Workflow

• Data quality is at the heart of the digitisation process. We wish to control the quality of data going into EMu.

• Didn’t want to simply be pushing large quantities of unqualified data into EMu to have to deal with at a later stage.

• Consistent, systematic approach to data capture.

• Every stage of the digitisation process followed written protocols.

• Each specimen given a unique specimen number (Data Matrix barcode & human readable).

Data Workflow

• Opted for data capture outside EMu

– poor quality data in EMu makes databasing directly into EMu difficult (sites, taxonomy, parties).

– build a highly streamlined data entry interface for transcription phase.

– build harmonisation tools to control data going into EMu (reduce duplication).

• Developing a RDA for the future.

• Biggest challenge is harmonisation with existing data within EMu (taxonomy, sites, parties, specimens).

Digitisation Workflow

Transcription

Taxonomy Harmonisation

Import into EMu

Georeferencing

Imaging

Specimen Preparation Digitiser

Digitiser

Digitiser

Taxonomist

Georeferencer

Data Manager

Specimen Preparation

Imaging

Ingestion into Transcription Database• Script uses the application Barcodefiler to

search the image for a barcode. If one is found the script renames the image filename with the specimen number.

• It then creates a stub record in the rapid data capture system (SQL backend) with three core data fields;– specimen number (from barcode)

– drawer number (from folder name)

– taxon name (from folder name)

• Using ImageMagic libraries it creates a cropped label derivative image.

Transcription

Data Harmonisation

• Biggest challenge is how to harmonise data with existing EMu data.

• Wish to use appropriate records where they exist in EMu and not to create additional duplicates.

• Data concepts we wish to harmonise with EMu records;

• Taxonomy (determination)

• Parties (collectors)

• Locations (drawers)

• Data concepts to create as new

• Sites

Taxonomy Harmonisation

• EMu - Taxonomy still a mess! For UK butterflies, 1000’s of names. Duplicates, erroneous names, different combinations.

• Did not have the time to clean Taxonomy for UK Lepidoptera. We have to live with the mess!

• Need taxonomic expertise to validate the iCollections name with the correct concept in EMu.

• Typos, errors when entering names by digitisers.• Can’t rely on the EMu import algorithms as

matching taxon names is too complex. Need human intervention.

• Built mapping tool to map taxon name with existing EMu name.

Taxonomy Harmonisation Tool

Sites Harmonisation

• Messy data makes databasing directly difficult. Sites has poor quality data. Very few are usable, very poor consistency of how data have been captured (diverse data sources).

• Mapping site variants to a site master record.Box Hill

Box Hill; Surrey

Box Hill; Kent Box Hill; Surrey; UK;

Box Hill; near Dorking 51.254 N, -0.308 W

Box Hill, Dorking

• Out of 181,000 specimens, just 9,681 unique site variants.

Sites Harmonisation & Georeferencing

Sites Georeferencing

Import into EMu

• Import is a phased approach;

1) Images. KE have built a backend script to ingest multimedia server side. Reports out a csv with the EMu irn & file name identifier.

2) Specimen record (taxonomy, drawer location & multimedia).

3) Georeferenced collection event data.

Issues

• Barcode no reads or misreads.

• Printing quality of barcodes.

• Multiple specimens on one pin.

• Conflicting data.

• Data difficult to interpret.

• Specimens with old style specimen number labels (non barcode).

• Specimen records exist already in EMu.

Digitisation Progress

iCollections Team

The success is due to the project having a strong team ethic, pulling together museum staff from a wide variety of different disciplines.Gordon Paterson chairVictoria Carter project managerDarrell Siebert quality assurancePeter Wing digitiserElisa Cane digitiserFlavia Toloni digitiserJo Durant digitiserLyndsey Douglas digitiserSara Albuquerque digitiserJasmin Perera digitiserSophie Ledger digitiserGerrardo Mazzetta digitiserGeoff Martin collections managementMartin Honey collections managementBlanca Huertas collections managementTheresa Howard collections managementSteve Brooks researchAngela Self researchIan Kitching researchMalcolm Penn georeferencingLiz Duffell georeferencingCaitlin McLaughlin georeferencingMike Sadka database & interface designerAdrian Hine data workflowChris Sleep databaseVladimir Blagoderov image workflowSteve Cafferty image workflow

Questions?