Upload
vincent-smith
View
435
Download
1
Tags:
Embed Size (px)
Citation preview
data.nhm.ac.ukNHM data portal update
Part of the informatics initiative (2013-15)
Vince Smith & Ben Scott
The problem – research data Hard to find, access, cite and integrate
• 45 available online(4 print only or behind pay walls)
• 9 had supplementary data files• 39 papers with tables, charts & other data
o>1000 sequenceso826 figureso76 tableso1 genome
• No collective view of these data (37 journals)• No consistent way of citing NHM data• No mechanism to integrate or version• No way to repurpose data (retyping?)
49 NHM science group papers in last 4 weeksData via Carolyn Lowry e-mail, 13th Feb. 2013
The problem – collections data
Initial problems•Don’t know / can’t find the website
Hard to find, access, cite and integrate
The problem – collections data
Initial problems•Don’t know / can’t find the website
Botany http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=32Entomology http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=40Library http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=36Mineralogy http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=55Palaeontology http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=34Zoology http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=38
Hard to find, access, cite and integrate
The problem – collections data
Initial problems•Don’t know / can’t find the website•6 different data collections
Hard to find, access, cite and integrate
The problem – collections data
Initial problems•Don’t know / can’t find the website•6 different data collections•23 interfaces & datasets of varying importance
Hard to find, access, cite and integrate
The problem – collections data
Initial problems•Don’t know / can’t find the website•6 different data collections•23 interfaces & datasets of varying importance•No priority to collection datasets
119 Specimens Up to 28,000,000 Specimens
Hard to find, access, cite and integrate
The problem – collections data
Initial problems•Don’t know / can’t find the website•6 different data collections•23 interfaces & datasets of varying importance•No priority to collection datasets•Entomology collections don’t exist (404)
Hard to find, access, cite and integrate
The problem – collections data
Initial problems•Don’t know / can’t find the website•6 different collections•23 interfaces & datasets of varying importance•No priority to collection datasets•Entomology collections don’t exist (404)•Library doesn’t have any online collections!
Hard to find, access, cite and integrate
The problem – collections data
Initial problems•Don’t know / can’t find the website•6 different collections•23 interfaces & datasets of varying importance•No priority to collection datasets•Entomology collections don’t exist (404)•Library doesn’t have any online collections!
Bigger issues•Idiosyncratic browse or search
Hard to find, access, cite and integrate
The problem – collections data
Initial problems•Don’t know / can’t find the website•6 different collections•23 interfaces & datasets of varying importance•No priority to collection datasets•Entomology collections don’t exist (404)•Library doesn’t have any online collections!
Bigger issues•Idiosyncratic browse or search•No maps, few images & very slow•No summary or statistics•No download, export or custom views•No integration with other data•No author info or update info•No means of specimen citation•No exports to GBIF or associated projects
Hard to find, access, cite and integrate
The data portal must correct these issues
The solution – data.nhm.ac.uk portal High level issues
Functional requirements•A central access point for NHM research & collections data•The capacity store/link and describe datasets•Integrated search & browse of datasets•The ability to cite datasets and specimen records in data sets•The ability to integrate collections data•Custom functions for sub-sections of data (e.g. initiatives, Virtual Herbarium)•The capacity to download, export & analyse data
Principles•Open-by-default: Capacity for embargoed and private data•Sustainable: Self-populated by NHM staff (except collections data)
Exclusions•Not a replacement for DAMS or KeEMu (a Web interface for these systems)•Publications out of scope (focused on data sets)•All annotations on data link back to the source (e.g. KeEMu)
The solution – data.nhm.ac.uk portal System Overview
Scope(Source Data)
KeEMu (NHM)
HerbCat (Kew)
Other datasetsSpecies dictionary,
initiatives, Scratchpads etc
User contributeddatasets
DwC-APhyloXML
neXMLNexus
Excel, CSVetc…
File types(formats)
Map view Table view Statistics view Analytic view
Explorer
Registry(Discovery & download)
NHM specimens
Kew specimens
Other
Private
Subportals(Branded slices of data)
Subportal 1e.g. Disease
initiative
Subportal 2e.g. Kew / NHM
Virtual Herbarium
Portal overview – adding data setsQuick & easy, semi-automated workflow
1. Name the dataset 2. Upload / link
the data file
3. Describe the data file
4. Theme & tag
5. Add additional resources
6. Temporal coverage
7. Geographic coverage
8. Save & finish
Portal overview – search interfaceDiscovering research data sets
Search
Datasets matching criteria
Individual dataset
Results
Browse & searchcriteria
Advanced display options
Portal overview – data set displayExploring research data sets
Metadata about the dataset
Name
Geographic scope
Tags
“Social”
Authors
License
Download
Developer tools
TechnicalInfo.
(extracted from data
file)
Portal overview – collections dataMain interface
Zoomable map
Applied filters
Toggle map, table & stats views
Search, download & display options
No. records
No. Georef. records
Additional interfaces
Collections views
Statistical summary
Specimen record views
Data field mappings
Summary preview
Full record
Tables
Download
Portal overview – collections data
Portal overviewSome example data portals & software
Data.gov & CKAN•UK government data portal•Uses CKAN, open-source data portal platform•Used by national & regional governments•Links into Drupal, DataCite & NHM systems•http://data.gov.uk & http://ckan.org/
Canadensys & CartoDB•Canadian network of biodiversity collections•Almost 1 million specimens, 18 datasets•Uses CartoDB mapping solution•Create dynamic maps, analyze and build location aware and geospatial applications•Widely used, cloud data storage, PostGIS•http://data.canadensys.net & http://cartodb.com/
Portal developmentTimeline & resources
Year 1 – Dataset discovery•Technical & functional specification (Vizz. subcontract)•Data workflows (KeEMu & research datasets)•Functional alpha prototype (CKAN)
Year 3 – Citation & analysis•DataCite DOIs on datasets & specimens•Initial Web analytical functions (R)•Initiative sub-portals including Virt. Herbarium
Year 2 – Visualisation•Mapping & statistical functionality (CartoDB)•Social and annotation functions•Stable beta release at http://data.nhm.ac.uk
Resources•1x Developer (Ben Scott) for 3 years•Vizzuality subcontract (circa £xxk - TBC)•ICT capital, travel & software (circa £25k)
Portal consultationFeedback & next steps
Initial stakeholder meetings (Feb. – May)•ICT Group (David Thomas, Chris Sleep & Gavin Malarky)•Darrell Siebert and the KE EMu user group•NHM Collections Committee & Initiative leaders •Kew Gardens & Virtual Herbarium Reps.•GBIF, NBN, UK DataCite team at BL, NERC •Digital Facility Team •Vizzuality
Wider consultation•Example data types / sets•Specialist search options & vocabularies•Specialist Earth Science needs
Documentation•Overview specification - http://goo.gl/qjioh•Project Initiation Document - http://goo.gl/oRr2j
FEEDBACK & LINKS
Slides: Feedback: [email protected]: http://goo.gl/qjiohPID: http://goo.gl/oRr2j
Two more things
Wikipedian in Residence•Four month post with Science Museum•Starting March / April•Work with NHM staff to improve Wikipedia•Run events with NHM staff & volunteers•Work with the GLAM group at Imperial College•Focus on NHM science themes & specimens•Not about promotion of “The NHM”
Biodiversity Informatics Workshop – May 2013•One full day - date TBC•Outputs from ViBRANT & e-Monocot •Includes Scratchpads & the Biodiversity Data Journal•What we do, how its used and where are we going•Includes links to NHM informatics & digitisation initiatives
Portal overview – data citationUnique identifiers for datasets & specimen records
Why cite data•URLs are not persistent•e.g. Wren JD: URL decay in MEDLINE- a 4-year follow-up study. Bioinformatics. 2008, Jun 1;24(11):1381-5) – circa 40% decay
•Measure our digital footprint•Puts research data on par with articles•Facilitates data mining
How to cite data•Digital Object Identifiers (DOIs)•Widely used & understood on articles•Operates in collaboration with DataCite•Part of an International consortium•Mixes NHM data with other domains
What gets an identifier•NHM specimen records (suffix of NHM ID’s)•NHM research datasets (files)•Insert into publications
http://dx.doi.org/BMNH_PBI_00388325