31
Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’ http://r4l.eprints.org Leslie Carr, Simon Coles & Jeremy Frey University of Southampton, U.K. [email protected] This work is licensed under a Creative Commons Licence Attribution-ShareAlike 3.0 http://creativecommons.org/licenses/by-sa/3.0

Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’ Leslie Carr, Simon Coles & Jeremy

  • View
    221

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

Experiences with Repositories and Blogs in Laboratories

or

‘R4L: The Repository for the Laboratory’

http://r4l.eprints.org

Leslie Carr, Simon Coles & Jeremy Frey

University of Southampton, U.K.

[email protected] work is licensed under a Creative Commons LicenceAttribution-ShareAlike 3.0

http://creativecommons.org/licenses/by-sa/3.0/

Page 2: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

The Problem: Data Generation

Synthesis Characterisation

Page 3: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

The Problem: Data Management“Data from experiments conducted as recently as six months ago might be suddenly deemed important, but those researchers may never find those numbers – or if they did might not know what those numbers meant”

“Lost in some research assistant’s computer, the data are often irretrievable or an undecipherable string of digits”

“To vet experiments, correct errors, or find new breakthroughs, scientists desperately need better ways to store and retrieve research data”

“Data from Big Science is … easier to handle, understand and archive. Small Science is horribly heterogeneous and far more vast. In time Small Science will generate 2-3 times more data than Big Science.”

‘Lost in a Sea of Science Data’ S.Carlson, The Chronicle of Higher Education (23/06/2006)

Page 4: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

The Problem: Data Deluge

• There are approx. 30 million known chemical compounds• Approx. 2 million crystal structures have been determined• There are less than 0.5 million published crystal structures residing in (licensed) curated databases• There are just a few thousand ‘open’ crystal structures

• The primary cause of this is the current data publication process, which is tied to journal articles and peer review

• 40 years ago a PhD student would determine about 3 crystal structures during the course of their study – this can now be easily achieved in a day

Page 5: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

The Problem: Publishing Data

Spectroscopic analysis is often performed to ensure a reaction is proceeding according to plan – as a result <5% are published (via a process with heavy information loss)

Page 6: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

The Problem: Reproducible Experiments

• Poor availability and description of experiments and arising data in the current literature • How can we validate this data? • Open data will need to be self explanatory and prove its own ‘correctness’• Requirement for an ‘experiment audit trail’

• (Published) Science should be reproducible• Requirement to provide sufficient data and metadata to back up an experiment description

Page 7: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

The SolutionUnderlying data

(Institutional data repository)

Intellect & Interpretation

(Journal article, report,

etc)

Page 8: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

Fitting into the Information Environment

Institutional Data Sources

Page 9: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

High Level Relationships

Page 10: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

Repository Design• Scenarios – assist design team in understanding each other

• Feedback from SPECTRa – based questionnaire

• First design: build one to throw away (out of the box EPrints)

• Population of disposable repository informed design of actual repository

• Population informed workflow capture and analysis

• Manufacturer discussions

• Requirements capture with publishing community

Page 11: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

Questionnaire Results #1• Respondents comprised PhD students (55%), postdoctoral workers (18%) and faculty staff (19%) and totalled 110 people.• Primary use of computers and the internet is for information researching, writing papers and reports, working up data and instrument control.• Computers are used regularly for everyday work, but much less so for social networking and other ‘modern’ uses.• Mainly highly established community standard applications, software and file formats are used, with less use of modern data sources.• There is still extensive use of printed paper copies of PDF files, which are generally stored on personal computers without any structure or use of reference software. A researcher will have 100-1000 PDF files on their computer and prefers to communicate them by sending PDF’s to collaborators.

Page 12: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

Questionnaire Results #2 • About 66% have had to generate electronic supplementary information to supplement their journal articles.• There is a predominance for self teaching use of software, as opposed to being taught professionally.• Supplementary information is mainly generated and stored in proprietary formats, although there is considerable use of ‘popular’ formats (eg Microsoft Office).• There is a preference, or requirement, to keep a hardcopy of data as well as an electronic one.• Experimental and analysed data are generally kept on a group or instrument controlling computer, however there is often a need to keep a hardcopy (eg in a lab notebook).• About 66% have not heard of ‘InChI’, ‘metadata’ or ‘JCAMP’ format, whilst around 50% have not heard of ‘DOI’, ‘Open Access’, ‘Semantic Web’ or ‘RDF’.

Page 13: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

Questionnaire Results #3

• There is a considerable lack of awareness surrounding repositories and their function.• There is a requirement for search and discovery to be based predominantly on structure, formula, author or keyword.• The most attractive purpose of a repository would be for the storage of a ‘permanent record’.• Most chemists would comply if deposit in a repository was a mandatory requirement of funding or publication, however virtually all are ignorant of what the position of these organisations is with respect to open access and deposit in a repository.

Page 14: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

The Plan

Page 15: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

Workflow AnalysisUV-Vis Powder XRD NMRMass Spec

Sufficient similarity to design a generic deposit / ingest process

Page 16: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

The R4L Repository Deposit / Ingest

Create new compound (parent record)

Add new experiment type

Add metadata and upload data files

Page 17: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

The Probity Service

• Process to assert originality of a piece of work / repository record

• Incorporate into ePrints core software?

Page 18: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

Repository Search / Browse

Search / Browse

Crucial record metadata for Data Management, Search/Browse and Discovery: Date; Instrument; Location; Compound Name; Experiment Type; Researcher

Page 19: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

Report Generation

• Too cumbersome and inflexible (revisit?)• Requirement for ‘familiar’ software• Suitable for informatics, but not routine reporting

Page 20: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

Report Generation

• Ability to import repository data into software and easily edit • Need to bring the repository to the researchers ‘desktop’• Demonstrator employed Sharepoint to store templates

• (functionality to be incorporated into repository software?)• Does this really bring anything new to reporting research?

Page 21: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

Analysis & Discussion: Blogging Experiments

A repository can…

• Allow one to put, store and get digital objects

• Provide minimal search and browse functions

• NOT provide the presentation and discussion functions essential to working up a scientific study

• Social networking tools and approaches can provide a way…

Page 22: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

Getting data into Blogs

• Developing relationship between Blog and R4L repository• Repository back-end, Blog for sharing data with collaborators

and developing ideas / conclusions based on data• ‘Live copy’ application only – <SWORD /> development?•

Page 23: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

Enabling Research

• Enables ‘geographically distributed collaborative research’

• Useful approach for sharing ‘failed’ experiments?

Page 24: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

Open Notebook Science

Page 25: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

Automatic Blogging by Machines

Page 26: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

Automatic Logging of Sensor Data• Timeline visualisation – instant detection of erroneous event• Assists in analysing inconsistencies in datasets

Page 27: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

Comments and Annotation

• Chemists need to scribble! •A picture says a thousand words! • Need for more advanced Blog tools / technology

Page 28: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

R4L End-to-End Overview

Page 29: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

Usability• Low barrier to use; familiarity; flexibility; quick gain• A specification/requirements for data repository based software?

Page 30: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

Problems Encountered• Over ambitious in affecting the attitudes of instrument manufacturers at such an early stage

• Attitudes of chemists towards changes in laboratory working procedure

• Attitudes of chemists towards change in the publication system

• Input from journal publishers before demonstrator / prototype available • Blog software restrictive

• Extreme diversity and number of file formats employed for analytical chemistry

Page 31: Experiences with Repositories and Blogs in Laboratories or ‘R4L: The Repository for the Laboratory’  Leslie Carr, Simon Coles & Jeremy

Future Directions

• A useful demonstrator to the practising scientist of the value of a laboratory repository…

• Further advocacy required in preservation and data management areas• Feasibility of a departmental data repository?

• An exemplar for:• The institution - towards an institutional data preservation policy?• Publishers – improved handling of supplementary information• Instrument manufacturers – will respond to the demands of their customer base• Follow on funding: eChemistry; myExperiment

• Best practice in validation and reproducibility of experiments• Develop relationship between data repository & ‘Blog approach’