www.dans.knaw.nlDANS is an institute of KNAW and NWO
FAIR Data in Trustworthy Data Repositories:
Everybody wants to play FAIR, but how do we put the principles into practice?
Peter Doorn, Director DANSIngrid Dillo, Deputy Director DANS
EUDAT/OpenAIRE webinar, 12 and 13 December 2016
Who we are
Watch our videos on YouTubehttps://www.youtube.com/user/DANSDataArchiving
What you can expect to learn
• General understanding of core requirements for trustworthy data repositories
• General understanding of the FAIR principles
• Introduction to a possible way of operationalizing the FAIR principles
DANS and DSA
• 2005: DANS to promote and provide permanent access to digital research information
• Formulate quality guidelines for digital repositories including DANS (TRAC, Nestor)
• 2006: 5 basic principles as basis for 16 DSA guidelines
• 2009: international DSA Board
• Over 60 seals acquired around the globe, but with a focus on Europe
The certification landscape
DSA and WDS: look-a-likes
Communalities:• Lightweight, community review
Complementarity:• Geographical spread• Disciplinary spread
Partnership
Goals:• Realizing efficiencies• Simplifying assessment options• Stimulating more certifications• Increasing impact on the community
Outcomes:• Common catalogue of requirements for core repository
assessment• Common procedures for assessment• Shared testbed for assessment
New common requirements
• Context (1)
• Organizational infrastructure (6)• Digital object management (8)• Technology (2)
• Additional information and applicant feedback (2)
1. Organisational infrastructure
R1. The repository has an explicit mission to provide access to and preserve data in its domain.
R2. The repository maintains all applicable licenses covering data access and use and monitors compliance.
R3. The repository has a continuity plan to ensure ongoing access to and preservation of its holdings.
R4. The repository ensures, to the extent possible, that data are created, curated, accessed, and used in compliance with disciplinary and ethical norms.
R5. The repository has adequate funding and sufficient numbers of qualified staff managed through a clear system of governance to effectively carry out the mission.
R6. The repository adopts mechanism(s) to secure ongoing expert guidance and feedback (either in-house, or external, including scientific guidance, if relevant).
2. Digital object management (1)
R7. The repository guarantees the integrity and authenticity of the data.
R8. The repository accepts data and metadata based on defined criteria to ensure relevance and understandability for data users.
R9. The repository applies documented processes and procedures in managing archival storage of the data.
R10. The repository assumes responsibility for long-term preservation and manages this function in a planned and documented way.
2. Digital object management (2)
R11. The repository has appropriate expertise to address technical data and metadata quality and ensures that sufficient information is available for end users to make quality-related evaluations.
R12. Archiving takes place according to defined workflows from ingest to dissemination.
R13. The repository enables users to discover the data and refer to them in a persistent way through proper citation.
R14. The repository enables reuse of the data over time, ensuring that appropriate metadata are available to support the understanding and use of the data.
3. Technical infrastructure
R15. The repository functions on well-supported operating systems and other core infrastructural software and is using hardware and software technologies appropriate to the services it provides to its Designated Community.
R16. The technical infrastructure of the repository provides for protection of the facility and its data, products, services, and users.
New requirements are out now!
http://www.datasealofapproval.org/en/news-and-events/news/2016/11/25/wds-and-dsa-announce-uni-ed-requirements-core-cert/
https://www.icsu-wds.org/news/news-archive/wds-dsa-unified-requirements-for-core-certification-of-trustworthy-data-repositories
Back to 2005: DSA principles
TheDSAisintendedtoensurethat:
• Thedatacanbefoundontheinternet• Thedataareaccessible(clearrightsandlicenses)• Thedataareinausableformat• Thedataarereliable• Thedataareidentifiedinauniqueandpersistentwaysothattheycanbe
referredto
FAIR Data Principles
Workshop Leiden 2014: minimal set of community agreed guiding principles to make data more easily discoverable, accessible, appropriately integrated and re-usable, and adequately citable.
FAIR principles:• Findable• Accessible• Interoperable• Reusable
(both for machines and for people)
FAIR principlesIn the FAIR approach, data should be:
1. Findable – Easy to find by both humans and computer systems and based on mandatory description of the metadata that allow the discovery of interesting datasets;
2. Accessible – Stored for long term such that they can be easily accessed and/or downloaded with well-defined license and access conditions (Open Access when possible), whether at the level of metadata, or at the level of the actual data content;
3. Interoperable – Ready to be combined with other datasets by humans as well as computer systems;
4. Reusable – Ready to be used for future research and to be processed further using computational methods.
Source: http://www.dtls.nl/fair-data/
Paper in: http://www.nature.com/articles/sdata201618
Everybody loves FAIR!
EverybodywantstobeFAIR…Butwhatdoesthatmean?Howtoputtheprinciplesintopractice?
Implementing the FAIR Principles?
See:http://datafairport.org/fair-principles-living-document-menu andhttps://www.force11.org/group/fairgroup/fairprinciples
Different implementations for different aims?
1. FAIR data management: posing requirements for new data creation
2. FAIR data assessment: establishing the profile of existing data
3. FAIR data technologies: transformation tools to make data FAIR
Creation Assessment Transformation
Creation: New Horizon 2020 guidelines on FAIR Data Management
Section 2. FAIR data1. Making data findable, including provisions for
metadata (5 questions)2. Making data openly accessible (10 questions)3. Making data interoperable (4 questions)4. Increase data re-use (through clarifying
licenses - 4 questions)
Additional sections:1. Data summary (6 questions, 5 of which also cover aspects of FAIRness)3. Allocation of resources (4 questions)4. Data security (2 questions)5. Ethical aspects (2 questions)6. Other issues (2 questions)
Total of 23 + 16 = 39 questions!!
This may result in:
FAIRDMP
Assess: Resemblance DSA – FAIR principles
DSAPrinciples(fordatarepositories) FAIRPrinciples(fordatasets)
datacanbe foundontheinternet Findable
dataareaccessible Accessible
dataareinausableformat Interoperable
dataarereliable Reusable
datacanbereferred to (citable)
Theresemblanceisnotperfect:• usableformat(DSA)isanaspectofinteroperability(FAIR)• reliability(DSA)isaconditionforreuse(FAIR)• FAIRexplicitlyaddressesmachinereadability• citabilityisinFAIRanaspectoffindability
Combine and operationalize: DSA & FAIR
• Growing demand for quality criteria for research datasets and a way to assess their fitness for use
• Combine the principles of core repository certification and FAIR
• Use the principles as quality criteria:• Core certification – digital repositories• FAIR – research data (sets)
• Operationalize the principles to make them easily implementable in any trustworthy digital repository
How could it work?• Consider F, A and I as separate dimensions of
data quality • Interaction effects among dimensions
complicate scoring, and so do elements that occur under different FAIR dimensions
• Score datasets on each dimension (from 1 to 5)• Consider Reusability as the resultant of the
other three:• Consider R, the average FAIRness as an
indicator of data quality• (F + A + I) / 3 = R
• Make scoring as automatic as possible, although not all principles can be established objectively
• scoring at ingest by data archivists of TDR• after reuse by data users (community review)
Each dataset a FAIR profileExamples:• Dataset X has FAIR profile F4-A3-I2 è R=3
• PID with limited metadata (well findable; 4)• Accessible with some restrictions (3)• Fairly low interoperability (2)
• Dataset Y has FAIR profile F5-A2-I3 è R=3,3• PID and extensive metadata (very easy to find; 5)• Only metadata accessible (2)• Average rating on interoperability (3)
Numbers of assessments, reviews and downloads indicated as well
By the way, alternative visualisation ideas…
FAIR dice suggested byHerbert van de Sompel
FAIR letters suggested byIan Duncan
FAIR leaves suggested byWouter Haak
FAIR clover 1
FAIR clover 2FAIR sunflower
Operationalising F - Findable
Findable - defined by metadata, documentation (and identifier for citation):1. No PID and no metadata/documentation2. PID without or with insufficient* metadata3. Sufficient* metadata without PID4. PID with sufficient* metadata
– Information on data provenance5. PID, rich metadata and additional documentation
– Additional explanation of how data can be used
* Sufficient = enough metadata to understand what the data is about
Operationalising F - Findable
Findable - defined by metadata, documentation (and identifier for citation):1. No PID and no metadata/documentation2. PID without or with insufficient* metadata3. Sufficient* metadata without PID4. PID with sufficient* metadata
– Information on data provenance5. PID, rich metadata and additional documentation
– Additional explanation of how data can be used
* Sufficient = enough metadata to understand what the data is about
Operationalising A - Accessible
Accessible - defined by presence of a user license [metadata retrievable by identifier: already included under F]:1. No user license / unclear conditions of reuse / metadata nor
data are accessible2. Metadata are accessible (even when the data are not or no
longer available)3. User restrictions apply (of any kind, including privacy,
commercial interests, embargo period, etc.)4. Public Access (after registration)5. Open Access (unrestricted, CC0 – perhaps also CCby?)
Note 1: Some people want “Openness” to be separate from AccessibleNote 2: A3 could be seen as an acceptable threshold (e.g. by funders)
Operationalising I - Interoperable
Interoperable - defined by the data format:1. Proprietary, non-open format data2. Proprietary format, accepted by DSA
Certified Trusted Data Repository3. Non-proprietary, open format (= “preferred”
or “archival” format)4. Data is additionally harmonized/
standardized, using standard vocabularies 5. Data is additionally linked to other data to
provide context
Note: this is an adaptation of Tim Berners-Lee’s 5-star open data plan!
First we attempted to operationalise R –Reusable as well… but we changed our mind
Reusable – is it a separate dimension? Partly subjective: it depends on what you want to use the data for!
Idea for operationalization Why werejected itClear provenance of data (to facilitate both replication and reuse)
Aspect of F4
Data is in a TDR – unsustained data will not remain usable
Aspect of Repository àData Seal of Approval
Explication on how data was or can be used is available
Aspect of F5
Data automatically usable by machines Aspect of I5
Data is reliable (replicable) Can only be known after re-analysis
What it would look like in the DANS EASY archive
Or in Zenodo, Dataverse, Mendeley Data, figshare, B2SAFE, … (if they comply with the DSA)
Or in Zenodo, Dataverse, Mendeley Data, figshare, B2SAFE, … (if they comply with the DSA)
Towards a FAIR Data Assessment Tool
• Independent website like the DSA-website• Repositories will link to the FAIR assessment
website• The website will provide:
Ø Assessment tool (“questionnaire” with explanation andexamples for each criterion)
Ø Online database containing:o Repository holding the dataseto PID (+ basic metadata such as name) of dataseto Reviewer info (ID can be withheld – anonymous
reviews should be possible)o FAIR profile and scores
Ø Analytics of FAIR profiles
Can FAIR Data Assessment be automatic?Criterion Automatic?
Y/N/SemiSubjective?Y/N/Semi
Comments
F1 NoPID/NoMetadata Y N
F2 PID/Insuff.Metadata S S Insufficient metadataissubjective
F3 NoPID/Suff.Metadata S S Sufficient metadataissubjective
F4 PID/Sufficient Metadata S S Sufficient metadataissubjective
F5 PID/Rich Metadata S S Rich metadataissubjective
A1 NoLicense/NoAccess Y N
A2 MetadataAccessible Y N
A3 UserRestrictions Y N
A4 PublicAccess Y N
A5 OpenAccess Y N
I1 Proprietary Format S N Depends onlistofproprietary formats
I2 Accepted Format S S Dependsonlistofaccepted formats
I3 Archival Format S S Dependsonlistofarchival formats
I4 +Harmonized N S Dependsondomainvocabularies
I5 +Linked S N Dependsonsemantic methods used
Optional: qualitative assessment / data review
Main takeaways • The core certification requirements and the FAIR
principles form a perfect couple for quality assessment of research data and trustworthy data repositories
• DANS is developing an operationalization of these principles:
• Data archive staff will assess FAIRness of data upon ingest
• Data users will assess FAIRness upon reuse• Scoring mechanism as automatic as possible
• Ideally: all certified trustworthy repositories should contain FAIR data
• But: the FAIR scoring mechanism is applicable in any repository
Hands-on FAIR data management: IDCC workshops
12th International Digital Curation Conference, Edinburgh, 20-23 February 2017
How EUDAT Services could support FAIR data (workshop 4)… In this workshop we relate the FAIR principles to EUDAT’s Service Suite. You will experiment with community metadata and with an annotation tool for curating and retrieving data. Community representatives will talk about their experiences...
OpenAIRE services and tools for Open Research Data in H2020 (workshop 6)… The workshop will focus on practical issues of storage and data sharing aspects, on how toselect an appropriate data repository, on guidance for efficient data management plans, on themonitoring of contextualized (linked) research…
Essentials 4 Data Support: the Train-the Trainer version (workshop 9)… You learn how Research Data Netherlands made the popular ‘Essentials 4 Data Support’ training with rich online content, weekly assignments and expert speakers… to practice hands-on writing DMPs and making a data management stakeholder analysis…
http://www.dcc.ac.uk/events/idcc17For more information and registration see:
Thank you for listening
[email protected]@dans.knaw.nlwww.dans.knaw.nl