35
Introduction to Data Management Planning Aaike De Wever @aaik

Introduction to Data Management Planning at Alien Challenge COST workshop

Embed Size (px)

Citation preview

Page 1: Introduction to Data Management Planning at Alien Challenge COST workshop

Introduction to Data Management PlanningAaike De Wever @aaik

Page 2: Introduction to Data Management Planning at Alien Challenge COST workshop

Landscape with the Fall of Icarus, Royal Museums of Fine Arts of Belgium, now seen as a good early copy of Bruegel's original - ca. 1558 - By Pieter Brueghel the Elder (1526/1530–1569) - 1., Public Domain, https://commons.wikimedia.org/w/index.php?curid=11974918

Page 3: Introduction to Data Management Planning at Alien Challenge COST workshop

Source: JISC http://webarchive.nationalarchives.gov.uk/20140702233839/http:/www.jisc.ac.uk/whatwedo/campaigns/res3/jischelp.aspx

Research Life Cycle and the Data Life Cycle

Page 4: Introduction to Data Management Planning at Alien Challenge COST workshop

Source: http://www.data-archive.ac.uk/create-manage/life-cycle

Page 5: Introduction to Data Management Planning at Alien Challenge COST workshop

Data decay

Vines et al. (2014) The Availability of Research Data Declines Rapidly with Article Age. Current Biology 24, 1-4.

Page 6: Introduction to Data Management Planning at Alien Challenge COST workshop

“Data often have a longer lifespan than the research project that creates them. Researchers may continue to work on data after funding has ceased, follow-up projects may analyse or add to the data, and data may be re-used by other researchers.”

“Well organised, well documented, preserved and shared data are invaluable to advance scientific inquiry and to increase opportunities for learning and innovation.”

Source: http://www.data-archive.ac.uk/create-manage/life-cycle

Page 7: Introduction to Data Management Planning at Alien Challenge COST workshop

Data management plan: definition

A data management plan or DMP is a formal document that outlines how you will handle your data both during your research, and after the project is completed.

The goal of a data management plan is to consider the many aspects of data management, metadata generation, data preservation, and analysis before the project begins; this ensures that data are well-managed in the present, and prepared for preservation in the future.

Page 8: Introduction to Data Management Planning at Alien Challenge COST workshop

Motivation to construct a DMP• Intrinsic motivation

• Obligation from:• institute• funder

Page 9: Introduction to Data Management Planning at Alien Challenge COST workshop

What should motivate you (1/2)• Planning facilitates later work, esp. with regards to: • data storage• publication• archival & retrieval

• Opportunity to consider how you handle data, incl.:• recording new data• efficient organisation of data • workflows to ensure its quality

Page 10: Introduction to Data Management Planning at Alien Challenge COST workshop

What should motivate you (2/2)• Ensuring data is understandable and re-usable, thus:• maximising the visibility & impact• consider data publication, re-use and dissemination of

data and results at an early stage• Opportunity to “talk data” with project partners, covering e.g.:• data workflows• data exchange• common data management policies and practices

Page 11: Introduction to Data Management Planning at Alien Challenge COST workshop

Approach for constructing DMP depends on• Purpose

– Project proposal– Improving internal data management– Overall open data policy

• Addressee– Funding agency: NSF, EU,…– Internal / Website

• Type(s) of data and volume– Big data such as next generation sequencing output or automatic sensor data– Remote sensing imagery– Biodiversity monitoring campaign

Page 12: Introduction to Data Management Planning at Alien Challenge COST workshop

Typical components of a DMP

WhatInfoStandardsSharingStorage

Page 13: Introduction to Data Management Planning at Alien Challenge COST workshop

Typical components of a DMP

What data and analyses are expected

Info about data (metadata)

Standards for data storage and exchange

Sharing, publication, public archiving

Storage preservation and access to data

Page 14: Introduction to Data Management Planning at Alien Challenge COST workshop

DMP components: Simple exampleData collected during runs: distance, duration, speed, cadence, heartrate

Page 15: Introduction to Data Management Planning at Alien Challenge COST workshop

DMP components: Simple exampleData collected during runs: distance, duration, speed, cadence, heartrate

Personal info on runner entered when configuring device/accountMetadata for run automatically recorded (date, time) or associate during data upload (weather)Additional metadata for run (shoes, remarks) added through web interface

Raw data in .fit formatUploaded data available through web services (json format) for exchange with other servicesSummary data downloadable as csv-data

Data on device (as long as storage permits) and pending uploads on computerUploaded data accessible through on-line serviceAd-hoc download of summary data on computer (end of year)

Data accessible to authorised “connections”Privacy settings can be set for individual activities

Page 16: Introduction to Data Management Planning at Alien Challenge COST workshop

DMP components: Sequence dataData from 4 454-sequencing runsExpected maximum file size: 4 Gb raw data/runOutput format: Standard Flowgram Format (SFF)

Information in ReadMe-file:• Project details & background • Information on sample origin & selection• Sequencing methodology

Raw data in: Standard Flowgram Format (SFF)FASTA format for analysis and storageMetadata included in FASTA format

• Local storage on removable HD • Back-up through cloud storage

• Raw data in NCBI Sequence Read Archive• Annotated FASTA data in EMBL/GenBank• Data associated with publications in Datadryad.org

Page 17: Introduction to Data Management Planning at Alien Challenge COST workshop

DMP components: Monitoring dataData from monthly water quality monitoring at 100 stations, 15 parameters recorded in spreadsheet (data volume 194 kb in xlsx, 62 in csv)

Details of individual sampling events recorded in spreadsheetSampling & analysis protocol recorded in field manual

Spreadsheet data imported into SQL database with export queries for:• Occurrence data in csv using fields from Darwin Core standard• Metadata in Ecological Metadate Language (EML)

• Local storage: Raw spreadsheet data on field operators’ device• Database on institute servers + local back-ups

• Occurrence data through GBIF on national IPT node• Data associated with publications in Datadryad.org

Page 18: Introduction to Data Management Planning at Alien Challenge COST workshop

Funders

Page 19: Introduction to Data Management Planning at Alien Challenge COST workshop

What

InfoStandardsSharingStorage

Typical components of a DMP for NSF-Bio

• media and methods • policies and public access

1. Describe data, metadata, formats standards

2. Physical/cyber resources and facilities

3. Media and dissemination methods

4. Policies for data sharing and public access

5. Roles and responsibilities

& Who

Page 20: Introduction to Data Management Planning at Alien Challenge COST workshop

H2020 – Pilot actionPilot action on open access to research data – Research Data generated by the project: • deposit in a research data repository and take

measures to make it possible for third parties to access, mine, exploit, reproduce and disseminate — free of charge for any user — the following: • the data, including associated metadata, needed to validate

the results presented in scientific publications as soon as possible; • other data, including associated metadata, as specified and

within the deadlines laid down in the data management plan (see Annex I);

Page 21: Introduction to Data Management Planning at Alien Challenge COST workshop

Typical components of a DMPfor H2020

What Info Standards SharingStorage

1. What types of data will the project generate/collect?

2. What standards will be used? 3. How will this data be exploited and/or

shared/made accessible for verification and re- use? If data cannot be made available, explain why.

4. How will this data be curated and preserved?

Page 22: Introduction to Data Management Planning at Alien Challenge COST workshop

H2020 DMP example: AQUACROSSKnowledge, Assessment, and Management for AQUAtic Biodiversity and Ecosystem Services aCROSS EU policies - aquacross.eu: • 16 partners • 8 case studies (over 4 WPs)• dataset survey• … case studies to be worked out, datasets to be used only partially known

Organisation of Data Management Plan• Overall DMP distinguishes categories of data (but has to remain general)• Allows for more specific DMP at case study level • Living document, updating at regular intervals

Page 23: Introduction to Data Management Planning at Alien Challenge COST workshop

Open data institutes in biodiversity monitoring?

See https://www.inbo.be/en/norms-for-data-use

Page 24: Introduction to Data Management Planning at Alien Challenge COST workshop

Open data institutes in biodiversity monitoring?Key actions:- All data owned by the institute are covered by open data policy- Data are released after an embargo period of 12 months- Focus on raw data (biodiversity, sequence and map data)- Data associated with scientific papers are opened up- Data are released under the CC0 license waiver- Data norms apply- Data are sufficiently documented through metadata- All research projects are required to prepare a DMP- Researches are required to apply these policies- Researchers are supported by the institutes data unit

Page 25: Introduction to Data Management Planning at Alien Challenge COST workshop

Features specific for Alien Invasive Species data?• Need to integrate initiatives in regional and national networks (Crall et

al. 2010 doi:10.1007/s10530-010-9740-9) • Need for faster sharing of information (for early warning and rapid

response)• More thorough data validation measures (before taking management

action)• Link observation data with data on management response (common

standard for recording this?)• Concerns about privacy

Page 26: Introduction to Data Management Planning at Alien Challenge COST workshop

The Tower of Babel (1563), Kunsthistorisches Museum, Vienna, oil on board - By Pieter Brueghel the Elder (1526/1530–1569) - 1., Public Domain, https://commons.wikimedia.org/w/index.php?curid=11974918

Page 27: Introduction to Data Management Planning at Alien Challenge COST workshop

Questions

WISSS ?

Page 29: Introduction to Data Management Planning at Alien Challenge COST workshop

ToolsDMP Online - https://dmponline.dcc.ac.uk

DMPTool - https://dmptool.org

Others:• DMP builder - https://dmp.library.ualberta.ca • DMP editor -http

://www.openmetadata.org/site/?page_id=373 • IEDA Data Management Plan tool - http://

www.iedadata.org/compliance/plan

Page 30: Introduction to Data Management Planning at Alien Challenge COST workshop

What data and analyses are expected?

Describe the data to be collected (actual observations) during your research including amount (if known). Name the type of data, the instrument or collection approach, and how the data will be sampled. If actual data are interpreted, note the interpretation. Describe any quality control measures. Also describe the final derivative products (datasets and software or computer code) and the analysis used including analytical software packages that are required for replication, etc. Describe data both (digital and analog) and physical materials (samples and collections) gathered or generated during the tie of the award.

Consider these questions:• What data will be generated in the research?• What data types will you be creating or capturing? (e.g. experimental measures, qualitative, raw, processed)• How will you capture or create the data? (This should cover content selection, instrumentation, technologies and

approaches chosen, methods for naming, versioning, meeting user needs, etc, and should be sensitive to the location in which data capture is taking place.)

• If you will be using existing data, state that fact and include where you got it. What is the relationship between the data you are collecting and the existing data?

Source: https://dmptool.org

Page 31: Introduction to Data Management Planning at Alien Challenge COST workshop

Info/Standards – Standards, Formats and MetadataDescribe the format of your data; think about what details (metadata) someone else would need to be able to use these files. Describe the structural standards that you will apply in making data and metadata available. For example, for most ecological data, documentation should be structured in Ecological Metadata Language (EML). An example of metadata could also be as simple as a "readme file" to explain variables, structure of the files, etc.

Consider these questions:• Which file formats will you use for your data and why?• What form will the metadata describing/documenting your data take?• How will you create or capture these details?• Which metadata standards will you use and why have you chosen them? (e.g. accepted domain-local

standards, widespread usage)• What contextual details (metadata) are needed to make the data you capture or collect meaningful?

Source: https://dmptool.org

Page 32: Introduction to Data Management Planning at Alien Challenge COST workshop

Who – Roles and responsabilities

Explain how the responsibilities regarding the management of your data will be delegated. This should include time allocations, project management of technical aspects, training requirements, and contributions of non-project staff - individuals should be named where possible. Remember that those responsible for long-term decisions about your data will likely be the custodians of the repository/archive you choose to store your data. While the costs associated with your research (and the results of your research) must be specified in the Budget Justification portion of the proposal, you may want to reiterate who will be responsible for funding the management of your data.

Consider the following:• Outline the staff/organizational roles and responsibilities for implementing this data management plan.• Who will be responsible for data management and for monitoring the data management plan?• How will adherence to this data management plan be checked or demonstrated?• What process is in place for transferring responsibility for the data?• Who will have responsibility over time for decisions about the data once the original personnel are no

longer available?Source: https://dmptool.org

Page 33: Introduction to Data Management Planning at Alien Challenge COST workshop

Sharing – Dissemination methods

Describe how and where you will make these data and metadata available to the community. Remember BIO is committed to timely and rapid data distribution; make sure you address how soon your data will be available. Indicate what data will be made available and preserved. Will data be accessible on a web page, by email request, via open-access repository, etc.?

Consider these questions:• What data will be made available from the study and preserved for the long-term?• How and when will you make the data available? (Include resources needed to make the data available: equipment, systems,

expertise, etc.)• What transformations will be necessary to prepare data for preservation / data sharing?• What metadata / documentation will be submitted alongside the data or created on deposit/ transformation in order to make

the data reusable?• What related information will be deposited?• What is the process for gaining access to the data?• How long will the original data collector/creator/principal investigator retain the right to use the data before opening it up to

wider use?• Explain details of any embargo periods for political/commercial/patent or publisher reasons.

Source: https://dmptool.org

Page 34: Introduction to Data Management Planning at Alien Challenge COST workshop

Sharing – Policies for Data Sharing and Public AccessDescribe the policies under which these data will be made available. It is very important, the reason a DMP is required, that you specify how you will share your data with non-group members after the project is completed. If the data is of a sensitive nature—privacy or ecological endangerment concerns, for instance—and public access is inappropriate, address here the means by which granular control and access will be achieved (e.g. formal consent agreements, anonymized data, only available within a secure network, etc.).

Consider these questions:• Will any permission restrictions need to be placed on the data?• Are there ethical and privacy issues? If so, how will these be resolved?• What have you done to comply with your obligations in your IRB Protocol?• Who will hold the intellectual property rights to the data and how might this affect data access?• What and who are the intended or foreseeable uses/users of the data?• Do you plan on publishing findings which rely on the data? If so, do your prospective publishers place any

restrictions on other avenues of publication?Source: https://dmptool.org

Page 35: Introduction to Data Management Planning at Alien Challenge COST workshop

Storage – Archiving, Storage and Preservation

Consider which data (or research products) will be deposited for long-term access and where. (What physical and/or cyber resources and facilities (including third party resources) will be used to store and preserve the data after the grant ends?) Describe your long-term strategy for storing, archiving and preserving the data you will generate or use.

Consider the following: • What is the long-term strategy for maintaining, curating and archiving the data?• Which archive/repository/database have you identified as a place to deposit data?• What procedures does your intended long-term data storage facility have in place for preservation and backup?• How long will/should data be kept beyond the life of the project?• What data will be preserved for the long-term?• On what basis will data be selected for long-term preservation?• What metadata/documentation will be submitted alongside the data or created on deposit/transformation in order to make the

data reusable?• What related information will be deposited?

Source: https://dmptool.org