Upload
aaike-de-wever
View
240
Download
0
Embed Size (px)
Citation preview
Introduction to Data Management PlanningAaike De Wever @aaik
Landscape with the Fall of Icarus, Royal Museums of Fine Arts of Belgium, now seen as a good early copy of Bruegel's original - ca. 1558 - By Pieter Brueghel the Elder (1526/1530–1569) - 1., Public Domain, https://commons.wikimedia.org/w/index.php?curid=11974918
Source: JISC http://webarchive.nationalarchives.gov.uk/20140702233839/http:/www.jisc.ac.uk/whatwedo/campaigns/res3/jischelp.aspx
Research Life Cycle and the Data Life Cycle
Source: http://www.data-archive.ac.uk/create-manage/life-cycle
Data decay
Vines et al. (2014) The Availability of Research Data Declines Rapidly with Article Age. Current Biology 24, 1-4.
“Data often have a longer lifespan than the research project that creates them. Researchers may continue to work on data after funding has ceased, follow-up projects may analyse or add to the data, and data may be re-used by other researchers.”
“Well organised, well documented, preserved and shared data are invaluable to advance scientific inquiry and to increase opportunities for learning and innovation.”
Source: http://www.data-archive.ac.uk/create-manage/life-cycle
Data management plan: definition
A data management plan or DMP is a formal document that outlines how you will handle your data both during your research, and after the project is completed.
The goal of a data management plan is to consider the many aspects of data management, metadata generation, data preservation, and analysis before the project begins; this ensures that data are well-managed in the present, and prepared for preservation in the future.
Motivation to construct a DMP• Intrinsic motivation
• Obligation from:• institute• funder
What should motivate you (1/2)• Planning facilitates later work, esp. with regards to: • data storage• publication• archival & retrieval
• Opportunity to consider how you handle data, incl.:• recording new data• efficient organisation of data • workflows to ensure its quality
What should motivate you (2/2)• Ensuring data is understandable and re-usable, thus:• maximising the visibility & impact• consider data publication, re-use and dissemination of
data and results at an early stage• Opportunity to “talk data” with project partners, covering e.g.:• data workflows• data exchange• common data management policies and practices
Approach for constructing DMP depends on• Purpose
– Project proposal– Improving internal data management– Overall open data policy
• Addressee– Funding agency: NSF, EU,…– Internal / Website
• Type(s) of data and volume– Big data such as next generation sequencing output or automatic sensor data– Remote sensing imagery– Biodiversity monitoring campaign
Typical components of a DMP
WhatInfoStandardsSharingStorage
Typical components of a DMP
What data and analyses are expected
Info about data (metadata)
Standards for data storage and exchange
Sharing, publication, public archiving
Storage preservation and access to data
DMP components: Simple exampleData collected during runs: distance, duration, speed, cadence, heartrate
DMP components: Simple exampleData collected during runs: distance, duration, speed, cadence, heartrate
Personal info on runner entered when configuring device/accountMetadata for run automatically recorded (date, time) or associate during data upload (weather)Additional metadata for run (shoes, remarks) added through web interface
Raw data in .fit formatUploaded data available through web services (json format) for exchange with other servicesSummary data downloadable as csv-data
Data on device (as long as storage permits) and pending uploads on computerUploaded data accessible through on-line serviceAd-hoc download of summary data on computer (end of year)
Data accessible to authorised “connections”Privacy settings can be set for individual activities
DMP components: Sequence dataData from 4 454-sequencing runsExpected maximum file size: 4 Gb raw data/runOutput format: Standard Flowgram Format (SFF)
Information in ReadMe-file:• Project details & background • Information on sample origin & selection• Sequencing methodology
Raw data in: Standard Flowgram Format (SFF)FASTA format for analysis and storageMetadata included in FASTA format
• Local storage on removable HD • Back-up through cloud storage
• Raw data in NCBI Sequence Read Archive• Annotated FASTA data in EMBL/GenBank• Data associated with publications in Datadryad.org
DMP components: Monitoring dataData from monthly water quality monitoring at 100 stations, 15 parameters recorded in spreadsheet (data volume 194 kb in xlsx, 62 in csv)
Details of individual sampling events recorded in spreadsheetSampling & analysis protocol recorded in field manual
Spreadsheet data imported into SQL database with export queries for:• Occurrence data in csv using fields from Darwin Core standard• Metadata in Ecological Metadate Language (EML)
• Local storage: Raw spreadsheet data on field operators’ device• Database on institute servers + local back-ups
• Occurrence data through GBIF on national IPT node• Data associated with publications in Datadryad.org
Funders
What
InfoStandardsSharingStorage
Typical components of a DMP for NSF-Bio
• media and methods • policies and public access
1. Describe data, metadata, formats standards
2. Physical/cyber resources and facilities
3. Media and dissemination methods
4. Policies for data sharing and public access
5. Roles and responsibilities
& Who
H2020 – Pilot actionPilot action on open access to research data – Research Data generated by the project: • deposit in a research data repository and take
measures to make it possible for third parties to access, mine, exploit, reproduce and disseminate — free of charge for any user — the following: • the data, including associated metadata, needed to validate
the results presented in scientific publications as soon as possible; • other data, including associated metadata, as specified and
within the deadlines laid down in the data management plan (see Annex I);
Typical components of a DMPfor H2020
What Info Standards SharingStorage
1. What types of data will the project generate/collect?
2. What standards will be used? 3. How will this data be exploited and/or
shared/made accessible for verification and re- use? If data cannot be made available, explain why.
4. How will this data be curated and preserved?
H2020 DMP example: AQUACROSSKnowledge, Assessment, and Management for AQUAtic Biodiversity and Ecosystem Services aCROSS EU policies - aquacross.eu: • 16 partners • 8 case studies (over 4 WPs)• dataset survey• … case studies to be worked out, datasets to be used only partially known
Organisation of Data Management Plan• Overall DMP distinguishes categories of data (but has to remain general)• Allows for more specific DMP at case study level • Living document, updating at regular intervals
Open data institutes in biodiversity monitoring?
See https://www.inbo.be/en/norms-for-data-use
Open data institutes in biodiversity monitoring?Key actions:- All data owned by the institute are covered by open data policy- Data are released after an embargo period of 12 months- Focus on raw data (biodiversity, sequence and map data)- Data associated with scientific papers are opened up- Data are released under the CC0 license waiver- Data norms apply- Data are sufficiently documented through metadata- All research projects are required to prepare a DMP- Researches are required to apply these policies- Researchers are supported by the institutes data unit
Features specific for Alien Invasive Species data?• Need to integrate initiatives in regional and national networks (Crall et
al. 2010 doi:10.1007/s10530-010-9740-9) • Need for faster sharing of information (for early warning and rapid
response)• More thorough data validation measures (before taking management
action)• Link observation data with data on management response (common
standard for recording this?)• Concerns about privacy
The Tower of Babel (1563), Kunsthistorisches Museum, Vienna, oil on board - By Pieter Brueghel the Elder (1526/1530–1569) - 1., Public Domain, https://commons.wikimedia.org/w/index.php?curid=11974918
Questions
WISSS ?
Background materialDigital Data Curation Centre• http://www.dcc.ac.uk/resources/how-guides/develop-data-plan Library University of Michigan• http://
www.lib.umich.edu/research-data-services/data-management-planning (and http://www.lib.umich.edu/research-data-services/nsf-data-management-plans#examples_proposals for sample plans)
Data One• https://www.dataone.org/best-practices
ToolsDMP Online - https://dmponline.dcc.ac.uk
DMPTool - https://dmptool.org
Others:• DMP builder - https://dmp.library.ualberta.ca • DMP editor -http
://www.openmetadata.org/site/?page_id=373 • IEDA Data Management Plan tool - http://
www.iedadata.org/compliance/plan
What data and analyses are expected?
Describe the data to be collected (actual observations) during your research including amount (if known). Name the type of data, the instrument or collection approach, and how the data will be sampled. If actual data are interpreted, note the interpretation. Describe any quality control measures. Also describe the final derivative products (datasets and software or computer code) and the analysis used including analytical software packages that are required for replication, etc. Describe data both (digital and analog) and physical materials (samples and collections) gathered or generated during the tie of the award.
Consider these questions:• What data will be generated in the research?• What data types will you be creating or capturing? (e.g. experimental measures, qualitative, raw, processed)• How will you capture or create the data? (This should cover content selection, instrumentation, technologies and
approaches chosen, methods for naming, versioning, meeting user needs, etc, and should be sensitive to the location in which data capture is taking place.)
• If you will be using existing data, state that fact and include where you got it. What is the relationship between the data you are collecting and the existing data?
Source: https://dmptool.org
Info/Standards – Standards, Formats and MetadataDescribe the format of your data; think about what details (metadata) someone else would need to be able to use these files. Describe the structural standards that you will apply in making data and metadata available. For example, for most ecological data, documentation should be structured in Ecological Metadata Language (EML). An example of metadata could also be as simple as a "readme file" to explain variables, structure of the files, etc.
Consider these questions:• Which file formats will you use for your data and why?• What form will the metadata describing/documenting your data take?• How will you create or capture these details?• Which metadata standards will you use and why have you chosen them? (e.g. accepted domain-local
standards, widespread usage)• What contextual details (metadata) are needed to make the data you capture or collect meaningful?
Source: https://dmptool.org
Who – Roles and responsabilities
Explain how the responsibilities regarding the management of your data will be delegated. This should include time allocations, project management of technical aspects, training requirements, and contributions of non-project staff - individuals should be named where possible. Remember that those responsible for long-term decisions about your data will likely be the custodians of the repository/archive you choose to store your data. While the costs associated with your research (and the results of your research) must be specified in the Budget Justification portion of the proposal, you may want to reiterate who will be responsible for funding the management of your data.
Consider the following:• Outline the staff/organizational roles and responsibilities for implementing this data management plan.• Who will be responsible for data management and for monitoring the data management plan?• How will adherence to this data management plan be checked or demonstrated?• What process is in place for transferring responsibility for the data?• Who will have responsibility over time for decisions about the data once the original personnel are no
longer available?Source: https://dmptool.org
Sharing – Dissemination methods
Describe how and where you will make these data and metadata available to the community. Remember BIO is committed to timely and rapid data distribution; make sure you address how soon your data will be available. Indicate what data will be made available and preserved. Will data be accessible on a web page, by email request, via open-access repository, etc.?
Consider these questions:• What data will be made available from the study and preserved for the long-term?• How and when will you make the data available? (Include resources needed to make the data available: equipment, systems,
expertise, etc.)• What transformations will be necessary to prepare data for preservation / data sharing?• What metadata / documentation will be submitted alongside the data or created on deposit/ transformation in order to make
the data reusable?• What related information will be deposited?• What is the process for gaining access to the data?• How long will the original data collector/creator/principal investigator retain the right to use the data before opening it up to
wider use?• Explain details of any embargo periods for political/commercial/patent or publisher reasons.
Source: https://dmptool.org
Sharing – Policies for Data Sharing and Public AccessDescribe the policies under which these data will be made available. It is very important, the reason a DMP is required, that you specify how you will share your data with non-group members after the project is completed. If the data is of a sensitive nature—privacy or ecological endangerment concerns, for instance—and public access is inappropriate, address here the means by which granular control and access will be achieved (e.g. formal consent agreements, anonymized data, only available within a secure network, etc.).
Consider these questions:• Will any permission restrictions need to be placed on the data?• Are there ethical and privacy issues? If so, how will these be resolved?• What have you done to comply with your obligations in your IRB Protocol?• Who will hold the intellectual property rights to the data and how might this affect data access?• What and who are the intended or foreseeable uses/users of the data?• Do you plan on publishing findings which rely on the data? If so, do your prospective publishers place any
restrictions on other avenues of publication?Source: https://dmptool.org
Storage – Archiving, Storage and Preservation
Consider which data (or research products) will be deposited for long-term access and where. (What physical and/or cyber resources and facilities (including third party resources) will be used to store and preserve the data after the grant ends?) Describe your long-term strategy for storing, archiving and preserving the data you will generate or use.
Consider the following: • What is the long-term strategy for maintaining, curating and archiving the data?• Which archive/repository/database have you identified as a place to deposit data?• What procedures does your intended long-term data storage facility have in place for preservation and backup?• How long will/should data be kept beyond the life of the project?• What data will be preserved for the long-term?• On what basis will data be selected for long-term preservation?• What metadata/documentation will be submitted alongside the data or created on deposit/transformation in order to make the
data reusable?• What related information will be deposited?
Source: https://dmptool.org