35
Because good research needs good data Digital Curation 101 “Taster”: Belfast 14-15 September 2009 Digital Curation 101 “Taster” Joy Davidson, Associate Director, DCC: [email protected] Sarah Higgins, Standards Advisor, DCC: [email protected] Funded by: This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK: Scotland License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

Digital Curation 101 - Taster

  • View
    1.261

  • Download
    0

Embed Size (px)

DESCRIPTION

This presentation introduced participants to the DC 101 course and was given at the Digital Curation and Preservation Outreach and Capacity Building Workshop in Belfast on September 14-15 2009. http://www.dcc.ac.uk/events/workshops/digital-curation-and-preservation-outreach-and-capacity-building-workshop

Citation preview

Page 1: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

Digital Curation 101 “Taster”

Joy Davidson, Associate Director, DCC: [email protected] Sarah Higgins, Standards Advisor, DCC: [email protected]

Funded by:

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK: Scotland License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

Page 2: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

DC 101 aims and objectivesData management and curation are becoming increasingly integral for successful research or digitisation bids. Using the context of beginning a new research bid, this short course aims to introduce participants to the DCC Curation Lifecycle Model as a means of contextualising the range and nature of roles and activities required to maintain access to data over time. While the DCC Curation Lifecycle Model is sequential, it is flexible and allows users to start at any point in the model, or start to address issues which have had lower priority, depending on their current needs.

Ultimately, tools and approaches will evolve over time, but if participants understand the bigger picture they will be in a better position to make critical decisions that best reflect their individual needs. The course will introduce participants to some of the tools and approaches and provide them with pointers to further information and support.

The course is aimed at researchers, content creators and those who support them. We hope that participants leave the course equipped to explain why data curation is important and what roles they have to play in the process.

Page 3: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

What is curation?Data have importance as the evidential base for scholarly conclusions, and for the validation of those conclusions, a basic tenet of which is reproducibility.

Curation is the active management and appraisal of data over the lifecycle of scholarly and scientific interest; it is the key to reproducibility and reuse. This adds value through the provision of context and linkage: placing emphasis on 'publishing' data in ways that ease reuse, with implications for metadata and interoperability.

Data curation is part of good research and content management practice.

Page 4: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

Why Curate? Curation brings immediate and longer-term benefits: • Access to reliable, working data – both for the creator and users• Compliance with funding body and research council mandates on data sharing,

management and access• Independent validation of research findings• Reliable lab and field electronic notebooks through trustworthy capture • Large amounts of data can be developed and analysed across different locations by

maintaining consistency in working practices and interpretations• Relationship management between different versions of dynamic or evolving datasets

is easier• Facilitated linkage with related research and between primary, secondary and tertiary

data• Knowledge and data originating from short-term research projects does not become

obsolete or inaccessible when funding expires• Innovative data set combining is possible e.g. combined historic biodiversity data and

GIS data can be used to investigate trends in ecosystem development.

Page 5: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

Lifecycle approach to curation

• digital materials are fragile and susceptible to change from technological advances from creation onwards

• activities (or lack of) at each lifecycle stage influence ability to manage and preserve materials in subsequent stages

• reliable re-use of digital materials is only possible if materials are curated in such a way that their authenticity and integrity are retained

• requires significant input and buy-in from the range of stakeholders – creators, curators, IT staff, management

• helps maximise initial investment made in creating or gathering data

• supports verification of provenance

• facilitates continuity of service

From: Pennock, Maureen,

Digital Curation: A Life-Cycle Approach

to Managing and Preserving Usable Digital Information, (2007)

Page 6: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

The DCC Curation Lifecycle ModelProvides a graphical high level overview of the stages required for successful curation and preservation of data.

It can be used to plan activities within an organisation or consortium to ensure all necessary stages are undertaken, each in the correct sequence.

1. Full Lifecycle Actions2. Sequential Actions3. Occasional Actions

http://www.dcc.ac.uk/lifecycle-model/

Page 7: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

Researchers and content creators tend to focus on:

• conceptualise

• create or receive

• ingest

• store

• access, use and reuse

• data

• description

• community watch and participation

Page 8: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

Researchers and content creators tend to focus less on:

• appraise and select

• dispose

• preservation action

• transform

• representation information

• preservation planning

• curate and preserve

• migrate

• reappraise

Page 9: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

Conceptualise “Conceive and plan the creation of data, including capture method and storage options.”

Researchers:• define a research question• begin to design the experiment• seek funding• conceive and plan the creation of data• consider capture methods and storage options• identify research collaborators• identify potential subjects

Roles: researcher, funding bodies, publishers, IT department, ethics panel

Plan with digital curation in mind! Decisions made at the Conceptualise stage impact on every other stage of the lifecycle.

Page 10: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

Specific issues to consider for the Conceptualise stage:

• Research design and workflows – what do you want to do? • What storage needs to you anticipate using? Does your institution have the capacity

for this? Will you keep raw or derived data or both? • Will you make use of any existing data? Will you need to obtain rights to use it?• Do you want your data to interoperate with other datasets? If so, how will you ensure

that this is possible? • What are the funder’s requirements regarding curation and preservation? Will they pay

for curation activity?• Will the research involve any legal restrictions on the use and access to the data? • Are there any data protection issues that will require data cleaning before the data can

be accessed and used? • Do you require ethical approval from your institution or funder? Will this have any

impact on the data’s potential use and reuse? • Do you need to calibrate data capture devices? Will this need to occur at multiple

sites? • Will the data be released under Creative Commons or Science Commons licenses?• Are there likely to be any embargoes on data publication?

Page 11: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

Create or Receive“Create data including administrative, descriptive, structural and technical metadata. Preservation metadata may also be added at the time of creation”. OR “Receive data, in accordance with documented collecting policies, from data creators, other archives, repositories or data centres, and if required assign appropriate metadata.”

Roles: researchers, information specialists, technical support

Ensure data are curation ready! • Be careful - data may be irreplaceable• Capture context for long-term reuse and

comprehensibility. • Clearly identify IPR at an early stage. This

can become murky later in the process.

Page 12: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

Specific issues to consider for the Create or Receive stage:

• What do you want people to be able to do with the data you are generating?• What do you not want people to be able to do with the data?• Are there any variations between data capture tools located at different sites? How will

you ensure that these are recorded/addressed? Consistency of testing and data acquisition are crucial.

• Will you be adhering to any content, syntax, and structure standards? Are these easily available for use by everyone on the project team?

• Who will have rights over any collaboratively generated data (eg., databases)• Who will you record contextual metadata and how?• What level of data quality do you need to achieve? How will you ensure this level is

achieved across all partners?• Will you make use of any ontologies to facilitate data integration?• Will you make use of any data collection policies?• How will you handles file naming and version control? • Do you have access to training and support for any/all of the above?

Page 13: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

Ingest and Store “Transfer data to an archive, repository, data centre or other custodian. Adhere to documented guidance, policies or legal requirements. Store the data in a secure manner adhering to relevant standards.”

Data is transferred to a curation environment such as an institutional repository or a subject-based repository.

Roles: information specialists, repository managers, researchers

Prepare data for long-term storage, access and continuity! Storage may be a dedicated data repository or a folder on a shared drive, but must be considered, secure and adhere to relevant standards.

Page 14: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

Specific issues to consider for the Ingest and Store stages:

• Does the data have sufficient metadata? If more is required, who will be responsible for providing it?

• Will the data require additional cleaning before it can be ingested into the repository?• Will frequent access to the data be required? If so, this could affect the storage

choices.• What level of responsibility does the repository indicate it will take on with regards to

stewardship?• Does the repository accept your data formats? If not, will there be any normalisation

processes that may occur with the deposit of non-preferred formats? • Does the repository outsource any of its activity? Could this have an impact on your

data? • Does the repository have sufficient resources and policies in place? • Once ingest is complete, is there a formal acknowledgement that the transfer of

custody has occurred?

Page 15: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

Access, Use and Reuse “Ensure that data is accessible to both designated users and reusers, on a day-to-day basis. This may be in the form of publicly available published information. Robust access controls and authentication procedures may be applicable.”

Roles: repository managers, researchers

Ensure access and continuity!

Page 16: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

Specific issues to consider for the Access, Use and Reuse stage:

• Are the intended users of the data able to access it and make use of it? i.e., are they able to use the data in the way that you originally intended them to use it? What about non-intended users?

• Are there any restrictions on access and reuse Ensure that these are communicated to the repository staff.

• Researchers should work with repository managers to develop suitable access policies and terms for use of the data

• If you are planning on making your data freely accessible for reuse, have you supplied enough context to enable its reliable reuse?

• Are they adequate finding aid to help locate and retrieve your data within the repository?

• Is the data practically interoperable with other datasets? Does it need to be?

Page 17: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

Appraise and Select “Evaluate data and select for long-term curation and preservation. Adhere to documented guidance, policies or legal requirements.”

Researchers and content creators, along with information specialists use quality checks to identify and evaluate data for long-term curation:

• must be legal, appropriate, and valuable• may include data objects, metadata, and

contextual information.

Roles: researchers, information specialists, funding bodies

Develop robust policies! The ‘keep everything’ approach quickly becomes unviable. As the volume of curated data increases, efficient search and retrieval becomes more difficult.

Page 18: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

Specific issues to consider for the Appraise and Select stage:

• Does the data meet the data quality metrics identified by both the researchers and the archive? Who will be responsible for the final decision? Can errors in the data remain undetected at this stage, and cause problems at later stages?

• Has enough contextual information been collected to make an informed decision about which data to keep?

• What is the minimum you need to keep for your data findings and publications to be supported over time?

• Are there any data that you, by law, are not allowed to keep? How will it be destroyed and what evidence will you be able to provide to support this if necessary?

• Do you have any schedule for re-appraisal over time?• Do you have access to expertise in your project staff or at your institution to assist with

selection and appraisal? • Your initial bid is a good place to start as you’ll have clearly indicated what outputs you

planned to produce.• Does your selection and appraisal fit in with your funding body requirements? What do

they expect you to keep and where does it need to be kept?

Page 19: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

Preservation action “Undertake actions to ensure long-term preservation and retention of the authoritative nature of data. Preservation actions should ensure that data remains authentic, reliable and usable while maintaining its integrity. Actions include data cleaning, validation, assigning preservation metadata, assigning representation information and ensuring acceptable data structures or file formats.”

Roles: information specialists, preservation practitioners, repository managers

Community Watch activities can be very helpful at this stage to identify imminent risks to data.

Page 20: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

Specific issues to consider for the Preservation Action stage:

• Does the repository participate in community watch and ongoing preservation planning activity?

• Does the repository manager know what the significant properties of your data are? If not, some preservation actions can alter the significant properties.

• Are any preservation actions undertaken transparent and documented?• Does the repository have legal rights to undertake preservation actions at all?• Does the researcher require notification of any preservation actions that may affect the

intended use of the data? If so, have mechanisms been set in place to facilitate this? • If certain actions are recommended, are they suitable for your data? If not, are

repository staff aware of any restrictions?

Page 21: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

Transform “Create new data from the original, for example: by migration into a different format; or by creating a subset, by selection or query, to create newly derived results, perhaps for publication.”

New data may be generated from the original: • by format migration;• through integration with other data;• by new analyses and techniques applied

within or across disciplines

Roles: researchers

New uses for curated data! Derivative data, new visualisations or enhancements feed back into the Conceptualise and Create stages of the lifecycle which then starts anew.

Page 22: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

Specific issues to consider for the Transform stage:

• Metadata aggregation to join up with other datasets, this integration of data drives new curation requirements.

• Image normalisation and automated analysis creates a variety of new contextual and provenance information

• If transformations or derivatives are produced (e.g. noise reduction) it must be accompanied by appropriate metadata

• Use community standards for recording provenance to safeguard against fast changing techniques.

• Does the community have sufficient support in transformation actions?• Is more value gained from producing new data or from transforming old data in new

ways?

Page 23: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

More information on all these stages is in the workshop packs!

[email protected]

Page 24: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

Tools and resources to help with the DCC Curation Lifecycle stages:

ConceptualiseDCC Policy Pages: http://www.dcc.ac.uk/resource/curation-policies Check our handy table as a starting point to make sure you are aware of any curation related requirements for your particular funding body. If your funding body is not in our table, please get in touch with us so that we can add their policy details.

DCC Helpdesk: http://www.dcc.ac.uk/helpdeskIf you need further assistance at this stage, please don’t hesitate to drop us a line via our helpdesk and we’ll make every effort to support your curation activity.

Page 25: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

Tools and resources to help with the DCC Curation Lifecycle stages:

Create or Receive DCC DIFFUSEYou might wish to consult the DCC DIFFUSE database for standards frameworks related to your area of research. We strongly encourage the contribution of standards frameworks for specific domains from our user community to help ensure that this is a community-driven resource. http://www.dcc.ac.uk/diffuse/

DCC Technology and Standards Watch papers: http://www.dcc.ac.uk/resource

AHDS advice on creating digital resources: http://www.ahds.ac.uk/creating/index.htm/

Page 26: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

Tools and resources to help with the DCC Curation Lifecycle stages:

Ingest and StoreAHDS recommended stable formats for different types of data: http://www.ahds.ac.uk/depositing/deposit-formats.htm

Access, Use, Re-useDCC Resource Centre: http://www.dcc.ac.uk/resource

DCC Helpdesk: http://www.dcc.ac.uk/helpdesk

DCC Legal Blog: http://dccblawg.blogspot.com/

DCC Briefing Papers (particularly Data Protection): http://www.dcc.ac.uk/resource/briefing-papers/

Page 27: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

Tools and resources to help with the DCC Curation Lifecycle stages:

Appraise and Select Data Audit Framework tool: http://www.data-audit.eu/

DCC Briefing Paper and Curation Manual chapter on Appraisal and Selection : http://www.dcc.ac.uk/resource/briefing-papers/; http://www.dcc.ac.uk/resource/curation-manual/chapters/

US Geological Survey selection and appraisal toolkit

Page 28: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

Tools and resources to help with the DCC Curation Lifecycle stages:

Preservation ActionDr. Manfred Thaller’s Fileshooter tool: Good for assessing file format robustness using your own success metrics. http://github.com/mcarden/shotgun/blob/39761fdd190faa47e9be09901782cda6d9f4f687/shotGun.h

PLANETS Testbed and Methodology: http://www.planets-project.eu/

DCC Curation Manual: http://www.dcc.ac.uk/resource/curation-manual/chapters /

Transform

DCC Briefing Papers (particularly Interoperability): http://www.dcc.ac.uk/resource/briefing-papers/

Page 29: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

CHECKLISTS: Conceptualise

Get into the habit of equating data curation with good research.

Know what your funding body expects you to do with your data and for how long. Assess your ability to be able to meet these expectations (i.e., do you need additional funding or staff?)

Determine intellectual property rights from the outset and ensure they are documented.

Identify any anticipated publication requirements (embargoes, restrictions on

publishing over multiple sites)

Identify and document specific roles and responsibilities as early as possible.

Page 30: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

CHECKLISTS: Create and/or Receive

Know who you are creating your data for and what you want them to be able to do (and not do) with it. Communicate this with others on the project

Identify any data protection requirements that you need to address in the course of your research and ensure that these are communicated to all staff.

Agree from an early stage any standards you will be making use of for content, syntax, and structure. Once these have been agreed, make sure they are communicated - both to other researchers on the project and to the data/information managers you will be working with. Provide training if necessary.

Identify data quality metrics as soon as possible and ensure that these are communicated and monitored.

Work together - researchers and information managers need to communicate regularly. Neither can do their job in isolation.

Be realistic – strike a balance between what is sufficient and what is ideal based on your practical realities.

Page 31: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

CHECKLISTS: Ingest and Store

Making use of archival standards like ISAD-G can be useful for hierarchical data description. So, talk to information managers at your institution for advice.

Make sure you know about any repository policies that might affect your deposit for long-term storage (i.e., what will they accept, are there preferred formats or normalisation processes that may affect the usability of your data, do they only accept cleaned data vs. original data?)

Remember - ingest does not necessarily need to mean deposit in a data centre or repository but rather moving to a ‘curated’ environment – could be as simple as a specific folder on a shared drive.

Make the ‘ingest’ process as straight-forward as possible and provide support and guidance wherever you can; automate processes if you can.

Decide on who is responsible for final aspects of data quality assurance at the point of deposit (researcher, archive, information manager, etc…); Ensure that this final point of QA is communicated to all stakeholders.

Data quality is not absolute. Quality must be assessed by fitness for purpose. So, ‘high quality’ data for one user group may be completely unsuitable for another user group.

Part of data quality is data cleaning. But, data cleaning can produce less true data so you’ll need to decide what level of cleaning you require.

If you are a repository manager, ensure that you provide a receipt (if possible) or at least acknowledge receipt for closure and ‘official’ transfer of stewardship

Page 32: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

CHECKLISTS: Access and Reuse

Know what you want users to be able to do with your data and for how long.

Pin down and communicate the significant properties of your data.

Ensure that any restrictions on access and use are communicated and respected.

Ensure that you provide enough context to ensure that your data can be located and used – either by the originally designated user community or new users over time.

Ensure you clearly articulate any citation requirements and usage statistics that you require at the point of ingest so that repository managers know how your data should be cited if it is reused.

Page 33: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

CHECKLISTS: Appraise and Select (1)

Make a start on selection and appraisal from as early a point as possible (e.g., apply the new NERC criteria for identifying valuable data sets at the project plan stage).

Plan for what you think you’ll need to keep to support your research findings. What is the minimum you’ll need to support your findings over time?

Know who you are keeping it the data for and what you want them to be able do with it. This may affect the way you keep it and what you keep.

Conversely, know what you need to dispose of. Destruction is often vital to ensure compliance with legal requirements.

Ensure that your data meets minimum quality assurance metrics (based on intended use).

Re-appraisal can take place before ingest so review what you have and what you need to keep before depositing it to long-term storage.

Work with researchers and information managers to develop policies and to identify realistic and implementable workflows

Page 34: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

CHECKLISTS: Appraise and Select (2)

Appraise for the here and now but with an eye to the future.

Think about costs (financial, staff, and environmental). These will affect you selection and appraisal decisions

Look for relationships with other data sets in your archive/repository as part of the appraisal process (i.e., does the dataset augment another collection significantly?). Some funders stipulate that you must identify whether the data exists already. This process might highlight additional datasets that your new research might augment significantly.

Page 35: Digital Curation 101 - Taster

Because good research needs good data

Digital Curation 101 “Taster”: Belfast 14-15 September 2009

CHECKLISTS: Preservation Action

Know what you want people to be able to do with your data – this will impact many aspects (formats selected for long term storage, compression, etc…)

Pin down the significant properties of your data and communicate them – make sure that the people carrying out preservation actions know what they are. This might be through

metadata or other means.

Don’t be afraid to be critical when reviewing ‘best practice’ and recommended approaches. They might work for the specific scenario for which they were created but not for you. Do you know the criteria used to rate things like ‘preferred’ formats?

Document preservation actions so that people know what has been done to the data over time.

Once you’ve gone through the exercise of producing a sound data management plan, you’ll be able to reuse many aspects of it – so each project data management plan will not need the same level of effort to complete.