32
Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012

Workflows for Digital Curation and Preservation

  • Upload
    arvid

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

Workflows for Digital Curation and Preservation. Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012. Topics. Goals A Very Brief Introduction to Workflow Systems Components for Curation Workflow Scenarios Future Work. Workflows for Curation. Goals - PowerPoint PPT Presentation

Citation preview

Page 1: Workflows for Digital Curation and Preservation

Workflows for Digital Curation and Preservation

Stacy KowalczykPASIG Dublin 2012October 17, 2012

Page 2: Workflows for Digital Curation and Preservation

Topics

• Goals• A Very Brief Introduction to Workflow Systems• Components for Curation• Workflow Scenarios• Future Work

2

Page 3: Workflows for Digital Curation and Preservation

3

Workflows for Curation

Goals– Increase capacity and scalability of curation efforts– Develop distributed curation processes– Lower costs of curation activities– Improve quality with systematic and repeatable

processes– Reduce human errors

Page 4: Workflows for Digital Curation and Preservation

4

Why Workflow Systems• Repetitive and mundane

activities simplified• Facilitates and enforces best

practices • Enables efficient scheduling • Machinery for coordinating

the execution of services and linking together resources

• Facilitates outreach to researchers for direct deposit and automatic curation

Page 5: Workflows for Digital Curation and Preservation

5

Types of Workflow SystemsKepler

BPEL

Ptolemy II

Triana

Taverna

Page 6: Workflows for Digital Curation and Preservation

6

Trident

• Open source project• Based on Microsoft Workflow Foundation classes• Supported by Microsoft Research and academic

researchers• Integrates with myExperiment• Well accepted in the research community– well over 100 peer-reviewed and white papers were

discovered from one scholarly aggregation service• Graphical workflow design and execution interface

Page 7: Workflows for Digital Curation and Preservation

7

Trident Workflow Components• Fixity• Data Integrity• Metadata Creation• Format Normalization

and Derivative Generation

• Persistent Identification• Repository Integration

Page 8: Workflows for Digital Curation and Preservation

8

Fixity Components

• MD5 checksum generator

• MD5 checksum validator

Page 9: Workflows for Digital Curation and Preservation

9

Data Integrity Components

• JHOVE for format verification and validation

• Group validation (for object integrity)

Page 10: Workflows for Digital Curation and Preservation

10

Metadata Creation Components• MIX data generator and validator

• METS data generator and validator

Page 11: Workflows for Digital Curation and Preservation

11

Format Components• Format Conversions for normalization and

derivative generation– .xlsx to .csv– .docx to .pdf– .ppt to .pdf– .tif to .jpg– Zipping on demand– Image (.tif or .jpg) to .pdf (single document and

multipage)

Page 12: Workflows for Digital Curation and Preservation

12

Repository Component

• Ingest to DSpace via Sword

• DOI generator

Page 13: Workflows for Digital Curation and Preservation

13

Data Ingest Workflows• Scenarios– Single part objects (individual images)– Multi-part objects (a book)– Multiple instantiations of a logical object (word,

pdf and ppt of a research paper)– Multiple multi-part objects (a group of letters)– Research data products (multiple files of various

types)

Page 14: Workflows for Digital Curation and Preservation

Single Part Objects

14

Page 15: Workflows for Digital Curation and Preservation

15

Single Part Objects Workflow

Derivative Generation

Format Validation

andVerification

Fixity Check

CreateTech

Metadata

Create Intellectual Metadata

Create Object

Metadata

PersistentIdentification

Deposit in Repository

Image Quality Checks

Page 16: Workflows for Digital Curation and Preservation

16

Single Part Objects Workflow• For each original image

– MD5 checksum– JHOVE validation and verification report– ImageMagick report– MIX file

• For each derivative file– MD5 Checksum– DOI

• For each logical object– DC record– METS record– Sword package

Page 17: Workflows for Digital Curation and Preservation

Multi-part Object Workflow

17

Page 18: Workflows for Digital Curation and Preservation

18

Multi-part Object Workflow• Comic Book– RIS– Set of .tif files

CreateTech

Metadata

Derivative Generation

Format Validation

andVerification

Fixity CheckObject Integrity

Create Intellectual Metadata

Create Object

Metadata

Persistent Identification

Deposit in Repository

Image Quality Checks

Page 19: Workflows for Digital Curation and Preservation

Multi-part Object Workflow• For each individual image file

– MD5 checksum– JHOVE validation and verification report– ImageMagick report– MIX file

• For each derivative file– MD5 Checksum

• For the whole object– DOI– DC record– METS record

• Sword Package

19

Page 20: Workflows for Digital Curation and Preservation

Multiple Instantiations of a Logical Object Workflow

20

Page 21: Workflows for Digital Curation and Preservation

21

Multiple Instantiations of a Logical Object Workflow

• Papers– Each logical object per subdirectory– RIS, word file and (perhaps) supplemental file

Format Normalization

Format Validation

andVerification

Fixity Check

Create Intellectual Metadata

Create Object

Metadata

Persistent Identification

Deposit in Repository

Derivative Generation

Page 22: Workflows for Digital Curation and Preservation

Multiple Instantiations of a Logical Object Workflow

• For each original object– MD5 Checksum– JHOVE report

• For each derivative object– MD5 Checksum– Output from normalization process– DOI for delivery object

• For the whole package– METS file– DC record– Sword Package

22

Page 23: Workflows for Digital Curation and Preservation

Multiple Multi-part Object Workflow

23

Page 24: Workflows for Digital Curation and Preservation

24

Multiple Multi-part Object Workflow

• Ball collection– RIS for collection and Inventory spreadsheet– Each logical object in separate subdirectory

CreateTech

Metadata

Derivative Generation

Format Validation

andVerification

Fixity Check

Create Intellectual Metadata

Create Object

Metadata

Persistent Identification

Deposit in Repository

Image Quality Checks

Collection Integrity

Create Collection Metadata

Page 25: Workflows for Digital Curation and Preservation

Multiple Multi-part Object Workflow• For each file

– MD5 checksum– JHOVE report– MIX file– Scanning specifications– Derivative files

• For each logical object– Derivative object– DC record– METS file– DOIs

• For the whole collection– METS file– DC record

25

Page 26: Workflows for Digital Curation and Preservation

Research Data Products

26

Page 27: Workflows for Digital Curation and Preservation

27

Research Data Products

• Vortex– A subdirectory for each experiment

Compress Data Fixity Check

Create Intellectual Metadata

Create Object

Metadata

PersistentIdentification

Deposit in Repository

Page 28: Workflows for Digital Curation and Preservation

Research Data Products

• Outputs– Zipped data file– MD5 Checksum– FGDC metadata record – Dublin Core record– METS record– Sword Package

28

Page 29: Workflows for Digital Curation and Preservation

29

Post Deposit Curation Workflow

• Scenarios – Fixity verification– Format normalization– New or additional derivative generation– Media migration– Persistent identifier updates– Metadata updates

Page 30: Workflows for Digital Curation and Preservation

Future Work

• Adding additional components– EAD from spreadsheet– MARC record support– Premis support

• Testing in the lab– Digital library scanning labs– Research labs– Integrating with a production repository

30

Page 31: Workflows for Digital Curation and Preservation

31

Acknowledgements

• This research was made possible through a generous grant by Microsoft Research

• And by the Data to Insight Center of Indiana University’s Pervasive Technology Institute

• Thanks to Kavitha Chandrashankar and Quan Zhou for their help with developing components, workflows, and documentation

Page 32: Workflows for Digital Curation and Preservation

Thank you

[email protected]://d2i.indiana.edu

32