- Workflows for Digital Preservation and Curation Workshop Open Repositories 2012 Stacy Kowalczyk Beth Plale Kavitha Chandrasekar Yiming Sun.
Workflows for Digital Preservation and Curation Workshop Open Repositories 2012 Stacy Kowalczyk Beth Plale Kavitha Chandrasekar Yiming Sun.
<p>Workflows for Digital Preservation and Curation Workshop Open Repositories 2012 Stacy Kowalczyk Beth Plale Kavitha Chandrasekar Yiming Sun Slide 2 Agenda Introduction to Digital Curation Workflow Systems Overview Workflows for Digital Curation Break Implementing Workflows in Trident Modifying a Workflow Create a new Workflow Creating Components Wrap up 7/10/12 2 Slide 3 Acknowledgements This workshop was made possible through a generous grant by Microsoft Research And by the Data to Insight Center of Indiana Universitys Pervasive Technology Institute Quan Zhou, Ph.D. student and developer, for his help with developing components, workflows, and documentation 7/10/12 3 Slide 4 Introduction to Digital Curation Defining curation Infrastructure for curation Curating the files Curating the object 7/10/12 4 Slide 5 Defining Curation Digital curation involves maintaining, preserving and adding value to digital research data throughout its lifecycle. The active management of research data reduces threats to their long-term research value and mitigates the risk of digital obsolescence. Meanwhile, curated data in trusted digital repositories may be shared among the wider research community. As well as reducing duplication of effort in research data creation, curation enhances the long-term value of existing data by making it available for further high quality research. Digital Curation Center 7/10/12 5 Slide 6 Curation Infrastructure Repository Public access Policies Processes Institutional support 7/10/12 6 Slide 7 Curating the Files Bitstream Integrity Fixity Duplicate copies File integrity Format verification Format validation 7/10/12 7 Slide 8 File Formats Durability Transparency Documentation Ubiquity Renderability Longevity 7/10/12 8 Slide 9 Format Choices Master files for preservation Highest quality Highest fidelity Lossless Derivative files for active use and delivery Smallest possible for user needs Fast delivery Easy to use format 7/10/12 9 Slide 10 Curating the Object Context Relationships between files Technical metadata Intellectual metadata To Metadata Implicit/explicit context 7/10/12 10 Slide 11 Curation Activities Ongoing verification File integrity Object integrity Metadata management Management of obsolescence Hardware Software Formats Documentation 7/10/12 11 Slide 12 Workflow Systems Purpose of workflow systems Types of workflow systems Trident Workflow Workbench 7/10/12 12 Slide 13 Why Workflow Systems Repetitive and mundane activities simplified Facilitates and enforces best practices Enables efficient scheduling Machinery for coordinating the execution of services and linking together resources Facilitates outreach to researchers for direct deposit and automatic curation 7/10/12 13 Slide 14 Types of Workflow Systems 7/10/12 14 Kepler BPEL Ptolemy II Triana Taverna Slide 15 Trident Open source project Based on Microsoft Workflow Foundation classes Supported by Microsoft Research and academic researchers Integrates with myExperiment Well accepted in the research community well over 100 peer-reviewed and white papers were discovered from one scholarly aggregation service 7/10/12 15 Slide 16 Trident Components Trident Management Studio Trident Workflow Composer Trident Workflow Application Microsoft SQL Server Trident Silverlight client for web execution of workflows Microsoft Visual Studio C# development environment 7/10/12 16 Slide 17 Design Visual Workflow Composer Trident Registry Workflow Packages (domain specific) Trident Runtime Services Windows Workflow Foundation.NET 4.0 Provenance Monitoring Workflow Scheduling Service Admin Admin Console Workflow Monitor Community Web Portal s earch Launch Monitor Workflow Launcher Results Repository Workflow Repository (myExperiment) Data Access Layer Data Object Model (data source abstraction layer) Data Storage Providers: SQL Server, Local XML store, Slide 18 Workflows for Curation Goals Systematic and repeatable processes Helps remove human errors Data Ingest Integrity checks Format normalization/derivative generation Metadata creations Curation activities Integrity checks Format migration Media migration 7/10/12 18 Slide 19 Data Ingest Workflows Scenarios Single part objects (individual images) Multi-part objects (a book) Multiple instantiations of a logical object (word, pdf and ppt of a research paper) Multiple multi-part objects (a group of letters) Research data products (multiple files of various types) Scientific workflow process 7/10/12 19 Slide 20 Single Part Objects Workflow Magic Lantern Slides Individual files Spreadsheet 7/10/12 20 Derivative Generation Format Validation and Verification Fixity Check Create Tech Metadata Create Intellectual Metadata Create Object Metadata Persistent Identification Deposit in Repository Image Quality Checks Slide 21 Multi-part Object Workflow Comic Book RIS Set of.tif files 7/10/12 21 Create Tech Metadata Derivative Generation Format Validation and Verification Fixity Check Object Integrity Create Intellectual Metadata Create Object Metadata Persistent Identification Deposit in Repository Image Quality Checks Slide 22 Multiple Instantiations of a Logical Object Workflow Papers Each logical object per subdirectory RIS, word file and (perhaps) supplemental file 7/10/12 22 Format Normalization Format Validation and Verification Fixity Check Create Tech Metadata Create Intellectual Metadata Create Object Metadata Persistent Identification Deposit in Repository Derivative Generation Slide 23 Multiple Multi-part Object Workflow Ball collection RIS for collection and Inventory spreadsheet Each logical object in separate subdirectory 7/10/12 23 Create Tech Metadata Derivative Generation Format Validation and Verification Fixity Check Object Integrity Create Intellectual Metadata Create Object Metadata Persistent Identification Deposit in Repository Image Quality Checks Collection Integrity Create Collection Metadata Slide 24 Research Data Products Vortex Each subdirectory is an experiment with FGDC metadata 7/10/12 24 Compress Data Fixity Check Create Intellectual Metadata Create Object Metadata Persistent Identification Deposit in Repository Slide 25 Workflow Components Format Conversions (for normalization and derivative generation) .xlsx to.csv .docx to.pdf .ppt to.pdf .tif to.jpg Zipping on demand Image (.tif or.jpg) to.pdf 7/10/12 25 Slide 26 Workflow Components 2 Context creation MIX data generator and validator METS data generator and validator Data Integrity MD5 checksum generator MD5 checksum validator JHOVE for format verification and validation Group validation (for object integrity) 7/10/12 26 Slide 27 Post Deposit Curation Workflow Scenarios Fixity verification Format normalization New or additional derivative generation Media migration Persistent identifier updates Metadata updates 7/10/12 27 Slide 28 Workflows in Trident 7/10/12 28 Slide 29 Executing Workflows 7/10/12 29 Individual object ingest Multipart object ingest Multiple multipart object ingest Multiple instantiations of a single logical object Research data ingest Scientific workflow Fixity check curation workflow Slide 30 Implementing Workflows in Trident Launch the Remote Desktop application User: AMAZONA- JJOAL14\oruser PWD: TridentOR12!! Computer ip addresses on slip of paper being passed out now. 7/10/12 30 Slide 31 Trident Workflow Composer 7/10/12 31 Slide 32 Participant Exercises 7/10/12 32 Slide 33 Modifying Workflows Add components to existing workflows Select the Individual Ingest Workflow Add DOI component Before the METS generator component Make the connections Select the Group Ingest Workflow Comic Add the METS generation component After the last component in the main line Make the connections 7/10/12 33 Slide 34 Simple Curation Workflow Creation Create a Workflow for a simple curation process validate MD5 checksums Define a directory of image files Define a METS file Define an out put location Link the MD5 checksum validation component Link the MD5 checksum report component Save and execute the workflow 7/10/12 34 Slide 35 Creating Components Exercise: Create a new Trident workflow component Implement the MARCXML to MODS Stylesheet http://www.loc.gov/standards/mods/v3/MARC21 slim2MODS3-4.xsl http://www.loc.gov/standards/mods/v3/MARC21 slim2MODS3-4.xsl Kavitha Chandrasekar will demonstrate the process 7/10/12 35 Slide 36 Wrap Up Thumb drives Trident codeplex site Trident listserv Contributing to Trident Workshop Evaluation Form Ongoing conversation 7/10/12 36 Slide 37 Contacts for Further Discussion Trident CodePlex site: http://tridentworkflow.codeplex.com/ http://tridentworkflow.codeplex.com/ Trident Listserv: trident-wf- email@example.com- firstname.lastname@example.org Stacy Kowalczyk: email@example.com@indiana.edu Kavitha Chandrasekar: firstname.lastname@example.org@imail.iu.edu Yiming Sun: email@example.com@umail.iu.edu Quan Zhou: firstname.lastname@example.org@indiana.edu 7/10/12 37 </p>