PAWN: A Novel Ingestion Workflow Technology for Scientific Data

Preview:

DESCRIPTION

PAWN: A Novel Ingestion Workflow Technology for Scientific Data. Mike Smorul, Joseph JaJa, Yang Wang, Mike McGann, and Fritz McCall. Overall Principles. Distributed, secure ingestion Use of web/grid technologies – platform independent Minimal client-side requirements - PowerPoint PPT Presentation

Citation preview

PAWN: A Novel Ingestion Workflow Technology for Scientific Data

Mike Smorul, Joseph JaJa, Yang Wang, Mike McGann, and Fritz McCall

Overall Principles

Distributed, secure ingestion Use of web/grid technologies – platform

independent Minimal client-side requirements Ease of integration with data grid systems. Designed to satisfy data integrity requirements

of scientific collections and digital preservation

Producer

Producer Management Interface

Producer data suppliers

Data Grid Gateway

Management Server

Producer

Provides data to a data grid based on a prior agreement.

Consists of a management/metadata server and an ingestion client.

Provides initial arrangement, context, and metadata.

Data Grid - receiving

Bitstream Validation Service

Data Grid

Scheduler

Producer 1

Producer n

Producer 2

Data Grid – receiving

Receives data from a Producer Validates bitstreams and metadata, and

sends acknowledgement to Producer. Arranges into collections and specifies

optional publishing and preservation policy.

Publishes bitstreams into data grid.

Data Grid – Long term Stewardship

Implemented using grid technologies.

Use the existing prototype NARA/UMD/SDSC site.

Automated replication and integrity checking.

Enforces access control and preservation policy

Ingestion Workflow

1. Negotiate Submission Agreement.

2. Workflow Initialization and Submission Information Packet (SIP) creation.

3. Transfer of SIPs to Data Grid site.

4. Validation of SIP transfer

5. Organization of data into collections and transfer into Data Grid.

Submission Agreement

Create machine actionable set of rules describing items.

Final Submission Agreement is composed of:

METS document for application defaults METS Constraint document to limit METS

form to submission parameters

METS Overview

Provides a framework for linking structural organization of objects with metadata.

Using XML namespace, metadata from various XML schema can be attached to objects Ie, dublin core, FGDC, etc

Extensible for more complex metadata http://www.loc.gov/standards/mets/

Sample METS Document<?xml version="1.0" encoding="utf-8" standalone="no"?><mets xmlns="http://www.loc.gov/METS/" xmlns:xlink="http://www.w3.org/TR/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/mets.xsd"><metsHdr><agent ROLE="CREATOR"><name>toaster@hostname</name>

</agent></metsHdr><fileSec><fileGrp><file ID="5" MIMETYPE="application/octet-stream" SIZE="67624" CREATED="2002-08-21T15:36:05"

CHECKSUM="2CE7D79E40BD6C6A65A6684B6FD3D08C" CHECKSUMTYPE="MD5"><FLocat LOCTYPE="URL" xlink:type="simple" xlink:href="/nfshomes/toaster/iscsi/GFS-contrib-5.1.tar.gz"/>

</file></fileGrp><fileGrp><file ID="7" MIMETYPE="application/octet-stream" SIZE="2517" CREATED="2002-09-06T17:06:07"

CHECKSUM="767185AA022180E701324C592E1C36E3" CHECKSUMTYPE="MD5"><FLocat LOCTYPE="URL" xlink:type="simple" xlink:href="/nfshomes/toaster/iscsi/gfs.out"/>

</file></fileGrp>

</fileSec><structMap><div ID="3" LABEL="iscsi"><fptr FILEID="5"/><fptr FILEID="7"/>

</div></structMap>

</mets>

MetadataLinking

StructuralOrganization

Why METS Constraints?

METS doesn’t provide a way to create machine interpretable rules describing a collection Ie: allow only TIFF files in certain structural

areas METS profiles allow for developer

interpretable rules, not machine interpretable

METS Constraints

Allows structural, metadata, and file constraints.

Structural Constraints:Restrict child div’s and restrict pointers to div, file,

and other mets documents File Constraints:

Restrict files by mime-type or validation tests Metadata Constraints:

Restrict allowed metadata schema.

METS Constraints - Template<?xml version="1.0" encoding="UTF-8"?><mets …. >

<!-- validation test section, referenced in the constraints document --><amdSec>

<techMD ID="xmltest"><mdWrap MDTYPE="OTHER">

<xmlData><val:validation NAME="xmltext" DESCRIPTION="Test for valid xml documents" MIMETYPE="text/xml">

<val:valgrp required="true"><val:valtest name="gif" required="true">

<val:description>generic gif test for any file</val:description></val:valtest>

</val:valgrp></val:validation>

</xmlData></mdWrap>

</techMD></amdSec>

<!-- base div structure to use for all clients --><structMap>

<div ID="ID1" LABEL="Research &amp; Development Records"><div ID="ID1.1" LABEL="Research &amp; Development Project Records">

<div ID="ID1.1.1" LABEL="R&amp;D Project Case Files"/><div ID="ID1.1.2" LABEL="R&amp;D Record Series"/>

</div></div>

</structMap></mets>

METS Constraints - Rules

<?xml version="1.0" encoding="UTF-8"?><metsconstraint …>

<filegrp ID="FILE1" NAME="Text Document"><!-- Files can be identified either by MIMETYPE, or TESTID in skeleton METS document or both --><file NAME="html document" MIMETYPE="text/html"/><file TESTID="xmltext" NAME="xml document" MIMETYPE="text/xml"/>

</filegrp>

<!-- Apply rules to predefined div's and link to required file/metadata tests above -->

<divrule DIVID="ID1" RESTRICTDIV="true" RESTRICTFTPR="true" RESTRICTMPTR="true"/><divrule DIVID="ID1.1" RESTRICTDIV="true" RESTRICTFTPR="true" RESTRICTMPTR="true"/><divrule DIVID="ID1.1.1" RESTRICTMPTR="true">

<filetype FILEGROUPID="FILE1"/></divrule><divrule DIVID="ID1.1.2" RESTRICTMPTR="true"/>

</metsconstraint>

Ingestion Workflow

1. Negotiate Submission Agreement.

2. Workflow Initialization and Submission Information Packet creation.

3. Transfer of SIPs to Data Grid site.

4. Validation of SIP transfer

5. Organization of data into collections and transfer into Data Grid.

Initialize Ingestion workflow

Instantiate Producer management server to track registered objects

Establish a working trust relationship with the Data Grid

Issue clients.

Create SIP

Each client registers objects stored locally with producer management serverRegister file types, validation tests, etcClient follows rules in Submission Agreement

Producer-wide agents can arrange registered object to give a broader context

SIP Example

Submission packet is designed to contain a self describing set of metadata that is self-validating

· Physical Object· Representation

Information

· Provenance· Fixity· Reference · Context

Packaging Information

Descriptive Information

Content InformationPreservation Description

Information

OAIS Information packet

Client Interface

Ingestion Workflow

1. Negotiate Submission Agreement.

2. Workflow Initialization and Submission Information Packet creation.

3. Transfer of SIPs to Data Grid site.

4. Validation of SIP transfer

5. Organization of data into collections and transfer into Data Grid.

Transfer SIP to Data Grid

Retrieve previously registered SIP from producer management server

Authenticate to data grid Update tracking information with new

location of files in data grid Data Grid acknowledges transfer

completion to producer management server

Ingestion Workflow

1. Negotiate Submission Agreement.

2. Workflow Initialization and Submission Information Packet creation.

3. Transfer of SIPs to Data Grid site.

4. Validation of SIP transfer

5. Organization of data into collections and transfer into Data Grid.

Validation of SIP transfer

Check incoming SIP against constraints documents.

Ensure object integrity by verifying checksums/cryptographic digest

Validate bitstreams against necessary tests

Record validation results

Ingestion Workflow

1. Negotiate Submission Agreement.

2. Workflow Initialization and Submission Information Packet creation.

3. Transfer of SIPs to Data Grid site.

4. Validation of SIP transfer

5. Organization of data into collections and transfer into Data Grid.

Final transfer to Data Grid

Transfer objects to Data Grid Update tracking information with new

location in Data Grid Transfer log of data activity into data grid Return accept/reject messages to

producer metadata server

Component Overview

CRL check

Success/Failure notification of ingestion

Metadata registration/retrieval

Producer Management Interface Data Grid Management Interface

Producer data suppliers

SIP transfer

Bitstream Validation Service

Data Grid

Producer Components

Database to track registered objects Certificate Authority management

Web service for receiving side security callback Management server supplies web service

interfaces to ingestion clients and management operations.

Clients are designed to be standalone, with security certificates issued by producer

Receiving Components

Receiving servers validate connecting clients and validate SIPs

Validation Services are simple webservice calls.

Abstract I/O layer into data grid.

Recap

Implemented using web technologies Architecture independent XML based metadata

METS based SIPsAdd-on constraints describing Submission Agreement

Target release dates:Beta: AprilRelease: June/July

More Information

ADAPT websitehttp://www.umiacs.umd.edu/research/adapt

PapersScalable, Reliable Marshalling and

Organization of Distributed Large Scale Data Onto Enterprise Storage Environments

PAWN: Producer - Archive Workflow Network in Support of Digital Preservation

Recommended