31
UC3 Curation Micro-Services Simplified Repository Ingest UC Curation Center California Digital Library May 20, 2010

UC3 Curation Micro-Services Simplified Repository Ingest

  • Upload
    orea

  • View
    54

  • Download
    0

Embed Size (px)

DESCRIPTION

UC3 Curation Micro-Services Simplified Repository Ingest. UC Curation Center California Digital Library May 20, 2010. Agenda. Introduction Welcome and review of objectives UC3 and digital curation Landscape, assumptions, and imperatives Curation micro-services The Merritt project - PowerPoint PPT Presentation

Citation preview

Page 1: UC3 Curation Micro-Services Simplified Repository Ingest

UC3 Curation Micro-ServicesSimplified Repository Ingest

UC Curation CenterCalifornia Digital Library

May 20, 2010

Page 2: UC3 Curation Micro-Services Simplified Repository Ingest

Agenda

Introduction– Welcome and review of objectives– UC3 and digital curation– Landscape, assumptions, and imperatives

Curation micro-services– The Merritt project– Design goals– The future of the DPR

Simplified repository ingest– Concepts– Implementation– Demonstration

Discussion

Page 3: UC3 Curation Micro-Services Simplified Repository Ingest

Objectives

By the end of this discussion we hope that you will understand– Digital curation and the UC3 mission– The emergent, micro-services approach to curation

infrastructure– The Merritt curation environment and the future of

the DPR– The Merritt Ingest service and its interactions with

the Identity, Storage, and Inventory services– How to incorporate the Ingest service into your

workflows

Page 4: UC3 Curation Micro-Services Simplified Repository Ingest

University of California Curation Center (UC3)

We’ve changed our name, but not our commitment– Ensuring that the information resources supporting,

and resulting from, the University’s research, teaching, and learning mission remains authentic, available, and usable

UC3 is a Center of Excellence– A creative partnership bringing together the expertise

and resources of the CDL, the ten UC campuses, and the broader international curation community

Page 5: UC3 Curation Micro-Services Simplified Repository Ingest

Digital curation

The set of policies and practices focused on managing and adding value to a body of trusted digital content

– Preservation ensures access over time– Access depends upon preservation up to a point in time

It can also be seen as facilitating the alignment of the scholarly and information lifecycles

Publish Preserve

Access

Collect

Discover

Gather

Create

Share

ManageResearchTeachingLearning

Information lifecycleScholarly lifecycle

Page 6: UC3 Curation Micro-Services Simplified Repository Ingest

Landscape

Ever increasing number, size, and diversity of content– More stuff, less resources

Ever increasing diversity of partners, stakeholders, and expectations

– Producers / consumers prosumers / conducers

Inevitability of disruptive change– Technology– User expectation– Institutional mission and resources

Problem or opportunity?

$

Work

Time

Page 7: UC3 Curation Micro-Services Simplified Repository Ingest

Assumptions

Curated content gains– Safety through redundancy “Lots of copies keeps stuff safe”

– Meaning through context “Lots of description keeps stuff meaningful”

– Utility through service“Lots of services keeps stuff useful”

– Value through use “Lots of uses keeps stuff valuable”

Curation is an outcome, not a place– Decentralized curation can be as effective as

centralized

Curation stewardship is a relay

Page 8: UC3 Curation Micro-Services Simplified Repository Ingest

Imperatives

Provide innovative, effective, and efficient services

Plan for change– Focus on content, not the systems in which that

content is managed Systems come and go (but not our system ;-)

– Occam’s Razor and Murphy’s Law suggest Favor the small and simple over the large and complex Favor the minimally sufficient over the feature laden Favor the configurable over the prescribed Favor the proven over the (merely) novel

Enable curation at the point of useDo more with less

Page 9: UC3 Curation Micro-Services Simplified Repository Ingest

Curation micro-services

Devolve curation function into a granular set of independent, but interoperable micro-services– Since each is small and self-contained, they are

collectively easier to develop, maintain, and enhance

– Since the level of investment in, and therefore commitment to, any given service is small, they are easier to replace when they have outlived their usefulness

– The scope of each service is limited, but complex behavior emerges from the strategic composition of individual atomistic services

Page 10: UC3 Curation Micro-Services Simplified Repository Ingest

Merritt curation micro-services

ValueAnnotation of content by consumers

Notification of new content availability

Transformation to create derivatives

Curation

Utility

Search of content and metadata

Index to enable fast search

of content for curation

PreservationContext

Characterization to extract content properties

of curated content

Replication for safety

StateFixity to verify bit-level integrity

for long-term retention

for long-term reference

UC3 M e r r i t t

Ingest

Inventory

StorageIdentity

Page 11: UC3 Curation Micro-Services Simplified Repository Ingest

What is the future of the DPR?

The DPR will continue to be operated as a core UC3 service

– However, the components of the underlying system will be gradually replaced with their new Merritt-based equivalents

– All content currently managed in the DPR will be automatically migrated to the new environment

Micro-services also can be used to deploy locally-hosted repositories to meet specialized local needs

UC3 M e r r i t t

Page 12: UC3 Curation Micro-Services Simplified Repository Ingest

What is the future of the DPR?

Continuing stewardship commitment by UC3 regarding managed content– Safety, persistence, efficiency, economy

Streamlined workflows for submission, access, and collection management– Easy in , easy out

Accept any content

Great flexibility in deploying customized repository solutions

UC3 M e r r i t t

Page 13: UC3 Curation Micro-Services Simplified Repository Ingest

Design goalsPolicy neutral, protocol and platform independent

– We know we can’t foresee all of the contexts in which these services can be usefully deployed

Principle of least surprise– Extensive options, but meaningful default behavior

Linked data– All entities exist within a web of semantic relations

http://linkeddata.org/

The file system is the database– All content and metadata are expressed in the file system– Some subset of this information may be replicated in

databases as an optimization for fast query

UC3 M e r r i t t

Page 14: UC3 Curation Micro-Services Simplified Repository Ingest

Design goals

Code to interfaces– Underlying implementations should and will evolve over

time without invalidating the public interface “contract”

Exploit agile methods– Early prototyping, frequent refactoring– Stakeholder engagement

The appropriate benchmark for submission user experience is Flickr

UC3 M e r r i t t

Page 15: UC3 Curation Micro-Services Simplified Repository Ingest

Storage concepts

Node– A sub-domain of the Storage service established to

meet specific policy, administrative, or technical needs

Object– Encapsulation in digital form of an abstract intellectual

or aesthetic work

Version– A set of files representing a discrete state of the object– Any change to object state constitutes a new version

File– A formatted bit stream

UC3 M e r r i t t

Page 16: UC3 Curation Micro-Services Simplified Repository Ingest

Storage concepts

Stable reference– All objects (and their versions, and their files) managed

in the Storage service have stable URLs that can be used to retrieve entities or metadata about entities, subject to appropriate access control

http://example-store.edu/content/abc/1234

http://example-store.edu/content/abc/1234/3

http://example-store.edu/state/abc/1234/3/xyz

UC3 M e r r i t t

File

Version

Object

Storage service

Request type

Storage node

Page 17: UC3 Curation Micro-Services Simplified Repository Ingest

Ingest concepts

Queue– Asynchronous processing of submitted material

Batch– A set of digital objects submitted together– The unit of notification and reporting

Job– The processing of a single digital object

Handler– A specific processing stage

UC3 M e r r i t t

Page 18: UC3 Curation Micro-Services Simplified Repository Ingest

Ingest concepts

Profile– A user-specific set of processing choices– Negotiated as part of the submission agreement

Notification– At the time of ingest submission and completion– Our stewardship obligation begins at the time of ingest

completion

Submit by-value (a file) or by-reference (a URL)

UC3 M e r r i t t

Page 19: UC3 Curation Micro-Services Simplified Repository Ingest

Ingest process flowUC3 M e r r i t t

Submitting library

Submitting library IngestIngest

InventoryInventory

StorageStorage

NodeNode

NodeNode

NodeNode

IdentityIdentity

Submit

Create identifier

Identifier

Add version

Get version metadata

Version metadata

Version metadata

Notification

Notification

Version metadata

Get version metadata

Add version

Page 20: UC3 Curation Micro-Services Simplified Repository Ingest

Ingest implementationUC3 M e r r i t t

Submitting library

Submitting library

SubmitterSubmitter ConsumerConsumer IngesterIngester StorageStorageQueue

HTML form

ServletImplicitly multi-threaded

ServletImplicitly multi-threaded

DæmonExplicitly multi-threaded

ZooKeeper dæmon

Job metadata

Job payload

Submission notification

Ingest notification

Batch or single object

Page 21: UC3 Curation Micro-Services Simplified Repository Ingest

Demonstration

A few caveats…– Still a work in progress!– The final interface style sheets are not yet applied– Inventory and authentication/authorization services

still under development– Full error reporting is not complete

UC3 M e r r i t t

Page 22: UC3 Curation Micro-Services Simplified Repository Ingest

Development roadmap

First wave Second wave Third wave Fourth wave Fifth wave Sixth wave

Identity Inventory Index Search Notification Annotation

Storage Ingest Fixity Replication Characterization Transformation

Object / collection modeling Metadata standards

Authentication / authorization Semantic interoperability

Policy / business model development

UC3 M e r r i t t

Page 23: UC3 Curation Micro-Services Simplified Repository Ingest

Early community reaction

Collaborative development and integration projects with UC3 partners

Independent implementation of key Merritt specifications

Presentation/BOF at Open Repositories 2010

Digital curation group and Barcamphttp://groups.google.com/group/digital-curationhttp://groups.google.com/group/digital-curation/web/curation-technology-sig

UC3 M e r r i t t

Page 24: UC3 Curation Micro-Services Simplified Repository Ingest

Discussion

Will existing workflows continue to work?– Yes, we have a crosswalk from the existing METS-

based feeder submission

What are the minimal requirements for an acceptable digital object?– A per-object METS file is no longer required– The DPR will accept any content in any form

However, the long-term curation service level may vary depending on the object’s formal characteristics, the presence (or absence) of accompanying metadata, the general state of curation understanding, and the availability of appropriate tools

UC3 M e r r i t t

Page 25: UC3 Curation Micro-Services Simplified Repository Ingest

Discussion

How do I include metadata in my submission?– The Ingest submission form provides an opportunity to

specify descriptive Dublin Kernel metadata

– Administrative metadata is implied by the user’s profileName, affiliation, contact information, collection, …

– Technical (and, potentially, descriptive) metadata is automatically extracted by the characterization handler

– Additional metadata can be expressed in recognized schemas and stored in files with well-known names

mrt-dublin-core.txtmrt-mods.xmlmrt-creative-commons.rdf…

UC3 M e r r i t t

Page 26: UC3 Curation Micro-Services Simplified Repository Ingest

Discussion

Isn’t a enterprise storage solution or RDMS (e.g. Oracle) better than just relying on the file system?– No, we believe that there are a number of important

advantages to directly exploiting the file systemNo vendor lock-in; propriety systems are difficult to debugModern file systems have excellent scaling characteristicsThe ability to re-instantiate the system by walking the file

system is significant

UC3 M e r r i t t

Page 27: UC3 Curation Micro-Services Simplified Repository Ingest

Discussion

Why is there a separate Ingest service? Why can’t I just submit directly to the Storage service?– Merritt embraces the “separation of concerns” principle

http://en.wikipedia.org/wiki/Separation_of_concerns

The Storage service only “knows” about storage and has strict requirements for the allowable form of submissions

The Ingest service was explicitly designed for user-facing operation and imposes minimal constraints on submission forms

UC3 M e r r i t t

Page 28: UC3 Curation Micro-Services Simplified Repository Ingest

Discussion (questions for you)

What constitutes a “collection”?–Does it have hierarchically-arranged sub-components?

What tools do you need to manage your collections effectively?

How do you expect to retrieve content from the repository?– Following a saved link?– Search query? If so, what would be the query terms?

UC3 M e r r i t t

Page 29: UC3 Curation Micro-Services Simplified Repository Ingest

Discussion (questions for you)

What level of access control is necessary? – Bright vs. dark policy– Embargo periods– Redaction

Who are the subject populations?– UC affiliates– Non-UC

How fine-grained must this control be?– Collection or object– Campus, research group, user

UC3 M e r r i t t

Page 30: UC3 Curation Micro-Services Simplified Repository Ingest

Discussion (questions for you)

Are there other repository tools or protocols that we should investigate?

Please respond to the DPR survey athttp://vovici.com/wsb.dll/s/aaeg44ec2

UC3 M e r r i t t

Page 31: UC3 Curation Micro-Services Simplified Repository Ingest

For more informationUC Curation Centerhttp://www.cdlib.org/services/uc3

Curation micro-serviceshttps://confluence.ucop.edu/display/Curation

DPR surveyhttp://vovici.com/wsb.dll/s/aaeg44ec2

Digital curation group and Barcamphttp://groups.google.com/group/digital-curationhttp://groups.google.com/group/digital-curation/web/curation-technology-sig

UC3Stephen Abrams Erik Hetzner Margaret Low Mark Reyes Perry

WilletPatricia Cruse Greg Janée John Kunze Tracy SenecaScott Fisher David Loy Isaac Rabinovitch Marisa Strong

UC3 M e r r i t t