UC3 Curation Micro-Services Simplified Repository Ingest UC Curation Center California Digital Library May 20, 2010.

  • Published on

  • View

  • Download

Embed Size (px)


  • UC3 Curation Micro-ServicesSimplified Repository IngestUC Curation CenterCalifornia Digital LibraryMay 20, 2010

  • AgendaIntroductionWelcome and review of objectivesUC3 and digital curationLandscape, assumptions, and imperativesCuration micro-servicesThe Merritt projectDesign goalsThe future of the DPRSimplified repository ingestConceptsImplementationDemonstrationDiscussion

  • ObjectivesBy the end of this discussion we hope that you will understandDigital curation and the UC3 missionThe emergent, micro-services approach to curation infrastructureThe Merritt curation environment and the future of the DPRThe Merritt Ingest service and its interactions with the Identity, Storage, and Inventory servicesHow to incorporate the Ingest service into your workflows

  • University of California Curation Center (UC3)Weve changed our name, but not our commitmentEnsuring that the information resources supporting, and resulting from, the Universitys research, teaching, and learning mission remains authentic, available, and usableUC3 is a Center of ExcellenceA creative partnership bringing together the expertise and resources of the CDL, the ten UC campuses, and the broader international curation community

  • Digital curationThe set of policies and practices focused on managing and adding value to a body of trusted digital contentPreservation ensures access over timeAccess depends upon preservation up to a point in timeIt can also be seen as facilitating the alignment of the scholarly and information lifecycles

  • LandscapeEver increasing number, size, and diversity of contentMore stuff, less resourcesEver increasing diversity of partners, stakeholders, and expectationsProducers / consumers prosumers / conducersInevitability of disruptive changeTechnologyUser expectationInstitutional mission and resourcesProblem or opportunity?$WorkTime

  • AssumptionsCurated content gainsSafety through redundancyLots of copies keeps stuff safeMeaning through contextLots of description keeps stuff meaningfulUtility through serviceLots of services keeps stuff usefulValue through useLots of uses keeps stuff valuableCuration is an outcome, not a placeDecentralized curation can be as effective as centralizedCuration stewardship is a relay

  • ImperativesProvide innovative, effective, and efficient servicesPlan for changeFocus on content, not the systems in which that content is managedSystems come and go (but not our system ;-)Occams Razor and Murphys Law suggestFavor the small and simple over the large and complexFavor the minimally sufficient over the feature ladenFavor the configurable over the prescribedFavor the proven over the (merely) novelEnable curation at the point of useDo more with less

  • Curation micro-servicesDevolve curation function into a granular set of independent, but interoperable micro-servicesSince each is small and self-contained,they are collectively easier to develop, maintain, and enhanceSince the level of investment in, and therefore commitment to, any given service is small, they are easier to replace when they have outlived their usefulnessThe scope of each service is limited, but complex behavior emerges from the strategic composition of individual atomistic services

  • Merritt curation micro-servicesIngestInventoryStorageIdentity

    ValueAnnotationof content by consumersNotificationof new content availabilityTransformationto create derivativesCurationUtilitySearchof content and metadataIndexto enable fast searchof content for curationPreservationContextCharacterizationto extract content propertiesof curated contentReplicationfor safetyStateFixityto verify bit-level integrityfor long-term retentionfor long-term reference

  • What is the future of the DPR?The DPR will continue to be operated as a core UC3 serviceHowever, the components of the underlying system will be gradually replaced with their new Merritt-based equivalentsAll content currently managed in the DPR will be automatically migrated to the new environmentMicro-services also can be used to deploy locally-hosted repositories to meet specialized local needs

  • What is the future of the DPR?Continuing stewardship commitment by UC3 regarding managed contentSafety, persistence, efficiency, economyStreamlined workflows for submission, access, and collection managementEasy in , easy outMinimal technical requirements for contributionGreat flexibility in deploying customized repository solutions

  • Design goalsPolicy neutral, protocol and platform independentWe know we cant foresee all of the contexts in which these services can be usefully deployedPrinciple of least surpriseExtensive options, but meaningful default behaviorLinked dataAll entities exist within a web of semantic relationshttp://linkeddata.org/The file system is the databaseAll content and metadata are expressed in the file systemSome subset of this information may be replicated in databases as an optimization for fast query

  • Design goalsCode to interfacesUnderlying implementations should and will evolve over time without invalidating the public interface contractExploit agile methodsEarly prototyping, frequent refactoringStakeholder engagement The appropriate benchmark for submission user experience is Flickr

  • Storage conceptsNodeA sub-domain of the Storage service established to meet specific policy, administrative, or technical needsObjectEncapsulation in digital form of an abstract intellectual or aesthetic workVersionA set of files representing a discrete state of the objectAny change to object state constitutes a new versionFileA formatted bit stream

  • Storage conceptsStable referenceAll objects (and their versions, and their files) managed in the Storage service have stable URLs that can be used to retrieve entities or metadata about entities, subject to appropriate access controlhttp://example-store.edu/content/abc/1234http://example-store.edu/content/abc/1234/3http://example-store.edu/state/abc/1234/3/xyz

    FileVersionObjectStorage serviceRequest typeStorage node

  • Ingest conceptsQueueAsynchronous processing of submitted materialBatchA set of digital objects submitted togetherThe unit of notification and reportingJobThe processing of a single digital objectHandlerA specific processing stage

  • Ingest conceptsProfileA user-specific set of processing choicesNegotiated as part of the submission agreementNotificationAt the time of ingest submission and completionOur stewardship obligation begins at the time of ingest completionSubmit by-value (a file) or by-reference (a URL)

  • Ingest process flowSubmitting libraryIngestInventoryStorageNodeNodeNodeIdentitySubmitCreate identifierIdentifierAdd versionGet version metadataVersion metadataVersion metadataNotificationNotificationVersion metadataGet version metadataAdd version

  • Ingest implementationSubmitting librarySubmitterConsumerIngesterStorageQueueHTML formServletImplicitly multi-threadedServletImplicitly multi-threadedDmonExplicitly multi-threadedZooKeeper dmonJob metadataJob payloadSubmission notificationIngest notificationBatch or single object

  • DemonstrationA few caveatsStill a work in progress!The final interface style sheets are not yet appliedInventory and authentication/authorization services still under developmentFull error reporting is not complete

  • Development roadmap

    First waveSecond wave Third waveFourth wave Fifth waveSixth wave IdentityInventoryIndexSearchNotificationAnnotationStorageIngestFixityReplicationCharacterizationTransformationObject / collection modelingMetadata standardsAuthentication / authorizationSemantic interoperabilityPolicy / business model development

  • Early community reactionCollaborative development and integration projects with UC3 partnersIndependent implementation of key Merritt specificationsPresentation/BOF at Open Repositories 2010Digital curation group and Barcamphttp://groups.google.com/group/digital-curationhttp://groups.google.com/group/digital-curation/web/curation-technology-sig

  • DiscussionWill existing workflows continue to work?Yes, we have a crosswalk from the existing METS-based feeder submissionWhat are the minimal requirements for an acceptable digital object?A per-object METS file is no longer requiredThe DPR will accept any content in any formHowever, the long-term curation service level may vary depending on the objects formal characteristics, the presence (or absence) of accompanying metadata, the general state of curation understanding, and the availability of appropriate tools

  • DiscussionHow do I include metadata in my submission?The Ingest submission form provides an opportunity to specify descriptive Dublin Kernel metadataAdministrative metadata is implied by the users profileName, affiliation, contact information, collection, Technical (and, potentially, descriptive) metadata is automatically extracted by the characterization handlerAdditional metadata can be expressed in recognized schemas and stored in files with well-known namesmrt-dublin-core.txtmrt-mods.xmlmrt-creative-commons.rdf

  • DiscussionIsnt a enterprise storage solution or RDMS (e.g. Oracle) better than just relying on the file system?No, we believe that there are a number of important advantages to directly exploiting the file systemNo vendor lock-in; propriety systems are difficult to debugModern file systems have excellent scaling characteristicsThe ability to re-instantiate the system by walking the file system is significant

  • DiscussionWhy is there a separate Ingest service? Why cant I just submit directly to the Storage service?Merritt embraces the separation of concerns principlehttp://en.wikipedia.org/wiki/Separation_of_concernsThe Storage service only knows about storage and has strict requirements for the allowable form of submissionsThe Ingest service was explicitly designed for user-facing operation and imposes minimal constraints on submission forms

  • Discussion (questions for you)What constitutes a collection?Does it have hierarchically-arranged sub-components?What tools do you need to manage your collections effectively?How do you expect to retrieve content from the repository?Following a saved link?Search query? If so, what would be the query terms?

  • Discussion (questions for you)What level of access control is necessary? Bright vs. dark policyEmbargo periodsRedactionWho are the subject populations?UC affiliatesNon-UCHow fine-grained must this control be?Collection or objectCampus, research group, user

  • Discussion (questions for you)Are there other repository tools or protocols that we should investigate?

    Please respond to the DPR survey athttp://vovici.com/wsb.dll/s/aaeg44ec2

  • For more informationUC Curation Centerhttp://www.cdlib.org/services/uc3Curation micro-serviceshttps://confluence.ucop.edu/display/CurationDPR surveyhttp://vovici.com/wsb.dll/s/aaeg44ec2Digital curation group and Barcamphttp://groups.google.com/group/digital-curationhttp://groups.google.com/group/digital-curation/web/curation-technology-sigUC3Stephen AbramsErik Hetzner Margaret Low Mark ReyesPerry WillettPatricia Cruse Greg Jane John KunzeTracy SenecaScott Fisher David Loy Isaac RabinovitchMarisa Strong

    ****************HandlersInitializeAcceptthe submission packageVerifythe package transferDisaggregate / Retrievethe packages individual componentsCorroboratethe component completeness and integrityCharacterizethe componentsMintthe objects primary identifierDescribethe objectDocumentthe ingest processDigestthe ingest packageTransferthe ingest package to the Storage serviceNotifythe submitting user agentCleanup

    *Profile choicesQueue priorityTarget Storage service and nodePrimary identifier scheme, namespace, and minterContent modelNotification emailHandler options



View more >