59
Preservation Born-again and Born Digital CS 502 – 20030407 Carl Lagoze – Cornell University Acknowledgements: Anne Kenney Nancy McGovern Vicky Reich

Preservation Born-again and Born Digital CS 502 – 20030407 Carl Lagoze – Cornell University Acknowledgements: Anne Kenney Nancy McGovern Vicky Reich

Embed Size (px)

Citation preview

  • PreservationBorn-again and Born Digital

    CS 502 20030407Carl Lagoze Cornell UniversityAcknowledgements: Anne Kenney Nancy McGovern Vicky Reich

  • Preservation of physical artifactsEnvironmental ControlBrittle BooksAcidification is byproduct of paper production in 1850s to 1980sBleach for whiteningAlum for sizing (fixity of ink)Tanning for leather tanning35-75% of paper based artifacts from this period are in dangerNewspapers and paperbacks especially vulnerableANSI standard Z39.48-1992 for permanent paper.

  • Deacidification of Brittle BooksRaise the pH level of treated paper to the acceptable range of 6.8 to 10.4pH extending the useful life of paper (measured by fold endurance after accelerated aging) by over 300%. Environmental treatment using magnesium oxide (MgO) Expense requires careful selection process

  • Failures of MicrofilmPopular preservation approach before digitizationSevere problemsQuality of filmingColor to bi-tonal Usability issuesSelf-destruction of film

    Double Fold Nicholson Baker

  • Digitization through ScanningAlternative to deacidificationAdvantagesUniversal accessReduction in shelf costsOCR (full-text access)DisadvantagesQuality reductionCostNot original syndromeDestruction of source (debinding)A new preservation problem

  • What are Digital Images?Electronic snapshots taken of a scene or scanned from documentssamples and mapped as a grid of dots or picture elements (pixels)pixel assigned a tonal value (black, white, grays, colors), represented in binary codecode stored or reduced (compressed)read and interpreted to create analog version

  • Why Rich Digital Masters?PreservationOriginal may only withstand one scanMaintenance of digital filesCostOne scan may be all that is affordableConversion costs dwarfed by other costsAccessMany from oneThe richer the file, the better the derivative in terms of quality and processibility

  • How to determine whats good enough?Connoisseurship of document attributesIdentify key information contentObjectively characterize or measure attributes: size, detail, tone, and colorAppreciate imaging factors affecting quality and costTranslate between analog and digitalEquate measurements to digital equivalencies and corresponding metrics, e.g., detail size resolution

  • Digital Image Quality is Governed By:resolution and thresholdbit depthcolor managementimage enhancementcompression and file format

  • ResolutionDetermined by number of pixels used to represent the imageIncreasing resolution increases level of detail captured and geometrically increases file sizezoom in

  • Effects of Resolution

    600 dpi

    300 dpi

    200 dpi

  • Threshold Setting in Bitonal Scanning

    defines the point on a scale from 0 to 255 at which gray values will be interpreted either as black or white

  • Effects of Threshold

    threshold = 100threshold = 60

  • Bit DepthDetermined by the number of binary digits (bits) used to represent each pixel1-bit8-bit24-bit

  • Bit Depthincreasing bit depth increases the level of gray or color information that can be represented and arithmetically increases file size Bit depth, dynamic range, and color appearance

  • Utilizing Sufficient Bit-Depth3-bit gray8-bit gray

  • Utilizing Sufficient Bit Depth8-bit color24-bit color

  • Bit Depth vs. Dynamic RangeThe range of tonal difference between lightest light and the darkest dark

  • Mapping Tones Correctly: Use of Histograms

  • What are Digital Images?Electronic snapshots taken of a scene or scanned from documentssamples and mapped as a grid of dots or picture elements (pixels)pixel assigned a tonal value (black, white, grays, colors), represented in binary codecode stored or reduced (compressed)read and interpreted to create analog version

  • Why Rich Digital Masters?PreservationOriginal may only withstand one scanMaintenance of digital filesCostOne scan may be all that is affordableConversion costs dwarfed by other costsAccessMany from oneThe richer the file, the better the derivative in terms of quality and processibility

  • How to determine whats good enough?Connoisseurship of document attributesIdentify key information contentObjectively characterize or measure attributes: size, detail, tone, and colorAppreciate imaging factors affecting quality and costTranslate between analog and digitalEquate measurements to digital equivalencies and corresponding metrics, e.g., detail size resolution MTF

  • Digital Image Quality is Governed By:resolution and thresholdbit depthcolor managementimage enhancementcompression and file format system performance

  • ResolutionDetermined by number of pixels used to represent the imageIncreasing resolution increases level of detail captured and geometrically increases file sizezoom in

  • Effects of Resolution

    600 dpi

    300 dpi

    200 dpi

  • Threshold Setting in Bitonal Scanning

    defines the point on a scale from 0 to 255 at which gray values will be interpreted either as black or white

  • Effects of Threshold

    threshold = 100threshold = 60

  • Bit DepthDetermined by the number of binary digits (bits) used to represent each pixel1-bit8-bit24-bit

  • Bit Depthincreasing bit depth increases the level of gray or color information that can be represented and arithmetically increases file size Bit depth, dynamic range, and color appearance

  • Utilizing Sufficient Bit-Depth3-bit gray8-bit gray

  • Utilizing Sufficient Bit Depth8-bit color24-bit color

  • Bit Depth vs. Dynamic RangeThe range of tonal difference between lightest light and the darkest dark

  • Mapping Tones Correctly: Use of Histograms

  • Aligning Document Attributes with Digital RequirementsIdentify key document attributesTone, color, and detailCharacterize them, if possible through objective measurementsDetermine quality requirements and tolerance levelsTranslate between analog and digital and between scanning requirements and scanning performance

  • Aligning Document Attributes with Digital RequirementsCalibrate scanner with targets and softwareCalibrate the rest of the systemControl lighting and environmentScan appropriate targets with documentsEvaluate images against originals

  • Aligning Document Attributes with Digital RequirementsMinimize post-processing in the master imageSave in TIFF; avoid lossy compression Maintain scanning metadataMonitor emerging image quality metrics

  • One Size Does Not Fit All!Different document types will require different scanning equipment and processesThe more complex the document, the higher the conversion/access requirementsScan the original whenever possibleNo standards for image conversion: guidance rather than guidelinesNotion of long-term utility and cross-institutional resources gaining ground

  • Trusted RepositoryAttributes of a Trusted Digital Repository(RLG-OCLC) http://www.rlg.org/longterm/attributes01.pdf Administrative responsibilityOAIS Reference Model (CCSDS)http://www.ccsds.org/documents/pdf/CCSDS-650.0-R-2.pdfOrganization viabilityFinancial sustainabilityTechnical suitabilitySystem securityProcedural accountabilityNational Archives and Records Administration (NARA)

  • Digital Preservation StrategiesDisclaimer: monolithic, homogeneous solutions are likely to fail, many digital preservation approaches are required

  • EmulationPreserve original look and feel and functionality of digital artifactEnable obsolete systems to be run on future unknown systems Notion of universal virtual machineJeff Rothenberg, Raymond LurieCAMiLEON Projecthttp://www.si.umich.edu/CAMILEON/about/aboutcam.html

  • MigrationFile formats change over time and become extinctIssues of proprietary vs. open source formatsLossiness of formatsRisk Management of Digital Information: A File Format Investigation http://www.clir.org/pubs/reports/pub93/contents.htmlCAMiLEON Project

  • CanonicalizationCanonicalization: A Fundamental Tool to Facilitate Preservation and Management of Digital Information http://www.dlib.org/dlib/september99/09lynch.htmlTie to XML standardshttp://www.w3.org/TR/2002/REC-xml-exc-c14n-20020718/

  • LOCKSSLots of Copies Keep Stuff Safehttp://lockss.stanford.edu/

  • LOCKSS MissionBuild tools and provide support toLibraries, so they can easily and affordably build, preserve, and archive local e-collections Own rather than lease electronic informationRetain traditional custodial role of scholarly informationPublishers, so they can easily and affordably provide content for preservation and archiving With minimal risk to their business model or to their publishing platforms Relinquish responsibility to provide perpetual access Fulfill librarians requirements that publishers guarantee long-term access to content sold

  • Paper Library SystemLibraries act for their institution toAcquire copies of important stuffKeep copies on shelvesGive access to local readersLibraries cooperate toSupply copies to other librariesa reader can easily to find a copya bad guy has trouble finding and destroying all copies Libraries ensure content persists simply by supporting their local communities A cooperative, affordable, decentralized, archive system with LOTS OF COPIES

  • LOCKSS Library SystemLibraries act for their institution toAcquire copies of important stuffKeep copies in transparent web cachesGive access to local readersLibraries cooperate toDetect and repair damage a reader can easily find a copya bad guy has trouble finding and destroying all copies Libraries ensure content persists simply by supporting their local communities A cooperative, affordable, decentralized, archive system with LOTS OF COPIES

  • LOCKSS TechnologyLOCKSS web cachesCollect HTTP delivered content All file formats (PDF, HTML, JPEG, TIF, Audio, Video)Collect presentation files of content as publishedMust have authorized access to publishers sitePreserve and audit content integrityIndependent content collectionCooperate to resolve content differences Continuously validate against other cachesRepair gaps from publisher and other cachesProvide access Readers access content via desktop web browserContent is never dark

  • Publisher

    PublishersLOCKSS cachesReadersData flows an approximation

  • Hardware Costshttp://www.almaden.ibm.com/sst/html/leadership/g05.htmHDD prices declineby 50%a year

  • Preservation Risk Management/Automated Monitoring

  • Levels of ContextWeb page as a stand-alone object, ignoring its hyperlinksin local context, considering the links into it and out from itWeb siteas a semantically coherent set of linked Web pagesas an entity in a broader technical and organizational context

  • Page-level MonitoringFormatting: TIDYStandards complianceDocument structureMetadata:HTTP headersHTML headersChangesContentLocation LinksOut-link structureIn-link structureIntra-site HubVolatilityPage provenanceURL parsingLog analysis

  • Site-level MonitoringGraph analysisStatic site analysis and Longitudinal studyAggregate page analysesSite maintenance indicatorsBackup and archiving policies and proceduresHardware and software environmentNetwork configuration and maintenance

  • Facilitating/Monitoring Longevity of Distributed ContentPreservation Service

  • Preservation Spaces

  • Object vs. Information

  • Preserving information rather than bits

    Add a slide on binary arithmetic in the real presentation, with reality check.Add a slide on binary arithmetic in the real presentation, with reality check.In classic risk management models, classification may be identified as step in identification, or part of assessment.Prism lists as separate stage. Characterization of Web sites Web crawler to enable risk assessment using is a key tool for Prism work. Seeking additional funding for tool development to get to Prism stage 4.Bullet 1a: HTTP header- fields used, indicators of well-managed page, e.g., error messages. Well-formed page? Type of content? Indicators of dynamic page: forms, java, etc.Bullet 1b: significance of links to and links from page? Lots of links to = good thing; lots of links from = indicators of scope of interest. How many levels should link checking extend?Bullet 2a: How to identify the boundary of a Web site? The hierarchical and user-defined nature of URLs and the absence of uniform indicators of ownership: e.g., webmaster and affiliation field, make it difficult to automatically track Web sites: map and track boundaries, apply policies within boundaries, etc.Bullet 2b: how to monitor server health, changes in organizational structure that impact Web content, etc. Tools: TIDY, Crawler, link checkers, log analyzersTrends: absence or presence of metadata as risk indicators, more or less risk based upon: top domain, mime types, nature of content, etc.Determining the health of a server