Digital preservation and institutional repositories

  • Published on

  • View

  • Download


Delivered at the Summer Institute for Data Curation at the University of Illinois, 21 May 2009.


<ul><li> 1. Institutional repositories for the digital arts and humanities Dorothea Salo University of Wisconsin </li> <li> 2. Preservation for the digital arts and humanities Dorothea Salo University of Wisconsin </li> <li> 3. Preservation and institutional repositories for the digital arts and humanities Dorothea Salo University of Wisconsin </li> <li> 4. And I said... ... youre giving me how much time for this? </li> <li> 5. Threat model Preservation means nothing unmodied. This is why it becomes such a bogeyman! Two things you need to know rst: why youre preserving what youre preserving, and what youre preserving it against. Your collection-development policy should inform the rst question. Your coll-dev policy doesnt include local born-digital or digitized materials? This is a problem. Fix it. The second question is your threat model. </li> <li> 6. What is your threat model for print? </li> <li> 7. Homelessness </li> <li> 8. Water </li> <li> 9. Bad materials </li> <li> 10. Flora and fauna </li> <li> 11. Physical damage </li> <li> 12. Loss or destruction </li> <li> 13. Armageddon </li> <li> 14. Why did I just make you do that? Im weird. Im trying to destroy the myth that any given medium preserves itself. Media do not preserve themselves. People preserve media or media get bizarrely lucky. We need not panic over digital preservation any more than we panic about print. Approach digital preservation the same way you approach print preservation. Strategically: this approach helps your colleagues get a grip, too. Your colleagues may well be the biggest barrier to digital preservation in your library! </li> <li> 15. In your groups... List important threats to digital data. </li> <li> 16. Physical medium failure </li> <li> 17. Bitrot </li> <li> 18. File format obsolescence </li> <li> 19. Forgetting what you have </li> <li> 20. Forgetting what the stu you have means </li> <li> 21. Rights and DRM </li> <li> 22. Lack (or disappearance) of organizational commitment </li> <li> 23. One word: Geocities. </li> <li> 24. Ignorance ? Its in Google, so its preserved. (Not even Google Books!) I make backups, so Im ne. I have a graduate student who takes care of these things. Metadata? Whats that? I have to have it? Digital preservation is an unsolvable problem, so why even try? (Ive heard this one from librarians. I bet you have too.) </li> <li> 25. Apathy </li> <li> 26. Armageddon </li> <li> 27. Salos needs pyramid Less Less immediate Fidelity tractable to original Usability Format viability Bitrot Physical medium issues More More immediate Acquisition issues tractable </li> <li> 28. Mitigating the risks </li> <li> 29. But rst, a word about failure We cant save everything digital! Well, no, we cant. We cant save everything printed either. Thats no excuse, in either medium. Why do we let it be one for digital materials? Yes, we will lose some stu. Thats life in the big city. Dive in anyway. </li> <li> 30. And a word about scale Many of those currently panicking about digital preservation are thinking about huge scales. At some repository size, bitrot happens faster than you can detect and x it. Last I heard, this was somewhere in the exabyte range. Were not. So lets relax about some of this stu. At our scale, many problems are solvable. Unless your problem is digital video. Good luck with that. Our scale problems happen on the front end, as weve been learning this week. </li> <li> 31. Physical medium failure Gold CDs are not the panacea we thought. Theyre not bad; theyre just hard to audit, so they fail (when they fail) silently. Silent failure is DEADLY. How long will hardware be able to read them? ALL such physical media are risky, for the same reasons! Current state of the art: get it on spinning disk. Back up often. Distribute your backups geographically. Test them now and then. Consider a LOCKSS cooperative agreement. Others have. Any physical medium WILL FAIL. Have a plan for when it does. </li> <li> 32. Bitrot Sometimes used for le format obsolescence. I use it for the bits ipped unexpectedly. Checking a le bit-by-bit against a backup copy is computationally impractical for every day. Though on ingest its a good idea to verify bit-by-bit! Checksums A le is, fundamentally, a great big number. Do math on the number le. Store the result as metadata. To check for bitrot, redo the math and check the answer against the stored result. If theyre dierent, scream. Several checksum algorithms; for our purposes, which one you use doesnt matter much. </li> <li> 33. File format obsolescence When possible, prefer le formats that are: Open/non-proprietary. (If a software vendor goes out of business, does their format?) Documented Standardized, non-patent-encumbered In widespread use. (If the format dies, lots of people have incentive to solve the problem.) For text, non-binary For everything else, lossless rather than lossy For compound objects, compound documents rather than embedded Realistically? We often have to take what were given. </li> <li> 34. Lossless? Lossy? What? Essential tradeo: quality and delity vs. le size Clipping information out makes the le size smaller! But once its gone, its gone. Tremendous problem with video. Lossless video formats are HUGE. Lossy image formats: JPEG, JPEG2000 (much less so) (more or less) Lossless: TIFF, PNG, GIF Compression may be lossless or lossy. Find out! </li> <li> 35. Example: JPG </li> <li> 36. Audio formats I am NOT going to talk about codecs vs. container formats. Consider it homework. No ideal choice here; lossless formats are patent-encumbered and/or proprietary WAV and AIFF are okay. Ogg Vorbis is ideal, but nobody supports it. mp3: if you must, its lossy. </li> <li> 37. Migration vs. emulation Migration: move the le to a new format Dont throw away your original! You may have made the wrong migration decision. Not necessarily a lossless process. (Fonts!) Emulation: create a modern hardware/software environment that can deal with the old format For some cultural artifacts such as games, this is the only reasonable option. Emulation advocates make big claims that Im not sure they can back up. Proceed with caution. </li> <li> 38. Normalization Migration of a dataset toward a well-dened target. Treat the same thing the same way. E.g. census data... dene a set of data tables, move all data into them. Great for interoperability and preservation! Pitfall: the same thing? Humanities: TEI is a de facto normalizer for humanities textual data. (Other XML formats in other elds: e.g. ChemML, NLM DTD.) </li> <li> 39. Problem: BEHAVIOR. Migration can preserve information content and (often but not always) appearance. Preserving interaction patterns is much harder! E.g. a web page containing Javascript Or a database with a query engine Or an applet or Flash object Or a collection whose interactions are based on an obsolete software system. (DynaText anyone?) Hard problem. No obvious solutions; certainly no easy ones. </li> <li> 40. When is a PDF not a PDF? When its a .doc with the wrong le extension When theres no le extension on it at all When its so old it doesnt follow the standardized PDF conventions When its otherwise malformed, made by a bad piece of software. How do you know whether you have a good PDF? (Or .doc, or .jpg, or .xml, or anything else.) </li> <li> 41. File format registries and testing tools JHOVE: JSTOR/Harvard Object Validation Environment Java software intended to be pluggable into other software environments Answers What format is this thing? and Is this thing a good example of the format? Limited repertoire of formats PRONOM/DROID + GDFR = Unied Digital Formats Registry </li> <li> 42. Forgetting what you have Absolutely pernicious problem. We dont know what we have to begin with! Do you know how much Faculty Stu is scattered throughout your institutions .edu domain? Me neither. But I know its a lot. How much of that is irreplaceable? Were also bad at labelling and tracking what we have. No easy answer to this one; the solution lies in a complete praxis reinvention. Yeah. Good luck with that. </li> <li> 43. ... but I thought you meant in libraries, Dorothea! Come on, weve solved that one: Metadata! Once its in the library, its probably ne. The real problem is all that Other Stu Out There. This is a collection-development problem and should be treated as one. Dont dump it on some poor digital preservation librarian! That at out doesnt scale. Dont make the mistake of drawing thick lines around...</li></ul>


View more >