Transcript
Page 1: Digital preservation and institutional repositories

Institutional repositories for the digital arts and

humanities

Dorothea SaloUniversity of Wisconsin

[email protected]

Page 2: Digital preservation and institutional repositories

Preservation for the digital arts and

humanities

Dorothea SaloUniversity of Wisconsin

[email protected]

Page 3: Digital preservation and institutional repositories

Dorothea SaloUniversity of Wisconsin

[email protected]

Preservation andinstitutional repositories for the digital arts and

humanities

Page 4: Digital preservation and institutional repositories

And I said...

... you’re giving me how much time for this?

Page 5: Digital preservation and institutional repositories

Threat model•“Preservation” means nothing unmodified.

• This is why it becomes such a bogeyman!

•Two things you need to know first:• why you’re preserving what you’re preserving, and• what you’re preserving it against.

•Your collection-development policy should inform the first question.

• Your coll-dev policy doesn’t include local born-digital or digitized materials? This is a problem. Fix it.

•The second question is your “threat model.”

Page 6: Digital preservation and institutional repositories

What is your threat model for print?

Page 7: Digital preservation and institutional repositories

Homelessness

Page 8: Digital preservation and institutional repositories

Water

Page 9: Digital preservation and institutional repositories

Bad materials

Page 10: Digital preservation and institutional repositories

Flora and fauna

Page 11: Digital preservation and institutional repositories

Physical damage

Page 12: Digital preservation and institutional repositories

Loss or destruction

Page 13: Digital preservation and institutional repositories

Armageddon

Page 14: Digital preservation and institutional repositories

Why did I just make you do that?

•I’m weird.• I’m trying to destroy the myth that any given

medium “preserves itself.”• Media do not preserve themselves. People preserve media

—or media get bizarrely lucky.

•We need not panic over digital preservation any more than we panic about print.

• Approach digital preservation the same way you approach print preservation.

• Strategically: this approach helps your colleagues get a grip, too. Your colleagues may well be the biggest barrier to digital preservation in your library!

Page 15: Digital preservation and institutional repositories

In your groups...

List important threats to digital data.

Page 16: Digital preservation and institutional repositories

Physical medium failure

Page 17: Digital preservation and institutional repositories

“Bitrot”

Page 18: Digital preservation and institutional repositories

File format obsolescence

Page 19: Digital preservation and institutional repositories

Forgetting what you have

Page 20: Digital preservation and institutional repositories

Forgetting what the stu! you have means

Page 21: Digital preservation and institutional repositories

Rights and DRM

Page 22: Digital preservation and institutional repositories

Lack (or disappearance) of organizational commitment

Page 23: Digital preservation and institutional repositories

One word: Geocities.

Page 24: Digital preservation and institutional repositories

?Ignorance

•“It’s in Google, so it’s preserved.” (Not even “Google Books!”)

•“I make backups, so I’m fine.”•“I have a graduate student who takes care of

these things.”•“Metadata? What’s that? I have to have it?”•“Digital preservation is an unsolvable problem,

so why even try?” (I’ve heard this one from librarians. I bet you have too.)

Page 25: Digital preservation and institutional repositories

Apathy

Page 26: Digital preservation and institutional repositories

Armageddon

Page 27: Digital preservation and institutional repositories

Salo’s needs pyramid

More tractable

Less tractable

More immediate

Less immediate

Physical medium issues

Bitrot

Format viability

Fidelityto original

Usability

Acquisition issues

Page 28: Digital preservation and institutional repositories

Mitigating the risks

Page 29: Digital preservation and institutional repositories

But first, a word about failure

•“We can’t save everything digital!”•Well, no, we can’t.•We can’t save everything printed either.•That’s no excuse, in either medium. Why do we

let it be one for digital materials?•Yes, we will lose some stu!. That’s life in the

big city. Dive in anyway.

Page 30: Digital preservation and institutional repositories

And a word about scale

•Many of those currently panicking about digital preservation are thinking about huge scales.

• At some repository size, bitrot happens faster than you can detect and fix it.

• Last I heard, this was somewhere in the exabyte range.

•We’re not. So let’s relax about some of this stu!. At our scale, many problems are solvable.

• Unless your problem is digital video. Good luck with that.

•Our scale problems happen on the front end, as we’ve been learning this week.

Page 31: Digital preservation and institutional repositories

Physical medium failure•Gold CDs are not the panacea we thought.

• They’re not bad; they’re just hard to audit, so they fail (when they fail) silently. Silent failure is DEADLY.

• How long will hardware be able to read them?• ALL such physical media are risky, for the same reasons!

•Current state of the art: get it on spinning disk.•Back up often. Distribute your backups

geographically. Test them now and then.• Consider a LOCKSS cooperative agreement. Others have.

•Any physical medium WILL FAIL. Have a plan for when it does.

Page 32: Digital preservation and institutional repositories

Bitrot•Sometimes used for “file format obsolescence.”• I use it for “the bits flipped unexpectedly.”•Checking a file bit-by-bit against a backup copy

is computationally impractical for every day.• Though on ingest it’s a good idea to verify bit-by-bit!

•Checksums• A file is, fundamentally, a great big number.• Do math on the number file. Store the result as metadata.• To check for bitrot, redo the math and check the answer

against the stored result. If they’re di!erent, scream.• Several checksum algorithms; for our purposes, which one

you use doesn’t matter much.

Page 33: Digital preservation and institutional repositories

File format obsolescence•When possible, prefer file formats that are:

• Open/non-proprietary. (If a software vendor goes out of business, does their format?)

• Documented• Standardized, non-patent-encumbered• In widespread use. (If the format dies, lots of people have

incentive to solve the problem.)• For text, non-binary• For everything else, lossless rather than lossy• For compound objects, compound documents rather than

embedded

•Realistically? We often have to take what we’re given.

Page 34: Digital preservation and institutional repositories

Lossless? Lossy? What?

•Essential tradeo!: quality and fidelity vs. file size•Clipping information out makes the file size

smaller! But once it’s gone, it’s gone.•Tremendous problem with video. Lossless video

formats are HUGE.•Lossy image formats: JPEG, JPEG2000 (much

less so)• (more or less) Lossless: TIFF, PNG, GIF•Compression may be lossless or lossy. Find out!

Page 35: Digital preservation and institutional repositories

Example: JPG

Page 36: Digital preservation and institutional repositories

Audio formats

•I am NOT going to talk about codecs vs. container formats. Consider it homework.

•No ideal choice here; lossless formats are patent-encumbered and/or proprietary

•WAV and AIFF are okay. Ogg Vorbis is ideal, but nobody supports it.

•mp3: if you must, it’s lossy.

Page 37: Digital preservation and institutional repositories

Migration vs. emulation

•Migration: move the file to a new format• Don’t throw away your original! You may have made the

wrong migration decision.• Not necessarily a lossless process. (Fonts!)

•Emulation: create a modern hardware/software environment that can deal with the old format

• For some cultural artifacts such as games, this is the only reasonable option.

• Emulation advocates make big claims that I’m not sure they can back up. Proceed with caution.

Page 38: Digital preservation and institutional repositories

Normalization•Migration of a dataset toward a well-defined

target.• “Treat the same thing the same way.”• E.g. census data... define a set of data tables, move all

data into them.• Great for interoperability and preservation!

•Pitfall: “the same thing”?•Humanities: TEI is a de facto normalizer for

humanities textual data. • (Other XML formats in other fields: e.g. ChemML, NLM

DTD.)

Page 39: Digital preservation and institutional repositories

Problem: BEHAVIOR.•Migration can preserve information content

and (often but not always) appearance.•Preserving interaction patterns is much

harder!• E.g. a web page containing Javascript• Or a database with a query engine• Or an applet or Flash object• Or a collection whose interactions are based on an

obsolete software system. (DynaText anyone?)

•Hard problem. No obvious solutions; certainly no easy ones.

Page 40: Digital preservation and institutional repositories

When is a PDF not a PDF?

•When it’s a .doc with the wrong file extension•When there’s no file extension on it at all•When it’s so old it doesn’t follow the

standardized PDF conventions•When it’s otherwise malformed, made by a

bad piece of software.•How do you know whether you have a good

PDF? (Or .doc, or .jpg, or .xml, or anything else.)

Page 41: Digital preservation and institutional repositories

File format registries and testing tools

•JHOVE: JSTOR/Harvard Object Validation Environment

• Java software intended to be pluggable into other software environments

• Answers “What format is this thing?” and “Is this thing a good example of the format?”

• Limited repertoire of formats

•PRONOM/DROID + GDFR = Unified Digital Formats Registry

Page 42: Digital preservation and institutional repositories

Forgetting what you have•Absolutely pernicious problem. We don’t know

what we have to begin with!• Do you know how much Faculty Stu! is scattered

throughout your institution’s .edu domain? Me neither. But I know it’s a lot. How much of that is irreplaceable?

•We’re also bad at labelling and tracking what we have.

•No easy answer to this one; the solution lies in a complete praxis reinvention.

• Yeah. Good luck with that.

Page 43: Digital preservation and institutional repositories

... but I thought you meant in libraries, Dorothea!

•Come on, we’ve solved that one: Metadata!•Once it’s in the library, it’s probably fine. The

real problem is all that Other Stu! Out There.•This is a collection-development problem and

should be treated as one. • Don’t dump it on some poor “digital preservation

librarian!” That flat out doesn’t scale.• Don’t make the mistake of drawing thick lines around

“our stu!” and “their stu!.” Like it or not, our coll-dev universe has moved beyond what’s published and what’s canonically “library.”

Page 44: Digital preservation and institutional repositories

What the stu! you have means

•Collect whatever it takes to answer this question:

• If the owner of this material were hit by a bus tomorrow, what would be needed for others to use it?

•Nasty discipline-specific problem.• This is what the NARA/RLG Trusted Digital Repository

checklist is aiming at with “designated community.”• Where NARA/RLG goes o! the rails is assuming you have

to go through this exercise with EVERYTHING YOU HAVE.• Data-dictionaries, algorithms, specifications, tech

metadata, whatever it takes. Use common sense!

Page 45: Digital preservation and institutional repositories

Rights and DRM•Not having IP rights to something may mean

you can’t preserve it.• Brian Lavoie writes well about this problem.• Copyright law and its exceptions haven’t caught up to the

digital age!• Third-party services (e.g. blogs, ITunesU, Slideshare) are a

headache here.

•DRM means that no matter the rights situation, you’re stuck.

• PDFs: Users turn on “security” features. This is DRM. Tell them not to do that!

• Huge headache with third-party services, again.

Page 46: Digital preservation and institutional repositories

... and other hassles•Privacy, confidentiality, and human-subject

research issues• Think “we’re the humanities; IRBs don’t happen to us”?

Think again. One word: FOLKLORE.

•Third-party copyright• Campus musical or dramatic performances

• Issues of cultural sensitivity, heritage, repatriation

•You need a dark (or at least dim) archive if you’re serious about digital preservation. There is no way around this. Sorry.

Page 47: Digital preservation and institutional repositories

Organizational commitment

•There is only one answer: POLICY.•Unfortunately, it’s not a quick, easy, or

uncomplicated answer.• Digital preservation costs money.• People in high places are scared of it.• It requires process and sta! change.

•You have to make the case. And then make it again. And again. Until they get it!

• Where I am, Somebody Else’s Problem fields are everywhere around this issue.

Page 48: Digital preservation and institutional repositories

You are probably the preservation option

of last resort.

Be prepared for anything excluded from your policy

to disappear.

Page 49: Digital preservation and institutional repositories

When organizations fail

•Remember Geocities? We’re worse.• Mellon: Can’t make a list of its funded on-the-web

projects, because most of them are GONE. G-O-N-E.

•We do not, as a profession, have a safety net for each others’ projects and materials.

•This is, frankly, unconscionable.• I don’t know how to fix it; I am just warning

you that project rescues are and will continue to be necessary.

• Institutional boundaries are a barrier here.

Page 50: Digital preservation and institutional repositories

Great policy guidance•Policy-making for research data in repositories:

a guide• http://www.disc-uk.org/docs/guide.pdf

•Practical data management: a legal and policy guide

• http://eprints.qut.edu.au/archive/00014923/01/Microsoft_Word_-_Practical_Data_Management_-_A_Legal_and_Policy_Guide_doc.pdf

• Australian, so take “legal” with a grain of salt

•Guide to social science data preparation and archiving

• http://www.icpsr.umich.edu/ICPSR/access/dataprep.pdf

Page 51: Digital preservation and institutional repositories

Summary: the OAIS model•“Reference model” for archival systems

• All theory, no praxis, by design. (Because praxis changes!)

•Four parts• Vocabulary• Data (and interaction) model• Required responsibilities of an archive• Recommended functions (in the computer-programming

sense) for carrying out those responsibilities

•My favorite distillation: Ockerbloom• http://everybodyslibraries.com/2008/10/13/what-

repositories-do-the-oais-model/

Page 52: Digital preservation and institutional repositories

Institutional repositories

Page 53: Digital preservation and institutional repositories

For our purposes...•We’re talking about the software.• I’m not going to rant (much) about what IRs

are for or how they’re run. • If you want that, read Roach Motel. Better yet, read

Palmer et al. 2009.

•We’re interested in the application (or lack thereof ) of IRs to data curation in the arts and humanities. Right? Right.

• I’m not afraid of the technical, and neither should you be.

Page 54: Digital preservation and institutional repositories

IR software•Open source

• Fedora Commons: http://fedora-commons.info/• DSpace: http://dspace.org/• EPrints: http://eprints.org/

•Commercial• ContentDM: http://contentdm.com/• VTLS/Vital: http://www.vtls.com/products/vital

•Hosted• ContentDM: http://contentdm.com/• BePress: http://bepress.com/• Open Repository (based on DSpace): http://

www.openrepository.com/• Digitool: http://www.exlibrisgroup.com/category/

DigiToolOverview

Page 55: Digital preservation and institutional repositories

In your groups...

Please brainstorm common examples of A&H digital

content requiring preservation.

Page 56: Digital preservation and institutional repositories

Common A&H use-cases•Image collections•Page-scanned books (with or without OCR)•Marked-up books•Theses and dissertations•Website preservation•Audio and video•Complex multimedia•Database (linguistic, geographic...)•Software

Page 57: Digital preservation and institutional repositories

In your groups...

Please brainstorm how you and your patrons expect to use and interact with these

genres of data.

Make a list of verbs.

Page 58: Digital preservation and institutional repositories

What they’ll tell you

“We have an institutional

repository.

You can put everything there!”

Page 59: Digital preservation and institutional repositories

How you must not respond

Page 60: Digital preservation and institutional repositories

The IR content use-case

•A research paper• In a single file; possibly more than one format

available• Is not related to any other item in the history

of ever•The user can download it, and... um... just

download it, really.

Page 61: Digital preservation and institutional repositories

How much of our stu! does that work for?

•Image collections•Page-scanned books (with or without OCR)•Marked-up books•Theses and dissertations•Website preservation•Audio and video•Complex multimedia•Database (linguistic, geographic...)•Software

Page 62: Digital preservation and institutional repositories

One user interface does not fit all

Page 63: Digital preservation and institutional repositories

One metadata standard does not fit all

•EAD•METS•VRA Core•MODS•TEI Header•Dublin Core•MARC• ... the beat goes on.

•The simple fact is that EPrints and DSpace do Dublin Core, METS, and nothing else natively. This is purely inadequate for humanities data curation.

Page 64: Digital preservation and institutional repositories

One file format does not fit all

•Yes, we have to take what we get.•With discrete files, most IR software is fine.•Forget about streaming audio/video.•DSpace is good with static websites.•For other composite objects, you’re in trouble.•For anything like a database, you’re in trouble.

Page 65: Digital preservation and institutional repositories

The DSpace/EPrints view of the universe

•Communities and collections•“EPeople”

• must be given explicit permission to add or edit materials

•Metadata entry forms• DSpace: fields configurable by collection• EPrints: auto-configures fields based on content type

•Files/bitstreams• Many permitted per item; must upload one by one in DSpace!• Get friendly with the DSpace batch importer. You’ll need it.

Page 66: Digital preservation and institutional repositories

The Fedora view of the universe

•You can do anything at all with anything at all as long as you’re willing to tell Fedora how to do it. Infinite flexibility! But also infinite responsibility.

•“Content model:” what’s in this thing?•“Service:” what should the user-interface do

with what’s in this thing?•Metadata, relationships, stu!

Page 67: Digital preservation and institutional repositories

Can you use Fedora for an IR?

•Yes, but not alone; you need all the Content Models and Services bolted on top.

•Try Islandora or Muradora. Fez is a last resort; it acts like DSpace, and this is not a good thing.

•Even if you can’t build a real Fedora digital library now, you may not be able to do so in future if you stick with DSpace...

• ... but the Fedora/DSpace merger may change things.

Page 68: Digital preservation and institutional repositories

What is this FOXML stu! anyway?

•Think of it as the Fedora batch-import format.• It’s complex! But it can absorb any amount or

type of XML metadata or data, which is really quite nice.

Page 69: Digital preservation and institutional repositories

Summing up•Out-of-the-box IR software will handle some

A&H data-curation jobs adequately...• ... but by no means all of them.• If you need sophisticated UI, bite the bullet

and go with Fedora. Islandora and Muradora make Fedora simpler for simple things than it once was.

• If you don’t need sophisticated user-facing UI, go with EPrints.

•DSpace is a loser choice.

Page 70: Digital preservation and institutional repositories

Credits

• Watch: http://www.flickr.com/photos/fdecomite/406635986/

• Wet book: http://www.flickr.com/photos/dno1967/2979040762/

• “Bookworm and Bug Juice”: http://www.flickr.com/photos/modestospeed/576659116/

• Moldy books: http://www.flickr.com/photos/umjanedoan/496656416/

• Damaged book: http://www.flickr.com/photos/donabelandewen/3375108358/

• Carnegie library: http://www.flickr.com/photos/jhoweaa/436923541/

• Floppy box: http://www.flickr.com/photos/rintakumpu/2684989757/

• Floppy art: http://www.flickr.com/photos/bludgeoner86/2507833950/

• Bitrot: http://www.flickr.com/photos/raver_mikey/2865543940/

• Escape the ring: http://www.flickr.com/photos/hydropeek/2611071166/• Obsolete grownups: http://www.flickr.com/photos/nietsdoener/1091201075/

• Confusion: http://www.flickr.com/photos/flavinsky/3411791256/• Confusion II: http://www.flickr.com/photos/demibrooke/2550349404/

• Axeman: http://www.flickr.com/photos/27888428@N00/3163030403/• Lazy dazy: http://www.flickr.com/photos/hmk/2742398590/

• DRM/Orwell: http://www.flickr.com/photos/jbonnain/523672080/• Mushroom cloud: http://www.flickr.com/photos/nicholas_t/543334336/

• Pollock: http://www.flickr.com/photos/redneck/215447253/

Page 71: Digital preservation and institutional repositories

Thank you!•This presentation is available under a Creative

Commons Attribution 3.0 United States license.

•Please remember to credit images if you reuse individual slides. Thank you!


Recommended