45
Digital Preservation Research at Old Dominion University Justin F. Brunelle The MITRE Corporation Old Dominion University (And hopefully MITRE, soon)

Digital Preservation at ODU

Embed Size (px)

DESCRIPTION

The presentation given for the RRAC meeting on 10-20-2010. This is a summary of the research efforts in Digital Preservation at Old Dominion University.

Citation preview

Page 1: Digital Preservation at ODU

Digital Preservation Research at Old Dominion University

Justin F. Brunelle

The MITRE Corporation

Old Dominion University

(And hopefully MITRE, soon)

Page 2: Digital Preservation at ODU

Why are we listening?

• Overview of the problem

• BRIEF introduction to ODU WSDL group research

• Memento

• I’ll be skipping around, so don’t hesitate to interrupt me

Page 3: Digital Preservation at ODU

Digital Preservation

• Using the past Web– Focus of our research

• Temporal Browsing– Sessions in the past

• Recovering Lost Pages– Is it really gone?

• 404s– How to fix broken links?

Page 4: Digital Preservation at ODU

1

same URI maps to same or very similar content at a later time

2

same URI maps to different content at a later time

3

different URI maps to same or very similar content at the same or at a later time

4

the content can not be found at any URI

U1

C1

U1

C1

timeA B

U1

C2

U1

C1

timeA B

U2

C1

U1

C1

U1

404

timeA B

U1

??

U1

C1

timeA B

Change on the Web

Page 5: Digital Preservation at ODU

Time to Talk About Saving Everything?

Dinner for one or two costs more than 1TB disk Wikis have popularized versioning

Cool URIs (http://www.w3.org/Provider/Style/URI.html) are widely adopted, e.g.:http://news.yahoo.com/s/ap/20100920/ap_on_el_se/us_alaska_senate http://d.yimg.com/a/p/ap/20100918/capt.67567dbc0a874b689f0b4a5c392f379c-67567dbc0a874b689f0b4a5c392f379c-0.jpghttp://d.yimg.com/a/p/afp/20100918/thumb.photo_1284846332993-1-0.jpg

Also related projects with cool URI / permalink focus: http://www.citability.org/ http://data.gov/ http://data.gov.uk/

Page 6: Digital Preservation at ODU

Fortress Model

• Get a lot of money

• Buy lots of storage

• Hire lots of people

• “Look upon my archive ye Mighty, and despair!”

Page 7: Digital Preservation at ODU

Alternate Methods

• Lazy Preservation (McCown)– “How much preservation do I get if I do absolutely

nothing?”• Just-In-Time Preservation (Klein)

– Wait for it to disappear, then find a “good ‘nuff” version

• Shared Infrastructure Preservation– Push content to sites that might preserve it

• arXiv.org, IA, WebCite…

• Server Enhanced Preservation– Create archival-ready resources

Page 8: Digital Preservation at ODU

And Soon…

• Social Preservation– Preserving resources using 3rd party Web Services

– Repository for OAI-ORE ReMs

– Social network feel

– Lazy-esque, server-side reconstruction

Page 9: Digital Preservation at ODU

But I digress…

• Few years away…

• Preliminary research

• And now back to the prior research…

Page 10: Digital Preservation at ODU

Web Infrastructure (McCown, 2007)

Page 11: Digital Preservation at ODU

WayBack Machine

http://web.archive.org/web/*/http://www.thecribs.com/http://mementoproxy.cs.odu.edu/aggr/timemap/link/http://www.thecribs.com/

from these we can create time-based: • indexes• IDF values• PageRank

Page 12: Digital Preservation at ODU

Batch Recovery For Sites

http://warrick.cs.odu.edu/

Free limo rides for life?!

Page 13: Digital Preservation at ODU

13

Reconstruction Diagram

added 20%

identical 50%

changed 33%

missing 17%

Page 14: Digital Preservation at ODU

Real-Time Recovery for URIs

Synchronicity - www.cs.odu.edu/~mklein/

Page 15: Digital Preservation at ODU

Memento wants to make navigating the Web’s Past Easy

15

http://www.mementoweb.orghttp://groups.google.com/group/memento-dev

Page 16: Digital Preservation at ODU

What are you talking about?

• Universal Resource Identifier (URI) ~= URL

• Resource:– <HTML>

• Representation

Page 17: Digital Preservation at ODU

W3C Web Architecture: Resource – URI - Representation

Resource

Representation

Represents

URI

Identifies

dereference

17

Page 18: Digital Preservation at ODU

dereference content negotiation

W3C Web Architecture: Resource – URI - Representation

Resource

URI

Identifies

Representation 1

Represents

Representation 2Represents

18

Page 19: Digital Preservation at ODU

Resources

19

Page 20: Digital Preservation at ODU

Resources have Representations

20

Page 21: Digital Preservation at ODU

Resources have Representations that Change over Time

21

Page 22: Digital Preservation at ODU

Only the Current Representation is Available from a Resource

22

Page 23: Digital Preservation at ODU

Old Representations are Lost Forever

23

Page 24: Digital Preservation at ODU

Finding Archived Resources

Go to http://www.archive.org/ and searchhttp://cnn.com

On http://web.archive.org/web/*/http://cnn.com, select desired datetime

24

Page 25: Digital Preservation at ODU

Archived Resources

http://web.archive.org/web/20010911203610/http://www.cnn.com/ archived resource for http://cnn.com

http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333 archived

resource for http://en.wikipedia.org/wiki/September_11_attacks

Sep 11 2001, 20:36:10 UTC Dec 20 2001, 4:51:00 UTC

25

Page 26: Digital Preservation at ODU

Navigating Archived Resources

http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333 archived

resource for http://en.wikipedia.org/wiki/September_11_attacks3

Dec 20 2001, 4:51:00 UTC

http://en.wikipedia.org/wiki/The_Pentagon

current

Pentagon

26

Page 27: Digital Preservation at ODU

Current and Past Web are Not Integrated

27

• Current and Past Web based on same technology.

• But, going from Current to Past Web is a matter of (manual) discovery.

• Memento wants to make going from Current to Past Web a (HTTP) protocol matter.

• Memento wants to integrate Current And Past Web.

Page 28: Digital Preservation at ODU

One Memento HTTP Navigation

28

Page 29: Digital Preservation at ODU

Memento HTTP FlowHEAD R, Accept-Datetime

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, Accept-Datetime

Page 30: Digital Preservation at ODU

One Memento HTTP Navigation

30

Scenario

• cnn.com includes Link to TimeGate at Internet Archive• URI-R on one server, URI-G & URI-M on another

Page 31: Digital Preservation at ODU

Memento HTTP FlowHEAD R, Accept-Datetime

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, Accept-Datetime

Page 32: Digital Preservation at ODU

Memento HTTP Flow: URI-RHEAD R, Accept-Datetime

HEAD http://cnn.com/ HTTP/1.1Host: cnn.comAccept-Datetime: Tue, 11 Sep 2001 20:35:00 GMTConnection: close

32

Page 33: Digital Preservation at ODU

Memento HTTP FlowHEAD R, Accept-Datetime

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, Accept-Datetime

Page 34: Digital Preservation at ODU

Memento HTTP Flow: Success – URI-RLinkG

HTTP/1.1 200 OKDate: Thu, 21 Jan 2010 00:02:12 GMTServer: ApacheLink: <http://web.archive.org/web/timegate/http://cnn.com>; rel="timegate"Content-Length: 255Connection: closeContent-Type: text/html; charset=iso-8859-1

34

Page 35: Digital Preservation at ODU

Memento HTTP FlowHEAD R, Accept-Datetime

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, Accept-Datetime

Page 36: Digital Preservation at ODU

GET G, Accept-Datetime

Memento HTTP Flow: URI-G

GET http://web.archive.org/web/timegate/http://cnn.com HTTP/1.1Host: web.archive.orgAccept-Datetime: Tue, 11 Sep 2001 20:35:00 GMTConnection: close

36

Page 37: Digital Preservation at ODU

Memento HTTP FlowHEAD R, Accept-Datetime

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, Accept-Datetime

Page 38: Digital Preservation at ODU

Memento HTTP Flow: Success – URI-G

302M, Vary, LinkR,B,M

HTTP/1.1 302 FoundDate: Thu, 21 Jan 2010 00:06:50 GMTServer: ApacheTCN: choiceVary: negotiate, accept-datetimeLocation: http://web.archive.org/web/20010911203610/http://www.cnn.comLink: <http://cnn.com/>; rel="original", <http://web.archive.org/web/timebundle/http://cnn.com/>; rel="timebundle”, <http://web.archive.org/web/20000915112826/http://www.cnn.com>; rel=“first- memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”, <http://web.archive.org/web/20080708093433/http://www.cnn.com>; rel=“last-memento”; datetime="Tue, 08 Jul 2008 09:34:33 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“prev-memento”; datetime="Tue, 11 Sep 2001 20:30:51 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“next-memento”; datetime="Tue, 11 Sep 2001 20:47:33 GMT”Content-Length: 0Connection: closeContent-Type: text/plain; charset=UTF-8

38

Page 39: Digital Preservation at ODU

Memento HTTP FlowHEAD R, Accept-Datetime

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, Accept-Datetime

Page 40: Digital Preservation at ODU

GET M, Accept-Datetime

Memento HTTP Flow: URI-M

GET http://web.archive.org/web/20010911203610/http://www.cnn.com HTTP/1.1Host: web.archive.orgAccept-Datetime: Tue, 11 Sep 2001 20:35:00 GMTConnection: close

40

Page 41: Digital Preservation at ODU

Memento HTTP FlowHEAD R, Accept-Datetime

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, Accept-Datetime

Page 42: Digital Preservation at ODU

Memento HTTP Flow: Success – URI-M

200, Content-Datetime, LinkR,B,M

HTTP/1.1 200 OKServer: Apache-Coyote/1.1X-Archive-Orig-Accept-Ranges: bytes…Content-Type: text/html;charset=utf-8Content-Length: 23364Date: Thu, 21 Jan 2010 00:09:40 GMTContent-Datetime: Tue, 11 Sep 2001 20:36:10 GMTLink: <http://cnn.com/>; rel="original", <http://web.archive.org/web/timebundle/http://cnn.com/>; rel="timebundle”, <http://web.archive.org/web/20000915112826/http://www.cnn.com>; rel=“first-memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”, <http://web.archive.org/web/20080708093433/http://www.cnn.com>; rel=“last-memento”; datetime="Tue, 08 Jul 2008 09:34:33 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“prev-memento”; datetime="Tue, 11 Sep 2001 20:30:51 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“next-memento”; datetime="Tue, 11 Sep 2001 20:47:33 GMT”Connection: close

Page 43: Digital Preservation at ODU

What does it all mean?

• Cutting edge technology

• Existing Infrastructure

• Redefining Web surfing

• MAJOR “real world” implications

Page 44: Digital Preservation at ODU

Closing Thoughts

Preservation not for

privileged priesthoodhttp://doi.acm.org/10.1145/1592761.1592794

http://booktwo.org/notebook/wikipedia-historiography/

no more hoary storiesabout format obsolescence:http://blog.dshr.org/2010/09/reinforcing-my-point.html

Don't dessicate resources;

leave them on the webEndless metadata is not

preservation…

archiving as branded service, not infrastructurehttp://blog.dshr.org/2010/06/jcdl-2010-keynote.html

Page 45: Digital Preservation at ODU

Acknowledgements

• Slides borrowed from:

• Dr. Michael L. Nelson:

– http://www.slideshare.net/phonedude/my-point-of-view-michael-l-nelson-web-archiving-cooperative

– http://www.slideshare.net/phonedude/review-of-web-archiving

– http://www.slideshare.net/phonedude/memento-time-travel-for-the-web

• Martin Klein:

– http://www.slideshare.net/phonedude/synchronicity-justintime-discovery-of-lost-web-pages