Upload
justin-brunelle
View
261
Download
3
Embed Size (px)
DESCRIPTION
The presentation given for the RRAC meeting on 10-20-2010. This is a summary of the research efforts in Digital Preservation at Old Dominion University.
Citation preview
Digital Preservation Research at Old Dominion University
Justin F. Brunelle
The MITRE Corporation
Old Dominion University
(And hopefully MITRE, soon)
Why are we listening?
• Overview of the problem
• BRIEF introduction to ODU WSDL group research
• Memento
• I’ll be skipping around, so don’t hesitate to interrupt me
Digital Preservation
• Using the past Web– Focus of our research
• Temporal Browsing– Sessions in the past
• Recovering Lost Pages– Is it really gone?
• 404s– How to fix broken links?
1
same URI maps to same or very similar content at a later time
2
same URI maps to different content at a later time
3
different URI maps to same or very similar content at the same or at a later time
4
the content can not be found at any URI
U1
C1
U1
C1
timeA B
U1
C2
U1
C1
timeA B
U2
C1
U1
C1
U1
404
timeA B
U1
??
U1
C1
timeA B
Change on the Web
Time to Talk About Saving Everything?
Dinner for one or two costs more than 1TB disk Wikis have popularized versioning
Cool URIs (http://www.w3.org/Provider/Style/URI.html) are widely adopted, e.g.:http://news.yahoo.com/s/ap/20100920/ap_on_el_se/us_alaska_senate http://d.yimg.com/a/p/ap/20100918/capt.67567dbc0a874b689f0b4a5c392f379c-67567dbc0a874b689f0b4a5c392f379c-0.jpghttp://d.yimg.com/a/p/afp/20100918/thumb.photo_1284846332993-1-0.jpg
Also related projects with cool URI / permalink focus: http://www.citability.org/ http://data.gov/ http://data.gov.uk/
Fortress Model
• Get a lot of money
• Buy lots of storage
• Hire lots of people
• “Look upon my archive ye Mighty, and despair!”
Alternate Methods
• Lazy Preservation (McCown)– “How much preservation do I get if I do absolutely
nothing?”• Just-In-Time Preservation (Klein)
– Wait for it to disappear, then find a “good ‘nuff” version
• Shared Infrastructure Preservation– Push content to sites that might preserve it
• arXiv.org, IA, WebCite…
• Server Enhanced Preservation– Create archival-ready resources
And Soon…
• Social Preservation– Preserving resources using 3rd party Web Services
– Repository for OAI-ORE ReMs
– Social network feel
– Lazy-esque, server-side reconstruction
But I digress…
• Few years away…
• Preliminary research
• And now back to the prior research…
Web Infrastructure (McCown, 2007)
WayBack Machine
http://web.archive.org/web/*/http://www.thecribs.com/http://mementoproxy.cs.odu.edu/aggr/timemap/link/http://www.thecribs.com/
from these we can create time-based: • indexes• IDF values• PageRank
Batch Recovery For Sites
http://warrick.cs.odu.edu/
Free limo rides for life?!
13
Reconstruction Diagram
added 20%
identical 50%
changed 33%
missing 17%
Real-Time Recovery for URIs
Synchronicity - www.cs.odu.edu/~mklein/
Memento wants to make navigating the Web’s Past Easy
15
http://www.mementoweb.orghttp://groups.google.com/group/memento-dev
What are you talking about?
• Universal Resource Identifier (URI) ~= URL
• Resource:– <HTML>
• Representation
W3C Web Architecture: Resource – URI - Representation
Resource
Representation
Represents
URI
Identifies
dereference
17
dereference content negotiation
W3C Web Architecture: Resource – URI - Representation
Resource
URI
Identifies
Representation 1
Represents
Representation 2Represents
18
Resources
19
Resources have Representations
20
Resources have Representations that Change over Time
21
Only the Current Representation is Available from a Resource
22
Old Representations are Lost Forever
23
Finding Archived Resources
Go to http://www.archive.org/ and searchhttp://cnn.com
On http://web.archive.org/web/*/http://cnn.com, select desired datetime
24
Archived Resources
http://web.archive.org/web/20010911203610/http://www.cnn.com/ archived resource for http://cnn.com
http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333 archived
resource for http://en.wikipedia.org/wiki/September_11_attacks
Sep 11 2001, 20:36:10 UTC Dec 20 2001, 4:51:00 UTC
25
Navigating Archived Resources
http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333 archived
resource for http://en.wikipedia.org/wiki/September_11_attacks3
Dec 20 2001, 4:51:00 UTC
http://en.wikipedia.org/wiki/The_Pentagon
current
Pentagon
26
Current and Past Web are Not Integrated
27
• Current and Past Web based on same technology.
• But, going from Current to Past Web is a matter of (manual) discovery.
• Memento wants to make going from Current to Past Web a (HTTP) protocol matter.
• Memento wants to integrate Current And Past Web.
One Memento HTTP Navigation
28
Memento HTTP FlowHEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, Accept-Datetime
One Memento HTTP Navigation
30
Scenario
• cnn.com includes Link to TimeGate at Internet Archive• URI-R on one server, URI-G & URI-M on another
Memento HTTP FlowHEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, Accept-Datetime
Memento HTTP Flow: URI-RHEAD R, Accept-Datetime
HEAD http://cnn.com/ HTTP/1.1Host: cnn.comAccept-Datetime: Tue, 11 Sep 2001 20:35:00 GMTConnection: close
32
Memento HTTP FlowHEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, Accept-Datetime
Memento HTTP Flow: Success – URI-RLinkG
HTTP/1.1 200 OKDate: Thu, 21 Jan 2010 00:02:12 GMTServer: ApacheLink: <http://web.archive.org/web/timegate/http://cnn.com>; rel="timegate"Content-Length: 255Connection: closeContent-Type: text/html; charset=iso-8859-1
34
Memento HTTP FlowHEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, Accept-Datetime
GET G, Accept-Datetime
Memento HTTP Flow: URI-G
GET http://web.archive.org/web/timegate/http://cnn.com HTTP/1.1Host: web.archive.orgAccept-Datetime: Tue, 11 Sep 2001 20:35:00 GMTConnection: close
36
Memento HTTP FlowHEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, Accept-Datetime
Memento HTTP Flow: Success – URI-G
302M, Vary, LinkR,B,M
HTTP/1.1 302 FoundDate: Thu, 21 Jan 2010 00:06:50 GMTServer: ApacheTCN: choiceVary: negotiate, accept-datetimeLocation: http://web.archive.org/web/20010911203610/http://www.cnn.comLink: <http://cnn.com/>; rel="original", <http://web.archive.org/web/timebundle/http://cnn.com/>; rel="timebundle”, <http://web.archive.org/web/20000915112826/http://www.cnn.com>; rel=“first- memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”, <http://web.archive.org/web/20080708093433/http://www.cnn.com>; rel=“last-memento”; datetime="Tue, 08 Jul 2008 09:34:33 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“prev-memento”; datetime="Tue, 11 Sep 2001 20:30:51 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“next-memento”; datetime="Tue, 11 Sep 2001 20:47:33 GMT”Content-Length: 0Connection: closeContent-Type: text/plain; charset=UTF-8
38
Memento HTTP FlowHEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, Accept-Datetime
GET M, Accept-Datetime
Memento HTTP Flow: URI-M
GET http://web.archive.org/web/20010911203610/http://www.cnn.com HTTP/1.1Host: web.archive.orgAccept-Datetime: Tue, 11 Sep 2001 20:35:00 GMTConnection: close
40
Memento HTTP FlowHEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, Accept-Datetime
Memento HTTP Flow: Success – URI-M
200, Content-Datetime, LinkR,B,M
HTTP/1.1 200 OKServer: Apache-Coyote/1.1X-Archive-Orig-Accept-Ranges: bytes…Content-Type: text/html;charset=utf-8Content-Length: 23364Date: Thu, 21 Jan 2010 00:09:40 GMTContent-Datetime: Tue, 11 Sep 2001 20:36:10 GMTLink: <http://cnn.com/>; rel="original", <http://web.archive.org/web/timebundle/http://cnn.com/>; rel="timebundle”, <http://web.archive.org/web/20000915112826/http://www.cnn.com>; rel=“first-memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”, <http://web.archive.org/web/20080708093433/http://www.cnn.com>; rel=“last-memento”; datetime="Tue, 08 Jul 2008 09:34:33 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“prev-memento”; datetime="Tue, 11 Sep 2001 20:30:51 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“next-memento”; datetime="Tue, 11 Sep 2001 20:47:33 GMT”Connection: close
What does it all mean?
• Cutting edge technology
• Existing Infrastructure
• Redefining Web surfing
• MAJOR “real world” implications
Closing Thoughts
Preservation not for
privileged priesthoodhttp://doi.acm.org/10.1145/1592761.1592794
http://booktwo.org/notebook/wikipedia-historiography/
no more hoary storiesabout format obsolescence:http://blog.dshr.org/2010/09/reinforcing-my-point.html
Don't dessicate resources;
leave them on the webEndless metadata is not
preservation…
archiving as branded service, not infrastructurehttp://blog.dshr.org/2010/06/jcdl-2010-keynote.html
Acknowledgements
• Slides borrowed from:
• Dr. Michael L. Nelson:
– http://www.slideshare.net/phonedude/my-point-of-view-michael-l-nelson-web-archiving-cooperative
– http://www.slideshare.net/phonedude/review-of-web-archiving
– http://www.slideshare.net/phonedude/memento-time-travel-for-the-web
• Martin Klein:
– http://www.slideshare.net/phonedude/synchronicity-justintime-discovery-of-lost-web-pages