Upload
michele-weigle
View
68
Download
2
Tags:
Embed Size (px)
Citation preview
Tools for Managing the Past Web
Dr. Michele C. Weigle
Web Sciences and Digital Libraries (WS-DL) Group
Department of Computer Science
Old Dominion University
ODU - ECE Seminar
February 20, 2015
But webpages can disappear
• Average lifespan of a webpage: 50-100 days
• A year after publication, about 11% of content
shared on social media will be gone.
February 20, 2015
SalahEldeen and Nelson, "Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?", TPDL 2012
http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html
5
Why archives matter
• Malaysia Airlines Flight 17 (MH17)
• Ukrainian separatists originally took credit for downing a transport plane in that location
• Later deleted the post
• Internet Archive had archived the post before deletion
February 20, 2015 7
http://www.csmonitor.com/World/Europe/2014/0717/Web-
evidence-points-to-pro-Russia-rebels-in-downing-of-MH17-video
Web archiving in the news - 2015
February 20, 2015 8
http://www.newyorker.com/magazine/2015/01/26/cobweb
But Wayback is not Google
• Wayback Machine has no full-text search– too big to be indexed
– 452 billion web pages, 9 petabytes of data
– growing at 20 TB/week
• Enter URL and pick a date
February 20, 2015 9
"It’s more like a phone book than like an archive."
-Jill Lepore, The New Yorker
How can I access the
archives?
February 20, 2015
MementoFox
Memento for Chrome
http://ws-dl.blogspot.com/2010/03/2010-03-19-mementofox-add-on-released.html
http://ws-dl.blogspot.com/2013/10/2013-10-14-right-click-to-past-memento.html
http://ws-dl.blogspot.com/2014/10/2014-10-03-integrating-live-and.html
Mink
http://www.mementoweb.org
11
ODU WS-DL Projects
Tools for Managing the Past Web
• Archive Quality
• Tweet Intention
• TimeMap
Summaries
• Archive What I See
Now
• Storytelling for
Archives
February 20, 2015 13
ODU WS-DL Projects
Tools for Managing the Past Web
• Archive Quality
• Tweet Intention
• TimeMap
Summaries
• Archive What I See
Now
• Storytelling for
Archives
February 20, 2015 14
The State of Web Archiving
"Hooray! It's in the archive!"
vs.
"How well was it archived?"
current:
future:
February 20, 2015 15
How damaged are these mementos?
February 20, 2015
M = 0.17
(live web)
Brunelle, Kelly, SalahEldeen, Weigle, and Nelson, "Not All Mementos Are Created Equal: Measuring the Impact of Missing
Resources", JCDL 2014, Best Student Paper
17
How damaged are these mementos?
February 20, 2015
M = 0.17
(live web)
M = 0.24
(missing main)
Brunelle, Kelly, SalahEldeen, Weigle, and Nelson, "Not All Mementos Are Created Equal: Measuring the Impact of Missing
Resources", JCDL 2014, Best Student Paper
18
How damaged are these mementos?
February 20, 2015
M = 0.17
(live web)
M = 0.24
(missing main)
M = 0.29
(missing logo + navigation)
Brunelle, Kelly, SalahEldeen, Weigle, and Nelson, "Not All Mementos Are Created Equal: Measuring the Impact of Missing
Resources", JCDL 2014, Best Student Paper
19
How damaged are these mementos?
February 20, 2015
M = 0.17
D = 0.09(live web)
M = 0.24
D = 0.41(missing main)
M = 0.29
D = 0.36(missing logo + navigation)
Brunelle, Kelly, SalahEldeen, Weigle, and Nelson, "Not All Mementos Are Created Equal: Measuring the Impact of Missing
Resources", JCDL 2014, Best Student Paper
20
February 20, 2015
Good News:
Although M is steady/increasing, D is decreasing
22
M = percentage missing
D = our damage metric
Sampled 45,000 mementos
- one memento/year of ~1850 webpages
- webpages from Bitly URIs shared over Twitter and Archive-It collections
Brunelle et al., JCDL 2014
Using JavaScript can result in
damaged mementos
February 20, 2015 23
JavaScript is
responsible for an
increasing proportion
of missing embedded
resources over time.
Brunelle, Kelly, Weigle and Nelson, "The Impact of JavaScript on Archivability," International Journal of Digital Libraries (IJDL), 2015
http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html
Sept 3, 2008
2012
Sometimes the live web "leaks" into
the archive
February 20, 2015 24
Different parts of a page can be
crawled at different times
February 20, 2015
Ainsworth and Nelson, "Evaluating Sliding and Sticky Target Policies by Measuring Temporal Drift in Acyclic Walks Through a Web
Archive", JCDL 2013
25
ODU WS-DL Projects
Tools for Managing the Past Web
• Archive Quality
• Tweet Intention
• TimeMap
Summaries
• Archive What I See
Now
• Storytelling for
Archives
February 20, 2015 26
Which page did Chris Hayes
mean to tweet?
February 20, 2015 27
Tweet on Oct 3, 2014
Likely target (captured Oct 1, 2014)
What you see depends on
when you click
February 20, 2015 28
Oct 9, 2014Oct 10, 2014
Nov 19-Dec 15, 2014 Today (Feb 2015) – now fergusonaction.com
Mapping Tweet Relevance
February 20, 2015 29
SalahEldeen and Nelson, "Reading the Correct History? Modeling Temporal Intention in Resource Sharing”, JCDL 2013
ODU WS-DL Projects
Tools for Managing the Past Web
• Archive Quality
• Tweet Intention
• TimeMap
Summaries
• Archive What I See
Now
• Storytelling for
Archives
February 20, 2015 31
What did usps.com look like?
February 20, 2015 33
http://whatdiditlooklike.mementoweb.org/
Animated GIF
1st memento of each
year
Submit a URL via
Twitter:
“#whatdiditlooklike URL”
Which tells you more about the
past of www.apple.com?
February 20, 2015
700 thumbnails
(not even all of them!)
32 sampled thumbnails
34
AlSum and Nelson, "Thumbnail Summarization Techniques for Web Archives", ECIR 2014
TimeMap Thumbnail
Summaries• Compare HTML, not images
• Compute SimHash of HTML
– result is a string representing the content of the page
• Calculate Hamming distance between SimHashes of consecutive mementos
• Generate thumbnails of mementos that have at least a 4 character difference in SimHash
– threshold too low -> near duplicate images
– threshold too high -> miss important changes
February 20, 2015 35
3 lines of difference
AlSum and Nelson, "Thumbnail Summarization Techniques for Web Archives", ECIR 2014
ODU WS-DL Projects
Tools for Managing the Past Web
• Archive Quality
• Tweet Intention
• TimeMap
Summaries
• Archive What I See
Now
• Storytelling for
Archives
February 20, 2015 39
Archive What I See Now
• Humanities researchers know they should archive web resources
• Standard web archiving tools are difficult for non IT experts
February 20, 2015
"Archive What I See Now", NEH Digital Humanities Implementation Grant, 2014-2017, http://bit.ly/odu-dhig-2014
40
Why not just take a screenshot or
“save as”?
February 20, 2015
Can't interact with
a screenshot
"Save Page As..."output is
difficult to keep organized --
especially with multiple
captures over time
41
What about archiving pages behind
authentication or that change quickly?
February 20, 2015
Facebook - requires login
Twitter - changes faster
than typical crawling rate
42
How we're addressing the problem
• Google Chrome extension
• Archive the current state
of the page in standard
Web Archive (WARC)
format
• Compatible with
Wayback
February 20, 2015 43
Kelly and Weigle, "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage", JCDL 2012
Kelly, Weigle, and Nelson. "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage," Digital Preservation
2012, Tools Demo Session
WARCreate
WARCreate - Work in Progress
• New modes of operation
– record mode
• while activated, add capture of each page visited to the
WARC
– countdown mode
• every interval, refresh and add new capture of page
– event mode
• add new capture of page every time it dynamically
reloads or refreshes
February 20, 2015 44
What to do with created WARCs?
February 20, 2015 45
Kelly, Weigle, and Nelson. "Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving," Personal Digital
Archiving 2013, Poster Session
Kelly, Nelson, and Weigle. "WARCreate and WAIL: WARC, Wayback and Heritrix Made Easy," Digital Preservation 2013
WAIL
• Load created WARCs into
a Wayback instance on
your local computer
• Single-click install of
Wayback (and other
archiving tools)
• Available for Windows,
OS X
Bridging the gap between the past web
and the live web
February 20, 2015
Mink
46
Kelly, Nelson, and Weigle, "Mink: Integrating the Live and Archived Web Viewing Experience Using Web Browsers and Memento,"
poster, ACM/IEEE Digital Libraries (DL), September 2014.
• Google Chrome extension
• For each page you visit,
displays the number of
archived versions available
• Provides access by date
• Allows for submission to
public archiving services
ODU WS-DL Projects
Tools for Managing the Past Web
• Archive Quality
• Tweet Intention
• TimeMap
Summaries
• Archive What I See
Now
• Storytelling for
Archives
February 20, 2015 48
Storytelling For Archives
Archived collectionsStorytelling services
Archived enriched
stories
February 20, 2015 53
AlNoamany, "Using Web Archives to Enrich the Live Web Experience Through Storytelling", TCDL Bulletin, December 2013.
Tools for Storytelling
• Tools for Users
– use existing tools like Storify to view the stories of
a collection
• Tools for Curators
– use existing stories to augment your collections
– create stories from your collections
• candidate mementos automatically selected
February 20, 2015 54
Story Types
Fixed Page – Fixed Time:
differences in GeoIP,
mobile, etc.
Fixed Page – Sliding Time:
evolution of a single page
(or domain) through time
Sliding Page – Fixed Time:
different perspectives on a
point in time
Sliding Page – Sliding Time:
broadest possible coverage
of a collection
same
Time
different
URI
same
different
Issues: topic modeling, eliminating duplicates, maximizing
novelty, structural & content quality
February 20, 2015 55
ODU WS-DL Projects
Tools for Managing the Past Web
• Archive Quality
• Tweet Intention
• TimeMap
Summaries
• Archive What I See
Now
• Storytelling for
Archives
February 20, 2015 56
Web Sciences and Digital Libraries
Group (WS-DL)
• Scott Ainsworth
• Sawood Alam
• Lulwah Alkwai
• Yasmin AlNoamany
• Mohamed Aturban
• Justin Brunelle
• Mat Kelly
• Corren McCoy
• Shawn Jones
• Amara Naas
• Louis Nguyen
• Alexander Nwala
• Hany SalahEldeen
@WebSciDL
http://ws-dl.cs.odu.edu/
http://ws-dl.blogspot.com/
Dr. Michele C. Weigle
@weiglemc
http://www.cs.odu.edu/~mweigle/
February 20, 2015 57
Faculty• Dr. Michael L. Nelson
• Dr. Michele C. Weigle
PhD Students