57
Tools for Managing the Past Web Dr. Michele C. Weigle Web Sciences and Digital Libraries (WS-DL) Group Department of Computer Science Old Dominion University ODU - ECE Seminar February 20, 2015

2015-odu-ece-tools-for-past-web

Embed Size (px)

Citation preview

Tools for Managing the Past Web

Dr. Michele C. Weigle

Web Sciences and Digital Libraries (WS-DL) Group

Department of Computer Science

Old Dominion University

ODU - ECE Seminar

February 20, 2015

What is the past web?

February 20, 2015 2

Why should I care about the

past web and web archives?

The Web holds our stories

February 20, 2015 4

But webpages can disappear

• Average lifespan of a webpage: 50-100 days

• A year after publication, about 11% of content

shared on social media will be gone.

February 20, 2015

SalahEldeen and Nelson, "Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?", TPDL 2012

http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html

5

Maybe it's archived?

February 20, 2015 6

archive.org/web

Why archives matter

• Malaysia Airlines Flight 17 (MH17)

• Ukrainian separatists originally took credit for downing a transport plane in that location

• Later deleted the post

• Internet Archive had archived the post before deletion

February 20, 2015 7

http://www.csmonitor.com/World/Europe/2014/0717/Web-

evidence-points-to-pro-Russia-rebels-in-downing-of-MH17-video

Web archiving in the news - 2015

February 20, 2015 8

http://www.newyorker.com/magazine/2015/01/26/cobweb

But Wayback is not Google

• Wayback Machine has no full-text search– too big to be indexed

– 452 billion web pages, 9 petabytes of data

– growing at 20 TB/week

• Enter URL and pick a date

February 20, 2015 9

"It’s more like a phone book than like an archive."

-Jill Lepore, The New Yorker

The Internet Archive isn't the

only archive in town

# o

f a

rch

ive

d p

ag

es

How can I access the

archives?

February 20, 2015

MementoFox

Memento for Chrome

http://ws-dl.blogspot.com/2010/03/2010-03-19-mementofox-add-on-released.html

http://ws-dl.blogspot.com/2013/10/2013-10-14-right-click-to-past-memento.html

http://ws-dl.blogspot.com/2014/10/2014-10-03-integrating-live-and.html

Mink

http://www.mementoweb.org

11

TimeTravel

February 20, 2015 12

http://timetravel.mementoweb.org

ODU WS-DL Projects

Tools for Managing the Past Web

• Archive Quality

• Tweet Intention

• TimeMap

Summaries

• Archive What I See

Now

• Storytelling for

Archives

February 20, 2015 13

ODU WS-DL Projects

Tools for Managing the Past Web

• Archive Quality

• Tweet Intention

• TimeMap

Summaries

• Archive What I See

Now

• Storytelling for

Archives

February 20, 2015 14

The State of Web Archiving

"Hooray! It's in the archive!"

vs.

"How well was it archived?"

current:

future:

February 20, 2015 15

Damaged Memento

February 20, 2015 16

How damaged are these mementos?

February 20, 2015

M = 0.17

(live web)

Brunelle, Kelly, SalahEldeen, Weigle, and Nelson, "Not All Mementos Are Created Equal: Measuring the Impact of Missing

Resources", JCDL 2014, Best Student Paper

17

How damaged are these mementos?

February 20, 2015

M = 0.17

(live web)

M = 0.24

(missing main)

Brunelle, Kelly, SalahEldeen, Weigle, and Nelson, "Not All Mementos Are Created Equal: Measuring the Impact of Missing

Resources", JCDL 2014, Best Student Paper

18

How damaged are these mementos?

February 20, 2015

M = 0.17

(live web)

M = 0.24

(missing main)

M = 0.29

(missing logo + navigation)

Brunelle, Kelly, SalahEldeen, Weigle, and Nelson, "Not All Mementos Are Created Equal: Measuring the Impact of Missing

Resources", JCDL 2014, Best Student Paper

19

How damaged are these mementos?

February 20, 2015

M = 0.17

D = 0.09(live web)

M = 0.24

D = 0.41(missing main)

M = 0.29

D = 0.36(missing logo + navigation)

Brunelle, Kelly, SalahEldeen, Weigle, and Nelson, "Not All Mementos Are Created Equal: Measuring the Impact of Missing

Resources", JCDL 2014, Best Student Paper

20

How to detect damage?

February 20, 2015

vs.

Brunelle et al., JCDL 2014

21

February 20, 2015

Good News:

Although M is steady/increasing, D is decreasing

22

M = percentage missing

D = our damage metric

Sampled 45,000 mementos

- one memento/year of ~1850 webpages

- webpages from Bitly URIs shared over Twitter and Archive-It collections

Brunelle et al., JCDL 2014

Using JavaScript can result in

damaged mementos

February 20, 2015 23

JavaScript is

responsible for an

increasing proportion

of missing embedded

resources over time.

Brunelle, Kelly, Weigle and Nelson, "The Impact of JavaScript on Archivability," International Journal of Digital Libraries (IJDL), 2015

http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html

Sept 3, 2008

2012

Sometimes the live web "leaks" into

the archive

February 20, 2015 24

Different parts of a page can be

crawled at different times

February 20, 2015

Ainsworth and Nelson, "Evaluating Sliding and Sticky Target Policies by Measuring Temporal Drift in Acyclic Walks Through a Web

Archive", JCDL 2013

25

ODU WS-DL Projects

Tools for Managing the Past Web

• Archive Quality

• Tweet Intention

• TimeMap

Summaries

• Archive What I See

Now

• Storytelling for

Archives

February 20, 2015 26

Which page did Chris Hayes

mean to tweet?

February 20, 2015 27

Tweet on Oct 3, 2014

Likely target (captured Oct 1, 2014)

What you see depends on

when you click

February 20, 2015 28

Oct 9, 2014Oct 10, 2014

Nov 19-Dec 15, 2014 Today (Feb 2015) – now fergusonaction.com

Mapping Tweet Relevance

February 20, 2015 29

SalahEldeen and Nelson, "Reading the Correct History? Modeling Temporal Intention in Resource Sharing”, JCDL 2013

Let the reader choose live or

archived

February 20, 2015 30

ODU WS-DL Projects

Tools for Managing the Past Web

• Archive Quality

• Tweet Intention

• TimeMap

Summaries

• Archive What I See

Now

• Storytelling for

Archives

February 20, 2015 31

Browsing TimeMaps

February 20, 2015 32

How were

these 4

thumbnails

chosen?

What did usps.com look like?

February 20, 2015 33

http://whatdiditlooklike.mementoweb.org/

Animated GIF

1st memento of each

year

Submit a URL via

Twitter:

“#whatdiditlooklike URL”

Which tells you more about the

past of www.apple.com?

February 20, 2015

700 thumbnails

(not even all of them!)

32 sampled thumbnails

34

AlSum and Nelson, "Thumbnail Summarization Techniques for Web Archives", ECIR 2014

TimeMap Thumbnail

Summaries• Compare HTML, not images

• Compute SimHash of HTML

– result is a string representing the content of the page

• Calculate Hamming distance between SimHashes of consecutive mementos

• Generate thumbnails of mementos that have at least a 4 character difference in SimHash

– threshold too low -> near duplicate images

– threshold too high -> miss important changes

February 20, 2015 35

3 lines of difference

AlSum and Nelson, "Thumbnail Summarization Techniques for Web Archives", ECIR 2014

Grid View

February 20, 2015 36

Cover Flow View

February 20, 2015 37

Embed in Wayback

February 20, 2015 38

ODU WS-DL Projects

Tools for Managing the Past Web

• Archive Quality

• Tweet Intention

• TimeMap

Summaries

• Archive What I See

Now

• Storytelling for

Archives

February 20, 2015 39

Archive What I See Now

• Humanities researchers know they should archive web resources

• Standard web archiving tools are difficult for non IT experts

February 20, 2015

"Archive What I See Now", NEH Digital Humanities Implementation Grant, 2014-2017, http://bit.ly/odu-dhig-2014

40

Why not just take a screenshot or

“save as”?

February 20, 2015

Can't interact with

a screenshot

"Save Page As..."output is

difficult to keep organized --

especially with multiple

captures over time

41

What about archiving pages behind

authentication or that change quickly?

February 20, 2015

Facebook - requires login

Twitter - changes faster

than typical crawling rate

42

How we're addressing the problem

• Google Chrome extension

• Archive the current state

of the page in standard

Web Archive (WARC)

format

• Compatible with

Wayback

February 20, 2015 43

Kelly and Weigle, "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage", JCDL 2012

Kelly, Weigle, and Nelson. "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage," Digital Preservation

2012, Tools Demo Session

WARCreate

WARCreate - Work in Progress

• New modes of operation

– record mode

• while activated, add capture of each page visited to the

WARC

– countdown mode

• every interval, refresh and add new capture of page

– event mode

• add new capture of page every time it dynamically

reloads or refreshes

February 20, 2015 44

What to do with created WARCs?

February 20, 2015 45

Kelly, Weigle, and Nelson. "Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving," Personal Digital

Archiving 2013, Poster Session

Kelly, Nelson, and Weigle. "WARCreate and WAIL: WARC, Wayback and Heritrix Made Easy," Digital Preservation 2013

WAIL

• Load created WARCs into

a Wayback instance on

your local computer

• Single-click install of

Wayback (and other

archiving tools)

• Available for Windows,

OS X

Bridging the gap between the past web

and the live web

February 20, 2015

Mink

46

Kelly, Nelson, and Weigle, "Mink: Integrating the Live and Archived Web Viewing Experience Using Web Browsers and Memento,"

poster, ACM/IEEE Digital Libraries (DL), September 2014.

• Google Chrome extension

• For each page you visit,

displays the number of

archived versions available

• Provides access by date

• Allows for submission to

public archiving services

Tools

February 20, 2015 47

WARCreate

Mink

WAIL

https://ws-dl.cs.odu.edu/Software

ODU WS-DL Projects

Tools for Managing the Past Web

• Archive Quality

• Tweet Intention

• TimeMap

Summaries

• Archive What I See

Now

• Storytelling for

Archives

February 20, 2015 48

Storify

February 20, 2015

https://storify.com/nzherald/mu

49

Bookmarking is not preserving

February 20, 2015 50

Bookmarking is not preserving

February 20, 2015 51

Archive-It Collections

February 20, 2015 52

https://archive-it.org/collections/2358

Storytelling For Archives

Archived collectionsStorytelling services

Archived enriched

stories

February 20, 2015 53

AlNoamany, "Using Web Archives to Enrich the Live Web Experience Through Storytelling", TCDL Bulletin, December 2013.

Tools for Storytelling

• Tools for Users

– use existing tools like Storify to view the stories of

a collection

• Tools for Curators

– use existing stories to augment your collections

– create stories from your collections

• candidate mementos automatically selected

February 20, 2015 54

Story Types

Fixed Page – Fixed Time:

differences in GeoIP,

mobile, etc.

Fixed Page – Sliding Time:

evolution of a single page

(or domain) through time

Sliding Page – Fixed Time:

different perspectives on a

point in time

Sliding Page – Sliding Time:

broadest possible coverage

of a collection

same

Time

different

URI

same

different

Issues: topic modeling, eliminating duplicates, maximizing

novelty, structural & content quality

February 20, 2015 55

ODU WS-DL Projects

Tools for Managing the Past Web

• Archive Quality

• Tweet Intention

• TimeMap

Summaries

• Archive What I See

Now

• Storytelling for

Archives

February 20, 2015 56

Web Sciences and Digital Libraries

Group (WS-DL)

• Scott Ainsworth

• Sawood Alam

• Lulwah Alkwai

• Yasmin AlNoamany

• Mohamed Aturban

• Justin Brunelle

• Mat Kelly

• Corren McCoy

• Shawn Jones

• Amara Naas

• Louis Nguyen

• Alexander Nwala

• Hany SalahEldeen

@WebSciDL

http://ws-dl.cs.odu.edu/

http://ws-dl.blogspot.com/

Dr. Michele C. Weigle

[email protected]

@weiglemc

http://www.cs.odu.edu/~mweigle/

February 20, 2015 57

Faculty• Dr. Michael L. Nelson

• Dr. Michele C. Weigle

PhD Students