28
Sn@tch: An Archiving and Analysis Service for Global News Todd Grappone @liber8er Sharon Farb @farbthink Martin Klein @mart1nkle1n Peter Broadwell @peterbroadwell

Sn@tch CNI Fall 2014

Embed Size (px)

Citation preview

Sn@tch:An Archiving and Analysis

Service for Global News

Todd Grappone @liber8er

Sharon Farb @farbthink

Martin Klein @mart1nkle1n

Peter Broadwell @peterbroadwell

Digital ephemera

collections

• Collected by researchers

• Donated by activists

• Include images, audio,

video, scanned

documents, social media,

server logs

International Collecting

• 829 digitally recorded Iranian dissident news programs

• 9,166 other videos from the Iranian Green Movement

• 29,441 digital photographs from the Green Movement

• 543 documents from Tahrir Square

News and Perspectives

The UCLA NewsScape:

• >228,000 hours of TV news• Recorded 2005-present• 13 countries, 9 languages• 38 networks• Searchable by captions, on-

screen text, named entities• How to incorporate social media

into this variety of perspectives?

Social Local Global

A Brief History of Timeliness

• Twitter archive at the Library of Congress [1]

• Last public update from January 4th 2013

• ~170 billion tweets, > 130 TB compressed (late 2012)

• Single search against 2006-2010 data may take up to 24 hours

• Twitter data access at Massachusetts Institute of Technology,

Laboratory for Social Machines [2]

• Public announcement from October 1st 2014

[1] http://blogs.loc.gov/loc/2013/01/update-on-the-twitter-archive-at-the-library-of-congress/

[2] https://blog.twitter.com/2014/investing-in-mit-s-new-laboratory-for-social-machines

A Brief History of Timeliness

In case you missed it:

• Twitter makes full archive

of tweets available,

indexed

• Great, problem solved?

• How about deleted

tweets?

• Real-time capture of

embedded resources?

https://blog.twitter.com/2014/building-a-complete-tweet-index

A Brief History of Timeliness

• Many initiatives to capture Twitter data

• Live, after an event, both

• Mostly ad-hoc efforts, rarely institutionalized

• Operation often requires programming or sys admin skills

• Deen Freelon’s (American University) incomplete list of tools:https://docs.google.com/document/d/1UaERzROI986HqcwrBDLaqGG8X_lY

wctj6ek6ryqDOiQ/

A Brief History of TimelinessSocial Feed Manager (Dan Chudnov, GWU); as presented at

#cni13f

http://social-feed-manager.readthedocs.org/

A Brief History of Timelinesstwarc (Ed Summers, MITH); used for Ferguson

data

http://inkdroid.org/journal/2014/08/30/a-ferguson-twitter-archive/http://files.archivists.org/conference/nola2013/twitter/twarc-saa13.htm

We Can

Remember It for

You Wholesale

I. Real-time capture of

tweets plus pro-active

archiving of embedded

resources

II. Rapid analysis, real-

time opportunities

III. Collection-agnostic

linking

Remembrance of Tweets/Links Past

• Utilize GWU’s Social Feed Manager

• Filter by keywords, user handles, location, time, etc

• Store raw tweets

• Extract and archive embedded URIs

• Utilize pro-active archiving solutions: Internet Archive,

archive.today

Remembrance of Tweets/Links Past

• UCLA’ s dataset about Egyptian revolution

• More than 400k tweets

• Approx. 50k unique users

• Tweets originated from within 200 miles around Cairo

Remembrance of Tweets/Links Past

• UCLA’ s dataset about Egyptian revolution

• 25% of tweets contain references to external resources

(web pages, images, videos, etc)

Remembrance of Tweets/Links Past

http://bit.ly/dTjCUd

HTTP 200 OK

Remembrance of Tweets/Links Past

• UCLA’ s dataset about Egyptian revolution

• 20% of references are dead, after less than 4 years (!!!)

Remembrance of Tweets/Links Past

http://yfrog.com/h02gvclj

HTTP GET

200 OK

HTTP HEAD

204 No Content

Remembrance of Tweets/Links Past

• UCLA’ s dataset about Egyptian revolution

• 20% of references are dead AND

• 60% of these are not archived

http://wayback.archive-it.org/all/20110203083908/http://yfrog.com/h02gvclj

This one

is!

discovered via #memento

Remembrance of Tweets/Links Past

URIs from Ed Summer’s Ferguson

dataset

https://edsu.github.io/ferguson-urls/

pink == not archived

(Internet Archive)

28%

Remembrance of Tweets/Links Past

http://babylon.library.ucla.edu/mklein/archived.html

Part 2: Rapid, Adaptive

Analysis

https://srogers.cartodb.com/viz/64f6c0f4-745d-11e4-

b4e1-0e4fddd5de28/public_map

Part 2: Rapid, Adaptive

Analysis

Part 3: Collection-Agnostic Linking

Part 3: Collection-Agnostic Linking

On TV news: Egypt, Tahrir, Cairo

On Twitter: #jan25, #tahrir, #egypt

Part 3: Collection-Agnostic Linking

Raiders of the Lost Links

Challenges and opportunities:

• Legal frameworks for sharing and preserving tweets and linked

resources

• Collaborations and partnerships to ensure momentum, sustainability

• Expansion to other forms of (social) media

Lazy Digital Archivists: Your Time is Up

Todd Grappone [email protected]

Sharon Farb [email protected]

Martin Klein [email protected]

Peter

Broadwell

[email protected]