11
Recent approaches to capture web content, which Heritrix can’t harvest Capturing Social Media Screen filming of Rich Media Project: Event crawl of The Eurovision Song Contest in Copenhagen 2014 Cooperation with researchers NAS workshop, Paris 2014/Sabine Schostag

Recent approaches to capture web content, which Heritrix can’t harvest Capturing Social Media Screen filming of Rich Media Project: Event crawl of

Embed Size (px)

Citation preview

Page 1: Recent approaches to capture web content, which Heritrix can’t harvest  Capturing Social Media  Screen filming of Rich Media  Project: Event crawl of

Recent approaches to capture web content, which Heritrix can’t harvest

Capturing Social Media Screen filming of Rich Media Project: Event crawl of The Eurovision Song Contest in

Copenhagen 2014 Cooperation with researchers

NAS workshop, Paris 2014/Sabine Schostag

Page 2: Recent approaches to capture web content, which Heritrix can’t harvest  Capturing Social Media  Screen filming of Rich Media  Project: Event crawl of

Why focus on social media?

Nowadays social media are the primary communication platforms during cultural and political events

Politicians, artists, musicians, even the traditional news media such as TV – use the social media more than traditional web pages

The entries on social media pages are ephemeral, so we need to capture them in a very high frequency

NAS workshop, Paris 2014/Sabine Schostag

Page 3: Recent approaches to capture web content, which Heritrix can’t harvest  Capturing Social Media  Screen filming of Rich Media  Project: Event crawl of

Which social media did we crawl?

Twitter.com comments Youtube.com video and comments Facebook.com comments Live blogs

Excluded for technical reasons …

instagram.com video and image tumblr.com multimedia blog flickr.com images vimeo.com video

NAS workshop, Paris 2014/Sabine Schostag

Page 4: Recent approaches to capture web content, which Heritrix can’t harvest  Capturing Social Media  Screen filming of Rich Media  Project: Event crawl of

Which Tools did we use? Harvesting with NetarchiveSuite using Heritrix 1.4* ,

weekly, daily and hourly ”Crontab” based screen dumping of static url’s using

PhantomJS to searchable PDF’s Manually LAP (Live Archive Program) browsing XML Extracts from API’s using own developed tools

and/or Digitalfootprints.dk Harvesting YouTube videos by extracting the video url’s

from the “watch-url” pages with own developed tool Screenrecording using CamStudio.org and a Netlab.dk

linux tool wrapping ”ffmpeg”

NAS workshop, Paris 2014/Sabine Schostag

Page 5: Recent approaches to capture web content, which Heritrix can’t harvest  Capturing Social Media  Screen filming of Rich Media  Project: Event crawl of

…more about the automated screen filming tool developed as part of research project by

curator/researcher, now implemented as a tool allows scheduled capturing is well suited to capture pre-planned streamed content is well suited to capture frequently updated content which

refreshes automatically (no mouseclicks) is not a replacement for existing collection methods, but a

supplement

NAS workshop, Paris 2014/Sabine Schostag

Page 6: Recent approaches to capture web content, which Heritrix can’t harvest  Capturing Social Media  Screen filming of Rich Media  Project: Event crawl of

…more about the automated screen filming tool

The tool enables the user to programme every mouseclick, every interaction on the webpage

NAS workshop, Paris 2014/Sabine Schostag

Page 7: Recent approaches to capture web content, which Heritrix can’t harvest  Capturing Social Media  Screen filming of Rich Media  Project: Event crawl of

…some screenshots from the filming tool

NAS workshop, Paris 2014/Sabine Schostag

ESC 2004 and the European Parliament Elections 2014

Page 8: Recent approaches to capture web content, which Heritrix can’t harvest  Capturing Social Media  Screen filming of Rich Media  Project: Event crawl of

Lessons learned NetarchiveSuite using Heritrix 1.4* can’t harvest js with AJAX and

the high frequency of feeds f.x. 47.000 tweets/minut. You can record the ”look and feel” with screen recording and

dumping, but it is a HUGE manual work producing files and provenance documentation outside the archive.

The LAP tool is not rather useful as it doesn’t support https (most of the social media use https today).

”Digitalfootprints.dk” can archive almost all XML content for twitter and could be harvested afterwards by NetarchiveSuite Heritrix.

NAS workshop, Paris 2014/Sabine Schostag

Page 9: Recent approaches to capture web content, which Heritrix can’t harvest  Capturing Social Media  Screen filming of Rich Media  Project: Event crawl of

Current issues

wider access better access (free text search) inclusion of older net collections collection of websites with restricted access advanced web content, ie. with

sound/video/live interaction (chat, virtual worlds …)

electronic communication networks ≠ the web

long-term preservation documentation

NAS workshop, Paris 2014/Sabine Schostag

Page 10: Recent approaches to capture web content, which Heritrix can’t harvest  Capturing Social Media  Screen filming of Rich Media  Project: Event crawl of

… and from the techical point of view more stable and operational screen recording and dumping

tools for huge social media events build social media API extract plugins into Heritrix and better

support for WARC linking of e.g. Youtube watch and video download url’s.

Build scripting and https support into the LAP-tool. upgrade NetarchiveSuite to Heritrix 3.* to better support js

with AJAX (using the Umbra plugin) and continuously crawling.

NAS workshop, Paris 2014/Sabine Schostag

Page 11: Recent approaches to capture web content, which Heritrix can’t harvest  Capturing Social Media  Screen filming of Rich Media  Project: Event crawl of

Epilogue

For the first time in Netarchive’s history the whole team met for to days

NAS workshop, Paris 2014/Sabine Schostag