Slides anu talkwebarchivingaug2012

Internet Content as Research Data

Australian National University August 2012, Canberra

Monica Omodei

Research Examples •  Social networking •  Lexicography •  Linguistics •  Network Science

•  Political Science •  Media Studies •  Contemporary history

Data-driven science is migrating from the natural sciences to humanities and social science

Talk Structure

•  Exis0ng web archives •  Web archive use cases •  Bringing archives together •  Crea0ng your own archive •  It’s ge>ng harder – challenges •  Web data mining & analysis

Exis0ng web archives

•  Internet Archive •  Common Crawl •  Pandora Archive •  Internet Memory Founda0on Archive •  Other na0onal archives •  Research, University Library archives

Common Collec0on Strategies

•  Crawl Scope & Focus 1)  Thema0c/Topical (elec0ons, events, global warming…) 2)  Resource-‐specific (video, pdf, etc.) 3)  Broad survey (domain wide for .com/.net/.org/.edu/.gov) 4)  Exhaus0ve (end of life, closure crawls, natl domains) 5)  Frequency-‐Based

•  Key Inputs: nomina0ons from subject maêr experts, prior crawl data, registry data, trusted directories, wikipedia, twiêr

Internet Archive’s Web Archive

Positives – Very broad – 175+ billion web instances – Historic – started 1996 – Publicly accessible – Time-based URL search – API access – Not constrained by legislation – covered by

fair use and fast take-down response

Internet Archive’s Web Archive Negatives

– Because of size can’t search by keyword – Because of size crawling is fully automated –

ergo QA is not possible

Common Crawl

•  Non-‐profit founda0on building an open crawl of the web to seed research and innova0on

•  Currently 5 billion pages •  Stored on Amazon’s S3 •  Accessible via MapReduce processing in Amazon’s EC2 compute cloud

•  Wholesale extrac0on, transforma0on, and analysis of web data cheap and easy

Common Crawl

Nega0ves •  Not designed for human browsing but for machine access

•  Objec0ve is to support large-‐scale analysis and text mining/indexing – not long-‐term preserva0on

•  Some costs are involved for direct extrac0on of data from S3 storage using Requester-‐Pays API

Pandora Archive •  Posi0ves

– Quality checked – Targeted Australian content with selec0on policy – Historical – started 1996 – Bibliocentric approach –web sites/publica0ons selected for archiving are catalogued (see Trove)

– Keyword search – Publicly accessible – You can nominate Australian web sites for inclusion -‐ pandora.nla.gov.au/registra0on_form.html

Pandora Archive

•  Nega0ves –  labour intensive thus quite small – significant content missed because permission to copy refused

•  Situa0on will improve markedly if Legal Deposit provisions extended to digital publica0ons

•  Broader coverage will be achieved when infrastructure is upgraded hence reducing labour costs for checking/fixing crawls

Pandora Archive Stats

•  Size – 6.32 TB •  Number of Files > 140 million •  Number of ‘0tles’ > 30.5K •  Number of 0tle instances > 73.5K

Which archived sites are popular ? •  Measure: filtered, aggregated web access

log data which counts access to “titles”"•  Examined top 30 archived titles (# of

accesses) for each year 2009 to 2012"•  Selected some to examine and speculate

as to why they might be popular"•  Selected those with consistently high

ranking, and ones that were very variable between years

Reasons for popularity of archived version

•  Were once popular and are now decommissioned, particularly if domain name continues to exist and redirects to the archive"

•  May not be that popular as live sites but their live site links prominently to Pandora as an archive for their content"

•  Popular referencing sources cite the archive as well as the live site (if it still exists)

Improving visibility and usage of Pandora archive

•  Articles about interesting content on the Australia Web Archives blog –http://blogs.nla.gov.au/australias-web-archives/"

•  More effort to identify archived sites that are no longer ʻliveʼ"

•  Market automatic redirect services to web site owners/managers"

•  Allow Google to index archive content for ʻnon-liveʼ sites (problematic)"

•  Install Twittervane - draws site nomina0ons for archiving based on trending Twiêr topics. "

.au Domain Annual Snapshots •  Annual crawls since 2005 commissioned from Internet Archive

•  Includes sites on servers located in Australia as well as .au domain

•  Robots.txt respected except for inline images and stylesheets

•  No public access – researcher access protocols are being developed

•  Full text search – suited to searching archives •  Separate .gov crawl publicly accessible soon

Australian web domain crawls

Year 2005 2006 2007 2008 2009 2011

Files 185 million

596 million

516 million

1 billion 765 million

660 million

Hosts crawled

811,523 1,046,038 1,247,614 3,038,658 1,074,645 1,346,549

Size (TBs) 6.69 19.04 18.47 34.55 24.29 30.71

Internet Memory Founda0on •  Number of European partners •  LiWA – Living Web Archives: next genera0on Web archiving methods and tools

•  LAWA – Longitudinal Analy0cs of Web Archive Data: experimental testbed for large-‐scale data analy0cs

•  ARCOMEM (Collect-‐All ARchives to COmmunity MEMories) leveraging social media for Intelligent Preserva0on

•  SCAPE – Scalable Preserva0on Environments

Other Na0onal Archives •  List of Interna0onal Internet Preserva0on Consor0um member archives – netpreserve.org/about/archiveList.php

•  Some are whole domain archives, some are selec0ve archives, many are both

•  Some have public access, others you will need to nego0ate access for research

•  Most archives have been collected using the heritrix open-‐source crawler and thus use the standard format (warc ISO format)

Research Archives •  California Digital Library •  Harvard University Libraries •  Columbia University Libraries •  University of North Texas …. and many more •  WebCITE -‐ webcita0on.org (cita0on service archive)

Example: Columbia University •  Member of the IIPC •  They use the ArchiveIt service •  A Research library that sees web archiving as fundamental to their collec0ng

•  They complement and coordinate with other web archives

•  Their collec0ng focus is thema0c – eg human rights, historic preserva0on, NY religious ins0tu0ons

•  They also archive web content as part of personal and organisa0onal archives (c.f. manuscripts coll)

•  Archive their own web site regularly

Bringing Archives Together

•  Common standards and APIs •  Memento project – adding 0me to the web

– Aggregates CDX files (URL index) from mul0ple archives

– Has a Firefox plug-‐in which allows 0me-‐based browsing

–  Ini0a0ve of Los Alamos Laboratories – See h^p://www.mementoweb.org/demo/

Common Use Cases for a web archive

•  Content discovery •  Nostalgia queries •  Web site restora0on and file recovery •  Domain name valua0on •  Fall-‐back for link-‐rot •  Prior art analysis and patent/copyright infringement research

•  Legal cases •  Topic analysis, web trends analysis, popularity analysis, network analysis, linguis0c analysis

Create your own Archive

•  Use a subscrip0on service •  Build your own web archiving infrastructure with open source sonware ( ie Heritrix and Wayback)

•  Use web cita0on services that create archive copies as you bookmark pages

Subscrip0on Services

•  archive-‐it.org (service operated by non-‐profit Internet Archive since 2006)

•  archivethe.net (service operated by non-‐profit Internet Memory Founda0on)

•  California Digital Library Web Archiving Service -‐ cdlib.org/services/uc3/was.html

•  OCLC Harvester Service -‐ oclc.org/webharvester/overview/default.htm

Install web archiving system locally

•  Easy-‐to-‐deploy web archiving toolkit not yet available

•  Ins0tu0onal web archiving infrastructure is feasible and has been established at a number of universi0es for use by researchers – needs IT systems engineers to set up though

•  Archives can be deposited with the NLA for long-‐term preserva0on

Personal Web Archiving

•  WARCreate – recently released free tool which creates wayback-‐consumable warc files from any web page

•  Google Chrome extension •  Enables preserva0on by users from their desktop •  Can target content unreachable by crawlers •  Brings WARC to personal digital archiving •  What you do with the WARC files is up to you •  Install suite provided to set up local Wayback instance and Memento 0megate

Current challenges

•  Database-‐driven features and func0ons •  Complex and varying URI formats and non-‐standard link implementa0ons eg Twiêr

•  Dynamically generated ever-‐changing URIs – For serving the same resources

•  Rich Media – eg streamed media with custom apps and ant-‐collec0on measures

•  Scripted incremental display and page-‐loading

… more…

•  Scripted HTML forms •  Mul0-‐sourced embedded material •  Dynamic authen0ca0on e.g. captchas, cross-‐site authen0ca0on, user-‐sensi0ve embeds

•  Alternate display based on browser or device, or other parameter

•  Site architecture designed to inhibit crawling and indexing – but if poorly done even ‘polite’ harvesters like Heritrix may crash their server

.. but wait, there’s more …

•  Server-‐side scripts and remote procedure calls – the full variety of paths through a site are now onen hidden in remote/opaque server-‐side code – not a new problem but now effects 80+% of online resources

•  HTML 5 web sockets – effec0vely codifies incremental updates without page reloads

•  Mobile publishing

Transac0onal Web Archiving •  Useful for ins0tu0onal archiving

– Best for record-‐keeping purposes -‐ when challenged in court about content on web site

– Can be used to ensure URL persistence eg when site has a make-‐over – can intercept 404s

– No ‘gaps’ c.f. crawl approach – every change in accessed content is archived

– However requires code snippet to be installed on web server

– Open source sonware being developed by Los Alamos Labs

Innovation is increasingly driven from Large scale Data Analysis

Need fast iteration to understand the right questions to ask More minds able to contribute = more value (perceived and real) placed on the importance of the data Increased demand for/value of the data = more funding to support it Need to surface the Information amongst all that data…

Web Data Mining & Analysis – What is it? Why Do It?

Platform & Toolkit: Overview

•  Software – Apache Hadoop – Apache Pig

•  Data/File format – WARC – CDX – WAT (new!)

Apache Hadoop

•  HDFS – Distributed storage – Durable, default 3x replication – Scalable: Yahoo! 60+PB HDFS

•  MapReduce – Distributed computation – You write Java functions – Hadoop distributes work across cluster – Tolerates failures

File formats and data: WARC

File formats and data: CDX

•  Index used to browse WARC-based archive •  Space-delimited text file •  Only essential the essential metadata needed

by Wayback – URL – Content Digest – Capture Timestamp – Content-Type – HTTP response code – etc.

File formats and data: WAT

•  Yet Another Metadata Format! ☺ ☹ •  Not preservation format •  Data exchange and analysis •  Less than full WARC, more than CDX •  Essential metadata for many types of analysis •  Avoids barriers to data exchange: copyright,

privacy •  Work-in-progress: we want your feedback

File formats and data: WAT •  WAT is WARC ☺

– WAT records are WARC metadata records

– WARC-Refers-To header identifies original WARC record

•  WAT payload is JSON – Compact – Hierarchical – Supported by every

programming environ

File formats & data: •  CDX: 53 MB •  WAT: 443 MB •  WARC: 8,651 MB

Some References

•  h^p://en.wikipedia.org/wiki/Web_archiving •  h^p://netpreserve.org/about/archiveList.php •  Web Archives: The Future(s) -‐ h^p://www.netpreserve.org/publica0ons/2011_06_IIPC_WebArchives-‐TheFutures.pdf

•  h^p://matkelly.com/warcreate/ •  Common Crawl: h^p://commoncrawl.org/data/accessing-‐the-‐data/

Contacts •  Webarchive @ nla.gov.au •  Secretariat @ internetmemory.org •  Queries about the internet archive web archive h^p://iawebarchiving.wordpress.com/

•  Queries about Archive-‐It service h^p://www.archive-‐it.org/contact-‐us

momodei @ nla.gov.au (un0l 31 Aug 2012 ) or monica.omodei @ gmail.com

Technology

Slides anu talkwebarchivingaug2012