Internet content as research data

Internet Content as Research Data

Digital Humanities Australia March 2012, Canberra

Monica Omodei & Gordon Mohr

Research Examples

•  Social networking •  Lexicography •  Linguistics •  Network Science •  Political Science •  Media Studies •  Contemporary history

Common Collec)on Strategies

•  Crawl Scope & Focus 1)  Thema)c/Topical (elec)ons, events, global warming…) 2)  Resource-‐specific (video, pdf, etc.) 3)  Broad survey (domain wide for .com/.net/.org/.edu/.gov) 4)  Exhaus)ve (end of life, closure crawls, natl domains) 5)  Frequency-‐Based

•  Key Inputs: nomina)ons from subject maSer experts, prior crawl data, registry data, trusted directories, wikipedia

Exis)ng web archives

•  Internet Archive •  Common Crawl •  Pandora Archive •  Internet Memory Founda)on Archive •  Other na)onal archives •  Research, University Library archives

Internet Archive’s Web Archive

Positives – Very broad – 175+ billion web instances – Historic – started 1996 – Publicly accessible – Time-based URL search – API access – Not constrained by legislation – covered by

fair use and fast take-down response

Internet Archive’s Web Archive Negatives

– Because of size can’t search by keyword – Because of size, fully automated - QA not

possible

Common Use Cases for IA’s web archive

•  Content discovery •  Nostalgia queries •  Web site restora)on and file recovery •  Domain name valua)on •  Collabora)ve R&D •  Prior art analysis and patent/copyright infringement research

•  Legal cases •  Topic analysis, web trends analysis, popularity analysis

Common Crawl

•  Non-‐profit founda)on building an open crawl of the web to seed research and innova)on

•  Currently 5 billion pages •  Stored on Amazon’s S3 •  Accessible via MapReduce processing in Amazon’s EC2 compute cloud

•  Wholesale extrac)on, transforma)on, and analysis of web data cheap and easy

•  commoncrawl.org/data/accessing-‐the-‐data/

Common Crawl

Nega)ves •  Not designed for human browsing but for machine access

•  Objec)ve is to support large-‐scale analysis and text mining/indexing – not long-‐term preserva)on

•  Some costs are involved for direct extrac)on of data from S3 storage using Requester-‐Pays API

Pandora Archive •  Posi)ves

– Quality checked – Targeted Australian content with selec)on policy – Historical – started 1996 – Bibliocentric approach –we sites/publica)ons selected for archiving are catalogued (see Trove)

– Keyword search – Publicly accessible – You can nominate Australian web sites for inclusion -‐ pandora.nla.gov.au/registra)on_form.html

Pandora Archive

•  Nega)ves –  labour intensive so small – significant content missed because permission to copy refused

•  Situa)on will improve markedly if Legal Deposit provisions extended to digital publica)ons

•  Broader coverage will be achieved when infrastructure is upgraded hence reducing labour costs for checking/fixing crawls

Pandora Archive Stats

•  Size – 6.32 TB •  Number of Files > 140 million •  Number of ‘)tles’ > 30.5K •  Number of )tle instances > 73.5K

.au Domain Annual Snapshots •  Annual crawls since 2005 commissioned from Internet Archive

•  Includes sites on servers located in Australia as well as .au domain

•  Robots.txt respected except for inline images and stylesheets

•  No public access – researcher access protocols are being developed

•  Full text search – tailored to archive search •  Separate .gov crawl publicly accessible soon

Australian web domain crawls

Year 2005 2006 2007 2008 2009 2011

Files 185 million

596 million

516 million

1 billion 765 million

660 million

Hosts crawled

811,523 1,046,038 1,247,614 3,038,658 1,074,645 1,346,549

Size (TBs) 6.69 19.04 18.47 34.55 24.29 30.71

Internet Memory Founda)on Archive

•  internetmemory.org/en/ •  no keyword search yet – only URL •  Number of European partners

Other Na)onal Archives •  List of Interna)onal Internet Preserva)on Consor)um member archives – netpreserve.org/about/archiveList.php

•  Some are whole domain archives, some are selec)ve archives, many are both

•  Some have public access, others you will need to nego)ate access for research

•  Most archives have been collected using the heritrix open-‐source crawler and thus use the standard format (warc ISO format)

Research Archives •  California Digital Library •  Harvard University Libraries •  Columbia University Libraries •  University of North Texas …. and many more •  WebCITE -‐ webcita)on.org (cita)on service archive)

Bringing Archives Together

•  Common standard and APIs •  Memento project

Create your own Archive

•  Use a subscrip)on service •  Build your own archive using open-‐source crawler heritrix and standard file format .warc

•  Use web cita)on services that create archive copies as you bookmark pages

Subscrip)on Services

•  archive-‐it.org (service operated by non-‐profit Internet Archive since 2006)

•  archivethe.net (service operated by non-‐profit Internet Memory Founda)on)

•  California Digital Library Web Archiving Service -‐ cdlib.org/services/uc3/was.html

•  OCLC Harvester Service -‐ oclc.org/webharvester/overview/default.htm

Install web archiving system locally

•  Easy-‐to-‐deploy web archiving toolkit not yet available (that meets web archive standards)

•  Ins)tu)onal web archiving infrastructure is feasible and has been established at a number of universi)es for use by researchers – needs IT systems engineers to set up though

•  Archives can be deposited with the NLA for long-‐term preserva)on

'Memento': adding )me to the web

Protocol and browser add-‐on (MementoFox) •  Aids discovery, aggrega)on of page histories

Innovation is increasingly driven from Large scale Data Analysis

Need fast iteration to understand the right questions to ask More minds able to contribute = more value (perceived and real) placed on the importance of the data Increased demand for/value of the data = more funding to support it Need to surface the Information amongst all that data…

Web Data Mining & Analysis – What is it? Why Do It?

Platform & Toolkit: Overview

•  Software – Apache Hadoop – Apache Pig

•  Data/File format – WARC – CDX – WAT (new!)

Apache Hadoop

•  HDFS – Distributed storage – Durable, default 3x replication – Scalable: Yahoo! 60+PB HDFS

•  MapReduce – Distributed computation – You write Java functions – Hadoop distributes work across cluster – Tolerates failures

File formats and data: WARC

File formats and data: CDX

•  Index for Wayback Machine: used to browse WARC-based archive

•  Space-delimited text file •  Only essential metadata needed by Wayback

– URL – Content Digest – Capture Timestamp – Content-Type – HTTP response code – etc.

File formats and data: WAT

•  Yet Another Metadata Format! ☺ ☹ •  Not preservation format •  Data exchange and analysis •  Less than full WARC, more than CDX •  Essential metadata for many types of analysis •  Avoids barriers to data exchange: copyright,

privacy •  Work-in-progress: we want your feedback

File formats and data: WAT •  WAT is WARC ☺

– WAT records are WARC metadata records

– WARC-Refers-To header identifies original WARC record

•  WAT payload is JSON – Compact – Hierarchical – Supported by every

programming environ

File formats & data: •  CDX: 53 MB •  WAT: 443 MB •  WARC: 8,651 MB

Some References

•  hSp://en.wikipedia.org/wiki/Web_archiving •  hSp://netpreserve.org/about/archiveList.php •  Web Archives: The Future(s) -‐ hSp://www.netpreserve.org/publica)ons/2011_06_IIPC_WebArchives-‐TheFutures.pdf

Contacts •  Webarchive @ nla.gov.au •  Secretariat @ internetmemory.org •  Queries about the internet archive web archive hSp://iawebarchiving.wordpress.com/

•  Queries about Archive-‐It service hSp://www.archive-‐it.org/contact-‐us

•  momodei @ nla.gov.au •  gojomo @ xavvy.com

Technology

Internet content as research data