64
Archiving the Web – The Bibliothèque nationale de France’s « L’archivage du Web » Bert Wendland Bibliothèque nationale de France

Archiving the Web – The Bibliothèque nationale de France’s « L’archivage du Web »

  • Upload
    venice

  • View
    30

  • Download
    2

Embed Size (px)

DESCRIPTION

Archiving the Web – The Bibliothèque nationale de France’s « L’archivage du Web ». Bert Wendland Bibliothèque nationale de France. Who I am / who we are. Bert Wendland Crawl engineer in the IT department of BnF Semi-joint working group Legal Deposit department 1 head of group - PowerPoint PPT Presentation

Citation preview

Page 1: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Archiving the Web – The Bibliothèque nationale

de France’s « L’archivage du Web »

Bert WendlandBibliothèque nationale de France

Page 2: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Who I am / who we are

> Bert Wendland> Crawl engineer in the IT department of BnF

> Semi-joint working group > Legal Deposit department

> 1 head of group> 4 librarians

> IT department> 1 project coordinator> 1 developer> 2 crawl engineers

> A network of 80 digital curators

23 May 2013 Archiving the Web 2

Page 3: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

27th November 2012 Session 4 - Web archiving for decision-makers 3

Page 4: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

27th November 2012 Session 4 - Web archiving for decision-makers 4

Page 5: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »
Page 6: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »
Page 7: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »
Page 8: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »
Page 9: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »
Page 10: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »
Page 11: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »
Page 12: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »
Page 13: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »
Page 14: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

27th November 2012 Session 4 - Web archiving for decision-makers 14

Page 15: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

27th November 2012 Session 4 - Web archiving for decision-makers 15

Page 16: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

28th November 2012 Session 5 - Integrating web archiving in IT operations 16

Agenda

> Context: I will present the BnF and web archiving as part of its legal mission.

> Concepts: I will describe how we operationalise the task of collecting and preserving the French web in terms of data, and how this relates to the general web archive at www.archive.org.

> Infrastructure: I will give an overview of the infrastructure that supports this task.

> Data acquisition: I will describe our mixed model of web harvesting that combines broad crawls and selective crawls to achieve a good trade-off between breadth and depth in coverage and temporal granularity.

> Data storage and access: I will describe the indexing structures that allow users to query this web archive.

Page 17: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Context

The BnF and web archiving as part of its legal mission

Page 18: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

The BnF

> Bibliothèque nationale de France

> About 30 million books, periodicals and others> 10 million at the new site> Yearly 60.000 new books> 400 TB of data in the web archive

> 100 TB of new data every year> Two sites

> Old site « Richelieu » in the centre of Paris> New site « François-Mitterand » since 1996

> Two levels at the new site> Study library (« Haut-de-jardin »): open stacks> Research library (« Rez-de-jardin »): access to all collection,

including web archives

23 May 2013 Archiving the Web 18

Page 19: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

The legal deposit

1368 Royal manuscripts of king Charles V in the Louvre

1537Legal deposit by king Francis I: all editors should send copies of their productions to the royal library

1648Legal deposit extended to maps and plans

1793Musical scores1925Photographs and gramophone records1975Video recordings1992CD-ROMs and electronic documents2002Websites (experimentally)2006Websites (in production)23 May 2013 Archiving the Web 19

Page 20: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Extension of the Legal deposit Act in 2006

> Coverage (article 39)« Sont également soumis au dépôt légal les signes, signaux, écrits, images, sons ou messages de toute nature faisant l’objet d’une communication au public par voie électronique. »

> Conditions (article 41 II)« Les organismes dépositaires procèdent à la collecte des signes, signaux, écrits, images sons ou messages de toute nature mis à la disposition du public ou de catégories de public, … Ils peuvent procéder eux-mêmes à cette collecte selon des procédures automatiques ou en déterminer les modalités en accord avec ces personnes. »

> Responsibilities (article 50)INA (Institut national de l'audiovisuel) for radio and TV websitesBnF for anything else

> No permission required to collect, but access to the archive restricted to in-house

> The goal is not to gather all or the “best of the Web”, but to preserve a representative collection of the Web at a certain date

23 May 2013 Archiving the Web 20

Page 21: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Concepts

How we collect and preserve the French web

Page 22: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

23 May 2013 Archiving the Web 22

Page 23: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

The Internet Archive

> Non-profit organisation, founded 1996by Brewster Kahle in San Francisco

> Stated mission of “universal access to all knowledge”> Websites, but also other media like scanned books,

movies, audio collections, …> Web archiving from the beginning, only 4 years after the

start of the WWW

> Main technologies for web archiving:> Heritrix: the crawler> Wayback Machine: access the archive

23 May 2013 Archiving the Web 23

Page 24: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Partnership BnF – IA

> A five-years partnership between 2004 and 2008> Data

> 2 focused crawls and 5 broad crawls on behalf of BnF> Extraction of historical Alexa data concerning .fr back to

1996

> Technology> Heritrix> Wayback Machine> 5 Petaboxes

> Know-how> Installation of Petaboxes by

engineers of IA> Presence of an IA crawl engineer

one day a week for 6 months23 May 2013 Archiving the Web 24

Page 25: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

How search engines work

Source : www.brightplanet.com

Archiving the Web, that’s archiving the files, the links and some meta data.

Page 26: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

How the web crawler works

Queue of URLs

“Seeds”:

http://www.site-untel.fr

http://www.monblog.fr

Web crawler (“Heritrix”)

Verification parameters:

YES

NO

Storage

URL rejected

Discovered URLs: http://www.unautre-site.fr

http://www.autre-blog.fr

Connection to the page

Storing the data

Extraction of links

Connection to the page

Connection to the page

Storing the data

Storing the data

Extraction of links

Extraction of links

Page 27: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Planning

Monitoring

Access

Indexing

Validation

CrawlingExperience

Quality Assuranc

e

Preservation

Selection

Current production workflow

BCWeb

NetarchiveSuite

Heritrix

Wayback Machine

SPAR

Indexing Process

VMware

NetarchiveSuite

NAS_qual

NAS_preload

23 May 2013 Archiving the Web 27

Page 28: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

> « »

Page 29: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Applications

> BCWeb (“BnF Collecte du Web”)> BnF in-house development> Selection tool for librarians: proposition of URLs to

collect for selective crawls> Technical validation of URLs by digital curators> Definition of collection packages> Transfer to NetarchiveSuite

> NAS_preload (“NetarchiveSuite Pre-Load”)> BnF in-house development> Preparation of broad crawls, based on a list of officially

registered domains by AFNIC

23 May 2013 Archiving the Web 29

Page 30: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Applications

> NetarchiveSuite> Open source application> Collaborative work of:

> BnF> The two national deposit libraries in Denmark (the Royal

Library in Copenhagen and the State and University Library in Aarhus)

> Austrian National Library (ÖNB)

> Central and main application of the archiving process

> Planning the crawls> Creating and launching jobs> Monitoring> Quality Assurance> Experience evaluation

23 May 2013 30

Page 31: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Applications

> Heritrix> Open source application by Internet Archive> Its name is an archaic English word for heiress (woman who

inherits)> A crawl is configured as a job in Heritrix, which consists

mainly of:> a list of URLs to start from (the seeds)> a scope (collect all URLs in the domain of a seed, stay on the

same host, only a particular web page, etc.)> a set of filters to exclude unwanted URLs from the crawl> a list of extractors (to extract URLs

from HTML, CSS, JavaScript)> many other technical parameters,

for instance to define the “politeness” of a crawl or whether or not obey a website’s robots.txt file

23 May 2013 Archiving the Web 31

Page 32: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Applications

> The Wayback Machine> Open source application by Internet Archive> Gives access to the archived data

> SPAR (“Système de Préservation et d’Archive Réparti”)> Not really an application, it is the BnF’s digital repository > Long-term preservation system for digital objects,

compliant with the OAIS (Open Archival Information System) standard, ISO 14721

23 May 2013 Archiving the Web 32

Page 33: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Applications

> NAS_qual (“NetarchiveSuite Quality Assurance”)> BnF in-house development> Indicators and statistics about the crawls

> The Indexing Process> Chain of shell scripts, developed in-house by BnF

23 May 2013 Archiving the Web 33

Page 34: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Data and process model

23 May 2013 Archiving the Web 34

Page 35: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Daily operations: same steps, different actions

Curators

> Monitoring: dashboard in NetarchiveSuite, filters in Heritrix, answers to webmaster's requests

> Quality assurance: analysis of indicators, visual control in WB

> Experience: reports on harvest concerning contents and websites description

Engineers

> Monitoring: dashboard in Nagios, operation on virtual machines, information to give to webmasters

> Quality assurance: production of indicators

> Experience: reports on harvest concerning IT exploitation

23 May 2013 Archiving the Web 35

Page 36: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Challenges

> What is the French web?> Not only .fr, also .com or .org

> Some data remain difficult to harvest> Streaming, databases, videos, JavaScript> Dynamic web pages > Contents protected by passwords> Complex instructions for Dailymotion, paid contents for

newspapers

23 May 2013 Archiving the Web 36

Page 37: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Infrastructure

The machines that support the task

Page 38: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Platforms

PilotDatabase

NAS

Operational Platform

Indexermaster

Indexer

Indexer

Indexer

23 May 2013 Archiving the Web 38

PostgreSQL

Application

Machines with Linux

Page 39: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Operational Platform: PFO

Platforms

Trial Run Platform: MAB

Pre-production Platform: PFP

1 pilot, 1 indexer master, 2 to 10 indexers, 20 to 70 crawlers.Variable and scalable number of computers

Identical setup to the PFO, the MAB (MAB = Marche À Blanc, Trial Run) aims to simulate and test harvests in real conditions for our curator team.Its size is also variable and subject to changes.

The PFP is a technical test platform for the use of our engineers team.

23 May 2013 Archiving the Web 39

Page 40: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Platforms

> Flexibility regarding the number of crawlers allocated to a platform

> Hardware resources sharing and optimisation> All classical needs of production environments such as

robustness and reliability

hyp

ervi

sor > Virtual computers

> Configuration « templates »

> Resource pool grouping of the computers

> Automatic management of all shared resources

Solution: Virtualisation!

Our needs:

1 2 3 4 5 6 7 8 9

23 May 2013 Archiving the Web 40

Page 41: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

The DL-WEB cluster

Shared resources

Cluster DL-WEB

1 2 3 4 5 6 7 8 9

23 May 2013 Archiving the Web 41

Page 42: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Dive into the hardware

2 x 9 RAM of 4 GB = 72 GB RAM / machine

2 sockets On every socket, 1 CPU 2 cores

1234

Total of 16 logical CPUs per machine

4 threads

Page 43: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Physical Machines

2 x 9 x 4Gb = 72 GB

2 x 2 x 4 = 16 CPU

9 x 72 = 648 GB

9 x 16 = 144 CPU

1 2 3 4 5 6 7 8 9

23 May 2013 Archiving the Web 43

Page 44: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Park of virtual machines

PFO MAB PFP

pilot 1 1 1

index-server 1 1 1

index-master 1 - -

crawler 70 70 10

indexer 10 - -

heritrix 5 5 5

free 5 5 5

197 93 82 22

Page 45: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Distributed Resource Scheduler (DRS)

and V-motion

If one of the hosts fails, all the VM hosted on this server are moved to other hosts and are rebooted.

A virtual machine is hosted on a single physical server at a given time.If the load of VM hosted on one of the servers becomes too heavy, some of the VMs are moved onto another host dynamically and without interruption.

23 May 2013 Archiving the Web 45

Page 46: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Fault tolerance (FT)

> An active copy of the FT VM runs on another server> If the server where the master VM is hosted fails, the ghost VM

instantly takes control without interruption> A copy is then created on a third server> The other VMs are moved and restarted

Fault Tolerance can be quite greedy regarding resources especially concerning network consumption. That’s why we have activated this functionality only for the pilot machine.

23 May 2013 46Archiving the Web

Page 47: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Data acquisition

Our mixed model of web harvesting

Page 48: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Calendar year

Number of websites

Broad crawls- once a year- .fr domains and beyond

Ongoing crawls:- running throughout the year- news or reference websites

Project crawls: - one shots -related to an event or a theme

BnF “mixed model” of harvesting

23 May 2013 Archiving the Web 48

Page 49: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Aggregation of a large number of sources

> In 2012:> 2.4 million domains in .fr and .re, provided by AFNIC

(Association française pour le nommage Internet en coopération – the French domain name allocation authority)

> 3,000 domains in .nc, provided by OPT-NC (Office des postes et télécommunications de Nouvelle-Calédonie – the office of telecommunications of New Caledonia)

> 2.6 million domains already present in NetarchiveSuite database> 13,000 domains from the selection of URLs by BnF librarians (in

BCWeb)> 6,000 domains from other workflows of the Library that contain

URLs as part of the metadata: publishers’ declarations for books and periodicals, the BnF catalogue, identification of new periodicals by librarians, print periodicals that move to online publishing, and others

> After de-duplication, this generated a list of 3.3 million unique domains

23 May 2013 Archiving the Web 49

Page 50: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Volume of collections

>Seven broad crawls since 2004>1996-2005 collections thanks to

Internet Archive>Tens of thousands of focus-crawled

websites since 2002>Total size

>20 billion URLs>400 Terabytes

23 May 2013 Archiving the Web 50

Page 51: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Volume of collectionsSelective Crawls

0

200.000.000

400.000.000

600.000.000

800.000.000

1.000.000.000

1.200.000.000

1.400.000.000

2007 2008 2009 2010 2011 2012

UR

L

0,00

10,00

20,00

30,00

40,00

50,00

60,00

70,00

TB

Nb of collected URLs Nb of collected TB

Broad Crawls

0

200.000.000

400.000.000

600.000.000

800.000.000

1.000.000.000

1.200.000.000

2010 2011 2012

UR

L

30,00

32,00

34,00

36,00

38,00

TB

Nb of collected URLs

Nb of collected TB

23 May 2013 Archiving the Web 51

All Crawls

0

500.000.000

1.000.000.000

1.500.000.000

2.000.000.000

2.500.000.000

2010 2011 2012

UR

L

0,00

20,00

40,00

60,00

80,00

100,00

TB

Nb of collected URLs

Nb of collected TB

Page 52: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Data storage and access

The indexing structures and how users query the web

archive

Page 53: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Data access: the Wayback Machine

CDX machine

CDX Path

CDX server

Data storage machine

Data server

ARCARC ARC ARC

Client

Browser

Web server

URL serverWeb interface

Data storage machine

Data server

ARCARC ARC ARC

Data storage machine

Data server

ARCARC ARC ARC

CDX machine

CDX Path

CDX server

1 2

3

4

912

14

11 10

5/6 7/8

13

23 May 2013 Archiving the Web 53

Page 54: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

The ARC files

File descriptionFor every collected URL: URL, IP-address, Archive-date, Content-type, Archive-length, HTTP headers and HTML code

filedesc://IA-001102.arc 0 19960923142103 text/plain 761 0 Alexa Internethttp://www.dryswamp.edu:80/index.html 127.10.100.2 19961104142103 text/html 202HTTP/1.0 200 Document followsDate: Mon, 04 Nov 1996 14:21:06 GMTServer: NCSA/1.4.1Content-type: text/html Last-modified: Sat,10 Aug 1996 22:33:11 GMTContent-length: 30<HTML>Hello World!!!</HTML>http://www. …

Page 55: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

ARC file format

23 May 2013 Archiving the Web 55

ARC record

URL-record

network_doc

protocol response

object

version block header

filedesc

URL-record-definition

filedesc:/ / IA-001102.arc 0 19960923142103 text/ plain 76 1 0 AlexaInternet URL IP-address Archive-date Content-type Archive-length http:/ / www.dryswamp.edu:80/ index.html 127.10.100.2 19961104142103 text/ html 202 HTTP/ 1.0 200 Document follows Date: Mon, 04 Nov 1996 14:21:06 GMT Server: NCSA/ 1.4.1 Content-type: text/ html Last-modified: Sat,10 Aug 1996 22:33:11 GMT Content-length: 30 <HTML> Hello World!!! </ HTML>

version-1-block

Page 56: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

The CDX files

Indexation of the ARC files

CDX A b e m s c V v D d g n0-0-0checkmate.com/Bugs/Bug_Investigators.html 20010424210551 209.52.183.152 text/html 200 58670fbe7432c5bed6f3dcd7ea32b221 17130110 59129865 1927657 6501523 DE_crawl6.20010424210458 5750

A = canonized URL, b = date, e = IP, m = mime type, s = response code, c = checksum, V = compressed arc file offset, v = uncompressed arc file offset, D = compressed dat file offset, d = uncompr. dat file offset, g = file name, n = arc document length

23 May 2013 Archiving the Web 56

Page 57: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

The PATH files

Location of the ARC files

ARC file name, location

DE_crawl6.20010424210458 /dlwebdata/01002/ DE_crawl6.20010424210458.arc.gzIA-001102.arc /dlwebdata/01003/IA-001102.arc

23 May 2013 Archiving the Web 57

Page 58: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Indexing the data

Page 59: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Binary Search

> Sorted list of data> O(log n)

> a maximum of 35 search operations for 20 billion lines!

Page 60: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Pre Ingest

Long-term preservation system for digital objects, compliant with the OAIS (Open Archival Information System) standard, ISO 14721

SPARSystème de Préservation et

d’Archive Réparti

Digitized books

Digitized audiovisual documents

Web archiving

Pre Ingest

Pre Ingest

Archiving the Web60

Page 61: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

SPARA generic repository

solution at BnF

Page 62: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Public access to the collections

> Customised version of open-source Wayback Machine

> Three access points:> URL search> Experimental full-text search using NutchWAX (only

covers about 10% of collections…)> Guided tours

Archiving the Web 6223 May 2013

Page 63: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

“Guided tours”

> Selections in the web archives, created by BnF subject librarians and external partners

> Provide a user-friendly way of discovering the contents of the archives

> Provides visibility for project collections

23 May 2013 Archiving the Web 63

Page 64: Archiving the Web –  The Bibliothèque nationale de France’s  « L’archivage du Web »

Thank you for your attentionQuestions?