18
Archiving the French Web: the BnF web archiving workflow Sara Aubry Web Archiving Project Manager, IT department Bibliothèque nationale de France International Conference on Web archives and e-LD Biblioteca Nacional de España, Madrid, July 9th 2013

Archiving the French Web: the BnF web archiving workflow. Sara Aubry

Embed Size (px)

DESCRIPTION

Presentada en la Jornada Internacional sobre Archivos Web y Depósito Legal Electrónico, en la Biblioteca Nacional de España (BNE), el día 9 de julio de 2013.

Citation preview

Page 1: Archiving the French Web: the BnF web archiving workflow. Sara Aubry

Archiving the French Web: the BnF web archiving workflow

Sara Aubry Web Archiving Project Manager, IT department Bibliothèque nationale de France

International Conference on Web archives and e-LD

Biblioteca Nacional de España, Madrid, July 9th 2013

Page 2: Archiving the French Web: the BnF web archiving workflow. Sara Aubry

Let’s start with some figures

• Programme start in 2000, industrialisation in 2008-2012

• Collections: – 1996 - now

– 20 000 websites for focused crawls, 2.5 million .fr domains for broad crawls

– 18.8 billion URLs, 370 TB, growing up +100TB / year

• Resources: – 9 Full Time Employees (5 librarians, 4 engineers)

– many partners within and out of Library, both at the national and international level

– 70 robots (648GB RAM, 144 CPUs 2.4GHz)

Page 3: Archiving the French Web: the BnF web archiving workflow. Sara Aubry

Digital curation is not different!

• « Actions, tools and practices defined and applied to collect, identify, select, organize and preserve digital contents (…) in order to use them and make them available (…) »

Definition of Digital Archiving in Wikipedia

Page 4: Archiving the French Web: the BnF web archiving workflow. Sara Aubry

BnF workflow overview

Selecting

Collecting

Indexing

Accessing

Preserving

nas_preload

Page 5: Archiving the French Web: the BnF web archiving workflow. Sara Aubry

Selecting with BCWeb

Page 6: Archiving the French Web: the BnF web archiving workflow. Sara Aubry

Selecting with BCWeb

• A form-based application, commonly called a « curator tool » – for content curators and researchers to nominate

websites to harvest – giving basic information about them (content policies,

trends watch)

• Most important information for each website: – Internet address/URL – frequency (daily, monthly, yearly, once…) – size/budget (small, medium, big) – depth (entire domain, part of it) Content curators

Page 7: Archiving the French Web: the BnF web archiving workflow. Sara Aubry

The Web is made of HTML pages

1 HTML page, 48 URL • 1 HTML • 1 text/css • 4 javascript • 17 image/png • 5 image/jpeg • 21 image/gif all links and inclusions are URL references

Page 8: Archiving the French Web: the BnF web archiving workflow. Sara Aubry

Harvesting with Heritrix

• A harvester is a piece of software (crawler, spider, robot)

• Simulates what a person would do with a browser but repeatedly and very fast

• Follows a looping process

• Repeated until new and in-scope URL are found and limits are not reached (budget and time)

WARC

Pick a location

Make a Request

Receive a Response

Examine for references

Save the content

Page 9: Archiving the French Web: the BnF web archiving workflow. Sara Aubry

Assets: - open source - small and large scale - textual or all-media formats - data structures

Page 10: Archiving the French Web: the BnF web archiving workflow. Sara Aubry

Digital curators: legal deposit department

Page 11: Archiving the French Web: the BnF web archiving workflow. Sara Aubry

Engineers : IT department

Challenges: • rich media and ever-changing

environment • social networks • content beyond paywalls

(news sites, ebooks)

Page 12: Archiving the French Web: the BnF web archiving workflow. Sara Aubry

Piloting the crawls with NetarchiveSuite

• Prepare, schedule, run and monitor harvests of websites, perform QA

Digital curators: legal deposit department

Engineers : IT department

Page 13: Archiving the French Web: the BnF web archiving workflow. Sara Aubry

Offering access with Wayback

• Give readers the ability to browse the web “as it was” with: – a regular web browser – a search and redisplay

software • An application called

“Web archives” – Wayback: for URL search,

display and browsing – Nutch prototype for

keyword search – Guided paths for collection

highlights

Page 14: Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Page 15: Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Page 16: Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Page 17: Archiving the French Web: the BnF web archiving workflow. Sara Aubry

Challenges: • links with our main Catalogue and

open data repository • “smart” URL search • full text search and indexing • small-scale data mining projects with

researchers

Page 18: Archiving the French Web: the BnF web archiving workflow. Sara Aubry

Questions ? E-mail: [email protected] Web site: http://www.bnf.fr Twitter: http://twitter.com/DLWebBnF