Upload
kevon-cleek
View
214
Download
2
Tags:
Embed Size (px)
Citation preview
1
Harvesting digital newspapers at the Bibliothèque nationale de
France
Géraldine CamileBibliothèque nationale de France
Tallinn, 2015-01-30
Summary Context and objectives of the
“subscription-based press project”Harvesting news websites with robotsResults and lessons learnt The future of the project – and its
alternatives
2
Collecting digital news at the BnF
Harvesting of news websites since 2010Use of crawlers100 news websites harvested every dayOnly freely accessible content
Using robots to collect digital equivalents of newspapers
“Subscription-based” press projectObtain passwords from publishers and crawl
protected contentFocus on the PDF versions to ensure collection
continuityAs microfilming budgets for local editions of
regional newspapers are decreasing
4
The subscription-based press project
Various actors within the LibraryLaw, Economy and Politics departmentLegal deposit department: printed periodicals
service Legal deposit department: digital legal deposit
serviceIT department
Different skills and approaches for printed and digital periodicals
CalendarA one-year experimentStarted end 2012; assessment end 2013Now in production mode
5
The harvesting workflow
Selection
Contact with
publisher
Technical instruction
Web harvest
Quality assuranceCataloguin
g
Description on
access UI
Curators
Curators
Library assistants
Cataloguers
Engineers
Preservation
Engineers
7
August 20th 2014Harvesting digital newspapers at the BnF – Clément Oury – IFLA WLIC conference 8
Format
Cataloguing…
Link with the printed edition record
Link to the archives
Type: digital document
Local editions
And access in the archives…
August 20th 2014Harvesting digital newspapers at the BnF – Clément Oury – IFLA WLIC conference 9
A guided tour of the news collection
August 20th 2014Harvesting digital newspapers at the BnF – Clément Oury – IFLA WLIC conference 10
Long term preservation in SPAR, BnF’s digital repository
August 20th 2014Harvesting press websites at the BnF – Clément Oury – IFLA WLIC conference 11
August 20th 2014Harvesting digital newspapers at the BnF – Clément Oury – IFLA WLIC conference 13
22 titles 192 local editions Start of harvest
Ouest-France 53 July 19, 2012
Le Républicain lorrain 8 December 12, 2012
Le Progrès 18 April 16, 2013
Midi libre 14 May 2, 2013
L’Indépendant 3 May 2, 2013
Centre Presse 1 May 2, 2013
La Tribune 1 May 22, 2013
Mediapart 1 July 16, 2013
La Montagne 14 October 10, 2013
Le Populaire du Centre 3 October 10, 2013
La République du Centre 2 October 10, 2013
Le Berry Républicain 1 October 10, 2013
L’Écho Républicain 1 October 10, 2013
Le Journal du Centre 1 October 10, 2013
Le Dauphiné libéré 20 April 7, 2014
Les Dernières Nouvelles d'Alsace
18 April 7, 2014
L'Est Républicain 10 April 7, 2014
L'Alsace 8 April 7, 2014
Le Journal de Saône-et-Loire 7 April 7, 2014
Le Bien Public 4 April 7, 2014
Vosges Matin 2 April 7, 2014
The collections
August 20th 2014Harvesting digital newspapers at the BnF – Clément Oury – IFLA WLIC conference 14(n° 1, oct./nov. 2012, p. 60-61)
Harvested titles
Map of the daily regional newspapers
Vosges MatinLa Liberté de l’Est
Main achievementsThe collections!Technical experimentations of
harvest of protected contentCreation of links between the
General Catalogue and web archivesRaising awareness among wider
library staff about collecting digital publicationsEven library assistants are now
managing digital documents
15
The dark side of the crawlNews websites’ architecture
may change very quicklyRequires high reactivity and
dedicated time of technical staffDifficulty to recover non-
harvested collectionsPress collections disappear very
rapidly from the publisher’s website
Some websites are technically NOT possible to harvest with crawling robots
16
The next steps of the projectExtend the harvest to new titlesImprove access to collections
A dedicated interface?Full-text index of the press corpus?
Promote the service towards: Librarians at reference desksResearchers and other users
Open remote accessFrom the researchers desktopsFrom regional libraries entitled to receive
access to web legal deposit collections18