44
Danish Legal Deposit on the Internet: Current Solutions and Approaches for the Future ECDL, September 2001 by Birgit N. Henriksen Head of Digitization and Web Department The Royal Library, Denmark [email protected]

Danish Legal Deposit on the Internet: Current Solutions and Approaches for the Future ECDL, September 2001 by Birgit N. Henriksen Head of Digitization

Embed Size (px)

Citation preview

Danish Legal Deposit on the Internet:

Current Solutions and Approaches for the Future

ECDL, September 2001by Birgit N. Henriksen

Head of Digitization and Web DepartmentThe Royal Library, Denmark

[email protected]

Presentation outline

•Since 1998 selection based archiving (production)

•netarchive.dk (new project, multiple archiving strategies , 2001-2002)

•Nordic Web Archive (project 2000-2001, access to web archives)

Three different initiatives:

The Danish Legal Deposit Law

•1697: All printers in royal and ducal lands must deposit

•1703: Only printers in Copenhagen have to deposit

•1781: All printers in royal and ducal lands must deposit

•1902: All printed materials to be deposited

•1927: Posters and some types of ephemera excluded

•1997: All published works to be deposited

The law from 1997 covers

any work published in Denmark regardless of medium

“work”: a delimited quantity of information which must be considered a final and independent unit

“published”: when … copies of the work have been placed on sale or otherwise distributed to the public

Types of Net Publications

Static included (only periodically updated) •monographs•periodicals

Dynamic excluded (continuously updated) •Databases•homepages

www.pligtaflevering.dk

How do we get the material?

•Download based on notification

NOT

•Harvesting the Danish domain•Delivery of works (a collection of files) from the individual publishers

Registration

•WHOthe person in charge of the technical completion of the digital copy

•HOW by filling out a form at http://www.pligtaflevering.dk

Registration Form - Monographs

Download - workflow

The staff at the Danish Department, The Royal Library

•determines whether a publication is covered by the law

•if yes, downloads all files belonging to the work•checks downloaded work•catalogues and classifies the work in the OPAC (only periodicals)

•transfers work to archival server

(server mirrored nightly to State and University Library, Århus)

Plug-ins

System Environment

Server witharchive

Web-server withRegistration system

Firewall/Router

Internet

Collecting System

Pc’s with restricted access to archive

KB’sLAN

REX

Free access

Controlledaccesss throughaccessthrough firewall

Server witharchive

Firewall/Router

Internet

PCs with restricted access to archivepligtafleveret

SB’sLAN

SOL Controlled accessadgangthrough firewall

Daily mirror from KB to SB

SB

KBRoyalLibrary

Web-server

Free access

Domain names in .dk domain

# of sub-domains

Registered in .dk May 12’th 1999

96.371

Registered in .dkJune 12’th 2001

301.730

Represented in archive June 12’th 2001

< 1000

Volume in archived material

June 1999 June 2000 June 2001

# net publications

958 5424 9175

#RepresentationsRepr./net pub.

13061.36

66191.22

116071.26

#Files – total Files/net pub.

87.88667.3

346.68552.4

569.15049.0

#Bytes – total 1,66 Gbyte 12,0 Gbyte 18,2 Gbyte

Monographs vs Periodicals

BeforeJuly 1st 1999

BeforeJuly 1st 2000

BeforeJuly 1st 2001

# % # % # %

Monographs

642 67 1594 29 2850 31

Periodicals(issues)

316 33 3830 71 6325 69

Public vs. Private Publishers

June 1999 June 2000 June 2001

# % # % # %

Public 648 68 3985 71 6200 67,5

Private 304 32 1430 26.4 2975 32,5

Staff resources

Man Years Paid hours per publication

Comments

1998 2,3 12,75 System being developed and set up

1999 1,9 1,2 Downloading, cataloguing and classifying all publications

2000 1,3 0,6 Downloading all, cataloguing and classifying periodicals

MimeType Statistics – % of collected files

June 1999

June 2000

June 2001

TEXT/HTML

56,0 58,6 % 59,3 %

Image (GIF, JPEG, PNG)

41,8 % 38,4 % 37,9 %

PDF 1,3 % 1,6 % 1,7 %

Other formats

0,9% 1,4 % 1,1 %

Three generations using the internet

1st (age 74) 2th (age 40) 3th (age 10-15)

Professionel life (Work/ school related)

Professional online periodicals /portals

Professional online periodicals /portals

Product informationInstitutions and organisations Newsgroups

Uncritical all available material

Entertainment

Just surfing around

AuctionsGame servicesBizarre websitesNewsgroups

EventsGame servicesGimmicksChat services

Searching for information

Search enginesNewsMunicipal sites

Search engines (including cashed web pages)News and media/portalsState- and municipal sitesProduct databases

Search engines

Special interests

HomebankingStock exchange

Homebanking and info related to family economyE-commerceOrganisationsSeasonal interests

Sport clubs (results)Live role play

The modifications from 1902

•Brochures and advertisements•Catalogues•Election campaign material•Club/organisation magazines •Songs•Scouting magazines, church newsletters•Maps•Portraits•Art prints

Brouchers

Online services like krak.dk

Organisation websites

Newsletters/minuts on websites

Product databases/portals

Net Art

Problems related to the notification concept

•Lack of notification of multiple representations of a publication

•Lack of notification of new versions

Problems related to technical issues•Errors or inconsistencies in the published files

•Java applets – no solution at the moment

•Found solutions on previous problems:•Documents with java scripts •Data behind forms•Data behind username/password logins

•Cookies-based session handling•SSL encryption

Gains if harvesting is used

•Better coverage of Denmark outside the public sphere

•Updated versions – also for static publications

• New trends on the net as soon as they appear

Why not only harvesting?

•Programs and plug-ins are difficult to keep track on

•Harvesting is not always possible (e.g.. streamed and web casted material)

•Harvesting may not give a useful result- technical problems (java, interactive sites)

- personalised sites•Harvesting may produce a collection of documents that have never existed on the net

•Harvesting may not always give the best format for long-time preservation

Net Art

Home banking

Searching the catalogue

Collections made by harvesting

•Are not complete – previous slides

•No robot will never be able to make a ’true’ snapshot – the snapshot contains a mix of documents that have never been published together at the same time – a ’fake’

Archive for Danish Literature

• www.adl.dk from 1. October 2001

•All full texts are structured in XML on work level

•The XML is loaded to a database•The database performs the web publishing in well-formed HTML on a page level

What do we prefer to archive and for what purpose?

Birte Christensen-Dalsgaard: Archive Experience, not Data

User Interface

Service Layer

Data Layer

Library SystemXML parserChatEtc.

Databases:CataloguePapers and articlesFinansial InformationEtc.

Database publishing,

Web Archiving Conference, CPH June 2001

•Focus: User Expectations to webarchiving in DK

•Brought together :•members of the user community, scholars as well as scientits

•member from the organisations traditionally in charge of preserving oral and written material

•members with technical knowledge

•Proceedings (UK) – netarchive.dk

Web Archiving Conference, CPH June 2001

•Sholars & scientist: •Archive the dynamic part of the web•Focus on archiving

• the content• the context• the evidence of use

•Archivists: •Use different archiving approaches •New methods for archiving dynamic material

•Budgets for making snapshots and making selective collections are comparable

Birte Christensen-Dalsgaard: 3 dimensions - duration

Real time dialog

Published, static

Sig

nal l

ifet

ime

Hourly Update

• Book-like publications• Scientific Journals• News-sites• Chat

Birte Christensen-Dalssgard: 3 dimensions - Permanent value

Transient

Persistent

Perman

ent V

alue

• What is worth preserving?• Quality vs. Representative

Birte Christensen-Dalsgaard: Background - Nature of Information

InteractivityStatic Dynamic

Transient

Persistent

Perman

ent V

alue

Real time dialog

Published, static

Sig

nal l

ifet

ime

Hourly Update

Birte Christensen-Dalsgaard: Domain of different harvesting methods

InteractivityStatic Dynamic

Transient

Persistent

Perman

ent V

alue

Real time dialog

Published, static

Sig

nal l

ifet

imeLegal Deposit, DK

Hourly Update

Accumulative harvestingSnapshot

Birte Christensen-Dalsgaard: What is missing?

InteractivityStatic Dynamic

Transient

Persistent

Perman

ent V

alue

Real time dialog

Published, static

Sig

nal l

ifet

imeLegal Deposit, DK

Hourly Update

Accumulative harvestingSnapshot

Accumulative

Snapshot

netarchive.dk (1)

InteractivityStatic Dynamic

Transient

Persistent

Perman

ent V

alue

Real time dialog

Published, static

Sig

nal l

ifet

ime

Process

Test different archival approaches and the subsequent usability of the archived material for research

netarchive.dk (2)

•Pilot project testing different archival approaches and the subsequent usability of the archived material for research

•Project partners:•State and University Library, Aarhus •Centre for Internet Research •The Royal Library

•With economic support from the Danish Electronic Research Library (DEF)

•Period: August 2001 – July 2002•Case: Danish municipal elections November

2001

netarchive.dk (3)

•Which materials with •What frequency?•Collection method?•Which software?

•How should the collection of materials be organized and how should it be stored?

•How should obsolescence of data formats be dealt with?

•How should access be given?•Budgets for collecting and storing

netarchive.dk (4)

Net material covered by netarchive.dk

• net activities from existing news media (newspapers, radio, TV (both national, regional and local media))

• political parties official pages, national and local• individual politicians’ personal pages• official (county) municipal pages• voters’ personal pages• »local themes«- pages• special interest organisations• portals in the broadest sense• opinion polling firms• public emails/ press releases• news groups / usenet• net-conferences and chat

How do we catch the missing part?

Process rather than material – ‘Filming’ the net through a browser

Goal:

Catch chronological series of displayed WebPages

Tools to take into consideration:

•Business intelligence tools•Tools used in usability laboratories

Nordic Web Archive (NWA)

•Establish a Danish test archive in order to participate in NWA

•Software: NEDLIB robot•Status 1/9 2001:

•Archiving started 20/8 2001•1.9 mio documents •43 GB uncompressed data

Questions?