ECDL 2005, September 18th - 23th 2005, Vienna, Austria
File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH
LIBRARY
File-based storage of Digital Objects and constituent datastreams: XMLtapes and Internet Archive ARC files
Xiaoming Liu (1), Luda Balakireva (1), Patrick Hochstenbach (2) and Herbert Van de Sompel (1)
(1) Digital Library Research & Prototyping Team
Research Library, Los Alamos National Laboratory (2) University Library
Ghent University
[email protected] , [email protected] , [email protected] , [email protected]
XMLtaperegistry
ARCfileregistry
index
arc record
arc record
arc record
arc record
arc record
index
arc record
arc record
arc record
arc record
arc record
tape record
tape record
tape record
tape record
tape record
index
XMLtape ARCfile ARCfile
XMLtape basics version blockversion block
OpenURL
A.xml
A.idx
1.arc
1.cdx
2.arc
2.cdx
http:://barracuda.lanl.gov/moai2/
http:://barracuda.lanl.gov/openurlhttp://cox.lanl.gov/taperegistry/OAIHandler
http://cox.lanl.gov/arcregistry/OAIHandler
tape record
tape record
arc record
arc record
arc record
arc record
tape record arc record arc record
ECDL 2005, September 18th - 23th 2005, Vienna, Austria
File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH
LIBRARY
Disclaimer
• The term Digital Object (DO) will be used as in Kahn/Wilensky:o Compound objecto Multiple datastreams of different mime typeso Secondary information pertaining to object and datastreamso Identifiers for object (and datastreams)
• This is ~ OAIS Content Information
Type MIME identifier
Digital Object scholarly paper N/A DOI
Constituent Datastream 1 metadata record application/xml PMID
Constituent Datastream 2 fulltext file application/pdf –
ECDL 2005, September 18th - 23th 2005, Vienna, Austria
File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH
LIBRARY
XML-based representation of DOs
• Growing interest in XML-based representation of DOs in Digital Library architectures:
o Platform-independence, o Industry-supporto Longevity, potential migration pathso Processing tools, validation capabilities
• XML-based Compound Object formats:o ISO/IEC 21000-2 MPEG-21 DID & DIDLo METSo IMS/CPo CCDS XFDU
• Typical functionality:o By-Value (base64) and/or By-Reference provision of constituent datastreamso By-Value and/or By-Reference provision of secondary informationo Provision of identifiers
ECDL 2005, September 18th - 23th 2005, Vienna, Austria
File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH
LIBRARY
Storing XML-based representations of DOs
• Existing approaches:o storage of the XML-representations as individual files in a file system:
- Poor access performance- Poor backup performance
o storage of the XML-representations in (SQL, XML, object) databases- Long term? Data are dependent on the underlying system
o storage of the XML-representations by concatenating many such documents into a single file such as tar or zip
- Not XML aware, hence, no use of off-the-shelf XML tools- Increasing storage space (base64-encoding of the constituent
datastreams)
ECDL 2005, September 18th - 23th 2005, Vienna, Austria
File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH
LIBRARY
aDORe XMLtape/ARCfile solution
• Part of LANL aDORe repository effort:o Standards-based, modular repository architecture
- Distributed architecture- Protocol-based interactions between modules- Usable to create interoperable federations of heterogeneous repositories
o Actual implementation of the architecture at LANLo Components of aDORe software will be released
• Inspired by Internet Archive ARC file approach:o File-based mechanism to store datastreams resulting from Web-crawlingo Concatenation of multiple datastreams into a single fileo Metadata as seperators between datastreamso But not OK to store XML-based representations of DOs:
- Metadata capabilities very limited & crawling related- Lose power of XML processing tools
ECDL 2005, September 18th - 23th 2005, Vienna, Austria
File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH
LIBRARY
aDORe XMLtape/ARCfile solution
• Two interconnected file-based storage mechanisms:o XMLtapes: File storage of XML-based representations of Digital Objectso ARCfiles: File storage of constituent datastreams of Digital Objects
• The ARC files are interconnected with one or more XMLtapes during the ingestion process
• A protocol-based access mechanism is introduced:o XMLtape is exposed as an autonomous OAI-PMH repositoryo ARCfile is exposed as an OpenURL Resolver
• Write once - Read many: o Files remain stableo Protocol-based access mechanism remains stableo Indexing mechanisms can change as technologies evolve
• Storage approach is independent from the compound object format used to represent DOs as XML
o aDORe uses MPEG-21 DIDL
ECDL 2005, September 18th - 23th 2005, Vienna, Austria
File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH
LIBRARY
ISO/IEC 21000-2: MPEG-21 DID & DIDL
Digital ItemDigital Item Declaration DIDL document
has declarationhas XML
serialization
MPEG-21 Abstract Model
MPEG-21 DIDL
has XMLserialization
based on based on
Representing DOs using MPEG-21 DID
DigitalObject
Package
sample DIDL document
<Item>
DIDLDocumentid="info:lanl-repo/i/58f202ac"
OAIS PACKAGE PERSPECTIVE OAIS CONTENT PERSPECTIVE
ID="uuid-0000a01c"
ID="uuid-00004a42"
ID="uuid-00005e90"
ID="uuid-888b135e"
item
<Item>
<Component>
<Component>
<DIDL>
item
component
info:doi/10.123/44455
component
info:pmid/2225887
info:lanl-repo/ds/380b1f5c
info:lanl-repo/ds/f1ec7e32
ECDL 2005, September 18th - 23th 2005, Vienna, Austria
File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH
LIBRARY
aDORe XMLtape
• An XML file that concatenates the XML-based representations of multiple DOs
• Structure is defined by an XML Schemao http://purl.lanl.gov/aDORe/schemas/2005-08/XMLtape.xsdo tape-level administrative section:
- Open-ended content- Plug-in for processing-related information, indication of related ARCfiles:
- http://purl.lanl.gov/aDORe/schemas/2005-08/XMLtapeBasics.xsdo concatenation of records, each of which consists of:
- record-level administrative section - identifier and datestamp of the contained record- other record-level administrative information
- a record (can be from any XML Namespace). DIDL in case of aDORe:- http://purl.lanl.gov/aDORe/schemas/2005-08/DIDL.xsd
• An XMLtape is a valid and well-formed XML file• Independent from chosen XML-based Compound Object Format
ECDL 2005, September 18th - 23th 2005, Vienna, Austria
File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH
LIBRARY
aDORe XMLtape
<?xml version="1.0" encoding="UTF-8"?><ta:tape xmlns:ta="http://library.lanl.gov/2005-08/aDORe/XMLtape/" <ta:tapeAdmin> ... </ta:tapeAdmin> <ta:tapeRecord> <ta:tapeRecordAdmin> <ta:identifier>oai:aps.org:PhysRevA.71.040101</ta:identifier> <ta:date>2005-03-29T04:31:22Z</ta:date> <ta:recordAdmin> ... </ta:recordAdmin> </ta:tapeRecordAdmin> <ta:record> <didl:DIDL>...</didl:DIDL> </ta:record> </ta:tapeRecord></ta:tape>
aDORe ta:tape
sample XMLtape
ECDL 2005, September 18th - 23th 2005, Vienna, Austria
File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH
LIBRARY
aDORe XMLtape index
identifierdatestamp of ingestion
XMLtape
record
record
record
record
record
record
record
record
identifierdatestamp of ingestion
identifierdatestamp of ingestion
index
identifier/datestamp
identifier/datestamp
identifier/datestamp
identifier/datestamp
identifier/datestamp
identifier/datestamp
identifier/datestamp
Indexing: • Can be achieved with a variety of technologies• Current implementation: Berkeley DB Java Edition
<ta:tapeRecordAdmin>
ECDL 2005, September 18th - 23th 2005, Vienna, Austria
File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH
LIBRARY
aDORe XMLtape as OAI-PMH repository
XMLtape
record
record
record
record
record
record
record
record
index
identifier/datestamp
identifier/datestamp
identifier/datestamp
identifier/datestamp
identifier/datestamp
identifier/datestamp
identifier/datestamp
OAI-PMH request
DIDL document
OAI-PMH identifier = identifier from <ta:tapeRecordAdmin>
OAI-PMH datestamp = datetime from <ta:tapeRecordAdmin>
OAI-PMH response = content of <ta:record>
ECDL 2005, September 18th - 23th 2005, Vienna, Austria
File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH
LIBRARY
Internet Archive ARCfile
• Concatenation of binary files
• Designed and used by the Internet Archive (Wayback machine)o > 400 TB web data
• Under revision by the International Internet Preservation Consortium (IIPC): WARC file format
o Input from LANL to facilitate non-Web-crawling use case
• The ARC file format is structured as follows:o file header that provides administrative information about the ARC file itselfo a sequence of document records, consisting of:
- a header line containing some, mainly crawl-related, metadata.
- URI of the crawled document
- timestamp of acquisition of the data
- size of the data block
- a response to a protocol request such as an HTTP GET
ECDL 2005, September 18th - 23th 2005, Vienna, Austria
File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH
LIBRARY
Internet Archive ARC file
filedesc://IA-001102.arc 0 19960923142103 text/plain 761 0 Alexa InternetURL IP-address Archive-date Content-type Archive-length
http://www.dryswamp.edu:80/index.html 127.10.100.2 19961104142103 text/html 202HTTP/1.0 200 Document followsDate: Mon, 04 Nov 1996 14:21:06 GMTServer: NCSA/1.4.1Content-type: text/html Last-modified: Sat,10 Aug 1996 22:33:11 GMTContent-length: 30<HTML>Hello World!!!
</HTML> sample ARC file
ECDL 2005, September 18th - 23th 2005, Vienna, Austria
File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH
LIBRARY
Internet Archive ARC file in aDORe
filedesc://singletape.arc 0.0.0.0 20050922142103 text/plain 76 1 0
Internet Archive
URL IP-address Archive-date Content-type Archive-length
info:lanl-repo/ds/39c2fa93-fa22-4c19-90af-b5f58b9b989a 0.0.0.0 20050907221344 application/pdf 415025 %PDF-1.3 %âãÏÓ290 0 obj << /Linearized 1 /O 295 /H [ 3642 1057 ] /L 415025…
sample aDORe ARC file
sample ARCfile
ECDL 2005, September 18th - 23th 2005, Vienna, Austria
File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH
LIBRARY
Internet Archive ARC file
index
URL
URL
URL
URL
URL
URL
URL
URL
ARC
datastream
datastream
datastream
datastream
datastream
datastream
datastream
datastream
URL
URL
Indexing: • Can be achieved with a variety of technologies• Current implementation in aDORe: Heritrix toolkit
URL IP-address Archive-date Content-type Archive-length
ECDL 2005, September 18th - 23th 2005, Vienna, Austria
File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH
LIBRARY
ARC file as OpenURL Resolver
ARC file
datastream
datastream
datastream
datastream
datastream
datastream
datastream
datastream
index
URL
URL
URL
URL
URL
URL
URL
OpenURL
OpenURL request
datastream
Referent Identifier = datastream identifier = URL from ARC record header
Resolver Identifier = identifier of ARC file
ECDL 2005, September 18th - 23th 2005, Vienna, Austria
File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH
LIBRARY
Associating an XMLtape with ARC Files (1)
• A Digital Object is represented using an XML-based Complex Object format (e.g. MPEG-21 DID)
• The resulting package (e.g. DIDL document) is stored in an XMLtape
• Constituent datastreams of the Digital Object are provided By-Reference:o Using the ref attribute of the Resource element in MPEG-21 DIDo The value of the network location of the constituent datastream is compliant
with the NISO OpenURL Framework:
baseURL(ARCfile OpenURL Resolver)?
url_ver = Z39.88-2004 &
rft_id = Datastream Identifier &
res_id = ARCfile identifier
ECDL 2005, September 18th - 23th 2005, Vienna, Austria
File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH
LIBRARY
Associating an XMLtape with ARC Files (1)
<?xml version="1.0" encoding="UTF-8"?><didl:DIDL>……<didl:Component id="uuid-ddec9dbb-90e5-4b8a-93f3-dd1c8b781547"> <didl:Descriptor> <didl:Statement mimeType="application/xml; charset=utf-8"> <dii:Identifier … > info:lanl-repo/ds/ba0797d3-9414-42d0-90e8-f5397e74892b </dii:Identifier> </didl:Statement> </didl:Descriptor> <didl:Resource mimeType="application/pdf“ ref="http://purl.lanl.gov/aDORe/demo/adore-arcfile-resolver/resolver? url_ver=Z39.88-2004 res_id=info:lanl-repo/arc/2001_4acb6e28-1ef9-11da-9e1e-d8ccd1d6c8f2 rft_id=info:lanl-repo/ds/ba0797d3-9414-42d0-90e8-f5397e74892b“/></didl:Component>……</didl:DIDL>
Extract from DIDL
ECDL 2005, September 18th - 23th 2005, Vienna, Austria
File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH
LIBRARY
Associating an XMLtape with ARC Files (2)
• An XMLtape is associated with its corresponding ARCfiles through a plug-in for the XMLtape-level administrative section.
ECDL 2005, September 18th - 23th 2005, Vienna, Austria
File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH
LIBRARY
Associating an XMLtape with ARC Files (2)
<?xml version="1.0" encoding="UTF-8"?>
<ta:tape xmlns:ta="http://library.lanl.gov/2005-08/aDORe/XMLtape/">
<ta:tapeAdmin>
<tb:XMLtapeBasics xmlns:tb="http://library.lanl.gov/2005-08/aDORe/XMLtapeBasics/“>
<tb:XMLtapeId>info:lanl-repo/xmltape/singlescitape</tb:XMLtapeId>
<tb:ARCfileId>info:lanl-repo/arc/singlescitape</tb:ARCfileId>
<tb:processSoftware>gov.lanl.xmltape.SingleTapeWriter</tb:processSoftware>
<tb:processTime>2005-09-07T22:13:39Z</tb:processTime>
</tb:XMLtapeBasics>
</ta:tapeAdmin>
<ta:tapeRecord>
<ta:tapeRecordAdmin>
…
</ta:tape>XMLtape header
ECDL 2005, September 18th - 23th 2005, Vienna, Austria
File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH
LIBRARY
AGENT
Identifier Locator
DID
LDoc
umen
t-id
or
con
tent
-id
List
of (
base
UR
L,
DID
LDoc
umen
t-id
)
DID
LDoc
umen
t-id
or
con
tent
-id
XMLtape
DIDLDocument- id
DIDLDocument-idindex
creation datetimeindex
ref
DIDL document
ref
OpenURL
data
stre
am-id
data
stre
am
ARC file
datastream id
datastream-idindex
ECDL 2005, September 18th - 23th 2005, Vienna, Austria
File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH
LIBRARY
XMLtaperegistry
ARCfileregistry
index
arc record
arc record
arc record
arc record
arc record
index
arc record
arc record
arc record
arc record
arc record
tape record
tape record
tape record
tape record
tape record
index
XMLtape ARCfile ARCfile
XMLtape basics version blockversion block
OpenURL
A.xml
A.idx
1.arc
1.cdx
2.arc
2.cdx
http:://barracuda.lanl.gov/moai2/
http:://barracuda.lanl.gov/openurlhttp://cox.lanl.gov/taperegistry/OAIHandler
http://cox.lanl.gov/arcregistry/OAIHandler
tape record
tape record
arc record
arc record
arc record
arc record
tape record arc record arc record
aDORe XMLtape/ARCfile environment
ECDL 2005, September 18th - 23th 2005, Vienna, Austria
File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH
LIBRARY
Implementation
• XMLtapes:o Berkeley DB Java Editiono OCLC OAICat
• ARCfiles:o Heritrixo OCLC OpenURL software
• XMLtape Registryo MySQL dbo OCLC OAICat
• ARCfile Registry:o MySQL dbo OCLC OAICat
ECDL 2005, September 18th - 23th 2005, Vienna, Austria
File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH
LIBRARY
Performance indicators
• System:o Model: Dell 2650 2U rack-mount server o CPU: dual 2.8 GHz Intel Xeon processors o RAM: 5GB RAM o Disks: 10k RPM SCSI disks
• XMLtape:o 1786 MB, 201872 DIDL recordso download 100 consecutive DIDL records (787 KB) => 0.18 secondo download static file of same size => 0.09 second
• ARCfile:o 272 MB, 4910 fileso download a sample PDF file (312 KB) => 0.24 secondo download static file of same size => 0.036 second
ECDL 2005, September 18th - 23th 2005, Vienna, Austria
File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH
LIBRARY
Software
• Software - ARC files:o Heritrix: the internet archive's open-source, extensible, web-scale, archival-
quality web crawler project. http://crawler.archive.org/o NetArchive.dk: a project that plans for the preservation of Denmark's cultural
heritage on the internet for future generations. http://www.netarchive.dk/o Many other tools: http://archive-access.sourceforge.Net
• XMLtapes:o Perl tool, XML::Tape (LANL & Ghent University),
http://search.cpan.org/~hochsten/XML-Tape/
• Combined aDORe XMLtape/ARCfile environment:o Java tool (LANL), soon to be released on SourceForge
ECDL 2005, September 18th - 23th 2005, Vienna, Austria
File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH
LIBRARY
Conclusion
• The file-based approach is inherently simple, and reduces dependency on database system.
• The autonomy of the indexes allows retaining the files over time, while the indexes can be created using other techniques as technologies evolve.
• The protocol-based nature of the access increases the flexibility in light of evolving technologies as it introduces another layer of abstraction.
• The XMLtape approach is inspired by the ARC file format, but provides several additional attractive features:
o Off-the-shelf XML tools can be used to parse/validate an XMLtapeo All DO metadata can be stored in XML-based compound object format
Presentation available via http://public.lanl.gov/herbertv/Install TSCC codec for avi movies