31
Wrangling DigiTool Data For LOCKSS Brian Meuse - Digital Collections Systems Analyst University Libraries Boston College MetaArchive Cooperative Annual Meeting October 23, 2009

Wrangling DigiTool Data For LOCKSS Brian Meuse - Digital Collections Systems Analyst University Libraries Boston College MetaArchive Cooperative Annual

Embed Size (px)

Citation preview

Wrangling DigiTool Data For LOCKSS

Brian Meuse - Digital Collections Systems Analyst

University Libraries

Boston College

MetaArchive Cooperative Annual Meeting

October 23, 2009

eTD@BC

• Electronic Theses and Dissertations– Undergraduate Honors Theses– Graduate Level Theses and Dissertations

• Archive and distribute

• Provide global Open Access to content– Embargoes when needed– No mandate to publish

What happens next?

• ProQuest processes students submission

• ProQuest ftp's back– Thesis (pdf) – Any additional files (3rd party permissions)– Descriptive metadata

• Once student uploads to ProQuest we get back within a day.

LOCKSS

MetaArchive Cooperative– LOCKSS based dark archive– Long Term Digital Preservation

DigiTool

• Oracle backend– Maintains object relationships– Stores all associated MetaData (XML)– Original filenames

• File storage– Simple directories on filesystem– Renamed to Unique Identifier (PID)

DigiTool To LOCKSS

• Export ETD files from DigiTool– Export function – Duplicate data– Current ETD collection is ~1GB– Bobbie Hanvey, ~30,000 photo-negatives

~600GB

DigiTool To LOCKSS• Direct URL links

– MetaData– Objects (Viewers for different formats)

• Direct links not persistent– Redirected to URL with session id– Every node is different– Not good for polling.

DigiTool To LOCKSS

• DigiTool API– SOAP web service– Can query database– Retrieve XML

• MetaData• Links to objects

DigiTool To LOCKSS

• Wrangling the data– Perl– Web Services– XSLT

DigiTool To LOCKSS• createMetaArchiveAU.pl

#!/usr/bin/perl -w

use strict;

use SOAP::Lite;

use FileHandle;

use Getopt::Long;

use LWP::Simple;

use Time::localtime;

use XML::LibXSLT;

use XML::LibXML;

….

DigiTool To LOCKSS

• Query DigiTool …

<x_condition><type>contains</type><element>type</element><value>electronic thesis dissertation</value>

</x_condition>

<x_condition><type>after</type><element>createDate</element>

<value>FROM</value>

</x_condition>

• XML response is list of pid’s

DigiTool To LOCKSS

• Retrieve digital entity for each PID

• XML contains– All Metadata for object– PID’s of related objects– Filename and path of file on server

DigiTool To LOCKSS• Metadata <md shared="false"> <name>descriptive</name>

<type>etd-ms</type> <value><![CDATA[<thesis><title>The Impact of Pension Policy on Older Adults&apos; Life Satisfaction: an Analysis of Longitudinal Mulitlevel Data</title><creator>Calvo, Esteban</creator><subject>aging</subject><subject>individualization</subject><subject>life satisfaction</subject><subject>pension policy</subject><subject>redistribution</subject><subject>subjective well-being</subject><publisher>Boston College</publisher><contributor role="advisor">Williamson, John B.</contributor><date>2009</date><type>Electronic Thesis or Dissertation</type><type>text</type><format>application/pdf</format><identifier>http://hdl.handle.net/2345/752</identifier><language>English</language><rights>I hereby allow Boston College to include and preserve my dissertation/thesis in electronic form in the Boston College Institutional Repository, which shall include the right to publicly post my dissertation/thesis on the World Wide Web. I will retain copyright ownership, but I grant to Boston College the non-exclusive right to copy, distribute, and publicly display my dissertation/thesis in any form as may be necessary or convenient in the future as file formats, storage media, and distribution mechanisms evolve.</rights><degree><name>PhD</name><level>Doctoral</level><discipline>Sociology</discipline><grantor>Boston College. Graduate School of Arts &amp; Sciences.</grantor></degree></thesis>]]></value> </md>

DigiTool To LOCKSS

• Related objects <relations>

<relation> <type>manifestation</type> <pid>106483</pid> </relation> <relation> <type>manifestation</type> <pid>108561</pid> </relation> <relation> <type>manifestation</type> <pid>108562</pid> </relation>

</relations>

DigiTool To LOCKSS

• Filename and path <stream_ref>

<file_name>Calvo-Esteban.pdf</file_name> <file_extension>pdf</file_extension> <mime_type>application/pdf</mime_type> <directory_path>/exlibris1/bcd03storage/2009/08/27/file_1/106484</directory_path> <file_id>1</file_id> <storage_id>1005</storage_id> <external_type>-1</external_type> <file_size_bytes>349524</file_size_bytes>

</stream_ref>

DigiTool To LOCKSS

• Retrieve each related item to get filename and path for those items

<relations> <relation> <type>manifestation</type> <pid>106483</pid> </relation> <relation> <type>manifestation</type> <pid>108561</pid> </relation> <relation> <type>manifestation</type> <pid>108562</pid> </relation>

</relations>

DigiTool To LOCKSS• Generate script to generate links

– Symbolic link for AU– From manifest web directory to object

ln -s /exlibris1/bcd03storage/2009/08/27/file_1/106484 18640905-20090930/106484/Calvo-Esteban.pdf

• When file is harvested, it will be given the original filename.

DigiTool To LOCKSS• Manifest Pages

– Transform XML to HTML– XSLT

DigiTool To LOCKSS

<relations> <relation> <type>manifestation</type> <pid>106483</pid>

<!– OA Permission --> </relation> <relation> <type>manifestation</type> <pid>108561</pid>

<!– Fulltext Index --> </relation> <relation> <type>manifestation</type> <pid>108562</pid>

<!– Thumbnail --> </relation></relations>

DigiTool To LOCKSS<html xmlns:xb="http://com/exlibris/digitool/repository/api/xmlbeans"><head> <title>Manifest for Calvo, Esteban 2009</title> </head> <body> <h2> Electronic Theses and Dissertations at Boston College </h2><h3> Manifest for Calvo, Esteban 2009</h3> <ul> <li><a href="http://dcollections.bc.edu/webclient/DeliveryManager?

metadata_request=true&amp;GET_XML=1&amp;pid=106484"> Metadata and Relationships</a></li>

<li><a href="Calvo-Esteban.pdf"> ETD PDF</a></li> <li><a href="Calvo-Esteban-permission.txt"> Permissions/Suppressed

file</a></li> <li><a href="_106484_pdf_thumbnail.jpg"> Thumbnail</a></li> </ul> </body>

</html>

Thank you!