64
ISSN 1747-1524 DCC | Digital Curation Manual Instalment on "Open Source for Digital Curation" http://www.dcc.ac.uk/resource/curation-manual/chapters/open-source/ Andrew McHugh Humanities Advanced Technology and Information Institute (HATII) University of Glasgow, Glasgow G12 8QJ http://www.hatii.arts.gla.ac.uk July 2005 ver 1.6 DCC - Digital Curation Centre http://www.dcc.ac.uk

DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

ISSN 1747-1524

DCC | Digital Curation Manual

Instalment on

"Open Source for Digital Curation"http://www.dcc.ac.uk/resource/curation-manual/chapters/open-source/

Andrew McHugh

Humanities Advanced Technology and Information Institute (HATII)

University of Glasgow, Glasgow G12 8QJ

http://www.hatii.arts.gla.ac.uk

July 2005

ver 1.6

DCC - Digital Curation Centre http://www.dcc.ac.uk

Page 2: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 2 DCC Digital Curation Manual

Legal Notices

The Digital Curation Manual is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK: Scotland License.

© in the collective work - Digital Curation Centre (which in the context of these notices shallmean one or more of the University of Edinburgh, the University of Glasgow, the University ofBath, the Council for the Central Laboratory of the Research Councils and the staff and agents ofthese parties involved in the work of the Digital Curation Centre), 2005.

© in the individual instalments – the author of the instalment or their employer where relevant (asindicated in catalogue entry below).

The Digital Curation Centre confirms that the owners of copyright in the individual instalmentshave given permission for their work to be licensed under the Creative Commons license.

Catalogue EntryTitle DCC Digital Curation Manual Instalment on Open Source for Digital Curation

Creator Andrew McHugh (author)

Subject Information Technology; Science; Technology--Philosophy; Computer Science;Digital Preservation; Digital Records; Science and the humanities.

Description Instalment on the role of open source software within the digital curation life-cycle. Describes a range of explicit digital curation application areas for opensource, some examples of existing uses of open source software, a selection ofopen source applications of possible interest to the digital curator, somequantifiable statistics illustrating the value of open source software and someadvice and pointers for institutions planning on introducing these technologies intotheir own information infrastructures.

Publisher HATII, University of Glasgow; University of Edinburgh; UKOLN, University ofBath; Council for the Central Laboratory of the Research Councils.

Contributor Seamus Ross (editor)

Contributor Michael Day (editor)

Date 1 August 2005 (creation)

Type Text

Format Adobe Portable Document Format v.1.2

Resource Identifier ISSN 1747-1524

Language English

Rights © HATII, University of Glasgow

Citation GuidelinesMcHugh A, (July 2005), "Open Source for Digital Curation", DCC Digital Curation Manual,S.Ross and M.Day (eds), Retrieved <date>, from http://www.dcc.ac.uk/resource/curation-manual/chapters/open-source/

Page 3: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 3

About the DCCThe JISC-funded Digital Curation Centre (DCC) provides a focus on research into

digital curation expertise and best practice for the storage, management and preservation ofdigital information to enable its use and re-use over time. The project represents acollaboration between the University of Edinburgh, the University of Glasgow throughHATII, UKOLN at the University of Bath, and the Council of the Central Laboratory of theResearch Councils (CCLRC). The DCC relies heavily on active participation and feedbackfrom all stakeholder communities. For more information, please visit www.dcc.ac.uk. TheDCC is not itself a data repository, nor does it attempt to impose policies and practices of onebranch of scholarship upon another. Rather, based on insight from a vibrant researchprogramme that addresses wider issues of data curation and long-term preservation, it willdevelop and offer programmes of outreach and practical services to assist those who facedigital curation challenges. It also seeks to complement and contribute towards the efforts ofrelated organisations, rather than duplicate services.

DCC - Digital Curation Manual

EditorsSeamus RossDirector, HATII, University of Glasgow (UK)

Michael DayResearch Officer, UKOLN, University of Bath (UK)

Peer Review BoardNeil Beagrie, JISC/British LibraryPartnership Manager (UK)

Georg Büechler, Digital PreservationSpecialist, Coordination Agency for theLong-term Preservation of Digital Files(Switzerland)

Filip Boudrez, Researcher DAVID, CityArchives of Antwerp (Belgium)

Andrew Charlesworth, Senior ResearchFellow in IT and Law, University ofBristol (UK)

Robin L. Dale, Program Manager, RLGMember Programs and Initiatives,Research Libraries Group (USA)

Wendy Duff, Associate Professor, Facultyof Information Studies, University ofToronto (Canada)

Peter Dukes, Strategy and LiaisonManager, Infections & Immunity Section,Research Management Group, MedicalResearch Council (UK)

Terry Eastwood, Professor, School ofLibrary, Archival and InformationStudies, University of British Columbia(Canada)

Julie Esanu, Program Officer, U.S.National Committee for CODATA,National Academy of Sciences (USA)

Paul Fiander, Head of BBC Informationand Archives, BBC (UK)

Luigi Fusco, Senior Advisor for EarthObservation Department, European SpaceAgency (Italy)

Hans Hofman, Director, Erpanet; SeniorAdvisor, Nationaal Archief van Nederland(Netherlands)

Max Kaiser, Coordinator of Research andDevelopment, Austrian National Library(Austria)

Carl Lagoze, Senior Research Associate,Cornell University (USA)

Nancy McGovern, Associate Director,IRIS Research Department, CornellUniversity (USA)

Reagan Moore, Associate Director, Data-Intensive Computing, San DiegoSupercomputer Center (USA)

Alan Murdock, Head of RecordsManagement Centre, EuropeanInvestment Bank (Luxembourg)

Julian Richards, Director, ArchaeologyData Service, University of York (UK)

Donald Sawyer, Interim Head, NationalSpace Science Data Center, NASA/GSFC(USA)

Jean-Pierre Teil, Head of ConstanceProgram, Archives nationales de France(France)

Mark Thorley, NERC Data ManagementCoordinator, Natural EnvironmentResearch Council (UK)

Helen Tibbo, Professor, School ofInformation and Library Science,University of North Carolina (USA)

Malcolm Todd, Head of Standards,Digital Records Management, TheNational Archives (UK)

Page 4: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 4 DCC Digital Curation Manual

PrefaceThe Digital Curation Centre (DCC) develops and shares expertise in

digital curation and makes accessible best practices in the creation,management, and preservation of digital information to enable its use and re-use over time. Among its key objectives is the development and maintenanceof a world-class digital curation manual. The DCC Digital Curation Manual isa community-driven resource—from the selection of topics for inclusionthrough to peer review. The Manual is accessible from the DCC web site(http://www.dcc.ac.uk/resource/curation-manual).

Each of the sections of the DCC Digital Curation Manual has beendesigned for use in conjunction with DCC Briefing Papers. The briefingpapers offer a high-level introduction to a specific topic; they are intended foruse by senior managers. The DCC Digital Curation Manual instalmentsprovide detailed and practical information aimed at digital curationpractitioners. They are designed to assist data creators, curators and re-usersto better understand and address the challenges they face and to fulfil the rolesthey play in creating, managing, and preserving digital information over time.Each instalment will place the topic on which it is focused in the context ofdigital curation by providing an introduction to the subject, case studies, andguidelines for best practice(s). A full list of areas that the curation manualaims to cover can be found at the DCC web site(http://www.dcc.ac.uk/resource/curation-manual/chapters). To ensure that thismanual reflects new developments, discoveries, and emerging practicesauthors will have a chance to update their contributions annually. Initially,we anticipate that the manual will be composed of forty instalments, but asnew topics emerge and older topics require more detailed coverage moremight be added to the work.

To ensure that the Manual is of the highest quality, the DCC hasassembled a peer review panel including a wide range of international expertsin the field of digital curation to review each of its instalments and to identifynewer areas that should be covered. The current membership of the PeerReview Panel is provided at the beginning of this document.

The DCC actively seeks suggestions for new topics and suggestions orfeedback on completed Curation Manual instalments. Both may be sent to theeditors of the DCC Digital Curation Manual at [email protected].

Seamus Ross & Michael Day.18 April 2005

Page 5: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 5

Table of Contents1 Executive Summary....................................................................................72 Introduction and Scope...............................................................................9

2.1 Advantages of Open Source for Digital Curators...............................92.2 Proprietary Software Development and Distribution.........................92.3 Commercial Retention of Control.....................................................102.4 The Philosophy of Open Source and Free Software.........................102.5 Facilitating Preservation Through Transparency..............................102.6 Open Source Within the Digital Curation Life-cycle.......................11

3 Background and Developments to Date...................................................123.1 The Origins of Free and Open Source Software...............................123.2 Open Source – Free Software with Different Emphases..................133.3 Licensing...........................................................................................133.4 Generic and Specialist Benefits of Open Source..............................14

4 How does Open Source Apply to Digital Curation?................................164.1 Life-cycle as User Perspectives........................................................164.2.1 Cost of Software............................................................................164.2.2 Availability of Assistance .............................................................164.2.3 Developer Advantages...................................................................164.3.1 Customisable Functionality...........................................................174.3.2 Peer Reviewed Software Integrity.................................................184.3.3 Users Assume a Strong Legal Position..........................................194.3.4 Increased Security of Digital Resources........................................194.4.1 Longevity of Digital Information...................................................204.4.2 The Relationship Between Open Source and Open Standards......214.4.3 Portability of Information .............................................................234.4.4 Preservation Through Transparency..............................................244.4.5 Legal Issues for Long-term Access................................................264.4.6 Later Stages....................................................................................26

5 Open Source and Free Software In Action...............................................275.1.1 Government and Public Sector......................................................275.1.2 Humanities Institutions..................................................................295.1.3 Science...........................................................................................295.1.4 HE/FE Institutions.........................................................................305.1.5 Commercial Organisations............................................................315.2.1 The GNU/Linux Operating System...............................................335.2.2 Emulation Applications for Open Source......................................345.2.3 Server and Development................................................................355.2.4 The Apache Web Server................................................................35

Page 6: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 6 DCC Digital Curation Manual

5.2.5 Databases.......................................................................................365.2.6 The GRID.......................................................................................365.2.7 Programming Languages...............................................................375.2.8 Others.............................................................................................385.2.9 Desktop and Productivity...............................................................385.2.10 OpenOffice.org............................................................................385.2.11 The Mozilla Project......................................................................395.2.12 Specific Open Source Applications for Digital Curation............40(a) Fedora Digital Object Repository Management System...................40(b) DSpace..............................................................................................41(c) FreeBXML.........................................................................................42(d) JHOVE..............................................................................................43(e) LOCKSS............................................................................................43(f) Xena...................................................................................................435.2.13 Other Institutional Repository Implementations..........................44

6 Quantitative Issues....................................................................................456.1 Financial Costs of Open Source Software........................................456.2 Software Acquisition and Upgrade Costs.........................................456.3 License Management and Litigation Costs.......................................466.4 Hardware Costs.................................................................................466.5 Support and Training........................................................................476.6 Total Cost of Ownership...................................................................486.7 Longer-term Considerations.............................................................496.8 Performance and Reliability.............................................................506.9 Market Share.....................................................................................51

7 Future Developments................................................................................538 Conclusion................................................................................................54Bibliography................................................................................................55Glossary of Terms.......................................................................................62Acronyms and Abbreviations......................................................................63About the Author.........................................................................................64

Page 7: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 7

1 Executive SummaryThroughout this Curation Manual a

number of individual practices, principles,techniques and technologies are suggested asbeing particularly appropriate throughout thedigital curation life-cycle. Some are uniquelyassociated with issues of use and longevity,while others are more generic in theirapplication areas, but with identifiable andsignificant benefits for those charged with thecreation, curation and re-use of digital materials.With a range of ubiquitous advantages, opensource and free software can offer tangiblebenefits throughout the digital curation life-cycle. By its nature the adoption of open sourcerepresents a broadly affecting cultural measure,which underpins and influences the outcome ofnumerous other decisions within any digitalcuration endeavour. Open source software isfrequently available without cost, which from amanagement perspective facilitates the creationof digital content, and its legal status frees datacreators from the onerous licensing restrictionsassociated with proprietary software. In terms ofactive use, there are rarely any kind of ongoingupgrade costs so there are fewer concernsassociated with ensuring that software ismaintained to the latest version. Furthermore, arange of quantifiable evidence suggests thatopen source applications can match and oftenbetter the performance, security and reliabilityof commercial alternatives. It is in the areas oflong-term access and preservation that the opensource approach offers some of its most relevant

benefits to the digital curator. With its intrinsictransparency it is more straightforward to ensurefuture accessibility through migration oremulation, bereft of the legal entanglements thatwith proprietary commercial software maymake such activities problematic. Re-use is alsofacilitated, with open source licenses1 explicitlypermitting the integration, alteration andredistribution of program code. Closelyassociated with (although by no meanssynonymous with) open source are openstandards and open formats; both are frequentlyembraced by the open source community.Encoding digital information in a manner whichis documented, commonly understood and notlinked to an individual commercial product orintended to help pursue a corporate goal is ahigh priority for many open source developersand distributors.

This Curation Manual instalment2

discusses at some lengths the relevant strengthsof open source software from a digital curationperspective, as well as detailing some of itsmore general advantages, which must beunderstood in order to accept the viability of an

1 For convenience and consistency, the AmericanEnglish spelling of the noun “license” is usedthroughout, since this spelling is most commonly usedwithin discussions of this topic.

2 This instalment adapts and builds upon materialsoriginally published as part of "Digicult TechnologyWatch Report 3", 2005, Seamus Ross, MartinDonnelly, Milena Dobreva, Daisy Abbott, AndrewMcHugh and Adam Rusbridge,http://www.digicult.info/pages/techwatch.php[Accessed: 7 April 2005, 11:30].

Page 8: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 8 DCC Digital Curation Manual

institutional or cultural shift to open softwareproducts. Through a series of sections itdescribes a range of explicit digital curationapplication areas for open source, someexamples of existing uses of open sourcesoftware, a selection of open source applicationsof possible interest to the digital curator, andsome quantitative statistics illustrating the valueof open source software.

Page 9: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 9

2 Introduction and Scope

2.1 Advantages of Open Source for DigitalCurators

The problems implicit in digital curationcan be mitigated at every stage of a digitalobject’s life-cycle by adopting appropriatestrategies and exploiting particular technologies.The open source software movement representsand characterises a thirty-year-old softwaredevelopment and distribution philosophy, andoffers several valuable advantages to the digitalcurator. In recent years, the open source ethoshas come to fruition within a range ofcommercial and public sector environments.Core beliefs in the principle of free softwareavailability, and of a community-based approachto software development have increasinglyestablished free and open source software withinthe mainstream, where its numerous applicationsnow reside as realistic and competitivealternatives to proprietary commercial softwarethat has been produced within a more traditional‘behind closed doors’ development model.While advantages can be identified with opensource in a range of application areas, there areseveral intrinsic qualities that lend themselvesparticularly well to digital curation activities,and that make the use of open source tools anexcellent starting point for data creators,curators and re-users seeking to facilitate the useand preservation of digital materials.

2.2 Proprietary Software Development andDistribution

The traditional, commercial softwaredevelopment model has a number of keycharacteristics. When a software application iscreated, it is written in a programming language,a human-readable syntax that broadlycorresponds to the way in which a computerunderstands and processes information.However, for a computer to make sense of aprogram it has to be offered in a much ‘lowerlevel’ format - ultimately the 1s and 0s ofbinary. In order to transform a program fromhuman-readable form to binary, many languagesrequire the code to undergo a process called‘compiling’. The original, or ‘source’ code ispassed through an intermediate program andtranslated into computer-readable syntax, whichto human eyes bears little relation to theoriginal. Within a proprietary model, developerswill typically perform the compiling processbehind closed doors before distributing thebinary results to customers, who can run theprogram and enjoy its benefits withoutestablishing a sense of how the program works,and without any means of finding out. Users areunable to change the way the program runs,other than by using the program’s inbuilt tools.Often, such utilities offer significant scope formodification – an example is the macrofunctionality incorporated within Microsoft’sOffice suite of applications. However, completecontrol over functionality is withheld, andultimately, changes can only be made at the

Page 10: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 10 DCC Digital Curation Manual

publisher or distributor’s behest.

2.3 Commercial Retention of ControlProprietary software companies are

motivated by commercial concerns, and arenaturally keen to strengthen their own position,often at the expense of consumer freedom. Bylimiting access to binary files the vendorscontrol the functionality of their applications,and can impose limitations based on their owndistribution or upgrade policies and plans. Ifproblems occur with software, new features aresought or versions are required for alternativehardware/software platforms, they must all benegotiated with the software vendor. Similarly,the customer is quite powerless to fix bugs thatare identified within the system, since to do sowill generally require some familiarity orinteraction with the software at source codelevel. Because source code access is likely to belimited to a small group of developers, changestake time to implement, and the addition ofspecialist functionality may be overlooked orconsidered commercially non-viable. From adigital curation perspective this model meansthat end-users are unable to identify thecharacteristics of the software and formats theyuse, and subsequently are limited in the ways inwhich they can inject additional functionality orpreservation qualities into their digitalinformation. Preservation strategies are likely tobe hampered by legal and technical barriersrelated to restrictive license terms and thesoftware’s closed nature.

2.4 The Philosophy of Open Source and FreeSoftware

In contrast, open source software isdeveloped and released in a more transparentfashion. Instead of concentrating on thefinancial advantages of limiting access to sourcecode and tightly guarding knowledge, opensource software is motivated by communityconcerns. Source code is openly shared,contributions are welcome from competentusers anywhere in the world and software isdistributed free from the onerous end-useragreements that characterise a great deal ofproprietary software. By empowering users withaccess to source code the open sourcemethodology encourages and rewardsmodification, re-use, redistribution andunderstanding. Institutions and organisations areempowered to choose appropriate tools toachieve their intended outcomes. This helps tolimit the dangers posed by relying upon specificcommercial proprietary software solutions. Themost obvious is the surrendering of ITinfrastructure control to the commerciallymotivated technology vendors – often atsignificant costs in terms of the ‘curatability’ ofone’s digital information.

2.5 Facilitating Preservation ThroughTransparency

Open source technologies are no longerthe marginalised preserve of bedroomhobbyists, with several open source applicationsamong the most proven and reliable of alldigital solutions. By regularly embracing the

Page 11: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 11

concept of open standards, these technologiesfurther remove the mystery from informationstorage and use over the longer term. As withsource code availability, open standards aim toexcise the opaque veneer that threatens anddisrupts digital preservation, limits and curtailsaccess to long-term stored documents, andhampers the straightforward exchange andinterchange of digital content. Understanding ofthe structures that underlie the software andformats we use and the legal rights to recreate,modify and re-distribute these structures aregreat facilitators to everyone: from desktopusers seeking an application specifically tailoredto their needs to large-scale memory institutionsthat need to ensure that the software format theyselect to encode their digital archive will notbecome obsolete, unsupported and impenetrablewithin a few years’ time.

2.6 Open Source Within the Digital CurationLife-cycle

The adoption of open source softwareprovides several diverse benefits throughout theentire scope of the digital curation life-cycle.When determining the ‘curatability’ of anapplication or software format several importantcriteria must be considered. These include itslongevity; the ease of its re-creation oremulation; its adherence to and use of openstandards; the level of legal freedom associatedwith its use; its associated costs; its ubiquity; itssupport for metadata; and its stability. From thevery conception of digital information opensource presents some immediate advantages.

Software acquisition costs are certainly lowerthan those associated with equivalentproprietary products, and although other hiddencosts are involved in introducing andmaintaining an open source infrastructure,several studies agree that total costs ofownership are also significantly cheaper.3

Furthermore, transparency through source codeavailability and the frequent associationbetween open source and open standardsfacilitates long-term comprehension and re-use,enabling creators, curators and re-users toeffectively and explicitly present their digitalmaterials alongside their underlying descriptiveinfrastructures. In addition, the lack of onerouslicensing restrictions that accompanyproprietary products and stipulate acceptableconditions for use, redistribution, transfer andreverse engineering removes many of theproblems often associated with the managementand redeployment of software.

3 David A. Wheeler, 2005,"Why Open SourceSoftware/Free Software (OSS/FS, FLOSS, or FOSS)?Look at the Numbers!",http://www.dwheeler.com/oss_fs_why.html[Accessed: 7 April 2005, 11:30].

Page 12: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 12 DCC Digital Curation Manual

3 Background and Developments toDate

3.1 The Origins of Free and Open SourceSoftware

The open source and free softwaremovements share a common goal, but differsubtly in their emphases. While both opposeproprietary closed-source software developmentand distribution, their motivations for doing soare contrasting. Nonetheless, both movementsbelieve in the same general ethos: that softwareshould be made universally available in itsentirety, with everyone afforded the opportunityto understand, change and re-distribute it.

The free software movement, spearheadedby the Free Software Foundation (FSF) andcharacterised by the writings of RichardStallman, has at its core a predominantlypolitical, social, and moral agenda.4 From itsorigins in the late 1970s and early 1980s, thefree software school grew out of frustration withthe barriers imposed by the secrets and non-disclosure agreements surrounding proprietarysoftware. Uncompromising in its philosophy, themovement argues that a number of fundamentalhuman freedoms depend on the ability to accesssoftware without obstruction. The definition offree software is displayed prominently on theFSF Web pages, and can be broken down intofour parts, each relating to one of four essential4 http://www.fsf.org [Accessed: 7 April 2005, 11:42];

http://www.stallman.org [Accessed: 7 April 2005,11:30].

freedoms:5

1. The freedom to run a program, for anypurpose;

2. The freedom to study how a programworks, and adapt it to individual needs,implying access to the underlying sourcecode;

3. The freedom to re-distribute copies;4. The freedom to improve the program and

release improvements to the public so thatthe whole community benefits, againimplying source code access.

These simple definitions offer acomprehensive insight into the priorities andmotivations of the free software movement. Inthe absence of a suitably unambiguous word inthe English language, the classic definition isfree as in free speech, not as in free beer. Whilethere is no stipulation that free software shouldbe made available without cost, it must bepossible to re-distribute bought software at nocost if it is to qualify. Stallman refutes thetraditional legal ownership arguments forproprietary software. He claims that traditionalproperty law concepts are irrelevant since theyrelate to the problems caused by taking awaysomeone else’s property, not simply making acopy. According to Stallman, since programsare not consumed in the same way as other

5 “The Free Software Definition,”http://www.fsf.org/licensing/essays/free-sw.html[Accessed: 7 April 2005, 11:40].

Page 13: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 13

types of property, they should not be subject tothe same values. The free software movement iscommitted to a culture whereby users canbenefit from the time already spent by otherssolving problems, negating the need to ‘reinventthe wheel’ themselves.

3.2 Open Source – Free Software withDifferent Emphases

Some have cited the uncompromisingidealism of the free software movement as oneof the main reasons for its continuedmarginalisation within the computingcommunity, despite its indisputably impressivetrack record in terms of software development.In the mid 1990s a new movement evolved inreaction to the unease often provoked byStallman’s favoured socio-political arguments.Seeking to characterise and promote the more‘sellable’ aspects of free software (giving muchless emphasis to the arguments favoured by theFSF), this movement was dubbed ‘open source,’and is represented by the Open Source Initiative(OSI) with the programmer and writer Eric S.Raymond at its helm.6 The OSI’s foremostargument is that the unique development modelunderpinning free software leads to bettersoftware than that developed behind closeddoors by the paid employees of commercialcompanies. For the purposes of this CurationManual instalment the most notable way inwhich this superiority manifests itself is in terms

6 http://www.opensource.org [Accessed: 7 April 2005,11:42]; http://www.catb.org/~esr/ [Accessed: 7 April2005, 11:43].

of the increased ‘curatability’ of open sourcesoftware. Significantly, the open sourcedefinition is not structured in terms ofindividual ‘human freedoms’, instead bearingmore relation to a legal document of the kindfamiliar to users of commercial softwareproducts. Among its ten individual requirementsare that open source software must be freelydistributed; that source code must be availablealong with any compiled binaries; and thatmodifications and derived works must bepermitted and re-distributable under the samelicense as the original software.7

3.3 LicensingThe most common software license under

which Free and open source software isdistributed is called the GNU General PublicLicense (GPL).8 Originally conceived todescribe the legal status of the GNU operatingsystem, this has become the generic standardfree software license. It is a copyleft license, andestablishes and seeks to protect the freedom ofits associated software quite strictly.

Copyleft is a concept of the Free SoftwareFoundation and serves as an alternative totraditional copyright restrictions. The coining ofthe term came in the light of concerns that

7 Needless to say, these requirements have a great dealin common with those outlined for Free Software.However, the Free Software Foundation generallyplaces higher ethical demands on software licenses, sowhile most, if not all, Free Software approved licenseswill be open source, the opposite is not necessarilytrue.

8 GNU is a recursive acronym for ‘GNU’s Not Unix’.

Page 14: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 14 DCC Digital Curation Manual

without some kind of protection, free and opensource software could be taken by proprietarysoftware developers, changed and then re-distributed under a proprietary non-free softwarelicense. The copyleft requirement makes itimpossible to “strip off the freedom” in thisfashion. Copyleft says that anyone who re-distributes the software, with or withoutchanges, must pass along the freedom to furthercopy and change it. Some opponents of freesoftware have dubbed this “a viral clause”because it means that any new software thatincorporates existing copyleft codeautomatically inherits the same license. Thus,the code and the freedoms of the license becomelegally inseparable. From a digital curationperspective, copyleft offers a level of assurancethat any measures taken to limit softwareobsolescence and to facilitate use are likely topersist within an application irrespective of anysubsequent revisions or redevelopment thattakes place.

One criticism that is often levelled at theGeneral Public License is that its terms andcontent were conceived mainly in accordancewith the United States legal system. Severalcommentators have suggested that the GPL isweaker or even completely inapplicable withinalternative, non-US jurisdictions.9 Consequentlyvarious regionally specific licenses have beendeveloped. A good example is the CECILL

9 Alex Thurgood, January 2005, "The GPL and non-U.S. law", Open Source Law Blog,http://www.oslawblog.com/2005/01/gpl-and-non-us-law.html [Accessed: 7 April 2005, 11:43].

license drafted by the French scientific researchcommunity in response to inadequacies of theGPL in the French legal context.10 However,controversy still looms to an extent, since theOpen Source Initiative is yet to approveCECILL as a conforming license. Although theintentions behind it are good, European legalreaction to it has remained somewhat wary.Several different open source and free softwarelicenses exist. Open source licenses are thoseexplicitly acknowledged by the OSI, and thesecurrently number around fifty individuallicenses, each with their own profile. The FreeSoftware Foundation is responsible foridentifying those licenses that qualify as freesoftware.

3.4 Generic and Specialist Benefits of OpenSource

Notwithstanding the moral and ethicalarguments in favour of free and open sourcesoftware, most readers will expect some insightsinto their more pragmatic merits before beingtempted to use them, or to develop applicationsunder their terms. There are several persuasivearguments in favour of using open sourcesoftware, and of releasing under an open sourcelicense. Many of these are generic, and equallyapplicable to computer users in any field,activity or industry, but there are severaladvantages of particular relevance to the digitalcuration community. Clearly, a softwareinfrastructure that facilitates digital curation can

10 http://www.cecill.info/licences/Licence_CeCILL_V1.1-US.html [Accessed: 7 April 2005, 11:43].

Page 15: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 15

only be viable if it also offers functionality,value and reliability on a par with or in excess ofthat offered by alternative proprietary tools. Theremainder of this Curation Manual instalmentwill therefore concentrate on both the generalqualities of open source software and thoseaspects that make an open source softwareenvironment extremely useful from the specificperspective of digital curators.

Page 16: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 16 DCC Digital Curation Manual

4 How does Open Source Apply toDigital Curation?

4.1 Life-cycle as User PerspectivesIn considering the value of open source for

digital curation it is convenient and worthwhileto consider its merits through every stage of thedata life-cycle model. Encompassing creation,active use, archiving, preservation, access andre-use, and disposal or transfer one can identifythree main user roles. These are of data creator,data curator and finally data re-user. Thefollowing sections will consider the advantagesoffered by open source in the context of each ofthese roles.

4.2 Perspective 1 - Data Creator

4.2.1 Cost of SoftwareThe first, and perhaps most often cited

benefit of open source software relates to theissue of acquisition cost. Open source and freesoftware need not be distributed without charge,but the definitions ensure that while a vendorcould sell an open source product for a fee, itspurchaser could then re-distribute it for free.Total cost of ownership is a more complex issue,incorporating a number of often more hiddencosts across the entire data life-cycle, and this isexplored in more depth below. Nonetheless, itcan be indisputably stated that from a financialperspective, open source software empowersusers to begin working with digital materials andcreate digital content quickly and with few

onerous responsibilities.

4.2.2 Availability of Assistance Help for data creators is also widely

available within the open source community.With vast documentation projects oftencoexisting alongside software development,comprehensive and useful information for usingand understanding open source software isincreasingly available. In addition to formalhelp materials, an Internet-wide community ofexpertise contributes to an ever expandingknowledge pool consisting of discussion fora,FAQs and “how to” guides.

4.2.3 Developer AdvantagesSimilarly, there are numerous additional

advantages in favour of creating new digitalinformation within the open source community.From a practical perspective, a great deal ofsoftware is released under an open sourcelicense because it is the only way to legallyintegrate existing free software code or libraries.Because of the vast range of well-written,standards compliant and commonly understoodsoftware that is currently only available undercopyleft licenses like the GPL it may be a moreattractive prospect to build on the work ofothers than start from scratch, replicatingfunctionality that already exists and is freelyavailable. In addition, basing work onmainstream ‘accepted’ code automaticallyexpands the pool of developers and users withan interest in ensuring its longevity. An opensource approach also offers the opportunity to

Page 17: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 17

consult and collaborate with a wide Internet-based developer community to facilitate testingand improvements. This is particularly useful forsolo developers, and small groups for whomoutside intervention, assistance and feedback arebeneficial and otherwise unavailable. Inaddition, several popular web sites exist topromote and distribute open source softwarematerials. For applications that work well,success, prominence and large-scale adoptionusually follow with word spreading around thecommunity, fuelled by exposure on sites likeSourceforge.net.11 Umbrella resources like theOpen Source Technology Group offer aplatform for shared ideas and knowledgeinterchange, and at the same time providemechanisms for the promotion and distributionof open source applications.12

4.3 Perspective 2 – Data Curator

4.3.1 Customisable FunctionalityThe overall depth of functionality offered

by a particular program is an immediate andobvious indicator of its value from an active useperspective. Since open source involvespotential users at every stage of the developmentprocess, the functionality requirements andexpectations that users have can be identifiedand implemented effectively. Unique ormarginalised functionality can be incorporated

11 http://sourceforge.net/ [Accessed: 7 April 2005,11:44].

12 http://www.ostg.com [Accessed: 7 April 2005, 11:44].

into existing applications straightforwardly, dueto the availability of source code. Features thatwould not be worthwhile for a commercialcompany to implement due to the lack of overalluser demand can be introduced, and newprojects can be started when it seems that aparticular functional requirement is unlikely tobe met. The open source model empowers usersto either develop their own specifically requiredapplications or to add their own functionality tothose that already exist. For example, it ispossible for developers to incorporate additionaldigital curation functionality such as metadatasupport or compatibility with additional fileformats into existing open source software. Theculture of customisation ensures that softwareand digital content can be altered to suit specificrequirements and expectations. The commercialsoftware world cannot match this level offlexibility. It will often charge a fee to add aparticular feature at the request of an individualclient or, more likely, assure the customer thatthe functionality will be integrated when theypay to upgrade to the next version of thesoftware.

Needless to say, many users of opensource software are ill-equipped in terms ofexpertise or time to individually implementevery change they require, or to affect themodifications in house. However, by facilitatingdevelopment on a global basis, the open sourcemodel enables organisations to outsource freely,or to motivate others within the community whoare equipped to modify or build upon code to doso. It is likely that open source users will

Page 18: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 18 DCC Digital Curation Manual

continue to seek assurances that softwaredevelopers are committed to their products long-term and that ongoing maintenance will beundertaken, new features added and bugs fixed.However, although the developer-customerrelationship can continue to exist in this fashion,it is not the only one that will ensure these endsare satisfied. If the developer reneges on acommitment that he or she has made then theunique status of open source software ensuresthat another individual or organisation canintervene and ensure the product’s sustainability.

4.3.2 Peer Reviewed Software IntegrityThe identification and correction of bugs

from programs is another strength of the opensource development model. All but the moststraightforward of software programs willcontain bugs: they are a regrettable, butinevitable part of software development. Withtraditional software development, it is commonfor a team of programmers to complete anapplication and, through a period of evaluation,to identify and fix errors. However, softwarecompanies face great pressure to get theirproducts on the shelves to begin generatingincome, and therefore bug-fixing schedules areoften necessarily limited. Users will often findflaws, but without access to source code it isimpossible for them to personally remedy these;the only recourse is to notify the relevantsoftware publisher. If a bug is sufficientlyserious then the company will probably issue asoftware patch to repair the flaw. However,since they must rely on error reports from users

with no access to source code it can be difficultto trace bugs; this represents a failure to exploitthe application users’ programming anddebugging abilities and dramatically lengthensthe process. In addition, there is usually nothingbut goodwill to guarantee that companies makeany corrections available free of charge and theymight simply refuse to address the flaw at all,leaving users with no option but to learn toaccept deficiencies in their applications.

Relying on the philosophy of releasingsoftware early and often, open source projectsare likely to receive users’ bug reports before aprogram is even close to the level of maturitythat a commercial company would deemacceptable for release. With access to sourcecode, collaborators can fix problemsthemselves, or offer detailed accounts inprogramming terminology of where and in whatcircumstances bugs manifest themselves.“Treating your users as co-developers is yourleast-hassle route to rapid code improvementand effective debugging,” writes Eric Raymond,his mantra: “Given enough eyeballs, all bugs areshallow.”13 In addition, open source projects arebecoming increasingly well documented, withthe rapid growth and mainstream proliferationof open source leading to the generation andprioritisation of good quality documentation. Aswell as facilitating debugging the peer reviewsystem can also be used to ensure that digitalinformation or content is sufficiently

13 Eric S. Raymond, "The Cathedral and the Bazaar",http://ot.op.org/cathedral-bazaar.html [Accessed: 7April 2005, 16:28]

Page 19: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 19

functionally rich throughout its development andmaintenance. For instance, from a digitalcuration perspective, collaboration can takeplace to ensure that the informationinfrastructures are optimised for longevity,continued accessibility and re-use. While poorlywritten code is by no means the exclusivepreserve of the proprietary softwaredevelopment world, the chances of a bad opensource program continuing to be developedbadly are mitigated somewhat by thecommunity’s watchful eyes.

4.3.3 Users Assume a Strong Legal PositionAn argument often raised by those

opposed to open source is that if something goeswrong with the software there is no one to directthe blame towards. From the digital curator’spoint of view this might provoke concerns: ifone relies upon a particular software package ordata format to ensure the curatability of digitalresources then there are certainly grounds fordissatisfaction and an expectation of recompenseif this is not achieved. While this is true, themajority of proprietary licenses will includeterms absolving responsibility for problemscaused by flaws or shortcomings in software.For instance, no one has successfully suedMicrosoft for downtime or information lost as adirect consequence of security loopholes in theirWindows operating system.14 Due to the

14 In fact, Microsoft’s Windows XP license includes theclause “In no event shall Microsoft…be liable forany…damages whatsoever…even in the event offault…(including negligence).” Todd Bishop,

financial models underpinning a variety of opensource software it is likely that vendors willoffer support contracts and software guaranteesas a commercial service quite distinct from thedistribution of software itself, which mayincorporate rights to compensation in the eventof a failure to meet their commitments. Inaddition, as discussed above, the transparency ofopen source enables individual developers oradministrators to independently implementsolutions to overcome the shortcomings in theprograms that they use.

4.3.4 Increased Security of Digital ResourcesThe Internet remains a dangerous place,

with the potential for virus infection, denial ofservice attacks and interception of personaldetails presenting serious concerns. Security istherefore something that must be taken intoconsideration during any digital curation workflow. For instance, where materials are stored inremote repositories, security must be assured inorder to be confident that retrieved informationhas not been compromised, or altered from itsinitially deposited form. It is often argued thatby making source code available it will be easierfor malicious individuals to identify and exploitsecurity vulnerabilities in open source software.This is dismissed by open source advocates as afalse argument, and typical of the “Fear,Uncertainty and Doubt” strategies frequently

September 2003, "Should Microsoft Be Liable ForBugs?" Seattlepi.com,http://seattlepi.nwsource.com/business/139286_msftliability12.html [Accessed: 7 April 2005, 11:45].

Page 20: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 20 DCC Digital Curation Manual

employed by the proprietary software world.The sense of ‘security through obscurity’promoted by proprietary software companies isconsidered to be rather dangerous: not only doesit create a false sense of safety, but it also limitsopportunities for identifying existing securityloopholes. Malicious ‘crackers’ are motivatedand determined, and will uncover anyvulnerabilities that exist whether code is freelyavailable or not. Opening the source ensures thatthose who are interested in finding problems tofix and secure (rather than to exploit) can do soas straightforwardly as possible. Also, while itshould certainly not lead to complacency, opensource users are less likely to find themselvesthe target of malicious attacks than those usingproprietary tools, since a great deal ofdestructive code is motivated by distrust andresentment directed towards large corporations.This situation could change if open sourcecontinues to establish itself within themainstream, since the intention of maliciouscrackers may be to simply attack the biggesttargets, irrespective of political factors.

4.4 Perspective 3 – Data Re-users

4.4.1 Longevity of Digital InformationThe expected lifetime of open source

software compares favourably with proprietaryalternatives, although arguments can bepresented to suggest that either is assured ofgreater longevity. Ubiquity is an appealingcharacteristic, and it is frequently maintainedthat mainstream commercial applications with

large distributions are more likely to beaccessible in the future due to the sheer numberof people who have a vested interest in ensuringthat this is the case. Few would question the factthat increasing the number of stakeholders islikely to increase the demand for an applicationor file format’s sustained accessibility (the so-called ‘follow the crowd’ approach). Particularconcerns could be levelled at open sourceprojects that are marginalised within the overalldigital community, and formats that althoughopen are less well supported and less frequentlyuse than commercial alternatives, such as theOpenOffice 1.1 document or Ogg Vorbis digitalaudio formats. However, one must make a cleardistinction between the size of the usercommunity that is interested in ensuring anapplication or data-set’s longevity and thestraightforwardness with which this can beachieved. Digital curators face both sociologicaland technical challenges. It is suggested that thelatter are better addressed by the use of opensource. Transparency is at the very foundationof open source and free software, promotingunderstanding and facilitating its curation. Inaddition, such software is far more likely toembrace open formats and standards in favourof proprietary alternatives. Therefore, althoughthere may be more voices demanding thecuration of commercially distributed proprietarysoftware, it is likely that a comparativelymodest number of open source users canachieve the same goal with less effort and withsignificantly less expense incurred. Peoplepower can make it easier to overcome the

Page 21: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 21

barriers to digital curation, but althoughcurrently less well used, open source softwareitself overcomes a number of the mostproblematic obstructions that the digital curatoris likely to face.15 Furthermore, while it lackscomparable user numbers in many disciplines

and application areas (particularly the desktopdomain), there are several areas in which opensource software, such as the Apache Web Server,the Bind DNS server and prominent institutionalrepository implementations like DSPACE,Fedora and GNU EPrints hold dominantpositions, even over commercial alternatives.

15 See Figure 1

4.4.2 The Relationship Between Open Sourceand Open Standards

The philosophies underpinning opensource software have close associations with theconcepts of open standards that are vital forsuccessful exchange, re-use and preservation of

documents and data. Open standards are thosethat, by virtue of their transparency andaccepted nature, offer a degree of protectionagainst obsolescence and inaccessibility.Technologist Bruce Perens suggests a definitionof the principles and practices surrounding openstandards, and offers detailed insights into whatsignificant details elevate a commonspecification to the status of open standard.16

16 Bruce Perens, "Open Standards: Principles andPractice"http://perens.com/OpenStandards/Definition.html

Figure 1© Digital Curation Centre

Page 22: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 22 DCC Digital Curation Manual

According to Perens’ criteria an open standardoffers the freedom to view and implement it,prevents customers from being ‘locked in’ to aparticular vendor or group, and ensures thatthere is no associated royalty or fee and nofavouring of one implementer over another.Although it should be possible to extend openstandards or offer them in subset form controlsmust exist to prevent dominant vendors fromimplementing the standard with extensions thatare incompatible with other systems. Closeparallels can be drawn between these principlesand those expressed within the open sourcedefinition, particularly in terms of the conceptsof freeness and unencumbered total access thateach promotes. Examples of open standards forfile formats include the OASIS Open DocumentFormat, the World Wide Web Consortium's(X)HTML and Adobe’s PDF format. Theserepresent reasonably safe starting points forstoring content for future retrieval since all areunderstood, documented and published. Thereare no associated licensing costs and no chargecan be levied for their use and distribution.Commercial software companies tend to assumethat if they can propagate their own particularfile formats widely enough, people will soonbecome reliant upon them. The most obviousexample is Microsoft’s Office suite, which usesthe core file formats .doc, .xls, .mdb and .ppt toencode word-processed documents,spreadsheets, database files and slide showpresentations. None of these are open standards,and therefore it is impossible to gain a thorough

[Accessed: 7 April 2005, 11:46].

understanding of how they work. Consequentlythere is no way to confidently read and write tothese formats with programs other thanMicrosoft’s own.17 Recent reports fromMicrosoft suggest that future versions of theirdocument formats will be defined in XML,which should theoretically introduce a greaterdegree of transparency in their structure.However, within a community suspicious ofMicrosoft (following for instance theirextremely limited ‘Shared Source’ scheme18)few expect these plans to lessen the opacityintrinsic to Microsoft’s products to anysignificant degree. These expectations aregalvanised with Microsoft’s failure to offerconfirmation that future versions of theirsoftware will support OASIS's Open DocumentFormat. David Rosenthal argues that the

17 Numerous projects, such as OpenOffice.org have triedto remedy this, with some success. See also summaryof CAMiLEON working papers in: The DigiCULTReport Full Report “Technological Landscapes fortomorrow’s cultural economy: Unlocking the value ofcultural heritage”, (January 2002), p. 212. Availableonline at http://www.digicult.info/pages/report.php,[Accessed: 7 April 2005, 11:46]

18 ‘Shared Source’ is a Microsoft initiative under whichenterprise users, academics and others can getcontrolled access to select parts of Microsoft’s sourcecode. Heavily criticised for its toe-in-the-waterconservatism, The UK Register web site described itas nothing more than a "worthless PR exercise",Andrew Orlowski, 2004, "Why Microsoft ‘SharedSource’ Can Never Be Trusted",http://www.theregister.co.uk/2004/03/17/why_microsoft_shared_source_can/, [Accessed: 6 July 2005,15:30]

Page 23: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 23

incompatibly of data formats within the Officesuite is a deliberate and quite integral part ofMicrosoft’s business model. By distributing itssoftware at low cost with new computerspressure is placed on users to pay for expensive‘essential’ upgrades that are subsequentlyintroduced. The case for upgrading can bepersuasive – ongoing support may depend on itand poor backwards compatibility in manyproducts may render collaboration impossible ifone’s peers are running a newer version. Theimplications that this has for accessibility needlittle further explanation. By its nature opensource software can be understood and ifnecessary replicated at a later date, and openformats boast similar long-term lucidity.Developers and users can continue to accessopen digital resources in the future morestraightforwardly than proprietary assets withmissing digital jigsaw pieces.

4.4.3 Portability of Information In terms of ease of emulation and potential

portability open source carries a significantadvantage. Without having to painstakinglyreverse engineer existing applications, softwareenvironments and data formats it is theoreticallystraightforward to take information in aparticular form or structure and recreate orrepackage it as required.19 Binary-only, non-

19 S. Ross and A. Gow, 1999, "Digital archaeology?Rescuing Neglected or Damaged Data Resources",(London & Bristol: British Library and JointInformation Systems Committee), ISBN 1900508516,http://www.ukoln.ac.uk/services/elib/papers/supportin

documented and non-standard software isshrouded in mystery, with even sophisticateddecompiler software unable to offer reliable anddefinitive insights into fundamental underlyingqualities. The efforts of OpenOffice.org tocreate a standards compliant productivity suitesupporting a range of both proprietary and openformats offers a somewhat trite, but insightfulexample of the kinds of barriers faced whendealing with commercially encoded digitalassets. The contemporary problems associatedwith these are likely to be amplified many timesin the future. Without an intimate understandingof the Microsoft Word format for instance, it isimpossible to adequately and confidently renderall the information contained within a .doc filein any non-Microsoft endorsed environment.20

As long as one has to rely upon an individualprivate corporate organisation to access and useone’s digital content, it can never be effectivelycurated, and its longevity can never be assured.Technology journalist David Berlind pulls nopunches: “Putting the vendor in control of yourIT costs is not a good position to be in.Unfortunately, that’s where a lot of us are.”21

g/pdf/p2.pdf [Accessed: 7 April 2005, 11:47].20 Maria Guercio and Cinzia Cappiello, 2004, "File

Formats Typology and Registries for digitalpreservation", (DELOS, WP6 D6.3.1),http://www.dpc.delos.info [Accessed 7 April 2005,11:48].

21 David Berlind, July 2002, "Who Gave MicrosoftControl of Your IT Costs? You did",http://techupdate.zdnet.com/techupdate/stories/main/0,14179,2875958,00.html [Accessed 7 April 2005,11:48].

Page 24: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 24 DCC Digital Curation Manual

These costs are likely to be more than simplyfinancial. The combination of proprietarysoftware and formats puts the softwaredistributor in an unhealthily powerful position,and exposes the customer to the cost of‘essential’ upgrades and even greater problemsshould the technology be discontinued or thedeveloper become insolvent. Every time that auser records data in a closed format it tightensthe grip held by its proprietary developer. Thechain becomes increasingly more difficult andmore expensive to break away from. It should beclear that when preservation issues and futurecontent access are considered, the problem isonly exacerbated. The OAIS reference modelspeaks of the importance of the availability ofRepresentation Information to ensure that theinformational value of our preserved bit-streamsremains available even into the future. Amongthe most frequently cited items ofRepresentation Information include software andformat specifications, which with an openapproach to software management, acquisitionand distribution are likely to be much morereadily available than within a proprietarycommercial model.

4.4.4 Preservation Through TransparencyThe digital curation community is no

stranger to costly projects that have failedbecause of technological choices that wereoverly proprietary or marginalised. The BBC’sDomesday project is perhaps the most frequentlycited example.22 In 1986, to celebrate the 900th22 http://www.atsf.co.uk/dottext/domesday.html

anniversary of the Domesday Book, a projectwas undertaken to incorporate a diverse range ofmaterials contributed by UK schoolchildrenwithin a multimedia resource. Unfortunately,the project’s technological choices led to certainsubsequent problems. For storage media, theproject team chose to use Philips’ proprietaryLaserVision LVROM disc, which could only beplayed on the associated LVROM player. Themultimedia application itself was written in alanguage called BCPL, a precursor to C whichran on the BBC Model B platform, which had tobe modified to interface with the proprietarydiscs, increasing costs and limiting the chancesof viability and uptake. Regrettably the systemsoon became obsolete and, less than twentyyears later, very few players or discs remain.Only the sustained efforts of the CAMiLEONproject to rescue the application and implementan emulation strategy have ensured that futuregenerations can access this valuable resource.23

Domesday need not have come up againstsuch problems if a more future-conscious seriesof decisions had been made at its conception.One of the biggest single issues for the project(and for digital curation more generally) was itsinability to ensure that at an unknown time inthe future users would still be able to access the

[Accessed 7 April 2005, 11:48].23 http://www.si.umich.edu/CAMILEON/domesday/dom

esday.html [Accessed 7 April 2005, 11:49]. See DaisyAbbott, "Overcoming the Dangers of TechnologicalObsolescence: Rescuing the BBC Domesday Project",DigiCULT.Info 4, Page 4 ,http://www.digicult.info/pages/newsletter.php[Accessed 7 April 2005, 11:49].

Page 25: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 25

stored digital materials.24 An analogy with theoriginal Domesday Book is offered on theCAMiLEON project web site. If the Latinlanguage in which the original Domesday Bookwas written became somehowincomprehensible, accessing the information itholds would be impossible.25 A Latin dictionarycould be used to overcome this problem and theOAIS reference model would call thisRepresentation Information. What must beremembered is that the remit of digital curator isnot limited to simply maintaining the physicalmaterials themselves, something which isconceptually quite straightforward; it is alsonecessary to ensure that a method ofunderstanding and exploiting their fullusefulness continues to exist in perpetuity.Instead of dwelling on the curation orpreservation of data one must strive for long-term access to information. By using open,standardised formats, one can more feasiblylimit the problems caused by the passage oftime. If the Domesday project had used an openstandardised structure to describe and encode itsmultimedia components, together with openformats for incorporated sound and video, it islikely that a clearer understanding of the datastructures could be established in the future,

24 For an example of an open source tool which‘normalises’ file formats for preservation see AdamRusbridge, April 2004, “XENA: ElectronicNormalising Tool”, DigiCULT.Info, Issue 7, page 32,http://www.digicult.info/pages/newsletter.php[Accessed 7 April 2005, 11:49].

25 http://www.si.umich.edu/CAMILEON/domesday/faq.html, Accessed 7 April 2005, 11:50

with less guesswork or painstaking reverseengineering procedures. Furthermore, if thesource code of the application was made freelyavailable it too could continue to be understoodand broken down into more easily migratedalgorithmic chunks.

New digital hardware will inevitably beintroduced, and it is likely that the machines weuse ten years from now will operate quitedifferently from those in we are familiar withtoday. But the hardware level is just one ofseveral potential areas where problems canoccur for the digital curator. Assuming that dataare encoded in an open file format, and that theprograms that read, access and write to theseformats are open source, any hardware-relatedpreservation problems can be morestraightforwardly overcome. The knowledgeconferred by the use of open source applicationsand open standards empowers future users,enabling resources to be more straightforwardlymanipulated within a future hardwareconfiguration. The INFORM methodologyproposed by Andreas Stanescu26 suggestsseveral classes of risk, including thoseoriginating from the digital object’s format, itsassociated software, and organisations andcommunities related to the preservation plansfor the object. Open source software and open,standardised formats are likely to fare very well

26 Andreas Stanescu, 2005, "Assessing the durability offormats in a digital preservation environment: TheINFORM methodology", (OCLC Systems andServices, International Digital Library Perspectives,Vol 21, Number 1, 2005, pp. 61-81)

Page 26: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 26 DCC Digital Curation Manual

in each of these categories. In a closed sourceenvironment, future preservationists will facethe onerous tasks of reverse engineeringsoftware to run on new platforms, developingemulators without a clear understanding ofsystems that need to be replicated andcontinuing to maintain otherwise obsoletehardware upon which resources are known tooperate.

4.4.5 Legal Issues for Long-term AccessThe practical problems inherent in dealing

with proprietary software represent only part ofthe problem. In addition there are likely to belegal obstacles to the emulation or porting ofcommercially distributed, proprietary software.Terms of use, usually strictly defined in softwarelicense agreements generally operate on a can-do basis, with an implicit assumption that if aparticular type of use is not mentioned that it isforbidden. It is therefore common for manycommercial software licenses to prohibit theemulation, porting, migration and reverseengineering of application information ordatasets. Similarly, restrictive terms of use mayinfringe upon one’s ability to collect andmaintain appropriate Representation Informationto prolong the useful life of a digital object. Forsoftware to qualify as open source or freehowever, there are assumptions to the contrary,in favour of freedom of use, re-use andredistribution. License agreements associatedwith OSS make it easier to take preservationmeasures without fear of violating theintellectual property claims of the original

developers. While proprietary vendors mayhave little interest in continuing to support theirsoftware indefinitely and seldom offer themeans to ensure its longevity, they often have atendency to legally challenge anyone else whoattempts to do so. This places a further burdenon the user to administer licensing and terms ofuse documentation to ensure that the correctinfrastructure is in place and that noinfringements can take place. Open source hasno such problems – in contrast, only steps takento limit free access to open source software arelikely to fall foul of licensing agreements.

4.4.6 Later StagesLater stages of the digital curation

lifecycle, such as disposal or transfer ofstewardship are further facilitated and simplifiedwithin an open source infrastructure. With noneof the legal barriers to redistribution that areoften explicitly forbidden under proprietarylicenses open source materials are arecomparatively straightforward to disseminateand transfer.

Page 27: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 27

5 Open Source and Free Software InAction

5.1 Areas of UseFrom its origins in laboratories,

technology centres and student dormitories,open source has exploded in popularity over thelast few years, and now performs a key role inthe IT policy and infrastructure of manyorganisations, institutions and companies.Various reasons are cited for the adoption ofopen source, and these tend to vary across thesectors in which it enjoys exposure. Many usersare attracted to the traditional stability andreliability of the software, others to itsstraightforward integration with heterogeneoussystem environments, and others, notably thedigital curation community, to the empoweringtransparency and freedom that are both intrinsicto open source. Many more may be convincedby the financial savings that might be achievedfrom using these technologies.

5.1.1 Government and Public SectorThe increasing success of open source in

the public and government sector has been oneof the more significant developments of recenttimes in terms of technology take-up. Publicsparring between proprietary softwarecompanies and the open source movement forlucrative governmental IT contracts emphasisesthe significance of this market, particularly forthe subsequent dissemination of technologies

throughout the public and social hierarchy,within the new era of e-government and digitalpublic administration. Numerous reasons can beidentified to explain the enthusiasm with whichopen source has been embraced by many publicbodies. Perhaps the most obvious are related tothe financial savings it affords, which ingovernmental terms may be a vote-winner.However, this is only part of the story. Therecan be little doubt that government bodies andagencies are to some extent wary about thepotential consequences of trusting their entire ITinfrastructure to one or two private (usuallyforeign) companies who are likely to guard theirsoftware secrets closely. Open source softwareis able to nullify this problem. In addition,governments are invariably charged with theresponsibility of ensuring that provisions are inplace to preserve a wide range of informationfor future generations. Open source facilitatesthis in a way that proprietary infrastructurescannot. Governments are formally charged bytheir electorate with the responsibility tomaintain public records long-term, andlegislation such as the Freedom of InformationAct in the United Kingdom offers persuasivearguments for a move towards a more open dataenvironment.

There are numerous examples of large-scale public sector migrations to open sourcewithin Europe and further afield. The FrenchGovernment has decided that centraladministration should terminate most of itsagreements with proprietary vendors for thesupply and use of software, meaning that

Page 28: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 28 DCC Digital Curation Manual

national and local authorities are to use opensource software as far as possible. The Agencyfor Information and CommunicationTechnologies in Administration (ATICA) wasset up in August 2001 to support this decision,and to coordinate the various governmentalagencies and bodies towards the intendedoutcome. Similarly, the German centraladministration in June 2002 entered into aframework agreement with IBM and SUSE onthe supply of open source products based onLinux, making it possible for German publicadministration to acquire Linux-based systems ata reduced price from IBM.27 The agreementincorporates the supply of servers andworkstations, as well as ongoing support fromIBM. While this promotion of open source isnot a law, it represents a tempting incentive toopen source decision-makers within the Germanpublic sectors. The United Kingdom has showna clear commitment too, and the British Officeof E-envoy issued an open source policy at theend of October 2004 which states that BritishGovernment and authorities will in futureconsider open source, declaring in particular aconcern about being ‘locked-in’ to the productsof single private commercial companies.28

27 http://www.ibm.com [Accessed 7 April 2005, 11:50];http://www.suse.com [Accessed 7 April 2005, 11:51].

28 See Danish Board of Technology, October 2002,“Open Source Software in e-government”,http://www.tekno.dk/pdf/projekter/p03_opensource_paper_english.pdf [Accessed: 7 April 2005, 11:51] andUK Government, October 2004, "Open Source",http://www.govtalk.gov.uk/policydocs/policydocs_document.asp?docnum=905 [Accessed: 8 July 2005,

The International Institute of Infonomicsreport entitled ‘Free/Libre and Open SourceSoftware: Survey and Study’ (2002)recommended and reported a widespreaddeployment of open source tools throughoutEuropean government.29 This documentcontains several accounts of the public sectorembracing open source systems. The FrenchMinistry of Culture migrated 400 servers fromUnix and Windows NT to Linux and intends tohave comprehensive Linux server solutions by2005. The Ministry of Justice and nationalcrime register use a combination of open sourcetools such as the Apache Web Server, Perl,Samba, and fetchmail, with an imminentmigration envisaged from proprietary Unix toLinux, PHP, and MySQL and finally theMinistry of Defence have FreeBSD, an opensource operating system comparable toGNU/Linux, installed within their infrastructure.

In what is regarded as one of the mostsignificant developments in the lifetime of opensource software, the City of Munich in Germanyofficially confirmed in June 2004 that it wouldbe transferring 14,000 municipal desktopcomputers from Microsoft Windows to opensource, combining Linux server software,desktop software, and virtual machinetechnology from VMware to provide

16:27]29 Except where otherwise stated, accounts are from

International Institute of Infonomics, 2002,"Free/Libre and Open Source Software: Survey andStudy", http://www.infonomics.nl/FLOSS/report/[Accessed: 7 April 2005, 11:51].

Page 29: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 29

interoperability among heterogeneous systems.30

Bloomberg News described this as Microsoft’s“biggest PC loss yet,” and the decision has ledanalysts to predict that Linux-powered PCs willgrow 25%-30% in 2004, and that Linux willaccount for 6% of desktop operating systemshipments by 2007.31 Bergen, Norway’s secondcity, has followed in the footsteps of Munich,Germany in choosing Linux to underpin itstechnology infrastructure, moving away fromproprietary UNIX and Microsoft Windowsplatforms and applications.

5.1.2 Humanities InstitutionsLike public sector institutions, the cultural

heritage sector has a vested interest in both thefinancial cost and the openness and accessibilityof the software it uses. Many institutions havediscovered that open source solutions offeradvantages to facilitate their requirements. Aprominent example is the National Library ofAustralia, which now deploys a range of opensource applications across its server space,reflecting a willingness to invest in the skills ofthe library and a commitment to standardisationin general.32 The Library’s Director of ITBusiness Systems, Mark Corbould, describedhow the institution has hesitated to replace its30 http://www.vmware.com/, [Accessed: 7 April 2005,

11:53]31 June 2004, "Munich Linux Decision Final",

DesktopLinux.com,http://www.desktoplinux.com/news/NS7137390752.html [Accessed: 7 April 2005, 11:55].

32 http://www.nla.gov.au/ [Accessed 7 April 2005,11:56].

700 proprietary workstations with open sourcehowever, since “Windows is so entrenched inthe desktop space that it would take a nuclearwar to remove it.”33 This seems to be anargument in favour of change sooner rather thanlater, and could be read as a firm assertion of thedifficulties posed to effective digital curation bythe current proprietary configuration.

5.1.3 ScienceWith bleeding edge innovation evident

throughout every scientific discipline, it isunsurprising that software developed within theopen source model has been embracedwholeheartedly by the science community.Significant institutions and organisations suchas NASA have displayed a commitment todistributing their endeavours under open sourcelicenses, with the NASA Open SourceAgreement34 conceived as a license determininglegal usage for a range of applications. Relevantprojects include artificial intelligence softwaresystems (Livingstone2), dynamic 3-D worldenvironments (World Wind), a simulationtoolkit for planetary exploration vehicles (theMission Simulation Toolkit) and a evolutionarysimulation (JavaGenes). NASA cites four mainmotivators for their adoption of open sourcetechnologies and release habits. Increasing33 Nadia Cameron, September 2003, "Open Source

Bookmarks Australian Heritage",http://www.computerworld.com.au/index.php?id=522130461&fp=16&fpid=0 [Accessed 7 April 2005,11:56].

34 http://www.opensource.org/licenses/nasa1.3.php[Accessed 7 April 2005, 11:56].

Page 30: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 30 DCC Digital Curation Manual

software quality via community peer review,accelerating development via communitycontributions, maximising the awareness andimpact of NASA research and increasingdissemination of NASA software in support ofthe education mission are together identified assufficiently worthwhile ends to justify utilisingthe open source development model.

A range of scientific applications availableunder open source licenses are made availableby projects like OpenScience, which gathers adiverse selection of applications intended tofacilitate the work of various communities.Examples include tools for conducting studies offorensics, acoustics, astronomy, life sciences,nanotechnology and chemistry. In addition,several initiatives exist to promote the opensource ethos more generally throughout thesciences. BIOS (Biological Innovation for OpenSociety) aims to "extend the metaphor andconcepts of open source and distributiveinnovation to biotechnology and other forms ofinnovation in biology" and to facilitate "thecooperative invention, improvement and sharingof biological technologies".35 Taking a scepticalview about the wide proliferation of restrictivepatents within biological sciences, the initiativehas identified a requirement for moretransparency to encourage knowledgedissemination and the progression of long-termcommunities of understanding. Other work likeScience Commons,36 an off-shoot of Creative

35 http://www.bios.net/daisy/bios/15 [Accessed: 7 April2005, 11:57].

36 http://science.creativecommons.org [Accessed: 7 April

Commons37 carries similar emphases, with itsintention to promote innovation throughknowledge sharing. In addition to developingopen source software and promoting its idealsthe science community has exhibited aconsistent enthusiasm for using existing opensource tools. A good example is the NASAAcquisition Internet Service, which in 2000 wasmoved without a hitch to the open sourceMySQL database, which has continued toprovide a robust foundation to this service eversince.38

5.1.4 HE/FE InstitutionsNotwithstanding the benefits of open

source from a digital curation perspective,Richard Stallman expresses a passionate beliefthat all educational institutions should use freesoftware for several additional reasons. He citesthe financial savings, moral influence, andadditional learning opportunities that sourcecode availability affords as major motivators.39

Many schools and universities have respondedto these and other justifications byimplementing open source solutions within theirIT environments. Projects such as OSS Watch,funded by the UK’s Joint Information Systems

2005, 11:57].37 http://creativecommons.org/ [Accessed: 7 April 2005,

11:57].38 Paula Shaka Trimble, December 2000, "Open Minds

on Open Source", FCW.com,http://www.fcw.com/fcw/articles/2000/1204/pol-nasa-12-04-00.asp [Accessed: 7 April 2005, 11:57].

39 http://www.gnu.org/philosophy/schools.html[Accessed: 7 April 2005, 11:57].

Page 31: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 31

Committee (JISC), offer advice, support, andexpertise to higher and further educationinstitutions interested in deploying open sourcesolutions.40 An Insight Special Report entitled‘Why Europe Needs Free and Open SourceSoftware and Content in Schools’ indicates areasin education in which open source can bedeployed. The report concludes that OSSprovides “a beneficial way to transferknowledge and best practice.”41

A recent study among IT specialists inthirty-seven tertiary education institutions in theUK and Antipodes showed that free and opensource software is already in place in 94% ofsurveyed institutions.42 A number of commonlyused Virtual Learning Environment packages(VLEs) are open source, including the popularmoodle,43 designed to facilitate the creation ofonline courses.44 Teaching and learning40 http://www.jisc.ac.uk/ [Accessed: 7 April 2005,

11:57]; http://www.oss-watch.ac.uk/ [Accessed: 7April 2005, 11:57].

41 http://www.eun.org/insight-pdf/special_reports/Why_Europe_needs_foss_Insight_2004.pdf [Accessed: 7 April 2005, 11:57].

42 David G. Glance, Jeremy Kerr and Alex Reid, January2004, "Factors Affecting the Use of Open SourceSoftware in Tertiary Education Institutions",http://www.firstmonday.org/issues/issue9_2/glance/[Accessed: 7 April 2005, 11:57].

43 http://moodle.org, [Accessed: 7 April 2005, 11:57]44 Open source can also be used for content management

in educational institutions. See Paul Conway,December 2003, “Zope at Duke University: OpenSource Content Management in a Higher EducationContext”, DigiCULT.Info, Issue 6, p. 10,http://www.digicult.info/pages/newsletter.php[Accessed: 7 April 2005, 11:57].

materials, which represent some of the mostvaluable resources generated in Higher andFurther Education institutions, can be moreeffectively managed and maintained within suchopen source infrastructures.

Developers face a range of difficulties inHigher and Further Education institutions whereintellectual property fruits of employees’research activities are often retained by theinstitution itself. In such cases decisions aboutredistributing the IPR rest with the owner. Insuch cases it is vital that employees andresearchers familiarise themselves with theterms of their employment to ensure that theirparticipation in open source development islegitimate and acceptable. The copyleftrequirement within many open source licensescompels those who adapt, build upon orintegrate the licensed code to release the fruitsunder the same license. Therefore, it is vital thatemployees are aware of the implications withintheir own institution of utilising copylefted opensource products.

5.1.5 Commercial OrganisationsIdentifying the quality of software and the

financial savings available, the enthusiasm withwhich some corporations have embraced opensource is indisputable.45 Prevalent now in arange of often mission-critical applications,open source performs a range of roles, from thegeneric to highly specialist within both small to

45 IBM is a good example. See http://www-136.ibm.com/developerworks/opensource/,[Accessed: 7 April 2005], 11:57 for more details.

Page 32: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 32 DCC Digital Curation Manual

medium enterprises and multi-national mega-corporations. Notable example of companieswith open source deployments or interestsinclude IBM, Novell, Hewlett Packard andYahoo! These deployments range from webservers and database infrastructures, to large-scale distributed computing projects. Facilitatingcommercial interactions and the management ofcorporate information long-term, open sourceand open standards are likely to continue toperform a vital role within the commercialsector.

5.2 Open Source ApplicationsSince the early development of the

GNU/Linux operating system, the open sourcesoftware library has grown at an impressive rate.From an initial emphasis on server and softwareinfrastructure code, an increasing number ofopen source projects now commit theirresources and efforts to the development ofdesktop applications in a range of areas,including general office productivity,multimedia development, sound and videoediting and manipulation, scientific analysis anddesktop publishing. As wide-ranging as opensource software is in terms of functionality, italso varies greatly in terms of maturity, stabilityand performance. Given the vast array ofprojects currently at different stages ofdevelopment, identifying the most valuable,useful or technologically worthwhileapplications can be difficult. Similarly, withsuch a broad range of software, locating a

particular application can be intimidating,despite the range of excellent web andrepository search tools currently available.

These problems can be addressed using anumber of open source resources. Two of themost prominent web-based examples –SourceForge and Freshmeat – serve distinctbut similar functions.46 SourceForge providesfree hosting and Web space for thousands ofindividual open source projects, offeringcentralised search tools, distribution acrossseveral worldwide mirrors and a largecommunity of users offering advice, feedbackand impressions of software projects.Freshmeat essentially comprises a massiveindex of “preferably” open source applicationsand tools for a range of platforms, together withlinks to each project’s own pages where thesoftware itself can normally be downloaded.Popularity details, ratings and vitality statisticsare maintained and presented, offering noviceusers clear insights into the success and level ofuse of individual applications. With anapplication's ubiquity offering one insight intoits curatability, it is important to be able toidentify just which are the most usedapplications and software formats.

Simply browsing these impressiveresources offers insights into the range of opensource tools that exist, as well as the diversity ofthe applications areas that are covered. In therealms of infrastructure, server and development

46 http://sourceforge.net [Accessed: 7 April 2005,11:57]; http://freshmeat.net [Accessed: 7 April 2005,11:57]

Page 33: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 33

software, several programs are commonly heldto be as good as or better than their proprietaryalternatives. Open source software like theLinux Operating System, Apache Web Serverand Bind DNS Server, combined with a range ofopen standards, are all integral parts of theInternet, and one cannot overstate the role thatthe concept of openness played during theWeb’s conception. Recent times have seen anincreasing number of more mainstream desktopapplications move towards this level ofexcellence. In addition to continuing to refinethe usability and functionality of general desktopapplications, it is also a priority for the opensource world to expand to meet the functionalrequirements of more specialist users. To thisend, a number of excellent applicationsspecifically aimed at the field of digital curationare now available under open source licenses.

A concern that is often vocalised is thatthe open source development model, based as itis upon the concept of collaboration, is likely toresult in monolithic software infrastructures,with single choices that represent acompromised community consensus. In manycases there is some truth in this. ‘Forking’ is aterm that describes the process of branching thedevelopment of source code over two or moreseparate, perhaps incompatible paths: this isstrongly discouraged within the open sourcecommunity. Many open source advocates willargue that when several individual alternativeprojects condense into a single unified effort it isa good thing, since although competition withinthe software industry is good for business, it

doesn’t necessarily lead to the development ofthe best software. Since the open source modelpools expertise and doesn’t set programmers thetask of usurping one another it can achieve agreat deal quite quickly. However, thisargument is not quite sufficient to quash theseconcerns. Diversity is welcome within softwareto overcome systemic failures when they arise.As in many disciplines, mistakes are made, andspreading the intellectual effort more widely islikely to ensure the non-fatality of any problemsthat are encountered. Nonetheless, although agreat deal of open source work finds itselfexpressed in just one or two applications withineach domain area there still exists somewelcome diversity. For instance, as theexamples below illustrate, there are a number ofindividual and distinct institutional repositoryimplementations currently being developed andreleased under open source licenses.

5.2.1 The GNU/Linux Operating SystemThe operating system is the central

software program within any computer system,communicating at a low level with themicroprocessor and other hardware, andorganising the execution and run-time of eachinstalled program. Among the most commonand familiar examples of proprietary operatingsystems within the personal computer marketare Microsoft’s Windows and Apple’s OS X.The free software movement would have had nofoundations if it had had to rely upon a centralproprietary program, and this realisation led tothe conception of the GNU (GNU’s not Unix)

Page 34: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 34 DCC Digital Curation Manual

project in 1984, to develop a suite ofapplications that would represent a freeoperating system. Identifying early on thatmulti-platform support was a desirablecharacteristic, Stallman chose to base his systemon the widely published core concepts that areshared by Unix computer systems, the onlyplatform at the time to offer a degree ofportability. Following success with a number ofapplications, eventually GNU lacked only akernel to make it into a complete, fullyfunctional operating system. The kernelrepresents the heart of the operating system andmanages system memory, the file system anddisk operations. In a timely coincidence, ayoung Finnish programmer, Linus Torvalds, wasconcurrently building his own Unix compatiblekernel, Linux, and this was swiftly integratedinto the existing GNU code, resulting in what isnow known as GNU/Linux.47

Since its initial release in the early 1990s,GNU/Linux has undergone almost constantrefinement, and now represents a mature, stableand usable platform, incorporating most of thefeatures of expensive proprietary Unices, andrepresenting a viable solution for both serverand desktop deployment.48 A number ofcompanies distribute the system withstraightforward installation packages and a47 Although the entire operating system is often referred

to as “Linux,” Torvalds’s contribution represents onlya part (albeit a very significant one) of the overallsystem. Open source advocates tend to favour theabbreviated terminology, probably because it is moreconcise and catchy, which helps in its promotion.

48 Unices is the plural of Unix.

variety of fully integrated applications. Some ofthe most popular ‘distributions’ include RedHat, SUSE, Mandrake and Debian.49 Each ofthese can be downloaded for free from itsassociated web site or purchased on CD or DVDfor a small sum, fully packaged anddocumented. Most distributions also offercorporate packages, with full support structuresmore akin to commercial proprietary systems.

Along with hardware, the operatingsystem represents one of the most significantenvironmental factors in determining theoperability of digital objects. Establishing anunderstanding of the systems that are required tointerpret the information encoded within ourdata streams is essential to facilitate our digitalcuration endeavours, and GNU/Linux offers theopportunity for anyone to do so.

5.2.2 Emulation Applications for OpenSource

Many information tasks that can beundertaken using proprietary tools can also beachieved with open source. If an appropriateapplication is not available however, the WINEpackage (Wine Is Not an Emulator) is a“Windows Compatibility Layer” for Linux/Unixwhich can be used to install and run many

49 http://www.redhat.com/ [Accessed: 7 April 2005,11:57]; http://www.SUSE.com/, [Accessed: 7 April2005, 11:57]; http://www.mandrakesoft.com/[Accessed: 7 April 2005, 11:57];http://www.debian.org/ [Accessed: 7 April 2005,11:57].

Page 35: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 35

Windows applications.50 However, theshortcomings of this project offer some insightsinto the problems posed when attempting torecreate proprietary, unpublished softwareinfrastructures. Despite the WINE project’svintage51 it still suffers from instability problemsand lacks support for the full range of Windowsapplications. Legal and technologicalimpediments associated with Windows’proprietary nature have been significantobstacles to the WINE project’s success. In theevent of WINE offering insufficient levels ofperformance or reliability, commercial Linuxapplications like VMware allow an alternativeoperating system to be installed within Linux,and run as an internal application.52 This meansthat full software support and performance isretained, although VMware is not availableunder a free license, and any installed operatingsystems must also be licensed. Anotherproprietary alternative is to purchaseCodeweaver’s Crossover Office application,which builds upon WINE technology and offersfull, robust and supported Linux compatibilityfor a very small range of Windows applications,including Microsoft Office, Adobe Photoshopand Lotus Notes, negating the need to purchasean additional Windows license.

For those wishing to run Linux or other

50 http://www.winehq.org/ [Accessed: 7 April 2005,12:10].

51 The WINE project’s origins can be traced to June1993.

52 http://www.vmware.com/support/linux/ [Accessed: 7April 2005, 12:10].

Unix applications within a Windowsenvironment Cygwin offers a "Linux-likeenvironment for Windows", consisting of aLinux API layer and a selection of tools toprovide Linux look and feel.53

5.2.3 Server and DevelopmentThe success enjoyed by the open source

software movement can be directly attributed toa number of infrastructure, server anddevelopment technologies that have beenwholeheartedly embraced by the technologicalcommunity. It is in this area that open sourcehas traditionally been most prominent, andwithin this domain, open source products arewell established. Since most of these tools areoffered at no cost and offer levels ofperformance, reliability and security comparablewith proprietary alternatives, they appeal tomany enterprises, organisations and institutions.Market share is considered in more depth in thequantitative section below.

5.2.4 The Apache Web ServerAlongside GNU/Linux, the Apache Web

Server project represents one of the mostprominent success stories of the open sourcemovement.54 A web server is an applicationused to make World Wide Web resourcesavailable. In April 2005, Apache had a 69%market share of all those on the Web.55

53 http://cygwin.com [Accessed: 13 July 2005, 11:49]54 http://www.apache.org [Accessed: 7 April 2005,

12:10].55 http://news.netcraft.com/archives/web_server_survey.

Page 36: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 36 DCC Digital Curation Manual

Straightforwardly configurable, well-documented, secure and available for a widerange of platforms, Apache pushes into secondplace its closest rival, Microsoft’s InternetInformation Server. Apache can be modified tosuit particular deployments, allowing systemadministrators to customise the services theyoffer, effectively creating new web servers basedon the Apache model. For digital curators thetransparency offered by Apache is welcome,given the vast numbers of web services andweb-deployed applications currently in use thatare closely integrated with the web server. Anunderstanding of these applications can only beobtained in many circumstances byunderstanding the software infrastructure thatfacilitates their delivery.

5.2.5 DatabasesDatabases are central to most of our

interactions with digital technologies, offeringstorage opportunities as structured as individualapplications require. Several open sourcepackages are available. Three particularexamples enjoy great prominence, with theirown respective individual strengths, eachoffering a maturity and depth of functionalityelevating them above many proprietarypackages. MySQL is perhaps the best known,and compares extremely favourably with mostproprietary equivalents, particularly in terms ofspeed and stability.56 It is particularly valuable

html [Accessed: 7 April 2005, 12:10].56 http://www.mysql.com [Accessed: 7 April 2005,

12:10].

when deployed on the Web due to its quickhandling of multiple connections. PostgreSQLand Firebird are other notable examples, andtend to be regarded as more functionallycomplete than MySQL.57 None matches theheavyweight functionality offered by theleading proprietary database (Oracle), butunless an application has special requirementsPostgreSQL in particular is likely to incorporatemost if not all of the necessary features.58 Allthree packages run natively on a range ofplatforms including Linux and Windows.MySQL’s significantly larger user base accountsfor its more comprehensive documentation andhelp structures, as well as its increasedstability.59 Prominent MySQL users includeGoogle, Cisco, Sabre Holdings, HewlettPackard, NASA and Yahoo!60

5.2.6 The GRIDOf interest to many working in the

57 http://www.postgresql.org/ [Accessed: 7 April 2005,12:10]; http://firebird.sourceforge.net/ [Accessed: 7April 2005, 12:10].

58 http://www.oracle.com [Accessed: 7 April 2005,12:10].

59 A fuller comparison of the relative merits of each canbe found in "PostGreSQL or MySQL?", http://www-css.fnal.gov/dsg/external/freeware/pgsql-vs-mysql.html [Accessed: 7 April 2005, 12:10] and IanGilfillan, December 2003, "PostgreSQL vs MySQL:Which is better?", DatabaseJournal.comhttp://www.databasejournal.com/features/postgresql/article.php/3288951 [Accessed: 7 April 2005, 12:10].

60 http://www.mysql.com/customers [Accessed: 7 April2005, 12:10].

Page 37: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 37

scientific data disciplines is the use of GRIDcomputing, which uses the processing power ofseveral computers connected via a network tosolve complex and large-scale computationalproblems. CERN61 describes the GRID as "aservice for sharing computer power and datastorage capacity over the Internet.". Theresponsibility for defining specifications for gridcomputing is held by the Global Grid Forum(GGF),62 and these are implemented through theGlobus Toolkit by the Globus Alliance,63 agroup of individuals and organisationsdeveloping fundamental technologies behind theGRID. The Globus Toolkit is an open sourcetoolkit used for building Grid systems andapplications. A growing number of projects andcompanies are using this package to unlock thepotential of grids for their own specificpurposes. It has become the de facto standard forgrid middleware and provides a standardplatform for services to build upon. As they didduring the development of TCP/IP, open sourcetools are playing a fundamental role in thedevelopment within this area, facilitating itsgrowth and the creation of new tools andapplication possibilities. Similarly, openstandards and collaboration are intrinsiccharacteristics of and fundamental requirementsfor the GRID.

61 Conseil Européen pour la Recherche Nucléaire(European Laboratory for Particle Physics)

62 http://www.gridforum.org, [Accessed: 11 July 2005,15:43]

63 http://www.globus.org, [Accessed: 11 July 2005,15:48]

5.2.7 Programming LanguagesPart of the GNU/Linux operating system,

the GNU C Compiler (GCC) is an open sourceimplementation of a C language compiler.Modules are also available to add support for arange of additional languages such as C++.Furthermore, a number of other languagesoperate under open source licenses, and severalhave been relied upon consistently in a range ofcomputing areas. Three of the most prominentare the scripting languages PHP (PHPHypertext Preprocessor), Perl (PracticalExtraction and Report Language), and Python.64

Because all three are traditionally interpretedlanguages (that is, they need not be compiledprior to execution), when it is made availabletheir code is usually in a human-readable form.This also facilitates multi-platforminteroperability. PHP is most widely used in thedevelopment of dynamic Web pages. Popularamong Web developers due to its fast parsingand flexibility, PHP is also versatile and comeswith many built-in and modular interfaces.Database connectivity is straightforward; whilePHP is most commonly associated with MySQLit can connect to any ODBC-enabled database.Perl offers similar functionality to PHP, butwith more general deployments traditionally.Perl is frequently used to add dynamicfunctionality to Web pages, but it is also used tohandle a range of other tasks involved in system64 http://www.php.net [Accessed: 7 April 2005, 12:10];

http://www.perl.com [Accessed: 7 April 2005, 12:10];http://www.python.org/ [Accessed: 7 April 2005,12:10].

Page 38: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 38 DCC Digital Curation Manual

administration and data processing. Perl’s richsupport for regular expressions also makes ituseful as a text manipulation language. Likeboth Perl and PHP, Python is frequently used ina Web environment. Combining powerfulcapabilities with a simple syntax, Python alsohas interfaces to numerous system calls andlibraries. Additional modules can be developedusing C or C++, extending the functionality tosuit individual requirements. All threeprogramming languages are portable, supportinga range of platforms including Linux, Windows,Mac, and OS/2. By developing in an open sourceenvironment, even if languages should fall intodecline or obsolescence, existing code will beable to be re-purposed for a future systemconfiguration and executed with mitigated riskof loss.65

5.2.8 OthersTo further describe the various open

source software packages that have establishedmajor footholds within the Internetinfrastructure would take a great deal of space;suffice to say that OSS applications exist foralmost every subject area, from eGovernment togaming. Further examples of popular andsuccessful tools include Sendmail, the world’smost used email server, Bind, the most

65 Sun Microsystems are currently involved in a publicdebate over the merits of opening up their currentlyclosed-source Java programming language.Developments here will be well worth watching, givenJava’s prominence as a platform-independent, web-friendly language.

commonly used domain name system (DNS)server, and Apache Jakarta Tomcat, one of themost popular Java Servlet and Java ServerPages containers in use on the Web, providingan infrastructure for the delivery of Web-basedJava programs.66

5.2.9 Desktop and ProductivityWhile its traditional arena of dominance

since the early 1990s has been in servers andnetwork infrastructure, the recent upsurge inpopularity of open source has led to thedevelopment of a number of mature andfunctionally rich desktop applications thatmeasure up well against their proprietary peers.Although notable gulfs continue to exist in someareas, several key open source desktopapplications have introduced innovative andpractically useful features that have beensubsequently adopted into commercialproprietary applications.

5.2.10 OpenOffice.orgWith the numerous problems associated

with the proprietary and hidden nature ofMicrosoft’s Office formats, institutions shouldbe extremely wary of regarding Office-encodeddata as curated. OpenOffice is a large-scaleproject, backed by Sun Microsystems todevelop a comprehensive and transparent suite

66 http://www.sendmail.org/ [Accessed: 7 April 2005,12:10]; http://www.isc.org/index.pl?/sw/Bind/[Accessed: 7 April 2005, 12:10];http://jakarta.apache.org/tomcat/ [Accessed: 7 April2005, 12:10].

Page 39: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 39

of tools, including word processor, spreadsheetand presentation software, that writes to openstandard formats defined in XML.67 With itscurrent version (1.1.4), the project has enjoyedsuccess. ‘Read’ and ‘write’ support forMicrosoft Office formats is included, but sincethese are not publicly documented, errors areoccasionally encountered, particularly whendealing with complex structures such as tables.Objects such as images, plugins, videos andcharts can be embedded as straightforwardly aswith Microsoft tools. There is no support forVisual Basic for Applications (VBA) macros,since this is a proprietary Microsoft technology,but scripting is possible using the integratedStarBasic syntax. The imminent new release ofOpenOffice will include support for the OASIS68

OpenDocument format, which is likely to beadopted by the European Commission as therecommended format for document interchangewithin the European public sector.

An eWeek survey comparing OpenOffice1.1 and Microsoft’s Office 2003 illustrates therelative merits of the two application suites.69

The general consensus is that while OpenOfficerepresents a good free package, with several67 http://www.openoffice.org [Accessed: 7 April 2005,

12:13].68 Organisation for the Advancement of Structured

Information Standards, http://www.oasis-open.org/home/index.php , [Accessed: 7 July 2005,14:39]

69 Jason Brooks, April 2004, "Office 2003 vs.Openoffice.org", Eweek.com,http://www.eweek.com/article2/0,1759,1571626,00.asp [Accessed: 7 April 2005, 12:10].

unique features such as built-in PDF-writingsupport and a user interface that integrates eachof the individual applications, it lacks the polishand some of the more advanced functionality ofthe latest version of Office. OpenOffice 1.1 isregarded as functionally comparable toMicrosoft’s Office 97, although the open sourceproduct is thought to offer several additionalfeatures and a greater level of reliability. Manyordinary users are unlikely to have requirementsthat extend beyond the features that areincluded. However, when Jack Wallen Jr. ofZDNet Australia writes “if you can do it inMicrosoft Office, you can do it inOpenOffice.org… for free,” it should be bornein mind that OpenOffice still has some way togo before matching all of the functionality ofMicrosoft’s flagship application.70 That itexceeds Microsoft’s efforts in terms ofimplementing a system for the creation, editingand rendering of preservable documents ishowever, unquestionable.

5.2.11 The Mozilla ProjectComprising a number of individual

programs, the Mozilla project represents one ofthe most successful open source desktopapplication projects, offering a level of maturity,functionality and innovation that matches andsurpasses much equivalent proprietarysoftware.71 At its forefront is the Mozilla70 http://www.zdnet.com.au/insight/0,39023731,202703

00,00.htm [Accessed: 7 April 2005, 12:10].71 http://www.mozilla.org/ [Accessed: 7 April 2005,

12:10].

Page 40: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 40 DCC Digital Curation Manual

package, which offers Web browsing, email,newsgroup and IRC access and a Webdevelopment application within a single,customisable interface. It is particularly in thearea of security that Mozilla outshinesMicrosoft’s frequently vulnerable InternetExplorer and Outlook Express, but it also offersgreater compliance to web standards andimproved functionality, with a built-in in pop-upblocker and integrated search facility that unlesspatched with Microsoft’s Windows XP ServicePack 2, Internet Explorer does not offer.

The Mozilla project has also developed arange of sister products. Mozilla Firefox is astripped down, lightweight Web browser withsupport for the addition of separate modules offunctionality, or extensions.72 It aims to be fullycustomisable, with optional features thatintroduce exciting navigational and developmentpossibilities. Thunderbird is another project,essentially promising the same things for emailas Firefox offers for the Web.73 Lightweight andsecure, it supports all major protocols and can befully customised to suit environment or userpreferences. All of these Mozilla tools areavailable for Windows, Linux, and Mac OS X,emphasising the interoperability of thesesolutions.

The Mozilla project has inspired andenabled the development of a number of otherapplications, including the defect-tracking

72 http://www.mozilla.org/products/firefox/ [Accessed: 7April 2005, 12:10].

73 http://www.mozilla.org/projects/thunderbird/[Accessed: 7 April 2005, 12:15].

system Bugzilla, which facilitates the reportingof errors from a wide number of applications,ensuring their continued development andrefinement.74

5.2.12 Specific Open Source Applications forDigital Curation

The increasing maturity of open sourcesoftware has led to the development of a rangeof tools designed for the achievement ofspecific, specialist goals. A number of factorsmake the open source model particularlysuitable for the digital curation community, andthis has led to concentrated development in thisarea. The requirements for openness and‘future-proofing’ within software and the oftenphysically distributed nature of organisationsand projects are issues in which open source canbe profoundly beneficial.

The following sections detail a number ofprominent open source applications aimedspecifically and indirectly at meeting digitalcuration requirements.

(a) Fedora Digital Object RepositoryManagement System

The Fedora project (not to be confusedwith Red Hat’s Fedora Linux distribution),originally developed by the Digital LibraryResearch Group at Cornell University is one ofseveral digital object repository architecturesthat have been proposed in recent years. TheFedora structure is based on object models that74 http://bugzilla.mozilla.org/ [Accessed: 7 April 2005,

12:15].

Page 41: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 41

each form the template for individual units ofcontent, called data objects. These can containvarious digital content, associated metadata andreferences to representation information.Behaviour objects describe tools and servicesthat can be used by the repository to provideaccess to the data objects. The system has athree-layered architecture. The Web ServicesExposure layer defines interfaces foradministration and access, the Core Subsystemlayer implements their subsystems and theStorage Layer implements the storage subsystemthat handles reading, writing, and removal ofdata from the repository. Digital Objects arestored as XML files corresponding to anextension of the Metadata Encoding andTransmission Standard (METS). AmongFedora’s “noteworthy” features identified by D-Lib Magazine75 are its XML submission andstorage, its access control and authentication,searching, OAI-PMH76 Metadata harvesting anda batch utility supporting the mass creation andloading of data objects. A recent paper entitled“Fedora: An Architecture for Complex Objectsand Their Relationships”77 describes further

75 Thornton Staples, April 2003, "The Fedora Project:An Open-source Digital Object RepositoryManagement System", D-Lib Magazine, Volume 9,Number 4,http://www.dlib.org/dlib/april03/staples/04staples.html[Accessed: 7 April 2005, 12:15].

76 Open Archive Initiative Protocol for MetadataHarvesting,http://www.openarchives.org/OAI/openarchivesprotocol.html [Accessed: 7 July 2005, 14:53]

77 Carl Lagoze, Sandy Payette, Edwin Shin, Chris

notable qualities, in particular the fact that thesoftware is implemented as a set of WebServices and that its full functionality is exposedthrough a series of well defined web serviceAPIs. The D-Lib article describes four use casescenarios for Fedora, illustrating its usefulness.From the first “low barrier to entry” scenario toa full fledged digital library or repository fordistributed objects Fedora is sufficientlyflexible to meet the requirements of manyinstitutions. Undoubtedly innovative, andfunctionally rich, Fedora offers a goodillustration of what open source developmentcan achieve.

(b) DSpaceThe DSpace Institutional Repository

System, developed jointly by MIT and HewlettPackard offers capture, storage, indexing,preservation and redistribution functionality. Itaims to satisfy the definition of an InstitutionalRepository offered by Clifford Lynch, as “anorganisational commitment to the stewardshipof digital materials, including long-termpreservation where appropriate, as well asorganisation and access or distribution”.78

Wilper, (rev v.4, March 2005), "Fedora: AnArchitecture for Complex Objects and TheirRelationships" ,http://www.arxiv.org/pdf/cs.DL/0501012 [Accessed:7 April 2005, 12:15].

78 Clifford A. Lynch, February 2003,"InstitutionalRepositories: Essential Infrastructure for Scholarshipin the Digital Age" ARL, no. 226: 1-7,http://www.arl.org/newsltr/226/ir.html [Accessed: 7April 2005, 12:15].

Page 42: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 42 DCC Digital Curation Manual

DSpace can be configured to accept a diversityof digital content ranging from documents,books and theses to data sets, computerprograms, visual simulations and models. Theseare organised according to the groups thatcontribute content, called ‘communities’, and'collections', which house individual contentitems and files. The primary goal of DSpace isdigital preservation, to provide long-termphysical storage and management of materials ina secure and effectively administeredenvironment. Persistent identifiers are allocatedto each stored item in the interests of ensuringtheir longevity, with preservation conducted inboth bit and functional terms. The latter isachieved using emulation or migration strategiesfor supported open formats and by relying onthe third party tools that are expected to emergefor popular proprietary formats. It is concededthat for unknown or one-off proprietary datafunctional preservation is difficult, but bypreserving the bit-stream too it is hoped thatfuture digital archaeologists will at least havethe opportunity to retrieve and reproduceinformation. Ancillary DSpace functionalityallows the implementation of access controls,versioning and search and retrieval based onDublin Core metadata which can be applied toeach submitted object. DSpace is promoted as aflexible solution, equipped to effectively handlethe diversity of materials and expectationsimplicit within a multi-disciplinary archive,mainly through its use of communities in theorganisation of its information. Built-in JavaAPIs allow the interoperation of stored content

with other systems that an institution maymaintain.

(c) FreeBXMLebXML (Electronic Business using

eXtensible Markup Language)79 is a suite ofspecifications that facilitate the exchange ofbusiness information over the Internet byorganisations of any size. Using thesespecifications it is possible for organisations toexchange messages, trade, communicate incommon terminology and define and registerprocesses relevant to their business. Started in1999 by OASIS and the UN/ECE agencyCEFACT it was based upon five layers ofsubstantive data specification, which arerealised in XML standards for businessprocesses, core data components, collaborationprotocol agreements, messaging and registriesand repositories.80 freebXML is an initiativeaiming to promote the ebXML specificationsthrough the sharing of software, expertise andexperience. Its web site(http://www.freebxml.org) offers centralisedaccess to relevant code and applications as wellas a forum for discussion about developmentsand deployments using ebXML. Among themost relevant programs available from the website under open source licenses are freebXML79 http://www.ebxml.org/.80 For further information see Brian Gibb, Suresh

Damodaran, 2002, "ebXML : Concepts andApplication", Wiley, ISBN: 076454960X, or AlanKotok and David R.R. Webber, 2001, "ebXML: TheNew Global Standard", New Riders, ISBN:0735711178.

Page 43: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 43

CC, a set of tools developed to facilitate themanagement of data dictionaries and freebXMLRegistry, built upon an extensible informationmodel, allowing the storage of any kinds of datain its repository, with arbitrary associationscreated between registry entries.

(d) JHOVEThe result of collaboration between

JSTOR81 and Harvard University Library, theJSTOR/Harvard Object ValidationEnvironment82 aims to provide functions toperform format-specific identification,validation and characterisation of digital objects.In essence, this offers the opportunity toautomatically determine a particular object’sformat if unknown, to confirm or deny thevalidity of a purported example of a particularformat by assessing whether it meets syntacticand semantic requirements of that format and toidentify the intrinsic properties of a particularobject based on its format. Standard formatmodules are distributed with the system andinclude AIFF, ASCII, GIF, HTML, JPEG, PDF,TIFF and XML.

(e) LOCKSSLOCKSS stands for “Lots of Copies Keeps

Stuff Safe”83 and is an open source peer-to-peerapplication with its core raison d’etre to offer81 The Scholarly Journal Archive, http://www.jstor.org

[Accessed: 25 July 2005, 11:43]82 http://hul.harvard.edu/jhove/ [Accessed: 25 July 2005,

11:43]83 http://lockss.stanford.edu/ [Accessed: 25 July 2005,

11:43]

persistent access to preserved digital materials.Initiated by Stanford University Libraries,LOCKSS runs on standard desktop workstationsand offers librarians and information managersthe opportunity to create low-cost, persistentand accessible copies of digital content as it ispublished. A secure peer-to-peer polling andreputation system ensures that the integrity andaccuracy of LOCKSS materials are maintained.

(f) XenaThe National Archives of Australia

originally developed the XML ElectronicNormalising of Archives84 project to meet thechallenges posed by preserving electronicrecords into the future in a constantly changinghardware and software culture. The applicationaims to resolve these concerns by convertingelectronic records in proprietary formats to astandardised XML format that can be read byfuture technology. Xena’s current versionsupports a range of formats that can beconverted with no information loss to thestandard XML. These include Microsoft’sWord, Excel and Powerpoint, theOpenOffice.org suite of formats, RTF,Relational database files, JPG, GIF, TIFF, PNGand BMP image files, HTML and plain text.Furthermore, with its plug-in based architecture,Xena can conceivably be extended to supportany other formats.85

84 http://xena.sourceforge.net/ [Accessed: 25 July 2005,11:43]

85 For an example of an open source tool which‘normalises’ file formats for preservation see Adam

Page 44: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 44 DCC Digital Curation Manual

5.2.13 Other Institutional RepositoryImplementations

Developed at the University ofSouthampton, GNU EPrints facilitates thecreation of online archives, with the defaultconfiguration a repository of the research outputof an academic institution. EPrints servers aredesigned to help dissemination of researchpublications by sharing associated metadatausing Open Archive Initiative (OAI) standards.Further open source alternatives includeMyCoRe, developed in Germany, the DutchARNO and also CERN Document ServerSoftware. According to a recent DPCTechnology Watch Report by Paul Wheatleyinto “Institutional Repositories in the Context ofDigital Preservation”86 none of these foursolutions cites digital preservation as a key aim.

With the range of repository solutionsavailable there is a increasing interest into howmultiple repositories might cooperate within aglobal curation network. The transparent natureof open source facilitates the implementation ofsystemic connections offering a range ofpotential benefits. Cooperation on the selectionof content and optimisation of technicalinfrastructures can take place, and duplication ofeffort can be minimised.

Rusbridge, April 2004, “XENA: ElectronicNormalising Tool”, DigiCULT.Info, Issue 7, page 32,http://www.digicult.info/pages/newsletter.php[Accessed: 7 April 2005, 11:49].

86 Paul Wheatley, 2004, “Institutional Repositories in thecontext of Digital Preservation” DPC TechnologyWatch Series Report 04-02.

Open source carries other advantages inthe context of digital resource registries andrepositories. For instance, Representationinformation registries can benefit from itsflexible legal status by offering direct access torendering, management and conversionapplications that are distributed under opensource licensing agreements.

Page 45: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 45

6 Quantitative Issues

6.1 Financial Costs of Open Source SoftwareIt is important to consider financial

implications in terms of total cost of ownership(TCO), particularly when software is distributedfree of charge. However, since a definitive list ofthe cost factors that should be taken into accountin a TCO study has yet to be settled upon, anumber of conflicting accounts exist. It is nodoubt possible to identify a persuasive TCOstudy in favour of most software configurations,but the actual figures will always depend on aspecific combination of environment andrequirements. An accurate picture can only bedrawn following consideration of all the relevantindividual cost elements, from software andhardware purchases to administration, and fromtechnical support to staff training.

As well as ambiguities across surveys it isalso clear that no one has yet conducted anyformal studies as to the relative cost implicationsof conducting a digital curation strategy withopen source or proprietary tools. This is trueboth when considering the relative costs ofcurating digital resources that are open source orproprietary and when considering using opensource tools to conduct our digital curationactivities. While it seems likely that the use ofopen tools and software formats are likely toprolong the longevity of our digital assets it ishard to present the cost implications of thishypothesis in quantitative terms. Instead wemust rely on the figures that do exist which

describe the relative cost implications of usingopen source and proprietary tools in moregeneral ways. Inevitably there will be uniqueimplications relating to the cost of digitalcuration, but these are difficult to assess at thistime.

6.2 Software Acquisition and Upgrade CostsThe initial acquisition cost of open source

software will usually be less than anyproprietary alternative. Of course, it need not befree (i.e. gratis) under the terms of its license,and other additional costs may be incurred fordocumentation, storage media, and supportcontracts. Taking these factors into account, a2001 study by Cybersource Consulting foundthe following acquisition cost results,illustrating the scalability of an open sourcesolution over three increasingly sizedinstallation environments:87

87 Cybersource, 2004, "Linux vs. Windows Total Cost ofOwnership Comparison",http://www.cyber.com.au/cyber/about/linux_vs_windows_tco_comparison.pdf [Accessed: 7 April 2005,12:18].

Page 46: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 46 DCC Digital Curation Manual

Microsoft GNU/Linux OS Savings

50

users

$69,987

(€56,615)

$80 (€65) $69,907

(€56,550)

100

users

$136,734

(€110,609)

$80 (€65) $136,654

(€110,544)

250

users

$282,974

(€228,907)

$80 (€65) $282,894

(€228,842)

The reason why open source softwarescales so well is that it is only necessary topurchase or obtain a single license, which coversunlimited subsequent installations. Proprietarysoftware, on the other hand, is typically licensedon a per-installation or per-user basis. This isworth bearing in mind if the intention is todeploy a large number of workstations. NetworkWorld Fusion News reported in 2001 that amajor part of the reason for an increase inLinux’s deployment in finance, healthcare,banking and retail was its scalability in cost andtechnical terms when large numbers of identicalsites and servers are needed. The journalcalculated that for a 2,000-site deployment SCOUnixWare would cost $9 million (€7.3m),Windows $8m (€6.5m), and Red Hat Linux just$180 (€146).88

Upgrade costs also compare favourably

88 Deni Connor, March 2001, "Linux Slips Slowly intothe Enterprise Realm",http://www.nwfusion.com/news/2001/0319specialfocus.html [Accessed: 7 April 2005, 12:15].

using open source applications. Proprietaryupgrades will typically cost around half theamount of the original application. Users cansubsequently find themselves at the mercy ofthe proprietary companies, who have amonopoly on the distribution of their software.To upgrade an open source application onesimply has to download the latest version, orpay the original cost once again and redeployacross as many machines as required.

6.3 License Management and Litigation CostsNeedless to say, open source software is

again favourable in the context of licensemanagement and litigation. For users ofproprietary software, failure to adhere to thestrict terms of software licenses can lead toextremely heavy fines and even custodialsentences. It is therefore in users interests tomanage licenses effectively, undertaking regularsoftware audits and even installing license-tracking software. Under an open source licensesuch procedures, costly in terms of both timeand money, are rendered unnecessary.Similarly, the costs involved in migrating to andfrom open source formats, emulation of existingopen source software architectures and ofsupplying open source tools to render or convertparticular file formats is likely to bedramatically less than with proprietaryalternatives.

6.4 Hardware CostsAs far as hardware is concerned, it is

generally acknowledged that open source

Page 47: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 47

software – such as the GNU/Linux operatingsystem – can run effectively on a lowerspecification machine than its Windowsequivalent. The latest version of Windows, XPProfessional, recommends a 300Mhz Intelcompatible processor, 128MB of RAM and aminimum of 1.5GB of hard disk space. Acomparable version of Mandrake Linux(version 9.1) requires only a Pentium classprocessor, 128MB of RAM and 150MB of diskspace. As Linux desktops and user interfaceshave become more graphically complex themargin between the two has narrowed. But sincea Linux system tends to be much moreconfigurable, it is more straightforward to installonly those parts of a system that are reallyrequired, saving processor power and diskspace.89 Furthermore, unlike existing consumerWindows systems most Linux distributions areavailable in both 32bit and 64bit Intel-basedversions, offering users the opportunity toexploit the advantages of newer moresophisticated hardware. Further emphasising theinteroperability intrinsic to open source, manyGNU/Linux distributions are available for non-Intel hardware, such as PowerPC, Sun SPARCand Alpha. The transparency at Linux’s corefacilitates porting to limitless alternativehardware environments, and numerousendeavours are constantly being undertaken torun Linux on devices as varied as Apple’s iPod

89 This also means that hardware can be used for longer,without the need to upgrade so frequently.

digital music device90 and Microsoft’s X-Boxgaming console.91 With the uncertaintysurrounding tomorrow’s computer hardwareenvironments it is comforting to know that suchmigration remains possible, irrespective of thenature of hardware products and their originalintended purposes. Suffice to say, from a costperspective the flexibility of Linux mitigates thelikelihood of hardware obsolescence, offeringusers more opportunity to make the most oftheir existing resources. Furthermore, rumourssuggest that future versions of MicrosoftWindows will incorporate hardware-tiedsecurity, essentially limiting the user's ability toconfigure the hardware environment on whichthe software operates.92 This may haveproblematic consequences for future migrationand reuse.

6.5 Support and TrainingFor other, less explicit up-front costs it

becomes more difficult to find consensus.Technical support and administration is onesuch area. Microsoft claims that it is morestraightforward to find trained administratorsand technicians for its platforms, and that theytherefore cost less. However, the open sourcecommunity rebuts this, arguing that with

90 http://neuron.com/~jason/ipod.html [Accessed: 7April 2005, 14:35].

91 http://www.xbox-linux.org/ [Accessed: 7 April 2005,14:57].

92 http://www.theregister.com/2003/11/03/ms_to_intro_hardwarelinked_security/ [Accessed: 25 July 2005,11:55]

Page 48: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 48 DCC Digital Curation Manual

GNU/Linux fewer administrators are required,because it is possible to automate a great dealand the systems are more reliable. Support foropen source products may be less available in aformal commercial capacity, but many problemsencountered in one’s digital curation efforts canbe mitigated by turning to a vast web-basedcommunity of users and experts that hasdemonstrated a regular and enthusiasticcommitment to offer assistance.

A further question is that of training.Anecdotal evidence suggests that the costsinvolved are fairly modest, thanks to theproliferation of modern GUI desktops withinLinux systems. It remains to be seen whetherthis is demonstrably true across the board,although it could be argued that training costsshould be no more than those incurred forWindows training. However, retrainingexperienced Windows users will inevitably bemore challenging, and will involve higherassociated costs.

6.6 Total Cost of OwnershipThe Robert Frances Group’s July 2002

study found that the TCO of GNU/Linux isroughly 40% of that of Windows, and 14% ofSun Microsystems’ Solaris.93 The group usedactual costs of production deployments of webservers at fourteen Global 2000 enterprises,

93 David A. Wheeler, 2005,"Why Open SourceSoftware/Free Software (OSS/FS, FLOSS, or FOSS)?Look at the Numbers!",http://www.dwheeler.com/oss_fs_why.html[Accessed: 7 April 2005, 11:30].

basing its analysis on software, and hardwarepurchases and maintenance, upgrade andadministrative costs. This study also found thatalthough Windows administrators cost lessindividually, each Linux or Solaris administratorcould cover many more machines, makingWindows administration more expensive. It wasalso revealed that Windows administrators spenttwice as much time patching systems anddealing with security issues than the others.

There is also a great deal of persuasivetestimonial evidence from a range of companiesand public institutions that have used opensource successfully and saved money. Forinstance, Amazon.com was able to cut US$17m(€13.8m) in technology expenses in a singlequarter by moving to Linux. The city of Largoin Florida saved $1m (€811,000) by usingGNU/Linux and ‘thin clients,’ and Intel VicePresident Doug Busch reported savings of$200m (€162.4m) by replacing proprietaryUNIX servers with GNU/Linux alternatives.94

While TCO studies are useful for interest94 Lynn Haber, April 2002, "City Saves with Linux, Thin

Clients", ZDNet,http://techupdate.zdnet.com/techupdate/stories/main/0,14179,2860180,00.html [Accessed: 7 April 2005,12:15]; David A. Wheeler, 2005,"Why Open SourceSoftware/Free Software (OSS/FS, FLOSS, or FOSS)?Look at the Numbers!",http://www.dwheeler.com/oss_fs_why.html[Accessed: 7 April 2005, 11:30]; Stephen Shankland,Margaret Kane and Robert Lemos, October 2001,"How Linux Saved Amazon Millions", News.com,http://news.com.com/2100-1001-275155.html?legacy=cnet&tag=owv [Accessed: 7April 2005, 12:15].

Page 49: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 49

purposes, the source of their commissioningshould be carefully noted. At least one (albeitMicrosoft sponsored) study has suggested thatWindows is cheaper than Linux,95 althoughtechnology writer Joe Barr has discussed andcriticised some of the problems inherent in thereport, such as assuming no upgrades over a fiveyear period, costing for an older operatingsystem, and not using the current MicrosoftEnterprise license. Barr concludes his report bystating that “TCO is like fine wine: it doesn’ttravel well. What may be true in one situation isreversed in another. What gets trumpeted as auniversal truth (‘Windows is cheaper thanLinux’) may or may not be true in a specificcase, but it is most certainly false when claimeduniversally.”96 Conversely, it is very unlikelythat for every configuration Linux represents amore cost-effective solution.

6.7 Longer-term ConsiderationsOf particular interest to the digital curator

are the long-term cost implications of usingopen source software. As suggested above, thereare few (if any) conclusive sources offeringquantitative savings information, but thereusability and accessibility implicit in open

95 IDC, "Windows 2000 Versus Linux in EnterpriseComputing",http://www.microsoft.com/windows2000/docs/TCO.pdf [Accessed: 7 April 2005, 12:15].

96 David A. Wheeler, 2005,"Why Open SourceSoftware/Free Software (OSS/FS, FLOSS, or FOSS)?Look at the Numbers!",http://www.dwheeler.com/oss_fs_why.html[Accessed: 7 April 2005, 11:30].

source will result in inevitable cost savingswhen long-term access to digital materials isrequired. The costs of recovery of informationare becoming increasingly well known. Severalinsightful examples come from the US legalsystem, where information discovery legislationcompels litigants to produce electronic materialsprior to trial, at the defendant’s expense.97

Proprietary systems can cause problems in theface of these kinds of requirements, hamperingstraightforward access to digital materials andbottlenecking the legal process. In the case ofZubulake v. UBS Warburg,98 the retrieval andpresentation of content from a single emailstored on backup tapes was priced at $175,000.In another similar example, in the case ofMurphy Oil Corporation v Fluor Daniel99 it wasstated that that the recovery of email contentinto a presentable form would cost some $6.2million dollars and take more than six months,excluding attorney time. UK Freedom ofInformation Legislation creates similarobligations to provide information, which onceagain can be problematic within a proprietary or97 National Electronic Commerce Coordinating Council,

2004, "Effectively Managing the Discovery ofElectronic Records",http://www.ec3.org/Downloads/2004/Effectively_Man_Discovery_of_El_Records.pdf [Accessed: 7 April2005, 12:15].

98 Zubulake v. UBS Warburg LLC 217 F.R.D. 309(S.D.N.Y. 2003) [Note: Zubulake I, Opinion of 13May 2003], Zubulake v. UBS Warburg LLC, 216F.R.D. 280 (S.D.N.Y. 2003) [Note: Zubulake II,Opinion of 24 July 2003]

99 Murphy Oil v. Fluor Daniel, Inc., 2002 WL 246439(E.D. La. 19 Feb 2002 )

Page 50: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 50 DCC Digital Curation Manual

opaque system. These sums represent the cost ofinaccessible digital resources. By introducingtransparency at every level of one’s informationinfrastructure, in terms of both technology andcomprehension, such costs can inevitably bemitigated. Irrespective of legal obligations topresent information, long-term use inevitablyrequires the repackaging and migration of digitalmaterials to accommodate the technologicalenvironments and standards of the day. Only byestablishing and effectively documenting anunderstanding of our digital resources can suchactivities be more straightforwardly - and morecost-effectively - undertaken.

6.8 Performance and ReliabilityPerformance and reliability are important

factors in determining the contemporary andlong-term usefulness of our digital assets, andfurther empirical evidence suggests that opensource technologies compare favourably in theseterms with their proprietary peers. Again, thereare few formal quantitative studies basedexplicitly around the performance of digitalcuration applications or curation processes, butone may regard the more generic examples thatdo exist as a useful barometer. According to astudy undertaken at the University of Wisconsinin 2000, 21% of Windows 2000 applicationscrashed when presented with random testingusing valid keyboard and mouse input.100 An

100 Justin E. Forrester and Barton P. Miller, 2000, "AnEmpirical Study of the Robustness of Windows NTApplications Using Random Testing",ftp://ftp.cs.wisc.edu/paradyn/technical_papers/fuzz-

additional 24% of applications hung whenpresented with valid keyboard and mouse input.When the same test was undertaken five yearsearlier using a then current Linux distribution,the failure rate was just 9%; and since then thereliability of open source software hasimproved. Comparable studies by IBM andBloor Research have had similar results.101

An eWeek survey in 2002 found thatMySQL was comparable to the proprietarymarket leader, Oracle, and offered betterperformance than a number of other proprietaryapplications, including Sybase Inc’s ASE,IBM’s DB2 and Microsoft’s SQL Server 2000Enterprise Edition.102

As far as performance is concerned, theresults for open source are also promising. Aswith TCO, performance benchmarks are oftendependent on environment, as well as whateverassumptions the tester has made; the only realbenchmark that can be of value to an individualuser is the one that most closely mirrors thework actually being done. PC Magazine foundin November 2001 that Linux with SAMBAsignificantly outperformed Windows 2000. At

nt.pdf [Accessed: 7 April 2005, 12:15].101 Li Ge, Linda Scott and Mark VanderWiele, 2003,

"Putting Linux Reliability to the Test", http://www-106.ibm.com/developerworks/linux/library/l-rel/[Accessed: 7 April 2005], 12:15;http://gnet.dhs.org/stories/bloor.php3 [Accessed: 7April 2005, 12:15].

102 Timothy Dyck, February 2002, "Server DatabasesClash", eWeek,http://www.eweek.com/article2/0,3959,293,00.asp[Accessed: 7 April 2005, 12:15].

Page 51: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 51

one stage in the test, using a 1Ghz Pentium 3with 512MB of RAM and handling thirty clientconnections, the Linux software was 78% fasterthan Microsoft’s.103 In February 2003, a team ofphysicists broke the Internet2 Land SpeedRecord using GNU/Linux, sending 6.7GB ofuncompressed data from Sunnyvale, Californiato Amsterdam in the Netherlands in just 58seconds.104

6.9 Market ShareThe market share enjoyed by open source

software is significant from a digital curationperspective since it is likely to determine to agreat extent the demand from within thecommunity for preservation of these digitalassets and their associated information. Whilenot the only important factor, it is oftenadvantageous in the interests of longevity to gowith what most people are using. Generallyspeaking, it is marginalised or minimallyadopted hardware, software and standards thatare more likely to become irretrievably lost.Several examples exist where open sourcesoftware leads the field, or enjoys prominenceon a commensurate level with its proprietaryequivalents.

103 Oliver Kaven, November 2001, "Performance Tests:File Server Throughput and Response Times",Pcmag.com,http://www.pcmag.com/article2/0,1759,16227,00.asp[Accessed: 7 April 2005, 12:15].

104 Katie Dean, February 2003, "Data Flood Feeds Needfor Speed", Wired,http://www.wired.com/news/infostructure/0,1377,57625,00.html [Accessed: 7 April 2005, 12:15].

Serving Web pages is one of several areaswhere open source software is dominant.According to Netcraft’s statistics on webservers the Apache Web Server was responsiblefor some 70% of all Web pages in July 2005,with the closest rival Microsoft’s InternetInformation Server responsible for just 23%.GNU/Linux is the second most prevalentoperating system for web servers with 29%,behind Windows which has just under half ofthe entire market share. Other open sourceoperating systems (such as FreeBSD) comprisearound 6% of all those that serve Web pages.105

Open source software enjoys prominencein other areas of the Internet too. Sendmail is theleading email server, with 42% of the marketshare.106 The DNS server Bind, an applicationthat translates human-readable Web site namesinto a format understandable by computers, hada 95% market share in 2000.107 Furtheremphasising the web-based prominence of opensource, PHP is the most commonly used Webprogramming language in the world, running onover eighteen million sites during January 2005,outstripping its primary rivals ASP.NET, JavaServer Pages, and Cold Fusion.108

105 http://news.netcraft.com/, [Accessed: 7 April 2005,12:25]

106 http://cr.yp.to/surveys/smtpsoftware6.txt, [Accessed: 7April 2005, 12:25]

107 Bill Manning, "in-addr version distribution",http://www.isi.edu/~bmanning/in-addr-versions.html[Accessed: 7 April 2005, 12:25].

108 http://www.php.net/usage.php, [Accessed: 7 April2005, 12:25]

Page 52: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 52 DCC Digital Curation Manual

At a more general level, part of thefindings of the Free/Libre and open sourceSoftware (FLOSS): Survey and Study publishedin June 2002 found that 43.7% of Germanestablishments, 31.5% of British establishments,and 17.7% of Swedish establishments reportedusing open source or free software.109 Opensource is well established within theinfrastructure of the Internet, but it is only inrelatively recent times that appropriate softwarehas become available to make Linux a viablechoice for desktop computer users.Improvements in graphical user interfaces(GUIs) and the potential to use Linux withoutrecourse to unfamiliar command lineinstructions have made it a more appealingprospect for casual or non-expert users. Anumber of companies and organisations areplanning or beginning migration. Public sectorinstitutions, who require autonomy over theirsoftware infrastructures and digital preservationstraightforwardness are understandably amongthe most enthusiastic.

109 http://www.infonomics.nl/FLOSS/, [Accessed: 7 April2005, 12:25]

Page 53: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 53

7 Future Developments

It seems likely that the future of opensource is assured, as projects continue to gathermomentum, established applications mature andmore and more specialist tools becomeavailable. There is a growing realisation withinthe software development world that openness isa desirable quality and that relying too much onthe commercially driven solutions distributed byproprietary developers may threaten thelongevity of one’s digital assets. An open sourcesoftware infrastructure, dealing in open andstandardised formats ensures that transparency ismaintained from the conception of informationuntil its long-term storage. Of course, a range offactors, both financial and behavioural mean thatthe digital realm is unlikely to be homogenisedin the near future, and it is certain that digitalcurators will have to continue to findimaginative ways to successfully manipulate,migrate and re-use digital information that can’tbe so straightforwardly comprehended. While inprinciple the adoption of open standards is ofgreat value to all kinds of organisations it isunrealistic to expect all important businessdecisions to be made based solely on archivalconsiderations. The open source communitymust continue to develop solutions that offerimmediate benefits in every sense, not just interms of digital curation, but for all levels ofdigital creators and users, from ITadministrators, managers, research scientists,and digital librarians to students, teachers and

home computer enthusiasts. Continuedinnovation, constant refinement and anemphasis on the financial savings and legalfreedom associated with open source will helpto convince users of the intrinsic benefits. It isunlikely that any battle will be won simply byemphasising the superior archival characteristicsof open source software and digital assetsencoded with open formats. In many cases itwill require even more than just functionalsuperiority to convince users of the benefits ofopen source, due to the high impact marketingthat accompanies and champions manycommercial products. A consumer-orientedexample is in the case of portable digital audio.Despite many experts agreeing that the openOgg Vorbis format offers a better encodingalgorithm in terms of sound quality per byte,most users are still encoding millions of hoursof their music collections in the proprietary andpatented MP3 format, buoyed in theirendeavours by television and web publicity, tothe extent that now “digital audio” is almostsynonymous in the popular consciousness withMP3. Only by continuing to offer innovative,usable and functionally rich solutions can opensource and its inherent digital curation qualitiesexpect to be fully exploited.

Page 54: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 54 DCC Digital Curation Manual

8 ConclusionThe open source and free software

development and distribution philosophies arenow well established, and offer several benefitsthroughout digital curation work flow. Theconceptual key to understanding the value ofthese technologies and their associated standardsis to understand the significance of transparency.By understanding the precise nature of our owndigital assets and those we seek to integrate withand exploit elsewhere we are able to take moreeffective steps towards ensuring their continuedand long-term accessibility. Removingrestrictive legal barriers to this endeavourfacilitates the digital curation process furtherstill. Open source solutions arrive free fromcommercially motivated opacity and represent aconsensus in favour of continued accessibility,comprehension and reusability. By their naturethey facilitate integration and interfacing withexisting infrastructures. Standardisation offormats is an unprofitable concept for thosewithin a competitive market economy who aremainly interested in promoting their own uniqueproduct path. Unfortunately, it is also a greatfacilitator for straightforward long-term digitalcuration, its promotion and presence renderingthe process significantly less irksome. Opensource technology is founded on anunwillingness to reinvent or monopolise thewheel; a sentiment that through activecollaboration our software and digital assets canbe more effectively structured, injected withgreater functionality and made more sustainable

over the long-term. There is money to be madethrough open source software, but it is aconsequential thing, and seldom do economicconcerns drive or direct the development andrelease agenda. By embracing open source toolswhere they are available and functionallysufficient, our digital materials will be moreeasily comprehensible in the future. Manycreators, custodians and re-users of digitalinformation have a responsibility to ensure thatthe materials they create offer maximumusability and accessibility in the contemporaryand are guarded against the problems ofobsolescence in the future. Using open sourcesoftware and open standards as a facilitator tothis is a worthwhile starting point, andcombined with other digital curation strategiescan be effective.

Page 55: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 55

Bibliography

Print Dibona, C., S. Ockman, M. Stone, 1999, OpenSources: Voices from the Open SourceRevolution, Sebastopol, CA, O Reilly

Dubois, P. MySQL, 2003, Indiana: SamsPublishing,

Gibb, B. S. Damodaran, 2002, ebXML :Concepts and Application, Wiley, ISBN:076454960X

Kotok, A., D.R.R. Webber, 2001, ebXML: TheNew Global Standard, New Riders, ISBN:0735711178

Moody, G., 2002, Rebel Code: Linux and theOpen Source Revolution, Penguin

Netproject, 2003, The IDA Open SourceMigration Guidelines, European Communities

Raymond E. S., 2001, The Cathedral and theBazaar - Musings on Linux and Open Source byan Accidental Revolutionary (Revised edition),Sebastopol, CA, O Reilly

Ross, S., A. Gow, 1999, Digital archaeology?Rescuing Neglected or Damaged DataResources, London & Bristol: British Libraryand Joint Information Systems Committee,ISBN 1900508516

Ross, S., M. Donnelly, M. Dobreva, D. Abbott,A. McHugh, A. Rusbridge, 2005, DigicultTechnology Watch Report 3

Stallman, R. M, J. Gay, L. Lessig, 2002, FreeSoftware, Free Society Selected Essays ofRichard M. Stallman, Free Software Foundation

Stanescu A, 2005, Assessing the durability offormats in a digital preservation environment:The INFORM methodology, OCLC Systems andServices, International Digital LibraryPerspectives, Vol 21, Number 1, 2005, pp 61-81

Weber, S., 2004, The Success of Open Source,Harvard, MA, Harvard University Press

Welsh, M, M. K. Dalheimer, T. Dawson, L.Kaufman, 2003, Running Linux, Sebastopol CA,O Reilly

Williams, S., 2002, Free as in Freedom RichardStallman’s Crusade for Free Software,Sebastopol, CA: O Reilly

Online Abbott, D., Overcoming the Dangers ofTechnological Obsolescence: Rescuing the BBCDomesday Project, DigiCULT.Info 4, Page 4 ,http://www.digicult.info/pages/newsletter.php[Accessed 7 April 2005, 11:49].

Berlind, D., 30 July 2002, Who Gave MicrosoftControl of your IT Costs? You Did,ZDNET.com,http://techupdate.zdnet.com/techupdate/stories/main/0,14179,2875958,00.html[Accessed: 7April 2005, 12:30]

Bernstein D. J., 2001, Internet Host SMTPServer Surveyhttp://cr.yp.to/surveys/smtpsoftware6.txt

Page 56: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 56 DCC Digital Curation Manual

[Accessed: 7 April 2005, 12:30]

Bishop,T., 12 September 2003, ShouldMicrosoft Be Liable For Bugs?, Seattle PostOnline,http://seattlepi.nwsource.com/business/139286_msftliability12.html [Accessed: 7 April 2005,12:30]

Books from the Pasthttp://www.booksfromthepast.org [Accessed: 7April 2005, 13:31]

Brooks, J, 26 April 2004, Office 2003 vs.OpenOffice.org, eWeekhttp://www.eweek.com/article2/0,1759,1571626,00.asp [Accessed: 7 April 2005, 13:31]

Cameron, N., September 2003, Open SourceBookmarks Australian Heritage,http://www.computerworld.com.au/index.php?id=522130461&fp=16&fpid=0 [Accessed 7 April2005, 11:56].

CAMiLEON BBC Domesday Rescue Project:http://www.si.umich.edu/CAMILEON/domesday/domesday.html [Accessed: 7 April 2005,13:31]

CECILL License,http://www.cecill.info/licences/Licence_CeCILL_V1.1-US.html [Accessed: 7 April 2005, 11:43].

Center of Open Source and Government(eGovOS) http://www.egovos.org/ [Accessed: 7April 2005, 13:31]

Connell, C., Open Source Projects ManageThemselves? Dream On IBM Lotus DeveloperNetwork Archives, http://www-

10.lotus.com/ldd/devbase.nsf/articles/doc2000091200 [Accessed: 7 April 2005, 13:31]

Connor, D., 2001, Linux Slips Slowly into theEnterprise Realm, Network World Fusion,http://www.nwfusion.com/news/2001/0319specialfocus.html [Accessed: 7 April 2005, 13:31]

Conway, P., December 2003, Zope at DukeUniversity: Open Source Content Managementin a Higher Education Context, DigiCULT.Info,Issue 6, p. 10,http://www.digicult.info/pages/newsletter.php[Accessed: 7 April 2005, 11:57].

Creative Commons,http://creativecommons.org/ [Accessed: 7 April2005, 11:57].

Dahdah, H., February 2003, Open SourceLibrary System a Welcome Gift, ComputerWorld,http://www.computerworld.com.au/index.php/id;534895878;relcomp;1 [Accessed: 7 April 2005,13:31]

Danish Board of Technology, Open SourceSoftware in e-governmenthttp://www.tekno.dk/pdf/projekter/p03_opensource_paper_english.pdf [Accessed: 7 April 2005,13:31]

Dean, K., 2003, Data Flood Feeds Need ForSpeed, Wired.com,http://www.wired.com/news/infostructure/0,1377,57625,00.html [Accessed: 7 April 2005,13:35]

The DigiCULT Report Full Report

Page 57: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 57

Technological Landscapes for tomorrow’scultural economy: Unlocking the value ofcultural heritage, pp.212. Available online athttp://www.digicult.info/pages/report.php,[Accessed: 7 April 2005, 11:46]

Digitale Bibliothek, http://digbib.iuk.hdm-stuttgart.de/gsdl/cgi-bin/library [Accessed: 7April 2005, 13:31]

Dravis, P., Open Source Software: Perspectivesfor Developmenthttp://www.infodev.org/symp2003/publications/OpenSourceSoftware.pdf [Accessed: 7 April2005, 13:31]

Dyck,T., February 2002, Server DatabasesClash, eWeek,http://www.eweek.com/article2/0,3959,293,00.asp [Accessed: 7 April 2005, 13:31]

enCore Open Source MOO project,http://lingua.utdallas.edu/encore/, [Accessed: 7April 2005, 13:31]

EROS: An Open Source Multilingual ResearchSystem for Image Content Retrieval dedicated toConservation-Restoration exchange betweenCultural Institutions,http://www.c2rmf.fr/documents/c2r_eros.pdf[Accessed: 7 April 2005, 13:31]

European Schoolnet Virtual School: Software,Freeware and Shareware:www.eun.org/goto.cfm?sid=220 [Accessed: 7April 2005, 13:31]

Forrester, J.E., B.P. Miller, 2000, An EmpiricalStudy of the Robustness of Windows NT

Applications Using Random Testing,ftp://ftp.cs.wisc.edu/paradyn/technical_papers/fuzz-nt.pdf [Accessed: 7 April 2005, 12:15].

Freshmeat.net http://freshmeat.net, Accessed: 7April 2005, 13:31

Guercio, M. C. Cappiello, File FormatsTypology and Registries for digitalpreservation, (DELOS, WP6 D6.3.1),http://www.dpc.delos.info [Accessed 7 April2005, 11:48].

Ge, L., L. Scott, M.Vanderwiele, 17 December2003, Putting Linux Reliability to the Test, IBMDeveloperWorks, http://www-106.ibm.com/developerworks/linux/library/l-rel/[Accessed: 7 April 2005, 13:31]

Gilfillan, I., 16 December 2003, PostgreSQL vs.MySQL:Which is Better?, Database Journal,http://www.databasejournal.com/features/postgresql/article.php/3288951 [Accessed: 7 April2005, 13:31]

Glance, D. G., J. Kerr,A. Reid, February 2004,Factors Affecting the Use of Open SourceSoftware in Tertiary Education Institutions,First Monday,Volume 9, Number 2,http://www.firstmonday.org/issues/issue9_2/glance/ [Accessed: 7 April 2005, 13:31]

GNET, January 2000, How Do Linux andWindows NT Measure Up in Real Life?, ID-side, http://gnet.dhs.org/stories/bloor.php3[Accessed: 7 April 2005, 13:31]

Haber, L., April 2002, City Saves WithLinux,Thin Clients, ZDNet.com,

Page 58: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 58 DCC Digital Curation Manual

http://techupdate.zdnet.com/techupdate/stories/main/0,14179,2860180,00.html [Accessed: 7April 2005, 13:37]

The Halloween Documentshttp://opensource.org/halloween/ [Accessed: 7April 2005, 13:31]

INFONOMICS Free/Libre and Open SourceSoftware: Survey and Studyhttp://www.infonomics.nl/FLOSS/report/[Accessed: 7 April 2005, 13:31]

Insight Special Report: Why Europe Needs Freeand Open Source Software and Content inSchools, 2004, http://www.eun.org/insight-pdf/special_reports/Why_Europe_needs_foss_Insight_2004.pdf [Accessed: 7 April 2005, 13:31]

Kaven, O., November 2001, Performance Tests:File Server Throughput and Response Times,PC Magazine,http://www.pcmag.com/article2/0,1759,16227,00.asp [Accessed: 7 April 2005, 13:31]

Lee, C., October 2001, Open Source: APromising Piece of the Digital PreservationPuzzle. A slightly different version appears asOpen-Source Software: A Promising Piece ofthe Digital Preservation Puzzle, ElectronicCurrents, Midwest Archives Conference (MAC)Newsletter, Volume 29, Number 2 (113), 26-28,http://www-personal.si.umich.edu/~calz/oss_preservation.htm [Accessed: 7 April 2005, 13:31]

Linux Migration.com,http://www.linuxmigration.com/ [Accessed: 7April 2005, 13:31]

The Linux Weekly News http://lwn.net/[Accessed: 7 April 2005, 13:31]

Manning, B. Bind, 2000, Internet UsageStatistics, http://www.isi.edu/~bmanning/in-addr-versions.html [Accessed: 7 April 2005,13:31]

Mantarov, B., 1999 Open Source Software as aNew Business Model The Entry of Red HatSoftware, Inc. on the Operating System Marketwith Linux, Dissertation submitted in partialfulfilment of the degree of MSc in InternationalManagement at the University of Reading,http://bmantarov.free.fr/bojidar/academic/Dissertation_-_Open_source_software_as_a_new_business_model.pdf [Accessed: 7 April 2005, 13:31]

McMillan, R., 26 March 2004, SCO LinuxLicensee Has Second Thoughts on Deal,Computer World,http://www.computerworld.com/governmenttopics/government/legalissues/story/0,10801,91671,00.html [Accessed: 7 April2005, 13:31]

Munich Linux Decision Final, 14 June 2004,DesktopLinux.com,http://www.desktoplinux.com/news/NS7137390752.html [Accessed: 7 April 2005, 13:31]

NASA Open Source License,http://www.opensource.org/licenses/nasa1.3.php[Accessed 7 April 2005, 11:56].

Netcraft Web Server Survey,http://news.netcraft.com/ [Accessed: 7 April2005, 13:31]

Page 59: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 59

Norwegian Board of Technology GlobalCountryWatch on Open Source Policy,http://www.teknologiradet.no/html/592.htm[Accessed: 7 April 2005, 13:31]

OpenSector.org, Public Sector Related Content,http://opensector.org/ [Accessed: 7 April 2005,13:31]

Open Source and Industry Alliance,http://www.osaia.org/ [Accessed: 7 April 2005,13:31]

The Open Source Initiative,http://www.opensource.org [Accessed: 7 April2005, 13:31]

Open Source Software Watch (OSS Watch),http://www.oss-watch.ac.uk/ [Accessed: 7 April2005, 13:31]

Open Source Technology Group,http://www.ostg.com [Accessed: 7 April 2005,13:31]

O’Reilly,T., May 2004, The Open SourceParadigm Shift, Tim.Oreilly.com ,http://tim.oreilly.com/opensource/paradigmshift_0504.html [Accessed: 7 April 2005, 13:31]

Perens, B. et al., Free Software Leaders StandTogether,http://perens.com/Articles/StandTogether.html[Accessed: 7 April 2005, 13:31]

Perens, B., Open Standards Principles andPractice,http://perens.com/OpenStandards/Definition.html [Accessed: 7 April 2005, 13:31]

PHP Usage Statistics,

http://www.php.net/usage.php [Accessed: 7April 2005, 13:31]

PostgreSQL or MySQL http://www-css.fnal.gov/dsg/external/freeware/pgsql-vs-mysql.html [Accessed: 7 April 2005, 13:31]

Raymond, E. S., 1999-2004, The Cathedral andthe Bazaar,http://www.catb.org/~esr/writings/cathedral-bazaar/ [Accessed: 7 April 2005, 13:31]

Raymond, E. S., version 4.4.7, 2003, TheJargon File, http://www.catb.org/~esr/jargon/[Accessed: 7 April 2005, 13:31]

Raymond, E. S., 2000, The Software ReleasePractice How-To,http://www.tldp.org/HOWTO/Software-Release-Practice-HOWTO/index.html[Accessed: 7 April 2005, 13:31]

Ross, S, M. Donnelly, M. Dobreva, D. Abbott,A. McHugh, A. Rusbridge, 2005, DigicultTechnology Watch Report 3http://www.digicult.info/pages/techwatch.php[Accessed: 7 April 2005, 11:30]

Ross, S, A. Gow, 1999, Digital archaeology?Rescuing Neglected or Damaged DataResources, (London & Bristol: British Libraryand Joint Information Systems Committee),ISBN 1900508516,http://www.ukoln.ac.uk/services/elib/papers/supporting/pdf/p2.pdf [Accessed: 7 April 2005,11:47]

Rusbridge, A., April 2004, XENA: ElectronicNormalising Tool, DigiCULT.Info, Issue 7,

Page 60: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 60 DCC Digital Curation Manual

page 32,http://www.digicult.info/pages/newsletter.php[Accessed 7 April 2005, 11:49]

Science Commons,http://science.creativecommons.org [Accessed: 7April 2005, 11:57].

Shankland, S., M. Kane, R. Lemos, 30 October2001, How Linux Saved Amazon Millions,CNET news.com, http://news.com.com/2100-1001-275155.html?legacy=cnet&tag=owvSourceforge.net http://sourceforge.net[Accessed: 7 April 2005, 13:31]

Stallman, R. M. et al. The Free SoftwareFoundation, http://www.fsf.org [Accessed: 7April 2005, 13:31]

Stallman, R. M. Stallman.org,http://www.stallman.org [Accessed: 7 April2005, 13:31]

Staples, T., April 2003, The Fedora Project: AnOpen-Source Digital Object RepositoryManagement System, D-Lib Magazine, Volume9, Number 4,http://www.dlib.org/dlib/april03/staples/04staples.html [Accessed: 7 April 2005, 12:15].

Stone, B., 2004, The Linux Killer,http://www.wired.com/wired/archive/12.07/linux.html?pg=4&topic=linux&topic_set=(none)[Accessed: 7 April 2005, 13:31]

Thurgood, A., January 2005, The GPL and non-U.S. law, Open Source Law Blog,http://www.oslawblog.com/2005/01/gpl-and-non-us-law.html [Accessed: 7 April 2005,

11:43].

Trimble, P.S., December 2000, Open Minds onOpen Source, FCW.com,http://www.fcw.com/fcw/articles/2000/1204/pol-nasa-12-04-00.asp [Accessed: 7 April 2005,11:57].

Wallen Jr, J., 29 November 2002,OpenOffice.org versus Microsoft Office,ZDNet.com Australia onhttp://www.zdnet.com.au/insight/0,39023731,20270300,00.htm [Accessed: 7 April 2005, 13:31]

Wheatley, P., 2004, Institutional Repositories inthe context of Digital Preservation DPCTechnology Watch Series Report 04-02,http://www.dpconline.org/docs/DPCTWf4word.pdf [Accessed: 22 July 2005, 14:35]

Wheeler, D.A., Rev. 2005, Why Open SourceSoftware / Free Software (OSS/FS, FLOSS orFOSS)? Look at the Numbers!,http://www.dwheeler.com/oss_fs_why.html[Accessed: 7 April 2005, 13:31]

Windows 2000 Versus Linux in EnterpriseComputing An Assessment of Business Valuefor Selected Workloadshttp://www.microsoft.com/windows2000/docs/TCO.pdf [Accessed: 7 April 2005, 13:31]

Witten, I. H., D. Bainbridge, S. J. Boddie,October 2001, Greenstone Open Source DigitalLibrary Software, D-Lib Magazine, Volume 7,Number 10,http://www.dlib.org/dlib/october01/witten/10witten.html [Accessed: 7 April 2005, 13:31]

Page 61: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 61

ForaAssociation of C and C++ Users (ACCU) OpenSource Forum 2004, 14-17 April 2004, Oxford,Englandhttp://www.reportlab.com/conferences/accu2004/index.html

Page 62: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 62 DCC Digital Curation Manual

Glossary of TermsCreative Commons - A non-profitorganisation devoted to expanding the rangeof creative work available for others tolegally build upon and share.

Curatability - A measure of the ease withwhich a digital resource can be curated.

Distribution - A software releaserepresenting a packaging up of severalindividual programs; most commonly aversion of GNU/Linux with associatedapplications.

Emulation - The process of recreatingexisting hardware or software environmentswith software.

Free Software - Software released underterms conforming to the four fundamentalfreedoms established by the Free SoftwareFoundation, which demand transparency andthe legal and practical freedoms to change,re-use and re-distribute code.

Freedom of Information - UK legislationcompelling public bodies to releaseinformation on request.

Institutional Repository - A softwareinfrastructure designed to store digitalresources for the facilitation of theirmanagement and/or preservation.

Migration - The process of moving digitalresources to alternative hardware or softwareenvironments to facilitate their use in the

face of obsolescence and ensure theirlongevity.

Proprietary Software - Examples ofsoftware where the user cannot controlfunctionality or study or edit the code.

Representation Information - Theinformation that maps a Data Object intomore meaningful concepts. An example isthe ASCII definition that describes how asequence of bits (i.e., a Data Object) ismapped into a symbol. In order to keepthings manageable, RepresentationInformation can be factored in distinct types,such as structure, semantics and others. Thelatter can include software and standards,among other things. This normalisationallows one, for example, to describe two setsof information which are identical, butwhich are held in different structures(formats), by combining the same Semanticdescription with different Structuredescriptions.

Software License - An agreementdistributed alongside computer softwaredetermining acceptable legal use for thatsoftware.

Source Code - Pre-compiled, human-readable program code.

Standard - An accepted practice,technology or specification.

Total Cost of Ownership - The financialcosts associated with a particular activity or

Page 63: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Andrew McHugh, Open Source for Digital Curation Page 63

policy, incorporating all costs incurred,including acquisition, maintenance, staff andretraining costs.

Acronyms and AbbreviationsBBC - British Broadcasting Corporation.

BIOS - Biological Innovation for OpenSource.

BSD - Berkeley System Design.

CVS - Concurrent Version System.

ebXML - Electronic Business Using XML.

FAQ - Frequently Asked Questions.

FLOSS - Free, Libre or Open SourceSoftware.

FS - Free Software.

FSF - Free Software Foundation.

FUD - Fear, Uncertainty and Doubt.

GNU - GNU’s Not Unix.

GPL - General Public License.

HTML - Hypertext Markup Language.

IBM - International Business Machines.

LOCKSS - Lots of Copies Keeps Stuff Safe.

METS - Metadata Encoding andTransmission Standard.

OSI - Open Source Initiative.

OSS - Open Source Software.

PDF - Portable Document Format.

PHP - PHP Hypertext PreProcessor.

Perl - Practical Extraction and ReportLanguage.

SQL - Structured Query Language.

TCO - Total Cost of Ownership.

VLE - Virtual Learning Environment.

WINE - Wine is Not an Emulator.

XENA - XML Electronic Normalising ofArchives.

XML - eXtensible Markup Language.

Page 64: DCC | Digital Curation Manual€¦ · The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage,

Page 64 DCC Digital Curation Manual

About the Author

Since graduating with a Scots Law Degreewith Honours from Glasgow University in 2000,Andrew McHugh has concentrated ondeveloping a wide range of skills mainly in thedigital realm. He collected his Masters inInformation Technology, again from GlasgowUniversity in 2001 and since then has beenemployed within HATII (the HumanitiesAdvanced Technology and InformationInstitute) at this University in various capacities.Within the institution’s Department of Music herevolutionised the information infrastructure,applying and honing several skills andperforming a diverse range of roles, withresponsibilities ranging from database andserver administration to web programming,application development and desktop clusterdesign and management. In late 2004 he joinedthe Digital Curation Centre in the position ofAdvisory Services Manager, leading a world-class team of digital curation practitioners inoffering leading-edge expertise and insight in arange of issues to a primarily HE and FEaudience. In his spare time Andrew maintains anaptitude and enthusiasm for softwaredevelopment and continues to develop web-based solutions for a range of customersincluding commercial and heritage clients. He isa keen user of open source technologies withseveral GNU/Linux distributions includingFedora Core, Gentoo and SUSE distributedacross the hard disk partitions of his various

systems, including a Microsoft X-Box gamesconsole.