Email Archiving Systems Interoperability The Harvard ... · PDF fileThe Harvard Library Report...

Preview:

Citation preview

Email Archiving Systems Interoperability

(Article begins on next page)

The Harvard community has made this article openly available.Please share how this access benefits you. Your story matters.

Citation Simpson, Joel. 2016. Email Archiving Systems Interoperability.Harvard Library Report.

Accessed May 13, 2018 2:54:24 PM EDT

Citable Link http://nrs.harvard.edu/urn-3:HUL.InstRepos:28682572

Terms of Use This article was downloaded from Harvard University's DASHrepository, and is made available under the terms and conditionsapplicable to Other Posted Material, as set forth athttp://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA

Harvard Library ReportJuly 2016

Prepared by Joel Simpson

Email ArchivingSystemsInteroperability

TheHarvardLibraryReportEmailArchivingStewardshipToolsWorkshopislicensedunderaCreativeCommonsAttribution4.0InternationalLicense(CCBY4.0)<https://creativecommons.org/licenses/by/4.0/>

PreparedbyJoelSimpson,ArtefactualSystems,Inc.

ReviewedbyWendyMarcusGogel,HarvardLibraryandGrainneReilly,LibraryTechnologyServices,HarvardUniversity

Citation:Simpson,Joel.2016.EmailArchivingSystemsInteroperability.HarvardLibraryReport.http://nrs.harvard.edu/urn-3:HUL.InstRepos:28682572.

Table of Contents

ExecutiveSummary..........................................................................................................................3

BackgroundandContext..................................................................................................................4

ProjectObjectives............................................................................................................................4

ProjectApproach..............................................................................................................................4

ProjectResults..................................................................................................................................5

1.AssessmentoftheEmailToolsDataSharingFramework....................................................5

2.AnalysisFramework:RequirementsforInteroperability.....................................................6

3.AnalysisofToolsusingtheRequirementsforInteroperabilityFramework.........................9

4.KeyFindings:AnalysisofToolsandEmailToolsDataSharingFramework........................18

5.OpportunitiestoImprovetheInteroperabilityofEmailTools...........................................20

Acknowledgements........................................................................................................................22

Executive Summary

Earlierthisyear,HarvardLibraryconvenedtheHarvardEAST(EmailArchivingStewardshipTools)workshoptofostertheexpandingemailarchivingcommunity,sharebestpracticesandidentifydirectionsforfuturework.

Oneofthemainconclusionsoftheworkshopwasthatthereisnostandardworkflowthatcanbeuniformlyappliedineverysituation,butthatallarchiveshavesimilarfunctionalneedsforemailarchiving,andthatgiventheneedforflexibility,currentprocessescouldbeimprovedbyusingtheuniquestrengthsofdifferenttoolstogether.

HarvardLibraryengagedArtefactualSystemsInc.tobetterunderstandhowthetoolscanexchangedatatodayandcarryoutanalysistoidentifyopportunitiesforthecommunitytofurthersupportcomprehensivepreservationworkflowsforemail.

CommunitymembershavebeeninvitedtocontributetoanEmailToolsDataSharingFramework.Theintentionistoprovideahighlevelviewofhowemailcontentormetadatacanbeinputoroutputtoeachofthedifferenttools,usingacommonframeworktosupportcomparisonandanalysis.Thisworkisongoing,butenoughdetailhasbeencollectedtoenableanalysisandidentificationofsomeclearopportunitiesforimprovingtheinteroperabilityofthesetools.

Asetof“requirementsforinteroperability”wereidentifiedtosetoutthedifferentaspectsorconcernsinvolvedinusingmultipletoolsinanemailarchiving,processingorpreservationworkflow.Analysiswascarriedouttounderstandhoweachofthetoolssupportsthesedifferentrequirements.Keyfindingswerethenidentifiedineachoftheseareas.

Finally,asetof7draftrecommendationshasbeenproposedforthewidercommunitytoconsider.Thesearehighlevelrecommendationswithoutdetailednextsstepsoranysuggestionforpriority.Wefeeltheyareusefulindecomposingthiscomplexproblemspaceintodiscreteandwell-definedopportunitiesthatwillbeeasiertotackleinafastchangingenvironment.

Page 3 of 22

Background and Context

Earlierthisyear,HarvardLibraryconvenedtheHarvardEAST(EmailArchivingStewardshipTools)workshoptofostertheexpandingemailarchivingcommunity,sharebestpracticesandidentifydirectionsforfuturework.Theworkshopinvolvedstakeholdersfromdifferentinstitutions,includingsubjectmatterexperts,usersanddevelopersofseveralemailarchivingorpreservationtools.

Theworkshopconcludedthatthecommunityisveryinterestedinworkingtogethertosolvesharedproblems.Severaldirectionsforfutureworkwereidentified,including“theneedforanexchangestandardthatenablesinteroperablewaystoextract,packageandtransferdatabetweentools”.Thisconclusionwasbasedontheconsensusthatthereisnooneuniformworkflowforemailarchiving,butthatcurrentprocessescouldbeimprovedifarchiveswereabletoharnesstheuniquestrengthsofeachtoolselectively(usingonlythefunctionalityneededinwhateverorderisneeded).

HarvardLibraryPreservationServicesengagedArtefactualSystemsInc.tocarryoutashortconsultingprojecttobuildonthesefindingsandidentifyopportunitiesforthecommunitytofurthersupportcomprehensivepreservationworkflowsforemail.

Project Objectives

Thegoalsofthisconsultingprojectareto:

1. identifygapsoropportunitiestoimprovetheinteroperabilityofthenumerousemailtoolsbyshowingthetype,formatandstructureofdatawhichcanbeinputoroutputfromeachtool

2. informemailstewardsabouttheoptionsandconsiderationsinvolvedindefiningemailarchivingworkflowsusingmultipletools

Thisprojecthasnotattemptedtoprovideafunctionaldescriptionorcomparisonofthevarioustoolsunderconsideration.Averybriefoverviewofthetools,withlinksforfurtherdetailedinformationavailablefromtheproviders,isprovidedbelowinsection3.AusefulcomparisonofEmailArchivingtools(includingmanynotconsideredinthisproject)canbefoundattheLifecycleToolsforArchivalEmailChart:https://docs.google.com/spreadsheets/d/1V1N22xnr5e0EbDlZWx58bjYO6rkrMrYH9wGX9-CK8c4/edit#gid=986222267.

Project Approach

Thisprojectisproducingtwodeliverablestomeettheobjectivesdefinedabove.

ThefirstdeliverableisanEmailToolsDataSharingFrameworkthatsetsoutthecontentobjects(i.e.email)andmetadatathateachemailorpreservationtoolcaninputoroutput.Representativesfromeachtoolproviderwereaskedtocompletethedescriptionsoftheseinputsandoutputsusingagenericframework(withassociatedglossary)toenablecommonunderstandingoftermsandmakecomparisonbetweentoolseasier.

Amoredetaileddescriptionandassessmentofthetoolisprovidedbelowinsection2.

Page 4 of 22

TheseconddeliverableofthisprojectisthisConsultingReportwhich

1. assessesthecompletionandusefulnessoftheEmailToolsDataSharingFramework2. proposesagenericsetofrequirementsforinteroperabilitytouseasananalysisframework3. analyzes/summarizeshoweachtoolsatisfiesthoserequirementsforinteroperability4. setsoutseveralrecommendationsforimprovinginteroperabilityofthetoolsandfurther

establishingbestpracticesforthecommunityPleasenotethatthroughoutthisreportwhenwereferto‘digitalobjects’wemeananytypeofdigitalobjects,includingemailsthemselves,relatedcontentlikeattachments,oranyassociatedmetadata.Weuse‘data’interchangeablywith‘digitalobjects’simplybecauseitisshorter.(Wehavenotseentheneedtodistinguishtheseconceptswithmoreprecisedefinitions.)

Project Results

1. Assessment of the Email Tools Data Sharing Framework

1.1. About the Email Tools Data Sharing Framework

Theemailtoolsdatasharingframeworkincludesinformationon6differentemailorpreservationtools.Theintentionistoprovideahighlevelviewofhowemailcontentormetadatacanbeinputoroutputtoeachofthedifferenttools.

Theframeworkissetoutinaspreadsheet,withonesheettodescribeinputsandanothertodescribeoutputs.Eachsheetisorganizedtofirstdescribetheactual(or"physical")dataobjects(orinput/outputmechanisms,asinsomecasestheyareprogrammatic),followedbyadescriptionofthekindsofdataormetadatafoundinthoseobjects.

Separaterowsdistinguishbetweenthelevelofobligationdemandedtobeabletouseeachtool:

● mandatorycontentordata(systemwillnotacceptorworkproperlywithoutthis)● usefulcontentordata(isoptional,butenablesfunctionalitywithinthesystem-e.g.asensitivity

flagthatcanbeusedwhenfiltering)● additionalcontentordata(canbeconsumed,butisnotusedinanywaybyconsumingsystem--

e.g.attachmentsareincludedinMBOX,buttheparticularsystemmaynotallowuserstodoanythingwiththem)

Thegoalistodescribeineachofthesecolumns:

● thetypeorextentofdataprovided(e.g.specificfieldsusedasreferenceIDs,oramoregeneraldescriptionsuchas'preservationevents')

● formatofdata(isa'local'schemadefined,orisastandardschemaused,suchasPREMIS)● location/structureofdata(whereintheinput/outputisthisinformation--e.g.PREMISevents

arerecordedinMETS.xmlfile;folderinformationstoredinpathnameinMBOXetc.)Insomecasesthisinformationneedstobebrokendownintodifferentlevelsofgranularity,forinstancetoindicateinformationstoredatindividualemaillevelvs.collectionlevel.

Page 5 of 22

1.2. Assessment of the Email Tools Data Sharing Framework

Atthetimeofthiswriting,completionofthespreadsheetisinprogress.Weinvitecommentsorthoughtsfromallparticipantson:

● abilitytocompletethespreadsheetconsistently(orkeydifferencesininterpretation)● anythinglearnedwhilefillingitin● whetheritiscompleteenough,orneedsfurtherwork;wishlistadditions/amendments(e.g.

suggestionsforaddingmoredetail)● initialviewsonvalueoftheexercise● intenttousethetoolmovingforward

Datagatheringworkisongoingandwillberefinedasneededbythecommunitytosupporttheircollaborativeeffortstoimprovethesetoolsandestablishbestpracticesforemailarchivingandpreservation.

InitialfeedbackandobservationsfromArtefactual:

● Itisinterestingtoseethisparticularperspectivefromthedifferenttools,andenablesinterestinganalysisofsimilaritiesanddifferences(whichwillbeexploredfurtherintherestofthisreport).

● Thespreadsheetemphasizestwodimensions(datatypesincolumnsandsystemsinrows),butthereareinfactnumerousdimensionsofinterest(includinggranularityofgroupingofdata,levelsofobligation,typeofdatavs.formatsorstandardsemployed,etc.).Thismakesfittinginalloftherelevantinformationachallenge.

● Giventhespace,itdoesnotseempossibletoincludeenoughdetailedinformationforthistobeaveryhandson‘howto’tool--butitmaywellbeausefulanalyticordecisionsupporttool,todetermineifthereisenoughcompatibilitybetweenaparticularselectionoftoolsforadesiredworkflow.

2. Analysis Framework: Requirements for Interoperability

Thedatasharingframeworkisprimarilyfocusedontheinputsandoutputsofeachofthetoolsunderconsideration.Giventhebroaderintenttoenableemailstewardstodeterminewhetherandhowtheymightcraftworkflowsusingmultipletools,thisreportproposesasetofgeneric‘requirementsforinteroperability’.Thisprovidesamoreholisticviewofthedifferentaspectsofusingmultipletoolsthatoperatetogethertoenableacomprehensiveworkflowforemailprocessingorpreservation.

Theserequirementsaremoreananalyticalframeworkthanaconcretesetofrequirements.Theyarefocusedonthelevelofbusinessprocessesandworkflows,anddonotrepresentaparticularefforttoelicitrequirementsfromendusers.

Therequirementsandtheirrationalearedescribedbelow.Inthefollowingsection,eachofthe6toolsisassessedagainsteachrequirement.Thisallowsustocomparesimilaritiesanddifferencesinspecificareasofconcernandusethisasthebasisforrecommendationsforfutureworklaterinthereport.

Page 6 of 22

2.1. Support for data transmission

Themostbasicrequirementforaworkflowthatusesmultipletoolsworkingonacommonsetofdataistoenablethosetoolstoaccessthatdata.

Thisfunctionalitycanbeprovidedinmanyforms;userinterfacesforselectionofdataforingestfromaparticularlocation;automatedjobsthatingestdata;directsystemtosystemconnectivity;orpublishedAPIs.Thegoalhereistosimplyarticulatehoweachsystemsupportsthis,ratherthantojudgeonemethodoveranother.Thiswillallowustoseewhichtoolscansharedata(andhow),ataphysicallevel,withothertools.

2.2. Support for standard data formats

Oncewehavedeterminedaparticulartoolcanaccessasetofdataphysically,weneedtoensureitcaninterpretandprocessthatdata.Ataminimum,thedataformatmustbe‘standard’betweenthetoolsbeingconsidered.

Itiswellestablishedinthepreservationcommunitythatopen,non-proprietaryandwidelyusedstandardsarepreferableforpreservationformats.Whilenotalldatatobeexchangedneedstobe(orevencanbe)inapreservationformat,thesameprincipleswillimprovetheoddsthatanyparticulartoolwillbeinteroperablewithothers.

Supportforstandarddataformatsappliestoemailcontent,metadataandthepackagingofbothemailandmetadata.

2.3. Support for appropriate scope of exchangeable data

Emailcontentandmetadatacanexistorbegroupedatvariouslevelsofgranularity.Differentprocessingtoolsmayacceptdatawithanentirelyarbitrarydefinitionofscope(usingagenerictermsuchasa‘transfer’or‘packet’),ortheymayrequiredataormetadatatoconformtoaspecificdefinition(suchasclearlygroupingdataby‘account’).

Scopeofdataalsoreferstothetypeandextentofdatainanyparticulardataset.Forexample,Archivematicahasfunctionalitytoverifyhashes/checksums;ifchecksumshavebeencreatedinanothertool(e.g.BitCurator),thenideallyArchivematicashouldallowchecksumstobeimportedsothatverificationcanoccuronthosechecksums,notjustonchecksumscreatedbyArchivematica.Thisconceptisclearlytiedcloselywiththelevelofgranularity-achecksummaybemadeforafolderorcollectionofemails,oritmaybecreatedattheindividualemaillevel.

Emailstewardswillneedtounderstandwhatscopeofdataisrequiredorpossibleusinganyparticulartool.Similarlyanydecisiontouseaparticulardatastandardneedstoconsiderthescopeofdatathatformatallowsfororrequires.

Page 7 of 22

2.4. Ability to track processing history and provenance

Theabilitytoestablishandmaintaintheprovenance(includingprocessinghistory)ofcontentisawellunderstoodrequirementinthearchivalandpreservationcommunities.Whilethismaynotbearequirementforeveryonelookingtoprocessemails,itisafundamentalrequirementforthecoreusergroupsofmanyofthe6toolsweareevaluating.

Emailstewardswhodoneedtorecordandcaptureprovenancewillgenerallyneedamechanismtodothiswhenevertheyareprocessing,creatingorchangingdata.Thismeansthateitherthetoolstheyuseforprocessingneedtocaptureprocessinghistorydirectly,ortheyneedsomeabilitytotrackprocessinghistorymanuallyandstoreitappropriately.

2.5. Support for maintaining the identity and integrity of data

Asdataismoved,migratedorprocessedbydifferenttools,emailstewardsneedtobeabletoensurethattheidentityandintegrityofthedatatheyareprocessingisnotcompromised.

Maintainingtheidentityofthedatasetdependsinlargepartuponusingidentifierstolinkittoitsdescriptiveandadministrativemetadata,andensuringthatthislinkcannotbebroken.Mosttoolsgenerateuniqueidentifiers,buttheseareusuallylocal(assigned,storedandmaintainedwithinthetoolitself).Externalidentifiersmaybesupported,eitherinformally(e.g.byrecordinganaccessionnumberaspartofadirectorystructureorfilename)ormoreformally(asinhavingafieldwithadeclareddatatypethatalignstotheidentifierusedbyanothersystem).Somesystemsalsosupportidentifiersthatreferexplicitlytoexternalresourcesorauthorities(aconceptunderpinninglinkeddata).

Maintainingtheintegrityofdigitalobjectsisoftenachievedusinghashesorchecksums,withregularverification,toensurethatthecontentoftheingesteddatahasnotbeenalteredovertime.Thehashesorchecksumscanbeassignedtoboththeoriginalingestedcontentandtoanynormalizedorotherwisemodifiedversionsthatmaybegeneratedfromthatcontent.Hashesorchecksumsmayalsobeassignedtoassociatedmetadata.

Anothercommonpracticetosafeguardtheintegrityofdataistopackagecontentandmetadata‘together’fortransfer,reducingtheriskofcorruptionorloss(i.e.linksbetweenthetwobreakingatsomepoint).

2.6. System access and documentation to support interoperability

Abasicrequirementistheabilitytoaccessandusethesoftware,bothtechnicallyandwithappropriatepermissionsorlicensing.

Allofthecapabilitiesmentionedabovearelessusefulinpracticeifknowledgetousethemisnotcapturedwell.Technicalanduserdocumentation,trainingmaterialsandtrainingresources(i.e.trainersforhire)alladdtotheabilitytousethetoolaspartofanintegratedworkflow.Thestartingminimumisdocumentationonhowtousethetoolatall.Ideallyaknowledgebasewouldaddresstheexchangeof

Page 8 of 22

data,interoperabilitywithothersystemsandanylicenserequirements.

3. Analysis of Tools using the Requirements for Interoperability Framework

3.1. Archivematica

Archivematicaisanintegratedsuiteofopen-sourcesoftwaretoolsthatallowsuserstoprocessdigitalobjectsfromingesttoaccessandtoimplementpreservationplans.Usersmonitorandcontrolingestandpreservationmicro-servicesviaaweb-baseddashboard.ArchivematicausesMETS,PREMIS,DublinCore,theLibraryofCongressBagItspecificationandotherrecognizedstandardstogenerateArchivalInformationPackages(AIPs)forstorageinexternalrepositories.

Requirement SupportingFunctionality Observations

Supportfordatatransmission

Digitalobjectsneedtoresideinalocallyaccessiblefilesystemforingest.ArchivematicaisprovidedwithanaccompanyingapplicationcalledStorageServicesthatcanbeusedtoconfigureaccesstosourcesofdataforingest.ThereisanAPItoassignaccessionnumbers,butnodirectsupportformovingdataacrosshardware,networksetc.

Therearenumerousexternaltoolsavailableformovingdata.

Supportforstandardformats

Anydigitalobjectcanbeingested,soanyemailformatcanbeprocessedwithcorefunctionality.EmailinputinMBOXformatcanbeprocessedusingadditionalfunctionality(extractingattachmentsandmetadata).EmailinputinmaildircanbenormalizedandoutputasMBOX.TheBagItfilepackagingstandardissupportedforinputandoutput.Metadatainputincsvorjsonformatscanbeprocessed.Additionalmetadata(inotherformats)canbeincludedbutnotprocessed.Metadataoutputsarewellsupportedbywidelyadoptedstandards(METS,DublinCore,PREMIS,Bag)

NosupporttonormalizetoEMLformat(widelyusedemailformat).

Supportforappropriatescopeofdata

Transfer,Submission,ArchivalandDisseminationpackagescanbestructuredanddescribedusinganydefinitiontheuserchooses.Forexample,anemailaccountoraccountscanbeingestedasoneormoreSIPs,andmultipleSIPscanbecombinedintooneormoreAIPs.Somekeymetadata,suchasrightsmetadata,canonlybeinputorassignedduringprocessingatthepackagelevel.

Providescompleteflexibilitybutnonativesupportforcommonemailgroupings(e.g.account,folderetc.)Rightsmetadatacan’tbeassignedtoindividualemails,souserswouldhavetomanuallystructureinputsandoutputstoreflectdifferentrights(e.g.createoneAIPorDIPforrestrictedemails,andonefor

Page 9 of 22

non-restrictedemails).

Abilitytotrackprocessinghistoryandprovenance

ProvidesextensivefunctionalitytotrackprocessinghistoryandrecordusingPREMISProcessinghistoryfromexternalsourcescould“travelwith”anydatasets,butcurrentlynoabilitytomergeorconsolidateprocessinghistoryfrommultiplesystems.

Emailstewardscouldcreatemanualprocessestomaintainmultipleprocessinghistoryfiles.

Supportformaintainingtheidentityandintegrityofdata

ArchivematicaassignsUUIDstoallingestedobjectsandusestheUUIDsandIDattributesintheMETSfilestomaintainlinksbetweendigitalobjectsandtheirmetadata.Archivematicaalsosupportsawiderangeofexternalmetadata,sothereareseveralwaysexternalidentifiers(i.e.fromothertools)canbemaintained.Howeverthereisnodirectsupportfortyped/declaredexternalidentifiers(e.g.automaticallyaddingidentifierswhenimportingfromanexternalsystem).Fixityverificationissupportedusingbothinternallyorexternallycreatedhashes.

Emailstewardscouldcreatemanualprocessesforaligningandmaintainingreferentialintegrityacrosssystems(butmayneedtoplanthis-e.g.aligningpackagestructuretoexternalidentificationsystems)

SystemAccessandDocumentation

Documentationavailable,communitysupportwebsite/groups,aswellasforhireservicesforconsultancy,trainingetc.SourcecodeandtechnicalinfoavailableonGitHub.Documentationcanbequitetechnical.

3.2. ArchivesSpace

ArchivesSpaceisanopensource,webapplicationformanagingarchivesinformation.Theapplicationisdesignedtosupportcorefunctionsinarchivesadministrationsuchasaccessioning;descriptionandarrangementofprocessedmaterialsincludinganalog,hybrid,andborn-digitalcontent;managementofauthorities(agentsandsubjects)andrights;andreferenceservice.Theapplicationsupportscollectionmanagementthroughcollectionmanagementrecords,trackingofevents,andagrowingnumberofadministrativereports.Theapplicationalsofunctionsasametadataauthoringtool,enablingthegenerationofEAD,MARCXML,MODS,DublinCore,andMETSformatteddata.

(summary taken from: https://archivesspace.atlassian.net/wiki/display/ADC/ArchivesSpace)

ArchivesSpaceisnotadigitalassetordocumentmanagementsystemandcannotmanagedigitalfilesordigitizationworkflows.Thedigitalobjectsmodulecanbeusedtodescribedigitalobjectsandlinktodigitalfilesstoredelsewhere.ThemetadatacreatedcanbeexportedtoothersystemsasMODS,METS,orDublinCoreormadepubliclyaccessiblethroughthebuilt-inpublicinterface,thoughtheviewersin

Page 10 of 22

thepublicinterfacearemorelimitedintheirfunctionalitythanthoseofadigitalassetmanagementsystemordigitalrepository.

(detailondigitalobjectstakenfromFAQ:http://www.archivesspace.org/faq)

Requirement SupportingFunctionality Observations

Supportfordatatransmission

ArchivesSpacedoesnotprovideameansofmovingorstoringemailcontent.MetadatacanbeexchangedasfilesorthroughasetofAPIs.

Supportforstandardformats

ArchivesSpacesupportsarangeofwellestablishedstandardsfordescribingarchivalrecords-EAD,MARCXML,MODS,DublinCore,andMETSformatteddata.ArchivesSpacedoesnotsupportfunctionalityorprocessingofemailcontent(i.e.normalisation,searchoridentificationofauthoritiesetc.)

Supportforappropriatescopeofdata

ArchivesSpaceprovidesfunctionalityfordescribingthearrangementandrelationshipsofdigitalobjects.Itdoesnotsupportemailspecificconceptsdirectly(e.g.thenotionofanemailaccount)

Itcouldbeusefultoestablishconventionsorbestpracticesfordescribingemailaccountsandtheirpotentialrelationshipstocollections,agentsetc.

Abilitytotrackprocessinghistoryandprovenance

Supportformaintainingtheidentityandintegrityofdata

Supportforidentifiersandintegrityinternallywithinarepository.Thesystemsupportsstructuredcaptureofagentsandsubjectswhichwillimproveconsistencyandaccuracyofdescription

SystemAccessandDocumentation

ArchivesSpaceisanopensourceprojectwithconsiderabledocumentationavailable.ItissupportedbytheLyrasisorganisationwithfulltimestaffwhoaredevelopersandsubjectmatterexperts.

3.3. BitCurator

TheBitCuratorEnvironmentisbuiltonastackoffreeandopensourcedigitalforensicstoolsandassociatedsoftwarelibraries,modifiedandpackagedforincreasedaccessibilityandfunctionalityfor

Page 11 of 22

collectinginstitutions.TheBitCuratorsoftwareisfreelydistributedunderanopensourcelicense.ItcanbeinstalledasaLinuxenvironment;runasavirtualmachineontopofmostcontemporaryoperatingsystems;orrunasindividualsoftwaretools,packages,supportscripts,anddocumentation.

KeyfeaturesofBitCuratorinclude:

● Pre-imagingdatatriage● Forensicdiskimaging● Filesystemanalysisandreporting● Identificationofprivateandindividuallyidentifyinginformation● Exportoftechnicalandothermetadata

(summarytakenfrom:http://www.bitcurator.net/bitcurator/)

Requirement SupportingFunctionality Observations

Supportfordatatransmission

BitCuratordoesprovidesupportformigratingdatawithoutalteringitinanyway,startingwiththeconceptofcreatingforensicimagesbeforefurthertransmittingorprocessingdata.Uniquelyamongthetoolsconsideredhere,BitCuratorprovidessoftwarewrite-blockingfunctionalitytoensuretheintegrityofsourceobjects.

Asthisisanareanotwellsupportedbyothertools,itcouldusesomeelaboration/detail.

Supportforstandardformats

SupportsDFXML(DigitalForensicsXML)thatenablestheexchangeofstructuredforensicinformation.BitCuratorgeneratesPREMISmetadatawhentheuserrunsseveralofitscoredataforensicstools,providingarecordofkeyprocessingevents.Providessomeprocessingsupportforemail-e.g.usingreadpsttoconvertPSTemailobjectsintoMBOX.AlsosupportsBAGformatforoutput.

Supportforappropriatescopeofdata

TheBitCuratorenvironmentincludesnumerousapplicationstobeusedfordifferentpurposes,toberunagainstindividualitemsorcollectionsofterms.Oneofthemostcommonlyusedtoolsisbulk_extractor,whichcanbeusedtoidentifypotentiallysensitiveinformationondisks,diskimagesordirectories.Othercoretools,includingfiwalkandotherspecializedreportingtools,aredesignedtoberunagainstentirediskimages.Whenrunagainstadiskordiskimage,bulk_extractorreportsonthelocationofpatternsbasedabyteoff-setontothedisk.Otherreportingtools,includingfiwak,generatemetadatabasedonthefilesystem(filesandfolders).Inthecaseofemail,thefileswouldbelikelyinformatssuchas.pstormbox.Thosewishingto

Page 12 of 22

generatemetadataassociatedwithspecificmessageswithinthosecontainerfilescouldusereadpstandpipeitsoutputtoothercommand-linetools.BitCuratorisprimarilyconcernedwithidentificationanddescriptionofdigitalobjectsratherthanarrangement.

Abilitytotrackprocessinghistoryandprovenance

BitCuratorgeneratesPREMISmetadatawhentheuserrunsseveralofitscoredataforensicstools,providingarecordofkeyprocessingevents.

Emailstewardscouldcreatemanualprocessestomaintainmultipleprocessinghistoryfiles.

Supportformaintainingtheidentityandintegrityofdata

BitCuratorprovidessupportforindexing,characterizinganduniquelyidentifyingallcontentonadiskordiskimage.Bitcuratorsupportscreationandvalidationofhashes/checksums.

SystemAccessandDocumentation

BitCuratorisanopensourceprojectwithconsiderabledocumentationavailable.

3.4. DArcMail

DArcMail(forDigitalArchiveMailSystem)wascreatedbytheSmithsonianInstitutionArchives.DArcMailprovidesnormalization,itemlevelandbulkprocessing,intellectualarrangement,searchcapability,packagingandaccessfunctionalityforemail.

Requirement SupportingFunctionality Observations

Supportfordatatransmission

Digitalobjectsneedtoresideinanaccessiblefilesystemforingest.

Supportforstandardformats

EmailinputrequiresMBOXastheoriginalformatorasaninterimnormalizationformat.EmailinputinMBOXformatcanbeprocessedwithallcorefunctionalityincludingexportingpreservedemails,emailcollectionsoremailaccountsintheEMailAccountXML(EMA).EMAisacomprehensiveXMLschemadesignedforRFC5322compliantpreservationpurposesappliedtothefullrangeofemailobjects,i.e.,singlemessagetowholeemailaccount.AllelementsoftheoriginalemailisretainedinthepreservationEMAXMLoutput.User-definedsubsetsofemailmessagescanbecreatedandexportedinMBOXorEMAXMLformats.

NosupporttonormalizetoEML.TheEMAXMLschemaisnotwidelyadopted.Itisfullyimplementedintwootheremailarchivingtools,orinlimitedfashioninacoupleotherapplications.

Page 13 of 22

Supportforappropriatescopeofdata

DArcMailallowsuserstointeractwithemailsonanindividual,grouporaccountbasis.Complexsearching,filteringandmessagethreadtracking.Attachmentscanbesearched,viewedandseparatedfromemail.

Abilitytotrackprocessinghistoryandprovenance

TheDArcMailtoolisdesignedtobeusedforinitialappraisalandthenforpreservation(AIP)andaccess(DIP).ItnativelyretainsthelogicalarrangementoftheoriginalaccountinboththeAIPandDIPpackages.ItsflexibilityallowsforcreationofcustomsubsetsofemailforcreationofspecializedAIPsandDIPs.

TransferandaccessioningofemaildigitalobjectsoccuroutsideoftheDArcMailworkflow.Non-technicalmetadatasuchasrightsmetadatamustbecapturedandmaintainedinaseparatesystemormanually.

Supportformaintainingtheidentityandintegrityofdata

DArcMailmaintainsallUIDspresentintheoriginalemails.ItgeneratesSHA-1checksumsforeachmessageandforemailaccountsasawholewhichareembeddedintheEMApreservationformat.DArcMailalsoproducesexternalmetadataincludingthechecksumforeachmessagepreserved.

Theinternalmessageandaccountchecksumsareretainedevenifthepreservedemailaccountismovedtofromonerepositorytoanother.

SystemAccessandDocumentation

DArcMailisnotcurrentlyavailableoutsideoftheSmithsonian.Limiteddocumentationispubliclyavailable.TheSmithsonianintendstoreleaseitasopensourcewhentime/effortallows.

Makingthetoolpubliclyavailableisapreconditionforanyothercommunityusers.

3.5. Electronic Archiving System (EAS)

HarvarddevelopedtheEAStooltoenablearchivalprocessingofemailmessagesandattachmentsandautomatetheprocessofmakingdepositstoHarvard'spreservationrepository.Keyfeaturesinclude:

● NormalizationtoEML--anopenstandardforpreservation(anextensionofIMFRFC5322)--forlongtermpreservation.

● Summaryviewsofthemetadataassociatedwithemailorattachmentswithinaresultset.

● Batchanditemlevelprocessingoptionsforarchivists.

● Longtermpreservationofemailandattachmentsinasecureenvironmentapprovedforsensitivedataissupportedbyautomatedpackagingandtransfertothepreservationrepository–DigitalRepositoryService(DRS).

● CaptureofessentialrightsmanagementinformationusingPREMIS.

Page 14 of 22

● CaptureofsignificanteventstrackingtodocumentdeletionsofemailandattachmentsandformattransformationssuchastheconversionofthenativemailformattoEML.

(featurelisttakenfrom:http://hul.harvard.edu/ois/systems/eas/)

Requirement SupportingFunctionality Observations

Supportfordatatransmission

Dataneedtobemovedtoa‘dropbox’(directoryspaceinHarvardsystems).EASdocumentationdescribeshowtouseasecureFTPclienttomovethedatabutthisisnotpartoftheEASsolution.

Therearenumerousexternaltoolsavailableformovingdata.

Supportforstandardformats

EmailcontentcanbeinputinMBOXorPSTformat(whichcoversthemajorityofemailclientstandardsforoutputofemail).Attachmentobjectsofanytype(e.g..ppt,.doc)canbeembeddedintheemailsorprovidedseparately.Itisnotpossibletoinputmetadata(beyondthatcontaineddirectlyinMBOX/PSTorattachmentformats).EmailisoutputtoEMLformat,withattachmentsextracted.Overallmetadataiscapturedandoutputusingwellestablishedstandardformats(e.g.METSandMODS)andbothrightsandprocessinghistoryarecapturedinPREMIS.SomereferencemetadataisinlocalformatdefinedbyHarvard(forpackets,collectionsetc.),asismetadatarelatingtosecurity(access)andsensitivity(usinglocallydefined‘flags’).

Emailcontentformatswellsupported.WhileEMLformatforoutputisawellestablishedstandarditisnotacceptedbyallothertoolsforinput.Securityandsensitivitymetadatacouldpotentiallybecapturedusingmorewidelyusedstandard.ReferencingmetadatagearedtowardsHarvardintegrationwithDRSsystem.Maynotbeanyneedtostandardizethis,butsupportforexternalIDswouldenablebetterinteroperabilitywithothertools.

Supportforappropriatescopeofdata

Submissionpacketscanbestructuredanddescribedusinganydefinitiontheuserchooses.Itisnotpossibletoinputadditionalmetadataorcontentbeyondemail/attachments.Processingworkcanbecompletedatindividualitemlevel(emailorattachment)oratvariouslevelsofgrouping(folder,collectionetc.).Additionalgroupingscanbeadded(collectionsorseries).Outputswillalwayscontainthesamepacketstructureastheassociatedinput.Outputcontainsnormalized/processedcontent;doesNOTcontainoriginalinputfiles(i.e.inMBOXorPSTformat)

Providessupportforgrouping(incollectionsetc.)Inabilitytoinputadditionalmetadataorcontentsuggeststhistoolmayworkbestat‘start’ofaworkflow.Stewardswillneedtothinkthroughmanualprocessesformanagingmetadatacreatedusingothertools.

Abilitytotrackprocessinghistoryandprovenance

ProvidesfunctionalitytotrackprocessinghistoryandrecordusingPREMIS.Noabilitytomergeprocessinghistorywiththatfromothertools.

Emailstewardscouldcreatemanualprocessestomaintainmultipleprocessinghistoryfiles.

Page 15 of 22

Supportformaintainingtheidentityandintegrityofdata

Identifiersareinternal(e.g.EASmessageID)orlocaltoHarvard(e.g.DRScodesareforHarvardrepository).Integrationwith‘Wordshack’applicationensuressomedescriptiveoridentificationinformationisbasedoncontrolledvocabulariesusedinHarvard(i.e.alsointegratedwithHarvardDRSrepository).Thisimprovesconsistencyinuseofadmincategoriesandtopics,andimprovesidentificationqualityforpersonsororganisations.

Supportforexternalreferencingsystemswouldbetterenablemulti-toolworkflows.UseofcontrolledvocabularieslimitedtoHarvardcurrently-couldbeseveralapproachestoextendthis-e.g.publishingthosevocabulariesasopendata,orenablinguse/integrationofother(e.g.linkedopendata)vocabulariesasalternatives

SystemAccessandDocumentation

UserdocumentationavailableandsupportforHarvardusers.SystemisnotcurrentlyavailablebeyondHarvardusers.

AprojecthasbeenproposedtoreleasesystemasOpenSourceproject;butsometechnicalworkrequiredtomakereadyformoregenericuse.

3.6. ePADD

ePADDisasoftwarepackagedevelopedbyStanfordUniversity'sSpecialCollectionsandUniversityArchivesthatsupportsarchivalprocessesaroundtheappraisal,ingest,processing,discovery,anddeliveryofemailarchives.Theuserguide(https://docs.google.com/document/d/1joUmI8yZEOnFzuWaVN1A5gAEA8UawC-UnKycdcuG5Xc/edit#)providesthefollowingdescriptionofthemajormodulesinthesystem:

Appraisal:Allowsdonors,dealers,andcuratorstoeasilygatherandreviewemailarchivespriortotransferringthosefilestoanarchivalrepository.

Processing:Providesarchivistswiththemeanstoarrangeanddescribeemailarchives.

Discovery:Providesthetoolsforrepositoriestoremotelysharearedactedviewofemailarchiveswithusersthroughawebserverdiscoveryenvironment.

Delivery:Enablesarchivalrepositoriestoprovidemoderatedfull-textaccesstounrestrictedemailarchiveswithinareadingroomenvironment.

Requirement SupportingFunctionality Observations

Supportfordatatransmission

Theappraisalmodulewillacceptemailfilesdirectly(fromalocalfilesystem)andalsohastheabilityconnectdirectlytoemailserverstodownloademailusingIMAP.Othermodulesrelyonoutputs(files/directories)fromotherePADDmodules(i.e.appraisaloutputis

Therearenumerousexternaltoolsavailableformovingdata.Theabilitytoconnectdirectlytoemailserverisuniqueandsimpleifonlytransportingemailcontent(i.e.noadditional

Page 16 of 22

inputtoprocessingmodule,processingmoduleoutputisinputtodiscoverymoduleetc.)

content/metadata).

Supportforstandardformats

EmailcontentcanbeinputinMBOXorbydirectlyconnectingtoemailserver(thereforeexcellentsupportifonlyinterestinginingestingemailcontent).Itisnotpossibletoinputothercontent(attachments)orMetadata(beyondthatcontaineddirectlyinMBOXformat).EmailisoutputtoMBOXformat.AttachmentsareNOTextractedseparately.Metadatathatlinkscorrespondents,people,organisationsorlocationstoexternalauthorities(e.g.LCSubjectHeadings)canbeoutputwithURIsthatrepresenttheentitybytheexternalauthority.

Whiletheformatforwrappingmetadataappearstobenon-standard,theprocessforassigningthemetadataformanydescriptiveelements(correspondent,locationetc.)usesexternalauthorities(linkeddata)whicharewellestablishedstandardsforthosespecificvocabularies.

Supportforappropriatescopeofdata

ePADDingestsmaterialstructuredaroundaparticularpersonwhomayhavemorethanoneemailaccount.Itdoesnotappeartoofferthewiderflexibilityofallowinguserstoentertheirownarbitrarilydefined‘packets’.Itisnotpossibletoinputadditionalmetadataorcontentbeyondemail/attachments.Processingworkcanbecompletedatindividualitemlevel(emailorattachment)oratvariouslevelsofgrouping(folder,collectionetc.).Additionalgroupings,suchascollectionsorseries,canbeadded.Scopeofoutputscanvaryasuserscanselectindividualemailstoincludeorexclude.Onlydescriptivemetadatacanbeoutput(butnothingforrights,sensitivity,processinghistoryetc.)ePADDallowsforthere-useorsharingoflexiconfilesforentityanalysis.Lexiconfilesenablefulltextsearchingonarangeofdifferentterms,enablingstewardstoconductcomplextieredsearches.

Metadatacan’tbeinputwithemailcontent.Metadatacan’tbeoutputexplicitly,butisusedinprocessingsostewardscoulddefineworkflowsthatenablethemtoaligntothesemanually.forexample,thecartfunctionalitycanbeusedtoselectonlyemailswithacertainrightsvalueforoutput;thenrepeatforothervalues,creatinganMBOXoutputfileforeachmetadatavalue.

Abilitytotrackprocessinghistoryandprovenance

Notavailablecurrently.

Asnotedabove,couldbesomescopeformanuallyoutputtingdatathatisgroupedaroundaparticularprocessing‘event’-butnodirectsupportformaintaining,muchlessmerging,processinghistory.

Page 17 of 22

Supportformaintainingtheidentityandintegrityofdata

Identifiersareinternal(e.g.ePaddmessageID)IntegrationwithexternalauthoritiessuchasLCSubjectHeadings(FAST)ensuresconsistencyandimprovesaccuracyinapplyingdescriptivemetadata.

Supportforexternalreferencingsystemswouldbetterenablemulti-toolworkflows.LinkedopendataapproachfordescriptivemetadataisuniquetoePADDbutcouldbehelpfulifadoptedbyothertools.

SystemAccessandDocumentation

Userdocumentationavailable;technicaldocumentationandcodeavailableonGitHub.

4. Key Findings: Analysis of Tools and Email Tools Data Sharing Framework

Thissectionsetsoutanalysisandfindingsforeachofthe‘requirementsforinteroperability’basedonourunderstandingofthecapabilitiesavailableacrossallofthetoolstoday.Withtheexceptionofsomespecificintegrations(e.g.ArchivematicaandArchiveSpace),thesetoolswerenotdesignedtointeroperatewitheachother,andsotherearenaturallyanumberofchallengesorrisksintryingtodothatasthetoolsstandtoday.

4.1. Current state of data transmission

● Datatransmissionis,ingeneral,consideredoutofscopebythesetools.● Thereisarisktothechainofcustodyinherentinanyattempttochaintoolstogether.The

primaryriskistometadatathatispartofthedigitalobjectitself(e.g.createdon,createdby,modifiedon,modifiedbyetc.)whichcaneasilybechangedorlostaspartof‘moving’datafromonefilesystemtoanother.

● Manyofthesetoolsattempttominimizethisriskinternally,e.g.,Archivematica,Bitcurator,DArcMail,EAS,allbundleseveraltoolsinternallyandmanagedatatransmissionbetweenprocessingsteps.

4.2. Use of standard formats

● Emailcontentformostsystemsisbasedonwell-establishedformats,particularlyMBOXandEML.SofarallsystemscaninputMBOX.

○ EASoutputsonlyEMLandnotalltoolssupportthisasaninput.● Somesystemssupportonlyverylimitedemail-specificprocessing(e.g.Archivematica)andsome

donotatall(ArchiveSpace)-butasthesesystemsaredesignedtotakeinvirtuallyanydigitalobjectsthisisnotabarrierfortheirmoregenericprocessingcapabilities

● Identificationorreferencingmetadataisoftenexpectedina‘format’thatisnonstandardinseveralcases.MessageIDs,repositoryID,collectionIDareoftentiedtospecificexternalsystems(EASwithDRS,DArcMailwithCMS).

● PREMISisthestandardusedtocaptureprovenanceorprocessinghistorymetadataandrightsmetadata(forthosesystemsthatrecordthismetadata).

Page 18 of 22

● TheLibraryofCongressBagItstandardisafilepackagingformatusedbyatatleasttwoofthetools(ArchivematicaandBitCurator).

4.3. Scope of email data or metadata exchange

● Therearenosignificantbarrierstoexchanginganyparticularscopeofemailcontent,withtheexceptionthatsomesystems(e.g.ePADD)assumethatemailisdealtwithormanagedonanaccountbasis,whereanaccountistheemailassociatedwithonlyoneindividual.Inotherwords,theusercouldnotinputallemailsforanentireorganisationandprocessthemtogetheratonce(whilemaintainingallindividualaccountlevelmetadata).

● Severaltoolshavelimitationsonthescopeofmetadatathatcanbeinputoraccepted:

○ EAS,ePadd,DArcMaildonotacceptanymetadataasaninput

● Severaltoolshavelimitationsonthescopeofmetadatathatcanbeoutput:

○ ePADDdoesnotallowformanytypesofmetadatatobeoutput

4.4. Capabilities for recording provenance and/or processing history

● Ifmaintainingafullprocessinghistoryisnecessary,thenitmaynotbefeasibletousesystemsthatdon’tsupportthis(ePADD,DArcMail).

4.5. Capabilities for maintaining identity and integrity of data

Use of unique identifiers:

● Mosttoolsgenerateuniqueidentifiersfordataatvariouslevelsofgranularity(someforindividualemail,virtuallyallforaggregationsofsometypesuchasfolder,account,collectionetc.).

● Mosttoolsdonotacceptorstore‘external’identifiers(i.e.uniqueIDscreatedbyothersystems).Thismaypresentchallengeswhenusingmultipletoolsbecausetherearelimitedwaysofensuringthataparticulardataitemorgroupofdataiscorrectlyidentified(forinstance,iflookingataparticularemailinonetool,isthereawayofconfidentlyfindingandprocessingthesameexactemailinanothertool).

● Sometoolsdoprovidesomemeansofcapturingexternalidentifiers(e.g.inArchivematicabyprovidingIDswithinametadatacsvfileatthepointoftransfer).Howevernoneofthetoolsappeartosupportthisatthelevelofindividualemails.

Definition of key elements and aggregations:

● Manyofthetoolsallowuserstodefinetheelementsoraggregationsthatsuitthembest.Thisflexibilityisastrengthbutcouldleadtosomeconfusionifelementsoraggregationsarenotdefinedconsistentlybetweensystems.

Page 19 of 22

● ThedefinitionofanEmailAccountisprobablythemostsignificantconcernasitappearstobedefineddifferentlyindifferentsystems.Anemailaccountinonetoolmayappeartobesameemailaccountwhenviewedorprocessedinanothertool,buttheriskisthatitisn’tbecausethedefinitionsarenotconsistent.Thereisalsotheriskthatthedatamodelsarenotcompatible-forinstanceifonesystemonlyallowsoneemailaddressperaccountwhereanotherallowsmultipleaddresses.

4.6. System access and documentation

● Alloftheopensourcesystemshavepubliclyavailabledocumentationorknowledgeresources,howeveraccesstodevelopersorsubjectmatterexpertsmaynotbepubliclyavailable.

● NeitherEASnorDArcMailarecurrentlyavailablebeyondtheirinstitutions.Bothprojectteamsintendtoreleasethemwithopensourcelicenses,butworkisrequiredtodothisandmakethesoftwareavailabletothecommunity.

5. Opportunities to Improve the Interoperability of Email Tools

Severaldraftrecommendationsaresuggestedbelowfordiscussion.Atthisstagenoefforthasbeenmadetoprioritizetheseorsetoutconcretenextsteps.Wehavekeptthescopeofthesetoareasthatwefeeladdresstheinteroperabilityofthespecifictoolsassessedinthisreport.

Wehavenotmadeanyspecificrecommendationsregardingthechallengesoftransmittingdatabetweensystems.Whiletherearesomeclearrisks,asdescribedinthefirstpartofsection4.1(suchaschainofcustodyandfileintegrity),wefeelthata)theseareverybroadandapplytoallformsofpreservationusingmultipletoolsandb)theextentoftheproblemisnotwelldefinedoragreedon;forexample,someinstitutionsmaynotseeanyproblemswithdatatransmissionprotocolsthathappenbeforeformalaccession.Whilewefeelthisareawarrantsfurtherconsideration,thatmaybeoutsidethescopeofconcernforthisreport.

5.1. Enhance tools to support external reference identifiers

Attheveryleast,toolsneedtobeabletoacceptandmaintainexternalidentifierssothatemailstewardscankeeptrack(atmultiplelevelsofgranularity)whatdataisbeingprocessedthroughoutaworkflow.

Ingeneral,emailstewardsshouldbeabletousetheidentifiersforindividualitems,foldersorothergroupingsfromonesystemwhenexportingdataandcarryingoutfurtherprocessinginanothersystem.

Ideallyexternalidentifierswouldalsobecapturedwhencapturingprocessinghistorysothatitispossibletoclearlytrackthechainofcustody(forexamplebyassociatingtheidentifierwiththePREMISagentinvolvedinprocessing).

Page 20 of 22

5.2. Adopt standard approaches to capturing and respecting rights and sensitivity metadata

Giventhatemailcollectionsoftencontaincontentwithavarietyofdifferentrights,andthatthereisawidespectrumofprivacyandconfidentialityissuesthatcanbeinvolved,emailarchivingtoolsshouldsupportstandardwaysofcapturingrightsorsensitivitymetadata.

Manysystemsalreadyusestandardsforrights(forinstanceusingPREMISrightsentities);however,theredoesn’tappeartobeanequivalentapproachforrecordingsensitivityorprivacyinformation.

5.3. Establish MBOX as minimum standard for input and output of email content

MBOXisthemostwidelyusedstandardamongstthetoolsconsideredhere.EMLisalsoawidelyusedstandardandsupportedbyamajorityofemailclients.TheEAXSstandardusedinDArcMailmaybemorecomprehensivebuthassofarnotbeenwidelyadoptedandtherearenotoolsfordiscoveryandaccessinthatformat.

WethereforerecommendthattoolprovidersconsideraddingMBOX--complyingwithRFC4155(ApplicationMBOXMediaType)andRFC5322(InternetMessageFormat)--asastandardforbothinputandoutput(wherethatdoesn’talreadyexist).Thisdoesn’tnecessarilymeanobsoletinguseofEMLorEAXS,butsimplyprovidingadditionalsupporttoenablemaximuminteroperabilitybetweentools.

5.4. Establish a common exchange standard for packaging email with metadata

Astandardforpackagingdigitalcontent,describingthecontentsofthepackageandensuringintegrityofthepackageusinghasheswillgreatlyimprovetheabilitytotransferdatasafelybetweensystems.TheLibraryofCongressBagitstandardiswell-establishedandisalreadyusedbyatleasttwoofthetoolshere(ArchivematicaandBitCurator).

TheBagItstandardmaynotbeenoughinitselfhowever.Whilerecommendation5.3wouldensurethatemailcontentcanbetransferredusingtheMBOXstandard,additionalstructuralandmetadatastandardsmaybeneededtodefineminimumexpectationsforwhatcontentormetadataisrequired,optionaloracceptable.Forexample,toclarifywhetheritisacceptabletopackagemultipleemailaccountstogether.

5.5. Support capture of processing history

SeveraltoolsrecordprocessinghistoryusingthewellestablishedPREMISstandard.

Ideallyalltoolswouldprovidethiscapabilitysothatcomprehensiveprocessinghistorycanbecapturedthroughoutaworkflowusingmultipletools.

Page 21 of 22

Further consideration should be given to the consolidation of processing history files from differentsystems, or the ability tomanually addprocessing history (to fill any gapswhere a tool does not yetrecorditautomatically).

5.6. Establish standard definition and description of email collections

Itisn’tclearthatthedefinitionofwhatconstitutesanemailaccount(includingtherelationshipwithemailaddresses,orpeople)isconsistentbetweentools.Establishingacommondefinitionwillenablealignmentofdifferentdatamodelsusedandreducetheriskofconfusionormis-identificationofemailcollectionsatthisfundamentallevel.

Withaconsistentandstandarddefinition,itwillthenbepossibletodevelopacommonstandardfordescribingemailaccounts.Thiswouldhelpimprovetheprecisionofsearchanddiscoveryandbetterenabletheexchangeofdescriptivemetadatabetweentools.

5.7. Make local tools publicly available with an open source license

Toolsthatareonlyusablebyoneinstitutionarenotusefultothewideremailarchivingcommunity.Whilethereareclearlycoststomakingatoolmorewidelyavailableandtryingtocreateandmaintainanactivecommunityaroundit,wefeeltherearemanybenefitsthatcanoffsetthosecostsinthelongrun,includingopeninguptheprojecttoawiderbaseofdevelopers,testersandpotentialfunders.

Acknowledgements

ThisprojectbuiltonthegreatworkstartedattheHarvardEmailArchivingStewardshipTools(EAST)workshopinMarch2016.Wewouldliketothanktheoriginalparticipantsandacknowledgethemanycontributionsreceivedsince.

InparticularwewouldliketothankthecontributorstotheEmailDataSharingFramework;GlynnEdwards,JoshSchneiderandPeterChan(StanfordUniversity),AndreaGoethals,GrainneReillyandSkipKendall(HarvardUniversity),SarahRomkeyandJustinSimpson(ArtefactualSystemsInc.)andCalLee(UniversityofNorthCarolinaChapelHill).

Numerousreviewersprovidedhelpfulcontributionsandsuggestionsforthisreport.WewouldliketothankEvelynMcLellan,JustinSimpsonandSarahRomkey(ArtefactualSystemsInc.),AnthonyMoulen,AndreaGoethalsandGrainneReilly(HarvardUniversity),ChrisProm(UniversityofIllinoisatUrbana-Champaign),CalLee(UniversityofNorthCarolinaChapelHill)andRiccardoFerrante(SmithsonianInstitutionArchives).

WewouldliketothankHarvardLibraryfortheopportunitytoengageinthisworkandprovidingsupportanddirectionthroughout.

FinallythankstoWendyGogel(HarvardUniversity)forcontributionsonmanyfrontsandprovidingleadershipfortheproject.

Page 22 of 22

Recommended