Upload
ross-spencer
View
220
Download
3
Tags:
Embed Size (px)
Citation preview
Department of Internal Affairs
Preservation in Practice: Archives New
Zealand
Ross Spencer - @beet_keeper
Archives New Zealand
Open Preservation Foundation,
Thursday April 16 2015
Department of Internal Affairs
Sun image, R24685027, E4, Archway,
Archives New Zealand.
http://www.archway.archives.govt.nz/ViewFullItem.do?code=24685027&digital
=yes
Department of Internal Affairs
Background
Born Digital and Cultural Heritage Conference
Melbourne*: http://bit.ly/1utAqz0
Spencer, Braden, Hutar, Masters, Crouch, Mosely, Fly
Away Home: Pilot Transfer of Born-digital Records at
Archives New Zealand
Collected our experiences from late 2013 through to early
2014. Royal Commission work through to GDAP Closure
and beginning of eAccessions.
* http://playitagainproject.org/conference-report/
Department of Internal Affairs
Putting it into practice…
• Our first two ingests… Code names, E1, E4.
• Began to evaluate process; document lessons
learned.
• Started to work on our next two ingests, E2, E3.
• Approach taken to disseminate knowledge
amongst two additional teams of archivists. One
team per transfer.
More background: http://www.slideshare.net/RossSpencer/the-reality-of-
digital-transfer-archives-new-zealand
Department of Internal Affairs
eAccession One [e1]
Legacy accessions that we have opportunity to utilise lessons
learned from Initial Digital Transfers…
175 Files (166.5 mb)
10 Directories
0 Unidentified Objects
0 Unidentified Extensions
7 Known Formats
0 Duplicates (content)
Department of Internal Affairs
eAccession Four [e4]
eAccessions were seen to be the least complex and allowed
us to focus, primarily, on the challenge of ingest…
1295 Files (565.0 mb)
6 Directories
2 Unidentified Objects
1 Unidentified Extensions
12 Known Formats
2 Duplicates (content)
Note: Obscured issue in original statistics…
2x false positives! Thumbs.db as more
generic OLE2.
Department of Internal Affairs
Technical Challenges in e1 and e4
• [Tools] Ability to handle multi-byte character encodings. Maori macrons
‘Ā’.
• [Tools] Unidentified files and false positives.
• [Tools] Recording of pre-conditioning actions on ingest into digital
preservation system.
• [Tools] Implementing CSV ingest mechanism; configuration, code, and
workflow.
• [Pre-conditioning / Tools] Digital preservation system’s ability (Rosetta)
to handle contiguous spaces in filenames.
• [Pre-conditioning] One invalid JPEG. Required rearrangement of
application marker segments.
Department of Internal Affairs
Questions already forming…
• How do we make reporting consistent?
• How do ensure reliability?
• Approach our work with rhe scientific method?
• How do we socialise amongst those inexperienced in
digital preservation?
• How do we disseminate a skillset*?
*And quickly… i.e. not seven years…
Department of Internal Affairs
• Some answers already appearing: stats report is now generated by
a Python script in response to these issues:
https://github.com/exponential-decay/droid-sqlite-analysis
• Relies only on The National Archives, DROID tool, file listing,
format ID, and checksumming utility
• Consistent reporting.
• Repeatable.
• Fast.
DROID SQLite Analysis
Department of Internal Affairs
DROID SQLite Analysis
• And then:
GitHub Example: https://github.com/exponential-decay/droid-sqlite-
analysis/blob/master/opf-test-corpus-test-output/opf-test-corpus-sqlite-analysis.txt
Department of Internal Affairs
But it’s still not enough…
• How do we make it user friendly?
• “Everything should be as simple as can be, but not simpler.” –
Albert Einstein.
• Even, when we talk about character encoding!
• Ā? 0x100? Non-Ascii? Unicode? UTF-8? Maori Capital Letter A,
with Macron?
• Thank you Cooper Hewitt!
Department of Internal Affairs
Cooper Hewitt, and Unicode:
GitHub: https://github.com/cooperhewitt/py-cooperhewitt-unicode
Department of Internal Affairs
So, HTML5 output is better, but…
• Need to improve definitions. Internationalization would be nice.
• Need to improve formatting
• Python difficult to distribute
• Remove some statistics: Duplicate file names as an example.
• Create new statistics e.g. blacklists created from continuous
evolution of appraisal
• How to promote the research of digital objects?
• Multiple copies of ~2000 files across network, or something else?
Department of Internal Affairs
Collection breakdown before further
archival and technical appraisal…
• E2 (~291MB)
• 2519 Files
• 177 Directories
• 5 Unidentified Objects
• 4 Extension Only ID
• 1 Unidentified Extensions
• 22 Known Formats
• 25 Extension Mismatches
• 24 Duplicate Content
• 2 Multiple Identification
• E3 (~198MB)
• 1748 Files
• 144 Directories
• 8 Unidentified Objects
• 7 Extension Only ID
• 1 Unidentified Extensions
• 12 Known Formats
• 37 Extension Mismatches
• 17 Duplicate Content
• 3 Multiple Identification
Department of Internal Affairs
We created the Rogues Gallery…
• Implemented by myself, and @AndreaKByrne
Department of Internal Affairs
We created the Rogues Gallery…
• Example Rogues: Duplicates,
Unidentified Files, Odd Character-
encodings, multiple-ids…
• ‘Art of the archivist’
• Enables copying of files, without
modification – and preserving context!
• Creates potential for split ingests
• Potential to spot format patterns.
• Repeatable, reliable, fast! We won’t get
it right first time. We will experiment!
• But still need to test, test, test!!!
Department of Internal Affairs
Utilises Rsync…
• Rsync Manpage:
http://manpages.ubuntu.com/manpages/trusty/man1/rsync.1.html
• rsync -avr --files-from=opf-rogues-list.txt “/cygdrive/c/”
“cygdrive/c/working/rogues-gallery”
• -a Archive Mode, (preserves permissions, modification times, plus
more.)
• -v Verbose, -r Recursive
• Rsync being utilised to maintain filesystem metadata integrity within
our processes already.
Department of Internal Affairs
What other approaches were there?
• DROID Itself – Even contains an Apache Derby Database!
• Better support for SQLite, more flexibility to produce our own reports as
required.
• Rosetta – Technical Analyst Workbench – Even isolates files!
• Joint NLNZ/Archives NZ, Digital Preservation Policy – Pre-conditioning –
Forensic recording and reversibility of pre-conditioning actions taken.
• These tools are focused on doing other jobs better. DROID’s CSV, Rosetta’s
Long Term Preservation Capability, and support for access.
• C3P0 – promising, but! Too difficult to install/distribute…
Department of Internal Affairs
What next..?
• Python is difficult to distribute. Re-write? – Go (Golang)? – The future
programming language of digital preservation?
• Also, create unit-tests. Trial Rogues Gallery output more.
• Incorporate metadata extraction tool JHOVE into process following
experience with e1 and e4, possibly via FITS
• Ingest E2 and E3!
• Tools can’t replace wisdom; issues the tools can’t yet pick up.
• Ideal: Archivists knowledge (processes, analysis, diagnosis) becomes
actuated.
Department of Internal Affairs
What next..?
Thank you!
More
ingests!**
**Iterative development of processes…