24
Department of Internal Affairs Preservation in Practice: Archives New Zealand Ross Spencer - @ beet_keeper Archives New Zealand Open Preservation Foundation, Thursday April 16 2015

OPF Webinar: Preservation in Practice - Archives New Zealand

Embed Size (px)

Citation preview

Department of Internal Affairs

Preservation in Practice: Archives New

Zealand

Ross Spencer - @beet_keeper

Archives New Zealand

Open Preservation Foundation,

Thursday April 16 2015

Department of Internal Affairs

Sun image, R24685027, E4, Archway,

Archives New Zealand.

http://www.archway.archives.govt.nz/ViewFullItem.do?code=24685027&digital

=yes

Department of Internal Affairs

Background

Born Digital and Cultural Heritage Conference

Melbourne*: http://bit.ly/1utAqz0

Spencer, Braden, Hutar, Masters, Crouch, Mosely, Fly

Away Home: Pilot Transfer of Born-digital Records at

Archives New Zealand

Collected our experiences from late 2013 through to early

2014. Royal Commission work through to GDAP Closure

and beginning of eAccessions.

* http://playitagainproject.org/conference-report/

Department of Internal Affairs

Putting it into practice…

• Our first two ingests… Code names, E1, E4.

• Began to evaluate process; document lessons

learned.

• Started to work on our next two ingests, E2, E3.

• Approach taken to disseminate knowledge

amongst two additional teams of archivists. One

team per transfer.

More background: http://www.slideshare.net/RossSpencer/the-reality-of-

digital-transfer-archives-new-zealand

Department of Internal Affairs

eAccession One [e1]

Legacy accessions that we have opportunity to utilise lessons

learned from Initial Digital Transfers…

175 Files (166.5 mb)

10 Directories

0 Unidentified Objects

0 Unidentified Extensions

7 Known Formats

0 Duplicates (content)

Department of Internal Affairs

eAccession Four [e4]

eAccessions were seen to be the least complex and allowed

us to focus, primarily, on the challenge of ingest…

1295 Files (565.0 mb)

6 Directories

2 Unidentified Objects

1 Unidentified Extensions

12 Known Formats

2 Duplicates (content)

Note: Obscured issue in original statistics…

2x false positives! Thumbs.db as more

generic OLE2.

Department of Internal Affairs

Technical Challenges in e1 and e4

• [Tools] Ability to handle multi-byte character encodings. Maori macrons

‘Ā’.

• [Tools] Unidentified files and false positives.

• [Tools] Recording of pre-conditioning actions on ingest into digital

preservation system.

• [Tools] Implementing CSV ingest mechanism; configuration, code, and

workflow.

• [Pre-conditioning / Tools] Digital preservation system’s ability (Rosetta)

to handle contiguous spaces in filenames.

• [Pre-conditioning] One invalid JPEG. Required rearrangement of

application marker segments.

Department of Internal Affairs

Questions already forming…

• How do we make reporting consistent?

• How do ensure reliability?

• Approach our work with rhe scientific method?

• How do we socialise amongst those inexperienced in

digital preservation?

• How do we disseminate a skillset*?

*And quickly… i.e. not seven years…

Department of Internal Affairs

• Some answers already appearing: stats report is now generated by

a Python script in response to these issues:

https://github.com/exponential-decay/droid-sqlite-analysis

• Relies only on The National Archives, DROID tool, file listing,

format ID, and checksumming utility

• Consistent reporting.

• Repeatable.

• Fast.

DROID SQLite Analysis

Department of Internal Affairs

DROID SQLite Analysis

• Previously:

Department of Internal Affairs

DROID SQLite Analysis

• And then:

GitHub Example: https://github.com/exponential-decay/droid-sqlite-

analysis/blob/master/opf-test-corpus-test-output/opf-test-corpus-sqlite-analysis.txt

Department of Internal Affairs

But it’s still not enough…

• How do we make it user friendly?

• “Everything should be as simple as can be, but not simpler.” –

Albert Einstein.

• Even, when we talk about character encoding!

• Ā? 0x100? Non-Ascii? Unicode? UTF-8? Maori Capital Letter A,

with Macron?

• Thank you Cooper Hewitt!

Department of Internal Affairs

Cooper Hewitt, and Unicode:

GitHub: https://github.com/cooperhewitt/py-cooperhewitt-unicode

Department of Internal Affairs

Next step, HTML5

** All output to stdout

Department of Internal Affairs

Next step, HTML5

Department of Internal Affairs

So, HTML5 output is better, but…

• Need to improve definitions. Internationalization would be nice.

• Need to improve formatting

• Python difficult to distribute

• Remove some statistics: Duplicate file names as an example.

• Create new statistics e.g. blacklists created from continuous

evolution of appraisal

• How to promote the research of digital objects?

• Multiple copies of ~2000 files across network, or something else?

Department of Internal Affairs

Collection breakdown before further

archival and technical appraisal…

• E2 (~291MB)

• 2519 Files

• 177 Directories

• 5 Unidentified Objects

• 4 Extension Only ID

• 1 Unidentified Extensions

• 22 Known Formats

• 25 Extension Mismatches

• 24 Duplicate Content

• 2 Multiple Identification

• E3 (~198MB)

• 1748 Files

• 144 Directories

• 8 Unidentified Objects

• 7 Extension Only ID

• 1 Unidentified Extensions

• 12 Known Formats

• 37 Extension Mismatches

• 17 Duplicate Content

• 3 Multiple Identification

Department of Internal Affairs

We created the Rogues Gallery…

• Implemented by myself, and @AndreaKByrne

Department of Internal Affairs

We created the Rogues Gallery…

• Example Rogues: Duplicates,

Unidentified Files, Odd Character-

encodings, multiple-ids…

• ‘Art of the archivist’

• Enables copying of files, without

modification – and preserving context!

• Creates potential for split ingests

• Potential to spot format patterns.

• Repeatable, reliable, fast! We won’t get

it right first time. We will experiment!

• But still need to test, test, test!!!

Department of Internal Affairs

Utilises Rsync…

• Rsync Manpage:

http://manpages.ubuntu.com/manpages/trusty/man1/rsync.1.html

• rsync -avr --files-from=opf-rogues-list.txt “/cygdrive/c/”

“cygdrive/c/working/rogues-gallery”

• -a Archive Mode, (preserves permissions, modification times, plus

more.)

• -v Verbose, -r Recursive

• Rsync being utilised to maintain filesystem metadata integrity within

our processes already.

Department of Internal Affairs

What other approaches were there?

• DROID Itself – Even contains an Apache Derby Database!

• Better support for SQLite, more flexibility to produce our own reports as

required.

• Rosetta – Technical Analyst Workbench – Even isolates files!

• Joint NLNZ/Archives NZ, Digital Preservation Policy – Pre-conditioning –

Forensic recording and reversibility of pre-conditioning actions taken.

• These tools are focused on doing other jobs better. DROID’s CSV, Rosetta’s

Long Term Preservation Capability, and support for access.

• C3P0 – promising, but! Too difficult to install/distribute…

Department of Internal Affairs

What next..?

• Python is difficult to distribute. Re-write? – Go (Golang)? – The future

programming language of digital preservation?

• Also, create unit-tests. Trial Rogues Gallery output more.

• Incorporate metadata extraction tool JHOVE into process following

experience with e1 and e4, possibly via FITS

• Ingest E2 and E3!

• Tools can’t replace wisdom; issues the tools can’t yet pick up.

• Ideal: Archivists knowledge (processes, analysis, diagnosis) becomes

actuated.

Department of Internal Affairs

What next..?

Thank you!

More

ingests!**

**Iterative development of processes…

Department of Internal Affairs