Upload
nathan-hodges
View
214
Download
1
Embed Size (px)
Citation preview
1
Archiving
LingDy
16 Feb 2012
TUFS, Tokyo
David Nathan
Endangered Languages Archive
Hans Rausing Endangered Languages Project
SOAS, University of London
4
What is a digital language archive?
a trusted repository created and maintained by an institution with a commitment to the long-term preservation of archived material
has policies and processes for materials acquisition, cataloguing, preservation, dissemination, migration to new digital formats
a platform for building and conducting relationships between data providers and data users
5
Why is language archiving different?
what is a language? the data is not conventionalised (like $,
age, year of publication etc) – what and how to code?
varying and competing expectations
6
And endangered languages archiving?
extremely diverse context – languages, cultures, communities, individuals, projects
typical source - fieldworkers typical materials - documentation difficult for archive staff to manage sensitivities and restrictions extremely high priority
7
What can a language archive offer?
Security - keep your electronic materials safe Preservation - store your materials for the long
term Discovery - help others to find out about your
materials, and you to find out about users Protocols - respect and implement sensitivities,
restrictions Sharing - share results of your work, if appropriate Acknowledgement - create citable
acknowledgement Mobilisation - create usable language materials Quality and standards - advice for assuring your
materials are of the highest quality and robust standards
8
Different kinds of language archives
different contexts, systems, methods, collection policies
you should consider placing your materials in more than one …
9
Why digital?
preservation: digitisation is the only way that audio and video (non-symbolic material) can be preserved for the future … because it can be copied and transmitted with zero loss
cataloguing, sharing, dissemination, repurposing
10
Digital disadvantages
digital data is fragile and ephemeral cost (human, equipment, maintenance) requires strategy and luck to get right preservation depends on file and data formats
depend on tools and software depends on formats (prefer standard, open,
explicit, long-lasting) materials may have to be converted and
migrated some formats require particular software (can
we archive the software?)
11
What is archiving of language materials?
preparing materials selecting structuring suitable encodings and formats well-documented
depositing them in a suitable archive(s) curation and accession by the archive ongoing management, dissemination new focus on form, presentation and user
interaction/feedback
12
Users and potential users
depositors – deposit, access or update materials
speakers and their descendants (“majority of users of Berkeley Language Center archive are community members”)
other researchers - comparative/historical linguists, typologists, theoreticians, anthropologists, historians, musicologists etc etc
other “stakeholders”, eg educationalists journalists and the wider public
13
Archives networks and bodies
foundation concepts and technologies from library initiatives, eg. D-LIB http://www.dlib.org/ OAI (Open Archives Initiative) OAIS Open Archival Information Systems
(NASA and space agencies incl JAXA)
Open Language Archives Community (OLAC)
Digital Endangered Languages and Archives Network (DELAMAN) ELAR, DOBES, ANLC, Paradisec, EMELD,
LACITO, AIATSIS, AMPM (Maori)
14
Citation examples
from Heidi Johnson of AILLA
Collection:Sherzer, Joel. "Kuna Collection." The Archive of the Indigenous Languages of Latin America: www.ailla.utexas.org. Media: audio, text, image. Access: 0% restricted.
File/resource:Sherzer, Joel (Researcher). (1970). "Report of a curing specialist." Kuna Collection. Archive of the Indigenous Languages of Latin America: www.ailla.utexas.org. Type: transcription&translation. Media: text. Access: public. Resource ID: CUK001R001.
15
Endangered Languages ARchive (ELAR)
one of 3 programs of the Hans Rausing Endangered Languages Project
develop policies, preservation infrastructure, cataloguing and dissemination, facilities, training, advice, materials development and publishing
16
ELAR facts and figures
archived collections: 110 online (published) collections: 50 average collection size about 60 GB online data bundles: 9523 total number of files held: around 200,000 total volume of files held: around 10 TB online data bundles unrestricted access: 5298 registered users: >500 annual downloads: >1,000 annual number of website "hits": 230,000
17
ELAR facts and figures – user accounts
increasing number of community members, including Aleut (Canada), Tai-Ahom, Wadar (India), Burushaski (Pakistan), Serrano, Cahuilla, Arapaho (USA), Iraqi Jewish (Iraq), Saami (Finland), Wabena (Tanzania), Torwali (Pakistan), Hani, Bai (China), Irish
comments: “I found your site while looking up my grandmother, and i found her on your site speaking our language. and i would love for my children her great grandchildren to hear our language coming from her".
many interdisciplinary researchers, particularly archivists and anthropologists
18
Archiving and data management
most data-related issues are really part of linguistic data/corpus management
there are now few data-related issues that are archive-specific metadata formats video presentation/exhibition of material
19
What can you archive (at ELAR)?
media - sound, video graphics - images, scans texts - fieldnotes, grammars, description,
analysis structured data - aligned and annotated
transcriptions, databases, lexica metadata - contextual information about
the materials, structured and unstructured
20
Archive objects
an “object” could be a file, a set of files, a directory, or a set of files with their relationships explicitly defined
these are often called “sessions”or “bundles”
they should be made explicit through metadata our future catalogue system will provide
the ability for depositors to directly create, label and update bundles
See bundles at ELAR
21
Archive material should be selected
example: Depositor’s question: How much video can I archive?
answer: ...
22
resource(s) for an endangered language it could be just one file
inventory / metadata deposit form view
existing deposits can also be updated, added to, and metadata added/modified
What is required to make a deposit?
23
How can I deliver data?
hard disks we return them we send them out
email good for samples for evaluation OK for most text materials
Dropbox etc flash cards and USB sticks a web upload facility may be
provided one day we download from your server
24
What about CDs and DVDs?
we have found CDs, andespecially DVDs, to bevery unreliable DVD fail rate > 10%
cause confusion as filesare allocated to fit on disks, not according to corpus structure
create a lot of work fordepositors and for ELAR
25
Protocol
the sensitivities and access restrictions associated with EL resources
need to be discussed, collected and recorded in the field
global protocol (the overall, typical value) is entered into the deposit form
specific protocol (for files, bundles) is entered via metadata (or any other explicit way)
26
Protocol and access control
principles: granularity – file, bundle or collection access is a relation between object and user protocol values can be changed over time
ELAR’s URCS system User Researcher Community member Subscriber
27
“I have images”
what kinds of images? what are their sources? what is their documentation value? what
role do they play in the collection? … these should be reflected in the data
structures/metadata
28
Metadata for images
at least captions what else?
… … … …
in what form? narrative tabular fields keywords
29
get a list of image files command (DOS) window in directory type “dir > list.txt” open text file (in Notepad++ or MS Word) change font to Courier get a “vertical selection” (or use a file listing utility!) paste into spreadsheet
Integrating images into metadata
30
Integrating images into metadata
make a new sheet for images paste in image file list (see previous) add an ID column
type “1” in first cell select from first to last cell in ID column Edit>Fill>Series>OK
add other columns now you can refer to your images
anywhere!
31
Using spreadsheet to access data
you can turn a filename into a link to access files directly from a spreadsheet have the filename in cells use the formula
=HYPERLINK(file, “Message") examples
=HYPERLINK("E:\archiving\images\"&A2, "click here")=HYPERLINK(A1&A2, "click here")=HYPERLINK(A1&A2, A2)
32
My cells have multiple values!
example: keywords this is probably OK, as keywords are
atomic just consistently use a suitable delimiter e.g. use comma - if data values cannot have
commas ELAR recommends double pipe “||”
33
My cells have multiple values!
example: speakers in a recording speakers are probably not atomic – they have
other attributes create a separate “speakers” sheet give each speaker an ID (number or initials) use the IDs in the original sheet, with delimiter
(implements one to many) (advanced) or make another sheet to associate
recordings with speakers (implements many to many)
34
Expressing “Relation” in spreadsheets
one column is usually insufficient “relationship” has 2-parts
the target of the relationship description of the relationship
how would this work for images?
35
How can I tell if it’s Unicode?
use a browser or Notepad++ paste text in examine the encoding (before and after)
36
Can I still use MS Word?
ELAR no longer accepts MS Word files but Word is still useful
quicker to type up useful tables, functions, macros etc
solutions think “text only” tables as spreadsheets (are they bad too?) (advanced) complex materials formatted as
styles, then export as marked up PDF/A – but not a perfect solution