37
1 Archiving LingDy 16 Feb 2012 TUFS, Tokyo David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project SOAS, University of London

1 Archiving LingDy 16 Feb 2012 TUFS, Tokyo David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project SOAS, University of London

Embed Size (px)

Citation preview

1

Archiving

LingDy

16 Feb 2012

TUFS, Tokyo

David Nathan

Endangered Languages Archive

Hans Rausing Endangered Languages Project

SOAS, University of London

2

What is an archive?

3

4

What is a digital language archive?

a trusted repository created and maintained by an institution with a commitment to the long-term preservation of archived material

has policies and processes for materials acquisition, cataloguing, preservation, dissemination, migration to new digital formats

a platform for building and conducting relationships between data providers and data users

5

Why is language archiving different?

what is a language? the data is not conventionalised (like $,

age, year of publication etc) – what and how to code?

varying and competing expectations

6

And endangered languages archiving?

extremely diverse context – languages, cultures, communities, individuals, projects

typical source - fieldworkers typical materials - documentation difficult for archive staff to manage sensitivities and restrictions extremely high priority

7

What can a language archive offer?

Security - keep your electronic materials safe Preservation - store your materials for the long

term Discovery - help others to find out about your

materials, and you to find out about users Protocols - respect and implement sensitivities,

restrictions Sharing - share results of your work, if appropriate Acknowledgement - create citable

acknowledgement Mobilisation - create usable language materials Quality and standards - advice for assuring your

materials are of the highest quality and robust standards

8

Different kinds of language archives

different contexts, systems, methods, collection policies

you should consider placing your materials in more than one …

9

Why digital?

preservation: digitisation is the only way that audio and video (non-symbolic material) can be preserved for the future … because it can be copied and transmitted with zero loss

cataloguing, sharing, dissemination, repurposing

10

Digital disadvantages

digital data is fragile and ephemeral cost (human, equipment, maintenance) requires strategy and luck to get right preservation depends on file and data formats

depend on tools and software depends on formats (prefer standard, open,

explicit, long-lasting) materials may have to be converted and

migrated some formats require particular software (can

we archive the software?)

11

What is archiving of language materials?

preparing materials selecting structuring suitable encodings and formats well-documented

depositing them in a suitable archive(s) curation and accession by the archive ongoing management, dissemination new focus on form, presentation and user

interaction/feedback

12

Users and potential users

depositors – deposit, access or update materials

speakers and their descendants (“majority of users of Berkeley Language Center archive are community members”)

other researchers - comparative/historical linguists, typologists, theoreticians, anthropologists, historians, musicologists etc etc

other “stakeholders”, eg educationalists journalists and the wider public

13

Archives networks and bodies

foundation concepts and technologies from library initiatives, eg. D-LIB http://www.dlib.org/ OAI (Open Archives Initiative) OAIS Open Archival Information Systems

(NASA and space agencies incl JAXA)

Open Language Archives Community (OLAC)

Digital Endangered Languages and Archives Network (DELAMAN) ELAR, DOBES, ANLC, Paradisec, EMELD,

LACITO, AIATSIS, AMPM (Maori)

14

Citation examples

from Heidi Johnson of AILLA

Collection:Sherzer, Joel. "Kuna Collection." The Archive of the Indigenous Languages of Latin America: www.ailla.utexas.org. Media: audio, text, image. Access: 0% restricted.

File/resource:Sherzer, Joel (Researcher). (1970). "Report of a curing specialist." Kuna Collection. Archive of the Indigenous Languages of Latin America: www.ailla.utexas.org. Type: transcription&translation. Media: text. Access: public. Resource ID: CUK001R001.

15

Endangered Languages ARchive (ELAR)

one of 3 programs of the Hans Rausing Endangered Languages Project

develop policies, preservation infrastructure, cataloguing and dissemination, facilities, training, advice, materials development and publishing

16

ELAR facts and figures

archived collections: 110 online (published) collections: 50 average collection size about 60 GB online data bundles: 9523 total number of files held: around 200,000 total volume of files held: around 10 TB online data bundles unrestricted access: 5298 registered users: >500 annual downloads: >1,000 annual number of website "hits": 230,000

17

ELAR facts and figures – user accounts

increasing number of community members, including Aleut (Canada), Tai-Ahom, Wadar (India), Burushaski (Pakistan), Serrano, Cahuilla, Arapaho (USA), Iraqi Jewish (Iraq), Saami (Finland), Wabena (Tanzania), Torwali (Pakistan), Hani, Bai (China), Irish

comments: “I found your site while looking up my grandmother, and i found her on your site speaking our language. and i would love for my children her great grandchildren to hear our language coming from her".

many interdisciplinary researchers, particularly archivists and anthropologists

18

Archiving and data management

most data-related issues are really part of linguistic data/corpus management

there are now few data-related issues that are archive-specific metadata formats video presentation/exhibition of material

19

What can you archive (at ELAR)?

media - sound, video graphics - images, scans texts - fieldnotes, grammars, description,

analysis structured data - aligned and annotated

transcriptions, databases, lexica metadata - contextual information about

the materials, structured and unstructured

20

Archive objects

an “object” could be a file, a set of files, a directory, or a set of files with their relationships explicitly defined

these are often called “sessions”or “bundles”

they should be made explicit through metadata our future catalogue system will provide

the ability for depositors to directly create, label and update bundles

See bundles at ELAR

21

Archive material should be selected

example: Depositor’s question: How much video can I archive?

answer: ...

22

resource(s) for an endangered language it could be just one file

inventory / metadata deposit form view

existing deposits can also be updated, added to, and metadata added/modified

What is required to make a deposit?

23

How can I deliver data?

hard disks we return them we send them out

email good for samples for evaluation OK for most text materials

Dropbox etc flash cards and USB sticks a web upload facility may be

provided one day we download from your server

24

What about CDs and DVDs?

we have found CDs, andespecially DVDs, to bevery unreliable DVD fail rate > 10%

cause confusion as filesare allocated to fit on disks, not according to corpus structure

create a lot of work fordepositors and for ELAR

25

Protocol

the sensitivities and access restrictions associated with EL resources

need to be discussed, collected and recorded in the field

global protocol (the overall, typical value) is entered into the deposit form

specific protocol (for files, bundles) is entered via metadata (or any other explicit way)

26

Protocol and access control

principles: granularity – file, bundle or collection access is a relation between object and user protocol values can be changed over time

ELAR’s URCS system User Researcher Community member Subscriber

27

“I have images”

what kinds of images? what are their sources? what is their documentation value? what

role do they play in the collection? … these should be reflected in the data

structures/metadata

28

Metadata for images

at least captions what else?

… … … …

in what form? narrative tabular fields keywords

29

get a list of image files command (DOS) window in directory type “dir > list.txt” open text file (in Notepad++ or MS Word) change font to Courier get a “vertical selection” (or use a file listing utility!) paste into spreadsheet

Integrating images into metadata

30

Integrating images into metadata

make a new sheet for images paste in image file list (see previous) add an ID column

type “1” in first cell select from first to last cell in ID column Edit>Fill>Series>OK

add other columns now you can refer to your images

anywhere!

31

Using spreadsheet to access data

you can turn a filename into a link to access files directly from a spreadsheet have the filename in cells use the formula

=HYPERLINK(file, “Message") examples

=HYPERLINK("E:\archiving\images\"&A2, "click here")=HYPERLINK(A1&A2, "click here")=HYPERLINK(A1&A2, A2)

32

My cells have multiple values!

example: keywords this is probably OK, as keywords are

atomic just consistently use a suitable delimiter e.g. use comma - if data values cannot have

commas ELAR recommends double pipe “||”

33

My cells have multiple values!

example: speakers in a recording speakers are probably not atomic – they have

other attributes create a separate “speakers” sheet give each speaker an ID (number or initials) use the IDs in the original sheet, with delimiter

(implements one to many) (advanced) or make another sheet to associate

recordings with speakers (implements many to many)

34

Expressing “Relation” in spreadsheets

one column is usually insufficient “relationship” has 2-parts

the target of the relationship description of the relationship

how would this work for images?

35

How can I tell if it’s Unicode?

use a browser or Notepad++ paste text in examine the encoding (before and after)

36

Can I still use MS Word?

ELAR no longer accepts MS Word files but Word is still useful

quicker to type up useful tables, functions, macros etc

solutions think “text only” tables as spreadsheets (are they bad too?) (advanced) complex materials formatted as

styles, then export as marked up PDF/A – but not a perfect solution

37

End