LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that...

Preview:

Citation preview

The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands

LAMUS & LAT Archiving software

Daan Broeder

Max-Planck Institute for Psycholinguistics

•  MPI for Psycholinguistics research corpora: child language, bilingualism, gesture, sign language, Corpus Spoken Dutch, second learner corpora, etc.

•  Archive for the DOBES project •  Hosting (and inviting) corpora for other projects in need (UNESCO study: 80% of all material is endangered)

–  DBD, NGT, Leiden Univ. language documentation corpora –  Donated endangered language corpora –  Eibl Eibersfeldt human ethology collection

•  Maintain a metadata catalog for properly described resources from other institutes –  BAS, C-ORAL-ROM (Univ. Florence), … –  LR from Lund Univ, INL, other archive partners

•  Copy of CHILDES and Talkbank corpora from CMU Mainly annotated audio/video recordings

50 TB: 200k MD records, 250k AV resources, 200k annotation files, lexicons, sketch grammars, etc.

The Language Archive - 2011

History

•  Started in 2000 to try solve the mounting data chaos at the MPI for Psycholinguistics

•  First needed proper data descriptions •  Archive software development linked to the

IMDI metadata set for Language Resource •  First archive was basically a file-system with

metadata descriptions and resource files •  Tools operating directly on the files •  A researcher’s notebook disk was just as

sophisticated

IMDI – ISLE Metadata Initiative

•  Metadata schema for Language Resources •  Developed from 2000 also in several EU projects

ISLE, ECHO, INTERA •  Especially multi-media/multi-modal recordings •  3 XML metadata schema + special profiles for

specific communities: Sign-Language, SL-acquisition, …

C C

S S S S S

C

M M T M T T

CT

I

•  Archiving formats only

•  Metadata in XML files

•  Relations represented by links

•  DBs only as helpers

•  Data safety through HSM, pushing data to TLs

TLA ARCHIVE

C C

S S S S S

C

M M

M

M

T T

T

} IMDI

metadata

}resources T

TLA Archive Organization

language

expedition

age group

genre

sessionX

media file

annot. file

Local tools - ARBIL - ELAN

WWW browser

media files metadata

annotations

ARCHIVE

LOCAL DATA

IMDI- Browser

HTTP server

resource download

Browsing/Search/Visualization

LAMUS

AMS

Archive Access

Upload data

LARI TROVE

All resources accessible by HTTP if authorized

PID service

All web-apps can be configured to use either Shibboleth or a local LDAP for authentication

imdidb. corpus structure

amsdb

C C

S S S S S

C

LAMUS

crawler

archive archive manager

content search

IMDI lucene

idx

IMDI search

IMDI browser

annexdb lamusdb

AMS

API

API API API API API

Archive Administration

Why ‘user managed’ deposition?

•  Increasing costs –  New cheaper technologies for recording, digitization and storage

causes huge increase in data quantities.

•  Using depositor knowledge –  Researcher/depositor knows where to put the data in the logical

structure (catalogue) of the archive. –  Communication with archive managers is overhead.

•  Offer remote archiving services –  Support distributed projects

•  Stricter checking –  Make checks explicit –  Archive managers have short contracts, knowledge seems to get lost.

•  Maximizing deposition –  80 percent of all recordings is in danger (UNESCO report) –  We want to open our archive for external depositors –  But cannot afford extra workload for archive managers

LAMUS is a web-application that allows •  Uploading and naming individual resources (media,

annotations, information files) •  Specifying ‘limited’ metadata and mutual relations for

and between resources •  Creating relevant linguistic groupings for the data (sub-

corpora) LAMUS will: •  Carry out checks for consistency and coherence: check

for accepted formats etc. (configurable list) •  Updating databases and indexes •  Issue PID for the new resources and metadata records

LAMUS

ARCHIVE

WORKSPACE

local disk

The Archive

check out

modify/add/..

check in

workspace

Add to original after • consistency check • versioning

Local tools: •  Arbil, •  ELAN, •  Shoebox, •  …

Using Arbil

using LAMUS

Corpus check-out check-in cycle

TLA – Versioning of resources

TLA versioning policy •  Nothing gets actually deleted •  Users can delete resources which are removed

from the visible collection (corpus tree) but remain in the archive

•  Users can update (replace) existing resources –  The new version will get a new PID –  Old version will be shelved but keep their PID

•  Access to old versions is managed by the owner

C C

S S S S S

C

•  User role administration: archive manager, domain curator, domain manager, domain editor

•  Set a required license •  Set access rules per media type:

annotations, images, audio, video, info

•  A rule sets access/denial to user/group for type of data

•  Special groups: ‘all’, ‘registered user’

•  Rules have priority •  Inheritance of rules by descendant

nodes

M M M M M M

C

C

C

S

M

Rule 1

Rule 2

Rule 3

Rule 1 Rule 2 Rule 3

AMS – Access Management System

Sign academic license

IMDI-Browser & Metadata Search

•  Browse the hierarchy of corpora •  Inspect metadata records •  Create bookmarks

–  resources –  IMDI-Browser showing resources

•  Show PIDs, URLs for resources and metadata •  Make resource access requests •  Search the metadata:

–  simple keyword, –  complex queries

IMDI-Browser as a jump board

http://corpus1.mpi.nl/ds/imdi_browser?openpath=MPI541199%23

Publishing resources

Regional Archives Initiative: Cooperation of TLA/MPI-PL with other organizations interested in EL archiving They use TLA LAT archiving software •  Encourage local resource collecting & archiving •  Network of South American archives has been established and contacts

with CLARA were made

Regional Archives Initiative

Synchronization physical structure •  Use “rsync” software •  Complete replication •  No special conditions possible •  Use for backup to computing centers

Synchronization logical structure •  Special software needed •  Per corpus copy to a selected target

•  Owner can make special exceptions

•  Use to synchronize between archives

C C

S S S S S

C

S S S

C

C

Logical synchronization

Data Synchronization I

C C

S S S S S

C

LAMUS archive

API

C

S S S

HTTP server

COSIX

COSIX: complex logic to compare corpus trees and determine

•  what is new •  what to replace •  what to add •  what to delete

Data Synchronization II

In a cooperation with CMU, COSIX is used to copy CHILDES and Talkbank corpora into our archive. CMU generating IMDI records on the fly from their DBs

Technical Info

•  Java web-applications running inside Tomcat servlet container

•  Postgress DBMS •  Platform: Linux •  Web-app frameworks: JSP, Applets, JSF, FLEX,

Wicket,… •  Works with most web browsers (Explorer,

Firefox, Opera, Safari)

LAMUS & LAT Future

•  TLA is part of CLARIN and is promoting CMDI, so … •  We are planning the transition from LAMUS – IMDI to

LAMUS CMDI •  We analyzed our set-up and still like the LAT

fundaments e.g. file based, modularity, … •  But we will also alleviate some current problems and

inconveniences: –  limited metadata editing in LAMUS –  Insufficient provenance tracking of resources –  Better handling of download/modify/upload cycle –  Better integration with other (LAT) archives and

infrastructures.

THANK YOU FOR YOUR ATTENTION

Thank you for your attention

CLARIN has received funding fromthe European Community's Seventh Framework Programme

under grant agreement n° 212230

Recommended