View
2
Download
0
Category
Preview:
Citation preview
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands
LAMUS & LAT Archiving software
Daan Broeder
Max-Planck Institute for Psycholinguistics
• MPI for Psycholinguistics research corpora: child language, bilingualism, gesture, sign language, Corpus Spoken Dutch, second learner corpora, etc.
• Archive for the DOBES project • Hosting (and inviting) corpora for other projects in need (UNESCO study: 80% of all material is endangered)
– DBD, NGT, Leiden Univ. language documentation corpora – Donated endangered language corpora – Eibl Eibersfeldt human ethology collection
• Maintain a metadata catalog for properly described resources from other institutes – BAS, C-ORAL-ROM (Univ. Florence), … – LR from Lund Univ, INL, other archive partners
• Copy of CHILDES and Talkbank corpora from CMU Mainly annotated audio/video recordings
50 TB: 200k MD records, 250k AV resources, 200k annotation files, lexicons, sketch grammars, etc.
The Language Archive - 2011
History
• Started in 2000 to try solve the mounting data chaos at the MPI for Psycholinguistics
• First needed proper data descriptions • Archive software development linked to the
IMDI metadata set for Language Resource • First archive was basically a file-system with
metadata descriptions and resource files • Tools operating directly on the files • A researcher’s notebook disk was just as
sophisticated
IMDI – ISLE Metadata Initiative
• Metadata schema for Language Resources • Developed from 2000 also in several EU projects
ISLE, ECHO, INTERA • Especially multi-media/multi-modal recordings • 3 XML metadata schema + special profiles for
specific communities: Sign-Language, SL-acquisition, …
C C
S S S S S
C
M M T M T T
CT
I
• Archiving formats only
• Metadata in XML files
• Relations represented by links
• DBs only as helpers
• Data safety through HSM, pushing data to TLs
TLA ARCHIVE
C C
S S S S S
C
M M
M
M
T T
T
} IMDI
metadata
}resources T
TLA Archive Organization
language
expedition
age group
genre
sessionX
media file
annot. file
Local tools - ARBIL - ELAN
WWW browser
media files metadata
annotations
ARCHIVE
LOCAL DATA
IMDI- Browser
HTTP server
resource download
Browsing/Search/Visualization
LAMUS
AMS
Archive Access
Upload data
LARI TROVE
All resources accessible by HTTP if authorized
PID service
All web-apps can be configured to use either Shibboleth or a local LDAP for authentication
imdidb. corpus structure
amsdb
C C
S S S S S
C
LAMUS
crawler
archive archive manager
content search
IMDI lucene
idx
IMDI search
IMDI browser
annexdb lamusdb
AMS
API
API API API API API
Archive Administration
Why ‘user managed’ deposition?
• Increasing costs – New cheaper technologies for recording, digitization and storage
causes huge increase in data quantities.
• Using depositor knowledge – Researcher/depositor knows where to put the data in the logical
structure (catalogue) of the archive. – Communication with archive managers is overhead.
• Offer remote archiving services – Support distributed projects
• Stricter checking – Make checks explicit – Archive managers have short contracts, knowledge seems to get lost.
• Maximizing deposition – 80 percent of all recordings is in danger (UNESCO report) – We want to open our archive for external depositors – But cannot afford extra workload for archive managers
LAMUS is a web-application that allows • Uploading and naming individual resources (media,
annotations, information files) • Specifying ‘limited’ metadata and mutual relations for
and between resources • Creating relevant linguistic groupings for the data (sub-
corpora) LAMUS will: • Carry out checks for consistency and coherence: check
for accepted formats etc. (configurable list) • Updating databases and indexes • Issue PID for the new resources and metadata records
LAMUS
ARCHIVE
WORKSPACE
local disk
The Archive
check out
modify/add/..
check in
workspace
Add to original after • consistency check • versioning
Local tools: • Arbil, • ELAN, • Shoebox, • …
Using Arbil
using LAMUS
Corpus check-out check-in cycle
TLA – Versioning of resources
TLA versioning policy • Nothing gets actually deleted • Users can delete resources which are removed
from the visible collection (corpus tree) but remain in the archive
• Users can update (replace) existing resources – The new version will get a new PID – Old version will be shelved but keep their PID
• Access to old versions is managed by the owner
C C
S S S S S
C
• User role administration: archive manager, domain curator, domain manager, domain editor
• Set a required license • Set access rules per media type:
annotations, images, audio, video, info
• A rule sets access/denial to user/group for type of data
• Special groups: ‘all’, ‘registered user’
• Rules have priority • Inheritance of rules by descendant
nodes
M M M M M M
C
C
C
S
M
Rule 1
Rule 2
Rule 3
Rule 1 Rule 2 Rule 3
AMS – Access Management System
Sign academic license
IMDI-Browser & Metadata Search
• Browse the hierarchy of corpora • Inspect metadata records • Create bookmarks
– resources – IMDI-Browser showing resources
• Show PIDs, URLs for resources and metadata • Make resource access requests • Search the metadata:
– simple keyword, – complex queries
IMDI-Browser as a jump board
http://corpus1.mpi.nl/ds/imdi_browser?openpath=MPI541199%23
Publishing resources
Regional Archives Initiative: Cooperation of TLA/MPI-PL with other organizations interested in EL archiving They use TLA LAT archiving software • Encourage local resource collecting & archiving • Network of South American archives has been established and contacts
with CLARA were made
Regional Archives Initiative
Synchronization physical structure • Use “rsync” software • Complete replication • No special conditions possible • Use for backup to computing centers
Synchronization logical structure • Special software needed • Per corpus copy to a selected target
• Owner can make special exceptions
• Use to synchronize between archives
C C
S S S S S
C
S S S
C
C
Logical synchronization
Data Synchronization I
C C
S S S S S
C
LAMUS archive
API
C
S S S
HTTP server
COSIX
COSIX: complex logic to compare corpus trees and determine
• what is new • what to replace • what to add • what to delete
Data Synchronization II
In a cooperation with CMU, COSIX is used to copy CHILDES and Talkbank corpora into our archive. CMU generating IMDI records on the fly from their DBs
Technical Info
• Java web-applications running inside Tomcat servlet container
• Postgress DBMS • Platform: Linux • Web-app frameworks: JSP, Applets, JSF, FLEX,
Wicket,… • Works with most web browsers (Explorer,
Firefox, Opera, Safari)
LAMUS & LAT Future
• TLA is part of CLARIN and is promoting CMDI, so … • We are planning the transition from LAMUS – IMDI to
LAMUS CMDI • We analyzed our set-up and still like the LAT
fundaments e.g. file based, modularity, … • But we will also alleviate some current problems and
inconveniences: – limited metadata editing in LAMUS – Insufficient provenance tracking of resources – Better handling of download/modify/upload cycle – Better integration with other (LAT) archives and
infrastructures.
THANK YOU FOR YOUR ATTENTION
Thank you for your attention
CLARIN has received funding fromthe European Community's Seventh Framework Programme
under grant agreement n° 212230
Recommended