27
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands LAMUS & LAT Archiving software Daan Broeder Max-Planck Institute for Psycholinguistics

LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)

The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands

LAMUS & LAT Archiving software

Daan Broeder

Max-Planck Institute for Psycholinguistics

Page 2: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)

•  MPI for Psycholinguistics research corpora: child language, bilingualism, gesture, sign language, Corpus Spoken Dutch, second learner corpora, etc.

•  Archive for the DOBES project •  Hosting (and inviting) corpora for other projects in need (UNESCO study: 80% of all material is endangered)

–  DBD, NGT, Leiden Univ. language documentation corpora –  Donated endangered language corpora –  Eibl Eibersfeldt human ethology collection

•  Maintain a metadata catalog for properly described resources from other institutes –  BAS, C-ORAL-ROM (Univ. Florence), … –  LR from Lund Univ, INL, other archive partners

•  Copy of CHILDES and Talkbank corpora from CMU Mainly annotated audio/video recordings

50 TB: 200k MD records, 250k AV resources, 200k annotation files, lexicons, sketch grammars, etc.

The Language Archive - 2011

Page 3: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)

History

•  Started in 2000 to try solve the mounting data chaos at the MPI for Psycholinguistics

•  First needed proper data descriptions •  Archive software development linked to the

IMDI metadata set for Language Resource •  First archive was basically a file-system with

metadata descriptions and resource files •  Tools operating directly on the files •  A researcher’s notebook disk was just as

sophisticated

Page 4: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)

IMDI – ISLE Metadata Initiative

•  Metadata schema for Language Resources •  Developed from 2000 also in several EU projects

ISLE, ECHO, INTERA •  Especially multi-media/multi-modal recordings •  3 XML metadata schema + special profiles for

specific communities: Sign-Language, SL-acquisition, …

C C

S S S S S

C

M M T M T T

CT

I

Page 5: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)

•  Archiving formats only

•  Metadata in XML files

•  Relations represented by links

•  DBs only as helpers

•  Data safety through HSM, pushing data to TLs

TLA ARCHIVE

C C

S S S S S

C

M M

M

M

T T

T

} IMDI

metadata

}resources T

TLA Archive Organization

language

expedition

age group

genre

sessionX

media file

annot. file

Page 6: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)

Local tools - ARBIL - ELAN

WWW browser

media files metadata

annotations

ARCHIVE

LOCAL DATA

IMDI- Browser

HTTP server

resource download

Browsing/Search/Visualization

LAMUS

AMS

Archive Access

Upload data

LARI TROVE

All resources accessible by HTTP if authorized

PID service

All web-apps can be configured to use either Shibboleth or a local LDAP for authentication

Page 7: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)

imdidb. corpus structure

amsdb

C C

S S S S S

C

LAMUS

crawler

archive archive manager

content search

IMDI lucene

idx

IMDI search

IMDI browser

annexdb lamusdb

AMS

API

API API API API API

Archive Administration

Page 8: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)
Page 9: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)

Why ‘user managed’ deposition?

•  Increasing costs –  New cheaper technologies for recording, digitization and storage

causes huge increase in data quantities.

•  Using depositor knowledge –  Researcher/depositor knows where to put the data in the logical

structure (catalogue) of the archive. –  Communication with archive managers is overhead.

•  Offer remote archiving services –  Support distributed projects

•  Stricter checking –  Make checks explicit –  Archive managers have short contracts, knowledge seems to get lost.

•  Maximizing deposition –  80 percent of all recordings is in danger (UNESCO report) –  We want to open our archive for external depositors –  But cannot afford extra workload for archive managers

Page 10: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)

LAMUS is a web-application that allows •  Uploading and naming individual resources (media,

annotations, information files) •  Specifying ‘limited’ metadata and mutual relations for

and between resources •  Creating relevant linguistic groupings for the data (sub-

corpora) LAMUS will: •  Carry out checks for consistency and coherence: check

for accepted formats etc. (configurable list) •  Updating databases and indexes •  Issue PID for the new resources and metadata records

LAMUS

Page 11: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)

ARCHIVE

WORKSPACE

local disk

Page 12: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)

The Archive

check out

modify/add/..

check in

workspace

Add to original after • consistency check • versioning

Local tools: •  Arbil, •  ELAN, •  Shoebox, •  …

Using Arbil

using LAMUS

Corpus check-out check-in cycle

Page 13: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)

TLA – Versioning of resources

TLA versioning policy •  Nothing gets actually deleted •  Users can delete resources which are removed

from the visible collection (corpus tree) but remain in the archive

•  Users can update (replace) existing resources –  The new version will get a new PID –  Old version will be shelved but keep their PID

•  Access to old versions is managed by the owner

Page 14: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)

C C

S S S S S

C

•  User role administration: archive manager, domain curator, domain manager, domain editor

•  Set a required license •  Set access rules per media type:

annotations, images, audio, video, info

•  A rule sets access/denial to user/group for type of data

•  Special groups: ‘all’, ‘registered user’

•  Rules have priority •  Inheritance of rules by descendant

nodes

M M M M M M

C

C

C

S

M

Rule 1

Rule 2

Rule 3

Rule 1 Rule 2 Rule 3

AMS – Access Management System

Sign academic license

Page 15: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)

IMDI-Browser & Metadata Search

•  Browse the hierarchy of corpora •  Inspect metadata records •  Create bookmarks

–  resources –  IMDI-Browser showing resources

•  Show PIDs, URLs for resources and metadata •  Make resource access requests •  Search the metadata:

–  simple keyword, –  complex queries

Page 16: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)

IMDI-Browser as a jump board

Page 17: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)

http://corpus1.mpi.nl/ds/imdi_browser?openpath=MPI541199%23

Page 18: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)
Page 19: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)
Page 20: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)

Publishing resources

Page 21: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)

Regional Archives Initiative: Cooperation of TLA/MPI-PL with other organizations interested in EL archiving They use TLA LAT archiving software •  Encourage local resource collecting & archiving •  Network of South American archives has been established and contacts

with CLARA were made

Regional Archives Initiative

Page 22: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)

Synchronization physical structure •  Use “rsync” software •  Complete replication •  No special conditions possible •  Use for backup to computing centers

Synchronization logical structure •  Special software needed •  Per corpus copy to a selected target

•  Owner can make special exceptions

•  Use to synchronize between archives

C C

S S S S S

C

S S S

C

C

Logical synchronization

Data Synchronization I

Page 23: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)

C C

S S S S S

C

LAMUS archive

API

C

S S S

HTTP server

COSIX

COSIX: complex logic to compare corpus trees and determine

•  what is new •  what to replace •  what to add •  what to delete

Data Synchronization II

In a cooperation with CMU, COSIX is used to copy CHILDES and Talkbank corpora into our archive. CMU generating IMDI records on the fly from their DBs

Page 24: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)

Technical Info

•  Java web-applications running inside Tomcat servlet container

•  Postgress DBMS •  Platform: Linux •  Web-app frameworks: JSP, Applets, JSF, FLEX,

Wicket,… •  Works with most web browsers (Explorer,

Firefox, Opera, Safari)

Page 25: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)

LAMUS & LAT Future

•  TLA is part of CLARIN and is promoting CMDI, so … •  We are planning the transition from LAMUS – IMDI to

LAMUS CMDI •  We analyzed our set-up and still like the LAT

fundaments e.g. file based, modularity, … •  But we will also alleviate some current problems and

inconveniences: –  limited metadata editing in LAMUS –  Insufficient provenance tracking of resources –  Better handling of download/modify/upload cycle –  Better integration with other (LAT) archives and

infrastructures.

Page 26: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)

THANK YOU FOR YOUR ATTENTION

Page 27: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)

Thank you for your attention

CLARIN has received funding fromthe European Community's Seventh Framework Programme

under grant agreement n° 212230