11
Depositors’ usage of IMDI metadata Daan Broeder & Alex Klassmann MPI Institute for Psycholinguistics DELAMAN meeting London 2006

Depositors’ usage of IMDI metadata

Embed Size (px)

DESCRIPTION

Depositors’ usage of IMDI metadata. Daan Broeder & Alex Klassmann MPI Institute for Psycholinguistics. DELAMAN meeting London 2006. IMDI metadata. Forms with ~150 possible descriptors Describes bundles of related resources Extensive set compared with DC/OLAC - PowerPoint PPT Presentation

Citation preview

Page 1: Depositors’ usage of IMDI metadata

Depositors’ usage of IMDI metadata

Daan Broeder & Alex Klassmann

MPI Institute for Psycholinguistics

DELAMAN meeting London 2006

Page 2: Depositors’ usage of IMDI metadata

IMDI metadata• Forms with ~150 possible descriptors

– Describes bundles of related resources– Extensive set compared with DC/OLAC– But only “name” descriptor is compulsory

• Archive holds– ~40000 IMDI sessions or resource bundles

+15000 non-local but available in our DB– Describing ~150000 resources

Page 3: Depositors’ usage of IMDI metadata

IMDI Metadata

The descriptors hierarchically ordered entries, which concern – the event (recording location, date, etc),– the project,– the languages involved,– the Participants,– the type and nature of speech,– technical information about the resources– access rights

• values of descriptors can be closed or open vocabularies or free text.

• user can use prose descriptions at each of these levels + project/user defined keys

Page 4: Depositors’ usage of IMDI metadata

Metadata Use

• Documentation of the resources• Retrieval and reuse: archive offers tools for:

– Browsing the archives’ corpora– Structured metadata search

• High precision, low recall

– Unstructured google-like metadata search• High recall, low precision

• Large set-> not all elements are always relevant– Sparsely populated metadata space– Search tool to show frequency counts for metadata

values. Avoids fruitless searches.

Page 5: Depositors’ usage of IMDI metadata
Page 6: Depositors’ usage of IMDI metadata

Depositor Guidance

• In general depositors are urged to be complete as possible for documentation purposes

• Some projects have an obligatory set of descriptors to fill in. (CGN, DBD, …)

• Provide training to get familiar with the set and tools

• Provide documentation• Support by student-assistants and corpus

managers

Page 7: Depositors’ usage of IMDI metadata

Observations II• Often researchers do not fill in all the relevant data at

their disposal.• Some tendency to avoid this time-consuming work

oriented to re-usage by others. • The sheer size of the set may discourage people to start

filling in data at all.• Training helps.• Best results in projects that decided beforehand what

descriptors were needed to fill in. • Of course there are also very committed individuals!!!• Corpus managers/student assistants may clean things

up. – but limited use since only the researcher has specific knowledge– can serve as intermediaries.

Page 8: Depositors’ usage of IMDI metadata

Observations II

• Only that part of the archive where metadata was specified manually (e.g. CGN was excluded as were sessions outside the MPI)

• Statistics on the basis of ~25000 remaining sessions

• The data gives an impression of how often fields are actually filled in (e.g. not empty and not default “unknown“ or “unspecified“).

• Cannot exclude “repairs” where obvious omissions were repaired by corpus management

Page 9: Depositors’ usage of IMDI metadata

Descriptor name total-25000 fl-12000 acqui-10000• Country 93 93 99• Address 15 21 15• Region 7 10 11• Description 48 30 77• Key 33 17 58• Project.Name 90 91 87• Content.Description 93 95 97• Genre 29 44 15• SubGenre 23 34 13• Task 43 49 34• Modalities 80 80 82• Subject 3 6 2• Interactivity 73 72 81• PlanningType 53 51 73• Involvement 70 71 72• SocialContext 6 10 9• EventStructure 7 9 9• Channel 8 10 11• Content.Language.Description 43 25 67• Content.Language.Id 91 90 91• Content.Language.Name 91 90 94

Page 10: Depositors’ usage of IMDI metadata

• Actor.Language.Description 33 14 61• Actor.Language.Id 25 20 53• Actor.Language.Name 47 37 83• Actor.Role 94 97 99• Actor.Name 94 95 99• Actor.FullName 90 93 97• Actor.Code 70 68 84• Actor.FamilySocialRole 24 31 18• Actor.EthnicGroup 14 20 13• Actor.BirthDate 5 8 8• Actor.Age 44 47 50• Actor.Sex 70 69 92• Actor.Education 13 16 11• Actor.Description 65 78 56• Actor.Key 52 44 68• MediaFile.Type 85 83 85• MediaFile.Format 85 83 85• MediaFile.Quality 18 8 31• WrittenResource.Type 67 57 71• WrittenResource.SubType 30 19 35• WrittenResource.Format 56 42 70• WrittenResource.ContentEncoding 3 7 0• WrittenResource.CharacterEncoding 3 12 0• WrittenResource.LanguageId 4 1 1

Page 11: Depositors’ usage of IMDI metadata

Conclusions• As can be seen the sets are far from being

complete.• But also every field of the scheme has been

used in some sessions, so that it seems that no field in the schema is obsolete

• People find use for the description fields that are available at different levels (~50%)

• Also the user/project defined keys are used (~50%) -> IMDI set is not big enough

• Some keys are not much used– Remove?– But where then to put this information if its available?