30
Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan CIDOC CRM SIG Workshop, Imperial College, 2006

Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

Embed Size (px)

Citation preview

Page 1: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

Mapping domain thesauri to the CRM to assist the semantic interoperability

of data archives

Doug Tudhope

Hypermedia Research Unit

University of Glamorgan

CIDOC CRM SIG Workshop, Imperial College, 2006

Page 2: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

Presentation

• FACET Project with Science Museum– Thesaurus-based query expansion

with NMSI Collections database – Semantic expansion– Web Demonstrator

Extend to heterogeneous datasets and terminology systems

• DELOS pilot project demonstrator– English Heritage upper ontology based on CRM

– Mapping English Heritage thesaurus and database to CRM

– Current work

Page 3: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

FACET - Faceted Access to Cultural hEritage Terminology

Aims:• Integration of thesaurus into the interface• Semantic expansion taking advantage of facet structure

http://www.comp.glam.ac.uk/~FACET/

Page 4: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

FACET Collaborators

• Research Council Funding: EPSRC 3 years

• National Museum of Science and Industry (NMSI):

National Railway Museum and Science Museum Collections Database

• J. Paul Getty Trust

Art and Architecture Thesaurus (AAT)

• Museum Documentation Association (MDA)

Railway Thesaurus

• Canadian Heritage Information Network (CHIN)

Advisors

Page 5: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

NRM Collection examples of free text object descriptor fields

• Chair, London Midland & Scottish Railway, straight wooden back initials carved on back, green leatherette seat.

• Chair, Railway Clearing House, Curved back with blue leather inset & blue leather seat. R. C.H. carved on back

• Chair, M.S. & L.R., Straight back, blue leather seat with M.S. & L.R. carved across back

• Armchair, Pullman, green plush, fringed from Pullman section.• Carver chair, Oak with oval brocade seat. Prince of Wales crest on back

from Royal Saloon of 1876• Armchair, Upholstered in blue maquette with curved, buttoned back &

scroll arms. Wooden legs• Occasional table, Oak with drawer, ornately carved. From Royal Saloon

of 1876• Set of 4 chairs, High-backed carver chairs upholstered in floral maquette• Clock, made by Jno Walker, 250 Regent Street. Metal face/Roman

numerals. Carved wooden square case. 20"x18"x10"

Page 6: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

Indexed Example from NRM Collection

ID 1975-7309

Description Armchair, Upholstered in blue moquette with curved,

buttoned back & scroll arms. Wooden legs

Item name(s) armchairs (AAT Hierarchy: Furnishings)

Part Aspect Term (AAT Hierarchy)

overall physical upholstering Processes & techniques

overall material moquette Materials

overall colour blue Color

legs material wood Materials

back shape curved Physical attributes

back physical buttoning Processes & techniques

arms shape scrolled arms Components

Page 7: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

Types of Knowledge Organisation System (KOS)

adapted from Zeng & Salaba: FRBR Workshop, OCLC 2005

Term Lists:

Synonym RingsAuthority FilesGlossaries/DictionariesGazetteers

Natural language Controlled language

Wea

kly- s

truct

u red

Str o

ngly-

stru

ctur

ed

Classification &Categorization: Subject HeadingsSubject Headings

Classification schemesClassification schemes TaxonomiesCategorization schemes

Relationship Groups: Ontologies Semantic networks

ThesauriThesauri

Pick lists

Page 8: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

Semantic Expansion

Expanding over thesaurus semantic relationships

allows the system to play an active role

• Ranking of matching results by semantic closeness• Query Expansion (automatic/interactive)• Augmented Browsing tools

Underpinning technologies:• Measures of distance over the semantic index space • Multi-concept Matching Function

Page 9: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

Faceted Knowledge Organisation Systems

Faceted classifications based on primary division

into fundamental, high-level categories (facets)

Compound descriptors (multi-concept headings) are synthesised

by combination of terms from limited number of fundamental facets

In constructing AAT, adjectival noun phrases very common:

e.g. painted oak furniture

“Rather than enumerate the nearly infinite number of object and subject descriptions needed by thesaurus users, the AAT decided to pursue the building blocks of these descriptors in the form of a faceted vocabulary”

(Guide to Indexing and Cataloging with the Art & Architecture Thesaurus)

Page 10: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

Matching Problem

“The major problem lies in developing a system whereby individual parts of subject headings containing multiple AAT terms are broken apart, individually exploded hierarchically, and then reintegrated to answer a query with relevance”

(Toni Petersen, AAT Director)

Query: mahogany, dark yellow, brocading, Edwardian, armchair

Descriptor: oak, light yellow, crests, ovals, brocade, Victorian, Carver chair

Potentially extra / missing / partially and non-matching terms

Page 11: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

System Architecture

Transact SQL Stored

Procedures

SQL Server Databases -collections & thesaurus

Active-X Data Objects (ADO) Data access components

Database

Application data objects

Expansion engine

(and data structure)

Query and matching functions

Compiled VB client interface and web browser interface

Application interfaces

Database interaction module

Persistent XML data:

Queries, parameters

etc.

Page 12: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

FACET standalone system

http://www.comp.glam.ac.uk/~facet/webdemo/

[email protected]

Page 13: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

FACET Web Demonstrator

• Illustrates thesaurus based expansion and faceted search

• Intended as an exploration of FACET research outcomes

via dynamically generated Web components

rather than a complete final interface

• Based on custom API for thesaurus programmatic access

• Browser-based interface (ASP application), using a combination of server-side scripting and compiled components

http://www.comp.glam.ac.uk/~FACET/webdemo/

http://jodi.tamu.edu/Articles/v04/i04/Binding/

Page 14: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

FACET Web Demonstator

Page 15: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

Semantic Query Expansion

Page 16: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

Some lessons learned

• Results show potential of faceted KOS for – Query expansion with semantically ranked results– Realtime implementation multi-concept matching function– Semantic expansion as a browsing tool

– Potential combine with statistical and linguistic techniques

How to generalise?

need for• Common KOS representations and APIs • Semantic mapping between different databases and KOS

Page 17: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

Semantic InteroperabIlity

• NMSI’s different museums and collections

held in a single collections database

• Easy to express connections between thesaurus hierarchies and DB fields

What if search across different DBs and KOS?

• Eg English Heritage (EH) a single organisation

but wide range unconnected DBs and vocabularies

Page 18: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

Mapping domain thesauri to the CRM to assist the semantic interoperability

of data archives

• DELOS NoE mini-project on Ontology-driven interoperability for Cluster on Knowledge Extraction & Semantic Interoperability

• Proof of concept demonstrator for exploring retrieval potential of mapping domain KOS to upper ontology (CIDOC CRM)

• In collaboration particularly with FORTH, University of Lund and English Heritage (Keith and Sarah May)

• Investigate integration of datasets - for assisting archaeological search and information extraction

Page 19: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

Background

• Current EH situation one of fragmented datasets andapplications, with different terminology systems

• Interpretation may not consist of same terms as context

• Searchers from different scientific perspectives may not use same terminology

• Need for integrative metadata framework EH have designed an upper ontology based on CRM standard

• Work to date focused on modelling

Page 20: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

Databases not meaningfully connected

• Even simply expressed queries currently difficult to answer,

due to lack of tools for cross database searching

"Specialists could only talk to [field] archaeologists

and not talk to each other".

(from discussion with a palaeoenvironmental archaeologist)

Wider questions arising from science analysis by finds specialists often referred back to field archaeologist

since databases documenting different scientific aspects

not meaningfully connected

Page 21: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

DELOS pilot project datasets

• English Heritage (and EH Data Services Unit) supplying various databases and controlled vocabularies.

• Starting with connecting to EH-CRM the new

Environmental Archaeology Thesaurus

and (part of) the Environmental Archaeology Bibliography

Page 22: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

Environmental Archaeology ThesaurusScope Notes Extract (i)• Altered by Animals• SN: Modification or damage by an animal• RT Worked (use where modification is by humans in ASPECT)• Anoxic• SN: Material preserved by exclusion of oxygen usually due to saturation with water which inhibits decay by micro-organisms• Non Preferred Term: Waterlogged• Burnt• SN: Use for material that has been burnt• Calcined• SN: Material burnt at a high temperature (above 700 degrees centigrade) leaving only the mineral component.• Non-preferred term: cremated• BT: Burnt• RT Cremation• Charred• SN: Material that has been burnt and at least in part reduced to carbon as a result of burning in a reducing atmosphere

below 500 degrees C.• Non-preferred term: Carbonised • BT: Burnt• Silicified• SN Use for material that has been burnt at high temperatures in a good air supply such that only silica component remains• BT: Burnt• ……• Mineral Replaced• SN: Replacement of organic material by minerals, including calcium carbonate and calcium phosphate• Non Preferred Term: Mineralised, Fossilised• Mineral Preserved• SN: Preservation of material by the toxic effect of corrosion products in the immediate vicinity, or within, the material• Non Preferred Term: Mineralised• Plant damage• SN: Material that has been penetrated or disrupted by the roots or rhizomes of plants.•

Page 23: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

Environmental Archaeology ThesaurusScope Notes Extract (ii)• Arthropods• SN: Use for remains of arthropods in general, including woodlice, spiders, insects etc. Please

note crustaceans have been included under this category.• BT: Invertebrates• NT: Cladocerans, Crustaceans (Decapods), Insects, Mites, Ostracods• Cladocerans• SN: Group of fresh water crustaceans which include the water fleas (Daphnia ssp.) the egg cases

(ephippia) of which are found in archaeological deposits (EH Guidelines for Environmental Archaeology)• BT: ArthropodsEAT-Draft scope notesv6.doc• Crustaceans (Decapods)• SN: Use for the remains of shrimps, prawns, crabs and lobsters• BT: Arthropods• Insects• SN: Use for the remains of any part of an insect (MDA Object Thesaurus)• Non-preferred term: Beetles, Coleoptera• Mites• SN: Related to spiders. Use for ticks and true mites. Mites are widely present in archaeological

deposits but are rarely studied in detail as they are difficult to identify (Kenward, forthcoming)• BT: Arthropods• Ostracods• SN: Small crustaceans ranging in size from 0.2mm to 30mm and possessing a bivalve carapace

or ‘shell’. They live in salt-water, brackish and freshwater and are used to help to reconstruct aquatic conditions e.g. pollution, degree of salinity

• BT: Arthropods

Page 24: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

EH extension to CRM

• Currently in pdf file• Need to represent in machine readable format

Page 25: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

Example of CRM - Thesaurus connection (by EH collaborators)

• FlotationSampleResidueType – EH_E0067

CRM entity E55: Type

Classification of flot and/or residue contents

• Mapping:

Use Arch Science Thesaurus Terms:

Object type, Material type, Modification state, Aspect

Page 26: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

Example CRM - Thesaurus connection 2

• ContextSampleType – EHE0053• CRM entity E55: Type

• Derived from the Environmental guidelines list

Samples taken will be of a particular type depending upon the technique that will be used to analyse them.

• For Specialist Scientific Sampling it would be appropriate to use Archaeological Science Thesaurus terms for “Investigative Techniques”, but for samples taken by non-specialists the investigative technique may not be know at the point of sampling.

Page 27: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

Current Work - Proof of concept demonstrator

• Express EH-CRM in machine-readable form• Add connections for databases and thesauri to EH-CRM

Demonstrator – first steps• Express user information need in terms of EH-CRM

• Identify database and thesaurus entities (if any)from extended EH-CRM

• Drive search from this information

Page 28: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

Next steps

• Involve other EH databases and vocabularies

• Connect very different datasets,

for example species taxonomies via via plant names

• Extend to associated grey literature

(and FRBR indexed documents)

Page 29: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

Contact Information

Doug Tudhope

School of Computing

University of Glamorgan

Pontypridd CF37 1DL

Wales, UK

[email protected]

http://www.comp.glam.ac.uk/pages/staff/dstudhope

Page 30: Mapping domain thesauri to the CRM to assist the semantic interoperability of data archives Doug Tudhope Hypermedia Research Unit University of Glamorgan

References

Binding C., Tudhope D. 2004. KOS at your Service: Programmatic Access to Knowledge Organisation Systems. JoDI 4(4), http://jodi.tamu.edu/Articles/v04/i04/Binding/

CIDOC CRM http://cidoc.ics.forth.gr/

DELOS Network of Excellence http://www.delos.info/

DELOS Knowledge Extraction & Semantic Interoperability http://delos-wp5.ukoln.ac.uk/

FACET Case Study, DigiCult Thematic Issue 6: Resource Discovery Technologies for the Heritage Sector,http://www.digicult.info/pages/Themiss.php [pdf]

FACET Web demonstrator http://www.comp.glam.ac.uk/~FACET/webdemo/