40
Tony Rees Divisional Data Centre CSIRO Marine Research, Australia ([email protected]) Metadata concepts, issues and experiences – lessons from 8 years of metadata management at CMR - for CSE Metadata Workshop, Canberra, May 2005

Tony Rees Divisional Data Centre CSIRO Marine Research, Australia ([email protected])

Embed Size (px)

DESCRIPTION

Metadata concepts, issues and experiences – lessons from 8 years of metadata management at CMR - for CSE Metadata Workshop, Canberra, May 2005. Tony Rees Divisional Data Centre CSIRO Marine Research, Australia ([email protected]). Overview. Some definitions / concepts - PowerPoint PPT Presentation

Citation preview

Page 1: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

Tony Rees

Divisional Data Centre

CSIRO Marine Research, Australia

([email protected])

Metadata concepts, issues and experiences – lessons from 8 years of

metadata management at CMR

- for CSE Metadata Workshop, Canberra, May 2005

Page 2: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

Overview

• Some definitions / concepts

• Who are the clients for metadata? (what is our target audience)

• How do people find metadata? (discovery / search mechanisms)

• The national metadata infrastructure context (ASDD etc.)

• Search methods – free text vs. structured searches, and the CMR (MarLIN) approach

• What metadata to collect?

• Space and time “footprints” in metadata records (storage and search implications)

• How do we populate the system...

• Selected implementation aspects (when actually building a system).

Page 3: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

Metadata is …

• Structured, summary information regarding a dataset or similar resource

• Conforms to some standard – e.g. ANZLIC (for our region), ISO 19115, can have agency-specific extensions

• Provides both descriptions of resources (cataloguing / documentation function) and potentially, previews of / access point to the data

• Definition of “Dataset” – in the eye of the beholder – a logical set of data sharing common attributes e.g. data type, collection method, survey / expt ... – size of data “chunks” (granularity of the metadata) determined by agency practices and preferences

• Probably good to distinguish dataset-level metadata from item level descriptions (keep in separate, tailored systems).

Page 4: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

Some example metadata systems …

• GCMD (NASA)

Page 5: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

Some example metadata systems (cont’d)…

• NERC Metadata Gateway (UK)

Page 6: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

Some example metadata systems (cont’d) …

• Australian Spatial Data Directory (another gateway)

Page 7: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

Some example metadata systems (cont’d) …

• MarLIN (CMR metadata system)

Page 8: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

What are we trying to do here?

• Describe our data holdings – to the inside and outside world

• Bring together relevant dataset documentation (or pointers to it) in a single, www-accessible location

• Provide a good (i.e.: tailored) set of search tools which suit our data holdings and “target” users

• Facilitate access to our data – on a self serve basis (where possible) **

• Connect our entered information to the wider world for “discovery” purposes, e.g. to metadata gateways and internet search engines

• Re-use metadata as a “building block” in broader Divisional systems (capture once, use many times) **

(** = value adding)

Page 9: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

Who are the clients for our metadata?

(hopefully not...)

Page 10: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

Who are the clients for our metadata?

(hopefully yes...)

Page 11: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

Who are the clients for our metadata?

• CSIRO researchers and their internal / external collaborators (e.g. for data discovery)

• Divisional management

• External parties – schools, public, scientific community, policy makers, consultants

• Ourselves– if an extensive data custodian (use for internal cataloguing / data access purposes)

• Recipients of CSIRO data – can supply metadata along with data products (also, may be a project deliverable)

• Future users (v. important) – “corporate memory”

Page 12: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

How do people find metadata?

• Agency-level systems (own access points)

• Metadata gateways – e.g. ASDD (Australian Spatial Data Directory) for Australia, NERC metadata gateway for UK

• Future one-CSIRO system (??)

• Internet search engines e.g. Google (if mechanism for crawling is enabled)

• Standalone metadata files (e.g. supplied with data).

NB: all have their place, e.g. agency-level systems may support richer or better targeted search facilities than those available via gateways.

Page 13: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

National Metadata Infrastructure

future agency systemmetadata

systems

describe / point to ...

CMR

MarLIN

CMR data

DEH

EDD

DEH data

BoM

BoM data

GA

GA data

etc.

etc.

ASDDAustralian Spatial Data Directory – national

cross-agency metadata gateway

• search via ASDD – search across multiple agencies, basic functionality

• search via MarLIN – search only CMR holdings, but extra functionality (also view “CMR internal” records not visible to external users)

Page 14: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

ASDD search – across multiple agency systems

Page 15: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

(etc.)

Page 16: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

• Also, converse applies – one word, multiple uses, e.g. shark (fish), shark cat (type of boat), Shark Bay (place)...

• Variant spellings also a problem (e.g. sea lion vs. sea-lion vs. sealion; fishery vs. fisheries; organization vs. organisation; Mt. vs. Mount...

• Typographical errors may render document invisible to a free text search (can be at either end, e.g. searcher or stored data).

Limitations of text-based searching...

• Basically a “hit and miss” method – no “browse” capability, or method to broaden / focus the search

• Relies on searcher and metadata creator using same words for same concepts (does not happen in practice, with free text entry across multiple systems)

• ... e.g. “whales” vs. “cetaceans” vs. “marine mammals” vs. species scientific names (multiple wordings covering potentially the same concept)

Page 17: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

• Steers users to use “one concept, one descriptor” approach; no spelling variants / errors

• Can organise thematically / hierarchically, i.e. “shark” under zoology, “Shark Bay” under localities... (less confusion); also can have explicit relationships (broader / narrower, related categories, etc.)

• Supports structured information retrieval and browsing

• Good prompt for terms that the searcher (or content creator) may not otherwise think to enter

• Amenable to global updates (hold list item ID’s in the record, actual values in a look-up table, change in one place only)

• Can be access point to more extensive stored additional information (e.g. via project, voyage, organisation, publication ID) – content creator picks a value from the list, system automatically adds the rest

Main difficulties: getting agreement on list content; anticipating all user needs; loss of flexibility / fine detail of expression (i.e., still a need for free text as optional supplement). Also, list maintenance is an overhead.

cf – Advantages of picklists (“controlled vocabularies”)...

Page 18: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

e.g. MarLIN approach... (example: search by taxonomic

group)

Page 19: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

(etc.)NB:

(1) this method (in principle) maximises both “recall” (getting records that you do want) and “precision” (not getting records that you don’t want)

(2) fewer “0 records returned” messages (user cannot search on terms not actually used)

Page 20: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

What metadata to collect? – 1

• Core ANZLIC fields – title, abstract, space and time ranges, data quality, data contact point, ANZLIC search words... (c. 40 fields)

Page 21: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

What metadata to collect? – 2

• Other fields of value to the agency – e.g...– project codes + associated info.– more specialised keywords or search terms– controlled defined regions list– links - data documentation, graphics links, data access– stored data volume, stored data location– references, contributors, acknowledgements (e.g. funding) ...

• Some of the above correspond to elements in the ISO standard (c. 400 fields), some will be new

• Tension between simple metadata set (few elements, but easy to collect) and more extensive dataset information (more effort to collect, but increased future value and / or structured search options).

Page 22: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

CMR Metadata search page (portion)

... in order to be useful for structured searches, relevant information must be captured at metadata entry time, in a consistent way (e.g. via picklists and supporting tables).

Page 23: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

Also need to consider space, time “footprints”, i.e. how to support these at search time

Example for a CMR dataset (“Lira” catch dataset from 1973):

Page 24: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

overlap = “hit”

Storage of relevant Temporal and Spatial search info: (default)

• Tend to not worry about temporal patchiness (maybe just add text comment in “completeness” field)

Dataset time range(as start, end dates)

Search time range(as start, end dates)

Machine-readable temporal search:

Machine-readable spatial search:

Dataset bounding box(as start, end lat & lon)

Search bounding box(as start, end lat & lon)

overlap = “hit”

• Spatial patchiness (or irregular polygon shapes) can be a more serious problem – CMR solution on next slide

Page 25: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

Spatial footprints – improved method

CMR has implemented a grid squares-based system for improved spatial “footprint” representation and querying (without requirement for a full GIS back end):

Dataset spatial extent – stored as list of squares intersected

in list = “hit”

• We use 0.5° x 0.5° squares – same resolution as 1:100 000 mapsheet series (approx. 50 x 50 km)

• Global “c-squares” notation covers marine as well as land areas.

Search by grid square (or set of squares) not in list = “miss”

Page 26: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

Related functionality on Museum Victoria “Bioinformatics” site(search interface shown):

• Searcher can use this approach to define a non-rectangular region of interest (green highlighted cells)

(NB, this uses a different [non global] notation for the cells, however the basic principle is the same)

Page 27: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

Result for the relevant “Lira” CMR metadata record...

• Red squares (as square IDs) are what is actually stored, can then be superimposed on any user-selected base map for display purposes

• Now will not get “false positives” – e.g. from searching at Alice Springs

Page 28: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

Remainder is “standard” metadata (ANZLIC + CMR extensions), e.g...

(etc.)

Page 29: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

How do we populate the system (get people to describe their data)?

• Non-trivial problem

• Education – value of metadata, responsibility of data custodians to describe their data in designated system/s

• Prescriptive approach – build into project planning, sign-off, APA’s

• Facilitation – dedicated personnel assist scientists, knock on doors

• Making records on researchers’ behalf – resource intensive, also not ideal since person making the metadata does not have the best understanding of the data

• Incrementally – e.g. as data is migrated into corporate systems, require the metadata to go with it (robust linkage) – NB, will probably always be “data islands” that this approach misses.

Page 30: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

How far have we got...?

• Currently there are some 2,100 records in the MarLIN system

(etc.)

Page 31: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

How far have we got...? – cont’d

• 90-95% of “Data Centre” holdings described – after 8 yr process! (<1000 records, mostly ships’ data, by voyage and data type)

• a few “data islands” have made concerted attempts to describe their data (e.g. 10-20+ records each)

• some major data acquisition exercises have generated 50-100+ records, mostly for third party data (generally not visible on extranet) – e.g. where metadata is a specified project deliverable along with the data (good!)

• remainder is pretty patchy (maybe 10% compliance) – hope to kickstart with project-based “skeleton records”, also more rigid directives / follow up from Divisional management.

Page 32: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

Project data template (example):

(etc.)

Page 33: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

What information model to use?

Stored data

Metadata system

Projects databaseLibrary pubs.

list

Item-level catalogues

Ancillary information

Ideal world (probably unattainable):

Persons database

... all information would be entered / maintained in one place only; updates would propagate automatically through the system; all resources would be electronic and seamlessly accessible

Page 34: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

Best we can do for now...

Stored data

Metadata system – main “datasets” table

Item-level catalogues

Ancillary information

digital + non-digital

MarLIN “persons”

table

digital + non-digital

digital + non-digital

MarLIN “references” table? (or text

descriptions)

MarLIN “projects”

table

MarLIN “data” links (URLs) in table (also text descriptions)

MarLIN “doc” links (URLs) in table (also text descriptions)

MarLIN “doc” + “graphic” links (URLs) in table (also text

descriptions)

plus some other tables (not shown) for voyages, organisations, keywords...

Page 35: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

Functionality / Processes to be supported (... list probably incomplete!)

• User interfaces – create, edit, search metadata records

• Administrator functions – user identities and privileges, “super-user”-level record modification, deletion, list maintenance

• Moderator function – approve / edit content to be published

• Security / authentication – who can access “internal” records (e.g. by specified IP domains or other mechanism)

• Access logging – including what search terms used, how many “hits”, etc. (plus applications to review user log and access stats)

• Application maintenance, tech. support, user training

• Automated connections to remote systems, plus on-demand import / export features (e.g. via XML)

• Ongoing development / modification to functionality or database structure – process, resources...

Page 36: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

Metadata integration / remote calls (examples)

• Project work space (HTML page)

Page 37: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

Metadata integration / remote calls (examples)

• Custom MarLIN search via web call (from different database)

Page 38: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

Metadata integration / remote calls (examples)

• Re-use of MarLIN supporting tables content (in other contexts)

Page 39: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

Concluding remarks

• Simple in theory, not so simple in practice, to design and implement a good system (especially in a research, rather than basic “products set” environment) – no “off the shelf” solution (or even key components) available

• Designing a system gives the opportunity to incorporate new / improved concepts (scope for innovation, design challenges)

• Should be benefits in sharing code, approaches, experiences across Divisions or other groups

• Populating the system is as important as building it!

• Connection to external gateways is not too hard, once system plus some publishable content exists

• CMR is a lonely trailblazer within CSIRO .. still considered an example of “best practice” (a bit of a worry, seeing how far we still have to go)...

Page 40: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia (Tony.Rees@csiro.au)

• Thanks!

• To visit MarLIN:go to www.marine.csiro.au

>> Data Centre (www.marine.csiro.au/datacentre/)

>> MarLIN (www.marine.csiro.au/marlin/)

• MarLIN “Edit” interface – currently requires access privileges to visit (will look at online in tomorrow’s session).