35
1 eXtended Metadata Registries (XMDR) NKOS Workshop June 11, 2005 Bruce Bargmeyer [email protected] Chair: ISO/IEC JTC1/SC32-Data Mgmt & Interchange PI: XMDR Project Lawrence Berkley National Laboratory University of California WWW.XMDR.ORG

1 eXtended Metadata Registries (XMDR) NKOS Workshop June 11, 2005 Bruce Bargmeyer [email protected] Chair: ISO/IEC JTC1/SC32-Data Mgmt & Interchange

  • View
    221

  • Download
    2

Embed Size (px)

Citation preview

1

eXtended Metadata Registries (XMDR)

NKOS WorkshopJune 11, 2005

Bruce [email protected]: ISO/IEC JTC1/SC32-Data Mgmt & InterchangePI: XMDR ProjectLawrence Berkley National LaboratoryUniversity of California

WWW.XMDR.ORG

2

XMDR Project Draws Together

UsersUsers

ISO/IEC 11179MetadataRegistries

Terminology

CONCEPT

Referent

Refers To Symbolizes

Stands For

“Rose”,“ClipArt”

Metadata Registry

Terminology Thesaurus Taxonomy

DataStandards

Ontology

StructuredMetadata

ISO/IEC JTC 1/SC 32ISO TC 37 & …

3

Align, Coordinate, Integrate Standards/Recommendations

ISO/IEC JTC 1/SC 32

UsUs

ersers

ISO/IEC 11179MetadataRegistries Terminology

CONCEPT

Referent

Refers To Symbolizes

Stands For

“Rose”,“ClipArt”

Metadata Registry

Terminology Thesaurus Taxonomy

DataStandards

Ontology

StructuredMetadata

ISO TC 37 & …

SemanticWeb

W3C

4

XMDR Metadata RegistriesExtensions

Register (and manage) any semantics that are useful in managing data. Enable users to find and register correspondences between the content

(concepts & relationships) of multiple KOSs Enable registration of links between concepts and data in databases

Provide Semantic Services -- lay foundation for semantics based computing: Semantics Service Oriented Architecture, Semantic Grids, Semantics based workflows, Semantic Web …

Prepare standards proposals, especially for ISO/IEC 11179 Parts 2 & 3

Develop a reference implementation

5

Data Elements

DZ

BE

CN

DK

EG

FR

. . .

ZW

ISO 3166English Name

ISO 31663-Numeric Code

012

056

156

208

818

250

. . .

716

ISO 31662-Alpha Code

Algeria

Belgium

China

Denmark

Egypt

France

. . .

Zimbabwe

Name:Context:Definition:Unique ID: 4572Value Domain:Maintenance Org.Steward:Classification:Registration Authority:Others

ISO 3166French Name

L`Algérie

Belgique

Chine

Danemark

Egypte

La France

. . .

Zimbabwe

DZA

BEL

CHN

DNK

EGY

FRA

. . .

ZWE

ISO 31663-Alpha Code

Current 11179 Metadata Registry ContentE.g., Country Identifier

Algeria

Belgium

China

Denmark

Egypt

France

. . .

Zimbabwe

Name: Country IdentifiersContext:Definition:Unique ID: 5769Conceptual Domain:Maintenance Org.:Steward:Classification:Registration Authority:Others

DataElementConcept

6

Data Element List – Address Group

<?xml version="1.0"?> <shipTo > <name>Alice Wilson</name> <street>161 North Street</street> <city>Happy Valley</city> <state>MO</state> <zip>63105</zip> <country code>USA</country code>

</shipTo>

Use ISO/IEC 11179 MDR to CreateXML Schemas, DBMS Schemas

33c

NameStreet AddressCity, State Postal CodeCountry

7

XMDR: Register Ontologies

Concept Concept

ConceptConcept Geographic Area

Geographic Sub-Area

Country

Country Identifier

Country Name Country Code

Short Name ISO 31662-Character

Code

ISO 31663- Character

Code

Long Name

DistributorCountry Name

Mailing AddressCountry Name ISO 3166

3-Numeric CodeFIPS Code

8

XMDR: Register AnyKnowledge Organization System (KOS)

Keywords Glossaries Gazetteers Thesauri Taxonomies Concept system (ISO TC 37) Ontologies Axiomititized Ontologies

9

XMDR: Register Graphs

Directed Graph

Directed Acyclic Graph

Graph

Undirected Graph

Bipartite Graph

Partial Order Graph

Faceted Classification

Clique

Partial Order Tree

Tree

Lattice

Ordered Tree

Note: not all bipartite graphsare undirected.

Graph Taxonomy:

10

Samples of Eco & Bio Graph Data

Nutrient cycles in microbial ecologies These are bipartite graphs, with two sets of nodes, microbes and reactants (nutrients), and directed edges indicating input and output relationships. Such nutrient cycle graphs are used to model the flow of nutrients in microbial ecologies, e.g., subsurface microbial ecologies for bioremediation.

Chemical structure graphs: Here atoms are nodes, and chemical bonds are represented by undirected edges. Multi-electron bonds are often represented by multiple edges between nodes (atoms), hence these are multigraphs. Common queries include subgraph isomorphism. Chemical structure graphs are commonly used in chemoinformatics systems, such as Chem Abstracts, MDL Systems, etc.

Sequence data and multiple sequence alignments . DNA/RNA/Protein sequences can be modeled as linear graphs

Topological adjacency relationships also arise in anatomy. These relationships differ from partonomies in that adjacency relationships are undirected and not generally transitive.

11

Eco & Bio Graph Data (Continued)

Taxonomies of proteins, chemical compounds, and organisms, ... These taxonomies (classification systems) are usually represented as directed acyclic graphs (partial orders or lattices). They are used when querying the pathways databases. Common queries are subsumption testing between two terms/concepts, i.e., is one concept a subset or instance of another. Note that some phylogenetic tree computations generate unrooted, i.e., undirected. trees.

Metabolic pathways: chemical reactions used for energy production, synthesis of proteins, carbohydrates, etc. Note that these graphs are usually cyclic.

Signaling pathways: chemical reactions for information transmission and processing. Often these reactions involve small numbers of molecules. Graph structure is similar to metabolic pathways.

Partonomies are used in biological settings most often to represent common topological relationships of gross anatomy in multi-cellular organisms. They are also useful in sub-cellular anatomy, and possibly in describing protein complexes. They are comprised of part-of relationships (in contrast to is-a relationships of taxonomies). Part-of relationships are represented by directed edges and are transitive. Partonomies are directed acyclic graphs.

Data Provenance relationships are used to record the source and derivation of data. Here, some nodes are used to represent either individual "facts" or "datasets" and other nodes represent "data sources" (either labs or individuals). Edges between "datasets" and "data sources" indicate "contributed by". Other edges (between datasets (or facts)) indicate derived from (e.g., via inference or computation). Data provenance graphs are usually directed acyclic graphs.

12

A graph theoretic characterization

Readily comprehensible characterization of metadata structures

Graph structure has implications for: Integrity Constraint Enforcement Data structures Query languages Combining metadata sets Algorithms for query processing

14

Example: Tree

Oakland Berkeley

Alameda County

California

part-of

part-of part-of

part-of

part-of

Santa Clara San Jose

Santa Clara County

part-of

19

Example: Ordered Tree

part-ofpart-of

part-of

Paper

Title page Section Bibliography

Note: implicit ordering relation among parts of paper.

21

Example: Faceted Classification

Wheeled Vehicle Facet

2 wheeled

4 wheeled

3 wheeled

is-a is-a is-a

Internal Combustion

Human Powered

Vehicle Propulsion Facet

Bicycle Tricycle MotorcycleAuto

is-a is-a

is-a

is-ais-a

is-ais-a

is-a

is-a

is-a is-a

24

Example: Directed Acyclic Graph

Wheeled Vehicle

2 Wheeled Vehicle

4 WheeledVehicle

3 WheeledVehicle

is-a is-a is-a

Internal Combustion

Vehicle

Human PoweredVehicle

PropelledVehicle

Bicycle Tricycle MotorcycleAuto

is-a is-a

is-a

is-ais-a

is-ais-a

is-a

is-a

is-a is-a

is-a is-a

Vehicle

26

Example: Partial Order Graph

Wheeled Vehicle

2 Wheeled Vehicle

4 WheeledVehicle

3 WheeledVehicle

is-a is-a is-a

Internal Combustion

Vehicle

Human PoweredVehicle

PropelledVehicle

Bicycle Tricycle MotorcycleAuto

is-a is-a

is-a

is-ais-a

is-ais-a

is-a

is-a

is-a is-a

is-a is-a

Vehicle

Dashed line = inferred is-a (transitive closure)

29

Example Lattice: Powerset of 3 element set

{a,b,c}

{c}{a} {b}

{b,c}{a,b} {a,c}

Denotes subset

Empty Set

32

Example Bipartite Graph

California

Massachusetts

Oregon

CA

MA

OR

States Two-letter state codes

34

Example of Clique

California

CA

Calif.

CAL

Here edges denote synonymy.

36

Example Compound Graph

Colin Powell claimed

Iraq had WMDs

38

Challenges

How to register & manage the various graph structures? DBMS, File systems ….

How to query the graph structures? XQuery for XML Poor to non-existent graph query languages

How to get adequate performance, even in high performance computing environment

User interface complexity How to manage semantic drift Versions How to interrelate graphs with other graphs and with data Granularity at which to register metadata (then point to

greater detail elsewhere?)

39

ISO/IEC 11179 Metadata Registries+ XMDR

Register and manage semantics that are or can be harmonized and vetted by some Community of Interest (COI)

Provide Semantic Services E.g., the semantics can be referenced by RDF

statements (subjects, predicates, objects) The semantics can be used to bootstrap the

Semantic Web and Semantic Computing A “vocabulary” that is grounded for some COI

Enable registration using formal logic as well as natural language

40

Terminolocgical & Formal(Axiomatized) Ontologies

The difference between a terminological ontology and a formal ontology is one of degree: as more axioms are added to a terminological ontology, it may evolve into a formal or axiomatized ontology.

Cyc has the most detailed axioms and definitions; it is an example of an axiomatized or formal ontology. WordNet is usually considered a terminological ontology.

Building, Sharing, and Merging OntologiesJohn F. Sowa

41

An Axiom for an Axiomatized Ontology

Definition: The resource_cost_point predicate, cpr, specifies the cost_value, c, (monetary units) of a resource, r, required by an activity, a, upto a certain time point, t.

If a resource of the terminal use or consume states, s, for an activity, a, are enabled at time point, t, there must exist a cost_value, c, at time point, t, for the activity, a,that uses or consumes the resource, r. The time interval, ti = [ts, te], during which a resource is used or consumed byan activity is specified in the use or consume specifications as use_spec(r, a, ts, te, q) or consume_spec(r, a, ts, te, q) where activity, a, uses or consumes quantity, q, of resource, r, during the time interval [ts, te]. Hence,

Axiom: a, s, r, q, ts, te, (use_spec(r, a, ts, te, q) enabled(s, a, t)) ∀ ∧ ∨(consume_spec(r, a, ts, te, q) enabled(s, a, t))≡ c, cpr(a,c,t,r)∧ ∃

Cost Ontology for TOronto Virtual Enterprise (TOVE)

42

XMDR: Vocabularies – Vetted with a Community of Interest to

Boostrap the Semantic Web and Semantic Computing

dbA:ma344

“AB”^^ai:StateCode

ai: StateUSPSCode

@prefix dbA: “http:/www.epa.gov/databaseA”@prefix ai: “http://www.epa/gov/edr/sw/AdministeredItem#”

dbA:e0139

ai: MailingAddress

43

ISO/IEC 11179 Part 3 Metamodel (Metadata Registry Schema)

Expressed in: UML model

XMDR: Also express in: OWL Ontology Common Logic

44

Ontology EditorProtege11179 OWL Ontology

XMDR Prototype: Modular Architecture-- Initial Implemented Modules

MetadataValidator

AuthenticationService

MappingEngine

RegistryExternalInterface

Generalization Composition (tight ownership) Aggregation (loose ownership)

Jena, Xerces

Java

RetrievalIndex

FullTextIndex

Lucene

LogicBasedIndex

Jena, OWI KSRacer

RegistryStore

WritableRegistryStore

Subversion

45

XMDR Content Priority ListPhase 1(V.A) National Drug File Reference Terminology

DTIC Thesaurus (Defense Technology Info. Center Thesaurus)

NCI Thesaurus National Cancer Institute Thesaurus

NCI Data Elements (National Cancer Institute Data Standards Registry

UMLS (non-proprietary portions)

GEMET (General Multilingual Environmental Thesaurus)

EDR Data Elements (Environmental Data Registry)

ISO 3166 Country Codes – from EPA EDR

USGS Geographic Names Information System (GNIS)

46

XMDR Content Priority ListPhase 2LOINC Logical Observation Identifiers Names and Codes

ITIS Integrated Taxonomic Information System

Getty Thesaurus of Geographic Names (TGN)

SIC (Standard Industrial Classification System)

NAICS (North American Industrial Classification System)

NAIC-SIC mappings

UNSPSC (United Nations Standard Products and Services Codes)

EPA Chemical Substance Registry System

EPA Terminology Reference System

ISO Language Identifiers ISO 639-3 Part 3

IETF Language Identifiers RFC 1766

Units Ontology

47

XMDR Content Priority ListPhase 3HL7 Terminology

HL7 Data Elements

GO (Gene Ontology)

NBII Biocomplexity Thesaurus

EPA Web Registry Controlled Vocabulary

BioPAX Ontology

NASA SWEET Ontologies

NDRTF

48

Project Status

This year: Building XMDR Core (content & System)

Next year: Extend XMDR-Core (content & system) R&D Semantic Services in a Semantic Service

Oriented Architecture

Expected to be five-year project

49

XMDR Project Participants

Collaborative, interagency effort US Environmental Protection Agency United States Geological Survey National Cancer Institute Mayo Clinic US Department of Defense Lawrence Berkeley National Laboratory & others

Interagency/International Cooperation on Ecoinformatics EPA, European Environment Agency, UNEP Ecoterm

50

Job Posting

Vacancy Announcement--Research position to be posted at LBNL—looking for person with education and experience in semantics, terminology, system development

WWW.XMDR.ORG

51

Acknowledgementsand References

Frank Olken, LBNLKevin Keck, LBNLJohn McCarthy, LBNL