28
Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT [email protected] / [email protected] Presentation at the Micro-Array Department, University of Amsterdam 23-8-2004

Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT [email protected] / [email protected]

Embed Size (px)

Citation preview

Page 1: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

Data and ontology integration issues in the biosciences

Marijke KeetNapier University, 10 Colinton Road, Edinburgh EH10 5DT [email protected] /

[email protected]

Presentation at the Micro-Array Department, University of Amsterdam23-8-2004

Page 2: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

Overview presentation

Data integration ontology Ontologies

kinds, formalisation, bias & bioscience [after the break]

Ontology integration categorisation, some challenges

Page 3: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

Overview presentation

Data integration ontology Ontologies

kinds, formalisation, bias & bioscience [after the break]

Ontology integration categorisation, some challenges

Page 4: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

Data heterogeneity

Schematicdata type, labelling, aggregation, generalisation

Semanticnaming, scaling and units, confounding

Intensional domain, integrity constraints

Based on Goh (1996)

Page 5: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

Integrating data

e.g. DB1 has attribute name colour and value green and DB2 with color and 2DE60E

Data is different, but the conceptualisation is the same. Capture this agreement in an ontology.Shorthand: “specification of a shared conceptualisation” (Gruber), but better: “An ontology is a logical theory accounting for the intended meaning of a formal vocabulary, i.e. its ontological commitment to a particular conceptualisation of the world. The intended models of a logical language using such a vocabulary are constrained by its ontological commitment. An ontology indirectly reflects this commitment (and the underlying conceptualisation) by approximating these intended models.” (Guarino, 1998).

Page 6: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

Overview presentation

Data integration ontology Ontologies

kinds, formalisation, bias & bioscience [after the break]

Ontology integration categorisation, some challenges

Page 7: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

Kinds of ontologies Representation ontologies: conceptualisations that underlie

knowledge representation formalisms.

Top-level ontologies: generic and intermediate ontology concepts. This can be on top of a domain ontology or as stand-alone effort; main

aspect is domain independence. Generic ontologies consist of the general, foundational aspects of a

conceptualisation (a lower branch in a top-level) Intermediate ontologies are slightly more tailored towards a

conceptualisation of a specific domain. There may not be references to generic ontologies.

Domain ontologies specialize in a subset of generic ontologies in a domain or sub-domain.

Application ontologies (…): the UoD is even narrower than a domain ontology.

Page 8: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

Levels of formalisation (1-2)

Heavyweight ontologies

Lightweight ontologies

Catalogue of normalised terms: is a simple list without inclusion order, axioms or glosses.Glossed catalogue: a catalogue with natural language glossary entries, e.g. a dictionary of medicine.Prototype-based ontology: types and subtype are distinguished by prototypes rather than definitions and axioms in a formal languageTaxonomy: is a collection of concepts having a partial order induced by inclusion. Axiomatised taxonomy: as taxonomy, but then with axioms and stated in a formal language.Context library / axiomatised ontology: a set of axiomatised taxonomies with relations among them, like the inclusion of one context into another one, or the use of a concept from one in the other one.

Formal ontology

Informal ontology

Semi-formal ontology?

Page 9: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

Formalisations (2-2)

Ontological precision

Axiomatized theory

Glossary

Thesaurus

Taxonomy

DB/OO scheme

tennisfootballgamefield gamecourt gameathletic gameoutdoor game

Catalog

game athletic game court game tennis outdoor game field game football

gameNT athletic game NT court game RT court NT tennis RT double fault

game(x) activity(x)athletic game(x) game(x)court game(x) athletic game(x) y. played_in(x,y) court(y)tennis(x) court game(x)double fault(x) fault(x) y. part_of(x,y) tennis(y)

precision: the ability to catch all and only the intended meaning (for a logical theory, to be satisfied by intended models) Gangemi (2004)

Page 10: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

Overview presentation

Data integration ontology Ontologies

kinds, formalisation, bias & bioscience [after the break]

Ontology integration categorisation, some challenges

Page 11: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

Ontology integration (1-4)

Combining different conceptualisations (‘views on reality’)… somehow.

System, language/syntax, structure, and semantic integration. Latter most difficult.

Structure and/versus semantic integration example

Anarchy of terminology, definitions and methodologies (now at least 24 terms and 48 definitions & methodologies)

Organise into levels of integration. Develop taxonomy of ontology integration?

Page 12: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

Ontology integration (2-4)

back

Example structure/semantics

Page 13: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

Ontology integration (3-4)

Use in/for applications

Increase in level of integration

Unification, total compatibility, merging [similar subject domains]

Merging [different subject domains], partial compatibility

Mapping, approximations, helper model, alignment, intersection ontology

Queried ontologies, hybrid ontologies

Extending, incremental loading

Increase in (perceived) difficulty of operation

Initial categorisation

Page 14: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

Ontology integration (4-4)

(In)formal ontologies (In)consistencies in ontology design decisions during development (relationships) detail

Top-down versus/combined with bottom-upUsing foundational aspects in ontology development decreases the chance of design inconsistencies and facilitates integration Subject domain heterogeneity example

Conflicting goals More conflicts and mismatches here

Some challenges

Page 15: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

(In)consistencies in ontology design decisions (1-2)

Subsumption versus instantiation: if A isA B, then all instances of A are also instances of B. The latter says a instanceOf A, i.e. a is an individual (particular, instance) and not a subtype of A.

Desiderata to create the hierarchy. Like keeping function, structure, process separate.

E.g. the OBO phenotype ontology does not:%attribute\:excretory_function ; PATO:00300204

%attribute\:urination ; PATO:00305204 %attribute\:urine_composition ; PATO:00301204

Page 16: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

(In)consistencies in ontology design decisions (2-2)

E.g. aseptate hypha isa hypha [aseptate = hypha without cross walls] and hypha in mycelium isa hypha. Former is about a special kind of hypha, the latter takes topology as distinction for subtyping -> are distinct factors though treated as a same kind of isA where in the FAO hypha subsumes both.

Allow multiple inheritance - or not? partOf: such as parthood, proper parthood,

connection, external connection, tangential parthood, interior parthood, partial coincidence and located-in (see e.g. Smith and Rosse, 2004; Donnelly, 2004)

Properties and meta properties (see Guarino and Welty (2000) for details)

back

Page 17: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

Conflicts and mismatches Factors affecting ontology combination tasks

Practical problems: finding matchings, diagnosis repeatability, software usability, social factors of cooperation, goals

Mismatches between ontologies - language level

syntax, logical representation, semantics of primitives, language expressivity, precision

- ontology level - conceptualisation

content/UoD, concept scope, relationship scope, context, aggregation, accuracy

- explication terminological: hyper-/hyponyms (generalization), homonyms,

synonyms modelling style: paradigm, entity/concept description encoding

Versioning: identification, traceability, translation

Page 18: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

Overview presentation

Data integration ontology Ontologies

kinds, formalisation, bias & bioscience

Ontology integration categorisation, some challenges

Page 19: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

Ontologies for bioscience (1-3)

Theory

(3)

New empirical axioms/laws (universal)

(4)

Facts with an empirical

basis

(1)

Empirical axioms/laws (universal)

(2)

Formation of a theory

ExplanationConfirmation

Formation of hypothesis

Induction, confirmation

Prediction Confirmation

Prediction

Page 20: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

Ontologies for bioscience (2-3)

Ontologies as engineering artefacts- Facilitate knowledge reuse, interoperability

Modelling practiceAnother item in the problem-solver’s toolbox Part of a new/improved software system

- SW tools for ontology development, maintenance, integration

Ontologies embedded in science- Top-level ontologies

Attempt to understand, what/why

- W.r.t. bioscience‘Co-defining’ concepts?Part of falsification paradigm and steps 2, 3 of standard view

-> synergy, mutually beneficial process, but…

Page 21: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

Ontologies for bioscience (3-3)

• The very essence of scientific progress is change, redefinition and creation of new concepts.

-> ontology subject to (extensive) modification. Complicates integration

• Concepts underspecified, hypotheses and theories exist simultaneously.

-> accommodate this in an ontology? E.g. a library of ‘alternative views ontologies’, with loose coupling instead of integration?

-> capture what is, what can be, (and what might be?)

• Biological data is more complicated than technological and practice data. -> more here

• Systems Thinking, integrative concepts, holism and process-orientation contradict with ‘objectifying’ knowledge in ontologies

-> interdisciplinary work of ontologists with scientists

• Empiricism and the theoretical methodology in life sciences.

-> bottom-up resp. top-down procedures for ontology development

Page 22: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

Formalising biological knowledge

Challenging biological data characteristics detail

Are these aspects real challenges, or due to limited expressiveness of non-formal approaches and software modelling paradigms (ER, OO, …), or maybe due to limited knowledge of both the domain expert and ontologist?Applied sciences within ‘bio’ (medicine, ecology, environmental sciences), contexts detail

nextback

Page 23: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

Main biological data characteristics• No legacy, no full knowledge of UoD.

-> Former might be alleviated over time; double curriculum, but still difference in ‘science’ and ‘engineering’ approaches

• Gradations/non-discrete data, occasional relationships, conditionality.-> Separate layer of sw, or semantics intricate part of bio data?

•Uncertainties, ‘postulations’, importance of parameters, properties.-> characteristic of conducting scientific research; lack vs

abundance of data can be argued as design decision, not characteristic of the data/concepts; ‘upgrading’ of concepts

•Definitional problems and lack of standardisation in nomenclature.-> Is the surface layer of next point; overabundance of ‘semi-

standards’; can be in itself interdisciplinary within bioscience

• Disagreements between and within research groups, ‘alternative’ hypotheses and theories coexist.

-> There is not one ‘what is’; development of multiple theories, concepts before agreement is part of doing scientific research; library of models, aliases

back

Page 24: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

Applied bioscience

Emphasis core sciences: ‘All-inclusive’ comprehensive models

Emphasis applied bioscience: Conceptually representing the integration of various core disciplines, Only what is relevant in limited context

example

back

Page 25: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

Example applied bioscience

back

Page 26: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

References and more info (1-2)

Donnelly, M. (2004). On parts and holes: the spatial structure of the human body. MEDINFO 2004, San Francisco, USA.

Gangemi, A. (2004). Some design patterns for domain ontology building and analysis. Manchester 15-16 January. www.loa-cnr.it/Tutorials.html

Goh, C.H. (1996). Representing and reasoning about semantic conflicts in heterogeneous information sources. PhD, MIT.

Guarino, N. (1998). Formal Ontology and Information Systems. In: Formal Ontology in Information Systems, Proceedings of FIOS'98, Trento, Italy, Amsterdam: IOS Press.

Guarino, N. and Welty, C. (2000). A formal ontology of properties. Proceedings of 12th Int. Conf. on Knowledge Engineering and Knowledge Management, Lecture Notes on Computer Science, Springer Verlag.

Keet, C.M. (2004). Ontology development and integration for the biosciences. Technical Report, Napier University, Edinburgh, UK. www.meteck.org/research.html

Smith, B. and Rosse, C. (2004). The role of foundational relations in the alignment of biomedical ontologies. Proceedings of MEDINFO, San Francisco, USA.

Page 27: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

References and more info (2-2)

Some websites with different perspectives/aims/information on ontologies

LOA www.loa-cnr.it

IFOMIS www.ifomis.de

Ontology www.ontology.org

Formal Ontology www.formalontology.it

RE Kent www.ontologos.org

WonderWeb project http://wonderweb.semanticweb.org

JF Sowa www.jfsowa.com/ontology/index.htm

SUMO http://ontology.teknowledge.com/

AAAI page http://www.aaai.org/AITopics/html/ontol.html Links to a few of groups developing tools

KAON http://km.aifb.uni-karlsruhe.de/kaon2, Protégé http://protege.stanford.edu, VU http://www.cs.vu.nl/, STARLab www.starlab.vub.ac.be/default.htm

Page 28: Data and ontology integration issues in the biosciences Marijke Keet Napier University, 10 Colinton Road, Edinburgh EH10 5DT m.keet@napier.ac.uk / marijke@meteck.org

Thank you!