109
1 How to build an ontology 2 Barry Smith http://ontology.buffalo.edu/smith

1 How to build an ontology 2 Barry Smith

Embed Size (px)

Citation preview

1

How to build an ontology 2

Barry Smith

http://ontology.buffalo.edu/smith

2

The 3-level DistinctionLevel 1:

everything that exists (things, processes, data …);

Level 2:ideas in people’s minds (diagnoses, thoughts, images

in your head, expectations, beliefs, fears …)

Level 3:publicly available (published, written down, drawn,

recorded, saved) versions of level 2 entities (ontologies, databases, journal articles, newspaper reports, diaries …)

The 3-level DistinctionLevel 1:

#120: an incident that happened;

Level 2:#213: the interpretation by some cognitive agent that #120

is an security breach; #31: the expectation by some cognitive agent that similar

incidents might happen in the future;

Level 3:#402: an entry in and information system concerning #120;#1503: an entry in some other information system about

#31 for mitigation or prevention purposes.

5

How do we know which general terms designate universals?

Roughly: terms used by scientists to designate entities about which we have a plurality of different kinds of testable proposition

(cell, electron ...)

More precisely: terms which designate universals are:

1. General

2. Used in current scientific textbooks to express laws of nature

3. Logically non-compound (‘non-rabbit’, ‘rabbit or violin’ do not designate universals)

4. Contain no parts designating particulars (‘cat in Leipzig’, ‘Finnish spy’ do not designate universals

6

7

Class =defa maximal collection of particulars determined by a general term (‘cell’. ‘electron’ but also: ‘ ‘restaurant in Palo Alto’, ‘Italian’)

the class A = the collection of all particulars x for which ‘x is A’ is true

8

universals vs. their extensions

universals

{a,b,c,...} collections of particulars

9

Extension =def

The extension of a universal A is the class: instance of the universal A

(it is the class of A’s instances)

(the class of all entities to which the term ‘A’ applies)

10

Problem

The same general term can be used to refer both to universals and to collections of particulars. Consider:

HIV is an infectious retrovirus

HIV is spreading very rapidly through Asia

11

universals vs. classes

universals

{c,d,e,...} classes

12

universals vs. classes

universals

defined classes

13

universals vs. classes

universals

populations, ...

14

Defined class =def

a class defined by a general term which does not designate a universal

the class of all diabetic patients in Leipzig on 4 June 1952

15

OWL is a good representation of defined classes

• sibling of Finnish spy

• member of Abba aged > 50 years

16

Terminology =def.

a representational artifact whose representational units are natural language terms (with IDs, synonyms, comments, etc.) which are intended to designate universals together with defined classes.

17

universals, classes, concepts

universals

defined classes

‘concepts’ ?

18

universals < defined classes < ‘concepts’

‘concepts’ which do not correspond to defined classes:

‘Surgical or other procedure not carried out because of patient's decision’

‘Congenital absent nipple’

because they do not correspond to anything

19

(Scientific) Ontology =def.

a representational artifact whose representational units (which may be drawn from a natural or from some formalized language) are intended to represent

1. universals in reality

2. those relations between these universals which obtain universally (= for all instances)

lung is_a anatomical structure

lobe of lung part_of lung

20

Part II: How to Build an Ontology

21

How to build an ontology

work with scientists to create an initial top-level classification

find ~50 most commonly used terms corresponding to universals in reality

arrange these terms into an informal is_a hierarchy according to this Universality principle

A is_a B every instance of A is an instance of B

fill in missing terms to give a complete hierarchy

(leave it to domain scientists to populate the lower levels of the hierarchy)

22

Principle of Low Hanging Fruit

Include even absolutely trivial assertions (assertions you know to be universally true)

pneumococcal virus is_a virus

Computers need to be led by the hand

23

Goal: Each term in an ontology represents exactly one universal

there are universals also of collectivities:

population

complex of cells

24

the use-mention confusion

swimming is healthy and has eight letters

25

Principle

Avoid confusing between words and things

Avoid confusing between concepts in our minds and entities in reality

Recommendation: avoid the word ‘concept’ entirely

26

Principle

For the sake of interoperability with other ontologies, do not give special meanings to terms with established general meanings

(Don’t use ‘cell’ when you mean ‘plant cell’)

27

Principle

Supply definitions wherever possible

(both human-understandable natural language definitions, and equivalent formal definitions)

28

Principle

Each term should have at most one definition

which may have both natural-language and formal versions

29

The Problem of Circularity

A Person = def. A person with an identity document

cell = def. plant cell, consisting of protoplast and cell wall; ...

30

Principle

Avoid circular definitions

(The term defined should not appear in its own definition)

31

Principle

A definition should use terms which are easier to understand than the term defined

32

Principle

Use Aristotelian definitions

An A is a B which C’s.

A human being is an animal which is rational

33

Principle

Do not seek to define everything

34

In every ontology

some terms and some relations are primitive = they cannot be defined (on pain of infinite regress)

Examples of primitive relations:

identity

instance_of

35

Rules for formatting terms

• Avoid abbreviations even when it is clear in context what they mean (‘breast’ for ‘breast tumor’)

• Avoid acronyms• Avoid mass terms (‘tissue’, ‘brain

mapping’, ‘clinical research’ ...)• Treat each term ‘A’ in an ontology is

shorthand for a term of the form ‘the universal A’

36

Univocity Terms should have the same meanings on

every occasion of use.

(= They should refer to the same universals)

Basic ontological relations such as is_a and part_of should be used in the same way by all ontologies

37

Universality

Ontologies are made of relational assertions

They should include only those which hold universally

pneumococcal virus causes pneumonia

38

Universality

Often, order will matter:

We can assert

adult transformation_of child

but not

child transforms_into adult

39

Universality

viral pneumonia caused by virus

but not

virus causes pneumonia

pneumococcal virus causes pneumonia

40

Universality

results analysis later_than protocol-design

BUT NOT

protocol-design earlier_than results analysis

41

Positivity

Complements of universals are not themselves universals.

Terms such as non-mammal non-membrane other metalworker in New Zealand

do not designate universals in reality

42

Positivity

What about non-smoker?

43

Objectivity

Which universals exist in reality is not a function of our knowledge.

Terms such as

unknown

unclassified

unlocalized

arthropathies not otherwise specified

do not designate universals in reality.

44

Keep Epistemology Separate from Ontology

If you want to say that

We do not know where A’s are located

do not invent a new class of

A’s with unknown locations

(A well-constructed ontology should grow linearly; it should not need to delete classes or relations because of increases in knowledge)

45

If you want to say

I surmise that this is a case of pneumonia

do not invent a new class of surmised pneumonias

Confusion of ‘findings’ in medical terminologies

Keep Sentences Separate from Terms

46

Single Inheritance

No kind in a classificatory hierarchy should have more than one is_a parent on the immediate higher level

47

Multiple Inheritance

thing

carblue thing

blue car

is_a is_a

48

Multiple Inheritance

is a source of errors

encourages laziness

serves as obstacle to integration with neighboring ontologies

hampers use of Aristotelian methodology for defining terms

hampers use of statistical search tools

49

Multiple Inheritance

thing

carblue thing

blue car

is_a1 is_a2

50

is_a Overloading

The success of ontology alignment demands that ontological relations (is_a, part_of, ...) have the same meanings in the different ontologies to be aligned.

51

Multiple Inheritance

thing

carblue thing

blue car

is_a1 is_a2

52

How to solve this problem

Create two ontologies:

of cars

of colors

Link the two together via cross-products

(= factoring, normalization, modularization)

53

Compositionality

The meanings of compound terms should be determined

1. by the meanings of component terms

together with

2. the rules governing syntax

54

Why do we need rules/standards for good ontology?

Ontologies must be intelligible both to humans (for annotation and curation) and to machines (for reasoning and error-checking): the lack of rules for classification leads to human error and blocks automatic reasoning and error-checking

Intuitive rules facilitate training of curators and annotators

Common rules allow alignment with other ontologies

think of ontologies as legends for cartoons

56

cartoons, like maps, always have a certain threshold of granularity

but they can be veridical representations of reality nonetheless

Goal: use logically well-structured ontologies to create algorithmic, dynamic cartoons

57

Randomized controlled trials

http://rctbank.ucsf.edu/ontology/outline/index.htm

58

Basic Formal Ontology

What the top level should look like

59

Two kinds of entities

occurrents (processes, events, happenings)

continuants (objects, qualities, states...)

60

Continuants (aka endurants)have continuous existence in timepreserve their identity through changeexist in toto whenever they exist at all

Occurrents (aka processes)have temporal partsunfold themselves in successive phasesexist only in their phases

61

You are a continuant

Your life is an occurrent

You are 3-dimensional

Your life is 4-dimensional

62

Dependent entities

require independent continuants as their bearers

There is no run without a runner

There is no grin without a cat

63

Dependent vs. independent continuants

Independent continuants (organisms, buildings, environments)

Dependent continuants (quality, shape, role, propensity, function, status, power, right)

64

All occurrents are dependent entities

They are dependent on those independent continuants which are their participants (agents, patients, media ...)

65

BFO Top-Level Ontology

ContinuantOccurrent

(always dependent on one or more

independent continuants)

IndependentContinuant

DependentContinuant

66

= A representation of top-level types

Continuant Occurrent

IndependentContinuant

DependentContinuant

cell component

biological process

molecular function

67

Top-Level Ontology

Continuant Occurrent

IndependentContinuant

DependentContinuant

Functioning

Side-Effect, Stochastic Process, ...

Function

68

Top-Level Ontology

Continuant Occurrent

IndependentContinuant

DependentContinuant

Functioning Side-Effect, Stochastic Process, ...

Function

69

Top-Level Ontology

Continuant Occurrent

IndependentContinuant

DependentContinuant

Quality Function Spatial Region

Functioning Side-Effect, Stochastic Process, ...

instances (in space and time)

70

71

72

Towards a Clinical Trial Ontology

To serve merger of data schemas

To serve flexibility of collaborative clinical trial research

To serve management of clinical trial research

To serve data access and reuse

73

CTO will be part of OBI

Ontology of Biomedical Investigations

http://obi.sourceforge.net

which is in turn part of the OBO Foundry

http://obofoundry.org

74

Overview of the Ontology of Biomedical Investigations

with thanks to Trish Whetzel on behalf of the FuGO Working Group

75

OBI

PurposeProvide a resource for the unambiguous description of the

components of biomedical investigations such as the design, protocols and instrumentation, material, data and types of analysis on the data

NOT designed to model biology

EnablesAllow consistent annotation of data across different

technological and biological domainsEnable powerful queriesFacilitate semantically-driven data integration

76

 

Motivation for OBI

Standardization efforts in biological and technological domains

Standard syntax - Data exchange formats To provide a mechanism for software

interoperability, e.g. FuGE Object Model

Standard semantics - Controlled vocabularies or ontology Centralize commonalities for annotation term

needs across domains to describe an investigation/study/experiment, e.g. FuGO

77

Biomedical Investigation Components

Computational/Higher Level Analysis

Data Pre-Processing

Instrumental Analysis

Sample Analysis Preparation

Treatments

Material and It's Characteristics

Investigation Design

Describe the material and characteristics.

Describe the manipulations or perturbations or observations performed on the material to meet the general aim of the investigation.

Describe how the material was prepared for analysis - e.g. labeling, protein digest, etc.

Describe the instrument and settings that were used.Describe the results from the instrument, e.g. what units are represented.

Describe the type analysis performed to confirm/deny the hypothesis, e.g. clustering.

Describe the design and purpose or general aim of the the Investigation.

78

FuGO Development Strategy Decisions

Unified Development

Pros

Overlap of terms is identified early in development

Universal/Common terms are defined by all those collaborating

Additional technological or biological terms can be added as needed by collaborators

Cons

Time needed to develop the ontology

Independent Development

Pros

Develop ‘Ontology’ in a time frame limited only by the community

Cons

Development of different working policies?

Use of different top level classes?

Overlap of terms at lower levels of the ontology tree

79

FuGO Development Process

Collect Use Cases - within community activity

Collect examples of investigations as performed within a community and present Use Cases to developers group

Bottom up approach - within community activity

Identify concepts to describe using controlled terms

Collect terms and their definitions

Bin terms in the top level ontology structure

Top down approach - collaborative activity

Build a top level ontology structure, is_a (vertical) relationships

Make a list of other foreseen (horizontal) relationships

Review how Top Level Nodes fit in with the Upper Level Ontologies

80

FuGO - Top Level Classes

Continuant: an entity that endure/remains the same through time Dependent Continuant: depend on another entity

E.g. Environment (depend on the set of ranges of conditions, e.g. geographic location)

E.g. Characteristics (entity that can be measured, e.g. temperature, unit)

- Realizable: an entity that is realizable through a process (executed/run)E.g. Software (a set of machine instructions)

E.g. Design (the plan that can be realized in a process)

E.g. Role (the part played by an entity within the context of a process)

Independent Continuant: stands on its ownE.g. All physical entity (instrument, technology platform, document etc.)

E.g. Biological material (organism, population etc.)

Occurrent: an entity that occurs/unfold in timeE.g. Temporal Regions, Spatio-Temporal Regions (single actions or Event)

Process E.g. Investigation (the entire ‘experimental’ process)E.g. Study (process of acquiring and treating the biological material)E.g. Assay (process of performing some tests and recording the results)

81

Emerging FuGO Design PrinciplesOBO Foundry ontology, utilize ontology best practices

Inherit top level classes from an Upper Level ontologyUse of the Relation OntologyFollow additional OBO Foundry principlesFacilitates interoperability with other OBO Foundry ontologies

Develop recommendations for naming conventions and metadataFormat for term names, e.g. underscore vs. camel case, no purals Use of Alphanumeric identifier for terms, I.e. something that does not have semantic

meaningMechanisms for adding synonyms, etc.

Open source approachProtégé/OWLWeekly conference callsShared environment using Sourceforge (SF) and SF mailing lists

82

Future Plans

Binning process - ongoing

Reconciliations into one canonical version

Iterative process

Common working practices - established

Each class consists of: unique alphanumeric identifier, human readable string name, definition and comments

Sourceforge tracker in place to collect comments on terms, definitions, relationships

Review ontology so that top level classes meet the needs of all involved ‘communities’

83

OBI Collaborating Communities

Crop sciences Generation Challenge Programme (GCP), www.generationcp.orgEnvironmental genomics MGED RSBI Group, www.mged.org/Workgroups/rsbiGenomic Standards Consortium (GSC), www.genomics.ceh.ac.uk/genomecatalogueHUPO Proteomics Standards Initiative (PSI), psidev.sourceforge.netImmunology Database and Analysis Portal, www.immport.orgImmune Epitope Database and Analysis Resource (IEDB),

http://www.immuneepitope.org/home.doInternational Society for Analytical Cytology, http://www.isac-net.org/Metabolomics Standards Initiative (MSI), msi.workgroups.sourceforge.netNeurogenetics, Biomedical Informatics Research Network (BIRN), www.nbirn.netNutrigenomics MGED RSBI Group, www.mged.org/Workgroups/rsbiPolymorphismToxicogenomics MGED RSBI Group, www.mged.org/Workgroups/rsbiTranscriptomics MGED Ontology Group, mged.sourceforge.net/ontologies

84

http://fugo.sourceforge.net

85

http://obi.sourceforge.net

86

87

88

89

90

91

92

93

Top-Level Class Hierarchy for RCT

Root Secondary-study

Trial-details

Trial

Concept • Generic-concept • Population-concept • Protocol-concept • Design-concept • Outcome-concept • Administrative-concept • Intervention-concept

94

Amended Top-Level Class Hierarchy for RCT

EntityContinuant

PopulationProtocolDesign

OccurrentTrial

Secondary-study Intervention

?? Trial-details ?? Outcome-concept ?? Administrative-concept

95

Concept • Generic-concept

– Term-information – Time-entity – Rule-concept

» Clinical-rule

Exclusion-rule

Inclusion-rule » Rule-entity

Recursive-rule

Base-rule » Ethnicity-language-rule » Age-gender-rule » Situation

96

97

98

Concept • Protocol-concept

– Follow-up-compliance – Follow-up-activity – Follow-up – Protocol-change – Treatment-assignment – Protocol – Reason – Outcomes-followup – Secondary-study-protocol

99

Amended Top-Level Class Hierarchy for RCT

EntityContinuant

Protocol• Secondary-study-protocol

Reason

Occurrent• Treatment-assignment • Follow-up

– Follow-up-activity

– Outcomes-follow-up

• Protocol-change

100

Concept • Population-concept

– Subgroup – Recruitment-flowchart – Population – Recruitment – Site-enrollment

101

Amended Top-Level Class Hierarchy for RCT

EntityContinuant

Protocol• Secondary-study-protocol

Recruitment-flowchart Reason Population

• Subgroup

Occurrent• Priors

– Recruitment– Site-enrollment – Treatment-assignment

• Follow-up – Follow-up-activity – Outcomes-follow-up

• Protocol-change

102

Concept • Administrative-concept

– Publication-concept – Study-site – Person – Ethics – Study-committee – Funder – Institution – Registry-ID

103

Continuant• Information object

– Publication – Registry-ID

• Study-site • Person • Institution

– Study-committee – Funder

???Ethics

104

Concept • Intervention-concept

– Blinding-concept – Compliance-details – Intervention-step – Intervention-arm – Co-intervention – Intervention – Compliance-result – Intervention-logic

105

Occurrent• Intervention

– Blinding– Intervention-step – Intervention-arm – Co-intervention

• ??? Intervention-logic

• ??? Compliance-result

• ??? Compliance-details

106

107

Test Case: Clinical Trial Ontology

primary outcomesecondary outcometimepoint clinical trialintervention groupcontrol groupassignment of populations to groupscomplex experimental designrandomizationplaceboresponseefficacycontrolprotocolnull hypothesis,confidence interval

108

FuGE idea: use OBI to design datatableshow to solve this problem of converting the ontology to a database schemawhat are ‘instances’annotating images (image repositories)annotation = shared understanding of a body of knowledge I run a trial I stick my data in Excell and create a datasetdesign database, design tables – that’s it – no more annotationsmetadata is added regarding provenance, this data was added by A and corrected by McBdo rare disease people share their data: here’s my data, here’s my data key, 1 is for males,

0 is for femalessharing is localbut UCSF (Clinical Data Repository) neurodegenerative people MS talk to Alzheimer’s they

can’t because (a) because of Hippa, (b) dataschemas are so different, (c) response to NIH: they put their excell spreadsheet out there, well gee whizz, (d) PharmGKB faced problems because of this (e) more obtuse the better. I can get another paper out of this data

no possibility of meta-analysis – opposite of biologists’ view

109