1 Ontologies: “What are they?” and “How do they work?” Michael Grobe (work supported in part by Research Technologies UITS Indiana University)

1

Ontologies:

“What are they?” and “How do they work?”

Michael Grobe

(work supported in part byResearch Technologies

UITSIndiana University)

2

Table of Contents

Panorama of definitionsExplication of the Big DefinitionThe Gene Ontology as an exampleProcessing queries on data annotated with

ontology classificationsMerging and building ontologies

Table of Non-contents

Automated annotationThe role of ontologies in the Semantic WebUsing ontologies in bioinformatics research

3

Panorama of definitions of “Ontology”

In standard use: “Ontology is “is a study of conceptions of reality and the nature of being. … It is the science of what is, of the kinds and structures of the objects, properties and relations in every area of reality.” (Wikipedia, 2008)

The term was hijacked for use within information science (sic) where it has many applications, but…

“People use the word ontology to mean different things, e.g.

- glossaries and data dictionaries, - thesauri and taxonomies, - schema and data models, and

- formal ontologies and inference.” (Pidcock, 2003)

4

More definitions

Here’s a definition from Uschold, et al. quoted by Stevens, et al.:

“An ontology may take a variety of forms, but necessarily it will include a vocabulary of terms, and some specification of their meaning.

This includes definitions and an indication of how concepts are inter-related which collectively impose a structure on the domain and constrain the possible interpretations of terms.”

5

More definitions

And a definition from Pidcock, 2003:

“A formal ontology is a controlled vocabulary expressed in an ontology representation language.

This language has a grammar for using vocabulary terms to express something meaningful within a specified domain of interest.

The grammar contains formal constraints … on how terms in the ontology’s vocabulary can be used together.”

6

A definition and a “clarification”

And another definition by Gruber, 1993:

“I use the term ontology to mean the specification of a conceptualization. …A conceptualization is an abstract, simplified view of the world that we wish to represent for some purpose.”

Stevens, et al., 2000 clarify (?):

“The conceptualisation is the couching of knowledge about the world in terms of entities (things, the relationships they hold and the constraints between them). The specification is the representation of this conceptualisation in a concrete form.”

7

Yet more definitions

And Gruber defines one purpose for ontologies:

“Ontologies provide controlled, consistent vocabularies to describe concepts and relationships, thereby enabling knowledge sharing” (Gruber, 1993)

But then, one also finds descriptions like:

“Shallow ontologies comprise relatively few unchanging terms that organize very large amounts of data—for example, terms such as customer, account number, and overdraft…” (Shadbolt, 2006)

8

And Wikipedia continues in this vein by describing 2 types of (IT-related) ontology:

“A domain ontology (or domain-specific ontology) models a specific domain, or part of the world. It represents the particular meanings of terms as they apply to that domain.” (Wikipedia, 2008)

“An upper ontology (or foundation ontology) is a model of the common objects that are generally applicable across a wide range of domain ontologies. It contains a core glossary in whose terms objects in a set of domains can be described.” There are several standardized upper ontologies available for use, including Dublin Core . . . (Wikipedia, 2008)

9

As an aside (because we are actually interested in domain ontologies), let’s take a quick look at the Dublin Core thanks again to Wikipedia:

“The Simple Dublin Core Metadata Element Set (DCMES) consists of 15 metadata elements:

Title Creator Subject Description Publisher Contributor Date Type Format Identifier Source Language Relation Coverage Rights“

Surprisingly, these are described as “metadata elements”.

10

But wait, there’s still more…

In particular, an ontology, or some ontologies, provide some ability to “reason”:

“…an ontology is a representation of a set of concepts within a domain and the relationships between those concepts. It is used to reason about the properties of that domain, and may be used to define the domain.” (Wikipedia, 2008)

“. . . a document or file that formally defines the relations among terms. The most typical kind of ontology for the Web has a taxonomy and a set of inference rules.” TimBL, 2000.

11

Note that this “reasoning” is performed using terms representing concepts rather than the concepts themselves. (Which is to say, text strings are being shuffled around; there is no “thought” involved.)

“The computer doesn’t truly “understand” any of this information, but it can now manipulate the terms much more effectively in ways that are meaningful to the human users.” (TimBL, 2001)

12

Note also that this is the first definition that refers to a “taxonomy”, and the term comes up a lot in the field, so let’s look at an example taxonomy.

Consider (a simplified version of) the Linnaean classification of living organisms based on these categories:

Dominion Subphylum FamilyDomain Class GenusKingdom Cohort SpeciesPhylum Order

Each individual organism is assigned a set of values, one for each category. The result is a large table with 11 (or more) columns.

This taxonomy defines a hierarchy of sets and subsets, and . . .

. . . the series of values in each column of an individual species record represents the path from the root of the tree to that species (leaf node), and much of this path information is redundant.

13

There may be better ways to store this redundant data, and . . .

. . . there may be other “ways” to think about what the data mean.

In particular, the set-subset relationships may be thought of as “inference rules” that can be applied to answer queries. For example:

If an organism is a member of a genus

and that genus is a member of a family

thenthat organism is also a member of that family

14

Finally, here is a “big,” formal definition:

"An ontology O is a six-tuple C, HC, HR, L, FC, FR, where

C is the set of concepts, HC a taxonomy induced on the concepts, HR the set of non-taxonomic relations, L the set of terms (lexicals) which refer to concepts

and relations, and FC, FR are relations that map the terms in L to the

corresponding concepts and relations.

If the ontology is dynamic all these structures are likely to change over time." (Niepert, et al., 2008)

Note that this definition includes no mention of “inference,” but inference may be hidden within.

15

This is clearly a complicated description, but we can break it into parts, at least some of which are understandable:

First, a couple of definitions:

A “tuple” is a set of objects in a specified order.

An “N-tuple” is a tuple that contains exactly N items.

A “relation” is a set of “tuples” of the same “arity”, but may be thought of as a “table”, which is how Relational Databases came to be named (even tho there are differences).

Now note that the primary set, C, is a “set of concepts,” not a “set of terms”.

“Terms,” are used to “refer to” the “concepts”, and both terms and relations are likely to change over time.

FC maps L to C, but “terms” also refer to “relations”?

16

As a very simple example, here’s a set of concepts (C) represented as strings of English text:

{ “Vehicle”, “Car”, “Truck”, “2-wheel drive car”,

“4-wheel drive car”, “front-wheel drive car”,“rear-wheel drive car”

}

Here’s a “taxonomy” (known as HC, perhaps called “is_a”?) “induced” on the set of concepts:

{ ( “Car”, “Vehicle” ),

( “Truck”, “Vehicle” ), ( “2-wheel drive car”, “Car” ),

( “4-wheel drive car”, “Car” ), ( “front-wheel drive car”, “2-wheel drive car” ),

( “rear-wheel drive car”, “2-wheel drive car” )}

17

Here’s a set of terms, L = ( 0, 1, 2, 3, 4, 5, and 6 ) and a relation (FC) mapping terms from the term set to concepts:

{( 0, “Vehicle” ),

( 1, “Car” ),( 2, “Truck” ),

( 3, “2-wheel drive car” ),( 4, “4-wheel drive car” ),

( 5, “4-wheel drive car” ),( 6, “4-wheel drive car” )

}

Note that it is a good idea to use “meaningless” terms: Identifiers like “GO:000056”.

Here’s a representation of the taxonomy (HC) using terms:

{ ( 1, 0 ), ( 2, 0 ), ( 3, 1 ), ( 4, 1 ), ( 5, 3 ), ( 6, 3 ) }

18

Here’s a relation (call it “is_transitively_a” or “is_a_descendent_of” or a “transitive closure”) derived from the taxonomy assuming “transitivity”:

{ ( 1, 0 ),

( 2, 0 ), ( 3, 1 ), ( 3, 0 ), ( 4, 1 ), ( 4, 0 ),

( 5, 3 ), ( 5, 1 ), ( 5, 0 ),( 5, 3 ), ( 5, 1 ), ( 5, 0 ),

}

The items in bold were added “by transitivity”.

This seems to be one way of “sneaking” inference into the definition.

19

This complete table, ….ur….relation, contains the same information as could be inferred from the transitivity inference rule:

If Item_A is_a Item_B

and Item_B is_a Item_C

thenItem_A is_a Item_C

….and in some cases the relation derived by transitivity would be prohibitively large, so inference rules are frequently used to determine the relationship between 2 items ad hoc.

Another way to “sneak” inference into this definition would be to consider the taxonomy as a set of inference rules, as will be considered later.

20

The Gene “Ontology”

One of the best known “ontologies” is the Gene Ontology which is actually 3 separate “ontologies” (with different “namespaces”)

- molecular function (cell biochemistry?)

What biochemical reactions do gene products perform?

- biological process (cell physiology?)

What cellular processes do the gene products participate in?

- cellular component (cell anatomy?)

In which cellular compartments or locations are those gene products expressed?

21

Here is a portion of the GO is_a DAG (Blake, 2004) for molecular function (example: “chromatin binding” is_a “DNA binding”):

(It is easy to confuse a gene product name with its molecular function, and for that reason many GO molecular functions are appended with the word "activity". www.geneontology.org, 2008)

22

Here is a subset (C) of the Gene Ontology molecular function concepts

bindingenzyme activityhelicase activityDNA bindingnucleic acid bindingchromatin bindinglamin/chromatin bindingDNA helicase activityATP-dependent helicase activityadenosine triphosphatase activityATP-dependent DNA helicase activityDNA-dependent adenosine triphospatase activity


23

The set (L) of Gene Ontology molecular function terms

GO:00005488GO:00008047GO:00004386GO:00003677GO:00003676GO:00003682GO:00003683GO:00003679GO:00008026GO:00016887GO:00004003GO:00008094

24

The relation (FC) mapping GO terms to concepts

GO:00005488 bindingGO:00008047 enzyme activityGO:00004386 helicase activityGO:00003677 DNA bindingGO:00003676 nucleic acid bindingGO:00003682 chromatin bindingGO:00003683 lamin/chromatin bindingGO:00003679 DNA helicase activityGO:00008026 ATP-dependent helicase activityGO:00016887 adenosine triphosphatase activityGO:00004003 ATP-dependent DNA helicase activityGO:00008094 DNA-dependent adenosine

triphosphatase activity

25

Here is the is_a relation (HC) defining relationships among concepts (nucleic_acid binding activity “is a kind of” binding activity):

Sub-function Functionmolecular_function rootbinding molecular functionnucleic acid binding bindingenzyme activity molecular functionhelicase activity enzyme activityDNA binding nucleic acid bindingchromatin binding DNA bindinglamin/chromatin binding chromatin bindingDNA helicase activity DNA bindingDNA helicase activity helicase activityATP-dependent helicase activity helicase activityadenosine triphosphatase activity enzyme activityATP-dependent helicase activity adenosine triphosphatase activityDNA-dependent adenosine triphosphatase activity

adenosine triphosphatase activityATP-dependent DNA helicase activity DNA helicase activityATP-dependent DNA helicase activity ATP-dependent helicase activityATP-dependent DNA helicase activity DNA-dependent adenosine

triphosphatase activity

(Note that some Sub-functions have multiple parent Functions.)

26

Here is a portion of the GO is_a DAG (Blake, 2004) for molecular function (example: “chromatin binding” is_a “DNA binding”):


27

Here’s the first entry (of the ~26K) in the GO text version (with all three parts intermixed):

[Term]id: GO:0000001name: mitochondrion inheritancenamespace: biological_processdef: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]synonym: "mitochondrial inheritance" EXACT []is_a: GO:0048308 ! organelle inheritanceis_a: GO:0048311 ! mitochondrion distribution

You can also get the GO as RDF XML, or as a MySQL database.

28

In the example, a GO concept (“name”) is being mapped to:

- a GO ID, - a root “namespace”,- a “def,” and also to a - set of “synonyms”.

And, in addition, the concept may be mapped to “parent” or “child” concepts through the

- “is_a” (subsumption) and/or - “part_of” (meronomy/partonomy),- “regulates” (gene transcription).- “positively_regulates”,- “negatively_regulates” links

as exemplified in the next slide.

Remember that “is_a” is really more like ”is a kind of”, and the last 3 in the list above are “non-taxonomic relations” (HR).

These links define edges of the GO DAGs.

29

[Term] id: GO:0003677 name: DNA binding namespace: molecular_function def: "Interacting selectively with DNA (deoxyribonucleic acid)." [GOC:jl] subset: goslim_candida subset: goslim_generic subset: goslim_plant subset: goslim_yeast subset: gosubset_prok related_synonym: "microtubule/chromatin interaction" [] narrow_synonym: "plasmid binding" [] is_a: GO:0003676 ! nucleic acid binding

[Term] id: GO:0003682 name: chromatin binding namespace: molecular_function def: "Interacting selectively with chromatin, the network of fibers of DNA and protein that make up the chromosomes of the eukaryotic nucleus during interphase." [GOC:jl, ISBN:0198506732 "Oxford Dictionary of Biochemistry and Molecular Biology"] subset: goslim_generic subset: goslim_pir subset: goslim_plant related_synonym: "microtubule/chromatin interaction" [] narrow_synonym: "nuclear membrane vesicle binding to chromatin" [] broad_synonym: "lamin/chromatin binding" [] is_a: GO:0003677 ! DNA binding (This was changed since 2004.)

30

Note that the Genes listed in the previous DAG graphic are NOT part of the ontology.

In fact, there is NO “DATA” in the ontology.

Blake (2004) emphasizes some important features of GO as:

“Not a way to unify biological database[s]

Not a dictated standard

Not a database of gene products, protein domains, or motifs

Does not define evolutionary relationships”

31

In fact, GO may not even BE an ontology :

“GO ontology, which is more a nomenclature and a taxonomy, than a formal ontology, is highly successful and widely used.” (Sheth, 2003)

In fact, that wide usage may be due directly to the fact that it is NOT a formal ontology:

“Semi-formal ontologies that may be based on limited expressive power are most practical and useful. Formal or semi-formal ontologies represented in very expressive languages…have, in practice, yielded little value in real world applications.” (Sheth, 2003)

“Our object in touting the value of semi-formal ontologies is to prevent research in the Semantic Web field from leading straight into the very problems that AI found itself in.” (Sheth, 2003)

32

So where is the data?

Here is a 2-column table that uses GO to “annotate” the products of the genes shown in the graphic above from the Mouse Genome Initiative database:

Gene Name Molecular FunctionMcmd2 GO:0003682Mcmd4 GO:0003682,GO:0004003Mcmd6 GO:0003682,GO:0004003Mcmd7 GO:0003682,GO:0004003

Note that only the lowest level GO ID terms are used here to identify functions.

Note also that a gene product may perform multiple functions and that multiple function entries in this table are separated by commas.

33

Scale of the genome annotation

As of August of 2004, the Mouse Genome had been annotated using the Gene Ontology as:

- Function: 12K genes with 30K annotations

- Process: 11K genes annotated with 21K annotations

- Location: 11K genes annotated with 20K annotations

34

Data may be presented via a tree representation:

binding (Click an entry to see data nucleic acid binding annotated with that entry.)

DNA bindingchromatin binding

lamin/chromatin binding DNA helicase activity

ATP-dependent DNA-helicase activityenzyme activity

helicase activityDNA helicase activity

ATP-dependent DNA helicase activityATP-dependent helicase activity

adenosine triphosphatase activityATP-dependent helicase activity

ATP-dependent DNA helicase activityDNA-dependent adenosine triphosphatase activity

ATP-dependent DNA helicase activity

35

When data is annotated using the most specific GO category, membership in parent categories (supersets) must to be determined by “inference”, that is, by moving up the is_a DAG or up the part_of path (if available), applying the transitivity rule.

We “know” that transitivity holds, and we use it “intuitively” as we inspect the DAG:

If X is_a molecular_function_1

andmolecular_function_1 is_a

molecular_function_2 then

X is_a molecular_function_2

36

Or we might think of each entry in the DAG relation as being an inference rule itself, and apply these rules whenever possible.

So an entry in the is_a DAG like:

( Function_1, Function_2 )

might be interpreted as the inference rule:

If gene_product_X is annotated with Function_1

and Function_1 is_a Function_2

then gene_product_X could be annotated with Function_2

(Aside: In some cases the is_a relation could be interpreted in reverse order as an “includes” relation?)

37

One way or another:

If you have the function, process, and location GO IDs for a collection of genes (which will never be in the GO itself)

and

you have the GO,

and

you have an appropriate inference capability

thenyou should be able answer questions that relate to themembership of any annotated item in any GOclass.

38

How might this process actually work with questions like:

“Tell me whether mouse Mcmd4 is a helicase.”

which should be roughly equivalent to:

“Is Mcmd4 annotated with “helicase activity” (GO:0004386) or some child thereof?”

Answer: Yes

“Which mouse genes are involved in DNA binding, but are not DNA helicases.”

which should be roughly equivalent to:

“Which mouse genes are annotated with “DNA binding” (GO:0003677) or some child thereof, but are not annotated with “helicase activity” (GO:0004386) or some child thereof.”

Answer: Mcmd, Mcmd2?

39

We could answer these questions by “inspection” because we know what the is_a relation “means”, and how to manipulate the relation “meaningfully”.

However, how can we answer these questions “mechanically” using a program?

In particular, if we interpret the is_a relation entries as inference rules, how can we process these queries?

First we will think of the queries as “assertions”, like:

“Mcmd4 is a helicase.”

and

“Gene_product_X displays “helicase activity”

and try to prove (or “satisfy”) them by using the inference rules to derive a list of facts provable from the given data.

40

Suppose you want to answer the question:

“Does mouse Mcmd4 display helicase activity?

Start with a “fact base” composed of the set of known “facts” from your annotation database:

{( Mcmd4, chromatin binding ),( Mcmd4, ATP-dependent DNA helicase activity )

}

Then repeatedly apply the inference rules to add facts to the collection of facts in the “fact base”, and . . .

Stop when the target assertion appears in the fact base, or when no new facts have been added during a step.

At that point, if the assertion is in the fact base, it is has been “proved” to be true, else it is false.

(This is a “forward-chaining” inference process.)

41

After step one, the “fact base” will contain (assuming the entries in the DAG relation are processed in the order presented earlier):

{

( Mcmd4, chromatin binding ),

( Mcmd4, ATP-dependent DNA helicase activity )

( Mcmd4, DNA helicase activity ),

( Mcmd4, DNA binding activity )

( Mcmd4, ATP-dependent helicase activity ),

( Mcmd4, DNA-dependent adenosine

triphosphate activity )

}

42

After step two, the fact base will contain:

{( Mcmd4, chromatin binding ),( Mcmd4, ATP-dependent DNA helicase activity )( Mcmd4, DNA helicase activity ),( Mcmd4, DNA binding activity ),( Mcmd4, ATP-dependent helicase activity ),( Mcmd4, DNA-dependent adenosine triphosphate activity ),( Mcmd4, helicase activity ),( Mcmd4, adenosine triphosphatase activity ),( Mcdm4, nucleic acid binding )

}

at which point we can stop, because the assertion has been “proved”.

To resolve the second query we initialize the fact base with the entire annotation database, infer new facts until no facts can be added and then list the facts that include gene-products with “helicase activity”.

(Aside: How would this be done using SQL?)

43

How would this be done using SQL?

It might be possible to use a series of self-joins to get records that include the full path from each concept to root.

On the other hand, it would probably be better to compute the paths using some external tool and store the result as a table of concept-ancestor pairs (for each DAG) like:

record_count, namespace, concept ID, ancestor ID

where the GO IDs are foreign keys. The record_count might be useful to identify the order of discovery during traversal.

It might also be useful to include a "generation offset" from the concept to each ancestor.

Query resolution would then require simple SQL requests for concept-ancestor pairs.

44

Merging and building ontologies

If you confront 2 databases each with its own ontology, you MIGHT be able to map one to the other if you want to combine them or query both using the ontology of just 1.

There has been a lot off research in this area, and apparently a handful (or 2) of tools have been developed to help, but . . .

“There are multiple tools to merge or map ontologies, but they are quite difficult to use and require some user editing in order to obtain reliable results.” (Pasquier, 2008)

In fact, there are also “no standardized methods for building ontologies” (Sevens, et al., 2003), and even though there exist multiple toolsets to help, building ontologies remains difficult.

45

Shirky (2008) argues that ontologies are not necessarily the best way to annotate all kinds of data.

He provides a list of domain and user characteristics that bode well for success:

“Domain has a small corpus, formal categories, stable entities, restricted entities, and clear edges.”

The “participants” are expert catalogers and include authoritative sources of judgment, and the users are organized, and expert in their use of the ontology.

He sites the psychiatric Diagnostic and Statistical Manual (DSM-IV) and the Periodic Table as examples.

One might add that the domain categories are stable, but not too stable; the categorization structure is irregular; and there are storage space constraints.

46

Summary

There are many (not entirely consistent) definitions of ontology.

The “Big” definition provides a concrete toehold that helps clarify the other definitions, and can be used to structure further work.

The Gene Ontology is, and can be, used to annotate life-sciences data.

Programs can be written to use the Gene Ontology and annotated data to answer queries (prove assertions).

Building ontologies may be difficult, but should be worth the effort in many circumstances.

47

References

Aktas, Mehmet, and Malon Pierce, Semantic Web and RDF Ontologies.http://grids.ucs.indiana.edu/ptliupages/presentations/SemanticWeb&RDFOntology.ppt

Berners-Lee, Tim, James Hendler and Ora Lassila, The Semantic Web, Scientific American, May 2001.http://www.sciam.com/article.cfm?id=the-semantic-web

Blake, Judith, “Using the Gene Ontology for Data Analysis”.http://www.geneontology.org/teaching_resources/presentations/2004-11_dataanalysis_jblake.ppt

Feigenbaum,Lee, Ivan Herman, Tonya Hongsermeier, Eric Neumann and Susie Stephens, The Semantic Web in Action, Scientific American, 2007. Ignore this article.http://thefigtrees.net/lee/sw/sciam/semantic-web-in-action#single-page

Gruber, Tom, “What is an Ontology?”, Personal web site.http://www.ksl.stanford.edu/kst/what-is-an-ontology.html

Jonquet, Clement, Mark A. Musen, Nigam H. Shah, Help will be provided for this task: Ontology-Based Annotator Web Service, International Semantic Web Conference (ISWC08), Karlsruhe, Germany. May 2008. http://bmir.stanford.edu/file_asset/index.php/1321/ISWC08_Jonquet_Musen_Shah_final.pdf

Niepert, Mathias, Cameron Buckner, and Colin Allen, Answer Set Programming on Expert Feedback to Populate and Extend Dynamic Ontologies, Association for the Advancement of Artificial Intelligence, 2008.http://inpho.cogs.indiana.edu/Papers/2008-InPhO-flairs.pdf

48

Pidcock, Woody, What are the differences between a vocabulary, a taxonomy, a thesaurus, an ontology, and a meta-model?, Web article, 2003.http://www.metamodel.com/article.php?story=20030115211223271

Rubin, Daniel L.,1 Dilvan A. Moreira, Pradip P. Kanjamala, and Mark A. Musen, BioPortal: A Web Portal to Biomedical Ontologies, AAAI Spring Symposium Series, Symbiotic Relationships between Semantic Web and Knowledge Engineering, Stanford University, (in press). Published 2008.http://bmir.stanford.edu/file_asset/index.php/1298/AAAI-BioPortal-2008.pdf

Shadbolt, Nigel, Wendy Hall and Tim Berners-Lee, The Semantic Web Revisited, IEEE INTELLIGENT SYSTEMS, 2006.http://eprints.ecs.soton.ac.uk/12614/1/Semantic_Web_Revisted.pdf

Sheth, Amit, Cartic Ramakrishnan, Semantic (Web) Technology In Action:Ontology Driven Information Systems for Search, Integration and Analysis, IEEE Data Engineering Bulletin, Special issue on Making the Semantic Web, Real, U. Dayal, H. Kuno, and K. Wilkinson, Eds., December 2003.http://lsdis.cs.uga.edu/library/download/SR03-BW.pdf

Shirky, Clay, “Ontology is overrated”, Personal websitehttp://www.shirky.com/writings/ontology_overrated.html

Stevens, Robert, Carole A. Goble, and Sean Bechhofer, Ontology-based knowledge representation for bioinformatics, Briefings in Bioinformatics, November 2000.http://bib.oxfordjournals.org/cgi/reprint/1/4/398?ck=nck

Wikipedia, Ontology (information science), June 2008.http://en.wikipedia.org/wiki/Ontology_(computer_science)

Documents

1 Ontologies: “What are they?” and “How do they work?” Michael Grobe (work supported in part by Research Technologies UITS Indiana University)