Evolutionary Biology || Knowledge Standardization in Evolutionary Biology: The Comparative Data Analysis Ontology

Chapter 12

Knowledge Standardization in Evolutionary

Biology: The Comparative Data Analysis

Ontology

Francisco Prosdocimi, Brandon Chisham, Enrico Pontelli,

Arlin Stoltzfus, and Julie D. Thompson

Abstract In this chapter we describe the development of a new biomedical onto-

logy in the context of the modern knowledge representation research field. We also

present the modeled concepts and their relevance in the light of the history of

evolutionary biology. CDAO stands for “Comparative Data Analysis Ontology”

and allows the representation of data produced in evolutionary biology studies in

the form of a set of well-defined concepts and the relationships among them. CDAO

is not intended to be a glossary or a simple taxonomy of evolution-related termi-

nology. Since evolutionary theory provides a broad framework for almost all fields

of biology, the concepts in CDAO reflect a rich history of controversies stressed by

academics in philosophical analyses of the whole field of biology. The concept of

an evolutionary tree to represent relationships between organisms is credited to

Darwin. However, the nature of species and the operational taxonomic units (OTU)

used in evolutionary analysis are still a matter of controversy among scholars. The

same can be said for a number of other concepts modeled in CDAO. For instance,

the choice of a methodological basis for evolutionary analysis is still a matter of

debate: should researchers use simple “non-theory” based comparative approaches

to analyze their data? Should they assume parsimony to adapt phylogeny towards a

popperian concept of science? Should they consider likelihood or Bayesian meth-

ods as more appropriate to their endeavor? In the first part of this chapter we try to

understand the role of a knowledge representation task in the context of modern

F. Prosdocimi and J.D. Thompson

Institut de Genetique et de Biologie Moleculaire et Cellulaire (IGBMC), Department of Structural

Biology and Genomics, Strasbourg, France. 1 rue Laurent Fries/BP 10142/67404 Illkirch Cedex,

France

email: [email protected]

B. Chisham and E. Pontelli

Department of Computer Science, New Mexico State University, P.O. Box 30001, MSC CS Las

Cruces, NM 88003, USA

A. Stoltzfus

Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute,

9600 Gudelsky Drive, Rockville, MD 20850, USA

P. Pontarotti (ed.), Evolutionary Biology: Concept, Modeling, and Application,DOI: 10.1007/978-3-642-00952-5_12, # Springer‐Verlag Berlin Heidelberg 2009

195

research in biology. The second part is devoted to the presentation of the concepts

modeled in CDAO and their specification using standard ontology descriptors.

Finally, the third part of the chapter deals with historical discussions in evolu-

tionary biology that influenced the genesis and development of CDAO’s forma-

lized concepts. This approach will be extended to show how evolutionary data can

be represented in ontologies in order to cope with the multiplicity of approaches

and philosophical backgrounds used in this endeavor. CDAO will prove to be a

very pluralistic ontology, allowing the representation of evolutionary data in a

number of different theoretical backgrounds normally assumed by evolutionary

biologists. Although further discussions and new software will be needed, we

believe that CDAO will become the standard way for describing evolutionary

biology concepts in the near future. This statement is based on the fact that (1)

CDAO is highly theoretical and describes the most relevant evolutionary biology

concepts; and (2) it is flexible enough to be able to represent, inter-relate and

allow further reasoning over the flood of data from the modern petabyte era of

biological research.

12.1 Introduction

12.1.1 Knowledge Representation as a Positive Heuristicin Biomedicine

At the intersection of information, computation and biological sciences, there exists

an interdisciplinary research field that has been growing steadily over the last

decade. This field is related to new techniques of ontology production for biomedi-

cal science. The term ontology is inherited from its original application in philoso-

phy and is frequently said to be related to the study of the nature of being, existenceand reality. From the point of view of information science (or knowledge represen-

tation), this new task can be understood as the formalization of the structure of

knowledge in some area of research. This formalization is frequently performed

through the definition of fixed words linked to rigorous concepts that researchers

consider to be relevant in some specific area (Gruber 1995). Furthermore, by

providing a common representation of interesting research topics, ontologies permit

data and knowledge to be integrated, reused and shared easily by both researchers

and computers (Harris 2008). If everybody uses the same terms and relations to

describe their data, they can understand each other better, exchange results, inte-

grate them in a common perspective and write algorithms to analyze the organized

information in a large-scale fashion.

One of the advantages of these ontologies inspired by information science, lies in

the fact that they allow the construction of structures similar to phrases by which

concepts link to each other using relations. The semantics appear and defined

entities are related to each other using explicit verbs or verbal-derived forms,

196 F. Prosdocimi et al.

such as is_a, has, contains. The creation of an ontology may be compared to the

description of a new language – in the sense that symbols (nouns or sets-of-nouns)

are defined to describe events and entities in the real world and other symbols

(actions or verbs) are created to relate these entities based on some theoretical

background. The relationships between the terms or symbols are generally built

using some kind of pre-defined formal rules. In a classical paper called Empiricism,semantics and ontology, the German philosopher Rudolf Carnap considered math-

ematics as a sort of ontology and suggests that mathematicians speak about

“symbols and formulas manipulated according to given formal rules” (Carnap

1950). From a modern knowledge representation point of view, the symbols

described by Carnap can be understood as the ontology concepts and the formal

rules are represented by the relations between the terms and some restrictions to

these relations. An example from molecular biology is the Gene Ontology in which

the relationships allow the hierarchical description of gene functions, such as:

“glycolysis” (GO:0006096) is part_of “hexose metabolic process” (GO:0019320)

(Ashburner et al. 2000). This concept-and-relations schema allows ontologies to

formalize and create meaning through such semantic sentences.

Although long-standing classical studies in epistemology and philosophy of

science have clearly demonstrated that this task of knowledge conceptualization

can never be consensual among all researchers in some area, well-designed formal

representation of natural, real-world entities can advance human knowledge and

generate new research fields. The Hungarian philosopher of science Imre Lakatos in

his classical book Methodology of scientific research programs argues that a

research (scientific) program is growing when it generates what he calls positiveheuristics, i.e., when it generates even more research and becomes more studied and

specialized over time (Lakatos 1978). Figure 12.1 demonstrates the exponential

growth in the study of biomedical ontologies, represented by the number of papers

with the word ontology in their title or abstract.

The data suggest that ontologies are deemed useful by the scientific community

and that the consideration of conceptualization issues in biological sciences brings

positive heuristics to this research field. In other words, the production of biomedi-

cal ontologies is a fertile research program that should promote the development of

the field.

12.1.2 Ontologies in the Petabyte Era of Biological Research

When initially describing some conceptual system, all the terms and relations must

be explicitly described as clearly as possible in order to solicit criticisms from the

community. These criticisms should generate discussions about the formalized

concepts in order to make them as clear, broad and consensual as possible. This

agreement among researchers is of critical importance to the general acceptance of

the ontology and thus, most scientific ontologies in use today are the result of

12 Knowledge Standardization in Evolutionary Biology 197

collaborative scientific research efforts (Rodrigues et al. 2006; Leontis et al. 2006;

Gene Ontology Consortium 2001).

As a consequence, once a system of concepts is artificially created to represent

some area of human knowledge, empirical data should be described using this

prototype ontology. This in turn, allows the use of computational techniques, such

as the vast array of expert algorithms (Sirin et al. 2007; Tsarkov and Horrocks 2006)

developed to read large amounts of empirical data and to inter-relate and interpret

them under some rational perspective. These algorithms, which extract knowledge

from ontology-based resources, are frequently called reasoners. Computer

reasoning allows the production of higher-level information from large volumes

of raw data – for example, derived from hard-science molecular experiments – and

facilitates subsequent analyses and interpretation by humans. The extraction of

knowledge from high-throughput data clearly cannot be performed by a single

individual, or even by a trained group of experts. In order to develop modern

science, new computer-readable standards are urgently needed in a number of

research areas, including evolutionary biology.

In a recent set of reports published in Nature (Issue 7209 dated 4 September 2008)

about “big data,” the editors argue that biology will arrive soon in what they have

called the “Petabyte era” (the petabyte is a value corresponding to 1,024 terabytes

that, in turn, represents about 1,000,000 megabytes or 1015 bits of information). An

entirely new generation of super-fast DNA sequencers will soon be up and running

all over the world (Rothberg and Leamon 2008; Shendure and Ji 2008) and modern

DNA sequencers such as Illumina’s Solexa, the Applied Biosystems’ Sequencing by

Oligonucleotide Ligation and Detection (SOLiD) technology and the GS FLX

800

Gene Ontology (Title)

Gene Ontology (Abstract)

Ontology (Title)

Ontology (Abstract)

700

600

500

400

300

200

100

02000 2001 2002 2003 2004 2005 2006 2007 2008*

Fig. 12.1 Growth in ontology usage in biomedical science research measured by Pubmed searches

for terms “ontology” and “gene ontology” in abstracts and titles of papers since 2000


instruments from Roche/454 Life Sciences will soon revolutionize the study of life

sciences (Marguerat et al. 2008). Some of these sequencers are capable of producing

the information necessary to build an entire human genome (3Gb) in about a week

(Dohm et al. 2008) but we still lack the informational and computational infrastruc-

ture to actually analyze all these sequences, in order to understand the biological

and evolutionary perspectives. Bioinformatics resources have begun to respond to

the challenges posed and are now addressing the issues of storage, availability,

reasoning and interpretation of this explosion of information (Cochrane et al. 2009).

The next decade will certainly see the rise of new ways of working in the so-called

“big science” field and it is probably not too risky to suppose that the graph shown in

Fig. 12.1 represents only the start of a more accentuated growth in the usage of

ontologies and ontology-based studies over the next decade.

12.1.3 The Central Role of Evolutionary Biology

The modern version of Darwinian evolutionary theory could be said to be the most

unifying conceptual system in the whole field of biology. “Nothing in biologymakes sense except in the light of evolution,” says one of most notorious biologists

from the 20th century and one of the architects of synthetic theory of evolution, orthe modern synthesis, the Russian geneticist Theodosius Dobzhansky (Dobzhansky

1973). If we assume that evolution is the explicative point of convergence of the

biological sciences, it follows that researchers in this crucial field must adapt their

methodology to today’s petabyte era. However, the long tradition of evolutionary

studies has highlighted a number of conceptual and philosophical discussions on its

main problems, concepts and data analysis paradigms. Recurrent academic discus-

sions in the evolutionary field concern, among others, (1) the nature of species, (2)

the strength of natural selection and the relevance of random processes in evolution

(selectionism versus neutralism), (3) the origin of life, (4) the rate of genomic and

anatomic modification over time (gradualism, punctuated equilibrium and salt-

ationism) and (5) the different empirical approaches to study and comprehend

evolutionary data (phenetics, cladistics, parsimony, likelihood, distance metrics,

etc.). Thus, in order to formalize the basic knowledge in the evolutionary biology

research field, most of these ideas, discussions and paradigms must be taken into

account. As we will see, most of these discussions are particularly relevant when

choosing and naming fundamental concepts in the evolutionary biology research

field and defining their scope and the relations between them. The terminology must

be exact and precise and must allow the multiplicity of approaches to be repre-

sented, so that every evolutionary biologist will be able to describe his data using

his preferred methodology. The goal of comparative data analysis ontology

(CDAO) is exactly this: to model a limited number of important concepts in the

field in such a way that researchers will feel free to describe their data from their

own perspective.


12.1.4 Current Biomedical Ontologies

Many ontologies have been developed recently in the biomedical field to represent

specific aspects of the real world, such as anatomical parts (Gaudet et al. 2008;

Maglia et al. 2007; Larson et al. 2007; Trelease 2006; Bard 2005; Lee and Sternberg

2003), development, disease, or DNA sequences (Segerdell et al. 2008; Schulz et al.

2007, 2008; Chabalier et al. 2007; Bard et al. 2005; Jaiswal et al. 2005). A

significant number of these ontologies are housed on the Open Biomedical Ontol-

ogies (OBO) web site, created by Ashburner and Lewis in 2001 as an umbrella body

for the developers of life science ontologies (Smith et al. 2007). Probably, the most

well-known and widely used ontology in the biomedical field is the Gene Ontology

(GO), which consists in three separate parts, describing different cellular structures,

molecular functions and biological processes of genes. GO has been very successful

as the first set of concepts to be actually applied in a large number of genome

projects, e.g., FlyBase (Drosophila), the Saccharomyces Genome Database (SGD),

the Mouse Genome Database (MGD) and modern scientific studies all over the

world.

Although GO allows a clear and concise representation of the human knowl-

edge associated with gene functions, it contains very simple semantic relation-

ships among the concepts described therein. The concepts are structured in a

hierarchy ranging from the most general to the most specific (Ashburner et al.

2000) and specific sub-concepts, such as “DNA recombination” (GO:0006310) or

“DNA repair” (GO:0006281), are related to their parents, such as “DNA metabo-

lism” (GO:0006259), using the is_a or part_of semantic relationships (http://wiki.

geneontology.org/index.php/Relation_composition). Most of the other accepted

ontologies in the biomedical field are also conservative in the use of new relations

among entities, illustrated by the fact that the Relation Ontology (Smith et al.

2005) contained only about a dozen relations in its latest release. However, it is

clear that some fields of knowledge cannot be represented by such a narrow set of

relations and a dedicated discussion list on the subject has recently been set up

(http://www.obofoundry.org/ro/) with proposals for a number of new relations

from scientific research organizations, such as the UCDHSC (University of

Colorado at Denver and Health Sciences Center) and the OBI (Open Biomedical

Investigations).

As we argued before, modern high-throughput methods will require that data is

described in terms of standard vocabularies and today’s substantial efforts to

produce consistent ontologies will be valuable for twenty-first-century biology.

Nevertheless, it is clear that ontology concepts that are clearly correlated to

the physical world (such as anatomical parts or even DNA-based descriptors)

are less interpretative in fashion than more general or abstract concepts, such as

the OTU or character in evolutionary theory. This makes the production of broad

theory ontologies a different task, requiring distinct formalisms, structures and

relations.


http://wiki.geneontology.org/index.php/Relation_composition

http://wiki.geneontology.org/index.php/Relation_composition

http://www.obofoundry.org/ro/

12.2 The Comparative Data Analysis Ontology

12.2.1 History of CDAO Development

Taking into account the necessity to analyze modern high-throughput biology data

under an evolutionary perspective, the production of an ontology was envisioned by

the members of the Evolutionary Informatics group at NESCent (http://evoinfo.

nescent.org). The Evolutionary Informatics group includes many of the most

prominent world-leaders in software development for evolutionary analysis. The

members of the group have been meeting at Durham, NC biannually since early

2007 to discuss future infrastructure requirements in this research area. The idea of

developing an ontology to formalize the knowledge in the field and allow automatic

reasoning came in the early days. Although everyone had agreed that concepts in

evolutionary biology could be interpreted differently depending on the various

perspectives – as we shall discuss – a number of these experts decided to launch

this ambitious project of creating a broad-range ontology to describe this central

notion in biology: the evolutionary theory. The comparative data analysis ontology

(CDAO) represents the first step in this process.

CDAO was developed using the OWL language using the Stanford University

ontology editor Protege. OWL (Web-Ontology Language) was chosen as the most

modern, flexible and robust language in the field, being the standard recommended

by theWorldWideWeb consortium (W3C) for ontology implementation. Although

recent and still under development, the ontology editor Protege has been shown to

be a user-friendly, robust and powerful tool to build OWL ontologies. With more

than 100,000 registered users, this software is supported by numerous academic,

government and corporate users (http://protege.stanford.edu/).

The overall structure and the main concepts described in CDAO are shown in

Fig. 12.2.

12.2.2 CDAO: Some Evaluation Considerations

The main purpose of the first prototype release of CDAO is to describe the most

basic structure of knowledge underlying the key entities in evolutionary analysis.

Thus, it does not provide an extensive terminological coverage (e.g., covering

the complete set of concepts identified in the controlled vocabulary at https://

www.nescent.org/wg_evoinfo/ConceptGlossary). This distinguishes CDAO from

ontologies like the Gene Ontology – where there is a more limited emphasis on

structuring knowledge (e.g., by using only a very basic set of relations between

terms) but an extensive informational coverage, mostly expressed in terms of a deep

taxonomy.

Several schemes have been proposed to provide a classification of ontologies.

According to such schemes, CDAO can be classified as follows:


http://evoinfo.nescent.org

http://evoinfo.nescent.org

http://protege.stanford.edu/

http://https://www.nescent.org/wg_evoinfo/ConceptGlossary

http://https://www.nescent.org/wg_evoinfo/ConceptGlossary

l Following the richness classification (Lassila and McGuiness 2001), CDAO is at

one of the highest levels of complexity, as it includes general logical constraints

and disjoint classes.l According to the subject-based classification (Gomes-Perez et al. 2004), CDAO

can be viewed as a domain-task ontology, as it describes concepts and relations

valid across one specific (although broad) domain: evolutionary analysis, with

a focus on specific tasks within the domain (e.g., phylogenetic inference and

description).l Van Heijst et al. (1997) provides two dimensions for the classification of

ontologies. The first dimension measures the amount of structure present in the

ontology; in this regard, CDAO can be viewed as a knowledge modeling

ontology, having less of a focus on extensive terminological coverage and

character state data matrix

character statedata matrix

character statedatum

state

state

unrootedtree

state

has

has

has

has

tree

haschild

hashas

haschild_node

hasright_node

hasright_state

hasleft_node

hasleft_state

hasparent_node

hasparent

hasdescendant

hasancestor

node

node

node

node

hasTU

represents_TU is_transformation_of

transformation

transformation

topology

part_of

part_of

part_of

is_a

is_a

is_a

part_of

belongs_tobelongs_tocharacter

Annotation:AlignmentProcedure

Annotation:

rootedtree

directededge

edge

connects_to

Annotation:

Annotation:taxonomic_link...

TreeProcedureModel...

Length...

Fig. 12.2 Main evolutionary biology concepts represented in CDAO. The ontology is divided in

three inter-related parts: (1) the character-state data matrix, representing data conceptualization

and description; (2) the tree topology, representing the ancestral relationships among the groups

analyzed; and (3) the transformation, representing the step-wise modification in characters along

evolutionary time. *2008 data collected on November 18, 2008


deep hierarchical classification and a greater emphasis instead on the basic

conceptual structure of knowledge in the domain. The second dimension classi-

fies ontologies based on the subject of the conceptualization; in this regard,

CDAO is a domain ontology (like the majority of ontologies in the OBO

foundry).l Mizoguchi and Vanwelkenhuysen (1995) classify ontologies based on their use.

The four top levels of the classification distinguish between Content, Communi-

cation, Indexing and Meta ontologies. CDAO is clearly a Content ontology; its

aim is to enable the reuse of knowledge related to evolutionary analysis across

agents and applications (e.g., stages of an analysis pipeline). It is important to

observe that CDAO is not a Communication ontology, as the purpose is to

emphasize the structure of knowledge. Content ontologies are further classifiedas Workplace, Task, Domain and General ontologies. CDAO is a Domain

ontology, which is Task-independent (not being tied to one specific task),

Activity-related (since CDAO emphasizes, at this time, the representation of

knowledge related to phylogenetic analysis) and it represents an Object ontology

(since it describes structure, behavior and function of entities).

12.2.3 CDAO Version 2.0

Ontology development is a continuous, iterative process and work on version 2 of

CDAO has already begun. This version aims to provide a more complete structure

or classification of the concepts defined in CDAO 1.0, in order to facilitate more

complex querying of CDAO data sets. As an example, we have added a richer set of

classes to describe the topology of trees. Using this richer annotation set, higher-

order concepts such as “Fully Resolved Tree” or “Polytomy” can be easily deduced,

either by a researcher or by an automatic reasoning software system. Previously,

these concepts could be inferred from the topology of a tree, if the reasoning system

had information concerning the definition of a polytomy. Our goals in including

these definitions in CDAO are (1) to standardize the definition of these concepts

and (2) to allow the use of general reasoners that have no specific evolutionary

knowledge. The terminological/informational component of CDAO is also being

expanded to provide a more extensive coverage of evolutionary concepts. For

instance, CDAO 2.0 includes imports for MyGrid (http://www.mygrid.org.uk/)

terms, in order to facilitate its use within web services. The additional terms specify

a large number of phylogenetic file formats, repositories and web services in

common use. This change will hopefully contribute to early efforts to develop a

MIAPA (Minimal Information About a Phylogenetic Analysis) standard (Leebens-

Mack et al. 2006).

All the documents published and other relevant information can be found on our

web site (http://www.evolutionaryontology.org). Please refer to these documents

when searching for the development of CDAO-based applications.


http://www.mygrid.org.uk/

http://www.evolutionaryontology.org

12.3 Conceptual Revolutions in Evolutionary Biology

and Historical Analysis of CDAO Concepts

In this section our aim is to make a very brief recapitulation of the history of

evolutionary biology regarding some conceptual modifications in this discipline

since the work of Charles Darwin. We will discuss how these concepts have

evolved and how they are currently represented in CDAO. We hope to provide a

theoretical analysis of our conceptualization work, illustrating the theoretical plas-

ticity of our ontology and how it facilitates more practical tasks related to data

annotation and representation of trees and character matrices, for example.

The concepts formalized in CDAO represent a vast and rich history of contro-

versies. The theory of evolution, as the broadest conceptual model in biology, has

been the target of a number of theoretical revisions mainly during the twentieth

century, resulting in today’s rich version that encompasses and links together many

diverse research fields, including biochemistry, genetics, genomics, ecology, zool-

ogy, botany, microbiology, development, physiology and medicine. Every single

data collection made in biology can be viewed from an evolutionary perspective

and our goal is to represent them in a modern and scalable way.

12.3.1 Conceptual Reformulations in EvolutionaryBiology Since Darwin

The years that followed Darwin’s publication The Origin of Species (1859) revealeda number of misunderstandings about the meaning of his global theory. Although

evolution, as opposed to creationism and common ancestry were soon accepted by

most evolutionists in the nineteenth century, the other parts of the theory were still

seen skeptically. The first point of theoretical disagreement was resolved in the

nineteenth century by the German biologist August Weissmann in his book OnHeredity, finally bringing to an end the Lamarckian idea of the inheritance of

acquired characteristics and the use and disuse of characters (Lamarck 1809).

Weissmann advocated the “germ plasm” theory, in which inheritance could

only take place by modifications in germ cells, since the other cells of the body

do not influence future generations and therefore could not act as agents of

heredity (Weissmann 1889). The term Neodarwinism is frequently used to refer

to Weissmann’s view of evolutionary theory.

During the first decade of the twentieth century, the works of Mendel were

rediscovered, although the interpretation of his works, made mainly by the Dutch

botanist Hugo De Vries, turned this theory into a mutation or saltationistic view of

evolution as opposed to Darwinian gradualism (Jacob 1970). Further development

in this field in Britain, made mainly by William Bateson, led to the birth of genetics

as a scientific discipline and the proliferation of saltationism. Nevertheless, it was

only during the 1930s and 1940s that a new conceptual revolution in evolutionary


biology would take place, based on the work of Thomas Morgan in New York. His

studies on heredity and evolution were mainly developed in fruit-flies of the genus

Drosophila, making this insect one of the most studied model organisms in the

whole field of biology. According to Mayr (1991), the studies of Morgan and his

students indicated that small mutations could allow a gradual modification in

populations; the sudden leaps predicted by saltation theory were no longer needed

to explain evolution. The subsequent revolution, integrating gradualism, Mendel-

ism and the recent population genetics from a Darwinistic perspective was named

the Fisherian synthesis (Mayr 2004). The next conceptual integration brought a

common perspective to population genetics, biodiversity, spatial-temporal patterns

and evolutionary theory. While population geneticists were interested in explaining

the evolution and allelic changes inside a population group (anagenesis), naturalists

were interested in how a population could differentiate into two groups, leading to a

new species (cladogenesis). The publication of Genetics and the Origin of Speciesby Dobzhansky in 1937 opened the doors for the integration of these two previously

inconsistent points-of-view into a single evolutionary framework. Julian Huxley

named this the synthetic theory of evolution, or new synthesis and it was better

understood and elaborated after the publication of books byMayr, Simpson, Huxley

and Stebbins.

Finally, the last great conceptual revolution in evolutionary biology resulted

from the development of molecular biology. Crick’s central dogma, predicting that

the flux of genetic information was unidirectional, finally explained the molecular

basis for why the information from the environment could not directly influence

DNA, corroborating Weissmann’s studies on heredity (Mayr 1991). The overall

similarity of the genetic code among all life forms could be seen as a final proof

corroborating all evolutionary theory. Moreover, the astonishing similarities

amongst the sequences of genes and proteins involved in the most basic metabolic

pathways between bacteria, archaeabacteria, protists, fungi, plants and animals may

be seen as the ultimate attestation that evolution occurs. The advent of molecular

biology, together with previous theoretical synthesis, has led to our modern evolu-

tionary theory, that Mayr suggests should be simply called Darwinism.

CDAO can only be envisioned today thanks to all of these theorists, who have

provided us with a solid theoretical basis. In the following sections, we will discuss

the history of some specific concepts formalized in CDAO.

12.3.2 History of CDAO Concepts: The Taxonomic Unit

A widely used term in the evolutionary analysis field is “OTU,” which stands for

“Operational Taxonomic Unit.” OTU generally refers to the organisms or group of

organisms the researcher is working on when he begins his studies. The problem of

the OTU concept is clearly related to the so-called “species problem.” For a long

time, species were considered to be biological organisms sharing some arbitrarily-

definedmorphological characteristics. However, with the development of biology in


the nineteenth and twentieth centuries, a number of obstacles were identified, such as

(1) the description of a number of cryptic species, where organisms very similar in

form have completely different evolutionary origins; and (2) the enormous amount

of variation observed in organisms from the same species, linked to sex, age, or

particular environments. In the early twentieth century, the concept of reproductiveisolation was proposed and species came to be considered as closed genetic groups

changing gene alleles inside their reproductive group. But this concept has rapidly

presented problems too. How should this be applied to asexual species? And how

does this explain the formation of hybrids? No general consensus about the species

problem has been achieved yet and pragmatically speaking, taxonomists still use

morphological and physiological characters to describe new species.

Recently, molecules have been widely used as units for comparison. Although

molecules can be representative of species or organism groups, researchers must

take into account the fact that the evolution of a gene or protein may be biased by a

number of factors. The evolutionary history of a gene family can be compared to the

evolutionary history of the organisms harboring these genes but the evaluation of a

number of different genes would be necessary to actually infer a phylogeny of

species based on molecular data.

12.3.2.1 TU@CDAO

In the context of an ontology representation of evolutionary data, all these philo-

sophical problems concerning the definition of biological groups have been, of

course, inherited. It is clear that the definition of any entity, or “taxonomic unit,” to

be compared is entirely under the responsibility of the researcher and CDAO should

allow the representation of entities at the level of molecules, gene/protein families,

organisms, populations, species, or any other higher taxonomic level.

One particular problem that we encountered in the formalization of the taxo-

nomic unit concept in CDAO is that the term “OTU” is often used in phylogenetic

studies to indicate present-day entities (species, populations, individuals, mole-

cules, etc.) under study, while a different term “HTU,” or hypothetical taxonomic

unit, is used to refer to the ancestral entities. In some specialized phylogenetic

studies, for example in viral studies, the distinction between present-day entities

and ancestors is less clear, since it is possible that the ancestors of these rapidly

evolving organisms still exist. As a consequence, we decided to define a single

concept for both present-day and ancestral entities: the TU, or taxonomic unit.

12.3.3 History of CDAO Concepts: The Characterand the Character-State Data Matrix

Once the taxonomic units to be studied have been clearly defined, the next step in an

evolutionary analysis is the choice of the specific traits or characters to be used to


classify these TUs. Until the theory of descent was actually accepted in biology and

used as a criterion for classification, morphological traits were employed as char-

acters to distinguish species by naturalists. David Hull credits to Hennig the first

serious attempt to claim that systematic classifications should be made using

evolutionary homologies (Hull 1988) and this has become one of the most relevant

concepts in evolution: the concept of character homology. Two characters are said

to be homologous if they descend from the same ancestral character present in an

ancestral organism. This argument was so strong that it was subsequently

incorporated by most evolutionary biologists and it could be said that systematics

today cannot be considered as separate from phylogenetics.

Maureen Kearney however points out that the identification and definition of

characters in species have led to the formation of two different schools in evolu-

tionary analysis and taxonomy (Kearney 2007). The first school was formed by the

pheneticists who claimed that both evolutionary analysis and systematics should be

done in a “theory-free” context. They claimed that a homology assumption would

create some sort of circularity in the methodology of evolution: if evolutionists

want to use phylogenetic trees to test hypotheses about evolution they should not

use evolutionary homology assumptions to build the trees. The second school of

cladists argues that both the evolutionary analysis of organism groups and the

biological nomenclature of them should be based on Darwinism, i.e., that characters

should be analyzed under an evolutionary context if and only if they can be assumed

to be homologous. In response to the tautology accusation made by pheneticists,

cladists advocate the philosophical principle of parsimony and the general scientific

principle known as the Occam’s razor normally understood as “Pluralitas non estponenda sine necessitate” (“plurality should not be posited without necessity”). Thecladists claim that their theory is closer to the popperian philosophy of science in

the sense that the most parsimonious tree is the representation of the evolutionary

history that could not be falsified by the available data in the light of modern

evolutionary theory.

12.3.3.1 Character@CDAO

In the face of this plurality, CDAO allows the user to define their characters

independently of any supposed homology among them. Although it has now

become almost a consensus that classification should be done using homologous

characters, whole-genome and multiple alignment comparisons are capable of

producing characters that seem more phenetic than cladistic. Once the user has

defined the characters, they can be annotated as being homologous or not.

Depending on the particular entities (TUs) under study, widely different

characters may be used to classify them, including morphological traits, molecular

characters, such as nucleotide residues or amino acids, gene functions, cellular

localizations or expression levels. The diverse types of data are taken into account

in CDAO by defining sub-classes of the more general character concept,

namely discrete_character, continuous_character, categorical_character and


compound_character. The compound_character provides a mechanism for combin-

ing individual characters into a single character, for example, to allow the definition

of different parts of a molecule as special characters in an evolutionary study.

For a given set of TUs, the state of each of the selected characters is entered into

a matrix, known as a “character-state data matrix,” where the rows of the matrix

correspond to TUs and the columns represent the characters. The “character-state”

data model is generic and can be applied to most data types, e.g., in a protein

sequence alignment, the TU represents a protein, the character represents a column

in the alignment and the character state is either a residue or a gap. The states of

high-level biological characters (morphology, development, anatomy, behavior)

are also typically encoded as discrete states. Many of these character states are

defined in existing bio-medical ontologies, often in OWL format and can be used in

conjunction with CDAO to annotate specific characters and character states.

Missing data and absent features, common in biological data, may be treated as

an extra state. Thus, a feature that is found only in some TUs, such as an intron, can

be represented by a binary character with “presence” and “absence” states. This

concept will become increasingly important with the more widespread use of

automatic phylogenetic inference approaches, as defined by (Eisen 1998).

12.3.4 History of CDAO Concepts: The Tree

The fact that biological organisms relate to each other following a tree-like schema

was originally described by Darwin in The Origin of Species (1859). One of the

biggest pieces of scientific work in all the history of science, this paradigm-shift

book – in the words of the philosopher of science Thomas Kuhn (Kuhn 1962) – was

published exactly 150 years ago and contained only a single illustration. This

picture is the ancestor of all modern studies in evolution and opened the way for

a field mainly developed during the 1950s by Willi Hennig: cladistics (or phyloge-

netics). Figure 12.3 shows the original picture in Darwin’s publication and a

modern phylogenetic tree of life.

According to Mayr (2004), none of the Darwinian sub-theories was accepted so

enthusiastically as the so-called theory of common ancestry. This common ancestry

insight gave birth to the idea of building a tree to relate modern organisms back to

ancestors in the past. In his book What makes biology unique?, Mayr states: well-

known “similarities, such as the cord in tunicates and the brachial arcs in fishesand terrestrial vertebrates were completely disconcerting until they were inter-preted as vestiges of a common past.”

Since then, the common ancestry idea has been referred to as the theory ofdescent (Hennig 1950; Weissmann 1889) and formed the basis for the initial

formalization of methods in evolutionary biology. The methods for the description

of an evolutionary tree evolved relatively slowly during the twentieth century, until

recently when it converged on the graph theory originally developed in the com-

puter sciences. The origin of graph theory is frequently attributed to a paper of the


Swiss mathematician Leonhard Euler, more than one century before Darwin. With

the multidisciplinary research fields growing at the end of the twentieth century,

information theorists started to work with evolutionary biology data and standard

methods to treat and represent tree-structured data in computer and information

sciences were introduced into evolutionary biology. The most widely used format

existing today to represent an evolutionary tree is the so-called Newick standard –

which makes use of the correspondence between nodes in a tree and nested

parentheses. This format was adopted in evolutionary biology in 1986 during an

informal meeting of the Society for the Study of Evolution in Durham, New

Hampshire.

12.3.4.1 Tree@CDAO

The traditional binary tree, where extant species evolved from a common ancestor,

is now being replaced by a more general representation, a phylogenetic network, for

example, in the case where horizontal gene transfer events produce complex trees

with criss-crossing branches. CDAO uses a semantic-based structure to represent

both networks and trees, with nodes and branches defined as separate concepts. The

trees can be either rooted or unrooted and branches in a rooted tree can be assigned

a “direction,” indicating the parent-child relationships between the associated

nodes.

The structure defined in CDAO allows researchers to annotate both branches and

nodes with specific information and in this way, character states can be associated

with specific tree lineages, present-day or ancestral TUs. Similarly, genetic events

(duplication, lateral transfer, inversion, transposition, deletion, insertion, etc.),

leading to modifications or transformations in the state of a character, can be

localized at specific branches of the tree. The existence of dedicated concepts

associated with ancestral TUs in an evolutionary tree allows the representation of

Fig. 12.3 The original and modern stages in phylogenetic tree building. (a) The very first

phylogenetic tree of organisms contained in the Origin of Species (Darwin 1859); (b) a modern

phylogenetic tree of life


the most likely status of characters in putatively ancestral organisms and can

help researchers to formulate better hypotheses about both living and non-living

organisms.

12.4 Conclusions

Most of the discussions in evolutionary biology made during the twentieth century

have produced what we now call modern Darwinism or, simply, Darwinism. Many

ancient controversies have been solved and the ones that have not been solved do

not seem to pose any practical problem to the modern evolutionary biologist.

Darwinism has been completely integrated theoretically with other broad fields in

biology, such as genetics and molecular biology. If an ontology can now be

envisaged to represent and annotate data in evolutionary biology, it is certainly

based on the work of numerous biologists and philosophers towards developing

unifying concepts in evolution. Even inside Darwinism, the controversies are now

well-understood and evolutionary biologists can work together, even though they

base their premises in different traditional schools. Modern evolutionists can switch

between theories, such as cladistics or phenetics, choosing the most pertinent

framework, without being accused of philosophical betray. In the twenty-first

century era of integrative biology, researchers have understood that no method is

superior in all cases and there will always be a number of variables or conditions

that will make one or another methodology more suitable for a particular case. And

although a number of controversies still exist, such as the nature of species (or

TUs), the best method to analyze data or the fact that characters should or should

not be homologies, evolutionary biologists can understand the limitations of their

methods and concepts in order to produce high-quality scientific understanding.

As a consequence, any modern method that aspires to unite evolutionary biol-

ogists must be able to cope with this diversity of definitions. CDAO was developed

with these discussions in mind and it should permit the representation of data in any

theoretical background currently available. Although general enough to allow the

representation of different paradigmatic views in evolution, CDAO represents a

well-defined structure of the knowledge in evolutionary biology that will facilitate

automatic reasoning by algorithms.

In conclusion, CDAO is the result of initial efforts made by members of the

NESCent Evolutionary Informatics consortium to conceptualize and define

the most pertinent concepts used in a comparative data analysis that focuses on

the evolution of biological organisms. Therefore CDAO allows a formal represen-

tation of evolutionary data, where annotations of any sort can be linked to the TUs,

characters, character-state data matrix and the branches of a tree. Thanks to the

formalisms defined in CDAO, most of this information can be subsequently parsed

and analyzed by automatic reasoning algorithms, with the goal of producing

new and high-level information. CDAO is intended to provide the basis for a

standardized file format for evolutionary data, promoting data reuse and data


interoperability. More importantly, it is hoped that it will eventually offer a

complete framework for both computer and human expert analyses.

12.5 Future Developments

CDAO is an ongoing project and we plan to add more specific concepts as new

versions of the ontology are built in the near future. Next steps on CDAO develop-

ment will include the addition of still missing standard evolutionary concepts, such

as (1) the ones used in systematics; (2) sister group relations; (3) standard homology

relationships, including molecular-based ones; and (4) homoplasy and evolutionary

convergence annotation. Additionally, since organism can be better understood in

ecological contexts, we also intend to associate CDAO with some environmental

ontology, such as the one provided at http://www.environmentontology.org/ has

been envisioned. Initially developed behavioral ontologies (Midford 2004) shall

also be used in association with CDAO instances to allow better description of

evolutionary features of complex organisms.

User-friendly algorithms to transform standard format of character-state data

matrices in CDAO formatted instances will soon be produced to facilitate the

representation of evolutionary data in the CDAO format. The production of these

algorithms for data representation is strikingly relevant to approximate the evolu-

tionary biologist in his lab to this new technology of data representation.

CDAO also aims to be integrated into a workflow-pipeline of evolutionary

software on which the users will be able to input their data, click in some buttons

to activate the methods they want to use and then run these methods on a grid

machine. The results shall be stored in the tree-structure part of CDAO and they

will be able to be analyzed automatically by a reasoner, returning the analyzed data

to the user for interpretation. Moreover, once CDAO stores the raw data, there is the

possibility of associating a number of different tree-topologies (and the methods

associated to them) for the same data set. Moreover, the statistical methods of re-

sampling will also be able to be annotated and values of confidence will be

associated on tree-branches. These developments however will depend on the

development of MIAPA and further submission of projects aiming to provide the

informatics structure for workflow development.

Furthermore, new digital representation of data matrices on evolutionary biology

has been produced and Ramırez et al. (2007) describes a data base of digital images

and character data matrices information that could be automatically transformed

into CDAO instances. The usage of CDAO as the standard file format for storing,

annotating and sharing evolutionary information may ease communication in the

near future, although standard algorithms and converting tools must be made

available as soon as possible.

At last, considering that we envision the possibility of getting together a number

of CDAO-annotated data sets produced by different research teams all over

the world, we strongly encourage evolutionary biology researchers to provide a


http://www.environmentontology.org/

complete and accurate description of their represented data. The possibility of

joining a number of evolutionary data produced everywhere in the globe opens a

wonderful possibility of producing complete and powerful evolutionary informa-

tion using thousands of diverse characters, from the molecular to the morphologi-

cal, behavioral and ecological. To the best of our knowledge and belief that sort of

global, heterogeneous, joined data set information could never be achieved without

making use of a standard well-defined knowledge representation vocabulary, as the

one described on CDAO.

Acknowledgments We thank NESCent for financing our group meetings and the members of the

Evolutionary Informatics team for their stimulus and support. The French ANR (EvolHHuPro:

BLAN07-1_198915) are gratefully acknowledged for financial support. Francisco Prosdocimi and

Julie D. Thompson are supported by institute funds from the Institut National de la Sante et de la

Recherche Medicale, the Centre National de la Recherche Scientifique and the Universite Louis

Pasteur de Strasbourg. Enrico Pontelli has been supported by National Science Foundation grants

HRD-0420407 and CNS-0220590. Brandon Chisham is supported by an IGERT fellowship from

NSF grant DGE-0504304. The identification of specific commercial software products in this

paper is for the purpose of specifying a protocol and does not imply a recommendation or

endorsement by the National Institute of Standards and Technology.

References

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K,

Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC,

Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene ontology: tool for the

unification of biology. The Gene Ontology Consortium. Nat Genet 25(1):25–29

Bard J, Rhee SY, Ashburner M (2005) An ontology for cell types. Genome Biol 6(2):R21

Bard JB (2005) Anatomics: the intersection of anatomy and bioinformatics. J Anat 206(1):1–16

Carnap R (1950) Empiricism, Semantics, and Ontology. Revue Internationale de Philosophie

4:20–40. On the web: http://www.ditext.com/carnap/carnap.html

Chabalier J, Mosser J, Burgun A (2007) Integrating biological pathways in disease ontologies.

Stud Health Technol Inform 129(Pt 1):791–795

Cochrane G, Akhtar R, Bonfield J, Bower L, Demiralp F, Faruque N, Gibson R, Hoad G, Hubbard T,

Hunter C, Jang M, Juhos S, Leinonen R, Leonard S, Lin Q, Lopez R, Lorenc D, McWilliam H,

Mukherjee G, Plaister S, Radhakrishnan R, Robinson S, Sobhany S, Hoopen PT, Vaughan R,

Zalunin V, Birney E (2009) Petabyte-scale innovations at the European Nucleotide Archive.

Nucleic Acids Res 37(Database issue):D19–25

Darwin CR (1859) On the origin of species by means of natural selection, or the preservation of

favoured races in the struggle for life, 1st edn. John Murray, London. On the web: http://

darwin-online.org.uk/content/frameset?itemID = F373& viewtype = text&pageseq = 1

Dobzhansky T (1973) Nothing in biology makes sense except in the light of evolution. Am Biol

Teacher 35:125–129

Dohm JC, Lottaz C, Borodina T, Himmelbauer H (2008) Substantial biases in ultra-short read data

sets from high-throughput DNA sequencing. Nucleic Acids Res 36(16):e105

Eisen JA (1998) Phylogenomics: improving functional predictions for uncharacterized genes by

evolutionary analysis. Genome Res 8:163–167

Gaudet P, Williams JG, Fey P, Chisholm RL (2008) An anatomy ontology to represent biological

knowledge in Dictyostelium discoideum. BMC Genom 18(9):130


Gene Ontology Consortium (2001) Creating the gene ontology resource: design and implementa-

tion. Genome Res 11(8):1425–1433

Gomes-Perez A, Fernando-Lopez M, Corcho O (2004) Ontological engineering: theoretical

foundations of ontologies. Springer Verlag, New York

Gruber TR (1995) Toward principles for the design of ontologies used for knowledge sharing.

International Journal of Human-Computer Studies 43(4–5):907–928

Harris MA (2008) Chapter 5: developing an ontology. In: Jonathan M, Keith (ed) Bioinformatics

data, sequence analysis and evolution. Humana Press, New York, pp 111–124

HennigW (1950) Phylogenetic systematics. (trans. D. Davis and R. Zangerl). University of Illinois

Press, Urbana 1966, reprinted 1979

Hull D (1988) Science as Process. University of Chicago Press, Chicago, IL

Jacob F (1970) La logique du vivant: une histoire de l’heredite. Gallimard, Paris

Jaiswal P, Avraham S, Ilic K, Kellogg EA, McCouch S, Pujar A, Reiser L, Rhee SY, Sachs MM,

Schaeffer M, Stein L, Stevens P, Vincent L, Ware D, Zapata F (2005) Plant ontology (PO): a

controlled vocabulary of plant structures and growth stages. CompFunct Genom6(7–8): 388–397

Kearney M (2007) Phylosophy and phylogenetics: historical and current connections. In: David L

Hull, Michael Ruse (eds) The Cambridge companion to the philosophy of biology. Cambridge

University Press, Cambridge, pp 211–232

Kuhn TS (1962) The structure of scientific revolutions, 1st edn. University of Chicago Press,

Chicago, IL, p 168

Lakatos I (1978) The methodology of scientific research programmes: philosophical papers, vol 1.

Cambridge University Press, Cambridge

Lamarck chevalier de (Jean-Baptiste Pierre Antoine de Monet) (1809) Philosophie Zoologique.

On the web: http://fr.wikisource.org/wiki/Philosophie_zoologique

Larson SD, Fong LL, Gupta A, Condit C, Bug WJ, Martone ME (2007) A formal ontology of

subcellular neuroanatomy. Front Neuroinform 1:3

Lassila O, McGuinness DL (2001) The role of frame-based representation on the semantic web.

Linkoping Electron Art Computer Inform Sci 6:5

Leebens-Mack J, Vision T, Brenner E, Bowers JE, Cannon S, Clement MJ, Cunningham CW,

dePamphilis C, deSalle R, Doyle JJ, Eisen JA, Gu X, Harshman J, Jansen RK, Kellogg EA,

Koonin EV, Mishler BD, Philippe H, Pires JC, Qiu YL, Rhee SY, Sjolander K, Soltis DE,

Soltis PS, Stevenson DW, Wall K, Warnow T, Zmasek C (2006) Taking the first steps towards

a standard for reporting on phylogenies: minimum information about a phylogenetic analysis

(MIAPA). OMICS 10(2):231–237

Lee RY, Sternberg PW (2003) Building a cell and anatomy ontology of caenorhabditis elegans.

Comp Funct Genom 4(1):121–126

Leontis NB, Altman RB, Berman HM, Brenner SE, Brown JW, Engelke DR, Harvey SC,

Holbrook SR, Jossinet F, Lewis SE, Major F, Mathews DH, Richardson JS, Williamson JR,

Westhof E (2006) The RNA Ontology Consortium: an open invitation to the RNA community.

RNA 12(4):533–541

Maglia AM, Leopold JL, Pugener LA, Gauch S (2007) An anatomical ontology for amphibians.

Pac Symp Biocomput 367–378

Marguerat S, Wilhelm BT, Bahler J (2008) Next-generation sequencing: applications beyond

genomes. Biochem Soc Trans 36(Pt 5):1091–1096

Mayr E (1991) One long argument: Charles Darwin and the genesis of modern evolutionary

thought (questions of science). Harvard University Press, Cambridge

Mayr E (2004) What makes biology unique?: considerations on the autonomy of a scientific

discipline. Cambridge University Press, Cambridge

Midford PE (2004) Ontologies for behavior. Bioinformatics 20:3700–3701

Mizoguchi R, Vanwelkenhuysen J (1995) Task ontology for reuse of problem solving knowledge.

Proceedings of KB & KS, pp 46–59

Ramırez MJ, Coddington JA, Maddison WP, Midford PE, Prendini L, Miller J, Griswold CE,

Hormiga G, Sierwald P, Scharff N, Benjamin SP, Wheeler WC (2007) Linking of


http://fr.wikisource.org/wiki/Philosophie_zoologique

digital images to phylogenetic data matrices using a morphological ontology. Syst Biol

56:283–294

Rodrigues JM, Rector A, Zanstra P, Baud R, Innes K, Rogers J, Rassinoux AM, Schulz S,

Trombert Paviot B, ten Napel H, Clavel L, van der Haring E, Mateus C (2006) An ontology

driven collaborative development for biomedical terminologies: from the French CCAM to the

Australian ICHI coding system. Stud Health Technol Inform 124:863–868

Rothberg JM, Leamon JH (2008) The development and impact of 454 sequencing. Nat Biotechnol

26(10):1117–1124

Schulz S, Marko K, Hahn U (2007) Spatial location and its relevance for terminological inferences

in bio-ontologies. BMC Bioinform 8:134

Schulz S, Stenzhorn H, Boeker M (2008) The ontology of biological taxa. Bioinformatics 24(13):

i313–321

Segerdell E, Bowes JB, Pollet N, Vize PD (2008) An ontology for Xenopus anatomy and

development. BMC Dev Biol 8:92

Shendure J, Ji H (2008) Next-generation DNA sequencing. Nat Biotechnol 26(10):1135–1145

Sirin S, Parsia B, Grau BC, Kalyanpur A, Katz Y (2007) Pellet: a practical OWL-DL reasoner.

Web Semantics 5:51–53

Smith B, Ceusters W, Klagges B, Kohler J, Kumar A, Lomax J, Mungall C, Neuhaus F, Rector AL,

Rosse C (2005) Relations in biomedical ontologies. Genome Biol 6(5):R46

Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W, Goldberg LJ, Eilbeck K, Ireland A,

Mungall CJ; OBI Consortium, Leontis N, Rocca-Serra P, Ruttenberg A, Sansone SA, Scheuer-

mann RH, Shah N, Whetzel PL, Lewis S (2007) The OBO Foundry: coordinated evolution of

ontologies to support biomedical data integration. Nat Biotechnol 25(11):1251–1255

Trelease RB (2006) Anatomical reasoning in the informatics age: Principles, ontologies, and

agendas. Anat Rec B New Anat 289(2):72–84

Tsarkov D, Horrocks I (2006) FaCT + + description logic reasoner: system description. In:

Furbach U, Shankar N (eds) IJCAR 2006, LNAI 4130, pp 292–297

van Heijst G, Schreiber AT, Wielinga BJ (1997) Roles are not classes: a reply to Nicola Guarino.

Int J Hum-Comput Stud 46(2):311–318

Weissmann A (1889) Essays upon heredity, vols 1 and 2. Clarendon, Oxford. On the web: http://

www.esp.org/books/weismann/essays/facsimile/


Documents

Evolutionary Biology || Knowledge Standardization in Evolutionary Biology: The Comparative Data Analysis Ontology