Chemical Vocabularies and Ontologies for Bioinformatics · 2015-07-28 · Chemical Vocabularies and Ontologies for Bioinformatics Kirill Degtyarenko European Bioinformatics Institute,

Chemical Vocabularies and Ontologies for Bioinformatics

Kirill Degtyarenko

European Bioinformatics Institute, Wellcome Trust Genome Campus,

Hinxton, Cambs CB10 1SD, United Kingdom. E-mail: kiri l [email protected]

Proceedings of the 2003 International Chemical Information Conference, Nîmes,

France, 19-22 October 2003.

The diversity of objects and concepts in biological chemistry can be reflected in the number

of ways used to describe an ‘elementary’ biochemical event such as enzymatic reaction. The

terminology used in publications or biological databases is often a mixture of terms borrowed

from widely different or even contradictory classifications. The ever-growing knowledge

cannot be processed meaningfully (e.g. efficiently and correctly referenced in biological

databases) without organisation, from controlled vocabularies to dictionaries and thesauri to

taxonomies and formal ontologies. Ontology of some domain of knowledge is a controlled

vocabulary of terms with defined logical relationships to each other. The unique types of

relationships between terms have to be included in biochemical ontologies. The relevance of

chemical thesauri and ontologies to bioinformatics is illustrated by current resources and

projects at the European Bioinformatics Institute, such as IntEnz (Enzyme Nomenclature),

COMe (the bioinorganic motif database) and the IUPHAR Receptor Database.

Introduction

‘Ontology’ is a formal definition of concepts (such as entities and relationships) of a given

area of knowledge, described in a standardised form [ 1]. It can be organised as a structured

vocabulary in the form of a directed acyclic graph or a network in which each term may be a

‘child’ of one or more ‘parent’ [ 2].

Naturally, sequences form the core data of biological sequence databases such as EMBL [ 3]

and Swiss-Prot [ 4], while 3-D coordinates form the core data of structural databases such as

PDB [ 5]. The other data typically found in a database entry, such as the name of gene or

protein, organism, or literature references, are called annotation. The quality of annotation

varies from entry to entry and from database to database.

Proceedings of the 2003 International Chemical Information Conference 2

As a rule, the annotation is present as free text. Since proteins and nucleic acids are

biochemical entities, chemical and biochemical terminology make a big proportion of this

free text. I will try to present some challenges and achievements in standardisation of

chemical language in macromolecular databases.

Vocabularies and Ontologies for Biological Databases

The free text annotation is easy to read but difficult to standardise. The bioinformatics

community is spending much effort towards making it ‘less free’ with help of controlled

vocabularies. In contrast to sequence and structural databases, the core data of vocabularies

consist of terms. One of the first controlled vocabularies for biology was the NCBI

Taxonomy [ 6]. Every natural sequence deposited to biosequence databases is supposed to

originate from an organism with a known Linnaean name:

“The NCBI Taxonomy database only contains the names for the organisms whose

sequences have been made public by the collaborating sequence database EMBL,

DDBJ, and NCBI/GenBank or by one of the other public databases that are indexed in

Entrez (including the Swiss-Prot, PIR and PRF protein sequence databases and the

PDB structure database). Currently sequence data are available for only about

100,000 of the about 2–10 million species supposed to exist on earth” [ 6]

Taxonomy is a strict hierarchy of parent–child relationships known as IsA (‘is kind of’). For

example, abbreviated NCBI taxonomy of Homo sapiens can be represented as

root %Eukaryota %Metazoa %Chordata %Craniata %Vertebrata %Euteleostomi %Mammalia %Eutheria %Primates %Catarrhini %Hominidae %Homo %Homo sapiens

The Gene Ontology Consortium [ 2] develops controlled vocabularies which are in wide use

by the bioinformatics community. The Gene Ontology (GO) comprises three domains:


3

‘molecular function’, ‘biological process’ and ‘cellular component’. In every domain,

the GO terms are organised as directed acyclic graph (DAG), which differ from taxonomies

in that a child term can have many parent terms. GO uses two generic parent–child

relationships, IsA and IsPartOf. Throughout the text, I will use the symbols for these

relationships as specified in GO File Format Guide [ 7]:

% = IsA

< = IsPartOf

GO is a part of the wider initiative known as OBO (Open Biology Ontologies). A list of

freely available ontologies that are relevant to genomics and proteomics and are structured

similarly to GO can be found at the OBO website [ 8].

Vocabularies and Ontologies for Biochemical Compounds

What are the ‘biochemical compounds’?

Any chemical compound naturally occurring in living organisms can be called a

‘biochemical compound’. Biochemical compounds can be classified according to their

structure, physico-chemical properties or biological function. Most biologists conveniently

divide all biochemical compounds into ‘biopolymers’, which consist of macromolecules, and

the ‘other compounds’, which consist of ‘small’ molecules (see, for instance, BioCyc

Taxonomy of Compounds [ 9]). Alternatively, biochemical compounds can be defined as

consisting of “molecules not directly encoded by the genome (thus excluding nucleic acids,

proteins and peptides derived from proteins by cleavage), that are either the products of

nature or are synthetic products used (either purposively or accidentally) to intervene in the

processes of living organisms” [M. Ashburner]. This second definition reflects ‘traditional’

bioinformatics view in the sense that information-rich macromolecules live in their databases

(EMBL, Swiss-Prot) relatively independently from all the other molecules, whether small or

large. As we will see, these two worlds significantly overlap.

Structure

The intuitive classification of all molecules relevant to biochemistry into macromolecules

and ‘small’ molecules is not as straightforward as it looks. E.g. globular proteins and tRNAs

are very compact, discrete molecules. In eukaryotes, genomic DNA exists as a

supramolecular DNA–protein complex. A gene is a part of a large DNA molecule, while the


corresponding mRNA is a discrete molecule. Pyrroloquinoline quinone (coenzyme PQQ), a

typical ‘small’ molecule, is synthesised in vivo from a 24-amino-acid polypeptide precursor [

10].

Polypeptides and nucleic acids are often referred to as ‘biopolymers’. It has to be noted that at

least in two aspects they are fundamentally different from other biopolymers (such as

polysaccharides, isoprenoids, lignins) as well as other natural and synthetic polymers. First,

while most polymers consist of chains of variable length, it is perfectly possible to purify a

chemically homogeneous protein or nucleic acid. This feature makes it possible to store the

amino acid and nucleic sequences as ultimate identifiers in the databases. Second, while most

polymers consist of low complexity (low information content) macromolecules, natural

proteins and nucleic acids are, as a rule, high complexity (high information content)

macromolecules. (Genes often occupy only a small fraction of eukaryotic DNA while the rest

is low complexity non-coding regions.) This feature makes it necessary to store the amino

acid and nucleic sequences in the databases. In contrast, for most polymers it is sufficient to

store a constitutional unit [ 11].

Bioinformatics routinely deals with one-dimensional (1-D) objects like protein and nucleic

acid sequences, and three-dimensional (3-D) objects such as crystal and solution structures in

PDB. Note that here we talk about mathematical objects, not real proteins or nucleic acids

which exist in three dimensions. For example, almost everybody is ‘familiar’ with double

helix model of DNA. Nucleotides of one strand are supposed to form Watson-Crick pairs

with nucleotides of the complimentary strand, so the sequence of only one strand is stored in

the databases. Since sequence is a very convenient way of representing a biomacromolecule,

it is easy to forget that other dimensionalities exist at all!

By definition, a macromolecule consists of many monomers. In 1-D representation, each

monomer is ‘collapsed’ to a symbol (dimensionality is close to 0). Dimensionality of cyclic

DNA, such as bacterial chromosomes and plasmids, genomes of mitochondria and

chloroplasts, is (negligibly) less than 1. In the biological databases, the corresponding

sequences are stored as 1-D with an arbitrarily set start nucleotide (often it is the DNA

replication origin site). In case of cyclic polypeptides such as cyclotides [ 12], the

dimensionality is less than 1 but the precursor polypeptides are linear, so the starting amino

acid residue is not arbitrarily chosen. Dimensionality of branched macromolecules such as

starch is more than 1. Similarly, dimensionality of a protein containing covalent links


5

between two polypeptides (each 1-D) also should be considered >1, while macromolecules

such as lignin form networks and require at least 2-D representation.

Although ‘small molecules’ appear to be less complex entities than macromolecules, their

naming, citation and representation in databases is not a trivial task. Most genetically

encoded biomacromolecules are easily represented as 1-D strings, while the 2-D sketch

remains the most adequate portrait of a ‘small molecule’. Several algorithms of linear

notation have been developed, e.g. SMILES [ 13]. However linear notation, as any other

structural core data, cannot be really used in speech (and should not be used in free text). The

good annotation practice for biological databases is to use either consistent and widely

recognised terminology or unique identifiers (to look up the molecule of interest from a

dedicated database). Ideally, scientists should use terminology that is both pronounceable

and meaningful. IUPAC systematic names are meaningful but often not pronounceable.

Therefore, biologists and chemists alike prefer to use common names. The use of common

names is not a problem as long as there is no confusion regarding the exact meaning of a

term. Thus, the viable solution for bioinformatician will be to use a definitive controlled

vocabulary of biochemical compounds, which contains both systematic and common names.

Interestingly, although Enzyme Nomenclature [ 14] contains terminology for enzymatic

reactions approved by Nomenclature Committee of the International Union of Biochemistry

and Molecular Biology (NC-IUBMB), no definitive terminology for the very compounds

involved in these reactions was published by NC-IUBMB, with the Glossary of Chemical

Names being the only exception [ 15]. Whatever the reason, the nontrivial task to derive these

terms from Enzyme Nomenclature was left to others!

There are several (bio)chemical compound databases in public domain. COMPOUND is a

part of LIGAND database [ 16]. COMPOUND includes all the compounds (i.e. substrates,

products, cofactors, inhibitors and activators) derived from Enzyme Nomenclature. Every

COMPOUND entry minimally has a unique identifier and a name, while many ‘small

molecule’ entries also have 2-D diagrams. The compounds are fairly heterogeneous:


COMPOUND

1. ‘Small molecule’

• individual compound, e.g. C00556 Benzyl alcohol

• a class of compounds, e.g. C00069 Alcohol

2. Macromolecule

• As a whole individual molecule, e.g. C02396 Cytochrome b-562

• As a class of compounds, e.g. C00420 Polysaccharide

• As a site, e.g.

C02959 Apurinic site in DNA

C04764 C-terminal glycine residue of the polypeptide ubiquitin

3. A class of ‘molecules’ classified by chemical function

• C00030 Reduced acceptor

• C11349 Amino group donor

Glossary of Chemical Names [ 15] originally appeared in Enzyme Nomenclature [ 14].

Additional Glossary entries have been added from subsequent Supplements. In contrast to

COMPOUND, Glossary represents a small subset of (less common) chemical terms.

Contents-wise, it is also very heterogeneous as examples show:

IUBMB Glossary of Chemical Names

1. Thesaurus (gives more broad term):

12-dehydrotetracycline = an antibiotic

2. Common–(semi)systematic (bilingual) dictionary:

cis-aconitate = (Z)-prop-1-ene-1,2,3-tricarboxylate

3. Common–formula (bilingual) dictionary:

0-D: superoxide = O2•-

1-D: amastatin = Leu[1ψ2,CHOHCONH]ValValAsp

2-D: quinine = an alkaloid (structure)

4. Common name dictionary (with definitions):

fusarinine C = a cyclic trihydroxamic acid formed by

esterification of 3 molecules of fusarinine


7

NIST Chemistry WebBook [ 17], developed at the National Institute of Standards and

Technology, contains data on small organic and some inorganic compounds. Apart from

name(s), formulae, CAS registry numbers, and structure, NIST Chemistry WebBook

contains additional data on physico-chemical properties of the species.

All the resources mentioned so far have no hierarchical structure in terms of searching for

classes of compounds, e.g. ‘alcohol’. In COMPOUND, the terms for compound classes are

available but there are no links to individual compounds from that class. However, for most

biochemical compounds, whether ‘small’ or macromolecules, the structural classification

can be based on existing chemical nomenclature systems, such as organic, inorganic and

macromolecular, or even created automatically on the basis of substructure search.

The promisingly named Chemical Ontology, developed by M. Ashburner and P. Jaiswal [

18], is an interesting prototype. Structurally it is similar to GO and organised as a DAG with

IsA the only kind of relationship used. Data sources include the chemical names as currently

used in GO and the external sources such as BioCyc [ 9], COMPOUND, ENZYME [ 19] and

UM-BBD [ 20].

All terms for compounds are classified according to either chemical nature

(grouped_by_chemistry) or biological function (grouped_by_functions), or yet to be

classified (unclassifieds). A classified compound may belong to more than one structural

and more than one functional class, therefore the different classification approaches may be

reconciled. An alternative ontology for molecular matter was suggested by the author in an

e-mail exchange with Michael Ashburner:


%molecular matter %grouped_by_state_of_matter %plasma %gas %liquid %solid %heterogeneous mixture %grouped_by_composition %compound ; synonym:chemical substance <formula unit <molecular entity %atom <electron <nucleus <proton <neutron %element %atomic ion %atomic radical %molecule <group %molecular ion %molecular radical %crystal molecule %ionic crystal molecule %metallic crystal molecule %covalent molecule %discrete covalent molecule %coordination molecule %giant covalent molecule %ion %atomic ion %molecular ion %radical %atomic radical %molecular radical %mixture <compound %heterogeneous mixture %colloidal suspension %liquid aerosol %solid aerosol %foam %emulsion %sol %solid foam %gel %solid sol %homogeneous mixture %solution <solute <solvent %solid solution


9

However, the relationships between chemical entities go beyond IsA. Importantly, the

distinction has to be made between groups and molecules. For example, constitutional unit

(i.e. group) IsPartOf a macromolecule but monomer molecule is not, although is may be

viewed as precursor of a macromolecule. In organic chemistry, it is usual to consider a

‘parent hydride’ of a specific compound for naming purposes [ 21]. This parent hydride,

being a specific compound itself, can be considered to be both a member and a parent of the

class of compounds.

Physico-chemical methods and properties

It is not just chemical compound terminology that lacks standardisation in biological

databases. For example, Swiss-Prot entries include literature citations dealing with

characterisation of proteins, where RP field may include methods used. However, the

terminology is not consistent, as the example for circular dichroism shows:

CD STUDIES CIRCULAR DICHROISM CIRCULAR DICHROISM ANALYSIS CIRCULAR DICHROISM SPECTROSCOPY MAGNETIC CIRCULAR DICHROISM STRUCTURE BY CIRCULAR DICHROISM

The information that “magnetic circular dichroism” is a variation of a more general method

(IsA “circular dichroism”) and “CD” is an abbreviation (i.e. synonym) of “circular

dichroism” is just not there.

The alpha-release of FIX (physico-chemical ontology for biology) is available at the OBO

website [ 22]. FIX includes two components: physico-chemical property and

physico-chemical method. In addition to IsA and IsPartOf, the relationship called

inferred_by between ‘property’ and ‘method’ entities is introduced. Of course

inferred_by can be considered merely a shortcut, for methods usually do not yield any

properties directly. Instead, one can design much more complex ontologies, e.g.

method (e.g. circular dichroism spectroscopy) based_on phenomenon (e.g.

circular dichroism) applied_to object (e.g. protein) yields data (e.g.

spectrum) contains feature (e.g. peak) corresponding_to (value_of)

property (e.g. “30% of alpha-helix”)


Currently, no verbal definitions are provided for FIX terms. However, the method can be

defined already via its place in ontology. For instance, “electron-nuclear double resonance

spectroscopy” (ENDOR; FIX:0000024) IsA “combined electron and nuclear magnetic

resonance spectroscopy” (FIX:0000165) which, in turn, IsA both “nuclear magnetic

resonance spectroscopy” (NMR; FIX:0000022) and “electron paramagnetic resonance

spectroscopy” (EPR; FIX:0000023).

FIX terms for physico-chemical properties may be used for annotation of both molecules (at

molecular level) and compounds (at molar level).

Biological function

If physico-chemical properties are not easily derived from chemical structure, then what

about biochemical properties? Biological function is not an immanent feature of the molecule

but a result of a specific interaction, e.g. with proteins or nucleic acids. The same compound

will behave remarkably differently in different organisms, or different cells, or different

metabolic pathways within the same cell. Therefore, the host of possible functions and

relationships one can think of, e.g. ‘a precursor of’, ‘an ihibitor of’, are only meaningful

when the biological context is specified.

Since all ‘functions’ of a molecule could be broadly divided into structural (to form a part of)

and chemical (participate in a reaction), they could be formed as cross-products of chemical

ontology with corresponding components of biological ontologies. At least, it seems that an

extension of Gene Ontology towards ‘small molecules’ is only logical: it is not only gene

products that can have molecular function, participate in biological process and form part

of cellular component! Of course, other ontologies (e.g. for toxicology or ecology) will

result in a different set of biological functions.


11

Vocabularies and Ontologies for Biochemical reactions

Ontology of biochemical reactions %biochemical reaction %binding reaction %biotransformation reaction %non-catalytic reaction %photoinduced reaction %spontaneous reaction %catalytic reaction %enzymatic reaction %deoxyribozymatic reaction %ribozymatic reaction %abzymatic reaction %intramolecular catalysis reaction %conformation change reaction %molecular transport reaction %electron transfer reaction %excitation-energy transfer reaction

Enzymatic reactions

The Enzyme Nomenclature [ 14], published by NC-IUBMB, provides the oldest controlled

vocabulary for biochemical function. Not surprisingly, EC numbers are often (mis)used for

annotation of gene products. It is important to remember that the basis of the Enzyme

Nomenclature is the overall reaction catalysed [ 23] (cf. overall transformation classification

in organic chemistry [ 24]), but not reaction mechanism or any other specific property of an

enzyme. Nevertheless, other biological catalysts such as ribozymes, deoxyribozymes or

catalytic antibodies (abzymes) do not form a part of Enzyme Nomenclature.

EC numbers form a strict hierarchy of IsA relationships. That means, any one EC number

belongs to one and only one sub-subclass, which belongs to one and only one subclass, which

belongs to one and only one class. Historically, the EC number served as both unique

identifier (ID) and descriptor of the enzyme place in hierarchy. This dual function of EC

numbers is fairly limiting because it requires the unique place of enzyme in the hierarchy.


Subclasses in EC 1

%EC 1 oxidoreductases %EC 1.1 acting on the CH-OH group of donors %EC 1.1.1 with NAD+ or NADP+ as acceptor %EC 1.1.2 with a cytochrome as acceptor %EC 1.1.3 with oxygen as acceptor %EC 1.1.4 with a disulfide as acceptor %EC 1.1.5 with a quinone or similar compound as acceptor %EC 1.1.99 With other acceptors %EC 1.2 acting on the aldehyde or oxo group of donors %EC 1.2.1 with NAD+ or NADP+ as acceptor ... %EC 1.3 acting on the CH-CH group of donors %EC 1.4 acting on the CH-NH

2 group of donors

%EC 1.5 acting on the CH-NH group of donors %EC 1.6 acting on NADH or NADPH %EC 1.7 acting on other nitrogenous compounds as donors %EC 1.8 acting on a sulfur group of donors %EC 1.9 acting on a heme group of donors ...

However, an enzyme can be correctly classified in more than one way. E.g. Intramolecular

Oxidoreductases (EC 5.3) are as much oxidoreductases (EC 1) as isomerases (EC 5). In every

subclass of oxidoreductases, the acceptors form repeating series, therefore the alternative

grouping is feasible, e.g. EC 1.1.1, EC 1.2.1, … EC 1.18.1 can be classified in a ‘EC 1.x.1’

subclass of oxidoreductases with NAD+ or NADP+ as acceptor:

Alternative subclasses in EC 1

%EC 1 oxidoreductases %EC 1.x.1 with NAD+ or NADP+ as acceptor %EC 1.1.1 acting on the CH-OH group of donors %EC 1.2.1 acting on the aldehyde or oxo group of donors %EC 1.3.1 acting on the CH-CH group of donors %EC 1.4.1 acting on the CH-NH

2 group of donors

... %EC 1.18.1 acting on iron-sulfur proteins as donors %EC 1.x.2 with a heme protein as acceptor %EC 1.x.3 with oxygen as acceptor %EC 1.x.4 with a disulfide as acceptor %EC 1.x.5 with a quinone as acceptor %EC 1.x.7 with an iron–sulfur protein as acceptor %EC 1.x.6 with a nitrogenous group as acceptor %EC 1.x.8 with a flavin as acceptor %EC 1.x.99 with other acceptors


13

The limit of four levels does not allow additional hierarchical IsA relationships which

otherwise can be introduced on the basis of natural hierarchy of chemical compound classes,

for example:

%EC 1.1.1.2 alcohol dehydrogenase (NADP+) %EC 1.1.1.91 aryl-alcohol dehydrogenase (NADP+) %EC 1.1.1.97 3-hydroxybenzyl-alcohol dehydrogenase

The extension of Enzyme Nomenclature beyond the traditional six classes of overall

transformations will include e.g. reactions affecting non-covalent bonds and transport

phenomena [ 25]. Further modification of Enzyme Nomenclature is required to accommodate

reaction mechanisms and enable multiple ancestry for enzymatic reactions. Classification of

reaction mechanisms consists of two orthogonal components: (i) fundamental reaction

mechanism classes, and (ii) catalytic mechanism classes. The catalytic mechanism, substrate

and allosteric effector specificities are examples of orthogonal features relevant to the

enzyme structure and can be inherited independently.

Reversibility

The biochemical reactions appear in most of Enzyme Nomenclature entries as if they were

reversible. (In the other entries, the verbal description of the reaction often does convey the

direction, e.g. EC 3.1.6.7 “Hydrolysis of the 2- and 3-sulfate groups of the polysulfates of

cellulose and charonin”.) This is in contrast with both experimental evidence (it is difficult to

make the peptidase to synthesise peptide bonds) and with higher order Enzyme

Nomenclature itself. Both class names (EC 3, Hydrolases; EC 6, Ligases) and subclass names

(e.g. EC 6.4 “Forming Carbon–Carbon Bonds”) imply the direction of the reaction. This

poses little problem for irreversible reactions or when the reaction can be catalysed by the

same enzyme in both directions. However, under physiological conditions (far from

equilibrium) the opposite reactions are often catalysed by different enzymes which are

nevertheless given the same EC number!

Succinate dehydrogenase (EC 1.3.5.1): succinate + Q → fumarate + QH2

Fumarate reductase (EC 1.3.5.1): fumarate + QH2 → succinate + Q

Electron transfer and excitation-energy transfer reactions

The term “pure electron transferase” was originally introduced as a name for a class of

flavoproteins (exemplified by flavodoxins) where the flavin is reduced and reoxidised in


one-electron steps [ 26, 27]. The meaning can be naturally extended to cover all the proteins

that catalyse electron transfer reactions only, such as cytochromes, ferredoxins and

cupredoxins. Although proteins involved in electron transfer are usually classified as

oxidoreductases, none of the ‘pure electron transferases’ is assigned an EC number.

Similarly, the excitation-energy transfer processes as in the antenna systems of

photosynthetic organisms are not covered by Enzyme Nomenclature.

Analogous to ‘traditionally understood’ metabolic pathways that consist of separate

enzymatic reactions, electron/exciton transfer reactions form electron/exciton transfer

pathways, that form integral part of metabolic pathways.

Transmembrane transport

A great number of fundamental biochemical reactions can be represented as

Xcompartment A → Xcompartment B

Most of these are not spontaneous and thus have to be facilitated by specific carriers or

transporters [ 28, 29]. Importantly, a distinct class (Energases) has been proposed to cover the

enzymes that catalyse chemical energy into mechanical energy [ 25]. Energases include

primary active transporters (directly utilising covalent bond energy to transport solutes

against a concentration gradient) and rotational molecular motors such as ATP synthase.

However, there is no reason to deny the other transporters their place in the enzyme

classification. Some electron transferases are also transmembrane transporters (TC 5 in [ 28]).

Non-enzymatic reactions

Finally, non-enzymatic biochemical reactions occur in vivo. In addition to other naturally

occurring ‘zymes’, the examples include Fenton chemistry [ 30], photoinduced

transformation of ergosterol to previtamin D3 and its subsequent thermal isomerisation to

vitamin D3 [ 31], etc. Again, these reactions form part of metabolic pathways.

Yet other reactions

Some biochemical reactions are neither catalytic nor spontaneous. For example, comment for

methylated-DNA—[protein]-cysteine S-methyltransferase (EC 2.1.1.63) reads: “Since the

acceptor protein is the ‘enzyme’ itself and the S-methyl-L-cysteine derivative formed is

relatively stable, the reaction is not catalytic.” The reaction proceeds through suicidal alkyl


15

transfer from guanine O6 to the cysteine residue of the enzyme; therefore the enzyme should

be present in stoichiometric, not catalytic, amounts. Reaction catalysed by EC 2.1.1.63 fits

the definition of intramolecular catalysis [ 32]. On the one hand, the intramolecular catalyst is

a kind of catalyst, since “the catalyst is both a reactant and product of the reaction” [ 33]. But

if the direct result of the reaction is an inactivated catalyst, it makes the whole process

noncatalytical (according to the Gold Book).

The term “autocatalytic reaction” is often used in a meaning not consistent with the Gold

Book definition [ 34], e.g., “autocatalytic quinone-methide mechanism of protein

flavinylation” [ 35] or “autocatalytic formation of a thioether cross-link between the

active-site residues” in galactose oxidase [ 36]. These are in fact intramolecular catalysis

events.

(Bio)chemical Resources at the European Bioinformatics Institute

IntEnz

At the EBI, enzyme classification is collected in the Integrated relational Enzyme database

(IntEnz) [ 37], a joint project with the Trinity College Dublin (TCD), the Swiss Institute of

Bioinformatics (SIB) and the University of Cologne, supported by the NC-IUBMB.

Currently, IntEnz contains enzyme data curated and approved by the members of

NC-IUBMB. The goal is to create a single relational database containing all the relevant

enzyme data, including those from ENZYME [ 19] and BRENDA [ 38] databases.

chemPDB

The chemPDB service [ 39] provides access to the ligands and small molecule dictionary of

the Macromolecular Structure Database (MSD) developed at the EBI [ 40]. chemPDB is

described as “consistent and enriched library of ligands, small molecules and monomers that

are referred as residues and ‘HET groups’ in any PDB entry”. Each entry includes the

standard three-letter code, one or more molecule names, RCSB classification of molecules,

formula, stereo and non-stereo SMILES, fingerprint, 2-D diagram, idealised 3-D coordinates

(including calculated hydrogen atom positions) or 3-D coordinates from a pre-selected PDB

entry. In addition, many entries contain automatically generated IUPAC systematic names.

The search facility provides functionality for queries based on chemical equivalence,

similarity, substructure and superstructure.


RESID

The RESID Database of Protein Modifications is created and supported by John Garavelli [

41]. The RESID Database is a comprehensive collection of annotations and structures for

protein pre-, co- and post-translational modifications including amino-terminal,

carboxyl-terminal and peptide chain cross-links. RESID includes: systematic and alternate

names, atomic formulae and masses, enzyme activities generating the modifications (with

corresponding cross-references to GO), keywords, literature citations, protein sequence

database feature table annotations, 2-D structure diagrams and 3-D molecular models.

Release 34.01 (15 August 2003) contains 339 entries.

IUPHAR Receptor database

IUPHAR Receptor Database [ 42] is created at the EBI and is edited by the members of the

International Union of Pharmacology Committee on Receptor Nomenclature and Drug

Classification (NC-IUPHAR). It is implemented as a relational database containing official

NC-IUPHAR recommendations for receptor nomenclature and classification [ 43]. Future

developments will aim to expand the current database to include non-sensory G

protein-coupled receptors, nuclear receptors, ligand-gated ion channels, and voltage-gated

ion channels [ 44]. Although receptors and ion channels are macromolecular structures, their

classification according to ligand provides basis for reciprocal NC-IUPHAR classification

(ontology) of ligands according to their receptors. This classification is orthogonal to

biochemical ontologies mentioned before. It has to be noted that compounds of

pharmacological interest, apart from ‘small compounds’, include polypeptide-derived

hormones, toxins and other polypeptides with known pharmacological effects, which can be

cross-referenced to protein sequence databases.


17

COMe

COMe (Co-Ordination of Metals, etc.) represents the ontology for bioinorganic centres in

complex proteins [ 45]. COMe consists of three types of entities: ‘bioinorganic motif’ (BIM),

‘molecule’ (MOL), and ‘complex proteins’ (PRX); each entity is assigned a unique

identifier. A BIM consists of at least one centre (metal atom, inorganic cluster, organic

molecule) and two or more endogenous and/or exogenous ligands. BIMs are represented as

one-dimensional (1-D) strings and 2-D diagrams. MOL entity represents ‘small molecule’

which, in complex with polypeptide(s), forms a functional protein. The PRX entity refers to

the functional protein as well as separate protein domains and subunits. The main groups of

complex proteins in COMe are (i) metalloproteins, (ii) organic prosthetic group proteins and

(iii) modified amino acid proteins. In addition to IsA and IsPartOf relationships, the

IsBoundTo relationship is introduced. It can occur only between MOL (child) and PRX

(parent). It is used because the molecule which IsBoundTo protein can be changed

chemically and, strictly speaking, become the different entity. The data are currently stored in

both XML format and a relational database and available via the Web [ 46].

Towards the unified dictionary of biochemical compounds

There is no authoritative database of biochemical compounds in the public domain. This is a

serious lack, as many biomedical databases need to refer to, or use data attached to,

biochemical compounds. Within the EBI alone these include MSD, Swiss-Prot, IntEnz,

IUPHAR Receptor Database and GO, as well as some databases developed in the EBI

research groups. More broadly, many other public biomedical databases have the same need.

As mentioned in the previous sections, several groups at the EBI build their own in-house

controlled vocabularies and/or chemical databases. To avoid unnecessary multiplication of

efforts, the project was initiated to create definitive, freely available dictionary of Chemical

compounds of Biological Interest (ChEBI). Data in ChEBI should be definitive in the sense

that terminology would be explicitly endorsed, where applicable, by IUPAC (systematic

names), NC-IUBMB (biochemical nomenclature) or NC-IUPHAR (drug classification).

More specifically, most immediate goal of ChEBI is to provide public reference for

biochemical compounds consisting of

• Substrates, products, cofactors, activators and inhibitors of enzymes (IntEnz)


• Ligands of receptors (IUPHAR Receptor Database)

• ‘Small molecules’ bound to macromolecules (chemPDB)

• Amino acid residues and their post-translational modifications (RESID)

• Metals and organic molecules bound to proteins as prosthetic groups (COMe)

• Molecules interacting with proteins (Swiss-Prot)

• Molecules involved in basic biological processes (GO)

Such a collection can be classified as a ‘small’ database. Our most optimistic estimate is that

in the next five years ChEBI will have no more that 50,000 curated entries. Therefore, it will

include only a small fraction of data provided by comprehensive commercial chemical

databases such as Beilstein [ 47] or CAS Registry [ 48] (cf. 4,519 compounds in IntEnz and

~5,000 in chemPDB vs more than 8 million substances in the Beilstein Database, with over

500,000 classified as ‘bioactive’). The same is true about depth of database (number of

properties likely to be documented).

In creating such a database the following principles should be held.

• Nothing held in the database must be proprietary or derived from a proprietary source

that would limit is free distribution/availability to anyone.

• Every data item in the database should be fully traceable and explicitly referenced to

the original source/version.

• Although the EBI will provide a web interface, the entirety of the data should be

available to all without constraint as, for example, PostgreSQL or MySQL table

dumps, mmCIF and XML (e.g. DAML+OIL, CML, etc.).

At this early stage our primary objective is to standardise biochemical terminology, but the

next step will necessarily concern structures. In accordance with the principles outlined

above, we plan to adopt open standards for chemical structure representation, such as IUPAC

Chemical Identifier (IChI) [ 49] for 2-D structures and CIF [ 50] for 3-D structures. The

connectivity and stereochemistry (2-D structure) for majority of small organic molecules in

ChEBI, (including isotope-labelled) could be stored as IChI. Other molecules will be linked

to corresponding 1-D or 3-D databases.


19

Acknowledgements

I wish to thank my colleagues at the EBI, Sergio Contrino, Michael Darsow and Paula de

Matos. I thank Gillian Adams for her helpful comments and suggestions on the manuscript. I

am indebted to Michael Ashburner (University of Cambridge) and Steve Stein (NIST), and

this paper is the ultimate result of our e-mail exchanges.

References

1. Carugo, O. and Pongor, S. (2002) The evolution of structural databases. Trends

Biotechnol. 20, 498–501.

2. The Gene Ontology Consortium, http://www.geneontology.org/

3. The EMBL Nucleotide Sequence Database, http://www.ebi.ac.uk/embl/

4. The Swiss-Prot Protein Knowledgebase, http://www.ebi.ac.uk/swissprot/

5. The Protein Data Bank, http://www.pdb.org/

6. NCBI Taxonomy, http://www.ncbi.nlm.nih.gov/Taxonomy/

7. GO File Format Guide, http://www.geneontology.org/doc/GO.format.html

8. Open Biology Ontologies, http://obo.sourceforge.net/

9. The BioCyc Knowledge Library, http://BioCyc.org/

10. http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-e+[RESID:AA0283]

11. Metanomski, W.V., Ed. (1991) Compendium of Macromolecular Nomenclature (“The

Purple Book”). Blackwell Scientific Publications, Oxford.

12. The Cyclotide Webpage, http://www.cyclotide.com/

13. James, C.A., Weininger, D., Delany, J. (2003) Daylight Theory Manual,

http://www.daylight.com/dayhtml/doc/theory/theory.toc.html

14. Enzyme Nomenclature: Recommendations (1992) of the Nomenclature Committee of

the International Union of Biochemistry and Molecular Biology. Academic Press, San

Diego.


15. IUBMB Glossary of Chemical Names,

http://www.chem.qmul.ac.uk/iubmb/enzyme/glossary.html

16. LIGAND database of chemical compounds and reactions in biological pathways,

http://www.genome.ad.jp/ligand/

17. NIST Chemistry WebBook, http://webbook.nist.gov/chemistry/

18. Chemical Ontology,

http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/obo/obo/ontology/biochemical/

19. The ENZYME database, http://www.expasy.org/enzyme/

20. The University of Minnesota Biocatalysis/Biodegradation Database (UM-BBD),

http://umbbd.ahc.umn.edu/

21. Panico, R., Powell, W.H. and Richer, J.C., Eds. (1993) A Guide to IUPAC

Nomenclature of Organic Compounds, Recommendations 1993 (“The Blue Book”).

Blackwell Scientific Publications, Oxford.

22. Physico-Chemical Ontology (FIX),

http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/obo/obo/ontology/physicochemical/

23. Tipton, K. and Boyce, S. (2000) History of the enzyme nomenclature system.

Bioinformatics 16, 34–40.

24. Grossman, R.B. (1999) The Art of Writing Reasonable Organic Reaction Mechanisms.

Springer-Verlag, New York.

25. Purich, D.L. (2001) Enzyme catalysis: a new definition accounting for noncovalent

substrate- and product-like states. Trends Biochem Sci. 26, 417–421.

26. Hemmerich, P., Massey, V. and Fenner, H. (1972) Flavin and 5-deazaflavin: A

chemical evaluation of ‘modified’ flavoproteins with respect to the mechanisms of

redox biocatalysis. FEBS Lett. 84, 5–21.

27. Nomenclature Committee of the International Union of Biochemistry (NC-IUB)

(1991) Nomenclature of electron-transfer proteins. Recommendations 1989. Eur. J.

Biochem. 200, 599–611; http://www.chem.qmul.ac.uk/iubmb/etp/


21

28. Transport Protein Database, http://tcdb.ucsd.edu/tcdb/

29. Nomenclature Committee of the International Union of Biochemistry and Molecular

Biology (NC-IUBMB) (2002) Membrane transport proteins. Recommendations 2002.

http://www.chem.qmul.ac.uk/iubmb/mtp/

30. Liochev, S.I. (1999) The mechanism of “Fenton-like” reactions and their importance

for biological systems. A biologist’s view. Metal Ions Biol. Syst. 36, 1–39.

31. Holick, M.F. (1995) Defects in the synthesis and metabolism of vitamin D. Exp. Clin.

Endocrinol. Diabetes 103, 219–227.

32. In McNaught, A.D. and Wilkinson, A., Eds. (1997) Compendium of Chemical

Terminology (“The Gold Book”), 2nd Edition. Blackwell Scientific Publications,

Oxford, p. 206.

33. Ibid., p. 58.

34. Ibid., p. 34.

35. Edmondson, D.E. and Newton-Vinson, P. (2001) The covalent FAD of monoamine

oxidase: structural and functional role and mechanism of the flavinylation reaction.

Antioxid. Redox Signal. 3, 789–806.

36. Firbank, S.J., Rogers, M., Hurtado-Guerrero, R., Dooley, D.M., Halcrow, M.A.,

Phillips, S.E.V., Knowles, P.F. and McPherson, M.J. (2003) Cofactor processing in

galactose oxidase. Biochem. Soc. Trans. 31, 506–509.

37. IntEnz: Integrated relational Enzyme database, http://www.ebi.ac.uk/IntEnz/

38. BRENDA: The Comprehensive Enzyme Information System,

http://www.brenda.uni-koeln.de/

39. MSD Ligand Chemistry, http://www.ebi.ac.uk/msd-srv/chempdb/

40. E-MSD: the European Bioinformatics Institute Macromolecular Structure Database,

http://www.ebi.ac.uk/msd/

41. The RESID Database of Protein Modifications,

ftp://ftp.ebi.ac.uk/pub/databases/RESID/


42. IUPHAR Receptor Database, http://www.ebi.ac.uk/iuphar-rd/

43. International Union of Pharmacology (2000) The IUPHAR Compendium of Receptor

Characterization and Classification, 2nd edition. IUPHAR Media, London.

44. Catterall, W.A., Chandy, K.G. and Gutman, G.A., Eds. (2002) The IUPHAR

Compendium of Voltage-gated Ion Channels. IUPHAR Media, Leeds.

45. Degtyarenko, K. and Contrino, S. (2003) COMe: the ontology of bioinorganic proteins.

The Chemistry Preprint Server (CPS: biochem/0307002).

46. COMe, http://www.ebi.ac.uk/come/

47. CrossFire Beilstein, http://www.mdl.com/products/knowledge/crossfire_beilstein/

48. CAS Registry, http://www.cas.org/EO/regsys.html

49. IUPAC Chemical Identifier (IChI) Project,

http://www.iupac.org/projects/2000/2000-025-1-800.html

50. IUCr Crystallographic Information File, http://www.iucr.org/iucr-top/cif/

Documents

Chemical Vocabularies and Ontologies for Bioinformatics · 2015-07-28 · Chemical Vocabularies and Ontologies for Bioinformatics Kirill Degtyarenko European Bioinformatics Institute,