ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry...

Preview:

Citation preview

ChEBI,text mining

and ontological best practice

Colin BatchelorRoyal Society of Chemistry

2008-05-19batchelorc@rsc.org

2

What is text mining?

Marti Hearst, Berkeley:“Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources.”

Can ChEBI help?

3

Overview

Reasoning

ChEBI as dictionary

Regular polysemy in chemistry

Some possible solutions

4

Reasoning

5

Reasoning

Reasoning is using the logical structure of an ontology to automatically infer facts about the world which have not been explicitly added by a human being.

Computers have no real-world knowledge beyond what we tell them.

6

Logical structure:properties of relations

We only have time to look at transitivity and is_a.

Smith et al., “Relations in Biomedical Ontologies”, Genome Biol., 2005, 6, R46.

Relation Transitive Symmetric Reflexive Anti-symmetric

is_a Yes No Yes Yes

part_of Yes No Yes Yes

7

ChEBI’s is_a is not transitive (1)

If a relation R is transitive, then:

If a R b and b R c, then a R c.

glutathione is_a cofactor cofactor is_a biological role

therefore glutathione is_a biological role

8

ChEBI’s is_a is not transitive (2)

water is_a amphiprotic solvent amphiprotic solvent is_a protophilic solvent (*) protophilic solvent is_a Bronsted base (*) Bronsted base is_a base base is_a biological role

therefore water is_a basetherefore water is_a biological role

* how come “protophilic solvent” and “Bronsted base” only have one child each?

9

ChEBI’s is_a is not transitive (3)

N-hydroxy-L-aspartic acid is_a hydroxamic acids

hydroxamic acids is_a organic functional classes

therefore N-hydroxy-L-aspartic acid is_a organic functional classes

10

is_a has many meanings!

1. An amount of a compound has a biological role: tris is_a buffer.*

2. An amount of a compound has an application: sodium dodecyl sulfate is_a detergent.*

3. A less-abstract type is an example of a more abstract type: propane is_a alkanes.

4. ?!: metals is_a atoms.*

* Not a property of a lone atom or molecule!

11

Computers need facts about the world, not about ChEBI curation

12

ChEBI as dictionary

13

Evaluating name–structure conversion with ChEBI

ChEBI release 37 (26 September 2007) contains 12688 annotated entities, of which 8486 have InChI strings.

We use OSCAR3 (oscar3-chem.sourceforge.net) for name–structure conversion.

We convert chebi.obo to an XML file, each paragraph containing either a ChEBI name or an IUPAC name.

The layered structure of the InChI lets us give partial credit for incomplete matches.

14

Results: IUPAC names

Total 8447

Identified as chemical 8255 (97.73%)

With InChI (upper bound) 1810 (21.43%)

Matching InChI, disregarding fixed hydrogen layer 1734 (20.53%)

Matching InChI, disregarding stereo 1176

Matching InChI, exact (lower bound) 1174 (13.90%)

Not all of name matched 1024

Name identified as two or more separate names 974 (11.53%)

15

Results: ChEBI names

Total 8146

Identified as chemical 7173 (88.06%)

With InChI (upper bound) 1036 (12.72%)

Matching InChI, disregarding fixed hydrogen layer 953 (11.70%)

Matching InChI, disregarding stereo 637

Matching InChI, exact (lower bound) 628 (7.71%)

Not all of name matched 764

Name identified as two or more separate names 373 (4.58%)

16

Regular polysemy

17

Regular polysemy

… where words stand for multiple things in a consistent way.

Examples: Brand names Grinding Figure–ground Exact–class–part polysemy in chemistry

Peter Corbett, Colin Batchelor and Ann Copestake (2008), “Pyridines, pyridine and pyridine rings”, Proc. BERBMTM08 at LREC 2008, Marrakech, Morocco.

18

Regular polysemy

Brand names“Learning to buy a Renault and talk to BMW”

Grinding“The squirrel scampered down the path and kept

stopping and looking at the officers to check they were behind”

vs.“[…] the trick was to serve squirrel fresh and not to

leave it hanging like other game”

19

Regular polysemy

Figure–ground Audrey Hepburn painted the door (figure) Audrey Hepburn walked through the door

(ground) The Incredible Hulk walked through the

door (ambiguous)

20

Methyl, the radical (exact)

21

Methyl, the group (part)

22

Can ChEBI handle methyl?

methyl group (CHEBI:32875) YESmethyl radical (CHEBI:29309) YES

23

Imidazole (exact)

24

An imidazole (class)

25

imidazole side-chain/group/ring (part)

26

Can ChEBI handle imidazole?

imidazoles (CHEBI:24780) YESimidazole (CHEBI:16069) YES

imidazole ring not yetimidazolyl group not yet

27

Mapping exact, class and part to entries in ChEBI

Tests:1. Has InChI: exact2. Name is plural: class3. Ends in –yl, “group” or “residue”: part

Test 2 doesn’t work for applications or roles.Test 3 is brittle.

I would much rather use the logical structure of the ontology.

28

Some possible solutions

29

Some possible solutions (1)

ChEBI must represent facts about the world rather than about itself.

Examples: If unclassified compounds have a structure, they

should be in the molecular structure tree rather than the unclassifieds tree.

“organic functional classes” is a tool for assigning nomenclature. No chemical compound is an “organic functional class”.

30

Some possible solutions (2)

ChEBI must distinguish between what is always true and what is only sometimes true.

Example: Replace some is_a relationships with

has_biological_role and has_application.

We need ChEBI to represent parts of molecules that aren’t substituents. They should all be descendants of molecular part (a new term), as should amino acid residues and nucleoside residues.

31

Questions?