22
Hazard Estimation and Method Comparison with OWL-Encoded Toxicity Decision Trees Leonid L. Chepelev, Dana Klassen, and Michel Dumontier Department of Biology, Institute of Biochemistry, School of Computer Science Carleton University Ottawa, Canada An OWLED 2011 Paper

Hazard Estimation and Method Comparison with OWL-Encoded Toxicity Decision Trees

Embed Size (px)

Citation preview

Page 1: Hazard Estimation and Method Comparison with OWL-Encoded Toxicity Decision Trees

Hazard Estimation and Method Comparison with OWL-Encoded Toxicity Decision Trees

Leonid L. Chepelev, Dana Klassen, and Michel DumontierDepartment of Biology, Institute of Biochemistry, School of Computer Science

Carleton University Ottawa, Canada

An OWLED 2011 Paper

Page 2: Hazard Estimation and Method Comparison with OWL-Encoded Toxicity Decision Trees

2

Motivation

• Machine learning approaches such as decision trees are commonly used in toxicity prediction

• However, interpretation of complex trees can be difficult to interpret, and there is no explanation for the category obtained.

• Moreover, many variant decision trees are coming out, difficult to compare

• Can we use OWL ontologies to formally represent and compare decision trees?

A simple toxicity decision tree: at each branching point, a rule is evaluated, and based on the outcome of this rule, either a final activity decision is made, or judgment is deferred to another node.

Page 3: Hazard Estimation and Method Comparison with OWL-Encoded Toxicity Decision Trees

3

Druglikeness: Lipinski’s Rule of Five

• Rule of thumb for druglikeness (orally active in humans)

(4 rules with multiples of 5)– mass of 500 Daltons or less– 5 hydrogen bond donors or less– 10 hydrogen bond acceptors or less– A partition coefficient (logP) value between -5 and 5

• Multiple conditions that must be satisfied to be considered druglike.

• A molecule must failing any of these would not be drug like.

Page 4: Hazard Estimation and Method Comparison with OWL-Encoded Toxicity Decision Trees

4

Chemical Data

Molecule Mass HBDC HBAC LogP Active

1 335.22 2 5 1.25 true

2 445.12 3 4 2.35 true

3 674.43 8 6 6.55 false

4 1882.25 4 12 -5.22 false

… … … … … …

Lipinski drug-likeness dataset comprised of 7000 compounds from the Human Metabolome Database (HMDB).

attributes computed using the Chemistry Development Kit.Tree built with open source Weka - collection of machine learning algorithms for data mining tools for data pre-processing, classification, regression, clustering, association rules, and visualization.

Page 5: Hazard Estimation and Method Comparison with OWL-Encoded Toxicity Decision Trees

5

Rule of Five Decision Tree

Correctly classified molecule counts are given in brackets. 100% accuracy in ten-fold cross validation.

Page 6: Hazard Estimation and Method Comparison with OWL-Encoded Toxicity Decision Trees

6

Formalization

A substance I is something that has a molecular weight Substance II is a kind of substance I that has a molecular weight <= 500Substance III is a kind of substance I that has a molecular weight > 500

Substance I

Substance II Substance III

subClassOf

Page 7: Hazard Estimation and Method Comparison with OWL-Encoded Toxicity Decision Trees

7

Formalization

Every node in the decision tree represents an entity having a attribute or feature, whose value may be specified

substance I is something that has a molecular weight ‘substance I’ equivalentClass ‘has attribute’ some ‘molecular weight’ substance II is a kind of substance I with a specified ‘substance II’ equivalentClass ‘substance I’ and ‘has attribute’ some (‘molecular weight’ and ‘has value’ double[<= 499.296759]))

Substance I I

Molecular Weight

has attribute

>499.296759

has value

Substance I

Molecular Weight

has attribute

subClassOf

Page 8: Hazard Estimation and Method Comparison with OWL-Encoded Toxicity Decision Trees

The Chemical Information Ontology (CHEMINF)

• 100+ chemical descriptors• 50+ chemical qualities• Relates descriptors to their

specifications, the software that generated them (along with the running parameters, and the algorithms that they implement)

• Contributors: Nico Adams, Leonid Chepelev, Michel Dumontier, Janna Hastings, Egon Willighagen, Peter Murray-Rust, Cristoph Steinbeck

8

http://semanticchemistry.googlecode.com

Page 9: Hazard Estimation and Method Comparison with OWL-Encoded Toxicity Decision Trees

9

A simple decision tree can be represented as a set of subsuming OWL classes

Methods: A WEKA tree was trained and serialized into dot format. Used the Weka API to read the document and create the ontology using the OWL API.

Page 10: Hazard Estimation and Method Comparison with OWL-Encoded Toxicity Decision Trees

10

Each outcome may also be formalized in terms of the set of all attributes as obtained by drawing a path to the root

Druglike-molecule equivalentClass ‘molecule’and ‘has attribute’ some (‘molecular weight’ that ‘has value’ double[<= 500.0])

and ‘has attribute’ some (‘hydrogen bond count donor count’ that ‘has value’ int[<= 5])

and ‘has attribute’ some (‘hydrogen bond acceptor count’ that ‘has value’ int[<= 10])

and ‘has attribute’ some (‘partition coefficient’ that ‘has value’ double[<= 5.0, >= -5.0])

Page 11: Hazard Estimation and Method Comparison with OWL-Encoded Toxicity Decision Trees

11

Large scale decision trees

• Lipinski example is typically trivial• Can we create a new decision tree capable of classification of

linked data• Obtained 1400 chemicals from an EPA ToxCast carcinogenic

toxicity dataset labelled either toxic or non-toxic• Computed 318 boolean features using the ToxTree API.

http://toxtree.sourceforge.net/• Generated the decision tree using Weka• Generated the OWL ontology using the OWL API• Generated individuals using the CHESS specification and used

descriptors specified in the CHEMINF ontology.• Classification using OWL API + Pellet; Protégé 4 and Hermit.

Page 12: Hazard Estimation and Method Comparison with OWL-Encoded Toxicity Decision Trees

12

A decision tree to predict carcinogenic toxicity

Page 13: Hazard Estimation and Method Comparison with OWL-Encoded Toxicity Decision Trees

13

Decision Tree to OWL Ontology

Page 14: Hazard Estimation and Method Comparison with OWL-Encoded Toxicity Decision Trees

14

Is acetaminophen toxic?

Page 15: Hazard Estimation and Method Comparison with OWL-Encoded Toxicity Decision Trees

15

From data to automated reasoning

data

linked data

AutomatedReasoning (realization) over OWL encoded Toxicity tree

Page 16: Hazard Estimation and Method Comparison with OWL-Encoded Toxicity Decision Trees

16

Page 17: Hazard Estimation and Method Comparison with OWL-Encoded Toxicity Decision Trees

17

Page 18: Hazard Estimation and Method Comparison with OWL-Encoded Toxicity Decision Trees

18

Path through Decision Tree kindly provided by reasoning about the OWL ontology

Page 19: Hazard Estimation and Method Comparison with OWL-Encoded Toxicity Decision Trees

19

Comparison of toxicity trees

• Along with the standard lipinski rule of five ontology, we generated a variant where MW <= 250.

• Reasoning over the two ontologies, we see that the active compound (based on the MW <= 250) is subsumed by the active compound based on MW <= 500

Page 20: Hazard Estimation and Method Comparison with OWL-Encoded Toxicity Decision Trees

20

Conclusion

• Decision trees can be faithfully represented as OWL ontologies

• As formalized ontologies, we can automatically reason about the ontology, and use it to classify new chemicals (hence predict toxicity)

• If we maintain the structure of the decision tree, we can get explanations to provide the set of attributes used in the decision making (unlike black box counterpart).

• Expectation that trees generated with different, but aligned vocabularies may now be comparable

Page 21: Hazard Estimation and Method Comparison with OWL-Encoded Toxicity Decision Trees

21

Acknowledgements

CHEMINF GroupLeo ChepelevJanna HastingsEgon WillighagenNico Adams

Toxicity GroupLeo ChepelevDana Klassen

Page 22: Hazard Estimation and Method Comparison with OWL-Encoded Toxicity Decision Trees

22

[email protected]

Presentations: http://slideshare.com/micheldumontier