Automated Theory Formation: First Steps in Bioinformatics Simon Colton Computational Bioinformatics Laboratory

Automated Theory Formation:First Steps in Bioinformatics

Simon ColtonComputational Bioinformatics Laboratory

Machine Learning (ML)Questions

Given some background informationConcepts, hypotheses (axioms)

Given some positive examplesAnd some negative examples

Find me an explanationWhy the positives are positive And the negatives are negative

Example: Predictive ToxicologyGiven some theory from chemistry

Structure of molecules, well known substructures

Given some examples of toxic drugsAnd some examples of non-toxic drugs

Question: Why are the toxic drugs toxic?

Automated Theory Formation (ATF) Questions

Given some background informationConcepts, hypotheses (axioms)

And some objects of interestNumbers, Molecules, etc.

Find something interestingInteresting things could be:

Concepts, examples, hypotheses, explanations

ATF OverviewScientific theories contain (at least):

Concepts: salt, acid, baseHypotheses: acid + base => salt + waterExplanations: transfer of electrons, dissolving

So, ATF should do (at least):Concept formation, Conjecture makingHypothesis proving and disproving.

Also needs to:Measure interestingness, present results, etc.

HR Theory Formation SystemDeveloped in maths

Designed to be general purpose systemConcept-based theory formation

Tries to make conceptMakes conjecture when it can’t make a conceptTries to explain conjectures

Conjecture-based theory formationFix faulty conjectures with concept formationPhD work of Alison Pease, based on Lakatos

Concept Formation in HR

10 General Production RulesTake in old concepts, produce new concepts

Split

Negate

Size

SplitCompose

[a,b] : b|a

[a,n]:n = |{b:b|a}|

[a]:2=|{b:b|a}|

[a] : 2|a

[a] : not 2|a

[a]:2=|{b:b|a}| & not 2|a (Odd Prime Numbers)

Conjecture MakingEmpirical checks are performed

After each attempt to invent a new conceptIf the concept has no examples

Makes non-existence conjectureIf concept has same examples as previous

Makes an equivalence conjectureIf another concept subsumes the concept

Makes an implication conjecture

Conjecture Extraction

Suppose HR makes equivalence conjecture:P(a) & Q(a) R(a) & S(a)

Extracts:P(a) & Q(a) => R(a), P(a) & Q(a) => S(a)R(a) & S(a) => P(a), R(a) & S(a) => Q(a)

Tries to Extract: P(a) => R(a), Q(a) => R(a), etc.Prime implicates (require proving, though)

Important: gets Horn ClausesCan be expressed in Prolog…..

Explanation GenerationIn mathematical domains

HR relies on automated theorem proversAnd Model generators

To find counterexamples

E.g., group theory: a*a=a a=id (prove easily)

In biological/chemistry domainsPossibly: visualisation tools, reaction pathways

Greatest HitsPlease ask me over coffee about:

Pre-processing constraint problemsLearning properties of quadratic residuesInventing integer sequencesPuzzle generationAdding to the TPTP librarySetting mathematical tutorial questions…

Long term aim in Bioinformatics

Develop an ATF system similar to HOMERBut working in biological domains

Biologist provides little background infoIn a format they are happy with

Program provides resultsIntelligent, interesting, not too much,And very little rubbish

Automated assistant for biology

Short term aim in BioinformaticsHR can work with biological data

Takes input similar to Muggleton’s Progol

Use HR to solve ML problemsSee how bad an idea that is

Use theory formation to improve MLIntegrate HR and Progol somehow

Naïve Approach to ML TasksGive HR the same input as Progol

Get it to form a theory

Look at the theoryExtract concepts which do well on the taski.e., they look similar to target concept

Not a goal-based approachBad idea (slow)

Less Naïve ApproachImprove search using “forward look-ahead”

ICML Paper

This has evolved to “reactive search”Uses HR’s own Java interpreterHR reacts to certain events in theory formation

Scripts supplied by the user

HR also makes “near-conjectures”Faster approach, but still fairly slow

Example – Mutagenesis42 DataMutagenesis similar to carcinogenisis42 drugs supplied with atom-bond details

Atom type, number & charge, bond type (1-8)

13 are mutagenic (active), 29 are not activeProgol learned this concept (88% accurate)

active(A) :- bond(A,B,C,2), bond(A,D,B,1),atm(A,D,c,21,E)

c,21 ? ?1 2

HR’s ResultsUsing reactive search, four PRs, 30K stepsHR learned this concept:

active(A) :- bond(A,B,C,1), atm(B,F,21), bond(A,C,D,E)Also 88% accurateBut, Progol’s answer “better”Because higher information content (fewer ?s)Biologists sometimes want more information

Is this really a simpler answer?

?,21 ? ?1 ?

But…..

HR also made these equivalence conjecturesAnd extracted them (+100 more) for us

atm(B,X,21) atm(B,c,21)atm(B,X,38) atm(B,n,38)bond(A,B,C,X1) & atm(C,X2,38) bond(A,B,C,1) & atm(C,X3,38)bond(A,X1,B,X2) & atm(B,X3,38) bond(A,B,X4,2), atm(B,X5,38)

We used these to re-write HR’s answerBy hand, but hope to automate

Giving us this answer:

Remember that Progol’s Answer was:

c,21 ? ?1 2

c,21 n,38 ?1 2

So, we filled in one of the blanks!

Are we making a meal of this?Yes, possibly for the mutagenesis data

I was worried about the difficulty of this problem

In the last week I’ve written a200-line Prolog program which runs quite fastAnd can be distributed over multiple processorsAnd can be easily understood by biologists

And gets these results….

Template search – ResultsNice result one (88% accurate, lots of info)

c,21 n,38 o,401 2

o,402

Nice result two (95% accurate)

c,21 n,38 o,401 2

c,? c,22 ?

-0.132

c,195 c,22 h,3

0.145

17 7 1

Template Search - Assumptions

Connected substructures Are interesting answersProgol’s answers are all substructures

More specific substructures are not so badBiologists may even want lots of informationDon’t forget that they want to do science

Each learned concept will be true ofAt least one active (positive) molecule

Template Search - OverviewUser chooses template for substructures

?,? ?,? ?,?

User specifies how many ?s are allowedE.g., 3 out of 8 in the above template

Algorithm starts with the first positiveExtracts all substructures in the template

Then takes the next positive, for each substructure in the set

Add the LGG so that it fits both positives Don’t go under the IC limit

? ?

Template Search – Final PartFor all the substructures

Take a disjunction Which achieves the best accuracy

Distribution of this algorithm possibleWe’re getting a big Linux farmPPP – Processor Per Positive

finds substructures true of one positive combine answers at the end

Conclusions & Future WorkAutomated Theory Formation

May be useful to bioinformaticsUse HR’s theory to improve Progol’s results

Possibly by pre-processing Progol’s input Or by post-processing the learned concept

Template search Maybe a good idea? Possibly not new….Not bad results for the Mutagenesis42 dataset

Documents

Automated Theory Formation: First Steps in Bioinformatics Simon Colton Computational Bioinformatics Laboratory