Upload
angela-daniel
View
218
Download
1
Tags:
Embed Size (px)
Citation preview
Automated Theory Formation:First Steps in Bioinformatics
Simon ColtonComputational Bioinformatics Laboratory
Machine Learning (ML)Questions
Given some background informationConcepts, hypotheses (axioms)
Given some positive examplesAnd some negative examples
Find me an explanationWhy the positives are positive And the negatives are negative
Example: Predictive ToxicologyGiven some theory from chemistry
Structure of molecules, well known substructures
Given some examples of toxic drugsAnd some examples of non-toxic drugs
Question: Why are the toxic drugs toxic?
Automated Theory Formation (ATF) Questions
Given some background informationConcepts, hypotheses (axioms)
And some objects of interestNumbers, Molecules, etc.
Find something interestingInteresting things could be:
Concepts, examples, hypotheses, explanations
ATF OverviewScientific theories contain (at least):
Concepts: salt, acid, baseHypotheses: acid + base => salt + waterExplanations: transfer of electrons, dissolving
So, ATF should do (at least):Concept formation, Conjecture makingHypothesis proving and disproving.
Also needs to:Measure interestingness, present results, etc.
HR Theory Formation SystemDeveloped in maths
Designed to be general purpose systemConcept-based theory formation
Tries to make conceptMakes conjecture when it can’t make a conceptTries to explain conjectures
Conjecture-based theory formationFix faulty conjectures with concept formationPhD work of Alison Pease, based on Lakatos
Concept Formation in HR
10 General Production RulesTake in old concepts, produce new concepts
Split
Negate
Size
SplitCompose
[a,b] : b|a
[a,n]:n = |{b:b|a}|
[a]:2=|{b:b|a}|
[a] : 2|a
[a] : not 2|a
[a]:2=|{b:b|a}| & not 2|a (Odd Prime Numbers)
Conjecture MakingEmpirical checks are performed
After each attempt to invent a new conceptIf the concept has no examples
Makes non-existence conjectureIf concept has same examples as previous
Makes an equivalence conjectureIf another concept subsumes the concept
Makes an implication conjecture
Conjecture Extraction
Suppose HR makes equivalence conjecture:P(a) & Q(a) R(a) & S(a)
Extracts:P(a) & Q(a) => R(a), P(a) & Q(a) => S(a)R(a) & S(a) => P(a), R(a) & S(a) => Q(a)
Tries to Extract: P(a) => R(a), Q(a) => R(a), etc.Prime implicates (require proving, though)
Important: gets Horn ClausesCan be expressed in Prolog…..
Explanation GenerationIn mathematical domains
HR relies on automated theorem proversAnd Model generators
To find counterexamples
E.g., group theory: a*a=a a=id (prove easily)
In biological/chemistry domainsPossibly: visualisation tools, reaction pathways
Greatest HitsPlease ask me over coffee about:
Pre-processing constraint problemsLearning properties of quadratic residuesInventing integer sequencesPuzzle generationAdding to the TPTP librarySetting mathematical tutorial questions…
Long term aim in Bioinformatics
Develop an ATF system similar to HOMERBut working in biological domains
Biologist provides little background infoIn a format they are happy with
Program provides resultsIntelligent, interesting, not too much,And very little rubbish
Automated assistant for biology
Short term aim in BioinformaticsHR can work with biological data
Takes input similar to Muggleton’s Progol
Use HR to solve ML problemsSee how bad an idea that is
Use theory formation to improve MLIntegrate HR and Progol somehow
Naïve Approach to ML TasksGive HR the same input as Progol
Get it to form a theory
Look at the theoryExtract concepts which do well on the taski.e., they look similar to target concept
Not a goal-based approachBad idea (slow)
Less Naïve ApproachImprove search using “forward look-ahead”
ICML Paper
This has evolved to “reactive search”Uses HR’s own Java interpreterHR reacts to certain events in theory formation
Scripts supplied by the user
HR also makes “near-conjectures”Faster approach, but still fairly slow
Example – Mutagenesis42 DataMutagenesis similar to carcinogenisis42 drugs supplied with atom-bond details
Atom type, number & charge, bond type (1-8)
13 are mutagenic (active), 29 are not activeProgol learned this concept (88% accurate)
active(A) :- bond(A,B,C,2), bond(A,D,B,1),atm(A,D,c,21,E)
c,21 ? ?1 2
HR’s ResultsUsing reactive search, four PRs, 30K stepsHR learned this concept:
active(A) :- bond(A,B,C,1), atm(B,F,21), bond(A,C,D,E)Also 88% accurateBut, Progol’s answer “better”Because higher information content (fewer ?s)Biologists sometimes want more information
Is this really a simpler answer?
?,21 ? ?1 ?
But…..
HR also made these equivalence conjecturesAnd extracted them (+100 more) for us
atm(B,X,21) atm(B,c,21)atm(B,X,38) atm(B,n,38)bond(A,B,C,X1) & atm(C,X2,38) bond(A,B,C,1) & atm(C,X3,38)bond(A,X1,B,X2) & atm(B,X3,38) bond(A,B,X4,2), atm(B,X5,38)
We used these to re-write HR’s answerBy hand, but hope to automate
Giving us this answer:
Remember that Progol’s Answer was:
c,21 ? ?1 2
c,21 n,38 ?1 2
So, we filled in one of the blanks!
Are we making a meal of this?Yes, possibly for the mutagenesis data
I was worried about the difficulty of this problem
In the last week I’ve written a200-line Prolog program which runs quite fastAnd can be distributed over multiple processorsAnd can be easily understood by biologists
And gets these results….
Template search – ResultsNice result one (88% accurate, lots of info)
c,21 n,38 o,401 2
o,402
Nice result two (95% accurate)
c,21 n,38 o,401 2
c,? c,22 ?
-0.132
c,195 c,22 h,3
0.145
17 7 1
Template Search - Assumptions
Connected substructures Are interesting answersProgol’s answers are all substructures
More specific substructures are not so badBiologists may even want lots of informationDon’t forget that they want to do science
Each learned concept will be true ofAt least one active (positive) molecule
Template Search - OverviewUser chooses template for substructures
?,? ?,? ?,?
User specifies how many ?s are allowedE.g., 3 out of 8 in the above template
Algorithm starts with the first positiveExtracts all substructures in the template
Then takes the next positive, for each substructure in the set
Add the LGG so that it fits both positives Don’t go under the IC limit
? ?
Template Search – Final PartFor all the substructures
Take a disjunction Which achieves the best accuracy
Distribution of this algorithm possibleWe’re getting a big Linux farmPPP – Processor Per Positive
finds substructures true of one positive combine answers at the end
Conclusions & Future WorkAutomated Theory Formation
May be useful to bioinformaticsUse HR’s theory to improve Progol’s results
Possibly by pre-processing Progol’s input Or by post-processing the learned concept
Template search Maybe a good idea? Possibly not new….Not bad results for the Mutagenesis42 dataset