Learning and Reasoning for AIluc.deraedt/Francqui4ab.pdf · Lots of proposals in the literature, e.g. • relational Markov networks (RMNs) [Taskar et al 2002] • Markov logic networks

Learning and Reasoning for AI

Luc De Raedt [email protected]

Roadmap• Prob. Programming - Modeling

• Inference

• Learning

• Dynamics

• KBMC & Markov Logic

• DeepProbLog

• From StarAI to Nesy

... with some detours on the way2

Part V: KBMC, Markov Logic

3

A key question in AI:Dealing with uncertainty

Reasoning with relational data

Learning

Statistical relational learning& Probabilistic programming, ...

?• logic• databases• programming• ...

• probability theory• graphical models• ...

• parameters• structure

4

so far

A key question in AI:Dealing with uncertainty

Reasoning with relational data

Learning

Statistical relational learning & Probabilistic programming, ...

?• logic• databases• programming• ...

• probability theory• graphical models• ...

• parameters• structure

5

next

De Raedt, Kersting, Natarajan, Poole: Statistical Relational AI

Flexible and Compact Relational Model for Predicting Grades

“Program” Abstraction:▪ S, C logical variable representing students, courses▪ the set of individuals of a type is called a population▪ Int(S), Grade(S, C), D(C) are parametrized random variables

Grounding:• for every student s, there is a random variable Int(s)• for every course c, there is a random variable Di(c)• for every s, c pair there is a random variable Grade(s,c)• all instances share the same structure and parameters

G

7

ProbLog by example: Grading

Shows relational structure

grounded model: replace variables by constants

Works for any number of students / classes (for 1000 students and 100 classes, you get 101100 random variables); still only few parameters

With SRL / PP

build and learn compact models,

from one set of individuals - > other sets;

reason also about exchangeability,

build even more complex models,

incorporate background knowledge

Lots of proposals in the literature, e.g.• relational Markov networks (RMNs) [Taskar et al 2002]

• Markov logic networks (MLNs) [Richardson & Domingos 2006]

• probabilistic soft logic (PSL) [Broecheler et al 2010]

• FACTORIE [McCallum et al 2009]

• Bayesian logic programs (BLPs) [Kersting & De Raedt 2001]

• relational Bayesian networks (RBNs) [Jaeger 2002]

• logical Bayesian networks (LBNs) [Fierens et al 2005]

• probabilistic relational models (PRMs) [Koller & Pfeffer 1998]

• Bayesian logic (BLOG) [Milch et al 2005]

• CLP(BN) [Santos Costa et al 2008]

• and many more ...

8

Probabilistic Relational Models (PRMs)

9

PersonBloodtype

M-chromosomeP-chromosome

Person

Bloodtype M-chromosome

P-chromosome

(Father)

Person


P-chromosome

(Mother)

Table

[Getoor,Koller, Pfeffer]


Probabilistic Relational Models (PRMs)

10

PersonBloodtype

M-chromosomeP-chromosome

Person


P-chromosome

(Father)

Person


P-chromosome

(Mother)

Table[Getoor,Koller, Pfeffer]


bt(Person)= BT.

pc(Person)= PC.

mc(Person) = MC.

bt(Person)=BT | pc(Person) =PC , mc(Person) =MC. pc(Person) = PC | pc_father(Father)= PCf, mc_father(Father)= MCf.

pc_father(Person) =PCf | father(Father,Person),pc(Father)=PC. ...

View :

Dependencies (CPDs associated with):

father(Father,Person). mother(Mother,Person).

Probabilistic Relational Models (PRMs) Bayesian Logic Programs (BLPs)

11

father(rex,fred). mother(ann,fred). father(brian,doro). mother(utta, doro). father(fred,henry). mother(doro,henry).

bt(Person)=BT | pc(Person)=PC, mc(Person)=MC.pc(Person)=PC | pc_father(Person)=PCf, mc_father(Person)=MCf.mc(Person)=MC | pc_mother(Person)=PCm, pc_mother(Person)=MCm.

mc(rex)

bt(rex)

pc(rex)mc(ann) pc(ann)

bt(ann)

mc(fred) pc(fred)

bt(fred)

mc(brian)

bt(brian)

pc(brian)mc(utta) pc(utta)

bt(utta)

mc(doro) pc(doro)

bt(doro)

mc(henry)pc(henry)

bt(henry)

RV State

pc_father(Person)= PCf | father(Father,Person),pc(Father) = PC. ...

Extension

Intension

Answering Queries

12

mc(rex)

bt(rex)


bt(ann)

mc(fred) pc(fred)

bt(fred)

mc(brian)

bt(brian)


bt(utta)

mc(doro) pc(doro)

bt(doro)

mc(henry)pc(henry)

bt(henry)

P(bt(ann)) ?

Support Network

Answering Queries

13

mc(rex)

bt(rex)


bt(ann)

mc(fred) pc(fred)

bt(fred)

mc(brian)

bt(brian)


bt(utta)

mc(doro) pc(doro)

bt(doro)

mc(henry)pc(henry)

bt(henry)

P(bt(ann), bt(fred)) ?

P(bt(ann)| bt(fred)) =P(bt(ann),bt(fred))

P(bt(fred))

Bayes‘ rule

Combining Rules

• Students reads two books

• Typical, noisy-or, noisy-max,

• ...

14

P(A|B,C)

P(A|B) and P(A|C)

prepared(Student,Topic) | read(Student,Book), discusses(Book,Topic).

Knowledge Based Model Construction

Extension + Intension =>Probabilistic Model

Advantages

same intension used for multiple extensions

parameters are being shared / tied together

unification is essential

•learning becomes feasible

•max. likelihood parameter estimation & structure learning

15

Bayesian Logic Programs

16

% apriori nodes nat(0).

% aposteriori nodes nat(s(X)) | nat(X).

nat(0) nat(s(0)) nat(s(s(0)) ...MC

% apriori nodes state(0).

% aposteriori nodes state(s(Time)) | state(Time). output(Time) | state(Time)

state(0)

output(0)

state(s(0))

output(s(0))

...HMM

% apriori nodes n1(0).

% aposteriori nodes n1(s(TimeSlice) | n2(TimeSlice). n2(TimeSlice) | n1(TimeSlice). n3(TimeSlice) | n1(TimeSlice), n2(TimeSlice).

n1(0)

n2(0)

n3(0)

n1(s(0))

n2(s(0))

n3(s(0))

...DBN

pure Pro

log

Prolog and Bayesian Nets as Special Case

Learning BLPs

RVs + States = (partial) Herbrand interpretationProbabilistic learning from interpretations

Family(1)pc(brian)=b,bt(ann)=a,bt(brian)=?,bt(dorothy)=a

Family(2)bt(cecily)=ab,pc(henry)=a,mc(fred)=?,bt(kim)=a,pc(bob)=b

Backgroundm(ann,dorothy),f(brian,dorothy),m(cecily,fred),f(henry,fred),f(fred,bob),m(kim,bob),...

Family(3)pc(rex)=b,bt(doro)=a,bt(brian)=?

17

Parameter Estimation

• +

bt(Person,BT) | pc(Person,PC), mc(Person,MC).pc(Person,PC) | pc_father(Person,PCf), mc_father(Person,MCf).mc(Person,MC) | pc_mother(Person,PCm), pc_mother(Person,MCm).

yields

18

Parameter Estimation

• +

bt(Person,BT) | pc(Person,PC), mc(Person,MC).pc(Person,PC) | pc_father(Person,PCf), mc_father(Person,MCf).mc(Person,MC) | pc_mother(Person,PCm), pc_mother(Person,MCm).

yields

Parameter tying

19

Expectation Maximization

Initial Parameters q0

Logic Program L

Expected counts of a clause

Update parameters (ML, MAP)

Maximization

EM-algorithm:iterate until convergence

Current Model (M,qk)

P( head(GI), body(GI) | DC )MM

DataCaseDC

Ground InstanceGI

P( head(GI), body(GI) | DC )MM

DataCaseDC

Ground InstanceGI

P( body(GI) | DC )MM

DataCaseDC

Ground InstanceGI

20


Markov Logic: Intuition

▪ Undirected graphical model▪ A logical KB is a set of hard constraints

on the set of possible worlds▪ Let’s make them soft constraints:

When a world violates a formula,it becomes less probable, not impossible

▪ Give each formula a weight(Higher weight ⇒ Stronger constraint)

( )∑∝ satisfiesit formulas of weightsexpP(world)

A possible worlds view

),( BobAnnaFriends¬

)(BobHappy)(BobHappy¬

),( BobAnnaFriends


Say we have two domain elements Anna and Bob as well as two predicates Friends and Happy

slides by Pedro Domingos




),( BobAnnaFriends

)(),(BobHappy

BobAnnaFriends∨

¬


Logical formulas such as not Friends(Anna,Bob) or Happy(Bob)

exclude possible worlds





),( BobAnnaFriends

1))(),(( =∨¬Φ BobHappyBobAnnaFriends75.0))(),(( =¬∧Φ BobHappyBobAnnaFriends

1 1

175.0


four times as likely that rule holds


An possible worlds view



),( BobAnnaFriends

29.0)75.0/1log()))(),(((

==

∨¬Φ BobHappyBobAnnaFriendsw

1 1

175.0


Or as log-linear model this is:

This can also be viewed as building a graphical model

Cancer(A)

Smokes(A) Smokes(B)

Cancer(B)

Suppose we have two constants: Anna (A) and Bob (B)


Markov Logic

Cancer(A)

Smokes(A)Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)



Markov Logic

Cancer(A)


Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)



Markov Logic

Cancer(A)


Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)



Markov Logic

Markov Logic

30

𝑪(𝑨) 𝑺(𝑨) 𝑭 (𝑨, 𝑩) 𝑭 (𝑩, 𝑨) 𝑪(𝑩)𝑺(𝑩)

F1(A) F1(B)F2(A,B)

𝑭 (𝑨, 𝑨)

F2(A,A) F2(B,A)

𝑭 (𝑩, 𝑩)

F2(B,B)

represented as a factor graph

P(Interpretation) ∝ ∏i,θ

Fi(X, Y )θ = ∏i,θ

exp(wi𝕀(Interpretation ⊧ Fi(X, Y )θ)


Markov Logic

▪ A Markov Logic Network (MLN) is a set of pairs (F, w) where▪ F is a formula in first-order logic▪ w is a real number

▪ An MLN defines a Markov network with▪ One node for each grounding of each predicate

in the MLN▪ One feature for each grounding of each formula F in the MLN,

with the corresponding weight w▪ Probability of a world

Weight of formula i No. of true groundings of formula i in x

!"

#$%

&= ∑

iii xnw

ZxP )(exp

1)(

Possible WorldsA vocabulary

Possible worldsLogical interpretations

Sm

okes

(Alic

e)

Sm

okes

(Bob

)

Frie

nds(

Alic

e,Bo

b)

Frie

nds(

Bob,

Alic

e)

Slides adapted from Guy Van den Broeck

A logical theory

Interpretations that satisfy the theoryModels

∀x,y, Smokes(x) ∧ Friends(x,y) ⇒ Smokes(y)

Sm

okes

(Alic

e)

Sm

okes

(Bob

)

Frie

nds(

Alic

e,Bo

b)

Frie

nds(

Bob,

Alic

e)

Possible Worlds


A logical theory

First-Order Model Counting

First-order model count~#SAT

∑

Sm

okes

(Alic

e)

Sm

okes

(Bob

)

Frie

nds(

Alic

e,Bo

b)

Frie

nds(

Bob,

Alic

e)

∀x,y, Smokes(x) ∧ Friends(x,y) ⇒ Smokes(y)

Slides Guy Van den Broeck


Markov Logic

▪ A Markov Logic Network (MLN) is a set of pairs (F, w) where▪ F is a formula in first-order logic▪ w is a real number

▪ An MLN defines a Markov network with▪ One node for each grounding of each predicate

in the MLN▪ One feature for each grounding of each formula F in the MLN,

with the corresponding weight w▪ Probability of a world

Weight of formula i No. of true groundings of formula i in x

!"

#$%

&= ∑

iii xnw

ZxP )(exp

1)(

1.5 ∀x,y, Smokes(x) ∧ Friends(x,y) ⇒ Smokes(y)

Sm

okes

(Alic

e)

Sm

okes

(Bob

)

Frie

nds(

Alic

e,Bo

b)

Frie

nds(

Bob,

Alic

e)

Markov Logic


counting only substitutions for which X =/= Y X=Alice, Y=BobX=Bob, Y=Alice

1

Zexp(1.5 ⇤ 2)

1

Zexp(1.5 ⇤ 2)

1

Zexp(1.5 ⇤ 1)

A Markov Logic theory

1.5 ∀x,y, Smokes(x) ∧ Friends(x,y) ⇒ Smokes(y)

Sm

okes

(Alic

e)

Sm

okes

(Bob

)

Frie

nds(

Alic

e,Bo

b)

Frie

nds(

Bob,

Alic

e)

Markov Logic


1

Zexp(1.5 ⇤ 2)

1

Zexp(1.5 ⇤ 2)

1

Zexp(1.5 ⇤ 1)

A Markov Logic theory

Zpartition function

∑

A logical theory and a weight function for predicates

Weighted first-order model count∑

Weighted First-Order Model Counting S

mok

es(A

lice)

Sm

okes

(Bob

)

Frie

nds(

Alic

e,Bo

b)

Frie

nds(

Bob,

Alic

e) Smokes → 1 ¬Smokes → 2 Friends → 4 ¬Friends → 1

Related to ProbLog Inference !

Parameter Learning

39

No. of times clause i is true in data

Expected no. times clause i is true according to MLN

[ ])()()(log xnExnxPw iwiwi

−=∂

∂

Has been used for generative learning (Pseudolikelihood); Many variations (also discriminative); applications in networks, NLP, bioinformatics, …

Applications

▪ Natural language processing, Collective Classification, Social Networks, Activity Recognition, …


Information Extraction

Parag Singla and Pedro Domingos, “Memory-EfficientInference in Relational Domains” (AAAI-06).

Singla, P., & Domingos, P. (2006). Memory-efficentinference in relatonal domains. In Proceedings of theTwenty-First National Conference on Artificial Intelligence(pp. 500-505). Boston, MA: AAAI Press.

H. Poon & P. Domingos, Sound and Efficient Inferencewith Probabilistic and Deterministic Dependencies”, inProc. AAAI-06, Boston, MA, 2006.

P. Hoifung (2006). Efficent inference. In Proceedings of theTwenty-First National Conference on Artificial Intelligence.


Segmentation





AuthorTitleVenue


Entity Resolution






Entity Resolution






Roadmap• Prob. Programming - Modeling

• Inference

• Learning

• Dynamics

• KBMC & Markov Logic

• DeepProbLog

• From StarAI to Nesy

... with some detours on the way45

Part VI: DeepProbLog

46

Learning

PROBABILITY

LOGIC

THREE DIFFERENT PARADIGMS FOR LEARNING

NEURAL

Integrate Deep Learning and (Probabilistic) Logics ?

48

earthquake burglary

alarmhears_alarm

callsAre there an equal number of large things

and metal spheres?

Deep Learning

Logic

?

Cf. Visual Genome en Clevr datasets

Neural-symbolic learning and reasoning: A survey and interpretation.[Besold et all ]

NeSY state-of-the-art• The integration of perception and reasoning is still an open problem.

• Main idea: inject/encode logic into neural networks (and let the NN do the rest)

• Encoding logic in the weights of neural networks

• Learning embeddings for logical entities

• Logical constraints as a regularizer during training

• Templating neural networks

• Building neural networks from functional programs

• Building neural networks from backwards proving

• Differentiable neural computers / program interpreters

State-of-the-art

• Encoding logic in the weights of neural networks

• Logic Tensor Networks (Serafini et al.)

• A Semantic Loss Function for Deep Learning with Symbolic Knowledge (Xu et al.)

• Ontology Reasoning with Deep Neural Networks (Hohenecker et al.)

• Semantic Based Regularization (Diligenti et al.)

50

State-of-the-art

• Templates for neural networks (ako Knowledge Base Model Construction)

• Lifted Relational Neural Networks (Šourek et al.)

• Neural Theorem Prover (Rocktäschel et al.)

• Neural Module Networks (Andreas et al.)

51

State-of-the-art

• Differentiable neural computers / program interpreters

• Differentiable Neural Computer (Graves et al.)

• Neural Programmer-Interpreters (Reed et al.)

• Differentiable Forth Interpreter (Bošnjak et al.)

52

DeepProbLogIdea: inject neural networks into logic by extending an existing PLP language

DeepProbLog = ProbLog + neural predicate

The neural predicate makes neural networks a first-class citizen

53

Related work DeepProbLog

Logic is made less expressive Full expressivity is retained

Logic is pushed into the neural network Clean separation

Fuzzy logic Probabilistic logicLanguage semantics unclear Clear semantics

NeurIPS 2018

Neural predicate

• Neural networks have uncertainty in their predictions

• A normalized output can be interpreted as a probability distribution

• Neural predicate models the output as probabilistic facts

• No changes needed in the probabilistic host language

54

Neural network

DTAI research group

The neural predicateThe output of the neural network is probabilistic facts in DeepProbLog

Example:

nn(mnist_net, [X], Y, [0 ... 9] ) :: digit(X,Y).

Instantiated into a (neural) Annotated Disjunction:

0.04::digit( ,0) ; 0.35::digit( ,1) ; ... ; 0.53::digit( ,7) ; ... ; 0.014::digit( ,9).

DTAI research group

DeepProbLog exemplified: MNIST addition

Task: Classify pairs of MNIST digits with their sum

Benefit of DeepProbLog:

• Encode addition in logic

• Separate addition from digit classification

8411


addition(X,Y,Z) :- digit(X,N1), digit(Y,N2), Z is N1+N2.

Examples: addition( , ,8), addition( , ,4), addition( , ,11), …

DTAI research group

DeepProbLog exemplified: MNIST addition

Task: Classify pairs of MNIST digits with their sum

Benefit of DeepProbLog:

• Encode addition in logic

• Separate addition from digit classification

8411


addition(X,Y,Z) :- digit(X,N1), digit(Y,N2), Z is N1+N2.

addition( , ,8) :- digit( ,N1), digit( ,N2), 8 is N1 + N2.

Examples: addition( , ,8), addition( , ,4), addition( , ,11), …

ExampleLearn to classify the sum of pairs of MNIST digits

Individual digits are not labeled!

E.g. ( , , 8)

Could be done by a CNN: classify the concatenation of both images into 19 classes

However:

58

+ = ?

MNIST Addition• Pairs of MNIST images, labeled

with sum

• Baseline: CNN

• Classifies concatenation of both images into classes 0 ...18

• DeepProbLog:

• CNN that classifies images into 0 … 9

• Two lines of DeepProblog code

59

Multi-digit MNIST addition with MNIST

Result

60

number ( [ ] , Result , Result ) .number ( [H | T ] , Acc , Result) :−

digit(H, Nr ), Acc2 is Nr +10*Acc , number ( T , Acc2 , Result ) .

number (X,Y) :− number (X, 0 ,Y ) .

multiaddition(X, Y, Z ) :− number (X, X2 ) ,

number (Y, Y2 ) , Z is X2+Y2 .

ExampleLearn to classify the sum of pairs of MNIST digits

Individual digits are not labeled!

E.g. ( , , 8)

Could be done by a CNN: classify the concatenation of both images into 19 classes

However:

61

+ = ?

(Deep)ProbLog : Inference

Inference / Reasoning• Most of the work in PP and StarAI is on

inference

• It is hard (complexity wise)

• Many inference methods

• exact, approximate, sampling and lifted …

• Inference is the key to learning

63

ProbLog InferenceAnswering a query in a ProbLog program happens in four steps1. Grounding the program w.r.t. the query2. Rewrite the ground logic program into a propositional logic formula3. Compile the formula into an arithmetic circuit4. Evaluate the arithmetic circuit

0.1 :: burglary. 0.5 :: hears_alarm(mary).

0.2 :: earthquake. 0.4 :: hears_alarm(john).

alarm :– earthquake.

alarm :– burglary. calls(X) :– alarm, hears_alarm(X).

Query

?-P(calls(mary)

ProbLog InferenceAnswering a query in a ProbLog program happens in four steps1. Grounding the program w.r.t. the query (only relevant part !)2. Rewrite the ground logic program into a propositional logic formula3. Compile the formula into an arithmetic circuit4. Evaluate the arithmetic circuit




alarm :– burglary. calls(mary) :– alarm, hears_alarm(mary).calls(john) :– alarm, hears_alarm(john).

Query

?-P(calls(mary)

ProbLog InferenceAnswering a query in a ProbLog program happens in four steps1. Grounding the program w.r.t. the query 2. Rewrite the ground logic program into a propositional logic formula3. Compile the formula into an arithmetic circuit4. Evaluate the arithmetic circuit




alarm :– burglary. calls(mary) :– alarm, hears_alarm(mary).

calls(john) :– alarm, hears_alarm(john).

calls(mary)

↔

hears_alarm(mary) ∧ (burglary ∨ earthquake)

ProbLog InferenceAnswering a query in a ProbLog program happens in four steps1. Grounding the program w.r.t. the query 2. Rewrite the ground logic program into a propositional logic formula3. Compile the formula into an arithmetic circuit (knowledge compilation)4. Evaluate the arithmetic circuit

calls(mary)

↔

hears_alarm(mary) ∧ (burglary ∨ earthquake) AND AND

AND

OR

calls(mary)

￢earthquake

0.8

earthquake

0.2

burglary

0.1

hears_alarm(mary)

0.5

0.08 0.1

0.04

0.14

ProbLog InferenceAnswering a query in a ProbLog program happens in four steps1. Grounding the program w.r.t. the query 2. Rewrite the ground logic program into a propositional logic formula3. Compile the formula into an arithmetic circuit (knowledge compilation)4. Evaluate the arithmetic circuit

calls(mary)

↔

hears_alarm(mary) ∧ (burglary ∨ earthquake) AND AND

AND

OR

calls(mary)

￢earthquake

0.8

earthquake

0.2

burglary

0.1

hears_alarm(mary)

0.5

0.08 0.1

0.04

0.14

DTAI research group

Optimization

PLP usually considers the inference settings

DeepProbLog focuses on optimization• We have a set of tuples (q,p)• q is a query and p its desired success probability

DeepProbLog

• We use our algebraic ProbLog and use the gradient semi-ring

• What is aProbLog ?

• a version of ProbLog where the probabilistic semi-ring is replaced by an arbitrary semiring structure

• labels on facts are elements of the semiring

• cf. the different semi-rings for the WMC, #SAT, …

70

more examples

71

semiring label functionI I

DTAI research group

Implementing DeepProbLog

1. Evaluating the neural networks:• Instantiate the neural annotated disjunction• Happens during grounding• ProbLog already had support for external functions

2. Perform backpropagation in the neural networks• No direct loss for neural networks• Loss defined on the logic level• Derive gradient in logic• Start backpropagation with derived gradient

DTAI research group

Deriving the gradient• The output of the neural network are probabilistic facts

• Probabilistic facts are leaves in the AC

• The AC is a differentiable structure

• We can derive it in the forward pass along with the probability

aProbLog + gradient semiring

DTAI research group

Gradient semiringt(0.2) :: earthquake.

t(0.1) :: burglary.

0.5 :: hears_alarm.

alarm :- earthquake.

alarm :- burglary.

calls :- alarm, hears_alarm.

DTAI research group

The DeepProbLog pipeline

EXPERIMENTS

76

Program Induction• Approach similar to that of ‘Programming with a Differentiable

Forth Interpreter’ [1] (∂4)

• Partially defined Forth program with slots / holes

• Slots are filled by neural network (encoder / decoder)

• Fully differentiable interpreter: NNs are trained with input / output examples

• DeepProbLog program with switches

• Switches are controlled by neural networks

77

[1]: Matko Bosnjak, Tim Rocktäschel, Jason Naradowsky, Sebastian Riedel: Programming with a Differentiable Forth Interpreter. ICML 2017: 547-556

Logic

Neural

● Sorting○ Sort lists of numbers using Bubble sort○ Hole: Swap or don’t swap when comparing two numbers

● Addition○ Add two numbers and a carry○ Hole: What is the resulting digit and carry on each step○ (Note: not MNIST digits, but actual numbers)

● Word Algebra Problems○ E.g. “Ann has 8 apples. She buys 4 more. She distributes them equally

among her 3 kids. How many apples does each child receive?○ Hole: Sequence of permuting, swapping and performing operations on

the three numbers[1]: Matko Bosnjak, Tim Rocktäschel, Jason Naradowsky, Sebastian Riedel: Programming with a Differentiable Forth Interpreter. ICML 2017: 547-556

Tasks[1]

hole(X,Y,X,Y):- swap(X,Y,0).

hole(X,Y,Y,X):- swap(X,Y,1).

bubble([X],[],X).bubble([H1,H2|T],[X1|T1],X):- hole(H1,H2,X1,X2), bubble([X2|T],T1,X).

bubblesort([],L,L).

bubblesort(L,L3,Sorted) :- bubble(L,L2,X), bubblesort(L2,[X|L3],Sorted).

sort(L,L2) :- bubblesort(L,[],L2).

Holes defined by neural predicate

Bubble sort implementation

Example DeepProbLog solution

Result

80

Noisy AdditionProbability

nn(classifier, [X], Y, [0 .. 9]) :: digit(X,Y).t(0.2) :: noisy.

1/19 :: uniform(X,Y,0) ; ... ; 1/19 :: uniform(X,Y,18).

addition(X,Y,Z) :- noisy, uniform(X,Y,Z).addition(X,Y,Z) :- \+noisy, digit(X,N1), digit(Y,N2), Z is N1+N2.

(a) The DeepProbLog program.

nn(classifier,[a],0) :: digit(a,0); nn(classifier,[a],1) :: digit(a,1).nn(classifier,[b],0) :: digit(b,0); nn(classifier,[b],1) :: digit(b,1).t(0.2)::noisy.

1/19::uniform(a,b,1).addition(a,b,1) :- noisy, uniform(a,b,1).

addition(a,b,1) :- \+noisy, digit(a,0), digit(b,1).addition(a,b,1) :- \+noisy, digit(a,1), digit(b,0).

(b) The ground DeepProbLog program.

(c) The AC for query addition(a,b,1).

Figure 4: Parameter learning in DeepProbLog. (Example 5)

Figure 5: The learning pipeline.

19

Neural

Noisy AdditionProbability












19












19

Neural

Noisy Addition

noisy

0.2,[1, 0,0,.. 0,0,..]

⨂

addition(a,b,1)

⨁ p,[∂p/∂pnoisy,

∂p/∂pdigit(a,0),...,∂p/∂pdigit(a,9),∂p/∂pdigit(b,0),...,∂p/∂pdigit(b,9)]

¬noisy

0.8,[-1, 0,0,.. 0,0,..]

digit(a,0)

0.8,[0, 1,0,.. 0,0,..]

digit(b,1)

0.6,[0, 0,0,.. 0,1,..]

digit(a,1)

0.1,[0, 0,1,.. 0,0,..]

digit(b,0)

0.2,[0, 0,0,.. 1,0,..]

uniform(a,b,1)

0.053,[0, 0,0,.. 0,0,..]

⨂ ⨂

⨁

⨂

0.011,[0.053, 0,0,.. 0,0,..]

0.02,[0, 0,0.2,.. 0.1,0,..]

0.48,[0, 0.6,0,.. 0,0.8,..]

0.5,[0, 0.6,0.2,.. 0.1,0.8,..]

0.4,[-0.5, 0.48,0.16,.. 0.08,0.64,..]

0.411,[-0.447, 0.48,0.16,.. 0.08,0.64,..]

Legend

Noisy Addition

Figure 8: The accuracy on the MNIST test set for individual digits while training on (T3).

Fraction of noise0.0 0.2 0.4 0.6 0.8 1.0

Baseline 93.46 87.85 82.49 52.67 8.79 5.87DeepProbLog 97.20 95.78 94.50 92.90 46.42 0.88

DeepProbLog w/ explicit noise 96.64 95.96 95.58 94.12 73.22 2.92Learned fraction of noise 0.000 0.212 0.415 0.618 0.803 0.985

Table 3: The accuracy on the test set for T4.

.

noise tolerant, even retaining an accuracy of 73.2% with 80% noisy labels.As shown in the last row, it is also able to learn the fraction of noisy labelsin the data. This shows that the model is able to recognize which exampleshave noisy labels.

6.2. Program Induction

The second set of problems demonstrates that DeepProbLog can performprogram induction. We follow the program sketching [25] setting of differentiableForth (@4) [8], where holes in given programs need to be filled by neural networkstrained on input-output examples for the entire program. As in their work, weconsider three tasks: addition, sorting [26] and word algebra problems (WAPs)[27].

T5: forth_addition([4], [8], 1, [1, 3])The input consists of two numbers, represented as lists of digits, and acarry. The output is the sum of the numbers and the carry. The programspecifies the basic addition algorithm in which we go from right to left overall digits, calculating the sum of two digits and taking the carry over tothe next pair. The hole in this program corresponds to calculating theresulting digit (result/4) and carry (carry/4), given two digits and theprevious carry.

23

Addition of images only• Examples of the form

• addition( , , )

• What will happen ?Figure 8: The accuracy on the MNIST test set for individual digits while training on (T3).





.





23

Addition of images only• Examples of the form

• addition( , , )

• What will happen ?

• usual loss function will map all images onto 0 (0 + 0 = 0 )

• can compensate for this by adding regularisation term based on max entropy

Figure 8: The accuracy on the MNIST test set for individual digits while training on (T3).





.





23

Simplified Poker• dealing with uncertainty

• ignore suits and just with A, J, Q and K

• two players, two cards, and one community card

• train the neural network to recognize the four cards

• reason probabilistically about the non-observed card

• learn the distribution of the unlabeled community card

•

Probability

NeuralLogic

nn(m_swap, [X]) :: swap(X,Y,).

hole(X,Y,X,Y):-\+swap(X,Y).

hole(X,Y,Y,X):-swap(X,Y).

bubble([X],[],X).bubble([H1,H2|T],[X1|T1],X):-

hole(H1,H2,X1,X2),bubble([X2|T],T1,X).

bubblesort([],L,L).

bubblesort(L,L3,Sorted) :-bubble(L,L2,X),bubblesort(L2,[X|L3],Sorted).

forth_sort(L,L2) :- bubblesort(L,[],L2).

Listing 6: Forth sorting sketch (T6)

Figure A.10: Examples of cards used as input for the Poker without perturbations(T9) ex-

periment.

calculate the result.

In Listing 8, there are two neural predicates: coin1/2 and coin2/2. Theirinput is the image of the two coins (e.g. Figure 9). The output is heads or tails.The coins/2 classifies both coins using these two predicates and then performsthe comparison of the classes with the compare/3 predicate.

In Listing 9, there’s a single neural predicate rank/2 that takes as input theimage of a card and classifies it as either a jack, queen, king or ace. There’s alsoan AD with learnable parameters that represents the distribution of the unseencommunity card (house_rank/1). The hand/2 predicate’s first argument is alist of 3 cards. It unifies the output with any of the valid hands that these cardscontain. The valid hands are: high card, pair (two cards have the same rank),three of a kind (three cards have the same rank), low straight (jack, queen king)and high straight(queen, king, ace). Each hand is assigned a rank with the

41

Distribution Jack Queen King Ace

Actual 0.2 0.4 0.15 0.25Learned 0.203± 0.002 0.396± 0.002 0.155± 0.003 0.246± 0.002

Table 8: The results for the Poker experiment (T9).

two cards and the community card.For simplicity, we only use the jack, queen, king and ace. We also do notconsider the suits of the cards.The input consists of 4 images that show the cards dealt to the two players.Additionally, every example is labeled with the chance that the game iswon, lost or ended in a draw, e.g.:

0.8 :: poker([Q~, Q}, A}, K|], loss)

We expect DeepProbLog to:

• train the neural network to recognize the four cards• reason probabilistically about the non-observed card• learn the distribution of the unlabeled community card

To make DeepProbLog converge more reliably, we add some examples withadditional supervision. Namely, in 10% of the examples we additionallyspecify the community card, i.e.

poker([Q~, Q}, A}, K|], A}, loss).

This also showcases one of the strengths of DeepProbLog, namely, it canmake use of examples that have different levels of observability. The lossfunction used in this experiment is the MSE between the predicted andtarget probabilities.

Results. We ran the experiment 10 times. Out of these 10 runs, 4 didn’tconverge on the correct solution. The average values of the learned pa-rameters for the remaining 6 runs are shown Table 8. As can be seen,DeepProbLog is able to correctly learn the probabilistic parameters. Inthese 6 runs, the neural network also correctly learned to classify all cardtypes, achieving a 100% accuracy. The other runs did not converge becausesome of the classes were permuted (i.e., queens predicted as aces and viceversa) or multiple classes mapped onto the same one (queens and kingswere both predicted as kings).

27


Actual 0.2 0.4 0.15 0.25Learned 0.203± 0.002 0.396± 0.002 0.155± 0.003 0.246± 0.002



0.8 :: poker([Q~, Q}, A}, K|], loss)







27


Actual 0.2 0.4 0.15 0.25Learned 0.203± 0.002 0.396± 0.002 0.155± 0.003 0.246± 0.002



0.8 :: poker([Q~, Q}, A}, K|], loss)







27

in 6/10 experiments

Challenges

• The data needs to provide a signal (cf. Addition of images only, and Poker … ); aka curriculum layer + regularization

• Scaling up -

• still using the exact inference of ProbLog

• circuits can be very large

• we were working on approximate inference

Further Reading• One book

• Three websites to start

• http://probmods.org/ Probabilistic Models of Cognition — Church

• http://dtai.cs.kuleuven.be/problog/ — check also [DR & Kimmig, MLJ 15]

• http://alchemy.cs.washington.edu/ —Markov Logic, check also [Domingos & Lowd] Markov Logic, Morgan Claypool.

http://probmods.org/

http://dtai.cs.kuleuven.be/problog/

http://alchemy.cs.washington.edu/

Thanks!

http://dtai.cs.kuleuven.be/problog

Maurice BruynoogheBart DemoenAnton DriesDaan FierensJason Filippou

Bernd GutmannManfred JaegerGerda Janssens

Kristian KerstingAngelika Kimmig

Theofrastos MantadelisWannes Meert

Bogdan MoldovanSiegfried Nijssen

Davide NittiJoris Renkens

Kate RevoredoRicardo Rocha

Vitor Santos CostaDimitar Shterionov

Ingo ThonHannu Toivonen

Guy Van den BroeckMathias VerbekeJonas Vlasselaer

90

Thanks !

• PRISM http://sato-www.cs.titech.ac.jp/prism/

• ProbLog2 http://dtai.cs.kuleuven.be/problog/

• Yap Prolog http://www.dcc.fc.up.pt/~vsc/Yap/ includes

• ProbLog1

• cplint https://sites.google.com/a/unife.it/ml/cplint

• CLP(BN)

• LP2

• PITA in XSB Prolog http://xsb.sourceforge.net/

• AILog2 http://artint.info/code/ailog/ailog2.html

• SLPs http://stoics.org.uk/~nicos/sware/pepl

• contdist http://www.cs.sunysb.edu/~cram/contdist/

• DC https://code.google.com/p/distributional-clauses

• WFOMC http://dtai.cs.kuleuven.be/ml/systems/wfomc

PLP Systems

91

Documents

Learning and Reasoning for AIluc.deraedt/Francqui4ab.pdf · Lots of proposals in the literature, e.g. • relational Markov networks (RMNs) [Taskar et al 2002] • Markov logic networks