Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Learning and Reasoning for AI
Luc De Raedt [email protected]
Roadmap• Prob. Programming - Modeling
• Inference
• Learning
• Dynamics
• KBMC & Markov Logic
• DeepProbLog
• From StarAI to Nesy
... with some detours on the way2
Part V: KBMC, Markov Logic
3
A key question in AI:Dealing with uncertainty
Reasoning with relational data
Learning
Statistical relational learning& Probabilistic programming, ...
?• logic• databases• programming• ...
• probability theory• graphical models• ...
• parameters• structure
4
so far
A key question in AI:Dealing with uncertainty
Reasoning with relational data
Learning
Statistical relational learning & Probabilistic programming, ...
?• logic• databases• programming• ...
• probability theory• graphical models• ...
• parameters• structure
5
next
De Raedt, Kersting, Natarajan, Poole: Statistical Relational AI
Flexible and Compact Relational Model for Predicting Grades
“Program” Abstraction:▪ S, C logical variable representing students, courses▪ the set of individuals of a type is called a population▪ Int(S), Grade(S, C), D(C) are parametrized random variables
Grounding:• for every student s, there is a random variable Int(s)• for every course c, there is a random variable Di(c)• for every s, c pair there is a random variable Grade(s,c)• all instances share the same structure and parameters
G
7
ProbLog by example: Grading
Shows relational structure
grounded model: replace variables by constants
Works for any number of students / classes (for 1000 students and 100 classes, you get 101100 random variables); still only few parameters
With SRL / PP
build and learn compact models,
from one set of individuals - > other sets;
reason also about exchangeability,
build even more complex models,
incorporate background knowledge
Lots of proposals in the literature, e.g.• relational Markov networks (RMNs) [Taskar et al 2002]
• Markov logic networks (MLNs) [Richardson & Domingos 2006]
• probabilistic soft logic (PSL) [Broecheler et al 2010]
• FACTORIE [McCallum et al 2009]
• Bayesian logic programs (BLPs) [Kersting & De Raedt 2001]
• relational Bayesian networks (RBNs) [Jaeger 2002]
• logical Bayesian networks (LBNs) [Fierens et al 2005]
• probabilistic relational models (PRMs) [Koller & Pfeffer 1998]
• Bayesian logic (BLOG) [Milch et al 2005]
• CLP(BN) [Santos Costa et al 2008]
• and many more ...
8
Probabilistic Relational Models (PRMs)
9
PersonBloodtype
M-chromosomeP-chromosome
Person
Bloodtype M-chromosome
P-chromosome
(Father)
Person
Bloodtype M-chromosome
P-chromosome
(Mother)
Table
[Getoor,Koller, Pfeffer]
[Getoor,Koller, Pfeffer]
Probabilistic Relational Models (PRMs)
10
PersonBloodtype
M-chromosomeP-chromosome
Person
Bloodtype M-chromosome
P-chromosome
(Father)
Person
Bloodtype M-chromosome
P-chromosome
(Mother)
Table[Getoor,Koller, Pfeffer]
[Getoor,Koller, Pfeffer]
bt(Person)= BT.
pc(Person)= PC.
mc(Person) = MC.
bt(Person)=BT | pc(Person) =PC , mc(Person) =MC. pc(Person) = PC | pc_father(Father)= PCf, mc_father(Father)= MCf.
pc_father(Person) =PCf | father(Father,Person),pc(Father)=PC. ...
View :
Dependencies (CPDs associated with):
father(Father,Person). mother(Mother,Person).
Probabilistic Relational Models (PRMs) Bayesian Logic Programs (BLPs)
11
father(rex,fred). mother(ann,fred). father(brian,doro). mother(utta, doro). father(fred,henry). mother(doro,henry).
bt(Person)=BT | pc(Person)=PC, mc(Person)=MC.pc(Person)=PC | pc_father(Person)=PCf, mc_father(Person)=MCf.mc(Person)=MC | pc_mother(Person)=PCm, pc_mother(Person)=MCm.
mc(rex)
bt(rex)
pc(rex)mc(ann) pc(ann)
bt(ann)
mc(fred) pc(fred)
bt(fred)
mc(brian)
bt(brian)
pc(brian)mc(utta) pc(utta)
bt(utta)
mc(doro) pc(doro)
bt(doro)
mc(henry)pc(henry)
bt(henry)
RV State
pc_father(Person)= PCf | father(Father,Person),pc(Father) = PC. ...
Extension
Intension
Answering Queries
12
mc(rex)
bt(rex)
pc(rex)mc(ann) pc(ann)
bt(ann)
mc(fred) pc(fred)
bt(fred)
mc(brian)
bt(brian)
pc(brian)mc(utta) pc(utta)
bt(utta)
mc(doro) pc(doro)
bt(doro)
mc(henry)pc(henry)
bt(henry)
P(bt(ann)) ?
Support Network
Answering Queries
13
mc(rex)
bt(rex)
pc(rex)mc(ann) pc(ann)
bt(ann)
mc(fred) pc(fred)
bt(fred)
mc(brian)
bt(brian)
pc(brian)mc(utta) pc(utta)
bt(utta)
mc(doro) pc(doro)
bt(doro)
mc(henry)pc(henry)
bt(henry)
P(bt(ann), bt(fred)) ?
P(bt(ann)| bt(fred)) =P(bt(ann),bt(fred))
P(bt(fred))
Bayes‘ rule
Combining Rules
• Students reads two books
• Typical, noisy-or, noisy-max,
• ...
14
P(A|B,C)
P(A|B) and P(A|C)
prepared(Student,Topic) | read(Student,Book), discusses(Book,Topic).
Knowledge Based Model Construction
Extension + Intension =>Probabilistic Model
Advantages
same intension used for multiple extensions
parameters are being shared / tied together
unification is essential
•learning becomes feasible
•max. likelihood parameter estimation & structure learning
15
Bayesian Logic Programs
16
% apriori nodes nat(0).
% aposteriori nodes nat(s(X)) | nat(X).
nat(0) nat(s(0)) nat(s(s(0)) ...MC
% apriori nodes state(0).
% aposteriori nodes state(s(Time)) | state(Time). output(Time) | state(Time)
state(0)
output(0)
state(s(0))
output(s(0))
...HMM
% apriori nodes n1(0).
% aposteriori nodes n1(s(TimeSlice) | n2(TimeSlice). n2(TimeSlice) | n1(TimeSlice). n3(TimeSlice) | n1(TimeSlice), n2(TimeSlice).
n1(0)
n2(0)
n3(0)
n1(s(0))
n2(s(0))
n3(s(0))
...DBN
pure Pro
log
Prolog and Bayesian Nets as Special Case
Learning BLPs
RVs + States = (partial) Herbrand interpretationProbabilistic learning from interpretations
Family(1)pc(brian)=b,bt(ann)=a,bt(brian)=?,bt(dorothy)=a
Family(2)bt(cecily)=ab,pc(henry)=a,mc(fred)=?,bt(kim)=a,pc(bob)=b
Backgroundm(ann,dorothy),f(brian,dorothy),m(cecily,fred),f(henry,fred),f(fred,bob),m(kim,bob),...
Family(3)pc(rex)=b,bt(doro)=a,bt(brian)=?
17
Parameter Estimation
• +
bt(Person,BT) | pc(Person,PC), mc(Person,MC).pc(Person,PC) | pc_father(Person,PCf), mc_father(Person,MCf).mc(Person,MC) | pc_mother(Person,PCm), pc_mother(Person,MCm).
yields
18
Parameter Estimation
• +
bt(Person,BT) | pc(Person,PC), mc(Person,MC).pc(Person,PC) | pc_father(Person,PCf), mc_father(Person,MCf).mc(Person,MC) | pc_mother(Person,PCm), pc_mother(Person,MCm).
yields
Parameter tying
19
Expectation Maximization
Initial Parameters q0
Logic Program L
Expected counts of a clause
Update parameters (ML, MAP)
Maximization
EM-algorithm:iterate until convergence
Current Model (M,qk)
P( head(GI), body(GI) | DC )MM
DataCaseDC
Ground InstanceGI
P( head(GI), body(GI) | DC )MM
DataCaseDC
Ground InstanceGI
P( body(GI) | DC )MM
DataCaseDC
Ground InstanceGI
20
De Raedt, Kersting, Natarajan, Poole: Statistical Relational AI
Markov Logic: Intuition
▪ Undirected graphical model▪ A logical KB is a set of hard constraints
on the set of possible worlds▪ Let’s make them soft constraints:
When a world violates a formula,it becomes less probable, not impossible
▪ Give each formula a weight(Higher weight ⇒ Stronger constraint)
( )∑∝ satisfiesit formulas of weightsexpP(world)
A possible worlds view
),( BobAnnaFriends¬
)(BobHappy)(BobHappy¬
),( BobAnnaFriends
De Raedt, Kersting, Natarajan, Poole: Statistical Relational AI
Say we have two domain elements Anna and Bob as well as two predicates Friends and Happy
slides by Pedro Domingos
A possible worlds view
),( BobAnnaFriends¬
)(BobHappy)(BobHappy¬
),( BobAnnaFriends
)(),(BobHappy
BobAnnaFriends∨
¬
De Raedt, Kersting, Natarajan, Poole: Statistical Relational AI
Logical formulas such as not Friends(Anna,Bob) or Happy(Bob)
exclude possible worlds
slides by Pedro Domingos
A possible worlds view
),( BobAnnaFriends¬
)(BobHappy)(BobHappy¬
),( BobAnnaFriends
1))(),(( =∨¬Φ BobHappyBobAnnaFriends75.0))(),(( =¬∧Φ BobHappyBobAnnaFriends
1 1
175.0
De Raedt, Kersting, Natarajan, Poole: Statistical Relational AI
four times as likely that rule holds
slides by Pedro Domingos
An possible worlds view
),( BobAnnaFriends¬
)(BobHappy)(BobHappy¬
),( BobAnnaFriends
29.0)75.0/1log()))(),(((
==
∨¬Φ BobHappyBobAnnaFriendsw
1 1
175.0
De Raedt, Kersting, Natarajan, Poole: Statistical Relational AI
Or as log-linear model this is:
This can also be viewed as building a graphical model
Cancer(A)
Smokes(A) Smokes(B)
Cancer(B)
Suppose we have two constants: Anna (A) and Bob (B)
slides by Pedro Domingos
Markov Logic
Cancer(A)
Smokes(A)Friends(A,A)
Friends(B,A)
Smokes(B)
Friends(A,B)
Cancer(B)
Friends(B,B)
Suppose we have two constants: Anna (A) and Bob (B)
slides by Pedro Domingos
Markov Logic
Cancer(A)
Smokes(A)Friends(A,A)
Friends(B,A)
Smokes(B)
Friends(A,B)
Cancer(B)
Friends(B,B)
Suppose we have two constants: Anna (A) and Bob (B)
slides by Pedro Domingos
Markov Logic
Cancer(A)
Smokes(A)Friends(A,A)
Friends(B,A)
Smokes(B)
Friends(A,B)
Cancer(B)
Friends(B,B)
Suppose we have two constants: Anna (A) and Bob (B)
slides by Pedro Domingos
Markov Logic
Markov Logic
30
𝑪(𝑨) 𝑺(𝑨) 𝑭 (𝑨, 𝑩) 𝑭 (𝑩, 𝑨) 𝑪(𝑩)𝑺(𝑩)
F1(A) F1(B)F2(A,B)
𝑭 (𝑨, 𝑨)
F2(A,A) F2(B,A)
𝑭 (𝑩, 𝑩)
F2(B,B)
represented as a factor graph
P(Interpretation) ∝ ∏i,θ
Fi(X, Y )θ = ∏i,θ
exp(wi𝕀(Interpretation ⊧ Fi(X, Y )θ)
De Raedt, Kersting, Natarajan, Poole: Statistical Relational AI
Markov Logic
▪ A Markov Logic Network (MLN) is a set of pairs (F, w) where▪ F is a formula in first-order logic▪ w is a real number
▪ An MLN defines a Markov network with▪ One node for each grounding of each predicate
in the MLN▪ One feature for each grounding of each formula F in the MLN,
with the corresponding weight w▪ Probability of a world
Weight of formula i No. of true groundings of formula i in x
!"
#$%
&= ∑
iii xnw
ZxP )(exp
1)(
Possible WorldsA vocabulary
Possible worldsLogical interpretations
Sm
okes
(Alic
e)
Sm
okes
(Bob
)
Frie
nds(
Alic
e,Bo
b)
Frie
nds(
Bob,
Alic
e)
Slides adapted from Guy Van den Broeck
A logical theory
Interpretations that satisfy the theoryModels
∀x,y, Smokes(x) ∧ Friends(x,y) ⇒ Smokes(y)
Sm
okes
(Alic
e)
Sm
okes
(Bob
)
Frie
nds(
Alic
e,Bo
b)
Frie
nds(
Bob,
Alic
e)
Possible Worlds
Slides adapted from Guy Van den Broeck
A logical theory
First-Order Model Counting
First-order model count~#SAT
∑
Sm
okes
(Alic
e)
Sm
okes
(Bob
)
Frie
nds(
Alic
e,Bo
b)
Frie
nds(
Bob,
Alic
e)
∀x,y, Smokes(x) ∧ Friends(x,y) ⇒ Smokes(y)
Slides Guy Van den Broeck
De Raedt, Kersting, Natarajan, Poole: Statistical Relational AI
Markov Logic
▪ A Markov Logic Network (MLN) is a set of pairs (F, w) where▪ F is a formula in first-order logic▪ w is a real number
▪ An MLN defines a Markov network with▪ One node for each grounding of each predicate
in the MLN▪ One feature for each grounding of each formula F in the MLN,
with the corresponding weight w▪ Probability of a world
Weight of formula i No. of true groundings of formula i in x
!"
#$%
&= ∑
iii xnw
ZxP )(exp
1)(
1.5 ∀x,y, Smokes(x) ∧ Friends(x,y) ⇒ Smokes(y)
Sm
okes
(Alic
e)
Sm
okes
(Bob
)
Frie
nds(
Alic
e,Bo
b)
Frie
nds(
Bob,
Alic
e)
Markov Logic
Slides adapted from Guy Van den Broeck
counting only substitutions for which X =/= Y X=Alice, Y=BobX=Bob, Y=Alice
1
Zexp(1.5 ⇤ 2)
1
Zexp(1.5 ⇤ 2)
1
Zexp(1.5 ⇤ 1)
A Markov Logic theory
1.5 ∀x,y, Smokes(x) ∧ Friends(x,y) ⇒ Smokes(y)
Sm
okes
(Alic
e)
Sm
okes
(Bob
)
Frie
nds(
Alic
e,Bo
b)
Frie
nds(
Bob,
Alic
e)
Markov Logic
Slides adapted from Guy Van den Broeck
1
Zexp(1.5 ⇤ 2)
1
Zexp(1.5 ⇤ 2)
1
Zexp(1.5 ⇤ 1)
A Markov Logic theory
Zpartition function
∑
A logical theory and a weight function for predicates
Weighted first-order model count∑
Weighted First-Order Model Counting S
mok
es(A
lice)
Sm
okes
(Bob
)
Frie
nds(
Alic
e,Bo
b)
Frie
nds(
Bob,
Alic
e) Smokes → 1 ¬Smokes → 2 Friends → 4 ¬Friends → 1
Related to ProbLog Inference !
Parameter Learning
39
No. of times clause i is true in data
Expected no. times clause i is true according to MLN
[ ])()()(log xnExnxPw iwiwi
−=∂
∂
Has been used for generative learning (Pseudolikelihood); Many variations (also discriminative); applications in networks, NLP, bioinformatics, …
Applications
▪ Natural language processing, Collective Classification, Social Networks, Activity Recognition, …
De Raedt, Kersting, Natarajan, Poole: Statistical Relational AI
Information Extraction
Parag Singla and Pedro Domingos, “Memory-EfficientInference in Relational Domains” (AAAI-06).
Singla, P., & Domingos, P. (2006). Memory-efficentinference in relatonal domains. In Proceedings of theTwenty-First National Conference on Artificial Intelligence(pp. 500-505). Boston, MA: AAAI Press.
H. Poon & P. Domingos, Sound and Efficient Inferencewith Probabilistic and Deterministic Dependencies”, inProc. AAAI-06, Boston, MA, 2006.
P. Hoifung (2006). Efficent inference. In Proceedings of theTwenty-First National Conference on Artificial Intelligence.
De Raedt, Kersting, Natarajan, Poole: Statistical Relational AI
Segmentation
Parag Singla and Pedro Domingos, “Memory-EfficientInference in Relational Domains” (AAAI-06).
Singla, P., & Domingos, P. (2006). Memory-efficentinference in relatonal domains. In Proceedings of theTwenty-First National Conference on Artificial Intelligence(pp. 500-505). Boston, MA: AAAI Press.
H. Poon & P. Domingos, Sound and Efficient Inferencewith Probabilistic and Deterministic Dependencies”, inProc. AAAI-06, Boston, MA, 2006.
P. Hoifung (2006). Efficent inference. In Proceedings of theTwenty-First National Conference on Artificial Intelligence.
AuthorTitleVenue
De Raedt, Kersting, Natarajan, Poole: Statistical Relational AI
Entity Resolution
Parag Singla and Pedro Domingos, “Memory-EfficientInference in Relational Domains” (AAAI-06).
Singla, P., & Domingos, P. (2006). Memory-efficentinference in relatonal domains. In Proceedings of theTwenty-First National Conference on Artificial Intelligence(pp. 500-505). Boston, MA: AAAI Press.
H. Poon & P. Domingos, Sound and Efficient Inferencewith Probabilistic and Deterministic Dependencies”, inProc. AAAI-06, Boston, MA, 2006.
P. Hoifung (2006). Efficent inference. In Proceedings of theTwenty-First National Conference on Artificial Intelligence.
De Raedt, Kersting, Natarajan, Poole: Statistical Relational AI
Entity Resolution
Parag Singla and Pedro Domingos, “Memory-EfficientInference in Relational Domains” (AAAI-06).
Singla, P., & Domingos, P. (2006). Memory-efficentinference in relatonal domains. In Proceedings of theTwenty-First National Conference on Artificial Intelligence(pp. 500-505). Boston, MA: AAAI Press.
H. Poon & P. Domingos, Sound and Efficient Inferencewith Probabilistic and Deterministic Dependencies”, inProc. AAAI-06, Boston, MA, 2006.
P. Hoifung (2006). Efficent inference. In Proceedings of theTwenty-First National Conference on Artificial Intelligence.
De Raedt, Kersting, Natarajan, Poole: Statistical Relational AI
Roadmap• Prob. Programming - Modeling
• Inference
• Learning
• Dynamics
• KBMC & Markov Logic
• DeepProbLog
• From StarAI to Nesy
... with some detours on the way45
Part VI: DeepProbLog
46
Learning
PROBABILITY
LOGIC
THREE DIFFERENT PARADIGMS FOR LEARNING
NEURAL
Integrate Deep Learning and (Probabilistic) Logics ?
48
earthquake burglary
alarmhears_alarm
callsAre there an equal number of large things
and metal spheres?
Deep Learning
Logic
?
Cf. Visual Genome en Clevr datasets
Neural-symbolic learning and reasoning: A survey and interpretation.[Besold et all ]
NeSY state-of-the-art• The integration of perception and reasoning is still an open problem.
• Main idea: inject/encode logic into neural networks (and let the NN do the rest)
• Encoding logic in the weights of neural networks
• Learning embeddings for logical entities
• Logical constraints as a regularizer during training
• Templating neural networks
• Building neural networks from functional programs
• Building neural networks from backwards proving
• Differentiable neural computers / program interpreters
State-of-the-art
• Encoding logic in the weights of neural networks
• Logic Tensor Networks (Serafini et al.)
• A Semantic Loss Function for Deep Learning with Symbolic Knowledge (Xu et al.)
• Ontology Reasoning with Deep Neural Networks (Hohenecker et al.)
• Semantic Based Regularization (Diligenti et al.)
50
State-of-the-art
• Templates for neural networks (ako Knowledge Base Model Construction)
• Lifted Relational Neural Networks (Šourek et al.)
• Neural Theorem Prover (Rocktäschel et al.)
• Neural Module Networks (Andreas et al.)
51
State-of-the-art
• Differentiable neural computers / program interpreters
• Differentiable Neural Computer (Graves et al.)
• Neural Programmer-Interpreters (Reed et al.)
• Differentiable Forth Interpreter (Bošnjak et al.)
52
DeepProbLogIdea: inject neural networks into logic by extending an existing PLP language
DeepProbLog = ProbLog + neural predicate
The neural predicate makes neural networks a first-class citizen
53
Related work DeepProbLog
Logic is made less expressive Full expressivity is retained
Logic is pushed into the neural network Clean separation
Fuzzy logic Probabilistic logicLanguage semantics unclear Clear semantics
NeurIPS 2018
Neural predicate
• Neural networks have uncertainty in their predictions
• A normalized output can be interpreted as a probability distribution
• Neural predicate models the output as probabilistic facts
• No changes needed in the probabilistic host language
54
Neural network
DTAI research group
The neural predicateThe output of the neural network is probabilistic facts in DeepProbLog
Example:
nn(mnist_net, [X], Y, [0 ... 9] ) :: digit(X,Y).
Instantiated into a (neural) Annotated Disjunction:
0.04::digit( ,0) ; 0.35::digit( ,1) ; ... ; 0.53::digit( ,7) ; ... ; 0.014::digit( ,9).
DTAI research group
DeepProbLog exemplified: MNIST addition
Task: Classify pairs of MNIST digits with their sum
Benefit of DeepProbLog:
• Encode addition in logic
• Separate addition from digit classification
8411
nn(mnist_net, [X], Y, [0 ... 9] ) :: digit(X,Y).
addition(X,Y,Z) :- digit(X,N1), digit(Y,N2), Z is N1+N2.
Examples: addition( , ,8), addition( , ,4), addition( , ,11), …
DTAI research group
DeepProbLog exemplified: MNIST addition
Task: Classify pairs of MNIST digits with their sum
Benefit of DeepProbLog:
• Encode addition in logic
• Separate addition from digit classification
8411
nn(mnist_net, [X], Y, [0 ... 9] ) :: digit(X,Y).
addition(X,Y,Z) :- digit(X,N1), digit(Y,N2), Z is N1+N2.
addition( , ,8) :- digit( ,N1), digit( ,N2), 8 is N1 + N2.
Examples: addition( , ,8), addition( , ,4), addition( , ,11), …
ExampleLearn to classify the sum of pairs of MNIST digits
Individual digits are not labeled!
E.g. ( , , 8)
Could be done by a CNN: classify the concatenation of both images into 19 classes
However:
58
+ = ?
MNIST Addition• Pairs of MNIST images, labeled
with sum
• Baseline: CNN
• Classifies concatenation of both images into classes 0 ...18
• DeepProbLog:
• CNN that classifies images into 0 … 9
• Two lines of DeepProblog code
59
Multi-digit MNIST addition with MNIST
Result
60
number ( [ ] , Result , Result ) .number ( [H | T ] , Acc , Result) :−
digit(H, Nr ), Acc2 is Nr +10*Acc , number ( T , Acc2 , Result ) .
number (X,Y) :− number (X, 0 ,Y ) .
multiaddition(X, Y, Z ) :− number (X, X2 ) ,
number (Y, Y2 ) , Z is X2+Y2 .
ExampleLearn to classify the sum of pairs of MNIST digits
Individual digits are not labeled!
E.g. ( , , 8)
Could be done by a CNN: classify the concatenation of both images into 19 classes
However:
61
+ = ?
(Deep)ProbLog : Inference
Inference / Reasoning• Most of the work in PP and StarAI is on
inference
• It is hard (complexity wise)
• Many inference methods
• exact, approximate, sampling and lifted …
• Inference is the key to learning
63
ProbLog InferenceAnswering a query in a ProbLog program happens in four steps1. Grounding the program w.r.t. the query2. Rewrite the ground logic program into a propositional logic formula3. Compile the formula into an arithmetic circuit4. Evaluate the arithmetic circuit
0.1 :: burglary. 0.5 :: hears_alarm(mary).
0.2 :: earthquake. 0.4 :: hears_alarm(john).
alarm :– earthquake.
alarm :– burglary. calls(X) :– alarm, hears_alarm(X).
Query
?-P(calls(mary)
ProbLog InferenceAnswering a query in a ProbLog program happens in four steps1. Grounding the program w.r.t. the query (only relevant part !)2. Rewrite the ground logic program into a propositional logic formula3. Compile the formula into an arithmetic circuit4. Evaluate the arithmetic circuit
0.1 :: burglary. 0.5 :: hears_alarm(mary).
0.2 :: earthquake. 0.4 :: hears_alarm(john).
alarm :– earthquake.
alarm :– burglary. calls(mary) :– alarm, hears_alarm(mary).calls(john) :– alarm, hears_alarm(john).
Query
?-P(calls(mary)
ProbLog InferenceAnswering a query in a ProbLog program happens in four steps1. Grounding the program w.r.t. the query 2. Rewrite the ground logic program into a propositional logic formula3. Compile the formula into an arithmetic circuit4. Evaluate the arithmetic circuit
0.1 :: burglary. 0.5 :: hears_alarm(mary).
0.2 :: earthquake. 0.4 :: hears_alarm(john).
alarm :– earthquake.
alarm :– burglary. calls(mary) :– alarm, hears_alarm(mary).
calls(john) :– alarm, hears_alarm(john).
calls(mary)
↔
hears_alarm(mary) ∧ (burglary ∨ earthquake)
ProbLog InferenceAnswering a query in a ProbLog program happens in four steps1. Grounding the program w.r.t. the query 2. Rewrite the ground logic program into a propositional logic formula3. Compile the formula into an arithmetic circuit (knowledge compilation)4. Evaluate the arithmetic circuit
calls(mary)
↔
hears_alarm(mary) ∧ (burglary ∨ earthquake) AND AND
AND
OR
calls(mary)
¬earthquake
0.8
earthquake
0.2
burglary
0.1
hears_alarm(mary)
0.5
0.08 0.1
0.04
0.14
ProbLog InferenceAnswering a query in a ProbLog program happens in four steps1. Grounding the program w.r.t. the query 2. Rewrite the ground logic program into a propositional logic formula3. Compile the formula into an arithmetic circuit (knowledge compilation)4. Evaluate the arithmetic circuit
calls(mary)
↔
hears_alarm(mary) ∧ (burglary ∨ earthquake) AND AND
AND
OR
calls(mary)
¬earthquake
0.8
earthquake
0.2
burglary
0.1
hears_alarm(mary)
0.5
0.08 0.1
0.04
0.14
DTAI research group
Optimization
PLP usually considers the inference settings
DeepProbLog focuses on optimization• We have a set of tuples (q,p)• q is a query and p its desired success probability
DeepProbLog
• We use our algebraic ProbLog and use the gradient semi-ring
• What is aProbLog ?
• a version of ProbLog where the probabilistic semi-ring is replaced by an arbitrary semiring structure
• labels on facts are elements of the semiring
• cf. the different semi-rings for the WMC, #SAT, …
70
more examples
71
semiring label functionI I
DTAI research group
Implementing DeepProbLog
1. Evaluating the neural networks:• Instantiate the neural annotated disjunction• Happens during grounding• ProbLog already had support for external functions
2. Perform backpropagation in the neural networks• No direct loss for neural networks• Loss defined on the logic level• Derive gradient in logic• Start backpropagation with derived gradient
DTAI research group
Deriving the gradient• The output of the neural network are probabilistic facts
• Probabilistic facts are leaves in the AC
• The AC is a differentiable structure
• We can derive it in the forward pass along with the probability
aProbLog + gradient semiring
DTAI research group
Gradient semiringt(0.2) :: earthquake.
t(0.1) :: burglary.
0.5 :: hears_alarm.
alarm :- earthquake.
alarm :- burglary.
calls :- alarm, hears_alarm.
DTAI research group
The DeepProbLog pipeline
EXPERIMENTS
76
Program Induction• Approach similar to that of ‘Programming with a Differentiable
Forth Interpreter’ [1] (∂4)
• Partially defined Forth program with slots / holes
• Slots are filled by neural network (encoder / decoder)
• Fully differentiable interpreter: NNs are trained with input / output examples
• DeepProbLog program with switches
• Switches are controlled by neural networks
77
[1]: Matko Bosnjak, Tim Rocktäschel, Jason Naradowsky, Sebastian Riedel: Programming with a Differentiable Forth Interpreter. ICML 2017: 547-556
Logic
Neural
● Sorting○ Sort lists of numbers using Bubble sort○ Hole: Swap or don’t swap when comparing two numbers
● Addition○ Add two numbers and a carry○ Hole: What is the resulting digit and carry on each step○ (Note: not MNIST digits, but actual numbers)
● Word Algebra Problems○ E.g. “Ann has 8 apples. She buys 4 more. She distributes them equally
among her 3 kids. How many apples does each child receive?○ Hole: Sequence of permuting, swapping and performing operations on
the three numbers[1]: Matko Bosnjak, Tim Rocktäschel, Jason Naradowsky, Sebastian Riedel: Programming with a Differentiable Forth Interpreter. ICML 2017: 547-556
Tasks[1]
hole(X,Y,X,Y):- swap(X,Y,0).
hole(X,Y,Y,X):- swap(X,Y,1).
bubble([X],[],X).bubble([H1,H2|T],[X1|T1],X):- hole(H1,H2,X1,X2), bubble([X2|T],T1,X).
bubblesort([],L,L).
bubblesort(L,L3,Sorted) :- bubble(L,L2,X), bubblesort(L2,[X|L3],Sorted).
sort(L,L2) :- bubblesort(L,[],L2).
Holes defined by neural predicate
Bubble sort implementation
Example DeepProbLog solution
Result
80
Noisy AdditionProbability
nn(classifier, [X], Y, [0 .. 9]) :: digit(X,Y).t(0.2) :: noisy.
1/19 :: uniform(X,Y,0) ; ... ; 1/19 :: uniform(X,Y,18).
addition(X,Y,Z) :- noisy, uniform(X,Y,Z).addition(X,Y,Z) :- \+noisy, digit(X,N1), digit(Y,N2), Z is N1+N2.
(a) The DeepProbLog program.
nn(classifier,[a],0) :: digit(a,0); nn(classifier,[a],1) :: digit(a,1).nn(classifier,[b],0) :: digit(b,0); nn(classifier,[b],1) :: digit(b,1).t(0.2)::noisy.
1/19::uniform(a,b,1).addition(a,b,1) :- noisy, uniform(a,b,1).
addition(a,b,1) :- \+noisy, digit(a,0), digit(b,1).addition(a,b,1) :- \+noisy, digit(a,1), digit(b,0).
(b) The ground DeepProbLog program.
(c) The AC for query addition(a,b,1).
Figure 4: Parameter learning in DeepProbLog. (Example 5)
Figure 5: The learning pipeline.
19
Neural
Noisy AdditionProbability
nn(classifier, [X], Y, [0 .. 9]) :: digit(X,Y).t(0.2) :: noisy.
1/19 :: uniform(X,Y,0) ; ... ; 1/19 :: uniform(X,Y,18).
addition(X,Y,Z) :- noisy, uniform(X,Y,Z).addition(X,Y,Z) :- \+noisy, digit(X,N1), digit(Y,N2), Z is N1+N2.
(a) The DeepProbLog program.
nn(classifier,[a],0) :: digit(a,0); nn(classifier,[a],1) :: digit(a,1).nn(classifier,[b],0) :: digit(b,0); nn(classifier,[b],1) :: digit(b,1).t(0.2)::noisy.
1/19::uniform(a,b,1).addition(a,b,1) :- noisy, uniform(a,b,1).
addition(a,b,1) :- \+noisy, digit(a,0), digit(b,1).addition(a,b,1) :- \+noisy, digit(a,1), digit(b,0).
(b) The ground DeepProbLog program.
(c) The AC for query addition(a,b,1).
Figure 4: Parameter learning in DeepProbLog. (Example 5)
Figure 5: The learning pipeline.
19
nn(classifier, [X], Y, [0 .. 9]) :: digit(X,Y).t(0.2) :: noisy.
1/19 :: uniform(X,Y,0) ; ... ; 1/19 :: uniform(X,Y,18).
addition(X,Y,Z) :- noisy, uniform(X,Y,Z).addition(X,Y,Z) :- \+noisy, digit(X,N1), digit(Y,N2), Z is N1+N2.
(a) The DeepProbLog program.
nn(classifier,[a],0) :: digit(a,0); nn(classifier,[a],1) :: digit(a,1).nn(classifier,[b],0) :: digit(b,0); nn(classifier,[b],1) :: digit(b,1).t(0.2)::noisy.
1/19::uniform(a,b,1).addition(a,b,1) :- noisy, uniform(a,b,1).
addition(a,b,1) :- \+noisy, digit(a,0), digit(b,1).addition(a,b,1) :- \+noisy, digit(a,1), digit(b,0).
(b) The ground DeepProbLog program.
(c) The AC for query addition(a,b,1).
Figure 4: Parameter learning in DeepProbLog. (Example 5)
Figure 5: The learning pipeline.
19
Neural
Noisy Addition
noisy
0.2,[1, 0,0,.. 0,0,..]
⨂
addition(a,b,1)
⨁ p,[∂p/∂pnoisy,
∂p/∂pdigit(a,0),...,∂p/∂pdigit(a,9),∂p/∂pdigit(b,0),...,∂p/∂pdigit(b,9)]
¬noisy
0.8,[-1, 0,0,.. 0,0,..]
digit(a,0)
0.8,[0, 1,0,.. 0,0,..]
digit(b,1)
0.6,[0, 0,0,.. 0,1,..]
digit(a,1)
0.1,[0, 0,1,.. 0,0,..]
digit(b,0)
0.2,[0, 0,0,.. 1,0,..]
uniform(a,b,1)
0.053,[0, 0,0,.. 0,0,..]
⨂ ⨂
⨁
⨂
0.011,[0.053, 0,0,.. 0,0,..]
0.02,[0, 0,0.2,.. 0.1,0,..]
0.48,[0, 0.6,0,.. 0,0.8,..]
0.5,[0, 0.6,0.2,.. 0.1,0.8,..]
0.4,[-0.5, 0.48,0.16,.. 0.08,0.64,..]
0.411,[-0.447, 0.48,0.16,.. 0.08,0.64,..]
Legend
Noisy Addition
Figure 8: The accuracy on the MNIST test set for individual digits while training on (T3).
Fraction of noise0.0 0.2 0.4 0.6 0.8 1.0
Baseline 93.46 87.85 82.49 52.67 8.79 5.87DeepProbLog 97.20 95.78 94.50 92.90 46.42 0.88
DeepProbLog w/ explicit noise 96.64 95.96 95.58 94.12 73.22 2.92Learned fraction of noise 0.000 0.212 0.415 0.618 0.803 0.985
Table 3: The accuracy on the test set for T4.
.
noise tolerant, even retaining an accuracy of 73.2% with 80% noisy labels.As shown in the last row, it is also able to learn the fraction of noisy labelsin the data. This shows that the model is able to recognize which exampleshave noisy labels.
6.2. Program Induction
The second set of problems demonstrates that DeepProbLog can performprogram induction. We follow the program sketching [25] setting of differentiableForth (@4) [8], where holes in given programs need to be filled by neural networkstrained on input-output examples for the entire program. As in their work, weconsider three tasks: addition, sorting [26] and word algebra problems (WAPs)[27].
T5: forth_addition([4], [8], 1, [1, 3])The input consists of two numbers, represented as lists of digits, and acarry. The output is the sum of the numbers and the carry. The programspecifies the basic addition algorithm in which we go from right to left overall digits, calculating the sum of two digits and taking the carry over tothe next pair. The hole in this program corresponds to calculating theresulting digit (result/4) and carry (carry/4), given two digits and theprevious carry.
23
Addition of images only• Examples of the form
• addition( , , )
• What will happen ?Figure 8: The accuracy on the MNIST test set for individual digits while training on (T3).
Fraction of noise0.0 0.2 0.4 0.6 0.8 1.0
Baseline 93.46 87.85 82.49 52.67 8.79 5.87DeepProbLog 97.20 95.78 94.50 92.90 46.42 0.88
DeepProbLog w/ explicit noise 96.64 95.96 95.58 94.12 73.22 2.92Learned fraction of noise 0.000 0.212 0.415 0.618 0.803 0.985
Table 3: The accuracy on the test set for T4.
.
noise tolerant, even retaining an accuracy of 73.2% with 80% noisy labels.As shown in the last row, it is also able to learn the fraction of noisy labelsin the data. This shows that the model is able to recognize which exampleshave noisy labels.
6.2. Program Induction
The second set of problems demonstrates that DeepProbLog can performprogram induction. We follow the program sketching [25] setting of differentiableForth (@4) [8], where holes in given programs need to be filled by neural networkstrained on input-output examples for the entire program. As in their work, weconsider three tasks: addition, sorting [26] and word algebra problems (WAPs)[27].
T5: forth_addition([4], [8], 1, [1, 3])The input consists of two numbers, represented as lists of digits, and acarry. The output is the sum of the numbers and the carry. The programspecifies the basic addition algorithm in which we go from right to left overall digits, calculating the sum of two digits and taking the carry over tothe next pair. The hole in this program corresponds to calculating theresulting digit (result/4) and carry (carry/4), given two digits and theprevious carry.
23
Addition of images only• Examples of the form
• addition( , , )
• What will happen ?
• usual loss function will map all images onto 0 (0 + 0 = 0 )
• can compensate for this by adding regularisation term based on max entropy
Figure 8: The accuracy on the MNIST test set for individual digits while training on (T3).
Fraction of noise0.0 0.2 0.4 0.6 0.8 1.0
Baseline 93.46 87.85 82.49 52.67 8.79 5.87DeepProbLog 97.20 95.78 94.50 92.90 46.42 0.88
DeepProbLog w/ explicit noise 96.64 95.96 95.58 94.12 73.22 2.92Learned fraction of noise 0.000 0.212 0.415 0.618 0.803 0.985
Table 3: The accuracy on the test set for T4.
.
noise tolerant, even retaining an accuracy of 73.2% with 80% noisy labels.As shown in the last row, it is also able to learn the fraction of noisy labelsin the data. This shows that the model is able to recognize which exampleshave noisy labels.
6.2. Program Induction
The second set of problems demonstrates that DeepProbLog can performprogram induction. We follow the program sketching [25] setting of differentiableForth (@4) [8], where holes in given programs need to be filled by neural networkstrained on input-output examples for the entire program. As in their work, weconsider three tasks: addition, sorting [26] and word algebra problems (WAPs)[27].
T5: forth_addition([4], [8], 1, [1, 3])The input consists of two numbers, represented as lists of digits, and acarry. The output is the sum of the numbers and the carry. The programspecifies the basic addition algorithm in which we go from right to left overall digits, calculating the sum of two digits and taking the carry over tothe next pair. The hole in this program corresponds to calculating theresulting digit (result/4) and carry (carry/4), given two digits and theprevious carry.
23
Simplified Poker• dealing with uncertainty
• ignore suits and just with A, J, Q and K
• two players, two cards, and one community card
• train the neural network to recognize the four cards
• reason probabilistically about the non-observed card
• learn the distribution of the unlabeled community card
•
Probability
NeuralLogic
nn(m_swap, [X]) :: swap(X,Y,).
hole(X,Y,X,Y):-\+swap(X,Y).
hole(X,Y,Y,X):-swap(X,Y).
bubble([X],[],X).bubble([H1,H2|T],[X1|T1],X):-
hole(H1,H2,X1,X2),bubble([X2|T],T1,X).
bubblesort([],L,L).
bubblesort(L,L3,Sorted) :-bubble(L,L2,X),bubblesort(L2,[X|L3],Sorted).
forth_sort(L,L2) :- bubblesort(L,[],L2).
Listing 6: Forth sorting sketch (T6)
Figure A.10: Examples of cards used as input for the Poker without perturbations(T9) ex-
periment.
calculate the result.
In Listing 8, there are two neural predicates: coin1/2 and coin2/2. Theirinput is the image of the two coins (e.g. Figure 9). The output is heads or tails.The coins/2 classifies both coins using these two predicates and then performsthe comparison of the classes with the compare/3 predicate.
In Listing 9, there’s a single neural predicate rank/2 that takes as input theimage of a card and classifies it as either a jack, queen, king or ace. There’s alsoan AD with learnable parameters that represents the distribution of the unseencommunity card (house_rank/1). The hand/2 predicate’s first argument is alist of 3 cards. It unifies the output with any of the valid hands that these cardscontain. The valid hands are: high card, pair (two cards have the same rank),three of a kind (three cards have the same rank), low straight (jack, queen king)and high straight(queen, king, ace). Each hand is assigned a rank with the
41
Distribution Jack Queen King Ace
Actual 0.2 0.4 0.15 0.25Learned 0.203± 0.002 0.396± 0.002 0.155± 0.003 0.246± 0.002
Table 8: The results for the Poker experiment (T9).
two cards and the community card.For simplicity, we only use the jack, queen, king and ace. We also do notconsider the suits of the cards.The input consists of 4 images that show the cards dealt to the two players.Additionally, every example is labeled with the chance that the game iswon, lost or ended in a draw, e.g.:
0.8 :: poker([Q~, Q}, A}, K|], loss)
We expect DeepProbLog to:
• train the neural network to recognize the four cards• reason probabilistically about the non-observed card• learn the distribution of the unlabeled community card
To make DeepProbLog converge more reliably, we add some examples withadditional supervision. Namely, in 10% of the examples we additionallyspecify the community card, i.e.
poker([Q~, Q}, A}, K|], A}, loss).
This also showcases one of the strengths of DeepProbLog, namely, it canmake use of examples that have different levels of observability. The lossfunction used in this experiment is the MSE between the predicted andtarget probabilities.
Results. We ran the experiment 10 times. Out of these 10 runs, 4 didn’tconverge on the correct solution. The average values of the learned pa-rameters for the remaining 6 runs are shown Table 8. As can be seen,DeepProbLog is able to correctly learn the probabilistic parameters. Inthese 6 runs, the neural network also correctly learned to classify all cardtypes, achieving a 100% accuracy. The other runs did not converge becausesome of the classes were permuted (i.e., queens predicted as aces and viceversa) or multiple classes mapped onto the same one (queens and kingswere both predicted as kings).
27
Distribution Jack Queen King Ace
Actual 0.2 0.4 0.15 0.25Learned 0.203± 0.002 0.396± 0.002 0.155± 0.003 0.246± 0.002
Table 8: The results for the Poker experiment (T9).
two cards and the community card.For simplicity, we only use the jack, queen, king and ace. We also do notconsider the suits of the cards.The input consists of 4 images that show the cards dealt to the two players.Additionally, every example is labeled with the chance that the game iswon, lost or ended in a draw, e.g.:
0.8 :: poker([Q~, Q}, A}, K|], loss)
We expect DeepProbLog to:
• train the neural network to recognize the four cards• reason probabilistically about the non-observed card• learn the distribution of the unlabeled community card
To make DeepProbLog converge more reliably, we add some examples withadditional supervision. Namely, in 10% of the examples we additionallyspecify the community card, i.e.
poker([Q~, Q}, A}, K|], A}, loss).
This also showcases one of the strengths of DeepProbLog, namely, it canmake use of examples that have different levels of observability. The lossfunction used in this experiment is the MSE between the predicted andtarget probabilities.
Results. We ran the experiment 10 times. Out of these 10 runs, 4 didn’tconverge on the correct solution. The average values of the learned pa-rameters for the remaining 6 runs are shown Table 8. As can be seen,DeepProbLog is able to correctly learn the probabilistic parameters. Inthese 6 runs, the neural network also correctly learned to classify all cardtypes, achieving a 100% accuracy. The other runs did not converge becausesome of the classes were permuted (i.e., queens predicted as aces and viceversa) or multiple classes mapped onto the same one (queens and kingswere both predicted as kings).
27
Distribution Jack Queen King Ace
Actual 0.2 0.4 0.15 0.25Learned 0.203± 0.002 0.396± 0.002 0.155± 0.003 0.246± 0.002
Table 8: The results for the Poker experiment (T9).
two cards and the community card.For simplicity, we only use the jack, queen, king and ace. We also do notconsider the suits of the cards.The input consists of 4 images that show the cards dealt to the two players.Additionally, every example is labeled with the chance that the game iswon, lost or ended in a draw, e.g.:
0.8 :: poker([Q~, Q}, A}, K|], loss)
We expect DeepProbLog to:
• train the neural network to recognize the four cards• reason probabilistically about the non-observed card• learn the distribution of the unlabeled community card
To make DeepProbLog converge more reliably, we add some examples withadditional supervision. Namely, in 10% of the examples we additionallyspecify the community card, i.e.
poker([Q~, Q}, A}, K|], A}, loss).
This also showcases one of the strengths of DeepProbLog, namely, it canmake use of examples that have different levels of observability. The lossfunction used in this experiment is the MSE between the predicted andtarget probabilities.
Results. We ran the experiment 10 times. Out of these 10 runs, 4 didn’tconverge on the correct solution. The average values of the learned pa-rameters for the remaining 6 runs are shown Table 8. As can be seen,DeepProbLog is able to correctly learn the probabilistic parameters. Inthese 6 runs, the neural network also correctly learned to classify all cardtypes, achieving a 100% accuracy. The other runs did not converge becausesome of the classes were permuted (i.e., queens predicted as aces and viceversa) or multiple classes mapped onto the same one (queens and kingswere both predicted as kings).
27
in 6/10 experiments
Challenges
• The data needs to provide a signal (cf. Addition of images only, and Poker … ); aka curriculum layer + regularization
• Scaling up -
• still using the exact inference of ProbLog
• circuits can be very large
• we were working on approximate inference
Further Reading• One book
• Three websites to start
• http://probmods.org/ Probabilistic Models of Cognition — Church
• http://dtai.cs.kuleuven.be/problog/ — check also [DR & Kimmig, MLJ 15]
• http://alchemy.cs.washington.edu/ —Markov Logic, check also [Domingos & Lowd] Markov Logic, Morgan Claypool.
Thanks!
http://dtai.cs.kuleuven.be/problog
Maurice BruynoogheBart DemoenAnton DriesDaan FierensJason Filippou
Bernd GutmannManfred JaegerGerda Janssens
Kristian KerstingAngelika Kimmig
Theofrastos MantadelisWannes Meert
Bogdan MoldovanSiegfried Nijssen
Davide NittiJoris Renkens
Kate RevoredoRicardo Rocha
Vitor Santos CostaDimitar Shterionov
Ingo ThonHannu Toivonen
Guy Van den BroeckMathias VerbekeJonas Vlasselaer
90
Thanks !
• PRISM http://sato-www.cs.titech.ac.jp/prism/
• ProbLog2 http://dtai.cs.kuleuven.be/problog/
• Yap Prolog http://www.dcc.fc.up.pt/~vsc/Yap/ includes
• ProbLog1
• cplint https://sites.google.com/a/unife.it/ml/cplint
• CLP(BN)
• LP2
• PITA in XSB Prolog http://xsb.sourceforge.net/
• AILog2 http://artint.info/code/ailog/ailog2.html
• SLPs http://stoics.org.uk/~nicos/sware/pepl
• contdist http://www.cs.sunysb.edu/~cram/contdist/
• DC https://code.google.com/p/distributional-clauses
• WFOMC http://dtai.cs.kuleuven.be/ml/systems/wfomc
PLP Systems
91