8
A Novel Structure Refining Algorithm for Statistical-Logical Models Marenglen Biba *† , Elton Ballhysa , Narasimha Rao Vajjala , Vijay Raju Mullagiri * Department of Computer Science, University of Bari, Via E. Orabona, 4, 70125, Bari, Italy Department of Computer Science, University of New York Tirana Rr. Komuna e Parisit, Tirana, Albania Email: [email protected], [email protected], [email protected], [email protected] Abstract—Statistical Relational Learning (SRL) is a growing field in Machine Learning that aims at the integration of logic-based learning approaches with probabilistic graphical models. Markov Logic Networks (MLNs) are one of the state- of-the-art SRL models that combine first-order logic and Markov networks (MNs) by attaching weights to first-order formulas and viewing these as templates for features of MNs. Learning models in SRL consists in learning the structure (logical clauses in MLNs) and the parameters (weights for each clause in MLNs). Structure learning of MLNs is performed by maximizing a likelihood function over relational databases and MLNs have been successfully applied to problems in relational and uncertain domains. Theory revision is the process of refining an existing theory by generalizing or specializing it depending on the nature of the new evidence. If the positive evidence is not explained then the theory must be generalized, whereas if the negative evidence is explained the theory must be specialized in order to exclude the negative example. Current SRL systems do not revise an existing model but learn structure and parameters from scratch. In this paper we propose a novel refining algorithm for theory revision under the statistical- logical framework of MLNs. The novelty of the proposed approach consists in a tight integration of structure and parameter learning of an SRL model in a single step inside which a specialization or generalization step is performed for theory refinement. Keywords-Statistical Relational Learning; Theory revision; Markov Logic Networks I. I NTRODUCTION Traditionally, Machine Learning research has fallen into two separate subfields: one that has focused on logical representations, and one on statistical ones. Logical Machine Learning approaches based on logic programming, descrip- tion logics, classical planning, rule induction, etc, tend to emphasize handling complexity. Statistical Machine Learn- ing approaches like Bayesian networks, hidden Markov models, statistical parsing, neural networks, etc, tend to emphasize handling uncertainty. However, learning systems must be able to handle both for real-world applications. The first attempts to integrate logic and probability were made in Artifical Intelligence and date back to the works in [1], [2], [3]. Later, several authors began using logic programs to compactly specify Bayesian networks, an approach known as knowledge-based model construction [4]. A central problem in Machine Learning has always been learning in rich representations that enable to deal with structure and relations. Much progress has been achieved in the relational learning field or differently known as Inductive Logic Programming [5]. On the other hand, successful sta- tistical machine learning models with their roots in statistics and pattern recognition, have made possible to deal with noisy and uncertain domains in a robust manner. Powerful models such as Probabilistic Graphical Models [6] and related algorithms have the power to handle uncertainty but lack the capability of dealing with structured domains. In Machine Learning, recently, in the burgeoning field of Statistical Relational Learning (SRL) [7] or Probabilistic Inductive Logic Programming [8], several approaches for combining logic and probability have been proposed. A growing amount of work has been dedicated to integrating subsets of first-order logic with probabilistic graphical mod- els, to extend logic programs with a probabilistic semantics or integrate other formalisms with probability. Some of the logic-based approaches are: Knowledge-based Model Con- truction [4], Bayesian Logic Programs [9], Stochastic Logic Programs [10], [11], Probabilistic Horn Abduction [12], Queries for Probabilistic Knowledge Bases [13], PRISM [14], CLP(BN) [15]. Other approaches include frame-based systems such as Probabilistic Relational Models [16] or PRMs extensions defined in [17], description logics based approaches such as those in [18] and P-CLASSIC of [19], database query langauges [20], [21], etc. All these approaches combine probabilistic graphical models with subsets of first-order logic (e.g., Horn Clauses). One of the state-of-the-art SRL approches is Markov logic [22], a powerful representation that has finite first-order logic and probabilistic graphical models as special cases. It extends first-order logic by attaching weights to formulas providing the full expressiveness of graphical models and first-order logic in finite domains and remaining well defined in many infinite domains [22], [23]. Weighted formulas 2010 International Conference on Complex, Intelligent and Software Intensive Systems 978-0-7695-3967-6/10 $26.00 © 2010 IEEE DOI 10.1109/CISIS.2010.86 116

[IEEE 2010 International Conference on Complex, Intelligent and Software Intensive Systems (CISIS) - Krakow, TBD, Poland (2010.02.15-2010.02.18)] 2010 International Conference on Complex,

Embed Size (px)

Citation preview

Page 1: [IEEE 2010 International Conference on Complex, Intelligent and Software Intensive Systems (CISIS) - Krakow, TBD, Poland (2010.02.15-2010.02.18)] 2010 International Conference on Complex,

A Novel Structure Refining Algorithm for Statistical-Logical Models

Marenglen Biba∗†, Elton Ballhysa†, Narasimha Rao Vajjala†, Vijay Raju Mullagiri†∗Department of Computer Science, University of Bari,

Via E. Orabona, 4, 70125, Bari, Italy†Department of Computer Science,

University of New York TiranaRr. Komuna e Parisit, Tirana, Albania

Email: [email protected], [email protected],[email protected], [email protected]

Abstract—Statistical Relational Learning (SRL) is a growingfield in Machine Learning that aims at the integration oflogic-based learning approaches with probabilistic graphicalmodels. Markov Logic Networks (MLNs) are one of the state-of-the-art SRL models that combine first-order logic andMarkov networks (MNs) by attaching weights to first-orderformulas and viewing these as templates for features of MNs.Learning models in SRL consists in learning the structure(logical clauses in MLNs) and the parameters (weights for eachclause in MLNs). Structure learning of MLNs is performed bymaximizing a likelihood function over relational databases andMLNs have been successfully applied to problems in relationaland uncertain domains. Theory revision is the process ofrefining an existing theory by generalizing or specializing itdepending on the nature of the new evidence. If the positiveevidence is not explained then the theory must be generalized,whereas if the negative evidence is explained the theory must bespecialized in order to exclude the negative example. CurrentSRL systems do not revise an existing model but learn structureand parameters from scratch. In this paper we propose a novelrefining algorithm for theory revision under the statistical-logical framework of MLNs. The novelty of the proposedapproach consists in a tight integration of structure andparameter learning of an SRL model in a single step insidewhich a specialization or generalization step is performed fortheory refinement.

Keywords-Statistical Relational Learning; Theory revision;Markov Logic Networks

I. INTRODUCTION

Traditionally, Machine Learning research has fallen intotwo separate subfields: one that has focused on logicalrepresentations, and one on statistical ones. Logical MachineLearning approaches based on logic programming, descrip-tion logics, classical planning, rule induction, etc, tend toemphasize handling complexity. Statistical Machine Learn-ing approaches like Bayesian networks, hidden Markovmodels, statistical parsing, neural networks, etc, tend toemphasize handling uncertainty. However, learning systemsmust be able to handle both for real-world applications. Thefirst attempts to integrate logic and probability were madein Artifical Intelligence and date back to the works in [1],[2], [3]. Later, several authors began using logic programs to

compactly specify Bayesian networks, an approach knownas knowledge-based model construction [4].

A central problem in Machine Learning has always beenlearning in rich representations that enable to deal withstructure and relations. Much progress has been achieved inthe relational learning field or differently known as InductiveLogic Programming [5]. On the other hand, successful sta-tistical machine learning models with their roots in statisticsand pattern recognition, have made possible to deal withnoisy and uncertain domains in a robust manner. Powerfulmodels such as Probabilistic Graphical Models [6] andrelated algorithms have the power to handle uncertainty butlack the capability of dealing with structured domains.

In Machine Learning, recently, in the burgeoning fieldof Statistical Relational Learning (SRL) [7] or ProbabilisticInductive Logic Programming [8], several approaches forcombining logic and probability have been proposed. Agrowing amount of work has been dedicated to integratingsubsets of first-order logic with probabilistic graphical mod-els, to extend logic programs with a probabilistic semanticsor integrate other formalisms with probability. Some of thelogic-based approaches are: Knowledge-based Model Con-truction [4], Bayesian Logic Programs [9], Stochastic LogicPrograms [10], [11], Probabilistic Horn Abduction [12],Queries for Probabilistic Knowledge Bases [13], PRISM[14], CLP(BN) [15]. Other approaches include frame-basedsystems such as Probabilistic Relational Models [16] orPRMs extensions defined in [17], description logics basedapproaches such as those in [18] and P-CLASSIC of [19],database query langauges [20], [21], etc.

All these approaches combine probabilistic graphicalmodels with subsets of first-order logic (e.g., Horn Clauses).One of the state-of-the-art SRL approches is Markov logic[22], a powerful representation that has finite first-orderlogic and probabilistic graphical models as special cases.It extends first-order logic by attaching weights to formulasproviding the full expressiveness of graphical models andfirst-order logic in finite domains and remaining well definedin many infinite domains [22], [23]. Weighted formulas

2010 International Conference on Complex, Intelligent and Software Intensive Systems

978-0-7695-3967-6/10 $26.00 © 2010 IEEE

DOI 10.1109/CISIS.2010.86

116

Page 2: [IEEE 2010 International Conference on Complex, Intelligent and Software Intensive Systems (CISIS) - Krakow, TBD, Poland (2010.02.15-2010.02.18)] 2010 International Conference on Complex,

are viewed as templates for constructing Markov Networks(MNs) and in the infinite-weight limit, Markov logic reducesto standard first-order logic. Markov logic avoids the as-sumption of i.i.d. (independent and identically distributed)data made by most statistical learners by using the power offirst-order logic to compactly represent dependencies amongobjects and relations. In this paper we will focus on this SRLmodel.

The representation power and the robustness of SRLmodels to deal with uncertainty does not solve all the prob-lems present in complex domains. Dealing with unknown orpartially observed data is an important problem in MachineLearning. Most SRL models face this problem only fromthe parameter setting point of view by following similarapproaches developed in the statistical machine learningfield. The most used approach is Expectation-Maximization(EM) [24]. In most cases, due to partially observed ormissing data, when more evidence about these becomesavailable in time, it is necessary to revise the structureof the model. In this paper we propose a novel refiningalgorithm that performs theory revision based on Markovlogic. The novelty of the proposed approaches stands in thetight integration of structure and parameter learning of anSRL model in a single step inside which a specializationor generalization step is performed. During structure searchguided by conditional likelihood, structure evaluation isperformed by first trying to logically explain the existingevidence. If this is not the case, then the refinement operatorgeneralizes or specializes the guilty clause. Then learningproceeds with optimal pseudo-likelihood parameter learningusing a logically sound theory.

II. MARKOV NETWORKS AND MARKOV LOGICNETWORKS

A Markov network (also known as Markov random field)is a model for the joint distribution of a set of variables X= (X1,X2,. . . ,Xn) ∈ χ [25]. It is composed of an undirectedgraph G and a set of potential functions. The graph hasa node for each variable, and the model has a potentialfunction φk for each clique in the graph. A potential functionis a non-negative real-valued function of the state of thecorresponding clique. The joint distribution represented bya Markov network is given by:

P (X = x) =1Z

∏k

φk(x{k}) (1)

where x{k} is the state of the kth clique (i.e., the state ofthe variables that appear in that clique). Z, known as thepartition function, is given by:

Z =∑x∈χ

∏k

φk(x{k}) (2)

Markov networks are often conveniently represented as log-linear models, with each clique potential replaced by anexponentiated weighted sum of features of the state, leadingto:

P (X = x) =1Z

exp(∑j

wj fj(x)) (3)

A feature may be any real-valued function of the state.We will focus on binary features, fj ∈ {0, 1}. In the mostdirect translation from the potential-function form, there isone feature corresponding to each possible state xk of eachclique, with its weight being log(φ(x{k}). This represen-tation is exponential in the size of the cliques. Howevera much smaller number of features (e.g., logical functionsof the state of the clique) can be specified, allowing fora more compact representation than the potential-functionform, particularly when large cliques are present. MLNs takeadvantage of this.

A first-order Knowledge Base (KB) can be seen as a setof hard constraints on the set of possible worlds: if a worldviolates even one formula, it has zero probability. The basicidea in Markov logic is to soften these constraints: whena world violates one formula in the KB it is less probable,but not impossible. The fewer formulas a world violates, themore probable it is. Each formula has an associated weightthat reflects how strong a constraint it is: the higher theweight, the greater the difference in log probability betweena world that satisfies the formula and one that does not, otherthings being equal.

A Markov logic network [22] L is a set of pairs (Fi;wi),where Fi is a formula in first-order logic and wi is areal number. Together with a finite set of constants C= {c1, c2, . . . , cp} it defines a Markov network ML;C asfollows:

1. ML;C contains one binary node for each possiblegrounding of each predicate appearing in L. The value ofthe node is 1 if the ground predicate is true, and 0 otherwise.

2. ML;C contains one feature for each possible groundingof each formula Fi in L. The value of this feature is1 if the ground formula is true, and 0 otherwise. Theweight of the feature is the wi associated with Fi in L.Thus there is an edge between two nodes of ML;C iffthe corresponding ground predicates appear together in atleast one grounding of one formula in L. An MLN can beviewed as a template for constructing Markov networks.The probability distribution over possible worlds x specifiedby the ground Markov network ML;C is given by

117

Page 3: [IEEE 2010 International Conference on Complex, Intelligent and Software Intensive Systems (CISIS) - Krakow, TBD, Poland (2010.02.15-2010.02.18)] 2010 International Conference on Complex,

P (X = x) =1Z

exp(F∑i=1

wini(x)) =1Z

∏i

φi(xi)ni(x)

(4)

where F is the number of formulas in the MLN and ni(x)is the number of true groundings of Fi in x. As formulaweights increase, an MLN increasingly resembles a purelylogical KB, becoming equivalent to one in the limit of allinfinite weights.

In this paper we focus on MLNs whose formulas arefunction-free clauses and assume domain closure (it hasbeen proven that no expressiveness is lost), ensuring thatthe Markov networks generated are finite. In this case, thegroundings of a formula are formed simply by replacing itsvariables with constants in all possible ways.

III. STRUCTURE AND PARAMETER LEARNING OF MLNS

A. Generative Structure Learning of MLNs

One of the approaches for learning MN weights is it-erative scaling [25]. However, maximizing the likelihood(or posterior) using a quasi-Newton optimization methodlike L-BFGS has recently been found to be much faster[26]. Regarding structure learning, the authors in [25] induceconjunctive features by starting with a set of atomic features(the original variables), conjoining each current feature witheach atomic feature, adding to the network the conjunctionthat most increases likelihood, and repeating. The workin [27] extends this to the case of conditional randomfields, which are Markov networks trained to maximize theconditional likelihood of a set of outputs given a set ofinputs.

The first attempt to learn MLNs was [22], where theauthors used CLAUDIEN [28] to learn the clauses ofMLNs and then learned the weights by maximizing pseudo-likelihood. In [29] another method was proposed that com-bines ideas from ILP and feature induction of Markovnetworks. This algorithm, that performs a beam or shortestfirst search in the space of clauses guided by a weightedpseudo-log-likelihood (WPLL) measure [30], outperformedthat of [22]. Recently, in [31] a bottom-up approach wasproposed in order to reduce the search space. This algorithmuses a propositional Markov network learning method toconstruct template networks that guide the construction ofcandidate clauses. In this way, it generates fewer candidatesfor evaluation. In [32], the authors proposed an algorithmbased on the iterated local search metaheuristic and showedthat using parallel computation, it is possible to improve overthe previous algorithms. For every candidate structure, in allthese algorithms, the parameters that optimize the WPLL areset through L-BFGS that approximates the second-derivativeof the WPLL by keeping a running finite-sized window ofprevious first-derivatives.

B. Discriminative Structure and Parameter Learning ofMLNs

Learning MLNs in a discriminative fashion has producedmuch better results for predictive tasks than generative ap-proaches as the results in [33] show. In this work the voted-perceptron algorithm was generalized to arbitrary MLNs byreplacing the Viterbi algorithm with a weighted satisfiabilitysolver. The new algorithm is essentially gradient descentwith an MPE approximation to the expected sufficientstatistics (true clause counts) and these can vary widelybetween clauses, causing the learning problem to be highlyill-conditioned, and making gradient descent very slow. In[34] a preconditioned scaled conjugate gradient approachis shown to outperform the algorithm in [33] in terms oflearning time and prediction accuracy. This algorithm isbased on the scaled conjugate gradient method and verygood results are obtained with a simple approach: per-weightlearning weights, with the weight’s learning rate being theglobal one divided by the corresponding clause’s empiricalnumber of true groundings.

However, for both these algorithms the structure is sup-posed to be given by an expert or learned previously andthey focus only on the parameter learning task. This canlead to suboptimal results if the clauses given by an expertdo not capture the essential dependencies in the domain inorder to improve classification accuracy. On the other side,since to the best of our knowledge, no attempt has beenmade to learn the structure of MLNs discriminatively, theclauses learned by generative structure learning algorithmstend to optimize the joint distribution of all the variables andapplying discriminative weight learning after the structurehas been learned generatively may lead to suboptimal resultssince the initial goal of the learned structure was not todiscriminate query predicates.

Recently different attempts have been proposed for dis-criminative structure learning of MLNs. In [35] MLNs wererestricted to non recursive definite clauses and the ILPsystem ALEPH [36] was used to generate a large number ofpotentially good candidates that are then scored using exactinference methods. In [37] the authors proposed anotherapproach, they set parameters by maximizing likelihood andchoose structures by conditional likelihood. Inference foreach canidate clause is performed using the lazy versionof the MC-SAT algorithm [38]. The authors propose somesimple heuristics to make the problem tractable and showimprovements in terms of predictive accuracy over genera-tive structure learning approaches and discriminative weightlearning algorithms.

IV. THEORY REVISION IN FIRST-ORDER LOGIC

A. First-Order Logic and Inductive Logic Programming

Relational learning is mostly related to first-order logicor more restricted formalisms. A first-order KB is a set

118

Page 4: [IEEE 2010 International Conference on Complex, Intelligent and Software Intensive Systems (CISIS) - Krakow, TBD, Poland (2010.02.15-2010.02.18)] 2010 International Conference on Complex,

of sentences or formulas in first-order logic (FOL) [39].Formulas in FOL are constructed using four types of sym-bols: constants, variables, functions, and predicates. Con-stant symbols represent objects in the domain of interest.Variable symbols range over the objects in the domain.Function symbols represent mappings from tuples of objectsto objects. Predicate symbols represent relations amongobjects in the domain or attributes of objects. A term is anyexpression representing an object in the domain. It can be aconstant, a variable, or a function applied to a tuple of terms.An atomic formula or atom is a predicate symbol appliedto a tuple of terms. A ground term is a term containingno variables. A ground atom or ground predicate is anatomic formula all of whose arguments are ground terms.Formulas are recursively constructed from atomic formulasusing logical connectives and quantifiers. A positive literalis an atomic formula; a negative literal is a negated atomicformula. A KB in clausal form is a conjunction of clauses,a clause being a disjunction of literals. A definite clause isa clause with exactly one positive literal (the head, with thenegative literals constituting the body). A possible world orHerbrand interpretation assigns a truth value to each possibleground predicate.

Because of the computational complexity, KBs are gen-erally constructed using a restricted subset of FOL whereinference and learning is more tractable. The most widely-used restriction is to Horn clauses, which are clauses con-taining at most one positive literal. In other words, a Hornclause is an implication with all positive antecedents, andonly one (positive) literal in the consequent. A programin the Prolog language is a set of Horn clauses. Prologprograms can be learned from examples (often relationaldatabases) by searching for Horn clauses that hold in thedata. The field of inductive logic programming (ILP) [5]deals exactly with this problem. The main task in ILP isfinding an hypothesis H (a logic program, i.e. a definiteclause program) from a set of positive and negative examplesP and N. In particular, it is required that the hypothesis Hcovers all positive examples in P and none of the negativeexamples in N. The representation language for representingthe examples together with the covers relation determines theILP setting [40].

Learning from entailment is probably the most popularILP setting and many well-known ILP systems such as FOIL[41], Progol [42] or ALEPH [36] follow this setting. In thissetting examples are definite clauses and an example e iscovered by an hypothesis H, w.r.t the background theory Bif and only if B ∪H |= e. Most ILP systems in this settingrequire ground facts as examples. They typically proceed fol-lowing a separate-and-conquer rule-learning approach [43].This means that in the outer loop they repeatedly searchfor a rule covering many positive examples and none ofthe negatives (set-covering approach [44]). In the inner loopILP systems generally perform a general-to-specific heuristic

search using refinement operators [45], [46] based on θ-subsumption [47]. These operators perform the steps in thesearch-space, by making small modifications to a hypothesis.From a logical perspective, these refinement operators typ-ically realize elementary generalization and specializationsteps (usually under θ-subsumption). More sophisticatedsystems like Progol or ALEPH employ a search bias toreduce the search space of hypothesis.

In the ILP setting of learning from interpretations, exam-ples are Herbrand interpretations and an examle e is coveredby an hypthesis H, w.r.t the background theory B, if and onlyif e is a model of B ∪ H . A possible world is describedthrough sets of true ground facts which are the Herbrandinterpretations. Learning from interpretations is generallyeasier and computationally more tractable than learning fromentailment [40]. This is due to the fact that interpretationscarry much more information than the examples in learningfrom entailment. In learning from entailment, examplesconsist of a single fact, while in interepretations all the factsthat hold in the example are known. The approach followedby ILP systems learning from interpretations is similarto those that learn from entailment. The most importantdifference stands in the generality relationship. In learningfrom entailment an hypothesis H1 is more general than H2 ifand only if H1 |= H2, while in learning from interpretationswhen H2 |= H1. A hypothesis H1 is more general thana hypothesis H2 if all examples covered by H2 are alsocovered by H1. ILP systems that learn from interpretationsare also well suited for learning from positive examples only[28].

B. Theory Revision in the ILP Setting

Algorithm 1 sketches theory revision in an incrementalinductive learning framework. Here, M represents the setof all positive and negative processed examples, E is theexample currently examined, T is the theory generatedso far according to M . Generalize and Specialize are theinductive operators used by the system to refine an incorrecttheory. When a new observation is available, the procedure isstarted, parameterized on the current theory, the example andthe current set of past examples. If the procedure succeedsto cover the positive example (or not cover the negative ex-ample), then the next example is considered. Otherwise theusual refinement procedure (generalization/specialization) isperformed.

The most important aspect of the strategy adopted inAlgorithm 1 that can be useful for our purposes of learningand refining the structure of an SRL model is the exploitationof the refinement operator that can modify a theory so thatit can account for a new example on which it previouslygenerated an omission/commission error. In our case, theoperator can be exploited for guiding the move from a theoryto one of its refinements, instead of randomly trying to applyall possible refinements as many SRL systems do.

119

Page 5: [IEEE 2010 International Conference on Complex, Intelligent and Software Intensive Systems (CISIS) - Krakow, TBD, Poland (2010.02.15-2010.02.18)] 2010 International Conference on Complex,

Algorithm 1 Theory Revision in an incremental inductivelearning framework

Revise (T ; E; M );{input: T : theory, E: example, M : historical memory;output: T revised theory;}M ←M ∪ E

if E is a positive example then Generalize(T,E,M );else Specialize(T,E,M );

V. SINGLE STEP STRUCTURE REFINING

In this section we describe how structure refining ofMLNs in a single step can be performed by combininga refinement operator with parameter learning. The novelalgorithm interleaves generalization or specialization stepswith optimal parameter learning for pseudo-likelihood. Thealgorithms we propose here are built upon the ideas thatwe presented in [37]. The parameters are set through maxi-mum pseudo-log-likelihood (WPLL), and the structures arescored through conditional likelihood. The main differenceregards the integration of a novel refinement operator whichgeneralizes or specializes the theory according to the novelevidence. The first difference between the full framework ofMLNs proposed by [22] and the framework that we proposehere is that in order to use ideal operators during structuresearch we need to restrict the clauses of our model MLNto Horn clauses. Most of relational learning is performedunder this expressiveness power and the successes of ILPhave shown that for many problems Horn logic is sufficientto deal with structured domains. Thus the structure learningalgorithms that we propose here are an extension of thoseproposed in [37], [32] in that here we perform theoryrevision in the structure learning process and the languagewe follow here is based on Horn logic instead of full FOL.The second difference is that the algorithms proposed in[22], try to apply all possible refinements, while here we usean ILP refinement operator to properly explore the searchspace by moving to more general or specific candidates asneeded according to the new evidence available.

A. Pseudo-likelihood

MLN weights can be learned by maximizing the likeli-hood of a relational database. Like in ILP, a closed-worldassumption [39] is made, thus all ground atoms not inthe database are assumed false. If there are n possibleground atoms, then we can represent a database as a vectorx = (x1, ..., xi..., xn) and xi is the truth value of the ithground atom, xi = 1 if the atom appears in the database,otherwise xi = 0. Standard methods can be used to learnMLN weights following Equation 4. If the jth formula hasnj(x) true groundings, by Equation 4 we get the derivativeof the log-likelihood with respect to its weights by:

∂wjlogPw(X = x) = nj(x)−

∑x′

Pw(X = x′)nj(x′) (5)

where x′ are databases and Pw(X = x′) is P (X = x′)computed using the current weight vector w = (w1, ..., wj).Thus, the jth component of the gradient is the differencebetween the number of true groundings of the jth formulain the data and its expectation according to the model.Counting the number of true groundings of a first-orderformula, unfortunately, is a #P-complete problem.

The problem with Equation 5 is that not only the firstcomponent is intractable, but also computing the expectednumber of true groundings is also intractable, requiringinference over the model. Further, efficient optimizationmethods also require computing the log-likelihood itself(Equation 4), and thus the partition function Z. This canbe done approximately using a Monte Carlo maximumlikelihood estimator (MC-MLE) [48]. However, the authorsin [22] found in their experiments that the Gibbs samplingused to compute the MC-MLEs and gradients did notconverge in reasonable time, and using the samples fromthe unconverged chains yielded poor results.

In many other fields such as spatial statistics, socialnetwork modeling and language processing, a more efficientalternative has been followed. This is optimizing pseudo-likelihood [30] instead of likelihood. If x is a possible world(a database or truth assignment) and xl is the lth groundatom’s truth value, the pseudo-likelihood of x is given bythe following equation (we follow the same notation as theauthors in [22]:

P ∗w(X = x) =n∏l=1

Pw(Xl = xl|MBx(Xl)) (6)

where MBx(Xl) is the state of the Markov blanket ofXl in the data. (i.e., the truth values of the ground atoms itappears in some ground formula with). From Equation 4 wehave P (Xl = xl|MBx(Xl)) is equal to:

exp(∑Fi=1 wini(x))

exp(∑Fi=1 wini(x[Xl=0])) + exp(

∑Fi=1 wini(x[Xl=1]))

(7)

When computing ni(x[Xl=1]) and ni(x[Xl=0]), the usuallyfollowed approach is closed world assumption [39], i.e., allground atoms not in the database are assumed false. Usinglogical abduction we can pontentially infer the truth valueof these atoms and thus when we compute these countswe could have more accurate values that reflect the currentdata. Since the optimization of the weights by L-BFGS isperformed on the estimates of the counts ni(x[Xl=1]) andni(x[Xl=0]), an improved accuracy on these counts wouldalso result in a more accurate parameter learning task. Thusthe use of logical abduction is motivated by the fact thatparameter estimation in satistical relational learning canbenefit from completed data through logical procedures. Tothe best of our knowledge, this is the first approach to

120

Page 6: [IEEE 2010 International Conference on Complex, Intelligent and Software Intensive Systems (CISIS) - Krakow, TBD, Poland (2010.02.15-2010.02.18)] 2010 International Conference on Complex,

Algorithm 2 MLNs Structure Learning and RefinementInput: P:set of predicates, MLN:Markov Logic Network, RDB:Relational Database,NEV:Novel EvidenceCLS = All clauses in MLN;LearnWPLLWeights(MLN,RDB);BestScore = f (MLN,RDB);BestModel = MLN;repeat

CurrentModel = FindBestModel(P,MLN,BestScore,CLS,RDB,NEV);if f (CurrentModel) ≥ f (BestModel) then

BestModel = CurrentModel;BestScore = f (MLN,RDB);

end ifuntil BestScore does not improve for two consecutive stepsreturn BestModel;f = CLL (conditional log-likelihood)

Algorithm 3 FindBestModelInput: P:set of predicates, MLN:Markov Logic Network, BestScore: current bestscore, CLS: List of clauses, RDB:Relational Database, NEV:Novel Evidence)CLC = Random Pick a clause in CLS;MLNS = LocalSearchII (CLC ,MLN,BestScore,NEV);BestModel = MLNS ;repeat

CL’C = Random Pick a guilty clause in (MLNS ) after performing a logical testof the current theory on NEV;MLN’S = LocalSearchII (CL’C ,MLNS ,BestScore,NEV);if f (BestModel,RDB) ≥ f (MLN’S ,RDB) then

BestModel = MLN’S ;BestScore = f (MLN’S ,RDB)

end ifMLNS = accept(MLNS ,MLN’S );

until two consecutive steps have not produced improvementReturn BestModelf = CLL (conditional log-likelihood)

integrate a pure logical procedure for abductive inferencewith a statistical parameter estimation algorithm.

B. Refinement Algorithm

Structure learning can start from an empty network orfrom an existing KB. Algorithm 2 iteratively generates re-finements of the current structure and scores them by condi-tional likelihood. These refinements are generated using nor-mal ILP refinement operators. Every neighbor of the currentstructure is obtained by a small generalization/specializationof a randomly chosen clause in the structure. Algorithm3 performs Iterated Local Search [49], [50] for the bestmodel that fits the data. It starts by randomly choosinga guilty clause CLC in the current MLN structure. Thenit performs a greedy local search to efficiently reach alocal optimum MLNS . At this point, a restart method isapplied by randomly choosing another guilty clause CL’Cfrom the clauses of MLNS . Then again, a greedy localsearch is applied to MLNS to reach another local optimumMLN ′S . The accept function decides whether the searchmust continue from the previous local optimum MLNS orfrom the last local optimum MLN ′S . The accept functionalways accepts the best solution found so far. For everycandidate structure, the parameters that optimize the WPLLare set through L-BFGS.

In Algorithm 4, we generate NBHD, the neighborhood ofMLNC , by using ILP refinement operators. All structures

Algorithm 4 LocalSearchIIInput: (CLC : clause chosen for refinement, MLNC : current model, BestScore:current best score, NEV:Novel Evidence)wp: walk probability, the probability of performing an improvement step or arandom steprepeat

NBHD = Neighborhood of MLNC constructed using ILP refinement operatorsby Generalizing or Specializing the guilty clause CLC based on MLNC andNEV.for Each Candidate Structure MLN in NBHD do

if MLN satisfies ILP coverage threshold thenLearnWPLLWeights(MLN,RDB);

end ifend forfor Each structure scored MLN do

score = f (MLN,RDB)if score ≥ BestScore then

BestScore = score; MLNS = MLNend if

end foruntil two consecutive steps do not produce improvementReturn MLNS ;

in NBHD differ from MLNC by only one clause whichis a generalization or specialization of the clause CLC .Two modifications can be applied here with respect to thetraditional setting. First of all, the structure refinement isnot carried out randomly, but can be guided by the examplesthemselves, since they were purposely provided by an expert.Hence, each example that is not correctly classified by thecurrent theory can be exploited to perform a generalization(if positive) or specialization (if negative) according toclassical ILP operators. Application of such an operator willprovide one or more (depending on the operator and onthe generalization model adopted) alternative refinementsof the original structure, each of which consists in a newstructure obtained by refining a single clause in the originalstructure. Moreover, pruning criteria can be set in orderto avoid working on refinements that are not regarded aspromising or acceptable. For instance, one could requirethat each candidate structure fulfils a minimum coveragethreshold in the logical sense, i.e., that the accuracy fromthe ILP point of view (how many positive examples arecovered and how many negatives are not) is greater than agiven minimum. We believe this heuristic can help excludecandidates that have a very low logical accuracy. Althoughthere is a mismatch between the coverage criterion used bymost ILP systems and the likelihood (or a function thereof)used by most statistical learners, a logical theory that doesnot explain any example from a logical interpretation wouldbe less useful, and would contradict the idea that examplesare purposely labelled by an expert and hence deserve somelevel of trust. Therefore, we decided to pose a threshold onthe accuracy of candidate structures and learn weights onlyfor those candidates that satisfy this threshold.

After setting weights with WPLL, in order to score eachMLN structure in terms of conditional likelihood (CLL),we need to perform inference over the network. A veryfast algorithm for inference in MLNs is MC-SAT [51].

121

Page 7: [IEEE 2010 International Conference on Complex, Intelligent and Software Intensive Systems (CISIS) - Krakow, TBD, Poland (2010.02.15-2010.02.18)] 2010 International Conference on Complex,

Since probabilistic inference methods like MCMC or beliefpropagation tend to give poor results when deterministicor near-deterministic dependencies are present, and logicalones like satisfiability testing are inapplicable to probabilisticdependencies, MC-SAT combines ideas from both MCMCand satisfiability to handle probabilistic, deterministic andnear-deterministic dependencies that are typical of statis-tical relational learning. MC-SAT was shown to greatlyoutperform Gibbs sampling and simulated tempering in tworeal-world datasets regarding entity resolution and collec-tive classification. In order to make the execution of MC-SAT tractable for every candidate structure, we follow thesame heuristic that were proposed in [37], i.e., we scorethrough MC-SAT only those candidate structures that showan improvement in WPLL, we use the lazy version of MC-SAT that is known as Lazy-MC-SAT [38] which reducesmemory and time by orders of magnitude compared to MC-SAT, we pose a memory and time limit on the inferenceprocess thorugh Lazy-MC-SAT. As the experiments showedin [37], these heuristics proved quite successful in two realworld domains.

VI. CONCLUSION AND FUTURE WORK

We have proposed a novel algorithm for refining thestructure of an SRL model which interleaves in a single stepstructure and parameter learning with theory generalizationand specialization. This helps to properly revise an SRLmodel withour starting to learn the parameters and the wholestructure from scratch as current SRL systems do. To thebest of our knowledge this is the first attempt to providean algorithm that refines the structure of an SRL in atight integration between logical refinement operators andstatistical parameter learning.

We intend to experimentally evaluate the proposed frame-works on complex relational domains with missing datawhich become available in time. In order to evaluate theadvantages of our approach, we intend to compare theaccuracy and learning time performance against a purestatistical learner that starts learning from scratch when newevidence becomes available; against a pure logical approachsuch as an ILP system that does not revise the structure ofthe model but learns in a batch fashion and finally againstanother state-of-the-art SRL system that only learns fromscratch.

REFERENCES

[1] F. Bacchus, Representing and Reasoning with ProbabilisticKnowledge. Cambridge, MA: MIT Press, 1990.

[2] J. Halpern, “An analysis of first-order logics of probability,”Artificial Intelligence, vol. 46, pp. 311–350, 1990.

[3] N. Nilsson, “Probabilistic logic,” Artificial Intelligence,vol. 28, pp. 71–87, 1986.

[4] J. S. Wellman, M. Breese and R. P. Goldman, “From knowl-edge bases to decision models,” Knowledge EngineeringReview, vol. 7, 1992.

[5] N. Lavrac and S. Dzeroski, Inductive Logic Programming:Techniques and applications. UK: Ellis Horwood, Chich-ester, 1994.

[6] J. Pearl, Probabilistic reasoning in intelligent systems: Net-works of plausible inference. San Francisco, CA: MorganKaufmann, 1988.

[7] L. Getoor and B. Taskar, Introduction to Statistical RelationalLearning. MIT, 2007.

[8] L. De Raedt, P. Frasconi, K. Kersting, and S. Muggleton,Eds., Probabilistic Inductive Logic Programming - Theoryand Applications. Springer, 2008.

[9] K. Kersting and L. D. Raedt, “Towards combining inductivelogic programming with bayesian networks,” in Proc. 11thInt’l Conf. on Inductive Logic Programming. Springer, 2001,pp. 118–131.

[10] S. Muggleton, “Stochastic logic programs,” in In L. De Raedt(Ed.), Advances in inductive logic programming. IOS Press,Amsterdam, 1996.

[11] J. Cussens, “Parameter estimation in stochastic logic pro-grams,” Machine Learning, vol. 44, no. 3, pp. 245–271, 2001.

[12] D. Poole, “Probabilistic horn abduction and bayesian net-works,” Articial Intelligence, vol. 64, no. 81-129, 1993.

[13] L. Ngo and P. Haddawy, “Answering queries from context-sensitive probabilistic knowledge bases,” Theoretical Com-puter Science, vol. 171, pp. 147–177, 1997.

[14] T. Sato and Y. Kameya, “Prism: A symbolic-statistical mod-eling language,” in Proceedings of the Fifteenth InternationalJoint Conference on Articial Intelligence. Nagoya, Japan:Morgan Kaufmann, 1997, pp. 1330–1335.

[15] V. Santos Costa, D. Page, M. Qazi, and J. Cussens, “Clp(bn):Constraint logic programming for probabilistic knowledge.”in Proceedings of the Nineteenth Conference on Uncertaintyin Articial Intelligence. Acapulco, Mexico: Morgan Kauf-mann, 2003, pp. 517–524.

[16] D. K. N. Friedman, L. Getoor and A. Pfeffer, “Learningprobabilistic relational models,” in Proc. 16th Int’l Joint Conf.on AI (IJCAI). Morgan Kaufmann, 1999, pp. 1300–1307.

[17] H. Pasula and S. Russell, “Approximate inference for rst-orderprobabilistic languages,” in Proceedings of the SeventeenthInternational Joint Conference on Articial Intelligence. Seat-tle, WA: Morgan Kaufmann, 2001, pp. 741–748.

[18] C. Cumby and D. Roth, “Feature extraction languages forpropositionalized relational learning,” in Proceedings of theIJCAI-2003 Workshop on Learning Statistical Models fromRelational Data. Acapulco, Mexico: IJCAII, 2003, pp. 24–31.

122

Page 8: [IEEE 2010 International Conference on Complex, Intelligent and Software Intensive Systems (CISIS) - Krakow, TBD, Poland (2010.02.15-2010.02.18)] 2010 International Conference on Complex,

[19] D. Koller, A. Levy, and A. Pfeffer, “P-classic: A tractableprobabilistic description logic,” in In Proc. of NCAI97, 1997,pp. 360–397.

[20] B. Taskar, P. Abbeel, and D. Koller, “Discriminative prob-abilistic models for relational data,” in Proceedings of theEighteenth Conference on Uncertainty in Articial Intelligence.Edmonton, Canada: Morgan Kaufmann, 2002, pp. 485–492.

[21] A. Popescul and L. H. Ungar, “Structural logistic regressionfor link analysis,” in Proceedings of the Second InternationalWorkshop on Multi-Relational Data Mining. Washington,DC: ACM Press, 2003, pp. 92–106.

[22] M. Richardson and P. Domingos, “Markov logic networks,”Machine Learning, vol. 62, pp. 107–236, 2006.

[23] P. Singla and P. Domingos, “Markov logic in infinite do-mains,” in Proc. 23rd UAI. AUAI Press, 2007, pp. 368–375.

[24] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximumlikelihood from incomplete data via the em algorithm,” Jour-nal of the Royal Statistical Society, vol. Series B, vol. 39, pp.1–38, 1977.

[25] S. Della Pietra, V. D. Pietra, and J. Laferty, “Inducing featuresof random fields,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 19, pp. 380–392, 1997.

[26] F. Sha and F. Pereira, “Shallow parsing with conditionalrandom fields,” in Proc. HLT-NAACL-03, 2003, pp. 134–141.

[27] A. McCallum, “Efficiently inducing features of conditionalrandom fields,” in Proc. UAI-03, 2003, pp. 403–410.

[28] L. De Raedt and L. Dehaspe, “Clausal discovery,” MachineLearning, vol. 26, pp. 99–146, 1997.

[29] S. Kok and P. Domingos, “Learning the structure of markovlogic networks,” in Proc. 22nd Int’l Conf. on Machine Learn-ing, 2005, pp. 441–448.

[30] J. Besag, “Statistical analysis of non-lattice data,” Statistician,vol. 24, pp. 179–195, 1975.

[31] L. Mihalkova and R. J. Mooney, “Bottom-up learning ofmarkov logic network structure,” in Proc. 24th Int’l Conf.on Machine Learning, 2007, pp. 625–632.

[32] M. Biba, S. Ferilli, and F. Esposito, “Structure learning ofmarkov logic networks through iterated local search.” inFrontiers in Artificial Intelligence and Applications, Proceed-ings of 18th European Conference on Artificial Intelligence(ECAI)., vol. 178, 2008, pp. 361–365.

[33] P. Singla and P. Domingos, “Discriminative training ofmarkov logic networks,” in Proc. 20th Nat’l Conf. on AI,(AAAI). AAAI Press, 2005, pp. 868–873.

[34] D. Lowd and P. Domingos, “Efficient weight learning formarkov logic networks,” in Proc. of the 11th PKDD. SpringerVerlag, 2007, pp. 200–211.

[35] T. N. Huynh and R. J. Mooney, “Discriminative structure andparameter learning for markov logic networks,” in In Proc.of the 25th International Conference on Machine Learning(ICML), 2008.

[36] A. Srinivasan, The Aleph Manual., avail-able at http://www.comlab.ox.ac.uk/oucl/ es-earch/areas/machlearn/Aleph/.

[37] M. Biba, S. Ferilli, and F. Esposito, “Discriminative structurelearning of markov logic networks.” in Proceedings of 18thInternational Conference on Inductive Logic Programming,(ILP 2008), LNCS 5194. Springer, 2008, pp. 59–76.

[38] H. Poon, P. Domingos, and M. Sumner, “A general methodfor reducing the complexity of relational inference and itsapplication to mcmc,” in Proc. 23rd Nat’l Conf. on ArticialIntelligence. Chicago, IL: AAAI Press, 2008.

[39] M. R. Genesereth and N. J. Nilsson, Logical foundations ofartificial intelligence. San Mateo, CA: Morgan Kaufmann.,1987.

[40] L. D. Raedt, “Logical settings for concept-learning,” ArticialIntelligence, vol. 95, no. 1, pp. 197–201, 1997.

[41] J. R. Quinlan, “Learning logical definitions from relations,”Machine Learning, vol. 5, pp. 239–266, 1990.

[42] S. H. Muggleton, “Inverse entailment and progol,” NewGeneration Computing Journal, pp. 245–286, 1995.

[43] J. Furnkranz, “Separate-and-conquer rule learning,” ArticialIntelligence Review, vol. 13(1), pp. 3–54, 1999.

[44] T. M. Mitchell, Machine Learning. The McGraw-HillCompanies, Inc., 1997.

[45] S.-H. Nienhuys-Cheng and R. de Wolf, Foundations of In-ductive Logic Programming. Springer-Verlag, 1997.

[46] E. Shapiro., Algorithmic Program Debugging. MIT Press,1983.

[47] G. D. Plotkin., “A note on inductive generalization,” InMachine Intelligence, Edinburgh University Press, vol. 5, pp.153–163, 1970.

[48] C. J. Geyer and E. A. Thompson, “Constrained monte carlomaximum likelihood for dependent data,” Journal of theRoyal Statistical Society, vol. Series B, 54, pp. 657–699,1992.

[49] H. H. Hoos and T. Stutzle, Stochastic Local Search: Founda-tions and Applications. Morgan Kaufmann, San Francisco,2005.

[50] H. Loureno, O. Martin, and T. Stutzle, “Iterated local search,”in Handbook of Metaheuristics. F. Glover and G. Kochen-berger, Kluwer Academic Publishers, Norwell, MA, USA,2002, pp. 321–353.

[51] H. Poon and P. Domingos, “Sound and efficient inferencewith probabilistic and deterministic dependencies,” in Proc.21st Nat’l Conf. on AI, (AAAI). AAAI Press, 2006, pp.458–463.

123