1 Learning the Structure of Markov Logic Networks Stanley Kok

1

Learning the Structure of Markov Logic

Networks

Stanley Kok

2

Overview Introduction CLAUDIEN, CRFs Algorithm

Evaluation Measure Clause Construction Search Strategies Speedup Techniques

Experiments

3

Introduction Richardson & Domingoes (2004) learned MLN

structure in two disjoint steps: Learn FO clauses with off-the-shelf ILP system

(CLAUDIEN) Learn clause weights by optimizing pseudo-

likelihood

Develop algorithm: Learns FO clauses by directly optimizing pseudo-

likelihood Fast enough Learns better structure than R&D, pure ILP, purely

probabilistic and purely KB approaches

4

CLAUDIEN CLAUsal DIscovery ENgine Starts with trivially false clause Repeatedly refine current clauses by adding literals Adds clauses that satisfy min accuracy and

coverage to KBtrue ) false

m ) false f ) false h ) false

m^f ) false m ) h

m ) fm^h ) false

f ) h f ) m f^h ) false h ) f h ) m

h ) m v f

5

CLAUDIEN language bias ´ clause template

Refine handcrafted KB Example,

Professor(P) ( AdvisedBy(S,P) in KB dlab_template(‘1-2:[Professor(P),Student(S)]<-

AdvisedBy(S,P)’) Professor(P) v Student(S) ( AdvisedBy(S,P)

6

Conditional Random Fields Markov networks used to compute P(y|x)

(McCallum2003)

Model:

Features, fk e.g. “current word is capitalized and next word is Inc”

y1 y2 y3 yn-1 yn

x1,x2,…,xn

…

IBM hired Alice….

Org PersonMisc Misc Misc

7

CRF – Feature Induction Set of atomic features (word=the, capitalized etc) Starts from empty CRF While convergence criteria is not met

Create list of new features consisting of Atomic features Binary conjunctions of atomic features Conjunctions of atomic features with features already in

model Evaluate gain in P(y|x) of adding each feature to model Add best K features to model (100s-1000s features)

8

Algorithm High-level algorithm

RepeatClauses <- FindBestClauses(MLN)Add Clauses to MLN

Until Clauses =

FindBestClauses(MLN)Search for, For each candidate clause c

Compute gainevaluation measure of adding c to MLNReturn k clauses with highest gain

and create candidate clauses

9

Evaluation Measure Ideally use log-likelihood, but slow

Recall: Value: Gradient:

10

Evaluation Measure Use pseudo-log-likelihood

(R&D(2004)), but Undue weight to predicates with large #

of groundings Recall: E.g.:

11

Evaluation Measure Use weighted pseudo-log-likelihood (WPLL)

E.g.:

12



Until Clauses =




13

Clause Construction

Add a literal (negative/positive) All possible ways variables of new literal can

be shared with those of clause !Student(S) v AdvBy(S,P)

Remove a literal (when refining MLN) Remove spurious conditions from rules !Student(S) v !YrInPgm(S,5) v TA(S,C)

v TmpAdvBy(S,P)

14

Clause Construction Flip signs of literals (when refining MLN)

Move literals on wrong side of implication !CseQtr(C1,Q1) v !CseQtr(C2,Q2) v !

SameCse(C1,C2) v !SameQtr(Q1,Q2) Beginning of algorithm Expensive, optional

Limit # of distinct variables to restrict search space

15



Until Clauses =




16

Search Strategies Shortest-first search (SFS)

1. Find gain of each clause2. Sort clauses by gain3. Return top 5 with positive gainMLN

wt1, !AdvBy(S,P)wt2, clause2

…

4. Add 5 clauses to MLN5. Retrain wts of MLN

candidate set

1. Find gain of each clause2. Sort them by gain

(Yikes! All length-2 clauses have gains · 0)

!AdvBy(S,P) v Stu(S)

17

Shortest-First Search

a. Extend 20 length-2 clause with highest gains

b. Form new candidate setc. Keep 1000 clauses with

highest gains

MLNwt1, !AdvBy(S,P)

wt2, clause2…

!AdvBy(S,P) v Stu(S) !AdvBy(S,P) v Stu(S) v Prof(P)

18

Shortest-First Search Shortest-first search (SFS)

• Repeat process • Extend all length-2

clauses before length-3 ones

MLNwt1, clause1wt2, clause2

…

candidate setHow do you refine a non-empty MLN?

19

SFS – MLN Refinementa. Extend 20 length-2

clause with highest gainsb. Extend length-2 clauses

in MLNc. Remove a predicate from

length-4 clauses in MLNd. Flip signs of length-3

clauses in MLN (optional)e. b,c,d replaces original

clause in MLN

MLNwt1, !AdvBy(S,P)

wt2, clause2…

wtA, clauseAwtB, clauseB

…

20

Search Strategies Beam Search

1. Keep a beam of 5 clauses with highest gains 2. Track best clause3. Stop when best clause does not change after two

consecutive iterations

MLNwt1, clause1wt2, clause2

…wtA, clauseAwtB, clauseB

…

How do you refine a non-empty MLN?

21



Until Clauses =




22

Difference from CRF – Feature Induction Set of atomic features (word=the, capitalized etc) Start from empty CRF While convergence criteria is not met

Create list of new features consisting of Atomic features Binary conjunctions of atomic features Conjunctions of atomic features with features already in

model Evaluate gain in P(y|x) of adding each feature to model Add best K features to model (100s-1000s features)

We can refine non-empty MLN

•We use pseudo-likelihood; different optimizations.•Applicable to arbitrary MN (not only linear chains)

•Maintain separate candidate set•Add best ¼10s in model

Flexible enough to fit in different search algms

23

OverviewIntroductionCLAUDIEN, CRFsAlgorithm

Evaluation MeasureClause ConstructionSearch Strategies

Speedup Techniques Experiments

24

Speedup Techniques Recall: FindBestClauses(MLN)

Search for, and create candidate clausesFor each candidate clause c

Compute gainWPLL of adding c to MLNReturn k clauses with highest gain

LearnWeights(MLN+c) to optimize WPLL with L-BFGS L-BFGS computes value and gradient of WPLL

Many candidate clauses; important to compute WPLL and its gradient efficiently

25

Speedup Techniques WPLL:

Ignore clauses in which predicate does not appear in e.g. predicate l does not appear in clause 1

CLL

26

Speedup Techniques Gnd pred’s CLL affected by clauses that contains it Most clause weights do not significantly

Most CLLs do not much Don’t have to recompute all CLLs

Store WPLL and CLLs Recompute CLLs only if weights affecting it beyond

some threshold Subtract old CLLs and add new CLLs to WPLL

27

Speedup Techniques WPLL is a sum over all ground predicates

Estimate WPLL Uniformly sampling grounding of each FO predicates

Sample x% of # groundings subject to min, max Extrapolate the average

28

Speedup Techniques WPLL and its gradient

Compute # true groundings of a clause #P-complete problem

Karp & Luby (1983)’s Monte-Carlo algorithm Gives estimate that is within of true value with

probability 1- Draws samples of a clause

Found that estimate converges faster than algorithm specifies Use convergence test (DeGroot & Schervish 2002) after

every 100 samples Earlier termination

29

Speedup Techniques L-BFGS used to learn clause weights to

optimize WPLL Two parameters:

Max number of iterations Convergence Threshold

Use smaller # max iterations and looser convergence thresholds When evaluating candidate clause’s gain Faster termination

30

Speedup Technique Lexicographic ordering on clauses

Avoid redundant computations for clauses that are syntactically the same

Don’t detect semantically identical but syntactically different clauses (NP-complete problem)

Cache new clauses Avoid recomputation

31

Speedup Techniques Also used R&D04 techniques for WPLL gradient :

Ignore predicates that don’t appear in ith formula

Ignore ground formulas with truth value unaffected by changing truth value of any literal

# true groundings of a clause computed once and cached

32

OverviewIntroductionCLAUDIEN, CRFsAlgorithm

Evaluation MeasureClause ConstructionSearch StrategiesSpeedup Techniques

Experiments

33

Experiments UW-CSE domain

22 predicates e.g. AdvisedBy, Professor etc 10 types e.g. Person, Course, Quarter etc Total # ground predicates about 4 million # true ground predicates (in DB) = 3212 Handcrafted KB with 94 formulas

Each student has at most one advisor If a student is an author of a paper, so is her advisor

etc

34

Experiments Cora domain

1295 citations to 112 CS research papers Author, Venue, Title, Year fields 5 Predicates viz. SameCitation, SameAuthor,

SameVenue, SameTitle, SameYear Evidence Predicates e.g.

WordsInCommonInTitle20%(title1, title2) Total # ground predicates about 5 million # true ground predicates (in DB) = 378,589 Handcrafted KB with 26 clauses

If two citations same, then they have same authors, titles etc, and vice versa

If two titles have many words in common, then they are the same, etc

35

Systems MLN(KB): weight-learning applied to handcrafted

KB MLN(CL): structure-learning with CLAUDIEN;

weight-learning MLN(KB+CL): structure-learning with CLAUDIEN,

using the handcrafted KB as its language bias; weight-learning

MLN(SLB): structure-learning with beam search, start from empty MLN

MLN(KB+SLB): ditto, start from handcrafted KB MLN(SLB+KB): structure-learning with beam

search, start from empty MLN, allow handcrafted clauses to be added in a first search step

MLN(SLS): structure-learning with SFS, start from empty MLN

36

Systems CL: CLAUDIEN alone KB: handcrafted KB alone KB+CL: CLAUDIEN with KB as its language

bias NB: naïve bayes BN: Bayesian networks

37

Methodology UW-CSE domain

DB divided into 5 areas: ai, graphics, languages, systems, theory

Leave-one-out testing by area Cora domain

5 different train-test splits Measured

average CLL of the predicates average area under the precision-recall curve

of the predicates (AUC)

38

Results MLN(SLS), MLN(SLB) better than

MLN(CL), MLN(KB), CL, KB, NB, BN

UW-CSE

0.533

0.4710.430

0.5500.507

0.306

0.419

0.320

0.170

0.286

0.3890.395

0.000

0.100

0.200

0.300

0.400

0.500

0.600

AU

UW-CSE

0.0590.0860.142

0.0680.114

0.418

0.141

1.100

0.733

1.234

0.507

0.166

0.000

0.200

0.400

0.600

0.800

1.000

1.200

1.400

CLL

CLL

(-v

e)

AU

C

39

Results MLN(SLS), MLN(SLB) better than

MLN(CL), MLN(KB), CL, KB, NB, BN

Cora

0.7820.826

0.782 0.796

0.148

0.813

0.693

0.148

0.693

0.1040.061

0.000

0.100

0.200

0.300

0.400

0.500

0.600

0.700

0.800

0.900

AU

CLL

AU

C

Cora

0.058 0.058 0.058 0.071

0.693

0.067

0.224

0.693

0.225

0.440

0.266

0.000

0.100

0.200

0.300

0.400

0.500

0.600

0.700

0.800

CLL

CLL

(-v

e)

40

Results MLN(SLB+KB) better than

MLN(KB+CL), KB+CL

UW-CSE

0.533

0.4710.430

0.5500.507

0.306

0.419

0.320

0.170

0.286

0.3890.395

0.000

0.100

0.200

0.300

0.400

0.500

0.600

AU

UW-CSE

0.0590.0860.142

0.0680.114

0.418

0.141

1.100

0.733

1.234

0.507

0.166

0.000

0.200

0.400

0.600

0.800

1.000

1.200

1.400

CLL

CLL

(-v

e)

AU

C

41

Results MLN(SLB+KB) better than

MLN(KB+CL), KB+CL

Cora

0.7820.826

0.782 0.796

0.148

0.813

0.693

0.148

0.693

0.1040.061

0.000

0.100

0.200

0.300

0.400

0.500

0.600

0.700

0.800

0.900

AU

CLL

AU

C

Cora

0.058 0.058 0.058 0.071

0.693

0.067

0.224

0.693

0.225

0.440

0.266

0.000

0.100

0.200

0.300

0.400

0.500

0.600

0.700

0.800

CLL

CLL

(-v

e)

42

Results MLN(<system>) does better than corresponding

<system>

UW-CSE

0.533

0.4710.430

0.5500.507

0.306

0.419

0.320

0.170

0.286

0.3890.395

0.000

0.100

0.200

0.300

0.400

0.500

0.600

AU

UW-CSE

0.0590.0860.142

0.0680.114

0.418

0.141

1.100

0.733

1.234

0.507

0.166

0.000

0.200

0.400

0.600

0.800

1.000

1.200

1.400

CLL

CLL

(-v

e)

AU

C

43

Results MLN(<system>) does better than corresponding

<system>

Cora

0.7820.826

0.782 0.796

0.148

0.813

0.693

0.148

0.693

0.1040.061

0.000

0.100

0.200

0.300

0.400

0.500

0.600

0.700

0.800

0.900

AU

CLL

AU

C

Cora

0.058 0.058 0.058 0.071

0.693

0.067

0.224

0.693

0.225

0.440

0.266

0.000

0.100

0.200

0.300

0.400

0.500

0.600

0.700

0.800

CLL

CLL

(-v

e)

44

Results MLN(SLS) on UW-CSE; cluster of 15 dual-

CPUs 2.8 GHz Pentium 4 machines With speed-ups: 5.3 hrs Without speed-ups: didn’t finish running in 24

hrs MLN(SLB) on UW-CSE; on single 2.8 GHz

Pentium 4 machine With speedups: 8.8 hrs Without speedups: 13.7 hrs

45

Future Work Speeding up counting of # true

groundings of clause Probabilistically bounding the loss in

accuracy due to subsampling Probabilistic predicate discovery

46

Conclusion Develop algorithm:

Learns FO clauses by directly optimizing pseudo-likelihood

Fast enough Learns better structure than R&D, pure ILP,

purely probabilistic and purely KB approaches

Documents

1 Learning the Structure of Markov Logic Networks Stanley Kok