Upload
delta
View
33
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Approximation Techniques for Automated Reasoning. Irina Rish IBM T.J.Watson Research Center [email protected]. Rina Dechter University of California, Irvine [email protected] . Outline. Introduction Reasoning tasks Reasoning approaches: elimination and conditioning - PowerPoint PPT Presentation
Citation preview
Approximation Techniques for Automated Reasoning
Irina RishIBM T.J.Watson Research Center
Rina DechterUniversity of California, Irvine
SP2 2
Outline Introduction
Reasoning tasks Reasoning approaches: elimination and
conditioning
CSPs: exact inference and approximations
Belief networks: exact inference and approximations
MDPs: decision-theoretic planning
Conclusions
SP2 3
Automated reasoning tasks Propositional satisfiability Constraint satisfaction Planning and scheduling Probabilistic inference Decision-theoretic planning Etc.
Reasoning is NP-hard
Approximations
SP2 4
Graphical Frameworks Our focus - graphical frameworks: constraint and belief networks Nodes variables Edges dependencies (constraints, probabilities, utilities) Reasoning graph
transformations
SP2 5
Propositional Satisfiability If Alex goes, then Becky goes: If Chris goes, then Alex goes: Query: Is it possible that Chris goes to
the party but Becky does not?
Example: party problem) (or, BA BA
) (or, ACA C
e?satisfiabl Is
C B, A,C B,Atheorynalpropositio
SP2 6
Constraint Satisfaction Example: map coloring Variables - countries (A,B,C,etc.) Values - colors (e.g., red, green, yellow) Constraints: etc. ,ED D, AB,A
SP2 7
Constrained OptimizationExample: power plant scheduling
)X,...,ost(XTotalFuelC minimize :
)(Power : demandpower time,down-min and up-min ,, :sConstraint
. domain ,Variables
N1
4321
1
Objective
DemandXXXXX
{ON,OFF}},...,X{X
i
n
SP2 8
Probabilistic Inference
smoking
A
S
T
V
X D
BCtuberculosis
X-ray
visit to Asia
lungcancer bronchitis
dyspnoea(shortness of breath)
abnormality in lungs
Query: P(T = yes | S = no, D = yes) = ?
Example: medical diagnosis
SP2 9
Decision-Theoretic Planning State = {X, Y, Battery_Level} Actions = {Go_North, Go_South, Go_West, Go_East} Probability of success = P Task: reach the goal location ASAP
Example: robot navigation
SP2 10
Reasoning Methods Our focus - conditioning and elimination Conditioning (“guessing” assignments, reasoning by assumptions)
• Branch-and-bound (optimization)• Backtracking search (CSPs)• Cycle-cutset (CSPs, belief nets)
Variable elimination (inference, “propagation” of constraints, probabilities, cost
functions)• Dynamic programming (optimization)• Adaptive consistency (CSPs)• Joint-tree propagation (CSPs, belief nets)
SP2 11
Conditioning: Backtracking Search
O(exp(n)) :Complexity
0
SP2 12
Bucket E: E D, E CBucket D: D ABucket C: C BBucket B: B ABucket A:
A C
widthinduced -*
*
w ))exp(w O(n :Complexity
contradiction
=
D = C
B = A
Bucket EliminationAdaptive Consistency (Dechter & Pear, 1987)
=
SP2 13
Bucket-elimination and conditioning: a uniform framework
Unifying approach to different reasoning tasks Understanding: commonality and differences “Technology transfer” Ease of implementation Extensions to hybrids:
conditioning+elimination Approximations
SP2 14
Exact CSP techniques: complexity
SP2 15
Approximations Exact approaches can be intractable Approximate conditioning
• Local search, gradient descent (optimization, CSPs, SAT)
• Stochastic simulations (belief nets) Approximate elimination
• Local consistency enforcing (CSPs), local probability propagation (belief nets)
• Bounded resolution (SAT)• Mini-bucket approach (belief nets)
Hybrids (conditioning+elimination) Other approximations (e.g., variational)
SP2 16
“Road map” CSPs: complete algorithms
Variable Elimination Conditioning (Search)
CSPs: approximations Belief nets: complete algorithms Belief nets: approximations MDPs
SP2 17
Constraint Satisfaction
Planning and scheduling Configuration and design problems Circuit diagnosis Scene labeling Temporal reasoning Natural language processing
Applications:
SP2 18
A Bred greenred yellowgreen redgreen yellowyellow greenyellow red
Constraint Satisfaction
Example: map coloring Variables - countries (A,B,C,etc.) Values - colors (e.g., red, green, yellow) Constraints: etc. ,ED D, AB,A
C
A
B
DE
FG
SP2 19
Constraint Networks
variablesofpair dconstrainea between edge an le,per variab nodea
},...,{},...,{ },,...,{
},...,{
1
11
1
:graph Constraint :sConstraint
:Domains :iables Var
C}D,{X, :network Constraint
l
kin
n
CCvvDDD
XX
CD
X
sconstraint all satisfies that variables the toassignment a valuea to :(CSP) Problem onSatisfacti Constraint solution A
SP2 20
The Idea of Elimination
project and join E variableEliminate
ECDBC EBEDDBC RRRR
3value assignment
D
B
C
RDBC
eliminating E
SP2 21
Variable Elimination Eliminate variablesone by one:“constraintpropagation”
Solution generation after elimination is backtrack-free
3
SP2 22
Elimination Operation:join followed by projection
Join operation over A finds all solutions satisfying constraints that involve A
SP2 23
Bucket EliminationAdaptive Consistency (Dechter and Pearl, 1987)
d ordering along widthinduced -(d) ,
*
*
w(d)))exp(w O(n :Complexity
E
D
A
C
B
}2,1{
}2,1{}2,1{
}2,1{ }3,2,1{
:)(AB :)(BC :)(AD :)(
BE C,E D,E :)(
ABucketBBucketCBucketDBucketEBucket
A
EDCB
:)(EB :)(
EC , BC :)(ED :)(
BA D,A :)(
EBucketBBucketCBucketDBucketABucket
E
ADCB
|| RDBE ,
|| RE
|| RDB
|| RDCB
|| RACB
|| RAB
RA
RCBE
SP2 24
Induced WidthWidth along ordering d: max # of previous neighbors (“parents”)
Induced width The width in the ordered induced graph, obtained by connecting “parents” of each recursively, from i=n to 1.
)(* dw
iX
SP2 25
Induced width (continued) Finding minimum- ordering is NP-
complete (Arnborg, 1985)
Greedy ordering heuristics: min-width, min-degree, max-cardinality (Bertele and Briochi, 1972; Freuder 1982)
Tractable classes: trees have of an ordering is computed in O(n) time, i.e. complexity of elimination is easy to
predict
*w
1* w*w
SP2 26
Example: crossword puzzle
SP2 27
Crossword Puzzle:Adaptive consistency
SP2 28
Adaptive Consistency as “bucket-elimination”Initialize: partition constraints into For i=n down to 1 // process buckets in the reverse
orderfor all relations do // join all relations and “project-out”
nbucketbucket ,...,1
im bucketRR ,...,1
) ()( jX jnew RR
i
iX
If is not empty, add it to where k is the largest variable index in Else problem is unsatisfiable
newR ,, ikbucketk newR
Return the set of all relations (old and new) in the buckets
SP2 29
Solving Trees (Mackworth and Freuder, 1985)
Adaptive consistency is linear for trees andequivalent to enforcing directional arc-consistency (recording only unary constraints)
SP2 30
Properties of bucket-elimination(adaptive consistency) Adaptive consistency generates a constraint network
that is backtrack-free (can be solved without deadends).
The time and space complexity of adaptive consistency along ordering d is .
Therefore, problems having bounded induced width are tractable (solved in polynomial time).
Examples of tractable problem classes: trees ( ), series-parallel networks ( ), and in general k-trees ( ).
(d)))exp(w O(n *
1*w2*w
k*w
SP2 31
“Road map” CSPs: complete algorithms
Variable Elimination Conditioning (Search)
CSPs: approximations Belief nets: complete algorithms Belief nets: approximations MDPs
SP2 32
The Idea of Conditioning
space linear time, lexponentia :Complexityalgorithms search by used :ngConditioni
SP2 33
Backtracking Search+Heuristics
Look-ahead schemes Forward checking (Haralick and Elliot, 1980) MAC (full arc-consistency at each node) (Gashnig
1977) Look back schemes
Backjumping (Gashnig 1977, Dechter 1990, Prosser 1993)
Backmarking (Gashnig 1977) BJ+DVO (Frost and Dechter, 1994) Constraint learning (Dechter 1990, Frost and
Dechter 1994, Bayardo and Miranker 1996)
“Vanilla” backtracking + variable/value ordering Heuristics + constraint propagation + learning +…
SP2 34
Search complexity distributions
Complexity histograms (deadends, time) => continuous distributions (Frost, Rish, and Vila 1997; Selman and Gomez 1997, Hoos 1998)
nodes explored in the search space
Freq
uenc
y (p
roba
bilit
y)
SP2 35
Constraint Programming Constraint solving embedded in
programming languages Allows flexible modeling + with
algorithms Logic programs + forward checking Eclipse, Ilog, OPL Using only look-ahead schemes.
SP2 36
Complete CSP algorithms: summary
Bucket elimination: adaptive consistency (CSP), directional resolution (SAT) elimination operation: join-project (CSP), resolution
(SAT) Time and space exponential in the induced width (given a variable ordering)
Conditioning: Backtracking search+heuristics Time complexity: worst-case O(exp(n)), but average-
case is often much better. Space complexity: linear.
SP2 37
“Road map” CSPs: complete algorithms CSPs: approximations
Approximating elimination Approximating conditioning
Belief nets: complete algorithms Belief nets: approximations MDPs
SP2 38
Approximating Elimination:Local Constraint Propagation Problem: bucket-elimination algorithms are intractable when induced width is large
Approximation: bound the size of recorded dependencies, i.e. perform local constraint propagation (local inference)
Advantages: efficiency; may discover inconsistencies by deducing new constraints
Disadvantages: does not guarantee a solution exist
SP2 39
From Global to Local Consistency
SP2 40
Constraint Propagation• Arc-consistency, unit resolution, i-consistency
32,1,
32,1, 32,1,
1 X, Y, Z, T 3X YY = ZT ZX T
X Y
T Z
32,1,
=
SP2 41
Constraint Propagation• Arc-consistency, unit resolution, i-consistency
1 X, Y, Z, T 3X YY = ZT ZX T
X Y
T Z
=
1 3
2 3
• Incorporated into backtracking search
• Constraint programming languages powerful approach for modeling and solving combinatorial optimization problems.
SP2 42
Arc-consistencyOnly domain constraints are recorded:
A BABA DRR
Example: }.2,1{ to of domain reduces
constriant },3,2,1{ },3,2,1{
X
YX
RXYXRR
SP2 43
Local consistency: i-consistency
i-consistency: Any consistent assignment to any i-1 variables is
consistent with at least one value of any i-th variable strong i-consistency: k-consistency for every directional i-consistency Given an ordering, each variable is i-consistent with
any i-1 preceding variables strong directional i-consistency Given an ordering, each variable is strongly i-consistent
with any i-1 preceding variables
ik
SP2 44
Directional i-consistency
DCBR
A
ECD
B
D
C B
E
DC B
E
DC B
E
:AB A:BBC :C
AD C,D :DBE C,E D,E :E
Adaptive d-arcd-path
DBDC RR ,CBR
DRCRDR
SP2 45
Enforcing Directional i-consistency
Directional i-consistency bounds the size of recorded constraints by i. i=1 - arc-consistency i=2 - path-consistency For , directional i-consistency is
equivalent to adaptive consistency*wi
SP2 46
Example: SAT Elimination operation – resolution Directional Resolution – adaptive consistency
(Davis and Putnam, 1960; Dechter and Rish, 1994)
Bounded resolution – bounds the resolvent size BDR(i) – directional i-consistency (Dechter and Rish, 1994) k-closure – full k-consistency (van Gelder and Tsuji, 1996)
In general: bounded induced-width resolution DCDR(b) – generalizes cycle-cutset idea: limits induced width by conditioning on cutset variables (Rish and Dechter 1996, Rish and Dechter 2000)
SP2 47
Directional Resolution Adaptive Consistency
SP2 48
DR complexity
))exp(( :space and timeDR))(exp(||
*
*
wnOwObucketi
SP2 49
History 1960 – resolution-based Davis-Putnam algorithm
1962 – resolution step replaced by conditioning (Davis, Logemann and Loveland, 1962) to avoid memory explosion, resulting into a backtracking search algorithm known as Davis-Putnam (DP), or DPLL procedure.
The dependency on induced width was not known in 1960.
1994 – Directional Resolution (DR), a rediscovery of the original Davis-Putnam, identification of tractable classes (Dechter and Rish, 1994).
SP2 50
DR versus DPLL: complementary propertiesUniform random 3-CNFs(large induced width)
(k,m)-tree 3-CNFs(bounded induced width)
SP2 51
Complementary properties => hybrids
SP2 52
BDR-DP(i): bounded resolution + backtracking Complete algorithm: run BDR(i) as preprocessing before the Davis-Putnam backtracking algorithm. Empirical results: random vs. structured (low-w*) problems:
SP2 53
DCDR(b)Conditioning+DR
*
*
low wfor ity tractabilguarantees Resolution wreduces ngConditioni
:Idea
SP2 54
otherwise condition ,)(w* bX i if Resolve
SP2 55
DCDR(b): empirical results
)exp(space |),)(|exp( Time hybrid :0 DR,pure : DPLL,pure : 0
:off- tradeAdjustable **
bbcondbwbwbb
SP2 56
Approximating Elimination: Summary
Key idea: local propagation, restricting the number of variables involved in recorded constraints Examples: arc-, path-, and i-consistency (CSPs),
bounded resolution, k-closure (SAT) For SAT:
bucket-elimination=directional resolution (original resolution-based Davis-Putnam)
Conditioning=DPLL (backtracking search) Hybrids: bounded resolution+search= complete algorithms (BDR-DP(i), DCDR(b) )
SP2 57
“Road map” CSPs: complete algorithms CSPs: approximations
Approximating elimination Approximating conditioning
Belief nets: complete algorithms Belief nets: approximations
MDPs
SP2 58
Approximating Conditioning: Local Search Problem: complete (systematic, exhaustive)
search can be intractable (O(exp(n) worst-case)
Approximation idea: explore only parts of search space
Advantages: anytime answer; may “run into” a solution quicker than systematic approaches
Disadvantages: may not find an exact solution even if there is one; cannot detect that a problem is unsatisfiable
SP2 59
Simple “greedy” search1. Generate a random assignment to all variables2. Repeat until no improvement made or solution
found: // hill-climbing step3. flip a variable (change its value) that increases the number of satisfied constraints
Easily gets stuck at local maxima
SP2 60
GSAT – local search for SAT(Selman, Levesque and Mitchell, 1992)
1. For i=1 to MaxTries2. Select a random assignment A3. For j=1 to MaxFlips4. if A satisfies all constraint, return A5. else flip a variable to maximize the score 6. (number of satisfied constraints; if no variable 7. assignment increases the score, flip at random)8. end9. end
Greatly improves hill-climbing by adding restarts and sideway moves
SP2 61
WalkSAT (Selman, Kautz and Cohen, 1994)
With probability p random walk – flip a variable in some
unsatisfied constraintWith probability 1-p perform a hill-climbing step
Adds random walk to GSAT:
Randomized hill-climbing often solves large and hard satisfiable problems
SP2 62
Other approaches Different flavors of GSAT with randomization
(GenSAT by Gent and Walsh, 1993; Novelty by McAllester, Kautz and Selman, 1997)
Simulated annealing Tabu search Genetic algorithms Hybrid approximations: elimination+conditioning
SP2 63
Approximating conditioning with elimination
Energy minimization in neural networks (Pinkas and Dechter, 1995)
For cycle-cutset nodes, use the greedy update function (relative to neighbors).For the rest of nodes, run the arc-consistency algorithm followed by value assignment.
}1,0{iX }1,0{jX
cutset
SP2 64
GSAT with Cycle-Cutset(Kask and Dechter, 1996)
Input: a CSP, a partition of the variables into cycle-cutset and tree variablesOutput: an assignment to all the variables
Within each try:Generate a random initial asignment, and then alternate between the two steps:
1. Run Tree algorithm (arc-consistency+assignment) on the problem with fixed values of cutset variables. 2. Run GSAT on the problem with fixed values of tree variables.
SP2 65
Results: GSAT with Cycle-Cutset(Kask and Dechter, 1996)
GSAT versus GSAT +CC
0
10
20
30
40
50
60
70
14 22 36 43
cycle cutset size
# of
pro
blem
s so
lved
GSATGSAT+CC
SP2 66
Results: GSAT with Cycle-Cutset(Kask and Dechter, 1996)
SP2 67
“Road map” CSPs: complete algorithms CSPs: approximations Bayesian belief nets: complete algorithms
Bucket-elimination Relation to: join-tree, Pearl’s poly-tree
algorithm, conditioning Belief nets: approximations MDPs
SP2 68
Belief Networks
= P(S) P(C|S) P(B|S) P(X|C,S) P(D|C,B)
lung Cancer
Smoking
X-ray
Bronchitis
DyspnoeaP(D|C,B)
P(B|S)
P(S)
P(X|C,S)
P(C|S)
P(S, C, B, X, D)
Conditional Independencies Efficient Representation
Θ) (G,BN
CPD: C B D=0 D=10 0 0.1 0.90 1 0.7 0.31 0 0.8 0.21 1 0.9 0.1
SP2 69
Example: Printer Troubleshooting
SP2 70
Example: Car Diagnosis
SP2 71
What are they good for? Diagnosis: P(cause|symptom)=?
Medicine Bio-informatics
Computer troubleshooting
Stock marketText Classification
Speechrecognition
Prediction: P(symptom|cause)=?
classmax Classification: P(class|
data) Decision-making (given a cost function)
1C 2C
cause
symptomsymptom
cause
SP2 72
Probabilistic Inference Tasks
X/Aa
*k
*1 e),xP(maxarg)a,...,(a
evidence)|xP(X)BEL(X iii
Belief updating:
Finding most probable explanation (MPE)
Finding maximum a-posteriory hypothesis
Finding maximum-expected-utility (MEU) decision
e),xP(maxarg*xx
)xU(e),xP(maxarg)d,...,(d X/Dd
*k
*1
variableshypothesis: XA
function utilityx variablesdecision
: )( :
UXD
SP2 73
Belief Updating
lung Cancer
Smoking
X-ray
Bronchitis
Dyspnoea
P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?
SP2 74
“Moral” Graph
n
iiin XparentsXPXXP
11 ))(|(),...,(
Conditional
ProbabilityDistributio
n(CPD)Clique in
moral graph
(“family”)
SP2 75
Belief updating: P(X|evidence)=?
“Moral” graph
A
D E
CB
P(a|e=0) P(a,e=0)=
bcde ,,,0
P(a)P(b|a)P(c|a)P(d|b,a)P(e|b,c)=
0e
P(a) d
),,,( ecdahB
b
P(b|a)P(d|b,a)P(e|b,c)
B C
ED
Variable Elimination
P(c|a)c
SP2 76
Bucket elimination Algorithm elim-bel (Dechter 1996)
b
Elimination operator
P(a|e=0)
W*=4”induced width” (max clique size)
bucket B:
P(a)
P(c|a)
P(b|a) P(d|b,a) P(e|b,c)
bucket C:
bucket D:
bucket E:
bucket A:
e=0
B
C
D
E
A
e)(a,hD
(a)hE
e)c,d,(a,hB
e)d,(a,hC
SP2 77
bmax Elimination operator
MPE
W*=4”induced width” (max clique size)
bucket B:
P(a)
P(c|a)
P(b|a) P(d|b,a) P(e|b,c)
bucket C:
bucket D:
bucket E:
bucket A:
e=0
B
C
D
E
A
e)(a,hD
(a)hE
e)c,d,(a,hB
e)d,(a,hC
Finding Algorithm elim-mpe (Dechter 1996)
)xP(maxMPEx
),|(),|()|()|()(maxby replaced is
,,,,cbePbadPabPacPaPMPE
:
bcdea max
SP2 78
Generating the MPE-tuple
C:
E:
P(b|a) P(d|b,a) P(e|b,c)B:
D:
A: P(a)
P(c|a)
e=0 e)(a,hD
(a)hE
e)c,d,(a,hB
e)d,(a,hC
(a)hP(a)max arga' 1. E
a
0e' 2.
)e'd,,(a'hmax argd' 3. C
d
)e'c,,d',(a'h)a'|P(cmax argc' 4.
Bc
)c'b,|P(e')a'b,|P(d')a'|P(bmax argb' 5.
b
)e',d',c',b',(a' Return
SP2 79
Complexity of elimination))((exp ( * dwnO
ddw ordering along graph moral of widthinduced the)(*
The effect of the ordering:
4)( 1* dw 2)( 2
* dw“Moral” graph
A
D E
CB
B
C
D
E
A
E
D
C
B
A
SP2 80
Other tasks and algorithms MAP and MEU tasks:
Similar bucket-elimination algorithms - elim-map, elim-meu (Dechter 1996)
Elimination operation: either summation or maximization Restriction on variable ordering: summation must precede
maximization (i.e. hypothesis or decision variables are eliminated last)
Other inference algorithms: Join-tree clustering Pearl’s poly-tree propagation Conditioning, etc.
SP2 81
Relationship with join-tree clustering
))()())
(a || haPAbucket(a,b hP(b|a) ||bucket(B)(a,b hP(c|a) ||bucket(C)
P(d|a,b)bucket(D) P(e|b,c)bucket(E)
B
C
D
ED,C,B,A, :Ordering
ABC
BCEADB
A cluster is a set of buckets (a “super-bucket”)
SP2 82
Relationship with Pearl’s belief propagation in poly-trees
Pearl’s belief propagation for single-root query
1X
2Z
1Z
3U
1Y
1U
2U
3Z
elim-bel using topological ordering and super-buckets for
families
Elim-bel, elim-mpe, and elim-map are linear for poly-trees.
1Z 2Z 3Z
1U 2U 3U
1X
1Y
)|(
)(
11
11
uzP
uZ )( 22
uZ)( 33
uZ
)( 11xY “Diagnostic
support”
“Causal support”
)( 1x
SP2 83
Conditioning generates the probability tree
0
),|(),|()|()|()()0,(ebcb
cbePbadPacPabPaPeaP
Complexity of conditioning: exponential time, linear space
SP2 84
Conditioning+Elimination
0
),|(),|()|()|()()0,(ebcb
cbePbadPacPabPaPeaP
Idea: conditioning until of a (sub)problem gets small*w
SP2 85
Super-bucket elimination(Dechter and El Fattah, 1996) Eliminating several variables ‘at once’ Conditioning is done only in super-buckets
SP2 86
The idea of super-bucketsLarger super-buckets (cliques) =>more time but less space
Complexity:1. Time: exponential in clique (super-bucket) size2. Space: exponential in separator size
SP2 87
Application: circuit diagnosisProblem: Given a circuit and its unexpected output, identify faulty components. The problem can be modeled as a constraint optimization problem and solved by bucket elimination.
SP2 88
Time-Space Tradeoff
SP2 89
“Road map” CSPs: complete algorithms CSPs: approximations Belief nets: complete algorithms Belief nets: approximations
Local inference: mini-buckets Stochastic simulations Variational techniques
MDPs
SP2 90
Mini-buckets: “local inference”
The idea is similar to i-consistency: bound the size of recorded dependencies
Computation in a bucket is time and space exponential in the number of variables involved
Therefore, partition functions in a bucket into “mini-buckets” on smaller number of variables
SP2 91
Mini-bucket approximation: MPE task
Split a bucket into mini-buckets =>bound complexity
XX gh )()()O(e :decrease complexity lExponentia n rnr eOeO
SP2 92
Approx-mpe(i) Input: i – max number of variables allowed in a mini-bucket Output: [lower bound (P of a sub-optimal solution), upper bound]
Example: approx-mpe(3) versus elim-mpe
2* w 4* w
SP2 93
Properties of approx-mpe(i) Complexity: O(exp(2i)) time and O(exp(i)) time.
Accuracy: determined by upper/lower (U/L) bound.
As i increases, both accuracy and complexity increase.
Possible use of mini-bucket approximations: As anytime algorithms (Dechter and Rish, 1997) As heuristics in best-first search (Kask and Dechter,
1999)
Other tasks: similar mini-bucket approximations for: belief updating, MAP and MEU (Dechter and Rish, 1997)
SP2 94
Anytime Approximation
UL
LU
mpe(i)-approxL mpe(i)-approxU
iii
ii
step
smallest theand largest the
solution return ,11
far so found solutionbest thekeepby computed boundlower by computed boundupper
available are resources space and time
0
returnend
if
While :Initialize
)mpe(-anytime
SP2 95
Empirical Evaluation(Dechter and Rish, 1997; Rish, 1999) Randomly generated networks
Uniform random probabilities Random noisy-OR
CPCS networks Probabilistic decoding
Comparing approx-mpe and anytime-mpe
versus elim-mpe
SP2 96
Random networks Uniform random: 60 nodes, 90 edges (200 instances)
In 80% of cases, 10-100 times speed-up while U/L<2 Noisy-OR – even better results
Exact elim-mpe was infeasible; appprox-mpe took 0.1 to 80 sec.q
i
yin qqyyxP
i
parameter noise random),...,|0(1
1
SP2 97
Anytime-mpe(0.0001) U/L error vs time
Time and parameter i 1 10 100 1000
Upp
er/L
ower
0.6
1.0
1.4
1.8
2.2
2.6
3.0
3.4
3.8 cpcs422b cpcs360b
i=1 i=21
CPCS networks – medical diagnosis(noisy-OR model)
Test case: no evidence
505.2 70.3anytime-mpe( ), 110.5 70.3anytime-mpe( ),
1697.6 115.8elim-mpecpcs422 cpcs360 Algorithm
Time (sec)
410 110
SP2 98
log(U/L) 0 2 4 6 8 10 12
0
100
200
300
400
500
600
700
800
900
1000
Freq
uenc
y
log(U/L) histogram for i=10 on 1000 instances of random evidence
log(U/L) histogram for i=10 on 1000 instances of likely evidence
log(U/L) 0 1 2 3 4 5 6 7 8 9 10 11 12
0
100
200
300
400
500
600
700
800
900
1000
Freq
uenc
y
The effect of evidenceMore likely evidence=>higher MPE => higher accuracy (why?)
Likely evidence versus random (unlikely) evidence
SP2 99
Probabilistic decodingError-correcting linear block code
State-of-the-art: approximate algorithm – iterative belief propagation (IBP) (Pearl’s poly-tree algorithm applied to loopy networks)
SP2 100
Iterative Belief Proapagation
Belief propagation is exact for poly-trees IBP - applying BP iteratively to cyclic
networks
No guarantees for convergence Works well for many coding networks
)( 11uX
1U 2U 3U
2X1X
)( 12xU
)( 12uX
)( 13xU
) BEL(U update :step One
1
SP2 101
approx-mpe vs. IBPcodes *w-low onbetter is mpe-approx
codes w*)-(high generatedrandomly onbetter is IBP
Bit error rate (BER) as a function of noise (sigma):
SP2 102
Mini-buckets: summary Mini-buckets – local inference approximation
Idea: bound size of recorded functions
Approx-mpe(i) - mini-bucket algorithm for MPE Better results for noisy-OR than for random
problems Accuracy increases with decreasing noise in Accuracy increases for likely evidence Sparser graphs -> higher accuracy Coding networks: approx-mpe outperfroms IBP on
low-induced width codes
SP2 103
Heuristic search Mini-buckets record upper-bound heuristics The evaluation function over
Best-first: expand a node with maximal evaluation function Branch and Bound: prune if f >= upper bound Properties:
an exact algorithm Better heuristics lead to more prunning
),...(x 1p pxx
pj buckethjp
p
iiip
ppp
hxh
paxPxg
xhxgxf
)(
)|()(
)()()(1
1
SP2 104
Heuristic FunctionGiven a cost function
P(a,b,c,d,e) = P(a) • P(b|a) • P(c|a) • P(e|b,c) • P(d|b,a) Define an evaluation function over a partial assignment as theprobability of it’s best extension
f*(a,e,d) = maxb,c P(a,b,c,d,e) = = P(a) • maxb,c P)b|a) • P(c|a) • P(e|b,c) • P(d|a,b)
= g(a,e,d) • H*(a,e,d)
E
E
DA
D
B
D
D
B0
1
1
0
1
0
SP2 105
Heuristic FunctionH*(a,e,d) = maxb,c P(b|a) • P(c|a) • P(e|b,c) • P(d|a,b)
= maxc P(c|a) • maxb P(e|b,c) • P(b|a) • P(d|a,b)
maxc P(c|a) • maxb P(e|b,c) • maxb P(b|a) • P(d|a,b)
= H(a,e,d)
f(a,e,d) = g(a,e,d) • H(a,e,d) f*(a,e,d)
The heuristic function H is compiled during the preprocessing stage of the
Mini-Bucket algorithm.
SP2 106
maxB P(e|b,c) P(d|a,b) P(b|a)
maxC P(c|a) hB(e,c)
maxD hB(d,a)
maxE hC(e,a)
maxA P(a) hE(a) hD (a)
Heuristic FunctionThe evaluation function f(xp) can be computed using function
recorded by the Mini-Bucket scheme and can be used to estimate
the probability of the best extension of partial assignment xp={x1, …, xp},f(xp)=g(xp) H(xp )
For example,
H(a,e,d) = hB(d,a) hC (e,a)
g(a,e,d) = P(a)
SP2 107
Properties Heuristic is monotone Heuristic is admissible Heuristic is computed in linear time IMPORTANT:
Mini-buckets generate heuristics of varying strength using control parameter – bound I
Higher bound -> more preprocessing -> stronger heuristics -> less search Allows controlled trade-off between
preprocessing and search
SP2 108
Empirical Evaluation of mini-bucket heuristics
Time [sec]
0 10 20 30
% S
olve
d E
xact
ly
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
BBMB i=2 BFMB i=2 BBMB i=6 BFMB i=6 BBMB i=10 BFMB i=10 BBMB i=14 BFMB i=14
Random Coding, K=100, noise 0.32
Time [sec]
0 10 20 30
% S
olve
d E
xact
ly
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
BBMB i=6 BFMB i=6 BBMB i=10 BFMB i=10 BBMB i=14 BFMB i=14
SP2 109
“Road map” CSPs: complete algorithms CSPs: approximations Belief nets: complete algorithms Belief nets: approximations
Local inference: mini-buckets Stochastic simulations Variational techniques
MDPs
SP2 110
Stochastic Simulation Forward sampling (logic sampling) Likelihood weighing Markov Chain Monte Carlo
(MCMC): Gibbs sampling
SP2 111
Approximation via Sampling
(MCMC) sampling Gibbs * weighinglikelihood *
:ues their val tonodes evidence clamping"" - sampling) forward (e.g., rejection-acceptance -
? Eevidence handle How to3.
, #)(
:sfrequencieby iesprobabilit Estimate2. )x,...,x,(xs where),s,...,s(
: ( from samples generate 1.
in
i2
i1
iN1
NyYwithsamplesyYP
PN
SX)
SP2 112
Forward Sampling(logic sampling (Henrion, 1988))
2 step and 1 5.: , and .4
)|( from sample 3. to .2
to# 1.
withconsistent samples :),...,( ordering an
samples, of # - evidence, - :1
goto
ixXEX
paxPxXn1i
N1sample
EN XXoancestral
NE
iii
iiii
n
sample rejectif
forFor
Output
Input
SP2 113
Forward sampling (example)
1X
2X 3X
4X
)( 1xP
)|( 12 xxP
),|( 324 xxxP
)|( 13 xxP
)|( from sample 5.otherwise 1, fromstart and
samplereject 0, If .4)|( from Sample .3)|( from Sample .2
)( from Sample .1 sample generate//
0 :Evidence
3,244
3
133
122
11
3
xxxPx
xxxPxxxPx
xPxk
X
Drawback: high rejection rate!
SP2 114
Likelihood Weighing(Fung and Chang, 1990; Shachter and Peot, 1990)
y Y wheres
EXi
1
)lescore(sampE)|y P(YThenscores normalize .7
)|P(ele)score(samp .6)|( from sample 5.
.4 to# 3.
.),...,( :nodes theof an Find2.
. assign , 1.
i
amples
i
iiii
i
n
iii
papaxPxX
EXN1sample
XXoorderingancestral
exEX
forFor
each For
Works well for likely evidence!
“Clamping” evidence+forward sampling+ weighing samples by evidence likelihood
SP2 115
Gibbs Sampling(Geman and Geman, 1984)
Markov Chain Monte Carlo (MCMC):create a Markov chain of samples
}){\|( from sample 5. .4
to# 3. , 2.
. , 1.
iiii
i
ii
iii
XXxPxXEX
N1samplevaluerandomxEX
exEX
forFor
each For each For
Advantage: guaranteed to converge to P(X)Disadvantage: convergence may be slow
SP2 116
Gibbs Sampling (cont’d)(Pearl, 1988)
ij chX
jjiiii paxPpaxPXXxP )|()|(}){\|(:locally computed is }){\|( :Important ii XXxP
iX )()( jj chX
jiii pachpaXM
Markov blanket:
nodesother all oft independen is parents), their andchildren, (parents,
Given
iX
blanketMarkov
SP2 117
“Road map” CSPs: complete algorithms CSPs: approximations Belief nets: complete algorithms Belief nets: approximations
Local inference: mini-buckets Stochastic simulations Variational techniques
MDPs
SP2 118
Variational ApproximationsIdea: variational transformation of CPDs simplifies
inferenceAdvantages: Compute upper and lower bounds on P(Y) Usually faster than sampling techniquesDisadvantages: More complex and less general: re-derived
for each particular form of CPD functions
SP2 119
Variational bounds: example
log(x) 1log
}1log{min
)log(
x
xx
parameter lvariationa -
This approach can be generalized for any concave (convex) function in order to compute its upper (lower) bounds
SP2 120
Convex duality approach(Jaakkola and Jordan, 1997)
bounds. lowerconvex
bounds upper
function dualconcave
get we,)( For .2
)()( )()(
get weand
)}({min)( )}({min)(
:s.t. )( a hasit is )( If 1.
*
*
*
*
*
xf
xfxffxxf
xfxffxxf
f ,xf
T
T
T
x
T
SP2 121
Example: QMR-DT network(Quick Medical Reference – Decision-Theoretic (Shwe et al., 1991))
Noisy-OR model:
ij
j
pad
dijii qqdfP )1()1()|0( 0
1d 2d kd
1f 2f 3f nf
600 diseases
4000 findings
1log- where)|0(
,0
)-q(edfP
ijij
jdii
ipajd ij
SP2 122
Inference in QMR-DT
Inference complexity: O(exp(min{p,k})) p = # of positive findings, k = max family size(Heckerman, 1989 (“Quickscore”), Rish and Dechter, 1998)
jii dj
fi
fi dPdfPdfP
dPdfPfdP)( )|( )|(
)()|(),(
01
j
ij
ifij
i
i
ipajd ij
d
padf
i
f
jdi
ee
e
][
0
0
0
0
0
1
0 )1(i
ipajd ij
f
jdie
Positive evidence “couples” the disease nodes
k,...,dd
fdPfdP2
),( )|( 1 :Inference
factorized
factorized
SP2 123
Variational approach to QMR-DT(Jaakkola and Jordan, 1997)
ipajd
jijiiiipajd iji
ipajd ij
dfifjdi
i
jdii
x
eeedfP
edfP
fdualconcaveexf
][)|1(
:by bounded be can 1)|1( Then
)1ln()1(ln)( a has and is )1ln()(
)(0 )()0(
0
*
**
The effect of positive evidence is now factorized (diseases are “decoupled”)
SP2 124
Variational approach (cont.)
Bounds on local CPDs yield a bound on posterior
Two approaches: sequential and block Sequential: applies variational
transformation to (a subset of) nodes sequentially during inference using a heuristic node ordering; then optimizes across variational parameters
Block: selects in advance nodes to be transformed, then selects variational parameters minimizing the KL-distance between true and approximate posteriors
SP2 125
Block approach
distance (KL)Leibler - Kullback theis )||( where
)||(minarg Find
bounds iational their var withCPDs some replacingafter ionapproximat),|(
evidence given ofposterior exact )|(
*
PQDPQD
EYQEYEYP
)()(log)()||(
S SPSQSQPQD
SP2 126
Variational approach: summary Variational approximations were
successfully applied to inference in QMR-DT and neural networks (logistic functions), and to learning (approximate E step in EM-algorithm)
For more details, see: Saul, Jaakkola, and Jordan, 1996 Jaakkola and Jordan, 1997 Neal and Hinton, 1998 Jordan, 1999
SP2 127
“Road map” CSPs: complete algorithms CSPs: approximations Belief nets: complete
algorithms Belief nets: approximations MDPs:
Elimination and Conditioning
SP2 128
Decision-Theoretic Planning State = {X, Y, Battery_Level} Actions = {Go_North, Go_South, Go_West, Go_East} Probability of success = P Task: reach the goal location ASAP
Example: robot navigation
SP2 129
Dynamic Belief Networks (DBNs)
Two-stage influence diagram Interaction graph
SP2 130
Markov Decision Process
).(π(x))V,|(π(x)),()(max
ΩΩπ :)(N MDPhorizon-Infinite - ΩΩ:d ),d,...,(dπ
:)(N MDPhorizon- Finite- πoptimal an find 6.
slices timeofnumber -N 5.x state ina action for taking reward - a)r(x, 4.
iesprobabilit transition- P3.space stateDΩ domain,- Daction,-}a,...,{aa 2.space stateDΩ domain, Dstate,}x,...,{xx .1
πΩ
ππ
ax
axtN1
xy
maaam1
nxxxn1
x
yxyPxrxVy
a
reward d)(discounte total expected maximum :Criterion 7.
policy :Problem
SP2 131
Dynamic Programming: Elimination
)( },),|(),({max)(: EquationOptimality
1
11 NNN
x
tttttt
a
t xrVVaxxPaxrxVt
t
)||||()||||O(N:gprogrammin dynamic of Complexity
22 nX
mAXA DDNO
))(|(),|( ,),(),(
:iesprobabilit and utilities leDecomposab
1
11
1
ti
ti
n
i
tttn
i
ti
tii
tt xpaxPaxxPaxraxr
SP2 132
Bucket Elimination
2
Complexity: O(exp(w*))
SP2 133
MDPs: Elimination and Conditioning
Finite-horizon MDPs: dynamic programming=elimination along temporal ordering
(N slices)
Infinite-horizon MDPs: Value Iteration (VI) = elimination along temporal ordering
(iterative) Policy Iteration (PI) = conditioning on Aj, elimination on Xj
(iterative)
Bucket elimination: “non-temporal” orderings Complexity:
nwnwO 2* *)),(exp(
SP2 134
MDPs: approximations Open directions for further research:
Applying probabilistic inference approximations to DBNs
Handling actions (rewards)
Approximating elimination, heuristic search, etc.
SP2 135
Conclusions Common reasoning approaches: elimination and conditioning Exact reasoning is often intractable => need approximations Approximation principles:
Approximating elimination – local inference, bounding size of dependencies among variables (cliques in a problem’s graph).
Mini-buckets, IBP, i-consistency enforcing Approximating conditioning – local search, stochastic
simulations Other approximations: variational techniques, etc.
Further research: Combining “orthogonal” approximation approaches Better understanding of “what works well where”: which
approximation suits which problem structure Other approximation paradigms (e.g., other ways of
approximating probabilities, constraints, cost functions)