View
215
Download
0
Embed Size (px)
Citation preview
CarnegieMellon
Evidence-Specific Structures for Rich Tractable CRFs
Anton Chechetka, Carlos Guestrin
General approach:1.Ground the model / features
2.Use standard ESS-CRFs +parameter sharing
want P(structured Query | Evidence):
P( | , )
P( | )webpage text + links
professorstudentstudentproject
B
BACP( | )
collectionof images +face similaritiesface
labels
Collaborative filtering:
Webpage classification:
Face recognition in image collections:
ji
jiijij EQQfwEZ
EQP,
),,(exp)(
1)|(
featuresweightsnormalization
Dense models
Evidence
Query
f12
f34
Induced structure over Qf12
f34
)],ˆ([),(
)|(log
)|(ˆ eqfEeqf
eqP
ijeQPqij
ij
feature expected feature
need inference in the induced model
Exact inference: #P-complete
Approximate inference:NP-complete
Hopeless in large dense models
Easy for tree-structured models
Features can be arbitrarilycorrelated
Convex objective
Unique global optimum
Intuitive gradient:
Tree models
Capture complex dependencies
Natural extensions to relational settings
Arbitrarily bad inference quality
Arbitrarily bad parameters quality
Simple dependencies only
Relational settings are not tree-structured
Efficient exact inference
Efficient learning of optimal parameters
This work:Keep efficient exact inference + parameters, enable rich dependencies and relational extensions
ji
jiijij EQQfwuETijIEZ
EQP,
),,()),((exp)(
1)|(
evidence-specific structure standard weighted features
structure selection parametersstructure selection algorithm
Batteryis good
Enginestarts
E={gas tank
is empty} Batteryis good
EnginestartsDependence
in generalNo dependence for
this specific evidence
CRF with Evidence-Specific StructureFormalism:
Motivation
Conditional Random Fields
Model Structure Tradeoffs
Intuition: Edge importance depends on evidence
Fixed dense model
Evidence-specific tree “mask”
Evidence-specific model× =( () ) ( )
E=e1
E=e2
E=e3
E=e1
E=e2
E=e3
×
Captureall potentialdependencies
Select the most important treespecific to the evidence value( ) ( )
Select tree structures, based on evidence,to capture the most important dependencies:
T(E,u) encodes the output of a structure selection algorithm
Global perspective on structure selection
Easy to guarantee tree structure(by selecting appropriate alg.)
Looking at one edge at a time is not enough to guarantee tree structure: being a tree is a global property
Objective still convex in w (but not u)
Efficient exact inference
Efficient learning of optimal parameters w
Much richer class of models than fixed trees (potential for capturing complex correlations)
Structure selection decoupled from feature design and weights (can use an arbitrarily dense model as the basis)
1. choose features f
2. choose tree learning algorithm T(E,)3. learn u
4. select evidence-specific trees T(ei,u) for every datapoint (E=ei,Q=qi) [u is fixed at this stage]
5. given u, trees T(ei,u), learn w [L-BFGS, etc.]
Learning a ESS-CRF model:
Algorithm Stage Dense CRFs ESS-CRFs (this work)
Structureselection ApproximateApproximate ApproximateApproximate
Feature weight learning
ApproximateApproximate(no gurarantees)(no gurarantees)
ExactExact
Test time inference
ApproximateApproximate(no gurarantees)(no gurarantees)
ExactExact
Parameter sharing for both w and u:
one weight per relation, not per grounding
P(Q) (no evidence)P(Qi,Qj|E=e)
(pairwise conditionals)+
Chow-Liu algorithm=
good tree for E=e
P(Q|E=e)
Directly generalize existing algorithms for the no-evidence case:
Train stage:
Learning Good Evidence-Specific Trees
E QE Q1,Q2 E EQ1,Q3 Q3,Q4
,original high-dimensional problem
low-dimensional pairwise problems
…
),|,(ˆ),(ˆ , ijjijiue ueEQQPQQP
),|,(ˆ 1331 uEQQP ),|,(ˆ 3443 uEQQP
learn pairwise conditional estimators params u
Test stage (evidence-specific Chow-Liu alg.):
Instantiate evidence in pairwise estimators:
Compute mutual information values
),(ˆ ueI ijedge
weights
Q1
Q3 Q4
Q2
Return maximum spanning tree:Q1
Q3 Q4
Q2
Fewer Sources of Errors
),|,(ˆ 1221 uEQQP
Learning Optimal Feature Weights
Our Approach:Evidence-Specific Structures
)]),ˆ([),(()),((
),|(log
),|(ˆ eqfEeqfueTijI
ueqP
ijueQPqij
ij
sparsity conforms to theevidence-specific structure
structure-related parameters u are fixed from the tree-learning step
efficient exact computationbecause T(e,u) is a tree
+ +
=E=e1
( () ) ( )E=e3E=e2
individual datapoints tree-sparse gradients (with different evidence-dependent
sparsity patterns)
overall dataset:dense gradient,but still tractable
Relational Extensions
ResultsFace recognition [w/ Denver Dash, Matthai Philipose]
• Exploit face similarities to propagate labels in collections of images
• Semi-superwised relational model
• 250…1700 images, 4…24 unique people
• Compare against dense discriminative models
disagree
Equal or better accuracy
100 times faster
WebKB [data + features thanks to Ben Taskar]
• webpages text + links page type (student, project,…)
same accuracy as dense models
~10 times faster
Can exactly compute the convex objective and its gradient
Use L-BFGS or conjugate gradient to find the
unique global optimum w.r.t. w exactly
Gradient similar to standard CRFs:
Qi Qj
1
1
0
0
1…0
Parameters dimensionality independent of model size
Reduces overfitting
Structure selection only after grounding
No worries about structure being a tree on the relational level
P(Qi,Qj) (pairwise marginals)
+Chow-Liu algorithm
=optimal tree
( )
Acknowledgements: this work has been supported by NSF Career award IIS-0644225 and by ARO MURI W911NF0710287 and W911NF0810242
agree