1
Carnegie Mellon Evidence-Specific Structures for Rich Tractable CRFs Anton Chechetka, Carlos Guestrin General approach: 1.Ground the model / features 2.Use standard ESS-CRFs + parameter sharing want P(structured Query | Evidence): P( | , ) P( | ) webpage text + links professor student student project B B A C P( | ) collection of images + face similaritie s face labe ls Collaborative filtering: Webpage classification: Face recognition in image collections: j i j i ij ij E Q Q f w E Z E Q P , ) , , ( exp ) ( 1 ) | ( features weights normalization Dense models Evidence Query f 12 f 34 Induced structure over Q f 12 f 34 )] , ˆ ( [ ) , ( ) | ( log ) | ( ˆ e q f E e q f e q P ij e Q P q ij ij feature expected feature need inference in the induced model Exact inference: #P- complete Approximate inference: NP-complete Hopeless in large dense models Easy for tree-structured models Features can be arbitrarily correlated Convex objective Unique global optimum Intuitive gradient: Tree models Capture complex dependencies Natural extensions to relational settings Arbitrarily bad inference quality Arbitrarily bad parameters quality Simple dependencies only Relational settings are not tree-structured Efficient exact inference Efficient learning of optimal parameters This work: Keep efficient exact inference + parameters, enable rich dependencies and relational extensions j i j i ij ij E Q Q f w u E T ij I E Z E Q P , ) , , ( )) , ( ( exp ) ( 1 ) | ( evidence-specific structure standard weighted features structure selection parameters structure selection algorithm Batte ry is good Engin e start s E={gas tank is empty} Batte ry is good Engin e start s Dependence in general No dependence for this specific evidence CRF with Evidence-Specific Structure Formalism: Motivation Conditional Random Fields Model Structure Tradeoffs Intuition: Edge importance depends on evidence Fixed dense model Evidence-specific tree mask” Evidence-specific model × = ( ( ) )( ) E=e 1 E=e 2 E=e 3 E=e 1 E=e 2 E=e 3 × Capture all potential dependencies Select the most important tree specific to the evidence value ( ) ( ) Select tree structures, based on evidence, to capture the most important dependencies : T(E,u) encodes the output of a structure selection algorithm Global perspective on structure selection Easy to guarantee tree structure (by selecting appropriate alg.) Looking at one edge at a time is not enough to guarantee tree structure: being a tree is a global property Objective still convex in w (but not u) Efficient exact inference Efficient learning of optimal parameters w Much richer class of models than fixed trees (potential for capturing complex correlations) Structure selection decoupled from feature design and weights (can use an arbitrarily dense model as the basis) 1. choose features f 2. choose tree learning algorithm T(E,) 3. learn u 4. select evidence-specific trees T(e i ,u) for every datapoint (E=e i ,Q=q i ) [u is fixed at this stage] 5. given u, trees T(e i ,u), learn w [L- BFGS, etc.] Learning a ESS-CRF model: Algorithm Stage Dense CRFs ESS-CRFs (this work) Structure selection Approximate Approximate Approximate Approximate Feature weight learning Approximate Approximate (no (no gurarantees) gurarantees) Exact Exact Test time inference Approximate Approximate (no (no gurarantees) gurarantees) Exact Exact Parameter sharing for both w and u: one weight per relation, not per grounding P(Q) (no evidence) P(Q i ,Q j |E=e) (pairwise conditionals) + Chow-Liu algorithm = good tree for E=e P(Q|E=e) Directly generalize existing algorithms for the no-evidence case: Train stage: Learning Good Evidence-Specific Trees E Q E Q 1 ,Q 2 E E Q 1 ,Q 3 Q 3 ,Q 4 , original high- dimensiona l problem low-dimensional pairwise problems ) , | , ( ˆ ) , ( ˆ , ij j i j i u e u e E Q Q P Q Q P ) , | , ( ˆ 13 3 1 u E Q Q P ) , | , ( ˆ 34 4 3 u E Q Q P learn pairwise conditional estimators params u Test stage (evidence-specific Chow-Liu alg.): Instantiate evidence in pairwise estimators: Compute mutual information values ) , ( ˆ u e I ij edge weight s Q 1 Q 3 Q 4 Q 2 Return maximum spanning tree: Q 1 Q 3 Q 4 Q 2 Fewer Sources of Errors ) , | , ( ˆ 12 2 1 u E Q Q P Learning Optimal Feature Weights Our Approach: Evidence-Specific Structures )] , ˆ ( [ ) , ( ( )) , ( ( ) , | ( log ) , | ( ˆ e q f E e q f u e T ij I u e q P ij u e Q P q ij ij sparsity conforms to the evidence-specific structure structure-related parameters u are fixed from the tree-learning step efficient exact computation because T(e,u) is a tree + + = E=e 1 ( ( ) )( ) E=e 3 E=e 2 individual datapoints tree-sparse gradients (with different evidence-dependent sparsity patterns) overall dataset: dense gradient, but still tractable Relational Extensions Results Face recognition [w/ Denver Dash, Matthai Philipose] • Exploit face similarities to propagate labels in collections of images • Semi-superwised relational model • 250…1700 images, 4…24 unique people • Compare against dense discriminative models disagre e Equal or better accuracy 100 times faster WebKB [data + features thanks to Ben Taskar] • webpages text + links page type (student, project,…) same accuracy as dense models ~10 times faster Can exactly compute the convex objective and its gradient Use L-BFGS or conjugate gradient to find the unique global optimum w.r.t. w exactly Gradient similar to standard CRFs: Q i Q j 1 1 0 0 1 0 Parameters dimensionality independent of model size Reduces overfitting Structure selection only after grounding No worries about structure being a tree on the relational level P(Q i ,Q j ) (pairwise marginals) + Chow-Liu algorithm = optimal tree ( ) Acknowledgements: this work has been supported by NSF Career award IIS-0644225 and by ARO MURI W911NF0710287 and W911NF0810242 agree

Carnegie Mellon Evidence-Specific Structures for Rich Tractable CRFs Anton Chechetka, Carlos Guestrin General approach: 1.Ground the model / features 2.Use

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Carnegie Mellon Evidence-Specific Structures for Rich Tractable CRFs Anton Chechetka, Carlos Guestrin General approach: 1.Ground the model / features 2.Use

CarnegieMellon

Evidence-Specific Structures for Rich Tractable CRFs

Anton Chechetka, Carlos Guestrin

General approach:1.Ground the model / features

2.Use standard ESS-CRFs +parameter sharing

want P(structured Query | Evidence):

P( | , )

P( | )webpage text + links

professorstudentstudentproject

B

BACP( | )

collectionof images +face similaritiesface

labels

Collaborative filtering:

Webpage classification:

Face recognition in image collections:

ji

jiijij EQQfwEZ

EQP,

),,(exp)(

1)|(

featuresweightsnormalization

Dense models

Evidence

Query

f12

f34

Induced structure over Qf12

f34

)],ˆ([),(

)|(log

)|(ˆ eqfEeqf

eqP

ijeQPqij

ij

feature expected feature

need inference in the induced model

Exact inference: #P-complete

Approximate inference:NP-complete

Hopeless in large dense models

Easy for tree-structured models

Features can be arbitrarilycorrelated

Convex objective

Unique global optimum

Intuitive gradient:

Tree models

Capture complex dependencies

Natural extensions to relational settings

Arbitrarily bad inference quality

Arbitrarily bad parameters quality

Simple dependencies only

Relational settings are not tree-structured

Efficient exact inference

Efficient learning of optimal parameters

This work:Keep efficient exact inference + parameters, enable rich dependencies and relational extensions

ji

jiijij EQQfwuETijIEZ

EQP,

),,()),((exp)(

1)|(

evidence-specific structure standard weighted features

structure selection parametersstructure selection algorithm

Batteryis good

Enginestarts

E={gas tank

is empty} Batteryis good

EnginestartsDependence

in generalNo dependence for

this specific evidence

CRF with Evidence-Specific StructureFormalism:

Motivation

Conditional Random Fields

Model Structure Tradeoffs

Intuition: Edge importance depends on evidence

Fixed dense model

Evidence-specific tree “mask”

Evidence-specific model× =( () ) ( )

E=e1

E=e2

E=e3

E=e1

E=e2

E=e3

×

Captureall potentialdependencies

Select the most important treespecific to the evidence value( ) ( )

Select tree structures, based on evidence,to capture the most important dependencies:

T(E,u) encodes the output of a structure selection algorithm

Global perspective on structure selection

Easy to guarantee tree structure(by selecting appropriate alg.)

Looking at one edge at a time is not enough to guarantee tree structure: being a tree is a global property

Objective still convex in w (but not u)

Efficient exact inference

Efficient learning of optimal parameters w

Much richer class of models than fixed trees (potential for capturing complex correlations)

Structure selection decoupled from feature design and weights (can use an arbitrarily dense model as the basis)

1. choose features f

2. choose tree learning algorithm T(E,)3. learn u

4. select evidence-specific trees T(ei,u) for every datapoint (E=ei,Q=qi) [u is fixed at this stage]

5. given u, trees T(ei,u), learn w [L-BFGS, etc.]

Learning a ESS-CRF model:

Algorithm Stage Dense CRFs ESS-CRFs (this work)

Structureselection ApproximateApproximate ApproximateApproximate

Feature weight learning

ApproximateApproximate(no gurarantees)(no gurarantees)

ExactExact

Test time inference

ApproximateApproximate(no gurarantees)(no gurarantees)

ExactExact

Parameter sharing for both w and u:

one weight per relation, not per grounding

P(Q) (no evidence)P(Qi,Qj|E=e)

(pairwise conditionals)+

Chow-Liu algorithm=

good tree for E=e

P(Q|E=e)

Directly generalize existing algorithms for the no-evidence case:

Train stage:

Learning Good Evidence-Specific Trees

E QE Q1,Q2 E EQ1,Q3 Q3,Q4

,original high-dimensional problem

low-dimensional pairwise problems

),|,(ˆ),(ˆ , ijjijiue ueEQQPQQP

),|,(ˆ 1331 uEQQP ),|,(ˆ 3443 uEQQP

learn pairwise conditional estimators params u

Test stage (evidence-specific Chow-Liu alg.):

Instantiate evidence in pairwise estimators:

Compute mutual information values

),(ˆ ueI ijedge

weights

Q1

Q3 Q4

Q2

Return maximum spanning tree:Q1

Q3 Q4

Q2

Fewer Sources of Errors

),|,(ˆ 1221 uEQQP

Learning Optimal Feature Weights

Our Approach:Evidence-Specific Structures

)]),ˆ([),(()),((

),|(log

),|(ˆ eqfEeqfueTijI

ueqP

ijueQPqij

ij

sparsity conforms to theevidence-specific structure

structure-related parameters u are fixed from the tree-learning step

efficient exact computationbecause T(e,u) is a tree

+ +

=E=e1

( () ) ( )E=e3E=e2

individual datapoints tree-sparse gradients (with different evidence-dependent

sparsity patterns)

overall dataset:dense gradient,but still tractable

Relational Extensions

ResultsFace recognition [w/ Denver Dash, Matthai Philipose]

• Exploit face similarities to propagate labels in collections of images

• Semi-superwised relational model

• 250…1700 images, 4…24 unique people

• Compare against dense discriminative models

disagree

Equal or better accuracy

100 times faster

WebKB [data + features thanks to Ben Taskar]

• webpages text + links page type (student, project,…)

same accuracy as dense models

~10 times faster

Can exactly compute the convex objective and its gradient

Use L-BFGS or conjugate gradient to find the

unique global optimum w.r.t. w exactly

Gradient similar to standard CRFs:

Qi Qj

1

1

0

0

1…0

Parameters dimensionality independent of model size

Reduces overfitting

Structure selection only after grounding

No worries about structure being a tree on the relational level

P(Qi,Qj) (pairwise marginals)

+Chow-Liu algorithm

=optimal tree

( )

Acknowledgements: this work has been supported by NSF Career award IIS-0644225 and by ARO MURI W911NF0710287 and W911NF0810242

agree