Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Carnegie Mellon

Thesis Defense

Joseph K. Bradley

Learning Large-Scale Conditional Random

Fields

CommitteeCarlos Guestrin (U. of Washington, Chair)Tom MitchellJohn Lafferty (U. of Chicago)Andrew McCallum (U. of Massachusetts at Amherst)

1 / 18 / 2013

Modeling Distributions

2

Goal: Model distribution P(X) over random variables X

E.g.: Model life of a grad student.

X2: deadline?

X1: losing sleep?

X3: sick?

X4: losing hair?

X5: overeating?

X6: loud roommate?

X7: taking classes?

X8: cold weather?X9: exercising?

X11: single?X10: gaining weight?

Modeling Distributions

3

X2: deadline?

X1: losing sleep?

X5: overeating?

X7: taking classes?

= P( losing sleep, overeating | deadline, taking classes )



Markov Random Fields (MRFs)

4

X2: deadline?

X1: losing sleep?

X3: sick?

X4: losing hair?

X5: overeating?

X6: loud roommate?

X7: taking classes?

X8: cold weather?X9: exercising?

X10: single?X10: gaining weight?



Markov Random Fields (MRFs)

5

X2

X1

X3

X4

X5

X6

X7

X8

X9

X10X10

graphical

structure

factor (parameters)


Conditional Random Fields (CRFs)

6

X2

Y1

Y3

Y4

Y5

X1

X3

X4

X5

X6Y2

MRFs: P(X) CRFs: P(Y|X)(Lafferty et al., 2001)

Do not model P(X)Simpler structure (over Y only)

MRFs & CRFs

7

Benefits•Principled statistical and computational framework•Large body of literature

Applications•Natural language processing (e.g., Lafferty et al., 2001)•Vision (e.g., Tappen et al., 2007)•Activity recognition (e.g., Vail et al., 2007)•Medical applications (e.g., Schmidt et al., 2008)•...

Challenges

8

Goal: Given data, learn CRF structure and parameters.

X2

Y1

Y3

Y4

Y5

X1

X5

X6Y2

Many learning methods require inference, i.e., answering queries P(A|B)

NP hard in general(Srebro, 2003)

Big structured optimization

problem

NP hard to approximate(Roth, 1996)

Approximations often lack strong guarantees.

Thesis Statement

CRFs offer statistical and computational advantages, but traditional learning methods are often impractical for large problems.

We can scale learning by using decompositions of learning problems which trade off sample complexity, computation, and parallelization.

9

Outline

Parameter Learning Learning without

intractable inferenceS

calin

g c

ore

m

eth

od

s

10

Structure Learning Learning

tractable structures

Parallel Regression Multicore sparse

regressionPara

llel

scalin

g

solve via

Outline

Parameter Learning Learning without

intractable inferenceS

calin

g c

ore

m

eth

od

s

11

Log-linear MRFs

12

X2

X1

X3

X4

X5

X6

X7

X8

X9

X10X10


Parameters Features

All results generalize to CRFs.

Parameter Learning: MLE

13

Traditional method: max-likelihood estimation (MLE)

Minimize objective:

Loss

Gold Standard: MLE is (optimally) statistically efficient.

Parameter LearningGiven structure Φ and samples from Pθ*(X),Learn parameters θ.


14


15

MLE requires inference.Provably hard for general MRFs. (Roth, 1996)

Inference makeslearning hard.

Can we learn withoutintractable inference?


16

Inference makeslearning hard.

Can we learn withoutintractable inference?

Approximate inference & objectives

• Many works: Hinton (2002), Sutton & McCallum (2005), Wainwright (2006), ...

• Many lack strong theory.• Almost no guarantees for general

MRFs or CRFs.

Our Solution

17

Max Likelihood Estimation (MLE)

Optimal High Difficult

Max Pseudolikelihood Estimation (MPLE)

High Low Easy

Sample complexit

y

Computational complexity

Parallel optimizati

on

PAC learnabilityfor many MRFs!

Bradley, Guestrin (2012)

Our Solution

18


Optimal High Difficult

Sample complexit

y


Parallel optimizati

on



High Low Easy


Our Solution

19


Optimal High


Difficult

High Low Easy

Max Composite Likelihood Estimation (MCLE)

Low Low Easy

Sample complexit

y


Parallel optimizati

on

Choose MCLE structure to optimize trade-offs


Deriving Pseudolikelihood (MPLE)

20

X2

X1

X3

X4

X5

MLE:

Hard to compute.So replace it!

Deriving Pseudolikelihood (MPLE)

21

X1

MLE:

Estimate via regression:

MPLE:

(Besag, 1975)

Tractable inference!

Pseudolikelihood (MPLE)

22

Pros•No intractable inference!•Consistent estimator

Cons•Less statistically efficient than MLE (Liang & Jordan, 2008)•No PAC bounds

PAC = Probably Approximately Correct(Valiant, 1984)

MPLE:

(Besag, 1975)

Sample Complexity: MLE

23

# parameters (length of θ)

Λmin: min eigenvalue of Hessian of loss at θ*

probability of failure

Our Theorem: Bound on n (# training examples needed)

Recall: Requires intractable inference.

parameter error (L1)

Sample Complexity: MPLE

24

# parameters (length of θ)

Λmin: mini [ min eigenvalue of Hessian of component i at θ* ]

probability of failureparamete

r error (L1)


Recall: Tractable inference.


Sample Complexity: MPLE

25



Related WorkRavikumar et al. (2010)• Regression Yi~X with Ising models• Basis of our theoryLiang & Jordan (2008)• Asymptotic analysis of MLE, MPLE• Our bounds match theirsAbbeel et al. (2006)• Only previous method with PAC bounds for high-treewidth

MRFs• We extend their work:

• Extension to CRFs, algorithmic improvements, analysis• Their method is very similar to MPLE.

Trade-offs: MLE & MPLE

26


Sample — computational complexitytrade-off

MLELarger Λmin

=> Lower sample complexity

Higher computational complexity

MPLESmaller Λmin

=> Higher sample complexity

Lower computational complexity

Trade-offs: MPLE

27

X1

Joint optimization for MPLE:

X2

Disjoint optimization for MPLE:

2 estimates of Average estimates

Lower sample complexity

Data-parallel

Sample complexity — parallelismtrade-off

Synthetic CRFs

28

Random

Associative

Chains Stars Grids

Factor strength = strength of variable interactions

Predictive Power of Bounds

29

Errors should be ordered: MLE < MPLE < MPLE-disjoint

L1 p

ara

m e

rror

ε

# training examples

MLE

MPLE

MPLE-disjoint

Factors: random, fixed strength

Length-4 chains

bett

er

Predictive Power of Bounds

30

MLE & MPLE Sample Complexity:

Factors: random

Length-6 chains

10,000 train exs

MLE

Act

ual ε

bett

er

harder

Failure Modes of MPLE

31

How do Λmin(MLE) and Λmin(MPLE) vary for different models?

Sample complexity:

Model diamet

er

Factor strengt

h

Node degree

Λmin: Model Diameter

32

Λmin ratio: MLE/MPLE(Higher = MLE better)

Model diameter

Λm

in r

ati

o

Relative MPLE performance is independent of diameter in chains.(Same for random factors)

Factors: associative, fixed strength

Chains

Λmin: Factor Strength

33


Factor strength

Λm

in r

ati

o

Factors: associative

Length-8 Chains

MPLE performs poorly with strong factors.(Same for random factors, and star & grid models)

Λmin: Node Degree

34


Node degree

Λm

in r

ati

o

Factors: associative, fixed strength

Stars

MPLE performs poorly with high-degree nodes.(Same for random factors)

Failure Modes of MPLE

35

How do Λmin(MLE) and Λmin(MPLE) vary for different models?

Sample complexity:

Model diamet

er

Factor strengt

h

Node degree

We can often fix this!

Composite Likelihood (MCLE)

36

MLE: Estimate P(Y) all at once


37


MPLE: Estimate P(Yi|Y\i) separately

Yi


38


MPLE: Estimate P(Yi|Y\i) separately

YAi

Something in between?

Composite Likelihood (MCLE):

Estimate P(YAi|Y\Ai) separately.(Lindsay, 1988)

Generalizes MLE, MPLE; analogous:ObjectiveSample complexityJoint & disjoint optimization


39

MCLE Class:Node-disjoint subgraphs which cover graph.


40

MCLE Class:Node-disjoint subgraphs which cover graph.

• Trees (tractable inference)

• Follow structure of P(X)• Cover star

structures• Cover strong factors

• Choose large components

Combs

Generalizes MLE, MPLE; analogous:ObjectiveSample complexityJoint & disjoint optimization

Structured MCLE on a Grid

41

Grid size |X|

Log

loss

rati

o (

oth

er/

MLE

)

MCLE (combs)

MPLE

Grid size |X|

Tra

inin

g t

ime (

sec)

MCLE (combs)

MPLE

MLE

Grid. Associative factors.10,000 train exs. Gibbs sampling.

bett

er

MCLE (combs) lowers sample complexity...without increasing computation!

MCLE tailoredto model structure.

Also in thesis: tailoring to

correlations in data.

Summary: Parameter Learning

42

Likelihood (MLE) Optimal High

Pseudolikelihood (MPLE)

Difficult

High Low Easy


Low Low Easy

Sample complexit

y


Parallel optimizati

on

• Finite sample complexity bounds for general MRFs, CRFs• PAC learnability for certain classes

• Empirical analysis• Guidelines for choosing MCLE structures: tailor to model, data

OutlineS

calin

g c

ore

m

eth

od

s

43

Structure Learning Learning

tractable structures

CRF Structure Learning

44

X3: deadline?

Y1: losing sleep?

Y3: sick? Y2: losing hair?

X1: loud roommate?

X2: taking classes?

Structure learning: Choose YC

I.e., learn conditional independence

Evidence selection: Choose XD

I.e., select X relevant to each YC

Related Work

Previous Work

Method Structure learning?

Tractable inference?

Evidence selection?

Torralba et al. (2004)

Boosted Random Fields

Yes No Yes

Schmidt et al. (2008)

Block-L1 regularized pseudolikelihood

Yes No No

Shahaf et al. (2009)

Edge weights +low-treewidth model

Yes Yes No

Most similar to our work: They focus on selecting treewidth-k structures. We focus on the choice of edge weight.

45

Tree CRFs with Local Evidence

GoalGiven:

DataLocal evidence

Learn tree CRF structureVia a scalable method


46

Xi relevant to each Yi

Fast inference at test-time

Chow-Liu for MRFs

47

Chow & Liu (1968)

Y1Y2

Y3

AlgorithmWeight edges with mutual information:

Chow-Liu for MRFs

48

Chow & Liu (1968)Algorithm

Weight edges with mutual information:

Choose max-weight spanning tree.

Y1Y2

Y3

Chow-Liu finds amax-likelihood structure.

Chow-Liu for CRFs?

What edge weight? must be efficient to compute

Global Conditional Mutual Information

(CMI)

Pro: Finds max-likelihood structure (with enough data)

Con: Intractable for large |X|

49

AlgorithmWeight each possible edge:

Choose max-weight spanning tree.

Generalized Edge Weights

Global CMI

50

Local Linear Entropy Scores (LLES): w(i,j) = linear combination of entropies over

Yi,Yj,Xi,Xj

TheoremNo LLES can recover all tree CRFs (even with non-trivial parameters and exact entropies).

Heuristic Edge Weights

Decomposable

Conditional Influence

(DCI)

Local CMI

Method Guarantees Compute w(i,j) tractably

Comments

Global CMI Recovers true tree

No Shahaf et al. (2009)Local CMI Lower-bounds

likelihood gainYes Fails with strong Yi—

Xi potentials

DCI Exact likelihood gain for some edges

Yes Best empirically

Global CMI

51

0

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500

Synthetic TestsTrees w/ associative factors. |Y|=40.1000 test samples. Error bars: 2 std. errors.

# training examples

Fract

ion

ed

ges

reco

vere

d DCIGlobal CMI

Local CMI

Schmidt et al.

True CRF

52

Synthetic TestsTrees w/ associative factors. |Y|=40.1000 test samples. Error bars: 2 std. errors.

0

5000

10000

15000

20000

0 100 200 300 400 500

Seco

nd

s

# training examples

Global CMI

DCILocal CMI

Schmidt et al.

53

bett

er

fMRI Tests

X: fMRI voxels (500)

Y: semantic features (218)

predict(Application & data from Palatucci et al., 2009)

Image fromhttp://en.wikipedia.org/wiki/File:FMRI.jpg

bett

er

)]|([log XYPE Disconnected(Palatucci et al., 2009)

DCI 1

DCI 2

54

Summary: Structure Learning

55

• Analyzed generalizing Chow-Liu to CRFs• Proposed class of edge weights: Local Linear Entropy Scores

• Negative result: insufficient for recovering trees• Discovered useful heuristic edge weights: Local CMI, DCI

• Promising empirical results on synthetic & fMRI data

Generalized Chow-LiuCompute edge

weightsMax-weight spanning tree

w12

w25 w45

w24

w23

Outline

56

Parallel Regression Multicore sparse

regressionPara

llel

scalin

g

Parameter LearningPseudolikelihoodCanonical parameterizationS

calin

g c

ore

m

eth

od

s Structure LearningGeneralized Chow-Liu

solve via Compute edge weights via P(Yi,Yj |

Xij )

Regress each variable on its

neighbors:P( Xi | X\i )

Sparse (L1) Regression(Bradley, Kyrola, Bickson, Guestrin, 2011)

Bias towards sparse solutions

Lasso (Tibshirani, 1996)

Objective:

Goal: Predict from , given samples

Useful in high-dimensional setting (# features >> # examples) Lasso and sparse logistic regression

57

Parallelizing LASSO

Many LASSO optimization algorithmsGradient descent, interior point, stochastic gradient, shrinkage, hard/soft thresholdingCoordinate descent (a.k.a. Shooting (Fu, 1998))

One of the fastest algorithms (Yuan et al., 2010)

Parallel optimizationMatrix-vector ops (e.g., interior point)

Stochastic gradient (e.g., Zinkevich et al., 2010)

Shooting

Not great empirically

Best for many samples, not large d

Inherently sequential

Shotgun: Parallel coordinate descent for L1 regression simple algorithm, elegant analysis

58

Shooting: Sequential SCDwhere

Stochastic Coordinate Descent (SCD)While not converged,

Choose random coordinate j,Update wj (closed-form minimization)

59

Shotgun: Parallel SCDwhere

Shotgun Algorithm (Parallel SCD)

While not converged,On each of P processors,

Choose random coordinate j,Update wj (same as for Shooting)

Nice case:Uncorrelatedfeatures

Bad case:Correlatedfeatures

Is SCD inherently sequential?

60

Shotgun: Theory

Convergence Theorem

Final objective

Assume # parallel updates

iterations

where

= spectral radius of XTX

Optimal objective

Generalizes bounds for Shooting (Shalev-Shwartz & Tewari, 2009)

61

Shotgun: Theory

Convergence Theorem

final - opt objective

Assume

iterations

# parallel updates

where = spectral radius of X ’X.

Nice case:Uncorrelatedfeatures

Bad case:Correlatedfeatures

(at worst)

where

62

Shotgun: TheoryConvergence Theorem

Assume

... linear speedups predicted.

Up to a threshold...

Experiments matchour theory!

63

Pmax=79Mug32_singlepixcam

T (

itera

tions)

P (parallel updates)

SparcoProblem7

Pmax=284

T (

itera

tions)

P (parallel updates)

Lasso Experiments

Compared many algorithmsInterior point (L1_LS)Shrinkage (FPC_AS, SpaRSA)Projected gradient (GPSR_BB)Iterative hard thresholding (Hard_IO)Also ran: GLMNET, LARS, SMIDAS

35 datasetsλ=.5, 10

ShootingShotgun P = 8 (multicore)

Single-Pixel Camera

Sparco (van den Berg et al., 2009) Sparse Compressed Imaging Large, Sparse Datasets

64

Shotgun provesmost scalable & robust

Shotgun: SpeedupAggregated results from all tests

Sp

eed

up

# cores

Optimal

Lasso Iteration Speedup

Lasso Time Speedup

Logistic Reg. Time Speedup

Not so great

But we are doingfewer iterations!

Explanation:Memory wall (Wulf & McKee, 1995)

The memory bus gets flooded.

Logistic regression uses more FLOPS/datum.Extra computation hides memory latency.Better speedups on average!

65

Summary: Parallel Regression

66

• Shotgun: parallel coordinate descent on multicore• Analysis: near-linear speedups, up to problem-dependent limit• Extensive experiments (37 datasets, 7 other methods)

• Our theory predicts empirical behavior well.• Shotgun is one of the most scalable methods.

ShotgunDecompose

computation by coordinate updates

Trade a little extra computation for a lot of

parallelism

Recall: Thesis StatementWe can scale learning by using decompositions of learning problems which trade off sample complexity, computation, and parallelization.

67

Parameter LearningStructured composite likelihood

Structure LearningGeneralized Chow-Liu

Parallel RegressionShotgun: parallel coordinate descent

MLE MCLE MPLE

w12

w25 w45

w24

w23

Decompositions use model structure & locality. Trade-offs use model- and data-specific methods.

Future Work: Unified System

68

Parameter Learning Structure Learning

Parallel Regression

Structured MCLEAutomatically:• choose MCLE structure

& parallelization strategy

• to optimize trade-offs,• tailored to model &

data.

Shotgun (multicore)Distributed

Limited communication in distributed setting.

Handle complex objectives (e.g., MCLE).

L1 Structure Learning

Learning Trees

Use structured MCLE?

Learn trees for parameter estimators?

Summary

Parameter learning Structured composite likelihood

Finite sample complexity boundsEmpirical analysisGuidelines for choosing MCLE

structures: tailor to model, dataAnalyzed canonical parameterization

of Abbeel et al. (2006)

69

We can scale learning by using decompositions of learning problems which trade off sample complexity, computation, and parallelization.

Structure learning Generalizing Chow-Liu to CRFs

Proposed class of edge weights: Local Linear Entropy Scores

Insufficient for recovering trees

Discovered useful heuristic edge weights: Local CMI, DCIPromising empirical results on

synthetic & fMRI dataParallel regression Shotgun: parallel coordinate descent on multicore

Analysis: near-linear speedups, up to problem-dependent limitExtensive experiments (37 datasets, 7 other methods)

Our theory predicts empirical behavior well.Shotgun is one of the most scalable methods.

Thank

you!

Documents

Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom