69
Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom Mitchell John Lafferty (U. of Chicago) Andrew McCallum (U. of Massachusetts at Amherst) 1 / 18 / 2013

Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Embed Size (px)

Citation preview

Page 1: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Carnegie Mellon

Thesis Defense

Joseph K. Bradley

Learning Large-Scale Conditional Random

Fields

CommitteeCarlos Guestrin (U. of Washington, Chair)Tom MitchellJohn Lafferty (U. of Chicago)Andrew McCallum (U. of Massachusetts at Amherst)

1 / 18 / 2013

Page 2: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Modeling Distributions

2

Goal: Model distribution P(X) over random variables X

E.g.: Model life of a grad student.

X2: deadline?

X1: losing sleep?

X3: sick?

X4: losing hair?

X5: overeating?

X6: loud roommate?

X7: taking classes?

X8: cold weather?X9: exercising?

X11: single?X10: gaining weight?

Page 3: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Modeling Distributions

3

X2: deadline?

X1: losing sleep?

X5: overeating?

X7: taking classes?

= P( losing sleep, overeating | deadline, taking classes )

Goal: Model distribution P(X) over random variables X

E.g.: Model life of a grad student.

Page 4: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Markov Random Fields (MRFs)

4

X2: deadline?

X1: losing sleep?

X3: sick?

X4: losing hair?

X5: overeating?

X6: loud roommate?

X7: taking classes?

X8: cold weather?X9: exercising?

X10: single?X10: gaining weight?

Goal: Model distribution P(X) over random variables X

E.g.: Model life of a grad student.

Page 5: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Markov Random Fields (MRFs)

5

X2

X1

X3

X4

X5

X6

X7

X8

X9

X10X10

graphical

structure

factor (parameters)

Goal: Model distribution P(X) over random variables X

Page 6: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Conditional Random Fields (CRFs)

6

X2

Y1

Y3

Y4

Y5

X1

X3

X4

X5

X6Y2

MRFs: P(X) CRFs: P(Y|X)(Lafferty et al., 2001)

Do not model P(X)Simpler structure (over Y only)

Page 7: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

MRFs & CRFs

7

Benefits•Principled statistical and computational framework•Large body of literature

Applications•Natural language processing (e.g., Lafferty et al., 2001)•Vision (e.g., Tappen et al., 2007)•Activity recognition (e.g., Vail et al., 2007)•Medical applications (e.g., Schmidt et al., 2008)•...

Page 8: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Challenges

8

Goal: Given data, learn CRF structure and parameters.

X2

Y1

Y3

Y4

Y5

X1

X5

X6Y2

Many learning methods require inference, i.e., answering queries P(A|B)

NP hard in general(Srebro, 2003)

Big structured optimization

problem

NP hard to approximate(Roth, 1996)

Approximations often lack strong guarantees.

Page 9: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Thesis Statement

CRFs offer statistical and computational advantages, but traditional learning methods are often impractical for large problems.

We can scale learning by using decompositions of learning problems which trade off sample complexity, computation, and parallelization.

9

Page 10: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Outline

Parameter Learning Learning without

intractable inferenceS

calin

g c

ore

m

eth

od

s

10

Structure Learning Learning

tractable structures

Parallel Regression Multicore sparse

regressionPara

llel

scalin

g

solve via

Page 11: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Outline

Parameter Learning Learning without

intractable inferenceS

calin

g c

ore

m

eth

od

s

11

Page 12: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Log-linear MRFs

12

X2

X1

X3

X4

X5

X6

X7

X8

X9

X10X10

Goal: Model distribution P(X) over random variables X

Parameters Features

All results generalize to CRFs.

Page 13: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Parameter Learning: MLE

13

Traditional method: max-likelihood estimation (MLE)

Minimize objective:

Loss

Gold Standard: MLE is (optimally) statistically efficient.

Parameter LearningGiven structure Φ and samples from Pθ*(X),Learn parameters θ.

Page 14: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Parameter Learning: MLE

14

Page 15: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Parameter Learning: MLE

15

MLE requires inference.Provably hard for general MRFs. (Roth, 1996)

Inference makeslearning hard.

Can we learn withoutintractable inference?

Page 16: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Parameter Learning: MLE

16

Inference makeslearning hard.

Can we learn withoutintractable inference?

Approximate inference & objectives

• Many works: Hinton (2002), Sutton & McCallum (2005), Wainwright (2006), ...

• Many lack strong theory.• Almost no guarantees for general

MRFs or CRFs.

Page 17: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Our Solution

17

Max Likelihood Estimation (MLE)

Optimal High Difficult

Max Pseudolikelihood Estimation (MPLE)

High Low Easy

Sample complexit

y

Computational complexity

Parallel optimizati

on

PAC learnabilityfor many MRFs!

Bradley, Guestrin (2012)

Page 18: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Our Solution

18

Max Likelihood Estimation (MLE)

Optimal High Difficult

Sample complexit

y

Computational complexity

Parallel optimizati

on

PAC learnabilityfor many MRFs!

Max Pseudolikelihood Estimation (MPLE)

High Low Easy

Bradley, Guestrin (2012)

Page 19: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Our Solution

19

Max Likelihood Estimation (MLE)

Optimal High

Max Pseudolikelihood Estimation (MPLE)

Difficult

High Low Easy

Max Composite Likelihood Estimation (MCLE)

Low Low Easy

Sample complexit

y

Computational complexity

Parallel optimizati

on

Choose MCLE structure to optimize trade-offs

Bradley, Guestrin (2012)

Page 20: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Deriving Pseudolikelihood (MPLE)

20

X2

X1

X3

X4

X5

MLE:

Hard to compute.So replace it!

Page 21: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Deriving Pseudolikelihood (MPLE)

21

X1

MLE:

Estimate via regression:

MPLE:

(Besag, 1975)

Tractable inference!

Page 22: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Pseudolikelihood (MPLE)

22

Pros•No intractable inference!•Consistent estimator

Cons•Less statistically efficient than MLE (Liang & Jordan, 2008)•No PAC bounds

PAC = Probably Approximately Correct(Valiant, 1984)

MPLE:

(Besag, 1975)

Page 23: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Sample Complexity: MLE

23

# parameters (length of θ)

Λmin: min eigenvalue of Hessian of loss at θ*

probability of failure

Our Theorem: Bound on n (# training examples needed)

Recall: Requires intractable inference.

parameter error (L1)

Page 24: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Sample Complexity: MPLE

24

# parameters (length of θ)

Λmin: mini [ min eigenvalue of Hessian of component i at θ* ]

probability of failureparamete

r error (L1)

Our Theorem: Bound on n (# training examples needed)

Recall: Tractable inference.

PAC learnabilityfor many MRFs!

Page 25: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Sample Complexity: MPLE

25

Our Theorem: Bound on n (# training examples needed)

PAC learnabilityfor many MRFs!

Related WorkRavikumar et al. (2010)• Regression Yi~X with Ising models• Basis of our theoryLiang & Jordan (2008)• Asymptotic analysis of MLE, MPLE• Our bounds match theirsAbbeel et al. (2006)• Only previous method with PAC bounds for high-treewidth

MRFs• We extend their work:

• Extension to CRFs, algorithmic improvements, analysis• Their method is very similar to MPLE.

Page 26: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Trade-offs: MLE & MPLE

26

Our Theorem: Bound on n (# training examples needed)

Sample — computational complexitytrade-off

MLELarger Λmin

=> Lower sample complexity

Higher computational complexity

MPLESmaller Λmin

=> Higher sample complexity

Lower computational complexity

Page 27: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Trade-offs: MPLE

27

X1

Joint optimization for MPLE:

X2

Disjoint optimization for MPLE:

2 estimates of Average estimates

Lower sample complexity

Data-parallel

Sample complexity — parallelismtrade-off

Page 28: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Synthetic CRFs

28

Random

Associative

Chains Stars Grids

Factor strength = strength of variable interactions

Page 29: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Predictive Power of Bounds

29

Errors should be ordered: MLE < MPLE < MPLE-disjoint

L1 p

ara

m e

rror

ε

# training examples

MLE

MPLE

MPLE-disjoint

Factors: random, fixed strength

Length-4 chains

bett

er

Page 30: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Predictive Power of Bounds

30

MLE & MPLE Sample Complexity:

Factors: random

Length-6 chains

10,000 train exs

MLE

Act

ual ε

bett

er

harder

Page 31: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Failure Modes of MPLE

31

How do Λmin(MLE) and Λmin(MPLE) vary for different models?

Sample complexity:

Model diamet

er

Factor strengt

h

Node degree

Page 32: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Λmin: Model Diameter

32

Λmin ratio: MLE/MPLE(Higher = MLE better)

Model diameter

Λm

in r

ati

o

Relative MPLE performance is independent of diameter in chains.(Same for random factors)

Factors: associative, fixed strength

Chains

Page 33: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Λmin: Factor Strength

33

Λmin ratio: MLE/MPLE(Higher = MLE better)

Factor strength

Λm

in r

ati

o

Factors: associative

Length-8 Chains

MPLE performs poorly with strong factors.(Same for random factors, and star & grid models)

Page 34: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Λmin: Node Degree

34

Λmin ratio: MLE/MPLE(Higher = MLE better)

Node degree

Λm

in r

ati

o

Factors: associative, fixed strength

Stars

MPLE performs poorly with high-degree nodes.(Same for random factors)

Page 35: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Failure Modes of MPLE

35

How do Λmin(MLE) and Λmin(MPLE) vary for different models?

Sample complexity:

Model diamet

er

Factor strengt

h

Node degree

We can often fix this!

Page 36: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Composite Likelihood (MCLE)

36

MLE: Estimate P(Y) all at once

Page 37: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Composite Likelihood (MCLE)

37

MLE: Estimate P(Y) all at once

MPLE: Estimate P(Yi|Y\i) separately

Yi

Page 38: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Composite Likelihood (MCLE)

38

MLE: Estimate P(Y) all at once

MPLE: Estimate P(Yi|Y\i) separately

YAi

Something in between?

Composite Likelihood (MCLE):

Estimate P(YAi|Y\Ai) separately.(Lindsay, 1988)

Page 39: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Generalizes MLE, MPLE; analogous:ObjectiveSample complexityJoint & disjoint optimization

Composite Likelihood (MCLE)

39

MCLE Class:Node-disjoint subgraphs which cover graph.

Page 40: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Composite Likelihood (MCLE)

40

MCLE Class:Node-disjoint subgraphs which cover graph.

• Trees (tractable inference)

• Follow structure of P(X)• Cover star

structures• Cover strong factors

• Choose large components

Combs

Generalizes MLE, MPLE; analogous:ObjectiveSample complexityJoint & disjoint optimization

Page 41: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Structured MCLE on a Grid

41

Grid size |X|

Log

loss

rati

o (

oth

er/

MLE

)

MCLE (combs)

MPLE

Grid size |X|

Tra

inin

g t

ime (

sec)

MCLE (combs)

MPLE

MLE

Grid. Associative factors.10,000 train exs. Gibbs sampling.

bett

er

MCLE (combs) lowers sample complexity...without increasing computation!

MCLE tailoredto model structure.

Also in thesis: tailoring to

correlations in data.

Page 42: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Summary: Parameter Learning

42

Likelihood (MLE) Optimal High

Pseudolikelihood (MPLE)

Difficult

High Low Easy

Composite Likelihood (MCLE)

Low Low Easy

Sample complexit

y

Computational complexity

Parallel optimizati

on

• Finite sample complexity bounds for general MRFs, CRFs• PAC learnability for certain classes

• Empirical analysis• Guidelines for choosing MCLE structures: tailor to model, data

Page 43: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

OutlineS

calin

g c

ore

m

eth

od

s

43

Structure Learning Learning

tractable structures

Page 44: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

CRF Structure Learning

44

X3: deadline?

Y1: losing sleep?

Y3: sick? Y2: losing hair?

X1: loud roommate?

X2: taking classes?

Structure learning: Choose YC

I.e., learn conditional independence

Evidence selection: Choose XD

I.e., select X relevant to each YC

Page 45: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Related Work

Previous Work

Method Structure learning?

Tractable inference?

Evidence selection?

Torralba et al. (2004)

Boosted Random Fields

Yes No Yes

Schmidt et al. (2008)

Block-L1 regularized pseudolikelihood

Yes No No

Shahaf et al. (2009)

Edge weights +low-treewidth model

Yes Yes No

Most similar to our work: They focus on selecting treewidth-k structures. We focus on the choice of edge weight.

45

Page 46: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Tree CRFs with Local Evidence

GoalGiven:

DataLocal evidence

Learn tree CRF structureVia a scalable method

Bradley, Guestrin (2010)

46

Xi relevant to each Yi

Fast inference at test-time

Page 47: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Chow-Liu for MRFs

47

Chow & Liu (1968)

Y1Y2

Y3

AlgorithmWeight edges with mutual information:

Page 48: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Chow-Liu for MRFs

48

Chow & Liu (1968)Algorithm

Weight edges with mutual information:

Choose max-weight spanning tree.

Y1Y2

Y3

Chow-Liu finds amax-likelihood structure.

Page 49: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Chow-Liu for CRFs?

What edge weight? must be efficient to compute

Global Conditional Mutual Information

(CMI)

Pro: Finds max-likelihood structure (with enough data)

Con: Intractable for large |X|

49

AlgorithmWeight each possible edge:

Choose max-weight spanning tree.

Page 50: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Generalized Edge Weights

Global CMI

50

Local Linear Entropy Scores (LLES): w(i,j) = linear combination of entropies over

Yi,Yj,Xi,Xj

TheoremNo LLES can recover all tree CRFs (even with non-trivial parameters and exact entropies).

Page 51: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Heuristic Edge Weights

Decomposable

Conditional Influence

(DCI)

Local CMI

Method Guarantees Compute w(i,j) tractably

Comments

Global CMI Recovers true tree

No Shahaf et al. (2009)Local CMI Lower-bounds

likelihood gainYes Fails with strong Yi—

Xi potentials

DCI Exact likelihood gain for some edges

Yes Best empirically

Global CMI

51

Page 52: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

0

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500

Synthetic TestsTrees w/ associative factors. |Y|=40.1000 test samples. Error bars: 2 std. errors.

# training examples

Fract

ion

ed

ges

reco

vere

d DCIGlobal CMI

Local CMI

Schmidt et al.

True CRF

52

Page 53: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Synthetic TestsTrees w/ associative factors. |Y|=40.1000 test samples. Error bars: 2 std. errors.

0

5000

10000

15000

20000

0 100 200 300 400 500

Seco

nd

s

# training examples

Global CMI

DCILocal CMI

Schmidt et al.

53

bett

er

Page 54: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

fMRI Tests

X: fMRI voxels (500)

Y: semantic features (218)

predict(Application & data from Palatucci et al., 2009)

Image fromhttp://en.wikipedia.org/wiki/File:FMRI.jpg

bett

er

)]|([log XYPE Disconnected(Palatucci et al., 2009)

DCI 1

DCI 2

54

Page 55: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Summary: Structure Learning

55

• Analyzed generalizing Chow-Liu to CRFs• Proposed class of edge weights: Local Linear Entropy Scores

• Negative result: insufficient for recovering trees• Discovered useful heuristic edge weights: Local CMI, DCI

• Promising empirical results on synthetic & fMRI data

Generalized Chow-LiuCompute edge

weightsMax-weight spanning tree

w12

w25 w45

w24

w23

Page 56: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Outline

56

Parallel Regression Multicore sparse

regressionPara

llel

scalin

g

Parameter LearningPseudolikelihoodCanonical parameterizationS

calin

g c

ore

m

eth

od

s Structure LearningGeneralized Chow-Liu

solve via Compute edge weights via P(Yi,Yj |

Xij )

Regress each variable on its

neighbors:P( Xi | X\i )

Page 57: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Sparse (L1) Regression(Bradley, Kyrola, Bickson, Guestrin, 2011)

Bias towards sparse solutions

Lasso (Tibshirani, 1996)

Objective:

Goal: Predict from , given samples

Useful in high-dimensional setting (# features >> # examples) Lasso and sparse logistic regression

57

Page 58: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Parallelizing LASSO

Many LASSO optimization algorithmsGradient descent, interior point, stochastic gradient, shrinkage, hard/soft thresholdingCoordinate descent (a.k.a. Shooting (Fu, 1998))

One of the fastest algorithms (Yuan et al., 2010)

Parallel optimizationMatrix-vector ops (e.g., interior point)

Stochastic gradient (e.g., Zinkevich et al., 2010)

Shooting

Not great empirically

Best for many samples, not large d

Inherently sequential

Shotgun: Parallel coordinate descent for L1 regression simple algorithm, elegant analysis

58

Page 59: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Shooting: Sequential SCDwhere

Stochastic Coordinate Descent (SCD)While not converged,

Choose random coordinate j,Update wj (closed-form minimization)

59

Page 60: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Shotgun: Parallel SCDwhere

Shotgun Algorithm (Parallel SCD)

While not converged,On each of P processors,

Choose random coordinate j,Update wj (same as for Shooting)

Nice case:Uncorrelatedfeatures

Bad case:Correlatedfeatures

Is SCD inherently sequential?

60

Page 61: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Shotgun: Theory

Convergence Theorem

Final objective

Assume # parallel updates

iterations

where

= spectral radius of XTX

Optimal objective

Generalizes bounds for Shooting (Shalev-Shwartz & Tewari, 2009)

61

Page 62: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Shotgun: Theory

Convergence Theorem

final - opt objective

Assume

iterations

# parallel updates

where = spectral radius of X ’X.

Nice case:Uncorrelatedfeatures

Bad case:Correlatedfeatures

(at worst)

where

62

Page 63: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Shotgun: TheoryConvergence Theorem

Assume

... linear speedups predicted.

Up to a threshold...

Experiments matchour theory!

63

Pmax=79Mug32_singlepixcam

T (

itera

tions)

P (parallel updates)

SparcoProblem7

Pmax=284

T (

itera

tions)

P (parallel updates)

Page 64: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Lasso Experiments

Compared many algorithmsInterior point (L1_LS)Shrinkage (FPC_AS, SpaRSA)Projected gradient (GPSR_BB)Iterative hard thresholding (Hard_IO)Also ran: GLMNET, LARS, SMIDAS

35 datasetsλ=.5, 10

ShootingShotgun P = 8 (multicore)

Single-Pixel Camera

Sparco (van den Berg et al., 2009) Sparse Compressed Imaging Large, Sparse Datasets

64

Shotgun provesmost scalable & robust

Page 65: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Shotgun: SpeedupAggregated results from all tests

Sp

eed

up

# cores

Optimal

Lasso Iteration Speedup

Lasso Time Speedup

Logistic Reg. Time Speedup

Not so great

But we are doingfewer iterations!

Explanation:Memory wall (Wulf & McKee, 1995)

The memory bus gets flooded.

Logistic regression uses more FLOPS/datum.Extra computation hides memory latency.Better speedups on average!

65

Page 66: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Summary: Parallel Regression

66

• Shotgun: parallel coordinate descent on multicore• Analysis: near-linear speedups, up to problem-dependent limit• Extensive experiments (37 datasets, 7 other methods)

• Our theory predicts empirical behavior well.• Shotgun is one of the most scalable methods.

ShotgunDecompose

computation by coordinate updates

Trade a little extra computation for a lot of

parallelism

Page 67: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Recall: Thesis StatementWe can scale learning by using decompositions of learning problems which trade off sample complexity, computation, and parallelization.

67

Parameter LearningStructured composite likelihood

Structure LearningGeneralized Chow-Liu

Parallel RegressionShotgun: parallel coordinate descent

MLE MCLE MPLE

w12

w25 w45

w24

w23

Decompositions use model structure & locality. Trade-offs use model- and data-specific methods.

Page 68: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Future Work: Unified System

68

Parameter Learning Structure Learning

Parallel Regression

Structured MCLEAutomatically:• choose MCLE structure

& parallelization strategy

• to optimize trade-offs,• tailored to model &

data.

Shotgun (multicore)Distributed

Limited communication in distributed setting.

Handle complex objectives (e.g., MCLE).

L1 Structure Learning

Learning Trees

Use structured MCLE?

Learn trees for parameter estimators?

Page 69: Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom

Summary

Parameter learning Structured composite likelihood

Finite sample complexity boundsEmpirical analysisGuidelines for choosing MCLE

structures: tailor to model, dataAnalyzed canonical parameterization

of Abbeel et al. (2006)

69

We can scale learning by using decompositions of learning problems which trade off sample complexity, computation, and parallelization.

Structure learning Generalizing Chow-Liu to CRFs

Proposed class of edge weights: Local Linear Entropy Scores

Insufficient for recovering trees

Discovered useful heuristic edge weights: Local CMI, DCIPromising empirical results on

synthetic & fMRI dataParallel regression Shotgun: parallel coordinate descent on multicore

Analysis: near-linear speedups, up to problem-dependent limitExtensive experiments (37 datasets, 7 other methods)

Our theory predicts empirical behavior well.Shotgun is one of the most scalable methods.

Thank

you!