35
ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning Dhruv Batra Virginia Tech Topics Markov Random Fields (Finish) MLE Structured SVMs Readings: KF 20.1-3, Barber 9.6

ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L... · Log-linear Markov network (most common representation) • Feature (or Sufficient Statistic) is some function

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

  • ECE 6504: Advanced Topics in Machine Learning

    Probabilistic Graphical Models and Large-Scale Learning

    Dhruv Batra Virginia Tech

    Topics –  Markov Random Fields

    –  (Finish) MLE –  Structured SVMs

    Readings: KF 20.1-3, Barber 9.6

  • Administrativia •  HW3

    –  Extra credit

    •  Project Presentations –  When: April 22, 24 –  Where: in class –  5 min talk

    •  Main results •  Semester completion 2 weeks out from that point so nearly finished

    results expected •  Slides due: April 21 11:55pm

    (C) Dhruv Batra 2

  • Recap of Last Time

    (C) Dhruv Batra 3

  • Main Issues in PGMs •  Representation

    –  How do we store P(X1, X2, …, Xn) –  What does my model mean/imply/assume? (Semantics)

    •  Inference –  How do I answer questions/queries with my model? such as –  Marginal Estimation: P(X5 | X1, X4) –  Most Probable Explanation: argmax P(X1, X2, …, Xn)

    •  Learning –  How do we learn parameters and structure of

    P(X1, X2, …, Xn) from data? –  What model is the right for my data?

    (C) Dhruv Batra 4

  • Recall -- Learning Bayes Nets

    (C) Dhruv Batra 5

    x(1)

    x(m)

    Data

    True Distribution P* (Maybe corresponds to a BN G*

    maybe not)

    Domain Experts

    CPTs – P(Xi| PaXi)

  • Learning Bayes Nets

    Known structure Unknown structure

    Fully observable data Missing data

    x(1)

    x(m)

    Data

    structure parameters

    CPTs – P(Xi| PaXi)

    (C) Dhruv Batra 6 Slide Credit: Carlos Guestrin

    Very easy

    Somewhat easy (EM)

    Hard

    Very very hard

  • Learning the CPTs

    x(1)

    x(m)

    Data For each discrete variable Xi

    P̂MLE(Xi = a | PaXi = b) =Count(Xi = a,PaXi = b)

    Count(PaXi = b)

    (C) Dhruv Batra 7 Slide Credit: Carlos Guestrin

  • Learning Markov Nets

    Known structure Unknown structure

    Fully observable data Missing data

    x(1)

    x(m)

    Data

    structure parameters

    Factors – Ψc(xc)

    (C) Dhruv Batra 8 Slide Credit: Carlos Guestrin

    NP-Hard (but doable)

    Harder (EM)

    Harder

    Don’t try this at home

  • Learning Parameters of a BN •  Log likelihood decomposes:

    •  Learn each CPT independently

    •  Use counts

    Flu Allergy

    Sinus

    Nose

    (C) Dhruv Batra 9 Slide Credit: Carlos Guestrin

  • Log Likelihood for MN •  Log likelihood decomposes:

    •  Doesn’t decompose! –  logZ couples all parameters together

    Flu Allergy

    Sinus

    Nose

    (C) Dhruv Batra 10 Slide Credit: Carlos Guestrin

  • Log-linear Markov network (most common representation)

    •  Feature (or Sufficient Statistic) is some function φ[D] for some subset of variables D –  e.g., indicator function

    •  Log-linear model over a Markov network H: –  a set of features φ1[D1],…, φk[Dk]

    •  each Di is a subset of a clique in H •  two φ’s can be over the same variables

    –  a set of weights w1,…,wk •  usually learned from data

    – 

    (C) Dhruv Batra 11 Slide Credit: Carlos Guestrin

  • •  Log-likelihood of data:

    •  Compute derivative & optimize –  usually with gradient ascent or L-BFGS

    Learning params for log linear models – Gradient Ascent

    (C) Dhruv Batra 12 Slide Credit: Carlos Guestrin

  • Learning log-linear models with gradient ascent

    •  Gradient:

    •  Requires one inference computation per

    •  Theorem: w is maximum likelihood solution iff – 

    •  Usually, must regularize –  E.g., L2 regularization on parameters

    (C) Dhruv Batra 13 Slide Credit: Carlos Guestrin

  • Plan for today •  MRF Parameter Learning

    –  MLE •  Conditional Random Fields •  Feature example

    –  Max-Margin •  Structured SVMs •  Cutting-Plane Algorithm •  (Stochastic) Subgradient Descent

    (C) Dhruv Batra 14

  • Semantic Segmentation •  Setup

    –  20 categories + background •  Dataset: Pascal Segmentation Challenge (VOC 2012) •  1500 train/val/test images

    (C) Dhruv Batra 15

  • Conditional Random Fields

    (C) Dhruv Batra 16

    Yi Yj

    Xi Xj

  • Conditional Random Fields •  Log-Potentials / Scores

    •  Express as a Log-Linear Model –  On board

    (C) Dhruv Batra 17

    S(y) =�

    i∈Vθi(yi) +

    (i,j)∈E

    θij(yi, yj)

    θij(yi, yj) = wij · φ(x, yi, yj)θi(yi) = wi · φ(x, yi)

    P (y) =1

    Z eS(y)

    yMAP

    P (y)

    y

  • MLE for CRFs •  Model

    •  Log-Likelihood: –  On board

    •  Derivative: –  On board

    (C) Dhruv Batra 18

    P (y|x) = 1ZxeS(y;x)

    =1

    Zxew

    �φ(x,y)

  • New Topic: Structured SVMs

    (C) Dhruv Batra 19

  • Recall: Generative vs. Discriminative •  Generative Approach (Naïve Bayes)

    –  Estimate p(X|Y) and p(Y) –  Use Bayes Rule to predict y

    •  Discriminative Approach –  Estimate p(Y|X) directly (Logistic Regression) –  Learn “discriminant” function h(x) (Support Vector Machine)

    (C) Dhruv Batra 20

  • Recall: Generative vs. Discriminative •  Generative Approach (Markov Random Fields)

    –  Estimate p(X,Y) –  At test time, use P(X=x,Y) to predict y

    •  Discriminative Approach –  Estimate p(Y|X) directly (Conditional Random Fields) –  Learn “discriminant” function h(x) (Structured SVMs)

    (C) Dhruv Batra 21

    h(x) = argmaxy∈Y

    w�φ(x,y)

  • Structured SVM

    •  Joint features describe match between x and y •  Learn weights w so that is max for correct y

    φ(x,y)

    w�φ(x,y)

    w�φ(x1,y) w�φ(xm,y)w�φ(xj ,y)

    (xj ,yj)(C) Dhruv Batra 22

  • Structured SVM •  Hard Margin

    –  On board

    (C) Dhruv Batra 23

  • Soft-Margin Structured SVM

    •  Two ideas –  Add slack

    w�φ(x1,y) w�φ(xm,y)w�φ(xj ,y)

    (xj ,yj)

    slack

    (C) Dhruv Batra 24

  • Soft-Margin Structured SVM

    •  Two ideas –  Add slack –  Re-scale the margin with a loss function

    •  Margin-Rescaled SSVMs

    w�φ(x1,y) w�φ(xm,y)w�φ(xj ,y)

    (xj ,yj)

    slack

    Lemma: The training loss is upper bounded by

    Err(h) =1

    m

    m�

    j=1

    ∆(yj , h(xj)) ≤ 1m

    m�

    j=1

    ξj

    (C) Dhruv Batra 25

  • Soft-Margin Structured SVM •  Minimize

    subject to

    Too many constraints!

    (C) Dhruv Batra 26

    12w2 + C

    N! j

    j!

    wT!(x j, y j ) ! wT!(x j, y)+"(y j, y)#! j

    Slide Credit: Thorsten Joachims

  • Cutting-Plane Method

    •  Cutting Plane –  Suppose we only solve the SVM objective over a small

    subset of constraints (working set). –  Some constraints from global set might be violated.

    (C) Dhruv Batra 27

    12w2 + C

    N! j

    j!

    wT!(x j, y j ) ! wT!(x j, y)+"(y j, y)#! j

    Slide Credit: Thorsten Joachims

  • Cutting-Plane Method

    (C) Dhruv Batra 28

  • Cutting-Plane Method

    Original SVM Problem •  Exponential constraints •  Most are dominated by a small set

    of “important” constraints

    Structural SVM Approach •  Repeatedly finds the next most

    violated constraint… •  …until set of constraints is a good

    approximation.

    (C) Dhruv Batra 29 Slide Credit: Yisong Yue

  • Cutting-Plane Method

    Original SVM Problem •  Exponential constraints •  Most are dominated by a small set

    of “important” constraints

    Structural SVM Approach •  Repeatedly finds the next most

    violated constraint… •  …until set of constraints is a good

    approximation.

    (C) Dhruv Batra 30 Slide Credit: Yisong Yue

  • Cutting-Plane Method

    Original SVM Problem •  Exponential constraints •  Most are dominated by a small set

    of “important” constraints

    Structural SVM Approach •  Repeatedly finds the next most

    violated constraint… •  …until set of constraints is a good

    approximation.

    (C) Dhruv Batra 31 Slide Credit: Yisong Yue

  • Cutting-Plane Method

    Original SVM Problem •  Exponential constraints •  Most are dominated by a small set

    of “important” constraints

    Structural SVM Approach •  Repeatedly finds the next most

    violated constraint… •  …until set of constraints is a good

    approximation.

    *This is known as a “cutting plane” method.

    (C) Dhruv Batra 32 Slide Credit: Yisong Yue

  • Cutting-Plane Method

    •  Cutting Plane –  Suppose we only solve the SVM objective over a small

    subset of constraints (working set). –  Some constraints from global set might be violated. –  Degree of violation?

    (C) Dhruv Batra 33

    12w2 + C

    N! j

    j!

    wT!(x j, y j ) ! wT!(x j, y)+"(y j, y)#! j

    wT!(x j, y)+!(y j, y)"! j "wT"(x j, y j )

    Slide Credit: Thorsten Joachims

  • Finding Most Violated Constraint •  Finding most violated constraint is equivalent

    to maximizing the RHS w/o slack:

    •  Requires solving:

    •  Highly related to inference:

    Violation = wT!(x, y)+!(y j, y)

    argmaxy

    wT!(x, y)+!(y j, y)

    h(x;w) = argmaxy!Y [wT!(x, y)]

    Slide Credit: Thorsten Joachims (C) Dhruv Batra 34

  • SVM: Logistic regression:

    Log loss:

    Slide Credit: Carlos Guestrin 35

    Side note: What’s the difference between SVMs and logistic regression?

    SVM: Hinge Loss

    LR: Logistic Loss

    (C) Dhruv Batra