38
L EARNING AND I NFERENCE IN G RAPHICAL M ODELS Chapter 10: Random Fields Dr. Martin Lauer University of Freiburg Machine Learning Lab Karlsruhe Institute of Technology Institute of Measurement and Control Systems Learning and Inference in Graphical Models. Chapter 10 – p. 1/38

Chapter 10: Random Fields

  • Upload
    vanphuc

  • View
    234

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Chapter 10: Random Fields

LEARNING AND INFERENCE IN GRAPHICAL MODELS

Chapter 10: Random Fields

Dr. Martin Lauer

University of FreiburgMachine Learning Lab

Karlsruhe Institute of TechnologyInstitute of Measurement and Control Systems

Learning and Inference in Graphical Models. Chapter 10 – p. 1/38

Page 2: Chapter 10: Random Fields

References for this chapter

◮ Christopher M. Bishop, Pattern Recognition and Machine Learning, ch. 8,Springer, 2006

◮ Michael Ying Yang and Wolfgang Forstner, A hierarchical conditional randomfield model for labeling and classifying images of man-made scenes. In:IEEE International Conference on Computer Vision Workshops (ICCVWorkshops), pp. 196-203, 2011

Learning and Inference in Graphical Models. Chapter 10 – p. 2/38

Page 3: Chapter 10: Random Fields

Motivation

Bayesian networks model clear dependencies, often causal dependencies.Bayesian networks are acyclic.

How can we model mutual and cyclic dependencies?

Example (economy):

◮ demand and supply determine the price

◮ high price fosters supply

◮ low price fosters demand

Learning and Inference in Graphical Models. Chapter 10 – p. 3/38

Page 4: Chapter 10: Random Fields

Motivation

Example (physics): modeling ferromagnetism in statistical mechanics

◮ a grid of magnetic dipoles in a volume

◮ every dipole causes a force on itsneighbors

◮ every dipole is forced by its neighbors

The dipoles might change their orientation.Every configuration of the magnetic dipolefield can be characterized by its energy. Theprobability of a certain configuration dependson its energy: high energy configurations areless probable, low energy configurations aremore probable.→ Ising-model (Ernst Ising, 1924)

Learning and Inference in Graphical Models. Chapter 10 – p. 4/38

Page 5: Chapter 10: Random Fields

Markov random fields

◮ a Markov random field (MRF) is a undirected,connected graph

◮ each node represents a random variable

• open circles indicate non-observed randomvariables

• filled circles indicate observed randomvariables

• dots indicate given constants

◮ links indicate an explicitly modeled stochasticdependence

A

B

D

C

Learning and Inference in Graphical Models. Chapter 10 – p. 5/38

Page 6: Chapter 10: Random Fields

Markov random fields

Joint probability distribution of a MRF is defined overcliques in the graph

Definition:A clique of size k is a subset C of k nodes of theMRF so that for each pair X, Y ∈ C with X 6= Y

holds that X and Y are connected by an edge.

Example:The MRF on the right has

◮ one clique of size 3:{X2, X3, X4}

◮ four cliques of size 2:{X1, X2}, {X2, X3}, {X2, X4}, {X3, X4}

◮ four cliques of size 1:{X1}, {X2}, {X3}, {X4}

X1

X2

X3

X4

Learning and Inference in Graphical Models. Chapter 10 – p. 6/38

Page 7: Chapter 10: Random Fields

Markov random fields

For every clique C in the MRF we specify a potential function

ψC : C → R>0

◮ large values of ψC indicate that a certain configuration of the randomvariables in the clique is more probable

◮ small values of ψC indicate that a certain configuration of the randomvariables in the clique is less probable

The joint distribution of the MRF is defined as the product of the potentialfunctions for all cliques

p(X1, . . . , Xn) =1

Z

C∈Cliques

ψC(C)

with Z =∫∏

C∈Cliques ψC(C)d(X1, . . . , Xn) the partition function

Remark: calculating Z might be very hard in practice

Learning and Inference in Graphical Models. Chapter 10 – p. 7/38

Page 8: Chapter 10: Random Fields

Markov random fields

Potential functions are usually given in terms of Gibbs/Boltzmann distributions

ψC(C) = e−EC(C)

with EC : C → R an “energy function”

◮ large energy means low probability

◮ small energy means large probability

Hence, the overall probability distribution of an MRF is

p(X1, . . . , Xn) =1

Ze−

∑C∈Cliques EC(C)

Learning and Inference in Graphical Models. Chapter 10 – p. 8/38

Page 9: Chapter 10: Random Fields

Markov random fields

Example: let us model the food preferences of a group of four persons: Antonia,Ben, Charles, and Damaris. They might choose between pasta, fish, and meat

◮ Ben likes meat and pasta but hates fish

◮ Antonia, Ben, and Charles prefer to choose the same

◮ Charles is vegetarian

◮ Damaris prefers to choose something else than all the other

→ create an MRF on the blackboard that models the food preferences of the fourpersons and assign potential functions to the cliques.

Learning and Inference in Graphical Models. Chapter 10 – p. 9/38

Page 10: Chapter 10: Random Fields

Markov random fields

One way to model the food preference task

Random variables A, B, C , D model Antonias,Bens, Charles, and Damaris’ choice. Discretevariables with values 1=pasta, 2=fish, 3=meat

Energy functions which are relevant(all others are constant):

A

B

C

D

E{B}(b) =

{

0 if b ∈ {1, 3}

100 if b = 2

E{A,B,C}(a, b, c) =

{

0 if a = b = c

30 otherwise

E{C}(c) =

0 if c = 1

50 if c = 2

200 if c = 3

E{A,D}(a, d) =

{

0 if a 6= d

10 if a = d

E{B,D}(b, d) =

{

0 if b 6= d

10 if b = d

E{C,D}(c, d) =

{

0 if c 6= d

10 if c = d

Learning and Inference in Graphical Models. Chapter 10 – p. 10/38

Page 11: Chapter 10: Random Fields

Factor graphs

Like for Bayesian networks we can define factor graphs over MRFs.

A factor graph is a bipartite graph with two kind of nodes:

◮ variable nodes that model random variables

◮ factor nodes that model a probabilistic relationship between variable nodes.Each factor node is assigned with a potential function

Variable nodes and factor nodes are connected by undirected links.For each MRF we can create a factor graph as follows:

◮ the set of variable nodes is taken from the nodes of the MRF

◮ for each non-constant potential function ψC

• we create a new factor node f

• we connect f with all variable nodes in clique C

• we assign the potential function ψC to f

Hence, the joint probability of the MRF is equal to the Gibbs distribution over thesum of all factor potentials

Learning and Inference in Graphical Models. Chapter 10 – p. 11/38

Page 12: Chapter 10: Random Fields

Factor graphs

The factor graph of the food preference task looks likes

A

B

C

D

�E{B}

�E{C}

�E{A,B,V }

�E{A,D}

�E{B,D}

�E{C,D}

Learning and Inference in Graphical Models. Chapter 10 – p. 12/38

Page 13: Chapter 10: Random Fields

Stochastic inference in Markov random fields

How can we calculate p(U = u|O = o) and argmaxu p(U = u|O = u)?

◮ if the factor graph related to a MRF is a tree, we can use the sum-productand max-sum algorithm introduced in chapter 4.

◮ in the general case there are no efficient exact algorithms

◮ we can build variational approximations (chapter 6) for approximate inference

◮ we can use MCMC samplers (chapter 7) for numerical inference

◮ we can use local optimization (chapter 8)

Example: in the food preference task,

◮ what is the overall best choice of food?

◮ what is the best choice of food if Antonia eats fish?

Learning and Inference in Graphical Models. Chapter 10 – p. 13/38

Page 14: Chapter 10: Random Fields

Special types of MRFs

MRFs are very general and can be used for many purposes. Some models havebeen shown to be very useful. In this lecture, we introduce

◮ the Potts model. Useful for image segmentation and noise removal

◮ Conditional random fields. Usefule for image segmentation

◮ the Boltzmann machine. Useful for unsupervised and supervised learning

◮ Markov logic networks. Useful for logic inference on noisy data (chapter 11)

Learning and Inference in Graphical Models. Chapter 10 – p. 14/38

Page 15: Chapter 10: Random Fields

Potts Model

Learning and Inference in Graphical Models. Chapter 10 – p. 15/38

Page 16: Chapter 10: Random Fields

Potts model

The Potts model can be used for segmentation and noise removal in images andother sensor data. We discuss it in the image segmentation case

Assume,

◮ an image is composed out of several areas (e.g. foreground/background,object A/object B/background)

◮ each area has a characteristic color or gray value

◮ pixels in the image are corrupted by noise

◮ neighboring pixels are very likely to belong to the same area

How can we model these assumptions with a MRF?

Learning and Inference in Graphical Models. Chapter 10 – p. 16/38

Page 17: Chapter 10: Random Fields

Potts model

◮ every pixel belongs to a certain area. Wemodel it with a discrete random variableXi,j . The true class label is unobserved.

◮ the color/gray value of each pixel isdescribed by a random variable Yi,j .The color value is observed.

◮ Xi,j and Yi,j are stochasticallydependent. This dependency can bedescribed by an energy function

◮ the class labels of neighboring pixels arestochastically dependent. This can bedescribed by an energy functions.

◮ we can provide priors for the class labelas energy function on individual Xi,j

Xi−1,j−1 Xi−1,j Xi−1,j+1

Xi,j−1 Xi,j Xi,j+1

Xi+1,j−1 Xi+1,j Xi+1,j+1

Yi−1,j−1 Yi−1,j Yi−1,j+1

Yi,j−1 Yi,j Yi,j+1

Yi+1,j−1 Yi+1,j Yi+1,j+1

Learning and Inference in Graphical Models. Chapter 10 – p. 17/38

Page 18: Chapter 10: Random Fields

Potts model

energy functions on cliques:

◮ similarity of neighboring nodes

E{Xi,j ,Xi+1,j}(xi,j, xi+1,j) =

{

0 if xi,j = xi+1,j

1 if xi,j 6= xi+1,j

E{Xi,j ,Xi,j+1}(xi,j, xi,j+1) =

{

0 if xi,j = xi,j+1

1 if xi,j 6= xi,j+1

◮ dependecy between observed color/gray value and class label. Assumeeach class k can be characterized by a typical color/gray value ck

E{Xi,j ,Yi,j}(xi,j, yi,j) = ||Yi,j − cXi,j||

◮ overall preference for certain classes. Assume a prior distribution p over the

classes E{Xi,j}(xi,j) = − log p(Xi,j)

Learning and Inference in Graphical Models. Chapter 10 – p. 18/38

Page 19: Chapter 10: Random Fields

Potts model

energy function for the whole Potts model:

E = κ∑

i,j

E{Xi,j ,Yi,j}(xi,j, yi,j)

+λ∑

i,j

E{Xi,j ,Xi+1,j}

+λ∑

i,j

E{Xi,j ,Xi,j+1}

+µ∑

i,j

E{Xi,j}(xi,j)

with weighting factors κ, λ, µ ≥ 0

Learning and Inference in Graphical Models. Chapter 10 – p. 19/38

Page 20: Chapter 10: Random Fields

Potts model for image segmentation

Let us apply the Potts model to image segmentation as described before

Determining a segmentation is done by maximizing the conditional probabilityp(. . . , Xi,j, . . . | . . . , Yi,j, . . . ) where Yi,j are the color/gray values of a givenpicture. This is equal to minimizing the overall energy keeping the Yi,j valuesfixed.

Solution techniques:

◮ finding an exact solution is NP-hard in general, in the two-class-case O(n3)if n is the number of pixels (solution using graph cuts)

◮ local optimization

◮ MCMC sampling → Matlab-demo

Think about extensions of the Potts model that can cope with cases in which thereference colors of the segments are a priori vague or unknown → homework

Learning and Inference in Graphical Models. Chapter 10 – p. 20/38

Page 21: Chapter 10: Random Fields

Conditional Random Fields

Learning and Inference in Graphical Models. Chapter 10 – p. 21/38

Page 22: Chapter 10: Random Fields

Segmentation with Potts model revisited

Using a Potts model for segmentationrequires adequate energy functionsE{Xi,j ,Yi,j}

◮ easy for a color segmentation task withpre-specified segment colors

◮ possible for a color segmentation taskwith roughly pre-specified segmentcolors

◮ almost impossible for texture-basedsegmentation

Task: segment picture into areasof road, buildings, vegetation, sky,cars.

Idea: combine random field based segmentation with traditional classifiers (e.g.neural networks, support vector machines, decision trees, etc.)

◮ apply classifier on small patches of the image

◮ use a random field to integrate neighborhood relationships

Learning and Inference in Graphical Models. Chapter 10 – p. 22/38

Page 23: Chapter 10: Random Fields

Combination of random fields and classifiers

A classifier is

◮ a mapping from a vector of observations (features) to class labels

◮ a mapping from a vector of observations (features) to class probabilities

With the second definition, the classifier provides a distribution p(X|Y ) with Xthe class label and Y the observation vector.A classifier does not provide a distribution on Y nor on X .

Learning and Inference in Graphical Models. Chapter 10 – p. 23/38

Page 24: Chapter 10: Random Fields

Combination of random fields and classifiers

Let us try to build a Potts model integratingthe classifiers to model p(X|Y )

◮ we can model the prior on the classlabels as before using a potentialfunction

◮ we can model the relationship betweenneighboring X nodes by a potentialfunction as before

◮ we can model p(Xi,j|Yi,j) with theclassifier

Xi−1,j−1 Xi−1,j Xi−1,j+1

Xi,j−1 Xi,j Xi,j+1

Xi+1,j−1 Xi+1,j Xi+1,j+1

Yi−1,j−1 Yi−1,j Yi−1,j+1

Yi,j−1 Yi,j Yi,j+1

Yi+1,j−1 Yi+1,j Yi+1,j+1

How does the joint distribution p({Xi,j, Yi,j}) over all (i, j) look like?

The joint distribution is not fully specified since we do not know p({Yi,j})

Learning and Inference in Graphical Models. Chapter 10 – p. 24/38

Page 25: Chapter 10: Random Fields

Conditional random fields

Conditional random fields (CRF) overcome theproblem of missing p({Yi,j}) by modeling only

p({Xi,j}|{Yi,j}). This is sufficient if we do not

want to make inference on {Yi,j} but only on

{Xi,j}

A conditional random field consists of

◮ a set of observed nodes O

◮ a set of unobserved random variables U

◮ edges between pairs of unobserved nodes

◮ edges between observed and unobservednodes

Note that cliques in a conditional random fieldcontain at most one observed node.

A

B

D

C

E

Learning and Inference in Graphical Models. Chapter 10 – p. 25/38

Page 26: Chapter 10: Random Fields

Conditional random fields

For every clique that contains at least oneunobserved node we specifiy a potential functionψC : C → R>0

A CRF specifies the conditional distribution p(U |O)as

p(U |O) =1

Z

C∈Cliques

ψC(C)

A

B

D

C

E

Learning and Inference in Graphical Models. Chapter 10 – p. 26/38

Page 27: Chapter 10: Random Fields

Example: facade segmentation

Segmentation of pictures into categories building/car/door/pavement/road/sky/vegetation/window. Work of Michael Ying Yang

Approach: Hierarchical CRF combined with random decision forest.

Result:

c.f. Yang and Forstner,2011

Learning and Inference in Graphical Models. Chapter 10 – p. 27/38

Page 28: Chapter 10: Random Fields

Boltzmann Machines

Learning and Inference in Graphical Models. Chapter 10 – p. 28/38

Page 29: Chapter 10: Random Fields

Boltzmann machines

Definition:A Boltzmann machine is a fully connected MRF with binary random variables. Itsenergy function is defined over 1-cliques and 2-cliques by:

EX(x) =−θX · x

EX,Y (x, y) =−wX,Y · x · y

with θX , wX,Y non-negative real weight factors.

Hence, if we enumerate all random variables with X1, . . . , Xn

p(x1, . . . , xn) =1

Ze∑n

i=1

∑i−1

j=1(wXi,Xj

·xi·xj)+∑n

i=1(θXi

·xi)

Note, that wX,X = 0 and wX,Y = wY,X .

Learning and Inference in Graphical Models. Chapter 10 – p. 29/38

Page 30: Chapter 10: Random Fields

Boltzmann machines

What is a Boltzmann machine good for?

Two tasks:

◮ pattern classification

◮ denoising of patterns

Learning and Inference in Graphical Models. Chapter 10 – p. 30/38

Page 31: Chapter 10: Random Fields

Boltzmann machines for pattern classification

Goal: we assume some patterns (data) which belong to different categories.Applying a pattern to the Boltzmann machine we want the Boltzman machine toreturn the appropriate class label.Structure of a Boltzmann machine for classificationThere are three different types of nodes:

◮ observed nodes O.We apply a pattern to the observed nodes by setting their value to therespective value of the pattern and never change it afterwards

◮ label nodes L.These serve as output of the Boltzmann machine. We have one label nodefor each class. Finally, the label nodes indicate the class probabilities foreach class

◮ hidden nodes H .These nodes are unobserved and used for stochastic inference on thepattern

Learning and Inference in Graphical Models. Chapter 10 – p. 31/38

Page 32: Chapter 10: Random Fields

Boltzmann machines for pattern classification

Process of class predicition:

1. we apply a pattern to the observed nodes, i.e. the value of i-th observednode is set to the i-th value of the pattern. Afterwards, we do not change theobserved nodes any more

2. we use Gibbs sampling to update the values of all hidden nodes H and labelnodes L, i.e. we try to determine most probable configurations ofp(L,H|O). If we are only interested in the most probable configuraton wemight also use simulated annealing to find it.

3. after a while we interpret the label nodes. We might assume that the value ofthe i-th label node is proportional to the posterior probability of the i-th class

Learning and Inference in Graphical Models. Chapter 10 – p. 32/38

Page 33: Chapter 10: Random Fields

Gibbs sampling for Boltzmann machines

To implement Gibbs sampling we need to knowp(Xi|X1, . . . , Xi−1, Xi+1, . . . , Xn)W.l.o.g. we get

p(Xn|X1, . . . , Xn−1) ∝ p(Xn, X1, . . . , Xn−1)

∝ e∑n

i=1

∑i−1

j=1(wXi,Xj

·xi·xj)+∑n

i=1(θXi

·xi)

= exn·

∑n−1

j=1(wXn,Xj

·xj)+θXn ·xn+∑n−1

i=1

∑i−1

j=1(wXi,Xj

·xi·xj)+∑n−1

i=1(θXi

·xi)

= exn·

∑n−1

j=1(wXn,Xj

·xj)+θXn ·xn · e∑n−1

i=1

∑i−1

j=1(wXi,Xj

·xi·xj)+∑n−1

i=1(θXi

·xi)

∝ exn·

∑n−1

j=1(wXn,Xj

·xj)+θXn ·xn

Hence,

p(Xn = 0|X1, . . . , Xn−1) =1

Z· e0

p(Xn = 1|X1, . . . , Xn−1) =1

Z· e

∑n−1

j=1(wXn,Xj

·xj)+θXn

From p(Xn = 0|X1, . . . , Xn−1) + p(Xn = 1|X1, . . . , Xn−1) = 1 follows

Z = 1 + e∑n−1

j=1(wXn,Xj

·xj)+θXn

Learning and Inference in Graphical Models. Chapter 10 – p. 33/38

Page 34: Chapter 10: Random Fields

Boltzmann machines for denoising

Goal: we assume that all patterns have a typical structure. Applying a pattern wewant the Boltzmann machine to return a typical pattern that is most similar to thepattern applied.Structure of a Boltzmann machine for denoisingThere are two different types of nodes:

◮ observed nodes O.We apply a pattern to the observed nodes by setting their value to therespective value of the pattern and never change it afterwards

◮ hidden nodes H .These nodes are unobserved and used for stochastic inference on thepattern

Learning and Inference in Graphical Models. Chapter 10 – p. 34/38

Page 35: Chapter 10: Random Fields

Boltzmann machines for denoising

Process of denoising:

1. we apply a pattern to the observed nodes, i.e. the value of i-th observednode is set to the i-th value of the pattern.

2. we use Gibbs sampling (or simulated annealing) to update the values of allhidden nodes H and observed nodes O, i.e. we try to determine mostprobable configurations of p(H,O).

3. after a while we consider the values of the observed nodes as pattern afterdenoising

Learning and Inference in Graphical Models. Chapter 10 – p. 35/38

Page 36: Chapter 10: Random Fields

Training of Boltzmann machines

For both tasks, we need to train a Boltzmann machine before we can use it, i.e.determine appropriate parameters wX,Y and θX

Assume we are given T training examples (patterns and labels for theclassification task, only patterns for the denoising task). Now, we want tomaximize the likelihood w.r.t. wX,Y and θXT∏

t=1

p(O(t), L(t)|{wX,Y |X, Y ∈ O ∪H ∪ L}, {θX |X ∈ O ∪H ∪ L})

→ gradient ascent (calculating the gradient is not trivial)

Learning and Inference in Graphical Models. Chapter 10 – p. 36/38

Page 37: Chapter 10: Random Fields

Boltzmann machines

Some remarks on Boltzmann machines:

◮ training Boltzmann machines is very time-consuming

◮ however, there are more efficient variants (restricted Boltzmann machines,deep belief networks) which are subject to recent research and which arebetter suitable for pattern recognition and machine learning

◮ we do not want to discuss Boltzmann machines in depth in this lecture sincethey have been discussed in Prof. Sperschneider’s machine learning lecturealready

Learning and Inference in Graphical Models. Chapter 10 – p. 37/38

Page 38: Chapter 10: Random Fields

Summary

◮ definition of Markov random fields

• joint probability distribution

• factor graph

◮ Potts model

• image segmentation example

◮ Conditional random fields

• image segmentation example of Michael Ying Yang

◮ Boltzmann machines

Learning and Inference in Graphical Models. Chapter 10 – p. 38/38