Chapter 10: Random Fields

LEARNING AND INFERENCE IN GRAPHICAL MODELS

Chapter 10: Random Fields

Dr. Martin Lauer

University of FreiburgMachine Learning Lab

Karlsruhe Institute of TechnologyInstitute of Measurement and Control Systems

Learning and Inference in Graphical Models. Chapter 10 – p. 1/38

References for this chapter

◮ Christopher M. Bishop, Pattern Recognition and Machine Learning, ch. 8,Springer, 2006

◮ Michael Ying Yang and Wolfgang Forstner, A hierarchical conditional randomfield model for labeling and classifying images of man-made scenes. In:IEEE International Conference on Computer Vision Workshops (ICCVWorkshops), pp. 196-203, 2011


Motivation

Bayesian networks model clear dependencies, often causal dependencies.Bayesian networks are acyclic.

How can we model mutual and cyclic dependencies?

Example (economy):

◮ demand and supply determine the price

◮ high price fosters supply

◮ low price fosters demand


Motivation

Example (physics): modeling ferromagnetism in statistical mechanics

◮ a grid of magnetic dipoles in a volume

◮ every dipole causes a force on itsneighbors

◮ every dipole is forced by its neighbors

The dipoles might change their orientation.Every configuration of the magnetic dipolefield can be characterized by its energy. Theprobability of a certain configuration dependson its energy: high energy configurations areless probable, low energy configurations aremore probable.→ Ising-model (Ernst Ising, 1924)


Markov random fields

◮ a Markov random field (MRF) is a undirected,connected graph

◮ each node represents a random variable

• open circles indicate non-observed randomvariables

• filled circles indicate observed randomvariables

• dots indicate given constants

◮ links indicate an explicitly modeled stochasticdependence

A

B

D

C



Joint probability distribution of a MRF is defined overcliques in the graph

Definition:A clique of size k is a subset C of k nodes of theMRF so that for each pair X, Y ∈ C with X 6= Y

holds that X and Y are connected by an edge.

Example:The MRF on the right has

◮ one clique of size 3:{X2, X3, X4}

◮ four cliques of size 2:{X1, X2}, {X2, X3}, {X2, X4}, {X3, X4}

◮ four cliques of size 1:{X1}, {X2}, {X3}, {X4}

X1

X2

X3

X4



For every clique C in the MRF we specify a potential function

ψC : C → R>0

◮ large values of ψC indicate that a certain configuration of the randomvariables in the clique is more probable

◮ small values of ψC indicate that a certain configuration of the randomvariables in the clique is less probable

The joint distribution of the MRF is defined as the product of the potentialfunctions for all cliques

p(X1, . . . , Xn) =1

Z

∏

C∈Cliques

ψC(C)

with Z =∫∏

C∈Cliques ψC(C)d(X1, . . . , Xn) the partition function

Remark: calculating Z might be very hard in practice



Potential functions are usually given in terms of Gibbs/Boltzmann distributions

ψC(C) = e−EC(C)

with EC : C → R an “energy function”

◮ large energy means low probability

◮ small energy means large probability

Hence, the overall probability distribution of an MRF is

p(X1, . . . , Xn) =1

Ze−

∑C∈Cliques EC(C)



Example: let us model the food preferences of a group of four persons: Antonia,Ben, Charles, and Damaris. They might choose between pasta, fish, and meat

◮ Ben likes meat and pasta but hates fish

◮ Antonia, Ben, and Charles prefer to choose the same

◮ Charles is vegetarian

◮ Damaris prefers to choose something else than all the other

→ create an MRF on the blackboard that models the food preferences of the fourpersons and assign potential functions to the cliques.



One way to model the food preference task

Random variables A, B, C , D model Antonias,Bens, Charles, and Damaris’ choice. Discretevariables with values 1=pasta, 2=fish, 3=meat

Energy functions which are relevant(all others are constant):

A

B

C

D

E{B}(b) =

{

0 if b ∈ {1, 3}

100 if b = 2

E{A,B,C}(a, b, c) =

{

0 if a = b = c

30 otherwise

E{C}(c) =

0 if c = 1

50 if c = 2

200 if c = 3

E{A,D}(a, d) =

{

0 if a 6= d

10 if a = d

E{B,D}(b, d) =

{

0 if b 6= d

10 if b = d

E{C,D}(c, d) =

{

0 if c 6= d

10 if c = d


Factor graphs

Like for Bayesian networks we can define factor graphs over MRFs.

A factor graph is a bipartite graph with two kind of nodes:

◮ variable nodes that model random variables

◮ factor nodes that model a probabilistic relationship between variable nodes.Each factor node is assigned with a potential function

Variable nodes and factor nodes are connected by undirected links.For each MRF we can create a factor graph as follows:

◮ the set of variable nodes is taken from the nodes of the MRF

◮ for each non-constant potential function ψC

• we create a new factor node f

• we connect f with all variable nodes in clique C

• we assign the potential function ψC to f

Hence, the joint probability of the MRF is equal to the Gibbs distribution over thesum of all factor potentials


Factor graphs

The factor graph of the food preference task looks likes

A

B

C

D

�E{B}

�E{C}

�E{A,B,V }

�E{A,D}

�E{B,D}

�E{C,D}


Stochastic inference in Markov random fields

How can we calculate p(U = u|O = o) and argmaxu p(U = u|O = u)?

◮ if the factor graph related to a MRF is a tree, we can use the sum-productand max-sum algorithm introduced in chapter 4.

◮ in the general case there are no efficient exact algorithms

◮ we can build variational approximations (chapter 6) for approximate inference

◮ we can use MCMC samplers (chapter 7) for numerical inference

◮ we can use local optimization (chapter 8)

Example: in the food preference task,

◮ what is the overall best choice of food?

◮ what is the best choice of food if Antonia eats fish?


Special types of MRFs

MRFs are very general and can be used for many purposes. Some models havebeen shown to be very useful. In this lecture, we introduce

◮ the Potts model. Useful for image segmentation and noise removal

◮ Conditional random fields. Usefule for image segmentation

◮ the Boltzmann machine. Useful for unsupervised and supervised learning

◮ Markov logic networks. Useful for logic inference on noisy data (chapter 11)


Potts Model


Potts model

The Potts model can be used for segmentation and noise removal in images andother sensor data. We discuss it in the image segmentation case

Assume,

◮ an image is composed out of several areas (e.g. foreground/background,object A/object B/background)

◮ each area has a characteristic color or gray value

◮ pixels in the image are corrupted by noise

◮ neighboring pixels are very likely to belong to the same area

How can we model these assumptions with a MRF?


Potts model

◮ every pixel belongs to a certain area. Wemodel it with a discrete random variableXi,j . The true class label is unobserved.

◮ the color/gray value of each pixel isdescribed by a random variable Yi,j .The color value is observed.

◮ Xi,j and Yi,j are stochasticallydependent. This dependency can bedescribed by an energy function

◮ the class labels of neighboring pixels arestochastically dependent. This can bedescribed by an energy functions.

◮ we can provide priors for the class labelas energy function on individual Xi,j

Xi−1,j−1 Xi−1,j Xi−1,j+1

Xi,j−1 Xi,j Xi,j+1

Xi+1,j−1 Xi+1,j Xi+1,j+1

Yi−1,j−1 Yi−1,j Yi−1,j+1

Yi,j−1 Yi,j Yi,j+1

Yi+1,j−1 Yi+1,j Yi+1,j+1


Potts model

energy functions on cliques:

◮ similarity of neighboring nodes

E{Xi,j ,Xi+1,j}(xi,j, xi+1,j) =

{

0 if xi,j = xi+1,j

1 if xi,j 6= xi+1,j

E{Xi,j ,Xi,j+1}(xi,j, xi,j+1) =

{

0 if xi,j = xi,j+1

1 if xi,j 6= xi,j+1

◮ dependecy between observed color/gray value and class label. Assumeeach class k can be characterized by a typical color/gray value ck

E{Xi,j ,Yi,j}(xi,j, yi,j) = ||Yi,j − cXi,j||

◮ overall preference for certain classes. Assume a prior distribution p over the

classes E{Xi,j}(xi,j) = − log p(Xi,j)


Potts model

energy function for the whole Potts model:

E = κ∑

i,j

E{Xi,j ,Yi,j}(xi,j, yi,j)

+λ∑

i,j

E{Xi,j ,Xi+1,j}

+λ∑

i,j

E{Xi,j ,Xi,j+1}

+µ∑

i,j

E{Xi,j}(xi,j)

with weighting factors κ, λ, µ ≥ 0


Potts model for image segmentation

Let us apply the Potts model to image segmentation as described before

Determining a segmentation is done by maximizing the conditional probabilityp(. . . , Xi,j, . . . | . . . , Yi,j, . . . ) where Yi,j are the color/gray values of a givenpicture. This is equal to minimizing the overall energy keeping the Yi,j valuesfixed.

Solution techniques:

◮ finding an exact solution is NP-hard in general, in the two-class-case O(n3)if n is the number of pixels (solution using graph cuts)

◮ local optimization

◮ MCMC sampling → Matlab-demo

Think about extensions of the Potts model that can cope with cases in which thereference colors of the segments are a priori vague or unknown → homework


Conditional Random Fields


Segmentation with Potts model revisited

Using a Potts model for segmentationrequires adequate energy functionsE{Xi,j ,Yi,j}

◮ easy for a color segmentation task withpre-specified segment colors

◮ possible for a color segmentation taskwith roughly pre-specified segmentcolors

◮ almost impossible for texture-basedsegmentation

Task: segment picture into areasof road, buildings, vegetation, sky,cars.

Idea: combine random field based segmentation with traditional classifiers (e.g.neural networks, support vector machines, decision trees, etc.)

◮ apply classifier on small patches of the image

◮ use a random field to integrate neighborhood relationships


Combination of random fields and classifiers

A classifier is

◮ a mapping from a vector of observations (features) to class labels

◮ a mapping from a vector of observations (features) to class probabilities

With the second definition, the classifier provides a distribution p(X|Y ) with Xthe class label and Y the observation vector.A classifier does not provide a distribution on Y nor on X .


Combination of random fields and classifiers

Let us try to build a Potts model integratingthe classifiers to model p(X|Y )

◮ we can model the prior on the classlabels as before using a potentialfunction

◮ we can model the relationship betweenneighboring X nodes by a potentialfunction as before

◮ we can model p(Xi,j|Yi,j) with theclassifier

Xi−1,j−1 Xi−1,j Xi−1,j+1

Xi,j−1 Xi,j Xi,j+1

Xi+1,j−1 Xi+1,j Xi+1,j+1

Yi−1,j−1 Yi−1,j Yi−1,j+1

Yi,j−1 Yi,j Yi,j+1

Yi+1,j−1 Yi+1,j Yi+1,j+1

How does the joint distribution p({Xi,j, Yi,j}) over all (i, j) look like?

The joint distribution is not fully specified since we do not know p({Yi,j})


Conditional random fields

Conditional random fields (CRF) overcome theproblem of missing p({Yi,j}) by modeling only

p({Xi,j}|{Yi,j}). This is sufficient if we do not

want to make inference on {Yi,j} but only on

{Xi,j}

A conditional random field consists of

◮ a set of observed nodes O

◮ a set of unobserved random variables U

◮ edges between pairs of unobserved nodes

◮ edges between observed and unobservednodes

Note that cliques in a conditional random fieldcontain at most one observed node.

A

B

D

C

E


Conditional random fields

For every clique that contains at least oneunobserved node we specifiy a potential functionψC : C → R>0

A CRF specifies the conditional distribution p(U |O)as

p(U |O) =1

Z

∏

C∈Cliques

ψC(C)

A

B

D

C

E


Example: facade segmentation

Segmentation of pictures into categories building/car/door/pavement/road/sky/vegetation/window. Work of Michael Ying Yang

Approach: Hierarchical CRF combined with random decision forest.

Result:

c.f. Yang and Forstner,2011


Boltzmann Machines


Boltzmann machines

Definition:A Boltzmann machine is a fully connected MRF with binary random variables. Itsenergy function is defined over 1-cliques and 2-cliques by:

EX(x) =−θX · x

EX,Y (x, y) =−wX,Y · x · y

with θX , wX,Y non-negative real weight factors.

Hence, if we enumerate all random variables with X1, . . . , Xn

p(x1, . . . , xn) =1

Ze∑n

i=1

∑i−1

j=1(wXi,Xj

·xi·xj)+∑n

i=1(θXi

·xi)

Note, that wX,X = 0 and wX,Y = wY,X .


Boltzmann machines

What is a Boltzmann machine good for?

Two tasks:

◮ pattern classification

◮ denoising of patterns


Boltzmann machines for pattern classification

Goal: we assume some patterns (data) which belong to different categories.Applying a pattern to the Boltzmann machine we want the Boltzman machine toreturn the appropriate class label.Structure of a Boltzmann machine for classificationThere are three different types of nodes:

◮ observed nodes O.We apply a pattern to the observed nodes by setting their value to therespective value of the pattern and never change it afterwards

◮ label nodes L.These serve as output of the Boltzmann machine. We have one label nodefor each class. Finally, the label nodes indicate the class probabilities foreach class

◮ hidden nodes H .These nodes are unobserved and used for stochastic inference on thepattern


Boltzmann machines for pattern classification

Process of class predicition:

1. we apply a pattern to the observed nodes, i.e. the value of i-th observednode is set to the i-th value of the pattern. Afterwards, we do not change theobserved nodes any more

2. we use Gibbs sampling to update the values of all hidden nodes H and labelnodes L, i.e. we try to determine most probable configurations ofp(L,H|O). If we are only interested in the most probable configuraton wemight also use simulated annealing to find it.

3. after a while we interpret the label nodes. We might assume that the value ofthe i-th label node is proportional to the posterior probability of the i-th class


Gibbs sampling for Boltzmann machines

To implement Gibbs sampling we need to knowp(Xi|X1, . . . , Xi−1, Xi+1, . . . , Xn)W.l.o.g. we get

p(Xn|X1, . . . , Xn−1) ∝ p(Xn, X1, . . . , Xn−1)

∝ e∑n

i=1

∑i−1

j=1(wXi,Xj

·xi·xj)+∑n

i=1(θXi

·xi)

= exn·

∑n−1

j=1(wXn,Xj

·xj)+θXn ·xn+∑n−1

i=1

∑i−1

j=1(wXi,Xj

·xi·xj)+∑n−1

i=1(θXi

·xi)

= exn·

∑n−1

j=1(wXn,Xj

·xj)+θXn ·xn · e∑n−1

i=1

∑i−1

j=1(wXi,Xj

·xi·xj)+∑n−1

i=1(θXi

·xi)

∝ exn·

∑n−1

j=1(wXn,Xj

·xj)+θXn ·xn

Hence,

p(Xn = 0|X1, . . . , Xn−1) =1

Z· e0

p(Xn = 1|X1, . . . , Xn−1) =1

Z· e

∑n−1

j=1(wXn,Xj

·xj)+θXn

From p(Xn = 0|X1, . . . , Xn−1) + p(Xn = 1|X1, . . . , Xn−1) = 1 follows

Z = 1 + e∑n−1

j=1(wXn,Xj

·xj)+θXn


Boltzmann machines for denoising

Goal: we assume that all patterns have a typical structure. Applying a pattern wewant the Boltzmann machine to return a typical pattern that is most similar to thepattern applied.Structure of a Boltzmann machine for denoisingThere are two different types of nodes:

◮ observed nodes O.We apply a pattern to the observed nodes by setting their value to therespective value of the pattern and never change it afterwards

◮ hidden nodes H .These nodes are unobserved and used for stochastic inference on thepattern


Boltzmann machines for denoising

Process of denoising:

1. we apply a pattern to the observed nodes, i.e. the value of i-th observednode is set to the i-th value of the pattern.

2. we use Gibbs sampling (or simulated annealing) to update the values of allhidden nodes H and observed nodes O, i.e. we try to determine mostprobable configurations of p(H,O).

3. after a while we consider the values of the observed nodes as pattern afterdenoising


Training of Boltzmann machines

For both tasks, we need to train a Boltzmann machine before we can use it, i.e.determine appropriate parameters wX,Y and θX

Assume we are given T training examples (patterns and labels for theclassification task, only patterns for the denoising task). Now, we want tomaximize the likelihood w.r.t. wX,Y and θXT∏

t=1

p(O(t), L(t)|{wX,Y |X, Y ∈ O ∪H ∪ L}, {θX |X ∈ O ∪H ∪ L})

→ gradient ascent (calculating the gradient is not trivial)


Boltzmann machines

Some remarks on Boltzmann machines:

◮ training Boltzmann machines is very time-consuming

◮ however, there are more efficient variants (restricted Boltzmann machines,deep belief networks) which are subject to recent research and which arebetter suitable for pattern recognition and machine learning

◮ we do not want to discuss Boltzmann machines in depth in this lecture sincethey have been discussed in Prof. Sperschneider’s machine learning lecturealready


Summary

◮ definition of Markov random fields

• joint probability distribution

• factor graph

◮ Potts model

• image segmentation example

◮ Conditional random fields

• image segmentation example of Michael Ying Yang

◮ Boltzmann machines


Documents

Chapter 10: Random Fields