Upload
charity-robertson
View
215
Download
0
Embed Size (px)
Citation preview
Crash Course on Machine LearningPart V
Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause
Structured Prediction
• Use local information • Exploit correlations
b r ea c
Min-max Formulation
LP duality
Before
QP duality
Exponentially many constraints/variables
After
By QP duality
Dual inherits structure from problem-specific inference LPVariables correspond to a decomposition of variables of the flat case
The Connection
b c a r e b r o r e b r o c eb r a c e
rc
ao
cr
.2
.15
.25
.4
.2 .35
.65.8.4
.61b 1e
2 2 10
Duals and Kernels
Kernel trick works: Factored dual Local functions (log-potentials) can use kernels
3D Mapping
Laser Range Finder
GPS
IMU
Data provided by: Michael Montemerlo & Sebastian Thrun
Label: ground, building, tree, shrub Training: 30 thousand points Testing: 3 million points
• Simple iterative method
• Unstable for structured output: fewer instances, big updates– May not converge if non-separable– Noisy
• Voted / averaged perceptron [Freund & Schapire 99, Collins 02]– Regularize / reduce variance by aggregating over iterations
Alternatives: Perceptron
• Add most violated constraint
• Handles several more general loss functions• Need to re-solve QP many times• Theorem: Only polynomial # of constraints needed to achieve -
error [Tsochantaridis et al, 04]
• Worst case # of constraints larger than factored
Alternatives: Constraint Generation
[Collins 02; Altun et al, 03]
Integration
• Feature Passing
• Margin Based– Max margin Structure Learning
• Probabilistic– Graphical Models
Graphical Models
• Joint distribution– Factoring using independent variables
• Representation
• Inference
• Learning
Big Picture
• Two problems with using full joint distribution tables as our probabilistic models: – Unless there are only a few variables, the joint is WAY too
big to represent explicitly – Hard to learn (estimate) anything empirically about more
than a few variables at a time
• Describe complex joint distributions (models) using simple, local distributions – We describe how variables locally interact – Local interactions chain together to give global, indirect
interactions
Joint Distribution• For n variables with domain sizes d
– joint distribution table with dn -1 free parameters • Size of representation if we use the chain rule
Concretely, counting the number of free parameters accounting for that we know probabilities sum to one: (d-1) + d(d-1) + d2(d-1) + ... + dn-1 (d-1) = (dn-1)/(d-1) (d-1)= dn - 1
Conditional Independence• Two variables are conditionally independent:
• What about this domain? – Traffic– Umbrella– Raining
RepresentationExplicitly model uncertainty and dependency structure
a
b
c
a
b
c
Directed Undirected Factor graph
d d
a
b
c d
Key concept: Markov blanket
Bayes Net: Notation
• Nodes: variables– Can be assigned (observed)
or – unassigned (unobserved)
• Arcs: interactions – Indicate “direct influence”
between variables – Formally: encode conditional
independence
Cavity
Toothache Catch
Weather
Example: Flip Coins
• N independent flip coins
• No interactions between variables– Absolute independence
X1 X2 Xn
Example: Traffic
• Variables: – Traffic– Rain
• Model 1: absolute independence• Model 2: rain causes traffic• Which makes more sense?
Rain
Traffic
Semantics
• A set of nodes, one per variable X • A directed, acyclic graph • A conditional distribution for each
node – A collection of distributions over X,
one for each combination of parents’ values
– Conditional Probability Table (CPT)
A1 An
X
A2
Parents
A Bayes net = Topology (graph) + Local Conditional Probabilities
Example: Alarm
• Variables:– Alarm– Burglary– Earthquake– Radio– Calls John
Earthquake
Radio
Burglary
Alarm
Call
Example: AlarmEarthquake
Radio
Burglary
Alarm
Call
P(C|A)
P(R|E) P(A|E,B)
P(E) P(B)
P(E,B,R,A,C)=P(E)P(B)P(R|E)P(A|B,E)P(C|A)
Bayes Net Size
• How big is a joint distribution over N Boolean variables?
• How big is the size of CPT with k parents?
• How big is the size of BN with n node if nodes have up to k parents?
• BNs: – Compact representation– Use local properties to define CPTS– Answer queries more easily
2n
2k+1
n.2k+1
Independence in BN
• BNs present a compact representation for joint distributions– Take advantage of conditional independence
• Given a BN let’s answer independence questions:– Are two nodes independent given certain
evidence?
– What can we say about X, Z? (Example: Low pressure, Rain, Traffic}
X Y Z
Causal Chains
• Question: Is Z independent of X given Y?
X Y Z• X: low pressure• Y: Rain• Z: Traffic
Common Cause
• Are X, Z independent?
X
y
Z
• Are X, Z independent given Y?
• Observing Y blocks the influence between X,Z
• Y: low pressure• X: Rain• Z: Cold
Common Effect
• Are X, Z independent?
X
Y
Z
• X: Rain• Y: Traffic• Z: Ball Game
• Are X, Z independent given Y?
• Observing Y activates influence between X, Z
Independence in BNs
• Any complex BN structure can be analyzed using these three cases
Earthquake
Radio
Burglary
Alarm
Call
Directed acyclical graph (Bayes net)
a
b
c d
P(a,b,c,d) = P(c|b)P(d|b)P(b|a)P(a)
• Can model causality• Parameter learning
– Decomposes: learn each term separately (ML)
• Inference– Simple exact inference if tree-
shaped (belief propagation)
Directed acyclical graph (Bayes net)
a
b
c d
• Can model causality• Parameter learning
– Decomposes: learn each term separately (ML)
• Inference– Simple exact inference if tree-
shaped (belief propagation)– Loops require approximation
• Loopy BP• Tree-reweighted BP• Sampling
P(a,b,c,d) = P(c|b)P(d|a,b)P(b|a)P(a)
• Example: Places and scenes
Directed graph
Place: office, kitchen, street, etc.
Car Person Toaster MicrowaveFire
Hydrant
Objects Present
P(place, car, person, toaster, micro, hydrant) = P(place) P(car | place) P(person | place) … P(hydrant | place)
Undirected graph (Markov Networks)
• Does not model causality• Often pairwise• Parameter learning difficult• Inference usually approximate
x1
x2
x3 x4
edgesji
jii
iZ dataxxdataxdataP,
24..1
11 ),;,(),;(),;( x
Markov Networks• Example: “label smoothing” grid
Binary nodes
0 10 0 K1 K 0
Pairwise Potential
Image De-Noising
Original Image Noisy Image
Image De-Noising
Image De-Noising
Noisy Image Restored Image (ICM)
Factor graphs• A general representation
a
b
c
Bayes Net
Factor Graph
d
a
b
c d
Factor graphs• A general representation
a
b
c
Markov Net
d
Factor Graph
a
b
c d
Factor graphs
),()(),,(),,,( 321 dafdfcbafdcbaP
Write as a factor graph
Inference in Graphical Models
• Joint• Marginal• Max
• Exact inference is HARD
Approximate Inference
Approximation
Sampling a Multinomial Distribution
Sampling from a BN
- Compute Marginals- Compute Conditionals
Belief Propagation
• Very general • Approximate, except for tree-shaped graphs
– Generalizing variants BP can have better convergence for graphs with many loops or high potentials
• Standard packages available (BNT toolbox)
• To learn more:– Yedidia, J.S.; Freeman, W.T.; Weiss, Y., "Understanding Belief Propagation and Its
Generalizations”, Technical Report, 2001: http://www.merl.com/publications/TR2001-022/
Belief Propagation
i
)(
)()(iNa
iiaii xmxb
“beliefs” “messages”
a
)( \)(
)()()(aNi aiNb
iibaaaa xmXfXb
The “belief” is the BP approximation of the marginal probability.
BP Message-update Rules
Using ,)()(\
ia xX
aaii Xbxb we get
ai
ia xX iaNj ajNb
jjbaaiia xmXfxm\ \)( \)(
)()()(
i a=
Inference: Graph Cuts
• Associative: edge potentials penalize different labels• Associative binary networks can be solved optimally
(and quickly) using graph cuts
• Multilabel associative networks can be handled by alpha-expansion or alpha-beta swaps
• To learn more:– http://www.cs.cornell.edu/~rdz/graphcuts.html– Classic paper: What Energy Functions can be Minimized via Graph Cuts? (Kolmogorov
and Zabih, ECCV '02/PAMI '04)
Graph Cuts: Binary MRF
Unary terms (compatability of data with label y)
Pairwise terms (compatability of neighboring labels)
Graph cuts used to optimise this cost function:
Summary of approach
• Associate each possible solution with a minimum cut on a graph• Set capacities on graph, so cost of cut matches the cost function• Use augmenting paths to find minimum cut• This minimizes the cost function and finds the MAP solution
Denoising Results
Original Pairwise costs increasing
Pairwise costs increasing