Upload
sandra-sims
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Carnegie Mellon
Parallel Splash Belief Propagation
Joseph E. GonzalezYucheng Low
Carlos GuestrinDavid O’Hallaron
Computers which worked on this project:
BigBro1, BigBro2, BigBro3, BigBro4, BigBro5, BigBro6, BiggerBro, BigBroFSTashish01, Tashi02, Tashi03, Tashi04, Tashi05, Tashi06, …, Tashi30,
parallel, gs6167, koobcam (helped with writing)
Why talk about parallelism now?
Change in the Foundation of ML
Past Sequentia
l Perfo
rmance
Future SequentialPerformance
Log
(Speed
in G
Hz)
2
Future Parallel PerformanceFuture Parallel Performance
Release Date
3
Why is this a Problem?
Sophistication
Para
llelis
m
Nearest Neighbor[Google et al.]
Basic Regression[Cheng et al.]
Graphical Models[Mendiburu et al.]
Support Vector Machines[Graf et al.]
Want to behere
Why is it hard?
4
Algorithmic Efficiency
Parallel Efficiency
ImplementationEfficiency
Eliminate wastedcomputation
Expose independentcomputation
Map computation toreal hardware
The Key Insight
5
6
The Result
Nearest Neighbor[Google et al.]
Basic Regression[Cheng et al.]
Support Vector Machines[Graf et al.]
GoalSplash Belief Propagation
Graphical Models[Gonzalez et al.]
Sophistication
Para
llelis
m
Graphical Models[Mendiburu et al.]
OutlineOverviewGraphical Models: Statistical StructureInference: Computational Structureτε - Approximate Messages: Statistical StructureParallel Splash
Dynamic SchedulingPartitioning
Experimental ResultsConclusions
7
Graphical Models and Parallelism
Graphical models provide a common language for general purpose parallel algorithms in machine learning
A parallel inference algorithm would improve:
8
Protein Structure Prediction
Inference is a key step in Learning Graphical Models
Computer VisionMovie Recommendation
Overview of Graphical Models
Graphical represent of local statistical dependencies
9
Observed Random Variables
Late
nt
Pix
el V
ari
able
sLa
tent
Pix
el V
ari
able
s Loca
l Dependencie
sLo
cal D
ependencie
s
Noisy Picture
InferenceInferenceWhat is the probability that this pixel is black?
What is the probability that this pixel is black?
“True” Pixel Values
Continuity Assumptions
Synthetic Noisy Image Problem
Overlapping Gaussian noiseAssess convergence and accuracy
Noisy Image Predicted Image
Protein Side-Chain Prediction
Model side-chain interactions as a graphical model
11
Protein Backbone
What is the most likely orientation?What is the most likely orientation?
InferenceInference
Protein Side-Chain Prediction
276 Protein Networks:Approximately:
700 Variables1600 Factors70 Discrete orientations
Strong Factors
12
Protein B
ackbon
e
Smokes(A) Cancer(A)Smokes(A) Cancer(A) Smokes(B) Cancer(B)Smokes(B) Cancer(B)
Friends(A,B) And Smokes(A) Smokes(B)
Friends(A,B) And Smokes(A) Smokes(B)
Markov Logic NetworksRepresent Logic as a graphical model
13
Cancer(A)Cancer(A) Cancer(B)Cancer(B)
Smokes(A)Smokes(A) Smokes(B)Smokes(B)
Friends(A,B)Friends(A,B)A: AliceB: Bob
True/False?
Pr(Cancer(B) = True | Smokes(A) = True & Friends(A,B) = True) = ?
Pr(Cancer(B) = True | Smokes(A) = True & Friends(A,B) = True) = ?
InferenceInference
Markov Logic Networks
14
Smokes(A) Cancer(A)Smokes(A) Cancer(A) Smokes(B) Cancer(B)Smokes(B) Cancer(B)
Friends(A,B) And Smokes(A) Smokes(B)
Friends(A,B) And Smokes(A) Smokes(B)
Cancer(A)Cancer(A)Cancer(B)Cancer(B)
Smokes(A)Smokes(A) Smokes(B)Smokes(B)
Friends(A,B)Friends(A,B)A: AliceB: Bob
True/False?UW-Systems Model8K Binary Variables406K Factors
Irregular degree distribution:
Some vertices with high degree
OutlineOverviewGraphical Models: Statistical StructureInference: Computational Structureτε - Approximate Messages: Statistical StructureParallel Splash
Dynamic SchedulingPartitioning
Experimental ResultsConclusions
15
The Inference Problem
NP-Hard in General Approximate Inference:
Belief Propagation
16
Smokes(A) Cancer(A)Smokes(A) Cancer(A) Smokes(B) Cancer(B)Smokes(B) Cancer(B)
Friends(A,B) And Smokes(A) Smokes(B)
Friends(A,B) And Smokes(A) Smokes(B)
Cancer(A)Cancer(A) Cancer(B)Cancer(B)
Smokes(A)Smokes(A) Smokes(B)Smokes(B)
Friends(A,B)Friends(A,B)A: AliceB: Bob
True/False?
Protein BackboneWhat is the probability that Bob Smokes given
Alice Smokes?
What is the best configuration of the protein side-chains?
What is the probabilitythat each pixel is black?
Belief Propagation (BP)Iterative message passing algorithm
Naturally Parallel Algorithm
17
Parallel Synchronous BPGiven the old messages all new messages can be computed in parallel:
18
NewMessages
NewMessages
OldMessages
OldMessages
CPU 2
CPU 1
CPU 3
CPU n
Map-Reduce Ready!
Sequential Computational Structure
19
Hidden Sequential Structure
20
Hidden Sequential Structure
Running Time:
21
EvidenceEvidence
Time for a singleparallel iterationTime for a singleparallel iteration Number of IterationsNumber of Iterations
Optimal Sequential Algorithm
Forward-Backward
Naturally Parallel
2n2/p
p ≤ 2n
22
RunningTime
2n
Gap
p = 1
Optimal Parallel
n
p = 2
Key Computational Structure
Naturally Parallel
2n2/p
p ≤ 2n
23
RunningTime
Optimal Parallel
n
p = 2
Gap Inherent Sequential Structure
Requires Efficient Scheduling
OutlineOverviewGraphical Models: Statistical StructureInference: Computational Structureτε - Approximate Messages: Statistical StructureParallel Splash
Dynamic SchedulingPartitioning
Experimental ResultsConclusions
24
Parallelism by Approximation
τε represents the minimal sequential structure
25
True Messages
τε -Approximation
11 22 33 44 55 66 77 88 99 1010
1
Tau-Epsilon StructureOften τε decreases quickly:
26
Markov LogicNetworks
Protein Networks
Mess
ag
e A
pp
roxim
ati
on E
rror
in L
og S
cale
Running Time Lower Bound
27
Theorem: Using p processors it is not possible to obtain a τε approximation in time less than:
Theorem: Using p processors it is not possible to obtain a τε approximation in time less than:
ParallelComponent
SequentialComponent
A single processor can only make k-τε +1 vertices left aware in k-iterationsA single processor can only make k-τε +1 vertices left aware in k-iterations
Consider one direction using p/2 processors (p≥2):
28
11 nn
τε τε τε τε τε τε τε
τε
…
n - τε
We must make n - τε vertices τε left-awareWe must make n - τε vertices τε left-aware
Proof: Running Time Lower Bound
Optimal Parallel SchedulingProcessor 1 Processor 2 Processor 3
29
Theorem: Using p processors this algorithm achieves a τε approximation in time:
Theorem: Using p processors this algorithm achieves a τε approximation in time:
Proof: Optimal Parallel Scheduling
All vertices are left-aware of the left most vertex on their processor
After exchanging messages
After next iteration:
After k parallel iterations each vertex is (k-1)(n/p) left-aware
30
Proof: Optimal Parallel Scheduling
After k parallel iterations each vertex is (k-1)(n/p) left-awareSince all vertices must be made τε left aware:
Each iteration takes O(n/p) time:
31
Comparing with SynchronousBP
Processor 1 Processor 2 Processor 3
32
Synchronous Schedule Optimal Schedule
Gap
OutlineOverviewGraphical Models: Statistical StructureInference: Computational Structureτε - Approximate Messages: Statistical StructureParallel Splash
Dynamic SchedulingPartitioning
Experimental ResultsConclusions
33
The Splash OperationGeneralize the optimal chain algorithm:
to arbitrary cyclic graphs:
~34
1) Grow a BFS Spanning tree with fixed size
2) Forward Pass computing all messages at each vertex
3) Backward Pass computing all messages at each vertex
Local StateLocal State
CPU 2
Local StateLocal State
CPU 3
Local StateLocal State
CPU 1
Running Parallel Splashes
Partition the graphSchedule Splashes locallyTransmit the messages along the boundary of the partition
35
Splash SplashSplash
Key Challenges:1) How do we schedules Splashes?2) How do we partition the Graph?
Local StateLocal State
Sch
ed
ulin
g
Queu
e
Where do we Splash?Assign priorities and use a scheduling queue to select roots:
SplashSplash
SplashSplash
??
?
CPU 1
How do we assign priorities?
Message SchedulingResidual Belief Propagation [Elidan et al., UAI 06]:
Assign priorities based on change in inbound messages
11
37
Message
Message
Message
22
Message
Message
Message
Large ChangeLarge ChangeSmall ChangeSmall Change
Small ChangeSmall Change
Large ChangeLarge Change
Small Change:Expensive No-Op
Large Change:Informative Update
Problem with Message Scheduling
Small changes in messages do not imply small changes in belief:
38
Small change inall message
Small change inall message
Large change inbelief
Large change inbelief
Message
Message
BeliefMessage
Message
Problem with Message Scheduling
Large changes in a single message do not imply large changes in belief:
39
Large change ina single messageLarge change in
a single messageSmall change
in beliefSmall change
in belief
Message
BeliefMessageMessage
Message
Belief Residual Scheduling
Assign priorities based on the cumulative change in belief:
1 1
+1
+rv =
MessageChange
MessageChange
40
A vertex whose belief has changed substantially
since last being updatedwill likely produce
informative new messages.
A vertex whose belief has changed substantially
since last being updatedwill likely produce
informative new messages.
Message vs. Belief Scheduling
Belief Scheduling improves accuracy and convergence
41
Bett
er
Splash PruningBelief residuals can be used to dynamically reshape and resize Splashes:
LowBeliefs
Residual
Splash SizeUsing Splash Pruning our algorithm is able to dynamically select the optimal splash size
43
Bett
er
Example
Synthetic Noisy Image
Factor Graph
Vertex Updates
ManyUpdate
s
FewUpdate
s
Algorithm identifies and focuses on hidden sequential structure
44
Parallel Splash Algorithm
Partition factor graph over processorsSchedule Splashes locally using belief residualsTransmit messages on boundary
Local StateLocal State
CPU 1
SplashSplash
Local StateLocal State
CPU 2
Local StateLocal State
CPU 3
SplashSplash
Fast Reliable Network
SplashSplash
45
SchedulingQueue
SchedulingQueue
SchedulingQueue
Given a uniform partitioning of the chain graphical model, Parallel Splash will run in time:
retaining optimality.
Theorem:
CPU 1 CPU 2
Partitioning ObjectiveThe partitioning of the factor graph determines:
Storage, Computation, and Communication
Goal: Balance Computation and Minimize Communication
46
EnsureBalanceEnsureBalanceComm.
costComm.
cost
The Partitioning ProblemObjective:
Depends on:
NP-Hard METIS fast partitioning heuristic
Work:
Comm:
47
Minimize Communication
Minimize Communication
Ensure BalanceEnsure Balance
Update counts are not known!
Unknown Update Counts Determined by belief schedulingDepends on: graph structure, factors, …Little correlation between past & future update counts
48
Noisy Image Update CountsSimple Solution:
Uninformed Cut
Uniformed Cuts
Greater imbalance & lower communication cost
Update Counts Uninformed Cut Optimal Cut
49
Bett
er
Bett
er
Too Much Work
Too Little Work
Over-PartitioningOver-cut graph into k*p partitions and randomly assign CPUs
Increase balanceIncrease communication cost (More Boundary)
CPU 1CPU 1 CPU 2CPU 2 CPU 2CPU 2
CPU 1CPU 1 CPU 1CPU 1 CPU 2CPU 2
CPU 1CPU 1 CPU 2CPU 2 CPU 1CPU 1
CPU 2CPU 2 CPU 1CPU 1 CPU 2CPU 2
Without Over-Partitioning
k=650
Over-Partitioning ResultsProvides a simple method to trade between work balance and communication cost
51
Bett
er
Bett
er
CPU UtilizationOver-partitioning improves CPU utilization:
52
Parallel Splash Algorithm
Over-Partition factor graph Randomly assign pieces to processors
Schedule Splashes locally using belief residualsTransmit messages on boundary
Local StateLocal State
CPU 1
SplashSplash
Local StateLocal State
CPU 2
Local StateLocal State
CPU 3
SplashSplash
Fast Reliable Network
SplashSplash
53
SchedulingQueue
SchedulingQueue
SchedulingQueue
OutlineOverviewGraphical Models: Statistical StructureInference: Computational Structureτε - Approximate Messages: Statistical StructureParallel Splash
Dynamic SchedulingPartitioning
Experimental ResultsConclusions
54
ExperimentsImplemented in C++ using MPICH2 as a message passing API
Ran on Intel OpenCirrus cluster: 120 processors
15 Nodes with 2 x Quad Core Intel Xeon ProcessorsGigabit Ethernet Switch
Tested on Markov Logic Networks obtained from Alchemy [Domingos et al. SSPR 08]
Present results on largest UW-Systems and smallest UW-Languages MLNs
55
Parallel Performance (Large Graph)
56
UW-Systems8K Variables406K Factors
Single Processor Running Time:
1 Hour
Linear to Super-Linear up to 120 CPUs
Cache efficiency
Linear
Bett
er
Parallel Performance (Small Graph)
UW-Languages1K Variables27K Factors
Single Processor Running Time:
1.5 Minutes
Linear to Super-Linear up to 30 CPUs
Network costs quickly dominate short running-time
57
Linear
Bett
er
OutlineOverviewGraphical Models: Statistical StructureInference: Computational Structureτε - Approximate Messages: Statistical StructureParallel Splash
Dynamic SchedulingPartitioning
Experimental ResultsConclusions
58
Summary
59
Algorithmic Efficiency
Parallel Efficiency
ImplementationEfficiency
Independent Parallel Splashes
Splash Structure +
Belief Residual Scheduling
Distributed QueuesAsynchronous Communication
Over-Partitioning
Experimental results on large factor graphs:
Linear to super-linear speed-up using up to 120 processors
Sophistication
Para
llelis
m
Parallel Splash Belief Propagation
We are here
We were here
Conclusion
Questions
61
Protein Results
62
1 2 3 4 5 6 7 80
1
2
3
4
5
6
7
8
9
10
Number of Cores
Rel
ativ
e S
peed
up
Residual BP
Linear Speedup
ResidualSplash BP
MapReduce BP
3D Video Task
63
1 2 3 4 5 6 7 80
1
2
3
4
5
6
7
8
Number of Cores
Rel
ativ
e S
peed
up
Residual BP
ResidualSplash BP
Linear Speedup
MapReduce BP
Distributed Parallel Setting
Opportunities:Access to larger systems: 8 CPUs 1000 CPUsLinear Increase:
RAM, Cache Capacity, and Memory Bandwidth
Challenges:Distributed state, Communication and Load Balancing
64
Fast Reliable Network
NodeNode
CPU BusMemoryCache
NodeNode
CPU BusMemoryCache