Upload
dexter-bennett
View
65
Download
0
Embed Size (px)
DESCRIPTION
2. GraphLab Tutorial. Yucheng Low. GraphLab Team. Yucheng Low. Joseph Gonzalez. Aapo Kyrola. Danny Bickson. Carlos Guestrin. Jay Gu. Development History. GraphLab 0.5 (2010). Internal Experimental Code. Insanely Templatized. First Open Source Release (< June 2011 LGPL - PowerPoint PPT Presentation
Citation preview
Carnegie Mellon University
GraphLab TutorialYucheng Low
2
GraphLab Team
YuchengLow
AapoKyrola
JayGu
JosephGonzalez
DannyBickson
Carlos Guestrin
GraphLab 0.5 (2010) Internal Experimental Code
Insanely Templatized
Development History
GraphLab 1 (2011)
Nearly Everything is Templatized
First Open Source Release (< June 2011 LGPL >= June 2011 APL)
GraphLab 2 (2012)
Many Things are Templatized
Shared Memory : Jan 2012Distributed : May 2012
Graphlab 2 Technical Design Goals
Improved useabilityDecreased compile timeAs good or better performance than GraphLab 1Improved distributed scalability
… other abstraction changes … (come to the talk!)
Development HistoryEver since GraphLab 1.0, all active development are open source (APL):
code.google.com/p/graphlabapi/
(Even current experimental code. Activated with a --experimental flag on ./configure )
Guaranteed Target Platforms• Any x86 Linux system with gcc >= 4.2• Any x86 Mac system with gcc 4.2.1 ( OS X 10.5 ?? )
• Other platforms?
… We welcome contributors.
Tutorial OutlineGraphLab in a few slides + PageRankChecking out GraphLab v2Implementing PageRank in GraphLab v2Overview of different GraphLab schedulersPreview of Distributed GraphLab v2
(may not work in your checkout!)Ongoing work… (however much as time allows)
WarningA preview of code still in intensive development!
Things may or may not work for you!
Interface may still change!
GraphLab 1 GraphLab 2 still has a number of performance regressions we are ironing out.
PageRank ExampleIterate:
Where:α is the random reset probabilityL[j] is the number of links on page j
1 32
4 65
10
The GraphLab Framework
Scheduler Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
11
Data GraphA graph with arbitrary data (C++ Objects) associated with each vertex and edge
Vertex Data:• Webpage• Webpage Features
Edge Data:• Link weight
Graph:• Link graph
12
The GraphLab Framework
Scheduler Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
pagerank(i, scope){ // Get Neighborhood data (R[i], Wij, R[j]) scope;
// Update the vertex data
// Reschedule Neighbors if needed if R[i] changes then reschedule_neighbors_of(i); }
;][)1(][][
iNj
ji jRWiR
Update Functions
13
An update function is a user defined program which when applied to a vertex transforms the data in the scope of the vertex
14
Dynamic Schedule
e f g
kjih
dcbaCPU 1
CPU 2
a
h
a
b
b
i
Process repeats until scheduler is empty
Source Code Interjection 1
Graph, update functions, and schedulers
--scope=vertex--scope=edge
Consistency
Trade-offConsistency “Throughput”
# “iterations” per second
Goal of ML algorithm: Converge
False Trade-off
18
Ensuring Race-Free CodeHow much can computation overlap?
19
The GraphLab Framework
Scheduler Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
Importance of ConsistencyFast ML Algorithm development cycle:
Build
Test
Debug
Tweak Model
Necessary for framework to behave predictably and consistently and avoid problems caused by non-determinism.Is the execution wrong? Or is the model wrong?
20
Full Consistency
Guaranteed safety for all update functions
Full Consistency
Parallel update only allowed two vertices apart Reduced opportunities for parallelism
Obtaining More Parallelism
Not all update functions will modify the entire scope!
Belief Propagation: Only uses edge dataGibbs Sampling: Only needs to read adjacent vertices
Edge Consistency
Obtaining More Parallelism
“Map” operations. Feature extraction on vertex data
Vertex Consistency
The GraphLab Framework
Scheduler Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
27
Shared VariablesGlobal aggregation through Sync OperationA global parallel reduction over the graph dataSynced variables recomputed at defined intervals while update functions are running
Sync: HighestPageRank
Sync: Loglikelihood
28
Source Code Interjection 2
Shared variables
What can we do with these primitives?
…many many things…
Matrix FactorizationNetflix Collaborative Filtering
Alternating Least Squares Matrix Factorization
Model: 0.5 million nodes, 99 million edges
Netflix
Users
Movies
d
NetflixSpeedup Increasing size of the matrix factorization
Video Co-SegmentationDiscover “coherent”segment types acrossa video (extends Batra et al. ‘10)
1. Form super-voxels video2. EM & inference in Markov random field
Large model: 23 million nodes, 390 million edges
GraphLab
Ideal
Many MoreTensor FactorizationBayesian Matrix FactorizationGraphical Model Inference/LearningLinear SVMEM clusteringLinear Solvers using GaBPSVDEtc.
Distributed Preview
GraphLab 2 Abstraction
Changes(an overview couple of them)
(Come to the talk for the rest!)
Exploiting Update Functors
(for the greater good)
Exploiting Update Functors (for the greater good)
1. Update Functors store state2. Scheduler schedules update functor instances.
3. We can use update functors as a controlled asynchronous message passing to communicate between vertices!
Delta Based Update Functorsstruct pagerank : public iupdate_functor<graph, pagerank> {
double delta;pagerank(double d) : delta(d) { }void operator+=(pagerank& other) { delta +=
other.delta; }void operator()(icontext_type& context) {
vertex_data& vdata = context.vertex_data();
vdata.rank += delta;if(abs(delta) > EPSILON) {
double out_delta = delta * (1 – RESET_PROB) *
1/context.num_out_edges(edge.source());
context.schedule_out_neighbors(pagerank(out_delta));}
}};// Initial Rank: R[i] = 0;// Initial Schedule: pagerank(RESET_PROB);
Asynchronous Message PassingObviously not all computation can be written this way. But when it can; it can be extremely fast.
Factorized Updates
PageRank in GraphLab
struct pagerank : public iupdate_functor<graph, pagerank> {
void operator()(icontext_type& context) {vertex_data& vdata =
context.vertex_data(); double sum = 0;foreach ( edge_type edge,
context.in_edges() )sum +=
context.const_edge_data(edge).weight *
context.const_vertex_data(edge.source()).rank;double old_rank = vdata.rank;vdata.rank = RESET_PROB + (1-RESET_PROB) *
sum;double residual = abs(vdata.rank –
old_rank) /
context.num_out_edges();if (residual > EPSILON)
context.reschedule_out_neighbors(pagerank());}
};
PageRank in GraphLab
struct pagerank : public iupdate_functor<graph, pagerank> {
void operator()(icontext_type& context) {vertex_data& vdata =
context.vertex_data(); double sum = 0;foreach ( edge_type edge,
context.in_edges() )sum +=
context.const_edge_data(edge).weight *
context.const_vertex_data(edge.source()).rank;double old_rank = vdata.rank;vdata.rank = RESET_PROB + (1-RESET_PROB) *
sum;double residual = abs(vdata.rank –
old_rank) /
context.num_out_edges();if (residual > EPSILON)
context.reschedule_out_neighbors(pagerank());}
};
Atomic Single Vertex Apply
Parallel Scatter [Reschedule]
Parallel “Sum” Gather
Decomposable Update Functors
Decompose update functions into 3 phases:
+ + … + Δ
Y YY
ParallelSum
User Defined:
Gather( ) ΔY
Δ1 + Δ2 Δ3
Y Scope
Gather
Y
YApply( , Δ) Y
Apply the accumulated value to center vertex
User Defined:
Apply
Y
Scatter( )
Update adjacent edgesand vertices.
User Defined:Y
Scatter
Factorized PageRankstruct pagerank : public iupdate_functor<graph, pagerank> { double accum = 0, residual = 0;
void gather(icontext_type& context, const edge_type& edge) {
accum += context.const_edge_data(edge).weight *
context.const_vertex_data(edge.source()).rank;}void merge(const pagerank& other) { accum +=
other.accum; }void apply(icontext_type& context) {
vertex_data& vdata = context.vertex_data();double old_value = vdata.rank;vdata.rank = RESET_PROB + (1 - RESET_PROB)
* accum; residual = fabs(vdata.rank – old_value) /
context.num_out_edges();}void scatter(icontext_type& context, const
edge_type& edge) {if (residual > EPSILON)
context.schedule(edge.target(), pagerank());
}};
Demo of *everything*
PageRank
Ongoing WorkExtensions to improve performance on large graphs.
(See the GraphLab talk later!!)Better distributed Graph representation methodsPossibly better Graph PartitioningOff-core Graph storageContinually changing graphs
All New rewrite of distributed GraphLab (come back in May!)
Ongoing WorkExtensions to improve performance on large graphs.
(See the GraphLab talk later!!)Better distributed Graph representation methodsPossibly better Graph PartitioningOff-core Graph storageContinually changing graphs
All New rewrite of distributed GraphLab (come back in May!)