Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Grappa: A latency tolerant runtime for large-scale irregular applications
Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, Mark OskinUniversity of WashingtonJanuary 21, 2014
�1
A tale of two programmers
�2blupics@flickr
Pat’s problem: traverse an unbalanced tree
• Tree is embedded in a graph
• ~1T edges in graph~1B edges in tree
• Power law, low diameter
�3
How about a big shared-memory machine?
�4
How about special purpose hardware?
�5
How about a commodity cluster?
�6
Grappa
• Goal: Provide irregular application programmers what they want
• Global view programming model
• Good small-message performance
• Tasks, threads, latency tolerance
• Fine-grained synchronization
• Load balancing
�7
Where is Grappa in the stack?
�8
Application
Compiler
Library
Runtime
Hardware
Grappa’s system view
�9
Global data
DRAM DRAM DRAM DRAM
Core Core Core Core
Network
Global Tasks
Each word of memory has a designated home core All accesses to that word run on that core
Main idea: tolerate latency with other work
�10
Task 1
read()
DRAM DRAM DRAM DRAM
Main idea: tolerate latency with other work
�11
Task 1
read()
DRAM DRAM DRAM DRAM
Task 2
Task ~1000
Main idea: tolerate latency with other work
�12
Task 1
Task 2
Task ~1000
read()
DRAM DRAM DRAM DRAM
�13
Outline
• Motivation
• Programming Grappa
• Key components
• Performance
• Other projects
Pat’s problem: traverse an unbalanced tree
• Tree is embedded in a graph
• ~1T edges in graph~1B edges in tree
• Power law, low diameter
�14
A single node, serial starting point
�15
struct Vertex { index_t id; Vertex * first_child; size_t num_children;};
!!!!!!!!!!!!!!!
A single node, serial starting point
�16
struct Vertex { index_t id; Vertex * first_child; size_t num_children;}; void search(Vertex * vertex_addr) { Vertex v = *vertex_addr;
Vertex * children = v.first_child; for( int i = 0; i < v.num_children; ++i ) { search(children+i); }}
!!!!!!
A single node, serial starting point
�17
struct Vertex { index_t id; Vertex * first_child; size_t num_children;}; void search(Vertex * vertex_addr) { Vertex v = *vertex_addr;
Vertex * children = v.first_child; for( int i = 0; i < v.num_children; ++i ) { search(children+i); }} int main( int argc, char * argv[] ) { Vertex * root = create_tree(); search(root); return 0;} !!!!
The standard boilerplate (not quite right)
�18
struct Vertex { index_t id; Vertex * first_child; size_t num_children;}; void search(Vertex * vertex_addr) { Vertex v = *vertex_addr;
Vertex * children = v.first_child; for( int i = 0; i < v.num_children; ++i ) { search(children+i); }} int main( int argc, char * argv[] ) { Grappa::init( &argc, &argv );
Vertex * root = create_tree(); search(root); ! Grappa::finalize(); return 0;}
Back to serial in Grappa’s global view
�19
struct Vertex { index_t id; Vertex * first_child; size_t num_children;}; void search(Vertex * vertex_addr) { Vertex v = *vertex_addr;
Vertex * children = v.first_child; for( int i = 0; i < v.num_children; ++i ) { search(children+i); }} int main( int argc, char * argv[] ) { Grappa::init( &argc, &argv ); Grappa::run( []{ Vertex * root = create_tree(); search(root); }); Grappa::finalize(); return 0;}
Back to serial in Grappa’s global view
�20
struct Vertex { index_t id; Vertex * first_child; size_t num_children;}; void search(Vertex * vertex_addr) { Vertex v = *vertex_addr;
Vertex * children = v.first_child; for( int i = 0; i < v.num_children; ++i ) { search(children+i); }} int main( int argc, char * argv[] ) { init( &argc, &argv ); run( []{ Vertex * root = create_tree(); search(root); }); finalize(); return 0;}
Addressing global memory
�21
struct Vertex { index_t id; GlobalAddress<Vertex> first_child; size_t num_children;}; void search(GlobalAddress<Vertex> vertex_addr) { Vertex v = *vertex_addr;
GlobalAddress<Vertex> children = v.first_child; for( int i = 0; i < v.num_children; ++i ) { search(children+i); }} int main( int argc, char * argv[] ) { init( &argc, &argv ); run( []{ GlobalAddress<Vertex> root = create_tree(); search(root); }); finalize(); return 0;}
Accessing global memory
�22
struct Vertex { index_t id; GlobalAddress<Vertex> first_child; size_t num_children;}; void search(GlobalAddress<Vertex> vertex_addr) { Vertex v = delegate::read(vertex_addr);
GlobalAddress<Vertex> children = v.first_child; for( int i = 0; i < v.num_children; ++i ) { search(children+i); }} int main( int argc, char * argv[] ) { init( &argc, &argv ); run( []{ GlobalAddress<Vertex> root = create_tree(); search(root); }); finalize(); return 0;}
Global memory with delegates
�23
struct Vertex { index_t id; GlobalAddress<Vertex> first_child; size_t num_children;}; void search(GlobalAddress<Vertex> vertex_addr) { Vertex v = delegate::call(vertex_addr, [=]{return *vertex_addr});
GlobalAddress<Vertex> children = v.first_child; for( int i = 0; i < v.num_children; ++i ) { search(children+i); }} int main( int argc, char * argv[] ) { init( &argc, &argv ); run( []{ GlobalAddress<Vertex> root = create_tree(); search(root); }); finalize(); return 0;}
Global memory with delegates
�24
void search(GlobalAddress<Vertex> vertex_addr) { Vertex v = delegate::read(vertex_addr);
GlobalAddress<Vertex> children = v.first_child; for( int i = 0; i < v.num_children; ++i ) { search(children+i); }} int main( int argc, char * argv[] ) { init( &argc, &argv ); run( []{ GlobalAddress<Vertex> root = create_tree(); search(root); }); finalize(); return 0;} !!!!
Exposing some parallelism
�25
void search(GlobalAddress<Vertex> vertex_addr) { Vertex v = delegate::read(vertex_addr);
GlobalAddress<Vertex> children = v.first_child; for( int i = 0; i < v.num_children; ++i ) { spawn( [=]{ search(children+i) }); }} int main( int argc, char * argv[] ) { init( &argc, &argv ); run( []{ finish( []{ GlobalAddress<Vertex> root = create_tree(); search(root); }); }); finalize(); return 0;} !!
Exposing more parallelism
�26
void search(GlobalAddress<Vertex> vertex_addr) { Vertex v = delegate::read(vertex_addr);
GlobalAddress<Vertex> children = v.first_child; forall<unbound,async>( 0, v.num_children, [children](int64_t i) { search(children+i); }} int main( int argc, char * argv[] ) { init( &argc, &argv ); run( []{ finish( []{ GlobalAddress<Vertex> root = create_tree(); search(root); }); }); finalize(); return 0;} !!
Delegation is more than just RDMA
�27
struct Vertex { index_t id; GlobalAddress<Vertex> first_child; size_t num_children; color_t color; }; !GlobalAddress<int> color_counts = global_alloc<int>(NUM_COLORS); !void search(GlobalAddress<Vertex> vertex_addr) { Vertex v = delegate::read(vertex_addr); color_t c = v.color; bool done = delegate::call( color_counts + c, [c](int & count) { if( count == MAX ) { return true; } else { count++; return false; } }); ! if( done ) return; ! GlobalAddress<Vertex> children = v.first_child; forall<unbound,async>( 0, v.num_children, [children](int64_t i) { search(children+i); }}
�28
Outline
• Motivation
• Programming Grappa
• Key components
• Performance
• Other projects
Grappa design
�29
Cluster node 0
Application
Grappa
Infiniband Network Interface
GASNet
Cluster node n
...TaskingDSM
Active Messages/Aggregator
Core Core CoreNetwork
Application
Grappa
Infiniband Network Interface
GASNet
Core
TaskingDSM
Active Messages/Aggregator
Core Core Core Core
GlobalAddress SpaceMemory Memory
Task n+1
User level context switching
�30Core
L1 Cache
Task n
Task queue Ready queue
Stack
Worker 1 status
Suspended workers
Task 1
Stack
Worker 2 status
Task 2
1 cacheline of status,3 cachelines of stack
Main innovation:We keep state smalland prefetch to cover DRAM latency
�31
var
DRAM DRAM DRAM DRAM
var+1var+1var+1
Accessing memory through delegates
Each word of memory has a designated home core All accesses to that word run on that core
Requestor blocks until complete
�32
var
DRAM DRAM DRAM DRAM
var+1
var+1
var+1
Accessing memory through delegates
Since var is private to home core,updates can be applied
var+1var+1var+1
Mitigating low injection rate with aggregation
�33
Wire Msg
Stack
Worker 3
Msg 3
Stack
Worker 2
Msg 2
Stack
Worker 1
Msg 1
Node 0 Node n
2. Serialize Messages using
Prefetching.
Msg 3
Msg 2
Msg 1
Wire Msg
Msg 3
Msg 2
Msg 1 4. Deserialize and Execute Messages.3. Send over
network.
1. Queue Messages in Linked List.
1 16 256 4096 65536 1048576Message size (bytes)
0B
1B
2B
3B
Maximum
ban
dwidth (bytes/second
) Raw MPI
Grappa, raw GASNet
Grappa, aggregated
Sheet 3
�34
Outline
• Motivation
• Programming Grappa
• Key components
• Performance
• Other projects
A snapshot of current performance
�35
• Current implementation: 15K lines Usable, but many optimizations still outstanding
• Three questions: Do Grappa’s components work individually? Do Grappa’s components work together? How do we compare with other systems?
• Ran on AMD Interlagos cluster; 32 2.1GHz cores, 64GB, 40Gb Infiniband Compared with 128-processor Cray XMT1 (500MHz, 128 streams each)
Measuring context switch performance
• Simple: N tasks yield in a loop on a single core We vary N to see how context switch time changes
• Context switching is isolated here: the system is doing nothing else
�36
of system design on data-intensive workloads, particularlylarge-scale graph analysis problems, that are important amongcybersecurity, informatics, and network-understanding work-loads. The BFS benchmark builds a search tree containingparent nodes for each traversed vertex during the search. Whilethis is a relatively simple problem to solve, it exercises therandom-access and fine-grained synchronization capabilitiesof a system as well as being a primitive in many other graphalgorithms. Performance is measured in traversed edges persecond (TEPS), where the number of edges is the edges mak-ing up the generated BFS tree. With some modifications to theXMT reference version of Graph500 BFS, the XMT compilercan be made to recognize and apply a Manhattan loop collapse,exposing enough parallelism to allow it to scale out to 64 nodesfor the problem scales we show. In order to make comparisoneasier, we do not employ algorithmic improvements for any ofthese versions, though there are many [11, 57]; this makes ourresults difficult to compare with published Graph500 results.Grappa can be expected to benefit the same as MPI due todecreased communication.
IntSort This sorting benchmark is taken from the NAS Par-allel Benchmark Suite [9, 44] and is one on which the CrayXMT’s early predecessor once held the world speed record [2].The largest problem size, class D, ranks two billion uniformlydistributed random integers using either a bucket or a count-ing sort algorithm, depending on the strengths of the system.Bucket sort executes a greater number of loops, but is ableto leverage locality and avoid communication completely inthe final phase, ranking within buckets. For these reasons, theMPI reference version and our Grappa implementation usebucket sort. On the other hand, the Cray XMT cannot takeadvantage of locality, but has an efficient compiler-supportedparallel prefix sum, so it performs best using the countingsort algorithm. The performance metrics for NAS ParallelBenchmarks, including IntSort, are “millions of operations persecond” (MOPS). For IntSort, this “operation” is ranking asingle key, so it is roughly comparable to “GUPS” or “TEPS.”
PageRank This is a common centrality metric for graphs.PageRank is an iterative algorithm with a common patternof gather, apply, and scatter on the rank of vertex. The algo-rithm is often implemented by sparse linear algebra libraries,with the main kernel being the sparse matrix dense vectormultiply. For the multiply step, Grappa parallelizes over therows and parallelizes each dot product. PageRank has thefortunate property that the accumulation function over thein-edges is associative and commutative, so they can be pro-cessed in any order or in parallel. Rather than the programmerwriting the parallel dot product as local accumulations witha final all-reduce step, we simply send streaming incrementsto each element of the final vector. We compare PageRankto published results for the Trilinos linear algebra library im-plemented in MPI [48], and multithreaded PageRank for theXMT [10]. For Grappa, we run on a scale 29 graph using the
Graph500 generator.The metric we use is algorithmic time, which means startup
and loading of the data structure (from disk) is not included inthe measurement. Grappa collects statistics about applicationbehavior (packets sent, context switches, etc) and these arediscussed where appropriate.
7. EvaluationThe goal of our evaluation is to understand whether the corepieces of the Grappa runtime system, namely our taskingsystem and the global memory/communication layer, workas expected and whether together they are able to efficientlyrun irregular applications. We evaluate Grappa in three basicsteps:• We present results that show that Grappa can support large
amounts of concurrency, sufficient for remote memory ac-cess and aggregation. The communication layer is able tosustain a very high rate of global memory operations. Wealso show the performance of a graph kernel that stressescommunication and concurrency together.
• We characterize system behavior, including profiling whereexecution time goes, and how aggregation affects messagesize and rates.
• Finally, we show how some more realistic irregular work-loads on Grappa compare to the Cray XMT and hand-tunedMPI code.
7.1. Basic Grappa Performance
User-level context switching Fast context switching is atthe heart of Grappa’s latency tolerance abilities. We assesscontext switch overheads using a simple microbenchmark thatruns a configurable number of workers on a single core, whereeach worker increments values in a large array.
40
80
120
160
0e+00 1e+05 2e+05 3e+05 4e+05Number of workers
Avg
cont
ext s
witc
h la
tenc
y (n
s)
No prefetchingPrefetching
Figure 5: Average context switch time with and without
prefetching.
Figure 5 shows the average context switch time as the num-ber of workers grow. At our standard operating point (⇡1K
7
Context switching is fast
�37
At 1K thread operating point,
~50 ns
500K threads: 75 ns!
This is switching at the bandwidth limit
to DRAM!
Pthreads: 450-800 ns
Measuring random access bandwidth
• Giga updates per second (GUPs) benchmark measures cluster-wide random access bandwidth
• Only one task per core sends messages, so aggregator is essentially isolated
�38
0.00
0.25
0.50
0.75
1.00
8 16 32 48 64Nodes
GUPS
Update type:BlockingDelegated
Random access BW is good
�39
Theoretical peak at 64 nodes is 6.4 GUPs
Minimal context switching
Unbalanced tree search in memory
�40
• Original UTS benchmark designed to exercise work stealing We modified it to include memory access as well
• Creates unbalanced tree in memory and times traversal
• Visiting a vertex requires remote access, so we are context switching, aggregating delegate messages, and work stealing all at the same time
• Metric is vertex throughput
Grappa is able to exploit parallelism in UTS
�41
T1XL tree
0
50
100
150
8 16 32 48 64Nodes
MVe
rts/s
System:GrappaCray XMT1
Grappa tolerates latency in UTS
�42
Summary Overall, Grappa provides a general programmingmodel at a moderate performance cost. While Grappa’s corefunctionality performs well, applications can specialize similarfunctionality for their problems and obtain better performancethan Grappa on the same hardware. This is, however, notthe end of the story for Grappa’s performance; Grappa is ayoung library and, as discussed in the next section, is limitedby its implementation rather than the hardware on which itruns. We believe there are opportunities for optimization thatwill improve its performance.
7.3. Characterization
GUPS BFS IntSort UTS PagerankApp. message rate/core (K/s) 984 983 732 411 474Avg. App. message bytes 31.8 33.9 30.6 47.3 45.5Network BW per node (MB/s) 478 511 343 298 332Avg. Network message bytes 23.2K 4.3K 12.8K 4.2K 3.0KAvg. active tasks/core 0.9 58.2 0.9 326.2 429.3Max. active tasks/core 1 128 1 507 1024Avg. ready queue length/core 2.3 6.1 1.3 186.1 300.4Avg. Ctx switch rate/core (K/s) 34.2 539 127 543 336Steal attempts/core/s 0 0 0 54.4 0Steal successes/core/s 0 0 0 17.0 0
Table 2: Internal runtime metrics, for 64-node, 16-core-per-
node benchmark runs
Runtime metrics Table 2 show a number of internal runtimemetrics collected while executing the benchmarks from theprevious section. These are per-core averages computed overall 1024 cores of the 64-node, 16-core-per-node jobs.
The first group of four metrics relate to the communicationlayer. Application messages refer to those issued by the usercode; network message refer to the aggregated packets sentover the wire. The data show that most application messagesare only a handful of bytes, but our aggregator is able to turnthem into packets of many kilobytes.
The next group of five metrics relate to the scheduler. Weshow the average and maximum number of concurrently-executing tasks, along with the average length of the readyqueue and the average context switch rate. The last two linesshows the rate of work-stealing from other cores. Only UTSdepends on work-stealing for performance; the other applica-tions exploit locality by binding tasks to specific cores Never-theless, even in UTS, steals are an infrequent occurrence andaccount for a small fraction of the execution time.
How much concurrency does Grappa require? How muchlatency does it add? Grappa depends on concurrency tocover the latency of aggregation and remote communication.How much is required for good performance?
Figure 8 shows a 48-node, 16-core-per-node run of UTS,varying the number of concurrently executing tasks on each
●
● ●●
●
●
●
●
●
●
●
●
●
● ●●
●
●
0
50
100
0.00.51.01.52.0
0
5
10
15
Throughput(M
Verts/s)Avg delegate latency (m
s)%
Idle
256 512 768 1024Number of workers
Figure 8: Throughput, request latency, and idleness as num-
ber of concurrent workers is varied
core. The top pane shows the overall throughput of the treesearch. The middle pane shows average blocking delegateoperation latency in microseconds. The bottom pane showsidle time; that is, the fraction of the time the scheduler couldnot find a ready task to run.
We can observe three things from this figure. First, above512 concurrently executing tasks per core, idle time is practi-cally zero: these tasks generate requests fast enough to coverthe latency of aggregation and communication. This matchesthe results seen in the throughput plot; throughput peaks at 512workers and gradually decreases after that due to the overheadof unnecessary context switches. Finally, we see that with 512workers, the average per-request latency is 1.8ms.
Does Grappa scale? Figures 9, 6, and 7 show scaling out to64 nodes. Grappa scales best on BFS and worst on IntSort, butis in general able to make use of more nodes. Unfortunatelymemory limitations in our current network library keep usfrom exploring scaling beyond 64 nodes; this will be addressedin future work.
In the limit, aggregation does not scale: the time it takes toaggregate enough random requests to build a buffer of reason-able size scales with the size of the cluster. However, we be-lieve our current approach will work for clusters with hundredsof nodes. In the future, we will explore hierarchical, collectivetechniques to aggregate requests from multiple nodes as well;we believe this can apply to clusters with thousands of nodes.
What limits Grappa’s performance? The most commonoperation in Grappa is sending a message. Three key opera-tions occur in a message’s lifetime: creating and enqueuing amessage, serializing a message in the aggregator, and deserial-izing a message at the destination. We benchmarked each ofthese steps individually to shed light on the mechanism.
Minimum-sized messages can be created and enqueuedat a rate of 16M/s, serialized into an aggregation buffer at
10
With 512 active tasks per core,
throughput peaks and idle time is practically zero.
(48 nodes, T1XL tree)
per core
Comparing application performance
�43
• Additional benchmarks: Breadth-first search: Simple version of Graph500 benchmark Integer sort: NAS Parallel Benchmark bucket/counting sort PageRank: Google’s web graph centrality metric
• For these three, MPI and XMT currently beat Grappa, but
• Price/performance is still better than XMT
• Grappa code is shorter and simpler than MPI
• There is a cost to Grappa’s generality(but we’re working on reducing that cost!)
Comparing GUPS performance
�44
Grappa XMT MPI
GUPS 1 2.23 0.11
UTS (T1)
BFS
IntSort
Pagerank
• XMT version: basic GUPS with hardware fetch-and-add
• MPI version: HPCC RandomAccess
• Contains specialized implementation of aggregation, but
• not optimized for out-of-cache
• limited support for concurrent communication
Performance normalized to Grappa, 64 nodes
Comparing UTS-in-memory performance
�45
Grappa XMT MPI
GUPS 1 2.23 0.11
UTS (T1) 1 0.38
BFS
IntSort
Pagerank
• XMT implementation uses fine-grained synchronization at each vertex, which is unnecessary
• Difficult to avoid; baked into compiler, OS, hardware, etc.
• Grappa supports this too, but also allows coarse-grained synchronization
• Grappa also takes advantage of locality in spawns and edge lists
Performance normalized to Grappa, 64 nodes
Comparing BFS performance
�46
Grappa XMT MPI
GUPS 1 2.23 0.11
UTS (T1) 1 0.38
BFS 1 1.63 3.52
IntSort
Pagerank
• BFS_Simple from Graph500(no Beamer-like optimizations)
• MPI version includes specialized implementation of aggregation, moving only two 8-byte vertex IDs
• Grappa runs additional code, requires more space to support general aggregation
• BFS messages include additional 16 bytes of deserialization/synchronization informationPerformance normalized to Grappa, 64 nodes
Comparing IntSort performance
�47
Grappa XMT MPI
GUPS 1 2.23 0.11
UTS (T1) 1 0.38
BFS 1 1.63 3.52
IntSort 1 3.59 5.36
Pagerank
• NAS Parallel Benchmarks, class D
• Grappa writes keys directly to destination with delegates
• MPI version implements specialized aggregation with local sort plus collective communication with Alltoallv
• Can we implement aggregation with collectives in Grappa?
Performance normalized to Grappa, 64 nodes
Comparing Pagerank performance
�48
Grappa XMT MPI
GUPS 1 2.23 0.11
UTS (T1) 1 0.38
BFS 1 1.63 3.52
IntSort 1 3.59 5.36
Pagerank 1 4.35 4.87
• Grappa version is a straightforward nested loop
• MPI version uses optimized Trilinos sparse matrix library
Performance normalized to Grappa, 64 nodes
Where are we?
�49
• Fundamentals are strong: Context switching is fast Aggregation is fast UTS composes them and gets good performance
• There is currently a cost to our generality
• MPI is mostly beating us for now, often replicating what we’re doing in a specialized way, as well as using tricks we may be able to take advantage of
• We are working on our next-generation networking layer now
�50
Outline
• Motivation
• Programming Grappa
• Key components
• Performance
• Other projects
Brandon Holt – Quals – 7 Nov 2013
Compiler support: automatic delegatesFor where your data is, there your code will be also.
23
Best performance comes from executing task where the data is
– delegate operations execute some small region of code atomically on the core that owns the memory
– generic delegate::call() executes the enclosed region of the task (expressed in a lambda) remotely
Delegated regions can be inferred automatically – inspect uses of Grappa global pointers, find regions that
use global pointers on a particular core, and extract into a delegate
– Implementing with a custom LLVM pass
Additional ideas – pass continuation to allow delegates to hop between
multiple cores before returning to the original caller – specialize multiple versions of delegate regions and
select the best one dynamically based on runtime values
int main(int argc, char* argv[]) { init(&argc, &argv); run([]{ long global* A = global_alloc<long>(Asize); long global* B = global_alloc<long>(Bsize); forall(B, Bsize, [=](double& b){ delegate::call((A+b).core(), [A,b]{ long * Ab = (A+b).pointer(); (*Ab) %= b; }); }); forall(B, Bsize, [=](double& b){ A[b] %= b; }); }); finalize();}
Automatic delegate
Manual delegate
Brandon Holt – Quals – 7 Nov 2013
”Schrödinger” Consistencyuntil observed, operations can be committed and not
23
Delay synchronization as long as possible – commit when operation would be able to observe order – example: pushes kept local, pops search for an available push!
Abstract data structure semantics – express how operations affect and observe abstract state
– abstract locks allow commutative ops to proceed in parallel, and block conflicting ops to run late
– inverse operations annihilate locally, needing no synchronization
– Similar in spirit to “Transactional Boosting”
Maurice Herlihy & Eric Koskinen. PPoPP 2008.Transactional Boosting: A Methodology for Highly-Concurrent Transactional Objects.
?
Q1(yr) :-‐ R( journal, “type”, “Journal” ), R( journal, “title”, “Nature” ), R( journal, “issued”, yr )
Grappaparallel shared memory code
Compiling queries for parallel shared memory platforms
Conclusion
• Extreme latency tolerance helps us scale irregular applications
• Grappa’s runtime system provides - a task library - a distributed shared memory system - a network aggregator
• Context switches are fast, even with many threads Random access bandwidth is goodAggregation is effective
�54
�55
Questions?
1