Upload
celina
View
90
Download
0
Embed Size (px)
DESCRIPTION
SALSA: Scalable and Low-synchronization NUMA-aware Algorithm for Producer-Consumer Pools. Elad Gidron , Idit Keidar , Dmitri Perelman, Yonathan Perez. New Architectures – New Software Development Challenges. Increasing number of computing elements Need scalability - PowerPoint PPT Presentation
Citation preview
Elad Gidron, Idit Keidar,Dmitri Perelman, Yonathan Perez
1
SALSA: Scalable and Low-synchronization NUMA-aware Algorithm
for Producer-Consumer Pools
New Architectures – New Software Development Challenges
2
Increasing number of computing elements
Need scalabilityMemory latency more pronounced
Need cache-friendlinessAsymmetric memory access in NUMA
multi-CPUNeed local memory accesses for reduced contention
Large systems less predictableNeed robustness to unexpected thread stalls, load fluctuations
Producer-Consumer Task Pools
3
Ubiquitous programming pattern for parallel programs
Task PoolProduce
rsConsumers
Get()Put(Task)
Typical Implementations I/II
4
Producers
Consumers
Tn .. T
2T1
Inherently not scalable due to contention
FIFO queue
Consumer 1Get (T1) Exec(T1)
Consumer 2Get (T2) Exec(T2)
FIFO is about task retrieval, not execution
Typical Implementations II/II
5
Consumers always pay overhead of synchronization with potential stealers
Load balancing not trivial
Producers
Consumers
Multiple queues with work-stealing
And Now, to Our Approach
6
Single-consumer pools as building block
Framework for multiple pools with stealing
SALSA – novel single-consumer pool
Evaluation
Building Block: Single Consumer Pool
7
Possible implementations:FIFO queuesSALSA (coming soon)
SCPool OwnerConsume()
Steal()Other consumers
Producers
Produce()
System Overview
8
SCPool
Producers
ConsumersSCPool
SCPool
Consume()Produce()
Steal()
Our Management Policy
9
NUMA-awareProducers and consumers are pairedProducer tries to insert to closest SCPoolConsumer tries to steal from closest SCPool
SCPool 1
SCPool 2
Memory 1 CPU1
cons 1
cons 2
prod 1
prod 2
SCPool 3
SCPool 4
Memory 2CPU2
cons 3
cons 4
prod 3
prod 4
Prod 2 access list: cons2, cons1, cons3, cons4
Cons 4 access list : cons3, cons1, cons2
interconnect
SCPool Implementation Goals
10
Goal 1: “Fast path”Use synchronization, atomic operations only when stealing
Goal 2: Minimize stealingGoal 3: Locality
Cache friendliness, low contentionGoal 4: Load balancing
Robustness to stalls, load fluctuations
SALSA – Scalable and Low Synchronization Algorithm
11
SCPool implementation Synchronization-free when no stealing
occurs Low contention
Tasks held in page-size chunks Cache-friendly
Consumers steal entire chunks of tasks Reduces number of steals
Producer-based load-balancing Robust to stalls, load fluctuations
SALSA Overview
12
idx=2 idx=-1
idx=4
idx=0
prod0. .
.
Prodn-1
steal
chun
kList
s
tasktask┴
owner=c10
1234
┴
┴
TAKENTAKENTAKEN
owner=c10
1234
tasktaskTAKEN
tasktask
owner=c1
┴
┴
01234
Tasks kept in chunks, organized in per-producer
chunk lists
Chunk owned by one consumer – the only one taking tasks
from it
SALSA Fast Path (No Stealing)
13
Producer: Put new valueIncrement local index
Consumer:Increment idxVerify ownershipChange chunk entry to TAKEN
No strong atomic operationsCache-friendlyExtremely lightweight
owner=c1TAKEN
TaskTask┴┴
idx=0
01234
Task
idx=1
TAKEN
prod local index
Chunk Stealing
14
Steal a chunk of tasksReduces the number of steal operations
Stealing consumer changes the owner field
When a consumer sees its chunk has been stolenTakes one task using CAS, Leaves the chunk
Stealing
15
Stealing is complicatedData racesLiveness issuesFast-path means no memory fences
See details in paper
Chunk Pools & Load Balancing
16
Where do chunks come from?Pool of free chunks per consumerStolen chunks move to stealersIf a consumer's pool is empty, producers go elsewhere
Same for slow consumers
Fastconsume
r
Largechunk pool
Automatic load
balancing
Chunk stealing
Producers can insert
tasks
Getting It Right
17
Liveness: we ensure that operations are lock-freeEnsure progress whenever operations
fail due to steals
Safety: linearizability mandates that Get() return Null only if all pools were simultaneously empty Tricky
Evaluation - Compared Algorithms
18
SALSASALSA+CAS: every consume operation uses CAS
(no fast path optimization)ConcBag: Concurrent Bags algorithm
[Sundell et al. 2011]Per producer chunk-list, but requires CAS for
consume and stealing granularity of a single task.WS-MSQ: work-stealing based on Michael-Scott
queues [M. M. Michael M. L. Scott 1996]
WS-LIFO: work-stealing based on Michael’s LIFO stacks [M. M. Michael 2004]
System Throughput
19
Balanced workloadN producers/N consumerslinearly
scalable
x20 faster than WS with MSQ
x5 faster than state-of-the-art concurrent bags
Thro
ughp
ut
1 4 8 12 16 20 24 28 310
5000
10000
15000
20000
25000
30000
35000
40000
SALSASALSA+CASConcBagWS-MSQWS-LIFO
Num of consumers
Thro
ughp
ut (
1000
tas
ks/m
sec)
Highly Contended Workloads: 1 Producer, N Consumers
20
Effective load balancing
High contention
among stealers
1 4 8 12 16 20 24 28 310
0.5
1
1.5
2
2.5
3
3.5
4
Num of consumersCA
S op
erat
ions
per
tas
k re
trie
val
Other algorithms suffer throughput
degradation
Thro
ughp
ut
CAS
per
task
ret
riev
al
Producer-Based Balancing inHighly Contended Workload
21
1 4 8 12 16 20 24 28 310
5000
10000
15000
20000
25000
30000
35000
40000
WS-SALSAWS-SALSAwCASWS-SALSA no migra-tionWS-SALSAwCAS no migration
Num of consumers
Thro
ughp
ut (
1000
tas
ks/m
sec)
50% faster with
balancing
Thro
ughp
ut
NUMA effects
Performance degradation is small as long as the interconnect / memory controller is not saturated22
affinity hardly matters as long as you’re cache effective
memory allocations should be decentralized
Thro
ughp
ut
Conclusions
23
Techniques for improving performance: Lightweight, synchronization-free fast pathNUMA-aware memory management (most data
accesses are inside NUMA nodes)Chunk-based stealing amortizes stealing costsElegant load-balancing using per-consumer chunk
pools Great performance
Linear scalability x20 faster than other work stealing techniques, x5
faster than state-of-the art non-FIFO poolsHighly robust to imbalances and unexpected thread
stalls
Backup
24
Chunk size
25
The optimal chunk size for SALSA is 1000.This is about the size of a page.
This may allow to migrate chunks from one node to another when stealing.
16 32 64 128 256 512 1000 20000
50000100000150000200000250000300000350000400000450000
SALSASALSA+CASConcBag
Num of tasks in a chunk
Thro
ughp
ut (
1000
tas
ks/m
sec)
Chunk Stealing - Overview
26
Point to the chunk from the special “steal list”Update the ownership via CASRemove the chunk from the original listCAS the entry at idx + 1 from Task to TAKEN
TAKENTAKENTAKEN
owner=c2
task┴
idx=1prod0prod1steal
prod0prod1steal
Consumer c1 Consumer c2
idx=1
task
TAKENTAKENTAKENtask┴
owner=c1
owner=c2
Chunk Stealing – Case 1
27
Stealing consumer (c2)1. Change ownership
with CAS2. i ← original idx3. Take task at i+1
with CAS
Original consumer (c1)1. idx++2. Verify ownership
If still the owner, then take task at idx without a CAS
Otherwise take task at idx with a CAS and leave chunk
owner=c1TAKEN
TaskTask┴┴
idx=0idx=1owner=c2
i=1Take task 2
Take task 1
Chunk Stealing – Case 2
28
Stealing consumer (c2)1. Change ownership
with CAS2. i ← original idx3. Take task at i+1
with CAS
Original consumer (c1)1. idx++2. Verify ownership
If still the owner, then take task at idx without a CAS
Otherwise take task at idx with a CAS and leave chunk
owner=c1TAKEN
TaskTask┴┴
idx=0idx=1owner=c2
i=0Take task 1
Take task 1
Chunk Lists
29
The lists are managed by the producers, empty nodes are lazily removed
When a producer fills a chunk, it takes a new chunk from the chunk pool and adds it to the list
List nodes are not stolen, so the idx is updated by the owner only
Chunks must be stolen from the owner’s list, to make sure the correct idx field is read
idx=2 idx=-1
idx=4prod0
tasktask┴
owner=c10
1234
┴┴
TAKENTAKENTAKEN
owner=c10
1234
tasktask
NUMA – Non Uniform Memory Access
30
Systems with large number of processors may have high contention on the memory bus.
In NUMA systems, every processor has its own memory controller connected to a memory bank.Accessing remote memory is more expensive.
Memory
Memory
Memory
Memory
CPU1
CPU2
CPU3
CPU4
Interconnect