SALSA: Scalable and Low-synchronization NUMA-aware Algorithm for Producer-Consumer Pools

Elad Gidron, Idit Keidar,Dmitri Perelman, Yonathan Perez

1

SALSA: Scalable and Low-synchronization NUMA-aware Algorithm

for Producer-Consumer Pools

New Architectures – New Software Development Challenges

2

Increasing number of computing elements

Need scalabilityMemory latency more pronounced

Need cache-friendlinessAsymmetric memory access in NUMA

multi-CPUNeed local memory accesses for reduced contention

Large systems less predictableNeed robustness to unexpected thread stalls, load fluctuations

Producer-Consumer Task Pools

3

Ubiquitous programming pattern for parallel programs

Task PoolProduce

rsConsumers

Get()Put(Task)

Typical Implementations I/II

4

Producers

Consumers

Tn .. T

2T1

Inherently not scalable due to contention

FIFO queue

Consumer 1Get (T1) Exec(T1)

Consumer 2Get (T2) Exec(T2)

FIFO is about task retrieval, not execution

Typical Implementations II/II

5

Consumers always pay overhead of synchronization with potential stealers

Load balancing not trivial

Producers

Consumers

Multiple queues with work-stealing

And Now, to Our Approach

6

Single-consumer pools as building block

Framework for multiple pools with stealing

SALSA – novel single-consumer pool

Evaluation

Building Block: Single Consumer Pool

7

Possible implementations:FIFO queuesSALSA (coming soon)

SCPool OwnerConsume()

Steal()Other consumers

Producers

Produce()

System Overview

8

SCPool

Producers

ConsumersSCPool

SCPool

Consume()Produce()

Steal()

Our Management Policy

9

NUMA-awareProducers and consumers are pairedProducer tries to insert to closest SCPoolConsumer tries to steal from closest SCPool

SCPool 1

SCPool 2

Memory 1 CPU1

cons 1

cons 2

prod 1

prod 2

SCPool 3

SCPool 4

Memory 2CPU2

cons 3

cons 4

prod 3

prod 4

Prod 2 access list: cons2, cons1, cons3, cons4

Cons 4 access list : cons3, cons1, cons2

interconnect

SCPool Implementation Goals

10

Goal 1: “Fast path”Use synchronization, atomic operations only when stealing

Goal 2: Minimize stealingGoal 3: Locality

Cache friendliness, low contentionGoal 4: Load balancing

Robustness to stalls, load fluctuations

SALSA – Scalable and Low Synchronization Algorithm

11

SCPool implementation Synchronization-free when no stealing

occurs Low contention

Tasks held in page-size chunks Cache-friendly

Consumers steal entire chunks of tasks Reduces number of steals

Producer-based load-balancing Robust to stalls, load fluctuations

SALSA Overview

12

idx=2 idx=-1

idx=4

idx=0

prod0. .

.

Prodn-1

steal

chun

kList

s

tasktask┴

owner=c10

1234

┴

┴

TAKENTAKENTAKEN

owner=c10

1234

tasktaskTAKEN

tasktask

owner=c1

┴

┴

01234

Tasks kept in chunks, organized in per-producer

chunk lists

Chunk owned by one consumer – the only one taking tasks

from it

SALSA Fast Path (No Stealing)

13

Producer: Put new valueIncrement local index

Consumer:Increment idxVerify ownershipChange chunk entry to TAKEN

No strong atomic operationsCache-friendlyExtremely lightweight

owner=c1TAKEN

TaskTask┴┴

idx=0

01234

Task

idx=1

TAKEN

prod local index

Chunk Stealing

14

Steal a chunk of tasksReduces the number of steal operations

Stealing consumer changes the owner field

When a consumer sees its chunk has been stolenTakes one task using CAS, Leaves the chunk

Stealing

15

Stealing is complicatedData racesLiveness issuesFast-path means no memory fences

See details in paper

Chunk Pools & Load Balancing

16

Where do chunks come from?Pool of free chunks per consumerStolen chunks move to stealersIf a consumer's pool is empty, producers go elsewhere

Same for slow consumers

Fastconsume

r

Largechunk pool

Automatic load

balancing

Chunk stealing

Producers can insert

tasks

Getting It Right

17

Liveness: we ensure that operations are lock-freeEnsure progress whenever operations

fail due to steals

Safety: linearizability mandates that Get() return Null only if all pools were simultaneously empty Tricky

Evaluation - Compared Algorithms

18

SALSASALSA+CAS: every consume operation uses CAS

(no fast path optimization)ConcBag: Concurrent Bags algorithm

[Sundell et al. 2011]Per producer chunk-list, but requires CAS for

consume and stealing granularity of a single task.WS-MSQ: work-stealing based on Michael-Scott

queues [M. M. Michael M. L. Scott 1996]

WS-LIFO: work-stealing based on Michael’s LIFO stacks [M. M. Michael 2004]

System Throughput

19

Balanced workloadN producers/N consumerslinearly

scalable

x20 faster than WS with MSQ

x5 faster than state-of-the-art concurrent bags

Thro

ughp

ut

1 4 8 12 16 20 24 28 310

5000

10000

15000

20000

25000

30000

35000

40000

SALSASALSA+CASConcBagWS-MSQWS-LIFO

Num of consumers

Thro

ughp

ut (

1000

tas

ks/m

sec)

Highly Contended Workloads: 1 Producer, N Consumers

20

Effective load balancing

High contention

among stealers

1 4 8 12 16 20 24 28 310

0.5

1

1.5

2

2.5

3

3.5

4

Num of consumersCA

S op

erat

ions

per

tas

k re

trie

val

Other algorithms suffer throughput

degradation

Thro

ughp

ut

CAS

per

task

ret

riev

al

Producer-Based Balancing inHighly Contended Workload

21

1 4 8 12 16 20 24 28 310

5000

10000

15000

20000

25000

30000

35000

40000

WS-SALSAWS-SALSAwCASWS-SALSA no migra-tionWS-SALSAwCAS no migration

Num of consumers

Thro

ughp

ut (

1000

tas

ks/m

sec)

50% faster with

balancing

Thro

ughp

ut

NUMA effects

Performance degradation is small as long as the interconnect / memory controller is not saturated22

affinity hardly matters as long as you’re cache effective

memory allocations should be decentralized

Thro

ughp

ut

Conclusions

23

Techniques for improving performance: Lightweight, synchronization-free fast pathNUMA-aware memory management (most data

accesses are inside NUMA nodes)Chunk-based stealing amortizes stealing costsElegant load-balancing using per-consumer chunk

pools Great performance

Linear scalability x20 faster than other work stealing techniques, x5

faster than state-of-the art non-FIFO poolsHighly robust to imbalances and unexpected thread

stalls

Backup

24

Chunk size

25

The optimal chunk size for SALSA is 1000.This is about the size of a page.

This may allow to migrate chunks from one node to another when stealing.

16 32 64 128 256 512 1000 20000

50000100000150000200000250000300000350000400000450000

SALSASALSA+CASConcBag

Num of tasks in a chunk

Thro

ughp

ut (

1000

tas

ks/m

sec)

Chunk Stealing - Overview

26

Point to the chunk from the special “steal list”Update the ownership via CASRemove the chunk from the original listCAS the entry at idx + 1 from Task to TAKEN

TAKENTAKENTAKEN

owner=c2

task┴

idx=1prod0prod1steal

prod0prod1steal

Consumer c1 Consumer c2

idx=1

task

TAKENTAKENTAKENtask┴

owner=c1

owner=c2

Chunk Stealing – Case 1

27

Stealing consumer (c2)1. Change ownership

with CAS2. i ← original idx3. Take task at i+1

with CAS

Original consumer (c1)1. idx++2. Verify ownership

If still the owner, then take task at idx without a CAS

Otherwise take task at idx with a CAS and leave chunk

owner=c1TAKEN

TaskTask┴┴

idx=0idx=1owner=c2

i=1Take task 2

Take task 1

Chunk Stealing – Case 2

28

Stealing consumer (c2)1. Change ownership

with CAS2. i ← original idx3. Take task at i+1

with CAS

Original consumer (c1)1. idx++2. Verify ownership

If still the owner, then take task at idx without a CAS

Otherwise take task at idx with a CAS and leave chunk

owner=c1TAKEN

TaskTask┴┴

idx=0idx=1owner=c2

i=0Take task 1

Take task 1

Chunk Lists

29

The lists are managed by the producers, empty nodes are lazily removed

When a producer fills a chunk, it takes a new chunk from the chunk pool and adds it to the list

List nodes are not stolen, so the idx is updated by the owner only

Chunks must be stolen from the owner’s list, to make sure the correct idx field is read

idx=2 idx=-1

idx=4prod0

tasktask┴

owner=c10

1234

┴┴

TAKENTAKENTAKEN

owner=c10

1234

tasktask

NUMA – Non Uniform Memory Access

30

Systems with large number of processors may have high contention on the memory bus.

In NUMA systems, every processor has its own memory controller connected to a memory bank.Accessing remote memory is more expensive.

Memory

Memory

Memory

Memory

CPU1

CPU2

CPU3

CPU4

Interconnect

Documents

SALSA: Scalable and Low-synchronization NUMA-aware Algorithm for Producer-Consumer Pools