21
1 NUMA-aware algorithms: the case of data shuffling Yinan Li* Ippokratis Pandis Rene Mueller Vijayshankar Raman Guy Lohman *University of Wisconsin - MadisonIBM Almaden Research Center

NUMA-aware algorithms: the case of data shuffling

  • Upload
    bayard

  • View
    29

  • Download
    3

Embed Size (px)

DESCRIPTION

NUMA-aware algorithms: the case of data shuffling. Yinan Li* Ippokratis Pandis Rene Mueller Vijayshankar Raman Guy Lohman. *University of Wisconsin - Madison. IBM Almaden Research Center. Hardware is a moving target. Intel-based. Cloud. POWER-based. 2-socket. - PowerPoint PPT Presentation

Citation preview

Page 1: NUMA-aware algorithms: the case of data shuffling

1

NUMA-aware algorithms: the case of data shuffling

Yinan Li* Ippokratis Pandis Rene Mueller Vijayshankar Raman Guy Lohman

*University of Wisconsin - Madison IBM Almaden Research Center

Page 2: NUMA-aware algorithms: the case of data shuffling

2

Hardware is a moving target

• Different degrees of parallelism, # sockets and memory hierarchies• Different types of CPUs (SSE, out-of-order vs in-order, 2- vs 4- vs 8-

way SMT, …), storage technologies …

Very difficult to optimize & maintain data management code for every HW platform

2-socket 4-socket (a) 8-socket4-socket (b)

Intel-based POWER-based Cloud

Page 3: NUMA-aware algorithms: the case of data shuffling

3

NUMA effects => underutilize RAM bandwidth

Mem

ory

Mem

ory

Mem

oryM

emory

Socket 0

Socket 2

Socket 1

Socket 3

1

Bandwidth seq. mem access (12 threads)

Latency data dependent random access (1 thread)

Local memory access

24.7 GB/s 340 cycles/access (~150 ns/access)

Remote memory 1 hop

10.9 GB/s 420 cycles/access (~185 ns/access)

Remote memory2 hops

10.9 GB/s 520 cycles/access (~230 ns/access)

Remote memory 2 hops with cross traffic

5.3 GB/s 530 cycles/access (~235 ns/access)

1

2

2

33

4

34

Sequential accesses are not the final solution

QPI

Page 4: NUMA-aware algorithms: the case of data shuffling

4

Use case: data shuffling

• Each of the N threads need to send data to the N-1 other threads• Common operation:• Sort-merge join• Partitioned aggregation• MapReduce• Both Map and Reduce shuffle data

• Scatter/gather

Ignoring NUMA leaves perf. on the table

Sequential, b

ut naïve

Coordinated NUMA-aware

4 socket

Band

wid

th

Page 5: NUMA-aware algorithms: the case of data shuffling

5

NUMA-aware data mgmt. operations

• Tons of work on SMPs & NUMA1.0• Sort-merge join [Albutiu et al. VLDB 2012]• Favor sequential accesses over random probes

• OLTP on HW Islands [Porobic et al. VLDB 2012]• Should we treat multisocket multicores as a cluster?

There are many different data operations that need similar optimizations

Page 6: NUMA-aware algorithms: the case of data shuffling

6

Need for primitives• Kernels used frequently on data management operations• E.g. sorting, hashing, data shuffling, …

• Highly optimized software solutions• Similar to BLAS• Optimized by skilled devs per new HW platform

• Hardware-based solutions• Database machines 2.0 (see Bionic DBMSs talk this afternoon)• If very important kernel, can be burnt into HW• Expensive, but orders of magnitude more efficient (perf., energy)• Companies like IBM and Oracle can do vertical engineering

Page 7: NUMA-aware algorithms: the case of data shuffling

7

Outline

• Introduction• NUMA 2.0 and related work• Data shuffling• Ring shuffling• Thread migration

• Evaluation• Conclusions

Page 8: NUMA-aware algorithms: the case of data shuffling

8

Data shuffling & naïve implementation• N threads produce N-1 partitions for all other threads• Each thread needs to read its partitions• N * (N-1) transfers

• Assume uniform sizes of partitions

Before After

Shuffl

e

• Naïve implementation• Each thread acting autonomously:for (thread=0; thread<N; thread++)

readMyPartitionFrom(thread); How bad can that be?

Page 9: NUMA-aware algorithms: the case of data shuffling

9

Shuffling naively in a NUMA systemNaïve uncoordinated shuffling

Step 1

Step 3

T0 T1 T2 T3 T4 T5 T6 T7

T0 T1 T2 T3 T4 T5 T6 T7

T0 T1 T2 T3 T4 T5 T6 T7

T0 T1 T2 T3 T4 T5 T6 T7

Step 2

Step …

Usage of QPI and Memory paths

1 4 8 12 16 20 24 28 320

20

40

60

80

100 Naive

# Threads

Band

wid

th (G

B/s)

BUT we bought 4 memory channels and 6 QPIs

Need to orchestrate threads/transfers to utilize the restMax mem. BW of 1 channel

Aggr. BW of allchannels

Page 10: NUMA-aware algorithms: the case of data shuffling

10

Ring shuffling• Devise a global schedule and all

threads follow it• Inner ring: partitions ordered by

thread number, socket; stationary• Outer ring: threads ordered by socket,

thread number; rotates

• Can be executed in lock-step or loosely• Needs:• Thread binding & synchronization• Control location of mem. allocations

..

s0.t0

s0.p0

s0.t1s1.t0

s1.t1s2.t0

s2.t1

s3.t0

s3.t1s1.p0

s2.p0s3.p0s0.p1

s1.p

1s2

.p2

s2.p3

Page 11: NUMA-aware algorithms: the case of data shuffling

11

Ring shuffling in action

Usage of QPI and Memory paths

Ring shuffling

Step 1

Step 3

T0 T1 T2 T3 T4 T5 T6 T7

T0 T1 T2 T3 T4 T5 T6 T7

T0 T1 T2 T3 T4 T5 T6 T7

T0 T1 T2 T3 T4 T5 T6 T7

Step 2

Step …

1 4 8 12 16 20 24 28 320

20

40

60

80

100Ring shufflingNaive

# Threads

Band

wid

th (G

B/s)

Orchestrated traffic utilizes underlying QPI network

Aggr. BW of allchannels

Page 12: NUMA-aware algorithms: the case of data shuffling

12

Thread migration instead of shuffling• Move computation to data

instead of shuffling them• Convert accesses to local memory

reads

• Choice of migrating only thread or thread + state• But, both very sensitive to amount

of thread state

1 4 8 12 16 20 24 28 320

20

40

60

80

100Thread migrationRing shufflingNaive

# Threads

Band

wid

th (G

B/s)

Aggr. BW of allchannels

Page 13: NUMA-aware algorithms: the case of data shuffling

13

Outline

• Introduction• NUMA 2.0 and related work• Data shuffling• Evaluation• Conclusions

Page 14: NUMA-aware algorithms: the case of data shuffling

14

Shuffling benchmark – peak bandwidth

Naive Coord. random (tight)

Ring (loose)

Ring (tight)

Thread migration

(loose)

Thread migration

(tight)

0

20

40

60

80

100

1204 socket

Band

wid

th (G

B/s)

Naive Coord. random (tight)

Ring (loose)

Ring (tight)

Thread migration

(loose)

Thread migration

(tight)

0

20

40

60

80

100

1208 socket

3x

~4x

IBM x38504 sockets x 8 cores Intel X7650 Nehalem-EXFully connected QPI

2x IBM x38508 sockets x 8 cores Intel X7650 Nehalem-EX

Page 15: NUMA-aware algorithms: the case of data shuffling

15

Exploiting ring shuffling in joins• Implemented the algorithm of Albutiu et al.• Sort-merge-based join implementation

1 2 4 8 16 320

5

10

15

20

25

30Total (Ring)Total (Naive)Merge Phase (Ring)Merge Phase (Naive)

# Threads

Scal

abili

ty (v

s. 1

thre

ad)

1 2 4 8 16 320%

20%

40%

60%

80%

100%w. Naive shuffling

# Threads

Tim

e br

eakd

own

1 2 4 8 16 32

w. Ring shuffling

Merge

Sort Fact

Sort Dim.

Partition

# Threads

Small overall perf. improvement because dominated by sort

Page 16: NUMA-aware algorithms: the case of data shuffling

16

Shuffling vs migration for aggregation• Partitioning-based aggregation

1K 2K 4K 8K 16K

32K

64K

128K

256K

512K 1M 2M 4M 8M

0

20

40

60

80

100 4 socket

# distinct keys / partition

Band

wid

th (G

B/s)

1K 2K 4K 8K 16K

32K

64K

128K

256K

512K 1M 2M 4M 8M

8 socket Migration & copyMigrationRingNaive

# distinct keys / partition

Potential of thread migration when thread state small

Page 17: NUMA-aware algorithms: the case of data shuffling

17

Conclusions• Hardware is a moving target• Need for primitives for data management operations• Highly optimized SW or HW implementations• BLAS for DBMSs

• Data shuffling can be up to 3x if NUMA-aware• Needs binding of memory allocations, thread scheduling …• Potential of thread migration

• Improved overall performance of optimized joins and aggregations• Continue investigating primitives, their implementation and exploitation• Looking for motivated summer interns ! [email to [email protected]]

Questions???

Page 18: NUMA-aware algorithms: the case of data shuffling

18

Backup slides

Page 19: NUMA-aware algorithms: the case of data shuffling

19

Shuffling data - scalability

1 4 8 12 16 20 24 28 320

20

40

60

80

100Thread migrationRing shufflingNaive

# Threads

Band

wid

th (G

B/s)

IBM x38504 sockets x 8 coresFully connected QPI

Page 20: NUMA-aware algorithms: the case of data shuffling

20

Shuffling vs migration for aggregation - breakdown• Partitioning-based aggregation

64K128K

256K512K 1M 2M 4M

0

2

4

6

8

10

12Naive

# Distinct groups

Tim

e (s

ec)

Ring

# Distinct groups

Migration

# Distinct groups

Migration & copy

Copy state

Upd. hash table

Read partitions

# Distinct groups

Page 21: NUMA-aware algorithms: the case of data shuffling

21

Naïve uncoordinated shuffling Coordinated shuffling

Iteration 1

Iteration 3

T0 T1 T2 T3 T4 T5 T6 T7

T0 T1 T2 T3 T4 T5 T6 T7

T0 T1 T2 T3 T4 T5 T6 T7

T0 T1 T2 T3 T4 T5 T6 T7

T0 T1 T2 T3 T4 T5 T6 T7

T0 T1 T2 T3 T4 T5 T6 T7

T0 T1 T2 T3 T4 T5 T6 T7

T0 T1 T2 T3 T4 T5 T6 T7

Iteration 2

Usage of QPI and Memory paths

Iteration …

Naïve vs ring shuffling