Upload
bayard
View
29
Download
3
Embed Size (px)
DESCRIPTION
NUMA-aware algorithms: the case of data shuffling. Yinan Li* Ippokratis Pandis Rene Mueller Vijayshankar Raman Guy Lohman. *University of Wisconsin - Madison. IBM Almaden Research Center. Hardware is a moving target. Intel-based. Cloud. POWER-based. 2-socket. - PowerPoint PPT Presentation
Citation preview
1
NUMA-aware algorithms: the case of data shuffling
Yinan Li* Ippokratis Pandis Rene Mueller Vijayshankar Raman Guy Lohman
*University of Wisconsin - Madison IBM Almaden Research Center
2
Hardware is a moving target
• Different degrees of parallelism, # sockets and memory hierarchies• Different types of CPUs (SSE, out-of-order vs in-order, 2- vs 4- vs 8-
way SMT, …), storage technologies …
Very difficult to optimize & maintain data management code for every HW platform
2-socket 4-socket (a) 8-socket4-socket (b)
Intel-based POWER-based Cloud
3
NUMA effects => underutilize RAM bandwidth
Mem
ory
Mem
ory
Mem
oryM
emory
Socket 0
Socket 2
Socket 1
Socket 3
1
Bandwidth seq. mem access (12 threads)
Latency data dependent random access (1 thread)
Local memory access
24.7 GB/s 340 cycles/access (~150 ns/access)
Remote memory 1 hop
10.9 GB/s 420 cycles/access (~185 ns/access)
Remote memory2 hops
10.9 GB/s 520 cycles/access (~230 ns/access)
Remote memory 2 hops with cross traffic
5.3 GB/s 530 cycles/access (~235 ns/access)
1
2
2
33
4
34
Sequential accesses are not the final solution
QPI
4
Use case: data shuffling
• Each of the N threads need to send data to the N-1 other threads• Common operation:• Sort-merge join• Partitioned aggregation• MapReduce• Both Map and Reduce shuffle data
• Scatter/gather
Ignoring NUMA leaves perf. on the table
Sequential, b
ut naïve
Coordinated NUMA-aware
4 socket
Band
wid
th
5
NUMA-aware data mgmt. operations
• Tons of work on SMPs & NUMA1.0• Sort-merge join [Albutiu et al. VLDB 2012]• Favor sequential accesses over random probes
• OLTP on HW Islands [Porobic et al. VLDB 2012]• Should we treat multisocket multicores as a cluster?
There are many different data operations that need similar optimizations
6
Need for primitives• Kernels used frequently on data management operations• E.g. sorting, hashing, data shuffling, …
• Highly optimized software solutions• Similar to BLAS• Optimized by skilled devs per new HW platform
• Hardware-based solutions• Database machines 2.0 (see Bionic DBMSs talk this afternoon)• If very important kernel, can be burnt into HW• Expensive, but orders of magnitude more efficient (perf., energy)• Companies like IBM and Oracle can do vertical engineering
7
Outline
• Introduction• NUMA 2.0 and related work• Data shuffling• Ring shuffling• Thread migration
• Evaluation• Conclusions
8
Data shuffling & naïve implementation• N threads produce N-1 partitions for all other threads• Each thread needs to read its partitions• N * (N-1) transfers
• Assume uniform sizes of partitions
Before After
Shuffl
e
• Naïve implementation• Each thread acting autonomously:for (thread=0; thread<N; thread++)
readMyPartitionFrom(thread); How bad can that be?
9
Shuffling naively in a NUMA systemNaïve uncoordinated shuffling
Step 1
Step 3
T0 T1 T2 T3 T4 T5 T6 T7
T0 T1 T2 T3 T4 T5 T6 T7
T0 T1 T2 T3 T4 T5 T6 T7
T0 T1 T2 T3 T4 T5 T6 T7
Step 2
Step …
Usage of QPI and Memory paths
1 4 8 12 16 20 24 28 320
20
40
60
80
100 Naive
# Threads
Band
wid
th (G
B/s)
BUT we bought 4 memory channels and 6 QPIs
Need to orchestrate threads/transfers to utilize the restMax mem. BW of 1 channel
Aggr. BW of allchannels
10
Ring shuffling• Devise a global schedule and all
threads follow it• Inner ring: partitions ordered by
thread number, socket; stationary• Outer ring: threads ordered by socket,
thread number; rotates
• Can be executed in lock-step or loosely• Needs:• Thread binding & synchronization• Control location of mem. allocations
..
s0.t0
s0.p0
s0.t1s1.t0
s1.t1s2.t0
s2.t1
s3.t0
s3.t1s1.p0
s2.p0s3.p0s0.p1
s1.p
1s2
.p2
s2.p3
11
Ring shuffling in action
Usage of QPI and Memory paths
Ring shuffling
Step 1
Step 3
T0 T1 T2 T3 T4 T5 T6 T7
T0 T1 T2 T3 T4 T5 T6 T7
T0 T1 T2 T3 T4 T5 T6 T7
T0 T1 T2 T3 T4 T5 T6 T7
Step 2
Step …
1 4 8 12 16 20 24 28 320
20
40
60
80
100Ring shufflingNaive
# Threads
Band
wid
th (G
B/s)
Orchestrated traffic utilizes underlying QPI network
Aggr. BW of allchannels
12
Thread migration instead of shuffling• Move computation to data
instead of shuffling them• Convert accesses to local memory
reads
• Choice of migrating only thread or thread + state• But, both very sensitive to amount
of thread state
1 4 8 12 16 20 24 28 320
20
40
60
80
100Thread migrationRing shufflingNaive
# Threads
Band
wid
th (G
B/s)
Aggr. BW of allchannels
13
Outline
• Introduction• NUMA 2.0 and related work• Data shuffling• Evaluation• Conclusions
14
Shuffling benchmark – peak bandwidth
Naive Coord. random (tight)
Ring (loose)
Ring (tight)
Thread migration
(loose)
Thread migration
(tight)
0
20
40
60
80
100
1204 socket
Band
wid
th (G
B/s)
Naive Coord. random (tight)
Ring (loose)
Ring (tight)
Thread migration
(loose)
Thread migration
(tight)
0
20
40
60
80
100
1208 socket
3x
~4x
IBM x38504 sockets x 8 cores Intel X7650 Nehalem-EXFully connected QPI
2x IBM x38508 sockets x 8 cores Intel X7650 Nehalem-EX
15
Exploiting ring shuffling in joins• Implemented the algorithm of Albutiu et al.• Sort-merge-based join implementation
1 2 4 8 16 320
5
10
15
20
25
30Total (Ring)Total (Naive)Merge Phase (Ring)Merge Phase (Naive)
# Threads
Scal
abili
ty (v
s. 1
thre
ad)
1 2 4 8 16 320%
20%
40%
60%
80%
100%w. Naive shuffling
# Threads
Tim
e br
eakd
own
1 2 4 8 16 32
w. Ring shuffling
Merge
Sort Fact
Sort Dim.
Partition
# Threads
Small overall perf. improvement because dominated by sort
16
Shuffling vs migration for aggregation• Partitioning-based aggregation
1K 2K 4K 8K 16K
32K
64K
128K
256K
512K 1M 2M 4M 8M
0
20
40
60
80
100 4 socket
# distinct keys / partition
Band
wid
th (G
B/s)
1K 2K 4K 8K 16K
32K
64K
128K
256K
512K 1M 2M 4M 8M
8 socket Migration & copyMigrationRingNaive
# distinct keys / partition
Potential of thread migration when thread state small
17
Conclusions• Hardware is a moving target• Need for primitives for data management operations• Highly optimized SW or HW implementations• BLAS for DBMSs
• Data shuffling can be up to 3x if NUMA-aware• Needs binding of memory allocations, thread scheduling …• Potential of thread migration
• Improved overall performance of optimized joins and aggregations• Continue investigating primitives, their implementation and exploitation• Looking for motivated summer interns ! [email to [email protected]]
Questions???
18
Backup slides
19
Shuffling data - scalability
1 4 8 12 16 20 24 28 320
20
40
60
80
100Thread migrationRing shufflingNaive
# Threads
Band
wid
th (G
B/s)
IBM x38504 sockets x 8 coresFully connected QPI
20
Shuffling vs migration for aggregation - breakdown• Partitioning-based aggregation
64K128K
256K512K 1M 2M 4M
0
2
4
6
8
10
12Naive
# Distinct groups
Tim
e (s
ec)
Ring
# Distinct groups
Migration
# Distinct groups
Migration & copy
Copy state
Upd. hash table
Read partitions
# Distinct groups
21
Naïve uncoordinated shuffling Coordinated shuffling
Iteration 1
Iteration 3
T0 T1 T2 T3 T4 T5 T6 T7
T0 T1 T2 T3 T4 T5 T6 T7
T0 T1 T2 T3 T4 T5 T6 T7
T0 T1 T2 T3 T4 T5 T6 T7
T0 T1 T2 T3 T4 T5 T6 T7
T0 T1 T2 T3 T4 T5 T6 T7
T0 T1 T2 T3 T4 T5 T6 T7
T0 T1 T2 T3 T4 T5 T6 T7
Iteration 2
Usage of QPI and Memory paths
Iteration …
Naïve vs ring shuffling