25
EXPLOITING COARSE-GRAINED PARALLELISM IN B+ TREE SEARCHES ON APUS MAYANK DAGA, MARK NUTTER AMD RESEARCH

HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

Embed Size (px)

DESCRIPTION

Presentation, HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter at the AMD Developer Summit (APU13) Nov. 11-13.

Citation preview

Page 1: HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

EXPLOITING COARSE-GRAINED PARALLELISM IN B+ TREE SEARCHES ON APUS

MAYANK DAGA, MARK NUTTER AMD RESEARCH

Page 2: HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

| APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 2

RELEVANCE OF B+ TREE SEARCHES

B+ Trees are special case of B Trees ‒ Fundamental data structure used in several popular database

management systems

High-throughput, read-only index searches are gaining traction in ‒ Audio-search

‒ Online Transaction Processing (OLTP) Benchmarks

Increase in memory capacity allows many database tables to reside in memory

‒Brings computational performance to the forefront

- Video-copy detection

X

B Tree B+ Tree mongoDB CouchDB

MySQL SQLite

Page 3: HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

| APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 3

A REAL-WORLD USE-CASE OF READ-ONLY SEARCHES

Mobile: Step 1. Record Audio Step 2. Generate Audio Fingerprint Step 3. Send search request to server

Server: Step 1. Receive search requests Step 2. Query Database Step 3. Return search results to client

Music Library – Millions of Songs

Thousands of clients

send requests

3 5

6 7

7 8

d1 d8 d7 d2 d3 d4 d5 d6

2 4

1 2 3 4 5 6

App on a smartphone

Database

Page 4: HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

| APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 4

DATABASE PRIMITIVES ON ACCELERATORS

Discrete graphics processing units (dGPUs) provide a compelling mix of ‒ Performance per Watt

‒ Performance per Dollar

dGPUs have been used to accelerate critical database primitives ‒ scan

‒ sort

‒ join

‒ aggregation

‒ B+ Tree Searches?

Page 5: HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

| APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 5

B+ TREE SEARCHES ON ACCELERATORS

B+ Tree searches present significant challenges ‒ Irregular representation in memory

‒ An artifact of malloc() and new()

‒ Today’s dGPUs do not have a direct mapping to the CPU virtual address space

‒ Indirect links need to be converted to relative offsets

‒ Requirement to copy the tree to the dGPU, which entails ‒ One is always bound by the amount of GPU device memory

Page 6: HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

| APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 6

OUR SOLUTION

Accelerated B+ Tree searches on a fused CPU+GPU processor (or APU1) ‒ Eliminates data-copies by combining x86 CPU

and vector GPU cores on the same silicon die

Developed a memory allocator to form a regular representation of the tree in memory ‒ Fundamental data structure is not altered

‒ Merely parts of its layout is changed

[1] www.hsafoundation.com

Page 7: HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

| APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 7

OUTLINE Motivation and Contribution

Background ‒ AMD APU Architecture ‒ B+ Trees

Approach ‒ Transforming the Memory Layout ‒ Eliminating the Divergence

Results ‒ Performance ‒ Analysis

Summary and Next Steps

Page 8: HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

| APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 8

AMD APU ARCHITECTURE

x86 Cores

GPU Vector Cores

DRAM Controller

Host Memory

System Request Interface (SRI)

xBar

Link

Co

ntro

ller

MCT

DRAM Controller

Platform Interfaces

UNB

RMB

FCL

GPU Frame-Buffer

System Memory

AMD 2nd Gen. A-series APU

UNB - Unified Northbridge, MCT - Memory Controller

The APU consists of a dedicated IOMMUv2 hardware - Provides direct mapping between

GPU and CPU virtual address (VA) space

- Enables GPUs to access the system memory

- Enables GPUs to track whether pages are resident in memory

Today, GPU cores can access VA space at a granularity of continuous chunks

Page 9: HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

| APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 9

B+ TREES

A B+ Tree … ‒ is a dynamic, multi-level index

‒ Is efficient for retrieval of data, stored in a block-oriented context

‒ has a high fan-out to reduce disk I/O operations

Order (b) of a B+ Tree measures the capacity of its nodes

Number of children (m) in an internal node is ‒ [b/2] <= m <= b

‒ Root node can have as few as two children

Number of keys in an internal node = (m – 1)

3 5

6 7

7 8

d1 d8 d7 d2 d3 d4 d5 d6

2 4

1 2 3 4 5 6

Page 10: HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

| APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 10

APPROACH FOR PARALLELIZATION

Fine-grained (Accelerate a single query) ‒ Replace Binary search in each node with K-ary search

‒ Maximum performance improvement = log(k)/log(2)

‒ Results in poor occupancy of the GPU cores

Coarse-grained (Perform many queries in parallel) ‒ Enables data-parallelism

‒ Increases memory bandwidth with parallel reads

‒ Increases throughput (transactions per second for OLTP)

Page 11: HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

| APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 11

TRANSFORMING THE MEMORY LAYOUT

Metadata ‒ Number of keys in a node

‒ Offset to keys/values in the buffer

‒ Offset to the first child node

‒ Whether a node is a leaf

Pass a pointer to this memory buffer to the accelerator

.. .. ..

nodes w/ metadata keys values

Page 12: HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

| APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 12

3 5

6 7

7 8

d1 d8 d7 d2 d3 d4 d5 d6

2 4

1 2 3 4 5 6

ELIMINATING THE DIVERGENCE

Each work-item/thread executes a single query

May increase divergence within a wave-front ‒ Every query may follow a different path in the B+ Tree

Sort the keys to be searched ‒ Increases the chances of work-items within a wave-front to follow similar paths in the B+ Tree ‒ We use Radix Sort1 to sort the keys on the GPU

WI-1 WI-2 WI-2

[1] D. G. Merrill and A. S. Grimshaw, “Revisiting sorting for gpgpu stream architectures,” in Proceedings of the 19th intl. conf. on Parallel architectures and compilation techniques, ser. PACT ’10. New York, NY, USA: ACM, 2010.

Page 13: HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

| APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 13

IMPACT OF DIVERGENCE IN B+ TREE SEARCHES

0

1

2

3

4

58

16

32

64

12

8 8

16

32

64

12

8 8

16

32

64

12

8 8

16

32

64

12

8

16K 32K 64K 128K

Imp

cat

of

Div

erge

nce

Number of Queries w/ Order of B+ Tree

AMD Radeon HD 7660 AMD Phenom II X6 1090T

Impact of Divergence on GPU – 3.7-fold (average) Impact of Divergence on CPU – 1.8-fold (average)

TM TM

Page 14: HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

| APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 14

OUTLINE Motivation and Contribution

Background ‒ AMD APU Architecture ‒ B+ Trees

Approach ‒ Transforming the Memory Layout ‒ Eliminating the Divergence

Results ‒ Performance ‒ Analysis

Summary and Next Steps

Page 15: HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

| APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 15

EXPERIMENTAL SETUP

Software ‒ A B+ Tree w/ 4M records is used ‒ Search queries are created using

‒ normal_distribution() (C++-11 feature) ‒ The queries have been sorted

‒ CPU Implementation from ‒ http://www.amittai.com/prose/bplustree.html

‒ Driver: AMD CatalystTM v12.8 ‒ Programming Model: OpenCLTM

Hardware ‒ AMD RadeonTM HD 7660 APU (Trinity)

‒ 4 cores w/ 6GB DDR3, 6 CUs w/ 2GB DDR3

‒ AMD PhenomTM II X6 1090T + AMD RadeonTM HD 7970 (Tahiti) ‒ 6 cores w/ 8GB DDR3, 32 CUs w/ 3GB GDDR5

‒ Device Memory does not include data-copy time

E ID (PKey) Age

0000001 34

4 m i l l i o n e n t r i e s

4194304 50

0100002000030000400005000060000

10

00

00

30

00

00

50

00

00

70

00

00

90

00

00

11

00

00

0

13

00

00

0

15

00

00

0

17

00

00

0

19

00

00

0

21

00

00

0

23

00

00

0

25

00

00

0

27

00

00

0

29

00

00

0

31

00

00

0

33

00

00

0

35

00

00

0

37

00

00

0

39

00

00

0

41

00

00

0

Freq

uen

cy

Bin

Page 16: HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

| APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 16

RESULTS – QUERIES PER SECOND

1

10

100

1000

4 8

16

32

64

12

8 4 8

16

32

64

12

8 4 8

16

32

64

12

8 4 8

16

32

64

12

8

16K 32K 64K 128K

Qu

erie

s/Se

con

d (

Mill

ion

)

Number of Queries w/ Order of B+Tree

AMD Phenom II X6 1090T (6-Threads+SSE) AMD Radeon HD 7970 (Device Memory) AMD Radeon HD 7970 (Pinned Memory)

dGPU (device memory) ~350M Queries/Sec. (avg.)

dGPU (pinned memory) ~9M Queries/Sec. (avg.)

AMD PhenomTM CPU ~18M Queries/Sec. (avg.)

700M Queries/Sec

AMD PhenomTM II 1090T (6-Threads+SSE) AMD RadeonTM HD 7970 (Device Memory) AMD RadeonTM HD 7970 (Pinned Memory)

Page 17: HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

| APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 17

RESULTS – QUERIES PER SECOND

0

20

40

60

80

100

120

140

4 8

16

32

64

12

8 4 8

16

32

64

12

8 4 8

16

32

64

12

8 4 8

16

32

64

12

8

16K 32K 64K 128K

Qu

erie

s/Se

con

d (

Mill

ion

)

Number of Queries w/ Order of B+Tree AMD Phenom II X6 1090T (6-Threads+SSE) AMD Radeon HD 7660 (Device Memory) AMD Radeon HD 7660 (Pinned Memory)

APU (pinned memory) is faster than the CPU implementation

APU (device memory) ~66M Queries/Sec. (avg.)

APU (pinned memory) ~40M Queries/Sec. (avg.)

AMD PhenomTM II 1090T (6-Threads+SSE) AMD RadeonTM HD 7660 (Device Memory) AMD RadeonTM HD 7660 (Pinned Memory)

Page 18: HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

| APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 18

RESULTS - SPEEDUP

0

2

4

6

8

10

12

4 8

16

32

64

12

8 4 8

16

32

64

12

8 4 8

16

32

64

12

8 4 8

16

32

64

12

8

16K 32K 64K 128K

Spee

du

p

Number of Queries w/ Order of B+Tree

AMD Radeon HD 7660 (Device Memory) AMD Radeon HD 7660 (Pinned Memory)

Average Speedup – 4.3-fold (Device Memory), 2.5-fold (Pinned Memory)

4.9-fold speedup

Baseline: six-threaded, hand-tuned, SSE-optimized CPU implementation.

• Efficacy of IOMMUv2 + HSA on the APU

Platform Size of the B+ Tree

< 1.5GB 1.5GB – 2.7GB > 2.7GB

Discrete GPU (memory size = 3GB) ✓ ✓ ✗ APU (prototype software) ✓ ✓ ✓

AMD RadeonTM HD 7660 (Pinned Memory) AMD RadeonTM HD 7660 (Device Memory)

Page 19: HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

| APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 19

ANALYSIS

The accelerators and the CPU yield best performance for different orders of the B+ Tree ‒ CPU order = 64

‒ Ability of CPUs to prefetch data is beneficial for higher orders

‒ APU and dGPU order = 16 ‒ GPUs do not have a prefetcher cache line should be most efficiently utilized

‒ GPUs have a cache-line size of 64 bytes ‒ Order = 16 is most beneficial (16 * 4 bytes)

.. .. .. nodes w/ metadata keys values

Page 20: HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

| APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 20

ANALYSIS

Minimum batch size to match the CPU performance

reuse_factor - amortizing the cost of data-copies to the GPU

Order = 64 Order = 16

dGPU (device memory) 4K queries 2K queries

dGPU (pinned memory N.A. N.A.

APU (device memory) 10K queries 4K queries

APU (pinned memory 20K queries 16K queries

90% Queries 100% Queries

dGPU 15 54

APU 100 N.A.

Page 21: HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

| APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 21

PROGRAMMABILITY

int i = 0, j;

node * c = root;

__m128i vkey = _mm_set1_epi32(key);

__m128i vnodekey, *vptr;

short int mask;

/* find the leaf node */

while( !c->is_leaf ){

for(i = 0; i < (c->num_keys-3); i+=4){

vptr = (__m128i *)&(c->keys[i]);

vnodekey = _mm_load_si128(vptr);

mask = _mm_movemask_ps(_mm_cvtepi32_ps( _mm_cmplt_epi32(vkey, vnodekey)));

if((mask) & 8) break;

}

for(j = i; j < c->num_keys; j++){

if(key < c->keys[j]) break;

}

c = (node *)c->pointers[j];

}

/* match the key in the leaf node */

for (i = 0; i < c->num_keys; i++)

if (c->keys[i] == key) break;

/* retrieve the record */

if (i != c->num_keys)

return (record *)c->pointers[i];

typedef global unsigned int g_uint;

typedef global mynode g_mynode;

int tid = get_global_id(0);

int i = 0;

g_mynode *c = (g_mynode *)root;

/* find the leaf node */

while(!c->is_leaf){

while (i < c->num_keys){

if(keys[tid] >= ((g_uint *)((intptr_t)root + c->keys))[i])

i++;

else break;

}

c = (g_mynode *)((intptr_t)root + c->ptr + i*sizeof(mynode));

}

/* match the key in the leaf node */

for(i=0; i<c->num_keys; i++){

if((((g_uint *)((intptr_t)root + c->keys))[i]) == keys[tid]) break;

}

/* retrieve the record */

if(i != c->num_keys)

records[tid] = ((g_uint *)((intptr_t)root + c->is_leaf + i*sizeof(g_uint)))[0];

CPU-SSE GPU

Page 22: HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

| APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 22

RELATED WORK

J. Fix, A. Wilkes, and K. Skadron, "Accelerating Braided B+ Tree Searches on a GPU with CUDA." In Proceedings of the 2nd Workshop on Applications for Multi and Many Core Processors: Analysis, Implementation, and Performance, in conjunction with ISCA, 2011

‒ Authors report ~10-fold speedup over single-thread-non-SSE CPU implementation, using a discrete NVIDIA GTX 480 GPU (do not take data-copies into account)

C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, S. A. Brandt, P. Dubey, “FAST: fast architecture sensitive tree search on modern CPUs and GPUs”, SIGMOD Conference, 2010

‒ Authors report ~100M queries per second using a discrete NVIDIA GTX 280 GPU (do not take data-copies into account)

J. Sewall, J. Chhugani, C. Kim, N. Satish, P. Dubey, “PALM: Parallel, Architecture-Friendly, Latch-Free Modifications to B+ Trees on Multi-Core Processors”, Proceedings of VLDB Endowment, (VLDB 2011)

‒ Applicable for B+ Tree modifications on the GPU

Page 23: HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

| APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 23

POSSIBILITIES WITH HSA

3 5

6 7

7 8

d1 d8 d7 d2 d3 d4 d5 d6

2 4

1 2 3 4 5 6

x86 Cores GPU Vector Cores

Memory

Serial Modifications

Parallel Reads for Searches

Graphics Core Next, User-mode Queuing, Shared Virtual Memory

Page 24: HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

| APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 24

SUMMARY

B+ Tree is the fundamental data structure in many RDBMS ‒ Accelerating B+ Tree searches is critical

‒ Presents significant challenges on discrete GPUs

We have accelerated B+ Tree searches by exploiting coarse-grained parallelism on a APU ‒ 2.5-fold (avg.) speedup over 6-threads+SSE CPU implementation

Possible Next Steps ‒ HSA + IOMMUv2 would reduce the issue of modifying B+ Tree

representation ‒ Investigate CPU-GPU co-scheduling

‒ Investigate modifications on the B+ Tree

Page 25: HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

| APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 25

DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, Phenom, Radeon, Catalyst and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.