24
1 @sqreamtech SIL7138 How to run SQL queries on TBs of data using GPUs Jake Wheat Lead Architect, SQream Technologies

How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,

1@sqreamtech SIL7138

How to run SQL queries on TBs of data using GPUs

Jake WheatLead Architect, SQream Technologies

Page 2: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,

2@sqreamtech SIL7138

How to run SQL queries on TBs of data using GPUs1. A toy SQL query engine

2. Support wide range of SQL queries

3. Support larger data

4. Optimise

Page 3: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,

3@sqreamtech SIL7138

Hypothetical set of physical operator primitives

Example SQL Physical Operator Implementationselect a+b, c * 5 from t select

(a.k.a project/extend/rename)thrust::transform

select a, count(*), sum(b), avg(b) from t group by a

stream aggregate thrust::reduce_by_key

select a, b from t where a > 0.5

filter thrust::remove_if

select distinct a from t stream distinct thrust::unique

select a, b, c, d from t order by a,b

sort thrust::sort

select * from t union allselect * from u

union all -

select * from t inner join u using (a)

sort merge join (smj) simple implementation:thrust::upper_bounds, lower_bounds, unnest, gather

Page 4: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,

4@sqreamtech SIL7138

A more complete engine• non indexed nested loop join (ninlj)

• not matching

• outer join

• stream union distinct

• distinct aggregates

• grouping sets

• window functions

Page 5: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,

5@sqreamtech SIL7138

Running on bigger data

Page 6: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,

6@sqreamtech SIL7138

4 MB

at a time

GPU

4 MBResult

1 TB

Didn’t fit

1 TB

Result

Queries on data which doesn't fit in the GPU memoryChunk the data

Page 7: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,

7@sqreamtech SIL7138

Queries on data which doesn't fit in the GPU memoryUse an external sorting algorithm

Extend the idea to other operators too, for instance:

• NINLJ

• SMJ

• window functions

• distinct aggregates (select a, count(distinct b), count(b) from t group by a)

• outer joins

Page 8: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,

8@sqreamtech SIL7138

GPU

1 TB

GPU / CPU spool + external sort

Sort

4 MB at a time

Queries on data which doesn't fit in the GPU memoryUse external sorting algorithms and a variety of spools

Page 9: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,

9@sqreamtech SIL7138

Optimising: some non-GPU specific ideas

• Use good benchmarking and profiling tools

• Take advantage of columnar - don't read columns which aren't needed

• Compressed data – helps with both disk I/O and PCI bus I/O

• Make sure we use good I/O patterns: big reads and writes, work with the Linux FS cache effectively

• Use large host memory buffers

• Use AST optimisations (e.g. push predicates closer to the tablescans)

• Use a cost based optimizer

Page 10: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,

10@sqreamtech SIL7138

Some GPU-specific ideas

• Use CUDA specific profiling tools in addition to your regular C++ ones(or at least have a way to enable cudaDeviceSynchronize calls all over your code)

• GPU task/queue

• Combine tasks on the GPU

• Use large chunks

• Rechunking

• Reduce or avoid host ↔ device transfers

• Optimise GPU code

Page 11: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,

11@sqreamtech SIL7138

GPU task/queue1. Allocate all the GPU memory we will need

2. Upload the data to the GPU

3. Run some kernels on this data

4. Download the results to host

5. Release the allocated GPU memory

Page 12: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,

12@sqreamtech SIL7138

GPU task/queueBenefits

• Can help support concurrency in a single query (to hide PCI transfer latency)

• Can help support concurrent queries on a single GPU with some basic fairness property (long running query won't starve out a short running query)

• GPU memory usage is very predictable

• No inter-task co-ordination needed

• No chance of deadlock on GPU resources

• Can be simply extended to distribute tasks over multiple GPUs

Page 13: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,

13@sqreamtech SIL7138

Host Worker

Host Worker

Host Worker

GPU task queue

GPU 1

GPU workerGPU worker….

GPU 2

GPU workerGPU worker….

Host Worker

Host Worker

Short GPU task concept, concurrency

Page 14: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,

14@sqreamtech SIL7138

Combine tasks on the GPU where possibleselect a, b+c as d from t where b > 5 order by a

Logical

Direct

Combined

TableScana,b,c

Transformd:=b+c

Remove Ifb>5

Sort bya Sort Merge

TableScana,b,c To device Transform

d:=b+c To host To device Remove Ifb>5 To host To device Sort by

a To host Sort Merge

TableScana,b,c To device Transform

d:=b+cRemove If

b>5Sort by

a To host Sort Merge

Page 15: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,

15@sqreamtech SIL7138

Use larger chunks of data1. transfer 10,000 rows to device2. transform d := b + c3. remove_if b > 54. sort by a5. transfer results to host

1. transfer 10,000,000 rows to device2. transform d := b + c3. remove_if b > 54. sort by a5. transfer results to host

Page 16: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,

16@sqreamtech SIL7138

RechunkingTable scansOptimised disk IO sizes often much bigger than the GPU can handle

Small

Big

Big

IO reads GPU process

slow good

fast too big to fit

fast good

split

Chunk size

Page 17: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,

17@sqreamtech SIL7138

RechunkingOutput of remove_if, join, reduce/reduce_by_key, etc.If the output of these is very small amounts of rows, collecting these rows back into big chunks can improve the performance of later operations (e.g. sort).

Page 18: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,

18@sqreamtech SIL7138

High selectivity

Low selectivity

input remove if sort

fast

input remove if sort

slow

input remove if rechunk sort

fast

Rechunking

Page 19: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,

19@sqreamtech SIL7138

RechunkingReducing PCI transfer amounts in NINLJfor each chunk in table A:for each chunk in table B:load A,B to the GPU, join them

Table A size: 10,000,000,000 rowsTable B size: 10,000,000 rows

At a chunk size of 1 million rows you will upload 110 billion rows to the device. If you increase it to 20 million rows, you will upload only 15 billion rows to the device

110B / 15B = Potentially more than 7x faster

Page 20: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,

20@sqreamtech SIL7138

9 chunks Loop and load9 times

3 chunks Loop and load3 times

RechunkingReducing PCI transfer amounts in NINLJ

Page 21: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,

21@sqreamtech SIL7138

Reduce / avoid HOST ↔ DEVICE transfers• NINLJ: keep the small table on the GPU for the whole join

• Chunk skipping -data is inserted to the database ordered by a record timestamp. The storage layer preserves this order (or inserts the data in sorted order)

Typical use-case is 24 months of data in the table, but queries only want to summarize a particular day, week or month.

Page 22: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,

22@sqreamtech SIL7138

Optimising KernelsThree example bottlenecks from SQream:

• reduce_by_key

• multikey sort

• internals of the join

Page 23: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,

23@sqreamtech SIL7138

Strings are very difficult to deal with• compression

• can be large size per record

non big data specific GPU issues:

• variable length data

• collations

• unicode, ICU doesn't run on GPUs

Good solutions: we're still looking

Page 24: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,

24@sqreamtech SIL7138

Find out more about SQream’s high performance GPU-driven database software

sqream.com