18
March 14, 2022 IDEAS 2011 1 Extend Core UDF Framework for GPU-Enabled Analytical Query Evaluation Qiming Chen, Ren Wu, Meichun Hsu, Bin Zhang HP Labs Palo Alto, CA, USA March 14, 2022

2 October 2015IDEAS 2011 1 Extend Core UDF Framework for GPU-Enabled Analytical Query Evaluation Qiming Chen, Ren Wu, Meichun Hsu, Bin Zhang HP Labs Palo

Embed Size (px)

Citation preview

April 19, 2023 IDEAS 20111

Extend Core UDF Framework for GPU-Enabled Analytical Query

Evaluation

Qiming Chen, Ren Wu, Meichun Hsu, Bin Zhang

HP Labs

Palo Alto, CA, USA

April 19, 2023

2 April 19, 2023 IDEAS 2011

Problems

• Motivated by pushing-down analytics to DB layer for fast data access and reduced data move −which requires integrating analytic computation

into the query pipeline using UDFs

• Existing UDF cannot act as a block operator with chunk-wise input, therefore −unable to deal with the application semantics

definable on a set of incoming tuples (e.g. representing an object)

−unable to leverage external computation engines (e.g. GPU) for efficient batch processing.

April 19, 2023

3 April 19, 2023 IDEAS 2011

Why need Block UDFs• From semantic point of view, many applications are

definable on a set of tuples−Minimal Spanning Ttree (MST) computation is defined on a tuple-

set representing a graph and returns a tuple-set representing the MST

April 19, 2023

• From performance point of view, processing data by external engine should be in-batch rather than copying data back and forth on the per-tuple basis

A graph tuple-set

MST relation

A MST tuple-set

Graph relation

GPU

Computation node

SAS server

UDF

4 April 19, 2023 IDEAS 2011

Solution: Set-In Set-Out (SISO) UDF• Introduce a new kind of UDFs called Set-In Set-Out

(SISO) as a block operator for processing the input tuples chunk by chunk from query processing pipeline−pool a chunk of input tuples,

−dispatches them to GPUs or an analytic engine for batch computation

−materializes the computation results and then streams out tuple by tuple to the query processing pipeline

April 19, 2023

SISO

UDF

Pipelined input

pooling Materializeresult

Pipelined output

GPU

5 April 19, 2023 IDEAS 2011

SISO Example: select vectorize(x,y,10) from point_table

April 19, 2023

comp

On t1, …t9, do “ETL” but return NULL

act like a scalar function

On t10, act like a table function

FIRST CALL: run computation on the 10 tuples

Normal CALLs: return 1 result tuple per call – tupe by tuple pipelined again

Build phase

Compute phase

Streamout phase

6 April 19, 2023 IDEAS 2011

Comparison with Scalar, Table UDF• Scalar UDF

−1 tuple in, 1 value/tuple out (tuple as composite value)

−Access to per-function state and per-tuple state

• Table UDF−1 tuple in, N tuples out

−Access to per-tuple (input) state and per-return state

• SISO−N tuple in, M value/tuple out

−Access to 4 level states: per-function, per-chunk, per-tuple (input), per-return• runs chunk by chunk; each chunk contains N tuples; return

nothing for (1,N-1)th tuple, return a result-set for Nth tuple

April 19, 2023

7 April 19, 2023 IDEAS 2011

Comparison with UDA• Agg operator or UDA

−No general form of set output (except group-by)

−No chunk-wise semantics

• SISO−Flexible forms of set output

−Chunk-wise semantics

April 19, 2023

Comparison with RVF• RVF

−Input relation initially as static data

−Input relation is loaded entirely rather than by chunks

• SISO−Input tuple-set chunk by chunk along query processing

−Input tuple-set as dynamic data

8 April 19, 2023 IDEAS 2011

Extending Query Engine to Support SISO UDF

• Support SISO as block-operator along the tuple-by-tuple query processing pipeline−With hybrid behavior in processing a chunk of N

tuples• for input tuples 1,…,N-1, like a scalar function, 1 call per

input tuple, returning nothing

• For tuple N, like a table function, multi-calls corresponding to that input tuple, returning a set

• Need to extend UDF Accessible States• Need to extend Invocation Pattern

April 19, 2023

9 April 19, 2023 IDEAS 2011

UDF Memory Context• A UDF is called multiple times in query

processing−In the FIRST_CALL a buffer can be initiated

−Then each NORMAL_CALL references and updates the buffer – buffer state across multi-calls

−After the FINAL_CALL, the buffer is discarded

• Multi-call context different for scalar and table UDF−For scalar UDF, 1 call per input

−For table UDF, N calls per input

• Therefore their memory contexts are different

April 19, 2023

10 April 19, 2023 IDEAS 2011

Extend UDF Accessible States

April 19, 2023

Per-function state

Per-chunk state

Per-tuple state

Per-return state

Per-function state

Per-tuple state Per-tuple state

Per-return state

SISO UDF Scalar UDF Table UDF

11 April 19, 2023 IDEAS 2011

Extend Call Skeleton

April 19, 2023

SISO UDF Scalar UDF Table UDF

Global First Call

Per-chunk First Call

Per-tuple single Call (no return)Per-tuple single Call (no return)Per-tuple single Call (no return)

:Last-tuple First Call

Normal Call (1 return)

Normal Call (1 return):

Last-tuple Last Call

Per-chunk Last Call

:

Per-chunk First Call

Per-chunk Final Call

Per-chunk First Call

Per-chunk Final Call:

Per-tuple Normal Call (1 return)

Per-tuple Normal Call (1 return)

Normal Call (1 return)

Normal Call (1 return):

Per-tuple First Call

Per-tuple Final Call

Global First Call

:

Final call optional (system specific)

:

Normal Call (1 return)

Normal Call (1 return):

Per-tuple First Call

Per-tuple Final Call

12 April 19, 2023 IDEAS 2011

SISO Call Skeleton Explained

April 19, 2023

SISO UDF

Global First Call

Per-chunk First Call

Per-tuple single Call (no return)Per-tuple single Call (no return)Per-tuple single Call (no return)

:Per-tuple First Call

Normal Call (1 return)

Normal Call (1 return):

Per-tuple Final Call

Per-chunk Final Call

:

Per-chunk First Call

Per-chunk Final Call

Per-chunk First Call

Per-chunk Final Call:

Pool last tuple in the chunk, make batch analytic computation

Set up function call global context for chunk-wise invocation (extend from fun-call node

Set up chunk-based buffer for pooling data

Pool tuples (vectorizing), return null

Rewind chunk oriented tuple index; Cleanup buffer

Return materialized results one tuple at a time

Advance chunk oriented tuple index, return null

13 April 19, 2023 IDEAS 2011

Integrate Query Processing with GPU Computation using SISO UDF

• Using General Purpose GPU (GPGPU) to accelerate analytic query processing allows us to leverage SQL’s analysis power and GPU’s computation power

• However, their operational patterns are different

−GPU computation is a kind of batch–processing with data-parallelism

−Query processing is tuple-by-tuple pipelined• We solve this problem by using SISO UDFs in

queries

−To handle batch GPU computation in query dataflow pipeline

April 19, 2023

14 April 19, 2023 IDEAS 2011

Experiment on Accelerating K-Means Clustering of Very Large Data Sets • K-Means clustering is an iterative process, in

each iteration −each point is assigned to the nearest cluster

center as the member of that cluster

−then for each center, its coordinates is re-calculated as the “mean” of the coordinates of its member points

• The process is repeated until convergence is achieved.

April 19, 2023

Init. Centers

Assign Center

Calc Centers

DoneConvergence Check

15 April 19, 2023 IDEAS 2011

Single Iteration of K-Means by SQL and SISO UDF

SELECT (p).cid, AVG((p).x) AS cx, AVG((p).y) AS cy FROM (

SELECT assign_center_siso(x, y, “SELECT * FROM Centers”, N)

AS p FROM Points ) r

GROUP BY (p).cid;

April 19, 2023

xp,yp

assign_center()

AVG GROUPBY

Points

Centers

cid,xc,yc

cid,xp,yp cid,xc,yc

SISO UDF

chunk-wise

initially

16 April 19, 2023 IDEAS 2011

Experiment Results Comparison• We compare performance of

−scalar UDF-wrapped, CPU-based implementation

−SISO UDF-wrapped, CPU-based implementation

−SISO-wrapped, GPU-accelerated implementation

April 19, 2023

Overall end-to-end query performance – Scalar UDF/CPU vs. SISO/CPU vs. SISO/GPUs

10M Points (second)

100M Points (second)

Q1: generalized scalar UDF (tuple by tuple)

155.45 1845.66

Q2: SISO UDF in 1M chunk computed by CPU

145.02 1541.41

Q3: SISO UDF in 1M chunk computed by GPGPU

27.41 345.01

17 April 19, 2023 IDEAS 2011

Scalar UDF vs. SISO UDF• the number of clusters set to 1000 • the number of data points from 1M to 100M• the chunk size fixed to 1M

−Beyond 1M (1000K), the performance gain gradually diminishes with further increase in chunk size

April 19, 2023

18 April 19, 2023 IDEAS 2011

Conclusions• In-DB analytics has been extensively

investigated, but not yet become a scalable approach

• An important reason lies in the lack of block UDFs to deal with the application semantics definable on a set of tuples, and to leverage external computation units such as GPUs for efficient batch processing

• To solve this problem, we developed SISO as a new kind of UDFs

• Integrating SISO with parallel DB is under further investigation

April 19, 2023