44
PL/CUDA ~Fusion of HPC Grade Power with In-Database Analytics~ The PG-Strom Project / NEC OSS Promotion Center KaiGai Kohei <[email protected]>

PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

Embed Size (px)

Citation preview

Page 1: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

PL/CUDA ~Fusion of HPC Grade Power with In-Database Analytics~

The PG-Strom Project / NEC OSS Promotion Center

KaiGai Kohei <[email protected]>

Page 2: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

about myself...

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 2

▌KaiGai Kohei

tw: @kkaigai

https://github.com/kaigai

▌PostgreSQL

SELinux, FDW, CustomScan, ...

▌PG-Strom

GPU acceleration for PostgreSQL

▌Works

NEC OSS Promotion Center

Development of the software and its business opportunity

Page 3: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

PG-Strom Overview (1/2) – Architecture

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 3

Application

Storage

Query Optimizer

Query Executor

PG-Strom Extension

SQL Parser

Storage Manager

GPU

No Storage Changes

No Query Changes

Features

• automatic GPU code generation from the supplied SQL

• asynchronous & massive parallel execution on GPUs

• WHERE-clause, JOIN, GROUP BY, and projection are supported

Advantages

• transparent acceleration by the power of thousands cores

• Low cost solution for analytic processing

Page 4: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

Characteristics of GPU (Graphic Processor Unit)

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 4

GPU CPU

Model NVIDIA

Tesla P100 Intel Xeon E5-2699v4

Architecture Pascal Broadwell

Launch Q2-2016 Q1-2016

# of transistors 15billion 7.2billion

# of cores 3584

(simple) 22

(functional)

core clock 1.126GHz

~1.303GHz 2.20GHz

~3.60GHz

Perk FFLOPS (FP32)

9.3 TFLOPS 1.2 TFLOPS (with AVX2)

DRAM Size 16GB (HBM2) max 1.5TB (DDR4)

Memory Band 732GB/s 76.8GB/s

Power Consumption

250W 145W

Page 5: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

PG-Strom Overview (2/2) – GPU binary generation on the fly

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics

QUERY: SELECT cat, count(*), avg(x) FROM t0 WHERE x between y and y + 20.0 GROUP BY cat;

:

STATIC_FUNCTION(bool)

gpupreagg_qual_eval(kern_context *kcxt,

kern_data_store *kds,

size_t kds_index)

{

pg_float8_t KPARAM_1 = pg_float8_param(kcxt,1);

pg_float8_t KVAR_3 = pg_float8_vref(kds,kcxt,2,kds_index);

pg_float8_t KVAR_4 = pg_float8_vref(kds,kcxt,3,kds_index);

return EVAL((pgfn_float8ge(kcxt, KVAR_3, KVAR_4) &&

pgfn_float8le(kcxt, KVAR_3, pgfn_float8pl(kcxt, KVAR_4, KPARAM_1))));

} :

E.g) Transform of arithmetic operations in the WHERE-clause to CUDA programs

Reference to input data

SQL expression in CUDA source code

Run-time

Compiler

(nvrtc)

Just-in-time Compile

Parallel Execution

5

Page 6: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

GPU accelerates SQL performance

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 6

▌Test Query: SELECT cat, count(*), avg(x) FROM t0 NATURAL JOIN t1 [NATURAL JOIN t2 ...] GROUP BY cat; t0 contains 100M rows, t1...t8 contains 100K rows (like a start schema)

40.44

62.79

79.82

100.50

119.51

144.55

201.12

248.95

9.96 9.93 9.96 9.99 10.02 10.03 9.98 10.00

0

50

100

150

200

250

300

2 3 4 5 6 7 8 9

Qu

ery

Re

sp

on

se

Tim

e [

se

c]

Number of joined tables

PG-Strom microbenchmark with JOIN/GROUP BY

PostgreSQL v9.5 PG-Strom v1.0

CPU: Xeon E5-2670v3 GPU: GTX1080 RAM: 384GB OS: CentOS 7.2 DB: PostgreSQL 9.5 + PG-Strom v1.0

Page 7: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

Feedbacks from users during v1.0 development

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 7

Application

Storage

Query Optimizer

Query Executor

PG-Strom Extension

SQL Parser

Storage Manager

GPU

Heavy computing intensive workloads

• In-database Analytics • Scientific R&D, Marketing, ...

by PL/CUDA + Matrix-Array

Heavy I/O intensive workloads

• Generic large OLAP • ETL, Reporting, ...

by SSD-to-GPU P2P DMA

Page 8: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 8

Introduction of PL/CUDA

Page 9: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

Our Failure (1/3) – Write an algorithm logic in SQL

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 9

Apr-2016

Page 10: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

Our Failure (2/3) – Performance benefit (?)

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 10

Apr-2016

Page 11: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

Our Failure (3/3) – Problems

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 11

▌Problem.1 – Who writes algorithms in SQL?

Majority of mathematical algorithms are developed based on the manner of procedural programming language.

Although users don’t need to write algorithm logics in CUDA, they also have to write up the algorithm logic using SQL puzzle.

▌Problem.2 – Performance Benefit

Yeah, PG-Strom is much faster than PostgreSQL to execute the core logic of min-max method. It is an excellent result but nobody knows how many people run the algorithm on PostgreSQL.

Performance of GpuProjection is almost equivalent to the CPU version of implementation which is designed to a particular problem. Why?

Inefficient code due to SQL compatibility

Inefficient data format due to PostgreSQL’s row-format

Page 12: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

Our Answer – PL/CUDA + Matrix-like Array

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 12

CREATE FUNCTION my_logic(matrix, matrix)

RETURNS vector

AS $$

$$ LANGUAGE ‘plcuda’;

User define CUDA code block

Storage

GPU Kernel

User defined CUDA code block

Post SQL Process

Tables JOIN Window

Function

ORDER BY GROUP BY etc....

Load function’s arguments

Write-back result set

PL/CUDA method for manual optimization

ArrayType header a1 a2 aN … b1 b2 bN … c1 c2 cN … d1 d2 dN …

𝑎1 ⋯ 𝑑1⋮ ⋱ ⋮𝑎𝑁 ⋯ 𝑑𝑁

Matrix of 4cols x Nrows

Matrix-like Array 2D Array without NULL, to represent a matrix

Page 13: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

Example of PL/CUDA function definition

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 13

CREATE OR REPLACE FUNCTION knn_gpu_similarity(int, int[], int[]) RETURNS float4[] AS $$

#plcuda_begin cl_int k = arg1.value; MatrixType *Q = (MatrixType *) arg2.value; MatrixType *D = (MatrixType *) arg3.value; MatrixType *R = (MatrixType *) results; : nloops = (ARRAY_MATRIX_HEIGHT(Q) + (part_sz - k - 1)) / (part_sz - k); for (loop=0; loop < nloops; loop++) { /* 1. calculation of the similarity */ for (i = get_local_id(); i < part_sz * part_nums; i += get_local_size()) { j = i % part_sz; /* index within partition */ /* index of database matrix (D) */ dindex = part_nums * get_global_index() + (i / part_sz); /* index of query matrix (Q) */ qindex = loop * (part_sz - k) + (j - k); values[i] = knn_similarity_compute(D, dindex, Q, qindex); } } : #plcuda_end

$$ LANGUAGE 'plcuda';

CUDA code block

Page 14: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

How GPU Kernels are built

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 14

CREATE OR REPLACE FUNCTION my_cuda_func(float[]) RETURNS int[] $$ #plcuda_sanity_check func_sc #plcuda_begin #plcuda_end #plcuda_working_bufsz func_wb $$ LANGUAGE ‘plcuda’;

User define CUDA code block

bool func_sc(float[])

helper function for sanity check

bigint func_wb(float[])

helper function for buffer size estimation

GPU Binary

GPU Kernel

Working Buffer

Input arguments

Run-time compiler

input/output

of SQL data

User define CUDA code block

Common Library Routines

source program of GPU code

Load the arguments

Page 15: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

Why automatic generated code was not best for performance

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 15

NULL checks for each variable references Overflow checks for each primitive operators Function call instead of primitive operators

STATIC_FUNCTION(pg_float4_t) pgfn_float4mul(kern_context *kcxt, pg_float4_t arg1, pg_float4_t arg2) { pg_float4_t result; result.isnull = arg1.isnull | arg2.isnull; if (!result.isnull) { result.value = arg1.value * arg2.value; CHECKFLOATVAL(&kcxt->e, result, isinf(arg1.value) || isinf(arg2.value), arg1.value == 0.0 || arg2.value == 0.0); } return result; }

select x*y from c_test;

Page 16: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

Disadvantage because of data format (1/2)

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 16

▌Row-oriented data

× includes unreferenced values

× many steps for data references

〇 common data format with PostgreSQL

▌Column-oriented data

〇 load only referenced variables

〇 data reference by 1 step

× needs data format exchange

GPU core

GPU core

GPU core

a c d f e b

a c d f e b

a d f e b

a c d f e b GPU core

e b GPU core

GPU core

GPU core

GPU core

b

b

b

b

b

GPU core

GPU core

e

e

e

e

e

Usual SQL load cannot justify the cost for data transformation, Advanced algorithm processing is improved by columnar format.

Page 17: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

Disadvantage because of data format (2/2)

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 17

▌Case of random memory access

Increase of memory transaction, but less usage rate of the data-bus

▌Case of coalesced memory access

Least number of memory transaction, and maximum usage rate of the data-bus

32bit

Memory transaction width: 256bit

32bit 32bit 32bit 32bit 32bit

32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit

Memory transaction width: 256bit

32bit x 8 = 256bit is valid data in 256bit width memory transaction (Bus usage ratio: 100.0%)

Only 32bit x 1 = 32bit is valid data in 256bit width memory transaction (Bus usage ratio: 12.5%)

GPU cores

GPU cores

Page 18: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

2D-Array as Matrix

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 18

▌datatype[] array_matrix(variadic datatype[])

An aggregate function to construct a 2D-array from the input stream.

datatype is either of int2, int4, int8, float4 or float8.

The 2D-array does not contain NULL.

▌SETOF record matrix_unnest(datatype[])

Deform a 2D-array into stream of multiple records

▌Downside

Unable to handle variable-length data type

1GB limit of varlena data type in PostgreSQL

ArrayType header a1 a2 aN … b1 b2 bN … c1 c2 cN … d1 d2 dN …

𝑎1 ⋯ 𝑑1⋮ ⋱ ⋮𝑎𝑁 ⋯ 𝑑𝑁

Matrix of 4cols x Nrows

Matrix-like Array 2D Array without NULL, to represent a matrix

Page 19: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

Example to call PL/CUDA function

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 19

SELECT row_number() OVER (), float4_as_int4(R.key_id) key_id, R.score

FROM matrix_unnest(

(SELECT my_plcuda_function(A.matrix, B.matrix)

FROM (SELECT cbind(array_matrix(id), array_matrix(x, y, z)) matrix FROM normal_table WHERE tag LIKE ‘%abc%’) A,

(SELECT matrix FROM matrix_table) B

) ) AS R(key_id real, score real)

ORDER BY score DESC LIMIT 1000;

Invocation of PL/CUDA function with two Matrix arguments

Construct, or load pre-built array-matrix

Deform array-matrix to generic records

Post-process by SQL (JOIN, window-function)

Page 20: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 20

Case Study

similarity search on drug discovery

Page 21: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

Background – relationship of disease and chemical compounds

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 21

target disease relevant protein

chemical compounds (= candidate of drugs)

Discovery of chemical compounds which are “active” to the target protein

inactive active active

but toxicity

academic papers

Page 22: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

k-NN Similarity Search on Chemical Compounds (1/2)

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 22

Database chemical compounds set (D; 10M records scale)

Query chemical compounds set

(Q; ~1000 records scale)

Search by Similarity

Target Protein

“similar compounds” will have higher probability of active

Picks up active chemical compounds to the target protein

from academic papers

Page 23: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

k-NN Similarity Search on Chemical Compounds (2/2)

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 23

Similarity is

definition of distance

ID NAME Fingerprint (1024bit)

1 CHEMBL153534 000000000001000000100000000000000100000000000001000000...

2 CHEMBL405398 000000000000000100100000000000000000000000000000100000...

3 CHEMBL503634 000001000000000000000000001000000100000000000000000000...

: : :

Data structure of the chemical compounds

Similarity by Jaccard index:

𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐴, 𝐵 = 𝐴𝐹𝑃 ∩ 𝐵𝐹𝑃 𝐴𝐹𝑃 ∪ 𝐵𝐹𝑃

Page 24: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

Scale of the computing

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 24

Database chemical compounds set (D; 10M records scale)

Q: Query chemical compounds set

average of the top-3

𝑑𝑖 of D-compounds

Distance of Q-set and 𝑑𝑖 compounds

𝑑𝑗 of D-compounds

average of the top-3

Distance of Q-set and 𝑑𝑗 compounds

Order of calculation:

𝑂 𝑄 × 𝐷 + 𝑂 𝐷 × 𝑄𝑙𝑜𝑔𝑄 (make distance) (sorting+average)

Page 25: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

Implementation of PL/CUDA function (1/3)

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 25

Step-1 Split all the logical combination of Q and D into multiple partitions, then assign them on SMM; execution unit of GPU.

Step-2 Each GPU core calculates a similarity score between a Q-compound and a D-compound, then store the score on “shared memory”; which is fast and close to SMM.

Page 26: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

Implementation of PL/CUDA function (2/3)

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 26

Step-3 Bitonic-sorting by the similarity score, then reorder the Q-compounds Step-5

Make an average by the top-k values, then store it on the result buffer.

Step-4 Repeat from Step-2, if # of Q-compounds is larger than shared memory size. Top-k items are kept.

Page 27: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

Implementation of PL/CUDA function (3/3)

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 27

CREATE OR REPLACE FUNCTION knn_gpu_similarity(int, -- k-value int[], -- ID+bitmap of Q int[]) -- ID+bitmap of D RETURNS float4[] -- result: ID+similarity AS $$ #plcuda_decl : #plcuda_begin #plcuda_kernel_blocksz ¥ knn_gpu_similarity_main_block_size #plcuda_num_threads ¥ knn_gpu_similarity_main_num_threads #plcuda_shmem_blocksz 8192 cl_int k = arg1.value; MatrixType *Q = (MatrixType *) arg2.value; MatrixType *D = (MatrixType *) arg3.value; MatrixType *R = (MatrixType *) results; : for (loop=0; loop < nloops; loop++) { /* 1. calculation of the similarity */ for (i = get_local_id(); i < part_sz * part_nums; i += get_local_size()) {

j = i % part_sz; /* index within partition */ dindex = part_nums * get_global_index() + (i / part_sz); qindex = loop * (part_sz - k) + (j - k); if (dindex < ARRAY_MATRIX_HEIGHT(D) && qindex < ARRAY_MATRIX_HEIGHT(Q)) { values[i] = knn_similarity_compute(D, dindex, Q, qindex); } } __syncthreads(); /* 2. sorting by the similarity for each partition */ knn_similarity_sorting(values, part_sz, part_nums); __syncthreads(); : } #plcuda_end #plcuda_sanity_check knn_gpu_similarity_sanity_check #plcuda_working_bufsz 0 #plcuda_results_bufsz knn_gpu_similarity_results_bufsz $$ LANGUAGE 'plcuda';

real[] -- ID+Similarity of D-compounds (2xN) knn_gpu_similarity(int k, -- k-value int[] Q, -- ID+Fingerprint of Q-compounds (33xM) int[] D); -- ID+Fingerprint of D-compounds (33xN)

Page 28: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

Invocation of PL/CUDA function

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 28

PREPARE knn_sim_rand_10m_gpu_v2(int) -- arg1:@k-value AS

SELECT row_number() OVER (), fp.name, similarity

FROM (SELECT float4_as_int4(key_id) key_id, similarity

FROM matrix_unnest( (SELECT rbind( knn_gpu_similarity($1,Q.matrix, D.matrix))

FROM (SELECT cbind(array_matrix(id), array_matrix(bitmap)) matrix FROM finger_print_query) Q,

(SELECT matrix FROM finger_print_10m_matrix) D

) ) AS sim(key_id real, similarity real)

ORDER BY similarity DESC) sim, finger_print_10m fp WHERE fp.id = sim.key_id LIMIT 1000;

Post-process by SQL; like lookup of compounds name by compounds-id (tables JOIN), making rank of similarity by window function.

Execution of PL/CUDA function with Q-/D-matrix as argument

Transform the records read from tables to Array-Matrix type. (Pre-build is also possible)

Transform the Array-Matrix (3xN), return value of PL/CUDA function, into usual record data x Nrows.

Page 29: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

Performance results

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 29

For comparison to CPU cases, we implemented an equivalent SQL function by C. # of D-compounds is 10M records, # of Q-compounds is 10, 50, 100, 500 and 1000. Up to 10B combination search; almost equivalent size for real drug discovery research.

HW) CPU: Xeon E5-2670v3, GPU: GTX980 / GTX1080, RAM:384GB SW) CentOS7, CUDA8.0, PostgreSQL v9.5 + PG-Strom v1.0

30.25 145.29

295.95

1503.31

3034.94

12.97 13.46 13.90 18.16 24.65 13.00 13.23 13.59 16.01 19.13 0

500

1000

1500

2000

2500

3000

3500

10 50 100 500 1000

Qu

ery

Re

sp

on

se

Tim

e [

se

c]

Number of Query Compounds [Q]

Similarity search of chemical compounds by k-NN method (k=3, D=10M)

CPU(E5-2670v3) GTX980 GTX1080

x150 times faster!

Page 30: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 30

Another Usage

k-means clustering in database system

Page 31: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

Clustering Analysis

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 31

Page 32: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

k-means clustering algorithm

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 32

1. Assign cluster randomly 2. Make centroid for each cluster

3. Chose the nearest cluster from the centroid

Page 33: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

k-means clustering algorithm

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 33

1. Assign cluster randomly 5. Make centroid for each cluster again

6. All the cluster get fully converged

4. Repeat until convergence

Page 34: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

k-means clustering on PL/CUDA

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 34

CREATE OR REPLACE FUNCTION gpu_kmeans(real[], -- ID + Data Matrix int, -- k-value (number of clusters) int = 10, -- max number of iteration int = 1) -- seed of initial randomness RETURNS int[] AS $$ #plcuda_decl : KERNEL_FUNCTION_MAXTHREADS(void) update_centroid(MatrixType *D, MatrixType *R, MatrixType *C) { : /* accumulate the local centroid */ for (did = get_global_id(); did < nitems; did += get_global_size()) { /* pick up the target cluster */ cid = r_values[nitems + did]; atomicAdd(&l_cent[cid], 1.0); for (index=1; index < width; index++) atomicAdd(&l_cent[index * k_value + cid], d_values[index * nitems + did]); } __syncthreads(); /* write back to the global C-matrix */ for (index = get_local_id(); index < width * k_value; index += get_local_size()) atomicAdd(&c_values[index], l_cent[index]); }

: #plcuda_begin : status = pgstromLaunchDynamicKernel4((void *) setup_initial_cluster, (kern_arg_t)(D), (kern_arg_t)(R), (kern_arg_t)(C), (kern_arg_t)(r_seed), nitems, 0, 0); if (status != cudaSuccess) PLCUDA_RUNTIME_ERROR_RETURN(status); for (loop=0; loop < nloops; loop++) { : status = pgstromLaunchDynamicKernelMaxThreads3( (void *)kmeans_update_cluster, (kern_arg_t)(D), (kern_arg_t)(R), (kern_arg_t)(C), (kern_arg_t)k_value, nitems, 0, sizeof(cl_int) + sizeof(cl_float)); if (status != cudaSuccess) PLCUDA_RUNTIME_ERROR_RETURN(status); : } #plcuda_sanity_check gpu_kmeans_sanity_check #plcuda_working_bufsz gpu_kmeans_working_bufsz #plcuda_results_bufsz gpu_kmeans_results_bufsz #plcuda_end $$ LANGUAGE 'plcuda';

Page 35: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

Test data for k-means clustering

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 35

▌Dataset overview A collection of datasets of vehicle traffic,

observed between two points for a set duration of time over a period of 6 months

449 observation points, at Aarhus, Denmark.

13.5M records from Feb to June of 2014

▌Data contains average speed

average measured time

number of vehicles

latitude/longitude of the observation points

etc...

▌What we did categorize the zone of road into 5-classes

according to the characteristics of vehicle’s running style.

Page 36: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

Invocation of GPU k-means function

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 36

SELECT report_id, k, c

FROM (SELECT report_id, k, c, row_number() OVER (PARTITION BY report_id ORDER BY c DESC) rank

FROM (SELECT report_id, k, count(*) c FROM matrix_unnest(

(SELECT gpu_kmeans ( array_matrix( int4_as_float4(report_id), avg_measured_time, avg_speed, vehicle_count),

5) FROM tr_rawdata) ) R(report_id int, k int) GROUP BY report_id, k ) __summary_1 ) __summary_2 WHERE rank = 1;

Make a matrix from the raw-data

Run k-means clustering logic

Pick-up most frequent cluster

Page 37: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

GPU k-means (1/3) – clustering by all the data

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 37

$ wget -O map.png "`psql traffic -At -f ~/traffic.sql`"

bypass highway?

Road towards downtown?

Beltway?

Page 38: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

GPU k-means (2/3) – Daytime and Nighttime

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 38

daytime (8-17) nighttime (18-7)

Page 39: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

GPU k-means (3/3) – Weekdays and Weekend

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 39

weekdays weekend

Page 40: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

Invocation of GPU k-means function

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 40

SELECT report_id, k, c FROM (SELECT report_id, k, c, row_number() OVER (PARTITION BY report_id ORDER BY c DESC) rank FROM (SELECT report_id, k, count(*) c FROM matrix_unnest( (SELECT gpu_kmeans ( array_matrix( int4_as_float4(report_id), avg_measured_time, avg_speed, vehicle_count), 5) FROM tr_rawdata

WHERE extract('hour' from timestamp) between 7 and 17

) ) R(report_id int, k int) GROUP BY report_id, k ) __summary_1 ) __summary_2 WHERE rank = 1;

Just add a line to select different input data set. Flexibility of SQL!

Page 41: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 41

Summary

Page 42: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

Summary

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 42

▌What is PL/CUDA

Original concept of PG-Strom is automatic optimization.

PL/CUDA pulls out full capability of GPU instead of manual optimization.

Likely, nobody has written advanced algorithm in SQL :-)

▌Advantages

TFLOPS grade computing engine for analytics in-database

No need to export entire dataset for analytics by external applications

Allows to utilize SQL flexibility for pre-/post-processing of the core analytics algorithms

▌Future Challenges

Data size larger than 1GB, because of varlena restriction in PostgreSQL

Asynchronous execution. CPU parallel

Time to construct matrix. If likely static, we can construct preliminary.

Page 43: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

The PG-Strom Project

Resources

PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics 43

▌Repository

https://github.com/pg-strom/devel

▌Today’s Slides

http://www.slideshare.net/kaigai/pgconfsv2016-plcuda

▌Contact

[email protected]

Tw: @kkaigai

Page 44: PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

Question?