赛灵思 WebP 和数据库解决方案

赛灵思技术日XILINX TECHNOLOGY DAY

Ryan Wang ( 王瑞 )SDx Foundation LibrariesMarch, 2019

赛灵思 WebP 和数据库解决方案

© Copyright 2019 Xilinx

变化，机会和挑战

Personalcomputing

Embeddedcomputing

Mobilecomputing

Cloudcomputing

Opportunities for DC computing: – New computing platform : More opportunities for developers

Services, Applications and Algorithms + New Platform = New opportunities

Challenges for DC computing : – Performance requirements : Processing, Power and Cost– Flexibility requirements : Application and Algorithm – Easy for user : Friendly for developers

FPGA is an Excellent and Balanced competitor for the new main-stream computing platform

SDx Libraries can help users cope with these new challenges



SDx Foundation Libraries: Based on C/C++

Fundamental libraries: • Basic data type • Math library, stream library, HLS MAL, XHPP… • …

Algorithm-specific libraries: • Image compression:

• WebP, • Lepton (TBD)

• Security, data compression, DSP, Open-CV…

Domain-specific libraries • Xilinx Database Acceleration Library : (TBR)• Quantitative Financial Library• High Performance Computing Library

From an open-source project to FaaS on cloud

( To Be Released )For different application scenarios

Today’s topics


赛灵思 WebP 解决方案——从开源项目到云端FaaS

˃ Background & Value Proposition

What is WebP‒ a new image format developed by Google, to enable

faster and smaller images on the Web

Why is WebP‒ It can save 25%~34% size than JPEG ‒ Royalty free

Why on FPGA‒ Time consuming for CPU

About 8 times than JPEG

‒ RTL developing flow to guarantee the performance

Item CPU(E3-1230 v2 @3.3GHz )

FPGA( ku115)

Throughput(853x640 pic/sec)

1.X(20pic/sec<=>16.4MB/s)

6.7 X(133pic/sec<=>109MB/s)

Cost Performance 1.X 3.X Power Performance 1.X 3.X

Tab 1. A WebP Encoder based on RTL flow

80% web capacity is used for PicturesBy using WebP, benefits for Google:

50T storage space per-day Several TB bandwidth per-day 1/3 speedup for webpage loading



˃ Why using SDx flow

RTLSDx & HLS √

for(k=0;k<10;k++){int bd= (k>1||k<8)?20:10;for(mode=0; mode<bd;

mode++){

#pragma HLS PIPELINE }

}

x Fast design implement

C/C++ vs HDL

Simple verificationself-verification

Short Time to marketfast design iteration

Long life cycleretiming by tools


Define kernel(s)

• Offload most of computation >90%

• Suitable for Reading Computing and Writing 3-block framework

Choose FPGA friendly algorithms

• Some optional algorithm branches are not friendly for hardware:• more data dependences• unpredictable dataflow• unpredictable control flow.

Estimate performance requirements

• Coarsely estimate the concurrent levels for top modules

• Find out potential bottlenecks insider modules

• Refine the estimations to lower level modules

Optimize bottlenecks

• Especially with the data dependency loop, which limits the effects of concurrency and pipeline

Reading

Computing

Writing

Key-factors in SDx flow




Prediction T Q

iQ

ArithmeticCoding

Recon iTCost

Original

ModeDecision

Context

Residual

Best Rcn

BestCoeff_Q

…

˃ Estimate performance requirements4x4 prediction‒ Very short input interval requirement:

738 cycle /16 sub-blocks = 46 cycle / sub-block

˃ Optimize bottlenecksUse 4x4 block as minimum processing unit ‒ Reducing the latency by simply using ‘unroll’ pragma in

source code

˃ FPGA friendly algorithmsDeveloping a new algorithm for calculating the ‘cost’ of mode ‒ The original takes hundreds of cycles and is difficult to be

parallelized‒ The new algorithm only take 3~4 cycles‒ About 0.1~0.2 dB loose of PSRN in case of very low

compression ratio

Comparison of RD curves for original and new ‘Cost’ functions

Latency dropped from hundreds to 35 cycles

(Goal: 46)

35.0036.0037.0038.0039.0040.0041.0042.0043.00

0.50 1.00 1.50 2.00 2.50

PSN

R

BPP( bit per pixel)

Lena 256x256m=4 kernel

‘-q’ =80

Orignalwith new ‘cost’

New ‘cost’ reduce latency from ~19 to only 3~4 cycles

4x4 prediction in WebP



˃ Fast design iteration by using HLS In the 4x4 intra-prediction, you can explore concurrencies by modifying loops’ scanning order It’s very convenient to try different way and check the results by using HLS flow

12 13 14 15

8 9 10 11

4 5 6 7

0 1 2 3

6 7 8 9

4 5 6 7

2 3 4 5

0 1 2 3

for(sb=0;sb<16;sb++){for(mode=0; mode<10; mode++){


}

for(k=0;k<10;k++){for(mode=0; mode<20; mode++){


}

Combine two independent sub-blocks into one pipelined loop

MB Latency_1 = 10* 48 = 480MB Latency_0 = 16* 35 = 560

for(k=0;k<10;k++){int bd= (k>1||k<8)?20:10;for(mode=0; mode<bd; mode++){


}

Using the variable loop bundle to reduce more latency

MB Latency_2 = 6*48 + 4*38+10 = 450

For normal raster scan order, following two have dependency

Exploring 4x4 intra-prediction concurrencies

Easy to develop and test different algorithms by using HLS !

4×4Sub block

6 7 8 9

4 5 6 7

2 3 4 5

0 1 2 3

About 20% throughput improvement



cpu K1 K2cpu K1 K2

cpu K1 K2cpu K1 K2

cpu K1 K2

45.2ms 9.0ms

CPU + 1 instance(1 thread+6%)

cpu K1 K2

CPU alone(1 thread)

+Pic 1

Pic 2

WebP lossy encoder for 853x640 with –q 80 : 60ms used by CPU1.4.ms

cpu K1 K2

4.4ms

In one CU, different kernels and CPU can run simultaneously

5.2ms 3.3ms real testing

Resources utilization : < 6% for 1 instance

Items CPU( CPU)

FPGA( RTL @ KU115)

AWS(Single)

NIMBIX(Batch)

Throughput(853x640 pic/sec)

20 p(16.4MB/s)

133 p (109MB/s)

169 p(138MB/s)

202(166 MB/s)

Throughput Comparison with CPU and RTL

– HW: (≈95%) » Kernel-1 : Intra-prediction (≈70%) » Kernel-2 : Arithmetic Coding(≈25%)

– SW: (≈5%) » pre-processing » post-processing

– AWS SDx17.1 (Single-Input) – Nimbix SDx17.4 (Batch-Input)

FaaS on NIMBIX Cloud



˃ 6-CU @233MHz (Goal 250MHz) Conditions:

Platform : Nimbix Picture size : 512x512 -q : 80 ( ≈87 of original WebP） Instance : 6 ins Total Pic : 3600 Total Size : 1.416GByte

Results: Average Interval

• Batch : 159.5ms• Pic : 2.66ms

CPU End to End tim : 1636ms Interval Throughput : 2257 pic/sec End to End Throughput : 2200 pic/sec 53X vs 1 CPU thread

End to End time： 1636ms printed by CPU

1595msFrom timeline file

Fixed overhead


赛灵思数据库加速解决方案——XFDL

Xilinx Database

Acceleration Library

Today’s 2th Topic



TPC-H POC

XF DB primitives

XF DB 2.0 + GQE

GQE-based acceleration ofOpen-source DB

Done

Working-on

Planning

XFDL• Xilinx Database

Acceleration Library (TBR)• General Query Engine• GQE-based acceleration

Accelerates database application in various scope

Provides kernel side primitives in HLS C++

Database developers are our audience

Will provide host side routine wrappers in future

Plans to be open-sourced



˃ Multi-Level AccelerationPrimitive Acceleration‒ Library provides optimized key primitives‒ Database developers develops hardware shells and software wrappers for

offloading key primitives in execution plan to FPGA.‒ Acceleration is transparent to SQL users.

Application Acceleration‒ Library provides all primitives and glue logic for kernel assembling‒ Database developers provide mechanism to execute pre-compiled hardware

queries.‒ Power SQL users write key queries in their application into FPGA kernels

General Database Acceleration (Working-on)‒ Library provides post-bitstream programmable General Query Engine and

software APIs to use its functions

Scan

Filter

Join

Agg

Scan

Filter

Join

Agg


Fully Documented and Open-Source



赛灵思数据库加速解决方案——XFDLPrimitive SetLOOKUP3-HASH DIRECT-GROUP-BY-AGGR

MURMUR3-HASH HASH-GROUP-BY-AGGR

SHA224/SHA256-HASH SORTED-GROUP-AGGR

HASH-JOIN (Multi-PU) BITONIC-SORT

SCAN-COL INSERT-SORT

FILTER MERGE-SORT

AGGR (MIN/MAX/SUM/CNT/…)

MERGE-JOIN

BLOOM-FILTER (Generator) MERGE-LEFT-JOIN

NESTED-LOOP-JOIN

HASH-SEMI-JOIN

ALU-BLOCK

DYNAMIC-EVAL

DYNAMIC-FILTER

COMBINE/SPLIT

Complete Implementation and Performance Descriptions



˃ Database Library – Primitive InterfaceData sent to and received from primitives as hls::stream<T>Enables parallelism (#pragma HLS dataflow)Uniform semantic for easy integration

select sum(o_revenue)from orderwhere o_date >= “20180101”

void queryKernel(uint64_t* buf_in, uint64_t* buf_out) {#pragma HLS dataflow

hls::stream<ap_uint<DW>> date_strm1, revn_strm1,revn_strm2, revn_strm3;

hls::stream<bool> e_strm1, e_strm2, e_strm3;

hls_alg::scan_col(buf_in, date_strm1, revn_strm1, e_strm1);hls_alg::filter<FOP_DC, 0, FOP_GE, 20180101>(

date_strm1, revn_strm1, e_strm1,revn_strm2, e_strm2);

hls_alg::aggregate<SUM>(revn_strm2, e_strm2,revn_strm3, e_strm3);

write_ddr(buf_out, revn_strm3, e_strm3);}

Implement with primitives

scan

filter

sum

write



˃ Example: Implement Hash-Join in SDx Library

˃ Key of performance: JOIN

˃ To do hash-joinCreate a hash tableStore small table in hash table or store reference to it in hash table.

˃ Make best use of storageStore hash-to-buffer-offset in URAM.‒ On-chip lookup.‒ Small table up to 8M

Store small table to HBM‒ Much more capacity than on-chip memory‒ Lower latency than DDR‒ More banks than DDR, less contention.

˃ Scan by columnLess data to FPGALess execution stalls

SELECT SUM(L_EXTENDEDPRICE *(1-L_DISCOUNT)) AS REVENUE

FROM LINEITEM,ORDERS

WHERE L_ORDERKEY = O_ORDERKEY ANDO_ORDERDATE >= '1994-01-01’ ANDO_ORDERDATE < '1995-01-01’

;




HBMHBM

HBM

HBM

SDx Kernel

by-passable wrapper

˃ Define kernel function set:Join two tables with different number of payload columnsFilter orders tables according to o_orderdateStore/fetch number of result rows to DDR/HBM for consecutive kernel call.

˃ Pick primitivesHash-JoinFilterScan-col (with #row from column-buffer)Combine and Split (for payload columns)

˃ Assemble, write a bypass-able wrapper

col scan-col

col scan-col

col scan-col

col scan-col

col scan-col

hash-join

combine

key

payload

split

colwrite-col

4-row-wide stream

buf buf buf buf

buf buf buf buf

filter

col scan-col

colwrite-col

colwrite-col

colwrite-col



multi-channel multi-PU hash-join

bitmap build probe

switc

her

bitmap build probe

bitmap build probe

colle

ctor

HBM

HBM

HBM

Dis

patc

h

CH-0

8 PU(w/ 8 HBM temp buffers)

Dis

patc

h

CH-1

Dis

patc

h

CH-3

Dis

patc

h

CH-2

˃ Hash-Join-MPU Primitive

Switcher distributes rows to PU’s according to MSB of hash

8 PU’s as each has a dedicated bank to avoid conflicts.

4 input channel serves enough data for 8 PU’s.

key

pay-load

4ch wide




˃ Performance:20ms data transfer time + initoverhead of SDx runtime.10ms kernel execution time18ms throughput by overlapping data transfer with previous query’s kernel execution in continuous multiple run.

K0 K1 K2 K3W0 W1

R R R R

W2 W3K4

R

W4K5

R

W5

Pipelined continuous run18ms

30ms

20ms 10ms

140

18

0

20

40

60

80

100

120

140

160

Commercial DB FPGA

Thro

ughp

ut (m

s/qu

ery)

TPC-H Query 5 Acceleration

7.8x speedup

VU37P, SDx 2018.2

Resource Utilization

LUT 63362REG 84313BRAM 124URAM 64DSP 3HBM Bank 8

Clock Frequency 274.5MHzKernel Execution time(1GB data) 11.2ms



Today’s topics Summary

˃ SDx/HLS flow is helpful for innovations on FPGA

Easy for thinking and developingFast for verification and iteration

˃ SDx Libraries can help developer greatly

More productive Better performance

Fundamental libraries: • Basic data type • Math library, stream library, HLS MAL, XHPP… • …

Algorithm-specific libraries: • Image compression:

• WebP, • Lepton (TBD)

• Security, data compression, DSP, Open-CV…

Domain-specific libraries • Xilinx Database Acceleration Library : (TBR)• Quantitative Financial Library• High Performance Computing Library

Adaptable.Intelligent.

赛灵思技术日XILINX TECHNOLOGY DAY

Documents

赛灵思 WebP 和数据库解决方案