Upload
others
View
9
Download
0
Embed Size (px)
赛 灵 思 技 术 日XILINX TECHNOLOGY DAY
Ryan Wang ( 王瑞 )SDx Foundation LibrariesMarch, 2019
赛灵思 WebP 和数据库解决方案
© Copyright 2019 Xilinx
变化,机会和挑战
Personalcomputing
Embeddedcomputing
Mobilecomputing
Cloudcomputing
Opportunities for DC computing: – New computing platform : More opportunities for developers
Services, Applications and Algorithms + New Platform = New opportunities
Challenges for DC computing : – Performance requirements : Processing, Power and Cost– Flexibility requirements : Application and Algorithm – Easy for user : Friendly for developers
FPGA is an Excellent and Balanced competitor for the new main-stream computing platform
SDx Libraries can help users cope with these new challenges
© Copyright 2019 Xilinx
赛灵思 WebP 和数据库解决方案
SDx Foundation Libraries: Based on C/C++
Fundamental libraries: • Basic data type • Math library, stream library, HLS MAL, XHPP… • …
Algorithm-specific libraries: • Image compression:
• WebP, • Lepton (TBD)
• Security, data compression, DSP, Open-CV…
Domain-specific libraries • Xilinx Database Acceleration Library : (TBR)• Quantitative Financial Library• High Performance Computing Library
From an open-source project to FaaS on cloud
( To Be Released )For different application scenarios
Today’s topics
© Copyright 2019 Xilinx
赛灵思 WebP 解决方案——从开源项目到云端FaaS
˃ Background & Value Proposition
What is WebP‒ a new image format developed by Google, to enable
faster and smaller images on the Web
Why is WebP‒ It can save 25%~34% size than JPEG ‒ Royalty free
Why on FPGA‒ Time consuming for CPU
About 8 times than JPEG
‒ RTL developing flow to guarantee the performance
Item CPU(E3-1230 v2 @3.3GHz )
FPGA( ku115)
Throughput(853x640 pic/sec)
1.X(20pic/sec<=>16.4MB/s)
6.7 X(133pic/sec<=>109MB/s)
Cost Performance 1.X 3.X Power Performance 1.X 3.X
Tab 1. A WebP Encoder based on RTL flow
80% web capacity is used for PicturesBy using WebP, benefits for Google:
50T storage space per-day Several TB bandwidth per-day 1/3 speedup for webpage loading
© Copyright 2019 Xilinx
赛灵思 WebP 解决方案——从开源项目到云端FaaS
˃ Why using SDx flow
RTLSDx & HLS √
for(k=0;k<10;k++){int bd= (k>1||k<8)?20:10;for(mode=0; mode<bd;
mode++){
#pragma HLS PIPELINE }
}
x Fast design implement
C/C++ vs HDL
Simple verificationself-verification
Short Time to marketfast design iteration
Long life cycleretiming by tools
© Copyright 2019 Xilinx
Define kernel(s)
• Offload most of computation >90%
• Suitable for Reading Computing and Writing 3-block framework
Choose FPGA friendly algorithms
• Some optional algorithm branches are not friendly for hardware:• more data dependences• unpredictable dataflow• unpredictable control flow.
Estimate performance requirements
• Coarsely estimate the concurrent levels for top modules
• Find out potential bottlenecks insider modules
• Refine the estimations to lower level modules
Optimize bottlenecks
• Especially with the data dependency loop, which limits the effects of concurrency and pipeline
Reading
Computing
Writing
Key-factors in SDx flow
赛灵思 WebP 解决方案——从开源项目到云端FaaS
© Copyright 2019 Xilinx
赛灵思 WebP 解决方案——从开源项目到云端FaaS
Prediction T Q
iQ
ArithmeticCoding
Recon iTCost
Original
ModeDecision
Context
Residual
Best Rcn
BestCoeff_Q
…
˃ Estimate performance requirements4x4 prediction‒ Very short input interval requirement:
738 cycle /16 sub-blocks = 46 cycle / sub-block
˃ Optimize bottlenecksUse 4x4 block as minimum processing unit ‒ Reducing the latency by simply using ‘unroll’ pragma in
source code
˃ FPGA friendly algorithmsDeveloping a new algorithm for calculating the ‘cost’ of mode ‒ The original takes hundreds of cycles and is difficult to be
parallelized‒ The new algorithm only take 3~4 cycles‒ About 0.1~0.2 dB loose of PSRN in case of very low
compression ratio
Comparison of RD curves for original and new ‘Cost’ functions
Latency dropped from hundreds to 35 cycles
(Goal: 46)
35.0036.0037.0038.0039.0040.0041.0042.0043.00
0.50 1.00 1.50 2.00 2.50
PSN
R
BPP( bit per pixel)
Lena 256x256m=4 kernel
‘-q’ =80
Orignalwith new ‘cost’
New ‘cost’ reduce latency from ~19 to only 3~4 cycles
4x4 prediction in WebP
© Copyright 2019 Xilinx
赛灵思 WebP 解决方案——从开源项目到云端FaaS
˃ Fast design iteration by using HLS In the 4x4 intra-prediction, you can explore concurrencies by modifying loops’ scanning order It’s very convenient to try different way and check the results by using HLS flow
12 13 14 15
8 9 10 11
4 5 6 7
0 1 2 3
6 7 8 9
4 5 6 7
2 3 4 5
0 1 2 3
for(sb=0;sb<16;sb++){for(mode=0; mode<10; mode++){
#pragma HLS PIPELINE }
}
for(k=0;k<10;k++){for(mode=0; mode<20; mode++){
#pragma HLS PIPELINE }
}
Combine two independent sub-blocks into one pipelined loop
MB Latency_1 = 10* 48 = 480MB Latency_0 = 16* 35 = 560
for(k=0;k<10;k++){int bd= (k>1||k<8)?20:10;for(mode=0; mode<bd; mode++){
#pragma HLS PIPELINE }
}
Using the variable loop bundle to reduce more latency
MB Latency_2 = 6*48 + 4*38+10 = 450
For normal raster scan order, following two have dependency
Exploring 4x4 intra-prediction concurrencies
Easy to develop and test different algorithms by using HLS !
4×4Sub block
6 7 8 9
4 5 6 7
2 3 4 5
0 1 2 3
About 20% throughput improvement
© Copyright 2019 Xilinx
赛灵思 WebP 解决方案——从开源项目到云端FaaS
cpu K1 K2cpu K1 K2
cpu K1 K2cpu K1 K2
cpu K1 K2
45.2ms 9.0ms
CPU + 1 instance(1 thread+6%)
cpu K1 K2
CPU alone(1 thread)
+Pic 1
Pic 2
WebP lossy encoder for 853x640 with –q 80 : 60ms used by CPU1.4.ms
cpu K1 K2
4.4ms
In one CU, different kernels and CPU can run simultaneously
5.2ms 3.3ms real testing
Resources utilization : < 6% for 1 instance
Items CPU( CPU)
FPGA( RTL @ KU115)
AWS(Single)
NIMBIX(Batch)
Throughput(853x640 pic/sec)
20 p(16.4MB/s)
133 p (109MB/s)
169 p(138MB/s)
202(166 MB/s)
Throughput Comparison with CPU and RTL
– HW: (≈95%) » Kernel-1 : Intra-prediction (≈70%) » Kernel-2 : Arithmetic Coding(≈25%)
– SW: (≈5%) » pre-processing » post-processing
– AWS SDx17.1 (Single-Input) – Nimbix SDx17.4 (Batch-Input)
FaaS on NIMBIX Cloud
© Copyright 2019 Xilinx
赛灵思 WebP 解决方案——从开源项目到云端FaaS
˃ 6-CU @233MHz (Goal 250MHz) Conditions:
Platform : Nimbix Picture size : 512x512 -q : 80 ( ≈87 of original WebP) Instance : 6 ins Total Pic : 3600 Total Size : 1.416GByte
Results: Average Interval
• Batch : 159.5ms• Pic : 2.66ms
CPU End to End tim : 1636ms Interval Throughput : 2257 pic/sec End to End Throughput : 2200 pic/sec 53X vs 1 CPU thread
End to End time: 1636ms printed by CPU
1595msFrom timeline file
Fixed overhead
© Copyright 2019 Xilinx
赛灵思数据库加速解决方案——XFDL
Xilinx Database
Acceleration Library
Today’s 2th Topic
© Copyright 2019 Xilinx
赛灵思数据库加速解决方案——XFDL
TPC-H POC
XF DB primitives
XF DB 2.0 + GQE
GQE-based acceleration ofOpen-source DB
Done
Working-on
Planning
XFDL• Xilinx Database
Acceleration Library (TBR)• General Query Engine• GQE-based acceleration
Accelerates database application in various scope
Provides kernel side primitives in HLS C++
Database developers are our audience
Will provide host side routine wrappers in future
Plans to be open-sourced
© Copyright 2019 Xilinx
赛灵思数据库加速解决方案——XFDL
˃ Multi-Level AccelerationPrimitive Acceleration‒ Library provides optimized key primitives‒ Database developers develops hardware shells and software wrappers for
offloading key primitives in execution plan to FPGA.‒ Acceleration is transparent to SQL users.
Application Acceleration‒ Library provides all primitives and glue logic for kernel assembling‒ Database developers provide mechanism to execute pre-compiled hardware
queries.‒ Power SQL users write key queries in their application into FPGA kernels
General Database Acceleration (Working-on)‒ Library provides post-bitstream programmable General Query Engine and
software APIs to use its functions
Scan
Filter
Join
Agg
Scan
Filter
Join
Agg
© Copyright 2019 Xilinx
Fully Documented and Open-Source
赛灵思数据库加速解决方案——XFDL
© Copyright 2019 Xilinx
赛灵思数据库加速解决方案——XFDLPrimitive SetLOOKUP3-HASH DIRECT-GROUP-BY-AGGR
MURMUR3-HASH HASH-GROUP-BY-AGGR
SHA224/SHA256-HASH SORTED-GROUP-AGGR
HASH-JOIN (Multi-PU) BITONIC-SORT
SCAN-COL INSERT-SORT
FILTER MERGE-SORT
AGGR (MIN/MAX/SUM/CNT/…)
MERGE-JOIN
BLOOM-FILTER (Generator) MERGE-LEFT-JOIN
NESTED-LOOP-JOIN
HASH-SEMI-JOIN
ALU-BLOCK
DYNAMIC-EVAL
DYNAMIC-FILTER
COMBINE/SPLIT
Complete Implementation and Performance Descriptions
© Copyright 2019 Xilinx
赛灵思数据库加速解决方案——XFDL
˃ Database Library – Primitive InterfaceData sent to and received from primitives as hls::stream<T>Enables parallelism (#pragma HLS dataflow)Uniform semantic for easy integration
select sum(o_revenue)from orderwhere o_date >= “20180101”
void queryKernel(uint64_t* buf_in, uint64_t* buf_out) {#pragma HLS dataflow
hls::stream<ap_uint<DW>> date_strm1, revn_strm1,revn_strm2, revn_strm3;
hls::stream<bool> e_strm1, e_strm2, e_strm3;
hls_alg::scan_col(buf_in, date_strm1, revn_strm1, e_strm1);hls_alg::filter<FOP_DC, 0, FOP_GE, 20180101>(
date_strm1, revn_strm1, e_strm1,revn_strm2, e_strm2);
hls_alg::aggregate<SUM>(revn_strm2, e_strm2,revn_strm3, e_strm3);
write_ddr(buf_out, revn_strm3, e_strm3);}
Implement with primitives
scan
filter
sum
write
© Copyright 2019 Xilinx
赛灵思数据库加速解决方案——XFDL
˃ Example: Implement Hash-Join in SDx Library
˃ Key of performance: JOIN
˃ To do hash-joinCreate a hash tableStore small table in hash table or store reference to it in hash table.
˃ Make best use of storageStore hash-to-buffer-offset in URAM.‒ On-chip lookup.‒ Small table up to 8M
Store small table to HBM‒ Much more capacity than on-chip memory‒ Lower latency than DDR‒ More banks than DDR, less contention.
˃ Scan by columnLess data to FPGALess execution stalls
SELECT SUM(L_EXTENDEDPRICE *(1-L_DISCOUNT)) AS REVENUE
FROM LINEITEM,ORDERS
WHERE L_ORDERKEY = O_ORDERKEY ANDO_ORDERDATE >= '1994-01-01’ ANDO_ORDERDATE < '1995-01-01’
;
© Copyright 2019 Xilinx
赛灵思数据库加速解决方案——XFDL
˃ Example: Implement Hash-Join in SDx Library
HBMHBM
HBM
HBM
SDx Kernel
by-passable wrapper
˃ Define kernel function set:Join two tables with different number of payload columnsFilter orders tables according to o_orderdateStore/fetch number of result rows to DDR/HBM for consecutive kernel call.
˃ Pick primitivesHash-JoinFilterScan-col (with #row from column-buffer)Combine and Split (for payload columns)
˃ Assemble, write a bypass-able wrapper
col scan-col
col scan-col
col scan-col
col scan-col
col scan-col
hash-join
combine
key
payload
split
colwrite-col
4-row-wide stream
buf buf buf buf
buf buf buf buf
filter
col scan-col
colwrite-col
colwrite-col
colwrite-col
© Copyright 2019 Xilinx
赛灵思数据库加速解决方案——XFDL
multi-channel multi-PU hash-join
bitmap build probe
switc
her
bitmap build probe
bitmap build probe
colle
ctor
HBM
HBM
HBM
Dis
patc
h
CH-0
8 PU(w/ 8 HBM temp buffers)
Dis
patc
h
CH-1
Dis
patc
h
CH-3
Dis
patc
h
CH-2
˃ Hash-Join-MPU Primitive
Switcher distributes rows to PU’s according to MSB of hash
8 PU’s as each has a dedicated bank to avoid conflicts.
4 input channel serves enough data for 8 PU’s.
key
pay-load
4ch wide
˃ Example: Implement Hash-Join in SDx Library
© Copyright 2019 Xilinx
赛灵思数据库加速解决方案——XFDL
˃ Performance:20ms data transfer time + initoverhead of SDx runtime.10ms kernel execution time18ms throughput by overlapping data transfer with previous query’s kernel execution in continuous multiple run.
K0 K1 K2 K3W0 W1
R R R R
W2 W3K4
R
W4K5
R
W5
Pipelined continuous run18ms
30ms
20ms 10ms
140
18
0
20
40
60
80
100
120
140
160
Commercial DB FPGA
Thro
ughp
ut (m
s/qu
ery)
TPC-H Query 5 Acceleration
7.8x speedup
VU37P, SDx 2018.2
Resource Utilization
LUT 63362REG 84313BRAM 124URAM 64DSP 3HBM Bank 8
Clock Frequency 274.5MHzKernel Execution time(1GB data) 11.2ms
© Copyright 2019 Xilinx
赛灵思 WebP 和数据库解决方案
Today’s topics Summary
˃ SDx/HLS flow is helpful for innovations on FPGA
Easy for thinking and developingFast for verification and iteration
˃ SDx Libraries can help developer greatly
More productive Better performance
Fundamental libraries: • Basic data type • Math library, stream library, HLS MAL, XHPP… • …
Algorithm-specific libraries: • Image compression:
• WebP, • Lepton (TBD)
• Security, data compression, DSP, Open-CV…
Domain-specific libraries • Xilinx Database Acceleration Library : (TBR)• Quantitative Financial Library• High Performance Computing Library
Adaptable.Intelligent.
赛 灵 思 技 术 日XILINX TECHNOLOGY DAY