21
1 Refine and Recycle: A Method to Increase Decompression Parallelism Authors: Jian Fang 1 , Jianyu Chen 1 , Jinho Lee 2 , Zaid Al-Ars 1 , H. Peter Hofstee 1,2 ASAP 2019, New York, US Speaker: H.Peter Hofstee July 17 th , 2019

Refine and Recycle: A Method to Increase Decompression

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Refine and Recycle: A Method to Increase Decompression

1

Refine and Recycle: A Method to Increase Decompression Parallelism

Authors: Jian Fang1, Jianyu Chen1, Jinho Lee2,Zaid Al-Ars1, H. Peter Hofstee1,2

ASAP 2019, New York, USSpeaker: H.Peter HofsteeJuly 17th, 2019

Page 2: Refine and Recycle: A Method to Increase Decompression

2

Outline

• Motivation• Snappy (De)compression• Design Challenges• Refine and Recycle Method• Proposed Decompressor Architecture• Experiments• Summary

Page 3: Refine and Recycle: A Method to Increase Decompression

3

Snappy (De)compression Background

• LZ77-based, byte-level • In Hadoop ecosystem• Support Parquet, ORC, etc.• Low compression ratio• Fast compression and decompression

Page 4: Refine and Recycle: A Method to Increase Decompression

4

Motivation

• Snappy is widely used but not fast enough• Widely used in databses and big data processing• Optimized CPU throughput about 1.4GB/s in a CPU core

(only 1% out of the main memory bandwidth)

• New and high-bandwidth connections for storage• PCIe Gen4 or OpenCAPI attached NVMe

• 24 NVMe drives have around 75GB/s read bandwidth• Difficult for prior design to keep up with this high bandwidth• FPGAs can be used to decompress them

Page 5: Refine and Recycle: A Method to Increase Decompression

5

Example Motivating System: OpenPOWER MiHawk

• 24 NVMe• Each >3GB/s read

• 4x OpenCAPI• Each ~22GB/s

• 4x Gen4 x16• e.g. 200Gb/s E-net

• Would like each FPGA-based adapter to support 6-8 NVMe ( 20-25GB/s )

Source: Wistron

Page 6: Refine and Recycle: A Method to Increase Decompression

Snappy Compression

• Two kinds of token: Literal token and copy token• Find match from previous data• Example:

6

ABCDEFGHABCD

offset 8, length4

Use (8,4) to replace “ABCD”

Page 7: Refine and Recycle: A Method to Increase Decompression

Snappy Compression

7

… ABCDEFGHABCD

Literal

copy : offset 8length 4

No Match!

… ABCDEFGHABCD

Match!

… (Literal, “ABCDEFGH”) (Copy, 8, 4)Output

Input

Generate a literal token

Find next match

Generate a copy token

Page 8: Refine and Recycle: A Method to Increase Decompression

Snappy (De)compression

8

… ABCDEFGHABCD

Copy

(Literal, “ABCDEFGH”)

… ABCDEFGH

(Copy, 8, 4)

Page 9: Refine and Recycle: A Method to Increase Decompression

9

Design Challenge

• Various token size• Meta data: 1-3 bytes• Token size: 2-64K bytes

• Parallelize token execution• BRAM bank conflict

Page 10: Refine and Recycle: A Method to Increase Decompression

10

BRAM Bank Conflict

16X

• Write – Write• Same block, same line

BRAMs

Page 11: Refine and Recycle: A Method to Increase Decompression

11

BRAM Bank Conflict

16X

• Write – Write• Same block, different lines

BRAMs

Page 12: Refine and Recycle: A Method to Increase Decompression

12

BRAM Bank Conflict

16X

• Read – Read

BRAMs

Page 13: Refine and Recycle: A Method to Increase Decompression

13

Design Challenge

• Various token size• Meta data: 1-3 bytes• Token size: 2-64K bytes

• Parallelize token execution• BRAM bank conflict

• Read-after-write dependency• Stall the pipeline

Page 14: Refine and Recycle: A Method to Increase Decompression

14

Refine Method

• Refine from token-level to BRAM bank-level• Convert tokens into independent BRAM read/write commands

16X

16X

Page 15: Refine and Recycle: A Method to Increase Decompression

15

Recycle Method

• Stall? • -> throughput decrease

• Tomasulo/Scoreboarding?• Too complex

• Recycle and wait for next round execution

Page 16: Refine and Recycle: A Method to Increase Decompression

16

Architecture Overview

Page 17: Refine and Recycle: A Method to Increase Decompression

17

Slice Parser – find token boundary

Page 18: Refine and Recycle: A Method to Increase Decompression

18

Experiment

• Platform• ADM 9v3 (Xilinx VU3P FPGA)

• CAPI 2.0 Interface (effective data rate 11GB/s)

• Power 9 22-core CPU in Ubuntu 18.04.1 LTS

• Benchmark• TPC-H (two different column in “lineitem” table and the whole table)

• XML from Wiki

• Matrix data

Page 19: Refine and Recycle: A Method to Increase Decompression

19

Experiment - Results

• End-to-end Throughput

• 250MHz• 14% LUTs• Up to 7.2GB/s• 10x faster than CPU• 10x more power

efficient

Page 20: Refine and Recycle: A Method to Increase Decompression

20

Summary

• A Refine method to releif BRAM bank conflicts• A Recycle method to reduce impact of RAW dependency• Assumption Bit Map to handle byte split in high speed• 7.2 GB/s with 14.2% logic resource and 7% BRAM resource• 5 engines can saturate OpenCAPI bandwidth • 10x faster and 10x more power efficient

Page 21: Refine and Recycle: A Method to Increase Decompression

21

Open Source

https://github.com/ChenJianyunp/FPGA-Snappy-Decompressor.git