High-Throughput Lossless Compression on Tightly Coupled ......compression throughput of 12.8 GB/s, which is the best pub-lished result (to the best of our knowledge). Compared to the

High-Throughput Lossless Compression on Tightly Coupled CPU-FPGA PlatformsWeikang Qiao∗, Jieqiong Du∗, Zhenman Fang∗†, Michael Lo∗, Mau-Chung Frank Chang∗, Jason Cong∗

∗ Center for Domain-Specific Computing, UCLA † Xilinx{wkqiao2015, du.jieqiong, milo168}@ucla.edu, {zhenman, cong}@cs.ucla.edu, [email protected]

Abstract—Data compression techniques have been widelyused to reduce data storage and movement overhead, especiallyin the big data era. While FPGAs are well suited to acceleratethe computation-intensive lossless compression algorithms, bigdata compression with parallel requests intrinsically poses twochallenges to the overall system throughput. First, scalingexisting single-engine FPGA compression accelerator designsalready encounters bottlenecks which will result in lower clockfrequency, saturated throughput and lower area efficiency. Sec-ond, when such FPGA compression accelerators are integratedwith the processors, the overall system throughput is typicallylimited by the communication between a CPU and an FPGA.

We propose a novel multi-way parallel and fully pipelinedarchitecture to achieve high-throughput lossless compressionon modern Intel-Altera HARPv2 platforms. To compensate forthe compression ratio loss in a multi-way design, we implementnovel techniques, such as a better data feeding method anda hash chain to increase the hash dictionary history. Ouraccelerator kernel itself can achieve a compression throughputof 12.8 GB/s (2.3x better than the current record throughput)and a comparable compression ratio of 2.03 for standardbenchmark data. Our approach enables design scalabilitywithout a reduction in clock frequency and also improvesthe performance per area efficiency (up to 1.5x). Moreover,we exploit the high CPU-FPGA communication bandwidth ofHARPv2 platforms to improve the compression throughput ofthe overall system, which can achieve an average practical end-to-end throughput of 10.0 GB/s (up to 12 GB/s for larger inputfiles) on HARPv2.

I. INTRODUCTION

Data compression techniques have been widely used indatacenters to reduce data storage and network transmis-sion overhead. Recent studies show FPGAs are promisingcandidates for accelerating the Deflate algorithm, which isthe core of many lossless compression standards such asZLIB [1] and GZIP [2]. One study from Altera [3] imple-ments a Deflate accelerator using OpenCL, which processes15 bytes per cycle at a clock frequency of 193 MHz andachieves a compression throughput of 2.84 GB/s. Anotherstudy from Microsoft [4] using a hardware description lan-guage (HDL) scales the design to process 32 bytes per cycleat a maximum clock frequency of 175 MHz. It achieves arecord compression throughput of 5.6 GB/s on FPGAs.

However, most prior studies [3], [4] only report thetheoretical compression throughput for the FPGA kernelitself. In fact, an FPGA needs to read its input data froma CPU’s DRAM, perform the compression, and then writethe output data back to the CPU’s DRAM. This CPU-FPGAcommunication can introduce significant overhead, espe-cially in the mainstream PCIe-based loosely coupled CPU-FPGA platforms [5]. Two recent studies [6], [7] do observe asignificant degradation of the compression throughput for theoverall PCIe-based CPU-FPGA platform. They only achievemarginal improvement compared to a multicore CPU im-plementation for big data workloads that usually performcompression on different data partitions concurrently.

Even given higher CPU-FPGA communication band-width, scaling current single-engine FPGA accelerator de-

signs also encounters the bottleneck. As the Microsoftdesign [4] presented, each time it scales the number of bytesprocessed per cycle (BPC) by 2x, the resource utilizationwill increase by around 3x. Moreover, even provided withmore area resources in larger FPGAs, further scaling BPC(e.g., from 32 to 64) for a single compression engine willincrease the critical path and degrade the clock frequency,making the total throughput saturated. As a result, there willbe a large degradation of performance per area efficiencywhen further scaling BPC in a single-engine design.

In this paper we present a novel multi-way parallel Deflatecompression accelerator design where each way representsa well-optimized and fully pipelined Deflate engine. It im-proves the overall design scalability as the clock frequencyof the design is not affected by the number of Deflateengines. It also improves the performance-per-area efficiencysince the resource utilization goes almost linearly with thethroughput. However, the multi-way parallel design comesat a cost of degraded compression ratio due to the fact thatcompression opportunities in one engine may disappear asthe matching records reside in another engine. To maintaina comparable compression ratio to prior studies, we providenovel optimizations within each Deflate accelerator engine,including 1) a better data feeding method to reduce the lossof dictionary records, 2) a hash chain to increase the hashdictionary history. Moreover, we break the clock frequencybottleneck in current designs by introducing the double bankdesign instead of the double clock design used in [4] andreverse multiplexer designs in hash memory update.

By parallelizing up to four Deflate engines on HARPv2,we can compress up to 64 bytes of data per cycle with afixed clock frequency of 200 MHz, at a compression ratioof 2.03. That is, our Deflate accelerator can achieve a peakcompression throughput of 12.8 GB/s, which is the best pub-lished result (to the best of our knowledge). Compared to therecord design by Microsoft [4] with 5.6 GB/s throughput, weachieve 2.3x higher throughput, 1.5x performance-per-areaefficiency, more scalability, and a comparable compressionratio (96%). In addition, on the HARP platform that hasthe same Stratix V FPGA as the Microsoft design [4], ouraccelerator design can achieve a throughput of 9.6 GB/s,with a 1.4x better performance-per-area efficiency.

Finally, we also explore the impact of CPU-FPGA com-munication bandwidth on system-level compression through-put. We wrap our FPGA accelerator with the CPU softwareinvocation and abstract it as a software library on modernIntel-Altera HARP and HARPv2 platforms. With the CPU-FPGA communication bandwidth significantly increased,we achieve an average practical end-to-end compressionthroughput of 3.9 GB/s on HARP and 10.0 GB/s (up tomore than 12 GB/s for large input files) on HARPv2. Thisshows that the compression design is rather powerful inreal-world applications. We will open source our design inhttps://github.com/WeikangQiao/HT-Deflate-FPGA.

II. DEFLATE ALGORITHM AND REVIEW

A. Prior Studies and Their LimitationsIn addition to the two recent designs discussed in Sec-

tion I, there are several other studies. IBM proposed amulti-way parallel compression engine design based on the842B algorithm [8]. They implemented a single compressionengine that processes 8 bytes per cycle at a clock frequencyof 125 MHz and achieves a compression ratio of 1.96. Byapplying four engines they can get a throughput of 4 GB/s,but the compression ratio will be further sacrificed basedon our study. Furthermore, there are no system interfaceand data feeding methods to support the multi-way paral-lel compression kernel, and thus no in-depth analysis orsolid implementation when the compression engine scalesto process a larger equivalent data window size—as willbe done in our study. Later on, IBM implemented anotherDeflate accelerator [9] which achieved a throughput of16 bytes/cycle at 250 MHz, i.e., 4 GB/s. However, itsscalability is limited due to certain architectural choices likea 256-port hash table. Another Xpress9 compressor [10],targeting high compression ratio, integrated seven enginesto support heavily multi-threaded environments. However,its throughput is only limited to 200 to 300 MB/s.

A quantitative comparison of our work to the recentIBM [9], Altera [3], and Microsoft [4] implementations willbe presented in Section V-B.

B. Algorithm OverviewThe lossless Deflate algorithm [11] mainly includes two

stages: first, it performs the dictionary-based LZ77 [12]compression; second, it performs the Huffman encoding [13]to compress at bit level.

LZ77 [12] scans the incoming byte stream and comparesthe new input with the entries in a dictionary which ispopulated by past input data. After finding a common stringof length L, this repeated section of the incoming string isreplaced with a (L, D) pair, where D is the distance betweenthe history and the incoming string. Then, the incomingstring is recorded in the dictionary for future references. TheDeflate format limits distances to 32K bytes and lengths to258 bytes, with a minimum length of 3.

Huffman encoding is known to produce the minimumlength encoding given the alphabet and the relative fre-quency of all characters. The Deflate algorithm has a dy-namic Huffman encoding option, where it creates a Huff-man code with the relative frequency of the current input,resulting in the minimum length encoding for the giveninput’s LZ77 result. However, it requires frequency trackingand must be done after the LZ77 phase. Dynamic Huffmanencoding is typically used in higher levels of ZLIB and GZIPstandards, where a high compression ratio is favored overthroughput. Meanwhile, the Deflate algorithm also allows astatic Huffman encoding option, where the Huffman code isgenerated by a golden frequency and is statically available.We use the static Huffman encoding option in this paper inorder to enable the fully-pipelined accelerator design.

C. Review of Existing Single-Engine ImplementationOur initial single-engine accelerator architecture is shown

in Figure 1, which is similar to [3] and [4]. We summarize

Hash Calc.

Hash Memory Update

BB

BB

String Match

Match Selection

Huffman Translate

Bit Packing

ROMHash RAM

VEC bytesInput

Output

Fig. 1: Design of single-engine fully-pipelined Deflate accelerator

A

Current Window

Next Window

Substring 0

Substring 1

Substring 2

Substring 3

B A A A B C D

A B A A

B A A A

A A Ａ B

A A B C

Fig. 2: Current and next window of input string to be compressed

the implementation of six major stages in this subsectionand refer the audience to [3] and [4] for more details. Letus denote the bytes we are currently processing ”the currentwindow,” and the number of bytes processed in each clockcycle ”VEC,” which represents the compression throughput.

Stage 1: Hash calculation. In each cycle the pipelineextracts all the substrings of length VEC, starting from everybyte in the current window, and indexes each substring toits corresponding history memory using hashing. For theexample of VEC=4 in Figure 2, it extracts four substrings(each with length 4) for the current window. VEC-1 morebytes from the next window are required for extractingsubstrings starting from later bytes of the current window.These substrings are then hashed and later matched withhistory data to implement the LZ77 algorithm. The lengthof VEC represents the substring lengths to be compared,and therefore the maximum match length. It seems thebetter maximum match length, the better the compressionratio. However, the study in [4] shows no compression ratioimprovement when increasing VEC from 24 to 32. This isbecause in most of the standard benchmarks there is nomatch whose length is larger than 24, or such a match cannotbe detected in the proposed algorithm. This observation isimportant as it enables us to start from a relatively smallVEC design with negligible compression ratio loss.

Stage 2: Hash memory update. The hardware comparesthese VEC input substrings to the records in the hash historymemory, and replaces the old records with these inputstrings. The hash history memory is a dual-ported RAMthat enables one read and one write per clock cycle. Tohelp reduce bank access conflict, [4] suggested the hashmemory runs at twice the clock rate of the rest of thesystem so each bank can handle two read requests in onesystem clock cycle. Figure 3 shows an example of bankconflict. Assuming strings ”AABA,” ”ABAA,” and ”AABC”are all mapped to bank 2, in this case only the first twostrings are allowed to access the bank. The switch part thatconnects each substring to its corresponding memory bankwill become the critical path when we scale VEC to a highernumber. It will limit the clock frequency of the design, since

A A B A

A A B A

A B C D

A B A A

B A A B

A A B C

Current Window

Next Window

Substring 0

Substring 1

Substring 2

Substring 3

Bank 1

Bank 2

Bank 3

Bank 4

Bank 2

Bank 2

Bank 1

Bank 2

SwitchHistory Memory

Fig. 3: Hash (bank) conflict solution 1: double clock [4]

each time it needs to reorder VEC inputs while each inputis a string of VEC bytes. The situation will be even worsewhen using a doubled-rate clock frequency since the switchalso needs to be operated at the same clock frequency.

Stage 3: String match. The VEC records from hashhistory memory are matched with their corresponding inputstrings in parallel, and any match with match length smallerthan 3 is rejected. Another switch is required to remap theVEC records to their counterparts, and it faces the sametiming challenges as the one in Stage 2. According to theLZ77 algorithm [12], an (L,D) pair is emitted for each inputposition. If there is a match, then L represents match length,and D represents the distance between the history record andthe current input string. If the match is rejected, then L isthe byte value and D is 0. A special case that needs to beaddressed is: if match length L is larger than D, L needs tobe truncated to D to avoid overlapping.

Stage 4: Match selection. The matches in the currentwindow could stretch to the next window and need to bepruned to avoid match overlapping between windows. Thisis implemented in two phases. First, in each cycle VECmatches from each byte in the current window will becompared to select a best match that extends farthest. Thebest match will be taken and any byte included in the bestmatch will be covered. This comparison can be done inparallel. Then the bytes in-between the current best matchand the best match extended from the previous window willbe examined one by one to resolve the conflict betweenadjacent matches, this is referred to as lazy evaluation [4].This phase is a loop behavior and can be implemented as aseries of VEC pipeline stages.

Stage 5: Huffman translation. The following stage isHuffman encoding. The method of counting symbol fre-quency and using dynamic Huffman encoding no longerworks because the Huffman packing pipeline must runsimultaneously with LZ77 stages in a pipeline fashion.Therefore, static Huffman encoding is used in this design.Static Huffman encoding is nothing but a dictionary lookup,which is implemented as a ROM. The VEC (L,D) pairs canbe looked up within VEC ROMs in parallel. After each (L,D)gets translated, we get a four-code tuple (Lcode, Dcode,Lextra, Dextra). Lcode and Dcode are the codes for literaland distance; Lextra and Dextra are the extra bits to encodeliteral and distances.

Stage 6: Bit packing. The last step involves packing thebinary codes of different lengths together and aligning themto byte boundaries. This is because the length of the four-code tuple output ranges from 8 bits to 26 bits, while thedata we finally stored are in byte format. Packing can beeasily achieved by a series of shift-OR operations [4].

Reorder Buffer

Deflate Core

Deflate Core

Deflate Core

FIFO

FIFO

Accelerator Function Unit

FPGA Interface Unit

Processor

Reorder Buffer

Reorder Buffer

QPI+PCIe

Arbiter(Write)

FIFO

Write Queue

Arbiter(Read)

Read Queue

I/O Control

CCI-P

Fig. 4: Overview of our multi-way parallel and fully-pipelinedDeflate accelerator design

III. ACCELERATOR OPTIMIZATION

To improve the compression throughput and overall sys-tem efficiency of the FPGA Deflate accelerator design, weexploit a multi-way parallel design, where each acceleratorengine can compress a relatively small amount of dataconcurrently. A system-level overview of our multi-wayaccelerator design is shown in Figure 4. The details of theCPU-FPGA interface will be introduced in Section IV-A.Our initial accelerator engine pipeline is similar to that in[3] and [4]. Note that for a single accelerator engine, asthe data window size VEC increases from 16 to 32 bytes(per cycle), the ALM resource usage on FPGAs is increasednonlinearly by roughly 2.6x times, while the compressionratio is improved only by 3%, as observed both by us andthe study in [4]; so using a data window size of 16 wouldbe a good trade-off of resource usage and compression ratio.We will evaluate the design trade-offs in Section V.

To compensate for the compression ratio loss in our multi-way design and improve the overall efficiency, in this sectionwe will mainly present novel optimizations implementedin our accelerator. First, we propose a better data feedingmethod to multiple engines to reduce the loss of dictionaryrecords and thus improve the compression ratio. Second, wepropose single engine optimizations, including hash chainsto improve the hash dictionary length and compression ratio,double bank design and switch optimization to improveclock frequency and resource utilization.

A. Multi-Engine Data Feeding MethodSince we propose a multi-way parallel FPGA Deflate

accelerator design shown above in Figure 4, we need todivide the input file into multiple (four in our example)segments to feed each Deflate engine (core). There are twoways to fetch data for the FPGA accelerator. The first iscyclic data feeding, shown in Figure 5(a). Each time itfetches four small consecutive blocks (e.g., VEC bytes ofdata, or cachelines in HARP and HARPv2 platforms) andfeeds them to the four parallel engines. The second is blockdata feeding, shown in Figure 5(b); it is also employed bymulti-core software compression solution such as pigz [14].It segments the entire input file into four large consecutiveparts. Each time it fetches one (or multiple) cachelines from

CL 0

Byte0 Byte1 Byte31 Byte0 Byte1 Byte

31 Byte0 Byte1 Byte 31

Engine 1 Engine 2 Engine 3

CL 1 CL 2 CL 3 CL 4 CL 5

Byte0 Byte1 Byte 31

Engine 4

Substring 0

Substring 15

...

CL 6 CL 7

CL: Cache Line

(a) Cyclic data feeding

Segment 1

Byte0 Byte1 Byte 31 Byte0 Byte1 Byte

31 Byte0 Byte1 Byte 31

Engine 1 Engine 2 Engine 3

Byte0 Byte1 Byte 31

Engine 4

Substring 0

Substring 15

...

Segment 2 Segment 3 Segment 4

(b) Block data feedingFig. 5: Data feeding from CPU to FPGA Deflate accelerator

one of the four parts and feeds them into each compressionengine.

Due to the fact that similar pieces of data are usuallylocated in nearby regions in a file, the compression ratio(measured by input file size divided by output file size) ofthe block data feeding is much higher than that of cyclicdata feeding. Actually, the block feeding method suits theDeflate algorithm perfectly to compress large files since theDeflate format limits history distance to 32K bytes. Forfiles much larger than 32K bytes, strings in the latter partwill not be compared to the previous part with a distancelonger than 32K bytes. Therefore, it makes minor impactson compression ratio to segment large files for block datafeeding, because only the compression of data strings inblock boundaries might be affected. Compared to cyclic datafeeding, the degradation on compression ratio due to blocksegmentation is negligible. On the other hand, cyclic datafeeding offers better streaming capability. However, blockdata feeding can work as well for streaming applicationsby processing a collection of streaming data each time, forexample, on the order of a few megabytes. Consideringthese, we will use the block data feeding as our default datafeeding method.

B. Single Engine OptimizationIn this subsection we will mainly present our opti-

mizations for the single compression engine to improvecompression ratio, clock frequency, and resource utilizationefficiency. These optimizations mainly focus on the majorbottleneck stages 2 and 3 in Section II-C, which 1) mapthe input strings that will be stored to the memory banks;2) obtain and map the read results of previous stringsfrom the memory banks for the inputs; and 3) compare theinput strings and previous strings to get the match lengthand distance results. With all the optimizations, our singlecompression engine is fully pipelined, with an pipeline initialinterval of one.

Chain depth 1

Chain depth 2

Chain depth 3

Candidate2Candidate1 Candidate3Substring

Fig. 6: Hash memory chain for each hash history

A A B A

A A B A

A B C D

A B A A

B A A B

A A B C

Current Window

Next Window

Substring 0

Substring 1

Substring 2

Substring 3

Bank 1

Bank 2

Bank 3

Bank 4

Bank 2

Bank 6

Bank 1

Bank 2 Bank 5

Bank 6

Bank 7

Bank 8

Switch

History Memory

Fig. 7: Hash (bank) conflict solution 2: double bank in this paper

1) Hash Chain Implementation: As presented in Sec-tion III-A, even given the block data feeding method, thecompression ratio will drop to some extent—which meanswe need to compensate for the potential compression ratioloss in a single engine. Therefore, we increase the historydictionary size to find a better match. To achieve this, a hashmemory chain, as shown in Figure 6, is implemented in thehash memory update stage. It is like a shift register, but inmore of a register file format. Every cycle different depthsof the chained memory all return a candidate string as theread output, and the current depth’s read output is the inputwritten into its next depth’s memory. The substring in thecurrent window will be stored into the memory in the firstdepth of the chain. In the example, candidate 1, 2 and 3 arethe output of each memory at different depths of the memorychain, and all of them will be compared to the substring tofind the best match. After that, the current substring will bestored into chain depth 1, candidate 1 will be stored intochain depth 2, and so on.

2) Double Clock Design vs. Double Bank Design: Toreduce the compression ratio drop caused by bank conflicts,Microsoft [4] uses a second clock whose frequency is twicethe global clock to read and update the 16 (VEC) hashmemory banks (dictionaries), which we refer as design1. As will be evaluated in Section V, the doubled clockfrequency of the memory part will become the bottleneckof the whole design and significantly limits the systemperformance, especially when we integrate the idea of hashmemory chain. We propose a second design to use a singleclock while doubling the number of banks to 32 (2*VEC),as shown in Figure 7. As a result, we increase the length ofthe hash dictionary. This approach avoids the doubled clockfrequency bottleneck and enables us to integrate more depthof hash chain.

3) Switch Optimization: As presented in Section II-C,we implement the switches with multiplexers whose inputsare strings and select signals are the hash values. For the

16 substrings

Bank1

Bank2

Bank3

Bank31

Bank32

Bank1

Bank2

Bank3

Bank31

Bank32

Bank1

Bank2

Bank3

Bank31

Bank32

Chain depth 1:

Chain depth 2:

Chain depth 3:

Multiplexers Network

Fig. 8: Input mapping to hash memory chain, i.e., memory banks

Bank 1 Bank 2 Bank 3 Bank 32

Cand. 1

Substring 0

Cand. 0

Match Calculation

( L0 D0 ),Cand. X : Potential Match Candidate for Substring X

Cand. 0Cand. 15

Cand. 1

Cand. 15

Match Calculation

( L1 D1 ),

Match Calculation

( L15 D15 ),

Substring 1

Substring 15

Fig. 9: Memory bank (hash table) mapping to input option 1: stringmatching after 128-to-1 MUX

first set of switches that map the input strings which willbe stored to the memory banks, we can use one 16-to-1(VEC=16) multiplexer (MUX) for each hash chain as shownin Figure 8. To avoid the timing bottleneck, we pipeline the16-to-1 MUXes into 2 stages using 4-to-1 MUXes.

For the second set of switches that map the read resultsof previous strings from the memory banks for the inputs,it is more complex. A straightforward way is using a 32-to-1 128 bit-width input MUX to select the record stringfrom the 32 output ports of memory banks (hash table)for each depth of the chain, and then compare it with theinput string, as shown in Figure 9. This way ensures thatfor those strings which cannot access the memory (e.g.,bank conflict, their mapped banks have been accessed) arestill able to measure if they are matched with the previousstrings. Another way is to leave those strings as mismatchedliterals and just compare output port data of memory withthe data delayed by 1 cycle from its input port, as shown inFigure 10. If the bank has not been accessed, (e.g., Banks3), then the match calculation unit will generate a pairof (L=0, D=0) to indicate the case. The compared resultis simply a 5-bit length data, which is much shorter thanthe original 128-bit data string. So we can use 5 bit-widthMUXes to select the corresponding hash match results for

Bank 1

Match Calculation

( L15 D15 ),

Bank 2

Match Calculation

Bank 3

Match Calculation

Bank 32

Match Calculation

( L1 D1 ), ( L=0 D = 0 ), ( L0 D0 ),

Cand. 15

Cand. 1

Cand. 0

( L0 D0 ), ( L1 D1 ), ( L15 D15 ),

Substring15

Substring1

Substring0

Fig. 10: Memory bank (hash table) mapping to input option 2:5-to-1 MUX after string matching

CPU AFU

FPGA

System Memory

QPI Interface

PCIe

PCIe

Fig. 11: Intel-Altera HARP and HARPv2 CPU-FPGA architecture

each input string. In this design we employ the secondoption to avoid the overhead of multiplexers to improve thetiming and resource consumption. The compression ratioonly degrades by 3% since we have actually mapped the16 input strings to 32 memory banks (e.g., design 2) orconceptually due to its double clock frequency (design 1).The second way eliminates the timing bottleneck even whenwe scale further up. For example, when we scale to process32 bytes per cycle, we only need to use 32 6-to-1 MUXes,which introduces much less overhead and does no harm tothe clock frequency of the design.

IV. SYSTEM INTEGRATION

In this section we first introduce some of the backgroundon the recent Intel-Altera HARP [15] and HARPv2 [16]CPU-FPGA platforms that we leverage to achieve high end-to-end compression throughput. Then we present the CPU-FPGA communication flow of our multi-way acceleratorarchitecture in the tightly coupled Intel-Altera HARPv2CPU-FPGA platform.

A. CPU-FPGA PlatformThe mainstream PCIe-based CPU-FPGA platforms use

direct memory access (DMA) for an FPGA to access the datafrom a CPU. First, the FPGA needs a memory controller IPto read the data from the CPU’s DRAM to its own DRAMthrough PCIe. And then the FPGA performs specified accel-eration in its accelerator function units (AFUs). In fact, thisDRAM-to-DRAM communication through PCIe can have alimited bandwidth in practice; for example, this practicalPCIe bandwidth can be only around 1.6 GB/s (althoughthe advertised bandwidth is 8 GB/s) according to the studyin [5]. This makes it impractical to implement full-speedacceleration even if we have a high throughput compressionaccelerator on the FPGA side.

doneready

ready

CPU

FPGA

Service Allocation

Preparation Loading

DataCompression Execution

ServiceFree

Wait & Poll

Data loading, segmentation & pointers storing

Fig. 12: Flow chart of CPU-FPGA system execution

The recent Intel-Altera HARP platform uses the Quick-Path Interconnect protocol (QPI). HARPv2 uses one QPIchannel and two PCIe channels within a single package toconnect the CPU cores and FPGA accelerators, as shown inFigure 11. The CPU and FPGA can communicate using theshared memory to significantly increase the communicationbandwidth. AFU can read/write data directly from/to systemvirtual memory through the core-cache interface (CCI). Thismakes FPGA have first-class access to the system memoryand thus achieve a high-bandwidth and low latency for CPU-FPGA communication. According to the study in [5], HARPcan provide 7.0 GB/s FPGA read bandwidth and 4.9 GB/sFPGA write bandwidth. Using the same benchmark methodin [5], we find that HARPv2 has further improvement andcan provide more than 15 GB/s FPGA read and write band-width. Such high CPU-FPGA communication bandwidthprovides opportunities to achieve a much higher end-to-endcompression throughput in practice.

Moreover, this shared system memory architecture elimi-nates explicit memory copies and increases the performanceof fine-grained memory accesses. This proves to be ratheruseful in our multi-way parallel design where we need toread/write to multiple random memory locations in consec-utive cycles.

B. CPU-FPGA Communication Flow and Interface DesignFigure 12 presents the flow of compression execution.

At initialization, the CPU first allocates memory workspacewhich is shared by the CPU and the FPGA. This includesdevice status memory (DSM) workspace to keep track ofthe status of the FPGA accelerator, a source buffer forstoring the source data, and a destination buffer to storethe processed output data. After loading the source file intothe source buffer, the CPU segments the source file equallyinto N blocks (four in our example) and then writes thesegmented base addresses and block sizes to the FPGA. Theinitial setting being done, the CPU enters a wait state andpolls the status bit in DSM until the FPGA sends an endingsignal. On the other side, the FPGA accelerator beginsexecution after loading the addresses and source data blocksizes. Each engine is allowed an equal share of time to sendread requests for source data from the shared memory usinga round-robin arbiter. Since read responses are out of orderin the HARPv2 platform, re-order buffers are used to restoresource data order before compressing. The processed resultsare kept temporarily in first-in-first-out (FIFO) buffers before

writing back to the destination buffer in the shared memory.Another arbiter is used to control write access among theDeflate engines. After all the source data are processed, theFPGA will signal the CPU that all work has been done.

In HARPv2, the data transfer size for each read/writerequest can be 1,2 or 4 consecutive cache lines (64 bytes percache line); and for maximum CPU-FPGA data bandwidth,our design uses 4 cache lines per request. The three physicallinks (one QPI and two PCIes) are configured to one virtualchannel in our accelerator design, and the FIU will optimizethe communication bandwidth by steering requests amongthe three physical links.

V. EVALUATION

A. Experimental SetupWe test our FPGA Deflate accelerators on the

HARPv2 [16] platform (major platform) and use the CalgaryCorpus [17] datasets as the benchmarks to measure theaverage compression throughput and compression ratio. Toshow how CPU-FPGA communication bandwidth may limitend-to-end throughput and avoid FPGA hardware generationgap, we also test some of the designs on the HARP platformas a comparison since HARP still has a very limited com-munication bandwidth and it uses the same Stratix V FPGAas previous studies [3], [4].HARP Platform. HARP platform integrates an Intel XeonE5-26xx v2 processor and an Altera Stratix V GX FPGA.HARPv2 Platform. HARPv2 platform integrates a 14-coreBroadwell EP CPU and an Altera Arria 10 GX1150 FPGA.

B. Comparison of Deflate AcceleratorsSince most prior studies only measure the performance of

the FPGA Deflate accelerator, we first compare our accel-erator design to state-of-the-art studies in Table I. By usingthe novel multi-way parallel and fully pipelined acceleratorarchitecture where each way features a single clock, 32banks and a memory chain depth of 3 designs that canprocess 16 bytes per cycle, we can compress 64 bytes/cycle(4-way parallel) at a clock frequency of 200 MHz. This ismore scalable than merely scaling up the data window size ofa single Deflate engine, since the area increases nonlinearlywith the data window size, and the maximum frequency theengine can achieve drops as a result of routing problems. Insummary, our FPGA Deflate accelerator achieves a recordcompression throughput of 12.8 GB/s, which is 2.2x fasterthan the prior record in [4]. In addition, our design is alsomuch more resource-efficient in terms of compressed MB/sper kilo ALMs, which is 1.5x efficient than the prior recordin [4]. Please also note we measure our designs at a fixedclock frequency of 200 MHz, since the platform providesthe fixed clock. In fact, due to our optimizations, our FPGAkernel can work at an even higher clock frequency.

The compression ratio drops by 4.7% compared to Mi-crosoft’s work [4], but it is still acceptable. This drop is dueto the fact we divide the input file into four parts to feedinto four different compression engines. If the data in onepart should be matched better with the data in another part,then in our case the compression ratio drops.

To be fair about the comparison, we also break down thecompression accelerator kernel speedup into two parts:

Table I: FPGA Deflate accelerator comparisonDesign Frequency Throughput Compression ratio Area(ALMs) Efficiency (MB/s per kilo ALMs)

IBM [9] 250 MHz 4 GB/s 2.17 110,000 36Altera [3] (Stratix V) 193 MHz 2.84 GB/s 2.17 123,000 23

Microsoft [4] (Stratix V) 175 MHz 5.6 GB/s 2.09 108,350 52This work on HARP (Stratix V) 200 MHz 9.6 GB/s 2.05 134,664 71.3

This work on HARPv2 (Arria 10) 200 MHz 12.8 GB/s 2.03 162,828 78.6

0

0.5

1

1.5

2

2.5

3

3.5

4

Compressio

nratio

1engine 4engines

Fig. 13: Compression ratio when scaling from 1 engine to 4 engines

1) Improvement from Accelerator Design: Since previousstudies [3], [4] both implement their designs on Stratix VFPGA, we also list the experimental result on HARP, whichalso uses Stratix V. Due to the total resource limitation, wecan only put three Deflate engines on it. Still, we can achievea compression throughput of 9.6 GB/s, which is 1.7x fasterand 1.4x more efficient in performance per area over therecord design [4].

2) Improvement from Technology Advancement:HARPv2 uses a more advanced Arria 10 FPGA, whichprovides more area and enables us to integrate one moreDeflate engine than HARP Stratix V FPGA. As a result,we can achieve 12.8 GB/s compression throughput onHARPv2. Table II lists the FPGA resource utilizationon the HARPv2 platform, where ALMs are the maincontributor. Note that HARPv2 interface itself occupies anadditional 20% of the FPGA resource.

Table II: FPGA resource utilization on HARPv2Resources Amount

ALMs 162,828 (38%)Registers 248,188 (15%)

Block Memory Bits 16,177,152 (29%)

C. Scaling EffectFigure 13 shows the corresponding compression ratio

of Calgary Corpus benchmarks as we change the parallelengine number from 1 to 4. The single engine (VEC=16)achieves an average compression ratio of 2.11. When weincrease the throughput to 12.8 GB/s with 4 engines, theaverage compression ratio drops by 3.7%. Note that thecompression ratio of those large files (e.g., book1, book2,news and pic) only degrades by less than 1%. This resultproves the analysis we present in Section III-A that themulti-way parallel Deflate compressor perfectly suits theapplications where large files need to be compressed.

Table III: Scaling results on HARPv2Parallel Engine No Throughput Compression ratio Area (ALMs)

1 3.2 GB/s 2.11 38,2972 6.4 GB/s 2.07 78,6043 9.6 GB/s 2.05 118,8914 12.8 GB/s 2.03 162,828

Table III lists the area usage (and compression ratio) forscaling up engines, and the total area usage increases roughlylinear. Considering that the area when we directly changethe data window size of a single compression engine from16 to 32 bytes will increase the ALM usage by 2.6x, it isalso difficult to keep the frequency to a high value. Thisbecomes the bottleneck for scaling in some cases where thesystem needs to run at a fixed clock frequency. For example,on HARP the system needs to run at 200 MHz. Exploitingthe parallel engines can avoid this scaling problem as eachengine is designed and placed separately.

D. Design Tradeoff EvaluationThe single compression engine design is important as it

determines the baseline performance for the whole design.Thus, we explore the single engine design space to choosethe best.

1) Memory Chain Depth: We first compare the perfor-mance of different hash chain depths for a single engine inTable IV. Increasing one depth only augments 3,500 moreALMs, thanks to the MUX optimization, and there is morethan a 9% improvement on the compression ratio, increasingdepth from 1 to 3. Note that further increasing memorychain depth will not benefit the system performance much,and the compression ratio gain becomes marginal (less than1.5% improvement measured). In fact, the dictionary stringsread from each depth need to be compared with the sameinput substring simultaneously, so matching comparison ofthe input string and the output dictionary string from thedeepest memory bank will become the critical path.

Table IV: Memory chain depth on HARPv2Depth Compression ratio Area(ALMs)

1 1.92 31,3602 2.04 34,7613 2.10 38,297

2) Double Clock vs. Double Bank: As presented inSection III-B2, a double clock design (design 1) is alsoimplemented whose memory part uses a clock frequency thatis twice the global clock. We use the same number of banknumbers as VEC (i.e., 16 banks), set the memory chain depthto be 3, and integrate 4 parallel engines. Table V summarizesthe two designs. For the double clock design, since BRAMblocks are placed as arrays and the interconnect between twoadjacent arrays will consume more time, the memory chainwhich occupies more than one single array can only run at afrequency of 340 MHz under the fastest configuration, thuslimiting the system performance. Since HARPv2 providesanother clock group of 150/300 MHz, we take 150 MHz asthe system clock for the design and achieve the equivalentthroughput of 9.6 GB/s, which is only 75% of the perfor-mance of the single clock design with double banks.

However, the area efficiency of the double clock design isslightly better than the double bank design since it reducesthe use of memory banks and corresponding multiplexers.Another interesting benefit of the double clock design is

Table V: Double Clock v.s. Double Bank on HARPv2Design Frequency Throughput Compression ratio Area(ALMs) Efficiency (MB/s per kilo ALMs)

Double clock design 150 MHz 9.6 GB/s 2.10 115,436 83Double bank design 200 MHz 12.8 GB/s 2.03 162,828 78.6

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

0

100

200

300

400

500

600

700

800

Practicalthroughput(GB/s)

Filesize(k

B)

filesize end-to-endthroughputFig. 14: Compression throughput and ratio on HARPv2, underdifferent file size of Calgary Corpus datasets

that it gives a slightly better compression ratio. This isbecause raising clock frequency while reducing memorybanks enables similar strings to have more chances at beingmapped to the same bank.

E. End-to-End Compression ThroughputWe measure the end-to-end compression throughput by

inserting a counter on the FPGA side. The counter counts thetotal clock cycles from when FPGA sends the first requestto read the memory to the time it sets the data valid bit inthe memory—which indicates the FPGA work is done andall of the data has been written back to the memory. Notethat on HARP and HARPv2, the CPU and FPGA share thesame memory so that there is no additional memory copyoverhead on conventional PCIe-based platforms. Denote thetotal clock cycles as tcycle and the total amount of data fed tothe compression engine as Adata bytes, then the end-to-endcompression throughput can be measured as Adata /tcycle.

The end-to-end overall compression throughput resultsare shown in Table VI. To show the overhead of CPU-FPGA communication, we first test our design on HARP,where the QPI bandwidth is limited (7 GB/s for read and4.9 GB/s for write). We integrate 3-way parallel engines onHARP since the Stratix V FPGA module on HARP providesless resources and cannot accommodate more engines. TheFPGA kernel throughput is 9.6 GB/s, while we can onlyachieve an average of 3.9 GB/s practical throughput basedon the Calgary Corpus benchmarks.

Table VI: End-to-end compression throughput on HARP andHARPv2

Platform FPGA Throughput Practical ThroughputHARP 9.6 GB/s 3.9 GB/s

HARPv2 12.8 GB/s 10.0 GB/s

On the other hand, even considering the read and writelatency, we can see that HARPv2 can achieve an averageend-to-end compression throughput of 10.0 GB/s since ithas over 15 GB/s CPU-FPGA communication bandwidth.We also list each benchmark’s size and its correspondingpractical throughput in Figure 14. If the file size is largerthan 500 kB (e.g., book1, book2 and pic), then the overhead

of communication latency will be negligible and the end-to-end throughput can be up to 12 GB/s.

VI. CONCLUSIONIn this work we designed an FPGA Deflate accelerator

that can be easily scaled to achieve a record compressionthroughput of 12.8 GB/s while maintaining a relatively highcompression ratio of 2.03. We presented methods for effi-ciently feeding data into the parallel compression engines,improving the resource utilization, augmenting compressionratio, and improving the clock frequency of the singlefully pipelined compression engine. This is the first publicwork to integrate the compression accelerator with Intel-Altera HARP and HARPv2 platforms, where we leveragethe high CPU-FPGA communication bandwidth to achievehigh end-to-end system compression throughput. The end-to-end compression throughput we achieve is 3.92 GB/s onHARP and 10 GB/s (12 GB/s) on HARPv2. This shows thatour compression accelerator design is a great fit for the newHARP and HARPv2 platforms.

ACKNOWLEDGMENTSThis work is partially supported by Mentor Graphics

under the Center for Domain-Specific Computing (CDSC)Industrial Partnership Program. We would like to thankIntel for donating the Heterogeneous Architecture ResearchPlatform (HARP). We also thank Peng Wei from UCLA andMichael Adler from Intel for their valuable support.

REFERENCES

[1] “Zlib Compression Library,” http://www.zlib.net/.[2] “Gzip file format specification version 4.3,” https://tools.ietf.org/html/

rfc1952.[3] M. S. Abdelfattah et al., “Gzip on a Chip: High Performance Lossless

Data Compression on FPGAs Using OpenCL,” in IWOCL, 2014.[4] J. Fowers et al., “A Scalable High-Bandwidth Architecture for

Lossless Compression on FPGAs,” in FCCM, 2015, pp. 161–164.[5] Y.-k. Choi et al., “A quantitative analysis on microarchitectures of

modern cpu-fpga platforms,” in DAC, 2016.[6] M. Huang et al., “Programming and runtime support to blaze fpga

accelerator deployment at datacenter scale,” in SoCC, 2016.[7] J. Cong et al., “Cpu-fpga co-scheduling for big data applications,”

vol. 35, no. 1, 2018, pp. 16–22.[8] B. Sukhwani et al., “High-Throughput, Lossless Data Compresion on

FPGAs,” in FCCM, 2011.[9] A. Martin et al., “Fpga-based application acceleration: Case study

with gzip compression/decompression streaming engine,” in ICCADSpecial Session 7C, 2013.

[10] J.-Y. Kim et al., “A Scalable Multi-Engine Xpress9 Compressor withAsynchronous Data Transfer,” in FCCM, 2014.

[11] P. Deutsch, “DEFLATE Compressed Data Format Specification ver-sion 1.3,” http://www.rfc-base.org/txt/rfc-1951.txt, 1996.

[12] J. Ziv et al., “A universal algorithm for sequential data compression,”IEEE Transactions on Information Theory, vol. 23, no. 3, pp. 337–343, May 1977.

[13] D. A. Huffman, “A Method for the Construction of Minimum-Redundancy Codes,” Proceedings of the IRE, vol. 40, no. 9, pp.1098–1101, Sept 1952.

[14] “Parallel Gzip Compression Library,” https://zlib.net/pigz/.[15] “Xeon+FPGA Platform for the Data Center,” https://www.ece.cmu.

edu/∼calcm/carl/lib/exe/fetch.php?media=carl15-gupta.pdf.[16] “Accelerating Datacenter Workloads,” http://fpl2016.org/slides/

Gupta%20--%20Accelerating%20Datacenter%20Workloads.pdf.[17] “The Calgary Corpus,” http://corpus.canterbury.ac.nz/descriptions/

#calgary.

Documents

High-Throughput Lossless Compression on Tightly Coupled ......compression throughput of 12.8 GB/s, which is the best pub-lished result (to the best of our knowledge). Compared to the