LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah

LZ77 Compression Using Altera OpenCL

Mohamed Abdelfattah

LZ77 Compression in OpenCL

Goal:- Demonstrate that a compression algorithm can be

implemented using the OpenCL compiler

2

high-performanceefficiently

2 GB/s

Outline:

1. OpenCL single-threaded flow

2. LZ77 overview

3. Implementation details

4. Optimizations & results

OpenCL Single-threaded Code

Basically C-code- OpenCL compiler extracts parallelism automatically- Pipeline parallelism

3

FPGA

One or more custom kernels

Kernels can communicate directly through “channels”

void kernelsimple(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; }}


4

FPGA

Load x Load y

Store z



5

Load x Load y

Store z

1



6

Load x Load y

Store z

1

2



7

Load x Load y

Store z

1

2

3



8

Load x Load y

Store z

2

3

4



9

Load x Load y

Store z

3

4

5

Can start new loop iteration every cycle! Initiation interval II = 1

No loop-carried dependencies



10

Load x Load y

Store z

void kernelcomplex(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; if(loop_carried/2 == 1) z = x + y; else z = x – y; loop_carried *= z; output[i] = z; }}


11

Load x Load y

Store z



12

Load x Load y

Store z



13

Load x Load y

Store z

Loop-carriedcomputation

Need data from iteration x for iteration x+1


14

Load x Load y

Store z

Load x Load y

Store z

Simple Complex


15

Load x Load y

Store z

Load x Load y

Store z

1 1

Simple Complex

Load x Load y

Store z

Load x Load y

Store z


16

11

2 2

Simple Complex

Load x Load y

Store z

Load x Load y

Store z


17

22

3 3

11

Simple Complex

Load x Load y

Store z

Load x Load y

Store z


18

32

4 3

1

2

1

1Pipeline bubble!

Takes 2 cycles to computeStall!

Stall!

!!

Simple Complex

Load x Load y

Store z

Load x Load y

Store z


19

4

2

5

3

3

2

1Continue

Takes 2 cycles to compute

4

!!

Simple Complex

Load x Load y

Store z

Load x Load y

Store z


20

5

2

6

3

4

3

2Bubble!


4

!!

Stall!

Stall!

Simple Complex

Load x Load y

Store z

Load x Load y

Store z


21

6

3

7

4

5

4

2Continue


5

!!

Simple Complex

Load x Load y

Store z

Load x Load y

Store z


22

7

3

8

4

6

5

3Bubble!


5

!!

Stall!

Stall!

Simple Complex

Load x Load y

Store z

Load x Load y

Store z


23

8

4

9

5

7

6

3Continue


6

!!

Simple Complex

Load x Load y

Store z

Load x Load y

Store z


24

9

4

10

5

8

7

4Bubble!


6

!!

Stall!

Stall!

Simple Complex

Load x Load y

Store z

Load x Load y

Store z


25

10

5

11

6

9

8

4


7

!!

II = 1 II = 2

Double the throughput

Optimize loop-carried computation

A new iteration of the loop starts every “II” cycles

Simple Complex


26

Outline:


2. LZ77 overview



LZ77 Compression Example

This sentence is an easy sentence to compress.

27

1. Scan file byte by byte2. Look for matches3. Replace with a reference to previous occurrence


28




29




30




31




32





33

1. Scan file byte by byte2. Look for matches

1. Match length2. Match offset

3. Replace with a reference to previous occurrence



34


1. Match length = 22. Match offset




35






36




Match offset = 20 bytes



37


1. Match length = 82. Match offset = 20


Match offset = 20 bytes

This sentence is an easy @(8,20) to compress.


38


• Match length = 8• Match offset = 20

3. Replace with a reference to previous occurrence• Marker, length, offset

This sentence is an easy sentence to compress. This sentence is an easy @(8,20) to compress.


39


• Match length = 8• Match offset = 20

3. Replace with a reference to previous occurrence• Marker, length, offset

Saved 5 bytes!


40

Outline:


2. LZ77 overview



Single-threaded OpenCL flow Single kernel: fully pipelined II = 1

Throughput estimate = 16 bytes/cycle * 200 MHz = 3051 MB/s

Overview

41

1. Shift In New Data

2. Dictionary Lookup/Update

3. Match Search & Filtering

4. Write to output

Comparison against CPU/Verilog

42


43

• Best implementation of Gzip on CPU• By Intel corporation• On Intel Core i5 (32nm) processor• 2013• Compression Speed: 338 MB/s• Compression ratio: 2.18X


44

• Best implementation on ASICs• AHA products group• Coming up Q2 2014• Compression Speed: 2.5 GB/s


45

• Best implementation on FPGAs• Verilog• IBM Corporation• Nov. 2013 ICCAD• Altera Stratix-V A7• Compression Speed: 3 GB/s


46

• OpenCL design example• Altera Stratix-V A7• Developed in 1 month• Compression speed ?• Compression Ratio ?


47

2.7 GB/s3 GB/s

2.5 GB/s

0.3 GB/s

Comparison against CPU

48

Same compression ratio

12X better performance/Watt

Comparison against Verilog

49

12% more resources

Much lower design effort and design time

10% Slower

Implementation Overview

50




4. Write to output


51

Current Window Input from DDR memory


52

Current Window

sample_text

e.g.

o l d _ t e x t

Cycle boundary


53

Current Window

sample_text

e.g.

o l d _ t e x t

Cycle boundary

VEC = 4

Use text in our example, but can be anything


54

Current Window

sample_text

e.g.

t e x t

Cycle boundary


55

Current Window

le_text

e.g.

t e x t s a m p

Cycle boundary


56




4. Write to output

e x t sx t s at s a mt e x t


57

t e x t s a m pCurrent Window:

1. Compute hash2. Look for match in 4 dictionaries3. Update dictionaries

Dictionary0

Dictionary1

Dictionary2

Dictionary3

Dictionaries buffer the text that we have already processed, e.g.:


58


t e x t

e x t s

x t s a

t s a m

Dictionary0

Dictionary1

Dictionary2

Dictionary3

t a n _

t e x t

Hash

t e x l

t e e n


59


t e x t

e x t s

x t s a

t s a m

Dictionary0

Dictionary1

Dictionary2

Dictionary3

t a n _

t e x t

Hash

t e x l

t e e n

e a t e

e a r s

e e p s

e n t e


60


t e x t

e x t s

x t s a

t s a m

Dictionary0

Dictionary1

Dictionary2

Dictionary3

t a n _

t e x tHash

t e x l

t e e n

e a t e

e a r s

e e p s

e n t e

x a n t

x y l o

x e l y

x i r t


61


t e x t

e x t s

x t s a

t s a m

Dictionary0

Dictionary1

Dictionary2

Dictionary3

t a n _

t e x tHash

t e x l

t e e n

e a t e

e a r s

e e p s

e n t e

x a n t

x y l o

x e l y

x i r t

t e e n

t e a l

t a n _

t a m e

Possile matches from history (dictionaries)


62

Dictionary0

Dictionary1

Dictionary2

Dictionary3


t e x t

e x t s

x t s a

t s a m

Hash


63

W0

RD02

RD03

RD00

RD01Dictionary0

W1

RD12

RD13

RD10

RD11Dictionary1

W2

RD22

RD23

RD20

RD21Dictionary2

W3

RD32

RD33

RD30

RD31Dictionary3


Generate exactly the number of read/write ports that we need

t e x t

t a n _

t e x t

t e x l

t e e n


64




4. Write to output


65

Current Windows:

t e x t

e x t s

x t s a

t s a m

t a n _t e x tt e x lt e e n

e a t ee a r se e p se n t e

x a n tx y l ox e l yx i r t

t e e n t e a l t a n _t a m e

Comparison Windows:

A set of candidate matches for each incoming substring

The substrings

Compare current window against each of its 4 compare windows


66

Current Window:

t e x t


Comparison Windows:

1432Match Length:

Comparators

We have another 3 of those

Compare each byte


67

Current Window:

t e x t


Comparison Windows:

1432Match Length:

Comparators

4

Match Reduction

Best Length:


68


69


70


71

Typical C-code

Fixed loop bounds – compiler can unroll loop


One bestlength associated with each current_window

72

t e x t

e x t s

x t s a

t s a m

3

3

4

3

3

1

t e x t s a m p


73

3

t e x t s a m p

Cycle boundary

1 3 4

Matches

0

1

2

4

0 1 2 3

Select the best combination of matches from the set of candidate matches1. Remove matches that are longer when encoded than original2. Remove matches covered by previous step3. From the remaining set; select the best ones

• (heuristic for bin-packing) last-fit

Best lengths:


74

3

t e x t s a m p

Cycle boundary

1 3 4

Matches

0

1

2

4

0 1 2 3



Best lengths:

Too short

Last-fit

Overlap

Last-fit


75

3

t e x t s a m p

Cycle boundary

1 3 4

Matches

0

4

0 1 2 3



Best lengths:

Last-fit

1

2

Too short

Overlap

Last-fit


76

3

t e x t s a m p

Cycle boundary

1 3 4

Matches:

0 1 2 3


• (heuristic for bin-packing) last-fit4. Compute “first valid position” for next step

Best lengths:

Last-fit

First Valid position next cycle

0 1 2 33


77

1. Remove matches that are longer when encoded than original

2. Remove matches covered by previous step

3 1 3 4e.g.: Best lengths:

s a m p First Valid ------position

33


0 1 2


78

1. Remove matches that are longer when encoded than original

2. Remove matches covered by previous step


s a m p First Valid ------position

33

-1 -1 -1 2e.g.: Best lengths:

0 1 2


79

3. From the remaining set; select the best ones last-fit bin-packing

3 0 3 4e.g.: Best lengths:?

0??


80

3. From the remaining set; select the best ones last-fit bin-packing


3 -1 -1 4


81

4. Compute “first valid position” for next step

3 -1 -1 4e.g.: Best lengths:

0 1 2 3

First_valid_pos = 3 3 3 7

t e x t s a m p0 1 2 3 0 1 2 33


82




4. Write to output

4. Writing to Output

Marker, length, offset- Length is limited by VEC (=16 in our case) – fits in 4 bits- Offset is limited by 0x40000 (doesn’t make sense to be more) – fits in 21 bits

Use either 3 or 4 bytes for this:- Offset < 2048

- Offset = 2048 .. 262144

83

MARKER LENGTH OFFSETOFFSET

OFFSET OFFSETMARKER LENGTH OFFSET

Results

84 OFFSET OFFSETMARKER LENGTH OFFSET


85

Outline:


2. LZ77 overview


4. Optimizations & results Area optimizations Compression ratio Results

Area Optimizations

By choosing the right (hardware) architecture, you are already most of the way there

The last ~5% (of area optimizations) requires some tinkering and advanced knowledge

Example:

86

Match Search & Filtering

87

Generates a long vine of logic:

Compute length

Compute length

Compute length

Compute length

Compute length

Compute length

Causes longer latency in the pipeline increases area

condition

88

Generates a long vine of logic:

Compute length

Compute length

Compute length

Compute length

Compute length

Compute length

Causes longer latency in the pipeline increases area

Balance the computation:

Balanced tree has shallower pipeline depth Less area

Get rid of the dependency on “length”

Modified Code

89

Instead of having a length variable (= 2,3,4)We have array of bits (= 0011,0111,1111)

4% smaller areaOR operator is cheaper than adder

OR operator creates a balanced tree (no condition)

Compression Ratio

Evaluate compression ratio on widely-used compression benchmarks:- Calgary – Canterbury – Large – Silesia corpora

Text, images, binary, databases – mix of everything Geomean results over all benchmarks

- Initial results: 78.3% or 1.28X

Want to improve results!

90

2. Hash Function1. Bin-packing Heuristic

1. Bin-packing heuristic

We use the “last-fit” heuristic- Reason: We have a loop-carried variable “first_valid_position”

91

1. Remove matches that are longer when encoded than original2. Remove matches covered by previous step3. From the remaining set; select the best ones

• heuristic for bin-packing4. Compute “first valid position” for next step

2. Filter bestlength (covered)

3. Filter bestlength (bin-pack)

4. Compute first_valid_pos

1. Filter bestlength (length)

Dependency causes a stall in the kernel pipeline Cannot start a new

iteration each cycle II = 6

Optimization Report in 14.0



92







2

1



93







2

1

!!Stall!



94







2

1

!!Stall!

!!Stall!

3



95







Last-fit bin-packing doesn’t affect “first_valid_position” 3 41 3

Because we always use the last match (which determines first_valid_position)



96





Last-fit bin-packing doesn’t affect “first_valid_position” 3 41 3

Because we always use the last match (which determines first_valid_position)

Tighter computation for loop-carried variable: Start new iteration each

cycle II = 1







Constraint: cannot change the first_valid_position in this step

Tighter computation for loop-carried variable: Start new iteration each

cycle II = 1


Constraint: Match selection heuristic cannot change “first_valid_position”

But: Last-fit is very inefficient

4

t e x t s a m p3 2 0

Matches

0

1

2

4

0 1 2 3

Best lengths:


3. Compute first_valid_pos0

0 0 2 -1

4 -1 -1 -1Much better!

Doesn’t affect first_valid_position

Add a step to eliminate matches that have the same reach but smaller value

8% better ratio

2. Hash Function

Original:- Hash[i] = curr_window[i]- E.g. Hash[text] = ‘t’

XOR2- Hash[i] = curr_window[i] xor curr_window[i+1]- E.g. Hash[text] = ‘t’ xor ‘e’ - Aliasing: ‘t’ xor ‘e’ = ‘e’ xor ‘t’- Not utilizing depth efficiently (256 words but BRAMS go up to 1024)

XOR3- Hash[i] = curr_window[i] << 2 xor

curr_window[i+1] << 1 xor curr_window[i+2]

- Match contains information about first 3 bytes + sense of their ordering- More likely that our compare windows will have a match- Hash (BRAM address) is 10 bits utilizes BRAM depth = 1024

99

3.1% better ratio

7.1% better ratio

Compared to Verilog, it is much easier to try & verify new algorithmsIt is exactly like trying out new C-code

Emulator in 13.1

Compression Ratio

Evaluate compression ratio on widely-used compression benchmarks:- Calgary – Canterbury – Large – Silesia corpora

Text, images, binary, databases – mix of everything Geomean results over all benchmarks

- Initial results: 78.3% or 1.28X

With (simple) huffman encoding (currently on the host)- 47.8% or 2.10X

100

Work in progress

60.2% or 1.67XAfter Optimizations:

Huffman portion of Gzip

16-way parallel variable-bit-width encoding/alignment

Huffman encoding

Huffman symbols are defined at runtime Variable number of bits (≤16) Concatenate codes to form a contiguous output stream

- Separate offset computation from the actual assembly

3 compute phases- Compute code bit-offsets and start offset of next iteration

- Assembly of the codes in the current iteration

- Build fixed-length segments across multiple iterations

102

∑ 𝑙𝑒𝑛𝑖

<< << <<

STORE

Compute offsets

Tight dependency on offset carried across iterations

- Careful about the order of the additions, the compiler does not consider dependencies when it redistributes

associative operations

- Decision whether to write to memory is based on accumulating a full segment

103

∑ 𝑙𝑒𝑛𝑖

pos[0]

basepos

pos[1]

pos[n]

Bit-level shift

Each code shifts to an arbitrary bit-offset within the entire range

2 shift stages- 16 bit barrel shifters- OR reduction tree for final assembly

104

Thank YouThank You

Documents

LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah