Upload
tyrone-searls
View
228
Download
0
Tags:
Embed Size (px)
Citation preview
LZ77 Compression Using Altera OpenCL
Mohamed Abdelfattah
LZ77 Compression in OpenCL
Goal:- Demonstrate that a compression algorithm can be
implemented using the OpenCL compiler
2
high-performanceefficiently
2 GB/s
Outline:
1. OpenCL single-threaded flow
2. LZ77 overview
3. Implementation details
4. Optimizations & results
OpenCL Single-threaded Code
Basically C-code- OpenCL compiler extracts parallelism automatically- Pipeline parallelism
3
FPGA
One or more custom kernels
Kernels can communicate directly through “channels”
void kernelsimple(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; }}
OpenCL Single-threaded Code
4
FPGA
Load x Load y
Store z
void kernelsimple(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; }}
OpenCL Single-threaded Code
5
Load x Load y
Store z
1
void kernelsimple(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; }}
OpenCL Single-threaded Code
6
Load x Load y
Store z
1
2
void kernelsimple(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; }}
OpenCL Single-threaded Code
7
Load x Load y
Store z
1
2
3
void kernelsimple(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; }}
OpenCL Single-threaded Code
8
Load x Load y
Store z
2
3
4
void kernelsimple(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; }}
OpenCL Single-threaded Code
9
Load x Load y
Store z
3
4
5
Can start new loop iteration every cycle! Initiation interval II = 1
No loop-carried dependencies
void kernelsimple(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; }}
OpenCL Single-threaded Code
10
Load x Load y
Store z
void kernelcomplex(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; if(loop_carried/2 == 1) z = x + y; else z = x – y; loop_carried *= z; output[i] = z; }}
OpenCL Single-threaded Code
11
Load x Load y
Store z
void kernelcomplex(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; if(loop_carried/2 == 1) z = x + y; else z = x – y; loop_carried *= z; output[i] = z; }}
OpenCL Single-threaded Code
12
Load x Load y
Store z
void kernelcomplex(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; if(loop_carried/2 == 1) z = x + y; else z = x – y; loop_carried *= z; output[i] = z; }}
OpenCL Single-threaded Code
13
Load x Load y
Store z
Loop-carriedcomputation
Need data from iteration x for iteration x+1
OpenCL Single-threaded Code
14
Load x Load y
Store z
Load x Load y
Store z
Simple Complex
OpenCL Single-threaded Code
15
Load x Load y
Store z
Load x Load y
Store z
1 1
Simple Complex
Load x Load y
Store z
Load x Load y
Store z
OpenCL Single-threaded Code
16
11
2 2
Simple Complex
Load x Load y
Store z
Load x Load y
Store z
OpenCL Single-threaded Code
17
22
3 3
11
Simple Complex
Load x Load y
Store z
Load x Load y
Store z
OpenCL Single-threaded Code
18
32
4 3
1
2
1
1Pipeline bubble!
Takes 2 cycles to computeStall!
Stall!
!!
Simple Complex
Load x Load y
Store z
Load x Load y
Store z
OpenCL Single-threaded Code
19
4
2
5
3
3
2
1Continue
Takes 2 cycles to compute
4
!!
Simple Complex
Load x Load y
Store z
Load x Load y
Store z
OpenCL Single-threaded Code
20
5
2
6
3
4
3
2Bubble!
Takes 2 cycles to compute
4
!!
Stall!
Stall!
Simple Complex
Load x Load y
Store z
Load x Load y
Store z
OpenCL Single-threaded Code
21
6
3
7
4
5
4
2Continue
Takes 2 cycles to compute
5
!!
Simple Complex
Load x Load y
Store z
Load x Load y
Store z
OpenCL Single-threaded Code
22
7
3
8
4
6
5
3Bubble!
Takes 2 cycles to compute
5
!!
Stall!
Stall!
Simple Complex
Load x Load y
Store z
Load x Load y
Store z
OpenCL Single-threaded Code
23
8
4
9
5
7
6
3Continue
Takes 2 cycles to compute
6
!!
Simple Complex
Load x Load y
Store z
Load x Load y
Store z
OpenCL Single-threaded Code
24
9
4
10
5
8
7
4Bubble!
Takes 2 cycles to compute
6
!!
Stall!
Stall!
Simple Complex
Load x Load y
Store z
Load x Load y
Store z
OpenCL Single-threaded Code
25
10
5
11
6
9
8
4
Takes 2 cycles to compute
7
!!
II = 1 II = 2
Double the throughput
Optimize loop-carried computation
A new iteration of the loop starts every “II” cycles
Simple Complex
LZ77 Compression in OpenCL
26
Outline:
1. OpenCL single-threaded flow
2. LZ77 overview
3. Implementation details
4. Optimizations & results
LZ77 Compression Example
This sentence is an easy sentence to compress.
27
1. Scan file byte by byte2. Look for matches3. Replace with a reference to previous occurrence
LZ77 Compression Example
28
This sentence is an easy sentence to compress.
1. Scan file byte by byte2. Look for matches3. Replace with a reference to previous occurrence
LZ77 Compression Example
29
This sentence is an easy sentence to compress.
1. Scan file byte by byte2. Look for matches3. Replace with a reference to previous occurrence
LZ77 Compression Example
30
This sentence is an easy sentence to compress.
1. Scan file byte by byte2. Look for matches3. Replace with a reference to previous occurrence
LZ77 Compression Example
31
This sentence is an easy sentence to compress.
1. Scan file byte by byte2. Look for matches3. Replace with a reference to previous occurrence
LZ77 Compression Example
32
This sentence is an easy sentence to compress.
1. Scan file byte by byte2. Look for matches3. Replace with a reference to previous occurrence
This sentence is an easy sentence to compress.
LZ77 Compression Example
33
1. Scan file byte by byte2. Look for matches
1. Match length2. Match offset
3. Replace with a reference to previous occurrence
This sentence is an easy sentence to compress.
LZ77 Compression Example
34
1. Scan file byte by byte2. Look for matches
1. Match length = 22. Match offset
3. Replace with a reference to previous occurrence
This sentence is an easy sentence to compress.
LZ77 Compression Example
35
1. Scan file byte by byte2. Look for matches
1. Match length = 32. Match offset
3. Replace with a reference to previous occurrence
This sentence is an easy sentence to compress.
LZ77 Compression Example
36
1. Scan file byte by byte2. Look for matches
1. Match length = 82. Match offset
3. Replace with a reference to previous occurrence
Match offset = 20 bytes
This sentence is an easy sentence to compress.
LZ77 Compression Example
37
1. Scan file byte by byte2. Look for matches
1. Match length = 82. Match offset = 20
3. Replace with a reference to previous occurrence
Match offset = 20 bytes
This sentence is an easy @(8,20) to compress.
LZ77 Compression Example
38
1. Scan file byte by byte2. Look for matches
• Match length = 8• Match offset = 20
3. Replace with a reference to previous occurrence• Marker, length, offset
This sentence is an easy sentence to compress. This sentence is an easy @(8,20) to compress.
LZ77 Compression Example
39
1. Scan file byte by byte2. Look for matches
• Match length = 8• Match offset = 20
3. Replace with a reference to previous occurrence• Marker, length, offset
Saved 5 bytes!
LZ77 Compression in OpenCL
40
Outline:
1. OpenCL single-threaded flow
2. LZ77 overview
3. Implementation details
4. Optimizations & results
Single-threaded OpenCL flow Single kernel: fully pipelined II = 1
Throughput estimate = 16 bytes/cycle * 200 MHz = 3051 MB/s
Overview
41
1. Shift In New Data
2. Dictionary Lookup/Update
3. Match Search & Filtering
4. Write to output
Comparison against CPU/Verilog
42
Comparison against CPU/Verilog
43
• Best implementation of Gzip on CPU• By Intel corporation• On Intel Core i5 (32nm) processor• 2013• Compression Speed: 338 MB/s• Compression ratio: 2.18X
Comparison against CPU/Verilog
44
• Best implementation on ASICs• AHA products group• Coming up Q2 2014• Compression Speed: 2.5 GB/s
Comparison against CPU/Verilog
45
• Best implementation on FPGAs• Verilog• IBM Corporation• Nov. 2013 ICCAD• Altera Stratix-V A7• Compression Speed: 3 GB/s
Comparison against CPU/Verilog
46
• OpenCL design example• Altera Stratix-V A7• Developed in 1 month• Compression speed ?• Compression Ratio ?
Comparison against CPU/Verilog
47
2.7 GB/s3 GB/s
2.5 GB/s
0.3 GB/s
Comparison against CPU
48
Same compression ratio
12X better performance/Watt
Comparison against Verilog
49
12% more resources
Much lower design effort and design time
10% Slower
Implementation Overview
50
1. Shift In New Data
2. Dictionary Lookup/Update
3. Match Search & Filtering
4. Write to output
1. Shift In New Data
51
Current Window Input from DDR memory
1. Shift In New Data
52
Current Window
sample_text
e.g.
o l d _ t e x t
Cycle boundary
1. Shift In New Data
53
Current Window
sample_text
e.g.
o l d _ t e x t
Cycle boundary
VEC = 4
Use text in our example, but can be anything
1. Shift In New Data
54
Current Window
sample_text
e.g.
t e x t
Cycle boundary
1. Shift In New Data
55
Current Window
le_text
e.g.
t e x t s a m p
Cycle boundary
Implementation Overview
56
1. Shift In New Data
2. Dictionary Lookup/Update
3. Match Search & Filtering
4. Write to output
e x t sx t s at s a mt e x t
2. Dictionary Lookup/Update
57
t e x t s a m pCurrent Window:
1. Compute hash2. Look for match in 4 dictionaries3. Update dictionaries
Dictionary0
Dictionary1
Dictionary2
Dictionary3
Dictionaries buffer the text that we have already processed, e.g.:
2. Dictionary Lookup/Update
58
t e x t s a m pCurrent Window:
t e x t
e x t s
x t s a
t s a m
Dictionary0
Dictionary1
Dictionary2
Dictionary3
t a n _
t e x t
Hash
t e x l
t e e n
2. Dictionary Lookup/Update
59
t e x t s a m pCurrent Window:
t e x t
e x t s
x t s a
t s a m
Dictionary0
Dictionary1
Dictionary2
Dictionary3
t a n _
t e x t
Hash
t e x l
t e e n
e a t e
e a r s
e e p s
e n t e
2. Dictionary Lookup/Update
60
t e x t s a m pCurrent Window:
t e x t
e x t s
x t s a
t s a m
Dictionary0
Dictionary1
Dictionary2
Dictionary3
t a n _
t e x tHash
t e x l
t e e n
e a t e
e a r s
e e p s
e n t e
x a n t
x y l o
x e l y
x i r t
2. Dictionary Lookup/Update
61
t e x t s a m pCurrent Window:
t e x t
e x t s
x t s a
t s a m
Dictionary0
Dictionary1
Dictionary2
Dictionary3
t a n _
t e x tHash
t e x l
t e e n
e a t e
e a r s
e e p s
e n t e
x a n t
x y l o
x e l y
x i r t
t e e n
t e a l
t a n _
t a m e
Possile matches from history (dictionaries)
2. Dictionary Lookup/Update
62
Dictionary0
Dictionary1
Dictionary2
Dictionary3
t e x t s a m pCurrent Window:
t e x t
e x t s
x t s a
t s a m
Hash
2. Dictionary Lookup/Update
63
W0
RD02
RD03
RD00
RD01Dictionary0
W1
RD12
RD13
RD10
RD11Dictionary1
W2
RD22
RD23
RD20
RD21Dictionary2
W3
RD32
RD33
RD30
RD31Dictionary3
t e x t s a m pCurrent Window:
Generate exactly the number of read/write ports that we need
t e x t
t a n _
t e x t
t e x l
t e e n
Implementation Overview
64
1. Shift In New Data
2. Dictionary Lookup/Update
3. Match Search & Filtering
4. Write to output
3. Match Search & Filtering
65
Current Windows:
t e x t
e x t s
x t s a
t s a m
t a n _t e x tt e x lt e e n
e a t ee a r se e p se n t e
x a n tx y l ox e l yx i r t
t e e n t e a l t a n _t a m e
Comparison Windows:
A set of candidate matches for each incoming substring
The substrings
Compare current window against each of its 4 compare windows
3. Match Search & Filtering
66
Current Window:
t e x t
t a n _t e x tt e x lt e e n
Comparison Windows:
1432Match Length:
Comparators
We have another 3 of those
Compare each byte
3. Match Search & Filtering
67
Current Window:
t e x t
t a n _t e x tt e x lt e e n
Comparison Windows:
1432Match Length:
Comparators
4
Match Reduction
Best Length:
3. Match Search & Filtering
68
3. Match Search & Filtering
69
3. Match Search & Filtering
70
3. Match Search & Filtering
71
Typical C-code
Fixed loop bounds – compiler can unroll loop
3. Match Search & Filtering
One bestlength associated with each current_window
72
t e x t
e x t s
x t s a
t s a m
3
3
4
3
3
1
t e x t s a m p
3. Match Search & Filtering
73
3
t e x t s a m p
Cycle boundary
1 3 4
Matches
0
1
2
4
0 1 2 3
Select the best combination of matches from the set of candidate matches1. Remove matches that are longer when encoded than original2. Remove matches covered by previous step3. From the remaining set; select the best ones
• (heuristic for bin-packing) last-fit
Best lengths:
3. Match Search & Filtering
74
3
t e x t s a m p
Cycle boundary
1 3 4
Matches
0
1
2
4
0 1 2 3
Select the best combination of matches from the set of candidate matches1. Remove matches that are longer when encoded than original2. Remove matches covered by previous step3. From the remaining set; select the best ones
• (heuristic for bin-packing) last-fit
Best lengths:
Too short
Last-fit
Overlap
Last-fit
3. Match Search & Filtering
75
3
t e x t s a m p
Cycle boundary
1 3 4
Matches
0
4
0 1 2 3
Select the best combination of matches from the set of candidate matches1. Remove matches that are longer when encoded than original2. Remove matches covered by previous step3. From the remaining set; select the best ones
• (heuristic for bin-packing) last-fit
Best lengths:
Last-fit
1
2
Too short
Overlap
Last-fit
3. Match Search & Filtering
76
3
t e x t s a m p
Cycle boundary
1 3 4
Matches:
0 1 2 3
Select the best combination of matches from the set of candidate matches1. Remove matches that are longer when encoded than original2. Remove matches covered by previous step3. From the remaining set; select the best ones
• (heuristic for bin-packing) last-fit4. Compute “first valid position” for next step
Best lengths:
Last-fit
First Valid position next cycle
0 1 2 33
3. Match Search & Filtering
77
1. Remove matches that are longer when encoded than original
2. Remove matches covered by previous step
3 1 3 4e.g.: Best lengths:
s a m p First Valid ------position
33
3 4 4 2e.g.: Best lengths:
0 1 2
3. Match Search & Filtering
78
1. Remove matches that are longer when encoded than original
2. Remove matches covered by previous step
3 1 3 4e.g.: Best lengths:
s a m p First Valid ------position
33
-1 -1 -1 2e.g.: Best lengths:
0 1 2
3. Match Search & Filtering
79
3. From the remaining set; select the best ones last-fit bin-packing
3 0 3 4e.g.: Best lengths:?
0??
3. Match Search & Filtering
80
3. From the remaining set; select the best ones last-fit bin-packing
3 0 0 4e.g.: Best lengths:
3 -1 -1 4
3. Match Search & Filtering
81
4. Compute “first valid position” for next step
3 -1 -1 4e.g.: Best lengths:
0 1 2 3
First_valid_pos = 3 3 3 7
t e x t s a m p0 1 2 3 0 1 2 33
Implementation Overview
82
1. Shift In New Data
2. Dictionary Lookup/Update
3. Match Search & Filtering
4. Write to output
4. Writing to Output
Marker, length, offset- Length is limited by VEC (=16 in our case) – fits in 4 bits- Offset is limited by 0x40000 (doesn’t make sense to be more) – fits in 21 bits
Use either 3 or 4 bytes for this:- Offset < 2048
- Offset = 2048 .. 262144
83
MARKER LENGTH OFFSETOFFSET
OFFSET OFFSETMARKER LENGTH OFFSET
Results
84 OFFSET OFFSETMARKER LENGTH OFFSET
LZ77 Compression in OpenCL
85
Outline:
1. OpenCL single-threaded flow
2. LZ77 overview
3. Implementation details
4. Optimizations & results Area optimizations Compression ratio Results
Area Optimizations
By choosing the right (hardware) architecture, you are already most of the way there
The last ~5% (of area optimizations) requires some tinkering and advanced knowledge
Example:
86
Match Search & Filtering
87
Generates a long vine of logic:
Compute length
Compute length
Compute length
Compute length
Compute length
Compute length
Causes longer latency in the pipeline increases area
condition
88
Generates a long vine of logic:
Compute length
Compute length
Compute length
Compute length
Compute length
Compute length
Causes longer latency in the pipeline increases area
Balance the computation:
Balanced tree has shallower pipeline depth Less area
Get rid of the dependency on “length”
Modified Code
89
Instead of having a length variable (= 2,3,4)We have array of bits (= 0011,0111,1111)
4% smaller areaOR operator is cheaper than adder
OR operator creates a balanced tree (no condition)
Compression Ratio
Evaluate compression ratio on widely-used compression benchmarks:- Calgary – Canterbury – Large – Silesia corpora
Text, images, binary, databases – mix of everything Geomean results over all benchmarks
- Initial results: 78.3% or 1.28X
Want to improve results!
90
2. Hash Function1. Bin-packing Heuristic
1. Bin-packing heuristic
We use the “last-fit” heuristic- Reason: We have a loop-carried variable “first_valid_position”
91
1. Remove matches that are longer when encoded than original2. Remove matches covered by previous step3. From the remaining set; select the best ones
• heuristic for bin-packing4. Compute “first valid position” for next step
2. Filter bestlength (covered)
3. Filter bestlength (bin-pack)
4. Compute first_valid_pos
1. Filter bestlength (length)
Dependency causes a stall in the kernel pipeline Cannot start a new
iteration each cycle II = 6
Optimization Report in 14.0
1. Bin-packing heuristic
We use the “last-fit” heuristic- Reason: We have a loop-carried variable “first_valid_position”
92
2. Filter bestlength (covered)
3. Filter bestlength (bin-pack)
4. Compute first_valid_pos
1. Filter bestlength (length)
Dependency causes a stall in the kernel pipeline Cannot start a new
iteration each cycle II = 6
2
1
1. Bin-packing heuristic
We use the “last-fit” heuristic- Reason: We have a loop-carried variable “first_valid_position”
93
2. Filter bestlength (covered)
3. Filter bestlength (bin-pack)
4. Compute first_valid_pos
1. Filter bestlength (length)
Dependency causes a stall in the kernel pipeline Cannot start a new
iteration each cycle II = 6
2
1
!!Stall!
1. Bin-packing heuristic
We use the “last-fit” heuristic- Reason: We have a loop-carried variable “first_valid_position”
94
2. Filter bestlength (covered)
3. Filter bestlength (bin-pack)
4. Compute first_valid_pos
1. Filter bestlength (length)
Dependency causes a stall in the kernel pipeline Cannot start a new
iteration each cycle II = 6
2
1
!!Stall!
!!Stall!
3
1. Bin-packing heuristic
We use the “last-fit” heuristic- Reason: We have a loop-carried variable “first_valid_position”
95
2. Filter bestlength (covered)
3. Filter bestlength (bin-pack)
4. Compute first_valid_pos
1. Filter bestlength (length)
Dependency causes a stall in the kernel pipeline Cannot start a new
iteration each cycle II = 6
Last-fit bin-packing doesn’t affect “first_valid_position” 3 41 3
Because we always use the last match (which determines first_valid_position)
1. Bin-packing heuristic
We use the “last-fit” heuristic- Reason: We have a loop-carried variable “first_valid_position”
96
2. Filter bestlength (covered)
4. Filter bestlength (bin-pack)
3. Compute first_valid_pos
1. Filter bestlength (length)
Last-fit bin-packing doesn’t affect “first_valid_position” 3 41 3
Because we always use the last match (which determines first_valid_position)
Tighter computation for loop-carried variable: Start new iteration each
cycle II = 1
1. Bin-packing heuristic
We use the “last-fit” heuristic- Reason: We have a loop-carried variable “first_valid_position”
2. Filter bestlength (covered)
4. Filter bestlength (bin-pack)
3. Compute first_valid_pos
1. Filter bestlength (length)
Constraint: cannot change the first_valid_position in this step
Tighter computation for loop-carried variable: Start new iteration each
cycle II = 1
1. Bin-packing heuristic
Constraint: Match selection heuristic cannot change “first_valid_position”
But: Last-fit is very inefficient
4
t e x t s a m p3 2 0
Matches
0
1
2
4
0 1 2 3
Best lengths:
4. Filter bestlength (bin-pack)
3. Compute first_valid_pos0
0 0 2 -1
4 -1 -1 -1Much better!
Doesn’t affect first_valid_position
Add a step to eliminate matches that have the same reach but smaller value
8% better ratio
2. Hash Function
Original:- Hash[i] = curr_window[i]- E.g. Hash[text] = ‘t’
XOR2- Hash[i] = curr_window[i] xor curr_window[i+1]- E.g. Hash[text] = ‘t’ xor ‘e’ - Aliasing: ‘t’ xor ‘e’ = ‘e’ xor ‘t’- Not utilizing depth efficiently (256 words but BRAMS go up to 1024)
XOR3- Hash[i] = curr_window[i] << 2 xor
curr_window[i+1] << 1 xor curr_window[i+2]
- Match contains information about first 3 bytes + sense of their ordering- More likely that our compare windows will have a match- Hash (BRAM address) is 10 bits utilizes BRAM depth = 1024
99
3.1% better ratio
7.1% better ratio
Compared to Verilog, it is much easier to try & verify new algorithmsIt is exactly like trying out new C-code
Emulator in 13.1
Compression Ratio
Evaluate compression ratio on widely-used compression benchmarks:- Calgary – Canterbury – Large – Silesia corpora
Text, images, binary, databases – mix of everything Geomean results over all benchmarks
- Initial results: 78.3% or 1.28X
With (simple) huffman encoding (currently on the host)- 47.8% or 2.10X
100
Work in progress
60.2% or 1.67XAfter Optimizations:
Huffman portion of Gzip
16-way parallel variable-bit-width encoding/alignment
Huffman encoding
Huffman symbols are defined at runtime Variable number of bits (≤16) Concatenate codes to form a contiguous output stream
- Separate offset computation from the actual assembly
3 compute phases- Compute code bit-offsets and start offset of next iteration
- Assembly of the codes in the current iteration
- Build fixed-length segments across multiple iterations
102
∑ 𝑙𝑒𝑛𝑖
<< << <<
STORE
Compute offsets
Tight dependency on offset carried across iterations
- Careful about the order of the additions, the compiler does not consider dependencies when it redistributes
associative operations
- Decision whether to write to memory is based on accumulating a full segment
103
∑ 𝑙𝑒𝑛𝑖
pos[0]
basepos
pos[1]
pos[n]
Bit-level shift
Each code shifts to an arbitrary bit-offset within the entire range
2 shift stages- 16 bit barrel shifters- OR reduction tree for final assembly
104
Thank YouThank You