Upload
david-barina
View
146
Download
4
Embed Size (px)
Citation preview
Lifting Scheme Cores for Wavelet Transform
David Barina(supervised by Pavel Zemcik)
1 / 24
DWT in image processing
can be found in many image-processing tasks
I analysis(edge detection, feature extraction, multiscale representation),
I compression (JPEG 2000, Dirac),
I watermarking, edge sharpening, contrast enhancement,tone mapping, denoising, fusion, etc.
2 / 24
Filter bank
S. Mallat, ”A theory for multiresolution signal decomposition: The wavelet representation” (1989)
H̃(z−1) a
d
↓ 2
+
G̃(z−1) ↓ 2
↑ 2 H(z)
↑ 2 G(z)
decomposition: two complementary filters,high number of operations
3 / 24
Lifting scheme
I. Daubechies, W. Sweldens, ”Factoring wavelet transforms into lifting steps” (1998)
a
d
split P̃ (z−1)T P (z) merge
P (z) =
I−1∏i=0
{[1 Si(z)0 1
] [1 0
Ti(z) 1
]}[K 00 1/K
]
decomposition: sequence of simple filtering steps,reduces the number of operations, split: even, odd
4 / 24
CDF 9/7 wavelet
I. Daubechies, W. Sweldens, ”Factoring wavelet transforms into lifting steps” (1998)
α
β
γ
δ
input
output
steps
even samples
odd samples
P̃ (z) =
[1 α
(1 + z−1
)0 1
] [1 0
β (1 + z) 1
] [1 γ
(1 + z−1
)0 1
] [1 0
δ (1 + z) 1
] [ζ 00 1/ζ
]
four two-tap symmetric filters
5 / 24
2-D decomposition
S. Mallat, ”A theory for multiresolution signal decomposition: The wavelet representation” (1989)
a h
v d
horizontal vertical
h
v d
a h
dv
image: 2-D signal, by a series of 1-D transforms, four subbands,multi-scale decomposition
6 / 24
Lenna
how to calculate this as efficiently as possible
7 / 24
Strategies and issues
R. Kutil, ”A single-loop approach to SIMD parallelization of 2-D wavelet lifting” (2006)
a h
v d
horizontal vertical
strategies row-column, block-based, and line-based
cache issues cache line, limited size, set associativity, prefetching
techniques padding, aggregation, memory layouts,interleave loops, parallelization
the approaches have to repeatedly visit samples,memory access is expensive ⇒ CPU cache, limitations,existing techniques, single-loop approach
8 / 24
Unsolved issues
2 × 2
prolog
core
epilog
prolog epilog
F
F
FF
I complicated border treatment (prolog/epilog phases)I suspend/resume processing
I arbitrary processing order (scan order)
I interleave the transform and a subsequent processing
I multi-scale decomposition
I reorganization of underlying scheme9 / 24
Objectives of the thesis
Aims improve image transform performance and resourceconsumption
Objectives eliminate the shortcomings of existing methodsprevious slide
Evaluation prove experimentallyperformance, memory requirements
10 / 24
Lifting core
D. Barina, P. Zemcik, ”Vectorization and parallelization of 2-D wavelet lifting” (in press)
solution: a processing unit
I continuously consumes an input and produces an output
I which visits every image sample only once (cache friendly)
I which is aware of image coordinates (can handle the borders)
I whose configuration (state) can be saved/restored
I which can be run in any direction
I which can be SIMD vectorized
I which can run in parallel (on independent parts of the image)
y = C x
xdef= In ‖ B y
def= On ‖ B
11 / 24
Core examples
D. Barina, P. Zemcik, ”Vectorization and parallelization of 2-D wavelet lifting” (in press)
α
β
γ
δ mn
1 2 3 4
core inputs, outputs
12 / 24
Processing orders
D. Barina, P. Zemcik, ”Vectorization and parallelization of 2-D wavelet lifting” (in press)
horizontal horiz. strips horiz. blocks
vertical vert. strips vert. blocks
13 / 24
Borders treatment
D. Barina, P. Zemcik, ”Vectorization and parallelization of 2-D wavelet lifting” (in press)
d a d a d a d a d a d a d a d a d a d
d a d a d a d a d a d a d a d a d a d a
n n n n n n n
a d aad
n nnnn
d a d a d a d a d a d a d a d a d a d
0
d a d a d a d a d a d a d a d a d a d a
2 n N − 2 N
0 0
n n n n n n
a
y = Cn x
cores gracefully treats the boundaries
14 / 24
Parallel cores and reorganization
M. Kula, D. Barina, et al., ”Block-based Approach to 2-D Wavelet Transform on GPUs” (2016)
1 2 3 4Sweldens1995
1 2 3Iwahashi2007
1 2proposed
15 / 24
3-D core
D. Barina, P. Zemcik, ”Real-Time 3-D Wavelet Lifting” (2015)
x
y
z
buffer x
buffer y
buffer z
extended into more dimensions, buffers on the sides
16 / 24
CPU implementation
D. Barina, P. Zemcik, ”Vectorization and parallelization of 2-D wavelet lifting” (in press)
0.0 s
5.0ns
10.0ns
15.0ns
20.0ns
25.0ns
30.0ns
35.0ns
40.0ns
45.0ns
50.0ns
1.0k 10.0k 100.0k 1.0M 10.0M 100.0M
tim
e /
pix
el
pixels
separable approach core approach
an evaluation of approaches,implemented the separable, single-loop, and core
17 / 24
3-D CPU implementation
D. Barina, P. Zemcik, ”Real-Time 3-D Wavelet Lifting” (2015)
x
y
z
buffer x
buffer y
buffer z
0.0 s
20.0ns
40.0ns
60.0ns
80.0ns
100.0ns
120.0ns
140.0ns
160.0ns
0.0 50.0M 100.0M 150.0M 200.0M 250.0M
tim
e /
voxe
l
voxels
naive horizontalnaive vertical
core 42
core 23
core 43
performance of 3-D transform: separable, 2-D core, 3-D core
18 / 24
GPU implementation
M. Kula, D. Barina, et al., ”Block-based Approach to 2-D Wavelet Transform on GPUs” (2016)
80.0 100.0 120.0 140.0 160.0 180.0 200.0 220.0 240.0 260.0
0.0 10.0M 20.0M 30.0M 40.0M 50.0M 60.0M 70.0M
GB
/s
pixels
Kucis2014Separable Block
Non-Separable Block
0
10
20
30
40
50
60
100kpel 1Mpel 10Mpel 100Mpel
GB
/s
SweldensIwahashi*
Explosive*
Monolithic*
Polyphase*
Monolithic∗ scheme:
left: SotA is in red, block methods in blue/green, reorganizationright: block methods, separable in black, our in blue/green
19 / 24
FPGA implementation
D. Barina, et al., ”Single-Loop Approach to 2-D Wavelet Lifting with JPEG 2000 Compatibility” (2015)
H V
BRAM
Input Transform
core FF LUT BRAMlatency 4 441 (0.1 %) 399 (0.18 %) 6 (1.1 %)latency 2 391 (< 0.1 %) 592 (0.27 %) 6 (1.1 %)
architecture device BRAM [bits] clocks/pel time [ms]Dillen2003 VirtexE1000-8 50K 0.50 1.20Descampe2004 Virtex-II XC2V6000 N/A 0.60 1.75Seo2007 Altera Stratix 128K 2.64 6.02Zhang2012 Virtex-II Pro XC2VP30 6× 18K 0.50 0.97the cores Zynq XC7Z045 1× 36K 0.26 0.27
20 / 24
JPEG 2000 implementation
D. Barina, O. Klima, P. Zemcik, ”Single-Loop Architecture for JPEG 2000” (2016)
core
codeblock
2 × 2cn
2 × 2cm
aj
aj+1
h v d
0.0
20.0
40.0
60.0
80.0
100.0
120.0
140.0
100.0k 1.0M 10.0M 100.0M 1.0G
tim
e [
ns]
resolution [pel]
proposedOpenJPEG
JasPerFFmpeg
21 / 24
Contributions of the thesis
Aims improved image transform performance and resourceconsumption
Objectives eliminated the shortcomings of existing methods
Evaluation assessed experimentally(performance, memory requirements)
evaluation performed:2-D on CPU, 3-D on CPU, 2-D on GPU, 2-D on FPGA,JPEG 2000 on CPU
22 / 24
Selected papersI Barina, D.; Klima, O.; Zemcik, P.: Single-Loop Software Architecture for JPEG 2000. In
Data Compression Conference (DCC), 2016
I Barina, D.; Musil, M.; Musil, P.; et al.: Single-Loop Approach to 2-D Wavelet Lifting withJPEG 2000 Compatibility. In Workshop on Applications for MultiCore Architectures(WAMCA), 2015
I Barina, D.; Zemcik, P.: Minimum Memory Vectorisation of Wavelet Lifting. In AdvancedConcepts for Intelligent Vision Systems (ACIVS), 2013
I Barina, D.; Zemcik, P.: Wavelet Lifting on Application Specific Vector Processor. InGraphiCon, 2013
I Barina, D.; Zemcik, P.: Diagonal Vectorisation of 2-D Wavelet Lifting. In IEEE InternationalConference on Image Processing (ICIP), 2014
I Barina, D.; Zemcik, P.: Real-Time 3-D Wavelet Lifting. In International Conference inCentral Europe on Computer Graphics, Visualization and Computer Vision (WSCG), 2015
I Barina, D.; Zemcik, P.: Vectorization and parallelization of 2-D wavelet lifting. Journal ofReal-Time Image Processing (JRTIP), in press
I Barina, D.; Klima, O.; Zemcik, P.: Single-Loop Architecture for JPEG 2000. In: Image andSignal Processing (ICISP), 2016
I Kula, M.; Barina, D.; Zemcik, P.: Block-based Approach to 2-D Wavelet Transform on GPUs.In International Conference on Information Technology – New Generations (ITNG), 2016
I Kucis, M.; Barina, D.; Kula, M.; et al.: 2-D Discrete Wavelet Transform Using GPU. InWorkshop on Application for Multi-Core Architectures (WAMCA), 2014
23 / 24
Summary
the core
I computing unit which processes the data in a single pass,
I can suspend/resume execution,
I can processes the data in many different orders,
I can handle signal boundaries (is aware of coordinates),
I can be easily SIMD vectorized and parallelized,
I and whose underlying scheme can be reorganized.
24 / 24