Lifting Scheme Cores for Wavelet Transform

Lifting Scheme Cores for Wavelet Transform

David Barina(supervised by Pavel Zemcik)

1 / 24

DWT in image processing

can be found in many image-processing tasks

I analysis(edge detection, feature extraction, multiscale representation),

I compression (JPEG 2000, Dirac),

I watermarking, edge sharpening, contrast enhancement,tone mapping, denoising, fusion, etc.

2 / 24

Filter bank

S. Mallat, ”A theory for multiresolution signal decomposition: The wavelet representation” (1989)

H̃(z−1) a

d

↓ 2

+

G̃(z−1) ↓ 2

↑ 2 H(z)

↑ 2 G(z)

decomposition: two complementary filters,high number of operations

3 / 24

Lifting scheme

I. Daubechies, W. Sweldens, ”Factoring wavelet transforms into lifting steps” (1998)

a

d

split P̃ (z−1)T P (z) merge

P (z) =

I−1∏i=0

{[1 Si(z)0 1

] [1 0

Ti(z) 1

]}[K 00 1/K

]

decomposition: sequence of simple filtering steps,reduces the number of operations, split: even, odd

4 / 24

CDF 9/7 wavelet

I. Daubechies, W. Sweldens, ”Factoring wavelet transforms into lifting steps” (1998)

α

β

γ

δ

input

output

steps

even samples

odd samples

P̃ (z) =

[1 α

(1 + z−1

)0 1

] [1 0

β (1 + z) 1

] [1 γ

(1 + z−1

)0 1

] [1 0

δ (1 + z) 1

] [ζ 00 1/ζ

]

four two-tap symmetric filters

5 / 24

2-D decomposition

S. Mallat, ”A theory for multiresolution signal decomposition: The wavelet representation” (1989)

a h

v d

horizontal vertical

h

v d

a h

dv

image: 2-D signal, by a series of 1-D transforms, four subbands,multi-scale decomposition

6 / 24

Lenna

how to calculate this as efficiently as possible

7 / 24

Strategies and issues

R. Kutil, ”A single-loop approach to SIMD parallelization of 2-D wavelet lifting” (2006)

a h

v d

horizontal vertical

strategies row-column, block-based, and line-based

cache issues cache line, limited size, set associativity, prefetching

techniques padding, aggregation, memory layouts,interleave loops, parallelization

the approaches have to repeatedly visit samples,memory access is expensive ⇒ CPU cache, limitations,existing techniques, single-loop approach

8 / 24

Unsolved issues

2 × 2

prolog

core

epilog

prolog epilog

F

F

FF

I complicated border treatment (prolog/epilog phases)I suspend/resume processing

I arbitrary processing order (scan order)

I interleave the transform and a subsequent processing

I multi-scale decomposition

I reorganization of underlying scheme9 / 24

Objectives of the thesis

Aims improve image transform performance and resourceconsumption

Objectives eliminate the shortcomings of existing methodsprevious slide

Evaluation prove experimentallyperformance, memory requirements

10 / 24

Lifting core

D. Barina, P. Zemcik, ”Vectorization and parallelization of 2-D wavelet lifting” (in press)

solution: a processing unit

I continuously consumes an input and produces an output

I which visits every image sample only once (cache friendly)

I which is aware of image coordinates (can handle the borders)

I whose configuration (state) can be saved/restored

I which can be run in any direction

I which can be SIMD vectorized

I which can run in parallel (on independent parts of the image)

y = C x

xdef= In ‖ B y

def= On ‖ B

11 / 24

Core examples


α

β

γ

δ mn

1 2 3 4

core inputs, outputs

12 / 24

Processing orders


horizontal horiz. strips horiz. blocks

vertical vert. strips vert. blocks

13 / 24

Borders treatment


d a d a d a d a d a d a d a d a d a d

d a d a d a d a d a d a d a d a d a d a

n n n n n n n

a d aad

n nnnn

d a d a d a d a d a d a d a d a d a d

0

d a d a d a d a d a d a d a d a d a d a

2 n N − 2 N

0 0

n n n n n n

a

y = Cn x

cores gracefully treats the boundaries

14 / 24

Parallel cores and reorganization

M. Kula, D. Barina, et al., ”Block-based Approach to 2-D Wavelet Transform on GPUs” (2016)

1 2 3 4Sweldens1995

1 2 3Iwahashi2007

1 2proposed

15 / 24

3-D core

D. Barina, P. Zemcik, ”Real-Time 3-D Wavelet Lifting” (2015)

x

y

z

buffer x

buffer y

buffer z

extended into more dimensions, buffers on the sides

16 / 24

CPU implementation


0.0 s

5.0ns

10.0ns

15.0ns

20.0ns

25.0ns

30.0ns

35.0ns

40.0ns

45.0ns

50.0ns

1.0k 10.0k 100.0k 1.0M 10.0M 100.0M

tim

e /

pix

el

pixels

separable approach core approach

an evaluation of approaches,implemented the separable, single-loop, and core

17 / 24

3-D CPU implementation

D. Barina, P. Zemcik, ”Real-Time 3-D Wavelet Lifting” (2015)

x

y

z

buffer x

buffer y

buffer z

0.0 s

20.0ns

40.0ns

60.0ns

80.0ns

100.0ns

120.0ns

140.0ns

160.0ns

0.0 50.0M 100.0M 150.0M 200.0M 250.0M

tim

e /

voxe

l

voxels

naive horizontalnaive vertical

core 42

core 23

core 43

performance of 3-D transform: separable, 2-D core, 3-D core

18 / 24

GPU implementation

M. Kula, D. Barina, et al., ”Block-based Approach to 2-D Wavelet Transform on GPUs” (2016)

80.0 100.0 120.0 140.0 160.0 180.0 200.0 220.0 240.0 260.0

0.0 10.0M 20.0M 30.0M 40.0M 50.0M 60.0M 70.0M

GB

/s

pixels

Kucis2014Separable Block

Non-Separable Block

0

10

20

30

40

50

60

100kpel 1Mpel 10Mpel 100Mpel

GB

/s

SweldensIwahashi*

Explosive*

Monolithic*

Polyphase*

Monolithic∗ scheme:

left: SotA is in red, block methods in blue/green, reorganizationright: block methods, separable in black, our in blue/green

19 / 24

FPGA implementation

D. Barina, et al., ”Single-Loop Approach to 2-D Wavelet Lifting with JPEG 2000 Compatibility” (2015)

H V

BRAM

Input Transform

core FF LUT BRAMlatency 4 441 (0.1 %) 399 (0.18 %) 6 (1.1 %)latency 2 391 (< 0.1 %) 592 (0.27 %) 6 (1.1 %)

architecture device BRAM [bits] clocks/pel time [ms]Dillen2003 VirtexE1000-8 50K 0.50 1.20Descampe2004 Virtex-II XC2V6000 N/A 0.60 1.75Seo2007 Altera Stratix 128K 2.64 6.02Zhang2012 Virtex-II Pro XC2VP30 6× 18K 0.50 0.97the cores Zynq XC7Z045 1× 36K 0.26 0.27

20 / 24

JPEG 2000 implementation

D. Barina, O. Klima, P. Zemcik, ”Single-Loop Architecture for JPEG 2000” (2016)

core

codeblock

2 × 2cn

2 × 2cm

aj

aj+1

h v d

0.0

20.0

40.0

60.0

80.0

100.0

120.0

140.0

100.0k 1.0M 10.0M 100.0M 1.0G

tim

e [

ns]

resolution [pel]

proposedOpenJPEG

JasPerFFmpeg

21 / 24

Contributions of the thesis

Aims improved image transform performance and resourceconsumption

Objectives eliminated the shortcomings of existing methods

Evaluation assessed experimentally(performance, memory requirements)

evaluation performed:2-D on CPU, 3-D on CPU, 2-D on GPU, 2-D on FPGA,JPEG 2000 on CPU

22 / 24

Selected papersI Barina, D.; Klima, O.; Zemcik, P.: Single-Loop Software Architecture for JPEG 2000. In

Data Compression Conference (DCC), 2016

I Barina, D.; Musil, M.; Musil, P.; et al.: Single-Loop Approach to 2-D Wavelet Lifting withJPEG 2000 Compatibility. In Workshop on Applications for MultiCore Architectures(WAMCA), 2015

I Barina, D.; Zemcik, P.: Minimum Memory Vectorisation of Wavelet Lifting. In AdvancedConcepts for Intelligent Vision Systems (ACIVS), 2013

I Barina, D.; Zemcik, P.: Wavelet Lifting on Application Specific Vector Processor. InGraphiCon, 2013

I Barina, D.; Zemcik, P.: Diagonal Vectorisation of 2-D Wavelet Lifting. In IEEE InternationalConference on Image Processing (ICIP), 2014

I Barina, D.; Zemcik, P.: Real-Time 3-D Wavelet Lifting. In International Conference inCentral Europe on Computer Graphics, Visualization and Computer Vision (WSCG), 2015

I Barina, D.; Zemcik, P.: Vectorization and parallelization of 2-D wavelet lifting. Journal ofReal-Time Image Processing (JRTIP), in press

I Barina, D.; Klima, O.; Zemcik, P.: Single-Loop Architecture for JPEG 2000. In: Image andSignal Processing (ICISP), 2016

I Kula, M.; Barina, D.; Zemcik, P.: Block-based Approach to 2-D Wavelet Transform on GPUs.In International Conference on Information Technology – New Generations (ITNG), 2016

I Kucis, M.; Barina, D.; Kula, M.; et al.: 2-D Discrete Wavelet Transform Using GPU. InWorkshop on Application for Multi-Core Architectures (WAMCA), 2014

23 / 24

Summary

the core

I computing unit which processes the data in a single pass,

I can suspend/resume execution,

I can processes the data in many different orders,

I can handle signal boundaries (is aware of coordinates),

I can be easily SIMD vectorized and parallelized,

I and whose underlying scheme can be reorganized.

24 / 24

Engineering

Lifting Scheme Cores for Wavelet Transform