High Performance Edge-Preserving Filter on GPU · 2014-04-18 · Title: High Performance...

HIGH PERFORMANCE EDGE-PRESERVING FILTER

ON GPU

Jonas Li (jonasl@nvidia.com)

Compute Architect, NVIDIA

AGENDA

Algorithm basis

Optimization on GPU

Performance data

ALGORITHM BASIS

Based on state-of-art “domain transform[1]”

— From R5(X,Y,R,G,B) to R2(XY,RGB), distances preserved

— (X,RGB)->ct(x)

— (Y,RGB)->ct(y)

— 1D smoothing on ct(x) and ct(y) ≡ edge preserving on 2D color image

𝑐𝑡 𝑢 = 1 + |𝐼′(𝑥)|𝑑𝑥𝑢

[1]: http://www.inf.ufrgs.br/~eslgastal/DomainTransform/

ct(x) is accumulation of each L1 distance of (x, I(x))

Ct= Ωw

ALGORITHM BASIS

Based on state-of-art “domain transform[1]”

— From R5(X,Y,R,G,B) to R2(XY,RGB), distances preserved

— (X,RGB)->ct(x)

— (Y,RGB)->ct(y)

— 1D smoothing on ct(x) and ct(y) ≡ edge preserving on 2D color image

[1]: http://www.inf.ufrgs.br/~eslgastal/DomainTransform/

ct(x) is accumulation of each L1 distance of (x, I(x))

ALGORITHM BASIS

GPU implementation

— Normalized convolution is GPU friendly

— Implemented on Tegra K1

— 2 main phases

Integral image(calculate ct(x), ct(y))

Normalized convolution(binary search, matrix transposition)

1ST PHASE: INTEGRAL IMAGE

Naïve implementation

— Large thread block

— Warp-wide shuffle

— Shared memory

Issues

— Warp synchronization

partial_sum = integral(thread_value); SMEM <- partial_sum; sync(); if(warp_id == 0) warp_sum = integral(partial_sum); SMEM <- warp_sum; sync(); If(warp_id != 0) thread_value += warp_sum; sync(); output();

Optimized implementation

— Eliminate warp synchronization

— Data prefetching

— 2.7x speedup vs. naïve code

— 95% of peak performance

2ND PHASE: NORMALIZED CONVOLUTION

Per-thread binary search

— Partially overlapped range

— Still highly divergent

— Data dependency

Matrix transposition

— Bank conflicts

while (right > left) { idx = (right + left)/ 2; v = input(idx); if (v > val) { right = idx; } else { left = idx + 1; } }

BINARY SEARCH

Texture

— Seems natural to handle divergence

— Longer latency

— 2D cache locality

Shared memory

— Data copy/refresh overhead

— Limited size per thread

L1 Cache

— As fast as shared memory

— Hardware managed data refresh

100 200 300 400 500 600 700 800 900 1000

radius

Binary search: Texture/Smem/L1 (lower is better)

MATRIX TRANSPOSITION

Previous implementation[1]

— Transpose in shared memory

— Tile padding to avoid bank conflict

Our implementation

— Transpose in shared memory

— No bank conflict

— No tile padding

[1]: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#shared-memory-in-matrix-multiplication-c-aa

4 warps per block

4 bytes per pixel

Naïve implementation:

4-way bank conflict in

shared memory

4 warps per block

4 bytes per pixel

Our implementation:

Store: No bank conflict

Load: No bank conflict

4 warps per block

4 bytes per pixel

Our implementation:

4 warps per block

4 bytes per pixel

Our implementation:

4 warps per block

4 bytes per pixel

Our implementation:

PERFORMANCE DATA

Performance comparison with naïve version

Tegra K1, 1920*1080,BGR color video

— Naïve version: 15.52fps

— Optimized version: 32.59fps

Integral image NormalizedConvolution

Speedup

Relative speedup

naïve version

optimized version

Total speedup: ~2.1x

SUMMARY

A real-time edge-preserving filter on GPU is presented

Fast integral without warp synchronization

L1-based in-place binary search

Efficient matrix transpose scheme

High Performance Edge-Preserving Filter on GPU · 2014-04-18 · Title: High Performance...

Documents

Cygnus: GPU meets FPGA for HPC - RIKEN R-CCS · 2020. 2. 27. · FPGA-GPU DMA (FPGA ← GPU) FPGA-GPU DMA (FPGA → GPU) direction via CPU FPGA-GPU DMA GPU→FPGA 17 1.44 FPGA→GPU

Extending Unified Parallel C for GPU Computing · PGAS Programming Model for Hybrid Multi-Core Systems Computer Node CPU Memory GPU GPU Memory CPU CPU GPU GPU Memory Computer Node

RGB-D Object Tracking: A Particle Filter Approach on GPUpeople.csail.mit.edu/cchoi/pub/Choi13iros_gputracking.pdf · RGB-D Object Tracking: A Particle Filter Approach on GPU Changhyun

GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

Parallel Hybrid Computing · GPU GPU GPU GPU OpenMP HMPP MPI CUDA. Programming Multicores/ ... CILK, TBB, automatic parallelization, vectorization… • Distributed memory architectures

GPU, GP-GPU, GPU computing

GPU Benefits for Earth System Science › sites › default › files... · GPU Benefits for Earth System Science. 2 TOPICS ... 19km GPU 19km CPU 1.9km GPU.93km GPU ... NOAA FV3 GPU

Detail Preserving Filter

Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

GPU Snapshot: Checkpoint Offloading for GPU …...costs, socket counts and energy consumption. Such accelerator-dense nodes pose a reliability challenge because preserving a large

GPU GPU and Supercomputer - University of Rochester€¦ · 4/16/2015 9 Outline: GPU GPGPU and CUDA CPU + GPU SuperComputer (TianHe 1A) Latest Trends GPU accelerated supercomputer

GPU Developments 2017 · GPU Developments 2017 ... and • •

Computer Vision on GPU with OpenCV - Gipsa- · PDF fileComputer Vision on GPU with OpenCV ... •OpenCV GPU module •Face Detection on GPU •Pedestrian detection on GPU 2 . ... C++

FIR filter on GPU

GPU-to-GPU and Host-to-Host Multipattern String Matching on a GPU

Positivity-preserving and asymptotic preserving method for ...jliu/research/pdf/... · Thisisarrnrvidedtotherbythepubr.Coprightrrnmayapply. POSITIVITY-PRESERVING AND ASYMPTOTIC PRESERVING

OpenCV on a GPU · OpenCV GPU header file Upload image from CPU to GPU memory Allocate a temp output image on the GPU Process images on the GPU Process images on the GPU Download

Multi-GPU Programming - GPU Technology Conference

SEOUL | Oct.7, 2016 OPTIMAL MEDIAN FILTER ALGORITHMS FOR … · 2016-10-12 · 2 AGENDA 1. Median Filter ? - Application - Algorithm 2. Implementation on GPU - Naïve approach - Optimization

GPU Computing with MATLAB - GPU Technology Conference