An Implementation of a FIR Filter on a GPU
Alexey Smirnov and Tzi-cker Chiueh
ECSL Research Seminar9/13/05
Outline
Introduction GPU Computing Overview Related Work FIR Filter Definition FIR Filter Implementation on GPU Performance Evaluation Conclusion
Introduction
Numerical algorithms often perform repeated computations on vectors of elements.
Parallel computation improves performance.
x86: MMX, SSE, SSE2, SSE3. Video cards are now
programmable.
Computation and Bandwidth Rates Video cards have higher GFLOPs
rate and memory bandwidth compared to CPU.
However, data copying between main memory and video memory can reduce performance.
GPU Computing Background Rendering pipeline:
User program defines vertex and texture coordinates.
Vertex processor converts vertex attributes from world coordinate system into screen coordinate system.
Fragment processor computes color of each output pixel using textures and color.
Interpolation defines coordinates and color for each pixel.
Vertex and fragment processors are programmable for example in C-like language Cg.
Rendering APIs OpenGL (Linux, Windows, MacOS)
and DirectX (Windows). OpenGL extensions allow to use
advanced features of a video card. NV_float_buffer supports floating-
point textures. ARB_render_texture allows to
render to a texture instead of the screen.
GPU Program Architecture Create floating-point textures that contain
input data and load them into video memory; Load the fragment program and enable multi-
texturing; Define vertex and texture coordinates; Draw the figure to an off-screen buffer; If the results were rendered to an off-screen
buffer then copy the image to a texture using glCopyTexSubImage2D().
Go to step 3 if more iterations needed. Use glGetTexImage() to copy data from video
memory to main memory.
Input Data Representation Matrices are represented as textures
naturally. Four elements per pixel (R, G, B, A).
Vectors are wrapped into matrices. Textures have maximum dimensions.
Related Work Four papers describing matrix
multiplication; Linear algebra operations; Array sorting; FFT; Earlier papers concluded that the CPU is
more efficient then GPU. Recent video cards, e.g. GeForce 7800
and ATI X800 XT do better than CPU.
FIR Filter Definition
Finite Impulse Response (FIR) filter is used in audio processing.
We modified GNU Radio – an open-source software implementing Software Defined Radio.
Other Relevant Transformations
Hilbert transformation:
Frequency translation FIR filter:
FIR Filter on a GPU
FIR Filter’s Loop Initialization:
Loop iteration:
FIR Filter’s Loop
O(j+1)=O(j)+MI
Final output value is computed as
Fragment Program
Optimizations Break loop into two to get rid of
conditional expression; Unroll loop body w/ and w/o
conditional expression; Process two rows of input and
textures; Use different texture units in
unrolled loops; Nothing of the above improved
performance.
Performance Evaluation: FIR Filter
Performance of FreqXlating FIR Filter
Performance of Hilbert Transformation
Conclusion Not everything improves from GPU
optimization. CPU optimization tricks do not work on
GPU. Texture upload/download takes up to
60% of total time. GPU computation can take several
seconds compared to millisecond time to render a frame in a game.
Future Work QoS for GPU: can application
specify maximum latency or share of GPU resources?
Work offload from CPU to GPU: is it possible to build a compiler that can automatically decide what is worth GPU optimization?
Debugging support: a lot of tools for Windows, none for Linux.