Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*,...

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband ChannelizationCarlo del Mundo*, Vignesh Adhinarayanan§, Wu-chun Feng*§

* Department of Electrical and Computer Engineering, § Department of Computer Science, Virginia Tech

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Forecast

• Goal: Accelerate the Fast Fourier Transform (FFT) using graphics processing units (GPUs) – Replace fixed hardware ASICs with programmable

Carlo del Mundo, cdel@vt.edu, carlodelmundo.com

synergy.cs.vt.edu

Forecast

• Goal: Accelerate the Fast Fourier Transform (FFT) using graphics processing units (GPUs) – Replace fixed hardware ASICs with programmable

http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpgahttp://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg

synergy.cs.vt.edu

Motivation

• FFT is a critical building blockacross many disciplines

synergy.cs.vt.edu

Motivation

http://www.ajnr.org/content/27/6/1230/F1.large.jpg

synergy.cs.vt.edu

Motivation

synergy.cs.vt.edu

Motivation

http://www.elektrodaily.com/wp-content/uploads/2013/02/shazam-app.png

synergy.cs.vt.edu

Motivation

synergy.cs.vt.edu

Motivation

http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html

synergy.cs.vt.edu

Motivation

synergy.cs.vt.edu

Introduction• Wideband Channelization

– Purpose: To isolate channels within a wideband signal

synergy.cs.vt.edu

Figure: Stages in a PFB Channelizer http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html

synergy.cs.vt.edu

Introduction (Channelization)

• Algorithm: Polyphase filter bank (PFB) channelizer

Figure: Stages in a PFB Channelizer

synergy.cs.vt.edu

• Algorithm: Polyphase filter bank (PFB) channelizer– Problem: FFT stage grows fastest in channelization

synergy.cs.vt.edu

• Algorithm: Polyphase filter bank (PFB) channelizer– Problem: FFT stage grows fastest in channelization

synergy.cs.vt.edu

Choosing the Right Processor

• Criteria: Programmability & Performance

synergy.cs.vt.edu

http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga

http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg

http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg

http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg

synergy.cs.vt.edu

Outline

• Motivation• Introduction• Background• Approach

– System-level optimizations– Algorithm-level optimizations

• Results– Optimizations in isolation– Optimizations in concert

• Conclusion

synergy.cs.vt.edu

Background (GPUs)

• GPU Memory Hierarchy

synergy.cs.vt.edu

Background (GPUs)

• GPU Memory Hierarchy

synergy.cs.vt.edu

Background (GPUs)

• GPU Memory Hierarchy– Global Memory

synergy.cs.vt.edu

Background (GPUs)

• GPU Memory Hierarchy– Global Memory

Memory Unit

Read Bandwidth (TB/s)

Global 0.17

Table: Memory Read Bandwidth for Radeon HD 6970

synergy.cs.vt.edu

Background (GPUs)

• GPU Memory Hierarchy– Global Memory– Image Memory

Memory Unit

L1/L2 Cache 1.35 / 0.45

Global 0.17

synergy.cs.vt.edu

Background (GPUs)

• GPU Memory Hierarchy– Global Memory– Image Memory– Constant Memory

Memory Unit

Constant 5.4

L1/L2 Cache 1.35 / 0.45

Global 0.17

synergy.cs.vt.edu

Background (GPUs)

• GPU Memory Hierarchy– Global Memory– Image Memory– Constant Memory– Local Memory

Memory Unit

Constant 5.4

Local 2.7

L1/L2 Cache 1.35 / 0.45

Global 0.17

synergy.cs.vt.edu

Background (GPUs)

• GPU Memory Hierarchy– Global Memory– Image Memory– Constant Memory– Local Memory– Registers

Memory Unit

Registers 16.2

Constant 5.4

Local 2.7

L1/L2 Cache 1.35 / 0.45

Global 0.17

synergy.cs.vt.edu

Outline

• Conclusion

synergy.cs.vt.edu

Approach• Act as the “human compiler”

synergy.cs.vt.edu

1. Derive a candidate set of optimizations for FFT on GPUs

Candidate Optimizations

synergy.cs.vt.edu

2. Apply optimizations in isolation

Optimizations in Isolation

synergy.cs.vt.edu

2. Apply optimizations in isolation3. Apply optimizations in concert

Optimizations in Concert

Optimizations in Isolation

synergy.cs.vt.edu

Approach

• System-level Optimizations (applicable to any application) 1. Register Preloading2. Vector Access/{Vector,Scalar} Arithmetic3. Constant Memory Usage 4. Dynamic Instruction Reduction5. Memory Coalescing6. Image Memory

• Algorithm-level Optimizations1. Transpose via LM2. Compute/Transpose via LM3. Compute/No Transpose via LM

C. del Mundo et al., “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.

synergy.cs.vt.edu

Approach

• Algorithm-level Optimizations1. Transpose via LM2. Compute/Transpose via LM3. Compute/No Transpose via LM

synergy.cs.vt.edu

Approach

synergy.cs.vt.edu

Approach

• Algorithm-level Optimizations1. Naïve Transpose (LM-CM)2. Compute/Transpose via LM (LM-CC)3. Compute/No Transpose via LM (LM-CT)

synergy.cs.vt.edu

System-level Optimizations

synergy.cs.vt.edu

1. Register Preloading (RP)– Load to registers first

synergy.cs.vt.edu

Without Register Preloading

79 __kernel void unoptimized(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]);

synergy.cs.vt.edu

With Register Preloading

79 __kernel void optimized(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 __private float2 r0, r1, r2, r3; // Register Declaration 85 // Explicit Loads 86 r0 = buffer[0]; r1 = buffer[1]; r2 = buffer[2]; r3 = buffer[3]; 87 FFT4_in_order_output(&r0, &r1, &r2, &r3);

Without Register Preloading

79 __kernel void unoptimized(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]);

synergy.cs.vt.edu

2. Vector Access (float{2, 4, 8, 16})

synergy.cs.vt.edu

a[0] a[1]

synergy.cs.vt.edu

a[0] a[1] a[2] a[3]

synergy.cs.vt.edu

– Scalar Math (VASM)

a[0] a[1] a[2] a[3]

synergy.cs.vt.edu

a[0] a[1] a[2] a[3]

– Scalar Math (VASM)• float + float

synergy.cs.vt.edu

a[0] a[1] a[2] a[3]

synergy.cs.vt.edu

– Vector Math (VAVM)• float4 + float4

a[0] a[1] a[2] a[3]

synergy.cs.vt.edu

a[0] a[1] a[2] a[3]

synergy.cs.vt.edu

a[0] a[1] a[2] a[3]

synergy.cs.vt.edu

Approach

• Algorithm-level Optimizations

1C. del Mundo, W. Feng. “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.

synergy.cs.vt.edu

Approach

• Algorithm-level Optimizations1. Naïve Transpose (LM-CM)2. Compute/Transpose via LM (LM-CC)3. Compute/No Transpose via LM (LM-CT)

1C. del Mundo, W. Feng. “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.

synergy.cs.vt.edu

Algorithm-level optimizations

synergy.cs.vt.edu

• Transpose – elements across the diagonal are exchanged

synergy.cs.vt.edu

4x4 matrix

Transposed matrix

synergy.cs.vt.edu

4x4 matrix

Transposed matrix

synergy.cs.vt.edu

4x4 matrix

Transposed matrix

synergy.cs.vt.edu

4x4 matrix

Transposed matrix

synergy.cs.vt.edu

4x4 matrix

Transposed matrix

synergy.cs.vt.edu

Original Transposed

synergy.cs.vt.edu

1. Naïve Transpose (LM-CM)

Local Memory

t0 t1 t2 t3

Original Transposed

Register File

synergy.cs.vt.edu

Local Memory

t0 t1 t2 t3

Original Transposed

Register File

synergy.cs.vt.edu

Local Memory

t0 t1 t2 t3

Original Transposed

Register File

synergy.cs.vt.edu

Local Memory

t0 t1 t2 t3

Original Transposed

Register File

synergy.cs.vt.edu

3. The pseudo transpose (LM-CT)

Original Transposed

synergy.cs.vt.edu

3. The pseudo transpose (LM-CT)

Original Transposed

synergy.cs.vt.edu

3. The pseudo transpose (LM-CT)– Idea:

• Load data to local memory

Original Transposed

Local Memory

synergy.cs.vt.edu

• Load data to local memory

Original Transposed

Local Memory

synergy.cs.vt.edu

• Load data to local memory• Perform computation on

columns,

Original Transposed

Local Memory

synergy.cs.vt.edu

columns, then rows.

Original Transposed

Local Memory

synergy.cs.vt.edu

columns, then rows.

– Advantage: • Skips the transpose step

Original Transposed

Local Memory

synergy.cs.vt.edu

columns, then rows.

– Advantage: • Skips the transpose step

– Disadvantage:• Local memory has lower

throughput than registers.

Original Transposed

Local Memory

synergy.cs.vt.edu

Outline

• Conclusion

synergy.cs.vt.edu

Results (Experimental Testbed)

GPU Testbed

Device (AMD Radeon)

CoresPeak

Performance

(GFLOPS)

PeakBandwidth

(GB/s)

HD 7970 2048 3788 264

HD 6970 (VLIW) 1536 2703 176

HD 5870 (VLIW) 1600 2720 154

• Algorithm:– 1D FFT (batched), N = 16 pts– Cooley-Tukey Decomposition

synergy.cs.vt.edu

Results (in isolation)

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.

AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)

synergy.cs.vt.edu

Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-

chip optimizations (RP, LM-CC, LM-CT)

synergy.cs.vt.edu

2. 40% - Coalesce memory accesses (CGAP)

synergy.cs.vt.edu

0% (No Change)

synergy.cs.vt.edu

3. 20% - Use scalar math (VASM2/VASM4)

synergy.cs.vt.edu

0% (No Change)

synergy.cs.vt.edu

Neutral/Detrimental to Baseline (Min. % Decrease)1. 20% - Naïve transpose (LM-CM),

40% - Constant Memory (CM-K, CM-L)

synergy.cs.vt.edu

0% (No Change)

synergy.cs.vt.edu

40% - Constant Memory (CM-K, CM-L)2. 0% - Dynamic instruction reduction (LU,

CSE, IL)

synergy.cs.vt.edu

0% (No Change)

CSE, IL)

synergy.cs.vt.edu

CSE, IL)3. 18% - Avoid large vectors & vector math

(VASM16, VAVM8/16)

synergy.cs.vt.edu

(VASM16, VAVM8/16)

synergy.cs.vt.edu

(VASM16, VAVM8/16)

synergy.cs.vt.edu

Results (in concert)• Improvements (Max.

Increase)

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.2 All implementations are coalesced (CGAP) and use VASM2.3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).

synergy.cs.vt.edu

Increase)

2.9x 2.4

synergy.cs.vt.edu

Increase)

2.9x 2.4

synergy.cs.vt.edu

Increase)– {RP + LM-CM} best on-

chip optimization

2.1x1.5

2.9x 2.4

synergy.cs.vt.edu

Results (in concert)• Improvements (Max. %

chip optimization– Use Constant Memory

(CM) for twiddle calculations

2.1x1.5

2.9x 2.4

synergy.cs.vt.edu

– Use global memory (instead of image memory)

2.1x1.5

2.9x 2.4

5.6x 5.6

synergy.cs.vt.edu

2.1x1.5

2.9x 2.4

5.6x 5.6

synergy.cs.vt.edu

2.1x1.5

2.9x 2.4

5.6x 5.6

synergy.cs.vt.edu

2.1x1.5

2.9x 2.4

5.6x 5.6

synergy.cs.vt.edu

2.9x 2.4

1.8x1.5

5.6x 5.6

synergy.cs.vt.edu

– Optimal set for AMD GPUs

• RP – Register Preloading

• LM-CM – Transpose vialocal memory

• CM – Constant memoryusage

• CGAP – Coalesced Global Access Pattern

• VASM2 – Vector Access, Scalar Math (float2)Carlo del Mundo, cdel@vt.edu,

carlodelmundo.com

2.9x 2.4

1.8x1.5

5.6x 5.6

synergy.cs.vt.edu

Results (1D FFT 16-pts, GPU versions)

• Optimized GPU faster by factors of 14.5 over baseline GPU

synergy.cs.vt.edu

Results (1D FFT 16-pts, GPU versions)

• Optimized GPU faster by factors of 14.5 over baseline GPU

synergy.cs.vt.edu

Conclusions

• Contributions:– A portable building block for FFT towards GPU-based radios– Architecture-aware insights for mapping and optimizing FFT across

three generations of AMD GPUs• Contact:

– Carlo del Mundo– cdel@vt.edu

• Optimal set for AMD GPUs– RP – Register Preloading– LM-CM – Transpose via

local memory– CM – Constant memory

usage– CGAP – Coalesced Global

Access Pattern– VASM2 – Vector Access,

Scalar Math (float2)

synergy.cs.vt.edu

Appendix Slides

synergy.cs.vt.edu

Introduction (FFT)

• Fast Fourier Transform (FFT)– A spectral method

• Key computational idiom for present and future applications (dwarf)§

List of Dwarfs1. Finite State Machine2. Circuits3. Graph Algorithms4. Structured Grid5. Dense Matrix6. Sparse Matrix7. Spectral Methods

8. Dynamic Prog.9. Particle Methods10. Backtrack/B&B11. Graphical Models12. Unstructured

Grids13. Map Reduce

§ Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009.

synergy.cs.vt.edu

Background (Optimizing on GPUs)1. RP (Register Preloading) - All data elements are first preloaded onto the register file of the

respective GPU. Computation is facilitated solely on registers.2. CGAP (Coalesced Global Access Pattern) - Threads access memory contiguously (the kth

thread accesses memory element k) 3. VASM2/4 (Vector Access, Scalar Math, float{2/4}) - Data elements are loaded as the

listed vector type. Arithmetic operations are scalar (float x float).4. LM-CM (Local Memory, Communication Only) - Data elements are loaded into local

memory only for communication. Threads swap data elements solely in local memory.5. LM-CT (Local Memory, Computation, No Transpose) - Data elements are loaded into

local memory for computation. The communication step is avoided by algorithm reorganization.

6. LM-CC (Local Memory, Computation and Communication) - All data elements are preloaded into local memory. Computation is performed in local memory, while registers are used for scratchpad communication.

7. CM-K (Constant Memory - Kernel Argument) - The twiddle multiplication stage of FFT is precomputed on the CPU and stored in the GPU constant memory for fast look up.

8. CSE (Common Subexpression Elimination) - A traditional optimization that collapses identical expressions in order to save computation. This optimization may increase register live time, therefore, increasing register pressure.

9. IL (Function Inlining) - A function's code body is inserted in place of a function call. It is used primarily for functions that are frequently called.

10. IM (Image Memory) – The use of a texture image replaces the use of global memory.

synergy.cs.vt.edu

Motivation (GPU FFT vs. CPU FFT)

* Device-Host Data Transfer Not Included

• GPU FFT outperforms CPU FFT by factors as high as 6.5*– 1D batched FFT, N = 16 pts

synergy.cs.vt.edu

Introduction (Channelizer Architecture)• Channelizer Architecture

– FIR Filtering, FFT, and Channel Mapping.

synergy.cs.vt.edu

S3: Constant Memory

• Fast cached lookup for frequently used data

synergy.cs.vt.edu

S3: Constant Memory

• Fast cached lookup for frequently used data

16 __constant float2 twiddles[16] = { (float2)(1.0f,0.0f), (float2) (1.0f,0.0f), (float2)(1.0f,0.0f), (float2)(1.0f,0.0f), ... more sin/cos values};

Without Constant Memory 61 for (int j = 1; j < 4; ++j) 62 { 63 double theta = -2.0 * M_PI * tid * j / 16; 64 float2 twid = make_float2(cos(theta), sin(theta)); 65 result[j] = buffer[j*4] * twid; 66 }

With Constant Memory61 for (int j = 1; j < 4; ++j)62 result[j] = buffer[j*4] *

twiddles[4*j+tid];

Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*,...

Documents

Practical Non-Uniform Channelization for …eprints.maynoothuniversity.ie/3950/1/RV_Practical_Non...Practical Non-Uniform Channelization for Multistandard Base Stations - ZTE Corporation

Channelization Power for COMINT - Novator Solutions · 2020. 4. 20. · for COMINT RF wideband channelization. Ideally there would be one receiver dedicated to every signal of interest

Directional Channelization Designonlinepubs.trb.org/Onlinepubs/hrbbulletin/72/72-001.pdfDirectional Channelization Design W. R. BELLIS, Chief, Bureau Traffic and Safety Research, New

On the Greenness of In-Situ and Post-Processing ......On the Greenness of In-Situ and Post-Processing Visualization Pipelines Vignesh Adhinarayanan∗, Wu-chun Feng∗, Jonathan Woodring†,

XPS Project Report Automatically Scalable Computation (ASC)synergy.cs.vt.edu/.../reports/Jonathan_Appavoo_3-xpsprojectreport.pdf · XPS Project Report Automatically Scalable Computation

Channelization for Multi-Standard Software-Defined Radio

Accelerating Fast Fourier Transform for Wideband Channelization

hp:// synergy.cs.vt.edu/ Motivation

1 Channelization and Turn Bays. 2 Island Channelization flush, paved, and delineated with markings – or unpaved and delineated with pavement edge and

Channelization Cbjkode and Uses

Channelization Surface Interpolation and Flow Accumulation

Left-Turn Channelization Tee Intersection and Back-To-Back

TheGreenIndex(TGI): …€¦ · synergy.cs.vt.edu • TheGreenIndex(TGI) – Amethodologytoprovideinsightsintosystemwideenergy eﬃciency# – Comparisonwithtradionalmetrics(performancetopowerrao)#

The Effects of Channelization

Overview: Efficient Parallel Computing - Virginia Techcourses.cs.vt.edu/~cs5944/lectures/slides/FengCSSeminarJan09.pdf · 1/30/09 2 synergy.cs.vt.edu Overview: Efficient Parallel

Annex D River Channelization Plan - United States Army · This river channelization plan is based on a separate ... D1 through D4 illustrate the proposed channelization of river

EAD 511 RIVER MANAGEMENT Mini Project: Channelization

Channelization for Multi-Standard Software-Defined Radio Base Stations

Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

Effects of Channelization on Sediment Distribution and