06/19/06 1 Simulation Case Studies Update David Bueno June 19, 2006 HCS Research Laboratory, ECE Department University of Florida

106/19/06

Simulation Case Studies UpdateSimulation Case Studies Update

David BuenoDavid Bueno

June 19, 2006June 19, 2006

HCS Research Laboratory, ECE Department

University of Florida

206/19/06

Overview Set of case studies to study performance of a set of realistic GMTI

kernels with detailed modeling of processing and memory subsystems Models developed during Summer ’05 internship serve as basis for

improvements Models modified to change/cover sensitive information and calibrated to more closely

resemble Chris’ testbed setup in some cases (e.g. 250 MHz RIO) Modeling environment allows us to change parameters, scale, and

network architecture with a much greater degree of freedom than testbed

Three main points of emphasis of case studies: Gain insight into tradeoffs in shared interface between

processor/network/memory Greater understanding of SBR processing/network requirements and optimal

system configuration Quantify RapidIO latency and examine methods of improving latency for

critical data Also intend to analytically examine the previously-studied FT RIO

architectures and discuss how they may apply to case study scenarios

306/19/06

System Architecture Two network architectures similar to

previous GMTI/SAR experiments Small-scale system assumes FPGA-level

TMR (“current” technology) 3 FPGAs (1 logical PE) per card Maximum of 7 processor boards (7 logical

PE’s), 1 global-memory board 4-switch network backplane card (8-port

switches) Clos-like backplane with first- and second-

stage switches on same card 1 system controller and 1 spacecraft interface

card, each connected to RIO network via single dedicated link High-level models act as source of latency-

sensitive data Large-scale system (shown) assumes fully

radiation-hardened FPGAs or additional management software (not modeled) capable of handling SEUs in FPGAs (e.g. DM-like system, “near future” technology) 7 processor cards, 4 FPGAs (4 logical PE’s)

and 1 RapidIO switch per card, 28 physical and logical FPGAs total

4-switch network backplane card (9-port switches) Clos-like backplane with second stage only

1 system controller and 1 spacecraft interface card

406/19/06

Major Improvements

Several changes from previous SBR experiments: Computation time based on working preliminary RIO testbed

implementation of GMTI kernels Detailed memory access model for SDRAM

Models contention between processing elements and RIO interface SRAM access deterministic and considered part of measured computation

time Support for measurement of latency-sensitive traffic with latency values

based on Honeywell RapidIO implementation New scaling method allows simulation runtimes reduced from hours

(overnight) to ~30 minutes Shrink data cube (and CPI) along pulses dimension for simulation, then

scale reported results to full CPI size Verified accurate to <<1% error for all cases where the system is able to

meet real-time deadlines Super-linear speedups in simulation runtime observed due to reduced

memory/disk access by MLD simulator

506/19/06

GMTI Case Study CPI data arrives at global memory board every 256 ms

Store-and-forward pre-processing (e.g. ECCM Beamforming) performed outside scope of case study

32 bits per element (16 bits complex, 16 bits real) to match testbed Four kernels compose processing stages to match testbed implementation:

Pulse Compression Doppler Processing Beamforming CFAR

Each input data cube is 256 pulses, 6 beams, number of ranges varied Cube size shrinks along range dimension by .6 after Pulse Compression Cube size shrinks after Beamforming by .5 along beams dimension (3 beams

formed from 6 input) 1KB of detection results reported to system controller at conclusion of CFAR Latency-sensitive control data arrives from spacecraft for delivery to a randomly-

selected processing node at Poisson-distributed intervals Latency-sensitive health/timer/status data sent from system controller to each

processing node at regular intervals Processing nodes respond to data with RapidIO responses that are also sensitive to

latency and jitter

606/19/06

Single-Engine Results One processing engine in each FPGA dedicated to each GMTI kernel (Pulse

Compression, Doppler Processing, Beamforming, CFAR—4 total engines per FPGA) Most significant jump in performance lies between 8 Gbps (125 MHz interface) and

16 Gbps (250 MHz interface) in all cases Between these values, memory interface goes from under-provisioned to adequately

provisioned for given network and processing traffic patterns 64k-range cube experiences non-

linear performance penalty compared to 32k ranges and 48k ranges cases Penalty due to double buffering

Sharing of SDRAM interface between incoming RIO data and PE access/corner turn RIO data

Smaller cubes require limited double buffering except for 8 Gbps, 48k case Data can be delivered and processed in

under 256 ms so potential for overlap

GMTI: 256 Pulses, 6 Beams, 1 Engine per Task per FPGA

150

200

250

300

350

400

450

8 16 24 32

Max SDRAM Interface Bandwidth (Gbps)

Avg

. C

PI

Lat

ency

(m

s)

32k Ranges

48k Ranges

64k Ranges

706/19/06

Multiple-Engine Results Number of processing engines for each GMTI task in each FPGA varied from 1-4 (up to 16 total engines

per FPGA) As processing engines added, performance becomes increasingly memory bound and PE requires more

memory bandwidth to benefit from increased processing capabilities Each successive increment in memory interface clock allows the effective addition of 1 more engine per task However, diminishing returns overall for each speed increase beyond 16 Gbps due to additional dependence on

network throughput for data delivery and corner turns RIO interface requires ~4 Gbps (theoretical max) memory bandwidth for each direction of traffic

8 Gbps required by full-duplex corner turn communication maxes out 125 MHz, 32-bit, DDR bus (8 Gbps) without any additional PE traffic

Double buffering of processing at PE level requires memory bandwidth to be 2x “processing bandwidth” for one engine For all experiments, FPGA can process one 32-bit element per cycle at 125 MHz (4 Gbps) 125 MHz bus leaves zero margin for any RapidIO traffic along with processing

Most stressful memory access period is actually not the corner turn network traffic, since no PE access occurs during this time Instead, double-buffered processing

and network traffic may require PE reads, PE writes, AND reception of next data cube

Note: important to distinguish between double buffering of processing data (e.g. perform an FFT while loading next chunk of data to FFT into SRAM) from double buffering of network data (i.e. process one data cube while receiving the next from sensors)

GMTI: 256 Pulses, 64k Ranges, 6 Beams

300

320

340

360

380

400

420

440

1 2 3 4

Number of Engines per Task

Avg

. C

PI

Lat

ency

(m

s)

8 Gbps SDRAM Interface




806/19/06

Conclusions Testbed and simulation case studies providing valuable insight

into implementation of real-world RapidIO systems for SBR processing

Performance of DDR SDRAM interface is major determining factor in performance of GMTI kernels Heavily influences performance of both PE and RapidIO network

Double buffering of both network and processing data greatly taxes memory, even more than network corner turns Even if receiving a data cube while performing a corner turn, RIO

network flow control will ensure the memory interface is not taxed more than 8 Gbps for 250 MHz RapidIO

However, if receiving a data cube while performing double-buffered processor access, RIO network will require 4 Gbps while each engine will also require 8 Gbps throughput

With current testbed configuration, no point in implementing support for multiple engines per task on a single FPGA Simulation results save Chris some pain and suffering

906/19/06

Future Work Addition of charts showing memory bandwidth utilization over time Addition of chart showing execution time components of CPI Inclusion of small-scale system results

Mainly as proof-of-concept for current technology Preliminary results show same trends at smaller scale

Quantification of FPGA resources required for each configuration used in experiments Mostly done, just need to calculate estimates with Chris for stages not yet implemented in

testbed (i.e. addition of Magnituding to CFAR, Beamforming) Baseline configuration (1 engine per task per FPGA) estimated to nearly fit in our current

testbed FPGAs Study of latency values and latency improvement tactics

Cut-through routing Preemption Dedicated paths Reduction of packet size Direct access from RapidIO network to SRAM

SAR global-memory-based FFT case studies Suite of experiments similar to GMTI case studies Also examine performance with processing performed directly out of global memory over RIO

1006/19/06

Brief Testbed UpdateBrief Testbed Update

Chris CongerChris Conger

June 19, 2006June 19, 2006



1106/19/06

Review of Testbed Node Architecture As requested, will briefly review details of testbed node

architecture Still awaiting arrival of full-featured DDR SDRAM controller

core, needed for maximum performance of main memory Current measured actual sustained throughput at 2.5 Gbps, with

restricted controller Burst-size fixed at 2 (min. for DDR), no bank management

High-current draw from new DDR modules causing noise on overall power supply Causing reliability issues with network link, as well as data

integrity through SDRAM Chris’ top priority to resolve, will discuss after node architecture

review

1206/19/06

HCS-CNF Top-level Architecture

Network Fabric

Node Architecture Conceptual Diagram

DMAcontroller

ExternalMemoryController

NetworkInterface

Controller

On-Chip Memory

Controller

COMMANDCONTROLLER

PowerPCHW module

1HW module

N

3rd PartyDDR SDRAM

Controller Core

3rd Party

RapidIOCore

RESET & CLOCK GENERATOR

outgoing

incoming

oscillator

oscillator

hw_reset

misc. I/O

remote request

port

localrequest

port

32 kBProcessing

Memory(SRAM)

Processor Engine

To OCM Controller

16 kB write-only port(upper ½ of address space)

16 kB read-only port(lower ½ of address space)

! Entire SRAM is composed of many small SRAM blocks; OCM-side is presented as one contiguous address space, while processor-side can be arranged in various ways (example below)

1306/19/06

Notes regarding previous slide Each HCS-CNF includes:

Processing elements w/ 32 KB internal SRAM (each) @ 100 MHz External 128 MB DDR SDRAM storage device @ 100/125 MHz RapidIO endpoint @ 125/250 MHz 64-bit internal data path, except at processing memory interfaces (32-bit width) DMA-style memory transfers, with transparent remote memory access Arbitrates access to SDRAM storage, no “starvation” allowed

Equally prioritized, 256 bytes per burst Command queues and localized control for each major section

Each colored circle indicates independent clock domains Internal clock frequency of Network Interface Controller dependent on RapidIO link

speed Processors (PowerPC and HW co-processors) fixed at 100 MHz All other parts of design operate at SDRAM clock frequency

Not shown in diagram is control path between PowerPC and co-processor engines Simple R/W control register on each co-processor provides necessary control

Baseline node design is SDRAM-centric (all transfers involve main memory) Modular design allows architectural flexibility for enhancement

SRAM-to-RapidIO direct path achieved by adding one more FIFO between OCM module and NIC module, and adjusting control logic at each end

Direct data transfer between processing engines only requires redesign of OCM module

1406/19/06

Additional SlidesAdditional Slides

David BuenoDavid Bueno

June 20, 2006June 20, 2006



1506/19/06

Addendum: CPI Execution Time Figure shows breakdown of CPI execution time 32k cube size chosen because it does not require network double buffering

Double buffering makes it difficult to determine exact breakdown of execution time due to overlap of communication and computation

Data cube receive occupies a significant portion of execution time, explaining diminishing returns as SDRAM interface speed is increased RIO network speed would then need to be increased as well to maintain high levels of speedup

Time per stage shrinks as cube size shrinks

CPI Execution Time Components

0%

20%

40%

60%

80%

100%

256 Pulses, 32k Ranges, 6 Beams, 16 Gbps SDRAM Bandwidth

Per

cen

tag

e o

f T

ota

l C

PI

Tim

e CFAR

Beamforming

Network Corner Turn 2

Local Data Reorganization 2

Doppler Processing

Network Corner Turn 1

Local Data Reorganization 1

Pulse Compression

Cube Receive

1606/19/06

Addendum: Memory Util. over Time

Figures depict utilization of 16 Gbps DDR SDRAM interface over the execution of 8 CPIs for 32k and 64k ranges cases Utilization poll interval is 64 us (so “instantaneous” utilization is calculated over each successive interval of 64 us) 32k (left)- no double buffering required, periods of inactivity since cube can be received and processed in 256 ms 64k (right)- requires double buffering, no periods of inactivity

“Medium” peaks (40-60%) due to processing or corner turn activity (in this case bounded by processor or RIO) Maximum peaks (100%) due to local data distribution (considered to be completely memory bound) Note that most of first and last CPIs of 64k ranges case are “setup” and “wrap up” CPIs so double buffering limited in

those CPIs Focus closely on the middle 6 CPIs of 64k ranges chart to see full effects of double buffering Essentially loading and emptying a “pipeline”

Network double buffering slightly raises utilization during periods of data cube distribution (by ~4%) Incoming data slows memory performance during local data redistribution and slows network performance during corner turns

256 Pulses, 6 Beams, 1 Engine per Task per FPGA: 32k Ranges

05

101520253035404550556065707580859095

100

0 256 512 768 1024 1280 1536 1792 2048

Time (ms)

SD

RA

M U

tili

zati

on

(%

)

256 Pulses, 6 Beams, 1 Engine per Task per FPGA: 64k Ranges

05

101520253035404550556065707580859095

100

0 256 512 768 1024 1280 1536 1792 2048

Time (ms)

SD

RA

M U

tili

zati

on

(%

)

Documents

06/19/06 1 Simulation Case Studies Update David Bueno June 19, 2006 HCS Research Laboratory, ECE Department University of Florida