Upload
ellen-carroll
View
214
Download
0
Embed Size (px)
Citation preview
106/19/06
Simulation Case Studies UpdateSimulation Case Studies Update
David BuenoDavid Bueno
June 19, 2006June 19, 2006
HCS Research Laboratory, ECE Department
University of Florida
206/19/06
Overview Set of case studies to study performance of a set of realistic GMTI
kernels with detailed modeling of processing and memory subsystems Models developed during Summer ’05 internship serve as basis for
improvements Models modified to change/cover sensitive information and calibrated to more closely
resemble Chris’ testbed setup in some cases (e.g. 250 MHz RIO) Modeling environment allows us to change parameters, scale, and
network architecture with a much greater degree of freedom than testbed
Three main points of emphasis of case studies: Gain insight into tradeoffs in shared interface between
processor/network/memory Greater understanding of SBR processing/network requirements and optimal
system configuration Quantify RapidIO latency and examine methods of improving latency for
critical data Also intend to analytically examine the previously-studied FT RIO
architectures and discuss how they may apply to case study scenarios
306/19/06
System Architecture Two network architectures similar to
previous GMTI/SAR experiments Small-scale system assumes FPGA-level
TMR (“current” technology) 3 FPGAs (1 logical PE) per card Maximum of 7 processor boards (7 logical
PE’s), 1 global-memory board 4-switch network backplane card (8-port
switches) Clos-like backplane with first- and second-
stage switches on same card 1 system controller and 1 spacecraft interface
card, each connected to RIO network via single dedicated link High-level models act as source of latency-
sensitive data Large-scale system (shown) assumes fully
radiation-hardened FPGAs or additional management software (not modeled) capable of handling SEUs in FPGAs (e.g. DM-like system, “near future” technology) 7 processor cards, 4 FPGAs (4 logical PE’s)
and 1 RapidIO switch per card, 28 physical and logical FPGAs total
4-switch network backplane card (9-port switches) Clos-like backplane with second stage only
1 system controller and 1 spacecraft interface card
406/19/06
Major Improvements
Several changes from previous SBR experiments: Computation time based on working preliminary RIO testbed
implementation of GMTI kernels Detailed memory access model for SDRAM
Models contention between processing elements and RIO interface SRAM access deterministic and considered part of measured computation
time Support for measurement of latency-sensitive traffic with latency values
based on Honeywell RapidIO implementation New scaling method allows simulation runtimes reduced from hours
(overnight) to ~30 minutes Shrink data cube (and CPI) along pulses dimension for simulation, then
scale reported results to full CPI size Verified accurate to <<1% error for all cases where the system is able to
meet real-time deadlines Super-linear speedups in simulation runtime observed due to reduced
memory/disk access by MLD simulator
506/19/06
GMTI Case Study CPI data arrives at global memory board every 256 ms
Store-and-forward pre-processing (e.g. ECCM Beamforming) performed outside scope of case study
32 bits per element (16 bits complex, 16 bits real) to match testbed Four kernels compose processing stages to match testbed implementation:
Pulse Compression Doppler Processing Beamforming CFAR
Each input data cube is 256 pulses, 6 beams, number of ranges varied Cube size shrinks along range dimension by .6 after Pulse Compression Cube size shrinks after Beamforming by .5 along beams dimension (3 beams
formed from 6 input) 1KB of detection results reported to system controller at conclusion of CFAR Latency-sensitive control data arrives from spacecraft for delivery to a randomly-
selected processing node at Poisson-distributed intervals Latency-sensitive health/timer/status data sent from system controller to each
processing node at regular intervals Processing nodes respond to data with RapidIO responses that are also sensitive to
latency and jitter
606/19/06
Single-Engine Results One processing engine in each FPGA dedicated to each GMTI kernel (Pulse
Compression, Doppler Processing, Beamforming, CFAR—4 total engines per FPGA) Most significant jump in performance lies between 8 Gbps (125 MHz interface) and
16 Gbps (250 MHz interface) in all cases Between these values, memory interface goes from under-provisioned to adequately
provisioned for given network and processing traffic patterns 64k-range cube experiences non-
linear performance penalty compared to 32k ranges and 48k ranges cases Penalty due to double buffering
Sharing of SDRAM interface between incoming RIO data and PE access/corner turn RIO data
Smaller cubes require limited double buffering except for 8 Gbps, 48k case Data can be delivered and processed in
under 256 ms so potential for overlap
GMTI: 256 Pulses, 6 Beams, 1 Engine per Task per FPGA
150
200
250
300
350
400
450
8 16 24 32
Max SDRAM Interface Bandwidth (Gbps)
Avg
. C
PI
Lat
ency
(m
s)
32k Ranges
48k Ranges
64k Ranges
706/19/06
Multiple-Engine Results Number of processing engines for each GMTI task in each FPGA varied from 1-4 (up to 16 total engines
per FPGA) As processing engines added, performance becomes increasingly memory bound and PE requires more
memory bandwidth to benefit from increased processing capabilities Each successive increment in memory interface clock allows the effective addition of 1 more engine per task However, diminishing returns overall for each speed increase beyond 16 Gbps due to additional dependence on
network throughput for data delivery and corner turns RIO interface requires ~4 Gbps (theoretical max) memory bandwidth for each direction of traffic
8 Gbps required by full-duplex corner turn communication maxes out 125 MHz, 32-bit, DDR bus (8 Gbps) without any additional PE traffic
Double buffering of processing at PE level requires memory bandwidth to be 2x “processing bandwidth” for one engine For all experiments, FPGA can process one 32-bit element per cycle at 125 MHz (4 Gbps) 125 MHz bus leaves zero margin for any RapidIO traffic along with processing
Most stressful memory access period is actually not the corner turn network traffic, since no PE access occurs during this time Instead, double-buffered processing
and network traffic may require PE reads, PE writes, AND reception of next data cube
Note: important to distinguish between double buffering of processing data (e.g. perform an FFT while loading next chunk of data to FFT into SRAM) from double buffering of network data (i.e. process one data cube while receiving the next from sensors)
GMTI: 256 Pulses, 64k Ranges, 6 Beams
300
320
340
360
380
400
420
440
1 2 3 4
Number of Engines per Task
Avg
. C
PI
Lat
ency
(m
s)
8 Gbps SDRAM Interface
16 Gbps SDRAM Interface
24 Gbps SDRAM Interface
32 Gbps SDRAM Interface
806/19/06
Conclusions Testbed and simulation case studies providing valuable insight
into implementation of real-world RapidIO systems for SBR processing
Performance of DDR SDRAM interface is major determining factor in performance of GMTI kernels Heavily influences performance of both PE and RapidIO network
Double buffering of both network and processing data greatly taxes memory, even more than network corner turns Even if receiving a data cube while performing a corner turn, RIO
network flow control will ensure the memory interface is not taxed more than 8 Gbps for 250 MHz RapidIO
However, if receiving a data cube while performing double-buffered processor access, RIO network will require 4 Gbps while each engine will also require 8 Gbps throughput
With current testbed configuration, no point in implementing support for multiple engines per task on a single FPGA Simulation results save Chris some pain and suffering
906/19/06
Future Work Addition of charts showing memory bandwidth utilization over time Addition of chart showing execution time components of CPI Inclusion of small-scale system results
Mainly as proof-of-concept for current technology Preliminary results show same trends at smaller scale
Quantification of FPGA resources required for each configuration used in experiments Mostly done, just need to calculate estimates with Chris for stages not yet implemented in
testbed (i.e. addition of Magnituding to CFAR, Beamforming) Baseline configuration (1 engine per task per FPGA) estimated to nearly fit in our current
testbed FPGAs Study of latency values and latency improvement tactics
Cut-through routing Preemption Dedicated paths Reduction of packet size Direct access from RapidIO network to SRAM
SAR global-memory-based FFT case studies Suite of experiments similar to GMTI case studies Also examine performance with processing performed directly out of global memory over RIO
1006/19/06
Brief Testbed UpdateBrief Testbed Update
Chris CongerChris Conger
June 19, 2006June 19, 2006
HCS Research Laboratory, ECE Department
University of Florida
1106/19/06
Review of Testbed Node Architecture As requested, will briefly review details of testbed node
architecture Still awaiting arrival of full-featured DDR SDRAM controller
core, needed for maximum performance of main memory Current measured actual sustained throughput at 2.5 Gbps, with
restricted controller Burst-size fixed at 2 (min. for DDR), no bank management
High-current draw from new DDR modules causing noise on overall power supply Causing reliability issues with network link, as well as data
integrity through SDRAM Chris’ top priority to resolve, will discuss after node architecture
review
1206/19/06
HCS-CNF Top-level Architecture
Network Fabric
Node Architecture Conceptual Diagram
DMAcontroller
ExternalMemoryController
NetworkInterface
Controller
On-Chip Memory
Controller
COMMANDCONTROLLER
PowerPCHW module
1HW module
N
3rd PartyDDR SDRAM
Controller Core
3rd Party
RapidIOCore
RESET & CLOCK GENERATOR
outgoing
incoming
oscillator
oscillator
hw_reset
misc. I/O
remote request
port
localrequest
port
32 kBProcessing
Memory(SRAM)
Processor Engine
To OCM Controller
16 kB write-only port(upper ½ of address space)
16 kB read-only port(lower ½ of address space)
! Entire SRAM is composed of many small SRAM blocks; OCM-side is presented as one contiguous address space, while processor-side can be arranged in various ways (example below)
1306/19/06
Notes regarding previous slide Each HCS-CNF includes:
Processing elements w/ 32 KB internal SRAM (each) @ 100 MHz External 128 MB DDR SDRAM storage device @ 100/125 MHz RapidIO endpoint @ 125/250 MHz 64-bit internal data path, except at processing memory interfaces (32-bit width) DMA-style memory transfers, with transparent remote memory access Arbitrates access to SDRAM storage, no “starvation” allowed
Equally prioritized, 256 bytes per burst Command queues and localized control for each major section
Each colored circle indicates independent clock domains Internal clock frequency of Network Interface Controller dependent on RapidIO link
speed Processors (PowerPC and HW co-processors) fixed at 100 MHz All other parts of design operate at SDRAM clock frequency
Not shown in diagram is control path between PowerPC and co-processor engines Simple R/W control register on each co-processor provides necessary control
Baseline node design is SDRAM-centric (all transfers involve main memory) Modular design allows architectural flexibility for enhancement
SRAM-to-RapidIO direct path achieved by adding one more FIFO between OCM module and NIC module, and adjusting control logic at each end
Direct data transfer between processing engines only requires redesign of OCM module
1406/19/06
Additional SlidesAdditional Slides
David BuenoDavid Bueno
June 20, 2006June 20, 2006
HCS Research Laboratory, ECE Department
University of Florida
1506/19/06
Addendum: CPI Execution Time Figure shows breakdown of CPI execution time 32k cube size chosen because it does not require network double buffering
Double buffering makes it difficult to determine exact breakdown of execution time due to overlap of communication and computation
Data cube receive occupies a significant portion of execution time, explaining diminishing returns as SDRAM interface speed is increased RIO network speed would then need to be increased as well to maintain high levels of speedup
Time per stage shrinks as cube size shrinks
CPI Execution Time Components
0%
20%
40%
60%
80%
100%
256 Pulses, 32k Ranges, 6 Beams, 16 Gbps SDRAM Bandwidth
Per
cen
tag
e o
f T
ota
l C
PI
Tim
e CFAR
Beamforming
Network Corner Turn 2
Local Data Reorganization 2
Doppler Processing
Network Corner Turn 1
Local Data Reorganization 1
Pulse Compression
Cube Receive
1606/19/06
Addendum: Memory Util. over Time
Figures depict utilization of 16 Gbps DDR SDRAM interface over the execution of 8 CPIs for 32k and 64k ranges cases Utilization poll interval is 64 us (so “instantaneous” utilization is calculated over each successive interval of 64 us) 32k (left)- no double buffering required, periods of inactivity since cube can be received and processed in 256 ms 64k (right)- requires double buffering, no periods of inactivity
“Medium” peaks (40-60%) due to processing or corner turn activity (in this case bounded by processor or RIO) Maximum peaks (100%) due to local data distribution (considered to be completely memory bound) Note that most of first and last CPIs of 64k ranges case are “setup” and “wrap up” CPIs so double buffering limited in
those CPIs Focus closely on the middle 6 CPIs of 64k ranges chart to see full effects of double buffering Essentially loading and emptying a “pipeline”
Network double buffering slightly raises utilization during periods of data cube distribution (by ~4%) Incoming data slows memory performance during local data redistribution and slows network performance during corner turns
256 Pulses, 6 Beams, 1 Engine per Task per FPGA: 32k Ranges
05
101520253035404550556065707580859095
100
0 256 512 768 1024 1280 1536 1792 2048
Time (ms)
SD
RA
M U
tili
zati
on
(%
)
256 Pulses, 6 Beams, 1 Engine per Task per FPGA: 64k Ranges
05
101520253035404550556065707580859095
100
0 256 512 768 1024 1280 1536 1792 2048
Time (ms)
SD
RA
M U
tili
zati
on
(%
)