08/06/04 1 Virtual Prototyping of Advanced Space System Architectures Based on RapidIO: Phase I Report Sponsor: Honeywell Space Systems, Clearwater, FL

108/06/04

Virtual Prototyping of Advanced Space System Architectures Based on RapidIO: Phase I Report

Sponsor:Sponsor:

Honeywell Space Systems, Clearwater, FL

Principal Investigator:Principal Investigator:

Dr. Alan D. George

OPS Graduate AssistantsOPS Graduate Assistants::

David Bueno, Ian Troxel

RA Graduate Assistants:RA Graduate Assistants:

Chris Conger, Adam Leko

HCS Research Laboratory, ECE Department

University of Florida

208/06/04

Presentation Outline

Project Motivation and Goals Projects Tasks Improvements/Additions to RapidIO Models GMTI System Designs Experiments and Results Conclusions Future Work

308/06/04

Project Motivation and Goals Determine optimal means by which to develop

RapidIO for space systems running GMTI Perform RIO switch, board and system tradeoff

studies Identify limitations of space-based RIO design Determine design feasibility of space-based GMTI

systems Discover optimal architecture for space-based GMTI

systems Provide assistance for Honeywell proposal efforts

Lay groundwork for future Honeywell system prototyping

408/06/04

Project Tasks

Literature review RIO spec, RIO components, SBR, misc.

RIO component and system modeling Layers, endpoints, switches, processors, etc. GMTI traffic models, memory boards, backplanes Script-based processing and algorithm modeling

Simulation Experiments Successful systems Unsuccessful systems with lessons learned

Data Analysis and Report

Items in red indicate additions since May 7, 2004 reportItems in red indicate additions since May 7, 2004 report

508/06/04

New Model Features Script-based processing/traffic flow

Allows us to easily model arbitrary applications with information about their computation and communication patterns

Detailed TDM model for central memory switch based on information from Honeywell Cycle-accurate, but does not require cycle-by-cycle simulation

“Trunking” implemented in switch model Probable feature of Honeywell’s switch Allows routing of packets with same destination ID to multiple ports

e.g. routing table may allow a packet destined for node 4 to exit out ports 4, 5, or 6 (ports 4, 5, and 6 are then said to form a “trunk”)

Port selected via round-robin scheme Switch deadlock avoidance features based on feedback from Honeywell

See previously posted presentation (July 20, 2004) for information on deadlock problem we experienced Our primary solution is identical to Honeywell’s switch features (each port speculatively grabs one buffer,

which prevents complete starvation and deadlock) Switch memory profiling

At simulation completion, outputs distribution of time the switch spends with free memory in discrete ranges e.g. distribution could indicate that, 25% of the time, switch has between 1000 and 2000 bytes of free

memory Selectable number of ranges

Link utilization monitoring Can be used to measure throughput at certain points in system

35% increase in simulation speed (approx.) Mostly due to optimization of code and learning more about MLD simulation tool

608/06/04

Basic System Description

Systems created have variable number of boards, all connected via a RapidIO switched backplane

Each board has 4 processing elements (PEs) and an 8-port RapidIO switch One RIO link to each PE Total of four links to backplane

Each processing element has 8 FPUs Backplane design is crucial to good system

performance

708/06/04

Base Algorithm Model (1) Based on reported method preferred by Honeywell’s

customers Multi-port data source

Required to handle massive GMTI traffic requirements at sub-GHz line rates Could be data coming from global memory, or streaming directly from

sensors

GMTI algorithm breakdown Corner turns optimize data distribution to ensure no communication is

needed within any of four tasks (only in-between tasks)

PulseCompression

DopplerProcessing

Space-TimeAdaptive

Processing(STAP)

ConstantFalse Alarm

Rate(CFAR)

Receive Cube

Send Results

Corner Turn Partitioned along range dimension

Partitioned along pulse dimension

808/06/04

Base Algorithm Model (2)

Staggered Partitioning Data cubes sent out to groups of processing elements in round-robin fashion Amount of time each PG has to receive and handle its data cube is N × CPI,

where: N = number of processing groups CPI = amount of time between generated data cubes (in ms)

If overlapping of computation of previous CPI and reception of next CPI is possible, up to (N+1) × CPI may be allowed Possible through DMA capability on end nodes

DataSource

PG0

PG1

PG2

PG - Processing Group

Data Cube0

Data Cube1

Data Cube2

Data Cube3

Data Cube4

Data Cube5

timestart

CPI 0 CPI 1 CPI 2 CPI 3 CPI 4

908/06/04

Early System Designs & Lessons Learned Getting data to nodes within CPI timeframe is most

important constraint Systems pictured are examples of systems without

sufficient bandwidth to function as a real-time GMTI system

Backplane A: supports 7 sources, 4 boards Backplane B: supports 2 sources, 9 boards

1008/06/04

Fundamental System Design Constraints System must have non-blocking connectivity from data source to processing nodes

Otherwise it is impossible for system to keep up with real-time requirements System must be able to source data to endpoints as fast as it receives it

System must use reasonable algorithm parameters for its system size Found using combination of trial-and-error and initial Honeywell GMTI spreadsheet

Formulated new equations (explained later) to predict system feasibility based on results obtained Baseline algorithm parameters for 5,6 and 7 board systems = 2k 32 6 256

2,000 ranges with 32 sub-bands 6 beams 256 pulses (CPI = 256ms)

All systems consist of 4-node boards with 8-port RIO switch, connected via backplane of RIO switches

Backplanes may not use an excessive number of switches Non-blocking systems easily constructible for up to 7 boards with 4 data sources Once system grows larger than 7 boards, number of switches needed grows astronomically

250 MHz RIO used as baseline Assumes clock rates will double by the time SBR is ready to fly 125 MHz RIO also studied to show what we can do with current technology

Store-and-forward routing used in switches See Appendix slide for complete simulation parameters

1108/06/04

Baseline Five-, Six-, and Seven-Board Systems Same backplane design for each system

Can conserve power/cost by connecting fewer boards to backplane Could use 6-port switches on backplane if interested in a 5-board system,

or 7-port switches for a 6-board system Further reduce power/cost

Ports are organized in groups of 4 Each group is connected to all 4 switches Each board can get to all other boards through any switch

4-Switch Non-blocking Backplane

7-Board System

1208/06/04

Three-Board System Smaller system, with non-blocking backplane using

only two switches Requires tuned-down GMTI algorithm parameters

(reduced number of ranges from 2000 to 1000, but still with 32 sub-bands)

2-Switch, 3-Board Non-blocking Backplane

1308/06/04

125 MHz Six-Board System For RapidIO systems running near 125 MHz

clock rates Requires 8 data-

source ports, sending data to 8 nodes at a time Requires 2 boards at

a time to work on single CPI

Requires corner turns be performed across boards

6-Board GMTI System

1408/06/04

Overview of Simulations (1) General simulation procedure

Create script to feed into our processor models using program that takes in high-level parameters (CPI, number of processors, data cube size, data reduction parameters), generates script file

Run simulation (very long, 18 – 47 hours) Take output from simulation and feed through post-simulation

script, which analyzes data and generates graphs, stats Enables quick analysis

Data traffic pattern Receive data from source (lots!) Compute for specified interval (Pulse compression) Corner turn Compute (Doppler processing) Do another corner turn Compute (STAP + CFAR) Send result back to data source board

1508/06/04

Overview of Simulations (2) Trade studies performed

250MHz systems, 4-switch backplane Using 5, 6, and 7 processing boards (4 compute-node processors per board,

8 FPUs per node processor) Traffic at different priority levels vs. same priority (5, 6, and 7 boards)

Prioritized: outgoing data highest, corner turn medium, incoming lowest Unprioritized: all data at same priority, switch and endpoint thresholds set

accordingly In every case, message-passing (MP) responses are upgraded by 1 class

MP requests are basic RapidIO message-passing unit Adjustment of available switch memory (6-board system memory capacity is

halved, prioritized vs. unprioritized) 250 MHz systems, 2-switch backplane (data size ~20% smaller than nominal

setup) Prioritized vs. unprioritized with 3 processing boards

125 MHz system, 4-switch backplane Nominal data set with no trunking Same hardware set as Honeywell has right now

1608/06/04

250 MHz Systems 5 vs. 6 vs. 7 processors

System scales well with # of processors due to traffic patterns of “processor board per CPI” partitioning; entire data cube is sent to single processing board Corner turn only traverses one hop and thus does not burden the backplane

5-board system has small overlap of computation of previous CPI with receive of next CPI Prevents system from having 2 processors send to data source board As a result, 5-board system gets data 2 ms faster Overall CPI latency for 5-board system is actually 2 ms faster

Prioritized vs. unprioritized In general, not much measurable difference between them However,

Prioritized: gets incoming data faster (incoming data higher priority) Unprioritized: gets results slightly faster (all data same priority)

Less than 0.5% difference in overall CPI time Reduce total memory in switches by 50%

No effect on overall CPI latency Overall throughput different

1708/06/04

Results - Explanation of Graphs These sample graphs are for explanation purposes

Indicate what each type of graph represents

Memory utilization histogram x-axis is amount of free memory (bytes) y-axis is percent of time spent with that

amount of free memory Shows how much time is being spent

with a lot or very little memory Charts which show significant amount

of time spent with little free memory imply congestion problem

All memory utilization histograms show switch memory usage on processor boards Backplane switch buffers generally

near-empty for valid systems No contention due to “clear path” from

data sources to processor nodes

1808/06/04

Results - Explanation of Graphs Throughput profile (LEFT)

Shows throughput achieved for each received message vs. time

x-axis is elapsed time (ms) y-axis is throughput (GBytes/s)

Processor utilization profile (RIGHT) Shows amount of time spent on each

stage of GMTI algorithm, including communication events

x-axis is CPI y-axis is elapsed time (ms) relative to

start of CPI

1908/06/04

Results - No Barrier Synchs250 MHz, 6 boards, prioritized

No synchronization between corner-turn steps One processor gets ahead, monopolizes switch

Throughput is seen to suffer Switch is spending almost 1/3 of its time with very low memory

2008/06/04

Results - With Barrier Syncs250 MHz, 6 boards, prioritized

Synchronizing between interprocessor communications resolves issue Full throughput achieved for all communications in corner turn Switch exhibits much better memory utilization characteristics

2108/06/04

Results - 5-Board System

CPI iterations are exceeding real-time deadlines (predicted: just over 5 boards needed to meet deadlines)

Processor profile closely matches predictions Within 2 ms for all values Corner turns have lower

throughput because of simultaneous sends & receives

Computation that continues past real-time deadline can be acceptable in some cases Assuming some computation

can overlap with communication e.g. overlap computation at end

of one CPI with communication at start of next CPI 250 MHz, 5 boards, not prioritized

2208/06/04

Results - 50% Reduction in Switch Memory All processors get full bandwidth

Overall CPI latency not affected Full memory system did not use all of switch buffer space

250 MHz, 6 boards, not prioritized

2308/06/04

Results – 50% Reduction in Switch Memory Throughput and CPI latency about same as before, but switch memory usage is

different First processor to send a message monopolizes switch memory (e.g. processor 2 on

second phase of first corner turn, shown below in green) Prioritized systems give less buffer space to corner turn traffic

Corner turns have second-highest priority, incoming data has highest priority Processor 2 gets full throughput on corner turn, other processors do not

250 MHz, 6 boards, prioritized

2408/06/04

Results - 3-Board Systems (250 MHz) Lower switch memory utilization on processor board switches due to smaller corner

turns Prioritized vs. unprioritized made very little difference

Less contention for switch memory negates prioritized advantage All else behaved as predicted

System can sustain performance requirements for GMTI 250 MHz, 3 boards, prioritized

2508/06/04

Results - 125 MHz 6-Board System 125MHz system originally starved on corner turns

Corner turn across 2 boards stresses backplane which is delivering data to other boards 250MHz pipelined system starved also

This initial configuration allowed boards to send cross-board corner-turn data out any backplane port (since each switch can get data to any board)


2608/06/04

Results - Trunking “Solution” Disable trunking for corner turns

Instead, use static load balancing when creating routing tables For example, make processors on board 0 use port 0 to talk to node 5, port 1 for node 6, port 2 for

node 7, port 3 for node 8 (if 5, 6, 7, and 8 are processor IDs of processors with board 0 on this CPI)

Need each stage of corner turn to have clear path from processor to processor No guarantees with trunking

Trunking can add more dependency loops in system

Solution works!


2708/06/04

Results - 125 MHz (without trunking) System fails to meet

deadline, but we assume that initial receive can be overlapped with computation from previous CPI (DMA)

Memory utilization falls within desirable range (much worse behavior was observed before, where switch was dipping into reserved buffers)

2808/06/04

Performance Prediction Equations: Data Cube Latency

Overall time to compute a single data cube:

Parameters:

A: maximum bandwidth available to a data stream (% network bandwidth available) B: bandwidth for that data stream including packet/protocol overhead S: data stream size p: parallel efficiency C: single node compute time N: FPUs per ASIC NP: number of processors per board NCB: number of compute boards used per CPI of data

Computation time and parallel efficiency must be provided by Honeywell Using values from Honeywell’s GMTI spreadsheet Assumed parallel efficiency close enough to 1.0 to approximate with 1.0

Communication parameters provided by RapidIO models Can be statically predicted for systems providing enough throughput

In general, systems with non-blocking connectivity between N data sources and any N nodes that will process a CPI Simulations either succeed or fail miserably (no in-between) Need to run simulations to verify feasibility of each system before using equations

Equations not valid for systems that fail

For this phase, assumed A = 1.0 (no other management traffic on network)

NCBNPN

Cp

B

SALatency

2908/06/04

Performance Prediction Equations: Communication Decomposition: each processor board computes an entire CPI

Parameters BGM: bandwidth from data source board nodes to processing board nodes SDC: size of incoming data cube BCT: bandwidth between nodes doing corner turn (including MP response overhead) SCT1, SCT2: size of corner turns (two, one between Pulse + Doppler nodes, one between Doppler + [STAP/CFAR]

nodes) SFIN: size of final data set after all processing

Experimentally derived parameters (250MHz RapidIO, packet headers are 12 bytes in length) BGM: 0.955

Same as 256 bytes / 268 bytes BCT: 0.914

Same as 256 bytes / (268 bytes + 12 bytes [MP response]) Corner-turn modeling

Each processor sends N - 1 messages of size ([(N – 1) / N] total data size) / (N – 1) = N – 1 messages of size (total data size / N)

Explanation: each processor receives 1 / N of total data from N processors, excluding itself See appendix for performance prediction example

GM

FIN

CT

CTCT

GM

DC

B

S

B

SS

B

S

B

S

21

3008/06/04

Performance Equations vs. Simulation Equation Simulation

Cube receive time 205.824ms 205.917ms

Corner turn 1 time 96.768ms 96.770ms

Corner turn 2 time 64.512ms 64.513ms

Data send time 2.058ms 2.152ms

Overall CPI latency 1284.523ms 1284.690ms

3108/06/04

Conclusions GMTI is not very sensitive to latency

Large data set, low synchronization frequency (but much data to be synchronized) Dependence on throughput requires traffic to be carefully mapped through switch

network beforehand Non-blocking backplane architectures are very important to performance

Combination of analytical approach and computer simulation provides effective performance prediction For partitionings under study, spreadsheet formulas can be used to accurately predict

simulation’s behavior for systems potentially capable of handling the workload Corner turns across one processor board are predictable However, barrier syncs are necessary to prevent nodes from getting ahead of each

other and monopolizing the network Spreadsheet for performance predication posted on project website

Dynamic behavior (trunking) can help in some cases and hurt in others GMTI’s traffic patterns are fairly predictable, making static load balancing via routing

tables best overall solution considered It is possible to do GMTI with a data cube of 2k ranges x 32 sub-bands x 256

pulses x 6 beams using Honeywell’s current or emerging technologies However, not much elbow room, but can always decrease cube size as needed

3208/06/04

Future Work Explore pipelined GMTI configurations

Will be included in results for HPEC’04 submission due August 30 Explore switch memory management policies

May help with problems of nodes “taking over” a switch port during corner turns (seen during experiments with synchronization disabled)

Also will be included for HPEC submission HPEC submission will be of course be shared and cleared with Honeywell; any

additional data gathered will be provided to Honeywell in addendum to this report Perform similar case studies for SAR

Determine optimal system configuration and algorithm partitioning Will require literature search, script development, system-level modeling Fundamental model components all in place, can be modified as necessary to

reflect new developments from Honeywell Construct a simple experimental testbed for RapidIO

Useful for model validation and future projects Start with two endpoint link partners, add small switch later as resources permit

Explore methods to improve simulation speed Packet-level simulations of SBR/RIO with Gigabytes of data can be very slow Several methods may be useful (e.g. parallel and distributed simulation, model

refinements, etc.)

3308/06/04

Appendix: Baseline Simulation Parameters Store-and-forward routing 250 MHz DDR RIO links 16-bit RIO links Endpoint input/output queue length = 8 RIO packets Endpoint priority 0 threshold = 5 packets

Endpoint retries priority 0 packets if it has greater than 5 packets in its buffer Other endpoint priority thresholds = 7 Maximum payload size = 256 bytes Packet disassembly delay = 14ns Response creation delay = 12ns Responses upgraded 1 level of priority Switch priority 0 threshold = 1000 bytes

Switch retries priority 0 packets if it has less than 1000 bytes of free memory Other switch priority thresholds = 0 bytes

Always accepts a priority 1 or greater packet if it has room TDM window size = 64ns TDM data copied per window = 64 bytes TDM minimum delay = 16ns

3408/06/04

Appendix: Performance Prediction Example System (NB=1, NP=4):

250MHz RapidIO link rate Five compute boards @ 250MHz (one board per CPI), eight FPUs per ASIC Data cube size:

64,000 ranges (2000 32) 256 pulses 6 beams

Based on spreadsheet: Total processing time per processor boards:

Pulse compression: 13.829s Doppler: 3.330s STAP: 1.835s CFAR: 10.300s

Each processor board incoming data size (bytes): Pulse receive: 786,432,000 Doppler receive (CT1): 471,859,200 STAP receive (CT2): 314,572,800 Final data send: 1% of original incoming data = 7,864,320

Corresponding equation parameters (assume A=1.0, p=1.0): SDC = 786,432,000 / 4 = 196,608,000

SCT1 = (471,859,200 / 4) * (3 / 4) = 88,473,600

SCT2 = (314,572,800 / 4) * (3 / 4) = 58,982,400

SFIN = 7,864,320 / 4 = 1,966,080

SDC / BGM = 205.824ms

SCT1 / BCT = 96.729ms

SCT2 / BCT = 64.512ms

SFIN / BGM = 2.058ms

Documents

08/06/04 1 Virtual Prototyping of Advanced Space System Architectures Based on RapidIO: Phase I Report Sponsor: Honeywell Space Systems, Clearwater, FL