12/10/04 1 Virtual Prototyping of Advanced Space System Architectures Based on RapidIO: Phase II Report Sponsor: Honeywell Space Systems, Clearwater, FL

112/10/04

Virtual Prototyping of Advanced Space System Architectures Based on RapidIO: Phase II Report

Sponsor:Sponsor:

Honeywell Space Systems, Clearwater, FL

Principal Investigator:Principal Investigator:

Dr. Alan D. George

Funded Graduate AssistantsFunded Graduate Assistants::

David Bueno, Ian Troxel, Chris Conger

Additional Graduate Assistants:Additional Graduate Assistants:

Adam Leko

HCS Research Laboratory, ECE Department

University of Florida

212/10/04

Presentation Outline Project Motivation and Goals Project Tasks Overview SAR Algorithm Flow RapidIO Logical I/O Layer Models MLD/RapidIO/Altia GUI SAR System Designs Previous Work Summary Experiments and Results GUI Demo Conclusions Future Work Collaboration Possibilities

312/10/04

Project Motivation and Goals Simulative analysis of Space-Based Radar (SBR) systems using

RapidIO interconnection networks A high-performance, switched interconnect for embedded systems Good scalability with better bisection bandwidth than bus-based designs

Build upon work from previous semesters RapidIO simulation model constructions set GMTI partitioning, system design and performance evaluation RapidIO switch and routing issues investigation

Study optimal method of constructing scalable RIO-based systems for Synthetic Aperture Radar (SAR) Identify system-level tradeoffs in system designs Discrete-event simulation of RapidIO network,

processing elements, and SAR algorithm Identify limitations of RIO design for SBR Determine effectiveness of various SAR algorithm

partitionings over RIO networkImage courtesy http://www.afa.org/magazine/aug2002/0802radar.asp

412/10/04

Project Tasks Overview RIO and SAR Modeling

Updated GMTI results Developed RIO Logical I/O Layer model SAR-specific models developed

Created standard and double-buffered version of SAR Numerous experiments performed

Demonstration GUI Altia tool acquired and linked to MLD Generic RIO demo interface created SAR-RIO demo interface created

RapidIO Testbed RIO physical and logical layer cores acquired from Xilinx Two Virtex-II Pro development boards Acquiring test equipment Examining other RIO core options including GDA

512/10/04

RapidIO Logical I/O Layer Model RapidIO three-layer architecture

Logical (end-to-end), transport, physical

Previously developed RIO models use RapidIO Message Passing Logical Layer

New I/O Logical Layer model provides memory-mapped reads and writes Well suited to our global memory-

based SAR approach Provides potentially increased

performance through responseless writes

Allows GM board to function without algorithm knowledge With MP logical layer GM board must

“send” data to processors, with IO, processors just “read” memory

Easy “plug-and-play” compatibility with existing RIO physical layer models

MLD Logical I/O Model

612/10/04

Merged Logical Layers

Packets coming into logical layer from physical layer or application layers are routed to appropriate logical layer component (Logical I/O or Message Passing)

Select appropriate logical layer block

Logical layer blocks

MLD Merged Logical Layers Model

712/10/04

Altia GUI Interface to MLD Ported graphical interface tool Altia to MLD

Altia designed to integrate with arbitrary C-based applications Created MLD library of Altia components for fast integration Use Altia to create custom, useful GUI to control/monitor simulations

Total of 3 components used to interface Altia Initialization module – launch GUI, register connections Input module – receive data from Altia, pass to MLD simulation Output module – send data from MLD simulation to Altia

Altia-MLD interface components

812/10/04

Altia GUI Interface to MLD Designed GUI demo systems to illustrate potential

Two systems constructed, designed using our RapidIO construction set SBR-demo system provides real-time performance visualization for

existing SAR and GMTI simulations Input-demo system allows user to control system behavior through GUI

controls, observe system reaction in real-time

Altia interface module as seen in MLD SBR-demo GUI to visualize simulations

912/10/04

SAR Algorithm Flow SAR composed of 7 sub-tasks

2-dimensional data set (image), processed iteratively Due to large image size, compute nodes must process portions of the

total data set, looping until entire image is processed for each sub-task Processed out of global memory, cyclic read-compute-write Each sub-task’s optimal data partitioning varies

Data size stays constant throughout algorithm** As opposed to the monotonically-decreasing data size of GMTI algorithm

Extensive data gathering and image processing time 16-second Coherent Processing Interval Data image potentially as large as 8GB

Range-Pulse Compression

Polar Reformatting

Pulse FFT

Range FFT

Pulse FFT

Auto-focus

Magnitude Function

Range dimension

Pulse dimension

Range/pulse blocks

** - Final sub-task reduces data size by ½, otherwise data size remains constant until the final step

1012/10/04

Partitioning Methods

Chose straightforward partitioning for SAR due to latency considerations Each chunk split across all processors Pipelined or staggered methods would incur extremely high

latencies for full-size images due to 16s CPI Pipelined latency ~= Number of stages * CPI Staggered latency ~= Number of groups * CPI Straightforward latency ~= CPI

Possible other acceptable partitionings Staggered-by-chunk

Split each chunk across each 4-processor board instead of across all nodes Possibly increase efficiency without a latency penalty

1112/10/04

7-Board System

SAR Backplane and System Models

High bandwidth requirements for GMTI algorithm dictate architecture for SAR systems All systems must eventually support SAR and GMTI Same backplane efficiently supports four-, five-, six-, and seven-board configurations

Can use smaller switches on backplane to conserve power if fewer than seven boards needed Three-board system possible using similar configuration with only two backplane switches

4-Switch Non-blocking Backplane

Backplane-to-Board 0, 1, 2, 3 Connections

Backplane-to-Board 4, 5, 6, and Data Source/GM Connections

1212/10/04

Previous Work Summary Studied RapidIO system designs for

space-based GMTI Important conclusions:

Non-blocking backplane extremely important for GMTI

GMTI not sensitive to latency of individual packets Cut-through routing unnecessary in

switches RapidIO transmitter- or receiver-

controlled flow control perform nearly equally

Straightforward partitioning method provides lowest latency, but least efficient use of resources

Staggered partitioning (by board) method very efficient, but has very high latency

Pipelined method a compromise

CPI Latencies

0

256

512

768

1024

1280

1536

32000 40000 48000 56000 64000

Number of ranges

La

ten

cy (

ms)

Straightforward, 7boards

Staggered, 5 boards

Pipelined, 6 boards

Pipelined, 7 boards

1312/10/04

Experimental Baseline Setup Simulation parameters

Systems use 250 MHz, 16-bit RapidIO links Central-memory, store-and-forward switches For other parameters see Appendix

SAR image sizes Most simulations run using much smaller images to keep

simulation runtime manageable (a 16s CPI is a LOT to simulate) Standard image size simulated is 2048 x 2048 Other sizes run include 4096 x 4096, as well as 16384 x 16384

Full-sized simulation runs verify that performance scales linearly because image is broken into “chunks” Doesn’t matter if you simulate 500 chunks or 500000 chunks as long

as chunk-size is consistent, because simulation is doing same thing over and over!

To determine performance of “real” system, simply multiply simulated CPI result latency by (desired image size/simulated image size) CPI Latency (actual) = CPI Latency (simulated) × (desired image size ÷

simulated image size)

1412/10/04

Scalability of SAR Results Table below displays computation of predicted performance for

16k x 16k image size based on 2k x 2k image size Numbers shown are a sample of metrics reported from simulation Error in approximation basically negligible Most important metric is CPI completion latency

Enormous time savings by simulating scaled-down processing loads

Message Passing

16 Nodes

128KB chunks2k x 2kactual

16k x 16kactual

16k x 16k predicted % error

CPI completion latency

213,671,662 13,767,812,846 13,674,986,368 0.006742282

RequestStats

Sum of Delay 173,389,298,614 11,106,406,501,800 11,096,915,111,296 0.000854587

Sum of Bytes 333,710,512 21,357,397,168 21,357,472,768 -0.000003540

Number of Transactions

1,245,244 79,691,839 79,695,616 -0.000047433

Response Stats

Sum of Delay 361,428,208 22,798,782,846 23,131,405,312 -0.014589483

Sum of Bytes 14,942,928 956,302,032 956,347,392 -0.000047433


1,245,244 79,691,836 79,695,616 -0.000047433

1512/10/04

Scalability of SAR Results Both Logical I/O and MP layers produce predictable, linearly-

scalable results Request/response statistic predictions slightly less accurate for Logical

I/O systems It will be shown that Logical I/O is more susceptible to network

contention, resulting in marginal reduction in prediction accuracy

Logical I/O

16-Nodes

128KB chunks2k x 2kactual

16k x 16kactual

16k x 16k predicted

% error

CPI completion latency

187,221,141 11,923,995,014 11,982,153,024 -0.004877393

RequestStats

Sum of Delay 193,415,004,394 12,425,814,223,000 12,378,560,281,216

0.003802885

Sum of Bytes 170,917,888 10,938,744,832 10,938,744,832 0.000000000


1,245,184 79,691,776 79,691,776 0.000000000

Response Stats

Sum of Delay 54,130,881,314 2,739,210,856,070 3,464,376,404,096 -0.264735205

Sum of Bytes 175,636,480 11,240,734,720 11,240,734,720 0.000000000


655,360 41,943,040 41,943,040 0.000000000

1612/10/04

SAR Results: Logical I/O vs. Message Passing Logical Layer Performance Message passing logical layer

much higher overhead All operations have responses CPI latency constant as chunk

size increases Low levels of contention in

network for all cases

Logical IO layer inherently better for current mapping Approximately ½ of operations

are writes, which require no response

Contention increases as chunk size increases (see next slide)

CPI Completion Latency vs. Chunk Size, optimized

170.000

180.000

190.000

200.000

210.000

220.000

64 128 256 512 1024

Per-Processor Chunk Size (KB)

La

ten

cy (

ms)

MP

IO

Parallel Efficiency vs. Chunk Size

0.65

0.7

0.75

0.8

0.85

0.9

64 128 256 512 1024


Effi

cie

ncy

MP

IO

1712/10/04

Logical IO Performance With message-passing model, “smart” GM board handles arbitration for data

to processors GM sends to processors when it is ready (~16 proc nodes, 4 GM RIO ports -> GM

the bottleneck and controls the traffic) With I/O model, all processors start issuing “reads” to the GM when they

want data

Board 0 Switch Memory Histogram

05

101520253035404550

Free Switch Memory

64K

1M

Floods the network with read requests Contention increases as chunk size

increases (see switch memory histogram to right)

Figure to right shows switch spending most of its time with low free memory

Potential solution? Add some synchronization elements to

processors to avoid having everyone ask for huge chunks of data at once

Let N processors ask at a time, where N = number of GM nodes

Working on implementation and will provide results in addendum to this report

1812/10/04

SAR Results: Cut through vs. Store and Forward Routing Similar to GMTI, adding cut-through routing capabilities to switches does

not improve performance overall system performance Efficiency chart below shows no major benefit for using cut-through routing Chart shows Message Passing and Logical I/O layers, as well as cut-through vs.

store-and-forward routing

Parallel Efficiency

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

64 128 256 512 1024


Eff

icie

ncy MP CT

MP SNF

IO CT

IO SNF

1912/10/04

SAR Results: Comparison with GMTI SAR simulations all run on same backplane as GMTI

Consider comparable metrics between the two to gauge system fitness for both algorithms

Side note: Logical I/O scales slightly better for SAR! Parallel efficiency graphs below reveal that SAR runs more

efficiently than GMTI on the same system SAR algorithm is very memory and compute intensive, but does not

stress the network as much as GMTISAR Parallel Efficiency vs. System Size

0

0.2

0.4

0.6

0.8

1

12 16 20 24 28

System Size (Nodes)

Effi

cien

cy

MP

IO

GMTI Parallel Efficiency vs. Data Size

0

0.2

0.4

0.6

0.8

1

32000 40000 48000 56000 64000

Data Cube Size (ranges)

Eff

icie

ncy

Straightforward, 5Board

2012/10/04

SAR Results: Double-Buffering Double-buffering allows reception of one “chunk” while processing is being

performed on the previous chunk (using Logical I/O layer) Depending on board architecture, requires 2-3x more memory available on-

board Early double-buffering experiments show CPI latency improvements for

smaller chunk sizes (due to communication/computation overlap) However, double-buffering

increases system contention as chunk sizes grow even more than standard Logical I/O SAR application Possible that certain phases of the

algorithm (more compute-heavy) will benefit more from double-buffering, while others should be left single-buffered (will explore this possibility further)

Once algorithm scaling to 16x16 taken into account, almost .5s in latency can be saved by double-buffering

CPI Completion Latency vs. Chunk Size

165

170

175

180

185

190

195

200

205

210

64 128 256 512 1024


Re

sult

La

ten

cy (

ms)

2xBuffer

1xBuffer

2112/10/04

SAR/RIO Demo

DEMO

2212/10/04

Conclusions RapidIO simulation and modeling capabilities extended

SAR algorithm simulated and analyzed, compared with GMTI Logical I/O layer added, GMTI model and results updated

Altia-MLD interface developed, documented, and demonstrated Small, flexible set of generic MLD primitives to easily interface with Altia from any

simulation Minimal components required to interface MLD with Altia, seamless integration and

flexible designs demonstrated using our Altia component library SAR’s network requirements easily handled by networks designed for GMTI

Biggest challenge of SAR is memory requirement Could benefit from further in-depth study on processor/memory architecture issues

RapidIO Logical I/O layer found to benefit SAR CPI completion latency through response-less write operations Well-suited to distributed-memory approach

Double-buffering significantly improves performance of SAR application Increases local memory requirements by 2x-3x Smart use of double-buffering only on tasks with heavy computation may optimize benefit

Cut-through routing does not greatly benefit SAR (similar to GMTI) Much of the overall delay is found in either port contention or simply “waiting your turn,” for

example in an orchestrated many-to-one send RIO testbed facilities in development

Xilinx cores and hardware recently acquired Looking into other options (comments welcome)

2312/10/04

Future Work and Project Options (1) Follow on to current RapidIO work

Explore additional options for double-buffering of SAR Explore synchronization options for Logical I/O Layer to

improve SAR performance Produce a “Day in the life of SBR” simulation (with SAR

and GMTI) Develop a GMTI-specific Altia GUI (the current one is SAR-

specific) Develop a “Day in the life of SBR” GUI Explore Logical I/O simulations for GMTI Develop additional partitionings of SAR

Pipelined? Staggered chunks?

2412/10/04

Future Work and Project Options (2) Additional RapidIO Projects

Examine fault tolerance aspects of Honeywell's RapidIO including failures within chips (both endpoints and switches) and link Completely redundant network vs. graceful degradation with redundant links

Include a higher level of fidelity in the system boards, memory, processors, etc. especially regarding other system software like O/Ss

Model other applications, possibly including compression Study of RIO multicast spec and alternatives

Other Projects FPGA interconnectivity, control, configuration management and fault

detection and correction in satellite systems Investigation of architecture tradeoffs of the next generation SBC High-level networking issues in the Wireless Reconfigurable

Interconnects project ST-9 project- constellations of satellites flying in formation

Study algorithms for communication/cooperative processing across satellites Others?

2512/10/04

Future Collaboration Possibilities I/UCRC Air Force Research Lab

Munitions directorate (Eglin) Space vehicles directorate (?)

Internships David interested in a summer internship

Other Options?

2612/10/04

RACEway and RACE++ RACEway- open standard RACE++- Mercury Computer

Systems’ second generation of RACEway technology

“Legacy” switched interconnect option

Nodes connected via RACE/RACE++ crossbar switches RACEway: 6-port RACE++: 8 port

Scalability RACEway- up to 1000 nodes RACE++- over 4000 nodes

Adaptive routing RACE- can be implemented on 2 of 6

crossbar ports RACE++- can be implemented on all

8 crossbar ports Active backplane

Failure of a single crossbar will often result in the failure of an entire

Port-to-Port BW 160 MB/s

267 MB/s

Crossbar BW

480 MB/s

1 GM/s

2712/10/04

Appendix: Baseline Simulation Parameters Store-and-forward routing 250 MHz DDR RIO links 16-bit RIO links Endpoint input/output queue length = 8 RIO packets Endpoint priority 0 threshold = 4 packets

Endpoint retries priority 0 packets if it has greater than 4 packets in its buffer Other endpoint priority thresholds = 5,6,7 (for prio 1,2,3 respectively) Maximum payload size = 256 bytes Packet disassembly delay = 14ns Response creation delay = 12ns Responses upgraded 1 level of priority Switch priority 0 threshold = 3000 bytes

Switch retries priority 0 packets if it has less than 3000 bytes of free memory Other switch priority thresholds = 2000,1000,0 (for prio 1,2,3 respectively) TDM window size = 64ns TDM data copied per window = 64 bytes TDM minimum delay = 16ns

Documents

12/10/04 1 Virtual Prototyping of Advanced Space System Architectures Based on RapidIO: Phase II Report Sponsor: Honeywell Space Systems, Clearwater, FL