Upload
damon-barker
View
214
Download
0
Embed Size (px)
Citation preview
108/06/04
Virtual Prototyping of Advanced Space System Architectures Based on RapidIO: Phase I Report
Sponsor:Sponsor:
Honeywell Space Systems, Clearwater, FL
Principal Investigator:Principal Investigator:
Dr. Alan D. George
OPS Graduate AssistantsOPS Graduate Assistants::
David Bueno, Ian Troxel
RA Graduate Assistants:RA Graduate Assistants:
Chris Conger, Adam Leko
HCS Research Laboratory, ECE Department
University of Florida
208/06/04
Presentation Outline
Project Motivation and Goals Projects Tasks Improvements/Additions to RapidIO Models GMTI System Designs Experiments and Results Conclusions Future Work
308/06/04
Project Motivation and Goals Determine optimal means by which to develop
RapidIO for space systems running GMTI Perform RIO switch, board and system tradeoff
studies Identify limitations of space-based RIO design Determine design feasibility of space-based GMTI
systems Discover optimal architecture for space-based GMTI
systems Provide assistance for Honeywell proposal efforts
Lay groundwork for future Honeywell system prototyping
408/06/04
Project Tasks
Literature review RIO spec, RIO components, SBR, misc.
RIO component and system modeling Layers, endpoints, switches, processors, etc. GMTI traffic models, memory boards, backplanes Script-based processing and algorithm modeling
Simulation Experiments Successful systems Unsuccessful systems with lessons learned
Data Analysis and Report
Items in red indicate additions since May 7, 2004 reportItems in red indicate additions since May 7, 2004 report
508/06/04
New Model Features Script-based processing/traffic flow
Allows us to easily model arbitrary applications with information about their computation and communication patterns
Detailed TDM model for central memory switch based on information from Honeywell Cycle-accurate, but does not require cycle-by-cycle simulation
“Trunking” implemented in switch model Probable feature of Honeywell’s switch Allows routing of packets with same destination ID to multiple ports
e.g. routing table may allow a packet destined for node 4 to exit out ports 4, 5, or 6 (ports 4, 5, and 6 are then said to form a “trunk”)
Port selected via round-robin scheme Switch deadlock avoidance features based on feedback from Honeywell
See previously posted presentation (July 20, 2004) for information on deadlock problem we experienced Our primary solution is identical to Honeywell’s switch features (each port speculatively grabs one buffer,
which prevents complete starvation and deadlock) Switch memory profiling
At simulation completion, outputs distribution of time the switch spends with free memory in discrete ranges e.g. distribution could indicate that, 25% of the time, switch has between 1000 and 2000 bytes of free
memory Selectable number of ranges
Link utilization monitoring Can be used to measure throughput at certain points in system
35% increase in simulation speed (approx.) Mostly due to optimization of code and learning more about MLD simulation tool
608/06/04
Basic System Description
Systems created have variable number of boards, all connected via a RapidIO switched backplane
Each board has 4 processing elements (PEs) and an 8-port RapidIO switch One RIO link to each PE Total of four links to backplane
Each processing element has 8 FPUs Backplane design is crucial to good system
performance
708/06/04
Base Algorithm Model (1) Based on reported method preferred by Honeywell’s
customers Multi-port data source
Required to handle massive GMTI traffic requirements at sub-GHz line rates Could be data coming from global memory, or streaming directly from
sensors
GMTI algorithm breakdown Corner turns optimize data distribution to ensure no communication is
needed within any of four tasks (only in-between tasks)
PulseCompression
DopplerProcessing
Space-TimeAdaptive
Processing(STAP)
ConstantFalse Alarm
Rate(CFAR)
Receive Cube
Send Results
Corner Turn Partitioned along range dimension
Partitioned along pulse dimension
808/06/04
Base Algorithm Model (2)
Staggered Partitioning Data cubes sent out to groups of processing elements in round-robin fashion Amount of time each PG has to receive and handle its data cube is N × CPI,
where: N = number of processing groups CPI = amount of time between generated data cubes (in ms)
If overlapping of computation of previous CPI and reception of next CPI is possible, up to (N+1) × CPI may be allowed Possible through DMA capability on end nodes
DataSource
PG0
PG1
PG2
PG - Processing Group
Data Cube0
Data Cube1
Data Cube2
Data Cube3
Data Cube4
Data Cube5
timestart
CPI 0 CPI 1 CPI 2 CPI 3 CPI 4
908/06/04
Early System Designs & Lessons Learned Getting data to nodes within CPI timeframe is most
important constraint Systems pictured are examples of systems without
sufficient bandwidth to function as a real-time GMTI system
Backplane A: supports 7 sources, 4 boards Backplane B: supports 2 sources, 9 boards
1008/06/04
Fundamental System Design Constraints System must have non-blocking connectivity from data source to processing nodes
Otherwise it is impossible for system to keep up with real-time requirements System must be able to source data to endpoints as fast as it receives it
System must use reasonable algorithm parameters for its system size Found using combination of trial-and-error and initial Honeywell GMTI spreadsheet
Formulated new equations (explained later) to predict system feasibility based on results obtained Baseline algorithm parameters for 5,6 and 7 board systems = 2k 32 6 256
2,000 ranges with 32 sub-bands 6 beams 256 pulses (CPI = 256ms)
All systems consist of 4-node boards with 8-port RIO switch, connected via backplane of RIO switches
Backplanes may not use an excessive number of switches Non-blocking systems easily constructible for up to 7 boards with 4 data sources Once system grows larger than 7 boards, number of switches needed grows astronomically
250 MHz RIO used as baseline Assumes clock rates will double by the time SBR is ready to fly 125 MHz RIO also studied to show what we can do with current technology
Store-and-forward routing used in switches See Appendix slide for complete simulation parameters
1108/06/04
Baseline Five-, Six-, and Seven-Board Systems Same backplane design for each system
Can conserve power/cost by connecting fewer boards to backplane Could use 6-port switches on backplane if interested in a 5-board system,
or 7-port switches for a 6-board system Further reduce power/cost
Ports are organized in groups of 4 Each group is connected to all 4 switches Each board can get to all other boards through any switch
4-Switch Non-blocking Backplane
7-Board System
1208/06/04
Three-Board System Smaller system, with non-blocking backplane using
only two switches Requires tuned-down GMTI algorithm parameters
(reduced number of ranges from 2000 to 1000, but still with 32 sub-bands)
2-Switch, 3-Board Non-blocking Backplane
1308/06/04
125 MHz Six-Board System For RapidIO systems running near 125 MHz
clock rates Requires 8 data-
source ports, sending data to 8 nodes at a time Requires 2 boards at
a time to work on single CPI
Requires corner turns be performed across boards
6-Board GMTI System
1408/06/04
Overview of Simulations (1) General simulation procedure
Create script to feed into our processor models using program that takes in high-level parameters (CPI, number of processors, data cube size, data reduction parameters), generates script file
Run simulation (very long, 18 – 47 hours) Take output from simulation and feed through post-simulation
script, which analyzes data and generates graphs, stats Enables quick analysis
Data traffic pattern Receive data from source (lots!) Compute for specified interval (Pulse compression) Corner turn Compute (Doppler processing) Do another corner turn Compute (STAP + CFAR) Send result back to data source board
1508/06/04
Overview of Simulations (2) Trade studies performed
250MHz systems, 4-switch backplane Using 5, 6, and 7 processing boards (4 compute-node processors per board,
8 FPUs per node processor) Traffic at different priority levels vs. same priority (5, 6, and 7 boards)
Prioritized: outgoing data highest, corner turn medium, incoming lowest Unprioritized: all data at same priority, switch and endpoint thresholds set
accordingly In every case, message-passing (MP) responses are upgraded by 1 class
MP requests are basic RapidIO message-passing unit Adjustment of available switch memory (6-board system memory capacity is
halved, prioritized vs. unprioritized) 250 MHz systems, 2-switch backplane (data size ~20% smaller than nominal
setup) Prioritized vs. unprioritized with 3 processing boards
125 MHz system, 4-switch backplane Nominal data set with no trunking Same hardware set as Honeywell has right now
1608/06/04
250 MHz Systems 5 vs. 6 vs. 7 processors
System scales well with # of processors due to traffic patterns of “processor board per CPI” partitioning; entire data cube is sent to single processing board Corner turn only traverses one hop and thus does not burden the backplane
5-board system has small overlap of computation of previous CPI with receive of next CPI Prevents system from having 2 processors send to data source board As a result, 5-board system gets data 2 ms faster Overall CPI latency for 5-board system is actually 2 ms faster
Prioritized vs. unprioritized In general, not much measurable difference between them However,
Prioritized: gets incoming data faster (incoming data higher priority) Unprioritized: gets results slightly faster (all data same priority)
Less than 0.5% difference in overall CPI time Reduce total memory in switches by 50%
No effect on overall CPI latency Overall throughput different
1708/06/04
Results - Explanation of Graphs These sample graphs are for explanation purposes
Indicate what each type of graph represents
Memory utilization histogram x-axis is amount of free memory (bytes) y-axis is percent of time spent with that
amount of free memory Shows how much time is being spent
with a lot or very little memory Charts which show significant amount
of time spent with little free memory imply congestion problem
All memory utilization histograms show switch memory usage on processor boards Backplane switch buffers generally
near-empty for valid systems No contention due to “clear path” from
data sources to processor nodes
1808/06/04
Results - Explanation of Graphs Throughput profile (LEFT)
Shows throughput achieved for each received message vs. time
x-axis is elapsed time (ms) y-axis is throughput (GBytes/s)
Processor utilization profile (RIGHT) Shows amount of time spent on each
stage of GMTI algorithm, including communication events
x-axis is CPI y-axis is elapsed time (ms) relative to
start of CPI
1908/06/04
Results - No Barrier Synchs250 MHz, 6 boards, prioritized
No synchronization between corner-turn steps One processor gets ahead, monopolizes switch
Throughput is seen to suffer Switch is spending almost 1/3 of its time with very low memory
2008/06/04
Results - With Barrier Syncs250 MHz, 6 boards, prioritized
Synchronizing between interprocessor communications resolves issue Full throughput achieved for all communications in corner turn Switch exhibits much better memory utilization characteristics
2108/06/04
Results - 5-Board System
CPI iterations are exceeding real-time deadlines (predicted: just over 5 boards needed to meet deadlines)
Processor profile closely matches predictions Within 2 ms for all values Corner turns have lower
throughput because of simultaneous sends & receives
Computation that continues past real-time deadline can be acceptable in some cases Assuming some computation
can overlap with communication e.g. overlap computation at end
of one CPI with communication at start of next CPI 250 MHz, 5 boards, not prioritized
2208/06/04
Results - 50% Reduction in Switch Memory All processors get full bandwidth
Overall CPI latency not affected Full memory system did not use all of switch buffer space
250 MHz, 6 boards, not prioritized
2308/06/04
Results – 50% Reduction in Switch Memory Throughput and CPI latency about same as before, but switch memory usage is
different First processor to send a message monopolizes switch memory (e.g. processor 2 on
second phase of first corner turn, shown below in green) Prioritized systems give less buffer space to corner turn traffic
Corner turns have second-highest priority, incoming data has highest priority Processor 2 gets full throughput on corner turn, other processors do not
250 MHz, 6 boards, prioritized
2408/06/04
Results - 3-Board Systems (250 MHz) Lower switch memory utilization on processor board switches due to smaller corner
turns Prioritized vs. unprioritized made very little difference
Less contention for switch memory negates prioritized advantage All else behaved as predicted
System can sustain performance requirements for GMTI 250 MHz, 3 boards, prioritized
2508/06/04
Results - 125 MHz 6-Board System 125MHz system originally starved on corner turns
Corner turn across 2 boards stresses backplane which is delivering data to other boards 250MHz pipelined system starved also
This initial configuration allowed boards to send cross-board corner-turn data out any backplane port (since each switch can get data to any board)
125 MHz, 6 boards, not prioritized
2608/06/04
Results - Trunking “Solution” Disable trunking for corner turns
Instead, use static load balancing when creating routing tables For example, make processors on board 0 use port 0 to talk to node 5, port 1 for node 6, port 2 for
node 7, port 3 for node 8 (if 5, 6, 7, and 8 are processor IDs of processors with board 0 on this CPI)
Need each stage of corner turn to have clear path from processor to processor No guarantees with trunking
Trunking can add more dependency loops in system
Solution works!
125 MHz, 6 boards, not prioritized
2708/06/04
Results - 125 MHz (without trunking) System fails to meet
deadline, but we assume that initial receive can be overlapped with computation from previous CPI (DMA)
Memory utilization falls within desirable range (much worse behavior was observed before, where switch was dipping into reserved buffers)
2808/06/04
Performance Prediction Equations: Data Cube Latency
Overall time to compute a single data cube:
Parameters:
A: maximum bandwidth available to a data stream (% network bandwidth available) B: bandwidth for that data stream including packet/protocol overhead S: data stream size p: parallel efficiency C: single node compute time N: FPUs per ASIC NP: number of processors per board NCB: number of compute boards used per CPI of data
Computation time and parallel efficiency must be provided by Honeywell Using values from Honeywell’s GMTI spreadsheet Assumed parallel efficiency close enough to 1.0 to approximate with 1.0
Communication parameters provided by RapidIO models Can be statically predicted for systems providing enough throughput
In general, systems with non-blocking connectivity between N data sources and any N nodes that will process a CPI Simulations either succeed or fail miserably (no in-between) Need to run simulations to verify feasibility of each system before using equations
Equations not valid for systems that fail
For this phase, assumed A = 1.0 (no other management traffic on network)
NCBNPN
Cp
B
SALatency
2908/06/04
Performance Prediction Equations: Communication Decomposition: each processor board computes an entire CPI
Parameters BGM: bandwidth from data source board nodes to processing board nodes SDC: size of incoming data cube BCT: bandwidth between nodes doing corner turn (including MP response overhead) SCT1, SCT2: size of corner turns (two, one between Pulse + Doppler nodes, one between Doppler + [STAP/CFAR]
nodes) SFIN: size of final data set after all processing
Experimentally derived parameters (250MHz RapidIO, packet headers are 12 bytes in length) BGM: 0.955
Same as 256 bytes / 268 bytes BCT: 0.914
Same as 256 bytes / (268 bytes + 12 bytes [MP response]) Corner-turn modeling
Each processor sends N - 1 messages of size ([(N – 1) / N] total data size) / (N – 1) = N – 1 messages of size (total data size / N)
Explanation: each processor receives 1 / N of total data from N processors, excluding itself See appendix for performance prediction example
GM
FIN
CT
CTCT
GM
DC
B
S
B
SS
B
S
B
S
21
3008/06/04
Performance Equations vs. Simulation Equation Simulation
Cube receive time 205.824ms 205.917ms
Corner turn 1 time 96.768ms 96.770ms
Corner turn 2 time 64.512ms 64.513ms
Data send time 2.058ms 2.152ms
Overall CPI latency 1284.523ms 1284.690ms
3108/06/04
Conclusions GMTI is not very sensitive to latency
Large data set, low synchronization frequency (but much data to be synchronized) Dependence on throughput requires traffic to be carefully mapped through switch
network beforehand Non-blocking backplane architectures are very important to performance
Combination of analytical approach and computer simulation provides effective performance prediction For partitionings under study, spreadsheet formulas can be used to accurately predict
simulation’s behavior for systems potentially capable of handling the workload Corner turns across one processor board are predictable However, barrier syncs are necessary to prevent nodes from getting ahead of each
other and monopolizing the network Spreadsheet for performance predication posted on project website
Dynamic behavior (trunking) can help in some cases and hurt in others GMTI’s traffic patterns are fairly predictable, making static load balancing via routing
tables best overall solution considered It is possible to do GMTI with a data cube of 2k ranges x 32 sub-bands x 256
pulses x 6 beams using Honeywell’s current or emerging technologies However, not much elbow room, but can always decrease cube size as needed
3208/06/04
Future Work Explore pipelined GMTI configurations
Will be included in results for HPEC’04 submission due August 30 Explore switch memory management policies
May help with problems of nodes “taking over” a switch port during corner turns (seen during experiments with synchronization disabled)
Also will be included for HPEC submission HPEC submission will be of course be shared and cleared with Honeywell; any
additional data gathered will be provided to Honeywell in addendum to this report Perform similar case studies for SAR
Determine optimal system configuration and algorithm partitioning Will require literature search, script development, system-level modeling Fundamental model components all in place, can be modified as necessary to
reflect new developments from Honeywell Construct a simple experimental testbed for RapidIO
Useful for model validation and future projects Start with two endpoint link partners, add small switch later as resources permit
Explore methods to improve simulation speed Packet-level simulations of SBR/RIO with Gigabytes of data can be very slow Several methods may be useful (e.g. parallel and distributed simulation, model
refinements, etc.)
3308/06/04
Appendix: Baseline Simulation Parameters Store-and-forward routing 250 MHz DDR RIO links 16-bit RIO links Endpoint input/output queue length = 8 RIO packets Endpoint priority 0 threshold = 5 packets
Endpoint retries priority 0 packets if it has greater than 5 packets in its buffer Other endpoint priority thresholds = 7 Maximum payload size = 256 bytes Packet disassembly delay = 14ns Response creation delay = 12ns Responses upgraded 1 level of priority Switch priority 0 threshold = 1000 bytes
Switch retries priority 0 packets if it has less than 1000 bytes of free memory Other switch priority thresholds = 0 bytes
Always accepts a priority 1 or greater packet if it has room TDM window size = 64ns TDM data copied per window = 64 bytes TDM minimum delay = 16ns
3408/06/04
Appendix: Performance Prediction Example System (NB=1, NP=4):
250MHz RapidIO link rate Five compute boards @ 250MHz (one board per CPI), eight FPUs per ASIC Data cube size:
64,000 ranges (2000 32) 256 pulses 6 beams
Based on spreadsheet: Total processing time per processor boards:
Pulse compression: 13.829s Doppler: 3.330s STAP: 1.835s CFAR: 10.300s
Each processor board incoming data size (bytes): Pulse receive: 786,432,000 Doppler receive (CT1): 471,859,200 STAP receive (CT2): 314,572,800 Final data send: 1% of original incoming data = 7,864,320
Corresponding equation parameters (assume A=1.0, p=1.0): SDC = 786,432,000 / 4 = 196,608,000
SCT1 = (471,859,200 / 4) * (3 / 4) = 88,473,600
SCT2 = (314,572,800 / 4) * (3 / 4) = 58,982,400
SFIN = 7,864,320 / 4 = 1,966,080
SDC / BGM = 205.824ms
SCT1 / BCT = 96.729ms
SCT2 / BCT = 64.512ms
SFIN / BGM = 2.058ms