Upload
danieltam
View
29
Download
0
Tags:
Embed Size (px)
Citation preview
11/6/2015
1
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 1
Engineering 180Systems Engineering
Embedded Processing Case Study
Lecture 1
May 21, 2015Steve Kirsch
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 2
Outline for Lectures on Real-Time Embedded Processing
Lecture 1• Overview of Embedded Subsystem Design
• Case Study: Problem statement
Lecture 2• Conceptual Design
Lecture 3• Preliminary Design
Lecture 4• Detailed Design / Integration and Test
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 3
Lecture 1: Agenda
Overview• What is an Embedded Processing System
• Characteristics
• Examples of Embedded computers
• Summarize Design Challenges
Case Study: The road to SDR: Applying System Engineering Process• Problem statement
• Identify stake holder
• Top level requirements
• Key Performance Parameter (KPPs)
Homework
11/6/2015
2
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 4
Overview: What is an embedded processing system
Wikipedia: • An embedded system is a computer system with a
dedicated function within a larger mechanical or electrical system, often with real-time computing constraints
• It is embedded as part of a complete device often including hardware and mechanical parts. Embedded systems control many devices in common use today
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 5
Overview: continued
Embedded processing sub-system Consists of one or more digital
processors integrated with other parts of a complex system (sensors, actuators, user interface, etc.)
Arranged in a tightly coupled architecture, designed to perform a specific set of functions
Sensor LoadEmbedded
DigitalProcessor
Generic System with Embedded Digital Processor
ADC DACAmp Amp
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 6
Overview: continued
System design as previously discussed is hierarchical • A system is broken down into subsystem which in turn is broken down
in to more subsystems
• Embedded computer subsystem in itself is a complex system using the same principles employed at higher level the system level
11/6/2015
3
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 7
Overview: continued
Embedded processing subsystems are unique because its requirements are implemented in both hardware and software components• An embedded processing subsystem engineer early task is to flow
down high level requirements and allocate these requirements to hardware and software components
• An embedded processing subsystem engineer therefore needs broad knowledge in both hardware and software systems
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 8
Overview: continued
Embedded subsystems today are perhaps the most critical and complex part of the system to get right, and to get right early in the system design process• Functionality that was classically implemented as hardware
solutions, today are being implemented as both hardware and software components (trending to more and more software)
• Human and environmental interfaces are sensed and controlled by interaction of hardware and software components
• Real-time operation is a function of hardware and software interaction Deterministic behavior is critical
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 9
Overview: Embedded system engineers job is very challenging
Embedded subsystem engineer is required to have broad expertise
• Hardware & Software development management process and tools
• Mechanical / Structural (Enclosure designs)
• Thermal dynamics (Cooling of electronic critical to system design)
• Materials (Enclosures, Backplanes, modules, connectors etc)
• E&M (Electromagnetic radiation and protection)
• Computer hardware architecture (interface standards, networks, memory architecture, processor architecture, communication protocols, electronic components (GPGPU, FPGA, ASIC technology, etc).
• Computer Software architecture (interface standards, communication prototcols, Operating systems, development tools, computer languages and programming models (parallel processing, streaming, objection oriented, scripting, etc)
• System Operational theory (e.g. Communication or Radar theory with an understanding of processing algorithm)
• System Simulation tools (e.g. matlab)
11/6/2015
4
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 10
Key flow down system requirements of an embedded processing system• Size
• Weight
• Power
• Life cycle costs Non-recurring develop cost
Recurring cost
• Rugged operating environment
• Durability / Reliability
• Maintainability
• Supportability
• Development schedule
• Development test, integration environment
• Functional requirement (many application specific)
Overview: Hardware Properties
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 11
Key flow down system requirements of an embedded processing system
• Life cycle costs
Non-recurring develop cost
Recurring cost
• Durability / Reliability
• Maintainability
• Supportability
• Development schedule
• Development test and integration environment
• Infrastructure requirement (many application specific) Build-time (Drivers, Libraries, Interfaces)
Run-time (Services, Clients, Servers, etc.)
• Functional requirements (application specific)
Overview: Software Properties
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 12
Overview: Software PropertiesKey Embedded Processing Application Characteristics
Complex Algorithms • Environment sensing and filtering
• Visualization
• Tracking
User interfaces • Human to computer
• computer to computer
Realtime operation• Hard realtime sec - msec response times
• Testability / observability
Multirate (asynchronous events)
11/6/2015
5
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 13
Overview: Balance design (CRISP)
Cost of ownership (life cycle cost)• Development cost, production cost, support cost (training, spares,
repair, system upgrade, …)
Risks (technology risk, production risk, obsolescence, …)
Installation (weight, size, power, style, transportation, …)
Supportability (reliability, maintainability, …)
Performance (functionality, ease of use, throughput, …)
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 14
Overview: Real Design is a Compromise Among Conflicting Needs and Desires
• Mission environment– Offboard info– Communication
requirements– Weapon
characteristics
• Physical environment– Operating temperatures– Storage temperatures– Coolant characteristics– Vibration levels– Shock– Prime power characteristics– EMI/C requirements
• Physical characteristics– Weight– Size (O&M)– Prime power
utilization– Cooling required– Dissipation– EMI/C characteristics
• Programmatic Characteristics– Development plan– Production plan– Risks– Technology maturity
• Cost– Recurring cost– Development cost– Life-cycle cost
Design has to balance multiple
desires and constraints
• Add your need here• Functional Performance– Detection performance– Tracking accuracies– ID capabilities– Weapon support– Map characteristics
• Support– Maintenance concept– Reliability– Maintainability measures– Built-in test capability
Source: Raytheon
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 15
Overview: Embedded computing examples
Whirlwind I – Lessons learned from the 40’s!
Core Memory Controller
11/6/2015
6
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 16
Overview: Embedded computing examples
ENIAC – First fully electronic turing machine
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 17
Overview: Embedded examples
Intel 4004 – First processor
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 18
Overview: Embedded examples
HP-35 Wikipedia• The HP-35 was Hewlett-Packard's first pocket calculator and the
world's first scientific pocket calculator[1] (a calculator with trigonometric and exponential functions). Like some of HP's desktop calculators, it used reverse Polish notation. Introduced at US$395,[2] the HP-35 was available from 1972 to 1975.
11/6/2015
7
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 19
Overview: Embedded examples
PlayStation3 – based on the IBM Cell processor in 2007 was way ahead of its time
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 20
Overview: Embedded Examples
Tianhe-2 – Worlds fastest computer in 11/2014
Tianhe-2 – is build from Intel Xeon Phi• Knights Ferry / Knights Landing
• 14nm processing
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 21
Overview: Embedded examples
Qualcomm SoC -- Snapdragon 800 processors
11/6/2015
8
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 22
Overview: Single Chip Compute
Single Chip Computer or processor is the foundation of embedded computing today• Embedded computational systems today are
constructed with single processor chip
Array of processor chips
SoC (system on a chip) that contains processor cores
Therefore understanding the key aspects of a processor is fundamental for an embedded system engineer
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 23
Overview: Physics of Software
Computing is a physical act.• Computers abstract information but in fact do their work by
moving electrons
• This is fundamentally why it take time and energy to compute
Software performance and energy consumption is where we connect embedded computing to the real word
Embedded engineers make high-level decision about the structure of their programs to greatly improve their real-time performance and power consumption
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 24
Overview: Challenges in Embedded Computing System Design
How much hardware do we need?
How do we meet deadlines?
How do we minimize power consumption?
How do we design for upgradability?
Does it really work and meet the requirements?
How can you get the job done with the budget and schedule constraints?
11/6/2015
9
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 25
Case Study: The Need Phase –Proposal Phase
The proposal phase usually consists of Defining the problem:
Identify customers and stakeholders Understand their needs Understand and develop the operational concept Identify the constraints
Defining the system (or product) to be procured or built Building the system specification (procurement spec)
Make sure the problem is solvable. Identify risks and risk mitigation plans Investigate potential system designs
Preliminary system modeling and performance assessment Preliminary program plan and schedule
Development cost projection
Develop product testing and evaluation strategies Writing a proposal Winning the contract
It is marketing, it is management, It is a lot of engineering, and it is about managing risk
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 26
Case Study: Problem Statement
The government in conjunction with a prime contractor intends to design and build a surveillance and reconnaissance radar system that will be compatible with existing UAVs as well as new more advanced UAVs of the future
The Radar primarily function is Air to Ground capability to locate and disarm ground moving troops, equipment and air defense systems Future system
Joint US Navy/ US Air unmanned combat air vehicle
Predator
Current System
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 27
Nested Design Process for Complex Systems
Conceptual Design
Preliminary Design
System Specification
System Architecture
Subsystem Specifications
Preliminary Design B
Subsystem Architecture
A B C D
System Level
Level #1 Subsystems
Level # 2 Subsystems B1 B2 B3
Conceptual design for subsystem B
Case Study: Nested Design ProcessRadar Processing Embedded Subsystem
11/6/2015
10
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 28
Case Study: Problem Statement
Radar system engineers are responsible for consuming the system level specifications and decomposing the requirements into level 1 requirements so that they could be allocated to the major subsystem components
• Antenna / Beam steering computer
• Receiver / Exciter
• Processor subsystem This is our task
Many additional requirements not explicitly specified in the level 1 spec referred to as “derived requirement” are also flowed down to the next level of the major subsystem components
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 29
Case Study: Problem Statement
Radar System Block Diagram
StableMicrowave
Source
PowerAmplifier
Low NoiseAmplifier
Down-Conversion
A/DConversionand TimingGenerator
DigitalSignal
Processing
FrequencySynthesis/
WF Gen
ImageInformation
Control, Interface, and Data Processing
Detected Objects
Commands, Motion Data
Radar Results, Health Info
Control
Antenna
Processing Subsystem
AntennaSubsystem
Reciever ExciterSubsystem
SystemInterfaces
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 30
System Design Process – (One view)
Need+
Desires
PotentialSolutions
Baseline Solution
Detailed,Documented
Baseline
Conceptual Design Preliminary Design
Includes:
Elicitation of need and requirements
Design through insight, invention, and successive refinement
Management of complexity through partitioning and creating well-posed lower-level design problems
11/6/2015
11
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 31
Case Study: Why is system engineer so challenging?
HighAbility toInfluenceLCC(70-75%of Cost DecisionsMade)
(10%-15%)
72% Life Cycle Cost28% Life Cycle Cost
Less Ability to Influence LCC (85% of Cost Decisions Made)
Little Ability to Influence LCC (90-95% of Cost Decisions Made)
(5%-10%)
Minimum Ability to Influence LCC (95% of Cost Decisions Made)
Acquisition Framework
Materiel DeveloperPM –Total Life Cycle System Manager Army Materiel Command
Combat DeveloperTRADOC
System Life Cycle
ConceptRefinement
TechnologyDevelopment
System Development& Demonstration
Production &Deployment
Operations& Support
B CA
The most important decision are made early in the design cycle with the least amount of detailed information
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 32
Case Study: Problem Statement
Customer system level specification describes• Mission scenarios
• Threats
• Operational environment
• Platform resource allocation for the Radar system Space
Weight
Cooling capacity
• Operator’s interfaces
• Mission stability (system shall run continuously for N hours)
• Plus many more illities requirements
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 33
Case Study: Problem Statement
The system level specification and requirements allocation is a complex task• Results of this work are documented and reviewed at the
SDR
The major subsystem responsible engineers (REA) is part of the Radar system team that does the allocation• Involvement of stake holders required
Program manager (also part of the Radar system engineering team) • Establishes the Work Breakdown Structure (WBS)
• Allocates budget to each of the WBS line items
• Creates an integrated master plan (IMP)
• Creates an integrated master schedule (IMS)
11/6/2015
12
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 34
Case Study: System Conceptual Design Phase Product
Products Baseline design
• Performance
• Risk
• Cost
• Schedule
• Other high-level attributes
Characterized by top-level budgets and supporting analysis
• Supported by enough lower-level design to give confidence in the numbers
Hardware, algorithms, signal processing sizing, software sizing
Top-level program plan
• Schedule
• Headcounts vs time
• Critical item development plans
• Top-level understanding of programmatic issues
Subsystem spec
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 35
Case Study: System Conceptual Design Review (SDR)
Review of the concept and supporting documents
Concept Analysis Review• System modeling and Simulation results
• Compare and contrast conceptual designs and review justifications for selected baseline
• Risk mitigation plan going forward
Establish the functional baseline
Approve the system specification
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 36
Case Study: Post SDR flowdown requirement
Processor subsystem Level 1 requirements specification• Space
• Weight
• Power
• Cooling
• Illities
• Transmit waveform specifications (PRF, num coherent pulses transmitted/collected, sample rates, number of receive channels, phase coding, etc.)
• Processing algorithms (preliminary)
• Interfaces (Sensor data, Sensor command, Nav, mission computer, instrumentation system)
11/6/2015
13
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 37
Case Study: Definition of terms
CPI: Coherent Processing Interval• Series of pulse with a phase relationship transmitted and collect that can be coherently processed
PDI: Post Detection Integration• Multiple CPI are non-coherently integrated
Dwell time • Time to radiate a single beam position on the ground
Bars • Number of beam positions to radiate a swath
PRF Pulse rate frequency• Rate pulses are transmitted
PRI Pulse rate interval = 1/PRF• Time from the start of one pulse to the next
Pulse modulation• Amplitude and phase superimposed on the pulse during the duration of a pulse
Receive channels – Radar antenna are typically partitioned into subArrays that have physically offset phase centers connected to a separate receiver and A/D
Swath 3
Swath 2
Swath 1
Ground Area of Interest
Azimuth
Range
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 38
Case study: GMTIGround Moving Target Indicator waveform
Number of CPIs/Dwell
Number of pulses/CPI
Pulse modulation LFM linear frequency modulation
Number of receive channel
PRF
Number swaths/scan area
Scan area rate
CPI 1pulse 0 - N
CPI Mpulse 0 - N
Key parameters for embedded subsystem design
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 39
Case study: GMTI Processing Algorithm (CPI processing)
I/QFormation
PulseCompression
MotionCompensation
Doppler Filtering
ClutterCancellation
NoiseEstimation
Target Detection
PDIProcessing
11/6/2015
14
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 40
Case study: GMTI Processing Algorithm (PDI processing)
CPIProcessing
AmbiguityResolving
Noise Estimation
SidelobeDetectionRejection
False AlarmControlM of N
Processing
Angle Estimation
TargetParameterEstimation
HitList
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 41
Text: Computers as ComponentsPrinciples of embedded computing system DesignBy Professor Wayne Wolf• Text link: Available in CCLE 15S-ENGR180-1 Information
Folder
• http://ceng2.ktu.edu.tr/~ulutas/Courses/EmbeddedSystems/0123743974.pdf
Read: Chapter 1 Embedded Computing• Introduction
• 1.1 Complex Systems and Microprocessors
Write up to a 1 page discussion answering • Why are microprocessors used in complex system designs?
Homework
11/6/2015
1
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 1
Engineering 180Systems Engineering
Embedded Processing Case Study
Lecture 2
May 26, 2015Steve Kirsch
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 2
Outline for Lectures on Real-Time Embedded Processing
Lecture 2• Conceptual Design
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 3
Lecture 2: Agenda: Conceptual Design Process
Review Homework
Starting point for conceptual design of the embedded processing
Feasibility and Requirements Analysis
Embedded processing design synthesis
Subsystem concept design review processHomework
11/6/2015
2
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 4
Homework 1: review
Many examples of processors use in embedded computing• Perhaps experience tells us that something about this approach if fundamental
Large variety of processors to choose from• High potential there is a best fit
Alternative to processors is custom utilizing hardwired logic. Advantage over this alternative
• Easier to design and debug
• Allows for possibility of upgrade and adding new functionality
More efficient than custom logic• Custom design will have some logic dedicated to sub-functions that aren’t active all the
time. Microprocessor’s logic is reused for all sub-functions
• Microprocessor’s are application agnostic, therefore we can leverage huge investments made by others. Application specific logic can be implemented in software
Microprocessor’s can be faster than custom logic (Seems almost counter intuitive!)
• Utilizes the latest manufacturing processes
• Resources available for access to the best experts and large design teams
• Can over come the overhead of interpreting instructions with clever utilization of parallelism
Why are microprocessors used in embedded computing systems?
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 5
Homework 1: review
What differentiates embedded computing from other forms of computing?• Program must meet deadlines
• Must be fast enough
Needs to have deterministic behavior to guarantee it will be fast enough
To understand real time behavior of an embedded computing system one needs to understand the component from the lowest level to the highest level of the system.
What are the 5 components from the lowest to the highest• CPU: (processors plus memory)
• Platform: (CPU scaffolding): Components supporting the CPU (eg Buses, I/O devices)
• Program: Programs can be very large, CPU see a very small window of the program at any one time. We must consider the structure of the program to determine the overall behavior of the system
• Tasks: We generally run several programs simultaneously on a CPU, creating a multi-tasking system. Tasks interact with each other in way that have profound implications for performance
• Multiprocessors: A system can have many microprocessors all interacting with each other as well as other potentially interacting with accelerators. The interaction can be very complex to analyze and determines the overall system performance.
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 6
Concept Design Phase: Embedded Processing
BAA – Broad Area Announcement typically precedes RFP request for proposal when contracting with the US government
• This is a head start on preparing for the RFP
RFP let• Procurement specs review and analyzed
• Enormous effort applied at this stage to develop a system design concept or concepts
• Proposal written and submitted Often leveraging years of IR&D
• Contract won!
System engineers decomposed and allocated the level 1 requirement• Produced preliminary subsystem specifications
Subsystem development team leads identified • Program manager
• Subsystem architect (head technical subsystem engineer)
• Development team leads (Tech leads) Hardware unit lead
Mode software lead
Infrastructure software lead This is our starting point for the Embedded Processing Concept design phase
11/6/2015
3
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 7
The Conceptual Design Process
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 8
Embedded Processing Concept Design:Stakeholders Requirements C1
Who are the stakeholders?• Our subsystem team
Subsystem Program Manager
Subsystem Architect
Tech Leads and their development teams
• Customers System Team
System Program Manager
Contracting organization
• Vendors and Suppliers
• Test and Integration team
Requirement sources• Procurement spec
• Subsystem specs (Generated by tier 1 system team)
• KPPs (Key Performance Parameters identified by customer or system team)
• System TPMs (Technical Performance Measures)
• SRD (system requirements document)
• SDD (system design description)
• SRR (system requirements review material)
• Vendor components specifications
• Legacy systems components *
• Standards *
• Laws of Physics
• Company development procedures, ethics, rules
• Laws of the land and point of deployment
• Common sense
* Potential requirement source
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 9
Embedded Processing Concept Design:Feasibility Analysis C2
Identify the possible processing solutions
Study the viability of these solutions according to the flowdown requirements• performance, cost, schedule, risk, supportability, …
Key questions: • Can we design the embedded processing to run in realtime while
meeting the SWaP-C requirements? SWAP-C (Space Weight and Power - Cost)
• What are the key risks?
• How to reduce the risks?
• Is Preliminary Subsystem specifications reasonable? Could it be
modified to reduce the risks and still meets the main system objectives?
11/6/2015
4
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 10
Embedded Processing Concept Design:Feasibility & Req Analysis C2 & C3 -- Step 1:
Understand requirements and focus first on the primary requirements that will likely drive the top level design
For our case study, the real-time signal processing requirement is key
System requirement is to scan an area of interest in N secs process the real time data and produce a hit report of all ground movers within the AoI with a false alarm rate of R and a probability of detection P.
The system flowdown requirements have specified the waveforms and the signal processing algorithms that can achieve this system performance
• As one begins to drill down to the next level of detail some requirements might not be achievable with in the scope of other requirements
• Requirements can be modified to help achieve the primary system goals at this stage
It is up to you to only accept requirements that can be achieved
Swath 3
Swath 2
Swath 1
Area of Interest (AoI)
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 11
Embedded Processing Concept Design:Feasibility & Req Analysis C2 & C3 -- Step 2:
Derive the key performance parameters for the embedded processing subsystem
• Data rates
• Memory requirements
• Processing throughput requirements based on the required processing algorithms
I/QFormation
PulseCompression
MotionCompensation
Doppler Filtering
ClutterCancellation
NoiseEstimation
Target Detection
PDIProcessing
Coherent GMTI Processing Algorithm
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 12
Deep dive into data rates:
Data rates can help the embedded processing subsystem engineer to understand a lot about the problem
In our case study the Sensor is producing a very high input stream of data (10s of Gsamples/secs)• A/D rates
• A/D sample word size (often a function of data rate)
• Number of input data channels
• REX Processor network bandwidth and protocol How is the data packaged and shipped?
How much extra bandwidth is needed for the protocol (eg. error correction coding)?
What is the receive duty? (How much of the total time is data streaming?)
• Synchronous or Asynchronous data flow Flow control
How much rate buffering is required?
How is data synchronization achieved?
Data rates will drive memory and processing requirements
11/6/2015
5
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 13
Deep dive into data rates: Receive Duty
Doppler filter bank
N Pulses
x x x x x x
A/D Samples (Range Bins)
Typical Radar Processing – One Beam Position (Dwell)
Receive window
Receive window
Receive window
Receive window
Receive window
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 14
Deep dive into data rates: System Front End Duty
System requires multimode operation• Tight interleaving of frontend resources desired for
best system performance
Data CollectionTime Mode 1 Mode 2 Mode 1
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 15
Deep dive into data rates:
Pipeline processing of an AoI• Complete Dwell of data collected prior to processing
• Memory required to hold a complete set of data (how big would this be?)
Data CollectionTimeMode 1
CollectionMode 2
Collection
Mode 1 Collection
Data Processing Time Mode 1 D0Processing
Mode 2 D1Processing
D0 D1 D2
Dwells
11/6/2015
6
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 16
Deep dive into data rates:
Pipeline with overlap processing• Data processed while it is being collected
• Memory size required reduced
Data Collection Time Mode 1
CollectionMode 2
Collection
Mode 1 Collection
Data Processing Time Mode 1 D0Processing
Mode 2 D1Processing
D0 D1 D2
Dwells
Mode 2 D2Processing
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 17
Deep dive into data rates:
Parallel Processing• Memory size required could be larger then the fully pipelined architecture
• Processing performance per processor reduced
• Notice the time to get the results from processing dwell D0 (latency) is longer in this case
Data Collection Time Mode 1
CollectionMode 2
CollectionMode 1
Collection
Data Processing Time Processor 0 Mode 1 D0Processing
D0 D1 D2
Dwells
Data Processing Time Processor 1
Data Processing Time Processor 2
Mode 2 D1Processing
Mode 1 D2Processing
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 18
Deep dive into data rates: What have we learned?
Keeping up with the real-time input data rate presents architecture trade-offs that can be used to balance the requirements for memory and processor(s) performance
Parallel processing approaches that may utilize an array of potentially slower processors could add processing latency but have more throughput performance overall• Latency is more important for some applications than others
Examples: Air to Air alert confirm modes need very short latency
SAR ground maps typical have very loose latency requirements
There exists many opportunities to exploit processing parallelism once system requirements are fully understood• For highly computation intensive signal processing requirements exploiting
parallelism is typically required to achieve system requirements
• Exploiting parallelism can be a cost effective approach for many less computation intensive applications as well
11/6/2015
7
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 19
Deep dive into processor performance:
Software Performance Drivers Algorithm design - Clever way of doing a computation can sometimes
provide dramatic improvement in system processing performance
Compiler efficiency - Optimized compiler can generate efficient code (fewer hardware instructions needed per high order language instruction)
OS responsiveness - Real-time Operating System (RTOS) can be designed to require minimal resources for the OS itself
Hardware Performance Drivers Processor execution speed - Optimized processor can execute more
hardware instructions per second (or per watt)
Memory effective bandwidth - Memory and memory bus can provide sufficient data and instruction access to keep up with the processor
I/O system effective bandwidth - External interfaces/network provide sufficient input and output bandwidth to keep processor busy
Balanced design requires each of these factors to be considered in allocating system resources
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 20
Deep dive into processor performance: Requirement Allocation can greatly affect performance
Critical part of the embedded processing concept design is functional requirements allocation to hardware vs software
Many cases it is quite simple to do this allocation
• Memory storage (Hardware)
• ALU functions (the lowest level of a computation engine) (Hardware)
• Basic Operating system functions (mutex, semaphores, thread scheduling) (Software functions)
Many function will have both hardware and software components• Ethernet interfaces
• DMA controllers (will be discuss later in detail)
It is very important to define the boundary between hardware and software• ISAs (Instruction Set Architecture) are commonly used to define the boundary
between hardware and software for a programmable device
• To the software it is an abstraction of the hardware
• To the hardware it’s a specification of what the hardware is required to do
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 21
Hardware Performance Drivers
Key performance drivers of an embedded processor• Memory Bandwidth
• Network Bandwidth
• I/O Bandwidth
• CPU OPs (Operations per sec) Signal processing performance usually expresses performance in FLOPS (Floating
point operations per sec
How can we optimize for performance?• Exploit parallelism
• Utilize a CPU that best fits the job
Program Memory
DataMemory
I/O Ports
CPUClock
Classic Harvard Architecture
11/6/2015
8
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 22
Exploiting Parallelism To Improve Performance
Parallelism is present in multiple forms• Thread or Task Level Parallelism (TLP)
Wikipedia: Task parallelism (also known as function parallelism and control parallelism) is a form of parallelization of computer code across multiple processors in parallel computing environments. Task parallelism focuses on distributing execution processes (threads) across different parallel computing nodes. It contrasts to data parallelism as another form of parallelism.
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 23
Thread Level Parallelism: cont.
Independent execution stream can execute in parallel all working on a single goal• Example with the multiple processor example showed earlier
processing multiple AoIs in parallel
Simultaneous multithread operation is commonly supported within modern processors• Multiple cores running independent threads
• Multiple hardware threads within a single core(SMT symmetric multi-threading or hyper-threading)
Most modern operating systems support simultaneous multithreading
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 24
Exploiting Parallelism To Improve Performance
Data Level Parallelism (DLP)Wikipedia:Data parallelism is a form of parallelization of computing across multiple processors in parallel computing environments. Data parallelism focuses on distributing the data across different parallel computing nodes. It contrasts to task parallelism as another form of parallelism.
11/6/2015
9
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 25
Multi-core P
Data level parallelism: cont.
Data is organized where the same operation can be performed on the data set at the same time.
This form of parallelism is abundant in Radar signal processing (will be discuss later when we return the GMTI algorithm)
Can be exploited classically in two typical ways Multi-core processor
SIMD (single instruction multiple data) processing cores
InputData
Core 1
Core N
OutputData
P Core
ALU0
ALUN
Instruction
Reg File
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 26
Other Hardware Architecture Parallelism
DMA Direct Memory Access is a form of this parallelism
To move data from memory to an I/O device CPU cycle are required with no DMA capability
With DMA a few CPU cycles are utilized to setup the DMA transfer and then can do work in parallel with the data movement
Transfers without DMA Transfers with DMA
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 27
Optimizing Performance by:Utilizing the Best Fit CPU Type
Some multicore p contain SIMD engines• Freescale, Intel
DSP Chips (Digital Signal Processing)• Texas Instruments
GPGPUs (General Purpose Graphic Processing Units)• Nvidia, Intel, AMD/ATI
FPGA (Field Programmable Gate Arrays)• Altera, XILINX
ASIC (Application Specific Integrated Circuit)• VLSI, Softchip, Micronix Integrated Systems
11/6/2015
10
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 28
Trade-offs between GPPs, DSPs, and FPGAs
When (fixed point) throughput per watt is most important
When recurring cost is most important
When time to market is the dominant issue or when high performance floating point arithmetic is essential
When to Consider Using
HighestLowestHigh Performance GPPs more expensive than DSPs
Recurring Cost per Component
Only with large performance penalty (Products with optimized floating point starting to appear.)
Limited number of products available with floating point
Yes. Some high-end GPPs have SIMD floating point vector units
Support for Floating Point Arithmetic
VHDL RequiredLimited support of HOL programming
Full-featured support of HOL programming
Ease of Application Programming
Highest (fixed point arithmetic)
More than GPPsLowestThroughput per Watt
Field Programmable Gate Arrays (FPGAs)
Digital Signal Processors (DSPs)
General Purpose Processors (GPPs)
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 29
Comparing ASICs and FPGAs
An Application-Specific Integrated Circuit (ASIC) is an integrated circuit (IC) customized for a particular use, rather than intended for general-purpose use.
A Field-Programmable Gate Array (FPGA) contains programmable logic blocks and programmable interconnects that allow the same FPGA to be used in many different applications.
Property ASIC FPGARequires foundry run(s) for each application
Yes No
Typical Development Cost High Moderate
Typical Development Schedule Lengthy Moderate
Recurring Cost Moderate High
Functional Density High Moderate
Power Consumption per Function
Lower Higher
Maximum Clock Frequency Higher Lower
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 30
The Conceptual Design Process
11/6/2015
11
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 31
System Concept High Level Embedded Processing System Architecture
REX
General Purpose Processing
MissionComputer
INSGPS
Signal Processing
High Speed Instrumentation
System
Embedded Processor
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 32
Subsystem Level Synthesize
This is where the rubber meets the road
Multiple options are created and assessed
A Baseline architecture is selected and performance is estimated (usually rough at this stage)
• Relevant like system performance can be useful as a guide
• Scaling validated performance from an earlier fielded system or IR&D efforts can greatly reduce risk
Performance assessment at this develop stage • Difficult for revolutionary designs
• Easier for evolutionary design
Full set of flowdown requirements will be too much to fully evaluate in detail at this stage
• Focus on the KPPs (Key Performance Parameters)
• Use SMEs (subject area experts) to help to guide focus on the highest risk largest impact requirements
• Identify the key items to evaluate
For our case study we know from experience that the signal processing will utilize the bulk of the SWAP-C and drive system performance
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 33
Evaluating CPU Performance
Best approach would seem to be to run the actual application software workload on each candidate processor and measure required CPU Time for each
Possible complications with this approach Application software may not be available early in development process
when CPU must be selected
Application software may include dependencies on Operating System and on external interfaces Multiple versions of the application software may need to be created
May be difficult to remove effects of CPU idle time due to waiting for external events and I/O transfers
Candidate processor may not exist yet. Evaluation may have to be done on a slowly-executing simulation
Often the best way to obtain a comparison is to use one or more software benchmarks that adequately represent the application
11/6/2015
12
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 34
Using Benchmarks to Evaluate CPU Performance
Benchmarking strategy is to obtain (or create if necessary) relatively simple sequences of code that together represent the most computationally-intensive algorithms of the application Requires some insight on the part of the subsystem engineer to be able
to identify these a priori
Use of multiple benchmarks creates an understanding of how well each candidate does on each algorithm The best CPU on one algorithm may not be the best on other algorithms
CPU selection will need to be based on balanced design principles, considering best overall benchmark performance as well as many other factors In particular, power consumption will be important for most embedded
applications
Benchmark results may suggest compiler optimizations or even CPU architectural enhancements that will dramatically improve performance
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 35
SPEC: Industry Standard Benchmarks
System Performance Evaluation Corporation (SPEC) established in 1989 by consortium of computer vendors to create standard benchmarks for computer systems (www.spec.org) Originally intended to benchmark performance of servers and workstations, using
CPU-intensive benchmarks
Has since expanded to include benchmarks for graphics, Java applications, client-server models, mail systems, file systems, and Web servers
CPU vendors normally execute benchmark suite and provide documented results
Serious effort made to produce benchmarks that avoid misleading comparisons, with strictly specified execution rules and reporting requirements
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 36
SPEC Benchmark Suites
SPEC CPU2006 contains two benchmark suites:
CINT2006 for measuring and comparing compute-intensive integer performance
CFP2006 for measuring and comparing compute-intensive floating point performance
Performance is expressed as the number of times the benchmark algorithm can be executed per unit time by the CPU being evaluated
Note: SPEC benchmarks measure the combined performance of the CPU and its compiler code generation capability
Besides SPEC, many other benchmarks are available, and it’s usually feasible to create application specific benchmarks when needed SPEC provides a good model of how to construct and use benchmarks
to make fair “apples-to-apples” comparisons between candidate CPUs
11/6/2015
13
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 37
EDN Embedded Microprocessor Benchmark Consortium (EEMBC)
Less well-known than SPEC, but more relevant to most embedded systems
Non-profit consortium supported by member dues and license fees
Real world benchmark software helps designers select the right embedded processors for their systems
Standard benchmarks and methodology ensure fair and reasonable comparisons
EEMBC Technology Center manages development of new benchmark software and certifies benchmark test results
Originally started under the sponsorship of Electronic Design Newsletter (EDN) Formed in 1997 to develop meaningful performance benchmarks for the
hardware and software used in embedded systems
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 38
EEMBC Benchmarks (Partial List)
Digital Entertainment AES DES High-Pass Gray-Scale Filter Huffman Decoding MP3 Decode MPEG-2 Decode MPEG-2 Encode MPEG-4 Decode MPEG-4 Encode RGB to CMYK Conversion RGB to YIQ Conversion RSA
Telecom Version 1.1 Autocorrelation Bit Allocation Convolutional Encoder Fast Fourier Transform (FFT) Viterbi Decoder
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 39
Signal processing performance evaluation
We will need to exploit data level parallelism
SIMD engines in our CPU is good candidate for this type of parallelism exploitation
• SIMD performance can’t be evaluated by target agnostic benchmarks
• Vectorized libraries are required to utilize SIMD engines.
• Use processor vendor characterized library timing to get an estimate of Clock cycles to process a particular size dataset Examples:
VSIPL standard signal processing library
Mercury Computer System SAL
Intel MKL
Memory bandwidth to feed CPU critical when using SIMD engines• Evaluate data rates from memory system and CPU memory interfaces
• Determine number of processor clocks to move data
Determine the performance driver• Compute Cycles
• Memory bandwidth
11/6/2015
14
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 40
Signal Processing performance evaluation
What library function are the important ones?
We will explore this in more detail in the detail design phase
I/QFormation
PulseCompression
MotionCompensation
Doppler Filtering
ClutterCancellation
NoiseEstimation
Target Detection
PDIProcessing
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 41
Subsystem level Syntheses (C4): Summary
Purpose: To select a subsystem level functional design.
Many trade studies are required
Modeling and simulations can be effective tools
Use of SMEs critical to help focus on key requirements
Conceptual design is an iterated process.
The Subsystem requirements are often revised based on the lessons learned during the design synthesis process
Requirement flow down and traceability are key to this process.
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 42
The Conceptual Design Process
11/6/2015
15
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 43
Subsystem Design Review
Typically done after the system level SRR• Major program milestone
Subsystem Concept Review often is part of the System level PDR
Objectives: Are very similar to the system level
• “Determine what need to be done”
• “Establish the baseline for the next design phase”
• “Show how the baseline will meets the requirements”
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 44
Subsystem Design Review: Specific objectives
Show an understanding of the complete set of flow down requirements
Specify derived requirements that constrain the design• Show traceability back to the system requirement
Present the baseline architecture• Where options are still under consideration show multiple approaches that
will be selected from in the PDR stage
Document and review the analysis that lead to the baseline architecture
Identify risks
Create a risk mitigation plan
Generate a preliminary requirements compliance matrix
Identify the subsystem TPMs (Technical Performance Measures)
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 45
Subsystem Design Review: TPMs Risk mitigation, Requirements Compliance
TPMs • Identified at this early stage in the design
• Will be monitored and tracked at each subsequent design phase (PDR, CDR)
• They are an important tool to manage technical risk.
• TPMs are dropped and added depends on the uncertainty and risk factors
Risk management plans should be in place for high priority TPMs.• Examples: SWAP, Processing margin
• Plans should list tasks that will be done to mitigate the risks with the highest probability and highest impact
Requirements Compliance Matrix• Shows flow down requirements
• Shows derived requirements and linkage back the higher level spec
• Shows test method for each requirement Validation by analysis
Validation at unit test level
Validation in the system integration lab
Validation in a deployed environment
11/6/2015
16
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 46
Subsystem Concept Design Phase: Summary
This early design phase is extremely important • Miss-interpretation of requirements can result in a system
that doesn’t meet customer expectations
• Design alternatives overlooked can result in sub-optimal system Result in a non-competitive system
• Risks missed that are discovered later in the design phase can be very costly
• Flawed analyses can result in a system that just doesn’t work
Chances for a success will be greatly enhanced by following a sound system design process
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 47
Homework
Read: Computers as Components by Wayne Wolf• Section 4.5 Designing with Microprocessors
• Section 4.7 System-level Performance
Write a 1 a short discussion answering • What are a few important processing subsystem performance drivers? Discuss
how you would analyze these performance drivers for our Radar embedded processor case study.
11/6/2015
1
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 1
Engineering 180Systems Engineering
Embedded Processing Case Study
Lecture 3
May 28, 2015Steve Kirsch
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 2
Outline for Lectures on Real-Time Embedded Processing
Lecture 3• Preliminary Design
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 3
Lecture 3: Agenda: Preliminary Design Process
Starting point for Preliminary design of the embedded processing
Hardware Architecture Design
Software Architecture Design
Architecture Performance Analysis
Homework
11/6/2015
2
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 4
The Goal of Preliminary Design
Thoroughly understand the system level requirement
Allocate the top level subsystem architecture requirements
• Identify the next level down subsystems and interfaces
• Flow down subsystem level requirements to these lower level subsystems
The main output from preliminary design is the allocated baseline (hardware and software baselines)
• Design description and analysis
• Requirements flow-down traceability
• Draft of a test compliance matrix
• PDR -- Preliminary Design Review
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 5
Nested Design Process for Complex Systems
Nested Design Process for Complex Systems
System design will have completed preliminary design• Major subsystem interfaces defined
• Behavioral functionality of the subsystems defined
• Allocation of SWAP to subsystems defined
• Allocation of illities to subsystems defined
• Master Develop Plan updated with more detail
We are at this step
This step completed
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 6
Preliminary Design Process
P1. Subsystem Requirements Analysis
Preliminary subSystemArchitecture
P2. Requirements Allocation
P3. Interface identification/design
P4. Subsystem-level synthesis
P5. Preliminary design review
To detailed design
11/6/2015
3
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 7
Preliminary Design Phase: Embedded Processing
System design in detail design phase • Specs revised as system design evolves
• System design risks in process of being mitigated Analysis results becoming available
Discovery of new unexpected problems arising
• Customer requirements potentially changing
• SOW changes Due to cost and schedule updates
• System test and integration details developing Impacts on subsystem design
New requirements
This is our starting point for the Embedded Processing Preliminary design phase
Embedded subsystem concept design• Rough idea of interface requirements
• Rough idea of processing algorithms
• Baseline architecture
• Course performance analysis
• Risks identified and mitigation plan defined
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 8
Embedded Processing Preliminary Design:Requirements Analysis
Drill down into processing algorithms (focus on the stressing pieces)
Algorithm laydown on target architecture required to get a more precise estimate on performance
• Programming model selected, initial target processor selected
Interface specification detailed• All radar waveforms finalized by system CDR (not quite there yet)
• Explore the full range of variability on interfaces
Functional behavioral descriptions detailed• Functional capabilities assessed and renegotiated with system team
Coherent GMTI Processing Algorithm
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 9
Processor Block Diagram: From Concept Design Phase
REXDataI/F
EthernetControllers
REXCntrlI/F
Signal Processing Modules
Control Processing Module
System I/O
PCIe x8
sFPDP x8
10 Gb Ethernet CustomI/F
High Speed point to point
mesh network
11/6/2015
4
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 10
Signal Processor Processor:IBM Cell Processor
.
The Cell multi-core Processor was a combined development between Sony, Toshiba and IBM
First app was the Sony’s PlayStation 3
First chips (90 nm version) available in 2005
65nm version in 2007 and 45nm version in 2009 (first chip used in Sony play station)
Chip performance was way ahead of it’s time in 2005 This attracted the attention of the Radar embedded processing team!
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 11
Theoretical Peak Performance in Ops/sec
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 12
IBM Cell Processor: Chip Spec
11/6/2015
5
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 13
IBM Cell Features
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 14
IBM Cell 8 SPE: High Performance Engine Ideal for this Radar Application
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 15
Why is Cell Processor So Fast
11/6/2015
6
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 16
Basics of Parallel Programming Models: SIMD Single Instruction Multiple Data Model
Wikipedia:Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously. Thus, such machines exploit data level parallelism, but not concurrency: there are simultaneous (parallel) computations, but only a single process (instruction) at a given moment.
Register File
ALU ALU ALU ALUInstructionRegisterDecoder
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 17
Fundamentals of utilizing the multi-processor capability of the embedded processing system
TLP (Thread Level Parallelism) provides the opportunity to achieve high levels of parallelism in the sensor processing domain• However data movement between threads if not done correctly could be
the kiss of death
• System performance can be brought to a halt waiting for data to be moved or reorganized across a parallel processing architecture
When utilizing TLP concurrency, data synchronization, and data reorganization is key to performance!
So what is concurrency, data synchronization and data reorganization?
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 18
ConcurrencyDoes it Imply Parallelism?
Sequential program• A single thread of control that executes one instruction at a time
• Next instruction isn’t executed until the prior one has completed
Concurrent program• A collection of autonomous sequential threads executing logically in
parallel
Concurrency is not necessarily parallelism• Interleaved Concurrency
Logically simultaneous processing
Interleaved execution on a single processor
• Parallelism Physically simultaneous processing
Requires a multi-processor not just a multi-threaded single processor
11/6/2015
7
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 19
Data Synchronization
All possible interleaving of threads won’t necessarily lead to a correct program!
Concurrent programs require synchronization so that data produced by one processing step won’t be consumed until a complete “coherent” set of data is stored.
Synchronization serves two purposes• Thread safety for access to shared resources
Avoids race conditions
• Coordinates actions of threads Parallel computation
Event notification
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 20
Data Organization
To efficiently process very large datasets when utilizing thread level parallelism the data must be organized in distributed memories so that it can be accessed at the highest possible rate
Coherent GMTI Processing Algorithm
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 21
Data OrganizationInter-process Communication Fundamentals
Parallel programs need to share data and results processed by different processors. There are two typical ways to pass data• Shared memory Architecture
• Message passing architecture
GLOBAL MEMORY
PROCESSOR
PROCESSOR
PROCESSOR
PROCESSOR
PROCESSOR
PROCESSOR
Share memory Architecture
InterconnectionNetwork
Processor+ memory
Processor+ memory
Processor+ memory
Processor+ memory
Processor+ memory
Message PassingArchitecture
11/6/2015
8
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 22
Processing Domains
Processing Domains refers to the organization of data in memory• Example shows data organized for processing in Fast Time Dimension
Data in Sequential Memory Locations
Fast Time
Slo
w T
ime
Channel
Thread 0
Thread 1
T3T5
T7
T1
T0T2
T4T6
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 23
Data Corner Turn
Data Cube was rotated in around Slow time Fast time plane
Each thread now requiresdifferent data within its virtual address space
Data must be moved between these addressspaces
6/11/2015
Data in Sequential Memory Locations
Fas
t Tim
e
Slow Time
Channel
T7
T1
T0
T2
T4T6
T3
T5
Thread 0
Thread 1
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 24
Identify the Processing Domains
Best functional performance achieved by processing all signal processing steps within a single domain prior to redistributing data
11/6/2015
9
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 25
Embedded Processing: Software ArchitectureSignal Processing Programming model
Baseline target hardware attributesBased on performance analysis so far in the design process SIMD engines (Data Level Parallelization)
DMA engines (Data movement) (Recall discussion from last lecture)
Multi processors (Thread level Parallelization)
High bandwidth main memory
High bandwidth network interfaces (Point to Point simultaneous data flow)
Desired Programming Model Attributes Can explicitly express an algorithms available parallelism
Can exploit the hardware attributes
Can hide or isolate low level programming details so that the application programmer doesn’t need to be concerned with things that can be automated
Can express when to utilize shared memory (fast memory) communication
Can express when to utilize message passing (slower memory) communication
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 26
Parallel Programming Approaches
Auto-vectorization for data level parallelism (DLP) extraction has been difficult to automate
• Many attempts (Intel C++ compiler, GCC, Green Hills Multi tools)• Experience shows these tools aren’t particular good
Source to Source compilation for Thread level parallelism (TLP) extraction • Still a big research area (too risky for our case study)
A Plethora of Programming Languages and parallelism abstracting compilers have been developed
• Most focus on a particular form of parallelism or architecture Shared memory -- Data Level Parallelism
Message passing -- Thread Level Parallelism
GPU specific architecture
Graph Programming Model for parallel programming has proven to be particularly good for the sensor signal processing domain
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 27
Programming Model:Directed Acyclic Graphs (DAG)
Wikipedia Definition• In mathematics and computer science, a directed acyclic graph (commonly abbreviated to
DAG), is a directed graph with no directed cycles. That is, it is formed by a collection of vertices and directed edges, each edge connecting one vertex to another, such that there is no way to start at some vertex v and follow a sequence of edges that eventually loops back to v again.
• Example:
11
8
9
10
3
5
7
2
11/6/2015
10
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 28
Put vertex with no inputs on left and no output on right and those with both input and output in the middle provides a more intuitive data flow diagram
• Acyclic nature becomes obvious
11
8
9
103
5
7 2
11
8
9
10
3
5
7
2
Programming Model:Directed Acyclic Graphs (DAG)
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 29
Signal Processing Programming Model Is Derived From DAGs
Directed Acyclic Graph DAG methodology is a perfect match for signal processing abstraction• DAG are a good method for expressing parallelism and data flow relationships• Signal Processing Programming model is focus on exposing parallelism and processing precedence
relationships
A vertex represents a signal processing function(s) and directed edges are the data flow path from one processing step to the next
The Acyclic nature of DAG is key to achieving an efficient processing structure• The invocation of the processing at a vertex is only dependent on the input data availability• Once processing at a vertex has been invoked it will run to completion uninterrupted• Data flows through the processing steps at the rate solely determined by the latency of the processing at
each vertex
An efficient DAG will perform as much processing as possible in a single vertex • Data should only be pass to a downstream vertex if new dependencies exist
1 2
Poor Design
1+2
Good Design
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 30
Graph Programming ModelDerived From Directed Acyclic Graphs (DAG)
DAGs forms the basis for the Graph Programming Model
Used to express– DLP (data level parallelism)
– TLP (threal level parallelism)
– Precedence relationships
– Data Reorganization
11/6/2015
11
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 31
Design the signal processing graph:Analyze Preformance
Utilize the graph programming model to express DLP• Groups all the processing within a single processing
domain into a jobClass
• jobClass will have multiple instances called jobs that can consume DLP
• Jobs will utilizing the multi-cores capability of the single Processor
Utilize the graph programming model to express TLP• Group multi data independent jobClasses into a subgraph
• Subgraphs will be allocated to groups of processors and will run in parallel utilizing multiple Processors
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 32
GMT Functional Requirements AllocationJobClasses and Subgraphs
Graph design for GMT mode• Allocation of processing functions to jobClasses based on corner turn
boundaries and subgraphs based on TLP opportunities
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 33
Mode laydown in Graph Programming Model
Process of a swath is requires1 dwell of radiation and collection
Multi-dwell are collected back to back with no gaps in collection time
CPI Graph
CPI Graph
Ch0Subgraph 0
Ch 1Subgraph 1
Ch 2Subgraph 2
Ch 3Subgraph 3
Subgraph 0
Processing of 1 dwell of data requires 2 Graphs• CPI Graph
Coherent processing
1 subgraph per receive channel
• PDI Graph
Post Detection Integration processing
1 subgraph per graph
11/6/2015
12
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 34
Subgraph Laydown on Target Hardware
Real-time constraints identified• subgraph 0-3 processing time < Dwell time
• subgraph 4 < 2*Dwell time
Dwells 0 321
P0,0
P0,1
P1.0
P1,1
P2,0
CPI subgraph 0 CPI subgraph 0
PDI subgraph 0
CPI subgraph 3
CPI subgraph 2
CPI subgraph 1
Processing for Dwell 0
PDI subgraph 0
CPI subgraph 3
CPI subgraph 2
CPI subgraph 1
CPI subgraph 0
PDI subgraph 0
CPI subgraph 3
CPI subgraph 2
CPI subgraph 1
Processing for Dwell 1
Processing for Dwell 2
P2,1
PX,Y X= module numberY= Processor number
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 35
Preliminary design review
Objectives: Make sure the functional baseline
requirements have been adequately addressed by the preliminary design Physical architecture
Interfaces
Subsystem functional requirements
Real-time constraints
SWAP
illities
Key documents: Subsystem description
Interface control documents (ICDs)
Preliminary Timing Analysis
Requirements traceability
Draft Requirement Compliance Matrix
Design review package
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 36
Oh No! Huston we have a problem!We aren’t making real-time requirement
Updated analysis just prior to PDR found• subgraph 0-3 processing time > Dwell time
• Each dwell processing is following further and further behind!!
Next lecture will address this problem
11/6/2015
13
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 37
Homework
Read Paper: Hybrid processor architectures meet demands for SWaPBy John Keller
• Available in CCLE 15S-ENGR180-1 Information Folder
Write a 1 page discussion answering • What are the pros and cons of using a hybrid processor architecture for our case study of
a Radar embedded processor?
• Is a hybrid architecture a good potential solution to resolve our processing timeline issue?
Read Paper: HPEC2012 – Kirsch.pdfGraph Programming Model: An Efficient Approach for Sensor Signal ProcessingBy Steve Kirsch
• Available in CCLE 15S-ENGR180-1 Information Folder
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 38
Backup Slides
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 39
Embedded Processing Use Case:Preliminary design
Hardware Architecture
Signal Processing Software Architecture
Performance of the Architecture• To get good performance requires a system
approach
Let’s drill down into the architecture
11/6/2015
14
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 40
IBM Cell Processor Component 1 PPE
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 41
Signal Processing Stressing Algorithm:Understand Behavioral Requirements
Expanded view of the GMT algorithm defined in the conceptual design phase
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 42
Functional Behavioral Requirements:Parameterizing the variability
Drilling down in the functional behavior of the processing steps• Parameterize the functionality based on the system waveform definition
11/6/2015
15
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 43
Signal Processing Libraries and Performance
SIMD architecture can utilized via two methods1. Standard programming languages (eg C++) if compiler technology supports automated vectorization
of code
2. Predesigned Signal processing libraries
Today’s compiler technology is very poor at automated vectorization of code
Best choice today is the use of Signal Processing libraries• Signal processing libraries are target dependent code written utilizing SIMD instruction sets
SIMD instruction sets• Are basically assembly level code that can access the ISA (instruction set architecture) of the target
processor
• Examples of SIMD instruction set are:
AltiVec – PowerPc architecture
SSE – x86 architecture
SPE intrinsics – IBM Cell SPE
Signal processing libraries are implemented with a SIMD instruction set• Examples of Signal processing libraries
Mercury SAL
VISPL (http://www.omgwiki.org/hpec/vsipl)
LAPack
BLAS
FFTW
11/6/2015
1
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 1
Engineering 180Systems Engineering
Embedded Processing Case Study
Lecture 4
June 2, 2015Steve Kirsch
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 2
Outline for Lectures on Real-Time Embedded Processing
Lecture 4• Detailed Design / Integrations and Test
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 3
Lecture 4: Agenda: Detailed Design Process
Starting point for detailed design of the embedded processing
Hardware Architecture Design Improvement
Software Architecture Design Improvement
Detailed Performance Analysis
Detail Design and CDR
Test and Integration
11/6/2015
2
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 4
The Goal of Detail Design
Synthesize the detail design
• Fully define all interfaces
• Fully define the functional behavior of all subcomponents
• Detailed analysis of performance
• Update of TPMs
• Detail analysis of SWAP and illities
• Define test and integration approach
Refine recurring cost estimate
Refine non-recurring cost estimate and development schedule
The main output from detailed design is the baseline design (hardware and software designs)
• Design description and analysis
• Requirements flow-down traceability updated
• Test compliance matrix and test procedure documents
• CDR -- Critical Design Review
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 5
GMT Functional Requirements AllocationJobClasses and Subgraphs From Prelimary Design Phase
Graph design for GMT mode• Allocation of processing functions to jobClass and subgraphs
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 6
PDR:Performance was identified as a big risk!
Reported at PDR • Subgraph 0-3 processing time > Dwell time
Processing time will be longer then the collection time thus not keeping up with real-time
DataProcessing 1 2 1
DataCollection 1 32 4
11/6/2015
3
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 7
Processor Block Diagram: Preliminary Design Had Margin For Growth Processor Enclosure
• 5 module slots available 4 used in baseline + 1 spare
• Sufficient spare prime power
• Sufficient total power dissipation margin
• Weight limit can accommodate a module in spare lot
Signal Processing Module• Sufficient board real-estate for
additional components
• Power regulation could accommodate additional components
System Design• Has insufficient SWAP for an
additional processor enclosure
• Processing subsystem firm requirement
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 8
Performance Growth:Source of real-time performance issue
Doppler Tune preliminary assessment accounted only for the application of the tuning parameters• Generation of tuning parameter computation initial ignore
resulted in a big unaccounted processing load
Pulse compression estimate greatly increased• Performance was dominated by data movement not
computation cycles
• Analysis focused on computation cycles
Main memory bandwidth became a bottleneck for many processing steps• Initial analysis didn’t account for simultaneous data flow of
REX data to main memory and data produced between processing steps
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 9
Deeper Dive into Radar Processing
Pulse compression basics• Pulse energy is transmitted as a long pulse due
to limitation on transmitter total instantaneous output power
• Signal processing compresses return signal for better range resolution
• Total energy in long pulse = compressed pulse
Signal processing consists of passing the signal through a matched filter
Pulses are phase coded for better compression• LFM
• Barker codes (Discrete Phase Codes)
• Arbitrary phase and amplitude codes
Linear Freq Modulation
Pulse Compression of a phase coded pulse
11/6/2015
4
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 10
Intuitive Approach to Pulse Compression Match Filter Utilizing AutoCorrelation
Transmitted pulse Tx
Received energy Rx
Convolution functionof Tx(sn) with Rx
T0 T1 T2 T3
R0 R1R2 R3 R4 R5 R6 R7 R8 R9 R10R11R12 R13 R14R15
S0 S1S2 S3 S4 S5 S6 S7 S8 S9 S10 S12 S14S15S11 S13
So =To*Ro + T1*R1 + T2*R2 + T3*R3
S1 =To*R1 + T1*R2 + T2*R3 + T3*R4
S2 =To*R1 + T1*R2 + T2*R3 + T3*R4 and so on
T0 T1 T2 T3
T0 T1 T2 T3
T0 T1 T2 T3
T0 T1 T2 T3
T0 T1 T2 T3
So
S1
S2
S3
S4
S5
Time-shifted Replicas of Tx(Sn)
Convolution / Correlation =Time-shift replicas of Tx(sn) Rx
= Dot Product
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 11
Pulse compression math
Definition of the convolution theorem
where denotes the Fourier transform of
Therefore one can do a “Fast Convolution”
Pulse compression is achieve by performing continuous time domain convolution
Discrete form of the convolution
=
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 12
Discrete Fourier Transform (DFT)
11/6/2015
5
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 13
DFT is Computational Intensive
• Given a set of N complex input samples, xn where n = 0, N-1, the DFT filters are:
• Assuming the W’ factors can be pre-computed, an N-point DFT requires:• N2 complex multiplies + N2 complex adds
complex mult = 6 real ops (4 multiples + 2 adds)complex add = 2 real ops
• For example, a single 1024-point DFT takes: (1024^2)* 8 = 8e6 ops
• Straightforward computation of N-point DFT requires~N2 complex multiplications and ~N2 complex additions
for a total of ~8N2 real arithmetic operations
1,0,'1
0
NmwherexWF n
N
nmnm
)2
exp(,)2
exp('N
jWwhereW
N
mnjWand mn
mn
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 14
FFT is a clever algorithm for computing DFTs, published by Cooley and Tukey in 1965
• It takes advantage of a lot of symmetry in the computation thus reducing the number of operations by a lot.
N point FFT ops = N/2(Log2N)* 10 ops• 1024 point FFT = 51200 ops
FFT results numerically identical to those of the corresponding DFT (not an approximation)
Advantage of FFT grows with increasing DFT size
An Algorithm for Rapidly Computing DFTsThe Fast Fourier Transform (FFT)
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 15
Pulse Compression by Fast Convolution
Time domain convolution• Assume 1024 complex floating point collected
samples
• Assume pulse width of 256 sample
• Time domain convolution FLOPs = 256 (complex multiplies) * 1024 collected samples
= 256 * 8 FLOPs * 1024= 2,097,152 FLOPs
Fast Convolution N= 1024FLOPs = ( (N/2*Log2N)*10 FLOPs ) *2 (forward and inverse)
= 10,240 FLOPs
11/6/2015
6
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 16
Class Group Discussion
Using your newly acquired embedded system engineering skills, how would you attack the processing performance risk we have identified in our case study?
Divide into 4 teams• 15 minutes to discuss
• Nominate a spokesman for your group
Create an approach for resolving the performance issue• What are the trades to consider
Address the root causes of the performance issues
Utilize your knowledge from the last homework assignment
Doppler Tune preliminary assessment accounted only for the application of the tuning parameters• Generation of tuning parameter computation initial ignore resulted in a big unaccounted processing load
Pulse compression estimate greatly increased• Performance was dominated by data movement not computation cycles
Main memory bandwidth became a bottleneck for many processing steps• Initial analysis didn’t account for simultaneous data flow of REX data to main memory and data produced between
processing steps
Root cause of real time performance issues
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 17
Attaching Real-time Performance issue
Was performance analysis correct? Is a more thorough analysis needed?
Understand real-time system requirements• How tightly spaced spots are required?
• Can the processing fall behind and then catch up?
Could system requirement be modified in some way without major system performance impact to resolve the embedded processing limitation
Could system requirements allocation be modified• Could Pulse compression processing be done in the REX prior to sending data to processor?
Once system solutions appear to be a deadend, then focus on subsystem solutions• Can margin that was planned to reduce risk later in the program be used now to solve this performance
problem?
• If spare slot is used for an additional signal processing card, will it solve the performance issues?
• What are the options for increasing throughput and memory bandwidth on signal procession card?
Increase development cost and NRE might be a big driver for solution• Performance Trade-studies and risk analysis affects cost assessments
Next let’s look at the trades and results
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 18
Performance Trades: Second Look at Requirements and Assumptions
Modifying System Requirements and allocations weren’t acceptable• Tailoring system requirements to only the address specific known Radar mode was
deemed a poor choice
• Design requirement to accommodate new “undefined” applications is very important
Adding additional processing units to the system• Though this approach could meet the SWAP requirements of the first application of the
system it was deemed too expense and would exceed the SWAP for other potential applications.
• Partitioning a mode across multi units given limited box to box bandwidth potentially wouldn’t solve all the performance issues
Utilizing spare slot for the additional performance would violate the processing margin requirement
• Intent of spare is for future programs and risk reduction during test and integration phase
Best option was to increase signal processing module performance within the module SWAP allocation
• Program resources could be reallocated (ie. $$ and schedule and engineering talent)
• Module SWAP margin was a lower risk and margin could be used earlier in program
Next step is trade studies for best way to improve module performance
11/6/2015
7
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 19
Review of Performance Analysis Error in performance analysis discovered!
• Programming model not well understood by the engineer doing performance modeling
Key aspect of programming model utilizes DMA and double buffering to parallelize data movement with computation cycles
Use of DMA requires target specific software design
Data dependent processing domain
Data independent processing domain
Dataset
PingBuf
PongBuf Ping Pong Buffer
t1
t2
t5t6
t7t8
t3t4
t1 t3 t5 t7
Time
DMA to Ping
DMA to Pong
Processing
t2 t4 t6 t8
t1 t2 t3 t4 t5 t6 t7 t8
Processing is fully parallelized with data movement if compute cycles take same amount of time as data movement
This technique of overlapping data movement with processing is called tiling
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 20
Data Movement vs Throughput In Determining Performance
FFT example • N=1024 Complex Floating Point Samples
• Total Flops to perform pulse compression via fast convolution = 10,240 FLOPs
• Assume CPU executes 1 FLOP/ns
• Fast convolution time = 10,240 FLOPs / (1FLOP/ns)= 10.24 sec
• Assume memory bandwidth = 100MB/sec Complex floating point sample = 8 bytes
• Data movement time = 1024 * 8 Bytes * 2 (in and out) / 100MB/sec
= 16 secData movement time is longer than computation time
Overall processing time driven by data movement time
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 21
Trade-study results
Analysis error accounted for only a small fraction of the performance issue
Re-allocation of processing requirement from Cell and a new Front-end processor looks promising
• Front-end high data rate processing characteristics Very few processing functions require > 50% of processing performance
FIR (Finite Impulse Response) filter for IQ formation or IQ calibration
Phase ramp generator and complex multiple
Large FFTs
Large data rate reduction after front-end processing (reducing processing load on following stages)
Application specific design tends to have the highest performance per SWAP
Trades Conclusion• Additional investment to develop “application specific” solution for front-end processing functions
FPGA (Field Programmable Gate Array) solution best choice (other contender, GPGPUs and DSP specific COTS chips)
Biggest bang for the buck!
Front-end processing fairly consistent between different mode applications
Greatly reduces load on IBM Cell
• Add more on module memory bandwidth Decouple REX data ingest with rest of IBM Cell processing
11/6/2015
8
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 22
Processor Block Diagram: Update from Preliminary Design Phase
REXDataI/F
EthernetControllers
REXCntrlI/F Control Processing Module
System I/O10 Gb Ethernet Custom
I/F
High Speed point to point
mesh network
Main Memory
CPUIBM Cell
Network Interface Controller
Distributed Global Bulk
Memory
Front-end Processor
Signal Processing Module
Main Memory
CPUIBM Cell
Network Interface Controller
Front-end Processor
Signal Processing Module
Main Memory
CPUIBM Cell
Network Interface Controller
Front-end Processor
Signal Processing Module
sFPDP x8
Distributed Global Bulk
Memory
Distributed Global Bulk
Memory
New Features ( Distributed GBM, Front-end Processor)
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 23
Solution is a Hybrid Architecture
Front-end Processor functions• FIR filter (I Q formation / Calibration)
• Phase ramp generator and complex multiplier
• Large efficient FFT
Front-end Processor implementation• Application Specific FPGA (Field Programmable Gate Array) based design
• High memory bandwidth memory interface
• Designed as an offload engine
GBM functions• REX data store in GBM instead of Main memory
Decouples high bandwidth REX interface from impacting Cell computations
• Front-end processor access data directly from GBM Reduces competition for main memory bandwidth between processor types
Very large Computational intensive functions
Hybrid design address all three of the key performance issue in available SWAP1) Doppler tuning parameter generation2) Large FFT computational speed3) Memory bandwidth limitations
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 24
Lessons Learned
Fully understand all requirements as thoroughly as possible and as early in the design process as possible
• Hardware requirements
• Software requirements
• Interaction between hardware and software
Perform as thorough of a performance analysis as earlier as practical• Problems discovered later in the design process are much more costly (e.g. If
performance issues were found in integration the fix would have been very expensive)
Explore higher level requirement as well as lower level allocation when resolving issues
• Though in this case we weren’t able to change the system requirements it was worth exploring
Use risk analysis when performing performance trades• A lower cost solution might have been to give up design margin, but the consequences
were too high and the probability of an occurrence wasn’t low enough
Often application specific designs are general enough to have wide applicability if scope is limited
• Application specific designs can be more SWAP efficient then general solutions, but are in general more costly
11/6/2015
9
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 25
Integration and Test
Requirements flowdown and allocation to subsystems includes requirement validation documentation• Requirements Compliance Matrix -- specifies the test
method Deployed system field test
System Integration Lab (SIL) test
Unit level test
Analysis
Inspection
• Test Description Document Detail description of tests and support equipment required to do the
test
• Test Procedure Document Specifies how to do the test and expected results
Increasing complexity and cost of validation
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 26
Integration and Test
Key concepts to keep in mind when planning for integration and test• Sufficient visibility for unit testing and system integration lab testing
Is there support for inspecting memory
Is there support for monitoring system state while in operation
Is there support for monitoring bus activity
Is there support for monitoring operation of application specific implementation (eg. Inside of an FPGA)
• Real-time debug tools for unit test and system integration lab Does the IDE (Integrated Development Environment) support non-
intrusive monitoring of OS and application software (example next slide)
• System Level Instrumentation (Support for both SIL and Field testing) At the full system level are there sufficient interfaces and capability
provided for non-intrusive real time access
Are there sufficient support for data reduction tools Sorting and understanding of the data of interest
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 27
Example of an IDE Real-time Non-Intrusive Debug Tool
Intel Vtune– Performance Profiler
Hotspot (statistical call tree), call counts (statistical)
Thread profiling with lock and waits analysis
Cache miss, bandwidth analysis
OpenCL kernel tracing & GPU offload on Windows*
11/6/2015
10
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 28
Example of an IDE Real-time Non-Intrusive Debug Tool
Green Hills IDE Event Analyzer• EventAnalyzer displays the length and frequency of RTOS and user events, making it quickly apparent what
operations take the most time and where optimization efforts should be focused
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 29
Critical Design Review CDR
CDR purpose• Final design review prior to the official acceptance of the design
• Opportunity for all stake holders to assess designs compliance to requirements
• Opportunity to review risk assessment and mitigation results All risks should be well understood and accepted at this time
• To force detail design documentation effort
• To refine Non-recurring and recurring costs
Goal of a CDR- Demonstrate the design meets the functional and performance requirements
- Assures the test and evaluation strategies, procedures and support are in place for the next development phase
- Establishment of the Product Baseline
Successful completion of CDR is the green light for the next development phases
- Building Hardware
- Writing of Application Software
- Unit test
- System Test
SYSTEM ENGINEERING
©2015 Steve Kirsch- All Rights Reserved 30
Embedded Processing Case Study Summary
Last 4 lectures stepped through the design development process for embedded processing design
• Concept development
• Preliminary design
• Detailed design
• Integration and Test (briefly)
Case study utilized “real” application for real-time high performance embedded processing in a highly SWAP constrained environment
Goal was to provide insight to the system engineering process and the myriad of complexities that the embedded system engineer needs to be aware of and the skill set required
1) Requires board technical knowledge of both hardware and software technologies2) Requires excellent team skills 3) There is no system design process that can replace experience! 4) High demand for engineers with this skill set!
Embedded Subsystem Engineer’s Job is Very Challenging and Very Rewarding
Final take away on the role of an Embedded Processing Engineer