ENGR180 Embedded Computing.pdf

11/6/2015

1

SYSTEM ENGINEERING

©2015 Steve Kirsch- All Rights Reserved 1

Engineering 180Systems Engineering

Embedded Processing Case Study

Lecture 1

May 21, 2015Steve Kirsch

SYSTEM ENGINEERING


Outline for Lectures on Real-Time Embedded Processing

Lecture 1• Overview of Embedded Subsystem Design

• Case Study: Problem statement

Lecture 2• Conceptual Design

Lecture 3• Preliminary Design

Lecture 4• Detailed Design / Integration and Test

SYSTEM ENGINEERING


Lecture 1: Agenda

Overview• What is an Embedded Processing System

• Characteristics

• Examples of Embedded computers

• Summarize Design Challenges

Case Study: The road to SDR: Applying System Engineering Process• Problem statement

• Identify stake holder

• Top level requirements

• Key Performance Parameter (KPPs)

Homework

11/6/2015

2

SYSTEM ENGINEERING


Overview: What is an embedded processing system

Wikipedia: • An embedded system is a computer system with a

dedicated function within a larger mechanical or electrical system, often with real-time computing constraints

• It is embedded as part of a complete device often including hardware and mechanical parts. Embedded systems control many devices in common use today

SYSTEM ENGINEERING


Overview: continued

Embedded processing sub-system Consists of one or more digital

processors integrated with other parts of a complex system (sensors, actuators, user interface, etc.)

Arranged in a tightly coupled architecture, designed to perform a specific set of functions

Sensor LoadEmbedded

DigitalProcessor

Generic System with Embedded Digital Processor

ADC DACAmp Amp

SYSTEM ENGINEERING


Overview: continued

System design as previously discussed is hierarchical • A system is broken down into subsystem which in turn is broken down

in to more subsystems

• Embedded computer subsystem in itself is a complex system using the same principles employed at higher level the system level

11/6/2015

3

SYSTEM ENGINEERING


Overview: continued

Embedded processing subsystems are unique because its requirements are implemented in both hardware and software components• An embedded processing subsystem engineer early task is to flow

down high level requirements and allocate these requirements to hardware and software components

• An embedded processing subsystem engineer therefore needs broad knowledge in both hardware and software systems

SYSTEM ENGINEERING


Overview: continued

Embedded subsystems today are perhaps the most critical and complex part of the system to get right, and to get right early in the system design process• Functionality that was classically implemented as hardware

solutions, today are being implemented as both hardware and software components (trending to more and more software)

• Human and environmental interfaces are sensed and controlled by interaction of hardware and software components

• Real-time operation is a function of hardware and software interaction Deterministic behavior is critical

SYSTEM ENGINEERING


Overview: Embedded system engineers job is very challenging

Embedded subsystem engineer is required to have broad expertise

• Hardware & Software development management process and tools

• Mechanical / Structural (Enclosure designs)

• Thermal dynamics (Cooling of electronic critical to system design)

• Materials (Enclosures, Backplanes, modules, connectors etc)

• E&M (Electromagnetic radiation and protection)

• Computer hardware architecture (interface standards, networks, memory architecture, processor architecture, communication protocols, electronic components (GPGPU, FPGA, ASIC technology, etc).

• Computer Software architecture (interface standards, communication prototcols, Operating systems, development tools, computer languages and programming models (parallel processing, streaming, objection oriented, scripting, etc)

• System Operational theory (e.g. Communication or Radar theory with an understanding of processing algorithm)

• System Simulation tools (e.g. matlab)

11/6/2015

4

SYSTEM ENGINEERING


Key flow down system requirements of an embedded processing system• Size

• Weight

• Power

• Life cycle costs Non-recurring develop cost

Recurring cost

• Rugged operating environment

• Durability / Reliability

• Maintainability

• Supportability

• Development schedule

• Development test, integration environment

• Functional requirement (many application specific)

Overview: Hardware Properties

SYSTEM ENGINEERING


Key flow down system requirements of an embedded processing system

• Life cycle costs

Non-recurring develop cost

Recurring cost

• Durability / Reliability

• Maintainability

• Supportability

• Development schedule

• Development test and integration environment

• Infrastructure requirement (many application specific) Build-time (Drivers, Libraries, Interfaces)

Run-time (Services, Clients, Servers, etc.)

• Functional requirements (application specific)

Overview: Software Properties

SYSTEM ENGINEERING


Overview: Software PropertiesKey Embedded Processing Application Characteristics

Complex Algorithms • Environment sensing and filtering

• Visualization

• Tracking

User interfaces • Human to computer

• computer to computer

Realtime operation• Hard realtime sec - msec response times

• Testability / observability

Multirate (asynchronous events)

11/6/2015

5

SYSTEM ENGINEERING


Overview: Balance design (CRISP)

Cost of ownership (life cycle cost)• Development cost, production cost, support cost (training, spares,

repair, system upgrade, …)

Risks (technology risk, production risk, obsolescence, …)

Installation (weight, size, power, style, transportation, …)

Supportability (reliability, maintainability, …)

Performance (functionality, ease of use, throughput, …)

SYSTEM ENGINEERING


Overview: Real Design is a Compromise Among Conflicting Needs and Desires

• Mission environment– Offboard info– Communication

requirements– Weapon

characteristics

• Physical environment– Operating temperatures– Storage temperatures– Coolant characteristics– Vibration levels– Shock– Prime power characteristics– EMI/C requirements

• Physical characteristics– Weight– Size (O&M)– Prime power

utilization– Cooling required– Dissipation– EMI/C characteristics

• Programmatic Characteristics– Development plan– Production plan– Risks– Technology maturity

• Cost– Recurring cost– Development cost– Life-cycle cost

Design has to balance multiple

desires and constraints

• Add your need here• Functional Performance– Detection performance– Tracking accuracies– ID capabilities– Weapon support– Map characteristics

• Support– Maintenance concept– Reliability– Maintainability measures– Built-in test capability

Source: Raytheon

SYSTEM ENGINEERING


Overview: Embedded computing examples

Whirlwind I – Lessons learned from the 40’s!

Core Memory Controller

11/6/2015

6

SYSTEM ENGINEERING


Overview: Embedded computing examples

ENIAC – First fully electronic turing machine

SYSTEM ENGINEERING


Overview: Embedded examples

Intel 4004 – First processor

SYSTEM ENGINEERING



HP-35 Wikipedia• The HP-35 was Hewlett-Packard's first pocket calculator and the

world's first scientific pocket calculator[1] (a calculator with trigonometric and exponential functions). Like some of HP's desktop calculators, it used reverse Polish notation. Introduced at US$395,[2] the HP-35 was available from 1972 to 1975.

11/6/2015

7

SYSTEM ENGINEERING



PlayStation3 – based on the IBM Cell processor in 2007 was way ahead of its time

SYSTEM ENGINEERING


Overview: Embedded Examples

Tianhe-2 – Worlds fastest computer in 11/2014

Tianhe-2 – is build from Intel Xeon Phi• Knights Ferry / Knights Landing

• 14nm processing

SYSTEM ENGINEERING



Qualcomm SoC -- Snapdragon 800 processors

11/6/2015

8

SYSTEM ENGINEERING


Overview: Single Chip Compute

Single Chip Computer or processor is the foundation of embedded computing today• Embedded computational systems today are

constructed with single processor chip

Array of processor chips

SoC (system on a chip) that contains processor cores

Therefore understanding the key aspects of a processor is fundamental for an embedded system engineer

SYSTEM ENGINEERING


Overview: Physics of Software

Computing is a physical act.• Computers abstract information but in fact do their work by

moving electrons

• This is fundamentally why it take time and energy to compute

Software performance and energy consumption is where we connect embedded computing to the real word

Embedded engineers make high-level decision about the structure of their programs to greatly improve their real-time performance and power consumption

SYSTEM ENGINEERING


Overview: Challenges in Embedded Computing System Design

How much hardware do we need?

How do we meet deadlines?

How do we minimize power consumption?

How do we design for upgradability?

Does it really work and meet the requirements?

How can you get the job done with the budget and schedule constraints?

11/6/2015

9

SYSTEM ENGINEERING


Case Study: The Need Phase –Proposal Phase

The proposal phase usually consists of Defining the problem:

Identify customers and stakeholders Understand their needs Understand and develop the operational concept Identify the constraints

Defining the system (or product) to be procured or built Building the system specification (procurement spec)

Make sure the problem is solvable. Identify risks and risk mitigation plans Investigate potential system designs

Preliminary system modeling and performance assessment Preliminary program plan and schedule

Development cost projection

Develop product testing and evaluation strategies Writing a proposal Winning the contract

It is marketing, it is management, It is a lot of engineering, and it is about managing risk

SYSTEM ENGINEERING


Case Study: Problem Statement

The government in conjunction with a prime contractor intends to design and build a surveillance and reconnaissance radar system that will be compatible with existing UAVs as well as new more advanced UAVs of the future

The Radar primarily function is Air to Ground capability to locate and disarm ground moving troops, equipment and air defense systems Future system

Joint US Navy/ US Air unmanned combat air vehicle

Predator

Current System

SYSTEM ENGINEERING


Nested Design Process for Complex Systems

Conceptual Design

Preliminary Design

System Specification

System Architecture

Subsystem Specifications

Preliminary Design B

Subsystem Architecture

A B C D

System Level

Level #1 Subsystems

Level # 2 Subsystems B1 B2 B3

Conceptual design for subsystem B

Case Study: Nested Design ProcessRadar Processing Embedded Subsystem

11/6/2015

10

SYSTEM ENGINEERING



Radar system engineers are responsible for consuming the system level specifications and decomposing the requirements into level 1 requirements so that they could be allocated to the major subsystem components

• Antenna / Beam steering computer

• Receiver / Exciter

• Processor subsystem This is our task

Many additional requirements not explicitly specified in the level 1 spec referred to as “derived requirement” are also flowed down to the next level of the major subsystem components

SYSTEM ENGINEERING



Radar System Block Diagram

StableMicrowave

Source

PowerAmplifier

Low NoiseAmplifier

Down-Conversion

A/DConversionand TimingGenerator

DigitalSignal

Processing

FrequencySynthesis/

WF Gen

ImageInformation

Control, Interface, and Data Processing

Detected Objects

Commands, Motion Data

Radar Results, Health Info

Control

Antenna

Processing Subsystem

AntennaSubsystem

Reciever ExciterSubsystem

SystemInterfaces

SYSTEM ENGINEERING


System Design Process – (One view)

Need+

Desires

PotentialSolutions

Baseline Solution

Detailed,Documented

Baseline

Conceptual Design Preliminary Design

Includes:

Elicitation of need and requirements

Design through insight, invention, and successive refinement

Management of complexity through partitioning and creating well-posed lower-level design problems

11/6/2015

11

SYSTEM ENGINEERING


Case Study: Why is system engineer so challenging?

HighAbility toInfluenceLCC(70-75%of Cost DecisionsMade)

(10%-15%)

72% Life Cycle Cost28% Life Cycle Cost

Less Ability to Influence LCC (85% of Cost Decisions Made)

Little Ability to Influence LCC (90-95% of Cost Decisions Made)

(5%-10%)

Minimum Ability to Influence LCC (95% of Cost Decisions Made)

Acquisition Framework

Materiel DeveloperPM –Total Life Cycle System Manager Army Materiel Command

Combat DeveloperTRADOC

System Life Cycle

ConceptRefinement

TechnologyDevelopment

System Development& Demonstration

Production &Deployment

Operations& Support

B CA

The most important decision are made early in the design cycle with the least amount of detailed information

SYSTEM ENGINEERING



Customer system level specification describes• Mission scenarios

• Threats

• Operational environment

• Platform resource allocation for the Radar system Space

Weight

Cooling capacity

• Operator’s interfaces

• Mission stability (system shall run continuously for N hours)

• Plus many more illities requirements

SYSTEM ENGINEERING



The system level specification and requirements allocation is a complex task• Results of this work are documented and reviewed at the

SDR

The major subsystem responsible engineers (REA) is part of the Radar system team that does the allocation• Involvement of stake holders required

Program manager (also part of the Radar system engineering team) • Establishes the Work Breakdown Structure (WBS)

• Allocates budget to each of the WBS line items

• Creates an integrated master plan (IMP)

• Creates an integrated master schedule (IMS)

11/6/2015

12

SYSTEM ENGINEERING


Case Study: System Conceptual Design Phase Product

Products Baseline design

• Performance

• Risk

• Cost

• Schedule

• Other high-level attributes

Characterized by top-level budgets and supporting analysis

• Supported by enough lower-level design to give confidence in the numbers

Hardware, algorithms, signal processing sizing, software sizing

Top-level program plan

• Schedule

• Headcounts vs time

• Critical item development plans

• Top-level understanding of programmatic issues

Subsystem spec

SYSTEM ENGINEERING


Case Study: System Conceptual Design Review (SDR)

Review of the concept and supporting documents

Concept Analysis Review• System modeling and Simulation results

• Compare and contrast conceptual designs and review justifications for selected baseline

• Risk mitigation plan going forward

Establish the functional baseline

Approve the system specification

SYSTEM ENGINEERING


Case Study: Post SDR flowdown requirement

Processor subsystem Level 1 requirements specification• Space

• Weight

• Power

• Cooling

• Illities

• Transmit waveform specifications (PRF, num coherent pulses transmitted/collected, sample rates, number of receive channels, phase coding, etc.)

• Processing algorithms (preliminary)

• Interfaces (Sensor data, Sensor command, Nav, mission computer, instrumentation system)

11/6/2015

13

SYSTEM ENGINEERING


Case Study: Definition of terms

CPI: Coherent Processing Interval• Series of pulse with a phase relationship transmitted and collect that can be coherently processed

PDI: Post Detection Integration• Multiple CPI are non-coherently integrated

Dwell time • Time to radiate a single beam position on the ground

Bars • Number of beam positions to radiate a swath

PRF Pulse rate frequency• Rate pulses are transmitted

PRI Pulse rate interval = 1/PRF• Time from the start of one pulse to the next

Pulse modulation• Amplitude and phase superimposed on the pulse during the duration of a pulse

Receive channels – Radar antenna are typically partitioned into subArrays that have physically offset phase centers connected to a separate receiver and A/D

Swath 3

Swath 2

Swath 1

Ground Area of Interest

Azimuth

Range

SYSTEM ENGINEERING


Case study: GMTIGround Moving Target Indicator waveform

Number of CPIs/Dwell

Number of pulses/CPI

Pulse modulation LFM linear frequency modulation

Number of receive channel

PRF

Number swaths/scan area

Scan area rate

CPI 1pulse 0 - N

CPI Mpulse 0 - N

Key parameters for embedded subsystem design

SYSTEM ENGINEERING


Case study: GMTI Processing Algorithm (CPI processing)

I/QFormation

PulseCompression

MotionCompensation

Doppler Filtering

ClutterCancellation

NoiseEstimation

Target Detection

PDIProcessing

11/6/2015

14

SYSTEM ENGINEERING


Case study: GMTI Processing Algorithm (PDI processing)

CPIProcessing

AmbiguityResolving

Noise Estimation

SidelobeDetectionRejection

False AlarmControlM of N

Processing

Angle Estimation

TargetParameterEstimation

HitList

SYSTEM ENGINEERING


Text: Computers as ComponentsPrinciples of embedded computing system DesignBy Professor Wayne Wolf• Text link: Available in CCLE 15S-ENGR180-1 Information

Folder

• http://ceng2.ktu.edu.tr/~ulutas/Courses/EmbeddedSystems/0123743974.pdf

Read: Chapter 1 Embedded Computing• Introduction

• 1.1 Complex Systems and Microprocessors

Write up to a 1 page discussion answering • Why are microprocessors used in complex system designs?

Homework

11/6/2015

1

SYSTEM ENGINEERING




Lecture 2


SYSTEM ENGINEERING



Lecture 2• Conceptual Design

SYSTEM ENGINEERING


Lecture 2: Agenda: Conceptual Design Process

Review Homework

Starting point for conceptual design of the embedded processing

Feasibility and Requirements Analysis

Embedded processing design synthesis

Subsystem concept design review processHomework

11/6/2015

2

SYSTEM ENGINEERING


Homework 1: review

Many examples of processors use in embedded computing• Perhaps experience tells us that something about this approach if fundamental

Large variety of processors to choose from• High potential there is a best fit

Alternative to processors is custom utilizing hardwired logic. Advantage over this alternative

• Easier to design and debug

• Allows for possibility of upgrade and adding new functionality

More efficient than custom logic• Custom design will have some logic dedicated to sub-functions that aren’t active all the

time. Microprocessor’s logic is reused for all sub-functions

• Microprocessor’s are application agnostic, therefore we can leverage huge investments made by others. Application specific logic can be implemented in software

Microprocessor’s can be faster than custom logic (Seems almost counter intuitive!)

• Utilizes the latest manufacturing processes

• Resources available for access to the best experts and large design teams

• Can over come the overhead of interpreting instructions with clever utilization of parallelism

Why are microprocessors used in embedded computing systems?

SYSTEM ENGINEERING


Homework 1: review

What differentiates embedded computing from other forms of computing?• Program must meet deadlines

• Must be fast enough

Needs to have deterministic behavior to guarantee it will be fast enough

To understand real time behavior of an embedded computing system one needs to understand the component from the lowest level to the highest level of the system.

What are the 5 components from the lowest to the highest• CPU: (processors plus memory)

• Platform: (CPU scaffolding): Components supporting the CPU (eg Buses, I/O devices)

• Program: Programs can be very large, CPU see a very small window of the program at any one time. We must consider the structure of the program to determine the overall behavior of the system

• Tasks: We generally run several programs simultaneously on a CPU, creating a multi-tasking system. Tasks interact with each other in way that have profound implications for performance

• Multiprocessors: A system can have many microprocessors all interacting with each other as well as other potentially interacting with accelerators. The interaction can be very complex to analyze and determines the overall system performance.

SYSTEM ENGINEERING


Concept Design Phase: Embedded Processing

BAA – Broad Area Announcement typically precedes RFP request for proposal when contracting with the US government

• This is a head start on preparing for the RFP

RFP let• Procurement specs review and analyzed

• Enormous effort applied at this stage to develop a system design concept or concepts

• Proposal written and submitted Often leveraging years of IR&D

• Contract won!

System engineers decomposed and allocated the level 1 requirement• Produced preliminary subsystem specifications

Subsystem development team leads identified • Program manager

• Subsystem architect (head technical subsystem engineer)

• Development team leads (Tech leads) Hardware unit lead

Mode software lead

Infrastructure software lead This is our starting point for the Embedded Processing Concept design phase

11/6/2015

3

SYSTEM ENGINEERING


The Conceptual Design Process

SYSTEM ENGINEERING


Embedded Processing Concept Design:Stakeholders Requirements C1

Who are the stakeholders?• Our subsystem team

Subsystem Program Manager

Subsystem Architect

Tech Leads and their development teams

• Customers System Team

System Program Manager

Contracting organization

• Vendors and Suppliers

• Test and Integration team

Requirement sources• Procurement spec

• Subsystem specs (Generated by tier 1 system team)

• KPPs (Key Performance Parameters identified by customer or system team)

• System TPMs (Technical Performance Measures)

• SRD (system requirements document)

• SDD (system design description)

• SRR (system requirements review material)

• Vendor components specifications

• Legacy systems components *

• Standards *

• Laws of Physics

• Company development procedures, ethics, rules

• Laws of the land and point of deployment

• Common sense

* Potential requirement source

SYSTEM ENGINEERING


Embedded Processing Concept Design:Feasibility Analysis C2

Identify the possible processing solutions

Study the viability of these solutions according to the flowdown requirements• performance, cost, schedule, risk, supportability, …

Key questions: • Can we design the embedded processing to run in realtime while

meeting the SWaP-C requirements? SWAP-C (Space Weight and Power - Cost)

• What are the key risks?

• How to reduce the risks?

• Is Preliminary Subsystem specifications reasonable? Could it be

modified to reduce the risks and still meets the main system objectives?

11/6/2015

4

SYSTEM ENGINEERING


Embedded Processing Concept Design:Feasibility & Req Analysis C2 & C3 -- Step 1:

Understand requirements and focus first on the primary requirements that will likely drive the top level design

For our case study, the real-time signal processing requirement is key

System requirement is to scan an area of interest in N secs process the real time data and produce a hit report of all ground movers within the AoI with a false alarm rate of R and a probability of detection P.

The system flowdown requirements have specified the waveforms and the signal processing algorithms that can achieve this system performance

• As one begins to drill down to the next level of detail some requirements might not be achievable with in the scope of other requirements

• Requirements can be modified to help achieve the primary system goals at this stage

It is up to you to only accept requirements that can be achieved

Swath 3

Swath 2

Swath 1

Area of Interest (AoI)

SYSTEM ENGINEERING


Embedded Processing Concept Design:Feasibility & Req Analysis C2 & C3 -- Step 2:

Derive the key performance parameters for the embedded processing subsystem

• Data rates

• Memory requirements

• Processing throughput requirements based on the required processing algorithms

I/QFormation

PulseCompression

MotionCompensation

Doppler Filtering

ClutterCancellation

NoiseEstimation

Target Detection

PDIProcessing

Coherent GMTI Processing Algorithm

SYSTEM ENGINEERING


Deep dive into data rates:

Data rates can help the embedded processing subsystem engineer to understand a lot about the problem

In our case study the Sensor is producing a very high input stream of data (10s of Gsamples/secs)• A/D rates

• A/D sample word size (often a function of data rate)

• Number of input data channels

• REX Processor network bandwidth and protocol How is the data packaged and shipped?

How much extra bandwidth is needed for the protocol (eg. error correction coding)?

What is the receive duty? (How much of the total time is data streaming?)

• Synchronous or Asynchronous data flow Flow control

How much rate buffering is required?

How is data synchronization achieved?

Data rates will drive memory and processing requirements

11/6/2015

5

SYSTEM ENGINEERING


Deep dive into data rates: Receive Duty

Doppler filter bank

N Pulses

x x x x x x

A/D Samples (Range Bins)

Typical Radar Processing – One Beam Position (Dwell)

Receive window

Receive window

Receive window

Receive window

Receive window

SYSTEM ENGINEERING


Deep dive into data rates: System Front End Duty

System requires multimode operation• Tight interleaving of frontend resources desired for

best system performance

Data CollectionTime Mode 1 Mode 2 Mode 1

SYSTEM ENGINEERING



Pipeline processing of an AoI• Complete Dwell of data collected prior to processing

• Memory required to hold a complete set of data (how big would this be?)

Data CollectionTimeMode 1

CollectionMode 2

Collection

Mode 1 Collection

Data Processing Time Mode 1 D0Processing

Mode 2 D1Processing

D0 D1 D2

Dwells

11/6/2015

6

SYSTEM ENGINEERING



Pipeline with overlap processing• Data processed while it is being collected

• Memory size required reduced

Data Collection Time Mode 1

CollectionMode 2

Collection

Mode 1 Collection

Data Processing Time Mode 1 D0Processing

Mode 2 D1Processing

D0 D1 D2

Dwells

Mode 2 D2Processing

SYSTEM ENGINEERING



Parallel Processing• Memory size required could be larger then the fully pipelined architecture

• Processing performance per processor reduced

• Notice the time to get the results from processing dwell D0 (latency) is longer in this case

Data Collection Time Mode 1

CollectionMode 2

CollectionMode 1

Collection

Data Processing Time Processor 0 Mode 1 D0Processing

D0 D1 D2

Dwells

Data Processing Time Processor 1

Data Processing Time Processor 2

Mode 2 D1Processing

Mode 1 D2Processing

SYSTEM ENGINEERING


Deep dive into data rates: What have we learned?

Keeping up with the real-time input data rate presents architecture trade-offs that can be used to balance the requirements for memory and processor(s) performance

Parallel processing approaches that may utilize an array of potentially slower processors could add processing latency but have more throughput performance overall• Latency is more important for some applications than others

Examples: Air to Air alert confirm modes need very short latency

SAR ground maps typical have very loose latency requirements

There exists many opportunities to exploit processing parallelism once system requirements are fully understood• For highly computation intensive signal processing requirements exploiting

parallelism is typically required to achieve system requirements

• Exploiting parallelism can be a cost effective approach for many less computation intensive applications as well

11/6/2015

7

SYSTEM ENGINEERING


Deep dive into processor performance:

Software Performance Drivers Algorithm design - Clever way of doing a computation can sometimes

provide dramatic improvement in system processing performance

Compiler efficiency - Optimized compiler can generate efficient code (fewer hardware instructions needed per high order language instruction)

OS responsiveness - Real-time Operating System (RTOS) can be designed to require minimal resources for the OS itself

Hardware Performance Drivers Processor execution speed - Optimized processor can execute more

hardware instructions per second (or per watt)

Memory effective bandwidth - Memory and memory bus can provide sufficient data and instruction access to keep up with the processor

I/O system effective bandwidth - External interfaces/network provide sufficient input and output bandwidth to keep processor busy

Balanced design requires each of these factors to be considered in allocating system resources

SYSTEM ENGINEERING


Deep dive into processor performance: Requirement Allocation can greatly affect performance

Critical part of the embedded processing concept design is functional requirements allocation to hardware vs software

Many cases it is quite simple to do this allocation

• Memory storage (Hardware)

• ALU functions (the lowest level of a computation engine) (Hardware)

• Basic Operating system functions (mutex, semaphores, thread scheduling) (Software functions)

Many function will have both hardware and software components• Ethernet interfaces

• DMA controllers (will be discuss later in detail)

It is very important to define the boundary between hardware and software• ISAs (Instruction Set Architecture) are commonly used to define the boundary

between hardware and software for a programmable device

• To the software it is an abstraction of the hardware

• To the hardware it’s a specification of what the hardware is required to do

SYSTEM ENGINEERING


Hardware Performance Drivers

Key performance drivers of an embedded processor• Memory Bandwidth

• Network Bandwidth

• I/O Bandwidth

• CPU OPs (Operations per sec) Signal processing performance usually expresses performance in FLOPS (Floating

point operations per sec

How can we optimize for performance?• Exploit parallelism

• Utilize a CPU that best fits the job

Program Memory

DataMemory

I/O Ports

CPUClock

Classic Harvard Architecture

11/6/2015

8

SYSTEM ENGINEERING


Exploiting Parallelism To Improve Performance

Parallelism is present in multiple forms• Thread or Task Level Parallelism (TLP)

Wikipedia: Task parallelism (also known as function parallelism and control parallelism) is a form of parallelization of computer code across multiple processors in parallel computing environments. Task parallelism focuses on distributing execution processes (threads) across different parallel computing nodes. It contrasts to data parallelism as another form of parallelism.

SYSTEM ENGINEERING


Thread Level Parallelism: cont.

Independent execution stream can execute in parallel all working on a single goal• Example with the multiple processor example showed earlier

processing multiple AoIs in parallel

Simultaneous multithread operation is commonly supported within modern processors• Multiple cores running independent threads

• Multiple hardware threads within a single core(SMT symmetric multi-threading or hyper-threading)

Most modern operating systems support simultaneous multithreading

SYSTEM ENGINEERING


Exploiting Parallelism To Improve Performance

Data Level Parallelism (DLP)Wikipedia:Data parallelism is a form of parallelization of computing across multiple processors in parallel computing environments. Data parallelism focuses on distributing the data across different parallel computing nodes. It contrasts to task parallelism as another form of parallelism.

11/6/2015

9

SYSTEM ENGINEERING


Multi-core P

Data level parallelism: cont.

Data is organized where the same operation can be performed on the data set at the same time.

This form of parallelism is abundant in Radar signal processing (will be discuss later when we return the GMTI algorithm)

Can be exploited classically in two typical ways Multi-core processor

SIMD (single instruction multiple data) processing cores

InputData

Core 1

Core N

OutputData

P Core

ALU0

ALUN

Instruction

Reg File

SYSTEM ENGINEERING


Other Hardware Architecture Parallelism

DMA Direct Memory Access is a form of this parallelism

To move data from memory to an I/O device CPU cycle are required with no DMA capability

With DMA a few CPU cycles are utilized to setup the DMA transfer and then can do work in parallel with the data movement

Transfers without DMA Transfers with DMA

SYSTEM ENGINEERING


Optimizing Performance by:Utilizing the Best Fit CPU Type

Some multicore p contain SIMD engines• Freescale, Intel

DSP Chips (Digital Signal Processing)• Texas Instruments

GPGPUs (General Purpose Graphic Processing Units)• Nvidia, Intel, AMD/ATI

FPGA (Field Programmable Gate Arrays)• Altera, XILINX

ASIC (Application Specific Integrated Circuit)• VLSI, Softchip, Micronix Integrated Systems

11/6/2015

10

SYSTEM ENGINEERING


Trade-offs between GPPs, DSPs, and FPGAs

When (fixed point) throughput per watt is most important

When recurring cost is most important

When time to market is the dominant issue or when high performance floating point arithmetic is essential

When to Consider Using

HighestLowestHigh Performance GPPs more expensive than DSPs

Recurring Cost per Component

Only with large performance penalty (Products with optimized floating point starting to appear.)

Limited number of products available with floating point

Yes. Some high-end GPPs have SIMD floating point vector units

Support for Floating Point Arithmetic

VHDL RequiredLimited support of HOL programming

Full-featured support of HOL programming

Ease of Application Programming

Highest (fixed point arithmetic)

More than GPPsLowestThroughput per Watt

Field Programmable Gate Arrays (FPGAs)

Digital Signal Processors (DSPs)

General Purpose Processors (GPPs)

SYSTEM ENGINEERING


Comparing ASICs and FPGAs

An Application-Specific Integrated Circuit (ASIC) is an integrated circuit (IC) customized for a particular use, rather than intended for general-purpose use.

A Field-Programmable Gate Array (FPGA) contains programmable logic blocks and programmable interconnects that allow the same FPGA to be used in many different applications.

Property ASIC FPGARequires foundry run(s) for each application

Yes No

Typical Development Cost High Moderate

Typical Development Schedule Lengthy Moderate

Recurring Cost Moderate High

Functional Density High Moderate

Power Consumption per Function

Lower Higher

Maximum Clock Frequency Higher Lower

SYSTEM ENGINEERING



11/6/2015

11

SYSTEM ENGINEERING


System Concept High Level Embedded Processing System Architecture

REX

General Purpose Processing

MissionComputer

INSGPS

Signal Processing

High Speed Instrumentation

System

Embedded Processor

SYSTEM ENGINEERING


Subsystem Level Synthesize

This is where the rubber meets the road

Multiple options are created and assessed

A Baseline architecture is selected and performance is estimated (usually rough at this stage)

• Relevant like system performance can be useful as a guide

• Scaling validated performance from an earlier fielded system or IR&D efforts can greatly reduce risk

Performance assessment at this develop stage • Difficult for revolutionary designs

• Easier for evolutionary design

Full set of flowdown requirements will be too much to fully evaluate in detail at this stage

• Focus on the KPPs (Key Performance Parameters)

• Use SMEs (subject area experts) to help to guide focus on the highest risk largest impact requirements

• Identify the key items to evaluate

For our case study we know from experience that the signal processing will utilize the bulk of the SWAP-C and drive system performance

SYSTEM ENGINEERING


Evaluating CPU Performance

Best approach would seem to be to run the actual application software workload on each candidate processor and measure required CPU Time for each

Possible complications with this approach Application software may not be available early in development process

when CPU must be selected

Application software may include dependencies on Operating System and on external interfaces Multiple versions of the application software may need to be created

May be difficult to remove effects of CPU idle time due to waiting for external events and I/O transfers

Candidate processor may not exist yet. Evaluation may have to be done on a slowly-executing simulation

Often the best way to obtain a comparison is to use one or more software benchmarks that adequately represent the application

11/6/2015

12

SYSTEM ENGINEERING


Using Benchmarks to Evaluate CPU Performance

Benchmarking strategy is to obtain (or create if necessary) relatively simple sequences of code that together represent the most computationally-intensive algorithms of the application Requires some insight on the part of the subsystem engineer to be able

to identify these a priori

Use of multiple benchmarks creates an understanding of how well each candidate does on each algorithm The best CPU on one algorithm may not be the best on other algorithms

CPU selection will need to be based on balanced design principles, considering best overall benchmark performance as well as many other factors In particular, power consumption will be important for most embedded

applications

Benchmark results may suggest compiler optimizations or even CPU architectural enhancements that will dramatically improve performance

SYSTEM ENGINEERING


SPEC: Industry Standard Benchmarks

System Performance Evaluation Corporation (SPEC) established in 1989 by consortium of computer vendors to create standard benchmarks for computer systems (www.spec.org) Originally intended to benchmark performance of servers and workstations, using

CPU-intensive benchmarks

Has since expanded to include benchmarks for graphics, Java applications, client-server models, mail systems, file systems, and Web servers

CPU vendors normally execute benchmark suite and provide documented results

Serious effort made to produce benchmarks that avoid misleading comparisons, with strictly specified execution rules and reporting requirements

SYSTEM ENGINEERING


SPEC Benchmark Suites

SPEC CPU2006 contains two benchmark suites:

CINT2006 for measuring and comparing compute-intensive integer performance

CFP2006 for measuring and comparing compute-intensive floating point performance

Performance is expressed as the number of times the benchmark algorithm can be executed per unit time by the CPU being evaluated

Note: SPEC benchmarks measure the combined performance of the CPU and its compiler code generation capability

Besides SPEC, many other benchmarks are available, and it’s usually feasible to create application specific benchmarks when needed SPEC provides a good model of how to construct and use benchmarks

to make fair “apples-to-apples” comparisons between candidate CPUs

11/6/2015

13

SYSTEM ENGINEERING


EDN Embedded Microprocessor Benchmark Consortium (EEMBC)

Less well-known than SPEC, but more relevant to most embedded systems

Non-profit consortium supported by member dues and license fees

Real world benchmark software helps designers select the right embedded processors for their systems

Standard benchmarks and methodology ensure fair and reasonable comparisons

EEMBC Technology Center manages development of new benchmark software and certifies benchmark test results

Originally started under the sponsorship of Electronic Design Newsletter (EDN) Formed in 1997 to develop meaningful performance benchmarks for the

hardware and software used in embedded systems

SYSTEM ENGINEERING


EEMBC Benchmarks (Partial List)

Digital Entertainment AES DES High-Pass Gray-Scale Filter Huffman Decoding MP3 Decode MPEG-2 Decode MPEG-2 Encode MPEG-4 Decode MPEG-4 Encode RGB to CMYK Conversion RGB to YIQ Conversion RSA

Telecom Version 1.1 Autocorrelation Bit Allocation Convolutional Encoder Fast Fourier Transform (FFT) Viterbi Decoder

SYSTEM ENGINEERING


Signal processing performance evaluation

We will need to exploit data level parallelism

SIMD engines in our CPU is good candidate for this type of parallelism exploitation

• SIMD performance can’t be evaluated by target agnostic benchmarks

• Vectorized libraries are required to utilize SIMD engines.

• Use processor vendor characterized library timing to get an estimate of Clock cycles to process a particular size dataset Examples:

VSIPL standard signal processing library

Mercury Computer System SAL

Intel MKL

Memory bandwidth to feed CPU critical when using SIMD engines• Evaluate data rates from memory system and CPU memory interfaces

• Determine number of processor clocks to move data

Determine the performance driver• Compute Cycles

• Memory bandwidth

11/6/2015

14

SYSTEM ENGINEERING


Signal Processing performance evaluation

What library function are the important ones?

We will explore this in more detail in the detail design phase

I/QFormation

PulseCompression

MotionCompensation

Doppler Filtering

ClutterCancellation

NoiseEstimation

Target Detection

PDIProcessing

SYSTEM ENGINEERING


Subsystem level Syntheses (C4): Summary

Purpose: To select a subsystem level functional design.

Many trade studies are required

Modeling and simulations can be effective tools

Use of SMEs critical to help focus on key requirements

Conceptual design is an iterated process.

The Subsystem requirements are often revised based on the lessons learned during the design synthesis process

Requirement flow down and traceability are key to this process.

SYSTEM ENGINEERING



11/6/2015

15

SYSTEM ENGINEERING


Subsystem Design Review

Typically done after the system level SRR• Major program milestone

Subsystem Concept Review often is part of the System level PDR

Objectives: Are very similar to the system level

• “Determine what need to be done”

• “Establish the baseline for the next design phase”

• “Show how the baseline will meets the requirements”

SYSTEM ENGINEERING


Subsystem Design Review: Specific objectives

Show an understanding of the complete set of flow down requirements

Specify derived requirements that constrain the design• Show traceability back to the system requirement

Present the baseline architecture• Where options are still under consideration show multiple approaches that

will be selected from in the PDR stage

Document and review the analysis that lead to the baseline architecture

Identify risks

Create a risk mitigation plan

Generate a preliminary requirements compliance matrix

Identify the subsystem TPMs (Technical Performance Measures)

SYSTEM ENGINEERING


Subsystem Design Review: TPMs Risk mitigation, Requirements Compliance

TPMs • Identified at this early stage in the design

• Will be monitored and tracked at each subsequent design phase (PDR, CDR)

• They are an important tool to manage technical risk.

• TPMs are dropped and added depends on the uncertainty and risk factors

Risk management plans should be in place for high priority TPMs.• Examples: SWAP, Processing margin

• Plans should list tasks that will be done to mitigate the risks with the highest probability and highest impact

Requirements Compliance Matrix• Shows flow down requirements

• Shows derived requirements and linkage back the higher level spec

• Shows test method for each requirement Validation by analysis

Validation at unit test level

Validation in the system integration lab

Validation in a deployed environment

11/6/2015

16

SYSTEM ENGINEERING


Subsystem Concept Design Phase: Summary

This early design phase is extremely important • Miss-interpretation of requirements can result in a system

that doesn’t meet customer expectations

• Design alternatives overlooked can result in sub-optimal system Result in a non-competitive system

• Risks missed that are discovered later in the design phase can be very costly

• Flawed analyses can result in a system that just doesn’t work

Chances for a success will be greatly enhanced by following a sound system design process

SYSTEM ENGINEERING


Homework

Read: Computers as Components by Wayne Wolf• Section 4.5 Designing with Microprocessors

• Section 4.7 System-level Performance

Write a 1 a short discussion answering • What are a few important processing subsystem performance drivers? Discuss

how you would analyze these performance drivers for our Radar embedded processor case study.

11/6/2015

1

SYSTEM ENGINEERING




Lecture 3


SYSTEM ENGINEERING



Lecture 3• Preliminary Design

SYSTEM ENGINEERING


Lecture 3: Agenda: Preliminary Design Process

Starting point for Preliminary design of the embedded processing

Hardware Architecture Design

Software Architecture Design

Architecture Performance Analysis

Homework

11/6/2015

2

SYSTEM ENGINEERING


The Goal of Preliminary Design

Thoroughly understand the system level requirement

Allocate the top level subsystem architecture requirements

• Identify the next level down subsystems and interfaces

• Flow down subsystem level requirements to these lower level subsystems

The main output from preliminary design is the allocated baseline (hardware and software baselines)

• Design description and analysis

• Requirements flow-down traceability

• Draft of a test compliance matrix

• PDR -- Preliminary Design Review

SYSTEM ENGINEERING




System design will have completed preliminary design• Major subsystem interfaces defined

• Behavioral functionality of the subsystems defined

• Allocation of SWAP to subsystems defined

• Allocation of illities to subsystems defined

• Master Develop Plan updated with more detail

We are at this step

This step completed

SYSTEM ENGINEERING


Preliminary Design Process

P1. Subsystem Requirements Analysis

Preliminary subSystemArchitecture

P2. Requirements Allocation

P3. Interface identification/design

P4. Subsystem-level synthesis

P5. Preliminary design review

To detailed design

11/6/2015

3

SYSTEM ENGINEERING


Preliminary Design Phase: Embedded Processing

System design in detail design phase • Specs revised as system design evolves

• System design risks in process of being mitigated Analysis results becoming available

Discovery of new unexpected problems arising

• Customer requirements potentially changing

• SOW changes Due to cost and schedule updates

• System test and integration details developing Impacts on subsystem design

New requirements

This is our starting point for the Embedded Processing Preliminary design phase

Embedded subsystem concept design• Rough idea of interface requirements

• Rough idea of processing algorithms

• Baseline architecture

• Course performance analysis

• Risks identified and mitigation plan defined

SYSTEM ENGINEERING


Embedded Processing Preliminary Design:Requirements Analysis

Drill down into processing algorithms (focus on the stressing pieces)

Algorithm laydown on target architecture required to get a more precise estimate on performance

• Programming model selected, initial target processor selected

Interface specification detailed• All radar waveforms finalized by system CDR (not quite there yet)

• Explore the full range of variability on interfaces

Functional behavioral descriptions detailed• Functional capabilities assessed and renegotiated with system team


SYSTEM ENGINEERING


Processor Block Diagram: From Concept Design Phase

REXDataI/F

EthernetControllers

REXCntrlI/F

Signal Processing Modules

Control Processing Module

System I/O

PCIe x8

sFPDP x8

10 Gb Ethernet CustomI/F

High Speed point to point

mesh network

11/6/2015

4

SYSTEM ENGINEERING


Signal Processor Processor:IBM Cell Processor

.

The Cell multi-core Processor was a combined development between Sony, Toshiba and IBM

First app was the Sony’s PlayStation 3

First chips (90 nm version) available in 2005

65nm version in 2007 and 45nm version in 2009 (first chip used in Sony play station)

Chip performance was way ahead of it’s time in 2005 This attracted the attention of the Radar embedded processing team!

SYSTEM ENGINEERING


Theoretical Peak Performance in Ops/sec

SYSTEM ENGINEERING


IBM Cell Processor: Chip Spec

11/6/2015

5

SYSTEM ENGINEERING


IBM Cell Features

SYSTEM ENGINEERING


IBM Cell 8 SPE: High Performance Engine Ideal for this Radar Application

SYSTEM ENGINEERING


Why is Cell Processor So Fast

11/6/2015

6

SYSTEM ENGINEERING


Basics of Parallel Programming Models: SIMD Single Instruction Multiple Data Model

Wikipedia:Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously. Thus, such machines exploit data level parallelism, but not concurrency: there are simultaneous (parallel) computations, but only a single process (instruction) at a given moment.

Register File

ALU ALU ALU ALUInstructionRegisterDecoder

SYSTEM ENGINEERING


Fundamentals of utilizing the multi-processor capability of the embedded processing system

TLP (Thread Level Parallelism) provides the opportunity to achieve high levels of parallelism in the sensor processing domain• However data movement between threads if not done correctly could be

the kiss of death

• System performance can be brought to a halt waiting for data to be moved or reorganized across a parallel processing architecture

When utilizing TLP concurrency, data synchronization, and data reorganization is key to performance!

So what is concurrency, data synchronization and data reorganization?

SYSTEM ENGINEERING


ConcurrencyDoes it Imply Parallelism?

Sequential program• A single thread of control that executes one instruction at a time

• Next instruction isn’t executed until the prior one has completed

Concurrent program• A collection of autonomous sequential threads executing logically in

parallel

Concurrency is not necessarily parallelism• Interleaved Concurrency

Logically simultaneous processing

Interleaved execution on a single processor

• Parallelism Physically simultaneous processing

Requires a multi-processor not just a multi-threaded single processor

11/6/2015

7

SYSTEM ENGINEERING


Data Synchronization

All possible interleaving of threads won’t necessarily lead to a correct program!

Concurrent programs require synchronization so that data produced by one processing step won’t be consumed until a complete “coherent” set of data is stored.

Synchronization serves two purposes• Thread safety for access to shared resources

Avoids race conditions

• Coordinates actions of threads Parallel computation

Event notification

SYSTEM ENGINEERING


Data Organization

To efficiently process very large datasets when utilizing thread level parallelism the data must be organized in distributed memories so that it can be accessed at the highest possible rate


SYSTEM ENGINEERING


Data OrganizationInter-process Communication Fundamentals

Parallel programs need to share data and results processed by different processors. There are two typical ways to pass data• Shared memory Architecture

• Message passing architecture

GLOBAL MEMORY

PROCESSOR

PROCESSOR

PROCESSOR

PROCESSOR

PROCESSOR

PROCESSOR

Share memory Architecture

InterconnectionNetwork

Processor+ memory

Processor+ memory

Processor+ memory

Processor+ memory

Processor+ memory

Message PassingArchitecture

11/6/2015

8

SYSTEM ENGINEERING


Processing Domains

Processing Domains refers to the organization of data in memory• Example shows data organized for processing in Fast Time Dimension

Data in Sequential Memory Locations

Fast Time

Slo

w T

ime

Channel

Thread 0

Thread 1

T3T5

T7

T1

T0T2

T4T6

SYSTEM ENGINEERING


Data Corner Turn

Data Cube was rotated in around Slow time Fast time plane

Each thread now requiresdifferent data within its virtual address space

Data must be moved between these addressspaces

6/11/2015

Data in Sequential Memory Locations

Fas

t Tim

e

Slow Time

Channel

T7

T1

T0

T2

T4T6

T3

T5

Thread 0

Thread 1

SYSTEM ENGINEERING


Identify the Processing Domains

Best functional performance achieved by processing all signal processing steps within a single domain prior to redistributing data

11/6/2015

9

SYSTEM ENGINEERING


Embedded Processing: Software ArchitectureSignal Processing Programming model

Baseline target hardware attributesBased on performance analysis so far in the design process SIMD engines (Data Level Parallelization)

DMA engines (Data movement) (Recall discussion from last lecture)

Multi processors (Thread level Parallelization)

High bandwidth main memory

High bandwidth network interfaces (Point to Point simultaneous data flow)

Desired Programming Model Attributes Can explicitly express an algorithms available parallelism

Can exploit the hardware attributes

Can hide or isolate low level programming details so that the application programmer doesn’t need to be concerned with things that can be automated

Can express when to utilize shared memory (fast memory) communication

Can express when to utilize message passing (slower memory) communication

SYSTEM ENGINEERING


Parallel Programming Approaches

Auto-vectorization for data level parallelism (DLP) extraction has been difficult to automate

• Many attempts (Intel C++ compiler, GCC, Green Hills Multi tools)• Experience shows these tools aren’t particular good

Source to Source compilation for Thread level parallelism (TLP) extraction • Still a big research area (too risky for our case study)

A Plethora of Programming Languages and parallelism abstracting compilers have been developed

• Most focus on a particular form of parallelism or architecture Shared memory -- Data Level Parallelism

Message passing -- Thread Level Parallelism

GPU specific architecture

Graph Programming Model for parallel programming has proven to be particularly good for the sensor signal processing domain

SYSTEM ENGINEERING


Programming Model:Directed Acyclic Graphs (DAG)

Wikipedia Definition• In mathematics and computer science, a directed acyclic graph (commonly abbreviated to

DAG), is a directed graph with no directed cycles. That is, it is formed by a collection of vertices and directed edges, each edge connecting one vertex to another, such that there is no way to start at some vertex v and follow a sequence of edges that eventually loops back to v again.

• Example:

11

8

9

10

3

5

7

2

11/6/2015

10

SYSTEM ENGINEERING


Put vertex with no inputs on left and no output on right and those with both input and output in the middle provides a more intuitive data flow diagram

• Acyclic nature becomes obvious

11

8

9

103

5

7 2

11

8

9

10

3

5

7

2

Programming Model:Directed Acyclic Graphs (DAG)

SYSTEM ENGINEERING


Signal Processing Programming Model Is Derived From DAGs

Directed Acyclic Graph DAG methodology is a perfect match for signal processing abstraction• DAG are a good method for expressing parallelism and data flow relationships• Signal Processing Programming model is focus on exposing parallelism and processing precedence

relationships

A vertex represents a signal processing function(s) and directed edges are the data flow path from one processing step to the next

The Acyclic nature of DAG is key to achieving an efficient processing structure• The invocation of the processing at a vertex is only dependent on the input data availability• Once processing at a vertex has been invoked it will run to completion uninterrupted• Data flows through the processing steps at the rate solely determined by the latency of the processing at

each vertex

An efficient DAG will perform as much processing as possible in a single vertex • Data should only be pass to a downstream vertex if new dependencies exist

1 2

Poor Design

1+2

Good Design

SYSTEM ENGINEERING


Graph Programming ModelDerived From Directed Acyclic Graphs (DAG)

DAGs forms the basis for the Graph Programming Model

Used to express– DLP (data level parallelism)

– TLP (threal level parallelism)

– Precedence relationships

– Data Reorganization

11/6/2015

11

SYSTEM ENGINEERING


Design the signal processing graph:Analyze Preformance

Utilize the graph programming model to express DLP• Groups all the processing within a single processing

domain into a jobClass

• jobClass will have multiple instances called jobs that can consume DLP

• Jobs will utilizing the multi-cores capability of the single Processor

Utilize the graph programming model to express TLP• Group multi data independent jobClasses into a subgraph

• Subgraphs will be allocated to groups of processors and will run in parallel utilizing multiple Processors

SYSTEM ENGINEERING


GMT Functional Requirements AllocationJobClasses and Subgraphs

Graph design for GMT mode• Allocation of processing functions to jobClasses based on corner turn

boundaries and subgraphs based on TLP opportunities

SYSTEM ENGINEERING


Mode laydown in Graph Programming Model

Process of a swath is requires1 dwell of radiation and collection

Multi-dwell are collected back to back with no gaps in collection time

CPI Graph

CPI Graph

Ch0Subgraph 0

Ch 1Subgraph 1

Ch 2Subgraph 2

Ch 3Subgraph 3

Subgraph 0

Processing of 1 dwell of data requires 2 Graphs• CPI Graph

Coherent processing

1 subgraph per receive channel

• PDI Graph

Post Detection Integration processing

1 subgraph per graph

11/6/2015

12

SYSTEM ENGINEERING


Subgraph Laydown on Target Hardware

Real-time constraints identified• subgraph 0-3 processing time < Dwell time

• subgraph 4 < 2*Dwell time

Dwells 0 321

P0,0

P0,1

P1.0

P1,1

P2,0

CPI subgraph 0 CPI subgraph 0

PDI subgraph 0

CPI subgraph 3

CPI subgraph 2

CPI subgraph 1

Processing for Dwell 0

PDI subgraph 0

CPI subgraph 3

CPI subgraph 2

CPI subgraph 1

CPI subgraph 0

PDI subgraph 0

CPI subgraph 3

CPI subgraph 2

CPI subgraph 1



P2,1

PX,Y X= module numberY= Processor number

SYSTEM ENGINEERING


Preliminary design review

Objectives: Make sure the functional baseline

requirements have been adequately addressed by the preliminary design Physical architecture

Interfaces

Subsystem functional requirements

Real-time constraints

SWAP

illities

Key documents: Subsystem description

Interface control documents (ICDs)

Preliminary Timing Analysis

Requirements traceability

Draft Requirement Compliance Matrix

Design review package

SYSTEM ENGINEERING


Oh No! Huston we have a problem!We aren’t making real-time requirement

Updated analysis just prior to PDR found• subgraph 0-3 processing time > Dwell time

• Each dwell processing is following further and further behind!!

Next lecture will address this problem

11/6/2015

13

SYSTEM ENGINEERING


Homework

Read Paper: Hybrid processor architectures meet demands for SWaPBy John Keller

• Available in CCLE 15S-ENGR180-1 Information Folder

Write a 1 page discussion answering • What are the pros and cons of using a hybrid processor architecture for our case study of

a Radar embedded processor?

• Is a hybrid architecture a good potential solution to resolve our processing timeline issue?

Read Paper: HPEC2012 – Kirsch.pdfGraph Programming Model: An Efficient Approach for Sensor Signal ProcessingBy Steve Kirsch

• Available in CCLE 15S-ENGR180-1 Information Folder

SYSTEM ENGINEERING


Backup Slides

SYSTEM ENGINEERING


Embedded Processing Use Case:Preliminary design

Hardware Architecture

Signal Processing Software Architecture

Performance of the Architecture• To get good performance requires a system

approach

Let’s drill down into the architecture

11/6/2015

14

SYSTEM ENGINEERING


IBM Cell Processor Component 1 PPE

SYSTEM ENGINEERING


Signal Processing Stressing Algorithm:Understand Behavioral Requirements

Expanded view of the GMT algorithm defined in the conceptual design phase

SYSTEM ENGINEERING


Functional Behavioral Requirements:Parameterizing the variability

Drilling down in the functional behavior of the processing steps• Parameterize the functionality based on the system waveform definition

11/6/2015

15

SYSTEM ENGINEERING


Signal Processing Libraries and Performance

SIMD architecture can utilized via two methods1. Standard programming languages (eg C++) if compiler technology supports automated vectorization

of code

2. Predesigned Signal processing libraries

Today’s compiler technology is very poor at automated vectorization of code

Best choice today is the use of Signal Processing libraries• Signal processing libraries are target dependent code written utilizing SIMD instruction sets

SIMD instruction sets• Are basically assembly level code that can access the ISA (instruction set architecture) of the target

processor

• Examples of SIMD instruction set are:

AltiVec – PowerPc architecture

SSE – x86 architecture

SPE intrinsics – IBM Cell SPE

Signal processing libraries are implemented with a SIMD instruction set• Examples of Signal processing libraries

Mercury SAL

VISPL (http://www.omgwiki.org/hpec/vsipl)

LAPack

BLAS

FFTW

11/6/2015

1

SYSTEM ENGINEERING




Lecture 4

June 2, 2015Steve Kirsch

SYSTEM ENGINEERING



Lecture 4• Detailed Design / Integrations and Test

SYSTEM ENGINEERING


Lecture 4: Agenda: Detailed Design Process

Starting point for detailed design of the embedded processing

Hardware Architecture Design Improvement

Software Architecture Design Improvement

Detailed Performance Analysis

Detail Design and CDR

Test and Integration

11/6/2015

2

SYSTEM ENGINEERING


The Goal of Detail Design

Synthesize the detail design

• Fully define all interfaces

• Fully define the functional behavior of all subcomponents

• Detailed analysis of performance

• Update of TPMs

• Detail analysis of SWAP and illities

• Define test and integration approach

Refine recurring cost estimate

Refine non-recurring cost estimate and development schedule

The main output from detailed design is the baseline design (hardware and software designs)

• Design description and analysis

• Requirements flow-down traceability updated

• Test compliance matrix and test procedure documents

• CDR -- Critical Design Review

SYSTEM ENGINEERING


GMT Functional Requirements AllocationJobClasses and Subgraphs From Prelimary Design Phase

Graph design for GMT mode• Allocation of processing functions to jobClass and subgraphs

SYSTEM ENGINEERING


PDR:Performance was identified as a big risk!

Reported at PDR • Subgraph 0-3 processing time > Dwell time

Processing time will be longer then the collection time thus not keeping up with real-time

DataProcessing 1 2 1

DataCollection 1 32 4

11/6/2015

3

SYSTEM ENGINEERING


Processor Block Diagram: Preliminary Design Had Margin For Growth Processor Enclosure

• 5 module slots available 4 used in baseline + 1 spare

• Sufficient spare prime power

• Sufficient total power dissipation margin

• Weight limit can accommodate a module in spare lot

Signal Processing Module• Sufficient board real-estate for

additional components

• Power regulation could accommodate additional components

System Design• Has insufficient SWAP for an

additional processor enclosure

• Processing subsystem firm requirement

SYSTEM ENGINEERING


Performance Growth:Source of real-time performance issue

Doppler Tune preliminary assessment accounted only for the application of the tuning parameters• Generation of tuning parameter computation initial ignore

resulted in a big unaccounted processing load

Pulse compression estimate greatly increased• Performance was dominated by data movement not

computation cycles

• Analysis focused on computation cycles

Main memory bandwidth became a bottleneck for many processing steps• Initial analysis didn’t account for simultaneous data flow of

REX data to main memory and data produced between processing steps

SYSTEM ENGINEERING


Deeper Dive into Radar Processing

Pulse compression basics• Pulse energy is transmitted as a long pulse due

to limitation on transmitter total instantaneous output power

• Signal processing compresses return signal for better range resolution

• Total energy in long pulse = compressed pulse

Signal processing consists of passing the signal through a matched filter

Pulses are phase coded for better compression• LFM

• Barker codes (Discrete Phase Codes)

• Arbitrary phase and amplitude codes

Linear Freq Modulation

Pulse Compression of a phase coded pulse

11/6/2015

4

SYSTEM ENGINEERING


Intuitive Approach to Pulse Compression Match Filter Utilizing AutoCorrelation

Transmitted pulse Tx

Received energy Rx

Convolution functionof Tx(sn) with Rx

T0 T1 T2 T3

R0 R1R2 R3 R4 R5 R6 R7 R8 R9 R10R11R12 R13 R14R15

S0 S1S2 S3 S4 S5 S6 S7 S8 S9 S10 S12 S14S15S11 S13

So =To*Ro + T1*R1 + T2*R2 + T3*R3

S1 =To*R1 + T1*R2 + T2*R3 + T3*R4

S2 =To*R1 + T1*R2 + T2*R3 + T3*R4 and so on

T0 T1 T2 T3

T0 T1 T2 T3

T0 T1 T2 T3

T0 T1 T2 T3

T0 T1 T2 T3

So

S1

S2

S3

S4

S5

Time-shifted Replicas of Tx(Sn)

Convolution / Correlation =Time-shift replicas of Tx(sn) Rx

= Dot Product

SYSTEM ENGINEERING


Pulse compression math

Definition of the convolution theorem

where denotes the Fourier transform of

Therefore one can do a “Fast Convolution”

Pulse compression is achieve by performing continuous time domain convolution

Discrete form of the convolution

=

SYSTEM ENGINEERING


Discrete Fourier Transform (DFT)

11/6/2015

5

SYSTEM ENGINEERING


DFT is Computational Intensive

• Given a set of N complex input samples, xn where n = 0, N-1, the DFT filters are:

• Assuming the W’ factors can be pre-computed, an N-point DFT requires:• N2 complex multiplies + N2 complex adds

complex mult = 6 real ops (4 multiples + 2 adds)complex add = 2 real ops

• For example, a single 1024-point DFT takes: (1024^2)* 8 = 8e6 ops

• Straightforward computation of N-point DFT requires~N2 complex multiplications and ~N2 complex additions

for a total of ~8N2 real arithmetic operations

1,0,'1

0

NmwherexWF n

N

nmnm

)2

exp(,)2

exp('N

jWwhereW

N

mnjWand mn

mn

SYSTEM ENGINEERING


FFT is a clever algorithm for computing DFTs, published by Cooley and Tukey in 1965

• It takes advantage of a lot of symmetry in the computation thus reducing the number of operations by a lot.

N point FFT ops = N/2(Log2N)* 10 ops• 1024 point FFT = 51200 ops

FFT results numerically identical to those of the corresponding DFT (not an approximation)

Advantage of FFT grows with increasing DFT size

An Algorithm for Rapidly Computing DFTsThe Fast Fourier Transform (FFT)

SYSTEM ENGINEERING


Pulse Compression by Fast Convolution

Time domain convolution• Assume 1024 complex floating point collected

samples

• Assume pulse width of 256 sample

• Time domain convolution FLOPs = 256 (complex multiplies) * 1024 collected samples

= 256 * 8 FLOPs * 1024= 2,097,152 FLOPs

Fast Convolution N= 1024FLOPs = ( (N/2*Log2N)*10 FLOPs ) *2 (forward and inverse)

= 10,240 FLOPs

11/6/2015

6

SYSTEM ENGINEERING


Class Group Discussion

Using your newly acquired embedded system engineering skills, how would you attack the processing performance risk we have identified in our case study?

Divide into 4 teams• 15 minutes to discuss

• Nominate a spokesman for your group

Create an approach for resolving the performance issue• What are the trades to consider

Address the root causes of the performance issues

Utilize your knowledge from the last homework assignment

Doppler Tune preliminary assessment accounted only for the application of the tuning parameters• Generation of tuning parameter computation initial ignore resulted in a big unaccounted processing load

Pulse compression estimate greatly increased• Performance was dominated by data movement not computation cycles

Main memory bandwidth became a bottleneck for many processing steps• Initial analysis didn’t account for simultaneous data flow of REX data to main memory and data produced between

processing steps

Root cause of real time performance issues

SYSTEM ENGINEERING


Attaching Real-time Performance issue

Was performance analysis correct? Is a more thorough analysis needed?

Understand real-time system requirements• How tightly spaced spots are required?

• Can the processing fall behind and then catch up?

Could system requirement be modified in some way without major system performance impact to resolve the embedded processing limitation

Could system requirements allocation be modified• Could Pulse compression processing be done in the REX prior to sending data to processor?

Once system solutions appear to be a deadend, then focus on subsystem solutions• Can margin that was planned to reduce risk later in the program be used now to solve this performance

problem?

• If spare slot is used for an additional signal processing card, will it solve the performance issues?

• What are the options for increasing throughput and memory bandwidth on signal procession card?

Increase development cost and NRE might be a big driver for solution• Performance Trade-studies and risk analysis affects cost assessments

Next let’s look at the trades and results

SYSTEM ENGINEERING


Performance Trades: Second Look at Requirements and Assumptions

Modifying System Requirements and allocations weren’t acceptable• Tailoring system requirements to only the address specific known Radar mode was

deemed a poor choice

• Design requirement to accommodate new “undefined” applications is very important

Adding additional processing units to the system• Though this approach could meet the SWAP requirements of the first application of the

system it was deemed too expense and would exceed the SWAP for other potential applications.

• Partitioning a mode across multi units given limited box to box bandwidth potentially wouldn’t solve all the performance issues

Utilizing spare slot for the additional performance would violate the processing margin requirement

• Intent of spare is for future programs and risk reduction during test and integration phase

Best option was to increase signal processing module performance within the module SWAP allocation

• Program resources could be reallocated (ie. $$ and schedule and engineering talent)

• Module SWAP margin was a lower risk and margin could be used earlier in program

Next step is trade studies for best way to improve module performance

11/6/2015

7

SYSTEM ENGINEERING


Review of Performance Analysis Error in performance analysis discovered!

• Programming model not well understood by the engineer doing performance modeling

Key aspect of programming model utilizes DMA and double buffering to parallelize data movement with computation cycles

Use of DMA requires target specific software design

Data dependent processing domain

Data independent processing domain

Dataset

PingBuf

PongBuf Ping Pong Buffer

t1

t2

t5t6

t7t8

t3t4

t1 t3 t5 t7

Time

DMA to Ping

DMA to Pong

Processing

t2 t4 t6 t8

t1 t2 t3 t4 t5 t6 t7 t8

Processing is fully parallelized with data movement if compute cycles take same amount of time as data movement

This technique of overlapping data movement with processing is called tiling

SYSTEM ENGINEERING


Data Movement vs Throughput In Determining Performance

FFT example • N=1024 Complex Floating Point Samples

• Total Flops to perform pulse compression via fast convolution = 10,240 FLOPs

• Assume CPU executes 1 FLOP/ns

• Fast convolution time = 10,240 FLOPs / (1FLOP/ns)= 10.24 sec

• Assume memory bandwidth = 100MB/sec Complex floating point sample = 8 bytes

• Data movement time = 1024 * 8 Bytes * 2 (in and out) / 100MB/sec

= 16 secData movement time is longer than computation time

Overall processing time driven by data movement time

SYSTEM ENGINEERING


Trade-study results

Analysis error accounted for only a small fraction of the performance issue

Re-allocation of processing requirement from Cell and a new Front-end processor looks promising

• Front-end high data rate processing characteristics Very few processing functions require > 50% of processing performance

FIR (Finite Impulse Response) filter for IQ formation or IQ calibration

Phase ramp generator and complex multiple

Large FFTs

Large data rate reduction after front-end processing (reducing processing load on following stages)

Application specific design tends to have the highest performance per SWAP

Trades Conclusion• Additional investment to develop “application specific” solution for front-end processing functions

FPGA (Field Programmable Gate Array) solution best choice (other contender, GPGPUs and DSP specific COTS chips)

Biggest bang for the buck!

Front-end processing fairly consistent between different mode applications

Greatly reduces load on IBM Cell

• Add more on module memory bandwidth Decouple REX data ingest with rest of IBM Cell processing

11/6/2015

8

SYSTEM ENGINEERING


Processor Block Diagram: Update from Preliminary Design Phase

REXDataI/F

EthernetControllers

REXCntrlI/F Control Processing Module

System I/O10 Gb Ethernet Custom

I/F

High Speed point to point

mesh network

Main Memory

CPUIBM Cell

Network Interface Controller

Distributed Global Bulk

Memory

Front-end Processor

Signal Processing Module

Main Memory

CPUIBM Cell


Front-end Processor


Main Memory

CPUIBM Cell


Front-end Processor


sFPDP x8


Memory


Memory

New Features ( Distributed GBM, Front-end Processor)

SYSTEM ENGINEERING


Solution is a Hybrid Architecture

Front-end Processor functions• FIR filter (I Q formation / Calibration)

• Phase ramp generator and complex multiplier

• Large efficient FFT

Front-end Processor implementation• Application Specific FPGA (Field Programmable Gate Array) based design

• High memory bandwidth memory interface

• Designed as an offload engine

GBM functions• REX data store in GBM instead of Main memory

Decouples high bandwidth REX interface from impacting Cell computations

• Front-end processor access data directly from GBM Reduces competition for main memory bandwidth between processor types

Very large Computational intensive functions

Hybrid design address all three of the key performance issue in available SWAP1) Doppler tuning parameter generation2) Large FFT computational speed3) Memory bandwidth limitations

SYSTEM ENGINEERING


Lessons Learned

Fully understand all requirements as thoroughly as possible and as early in the design process as possible

• Hardware requirements

• Software requirements

• Interaction between hardware and software

Perform as thorough of a performance analysis as earlier as practical• Problems discovered later in the design process are much more costly (e.g. If

performance issues were found in integration the fix would have been very expensive)

Explore higher level requirement as well as lower level allocation when resolving issues

• Though in this case we weren’t able to change the system requirements it was worth exploring

Use risk analysis when performing performance trades• A lower cost solution might have been to give up design margin, but the consequences

were too high and the probability of an occurrence wasn’t low enough

Often application specific designs are general enough to have wide applicability if scope is limited

• Application specific designs can be more SWAP efficient then general solutions, but are in general more costly

11/6/2015

9

SYSTEM ENGINEERING


Integration and Test

Requirements flowdown and allocation to subsystems includes requirement validation documentation• Requirements Compliance Matrix -- specifies the test

method Deployed system field test

System Integration Lab (SIL) test

Unit level test

Analysis

Inspection

• Test Description Document Detail description of tests and support equipment required to do the

test

• Test Procedure Document Specifies how to do the test and expected results

Increasing complexity and cost of validation

SYSTEM ENGINEERING


Integration and Test

Key concepts to keep in mind when planning for integration and test• Sufficient visibility for unit testing and system integration lab testing

Is there support for inspecting memory

Is there support for monitoring system state while in operation

Is there support for monitoring bus activity

Is there support for monitoring operation of application specific implementation (eg. Inside of an FPGA)

• Real-time debug tools for unit test and system integration lab Does the IDE (Integrated Development Environment) support non-

intrusive monitoring of OS and application software (example next slide)

• System Level Instrumentation (Support for both SIL and Field testing) At the full system level are there sufficient interfaces and capability

provided for non-intrusive real time access

Are there sufficient support for data reduction tools Sorting and understanding of the data of interest

SYSTEM ENGINEERING


Example of an IDE Real-time Non-Intrusive Debug Tool

Intel Vtune– Performance Profiler

Hotspot (statistical call tree), call counts (statistical)

Thread profiling with lock and waits analysis

Cache miss, bandwidth analysis

OpenCL kernel tracing & GPU offload on Windows*

11/6/2015

10

SYSTEM ENGINEERING


Example of an IDE Real-time Non-Intrusive Debug Tool

Green Hills IDE Event Analyzer• EventAnalyzer displays the length and frequency of RTOS and user events, making it quickly apparent what

operations take the most time and where optimization efforts should be focused

SYSTEM ENGINEERING


Critical Design Review CDR

CDR purpose• Final design review prior to the official acceptance of the design

• Opportunity for all stake holders to assess designs compliance to requirements

• Opportunity to review risk assessment and mitigation results All risks should be well understood and accepted at this time

• To force detail design documentation effort

• To refine Non-recurring and recurring costs

Goal of a CDR- Demonstrate the design meets the functional and performance requirements

- Assures the test and evaluation strategies, procedures and support are in place for the next development phase

- Establishment of the Product Baseline

Successful completion of CDR is the green light for the next development phases

- Building Hardware

- Writing of Application Software

- Unit test

- System Test

SYSTEM ENGINEERING


Embedded Processing Case Study Summary

Last 4 lectures stepped through the design development process for embedded processing design

• Concept development

• Preliminary design

• Detailed design

• Integration and Test (briefly)

Case study utilized “real” application for real-time high performance embedded processing in a highly SWAP constrained environment

Goal was to provide insight to the system engineering process and the myriad of complexities that the embedded system engineer needs to be aware of and the skill set required

1) Requires board technical knowledge of both hardware and software technologies2) Requires excellent team skills 3) There is no system design process that can replace experience! 4) High demand for engineers with this skill set!

Embedded Subsystem Engineer’s Job is Very Challenging and Very Rewarding

Final take away on the role of an Embedded Processing Engineer

Documents

ENGR180 Embedded Computing.pdf