embedded system architecture by Ralf Niemann

8/10/2019 embedded system architecture by Ralf Niemann

1/130

Hardware/Software Codesign

of Embedded Systems

Petru Eles and Zebo Peng

Embedded Systems Laboratory (ESLAB)Linkping University

Embedded Tutorial

Lecture ContentsLecture Contents

=

Introduction and basic issues.

= Architectures and platforms.

= Analysis, co-simulation, and design space


2/130

3Prof. Z. Peng, ESLAB/LiTH

IntroductionIntroduction

= Codesign of embeddedsystems

= Definition and motivation

= The design flows

=

System level design issues

Traditional Design FlowTraditional Design Flow

Informal System Specification

Early, Manual Partitioning

HW SpecificationSW Specification


3/130


timetime

Design TimeDesign Time

Specification

& Partitioning

HW Design

&Simulation

SW Design&

Simulation

Integration&

Test

Traditional Design: HW/SW Codesign:

Specification

& Partitioning

HW Design

&Simulation

SW Design&

Simulation

Integration&

Test

Co-sim.

&Co-verif.

Reduced TTM

HW/SW CodesignHW/SW Codesign

= The concurrent design of hardware and

software elements, supporting explicithardware/software trade-off.

0 Co-specification to create an commonspecification that describes both hardware andsoftware elements


4/130


Why Codesign?Why Codesign?

=

Reduce time-to-market.= Achieve better designs:

0 More design alternatives can be explored.

0 Better solutions can be found by advanced optimizationtechniques.

= To meet strict design constraints, such as:

0 Timing or performance constraints.0 Power dissipation.

0 Physical constraints, e.g., size, weight, etc.

0 Safety and reliability constraints.

0 Cost constraints.

= Codesign is also made possible by the advances in

design methodologies and tools.

Vertical CodesignVertical Codesign

= Instruction set processor design, for both general-purpose systems and ASIPs (Application Specific

Instruction Processors).

To determine how big the

hardware engine you need to

Specification


5/130


Codesign of ProcessorsCodesign of Processors

= General-Purpose Processors0 Architectural support for operating systems.

0 Cache design and tuning (e.g., selection of cachesize and control schemes).

0 Pipeline control design (control mechanisms,compiler design).

= ASIPs

0 Customization of instruction sets and specificresources (e.g., accelerator and coprocessor).

0 Design of register files, busses andinterconnections.

0 Development of specific compiler.

Horizontal CodesignHorizontal Codesign

= Some of system functionality is implemented insoftware running on programmable CPUs, while other

functions are implemented in hardware.= Typical for design of embedded systems.

SpecificationCodesign of

Specialized processor


6/130


What is an Embedded System?What is an Embedded System?

= There are many different definitions!0 A special-purpose computer system that is used for a

particular task.

0 A computer based systems embedded in real lifemachines. Though computer based, it dose not have theusual key-board and monitors. The processor and relatedcircuitry are configured to do a specific task.

= Some highlights what it is (not) used for:0 Any device which includes a programmable component but

itself is not intended to be a general purpose computer.

= Some focus on what it is built from:0 A collection of programmable parts surrounded by ASICs

and other standard components, that interact continuously

with an environment through sensors and actuators.

Characteristics of an Embedded SystemCharacteristics of an Embedded System

= Dedicated (not general purpose).

0 One or several applications known at design-time.

= Contains a programmable component.0 But usually not programmable by the end-user.

= Interacts (continuously) with the environment:

0 Real-time behavior.


7/130


1%

99%

Embedded Systems

General purpose systems Embedded systems

Microprocessormarket sharesin 1999

Actuat

Embedded ControllersEmbedded Controllers

CPUMemorySenso


8/130


I/O Interface

Network Interface

CPU

RA M

ROM

ASIC

Actuators Sensors

Distributed Embedded SystemsDistributed Embedded Systems

ECU ECU ECU

Gateway

Gateway

ECU ECU ECU

Time and Power ConstraintsTime and Power Constraints

= Time constraints:

0 They have to perform in real-time: if data are not ready by

a certain deadline, the system fails to perform correctly.0 Hard deadline failure to meet leads to major hazards.

0 Soft deadline failure to meet can be tolerated but quality

of service is reduced.

P t i t


9/130


Safety Critical RequirementsSafety Critical Requirements

= Embedded systems are often used in lifecritical applications.

0 Avionics, automotive electronics, nuclear plants,medical applications, military applications, etc.

= Reliability and safety are major requirements.

= To guarantee correctness during design:0 Formal verification: Mathematics-based methods

to verify certain properties of the designedsystem.

0 Automatic synthesis: Certain design steps areautomatically performed by design tools

Correctness by construction.

Short Time to MarketShort Time to Market

= In highly competitive markets it is critical to catchthe market window:

0 A short delay with the product on the market can havecatastrophic financial consequences (even if the quality of

the product is excellent).

= Design time has to be reduced!


10/130


The ES Design ChallengesThe ES Design Challenges

=

Increasing application complexity (e.g., automotive).= Heterogeneous architecture (HW, SW, network,

mechatronics, etc.).

= Stringent time and power constraints.

= Low cost requirement.

= Short time to market.

= Safety and reliability (e.g., very long life-time).

= In order to achieve all these requirements, systemshave to be highly optimized.

= Both hardware and software aspects have to beconsidered simultaneously!

Current Design PracticeCurrent Design Practice

1. Start from some informal specification and a set ofconstraints (time, power, and cost constraints).

2. Generate a more formal specification, based on somemodeling concept (FSM, data-flow, etc.), usingMatlab, Statecharts, SystemC, C, UML, or VHDL.

3. Simulate the model in order to check itsfunctionality. The model is modified, if needed.


11/130


The ConsequencesThe Consequences

= Delays in the design process:0 Increased design cost.

0 Delays in time to market missed market window.

= High cost due to many iterations withimplementation and prototyping.

= Bad design decisions taken under time pressure:0 Low quality.

0 High cost.

= The lesson: We need to explore more designalternatives in an efficient manner.0 At the system level!

SystemSystem--Level DesignLevel Design

Informal Specification,Constraints

FormalVerification

FunctionalSimulation

System Model

Modeling

Arch. Selection

SystemArchitecture

Mapping


12/130


The Improved Design FlowThe Improved Design Flow

= Several design alternatives are evaluatedbefore going down to the lower-level design.

0 This is performed as part of the design spaceexploration process.

0 Different architectures, mappings and schedulesare explored, before the actual implementation

and prototyping.

= We get highly optimized solutions in shorttime.

0 There is a good chance that design iterations atthe lower-level, including prototyping, can beavoided.

Additional ImprovementsAdditional Improvements

= Formal verification0 It is impossible to do an exhaustive simulation.

0

Especially for safety critical systems, formal verification isneeded.

= Simulation0 Used not only for functional validation.

0 Should also be used after mapping and scheduling in orderto check, for example, timing properties.


13/130


The LowerThe Lower--Level IssuesLevel Issues

= Software generation:0 Encoding in an implementation language (C, C++,

assembler).

0 Compiling (this can include particular optimizations forapplication specific processors, DSPs, etc.).

0 Generation of a real-time kernel or adapting to an existingoperating system.

= Hardware synthesis:0 Encoding in a HDL (VHDL and Verilog).0 Successive synthesis steps: high-level, register-transfer

level, logic-level synthesis.

= Hardware/software integration:0 The software is run together with the hardware model

(co-simulation).

= Prototyping:0 A prototype of the hardware is constructed and the

software is executed on the target architecture.

LowerLower--Level DesignLevel Design

There are established CAD tools on the market whichautomatically perform many of the low level tasks:

= Code generators (software model C, hardwaremodel VHDL)

= Compilers.

H d th i t l


14/130


Focus on SystemFocus on System--Level DesignLevel Design

= Have huge influence on the quality of the finalimplementation.

= Very few commercial tools are available.

= Mostly experimental and academic tools available.

= Huge efforts and investments are currently made in

order to develop tools and methodologies for systemlevel design.

= Ad-hoc solutions are less and less acceptable.

= It is the system level we are mainly interested, in

this course!

Concluding RemarksConcluding Remarks

= Codesign provides the capability to make

explicit and efficient hardware/softwaretrade-off.

= Codesign of embedded systems have manyadvantages and challenges.


15/130

Analysis, Co-Simulation

and Design Space Exploration

Zebo Peng

Embedded Systems Laboratory (ESLAB)Linkping University

OutlineOutline

= Static analysis techniques

= Design space exploration


16/130


The Design SpaceThe Design Space

= Very large due to many solution parameters:

0 architectures and components

0 hardware/software partitioning

0 mapping and scheduling

0 operating systems and global control

0 communication synthesis

Source: S3

Source: Stratus

Computers

Hardware Software

Embedded

memory

DSP

Network

High-speed electronicsSensor

Analog

circuit

ASIC

Microprocessor

SoC

Design Space ExplorationDesign Space Exploration

What are needed in order to explore the complexdesign space to find a good solution:

= Exploration in the higher level of abstractions.

= Development of high-level analysis and estimationtechniques.

= Employment of very fast exploration algorithms


17/130


The Optimization ProblemThe Optimization Problem

The majority of design space exploration tasks can beviewed as optimization problems:

To find

0 the architecture (type and number of processors, memory

modules, and communication blocks, as well as their

interconnections),

0 the mapping of functionality onto the architecturecomponents, and

0 the schedules of basic functions and communications,

such that a cost function (in terms of implementationcost, performance, power, etc.) is minimized and aset of constraints is satisfied.

The System Partitioning ProblemThe System Partitioning Problem

5

8

35

2

3

45

5

4

35

5

6

5665

24

20

40

67

15

23

Two-way partitioning


18/130


Hardware/Software PartitioningHardware/Software Partitioning

Input: Implementation independent systemspecification consisting of interactingprocesses (e.g., VHDL).

Output: Two sets of processes, assigned for hardwareand software implementation respectively.

Target architecture:

- Microprocessors

- ASICs

- Shared memories


Assumptions:

= Microprocessor and ASIC working in parallel;

= Reducing the amount of communication betweenthe microprocessor and hardware improves the

overall performance.

Objectives:


19/130



= Quantitative values can be derived via simulation,profiling, or static analysis of the specification.

Ex.

0 computation load(CL) number of operations executed

by a basic region or process of the specification.

0 communication intensity(CI) total number of

communication operations on a channel between twoprocesses.

= Performance improvement based on:

0 Placing computation intensive processes into hardware.

0 Increasing parallelism.

0 Reducing inter-domain communication.

Process Graph FormulationProcess Graph Formulation

= nodes correspond to processes, which could beprocesses or basic blocks in the original specification

(e.g., VHDL).= node weights reflect the degree of suitability for

hardware implementation of the correspondingprocess:

the computation load of the process;


20/130


Process Graph FormulationProcess Graph Formulation

= The Graph Partitioning Problem:To partition the process graph into two groups such

that the sum of the weights of the cut edges will beminimal, subject to a set of constraints:

Ex.

HiH

i MaxtH cos_ Physical limitation of silicon area

HwiLimWNi 1

Implement a node in HW, when

it is appropriate.

Features of CO ProblemsFeatures of CO Problems

= Most CO problems, e.g., system partitioning with

constraints, for digital system designs are NP-

compete.

= The time needed to solve an NP-compete problemgrows exponentially with respect to the problem sizen.


21/130


Features of CO ProblemsFeatures of CO Problems

= Many CO problems can be formulated as an IntegerLinear Programming (ILP) problem, and solved by anILP solver.

= It is inherently more difficult to solve an ILP problemthan the corresponding Linear Programming problem.

= The size of problem that can be solved successfully

by ILP algorithms is an order of magnitude smallerthan the size of LP problems that can be easilysolved.

HeuristicsHeuristics

= A heuristic seeks near-optimal solutions at areasonable computational cost without being able to

guarantee either optimality or feasibility.= Motivations:

0 Many exact algorithms involve a huge amount ofcomputation effort.

0 The decision variables have frequently complicated


22/130


Heuristic Approaches to COHeuristic Approaches to CO

Problem specific Generic methods

Clustering

List scheduling

Left-edge algorithm

Branch and bound

Divide and conquer

Constructive

Transformational

(Iterativeimprovemen

t)

Kernighan-Lin

algorithm

Neighborhood search

Simulated annealing

Tabu search Genetic algorithms

(MetalH

euris

tics)

Clustering for System PartitioningClustering for System Partitioning

= Each node initially belongs to its own cluster, andclusters are then gradually merged until the desiredpartitioning is found.

= The merge operation is selected based on localinformation (closeness metrics), rather than globalview of the whole system.

v22

v23


23/130


The KernighanThe Kernighan--Lin Algorithm (KL)Lin Algorithm (KL)

= A graph is partitioned into two clusters ofarbitrary size, by minimizing a givenobjective function.

= KL is based on an iterative partitioningstrategy:

0 The algorithm starts with two arbitrary clustersC1 and C2.

0 The partitioning is then iteratively improved bymoving nodes between the clusters.

0 At each iteration, the node which produces theminimal value of the cost function is moved; thisvalue can, however, be greater than the value

before moving the node.

BranchBranch--andand--BoundBound

= Traverse an implicit tree to find the best leaf (solution).

4-City TSP

0 1 2 3

0 3 6 410

3

0

1


24/130


BranchBranch--andand--Bound ExBound Ex0 1 2 3

0 3 6 41

0 40 5

0 4

0

0

1

2

3{0}

{0,1}

{0,1,2}

{0,1,2,3}

L = 88

L 0

L 3

L 43

{0,1,3}

{0,1,3,2}

L 8

L = 18

{0,2}L 6

{0,2,1}

L 46

{0,2,1,3}

L = 92

{0,3}L 41

{0,2,3}

{0,2,3,1}

L 10

L = 18

{0,3,1} {0,3,2}

{0,3,1,2} {0,3,2,1}

L 46 L 45

L = 92 L = 88

= Low-bound on the cost function.

= Search strategy

Neighborhood Search MethodNeighborhood Search Method

= Step 1 (Initialization)(A) Select a starting solution xnow X.(B) xbest = xnow, best_cost = c(xbest).

= Step 2 (Choice and termination)Choose a solution xnext N(xnow).If no solution can be selected or the terminating criteria apply,

then the method stop


25/130


Neighborhood Search MethodNeighborhood Search Method

=

The neighborhood search method is very attractive formany CO problems as they have a natural neighborhoodstructure, which can be easily defined and evaluated.0 Ex. Graph partitioning: swapping two nodes.

5

8

35

2

3

45

5

4

35

5

6

5665

24

20

40

67

15

23

5

8

35

2

3

45

5

4

35

5

6

5665

24

20

40

67

15

23

The Descent MethodThe Descent Method

= Step 1 (Initialization)

= Step 2 (Choice and termination)

Choose xnext N(xnow) such that c(xnext) < c(xnow), andterminate if no such xnext can be found.

= Step 3 (Update)

The descent process can easily be stuck at a local


26/130


Dealing with Local OptimalityDealing with Local Optimality

= Enlarge the neighborhood.

Cost

Solutions

= Start with different initial solutions.

X

= To allow uphill moves:

0 Simulated annealing

0 Tabu search

The SA AlgorithmThe SA Algorithm

Select an initial solution xnow X;Select an initial temperature t> 0;

Select a temperature reduction function ;RepeatRepeat

Randomly select xnext N(xnow); = c o s t (xnext) - c o s t (xnow);


27/130


A HW/SW Partitioning ExampleA HW/SW Partitioning Example

35000

40000

45000

50000

55000

60000

65000

70000

75000

0 200 400 600 800 1000 1200 1400

Number of iterations

Costfuncti

onv

alue

optimum at iteration 1006

Analysis TechniquesAnalysis Techniques

= Analysis and simulation techniques are essential for

hardware/software codesign:

0 To guide the design space exploration.

0 To provide feedback to the human designers.

0 To support design validation.

S l ti f l i / i l ti t h i i


28/130


Performance MetricsPerformance Metrics

=

Extreme case performance0 Worst-case execution time

0 Best-case execution time

= Average case performance

= Probabilistic performance

0 Used in soft real-time applications

0

To accurately handle the variable execution time of tasks,which may be due to

Application characteristics (e.g., data dependent loops);

Architectural factors (e.g., cache misses);

External factors (e.g., network load); or

Insufficient knowledge.

0 To guarantee a high probability of meeting timing

constraints.

SimulationSimulation--based Techniquesbased Techniques

= Software Running the compiled programon the simulated target architecture.

= Hardware Building a simulation model ofthe hardware and executing it to collectinformation.

A very large number of inputs should be used


29/130


30/130


Program Path AnalysisProgram Path Analysis

= To determine what sequence of instructions will be

executed in the worst case scenario.

A basic block is composed of

instructions in a straight line

= Let us first assume thateach instruction takes afixed time to execute

Program Path AnalysisProgram Path Analysis

= Infeasible paths can be eliminated by dataflow analysis and path information provided

by the programmer.= The number of feasible paths is typically

exponential with the program size.

Efficient methods are needed to avoid


31/130


ILP FormulationILP Formulation

Letxibe the number of times a basic block Bi is executed;

cibe the execution time of the basic block Bi, which isassumed to be a constant.

The total execution time of the program for a particularexecution is:

=

N

iii xc

1

1

10

1

11

101

C1

C2

C3

C4

C5

C6

C7

C1+ C

2+ C

4+ 11 C

5+ 10 C

6+ C

7

ILP Formulation (ContILP Formulation (Contd)d)

The estimated WCET of the program is:

subject to a set of constraints Ax b.

=

N

i

ii xc

1max


32/130


An ExampleAn Example

/* k >= 0 */s = k;while (k < 10) {

if (ok)j++;

else {j = 0;ok = true;

}k++;

}r = j;

x1 s = k;B1

d1

d2

x2 while (k


33/130


Constraints IIConstraints II

=

Functionality constraints:

Loop bound information

0x1 x3 10x1Path information

x5 1x1

/* k >= 0 */s = k;while (k < 10) {

if (ok)j++;

else {j = 0;ok = true;

}k++;

}r = j;

X1X2X3X4

X5

X6

X7

Remarks on Performance AnalysisRemarks on Performance Analysis

= One of the main issues of hardware/software

codesign is estimation and analysis.

= Analysis of average and probabilistic performance

can be done by simulation.

= Worst case execution time analysis can only be

ffi i l d b i l i h i


34/130


SimulationSimulation

= Applied usually directly to the designdescriptions, e.g. VHDL.

= Can be used at different levels ofabstractions:

0 System

0 Algorithmic

0 Register-transfer

0 Logic

0 Gate

0 Switch and circuit

CoCo--SimulationSimulation

= How the hardware and software components are

simulated at the same time?

Problems:

= Different simulation platforms are used;

= Software runs fast while hardware simulation is


35/130


Approaches to CoApproaches to Co--Simulation 1Simulation 1

= Gate-level model of the processor

0 Gate level simulation of the processor is very slow (tens ofclock cycles/sec).

Ex. 10 cycles/sec, 1 GHz processor 100 million seconds(3.2 years) are needed to simulate one second of real time.

0

This provides a very accurate solution and is very simplefrom the co-simulation point of view.

Gate-

level

model

(VHDL)

SW

ASIC

model

(VHDL)

VHDL

simulation VHDL

simulation

Co-simulation framework

Approaches to CoApproaches to Co--Simulation 2Simulation 2

= Instruction-set architecture models

ISA

model

(C

progr.)

SW

ASICmodel

(VHDL)

Program

running

on hostVHDL

simulation


36/130


37/130

Hardware/Software Codesign Arch & Platf - 1


38/130

Petru Eles, IDA, LiTH

Architectures and Platforms

1. Architecture Selection: The Basic Trade-Offs

2. General Purpose vs. Application-Specific Processors

3. Processor Specialisation

4. ASIP Design Flow

5. Specialisation of a VLIW ASIP

6. Tool Support for Processor Specialisation

7. Application Specific Platforms

8. IP-Based Design (Design Reuse)9. Reconfigurable Systems



39/130

Remember the Design Flow

System model


Functional

Simulation

Modeling

Arch. Selection

Systemarchitecture

Mapping

Estimation

Mapped and

scheduled model

Scheduling

OK

not OK not OK

FormalVerification

Softw. model Hardw. model

SimulationFormal

Verification

Softw. Generation Hardw. Synthesis

Simulation



40/130


Architecture Selection and Mapping

Select the underlying hardware structure on which to run themodelled system.

Map the functionality captured by the system over thecomponents of the selected architecture.Functionality includes processing and communication.



41/130

Architecture Selection

Build a customised architecture strictlyoptimised for the particular application.

Use a general purpose, existing platformand map the application on it.

Use programmable processors

running software.

Use dedicated electronicsfixed

reconfigurable

or something in-between

or both

General

Purposevs.ApplicationSpecific

Softwarevs.Hardware



42/130


Architecture Selection (contd)

The trade-offs:

Performance (high speed, low power consumption)

Flexibility (how easy it is to upgrade or modify)

Application specific

General purpose

Hardware

Software

high

low

high

low

Reconfigurablehardware

Application specific

General purpose

Hardware

Softwarehigh

low

high

low

Reconfigurablehardware



43/130


Architecture Selection (contd)

flexibility

energy

consumed

low

low

med.

med.

high

high

orderof

m

agnitude

o

rderof

ma

gnitude

ASIC

FPGA

ASIP

GP proc.



44/130


General Purpose vs. Application Specific Processors

Both GP processors and ASIPs (application specific instruction setprocessors) can be RISCs, CISCs, DSPs, microcontrollers, etc.

- One could look at DSPs and microcontrollers as being specificfor DSP and simple control applications respectively.

- An application specific DSP or microcontroller is, however,more specialised thenjustfor DSP or control applications.

GP processors

- Neither instruction set nor microarchitecture or memorysystem are customised for a particular application or family ofapplications

ASIPs

- Instruction set, microarchitecture and/or memory system arecustomised for an application or family of applications.

- What results is better performance and reduced powerconsumption.



45/130


What Makes an ASIP Specific?

What can we specialize in a processor?

Instruction set (IS) specialisation

Exclude instructions which are not used

- reduces instruction word length (fewer bits needed for encoding);

- keeps controller and data path simple.

Introduce instructions, even exotic ones, which are specific to theapplication: combinations of arithmetic instructions (multiply-accumulate), small algorithms (encoding/decoding, filter), vector

operations, string manipulation or string matching, pixel operations, etc.- reduces code sizereduced memory size, memory bandwidth,

power consumption, execution time.



46/130



Function unit and data path specialisation

Once an application specific IS is defined, this IS can be

implemented using a more or less specific data path and more orless specific function units.

Adaptation of word length.

Adaptation of register number.

Adaptation of functional units

- Highly specialised functional units can be introduced for stringmatching and manipulation, pixel operation, arithmetics, and

even complex units to perform certain sequences ofcomputations (co-processors).


47/130



48/130



Interconnect specialization

Interconnect of functional modules and registers.

Interconnect to memory and cache.

- How many internal buses?

- What kind of protocol?

- Additional connections increase the potential of parallelism.

Control specialisation

Centralised control or distributed (globally asynchronous)?

Pipelining?

Out of order execution?

Hardwired or microprogrammed?



49/130


ASIP Design Flow

(It can be seen as a part of the big design flow - slide 2)

Algorithm(s)

Simulator

ProcessorArchitecture

Compiler

Performancenumbers



50/130


A SOC for Multimedia Applications

Glue logic

A/D and D/A

Controller(ASIP)

On-chipmemory

DSP(GP)

VLIWprocessor

(ASIP)

The application specificController performsmaster control of thesystem and memory

access control.

The off-the-shelf (GP)DSP performs lesscomputation intensive

modem and sound codecfunctions.

The VLIW ASIP performscomputation intensivefunctions: discrete cosine

and inverse discretecosine transforms,motion estimation, etc.

This is a typical application specificplatform. Its structure has beenadapted for a family of applications.

Besides GP processor cores, theplatform also consists of ASIP coreswhich themselves are specialised.


51/130



52/130


Specialization of a VLIW ASIP (contd)

Thats how an instruction word looks like:

op4 op5 op6 op7 op8 op9 op10 op11op1 op2 op3

Cluster 1 Cluster 2 Cluster 3



53/130



Traditionally the datapath is organised as single register file shared byall functional units.

Problem: Such a centralised structure does not scale!

We increase the nr. of functional units in order to increase parallelism

We have to increase the number of registers in the register file

Internal storage and communication between functional units andregisters becomes dominant in terms of area, delay, and power.

High performance VLIW processors are limited not by arithmeticcapacity but by internal bandwidth.



54/130



A solution: clustering.

Restrict the connectivity between functional units and registers, sothat each functional unit can read/write from/to a subset ofregisters.

Organise the datapath as clusters of functional units and local

register files.

Nothing is for free!!!Moving data between registers belonging to different clusters takesmuch time and power!

You have to drastically minimise the number of such moves by:- Carefully adapting the structure of clusters to the application.

- Using very clever compilers.



55/130



Instruction set specialisation: nothing special.

Function unit and data path specialisation

- Determine the number of clusters.

- For each cluster determine

- the number and type of functional units;

- the dimension of the register file.

Memory specialisation is extremely important because we need tostream large amounts of data to the clusters at high rate; one has

to adapt the memory structure to the access characteristics of theapplication.

- determine the number and size of memory banks



56/130



Interconnect specialization

- Determine the interconnect structure between clusters andfrom clusters to memory:

- one or several buses,

- crossbar interconnection

- etc.

Control specialisation:

Thats more or less done, as we have decided for a VLIW

processor.



57/130


Tool Support for Processor Specialisation

Look at the design flow on slide 12!

In order to be able to generate a specialised architecture you need:

Retargetable compiler

Configurable simulator


R bl C il


58/130


Retargetable Compiler

Retargetable compiler

Algorithm

Object code

ProcessorArchitecture

RetargetableCompiler


R t t bl C il ( td)


59/130


Retargetable Compiler (contd)

An automatically retargetable compilercan be used for a range ofdifferent target architectures.

The actual code optimization and code generation is done by thecompiler, based on a description of the target processor architecture.This description is formulated in a, so called, architecture descriptionlanguage.

Having a good compiler is not only important for the processorspecialisation process!

Once you have got your specialised ASIP you need a good compiler

in order to efficiently make use of it!


C fi bl Si l t


60/130


Configurable Simulator

Simulator

Processor

Architecture

Performancenumbers

Object code

Such a simulator can beconfigured for a particulararchitecture (based on an

architecture description)

In this context, the mostimportant output produced by

the simulator is performancenumbers:

- throughput

- delay

- power/energy consumption


Application Specific Platforms


61/130


Application Specific Platforms

Not only processors but also hardware platformscan be specialised

for classes of applications.

The platform will define a certain communication infrastructure

(buses and protocols), certain processor cores, peripherals,accelerators commonly used in the particular application area, andbasic memory structure.


Application Specific Platforms (contd)


62/130


Application Specific Platforms (cont d)

Proc.Core1 DMA Memory Bridge

PeripheralRecon-

figurable

logic

System bus

Peripheral bus

CacheProc.Core2

Proc.Core3

Peripheral


Application Specific Platforms (contd)


63/130


Application Specific Platforms (cont d)

Design space exploration for platform definition:

Simulator

PlatformArchitecture

Mapping/Compiling

Performancenumbers

Applications


Instantiating a Platform


64/130


Instantiating a Platform

Once we have an application, the chip to implement on will not bedesigned as a collection of independently developed blocks, but will

be an instance of an application specific platform.

The hardware platform will be refined by

- determining memory and cache size

- identifying the particular cores, peripherals to be used

- adding specific ASICs, accelerators

- determining the amount of reconfigurable logic (if needed)


Instantiating a Platform (contd)


65/130


Instantiating a Platform (cont d)

Simulator

PlatformInstance

Mapping/

Compiling

Performancenumbers

Application

PlatformArchitecture


66/130


IP-Based Design (Design Reuse)


67/130


IP Based Design (Design Reuse)

The key concept in order to increase designers productivity is reuse.

In order to manage the complexity of current large designs we do not

start from scratch but reuse as much as possible from previousdesigns, or use commercially available pre-designed IP blocks.

IP: intellectual property.

Some people call this IP-based design, core-based design, reusetechniques, etc.:

Core-based designis the process of composing a new system

design by reusing existing components.


IP-Based Design (contd)


68/130


g ( )

What are the blocks (cores) we reuse?

interfaces, encoders/decoders, filters, memories, timers,microcontroller-cores, DSP-cores, RISC-cores, GP processor-cores.

Possible(!) definition

A coreis a design block which is larger than a typical RTLcomponent.

Of course:We also reuse software components!


IP-Based Design (contd)

Lib Lib


69/130


What we have designed here can be: An application specific SOC

A platform to be further instantiated for a particular application.

Core 1 Core 2 Core 3

Library

Vendor A

Interconnection bus/switch

Library

Vendor B

Core 4processor

Library

Vendor C

Interface

I/O

glue glue glue

glue


Types of Cores


70/130


yp

Hard cores: are fully designed, placed, and routed by the supplier.

Firm cores: technology-mapped gate-level netlists.

A completely validated layout with definite timing

rapid integration low flexibility

less predictability flexibility duringplace and route


Types of Cores (contd)


71/130


Soft cores: synthesizable RTL or behavioral descriptions.

Flexibility can provide opportunities like e.g. adding applicationspecific instructions to a processor core by modifying thebehavioral description.

much work withintegration andverification.

maximal flexibility


Reconfigurable Systems


72/130


Programmable Hardware Circuits:

They implement arbitrary combinational or sequential circuits

and can be configured by loading a local memory that determinesthe interconnection among logic blocks.

Reconfiguration can be applied an unlimited number of times.

Main applications:

- Software acceleration

- Prototyping


Reconfigurable Systems (contd)


73/130


Dynamic reconfiguration: spacial and temporal partitioning

---------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Processor Memory

FPGA

Accelerator

att1

att2

att3

att4

temporally

partit

ioned


Reconfigurable Systems (contd)


74/130


System on Chip with dynamically reconfigurable datapath

Reconfigurabledatapath

Onchip

mem.

CPU

C code

Profiling &Kernel

extraction

Hw/Swpartitioning

Kernels

C codeDatapathsynthesis


Summary


75/130


Architecture selection is about making trade-offs along thedimensions of speed, cost, flexibility, and power consumption.

ASIPs are programmable processors, specialised for a particular

application or for a family of applications.

Specialisation of an ASIP concerns instruction set, function unitsand data path, memory system, interconnect, and control.

Two design tools are of great importance in order to performprocessor specialisation: retargetable compiler and configurablesimulator.

Not only processors can be specialised but also platforms. A

Platform is specialised to execute a certain family of applications.The particular hardware to be used for a given application is aspecialised instantiation of the platform.


Summary (contd)


76/130


Reuse is a key technique in order to achieve high designproductivity. Cores to be reused can be from interfaces and

decoders to filters and processors.

The three types of cores differ in their flexibility, predictability, andthe effort needed for integration: hard, firm, and soft cores.

Reconfigurable systems can provide good flexibility and, at thesame time, many of the advantages of classical hardwareimplementation. They are mainly used for software accelerationand prototyping.

Hardware/Software Codesign Low Power/Energy - 1

System-Level Power/Energy Optimization


77/130


1. Sources of Power Dissipation

2. Reducing Power Consumption

3. System Level Power Optimization

4. Dynamic Power Management

5. Mapping and Scheduling for Low Energy

6. Real-Time Scheduling with Dynamic Voltage Scaling


Remember the Design Flow


78/130

System model


FunctionalSimulation

Modeling

Arch. Selection

Systemarchitecture

Mapping

Estimation

Mapped andscheduled model

Scheduling

OK

not OK not OK

FormalVerification

Softw. model Hardw. model

Simulation

FormalVerification

Softw. Generation Hardw. Synthesis

Simulation


Why is Power Consumption an Issue?


79/130


Portable systems - battery life time!

Systems with a very limited power budget: Mars Pathfinder,autonomous helicopter, ...

Desktops and servers: high power consumption

- raises temperature and deteriorates performance & reliability

- increases the need for expensive cooling mechanisms

One of the main difficulties with developing high performancechips is heat extraction.

High power consumption has economical and ecologicalconsequences.


Sources of Power Dissipation in CMOS Devices


80/130


1

2--- C VDD

2f NSW QSC VDD f NSW Ileak VD+ +=

dynamic static

Switching powerPower required tocharge/dischargecircuit nodes

Short-circ. powerDissipation dueto short-circuitcurrent

Leakage powerDissipationdue to leakagecurrent

C = node capacitances

NSW= switching activities(number of gate transitions

per clock cycle)f = frequency of operation

VDD= supply voltage

QSC= charge carried byshort circuit current

per transitionIleak= leakage current


Sources of Power Dissipation in CMOS Devices (contd)


81/130


source

draing

ate

body

gate

drain

source

Vbs

CMOS transistor (N-type)

Vbs= body bias voltage

Vth= threshold voltage

Threshold voltage:

- The minimal voltagerequired at the gate toturn on the transistor




82/130


Vdd

CL

gate

drain

source


Vbs= body bias voltageV

th= threshold voltage

Vdd= supply voltageCL= output load capacitance

CMOS inverter

Dynamic power

- Charging and discharging the

output load capacitance

- Momentary short circuits at a

gates output

source

draing

ate

bodyVbs




83/130


Vdd

CL

gate

drain

source


Vbs= body bias voltageVth= threshold voltage

Vdd= supply voltageCL= output load capacitance

CMOS inverter

Static power

- Subthreshold leakage

conduction- Junction leakage (drain

and source to body)

It flows even whenthe voltage at thegate is below Vth

source

drainga

te

bodyVbs




84/130


For long:

Leakage power has been considered negligible compared todynamic.

Today:

Total dissipation from leakage is approaching the total from

dynamic.

As technology drops below 65nm: Leakage power is exceeding dynamic.




85/130


Leakage power is consumed even if the circuit is idle (standby). Theonly way to avoid is decoupling from power.

Short circuit power can be around 10% of total.

Switching power is still the main source of power consumption.

For the rest of the discussion, we consider mainly switchingpower. At the end we come back to leakage.


Power and Energy Consumption


86/130


P 1

2--- C VDD

2f NSW =

P t 12--- C VDD2 NCY NSW = =

NCY= number of cycles needed for the particular task.

In certain situations we are concerned about power consumption:

- heath dissipation, cooling:

- physical deterioration due to temperature.

Sometimes we want to reduce total energy consumed:- battery life.


Reducing Power/Energy Consumption


87/130


The main sources:

Reduce supply voltage

Reduce switching activity

Reduce capacitance

Reduce number of cycles


Reducing Power/Energy Consumption (contd)

Ci it l l


88/130


Circuit level Ordering of transistors in gate (influences capacitance).

Transistor sizing.

Logic level

Dont-care optimization to reduce switching activity.

Reduce spurious switching activity by balancing the delays ofpaths that converge at each gate.

Technology mapping.

State encoding such that switching activity is minimised: ifstate shas a large number of transitions to state q, theyshould be given uni-distant codes.

Encoding to minimise switching activity in arithmetic units oron the bus.

Gated clocks: Gate the clocks of circuits (registers, gates,arithmetic units when they are in idle time periods.




89/130


Behavioral level

Schedule and map operations so that number of cycles isminimised (with increased number of switching per clockcycle) you can run at slower clock rate you can reducesupply voltage.

Allocate and share modules so that power consumption isreduced (for example, by reducing switching activity)



A hit t l l


90/130


Architecture level

Specialise instruction set, datapath, register structure to theparticular architecture, with power consumption as an optimization

goal.- You have on the chip and you switch only those resources

(gates) you really need.

Reduce power consumption on the bus.- lower switching activity: clever encoding, reduce switching ac-

tivity on the address bus by exploiting correlations;

- minimise the bus length (capacitance) by optimal moduleplacement.

- bus segmentation: transform a long heavily loaded global businto a partitioned set of local bus segments.



O ti i th t t


91/130


Optimise the memory structure.

- Memory transfers are extremely power hungry: a memorytransfer takes 33 times more energy than an addition!

Reducing the number of memory accesses is a very efficientway to save power!

- Adapt the number of caches, their size and associativity, andthe length of the cache line to the application reducenumber of memory transfers.

- Interesting trade-off: larger caches consume more power but

reduce number of memory transfers find the right balance!




92/130


Provide instruction support for Power management:

- Instructions which allow to put in stand-by or shut down certainparts of the system.

- Instructions which allow to dynamically fix the supply voltage(dynamic voltage scaling).




93/130


System Level

Static techniques are applied at design time.

- Compilation for low power: instruction selection consideringtheir power profile, data placement in memory, registerallocation.

- Algorithm design: find the algorithm which is the most power-efficient.

- Task mapping and scheduling.

Dynamic techniques are applied at run time.

- These techniques are applied at run-time in order to reducepower consumption by exploiting idle or low-workload periods.


System Level Power Optimization


94/130


Three techniques will be discussed:

1. Dynamic power management: a dynamic technique.

2. Task mapping: a static technique.

3. Task scheduling with dynamic power scaling: static & dynamic.


Dynamic Power Management (DPM)

Decisions:


95/130


application

hardware

power aware OS

Decisions:

Switching among multiple powerstates:

idle

sleep

run

Switching among multiplefrequencies and voltage levels.

Goal:

Energy optimization

QoS constraints satisfied


Dynamic Power Management (contd)

Hardware Support (e g Intel Xscale Processor)


96/130


Hardware Support (e.g. Intel Xscale Processor)

RUNRUNRUN

RUN

IDLE SLEEP

RUN

0.75V, 60mW150MHz

1.3V, 450mW600MHz

1.6V, 900mW800MHz

90s

40mW 160W

10s

10s 140ms

1.5ms

160s

RUN: operational

IDLE: Clocks to theCPU are disabled;recovery is throughinterrupt.

SLEEP: Mainly

powered off;recovery throughwake-up event.

Other intermediate

states: DEEPIDLE, STANDBY,DEEP SLEEP


Dynamic Power Management (contd)


97/130


DPM techniques are used in laptops, personal digital assistants(PDAs), and other portable appliances in order to shut down orplace in stand-by unused devices.The goal is power saving.

DPM techniques are implemented in the operating system(including Windows 2000 running on laptops).

The power breakdown for a laptop computer:- 36% of total power consumed by the display

- 18% by hard-disk

- 18% by wireless LAN interface

- 7% by keyboard, mouse, etc.- 21% by digital VLSI circuits.

dont forgetthese!


The Basic Concept of DPM

Wh th t f d i th d i i b


98/130


When there are requests for a device the device is busy;otherwise it is idle.

When the device is idle, it can be shut down to enter a low-powersleeping state.

BusyBusy

Working WorkingSleeping

T1 T2 T3 T4

Device state

Power state

Workload

Time

Requests Requests

Idle

Tsd Twu

?


99/130


Power Management Policies

Power management policies are concerned with predictions


100/130


Power management policies are concerned with predictionsrelated to idle periods:

- For shut-down: try to predict how long the idle period will be in

order to decide if a shut-down should be performed.

- For wake-up: try to predict when the idle period ends, in orderto avoid user delays due to Twu.It is quite difficult, and often the wake-up is started simplywhen a request has arrived.

Typical Policies:

1. Time-out

2. Predictive

3. Stochastic


101/130


Predictive Policy

The length of an idle period is predicted. If the prediction is for an idleperiod long enough, the shut-down is performed immediately (no time


102/130


period long enough, the shut down is performed immediately (no timeinterval T1- T2on slide 16).

Policy

- L-shaped distribution for Idle PeriodPrevious Busy Period----------------------------------------------------;

Busy Period

Idle

Period

Short busy periodsare followed by

long idle periods.Busy periods longerthan a threshold are followed by

short idle periods.

Shut down aftershort busy period!


Stochastic Policy

Predictions are based on Markov models: requests and power statetransitions of the device are modelled as probabilistic state machines


103/130


transitions of the device are modelled as probabilistic state machines.

The power manager observes the arriving requests, the requestqueue and the device generates shutdown commands.

Environment or user:generates requests

The device:provides service

requestqueue

Power manager

Markov model:

device

Markov model:

request generator

requests

ob

s.

obs.

obs.

com

man

ds


Mapping and Scheduling for Low Energy

For many embedded systems DPM techniques like presented


104/130


For many embedded systems DPM techniques, like presentedbefore, cannot be applied:

They have no devices like hard-disk, no (or small) display VLSI is a main source of power dissipation.

They have time constraints we have to keep deadlines(usually we cannot afford shut-down and wake-up times).

The operating system is small no sophisticated techniques atrun-time.

The application is known at design time we know a lot aboutthe application already at design time.

Static techniques can be used (applied at design time).Mapping and scheduling for low energy are important!


Mapping for Low Energy

1


105/130


8

5

7

3

6

4

2

p3 p4

Bus

TaskWCET Energy

p3 p4 p3 p4

1 5 6 5 32 7 9 8 4

3 5 6 5 3

4 8 10 6 4

5 10 11 8 66 17 21 15 10

7 10 14 8 7

8 15 19 14 9


Mapping for Low Energy (contd)

Consider a mapping: Communication times and energy:


106/130



p3: 1, 3, 6, 7, 8. C1-2: t = 1; E = 3. C3-5: t = 2; E = 5.

p4: 2, 4, 5. C4-8: t = 1; E = 3. C5-7: t = 1; E = 3.

Execution time: 52; Energy consumed: 75.

1

38 40 42 44 46 48 50 52 54 56 58 60 620 2 4 6 8 10 12 14 16 18 20 22 24 26 30 32 3428 36Time 64

3

4

6 7 8p3

p4

bus

2 5

C1-2 C5-7C3-5 C4-8





107/130



p3: 1, 3, 6, 7. C1-2: t = 1; E = 3. C3-5: t = 2; E = 5.

p4: 2, 4, 5, 8. C7-8: t = 1; E = 3. C5-7: t = 1; E = 3.

Execution time: 57; Energy consumed: 70.

1

38 40 42 44 46 48 50 52 54 56 58 60 620 2 4 6 8 10 12 14 16 18 20 22 24 26 30 32 3428 36Time 64

3

4

6 7

8

p3

p4

bus

2 5

C1-2 C5-7C3-5 C7-8




108/130


The second mapping with 8on p4consumes less energy;

Assume that we have a maximum allowed delay = 60.

This second mapping is preferable, even if it is slower!


109/130


Real-Time Scheduling with Dynamic Voltage Scaling (contd)

The scheduling problem:


110/130


The scheduling problem:

Which task to execute at a certain moment on a certain processor sothat time constraints are fulfilled?

The scheduling problem with voltage scaling:

Which task to execute at a certain moment on a certain processor, and

at which voltage level, so that time constraints are fulfilled and energyconsumption is minimised?

The problem: reducing supply voltage extends execution time!


Variable Voltage Processors


111/130


Several supply voltage levels are available.

Supply voltage can be fixed by the application (operating system)through execution of particular instructions.

Frequency is automatically adjusted to the current supply voltage.

Several processors with variable voltage levels are already

available. There will be more and more in the near future.


The Basic Principle

We consider a single task :

total comp tation 109 e ec tion c cles


112/130


- total computation: 109execution cycles.

- deadline: 25 seconds.

- processor nominal (maximum) voltage: 5V.

- energy: 40 nJ/cycle at nominal voltage.

- processor speed: 50MHz (50106cycles/sec) at nominal voltage.

0 5 10 15 20 25 time (sec)

V2

52

slack

Etotal= 40 J

texe= 20 sec

109cycles


The Basic Principle (contd)

Lets make it slower!

V 2 5V


113/130


VDD= 2.5V

- energy: 402.52/52= 10nJ/cycle.

- speed: 502.5/5 = 25MHz

0 5 10 15 20 25 time (sec)

V2

52

Etotal= 32.5 J

texe= 25 sec

2.52

750106cycles 250106cycles



VDD= 4V


114/130


DD

- energy: 4042/52= 25nJ/cycle.

- speed: 504/5 = 40MHz

0 5 10 15 20 25 time (sec)

V2

52

Etotal= 25 J

texe= 25 sec42

109cycles



If a processor uses a single supply voltage and completes aprogram just on deadline, the energy consumption is minimised.


115/130


Consider two tasks 1, 2:

Computation- 1: 25010

6execution cycles; 2: 750106execution cycles;

Deadline: 25 seconds.

Processor nominal (maximum) voltage: 5V.

Energy:

- 40 nJ/cycle at nominal voltage.

- 25 nJ/cycle at VDD= 4V.

Processor speed:

- 50MHz (50106cycles/sec) at nominal voltage.

- 40MHz at VDD= 4V.

1

2



Find the voltage so that the tasks just meet their deadline you


116/130


g j yhave minimised energy consumption!

0 5 10 15 20 25 time (sec)

V2

Etotal= 25 J42

750106cycles250106

cycles

1 2


Considering Task Particularities

Energy consumed by a task: NSW= number of gate transitionsper clock cycle


117/130


1

2--- C VDD

2NCY NSW =

Average energy consumed by task per cycle:

ECY1

2--- C VDD

2NSW =

Often tasks differ from each other in terms of executed operationsNSWand Cdiffer from one task to the other.

The average energy consumed per cycle differs from task to task.

per clock cycle.

C = switched capacitance perclock cycle.


Considering Task Particularities (contd)

Consider two tasks 1, 2: Computation


118/130


p

- 1: 250106execution cycles; 2: 75010

6execution cycles;

Deadline: 25 seconds.

Processor nominal (maximum) voltage: 5V.

Processor speed:

- 50MHz (50106cycles/sec) at nominal voltage.

- 40MHz at VDD= 4V.- 25MHz at VDD= 2.5V.

Energy 1- 50 nJ/cycle at VDD= 5V.

- 32 nJ/cycle at VDD= 4V.- 12.5 nJ/cycle at VDD= 2.5V.

Energy 2- 12.5 nJ/cycle at VDD= 5V.

- 8 nJ/cycle at VDD= 4V.- 3 nJ/cycle at VDD= 2.5V.

1

2



Here we have a solution with VDD= 4V, and deadline just fulfilled:


119/130


Etotal= 32nJ/cycle 250 106cycles + 8nJ/cycle 750 106cycles

0 5 10 15 20 25 time (sec)

V2

Etotal= 14 J42

750106cycles250106

cycles

1 2



Here we run 1at VDD= 2.5V, and 2at VDD= 5V; the tasks finishj st on deadline


120/130


just on deadline.

Etotal= 12.5nJ/cycle 250 106cycles + 12.5nJ/cycle 750 10

6cycles

0 5 10 15 20 25 time (sec)

V2

52

Etotal= 12.5 J

2.52

750106cycles250106cycles

1

2




121/130


If power consumption per cycle is not constant (but differs from taskto task), the rule on slide 33 is not true any more.

Voltage levels have to be reduced with priority for those tasks whichhave a larger energy consumption per cycle.

One particular voltage level has to be established for each task, sothat deadlines are just satisfied.


Discrete Voltage Levels

Practical microprocessors can work only at a finite number of discretevoltage levels.


122/130


g

The ideal voltage Videal, determined for a certain task does not exist.

A task is supposed to run for time texeat the voltage Videal.

On the particular processor the two closest available neighbours toVidealare: V1< Videal< V2.

You have minimised the energy if you run the task for time t1atvoltage V1and for t2at voltage V2, so that t1+ t2= texe.


123/130


The Pitfalls with Ignoring Leakage

E NC C eff Vdd2

Lg Vdd K3 eK

4 Vdd

eK

5 Vbs

Vbs Iju+( ) t +=


124/130


Minimise this andignore the rest!


The Pitfalls with Ignoring Leakage

E NC C eff Vdd2

Lg Vdd K3 eK

4 Vdd

eK

5 Vbs

Vbs Iju+( ) t +=


125/130


1. We dont optimize global energy but only a part of it!

2. We can get it even very wrong and increaseenergy

consumption!

eff dd g dd 3 bs ju

Leakage decreaseswith Vdd, but growthwith time!

Dynamic decreaseswith Vddregardlessof increased time.


E NC Ceff

Vdd

2 L

g V

dd K

3 e

K4

VddeK

5 Vbs

Vbs

Iju

+( ) t +=


126/130


0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95

1e-10

2e-10

3e-104e-10

5e-10

6e-10

7e-10

8e-10

10.50

Dynamic energy

Vdd

Energy

perCycle

Jejurikar et. al., DAC04

70nm technology, Crusoe processor


E NC C eff Vdd2

Lg Vdd K3 e

K4

Vdd

e

K5

Vbs

Vbs Iju+( ) t +=


127/130


0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95

1e-10

2e-10

3e-104e-10

5e-10

6e-10

7e-10

8e-10

10.50

Leakage energy

Dynamic energy

Vdd

EnergyperCycle


70nm technology, Crusoe processor


E NC Ceff

Vdd

2 L

g

Vdd

K3

eK

4 Vdd

eK

5 Vbs

Vbs

Iju

+( ) t +=

C iti l i t!


128/130


0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95

1e-10

2e-10

3e-104e-10

5e-10

6e-10

7e-10

8e-10

10.50

Leakage energy

Dynamic energy

Dynamic + Leakage

Vdd

Energy

perCycle


70nm technology

Critical point!If you go beyond this

with Vddenergy grows


Summary

Power consumption becomes a central issue for embedded


129/130


systems design.

Power/energy consumption can be reduced by reducing supplyvoltage, switching activity, switched capacitance, number ofexecuted cycles.

There are means at all levels of the design to reduce powerconsumption: circuit, logic, behavioral, architecture, system level.

At system level we distinguish dynamic techniques (applied duringrun-time) and static techniques (applied at design time).


Summary (contd)

Dynamic power management is implemented by the operatingsystem, and is mainly used in portable appliances to shut down orplace in stand by unused devices


130/130


place in stand-by unused devices.

Typical policies for power management are: time-out, predictive,and stochastic.

Both at task mapping and at scheduling, design decisions can be

made with have a huge impact on power/energy consumption.

Real-time scheduling in the context of processors with voltagescaling is extremely interesting. The main trade-off is voltage levelvs. execution time. One has to find the optimal voltage levels such

that energy consumption is reduced and deadlines are stillfulfilled.

Documents

embedded system architecture by Ralf Niemann